Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

There is plentiful public personal information of celebrities on the Internet, e.g. resumes, personal profiles and personal homepages, which may imply social relationships among the celebrities. For example, two people may be schoolmates if they have been studied in the same university during an overlapped time period. This information can be organized as a social network to support community discovery, most influential nodes discovery and other researches. Compared with traditional social networks, it has some distinguished characteristics: First, links in this network may represent various types of relationships (e.g. schoolmates and colleagues) rather than homogeneous relationships. Second, the network is more realistic where the links are deduced from the factual experiences of people instead of the interaction data of users via a social application. Third, the formation of a link is sensitive to time as the example described above.

The construction of such a social network can be viewed as a two-step process. We can build a relational network by extracting events from personal documents where nodes represent main entities in the documents, including the person, the organizations he belongs to, the locations of these organizations, etc. Then the social network can be regarded as a view of the relational network after predicting the link between arbitrary person-person pair. Challenges to implement such a system can be concluded as follows: the information unit in a personal document is an event rather than a binary relation, which is more complicated to extract; how to infer knowledge properly on a network embedded with heterogeneous nodes and links. KEIPD employs a tree-similarity based approach to extract events. For link prediction, unsupervised predictors. The system is based on a considerably mature graph database.

2 System Overview

Figure 1 shows the overview of KEIPD. We will introduce the details of the information extraction module and knowledge inference module in this section.

Fig. 1.
figure 1

System overview

Table 1. The template of membership events

2.1 Information Extraction

According to the classical entity types proposed in the Message Understanding Conference (MUC) and the characteristics of personal documents, we consider three entity types here: Person, Organization and Location.

Event Template. Take resume documents for example. A resume displays some fixed classes of events by time order, among which the most typical event is membership, as shown in Table 1. Events in the same class correspond to a common predefined template.

Tree-similarity based method. We adapt the method from [3] to perform event extraction. It is assumed that sentences describing the same type of events own similar parse tree structures. First, we refer to a integrated natural language processing tool, LTP [1], to preprocess the text through Named Entity Recognition (NER) and Dependency Parsing tasks. The results of these two tasks are merged to constitute a NE-tagged parse tree where key attributes of the nodes include the word, part-of-speech tagging, the results of NER and Dependency Parsing.

Then, the parse trees are clustered with the tree-similarity function:

$$\begin{aligned} K(T_1,T_2)=m(r_1,r_2)*{s(r_1,r_2)+K_c(r_1[\mathbf c ],r_2[\mathbf c ])} \end{aligned}$$
(1)

where

$$\begin{aligned} K_c(p_1[\mathbf c ],p_2[\mathbf c ])=\arg \max _\mathbf{a ,\mathbf b } K(p_1[\mathbf a ],p_2[\mathbf b ]) \end{aligned}$$
(2)
$$\begin{aligned} K(p_1[\mathbf a ],p_2[\mathbf b ])=\sum _{i=1}^l K(p_1[\mathbf a _i],p_2[\mathbf b _i]) \end{aligned}$$
(3)

Here, \(T_1\) and \(T_2\) are two trees where \(r_1\) and \(r_2\) are their root nodes. Equation (3) is the similarity function over two arbitrary children node sequences \(p_1[\mathbf a ]\) and \(p_2[\mathbf b ]\). Due to space limitations, see [3] for more details.

We adjust the match function \(m(r_1,r_2)\) and the node-similarity function \(s(r_1,r_2)\) to suit our data:

$$\begin{aligned} m(p_i,p_j)= {\left\{ \begin{array}{ll} 0 &{} p_i.relate=p_j.relate \\ 1 &{} otherwise \end{array}\right. } \end{aligned}$$
(4)
$$\begin{aligned} s(p_i,p_j)= {\left\{ \begin{array}{ll} 0.2 &{} p_i.ne \ne p_j.ne\\ 0.5 &{} p_i=O, p_j=O, p_i.pos \ne p_j.pos\\ 0.8 &{} p_i=O, p_i.pos=p_j.pos\\ 1.0 &{} p_i \ne O, p_i.ne=p_j.ne \end{array}\right. } \end{aligned}$$
(5)

The weight in Eq. (5) is assigned empirically according to the discriminative ability of the feature types.

The calculation of similarity starts from leaf nodes and goes up to the root employing a dynamic programming algorithm. We summarize syntactic rules manually for different clusters to fill the corresponding event template.

2.2 Knowledge Inference

Online Knowledge Bases. Considering the complexity of natural language, we process the relational network with the assistance of some online knowledge bases. The hierarchical characteristics of entities belonging to Location and Organization are key factors for link prediction. Therefore, we crawl an external knowledge base about fine-grained regionalism in China which contains more than 700K locations. For organizations, we refer to an online encyclopediaFootnote 1 to normalize their names and design a simple algorithm to infer the hierarchy by analyzing prefix relations.

Link Prediction. Besides the predictors shown in Table 2, we also experiment with rooted PageRank andPropFlow, see [2] for more details. As demonstrated in Sect. 1, the formation of a link is strongly dependent on the time attributes, so we prune the candidate entity pairs before prediction using the time constraint.

Table 2. Unsupervised predictors for link prediction

3 Demonstration Scenarios

There are about 15K personal documents of politicians crawled from People Footnote 2 as source data. The system will be demonstrated via two types of query operations:

  1. (1)

    Point query. Given a specific person as a query condition, the system will return related people with corresponding relationships.

  2. (2)

    Path query. Given two specific people as query conditions, the system will return all the paths/the shortest path between them.