KEIPD: Knowledge Extraction and Inference System for Personal Documents

Lv, Zhaoyang; Liu, Yuanyuan; Yu, Xiaohui

doi:10.1007/978-3-319-45817-5_72

Zhaoyang Lv¹⁷,
Yuanyuan Liu¹⁷ &
Xiaohui Yu^17,18

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9932))

Included in the following conference series:

Asia-Pacific Web Conference

1620 Accesses

Abstract

Public personal documents on the Internet, such as resumes and personal homepages, may imply social relationships among people, which is of great value in various applications. This paper presents KEIPD, a system to extract and infer knowledge from personal documents. KEIPD employs a tree-similarity based approach to extract information from personal documents to obtain a relational network of entities. Then the inference of social relationships can be transformed into a link prediction problem. KEIPD implements some popular unsupervised predictors for link prediction and prune the candidate entity pairs based on the domain-dependent constraint.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Harvesting Knowledge from Social Networks: Extracting Typed Relationships Among Entities

Entity Extraction from Wikipedia List Pages

DRHTG: A Knowledge-Centric Approach for Document Retrieval Based on Heterogeneous Entity Tree Generation and RDF Mapping

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

There is plentiful public personal information of celebrities on the Internet, e.g. resumes, personal profiles and personal homepages, which may imply social relationships among the celebrities. For example, two people may be schoolmates if they have been studied in the same university during an overlapped time period. This information can be organized as a social network to support community discovery, most influential nodes discovery and other researches. Compared with traditional social networks, it has some distinguished characteristics: First, links in this network may represent various types of relationships (e.g. schoolmates and colleagues) rather than homogeneous relationships. Second, the network is more realistic where the links are deduced from the factual experiences of people instead of the interaction data of users via a social application. Third, the formation of a link is sensitive to time as the example described above.

The construction of such a social network can be viewed as a two-step process. We can build a relational network by extracting events from personal documents where nodes represent main entities in the documents, including the person, the organizations he belongs to, the locations of these organizations, etc. Then the social network can be regarded as a view of the relational network after predicting the link between arbitrary person-person pair. Challenges to implement such a system can be concluded as follows: the information unit in a personal document is an event rather than a binary relation, which is more complicated to extract; how to infer knowledge properly on a network embedded with heterogeneous nodes and links. KEIPD employs a tree-similarity based approach to extract events. For link prediction, unsupervised predictors. The system is based on a considerably mature graph database.

2 System Overview

Figure 1 shows the overview of KEIPD. We will introduce the details of the information extraction module and knowledge inference module in this section.

Table 1. The template of membership events

Full size table

2.1 Information Extraction

According to the classical entity types proposed in the Message Understanding Conference (MUC) and the characteristics of personal documents, we consider three entity types here: Person, Organization and Location.

Event Template. Take resume documents for example. A resume displays some fixed classes of events by time order, among which the most typical event is membership, as shown in Table 1. Events in the same class correspond to a common predefined template.

Tree-similarity based method. We adapt the method from [3] to perform event extraction. It is assumed that sentences describing the same type of events own similar parse tree structures. First, we refer to a integrated natural language processing tool, LTP [1], to preprocess the text through Named Entity Recognition (NER) and Dependency Parsing tasks. The results of these two tasks are merged to constitute a NE-tagged parse tree where key attributes of the nodes include the word, part-of-speech tagging, the results of NER and Dependency Parsing.

Then, the parse trees are clustered with the tree-similarity function:

$$\begin{aligned} K(T_1,T_2)=m(r_1,r_2)*{s(r_1,r_2)+K_c(r_1[\mathbf c ],r_2[\mathbf c ])} \end{aligned}$$

(1)

where

$$\begin{aligned} K_c(p_1[\mathbf c ],p_2[\mathbf c ])=\arg \max _\mathbf{a ,\mathbf b } K(p_1[\mathbf a ],p_2[\mathbf b ]) \end{aligned}$$

(2)

$$\begin{aligned} K(p_1[\mathbf a ],p_2[\mathbf b ])=\sum _{i=1}^l K(p_1[\mathbf a _i],p_2[\mathbf b _i]) \end{aligned}$$

(3)

Here, $T_1$ and $T_2$ are two trees where $r_1$ and $r_2$ are their root nodes. Equation (3) is the similarity function over two arbitrary children node sequences $p_1[\mathbf a ]$ and $p_2[\mathbf b ]$. Due to space limitations, see [3] for more details.

We adjust the match function $m(r_1,r_2)$ and the node-similarity function $s(r_1,r_2)$ to suit our data:

$$\begin{aligned} m(p_i,p_j)= {\left\{ \begin{array}{ll} 0 &{} p_i.relate=p_j.relate \\ 1 &{} otherwise \end{array}\right. } \end{aligned}$$

(4)

$$\begin{aligned} s(p_i,p_j)= {\left\{ \begin{array}{ll} 0.2 &{} p_i.ne \ne p_j.ne\\ 0.5 &{} p_i=O, p_j=O, p_i.pos \ne p_j.pos\\ 0.8 &{} p_i=O, p_i.pos=p_j.pos\\ 1.0 &{} p_i \ne O, p_i.ne=p_j.ne \end{array}\right. } \end{aligned}$$

(5)

The weight in Eq. (5) is assigned empirically according to the discriminative ability of the feature types.

The calculation of similarity starts from leaf nodes and goes up to the root employing a dynamic programming algorithm. We summarize syntactic rules manually for different clusters to fill the corresponding event template.

2.2 Knowledge Inference

Online Knowledge Bases. Considering the complexity of natural language, we process the relational network with the assistance of some online knowledge bases. The hierarchical characteristics of entities belonging to Location and Organization are key factors for link prediction. Therefore, we crawl an external knowledge base about fine-grained regionalism in China which contains more than 700K locations. For organizations, we refer to an online encyclopedia^{Footnote 1} to normalize their names and design a simple algorithm to infer the hierarchy by analyzing prefix relations.

Link Prediction. Besides the predictors shown in Table 2, we also experiment with rooted PageRank andPropFlow, see [2] for more details. As demonstrated in Sect. 1, the formation of a link is strongly dependent on the time attributes, so we prune the candidate entity pairs before prediction using the time constraint.

Table 2. Unsupervised predictors for link prediction

Full size table

3 Demonstration Scenarios

There are about 15K personal documents of politicians crawled from People ^{Footnote 2} as source data. The system will be demonstrated via two types of query operations:

(1)
Point query. Given a specific person as a query condition, the system will return related people with corresponding relationships.
(2)
Path query. Given two specific people as query conditions, the system will return all the paths/the shortest path between them.

Notes

1.
http://baike.baidu.com.
2.
http://www.people.com.cn.

References

Che, W., Li, Z., Liu, T.: LTP: a Chinese language technology platform. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. pp. 13–16. Association for Computational Linguistics (2010)
Google Scholar
Davis, D., Lichtenwalter, R., Chawla, N.V.: Multi-relational link prediction in heterogeneous information networks. In: 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 281–288. IEEE (2011)
Google Scholar
Zhang, M., Su, J., Wang, D., Zhou, G., Tan, C.-L.: Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 378–389. Springer, Heidelberg (2005)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Shandong University, Jinan, China
Zhaoyang Lv, Yuanyuan Liu & Xiaohui Yu
School of Information Technology, York University, Toronto, ON, Canada
Xiaohui Yu

Authors

Zhaoyang Lv
View author publications
You can also search for this author in PubMed Google Scholar
Yuanyuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaohui Yu .

Editor information

Editors and Affiliations

School of Computing, University of Utah, Salt Lake City, Utah, USA
Feifei Li
School of Electrical Engineering, Seoul National University, Seoul, Korea (Republic of)
Kyuseok Shim
Soochow University , Suzhou, China
Kai Zheng
Soochow University , Suzhou, China
Guanfeng Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lv, Z., Liu, Y., Yu, X. (2016). KEIPD: Knowledge Extraction and Inference System for Personal Documents. In: Li, F., Shim, K., Zheng, K., Liu, G. (eds) Web Technologies and Applications. APWeb 2016. Lecture Notes in Computer Science(), vol 9932. Springer, Cham. https://doi.org/10.1007/978-3-319-45817-5_72

Download citation

DOI: https://doi.org/10.1007/978-3-319-45817-5_72
Published: 18 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45816-8
Online ISBN: 978-3-319-45817-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

KEIPD: Knowledge Extraction and Inference System for Personal Documents

Abstract

Similar content being viewed by others

Harvesting Knowledge from Social Networks: Extracting Typed Relationships Among Entities

Entity Extraction from Wikipedia List Pages

DRHTG: A Knowledge-Centric Approach for Document Retrieval Based on Heterogeneous Entity Tree Generation and RDF Mapping

Keywords

1 Introduction

2 System Overview

2.1 Information Extraction

2.2 Knowledge Inference

3 Demonstration Scenarios

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

KEIPD: Knowledge Extraction and Inference System for Personal Documents

Abstract

Similar content being viewed by others

Harvesting Knowledge from Social Networks: Extracting Typed Relationships Among Entities

Entity Extraction from Wikipedia List Pages

DRHTG: A Knowledge-Centric Approach for Document Retrieval Based on Heterogeneous Entity Tree Generation and RDF Mapping

Keywords

1 Introduction

2 System Overview

2.1 Information Extraction

2.2 Knowledge Inference

3 Demonstration Scenarios

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation