Abstract
Information on the Internet is fragmented and presented in different data sources, which makes automatic knowledge harvesting and understanding formidable for machines, and even for humans. Knowledge graphs have become prevalent in both of industry and academic circles these years, to be one of the most efficient and effective knowledge integration approaches. Techniques for knowledge graph construction can mine information from either structured, semi-structured, or even unstructured data sources, and finally integrate the information into knowledge, represented in a graph. Furthermore, knowledge graph is able to organize information in an easy-to-maintain, easy-to-understand and easy-to-use manner.
In this paper, we give a summarization of techniques for constructing knowledge graphs. We review the existing knowledge graph systems developed by both academia and industry. We discuss in detail about the process of building knowledge graphs, and survey state-of-the-art techniques for automatic knowledge graph checking and expansion via logical inferring and reasoning. We also review the issues of graph data management by introducing the knowledge data models and graph databases, especially from a NoSQL point of view. Finally, we overview current knowledge graph systems and discuss the future research directions.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Frost A. Introduction to Knowledge Base Systems. Macmillan Publishing Company, Inc., 1986
Hua W, Song Y Q, Wang H X, Zhou X F. Identifying users’ topical tasks in Web search. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining. 2013, 93–102
Song Y Q, Wang H X, Wang Z Y, Li H S, Chen W Z. Short text conceptualization using a probabilistic knowledgebase. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011, 2330–2336
Wang J J, Wang H X, Wang Z Y, Zhu K Q. Understanding tables on the Web. In: Proceedings of the International Conference on Conceptual Modeling. 2012, 141–155
Deshpande O, Lamba D, Tourn S, Subramaniam S, Rajaraman A, Harinarayan V, Doan A. Building, maintaining, and using knowledge bases: a report from the trenches. In: Proceedings of ACM Special Interest Group on Management of Data. 2013, 1209–1220
Hoffart J, Suchanek F, Berberich K, Weikum G. YAGO2: a spatially and temporally enhanced knowledge base from wikipedia. Artificial Intelligence, 2013, 194: 28–61
Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes P, Hellmann S, Morsey M, Kleef P, Auer S, Bizer C. DBpedia — a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 2015, 6(2): 167–195
Kushmerick N. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 2000, 118: 15–68
Muslea I, Minton S, Knoblock C. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi- Agent Systems, 2001, 4(1–2): 93–114
Buttler D, Liu L, Pu C. A fully automated object extraction system for the World Wide Web. In: Proceedings of ACM International Conference on Distributed Computing Systems. 2001, 361–370
Wang R, Cohen W. Language-independent set expansion of named entities using the Web. In: Proceedings of the 7th IEEE International Conference on Data Mining. 2007, 342–350
Nie Z, Wen J R, Zhang B, Ma W Y. 2D conditional random fields for Web information extraction. In: Proceedings of the 22nd International Conference on Machine Learning. 2005, 1044–1051
Finn A, Kushmerick N. Multi-level boundary classification for information extraction. In: Proceedings of the 15th European Conference on Machine Learning. 2004, 111–122
Sutton C, Rohanimanesh K, Mccallum A. Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of the 21st International Conference on Machine Learning. 2004, 693–723
Wellner B, Mccallum A, Peng F, Hay M. An integrated, conditional model of information extraction and coreference with application to citation matching. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. 2004, 593–601
Zhu J, Nie Z, Wen J R. Simultaneous record detection and attribute labeling in Web data extraction. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006, 494–503
Marrero M, Urbano J, Sánchez-Cuadrado S, Morato J, Berbís J. Named entity recognition: fallacies, challenges and opportunities. Computer Standards & Interfaces, 2013, 35(5): 482–489
Zhou G D, Su J. Named entity recognition using an hmm-based chunk tagger. In: Proceedings of the 40th AnnualMeeting of the Association for Computational Linguistics, 2002, 473–480
Finkel J, Grenager T, Manning C. Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of Annual Meeting on Association for Computational Linguistics. 2005, 363–370
Liu X H, Zhang S D, Wei F R, Zhou M. Recognizing named entities in tweets. In: Proceedings of Annual Meeting on Association for Computational Linguistics. 2011, 359–367
Pan S J, Toh Z, Su J. Transfer joint embedding for cross-domain named entity recognition. ACM Transactions on Information Systems, 2013, 31(2): 7
Prokofyev R, Demartini G, Cudré-Mauroux P. Effective named entity recognition for idiosyncratic Web collections. In: Proceedings of the 23rd International Conference on World Wide Web. 2014, 397–408
Shen W, Wang J Y, Luo P, Wang M. Linden: linking named entities with knowledge base via semantic knowledge. In: Proceedings of the 21st International Conference on World Wide Web. 2012, 449–458
Bagga A, Baldwin B. Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 17th International Conference on Computational Linguistics. 1998, 79–85.
Pedersen T, Purandare A, Kulkarni A. Name discrimination by clustering similar contexts. In: Proceedings of Computational Linguistics and Intelligent Text Processing. 2005, 226–237
Chen Y, Martin J. Towards robust unsupervised personal name disambiguation. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 2007, 190–198
Lasek I, Vojtás P. Various approaches to text representation for named entity disambiguation In: Proceedings of International Conference on Information Integration and Web-based Applications and Services. 2013, 242–259
Shen W, Wang J Y, Luo P, Wang M. Liege: link entities in Web lists with knowledge base. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1424–1432
Guo Y H, Che W X, Liu T, Li S. A graph-based method for entity linking. In: Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011, 1010–1018
Han X P, Sun L, Zhao J. Collective entity linking inWeb text: a graphbased method. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2011, 765–774
Han X P, Sun L. A generative entity-mention model for linking entities with knowledge base. In: Proceedings of Annual Meeting on Association for Computational Linguistics, 2011, 945–954
Sil A, Yates A. Re-ranking for joint named-entity recognition and inking. In: Proceedings of the 6th Workshop on Ph. D. Students in Information and Knowledge Management. 2013, 2369–2374
Liu X H, Zhou M, Zhou X F, Fu Z, Wei F. Joint inference of named entity recognition and normalization for tweets. In: Proceedings of Annual Meeting on Association for Computational Linguistics. 2012, 526–535
Russell S, Norvig P. Artificial intelligence: a modern approach. New Jersey: Prentice-Hall, Egnlewood Cliffs, 1995, 25
Collins M, Duffy N. New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron. In: Proceedings of Annual Meeting on Association for Computational Linguistics. 2002, 263–270
Knoke D, Burke P. Log-linear Models. New York: SAGE Publications. 1980, 20
Ratnaparkhi A. A maximum entropy model for part-of-speech tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1996, 133–142
Kambhatla N. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In: Proceedings of the ACL on Interactive Poster and Demonstration Sessions. 2004, 20
Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C. Text classification using string kernels. Journal of Machine Learning Research, 2002, 2(3): 419–444
Bunescu R, Mooney R. A shortest path dependency kernel for relation extraction. In: Proceedings of Conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005, 724–731
Zelenko D, Aone C, Richardella A. Kernel methods for relation extraction. Journal of Machine Learning Research, 2003, 3(3): 1083–1106
Culotta A, Sorensen J. Dependency tree kernels for relation extraction. In: Proceedings of Annual Meeting on Association for Computational Linguistics. 2004, 423–429
Bunescu R, Mooney R. Subsequence kernels for relation extraction. Advances in Neural Information Processing Systems. 2005, 171–178
Hearst M. Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics. 1992, 539–545
Brin S. Extracting patterns and relations from the World Wide Web. In: Proceedings of WebDBWorkshop at the 6th International Conference on Extending Database Technology. 1998, 172–183
Agichtein E, Gravano L. Snowball: extracting relations from large plain-text collections. In: Proceedings of the 5th ACM International Conference on Digital Libraries. 2000, 85–94
Etzioni O, Cafarella M, Downey D, Popescu A, Shaked T, Soderland S, Weld D, Yates A. Unsupervised named-entity extraction from the Web: an experimental study. Artificial Intelligence, 2005, 165(1): 91–134
Nakashole N, Theobald M, Weikum G. Scalable knowledge harvesting with high precision and high recall. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining. 2011, 227–236
Banko M, Cafarella M, Soderland S, Broadhead M, Etzioni O. Open information extraction from the Web. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence. 2007, 2670–2676
Downey D, Etzioni O, Soderland S. A probabilistic model of redundancy in information extraction. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence. 2005, 1034–1041
Fader A, Soderland S, Etzioni O. Identifying relations for open information extraction. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 2011, 1535–1545
Etzioni O, Fader A, Christensen J, Soderland S. Open information extraction: the second generation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011, 3–10
Cergani E, Miettinen P. Discovering relations using matrix factorization methods. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 2013, 1549–1552
Bian J, Gao B, Liu TY. Knowledge-powered deep learning for word embedding. In: Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2014, 132–148
Lin Y K, Liu Z Y, Luan H B, Sun M S, Rao S W, Liu S. Modeling relation paths for representation learning of knowledge bases. 2015, arXiv:1506.00379
Weston J, Bordes A, Yakhnenko O, Usunier N. Connecting language and knowledge bases with embedding models for relation extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2013, 1366–1371
Yu M, Gormley M, Dredze M. Combining word embeddings and feature embeddings for fine-grained relation extraction. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015, 1374–1379
McDonald R, Pereira F, Kulick S, Winters R, Jin Y, White P. Simple algorithms for complex relation extraction with applications to biomedical IE. In: Proceedings of Annual Meeting on Association for Computational Linguistics. 2005, 491–498
Getoor L, Taskar B. Introduction to statistical relational learning. Journal of the Royal Statistical Society: Series A (Statistics in Society), 2010, 173(4): 934–935
Ueda K. Guarded horn clauses. In: Proceedings of the Conference on Logic Programming. 1985, 168–179
Suchanek F, Sozio M, Weikum G. Sofie: a self-organizing framework for information extraction. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 631–640
Muggleton S, Feng C. Efficient induction of logic programs. Inductive Logic Programming, 1992, 38: 281–298
Quinlan J, Cameron-Jones R. Foil: a midterm report. In: Proceedings of the European Conference on Machine Learning. 1993, 3–20
Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka E R, Michell T M. Toward an architecture for never-ending language learning. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence. 2010, 5: 3
Lao N, Mitchell T, Cohen W. Random walk inference and learning in a large scale knowledge base. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011, 529–539
Richards B, Mooney R. Learning relations by pathfinding. In: Proceedings of the 10th National Conference on Artificial Intelligence. 1992, 50–55
Tong H, Faloutsos C, Pan J. Random walk with restart: fast solutions and applications. Knowledge and Information Systems, 2008, 14(3): 327–346
Lao N, Cohen W. Relational retrieval using a combination of pathconstrained random walks. Machine Learning, 2010, 81(1): 53–67
Kotecha J, Djuric P. Gaussian particle filtering. In: Proceedings of IEEE Transactions on Signal Processing. 2003, 51(10): 2592–2601
Thrun S, Burgard W, Fox D. Probabilistic robotics (intelligent robotics and autonomous agents series). Intelligent Robotics and Autonomous Agents, 2002, 45(3): 52–57
Gardner M, Talukdar P, Kisiel B, Mitchell T. Improving learning and inference in a large knowledge-base using latent syntactic cues. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2013, 833–838
Wang W, Mazaitis K, Lao N, Mitchell T, Cohen W. Efficient inference and learning in a large knowledge base: reasoning with extracted information using a locally groundable first-order probabilistic logic. 2014, arXiv:1404.3301
Cussens J. Parameter estimation in stochastic logic programs. Machine Learning, 2001, 44(3): 245–271
Zinkevich M, Weimer M, Smola A, Li L. Parallelized stochastic gradient descent. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 2595–2603
Socher R, Chen D, Manning C, Ng A. Reasoning with neural tensor networks for knowledge base completion. In: Proceedings of Advances in Neural Information Processing Systems. 2013, 926–934
Socher R, Perelygin A, Wu J, Chuang J, Manning C, Ng A, Potts C. Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2013, 1631: 1642–1653
Bordes A, Usunier N, García-Durán A, Weston J, Yakhnenko O. Translating embeddings for modeling multi-relational data. Advances in Neural Information Processing Systems, 2013, 2787–2795
Wang Z, Zhang J W, Feng J L, Chen Z. Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence. 2014, 1112–1119
Lin Y K, Liu Z, Sun M S, Liu Y, Zhu X. Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015, 2181–2187
Bordes A, Glorot X, Weston J, Bengio Y. Joint learning of words and meaning representations for open-text semantic parsing. In: Proceedings of the 15th International Conference on Artificial Intelligence and Statistics. 2012, 127–135
Bordes A, Weston J, Collobert R, Bengio Y. Learning structured embeddings of knowledge bases. In: Proceedings of the 25th AAAI Conference on Artificial Intelligence. 2011
Sutskever I, Salakhutdinov R, Tenenbaum J. Modelling relational data using bayesian clustered tensor factorization. Advances in Neural Information Processing Systems, 2009, 1821–1828
Weikum G, Theobald M. From information to knowledge: harvesting entities and relationships from Web sources. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2010, 65–76
Richardson M, Domingos P. Markov logic networks. Machine learning, 2006, 62(1–2): 107–136
Kindermann R, Snell J. Markov Random Fields and Their Applications. Providence: American Mathematical Society. 1980
Poon H, Domingos P, Sumner M. A general method for reducing the complexity of relational inference and its application to mcmc. In: Proceedings of the 23rd National Conference on Artificial Intelligence. 2008, 1075–1080
Resnik P, Hardisty E. Gibbs Sampling for the Uninitiated. Technical Report, DTIC Document. 2010
Duchi J, Tarlow D, Elidan G, Koller D. Using combinatorial optimization within max-product belief propagation. In: Proceedings of Advances in Neural Information Processing Systems. 2006, 369–376
Zhu J, Nie Z, Liu X, Zhang B, Wen J R. Statsnowball: a statistical approach to extracting entity relationships. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 101–110
Lafferty J, Mc Callum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. 2001, 282–289
Chang M, Ratinov L, Rizzolo N, Roth D. Learning and inference with constraints In: Proceedings of the 23rd Conference AAAI on Artificial Intelligence. 2008, 1513–1518
Carlson A, Betteridge J, Wang R, Hruschka E, Mitchell T. Coupled semi-supervised learning for information extraction. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010, 101–110
Das S, Agrawal D, Abbadi A. G-store: a scalable data store for transactional multi key access in the cloud. In: Proceedings of the 1st ACM Symposium on Cloud Computing. 2010, 163–174
Momjian B. PostgreSQL: introduction and concepts. New York: Addison-Wesley, 2001
Shao B, Wang H X, Li YT. Trinity: A distributed graph engine on a memory cloud. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2013, 505–516
Hirabayashi M. Tokyo Cabinet: A Modern Implementation of DBM. http://1978th. net/tokyocabinet/, 2010
Anderson J, Lehnardt J, Slater N. CouchDB: the Definitive Guide. Sebastopol: O’Reilly Media Inc., 2010
Mondal J, Deshpande A. Managing large dynamic graphs efficiently. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2012, 145–156
Aasman J. Allegro graph: RDF triple database. Technical Report. 2006
Miller J. Graph database applications and concepts with Neo4j. In: Proceedings of the Southern Association for Information Systems Conference. 2013, 141–147
Martinez-Bazan N, Gomez-Villamor S, Escale-Claveras F. Dex: a high-performance graph database management system. In: Proceedings of IEEE International Conference on Data Engineering Workshops. 2011, 124–127
Iordanov B. Hypergraphdb: a generalized graph database. In: Proceedings of the 11th International Conference on Web-Age Information Management Workshops. 2010, 25–36
Malewicz G, Austern M, Bik A, Dehnert J, Horn I, Leiser N, Czajkowski G. Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference onManagement of Data. 2010, 135–146
Lans R. Infinitegraph: extending business, social and government intelligence with graph analytics. The Analysis, 2010
Matuszek C, Cabral J, Witbrock M, De Oliveira J. An introduction to the syntax and content of Cyc. In: Proceedings of AAAI Spring Symposium: Formalizing and Compiling Background Knowledge and Its Applications to Knowledge Representation and Question Answering. 2006, 44–49
Liu H, Singh P. Conceptnet a practical commonsense reasoning toolkit. BT Technology Journal, 2004, 22(4): 211–226
Miller G. Wordnet: a lexical database for English. Communications of ACM, 1995, 38(11): 39–41
Magnini B, Strapparava C, Ciravegna F, Pianta E. A project for the constraction of an italian lexical knowledge base in the framework of wordnet. In: Proceedings of International Workshop on the “Future of the Dictionary”. 1994
Zhang Z D, Dong Q. Hownet—a hybrid language and knowledge resource. In: Proceedings of Natural Language Processing and Knowledge Engineering. 2003, 820–824
Baker C, Fillmore C, Lowe J. The berkeley framenet project. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics. 1998, 86–90
Wu W, Li H, Wang H X, Zhu K Q. Probase: a probabilistic taxonomy for text understanding. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 2012, 481–492
Bollacker K, Cook R, Tufts P. Freebase: a shared database of structured general human knowledge. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence. 2007, 1962–1963
Weld D, Hoffmann R, Wu F. Using wikipedia to bootstrap open information extraction. ACM SIGMOD Record, 2008, 37(4): 62–68
Wu F, Weld D. Autonomously semantifying wikipedia. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management. 2007, 41–50
Wu F, Weld D. Automatically refining the wikipedia infobox ontology. In: Proceedings of the 17th International World Wide Web Conference. 2008, 635–644
Suchanek F, Kasneci G, Weikum G. Yago: a core of semantic knowledge. In: Proceedings of the 16th International WorldWideWeb Conference. 2007, 697–706
Biega J, Kuzey E, Suchanek F. Inside YAGO2s: a transparent information extraction architecture. In: Proceedings of the 22nd International World Wide Web Conference. 2013, 325–328
Singhal A. Introducing the knowledge graph: things, not strings. Official Google Blog, 2012
Sengupta S. Facebook unveils a new search tool. New York Times, 2013
Wang C Y, Gao M, He X F, Zhang R. Challenges in Chinese knowledge graph construction. In: Proceedings of the 31st IEEE International Conference on Data Engineering Workshops. 2015, 59–61
Acknowledgements
This work has been supported by the National Key Research and Development Program of China (2016YFB1000905), the National Natural Science Foundation of China (Grant Nos. U1401256, 61402177, 61402180) and the Natural Science Foundation of Shanghai (14ZR1412600). This work was also supported by CCF-Tecent Research Program of China (AGR20150114). The author would also like to thank Key Disciplines of Software Engineering of Shanghai Second Polytechnic University (XXKZD1301).
Author information
Authors and Affiliations
Corresponding author
Additional information
Jihong Yan is a PhD candidate in Institute for Data Science and Engineering, East China Normal University, China. She is an associate professor at the Institute for Computer and Information Engineering, Shanghai Second Polytechnic University, China. Her research interests include Web data management and mining.
Chengyu Wang is a PhD candidate in Institute for Data Science and Engineering, East China Normal University, China. His research interests include Web data mining and information extraction. He is currently working on constructing Chinese knowledge graphs from Web data sources.
Wenliang Cheng is a graduate student of Institute for Data Science and Engineering, East China Normal University, China. His research interests include knowledge graph and natural language processing.
Ming Gao is an associate professor at the Institute for Data Science and Engineering, East China Normal University, China. He received his PhD in computer science from Fudan University, China. Prior to joining ECNU, Ming Gao worked with Living Analytics Research Centre (LARC), Singapore Management University, Singapore. His current research interests include distributed data management, user profiling and social mining, data stream management and mining, and uncertain data management.
Aoying Zhou is a professor of computer science at East China Normal University, China, where he is heading the Institute of Massive Computing. He is the winner of the National Science Fund for Distinguished Young Scholars supported by NSFC and the professorship appointment under Changjiang Scholars Program of Ministry of Education.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Yan, J., Wang, C., Cheng, W. et al. A retrospective of knowledge graphs. Front. Comput. Sci. 12, 55–74 (2018). https://doi.org/10.1007/s11704-016-5228-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-016-5228-9