Keywords

1 Introduction

In this section, we briefly introduced knowledge graph and open data.

1.1 Open Data

According to open definition [7] open data refers to data that “anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness).” From this definition, open data includes any kind of data that can be freely accessed, modified and share on the web. Open data exist in different formats including text documents, spreadsheet, structured documents in RDF or JSON format, pictures, geographic files formats, etc. Popular examples of common open data sets include those published in government portals such as data.gov.* (e.g. uk, i.e., and es). Examples of open data portals in Africa include http://data.edostate.gov.ng of the Edo State Government in Nigeria, http://www.opendata.go.ke/ of the Kenyan Government and http://dataportal.opendataforafrica.org/ maintained by the African Development Bank. See Fig. 1 for example of an open data portal. Related to open data are also public data and resource such as DBpedia [14], YAGO [3], GeonamesFootnote 1, Wikipedia, word-NetFootnote 2, dbtune.org, New York Times datasetFootnote 3, opendatacommunities.org datasets, etc. Open data covers a wide range of domains which are heterogeneous in nature and noisy. Open data, therefore, reveal a large variation in quality. Applications consuming this data need to therefore, engage in other processing steps to deal with the inconsistencies and misleading information. The issues with open data include: accuracy, representation, integration and linking. One way to address this problem is by integrating islands of non-consistent open datasets to build a more consistent global dataset in the form of knowledge graph.

Fig. 1.
figure 1

Example of an open data portal (http://dataportal.opendataforafrica.org/)

1.2 Knowledge Graph

There is no generally agreed definition of what a knowledge graph is. The term knowledge graph was originally used by Google when introducing their knowledge graph [5] in 2012. Ever since, researchers have often used the term to refer to semantic web repositories such as DBpedia [14] and YAGO [3]. [4] Defines knowledge graph by given its characteristics: “A knowledge graph

  • 1. Mainly describes real world entities and their interrelations, organized in a graph

  • 2. Defines possible classes and relations of entities in a schema

  • 3. Allows for potentially interrelating arbitrary entities with each other

  • 4. Covers various topical domains.”

Another study [6] titled “Towards a Definition of Knowledge Graphs” conducted a study on the term knowledge graph and define Knowledge Graph as:

“A knowledge graph acquires and integrates information into an ontology and applies a reasoner to derive new knowledge.”

Knowledge graphs are often differentiated based on their architecture, operational purposes, data sources, coverage and the technologies used in building them. Knowledge graphs are a key driving force for the future of artificial intelligence systems and a lot of other applications that consume and reason with structured data including search engines, enterprise and business systems, recommender systems etc.

2 Related Work

Building a Knowledge Graph is a very difficult task due to the heterogeneity of the data sources on the internet, volume or size of the data and veracity or noise in the data [1]. Knowledge graphs or knowledge base systems have been in used for some period of time. In [8], the authors show that the theory and practice of knowledge graph date back to 1982. The recent years has witnessed the evolvement of several Knowledge graphs including: Wikidata [9], YAGO [3], Freebase [13], NELL [2] PROSPERA [10] Knowledge Vault (KV) [11], Google Knowledge Graph [5], Microsoft Bing Satori [17] etc. These knowledge graphs can be classified based on their information source, scope and operational purpose. In the case of information source for example, some of the knowledge graph systems surf the internet to extract information from unstructured data sources, example of such systems include KV, NELL and PROSPERA. Other knowledge graph system may rely on human annotation and structured sources such as Freebase, or may combine the two scenarios e.g. YAGO2 [12]. In the case of scope or coverage, some focused on gathering information about a specific domain (domain specific knowledge graphs) examples include [1, 11, 15]. While others gather every information or facts across wide domains (Domain independent knowledge graphs) examples are [5, 13, 14, 17]. In the case of purpose, some of the knowledge graphs were built to be used independently such as [3, 14], while others were used as part of other systems to enhance their productivity and efficiency as it is in the case of Google Knowledge Graph and Microsoft Bing Satori.

Knowledge graphs have been built and used in other research and projects. For example in [1], a generic approach for building domain-specific knowledge graphs was proposed and this approach was employed to build a knowledge graph to combat human trafficking.

Another study [18], which complements the traditional approach of building knowledge graphs like Google’s Knowledge graph focused on building event centric knowledge graph. They try to capture the dynamic state of the world by extracting information about events reported in news using state-of-the-art natural language processing and semantic web techniques. Their study also provides a method and tools to automatically build knowledge graphs from news article.

While our approach may intersect with previous methods based on information source, scope and purpose, the previous methods did not use refinement methods that improve both coverage and accuracy of knowledge graph. In addition, our proposed method will compute the correctness score for every relationship in the graph and based on that, the system can determine whether to store the newly generated knowledge after judiciously setting an accuracy threshold.

3 Proposed Architecture of the Knowledge Graph System

The architecture of the proposed system for building the knowledge graph is as shown in Fig. 2. The stages for building the knowledge graph are briefly explained below.

Fig. 2.
figure 2

Proposed architecture for the knowledge graph

  • Data Extraction Module: This sub-system is responsible for gathering information from different sources available in open data portals through the underlying platforms application programing interfaces.

  • Data Analysis Module: in this stage, the information is interpreted using NLP techniques. Specifically, attempts are made to discover entities of interest from the open datasets.

  • Identity/Entity Resolution: in this section, we employ entity resolution methods such as Silk Link Discovery Framework [16] to resolve common entities.

  • Refinement Module: In this module, we improve on the quality as well as the coverage of the knowledge graph.

  • Performance Evaluation: This module evaluates the overall performance of the system based on some well-known gold standard graph evaluation resource.

4 Conclusions

In this work, we have considered the problem of building knowledge graph using open data. Our research agenda has the potentials to open up access to open data that are currently only accessible to a very few technical users of open data portals. Opening up access to open data as knowledge graphs will make contents of open datasets searchable using keywords or natural language phrases on existing search engines like Google. So far, only large multinational search engine providers such as Google and Microsoft provide knowledge graphs (on entities that are core to their interests) to support more intelligent search on the web. In addition our work will also significantly impact the continuous efforts of the W3C in publishing more Linked Open Data (semantically rich, open and machine readable data) on the web. Our knowledge graph approach will exploit the state of the art approach with focus on accuracy of graph relations, reasoning to discover more relations and seeking ways to increase the confidence score of relationship in the knowledge graph over time.