Keywords

1 Introduction

Knowledge is the crystallization of human understanding of the objective world in practice. As a form of structured knowledge, knowledge graphs (KGs) can be traced back to semantic nets proposed by Richens in 1956 [1, 2]. Later, expert systems such as MYCIN were proposed and became a research hotspot [3]. Expert systems were also considered the precursor of KG. As the builder of a search engine, Google is committed to understanding the words that users use. In other words, when users search, the words they enter refer to things that actually exist in the world, rather than just their surface meanings. Based on this idea, Google attempted to establish relationships between real-world entities by building the KG. In 2012, it was integrated into the Google search engine, making it easier for users to access knowledge related to their search queries.

KG is a collection of information about real-world entities, including people, books, movies and many other types of things. For example, for a celebrity, relevant data such as their birthday and height are collected, and the person is linked to other closely related entities in the KG. More specifically, if a user wants to learn about astronomers, they may search for Galileo, as shown in Fig. 1. Based on the knowledge graph, the search result will directly display relevant information and show Galileo’s scientific contributions. It can also help users discover other famous astronomers, such as Copernicus and Kepler. The goal of the KG is to move from an information engine to a knowledge engine.

The proposal of knowledge graphs has attracted widespread attention from academia and industry. As the knowledge graph continues to develop, it will become larger in scale and more content-rich. An increasing number of knowledge graphs are being created to support downstream applications such as knowledge management, search engines, intelligent question answering and recommendation systems. The research fields include: medical, archeology, e-commerce, catering, and economics [4,5,6,7,8,9].

Fig. 1.
figure 1

Google search for Galileo

The concept of modal, similar to the concept of neural networks, was initially a biological concept. Humans have visual, auditory, tactile, and olfactory senses. Each different form of information can be referred to as a modal. In machine learning, it generally refers to different media of information, such as text, images, speech and videos. Multimodal refers to the combination of multiple different types of data. With the rapid development of the Internet, the explosive growth of information in different modalities has become a critical and challenging problem in terms of how to efficiently utilize these diverse types of information [24]. On the other hand, to overcome the limitations of a single mode in practical applications, the demand for machines to learn multimodal knowledge has also been increasing. For example, the image captioning task is one of the first tasks involving the combination of multimodal images and text. Machines need to automatically generate natural language descriptions of images, which requires more than the image understanding level provided by typical image recognition and object detection methods [12, 13]. Visual question answering is often seen as a visual Turing test, where the system needs to understand any form of natural language question (usually related to visual information in the image) and answer it in a natural way [17, 18].

However, as shown in Fig. 2(a), KGs mostly use pure symbolic text as objects, constructing a semantic network using triples. This approach limits the machines’ understanding and expression capabilities [11, 12]. If we only tell the machine about the description of “dogs”, it is difficult for the machine to understand the concept of “dogs”, which makes the application of KGs difficult. However, if we combine different modalities of information about dogs, such as pictures of dogs and the sound of barking, the image of “dogs” becomes vivid. In other words, if we want machines to truly gain intelligence, single-modal information alone is far from sufficient. Therefore, multimodal knowledge graph (MMKG), as shown in Fig. 2(b), has great help in achieving artificial intelligence. KGs are also urgently in need of multimodality.

Fig. 2.
figure 2

(a) An example of unimodal KG (b) An example of a MMKG

In this context, the construction and application of MMKGs have become a research hotspot. However, there has been a thorny issue that has not been resolved, which is the definition of MMKG. Starting from the KG itself, this paper summarizes the definition of KG, explores the definition of MMKG, and provides an example MMKG in the medical field. The rest of this paper is organized as follows: Section 2 summarizes the definition of KG, and Sect. 3 explores the concept of MMKG. To illustrate the concept, Sect. 4 constructs an example MMKG in the medical field, and Sect. 5 provides a summary of the entire paper.

Fig. 3.
figure 3

(a) The introduction of Franklin (b) The introduction of Benjamin

2 Definition of KG

2.1 Description of the Problem

The definition of KG has been a longstanding topic of discussion among experts and scholars, but a consensus has not yet been reached [10,11,12, 19,20,21,22,23]. The root cause of this problem is that Google’s introduction to its KG blog did not mention the definition and related technical issues of KG, which has led to conflicting definitions and descriptions of KG in its development [1, 20]. For example, Paulheim et al. defined KG as a graph-based organization used to describe entities and their relationships in the real world [21]. This definition is too abstract and not sufficiently detailed to KG. Ehrlinger et al. defined KG as the acquisition of knowledge and integration into ontology, using a reasoning engine to deduce new knowledge [20]. The implication is that KG consists of two parts, knowledge and reasoning engine, which is also biased. Zheng et al. simply defined KG as representing entities with nodes and relationships with edges [4]. Most other papers mention the representation of KG, rather than its definition. For example, Ji et al. defined a knowledge graph as \(\mathcal {G}\) = {\(\mathcal {E}\), \(\mathcal {R}\), \(\mathcal {F}\)}, where \(\mathcal {E}\), \(\mathcal {R}\) and \(\mathcal {F}\) are sets of entities, relationships, and facts, respectively. Facts are represented as triples {h, r, t} \(\in \) \(\mathcal {F}\) [23]. However, there is no distinction made between relationships and attributes and no discussion of directivity between triples. As seen, even the representation of KG is difficult to have a unified standard [11]. This is very unfriendly for research in this field. Therefore, a unified and standard definition of KG is needed.

Fig. 4.
figure 4

(a) Introduction of Galileo’s birthplace, Pisa. (b) Introduction of Isaac Newton, a figure related to Galileo.

2.2 Inquiry into the Problem

To solve this thorny problem, must go back to the source and start with Google’s blog on KG. The blog provides a case of how KG is used for search, as shown in Fig. 1. We can see that the search result for Galileo consists of the following parts:

  • The first part is Galileo’s name and classification: Galileo belongs to the category of physicists. We can view this classification as a part of the framework of ontology, and Galileo is an instance under this class.

  • Then, are his images, which come from different sources such as BaiduPedia, StarWalk, Wikipedia [1].

  • After the images, there is a section on Galileo’s personal information, including his life events, and contributions. The users could click the blue text and will have a page jump. The black text could not be clicked.

  • At the bottom, there are related figures such as Copernicus, Newton. The text and images are integrated as a whole, and clicking on them can lead to their corresponding pages.

Fig. 5.
figure 5

(a) Inductive method for ontology construction (b) Ontology diagram (c) Taxonomic method for ontology construction

Therefore, we can see that the data in KG should include two types: resources and literals. Resources refer to resource links from different data sources. Literals can be understood as strings in programming languages, which do not have meaning in themselves. With the concept of “trees” in data structures, literals are similar to “leaves” with a degree of 0. Combined with another instance in the introduction, as shown in Fig. 3, resources exist in the form of entity nodes, and different entity nodes are connected through relationships, represented by white lines in the graph. Literals are usually considered internal information of entity nodes and are not connected to other nodes, which is called the property value. These two types of information can be described by triples. For example, “Galileo - birth place - Pisa, Italy”, “Galileo - died in - January 8, 1642”. The first triple was defined as entity-relationship-entity, and the other was defined as entity-property-property value. These two together form the basic components of a KG. One important point to note is the directivity of the entities in KG. Some literature mentions this issue, suggesting that KG should be defined as a directed graph structure [10, 11]. However, these studies has not provided a clear explanation on this issue: whether it is the directivity of the relationship between entities or the directivity between entities and property values, and whether this directivity refers to one-way or two-way, or multidirectional? This paper explains this issue: using the previous example, “Galileo - died in - January 8, 1642” is a reasonable expression, rather than “January 8, 1642 - died in - Galileo”. That is, in the triple of entity-property-property value, the node points to the property value, and the node’s property is only connected to that node, which is unidirectional. This matches the representation in Fig. 3. For the triple of entity-relationship-entity, such as “Galileo - birth place - Pisa, Italy”. An entity is connected to many different entities, such as “Galileo”, which is related to many other figures, such as “Isaac Newton” and “Aristotle”; that is, the entity has multidirectionality. Furthermore, by clicking the Italy Pisa of birthplace information displayed in Fig. 1 and the recommended Isaac Newton below, as shown in Fig. 4, the recommended content displayed under the Pisa node is not related to Galileo, while Galileo appears in the recommended content below Newton. This indicates that the triple of “Galileo-birthplace-Italy, Pisa” is a one-way structure, while the triple of “Galileo-related person-Isaac Newton” is a two-way structure. That is, the relationship between nodes can be either one-way or two-way.

Fig. 6.
figure 6

(a) Search result of Pisa (b) Search result of Rome (c) Search result of Europe

2.3 Knowledge Base, Ontology and RDF

Another point is the relationship and difference between knowledge graphs (KG) and knowledge bases. Many recent papers do not distinguish between these two concepts and treat KG and knowledge bases as equivalent [10, 11]. They consider semantic networks, graph databases, and knowledge bases such as WordNet (1995), BabelNet (2010), Freebase (2008), DBpedia (2007), YAGO (2007), and WikiData (2014) as KG without explanation, which is obviously unreasonable [28,29,30,31,32,33]. One piece of evidence is that Johanna Wright, the product management director, mentioned in her introduction of KG that Google uses search engines to understand user search content and add some of this content to the knowledge base. This indicates that KG is a kind of knowledge base. However, other descriptions from Google employees suggest that these two concepts are not identical [1]. To this end, this paper explains that a knowledge base is a special database used for knowledge management. It is a collection of heterogeneous knowledge from multiple sources in a required field, including basic facts, rules, and other related information. A KG is a processed knowledge base that has a graph structure and contains structured and semistructured data. In addition to KG, two other frequently mentioned concepts are ontology and Resource Description Framework (RDF) [25, 26]. RDF is a data model developed by W3C, which provides a unified standard for describing things and their relationships. RDF is composed of nodes and edges, where nodes represent specific entity resources or property values and edges represent relationships between entities or between entities and property values. RDF has constraints on each part of the SPO triple: “s” should be Internationalized Resource Identifiers (IRI) or blank node, “p” should be IRI and “o” could be IRI, resource or literals. However, RDF has a serious limitation in that it cannot distinguish between classes and objects. It also cannot define and describe class relationships and properties. In other words, RDF is mainly used to describe concrete things and lacks the ability to abstractly categorize and define groups of similar things. This clearly limits the expressive power of the model. Therefore, the assistance of an ontology is needed.

Fig. 7.
figure 7

Part of data of COMMKG-19

Ontology is a philosophical concept that involves dividing entities into basic categories and hierarchies. Ontology has a classification system and basic reasoning principles. The classification system defines the relationship between categories, providing the basis for reasoning. Some ontologies are widely used in the medical field, such as CIDO, GO, UberOn and DOID [34,35,36,37]. There are two main ways to build ontologies: the bottom-up inductive approach as shown in Fig. 5(a), and the top-down classification approach as shown in Fig. 5(c). Generally, the construction of open-domain KGs often uses the inductive method to classify features from underlying data due to the large amount of data involved. In contrary, domain-specific KGs often define classification categories before filling in the data. Google’s KG is full of the shadows of category in ontology. As previously mentioned, Galileo belongs to the category of physicists. As shown in Fig. 6(a), there is a comment line: “Pisa: The City of Italy”, which can be viewed as a category label. In the recommended content below, we can see four recommended places, Florence, Lucca, Livorno and Tuscany. The category label of Florence, Lucca and Livorno is “City in Italy”, and the category label of Tuscany is “Administrative districts of Italy”. All of this recommended content belongs to places (cities or administrative regions) in Italy. In the additional recommendations, there are two items worth noting: one is “Rome, the capital of Italy”, and the other is “Italy, Countries in Europe”. When searching for Rome, as shown in Fig. 6(b), the related content includes Italy, Milan, Venice, and Florence. The corresponding category tags are countries in Europe and cities in Italy. In the additional recommendations, there are also Madrid (the capital of Spain) and London (the capital of the United Kingdom). It could be speculate that the recommended entities in the KG come from three categories: entities with the same label in the same category, with subcategory labels, and with parent category labels. Search for “Europe” to verify this assumption. As shown in Fig. 6(c), we obtained nodes with the same label: Asia and Africa. Subcategory nodes: Germany and Italy. The reason why there is no parent category entity is that “Continent” may be a top-level concept. The ontology based on this situation is shown in Fig. 5(b).

Fig. 8.
figure 8

(a) Some top-level concepts in the ontology (b) Some relationships in the ontology (c) Some properties in the ontology (c) Some properties in the ontology

2.4 Conclusion

In summary, this paper provides the definition of KG: KG is a kind of knowledge base composed of ontology and resource description framework, which can serve downstream applications. Its symbolic language is \(\mathcal {G}\) = {\(\mathcal {E}\), \(\mathcal {R}\), \(\mathcal {P}\), \(\mathcal {V}\), \(\mathcal {T_{R}}\), \(\mathcal {T_{P}}\)}, which is a set of elements and knowledge, where \(\mathcal {E}\), \(\mathcal {R}\), \(\mathcal {P}\), \(\mathcal {V}\) is a set of entities, relationships, properties, property values. Knowledge \(\mathcal {T_{R}}\) is a set of triples of entity-relationship-entity, and \(\mathcal {T_{P}}\) is a set of triples of entity-property-property value. One piece of knowledge can be represented as \(\mathcal {T_{R}}\) = {\(\mathcal {E}\), \(\mathcal {R}\), \(\mathcal {E}\)} or \(\mathcal {T_{P}}\) = {\(\mathcal {E}\), \(\mathcal {P}\), \(\mathcal {V}\)}. where \(\mathcal {R}\) and \(\mathcal {P}\) are directional, pointing from the head entity to the tail entity or property value. For example, “Zhengzhou belongs to Henan Province” can be expressed as \(\mathcal {T_{R}}\) = (Zhengzhou, belongs to, Henan Province), and “Biden is 81 years old this year” can be expressed as \(\mathcal {T_{P}}\) = (Biden, age, 81).

3 Exploring the Concept of MMKG

Most literature researching MMKG does not mention the definition, and the definitions in some literature are too abstract [41, 44]. Wang et al. directly introduced the RDF model into Richpedia and regarded it as a finite set of RDF triples [43]. Zhu et al. mentions that MMKG is a multimodal representation of part of the knowledge in KG [11]. In view of this phenomenon, it is necessary to summarize a unified and complete definition of MMKG.

Fig. 9.
figure 9

COMMKG-19 visualization by Neo4j

3.1 Multimodality of Knowledge Graphs

Some literature uses “CKG” to refer to knowledge graphs based solely on text modal [10]. This statement is unreasonable. The reason is that relevant researchers have ignored such a problem: Has the KG been multimodal since its inception? The answer is affirmative. The root cause of this problem is that Google does not mention the multimodal problem about KG in the relevant introduction. Although the concept of “multimodal” has been proposed for a long time, it was not until approximately 2015 that it received widespread attention in the field of artificial intelligence, and most of the research was based on text and images [46]. As the key to Google’s search engine, the KG improves the performance of search engines in three ways: find the right thing, get the best summary and go deeper and broader. As the view in the blog, “Language can be ambiguous-do you mean Taj Mahal the monument, or Taj Mahal the musician?”, in order to provide better recommendations, KG has added image elements to relevant recommendations, such as Fig. 1 and Fig. 4. However, research based on KG, including construction and application, initially focused on text modal [47,48,49,50,51,52]. For example, Sören Auer built a KG for the exchange of academic information [49]. In the field of natural language processing (NLP), named entity recognition (NER) and relationship extraction(RE) work based on text modal has been greatly developed [50, 51]. With the proposal of the TransE model, knowledge representation learning(KRL) based on text information has become a major research hotspot [52, 53]. It is only in recent years that research on multimodality of KG has progressed. Examples include Liu et al.’s proposal of MMKG in 2019 and Wang et al.’s proposal of Richpedia in 2020 [43, 44].

It can be seen that the development of knowledge graphs is from multimodal to unimodal and then to multimodal. One of the main reasons for this is that: there is a lack of a unified understanding of the knowledge graph in academia and industry. This is why this paper explores the definition of a knowledge graph.

3.2 Comparison of KG and MMKG

Contrary to existing beliefs, based on the foregoing, this paper argues that MMKG should not be regarded as a generalization of KG; rather, KG is a special case of MMKG. In other words, KG is MMKG that contains only text modal information. Therefore, in terms of definition, MMKG and KG should conform to the same definition framework. Compared to the definition of KG, the definition of MMKG is broader. This paper defines MMKG as follows: MMKG is a kind of knowledge base that contains data in at least two different modals forms: text, voice, images, videos, etc. Follow the ontology and resource description framework, which can serve downstream applications. The symbolic language of MMKG is \(\mathcal {G}\) = {\(\mathcal {E^{M}}\), \(\mathcal {R^{M}}\), \(\mathcal {P^{M}}\), \(\mathcal {V^{M}}\), \(\mathcal {T^{M}_{R}}\), \(\mathcal {T_{P}^{M}}\)}, where \(\mathcal {E^{M}}\), \(\mathcal {R^{M}}\), \(\mathcal {P^{M}}\), and \(\mathcal {V^{M}}\) are a set of entities, relationships, properties, and property values and could be different modes. Knowledge \(\mathcal {T_{R}^{M}}\) is a set of triples of entity-relationship-entity with different modal, and \(\mathcal {T_{P}^{M}}\) is a set of triples of entity-property-property value with different modal. One piece of knowledge can be represented as \(\mathcal {T_{R}^{M}}\) = {\(\mathcal {E}^{M}\), \(\mathcal {R^{M}}\), \(\mathcal {E^{M}}\)} or \(\mathcal {T_{P}^{M}}\) = {\(\mathcal {E^{M}}\), \(\mathcal {P^{M}}\), \(\mathcal {V^{M}}\)}, where \(\mathcal {R^{M}}\) and \(\mathcal {P^{M}}\) are directional, pointing from the head entity to the tail entity or property value.

The difference between KG and MMKG is mainly reflected in the application level, which is also the core issue of extensive research in academia. A major difficulty in researching MMKG is how to fuse the features of different modal data in a reasonable way to support downstream applications. Compared with the interaction between text modals in KG, MMKG needs to consider the features of different modal data. Current research focuses on supplementing text information with image information to improve the accuracy of downstream tasks. Sun et al. designed a recommendation system based on the MMKG, which effectively alleviated the problems of cold start and data sparseness in the recommendation system [27]. Zheng et al. used doctor-patient dialogue and related examination pictures (CT, X-ray and ultrasound) to improve the accuracy of the diagnostic system for COVID-19 [39]. For KRL, the semantic information of unimodal limits the performance of the model. The introduction of modal data makes the performance of such models have more room for improvement. Wang et al. fused text and image modal features through the multihead self-attention mechanism to improve the accuracy of link prediction [55]. One thing to note is that image information should be as important as text information.

Table 1. Comparison of data between COKG-19 and COMMKG-19.

4 Construction of MMKG

4.1 Two Different Ways to Build MMKG

At present, academia and industry generally use two different ways to construct MMKGs. One is to build MMKG using images as entity nodes. After that, the node information is enriched through the properties of the node, such as the size of the picture and the content of the image. This paper refers to this build as E-MMKG for short. Wang et al. built Richpedia following RDF. The text entity comes from Wikidata’s IRI. For image entities, collect images from Wikipedia and create corresponding IRIs in Richpedia. The result was a collection of 30,638 entities about cities, attractions, and celebrities. On average, a total of 99.2 images were retained for each entity. However, in Richpedia, the number of relationships between images is smaller and the ontology is simpler [43].

The other way is to build MMKG using the image as a property of the node, which this paper refers to as P-MMKG for short. Daniel et al. created ImageGraph, which contains 14,860 entities and 829,931 images. Its relationship structure is based on FB15K. For image data, more than 462 GB of image data was downloaded from different search engines. Corrupted, duplicate, and low-quality images are removed. In addition, triples in the header or tail entities that cannot be linked to the image data are filtered [54].

Compared with the construction method of E-MMKG, the construction of P-MMKG is simpler because the current attributes in the knowledge graph are not connected to other nodes, and there is no need to consider the relationship between images. The construction of E-MMKG often needs to consider the relationship between images, such as similar or different. Although it enriches the amount of data in MMKG, it also increases the complexity of the build. In some fields where the relationship between images is not in high demand, the P-MMKG construction method is recommended. However, in some specific fields, such as in the Encyclopedia Knowledge Graph, illustrating the relationships between different species of animals through images (tigers and lions share a common ancestor), E-MMKG must be considered. These two different MMKG construction methods follow the ontology and RDF structure, which is consistent with the definition of MMKG in this paper.

4.2 Building Sample MMKG in the Medical Field

The potential of KG in the medical field is enormous and is considered the cornerstone for achieving smart healthcare. Some work based on KG in the medical field has made good progress [4, 39]. The outbreak of the COVID-19 virus in 2019 has had a profound impact on human life. Research based on the COVID-19 virus has been a hot topic in recent years. However, the shortcomings of these KGs are also very obvious: most medical KGs are based on textual data. A few MMKGs have limited types of image data, and these MMKGs do not consider speech data [39]. To provide a better illustration and facilitate better research by experts and scholars, this paper constructed a sample MMKG based on the COVID-19 virus, including textual, image and speech data.

Since there is no need to consider the relationship between images, this paper uses P-MMKG to construct the sample MMKG. The ontology and some of the textual data were referenced from COKG-19Footnote 1. COKG-19 is an open-source KG on COVID-19 primarily based on textual information jointly released by the AMiner team of the Department of Computer Science and Technology at Tsinghua University and the ZhikuAI team. The KG collected data from 8 COVID- 19-related KGs that are open-source on OPENKGFootnote 2. Through various algorithms such as entity recognition, semantic matching and disambiguation, and knowledge fusion, the KG merged concepts with the same meaning, differentiated polysemous concepts, and supplemented and corrected them based on the opinions of relevant experts. In recent years, there have been some new variants of the COVID-19 virus. Therefore, this paper added some concepts, attributes, relationships, and instances to COKG.

For image data, a web crawler system is built to retrieve images related to entities from different search engines, which collect URL links to the top-ranking images of different search engines. Taking into account the cost of manual construction, the sample size is selected as 10% of the number of entities. To ensure the quality of the picture, we manually adjusted the size of some pictures and deleted low-quality pictures considering factors such as image size, clarity, and reliability. Filter out the most representative pictures as the property store of the node. It is worth mentioning that figurative pictures are chosen to convey some non-visual concepts such as delirium. In the end, a total of 2700 pictures passed the screening, and some important nodes were assigned multiple images.

For speech data, the content of the dataset is mainly the clinical manifestations of partial symptoms. Through Text-To-Speech (TTS) technology, 268 audio files were generated using the open-source API of iFlytekFootnote 3. We refer to the sample MMKG as COMMKG-19, in addition, COMMKG-19 additionally extracted English triples. The data pairs for COKG-19 and COMMKG-19 are shown in Table 1.

To store the above information and provide URL links, as shown in Fig. 7, this paper has established an open source websiteFootnote 4. Protégé is an ontology editor developed by Stanford University, that is used to create and maintain ontologies and knowledge graphs. In this paper, Protégé was used to add, modify and supplement the ontology data of COKG-19, such as concepts, relationships and properties, as shown in Fig. 8.

The MMKG visualization was achieved by importing data into a Neo4j graph database by generating Turtle files, as shown in Fig. 9. In addition, to facilitate the extraction and utilization of MMKG data, a user interface was designed to retrieve node information, as shown in Fig. 10.

Fig. 10.
figure 10

User interface

5 Summary

In recent years, KG has made significant progress, and many MMKG-based studies have achieved remarkable advances. To promote a unified understanding of KG in the academic and industrial communities and to use the term “KG” more rigorously, this paper starts from the KG itself, conducts investigations and research, summarizes previous work, proposes a definition of KG, explores the concept of MMKG, and provides a sample MMKG in the medical field.

The work presented in this paper has some limitations. First, the proposed definition of KG needs to be widely recognized and further refined by relevant researchers. Second, the MMKG sample constructed in the medical field has relatively few image and speech data, partly due to the high cost of manual work. Therefore, in future work, we will consider automated processing of speech and image data. In addition, video data have not been considered because there is currently limited research on video modal data, but video data often contain more information, which is an important aspect to consider. Due to article constraints, some of the content cannot be described in detail. We will focus on outlining the MMKG technical system to establish connections between different research fields and promote the development of the KG field in the future.