Keywords

1 Introduction

Increasing amount of big data has introduced many new challenges in data mining. Traditional data mining performed at data level may not be highly effective in discovering knowledge for two reasons. Firstly, each attribute at the data level has a unique label which has a closed world assumption. Secondly, it is difficult to infer implicit information among entities. In contrast, at the knowledge level, each attribute might have more than one label, which focuses on presenting information by semantic meaning rather than data. Therefore, mining at the knowledge level can help infer implicit details which can help to achieve a higher level of knowledge discovery.

In terms of knowledge base, Wang et al. [11] demonstrated that a medical knowledge base would have the ability to improve the performance of discovering medical knowledge if it was integrated into the medical domain for relevance assessment. Goh et al. [5] argued that a knowledge base is useful in the clinical decision support system. In the medical area, MEDLINE is a vital source because it contains a significant number of articles that are updated every week in the medical field. However, most researchers focus on using MEDLINE to identify information. Xu et al. [13] introduced a model to identify drug-disease associations by extracting the document from MEDLINE. Some researchers have suggested a new method of achieving high quality in discovering knowledge. Banuqitah et al. [1] suggested a way that used more than one level of learning from documents extracted in MEDLINE to improve the discovery of previously hidden precious knowledge. Therefore, MEDLINE would become more useful if it was processed to be integrated into concepts of MeSH by instances which can be applied to applications of decision support.

The study focuses on building a knowledge base to help facilitate the use of MEDLINE as well as to improve the performance of decision support systems in the medical domain based on Medical Subject Headings (MeSH) and MEDLINE. MeSH includes a list of concepts that link to the documents of MEDLINE. Each concept is associated with a number of articles in the medical domain. MEDLINE uses MeSH for indexing of articles. MEDLINE holds approximately 27 million articles in the fields of biomedicine and health [3] and updates frequently. However, researchers cannot directly apply MeSH on data mining because MeSH is a hierarchy of concepts. This study has created feature vectors from documents in MEDLINE and then mapped feature vectors to concepts in MeSH to build a knowledge base, namely MeKG. This helps measure the distance among concepts and creates an advantage in using these concepts. The contributions of this work are the following:

  • Introducing a framework to build a knowledge base that can help to manage more meaningful information.

  • Providing a knowledge base in the medical domain that can improve the result of searching based on semantic relationships among entities.

  • Opening a chance for other works to use this knowledge base to help build medical decision support systems.

The remainder of this paper is organised as follows. In Sect. 2, the study reviews the existing work on mining the knowledge base. Then, the study suggests a framework for building a knowledge base in Sect. 3. Section 4 follows, where the study discusses the advantages of the knowledge base. Finally, the conclusions are presented in Sect. 5.

2 Related Work

Knowledge base has become a notable topic in the last decade. It plays a vital role in mining data. It can help to discover hidden patterns between entities. Therefore, there is an increase in building and applying knowledge base on data mining. Xu et al. [12] suggested a new knowledge powered method by incorporating knowledge graphs into the learning process to encode the relationship between entities, attributes or properties of objects. This approach has assisted in improving the quality of word representations. Bordes et al. [2] suggested a method to learn the distributed embedding of knowledge bases. This approach has helped to generate new reasonable relations by linking raw-text as entity vectors to knowledge extraction. Similarity, Nguyen et al. [7] investigated a method to leverage semantics from raw text and knowledge resources for achieving high-level representations of documents based on both text embedding and concept-based embedding.

To use conceptual graphs effectively, Shi et al. [9] proposed a new model to organise and integrate the textual medical knowledge into conceptual graphs. This approach provided semantic mappings between textual medical expertise and medical knowledge, which could explore complex semantics among entities in chain inferences. This proposal helped to detect and obtain access to valuable information from the medical domain. Moreover, based on the documents, Voskarides [10] aimed to clarify the relationships among entities of knowledge graph by sentences. These sentences that referred to an entity pair were extracted and enriched through ranking.

Obviously, knowledge base has been successfully considered for using in data mining by researchers [2, 7, 9, 10, 12]. However, researchers had not previously considered developing a framework of building a knowledge graph that helps to improve the quality of decision support system. Besides, a useful source, MEDLINE that contains a large number of articles in the medical domain was not fully explored. This source would be significant for healthcare if it could be processed for applying in data mining. Therefore, this study presents a framework to build a knowledge graph based on MEDLINE to improve the effect of exploring knowledge in the medical domain. In addition, this approach aims to help increase the accuracy of decision support systems.

3 Building Knowledge Base

3.1 The Framework

The study aims to connect specific factors to concepts for generating instances, which help to identify the distance between concepts. For example, it is impossible to calculate the semantic distance between Google and Java because Google and Java are not concepts. However, by applying a training model, researchers are able to recognize that Google is an instance of a search engine, and Java is an instance of programming. As a result, the researchers can measure the distance between Java and Google by calculating the passages existing between concepts within the graph. Therefore, this study proposes an approach to improve the quality of healthcare by linking feature vectors (just like instances) to concepts which can help to find more relationships between concepts. To deal with this challenge, the research grouped all the kinds of concepts into a subgraph through medical subject headings. Each subgraph corresponds to a specific disease or a subject. Then, the study populated the knowledge base for each subgraph based on instances that learned by a large number of articles from MEDLINE. These instances were mapped to concepts in MeSH to create the MeKG knowledge graph base. The MeKG can make a significant contribution to decision-making support which helps to find other options in term of possible medication and diagnoses for practitioners.

However, MeSH is one of the hierarchy of those concepts and MEDLINE is an aggregation document in the medical domain as it collects published papers. MEDLINE is a metadata collection repository of biomedical abstracts and one of the most significant data sources related to scientific literature. MEDLINE uses MeSH to manually index publications from the National Library of Medicine. Currently, MEDLINE holds approximately 27 million records in the fields of biomedicine and health [3]. This study did not consider citations, including the links of articles as well as book reviews. The study used only journal articles that were indexed by MeSH to perform experiments. There are six to fifteen subject headings from MeSH assigned with each article in the MEDLINE database. Using MeSH, which stores concepts from documents in MEDLINE, has an advantage in building a knowledge base. In this case, MeSH plays an important role as the backbone of the graph for building the MeKG knowledge graph base. This study applied instances that were learned from MEDLINE to populate knowledge base constructed on the following formal definition.

Definition 1

[Medical Subject Headings]

The Medical Subject Headings are \(\mathbb {C}=\{ c_1, c_2, \dots , c_i \}\), where c is a concept belong \(\mathbb {C}\) and i is the number of concepts.

Definition 2

[Knowledge Graph Base]

The Knowledge Graph Base is a 3-tuple , where

  • , where \(\mathcal {I}_{c} \subset \mathcal {I}, c \in \mathbb {C}\). \(\mathcal {I}\) is the universal set of instances.

  • \(\mathcal {R} = \{r_1, r_2, \dots , r_q\}\) is the set of all relation types in a knowledge graph, where q is the number of relations.

  • is graph that is generated by \(\mathcal {R}\) and .

Definition 3

[MEDLINE]

The MEDLINE is \(\mathbb {D}=\{ d_1, d_2, \dots , d_j \}\), where j is the number of documents in MEDLINE. \(d := \langle \mathcal {T}, map(d) \rangle \), where \(\mathcal {T} = \{t_{1}, t_{2},\ldots ,t_{z} \}\) is a set of terms from d, and \(map(d) \longrightarrow \mathbb {C}_{d} \subset \mathbb {C}\).

Definition 4

[Research Problem]

The task of research is to learn instances from \(\mathbb {D}\) based on \(\mathcal {T}\) and map(d). These instances then are associated with concepts \(\mathbb {C}_{d} \subset \mathbb {C}\) to build a knowledge graph \(\mathbb {KG}\).

The challenge of this study was related to learning instances from \(\mathbb {D}\) which store a large number of articles in different subjects in the medical domain. Therefore, this study divided the documents from \(\mathbb {D}\) into different subjects based on topics from \(\mathbb {C}\) with the use of concepts. Each concept c in MeSH will belong to a specific subject. Assume that there were n subjects which contain k number of concepts. In this case, MeSH was presented as \(\mathbb {C} =\{ c_{11}, c_{22},\ldots ,c_{kn} \}\). MEDLINE was indicated as \(\mathbb {D} =\{ d_{j1}, d_{j2},\ldots ,d_{jn} \}\). This study took the advantage of \(\mathbb {C}\) and \(\mathbb {D}\) that they were linked through descriptors of MeSH. Each article in \(\mathbb {D}\) has three to six concepts of \(\mathbb {C}\) which belong to a specific subject. Based on those concepts, a large number of articles would be extracted. The extracted documents were used to learn instances from those concepts.

Assume that we want to learn instances from subject \(s \in n\), we have , where . \(\mathbb {C}_{s} =\{ c_{1s}, c_{2s},\ldots ,c_{ks} \}\). \(\mathbb {D}_{s} =\{ d_{js}, d_{js},\ldots ,d_{js} \}\)

\(\mathcal {I}_{s}\) is learned by mapping between \(\mathbb {C}_{s}\) and \(\mathbb {D}_{s}\).

$$\begin{aligned} f(\mathcal {I}_{s}) = \sum ^l_{k=1} (c_{ks}\longmapsto t_{zs}) \times \alpha \end{aligned}$$
(1)

where \(\alpha \) is a threshold to determine the mapping

3.2 Building a Knowledge Graph Base

Before using MeSH and MEDLINE for building a knowledge graph, the study needs to rebuild the XML format of MeSH and MEDLINE. Figure 1 presents some important elements of structure MEDLINE and MeSH. In contrast, Figs. 2 and 3 showed a new format of MeSH by tables that were stored in the form of MySQL. The new format can help to provide efficient access to the concepts and relationships stored in MeSH. The study used Java programming language to write an XML parser for converting the data. The new structure can be a more convenient way to extract information from MEDLINE, which uses MeSH for information retrieval.

Fig. 1.
figure 1

Attributes of MEDLINE and MeSH

Fig. 2.
figure 2

A table extracted from MeSH for Descriptor link to Qualifier

Fig. 3.
figure 3

The relationship among tables extracted from MeSH

By rebuilding the structure of MeSH and MEDLINE, MeSH and MEDLINE can be easily accessed to extract all the articles related to specific fields. The field depends on the purpose of extraction data. Based on MeSH, there are many types of topics in this graph. Each subject may have several different objects. The object with the same type may have a list of a number of concepts and terms. However, MeSH is still a hierarchy of concepts and terms, and it is not valid to use directly for discovery knowledge and mining data. Therefore, this approach makes a task to create instances that link to these concepts from MeSH, which help to find semantic distance between concepts. The study uses a word vector space to create features vectors that correspond to instances. Word vector space called word embedding was approached for calculating the weight of each concept and term. Word2Vec is a successful algorithm related to word vector space for generating feature vectors indicated by Mikolov et al. [6]. This technique is used by Ganguly et al. [4] and Zheng et al. [14] in measuring the semantic similarity among documents. In this study, Word2Vec is also used for learning instances.

Assume we want to find the relationships among subjects about the heart disease. We select all the articles related to the heart from MEDLINE based on descriptors of the MeSH. In this case, the descriptor represents heart disease in MeSH through an identity (D006321). Figure 4 shows the associations between MeSH and MEDLINE. Based on the identity of the description, the study performs an extraction of all articles related to heart disease. These documents have a list of terms regarding heart disease. To ensure enough data for calculating the weight of words, the research cleaned data by removing all stop-words and steam-words. Then, the study used the word2vec algorithm [6] to convert the extracted data into vector space. In this experiment, the study used both two methods, including the continuous bag of words and the skip-gram to train the model. The skip-gram was selected for the final generation of feature vectors because the model using the skip-gram obtained a better result than the continuous bag of words. The set of parameters for the skip- gram method to achieve the highest performance for this experiment including set_Min_Vocabulary_Frequency (5), use_Number_Threads (20), set_Window_Size (10), set_Layer_Size (200), use_Negative_Samples (10) and set_Number_Iterations (5). This process helped to calculate the semantic relationship between terms related to heart disease and to identify similar neighbours for a given word. The weight among terms was determined by a coefficient \(\alpha \) rank from 1 to 0. Finally, all coefficient \(\alpha \) learned from this training which are called instances would play an essential role in populating the knowledge for the graph. Instances can link to concepts from MeSH through mapping between these selected features and objects for heart disease. The mapping between features and concepts help to create a knowledge graph which assists in finding hidden meaning between concepts. This knowledge graph was presented by Fig. 5.

Fig. 4.
figure 4

Mapping between MeSH and MEDLINE

Fig. 5.
figure 5

Knowledge Graph with instances by learning from MEDLINE

4 Discussions

There is an increasing amount of research in healthcare by mining data at the data level. Based on these methods, all hidden relationships among objects may not be fully explored because of the ambiguous meaning of objects. To improve the performance of a decision support system and promote it to the knowledge level, the study suggested a method to create a knowledge base with a focus on searching for the precise meaning of each concept as well as hidden relationships between concepts. For example, if the data level is used to find relationships between smoking and lung cancer, the researcher may not be able to work out how smoking causes cancer. However, using instances that are learned from the knowledge graph may help researchers understand how lung cancer causes smoking. Based on the knowledge graph, the similarity and difference among objects can be presented under the weights. This advantage assists in achieving a higher performance of mining medical and healthcare data.

Specifically, knowledge base has strong capabilities to improve the performance of the classification models. For example, if the study is only based on the data level to build the classification model, this approach may not create a useful model because of the semantic relationships among objects. However, by using the knowledge level, these issues would be resolved. Knowledge base may help to discover more unknown objects that can help the classification model to be able to generate an effective result. At the data level, the classification model has a trend for using all attributes of a dataset to predict results [8]. However, at the knowledge level, the classification model can reject all attributes that are not related to the topic for predicting a result based on the semantic relationships among attributes of a dataset. In addition, eliminating noise variables or noise properties helps to improve the accuracy of the result and plays a significant role in classification models. Therefore, application of a knowledge base in developing classification models promises to bring positive results.

5 Conclusions

Mining data at the data level has challenges because of ambiguous meaning and an increase volume in big data. In contrast, mining data at the knowledge level may help to discover relationships, which are hidden between entities. Therefore, this study introduced a method to build the MeKG knowledge graph base that helped improve the performance of decision support systems. A vector space model was conducted to generate feature vectors for linking to concepts of MeSH. The model runs all documents extracted from MEDLINE for a specific topic. These feature vectors helped to create instances. Finally, instances were mapped to concepts and terms of MeSH for building the MeKG knowledge graph base. This study contributed a significant knowledge graph base in healthcare. Additionally, this study helped medical researchers, as well as practitioners, achieve a high performance of searching based on a knowledge base level. The MeKG knowledge graph base also played an essential role in the searching-system because it could help to solve the ambiguous meaning of each object.

In further research, the study aims to use this knowledge base for applying the classification model to improve its predicted capability.