Keywords

1 Introduction

Cardiovascular disease is a serious threat to human beings. In China, the mortality rate of cardiovascular disease is still the highest among all diseases. Cardiovascular disease is considered to be one of the major causes of death in the world. With the aging of society and the acceleration of urbanization, the prevalence of unhealthy lifestyles among Chinese resident, the risk factors of cardiovascular disease are generally exposed. At the same time, the national burden of cardiovascular disease is growing increasingly heavy.

Since 2004, the average annual growth rate of hospitalization expenses for cardiovascular disease is much higher than the growth rate of gross domestic product (GDP) [1]. Therefore, being able to predict hospitalization expenses in advance is of great significance to both patients and hospitals [2], and how to select features according to sample data and doctors’ needs is crucial. Feature selection to improve the accuracy of prediction and combined with doctors’ prior knowledge can effectively reduce the error rate of prediction is a major research of machine learning [3]. Many researchers have created different algorithms to predict the hospitalization costs of cardiovascular diseases. However, these systems have the problems of unsatisfactory accuracy when facing real world data sets [4] and different requirements from actual doctors in feature selection.

The concept of knowledge graph is proposed by Google on May 17, 2012. Google will use this as a basis to build a next-generation intelligent search engine. In essence, knowledge graph is a semantic network that reveals the relationship between entities. Formal descriptions of real-world things and their relationships can be made. With theproliferation of semantic Web resources and the publication and sharing of vast amounts of RDF data, researchers in academia and industry have spent a great deal of effort building a variety of structured knowledge bases. These knowledge bases can be roughly divided into two categories: open link knowledge base and industry knowledge base. Typical examples of open linked knowledge base are Freebase, Wikidata, OpenKG, YAGO; Typical examples of vertical industry knowledge base are: IMDB (movie data), MusicBrainz (music data), MusicBrainz (semantic knowledge network).

We apply the knowledge graph to the medical field [5], and use the knowledge graph in combination with the interaction between doctors for feature selection, and use the selected data to predict the hospitalization cost of cardiovascular diseases.

The main contributions of this paper include:

  1. a)

    We create the medical health concept knowledge graph (MCKG) using the open source knowledge graph such as Wikidata, OpenKG and the open source knowledge base such as the language specification defined by UMLS.

  2. b)

    Based on MCKG, we build the medical instance knowledge graph (MIKG) with real data from cooperative hospitals.

  3. c)

    Based on the constructed knowledge graph, we use it to conduct feature selection and obtain feature alternatives. Doctors define rules and requirements in the alternative and further obtain the final feature selection scheme.

  4. d)

    We use the selected feature data to predict the hospitalization cost of cardiovascular disease, and the experiment reduces the average error rate of the prediction.

The rest of the paper is organized as follows: In Sect. 2 we discuss the related work. Section 3 introduces the methodology about how to construct MCKG and MIKG. Section 4 shows the experiment and prediction results. At last, we conclude the paper in Sect. 5.

2 Related Work

Knowledge graph is an important part of artificial intelligence technology [6]. It has been a hot trend in the field of artificial intelligence to make use of core technologies such as knowledge extraction and knowledge representation [7] of knowledge graph to carryout relevant research. Knowledge graph has a very broad application prospect in the medical services, the technology can solve the problems of strong data professionalism and complex structure in the medical field, improve medical and health services [8] and plays an important role in clinical decision support system [9].

At present, most of the studies related to cardiovascular diseases use data sets of UCI CLEVELAND [10]. Aiming at feature selection, Senthilkumarmohan et al. proposed a method about Hybrid Random Forest with Linear Model, which uses artificial neural network model with feedback for feature selection [11]. FajrI et al. used discrete minimum wavelet method for feature selection [12]. AliL et al. explained method of exhaustion to search the best configuration of the network to select relevant features from the feature space [13], and Fatih et al. selected features based on the simplified rule library [14]. Jesmin et al. combined with medical knowledge, computing intelligently to delete clinical features [15]. Prakash, S et al. used optimality criterion feature selection method for feature selection [16]. Chandra Babu Gokulnath et al. combining genetic algorithm with support vector machine used to select features in feature space [17]. Ting-Ting Zhao et al. used discriminant minimum class locality preserving canonical correlation analysis to extract features from two data sets based on gain and entropy of motion vector [18]. Sarah P et al. used convolutional neural network to make sense of feature selection [19]. Ashirjaveed et al. employed random searching algorithm to select relevant features [20].

These feature selection methods are not combined with knowledge graph. In this paper, we used a different feature selection method. We first construct MCKG based on doctors’ prior knowledge, open source knowledge base and open source knowledge graph, and then integrate the structured data of hospital database and case data to obtain MIKG. We use MIKG for feature selection and get the alternative scheme of features. Then, we further screen the alternative scheme according to the rules defined by doctors and the actual needs of doctors to get the final feature selection scheme.

3 Methodology

Most of the existing medical knowledge graphs are constructed based on medical literature published on the Internet as well as various public data sets and electronic medical records. Although such data are easy to obtain, there are some problems such as limited knowledge sources, low data purity and data redundancy. The existing feature selection methods are rarely combined with knowledge graph. Using more efficient data storage method of medical knowledge graph and combining with more authoritative medical knowledge of doctors to screen the hospitalization features of cardiovascular diseases can effectively reduce the average error of prediction costs.

To deal with these problems, this article proposes such a method: Open source knowledge graphs, such as Wikidata, OpenKG, etc. and open source knowledge base, such as medical language specifications defined by UMLS and doctors’ prior medical knowledge are used to construct the medical concept knowledge graph (MCKG), the medical instance knowledge graph(MIKG) is completed using data of cardiovascular disease related cases from cooperative hospitals. Based on the constructed MIKG, we obtain a feature alternative scheme, and then combine with the actual needs of doctors and rules to generate the final feature selection scheme in the feature alternative scheme.

The knowledge graph data combined with doctor’s interaction, the all features data, and the feature data selected by random search algorithm [20] are compared in three dimensions by combining the machine learning algorithm of the three schools, random forest [21], support vector machine [22], and line regression [23], the training set and the test set use a ratio of 70%: 30%, the evaluation standard is the average hospitalization cost error. The average error rate of the selected feature data combined with the random forest algorithm is reduced to 11.86%. This is a significant improvement over the feature data selected by other methods. Figure 1 is the core process of this paper:

Fig. 1.
figure 1

Hospital cost prediction flow chart

MCKG’s data sources mainly include public medical knowledge base, medical knowledge graph, and unified medical standards and specifications, which further guide the construction of MIKG. The data source of MIKG is mainly the structured data of the cooperative hospitals and the unstructured data entities marked by the doctors. The detailed process of MCKG and MIKG construction will be introduced in Sect. 3.1 and Sect. 3.2.

3.1 The Construction of MCKG

As we all know, natural language has the characteristics of polysemy and multiple synonyms, so there is a problem of concept confusion in the traditional medical knowledge graph. In this paper, open source knowledge graph such as Wikidata and OpenKG published on the Internet are combined with the prior medical knowledge of doctors in cooperative hospitals. The knowledge of doctors’ dictionaries in unified standardized language provided by UMLS is imported into the conceptual knowledge graph. The knowledge graph is defined with entities as nodes and relationships and attributes as edges. Using ontology notation, that is a triplet(entity-relationship-entity) represents two associated nodes.

The MCKG constructed include the medical knowledge of Chinese and English knowledge as well as the medical specifications defined by UMLS, the main sources of data are from medical knowledge base, medical knowledge graph and doctor. MCKG includes 8,298,580 medical concepts from 116 word-lists and 51 entity words from cooperative hospital. The part of the MCKG constructed in this article is shown in Fig. 2, strictly in accordance with the UMLS definition specification, which is helpful to accurately understand the concepts and relationships between entities (only a part of the concept graph is intercepted in the figure).

Fig. 2.
figure 2

Medical concept knowledge graph

It can be seen from the figure that the entity part has a surgery part, a patient part, and a diagnosis part. The attribute value part will be completed by the MIKG mentioned in the next section. The main role of the MCKG is mainly two points. First, it clarifies the relationship between the various parts of the graph, and second it guides the construction of MIKG.

3.2 The Construction of MIKG

MIKG is constructed under the guidance of the MCKG described in Sect. 3.1. The cardiovascular diseases data sets used in this paper are all from cooperative hospital. The data is divided into structured data and unstructured data.

The instantiation of the KG is mainly the process of knowledge extraction. The main process of this experiment generates a medical dictionary based on the latest cardiovascular disease diagnosis rules defined by experts, unstructured data (the medical record data of some patients) mainly adopts the method of entity annotation, defines relevant rules, extracts features related to hospitalization costs, and imports them into the MIKG. Here is an example of entity annotation of unstructured data in Table 1(Only part of a patient’s case data is intercepted):

Table 1. Entity annotation sample

The table shows that we divide the types of entity annotation into four categories: symptom entity, examination entity, disease entity and medication entity. The entity tags serve as the entity node of MIKG and they are imported into MIKG in the form of RDF triples.

The structured data of cardiovascular disease comes from the hospital database, including basic patient information, surgical information, diagnosis information and other information. The structured data is mapped according to the rules of relational data (ER)-mapping-RDF data. For example, If the table contains “cardiovascular diseases” and related hospital information, we can map it to an RDF triple. The goal of instantiation of MCKG is to extract the entities and relationships of cardiovascular diseases from textual data and structured data, then realize the visualization of MIKG and select features through interaction with doctors.

First, under the guidance of doctors, seven tables related to the prediction of hospitalization costs for cardiovascular diseases were extracted, as shown in Table 2:

Table 2. Related ER data

We extract the entities of all patient records in these tables, namely patient entity, surgery entity, diagnosis entity, diagnosis result entity, diagnosis type entity, main index entity, medical order entity (the above-mentioned information related to patient’s privacy has been desensitized):

  1. 1.

    Patient entity extraction: extract the ID and admission ID of each patient with cardiovascular disease from the patient ID (PATIENT_ID) and the patient’s admission ID (VISIT_ID) as the attribute value of the patient’s entity.

  2. 2.

    Surgery entity extraction: due to the different conditions of each patient and the different operations performed, different types of operations such as vascular exploration, coronary angiography, and coronary artery bypass grafting are extracted from the surgical entities of the patient as a subclass of surgical entities entity.

  3. 3.

    Diagnosis entity extraction: Each patient’s examination number, examination date, and patient’s basic information such as gender and age were extracted as the subclass entities of the diagnostic entity.

  4. 4.

    Diagnosis result entity extraction: The diagnosis results of each patient are necessarily different, and indicators such as WBC, NEUT%, RBC, etc. as well as the diagnosis result time are extracted as the subclass entities of the diagnosis result entity.

  5. 5.

    Diagnosis type entity extraction: Different patients have different types of diagnosis according to the needs of different types of cardiovascular diseases. Different diagnosis types such as vascular headache, carotid atherosclerosis, coronary atherosclerotic heart disease are extracted from the diagnosis entities as diagnosis type entity.

  6. 6.

    Main index entity extraction: The payment types of each patient, such as out-of-pocket, public expense, medical insurance, as well as entities such as place of birth and date of birth, are extracted as the subclass entities of the main index entity.

  7. 7.

    medical order entity extraction: Entities such as the medical examination performed by each patient, the drugs related to cardiovascular disease used, the corresponding dose, the starting time and the end time of the medication are extracted as the subclass entities of the medical order entity.

To sum up, the relationship between different entities is extracted by applying the MCKG to MIKG, for example: the relationship bet ween the patient entity and the patient entity is has_a, The relationship between the type of surgery and the surgical entity is attribute_of, The relationship between the type of diagnosis and diagnosis entity attribute_of and so on. Converting the cardiovascular disease data from the cooperative hospital into RDF data, the construction of the medical instance knowledge graph is shown in Fig. 3:

Fig. 3.
figure 3

Medical instance knowledge graph

The size of the constructed medical instance knowledge graph is 83.6 GB which contains 698946023 triples. Taking the patient entity as the center, different entity nodes (rectangular nodes in the figure) are connected and different nodes are connected to the corresponding instance nodes (elliptical nodes in the figure).

3.3 KG Feature Selection

Based on the already generated MCKG and MIKG, we have screened up to 189 features provided by the original database into 47 features as shown in Fig. 3. Combing the selected KG with the doctor’s needs and regulations, according to the 1–2 steps closest to the patient’s hospitalization information, the nine feature KG with the highest correlation with hospitalization costs are finally extracted as shown in Fig. 4:

Fig. 4.
figure 4

Knowledge graph feature selection

The dotted frame in the figure is divided into alternative plans submitted to the doctor, who selects features based on his/her prior medical knowledge and clinical needs. ORDER_CODE is the doctor’s advice code, AD_TIME is patient’s admission time, DISCHARGE_TIME is patient’s discharge time, CHARGE_TYPE is patient’s type of payment, REPORT_ITEM_NAME is patient’ examination items, SEX is patient’s sex, Age is patient’s age, OPERATION_DESC is patient’s type of operation, TOTAL_COSTS is Total cost of patient hospitalization.

4 Experiments

In order to verify the effectiveness of the MIKG combined with the feature selection of doctors’ interactive, we divide the experimental data into three groups:

The first group is DAF (data of all features), the second group is the data filtered by Random Searching Algorithm (RSA), and the third group is the data filtered by MIKG mentioned in Sect. 3 combined with the knowledge of doctors (KG-D).

The data set used in the experiment is RDF triplet data set, 10,000 of these triples are randomly selected, the training set and the test set use a ratio of 70%: 30%.

Experimental environment for this experiment: Processor: Inter® Core ™ i5-8265U CPU @ 1.8 Hz; RAM: 8.0 GB, operating system: WIN 10. This experiment uses Python3.7 software package.

The experiment uses machine learning algorithms: SVM, RF and LR. The average prediction error of hospitalization cost (Averr) is used as the evaluation index.

$$ Av{\text{e}}rr = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {\frac{{\hat{y}_{i} - y_{i} }}{{y_{i} }}} \right| \times 100\% $$
(1)

\( \hat{y}_{i } \) represents the predicted value of hospitalization costs,\( y_{i} \) represents the actual value of hospitalization expenses, n is the total number of samples. The process of feature selection of RSA-RF [17] is shown in Fig. 5:

Fig. 5.
figure 5

RSA feature selection

The average prediction error rate of this experiment is shown in Table 3:

Table 3. Average prediction error for different feature selection methods

It can be seen from the table that when all the features related to cardiovascular disease of patients are used to predict hospitalization cost, no matter which classifier is used, SVM, LR or RF, there is a high prediction error. When we use the feature data selected by the random search algorithm to predict the hospitalization cost, we can see that the prediction error is reduced. When we used MIKG in combination with the feature data of doctors’ interactive selection for prediction, the prediction error of the classifier was significantly reduced, among which the best effect was achieved when RF was used, and the prediction cost error was reduced to 11.86%. This experiment proves the proposed the effectiveness of this method.

5 Conclusion and Future Work

We build MCKG using open source knowledge graphs such as Wikadata, OpenKG, etc. and open source knowledge base such as the medical language specifications defined by UMLS and the doctor’s prior medical knowledge. Then we integrate the structured data in the database of the cooperative hospital and the unstructured data processed by doctors through entity annotation. We select features through the above-mentioned knowledge graph to get a feature alternative, and then combine the doctor’s clinical needs and definition rules to get the final feature selection. Based on the feature selected by the above-mentioned method, we compare the corresponding RDF data with the data obtained by the features selected by the random search algorithm and the data corresponding to all the features related to hospitalization costs using SVM, RF, LR three different genres of machine learning algorithms to perform the hospitalization cost error prediction, experimental results prove that our feature selection method combined with random forest algorithm effectively reduces the prediction error of cardiovascular disease hospitalization costs.

Future work will focus on the application of MIKG to other data sets, as well as the selection of different deep learning models, and apply MIKG to a broader field of artificial intelligence.