Keywords

1 Introduction

Diabetes is a chronic metabolic disease characterized by high blood glucose level. Untreated or uncontrolled diabetes can cause a range of complications, including acute ones like diabetic ketoacidosis and chronic ones such as cardiovascular diseases and diabetic nephropathy. With the rapid economic developments and changes in lifestyle, China has become the country with the most diabetes patients in the world: the prevalence of diabetes in Chinese adults is about \(11.2\%\) and still increasing [1]. The medical expenses from diabetes without complications already account for \(8.5\%\) of national health expenditure in China [2]. As a result, diabetes is a serious public health problem in the realization of “Healthy China 2030” that requires interdisciplinary innovations to solve.

Knowledge Graph (KG) has been proven effective in modeling structured information and conceptual knowledge, especially in the medical domain [3]. Medical knowledge graph is attracting attention from both academic and healthcare industries due to its power in intelligent healthcare applications, such as clinical decision support systems (CDSSs) for diagnosis and treatment [4, 5], self-diagnosis utilities to assist patient evaluating health conditions based on symptoms [6, 7]. High-quality entity and relation corpus is crucial for constructing knowledge base, however, there is no dataset dedicated to the diabetes disease at the moment. To address this issue, we introduce DiaKG, a high-quality Chinese dataset for Diabetes knowledge graph construction.

The contributions of this work are as follows:

  1. 1.

    To the best of our knowledge, this is the first diabetes dataset for medical knowledge graph construction at home and abroad.

  2. 2.

    In addition to the medical experts, we also introduce AI experts to participate in the annotation process to provide data insight, which improves the usability of DiaKG and finally benefits the end-to-end model performance.

We hope the release of this corpus can help researchers develop knowledge bases for clinical diagnosis, drug recommendation, and auxiliary diagnostics to further explore the mysteries of diabetes. The datasets are publicly available at https://tianchi.aliyun.com/dataset/dataDetail?dataId=88836

2 DiaKG Construction

2.1 Data Resource

The dataset is derived from 41 diabetes guidelines and consensus, which are from authoritative Chinese journals covering the most extensive fields of research content and hotspot in recent years, including clinical research, drug usage, clinical cases, diagnosis and treatment methods, etc. Hence it is a quality-assured resource for constructing a diabetes knowledge base.

2.2 Annotation Guide

Two seasoned endocrinologists designed the annotation guide. The guide focuses on entities and relations since these two types are the fundamental elements of a knowledge graph.

Entity. 18 types of entities are defined (Table 1). Nested entities are allowed; for example, is a ‘Disease’ entity, and is a ‘Class’ one. Entities in DiaKG has two characteristics that stand out: 1. Entities may attribute to different types according to the contextual content; for example, in sentence is a ‘Disease’ type, while in the sentence serves as a ‘Reason’ type; 2. Some entity types are of long spans, like ‘Pathogenesis’ type is usually consisted of a sentence.

Table 1. List of entities

Relation. Relations are centered on ‘Disease’ and ‘Drug’ types, where a total of 15 relations are defined (Table 2). Relations are annotated on the paragraph level, so entities from different sentences may form a relation, which has raised the difficulty for the relation extraction task. Head entity and tail entity existing in the same sentence only account for \(43.4\%\) in DiaKG.

Table 2. List of relations

2.3 The Annotation Process

The annotated process is shown in Fig. 1. The process can be divided into two steps:

OCR Process. The PDF files are transformed to plain text format via the OCR toolFootnote 1, where non-text data like figures and tables are manually removed. Additionally 2 annotators manually check the OCR results character by character to avoid misrecognitions, for example, may be recognized as .

Annotation Process. 6 M.D. candidates were employed and were trained thoroughly by our medical experts to have a comprehensive understanding of the annotation task. During the trial annotation step, we creatively invited 2 AI experts to label the data simultaneously, based on the assumption that AI experts could provide data insight from the model’s perspective. For example, medical experts are inclined to label as a whole entity, while AI experts regard , ‘maturity-onset diabetes of the young’ and ‘MODY’ as three separate entities are more model-friendly. Feedback from AI experts and the annotators were sent back to the medical expert to refine the annotation guideline iteratively. The formal annotation step started by the 6 M.D. candidates and 1 medical experts would give timely help when needed. The Quility Control (QC) step was conducted by the medical experts to guarantee the data quality, and common annotation problems were corrected in a batch mode. The final quality is evaluated by the other medical expert via random sampling of 300 records. The accuracy rates of entity and relation are \(90.4\%\) and \(96.5\%\), respectively, demonstrating the high-quality of DiaKG. The examined dataset contains 22,050 entities and 6,890 relations, which is empirically adequate for a specified disease.

2.4 Data Statistic

Detailed statistical information for DiaKG is shown in Table 1 and Table 2.

Fig. 1.
figure 1

The annotated process of the diabetes dataset.

3 Experiments

We conduct Named Entity Recognition (NER) and Relation Extraction (RE) experiments to evaluate DiaKG. The codebase is public on githubFootnote 2, and the implementation details are also illustrated on the github repository.

3.1 Named Entity Recognition (NER)

We only report results from X Li et al. (2019) [8] since it is the SOTA model for NER with nested settings at the time of this writting.

3.2 Relation Extraction (RE)

The RE task is defined as giving the head entity and the tail entity, to classify the relation type. Due to the simplified setting, we report results from bi-directional GRU-attention [9] in this paper.

4 Analysis

The experimental results are shown in Table 3 and Table 4. We report the total result, plus the top 2 and last 3 types’ results for each task to analyze DiaKG.

The overall macro-average scores for the two tasks are \(83.3\%\) and \(83.6\%\), respectively, which are satisfying considering the multifarious types we define, also demonstrating DiaKG’s high quality. For the NER task, the results of ‘Disease’ and ‘Drug’ types are as expected because these two types exist frequently among the documents, thus leading to a higher score. The average entity length for ‘Pathogenesis’ type is 10.3, showing that the SOTA MRC-Bert model still can not handle the long spans perfectly; We analyzed errors of the ‘Symptom’ and ‘Reason’ types and found that the model is prone to classify entities as other types, mainly contributing to the characteristic that entity may be of different types due to the contextual content. For the RE task, the case study shows that entities with long distance are difficult to classify. For example, entities with ‘Drug_Diesease’ type usually exist in the same sub-sentence, whereas the ones with ‘Reason_Disease’ type are usualy located in different sub-sentences, sometimes even in different sentences. The above experimental results demonstrate that DiaKG is challenging for most current models and it is encouraged to employ more powerful models on this dataset.

Table 3. Selected NER results
Table 4. Selected RE results

5 Conclusion and Future Work

In this paper, we introduce DiaKG, a specified dataset dedicated to the diabetes disease. Through a carefully designed annotation process, we have obtained a high-quality dataset. The experiment results prove the practicability of DiaKG as well as the challenges for the most recent typical methods. We hope the release of this dataset can advance the construction of diabetes knowledge graphs and facilitate AI-based applications. We will further explore the potentials of this corpus and provide more challenging tasks like QA tasks.