DiaKG: An Annotated Diabetes Dataset for Medical Knowledge Graph Construction

Chang, Dejie; Chen, Mosha; Liu, Chaozhen; Liu, Liping; Li, Dongdong; Li, Wei; Kong, Fei; Liu, Bangchang; Luo, Xiaobin; Qi, Ji; Jin, Qiao; Xu, Bin

doi:10.1007/978-981-16-6471-7_26

Dejie Chang¹¹,
Mosha Chen¹²,
Chaozhen Liu¹¹,
Liping Liu¹¹,
Dongdong Li¹¹,
Wei Li¹¹,
Fei Kong¹¹,
Bangchang Liu¹¹,
Xiaobin Luo¹¹,
Ji Qi¹³,
Qiao Jin¹³ &
…
Bin Xu¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1466))

Included in the following conference series:

China Conference on Knowledge Graph and Semantic Computing

2267 Accesses
10 Citations

Abstract

Knowledge Graph has been proven effective in modeling structured information and conceptual knowledge, especially in the medical domain. However, the lack of high-quality annotated corpora remains a crucial problem for advancing the research and applications on this task. In order to accelerate the research for domain-specific knowledge graphs in the medical domain, we introduce DiaKG, a high-quality Chinese dataset for Diabetes knowledge graph, which contains 22,050 entities and 6,890 relations in total. We implement recent typical methods for Named Entity Recognition and Relation Extraction as a benchmark to evaluate the proposed dataset thoroughly. Empirical results show that the DiaKG is challenging for most existing methods and further analysis is conducted to discuss future research direction for improvements. We hope the release of this dataset can assist the construction of diabetes knowledge graphs and facilitate AI-based applications.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Survey of Techniques for Constructing Mongolian Domain-Specific Knowledge Graph

Constructing biomedical domain-specific knowledge graph with minimum supervision

Article 23 March 2019

Constructing a Multi-scale Medical Knowledge Graph from Electronic Medical Records

Keywords

1 Introduction

Diabetes is a chronic metabolic disease characterized by high blood glucose level. Untreated or uncontrolled diabetes can cause a range of complications, including acute ones like diabetic ketoacidosis and chronic ones such as cardiovascular diseases and diabetic nephropathy. With the rapid economic developments and changes in lifestyle, China has become the country with the most diabetes patients in the world: the prevalence of diabetes in Chinese adults is about \(11.2\%\) and still increasing [1]. The medical expenses from diabetes without complications already account for \(8.5\%\) of national health expenditure in China [2]. As a result, diabetes is a serious public health problem in the realization of “Healthy China 2030” that requires interdisciplinary innovations to solve.

Knowledge Graph (KG) has been proven effective in modeling structured information and conceptual knowledge, especially in the medical domain [3]. Medical knowledge graph is attracting attention from both academic and healthcare industries due to its power in intelligent healthcare applications, such as clinical decision support systems (CDSSs) for diagnosis and treatment [4, 5], self-diagnosis utilities to assist patient evaluating health conditions based on symptoms [6, 7]. High-quality entity and relation corpus is crucial for constructing knowledge base, however, there is no dataset dedicated to the diabetes disease at the moment. To address this issue, we introduce DiaKG, a high-quality Chinese dataset for Diabetes knowledge graph construction.

The contributions of this work are as follows:

1.
To the best of our knowledge, this is the first diabetes dataset for medical knowledge graph construction at home and abroad.
2.
In addition to the medical experts, we also introduce AI experts to participate in the annotation process to provide data insight, which improves the usability of DiaKG and finally benefits the end-to-end model performance.

We hope the release of this corpus can help researchers develop knowledge bases for clinical diagnosis, drug recommendation, and auxiliary diagnostics to further explore the mysteries of diabetes. The datasets are publicly available at https://tianchi.aliyun.com/dataset/dataDetail?dataId=88836

2 DiaKG Construction

2.1 Data Resource

The dataset is derived from 41 diabetes guidelines and consensus, which are from authoritative Chinese journals covering the most extensive fields of research content and hotspot in recent years, including clinical research, drug usage, clinical cases, diagnosis and treatment methods, etc. Hence it is a quality-assured resource for constructing a diabetes knowledge base.

2.2 Annotation Guide

Two seasoned endocrinologists designed the annotation guide. The guide focuses on entities and relations since these two types are the fundamental elements of a knowledge graph.

Entity. 18 types of entities are defined (Table 1). Nested entities are allowed; for example, is a ‘Disease’ entity, and is a ‘Class’ one. Entities in DiaKG has two characteristics that stand out: 1. Entities may attribute to different types according to the contextual content; for example, in sentence is a ‘Disease’ type, while in the sentence serves as a ‘Reason’ type; 2. Some entity types are of long spans, like ‘Pathogenesis’ type is usually consisted of a sentence.

Table 1. List of entities

Full size table

Relation. Relations are centered on ‘Disease’ and ‘Drug’ types, where a total of 15 relations are defined (Table 2). Relations are annotated on the paragraph level, so entities from different sentences may form a relation, which has raised the difficulty for the relation extraction task. Head entity and tail entity existing in the same sentence only account for \(43.4\%\) in DiaKG.

Table 2. List of relations

Full size table

2.3 The Annotation Process

The annotated process is shown in Fig. 1. The process can be divided into two steps:

OCR Process. The PDF files are transformed to plain text format via the OCR tool^{Footnote 1}, where non-text data like figures and tables are manually removed. Additionally 2 annotators manually check the OCR results character by character to avoid misrecognitions, for example, may be recognized as .

Annotation Process. 6 M.D. candidates were employed and were trained thoroughly by our medical experts to have a comprehensive understanding of the annotation task. During the trial annotation step, we creatively invited 2 AI experts to label the data simultaneously, based on the assumption that AI experts could provide data insight from the model’s perspective. For example, medical experts are inclined to label as a whole entity, while AI experts regard , ‘maturity-onset diabetes of the young’ and ‘MODY’ as three separate entities are more model-friendly. Feedback from AI experts and the annotators were sent back to the medical expert to refine the annotation guideline iteratively. The formal annotation step started by the 6 M.D. candidates and 1 medical experts would give timely help when needed. The Quility Control (QC) step was conducted by the medical experts to guarantee the data quality, and common annotation problems were corrected in a batch mode. The final quality is evaluated by the other medical expert via random sampling of 300 records. The accuracy rates of entity and relation are \(90.4\%\) and \(96.5\%\), respectively, demonstrating the high-quality of DiaKG. The examined dataset contains 22,050 entities and 6,890 relations, which is empirically adequate for a specified disease.

2.4 Data Statistic

Detailed statistical information for DiaKG is shown in Table 1 and Table 2.

3 Experiments

We conduct Named Entity Recognition (NER) and Relation Extraction (RE) experiments to evaluate DiaKG. The codebase is public on github^{Footnote 2}, and the implementation details are also illustrated on the github repository.

3.1 Named Entity Recognition (NER)

We only report results from X Li et al. (2019) [8] since it is the SOTA model for NER with nested settings at the time of this writting.

3.2 Relation Extraction (RE)

The RE task is defined as giving the head entity and the tail entity, to classify the relation type. Due to the simplified setting, we report results from bi-directional GRU-attention [9] in this paper.

4 Analysis

The experimental results are shown in Table 3 and Table 4. We report the total result, plus the top 2 and last 3 types’ results for each task to analyze DiaKG.

The overall macro-average scores for the two tasks are \(83.3\%\) and \(83.6\%\), respectively, which are satisfying considering the multifarious types we define, also demonstrating DiaKG’s high quality. For the NER task, the results of ‘Disease’ and ‘Drug’ types are as expected because these two types exist frequently among the documents, thus leading to a higher score. The average entity length for ‘Pathogenesis’ type is 10.3, showing that the SOTA MRC-Bert model still can not handle the long spans perfectly; We analyzed errors of the ‘Symptom’ and ‘Reason’ types and found that the model is prone to classify entities as other types, mainly contributing to the characteristic that entity may be of different types due to the contextual content. For the RE task, the case study shows that entities with long distance are difficult to classify. For example, entities with ‘Drug_Diesease’ type usually exist in the same sub-sentence, whereas the ones with ‘Reason_Disease’ type are usualy located in different sub-sentences, sometimes even in different sentences. The above experimental results demonstrate that DiaKG is challenging for most current models and it is encouraged to employ more powerful models on this dataset.

Table 3. Selected NER results

Full size table

Table 4. Selected RE results

Full size table

5 Conclusion and Future Work

In this paper, we introduce DiaKG, a specified dataset dedicated to the diabetes disease. Through a carefully designed annotation process, we have obtained a high-quality dataset. The experiment results prove the practicability of DiaKG as well as the challenges for the most recent typical methods. We hope the release of this dataset can advance the construction of diabetes knowledge graphs and facilitate AI-based applications. We will further explore the potentials of this corpus and provide more challenging tasks like QA tasks.

Notes

References

Li, Y., Teng, D., Shi, X., et al.: Prevalence of diabetes recorded in mainland China using 2018 diagnostic criteria from the American Diabetes Association: national cross sectional study. BMJ 369 (2020)
Google Scholar
Luo, Z., Fabre, G., Rodwin, V.G.: Meeting the Challenge of Diabetes in China. Int. J. Health Policy Manage. 9(2) (2020)
Google Scholar
Nickel, M., et al.: A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1), 11–33 (2015)
Article Google Scholar
Bisson, L.J., Komm, J.T., Bernas, G.A., et al.: Accuracy of a computer-based diagnostic program for ambulatory patients with knee pain. Am. J. Sports Med. 42(10), 2371–6 (2014)
Article Google Scholar
Wang, M., Liu, M., Liu, J., et al.: Safe medicine recommendation via medical knowledge graph embedding. arXiv preprint arXiv:1710.05980.2017
Tang, H., Ng, J.H.K.: Googling for a diagnosis–use of Google as a diagnostic aid: internet based study. BMJ 333 (2006)
Google Scholar
Gann, B.: Giving patients choice and control: health informatics on the patient journey. Yearb Med. Inform. 21(01), 70–73 (2012)
Google Scholar
Li, X., Feng, J., Meng, Y., et al.: A unified MRC framework for named entity recognition (2019)
Google Scholar
Peng, Z., Wei, S., Tian, J., et al.: Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2016)
Google Scholar

Download references

Acknowledgments

We want to express gratitude to the anonymous reviewers for their hard work and kind comments. We also thank Tianchi Platform to host DiaKG.

Author information

Authors and Affiliations

Miao Health, Singapore, Singapore
Dejie Chang, Chaozhen Liu, Liping Liu, Dongdong Li, Wei Li, Fei Kong, Bangchang Liu & Xiaobin Luo
Alibaba Group, Hangzhou, China
Mosha Chen
Tsinghua University, Beijing, China
Ji Qi, Qiao Jin & Bin Xu

Authors

Dejie Chang
View author publications
You can also search for this author in PubMed Google Scholar
Mosha Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chaozhen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Liping Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dongdong Li
View author publications
You can also search for this author in PubMed Google Scholar
Wei Li
View author publications
You can also search for this author in PubMed Google Scholar
Fei Kong
View author publications
You can also search for this author in PubMed Google Scholar
Bangchang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobin Luo
View author publications
You can also search for this author in PubMed Google Scholar
Ji Qi
View author publications
You can also search for this author in PubMed Google Scholar
Qiao Jin
View author publications
You can also search for this author in PubMed Google Scholar
Bin Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dejie Chang .

Editor information

Editors and Affiliations

Harbin Institute of Technology, Harbin, China
Bing Qin
Peking University, Beijing, China
Zhi Jin
Tongji University, Shanghai, China
Haofen Wang
University of Edinburgh, Edinburgh, UK
Jeff Pan
University of South China, Hengyang, China
Yongbin Liu
Chinese Academy of Sciences, Beijing, China
Bo An

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chang, D. et al. (2021). DiaKG: An Annotated Diabetes Dataset for Medical Knowledge Graph Construction. In: Qin, B., Jin, Z., Wang, H., Pan, J., Liu, Y., An, B. (eds) Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction. CCKS 2021. Communications in Computer and Information Science, vol 1466. Springer, Singapore. https://doi.org/10.1007/978-981-16-6471-7_26

Download citation

DOI: https://doi.org/10.1007/978-981-16-6471-7_26
Published: 28 October 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6470-0
Online ISBN: 978-981-16-6471-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DiaKG: An Annotated Diabetes Dataset for Medical Knowledge Graph Construction

Abstract

Similar content being viewed by others

A Survey of Techniques for Constructing Mongolian Domain-Specific Knowledge Graph

Constructing biomedical domain-specific knowledge graph with minimum supervision

Constructing a Multi-scale Medical Knowledge Graph from Electronic Medical Records

Keywords

1 Introduction