Keywords

1 Introduction

At the beginning of 2020, COVID-19 has swept the world as a sudden epidemic, disrupting the peace of every family in every country in the world. The spread of COVID-19 is so fast and infectious that it is beyond everyone's imagination. As a result, it has had a catastrophic impact on the world’s population, economy, environment, and education. The severity of the epidemic problem quickly drew a response from scientific researchers in most countries, and academic research on vaccine development, drug research, and disease transmission trend prediction on COVID-19 was quickly launched. Papers from various fields and angles have been included on PubMed.

PubMed is an abstract database developed by the National Center for Biotechnology Information (NCBI) under the National Library of Medicine (NLM). As one of the most influential databases in the biomedical field, PubMed has the advantages of timely update, free access, and high coverage rate. Therefore, we choose PubMed and LitCovid (dataset in PubMed [1, 2]) as our data source.

In this context, a complete and efficient retrieval approach is particularly important. It must meet two requirements: on the one hand, it can enable researchers to quickly obtain research progress in a specific research field, and on the other hand, it also needs to provide a way for researchers to find research partners in the same direction. The powerful information extraction capabilities and intuitive visualization functions of the knowledge graph perfectly meet our needs, so we chose to construct the COVID-19 literature knowledge graph to summarize existing research.

In the research field of bio-entity recognition and knowledge graph, lots of scholars has been fruitful. Song HJ used Word2Vec to complete Bio-NER, and got F1 score of 72.82% [3]. Ling, Luo added attention mechanism on BiLSTM-CRF model to enforce tagging consistency and recognize CHEMDNER corpus and CDR task corpus [4]. Roderic mapped local identifiers to shared global identifiers. He constructed a knowledge graph based on this [5]. Xu trained Bio-BERT model to build a PubMed knowledge graph, and achieved an F1 score of 86.04% [6]. The goal of our study is building a knowledge graph about COVID-19 by extracting valuable information from literatures and integrating multi-source data.

2 Building Methods

2.1 Named Entity Recognition

NER is an important issue in natural language processing and it also plays an important basic role in building a knowledge graph. It can be said that if the problem of NER cannot be resolved reasonably, our follow-up works won’t be possible. Our article uses the BERT-BiLSTM-CRF model to complete the extraction of biological entities in COVID-19 related literature, our process of the model can be shown as Fig. 1.

Fig. 1.
figure 1

BERT-BiLSTM-CRF model

Bidirectional Encoder Representation from Transformers (BERT) is an Encoder which is based on Bidirectional-Transformer. The Transformer model can be seen as a text sequence architecture depended on the self-attention mechanism. With this transformer, not only could we consider the contextual relationship more clearly and make parallel calculations, but also allow the prediction sequence with no length limit which means we can better capture the semantic features of the context. So, the multi-layer Bidirectional-Transformer in BERT makes the sequence be constrained by the left and right context at the same time. Compared with the ELMo model which was proposed by Matthew E. Peters and others in 2018 [7], Bert can obtain contextual semantic information better.

So, our first step is using the BERT pre-training language model to get the semantic representation of each token. However, the basic BERT is based on common corpus training and cannot be directly applied to our target medical field. It is necessary to fine-tune the existing parameters of the model. We use WordPiece embedding to supplement the missing words, which is an algorithm that decomposes a word into several different units and expresses each unit. The results prove that this method can improve the effect of extracting semantic features of uncommon words.

After getting the vector representation of each token, we input the vector into the BiLSTM model. The structure of basic LSTM can be formalized as follows:

$$ i_{t} = \sigma \left( {x_{t} W_{x}^{i} + h_{t - 1} W_{h}^{i} + b_{i} } \right) $$
(1)
$$ f_{t} = \sigma \left( {x_{t} W_{x}^{f} + h_{t - 1} W_{h}^{f} + b_{f} } \right) $$
(2)
$$ o_{t} = \sigma \left( {x_{t} W_{x}^{o} + h_{t - 1} W_{h}^{o} + b_{o} } \right) $$
(3)
$$ \widetilde{{c_{t} }} = tanh\left( {x_{t} W_{x}^{c} + h_{t - 1} W_{h}^{c} + b_{c} } \right) $$
(4)
$$ h_{t} = o_{t} *{\text{tanh}}\left( {f_{t} *c_{t - 1} + i_{t} *\widetilde{{c_{t} }}} \right) $$
(5)

In the formula, \(\sigma\) is the sigmoid activation function, \(x_{t}\) is the input word at the current moment, \(h_{t - 1}\) is the hidden layer state at the previous moment,\( i_{t} ,f_{t} ,o_{t}\) represent the values of the input gate, forget gate, and output gate at time \(t\) respectively. \(W,b\) represent the weight matrix and bias vector, \(\widetilde{{c_{t} }}\) is an intermediate state, and \(h_{t}\) is the output at time \(t\).

BiLSTM uses forward and backward calculations on the basis of LSTM to obtain two different sets of hidden representations and then stitch the vectors to obtain the final hidden representation. The improvement of LSTM allows us to better capture the two-way semantic dependency and master the semantic co-occurrence information of the context more effectively, thereby improving the performance of named entities.

We also set up different tags to predict the type of token, they are BIO (Beginning, inside, out-side), X (subtoken of WordPiece), [CLS] (leading token of sequence), [SEP] (delimiter of a sentence), PAD (padding in sequence). What’s more, the BIO annotation is subdivided into six categories: Gene, Disease, Chemical, Mutation, Species, CellLine. Input the word vector obtained by BERT into BiLSTM and pass through the softmax classification, we can get the probability distribution of each token belonging to different labels.

In order to solve the problem that BiLSTM does not consider the relationship between labeled entity sequences, we introduce Conditional Random Field (CRF) to obtain the globally optimal labeled sequence.

We define matrix \(P\) as the output of the BiLSTM layer, and the size of \(P\) is \(n \times m\), \(n\) is the number of words, \(m\) is the label category. \(P_{i,j}\) represents the probability of the word \(i\) in the sentence belonging to the label \( j\). The probability of the entire prediction sequence \(y = \left\{ {y_{1} ,y_{2} , \cdots ,y_{n} } \right\}\) can be expressed as follows:

$$ K\left( {X,y} \right) = \sum\nolimits_{i = 0}^{n} {A_{{y_{i} ,y_{i + 1} }} } + \sum\nolimits_{i = 1}^{n} {P_{{i,y_{i} }} } $$
(6)

Matrix \(A\) is the transition matrix, \(A_{ij}\) represents the probability of transferring from tag \(i\) to tag \(j\).

$$ y^{*} = \mathop {{\text{argmax}}}\limits_{{\tilde{y} \in Y_{X} }} K\left( {X,\tilde{y}} \right) $$
(7)

\(\tilde{y}\) represents the true value of tag, and \(Y_{X}\) represents all possible tag sets. The sequence \(y^{*}\) with the largest overall probability which is output by formula (7), is also the best labeling result obtained after our model training.

2.2 Validation of BERT-BiLSTM-CRF

For the NER model, we need to perform a validity test. All of our data in this article come from PubMed, a website which contains almost all papers in the medical field. The data published on this website has been physically labeled, but the latest published and included articles have not yet labeled information. Therefore, we set 70% of the labeled articles in PubMed as the training data set, 20% as the test set, and 10% as the verification set. The quality of our model is evaluated by the indicators of recall, accuracy and F1 score. In order to verify the effect of the model, we used the unfine-tuned Bert model, Word2Vec, and Att-BiLSTM-CRF to compare and verify the data set. The results are shown in Table 1.

Table 1. Performance of different models

2.3 Author Name Disambiguation

It is common for researchers having the same name or surname, while the names and affiliations of an individual changes over time. Therefore, when constructing a knowledge graph, it is important to disambiguate different authors. So far, the commonly used methods are mainly divided into three categories. The first one is manual disambiguation, searching for the author’s information and comparing the author’s message to make judgments. The advantage of this method is its high accuracy, but it is time-consuming and labor-intensive, which makes it impossible to be applied in huge data sets. The second method is accessing public scholar registration platforms such as ORCID, Google Scholar, and Semantic Scholar to get author’s information. This method can quickly and easily obtain high-precision author identity information, but sometimes the coverage of the research field is limited. The third method is to evaluate the similarity of two same-name authors through algorithms to determine whether they belong to the same author. The acquisition of author’s feature usually depends on the authors’ affiliation information, titles and keywords of the published article, the information of the collaborators, the type of journal, etc. In recent years, with the rapid development of machine learning, the accuracy of such methods has reached a high level.

In our research, we integrate the data and information in Semantic Scholar and Google Scholar to complete the disambiguation and mark the authors. First, we use a two-classifier trained by the Semantic Scholar database to disambiguate each group of authors with the same name, and add the processed authors as increments to the created author dataset. Then use the corresponding author's information obtained in Google Scholar as a supplementary information source. Finally, we correct false disambiguation results manually, while supply the affiliation information of authors not covered.

3 CLKG Construction Process

CLKG is built based on python3.7 and networkx. The output is stored as gpickle. Anyone can get CLKG in https://github.com/spicycock/CLKG. The construction process of CLKG is shown as Fig. 2. Up to the date of writing, we obtained 82365 articles related to COVID-19 on PubMed. First, we use BERT-BiLSTM-CRF model to solve the NER problem and get the entity and its corresponding type from the abstract of each article. In this step, we extract 26,458 entities in total (including 15,437 Disease tags, 3783 Gene tags, 4832 Chemical tags, 316 Mutation tags, 1975 Species tags, and 115 CellLine tags). Then use the method mentioned in 2.3 to extract and disambiguate scholar names, and finally obtain 294655 disambiguated author names. In the third step, construct a knowledge graph based on three types of relationships: entity-entity, author-author, and entity-author. Make a further explanation, we use entity or author as a node. If there is an association between the two nodes, add an undirected edge to connect. In this way, the basic architecture of CLKG can be constructed.

After establishing the basic graph, we integrate the author's affiliation information from Google Scholar into the node information of the graph. At the same time, for each entity-author connection, we added the publication information of the related articles obtained from PubMed, Including journal name, issue time and issue number. By this way, we can expand the information of the knowledge graph to construct CLKG completely.

Fig. 2.
figure 2

Construction process of CLKG

4 CLKG Visualization

Since CLKG is constructed based on all 82365 documents related to COVID-19, it contains a huge amount of information and the relationship between nodes is also complicated, which means it is difficult to visualize it with general methods. CLKG provides a convenient search interface, allowing us to extract only the relevant fields of interest and reduce the amount of information to the extent that the two-dimensional image can carry. An example of the visualization result is shown in Fig. 3, where the size of the author node is determined by the number of articles published by the author, and the size of the entity node is determined by the number of times mentioned.

Not only could we focus on the overall research status in a certain academical area, but also can conduct more refined searches. On one hand, we can locate a specific well-known scholar to find out what he has said about COVID-19. On the other hand, we can also focus on a specific research entity, listen to other people’s opinions on this issue or look for the future potential partners. We have made two examples to show how it works, Anthony S. Fauci is a well-known infectious disease expert in the United States, who has made great contributions to the control, treatment, and research of the epidemic. We can quickly extract the relevant research content, co-collaborators and other related information from CLKG, and the sub-graph constructed is shown in Fig. 4.

Sinefungin is a medicinal ingredient that is often mentioned in the research on the treatment of COVID-19. As a professional scholar in the medical field, you may be concerned about whether it is proven to be effective. Figure 5 takes sinefungin as an example, and extracts relevant information from CLKG. With Fig. 5, we can clearly see the current status of published research and the clustering relationship among various nodes.

Fig. 3.
figure 3

Select some nodes in the overall graph for visual expression, the pink node is entity, and the blue node represents author.

Fig. 4.
figure 4

The subgraph extracted with Dr. Fauci as the center, the pink nodes are related entities, and the blue nodes are the collaborators of Dr. Fauci

Fig. 5.
figure 5

The subgraph extracted with sinefungin as the center, the pink nodes are the entities related to it, and the blue nodes are the authors who studied sinefungin in the paper

5 Conclusion

As stated at the beginning of this article, COVID-19 is a severe test for each family in every country in the world. We should be one mind to overcome this disaster together. Our article uses all COVID-19 related papers in PubMed as the basis, applies the BERT-BiLSTM-CRF model to solve the key NER problem, disambiguates the researcher with the same name, and finally establishes a comprehensive and complete CLKG based on the relationship across authors and entities. As a knowledge map, CLKG collects and summarizes the research results of the world’s top scientists, pharmacists and other experts on COVID-19 from all over the world and then visualizes the output. Through CLKG, we can not only quickly query the research status, frontier hotspots and research process of COVID-19, but also allows researchers find their academical partners in specific subject areas more quickly and efficiently. This timely and accurate information sharing and sincere cooperation among top scholars will undoubtedly play a key role in overcoming the epidemic, reducing unemployment, restarting the economy and restoring education.

At the same time, CLKG has excellent scalability in both vertical and horizontal directions. Vertically, CLKG can quickly extract the biological entities and add new information to the knowledge map when new literature appears, without complex and time-consuming reconstruction. Horizontally, the CLKG construction method in this article can be easily applied to any field of the same type (such as cancer, heart disease, etc.). Even not only limited to the medical field, global issues such as global warming and environmental pollution can also be extended well.