Keywords

1 Introduction

Diabetes and its complications are now the major health killers in most countries. According to International Diabetes Federation [1], 415 million or 8.8% of adults aged 20–79 had diabetes in 2015. The number is estimated to be 642 million in 2040. Not only does diabetes pose great threat to worldwide health, but it makes substantial economic impact on countries and national health systems. China is the top 1 country with the most people suffering from diabetes. According to Chinese Diabetes Society [2], there were nearly 110 million people with diabetes in China in 2014 with the overall prevalence doubling to 9.7% in nearly 10 years. The number of people at high risk was 150 million with the prevalence of pre-diabetes up to 15.5%, which was higher than the world average. Compared with the worldwide situation, threaten brought by diabetes is more severe in China.

The research of diabetes has been always of an extensive concern in academia. Doctors and researchers have published a wealth of diabetes literatures to record their clinical and research findings. These research literatures can to some extent reflect the development of diabetes prevention and treatment. However, it is difficult to accurately capture the pulse through traditional manual retrieval when facing with diabetes literatures of such a large quantity, wide distribution and fast growth. Therefore, it is of need to develop an automatic, efficient, and accurate method to continuously discover the recent research status of diabetes.

Bibliometrics uses relevant statistical and mathematical approaches to study information materials, which has become a well-established part of information research to the quantitative description of documents [3]. Traditional bibliometrics is popular in solving problems of extreme value and sorting, but it is weak in revealing the structure of literature, e.g., the cooperation among authors and affiliations, the structure and evolution of research hotspots. On the other hand, knowledge mapping analysis is a widely applied graphical and visual technique in revealing structure relationship and scientific knowledge development process [4]. Technologies involving in mathematics, information science, and computer science are combined to help analyze and understand the development process and forefront issues [5]. To a certain extent, knowledge mapping analysis and bibliometrics can be regarded as complements for each other.

This paper presents a data-driven method based on bibliometrics and knowledge mapping analysis, aiming at discovering the recent research status of diabetes in China. Using the techniques of descriptive statistics, clustering and multidimensional scaling, and network analysis, this paper conducts the analysis on literature distribution characteristics, research hotspots and cooperation relationships. We believe that the work can potentially assist clinical professionals and medical researchers in determining hot research topics and keeping abreast of the research status of diabetes in the development of research strategies.

2 Literature Review

There have been a considerable number of studies with the applications of bibliometrics and knowledge mapping analysis: evaluate the speed of publication of ophthalmology journals [6], map the literature related to a certain research field such as cancer research [7, 8] or health literacy research [9], and allow one to recognize new topics in the literature [10]. Bibliometrics and knowledge mapping analysis have played fundamental roles in examining the trends in medical research output. Ramos [11] conducted a bibliometric analysis of tuberculosis research indexed in PubMed during 1997 and 2006, finding the phenomenon that the research output in countries with more estimated cases of tuberculosis was less than that in industrialized countries. Boudry et al. [12] provided a review on scientific production related with the field of eye disease during 2010 and 2014 so as to identify the major topics as well as the predominant actors including journals, countries, and continents. Similar works have been conducted for other medical fields, such as obesity [13], leishmaniasis [14], methaemoglobinaemia [15], giardiasis [16], dermatology [17], neurogenic bladder [18], rheumatology [19], and surgery [20].

Few studies concerning with diabetes using bibliometrics and knowledge mapping analysis methods can be found. Harande [21] used bibliometric approach to examine the increasing diabetes-related literature in Nigeria, indicating that the literature of diabetes in Nigeria was in harmony with the Bradford–Zipf distribution. Harande and Alhaji [22] examined the growth of published literature on the disease in three countries including Nigeria, Argentina and Thailand, showing that the literature of diabetes grew and spread very widely. Similar research has been done for diabetes literature in Middle East countries [23]. Geaney et al. [24] provided a detailed evaluation of type 2 diabetes mellitus research output during 1951–2012 with methods of large-scale data analysis, bibliometric indicators, and density-equalizing mapping. Zhang et al. [25] investigated the relationship between antipsychotics and type 2 diabetes research with bibliometrics method. The existing relative studies seldom focus on the literature of diabetes in China, let alone use the combination methods of bibliometrics and knowledge mapping analysis.

Therefore, the aim of this study is to provide a detailed evaluation of the diabetes research output from 2007 to 2016 in China using a data-driven method based on bibliometrics and knowledge mapping analysis to quantitatively analyze data from the CNKI database in terms of: (1) literature distribution characteristics with method of descriptive statistics; (2) research hotspots with method of clustering and multidimensional scaling analysis; and (3) cooperation relationships among authors and affiliations with network analysis methods.

3 The Data-Driven Approach

With the combination of bibliometrics and knowledge mapping analysis, we design a data-driven approach for discovering the recent research status of diabetes in mainland China from medical literature. The processing procedure is shown as Fig. 1. Diabetes-related publications as raw data are retrieved from CNKI and are used as dataset after preprocessing for bibliometrics and knowledge mapping analysis. The analysis applies descriptive statistical methods for acquiring literature distribution characteristics, multidimensional scaling analysis and data mining methods such as k-means clustering for discovering research hotspots, as well as network analysis for revealing cooperation relationships among authors and affiliations. Specifically, the procedure can be divided into the following stages:

Fig. 1.
figure 1

The procedure for acquiring research status of diabetes in China

Literature retrieval: The relevant publications were directly retrieved from CNKI in April 2017 with the following searching criteria: (1) “diabetes” as topic using an extended search keyword list; (2) “2007” to “2016” as publication year; (3) SCI, EI, CSSCI, and core journals as literature sources to keep high literature quality. The total retrieved raw data containing 35,059 documents was obtained for preprocessing.

Data preprocessing: to make sure of the high relevance with diabetes research, further data cleaning and processing was needed, consisting of: (1) excluding articles belonging to messaging literature like yearbooks, newspapers, and non-academic book publications; (2) excluding articles whose titles and keywords did not contain high relevant words like “diabetes” and “hypoglycemic”; (3) keeping all the articles whose source journal names contained the keyword “diabetes”. After preprocessing, we finally acquired the dataset containing 24,561 articles.

Distribution characteristics analysis: with the basis of 24,561 documents, the literature distribution characteristics including publication date and journal distribution as well as productive affiliations were acquired using descriptive statistical methods. For example, as for productive affiliations, the affiliations were ranked according to their decreasing order of productivity. The rankings started with the affiliations with the highest number of diabetes-related publications. We used a publicly available software R for statistical analysis in the paper.

Research hotspots analysis: the recent research hotspots of diabetes in China were acquired based on a keyword co-occurrence matrix with the following steps. (1) Standardization of keywords: there existed nonstandard and inconsistent keyword representations thus it needed to be consistent, e.g., unifying “type 2 diabetes”, “diabetes, type 2”, “diabetes type 2”, “Type 2 diabetes (T2DM)” as “type 2 diabetes”; unifying “type 1 diabetes”, “diabetes, type 1”, “diabetes type 1”, “Type 1 diabetes (T1DM)” as “type 1 diabetes”; unifying “non-insulin-dependent diabetes mellitus”, “diabetes, non-insulin-dependent” as “non-insulin-dependent diabetes mellitus”; unifying “Type 1”, “TypeI” as “Type 1”; unifying “Type 2”, “TypeII” as “Type 2”; unifying “gestational diabetes”, “diabetes, pregnancy”, “diabetes, gestational” as “gestational diabetes”. (2) With the standardization, keywords with high frequency were ranked and sorted. (3) With the top frequent keywords, a co-occurrence matrix was constructed. (4) With the co-occurrence matrix, a keyword correlation matrix was acquired through the calculation using Ochiai correlation coefficient. (5) With the keyword correlation matrix, a keyword dissimilar matrix was generated. (6) Hierarchical clustering and k-means clustering analysis were applied based on the keyword correlation matrix. (7) Multidimensional scaling analysis was implemented based on the keyword dissimilar matrix.

Cooperation network analysis: With the top productive authors and affiliations generated with descriptive statistical method, the cooperation network of authors and affiliations were acquired, respectively. It was required to convert the author and affiliation data into date form of one-to-one cooperation at first, and then the cooperation networks were generated using an easy-to-use package named as networkD3 in R. It visualized the networks into force-directed graphs using the a function named as simpleNetwork.

4 Results and Discussion

As for literature characteristics analysis, the publication date and journal distribution as well as productive affiliations were taken into consideration. Table 1 shows the number and growth rate of diabetes publications by year during 2007–2016, revealing that the total number of publications increased rapidly by year before 2010, but dwindled after 2010. The growth rate of publications on diabetes from 2009 to 2010 reached up to 31.9%, witnessing the great mass upsurge on diabetes research in 2010.

Table 1. The number and growth rate of publications on diabetes by year during 2007–2016

The time regression curve of cumulative number of publications could be fitted as \( y = { - 5} . 0 7 5\times 1 0^{ 6} { + 2} . 5 2 9\times 1 0^{ 3} x \) with the goodness-of-fit \( R^{2} \) equaling to 0.9991. It indicates that the regression curve fitted the development trend of the actual cumulative number of publications well, with which the future research output on diabetes in China could be inferred. For example, the predicted number of publications on diabetes research in 2017 in China is \( { - 5} . 0 7 5\times 1 0^{ 6} { + 2} . 5 2 9\times 1 0^{ 3} \times 2017{ - ( - 5} . 0 7 5\times 1 0^{ 6} { + 2} . 5 2 9\times 1 0^{ 3} \times 2016) = 2529 \). Table 2 shows the top 10 most productive journals, from which these 10 journals together accounted for 37.52% of the total publications.

Table 2. Top 10 most productive journals

Table 3 shows the top 10 most productive first author affiliations and total author affiliations, from which Anhui Provincial Hospital ranked within top two in both the top 10 most productive first author affiliations and total affiliations, indicating that it contributed much on the diabetes research in China.

Table 3. Top 10 most productive first author affiliations and total affiliations

For the research hotspots analysis, we used author defined keywords. Generally, they can represent the main points of the publications. Table 4 shows 34 keywords with frequency greater than or equaling to 200. The top 3 in order were: “Diabetes” with frequency of 7,510 and frequency of 7.078%, “Type 2 Diabetes” with frequency of 6,978 and frequency of 6.577%, and “Diabetic Nephropathy” with frequency of 2,873 and frequency of 2.708%. Using the keywords, a co-occurrence matrix with 34 rows and 34 columns was generated, where the top 10 keywords are shown as Table 5. The co-occurrence matrix is a symmetric matrix. The data on the main diagonal indicates the frequency of the keywords and the data on the non-main diagonal represents the co-occurrence frequency between two different keywords. For instance, the co-occurrence frequency of “Diabetes” and “Blood Glucose” is 280, indicating that these two keywords appeared together in 280 documents.

Table 4. The ranked keywords with frequency greater than or equaling to 200
Table 5. The top 10 keywords in the co-occurrence matrix

After that, we generated a keyword correlation matrix with the basis of the co-occurrence matrix using Ochiai correlation coefficient. The used calculation formula was \( O_{ij} = A_{ij} /\sqrt {A_{i} A_{j} } \). In the formula, the value range of O ij was [0, 1], representing the probability of the co-occurrence of keyword W i and W j . A ij represented the co-occurrence frequency of keyword W i and W j . A i represented the frequency of the keyword W i and A j represented the frequency of the keyword W j . In the correlation matrix, the value represented the distance between two keywords. The larger the correlation value was between two keywords, the smaller the distance was between them. Then the keyword dissimilar matrix was acquired through subtracting each value in the correlation matrix from 1.

Therefore, 34 keywords were divided into 7 research hotspot categories through hierarchical clustering, k-means clustering, and multidimensional scaling analysis, respectively. We set the cluster number k as 7 in the experiment from our empirical experience. Figure 2 shows the result of hierarchical clustering using complete-linkage with distance function \( D_{HK} = \hbox{max} (d_{uv} ),u \in H\;{\text{and}}\;v \in K \). d uv represented the distance between word u and v, where u belonged to cluster H and v belonged to cluster K.

Fig. 2.
figure 2

The visulized result of hierarchical clustering

Among the generated research hotspot categories using the three methods, three categories are the same as follows: (1) Glycosylated Hemoglobin, and Type 2 Diabetes; (2) Blood Fat, and Blood Glucose; and (3) Diabetic Retinopathy, Vascular Endothelial Growth Factor, Diabetic Nephropathy, and Transforming Growth Factor-β1. Table 6 is a summary of the results of hierarchical clustering, k-means clustering, and multidimensional scaling analysis.

Table 6. The summary of the results of hierarchical clustering, k-means clustering, multidimensional scaling analysis

For cooperation network analysis, the number of publications with a specific first author or affiliation was much less than that with cooperation with other authors or affiliations. For example, the number of publications for Anhui Provincial Hospital as first author affiliation was 298. However, the number was up to 618 when cooperating with other affiliations, indicating the cooperation among authors or affiliations on diabetes research was much. Therefore, considering all authors in the same publications, we analyzed the cooperation relationships among authors with the number of publications greater than or equaling to 100, 150, and 200, respectively, as well as the affiliations with the number of publications greater than or equaling to 50.

Figure 3 shows the cooperation network of authors with the number of publications >=100 (access via the linkFootnote 1). Figure 4 shows the cooperation network of affiliations with the number of publications >=50 (access via the linkFootnote 2). In the network, the black nodes represent authors or affiliations, and the lines represent the cooperation relationship. The more connected lines surrounding a specific node, the closer cooperation relationship with other authors or affiliations is for the author or affiliation. Moreover, one can dynamically drag and drop to view the cooperation relationship for a specific author or affiliation.

Fig. 3.
figure 3

The cooperation network of authors with the number of publications greater than or equaling to 100

Fig. 4.
figure 4

The cooperation network of affiliations with the number of publications greater than or equaling to 50

5 Summary

This paper presented a data-driven method based on bibliometrics and knowledge mapping analysis, aiming at discovering the recent research status of diabetes in China during 2007–2016. With the analysis techniques of descriptive statistics, clustering and multidimensional scaling, and network analysis, we acquired literature distribution characteristics, research hotspots, and cooperation relationships among authors and affiliations. The results and findings were presented. Our work can provide clinical researchers and funding agencies with state-of-the-art research status, potentially assisting scientific research topic determination and participating in the development of clinical research strategies.