1 Introduction

Text mining is the discovery of new, previously unknown information, by the automatic extraction of information from different text resources by computer (Hearst 2003). Text mining methods can be regarded as an extension of data mining to text data (Romero and Ventura 2007), and data mining techniques are also widely applied for image domain processing, e.g., clustering (Zhang et al. 2017), classification (Tan et al. 2018; Tan and Gao 2017), discriminant analysis (Li et al. 2017), and information retrieval (Luo et al. 2017). An important aim of text mining is to shift through large volumes of text for the extraction of patterns and models to be incorporated in intelligent applications (Apte et al. 1998). Usually, text mining is widely applied to the process of structuring the input text, generating patterns within the structured data, as well as evaluating and interpreting the output.Footnote 1 In addition, it allows researchers to identify out needed information more efficiently, uncover relations hidden in the sheer volume of available information, and generally shift the burden of information overload from the researchers to the computer by adopting algorithmic, statistical and data management methods to the vast amount of knowledge existing in unstructured texts. On the other hand, medicine is a large and complex domain with abundant synonymy and semantically similar and related concepts (Batet et al. 2011). Most clinical information resources such as electronic medical records and medical knowledge contain considerable amount of information, much of which comes in free text (Meystre and Haug 2005). Therefore, text mining has the great potential in improving health care and advancing medicine through the processing of large amount of medical text data.

Text mining in medical research field has drawn more and more attention from the academia. Especially in recent years, researchers begin to explore how text mining techniques can be applied in processing medical information. Some examples are as follows. With the basis of dual-process theory and the knowledge adoption model, Jin et al. (2016) introduced a healthcare information adoption model for the exploration of patients’ healthcare information- seeking behavior in online communities. Savova et al. (2010) developed and evaluated an open-source natural language processing system to extract information from electronic medical record clinical free text. Lucini et al. (2017) processed data from early emergency department patient records with the application of text mining methods. In addition, a remarkable growth of interest in problems of systems optimization enables the wide application of optimization techniques (e.g., Wang et al. 2018; He et al. 2016, 2017; Lin et al. 2017). Saraswathi and Tamilarasi (2016) proposed an ant colony optimization-based feature selection method for opinion mining classification. Other research interests of medical information processing with text mining techniques include obesity event mining (Chou et al. 2014), sexual event mining (Knight et al. 2012), smoking event mining (Hoek et al. 2014). Consequently, there is an increasing number of academic publications in this interdisciplinary research field.

In the analysis of existing publications, bibliometric analysis is an effective and widely applied strategy. The term bibliometrics is interpreted as “the application of mathematical and statistical methods to books and other media of communication” in 1969 by Glanzel (2003). Used initially in the field of library and information science, bibliometrics has now been widely applied to other areas and has demonstrated significant effectiveness from long-term practice. With the coming of the era of big data, bibliometrics has been a quantitative and qualitative analysis tool of distribution, research hotspots, and tendency for a given research field (Chen et al. 2017a, b; Li and Zhao 2015), as well as a widely accepted tool for identifying future research directions to guide younger researchers (Fu et al. 2010). Benefits of bibliometric analysis are remarkable, e.g., information organization in a specific field (Merigó et al. 2015), scientific developments evaluation in knowledge of a specific subject (Bouyssou and Marchant 2011), research performance comparison across different countries and institutes, and emerging research hotspots identification (Mazloumian 2012). In particular, it has also been applied in interdisciplinary research fields, e.g., natural language processing in mobile computing (Chen et al. 2018a), the natural resource accounting (Zhong et al. 2016), and the fuzzy theory research field (Yu et al. 2018).

To the best of our knowledge, there is no bibliometric analysis of the research field of text mining in medical yet. Therefore, this study conducts a bibliometric analysis on scientific publications retrieved from Web of Science and PubMed during the year 2008–2017 for the exploration of the status and development of the field. The main objectives of this study include: (1) publication statistical characteristics identification, (2) publication geographical distribution exploration, (3) collaboration degrees acquisition, (4) scientific collaboration relation visualization, and (5) current research hotspots and their evolution discovery.

The remainder of this paper is organized as follows: In Sect. 2, we introduce methods and materials. The analyzing results of overall characteristics, collaboration analysis, and topic modeling analysis are presented in Sect. 3. Section 4 is the set of more relevant discussion. This study finishes with conclusions in Sect. 5.

Table 1 Statistical characteristics of the publications

2 Materials and methods

2.1 Materials

Web of Science (WoS) and PubMed are the most commonly used databases in the academia. WoS is the most authoritative citation database with publications of high quality, while PubMed is the largest data source on life sciences and biomedical topics. In our study, to make full use of their complementary advantages, we use all the relevant publications from these two databases.

First of all, a list of keywords (Table 8 in “Appendix”) related to “text mining” was determined by relevant domain experts in the field. In WoS Core Collection database, Topic Subject was used as retrieval field. “Science Citation Index Expanded (SCI-E)” and “Social Sciences Citation Index (SSCI)” were set to be the citation indexes to ensure publication quality. 2284 publications between 2008 and 2017 with “Article” and “Proceedings paper” as article types, and WoS category containing terms “Health”, “Medicine”, “Medical”, “Clinical”, and “Nursing” were identified. Furthermore, after manually removing 62 irrelevant publications with “image” or “imaging” containing in title, 2222 publications were finally identified out.

As for PubMed database, Title/Abstract was used as search column. 6331 publications between 2008 and 2017 were retrieved, where 3346 were in “Journal Article” type with “humans” as species and “MEDLINE” as journal category. Similarly, after removing 165 irrelevant publications with “image” or “imaging” containing in title, 3283 publications were identified out. 1967 publications were finally obtained for analysis after removing 1316 publications that were already contained in WoS through manual review according to publication title, author, publication year, and publication source.

The raw data of the totally 4189 publications from WoS and PubMed were downloaded as both plain text and XML format. Key elements including title, publication source, published year, abstract, author address, author keywords, and Keywords Plus/PubMed MeSH were extracted. Manual information supplement was conducted. Finally, according to the author address information, the corresponding institutes and countries were identified. The statistical characteristics of the publications are shown in Table 1.

2.2 Collaboration degree analysis

The collaboration degree is a measure of scientific research’s connective relation to the level of authors, institutes, and countries (Zhang et al. 2016). The calculations of author’s collaboration degree, institute’s collaboration degree, and country’s collaboration degree are expressed in Eq. (1) in order (Wei et al. 2013).

$$\begin{aligned} C_{Ai} = \frac{\sum _{j=1}^{N}\alpha _{j}}{N},\quad C_{Ii} = \frac{\sum _{j=1}^{N}\beta _{j}}{N},\quad C_{Ci} = \frac{\sum _{j=1}^{N}\gamma _{j}}{N} \end{aligned}$$
(1)

In the equation, \(C_{Ai}\), \(C_{Ii}\), and \(C_{Ci}\) represent the author, institute, and country’s collaboration degree of the i year. \(\alpha _{j}\), \(\beta _{j}\), and \(\gamma _{j}\) indicate the number of authors, institutes and countries for each publication. N donates the annual total number of publications in the research field.

2.3 Social network analysis (SNA)

Complex social systems are usually formed from the interaction of social actors with each other at multiple physical or social interfaces and across layers. A complex social system can be expressed through a social network with “actors” as nodes and “interactions” as link lines. Social network is thus a collection of social actors and their interaction relations. The relations between nodes represent similarities, interactions, social relations, and flows (Borgatti et al. 2009). It is very interesting for scholars and managers to investigate how complex social systems change and evolve to emerge dynamic patterns. By studying the social network, dynamic patterns of interactions emergence and their evolution with time can then be explored. The social network analysis (SNA) has formed a quantitative analysis ground on the development of the mathematical method and the graph theory and thus provides a quantitative assessment on relations between social actors.

In this paper, we apply SNA to explore the collaboration relations for specific countries/regions, institutes, and authors in the research field. Collaboration relations between them can be visualized by SNA by counting the number of times they (e.g., two countries/regions) appear in the same publication together. In the network, each country/region, institute, or author is presented as a node with the node size representing its proportion of publications. The node color denotes the continent or country. The thickness of each line indicates collaboration strength between two countries/regions, institutes, or authors. One could explore the collaboration relations for specific countries/regions, institutes, or authors by clicking the nodes.

2.4 Latent Dirichlet allocation (LDA)

As an emerging quantitative method to assessing substantial textual data, topic modeling extracts semantic information from a collection of texts with the use of statistical algorithms. The first topic model, probabilistic latent semantic indexing (pLSI), was proposed by Hofmann (1999). It models the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions, where the mixture components can be viewed as representation topics. An improved three-layer Bayesian model, latent Dirichlet allocation (LDA), was developed by Blei et al. (2003), which takes Dirichlet distribution as the prior distribution and reduces the parameter number to only one. In LDA, documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words, and topics are assumed to be uncorrelated. In order to reduce the computing time and the required memory (Blei and Lafferty 2007; Teh et al. 2005), some various extensions such as Correlated Topic Models and Hierarchical Dirichlet Process have been proposed based on the original LDA model in recent years. LDA and its extensions have been widely applied in scientometric research for discovering semantic structures and latent topics in a discipline or measuring the relations of multiple disciplines (Lu and Wolfram 2012; Nichols 2014; Yau et al. 2014). LDA defines the following terms:

  1. (1)

    A word is an item from a vocabulary indexed by {1\(,\ldots , V\)};

  2. (2)

    A document is a sequence of N words denoted by \(d =(w_{1},\ldots ,w_{N})\);

  3. (3)

    A corpus is a collection of M documents denoted by \(D=\{d_{1},\ldots , d_{M}\}\).

LDA assumes the following generation process:

  1. (1)

    The term distribution \(\beta \) containing the probability of a word occurring in a given topic is determined by \(\beta \sim \) Dirichlet(\(\delta \));

  2. (2)

    The proportions \(\theta \) of the topic distribution for a document d are determined by \(\theta \sim \) Dirichlet(\(\alpha \));

  3. (3)

    For each word \(w_{i}\) in the document d, a topic is chosen by the distribution \(z_{i}\sim \) Multinomial(\(\theta \)), and a word is chosen from a multinomial probability distribution conditioned on the topic \(z_{i}: p(w_{i}|z_{i},\beta )\).

The log-likelihood for one document \({d\,\in \,D}\) in variational expectation–maximization (VEM) estimation is given by Eq. (2).

$$\begin{aligned} \begin{aligned} l(\alpha , \beta )&=\! \log (p(d|\alpha , \beta )) \\&=\!\log \int \left\{ \sum _{z}\left[ \prod _{i=1}^{N}p(w_{i}|z_{i}, \beta )p(z_{i}|\theta )\right] \right\} p(\theta |\alpha )\hbox {d}\theta \end{aligned} \end{aligned}$$
(2)

Gibbs sampling is a Markov chain Monte Carlo method (Finkel et al. 2005) aiming at constructing a Markov chain converging to the target probability distribution in the high- dimensional model and then extracting the sample distribution closest to the target probability distribution. The log-likelihood for Gibbs sampling is as Eq. (3).

$$\begin{aligned} \begin{aligned} \log (p(d|z)) =&\ k \ \log \left( \frac{{\varGamma }(V\delta )}{{\varGamma }(\delta )^{V}}\right) \\&+ \sum _{K=1}^{k}\left\{ \left[ \sum _{j=1}^{V}\log \left( {\varGamma }(n_{K}^{(j)} + \delta )\right) \right] \right. \\&\left. - \log \left( {\varGamma }(n_{K}^{(.)} + V\delta )\right) \right\} \end{aligned} \end{aligned}$$
(3)

The topic modeling analysis in this study follows the following steps:

  1. (1)

    Weights 0.4, 0.4 and 0.2 determined in our former experiment (Chen et al. 2018b) are assigned to segmented author keywords, Keywords Plus and PubMed MeSH, publication title, and abstract, respectively.

  2. (2)

    Since frequent terms usually provide just limited information, as most terms in Table 1 are either trivial in text mining in medical sector, such as “health”, “mining”, “clinical”, and “patient”, or trivial in general scientific publications, such as “analysis”, “using”, “study”, and “method”. Thus, we perform a transformation on the corpus using Term Frequency-Inverse Document Frequencies (TF-IDF) to penalize frequent terms occurring in many publications (Salton et al. 1975; Robertson 2004). We calculate the TF-IDF values of all terms and sort them according to the values. A threshold is determined as 0.1 empirically by manually examining these ranked terms. Terms with a TF-IDF value no more than the threshold are removed.

  3. (3)

    Through sampling, 17 different topic numbers are set to c(2 : 10, 15, 20, 30, 40, 50, 80, 150, 250). For each topic number, tenfold cross-validation is used to evaluate model performance. Perplexity criteria are used to select optimal topic number (Blei et al. 2003). \(\alpha \) for Gibbs sampling is initialized as the mean value of \(\alpha \) values for model fitting using VEM with the optimal topic number.

  4. (4)

    We then adopt Gibbs sampling and VEM method to estimate the LDA model with the optimal topic number and an initialized \(\alpha \).

  5. (5)

    By matching the topics detected by VEM and Gibbs sampling based on Hellinger distance as Eq. (4), the best matches with the smallest distance can be identified. In Eq. (4), P and Q denote two probability measures.

    $$\begin{aligned} H^{2}(P, Q) = \frac{1}{2}\int \left( \sqrt{\hbox {d}P} - \sqrt{\hbox {d}Q}\right) ^{2} \end{aligned}$$
    (4)
Fig. 1
figure 1

Publication number distribution by year

Recently, with the development and the availability of accessible software, topic modeling and other text mining approaches are becoming more approachable. Open-access options include some R, Python and Java packages. In this study, the topic modeling process is conducted with an R package called Topicmodels offered by Grün and Hornik (2011). The package requires a text mining front-end addition, such as the R package, tm (Feinerer et al. 2008).

3 Result

3.1 General publication statistics

3.1.1 Publication with year

The publication number distribution by year is demonstrated in Fig. 1. The publication number keeps increasing by year from 251 (year 2008) to 577 (year 2016), but experiences decline in 2017 to 437. The decline may be caused by time lag of some publications to be included in the databases in 2017. The annual growth rate reaches 7.31% on average, while the rate reaches up to 25.32% from 2012 to 2013, witnessing the research upsurge in 2013.

Table 2 Top 20 productive publication sources in the research

3.1.2 Productive publication sources

The top 20 productive publication sources in the research field are presented in Table 2. These publication sources together contribute 38.46% of the total publications. All the 20 publication sources are journals except AMIA Annual Symposium Proceedings and Studies in Health Technology and Informatics, which are two top conferences in medical informatics. The most productive journal is Journal of Biomedical Informatics with 297 publications, followed by Journal of the American Medical Informatics Association with 212 publications, and PLOS One with 146 publications. All the 18 journals on the list have an IF of over 1.00. Nucleic Acids Research possesses the highest IF as 11.561, reflecting high quality of its publications. Interestingly, the causality among total publications in one journal and IF is not found in such a field. This may due to the fact that most journals with higher reputation actually cover many research fields, in which text mining in medical research is just one of them.

3.1.3 Geographical distribution

The analysis of world geographical distribution is based on author institute address. All the authors participating in each publication are considered. Also, since an author may be affiliated with more than one institutes, all the countries/regions and institutes of authors are used for the geographical distribution analysis.

The 4189 publications are from 88 countries/regions. Figure 2 illustrates geographical distribution of the publications. The top 4 countries are: the USA (1680 publications), UK (546 publications), Canada (314 publications), and China (285 publications). The publication number of the USA is nearly 3 times than that of the second productive country, indicating its dominant position in the research field. As for the top 20 countries/regions, most are developed countries except China (rank 4th) and Brazil (rank 7th), reflecting their huge enthusiasm in the research field.

Since the publications are mainly distributed in the 5 countries, we further explore the annual publication distributions for these countries, as shown in Fig. 3. The number of publications for the USA is on the whole presenting an upward trend in fluctuation from 82 in 2008 to 241 in 2015, but dwindles since 2015. As for UK, the publication number presents slow growth before 2013, and a slight decline appears from 2013 to 2015. After that, a sharp growth is noticeable in 2016. As for the other three countries, the publication numbers are on the whole presenting upward trends in fluctuation with years going on. In short, the research field has received increasing attention from these countries.

Fig. 2
figure 2

Geomap of publications by countries

Table 3 The most productive authors in the research
Table 4 The most productive first authors and last authors in the research

3.1.4 Productive authors and institutes

The top 20 productive authors are listed in Table 3. All of them come from the USA except Darmoni, Stefan J. from France, which again demonstrates the USA’s high productivity in the research field. The top 3 are all from the USA, including Denny, Joshua C. (52 publications), Xu, Hua (52 publications), and Savova, Guergana K. (38 publications), followed by Liu, Hongfang (33 publications) from the USA and Lu, Zhiyong (32 publications) from the USA. Most of the 20 authors serve more as last authors than as first authors, and almost all collaborate with other authors in all their publications except Denny, Joshua C. and Chute, Christopher G.

Table 4 depicts the most productive first authors and last authors. All the 9 productive first authors are from the USA. All the 9 productive last authors come from the USA except Darmoni, Stefan J. from France. The top 3 first authors are Pakhomov, Serguei V. S. (11 publications), Denny, Joshua C. (10 publications), and Meystre, Stephane M. (9 publications). The top 3 last authors are Xu, Hua (26 publications), Lu, Zhiyong (23 publications), and Denny, Joshua C. (17 publications). It is worth noting that Denny, Joshua C. and Xu, Hua appear in both two lists, which to a certain degree demonstrates their influence in the research.

Fig. 3
figure 3

Publication distributions by year for the top 5 countries

3208 institutes from 88 countries have performed researches in the field. Table 5 shows the most productive institutes. Most of the 19 institutes are from the USA except University of Manchester from UK, University of Toronto from Canada, and University of Sao Paulo from Brazil. The top 5 are all from the USA, including National Institutes Health (120 publications), University of Utah (110 publications), Vanderbilt University (93 publications), Harvard University (86 publications), and Mayo Clinic (84 publications). The first institute percentage for most of the institutes is above 50% except University of California San Diego (38.78%) and University of Texas Health Science Center Houston (40.91%), indicating the leading position of the top productive institutes. Most institutes collaborate a lot with other institutes with an average collaboration percentage up to 78.98%, especially Salt Lake City VA Health Care System (96.08%).

Table 5 The most productive institutes in the research

3.2 Collaboration analysis

3.2.1 Collaboration degree

Figure 4 presents the annual collaboration degrees at three perspectives. The auctorial collaboration degree increases apparently, up to 5.29. In contrast, institutional and international collaboration degrees are steady and relatively low, especially the international collaboration degree. This reflects that the authors tend to collaborate more with those within the same country or institute. The three average degrees are 4.51, 2.26, and 1.30, respectively, that is, 4.51 authors, 2.26 institutes, and 1.30 countries participate in one publication averagely.

3.2.2 Collaboration visualization

We further visualize the collaborations in three perspectives using SNA. A collaboration networkFootnote 2 for 88 countries/regions with 88 nodes and 516 edges is shown in Fig. 5. The USA (the largest node in brown color) has the most collaborations with other countries/regions. The USA–England collaboration (the thickest line) ranks at the first, followed by the USA–China and the USA–Canada collaborations. The collaboration networkFootnote 3 among 67 institutes with the number of publications \(\ge 20\) is shown in Fig. 6 with 67 nodes and 564 edges. Forty of the 67 institutes come from the USA, and the collaboration network among them (the nodes in blue color) is very dense. The collaboration networkFootnote 4 of 81 authors with publications \(\ge 10\) is as Fig. 7. The node count and edge count are 81 and 291. Among the nodes, 8 are sparse nodes including “Xu, Rong”, “Botsis, Taxiarchis”, “Khorasani, Ramin”, “Stewart, Robert”, “Darmoni, Stefan J”, “Nenadic, Goran”, “Dai, Hong-Jie”, and “Zweigenbaum, Pierre” due to the lack of collaborations with other author nodes. Most of the authors (74.07%) come from the USA, and the collaboration network among them (the nodes in blue color) is very dense.

Fig. 4
figure 4

Annual collaboration degree distributions

Fig. 5
figure 5

Collaboration network of 88 countries/regions (the orange nodes represent countries/regions from South America, blue for Africa, green for Oceania, red for Europe, purple for Asia, and brown for North America)

Fig. 6
figure 6

Collaboration network of 67 institutes (different colors of nodes represent different countries/regions, e.g., the blue nodes represent institutes from the USA, orange for England, purple for Australia)

Fig. 7
figure 7

Collaboration network of 81 authors (different colors of nodes represent different countries/regions, e.g., the blue nodes represent institutes from the USA, green for England, light blue for China)

3.3 Topic modeling analysis

Terms with TF-IDF values more than the threshold 0.1 are employed in the topic modeling analysis. Table 6 lists the top 20 frequent terms. Apparently, terms listed in the table are more specific terminology of text mining in medical research issues. There are several nursing-related terms with high occurrence numbers such as “Nursing” (1002) and “Nurse” (871), suggesting the significance of nursing research using text mining techniques. Terms such as “Breast”, “Depression”, “Sexual”, and “Obesity” reflect specific medical issues in the research. “Chinese” (422) is the only country appearing in the table, indicating that China has been focusing on text mining in medical research during these years.

3.3.1 Topic generation

We employ LDA model to reveal the latent intellectual topics in the literature corpus based on terms selected by TF-IDF. To fit the model, we should determine the parameters including the number of topics and the \(\alpha \). Hence, we compute the perplexities of a set of models with different numbers of topics to find a minimum in the tenfold cross-validation. Figure 8 presents the perplexities of models with different numbers of topics. The result indicates that the data are best accounted for by a model incorporating 40 topics. The \(\alpha \) is set to the mean value 0.01857649 in the cross-validation fitted using VEM. Using the parameters, we estimate the LDA model using Gibbs sampling.

We assign potential theme to each topic by semantics analysis of representative terms in each topic, as well as reviewing text intention of the corresponding publications. The order of topics is determined based on Hellinger distance. Specifically, Topic 31 is the best matching topic, and Topic 22 ranks at 2nd. Due to space limitation, Table 7 only displays the top 10 best matching topics with the most frequent terms. Each publication is assigned to the most likely topic based on posterior probability. We then obtain a topic distribution by integrating topic proportions for all the publications. The 4 most frequent research topics are: Topic 16 (3.91%), Topic 24 (3.31%), Topic 9 (3.29%), and Topic 31 (3.14%), while the 4 least frequent research topics are: Topic 25 (1.91%), Topic 14 (1.88%), Topic 33 (1.88%), and Topic 37 (1.88%).

3.3.2 Topic cluster analysis and trend analysis

We use the hierarchical cluster analysis to perform the cluster analysis of the 40 topics. One way of measuring topic similarity is based on term-level similarity, meaning that topics may contain some of the same terms. Another way of topic similarity measuring is by document-level similarity, meaning that topics may appear in some of the same documents. The clustering results based on cosine similarity for the two measurements are shown in Figs. 9 and 10. In the figures, lower location of connecting line means that topics are more similar.

Identifying emerging research topics can provide valuable insights into the development of the research field (Jiang et al. 2016). Therefore, we then explore the annual publication proportions of the 40 research topics, as shown in Fig. 11. We use a nonparametric trend test called MannKendall test (Mann 1945) to examine whether increasing or decreasing trends are existing in the 40 topics. Test results show that fourteen topics, including Topic 2, Topic 4, Topic 11, Topic 13, Topic 15, Topic 20, Topic 22, Topic 25, Topic 26, Topic 27, Topic 32, Topic 33, Topic 36, and Topic 40, present a statistically significant increasing trend at the two-sided \({p}=0.05\) level.

Fig. 8
figure 8

Left: estimated \(\alpha \) value for the models fitted using VEM. Right: perplexities of the test data for the models fitted by using Gibbs sampling. Each line corresponds to one of the folds in the tenfold cross-validation

Table 6 Top 20 most frequent terms
Table 7 Top 20 most frequent terms for the top 10 best matching topics
Fig. 9
figure 9

Dendrogram of the term-level similarity clustering

Fig. 10
figure 10

Dendrogram of the document-level similarity clustering

Fig. 11
figure 11

Trends of the 40 research topics during the year 2008–2017 (x-coordinate as year, y-coordinate as proportion %)

4 Discussion

Scientific literature related to text mining in medical research is an abundant and reliable data pool, from which we can understand the major academic concerns about the research field and hence deploy a proper development strategy. Based on the 4189 publications collected from the WoS and PubMed databases, the analysis focuses on literature characteristics, geographical publication distribution, collaboration relations, as well as research topic. Results of this exploration present a comprehensive overview and an intellectual structure of the research, especially research topics, from 2008 to 2017.

The rapid growth of relevant research publications reveals the vigorous development of text mining in medical research in recent years. The top 20 productive publication sources contribute 38.46% of the total publications, with Journal of Biomedical Informatics as the most productive one. The USA dominates in the field with a publication number far more than other countries. The majority of productive institutes and authors come from the USA. Collaboration degree analysis reveals that authors tend to collaborate more with those within the same institute or country.

A topic modeling-based bibliometric exploration regarding the global research trend of text mining in medical research field is also conducted. The 40-topic model has been successfully applied to discover the latent thematic patterns in the corpus. In light of our prior knowledge about text mining in medical research, most topics identified using LDA method are recognizable and easy-to-understand, as they are related to major issues in the research field. This topic modeling-based bibliometric exploration directly contributes to our understanding of what academic concerns of text mining in medical research field are in the past decade. We provide interpretations of the top 5 best matching topics as follows.

Topic 31 pertains to be Speech related event mining with the highest frequent term “Speech”. Terms like “Prosodic”, “Prosody”, “Listener”, “Consonant”, “Vowel”, “Sound”, “Phonologicaland”, and “Rhythm” are also included. Some researchers concern about the study of Aphasia, e.g., speech segmentation in Aphasia (Peñaloza et al. 2015); thus, terms such as “Aphasia” and “Aphasic” are also contained in Topic 31. Other study focuses include semantic processing in connected speech (Ahmed et al. 2013), automatic speech-recognition systems development for spoken clinical questions (Liu et al. 2011).

Topic 22 contains terms like “Men”, “Sexual”, “Hiv”, “Sexuality”, “Hiv/aids”, “Gay”, “Lesbian”, and “Condom”, and thus apparently refers to Sexual related event mining. Although improvements in the medical management of HIV have reduced the rate of perinatal transmission from mothers to their children, youth still continue to acquire HIV through risky behaviors such as unprotected sex and injection-drug use (Leonard et al. 2010). This attracts widespread attention from all circles of the society. Researchers in academia also concern much about sexual risk reduction through strengthening prevention efforts and clinical behavioral interventions.

Topic 5 centers around Alzheimer event related mining. Thus, terms like “Memory”, “Schizophrenia”, “Dementia”, “Alzheimer”, “Short-term”, and “Spiritual” are contained in the topic. As one of the leading causes of death and one of the most financially costly diseases, Alzheimer has been always a worldwide concern. It is estimated that by 2050, one new case of Alzheimer’s is expected to develop every 33 seconds, resulting in nearly 1 million new cases per year (Alzheimer’s 2015). Many researchers devote themselves to Alzheimer’s study using text mining techniques (e.g., Pistono et al. 2016; Oscar et al. 2017).

Topic 21 contains words like “Child”, “Parent”, “Mother”, “Caregiver”, “Parental”, “Neonatal”, “Infant”, and “Parenting” and thus discusses Parenting for child and infant. Parenting or child rearing is the process of promoting and supporting the physical, emotional, social, and intellectual development of a child from infancy to adulthood.Footnote 5 Relevant researches focus on events such as parenting stress (Kantrowitz-Gordon et al. 2016), and parenting and disability (Fraser and Llewellyn 2015).

Topic 16 focuses on Nursing event mining with terms like “Nursing”, “Nurse”, “Intimate”, “Delirium”, “Nurse-patient”, “Abuse”, and “Caregiver”. Relevant studies include nursing education (Shin et al. 2015), nursing practices (Fey and Jenkins 2015), professionalism, and ethical dilemmas for nursing students (Rees et al. 2015; Kim et al. 2015), mental health nursing (Mårtensson et al. 2014), and the like.

The 40 identified topics are further clustered based on term-level similarity and document-level similarity to find latent relations and emerging interdisciplinary fields of these topics. As can be seen from Fig 9, Topic 18 and Topic 33 as well as Topic 21 and Topic 38 have high term-level similarity and are far distant from other topics. From the topic interpretations, Topic 18 and Topic 33 concern with reading and hearing issues, and both Topic 21 and Topic 38 contain “Child” as high frequent term. Other topics with less term-level similarity are mapped in the middle of the dendrogram. For instance, both Topic 5 and Topic 7 discuss about psychosis issues, and both Topic 1 and Topic 31 focus on speech issues. In summary, the dendrogram shown in Fig. 9 clearly presents the term usage similarity structure of the research topics.

Different from term-level similarity clustering, the goal of document-level similarity clustering is to describe the interaction structure of the research topics. As shown in Fig. 10, Topic 9 and Topic 16 have a high document-level similarity, meaning that publications with a high topic proportion of Topic 9 often have a high topic proportion of Topic 16 simultaneously. Document-level measure of topic similarity has the same meaning of interdisciplinary analysis (Lu and Wolfram 2012). If two topics frequently appear in the same publications, there is a big potential to foster a novel interdisciplinary research field. Almost all topic pairs have high term-level similarity but low document-level similarity, such as Topic 18 and Topic 33, or high document-level similarity but low term-level similarity, such as Topic 9 and Topic 16. The differences between term-level similarity and document-level similarity also reflect the intellectual structure of the research field in the past decade.

Increasing and decreasing topics are also recognized through statistic test. We provide brief explanations for some of the emerging topics. Topic 2 discusses Organ transplantation; Topic 13 focuses on the Labour related event; Topic 15 addresses Drug regulatory related event; Topic 22 focuses on Sexual event mining; Topic 25 addresses Smoking event; Topic 26 is about Aging event; Topic 27 centers around Heart disease; and Topic 40 relates with Depression event. As can be seen from Fig. 11, some topics, such as Topic 1, Topic 5, and Topic 9, show a trend with relatively sharp fluctuations. Topic 8 and Topic 14 show an increasing trend before 2014 and a decreasing trend after 2014.

We highlight this study at its improvements comparing with the existing similar works with the adoption of bibliometrics. According to our investigation, some deficiencies of the existing bibliometric works are found as follows. First, most relevant studies used either WoS or PubMed as the publication retrieval database for studying medical-related topics (e.g., Khan et al. 2017; Nafade et al. 2018; Baker et al. 2018). However, the difference in database coverage might lead to insufficiency of analyzing results when only one of them was used. Second, the existing bibliometric studies focusing on theme discovery seldom included terms in title and abstract fields as the analysis elements, which might lead to insufficient analysis. Last but not the least, although in a few studies such as Yeung et al. (2017), key terms in title and abstract fields were included for analysis, but with equal importance. However, it is more reasonable to bestow weighing for terms from different fields. Therefore, giving the deficiencies in the existing researches, this study uses both WoS and PubMed as the publication resource databases. We not only include key terms extracted from free text by using a self-developed NLP module, but assign weights based on experiment to terms from different fields. We also employ various analyzing techniques such as geographic visualization, collaboration degree, social network analysis, and topic modeling analysis for a more comprehensive analysis.

There are some limitations in this study. First, we treat journal and conference publications equally important in the analysis. Generally, the quality of a journal publication is higher than a conference publication. Therefore, in the future, we will seek persuasive way to bestow weighing for publications of different types. Second, citation data available from WoS have not been employed in the analysis since PubMed does not provide citation data as WoS. Citation data are indeed valuable to describe relations between scientific publications. Thus, further investigation is required to take citation data into consideration, with an in-depth understanding of the citing rationale. Last but not least, as for topic cluster analysis, the clustering is based on cosine similarity, and the clustering results might be vulnerable to choices of similarity measurement method. Therefore, in our future work, we will conduct comparison on different calculation methods for further exploration.

Notwithstanding its limitations, this study is the first to thoroughly assess research output of text mining in medical research field in statistical perspective. The findings in the study can potentially benefit relevant researchers, especially newcomers in understanding the research performance and recent development of the research field, optimizing research topic decision, and monitoring new scientific or technological activities.

5 Conclusions

This study presents a bibliometric analysis of the text mining in medical research area during the year 2008–2017. Our work is the first in-depth study on keeping track of the current advances in the research area from quantitative perspective. The result shows that the developed methods are universal and can help researchers comprehensively understand the knowledge of a certain field hidden in a large amount of scientific literature. The rapid growth of scientific literature reveals the vigorous development of text mining in medical research in recent years. Collaboration degree analysis and social network analysis reveal scientific collaboration characteristics. Latent Dirichlet allocation exploration presents a comprehensive overview and an intellectual structure of the research, especially research topics. The clustering analysis and trend analysis can help process the derived topics to provide an architecture overview of a certain field in more detail.

For further studies, we will employ the author-topic model, a probabilistic model for linking authors to observed words in the scientific literature of the research field. This will provide a general framework for exploration, discovery, and query-answering in the context of the relations of author and topics.