Keywords

1 Introduction

Emerging research topics represent the new areas of science and technology (S&T) in which the scientists are highly concerned. Actually, it is of great significance to mine these topics through S&T literatures for scientific research and policy making [1]. With the rapid growth of S&T literatures, it has become a big challenge to efficiently and accurately discover implicit knowledge from the vast literatures in a credible way. Then, Knowledge Discovery in Literature (KDiL) has become an important research area. Indeed, it is very interesting and useful to combine text mining with scientometrics methods for KDiL.

SPO predication represents the semantic relationships among knowledge units, which consists of a subject argument (noun phrase), an object argument (noun phrase) and the relation that binds them (verb phrase) [2]. In fact, the SPO predication can be considered as a kind of semantic network widely used in KDiL, which can reflect research topics of literatures with semantic information and represent S&T information with more details.

In this paper, we propose a percolation approach to discovering emerging research topics based on SPO predications combining text mining and scientometrics methods. Firstly, SemRep [2] which is a Unified Medical Language System (UMLS)-based information extraction tool and Semantic MEDLINE [3] which is a SPO database generated by SemRep based on PubMed, are used to get SPO predications from biomedical text. The subject and object arguments of each SPO are the concepts from the UMLS Metathesaurus, and the Predicate is a relation from the UMLS Semantic Network [2, 3]. For example, from the sentence “We used hemofiltration to treat a patient with digoxin overdose that was complicated by refractory hyperkalemia”, SemRep extracts four predications as follows: “Hemofiltration-TREATS-Patients, Digoxin overdose-PROCESS_OF-Patients, hyperkalemia-COMPLICATES-Digoxin overdose, Hemofiltration-TREATS (INFER)-Digoxin overdose” [2, 3].

Then, community detection is conducted in the SPO semantic networks, and a community containing SPO predications can be considered as a research topic. Afterwards, two scientometrics indicators of RTA and RTAN combined by HBS algorithm are chosen to find potential emerging research topics from communities. Finally, S&T literatures of stem cells are selected as a case study. The result indicates that the approach can effectively and accurately discover emerging research topics.

The rest of this paper is organized as follows. Section 2 briefly describes the previous works related to the discovery of emerging research topics. In Sect. 3, we present a percolation approach to discovering research topics based on SPO predications. Afterwards, we conduct a case study in Sect. 4. The conclusion and discussion about further research are given in the last section.

2 Literature and Review

In this section, we investigate the literature reviews concentrating on discovering emerging research topics. Generally, they are divided into scientometrics methods and text mining methods.

The scientometrics methods usually use indicators analysis to discover emerging research topics based on citation or co-occurrence relationship. In [4], the authors proposed a multi-level structural variation approach, which is motivated by an explanatory and computational theory of transformative discovery. With the novel structural variation metrics derived from the theory, they integrated the theoretical framework with a visual analytic process, which enables an analyst to study the literature of a scientific field across multiple levels of aggregation and decomposition, including the field as a whole, specialties, topics and predicates.

In [5], according to the co-cited networks of regenerative medicine literatures based on a combined dataset of 71,393 relevant papers published between 2000 and 2014, the authors presented a snapshot of the fast-growing fields and identified the emerging trends with new developments. Actually, the structural and temporal dynamics are identified in terms of most active research topics and cited references. New developments are identified in terms of newly emerged clusters and research areas, while disciplinary-level patterns are visualized in dual-map overlays.

In [6], the authors proposed a method of discovering research fronts, which compares the structures of citation networks of scientific publications with those of patents by citation analysis and measures the similarity between sets of academic papers and sets of patents by natural language processing. In order to discover research fronts that do not correspond to any patents, they performed a comparative study to measure the semantic similarity between academic papers and patents. As a result, cosine similarity of term frequency-inverse document frequency (tfidf) vector was found to be a preferable way of discovering corresponding relationships.

The text mining methods usually conduct content-analysis using domain ontology, semantic network, and community detection etc. to mine emerging research topics from the contents of literatures. In [7], based on Human Phenotype Ontology (HPO), the authors presented a method named RelativeBestPair to measure similarity from the query terms to hereditary diseases and rank the candidate diseases. In order to evaluate the performance, they carried out the experiments on a set of patients based on 44 complex diseases by adding noise and imprecision to be closer to real clinical conditions. In comparison with seven existing semantic similarity measures, RelativeBestPair significantly outperformed all other seven methods in the simulated dataset with both noise and imprecision, which might be of great help in clinical setting.

In [8], the authors proposed a multi-phase gold standard annotation approach, which was used to annotate 500 sentences randomly selected from MEDLINE abstracts on a wide range of biomedical topics with 1371 semantic predications. According to the UMLS Metathesaurus for concepts and the UMLS Semantic Network for relations, they measured inter-annotator agreement and analyzed the annotations, so as to identify some of the challenges in annotating biomedical text with relations based on ontology or terminology.

In [9], according to some semi-supervised learning methods named Positive-Unlabeled Learning (PU-Learning), the authors proposed a novel method to predict the disease candidate genes from human genome, which is an important part of nowadays biomedical research. Since the diseases with the same phenotype have the similar biological characteristics and genes associated with these same diseases tend to share common functional properties, the proposed method detects the disease candidate genes through gene expression profiles by learning hidden Markov models. The experiments were carried out on a mixed part of 398 disease genes from 3 disease types and 12001 unlabeled genes, and the results indicated a significant improvement in comparison with the other methods in literatures.

In [10], based on Formal Concept Analysis (FCA), the authors proposed a method named FCA-Map to incrementally generate five types of formal contexts and extract mappings from the derived lattices, which is used to identify and validate mappings across ontologies, including one-to-one mappings, complex mappings and correspondences between object properties. Compared with other FCA-based systems, their proposed method is more comprehensive as an attempt to push the envelope of the FCA formalism in ontology matching tasks. The experiments on large, real-world domain ontologies show promising results and reveal the power of FCA.

Both of the above-mentioned methods face specific challenges. Scientometrics methods are mature, but require that there are complete citation networks or high co-occurrence. If the citation networks are incomplete or not available, for example, the citation networks between papers and patents usually are very weak, the scientometrics methods cannot produce reasonable results. Text mining methods, which analysis fine-grained knowledge units such as keywords, SPO predications, and topics, do not need complete citation networks. However, usually the number of knowledge units is large and it is hard to clean and select the right ones without scientometrics methods.

3 Methodology

In our research, we propose a percolation approach to discovering emerging research topics based on SPO predications, which constructs a three-level SPO-based semantic network. First, we present an introduction to the construction of SPO-based semantic network. Then, we investigate a percolation approach to detecting communities in the network. Afterwards, we take the HBS algorithm on two indicators, RTA and RTAN, to identify potential emerging research topics from the communities.

3.1 SPO-Based Semantic Network Construction

After getting required literatures set, SPO predications can be extracted from content of literatures by SemRep. Then, these SPO predications need to be cleaned by Term Clumping which includes general cleaning and pruning processes [11]. General cleaning will remove some common academic/scientific subjects or objects such as “cells,” “organ.” Some predicates such as “LOCATION_OF,” “PART_OF” that reflect hierarchy or position relationship and are meaningless for mining emerging research topics will also be removed. The pruning process helps with further cleaning by discarding the very low frequency and the meaningless subjects, predicates, objects or SPO. After that, each literature is represented as an exchangeable bag-of-SPO.

Based on four basic principles proposed by M. Fiszman et al., which are relevancy, connectivity, novelty and saliency, a SPO-based semantic network is constructed to detect the communities [12]. An example of SPO-based semantic network is illustrated in Fig. 1, which is composed of thousands of nodes and edges. In Fig. 1, the vertices with different colors denote the different SPO predications, and the size of the vertex denotes the frequency of SPO predication. Actually, many vertices can be both the subjects and the objects so that it makes the whole network become very complicated [13]. Therefore, it is very difficult for the experts to recognize the valuable topics from a SPO-based semantic network directly.

Fig. 1.
figure 1

An example of SPO-based semantic network

3.2 Community Detection

In order to effectively find the communities from a SPO-based semantic network, we propose a percolation approach to achieve the community detection, which employs the widely used modularity function defined as follows [14]:

$$ Q = \frac{1}{2m}\sum\nolimits_{vw} {\left[ {A_{vw} - \frac{{k_{v} k_{w} }}{2m}} \right]\delta (C_{v} ,C_{w} )} $$
(1)

Suppose that the vertices are divided into different communities such that the vertex \( v \) belongs to the community \( C \) denoted by \( C_{v} \). In Formula 1, \( A \) is the adjacency matrix of the network \( G \). \( A_{vw} = 1 \) if one vertex \( v \) is connected to another vertex \( w \), otherwise \( A_{vw} = 0 \). The \( \delta \) function \( \delta (i,j) \) is equal to 1 if \( i = j \) and 0 otherwise. The degree \( k_{v} \) of a vertex \( v \) is defined to be \( k_{v} = \sum\nolimits_{v} {A_{vw} } \), and the number of edges in the network is \( m = \sum\nolimits_{wv} {A_{wv} /2} \).

Furthermore, the modularity function can be presented in a simple way, which is formulated below [14]:

$$ Q = \,\,\sum\nolimits_{i} {(e_{ii} - a_{i}^{2} )} $$
(2)

where \( i \) runs over all communities in the network, \( e_{ii} \) and \( a_{i}^{2} \) are respectively defined as follows [14]:

$$ e_{ij} = \frac{1}{2m}\sum\nolimits_{vw} {A_{vw} \delta \left( {C_{v} ,i} \right)\delta (C_{w} ,j)} $$
(3)

which is the fraction of edges that join vertices in community \( i \) to vertices in community \( j \), and

$$ a_{i} = \frac{1}{2m}\sum\nolimits_{v} {k_{v} \delta (C_{v} ,i)} $$
(4)

which is the fraction of the ends of edges that are attached to vertices in community \( i \).

Based on the modularity function optimization [15], the percolation approach is a heuristic method to extract the community structure of large networks, which is presented in Algorithm 1.

In this algorithm, we initialize the network into a directed weighted graph according to the SPO-based semantic relation. Then, we calculate the average weighted degree \( d \) of this graph. Afterwards, the weight of each edge multiplies by a random number with the probability \( 1 - 1/d \). Based on the modularity function value, the local search procedure is executed until the modularity does not improve any more. Then, we obtain the communities of the considered network. An example of the communities in a semantic network detected by the algorithm is illustrated in Fig. 2, in which the communities are represented in different colors.

Fig. 2.
figure 2

An example of the communities in SPO-based semantic network

3.3 Hypervolume-Based Selection

After finding communities in the SPO-based semantic network, we aim to select emerging research topics from these communities based on two scientometrics indicators, which are RTA and RTAN proposed in [16]. Specifically, RTA refers to time span of research topics, the larger RTA value is, the wider the time span of distribution of topics is. While RTAN refers to academic attentiveness, the larger RTAN value is, the hotter the topics are. Therefore, we prefer to select the topics with smaller values of RTA and larger values of RTAN as candidates of emerging research topics. RTA and RTAN are defined by the formulas below:

$$ f_{1} = RTA\left( {topic_{i} } \right) = \sum\nolimits_{i = 1}^{n} {Y_{kw} \frac{{n_{i} }}{N}} $$
(5)

where \( n_{i} \) refers to the number of terms in topic of the time span, N refers to the total number of terms in all topics of the time span and \( Y_{kw} \) refers to age of each term.

$$ f_{2} = RTAN\left( {topic_{i} } \right) = \frac{{n_{i} }}{N} \times 100\% $$
(6)

where \( n_{i} \) refers to the number of authors in topici of the time span, and N refers to the total number of authors in all topics of the time span.

$$ Y_{kw} = \sum\nolimits_{i = 1}^{n} {\left( {Year_{cur} - Year_{i} } \right) \times tfidf_{i} /\left( {\sum\nolimits_{j = 1}^{n} {tfidf_{j} } } \right)} $$
(7)

where \( Year_{cur} \) refers to the last year of the time span, \( Year_{i} \) refers to the year of the time span in all topics and \( tfidf_{i} \) refers to the \( {\text{TF}}/{\text{IDF}} \) value of the \( i^{th} \) term.

According to the two objective values of \( f_{1} \) and \( f_{2} \), we select topics among the communities with the HBS algorithm, which is presented in Algorithm 2 below [17].

In Algorithm 2, \( topic_{i} \) denotes the \( i^{th} \) topic in the semantic relation network. First, we calculate the two objective values of \( topic_{i} \). Then, we calculate the fitness value of \( topic_{i} \) with the \( HC \) indicator defined as follows:

$$ HC\left( x \right) = \left( {f_{1} \left( {y_{1} } \right) - f_{1} \left( x \right)} \right) \times (f_{2} \left( {y_{0} } \right) - f_{2} (x)) $$
(8)

As is shown in Fig. 3, the fitness value of \( topic_{i} \) denoted by \( x \) corresponds to the size of the red area, where \( y_{0} \) and \( y_{1} \) refer to other topics, which are the neighbours of \( topic_{i} \) in the objective space. Thus, we can select a designated number of research topics with high fitness values.

Fig. 3.
figure 3

An example of fitness value calculation

4 Case Study

Stem Cells, which are a group of cells that are capable of self-renewal and multidirectional differentiation, are important research objects in biomedical area. Due to their important value and tremendous development prospects in the treatment of diseases and regenerative medicine, stem cells have drawn the worldwide attention and become the hot point of life science and medical research [18]. In this section, stem cells scientific papers are selected as the case study to demonstrate the approach.

4.1 Data Information

Initially, we selected PubMed as the data source, and made retrieval strategy as follow: “(stem cells [MeSH Major Topic]) AND (“2008-01-01”[Date - Publication]: “2017-12-31”[Date - Publication])”, and 86,452 records were obtained. After excluding some non-technical papers such as “Clinical Trial,” “Dataset”, SPO predications were extract from the title and abstract fields of each paper. Then, the general cleaning and pruning processes were applied to clean these SPO predications. After that, the SPO-based semantic network of stem cell was constructed according to the above method.

4.2 Experimental Results

In this subsection, we presented the experimental result. Some communities from the semantic networks are respectively illustrated in Fig. 4. In the figure, the different communities are represented in different colors, which are composed of subjects, objects and the corresponding predications, and the frequency of the SPO predications is proportional to the size of the vertex. From Fig. 4, we can observe that the predications “AFFECT, STIMULATE, COEXISTS_WITH” etc. are higher frequency. Some communities with these three predications are summarized in Table 3. In this table, we do not present all the found communities in the network but to provide parts of three different communities according to the HBS process, which are considered as emerging research topics in stem cell by experts.

Fig. 4.
figure 4

Some communities from SPO-based semantic network of stem cell

Table 1. The percolation algorithm to extract community structure
Table 2. The hypervolume-based selection algorithm
Table 3. Some emerging research topics in stem cell

5 Conclusion

In this work, we investigated a percolation approach to discovering emerging research topics from the SPO-based semantic network. Then, we perform the experiments in the research area of stem cells. The results indicate that it can help significantly discover the emerging research topics in the considered area.

However, there are some challenges in the processes. First, there are so many noises in the “Subjects, Objects, Predicates and SPO predications” from the papers and the general cleaning and pruning processing deeply depend on experts’ opinions. Secondly, the SPO predications are extracted only from the title and abstract fields. Maybe it is not enough. Thirdly, the topics should be described in a more understandable way.

In the future, we intend to make more specific general cleaning and pruning rules to help conduct more objective data cleaning. In addition, we will directly extract SPO predications from full texts of papers by SemRep to get more SPO predications. Moreover, further research such as attaching understandable labels to topics, mining the linkages between topics, and discovering topics evolution are ongoing.