Introduction

Exploring trends in scientific research is an area of interest for many researchers. Recently, many papers have been published in an electronic form; thus, methods using text mining techniques have been increasingly applied for bibliometrics (Hawkins 2001). One of these text mining techniques is topic modeling, which can reduce dimensionality of vast text data interpretably and produce semantically coherent topics. There have been several attempts to utilize topic modeling to extract important research topics over time (Blei et al. 2003; Blei and Lafferty 2006; Wang and McCallum 2006; Hall et al. 2008; Mimno and McCallum 2012).

Topic modeling captures the probability of each topic’s occurrence. In time-variant topic modeling, looking at frequent topics by period is one of the best ways to see how the research field has evolved. However, the question is whether the topics with high probabilities of appearance are the most important topics. Some topics may appear frequently in papers with low citations, while others may appear rarely but in highly cited papers. Therefore, to identify important research topics, it is necessary to consider not only the probability of occurrence but also the number of citations of the papers in which topics appear. While existing time-variant topic models have not taken citations into consideration, topic modeling has been extended to bibliometric measures based on citation counts with topical information (Mann et al. 2006; Kim et al. 2018; Wang et al. 2013). However, these approaches are limited due to the following two reasons. First, most of previous studies are feasible for trend analysis of scientific publications that are cited reasonably often, but they do not consider time variant factors (Kim et al. 2018; Mann et al. 2006; Wang et al. 2013). Second, a few prior studies that considered time variant factors such as Dirichlet multinomial regression (DMR) (Mann et al. 2006) and dynamic topic model (DTM) (Blei and Lafferty 2006) require extra steps to discretize continuous values like publication year and citation count to make the results of topic modeling meaningful. The results section confirms these problems.

To tackle these aforementioned issues, we propose a new generalized Dirichlet Multinomial Regression (g-DMR) topic model that replaces a linear inner product in topic priors with the general form based on Topic Distribution Function (TDF). The proposed model enables continuous and multidimensional analysis by incorporating metadata into topic model non-linearly as well as dynamically. We suggested multidimensional Legendre Polynomial as TDF for research trend analysis. Using g-DMR with Legendre Polynomial, the publication year and number of citations are cohesively combined into one topic model. To evaluate the utility of the proposed model, we applied it to two fields: Library Information Science (LIS) and Text Mining.

To verify the soundness of the proposed approach, we evaluated the quality of the proposed approach in terms of topic coherence and distribution inference. The results of the experiments demonstrated the utility of the proposed approach through a 2 × 2 matrix of high vs. low impact and hot vs. cold topics. For those who are interested in applying the proposed approach to other research fields, we made the source code and the results of experiments publicly available at https://github.com/bab2min/g-dmr.

In the subsequent sections, we examine related works about models capturing topic trends and discuss their limitations. We review the DMR topic model, generalize it by introducing the TDF, and suggest diverse versions of it. We verify that the proposed model works well through experimental results. Finally, we show the results of trend analysis on the field of LIS and Text Mining using the proposed model.

Related works

Research trend analysis

Most previous studies on research trend analysis have adopted bibliometric analysis methods. Jabeen et al. (2015) analyzed evolution rates and trends in the field of LIS by applying bibliometric techniques to about 18,000 articles published between 2003 and 2012. Sethi and Panda (2012) gauged the growth of the field of LIS by a citation analysis of 1000 articles published from 2000 to 2010. Li et al. (2009) traced the global trend of stem cell research between 1990 and 2007 to understand the characteristics of stem cell research activities by bibliometric techniques. Tran et al. (2018) conducted trend analysis in child maltreatment on Web of Science data with the query term “child maltreatment.” By using VOSviewer for network analysis, they identified an increase trend in child maltreatment research, mostly conducted in the United States. Lv et al. (2011) investigated the global scientific production trend of graphene research using bibliometric analysis and visualization analysis with data collected between 1991 and 2010 from the Web of Science.

Another major direction in research trend analysis relies on content analysis, such as topic modeling and keyword analysis. Song et al. (2014) used the PubMed Central full-text database to retrieve articles related to the field of bioinformatics from 2000 to 2011. By applying Latent Dirichlet Allocation (LDA) to three periods (2000–2003, 2004–2007, and 2008–2011), they analyzed topical trends in the field of bioinformatics. Timakum et al. (2018) traced changes in LIS knowledge trends over the past 20 years to grasp developments in LIS research at a fine-grained level. By employing network analysis and topic modeling, they analyzed full-text journal articles from six top-ranked library science journals. Zou (2018) applied LDA for trend analysis to explore the temporal popularity of certain drug safety topics over time using 4347 related articles from 2007 to 2016. In addition, several papers incorporated citation relations into topic model (Wang et al. 2013; Jo et al. 2011). Jo et al. (2011) proposed a hybrid topic modeling algorithm that relied on the temporal order of papers and the document language model to detect the formation of new topics. In doing so, they computed the strength of the relationship between two topics through the total number of shared citations. However, since they ignored the direct relationship of topic dependency, they did not distinguish between differences in topic importance. To overcome this problem, Wang et al. (2013) proposed a probabilistic topic model for literature citation analysis that directly incorporated citation relations.

One recent study applied a deep learning algorithm to research trend prediction. Chen et al. (2018) proposed the new algorithm, the Correlated Neural Influence Model, to predict trending research topics at mutually influenced conferences by sequentially tracking the research topics for each individual conference and embedding all the research topics into the hidden space. In a different paper, the performance of journals over time in a given field was measured by incorporating journal ranking and publication year into the topic model (Song et al. 2017). These authors proposed a journal-time-topic model, an extension of Dirichlet multinomial regression, which we applied to the field of bioinformatics to understand journal contributions to topics in a field and shifts in topic trends.

However, these studies, including Wang et al. (2013), that apply topic modeling to research trend analysis have not yet incorporated both the number of citations and the publication year into the Dirichlet distribution.

Topic modeling

Blei and Lafferty suggested a Dynamic Topic Model (DTM) (2006), which generates documents of each time-slice from a normal distribution across topics, with each time-slice’s distribution centroid drawn from the previous one. One of the key features of this model is that it not only generates the topic distribution per document for each period, but the word distribution per topic can also change for each time-slice. But since the document set is divided into N subsets by its publication date to form a Markov chain, the time stamp of documents is bound to be discretized. To solve this problem, continuous DTM was introduced (Wang et al. 2012). They used Brownian motion to model dynamics of the topics. It allows the model to accept a continuous time point of documents and handles a number of time points without a limitation of fixed granularity.

There are previous studies analyzing a scholarly impact using topic model Gerrish and Blei (2010). suggested Document Influence Model (DIM) based on DTM. They introduced influence score, which controls how much words in the article affects overall topic drift, to each article of DTM. By using these scores, they could measure the impact of articles on the scholarly trend. Gerow et al. (2018) suggested binding authorship, affiliation and publication venue to DTM and developed an evaluation method assessing their scholarly influence. This research pursued the similar objective with the present study in that it links topic model with the scholarly impact, but it differs from the proposed method greatly in terms of the algorithmic approach. While their work generated topics first, and measured the scholarly impact based on topics, in the present paper, we attempted to discovery topics from both time and scholarly impact. As shown in the Sects. 4 and 5, generating topics from these two dimensions of metadata can yield better quality than using only one of these metadata and can reliably visualize the research trends.

The topics over time (ToT) Model (Wang and McCallum 2006) assumes that a beta distribution related to each topic generates the time stamp of each document. Since the beta distribution is continuous, this model is characterized by its ability to deal with continuous values of each document’s time stamp without discretization. However, one of the limitations of ToT is that the topic distribution of ToT must be in line with beta distributions. There are some previous studies analyzing research trend based on ToT model. One of them, Trend Detection Model (Kawamae and Higashinaka 2010), adopted a latent trend class variable into each document. The trend class is defined as topic co-occurrence patterns and it affects a topic distribution and a time stamp of each document. They proved that the introduction of trend class lowers perplexity of the model and improves the accuracy of timestamp prediction.

Also LDA (Blei et al. 2003) can be used to see the change of topics over time. Hall et al. (2008) suggested post hoc calculation for topic probabilities of each year. In general, the LDA model does not have a device for viewing the distribution of topics by time. However, they defined the empirical probability that a paper in a given year was about a certain topic, and the distributions of topic by year were calculated by applying LDA. Alternatively, the Dirichlet-Multinomial Regression model (DMR) (Mimno and McCallum 2012), an expanded version of the LDA for including metadata attached to documents, can be used to analyze trends in research. In the DMR model, the Dirichlet parameter of documents depends on its author, date, or other metadata. That is, the model can reflect the metadata of documents in topic generation.

There were some studies that exploited, not the temporal information, but the citations to analyze research trends. One of them, citation influence model (CIM) (Dietz et al. 2007), introduced a Bernoulli distribution determining whether words are drawn from the topic distribution of the current document or from the topic distribution of the cited document. This allows the topic distribution of the cited document to affect the citing document. They showed that the proposed model can generate a citation graph which includes significances of influence between documents. Also, there was an attempt to detect emerging topics based on the CIM and DIM. Xu and his colleagues introduced the framework for emerging topic detection, which extracts growth, coherence, and influence indicators using DIM and a novelty indicator using CIM (Xu et al. 2019). These indicators are fed to Multi-Task least-squares support vector machine to predict emerging topics. They showed that emerging topics can be detected by combining these indicators.

Meanwhile, structural topic Model (STM) (Roberts et al. 2014), which incorporates metadata of documents into topic prevalence and topic content, has been suggested. Since topic prevalence affects document-topic distributions and topic content affects topic-word distributions, STM provides a general method to analyze the impact of each metadata on topics from the topic structure viewpoint. STM can also accept linear, binary, or categorical variables as metadata by exploiting generalized linear model. Despite its flexibility, it has a limitation that the topic distributions are determined monotonically following a linear combination between its metadata variable and latent variable. The dynamics of topics we tried to inspect may show more complicated patterns with time or scholarly influence than linearly increasing or decreasing patterns. Thus, we needed to develop a new topic model, which is able to capture more dynamic topic patterns with multi-variable metadata than linear, to analyze research trends from the perspectives of both time and number of citations.

Methodology

DMR topic model

Before discussing the g-DMR, we first review the underlying statistical assumption of the DMR model. Mimno and McCallum (2012) proposed a feature vector \({\lambda }_{t}\) for each topic t to generate topics, which are sensitive to metadata of documents. They expanded the generative process of document as below:

  1. (1)

    For each topic t,

    1. (A)

      Draw \({{\varvec{\lambda}}}_{t}\sim N\left(0, {\sigma }^{2}I\right)\).

    2. (B)

      Draw \({\boldsymbol{\Phi }}_{t}\sim Dir\left({\varvec{\beta}}\right)\).

  2. (2)

    For each document d:

    1. (A)

      For each topic t, let \({\boldsymbol{\alpha }}_{d,t}={{\varvec{x}}}_{d}\cdot {{\varvec{\lambda}}}_{{\varvec{t}}}\).

    2. (B)

      Draw \({\boldsymbol{\Theta }}_{d}\sim \mathrm{D}\mathrm{i}\mathrm{r}\left({\boldsymbol{\alpha }}_{d}\right)\).

    3. (C)

      For each word i,

    1. (i)

      Draw \({\mathrm{z}}_{i}\sim \mathrm{M}\mathrm{u}\mathrm{l}\mathrm{t}\left({\boldsymbol{\Theta }}_{d}\right)\),

    2. (ii)

      Draw \({\mathrm{w}}_{i}\sim \mathrm{M}\mathrm{u}\mathrm{l}\mathrm{t}\left({{\varvec{\phi}}}_{{z}_{i}}\right)\),

where \(\upsigma\) is the standard deviation of the prior on \({{\varvec{\lambda}}}_{t}\); \({\varvec{\beta}}\) is a concentration parameter in the form of vector, which is generally considered symmetric, for topic-word distribution and \({\boldsymbol{\Phi }}_{t}\) is a topic-word distribution vector for topic t; \({{\varvec{x}}}_{d}\) is the metadata of document d, which is represented as a |F|-dimensional vector; F is the set of features; \({\boldsymbol{\alpha }}_{d}\) is a parameter vector, which is the collection of \({\boldsymbol{\alpha }}_{d,t}\), for topic distribution derived from \({{\varvec{\lambda}}}_{t}\) and \({{\varvec{x}}}_{d}\); \({\boldsymbol{\Theta }}_{d}\) is a document-topic distribution vector for document d; and the other parameters are the same as denoted in the LDA model (Blei et al. 2003). Thus, this model can accept the metadata of documents and infer topic probabilities from the metadata.

$$\begin{array}{c}P\left(\mathbf{z}\right)={\left(\frac{\Gamma \left(Ka\right)}{{\Gamma \left(\alpha \right)}^{T}}\right)}^{D}\frac{\prod_{d,t}\Gamma \left({n}_{d,t}\right)}{\prod_{d}\Gamma \left(\sum_{t}{n}_{d,t}+T\alpha \right)}.\end{array}$$

The likelihood of topic t, which is above in the LDA, expands as follows in the DMR:

$$\mathrm{P}\left(\mathbf{z}\right)=\int \mathrm{P}\left({\varvec{z}}|{\varvec{\lambda}}\right)\mathrm{P}\left({\varvec{\lambda}}\right) \mathrm{d}{\varvec{\lambda}},$$
$$\mathrm{P}\left({\varvec{\uplambda}}\right)=\prod_{t,k}\frac{1}{\sqrt{2\pi {\sigma }^{2}}}\mathrm{exp}\left(-\frac{{{\varvec{\lambda}}}_{t,k}^{2}}{2{\sigma }^{2}}\right),$$
$$\begin{array}{c}P\left(\mathbf{z}|{\varvec{\uplambda}}\right)=\prod_{d}\frac{\Gamma \left(\sum_{t}{\boldsymbol{\alpha }}_{d,t}\right)}{\Gamma \left(\sum_{t}{(n}_{d,t}+{\boldsymbol{\alpha }}_{d,t})\right)}\prod_{d,t}\frac{\Gamma \left({n}_{d,t}+{\boldsymbol{\alpha }}_{d,t}\right)}{\Gamma \left({\boldsymbol{\alpha }}_{d,t}\right)} .\end{array}$$

As this equation is intractable, Mimno and McCallum (2012) trained the model using collapsed Gibbs sampling and adjusted the parameter \({\varvec{\lambda}}\) using an L-BFGS optimizer.

This model proposes a method to incorporate metadata into topic model. However, it has a limitation that topic distributions for documents are generated monotonically for metadata because it is from a linear combination of \({{\varvec{x}}}_{d}\) and \({{\varvec{\lambda}}}_{t}\). This means that the probability of a topic consistently increases or decreases as each feature, which consists of a vector \({{\varvec{x}}}_{d}\), increases or decreases. Therefore, this model is not suitable for capturing dynamic changes in the topic distributions.

Generalized DMR (g-DMR)

In DMR topic model, \({\boldsymbol{\alpha }}_{d,t}\) is defined as \({{\varvec{x}}}_{d}\cdot {{\varvec{\lambda}}}_{t}\), and it causes the limitation of monotonicity for metadata. We propose a g-DMR topic model to overcome the limitations of the DMR topic model, replacing \({\boldsymbol{\alpha }}_{d,t}\) as more general function.

$${\boldsymbol{\alpha }}_{d,t}=\mathrm{exp}\left({f}_{t}\left({{\varvec{x}}}_{d}\right)\right)+\upvarepsilon ,$$
(1)

\({f}_{t}\) receives the metadata vector of a document and returns the weight of the topic t of the document. Theoretically, the value of the exp function cannot be zero, but it often becomes nearly zero in some \({f}_{t}\) functions due to the limitations of computation precision. So, we added non-zero \(\upvarepsilon\), the smoothing parameter that prevents \({\boldsymbol{\alpha }}_{d,t}\) from getting too close to zero. We introduce various \({f}_{t}\) with the name Topic Distribution Function (TDF) that can be used depending on the purpose of the analysis. Some of the TDFs that can be useful are shown in Table 1, and DMR model can be seen as the special case of g-DMR with linear TDF.

Table 1 Examples of topic distribution functions

The constant TDF is the simplest form of TDF, with the topic weight not affected by the metadata of the document at all. Thus, in this case, the model will be the same as the LDA. The linear TDF is the same form as in the DMR model, where the topic distribution of a document can be different by the metadata of the document only monotonically.

There are many methods, such as simple polynomial approximation, Fourier approximation, Hermite, Legendre polynomial approximation, and so on, used to approximate unknown continuous functions. Since a TDF is unknown until the end of training, these methods can be used as the basis of TDFs.

We initially tried to use simple polynomials as the basis of topic distribution function, but it didn’t converge properly because of its instability. In more detail, the higher-order terms approach zero in the domain (− 1, 1). Conversely, they diverge in the wider domain than (− 1, 1). Thus, when simple polynomials were set as the basis, then the coefficients of higher order terms were likely to be too large or too small than the one of lower order terms in L-BFGS process, and it failed to optimize the model.

On the other hand, in the case of the Legendre polynomials, their value is bounded by [− 1, 1] within their domain, and they all have zeros. These features help L-BFGS optimization to converge better even if Legendre polynomials have the same order as the simple polynomials. The shifted version of Legendre polynomial has another advantage where since it has only integer coefficients and needs no weight function, it has a computationally lower cost than other polynomial approximation methods such as Hermite polynomial approximation. Due to these strengths, the Legendre polynomial and its variation, Shifted Legendre polynomial (SLP), are widely used to approximate complex equations in physics and engineering (Andrews and Andrews 1992).

Thus, we suggested the SLP as a basis for the TDF. Each order of SLP is defined as the following:

$${L}_{i}\left(x\right)=\sum_{k=0}^{i}{\left(-1\right)}^{i+k}\left(\genfrac{}{}{0pt}{}{i}{k}\right)\left(\genfrac{}{}{0pt}{}{i+k}{k}\right){x}^{k}.$$

The Legendre TDF in Table 1, which has only one variable, is the simplest form of TDF using the SLP approximation. This can be extended as the 2d Legendre TDF to receive 2 variables. The Legendre TDF can also be extended with more variables, but this is not shown in the table. The orders of SLP (I, J in the table) are critical to performance of the TDF. A higher order can make a more complex form of a TDF and approximate data accurately, but it needs a more expensive calculation cost. A lower order of TDF approximation can be performed quickly, but its results may be less accurate. In addition, for 2d Legendre TDF, we draw \({\lambda }_{t}\) from different distributions for i = 0, j = 0 and for the others as follows:

$$\begin{aligned}&{{\varvec{\lambda}}}_{t,\mathrm{0,0}}\sim N\left(0, {\sigma }_{0}^{2}I\right)\\&{{\varvec{\lambda}}}_{t,i\ne 0,j\ne 0}\sim N\left(0, {\sigma }^{2}I\right),\end{aligned}$$

where \({\sigma }_{0}> \upsigma\). Therefore, the intercept term can have a larger variance and the coefficients of higher order terms may have value closer to zero. Usually, the absolute value of the intercept term is larger than others, and it helps the TDF to converge easily.

And since SLP has a domain ranged [0, 1], the metadata of all documents should be normalized into [0, 1] as follows:

$${x}_{d}^{^{\prime}}=\frac{{x}_{d}-\underset{{d}^{^{\prime}}}{\mathrm{min}}{x}_{{d}^{^{\prime}}}}{\underset{{d}^{^{\prime}}}{\mathrm{max}}{x}_{{d}^{^{\prime}}}-\underset{{d}^{^{\prime}}}{\mathrm{min}}{x}_{{d}^{^{\prime}}}}.$$

Trend Analysis using g-DMR

Once \(\lambda\) is inferred through the model fitting process, the topic probability distribution at a specific point can be calculated by the following equation:

$$P\left(t|{\varvec{x}}\right)=\frac{\mathrm{exp}\left({f}_{t}\left({\varvec{x}}\right)\right)+\varepsilon }{{\sum }_{{t}^{\prime}}\left(\mathrm{exp}\left({f}_{{t}^{}}\left({\varvec{x}}\right)\right)+\varepsilon \right)},$$
(2)

where t is a topic and x is a metadata value. If x is constructed by the axis of the publication year and the number of times of citation, we can infer the probability of each topic at a specific value. Based on this, the research trend of the topic can be examined.

Figure 1 shows the process of trend analysis proposed in this paper. The first step is collecting all the academic papers in the fields of ‘Text Mining’ and ‘LIS’. For LIS, we downloaded all available articles in the field of Library and Information Science provided in Scopus. For Text Mining, since there was no such explicit field in Scopus, we collected all articles that contain a keyword ‘text mining’. The collected datasets include abstract, year of publication, and the number of citations. The year of publication, and the number of citations can be used to analyze which topics were hot or cold in the field by time series analysis.

Fig. 1
figure 1

Process of trend analysis using g-DMR

Table 2 shows the details of the data collection for experiments. We downloaded two sets of scholarly publications from Scopus. One dataset consists of publications in the field of Library and Information Science (LIS) and the other dataset consists of publications in the field of Text Mining.

Table 2 Description of datasets

The following pre-processing steps were applied: (1) stemming the documents, (2) removing stopwords, and (3) refining features. We selected Porter Stemmer rather than lemmatization techniques. It is because the text we used is the abstract section of publications whose length is typically short. Thus, lemmatizing the text results in the sparser representation of the text than what stemming generates. Also, a lemmatizer results in similar words with different POS tags belonging to the same topics (e.g. lemmatization, lemmatize, lemmatizing). The distribution of the number of citations has a typical long-tail distribution, which consists of a small number of upper rankers and numerous lower rankers. Since the numbers of citations were highly skewed, we took its relative rank each year, not its citation count. The relative rank is determined as a value between 0 and 1. The most cited articles in the year are given a value of 1, and the least cited articles are given a value of zero. And a paper with a median number of citations is given a 0.5.

These preprocessed documents go through the g-DMR topic model process, resulting in three types of output, which are topic–word distribution, document–topic distribution, and \(\lambda\) parameters of each topic. The topic–word distribution is used to represent the semantic meaning of each topic. To calculate the topic distribution for a specific year and number of citation, \({\varvec{\lambda}}\) parameters can be used. This allows us to draw a topic distribution map. The document–topic distribution will not be used for trend analysis, but we can use it to analyze documents semantically.

Evaluation

We performed experiments on two datasets, academic papers in the fields of LIS and Text Mining, to evaluate the new model. For the experiments, we implemented the g-DMR code in the C+ + programming language based on the PCGS (Yan et al. 2009), which is capable of multithreaded performance to handle a large collection of text efficiently. In addition, to accelerate the Gibbs sampling process and the L-BFGS optimization using GPU, OpenCL was used. L-BFGSFootnote 1 was used to estimate parameters.

To perform the g-DMR topic model with the metadata, year of publication x, and relative citation ranks y, the following 2d Legendre TDF was used:

$${f}_{t}\left(x,y\right)={\sum }_{i=0}^{I}{\sum }_{j=0}^{J}{\lambda }_{t,i,j}{L}_{i}\left(x\right){L}_{j}\left(y\right).$$

It is important to select appropriate hyper-parameters for the performance improvement of the model. Normally, \(\upbeta\) is set to a value less than 1, which makes the topic-word distribution sparse. It is possible to inference \(\upbeta\) as an optimal value (Blei et al. 2003), but \(\upbeta\) is usually set between 0.01 and 0.1. We set\(\upbeta \hspace{0.17em}=\hspace{0.17em}\)0.01 for both models same as in DMR (Mimno and McCallum 2012). The parameters \(\upsigma\) and \({\sigma }_{0}\) affect the variation of the topic distribution by the metadata. Since the smaller \(\upsigma\) and \({\sigma }_{0}\) generates the more uniform the overall distribution by the metadata, too small \(\upsigma\) and \({\sigma }_{0}\) values make it hard to differentiate topics. On the other hand, too large \(\upsigma\) or \({\sigma }_{0}\) disturbs the convergence of the entire model because it maximizes the distribution difference and makes the Gibbs sampling process in the model unstable. When setting > \(\upsigma\)3 for DMR and > \({\sigma }_{0}\)3 for g-DMR, both models often failed to converge in our data. Thus, we set \(\upsigma =3\) for DMR and \({\sigma }_{0}=3\) for g-DMR. In g-DMR, \(\upsigma\) determines the distribution of the coefficient of the higher order term. Thus, this value should be less than \({\sigma }_{0}\) to help converge the entire model. we tried using various values and found an appropriate value of 0.25 for\({\sigma }_{0}\). Finally, the smoothing parameter \(\upvarepsilon\) does not play an important role in the convergence of the model. This is only to prevent the exponential term in Eq. (1) from getting close to zero, and we used a very small value of \({10}^{-10}\) for only g-DMR.

Topics of LIS and text mining extracted by g-DMR

We trained each model with 800 iterations where initial 200 iterations were the burn-in period. Thus, optimization of the parameters λ was disabled in the burn-in period. After initial 200 iterations, we optimized the parameters every 20 iterations For optimization, all \({\varvec{\lambda}}\) parameters were drawn from its random distribution and the optimal values were found by using L-BFGS. This was done totally 5 times independently, and the best \({\varvec{\lambda}}\) parameters were selected. The evaluation was performed over 10-fold cross validation.

As the value of I and J can change the overall performance of g-DMR, we conducted the preliminary test of perplexity for each I and J value at K = 40. Table 3 shows perplexities of the result from I = 0 to 5 and J = 0 to 4 and the top 5 values are shown in bold. For both datasets, the result of I = 4 and J = 3 shows the top performance. Accordingly, we chose these parameters for the following experiments. Interestingly, unlike our expectation, the high I and J values do not always produce the better perplexity. This seems to be due to the fact that the high I and J generate the more complicated space, which makes it difficult to find the global minimum with L-BFGS.

Table 3 Perplexities by I and J for each dataset in K = 40

Tables 4 and 5 show the results of g-DMR with the number of topics setting to 40. In particular, as shown in Table 4, the g-DMR results for the field of text mining present various core subjects of text mining including text classification, document clustering and semantics and ontology as important topics. In addition, the results show that the most dominant application areas are related to the biomedical domain including bioinformatics, health informatics, and drug discovery (“Appendix 1”).

Table 4 Topic–word results for Top 5 topics from text mining dataset at K = 30
Table 5 Topic–word results for Top 5 topics from LIS dataset at K = 40

As shown in Table 5, the salient topics of LIS include bibliometrics, information management and information theory. In addition, research domains that are highly interactive with LIS are information system, cheminformatics, health communication, bioinformatics, and business informatics (“Appendix 2”).

Evaluation of the quality of citation impact-sensitive topic trends

In this sub-section, we reported on the result of quality of citation impact-sensitive topic trends. To this end, we measure the inference performance of g-DMR for grasping the citation-sensitive topic trends in LIS as well as Text Mining by comparing the inferred trends with the actual research trends. We also compared the inference performance of g-DMR with DMR’s. However, for the DMR model to perform topic modeling in consideration of year and citation count, year of publication has to be discretized into year unit and the number of citations has to be discretized as well. This is a major limitation of DMR with regards to citation impact-sensitive topic trend analysis. For the reported results, we discretized year and citation count into three, six, and nine bins, respectively. For example, three partitions consist of the bottom 1/3, the top 1/3, and the rest. For DMR, these discretized years and the number of citations are combined by a Cartesian product and finally used as metadata.

As will be described in the subsequent section, higher K values show better perplexity, but the results are more difficult to interpret. Therefore, choosing an appropriate K value is very critical to the overall interpretation of results. In this evaluation, the K value was selected as K = 30 for Text Mining and K = 40 for LIS, based on the inflection point where perplexity begins to decrease as K increases.

We chose four topics that are not too broad or general since a broad or general topic tends to appear from all areas, and it makes estimated topic distribution undistinguishable among topics. On the other hand, topics that are narrow or specific in a certain area have a complicated distribution, which is suitable for judging the quality of the estimated topic distribution. Thus, we chose “Topic Models” and “Sentiment Analysis” from Text Mining and “Bibliometrics” and “Cheminformatics” from LIS to evaluate the performance of citation impact-sensitive topic trend analysis. As shown in the Fig. 2, these topics are not common and are concentrated at a certain time, making them suitable for measuring the quality of an estimated topic distribution. To obtain real distributions of each topic, we selected relevant documents using abstracts and keywords. For DMR model, we chose the equivalent topics to ones which were selected from g-DMR. The major filtering keywords of each topic, which are displayed in Fig. 2, were used to get the real distribution.

Fig. 2
figure 2

Result comparison of text mining and LIS between g-DMR and DMR

Since each distribution is not a probability distribution, we cannot compare these distributions using statistical distances such as Kullback–Leibler divergence. Instead, we define the sum of squared errors between the log likelihoods of two distributions as a distance function:

$$\mathrm{d}\mathrm{i}\mathrm{s}\mathrm{t}\left(P,Q\right)= \sqrt{\int {\left(\mathrm{log}P\left(x\right)- \mathrm{log}Q(x)\right)}^{2}dx} ,$$

where \(P\) and \(Q\) are distributions with all values greater than 0 and \(\mathrm{d}\mathrm{i}\mathrm{s}\mathrm{t}\left(P,Q\right)=0\) if only \(P=Q\).

Table 6 shows distribution distances of g-DMR and DMR from its real distribution for the chosen topics and the smallest one is shown in bold. Since in the distribution of real documents, P(x) can be 0 where dist (P, Q) cannot be defined, the distribution of real documents was smoothed by adding a small value 1e-2 for all range. We set a uniform distribution with a certain likelihood regardless of year or the number of citations as a baseline. We could assume that the case of performing the constant TDF, that is, LDA model, will show similar results to the baseline.

Table 6 Distribution distances from real one of g-DMR and DMR

As shown in Table 6, the inference performance of DMR varies depending on the number of steps. This implies the inadequateness of DMR for citation impact-sensitive topic trend analysis in that steps we chose were determined empirically.

Topic distributions can be calculated following Eq. (2) by the λ parameters generated by the g-DMR model. These topic distributions capture the location where documents of a specific topic are concentrated.

The distribution of documents is discrete by nature and cannot be compared with a continuous distribution directly. Thus, we measure the proportion of related documents in bins divided by the grid to compare inferred with real distributions. The X axis, publication year, was divided by year; the Y axis, the normalized number of citation, was divided into 3, 6, or 9 bins, respectively. For example, the data collection of Text Mining was divided into 162 bins and the data collection of LIS was divided into 189 bins.

Figure 2 shows that the distributions inferred from λ of g-DMR approximate the real quite well. The red part of the distribution represents the area where many documents were published, and these areas are well-matched between the left and the right. Also, in all cases, the distance of g-DMR was smaller than DMR.

On the other hand, the DMR model in both 3-step, 6-step and 9-step showed higher distances than the baseline and did not learn the real distribution well. Even the model of the finer grid showed the larger distance. In DMR models, the finer is the grid, the more parameters are to learn, but the fewer documents are included in each grid. Thus, the performance seems to be worse in the finer model.

Minor differences between g-DMR and real distribution can be caused by the following factors: (1) a lack of documents for some time points and (2) a gap between the author’s intention and the topic estimated from word distribution. If we minimize these factors effectively, the distribution inferred from λ can be used to observe trends of each topic.

Evaluation of topic quality by g-DMR

Generally, the results of the topic models are evaluated based on the likelihood of the held-out documents or perplexity. A higher likelihood or lower perplexity means that the model predicts more accurately for an unseen document. Thus, likelihood and perplexity can be used to compare overall performance of models. However, as Chang et al. asserted, lower perplexity does not always mean better results for human interpretation and sometimes gives results which is difficult to interpret (Chang et al. 2009). Therefore, it is necessary to evaluate topic model using other measure together rather than using only perplexity. Thus, we also applied a topic coherence evaluation based on Wikipedia (Newman et al. 2010) in addition to perplexity evaluation.

Figure 3 shows the log perplexities of g-DMR and DMR from the number of topics from K = 10 to 100 for each dataset. At every point, g-DMR shows better perplexity than DMR. This means that g-DMR can reach the higher fitness to given data than DMR does as g-DMR can incorporate topic distributions beyond monotonicity. Also as mentioned above, we conducted following topic coherence evaluation using external data.

Fig. 3
figure 3

Log perplexity of g-DMR and DMR from K = 10 to 100 (lower is better)

Higher topic coherence is a signal of a good topic to interpret (Newman et al. 2010). If a topic model simply produces low perplexity but its result is difficult for humans to interpret, it would not a suitable model for analysis. Thus, we should consider both perplexity and topic coherence when comparing models’ performance. Average topic coherence scores for the top 5 words in each topic from the g-DMR and DMR are shown in Table 7. The higher score among DMR and g-DMR is shown in bold. Since Gibbs sampling is a probabilistic process, different results can be produced at run time, so one model may luckily outperform the other. In order to minimize this kind of error, the Welch’s t-test was performed to 8 samples except 2 extremes out of 10 obtained in each of two models. The result of the statistical test is provided in “Appendix 3”. Overall, there was no statistically significant difference between g-DMR and DMR. This means that g-DMR is equivalent to DMR from the perspective of generating coherent topics.

Table 7 Average topic coherence of g-DMR and DMR (higher is better)

Taking all the above into consideration, it can be concluded that g-DMR generates the same quality of topics as DMR but can reflect the given data in the topic model. In addition, another strength of the g-DMR model is that it can estimate the distribution of topics according to the year of publication or number of citations. This is evaluated in the next section.

Citation impact-sensitive topic trend

As explained above, the experimental results with Dataset I and Dataset II can be used to analyze the topic trends in Text Mining and LIS. The trends of each topic were drawn as a topographical map using λ parameters. As we can see from the topographical map where the heights and valleys are shown, the map helps identify when the topic was studied intensively and how many times it was cited.

Figure 4 shows the distributions of 30 topics from the Text Mining field. In each map, the X axis represents the year of publication, and the Y axis represents the relative citation weight. The blue and dark colors indicate a low proportion of documents, and the red and bright colors indicate that the proportion of the documents is high. For example, a map of two general topics shows red or white colors in its entire range. This means that these three topics appear in almost every document of the corpus. As another example, the map of Social Network Analysis is green overall, but reddish after 2010s. One can interpret from this that in the field of Text Mining, Social Networks Analysis were introduced in the early 2000s, but did not cause an impact. However, the topic has been drawing attention recently.

Fig. 4
figure 4figure 4

Trends of 30 topics in the text mining field

We also categorized the topics into four groups based on citations and recent trends in Table 8. The “Cold Recently” group includes the topics that had high weights in the 2000s but have had low weights recently. Conversely, the “Hot Recently” group has the topics that had low weights in the 2000s but have obtained high weights recently. The topics in the “Lowly Cited” group have a lower number of citations than expected, and the topics in the “Highly Cited” group have a higher number of recent citations than expected.

Table 8 Brief Categorization of Topics from Text Mining

There are some topics whose weights are too small to observe research trends. For these topics, it is difficult to analyze trends because the value of the overall distribution falls within the error bounds. Therefore, we omitted them from the table.

The combination of Hot or Cold Recently and Highly or Lowly Cited yields four categories in total. The topics in the category of Hot and Lowly Cited have been widely studied by many scholars in recent years, but relatively few papers have cited them. This category includes Health Informatics, Natural Language Processing, Recommendation System, etc. On the other hand, the topics in the Cold and Highly Cited category have a relatively large number of citations, but the number of papers published on them recently is small. These include Biomedical Corpus Annotation, Bioinformatics (Genomic Database & Interaction Database), etc. The topics in the Cold and Lowly Cited category, such as System Development and Data Mining, have been paid lesser attention to by scholars than before. Finally, the topics of the Hot and Highly Cited category, such as Social Network Analysis, Sentiment Analysis, Bibliometrics, etc., are the fields in which the number of citations is relatively high and there have recently been increasing interests in research topics in these areas.

Figure 5 shows the distributions of 40 topics from the LIS field. It provides a much broader spectrum of the field than what the trends of Text Mining show. Although the compositions of the figure and table are the same as those of Fig. 4 and Table 8, the overall topography of these results is more complicated because of the larger size of the corpus. It contains three general topics, which are related to methods of research, not subjects, such as Survey Research and Group Interview. We omitted those general topics and minor topics whose distribution weights are too small.

Fig. 5
figure 5figure 5figure 5

Trends of 40 topics in the LIS field

In the LIS field, there are some Cold and Lowly Cited topics, which have been widely studied in the past and received relatively little attention and a low number of citations, as shown in Table 9. These include Information Management and Collection Management. The topics in the Hot and Lowly Cited category vary from Information Literacy to Information Ethics and are currently being studied extensively in LIS. On the other hand, in the Cold and Highly Cited category, there appear interdisciplinary topic areas such as Information Theories and Business Informatics. Lastly, in the Hot and Highly Cited category, there are also many interdisciplinary topics similar to Cold and Highly Cited group, such as Bibliometrics, Social Media and Cheminformatics. This shows the increasing trends in collaboration between LIS and other disciplines, and these topics gain more attraction than traditional LIS topics.

Table 9 Brief categorization of topics from LIS

We further examine the recent hot topic “Scientometrics”. A total of 50 articles, which have more than 50% of the topic Scientometrics, were published in 2016. 37 of them were published in the Scientometrics journal. On the subject of research, most of them dealt mainly with collaboration networks (Zhang et al. 2016; Bouabid et al. 2016; Zhao and Zhao 2016). They covered various different collaboration types of countries, disciplines, and authors, and those papers were relatively highly cited. Another major subject was trend analysis, which included discipline-specific case studies (Cavacini 2016; Milanez et al. 2016; Stein et al. 2016). There were studies using VOSViewer, a visualization software (Moed 2016), and few researches proposed a visualization technique itself (Liu and Mei 2016). There were a few studies on patent analysis (Fukugawa 2016; Dou and Kister 2016; Kang and Sohn 2016). On average, they received fewer citations than collaboration research did.

Conclusion and future work

We proposed the g-DMR topic model that is a universal method for multi-dimensional featured topic models by incorporating multiple arbitrary features into topic model. In the present paper, we applied the proposed model for analyzing citation impact-sensitive research trends. To evaluate the performance of g-DMR, we downloaded 10,907 articles on text mining and 54,703 articles on LIS from Scopus. We compared the inferred topic trends in consideration of citation count with the actual topic trends for LIS as well as text mining. As reported in the Sect. 4.2, DMR is not suitable for citation impact-sensitive topic trends analysis because we have to arbitrarily partition the datasets by both citation counts and year. In contrast to DMR, g-DMR overcomes this problem by proposing Legendre TDF. The comparison results between the inferred and the actual topic trends show that g-DMR enables to predict the topic trends accurately whereas DMR with the several partitioning options fails to predict the topic trends. Moreover, the results also show that g-DMR enables to analyze citation impact-sensitive topic trends. Due to the nature of g-DMR, the proposed approach is highly applicable for other domains. As stated in the introduction section, g-DMR is publicly available. Thus, if similar studies in other domains are to be conducted in the future, it would shed a new light to understand the research trends in the applied domain by harmoniously incorporating citation counts into topic trend model.

In addition to the TDF presented in this paper, a different type of TDF can be proposed to handle several continuous metadata and categorical metadata variables together, and it is also possible to model two or more metadata categories of the documents that are independent of each other. g-DMR provides more general generative models than DMR does and can perform topic inference with more complicated metadata. It can be also used for social text or other analysis where more than one metadata variable is attached. For example, if there is a set of social texts scored for multiple sentiment aspects, we can find word distributions related to each sentiment through g-DMR analysis. Furthermore, it is expected that it is also possible to predict the best-matching distribution of the given word set and the sentiment score.

However, there are a couple of limitations of g-DMR. First, the overall performance of the model depends heavily on parameters I, J (the order of Legendre Polynomials) and K (the number of topics). Thus, the appropriate value of the parameters should be obtained by an experiment. Another limitation is that g-DMR is flat. In other words, g-DMR cannot grasp the hierarchy of or correlation between topics. Since the topics in scholarly documents are not independent but have hierarchies or relations with each other, g-DMR may miss information about this.

Our model can be extended in several potentially useful ways. First, a non-parametric inference, which is used in HDP (Teh et al. 2005) or hLDA (Griffiths et al. 2004), can be applied to g-DMR. This solves the problem of selecting appropriate values for parameters by fitting the number of topics, K, from the given data.

In addition, integrating a hierarchical model into g-DMR is another useful extension that addresses the problem of topic independence. In this way, research trends may be viewed considering different levels of hierarchy as well as time and citation impact, and we are retaining this as an idea for future work.