Introduction

The statistical literature has remarkably changed in recent years. Many authors studied this evolution, with different approaches. For example, (Genest 1997, 1999) analyzed respectively sixteen international journals publishing statistical theories during the period 1985–1995 and eighteen international journals, half of which are specialized in probability theory and the other half in statistics during the period 1986–1995. Papers, authors, and adjusted page counts yield measures of productivity for institutions and countries that contributed to fundamental research in statistics and probability during that period. Genest (2002) updated the previous works on world research output in probability and statistics collecting data until 2000. The data provide valuable information on the evolution of publication habits, in terms of the volume of research, the length of papers, co-authorship practices.

Schell (2010) suggested that dissemination of ideas from theory to practice is a significant challenge in statistics; therefore quick identification of articles useful to practitioners would greatly assist in this dissemination, thereby improving science. For this purpose, he studied and used the citation count history of articles to identify key papers for applied biostatisticians that appeared between 1985 and 1992 in 12 statistics journals. Stigler (1994) studied the use of citation data to investigate the role that statistics journals play in communication within that field and between statistics and other fields. The study looks at citations as import-export statistics reflecting intellectual influence. Ryan and Woodall (2005) attempted to identify the 25 most-cited statistical papers, providing some brief commentary on each paper on their list. This list consists, to a great extent, of papers that are on non-parametric methods, have applications in the life sciences, or deal with the multiple comparisons problem. They also briefly discussed some of the issues involved in the use of citation counts.

In this paper we investigate a decade of research in statistics with a topic model approach. Through the topic model algorithms, applied to our data, with a suitably preliminary cleaning, we identify the most relevant topics in statistical literature between 2000 and 2010, obviously according to the three journals considered, and we describe them, associating keywords and publication venue, authors affiliation countries, and year. We also study the citation distribution by topic. Lastly, we show the topic evolution, an innovative approach to investigate the issue, and the mutual relations among them.

Specifically, we have implemented two different approaches: (1) in order to study topic nowadays, we have identified topics by working on the whole corpus of papers; (2) in order to study topic evolution, we have analyzed topics that can be found by taking into account only the papers produced year by year; in the latter case we were dealing with 11 different corpora and we have generated a set of topics for each corpus independently.

In addition, in this context, we will analyze a posteriori the topic characteristics assigning to them bibliometric indices, based on citations as well as the typical descriptors of the papers that compose them. Thus, within the same subject category, it is possible to identify topics with a different impact.

The paper is organized as follows: in “The corpus of publications data” section , we describe the collected data. “Probabilistic topic models as bibliographic descriptors” section is devoted to describe probabilistic topic models. In “Current topics” and “Topic evolution” sections we present the results obtained with the “topic nowadays” and with the “topic evolution” approaches, respectively. In “Conclusions” section we give our concluding remarks.

The corpus of publications data

In this work, we consider papers published between 2000 and 2010 in one of the following journals:

  • The Annals of Statistics (Ann. Stat.)

  • Journal of the American Statistical Association (JASA)

  • Journal of the Royal Statistical Society. Series B (JRSS(B))

Data were collected in June 2013 and stored in our bibliometric database, which is designed according to the model presented in Ferrara and Salini (2012). Since our goal is to identify methodological innovations in statistics that have marked the history of statistical research in the last years, we have chosen three journals that are universally known to publish contributions of methodological and foundational innovation in statistics. The three journals are all in the top decile of Statistics & Probability according to the various bibliometric indices (IF, 5year-IF, Eigenfactor score, etc.). Moreover they have a long tradition of publishing works that are at the leading edge of methodological development, with a strong emphasis on relevance to statistical practice. For these three journals, we create a reference corpus which is used for all the analysis activities that are described in the rest of the paper. The corpus is created by collecting, for each paper in the journals, the metadata concerning authors and their countries of affiliation, title and abstract, year of publication, and the number of citations received by the paper. The paper titles and abstracts, in particular, will be used in order to extract relevant keywords and linguistic information from the corpus papers.

In Table 1, we show the distribution of papers and citations per year of the three selected journals resulting from Web of Science. It can be noted that JASA is the journal with the highest number of paper published. The number of papers published in time increases for The Annals of Statistics but remains constant for the other two journals. We note that in the table the citations refer to papers and not to authors. For example, 3536 represents the number of citations in June 2013 from Web of Science to papers published in The Annals of Statistics in the year 2000. Obviously the average number of citations is decreasing with time. Indeed, the citations for more recent papers are lower than for older papers, as we expect.

Table 1 Distribution of papers per year and journal

In Table 2, we show the distribution of papers by country of affiliation and journal. In this case, if two authors of the same paper come from two different countries, the citations are counted two times. Therefore the total number of citations of the previous table is lower than the total number of citations in this table. An author from the United States of America is present in at least one paper in two in case of JASA and The Annals of Statistics. On JRSS (B), even if the United States authors are the most numerous, there is a significant presence of British authors with respect to the other countries.

Table 2 Distribution of papers per country and journal

Table 3 shows the papers that are cited more than 500 times. The most attractive concepts and/or techniques, according to the three considered journals and their Editors’ decisions, seem to be boosting, false discovery rate, clustering, microarray and gene expression data.

Table 3 Papers with more than 500 citations

A typical intuition looking at a collection of papers is that papers can be grouped into “topics”. The latter could then be described and connected to each other. Moreover, a topic should generate other topics over the years. In the next sections we will develop these ideas.

Probabilistic topic models as bibliographic descriptors

Topic models are based on the idea that documents are a combination of topics, where a topic is defined as a probability distribution over words. Documents are observed, while topics (and their distributions) are considered as hidden structures or latent variables. Topic modeling algorithms are statistical methods that analyze the words of the original documents to discover the topics that run through them, how these topics are connected to each other, and how they change over time (Blei 2012). The simplest and most commonly used probabilistic topic approach to document modeling is the generative model Latent Dirichlet Allocation (LDA) (Blei et al. 2003). The idea behind LDA is that documents blend multiple topics.

A topic is defined to be a distribution over a fixed vocabulary. For example the statistics topic has words about statistics with high probability. The model assumes that the topics are generated before the documents. For each document, the words are generated in a two-stage process: (1) randomly choose a distribution over topics (Dirichlet distribution); (2) for each word first randomly choose a topic from the distribution over topics and then randomly choose a word from the corresponding distribution over the vocabulary. The central problem for topic modeling is the use of the observed documents to infer the latent variables. Topic models are probabilistic models in which data are treated as arising from a generative process that includes hidden (or latent) variables. This process defines a joint probability distribution over both the observed and hidden random variables.

The conditional distribution of the hidden variables given the observed variables, also called posterior distribution, is computed. The numerator of the conditional distribution is the joint distribution of all the random variables, which can be easily computed; the denominator is the marginal probability of the observations, or the probability of seeing the observed corpus under any topic model. Theoretically, it can be computed by summing the joint distribution over every possible instantiation of the hidden topic structure; practically, because the number of possible topic structures is exponentially large, this sum is difficult to compute.

Topic modeling algorithms fall into two categories, which propose different alternative distributions to approximate the true posterior: sampling-based algorithms, as Gibbs sampling, and variational algorithms. The first group considers a Markov chain, a sequence of random variables, each dependent of the previous, whose limiting distribution is the posterior (Steyvers and Griffiths 2007); the second group of algorithms, instead, represents a deterministic alternative to sampling-based algorithms (VEM). Rather than approximating the posterior with samples, variational methods posit a parameterized family of distributions over the hidden structure and find the member of the family that is closest to the posterior; in this way, they transform the inference problem into an optimization problem. In 2007, a correlated topic model (CTM) was proposed, which explicitly models the correlation between the latent topics in the documents (Blei and Lafferty 2007). In this paper, we rely on VEM algorithms instead of CTM because VEM algorithms provide more than a topic explanation for each paper. This overlapping between topics is important to find topic correlations, which is one of the main goals of our work.

Choosing the number of topics

As discussed before, topic models are latent variable models of documents that exploit the correlations among the words and latent semantic themes in a collection of papers (Blei and Lafferty 2007). An important consequence of this definition is that the expected number of topics (i.e., the latent variables) is supposed to be set before the computation of the model itself. Thus, since the number of topics has to be set a priori, choosing the best number of topics for a given collection of papers is not trivial. In the literature (Hall et al. 2008; Blei 2012) this problem has been addressed in different ways, but always looking for a compromise between the need for a high number of topics to cover all the themes in the document collection and the need for a limited number of topics, which can be more easily understood and verified by experts in the domain of data collected.

In order to help in choosing the number of topics of interest, a measure of perplexity has been introduced (Grün and Hornik 2011). The idea is that model selection with respect to the number of topics is possible by splitting the data into training and test datasets. The likelihood for the test data is then approximated using the lower bound for VEM estimation. In particular, perplexity is a measure of the ability of a model to generalize to unseen data. It is defined as the reciprocal geometric mean of the likelihood of a test corpus given the model, as follows:

$$\begin{aligned} {\rm Perplexity}(w) = \exp \left\{ -\frac{\log (p(w))}{\sum _{d=1}^D\sum _{j=1}^V n^{(jd)}}\right\} \end{aligned}$$

where \(n^{(jd)}\) denotes how often the \(j\hbox {th}\) term occurred in the \(d\hbox {th}\) document.

The common method to evaluate perplexity in topic models is to hold out test data from the corpus to be trained and then test the estimated model on the held-out data. Higher values of perplexity indicate a higher misrepresentation of the words of the test documents by the trained topics. Perplexity is a measure of the quality of the model learned by LDA in predicting future data from the same distribution as the data used to train the model. In doing so, it measures an interesting characteristic of an inference algorithm: given that the model is the same, the best algorithm (in terms of quality of the learned result) will have better perplexity than the others. Perplexity is usually the first or second metric used to judge statistical model quality (other popular methods being test-set likelihood or even marginal probability of the data given the model), but it is too coarse, hence recently the topic modeling community has been moving towards more accurate metrics (Mimno and Blei 2011). Even though these more refined metrics carry a lot more weight and show you all sorts of interesting information, bear in mind that test-set perplexity is probably correlated with all of them. In this paper, we run a set of pre-tests on the whole collection of paper at hand, by executing the VEM algorithm with a variable number of topics, ranging from 10 to 400, and by collecting the perplexity values of each execution. The resulting perplexity plot is shown in Fig. 1.

Fig. 1
figure 1

Perplexity plot for a number of topics ranging from 10 to 400

Looking at the perplexity plot, we can see how the perplexity dramatically decreases from a level of 400–200, just moving from 10 to 30 topics. Then, the decrement continues reaching a more or less stable value of 100/120 with 120 or more topics. Since we are interested in having low levels of perplexity but also in keeping as low as possible the numbers of topics, we decided to limit ourselves to 120 topics.

We briefly recall that each topic is associated as an explanation with each paper together with a measure of relevance (average quality) of that paper for the topic. Thus, a first measure of the ability of topics to explain papers is the average level of paper relevance per topic. Moreover, we are also interested in counting the number of papers explained by each topic. The idea is that a solution where topics explain more papers is better than a solution where topics are capable of explaining less papers, in that the former one provides a better synthesis of the paper corpus. But of course, the two situations are comparable only if the average level of relevance per topic is the same or similar. A relevant issue working with topic models is to determine when a topic has to be considered as a good explanation for a paper at hand. In particular, given a topic T, a paper p, and the relevance \(\rho (p,T)\) of T with respect to p, we say that p is explained by T if \(\rho (p,T) \ge th_e\).Footnote 1 Therefore, we call \(th_e\) explanation threshold. In order to determine the explanation threshold \(th_e\), we start from two main requirements: (1) corpus coverage: we are interested in finding a threshold value such that the fraction of papers in the corpus that are explained by at least one topic is high; (2) explanation quality: we are interested in finding a threshold value such that the average quality (i.e., relevance) of topic explanations is high. More formally, the corpus coverage \(C_{P}^{th_e}\) for a corpus of papers P and an explanation threshold \(th_e\) is defined as:

$$\begin{aligned} C_{P}^{th_e} = \frac{\mid \{p_i \mid p_i \in P \wedge \rho (p_i,T) \ge th_e\} \mid }{\mid P \mid } \end{aligned}$$

where \(\mid \{p_i \mid p_i \in P \wedge \rho (p_i,T) \ge th_e\} \mid \) is the cardinality of the set of corpus papers which are explained by at least one topic with a value of relevance higher than, or equal to, a given explanation threshold \(th_e\). The explanation quality \(Q_{P}^{th_e}\) is defined as:

$$\begin{aligned} Q_{P}^{th_e} = \frac{\sum \limits _{i=1}^{\mid P \mid } \rho (p_i,T) \mid p_i \in P \wedge \rho (p_i,T) \ge th_e}{K} \end{aligned}$$

where \(K = \mid \{p_i \mid p_i \in P \wedge \rho (p_i,T) \ge th_e\} \mid \) is the number of papers having at least one explanation topic T such that \(\rho (p_i,T) \ge th_e\). We can observe that, as expected, when the corpus coverage increases, the explanation quality decreases. In fact, topics with high quality explanations are more focused on a limited number of papers. On the opposite, when we set a low level of the explanation threshold, the number of papers that are explained by at least a topic is higher, but the average quality of accepted explanations is lower. This situation is illustrated in Fig. 2, where we report the values of corpus coverage and explanation quality for different levels of the explanation threshold \(th_e\) when we consider 120 topics.

Fig. 2
figure 2

Corpus coverage and explanation quality at different values of explanation threshold

By taking into account the results shown in Fig. 2, we use as value of the explanation threshold the point where the distance between corpus coverage and explanation quality is minimal, that is \(th_e = 0.4\). Thus, working on 120 topics and using an explanation threshold of 0.4, the coverage of corpus is of 1974 papers out of 3048 (64.76 %).Footnote 2 The average explanation quality is 0.59.

Current topics

As reported in the previous section, in our first experiment with topic models we calculate 120 topics by working on the whole corpus of papers considering for each paper title and abstract. Then, we select the top-30 prominent topics (i.e., topics explaining the highest number of papers), in order to have a limited and easily understandable collection of topics which explain the paper corpus with a good level of relevance. The complete list of 30 topicsFootnote 3 that have been presented in this paper is reported in Table 4, where we anticipate a selection of keywords that we extract from each topic in order to help the reader in understanding the contents of the topics. The procedure of keyword selection, that is also capable of extracting compound keywords, is detailed in "Finding the most relevant topic keywords" subsection.

Table 4 Top-30 topics per number of papers

Topic description

One of the most interesting uses of topics is to provide a synthetic map of the corpus of papers that have been collected, in order, in general, to give a high level view of the most relevant concepts, keywords, and themes that have been addressed in a time period for a specific field of research. In our case, limited to the three journals considered, we identify the relevant themes in the last 10 year in statistics. In order to obtain this result, we first need to associate each topic with a descriptor providing information about the most relevant keywords describing the topic and some useful statistical information concerning the publication venue distribution of papers per topic (useful to understand if there are journals that are more specialized in some topics) and a distribution of the topic papers per year, country, and number of citations.

More formally, we can define a descriptor \({\mathcal {D}}_T\) of a topic \(T\) as a five-tuple of the form \({\mathcal {D}}_T = \langle K_T, J_T, Y_T, C_T, R_T\rangle \), where \(K_T = \langle (k_1, r_1), (k_2, r_2), \dots , (k_j, r_j)\rangle \) denotes a list of the j most relevant keywords describing T together with the relevance \(r_i\) of the keyword \(k_i\) for T; \(J_T = \langle (j_1, n_1), (j_2, n_2), \dots , (j_k, n_k)\rangle \) denotes the list of the k most relevant journals for T together with the fraction \(n_i\) of papers explained by T that have been published by the journal \(j_i\); \(Y_T = \langle (y_1, n_1), (y_2, n_2), \dots , (y_l, n_l)\rangle \) denotes the list of the l most relevant years for T together with the fraction \(n_i\) of papers explained by T that have been published in \(y_i\); \(C_T = \langle (c_1, n_1), (c_2, n_2), \dots , (c_m, n_m)\rangle \) denotes the list of the m most relevant countries for T together with the fraction \(n_i\) of authors of papers explained by T that have been affiliated in an institution located in the country \(c_i\); finally \(R_T\) denotes the distribution of citations per papers that have been explained by T.

Finding the most relevant topic keywords

In order to determine the keyword descriptor \(K_T\) for the topic T, we first select the set \(P_T = \{p_i \mid \rho (p,T) \ge th_e\}\) of papers explained by the topic T, that is the set of papers with relevance \(\rho (p,T)\) higher than, or equal to, the explanation threshold with respect to the topic T. In such a way, we create a textual corpus that is then pre-processed by using some standard natural language processing techniques (NLP) in order to create a vector of terms for each paper. The NLP techniques used are elision removal (log-likelihood \(\rightarrow \) log likelihood), lower case normalization (New York \(\rightarrow \) new york), and stop words removal (the next step \(\rightarrow \) next step). During the pre-processing step, we decided not to use some other common techniques like stemming, in order to provide a more human readable descriptor. Instead, we introduced bi-gram research. In our approach, a bi-gram is a pair of terms that often occur together in the corpus and, thus, can be interpreted as a single compound term (e.g., false discovery). In order to determine relevant bi-grams, we associate a measure of mutual information \(I(t_i,t_j)\) with any pair of adjacent terms \(t_i\) and \(t_j\). \(I(t_i,t_j)\) is defined asFootnote 4:

$$\begin{aligned} I(t_i,t_j) = \log \left( \frac{p(t_i, t_j)}{p(t_i) \cdot p(t_j)}\right) \end{aligned}$$

where \(p(t_i, t_j)\) is the ratio between the number of occurrences of the pair \((t_i, t_j)\) to the number of occurrences of all the pairs of adjacent terms in the corpus; \(p(t_i)\) is the ratio between the number of occurrences of \(t_i\) to the total number of occurrences of any term in the corpus. \(I(t_i,t_j)\) measures the relevance of the occurrences of the pair \((t_i,t_j)\) with respect to the relevance of the occurrences of the terms \(t_i\) and \(t_j\) separately. We select the most relevant pairs as those pairs that appear more than twice in the corpus and have a value of mutual information higher than a fixed threshold equal to 1. Then we substitute the relevant pairs to the single terms in the vector. As soon as all the papers explained by the topic T (i.e., papers in \(P_T\)) are associated with a vector of terms, we calculate the most relevant keywords describing T as the list of the most relevant vector terms that are the terms with the highest relevance according to a TF-IDF like measure. The term frequency \(\text {TF}(t_i)\) of a term \(t_i\) is calculated as the number of occurrences of \(t_i\) in all the term vectors associated with papers in \(P_T\). The inverse document frequency \(\text {IDF}(t_i)\) of a term \(t_i\) is calculated as follows:

$$\begin{aligned} \text {IDF}(t_i) = \log \left( \frac{\mid P \mid }{1 + \mid \{p \in P : t_i \in p\} \mid } \right) \end{aligned}$$

where \(\mid P \mid \) is the total number of papers in the corpus (not only those explained by the topic T), while \(\mid \{p \in P : t_i \in p\} \mid \) is the number of paper vectors containing \(t_i\). Finally, the relevance \(r_i\) of a keyword \(k_i\), given the corresponding term \(t_i\), is calculated as \(r_i = \text {TF}(t_i) \cdot \text {IDF}(t_i)\). In order to select the most relevant keywords, we experimentally observed that we can take the terms that have a cumulative relevance higher than 10 % of the sum of all the terms’ relevances.

Finding publication venue, year, country, and citation descriptors

The approach used for determining the other descriptors \(J_T, Y_T\), and \(C_T\) is basically the same. We simply count the number of papers in \(P_T\) aggregated by the dimension of interest, that can be the journal, the year, or the country. For what concerns countries, we associate a paper with each country of affiliation of its author. Finally, we take the whole list of journals, years and countries together with the fraction of papers associated with each source, year or country over the total number of papers (authors in case of countries) in \(P_T\).

Another descriptor that can be associated with a topic is its impact, based on citations. Since citations are associated to every paper, it is possible to aggregate them to obtain bibliometric measures related to the topic. In addition to the mean and/or the median of the citations, and the classical h-index (Hirsch 2005), a graph of the citation distribution in the topic can be produced to highlight if all the papers have a similar number of citations or not. In the first case it would mean that the topic, and not the paper or the author, is able to produce itself a certain level of citations.

Example

In order to provide an example of topic description, in Table 5 we take into account a sample of paper titles that are explained by the topic T929 related to false discovery rate.

Table 5 Example of papers explained by the T929

According to the approach described in the previous sections, we calculate the descriptors of the topic T929 by determining the most relevant keywords and the publication source, year, and country distribution, which are reported in Table 6.

Table 6 Descriptors for topic T929

The descriptors can also be used to provide a graphical representation of a topic, according to the following approach. A topic is represented as a circle, whose diameter is proportional to the number of papers explained by the topic at hand (this is useful when more than one topic is depicted in the same map). The topic circle contains the keywords that are printed in a tag-cloud fashion, where the dimension of each keyword is proportional to its relevance for the topic. Then, the other descriptors are associated with the circle as plots reporting the relevance of each element of the descriptor. An example of such a graphical representation is shown in Fig. 3.

Fig. 3
figure 3

Graphical representation of topic T929

The T929 is characterized by the keywords false discovery, discovery rate, multiple testing. The three journals addressed this topic, more particularly The Annals of Statistics and JASA. The spread has been fluctuating over the years and tended to increase over time, with a peak in 2009. The most represented country is the United States. The citations are not distributed evenly among the paper, but, as it can be observed in Table 5, there are papers with a very high number of citations. The median equal to 20 is much lower than the mean which is equal to 128.6. This shape is in accordance with the well known empirical laws governing the distribution of citations. In practice, the number of citations received by scientific papers appears to have a power-law distribution (Newman 2006). The distribution of citations is a rapidly decreasing function of citation count. Zipf plot is well suited for determining the large-x tail of the citation distribution (Gupta et al. 2005). For other topics, another pattern could be observed. In "Citation distribution per topics" subsection a more detailed analysis of citations intra topics will be presented.

Topic mutual relations

An interesting feature of topics is their mutual relations. In particular, we propose a similarity relation \(\sigma (T_i,T_j)\) between two topics, which is based on their terminology, as follows. Let the dictionary \({\mathcal {D}}\) be the set of all the terms used in the corpus, i.e., all the terms appearing in the title or abstract of at least one paper of the entire collection. The topic similarity \(\sigma (T_i,T_j)\) between two topics \(T_i\) and \(T_j\) is based on a weight \(w_l\) associated with each term \(t_l \in {\mathcal {D}}\) with respect to \(T_i\) and \(T_j\), respectively. In particular, given the topics \(T_i\) and \(T_j\), we define two vectors of terms \(V_i\) and \(V_j\), which have the following form:

$$\begin{aligned} V_i = \langle w_1, w_2, \dots , w_n \rangle ,\ V_j = \langle w_1, w_2, \dots , w_m \rangle \end{aligned}$$

where n and m are the number of terms extracted from \(T_i\) and \(T_j\), respectively. In particular, given a term \(t_k \in {\mathcal {D}}\), its corresponding weight \(w_{ki}\) with respect to \(T_i\) is equal to 0 if \(t_k\) does not appear in \(T_i\) (i.e., is not used either in a title or in a abstract of any paper explained by \(T_i\)); otherwise, \(w_{ki}\) is calculated using the TF-IDF method discussed in "Topic description" subsection. Analogously, we calculate the weight \(w_{kj}\) for the topic \(T_j\). On this basis, we evaluate the similarity \(\sigma (T_i,T_j)\) between \(T_i\) and \(T_j\) as the correlation between their corresponding vectors of terms \(V_i\) and \(V_j\), as follows:

$$\begin{aligned} \sigma (T_i,T_j) = \frac{\sum w_{ik}w_{il}}{\root \of {\sum w_{ik}^2}\ \root \of {\sum w_{jk}^2}} \end{aligned}$$

where \(w_{ik}\) denotes the weight attributed to keyword \(t_k\) for topic \(T_i\).

Example

In order to clarify the evaluation of term-based topic similarity, we introduce a very simple example, by taking into account the topics T929 (false, discovery, false discoveries, discovery rates, rates) and T878 (models, endpoints, partitioning, decision, procedures), that have the following keyword descriptors:

  • T929: false (0.49), discovery (0.48), false discoveries (0.45), discovery rates (0.36), rates (0.29), procedure (0.2), multiple (0.19), testing (0.19), control (0.19), multiple testing (0.15), values (0.11), large scale (0.09), controlling false (0.08), simultaneous (0.07), hypothesis (0.07)

  • T878: models (0.2), endpoints (0.15), partitioning (0.14), decision (0.12), procedures (0.11), testing (0.11), lq (0.11), bayes (0.11), primary (0.1), multiple (0.1), decision theory (0.1), principle (0.09), discovery (0.09), equivalence (0.08), best (0.08)

By considering the two term vectors corresponding to T929 and T878, we retrieve several terms in common (e.g., procedure, testing, discovery) that are used to determine the product vector of common terms, which is equal to 0.1051. The sum of all the weights in the two vectors is equal to 1.0458 for T929 and 0.4515 for T878. This leads to a similarity equal to 0.2225 that is calculated as follows:

$$\begin{aligned} \sigma (T929,T878) = \frac{0.1051}{1.0458 \cdot 0.4515} = 0.2225 \end{aligned}$$

In Table 7 the list of topics, with the relative keywords, ordered by descending similarity with T929 is reported.

Table 7 List of topics ordered by descending similarity with T929

Topic map

Since a topic is associated with a set of papers, it can be seen as the collection of papers that are explained by the topic at hand according to the explanation threshold \(th_e\). As we have seen, a topic can have a variable degree of similarity to other topics and can be described by the most occurrent terms in the papers therein contained. Moreover, each paper is associated with the paper contributors, usually the authors, with the venue of publication, with the publication year, and with the number of citations received. In order to provide a synthetic and comprehensive view of the topics addressed by a corpus of papers, we introduce the notion of topic map. A topic map is a graph where nodes represent topics and edges represent similarity relations between topics. Moreover, each node of the topic map can be graphically represented by a circle containing the most relevant k terms extracted from the papers explained by the corresponding topic. A term’s font size is proportional to the number of occurrences of the term in the topic, using conventional tag clouds. The circle area is proportional to the number of papers explained by the topic. Also the graphical disposition of nodes is relevant, since it is determined in order to display similar topics as close as possible one to each other. The edge width is proportional to the strength of the similarity relation. Finally, a node/topic could be also labelled with country, publication source, and year descriptors. As an example of very simple topic map, we show in Fig. 4 a portion of the topic map for the topics extracted around T929 presented in the previous example. In our example, we show in each topic only the topic ID and the three most relevant keywords.

Fig. 4
figure 4

Topic map representing topics extracted around T929

Citation distribution per topics

As just mentioned before, it is not obvious that, within the same subject category, each topic has the same level of citations and then the same benchmark values for bibliometric indicators. Moreover, also the citation distribution could differ from topic to topic. In order to check this hypothesis, we select five topics with different citation patterns. In Table 8, for each topic we show the number of papers, the h-index Footnote 5, the number of papers with more than 500 citations, the number of papers with more than 100 citations. Moreover for the citations the mean, the standard deviation, the median, the interquartile range (IR) and the Gini coefficientFootnote 6 are reported. We can also represent the citation distribution through the Lorenz curve, that is the graphical representation developed by the American economist Max Lorenz in 1905 for the wealth distribution. In our application (see Fig. 5), the horizontal axis is the proportion of papers and the vertical axis is the proportion of citations. A straight diagonal line represents perfect equal distribution of citations per paper; the Lorenz curve lies beneath it, showing the real distribution of the citations. The difference between the straight line and the curved line is the amount of concentration, this area represents the Gini coefficient.

Table 8 Citation pattern per topics

Fox example, T929, related to false discovery rate and previously discussed in this section, has a citation mean lower than T878 related to multiple comparisons, but its citation median and its h-index are much higher. This means that in T929 the citations are less concentrated than in T878; in fact the Gini coefficient is smaller.

Fig. 5
figure 5

Lorenz curve of citations per topic

This is evident also looking at the standard deviation and the IR that are higher for topic T878 with respect to T929.

Topic evolution

In order to study the evolution of a scientific field in time, the idea of focusing on the years associated with topics is the most natural approach but it is also affected by a structural problem: in fact, topics are statistically discovered over the whole corpus, which includes papers that have been published in different years. This means that the number of papers published in a given year affects the whole composition of topics and, potentially, may lead to a situation where topics that were popular in years featured by a limited number of publications are not discovered at all. Our idea is that, instead of focusing on the whole corpus of papers, we are now interested in studying the topics that can be found by taking into account only the papers produced year by year. According to this approach, we are now dealing with 11 different corpora (i.e., one per year) and we generate a set of topics for each corpus independently. Then, we study the similarity relations existing between the topics associated with one year and the topics associated with the subsequent year. Our hypothesis is that a similarity relation between a topic \(T_{i(y)}\), deriving from the corpus of papers published in the year y, and a topic \(T_{j(y+1)}\), derived from the year \(y + 1\), is a useful index of a possible evolution of the topic \(T_i\) into the topic \(T_j\).

As an example, we discuss the case of two papers:

  1. a.

    Storey J.D. (2002), A direct approach to false discovery rates, JRSS(B)

  2. b.

    Ronchetti E.; Cantoni E. (2001), Robust Inference for Generalized Linear Models, JASA

In Fig. 6, we report a topic evolution map in which the topics that contain papers (a and b) are highlighted. The topics here are extracted year by year. In the evolution map, topics, represented as circles, are ordered according to years and are linked one to each other by arrows which represent similarity relations among topics. The similarity has been calculated as discussed in "Current topics" section. Three topics include the false discovery rate (i.e., T567, T661, and T685), and another one includes “multiple testing” (i.e., T702), which is a related broader topic. These have been reported as gray, shadowed circles in the map. The topic chain is created as follows: given a topic \(T_{i(y)}\) derived from year y, we calculate the similarity between \(T_{i(y)}\) and all the topics \(T_{j(y+1)}\) derived from the subsequent year \(y+1\); then, we set up a similarity threshold \(th_s\) and we create a link between \(T_{i(y)}\) and all the topics \(T_{j(y+1)}\) such that \(\sigma (T_{i(y)}, T_{j(y+1))}) \ge th_s\).

In such a way, the evolution map suggests possible evolution paths connecting topics extracted from papers published in the early period (i.e., 2000–2003) and topics extracted from papers of the late period (i.e., 2008–2010).

Fig. 6
figure 6

Topic evolution map

Looking at the map, it is possible to understand which topics have chained over the years, leading to the current research topics. For example T640 (microarray, not parametric, semi-parametric regression, functional data) is connected to both previous routes that include topics such as false dicovery rate (T567), robust methods (T546), functional data analysis (T628), robustness issues in multivariate data analysis (T622). T663 (risk factors, multicenter survival studies, hazard function) is connected with the topic T685 (randomized experiment, clinical trials, false discovery rate) preceding T700 (high-dimensional regression, large-scale prediction problems) and T727 (weighted distance estimation, missing at random, variable selection).

The list of most relevant keywords for topics appearing in Fig. 6 is reported in Table 9.

Looking at Table 9 it is possible to see that some arguments, the biggest ones in Fig. 6, identified by keywords as comment, letters editor, rejoinder, etc., have generated a literature debate besides a lot of papers. The evolutionary map highlights the new challenges, according to the three considered journals and their Editors’ decisions, that the big data has generated in statistics, due to both high-dimensionality and large sample size. It should however be noted that the map in Fig. 6, as already said, includes only the evolution over the years generated by T525 and T523. It is a capture of the larger map, difficult to represent graphically in whole, which contains the ’hot’ topics for each year. This map definitely needs to be explored in future.

Table 9 Most relevant keywords for topics appearing in Fig. 6

Conclusions

Working on a selection of publications from the international statistical literature, we have applied the topic model approach and post-processed the results to the end of describing in depth the corresponding segment of the field over the years. The aim of our analysis was to verify the existence of predominant topics, explained by different descriptors, and to determine whether these topics generate patterns of citations. Our results seem to confirm the expectations. Accordingly, the common evaluation approaches, based on normalization with respect to a field, lose significance; a normalization with respect to the topic would seem more appropriate. Our approach raises a critical situation: with high heterogeneity of data, the identified topics exhibit a problem of robustness; in literature some other methods to cluster textual data exist. Moreover, our contribution doesn’t claim to be exhaustive; it presents a case study that raises matter for debate. Taking into account our recommendations, comparisons between the topic model approach and other methods of clustering would be possible, with the purpose to normalize the bibliometric data with respect to the topics.