Introduction

In recent years, the evaluation of academic research productivity in different fields of knowledge has been related to the impact of the results of scientific production (Abramo et al. 2008; Sabharwal 2013; Campanario 2015). The motivation for studying productivity lies in the wish to promote academic excellence and render the research from each country as competitive as possible on the global stage (Frandsen 2005; Kocher et al. 2006; Abramo and D’Angelo 2014).

The quality of a research study is determined by a great number of variables, from the personal characteristics of the researcher to national and international policies and trends (Genest 1997; Dundar and Lewis 1998; Williamson and Cable 2003; Seggie and Griffith 2009; Duch et al. 2012; Kaur et al. 2015). However, the criteria for evaluating research productivity are combined mainly in two ways. First, the peer review process is assumed as the principal evaluation method, but this in turn is the object of a certain subjectivity level (Abramo et al. 2008, Bornmann 2011; Bertocchi et al. 2015; Day 2015).

Alternatively, another way of evaluating scientific activity in terms of productivity is based on bibliometric analysis. This method consists mainly of quantifying the number of documents published by a country, institution, research group or individual, as well as the citations received by such documents (Broadus 1987; Borokhovich et al. 1995; Abramo et al. 2008; Heberger et al. 2010; Finardi 2013; Kaur et al. 2015; Bertocchi et al. 2015). The most common bibliometric measurements are those based on publications and citations, and this information comes from different databases such as Web of Science (WoS), Scopus, and Google Scholar, among others. However, the heterogeneity in publication and citation policies between the different fields of knowledge (Kaur et al. 2013; Ruiz-Castillo and Costas 2014; Mingers and Leydesdorff 2015) make the direct comparison in terms of the number of published articles and cites ‘unfair’ (Crespo et al. 2012) and raise the need for the search of more appropriate methods of comparison.

The majority of research productivity studies are focused on a single field of knowledge. For example, the literature focused on research productivity in economics is abundant (Hodgson and Rothman 1999; Coupé 2003; Kocher et al. 2006; Ellison 2013). As a result, and taking into account the existing scientific advancements in each field of knowledge, it becomes relevant to study research productivity not only from the standpoint of measuring scientific production results, but also for the purpose of analyzing differences between the fields of knowledge in question (Sabharwal 2013; Abramo and D’Angelo 2014; Ruiz-Castillo and Costas 2014; Bertocchi et al. 2015).

In addition, studies on research productivity have taken into account different probability distribution functions in order to identify patterns in quantitative relationships between authors and their contributions over a period of time. These studies have determined that bibliometric indicators such as the number of articles published or the number of citations received by an author are characterized by distributions with heavy tails (Lotka 1926; Price 1976; Redner 1998; Chung and Cox 1990; Albarrán et al. 2011; Eom and Fortunato 2011; Da Silva et al. 2012; Ruiz-Castillo and Costas 2014; Campanario 2015).

As a result, the probability distribution models that have been applied the most in the literature on research productivity are those that obey the following laws: Lotka’s law (Lotka 1926; Nicholls 1986; Chung and Cox 1990; Kretschmer and Kretschmer 2007), the power law (Price 1976; Egghe 2005; Albarrán et al. 2011; Aguinis et al. 2015) and Bradford’s Law (Garfield 1980; Rousseau 1994; Nicolaisen and Hjørland 2007; Campanario 2015). These laws, mainly based on distribution functions such as the exponential or Pareto distributions, have been controversial and have generated a strong debate during more than a century. For instance, Newman (2005) asserted that few real-world processes follow a power law over their entire range, and in particular not for smaller values of the variable being measured. Martínez-Mekler et al. (2009) argued that, when real data are used, power laws hold only for an intermediate range of values, whereas the tails of the distributions tend to deviate from the values expected according to the power law. Therefore, the authors suggested that the two-parameter law incorporates the product of two power laws defined over the complete data set: One of these power laws measured from left to right, and the other from right to left.

Other studies such as those by Kumar et al. (1998), Radicchi et al. (2008), Perc (2010), Eom and Fortunato (2011) and Birkmaier and Wohlrabe (2014) have proposed the application of the lognormal distribution to study research activity. Nevertheless, the evidence on the true distribution of scientific production and citation is still inconclusive (Albarrán et al. 2011), which might be a consequence of the use of only one- or two-parameter distributions.

In fact, all of the proposed distributions have the disadvantage that they depend on very few parameters to capture the entire shape of the productivity distribution, particularly the right tail of the distribution. This fact might result in more imprecise productivity measurements and unreliable comparisons of productivity between different fields of knowledge. To obtain reliable research productivity estimates, we propose the use of semi-nonparametric (SNP) approximations of productivity distributions based on the Edgeworth and Gram–Charlier expansions. These distributions have been applied in very diverse fields, where the precision of capturing the tails of distributions is important for the correct measurement of the frequency of extreme values (see Blinnikov and Moessner 1998, or Mauleón and Perote 2000, as examples of applications to astronomy or finance, respectively). In this article, we propose their use for the first time to measure research productivity and to determine with a higher degree of accuracy the quantiles that sort the most productive researchers in each field of knowledge as a proxy of the level of difficulty involved in being a top researcher in each field.

For the purpose of holding the parameter flexibility of Gram–Charlier distributions but restricting the domain to positive values, we propose logarithmic transformations of a SNP distribution (which we refer to as log-SNP), which are extensions of a lognormal distribution that allow for approximating any empirical distribution through the introduction of additional parameters. Given that bibliometric indicators usually exhibit relatively long tails and multimodality (Guerrero-Bote et al. 2007; Lancho-Barrantes et al. 2010; Sabharwal 2013), we show that, compared to the lognormal distribution, the log-SNP distribution provides a better fit when characterizing research productivity in top journals.

The productivity distribution

The characterization of a random variable through its probability density function (pdf) and its fit to the empirical distribution of a series can be achieved using different approaches, from a parametric perspective based on a frequency distribution with a known functional shape to a purely nonparametric approach. An intermediate possibility is the use of SNP approximations in which the functional shape is only partly parametrized, with the rest being an unknown function (Chen 2007). In this study, we consider an SNP approach in which the unknown function is modelled based on an orthogonal polynomial series expansion. In particular, we will analyze Edgeworth and Gram–Charlier expansions that have been shown to be valid asymptotic approximations of any empirical distribution under relatively weak regularity conditions (Sargan 1975; Phillips 1977). Next, we define the SNP distribution based on the Gram–Charlier series, as well as its logarithmic transformation, and analyze its basic properties.

The SNP distribution

Let {P s (x)}, x ∈ ℝ and \(s \in \mathbb{N}\) be a family of orthogonal polynomials with respect to a density function w(x) that satisfies the following relationshipFootnote 1

$$\int\limits_{ - \infty }^{\infty } {P_{s} \left( x \right)P_{j} \left( x \right)w\left( x \right)dx = 0,\quad \forall s \ne j, s,\;j = 0,1,2, \ldots }$$
(1)

Within this family, Hermite polynomials (HPs) are those that use a standard normal density distribution, with weight \(\phi \left( x \right) = \frac{1}{{\sqrt {2\pi } }}e^{{ - \frac{1}{2}x^{2} }}\). In particular, the HP of order s, H s (x), can be obtained in terms of the derivative of order s of the density function of the standard normal distribution, as expressed in Eq. (2):

$$H_{s} \left( x \right) = \frac{{\left( { - 1} \right)^{s} }}{\phi \left( x \right)}\frac{{d^{s} \phi \left( x \right)}}{{dx^{s} }}$$
(2)

Next, we show the first eight HPs:

$$H_{0} \left( x \right) = 1$$
(3)
$$H_{1} \left( x \right) = x$$
(4)
$$H_{2} \left( x \right) = x^{2} - 1$$
(5)
$$H_{3} \left( x \right) = x^{3} - 3x$$
(6)
$$H_{4} \left( x \right) = x^{4} - 6x^{2} + 3$$
(7)
$$H_{5} \left( x \right) = x^{5} - 10x^{3} + 15x$$
(8)
$$H_{6} \left( x \right) = x^{6} - 15x^{4} + 45x^{2} - 15$$
(9)
$$H_{7} \left( x \right) = x^{7} - 21x^{5} + 105x^{3} - 105x$$
(10)
$$H_{8} \left( x \right) = x^{8} - 28x^{6} + 210x^{4} - 420x^{2} + 105$$
(11)

It is easy to proof that these polynomials satisfy the mentioned orthogonality property given that \(\forall s,\quad j = 0, 1, 2, \ldots\)

$$\int\limits_{ - \infty }^{\infty } {H_{s} \left( x \right)H_{j} \left( x \right)\phi \left( x \right)dx = \left\{ {\begin{array}{*{20}l} {0, \quad s \ne j} \hfill \\ {s!, \quad s = j} \hfill \\ \end{array} } \right.}$$
(12)

The HPs also constitute the basis of the Edgeworth and Gram–Charlier (Type A) series, which allow, under certain regularity conditions (Cramér 1925), the expression of any pdf, f(x), in terms of an infinite series (Wallace, 1958) as followsFootnote 2

$$f\left( x \right) = \sum\limits_{s = 0}^{\infty } {\delta_{s} H_{s} \left( x \right)\phi \left( x \right)} ,\;{\text{where}}\quad \delta_{s} = \frac{1}{s!}\int\limits_{ - \infty }^{\infty } {H_{s} \left( x \right)f\left( x \right)dx}$$
(13)

Moreover, thanks to the orthogonality of the HPs, truncating the series to a specific order n of the expansion allows for defining a family of SNP distributions, g(x; d), where \(\varvec{d} = \left( {d_{1} , \ldots , d_{n} } \right)^{{\prime }} \in {\mathbb{R}}^{n}\) denotes the vector of the parameters.Footnote 3

$$g\left( {x;\varvec{d}} \right) = \left[ {1 + \mathop \sum \limits_{s = 1}^{n} d_{s} H_{s} \left( x \right)} \right]\phi \left( x \right){\mathop{\underset{n \to \infty }\longrightarrow}} f\left( x \right)$$
(14)

However, the SNP distribution defined in Eq. (14) is only a density function for a subset of values of d that guarantee g(x; d) ≥ 0. To solve this problem, different types of restrictions or positivity transformations have been proposed (Gallant and Nychka 1987), even though they involve the introduction of unnecessary complexity for empirical applications that implement maximum likelihood (ML) algorithms (given that in the optimum ML leads to estimations that guarantee positivity).

The advantages of SNP distributions when fitting frequency functions lies in their flexible parametric structure that permits to adjust location and scale with different parameters than those used for skewness, leptokurtosis and even higher order moments. Figure 1 illustrates the allowable shape of the SNP (depicted with 1000 simulated observations) compared with a normal distribution. For the sake of comparison, in both cases we consider the same location and scale parameters, μ = 0 and σ = 1, but we introduce additional (even) parameters in the SNP. Particularly, Panels (a1) and (a2) incorporate d 2 = 0.1 and d 4 = 0.1 and Panels (b1) and (b2) d 2 = 0.1, d 4 = 0.01, d 6 = 0.001 and d 8 = 0.005. Note also that Panels (a1) and (b1) represent the whole domain but Panels (a2) and (b2) just a detail of the right distribution tails. It is clear from these pictures that the SNP not only captures leptokurtosis but also presents wavy and heavy tails that may adapt the probability pattern of any data generating process.

Fig. 1
figure 1

Pdf of normal versus SNP distribution. Figures compare the shape of both Normal (dashed line) and SNP (solid line) distributions with location and scale parameters, µ = 0 and σ = 1, and additional parameters for the latter. Particularly, Panels (a1) and (a2) incorporate parameters d 2 = 0.1 and d 4 = 0.1 and Panels (b1) and (b2) consider d 2 = 0.1, d 4 = 0.1, d 6 = 0.001 and d 8 = 0.005. Panels (a1) and (b1) represent the whole domain, whereas Panels (a2) and (b2) a detail of the right tails of the distributions. Data are simulated through 1000 replications

In addition, the resulting higher number of parameters does not involve more complexity in theoretical or empirical terms. For example, the central moments can be easily obtained as linear functions of the distribution parameters (see “Appendix 1” section). Note that the even (odd) moment of order n depends only on the n first even (odd) parameters. This fact allows for the search of initial values for the optimization logarithms through the direct application of the method of moments (MM). A closed expression can also be obtained for the cumulative distribution function (cdf) of the SNP distribution as a function of the normal distribution cdf, as shown in Eq. (15) (see the proof in “Appendix 2” section). This allows for a simple calculation of the probabilities and quantiles of the SNP distribution.

$$\begin{aligned} G_{x} \left( a \right) = & \int\limits_{ - \infty }^{a} {g\left( {x;\varvec{d}} \right)dx} \\ = & \int\limits_{ - \infty }^{a} {\phi \left( x \right)dx - \phi \left( a \right)\sum\limits_{s = 1}^{n} {d_{s} H_{s - 1} \left( a \right)} } \\ \end{aligned}$$
(15)

The log-SNP distribution

Ñíguez et al. (2012) define a variable z > 0 as (standard) log-SNP if the variable x = log(z) is SNP distributed and its pdf defined as in Eq. (14). The resulting distribution inherits all the good properties of the SNP distribution, including its flexibility in capturing the extreme values of the distribution, but the density is defined on ℝ+, which is required to fit productivity data. We will go a step further and similarly define a log-SNP distribution, but rather over a linear transformation y = σx + μ.

Definition

We will say that the variable z > 0 is log-SNP distributed with location parameter μ ∈ ℝ, scale σ 2 ∈ ℝ and shape parameters \(\varvec{d} = \left( {d_{1} , \ldots , d_{n} } \right)^{{\prime }} \in {\mathbb{R}}^{n}\) if its pdf can be expressed as

$$h\left( {z; \mu ,\sigma^{2} ,\varvec{d}} \right) = \left[ {1 + \sum\limits_{s = 1}^{n} {d_{s} H_{s} \left( {\frac{\log \left( z \right) - \mu }{\sigma }} \right)} } \right]\left( {\frac{1}{{z\sigma \sqrt {2\pi } }}e^{{\frac{{ - \left( {\log \left( z \right) - \mu } \right)^{2} }}{{2\sigma^{2} }}}} } \right)$$
(16)

Defined in this manner, the lognormal distribution is a particular case of the log-SNP (for d s  = 0, ∀s), which allows for a comparison of the improvements in the fit of the latter to those obtained with the lognormal by using linear restrictions tests such as the likelihood ratio (LR). This article shows that, as a matter of fact, the parametric flexibility of the log-SNP allows for significant fit improvements to productivity distributions, as the log-SNP is capable of representing different shapes (including jumps in the probability mass function and heavy tails) through the incorporation of parameters in addition to those of a standard lognormal distribution. These parameters are directly related to the distribution momentsFootnote 4 and constitute additional degrees of freedom for the estimation procedures. For example, if only d s parameters are considered for s even skewness depends only on parameter σ, and the larger the expansion the heavier (and possibly wavier) the distribution tail is.

Figure 2 presents an illustration (1000 simulated replications) of the log-SNP allowable shape in comparison with that of the lognormal, both with the same location and scale parameters, i.e. μ = 0 and σ = 1. Panels (a1) and (a2) depict a log-SNP with additional parameters d 2 = 0.12 and d 4 = 0.11 and Panels (b1) and (b2) incorporate parameters d 2 = 0.28, d 4 = 0.44, d 6 = 0.07 and d 8 = 0.009. In order to emphasize the behavior for the extreme (positive) values Panels (a2) and (b2) display a zoom on the right tails of the distribution. For this case, it is clear that the log-SNP allows more flexibility to capture thick (and wavy) tails. Even more important, biased estimations and misleading results may be obtained when using a single parameter distribution to fit distribution shape and heavy tails.

Fig. 2
figure 2

Pdf of lognormal versus log-SNP distribution. Figures compare the shape of both Normal (dashed line) and SNP (solid line) distributions with location and scale parameters, µ = 0 and σ = 1, and additional parameters for the latter. Particularly, Panels (a1) and (a2) incorporate parameters d 2 = 0.12 and d 4 = 0.11 and Panels (b1) and (b2) consider d 2 = 0.28, d 4 = 0.44, d 6 = 0.07 and d 8 = 0.009. Panels (a1) and (b1) represent the whole domain, whereas Panels (a2) and (b2) a detail of the right tails of the distributions. Data are simulated through 1000 replications

Data and methodology

Data

To test whether a lognormal or a log-SNP distribution fits the best to the performance distribution of 140,971 researchers who have produced 253,634 publications in 18 fields of knowledge, we used the data from O’Boyle and Aguinis (2012). These authors classified the fields of knowledge based on the Journal Citation Reports (JCR), which provide impact factors (IFs) in different fields of knowledge labeled within the categories of “sciences” and “social sciences”. As it is well-known, there are multiple subfields included within one JCR category, but they identified authors across all subfields so that authors publishing in more than one area would have all their publications included.

The authors used impact factors from JCR in 2007 to identify the top five journals within each field.Footnote 5 They selected field-specific journals to avoid having the search contaminated by authors from other sciences. Additionally, the authors used the “Publish or Perish” program (Harzing 2008), which relies on Google Scholar, to identify all authors who had published at least one article in one of these journals between January 2000 and June 2009.Footnote 6 Their procedure is not absent of the common problems in bibliometric analysis produced by the authors’ names ambiguity, e.g. the treatment of synonyms, homonyms, misspellings, and processing errors. Further techniques might have been used to refine the database (Momeni and Mayr 2016; Van den Besselaar and Sandström 2016) although they are not substantial in our methodological framework. Furthermore, focusing on just five top journals and the combined use of Web of Science and Google Scholar databases may help to mitigate such problems (Yang and Meho 2006; Harzing and Van der Wal 2008; Harzing 2014; Harzing and Alakangas 2016). With this information, the productivity of the researchers is measured as the number of articles published by an author in each of the fields of knowledge during the observation period of 9.5 years.

A limitation of using this measure of productivity is the multiple authorship of manuscripts, since multiple authorship may bias the results in favor some authors and, comparatively, in areas where a high number of authors per paper is commonly accepted by the research community (Ruiz-Castillo and Costas 2014; Mingers and Leydesdorff 2015). The literature on scientific production, however, does not discriminate by multiple authorship. This procedure is known as “complete count”, i.e. the fact that an article is equally valuable for all its authors regardless the number of authors and their marginal contribution to the manuscript (Nicholls 1989; Ruiz-Castillo and Costas 2014). On the other hand, another important issue when analysing research productivity is the relation between quantity and quality (Kaur et al. 2015). In our study, as well as in O’Boyle and Aguinis (2012), quality of top researchers is imposed by the fact that we only consider publications in the five main journals in every field of knowledge. Then comparisons are only done in terms of quantity of manuscripts for a given probability in the “true” distribution (i.e. a quantile), which proxies a level of difficulty taking into account the different practices between the areas.

Table 1 shows the descriptive statistics for the publications of the top researchers included in our sample. In this table we also record the Median Impact Factor (MIF) of the top five journals in each of the selected fields of knowledge, based on the JCR of the year 2007Footnote 7 for each of the analyzed categories classified in sciences and social sciences. We provide this citation index to obtain a broader view of each of the selected fields and, particularly, its correlation with scientific production.

Table 1 Descriptive statistics for scientific production

It can be observed that throughout the 18 fields of knowledge analyzed, the minimum number of researchers is 1073 for the field of Ethics and the maximum is 30,531 for Dermatology. For each field, we compute two Mean values on scientific production: Mean (1) is the mean of publications for each field and for the entire sample of researchers; Mean (2) is the mean of publications for each field but only for the researchers with a number of articles above Mean (1). Mean (1) varies from 1.42 to 2.26 and Mean (2) from 2.56 to 5.92. Furthermore, we compute the percentage of authors with less/more publications than (or equal) to Mean (1)/Mean (2). We find that, on average, 71.38 % of all researchers have productivity below Mean (1), whilst researchers with productivity above Mean (2) represent the 7.76 %. These results support Ruiz-Castillo and Costas’ (2014) findings about the skewness of field productivity distributions, since a large proportion of researchers have below mean productivity and only a small percentage of them account for most of the publications.

Regarding the other statistics in Table 1, the standard deviation of publications has a range of 0.97–3.38 publications and skewness and excess kurtosis reveals that the productivity distributions exhibit positive skewness and leptokurtosis, with the field of Genetics being the most skewed and leptokurtic of the sample. The maximum number of articles per researcher varies from 13 (Law) to 120 (Genetics), depending on the field considered. In addition, we find large differences when considering the MIF indicator (of the top five journals in each area), which varies from 0.85 (History) to 18.30 (Genetics). Furthermore, it is clear that the MIF is positively correlated to the maximum number of articles per researcher. As a result, Genetics has the highest MIF and the maximum number of publications per researcher, while the MIF of History places 18th and 17th in number of publications per researcher. In general, fields that belong to the Sciences JCR category have a larger number of researchers and, then, a larger MIF.

All in all, the results confirm the existence of wide differences in scientific production in terms of number of articles per author between the different fields of knowledge, which is consistent with other studies, e.g. Abramo and D’Angelo (2014) or Mingers and Leydesdorff 2015). Next we propose a new methodology based on the log-SNP to study how these differences affect the productivity distribution, especially when measuring the productivity of the top researchers.

Methodology

This section presents the methodology applied to characterize the research productivity in each field of knowledge based on the log-SNP distribution. Details are provided on the ML estimation methodology and its related goodness of fit measures used to choose between the different pdfs nested on the family of log-SNP distributions (including the lognormal). The pdf of the log-SNP distribution is sequentially estimated up to a truncating order of n = 8.

Let z i be the number of articles published by an author in one of the selected fields of knowledge; the log-likelihood functionFootnote 8 for a log-SNP(μ, σ 2, d) distributed observation truncated to the eighth moment is given by:

$$\log L\left( {\left. {\mu ,\sigma^{2} ,\varvec{d}} \right|z_{i} } \right) = - \frac{1}{2}\log \left( {2\pi \sigma^{2} z_{i}^{2} } \right) - \frac{1}{2}\left( {\frac{{\log \left( {z_{i} } \right) - \mu }}{\sigma }} \right)^{2} + \log \left[ {1 + \sum\limits_{s = 1}^{8} {d_{s} H_{s} \left( {\frac{{\log \left( {z_{i} } \right) - \mu }}{\sigma }} \right)} } \right]$$
(17)

The sequential estimation begins with the simplest nested density, the lognormal, and the d s parameters are recursively added, the initial values of which are selected consistently with their sample moments counterparts. The inclusion of new parameters in the productivity distribution is performed according to accuracy criteria, i.e. the log-likelihood (logL) and the Akaike Information Criterion (AIC), and linear restrictions tests provided by the LR statistic. Based on these criteria, n = 8 was selected as the optimum truncating order, and only the even parameters, d 2, d 4, d 6 and d 8, were selected.

Results

Table 2 presents the ML estimates of the parameters of the performance distributions for each of the fields selected. Panel A shows the estimated parameters for a lognormal distribution, and Panel B shows the estimated parameters for the log-SNP distribution. Panel C displays the LR statistic for comparing the log-SNP and the lognormal distributions.

Table 2 Estimates for the productivity distribution under lognormal and log-SNP

The results of the estimations reveal that all the models adequately capture the mean and standard deviation of each of the fields, denoted as parameters μ and σ, respectively. The P values clearly indicate that these parameters are highly significant for both distributions. It is noteworthy that the parameter σ, which also capture skewness of the lognormal and the log-SNP provided that odd parameters are not included, remains very stable for all productivity distributions. This evidence is consistent with Ruiz-Castillo and Costas (2014) who found that “in spite of wide differences in production and citation practices across fields, the shape of field productivity distributions is very similar across fields”. However, as shown in Panel B, for the log-SNP distribution, the d s parameters are also highly significant for the majority of fields of knowledge. When analyzing the AIC (which penalizes log-likelihood value with the inclusion of additional parameters) for the two distributions, we found that this criterion is consistently lower for the log-SNP distribution, which suggests that the modeling based on this distribution is clearly superior. In addition, from the LR statistics included in Panel C, we conclude that for all the selected fields, incorporating the d s parameters improves the accuracy of the model.Footnote 9

An example of the fit quality obtained for two (randomly selected) fields, Finance and Dentistry, is captured in Fig. 3. This figure depicts the empirical histogram and pdf values estimated under a lognormal specification and under the log-SNP. In both cases, the log-SNP distributions more adequately capture not just the values around the mean but also the extreme values. Figure 4 shows in detail the right tails of the distribution, which capture the frequency of the researchers with higher productivity. From these figures, it is clear that the log-SNP specification allows the better characterization of the research activity.

Fig. 3
figure 3

Pdf of research productivity in finance and dentistry. The figure shows the distribution of the empirical frequencies (histogram) of the productivity of the researchers who published in the five top journals (in JCR-2007 terms) in finance and dentistry during the period 2000–2009. The estimated pdfs under the lognormal and log-SNP specifications are depicted in dashed line and solid line, respectively

Fig. 4
figure 4

Pdf of research productivity in finance and dentistry (right tail). The figure shows the right tail of the distribution of empirical frequencies (histogram) of productivity of the researchers who published in the five top journals (JCR-2007 ranking) in finance and dentistry during the year 2000–2009. Fitted lognormal and log-SNP pdfs are depicted in dashed line and solid line, respectively

Figure 5 shows the comparison between the fitted densities for Finance and Dentistry in terms of the empirical and theoretical cdfs for both specifications, the log-SNP and the lognormal. The latter appears to underestimate the cumulative probability (especially for Dentistry) when compared to the log-SNP.

Fig. 5
figure 5

Cdf of research productivity in finance and dentistry. The figure shows the empirical cumulative distribution function of the productivity of the researchers who published in the five top journals (JCR-2007 ranking) in finance and dentistry during the period 2000–2009. Fitted lognormal and log-SNP cdfs are depicted in dashed line and solid line, respectively

Figure 4 illustrates how the lognormal distribution underestimates research productivity, especially for the more extreme values (under the lognormal distribution, a researcher must publish less articles to be included in the top quantiles of the performance distribution). Table 3 quantifies these effects for the different fields of knowledge by computing the empirical and estimated quantiles under the lognormal and log-SNP for confidence levels of 5, 1, 0.1 and 0.05 %.Footnote 10 Note that, once the productivity distributions are properly estimated, the definition of a top researcher in every field requires the computation of the corresponding quantile for a given probability. These quantiles represent bounds of performance in terms of number of articles (regardless the number of authors), provided that quality is guaranteed by considering only publications on the top 5 reviews in every field. Furthermore, these quantiles are fairly comparable among different areas.

Table 3 Number of articles empirically observed versus those theoretically expected under lognormal and log-SNP

The values in the table clearly indicate the higher accuracy of the log-SNP distribution fits, particularly in the tails, and the underestimation of the productivity of top researchers obtained from the traditional parametric distributions such as the lognormal. For example, for the field of Agronomy, it can be seen that to belong to the top 0.05 % of researchers who publish the highest number of articles in the best journals, 15 publications are empirically required. This limit is much less strict if we assume that the distribution is lognormal (6 publications) as compared to log-SNP (12 publications). These results are consistent with the research by Kumar et al. (1998), Perc (2010) and Eom and Fortunato (2011), who found that the use of the lognormal distribution for modeling bibliometric indicators underestimates the heavy tails of the distributions.

Further results

This article proposes a new methodology to compute research productivity for top researchers through the quantiles of a new and general distribution called log-SNP. Our main application compares these measures with those of the lognormal with a sample of scientific production in 18 (arbitrarily chosen) fields, finding the outperformance of the log-SNP. Nevertheless, the focus of the paper is done on the technique more than on the particular results. In order to justify that our result is general we replicated the study with the productivity data provided in Borokhovich et al. (1995), which refer to academic institutions (in the field of finance) instead of individual researchers. Particularly, the data accounts for the number of articles published from 1989 through 1993 in a set of 16 finance journals by authors affiliated to different institutions at the time of publication. The journals in finance (excluding real estate and insurance) were selected from those listed in Heck’s Finance Literature Index for 1993. Only articles and notes were included in sample. The number of publications attributed to each academic institution was adjusted for the number of authors. For example, for publications with two authors affiliated to different institutions every institution received a credit for 0.5 article. Any proportion of an article that was not attributable to an author affiliated with an academic institution located in the United States or Canada was deleted from the study. A total of 330 institutions were included in this sample.

Table 4 reports the results of this new estimation. Panel A displays the ML estimates for the research productivity on top finance journals for academic institutions. The results are consistent with those previously obtained for researchers in different fields of knowledge, i.e. the log-SNP outperforms the lognormal according to the LR test and thus the log-SNP parameters are highly significant. Panel B compares the number of articles empirically observed with those theoretically expected under both specifications revealing the outperformance of log-SNP. In this case it seems that the lognormal overestimates the distribution tails, particularly for low confidence levels. This result corroborates the evidence that the use of rigid distributions involve misleading results because are unable to fit different characteristics of the distribution (particularly extreme values) with a single (or two) parameter(s).

Table 4 Institutional research productivity under lognormal and log-SNP

Figure 6 illustrates the assessment above showing the best fit of the lof-SNP in terms of the right tails of the pdf (Fig. 6a) and the cdf (Fig. 6b) compared with the empirical distributions.

Fig. 6
figure 6

Pdf and cdf of institutional research productivity. The figure shows: a the right tail of the distribution of empirical frequencies (histogram) of research productivity for academic institutions that have published a set of 16 finance journals between the years 1989 and 1993. The fitted lognormal and log-SNP pdfs are depicted in dashed line and solid line, respectively. b The figure shows the empirical cumulative distribution function of the same sample. The fitted lognormal and log-SNP cdfs are depicted in dashed line and solid line, respectively

Discussion and conclusions

Bibliometric analysis has been shown to be a valuable method for evaluating scientific production and has experienced a growing impact in the academia. However, the literature indicates that in most cases, the distributions commonly used for measuring productivity have been shown to underestimate the behavior of the top researchers, given that their productivity seems to be generated by a distribution with very heavy tails. This fact calls for the search of more appropriate distributions and methodologies.

This study analyzes the research productivity in 18 fields of knowledge belonging to the JCR categories of sciences and social sciences between the years 2000 and 2009. The results show that the level of productivity, as measured by the number of publications per author, depends on the field of knowledge being studied, which is consistent with previous evidence. In particular, the fields that belong to the category of sciences have a higher number of publications per author. In addition, we observe that the MIF indicator is highly correlated to the maximum number of articles per researcher; that is, the greater the number of articles published in top journals by each researcher (usually the most cited), the greater the MIF by field of knowledge.

This study proposes a novel methodology based on the computation of the quantiles of a flexible log-SNP distribution for measuring the scientific productivity distribution of top researchers in different fields of knowledge. Such a distribution nests the lognormal and includes new parameters for accurately capturing the heavy tail of the research productivity distribution. Our study shows, for both researchers and institutions productivity, that the log-SNP provides a better fit of research productivity distribution than the lognormal and quantifies the differences in the measures of the top researchers’ productivity attached to the distributional hypothesis. We argue that the log-SNP is an accurate data generating process for the top researchers’ productivity, and thus this process is more reliable than that of the lognormal (that is nested in our model) since the log-SNP is more flexible when data are highly skewed and there are possible jumps in the tail due to extreme observations.

Therefore we provide an interesting methodology to measure scientific productivity that can be used when the performance of authors, institutions or fields have to be compared or aggregated so as to implement policies based on academic performance.