Introduction

In the past few decades, the growing popularity of bibliometric indices has led to a thorough study of the citation process. The type of document (Annalingam et al. 2014; Bornmann 2013; Bornmann and Williams 2013; Ingwersen and Larsen 2014), its subject (Antoniou et al. 2015; Bornmann et al. 2012; Dorta-González et al. 2014; van Eck et al. 2013), its publishing venue (Didegah and Thelwall 2013; Falagas et al. 2013; Garner et al. 2014; Jiang et al. 2013; Van Der Pol et al. 2015), authors (Biscaro and Giupponi 2014; Collet et al. 2014; Farshad et al. 2013; Pagel and Hudetz 2011) and other characteristics (Tahamtan et al. 2016) all influence its citation impact in, statistically speaking, predictable ways. However, these bibliometric indices from academic publications can only help to measure the impact of research within research itself. Alternative metrics (also known as altmetrics) are regarded as an attractive possibility because they not only enable fast, but might also provide broad impact measurement (Priem and Hemminger 2010; Priem et al. 2012). Altmetrics “focuses on the creation, evaluation and use of scholarly metrics derived from the social web” (Haustein et al. 2014). The question to what extent altmetrics actually permit a broad impact measurement of research is currently an object of scientometric research.

Bibliometricians see some value in altmetrics, especially download metrics (Haustein et al. 2013) and there is already evidence that a range of altmetrics associate with traditional citations counts, with Mendeley (Haustein et al. 2013; Li et al. 2012; Zahedi et al. 2013) and Twitter (Eysenbach 2011; Thelwall et al. 2013) seeming to be the most promising sources. And the weak positive correlations between social media mentions and future citations (Peoples et al. 2016; Ringelhan et al. 2015) suggest that online activity may anticipate or drive the traditional measure of scholarly ‘impact’. Online activity also promotes engagement with academic research, scholarly or otherwise, increasing article views and PDF downloads of PLoS ONE articles (de Winter 2015; Wang et al. 2014). Thus, altmetrics, and the online activity they represent, have the potential to complement, pre-empt and boost future citation rates, and are increasingly used by institutions and funders to measure the attention garnered by the research they support (Ravenscroft et al. 2017).

The use of altmetrics in information retrieval and research evaluation brings the question: Whether the data Altmetric collects is a leading indicator of later success? Do social media mentions predict or correlate with subsequent citation rates for a given article? The absence of such a detection, however, push this paper to contribute to this discussion. We hope that the combination of traditional bibliometric indices and alternative metrics will provide more complete article profiles as it captures more dimensions of scientific practice.

A machine learning framework was established to detect the usefulness of this combination in predicting the later success of articles. Because the highly-cited status for papers is widely accepted as the one indicator of success or higher citation impact, this work is aiming to detect whether this combination of traditional bibliometric indices and alternative metrics are beneficial to predict the future highly-cited papers. And which one could be better for this prediction? Figure 1 shows the sketch of the prediction framework.

Fig. 1
figure 1

The sketch for the framework on predicting future highly-cited papers

Both the traditional bibliometric indices and the alternative metrics were combined to establish the feature space for the prediction task. In order to overcome the “dimensionality curse” (Korn et al. 2001; Pagel et al. 2000) which may probably existed in the feature space, a feature selection process was performed to choose the subset of features which maintain the essential characteristics of the data set. Considering that each feature selection technique may be biased to some features due to their initial mechanisms, three different selection techniques of Relief-F, principal component analysis (PCA) and entropy weighted method (EWM) were introduced to rank the features according to the weights calculated by each selection technique. In order to discover how many features are significant to characterize the original dataset, the fractal dimension of the dataset was calculated. Combining the fractal dimension of the data set and the feature-ranking results under each selection technique, the kernel feature subsets were detected for each selection technique, respectively. Finally, three classification methods of Naïve Bayes, KNN and random forest were taken to detect the robust of the three feature subsets. And the typical indicators for predicting future highly-cited papers were hoped to be identified if reasonable classification performances could be got in the classification process.

Related work

Given the important role of citations in measuring the quality of research and researchers, it is reasoning to investigate why some papers achieve more citations than others. Various studies have been conducted to explore the factors influencing citations. Some have attempted to estimate and predict citations of future.

Tahamtan et al. (2016) made a comprehensive review of the factors predicting the frequency of citations. They detected 198 relevant papers and summarized that the three categories of factors–‘paper’ related factors, ‘journal’ related factors, and ‘author’ related factors– are related to the number of citations. Fourteen ‘paper’ related factors were widely discussed to investigate their influences on predicting paper’s future citation counts, e.g. the quality of paper (Buela-Casal and Zych 2010; Patterson and Harris 2009; Stremersch et al. 2007), characteristics of fields/subfield of a discipline and study subject/topics (Glänzel and Schubert 2003; Glänzel et al. 2014; Dorta-González et al. 2014; Gonzalez-Alcaide et al. 2016; Wang et al. 2015a), the characteristics of references (Antoniou et al. 2015; Biscaro and Giupponi 2014; Chen 2012; Didegah and Thelwall 2013; Onodera and Yoshikane 2015; Yu and Yu 2014), the length of paper (Falagas et al. 2013; Stremersch et al. 2015; van Wesel et al. 2014), the early citation and speed of citation (Garner et al. 2014; Glänzel et al. 2012; Hilmer and Lusk 2009a, b), and the accessibility and visibility of papers (Ebrahim et al. 2014; Rees et al. 2012; Yue and Wilson 2004), et al. Four ‘journal’ related factors, including the journal impact factor (Haslam and Koval 2010; Jiang et al. 2013; Royle et al. 2013; Van Der Pol et al. 2015), language of journal (Borsuk et al. 2009; Leimu and Koricheva 2005; Lira et al. 2013), scope of journal (Bjarnason and Sigfusdottir 2002; Huang et al. 2012; Vanclay 2013), and the form of publication (Ingwersen et al. 2014; Ibáñez et al. 2013; Ke et al. 2014; Sangwal 2012), were investigated to verify their performances in predicting the number of citations. And ten ‘author’ related factors were also detected by researchers to test whether these factors are related to the frequency of citations, such as the factors of the number of authors (Amara et al. 2015; Glänzel and Thijs 2004; Puuska et al. 2014; Sin 2011; Vieira and Gomes 2010), author’s reputation and previous citations (Bornmann et al. 2012; Frandsen and Nicolaisen 2013; Hurley et al. 2013), the international and national collaboration of authors (Chi and Glänzel 2018; Collet et al. 2014; Glänzel and Heeffer 2014; Nomaler et al. 2013; Onyancha and Maluleka 2011; Wang et al. 2015b), authors’ country (Lee et al. 2010; Miettunen and Nieminen 2003; Padial et al. 2010; Willis et al. 2011), and author’s productivity (Bosquet and Combes 2013; Stremersch et al. 2015), et al.

However, there are some other factors related to the future frequency of citations which are not classified under the above three categories of factors. For example, the factors represent the knowledge diffusion activities of articles in the scientific environment. Our previous studies showed that the knowledge diffusion activities, represented as the citation distribution of articles in the scientific environment in their early stage after publication, could be good predictors for the article’s future citation frequencies (Wang et al. 2012a, b). Such a citation distribution of an article in the scientific environment reflects the scope of the knowledge diffusion of it, indicating the range of its visibility and its contribution on the scientific entities in some extent. And this visibility and contribution has laid an important foundation for the article’s citation frequency in future. Thus, in the present study, the citation distribution properties of articles in scientific environment were incorporated in the feature space to detect their capacities on predicting the future success of articles. Furthermore, although large amount of bibliometric factors were detected and regarded as the valuable indicators for predicting articles’ future citation frequencies, they only relate to the assessment of scientific impact of articles but lack of reflecting the influence of researches beyond academia.

As scholarly communication migrated to the Web, so did citations. Altmetrics, short for alternative metrics, has been considered as an interesting option to describe Web-based metrics for measuring the societal impact of research (Priem and Hemminger 2010; Piwowar and Priem 2013). Most comments on the benefits of altmetrics relate their potential for measuring the broader impact of research, that is, beyond science (Priem et al. 2012; Bornmann 2014). It is hoped that altmetrics can deliver more transparent descriptions of the interest, usage and reach of scholarly products (Fausto et al. 2012; Taylor 2013) and also more diverse and nuanced forms of impact analyses than traditional biblometrics metrics (Waltman and Costas 2014).

There is evidence that higher altmetric scores associate with higher citation counts for many different indicators, for instance, for articles that were bookmarked on Mendeley (Eysenbach 2011; Herrmannova et al. 2018; Thelwall and Wilson 2016; Thelwall 2018), mentioned in Wikipedia (Didegah et al. 2018; Kousha and Thelwall 2017; Marashi et al. 2013), and tweeted on Twitter (Eysenbach 2011; Thelwall et al. 2013; Ortega 2016; Shu et al. 2018). Thelwall et al. (2013) studied associations between journal citations and different altmetric indicators. They found that six altmetrics sources (tweets, Facebook wall posts, research highlights, blog mentions, mainstream media mentions and forum posts) had significant associations between higher altmetric scores and high numbers of citations, suggesting that multiple different types of altmetrics may be valid and useful. Chi and Glänzel (2017) analyzed the relation between the usage and citation impact and found that the citations and usage counts in Web of Science correlate significantly, especially in the social science. Syamili and Rekha (2017) detected the correlation between the altmetrics scores (saves, views, Mendeley and twitter) with citation and found that all the altmetric scores expect twitter have good correlation with traditional bibliometric citation. Peoples et al. (2016) estimated the relative effects of Twitter activity on Web of Science citation rates and detected that twitter activity was a more important predictor of citation rates than 5-year journal impact factor. They stated that altmetrics and traditional metrics are closely related, but not identical, and suggested that both altmetrics and traditional citation rates can be useful metrics of research impact.

These previous investigations on altmetrics have attempted to compare the various altmetrics metrics with the traditional citation counts. However, they almost always used contemporary rather than future citation counts, leaving it still unknown whether early altmetrics scores can predict later citations for articles. When predicting the future citation status of an article, whether the altmetrics is a useful complement to traditional bibliometric indicators, or is a simple alternative? Or, which one of the two kinds of indicators will provide a better predictive performance?

The present study is mainly aiming to answer these questions. By collecting the early web usage data, and the article’s traditional bibliometric indicators, we explored the role of the two kinds of factors in predicting the future citation trend of articles and hoped to give a more clear understanding on their roles in the prediction.

Data set

Our analysis is based on a corpus of 617 scientific articles published in seven journals in Public Library of Science (PLOS) between January 1, 2010 and January 31, 2010. In each journal, the articles collected were ranked according to their total citation counts and then were divided into the highly-cited papers (HCPs), medium-cited papers (MCPs) and Low-cited papers (LCPs) by the following scheme:

  1. (1)

    HCPs Those articles whose accumulated citation ratio reach the 20% of the total citation counts of all articles collected in this journal, were regarded as highly-cited.

  2. (2)

    LCPs Those articles whose accumulated citation ratio located in the last 20% of the total citation counts of all articles collected in this journal, were regarded as low-cited.

  3. (3)

    MCPs Those articles whose accumulated citation ratio located between the ones for HCPs and LCPs in this journal, were regarded as medium-cited.

Table 1 gives the distribution of HCPs, MCPs and LCPs in the seven journals published in PLOS. The purpose of dividing the articles into the three categories is to define the different growth status of articles, where the HCPs is taken for the successful ones. Basing on this division, an examination on which indicators are better to predict the future success of articles could be performed.

Table 1 The distribution of HCPs, MCPs and LCPs in each journal in PLOS

Indices for predicting future HCPs

Bibliometric indices

In the previous researches, bibliometricians discussed the contribution of characters from authors, journals and articles on the HCPs prediction. Therefore, the indices associated with these aspects were collected to be the members of feature space. At the same time, our previous work showed that the citation distribution of articles in the scientific environment in their early stage after publication also benefits for their later success (Wang et al. 2012a, b). And according to Glänzel (2008), the use of a 3-year citation window is “a good compromise between the fast reception of life science and technology literature and that of the slowly ageing theoretical and mathematical subjects”. Thus, the articles’ citation distribution data in the first 3-year citation window were collected to compose the feature space. Table 2 shows the bibliometrics indices collected from the above consideration.

Table 2 The bibliometric indices

Indices of x1x6 give the characters of authors, in which the h-index of authors is used to represent the reputation of them. x7 is the Impact Factor of journals publishing this article. x8 is the number of references. Indices of x9x11 show the basic properties of articles. Indices of x12x13 are the first-cited properties of articles. Initial citations that a paper receives are the early feedback of the scientific community about that paper. Van Dalen and Henkens (2005) stated that these two indices could help to determine the rate and the strength of the first citation. The index of x12, the first-cited-age of the article, is calculated by the interval between article’s first-cited year and its publication year. For example, if one article P published in year 2000 got its first citations in year 2002, the first-cited-age of P is calculated as 2002 − 2000 + 1 = 3. Indices x14x20 give the citation distribution characters of articles in the scientific environments in the first 3 years after publication, in other words, the size of the scope of their influences in the scientific environments in their early stage after publication.

All these indices could be obtained through Web of Science. By using the “create citation report” in Web of Science, the h-index of authors could be got. By using the “analyze the indexing results” and the following buttons of “publication year” and “Web of Science categories” etc., the indices associated with the first-cited properties and the citation distribution characters in the first 3-year citation window could be obtained.

Alternative metrics

In this paper, all the alternative metrics were collected from PLOS. PLOS inaugurated a program to provide Article-level Metrics (ALMs) on every article across all journals since 2009. ALMs capture the manifold ways in which research is disseminated and can help users determine the value of an article to them and to their scientific community. PLOS ALMs split the alternative metrics into five categories, including: viewed, saved, discussed, recommended, and cited. Table 3 shows the detailed information about these five categories of alternative metrics in PLOS.

Table 3 The detailed information about the five categories of alternative metrics in PLOS

More detailed explanation about these five categories of alternative metrics can be found in http://www.lagotto.io/plos/#relativeMetrics. The classification mirrors the whole process of user engagement, from the first look at a paper to its citation in (scholarly) literature and thus relate to the various dimensions of research impact (Neylon and Wu 2009). Because our efforts on discussing the alternative metrics is limited to detect whether there are some metrics besides citation activities potentially benefiting to articles’ future citations, the category “cited” is not considered here.

Altmetrics focuses on the exchange and dissemination of article in the social web environment, which can timely reflect the value of article. However, because the Web environment has the characteristics of faster update and more emphasis on new information, Altmetrics has insufficient sustainability for one article. In order to determine a reasonable time for collecting these alternative metrics, we observed the changing characters of these metrics for each of the 617 articles from PLOS ALMs. For each article, its publication date is marked as T0; the date when the metrics firstly chang is marked as T1; and the date when the metrics will never change is marked as T2. By recording the changing value of the four categories of alternative metrics for each article (In this paper, the changes in these indices were counted every month), two time intervals were calculated as shown in Table 4.

Table 4 The changing status of the alternative metrics

The time span between the publication date and the first changing date, labeled as ave. (T1 − T0) in Table 4, is an expression on first changing speed for the metrics. The time span between the first changing date and the date never changing, labeled as ave.(T2 − T1) in Table 4, is used to represent the duration of each metrics. Obviously, the metrics are different in the changing status. Articles would be quickly viewed and discussed than saved according to the shorter time span ave. (T1 − T0) for metrics of “Viewed” and “Discussed”. Metrics “Viewed” has the longest time span ave.(T2 − T1) than that of “Saved” and “Discussed”, which indicates that articles could be viewed for a relative longer time. It should be mentioned that there is no status for metrics “Recommended”. The reason is that almost all the papers collected are not fortunate enough to be recommended in F1000Prime. Thus any calculation on this metrics is lack of significance. Accordingly, only three alternative metrics of “Viewed”, “Saved” and “Discussed” are left to constitute the feature space, which is labeled as indices x21-x23.

And because this study is based on the monthly unit to record the changes in the value of these indices, it can not accurately locate the specific time point of changes in the value. Furthermore, the analysis on the sustainability of these indices is somewhat rough. It can’t guarantee that some indices would get new attentions and change into a new value as a result after a period of silence. However, this study only wants to determine a point of time when the values of these indices are no longer densely varying, so that we can use the data of these indices before that point to explore the forecasting task.

Based on the results on the duration analysis, all the three metrics tend to be stable in the first 2-years after the publication of articles. We take the date of February 1, 2012, which is just the first 2-year after articles’ publication, to collect the values of “Viewed”, “Saved”, and “Discussed” for each article.

Feature selection process

Feature selection defines the least number of features that could be used to best represent the inter relationships amongst those features (Naraei and Sadeghian 2017). In this paper, three different feature selection techniques of Relief-F, principal component analysis (PCA) and entropy weighted method (EWM) were used to calculate the significance of each feature on the initial data set. However, these techniques can only help to rank the features according to the importance of them to the original data set, while be insufficient in determining the numbers of key features which can best represent the data set. In this paper, the fractal dimension of the data set was introduced to discover how many attributes are significant to characterize the dataset.

Fractal dimension of data set

The initial idea for calculating the fractal dimension of one data set aims at detecting the correlations between attributes in a data set, spotting the attributes that can be obtained by some function of others, and defining the reduced data set with only the relevant attributes (Berchtold et al. 1998; Pagel et al. 2000). These considerations lead to the definition of the embedding and intrinsic dimensions, as well as the correlation fractal dimension (Traina et al. 2000).

Definition 1

The embedding dimension E of a data set is the dimension of its address space. In other words, it is the number of attributes in the data set.

Definition 2

The intrinsic dimension D of a data set is the dimension of the spatial object represented by the data set, regardless of the space where it is embedded.

Note that if a data set has all of its features independent of the others, its intrinsic dimension is the embedding dimension (D = E). However, whenever there is a correlation between two or more features, the intrinsic dimensionality of the data set is reduced accordingly. Through the intrinsic dimension of a data set it is possible to decide how many attributes are actually required to characterize it.

Definition 3

Correlation Fractal dimension: Given a dataset presenting self-similarity in the range of scales [r1,r2], its Correlation Fractal dimension D2 for this range is measured as:

$$D_{2} = \frac{{\partial log\mathop \sum \nolimits_{i} C_{r,i}^{2} }}{{\partial { \log }\left( r \right)}} = \frac{{\partial { \log }\left( {S\left( r \right)} \right)}}{{\partial { \log }\left( r \right)}}, r \in \left[ {r_{1} ,r_{2} } \right]$$
(1)

As shown in (Belussi and Faloutsos 1995), the correlation fractal dimension correspond to the intrinsic dimension of the dataset. Thus, in the present work we use D2 as the intrinsic dimension D, just like Belussi and Faloutsos (1995) have done. The intrinsic dimensionality gives a lower bound of the number of attributes needed to keep the essential characteristics of the dataset. The detailed calculation process on correlation fractal dimension could be found at Belussi and Faloutsos (1995) and Traina et al. (2000).

Feature selection technique of Relief-F

The basic idea of Relief-F is to draw instances at random, compute their nearest neighbors, and adjust a feature weighting vector to give larger weight to features that discriminate the instances from neighbors of different classes (Kononenko 1994). Specially, for a random selected sample \(x_{i}\), it finds the k nearest neighbors {hj}in the same class C with \(x_{i}\) as well as the k nearest neighbors{mj} in the each of the other classes {S} besides C, respectively. And then the relief-F tries to find a good estimate of the following probability to assign as the weight for feature A:

$$\begin{aligned} W(A) & = W(A) - \sum\limits_{j = 1}^{k} {{\text{diff}}(A,x_{i} ,h_{j} )/({\text{mk}})} \\ & \quad + \sum_{S \ne C} \left[ {\frac{p(C)}{1 - p(S)}\sum\limits_{j = 1}^{k} {{\text{diff}}(A,x_{i} ,m_{j} } } \right]/({\text{mk}}) \\ {\text{diff}}(A,x_{i} ,x_{j} ) & = \left\{ {\begin{array}{*{20}l} {\frac{{\left| {x_{i} [A] - x_{j} [A]} \right|}}{\hbox{max} (A) - \hbox{min} (A)} } \hfill & {A\;{\text{is}}\;{\text{continuous}}} \hfill \\ 0 \hfill & {A\;{\text{is}}\;{\text{discrete}}\;{\text{and}}\;x_{i} [A] = x_{j} [A]} \hfill \\ 1 \hfill & {A\;{\text{is}}\;{\text{discrete}}\;{\text{and}}\;x_{i} [A] \ne x_{j} [A]} \hfill \\ \end{array} } \right. \\ \end{aligned}$$
(2)

where \(p\left( C \right)\) is the probability of class C, and \(p\left( S \right)\) is the probability of the other class {S}besides C.

By calculating the weight for every feature in the original data set by Eq. (2), we can get the features ranked by their weights.

Feature selection technique of PCA

The principal component analysis technique is a statistical analysis approach to map multiple characteristic parameters to a few comprehensive features. These PCA-based comprehensive features are not related to each other and can represent original features effectively (Abdi et al. 2013). Based on these comprehensive features, the absolute weight to the original features could be calculated.

For a given feature vector set \(x = \left\{ {x_{1} ,x_{2} , \ldots , x_{m} } \right\}\), \(x_{i} \in R^{n}\) which consists of m features vectors (m samples), each with n-dimensional, the algorithm to extract sensitive features and calculate the weight for original features is taken as follows (Xu et al. 2008):

  1. (1)

    Calculate the average value:

    $${\varvec{\upmu}} = \frac{1}{m}\mathop \sum \limits_{i = 1}^{m} x_{i}$$
    (3)
  2. (2)

    Compute the covariance matrix C of eigenvectors:

    $${\text{C}} = \frac{1}{m}\mathop \sum \limits_{i = 1}^{m} \left( {x_{i} - {\varvec{\upmu}}} \right)\left( {x_{i} - {\varvec{\upmu}}} \right)^{\varvec{T}}$$
    (4)
  3. (3)

    Compute the eigenvalues \(\lambda_{i}\) and eigenvectors \(\varvec{\nu}_{\varvec{i}} \left( {i = 1,2, \ldots ,n} \right)\) of C:

    $${\text{C}}\varvec{\nu}_{\varvec{i}} = \lambda_{i}\varvec{\nu}_{\varvec{i}}$$
    (5)
  4. (4)

    Arrange the eigenvalues in descending order \(\lambda_{1} \ge \lambda_{2} \ge \ldots \ge \lambda_{n}\), composite the first k eigenvalues \(\Delta = \left( {\lambda_{1} , \lambda_{2} , \ldots ,\lambda_{k} } \right)\) and corresponding eigenvectors \(\varvec{W} = \left[ {\varvec{w}_{1} ,\varvec{w}_{2} , \ldots ,\varvec{w}_{\varvec{k}} } \right]\). Thus the cumulative contribution rate is defined as:

    $$R_{k} = \frac{{\mathop \sum \nolimits_{i = 1}^{k} \lambda_{i} }}{{\mathop \sum \nolimits_{i = 1}^{n} \lambda_{i} }}$$
    (6)

    \(R_{k}\) indicates the percentage of the total variance by the first k principal components (PCs). The first principal component is oriented in the direction of the largest variance. The following components are furthermore oriented in the direction of the decreasingly ordered further variances. In this paper, the most k representative PCs were selected by \(R_{k} \ge 0.8\).

  5. (5)

    The linear combination can be expressed for the kth dimension of the projected feature vector as follows:

    $$PC_{k} = \mathop \sum \limits_{j} w_{j}^{k} x_{j}$$
    (7)

    The weighting reflects the contribution of the original features to the linear combination, and thus is related to the original variance of the data samples.

  6. (6)

    Calculate the absolute weight for the original features. By using the similar process proposed by Xu et al. (2008), the absolute weight \(w_{j}^{'}\) of features \(x_{j}\) to the original data set is calculated by the selected components.

    $$w_{j}^{'} = \frac{{\mathop \sum \nolimits_{i = 1}^{k} w_{j \times }^{i} \lambda_{i} }}{{\mathop \sum \nolimits_{i = 1}^{k} \lambda_{i} }}$$
    (8)

By this method, a large \(w_{j}^{'}\) represents a large contribution of the feature \(x_{j}\) to the original data set. Following the same computational procedure, the weights of all features were computed. Taking the order of weights from large to small, we can get all the features ranked.

Feature selection technique of entropy weight method (EWM)

The entropy weight method (EWM) determines weights through quantifying the disorder extent of a particular system (Huang et al. 2015). Because the weighting factors are purely dependent on the value of indices rather than human subjective assessment, EWM was recognized as an objective method for weight calculation. For a given feature vector set \({\mathbf{x}} = \left\{ {x_{1} ,x_{2} , \ldots , x_{m} } \right\}\), \(x_{i} \in R^{n}\) which consists of m samples, each with n-dimensional, the main steps for weighting the features with the entropy weight method are as follows:

Step 1 Normalization of the features. In order to ensure the uniformity of indices’ units or value range, the normalization of all features is performed as:

$$\begin{aligned} & y_{ij} = \frac{{x_{ij} - \min_{j} \left( {x_{ij} } \right)}}{{\max_{j} \left( {x_{ij} } \right) - \min_{j} \left( {x_{ij} } \right)}}\left( {\max_{\text{new}} - \min_{\text{new}} } \right) + \min_{\text{new}} \\ & \quad \quad i = 1,2, \ldots ,m;\quad j = 1,2, \ldots ,n \\ \end{aligned}$$
(9)

where \(\left[ {\min_{\text{new}} , \max_{\text{new}} } \right]\) is the new value range for all the features, which is usually set as \(\left[ {\min_{\text{new}} , \max_{\text{new}} } \right] = [0,1].\)

Step 2 Calculation of weighting coefficients. The information entropy of each feature is calculated by:

$$E_{j} = - \left( {\ln n} \right)^{ - 1} \mathop \sum \limits_{i = 1}^{n} p_{ij} \ln p_{ij}$$
(10)

where \(E_{j}\) is the information entropy of each feature, \(p_{ij}\) can be calculated by \(p_{ij} = {\raise0.7ex\hbox{${y_{ij} }$} \!\mathord{\left/ {\vphantom {{y_{ij} } {\mathop \sum \nolimits_{i = 1}^{n} y_{ij} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\mathop \sum \nolimits_{i = 1}^{n} y_{ij} }$}}\).

Based on the value of information entropy \(E_{j}\), the weighting factor of each feature is calculated by:

$$\omega_{j} = \frac{{1 - E_{j} }}{{\mathop \sum \nolimits_{j = 1}^{n} (1 - E_{j} )}} = \frac{{1 - E_{j} }}{{n - \mathop \sum \nolimits_{j = 1}^{3} E_{j} }}$$
(11)

where \(\sum\nolimits_{j = 1}^{n} {\omega_{j} } = 1 {\text{and}} 0 \le \omega_{j} \le 1 .\)\(1 - E_{j}\) indicates the inconsistency degree of each sample under the jth feature from the theory of information entropy. Then, the feature that can create a larger inconsistency degree among samples, in other words, which has a larger capacity to discriminate samples, would have a larger weighting coefficient.

Also, basing on the weights to each feature, it can easily get the features ranked.

Then, by combining the fractal dimension of the data set, which determine the numbers of key attributes well characterizing the data set, and the ranked features in each of the feature selection techniques, the key feature subset of the data set under each feature selection technique could be achieved.

The classification process to verify the robustness of the feature subsets

To verify the robustness of the feature subsets selected under the three feature selection techniques, three machine learning techniques, Naïve-Bayes classifier, K-nearest-neighbor classifier (KNN), and random forest classifier based on decision trees were employed on the obtained features.

Naïve-Bayes classifier (Langley et al. 1992) predicts the probability that a given sample belongs to a particular class. Given a sample X, the classifier will predict that X belongs to the class having the highest a posteriori probability, conditioned on X.

K-nearest Neighbor (KNN) classifier simply retains the entire training set during learning and assigns to each query a class represented by the majority label of its k-nearest neighbors in the training set. In the present study, we used the distance-weighted KNN proposed by Dudani (1976) to perform the classification process.

Random forest combines several randomized decision trees and aggregates their predictions by averaging (Breiman 2001). In this study, the random forest classifier consists of seven trees, with each is grown with the classification and regression tree (CART) algorithm (Breiman et al. 1984). To classify a new dataset, each case of the datasets is passed down to each of the seven trees. The forest chooses a class having the most out of seven votes to be as the final class label of the case.

The detailed calculation process for these three classifiers were discussed in the “Supplementary Material”.

Experimental results and discussion

According to the analysis of fractal dimension in Sect. 5.1, the slope of (log(r),log(S(r)) in Eq. (1) was calculated with a value of 7.98, which is the lower bound of the number of attributes needed to keep the essential characteristics of the dataset. That is to say, in all the twenty-three indices collected for the prediction task, there are at least eight ones are most significant to represent the original data set. According to this result, the first eight features were selected to be the most important representing features for the original data set according to the ranked results in each feature selection technique. Table 5 shows the results for the selected feature subset under each feature selection technique.

Table 5 Key feature subsets under the three feature selection techniques

Not all the selected indices exist in Table 5. In fact, only eleven indices appear in the three feature subsets. This means that these eleven indices will have larger predictive capacities than those not appearing in Table 5.

Among the eleven features in Table 5, the indices {x12, x14, x17, x18, x22} exist in all the three feature subsets, which indicates that these five features are the core characteristics of the initial data set. That is to say, these five features are those playing dominated roles in determining which articles could grow up into highly-cited ones. Index x12 shows the first-cited age of articles, which represents the speed with which the results of a research is disseminated in the scientific community. Indices {x14, x17, x18} show the scope of knowledge diffusion for one article in the scientific community. Index x22 represents the saved times of the article in various social medias in the first 2-year time window after publication. It suggests that the alternative metrics, especially the saved activity, also has kernel influence on articles’ future success. Besides these five corn characteristics, the h-index of the first author before publishing this article (x3), the citations got in the first-cited age (x13), the number of citing institutions in the first 3 years after publication (x16), the total citations got in the first 3 years (x20), the viewed times (x21) and the discussed times (x23) are all helpful for the future growth of articles.

Then, the three classifiers of Naïve-Bayes, K-nearest-neighbor (KNN), and random forests were operated to test the performance of these feature subsets on predicting future HCPs. Table 6 shows the final classification performance of each feature subset under each of the three classifiers.

Table 6 The classification performance of the feature subsets

Obviously, all the feature subsets have got considerable classification performance under each of the classifiers. Even the worst accuracy has reached to 0.882. And the feature subset selected by PCA has gained the largest precision with 0.947 trained by the random forests. In fact, the feature subset selected by PCA has got the best classification performance regardless of the classifiers. The last row in Table 6 gives the average classification accuracy for each of the feature subsets under different classifiers; and the last column is for the average classification accuracy for each of the classifiers under different feature subsets. All the average classification accuracies are above 0.9 no matter to classifiers or to feature subsets. The results indicate that the feature subsets extracted by the three feature selection techniques are stable and valuable to classify and predict the future HCPs, although there’s little differences for the values of accuracy. Based on this considerable classification results, this study combines all the features appearing in the three feature subsets as the typical indicators for the future HCPs.

Table 7 shows the final typical indicators for the future HCPs. There are eleven features are verified to be the typical indicators for future HCPs, with eight from traditional bibliometric indices and three from alternative metrics. This indicates that both the bibliometric indices and the alternative metrics do benefit to the future success of articles. And the altmetrics does offer new ways to measure the impact of publications which may complement rather than replace traditional indicators for research evaluation and prediction. Table 7 also shows the frequency of occurrence of each index. Features {x12, x14, x17, x18, x22} exist in all the three feature subsets, showing their leading role in predicting future HCPs. Indices {x16, x20, x21} are also existed as the informative features with two times’ occurrence. Indices {x3, x13, x23} present for one time in the feature subsets. Although these eleven features are diverse in the occurrence and in the predictive capacities, we believe that the perfect classification performance is the result of the combined effects of these indices. Thus, all these eleven indices in Table 7 are presented as the typical indicators for the future HCPs. Here, we give a detailed analysis on these features.

Table 7 Typical indices for the future HCPs
  1. (1)

    The first-cited performance of articles: The two first-cited indices {x12, x13}for articles are both selected as the valid predictors, where x12 shows the citation rate and x13 is the citation strength in the first-cited year. Van Dalen and Henkens (2005) stated that the status of uncitedness of a paper becomes a stigma and the longer a paper is uncited, the lower its quality and the less inclined researchers will be to cite it. This stigma for uncitedness indicates the important role of a paper’s first-cited performance on its later citation life. Our results show that a quicker accept speed and a higher accept strength for one article in the scientific community are important for its future success. This results is consistent with the conclusions by some previous works (Adams 2005; Bornmann and Daniel 2010; Chakraborty et al. 2014; Garner et al. 2014; Guerrero-Bote and Moya-Anegón 2014; Hilmer and Lusk 2009a, b). These works stated that based on the speed with which the results of a research is disseminated in the scientific community and is being cited, future citations can be predicted.

  2. (2)

    The early-stage citation diffusion performance of articles: Five indices {x14, x17, x18, x16, x20}associated with articles’ citation diffusion performance in their early stage, 3 years after publication in this study, are also considerable typical indicators for future HCPs. These five indices show the scope of the influence of articles in their citing environment. If one article has got citations from more countries, institutions, subjects and journals, etc., it means that the knowledge carried by the article has been diffused into a more diverse fields accompany with the occurrence of citation activities. This wider visibility of the article impact, in turn, bring more opportunities for articles to gain new citations. Aksnes (2003) stated that the increasing visibility of paper would lead to further citations to it. Our experimental results confirm this viewpoint.

  3. (3)

    The early-stage web usage statistics of articles: All the three indices{x21, x22, x23} related to articles’ web usage are extracted as typical predictors. Numbers of works have presented to verify the positive relationship between the number of citations and various alternative metrics by statistical techniques (McCabe and Snyder 2015; Yuan and Hua 2011; Eysenbach 2011; Neylon and Wu 2009; Thelwall et al. 2013; Li et al. 2012; Haustein et al. 2013; Zahedi et al. 2013). However, fewer have focused on the predictive performance of these metrics on future citation numbers. Our study made a preliminary attempt on this issue and showed that the earlier web usage metrics can also be valuable predictors of later citation flourish.

  4. (4)

    The prestige of authors: The h index of the first author before publishing this article, labelled as index {x3} is also an valuable predictor to show the importance of author’s prestige. A larger h-index indicates that the author has gained considerable research capabilities or reputations in science. Hurley et al. (2013) presented that h-index of the author group influences citation frequency. Schilling and Green (2011) stated that prior experience and publishing success may act as a signaling and legitimization device that serves to increase the likelihood of others reading and citing the work. Researcher also showed that the number of citations to an author’s precious papers can be considered as a good predictor for citations to further papers (Tang et al. 2014; Walters 2006; Yu et al. 2014). Our study show that comparing with the authors’ previous citations, h-index of the first author would be a more predictive indicator.

Conclusions

Is it possible to identify the importance of an article earlier in the read-cite cycle? This paper is aiming to answer the above question by identifying the early-stage predictors from traditional bibliometric indices and alternative metrics. By establishing a feature space with twenty-three indices, a manifold characters from authors, journals, articles, early citation distribution performance, as well as the early web usage statistics were tested to find the key predictors for articles’ future success. Combining with the fractal dimension of the data set, three feature selection techniques of Relief-F, principal component analysis (PCA) and entropy weighted method (EWM) were performed to extract the key feature subsets which can better represent the original data set. Then three kinds of classifiers, Naïve Bayes, KNN and random forest, were taken to verify the classification performance of the feature subsets. Experimental results on articles published in the seven journals in PLOS showed that both traditional bibliometric indices and alternative metrics are valuable predictors for future HCPs. These predictors are mainly from four aspects: the first-cited performance of articles, the early-stage citation diffusion characters of articles, the early-stage web usage statistics of articles, and the prestige of authors. It indicates that the data altmetric collects is also a useful leading indicator for later success. The web-based access to the research literature does offer a potential measure for the future impact of articles. And the combination of traditional and alternative metrics for research prediction could provide more complete article profiles as it captures more dimensions of scientific practice.

The limitation of the present analysis is that it focused on the seven journals published in PLOS only, where all the articles collected are life sciences and medical literatures. The results on this limited corpus would not be universal to articles in other fields. And the current study is based solely on the analysis of alternative indicators provided by PLOS, and there may be other alternative indicators on other sites, but this is not covered in this study. However, PLOS ALM is an important altmetrics application platform with high authority and availability. The results from PLOS could also provide certain reference on the following evaluation or prediction-related research.