Introduction

With the rapid development of the internet and information technology, the modes of obtaining and transmitting modes of obtaining and transmitting human knowledge have undergone major changes. The mode of obtaining knowledge has changed from traditional paper documents to electronic resources, which is more diverse, convenient and timely. In the context of the digitization of academic resources, big scholarly data, data related to different academic entities (e.g., scholars, institutions, publications and disciplines) and their relationships (e.g., collaboration and citation), have emerged (Xia et al., 2017). On the one hand, big scholarly data opens the door to the palace of knowledge for researchers. On the other hand, it also brings unprecedented challenges. For example, it has become increasingly challenging for researchers to quickly find influential papers from a very large amount of resources, and more information is needed to support scientific research evaluation to make research fund allocation fair (Bai et al., 2020).

Scientific impact plays an important role in the evaluation of publications, scholars, departments and institutions. The evaluation of scientific impact is usually based on past performance; however, it is more meaningful to grasp the future influence of academic entities (Cheang et al., 2014a; Fortunato et al., 2018). Therefore, predicting scientific impact is of great significance, specifically in the following aspects. Resource recommendation can help researchers quickly find papers they need and improve the efficiency of scientific research. The reliable impact prediction of researchers can help identify rising stars, facilitate expert recommendation and promote fruitful collaboration. Scientific impact prediction is also good for management of researchers and research-based institutions. Precise predictions can provide decision makers with strong evidences in many situations, such as hiring and promoting researchers, funding projects and applying for awards (Cheang et al., 2014b; Ma & Uzzi, 2018). At the same time, quantitative methods can be seen as a supplement to peer review and help allocate resources effectively. For personal development, understanding the factors affecting future academic achievements can help scholars better plan their research careers (Cheang et al., 2015; Van Dijk et al., 2014). Currently, data-driven approaches make it possible to predict scientific impact and attract the attention of researchers from various disciplines (Wang et al., 2021a; Weis & Jacobson, 2021; Xiao et al., 2021). They are based on data from digital libraries or web crawling, extracting relevant influence indicators, analyzing the laws of scientific development and realizing the prediction of the future impact of different academic entities. Related studies have been published in some prestigious journals, such as Science (Sinatra et al., 2016; Wang et al., 2013), Nature (Acuna et al., 2012) and PNAS (Ma & Uzzi, 2018; Way et al., 2017), indicating that this is a topic worthy of in-depth research.

Our survey mainly focused on four common academic entities, i.e., papers, scholars, venues and institutions. We retrieved data in a predefined manner from the Web of Science (WoS) Core Collection database in the period of 2000–2021. The retrieval strategy was as follows: Title = (article or paper or citation or scientific or scientist or h index or journal or institution or universit*) and (predict or forecast or long term). The document types were confined to article or review. The limitation is that papers whose title keywords are not in the retrieval strategy may be overlooked. To cover more comprehensive literature, we also selected closely related references as supplements. After filtering articles by reading their abstracts, we finally obtained 168 articles for further analysis. Among them, we identified the top 10 authors and top 5 journals that published the most studies, and they are displayed in Figs. 1 and 2, respectively. Overall, the top five journals represent 50% of all the publications. We also calculated statistics on the prediction tasks presented in each article. The results show that the number of studies predicting paper impact is the largest, followed by scholar’ influence forecasting, while the prediction of institutions and venues is more complex, so the number of these articles is relatively small compared to that of papers and authors. Meanwhile, a small number of papers have attempted to predict multiple academic entities at the same time. The proportion of articles that predicting the impact of different academic entities is shown in Fig. 3.

Fig. 1
figure 1

The top 10 productive authors

Fig. 2
figure 2

The top five productive journals

Fig. 3
figure 3

Schematic diagram of the proportion of articles for scientific impact prediction of different academic entities

Among these papers, there have been several review studies. Hou et al. (2019) reviewed methods and applications in prediction of paper impact, scholar impact and author collaboration. Zhang et al. (2019) summarized the author impact predictive models and the common evaluation metrics. Bai et al. (2017a) briefly introduced the methods of predicting scholarly article impact. However, these studies do not include the latest research methods, such as deep learning and graph neural networks (GNNs), and some new features, e.g., content features and network embedding features, are not covered. In this paper, we conducted a novel and updated survey that comprehensively summarizes the prediction tasks of four types of entities (e.g., papers, scholars, venues and institutions) and the common input features and proposed a taxonomy of approaches for scientific impact prediction, involving some popular algorithms in recent years, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs) and GNNs. Figure 4 shows the framework of this survey.

Fig. 4
figure 4

The framework for this survey

The rest of our paper is arranged as follows. In Sect. “Scientific impact prediction tasks”, we summarize the scientific impact prediction tasks of four academic entities and present the commonly used features. In Sect. “Taxonomy of prediction methods”, different prediction methods are elaborated in detail, and datasets and evaluation metrics are also discussed. Then, the challenges and potential research directions are pointed out in Sect. “Open challenges and future research directions”. Finally, we conclude the paper in the last section.

Scientific impact prediction tasks

Paper impact prediction

With the large number of scientific papers that are published every year, researchers need to recognize the more influential papers in advance (Abrishami & Aliakbary, 2019). However, the rapid increase in the number of papers has also brought about information overload, preventing researchers from effectively retrieving papers and making evaluations (Zhou et al., 2021). It takes time for newly published papers to be cited, so it is valuable to predict the citations of papers shortly after they have been published (Ruan et al., 2020). In the 2003 KDD Cup, one of the tasks was to predict the citation counts of papers. Since then, researchers have made many efforts in this field.

Prediction tasks

The goal is mainly to predict the values of the paper impact evaluation indicators. Citation count is widely used in the evaluation of paper impact, and it is simple, standard and objective, which is also the basis of many other evaluation metrics, such as the h-index and journal impact factor (JIF). In addition, a directed graph can be constructed from the paper citation relationship, and the evaluation of paper impact is often transformed into the importance ranking of nodes in the citation network. Therefore, there are two categories of paper impact prediction, citation-based prediction and ranking-based prediction, and they mainly include the following tasks:

  1. (1)

    Cumulative citation prediction under a given time window. Formally, given a set of scientific publications D, the citation count of a publication \(d \in D\) at time t is defined as \(Cit(d,t) = \left| {\left\{ {d^{\prime} \in D:d \, is \, cited \, by \, d^{\prime} \, at \, time \, t} \right\}} \right|\), and the goal is to estimate \(Cit(t + \Delta t)\)(Pobiedina & Ichise, 2016; Yan et al., 2011). The forecast time window \(\Delta t\) is roughly divided into short-term (e.g., \(\Delta t < 5\) year) and long-term windows (e.g., \(\Delta t > 10\) year), which have no definite boundaries. The prediction task can be subdivided based on estimating the number of citations of a paper over a fixed \(\Delta t\), different time intervals or several consecutive years after publication. Yu et al. (2014) predicted paper citations after 5 years of publication in the area of Information Science & Library Science, and they believed that the citation impact of the 5-year time window was an important manifestation of the quality of the paper. Yan et al. (2011) modeled the process of citation count prediction for 1 year, 5 years, and 10 years. They considered different feature combinations, and the prediction with a longer time window achieved the best accuracy (\(\Delta t = 10\)). Chakraborty et al. (2014) proposed a two-stage prediction model with a dataset of more than 1.5 million papers in computer science domain and \(\Delta t\) ranged from 1 to 5. Although these studies have claimed that their models can achieve high accuracy, there are still many challenges. There is no standard for the selection of a prediction time window, which usually depends on experience (Onodera & Yoshikane, 2015). The cumulative rate of citations in different disciplines varies greatly, so the choice of prediction time window should be different.

  2. (2)

    Long-term citation sequence prediction. Citation time series can reflect the influence of academic publications over time (Jiang et al., 2021). The long-term citation sequence prediction task is defined as a multioutput problem, i.e., predicting future citation sequences \(C_{k + 1} ,C_{k + 2} ,...,C_{n}\) of the papers according to early citations \(C_{0} ,C_{1} ,...,C_{k}\) and other features. Abrishami and Aliakbary (2019) collected 175,432 papers from five prestigious journals, i.e., Nature, Science, NEJM, Cell and PNAS, and obtained their 14-year citation data. They used citation counts of a paper from the 0th to the kth year after publication as input and predicted citations of the paper from the (k + 1)th to the nth year (\(k < 7,n = 14\)).

  3. (3)

    Citation distribution or trend prediction. Because of the high uncertainty of citation prediction, it is believed that it is more useful to know the probability of citations that a publication would receive in the future. For example, Stegehuis et al. (2015) attempted to predict a future citation distribution by linking a quantile estimation technique from extreme value theory using only the JIF and first-year citations. Instead of predicting the citation count for each paper, several studies clustered papers into several citation trends and trained a prediction model for papers with each trend, considering that papers with similar early citation dynamics may be similar in the future (Cao et al., 2016; Li et al., 2015).

  4. (4)

    Highly cited paper prediction. Highly cited papers represent authority in the research field and have been widely used to evaluate researchers and institutions. Predicting highly cited papers in advance can help researchers track research trends and plan research directions (Wang et al., 2019a). Therefore, identifying highly cited papers is the core task of paper impact prediction. The definition of highly cited papers usually includes absolute and relative thresholds (Wang et al., 2012). For the absolute threshold, if the citation counts of a paper exceed a certain fixed value, it is considered a highly cited paper (Wang et al., 2011). However, the relative threshold is more often used, e.g., papers are ranked according to the total number of citations in a certain time window, where the top x% (e.g. x = 1,10,20) of papers are considered highly cited. Hu et al. (2020) defined the prediction task as whether the paper was in the top 25%, 33%, or 50% of total citations in the six years since publication. They extracted journal, author and keyword features in the field of marketing and management information system, and the model showed good performance for the prediction of the top 25% highly cited papers. However, their dataset covered a 4-year span, which may be conducive to early published papers. Therefore, some researchers have attempted to predict papers published in the same period that are more reasonable. Wang et al. (2019c) collected 23 features combining traditional bibliometric and alternative indicators from papers published in the same month to predict highly cited papers whose accumulated citation count reached 20% of the total citations within the dataset in a five-year window.

  5. (5)

    Early identification of sleeping beauties. The citation patterns of scientific papers vary greatly. Normally, the citation counts of a paper will gradually reach the peak within a few years after publication and then decrease. Academia focuses more on highly cited papers. However, the value of low-cited or zero-cited papers is also worth exploring (Van Noorden, 2017). Sometimes, the importance and value of some major scientific discoveries and innovations are not recognized when they are initially published; instead, they only begin to gain attention many years later, which is referred to as delayed recognition or “sleeping beauties (SBs)” in science (Van Raan, 2004). The term “sleeping beauty” represents a special phenomenon in the scientific community that reminds us that we must have a strong sensitivity and sufficient tolerance to new ideas or discoveries. The SBs are mostly identified from retrospective research based on the time dimension (Ke et al., 2015). However, with the development of data mining, it is possible to identify SBs in advance to shorten the cycle of technological innovation and reduce the possibility of important scientific discoveries being ignored. Dey et al. (2017) developed a machine learning model to predict whether a paper is likely to become a sleeping beauty in computer science, and the results showed that SBs can be immediately identified after publication with a relatively high accuracy. To date, there have been few studies on the prediction of SBs, and further research is needed.

  6. (6)

    Paper impact ranking prediction. Most of the abovementioned citation-based predictions require short-term historical data and are not suitable for newly published papers that have not yet begun to accumulate citations. Therefore, automatically ranking papers according to their potential impact has drawn much interest and can help researchers retrieve relevant and important information effectively (Bento et al., 2013; Zhang et al., 2018c). Different from focusing on predicting the future citations of a paper, this task is to predict the most influential TOP K papers. Using the arXiv(hep-th) dataset, Sayyadi and Getoor (2009) combined information about citations, authors and publication time to calculate the future ranking score of a paper based on the citation network and paper-author network. They took the top 50 papers sorted by future PageRank as the ground truth, and there was a high correlation between the predicted score and the future PageRank score. Later, Zhou et al. (2021) proposed an age-based diffusion model with a random walk process across citation networks, which can improve the ranking of newly published papers that have zero or few citations but will be popular in the future.

  7. (7)

    Link prediction. Citation prediction is often regarded as a kind of link prediction problem in citation networks. The goal is usually to predict the citation relationship between papers (Hou et al., 2019). Link prediction methods often use node degree to evaluate the importance of nodes. Zhou et al. (2018) used the h-type index and considered that nodes with high in-degree neighbors were more important to compute node similarity, which significantly improved the accuracy of link prediction in the citation network. In addition, the citation count of a paper is equal to its in-degree in the citation network; when a new link is generated, the in-degree increases, that is, the citation count increases. However, link prediction cannot capture the pattern of citation count changes over time (Liu et al., 2020). Therefore, Pobiedina and Ichise (2016) attempted to estimate the number of new links for a specific paper node by introducing a new feature named GERscore, which is based on the graph evolution rules, and the result showed that the GERscore significantly improved prediction accuracy.

Features

Researchers often pay attention to the factors that may increase the impact of their work. The citation process is complicated and is affected not only by pure scientific content but also by other factors, such as the journal in which the paper is published and the author’s reputation and social influence (Tahamtan et al., 2016). The features used in paper impact forecasting include basic metadata information, content-related, network-related and altmetrics-related information.

Metadata-related information

Most of the features used in paper impact prediction come from metadata extracted from a digital library, e.g., title, abstract, keywords, references, authors and publication journal, which are relatively easy to obtain, and these features are also considered to be related to the future impact of the paper (Onodera & Yoshikane, 2015). For example, papers with more early citations are expected to have more in the future, and papers written by high-impact authors or published in high-impact journals may be more influential. In addition, researchers often perform statistical analysis on metadata information (e.g., citations or h-index) to obtain the maximum, minimum and average values, which are also chosen as the features of the model (Liu et al., 2020). Therefore, the basic metadata features are the most commonly used features.

Content-related information

The quality of a paper is the kernel factor that affects its readability and number of citations, but it is often ignored due to the lack of a quantitative mechanism. Singh et al. (2015) extracted two simple content-related features from the citation contexts, i.e., number of times a paper was cited within the same article and number of words within the citation context, and the results showed that these two additional features increased the prediction accuracy by 8–10%. With the development of text mining technology in recent years, some studies have begun to extract content features from metadata text to mine deeper semantic information. The topic of a paper has long been regarded as a significant feature of its content (Yan et al., 2011), e.g., hot topics or mainstream topics tend to receive more citations. Natural language processing (NLP) methods are often used to model topic diversity or keyword popularity (Chakraborty et al., 2014; Mahalakshmi et al., 2020). Hu et al. (2020) defined five keyword popularity features depending on data from Google Scholar, Google Trends and ResearchGate via a probabilistic topic model, which improved the effectiveness of highly-cited paper identification. To obtain richer semantic information, the word vector method, e.g., the word2vec and doc2vec algorithms have been used for feature extraction from title, abstract or peer review text (Li et al. 2019a; Ma et al., 2021). Moreover, some studies have paid attention to the sentiment metric of paper text. Fronzetti Colladon et al. (2020) considered semantic features of the abstract through a lexicon and rule-based sentiment analysis tool to calculate the sentiment value of each word in the abstract, and the result showed that it was better to write abstracts with more positive words. The above studies indicated that deep analysis of these content-based features can lead to further improvements in the prediction of paper impact (Ma et al., 2021; Singh et al., 2015).

Network-related information

The structural characteristics of the academic network are closely related to the node impact, and centrality is the widely used evaluation indicator. It has been found that highly cited papers have a higher betweenness centrality even at early stages after publication (Bertsimas et al., 2013). Changes in the topological position in the network can reflect the dynamics of node impact to a certain extent. Researchers have extracted various topological features to improve the prediction accuracy. Davletov et al. (2014) calculated betweenness centrality, closeness centrality, PageRank and eigenvector centrality in the citation network and built a feature vector for these metrics to predict high-impact papers. Chen (2012) provided three metrics of structural variation, i.e., modularity change rate, cluster linkage and centrality divergence to measure new boundary-spanning introduced by new paper, which can predict the future citations. In addition, inspired by word embedding technology in the field of NLP, automatically learning the vector representation of nodes in the network has become a research hotspot in recent years. Several studies adopted network representation learning approaches (e.g., node2vec and struc2vec) to capture the characteristics of citation networks, which brought additional features to the prediction of the scientific impact of papers (Luo et al., 2020).

Altmetrics-related information

New forms of academic publishing have emerged in the open scientific community, and an increasing number of papers are first published online. A series of novel measurement indicators, called altmetrics, have emerged (Thelwall & Nevill, 2018). Altmetrics refer to alternative metrics and are the creation and study of web-based metrics for analysis. They are closely related to open research activities. Data sources include article downloads and page views as well as data from social networks, news magazines, online literature management tools and public policy archives. Many publishers, e.g., Nature, Science, Cell and PNAS, use altmetrics to measure the attention of social networks to their published papers. Altmetrics are supplements to traditional bibliometrics, such as citation count and h-index, and they are considered a new way to measure the societal impact of research. The behavior of altmetrics occurs earlier than a citation, e.g., the download number of an article can be recorded and counted immediately, so altmetrics may be related to the future citation count of an article. Many studies used altmetrics as features for scientific impact prediction, and it has been found that these indicators are correlated with future citations to a certain extent (Akella et al., 2021; Drongstrup et al., 2020; Zoller et al., 2016).

Until now, an increasing number of features have been extracted and calculated to construct prediction models, and most studies have used multidimensional features to make predictions. As shown in Table 1, we summarize the commonly used features into six categories. However, not all of these features contribute to the final result. Ruan et al. (2020) obtained thirty features and found that only five features have significant effects on the prediction performance of the model. Therefore, correlation analysis, regression analysis and rough sets are often used for feature selection. Overall, combining two or more feature categories would result in a better prediction than using only one feature category (Hu et al., 2020).

Table 1 The commonly used features for paper scientific impact prediction

Scholar impact prediction

The evaluation of scholars’ academic performance is often based on their published papers, e.g., citations and h-index, which are based on past data. Therefore, if the impact of scholars can be predicted in advance, more meaningful information for decision-making, e.g., personal career development, financial support, promotion, and job offers, can be provided. Recent advances, such as data mining techniques, have made it possible to forecast the influence of scholars (Li & Tong, 2015).

Prediction tasks

Scholars’ publications that are recognized or cited by their peers can reflect their academic influence. It is of great significance to understand scholars’ potential or academic achievements in advance. Scholar impact prediction mainly focuses on the citation count or the number of papers published, and the prediction tasks can be divided into the following categories:

  1. (1)

    Scholar’s h-index or citation count prediction. Citation counts and the h-index are commonly used quantitative metrics for the evaluation of academic performance. Therefore, they have become the targets of scholars’ impact prediction. Mazloumian (2012) used multilevel regression models with random effects to predict future citations of a scientist’s published papers, but its prediction power decayed over time. Acuna et al. (2012) considered features of the number of articles published, current h-index, years since publishing first article, number of distinct journals and the number of articles in high profile journals to predict the h-index of more than 3000 neuroscientists 5 and 10 years ahead. However, there were some restrictions, such as a career limit of 5–12 years and an h-index greater than 4. The validity of Acuna’s equations was limited when using different datasets or considering different academic career lengths (García-Pérez, 2013). Then, some improvements, e.g., no constraints, were made. Ayaz et al. (2018) predicted the h-index of 15,000 scientists in computer science with different combinations of parameters, but the forecast for scholars with 1-year of work experience was inaccurate. Most of the above studies failed to distinguish scholars at different career stages, e.g., it is unfair to compare junior scholars and senior scholars together. Moreover, the h-index and total citation counts have cumulative advantages and are more biased toward older papers, which cannot reflect scholars’ future potential. To solve these problems, Zuo and Zhao (2021) predicted future citations of future work for scholars at different career stages, which were distinguished by the number of years between the first and last publication, and they obtained more reasonable results. In addition, Dong et al. (2016) first predicted authors’ h-indices in the next 5 years based on their previous publication records and then determined whether previously or newly published papers will contribute to the h-index. They also found the prediction task was more difficult for authors with high h-indices.

  2. (2)

    Detecting rising stars. Rising stars often refer to scholars who currently have relatively low profiles but may emerge as prominent contributors in their field in the future (Li et al., 2009). Detecting academic rising stars can not only help scientific research institutions recruit talent but also provide candidates for reviewers, funds or award applications, which is an important task in predicting the impact of scholars. This task can be realized by ranking authors based on a potential score, classifying them into rising and nonrising stars, or clustering rising stars with similar characteristics (Panagopoulos et al., 2017). Some studies considered the degree of mutual influence of nodes in the coauthor network, which was modeled by calculating its out-link and combined with author contribution and journal ranking for iterative calculation to obtain the final score of the node (Daud et al., 2013). In addition, as the early features of high-impact scholars can provide a reference for the recognition of academic rising stars, some researchers extracted the features of scholars based on big scholarly data, and it is often formalized as a classification task to predict whether a given young scholar will be a rising star in the future. The scholar’s number of publications, citation count, network indicators and journal level are chosen as features, and the citation counts of the scholar are often regarded as the classification label (Daud et al., 2015).

  3. (3)

    Scientific prize winner or promotion prediction. Prizes and promotions are related to the recognition of scholars’ academic achievements and research abilities, which also guide the direction of future scientific investments. Predicting the prize or promotion of scholars can help discover the growth characteristics and general laws of award-winning groups in social and academic activities and can provide guidance for academic evaluation and talent training. Jensen et al. (2009) studied several bibliometric indicators (e.g., h-index and number of papers published) to predict promotions to senior positions for CNRS researchers, but the prediction accuracy was limited. Moreover, every year, before the drawing of the Nobel Prize, institutions or individuals attempt to predict the winners of the Nobel Prize, which is also an interesting activity in the scientific community. Citation counts have been proven to be useful in predicting Nobel laureates (Gardfield, 1977) and subsequently, Ashton and Oppenheim (1978) made improvements using nonfirst author papers, which showed better results. However, the above methods ignored the dynamic evolution of scientific development over time, and the predictive power of Nobel Prizes using simple bibliometric indicators has become limited (Gingras & Wallace, 2010). Zhou et al. (2020c) considered the prediction of Nobel Prize laureates in physics as a special binary classification task and introduced a competition mechanism considering the number of authors in the same period to normalize the citations, and the result showed that their method was effective for identifying prize winning scientists. Apart from the Nobel Prize, Rokach et al. (2011) predicted the next AAAI fellowship winners utilizing 292 researchers in the field of artificial intelligence. Ma and Uzzi (2018) collected more than 3000 scientific prizes, including 10,455 prize winners for over 100 years, to predict the probability that a scientist was a multiple prizewinner, and they found that prizes were more concentrated within a small group.

  4. (4)

    Scholar’s publication productivity prediction. Along with citation count and h-index, publication productivity is also an important indicator of scholars’ academic abilities. However, scholars publish random and diverse papers; thus, it is a challenging task to predict their publication productivity (Way et al., 2017). Xie (2020) found that the number of publications within a short time interval followed a Poisson distribution and he proposed a piecewise Poisson model to predict publication productivity for researchers. However, this model is only applicable to a group of scholars and cannot be used for individual scholars. Later, he improved this method by integrating long short-term memory (LSTM) with a piecewise Poisson model, which can provide short-term prediction for individuals and long-term prediction for groups of scholars (Du et al., 2021).

Features

The prediction of scholars’ influence is mainly based on their academic papers using the h-index and citation count. Publication-related indicators are easily accessible and quantifiable and are often used as features. In addition, scholars’ personal attributes, such as age and educational background, are also related to their academic achievements. Therefore, the features of scholars’ influence are summarized in our study as publication features, scholar features and social features (Table 2).

Table 2 The commonly used features for author scientific impact prediction

Publication features

There is a high correlation between the number of citations in the following years and the h-index in the previous years, and the h-index is better than other indicators, such as the total citation count and total paper count, in predicting future scientific achievement (Hirsch, 2007). However, several studies found different results. Schreiber (2013) found that an increase in the h-index was more likely to result from previous, often rather old, publications. Penner et al. (2013) thought the h-index was cumulative, non-decreasing indicator, which contained intrinsic autocorrelation, resulting in overestimation of its predictive power. Sinatra et al. (2016) found that the highest-impact work in a scientist’s career was randomly distributed. They defined the Q parameter corresponding to the logarithm of the number of citations in a period of time, which can predict the evolution of scientific excellence. However, cumulative indicators are not good for practical prediction (Põder, 2017), and some incremental indicators, e.g., the citation increment, h-index increment and incremental number of papers, that can show the dynamic changes in scholars’ academic influence were extracted.

Scholar features

Individual scholars are different from each other and are affected by many factors, such as research area, age, sex, mobility and institution prestige (Yu et al., 2021). It has been found that the length of a scholar’s academic career is correlated with their number of publications and citation counts and has a significant role in predicting the scientific impact of scholars (Kong et al., 2020). Researchers from prestigious institutions tend to be more productive (Van Dijk et al., 2014), and the scientific contributions of early career scholars are greatly influenced by their working institutions (Way et al., 2019). In addition, males are more likely to achieve academic success than females under equal conditions (Lindahl et al., 2020). All of these findings suggest that the nonpublication features play an important role in academic success and can be utilized for scholars’ impact prediction.

Social features

Scientists’ collaboration in research has become the main mode of scientific activities, which can not only promote scientific research but also help expand the influence of scholars. Junior researchers who have coauthored with top scientists can achieve a persistent competitive advantage (Li et al., 2019b). The prestige of coauthors is often seen as an important feature. Similarly, the structural characteristics of scholars in the academic network are also considered significant factors for author impact prediction, e.g., degree centrality and the average relation strength have positive effects on scholars’ scientific performance (Abbasi et al., 2011).

Publication venue impact prediction

Publication venues are the carriers of academic exchange and play an important role in disseminating scientific knowledge and enlightening research ideas. Publishing articles in high-level journals is often used as an important indicator for scholars’ evaluation and paper quality evaluation. Commonly used evaluation metrics, such as JIF and CiteScore, are calculated based on the publication data of the past few years, which will cause a time lag. Therefore, it is more meaningful to predict the influence of journals or to evaluate their long-term impact, which can be helpful for journal recommendations.

The journal impact forecasting task is mainly based on JIF or total citations. Wu et al. (2008) used the citations of papers to predict the JIF based on data from journals in different fields and predictions were made four months ahead of the official data. Valderrama et al. (2018) took the annual change in JIF (e.g. slope and intercept), degree of adaptation of publication guides and percentage of review articles as the independent variables to predict JIF in the field of dentistry, and the results showed a high determination coefficient. Wang et al. (2019b) considered four age characteristics of the active articles (average age, weighted average age, largest age and age of articles with largest citations) for 36 journals in the field of library and information science and found these indicators had a high correlation with the journal’s total citations, which can quantify the long-term impact of journals.

At present, the prediction of journal impact is mostly based on constructing models by extracting relevant features of published papers. However, the journal impact is affected not only by paper-related factors but also by indicators such as the review cycle, publication cycle, publication volume, publishing fees and altmetrics-related factors. To build a more accurate forecasting model, it is necessary to comprehensively consider various features in the future.

Institution impact prediction

It has become a tradition for many academic institutions, newspapers and magazines to publish rankings of research institutions or universities every year (Wilson et al., 2016), but it is still difficult to quantify the long-term impact of an institution due to the diversity of various subjective and objective factors, e.g., reputation, international collaboration, industry income and publication-related indicators. Predicting the influence of academic institutions is of great significance and can guide government agencies in making decisions, recruiting new members, guiding awards and helping students choose universities.

Considering the easy accessibility of publication data, current predictions about the impact of institutions usually shift to prediction of the institutions’ publications. Related research originated from the 2016 KDD Cup competition, whose goal was to predict paper acceptance in eight top conferences in the next year, and a snapshot of the Microsoft Academic Graph (MAG) was provided for this challenge. It is believed that the prediction of paper acceptance will be helpful for the evaluation of the development potential of an institution (Bai et al., 2017b; Sandulescu & Chiru, 2016). Recently, Wang et al. (2021b) combined individual and network features to improve the ranking of the paper acceptance rate. The commonly used features are shown in Table 3. Due to the limitation of effective indicators and the complexity of the prediction task, research on institutional impact prediction is in the relatively initial stage of development.

Table 3 The commonly used features for institution scientific impact prediction

Co-prediction of multiple academic entities

The tasks above include the prediction of an influence for a single entity. However, big scholarly data contain multiple entities and different relationships (Fig. 5). Based on these relationships between different entities, it is possible to rank the future impact of multiple types of objects in the network simultaneously. Some coranking prediction tasks are based on mutual reinforcement rules, e.g., potentially important papers published in high-quality venues and venues with good prestige attract influential researchers submitting papers. MRCoRank (Wang et al., 2016) integrates papers, authors, journals and text features into a unified framework, which can be used to predict the future influence of new publications and young researchers. Using MRCoRank, a coauthor graph, paper citation graph, venue-paper graph, venue-author graph, an author-paper graph, and paper-text feature and author-text feature graphs were built, and recent citations, recent coauthors and recently published papers were given more weight. WMR-Rank (Zhang & Wu, 2020) extracts seven types of relations to predict the future impact of papers, authors and venues through an iterative process with mutual reinforcement, and this model not only considers the time awareness but also considers the different contributions of multiple coauthors, which can be used to predict the influence of multiple entities more precisely. Moreover, Zhou et al. (2020a) proposed a model based on a heterogeneous dynamical graph neural network to predict the cumulative impact of papers and authors to capture the dynamic processes of impact evolution and complex node interactions.

Fig. 5
figure 5

Entities and their relationships in academic network

Taxonomy of prediction methods

The evolution of scientific impact is highly dynamic and complex. With the development of bibliometrics, network science and computer science, researchers have proposed many methods to solve this difficulty. Current research can be classified as mathematical statistics-based, traditional machine learning-based, deep learning-based and graph-based. The number of various methods used in annual publications is shown in Fig. 6. Early prediction methods are mainly mathematical statistics-based. In recent years, due to the rapid development of various data-driven models, machine learning, deep learning and graph-based methods have been used widely.

Fig. 6
figure 6

The number of different methods used in annual publications

Mathematical statistics-based methods

Statistical learning is the most widely used method in scientific impact prediction. Statistical learning can analyze various features of academic entities, and can establish mathematical models to find out the relationship between relevant features and scientific impact to fulfill the prediction tasks, such as citation count prediction and JIF prediction (Fig. 7). The influential features are identified from a set of candidate variables to build the prediction model, and stepwise regression (Yu et al., 2014), negative binomial regression (Onodera & Yoshikane, 2015), ordinary least squares regression (Abramo et al., 2019a), quantile regression (Danell, 2011), hierarchical regression (Ha et al., 2016) and semi-continuous regression (Klimek et al., 2016) are commonly used algorithms and the citation counts are used to generate the equation, but this is not considered strictly prediction due to the possibility of system changes between years (Thelwall & Nevill, 2018). Statistical learning selects important independent variables through feature selection methods, but there is no consensus on the choice of independent variables. The main reason is that the existing models assume that multiple factors are independent of each other and do not consider the interaction between them. In addition, the sample of the multiple regression model is limited to a specific field and the generality of the conclusion may be limited.

Fig. 7
figure 7

Schematic diagram of mathematical statistics based-methods

In addition, a mathematical statistics model can describe the process of citation accumulation. Wang et al. (2013) derived a model to forecast long-term citations of paper based on three parameters: preferential attachment, aging and fitness, and the citation dynamics of paper i at time t can be described as:

$$c_{i}^{t} = m\left( {e^{{\lambda_{i} }} \Phi (\frac{{\ln t - \mu_{i} }}{{\sigma_{i} }}) - 1} \right)$$
(1)

where the parameter set \((\lambda_{i} ,\mu_{i} ,\sigma_{i} )\) can be calculated based on its historical citation, when \(t \to \infty ,\Phi \to 1\), the ultimate impact, which represents the total citations a paper acquired during its lifetime, can be obtained by: \(c_{i}^{\infty } = m(\lambda^{i} - 1)\). It means that the ultimate impact of a paper is only related to the relative fitness \(\lambda\). Although this method can model citation dynamics, it performs poorly in some disciplines and is prone to overfitting (Cao et al., 2016), which has been improved by later studies (Bai et al., 2019; Shen et al., 2014). In addition, Sinatra et al. (2016) formulated a stochastic model that can accurately predict the evolution of a scholar’s impact. They defined a unique parameter Q for each scientist, which was formulated as:

$$Q_{i} = e^{{\log c_{10,i} - \mu p}}$$
(2)

where \(c_{10,i}\) is the average citation of papers published by scholar i in recent 10 years, p is the potential impact of the research topic. Q-value is highly stable and can compare scientists of different ages and stages, but it requires a long period of observation and is not suitable for new scholars.

Traditional machine learning-based methods

With the development of big data and large-scale computing in the past ten years, machine learning has performed well in many prediction tasks. In the context of big scholarly data, traditional machine learning methods are widely used in scientific impact prediction and obtain high accuracy.

Supervised learning

Supervised learning is a category of machine learning methods in which models are trained using labeled data. The training set includes input features and output variables, and the goal is to learn the mapping from input to output to make predictions about unknown data. Some studies on the prediction of paper citations and scholars’ h-indices have used supervised learning methods, and indicators such as paper-related and author-related indicators are often chosen as input features (Fig. 8). Support vector regression (SVR), random forest (RF), K-nearest neighbor (KNN), linear regression (LR), BP neural network and gradient boosting algorithms are often used to train the model (Livne et al., 2013; Ruan et al., 2020; Singh et al., 2015). Ruan et al. (2020) collected literature published from 2000 to 2013 and citation data before 2018 and applied the four-layer BP neural network to predict 5-year citations of nearly 50,000 papers. They chose the Adam optimizer and ReLU activation function to train the model, and L2 regularization was used to prevent overfitting. The mean squared error (MSE) and R2 for the test datasets showed that the BP neural network outperformed other algorithms, such as SVR, KNN, RF and XGBoost. Weihs and Etzioni (2017) used RF and gradient boosted regression trees (GBRT) with more than 20 features to predict the author h-index with a dataset of four million computer science papers written by approximately 800,000 authors. Compared with Acuna’s model (Acuna et al., 2012), the best prediction accuracy rates of 5 and 10 years increased by 24.8% and 50.6% respectively. Mistele et al. (2019) trained a feedforward neural network to predict authors’ citations and h-indices based on arXiv publications in physics field, and they claimed the R2 of 10-year prediction was higher than Weihs and Etzioni’s method, but they used different datasets.

Fig. 8
figure 8

The process of machine learning methods in scientific impact prediction

The citation counts of papers and the h-indices of authors follow a heavy-tailed distribution, which may skew the prediction (Dong et al., 2016). Error metrics for continuous loss functions are difficult to interpret (Fu & Aliferis, 2010). Instead of predicting the actual future citations, many researchers regard scientific impact prediction as a classification problem. Generally, recognizing highly cited papers is often defined as binary classification, and different classification labels are set in advance. Fu and Aliferis (2010) used an SVM with a heterogeneous polynomial kernel to develop a binary classification model to predict whether an article would exceed T citations (T = 20,50,100 and 500) within 10 years, and the AUC ranged from 0.86 to 0.92. Wang et al. (2019c) collected more than twenty features and utilized three feature selection techniques to reduce redundant features. They defined three classes, i.e., whether the citations after 3 years were highly-cited, medium-cited or low cited, and employed naive Bayesian (NB), KNN and RF on the obtained features, the experiment showed all the average classification accuracies were above 0.9, but their dataset was relatively small with only 617 articles. Bhat et al. (2015) defined 2-class (whether the paper was zero cited or not) and 3-classes (0, 33-rd and 66-th percentiles of the citation distribution), and applied NB, SVM, RF, boosted trees to a large dataset with over 800,000 papers, and the best classifier (RF) yielded accuracy of 0.87.In addition, Nie et al. (2019) formalized a binary classification with five different classifiers, i.e., KNN, RF, SVM, GBDT, XGBoost to predict whether the given young scholar would be a rising star in the future, and the label information was based on the increment of impact score calculated by the quality of citing papers and the influence of coauthors, and the best performance achieved F1 score close to 0.8.

Unsupervised learning

Unsupervised learning is a kind of machine learning where the data are unlabeled, and the dataset can be classified according to the similarity between samples. Paper citations are highly random and the evolution and dynamics of a scholar throughout the scholar’s whole academic career are also different. Therefore, it is difficult to characterize the dynamics of scientific impact. Inspired by the above problems, some studies divided paper citations or scholars into different types and then predicted the impact of different groups. Cao et al. (2016) found L previously published papers that matched the citation dynamics of the test paper with the smallest matching error by calculating the Euclidean distance and then clustered them into K clusters by fitting a Gaussian mixture model, which can obtain K possible trends of the paper’s future citations and probabilities. Panagopoulos et al. (2017) used the evolution of author features (i.e., productivity, impact and collaborative indicators) over time as the input to K-means, which clustered the authors into seven categories and the “rising stars” cluster can be detected through the biggest improvement over time across all of the key performance indicators.

Deep learning based-methods

Apart from traditional machine learning methods, in recent years, deep learning has shown outstanding performance in various research fields, such as computer vision and NLP. Deep learning is not only an effective method for the prediction but also a powerful tool for feature extraction. Studies have shown that feature design based on deep learning effectively improves the feature extraction of academic entities in terms of metadata text and network structure (Ma et al., 2021).

RNN and its variants

The citations of papers have time characteristics, and it is more accurate to save the information in the time series. Researchers have attempted to learn predictive models based on citation sequence patterns with early information through deep learning structures such as RNNs, LSTMs or gated recurrent units (GRUs) (Abrishami & Aliakbary, 2019; Wen et al., 2020; Yuan et al., 2018) that are mainly suitable for long-term citation prediction. Yuan et al. (2018) proposed a many-to-one model with two-layer LSTM units and integrated four major phenomena, i.e., the intrinsic quality, as represented by the citation count serving as the input of the model, the aging effect and the Matthew effect, which can be modeled by the forget gate and update gate, respectively, and the recency effect, which can represent current working memory. They used the citation data five years after the paper was published as training data to predict the citations in the next five years, and their model outperformed traditional machine learning algorithms such as SVR and LR. Abrishami and Aliakbary (2019) designed a many-to-many RNN architecture only with the early citations as the input sequence. They made experiments with different input sequence lengths (0 < k < 7) and the results showed that the model can perform better than other baselines when the input sequence k >  = 3, but they ignored many other time series features. Ma et al. (2021) used the doc2vec algorithm to encode metadata text and developed a Bi-LSTM model with an attention mechanism to extract paragraph-level semantic information, from which an early citation vector was combined as input, and then predictions for the next 8 years were realized through two fully connected layers. The prediction accuracy was higher than that of GBRT and XGBoost. The above models also found that the prediction performance will improve with the length of early citation sequence increasing, which may due to the fact that longer citation history can include more information, but it can also affect the timeliness of prediction in some research field (e.g., computer science), especially those newly published papers.

CNN-related methods

CNNs have been used to capture complex temporal patterns of citations, and they can automatically transform the feature from the initial representation to a higher-level representation and learn the mapping from input to output. Xu et al. (2019) designed a CNN model with three convolution layers to capture the complex nonlinear relationships between the early network features and the final citation count to predict the long-term citation count of papers in the field of Markov chain, and the results showed that the R2 of 5-year prediction can reach 0.9134, exceeding the comparison model by 5%. Wang et al. (2020) constructed an attention CNN model to predict paper citations. They used the doc2vec and word2vec algorithm to vectorize the paper text, and merged with journals and altmetric features to build feature matrix, then an attention layer was added to focus on key features, and the model had a higher accuracy than LR and classification and regression tree (CART).

Graph-based methods

Graph-based methods are likely more effective because they consider information from the network structure. In an academic network, the papers, authors and venues are treated as nodes, and citation relationships and coauthor relationships are treated as edges. The network structure can be divided into homogeneous and heterogeneous structures, and different network structures are chosen according to different prediction tasks. The commonly used network types include citation networks, coauthor networks, author-paper networks, and paper-journal networks. Therefore, the first step is to construct a suitable academic network and then to realize the scientific impact prediction by mining the hidden relationships in the graph. Figure 9 shows the tasks and the corresponding network structures.

Fig. 9
figure 9

Application of the graph-based methods

PageRank-like algorithms

Inspired by the web page ranking problem, PageRank-like algorithms are widely used in networks to rank academic entities. The more citations the paper has, the higher its PageRank value and the greater its influence. If a paper with a higher PageRank value cites other papers, then the PageRank of the cited paper will be higher. However, the citation network is special in that it is a temporal network, and PageRank is biased toward older papers, making it difficult to predict newly published papers without many citations. For the prediction of paper influence ranking, researchers have considered the time decay mechanism (e.g., exponential decay) and integrated it into the diffusive random walk process, which assigns more weight to recent papers, and the node scores can predict the future popularity of papers (Walker et al., 2007). Researchers considered the real process on citation networks based on the probability that a researcher will follow a paper’s references decays with the increase in each diffusion step, and the results showed that their model can predict newly published papers that may become popular in the future well (Zhou et al., 2020b). In addition, the ranking of a scholar’s future impact, especially the discovery of rising stars, is often modeled by the mutual influence of scholars in academic networks. Using PubRank (Li et al., 2009), a coauthor network, which was weighted and undirected, was constructed, and the weights of edges were mutually influenced, as calculated by the number of publications coauthored, and node weights were assigned by the quality of a researcher’s publications, which was formalized as:

$$PubRank(p_{i} ) = \frac{1 - d}{N} + d \cdot \sum\limits_{j = 1}^{\left| V \right|} {\frac{{\omega (p_{i} ,p_{j} ) \cdot \lambda (p_{i} ) \cdot PubRank(p_{j} )}}{{\sum\limits_{k = 1}^{\left| V \right|} {\omega (p_{k} ,p_{j} ) \cdot \lambda (p_{k} )} }}}$$
(3)

where \(\omega (p_{i} ,p_{j} )\) is mutual influence, defined as the number of coauthor papers between \(p_{i}\) and \(p_{j}\) divided by the number of papers of \(p_{j}\). \(\lambda (p_{k} )\) is publication quality score. A series of PubRank scores over several years for each author was calculated and scholars with larger PubRank score gradients were identified as rising stars.

Based on PubRank, a variety of improved methods that considered the author order, coauthor citation and dynamic ranking of venues were proposed for calculating node and edge weight (Daud et al., 2013, 2017). PageRank-like algorithms are suitable for homogeneous networks that contain only one type of node and link relationship, e.g., paper citation networks or coauthor networks. However, the structure of a homogeneous network is simple, and it is possible to ignore important information.

PageRank + HITS algorithms

The academic network structure is highly complex and heterogeneous, and researchers have divided the heterogeneous academic network into multiple subnetworks based on the relationships between authors, papers, venues and institutions. For subnetworks of a single node type, PageRank is used to calculate the importance of the node, while for bipartite graphs, HITS is often used for calculation, and each subnetwork is calculated independently. Finally, the scores of each node are weighted and fused. FutureRank (Sayyadi & Getoor, 2009) was the first algorithm used to rank the future impact of papers; PageRank and HITS were used to integrate the paper’s PageRank value, author’s authority value and time weight in the iterative process. The formulation is shown as:

$$R^{p} = \alpha \cdot M^{C} \cdot R^{C} + \beta \cdot M^{{A^{T} }} \cdot R^{A} + \gamma \cdot R^{Time} + (1 - \alpha - \beta - \gamma ) \cdot \frac{1}{n}$$
(4)

where \(M^{C}\) is the citation matrix, \(M^{A}\) is the authorship matrix, \(M^{C} \cdot R^{C}\) is the PageRank score in the citation network, and \(M^{{A^{T} }} \cdot R^{A}\) is the authority score in the authorship network. \(R^{Time} = e^{ - \rho t}\) gives more favor to recently published papers.

MRFRank (Wang et al., 2014) was used to improve the time weight design method, taking the cited time into account and adding text features into the network, which significantly improved the performance. However, these methods manually design time-aware weights that cannot model the dynamics of academic networks well. To solve this problem, a heterogeneous scientific hyper network framework (HSHMRR), consisting of seven subnetworks, was defined and combined with the learning-to-rank algorithm multiple additive regression tree (MART), which can capture the dynamic nature of academic networks (Zhang et al., 2018c). The experiments based on MAG showed that the HSHMRR-MART outperformed FutureRank by 24%-29%. For author impact prediction, CocaRank (Zhang et al., 2016b) and ScholarRank (Zhang et al., 2016a) divided heterogeneous networks into paper citation network, paper-author network and paper-journal network, which can contain more information and make the prediction of a rising star more reasonable.

The above graph-based method directly used the weights of edges to construct the relationship matrix between entities, retaining the global structure information of the network, but they ignored the importance of the local structure information of the network to the evaluation of the influence of the nodes in the academic network. In addition, different networks were simply merged into the random walk framework, and they cannot effectively learn heterogeneous network structure information and other information. Meanwhile, existing graph-based methods heavily depend on the global structure, ignoring the local structure information, which restricts the prediction accuracy.

Network embedding and GNN

PageRank-based methods require a large number of calculations, and performing these calculations takes considerable time when the number of nodes is large. In recent years, network embedding, which aims at learning the low-dimensional latent representation of nodes in a network, can preserve local and global information and can improve efficiency. Xiao et al. (2019) provided a network embedding model that can take the global, local structural and text into account simultaneously. They constructed a paper citation network with text information, a coauthor network and a paper-author network. They used KL divergence to describe the difference between the probability distribution of each node in the latent vector space and that of the node in the network. The subnetworks were combined to minimize the objective function, realizing the representation learning of the nodes, and the future impact of papers and authors were mutually ranked by integrating the learned embedding representations into a multivariate random-walk process. This model made full use of text information to help learn the potential similarities between papers and can learn better vector representations of newly published papers that lacking citations. In addition, the academic network is dynamically evolving, with new papers, new authors and new links generated every year; therefore, it is important to rank their future influence in a dynamic graph. Generally, it is believed that there is a relationship between a paper’s citation and that of its neighbors (Holm et al., 2020). GNNs are suitable for structured prediction problems due to the neighborhood changes caused by the graph topology (Cummings & Nassar, 2020). Jiang et al. (2021) utilized relational-GCN which was extended with a simple temporal alignment technique to learn the embedding of metadata nodes in a dynamic heterogeneous information network, and this model can predict a new paper’s citation time series without leading citations.

Model comparison and analysis

Comparison of different methods

Table 4 shows advantages and disadvantages of different methods. Mathematical statistics models construct rigorous mathematical formulas, explain causality with the support of mathematical theories and discover the changing laws of scientific influence. Machine learning methods make full use of high-dimensional features and nonlinear relationships in scholarly big data to build models and obtain high prediction accuracy, while graph-based methods can utilize available structural information, such as the citation network and the author network to enrich features and model the dynamic changes of academic entities. Different methods have their advantages and application scenario, but with the opening and accumulation of data, machine learning-based, especially deep learning-based methods, and graph-based methods will play a greater role in academic data mining and entity relationship mining.

Table 4 Advantage and disadvantage of different methods

Dataset

The datasets used for scientific impact prediction usually come from the digital libraries or academic search engines. Figure 10 shows the sources of datasets commonly used in the surveyed articles.

Fig. 10
figure 10

The top 10 commonly used datasets for scientific impact prediction

These datasets are mainly divided into three types: (i) Subscription access. Digital libraries play an important role in the storage and acquisition of global academic information. To complete the prediction task, researchers often collect data based on a certain subject area or different journals from the WoS or Scopus database. These databases are multidisciplinary that can meet the needs of researchers in multiple fields, but they need to be subscribed and the data sets are often not publicly available. (ii) Free access. With the large-scale digitization and explosive growth of academic resources, free-to-use academic search engines have emerged, which can help researchers access online resources more easily. Therefore, researchers often extract data from these academic search engines such as Google Scholar, MAG, Semantic Scholar, DBLP, etc. However, they also need to retrieve data and build datasets manually, and in terms of complex feature extraction, e.g. citation relationship, this is often time-consuming. (iii) Public dataset. To better promote scientific progress, researchers or some data mining competitions have released the datasets used and the common public data sets are shown below.

AMiner Footnote 1

It is an academic data analysis and mining platform developed by Tsinghua University, containing more than 200 million articles. Some datasets have been published, including research data, such as citation network analysis, expert discovery and name disambiguation.

APS Footnote 2

It is comprised over 450,000 articles published in American Physical Society journals since 1893. It contains two data files: article metadata and citing article pairs.

arXiv (hep-th) Footnote 3

This dataset was released in the KDD cup 2003. It contains over 27,000 articles with 350,000 references on high energy physics from arXiv.

Semantic Scholar (L.weihs) Footnote 4

Semantic Scholar was established by the Allen Institute of Artificial Intelligence in 2015. From the initial collection of 3 million articles in the field of computer science, it has now included more than 200 million documents, covering 19 fields such as economics and management. Weihs and Etzioni (2017) summarized the dataset from 1975 to 2015 for detailed analysis and extracted more features, e.g., h-index of various scholars.

Evaluation metrics

Due to the lack of a unified gold standard, it is challenging to evaluate the prediction results. We summarize the common evaluation metrics in Table 5.

Table 5 The common evaluation metrics for scientific impact prediction

Model performance

Table 6 summarizes the performance of common prediction tasks and shows the best prediction results. We conduct model performance analysis and select those models with specific dataset sizes and clear prediction results. But it should be noted that due to different datasets used, the prediction results are not comparable even for the same prediction task.

Table 6 The performance of common prediction tasks

Open challenges and future research directions

Multisource data fusion

The continued growth of scientific corpora and the increasing importance of nontraditional literature have increasingly enriched data sources. How to integrate additional data sources, such as preprints and commercialization data, is the current challenge (Weis & Jacobson, 2021). In addition, the currently used dataset has the problem of missing citation information (Zhang & Wu, 2021), and a paper may have different citations in various literature databases. It may be beneficial to use some external resources, such as Google Scholar, to collect citation data (Du et al., 2021). In addition, a series of multidimensional complex features can be used for scientific impact prediction. However, most current studies choose features that can be obtained in a relatively simple and fast way, which may result in the omission of some features (Yu et al., 2014). Therefore, more indicators should be considered to improve the performance of the prediction model (Klemiński et al., 2021; Kong et al., 2020; Liu et al., 2020; Ma et al., 2021). For example, altmetrics have been explored for scientific impact prediction (Thelwall & Nevill, 2018), but the application of altmetrics is limited by data coverage (Drongstrup et al., 2020). It may be better to combine altmetrics with other features, e.g., citation data, and use several sources for altmetrics to prevent missing data (Du et al., 2021). Therefore, we need to adopt multisource data fusion methods (Zheng, 2015) to process big scholarly data to solve the problem of missing information and provide a reliable basis for scientific impact prediction. Deep learning can be used to learn the hierarchical features of data through unsupervised training (Zhang et al., 2018b) or combine broad learning (Zhang & Yu, 2018) to integrate different data sources and mine more valuable information. The fusion of multisource data can help prevent researchers from focusing on optimizing citations and promote more accurate and objective predictions.

Interpretability and stability

Machine learning methods have shown their effectiveness in this field and can obtain high accuracy. However, the prediction of scientific impact is not only used for “prediction” purposes but is also used to explore the laws of development behind science and deeply understand and promote scientific research. Machine learning algorithms such as BP neural networks have a ‘black box’ nature and cannot be used to interpret the relationship between selected features and the number of citations (Ruan et al., 2020). Although statistical learning models can explain the relationship between input variables and scientific impact through correlation, the variables are considered to be independent; however, this assumption is not always true (Thelwall & Nevill, 2018). Moreover, the correlation itself is also unstable, and a large number of false associations caused by sample selection bias will be generated, which leads to unexplainable and unstable models (Zhang et al., 2017). In addition, the datasets used by researchers are often limited to a certain field (Hu et al., 2020; Levitt & Thelwall, 2011; Yu et al., 2014), and the stability of models is unclear when applied to different disciplines or interdisciplinary fields (Bornmann & Daniel, 2010; Zhou et al., 2020b). In the future, causal inference (Cui et al., 2020) can be added to machine learning to remove the influence of confounding factors and select meaningful features to improve the generalization ability of prediction models and better explain the driving factors behind academic influence. To test the generality and stability, further studies in various disciplines are needed (Kong et al., 2020; Wen et al., 2020).

Citation content analysis

Due to the difference in the citation motivation, the citation behavior is completely random. Some citations are deep, some are superficial, some are positive and some are negative, and there are even coercive citations and padded citations (Fong & Wilhite, 2017). Most current citation-based prediction studies treat all citations equally; thus, neither the true intention of the citing author nor the different influences of the citing paper can be presented (Giuffrida et al., 2019; Zhang et al., 2019). One possible solution is to divide citations into different levels according to the relevance between the cited and the citing paper (F. Zhang et al., 2019). Another approach is to design relevant metrics to measure citation differences. For example, citation strength, which is usually measured by the number of times a paper is cited in the same article, can be used for evaluating paper influence and author influence (Wan & Liu, 2014), and further studies using machine learning methods can be completed for content-based citation strength estimation (Zhang & Wu, 2021). Moreover, content-based citation analysis has received widespread attention (Ding et al., 2014). With the development of NLP, researchers can obtain the sentiment value of the citation through semantic analysis of the citation text and achieve a deeper and more accurate understanding of the content of the paper (Porwal & Devare, 2020). In future research, transformer models such as BERT and RoBERTa can be applied to enhance the prediction effectiveness (Ma et al., 2021; Wang et al., 2021a).

Dynamic evolution of academic networks

The dynamic changes of scientific impact can be reflected in the academic network. In current studies, although the time evolution is taken into account, for example, new papers are given more weight than old papers, the dynamics of academic influence cannot be sufficiently revealed. It would be better to model the process of network evolution (Kanellos et al., 2021). The link evolution and topological structure change in temporal complex networks provide insights into future works (Bütün & Kaya, 2019). In addition, the network-based approach must face the “cold start” problem (Zhou et al., 2021), which refers to those newly published papers or new scholars, of which there are few links in the academic network. To solve the problem of predicting the influence of new papers or scholars, it is possible to combine network-based methods and metadata information such as topics, authors, and institutions (Zhou et al., 2021). However, how to integrate additional information into academic networks and ranking frameworks is still a challenge (Zhang & Wu, 2021). The application of graph-based machine learning methods such as GNN can simultaneously consider the features of each node in the graph and the features of its neighbors, which can be used to solve the cold start problem and realize the impact prediction of new papers or new authors (Weis & Jacobson, 2021). Future research can pay more attention to embedding methods, specifically for academic networks (Klemiński et al., 2021), and consider the dynamic GNN method (Skarding et al., 2021), which can represent the structure and timing information of the academic network and capture the dynamic evolution characteristics of academic influence.

Benchmarks and evaluations

Constructing a unified baseline is conducive to the continuation of scientific research in previous findings. Machine learning methods require standards to assess the effectiveness of models. Most of the current studies only compare different ML algorithms to obtain the best model and lack comparison with other papers (Akella et al., 2021; Brizan et al., 2016). Very few papers have made comparison with other scholars’ models, which are based on different datasets; the advantages of the proposed method cannot be reflected well (Mistele et al., 2019). Without a standardized dataset, it is difficult to form a unified evaluation standard, and it will be difficult to compare different methods. Many forecasting methods claim to have high accuracy, but they are not effective when applied to different datasets (García-Pérez, 2013). Building benchmarks for scientific impact prediction is an important problem that urgently needs to be solved (Bai et al., 2020). In the future, more evaluation experiments, which should be reproducible on different datasets of different sizes, are needed (Liu et al., 2020). We should actively promote data sharing, strengthen collaborations between researchers and publishing institutions, formulate the open sharing standardization of data and policies, increase researchers’ enthusiasm for data sharing, and work together to build benchmarks to promote the rapid development of this research field.

Conclusion

For years, researchers have been attempting to find models that can accurately predict the future impact of academic papers, scholars, publication venues and institutions. In this paper, we conducted a comprehensive review of the literature on predicting future scientific impacts, focusing on prediction tasks, features and methods. There are always many controversies and prejudices on how to measure academic influence. It has been argued that these predictive models will only serve to perpetuate existing academic biases (Chawla, 2021). Therefore, scientific impact prediction should first focus on providing more choices for scientific research and helping researchers discover directions with greater influence in the future in advance. Second, it can be used to identify the driving forces of science and develop predictive models to capture the evolution of technological innovation and better accelerate it. Third, analyzing the mechanism for successfully predicting scientific impact can help design policies that improve scientific enterprise (Fortunato et al., 2018), rather than only focusing on predicting results.