Introduction

Conducting a literature review, or survey, is a critical part of research. As the literature continues to grow and as scholars continue to move across disciplines, synthesizing and highlighting existing findings becomes increasingly important. At the same time, it has become increasingly difficult to identify even a slice of the relevant papers for a given topic (Tsafnat et al. 2013). The problem is that this curatorial process does not scale well. It is expensive in both time and human effort. The advent of Big Scholarly Data—the availability of data around published research and the techniques and resources to process it—has led to a flurry of activity in finding automated ways to help with this problem (Ammar et al. 2018; Beel et al. 2016; Kanakia et al. 2019; Ronzano and Saggion 2015; Williams et al. 2015; Zitt et al. 2015}.

Many methods have been developed to recommend relevant papers, using features related to textual similarity, keywords, and structural information such as relatedness in a citation network (Beel et al. 2016). However, a common problem in developing and evaluating these methods is a lack of ground truth. We don’t know whether our methods are actually selecting relevant papers or topics. This is a general problem in recommender research, but especially so for scholarly papers, given the specialized knowledge needed to evaluate quality and relatedness.

In this paper, we present an approach to this problem that leverages the references in existing review papers as an approximation to ground truth. We assume that references in a review represent at least a subset of papers relevant to a given topic. Using this abundant labeled data within the thousands of reviews in the literature, we are able to frame the collection of a literature survey as a supervised learning problem. Within this supervised framework, we are able to evaluate, at least to some degree, the quality of methods aimed at automatically synthesizing scientific knowledge.

With this framework in place, we demonstrate how supervised learning models can be used to identify relevant papers for review, deriving features from the metadata associated with an article. These features include citations and the groups of papers that can be derived by clustered citation networks (Fortunato 2010). They also include text features derived from the similarity in paper titles. However, any set of related features (authors, disciplines, etc.) could be incorporated.

Using the reference list from a single review article as a benchmark, we develop methods for recapturing those references automatically using the features noted above (“Application to a single review article” section). We then extend this method beyond one review article and apply the methods to a large group of review articles (Sects. “Large-scale study on multiple review papers and Extended analysis” sections). Finally, we apply the methods to identify relevant papers in the fields of science communication and misinformation studies. We invite domain experts to validate our results (“Exploring scientific fields using automated literature review” section). We make code and sample data for this project freely available at https://github.com/h1-the-swan/autoreview.

The main contribution of this work is a novel framework for constructing and evaluating automated methods for generating references for literature surveys at a large scale. This work builds off of a BIRNDL workshop paper presented at SIGIR 2019 (Portenoy and West 2019). We have extended this work in several ways: running thousands of experiments to assess how the methods perform using various review articles, sets of features, and data splits; expanding the background literature review; and reporting results from expert feedback on our exploration of new fields.

Background

There have been several previous attempts at automated or semi-automated literature surveys (Belter 2016; Chen 2012; Janssens and Gwinn 2015; Jha et al. 2013; Silva et al. 2016). These approaches have tended to be smaller scale and rely on more qualitative means of evaluations, which are difficult to replicate and compare across studies. For example, Chen (2012) developed a system to aid in writing literature reviews, which was evaluated by having first-year graduate students use it to help them write and submit papers for publication. These student-submitted papers had a high acceptance rate, and one student won a best paper award. This evaluation approach, while creative and compelling, does not scale well. Another study by Silva et al. (2016) applied community detection on citation networks to map papers in two different topics and then apply text analytics to generate taxonomies of terms. This approach allowed for detailed analysis of how subtopics are related within a field, but it relied on keyword searches, which can be an insufficient method of identifying all relevant articles (Greenhalgh and Peacock 2005; Horsley et al. 2011; Larsen et al. 2019).

Recent work has explored the use of review articles as a way of testing automated literature review systems. Belter (2016) used a semi-automated technique to retrieve documents for systematic reviews using citations. Janssens and Gwinn (2015) used co-citation and direct citation networks to identify eligible studies for existing biomedical systematic reviews, starting from one or two known articles. These methods have begun to be used in helping to create new systematic reviews (e.g., Albarqouni et al. 2017). Other studies have used active learning approaches to reduce the workload associated with selecting relevant articles for systematic reviews in the domains of medicine and public health (Miwa et al. 2014; Wallace et al. 2010), law (Cormack and Grossman 2014), and software engineering (Yu et al. 2018; Yu and Menzies 2019).

Automatically identifying papers for surveys is similar to recommending papers, more generally. This topic has been extensively studied within and outside big scholarly data. A recent survey paper on research paper recommender systems (Beel et al. 2016) identified more than 200 articles on the topic published since 1998. The survey notes that the majority of approaches use keywords, text snippets, or a single article as input. Our approach, in contrast, starts with a set of seed papers, which are then expanded upon. Our approach also has the distinction of being able to make use of any combination of various features, enabling us to use both textual and network-based features. Some previous work has built recommender systems which combine text and citation information (Gupta and Varma 2017; Kong et al. 2018). These take a different approach, using embeddings to characterize similarity between articles.

The new research in automated methods for literature reviews is the result of people applying newly available data and computational power to a perennial and worsening problem—that of the need for and difficulty of organizing large bodies of research. This need for efficient literature review, and especially systematic review, is strongly felt in medicine, but it is also a need for all areas of science (Bastian et al. 2010; O’Mara-Eves et al. 2015; Tsafnat et al. 2013). In our work, we aim to provide a framework to help with this research by offering a way to develop and test literature review generation and recommendation at a large scale.

Data and methods

Data

The network data used in our analysis come from a recent snapshot of the Web of Science (WoS) citation index consisting of 1,269,262,278 directed citation links between 163,830,918 papers. The data set contains paper-level metadata, such as titles, abstracts, publication dates and venues, and authors. We used WoS because it is one of the most comprehensive bibliographic datasets, covering a large number of articles across most scientific fields. WoS also identifies certain articles as review papers, which was convenient for this project.

We removed some papers from the full data set. In order to reduce the network to a size that we could cluster (see “Features” section), we removed all papers that had no outgoing citations, and any paper that was only cited once (many of these actually appeared to be placeholder data, for which WoS could not fully identify the cited paper). We also removed papers which were missing all metadata, such as publication year and title. This cleaned data set had 55,271,946 papers, and 1,020,164,414 directed citation links.

Identifying candidate papers and setting up the supervised learning problem

Our procedure is presented in Fig. 1.

Fig. 1
figure 1

Schematic of the framework used to collect data for development and testing of a supervised literature review classifier. a Start with an initial set of articles (i.e., the bibliography of an existing review article). b Split this set into seed papers (S) and target papers (T). c Collect a large set of candidate papers (C) from the seed papers by collecting in- and out-citations, two degrees out. Label these papers as positive or negative based on whether they are among the target papers (T). d Split the candidate papers into a training set and a test set to build a supervised classifier, with features based on similarity to the seed papers (S)

The first step is to randomly split the papers into a set of “seed” papers and a set of “target” papers. We are imagining a researcher who is starting with a set of papers relating to a topic (the seed papers). This researcher wants to expand this set to find the other relevant and important papers in the topic. The target papers can be thought of as the set of papers the researcher has not yet included. Ideally, we would like to search for these target papers within the total set of papers in our data set. However, it is infeasible to generate features and train models using the total set of 55 million papers. To narrow the total set to a more reasonable number of candidate papers, we collect all of the papers that have either cited or been cited by the seed papers. We then go one more degree out, taking all of the papers that have cited or been cited by all of those. We follow a second degree of citations because following direct citations is often not sufficient to identify all relevant literature (Janssens and Gwinn 2015; Robinson et al. 2014). This process of following in- and out-citations imitates the recommended practice for a researcher looking for papers to include in a survey, but at a larger scale (Webster and Watson 2002). The resulting set of papers, while large (generally around 500K to 2M), is manageable enough to work with. We have found that this method, using different samples for the seed papers, reliably generates sets of papers that contain all or nearly all of the target papers (see “Results” section and Table 2). We label each candidate paper positive or negative depending on whether it is one of the target papers. The goal is to identify the positive (target) papers among the many candidate papers. At this point, we split the candidate papers into training and test sets in order to build classifiers.

Features

Our next step is to generate features to use in a classification model. One feature we use involves the use of citation communities. The structure of the citation network, in which nodes represent papers and edges are directed citations between them, contains patterns about the relatedness of papers. Citation communities are groups of papers that tend to have more connections within-community than outside it. To extract these communities, we used Infomap to cluster the citation network (Rosvall and Bergstrom 2008). Infomap is a well known unsupervised community detection algorithm based on principles of information compression. Identifying clusters in a network of tens of millions of documents is computationally expensive, so we developed a two-step approach to cluster the full network.Footnote 1 In the first clustering step, we identified a non-hierarchical clustering of the full network using a parallelized version of Infomap (Bae et al. 2013). This process took 5.3 hours on a machine with 32 cores. 5,513,812 clusters were identified in this way. In the second step, we further processed these clusters to identify hierarchical structure, which is something the parallelized version of Infomap cannot do. We wanted to identify this hierarchy because the structure of science tends to be hierarchical, with smaller communities nested within broader ones. To do this, we used Infomap combined with Apache Spark to further cluster all of the top-level clusters with at least 100 nodes into multi-level, non-overlapping clusters. This second step took about 30 minutes on the same machine. The final clustering had 9,390,263 bottom-level clusters, with a maximum depth of 11, and an average depth of 2.9 (std 0.77).Footnote 2

To incorporate the citation clustering information into classification models, we calculate the average cluster distance between a paper and the seed papers. Distance for two papers i and j is defined as

$$\begin{aligned} \frac{D_i + D_j - 2D_{LCA}}{D_i + D_j} \end{aligned}$$
(1)

where \(D_i\) and \(D_j\) represent the depth in the clustering tree hierarchy of i and j, and \(D_{LCA}\) represents the depth of the lowest common ancestor of the two papers’ clusters (Djidjev et al. 1991).Footnote 3 The feature for paper i is the average distance to each of the seed papers. We also use PageRank as a measure of citation-based importance (Page et al. 1999).

In addition to these network-based features, we add in a simple text-based feature: the average cosine-similarity of the TF-IDF vector of the paper title to those of the seed paper titles. The purpose of including this feature is to demonstrate how structural- (network) and content- (text) based features can be combined in one model, and can be compared to models with only one or the other. In “Extended analysis” section we extend our analysis to include GloVe word embeddings, and add publication year as an additional feature of paper metadata. There are many other options for features that could be used, including those related to citation or coauthorship patterns, paper text, venue of publication, or any other paper feature that could help identify similarity to the seed papers. Future work will continue this approach, exploring all of these features and how they affect the models’ ability to reconstruct the review papers’ reference lists. Code and sample data for specifying the features used here are available at https://github.com/h1-the-swan/autoreview.

Results

Application to a single review article

To illustrate how the autoreview process works on a single review article, we use a review article on community detection in graphs (Fortunato 2010). We chose this paper because we are familiar with the topic and could therefore inspect the plausibility of the results. The paper represents a comprehensive review of the topic up to the year of publication (2010). This paper has 262 linked references in our data. We apply the autoreview method using a seed set of 50 papers, randomly sampled from the references. This set of “seed papers” can be thought of as the small set of papers that our imagined researcher above starts with. The remaining 212 papers are “target” papers that we would like to identify.

Table 1 Results for autoreview performed on a single review paper, for five different initial random splits of the references into 50 seed papers and 212 target papers

Table 1 shows the results from five splits, each using a different random seed. The “random seed” is an integer that the sampler uses as a starting point; each different random seed leads to a different split of seed and target sets. Running the process multiple times allowed us to see how the whole system varied when the initial seed papers changed but the review article remained the same. We stopped after five times because generating candidate sets and training models is computationally expensive on the large candidate paper sets. We also wanted to focus our efforts on learning how the system would perform with other review articles ( “Large-scale study on multiple review papers” section).

For each run, we split the 262 papers into a set of 50 seed papers and 212 target papers. After collecting candidate papers, we cleaned the data by removing the seed papers, papers for which we did not have titles, and papers published after the year the review paper was published (2010). Each seed (i.e., each row of Table 1) represents one instance of the process in Fig. 1. We report the number of candidate papers in the final set for each run. These sets of candidate papers range in size from 500K to 1.4M papers. In each case, only (at most) 212 of these papers are in the positive class. This parallels the experience of a researcher trying to do an effective survey of a topic—the goal is to find the right papers in a large body of literature that can feel overwhelming. With respect to these candidate sets, this method achieves very high recall: 98.3% on average (standard deviation 0.00258).

After identifying seed papers, target papers, and candidate papers, we generated features for each candidate paper, and labeled them according to whether or not they were among the targets. We then split the candidate papers into a training and test set, and trained classifiers to try to identify the targets. We experimented with a variety of classifiers: logistic regression, SVC (support vector classifier), SGD classifier (stochastic gradient descent), gaussian naive bayes, random forest, and AdaBoost classifier (Murphy 2008). Many of these proved to give poor performance and/or run too slowly, so we proceeded with only random forest, logistic regression, and AdaBoost, selecting the best-performing model for each data set.Footnote 4

Table 1 reports evaluation measures for each of these five models, as well as their averages. These include the Precision at 10, 100, and 1000, the R-Precision, and the Average Precision. The R-Precision is the fraction of target papers found in the top N papers, where N is the total number of target papers—212 in this case (Manning et al. 2008). The Average Precision is the sum of the precision at k for each rank k of a correctly identified target paper, divided by the total number of target papers. Since the fundamental task is to reconstruct the reference list from the original review paper, we focus our analysis mostly on the R-Precision scores, which characterize exactly how well the models were able to do this (see “Large-scale study on multiple review papers” section for more discussion on the evaluation measures).

Using two network-based features—the average distance between a paper’s cluster and those of the seed papers, and the absolute difference of a paper’s PageRank score to that of the average seed paperFootnote 5—a random forest classifier identified, on average, 86 of the target papers (40.8%). We also ran the same experiments using the text-based feature of average paper title TF-IDF similarity to the seed paper titles in addition to the network-based features (see “Features” section). This improved the results: the random forest classifiers then identified, on average, 106 of the target papers (50.1%). In the “Appendix”, we include some examples of papers ranked by the classifier.

Large-scale study on multiple review papers

We now apply these same methods to multiple review papers. The Web of Science, as part of its citation index data, identifies a subset of articles as review papers (\(\sim 1.4\) million papers). We run autoreview on a sample of these reviews to explore how the method performs on a variety of review articles, under varying conditions. We first take a random sample of 500 review articles with between 200 and 250 references. We chose this sample as a starting point in order to hold the number of references relatively constant. We believe that review papers of this size represents the type of review for which this method would be useful—a comprehensive review of a well-defined topic. We also explore results on subsets of larger and smaller review articles in “Extended analysis” section.

For each of these 500 review articles, we took the references and split them into seed and target sets, and tried to use features of the seeds to predict the targets. Table 2 shows summary statistics and results for these papers using the same procedure outlined in the previous section (“Application to a single review article” section). The “number of candidates” refers to the number of papers generated from following two degrees of citations in and out from the seed papers (5 random splits of seed/target for each review paper; 2500 candidate sets in total). These candidate sets are highly imbalanced, with the 150-200 target papers hidden among hundreds of thousands or millions of candidates; again, this is meant to mimic the predicament of the researcher searching for relevant papers in an large pool of candidates. The candidate sets have very high recall, generally including all or missing just one or two papers (with a few outliers in which the citation-based method failed to capture many of the target papers).

Table 2 Summary statistics for the 500 review articles, including the number of references per review (i.e., the seed papers + the target papers to predict), the publication year, the number of candidates generated per initial split of the data, the overall recall for the candidate sets, and precision measures for two sets of features—the network features only (cluster distance and PageRank), and network features + TF-IDF similarity of titles

For each review article, we gathered the cited papers, and trained models for 5 different random seeds, corresponding to 5 different splits of the data into seed and target papers. We fixed the size of the seed set at 50 papers (in the following section, we explore the effect of varying the size of this seed set). We chose the best-performing model for each split—in most cases, this was a random forest classifier; however, in some instances, a logistic regression or AdaBoost classifier outperformed the random forest.Footnote 6

Again, we report the performance of the classifiers as the Precision (at 10, 100, 100, and 10,000), the R-Precision, and the Average Precision (Manning et al. 2008). The overall goal is to reconstruct the list of held-out target papers from the reference set of the original review articles. Within this task, the relative rank of the different predictions is not especially important. Because of this, we focus the rest of our analysis on the R-Precision scores: the number of correctly predicted target papers among the top N papers, where N is the number of target papers, divided by N. However, looking briefly at the precision at k scores—the ratio of target papers identified at different ranks—we can see that the models do tend to give good performance in terms of ranking relevant papers relatively higher. For example, the models given network and text features had, on average, eight papers correctly predicted among the top ten, and 50 among the top 100. We also report the average precision over all target papers as an alternate measure of precision for all relevant documents. This measure is highly correlated to R-Precision (pearson’s \(r=0.97\) across all models), so we focus on R-Precision from this point on for simplicity’s sake.

Figure 2 shows the distribution of R-Precision scores for 2500 classifiers (five classifiers for each of the 500 review articles, each one trained and tested on a different split of the article’s references). The figure shows the classifiers that were given both network (cluster and PageRank) and text (TF-IDF-based similarity of titles) features. The average score was 0.385 (standard deviation 0.105); the highest score was 0.792.

Fig. 2
figure 2

Violin plot showing the distribution of R-Precision scores (number of correctly predicted target papers divided by total number of target papers) for 2500 classifiers, each trained on one of 500 different review articles. The violin plot shows a box plot in the center, surrounded by a mirrored probability distribution for the scores. The distribution is annotated with the titles of three review articles. The review article in the lower tail was one of those which the classifiers did most poorly at predicting references (mean score: 0.14). The one in the upper tail is an example of a review paper whose classifiers performed best (0.65). The one in the middle at the fattest part of the distribution is more or less typical for the review articles in our set (0.39)

Some of the worst performing review articles tended to be year-specific reviews, e.g., “Germanium : Annual survey covering the year 1972”. These particular reviews have temporal constraints that the classifiers did not learn well. Publication date was not even among the features available to these classifiers; adding publication year in the set of features available to the classifiers did cause the performance to improve somewhat. However, in future analysis, these year-specific reviews should be excluded if possible, as they represent a less-typical case with a hard constraint on the potential references. Nevertheless, this type of review article only represented a portion of those in the lower tail, so it is only a partial explanation for the poor performance on these papers. The models tended to perform better with smaller candidate sets (pearson’s \(r=-0.17\) for the relationship between candidate size and score). This is likely due to the fact that these candidate sets simply had less noise by virtue of them being smaller. However, since the candidates are collected based on random splits of the data, it is not possible to exploit this in order to improve performance (i.e., by limiting the size of the candidate sets).

The analysis to this point has aggregated all review articles together; however, it could be the case that different types of review articles perform differently using these methods. One way to explore this is to look at the discipline of the review articles. Fig. 3 shows the same distribution of R-Precision scores as above, broken down by subject. We used the Web of Science subject labels for the review papers (taking the first one if there were multiple), and aggregated them into broad categories. Most of the reviews analyzed were in Medicine (202), Biology (122), and Natural Sciences (101).

Fig. 3
figure 3

Box plots of the R-Precision scores for the 500 review articles by subject. 50 seed papers, network and TF-IDF title features. See text for discussion

Most of the subject groups did not perform significantly differently from each other, suggesting that it is no more difficult to predict the reviews’ references in, for example, medicine as it is for those in the natural sciences. Some of the groups on either extreme did show statistically significant differences—e.g., engineering did have higher scores than pyschology/social sciences—but in general the differences between groups were modest at most (pairwise independent t-tests, Bonferroni corrected \(\alpha\) of 0.0024). It is interesting that we did not find any major differences between fields, given that in bibliometric research, findings often do not generalize across different fields.

Extended analysis

We now extend our analysis to explore how the methods perform under various conditions. The three categories of conditions we experiment with are the number of seed/target papers in the initial split of the review references, the features used by the models, and the number of references in the review papers.

Using the same sample of review papers as in the previous section (“Large-scale study on multiple review papers” section), we begin by varying the first two of these: the number of seed papers, and the sets of features. We limit our analysis here to a subset of 100 of the previously used 500 articles. This was more computationally tractable, as each combination of seed size and feature set involve training models for five seed/target splits. Fig. 4 shows the R-Precision scores for 8 different sets of features and 5 different numbers of seed papers.

Varying features Each line in Fig. 4 represents the performance of classifiers using different sets of features to rank and identify target papers, with better performing feature sets on top. Using only the TF-IDF information for paper titles gave the worst performance (\(\sim 0.1\)). Using network features alone—either cluster information, or cluster information combined with the paper’s PageRank scores—resulted in somewhat higher scores than TF-IDF features (\(\sim 0.15\), a 50% improvement over TF-IDF). Combining network and text features, as we saw in the previous section, gave a large boost in performance, with scores around 0.4. Adding another feature from the paper metadata—the publication year—gave another boost, with scores around 0.6.Footnote 7 We believe that this improvement is because topics in science tend to be situated in a given period in time. By giving the model information about the publication years of papers, it is better able to identify the important papers in the field.

In order to test more sophisticated text features, we also explored models using title embeddings. For each paper title, we found the average word vector from 300-dimensional GloVe embeddings.Footnote 8 We used as a feature the cosine similarity between this vector and the mean of the title vectors for the seed papers. These features tended to perform very well; in fact, using embeddings alone absent any other features tended to give scores higher than most other sets of features that did not include embeddings. The best performing models we tested were ones that combined all types of features—word embeddings, network features, and publication year. These models had R-Precision scores around 0.81.Footnote 9

Varying seed size Each point along the x-axis of Fig. 4 represents results from starting with different sizes of seed/target splits. For example, for each leftmost point, the autoreview process began for each of the 100 review papers by randomly splitting the 200-250 references into a seed set of 15 seed papers and 185-235 target papers, with the target papers then used to generate the candidate sets. Again, this procedure was done with five different random seeds for each review, for each seed size (15, 25, 50, 100, and 150).Footnote 10

Intuitively, we might expect performance to increase along with the size of the seed set, since with more seeds, the classifiers have more knowledge of how similar papers should look, and fewer target papers to predict. We do see this trend for some of the feature sets—for example, with network + TF-IDF, and network + TF-IDF + publication year. Notably, for each of these, the scores for the smaller seed sets are only modestly lower than the largest seed sets, which suggests that this method can perform fairly well even with only a handful of seed papers. On the other hand, some of the feature sets do not improve with more seed papers. The classifiers using title embeddings alone is the most extreme of these: these models actually perform best with the fewest number of seed papers, and performance decreases as the number of seed papers increases. While the reason for this is not entirely clear, it may due to a tradeoff between having more seed papers—which means more information for the classifier to use—but fewer target papers—which means the classifier has to identify the target papers higher up in the rankings in order to get a high score.

Fig. 4
figure 4

R-precision scores for autoreview, varying the number of seed/target papers, and the sets of features used. Each point represents the mean of the R-Precision scores for 500 models—5 each for different seed/target splits of the references of 100 review papers. The error bars represent 95% confidence intervals

Varying size of review papers Figure 5 shows the average R-Precision scores when starting with review articles with reference papers of varying length. The medium size articles are the same 100 as above, with a seed size of 50. The small review articles are a different set of reviews that have an average of 50 references, with 15 of these references used as the seed papers. The large reviews are another set of reviews that have on average 945 references, with a seed size of 50. These results are largely consistent with those above. Models with only network or only TF-IDF features all perform about the same, regardless of review paper size. For other feature sets, small review papers tend to perform better than larger ones, but this may be a function of the ratio of seed papers to target papers (as seen in many of the models in Fig. 4), and not due to any inherent differences between these groups of review papers.

Fig. 5
figure 5

Average R-Precision scores for different size review articles. The middle (red) bar for each feature set represents the average score for the same 100 review articles using the same procedure as in Fig. 4 (seed size 50). The other two bars in each group represent a different set of review articles, the left a set of 100 smaller reviews (50 references on average), the right a set of 100 larger reviews (945 references on average). Error bars represent 95% confidence intervals. (Color figure online)

Exploring scientific fields using automated literature review

The method we introduce can be adapted as a tool for exploring key papers in an emerging field. In this use case, it is the papers the classifier “misses” that we are interested in. The classifier, attempting to predict the target papers, assigns a confidence score to each of the candidate papers. We are interested in those candidate papers which received a high score, yet were not actually target papers. In the classic classification task, these would be considered misidentified, but in this task we consider the possibility that their similarity to the seed papers may make them relevant papers for this field. This is consistent with Belter’s suggestion of “supplement[ing] the traditional method by identifying relevant publications not retrieved through traditional search techniques” (Belter 2016). As a case study, we applied this method to papers in the emerging field of misinformation studies, which pulls research from psychology, risk assessment, science communication, computer science, and others.

As part of this case study and in collaboration with the National Academy of Sciences (NAS), we curated a collection of important papers in this fieldFootnote 11 and used this collection as a seed set to identify other related papers that might have been missed by our more manual methods. Evaluating these results brings us back to shaky territory where we do not have ground truth. However, conversations with domain experts interested in formally characterizing these fields have been encouraging, suggesting the utility of these methods in identifying relevant papers. The original seed papers and the extended bibliography of machine-identified and ranked papers can be found at http://www.misinformationresearch.org.

Leveraging the expertise of the NAS scientists, we are also studying how well these methods can identify papers in a somewhat more established field. We used a seed set of curated papers in the field of Science Communication to identify and rank additional papers. The seed set consisted of 274 papers collected from a 2017 National Academies report on science communication (National Academies of Sciences 2017). We performed five different splits of these papers into seed and target sets (see Fig. 1). For each of these, we generated large sets of candidate papers from Web of Science, and then trained random forest models to rank candidates based on the citation- and title-based features described in “Application to a single review article” section. For each candidate paper, we aggregated the results of the five classifiers by taking the sum of the models’ predicted probabilities. We then provided the evaluators with a list of the top 1,000 papers for evaluation that were not in the original seed set. Three domain experts have evaluated this data set, one independently, the other two working together. They made binary relevance judgments for each of the 1,000 papers, with the instructions: “identify any references that the algorithm picked up that don’t belong in the field of science communication.” The first rater judged 947 (95%) of the references to be relevant, while the other two judged 872 (87%) to be relevant (moderate inter-rater reliability between the two ratings: Cohen’s \(\kappa = 0.37\)). We plan to make use of expert evaluations to assess how useful this approach could be in other fields, including misinformation studies.

Discussion

Our results suggest that it is possible to use automated methods to identify many of the most relevant papers for a literature review, starting from a large set of candidate papers. We believe that, by trying new features and tuning model parameters, we can increase performance and learn more about what distinguishes these papers. We have also seen promise in using these methods to build novel surveys of topics from a set of seed papers. An important area of future work will be collecting more expert-labeled evaluations to validate and improve this approach.

Running these experiments on our samples of review articles required thousands of hours of computation on a supercomputing cluster. However, applying the methods to a single set of references (as in “Exploring scientific fields using automated literature review” section), is much less intensive, and does not necessarily require these resources that may not be broadly accessible or scalable to a general audience.

Previous work on automated methods for literature review have tended to use a small number of hand-selected systematic review articles (Belter 2016), or a small number of scientific fields (Silva et al. 2016). The small scale and close qualitative approach can provide a lot of insight, but makes it hard to specify benchmarks to generalize and compare different methods. Our experimental approach, on the other hand, gathers many review papers and applies general techniques, allowing for a much larger pool of labeled data.

We found that we were able to identify many of the references of review articles in a variety of research areas. Our methods also missed many references, ranking other articles more highly than the ones in the original reference list. However, it seems that these “incorrect” articles may actually have value: they may be relevant articles that were missed by the review papers’ authors. We found some support for this with the help of domain experts, who found that many of the “misclassified” articles were in fact relevant to the given field. While the precision scores attainable by these methods represent a good goal when making improvements, it is worth noting that in many cases, the failures of the classifiers may actually indicate valuable papers that have been overlooked.

Furthermore, we see potential in using this framework to develop and evaluate methods for literature survey generation and related problems such as scholarly recommendation and field identification. The objective we propose for our modeling task—accurately finding all of the remaining references from a review paper given a held out sample of seed papers from those references—is not a perfect one. We assume that the references in a review paper represent domain experts’ best attempt to collect the relevant literature in a single research topic; however, there exist several different types of review article (systematic review, meta-analysis, broad literature survey, etc.), and our current method ignores potential nuance between them. Additionally, we assume that every article in a review paper’s bibliography is a relevant article to be included in a field’s survey; in practice, an article can be cited for many different reasons, even within a review article. Despite these limitations, the large amount of available data allows our framework to provide a means of experimenting with and developing methods for automated literature surveys. There are many review articles similar to the ones we used that have their bibliographies available and so it will be possible to do this development and analysis on a large scale across many domains. Using this framework, it will be possible to empirically evaluate novel features for their use in identifying papers relevant to a survey in a given topic.