Introduction

The traditional pipeline for academic publication is highly time-consuming (Björk and Solomon 2013). The whole publishing process, from doing research, writing a paper, submitting for peer review, revising or rewriting if rejected all the way to final publishing, can be a weary march that costs several or even a dozen months. Because of this, some researchers who cannot afford to wait then turn to conferences for publication since the process can be reduced to only a few months. But still there are lots of other researchers out there who are eager to share their research results as soon as possible. To them, even a few months would be too long. There thus came into existence those popular preprint servers. “A preprint is a complete scientific manuscript (often one also being submitted to a peer-reviewed journal) that is uploaded by the authors to a public server without formal review” (Berg et al. 2016). Users of preprint servers can post their manuscripts without rigorous peer review but only with a brief censoring. Even though “preprint” indicates the concept of pre-submission before publication, there also exist a large quantity of post-prints that are submitted to preprint servers after publication. Therefore in this paper, “preprint” is defined as “e-print” (Kling 2004; Brody et al. 2006) that implies both “pre-print” and “post-print” on preprint servers. The word “unpublished” is used to describe the state of a preprint in which it has not yet been accepted for any type of publications; the words “published” and “printed” are used to describe the state of a preprint in which it has been peer-reviewed and formally published in journal, conference, book, report or other types of publications.

ArxivFootnote 1 (Ginsparg 2011), founded in 1991, is a preprint server in the field of science and engineering. From its inception to 2014, arXiv housed a total of 1 million manuscripts after 23 years of development (Van Noorden 2014) and in 2019, it received an average of about 13,000 monthly submissions.Footnote 2 Computing Research Repository (CoRR) (Halpern 2000) is a respected component of arXiv. This repository covers various categories of computer science (CS) and enjoys a rapid increase in submission. CoRR now functions as the most important preprint server in CS filed.

In addition to arXiv, there are a considerable number of other preprint servers for different fields. BioRxivFootnote 3 is a platform for unpublished preprints, especially those in the life science. Unlike arXiv, bioRxiv assigns DOI to its preprints for citation and cooperates with journals in a way that enables authors to submit their manuscripts directly to a journal’s submission system through it. Social Science Research Network (SSRN)Footnote 4 is a repository originally developed for social science and humanities. It was later extended to cover other fields in science and engineering, such as biology, chemistry and computer science. This platform allows users to upload their unpublished preprints directly and it also accepts published papers. Humanities CommonsFootnote 5 is a platform created by the Modern Language Association, for the field of humanities. It serves as a network for humanities scholars to post new publications and disseminate research results. PreprintsFootnote 6 is a multidisciplinary preprint server supported by the open access publisher MDPIFootnote 7 that provides immediate accessibility to scientific manuscripts in all fields of research. This server provides users with the number of views and downloads that a preprint has received. Users can also make comments on the preprints.

Such prosperous preprint servers are driven by several forces because submitting manuscripts to them comes with many benefits. Firstly, preprints give a record of priority. In many cases of research, researchers might conduct studies under similar topics and methods, but unfortunately this similarity can lead to fierce controversy on priority just like what happened between Isaac Newton and Gottfried Wilhelm Leibniz over the nearly simultaneous invention of calculus. Therefore, it is vital for researchers to publish their original ideas and research results in time in the academic field. Secondly, credit is given to preprints for enabling more feedback. Good research and papers imply rounds of refinement. Within traditional peer review system, authors can only get limited rounds of feedback from only a handful of reviewers and editors. However, publishing the manuscripts in early stage can elicit discussion and feedback from the whole community. Thirdly, a preprint can function as an attention grabber. Most preprint servers provide daily notification service to service-subscribers, sending them lists of latest submission and updated manuscripts. There are studies (Davis and Fromerth 2007; Feldman et al. 2018) revealing that published papers with preprints submitted before publication gain more citations than those without. Some researchers, in order to present their work to more people, choose to make submissions to preprint servers for public access even after their work have been accepted by peer review.

However, preprint servers also lead to widespread controversy (Vale 2015; Annesley et al. 2017). For one thing, there is no guarantee on the quality of the non-peer-reviewed preprints. Even though some preprint servers, such as arXiv and bioRxiv do perform inspection on the submissions, this inspection only targets non-scientific content, plagiarism, and fouled words and cannot ensure the internal academic quality. Unfinished or even fraudulent preprints, which might be submitted just for scooping, cast a detrimental influence on refereed publication. Whether these kinds of preprints can be considered as a claim of priority or not remains open to question. For another, a simplified process might lead to a growing burst in the number of academic submissions, which could then become a burden for researchers to distinguish between good and bad.

There exist such enormous quantity of preprints and yet it remains unclear how many preprints have actually been printed and why. This paper sets out to answer these two questions by conducting a case study on CS related preprints on arXiv from 2008 to 2017. Our main contributions lie in:

  1. 1.

    A BERT-based method and a related dataset are introduced to map preprints to their published versions under different titles and with other modifications. Our method achieves an improvement of 56% in accuracy over the compared method.

  2. 2.

    Mapping was conducted from 141,961 sampled preprints to their published versions one by one. Statistical analyses are performed on different aspects including published type, subject category, publication venue, submission stage and citation count.

  3. 3.

    Common features of published preprints are identified by in-depth comparisons conducted between the published and the unpublished. Practical suggestions for future academic writing are provided based on the findings and analysis.

Related work

Former studies on preprint basically cover the areas of citation, publications, impact, preprint servers and peer review publication.

ArXiv provides its users with usage statistics,Footnote 8 but the information is limited to the statistics of submission, access and download. (Davis and Fromerth 2007) is an early work that analyzed the correlation between the submission of a preprint and the citation and official download counts of its final publication in the field of mathematics. Their study identified 511 (18.5%) published preprints out of 2,765 sampled journal papers on arXiv. However, the authors did not mention which method they employed to map the published papers to their preprints on arXiv.

Larivière et al. (2014) analyzed arXiv preprints in all subjects (computer science, mathematics, physics, etc.) and the correspondent published versions on Web of Science (WoS). However, most of conference papers in CS were excluded in their study since they were not indexed in WoS. In the domain of CS, conference papers, especially those submitted to top conferences, play a more vital role than journal papers (Vrettas and Sanderson 2015). Such exclusion implies incompleteness of study in the world of CS.

On the contrary, Sutton and Gong (2017) analyzed papers published in top CS conferences and found that in the year of 2017, 23% of these conference papers have been submitted to arXiv. The study also shows that 56% of the above arXiv-deposited papers were submitted before or during the review process. Though with interesting findings, this study only deals with papers of top conferences. In the research, they checked off the published papers listed on the conference proceedings one by one to identify whether these papers were submitted to arXiv before or not. Such a method leaves out those arXiv-deposited papers that are published in other academic venues. Although the CS community generally attaches greater importance to top conferences in this field, journal and other types of publication remain an indispensable and significant part that should not be ignored.

The research of Feldman et al. (2018) explored whether arXiv-deposited papers can gain more citations in the field of top CS conferences. In their research, they adopted a mapping method similar to that of Sutton and Gong (2017) to map the preprints on arXiv to their corresponding accepted papers by matching paper titles on the conference paper lists with their metadata on arXiv.

To our best knowledge, there exists no preprint study that covers all basic types of academic publications, including conference, journal, book chapter and others in the field of CS. Moreover, using correspondence matching or traditional fuzzy mapping to match preprints and their publications may lead to impreciseness. This paper would like to fill this gap by introducing a BERT-based matching method that can capture the semantic information of titles. Unlike most previous studies that only checked off a limited list of publishing papers against arXiv records, we embark on a different route as we check off the preprints on arXiv against other databases for matching.

Data sources

In this section, we describe the data sources used in our research. This research samples preprints that are first submitted to arXiv within the time period from 2008 to 2017 and fall under at least one category that starts with the category prefix “cs.” (an indicator for the field of computer science). A total of 141,961 preprints are thus identified according to these two criteria. The reason why we do not set the time range of sample data till the time of writing this paper is that it may take a long time for a preprint to be reviewed, revised and published. We need to leave enough turnaround time for formal publication. Multiple data sources of arXiv, Crossref, DBLP, Google Scholar and Papers With Code are used to support our research.

  1. 1.

    ArXiv

    Apart from web page access, arXiv also opens its metadata to public access via Application Programming Interface (API).Footnote 9 These metadata include article ID, version number, title, authors, categories, abstract, created date and updated date. Some preprints also provide optional data like Digital Object Identifier (DOI), journal references, comments, etc. If a preprint has been updated before, version history will also be presented. Users can use Amazon S3 for bulk download of packed PDF files.Footnote 10 We harvested the sample data of both metadata and PDF files from arXiv in July 2019.

  2. 2.

    Crossref

    Crossref,Footnote 11 launched in 1999, aims to establish cross-publisher citation linking for academic publications (Lammey 2014). As an official DOI registration agency of the International DOI Foundation, Crossref allows linking among a vast number of publications of different content types including journals, conference proceedings, books, data sets, etc. It works with thousands of publishers to provide authorized access to their metadata including DOI, publication date and other basic information. Via APIsFootnote 12 its metadata are also made available for free public access with publication titles or DOIs. We invoked Crossref APIs and stored the related data in August 2019.

  3. 3.

    DBLP

    The Digital Bibliography and Library Project (DBLP) with the new title of “The DBLP Computer Science Bibliography”Footnote 13 (Ley 2002), is a famous bibliography website centering around CS. It has been proved that its database indexes the largest amount of CS papers (Cavacini 2015). This website only stores basic publication information without abstract. In addition to publication in peer-reviewed venues, DBLP also indexes preprints in CoRR. Its data is provided in an Extensible Markup Language (XML) file.Footnote 14 Our research is based on the version of “dblp-2019-10-01”.

  4. 4.

    Google Scholar

    Thanks to the powerful search and analysis technologies of Google, Google ScholarFootnote 15 plays a leading role in academic literature analysis and retrieval platform service. Unlike other citation analysis platforms that only provide indexes to journal papers with high impact factor and usually written in English, Google Scholar indexes a wide range of academic documents (journal papers, conference papers, books, theses, etc.) written in various languages (Kousha and Thelwall 2008; Martín-Martín et al. 2018). Google Scholar citations can thus be used to fully reflect the overall citation of a paper (Martin-Martin et al. 2017). Since APIs are not available on Google Scholar, these data were crawled in the 2 months of August and September in 2019.

  5. 5.

    Papers With Code

    Papers With CodeFootnote 16 links the source code to arXiv papers. On one hand, it labels data automatically with the use of Natural Language Processing technology to analyze paper contents and extract evaluation metrics. On the other hand, it also labels data by hand. The website provides daily-updated metadata in the format of JavaScript Object Notation (JSON). We downloaded the file on October 14th, 2019.

Since the process of crawling, downloading and processing data is intensely time consuming, our data collection process lasted several months. Nevertheless, the dates of our sampled data were generated at least 18 months ago and thus these data are subject to minor variation only. Therefore, the slightly prolonged duration of data collection had little effect on our research results.

Methods

Presented in this section are analysis methods on identifying the number of sampled preprints that are accepted for publication in peer-reviewed venues. There are three cases of published preprints: (1) preprints published under the same titles with DOIs or names of the specific published venues provided by arXiv; (2) preprints published under the same titles without DOIs or published venues provided; (3) preprints published with their titles changed without DOIs or published venues provided.

Case one

ArXiv cooperates with Inspire (formerly SPIRES) to provide automatic update to DOI information and journal references if a preprint is published.Footnote 17 In addition, it also encourages authors to update this information themselves for their accepted manuscripts.Footnote 18 A total of 28.7% of our sampled data were confirmed to be published in peer-reviewed venues with 22.1% offering DOI information in their metadata and 6.6% with specific publication venues but no DOI.

Case two

For preprints not included in the first case, we first conducted a search on Crossref or DBLP to examine how many sampled preprints are published under the same titles with the original first author appearing in the authorship. If a search result is in accord with such conditions, it is considered the published version of its preprint on arXiv. Through this process, 37.0% of the sampled preprints are identified.

Case three

The remaining sampled preprints were taken into consideration according to the following three different situations:

  1. 1.

    Preprints that are not submitted to or accepted by a peer-reviewed venue.

  2. 2.

    Preprints that are published in some venues that are not indexed by Crossref or DBLP.

  3. 3.

    Preprints that are published with changed titles and content after peer review but without timely version update on arXiv.

In the case of Situation 3, these preprints, with revision on the title, content and even authors, are hard to be identified with simple string matching. In view of this issue, we designed a classification model and constructed a special dataset to conduct pair matching between the sampled preprints and their revised published versions.

Dataset for the classification model

A special dataset is constructed for our binary classification model. Both our positive and negative samples are composed of the following three fields: title pairs (preprint, candidate), author pairs (preprint, candidate) and a True or False label that indicates whether the candidate is the modified version of the preprint.Footnote 19

To avoid overlap among training, development and test data, the training set only includes data which are not included in development or test set. To be more specific, they are preprints under CS category within the time period from 1991 through 2007, from 2018 through July, 2019 and preprints from 2008 through 2017 but fall under no category in CS.

The arXiv API presents version history information of each preprint by attaching a version number as a suffix to the file names, like v1, v2. These data directly indicate the version sequence, and were thus gathered to create positive samples.

For each preprint, there may be several committed versions with title changed or not. We choose every two of different titles for each preprint to form a positive sample pair, which means these two titles are different but belong to the same preprint. An example is given as follows. The sampled preprintFootnote 20 changes titles for each of the four submissions:

  • v1: Fully Convolutional Network-based Multi-Task Learning for Rectum and Rectal Cancer Segmentation

  • v2: Multi-Task Learning with a Fully Convolutional Network for Rectum and Rectal Cancer Segmentation

  • v3: A Fully Convolutional Network for Rectal Cancer Segmentation

  • v4: Reducing the Model Variance of Rectal Cancer Segmentation Network

Titles of the four versions are different from each other and altogether we can draw six positive sample pairs from them. The corresponding authors of each version were also extracted to compose input pairs.

To compose negative samples, every title of preprint was submitted as a search query to Crossref for the first ten results returned, with information on title and author. Among the results, if one or more authors of a result are matched with the query’s, this result will be removed. For the conditions of unmatched, they will be paired with the query one by one. The corresponding authors of the papers in query results were also gathered to form input pairs.

There are a total of 40k samples in the training set, 5k in the development set and 5k in the test set.

Classification model

Bidirectional Encoder Representations from Transformers (Devlin et al. 2019) is a state-of-the-art framework for word encoding. Its architecture enables BERT to learn contextual word embedding from both left to right and right to left. Our model is constructed on the BERT-based SciBERT (Beltagy et al. 2019). It is a domain-specific model for scientific papers developed through fine-tuned BERT. Unlike BERT, SciBERT is trained on 1.14 million papers in the domain of CS and biomedical exclusively. Experiments show that SciBERT can achieve better performance than BERT on scientific text.

SciBERT is fed with a pair of titles from two different papers (a preprint and its candidate to be checked) and outputs a probability value that indicates the similarity between these two papers. If the output value is higher than 0.5, the label of this pair will be set to True, otherwise to False. For the author pair from the above two papers, if the first author of the preprint matches one of the authors of the candidate, the label of the author pair will be set to True, otherwise to False. An “and” operation is then performed on these two boolean values to output a final label that indicates the likely relation between these two papers. See Fig. 1 for the detailed structure.

Fig. 1
figure 1

Structure of the proposed classification model

Model results

The classification model we propose yields an accuracy of 0.78 and an overall F1-score of 0.72. (Larivière et al. 2014) was chosen as the compared method which maps preprints and their published versions with different means of fuzzy matching. Our method outperforms the compared method in both accuracy and F1-score on the test set mentioned in "Dataset for the classification model" section. See Table 1 for detailed information. With our model, we can identify preprints published under changed titles better. Finally, 11.4% preprints of sampled data are mapped with their published version under changed title.

Table 1 Comparisons of accuracy and F1-score with the compared method

Statistics and analysis

Published type

According to the above data, 65.7% of CS related preprints submitted to arXiv within the time period from 2008 through 2017 have been published in peer-reviewed venues with the same titles and 11.4% are published under changed titles and with other modifications. The whole sampled data are categorized into four types. See Fig. 2 for detailed information.

Fig. 2
figure 2

Distribution of preprints by published type

We estimated that nearly a quarter of the sampled preprints on arXiv have not been published and we performed an analysis to figure out the reasons behind, which are listed as follows:

Firstly, some unpublished preprints are strongly related to arXiv itself (Warner 2001; Rieger et al. 2016). These preprints are written by the founders or administrators of arXiv to introduce its history, status quo and development. From the very beginning, these authors only have the intention to present these preprints on arXiv. Secondly, there are also preprints that have indeed been submitted for peer review but fail to be accepted for publication.Footnote 21

Fig. 3
figure 3

The rise of total, published and unpublished preprints, 2008–2017

Fig. 4
figure 4

Publication rate of preprints, 2008–2017

The statistics of the total, the published and the unpublished preprints are also presented here by years (see Fig. 3). The total number of all preprints underwent a significant increase in this ten years’ period, soaring form below 5000 in 2008 to over 30,000 in 2017, more than five times. The increase of the submissions has picked up speed since 2015. The increase of published preprints follows the same growing trend. The growth of unpublished preprints, in contrast, only progresses mildly. As shown in Fig. 4, although the number of preprints has greatly increased in the past decade, the publication rate of preprints has declined as a whole. To some extent, this reveals that preprints have been increasingly popular among researchers. The growth rate of the number of all preprints is much higher than that of finally published ones, that is, the denominator of the publication rate of preprints increases significantly, which eventually leads to a decline in the publication rate of preprints.

Subject categories

Fig. 5
figure 5

Distribution of published and unpublished preprints by subject category

We classified the published and unpublished preprints according to their first category label included in arXiv metadata. For the preprints in CS field, we divided them into different sub-categories just as arXiv. See Fig. 5 for detailed information.Footnote 22

From Fig. 5, we can see that Information Theory (CS.IT) is the most productive category followed by Computer Vision and Pattern Recognition (CS.CV), Machine Learning (CS.LG). The number of preprints in Information Theory is over twice of that in Machine Learning. Altogether, preprints of these top three categories account for about one-fourth of total. Preprints in General Literature (CS.GL) and Operating Systems (CS.OS) account for only a small fraction of the total. Published preprints have a larger proportion than their unpublished counterparts in almost all categories. Mathematics, Physics and Statistics are the top three categories in terms of the number of CS related preprints. This reveals that cross-disciplined research is prosperous in these three domains.

Publication venue

As shown in Fig. 6, nearly half of the published preprints are journal papers and about one-third are conference papers. Book chapters account for one-tenth of the total. Preprints of unknown publications (information not provided by our data sources) and others only make up a small amount. (Vrettas and Sanderson 2015) found that, on average, top conference papers have higher citation rate than top journal papers in the domain of CS. However, from the statistical result of our research, more preprints are finally published in journals than conferences. We suspect the main reason is that most of journals have a longer publication period than conference proceedings, so those researchers who want their papers to be published in journals will first submit preprints to arXiv to share their work in advance. On the other hand, owing to the fact that the publication period of conference proceeding is relatively short, there is less need for them to submit preprints.

Fig. 6
figure 6

Distribution of preprints by publication venue

Submission stage

Submission stage, i.e. the time a preprint is submitted to arXiv is also an important subject matter. When are those preprints submitted to arXiv? Are they submitted before or after formal publication? Would authors commonly upload the formal published version to arXiv? Or those authors just consider arXiv a platform for quick dissemination of their work to the public?

We can directly obtain the published date of the peer-reviewed papers from the data sources. However, other information such as received or revised date is contained in the PDF files of the formal published versions. Collecting these data is a difficult task that invites copyright and cost problems. We are finding a solution to get the data and use them to conduct a deeper analysis in the future. At present, we just classified the preprints into two categories according to their created date on arXiv: submitted before publication and submitted after publication. See Fig. 7 for detailed information.

Fig. 7
figure 7

Distribution of published preprints by submission stage

Figure 7 shows that the majority of journal papers and conference papers are submitted before publication. It is reasonable for a preprint server. By comparison, the proportion of journal papers submitted before publication is larger than that of conference papers by 16%. According to our analysis, this phenomenon exists because it normally takes longer time for peer review in journal publication than that for conference publication. Therefore, researchers submitting for journal publication are more inclined to post their manuscripts to preprint servers as a claim of priority for the fear of being scooped.

Citation count

Citation count is a vital indicator for the quality of scientific papers. In this section, we compared the citation counts received respectively by the published and unpublished preprints. Citation data were crawled from Google Scholar. If a paper has its arXiv-deposited and published versions indexed separately by Google Scholar, the citation counts of the two versions will be summated. See Table 2 for detailed information. We used the D′Agostino-Pearson test (D′Agostino 1971; Pearson et al. 1977) to test the Normality of citation counts in published preprints, unpublished preprints, journal papers and conference papers. In the case that the significance level α is predefined as 0.005 (Benjamin et al. 2018), all H0 are rejected due to smaller probability values (namely P value) than α, which means these data do not follow Gaussian distribution at the 0.5% significance level. We then used the Mann-Whitney U test (Mann and Whitney 1947) to compare these data between published preprints and unpublished preprints, journal papers and conference papers, with the same α as before. The P values are all smaller than α indicating that all H0 are rejected and data in these groups follow different distributions at the 0.5% significance level. Consequently, the median is chosen as the measure to compare these groups.Footnote 23

Table 2 Citations of published and unpublished preprints

It is obvious from Table 2 that published preprints enjoy higher visibility than the unpublished ones. The median citation count of journal papers is the same as conference papers. Over one-third of unpublished preprints have not been cited while only about one-tenth of published preprints receive zero citation. The percentage of journal papers with zero citation is larger than that of conference papers. There are also unpublished but highly-cited preprints on arXiv. For example, ADADELTA: An Adaptive Learning Rate Method (Zeiler 2012), which introduces an effective gradient descent method, has gained over 3,000 citations.

What preprints can be printed

In "Statistics and analysis" section, we analyzed the publication conditions of preprints on arXiv. In this section, comparisons of version history, number of authors and article length, number of references and their citations, number of figures and tables, proportion of open source code were conducted between the published and unpublished preprints to identify what features enable a preprint to be printed eventually. Based on this comparison and analysis, we went further to provide practical suggestions for academic writers in CS.

Science-parseFootnote 24 is used to parse PDF files on arXiv. The PDF files are transformed into structured XML files with title, authors, abstract, introduction, conclusion and references included. In order to conduct an in-depth comparison, the published preprints in some subsections were then classified into two categories: conference papers and journal papers. The comparison was performed among published preprints, journal papers, conference papers and unpublished preprints. In addition, book chapters and other types of publications were excluded in these comparisons. For one thing, book chapters are subject to a writing style greatly different from that of journal and conference papers; for another, papers of other types only account for a tiny share of the total and are thus less representative. For papers published under different titles, their versions deposited on arXiv might not be the final ones and thus they were also excluded from the data. Apart from that, these comparisons also excluded papers with no updated version submitted to arXiv after the publication so as to ensure the comparison was conducted just among the formal published versions of published preprints.

Version history

ArXiv allows users to make modifications to preprints’ content and metadata with no restriction on time. This freedom is one distinct advantage offered by preprint servers and authors can update their work without going through a complicated review process. We compared the numbers of update between the published and unpublished preprints.

Table 3 Proportion of preprints with different version numbers

Table 3 shows that preprints with one version take up the largest proportion of both published and unpublished preprints. To some extent, this indicates that arXiv is mainly used by researchers as a platform to share their work with others. Published preprints have a lower share of unmodified versions than unpublished preprints while for the proportion of updating more than one version, published preprints exceeds unpublished preprints. This result can be explained by two reasons: (1) repeated revisions normally lead to high quality and thus repeatedly revised preprints have greater chances to be accepted; (2) after their preprints being accepted for publication, most authors will upload the accepted version to arXiv to ensure completeness and consistency of their work. Besides, few preprints on arXiv have more than 5 versions and this is because revisions after version 5 will not be listed in daily mailing anymore.Footnote 25

Number of authors and article length

Number of authors and article length have a huge influence on the first impression of a paper, therefore we conducted comparisons on these two factors. In the comparisons, preprints without certain sections were excluded. See Table 4 for detailed information.

Table 4 Medians of number of authors and word counts

From Table 4, we can see that the median of the published preprints is higher than that of the unpublished ones in terms of number of authors. This means that multi-authorship is a feature of accepted papers. For article length, the published preprints have all the median values larger than those of unpublished ones. These results illustrate that article length is a quality indicator for reviewers. In particular, the published preprints have significantly longer abstract and introduction, with 9% and 23% more in length respectively than those of the unpublished preprints. This demonstrates that detailed abstract and introduction are marked features of published preprints. For the comparison between journal and conference papers, journal papers outnumber conference papers in all items except the number of authors. According to our analysis, the reason for this result is that conference papers have a more rigorous restriction on article length (mostly 8 or 12 pages), thus they are usually in a more concise style.

Number of references and their citation counts

For scientific papers, references are indispensable and to some extent referencing behaviors are highly correlated to the academic quality of the papers. For this reason, we conducted a comparison of the number of references as well as citation counts received by these references. In order to accomplish the comparison on such a vast quantity of citation counts in a practical way, we only targeted a subset of preprints labeled with Artificial Intelligence from 2016 to 2017. A total of 4,743 preprints were identified within this subset. See Table 5 for detailed information. Please note that official reference data are not included in the APIs of arXiv. Number of references and their citation counts might thus be a little lower than the actual values due to possibly erroneous parsing of PDF files.

Table 5 Medians of number of references and citation counts of references

It is shown clearly in Table 5 that compared with the unpublished preprints, the published preprints have more references. This result indicates that the number of references is positively related to the acceptance of papers. In terms of the median number of references, the published preprints cite 30% more than that of the unpublished ones. The median citation counts of the published preprints’ references is also 45% higher than that of the unpublished ones. Judging from the median, the journal papers have more references than the conference papers, while the conference papers have more highly cited references.

Citation counts of references are rather high and this is because they are pushed up by some most cited references. For example, R: A Language and Environment for Statistical Computing (R Core Team 2013) has received more than 140,000 citations.

Number of figures and tables

Figures and tables are two essential components in academic writing. They can highlight and reinforce the key information in a straightforward way so that the paper can be more reader friendly. Figures and tables were parsed, counted in number and calculated for their median values separately. See Table 6 for detailed information.

Table 6 Medians of number of figures and number of tables

The results shown in Table 6 are different from what we expected. The published and unpublished preprints score the same in the median number of figures. The journal papers and the unpublished preprints both surprisingly have zero as the median number of tables. We were afraid that these values were caused by error automatically parsing steps so we manually calculated the number of tables in PDF files for 100 randomly selected samples from the unpublished preprints and the result remains as zero.Footnote 26 It is also worth noting that journal papers use more figures and fewer tables than the conference papers. Overall, the papers published do not necessarily feature larger quantity in figures and tables. However, we can reach a conclusion from these results that CS papers as a whole normally feature the use of figures, which shows that researchers nowadays are well aware of the effectiveness of figures as a form of illustration.

Open source code

The reproducibility of a CS research is largely based on the availability of its source code and thus whether the source code is provided can be considered an indicator for the reliability and credibility of the research. Opening source code can be a solid proof for the confidence of researchers to their academic work as others can thus reproduce the results. In this section, statistical analysis was performed to determine whether opening source would influence the acceptance rate. We counted the respective percentage of open source papers for the published and unpublished preprints.

We conducted a mapping between the sampled preprints and their corresponding code repositories using Papers With Code. Altogether 5,319 preprints were identified with open source code provided, which only accounted for 3.7% of the total sample preprints. One explanation is that papers in some domains of CS are purely theoretical and thus involve no code. Therefore, we only took into consideration those preprints labeled with at least one of the following categories: Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition, Information Retrieval, Machine Learning, Neural and Evolutionary Computing. A total of 46,937 preprints were identified, among which, the percentage was 11.3%. The percentage was still relatively low. An explanation is that Papers With Code prefers to index up-to-date research and thus some of our sampled preprints selected from 2008 to 2017 might not be covered by Papers With Code.

Among the preprints with open source code, 79.7% have been accepted by peer-reviewed venues. It is a strong evidence that opening source code correlates tightly with the acceptance rate. Therefore, we suggest researchers provide open source code in their papers.

Future work

For future study, we are looking forward to continuing our work in the following directions. First and foremost, we hope to extend our research from CS to other domains and from arXiv to other preprint servers. Next, we are exploring a more efficient solution to conduct quantitative analysis on citations. The current method we adopted is relatively time-consuming and costly, thus we only analyzed Artificial Intelligence preprints in 2016-2017. With a new solution, we can extend our research to cover preprints in more fields and a longer time range. Last but not least, we would love to include other factors, chiefly the influence of funding, structure and even content in our comparison between published preprints and unpublished ones.

Conclusion

In this paper, we introduce a deep learning-based method to map arXiv-deposited preprints to their corresponding published versions with different titles in peer-reviewed venues. With the help of this enabling method and our data sources, we found that 66% of CS preprints submitted to arXiv between 2008 and 2017 have been published with the same title and 11% are published under different titles and with other modifications. These results show that posting manuscripts to preprint servers contributes to the acceptance of papers in peer-reviewed venues. Among these published preprints, nearly half of them are published on journal and around one-third of them are accepted in conference proceedings. Apart from that, we went further to analyze the differences between published and unpublished preprints. The results demonstrate that, compared with unpublished preprints, most of published preprints in the CS domain share the common features like adequate revisions, multiple authorship, detailed abstract and introduction, extensive and authoritative references and available source code.