Keywords

1 Introduction

Text classification is a fundamental problem in natural language processing (NLP), and it has been applied in various fields such as translation, dialogue response, sentiment analysis, and summarization. In recent years, machine learning models have been widely used for text classification [20]. In these approaches, the text is input to a machine learning model and it is trained to classify the text. In NLP using machine learning, recurrent neural networks (RNNs) such as long short-term memory (LSTM) with recursive structures have often been used in natural processing using machine learning. However, these models require sequential processing from the beginning to the end of a sentence, which prevents parallel computation. This is a critical drawback for training networks, which generally require a large amount of time. For this problem, the Transformer was proposed [22]. By introducing the self-attention structure, Transformer can achieve the same or better performance as RNNs without recursive structure. The Transformer processes the inputs simultaneously and computes the attention weights among them. This allows the network to be trained on large data sets using parallel processing. Also, BERT (Bidirectional Encoder Representations from Transformers) [13] utilizing Transformer technology is one of the most successful models currently available; the performance of natural language processing using machine learning has been greatly improved by BERT.

The main contribution of this paper is to propose a method for predicting the quality of scholarly papers using machine learning. Recently, many scientific papers have been available on the Internet through PubMed [4], Web of Science [7], Google Scholar [2], and others. While they have made it easier to browse superior papers, they have also increased the chances of encountering inferior papers. Generally, the number of citations is used as an indicator of the superiority of a paper. However, it is not easy to predict quality simply by the number of citations alone, as it is highly dependent on the time of publication, and moreover, it is impossible to predict a paper before submission. Therefore, in this paper, we consider papers published in superior journals to be superior papers and papers published in less superior journals to be less superior papers. Furthermore, we consider the abstracts of papers published in these journals to be similar, thus predicting the quality of papers based on the abstracts. Based on the above idea, in the proposed approach, the quality prediction of papers is obtained as a classification problem for the abstracts of papers. Specifically, the proposed method uses a BERT-based model to classify whether an article is included in the upper or the lower-ranked journal in the Average Journal Impact Factor Percentile [1] from its abstract. In this paper, we show that different datasets of pre-training of the proposed BERT-based model affect classification accuracy. The results of training the models show that the models can classify whether the input abstracts are from superior or less superior journals with a test accuracy of 95.1% and 89.6% in the field of medicine and computer science, respectively.

As related work, studies predicting the quality of academic papers have been conducted [8, 16]. In those studies, there are mainly two types of approaches of prediction. One is to predict the quality of a paper based only on its content, using the title, abstract, text, figures, tables, references, and appearance of the paper [10, 14, 19, 21, 24]. The other is to estimate the quality based on the contents of the paper as well as additional information outside of the paper, such as author reputations, impact factor of the journal in which the paper was published, cited papers and the citation network composed of them [9, 11, 12, 17, 25]. Unlike the above methods, the proposed method classifies whether or not the content of an abstract is that of a superior journal. This classification is inspired by the idea that papers in a good journal have well-described abstracts. The proposed approach does not aim to judge the excellence of the research, but focuses on the quality of writing of the papers.

The rest of this paper is organized as follows. We briefly introduce BERT in Sect. 2. In Sect. 3, we show our proposed machine learning approach using the BERT-based model, and experimental results are presented in Sect. 4. Finally, we conclude the paper in Sect. 5.

2 BERT

Bidirectional Encoder Representations from Transformers (BERT) [13] is one of the state-of-the-art machine learning techniques based on Transformer [22] for NLP. The technique includes model structures and learning approaches. In this section, we briefly introduce the technique.

Figure 1 illustrates an outline of the structure of the BERT model. Given an input sequence of N words, each word is converted to a token and then each of them is mapped to a vector data of size k by word embedding. After that, L transformer encoder blocks transform the tokens to contain more accurate information. Each Transformer encoder block based on the encoder of the Transformer has A attention heads and H hidden layers. The BERT model has configurations of various sizes, and typical models and their number of parameters are shown in Table 1. Users can choose whether to use all of the output of the last Transformer encoder block or the part of one for final classification. Usually, it is sufficient to use only the first output for classification tasks, and the final classification result is obtained using the classifier based on the first output as shown in the figure.

Fig. 1.
figure 1

Structure of the BERT model

Table 1. Model configuration of BERT models

The training of BERT models consists of two phases: pre-training phase and fine-tuning phase. In general, the pre-training phase involves training the model on a large corpus such as Wikipedia. On the other hand, in the fine-tuning phase, the weight parameters obtained in pre-training are used as initial values for the model, and training is performed on the target task. This phase can often be completed with much less computation than the pre-training phase. We give an explanation about these phases as follows.

The network training in the pre-training phase is unsupervised and consists of two tasks, masked language model (MLM) and next sentence prediction (NSP). In this training, we train only on word embedding and transformer encoder blocks. In MLM, we input sentences of tokens with some of them masked to train the model to predict the original tokens. In NSP, we input two sentences concatenated to train the model to predict whether the two sentences are consecutive or not. When selecting two sentences, 50% of the input is actually consecutive sentences and the other 50% is randomly selected from the data set. By training on a large number of sentences for these two tasks, we obtain a model that can capture the features of the sentences. Since the computational cost of pre-training is enormous, we can utilize BERT models already trained on massive corpora such as Wikipedia, BookCorpus, and MEDLINE/PubMed [6, 23], and often employ these models for the following fine-tuning phase.

In the fine-tuning phase, we train the whole network on the target task. The model obtained in the above pre-training is used as the initial values and trained as supervised learning. In general, this training requires fewer iterations than the pre-training. In the proposed method, the pre-trained model is trained as a classification problem.

3 Proposed Quality Prediction of Scientific Papers

This section presents the proposed method, the BERT model for predicting the quality of papers as a classification problem, the utilized dataset, and the classifier.

3.1 Dataset of Scientific Papers

In this work, we use the semantic scholar open research corpus (S2ORC) [5, 18] version 20200705v1 as a dataset of scientific papers. S2ORC is used for natural language processing and text mining research. The dataset contains 136M paper data, of which 12M are full-text papers, covering various fields of research. In this study, we employ abstracts of papers in the fields of medicine and computer science from the data set.

3.2 Quality Classification of Papers

In order to predict the quality of papers, we introduce the Average Journal Impact Factor Percentile (Average JIF Percentile) provided by Journal Citation Reports [1] as the metric of article quality. JIF Percentile is a metric that indicates the top percentage of journals in a given field in terms of impact factor in that field. JIF Percentile is obtained by the following formula [3]:

$$\begin{aligned} \frac{N-R+0.5}{N}, \end{aligned}$$
(1)

where N is the number of journals in the category and R is the descending rank. Average JIF Percentile is the average of the JIF Percentile values of the target fields, which takes into account the fact that the target fields cover multiple fields. In this study, we consider predicting the quality of papers as a classification problem. Let \(J_U\) be the set of journals whose Average JIF Percentile is 0.8 or higher, and \(J_L\) denote the set of journals whose Average JIF Percentile is 0.2 or lower. In the classification problem, given an abstract of a paper, we classify whether the paper is included in \(J_U\) or \(J_L\). The reader might think that if the journals in a particular field are biased toward either of \(J_U\) and \(J_L\), then this classification problem would lead to a different classification problem of whether a paper is in a particular field or not. Tables 2 and 3 are the 10 journals with the most papers in \(J_U\) and \(J_L\) for medicine and computer science, respectively. According to the tables, there is no significant unbalance in the fields of papers included in \(J_U\) and \(J_L\), respectively. This means that this classification problem cannot be correctly classified only by finding papers in a specific field.

Table 2. Journals with the most papers for medicine
Table 3. Journals with the most papers for computer science

3.3 BERT-Based Model of Quality Prediction of Scientific Papers

The proposed BERT-based model of quality prediction of scientific papers is the structure shown in Fig. 1. The classifier has an input that corresponds to the first token in the output of the last transformer encoder block and outputs the classification result. The model including the classifier is trained in the fine-tuning phase to output 1 if the input abstract is included in \(J_U\) and 0 if it is included in \(J_L\). The classifier uses a fully connected layer with one output channel and a sigmoid function as the activation function. We note that since the problem targeted by the proposed model is to classify peer-reviewed papers, this is a more difficult problem than the classification problem that predicts acceptance and non-acceptance for publishing [10, 24].

4 Experimental Results

In this section, we show the methodology for training models for predicting paper quality as the target task, and evaluate the resulting models. In this study, we train models to predict the quality from abstracts for two research fields, medicine and computer science. In the proposed approach, three types of models have been employed as the BERT-models trained in the pre-training phase. Two of them are pre-trained models from the TensorFlow Hub [6], one trained on the Wikipedia and BookCoups datasets, and the other trained on the MEDLINE/PubMed datasets. Please refer [6] for the already pre-trained models on the Wikipedia and BookCoups datasets and the MEDLINE/PubMed datasets. Since the pre-trained model sizes available in the TensorFlow Hub vary by dataset, we experiment with the BERT-tiny, BERT-mini, BERT-small, and BERT-base models for the Wikipedia and BookCoups datasets, and the BERT-base model for the MEDLINE/PubMed datasets. The remaining one is a model that we trained by ourselves using the abstracts of papers in S2ORC [5]. On the other hand, in the fine-tuning phase, we fine-tune the models on abstracts of papers in S2ORC. In the following, we show the details of the training in the pre-training phase and the fine-tuning phase.

4.1 Training in the Pre-training Phase on Abstracts from S2ORC

Here, we show the training of the BERT models on all abstracts in the S2ORC dataset trained from scratch. The BERT-based models were trained using MLM and NSP tasks shown in Sect. 2. Each model was trained for 3,000,000 steps, with a batch size of 8, and a maximum input size 512. The training is optimized by Adam with learning rate of 0.0001, \(\beta _1=0.9\), \(\beta _2=0.999\), \(L_2\) weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. We use GELU [15] as the activation function and the sum of the likelihood of MLM and NSP as the training loss.

4.2 Training in the Fine-Tuning Phase

The training in the fine-tuning phase was performed using each model trained in the pre-training phase as initial values of weights. In the experiment, models were trained on S2ORC dataset from two fields, medicine and computer science, respectively. The number of papers in each field of training data and test data is shown in Table 4. Each model was trained for 50 epochs, with a batch size of 64, and a maximum input size 512. The training is optimized by AdamW with a learning rate of 0.00003, \(L_2\) weight decay of 0.01, learning rate warm-up over the first 10,000 steps, and linear decay of the learning rate. We use GELU as the activation function and the binary cross-entropy loss as the training loss.

Table 4. The number of abstracts from S2ORC dataset in the fine-tuning phase

4.3 The Test Accuracy of Prediction of the Trained Model

Table 5 shows the test accuracy of prediction of the trained models for each research field. The table shows that for both fields, the larger models are more accurate, and the model pre-trained on MEDLINE/PubMed dataset is the most accurate. The reason for the lower accuracy when using the S2ORC dataset for pre-training despite having the same scholarly articles as the MEDLINE/PubMed dataset is due to the smaller size of the dataset compared to the other two datasets. As a result of training the model, we achieved a test accuracy of 95.1% in the medical field and 89.6% in the computer science field. As shown in Tables 2 and 3, since there is little unevenness in fields between \(J_U\) and \(J_L\) journals, this result implies that the model does not classify papers by finding specific research fields from their abstracts, but can perform the classification in terms of abstract presentation.

Table 5. The test accuracy of prediction

4.4 Detailed Analysis of the Prediction

To clarify the classification details of the proposed model, we performed the classification on subsets of the sentences in the abstract. Specifically, Tables 6 and 7 show the results of the classification performed on the i-th to j-th sentences for two abstracts sampled from \(J_U\) and \(J_L\), respectively. We note that these two abstracts are simply sampled from each dataset and are not meant to judge their quality of them. The classification results are the top right values of the table, 0.9736 and 0.0429, respectively, indicating that the model correctly classifies each. Also, the values of the diagonal elements indicate the output values of each sentence evaluated on itself. From the table, it appears that the proposed model outputs the final result by considering multiple sentences, although both abstracts contain sentences with large and small values. Focusing on the output values of single sentences, there are sentences with high or low values, but the final estimated results seem to be evaluated by considering multiple sentences. In addition, the proposed model may be used as a supporting tool when writing papers, since the output as shown in Tables 6 and 7 provides at-a-glance information on good and bad descriptions in abstracts.

Table 6. Detailed analysis for an abstract in \(J_U\) sampled from computer science papers in S2ORC [18]
Table 7. Detailed analysis for an abstract in \(J_L\) sampled from computer science papers in S2ORC [18]

5 Conclusions

In this paper, we proposed a method for predicting the quality of scholarly papers using machine learning. We predict the quality of papers as a classification problem for the abstracts of papers whether the paper is included in superior journals or less superior journals. We used BERT-based models and trained them on several datasets. As a result of our experiment, we showed that different datasets of pre-training of the proposed BERT-based model affect classification accuracy. Also, the results of training the models showed that the models could classify whether the input abstracts are from superior or less superior journals with a test accuracy of 95.1% and 89.6% in the field of medicine and computer science, respectively. Furthermore, by evaluating the sentence combinations in the abstracts, we clarified the details of the classification results and visualized them.