Introduction

Academic literature records the research process with a standardized structure and provides clues to track progress in a scientific field (Lindsay 1995; Hofmann 2016). Generally, the main components of academic literature include an abstract, introduction, related work, method, experiment and result, and conclusion (Day 1989; Peat et al. 2002; Sollaci and Pereira 2004). In recent years, academic text mining using content from different components has received increasing attention from researchers. However, most of the existing research focuses on key phrase extraction (Park and Caragea 2020), citation content analysis (Fisas et al. 2016), rhetorical structure analysis (Sateli and Witte 2015), and essential sentence extraction (Mehta et al. 2018), with little attention paid to the research contributions stated in the full text Swales (1990). Research contributions, indicating how a research paper contributes new knowledge or new understanding in contrast to prior research on the topic, are the most valuable type of information for researchers to understand the main content of a paper. A research contribution relates to the research problem addressed by the contribution, the research method, and (at least one) research result (Oelen et al. 2019). For example, “we build a transfer learning framework employing a diverse range of intermediate tasks covering sequence tagging with semantic and syntactic aspects, and natural language inference” and “we achieve competitive performance over both strong baselines and previous works” are two contribution statements (Park and Caragea 2020). Research contributions can help researchers understand the core content of a paper and the growth of innovation in science.

We can easily identify sentences about research contributions from the introduction section by locating such statements as “Our contributions are summarized as follows,” “The major contributions of this paper are,” or similar phrasings. There are different types of contributions; for example, creating datasets, building new models, performing evaluations, etc. If these contributions can be classified appropriately and automatically, it would be helpful for knowledge recommendation, structured abstract generation, and scientific evolution analysis. However, an annotation scheme, the codebook which defines the annotation categories and the annotation guidelines (Hovy and Lavid 2010), for research contributions is one of the essential requirements for research contribution classification.

We studied the existing annotation schemes for academic literature. We noted that most of them mainly focus on context types (indicate the various roles of a citation context plays in different components of an article) (Angrosh et al. 2012), citation functions (indicate what could the author’s intention have been in choosing a certain citation) (Teufel et al. 2006), and future work types (indicate the different categories of future work sentences, such as method, resources, evaluation, application, problem, and others) (Hao et al. 2020). There is no annotation scheme for research contributions. To bridge this gap, we first propose a fine-grained annotation scheme with six categories for research contributions in academic literature. A human annotation experiment (where humans are asked to identify and annotate the data) conducted on 5,024 sentences collected from Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL Anthology) and an academic journal Information Processing & Management (IP &M) demonstrates the reliability of our scheme. Based on the high quality dataset constructed, we built automated research contribution classifiers using classic machine learning (ML) models and transformer-based deep learning (DL) models. The contributions of our paper are as follows:

  1. 1.

    We propose a fine-grained annotation scheme for research contributions in academic literature. The annotation scheme includes six types of research contributions: dataset/ resources creation, theory proposal, model construction or optimization, algorithms/ methods construction or optimization, performance evaluation, and applications.

  2. 2.

    We conduct a human annotation experiment to evaluate the reliability of our scheme. We reach an inter-annotator agreement of Cohen’s kappa = 0.91 and Fleiss’ kappa = 0.91.

  3. 3.

    We develop a high-quality research contribution dataset, including 5024 annotated sentences on the six categories. The dataset is publicly available.

  4. 4.

    We apply classic ML and transformer-based DL models for automated research contribution extraction. The experimental results show that the SCI-BERT model achieves the best performance among all the models, with an F1 score of 0.58.

Related work

This paper proposes an annotation scheme for research contributions in academic literature, then builds many classifiers for automated research contribution identification. Therefore, we review the literature from the following sub-topics: annotation schemes for academic literature, research contributions analysis, and ML models for text classification.

Annotation schemes for academic literature

Several annotation schemes have been proposed for academic literature. The annotation schemes are either designed for full texts or citation sentences only. Regarding the full text level, Swales (1990, 2011) produced one of the earliest models (i.e., CARS) for analyzing research papers. CARS model consists of three “moves” (components) with “steps” (sub-components) that most research paper introduction covers. Hao et al. (2020) proposed an annotation scheme with six main categories and 17 sub-categories for future work sentences. D’Souza and Auer (2020) described ten core information units for organizing academic contributions in a knowledge graph (KG): ResearchProblem, Approach, Objective, ExperimentalSetup, Results, Tasks, Experiments, AblationAnalysis, Baselines, and Code. For the citation sentence level, Teufel et al. (2006) proposed an annotation scheme with four categories and 12 fine-grained categories for citation function. The annotation experiment was performed on 320 conference articles and kappa agreement was used to measure the reliability. Alternatively, Angrosh et al. (2012) presented a citation-centric annotation scheme for academic literature. It included six categories for citation sentences (i.e., the sentences include the citation marks) and five categories for non-citation sentences (i.e., sentences surrounding the citation sentences and providing further descriptions of the citation sentences.). A pilot study was carried out using 11 annotators and nine articles. Agreement calculated with Krippendorff’s alpha was used to measure the reliability. The above research provides us insights into how to construct a fine-grained annotation scheme for research contribution sentences and how to evaluate the reliability of our scheme.

Research contributions analysis and identification

Research contributions analysis is a new topic, which has recently garnered attention. Auer et al. (2018) constructed the open research knowledge graph (ORKG) where each paper was summarized with its fundamental contribution properties and values. In the ORKG, the contributions were interconnected via the graph, even across papers. It helps users to compare research contributions between different papers while writing an academic literature review (Oelen et al. 2019). Similarly, Vogt et al. (2020) represented research contributions in scholarly knowledge graphs using knowledge graph cells. Compared to the ORKG (Auer et al. 2018), the Research Contribution Model (RCM) can generate a KG whose content is more easily maintained and easier to understand (Vogt et al. 2020). The ontology built by Vogt et al. (2020) provided a reference for defining the annotation categories in our study. However, it identified contributions from abstracts only rather than full texts. D’Souza and Auer (2020) developed an annotation scheme to identify research contributions from natural language processing (NLP) literature with the structure of \(<subject, predicate, object>\). In 2021, a scientific information extraction task called NLPContributionGraph was organized on SemEval-2021. The task aimed to build a comprehensive knowledge graph that publishes the research contributions of scholarly publications per paper, and even across papers, where the contributions are connected via the graph (Jaradeh et al. 2019). More than ten teams attended and contributed to this task.

Instead of focusing on research contribution identification, some researchers targeted a similar task: research highlight extraction. Wang et al. (2018) compared the differences between extracting highlights and abstracts from journal articles. However, they relied on classic features such as word frequency, term frequency-inverse document frequency (tf-idf), sentence length, etc. Rehman et al. (2021) conducted a preliminary experimental study using DL models to generate research highlights from scientific abstracts, but the performance still has much room for improvement.

Research contributions have also been applied for evaluating the value and impact of academic literature. Le et al. (2019) applied research contributions identified from citing papers for evaluating the academic value of cited papers. In addition, research contributions also have the potential in assessing the innovation level of an academic literature article. Kok and Schuit (2012) designed a novel approach to map contributions in research articles in the health field to assist stakeholders to better utilize the research. Morton (2015) proposed an empirically grounded framework for assessing the impact of research based on research contributions. If the research contributions can be automatically and accurately extracted from scientific literature, both of the above applications can be easily extended to other fields.

Machine learning and deep learning for text classification

Automated research contribution identification is a text classification task. Models that are fit for short text classification can also be used in this research. Kowsari et al. (2019) conducted a comprehensive review of ML algorithms for text classification from text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluation methods. Li et al. (2020) compared the multiple ML and DL models for text classification. Among which, the transformer-based methods (i.e., ELMo, GPT, BERT), which apply unsupervised methods to mine semantic knowledge automatically and then construct pre-training targets to support semantic understanding, have been widely used and proven effective. Quantitative evaluation showed that BERT-based models get better results on most datasets (Li et al. 2020). Therefore, we opted to try BERT-based models firstly when implementing a text classification task, as suggested by Li et al. (2020). Recently, SCI-BERT, a pre-trained language model for scientific text, was developed to improve performance on downstream scientific NLP tasks (Beltagy et al. 2019).

High-quality research contribution dataset annotation

Data acquisition and preparation

The initial data in this research is from the ACL Anthology (nd 2022) and IP &M (nd 2022b). We select the two sources because the research contributions are clearly claimed and can be easily extracted. We manually identify the sentences which indicate the research contributions. Specifically, we conduct a pre-investigation of the original corpus to summarize the patterns of research contribution sentences, then formulate the labeling specifications. For IP &M, we directly take the research highlights as the contributions. For ACL articles, we use two strategies to identify the contribution sentences: (1) For explicit research contribution sentences, we are able to easily locate them by identifying the contribution block indicators; for example, “Our contributions are summarized as follows,” “The major contributions of this paper are,” or similar statements. (2) As suggested by Swales (1990), we located implicit contribution sentences as findings in the last paragraph or the second to the last paragraph in the introduction section. We then tease out several verbs or verb phrases that indicate the research contributions, such as “present”, “introduce”, “compare”, “design”, “apply”, “develop”, etc. By following the above strategy, we are able to finally collect 3374 research contribution sentences from ACL and 1650 from IP &M in total.

An annotation scheme for research contributions

We create an annotation scheme with six types of research contributions, which is adapted from the annotation scheme in our International Conference on Scientometrics and Informetrics (ISSI) 2021 paper (Chen and Kanuboddu 2021). The initial annotation scheme included nine categories: dataset creation, theory proposal, model construction, model optimization, new algorithm/ method/ technology, algorithm/ method/ technology/ optimization, performance evaluation, resources, and applications. Our pre-experimental study shows that ML models have difficulty in distinguishing model construction and model optimization, algorithm construction and algorithm optimization, and dataset creation and resources, even though they indeed belong to these different categories. Therefore, we merge these categories as model construction or optimization, algorithm/ method/ construction or optimization, and dataset/ resources creation, respectively. We separate the method and model since they are different concepts, especially in computational linguistics, according to QasemiZadeh and Handschuh (2014). A more detailed explanation of each category with definitions and examples can be found in Table 1.

Table 1 Our annotation scheme for research contributions

Human annotation experiment

Annotation procedure

To evaluate the reliability of the research contribution annotation scheme discussed before and create a dataset for automatic research contribution classification, we conduct the annotation experiment with six annotators who are designers of the scheme and very familiar with the annotation guideline. They also have a background in NLP and ML, which can ensure annotation quality.

The six annotators are divided into two groups for the annotation; in other words, each sentence will be annotated by three annotators. During the annotation, the annotators independently annotated the same number of sentences with the proposed scheme. A majority vote is used to decide the final label for a sentence. If the label of a sentence cannot be confirmed based on the three annotators, another annotator will label the sentence.

Annotation results

We obtain 5,024 annotated sentences in total. We combine Cohen’s kappa (Carletta 1996) and Fleiss’ kappa (Falotico and Quatto 2015) to measure the agreement. Cohen’s kappa is a statistic used to measure inter-rater agreements between two annotators (Carletta 1996). The value of kappa ranges between − 1 and 1. Generally, a kappa of 0.8 is considered stable. Fleiss’ kappa is a statistical extension method of Cohen used for determining agreement among more than two annotators (Falotico and Quatto 2015). Overall, we reached an inter-annotator agreement of Cohen’s kappa = 0.91 (average of three pairs) and Fleiss’ kappa = 0.91. The agreement is quite good, considering the number of categories. To evaluate the annotation quality for each category, we also calculate the Fleiss’ kappa of the three annotators for each category. The results are shown in Table 2, further demonstrating the high quality of the annotated research contribution dataset.

Table 2 Our annotation scheme for research contributions

Statistical analysis of the annotated dataset

The relative frequency of each category observed in the annotation results is shown in Fig. 1. As can be seen from the figure, the top three categories are theory proposal, model construction or optimization, and algorithms/ methods construction or optimization, with 1340, 1246, and 1041 contribution sentences, respectively, which comes in at 72.5%. The distribution is aligned with the scopes of ACL and IP &M. Since ACL is the top conference while IP &M is the top journal in computer and information science, they both request stronger contributions, especially technical ones, before they can be accepted.

Fig. 1
figure 1

The distribution of research contributions in each category

Key terms analysis in each category

We extract the top 20 key terms based on term frequency (as listed in Fig. 2) for the contribution sentences in each category and the key terms are limited to uni-gram.

As shown in Fig. 2, most of the key terms are nouns and verbs. The noun terms reflect the most important contributions and the verb terms indicate how the research contributions are presented in each category. For example, in the category dataset or resources creation, it is obvious that the key terms “data”, “dataset”, “corpus”, “xxx dataset”, and “annotate” are at the top of the list. Verbs such as “present” and “introduce” are frequently used to introduce a new dataset. Similarly, in the model construction and optimization categories, key terms such as “model”, “neural network model”, and “language model” are prevalent. In the performance evaluation category, “performance”, “evaluation”, “evaluate”, “test”, and “outperform” are among the top key terms. The analysis of the key terms can provide effective features for automatic research contribution classification in the future.

Fig. 2
figure 2

The top 20 key terms in each category

Automated research contribution classification

Text representation

In this research, we use two methods for text representation: manual features and pre-trained word embeddings. The feature-based method is not only labor-intensive but also sometimes less effective due to highly sparse vectors. In contrast, word embedding-based methods learn feature representation from a large corpus and generate shorter dense vectors that better capture contextual information. Some highly-performed pre-trained models are Word2Vec, GloVe, and BERT. However, these pre-trained models suffer from domain-specific issues, requiring fine-tuning on the domain datasets. Therefore, this study investigates different word presentation and feature extraction methods performed in classic ML and DL models.

For manual features, we incorporate the most frequent nouns (1298 features) and verbs (1344 features) along with tf-idf (1000 features). In addition, Word2Vec (Mikolov et al. 2013), a pre-trained word embedding released by Google, is also applied to encode research contribution sentences to train other classic classifiers. Word2Vec presents each instance with 300 dimensional dense vectors, requiring our classifiers to learn fewer weights than the manual feature-based representation; therefore, it possibly helps with generalization and avoiding overfitting. Moreover, SCI-BERT (Beltagy et al. 2019), a pre-trained language model trained on 1.14M scientific papers from Semantic Scholar, is used to encode text in the DL model. As mentioned, pre-trained embeddings are less effective in some domain-specific datasets, so using SCI-BERT possibly can overcome this limitation.

Classification algorithms

We implement multiple ML and DL classification models. Even though DL has proven its outperformance in most NLP tasks, it requires more training data and computational resources. If effective features can be extracted and selected, ML models can also achieve good performance. Therefore, we compare several manual feature-based ML algorithms and the transformer-based DL model. The classification algorithms used in our study are summarized as follows:

  • Logistic Regression (LR) is a probabilistic classifier based on a logistic function to learn conditional probability. It assumes the independent relations among features. We use L2 regulation, and lbfgs with maximum iterations of 700 to optimize the model and avoid overfitting.

  • Random Forest (RF) is an ensemble model that fits several decision tree classifiers on a variety of subsets and eventually takes the average of the performance of classifiers for a more reliable predictive score and avoid overfitting. Parameters used in this model are set by default.

  • K-nearest neighbors (KNN) is a non-generalizing learning model. The model stores training data points. Classification is based on the majority votes of k nearest neighbors’ labels toward the predicting data point. We selected k = 5 as the number of k neighbors.

  • Decision Trees (DT) makes predictions by learning the decision rules inferred from the training data. The model is highly interpretable but less generalized and unstable since it prioritizes locally optimal decisions.

  • Naive Bayes (NB) is a probability model based on the Bayes theorem, assuming the conditional independence between features. NB performs very well compared to other complex models in many cases, especially when the training data is small.

  • Support Vector Machines (SVM) attempts to map training instances into the high-dimension space to maximize the distance between categories (also called margins). Therefore, the model can perform well in high-dimension datasets, even if the number of dimensions is greater than the number of instances. We set ’ovo’ (one-versus-one) as a decision function of the model.

  • BERT is a Bidirectional Encoder Representations from Transformers. We implement two BERT-based models for the purpose of comparison: BERT, and SCI-BERT. We encode sentences with BERT, and SCI-BERT embeddings trained on the BERT architecture and further fine-tune it in our dataset. The model is trained on eight epochs, batch sizes of 32, and an Adam optimizer with a learning rate of 2e-5.

Notice from the data exploratory analysis 1, the “application” class is strongly imbalanced in comparison to other classes. Class imbalance significantly declines the model performance (Weng et al. 2020). Therefore, we implement the oversampling method with the SMOTE algorithm (Chawla et al. 2002) for augmenting enough data of this class to train the model. All classic ML models are trained and validated with the ten-fold cross-validation.

Evaluation metrics

We use recall, precision, and F1 score as metrics to evaluate the performance on each category since they are the most frequently used evaluation metrics for text classification (Li et al. 2020). For the overall performance, we use weighted-average precision, recall, and F1 score. Each class’s contribution to the average is weighted by its size.

Results

The overall results are presented in Table 3. Overall, BERT-based models show the best performance compared to all the others, demonstrating the effectiveness of BERT in research contribution identification. In addition, SCI-BERT performs better than the general BERT model, 0.58 in comparison to 0.56 on the F1 score. The results also indicate that the RF model in which tf-idf, most frequent nouns, and verbs were used as features is comparable with the general BERT and performs better than Word2Vec-based ML models. It indicates that the research contribution identification requires a more domain-specific text representation, and the manually-engineered features can capture adequate information for it.

Table 3 The overall results of research contribution classification

Figure 3 compares accuracy scores of classic ML models with the Word2Vec embedding (left) and handcrafted feature (right) representations over the ten-fold cross-validation. With the Word2Vec embedding, SVM achieves the highest performance, followed by LR. However, given the handcrafted feature extraction, RF outperforms other classic ML models, followed by SVM. This strengthens our assumption about the importance of feature engineering on classic ML models’ performance. It is also coincident with the conclusion in Fernández-Delgado et al. (2014) about the highest performance of RF and SVM among 179 classifiers on 121 datasets.

The confusion matrix in Fig. 4 describes the performance of SCI-BERT, our best model, on the test set. It gives us a better idea about how the model performs in the six categories. It indicates a correlation between the data size and the number of true positives. The model performs better in the classes with more data but the worst in the “applications” class, 0.36 on the F1 score. Its data accounts for 4.2% of the dataset, which is insufficient for the model to learn patterns in this class. Even though the SMOTE oversampling is applied to solve the class imbalance in classic ML models, the performance of most classic ML models is even worse in this class, under 0.19 on the F1 score, except for RF with handcrafted features. Meanwhile, we do not use any methods to handle the class-imbalance issues in DL models, indicating that BERT-based models can overcome this issue by themselves.

Fig. 3
figure 3

Cross-validation performance (accuracy) of classic ML models with Word2Vec embeddings (left) and manual features (right)

Fig. 4
figure 4

Confusion matrix of SCI-BERT. From 0 to 5, six categories are theory proposal, algorithms/ methods construction or optimization, model construction or optimization, performance evaluation, dataset/ resources creation, and applications, respectively

Discussion

According to Li et al. (2020), word embedding-based models such as BERT can get better performance on most text classification datasets, which means that we can always implement DL models first to get SOTA results. However, this conclusion does not fit the domain-specific text classification tasks, as demonstrated by Chen at al. (Chen et al. 2022) in a legal text classification task. This research further confirms the conclusion, as can be seen from the results of Word2Vec and manual features with machine learning models such as RF and SVM.

Many factors can affect the model selection of domain-specific text classification classification, such as data, performance, computation, and interpretation Chen et al. (2022). In this research, we aim to build a strong baseline for research contribution classification. Therefore, we pay more attention to the performance aspect. For the word embedding-based classification models, in addition to the data quality, the quality of the word embedding will also affect the model performance (Chen et al. 2021). The research contribution datasets used for the classification is of high-quality according to the annotation results in section 3.3. As for the quality of the word embedding, the quality of the pre-training data for BERT-based models is higher since their training data cover much more scientific concepts than the training dataset used for training the Word2Vec embedding. Therefore, the BERT-based models achieve better results than Word2Vec-based models (Shen and Liu 2021).

SCI-BERT, trained on large-scale academic publications, achieves the best performance on scientific NLP tasks, indicating the effectiveness of fine-tuning the general language models with domain unlabeled texts on domain text classification (Beltagy et al. 2019; Chakravarthi 2021). Although manual feature-based machine learning models do not perform as well as SCI-BERT, it generates a similar performance to the BERT model, indicating the effectiveness of our manually-selected features. Notice that the feature-based models are more efficient and take less computational resources than BERT Chen et al. (2022). Moreover, the results from the feature-based models can more easily be interpreted.

Even for the extraction of contribution sentences or non-contribution sentences task, the performance is quite low ( \(<50\%\)) (Wang et al. 2018). Classifying research contributions into different types is a more challenging task. We believe our research has built a strong baseline for further research. The high-quality dataset and the baseline constructed in this study are intended to be the foundation of research contribution classification and automated creation of summaries of fundamental contributions.

Conclusion and future work

In this paper, we propose a fine-grained annotation scheme with six categories of research contributions. Based on the proposed annotation scheme, we create a high-quality dataset for research contribution identification with 5024 contribution sentences taken from the ACL Anthology and IP &M. The dataset quality and bias elimination are validated with very high kappa values, 0.91 on both Cohen’s kappa and Fleiss’ kappa. Furthermore, we provide several benchmarking models on the created dataset: classic ML, and DL, using handcrafted features and contextual word embeddings. Our experiments prove an outperformance of the SCI-BERT model, followed by random forest with the manual feature extraction method. In the future, we plan to expand the dataset, especially in some heavily imbalanced classes such as “applications”. We will also increase the comprehensiveness of the dataset by including contribution sentences from some other journals. To improve the model performance, we will explore both transfer learning by fine-tuning with more related data and generating more effective features for research contribution representation.