Important citations identification with semi-supervised classification model

An, Xin; Sun, Xin; Xu, Shuo

doi:10.1007/s11192-021-04212-6

Important citations identification with semi-supervised classification model

Published: 20 January 2022

Volume 127, pages 6533–6555, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Scientometrics Aims and scope Submit manuscript

Important citations identification with semi-supervised classification model

Download PDF

601 Accesses
6 Citations
Explore all metrics

Abstract

Given that citations are not equally important, various techniques have been presented to identify important citations on the basis of supervised machine learning models. However, only a small volume of instances have been annotated manually with the labels. To make full use of unlabeled instances and promote the identification performance, the semi-supervised self-training technique is utilized here to identify important citations in this work. After six groups of features are engineered, the SVM and RF models are chosen as the base classifiers for self-training strategy. Then two experiments based on two different types of datasets are conducted. The experiment on the expert-labeled dataset from one single discipline shows that the semi-supervised versions of SVM and RF models significantly improve the performance of the conventional supervised versions when unannotated samples under 75% and 95% confidence level are rejoined to the training set, respectively. The AUC-PR and AUC-ROC of SVM model are 0.8102 and 0.9622, and those of RF model reach 0.9248 and 0.9841, which outperform their counterparts and the benchmark methods in the literature. This demonstrates the effectiveness of our semi-supervised self-training strategy for important citation identification. Another experiment on the author-labeled dataset from multiple disciplines, semi-supervised learning models can perform better than their supervised learning counterparts in term of AUC-PR when the ratio of labeled instances is less than 20%. Compared to our first experiment, insufficient amount of instances from each discipline in our second experiment enables the performance of the models to be unsatisfactory.

SDCF: semi-automatically structured dataset of citation functions

Article Open access 21 July 2022

Important citation identification by exploiting the syntactic and contextual information of citations

Article 02 September 2020

Identification of important citations by exploiting research articles’ metadata and cue-terms from content

Article 22 November 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Citations are reckoned as a proxy of scientific knowledge flow in the literature, thus they are usually utilized for multifarious academic evaluation purposes, such as ranking of researchers (Hirsch, 2005), journals (Garfield, 2006), organizations (Lazaridis, 2010), etc. But most studies treat all references as equally important to an interested citing publication. This is obviously not in line with actual situations. Therefore, important citations identification plays a vital role in scientific evaluation procedure. It has promising potentials in fair distribution of academic resources and fair evaluation of talents. In recent years, researchers have argued that citations are not equally important and presented various techniques to identify important citations (An et al., 2021a; Hassan et al., 2017, 2018a; 2018b; Qayyum & Afzal, 2019; Valenzuela et al., 2015; Wang et al., 2020; Zhu et al., 2015).

Given a scholarly article, its important citations are actually the references that greatly contribute to this article. It is easy to see that the citation importance is closely related with citation function (viz. the reason for citing a paper) (Teufel et al., 2006; Valenzuela et al., 2015). Though various classification schemes for citation function were constructed in the literature (Abu-Jbara et al., 2013; Dong & Schafer, 2011; Li et al., 2013; Radoulov, 2008; Teufel et al., 2006), these schemes were greatly simplified after 2015 to facilitate annotation and machine-learning model building with satisfactory performance (An et al., 2021a). For example, Zhu et al. (2015) distinguished influential references from incidental ones from the role which a reference plays in the core idea, method of a given citing paper. Valenzuela et al. (2015) classified citations into related work, comparison, using the work and extending the work, and then further folded these categories into two ones: important citations (related work and comparison) and incidental ones (using the work and extending the work). Many follow-up studies followed these simplified classification schemes (An et al., 2021a; Hassan et al., 2017, 2018a; 2018b, Qayyum & Afzal, 2019; Wang et al., 2020).

To identify important citations, the supervised learning methods are commonly used, which can learn a discriminant pattern from a labeled dataset to form a classification model. However, most supervised learning methods require a large amount of labeled instances to ensure the performance of the resulting models (Xu et al., 2011). To the best of our knowledge, only annotated datasets in Valenzuela et al. (2015) and Zhu et al. (2015) can be accessed publicly due to the time-consuming annotation and heavy workload. The number of pairs of citing-cited articles is 456 and 2,685 in these two datasets respectively (cf. Section Datasets). Consequently, the overwhelming majority of classification models for identifying important citations are built on the basis of hundreds of labeled instances in previous studies (Hassan et al., 2017, 2018a; 2018b; Qayyum & Afzal, 2019; Valenzuela et al., 2015; Wang et al., 2020).

As a matter of fact, large amount of relatively inexpensive unlabeled instances can be available, but have not been exploited for identifying important citations. One branch of machine learning techniques, semi-supervised learning, is able to leverage large amount of un-annotated instances along with small amount of annotated instances. Last two decades have witnessed significant progress in the field of semi-supervised learning. Correspondingly, many learning strategies and methods have been proposed in the literature, such as self-training (Yarowsky, 1995), co-training (Blum & Mitchell, 1998), transductive support vector machine (TSVM) (Bennett & Demiriz, 1999; Joachims, 1999), and graph-based method (Zhu et al., 2005). Among these approaches, the self-training strategy provides more choices on base classifiers and has great flexibility in threshold setting.

However, important citations identification with semi-supervised model remains largely under-studied. To make full use of unlabeled instances and promote the model performance, a semi-supervised self-training learning strategy is deployed here to identify important citations. The SVM and Random Forest (RF) model are taken here as base classifiers of self-training learning strategy. In this study, we devote to exploiting whether and to what extent the unlabeled instances can benefit a supervised model. Besides, from the perspective of practical significance, we hope the proposed strategy for important citations identification in this research can contribute fair evaluation of scientific research and academic achievements.

The rest of the article is structured as follows. After Section Related work briefly describes the important citations identification and semi-supervised learning, the framework of semi-supervised self-training for important citations identification is introduced in Section Methodology. Section Datasets shows the statistics of two different types of datasets from Valenzuela et al. (2015) and Zhu et al. (2015). In Section Experimental results and discussion, two experiments of SVM and RF models armed with semi-supervised self-training strategy are conducted, and Section Conclusions concludes this work.

Related work

Important citations identification

In recent years, the classification of citations has shifted from manual classification (Garfield, 1965) into automatic identification, from multi-categories (Abu-Jbara et al., 2013; Dong & Schafer, 2011; Li et al., 2013; Radoulov, 2008; Teufel et al., 2006) into only two categories (important vs. incidental) (Iqbal et al., 2021). Various approaches have been developed in the literature to identify important citations automatically.

Zhu et al. (2015) collected about 100 scholarly articles from 40 researchers with their opinions on the most essential references to their works, which generated 3,143 labeled pairs of citing-cited papers. Then, they used the SVM model as their supervised learning algorithm to classify the citations into influential category and non-influential one. Valenzuela et al. (2015) annotated 465 citations from the Association for Computational Linguistics (ACL) anthology into important and incidental categories and two supervised learning models (SVM and RF) were used to classify important citations. Since then, a plethora of studies have been implemented with different supervised learning models on these annotated datasets.

Hassan et al. (2017) employed five classification techniques (SVM, RF, Naïve Bayes, K-Nearest Neighbors and Decision Tree) on the dataset in Valenzuela et al. (2015) with 14 features, including context-based features, cue words-based features and textual features. Hassan et al. (2017) found that the RF model performed the best in terms of the area under the curve of receiver operating characteristic (AUC-ROC) and precision-recall (AUC-PR), followed by the SVM model. Hassan et al. (2018b) further exploited the potential of a deep learning model, Long Short-Term Memory (LSTM) model, for this task on the same dataset. Hassan et al. (2018b) observed that the LSTM model outperformed the traditional counterparts, but its performance is limited by the unavailable large-scale annotated instances.

Compared to Valenzuela et al. (2015), Qayyum and Afzal (2019) improved the performance of the SVM and RF models in term of precision by relying on freely available metadata on the dataset in Valenzuela et al. (2015) and their self-collected dataset with 488 labeled citation pairs. Wang et al. (2020) distinguished important and non-important citations by engineering the syntactic and contextual features on the dataset in Valenzuela et al. (2015) and their self-annotated dataset with 458 citation pairs. Zeng et al. (2020) detected citation worthiness by using a Bidirectional Long Short-Term Memory (Bi-LSTM) network with attention mechanism and contextual information. An et al. (2021a) combined generative model and discriminative model for identifying important citations on the datasets in Valenzuela et al. (2015) and Zhu et al. (2015). An et al. (2021a) found that the RF model outperforms the SVM model, but the Convolutional Neural Network (CNN) model did not achieve the desired performance due to the small volume of annotated instances. Aljuaid et al. (2021) improved the performance by using sentiment analysis of in-text citations to identify important citations with the SVM, Kernel Logistic Regression (KLR) and RF models on the dataset in Valenzuela et al. (2015) and the dataset in Qayyum and Afzal (2019).

It can be seen that the supervised learning model is a main-stream technique in this task. Among all these supervised models, the SVM and RF models were the most commonly used and outperformed the other counterparts. However, the supervised learning technique heavily relies on large amount of labeled instances to maintain the performance, which is in contrast with the reality that labeled instances are costly to obtain. In fact, only two small-scale labeled datasets in Valenzuela et al. (2015) and Zhu et al. (2015) are publicly available. Until now, large amount of unlabeled instances have not still been exploited.

Semi-supervised learning

In practice, to overcome the limitation of little amount of labeled instances and make full use of unlabeled instances, the semi-supervised learning technique has been receiving more attention. It attempts to harness the unlabeled instances to exceed the performance of supervised learning models. Over the past two decades, many semi-supervised classification methods have been raised on the basis of different assumptions, such as smoothness, low-density and manifold (van Engelen & Hoos, 2020).

According to the distinct optimization procedures, the semi-supervised classification algorithms can be divided into two groups, namely inductive algorithms and transductive algorithms (van Engelen & Hoos, 2020). The inductive algorithm aims to form a classification model to predict the whole input space. Among these, generative mixture models and expectation–maximization (EM) are considered as the earliest semi-supervised learning method (Zhu, 2008), which needs the identifiability and model correctness to maintain the performance. The wrapper methods are the most commonly used, which train supervised base classifiers on labeled instances and utilize pseudo-labeled instances to augment the performance, including self-training (Yarowsky, 1995), co-training (Blum & Mitchell, 1998), etc. Theoretically, any supervised classifier can be used as a base learner in this group of methods, which is deemed as one of the most significant advantages.

Another group of semi-supervised methods is transductive algorithms, which can only predict the given set of unlabeled instances. The semi-supervised SVM (S³VM) was proposed as the extension of SVM to the semi-supervised learning, of which transductive SVM (TSVM) (Joachims, 1999; Vapnik, 1998) aims to find the maximum margin on labeled and unlabeled instances. But the TSVM encounters the problem of NP-hard. In addition, the graph-based methods define a graph over labeled and unlabeled instances and reflect the pairwise similarity using edges (Zhu et al., 2005). The more similar the edges, the more likely two instances share a same label. The graph-based methods include Mincut (Blum & Chawla, 2001), Gaussian Random field and Harmonic Functions (Zhu et al., 2003), etc. But the graph construction relies on domain knowledge and has high time complexity.

In general, the self-training method expands the training set with predictions on unlabeled instances. It is easy to operationalize and has great flexibility in threshold setting. This gives more choices on base classifier selection and has been utilized in many domains, such as word sense disambiguation (Yarowsky, 1995), object detection (Rosenberg et al., 2005), sentence subjectivity classification (Wang et al., 2008), sentiment classification (He et al., 2011), and so on. Furthermore, it has been shown the effectiveness in improving the predictive performance of base classifiers (Li et al., 2008; Tanha et al., 2017; Zhang et al., 2021). Therefore, to make full use of the unlabeled instances, the semi-supervised self-training method is preferred to identify important citations in this paper.

Methodology

Figure 1 depicts the sketch of our research framework on important citations identification, which is based on the full-text articles. After the preprocessing steps, six groups of features are extracted in the feature engineering module. Then the whole dataset is divided into labeled dataset and unlabeled dataset, which are fed to the SVM and RF models with self-training strategy to identify important citations. The self-training strategy, preprocessing, feature engineering, and statistics of datasets will be described in more details in the following sections in turn.

Self-training strategy

The main idea of self-training strategy is to train a base classifier on a small volume of labeled instances and make predictions on large amount of unlabeled instances. Then, pseudo-labeled instances with high level of confidence are selected to expand the scale of labeled dataset. After that, the model is retrained on newly synthesized labeled dataset. This process is iterated until no new instance meets the condition. A significant advantage of this method is that any supervised model can be used as a base classifier in theory (van Engelen & Hoos, 2020).

Figure 2 depicts the framework of important citations identification on the basis of semi-supervised self-training learning strategy. First of all, a supervised learning model (such as SVM and RF) is trained on the labeled dataset with fivefold cross validation procedure. After learning the training set of each fold, the labels of the unlabeled dataset are predicted respectively. We select the samples with 95%, 90%, 85%, 80%, 75%, and 70% confidence levels as the pseudo-labeled dataset to rejoin the training set. For each fold, the model is retrained on the new combined dataset and evaluated on the test set. The involved parameters are optimized correspondingly. The areas under the curve of PR and ROC are used as indicators for evaluating the performance. Please refer to the pseudo code in Algorithm 1 for more details on our methodology for identifying important citations (in our case, \(V=5\)).

Preprocessing

The preprocessing includes the following steps: (1) The citing papers are collected in the format of PDF and then converted to text format with the Xpdf toolkit (http://xpdfreader.com). (2) The textual data is parsed by the ParsCit software (Councill et al., 2008) to extract title, author list, abstract, main body, and references of each citing paper. It is worth noting that the ParsCit software can normalize each section in a citing publication into a generic section header (introduction, related work, method, experiment, discussion, and conclusion). To avoid parsing mistakes, each parsed document is checked carefully and corrected manually. (3) The citation contexts are extracted on the basis of regular expressions. (4) All textual information including citation contexts and abstract is cleaned with Natural Language Toolkit (NLTK).

Feature engineering

As for the feature engineering, the following six groups of features from our previous study (An et al., 2021a) are utilized here, as shown in Table 1. The effectiveness of these features has been verified on identifying important citations. G1 contains two generative features extracted from the CIM model (Citation Influence Model) (Dietz et al., 2007; Xu et al., 2019), which incorporates the topical innovation and topical inheritance via citations on the basis of the first-order Markovian assumption. One is a multinomial distribution of references, which reflects the importance degree of a cited publication to a citing publication. The other is the symmetrized Kullback–Leibler divergence between multinomial distribution of topics specific to a pair of citing and cited publications, which represent the similarity in the topic space between citing and cited papers.

Table 1 Features utilized for important citation identification

Important citations identification with semi-supervised classification model

Abstract

Similar content being viewed by others

SDCF: semi-automatically structured dataset of citation functions

Important citation identification by exploiting the syntactic and contextual information of citations

Identification of important citations by exploiting research articles’ metadata and cue-terms from content

Explore related subjects

Introduction

Related work

Important citations identification

Semi-supervised learning

Methodology

Self-training strategy

Preprocessing

Feature engineering

Datasets

Experimental results and discussion

Experiment I

Experiment II

Discussion

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation