Citance-based retrieval and summarization using IR and machine learning

Karimi, Samaneh; Moraes, Luis; Das, Avisha; Shakery, Azadeh; Verma, Rakesh

doi:10.1007/s11192-018-2785-8

Citance-based retrieval and summarization using IR and machine learning

Published: 04 July 2018

Volume 116, pages 1331–1366, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Scientometrics Aims and scope Submit manuscript

Citance-based retrieval and summarization using IR and machine learning

Download PDF

Samaneh Karimi ORCID: orcid.org/0000-0003-3483-0590^1,2,
Luis Moraes²,
Avisha Das²,
Azadeh Shakery^1,3 &
…
Rakesh Verma²

635 Accesses
12 Citations
2 Altmetric
Explore all metrics

Abstract

We consider the three interesting problems posed by the CL-SciSumm series of shared tasks. Given a reference document D and a set $C_D$ of citances for D: (1) find the span of reference text that corresponds to each citance $c \in C_D$, (2) identify the facet corresponding to each span of reference text from a predefined list of five facets, and (3) construct a summary of at most 250 words for D based on the reference spans. The shared task provided annotated training and test sets for these problems. This paper describes our efforts and the results achieved for each problem, and also a discussion of some interesting parameters of the datasets, which may spur further improvements and innovations.

Scientific document summarization via citation contextualization and scientific discourse

Article 09 May 2017

Insights from CL-SciSumm 2016: the faceted scientific document summarization Shared Task

Article 14 June 2017

Bibliometric-Enhanced Information Retrieval 10th Anniversary Workshop Edition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The main goal of the CL-SciSumm 2017 Shared Task (Jaidka et al. 2017) was automated summarization of scientific articles from the computational linguistics domain. We are given a reference document D to be summarized along with a set of citances $C_D$—each sentence $c \in C_D$ cites document D. The goal is to create a summary of D that is driven by the citances in $C_D$. While there has been considerable research in single document summarization techniques (Barrera and Verma 2012; Gambhir and Gupta 2017; Verma and Lee 2017)—the task evaluates the role of citances in generating informative summaries of a paper. This is interesting since a citance can give considerable insight about the purpose and content of the scientific article being summarized, from the viewpoint of the person(s) citing the paper.

For CL-SciSumm 2017, the entire shared task has been split into three subtasks. Given a reference document D and a set $C_D$ of citances for D:

Task 1A: For each citance $c \in C_D$, extract the span of reference text,^{Footnote 1} SR(c), that provides the most information about the citation.^{Footnote 2}
Task 1B: Classification of each SR(c) according to a predefined set of facets: Aim, Method, Hypothesis, Implication, and Results.
Task 2: Generate a summary of at most 250 words for D based on all the SR(c)’s ($c \in C_D$).

We participated in the shared task and our initial results are presented in Karimi et al. (2017). In this paper, we expand upon our methods and experiments and study the problems and datasets in more detail. We present several methods for Tasks 1A and 1B and a simple approach for Task 2. We evaluate the proposed methods on a dataset of scientific articles from the Computational Linguistics domain. This dataset contained 30 documents for training and 10 documents for testing. A detailed description of this dataset is given in “Datasets” section.

For Task 1A, to identify SR(c), we experimented with a number of approaches: structural correspondence learning (SCL), positional language model (PLM), and textual entailment (TE) with two entailment systems.

Each approach returns a score for the sentence in D. The sentences in D are then ranked in non-ascending order by their scores and the top three sentences from D are selected as SR(c). It is challenging to extract just three sentences relevant to a citance from the entire document consisting of hundreds of sentences. Therefore, we also used combinations of the basic approaches and a learning to rank approach to retrieve the best set of sentences to construct SR(c).

We employ SCL modeling technique to learn a joint representation of domains represented by $C_D$ and sentences of D. The second method is a positional language model that leverages proximity information of D to modulate relevance, given a citance $c \in C_D$. We also studied the measure of textual entailment existing between citance c and each sentence s from D—a positive entailment between c and s may imply that $s \in {\hbox {SR}}(c)$. We ranked the top sentences extracted by the systems to get the most relevant ones. LambdaRank (Burges 2010) appeared to be one of the best ranking algorithms.

For the facet classification task (Task 1B), we present two methods: a Rule-based method augmented by WordNet expansion, a Machine learning based method using five classifiers: SVMs, random forests, decision trees, multi-layer perceptron, and AdaBoost. TFIDF features are used to train the classifiers.

Our approach to Task 2 is simply to sort all the sentences in all the SR(c)’s (for all $c \in C_D$) in the order in which they appear in the document, and then truncate to 250 words.

On Task 1A, the performance of our methods differed considerably on the training and test sets. This provided yet another motivation for us to conduct an analysis of the training and test sets. We believe that this analysis is of independent interest as well.

The rest of the paper is organized as follows. “Preliminaries” section presents the definitions and background for the paper. In “Related work” section we present the relevant related work. “Task 1A: Reference span detection”, “Task 1B: Facet detection” and “Task 2: Summary generation” sections present our methods for Tasks 1A, 1B, and 2 respectively. “Datasets” section describes the dataset for CL-SciSumm 2017, our analysis of its characteristics, and our results. “Discussion” section gives our perspectives on the results and “Conclusion” section concludes the paper.

Preliminaries

This section gives a brief description of the terms that have been used throughout the paper.

Cosine similarity A similarity measure between two non-zero vectors A and B is given by the cosine of the angle between them, say $\theta$. Equation 1 gives the formula for calculating cosine similarity.

$$\begin{aligned} { Similarity}(A,B) = \cos \,\theta = \frac{A \cdot B}{\Vert A\Vert {\cdot } \Vert B\Vert } \end{aligned}$$

(1)

TFIDF Term frequency-inverse document frequency (Manning et al. 2008) is a popular term weighting method used for selecting the important words across a corpus of textual documents. It ranks words by rewarding them based on their frequency in one document and penalizes the words if they appear across all documents. The method is originally directed towards extraction of documents from a corpus as opposed to extraction of sentences from a document.

For the purpose of our task, we adjust the metric to calculate the scores based on sentence-level granularity in a document as opposed to document-level granularity in a corpus. In other words, we use the frequencies of words in a sentence. Thus the “corpus” in our scenario refers to the entire document. The term frequency ($tf_{w_{i}}$) refers to the frequency of the word ($w_{i}$) in the sentence. The inverse “document” frequency (${ idf}_{w_{i}}$) is calculated using the number of sentences that contain the word ($w_{i}$) in the same document. A sentence S is a collection of word frequencies given by Eq. 2, where N is the number of sentences in the document containing S and $df_{w_i}$ denotes the number of sentences containing $w_i$ in that document.

$$\begin{aligned} S&= \langle s_{1}, s_{2}, s_{3}, \ldots , s_{n}\rangle \quad {\hbox {where}}\quad s_{i} = tf_{w_{i}} \cdot { idf}_{w_{i}} \end{aligned}$$

(2)

$$\begin{aligned} { idf}_{w_{i}}&= \log (N/df_{w_{i}}) \end{aligned}$$

(3)

LDA Latent Dirichlet allocation (LDA) (Blei et al. 2003) is a generative method used for topic detection of a document. While topics in a corpus follow a symmetric Dirichlet distribution, terms in the corpus are assumed to follow a multinomial distribution. Therefore, the parameters learned from the corpus can be used to determine the topic distribution of terms in a corpus—thus generating the topics of the document. We refer to Eq. 4, where LDA is a measure of topic membership of a sentence S to a ${\hbox {topic}}_{i}$. The topic membership vector is used to compare with cosine similarity values.

$$\begin{aligned} S = \langle s_{1}, s_{2}, s_{3}, \ldots , s_{n}\rangle \quad {\hbox {where}}\quad s_{i} = P(S \, \in \, {\hbox {topic}}_{i}) \end{aligned}$$

(4)

$F_{1}$-score $F_{1}$-score is the harmonic mean of precision and recall.

Precision is the proportion of correct results among the results that were returned. And recall is the proportion of correct results among all possible correct results. Our system outputs the top 3 sentences and we compute recall, precision, and $F_{1}$-score using these sentences. If a relevant sentence appears in the top 3, then it factors into recall, precision, and $F_{1}$-score. Whenever we present the $F_{1}$-score on a set of documents, we calculate it through micro-averaging, i.e. averaging among all instances, instead of averaging the $F_{1}$-score obtained for each document.

SVM Support vector machines (SVMs) (Cortes and Vapnik 1995) is a discriminative classification method. SVM is a supervised classifier used to linearly classify between data instances even in high dimensional spaces. We use support vector machines for our machine learning based approach in facet detection (“Machine learning approach” section). The SVM model was trained on the training set of reference documents and tested on the given Test set. We use the Scikit-learn python library for the implementation.

RandomForest RandomForest (Breiman 2001) constructs a multitude of decision trees. It uses majority voting across the outputs of the individual trees during classification for the final class decision. The decision trees are usually constructed using a random subset of features from the entire feature list. We use the python Scikit-learn library to build our RandomForest classifier for the supervised facet classification in “Machine learning approach” section. In the Machine learning based method experiments for Task 1B, the default values of the parameters are used. The number of trees in the forest is 10, Gini impurity is used as the function to measure the quality of a split and nodes are expanded until all leaves are pure or until all leaves contain less than two samples.

Decision trees Decision tree (Quinlan 1986) is used as a non-parametric supervised classifier to create a robust model that predicts the class of a test instance by learning simple decision rules inferred from the given set of attributes. Similar to the previous machine learners, we use decision trees for facet classification in “Machine learning approach” section. The Gini impurity is used to measure the quality of a split.

MLP Multi-layer perceptron (MLP) (Bishop 1995) is a supervised neural learning algorithm. It differs from a simple perceptron in that, between the input and the output layer, there can be one or more non-linear layers, called hidden layers. The system learns a pattern using a feedforward network of neurons and a supervised technique called backpropagation, for calculating weights of the connections. The MLP has been used as a classifier for identification of the facet of the reference text span from the given list (“Machine learning approach” section). We use the default architecture of a single hidden layer with 100 neurons.

AdaBoost Adaptive boosting (AdaBoost) (Freund and Schapire 1997) is a supervised boosting machine learner, which can be used to combine several ‘weak learners’ to improve their performance. The final predicted class is the given by the weighted sum of the outputs of the learners used. The final boosted learner often proves to be a strong classifier. AdaBoost is implemented using Scikit-learn python library in our facet classification system. The decision tree classifier is used as the weak learner in our experiments for facet detection. The maximum number of estimators at which the boosting is terminated is set to 50.

The next section describes the related work, gives an overview of the shared task and the performance of the participating teams, and ranks the documents based on a measure of their difficulty.

Related work

In Moraes et al. (2017), we have provided an extensive review of the literature on citance-based reference span identification. Citations are considered an important source of information in many text mining areas (Elkiss et al. 2008). For example, citations can be used in summarization to improve a summary (Nanba et al. 2000). It is thought that citations embody the community’s perspective on the content of the cited paper (Nakov et al. 2004).

In Qazvinian et al. (2013), the authors illustrate the importance of citations for summarization purposes. They made their summaries based on three sets of information including only the reference article; only the abstract; and, only citations. Finally, they showed that citations produced the best results. In another study, Mohammad et al. (2009) also showed that the information from citations is different from that which can be gleaned from just the abstract or reference article. However, there is one caveat, viz., citations often focus on very specific aspects of a paper (Elkiss et al. 2008).

Facet identification is another task tackled by the participating teams in CL-SciSumm shared tasks. In CL-SciSumm shared task 2016, a feature engineering approach is proposed by one of the participating teams (Lu et al. 2016) to solve the problem. They define a set of features including lexical features such as tfidf, the similarity between the topic distributions of citation and candidate reference spans, the concept similarity between citation and candidate reference spans using WordNet and sentence importance. Then they apply three different classifiers including Naïve Bayes, decision tree and support vector machine to identify the facet. Decision tree is also employed by another team in CL-SciSumm 2016 (Cao et al. 2016), which uses the tfidf vectors as features. We have also used tfidf vectors as features in our classification methods for this task. In Pramanick et al. (2017), authors propose a new method based on the cosine similarity between each candidate sentence vector and each facet’s bag of words. A different approach to Task 1B is proposed in Ma et al. (2017), which builds a dictionary for each facet and the reference span is assigned to the facet whose dictionary contains any of the reference span words. Neural networks (Prasad 2017), majority voting (Felber and Kern 2017) and convolutional neural network (Lauscher et al. 2017) are some other approaches proposed by CL-SciSumm 2017 participants for Task 1B. In addition to the classification methods, we have also proposed three variants of a rule-based method, which employs WordNet expansion to identify the facets.

CL-SciSumm 2017

We briefly describe the variety of techniques used by the participating teams in the CL-SciSumm 2017 Shared Task (Jaidka et al. 2017). A total of nine teams participated in Task 1, a subset of five teams further submitted runs for Task 2. Based on the CL-SciSumm 2017 overview (Jaidka et al. 2017), the top three best-performing teams for Task 1A were NJUST (Ma et al. 2017), TUGRAZ (Felber and Kern 2017) and CIST (Li et al. 2017) with the Sentence-overlap $F_1$ metric. Based on ROUGE $F_1$ scores, the top three teams for Task 1A were NJUST, TUGRAZ and UHouston.^{Footnote 3} For Task 1B the top three performing teams were CIST, PKU (Zhang and Li 2017), and NJUST. We summarize below the techniques used by the teams: CIST, NJUST, TUGRAZ, PKU, and UPF (AbuRaed et al. 2017), which did well on Task 2.

The CIST system proposed in Li et al. (2017) calculates similarity values, including Jaccard similarity, context similarity, and idf similarity, between reference text and citances. The final results for Task 1A are based on a combination of similarity scores using methods like fusion, majority voting, Jaccard Cascade and Jaccard Focused methods. For Task 1B, they explored better features and tried three methods: rule-based, SVM and fusion. For Task 2 they used Determinantal Point Processes with a linear combination of five types of features that had been previously used. A majority voting across multiple distance-based metrics is used for getting the best pairs of relevant citance and reference text pair in the UPF system (Aburaed et al. 2017). The authors use a multi-class classification system for the facet distribution task.

The majority voting results from an ensemble of classifiers (Linear SVM, SVM using radial basis kernel function, Logistic Regression, Decision Tree) is used for identification of reference and citance spans in the NJUST (Ma et al. 2017). The authors maintained a dictionary of related words for each discourse facet. While evaluating Task 1B, a reference sentence is assigned a facet if it contains any word from the dictionary of the particular facet. The proposed system uses bisecting K-means clustering to generate the summaries for a particular reference document.

For Task 1A, the PKU system (Zhang and Li 2017) uses a combination of sentence-level and character-level tfidf scores as well as Word2Vec based similarity values as features to a logistic regression classifier. TUGRAZ (Felber and Kern 2017) proposed a query-based retrieval system where the reference spans are treated as an index and citance acts as the query. For a given citance query, the relevant reference text is chosen depending on the results of a ranking algorithm.

Participating systems’ performance

In this section, the performance of all participating systems in CL-SciSumm 2017 is reported with the $F_1$ score. The plots in this section are based on the workshop proceedings reports of the systems’ performance (Jaidka et al. 2017). However, note that the papers were reviewed and revised after the workshop for the proceedings, so the reported results in these papers might not match with the results obtained at the competition stage. Note that the performance we report for Task 1B in this section follows the convention of the shared task. For a correct facet classification to count, the system must have retrieved the correct reference span during Task 1A. Thus, Task 1A acts as an upper bound for Task 1B performance.

We examine the top performing systems on each of the subtasks to analyze the current best-performing techniques for that subtask. In Fig. 1, NJUST is the winner followed by TUGRAZ and CIST. NJUST (Ma et al. 2017) used ensemble learning for identification of reference text based on similarity-based, rule-based and position-based features extracted from the reference text as well as the citance. CIST (Li et al. 2017) also makes use of similarity scores for Task 1A.

For Task 1B (Fig. 2), CIST (Li et al. 2017) performs the best, and they also had a fusion method for this task. A closer examination of their methods is needed to confirm whether the fusion method indeed had the best score for this task. Our reading of their workshop paper was inconclusive on this point. NJUST (Ma et al. 2017) and TUGRAZ (Felber and Kern 2017) are almost similar in their performance on this task. If we look at the summaries of their approaches for this task in Jaidka et al. (2017), the methods do look similar. Both of them have used an index (called dictionary in NJUST) of reference text along with the facets. Then based on the citance words and which facet(s) in the index contains that word, they identify the citance’s facet. A deeper examination of their papers is needed to confirm this.

For Task 2 (Fig. 3), CIST (Li et al. 2017) used a combination of pre-processing techniques that included: document merging, sentence filtering, etc., followed by feature extraction using topic modeling (hLDA) and title similarity. In the final step, the system uses Jaccard similarity for redundancy elimination across chosen reference sentences and Determinantal Point Processes for diverse yet structured summary generation.

Reference documents difficulty for Task 1A

Since all teams participated in Task 1A, we now compare the 10 reference documents in the Test Set based on the teams’ performance ($F_1$ scores) on each reference document on this task.

For this purpose, all systems’ runs for each reference document are sorted based on their sentence overlap $F_1$ scores, then the top-ranked run of each system is selected and used to represent the system’s performance on that reference document. The variance of the systems’ best $F_1$ scores for each reference document is shown in parentheses and also in Table 1. According to Fig. 4, based on the median of each box plot, ‘W09-0621’ is the easiest reference document in the test set and ‘P07-1040’ is the most difficult reference document in the test set for the participating systems.

Table 1 The variance of the systems’ best $F_1$ scores for each reference document

Citance-based retrieval and summarization using IR and machine learning

Abstract

Similar content being viewed by others

Scientific document summarization via citation contextualization and scientific discourse

Insights from CL-SciSumm 2016: the faceted scientific document summarization Shared Task

Bibliometric-Enhanced Information Retrieval 10th Anniversary Workshop Edition

Explore related subjects

Introduction

Preliminaries

Related work

CL-SciSumm 2017

Participating systems’ performance

Reference documents difficulty for Task 1A

Task 1A: Reference span detection

Positional language model approach

Textual entailment approach

Textual entailment system A: TIFMO

Textual entailment system B: TE using deep learning

Structural correspondence learning approach

Previous methods

Method combinations

Linear combination

Filtering

Learning-to-rank

Results

Task 1B: Facet detection

Rule-based approach

Machine learning approach

Evaluation

Task 2: Summary generation

Datasets

Dataset statistics

Facet distribution

Text difficulty level

Misclassifications

Correlations with unsolved-by-us citances

Similarity with unsolved-by-us citances

Unsolved-by-us versus easy citances

Groundtruth v/s irrelevant citances

Discussion

Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation