1 Introduction

Along with the development of Internet and web media, we are able to know all events around the world. It becomes a critical problem for readers to get useful information efficiently from these mass data, and a lot of work has been done to help achieve this goal. News topic detection is one of the most widely used technologies by many news websites and search engines. It groups articles into event clusters and presents them with short abstracts. Related news recommendation is also popular recently. It provides similar or related topics based on the currently browsed article. These techniques raise a great opportunity to mine useful knowledge from news documents. We can discover trends among related topics, for example, the trend of market prices in the past periods. We can learn lessons by comparing similar events, for example, what causes the revolutions in Tunisia, Egypt and Libya, and what leads to the different endings in the mining accidents in Chile and China. We can judge the pros and cons of competitors according to related words, for example, which of Putin and Medvedev is better to be the next president of Russia. However, it requires thorough knowledge of all involved topics in order to make good comparisons, and thus, it is usually very time-consuming and labor-intensive for manual analysis. If a summary can be generated automatically to highlight the comparative information among news topics, it can obviously help people analyze those topics in a much easier and more efficient way.

Literally, a comparison identifies the similarities or differences among two or more objects. It basically consists of the three components: the compared objects, the scale (i.e., the aspect on which the objects are measured and the comparison is made), and the result (i.e., the predicate that describes the positions of objects on the comparative scale). Comparisons can be represented by using comparative sentences. They use particular syntactic forms and/or words and describe a simple relation between the compared objects, for example, equative, greater or less, non-gradable differential, etc. Comparisons can also be formed by describing each object in a text section (e.g., a sentence, or a paragraph) respectively. This kind of expression is not as explicit as the comparative sentence, but it can describe the features of compared objects in more details. For example,

  1. (i)

    Chile is richer than Haiti.

  2. (ii)

    Haiti is an extremely poor country.

  3. (iii)

    Chile is a rich country.

Sentence (i) is a typical comparative sentence, where the objects are Chile and Haiti; the comparative scale is wealth, which is implied by “richer”; and the result is that Chile is superior to Haiti. Sentence (ii) does not describe any comparison when it is regarded individually, and neither does sentence (iii). However, when we take them both into account, we are implied with a comparison on the wealth of the two countries by the “poorrich” pair, and we are conveyed with the similar information as sentences (i).

The comparison should meet some semantic conditions. First, the objects must be comparable, that is, they all have some common aspects and usually belong to the same concept category. For example, the sentence “Then the righteous will shine like the sun in the kingdom of their father.” is an allegory but not a comparison, because “righteous” and “sun” are different concepts and usually not comparable. Second, the comparative scale must be shared by all the objects. For example, “A pigeon can fly faster than a man” is semantically improper because men cannot fly. Third, the comparative result must describe the relation among all objects clearly. Take “Haiti is a poor country.” and “India’s wealth drop 32 %.” for example. Although these two sentences both talk about the wealth of countries, they do not compose a good comparison because we cannot extrapolate the relations between Haiti and India, that is, whether India is richer than Haiti, or as poor as Haiti, or even poorer than Haiti.

A news topic consists of stories about events and activities that are directly connected to a central event. The comparative news summarization task aims to extract salient comparisons among comparable news topics and convey such information using human readable sentences. Recently, this task has drawn much attention, and a few algorithms have been proposed. However, most previous studies have focused on comparing review opinions of products, because the aspects in reviews are easy to extract and the comparisons in reviews have simple patterns, for example, positive vs. negative. In contrast, the aspects are much more diverse in news documents. They can be the time of the events, the persons involved, or the attitudes of participants, etc. These aspects can be expressed explicitly or implicitly in various ways. These issues raise great challenges to comparative summarization in the news domain.

In this study, we propose a novel approach for comparative news summarization. A good comparative summary should contain sentences that convey both the comparative information among topics and the representative information about each individual topic. A set of sentences is considered comparative if the sentences share comparative concepts, and a sentence is considered representative if it contains important concepts about the topic. We take into account these two criteria in an objective function and solve the optimization problem using linear programming. Experimental results demonstrate the effectiveness of our model, which outperforms the baseline systems in quality of comparison identification and summarization.

The rest of this article is organized as follows: first, we briefly review the related works in Sect. 2. Then, we put forward the definition of comparative news summarization task in Sect. 3. Section 4 describes our proposed approach in detail. The performance evaluation of our system follows on, and effects of key parameters are discussed in Sect. 5. Finally, Sect. 6 concludes our work and talks about the possible future works.

2 Related work

2.1 Comparative text analysis

Comparisons have been researched for a long time in the linguistic and literature fields. Many works have studied the connotation, extension, forms and usages of comparisons [20, 24, 32, 45]. The comparative analysis has been applied in many domains, and several related academic subjects have been founded, such as comparative linguistic [2] comparative literature ([55]), comparative history [6] and comparative politics [26].

The comparative analysis is also widely used in web applications. Many electronic commerce systems, for example AmazonFootnote 1 and Newegg,Footnote 2 provide commodities comparisons on the prices and the functionalities basing on the underlying structural data. More recently, mining comparative information from unstructured data has drawn much attention. Several researchers propose to find comparable objects by using linguistic patterns [4, 25] or distributional similarities [14, 30]. Some studies try to identify explicit linguistic comparative sentences and extract components of comparisons from them [16, 17]. Other studies make contrasts by extracting features of individual objects and then matching them up [22, 42, 59]. While most of studies concentrate on comparing the common aspects of objects, there are also some researches focusing on detecting the unique points of topics [52] or the novelty of documents [49], which can be considered as a special kind of comparison “with vs. without”.

Researchers have developed various forms of presentations for discovered comparisons. Liu et al. [29] simplify the opinion into a real value indicating the polarity and strength of sentiment and present the comparisons using histograms. Zhai et al. [59] propose a topic model to discover comparative themes and present them using word distributions. More recently, the comparative summary becomes popular because of its rich informativeness and high readability [22, 23, 42, 56].

Witte and Bergler [56] introduce a contrastive summarization task to indicate the common themes across all documents and document-specific contrastive themes. They use a fuzzy set theory-based clustering algorithm to generate topic clusters. The topics that span a high percentage (e.g., \(>\)90 %) of all documents are extracted as the common topics, while the topics covered only in a subset (e.g., \(<\)5 %) of documents are extracted as distinguishing topics. The main drawback of their approach is that the distinguishing topics are not aligned, that is, they may talk about different aspects of different documents, rather than contrast the divergences of documents on the same aspects. Lerman and McDonald [23] propose a model for generating pairs of summaries that highlights differences between two products. The basic idea of the model is to reward sentences that are similar to their own product’s reviews (i.e., have a low KL-divergence with respect to the own product’s reviews) and different to the other product’s reviews (i.e., have a high KL-divergence with respect to the other product’s reviews). This model also prefers different aspects between products rather than divergences on the same aspects. Kim and Zhai [22] propose a method to extract comparable sentences from two sets of positive and negative opinions and generate a comparative summary containing a set of contrastive sentence pairs. The task is formalized as an optimization problem of maximizing the content similarity and the contrastive similarity, and the optimization is solved by using a greedy algorithm. Paul et al. [42] propose a random walk formulation called Comparative LexRank to score sentences and pairs of sentences from opposite viewpoints. So far, the study of comparative summarization mostly focuses on product reviews, where the aspects and sentiments are relatively easy to extract. Wan et al. [50] propose a system to summarize the differences in the news reports of the same topics in different languages by using a constrained co-ranking method. In contrast, our task focuses on summarizing commonalities and differences of two comparable topics.

2.2 Multiple document summarization

The automatic summarization aims to generate a short description that conveys important information in the original texts in natural language [33]. Several specific subtasks have been proposed. The traditional summarization task places equal emphasis on different information and provides balanced coverage. The guided summarization (also called topic-focused summarization) makes summaries that focus on some user’s current context, which are usually provided in keywords and/or short narratives [56]. In comparison, the comparative summarization focuses on a particular kind of information, instead of any specific context. The updating summarization emphasizes on detecting novelty of new articles over the earlier articles [8], which can be regarded as a special kind of comparative summary.

A variety of summarization methods have been proposed recently. Generally speaking, the summarization task can be performed by extraction, which identifies important sections of the text and then produce them verbatim, or by abstraction, which involves generating novel sentences with important material. Extraction-based methods usually assign a saliency score to each sentence and then rank sentences in the document. The scores are usually computed based on statistical and linguistic features, including term frequency, sentence position, cue words, topic signature, etc. [13, 35, 44]. Machine learning methods have also been employed to extract sentences, including unsupervised methods [40] and supervised methods [47, 58]. More recently, graph-based methods have been proposed for document summarization [12, 36, 46, 51, 54]. These methods build a graph based on the similarity between sentences and calculate the importance of a sentence regarding global information of the graph.

So far, most of summarization models consider a sentence as an information unit. Sentences are selected under the “maximal marginal relevance” (MMR) criterion [7] to maximize the involved information while minimizing the redundancy. Mcdonald [34] adapts the MMR framework and gives an Integer Linear Programming (ILP) formulation with explicit relevance and redundancy terms. Gillick et al. [11] propose a linear programming model on sub-sentence units without explicit redundancy term. Instead, the redundancy is limited by the fact that each kind of unit is counted only once, combined with a length constraint so that the solution prefers diverse information more. In this study, we expand the concept-based ILP model in [11] for comparative news summarization.

2.3 News article analysis

News article analysis is a hot area in both the academic circle and the industrial community for several reasons. First, there are strong demands of news analysis techniques for alleviating the burdens of getting information from massive news resources. Second, the news documents are good corpora for data mining research because of their large amount and easy accessibility. Third, the high linguistic quality of news articles and the abundant linguistic phenomenon in news articles makes the news domain suitable for natural language processing.

So far, many text mining and natural language processing tasks have been applied to the news domain, including classification [38], sentiment analysis [3], summarization [5], named entity recognition [41], relation extraction [48], etc. Some specific tasks on news analysis have also been proposed. The Topic Detection and Tracking (TDT) aims to find topically related material (i.e., an event) in streams of news data [1, 53]. The news trend analysis aims to study the developing behavior of the society interests, that is, determining if they change or remain considerably stable from one period to another [37]. The News Personalization generalizes personalized recommendation for each reader based on their preferences [10, 19].

News analysis techniques have been successfully applied on real applications. Google NewsFootnote 3 is a well-known online news service, providing news categorization, topic aggregation and summarization, related news recommendation, news personalization, news retrieval, etc. There are also many similar systems, such as Bing NewsFootnote 4 and Yahoo! News.Footnote 5 There is a significant difference between the news summarization provided by these systems and our proposal. The summarization in these systems is based on the articles of a single topic, while we propose to generate a comparative summary of two comparable topics. To the best of our knowledge, yet there is no public accessible news service that provides comparative analysis of two news topics.

3 Problem definition

3.1 News topic comparison

A news topic is “a seminal event or activity, along with all directly related events and activities” [39]. It contains a collection of stories which discuss events and activities that are directly connected to the seminal event. For example, a story about search for survivors of an earthquake will be considered to belong to the earthquake topic. However, two stories about different earthquakes are usually not considered to belong to the same news topic.

Similar events happen from time to time, and it is interesting to compare those events to discover latent knowledge. Different from comparisons of product reviews which focus on a few features of products, the comparisons of news topics involve many various aspects. For example, a news topic may involve the causes and consequences of the event, the attitudes and actions of involved persons, as well as many details of the event. Any of these aspects can be a comparative scale, if it occurs in the comparable topics simultaneously. For example, when comparing the earthquake in Haiti with the one in Chile, we can compare on the intensity of the temblors, the damages in the disaster areas, the rescue efforts of local governments, the international assistances, etc.

Note that the number of news topics in a comparison is not limited to be two. We can compare three or even more topics, for example, the “Color Revolutions” in Georgia, Ukraine, Kyrgyzstan and Tunisia. A strict comparison of several topics ought to contain information of all topics. Thus the more topics involved, the fewer comparisons can be extracted. A looser definition allows absence of some topics, and thus, it actually degenerates to a combination of several comparisons between two topics. In this study, we focus on the comparison of two news topics and leave the study of comparison of more topics as future work.

In addition to comparing different events, we can compare the different periods of a continuous event. For example, by comparing the Libya’s turmoil before and after March, we can find completely different situations caused by NATO’s Airstrikes. It is also possible to compare articles about the same topic written by different news agencies and analyze their different views and attitudes. The comparison on periods and versions of a topic emphasizes more on the differences and contradictions, while the comparison of different topics places equal balances on the commonalities and differences. In this study, we focus on the comparison of different topics.

3.2 Comparative news summarization

A comparative news summary of two comparable news topics highlights the commonalities and differences between them. It consists of two blocks of texts. Each block is concentrated on a single topic, while both blocks refer to the comparable aspects of two topics. For example, Table 1 illustrates a comparative summary about two mine accidents. The left column describes the Chilean copper mine accident, while the right columns describes the New Zealand coal mine accident. Both blocks mention the names of the mines (San José copper–gold mine vs. Pike River Coal mine), the numbers of victims (33 miners vs. 29 workers), the efforts of rescues (drill a new hole ... to extract the miners vs. wait for tests ... before entering the coal mine), and the endings of accidents (brought safely to the surface vs. died after ... blast).

Table 1 Comparative summary about “Chilean mine accident versus New Zealand mine accident

Formally, let \(topic_{1}, topic_{2}\) be two comparable news topics, where each topic is described by a document collection \(D_{i }(i = 1, 2)\). The task of comparative summarization is to extract two blocks of texts \(B_{1 }=S_{11 }\cup {\cdots }\cup S_{1n},\,B_{2 }=S_{21 }\cup {\cdots } \cup S_{2n}\), where \(S_{ij} \subset D_{i} (i = 1, 2)\) is a set of representative sentences about \(topic_{i}\) on the aspect \(aspect_{j}\). In other words, \(S_{1j}\) and \(S_{2j}\) describe a comparison of the two topics on \(aspect_{j}\). Meanwhile, the length (i.e., the number of words) of the summary should not exceed a limit \(L\).

Generally speaking, a good comparative summary should meet several criteria. First, the summary ought to focus on the comparison between news topics. Second, the information in the summary should be salient and representative to the news topics. Third, the summary should convey as much information as possible within a length limit. Finally, the summary must have good linguistic quality, that is, it should be fluent and can be easily understood by human.

4 Proposed approach

A naive idea of comparative summarization is to extract the comparative sentences from original articles. Unfortunately, it is not practical because the comparative sentences do not appear frequently in news articles, and those which are relevant to desired competitors are even fewer. Instead, we extract proper non-comparative sentences which talk about the similar aspects in the two topics from the news documents and organize them appropriately to form comparisons.

The processing procedure of our comparative summarization system is illustrated in Fig. 1. The main steps include pre-processing, sentence selection and sentence ordering. The pre-processing step cleans the texts and segments them into information units. The sentence selection step extracts proper sentences from the original documents by considering both information representativeness and comparativeness. The sentence ordering step reorganizes the sentences in the summary to improve the readability of the summary. The details of each step are described in the following subsections.

Fig. 1
figure 1

The processing procedure of proposed system

4.1 Pre-processing

The pre-processing step aims to clean the texts and extract information units from the texts. In theory, the semantic information of a text is related to the meaning of words, the syntactic structure of the sentence, the relation among the sentences, and even the contexts around the text. To simplify the model, we only take into account the meaning of words in the text. Similar to the bag-of-words model, we represent the text with a bag-of-concepts, where each concept is an information unit. Obviously, the more accurate the extracted concepts are, the better we can represent the meaning of a text. However, it is not easy to extract semantic concepts accurately. In this study, we use words/stems, named entities and bigrams to simply represent concepts, and leave the more complex concept extraction for future work.

In the preprocessing step, we first break each document into sentences and then tokenize each sentence into words and recognize the named entities (We use the Stanford CoreNLP toolkitFootnote 6 for NER in English, and an in-house CRF-based NER tool for NER in Chinese. Other NER toolkits, for example GATEFootnote 7 and LinPipe,Footnote 8 will also work). The common stop words are discarded. For English documents, the stems are used instead of the original words. We filter out the sentences that contain no more than three non-stop words. The unigram and bigram of tokens are extracted, and the frequencies of their occurrences are counted.

4.2 Sentence selection

The step of sentence selection aims to extract proper sentences from original documents to compose a comparative summary. The comparative summary ought to be representative to the information in each topic, and emphasize the comparisons between the two topics. We formalize the sentence selection as an optimization problem of selecting sentences to maximize the representativeness and the comparativeness of the summary. The estimations of a summary’s representativeness and comparativeness, as well as the optimization model, are described in the following texts, respectively.

4.2.1 Representativeness estimation

As the essential requirement for text summarization, the comparative summary should represent the salient aspects of each individual news topic, that is, convey the important information in the documents. The representativeness of the summary can be estimated by the amount and saliency of information contained in the summary. Note that a concept is a unit of information. Therefore, the amount and saliency of information can be estimated by the aggregation of weights of the concepts.

Intuitively, as the keynotes of the topic, the important information will be mentioned many times across the articles. Thus the more frequently a concept occurs in the documents, the more likely it represents a salient piece of information. Based on this assumption, we can estimate a concept’s saliency with its frequency of occurrences. Formally, the weight \(w_{ij}\) of a concept \(c_{ij}\), is calculated as follows:

$$\begin{aligned} w_{ij} =freq\left( {c_{ij} ,D_i } \right)\cdot idf\left( {c_{ij} } \right) \end{aligned}$$
(1)

where freq(\(c_{ij},\,D_{i})\) is the frequency of occurrences of \(c_{ij}\) in \(D_{i}\); idf (\(c_{ij})\) is the inverse document frequency of \(c_{ij }\)that penalizes the topic-independent common words, as defined in Eq. 2:

$$\begin{aligned} idf\left( {c_{ij} } \right)=\log \frac{|D_B |}{1+\sum _{d\in D_B } {I\left( {c_{ij} ,d} \right)} } \end{aligned}$$
(2)

where \(D_{B}\) is a large background corpus, \({\vert }D_{B}{\vert }\) is the amount of documents in \(D_{B}\), and \(I(c_{ij},\,d) \in \) {0, 1} is an indicator function of whether \(c_{ij}\) occurs in document \(d\):

$$\begin{aligned} I\left( {c_{ij} ,d} \right)=\left\{ {\begin{array}{l} 1,\quad \text{ if}\,freq\left( {c_{ij} ,d} \right)>0 \\ 0,\quad \text{ otherwise} \\ \end{array}} \right. \end{aligned}$$
(3)

The representativeness of a comparative summary sum is defined as the aggregation of the concepts’ weights as follows:

$$\begin{aligned} Rep\left( {sum} \right)=\sum _{i=1}^2 {\sum _{j=1}^{|C_i |} {w_{ij}\, f\left( {c_{ij} ,sum} \right)} } \end{aligned}$$
(4)

where \(C_{i}\) is the collection of concepts in \(D_{i}\); \(f(c_{ij}\), sum) is a function to calculate \(c_{ij}\)’s contribution of information in the summary.

A simple form of \(f\) is the frequency function freq(\(c_{ij}\), sum). The basic idea is that the more frequently a salient concept occurs in the summary, the more sentences focus on the salient aspects in the topics, and thus the more representative the summary is. The disadvantage of frequency function is that it does not penalize the redundancy of sentences of in the summary, and thus, the coverage of information will be low.

Another form of \(f\) is the indicator function \(I(c_{ij}\), sum), as used in [11]. The advantage of indicator function is that it only counts each concept once, so that the optimized summary tends to contain sentences without overlapping concepts. However, because the summary should focus on some certain topics, a few overlaps of topic words are necessary. For example, many sentences in the summary about an earthquake will contain the word “earthquake”, because it is the topic’ centroid concept that is inescapable.

To balance the information saliency and redundancy, the \(f\) should increase as the frequency grows, but the growth rate of \(f\) should decrease gradually, as illustrated in Fig. 2.

Fig. 2
figure 2

Graph of function \(f (x)\)

In this study, we defined \(f\) as follows

$$\begin{aligned} f\left( {c_{ij} ,sum} \right)=\left\{ {\begin{array}{r@{\quad }l} 0&freq\left( {c_{ij} ,sum} \right)=0 \\ \sum _{k=0}^{freq\left( {c_{ij} ,sum} \right)-1} {\alpha ^{k}},&0<freq\left( {c_{ij} ,sum} \right)\le N \\ \sum _{k=0}^N {\alpha ^{k}},&freq\left( {c_{ij} ,sum} \right)>N \\ \end{array}} \right. \end{aligned}$$
(5)

where \(\alpha \in \) [0, 1] (we define 0\(^{0}\) = 1 here). The lower \(\alpha \) is, the less a redundant concept instance contributes, and thus the summary tends to contain less redundancy. When \(\alpha \) = 0, then \(f\) degenerates to the indicator function. When \(\alpha \) = 1, then \(f\) is actually the frequency function which does not avoid any redundancy. \(N\) controls the upper boundary of a concept’s contribution. In this study, we set \(N \)= 3.

4.2.2 Comparativeness estimation

According to the definition of comparison, a set of texts can form a comparison only if they discuss the common aspects of objects. For example,

Lionel Messi named FIFA World Player of the Year 2010. Cristiano Ronaldo Crowned FIFA World Player of the Year 2009.

The above two sentences compare on the aspect “FIFA World Player”, which is contained in both sentences. Furthermore, semantic-related concepts can also represent comparisons. For example, “snow” and “sunny” can indicate a comparison on weather; “alive” and “death” can imply a comparison on rescue result. If two sentences share a pair of semantic-related concepts, we consider them as potential comparisons. In other words, semantic-related concept pairs are evidences of comparisons.

Formally, let \(C_{i}\) = {\(c_{ij}\)} (\(i\) =1, 2) be the set of concepts in the document set \(D_{i}\). A sentence \(s_{ij} \in D_{i}\) is represented with a subset of \(C_{i}\). Then a pair of sentences \(\langle s_{1a},\,s_{2b}\rangle \) is a potential comparison iff

$$\begin{aligned} \exists c_{1k_1 } \exists c_{2k_2 } \left( {c_{1k_1 } \in s_{1a} \wedge c_{2k_2 } \in s_{2b} \wedge rel\left( {c_{1k_1 } ,c_{2k_2 } } \right)\ge \tau } \right) \end{aligned}$$

where \(rel\left( {c_{1k_1 } ,c_{2k_2 } } \right)\) is the semantic relevance between \(c_{1k_1 } \) and \(c_{2k_2 } \), and \(\tau \) is a minimal threshold. The \(c_{1k_1 } \) and \(c_{2k_2 } \) make up an evidence of the comparativeness between \(s_{1a}\) and \(s_{2b}\). In the latter of this paper, we denote a comparative evidence, that is, a semantically related concept pair, as \(ce_k =\langle c_{1k_1 } ,c_{2k_2 } \rangle \), where \(c_{1k_1 } \in D_1\) and \(c_{2k_2 } \in D_2 \).

To extract these comparative evidences, we extract the concepts from the documents and then calculate the semantic relevance between each couple of concepts. The semantic relevance between two English words is calculated using the algorithms based on WordNet [43]. The relevance between two Chinese words is calculated based on Hownet [31]. The relevance between two out-of-vocabulary bigrams \(wd_{11} wd_{12}, wd_{21} wd_{22}\) is calculated as \(\left[ {rel\left( {wd_{11} ,wd_{21} } \right)+rel\left( {wd_{12} ,wd_{22} } \right)} \right]/2\), where \(wd_{ij} (i,\,j = 1, 2\)) is a word. The collection of obtained comparative evidences (i.e., semantic-related concept pairs) is denoted as \(CE = \{ce_{k}\}.\)

The weight \(u_{k}\) of a comparative evidence \(ce_k =\langle c_{1k_1 } ,c_{2k_2 }\rangle \) is calculated as:

$$\begin{aligned} u_k =\frac{w_{1k_1 } +w_{2k_2 } }{2} \end{aligned}$$
(6)

where \(w_{1k_1}\) and \(w_{2k_2}\) are the weights of \(c_{1k_1 } \) and \(c_{2k_2}\), calculated using Eq. 1, respectively.

The comparativeness of the summary sum is calculated as:

$$\begin{aligned} Cmp\left( {sum} \right)=\sum _{k=1}^{|CE|} {u_k g\left( {ce_k ,sum} \right)} \end{aligned}$$
(7)

where \(g\)(ce \(_{k}\), sum) is a function to calculate ce \(_{k}\)’s contribution of information. Similar to the \(f(c_{ij}\), sum) discussed in Sect. 4.2.1, the \(g\)(ce \(_{k}\), sum) is defined as follows:

$$\begin{aligned} g\left( {ce_k ,sum} \right)=\left\{ {\begin{array}{l@{\quad }l} 0,&freq\left( {ce_k ,sum} \right)=0 \\ \sum _{i=0}^{freq\left( {ce_k ,sum} \right)-1} {\alpha ^{i}},&0<freq\left( {ce_k ,sum} \right)\le N \\ \sum _{i=0}^N {\alpha ^{i}} ,&freq\left( {ce_k ,sum} \right)>N \\ \end{array}} \right. \end{aligned}$$
(8)

where freq(ce \(_{k}\), sum) is the frequency of ce \(_{k}\)’s occurrences in the summary, that is,

$$\begin{aligned} freq\left( {ce_k ,sum} \right)=Min\left( {freq(c_{1k_1 } ,sum),freq(c_{2k_2 } ,sum)} \right) \end{aligned}$$
(9)

4.2.3 Concept-based comparative summary model

To highlight the comparison among news topics, a comparative summary should contain as many salient comparative evidences as possible. Besides, it should represent the topics well, that is, convey the important information of each individual news topic. Thus, the score of a comparative summary can be calculated as the sum of the comparativeness score and the representativeness score:

$$\begin{aligned} Score\left( {sum} \right)&= \left( {1-\lambda } \right)\cdot Rep\left( {sum} \right)+\lambda \cdot Cmp\left( {sum} \right) \nonumber \\&= \left( {1-\lambda } \right)\cdot \sum _{i=1}^2 {\sum _{j=1}^{|C_i |} {w_{ij} f\left( {c_{ij} ,sum} \right)} } +\lambda \cdot \sum _{k=1}^{|CE|} {u_k g\left( {ce_k ,sum} \right)} \end{aligned}$$
(10)

where \(\lambda \) \(\in \) [0, 1] is a coefficient that balances the comparativeness and representativeness.

The summary consists of several sentences extracted from the original documents. The aim of sentence selection is to maximize the score of the summary by selecting proper sentences, that is,

$$\begin{aligned} ComparativeSummary\left( {D_1 ,D_2 } \right)=\mathop {\arg \max }\limits _{sum*\subset D_1 \cup D_2 } Score\left( {sum*} \right) \end{aligned}$$
(11)

The objective function can be optimized using the linear programming algorithm. Let \(f_{ij}\) be a numeric variable that indicates the function value of \(f (c_{ij}\) , sum), and \(g_{k}\) be a numeric variable that indicates the function value of \(g\)(ce \(_{k}\), sum). Then the optimization subjection can be defined as:

$$\begin{aligned} \text{ Max}\;\left( {1-\lambda } \right)\sum _{i=1}^2 {\sum _{j=1}^{|C_i |} {w_{ij} f_{ij} } } +\lambda \sum _{k=1}^{|CE|} {u_k g_k } \end{aligned}$$
(S.1)

Let \(n_{ij}\) be an integer variable denoting the value of freq(\(c_{ij}\), sum). The piecewise function Eq. 5 is equivalent to the following constraints (under \(N\) = 3):

$$\begin{aligned} f_{ij} \le n_{ij} \end{aligned}$$
(C.1)
$$\begin{aligned} f_{ij} \le \alpha \cdot n_{ij} +1-\alpha \end{aligned}$$
(C.2)
$$\begin{aligned} f_{ij} \le \alpha ^{2}\cdot n_{ij} +1+\alpha -2\alpha ^{2} \end{aligned}$$
(C.3)
$$\begin{aligned} f_{ij} \le \alpha ^{2}+\alpha +1 \end{aligned}$$
(C.4)

Similarly, let \(m_{k}\) be an integer variable denoting the value of freq(ce \(_{k}\), sum). According to Eq. 8, each \(g_{k }\) should satisfy the following constraints (under \(N = 3\)):

$$\begin{aligned} g_k \le m_k \end{aligned}$$
(C.5)
$$\begin{aligned} g_k \le \alpha \cdot m_k +1-\alpha \end{aligned}$$
(C.6)
$$\begin{aligned} g_k \le \alpha ^{2}\cdot m_k +1+\alpha -2\alpha ^{2} \end{aligned}$$
(C.7)
$$\begin{aligned} g_k \le \alpha ^{2}+\alpha +1 \end{aligned}$$
(C.8)

According to Eq. 9, \(m_{k}\) should satisfy the following constraints:

$$\begin{aligned} m_k \le n_{1k_1 } \end{aligned}$$
(C.9)
$$\begin{aligned} m_k \le n_{2k_2 } \end{aligned}$$
(C.10)

Furthermore, let \(o_{iq }\) be a binary variable indicating whether the sentence \(s_{iq}\) is presented in the summary, and \(I(c_{ij},s_{iq})\) be a binary constant indicating whether the concept \(c_{ij}\) occurs in the sentence \(s_{iq}\), then the frequencies of concepts should meet the following constraints:

$$\begin{aligned} n_{ij} =\sum _{s_{iq} \in D_i } {I\left( {c_{ij} ,s_{iq} } \right)\cdot o_{iq} } \end{aligned}$$
(C.11)

Finally, the summary should satisfy a length constraint:

$$\begin{aligned} \sum _{i=1}^2 {\sum _{s_{iq} \in D_i } {l_{iq} \cdot o_{iq} } } \le L \end{aligned}$$
(C.12)

where \(l_{iq}\) is the length of sentence \(s_{iq}\), and \(L\) is the maximal summary length.

Note that the variables \(f_{ij},\,g_{k},\,n_{ij}\),\( m_{k}\) and \(o_{iq}\) are all linear in S.1 and C.1–C.12, and their coefficients are all constants. Thus, the optimization problem is a mixed integer programming (MIP) problem. Though the MIP problems are generally NP-hard, considerable works have been done [9, 18, 21, 57], and several software solutions have been released to solve them efficiently. In this study, we use the IBM ILOG CPLEX OptimizerFootnote 9 to solve this problem. Other MIP optimizer, for example Gurobi OptimizerFootnote 10 and GPLK,Footnote 11 will also work.

4.3 Sentence ordering

For better intelligibility, it is necessary to organize the summary according to the comparative aspects. In this study, we use an aggregative clustering algorithm [15] to group the sentence into several bunches based on the similarity of the sentences. A sentence in the summary is represented as a weighted vector of the concepts that is contained in the sentence and the comparative concept pairs where one of the concepts is contained in the sentence. The similarity of two sentences is calculated using the cosine value of the two vectors. After the clusters are obtained, we order them according to the numbers of sentences they contain. The final summary contains two blocks. Each block consists of sentences which are selected from the same document set. In each block of the summary, we organize the sentences according to the order of the clusters which they belong to, that is,

$$\begin{aligned} \forall s_i \in cluster_a \forall s_j \in cluster_b :cluster_a \prec cluster_b \rightarrow s_i \prec s_j \end{aligned}$$

where \(x\prec y\) means \(x\) is arranged before \(y\).

5 Experiment

5.1 Dataset

Because of the novelty of the comparative news summarization task, there is no existing dataset yet for evaluation, and thus, we create our own. We first choose twenty pairs of comparable topics (ten in English and ten in Chinese, as shown in Tables 2 and 3) and then retrieve ten related news articles for each topic using the Google News search engine.

Table 2 Comparable topic pairs in the English dataset
Table 3 Comparable topic pairs in the Chinese dataset

Footnote 12 Finally, we write the comparative summary for each topic pair manually. Note that every reference summary also contains two blocks, each of which concentrates on a single topic in the pair.

5.2 Evaluation metrics

Comparative Aspect Recall (CAR): It is defined as the number of human agreed comparative aspects in the summary. This metric evaluates the performance of comparative extraction.

Overall Responsiveness (OR): The assessors will give an overall responsiveness score to each summary, based on both content and readability/fluency. It is judged on the 5-point scale indicating very poor, poor, barely acceptable, good, very good, respectively.

ROUGE: The ROUGE is a widely used metric in summarization evaluation. It measures summary quality by counting overlapping units between the candidate summary and the reference summary [27, 28]. The n-gram based ROUGE value is calculated as follows:

$$\begin{aligned} ROUGE\text{-}N=\frac{\sum \nolimits _{S\in \{RefSum\}} {\sum \nolimits _{n-gram\in S} {Count_{match} \left( {n-gram} \right)} } }{\sum \nolimits _{S\in \{RefSum\}} {\sum \nolimits _{n-gram\in S} {Count\left( {n-gram} \right)} } } \end{aligned}$$
(12)

where \(n\) stands for the length of the n-gram, Count \(_{match}\)(n-gram) is the maximum number of n-grams co-occurring in a candidate summary and a set of reference summaries, and Count (n-gram) is the number of n-grams in the reference summaries. The ROUGE-S is similar to ROUGE-N (\(N\)= 2), but based on the skip-bigrams (i.e., pairs of words in their sentence order, allowing for a certain gap) instead of the regular bigrams (i.e., pairs of continuous words). The ROUGE-SU evaluates the co-occurrence of both skip-bigrams and unigrams.

In our experiment, we use the ROUGE-2 (i.e., ROUGE-N with \(N=2)\) values and the ROUGE-SU4 (i.e., ROUGE-SU allowing gaps within 4 terms) values to evaluate the systems’ performance. In addition, we evaluate each block in the summary respectively and report the mean of two ROUGE values (denoted as M-ROUGE-2 and M-ROUGE-SU4) to evaluate whether the comparative summary is related to each topics.

For English dataset, we use the ROUGE toolkitFootnote 13 to calculate the values. For Chinese dataset, we first tokenize both the reference summary and the automatically generated summary and then use a modified version of ROUGE toolkitFootnote 14 to calculate the values.

5.3 Baseline systems

Baseline 1: Non-Comparative Model (NCM): The non-comparative model treats the comparative summarization task as a traditional summarization problem. It merges all the documents into a single document and selects the most salient and representative sentences as the summary. In this study, we use the model proposed in [11] because of its simplicity and good performance.

Baseline 2: Co-Ranking Model (CRM): This model is adapted from [51]. It makes use of the relations within each topic and the relations across the topics to reinforce scores of the comparison-related sentences. More specifically, a sentence’s score consists of two parts, that is, the contribution of related in-topic sentences (representativeness), and the contribution of related cross-topic sentences (comparativeness).

Formally, let \(s_{1i} \in D_{1}\) and \(s_{2j} \in D_{2}\) denote sentences in \(D_{1}\) and \(D_{2}\), respectively. us \(_{i}\) denotes the score of \(s_{1i}\), and vs \(_{j}\) denote the score of \(s_{2j}\). sim(\(s,\,s^{\prime }\)) denote the similarity of two sentence \(s\) and \(s^{\prime }\). Then the scores of sentences are computed as follows:

$$\begin{aligned} us_i =\theta \cdot \sum _{s_{1k} \in D_1 } {\frac{sim\left( {s_{1i} ,s_{1k} } \right)\cdot us_k }{\sum _{s_{1p} \in D_1 } {sim\left( {s_{1p} ,s_{1k} } \right)} }} +\left( {1-\theta } \right)\cdot \sum _{s_{2j} \in D_2 } {\frac{sim\left( {s_{1i} ,s_{2j} } \right)\cdot vs_j }{\sum _{s_{1p} \in D_1 } {sim\left( {s_{1p} ,s_{2j} } \right)} }} \end{aligned}$$
(13)
$$\begin{aligned} vs_i =\theta \cdot \sum _{s_{2k} \in D_2 } {\frac{sim\left( {s_{2j} ,s_{2k} } \right)\cdot vs_k }{\sum _{s_{2q} \in D2} {sim\left( {s_{2q} ,s_{2k} } \right)} }} +\left( {1-\theta } \right)\cdot \sum _{s_{1i} \in D_1 } {\frac{sim\left( {s_{1i} ,s_{2j} } \right)\cdot us_j }{\sum _{s_{2q} \in D_2 } {sim\left( {s_{1i} ,s_{2q} } \right)} }} \end{aligned}$$
(14)

where \(\theta \) \(\in \) [0, 1] is a parameter to balance the influence of representativeness and comparativeness. In this study, we set \(\theta = 0.5\). The values of \(us_{i}\) and \(vs_{j}\) can be computed iteratively. The algorithm first assigns random initial values to \(us_{i}\) and \(vs_{j}\) and then recursively compute the new estimations of \(us_{i}^{(n+1)}\), \(vs_{j}^{(n+1)}\) using Eqs. 13 and 14 until the values are convergent.

After that, we estimate the score of a cross-topic sentence pair \(sp_k =\langle s_{1k_1 } ,s_{2k_2 }\rangle \) as follows:

$$\begin{aligned} score(sp_k )=\eta \cdot sim(s_{1k_1 } ,s_{2k_2 } )+(1-\eta )(us_{k_1 } +vs_{k_2 } ) \end{aligned}$$
(15)

where \(\eta \in \) [0, 1] is a factor that balance the comparativeness and the saliency of sentences (\(\eta = 0.5\) in this study). The most salient sentence pairs are selected iteratively, and the scores of remained sentence pairs are updated using the MMR algorithm [7].

5.4 Experimental results

We apply all the systems to generate comparative summaries with a length limit of 400 words. The evaluation results are shown in Tables 4 and 5. The NCM and CRM models are described in Sect. 5.3, and the concept-based comparative model (CCM model) is our proposed model. In this experiment, the \(\lambda \) and \(\alpha \) in CCM are set as follows: \(\lambda = 0.3\), \(\alpha = 0.5\) for English dataset (Table 4), and \(\lambda = 0.5\), \(\alpha = 0.5\) for Chinese dataset (Table 5).

Table 4 The evaluation results on the english dataset
Table 5 The evaluation results on the Chinese dataset

Compared with the baseline systems, our proposed model achieves the best scores over all metrics. It is not surprising that the NCM model does not perform well in this task, because it does not focus on the comparisons. The CRM model utilizes the similarity between two topics to enhance the score of comparison-related sentences. However, the sentences in a comparison usually share only a few words in common, and thus the similarities among them are low. In such cases, the co-ranking algorithm can barely benefit from the cross-topic relations. The CCM model calculates the score of summary explicitly using scores of comparative concepts and representative concepts. It balances these two factors and is not affected by the low similarities among sentences. Besides, it uses a mixed integer programming model to find a globally optimized solution. Thus it achieves good performance on both comparison extraction and summarization.

5.5 Parameter effect

In our model, there are two important parameters, \(\lambda \) and \(\alpha \). \(\lambda \) balances the importance of comparativeness and the importance of representativeness, while \(\alpha \) balances the information saliency and the redundancy. Intuitively, both \(\lambda \) and \(\alpha \) should be medium, neither too big nor to small. To verify this assumption, we run the system with different settings of \(\lambda \) and \(\alpha \) and evaluate them using the ROUGE values.

First, we set \(\alpha = 0.5\) and range \(\lambda \) from 0 to 1 in step of 0.1. The results in English and Chinese corpus are shown in Figs. 3 and 4, respectively. In both corpuses, the performance first increases as \(\lambda \) grows and then reaches top at a medium value of \(\lambda \). After that, it decreases instead as \(\lambda \) continuously grows.

Fig. 3
figure 3

The system performances on the English dataset with different \(\lambda \) settings

Fig. 4
figure 4

The system performances on the Chinese dataset with different \(\lambda \) settings

It is interesting that the performance of considering representativeness only (\(\lambda = 0\)) is superior to the performance of considering comparativeness only (\(\lambda = 1\)) in the English dataset. The possible reason is that the English news articles contain more wide information, for example, interviews and quotations. These less important information leads to some inessential comparisons and thus harm the performance. On the other hand, the focused points of related news topics are much the same, and thus the representative summaries of each topic are likely to be comparative. The Chinese news articles mostly consist of objective descriptions of the news event, and thus the comparisons are more likely to focus on the salient aspects. Generally speaking, \(\lambda \) should be small (i.e., emphasize the representativeness) on the news articles of divergent themes, and be medium on the news articles of compact themes. In practice, the optimized parameters can be learned by using an evaluation dataset.

To investigate the effect of \(\alpha \), we set \(\lambda \) to the best setting according to the previous experiment and range \(\alpha \) from 0 to 1. The results are illustrated in Figs. 5 and 6. Similar to the effects of \(\lambda \), the system also performs best at a medium value of \(\alpha \). Notice that the fluctuation of performances in English corpus is little when \(\alpha \in [0.2, 0.5]\), and thus \(\alpha = 0.5\) is an acceptable setting for both corpuses.

Fig. 5
figure 5

The system performances on the English dataset with different \(\alpha \) settings

Fig. 6
figure 6

The system performances on the Chinese dataset with different \(\alpha \) settings

5.6 Case study

Tables 6 and 7 show the reference summary and the system-generated summary for World Cup 2006 versus World Cup 2010, respectively. The generated summary introduces the champions, the Golden Ball winners, the Young Players of two World Cup matches, the effects to the economy, etc. However, it also makes a comparison on Golden Shoe winner and Golden Glove winner. The primary cause is the word “Golden”, which is not an appropriate unit of concept. To overcome these defects, more precise concept extraction and relation extraction should be applied.

Table 6 The reference comparative summary about two World Cup matches
Table 7 The system-generated comparative summary about two world cup matches

Tables 8 and 9 show the reference summary and the system-generated summary for Wenchuan Earthquake vs. Yushu Earthquake. The generated comparative summary presents the times, locations, rescue efforts & effects, and responses of foreign countries in the two earthquakes in China. Overall, it has good quantity in comparison. However, the sentences (1.8) and (1.9) are not quite appropriate to be compared with the sentence (2.7), because the former two sentences describe the responses of foreign governments, while the latter one describes the response of a foreign media. The representativeness of some aspects is a bit low. For example, the sentence (1.2) mentions the property damage, but does not report the exact amount of the damage. The features of representativeness need to be further studied.

Table 8 The reference comparative summary about two earthquakes
Table 9 The system-generated comparative summary about two earthquakes

6 Conclusion & future work

The comparative analysis of related news topics is useful in many applications. In this study, we propose a novel approach to summing up the commonalities and differences among comparable news topics. We formalize the task as a problem of selecting sentences to maximize both the comparativeness and the representativeness, propose an estimation function based on concept-level evidences, and solve it using linear programming. The experiment results show that our model outperforms the baseline systems in comparison extraction and summarization.

Due to the complexity of news topics and semantic requirements of comparisons, the comparative summaries are still far from satisfactory. In future work, we are going to utilize more semantic information in this task to enhance the quality of summarization. First, the comparative evidences need to be further anatomized. Intuitively, adverbs will not lead to comparisons unless they modify similar actions. We will introduce syntactic structures into comparative evidence extraction in future work. We are also going to use more resources, such as Wikipedia, to calculate the semantic relatedness among concepts. Second, the representativeness can also take into account relatedness among concepts. When a concept is presented in the summary, its similar concepts are not necessary to be selected. The weights of concepts can be tuned using machine learning techniques. Finally, we are going to formalize the sentence ordering step in more compact way, in the consideration of both comparativeness and coherence.

In the future, we are also going to extend the task of comparative summarization. First, comparisons of more than two topics can be studied. Second, it is better to separate the commonality and difference of news topics and extract the key phrases indicating the compared aspects. Third, a comparison of the snapshots along the timeline of a continual event can be studied. This particular kind of comparison mainly focuses on the difference and evolutions among snapshots. Finally, it is an interesting problem to compare cross-lingual news topics, where contradictions are the most important information.

The evaluation methods of comparative summarization can also be further studied. In this study, we use the automatically calculated ROUGE values, manually annotated Comparative Aspect Recall and Overall Responsiveness to evaluate the systems’ performance. Additional independent human evaluation should be involved to strengthen the reliability of evaluation. We also plan to investigate the correlation between automatically calculated metrics and manual ratings to verify the reliability of those metrics.