Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Most questions posted on Community-based Question Answering (CQA) web-sites, such as Yahoo Answers, Answers.com and StackExchange, do not target simple facts such as “what is Brad Pitt’s height?” or “how far is the moon from earth?”. Instead, askers expect some human touch in the answers to their questions. Especially, many questions look for recommendations, suggestions and opinions, e.g.what are some good horror movies for Halloween?”, “should you wear a jockstrap under swimsuit?” or “how can I start to learn web development?”. According to our analysis, based on editorial judgments of 12,000 Yahoo Answers questions, 70 % of all questions are advice or opinion seeking questions.

Examining answers for such advice-seeking questions, we found that quite often answerers do not provide supportive evidence for their recommendation, and that answers usually represent diverse perspectives of the different answerers for the question at hand. For example, answerers may recommend different horror movies. Still, the asker would like to choose only one or two movies to watch, and without additional supportive evidence her decision may be non-trivial.

In this paper we assume that askers would be happy to receive additional information that will help them in choosing the best fit for their need from the various suggestions or opinions provided in the CQA answers. More formally, we propose the novel task of retrieving sentences from the Web that provide support to a given recommendation or opinion that is part of an answer in a CQA site.

We refer to the part of the answer (e.g., a sentence) that contains a recommendation as a subjective claim about the need expressed in the question (e.g., a call for advice). For a sentence to be considered as supporting the claim, it should be relevant to the content of the claim and provide some supporting information; e.g., examples, statistics, or testimony [1]. More specifically, a supporting sentence is one whose acceptance is likely to raise the confidence in the claim.

While supporting sentences may be part of the same answer containing the claim, or found in other answers given for the same question, in this paper we are interested in retrieving sentences from other sources which may provide different perspectives on the claim compared to content on CQA sites. For example, for the question “what are some good horror movies?”, a typical CQA answer could be “The Shining is a great movie; I love watching it every year”. On the other hand, a supporting sentence from external sites may contain information such as “...in 2006, the Shining made it into Ebert’s series of “Great Movie” reviews...”. Specifically, we focus on retrieving supporting sentences from Wikipedia, although our methods can be largely applied to other Web sites.

We present a general scheme of Learning to Rank for Support, in which the retrieval algorithm is directly optimized for ranking sentences by presumed support. Our feature set includes both relevance-oriented features, such as textual similarity, and support-oriented features, such as sentiment matching and similarity with language-model-based support priors.

We experimented with a new dataset containing 40 subjective claims from the Movies category of Yahoo Answers. For each claim, sentences retrieved from Wikipedia using relevance estimates were manually evaluated for relevance and support. The evaluated benchmark was then used to train and test our model. The results demonstrate the merits of integrating relevance-based and support-based features for the support ranking task. Furthermore, our model substantially outperforms a state-of-the-art Textual Entailment system used for support ranking. This result emphasizes the difference between prior work on supporting objective claims and our task of supporting subjective recommendations.

2 Ranking Sentences by Support

Our goal is to devise a sentence retrieval method that ranks sentences by the level of support they provide to a given subjective claim. For example, the sentence “movie X received the Oscar academy award for the best film” would be considered as providing strong support to the claim “X is a good movie”.

We confine our treatment of the sentence retrieval task to claims about a single entity \(c_{e}\)e.g. the movie X in the example above — since often advice-seeking CQA questions are about entities such as restaurants, movies, singers and products. For sentence \(s\) to provide support for a given claim \(c\), \(s\) must be relevant to \(c\) and especially to the entity \(c_{e}\) that \(c\) is about. Hence, our approach for support ranking is based on an initial relevance ranking of sentences (Sect. 2.1). Then, a set of features is used in a learning-to-rank method for re-ranking the top-retrieved sentences by their (presumed) support for \(c\) (Sect. 2.2).

2.1 Initial Relevance Ranking

Our first step is to rank sentences by their presumed relevance to claim \(c\). Since these sentences are part of documents in a corpus \(D\), we follow common practice in work on sentence retrieval [2] and first apply document retrieval with respect to \(c\). Then, the sentences in the top ranked documents are ranked for relevance.

We assume that each document \(d\in D\) is composed of a title, \(d_{t}\), and a body, \(d_{b}\). This is the case for Wikipedia, which is used in our experiment, as well as for most Web pages. The initial document retrieval, henceforth InitDoc, is based on the document score \(S_{SDM}(c;d_{b})\). This score is assigned to the body of document \(d\) with respect to the claim \(c\) by the state-of-the-art sequential dependence model (SDM) from the Markov Random Field framework [3]. For texts x and y,

$$\begin{aligned} S_{SDM}(x;y) \mathop {=}\limits ^{def}\lambda _{T} S_{T}(x;y) + \lambda _{O} S_{O}(x;y) + \lambda _{U} S_{U}(x;y); \end{aligned}$$
(1)

\(S_{T}(x;y)\), \(S_{O}(x;y)\) and \(S_{U}(x;y)\) are the (smoothed) log likelihood values of the appearances of unigrams, ordered bigrams and unordered bigrams, respectively, of tokens from x in y; \(\lambda _{T}\), \(\lambda _{U}\), and \(\lambda _{O}\) are free parameters whose values sum to 1. We further bias the initial document ranking in favor of documents whose titles contain \(c_{e}\) — the entity the claim is about. Specifically, \(d\) is ranked by:

$$\begin{aligned} S_{InitDoc}(c;d) \mathop {=}\limits ^{def}\alpha S(c_{e};d_{t}) + (1-\alpha ) S_{SDM}(c;d_{b}); \end{aligned}$$
(2)

\(S(c_{e};d_{t})\) is the log of the Dirichlet smoothed maximum likelihood estimate, with respect to \(d\)’s title, of the n-gram which constitutes the entity \(c_{e}\) [4]; smoothing is based on n-gram counts in the corpusFootnote 1; \(\alpha \) is a free parameter.

To estimate the relevance of sentence \(s\) to the claim \(c\), we can measure their similarity using, again, the SDM model. We follow common practice in work on passage retrieval [2], and interpolate, using a parameter \(\beta \), the claim-sentence similarity score with the retrieval score of document \(d\) which \(s\) is part of:

$$\begin{aligned} S_{InitSent}(c;s) \mathop {=}\limits ^{def}\beta S_{SDM}(c;s) + (1-\beta ) S_{InitDoc}(c;d). \end{aligned}$$
(3)

Equation 3 is used to rank the sentences in the top-\(N\) retrieved documents; \(N\) is a free parameter. The \(k\) most highly ranked sentences serve for \(\mathcal{S}_{\mathrm{init}}^{[k]}\) — the initial set of sentences to be ranked for support. Herein, InitSent denotes the sentence score assigned in Eq. 3 which is used to induce the initial sentence ranking.

2.2 Learning to Rank for Support

Next, we rank the sentences in \(\mathcal{S}_{\mathrm{init}}^{[k]}\) by the support they provide to the claim. To this end, we apply a learning-to-rank (LTR) approach [5] to construct a ranking function designed to optimize support. Specifically, we use a training set of claims, their respective sentences, and labels of the support level the sentences provide for the claims. Each pair of a claim and a sentence, (\(c\), \(s\)), is represented as a feature vector. Below, we detail our feature set. In Sect. 3 we report the performance of three LTR methods applied with these features.

Language-Model Similarities. We use the initial retrieval scores, InitDoc (Eq. 2) and InitSent (Eq. 3), as relevance-estimate features. Additionally, we use several language-model-based similarity estimates. Let \(p_{JM}^{[\psi ]}(w|x)\) be the probability assigned to term \(w\) by a Jelinek-Mercer smoothed unigram language model induced from text x using the smoothing parameter \(\psi \) [4];Footnote 2 setting \(\psi =0\) amounts to the maximum likelihood estimate of \(w\) with respect to x. The similarity between texts x and y is estimated using the cross entropy, \(CE\), between their induced language models: \(sim_{LM}(x,y) \mathop {=}\limits ^{def}-CE\left( p_{JM}^{[0]}(\cdot |x) \; \Big \vert \Big \vert \,\, p_{JM}^{[\psi ]}(\cdot |y)\right) \); higher values of \(CE\) correspond to reduced similarity.

We use the following similarity features: (i) ClaimTitle: between the claim and the document title (\(sim_{LM}(c,d_{t})\)); (ii) EntTitle: between the entity and the document title (\(sim_{LM}(c_{e},d_{t})\)); (iii) ClaimBody: between the claim and the document body (\(sim_{LM}(c,d_{b})\)); (iv) EntBody: between the entity and the document body (\(sim_{LM}(s_{e},d_{b})\)); (v) ClaimSent: between the claim and the sentence (\(sim_{LM}(c,s)\)); and, (vi) EntSent: between the entity and the sentence (\(sim_{LM}(c_{e},s)\)). The entity is treated here as a bag of terms. These relevance-based similarity estimates, some of which are components of Eqs. 2 and 3, are weighed by the learning to rank method with respect to support ranking rather than relevance ranking, which helps to avoid metric divergence issues [3].

Semantic Similarities. Both the claim \(c\) and the candidate support sentence \(s\) can be short. Thus, to address potential vocabulary mismatch issues in textual similarity estimation, we also use semantic-based similarity measures that utilize word embedding [6]. Specifically, we use the word vectors, of dimension 300, trained over a Google news dataset with Word2VecFootnote 3. Let \(\mathbf {w}\) denote the embedding vector representing term \(w\). We measure the extent to which the terms in \(s\) “cover” the terms in \(c\) by MaxSemSim: \(\sum _{w\in c} \max _{w' \in s} \cos (\mathbf {w},\mathbf {w'})\). Additionally, we measure the similarity between the centroids of the claim and the sentence (cf. [7]), CentSemSim: \(\cos (\frac{1}{|c|} \sum _{w\in c} \mathbf {w},\frac{1}{|s|} \sum _{w' \in s} \mathbf {w'})\); \(|c|\) and \(|s|\) are the number of terms in the claim and sentence, respectively.

Sentiment Features. As the claim \(c\) is assumed to be subjective, we make the premise that a relevant sentence \(s\) is likely to also support \(c\) if the same sentiment is expressed in \(c\) and \(s\). We use the Stanford sentiment analyzerFootnote 4 [8], pre-trained with the Rotten Tomatoes movie reviews dataset [9]. This tool produces, for a given text, a probability distribution over a 1–5 sentiment scale; 1 stands for “very negative” and 5 stands for “very positive”. As a sentiment similarity feature, SentimentSim, we use the Jensen Shannon (JS) divergence between the sentiment distributions for the claim and the sentence. Higher JS values correspond to lower similarity. Additionally, we compute SentimentEnt: the entropy of the sentiment distribution induced for \(s\). This feature attests to the focus (or lack thereof) of the sentiment distribution induced from the sentence.

Quality-Oriented Language Models. In general, we expect to find differences between the language used to describe entities that are of “high quality” compared to those of “low quality”. Still, to construct language models for such classes of entities, labeled examples are needed. This labeling is typically missing from most Web sources. Yet, for many domains there are sites that provide ratings for entities, e.g., user feedback for local businesses in yelp.com. We propose to transfer such ratings to other sites as noisy quality labels. Specifically, our test claims are about movies, and the sentences ranked for support are extracted from Wikipedia which does not provide explicit ratings. Therefore, we automatically labeled each Wikipedia page about a movie with the 1–5 star grade review posted for this movie in IMDBFootnote 5 (if exists). Using this knowledge transfer, five unigram language models were induced, one per rating grade \(l\). Specifically, all Wikipedia pages of movies with an IMDB review of a grade \(l\) were concatenated to yield the text: \(Text_{l}\).Footnote 6 Then, for sentence \(s\), the claim-independent features, denoted Prior-\(l\), that correspond to quality levels \(l\in \{1,\ldots ,5\}\) are: \(\frac{sim_{LM}(s,Text_{l})}{\sum _{l'=1}^{5} sim_{LM}(s,Text_{l'})}\).

Sentence Style. The StopWords feature is the fraction of terms in the sentence that are stop words. High occurrence of stop words potentially attests to rich use of language [10], and consequently, to sentence quality. Stop words are determined using the Stanford parserFootnote 7. We also use the sentence length, SentLength, as a prior signal for sentence quality.

Table 1. Examples of claims and supporting and non-supporting sentences.

3 Empirical Evaluation

3.1 Dataset

There is no publicly available dataset for evaluating sentence ranking for support of subjective claims that originate from advice-seeking questions and corresponding answers. Hence, we created a novel datasetFootnote 8 as follows. Fifty subjective claims about movies, which serve as the entities \(c_{e}\), were collected from Yahoo AnswersFootnote 9 by scanning its movies category. We looked for advice-seeking questions, which are common in the movies category, and selected answers that contain at least one movie title. Each pair of a question and a movie title appearing in an answer for the question was transformed to a claim by manually reformulating the question into an affirmative form and inserting the entity (movie title) as the subject. For example, the question “any good science fiction movies?” and the movie title “Tron” was transformed to the claim “Tron is a good science fiction movie”.

The corpus used for sentence retrieval is a dump of the movies category of Wikipedia from March 2015, which contains 111, 164 Wikipedia pages. For each claim, 100 sentences were retrieved using the initial sentence retrieval approach, InitSent (Sect. 2.1). Each of these 100 sentences was categorized by five annotators from CrowdFlowerFootnote 10 into: (1) not relevant to the claim, (2) strong non-support, (3) medium non-support, (4) neutral, (5) medium support, (6) strong support. The final label was determined by a majority vote.

We used the following induced scales: (a) binary relevance: not relevant (category 1) vs. relevant (categories 2–6); (b) binary support: non-support (categories 1–4) vs. support (categories 5–6); (c) graded support: non-support (categories 1–4), weak support (category 5) and strong support (category 6). The Fleiss’ Kappa inter-annotator agreement rates are: 0.68 (substantial) for binary relevance, 0.592 (moderate) for binary support and 0.457 (moderate) for graded support. Table 1 provides examples of claims and relevant (on a binary scale) sentences that either support the claim or not (i.e., binary support scale is used).

Ten out of the fifty claims had no support sentences and were not used for evaluation. For the forty remaining claims, on average, half of the support sentences were weak support and the other half were strong support. On average, \(23.5\,\%\) of the relevant sentences are supportive (binary scale). The median, average and standard deviation of relevant sentences, and of support sentences (binary scale), per claim are: 29, 40.5, 29.3 and 5.5, 7.4 and 6.7, respectively.

3.2 Methods

For the learning-to-rank methods (LTR) we used a linear SVMrank (LinearSVM) [11], a second-degree polynomial kernel SVMrank (PolySVM) [11], and LambdaMART [12], which is a state-of-the-art learning-to-rank method [5]. We used the LTR methodsFootnote 11 with all the features described in Sect. 2.2 for ranking sentences by support and by relevance — i.e., we either optimized performance for support or for relevance. Leave-one-out cross validation, performed over queries, was used for training and testing.

The Indri toolkitFootnote 12 was used for experiments. Krovetz stemming was applied to claims and sentences only for inducing the initial document and sentence ranking and for computing the language-model-based similarity features described in Sect. 2.2. For these features, stopwords on the INQUERY list were removed only from claims. The number of documents (Wikipedia pages) initially retrieved using InitDoc (Eq. 2) for each claim was \(N=1000\); \(\alpha \) was set to 0.66 to boost the ranking of the Wikipedia page about the target movie in the claim. Then, \(k=100\) sentences from these 1000 documents were retrieved using InitSent (Eq. 3) with \(\beta =0.5\). These 100 sentences constitute the set \(\mathcal{S}_{\mathrm{init}}^{[100]}\) which is re-ranked by the LTR methods. The SDM free parameters, \(\lambda _{T}\), \(\lambda _{O}\) and \(\lambda _{U}\) were automatically set, in both InitDoc and InitSent, using the approach proposed in [13]. For language models, the Dirichlet smoothing parameter, \(\mu \), and the Jelinek-Mercer smoothing parameter, \(\psi \), were set to the standard values of 1000 and 0.1, respectively [4]. We note that the free parameters of the initial document and sentence ranking could not be set using training data, as such data is only available for the initially retrieved sentence set, \(\mathcal{S}_{\mathrm{init}}^{[100]}\), as described above.

We view support ranking as a high-precision oriented task in which users are interested in seeing a few sentences that strongly support the claims at hand. Hence, for evaluation measures we use NDCG@1, NDCG@3, NDCG@10 and the precision of the top-5 sentences (p@5). The NDCG performance numbers for support ranking are based on graded support scale, and those for p@5 are based on the binary support scale. All performance numbers for relevance ranking are based on the binary relevance scale. LambdaMART was trained for NDCG@10 as this yielded, in general, better support ranking performance across the evaluation measures than using NDCG@1 or NDCG@3. Statistically significant differences of performance are determined using the two tailed paired t-test with \(p=0.05\).

Table 2. Main result table. Comparing the relevance-ranking and support-ranking performance of the three LTR methods with that of the initial sentence ranking (InitSent). Boldface: the best result in a column; ‘\(i\)’, ‘\(l\)’ and ‘\(p\)’ mark statistically significant differences with InitSent, LinearSVM and PolySVM, respectively.

3.3 Results

Table 2 presents our main results. We see that all three LTR methods outperform the initial sentence ranking, InitSent, in terms of relevance ranking. Although few of these improvements are statistically significant, they attest to the potential merits of using the additional relevance-based features described in Sect. 2.2. More importantly, all LTR methods substantially, and statistically significantly, outperform the initial (relevance-based) sentence ranking in terms of support. This result emphasizes the difference between relevance and support and shows that our proposed features for support ranking are quite effective, especially when used in a non-linear ranker such as LambdaMART.

In Sects. 1 and 4 we discuss the difference between subjective and factoid claims. To further explore this difference, we compare our best performing LambdaMART method with the P1EDAFootnote 13 state-of-the-art textual entailment algorithm [14] when both are used for the support-ranking task we address here. P1EDA was designed for factual claimsFootnote 14. Specifically, given a claim and a candidate sentence, P1EDA produces a classification decision of whether the sentence entails the claim, accompanied with a confidence level. The confidence level was used for support (and relevance) ranking. We also tested the inclusion of P1EDA’s output (confidence level) as an additional feature in LambdaMART, yielding the LMart+P1EDA method. Table 3 depicts the performance numbers.

We can see in Table 3 that P1EDA is (substantially) outperformed by both InitSent and LambdaMART, for both relevance and support ranking. Since the claims in our setting are simple, this finding implies that approaches for identifying texts that support (or “prove”) a factoid claim may not be effective for the task of supporting subjective claims. The integration of P1EDA as a feature in LambdaMART improves performance (although not to a statistically significant degree) for some of the evaluation measures, including NDCG@10 for which the ranker was trained, and hurts performance for others — statistically significantly so in only a single caseFootnote 15.

Table 3. Comparison and integration with a state-of-the-art textual entailment algorithm (P1EDA). LMart stands for “LambdaMart”. Boldface: the best result in a column. Statistically significant differences with InitSent, P1EDA and LMart are marked with ‘\(i\)’, ‘\(p\)’, and ‘\(m\)’, respectively.

Integrating P1EDA with only our semantic-similarity features using LambdaMART, which is a conceptually similar approach to a classification method employed in some work on argument mining [16], resulted in support-ranking performance that is substantially worse than that of using all our proposed features in LambdaMART. Actual numbers are omitted due to space considerations and as they convey no additional insight.

Feature Analysis. To analyze the contribution of individual features to overall performance, Table 4 compares LambdaMART, used with all features, to using individual features alone for ranking. As LambdaMART was trained for NDCG@10 for support ranking, we explore the 10 features that yielded the highest NDCG@10 support-ranking performance.

Table 4. Using features alone (specifically, the 10 that yield the highest NDCG@10 support ranking) to rank the initial sentence list vs. integrating all features in LambdaMART. Boldface: the best result in a column; ‘\(m\)’: statistically significant difference with LambdaMART.

Table 4 clearly shows that while a few features yield support-ranking performance that transcends that of the initial sentence ranking (InitSent), LambdaMART that integrates all features yields substantially, and statistically significantly, better support-ranking performance. This finding attests to the importance of integrating various features for support ranking. LambdaMART is also superior to almost all ten features for relevance rankingFootnote 16.

We see that quite a few of the top-10 features are (lexical) similarities between the claim and/or the entity it is about and the sentence and/or its ambient document. This shows that (direct) estimates of claim-sentence relevance can be quite important for support ranking, as is expected. Yet, integrating these estimates with support-oriented estimates is important for attaining highly effective support ranking performance as is evident in LambdaMART’s performance.

SentimentSim, the sentiment similarity between the claim and the sentence, is among the most effective features when used alone for support ranking. Additional ablation testsFootnote 17 reveal that removing SentimentSim from the set of all features results in the most severe performance degradation for all three learning-to-rank methods. Indeed, sentiment is an important aspect of subjective claims, and therefore, of inferring support for these claims.

We also found that ranking sentences by decreasing entropy of sentiment (SentimentEnt) is superior to ranking by increasing entropy for NDCG@1 and NDCG@3, while for NDCG@10 the reverse holds. The former finding is a conceptual reminiscent of those about using the entropy of a document term distribution for the document prior for Web search [10]: the higher the entropy, the “broader” the textual unit is – in our case, in terms of expressed sentiment — which presumably implies to a higher prior.

Finally, Table 4 also shows that Prior-5 is the most effective claim-independent featureFootnote 18. It is the similarity between a language model of the sentence and that induced from Wikipedia movie pages which received high grade (5 stars) reviews in IMDB. This shows that although Wikipedia authors aim to be objective in their writing, the style and information for high rated movies is still quite different from that for lower rated ones, and it can potentially be modeled via the automatic knowledge transfer and labeling method proposed in Sect. 2.2.

4 Related Work

A few lines of research are related to our work. The Textual Entailment task is inferring the truthfulness of a textual statement (hypothesis) from a given text [17]. A more specific incarnation of Textual Inference is automatic Question Answering (QA). Work on these tasks focused on factoid claims for which a clear correct/incorrect labeling should be inferred from supportive evidence. Thus, typical textual inference approaches are designed to find the claim (e.g. a candidate answer in QA) embedded in the supporting text, although it may be rephrased. In contrast, in this paper, claims originate from CQA users who provide subjective recommendations rather than state facts. Our model, designed for ranking sentences by support for a subjective claim, significantly outperforms for this task a state-of-the-art textual entailment method as shown in Sect. 3.3.

Blanco and Zaragoza [18] introduce methods for retrieving sentences that explain the relationship between a Web query and a related named entity, as part of the entity ranking task. In contrast, we rank sentences by support for a subjective claim. Kim et al. [19] present methods for retrieving sentences that explain reasons for sentiment expressed about an aspect of a topic. In contrast to these sentence ranking methods [18, 19], ours utilizes a learning-to-rank method that integrates various relevance and support features not used in [18, 19].

The task most related to ours is argument mining (e.g., [16, 2024]). Specifically, arguments supporting or contradicting a claim about a given debatable (often controversial) topic are sought. Some of the types of features we use for support ranking have also been used for argument mining; namely, semantic [16, 24] and sentiment [24] similarities between the claim and a candidate argument. Yet, the actual estimates and techniques used here to induce these features are different than those in work on argument mining [16, 24]. Furthermore, the knowledge-transfer-based features we utilize, and whose effectiveness was demonstrated in Sect. 3.3, are novel to this study.

Interestingly, while textual entailment features were found to be effective for argument mining [16, 20], this is not the case for support ranking (see Sect. 3.3). This finding could be attributed to the fundamentally different nature of claims used in our work, and those used in argument mining. That is, our claims originate from answers to advice-seeking questions of subjective nature, rather than being about a given debatable/controversial topic. Also, additional information about the debatable topic which was utilized in work on argument mining [24] is not available in our setting.

Often, work on argument mining [24], similarly to that on question answering (e.g., [25]), focuses on finding supporting or contradicting evidence in the same document in which the claim appears. In contrast, we retrieve supporting sentences from the Web for claims originating from CQA sites. In fact, there has been very little work on using sentence retrieval for argument mining [22]. In contrast to our work, a Boolean retrieval method was used, different features were utilized, and relevance-based estimates were not integrated with support-based estimates using a learning-to-rank approach.

5 Conclusions and Future Work

We addressed a novel task: ranking sentences from the Web by the support they provide to a subjective claim. The claim originates from an answer provided in a community question answering (CQA) site to an advice-seeking question.

Our support-ranking model utilizes various features in a learning-to-rank method; some are relevance oriented while others are support oriented. Empirical evaluation performed using a new dataset of claims created from Yahoo Answers attested to the merits of our proposed approach.

For future work we intend to extend the set of features, explore additional data domains, and study the utilization of supportive sentences in answers posted for subjective questions in CQA sites.