Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Definitional question answering (QA), as an important form of complex QA, has attracted more and more attention recently. Definitional QA looks for extended answers that are composed of pieces of relevant information spread over many documents in a corpus, such as a biography of a person (e.g. “Who is George Bush?”), and the definition of a generic term (e.g. “What is naproxen?”) [9].

The development of definitional QA has been boosted by the Text Retrieval Conference (TREC). For a definitional question such as “Who is X” or “What is X”, we call “X” target. Most definitional QA systems have the following pipeline structure:

  1. 1.

    Use target as query to retrieve the related sentences;

  2. 2.

    Rank the returned candidate sentences;

  3. 3.

    Remove redundant sentences and return top k sentences as answers.

In definitional QA, most works focus on the second step, such as pattern based methods [3, 5, 10] and centroid vector based methods [1, 7]. Xu et al. [10] ranked candidate sentences by RankSVM [6]. Han et al. [4] ranked the candidate sentences from the two points of view: topic and definition.

However, most of these rank methods have two weaknesses: firstly, it is difficult to sample bad answers because the number of bad answers is usually far larger than good answers; secondly, it is hard to judge whether an answer is good or not in an objective way, so the training data is often noisy.

In this paper, we propose an online learning algorithm, which dynamically construct the supervisor on each iteration and assure the quality of the top k returned answers, instead of optimizing rank of the whole candidate list. Our learning algorithm is based on Passive-Aggressive algorithm [2], which passively accepts a solution whose loss is zero, while it aggressively forces the new prototype vector to stay as close as possible to the one previously learned.

The rest of the paper is organized as following. We present our algorithm to rank the candidate sentences in Sect. 2 and describe the features in Sect. 3. Then we give our experiments in Sect. 4. Section 5 concludes the paper.

2 Rank Answers with Variant Passive-Aggressive Algorithm

In this section, we propose an online learning algorithm to rank the answers. Given a target x and the set of its associated candidate answers C, we find a subset of \(\hat{Y} \subset C\) with size k as the returned answers by

$$\begin{aligned} \hat{Y} = \arg \max _{Y \subset C} {\mathbf {w}^T \varPhi (x, Y)}, \end{aligned}$$
(1)

where \(\varPhi (x, Y)\) is a feature vector.

We define the distance between two sets A and B by inverse Jaccard Similarity,

$$\begin{aligned} \varDelta (A,B)= |A \cup B|/|A \cap B|. \end{aligned}$$
(2)

Assuming that the subset \(Y^* \subset C\) concludes all good answers. We wish to learn \(\mathbf {w}\) so that \(\varDelta (\hat{Y},Y^*)\) is as small as possible.

We use Passive Aggressive (PA) algorithm [2] to find the new weight vector \(w_{t+1}\) to be the solution to the following constrained optimization problem in round t.

$$\begin{aligned}&\mathbf {w}_{t+1} = \arg \min _{\mathbf {w}} {\frac{1}{2}||\mathbf {w}-\mathbf {w}_t||^2 + \mathcal {C}\xi }\end{aligned}$$
(3)
$$\begin{aligned}&\mathbf s.t. \ell (\mathbf {w};x_t)<=\xi \,\mathbf{and }\,\xi >=0. \end{aligned}$$
(4)

where \(\ell (\mathbf {w};x_t)\) is the hinge-loss function, \(\xi \) is a slack variable, and \(\mathcal {C}\) is a positive parameter which controls the influence of the slack term on the objective function.

Different from standard PA algorithm, we define the loss as,

$$\begin{aligned} \ell (\mathbf {w};x) = \left\{ \begin{array}{l} 0, \quad \quad \quad \quad \quad \gamma (\mathbf {w};x) > \varDelta (Y^*,\hat{Y}) \\ \varDelta (Y^*,\hat{Y}) - \gamma (\mathbf {w};x), \quad \quad \text {otherwise} \end{array}\right. \end{aligned}$$
(5)

where

$$\begin{aligned} \gamma (\mathbf {w};x)=\mathbf {w}^T \varPhi (x,Y^*)- \mathbf {w}^T \varPhi (x,\hat{Y}), \end{aligned}$$
(6)

We abbreviate \(\ell (\mathbf {w};x)\) to \(\ell \). If \(\ell = 0\) then \(\mathbf {w}_t\) itself satisfies the constraint in Eq. (3) and is clearly the optimal solution. We therefore concentrate on the case where \(\ell >0\).

Since it is hard to judge whether an answer is good or not in an objective way, we do not use directly the manual answer set \(Y^*\) in our learning process. We dynamically construct the supervisor on each iteration.

We define \(\theta = \min \{|\hat{Y} - Y^*|, |(C - \hat{Y}) \cap Y^*|\}\) is the minimal number of bad answers in top-k and good answers out of top-k. In each iteration, we build \(Y^{**}\) by inserting \(\theta \) good answers, denoted as P, from out of top-k into top-k, and excluding the same number of bad answers, denoted as Q.

$$\begin{aligned} Y^{**}=(\hat{Y} - Q) \cup P. \end{aligned}$$
(7)

Now, we will show how we decide P and Q.

Assuming in the round i, the rank of the \(s^t\) is \(r_i(s^t)\). After the update of w, the rank of the \(s^t\) is \(r_{i+1}(s^t)\). We defined the distance of two rankings, \(d(r_i,r_{i+1})\) as the sum of un-concordant pairs.

Theorem 1

Given two rankings \(r_1\), \(r_2\), over \(s_1, s_2, \ldots , s_m\), w.l.g. we let \(r_1(s_i) = i\), if \(r_2(s_i) > r_2(s_j), i < j\), then \(d(r_1, r_2) \ge j - i\)

Proof

Consider the rank \(r_2^0: s_1 \ldots s_{i-1}, s_j\)\(s_i, s_{i+1} \ldots s_{j-1}\)\( s_{j+1} \ldots s_m\). Obliviously, \(d(r_1, r_2^0) = j - i\) and the set of the un-concordant pairs of \(r_2^0\) is \(NC^0:\{(s_j, s_i), (s_j, s_{i+1}),\ldots , (s_j, s_{j-1})\}\). Assuming that there is a rank \(r_2^1: r_2^1(s_i) > r_2^1(s_j)\), set of the un-concordant pairs of which is denoted as \(NC^1\) and \(d(r_1,r_2^1) < j - i\). Then there must be some \(K \subseteq [i+1, j-1]: \forall k \in K, (s_j, s_k) \in NC^0 \vee (s_j, s_k) \notin NC^1\). However, \(\forall k \in K, (s_i, s_k) \in NC^1\), so \(|NC^1| \le |NC^0|\), which is contrast with the assumption.

Theorem 1 can easily be extended to the following case.

Theorem 2

Given two rankings \(r_1\), \(r_2\), over \(s_1, s_2, \ldots , s_m\), w.l.g. we let \(r_1(s_i) = i\), if \(I,J \subseteq [1,m], \forall i \in I, \forall \in J, i < j\) and \(r_2(s_i) > r_2(s_j)\), then \(d(r_1, r_2) \ge \sum _{j\in J}j - \sum _{i\in I}i\)

Let the positions of elements of P in \(r_i(s^t)\) are \(r_i(P_1) < r_i(P_2) < \cdots < r_i(P_\theta )\), and positions of elements of Q are \(r_i(Q_1) < r_i(Q_2) < \cdots < r_i(Q_\theta )\), then according to our constraints and the Theorem 2

$$\begin{aligned} d(r_i(s^t), r_{i+1}(s^t)) \ge \sum _{j=1}^\theta {(r_i(P_j) - r_i(Q_j))} \end{aligned}$$
(8)

The lower bound of the distance of two rankings is \(\sum _{j=1}^\theta {(r_i(P_j) - r_i(Q_j))}\). We assume \(\Vert w_{i+1}-w_i\Vert \) will be increased with the increase of the lower bound. The intuition interpretation is if it is asked to change the positions of two adjacent sentences in \(r_i(s^t)\), the change of w will be small, and instead, if it is asked to change the positions of two sentences which are the begin and the end of the ranked list, w will change more largely. In order to minimize \(\Vert w_{i+1}-w_i\Vert \), \(\sum _{j=1}^\theta {(r_i(P_j) - r_i(Q_j))} \) should be minimized by minimizing each \(r_i(P_j)\) and maximizing each \(r_i(Q_j)\). So we get:

  • P :  the top-\(\theta \) good answers in \(C - Y^*\);

  • Q :  the bottom-\(\theta \) non-answers in \(\hat{Y}\).

Similar to [2], we get the update step,

$$\begin{aligned} \alpha _t = \min \,\left( \mathcal {C},\frac{\varDelta (Y^{**},\hat{Y}) - \mathbf {w}_t^T\left( \varPhi (\mathbf {x},Y^{**})- \varPhi (\mathbf {x},\hat{Y})\right) }{||\varPhi (\mathbf {x},Y^{**})- \varPhi (\mathbf {x},\hat{Y})||^2}\right) . \end{aligned}$$
(9)

Our final algorithm is shown in Algorithm (1).

figure a

3 Features

For simplicity, we assume that each sentence is independent with others and remove the redundance later, the feature vector can be decomposed by \(\varPhi (x, Y) = \sum _{s \in Y} \phi (x, s)\).

3.1 Features Based on Language Models

For a candidate sentence s, we can calculate log P(s|Corpus) using language models trained on different corpora. Here we use four corpora: AQUAINT, modified AQUAINT (AQUAINT*), target corpus (TC) and definition corpus (DC).

  • Aquaint Aquaint consists of newswire text data in English, drawn from three sources and contains roughly 375 million words correlating to about 3GB of data. P(s|Aquaint) is used to estimate the complexity of the sentence.

  • Aquaint* Aquaint* is the modified version of Aquaint, which replace the named entities and number word with their entity types(PERSON, LOCATION, ORGANIZATION, BASICNAME) and POS tag (CD). \(P(s|Aquaint*)\) is used to measure the complexity of sentence after eliminating the effect of different number words and named entity of same type.

  • Target Corpus(TC) To each target, we build a TC correspondingly. We get top ranked 100 snippets by Google with the target as the query. Parameters are smoothed on Aquaint by Dirichlet smoothing [11].

  • Definition Corpus(DC) We build a corpus composed of definitional sentence by collection Wikipedia article on the train targets. Because some words like named entity phrase and number word may be high related with the specific target, we rewrite them in the same way as Aquaint*.

We use unigram model to get the four features of the sentence s, \(\log P(s|TC)\), \(\log P(s|DC)\), \(\log P(s|Aquaint)\), \(\log P(s|Aquaint*)\), where the models train by TC and DC are Dirichlet smoothed with the general corpus of Aquaint and Aquaint*, respectively.

3.2 Features Based on Dependency Relation Patterns

The syntax of a sentence is also important. For example, appositive structure often appears in the definition sentences. We use Minipar [8] to get a set of dependency relations patterns. First, to each sentence s as training sample, we get a set of triples: \(\langle \)word, relation, word\(\rangle \). Then, we use two wildcard IN_TARGET, OUT_TARGET to indicate whether the content word is the target. Thus, we can get 20 most frequent patterns in the form of \(<\)IN_TARGET, relation, OUT_TARGET\(>\).

Redundancy Features. Redundancy features are in the form of \(\psi (x^t, s^t_{i_j}, s^t_{i_1\ldots i_{j-1}})\) to the sentence \(s^t_{i_j}\) for the target \(x^t\). For each content word tw in target, we test a range [ab] centered by tw, denoted as \(r_{tw}\), in \(s^t_{i_j}\) for it. The word w is in the range \(r_{tw}\) if and only if there are at least \(a-1\) and at most \(b-1\) words between w and tw. We calculate how many content words have been appeared in the previous sentences as following.

$$\begin{aligned} \frac{|\{w: \exists tw, w \in r_{tw} \vee \exists s^t_{i_{j'}}, 1 \le j' \le j-1, w \in s^t_{i_{j'}}\}|}{|\{w: \exists s^t_{i_{j'}}, 1 \le j' \le j-1, w \in s^t_{i_{j'}}\}|} \end{aligned}$$

We used 3 different range \([1,5],[6,10],[10,+\infty ]\) to catch features (Tables 2 and 3).

Table 1. Results on TREC 2005 (\(k=12\))
Table 2. F-3 score on the TREC 2006
Table 3. Recall on the TREC 2006

4 Experiments

We use three datasets from TREC 2004 \(\sim \) 2006, which include 65, 75, 75 definitional questions respectively.

We firstly use LuceneFootnote 1 to get at most 200 sentences from Aquaint related to the target, and the query is just the target.

We adopt F-3 score, which used in the TREC definitional question answering task [9].

$$\begin{aligned} NR= & {} \frac{r}{R} \end{aligned}$$
(10)
$$\begin{aligned} NP= & {} \left\{ \begin{array}{l} 1, \qquad l < 100\times (r + a) \\ 1 - \frac{l-100\times (r+a)}{1}, \text{ otherwise } \end{array} \right. \end{aligned}$$
(11)
$$\begin{aligned} F\texttt {-}3= & {} \frac{10 \times NR\times NP}{9 \times NP+NR} \end{aligned}$$
(12)

where r is number of vital nuggets in the system response, R is number of vital nuggets in the gold standard, a is number of okay nuggets in the system response and l is length of the system response. Vital nuggets represent the most important facts about the target and should be included. Okay nuggets contribute to relevant information but are not essential.

4.1 Comparison with Other Systems

Kor and Chua [7] gave the results of Soft Pattern model (SP) and Human Interests Model (HIM), which both used questions in TREC 2004 as training data and questions in TREC 2005 as test data. Table 1 shows our method clearly outperforms SP and HIM.

4.2 Comparison with Other Ranking Algorithms

To demonstrate the effectiveness of our ranking method, we also compare it with RankSVM [10] and Han [4]. We use the questions in TREC 2005 as training data and TREC 2006 as test data. The targets include PERSON, ORGANIZATION, THING, EVENT.

5 Conclusion

In this paper, we propose an online learning algorithm, which dynamically construct the supervisor on each iteration and assure the quality of the top k returned answers, instead of optimizing rank of the whole candidate list. In the future, we will seek the applications of our method on the ranking problems in other tasks such as summarization.