Keywords

1 Background

Question Answering (or QA) is a fundamental task in Artificial Intelligence, whose goal is to build a system that can automatically answer natural language questions. In the last decade, the development of QA techniques has been greatly promoted by both academic field and industry field.

In the academic field, with the rise of large scale curated knowledge bases, like Yago, Satori, Freebase, etc., more and more researchers pay their attentions to the knowledge-based QA (or KBQA) task, such as semantic parsing-based approaches [17] and information retrieval-based approaches [816]. Besides KBQA, researchers are interested in document-based QA (or DBQA) as well, whose goal is to select answers from a set of given documents and use them as responses to natural language questions. Usually, information retrieval-based approaches [1822] are used for the DBQA task.

In the industry field, many influential QA-related products have been built, such as IBM Watson, Apple Siri, Google Now, Facebook Graph Search, Microsoft Cortana and XiaoIce etc. These kinds of systems are immerging into every user’s life who is using mobile devices.

Under such circumstance, in this year’s NLPCC-ICCPOL shared task, we call the open domain QA task that cover both KBQA and DBQA tasks. Our motivations are two-folds:

  1. 1.

    We expect this activity can enhance the progress of QA research, esp. for Chinese;

  2. 2.

    We encourage more QA researchers to share their experiences, techniques, and progress.

The remainder of this paper is organized as follows. Section 1 describes two open domain Chinese QA tasks. In Sect. 2, we describe the benchmark datasets constructed. Section 3 describes evaluation metrics, and Sect. 4 presents the evaluation results of different submissions. We conclude the paper in Sect. 5, and point out our plan on future QA evaluation activities.

2 Task Description

The NLPCC-ICCPOL 2016 open domain QA shared task includes two QA tasks for Chinese language: knowledge-based QA (KBQA) task and document-based QA (DBQA) task.

2.1 KBQA Task

Given a question, a KBQA system built by each participating team should select one or more entities as answers from a given knowledge base (KB). The datasets for this task include:

  • A Chinese KB. It includes knowledge triples crawled from the web. Each knowledge triple has the form: <Subject, Predicate, Object>, where ‘Subject’ denotes a subject entity, ‘Predicate’ denotes a relation, and ‘Object’ denotes an object entity. A sample of knowledge triples is given in Fig. 1, and the statistics of the Chinese KB is given in Table 1.

    Fig. 1.
    figure 1

    An example of the Chinese KB.

    Table 1. Statistics of the Chinese KB.
  • A training set and a testing set. We assign a set of knowledge triples sampled from the Chinese KB to human annotators. For each knowledge triple, a human annotator will write down a natural language question, whose answer should be the object entity of the current knowledge triple. The statistic of labeled QA pairs and an annotation example are given in Table 2:

    Table 2. Statistics of the KBQA datasets.

In KBQA task, any data resource can be used to train necessary models, such as entity linking, semantic parsing, etc., but answer entities should come from the provided KB only.

2.2 DBQA Task

Given a question and its corresponding document, a DBQA system built by each participating team should select one or more sentences as answers from the document. The datasets for this task include:

  • A training set and a testing set. We assign a set of documents to human annotators. For each document, a human annotator will (1) first, select a sentence from the document, and (2) then, write down a natural language question, whose answer should be the selected sentence. The statistic of labeled QA pairs and an annotation example are given in Table 3:

    Table 3. Statistics of the DBQA datasets.

As shown in the example in Table 3, a question (the 1st column), question’s corresponding document sentences (the 2nd column), and their answer annotations (the 3rd column) are provided. If a document sentence is the correct answer of the question, its annotation will be 1, otherwise its annotation will be 0. The three columns will be separated by the symbol ‘\t’.

In DBQA task, any data resource can be used to train necessary models, such as paraphrasing model, sentence matching model, etc., but answer sentences should come from the provided documents only.

3 Evaluation Metrics

The quality of a KBQA system is evaluated by Averaged F1, and the quality of a DBQA system is evaluated by MRR, MAP, and ACC@1.

  • Averaged F1

$$ Averaged F1 = \frac{1}{|Q|}\mathop \sum \limits_{i = 1}^{|Q|} F_{i} $$

\( F_{i} \) denotes the F1 score for question \( Q_{i} \) computed based on \( C_{i} \) and \( A_{i} \). \( F_{i} \) is set to 0 if \( C_{i} \) is empty or doesn’t overlap with \( A_{i} \). Otherwise, \( F_{i} \) is computed as follows:

$$ F_{i} = \frac{{2.\frac{{\# (C_{i} ,A_{i} )}}{{|C_{i} |}}.\frac{{\# (C_{i} ,A_{i} )}}{{|A_{i} |}}}}{{\frac{{\# (C_{i} ,A_{i} )}}{{|C_{i} |}} + \frac{{\# (C_{i} ,A_{i} )}}{{|A_{i} |}}}} $$

where \( \# (C_{i} ,A_{i} ) \) denotes the number of answers occur in both \( C_{i} \) and \( A_{i} \). \( |C_{i} | \) and \( |A_{i} | \) denote the number of answers in \( C_{i} \) and \( A_{i} \) respectively.

  • MRR

$$ MRR = \frac{1}{|Q|}\mathop \sum \limits_{i = 1}^{|Q|} \frac{1}{{rank_{i} }} $$

\( |Q| \) denotes the total number of questions in the evaluation set, \( rank_{i} \) denotes the position of the first correct answer in the generated answer set \( C_{i} \) for the \( i^{th} \) question \( Q_{i} \). If \( C_{i} \) doesn’t overlap with the golden answers \( A_{i} \) for \( Q_{i} \), \( \frac{1}{{rank_{i} }} \) is set to 0.

  • MAP

$$ MAP = \frac{1}{|Q|}\mathop \sum \limits_{i = 1}^{|Q|} AveP(C_{i} , A_{i} ) $$

\( AveP\left( {C, A} \right) = \frac{{\mathop \sum \nolimits_{k = 1}^{n} \left( {P\left( k \right) \cdot rel\left( k \right)} \right)}}{min(m,n) } \) denotes the average precision. \( k \) is the rank in the sequence of retrieved answer sentences. \( m \) is the number of correct answer sentences. \( n \) is the number of retrieved answer sentences. If \( min(m,n) \) is 0, \( AveP\left( {C, A} \right) \) is set to 0. \( P\left( k \right) \) is the precision at cut-off \( k \) in the list. \( rel\left( k \right) \) is an indicator function equaling 1 if the item at rank  \( k \)  is an answer sentence, and 0 otherwise.

  • ACC@N

$$ Accuracy@N = \frac{1}{|Q|}\mathop \sum \limits_{i = 1}^{|Q|} \delta (C_{i} , A_{i} ) $$

\( \delta (C_{i} , A_{i} ) \) equals to 1 when there is at least one answer contained by \( C_{i} \) occurs in \( A_{i} \), and 0 otherwise.

4 Evaluation Results

There are totally 99 teams registered for the above two Chinese QA task, and 39 teams submitted their results. Tables 4 and 5 lists the evaluation results of KBQA and DBQA tasks respectively.

Table 4. Evaluation results of the KBQA task.
Table 5. Evaluation results of the DBQA task.

5 Conclusion

This paper briefly introduces the overview of this year’s two open domain Chinese QA shared tasks. Comparing to last year’s results (19 teams registered and only 3 teams submitted final submissions), in this year, we have 99 teams registered and 39 teams submitted final submissions, which has been a great progress for the Chinese QA community. In the future, we plan to provide more QA datasets and call for new QA tasks for Chinese. Besides, we plan to extend the QA tasks from Chinese to English as well.