Keywords

1 Introduction

Question answering on knowledge bases is an active area since proposed. Its main task is to answer natural language questions from a structured knowledge base. The \({\mathsf {KBQA}} \) system has been used as QA machine in many fields. At its early stage, investigators have proposed techniques for answering simple questions.

Driven by practical requirements, techniques for answering complex questions, which contain multiple constraints are in urgent demand. However, prior techniques for answering simple questions can not be easily adapted for answering complex questions. Taking the question “who plays claire in lost?” from \(\mathsf {WebQuestion}\)  [4] as an example, if one simply parses it into a triple query \(\langle \)claire, is_played_by, ?\(\rangle \), the answer will be anyone who played claire in any series. However, the correct answer should be the actor in “lost”. Therefore, answering complex questions is far more complicated than that of simple questions.

To answer complex questions, most semantic-parsing-based approaches parse a question into a query graph and use that query graph to find answers from a \({\mathsf {KB}} \). Typically, a query graph is defined as a formal graph, whose nodes correspond to the topic entity, answer node, variables and constraints, and edges correspond to relations [27]. Figure 1 shows two questions along with their query graphs. In each figure, the rectangle with bold font denotes the topic entity, rectangles with normal font denote constraints and rounded rectangle with dotted line includes a core path. The construction of query graphs can be split into three main tasks, i.e., entity linking, core paths generation and constraints selection [3, 8, 10, 15, 17, 27, 28, 30]. Due to semantic diversity, query graphs are often unable to precisely express the meaning of a given question.

The low quality of query graphs mainly lies in two issues. The first issue is the core path generation. For example, question \({\textbf {Q}}_1\) in Fig. 1 has two candidate core paths, i.e., \(\langle \)fiction_character\(\rangle \)Footnote 1 and \(\langle \)TV_character\(\rangle \)Footnote 2. It is uneasy to determine which one is the correct one. While, if the type information, e.g., “TV Character”, of the topic entity can be used, then the candidate selection becomes much easier. Another issue lies in the constraints selection. Most existing works [3, 17, 27, 30] simply add all identified constraints to a query graph without careful justification. Hence, a query graph may carry inappropriate constraints.

Fig. 1.
figure 1

Questions and their query graphs

Contributions. The contributions of this paper are summarized as follows.

  1. 1.

    We devise a novel metric that is based on local similarity and global similarity for measuring semantic similarity between a mention and a candidate entity. This new metric shows superiority than its counterparts.

  2. 2.

    We introduce an attention-based model for core path generation. In contrast to prior works, this model incorporates richer implicit information for path identification and shows excellent performance.

  3. 3.

    We propose a memory-based model for constraints selection. The model enables us to choose suitable constraints adaptively.

  4. 4.

    We compared our approach with the state-of-art methods on typical benchmark datasets and show promising results.

The reminder of the paper is organized as follows. We introduce related works in Sect. 2. We illustrate our method in Sect. 3. We conduct experiments to show the performance of our approach in Sect. 4, followed by conclusion in Sect. 5.

2 Related Work

The problem of answering complex questions on \(\mathsf {KBs}\) has been investigated for a period. Investigators have developed several techniques, that can be categorized into information retrieval (abbr. \(\mathsf {IR}\)) and semantic parsing (abbr. \(\mathsf {SP}\)).

Information Retrieval. Methods in this line mainly work as follows: (a) identify the topic entity from the underlying \(\mathsf {KB}\) for the given question; (b) gain a subgraph around the topic entity over \(\mathsf {KB}\); (c) extract features of candidate answers from the subgraph; and (d) match the feature of the query to choose answers.

One bottleneck of \(\mathsf {IR}\)-based methods is feature extraction. Hence, existing approaches adopt different feature extraction methods to catch important information from the question and the answer, respectively. Early approaches perform as follows. [7, 26] first gain candidate answers over \(\mathsf {KB}\) according to dependency tree, then use feature map to process feature extraction on the question and candidate answers respectively. With the development of neural networks, [11, 14] introduce techniques by incorporating CNN and attention mechanism in feature extraction. Moreover, some approaches [9, 16] adopted memory networks for the task.

Since \(\mathsf {IR}\)-based approaches retrieve answers from a \(\mathsf {KB}\), the completeness of the \(\mathsf {KB}\) is imperative. [13, 20] make efforts on reasoning under incomplete \(\mathsf {KB}\). While [22, 29] proposed techniques to dynamically update reasoning instructions.

Semantic Parsing. Most semantic parsing approaches show better performance on complex \(\mathsf {KBQA}\)  [8, 9, 15]. Methods in this line mainly work as follows: (a) question understanding: understand a given question via semantic and syntactic analysis; (b) logical parsing: translate the question into logical forms; (c) \(\mathsf {KB}\) grounding: instantiate logical forms by conducting semantic alignment on underlying \(\mathsf {KB}\); (d) \(\mathsf {KB}\) execution: execute the logical forms to obtain answers.

In the early stage, traditional semantic parsing methods [4, 12] used templates to conduct logical parsing. Recently, for question understanding, syntax dependencies [1, 2, 17] were introduced to accomplish the task, while [21] treats a complex question as the combination of several simple questions.

The logical form generation is crucial for \(\mathsf {SP}\)-based methods, approaches, e.g.,  [8, 27] focus on the task. A staged query graph generation mechanism was introduced by [27], where each stage added some semantic components with rules. Followed by [3, 27] further investigated the staged query graph generation, and summarized the types of constraints. A hierarchical representation that based on words, questions and relation levels was proposed by [28]. [15] developed a state-transition model to generate query graphs. While [17] and [18] improved techniques for ranking query graphs. [10] analyzed historical datasets and put forward a set of structured templates for generating complex query graphs. Followed by [8, 10] proposed a generative model to generate abstract structures adaptively, and match the abstract structures against a \(\mathsf {KB}\) to gain formal query graphs.

3 Approach

Problem Formulation. The task of complex question answering on \(\mathsf {KBs}\) can be formulated as follows. Given a natural language question \({\textbf {Q}}\) represented as a sequence of words \(\mathcal{L}_Q = \{w_{1}, ... , w_{T}\}\) and a background knowledge base \({\mathsf {KB}} \) as input, it is to find a set of entities in \({\mathsf {KB}} \) as answers for the question \({\textbf {Q}}\).

Typically, \(\mathsf {SP}\)-based techniques, which gain better performances, involves question patterns and query graphs (a.k.a. semantic graphs). We next formally introduce the notions (rephrased) as follows.

Question Pattern. Given a question \({\textbf {Q}}\), one can generate a new question by replacing an entity mention in \({\textbf {Q}}\) with a generic symbol \(\langle \)e\(\rangle \). Then this new question is referred to as a question pattern (denoted by \({\textbf {P}}\)) of \({\textbf {Q}}\). For example, a question pattern of “who plays claire in lost?” is “who plays \(\langle \)e\(\rangle \) in lost?’, where the entity mention “claire” in \({\textbf {Q}}\) is replaced by \(\langle \)e\(\rangle \).

Query Graph. Typically, a query graph \(Q_g\) is defined as a logical form in \(\lambda \)-calculus [27] and consists of a core path and various constraints. Specifically, a core path includes a topic entity, a middle node (not essential), an answer node and some predicates among these nodes. Constraints are categorized into four types according to their semantics, e.g., entity constraints, type constraints, time constraints, and ordinary constraints. Two query graphs are shown in Fig. 1.

Overview. Our approach consists of three modules i.e., Topic Entity Recognition, Core Path Generation and Constraints Selection. In a nutshell, given a question \({\textbf {Q}}\) issued on a \(\mathsf {KB}\), the Topic Entity Recognition module first retrieves mentions from it and identifies a set of topic entities as candidates from the underlying \(\mathsf {KB}\). On the set of candidate entities, the Core Path Generation module identifies the core path with an attention-based model. Finally, the Constraints Selection module employs a memory-based model to select the most suitable constraints, which are then used to produce the query graph. Below, we illustrate the details of these modules in Sects. 3.13.2 and 3.3, respectively.

3.1 Topic Entity Recognition

The Topic Entity Recognition module works as follows. (1) It employs, e.g., BiLSTM + CRF to recognize entity mentions from a given question \({\textbf {Q}}\). (2) It performs the entity linking task with the state-of-the-art entity linking tool, i.e., S-MART [24] to link a mention to entities in a \({\mathsf {KB}} \). Note that, a mention may be linked to multiple entities in a \({\mathsf {KB}} \). Hence, after above process, a collection of triples \(\langle m,e,s_e\rangle \) are returned, where m, e and \(s_e\) represent a mention, an entity that is linked from m and a score measuring semantic similarity between m and e, respectively. (3) It determines the best topic entity according to a novel metric for measuring semantic similarity between a mention and an entity in a \({\mathsf {KB}} \).

Similarity Metric. Given a question \({\textbf {Q}}\), a list of n mentions \([m_{1},m_{2}, ...,m_{n}]\) can be extracted from it. For a mention \(m_{i}\) (\(i\in [1,n]\)), it is linked to k different entities \([e_i^{1}, e_i^{2},\cdots , e_i^{k}]\) in a \({\mathsf {KB}} \) and each entity \(e_i^j\) is assigned with a value \(s_i^j\) indicating the matching degree between \(m_i\) and \(e_i^j\) (\(j\in [1,k]\)). We define the largest matching degree \(\hat{s}_i\) of \(m_i\) as \(\hat{s}_i={\mathsf {max}} \{s_i^{1}, s_i^{2},\cdots , s_i^{k}\}\), then the entity with \(\hat{s}_i\) is considered the best entity of \(m_i\). The local similarity \(l_i^j\) and global similarity \(g_i^j\) between a mention \(m_i\) and its j-th entity \(e_i^j\) are hence defined as \(l_i^j = s_i^j /\hat{s}_i\) and \(g_i^j = \frac{s_i^j}{{\mathsf {max}} \{\hat{s}_1, \hat{s}_2,\cdots , \hat{s}_n\}}\), respectively. The final semantic similarity \(\tau \) is defined as below.

$$\begin{aligned} \tau = \beta \cdot l_i^j + (1-\beta )\cdot g_i^j, \end{aligned}$$
(1)

where \(\beta \) is an adjustment coefficient tuned by users.

Based on the metric \(\tau \), the best topic entity e is determined and returned to the core path generation module for further processing.

3.2 Core Path Generation

Given the topic entity e of a question \({\textbf {Q}}\), one can generate a set \({\textbf {R}}=\{R_{1},\cdots , R_{T}\}\) of core paths as candidates, where each \(R_i\) takes e as the initial node and consists of h hop nodes and edges of e in a \({\mathsf {KG}} \). Then one only needs to determine the best one from multiple candidates. Here, the quality of a core path \(R_i\) w.r.t. \({\textbf {Q}}\) indicates the semantic distance between \(R_i\) and the question pattern of \({\textbf {Q}}\).

Fig. 2.
figure 2

Attention-based core path generation model & Deep semantic network

To tackle the issue, a host of techniques e.g.,  [3, 17, 27] have been developed. These techniques mostly learn correlation between a question pattern and corresponding candidate core paths for the best core path determination. While, the correlation from explicit information alone is insufficient. For example, given the question “who plays ken barlow in coronation street?”, it is easy to obtain a topic entity as well as a question pattern “who plays \(\langle \)e\(\rangle \) in coronation street?”. Note that the pattern corresponds to multiple candidate paths (a.k.a. relations) in a \(\mathsf {KG}\), e.g., \(\langle \)fiction_character\(\rangle \) and \(\langle \)TV_character\(\rangle \), while only one “golden” path exists. If the “type” information of the topic entity, i.e., ‘TV Character’, can be incorporated, one can easily determine the golden path: \(\langle \)TV_character\(\rangle \). This example shows that implicit information, e.g.,  type of the topic entity, plays a crucial role in determining the golden path.

Motivated by this, we develop an Attention-based Core Path Generation Model, denoted as ACPGM, to generate a core path for a given question.

Model Details. As shown in Fig. 2(a), ACPGM learns correlation between a question and a core path by using explicit correlation \(v_1\) and implicit correlation \(v_2\) and semantic similarity \(\tau \) of the topic entity. In particular, the degree of explicit correlation is calculated with a deep semantic network, which is shown in Fig. 2(b). We next illustrate with more details.

Explicit & Implicit Correlation. Intuitively, the semantic correlation between a question pattern \({\textbf {P}}\) and a core path \(R_i\) is referred to as the explicit correlation. On the contrary, the correlation that is inferred from certain implicit information, is referred to as implicit correlation. Taking the question pattern “who plays \(\langle \)e\(\rangle \) in lost?” of \({\textbf {Q}}_1\) as example, it corresponds to multiple candidate paths, among which \(\langle \)fictional_character\(\rangle \) and \(\langle \)TV_character\(\rangle \) are typical ones. Observe that the “type” (as one kind of implicit information) of the topic entity is “TV Character”, which already appeared in the second candidate. If the “type” information can be used to guide the selection, then the correct answer can be obtained since the second candidate is more preferred.

Structure of the Model. The entire model consists of two parts for learning explicit correlation and core paths selection, respectively.

(I) A deep semantic network (abbr. DSN) is developed to capture the explicit correlation between a question pattern \({\textbf {P}}\) and a candidate path \(R_i\). The structure of DSN is shown in Fig. 2 (b). It encodes \({\textbf {P}}\) and \(R_i\) as letter-trigram vectors, and applies two convolutional neural networks with the same structure to learn their semantic embeddings, respectively. Then DSN computes semantic similarity (denoted by \(v_{1}\)) between \({\textbf {P}}\) and \(R_i\) by calculating the cosine distance of two embeddings.

(II) ACPGM takes two lists \(L_1\) and \(L_2\) as input and leverages attention scheme to improve performance. The first list \(L_1\) consists of a question pattern \({\textbf {P}}\) and the “type” information of the topic entity in \({\textbf {P}}\). The second list \(L_2\) essentially corresponds to the implicit information, e.g., the first two and the last entry of a candidate path, along with the “type” information of the answer entity. Each word \(x_i\) (resp. \(c_i\)) in the first (resp. second) list is encoded as \(h_{i}\) (resp. \(e_{j}\)), via certain encoding techniques, e.g., Word2Vec.

For each \(h_i\) in \(L_1\) and each \(e_j\) in \(L_2\), we calculate a weight \(w_{ij}\) of each pair \(\langle h_i, e_j\rangle \), as attention to the answer with below function.

$$\begin{aligned} w_{ij} = W^{T}(h_{i} \cdot e_{j}) + b, \end{aligned}$$
(2)

where \(\cdot \) operates as the dot product, \(W^{T}\in \mathbb {R}\) and b are an intermediate matrix and an offset value, respectively; they can be randomly initialized and updated during training. Note that there are in total \(l=|L_1|*|L_2|\) pairs of weights for each input. Accordingly, the attention weight \(a_{ij}\) of the pair \(\langle h_i,e_j\rangle \), in terms of the implicit information can be computed via below function.

$$\begin{aligned} a_{ij} = \frac{exp(w_{ij})}{\sum _{k=1}^{l}{exp(w_{ik})}} \end{aligned}$$
(3)

The attention weights are then employed to compute a weighted sum for each word, leading to a semantic vector that represents the question pattern. The similarity score \(v_2\) between a question \({\textbf {Q}}\) and a core path \(R_i\) is defined as below.

$$\begin{aligned} q_{i} = \sum _{j=1}^{n+1}{a_{ij}h_{j}} \end{aligned}$$
(4)
$$\begin{aligned} v_{2} = \sum _{i=1}^{m+1}{q_{i}e_{i}} \end{aligned}$$
(5)

By now, the explicit and implicit correlation between a question and a candidate core path, i.e., \(v_{1}\) and \(v_{2}\) are obtained. Together with score \(\tau \) for measuring topic entity, a simple multi-layer perceptron, denoted by \(f(\cdot )\) is incorporated to calculate the final score that is used for core paths ranking.

$$\begin{aligned} v = f(v_{1}, v_{2}, \tau ) \end{aligned}$$
(6)

Here, the loss function of ACPGM is defined as follows.

$$\begin{aligned} Loss({\textbf {Q}}, R) = max(\gamma + v({\textbf {Q}}, a_{0}) - v({\textbf {Q}}, a),0), \end{aligned}$$
(7)

where \(\gamma \) is the margin parameter, a and \(a_0\) refer to the positive and negative samples, respectively. The intuition of the training strategy is to ensure the score of positive question-answer pairs to be higher than negative ones with a margin.

3.3 Constraints Selection

Given a core path, it is still necessary to enrich it with constraints imposed by the given question and produce a query graph for seeking the final answer. To this end, we first categorize constraints into four types, i.e.,  entity constraint, type constraint, time constraint and ordinary constraint, along the same line as [17], and then apply an approach, that works in two stages, for constraints selection.

Candidates Identification. As the first stage, this task targets at collecting valid constraints as candidates. Given a question \({\textbf {Q}}\) and its corresponding core path \(p_c\) (identified by ACPGM), we first identify 1-hop entities \(e_{c}\) connected to nodes (excluding the topic entity and answer node) of \(p_c\) from the underlying \(\mathsf {KB}\). For each identified entity \(e_{c}\), if part of its name appears in the original question \({\textbf {Q}}\), \(e_{c}\) is considered as a possible entity constraint. Moreover, we treat type information associated to the answer node as type constraint. The time constraint is recognized via certain regular expression, and ordinary constraint is distinguished via some typical hand-weaved rules. All the identified constraints are then included in a set of size n for further process.

Constraints Selection. Given the set of candidate constraints, a Memory-based Constratint Selection Model, denoted as MCSM, is developed to choose constraints. The idea of MCSM is to measure the relevance between a question and its constraints and then select the most suitable ones to update the core path.

Fig. 3.
figure 3

Constraint selection model

It works as follows. For each candidate constraint, it is converted into a vector \(c_i\) (\(i\in [1, n]\)) of dimension d with an embedding matrix \(A\in \mathbb {R}^{d\times n}\), which is initialized randomly and learned during training. The set of embedding \({\textbf {C}} = \{c_{1}, c_{2},\cdots , c_{n}\}\) will be stored in memory for querying. The input question is also encoded as a matrix \(e_q\) of dimension d with another matrix \(B\in \mathbb {R}^{d\times n}\). Via encoding, the relevance \(r_i\) between a candidate constraint and the question can be measured with Eq. 8.

$$\begin{aligned} r_i=\sigma ({\textbf {e}}_q \cdot c_{i}) \end{aligned}$$
(8)

Here \(\sigma \) indicates a nonlinear transformation with sigmoid function. In this way, \(r_i\) can be deemed as the relevance degree between the i-th constraint and the question. Based on the relevance set, a new embedding \({\textbf {e}}_o\) as the representation of the output memory can be generated as follows:

$$\begin{aligned} {\textbf {e}}_o = \sum _{i=1}^{n}{c_{i} \times r_{i}} \end{aligned}$$
(9)

Intuitively, \({\textbf {e}}_o\) captures the total relevance between a question and all its constraints. Finally, the output embedding \({\textbf {e}}_o\) is transformed via Eq. 10 to be a numerical number val for judgement.

$$\begin{aligned} val=\sigma (H^T\cdot {\textbf {e}}_o)-\theta \end{aligned}$$
(10)

Here, matrix \(H\in \mathbb {R}^{n\times d}\) is randomly initialized and optimized via training, and \(\theta \) is a predefined threshold for constraint selection, i.e., a constraint with val above zero can be accepted.

After constraints are determined, the core path \(p_c\) is updated as follows. For an entity constraint, it is associated with a node (excluding the topic entity and answer node) in \(p_c\). For a type constraint, it is connected to the answer node directly. If the constraint is a time constraint or ordinary constraint, \(p_c\) is extended by connecting the constraint to the middle node or answer node. After above steps, the query graph is generated. Then the final answer can be easily obtained.

4 Experimental Studies

4.1 Settings

We introduce details of model implementation, dataset and baseline methods.

Model Implementation. Our models were implemented in Keras v2.2.5 with CUDA 9.0 running on an NVIDIA Tesla P100 GPU.

Knowledge Base. In this work, we use \(\mathsf {Freebase}\)  [6], which contains 5,323 predicates, 46 million entities and 2.6 billion facts, as our knowledge base. We host \(\mathsf {Freebase}\) with the Virtuoso engine to search for entities and relations.

Questions. We adopted following question sets for performance evaluation.

  • \(\mathsf {CompQuestion}\)  [3] contains 2,100 questions collected from Bing search query log. It is split into 1,300 and 800 for training and testing, respectively.

  • \(\mathsf {WebQuestion}\)  [4] contains 5,810 questions collected from Google Suggest API and is split into 3,778 training and 2,032 testing QA pairs.

Baseline Methods. We compared our approach with following baseline methods: (1) Yih et al. 2015 [27], a technique for staged query graph generation; (2) Bao et al. 2016 [3], extended from Yih’s work with techniques for constraint detection and binding; (3) Luo et al., 2018 [17], that treats a query graph as semantic components and introduces a method to improve the ranking model; (4) Hu et al. 2018 [15] which gains the best performance on \({\mathsf {WebQuestion}} \) with a state-transition mechanism; and (5) Chen et al. 2020 [8], that achieves the best performance on \({\mathsf {CompQuestion}} \).

4.2 Model Training

We next introduce training strategies used by our models.

Training for DSM. The DSM is fed with a collection of question pattern and core path pairs for training. To improve performance, a subtle strategy is applied to ensure the correctness of positive samples as much as possible. Specifically, for each question \({\textbf {Q}}\) in the original training set, a core path that takes topic entity e in \({\textbf {Q}}\) as starting node, with length no more than two in the underlying \(\mathsf {KB}\) and F1 score no less than 0.5 is chosen as training data. Here F1 is defined as \(\frac{2 * {\mathsf {precision}} * {\mathsf {recall}}}{{\mathsf {precision}} + {\mathsf {recall}}}\), where \({\mathsf {precision}} =\frac{\#{\mathsf {true}} \_{\mathsf {paths}} \_{\mathsf {predicted}}}{\#{\mathsf {predicted}} \_{\mathsf {paths}}}\) and \({\mathsf {recall}} =\frac{\#{\mathsf {true}} \_{\mathsf {paths}} \_{\mathsf {predicted}}}{\#{\mathsf {true}} \_{\mathsf {paths}}}\).

Training for ACPGM. To train ACPGM, a collection of question pattern and core path pairs \(\langle {\textbf {P}},R\rangle \) needs to be prepared. Besides the set of paths identified for training DSM, n incorrect paths were chosen randomly as negative samples, for each question pattern \({\textbf {P}}\); In this work, margin \(\gamma \) is set as 0.1, n is set as 5 and h, indicating number of hops, is set as 3.

Training for MCSM. The constraint with the largest F1 score is deemed as the golden constraint. As a multi-classification model, MCSM calculates the binary cross entropy loss for each class label, and sum all the label loss as the final loss. The probability threshold \(\theta \) is set to 0.8.

During the training, we use a minibatch stochastic gradient descent to minimize the pairwise training loss. The minibatch size is set to 256. The initial learning rate is set to 0.01 and gradually decays. We use wiki answer vectors as the initial word-level embedding.

4.3 Results and Analysis

We use the F1 score over all questions as evaluation metric. Here, F1 score for final answer is defined similarly as before, with the exception that the \(\mathsf {precision}\) and \(\mathsf {recall}\) are defined on answer entities rather than core paths.

Table 1. Performance evaluation (F1)

Overall Performance. Table 1 shows the results on two benchmark datasets. As can be seen, our method achieves 53.2% and 42.6% F1 values on \({\mathsf {WebQuestion}} \) and \({\mathsf {CompQuestion}} \), respectively, indicating that our method excels most of the state-of-the-art works, w.r.t. F1 score.

We noticed that Chen et al., 2020[8] claimed higher performance on both dataset. However, this work introduced additional information, proposed in [17] as ground truth during training, while we just use the original \({\mathsf {WebQuestion}} \). Moreover, Hu et al., 2020[15] introduced external knowledge to solve implicit relation to achieve higher performance on \({\mathsf {WebQuestion}} \).

Performance of Sub-modules. As our approach works in a pipelined manner, the overall performance is influenced by the sub-modules. It is hence necessary to show the performance of each sub-module. To this end, we introduce a metric, which is defined as \({\mathsf {Acc}} _u=\frac{|{\mathsf {TP}} |}{|\text {test}|}\), where \({\mathsf {TP}} \) is a set consisting of intermediate results returned by a sub-module, such that each of them can imply a correct answer, and \({\mathsf {test}} \) refers to the test set. Intuitively, the metric \({\mathsf {Acc}} _u\) is used to show the upper bound of the prediction capability of each sub-module. Besides \({\mathsf {Acc}} _u\), the metric F1 is also applied for performance evaluation. Table 2 shows \({\mathsf {Acc}} _u\) and F1 scores of sub-modules on \({\mathsf {WebQuestion}} \) and \({\mathsf {CompQuestion}} \), respectively.

Table 2. Performance evaluation w.r.t. sub-modules

Topic Entity Recognition. For an identified entity, if it can lead to the “golden” answer, it is treated as a correct entity, otherwise, it will be regarded as an incorrect prediction. Based on this assertion, the \({\mathsf {Acc}} _u\) and F1 score w.r.t. topic entity recognition can be defined. As is shown, the \({\mathsf {Acc}} _u\) of the sub-module reaches 93.5% on \(\mathsf {WebQuestion}\) and 93.4% on \(\mathsf {CompQuestion}\). This shows that our sub-module for topic entity recognition performs pretty well. However, the F1 scores of the sub-module reaches 77.4% and 64.4% on \(\mathsf {WebQuestion}\) and \(\mathsf {CompQuestion}\), respectively. The gap between the \({\mathsf {Acc}} _u\) and F1 scores partly lies in the incompleteness of the answer set of a question in the test set. Taking \(\mathsf {WebQuestion}\) as an example, it was constructed by Amazon Mechanical Turk (AMT) manually, for each question, it only corresponds to no more than ten answers.

Core Path Generation. As shown in Table 2, the \({\mathsf {Acc}} _u\) of our core path generation sub-module is 65% on \(\mathsf {WebQuestion}\) and 60.3% on \(\mathsf {CompQuestion}\), respectively; while the F1 scores on \(\mathsf {WebQuestion}\) and \(\mathsf {CompQuestion}\) are 54.7% and 44.3%, respectively. The figures tell us following. (1) It is more difficult to determine a correct core path on \(\mathsf {CompQuestion}\) than that on \(\mathsf {WebQuestion}\), as both \({\mathsf {Acc}} _u\) and F1 score on \(\mathsf {CompQuestion}\) are lower than that on \(\mathsf {WebQuestion}\), which is also consistent with the observation on both dataset. (2) Both metrics of the sub-module are significantly lower than that of Compared with the sub-module for topic entity recognition,

There still exists a big gap between \({\mathsf {Acc}} _u\) and \(F_1\) scores on two dataset. We will make further analysis in Error analysis part.

Constraints Selection. After constraints selection, the final answers are obtained. Thus the F1 score of the sub-module is the same as that of the entire system.

We also compared the MCSM with a variant (denoted as MemN2N) of the model introduced by [19] to show its advantages w.r.t. training costs. Compared with MCSM, MemN2N leverages two matrices to produce different embedding of constraints, which brings trouble for model training. As shown in Table 3, MCSM consumes less time for model training, while achieves almost the same F1 score, no matter how training set and validation set are split. This sufficiently shows that the MCSM is able to achieve high performance with reduced training cost.

Table 3. MCSM v.s MemN2N

5 Conclusion

In this paper, we proposed a comprehensive approach to answering complex questions on knowledge bases. A novel metric for measuring entity similarity was introduced and incorporated in our topic entity recognition. As another component, our attention-based core path generation model leveraged attention scheme to determine the best core paths, based on both explicit and information. A memory network with compact structure was developed for constraints selection. Extensive experiments on typical benchmark datasets show that (1) our approach outperformed most of existing methods, w.r.t. F1 score; (2) our sub-modules performed well, i.e., with high F1 values; (3) our sub-module e.g., the constraints selection model was easy to train and consumes less training time; and (4) implicit information was verified effective for determining core paths.