Keywords

1 Introduction

Subsequent to the Japanese Todai project [5], China has launched an ambitious initiative to have AI pass the National Higher Education Entrance Examination, commonly known as Gaokao. To meet this grand challenge, advances in AI technologies can in the meantime benefit future-generation computer-aided education, e.g., enabling applications that can interact with learners and act as mentors, or provide assistance to teachers. This long-term goal drives our research of an approach that is capable of not only correctly answering complex questions but also solving problems like human intelligence, i.e., to have an interpretable problem solving process.

Previous research focused on history exams in Gaokao [2, 14]. In this paper, we present the first study on geography exams. An example of geography questions is illustrated in Fig. 1, which appears more difficult than history questions for both students and machines due to not only the understanding and exploitation of graphical presentations (e.g., maps and charts) but also the complexity and diversity in problem solving. These distinguishing features have turned previous solutions to history questions and existing methods for question answering (QA) ineffective. Specifically, our investigation of more than one thousand real geography questions finds that: quantitative methods (e.g., math calculation) and qualitative methods (e.g., text retrieval) are almost equally important; most questions require mixing them together in a hybrid reasoning process.

Fig. 1.
figure 1

An example of geography question in Gaokao based on a map and a chart.

Therefore, we propose a hybrid approach having a pipeline design to enable interpretable problem solving. Our contributions are threefold.

  • We collect, investigate, and make available 1,863 geography questions used in Gaokao or mock exams in recent years. In Sect. 2, we survey the difficulties in answering these questions and summarize the new technical challenges they bring to AI, which can hardly be met by existing methods.

  • Our proposed approach mixes information retrieval techniques with knowledge engineering. Section 3 overviews its pipeline. Section 4 describes our current implementation consisting of question parsing, knowledge graph construction, semantic matching, spreading activation and answer generation.

  • We carry out extensive experiments on the collected questions, and create a gold standard for the output of each component of the proposed approach. In Sect. 5 we evaluate the approach and its components, show their effectiveness, and analyze their failures.

2 Challenges and Related Work

Differences Between Geography and Other Subjects. Research attention has been given to complex questions appearing in entrance exams at different levels of education including Gaokao. The OntoNova system [1] represents chemistry knowledge in F-Logic, to answer formal queries regarding complex chemical concepts and reactions via rule reasoning. Similar approaches are used to answer physics and biology questions [6]. Different from these natural or formal sciences which require a fully formal method with an emphasis on deep reasoning, history [2] and elementary science questions [3] are solved primarily by using information retrieval techniques supplemented by shallow reasoning.

Geography is distinguished from the above-mentioned subjects in problem understanding and solving. As illustrated in Fig. 1, geography questions are unique because they always involve maps, charts, or other graphical presentations from which information is to be extracted as it is usually essential to problem solving. The understanding of the natural language expressing those questions (e.g., Q1–Q3 in Fig. 1) is distinctively challenged by: the lack of a vocabulary of geographical concepts (e.g., terrain and climate in Q2), the extensive use of unbounded commonsense knowledge (e.g., state-level new area and big data industry in Q3), and many new question types that are specific to geography. To solve those questions which often exhibit a hybrid nature, the techniques we need include: (i) all kinds of knowledge representations (e.g., ontological, causal, spatial, and temporal relations) and reasoning capabilities (e.g., logical reasoning, math calculation) which have only been partially considered in answering physics and chemistry questions, (ii) qualitative methods (e.g., text retrieval and summarization) which have only been preliminarily used in answering history questions, and more importantly, (iii) an effective mixture of these methods within a single problem solving process.

Challenges to Traditional Methods for QA. Geography questions in Gaokao are even further beyond the capabilities of traditional methods for QA. None of the above-mentioned challenges can be met by early information retrieval solutions [9] or recent semantic parsing techniques [10] and QA systems [4]. In fact, those methods primarily focus on retrieving answers to short factoid questions, thereby not fitting geography questions in Gaokao. As illustrated by Q3 in Fig. 1, geography questions in Gaokao can be too long to be processed by existing QA solutions, and are often non-factoid expecting answers that are several generated sentences. Other end-to-end methods [12] suffer from lacking interpretability and hence diverge from the long-term goal of our research.

Therefore, not only novel techniques for QA are needed for effective and interpretable solving of geography questions, but also an effort of domain-specific knowledge engineering seems necessary.

3 Overview of the Approach

Towards meeting the identified challenges, we propose a hybrid approach which features a pipeline design and hence an interpretable problem solving process to facilitate future applications. The major design principle is to mix information retrieval techniques (e.g., ranking) with knowledge engineering, considering the sole use of information retrieval is not powerful enough and a fully formal method is unfeasible at the moment due to the complexity in knowledge representation and the difficulty in natural language understanding.

Fig. 2.
figure 2

Overview of the proposed approach.

As outlined in Fig. 2, we construct a knowledge graph (KG) for the geography domain. We choose this single but flexible representation as it can help us manage the complexity of knowledge engineering. To answer a question, our approach initially activates a set of nodes in the KG that can find semantic matches in the question text, and then iteratively spreads that activation to neighboring nodes for ranking. Question parsing returns the type and structure of the question, which are used to filter out irrelevant nodes. Finally, the remaining top-ranked nodes are selected, and an answer is generated by concatenating their text.

We also plan to annotate graphical presentations and produce facts which in turn will be enriched by reasoning over a knowledge base. The resulting facts will also be considered in semantic matching. However, this branch has not been incorporated into our implementation reported in this paper.

4 Detailed Implementation

We collect a representative sample of 1,863 geography questions from Gaokao or mock exams in recent years. The current implementation of our approach is guided by an investigation of this dataset. In this section, we detail our implementation using Q1–Q3 in Fig. 1 as a running example. We also describe the gold standard we create for the output of each component of our approach on that dataset, for characterizing real questions and for evaluation.

The hybrid nature of the approach is highlighted in Sect. 4.4, which is founded on the techniques presented in Sects. 4.1, 4.2 and 4.3.

4.1 Question Parsing

Question Decomposition. When applicable, we decompose a question into simpler atomic sub-questions to be separately solved, whose answers will be concatenated in the end. An atomic question in Gaokao is normally identified by an imperative sentence starting with a particular verb. We define 44 such Chinese verbs, e.g., describe, explain, summarize, and identify.

Table 1. Types of questions
Table 2. Relations in the KG (manual + automated)

Example 1

Q3 is decomposed into two sub-questions.

Q3a: Summarize the positive circumstances for developing the Gui-An base for the big data industry.

Q3b: Identify some measures for this industrial base to attract more companies.

We also use common Chinese coordinating conjunctions (e.g., and) and punctuation marks representing coordination to further decompose questions.

Example 2

Q2 is decomposed into two sub-questions.

Q2a: In terms of terrain, explain the negative effects of natural circumstances on the economic development of Guizhou.

Q2b: In terms of climate, explain the negative effects of natural circumstances on the economic development of Guizhou.

Gold Standard 1 . We manually analyze all the 1,863 questions, find 1,155 atomic questions, and decompose the other 708 questions (38%) into 1,556 atomic questions. There are a total of 2,711 atomic questions after decomposition.

Question Classification and Structuring. To capture the intent of a question, we detect its type and transform it into a type-specific semi-structured tuple representation of its arguments.

We identify 13 question types, plus MISC representing all the others, as shown in Table 1. We detect the type of a question using rules over 84 particular keywords in Chinese. Example rules are as follows.

  • A question ending with circumstances is a Circumstances question.

  • A question mentioning hazards, threats, or risks is a Troubles question.

A question will be in MISC if none of the rules are matched.

Fig. 3.
figure 3

An example of knowledge graph.

For each type of question, we define one or several question structures to represent the arguments of a question. Example question structures are as follows:

$$\begin{aligned} \text {Effects}~(\text {cause}, \text {recipient}, \text {aspect}, \text {polarity}), \end{aligned}$$
(1)
$$\begin{aligned} \text {Circumstances}~(\text {subject}, \text {aspect}, \text {polarity}). \end{aligned}$$
(2)

The first structure represents a question asking the effects of a particular polarity (i.e., positive or negative effects, or unspecified) that the cause has on the recipient, in terms of some aspect which is optional. The second one asks the circumstances which the subject is under, of an optional polarity, and in terms of an optional aspect.

As questions in Gaokao are mainly expressed in a handful of syntactic forms, we transform a question into a question structure based on template matching. An example template is as follows, written as a regular expression.

$$\begin{aligned}&((\text {In terms of})|(\text {Based on})~\text {aspect ,})? \nonumber \\&\text {explain}|\text {summarize}~\text {the polarity }?~\text {effects of cause on recipient }. \end{aligned}$$
(3)

Example 3

Q2a is transformed into the question structure in Eq. (1) by Eq. (3):

$$\begin{aligned} \text {Effects}~( \text {natural circumstances}, \text {the economic development of Guizhou}, \text {terrain}, \text {negative}) \,. \end{aligned}$$
(4)

Q3a is transformed into the question structure in Eq. (2):

$$\begin{aligned} \text {Circumstances}~( \text {the Gui-An base for the big data industry}, \text {developing}, \text {positive}). \end{aligned}$$
(5)

Gold Standard 2 . We manually classify all the 2,711 atomic questions. Their distribution is shown in Table 1. We also transform each of the 685 atomic Circumstances, Effects, and Measures questions into a question structure.

4.2 Knowledge Graph Construction

Our proposed implementation focuses on answering Circumstances, Effects, and Measures questions as they are less dependent on graphical presentations which are not addressed in this paper. These questions constitute 25% of all.

The remaining components of our approach center around a geography knowledge graph (KG) illustrated in Fig. 3, which is semi-structured as each node is a free text that can represent a concept, a fact, or any other thing, whereas each edge represents one of a set of predefined relations. We construct the KG in a semi-automatic way. A domain expert manually constructs the first version to cover Circumstances, Effects, and Measures knowledge in textbooks. The KG is then automatically completed with more edges and nodes to improve its quality.

Manual Construction. The domain expert constructs the KG in four steps. First, geography concepts extracted from textbooks and the official Gaokao guide are added to the KG as nodes. Then, relations extracted from textbooks and other relevant materials are added as edges. These relations, namely circumstance, effect, and measure, provide candidate answers to the corresponding types of questions. Next, more types of relations, e.g., is-a, aspect, and state, are added to connect nodes. Finally, the KG is refined by merging nearly duplicate nodes and substituting some relations (e.g., effect) with their sub-relations having polarity information (e.g., negative-effect). The resulting KG consists of 1,243 nodes and 1,979 edges. Table 2 shows its distribution of relations.

Gold Standard 3 . We manually map the correct answer of each Circumstances, Effects, and Measures question to a set of nodes in the KG, for evaluating the output of our approach.

Automated Completion. To complement human efforts, we devise an algorithm for automatically completing the KG. Concepts mentioned in existing nodes are recognized. If a concept has no corresponding node in the KG, it will be added as a new node. The concept is connected to the nodes mentioning it via proper relations.

Specifically, concepts are recognized using predefined POS patterns tagged by the Stanford Parser. For example, given a node text (in Chinese) tagged with a sequence of NNs followed by VA, these NNs will identify a (new) concept node, which will be connected to the original node via state. In this way, 241 nodes (+19%) and 367 edges (+19%) are automatically added to the KG, as detailed in Table 2.

Example 4

From the node Poorly-established infrastructure which (in Chinese) is tagged with NN NN VA, Infrastructure (tagged with NN NN) is recognized as a (new) node, and is connected to Poorly-established infrastructure via state.

4.3 Semantic Matching

We initially activate all the nodes in the KG that are matched with the question text. A node will be activated if its content words/phrases are all matched by the question text. We realize semantic matching by expanding each content word/phrase with its synonyms and hyponyms. Synonyms are imported from an online Chinese thesaurus. Hyponyms are obtained from a hierarchy of 960 domain terms defined by a domain expert.

Example 5

The node Terrain in Fig. 3 is matched with the word terrain in Q2a. The node High-tech industry in Fig. 3 is matched with its hyponym big data industry in Q3a.

Gold Standard 4 . We manually match the text of all the Circumstances, Effects, and Measures questions with nodes in the KG, and label a total of 3,823 matches.

4.4 Spreading Activation and Answer Generation

We concatenate the text of top-ranked nodes in the KG as an answer. Nodes are ranked by closeness to the initial activation found by semantic matching. For ranking, we use a spreading activation method (SA) called random walk with restart (RWR) [13]. SA iteratively spreads the amount of activation received by each node to its neighbors until convergence. During spreading, RWR allows to jump back to an initially activated node, thereby lifting the ranking of the nodes that are closer to the initial activation. This is the key step of our approach as it mixes information retrieval techniques (i.e., SA and RWR ranking) with knowledge engineering (i.e., KG) to discover implicit answers.

Example 6

As Q2a matches the node Terrain in Fig. 3, nodes that are closer to it will be ranked higher, e.g., Geological hazard compared with Poorly-established infrastructure.

Further, we leverage the structure of a question to filter out irrelevant nodes. For each structure we define one or several relation path expressions for traversing the KG, which are written as regular expressions. For example, we define the following relation path expression for the question structure in Eq. (2) when polarity is positive:

$$\begin{aligned} \text {subject }~\text {is-a}^*~\text {positive-circumstance}, \end{aligned}$$
(6)

which starts searching the KG from each node matched by the subject argument, passes through any number of is-a edges, and finally passes through a positive-circumstance edge and reaches a set of nodes which will be labeled as relevant; if subject does not match any node, search will start from every node. Irrelevant nodes will be removed from the answer. This step realizes answer type coercion with reasoning capabilities, e.g., is-a reasoning in this example.

Example 7

As the subject argument of Q3a’s question structure in Eq. (5) matches the node High-tech industry in Fig. 3, using the relation path expression in Eq. (6) we identify three relevant nodes: Government policies, Close to skilled workers, and Well-established infrastructure.

Table 3. Percentage of correctly classified questions
Table 4. Percentage of correctly structured questions

The proposed problem solving process is interpretable. We can extract from the KG those paths connecting returned nodes with initially activated nodes (which are matched with the question text), and present as answer explanations.

5 Evaluation

We evaluated our approach based on the collected questions and the gold standard we created. All the relevant resources have been made available at:

http://ws.nju.edu.cn/gaokao/ccks-18.zip.

5.1 Evaluation of the Entire Approach

Evaluation Design. In Gold Standard 3 we labeled a set of nodes in the KG as the correct answer to a question. Based on that, we configured both our approach and a baseline method to output k top-ranked nodes for each of the 685 questions, and calculated their average F1 scores on all the questions.

Baseline. We compared with BM25, a baseline widely used in information retrieval. To fit our problem, we adapted it to take nodes from circumstance, effect, and measure relations in the KG as candidate answers to Circumstances, Effects, and Measures questions, respectively, and rank nodes by the BM25 relevance between node and question text. On the other hand, semantic parsing and other knowledge-based methods for QA were not compared because they could not process our questions due to the unfitness discussed in Sect. 2.

Results. As shown in Fig. 4, the F1 score of the baseline was consistently below 0.1 under any setting of k. The overall F1 score of our approach reached 0.26 when \(k=5\), and exceeded 0.3 on Circumstances and Measures questions, significantly outperforming the baseline.

5.2 Evaluation of Components

Although largely exceeding the baseline, there was considerable room for improving our approach. We evaluated each component to identify their limitations.

Fig. 4.
figure 4

F1 score of k top-ranked nodes as an answer.

Fig. 5.
figure 5

Precision, recall, and F1 score of matching.

Question Decomposition. Among the 1,155 indecomposable atomic questions in Gold Standard 1, our method mistakenly decomposed 40 (3%) as they contained triggering verbs or coordinating conjunctions. Among the other 708 non-atomic questions, the decomposition outputted by our method was exactly the same as the gold standard on 595 (84%); the generated sub-questions contained insignificant errors on 105 (15%), mainly due to incorrect phrase structure trees about coordinating conjunctions; the decomposition contained significant errors only on 8 (1%). To conclude, out of the 1,863 questions, the automated decomposition of 1,815 (97%) was (nearly) correct.

Question Classification and Structuring. Among the 2,711 atomic questions in Gold Standard 2, our method correctly classified 2,589 (96%). As shown in Table 3, the accuracy exceeded 90% on most question types, though relatively low on Opinions and Transport questions because they were given low priority when a question could match multiple rules leading to different types.

Among the 685 Circumstances, Effects, and Measures questions in Gold Standard 2, our method transformed 583 (85%) into correct question structures, as shown in Table 4. Incorrect transformation was mainly due to incorrect word segmentation, an open problem for Chinese processing.

Semantic Matching. In Gold Standard 4 we labeled 3,823 matches. Based on that, we calculated the average precision, recall, and F1 score of the matches generated by our method on all the questions. As shown in Fig. 5, the precision of string matching was high (0.94), but the recall was unsatisfactory (0.43). Expanding content words/phrases with synonyms and hyponyms, recall notably increased by 0.03 and 0.16, respectively, whereas the decreases in precision were negligible, so F1 scores rose. Using both synonyms and hyponyms, F1 score reached 0.75. Our method mainly failed to find partial matches and deep semantic matches. We did not allow partial matches because otherwise precision would drop largely. To identify deep semantic matches representing implicit relatedness, e.g., the match between the text state-level new area and the node Government policies in the KG, more sophisticated semantic matching techniques would be needed, or the KG should be substantially extended to cover more concepts and relations such that implicit matches would more likely be identified during spreading activation.

Spreading Activation and Answer Generation. In addition to the evaluation of the entire approach in Sect. 5.1, we also evaluated the k top-ranked nodes outputted by our spreading activation method using manually labeled matches in Gold Standard 4 as initial activation. F1 score was raised to 0.31 (from 0.26) when \(k=5\), still not very high mainly due to the incompleteness of the KG. Many concepts mentioned in question text were not covered by the KG, thereby incapable of effectively guiding the spreading activation.

Concluding Remarks. Our question parsing, though based on a few simple rules and templates, generated high-quality results. Semantic matching was satisfactory but could be improved. The incompleteness of the KG became a bottleneck of our current implementation.

6 Future Directions

We present the first research effort to answer geography questions in Gaokao. We have identified the difficulties in problem understanding and solving, and we invite the community to continue contributing solutions to this new AI challenge, as it will benefit future-generation computer-aided education and push the boundaries of AI. To provide supporting resources, we have made all our data publicly available.

Our major technical contribution is a pipeline approach which mixes information retrieval techniques with knowledge engineering and realizes interpretable problem solving. Effectively integrating question parsing, semantic matching, spreading activation and answer generation over a constructed knowledge graph, our approach exhibits a non-trivial capacity to answer complex questions and significantly outperforms a baseline in an extensive experiment. The result is promising as graphical presentations in questions and more powerful reasoning capabilities are still yet to be exploited.

Several open problems have surfaced to be addressed in future work. (i) To extend the coverage of the KG, we are seeking practically effective methods for automated construction of KG. Recent studies have shown it is possible to automatically extract specific relations of reasonably good quality from text to answer the corresponding type of questions, e.g., causality [7]. We will adapt those methods to build a large-scale KG automatically and evaluate their effectiveness. (ii) Although KG as a single flexible representation has successfully helped us manage the complexity of knowledge engineering, it is unlikely to fit all question types. We will consider and integrate other knowledge representations which are possibly associated with different types of reasoning. (iii) Despite the effectiveness of rule- and template-based question parsing in our application, more powerful and robust techniques are still needed to process user-generated questions in potential future applications which are less formal than those in Gaokao. In particular, to better understand and use more information (e.g., newly defined concepts and laws) provided in question text, we will experiment with learning [8] and semantic parsing methods [11] which have been used to solve verbally expressed number problems.