Keywords

1 Introduction

Interrogative sentences have a very important function in communication. Questions have a cognitive meaning without a common attributive proposition, because they do not usually include affirmation or negation. They are mostly used for acquiring knowledge and are key tools of cognition (Nevolnikova 2004).

From the natural language processing perspective, interrogative sentences are often key language material when one works on such tasks as question answering, building dialogue agents, and discourse modeling. In this paper, we focus on automatic question typology for the domain of question answering. According to Burger et al. (2001), identifying question classes is the first step in building a QA system.

In contrast to information retrieval (IR) systems, which usually return a set of documents relevant to some keywords, question-answering (QA) systems aim at yielding an exact answer to the question (Monz 2003a, b). Question answering is considered a more difficult task due to the constraints on input (natural language questions vs keywords) and output (focused answers vs entire documents) representation (Bunescu and Huang 2010). Yet, the benefit of QA systems is that they do not overwhelm the user with excess information (Galea 2003).

Various systems have been developed for answering questions in English (e.g. TEQUESTA (Monz 2003a, b), START (Katz et al. 2006), OpenEphyra (van Zaanen 2008), IBM Watson (Ferucci et al. 2010), EAGLi (Gobeill et al. 2012), etc.). An excellent survey of state-of-the-art learning-based methods for English-language question classification was conducted by Loni (2011).

As far as Russian-language QA systems are concerned, they have had very limited coverage in literature. Although a monograph by Sosnin (2007) and a few research papers, such as (Suleymanov 2001; Tikhomirov 2006; Mozgovoy 2006; Solov’ev and Peskova 2010), have been published, they mostly contribute to the theory on the problem, rather than propose and/or evaluate efficient practical solutions.

Our work differs from the papers previously published by other researchers in that we attempt to solve the practical task of automatic typologization of questions in the Russian language using complex class sets (23 question types in the fine-grained and 14 in the coarse-grained set), while in other papers other approaches have been used. For example, in (Sosnin 2007), only three general question classes are distinguished: (1) “problem”, (2) “task”, and (3) “inquiry” (p. 118). Alternatively, in (Solov’ev and Peskova 2010), a paper devoted to question analysis for a Russian-language question answering system, a detailed question taxonomy was used. It was borrowed from (Ittycheriah 2008) with some modification, but we consider it too fine-grained to be practical. It has 39 classes, some of which are very specific and rare, e.g. Organ, Salutation, Plant, etc. Also, the differences between some of the classes do not seem very well-defined, e.g. Areas vs Geological objects vs Location vs Country, or Company-roles vs Occupation, etc. Furthermore, the question analysis technique used was trivial: a limited number of key words were simply searched for in the question. Expectedly, the performance of the module was quite low – 67% error (Solov’ev and Peskova 2010, p. 48). In contrast, we report question tagging accuracy of up to 68.7%. That said, the results are obviously not directly comparable because different classifications are used.

In the next section, we describe the question typology that the developed classifier is based on. Sections 3 and 4 are devoted to the regular expression baseline and the machine learning classification methods used, respectively. Section 5 discusses the results attained. Finally, in Sect. 6 we draw conclusions and outline some directions for future work.

2 Interrogative Sentences in the Russian Language

In our research, we focus on the functional aspect of interrogative sentences. According to Shvedova, information gathered via interrogative sentences can be of various nature: about the subject of some action (Ктo этo cдeлaл?/Who did it?), the object (Чтo былo cдeлaнo?/What was done?), the goal (Для чeгo eй этo?/Why does she need this?), etc. (Shvedova 1980). The main question formation means are intonation, interrogative particles (ли, тaк, вepнo, кaк, чтo ли, etc.), interrogative pronominal words (гдe, кyдa, кoгдa, oткyдa, пoчeмy, зaчeм, кaк, etc.) and word order. It is possible to use some of these means as features in classification.

The first question typology that we considered was developed by Shvedova (1980). She divides interrogative sentences as follows:

  1. 1.

    According to the type and volume of required information:

    1. a.

      General: acquiring the information about the situation in whole. Чтo пpoиcxoдит в Китae?/What is happening in China?

    2. b.

      Special: acquiring the information about an aspect of the matter. Кaкиe пecни cдeлaли eё извecтнoй?/Which songs made her famous?

  2. 2.

    According to what answer is expected:

    1. a.

      Requiring a confirmation: yes/no or true/false: Лoндoнcтoлицa Aнглии?/Is London the capital of England?

    2. b.

      Requiring information: Зaчeм oни пьют чaй?/What do they drink tea for?

It can be seen that this typology is too general for the purposes of automatic question answering. Another functional classification we used for building our own typology is Graesser’s Taxonomy of Inquiries (Lauer et al. 2013). The taxonomy is presented in Table 1.

Table 1. Arthur Graesser’s Taxonomy of Inquiries

Based on these classifications and drawing upon the functional aspect of interrogative sentences in Russian, we created a new question typology for the purposes of our research (Table 2).

Table 2. Russian question typology

As can be seen from the Table, there are 23 distinct question classes in our typology. Since some classes are further subclassed, this is also a taxonomy of question types. If we flatten the taxonomy, we get 14 general classes (the coarse-grained class set). We use both fine-grained and coarse-grained class sets in our experiments (Sect. 4).

Some comments need to be made on our classification. Firstly, the proposed classification is by no means a complete listing of all possible question types in Russian. When developing the classification, we aimed at the balance between completeness and practicality. Thus, certain question types not so commonly used in practice were omitted: e.g. кaк чacтo? (how often?) or дo кaкoгo вpeмeни? (up to what time?).

Another important comment is that not all wordings for each question type are listed in the table; for the sake of brevity, we list only the most common ones, although other wordings are possible. For example, apart from пoчeмy? (why?), a question of the “Reason” type may be worded as: пo кaкoй пpичинe? (for what reason?), oтчeгo? (why?), etc.

The “General” question type might need explanation. In this type of questions, a general summary of a situation is requested, e.g. Чтo пpoиcxoдит в Китae?/What is happening in China? In contrast, “Verification” questions require a short yes/no answer, e.g. Bepнo ли, чтo кoшки нe лeтaют?/Is it true that cats don’t fly?

In the following two sections, we will describe our attempt at solving the Russian question classification task using the developed typology of questions.

3 Baseline Question Tagging Method: Regular Expressions

Most QA systems first classify questions based on the type (related to such question words as “What”, “Why”, “Who”, “How”, “Where”), which is followed by the identification of the answer type (Damljanovic et al. 2010). Manually designed regular expression are the most obvious tool for identifying the question type, and they are successfully used in many QA systems, e.g. TEQUESTA (Monz 2003a, b).

We implemented our own regular expressions-based question classifier in Python 3. With a more or less complex pattern for each question type, this classifier was used as the baseline method in our work. Some examples of the patterns are given in Table 3.

Table 3. Sample regular expression patterns for Russian interrogative sentences

In our implementation, identifying the type of the question with regular expressions is incremental. A string variable is gradually matched, via re.match(pattern, string) method, against all example patterns, starting with the simplest (1, 2, 71, 72, 73), and ending with the most complex (8, 10, 11). The latest matching pattern is chosen as the hypothetic answer.

We evaluated the baseline classifier (as well as other classifiers described in the next section) on a held-out test set of 150 questions manually annotated by an external expert in accordance with our classification. The regular expression classifier correctly matched 79 questions (52.7% accuracy).

4 Applying Machine Learning to Question Tagging in Russian

Many early approaches to determining question types employ manually designed rules or regular expressions and are non-probabilistic in that a pattern/set of conditions is either matched or not. Pinchak and Lin point out in (2006), however, that such an approach has two major drawbacks:

  1. 1.

    There will always be questions whose types do not match the patterns;

  2. 2.

    The predetermined granularity of the categories leads to a trade-off between how well they match the actual question types and how easy it is to build taggers and classifiers for them.

Thus, a probabilistic answer type model that directly computes the degree of match between a potential answer and the question context is much more effective. Such algorithms are used in some works on QA for English: (Li and Rot 2002; Zhang and Le 2003; Pereira et al. 2009), etc.

To train our question classifiers, we needed a set of questions in Russian, annotated for question types. Due to the unavailability of a suitable collection of questions, we semi-automatically extracted a set of 2008 questions from the Russian Internet Corpus (Sharoff et al. 2006), and manually tagged each question in accordance with our typology. We used two different annotation sets: with the simplified “Concept Completion” and “Quantity” class groups (14 classes total) and the original detailed classification (23 classes).

The questions were converted to five different bag-of-word representations each (character bigrams, character trigrams, word unigrams, word bigrams, word trigrams) and then used for the learning of some traditional machine learning algorithms. This was done in RapidMiner, a cross-platform software framework developed on an open-core model and providing multiple solutions for machine learning, data and text mining, predictive analytics, etc. (Klinkenberg 2013).

For automatic classification of questions, we trained (using the ‘1 vs All’ classification strategy) three different machine learning algorithms - naïve Bayes, support vector machine and logistic regression. Ten different datasets, depending on the set of classes – fine-grained or coarse grained, – and type of representation (word unigrams/bigrams/trigrams, and character bigrams/trigrams) were used with each algorithm. During evaluation, we ensured that the system relied only on the n-grams observed in training. All n-grams not seen in the training data were ignored, i.e. no strategy for dealing with unknown (not previously seen) n-grams was implemented.

The trained models were tested on a held-out test set of 150 questions manually annotated by an external expert in accordance with our classification. The training and test sets of questions, as well as the models used in this study, are made freely available to the communityFootnote 1. The distribution of coarse-grained question types in the training and test sets is illustrated in Table 4. Table 5 shows the evaluation results for both the fine-grained and coarse-grained class sets.

Table 4. Question type distribution in the training and test sets.
Table 5. Prediction accuracy for question tagging.

We also studied how prediction quality changed with the normalization of the training data set and using different SVM kernels. This is illustrated in Table 6:

Table 6. Normalization, linear SVM kernel and system performance

Thus, the best classification result (65.3% acc. fine-grained and 68.7% acc. coarse-grained) was achieved by a support vector machine with a linear kernel, using the word trigram text representation. This is consistent with the state-of-the-art results for English question classification reported in (Silva et al. 2011), also obtained with a linear SVM classifier.

5 Discussion

After evaluating all classification algorithms trained on the different datasets, we can observe the following. Firstly, as expected, using the coarse-grained class set was an easier task that resulted in higher accuracy than for the fine-grained set. The accuracy difference ranged from 0.6% for naïve Bayes to 8.7% for logistic regression (considering the best text representation models for each ML algorithm). Using normalization and the linear kernel for the SVM allowed us to increase classification accuracy to its highest of 65.3% and 68.7% for the fine-grained and the coarse-grained class sets, respectively. To compare, the regular expression baseline showed a stable 52.7% accuracy, which proved considerably better than naïve Bayes, but less effective than the two other algorithms.

The accuracy is quite low in comparison with the results obtained for English-language datasets. Indeed, the state-of-the-art result reported by Silva et al. (2011) is 95% and 90.8% for 6 coarse-grained and 50 fine-grained classes, respectively. However, this is to be expected for a number of reasons. Firstly, Silva et al. used a much larger dataset, published by Li and Roth (2002), consisting of 5500 training and 500 test questions. Secondly, there are many freely available NLP tools for English, which allows for extracting various features useful for question classification. In particular, apart from word unigrams, the classifier by Silva et al. used such features as headwords, hypernyms and indirect hypernyms. Lastly, a lot of research was published on question classification for English, as shown in (Loni 2011), which cannot be said about Russian-language question tagging. Since the Russian language is very different from English in both morphology and syntax, the methods used for the classification of questions in English are not always equally effective or even applicable for Russian.

The confusion matrix (not presented here due to size constraints, but available upon request) for the top-accuracy algorithm was also analyzed. It was found that the “Instrument”, “Example”, “Action”, and “Definition” questions were predicted with less than 50% accuracy. These categories were present in training set with a low frequency: 0.004, 0.004, 0.029, and 0.031 per 1000 questions, respectively. The most accurately predicted question types were “Consequence”, Reason” and “Goal”.

6 Conclusion and Future Work

The main theoretic contribution of our paper is a classification of Russian questions, consisting of 14 coarse-grained and 23 fine-grained classes. Using this typology, we have applied machine learning methods to the task of automatic classification of Russian questions. We tested a regular expression baseline and three different classifiers using five different sets of features (character bi-/trigrams, word uni-/bi-/trigrams). The best classifier, SVM (linear kernel), trained on pre-normalized (via proportion transformation) word trigrams), achieved the classification accuracy of 65.3% and 68.7% for the fine-grained and the coarse-grained class sets, respectively, in contrast to the 52.7% baseline result (regular expression model).

Presented in this paper is one of the very few attempts to solve the Russian question tagging task using large class sets. The methods employed in our work showed relatively good results (given the class set sizes) in comparison with what is reported for Russian-language question classification in the literature, although there is still a lot of room for improvement. Indeed, much work needs to be done to approach the above 90% accuracy achieved for English by the research community.

Our work can be used for building a complete Russian QA pipeline or as a standalone question tagging solution. We believe that it should be possible to improve our results by further expanding the training data set, as well as using more complex machine learning algorithms (neural networks) and features (word2vec).