Keywords

1 Introduction

QA system can answer users’ questions in accurate and concise natural language, which generally includes three modules: question understanding, information retrieval and answer extraction. Question understanding is the first step in QA system, which can be divided into three parts: question classification, keyword extraction and keyword expansion [1]. The question classification is an important part among them. The goal of question classification is to assign a category to each question, and this category represents the type of answer to a question [2]. Question classification in QA system has two main functions. On the one hand, it can effectively reduce the search space and time of answers. On the other hand, it can formulate different answer extraction strategies based on the types of questions. Therefore, question classification performance affects the performance of QA system to a large extent [3].

Traditional question classification methods include rule-based and statistical machine learning. Rule-based approaches construct question classification rules manually or semi-automatically, which are time-consuming and difficult to guarantee the completeness of rules. The methods based on statistical machine learning need to define the features of questions manually, and then select a machine learning model for classification. Common features include POS, named entities, central words, syntactic structures and semantic relations [4,5,6,7,8]. In order to obtain the features of questions, it is generally necessary to complete the Natural Language Processing (NLP) tasks such as POS tagging, syntactic analysis and semantic analysis on questions. The accuracy of these NLP tasks has great impact on the accuracy of question classification. In addition, artificial rules and feature selection have certain subjectivity, and it is difficult to ensure the completeness of rules and features, which might result the failure to fully understand the semantics of questions.

In recent years, deep learning has been widely used in the field of NLP. Some researchers have used deep learning to explore question classification and have achieved certain results [2, 9]. Long short term memory (LSTM) and Convolutional Neural Network (CNN) with better flexibility are two common deep learning frameworks. They can extract the potential syntactic and semantic features of questions by self-learning, which are more conducive to the representation and understanding of questions, and effectively simplify manual extraction of features in traditional machine learning. Zhou et al. [2] integrated words, POS and word weights into embedded representation of words and used Bi-directional LSTM (Bi-LSTM) to classify English questions. Li et al. [9] proposed an improved model that combined the advantages of LSTM and CNN to enhance learning of word senses and deep features. Although the methods adopted by the above scholars have made up for some shortcomings in traditional machine learning, there are some deficiencies. The word weight extracted by Zhou et al. [2] has good performance on English questions, but its influence on Chinese questions is not obvious. Although combining the advantages of LSTM and CNN, the model proposed by Li et al. [3] also has three shortcomings. Firstly, using two models makes algorithm with high time complexity. Secondly, the unidirectional LSTM model can only be introduced from the previous information, but sometimes it is not enough to just look at the preceding words. For example, in the sentence “ (what are the main religions in the world?)”, it is difficult to judge the question category if only considering “ (the world)” without the word “ (religion)” in the back. Finally, the AdaDelta method used in the gradient updating has very good acceleration effect in the early and middle training phases. However, AdaDelta repeatedly shakes around the local minimum after entering the local minimum minefield in the later stage of training.

In order to overcome the deficiencies mentioned above, this paper proposes a question classification model based on Bi-LSTM, which integrates word vectors, POS and the position of words into embedded representation of words, uses Bi-LSTM to automatically learn the semantic representation of questions, and uses Adam algorithm to complete gradient updating. The Adam algorithm can quickly converge and quickly find the correct target direction in the parameter updating and minimize the loss function in the maximum extent. At last, questions are classified by SoftMax function.

2 Related Work

Early question classification mainly used rules-based approaches, such as DIOGENE [10] system and NUS [11] system. These methods extract the special question words (such as “why”, “where”), the common question words (such as “what”) and the nouns closest to the question words as features and judge the types of questions according to the combination rules of feature words. However, these methods often consume a lot of human resources and lack flexibility.

Statistical machine learning is another way of question classification, which has been used for a long time. This method needs to manually define question features, and then to select a machine learning model for classification. The common machine learning models in question classification are Bayesian model [12,13,14], maximum entropy model (ME) [15] and supported vector machine (SVM) model [6, 7]. Li et al. [6] used the X2 statistic to select the upper concept in WordNet to selectively expand the vocabulary in questions, and its accuracy of the coarse-grained classification reached 91.60% on the UIUC dataset [7]. Zhang et al. [7] used the tree kernel function to enable SVM [16] to take advantage of the syntactic structures of questions. The accuracy of the coarse-grained classification reached 90.0% on the TREC QA trackFootnote 1 dataset. Zhang et al. [12] simplified the classification of questions by using the irrelevance of words. The coarse-grained classification accuracy conducted on HIT’s Chinese question dataset reached 72.4%. Tian et al. [13] improved question classification based on self-learning rules and Bayesian model, and coarse-grained classification was conducted on HIT’s Chinese question sets with the accuracy of 84%. Wen et al. [14] extracted the main stems, interrogative words and subsidiary components of questions as features for classification, and the classification accuracy of coarse-grained reached 86.62% on the Chinese question sets provided by HIT and Chinese Academy of Sciences (CAS). Sun et al. [15] presented a new method of feature extraction, which used HowNet as a semantic resource. The classification accuracy of coarse-grained reached 92.18% on the Chinese question sets provided by HIT and CAS. In order to get question features, the above methods generally need to complete NLP tasks such as POS tagging, syntactic analysis and semantic analysis. The accuracy of these NLP tasks has great impact on the accuracy of question classification.

In recent years, deep learning models have achieved significant progress in some areas such as computer vision [17], speech recognition [18] and NLP. Kim et al. [19] used CNN to extract features and applied them to the classification tasks in multiple NLP fields. They have achieved good results on MRFootnote 2, SST-2 (See Footnote 3), SST-1Footnote 3, subj TRECFootnote 4, CRFootnote 5, and MPQAFootnote 6 data sets. Li et al. [9] presented an autonomous learning framework with hybrid LSTM and CNN to learn question features. The classification accuracy of coarse-grained reached 93.08% on the Chinese question sets provided by HIT, NLPCC 2015 QA and Fudan University. Zhou et al. [2] presented Bi-LSTM classification model based on word, POS and word weight. Its classification accuracy of coarse-grained reached 94.0% on the TREC QA dataset (See Footnote 5).

3 Question Classification Based on Bi-LSTM

3.1 LSTM and Bi-LSTM Models

LSTM is a recurrent neural network (RNN), which can remember long-term information to avoid long-term dependence problem that RNN cannot solve. The key part of LSTM is the memory unit, which controls the reading and writing of information by using the input gate \( i_{t} \), the forget gate \( f_{t} \) and the output gate \( o_{t} \). The input gate controls the amount of new information input to the memory unit, the forget gate controls the information through the memory unit, and the output gate controls output information. LSTM neuronal structure is shown in Fig. 1, and the specific calculation formula is shown as Eqs. (16):

Fig. 1.
figure 1

LSTM neuronal structure

$$ i_{t} = \sigma \left( {W_{i} x_{t} + U_{i} h_{t - 1} + b_{i} } \right) $$
(1)
$$ f_{t} = \sigma \left( {W_{f} x_{t} + U_{f} h_{t - 1} + b_{f} } \right) $$
(2)
$$ o_{t} =\upsigma\left( {W_{o} x_{t} + U_{o} h_{t - 1} + b_{o} } \right) $$
(3)
$$ g_{t} = \tan {\text{h}}\left( {W_{g} x_{t} + U_{g} h_{t - 1} + b_{g} } \right) $$
(4)
$$ {\text{c}}_{\text{t}} = i_{t} *g_{t} + f_{t} *c_{t - 1} $$
(5)
$$ h_{t} = o_{t} *\tan h\left( {c_{t} } \right) $$
(6)

Where \( h_{t} \) is the output of LSTM unit, \( \sigma \) is the activation function sigmoid, * indicates the point multiplication between vectors, t is the time step. The input gate \( i_{t} \), the forget gate \( f_{t} \) and the output gate \( o_{t} \) depend on the previous state \( h_{t - 1} \) and the current input \( x_{t} \), the extracted feature \( g_{t} \) is used as candidate storage units, \( c_{t} \) is the current storage unit.

Although single direction LSTM model can avoid long-term dependence problem, it can only capture the characteristics of the previous word. In order to fully utilize the contextual information of words, Bi-LSTM utilizes both the previous and the future context by processing the sequence on two directions, and generates two independent sequences of LSTM output vectors. The addition of output vectors is input to the Max Pooling layer to generate sentence representation.

3.2 Question Classification Model Based on Bi-LSTM

Question classification model based on Bi-LSTM is shown in Fig. 2, which mainly consists of three modules: corpus preprocessing, word embedding and classification.

Fig. 2.
figure 2

Bi-LSTM question classification model

The function of corpus preprocessing module is mainly to complete the preprocessing of the raw corpus, which includes word segmentation, POS tagging, etc. The preprocessing model generates the word sequence \( S_{i} = \left( {w_{0} ,w_{1} , \ldots ,w_{k} } \right) \), the POS sequence \( POS_{i} = \left( {pos_{0} ,pos_{1} , \ldots ,pos_{k} } \right) \) and the word position sequence \( LOC_{i} = \left( {loc_{0} ,loc_{1} , \ldots ,loc_{k} } \right) \) for the i-th question, where k is the number of words in the question, \( w_{j} \) is the j-th word in the question, \( pos_{j} \) is the j-th POS, \( loc_{j} \) is the j-th word position. The word embedding module mainly completes feature vectorization and feature vectors concatenating. Firstly, \( w_{j} \), \( pos_{j} \) and \( loc_{j} \) are vectorized respectively, and then their vectors are concatenated to generate word embedding. The classification module mainly consists of Bi-LSTM hidden layer, Max Pooling layer and SoftMax layer. The Bi-LSTM hidden layer calculates forward and reverse state of questions through three gate functions, obtains the forward output \( h_{j}^{\sim} \) and the reverse output \( h_{j}^{'} \), and then adds \( h_{j}^{\sim} \) and \( h_{j}^{'} \) together to get the output \( h_{j} \). The Max Pooling layer gets the maximum of the output of the hidden layer and generates question representation. The SoftMax layer uses question representation for classification.

3.2.1 Preprocessing Module

In the classification model, the quality of data directly affects the classification performance [20]. As the basic unit of questions, the words store the main features of questions. In addition, word location and POS information are also important for question classification. In the preprocessing module, this paper considers three kinds of position information: “beginning”, “middle” and “end”, which are represented by “1, 2, 3” respectively. The preprocessing algorithm is described as follows:

figure d

3.2.2 Word Embedding Module

Word embeddings as dense low-dimensional continuous representation of words can effectively represent semantic and grammatical information of words [21]. The word embedding module is mainly composed of feature vectorization and feature vectors concatenating. Firstly, \( w_{j} ,\,pos_{j} ,\,loc_{j} \) of output sequence \( S_{i} \), \( POS_{i} \), \( LOC_{i} \) of the preprocessing module are vectorized to get word vector \( V_{j}^{w} = \left( {\gamma_{0} ,\gamma_{1} , \ldots ,\gamma_{m} } \right) \in R^{m} \), POS vector \( V_{j}^{p} = \left( {\eta_{0} ,\eta_{1} , \ldots ,\eta_{n} } \right) \in R^{n} \) and location vector \( V_{j}^{l} = \left( {\lambda_{0} ,\,\lambda_{1} ,\, \ldots ,\,\lambda_{l} } \right) \in R^{l} \). Then the word embedding vector \( X_{j} = \left( {x_{0} ,\,x_{1} ,\, \ldots ,\,x_{d} } \right) \in R^{d} \) is obtained by concatenating \( V_{j}^{w} \), \( V_{j}^{p} \), \( V_{j}^{l} \), where \( d = m\, + \,n\, + \,l \). The concatenating equation is shown as Eq. (7):

$$ {\rm X}_{j} = \left( {x_{0} ,x_{1} , \ldots ,x_{d} } \right) = \left[ {V_{j}^{w} ,V_{j}^{p} ,V_{j}^{l} } \right] = \left( {\gamma_{0} ,\gamma_{1} , \ldots ,\gamma_{m} ,\eta_{0} ,\eta_{1} , \ldots ,\eta_{n} ,\lambda_{0} ,\lambda_{1} , \ldots ,\lambda_{\iota } } \right) $$
(7)

Word embedding generation algorithm for the i-th question is described as follows, where k is the words length of the i-th question.

figure e

3.2.3 Classification Module

The classification module is composed of three parts: Bi-LSTM hidden layer, Max Pooling layer and softmax layer. Firstly, \( X_{j} \) generated from word embedding module is used as the input of Bi-LSTM hidden layer and is calculated forward and backward through three gates. The forward output is \( h_{j}^{\sim} \) and the reverse output is \( h_{j}^{'} \). Then \( h_{j}^{\sim} \) and \( h_{j}^{ '} \) are summed to get the output of Bi-LSTM hidden layer, the formula is shown as Eq. (8):

$$ h_{j} = h_{j}^{ \sim } + h_{j}^{ '} $$
(8)

The i-th question obtains an output matrix \( H_{i} = \left( {h_{i0} ,h_{i1} , \ldots ,h_{ik} } \right) \) through the Bi-LSTM hidden layer, where \( h_{ij} \) represents Bi-LSTM hidden layer output corresponding to \( X_{j} \left( {0 \le j < k} \right) \), k is the length of the i-th question. \( H_{i} \) is used as the input of Max Pooling layer to generate the representation of the i-th question, and its calculation equation is shown as Eq. (9):

$$ f_{i} = \mathop {\hbox{max} }\limits_{j} (h_{ij} ) $$
(9)

Questions are classified by softmax layer as Eq. (10):

$$ h_{\theta } \left( {f_{i} } \right) = \left[ {\begin{array}{*{20}c} {p\left( {y^{\left( i \right)} = 1} \right)|f_{i} ,\theta } \\ {p\left( {y^{\left( i \right)} = 2} \right)|f_{i} ,\theta } \\ {p\left( {y^{\left( i \right)} = 3} \right)|f_{i} ,\theta } \\ . \\ . \\ . \\ {p\left( {y^{\left( i \right)} = m} \right)|f_{i} ,\theta } \\ \end{array} } \right] = \frac{1}{{\mathop \sum \nolimits_{j = 1}^{m} e^{{\theta_{j}^{T} f_{i} }} }}\left[ {\begin{array}{*{20}c} {e^{{\theta_{1}^{T} f_{i} }} } \\ {e^{{\theta_{2}^{T} f_{i} }} } \\ . \\ . \\ . \\ {e^{{\theta_{m}^{T} f_{i} }} } \\ \end{array} } \right] $$
(10)

Where m is the number of categories, model parameters are \( \theta_{1} ,\,\theta_{2} , \ldots ,\,\theta_{m} \in R^{n + 1} \).

The classification algorithm is described as follows:

figure f

4 Experiment and Result Analysis

4.1 Question Classification System

To classify questions, firstly, we need to know what types of questions exist. The types of questions are determined by the classification scheme. Information Retrieval and Social Computing Center of HIT defines the Chinese question classification scheme based on some existing QA systems, the characteristics of dividing things in the real world and the characteristics of Chinese. This classification scheme is widely adopted by scholars [9]. This paper adopts the scheme to mark questions on coarse-grained categories. Table 1 shows the hierarchical classification scheme of HIT’s question classification, including seven coarse categories: description, human, location, number, time, entity and unknown. Each coarse category defines some sub-categories according to actual situations. There are totally 84 small categories.

Table 1. HIT’s question classification scheme

4.2 Dataset

This paper uses the HIT Question Classification Dataset. There are 6,296 questions in this dataset, 4,981 questions in the training set and 1,315 questions in the test set. The question distribution of the dataset is shown in Table 2.

Table 2. Distribution of training questions and test questions

4.3 Experimental Parameters Setting

This paper uses the fixed length of questions, which is set to the maximum length. The insufficient length is filled with “0”. The dimensions of word vectors, POS vectors, and word position vectors are randomly initialized by 150, 100 and 100 respectively. The Bi-LSTM hidden Layer consists of 100 LSTM units. The learning rate adaptive optimization algorithm (Adam) is used to update the parameters. The batch training method is adopted during training, and the batch size is 128. The maximum number of training rounds for network training is set to 2000. In each round of training, the dropout rate of the dropout method is 0.4.

4.4 Experimental Result

This paper uses three kinds of evaluation criteria: accuracy (A), recall rate (R), and F1 value (F1) [22] to evaluate the effect of word location on each category. The results are shown in Table 3. For convenience, the model of “Bi-LSTM + Word + POS” is denoted by “model1”, “Bi-LSTM + Word + POS + Word Location” is denoted by “model2”.

Table 3. The effect of word location feature on each category

According to Table 3, the micro-average accuracy of coarse-grained without word location feature is 90.27%. After word location feature is integrated, the micro-average accuracy is 92.38%. This shows that adding word location feature to the word embedding generation process improves the classification performance. Moreover, as can be seen from Table 3, the word location feature significantly improves the results of class “human” and “entity”, and the accuracy increased by 9.5% points and 5.9% points respectively. This is because the positions of keywords that determined whether the question belongs to the “human” class or “entity” class are more obvious. For example, “ ? (which company is HP’s abbreviation?)

The keywords “ (company)” and “ (abbreviation)” in the sentence have the same POS and different locations. The keyword “ (abbreviation)” is one of the features of the “description” class. If the position of the keyword is not considered, it is not easy to determine the category of this question. Therefore, the keyword position becomes the key to correctly determine the category of questions. From Table 4, it can be seen that the effect of word positions of “number” class is not obvious. This is because the questions in the “number” class contain entities with obvious characteristics such as “area”, “zip code”, “number”, and “area code”. Therefore, adding location information has no obvious effect on numeric question classification.

Table 4. Chinese question coarse-grained classification work comparing

The results of coarse-grained question classification based on different methods are shown in Table 4.

Traditional methods mostly used manual strategies to formulate feature extraction, which had some limitations and lack flexibility. The method of this paper makes up for the shortcomings of traditional learning methods, and the accuracy of coarse-grained classification of questions is 3.6 percentage points higher than traditional methods. According to Table 4, the question classification model proposed in this paper has not yet reached the best classification result. The possible reasons are as follows. The amount of experimental data used in this paper is small, but the number of parameters in the Bi-LSTM model is large. It is difficult to train a better classifier when the experimental data is small. Li Chao et al. [9] combined the advantages of both unidirectional LSTM and CNN. They used more training data to autonomously learn the deep syntactic and semantic features of questions, which can train classifiers with better performance.

5 Conclusion

For question classification, this paper used Bi-LSTM to classify questions. The question classification model based on Bi-LSTM is divided into three modules for classification. The preprocessing module is mainly to preprocess the raw corpus and generate word sequences, POS sequences and word position sequences after processing. The word embedding generation module combines word vector, POS vector and position vector to produce the embedded representation of words. The classification module first generates the distributed representation of words through the hidden layer of Bi-LSTM, then generates the representation of questions via the Max Pooling layer, finally classifies questions in the softmax layer. Experiments on the question classification datasets of HIT showed that the accuracy reached 92.38%. This shows that the question classification method based on Bi-LSTM can improve the performance of questions under the condition that it is unnecessary to make complex feature rules. However, the classification accuracy of “description” class and “entity” class needs further improvement. This is because there are fewer corpora for “description” class and “entity” class, while other classes have relatively more corpora, which can better train the model. Therefore, in order to solve this problem, in addition to collecting and marking more data, the features extracted from this paper can be applied to a variety of deep learning methods to fully understand the semantics of questions. This will be the focus of the next step.