Keywords

1 Introduction

In recent years, many models have been developed that can perform reading comprehension. Most recent models utilize deep learning techniques augmented with attention and memory mechanisms [1, 2] to decode answers from the encoded questions and documents [1, 3]. Many models either use information retrieval or inference techniques to find answers. However, current state-of-the-art models still face a decline in accuracy when processing larger and more complex text documents [3,4,5,6]. In this paper, we introduce a new approach towards text comprehension by utilizing predictive word embeddings, matched attention and external memory. Our proposed model can handle any document size due to its expandable architecture. Additionally, we predict the semantic relationship of out-of-vocabulary (OOV) and rare words, thereby, allowing us to generate significantly more accurate vector based word embeddings. End to end training is also possible with this model because the entire model is differentiable. The general architecture of this model can be expanded to applications beyond reading comprehension, such as, sentiment classification, image captioning and machine translation.

2 Related Work

Most natural language processing models use recurrent neural networks (RNNs) for encoding and decoding operations as they can handle sequential data very easily. More recently, LSTM and GRU [7] model variations have been adopted because they can encode larger contexts [6, 7]. The recent addition of attention mechanisms over gated RNNs has also enabled a plethora of applications including, but not limited to, sentiment analysis, neural machine translation [1, 3], image caption generation [8] and question answering [2]. Furthermore, external memory manipulation also involves attention as seen in [4]. In this paper, we use a similar approach to understand the context of a document. However, unlike previous approaches [6, 9], our model combines the benefits of inference capability in augmented memory models with the scalability of sequential gated attention based models. This allows our model to perform a variety of tasks that were previously limited to only certain types of models. In recent years vector embedding models such as Word2Vec [10] and GloVe [11] have become a common method for encoding words because they can capture some of the semantic relationships between words. However, a major drawback of these models is that they can’t generate representations of words or phrases that were not part of their training set. So, rare or out-of-vocabulary (OOV) words either have poor representations or don’t have any representation at all. This results in diminished accuracy because the attention based models can’t effectively determine how the poorly represented word relates to the context. Methods to mitigate this issue have been described in [12]. In this paper we propose a new method for embedding words that takes care of the underlying issues with vector based word embedding models, thereby significantly boosting their accuracy. To test our model we use common datasets like Facebook’s bAbI dataset [13]; Stanford’s SQuAD dataset [14] and Microsoft’s MS MARCO dataset [15]. The bAbI dataset tests inference tasks over simple sentences, whereas, The SQuAD and MS MARCO datasets test more complex inference and information retrieval aspects of reading comprehension.

3 Model and Methods

In this section we provide a brief overview of the proposed model. Subsequently, we describe each component of the model in detail and give the intuition for its creation. We begin the task of reading comprehension by sequentially encoding the question and document using two bidirectional GRUs [7, 16]. A paired matching matrix then associates each word in the question with words in the given document [17]. The relevance of each sentence is determined using a soft attention mechanism on the matching matrix. Subsequently, a temporal controller writes the weighted encoding of the word into memory using a method similar to the one described in [18]. The encoded question and the word pair matching matrix is then passed as input to the read controller which selects the weighted encodings of the words (memory vector) that are relevant to the question. Each selection is fed back to the read controller along with the question to find more evidence for supporting an answer (Fig. 1).

Fig. 1.
figure 1

A high level view of ALICE’s architecture showing one time step

The weighted results are given to a bidirectional GRU which decodes the answer [1].

3.1 Embedding and Encoding

We propose a new method to handle rare or OOV words in vector based word embeddings [10, 11] by guessing the context of the words. A bidirectional LSTM (BiLSTM) is trained separately to predict the next word in a sentence. For each rare or OOV word, the BiLSTM generates ‘n’ candidate results to replace that word. We average the ‘n’ candidate vectors and insert then result into our word embedding model. We train the BiLSTM on texts used to train the word embedding model to ensure we don’t encounter any OOV words. To encode the question Q and the document D, we generate their corresponding word embeddings \( \left[ {w_{i}^{Q} } \right]_{i = 1}^{m} \) and \( \left[ {w_{i}^{D} } \right]_{i = 1}^{k} \) along with their character embeddings \( \left[ {c_{i}^{Q} } \right]_{i = 1}^{m} \) and \( \left[ {c_{i}^{D} } \right]_{i = 1}^{m} \), where ‘m’ is the number of words in the question Q and ‘k’ is the number of words in the document D. For word embedding we use pre-trained GloVe embeddings [11] and for character embedding we use an LSTM-charCNN [19]. We then combine the word and character embeddings using bidirectional GRUs [16] to form encodings \( \left[ {e_{i}^{Q} } \right]_{i = 1}^{m} \) and \( \left[ {e_{i}^{D} } \right]_{i = 1}^{k} \) for all words in the question and document respectively.

$$ e_{i}^{Q} = BiGRU\left( {e_{i - 1}^{Q} ,\left[ {w_{i}^{Q} ,c_{i}^{Q} } \right]} \right) $$
(1)
$$ e_{i}^{D} = BiGRU\left( {e_{i - 1}^{D} ,\left[ {w_{i}^{D} ,c_{i}^{D} } \right]} \right) $$
(2)

By combining the enhanced word vector representations with character embeddings we get an encoding that effectively handles rare and OOV words.

3.2 Word to Word Relevance

To determine the importance of a word in answering the given question, we follow the suggestions outlined in [17] to generate word pair representations. For a given question encoding \( \left[ {e_{i}^{Q} } \right]_{i = 1}^{m} \) and a document encoding \( \left[ {e_{i}^{D} } \right]_{i = 1}^{k} \), the word pair representation \( \left[ {p_{i}^{D} } \right]_{i = 1}^{k} \) is calculated using soft-alignment of words:

$$ p_{i}^{D} = RNN\left( {p_{i - 1}^{D} ,c_{i} } \right) $$
(3)

where ci is a context vector formed by merging all word pair attention vectors with question’s encoding \( e_{i}^{Q} \).

$$ c_{i} = \sum\limits_{i = 1}^{m} {a_{i} e_{i}^{Q} } $$
(4)
$$ s_{i} = \frac{{\exp \left( {a_{j} } \right) }}{{\sum\nolimits_{j = 1}^{m} {\exp \left( {a_{j} } \right)} }} $$
(5)
$$ a_{i} = w^{T} tanh\left( {W^{Q} e_{i}^{Q} + W^{D} e_{i}^{D} + W^{p} p_{i - 1}^{D} } \right) $$
(6)

Here, ai is an attention vector over the individual question-document word pairs and si is the softmax over ai. w is a learned weight vector parameter whose transpose is wT. In [20], they add \( e_{i}^{D} \) as another input to the recurrent network used in \( \left[ {p_{i}^{D} } \right] \) (Fig. 2):

Fig. 2.
figure 2

A visual depiction of the attention mechanism used

$$ p_{i}^{D} = RNN\left( {p_{i - 1}^{D} ,\left[ {e_{i}^{D} ,c_{i} } \right]} \right) $$
(7)

Since the document may be very large, we introduce a gate gi over the input \( \left[ {e_{i}^{D} , c_{i} } \right] \). This allows us to find parts of the document that are relevant to the question.

$$ g_{i} = sigmoid\left( {W_{g} \left[ {e_{i}^{D} ,c_{i} } \right]} \right) $$
(8)
$$ \left[ {e_{i}^{D} ,c_{i} } \right]^{{\prime }} = g_{i} \odot \left[ {e_{i}^{D} ,c_{i} } \right] $$
(9)

The gate gi is a learned value over time. This gate filters out the irrelevant words when gi is closer to zero and gives importance to words when gi is closer to one. The gate gi learns different values for Wg over various time steps. Thus modeling a mechanism to effectively select parts of the document relevant to the question. In our model we use a BiGRU in place of an RNN.

3.3 Memory Controller

The memory architecture is similar to, but less complex than the one described in [18]. Memory is stored in an N × M matrix where N is the number of memory locations and M is the vector size at each location. Our first step is to write the weighted word vectors in to memory. This is achieved by using an LSTM for the controller network defined by:

$$ i_{t}^{l} = sigmoid\left( {W_{i}^{l} \left[ {x_{t} , h_{t - 1}^{l} ,h_{t}^{l} - 1 + b_{i}^{l} } \right]} \right) $$
(10)
$$ f_{t}^{l} = sigmoid\left( {W_{f}^{l} \left[ {x_{t} ,h_{t - 1}^{l} ,h_{t}^{l} - 1 + b_{f}^{l} } \right]} \right) $$
(11)
$$ s_{t}^{l} = f_{t}^{l} s_{t - 1}^{l} + i_{t}^{l} tanh\left( {W_{s}^{l} \left[ {x_{t} ,h_{t - 1}^{l} ,h_{t}^{l} - 1 + b_{s}^{l} } \right]} \right) $$
(12)
$$ o_{t}^{l} = sigmoid\left( {W_{o}^{l} \left[ {x_{t} ,h_{t - 1}^{l} ,h_{t}^{l} - 1 + b_{o}^{l} } \right]} \right) $$
(13)
$$ h_{t}^{l} = o_{t}^{l} tanh\left( {s_{t}^{l} } \right) $$
(14)

where \( l \) denotes the layer of the LSTM and sigmoid is the logistic sigmoid function defined as:

$$ sigmoid\left( x \right) = \frac{1}{{1 + e^{ - x} }} $$
(15)

\( i_{t}^{l} , f_{t}^{l} , s_{t}^{l} , o_{t}^{l} , \) and \( h_{t}^{l} \) are the input, forget, state, output and hidden gates, respectively, at layer l and time t. The input vector xt is supplied to the controller at each time-step t. Since we want to maintain the order of sentences occurring in the document, we concatenate an increasing time index i to \( p_{i}^{D} \) from the word matching step:

$$ p_{i}^{D} = RNN\left( {p_{i - 1}^{D} ,\left[ {e_{i}^{D} ,c_{i} } \right]} \right) $$
(16)
$$ x_{t} = \left[ {p_{i}^{D} ,i} \right] $$
(17)

To read the memory vectors, we use the weighted averages across all locations:

$$ r_{t}^{i} = M_{t}^{T} w_{t}^{r} $$
(18)

Here, Mt denotes the memory matrix at time t. The read vectors are appended to the controller input after each time-step. For the first time step, the question’s encoding \( e_{Q}^{i} \) is supplied as input. The read weighting \( w_{t}^{r} \) determines the importance of a vector for answering the question. It is defined as:

$$ w_{t}^{r} = f_{t} \left[ 1 \right]\left( {i - 1} \right) + f_{t} \left[ 2 \right]C\left( {M,k} \right) + f_{t} \left[ 3 \right]\left( {i + 1} \right) $$
(19)

Here, ft is a read mode decided by applying a softmax over a collection of 3 states: move backward, find similar vectors, move forward. The read weight \( w_{t}^{r} \) is applied over all memory locations, allowing the model to select important facts relevant to the given question. To find evidence supporting an answer, we use content-based addressing [18] on the read head to perform lookups over the memory:

$$ C\left( {M,k} \right) = \frac{{\exp \left( {D\left( {k,M\left[ {i, \cdot } \right]} \right)} \right)}}{{\mathop \sum \nolimits_{j} \exp \left( {D\left( {k,M\left[ {j, \cdot } \right]} \right)} \right)}} $$
(20)

Here, \( k \in R \) is the key or address of a memory location and D is the cosine similarity function. In our case, the temporal index is also used as the key.

3.4 Output

The output vectors are fed into a BiGRU which decodes the answer. Once the output is decoded, we calculate the loss using a standard cross-entropy loss function. We use this to minimize the sum of log probabilities when comparing the decoded answer with the actual one:

$$ L\left( {y,z} \right) = - \mathop \sum \limits_{{\left\{ {i = 0} \right\}}}^{m} z \cdot \,\log \left( {P\left( {y|z} \right)} \right) $$
(21)

Here, m is the vocabulary size, z is the actual answer and y is the prediction given by the model. The error is calculated and back-propagated until it is minimized.

4 Results

The bAbI dataset [13] consists of 20 tasks for testing a model’s ability to reason over text. From the results listed in Table 1, we observe that ALICE performs significantly better than the DNC on basic induction tasks (task 16), which significantly contributes to a higher mean accuracy for ALICE. We also observe that the DMN [4] performs better than ALICE on basic induction tasks, yet, ALICE performs better on all other tasks. The accuracy of ALICE converges towards 96.8% over 10 training sessions. We only report the best results obtained from one of the 10 sessions.

Table 1. Comparison of results obtained on the bAbI dataset

We further test our model on the SQuAD [14] and MS MARCO datasets [15]. The SQuAD dataset [14] contains question-answer pairs derived from 536 Wikipedia articles. SQuAD uses exact match (EM) and F1 score metrics to measure the performance of a given model. We report the best results obtained from 1 of 10 training sessions in Table 2.

Table 2. A comparison of results obtained on the SquAD dataset

While ALICE outperforms competitive models like ReasoNet [23] and Reinforced Mnemonic Reader [22]; the Stochastic Attention Network (SAN ensemble model) [21] still beats ALICE by a small margin. We attribute this to SAN’s ensemble nature. To test our model (ALICE) on larger texts we use the MS MARCO dataset [15]. It contains multiple passages extracted from anonymized Bing search engine queries and the answers may not be exactly worded in those passages. The metrics used for evaluating a model for the MS MARCO dataset are BLEU and ROUGE-L scores.

From the results in Table 3, we see that ReasoNet [23] marginally outperforms ALICE on the MS MARCO dataset. Note that the results for ReasoNet are obtained by Microsoft AI and Research group after the paper was published. From our tests, we can clearly see that ALICE performs similar to, or better than some competitive models on the bAbI [13], SQuAD [14] and MS MARCO [15] datasets.

Table 3. A comparison of results obtained on the SquAD dataset

5 Conclusion

In this paper, we propose a novel model, ALICE, aimed at the task of reading comprehension and question answering. We simplify an existing state-of-the-art architecture and combine it with a matching layer, to attend over a question and document. We also provide a method to improve the accuracy of similar models by using a bidirectional LSTM to generate contextual word embeddings for out-of-vocabulary words. Results for our model show that it is scalable in size and complexity. Our model achieves results that are close to the state-of-the-art and similar to, or better than some competitive models. Future work includes simplifying the current model and applying this model to generate captions for images.