1 Introduction

Machine Reading Comprehension (MRC) is a classic task in textual question answering (QA), where models are required to answer a natural language question given the relevant/irrelevant passages. Thanks to the release of large-scale datasets [17, 22, 25, 43], related end-to-end neural networks have achieved promising results in various scenarios [1, 7, 27, 36, 47]. Usually, MRC based question answering (QA) can be divided into three types of tasks: extractive QA, generative QA, and multi-choice QA. Compared to extractive MRC limited to text spans, multi-choice MRC allows more flexible design of multiple types of questions such as summarization, commonsense, logical reasoning, arithmetic, and sentiment analysis. Hence, most commonsense-based QA datasets are designed in a multi-choice form. For example, as shown in Fig. 1, the well-known fact that “McDonalds” is a restaurant is useful to find the correct option.

Fig. 1
figure 1

An example of DREAM dataset. (⋆: the correct answer)

Existing multi-choice QA datasets are small in size, making previous methods focus on transfer learning with out-of-domain datasets and tasks [14, 33] or designing complicated matching networks [35, 47]. Nevertheless, more data and more parameters mean more computing resources are consumed. Besides, the out-of-domain data and the accumulation of model capacity can not solve the fact-based QA task well and explain the commonsense reasoning explicitly.

On the other hand, although pre-trained language models (LMs) such as BERT [5] have shown powerful achievements at downstream tasks, including MRC, in the past year, their pre-training methods ignore the role of factual knowledge. Existing work injects knowledge into LMs by auxiliary knowledge-driven objectives and updating parameters in a multi-task learning manner [24, 48], requiring pre-calculating knowledge representation and even pre-training from scratch. Another solution is to leverage the language model as an encoder, whose outputs are fed into the knowledge-text interaction layer for specific downstream tasks [41], increasing model complexity and computational cost.

To alleviate these problems, we take BERT as a base pre-trained model and incorporate the off-the-shelf commonsense representations for multi-choice MRC. Intuitively, it is easier to get the correct answer by fusing the commonsense relationships between the passage and options into the model for inference. Instead of stacking interaction layers downstream, we introduce three simple yet effective methods plugged in BERT structure, respectively named additive feature-based gating, multi-level linear transformation, and multi-head attentional fusion, to integrate token-level knowledge representations into BERT. Thus, text can be encoded in BERT while considering commonsense information. Different from previous work training the knowledge embedding before/after retrieving relevant entities, we directly leverage pre-computed ConceptNet embeddings [28] as external knowledge representation. Moreover, since not all commonsense concepts are necessary to the token and much external knowledge implicitly exists in conversations, a mask mechanism is introduced for token-level multi-hop relationship searching. Our goal is to enable the self-attention (SA) in BERT to identify the knowledge-aware tokens without additional knowledge-driven objectives or pre-training from scratch.

The remainder of this paper is organized as follows: Section 2 summarizes the main contributions. Section 3 describes the task and related notations, followed by a concise introduction to the baseline BERT. In Section 4, we propose our incremental language models with three variants of injection methods. In Section 5, we present our token-level multi-hop relationship filtering mechanism. Section 6 shows the experimental details and results. Section 7 gives further analysis to verify the effectiveness of our methods. Section 8 introduces related work. Section 9 concludes.

2 Contributions

The main contributions of this paper can be summarized as follows:

  1. 1.

    We have proposed three simple yet effective injection methods plugged in BERT to incorporate off-the-shelf commonsense representations for multi-choice MRC;

  2. 2.

    We have introduced a token-level multi-hop mask mechanism to adaptively select relevant external knowledge, emphasizing the knowledge-aware tokens through the self-attention (SA) scores;

  3. 3.

    We have evaluated the incremental BERT on three prevalent multi-choice datasets, DREAM, CosmosQA and RACE. DREAM and CosmosQA contain a higher proportion of commonsense questions while RACE has few commonsense questions. The incremental BERT has obtained considerable improvements on two knowledge-driven datasets and comparable results on DREAM compared with the vanilla system. Further experimental analysis shows the robustness of the incremental model in the case of an incomplete training set.

3 Background

3.1 Task description

Given a passage C = {c1, c2, ... , cs}, a question Q = {q1, q2, ... , qm} about this passage, and the answer options A = {A1, A2,...Ak}, the target of multi-choice MRC is to choose the correct one from the candidate answer set A.

3.2 Baseline

BERT is based on Transformer backbone framework. In this paper, we directly use BERT as a baseline, which includes a multi-layer bidirectional Transformer encoder and a linear classifier. Following [23] we concatenate the context C, question Q, and answer option Ai as the input sequence:

$$ \textrm{[CLS]}c_{1..s}\textrm{[SEP]}q_{1..m}\textrm{[SEP]}a^{i}_{1..n}\textrm{[SEP]} $$

where [SEP] is the separating token, and [CLS] is the token for classification. For each token, the input representation is constructed as:

$$ \boldsymbol{BE}_{i}=\boldsymbol{e}^{tok}_{i}+\boldsymbol{e}^{pos}_{i}+\boldsymbol{e}^{seg}_{i}, i=1..T $$

where \(\boldsymbol {e}^{tok}_{i}\), \(\boldsymbol {e}^{pos}_{i}\), \(\boldsymbol {e}^{seg}_{i}\), and T are the token embeddings, position embeddings, segment embeddings, and maximum length of sequence respectively. Tokens in C share a same segment embedding pseg, and tokens in Q and Ai a same segment embedding qaseg.

Such input representations are then fed into a stack of Transformer encoder blocks, which contains two sub-layers. The first sub-layer is a multi-head self-attention MHA. Given a matrix of T query vectors \(\boldsymbol {Q} \in \mathbb {R}^{T \times d_{1}}\), keys \(\boldsymbol {K} \in \mathbb {R}^{T \times d_{1}} \) and values \(\boldsymbol {V} \in \mathbb {R}^{T \times d_{1}} \), MHA(Q, K, V) is computed as:

$$ \begin{array}{@{}rcl@{}} & \texttt{Attention}(\boldsymbol{Q}, \boldsymbol{K},\boldsymbol{V})=softmax(\frac{\boldsymbol{Q}\boldsymbol{K}^{T}}{\sqrt{d_{1}}})\boldsymbol{V} & \end{array} $$
(1)
$$ \begin{array}{@{}rcl@{}} & b_{j}= \texttt{Attention}(\boldsymbol{Q}\boldsymbol{W}_{j}^{Q},\boldsymbol{K}\boldsymbol{W}_{j}^{K},\boldsymbol{V}\boldsymbol{W}_{j}^{V})& \end{array} $$
(2)
$$ \begin{array}{@{}rcl@{}} & B=Concat(b_{1}, ... ,b_{H}) & \end{array} $$
(3)

where d1 is the number of the hidden units, H denotes the number of heads used to focus on different parts of channels of the value vectors, \(\boldsymbol {W}_{j}^{Q} \in \mathbb {R}^{T \times d_{1}/H}\), \(\boldsymbol {W}_{j}^{K} \in \mathbb {R}^{T \times d_{1}/H}\) and \(\boldsymbol {W}_{j}^{V} \in \mathbb {R}^{T \times d_{1}/H}\) are the parameters of linear mapping layer for j-th head. The second sub-layer is a position-wise fully connected feed-forward network (FFN), which consists of two dense linear layers with a GELU activation in between.

$$ \begin{array}{@{}rcl@{}} & \boldsymbol{u}^{l}=\texttt{MHA}(\boldsymbol{h}^{l}, \boldsymbol{h}^{l}, \boldsymbol{h}^{l})& \end{array} $$
(4)
$$ \begin{array}{@{}rcl@{}} & \boldsymbol{h}^{l+1}=\texttt{FFN}(\boldsymbol{u}^{l})& \end{array} $$
(5)
$$ \begin{array}{@{}rcl@{}} & \texttt{FFN}(\boldsymbol{x})=\boldsymbol{W}_{2}\textrm{GELU}(\boldsymbol{W}_{1}\boldsymbol{x}+\boldsymbol{b}_{1})+\boldsymbol{b}_{2} & \end{array} $$
(6)

where \(\boldsymbol {h}^{l}\in \mathbb {R}^{T \times d_{1}}\) denotes the hidden state at the l-th layer. We utilize the input representations BE as the initial state h0. Note that we omit residual connection and layer normalization used in each sub-layer for simplicity, and refer readers to [31] and [5] for more details.

The final hidden state of the token [CLS], \(\boldsymbol {h}^{L}_{[CLS]}\), is then projected into a score \(p_{i} \in \mathbb {R}^{1}\) via a linear layer. For each question, we obtain the logit vector p = [p1, p2, ... , pk] for all options. We choose the option with highest score p as the answer.

4 Incremental BERT with commensense

4.1 Knowledge integration mechanism

There have been many studies proving that large-scale pre-training language models based on Transformer, such as BERT, have a promising ability to represent text. However, they ignore the effective integration of external commonsense and consensus, which plays an important role in conversation comprehension. To this end, we explore three token-level injection methods to extend BERT to allow flexibility in incorporating external knowledge. Specifically, we integrate the commonsense embeddings CE selected with a multi-hop co-occurrence mask (We will describe the knowledge representations and selection in Section 5) into BERT in three ways: additive feature-based gating, multi-level linear transformation, and multi-head attentional fusion. We denote the three methods as “gate”, “linear”, and “attention”, respectively.

Additive Feature-based Gating

As depicted in the upper left part of Fig. 2, the method “gate” tries to add the ConceptNet representation of the selected commonsense associated token to the corresponding hidden state at each layer. To be specific, for each token ti, we integrate the input representations BEi with external knowledge embeddings \(\boldsymbol {CE}_{i}\in \mathbb {R}^{d_{2}}\) as:

$$ \boldsymbol{In}_{i}=\boldsymbol{BE}_{i}+\sigma (\boldsymbol{W}_{g}\boldsymbol{CE}_{i}+\boldsymbol{b}_{g}) $$
(7)

where σ denotes the sigmoid activation function served as a gate mechanism and \(\boldsymbol {W}_{g} \in \mathbb {R}^{d_{1}\times d_{2}}\) is a trainable weight parameter. This gating mechanism generates a mask-vector from each CEi with values between 0 and 1, incorporating information into salient dimensions of BEi.

Fig. 2
figure 2

Overview of the incremental language model. Three proposed fusion methods are abbreviated as “gate”, “linear”, and “attention”, respectively

Multi-level Linear Transformation

The middle part of Fig. 2 shows the second method “linear” that integrates the external knowledge at each intermediate FFN layer of BERT. For each Transformer encoder block, we replace the second sub-layer with a knowledge fusion layer for the incorporation of the token representations and their corresponding commonsense embeddings, which is computed as:

$$ \begin{array}{@{}rcl@{}} & \tilde{\boldsymbol{u}}^{l}_{i}=\textrm{GELU}(\boldsymbol{W}^{l}_{1}\boldsymbol{u}^{l}_{i}+\tilde{\boldsymbol{W}}^{l}_{1}\boldsymbol{CE}_{i}+\boldsymbol{b}^{l}) & \end{array} $$
(8)
$$ \begin{array}{@{}rcl@{}} & \boldsymbol{h}^{l+1}_{i} = \boldsymbol{W}_{2}\tilde{\boldsymbol{u}}^{l}_{i}+\boldsymbol{b}_{2} & \end{array} $$
(9)

where \(\tilde {\boldsymbol {W}}^{l}_{1} \in \mathbb {R}^{d_{1}\times d_{2}}\) is a trainable weight parameter. Note that this method is in a similar spirit to the work of [48]. However, since our method focuses on the role of commonsense invariance between related tokens in text-based comprehension and their approach focuses on knowledge-driven tasks, we did not apply multi-head self-attention and mutual projection to knowledge embedding encoding. Instead, the knowledge embeddings are fixed for multi-level Transformer encoder blocks, which is simpler and does not require pre-training objective.

Multi-head Attentional Fusion

The third method, as depicted in the “attention” part of Fig. 2, is inspired by the work of [18] and applies attention-based integration to the final hidden states hL. Specifically, we add another Transformer encoder block with two multi-head attention sub-layers to the output of the BERT encoder. The first sub-layer is a multi-head knowledge attention (KA) computed as:

$$ \boldsymbol{v}^{L}=\texttt{MHA}(\boldsymbol{h}^{L}, \tilde{\boldsymbol{CE}}, \tilde{\boldsymbol{CE}}) $$
(10)

where \(\tilde {\boldsymbol {CE}}\) is a concatenation of CE and a knowledge sentinel \(\boldsymbol {s} \in \mathbb {R}^{d_{2}}\). Considering not all tokens are relevant to the background knowledge, we follow [42] to employ the sentinel vector to control the tradeoff between background knowledge and information from the passage text. Thus, we get the knowledge-aware context representations vL and feed them into the second sub-layer, which consists of a multi-head self-attention and a FFN:

$$ \begin{array}{@{}rcl@{}} & \tilde{\boldsymbol{v}}^{L}=\texttt{MHA}(\boldsymbol{v}^{L},\boldsymbol{v}^{L}, \boldsymbol{v}^{L}) & \end{array} $$
(11)
$$ \begin{array}{@{}rcl@{}} & \boldsymbol{y}^{L}=\texttt{FFN}(\tilde{\boldsymbol{v}}^{L})& \end{array} $$
(12)

Note that we also employ residual connection and layer normalization around each attention layer. We replace hL with yL to predict the correct answer.

5 Commonsense representation and filtering

Existing commonsense libraries are usually presented in structured data. Taking into account the diversity of commonsense and the ready-made vector representation acquisition, we use ConceptNet 5.5,Footnote 1 a knowledge graph (KG) including linguistic and world knowledge from many different sources such as WordNet [21] and DBPedia. Commonsense in ConceptNet is represented in the form of a triple (subject, relation, object). For example, “a dog has a tail” can be represented as (dog, HasA, tail). Additionally, daily lexical knowledge and even emojis can be found in ConceptNet (e.g., (lol, DerivedFrom, laugh)). We believe that the graph-structured knowledge can be useful for multi-choice MRC that involves further reasoning with commonsense. Below we first introduce commonsense knowledge representations, and then present a token-level multi-hop knowledge filtering method.

5.1 Knowledge graph embedding

Unlike previous work training the knowledge embedding before/after retrieving relevant entities, we directly leverage off-the-shelf ConceptNet embeddings as external knowledge representation, representing global commonsense relationships. To be specific, we retrieve the tokens from the common vocabulary of BERT and ConceptNet and extract the corresponding KG embeddings. For those BERT tokens that are not found in ConceptNet, we set them to 0. We use three types of representation for common tokens: ConceptNet-PPMI Footnote 2, ConceptNet Numberbatch, Footnote 3 and Randomly Initialized Embedding.

ConceptNet-PPMI

A matrix of word embeddings trained on a sparse, symmetric term-term matrix where each cell contains the sum of the weights of all edges that connect the two corresponding terms. For each term in the ConceptNet graph, its ConceptNet-PPMI representation reflects the context containing the information of other nodes to which it is connected.

ConceptNet Numberbatch

A set of semantic vectors built with an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting. Word embeddings in ConceptNet Numberbatch can represent both text-based context and structured knowledge.

Randomly Initialized Embedding

Since the relations are not scored and represented explicitly, we also use randomly initialized embeddings for tokens to analyze the indirect commonsense relation between words in the passage and the effect of KG embeddings.

5.2 Token-level multi-hop knowledge filtering

Although vectors calculated based on the knowledge graph can represent the commonsense relationships, fusing these embeddings into all tokens of the question-oriented passage is usually invalid or even noisy. Moreover, the model requires commonsense relation not directly stated in the context to reach the correct option. For example, Fig. 3 shows that the model possibly needs multi-hop commonsense to reason about where the conversation takes place. Therefore, to improve the precision of useful information, we design a mask vector M to filter commonsense representations. Specifically, the length of M is the same as the sequence length input to the model and we initialize the mask values of all tokens to 1. For each token t1Ai that is not a stop word or a padding token, we set \(\boldsymbol {M}_{index(t_{1})} = 0\) and use it as a subject concept to search for the object concept t2CQ connected to t1 in ConceptNet, then set \(\boldsymbol {M}_{index(t_{2})} = 0\) and continue searching t3C with t2. For concepts consisting of multiple tokens (e.g., sign_contract), we mask subtokens in the passage and repeat the above operation. We present this overall procedure in Fig. 4.

Fig. 3
figure 3

An example of multi-hop relation searching. In ConceptNet, “bank” is connected to “money”, “cash” and “dollars” through the RelatedTo relationship. Further, “sign contract” and “exchange” can be found

Fig. 4
figure 4

Procedure of the token-level multi-hop knowledge filtering mechanism

Thus, we obtain the mask vector M, which only contains 0 and 1 binary values. We further define the mask operation as follow:

$$ \begin{array}{@{}rcl@{}} & {\varPhi}_{mask}(\boldsymbol{CE_{i}})=\left\{ \begin{array}{rcl} \boldsymbol{CE_{i}}& ,\boldsymbol{M}_{i} = 0 \\ 0& ,\boldsymbol{M}_{i} =1 \end{array} \right.& \end{array} $$
(13)

For tokens corresponding to multiple concepts in multi-hop alignment, we use a single-layer feedforward network for weighted integration:

$$ \begin{array}{@{}rcl@{}} & \boldsymbol{CE_{i}}= {\sum}_{k=1}^{K}\alpha_{k} * c_{i,k}& \end{array} $$
(14)
$$ \begin{array}{@{}rcl@{}} & \alpha_{k}=\frac{e^{\boldsymbol{w}c_{i,k}}}{{\sum}_{k=1}^{K}e^{\boldsymbol{w}c_{i,k}}} \end{array} $$
(15)

where \(\boldsymbol {w} \in \mathbb {R}^{d_{2}}\) is a trainable weight parameter and K is the number of concepts containing the token in multi-hop alignment.

The filtered commonsense embeddings CE will be taken as input to the three fusion methods, depicted in Fig. 2. It is obvious that the commonsense filtering mechanism essentially improves the prediction of commonsense questions by integrating effective representations to change the token-level attention weights within the language model.

6 Experiments

6.1 Dataset and evaluation metric

We report results on three well-known multi-choice datasets, CosmosQA [13], DREAM [29], and RACE [17], which are summarized in Table 1. Specifically, we introduce the datasets:

CosmosQA:

is a large-scale dataset that requires commonsense-based reading comprehension, formulated as multiple-choice questions. In contrast to most existing MRC datasets where the questions focus on a factual and literal understanding of the context paragraph, CosmosQA focuses on reading between the lines over a diverse collection of people’s everyday narratives.

DREAM:

is collected from text material of listening comprehension examinations designed for evaluating the dialog understanding level of Chinese learners of English. DREAM contains 34% questions with unspoken commonsense, which requires the model to answer these questions not only by advanced reading skills but also with rich background knowledge.

RACE:

consists of two subsets: RACE-M and RACE-H respectively corresponding to the English exams for middle and high school Chinese students, which is recognized as one of the largest and most difficult datasets in multi-choice reading comprehension.

Table 1 Statistics of multi-choice machine reading comprehension datasets. ∗ denotes the numbers are based on 500 samples

For all datasets, we use the official train/dev/test splits. For multi-choice MRC task, the evaluation metric is accuracy calculated as acc = N+/N, where N+ denotes the number of examples the model selects the correct answer, and N denotes the total number of evaluation examples.

6.2 Implementation details

We implement our experiments using HuggingfaceFootnote 4. We use BERT-base and BERT-large as baseline systems. To keep the order of magnitude close, we use L2 normalization to preprocess ConceptNet-PPMI. We experiment with commonsense relation searching of up to three hops. We set K = 3. The embeddings of commonsense are fixed during the fine-tuning process, and the parameters of BERT are trainable and initialized from the Huggingface checkpoint. For all fine-tuning experiments, we use BertAdam as the optimizer. We employ early stopping and predict the test set using the best model on the development set.

For training, we run all experiments on two 16G Quadro P5000. For CosmosQA, we set the max sequence length T to be 256 and select the hyperparameters from batch size: {16, 32, 64}, learning rate: {5e-5, 2e-5, 1e-5, 8e-6}. It takes about 8 hours to get the best result. For DREAM dataset, we run experiments for 8 epochs, set the max sequence length to be 512, and select the hyperparameters from batch size: {8, 12, 24, 36}, learning rate: {2e-5, 1e-5, 8e-6}. It takes about 4 hours to get the best result. For RACE dataset, we run experiments for 3 epochs, set the max sequence length to be 512, and select the hyperparameters from batch size: {8, 16, 32}, learning rate: {3e-5, 2e-5, 1e-5}. It takes about 12 hours to get the best result. In Table 2, we present the best hyperparameters on the development set and use them to verify on the test set.

Table 2 The best hyperparameters on different datasets (BERT-base/BERT-large). T denotes the max sequence length

6.3 Results

We compare the performance of the three proposed fusion methods with the two baselines in Table 3, where models on the leaderboards and publications are also shown.

  1. (1)

    BERT+WAE: To mimic the human exclusion strategy, authors train their model with the wrong answer loss and correct answer loss to generalize the features of their model, and exclude likely but wrong options.

  2. (2)

    MMM: It involves two sequential stages: coarse-tuning stage using out-of-domain datasets and multitask learning stage using a larger in-domain dataset to help model generalize better with limited data. Furthermore, the authors propose a novel multi-step attention network (MAN) as the top-level classifier for this task.

  3. (3)

    DUMA: It proposes a novel going-back-to-the-basic solution that straightforwardly models the MRC relationship as attention mechanism inside the network.

  4. (4)

    DCMN: It proposes a dual co-matching network (DCMN) which models the relationship among passage, question and answer options bidirectionally. Besides, it integrates two reading strategies including passage sentence selection and answer option interaction.

  5. (5)

    Multiway: It performs multiway attention over BERT encoding output. Specifically, for the passage, question and option, the mutual attention will be calculated separately and pooled into the final representation.

ConceptNet Numberbatch is used as commonsense representation (We will discuss the role of knowledge embedding in Section 7), and we apply a two-hop commonsense relationship to filter knowledge.

Table 3 Accuracy (%) on the multi-choice datasets including CosmosQA, DREAM and RACE. ConceptNet Numberbatch is used as commonsense representation and two-hop relation searching is applied. “-B” means the base model and “-L” means the large model. Due to the submission limit of CosmosQA, we only evaluate the incremental BERT-large model and publish the best result

From the results, we observe that our plug-in methods of incorporating commonsense can improve performance over the vanilla BERT on DREAM and CosmosQA. Specifically, multi-level linear transformation achieves the best results on CosmosQA (69.2% vs. 66.8% with BERT-large) and DREAM (65.3% vs. 62.8% with BERT-base and 69.3% vs.66.6% with BERT-large). Compared with the other two methods, multi-head attentional fusion improves less on CosmosQA and DREAM, and decreases performance on RACE. In knowledge-driven multi-choice tasks, the incremental model variants obtain 0.7%-2.7% considerable improvement in average accuracy over the baseline of directly fine-tuned BERT. In contrast, our increment models have achieved comparable results on RACE. On the one hand, it means RACE requires little external knowledge for reading comprehension. On the other hand, it illustrates our methods do not lose the textual information after heterogeneous knowledge fusion. Compared to these public models, although the performance is slightly worse on DREAM and RACE, the proposed methods have two advantages: 1) Different from DUMA and DCMN+, which are designed to be complex interactive matching networks, only a few mapping parameters and a single layer of parallel attention calculation are added to fuse commonsense into BERT; 2) Different from MMM using data from out-of-domain tasks for transfer learning, the incremental BERT has significantly improved the performance by direct fine-tuning. In addition, prediction results involving commonsense questions are difficult to explain in the existing methods clearly. On the contrary, we directly incorporate off-the-shelf commonsense representations into BERT’s internal structure through token-level pre-matching to achieve the purpose of explicit use of external knowledge, obtaining interpretable performance improvement.

7 Discussion

7.1 Knowledge embedding

Table 4 shows the results of our incremental BERT-baselinear model obtained by adding initialization with different commonsense representations. From this table, we see that adding Concept-PPMI globally has a negative impact on the performance of BERT, while fusing it according to multi-hop commonsense relation improves the results. A possible reason is that Concept-PPMI only contains structured information based on the knowledge graph, providing a lot of noise when integrated indiscriminately. Hence, leveraging the multi-hop commonsense filtering algorithm helps BERT effectively utilize the structured information, which is also demonstrated in the experiment with random initialization. Moreover, the incremental model using random initialization commonsense performs better than using Concept-PPMI in global fusion, which means heterogeneous information is difficult to integrate directly without prior filtering since the pre-training procedure for language representation is quite different from the knowledge representation procedure.

Table 4 Performance in accuracy (%) with different knowledge representation. We use BERT-baselinear and DREAM development set for analysis

7.2 Multi-hop commensense selection

Table 5 illustrates the role of filtering commonsense, where we also integrate commonsense representations for each token in C and Ai for multi-hop analysis (global in Table 5). We can see that: (1) All three methods achieve their own best results in the two-hop commonsense relation search, which means that the indirect commonsense concept does not always work; (2) Multi-head attentional fusion performs better only in no more than two-hop commonsense relation, which is probably due to the knowledge-context attention mechanism is not sensitive to excessive noise fusion. Interestingly, additive feature-based gating with global commonsense performs better than itself with one-hop commonsense on DREAM and CosmosQA. We hypothesize that the ConceptNet Numberbatch contains text-based lexicon information since it is obtained by jointly retrofitting from word2vec and GloVe.

Table 5 Accuracy (%) on the CosmosQA, DREAM and RACE development dataset based on the different number of hop commonsense relation searching, where “global” means commonsense representations are integrated into all tokens

7.3 Self-attention

To verify our goal to enable the self-attention in BERT to identify the knowledge-aware tokens, we consider the case depicted in Fig. 3. In this case, the BERT chooses the wrong candidate option (A) and our models make the right choice (B). We capture the correlation between tokens in the BERT and two-hop BERT-baselinear respectively, which are visualized in Fig. 5a and b, obtained from the penultimate self-attention layer of BERT and two-hop BERT-baselinear, respectively.Footnote 5 For BERT, the token “bank” has a low degree of similarity to all tokens tiC except “traveler” and “cheques”, and the focus of almost all tokens in the dialog is quite discrete. Moreover, part of tokens has a relatively high degree of similarity to “conversation” and the segment token, which is not enough to support the model to choose the correct conversation place. By contrast, our incremental model can learn more accurate representations to understand the commonsense relation between the passage and the candidate option, and infer the correct answer. From Fig. 5b, we can observe that “bank” has a high degree of relevance with “cash”, “sign”, “money”, “exchange” and “dollars”, which perfectly reflects their commonsense relationships shown in Fig. 3. In addition, the original similarity between “bank” and “cheques” is also retained or even strengthened. It illustrates that the commonsense fusion method preserves textual information while effectively utilizing heterogeneous knowledge.

Fig. 5
figure 5

Case study. In this case, the BERT (a) chooses the wrong candidate option and our models make the right choice. Two-hop BERT-baselinear (b) is used for comparison. Heat maps present similarities between correct answer (row) and dialog (column) tokens

7.4 Incomplete training set

BERT pre-trained on large-scale texts is still deficient in explicitly representing the relationship between commonsense concepts. The smaller the text training set in the downstream knowledge-driven task, the higher the requirement for the commonsense understanding ability of the model. We show the results of different incomplete training set settings in Fig. 6, using BERT-base as the baseline. We can see that the performance of all models shows a similar trend with the decrease in training set size. Compared to the vanilla BERT, our incremental models maintain better robustness. It is worth mentioning that the performance of the three-hop models have decreased more slowly than the one-hop models when the training set size drops to 60% and 40%. The three methods have different performances for different number of hops. We argue that commonsense would be more needed when the scale of text training set decreases to a certain extent. Augmenting BERT with external knowledge incorporation results in significant improvements in the settings with incomplete training set.

Fig. 6
figure 6

Accuracy on DREAM development set with the decrease in training set size. BERT-base is used for comparison

7.5 Computational costs

We present the computing resources used in our experiments. Each component’s parameters and the running time for each variant (1-hop/2-hop/3-hop) are summarized in Table 6. Since proposed methods add few parameters, each variant took the same time as the BERT-large baseline. The computation bottleneck is mainly from BERT and multi-hop token alignment. Considering the performance improvement of the two-hop relation search, the increase in overall running time is acceptable. However, the huge amount of parameters and long running time mean that there is still much work to deploy the model as a practical question answering system. Interestingly, we have found that the running time on RACE did not increase significantly as the number of search hops increased, which further reflects that RACE contains few commonsense questions.

Table 6 Computational costs for each variant of proposed methods. BERT-large is taken as the baseline. Running time is the sum of relation search time (1-hop/2-hop/3-hop) and model training time

7.6 Error analysis

We conduct the following error analysis to investigate problems that our model is short of the ability to address. We randomly extract 200 samples from the development set of DREAM, and then classify them into several question types according to the annotation criterion consistent with [29]. We compare two-hop BERT-baselinear with BERT-base on these categories, as shown in Table 7. Both models perform worse than random guessing (33.3%) on math problems since the Conceptnet does not contain the commonsense of mathematical computing, especially time and currency, which can be future work. Although superior to BERT on the implicit questions (e.g., under the categories logic and commonsense) which require external knowledge, our incremental model is less capable of answering these questions under the category summary. We hypothesize that integrating token-level commonsense may interfere with the reasoning requiring the aggregation of information from multiple sentences.

Table 7 Error analysis on DREAM. The column of “Proportion” reports the percentage of question types among 200 samples that are from the development set of DREAM dataset

8 Related work

Machine Reading Comprehension

In recent years, many MRC datasets have been released to solve different task scenarios, e.g., cloze-style [8, 9], extractive/abstractive answer [6, 15, 16, 22, 25], multi-choice [17], conversational QA [3, 26], multi-hop [38, 43], and whether external knowledge is needed [4, 13, 19, 30, 46]. Most MRC datasets that require external knowledge such as ARC, DREAM, OpenBookQA, CommonsenseQA and CosmosQA are designed in a multi-choice form. In this paper, we focus on the multi-choice MRC task. Hence, we choose CosmosQA, DREAM and RACE in the experiments. For multi-choice MRC, existing methods include designing the interaction among the passage, question and option [35, 47, 50], or transfer learning through data augmentation [14]. Nevertheless, these methods do not rely on commonsense knowledge for logical reasoning.

Integrating External Knowledge for MRC

Existing work has utilized structured knowledge from KBs/KGs to improve performance on MRC and QA. Existing work has utilized structured knowledge from KBs/KGs to improve performance on MRC and QA. Yang et al. [42] incorporate retrieved knowledge into LSTM by employing an attention mechanism with a sentinel. Bauer et al. [1] select grounded multi-hop relational commonsense information from ConceptNet via pointwise mutual information and term-frequency based scoring function and use a selectively gated attention mechanism to fuse the knowledge. Mihaylov et al. [20] introduce a mixed attention to external knowledge for cloze-style reading comprehension. Chen et al. [2], Wang et al. [33] and Zhong et al. [49] explore the effect of semantic relations from KGs such as ConceptNet on MRC. Wang et al. [32] propose a data enrichment method, which uses WordNet to extract inter-word semantic connections as general knowledge from each given passage-question pair. Xiong et al. [40] retrieve the corresponding entities and relation from text to aggregate answer evidence from an incomplete KB. Yang et al. [41] take BERT as encoder and employ an attention mechanism similar to Yang et al. [42] to fuse globally pre-trained knowledge downstream. Compared to these methods, we mainly focus on plug-in fusion methods and explore token-level multi-hop commonsense representation integration instead of relation embeddings.

Injecting knowledge into LMs

Neural networks and deep learning have been widely used in many fields such as computer vision and image processing [10,11,12, 44, 45]. Recently, pre-trained deep language models such as BERT have shown powerful achievements in downstream NLP tasks including MRC. The injection of external knowledge to LMs can be generally divided into two groups. Methods in the first group design auxiliary knowledge-driven objectives and updating parameters in a multi-task learning manner [24, 37, 39, 48], which requires pre-calculating knowledge representation and even pre-training BERT from scratch. The second group is to pre-train external modules to assist LMs [34, 49]. In contrast, our fusion methods are to directly fine-tune on the target MRC datasets.

9 Conclusion

This paper introduces increment BERT with three plug-in fusion methods, which enhances the vanilla BERT with commonsense representations from ConceptNet. We have used pre-computed ConceptNet embeddings as external knowledge representation and introduced a mask mechanism for token-level multi-hop relationship searching to filter external knowledge, so as to enable the self-attention in BERT to identify the knowledge-aware tokens effectively. Our variants of proposed methods have achieved significant improvements over baseline on two knowledge-driven multi-choice datasets. Experiments on few-commonsense dataset RACE shows that the introduction of external knowledge will not cause loss to the original text information understanding. Future work can start with more granular relationships to integrate external knowledge and how to design an effective yet efficient model architecture for practical deployment.