Keywords

1 Introduction

Natural language sentence matching is the task of comparing two sentences and identifying the relationship between them. It is a fundamental technique for a variety of tasks. For example, in the paraphrase recognition task, it is used to determine whether two sentences are paraphrased. In the text implication recognition task, it is possible to determine whether a hypothetical sentence can be inferred from a predicate sentence.

Recognizing Textual Entailment (RTE), proposed by Dagan [6], is a study of the relationship between premises and assumptions. It mainly includes entailment, contradiction, and neutrality. The main methods for recognizing textual entailment include the following: similarity-based methods [15], rule-based methods [11], alignment feature-based machine learning methods [18], etc. However, These methods can’t perform well in recognition because they didn’t extract the semantic information of the sentences well. In recent years, deep learning-based methods have been effective in semantic modeling, achieving good results in many tasks in NLP [12, 13, 23]. Therefore, on the task of recognizing textual entailment, deep learning-based methods have outperformed earlier approaches and become the dominant recognizing textual entailment method. For example, Bowman et al. used recurrent neural networks to model premises and hypotheses, which have the advantage of making full use of syntactic information [2]. After that, he first applied LSTM sentence models to the RTE domain by encoding premises and hypotheses through LSTM to obtain sentence vectors [3]. WANG et al. proposed mLSTM model on this basis, which focuses on splicing attention weights in the hidden states of the LSTM, focusing on the part of the semantic match between the premise and the hypothesis. The experimental results showed that the method achieved good results on the SNLI dataset [20].

Paraphrase recognition is also called paraphrase detection. The task of paraphrase recognition is to determine whether two texts hold the same meaning. If they have the same meaning, they are called paraphrase pairs. Traditional paraphrase recognition methods focus on text features. However, there are problems such as low accuracy rate. Therefore, deep learning-based paraphrase recognition methods have become a hot research topic. Deep learning-based paraphrase recognition methods are mainly divided into two types; 1) calculated word vectors by neural networks, and then calculated word vector distances to determine whether they were paraphrase pairs. For example, Huang et al. used an improved EMD method to calculate the semantic distance between vectors and obtain the interpretation relationship [7]. 2) Directly determining whether a text pair is a paraphrased pair by a neural network model, which is essentially a binary classification algorithm. Wang et al. proposed the BIMPM model, which first encodes sentence pairs by a bidirectional LSTM and then matches the encoding results from multiple perspectives in both directions [21]. Chen et al. proposed an ESIM model that uses a two-layer bidirectional LSTM and a self-attention mechanism for encoding, then it extracts features through the average pooling layer and the maximum pooling layer, and finally performs classification [5].

These models mentioned above have achieved good results on specific tasks, but most of these models have difficulty extracting deep semantic information and effectively fusing the extracted semantic information, in this paper, we propose a sentence matching model based on deep interaction and fusion. We use the bi-directional attention and self-attention to obtain the high-level semantic information. Then, we use a heuristic fusion function to fuse the low-level semantic information and the high-level semantic information to obtain the final semantic information. We conducted experiments on the SNLI datasets for the recognizing textual entailment task, the Quora dataset for the paraphrase recognition task. The results showed that the accuracy of the proposed algorithm on the SNLI test set is 87.1%, and the accuracy of the Quora test set is 86.8%. Our contributions can be summarized as follows:

  • We propose a sentence matching model based on deep interaction and fusion. It introduces bidirectional attention mechanism into sentence matching task for the first time.

  • We propose a heuristic fusion function. It can learn the weights of fusion by neural network to achieve deep fusion.

  • We evaluate our model on two different tasks and Validate the effectiveness of the model.

2 BIDAF Model Based on Bi-directional Attention Flow

In the task of extractive machine reading comprehension, Seo et al. first proposed a bi-directional attention flow model BIDAF (Bi-Directional Attention Flow) for question-to-article and article-to-question [16]. Its structure is shown in Fig. 1.

Fig. 1.
figure 1

Bi-directional attention flow model

The model mainly consists of an embed layer, a contextual encoder layer, an attention flow layer, a modeling layer, and an output layer. After the character-level word embedding and the pre-trained word vector Glove word embedding, the contextual representations X and Y of the article and the question are obtained by a bidirectional LSTM, respectively. The bi-directional attention flow between them is computed, and it proceeds as follows:

  1. a)

    The similarity matrix between the question and the article is calculated. The calculation formula is shown in Eq. 1.

    $$\begin{aligned} \begin{aligned}&K_{tj}=W^T\left[ X_{:t};Y_{:j};X_{:t}\odot Y_{:j} \right] \end{aligned} \end{aligned}$$
    (1)

    where \(K_{tj}\) is the similarity of the t-th article word to the j-th question word, \(X_{:t}\) is the t-th column vector of X, \(Y_{:j}\) is the j-th column vector of Y, and W is a trainable weight vector.

  2. b)

    Calculating the article-to-question attention. Firstly, the normalization operation is performed on the above similarity matrix, and then the weighted sum of the problem vector is calculated to obtain the article-to-problem attention, which is calculated as shown in Eq. 2.

    $$\begin{aligned} \begin{aligned}&x_t=soft\max \left( K \right) \\&\hat{Y}_{:t}=\sum _j{x_{tj}Y_{:j}} \end{aligned} \end{aligned}$$
    (2)
  3. c)

    Query-to-context (Q2C) attention signifies which context words have the closest similarity to one of the query words and are hence critical for answering the query. We obtain the attention weights on the context words by \(y=softmax\!\,(max_{col}\!\,(K))\in R^T\), where the maximum function \(\max _{col}\) is performed across the column. Then the attended context vector is \(\hat{x}=\sum _t{y_tX_{:t}}\). This vector indicates the weighted sum of the most important words in the context with respect to the query. \(\hat{x}\) is tiled T times across the column, thus giving \(\hat{X}\in R^{2d*T}\).

  4. d)

    Fusion of bidirectional attention streams. The bidirectional attention streams obtained above are stitched together to obtain the new representation, which is calculated as shown in Eq. 3.

    $$\begin{aligned} \begin{aligned}&L_{:t}=\left[ X_{:t};\hat{Y}_{:t};X_{:t}\odot \hat{Y}_{:t};X_{:t}\odot \hat{X}_{:t} \right] \end{aligned} \end{aligned}$$
    (3)

We builds on this work by looking at sentence pairs in a natural language sentence matching task as articles and problems for reading comprehension. We use the bi-directional attention and self-attention to obtain the high-level semantic information. Then, we use a heuristic fusion function to fuse the low-level semantic information and the high-level semantic information to obtain the final semantic information.

3 Method

In this section, we describe our model in detail. As shown in Fig. 2, our model mainly consists of an embedding layer, a contextual encoder layer, an interaction layer, a fusion layer, and an output layer.

Fig. 2.
figure 2

Overview of the architecture of our proposed DIFM model. It consists of an embedding layer, a contextual encoder layer, an interaction layer, a fusion layer, and an output layer.

3.1 Embedding Layer

The purpose of the embedding layer is to map the input sentence A and sentence B into word vectors. The traditional mapping method is one-hot encoding. However, it is spatially expensive and inefficient, so we use pre-trained word vectors for word embedding. These word vectors are constant during training.

Since the text contains unregistered words, we also use character-level word vector embedding. Each word can be seen as a concatenation of characters and characters, and then we use LSTM to get character-level word vectors. It can effectively handle unregistered words.

We assume that the pre-trained word vector for word h is \({h_w}\), and character-level word vector is \({h_c}\), we splice the two vectors and use a two-tier highway network [25] to get the word vector representation of word h:\(h = [{h_\mathrm{{1}}};{h_\mathrm{{2}}}] \in {R^{{d_1} + {d_2}}}\) , where \({d_1}\) is the dimension of Glove word embedding and \({d_2}\) is the dimension of character-level word embedding. Finally, we obtain the word embedding matrix \(X \in {R^{n\mathrm{{*}}({d_1} + {d_2})}}\) for sentence A and the word embedding matrix \(Y \in {R^{m*({d_1} + {d_2})}}\) for sentence B, where n, m represent the number of words in sentence A and sentence B.

3.2 Contextual Encoder Layer

The purpose of the contextual encoder layer is to fully exploit the contextual relationship features of the sentences. We use bidirectional LSTM for encoding which can mine the contextual relationship features of the sentences. Then, we can obtain its representation \(H \in {R^{2d*n}}\) and \(P \in {R^{2d*m}}\) , where d is the hidden layer dimension.

3.3 Interaction Layer

The purpose of the interaction layer is to extract the effective features between sentences. In this module, we can obtain low-level semantic information and high-level semantic information.

Low-Level Semantic Information. The purpose of this module initially fuses two sentences to get the low-level semantic information. We first calculate the similarity matrix S of the context-encoded information H and P, which is shown in Eq. 4.

$$\begin{aligned} \begin{aligned} {S}_{ij}={{W}_s}^T[h;p;h\odot p] \end{aligned} \end{aligned}$$
(4)

where \({{S}_{ij}}\) denotes the similarity between the i-th word of H and the j-th word of P, \({{W}_{s}}\) is weight matrices, h is the i-th column of H, and p is the j-th column of P. Then, we calculate the low-level semantic information V of A and B, which is shown in Eq. 5.

$$\begin{aligned} \begin{aligned} {V}={P}\cdot softmax\!\,({S}^T) \end{aligned} \end{aligned}$$
(5)

High-Level Semantic Information. The purpose of this module is mine the deep semantics of the text, and to generate high-level semantic information. In this module, we frist calculate the bidirectional attention of H and P that is the attention of \(H\rightarrow P\) and \(P\rightarrow H\). It is calculated as follows.

\(H\rightarrow P\): The attention describes which words in the sentence P are most relevant to H. The calculation process is as follows; firstly, each row of the similarity matrix is normalized to get the attention weight, and then the new text representation \(Q\in {{R}^{2d*n}}\) is obtained by weighted summation with each column of P, which is calculated as shown in Eq. 6.

$$\begin{aligned} \begin{aligned}&{{\alpha }_{t}}=softmax ({{S}_{t:}})\in {{R}^{m}} \\&{{q}_{:t}}=\sum \limits _{j}{{{\alpha }_{tj}}{{P}_{:j}}} \end{aligned} \end{aligned}$$
(6)

where \({q_{:t}}\) is the t-th column of Q.

\(P\rightarrow H\): The attention indicates which words in H are most similar to P. The calculation process is as follows: firstly, the column with the largest value in the similarity matrix \(\boldsymbol{S}\) is taken to obtain the attention weight, then the weighted sum of H is expanded by n time steps to obtain \(C\in {{R}^{2d*n}}\), which is calculated as shown in Eq. 7.

$$\begin{aligned} \begin{aligned}&b=softmax (\underset{col}{\mathop {\max }}\, (S))\in {{R}^{n}} \\&c=\sum \limits _{t}{{{b}_{t}}{{H}_{t:}}\in {{R}^{2d}}} \\ \end{aligned} \end{aligned}$$
(7)

After obtaining the attention matrix Q of \(H\rightarrow P\) and the attention matrix C of \(P\rightarrow H\), we splice the attention in these two directions by a multilayer perceptron. Finally, we get the spliced contextual representation G, which is calculated as shown in Eq. 8.

$$\begin{aligned} \begin{aligned}&{{G}_{:t}}=\beta ({{C}_{:t}}, {{H}_{:t}}, {{Q}_{:t}}) \\&\beta (c, h, q)=[h;q;h\odot q;h\odot c]\in {{R}^{8d}} \end{aligned} \end{aligned}$$
(8)

Then, we calculate its self-attention [19], which is calculated as shown in Eq. 9.

$$\begin{aligned} \begin{aligned}&E=G^TG\\&Z=G\cdot softmax\!\,(E) \end{aligned} \end{aligned}$$
(9)

Finally, we pass the above semantic information Z through a bi-directional LSTM to obtain high-level semantic information U.

3.4 Fusion Layer

The purpose of the fusion layer is to fuse the low-level semantic information V and the high-level semantic information U. We innovatively propose a heuristic fusion function, it can learn the weights of fusion by neural network to achieve deep fusion. We fuse V and U to obtain the text representation \(L=fusion(U, V)\in {{R}^{n*2d}}\) , where the fusion function is defined as shown in Eq. 10:

$$\begin{aligned} \begin{aligned}&\widetilde{x}=\tanh ({{W}_{1}}[x;y;x\odot y;x-y]) \\&g=sigmoid({{W}_{2}}[x;y;x\odot y;x-y]) \\&z=g\odot \widetilde{x}+(1-g)\odot x \\ \end{aligned} \end{aligned}$$
(10)

where \({{W}_{1}}\) and \({{W}_{2}}\) are weight matrices, and g is a gating mechanism to control the weight of the intermediate vectors in the output vector. In this paper, x refers to U and y refers to V.

3.5 Output Layer

The purpose of the output layer is to output the results. In this paper, we use a linear layer to get the results of sentence matching. The process is shown in Eq. 11.

$$\begin{aligned} \begin{aligned} y = softmax(\tanh (ZW + b)) \end{aligned} \end{aligned}$$
(11)

where both W and b are trainable parameters. Z is the vector after splicing its first and last vectors.

4 Experimental Results and Analysis

In this section, we validate our model on two datasets from two tasks. We first present some details of the model implementation, and secondly, we show the experimental results on the dataset. Finally, we analyze the experimental results.

4.1 Experimental Details

Loss Function. In this paper, the cross-entropy loss function can be chosen as shown in Eq. 12.

$$\begin{aligned} \begin{aligned} loss=-\sum \limits _{i=1}^{N}{\sum \limits _{k=1}^{K}{{{y}^{(i, k)}}\log {{{\hat{y}}}^{(i, k)}}}} \end{aligned} \end{aligned}$$
(12)

where N is the number of samples, K is the total number of categories and \({{\hat{y}}^{(i, k)}}\) is the true label of the i-th sample.

Dataset. In this paper, we use the natural language inference datasets SNLI, and the paraphrase recognition dataset Quora to validate our model. The SNLI dataset contains 570K manually labeled and categorically balanced sentence pairs. The Quora question pair dataset contains over 400k pairs of data that each with binary annotations, with 1 being a duplicate and 0 being a non-duplicate. The statistical descriptions of SNLI and Quora data are shown in Table 1.

Table 1. The statistical descriptions of SNLI and Quora
Table 2. Values of hyper parameters

Parameter Settings. This experiment is conducted in a hardware environment with a graphics card RTX5000 and 16G of video memory. The system is Ubuntu 20.04, the development language is Python 3.7, and the deep learning framework is Pytorch 1.8.

In the model training process, a 300-dimensional Glove word vector are used for word embedding, and the maximum length of text sentences is set to 300 and 50 words on the SNLI and Quora datasets, respectively. The specific hyperparameter settings are shown in Table 2.

4.2 Experimental Results and Analysis

We compare the experimental results of the sentence matching model based on deep interaction and fusion on the SNLI dataset with other published models. The evaluation metric we use is the accuracy rate. The results are shown in Table 3. As can be seen from Table 3, our model achieves an accuracy rate of 0. 871 on the SNLI dataset, which achieves better results in the listed models. Compared with the LSTM, it is improved by 0. 065. Compared with Star-Transformer model, it is improved by 0. 004. Compared with some other models, it is observed that our model is better than the others model. We conduct experiments on the Quora dataset, and the evaluation metric is accuracy. The experimental results on the Quora dataset are shown in Table 4. As can be seen from Table 4, the accuracy of our method on the test set is 0.868. The experimental results improve the accuracy by 0.054 compared to the traditional LSTM model. Compared with the enhanced sequential inference model ESIM, it is improved by 0.004. The experimental results achieved good results compared to some current popular deep learning methods. Our model achieve relatively good results in both tasks, which illustrates the effectiveness of our model.

Table 3. The accuracy (\(\%\)) of the model on the SNLI test set. Results marked with \(^a\) are reported by Bowman et al. [4], \(^b\) are reported by Han et al. [9], \(^c\) are reported by Shen et al. [17], \(^d\) are reported by Borges et al. [1], \(^e\) are reported by Guo et al. [8], \(^f\) are reported by Mu et al. [14].
Table 4. The accuracy (\(\%\)) of the model on the Quora test set. Results marked with \(^g\) are reported by Yang et al. [22], \(^h\) are reported by He et al. [10], \(^i\) are reported by Zhao et al. [24], \(^j\) are reported by Chen et al. [5].

4.3 Ablation Experiments

To explore the role played by each module, we conduct an ablation experiment on the SNLI dataset . Without using the fusion function, which means that the low-level semantic information are directly spliced with the high-level semantic information. The experimental results are shown in Table 5.

Table 5. Ablation study on the SNLI validation dataset

We first verify the effectiveness of character embedding. Specifically, we remove the character embedding for the experiment, and its accuracy drops by 1.5% points, proving that character embedding plays an important role in improving the performance of the model.

In addition, we verify the effectiveness of the semantic information and fusion modules. We removed low-level semantic information and high-level semantic information from the original model, and its accuracy dropped by 1.2% points and 7.6% points. At the same time, we remove the fusion function, and its accuracy drops by about 1.0% points. It shows that the different semantic information and the fusion function are beneficial to improve the accuracy of the model, with the high-level semantic information being more significant for the model.

Finally, we verify the effectiveness of each attention on the model. We remove the attention from P to H, the attention from H to P, and the self-attention module respectively. Their accuracy rates decreased by 2.5% points, 0.9% points, and 1.3% points. It shows that all the various attention mechanisms improve the performance of the model, with the P to H attention being more significant for the model.

The ablation experiments show that each component of our model plays an important role, especially the high-level semantic information module and the P to H attention module, which have a greater impact on the performance of the model. Meanwhile, the character embedding and fusion function also play an important role in our model.

5 Conclusion

we investigate natural language sentence matching methods and propose an effective deep interaction and fusion model for sentence matching. Our model first uses the bi-directional attention in the machine reading comprehension model and self-attention to obtain the high-level semantic information. Then, we use a heuristic fusion function to fuse the semantic information that we get. Finally, we use a linear layer to get the results of sentence matching . We conducted experiments on SNLI and Quora datasets. The experimental results show that the model proposed in this paper can achieve good results in two tasks. In this work, we find that our proposed interaction module and fusion module occupie the dominant position and have a great impact on our model. However, Our model is not as powerful as the pre-trained model in terms of feature extraction and lacks external knowledge. The next research work plan will focus on the following two points: 1) we use more powerful feature extractors, such as BERT pre-trained model as text feature extractors; 2) the introduction of external knowledge will be considered. For example, WordNet, an external knowledge base, contains many sets of synonyms, and for each input word, its synonyms are retrieved from WordNet and embedded in the word vector representation of the word to further improve the performance of the model.