1 Introduction

Spoken language understanding (SLU) is a critical component in human/machine spoken dialogue system, it is aimed to identify users’ domain and intent in the natural language utterances and extract relevant parameters or slots to achieve the goal. SLU typically consists of two tasks, which intent detection and slot filling (Tur and De Mori) [1]. Intent detection focused on predicting the intent of users’, which slot filling extracts semantic concepts. For instance, as shown in Table 1, give a about airticket utterance, which query “first class fares from Boston to Denver”. Each token in the sentence has a slot, and there are a special intent for the whole sentence.

Table 1 An example about air ticket enquiry from the ATIS datasets , with a slot BIO format

Previously, traditional pipeline methods that slot filling and intent detection were implemented separately, without taking into account the dependent relationships for two tasks. Qin et al. [2] proposed that slot and intent are closely linked. In fact, the slot and intent are correlative, and the two tasks can mutually reinforce each other. For example, if the intent of a sentence is an air ticket query, its corresponding slot may include the departure and arrival cities rather than the name of the movie. Hence, it is necessity to combine slot filling and intent detection.

Considering the pipeline methods can lead to error propagation, and the two tasks has strong connection, the joint model for intent detection and slot filling has became a tendency and made greatly process, many joint model based on the multi-tasking learning framework has proposed (Zhang and Wang; Liu and Lane; Hakkani-Tur et al.) [3,4,5]. However, the prior model only conduct join model by share parameters, they use a joint loss function to implicitly model the between the two tasks. Recently, Goo et al. [6] introduced slot-gated mechanism and Li et al. [7] introduced self-attention with gated mechanism which apply intent information to slot task, but the slot task not apply intent task, the bi-directional interaction for slot filling and intent detection tasks has not created. In fact, the joint model should treat as parallel tasks that take advantage of the relationship between slot and intent.

To solve the above-mentioned problems, we propose a bidirectional interaction model based on gate mechanism. First, we use DCNN as the sentence encoder to capture information at a larger distance by increasing the expansion width in the convolution kernel without increasing the number of parameters. A major advantage of self-attention can built directly connections between any two tokens in a sentence. Due to this strong connection relationship between the two tasks, we proposed a bidirectional interaction model based on a gating mechanism to improve the existing joint model. By digging into the connection between intention detection and slot filling, a bidirectional interaction mechanism was established to make up for the shortcomings of the existing joint model.

The rest of the paper is as follows: Section 2 discusses related work, Section 3 gives a detailed description of our model, Section 4 presents experiments results and analysis, and Section 5 summarizes this work and the future direction.

2 Related work

Spoken language understanding has a long research history, which originated from some call classification systems (Gorin et al.) [8] and the ATIS project in the 1990s. SLU typically consists of two tasks, which include intent detection and slot filling. The intent detection can be looked upon classification tasks to predict speakers intent general methods like support vector network (SVM) (Haffner et al.) [9] and recurrent neural network (RNN) [10] (Lai et al.) can be used. The slot filling can be looked upon sequence labeling task to extract semantic components, popular methods like conditional random field (CRF) [11] (Raymond and Riccardi), long short-term memory (LSTM) networks (Yao et al. [12], (Sun et al.) [13] can be used.

Due to appear error propagation problem in traditional pipeline methods, recently, some model overcome the problem. Xu et al. [14] proposed using based on convolutional neural network(CNN) for joint slot filling and intent detection. Guo et al. [15] proposed using RNN for joint training of slot filling and intent detection. Zhang and Wang et al. [3] proposed using RNNs for joint slot filling and intent detection. Hakkani-Tur et al. [5] proposed a single recurrent neural network for joint slot filling and intent detection. Liu and Lane [4] proposed an attention-based encoder-decoder neural network model for joint slot filling and intent detection, which an encoder for input and a decoder for output. However, those model only conduct joint model by share parameters, they use a joint loss function to implicitly model the between the two tasks.

Recently, some joint model starts to explicitly for slot filling and intent detection, Wang et al. [16] proposed the Bi-model RNN structures, which take cross-impact into account for slot filling and intent detection jointly. Goo et al. [6] proposed using slot-gated mechanism, which apply intent detection to slot filling. Zhang et al. [17] proposed a hierarchical capsule neural network to model the the hierarchical relationship among word, slot, and intent in an utterance. E et al. [18] proposed an SF-ID network to build the direct connection for slot filling and intent detection, and introduce a interrelated mechanism to increase bi-directional interrelated. Qin et al. [2] proposed a token-level intent detection with Stack-Propagation framework, which use directly the intent detection as input for slot filling. Liu et al. [19] proposed a collaborative memory block to implicitly consider the mutual interaction between slot filling and intent detection.

Although much progress has been made in these approaches, intent detection and slot filling a challenging and open task. Therefore, a bidirectional interaction mechanism was established to make up for the shortcomings of the existing joint model to improve the performance of the SLU system.

3 Proposed approaches

This section describes the details of our framework, it consists of a encoder, a self-attention layer, bidirectional interaction gate layer that establishes connection between slot filling and intent detection and decoder for the two tasks. As shown in Fig. 1, we use DCNN encoder to extract the shared features of intent detection and slot filling, instead of using the RNN encoder. Then, we enter shared representations into two self-attention structures to get intent and slot representations. Finally, the intent representation and slot representation are input to the interaction gate structure to obtain interactive information. The purpose is to integrate the slot information into the intent detection task and the intent information into the slot filling task.

Fig. 1
figure 1

Illustration of dual interaction model based on the gate mechanism for joint intent detection and slot filling

3.1 DCNN encoder

Compared with ordinary CNN, DCNN has great advantages in processing long-distance sequence information. As shown in Fig. 2, it is a three-layer convolutional neural networks, in the third layer, ordinary convolutional neural network can only capture 3 inputs before and after, but 2-dilated CNN can capture 7 inputs before and after. The number and speed of parameters remain unchanged. Therefore, the same post-stacked void convolution can better capture the global information of the input. Therefore, the DCNN is selected as the encoder of the input sentence in the model. Since the encoder stacks multi-layer void convolution, residual network are added between each void convolution block to prevent model degradation and make model training more stable, the specific structure is as follows:

Fig. 2
figure 2

Compare of CNN and DCNN

The DCNN block is composed of 5 convolutional layers. The relationship information between two words at a distance can be obtained through layer superposition in Fig. 3, its input is the word embedding of the user’s utterance X = {X1,X2,...,Xn}(n is the number of tokens in the input utterance). Then the specific calculation process of this DCNN block is as follows:

$$ \begin{array}{@{}rcl@{}} X_{L}&=&g_{L}\otimes DConv1D_{1}(X_{L-1}) + (1 - g_{L})\otimes X_{L-1} \end{array} $$
(1)
$$ \begin{array}{@{}rcl@{}} g_{L}&=&\sigma(DConv2D_{2}(X_{L-1})) \end{array} $$
(2)
Fig. 3
figure 3

The structure of DCNN

Where XL is the output of DCNN block, XL− 1 is the output of the previous convolutional layers, ⊗ is sigmoid activation function, DConv1D1 and DConv2D2 are 1-dimensional dilated convolution which has a wider receptive field, gL is the gating function of the second layer, which can control the proportion of XL output from this layer and the input of this layer, so as to control the transfer of information between each layer to the next layer, and the information can be transmitted across multiple channels [20]. In addition, this method can also alleviate the problem of possible gradient disappearance when the gradient is back propagated in a single path, making the gradient propagation more smoothly.

The encoder is composed of 5 CNN blocks, the size of the convolution window of each convolution block is 3, and the dilated rate is [1, 2, 4, 1, 1]. The dilated rates of the first three layers were 1, 2, and 4, the number of layers from low to high could capture window information, fragment information, and global information respectively. The last two layers were 1. Therefore, the 5-layer dilated CNN block can not only extract the semantic features required for intent recognition, but also capture the window features and context semantic features required for the sequence of labeling tasks of slot filling. Finally, the convolution result is input to the self-attention layer to better capture the semantic relationship between each token.

3.2 Self-attention

Self-attention is an special case of the attention mechanism and has been successfully applied to a variety of tasks. It refines the representation by matching a single sequence against itself, and then capture the internal structure of the sentence, in other words, give a sentence, each word should be conducted attention calculate to all the words in the sentence. In our model, we employ DCNN with self-attention mechanism to the advantages of temporal features and contextual information, which are useful for intent detection and sequence labeling tasks (Zhong et al.; Yin et al. [21, 22]).

Self-attention module is a very effective method for natural language processing tasks, it can be understood as a mapping. The self-attention mechanism learns the dependency relationship between any two tokens in a sentence and capture the internal structure information of the sentence. When the self-attention mechanism establishes a connection between each token and other tokens in the sentence, it will give greater weight to the important tokens in the sentence. We use scaled dot-product attention, following Vaswani et al. [23], where we map the input vectors \(X \in {\mathbb {R}}^{n \times d}\)(d represents the mapped dimension) to queries(Q), keys(K), value(V), the calculation process is as follows:

$$ \begin{array}{@{}rcl@{}} Q&=&W_{Q}X_{5}+b_{Q} \end{array} $$
(3)
$$ \begin{array}{@{}rcl@{}} K&=&W_{K}X_{5}+b_{K} \end{array} $$
(4)
$$ \begin{array}{@{}rcl@{}} V&=&X_{5}+b_{V} \end{array} $$
(5)
$$ \begin{array}{@{}rcl@{}} C&=&softmax\left( \frac{QK^{T}}{\sqrt{d_{k}}}\right)\otimes V \end{array} $$
(6)

Where WQ,WK,WV,bQ,bK,bV are the weight matrix and bias of Q, K, and V respectively. dk is usually set to a large value (e.g., 100), so the scaling factor could reduce the extremely small gradients effect. A masking operation (which masks the diagonal of the affinity matrix) is applied before the softmax to avoid high matching scores between identical vectors of query and key.

The model introduces two self-attention layers, the intent self-attention layer and the slot self-attention layer, which are independent of each other and do not share weights. The input of the intent self-attention layer and the slot self-attention layer is the output X5 of the dilated convolutional layer, and the output is CIand Cslot respectively.

3.3 Bidirectional interaction gating mechanism

In order to increase the interaction and connection between intent and slot, we designed an interaction gate structure to obtain the final representation of slot and intent. Therefore, our model can add intent information to the slot filling task, and also add slot information to the intent detection task. This way can make intent and slot promote the performance of each other.

In gated-intent module, the slot context vector Cslot and intent context vector CI and combine to calculate a intent gate gI, and the calculation process is as follows:

$$ \begin{array}{@{}rcl@{}} g^{I}&=&tanh(W^{I}C^{I}+C^{slot}) \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} r^{I}&=&maxpooling(g^{I}C^{I}) \end{array} $$
(8)
$$ \begin{array}{@{}rcl@{}} \widehat{y}^{I}&=&softmax({W_{y}^{I}}r^{I}+{b_{y}^{S}}) \end{array} $$
(9)

where tanh is the activation function, and \( {W_{y}^{I}}, {b_{y}^{S}}, W^{I}\) is trainable weight respectively. The element of gate gI is mapped between (− 1,1), which can be seen as a weight feature of the intent context vector CI to control the information passing rate, we get the final representation rI and the prediction yI of the intent:

Unlike other methods that only use slot gates, we think that slot and intent information can also improve the performance of each other’s tasks with the premise of strong correlation between two tasks. Therefore, similar to the intent gate, we also designed a slot gate to control the passing rate of slot information.

In slot gate module, as described above, intent information is useful for slot filling task. The slot gate gS can also be computed in the same method as gI, but the Cave is the average of CI. The slot gate and yS are modeled similarly, the calculation process is as follows:

$$ \begin{array}{@{}rcl@{}} C^{ave}&=&meanpooling(C^{I}) \end{array} $$
(10)
$$ \begin{array}{@{}rcl@{}} g^{S}&=&tanh(W^{S}C^{slot}+C^{ave}) \end{array} $$
(11)
$$ \begin{array}{@{}rcl@{}} r^{S}&=&C^{slot}g^{S} \end{array} $$
(12)
$$ \begin{array}{@{}rcl@{}} O^{S}&=&{W_{o}^{S}}r^{S}+{b_{o}^{S}} \end{array} $$
(13)
$$ \begin{array}{@{}rcl@{}} P(\widehat{y}|O^{S})&=&\frac{ {\sum}_{i=1} exp f(y_{i-1} , y_{i} , O^{S})}{ {\sum}_{y'} {\sum}_{i=1}exp f(y_{i-1}^{\prime} , y_{i}^{\prime} , O^{S})} \end{array} $$
(14)

Where WS is the weight matrix, rS is the slot final represent, f(yi− 1,yi,OS) computes the transition score from yi− 1 to yi and yS represents the predicted label sequence.

3.4 Loss function

The loss function of intention detection is cross entropy loss function, and it is calculated as follows:

$$ L_{1}=-\sum\limits_{i=1}^{cy_{i^{I\_label}}}\log p({y_{i}^{I}}) $$
(15)

And, the slot filling objection is formulated as:

$$ L_{2}=-\log P(\widehat{y}|O^{S}) $$
(16)

Where \(y_{i}^{I\_label}\) is intent label and \(y_{i}^{S\_label}\) is slot label, To obtain both slot filling and intent detection jointly, the final objective is formulated as:

$$ L=\alpha L_{1}+(1-\alpha)L_{2} $$
(17)

Where α is hyper-parameter.

4 Experiments

4.1 Data set

In the section, in order to evaluate the proposed model, we used two public datasets, one is the public Airline Travel Information System (ATIS) (Hemphill et al.) dataset, which contains audio recordings of people making flight reservations, it is widely used in SLU research. The training set uses 4478 utterances and the test set uses 893 utterances, the another 500 utterances as development set, there are in total 120 slot labels and 21 intention types. Other is the custom-intent-engines collected by SNIPS (SNIPS data set) (Coucke et al.), which collected from the SNIPS personal voice assistant, The training set uses 13084 utterances and the test set uses 700 utterances, the another 700 utterances as development set, there are in total 72 slot labels and 7 intent types. Different from the ATIS data set, SNIPS is more complex because of its intent diversity and large vocabulary.

4.2 Training setup and evaluation criteria

In our model, the use of data set is following previous related work (Qin et al. [2]), which include format and partition. The data set dimensionalities of the word embedding is 300, the hidden units of the DCNN encoder are set as 128. The CNN is 5 layers, the void ratio is (1,2,4,1,1). L2 regularization is used on our model is 1 × 10− 5. We use Adam (Kingma and Ba, 2014) to optimize the parameters in our model. For all the experiments, we select the model which works the best on the dev set, and then evaluate it on the test set.

Intent detection and slot filling tasks can be defined as classification tasks and sequence labeling tasks, respectively. The common indicators for classification tasks and sequence labeling tasks are also applicable to intent detection and slot filling tasks. We evaluate the SLU performance about slot filling using F1 score, intent prediction using accuracy and it is calculated as follows:

$$ \begin{array}{@{}rcl@{}} &&Accurary=\frac{TP+TN}{TP+TN+FP+FN} \end{array} $$
(18)
$$ \begin{array}{@{}rcl@{}} &&P=\frac{TP}{TP+FP} \end{array} $$
(19)
$$ \begin{array}{@{}rcl@{}} &&R=\frac{TP}{TP+FN} \end{array} $$
(20)
$$ \begin{array}{@{}rcl@{}} &&F1=\frac{2\times P\times R}{P+R} \end{array} $$
(21)

Where P and R are the accuracy rate and recall rate of the slot filling task.

4.3 Overall results

Following Qin et al., we evaluate the SLU performance about slot filling using F1 score, intent prediction using accuracy. The experiment results of the proposed models on SNIPS and ATIS datasets are shown in Table 2.

Table 2 Joint model results on two datasets

From the Table, we can see that our model significantly improvement than other existing model. In the ATIS datasets, we compared with the recent joint model with stack-propagation framework, the improvement is around 0.4% and 0.6% in Slot F1 and Intent Acc. In the SNIPS datasets, the improvement is around 0.5% and 0.3% in Slot F1 and Intent Acc. This indicates the effectiveness of dual interaction model based on the gate mechanism, it can significantly benefit the SLU performance.

4.4 Analysis of gated mechanism

Table 2 shows the superior performance of our model. In order to better guide our follow-up work, we need to know the role of each module. So we performed ablation experiments to evaluate the impact of the various modules of the model on the final effect. First we tested the encoding ability of the DCNN Encoder. Then, we investigated the effect of the self-attention module on the results. Finally, we focused on the effectiveness of the interactive gate structure.

4.4.1 Effect of DCNN encoder

In order to verify the effect of DCNN layers, hole rate, and selection gate on the results, we set up the following ablation experiments.

We set up a comparison test, which includes different numbers of DCNN layers and different void ratios, and whether to use selection gates to control the flow of information between each layer of CNNs.

As can be seen from Fig. 4, the model has the best effect when the DCNN is 5 layers, the void ratio is (1,2,4,1,1) and the selection gate is used. However, with the same number of DCNN layers and void ratio, the model without selection gate performs very poorly. This is a common problem in DCNN networks-model degradation. When the CNN network becomes deeper and deeper, the model will become difficult to train and the performance will deteriorate. After adding the selection gate, the final output of each layer of DCNN block comes from its weighted sum of its inputs and inputs (such as (1) and (2)). It can alleviate model degradation and control information flow during training.

Fig. 4
figure 4

The analysis of independent gate and interaction result

When the DCNN block hole rate of each layer is (1,1,1,1,1) (equivalent to the general 1-dimensional CNN), the performance of the model on the two data sets has significantly decreased. The accuracy of slot filling F1 and intent detection on the SNIPS dataset decreased by 0.6% and 1.5%, respectively. Similarly, the accuracy of slot filling F1 and intent detection on the ATIS data set decreased by 0.7% and 0.8%, respectively. This is because the receptive field of the hole convolution is larger and it is possible to obtain the global information of the short sentence after superimposing multiple DCNN blocks. Experimental results also verify the effectiveness of multi-layer dilated convolution.

Finally, the experimental results show the importance of the DCNN block with a void ratio of 1 in the last two layers in the slot filling task. However, the performance of intent detection is 3 − layer, 4 − layer, and 5 − layer is hardly affected. The most likely reason is that intent detection requires more global information, and slot filling requires more detailed token information.

4.4.2 Effect of self-attention and interactive gate structure

In order to test the effect of self-attention and specially designed interaction gate mechanism, we designed the following ablation experiments.

First, we deleted slot self-attention and intent self-attention respectively, to observe the effect of self attention on intent detection and slot filling performance. It can be seen from Table 3 that after removing the intent self-attention and directly using the output of the DCNN encoder instead of the intent representation, the intent accuracy of the SNIPS and ATIS datasets decreased by 1% and 1.1%, respectively, but slot F1 remained almost constant. However, when we delete slot-attention in the same way, the intent accuracy rate remains the same without decreasing, and the slot F1 value drops slightly. We analyze this because the self-attention mechanism does not encode the relative position information into the final slot representation [24], and the relative position information is important for slot filling, so self-attention has less impact on the slot filling task than intent recognition.

Table 3 The analysis of seperate training and interaction result

Then, we also explored the effect of the interaction gate mechanism on the model effect. As shown in Table 3, when only the intent gate is deleted, the intent accuracy rate decreases by about 1% and slot F1 decreases by about 0.5%; in contrast, after only the slot gate of the model is deleted, the slot F1 value and The intent accuracy has also dropped significantly. The experimental results show that the interaction of intent information and slot information by using an interactive gate structure can promote both intent detection and slot filling. When we remove the intent gate and slot gate at the same time, the model effect drops significantly. These also proves the effectiveness of our intent-slot interactive gate structure on two tasks.

5 Conclusions

In this paper, we designed a dual interaction model based on the gate mechanism for spoken language understanding, we further proposed DCNN for joint learning on both intent detection and slot filling tasks. In our model, DCNN block with self-attention to better capture the semantic of utterance, and gate mechanism is introduced to control the passing rate and make fully use of semantic relevance between slot filling and intent detection. Experiment results on ATIS data set have shown efficiency of our model and outperforms the state-of-the-art approach on both tasks. Besides, our model also shows consistent performance gain over the independent training models. In future works, we plan to improve our model by introducing extra knowledge.