1 Introduction

Large language models are increasingly being utilized in various fields, including urban informatics, as demonstrated by CityGPT. The task-oriented (also referred to as goal-oriented) dialogue system, as part of the urban large language model, has become a topic of interest in the research community and industry (Zhang et al., 2020a, 2020b). Unlike chatbots (Wang et al., 2019, 2020), a task-oriented dialogue system is a closed-domain dialogue system (Gao et al., 2021; Mi et al., 2021) that can perform specific tasks for users, such as querying information, ordering products online, and playing music. Representative products include Siri and Cortana. The task-oriented dialogue system includes modules such as natural language understanding (Liu et al., 2021), dialogue management (Takanobu et al., 2019, 2020; Zhang et al., 2019a, 2019b), and natural language generation (Mi et al., 2019, 2020). The natural language understanding module is crucial because it is related to a task-oriented dialogue system providing correctional services for users.

The natural language understanding module of a task-oriented dialogue system performs two tasks (Wang et al., 2023): intent detection and slot filling. The intent detection task can be regarded as a text classification task; the classification model is trained to predict the intention of the user from the user’s input information. Table 1 shows an example of a user asking about weather conditions (Intent: GetWeather) in the SNIPS (Coucke et al., 2018) corpus. The slot filling task can be regarded as a sequence analysis task; the sequence analysis model is trained to predict the details of the user’s intention. Table 1 shows an example of a user asking about the weather conditions in their location (Slot: B-current_location) in the SNIPS (Coucke et al., 2018) corpus. The natural language understanding module performs these two tasks to obtain the specific needs of users for a task-oriented dialogue system. Wang et al. (2018) found that for models based on deep learning, if these two tasks cooperate with each other, the accuracy of the task-oriented dialogue system to obtain user requirements can be improved.

Table 1 An example of the SNIPS corpus

Recently, transformer research has emerged in the field of natural language processing, and some transformer-based models for intent detection and slot filling have been proposed (Qin et al., 2021; Wang et al., 2021). Although the vanilla Transformer can handle text classification tasks or sequence analysis tasks, it has difficulty accomplishing these two tasks at the same time. Therefore, there are two solutions to this problem: one is to modify the vanilla Transformer to better handle the text classification task and sequence analysis task simultaneously (Wang et al., 2021), and the other is to combine the vanilla Transformer with other methods to build a model that can better handle the text classification task and sequence analysis task simultaneously (Qin et al., 2021). We choose the second solution to build our model. Inspired by the TRANS-BLSTM method (Huang et al., 2020), we integrate a vanilla Transformer with bidirectional LSTM (BiLSTM) as the encoder and a linear classification decoder for intent detection with a conditional random field (CRF) as the slot filling decoder and propose the TLC method. TLC stands for Transformer, LSTM and CRF, which are indispensable in our method. We will further explain this in the model ablation analysis. In addition, we add a residual learning module to our model. The experimental results show that the residual learning module is effective in improving the slot filling effect of our model. However, compared with ResNET (He et al., 2016) in the computer vision research field, TLC cannot improve the effect of intent detection and slot filling through a large-scale overlay of neural network layers. Qin et al. (2021) also found the same problem in their research. We will discuss this problem in the parameter tuning analysis section.

Our contributions in this paper are (1) the proposal of a new transformer-based model for intent detection and slot filling and (2) empirical verification of the effectiveness of our proposed model on two public datasets.

2 Related work

In the past, when statistical learning methods dominated natural language processing research, intent detection and slot filling were regarded as two independent tasks. The support vector machine (SVM) and AdaBoost algorithms had good results for the intent detection task and the conditional random field (CRF) dominated the slot filling task (Mesnil et al., 2013). With the advent of the deep learning era, methods of intent detection and slot filling based on deep learning have become mainstream, such as the Joint Seq model (Hakkani-Tür et al., 2016) based on BiLSTM. Liu and Lane (2016) added an attention mechanism to BiLSTM and proposed the attention BiRNN model. Zhu and Yu (2017) presented the focus mechanism and applied it to the encoder-decoder structure. Goo et al. (2018) added a slot gating mechanism to the attention BiRNN and proposed the slot-gated attention model. Li et al. (2018) added a gating mechanism to their model and proposed the self-attentive model. Wang et al. (2018) recognized the interaction between the intent detection task and slot filling task and proposed the Bi-Model. Zhang et al., (2019a, 2019b) applied the capsule network for intent detection and slot filling and proposed the CAPSULE-NLU model. The SF-ID Network (E et al., 2019) and the CM-Net (Liu et al., 2019) have contributed to improving the interaction and promotion between the intent detection task and the slot filling task. Qin et al. (2019) proposed the stack propagation model, which can effectively improve the performance of intent detection and further alleviate error propagation by adding word-level intent detection, thereby better combining intent information for slot filling.

With the development of pretraining technology, pretraining language models have begun to be used for intent detection and slot filling. Siddhant et al. (2019) used the ELMo (Peters et al., 2018) model as a representation learning method, combined BiLSTM with CRF, and improved the baseline method of intent detection and slot filling. Chen et al. (2019) applied the BERT (Devlin et al., 2019) model to intent detection and slot filling and proposed the JointBERT model. Furthermore, with the rise of graph neural network research, methods of intent detection and slot filling based on graph neural networks have been proposed, such as the graph LSTM model (Zhang et al., 2020a, 2020b). Recently, Transformer-based methods for intent detection and slot filling have sparked interest in the research community. The Co-Interactive Transformer (Qin et al., 2021) and the SyntacticTF (Wang et al., 2021) are the latest methods developed for intent detection and slot filling based on the Transformer. In addition, Gunaratna et al. (2022) proposed a joint NLU model based on BERT that can improve the slot explanation ability while improving the effect of intent detection and slot filling.

3 Proposed model

We followed current mainstream approaches and regarded intent detection and slot filling as interrelated tasks.

3.1 Problem Formalization

The tasks of intent detection and slot filling can be formalized as Eqs. (1), (2) and (3):

$${y}^{intent}=\sigma \left({W}^{intent}{h}_{1}+{b}^{intent}\right)$$
(1)
$${y}_{n}^{slot}=\sigma \left({W}^{slot}{h}_{n}+{b}^{slot}\right)$$
(2)
$$P\left(y^{intent},y_n^{slot}\vert x\right)=P\left(y^{intent}\vert x\right){\textstyle\prod_{n=1}^N}P\left(y_n^{slot}\vert x\right)$$
(3)

where \({y}^{intent}\) represents the user's intention; \({y}_{n}^{slot}\) represents the slot value for the user’s input information; \({h}_{1}\) and \({h}_{n}\) represent the hidden vectors of the user input information in the neural network; \(n\in \left[1,N\right]\). \({W}^{intent}\), \({b}^{intent}\), \({W}^{slot}\) and \({b}^{slot}\) are the neural network parameters; and \(\sigma\) represents the activation function. The goal of this task is to train the neural network model to predict the correct user intention \({y}^{intent}\) and slot value \({y}_{n}^{slot}\) according to user input information \(x\).

3.2 Model Overview

Our proposed model follows the encoder-decoder structure. The encoder of the TLC model includes two parts: the first part is a vanilla Transformer encoder, and the second part is a bidirectional LSTM (BiLSTM) encoder. We add a residual connection between the Transformer encoder and the BiLSTM encoder. This residual connection plays a key role in promoting the slot filling effect of our model. The decoder of the TLC model includes two parts: a linear classification decoder for intent detection and a CRF decoder for slot filling. The architecture of the proposed TLC model is shown in Fig. 1.

Fig. 1
figure 1

Architecture of the TLC

3.3 Encoder

As shown in Fig. 1, the first step of TLC model training requires representation learning, which includes word embedding and positional embedding. The word embedding of the TLC model uses the GloVe (Pennington et al., 2014) method. The positional embedding of the TLC model is a method proposed by Vaswani et al. (2017). We will analyse two kinds of positional embedding methods (Vaswani et al., 2017) in the parameter tuning analysis section. After word embedding and positional embedding, our proposed model performs Add and Dropout (Hinton et al., 2012) operations on the outputs of word embedding and positional embedding, which can be defined as Eq. (4):

$$S=Dropout\left(P+E\right)$$
(4)

where \(P\) represents the output of positional embedding, and \(E\) represents the output of word embedding. Then, \(S\) is input into the Transformer encoder of the TLC model.

The Transformer encoder is the first encoder of the TLC model. Although a complete Transformer model includes an encoder and a decoder, we use only a Transformer encoder in our proposed model. The first step is to map \(S\) to Query, Key and Value and then process it through the multihead attention mechanism. This step can be defined as Eqs. (5), (6) and (7):

$$Attention\left(Q,K,V\right)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)V$$
(5)
$${head}_{i}=Attention(Q{W}_{i}^{Q},K{W}_{i}^{K},V{W}_{i}^{V})$$
(6)
$${ H}^{O}=Concat\left({head}_{1},\dots ,{head}_{h}\right){W}^{O}$$
(7)

where \(Q\) represents Query; \(K\) represents Key; \(V\) represents Value; \(1/\surd ({d}_{k} )\) is the scaling factor; and\({W}_{i}^{Q}\in {\mathbb{R}}^{{d}_{model}\times {d}_{k}}\),\({W}_{i}^{K}\in {\mathbb{R}}^{{d}_{model}\times {d}_{k}}\), \({W}_{i}^{V}\in {\mathbb{R}}^{{d}_{model}\times {d}_{v}}\) and \({W}^{O}\in {\mathbb{R}}^{{hd}_{v}\times {d}_{model}}\) are the parameter matrices of linear mapping. Then, \({H}^{O}\) is subjected to the Add and LayerNorm (LN) (Ba et al., 2016) operations and input into the feed-forward network (FFN), which can be defined as Eqs. (8) and (9):

$${H}^{L1}=LN\left(S+{H}^{O}\right)$$
(8)
$$FFN\left({H}^{L1}\right)=max\left(0,{H}^{L1}{W}_{1}+{b}_{1}\right){W}_{2}+{b}_{2}$$
(9)

where \({W}_{1}, {b}_{1},{W}_{2}, and {b}_{2}\) are parameters of the FFN. Before completing the Transformer encoding, another Add and LayerNorm (LN) (Ba et al., 2016) operation is performed, which can be defined as Eq. (10):

$${H}^{L2}=LN\left({H}^{L1}+FFN\left({H}^{L1}\right)\right)$$
(10)

Furthermore, before the output of the Transformer is input into the BiLSTM encoder, it is necessary to perform the Add and Dropout (Hinton et al., 2012) operation between \(S\) and \({H}^{L2}\). Although this process is simple, it is necessary to improve the slot filling effect of the TLC model. This process can be defined as Eq. (11):

$$X=Dropout\left(S+{H}^{L2}\right)$$
(11)

Then, \(X\) is expressed as \(X=\left({ x}_{1},{ x}_{2},\dots {,x}_{t}\right)\) and input into the BiLSTM for the second encoding, which can be defined as Eqs. (12), (13) and (14):

$$\overleftarrow{{h}_{t}}={LSTM}_{bw}\left(\overleftarrow{{h}_{t+1}}, { x}_{t}\right)$$
(12)
$$\overrightarrow{{h}_{t}}={LSTM}_{fw}\left(\overrightarrow{{h}_{t-1}}, { x}_{t}\right)$$
(13)
$$H=\left[\overleftarrow{{h}_{t}}, \overrightarrow{{h}_{t}} \right]$$
(14)

where \({h}_{t}\) represents the hidden state of the LSTM at time step \(t\), \(\overleftarrow{{h}_{t}}\) represents the hidden state of the LSTM calculated from back to front at time step \(t\), \({LSTM}_{bw}\) represents the LSTM function from back to front, \(\overrightarrow{{h}_{t}}\) represents the hidden state of the LSTM calculated from front to back at time step \(t\), \({LSTM}_{fw}\) represents the LSTM function from front to back, and \({x}_{t}\) represents the \(t\)-th token in \(X\). Finally, after encoding the BiLSTM, the result \(H\) is input to the decoder of the TLC model.

3.4 Decoder

The decoder of the TLC model includes two parts: a linear classifier for intent detection and a CRF for slot filling. Since intent detection and slot filling are different types of tasks, the output result \(H\) from the BiLSTM encoder needs to be extracted according to the nature of the different tasks. We extract the output of the hidden unit at the last state from \(H\), record it as \({H}^{intent}\) for intent detection, and input it into the neural network for linear classification, which can be defined as Eq. (15):

$${y}^{intent}=\sigma \left({W}^{intent}{H}^{intent}+{b}^{intent}\right)$$
(15)

where \(\sigma\) is the LogSoftmax activation function, and \({W}^{intent}\) and \({b}^{intent}\) are neural network parameters.

\(H\) is the output of the hidden state of the BiLSTM encoder at all time steps, so it can be directly used for slot filling. We followed the method of Qin et al. (2021) to apply a CRF for the slot filling task in the decoder of our method, which can be defined as Eqs. (16) and (17):

$${C}^{slot}={W}^{slot}H+{b}^{slot}$$
(16)
$$P\left({y}^{slot}|{C}^{slot}\right)=\frac{\sum_{i=1}exp f\left({y}_{i-1},{y}_{i},{C}^{slot}\right)}{\sum_{{y}{\prime}}\sum_{i=1}expf\left({y}_{i-1}{\prime},{y}_{i}{\prime},{C}^{slot}\right)}$$
(17)

where \({W}^{slot}\) and \({b}^{slot}\) are training parameters, \({y}_{i}{\prime}\) is the slot label, and \(f\left({y}_{i-1},{y}_{i},{C}^{slot}\right)\) is responsible for calculating the label score of \({y}_{i}\) and the score of transition from \({y}_{i-1}\) to \({y}_{i}\).

Finally, both the intent detection task and the slot filling task use the negative log likelihood loss (NLLLOSS) function as the loss function for the training of the TLC model. The loss function of the joint training can be defined as Eq. (18):

$${L}^{joint}=\alpha {L}^{intent}+\left(1-\alpha \right){L}^{slot}$$
(18)

where \({L}^{joint}\) is the loss function of the entire TLC model, \({L}^{intent}\) is the loss function for the intent detection task, \({L}^{slot}\) is the loss function for the slot filling task, and the hyperparameter \(\alpha\) is used to control the balance between these two tasks during training.

4 Experiments

To test our proposed TLC model for intent detection and slot filling, we choose the SNIPS (Coucke et al., 2018) corpus and the ATIS (Hemphill et al., 1990; Tur et al., 2010) corpus for experiments.

4.1 Datasets and evaluation metrics

The SNIPS corpus is a task-oriented dialogue system corpus collected by the French company SNIPS. It is mainly used to design voice assistants for dialogue systems. The full name of the ATIS corpus is the Air Travel Information System corpus. It is a corpus collected through the Official Airline Guide (OAG, 1990) that contains professional information such as airline bookings, travel, and consultations. It is a commonly used dataset for the evaluation of intention detection and slot filling in task-oriented dialogue systems. Although the SNIPS corpus has fewer types of intent and slots than the ATIS corpus, it has more training data. The statistics of the SNIPS corpus and the ATIS corpus are shown in Table 2.

Table 2 Statistics of the SNIPS and ATIS datasets

In the evaluation of the experimental results of intent detection and slot filling, we choose accuracy as the evaluation metric for the intent detection task and the F1 value as the evaluation metric for the slot filling task.

4.2 Experimental Settings

We use the PyTorch (Paszke et al., 2019) deep learning framework to build the TLC model, and all experiments are performed on a single GeForce GTX 1080 Ti GPU. Following the method of Wang et al. (2021) in the word embedding part of our model, the GloVe (Pennington et al., 2014) method is used for word-level embedding, and the Kazuma (Hashimoto et al., 2017) character-level embedding method is used as a supplement. The representation learning dimensions of these methods are 300 and 100, respectively. In the positional embedding part of our model, we choose the learned positional embedding method. The Transformer encoder layer of our model is 2, the Transformer encoder dimension is 400, the number of heads of the multihead attention mechanism is 10, the dimension of the feedforward network is 2048, and the activation function is GELU. The BiLSTM encoder layer of our model is 2, and the hidden size of each LSTM is 200. The batch size for training is 32, and the maximum number of epochs is 200. The learning rate is 0.0001, and the dropout rate is 0.1. The Adam (Kingma and Ba, 2015) method is used as the training optimizer; \({\beta }_{1}\) and \({\beta }_{2}\) are set to 0.9 and 0.999, respectively; \(\epsilon ={10}^{-8}\); and the weight decay is 0. The gradient clipping method is used to prevent overfitting during training, the maximum norm of the gradient is set to 1, and the type of norm is L2. The hyperparameter \(\alpha\) is 0.5.

4.3 Experimental Results

We use 12 models as the baseline methods for intent detection and slot filling experiments. These 12 models include Joint Seq (Hakkani-Tür et al., 2016), Slot-Gated Atten (Goo et al., 2018), Self-Attentive Model (Li et al., 2018), Bi-Model (Wang et al., 2018), CAPSULE-NLU (Zhang et al., 2019a, 2019b), SF-ID Network (E et al., 2019), CM-Net (Liu et al., 2019), Stack-Propagation (Qin et al., 2019), JointBERT (Chen et al., 2019), Graph LSTM (Zhang et al., 2020a, 2020b), Co-Interactive Transformer (Qin et al., 2021), SyntacticTF (Wang et al., 2021). The characteristics of these 12 models have been introduced in related studies. It should be noted that the Co-Interactive Transformer (Qin et al., 2021), the SyntacticTF (Wang et al., 2021) and our proposed TLC are models based on the Transformer encoder. In addition, the experimental results of these 12 models are from published papers (Hakkani-Tür et al., 2016; Goo et al., 2018; Li et al., 2018; Wang et al., 2018; Zhang et al., 2019a, 2019b; E et al., 2019; Liu et al., 2019; Qin et al., 2019; Chen et al., 2019; Zhang et al., 2020a, 2020b; Qin et al., 2021; Wang et al., 2021). The experimental results are shown in Table 3. The experimental results of the two datasets show that our proposed TLC model is a better model for intent detection and slot filling. On the SNIPS corpus, the slot filling F1 value of our model is 0.36% higher than that of SyntacticTF (Wang et al., 2021). The intent detection accuracy of our model is 0.15% higher than that of SyntacticTF (Wang et al., 2021). On the ATIS corpus, the slot filling F1 value of our model is 0.09% higher than that of SyntacticTF (Wang et al., 2021), and the accuracy of intent detection of our model is 0.47% higher than that of the Co-Interactive Transformer (Qin et al., 2021). The experimental results show that our proposed TLC model outperforms the previously proposed models based on the Transformer encoder.

Table 3 Results of intent detection and slot filling on the SNIPS and ATIS datasets

5 Discussion

We conduct model ablation analysis and parameter tuning analysis of the TLC model. In addition, we combine the TLC model with BERT.

5.1 Ablation Study

We conduct a model ablation analysis of our proposed model, and the experimental results are shown in Table 4.

Table 4 Results of the ablation study on the SNIPS and ATIS datasets

Table 4 shows that when we remove the residual learning module, BiLSTM or CRF from the TLC model, the effect of the TLC model will decrease. It is worth noting that the residual learning module removed here is the newly added residual learning of our proposed model, which is the red line in Fig. 1, rather than the residual connection between the internal layers of the vanilla Transformer. When the residual learning module of the TLC model is removed, the slot filling effect of the model is reduced on both datasets. When only BiLSTM is removed from the TLC model, the slot filling effect of the model is reduced on both datasets. When only the CRF is removed from the TLC model, only the slot filling effect on the SNIPS corpus decreases, and the intent detection effect increases. However, considering that it is difficult to achieve a better slot filling effect on the SNIPS corpus, we choose CRF as the slot filling decoder for our proposed model. When Residual Learning, BiLSTM, and CRF are removed at the same time, the effect of our proposed model decreases significantly. This means that it is difficult for the model to complete the intent detection task and slot filling task with vanilla Transformer at the same time. This also further illustrates the vanilla Transformer combined with BiLSTM and CRF in our proposed model is indispensable for intent detection task and slot filling task.

5.2 Parameter Tuning Analysis

The Transformer encoder is the core of the TLC model. Therefore, we analyse the parameters of the Transformer encoder in the TLC model. First, because positional embedding plays an important role in the Transformer model, we adjust and analyse the positional embedding method of our proposed model. The experimental results are shown in Table 5. When the TLC model does not use positional embedding, the effect of the TLC model decreases. Therefore, choosing a suitable positional embedding method is vital for our proposed model. Vaswani et al. (2017) proposed sinusoidal positional encoding and learned positional embedding for the Transformer model and found that the effects of these two methods were basically the same in machine translation experiments. However, as shown in Table 5, in the intent detection and slot filling experiments, the sinusoidal positional encoding method has a better intent detection effect, while the learned positional embedding method has a better slot filling effect. Since it is more difficult to achieve a better slot filling effect compared with the baseline method, we choose the learned positional embedding method as the positional embedding method of the Transformer encoder in our proposed model.

Table 5 Results of the positional embedding adjustment experiment

Second, it is a common practice to use a multilayer transformer in natural language processing tasks. Therefore, whether the effect of the TLC model can be improved by increasing the number of transformer layers is worthy of further study. We adjust the number of transformer encoder layers in our proposed model, and the experimental results are shown in Table 6. When a 2-layer transformer encoder is used in the model, the experimental effect of our proposed model is the best. In addition, the Co-Interactive Transformer (Qin et al., 2021) and SyntacticTF (Wang et al., 2021) both use a 2-layer Transformer encoder. Therefore, when using the transformer encoder for intent detection and slot filling, we choose 2 as the parameter of the Transformer encoder in our proposed model.

Table 6 Results of the transformer encoder layer tuning experiment

5.3 Combination with BERT

The BERT (Devlin et al., 2019) model, a landmark in the field of natural language processing, excels at handling various natural language processing tasks and can be combined with other methods to achieve better results. The joint NLU (Gunaratna et al., 2022) is a model based on BERT. Qin et al. (2021) used the Co-Interactive Transformer with BERT to achieve better intent detection and slot filling results. Therefore, we combine the TLC model with BERT for intent detection and slot filling. In these experiments, we remove the word embedding and positional embedding in the TLC model and combine the rest of the TLC model with BERT. The number of heads of the multihead attention mechanism in the transformer is adjusted from 10 to 16, and the dimension of the feedforward network is adjusted from 2048 to 1024. The Transformers (Wolf et al., 2020) tool is used to call the BERT for combining with the TLC model. Since the SNIPS corpus is a cased corpus, the base-cased version of BERT is selected. The ATIS corpus is an uncased corpus, so the base-uncased version of BERT is selected. The learning rate is changed from 0.0001 to 0.00005, the max epoch is changed from 200 to 100, and BERTAdam, which is an improved Adam (Kingma and Ba, 2015) optimization method for BERT, is used as the training optimizer. The hyperparameter \(\alpha\) of the experiment on the SNIPS corpus remains 0.5, while the hyperparameter \(\alpha\) of the experiment on the ATIS corpus is changed to 0.7. Other model parameters and training settings remain unchanged. We follow the method of Qin et al. (2021) and choose stack propagation (Qin et al., 2019) and the Co-Interactive Transformer (Qin et al., 2021) as the research comparison methods. In addition, Wang et al. (2021) believed that BERT and Transformer belong to different technical routes and did not combine SyntacticTF with BERT to study intent detection and slot filling. Therefore, we did not choose SyntacticTF as the comparison method. The experimental results of the combination of the TLC model with BERT are shown in Table 7.

Table 7 Results of TLC combined with BERT on the SNIPS and ATIS datasets

As shown in Table 7, BERT can enhance the effect of the TLC model for intent detection and slot filling. When the TLC model is combined with BERT, our model outperforms all the comparison methods on the SNIPS corpus. The slot filling F1 value of our proposed TLC model is 0.12 higher than that of the joint NLU (Gunaratna et al., 2022) on the ATIS corpus. Although the accuracy of intent detection of our proposed TLC model is lower than that of the joint NLU (Gunaratna et al., 2022) model on the ATIS corpus, our model performs better than the joint NLU model on both the SNIPS and the ATIS corpus.

6 Conclusion

In this paper, we propose a novel model based on transformers for intent detection and slot filling. The experimental results show that the proposed method can achieve higher intent detection accuracy and slot filling F1 values than the existing Transformer-based methods. In addition, our proposed model can be combined with BERT to achieve better experimental results of intent detection and slot filling. In the future, we will verify our model on other datasets.