1 Introduction

In task-oriented spoken dialog systems, dialog state tracking (DST) is used to update a dialog state (i.e., the state of the user’s goal). DST is an essential function for a dialog system because the state directly affects the system’s response to the user. A dialog state is defined as “a data structure that summarizes the dialog history up to the time when the next system action is chosen” [1]. In practice, a dialog state is a probability distribution over a set of slots and their values as defined in a domain ontology. In the example shown in Table 1, “area, food, pricerange” are slots and “north, japanese, dontcare, none” are slot values. In practical applications, slot values may be changed during the operation of a dialog system. For example, in the restaurant information domain, the domain ontology changes when new restaurants are added. Therefore, DST should be able to handle a dynamic ontology and unseen slot values.

Traditional DST approaches use handcrafted rules [2, 3] because rule-based approaches are simple and intuitive. However, crafting rules is costly and applying them to a new domain is difficult. Recent DST approaches have been based on deep learning models such as recurrent neural networks (RNNs) [4,5,6], which need to be trained for predefined slots and values using the domain ontology. However, to deal with new or unseen slot values, training data must be prepared and a new model must be trained.

To overcome this drawback, DST models have been proposed that can handle unseen slot values without retraining [7,8,9]. The RNN models in [7] use input features after delexicalization. Delexicalization replaces the words relevant to slots and their values with generic symbols. However, delexicalization requires handcrafted rules that compare input words with a large list of synonyms for slots and their values. Another approach uses spoken language understanding (SLU) based on concept tagger architecture [8]. This approach utilizes slot names or slot descriptions to detect unseen values without model retraining. A neural belief tracker [9] estimates a dialog state by comparing the representations of user utterance, system response, and slot values. Although these methods generalize a DST model to unseen slot values, it comes at the cost of crafting the rules, the list of synonyms, slot description, or semantic dictionary.

Table 1 Examples of dialogs and dialog states

A problem that is not adequately addressed in the literature is how to deal with unseen slot values without any handcrafted rules. Pointer network-based DST approaches [10, 11] can detect unseen values by utilizing context information. DST models use a pointer mechanism to extract words that are relevant to values. Although the model [10] showed comparatively good performance on the second Dialog State Tracking Challenge (DSTC2) dataset, the accuracies for unseen values were low. Another DST model [11] showed better accuracies for unseen values in the third Dialog State Tracking Challenge (DSTC3) dataset. However, the results show the tradeoff between the accuracies for seen values versus those for unseen values. BERT-DST [12] extracts span (start position and end position) of the specified slot value from the user utterance and the system response. BERT-DST showed high accuracies for tracking unseen slot values. However, its effectiveness was evaluated using a restaurant name slot and a movie name slot; therefore, its effectiveness with other slots is unknown.

This paper proposes a new attention mechanism for a fully data-driven DST approach that can handle unseen slot values without handcrafted rules and model retraining. This approach is based on the pointer-based DST [11]. Unlike with conventional methods, we use encoded user utterances and a hypothesis for the slot values (the target values) to calculate attention. This enables the DST model to handle an unseen value by directly incorporating it into the attention weights. Attention weights are used to calculate context vectors, which are the weighted sums of word vectors. By comparing the context vectors and word vectors of slot values, the model estimates the dialog state. We evaluate the DST performance based on the proposed approach using the DSTC2 and DSTC3 datasets.

The remainder of this paper is organized as follows: Sect. 2 presents the proposed approach, Sect. 3 shows the experimental results and discusses their meaning and importance and Sect. 4 concludes the paper.

2 Dialog State Tracker

Our proposal is an extension of the DST model in [11], but differs from that approach in that target values are used to calculate attention. The new attention mechanism enables the model to focus on words that are relevant to the target values even if the target values were unseen in training.

Fig. 1
figure 1

Schematic diagram of our dialog state tracker

Figure 1 illustrates an overview of our DST model. The model consists of encoding and decoding layers. The encoding layer extracts one score from system actions (\({\textit{\textbf{s}}}^\mathrm{s}\)) and another score from user utterances (\({\textit{\textbf{s}}}^\mathrm{u}\)) separately. These two scores are integrated with the previous dialog state (\({\textit{\textbf{s}}}^\mathrm{p}\)) using weight parameters (\(\varvec{\beta } =[\beta ^\mathrm{{s}}, \beta ^\mathrm{{u}}, \beta ^\mathrm{{p}}]\)) in the decoding layer. The weighted sum (\({\textit{\textbf{y}}}\)) is regarded as a probability distribution over the slot values after applying the softmax function.

We will describe DST models that use the conventional attention mechanism and the new target value attention mechanism in Sects. 2.1 and 2.2, respectively.

In the sections that follow, we explain a process for a particular slot that includes K values (\(v_1, \ldots , v_K\)), in which the kth value consists of \(M_k\) words. We use n and N for the index of a word in a user utterance and the number of words in the user utterance, respectively.

2.1 DST Model with Conventional Attention

This section describes a DST model with a conventional attention mechanism based on the model proposed in [11].

2.1.1 Encoding Layer

The utterance-encoding and the action-encoding modules calculate two kinds of features. One is possibility scores (\({\textit{\textbf{s}}^\mathrm{{u}}}, {\textit{\textbf{s}}^\mathrm{{s}}}\)) that represent whether a slot value is the user’s goal. The other is feature vectors (\({\textit{\textbf{h}}^\mathrm{{r}}}, {\textit{\textbf{h}}_L}\)) for calculating weight parameters in the decoding layer.

Action Encoding

Action encoding extracts two kinds of features from a system action. One is a feature vector used for calculating weight parameters. The other is a score vector that represents how the system refers to a slot value.

The system action is represented as three features: a system action tag (\({\textit{\textbf{r}}^\mathrm{act}}\)), a target slot feature (\(r^\mathrm{s}\)), and a target value feature (\(r^\mathrm{v}\)). The system action tag is a one-hot vector whose dimension is the same as the number of action tags. The target slot feature and the target value feature are 1 if the previous system action includes the target slot and target value and are 0 otherwise. The three features are concatenated and encoded using a neural network as:

$$\begin{aligned} \textit{\textbf{h}}^\mathrm{r}= & {} \hbox {NN}_\mathrm{sys}\left( \textit{\textbf{r}}^\mathrm{act} \oplus r^\mathrm{s} \oplus r^\mathrm{v}\right) , \end{aligned}$$
(1)

where \({\textit{\textbf{h}}^\mathrm{r}} \in \mathbb {R}^{d_{\mathrm m}}\) is an output vector, \(d_{\mathrm m}\) is a model parameter, \(\hbox {NN}_\mathrm{sys}(\cdot )\) is a fully-connected neural network and \(\oplus \) is the vector concatenation. The output is used for the weight calculation.

The score vector of the system response (\({\textit{\textbf{s}}^\mathrm{s}}\)) is a \(K+2\) dimensional binary flag. If the system response includes the kth value, the kth component is 1 and is 0 otherwise. The last two components correspond to special values “none” and “dontcare”.

Utterance Encoding

Fig. 2
figure 2

Utterance encoding using attention mechanisms

Figure 2(a) shows a block diagram of utterance encoding, in which the module receives a user utterance and encodes it using an attention mechanism.

The user utterance of N word vectors is encoded using a bidirectional LSTM as follows:

$$\begin{aligned} \textit{\textbf{h}}_n^\mathrm{f}= & {} \hbox {LSTM}_\mathrm{fwd}(\textit{\textbf{h}}_{n-1}^\mathrm{f}, \textit{\textbf{w}}_n), \end{aligned}$$
(2)
$$\begin{aligned} \textit{\textbf{h}}_n^\mathrm{b}= & {} \hbox {LSTM}_\mathrm{bwd}(\textit{\textbf{h}}_{n+1}^\mathrm{b}, \textit{\textbf{w}}_n), \end{aligned}$$
(3)
$$\begin{aligned} \textit{\textbf{h}}_n= & {} \textit{\textbf{h}}_{n}^\mathrm{f} \oplus \textit{\textbf{h}}_{n}^\mathrm{b}, \end{aligned}$$
(4)

where \(\textit{\textbf{w}}_n \in \mathbb {R}^{d_{\mathrm w}}, n=1, \ldots ,N\) is a word vector whose dimension is \(d_{\mathrm w}\), \(\textit{\textbf{h}}_n^\mathrm{f} \in \mathbb {R}^{d_{\mathrm m}/2}, \textit{\textbf{h}}_n^\mathrm{b} \in \mathbb {R}^{d_{\mathrm m}/2}\), and \(\textit{\textbf{h}}_n \in \mathbb {R}^{d_{\mathrm m}}\) are hidden states, \(\hbox {LSTM}_\mathrm{fwd}(\cdot ,\cdot )\) and \(\hbox {LSTM}_\mathrm{bwd}(\cdot ,\cdot )\) are forward and backward LSTM RNNs. Next, attention weights (\(\alpha _n\)) are calculated from the hidden states (\(\textit{\textbf{h}}_n\)) as follows:

$$\begin{aligned} z_n= & {} \hbox {NN}_\mathrm{att}(\textit{\textbf{h}}_n), \end{aligned}$$
(5)
$$\begin{aligned} \left[ \alpha _1, \ldots , \alpha _N\right]= & {} \hbox {softmax}([z_1, \ldots , z_N]), \end{aligned}$$
(6)

where \(\hbox {NN}_\mathrm{att} (\cdot )\) is a fully-connected neural network.

Then, a context vector (\(\textit{\textbf{c}} \in \mathbb {R}^{d_{\mathrm w}}\)) of the user utterance is calculated as a weighted sum of the word vectors as follows:

$$\begin{aligned} \textit{\textbf{c}}= & {} \sum _{n=1}^N \alpha _n \textit{\textbf{w}}_n. \end{aligned}$$
(7)

The score of the user utterance is calculated using cosine similarity between the context vector (\(\textit{\textbf{c}}\)) and the word vector of the kth value (\(\textit{\textbf{h}}^{v_k} \in \mathbb {R}^{d_{\mathrm w}}\)) as follows:

$$\begin{aligned} s_k= & {} \frac{\textit{\textbf{c}} \cdot \textit{\textbf{h}}^{v_k}}{\Vert \textit{\textbf{c}}\Vert \Vert \textit{\textbf{h}}^{v_k}\Vert }. \end{aligned}$$
(8)

Note that to handle values consisting of multiple words such as “eastern european”, we use the sum of the word vectors as \(\textit{\textbf{h}}^{v_k}\), that is, \(\textit{\textbf{h}}^{v_k} = \sum _{m=1}^{M_k} \textit{\textbf{v}}_{k,m}\), where \(\textit{\textbf{v}}_{k,m} \in \mathbb {R}^{d_\mathrm{w}}\) is the m-th word vector of the k-th value.

To estimate the scores (\(\varvec{\tilde{s}}=[s_{\text {none}}, s_{\text {dc}}]\)) for the special values “none” and “dontcare”, we use a separate neural network \(\hbox {NN}_\mathrm{val}(\cdot )\) as follows:

$$\begin{aligned} \textit{\textbf{x}}= & {} \textit{\textbf{h}}^\mathrm{f}_{N^\mathrm{u}} \oplus \textit{\textbf{h}}^\mathrm{b}_{1} \oplus \textit{\textbf{h}}^\mathrm{r} \oplus \mathrm{max} (\textit{\textbf{s}}^\mathrm{u}), \end{aligned}$$
(9)
$$\begin{aligned} \varvec{\tilde{s}}= & {} \hbox {NN}_\mathrm{val} \left( \textit{\textbf{x}}\right) , \end{aligned}$$
(10)

where \(\textit{\textbf{x}} \in \mathbb {R}^{2d_\mathrm{m} + 1}\) is the concatenation of the last states of the forward and backward LSTM (\(\textit{\textbf{h}}_{N}^\mathrm{f}, \textit{\textbf{h}}_{0}^\mathrm{b}\)), the system action feature (\(\textit{\textbf{h}}^\mathrm{r}\)), and the maximum cosine similarity (\(\mathrm{max} (\textit{\textbf{s}}^\mathrm{u})\)). Finally, the scores \(s_k\) and \(\varvec{\tilde{s}}\) are concatenated as \(\textit{\textbf{s}}^\mathrm{u} = [s_1, \ldots , s_K] \oplus \varvec{\tilde{s}}\). Note that we omit this part from Fig. 2(a) for simplicity. The utterance encoding sends the concatenated score (\(\textit{\textbf{s}}^\mathrm{u}\)) and feature vector (\(\textit{\textbf{x}}\)) to the following processing operation.

Decoding Layer

The decoding layer integrates scores calculated from the user utterance (\(\textit{\textbf{s}}^\mathrm{u}\)), the score from the system response (\(\textit{\textbf{s}}^\mathrm{s}\)), and the dialog state of the previous turn (\(\textit{\textbf{s}}^\mathrm{p}\)) using weight parameters (\(\beta = \left[ \beta ^\mathrm{u}, \beta ^\mathrm{s}, \beta ^\mathrm{p} \right] \)) from a neural network \(\hbox {NN}_\mathrm{weight} (\cdot )\) as follows:

$$\begin{aligned} \varvec{\beta }= & {} \hbox {NN}_\mathrm{weight} \left( \textit{\textbf{x}} \right) , \end{aligned}$$
(11)
$$\begin{aligned} \textit{\textbf{y}}= & {} \beta ^\mathrm{u} \textit{\textbf{s}}^\mathrm{u} + \beta ^\mathrm{s} \textit{\textbf{s}}^\mathrm{s} + \beta ^\mathrm{p} \textit{\textbf{s}}^\mathrm{p}, \end{aligned}$$
(12)
$$\begin{aligned} \textit{\textbf{p}}= & {} \hbox {softmax} (\textit{\textbf{y}}). \end{aligned}$$
(13)

After applying the softmax function, the model outputs the probability distribution (\(\textit{\textbf{p}} \in \mathbb {R}^{K+2} \)) over K slot values, none and dontcare.

2.2 DST Model with Target Value Attention

Figure 2(b) shows a block diagram of utterance encoding based on our attention mechanism. This model calculates attention weights for each target value using the word vectors of the corresponding target value. The attention mechanism is designed to focus the decoder on words in the user’s utterance that are relevant to the target value. If the utterance includes a word relevant to the target value, the attention weight for the word will be greater than that for the other words. As a result, the context vector (\(\textit{\textbf{c}}_{k}\)) is similar to the word vector of the target value. Therefore, by comparing the context vector and the target value, the system can detect unseen values.

Instead of Eqs. (5), (6), attention weights (\(\alpha _{n,v_k}\)) are calculated using the word vector of the k-th value (\(\textit{\textbf{h}}^{v_k}\)) and the hidden states (\(\textit{\textbf{h}}_n\)) as follows:

$$\begin{aligned} z_{n, v_k}= & {} \hbox {NN}_\mathrm{att}(\textit{\textbf{h}}^{v_k} \oplus \textit{\textbf{h}}_n), \end{aligned}$$
(14)
$$\begin{aligned} \left[ \alpha _{1, v_k}, \ldots , \alpha _{n, v_k}\right]= & {} \hbox {softmax}(\left[ z_{1, v_k}, \ldots , z_{n, v_k}\right] ), \end{aligned}$$
(15)

where \(\hbox {NN}_\mathrm{att} (\cdot )\) is a fully-connected neural network.

Then, a context vector (\(\textit{\textbf{c}}_{k} \in \mathbb {R}^{d_{w}}\)) of the user utterance is calculated as a weighted sum of the word vectors as follow:

$$\begin{aligned} \textit{\textbf{c}}_k= & {} \sum _{n=1}^{N^\mathrm{u}} \alpha _{n,v_k} \textit{\textbf{w}}_n. \end{aligned}$$
(16)

Note that the context vector is calculated for each slot value.

The score of the user utterance is calculated using cosine similarity between the context vector (\(\textit{\textbf{c}}_k\)) and the value vectors (\(\textit{\textbf{h}}^{v_k}\)):

$$\begin{aligned} s_k= & {} \frac{\textit{\textbf{c}}_k \cdot \textit{\textbf{h}}^{v_k}}{\Vert \textit{\textbf{c}}_k\Vert \Vert \textit{\textbf{h}}^{v_k}\Vert }. \end{aligned}$$
(17)

The remaining parts are the same as the ones described in Sect. 2.1.

2.2.1 Model Training

When training the model, we minimize a loss function that consists of two terms. One is the cross-entropy loss (\(L_\mathrm{ce}\)) between the output probabilities (\(\textit{\textbf{p}}\)) and the ground truth label (\(\textit{\textbf{d}}\)). The other is the triplet loss [13] (\(L_\mathrm{tri}\)) between the normalized word vector of the ground truth value (\(\textit{\textbf{h}}^\mathrm{v_{\kappa }}\)) and the context vectors (\(\textit{\textbf{c}} = [\textit{\textbf{c}}_1, \ldots , \textit{\textbf{c}}_K]\)).

$$\begin{aligned} L= & {} L_\mathrm{ce}(\textit{\textbf{d}}, \textit{\textbf{p}}) + L_\mathrm{{tri}}(\textit{\textbf{h}}^\mathrm{{v_{\kappa }}}, \textit{\textbf{c}}), \end{aligned}$$
(18)
$$\begin{aligned} L_\mathrm{ce}(\textit{\textbf{d}}, \textit{\textbf{p}})= & {} -\sum \textit{\textbf{d}} \log \textit{\textbf{p}}, \end{aligned}$$
(19)
$$\begin{aligned} L_\mathrm{tri}(\textit{\textbf{h}}^\mathrm{{v_{\kappa }}}, \textit{\textbf{c}})= & {} \frac{1}{K} \left( \sum _{j \ne \kappa } \max \left\{ 0, \Vert \textit{\textbf{h}}^{v_{\kappa }} - \textit{\textbf{c}}_{\kappa } \Vert - \Vert \textit{\textbf{h}}^{v_{\kappa }} - \textit{\textbf{c}}_{j} \Vert + \varepsilon \right\} \right) , \end{aligned}$$
(20)

where \(\varepsilon \) and \(\kappa \) are a margin parameter and an index of the ground truth value, respectively. The triplet loss helps the model learn the context vector calculation that assigns a smaller distance to \((\textit{\textbf{h}}^\mathrm{{v_{\kappa }}}, \textit{\textbf{c}}_{\kappa })\) and bigger distance to \((\textit{\textbf{h}}^\mathrm{{v_{\kappa }}}, \textit{\textbf{c}}_{j \ne {\kappa }})\). Note that we add the triplet loss when the corresponding user utterance includes a slot value.

We introduce the “Sampling with Decay” technique [18] to feed the previous state (\(\textit{\textbf{s}}^\mathrm{{p}}\)). During model training, we randomly sample the previous state from the ground truth previous state (\(\textit{\textbf{d}}\)) with a probability of q or from the estimated state (\(\textit{\textbf{p}}\)) with a probability of \(1-q\). We define q with the decay function dependent on the index of training epochs (e) as \(q = \frac{\mu }{\mu + \exp (e/\mu )}\) where \(\mu \) is a parameter. As the training proceeds, the probability of (q) feeding ground truth decreases gradually [18].

3 Experiments

Table 2 Ontology of the train and test datasets

We evaluated our model using the DSTC2 and DSTC3 datasets [14, 15]. The datasets include human-computer dialogs. Users interacted with dialog systems to search for restaurants by specifying constraints. Among the slots included in the DSTC3 dataset, we used “area”, “food”, and “pricerange”. We excluded the “childrenallowed,” “type,” “hasinternet,” “hastv,” and “near” slots because these slots are not included in the DSTC2 dataset. We also excluded the “name” slot because word vectors for several values were not obtained. A summary of the slots and slot values are shown in Table 2. In the DSTC3 test dataset, 36.0%, 17.3%, and 3.5% of the dataset refer to unseen values in the area, food, and pricerange slots, respectively.

3.1 Experimental Condition

The evaluation metric is the accuracy of a value estimation. The accuracy is calculated as the fraction of turns where the top dialog state hypothesis is correct [14]. The ground truth label is under “Scheme A”, which defines the label as the most recently asserted value and “Schedule 2” [14].

We implemented a prototype DST system based on the proposed method using a chainer [16]. One-best ASR results were used as inputs to the encoding layer described in Sect. 2. Contractions were converted to their original forms (e.g.., i’m to i am). Then, each word was converted to a 300-dimensional word vector using GloVe [17]. We used the GloVe model available at the GloVe website.Footnote 1 During training, the parameters of the GloVe model were fixed.

As a baseline method, we implemented a DST method that does not use the target value for attention weight calculation that is explained in Sect. 2.1. DST based on the RNN with the proposed attention mechanism and the conventional attention mechanism are called as “Prop” and “Comp”, respectively.

We also evaluated two conventional methods, BERT-based DST [12] and pointer-based DST [11]. We implemented BERT-based DST using the publicly available BERT-DST source codes.Footnote 2 We used default parameters in the source code except the slot value dropout ratio. We trained models using the slot value dropout ratio of \([0, 0.1, \ldots , 0.4]\) and selected the best one. Note that the accuracies of the BERT-based DST were calculated using only pointable samples. The accuracies of the pointer-based DST are the ones reported in [11].

For \(\mathrm{LSTM}_\mathrm{fwd}\) and \(\mathrm{LSTM}_\mathrm{bwd}\), we used 1-layer LSTMs with 32 nodes. For \(\mathrm{NN}_\mathrm{sys}\), \(\mathrm{NN}_\mathrm{att}\), \(\mathrm{NN}_\mathrm{val}\), and \(\mathrm{NN}_\mathrm{weight}\), we used 1-layer, 3-layer, 4-layer, and 4-layer fully connected NNs, respectively.

Hyper parameters are as follows: Adam optimizer using the chainer implementation; learning rate, 0.001; gradient clipping, 1.0; mini batch size, 32; sampling parameter \(\mu \), 12; and maximum epoch, 200. We also applied word dropout that randomly replaced the word vectors of user utterances with zero vectors. These hyper parameters were identical for the Comp and Prop models.

3.2 Results and Discussion

Table 3 Test accuracies on the DSTC2 test dataset

Table 3 shows DST accuracies on the DSTC2 dataset. The upper part shows the results reported in the DSTC2 [14] and the lower part shows the results of fully data-driven DST methods. In all slots, RNN with rules shows the best performance; from 0.2 to 0.7 point higher than Comp and Prop. Prop and Comp achieve almost the same performance as Focus baseline without using any handcrafted rules. The differences between Comp and Prop is less than 0.4 point. This is reasonable because Comp can extract words relevant to seen values without using the target value attention mechanism.

Table 4 Test accuracies on the DSTC3 test dataset

Table 4 shows DST accuracies on the DSTC3 test dataset. This table reveals the gap between rule-based DST and fully-data driven DST models. Among fully-data driven models, Comp shows better performance than does BERT-based DST under all conditions. Prop improves accuracies further. Pointer-based DST shows high accuracies on the area slot, but the accuracies are lower on the other slot.

The accuracy of Prop on the area slot is lower than that on the food and pricerange slots. The lower score might be caused by values consisting of multiple words such as “new chesterton”, “kings hedges”. In the training dataset, the area slot includes values consisting of single words such as “north” and “south”. Therefore, the DST models learned to extract only single words from user utterances. We observed that BERT-based DST also suffers from such word length mismatches.

Table 5 Ablation study on the DSTC3 test dataset

We perform ablation experiments on the DSTC3 test dataset to analyze the effectiveness of different components. The results are shown in Table 5. The accuracies of “without TVA” and “with TVA” are almost the same. However, integration of the three components achieves the comparative or higher accuracies under most conditions. The effect of our method is more pronounced on unseen values than all values. On average over the three slots, the score of prop (69.6) is 10.3 points higher than that of Comp + SVD (59.3).

One of the drawbacks is that our method requires a word vector for the target value. This method cannot track values whose word vector is not available. This is why we exclude the name slot for the experiments. One promising approach is to use a part of a word as a unit.

Another drawback is that the proposed model tends to fail when two values include the same word. In the test dataset, “pub” is included in the food slot (“pub food”) and the type slot (“pub”). When a user says “I’m looking for a pub food restaurant,” the ground truth of the dialog state is “food = pub food, type = pub.” On the other hand, if a user says “I’m looking for a pub,” the ground truth of the dialog state is “food = none, type = pub.” The DST model with target value attention tends to estimate such user utterances as “food = pub food.”

4 Summary

This paper proposed a fully data-driven approach to DST based on a target value attention mechanism. Unlike conventional attention mechanisms, the proposed attention mechanism utilizes the hypothesis for slot values in order to focus on unseen values without model retraining. We used the DSTC2 and DSTC3 datasets to evaluate the DST model based on the proposed approach. For unseen values, the results showed that using the proposed attention mechanism led to a 10.3-point improvement over the conventional attention mechanism.

Future research will aim to improve the accuracy of the model for both seen and unseen values as well as extend the proposed approach to handle unseen slots.