1 Introduction

Goal-driven Spoken Dialog Systems (SDSs) provide a natural language interface for human machine interaction to help users achieve their goals, such as finding restaurants or booking flights. SDSs have been widely applied in daily lives and industry for various purposes, and have attracted a large number of researchers into the study of it.

Dialog state tracking (DST) is the key component of goal-driven SDS. It is responsible for maintaining and updating dialog state at each time step as the dialog progresses, and lays the basis for the system to decide how to respond to the users [1]. A predefined set of slots needs to be filled as the goal-driven conversation progresses. The dialog states are distributions of all possible values of these slots.

The dialog state tracking Challenge (DSTC) [2, 3] provides a common testbed and evaluation standards for DST. It strongly promotes the research of DST. So far, six sessions of DSTC have been successfully held, and each has its own features and tasks. The first DSTC uses human–computer dialogs in the bus timetable domain without changing the user’s goals and is intended to explore different types of mis-match between the training and test data. DSTC2 releases a large number of dialogs related to restaurant searching and introduces the problem of changing user goals. DSTC3 is released to address the problem of adaptation to a new domain, involving the task of recognizing unknown slot values, at the same time distinguishing which specific unknown slot values is expressed in the user utterance. The fourth challenge focuses on human–human dialogs, which are much more unstructured and noisy than human–machine dialogs. The fifth challenge introduces a cross-language dialog state tracking task to address the problem of adaptation to a new language. The sixth challenge introduces three tasks, namely, end-to-end goal oriented dialog learning, end-to-end conversation modeling, and dialogue breakdown detection.

Many different belief tracking approaches have been proposed in the literatures. Early DST used hand-crafted rules. Generative models of DST employed Bayesian Network, the distribution over possible dialog states is updated by Bayesian inference [4, 5]. Bohus and Rudnicky [6] proposed the first discriminative state tracking. Subsequent works had explored numerous variations of discriminative approach [7,8,9,10]. Henderson et al. [11] proposed a first word-based DST model. Different from traditional DST models which used outputs of Natural Language Understanding (NLU) as inputs, the word-based DST mapped directly from the utterances to an updated belief state without an explicit NLU. Subsequently, many NLU and DST joint models were proposed and most achieved good results [12,13,14,15,16]. There had also been some works on combining the generative model or rule-based method with the discriminative model to update the dialog state respectively, and these ideas achieved good performance [17, 18].

Most of the above models generally rely on either classification over a fixed slot value set or scoring each candidate slot value pairs separately, in which the slot values are predefined, the value might be unseen but not unknown. An unseen value is a different expression of a predefined slot value, and it is not observed during the training, while an unknown value is out of the predefined slot values. For example, a predefined value set of the slot food type might be \(\left\{ Spanish, French, Chinese\right\} \) for a restaurant dialogue system, with the joining of new people from different places, new slot values like \(\left\{ Italian, Vietnamese\right\} \) will occur in dialogs. They are not different expressions of existing slot values and cannot be classified to any of them. They are unknown values to the system. In practical applications, however, the continuous emergence of unknown slot values in dialogs is an inevitable phenomenon. Firstly, a slot may find more and more values with the increasing of users. Secondly, it is impossible to collect training data for all slot values, the training corpus is limited, especially the initial corpus for new fields. If unknown slot values cannot be accurately recognized and distinguished, the dialog states cannot be correctly updated in real time, solving the problem of unknown slot values plays an important role in improving the performance of DST.

Henderson et al. [11, 19] firstly introduces a de-lexicalization module in dialog state tracking model to identify mentions of unseen slot values. The de-lexicalization strategy is replacing slot values mentioned in the dialog text with generic tags. Such a conversion allows the models to generalize much better to infrequent or unseen values. And then Mrkšić et al. [12] learns vector representation for user utterance and candidate slot-value pairs to make final decisions on whether the user expresses the current candidate slot-value pair, using the pre-trained word embeddings to handle unseen values in the utterance. These models are only designed to handle unseen values.

Kadlec et al. [23] derives several heuristics based on the existing NLU results to track the dialog state, they make changes to the original NLU hypotheses through identifying several obvious shortcomings of it, and finally achieves the optimal results on DSTC3 dataset which suffers from the problem of unknown slot values. In fact, they don’t do any work on unknown slot values.

Relevant works include zero-shot or few-shot learning problem and novel class detection or anomaly detection. The general idea behind zero-shot or few-shot learning is to map the input and class labels to a semantic space in which similar classes are represented by closer points in the space [20, 21]. However, these methods are only tried on simple datasets at the moment. Complex datasets need to design complex map functions to ensure that the input and class labels are mapped to the same semantic space. Traditional novel class or anomaly detection puts the detected novel or abnormal samples in cache to form its training corpus after finding enough samples, not only needs to retrain, but also introduces time constraints for distinguishing between novel or abnormal classes and known classes [22].

Unknown slot values can be considered as new or anomaly categories of dialog state, the detection of unknown slot values can be treated as the process of novel class or anomaly detection. But after an unknown slot value being detected, it needs to be included in the process of state updating to facilitate the action generation.

Therefore, to deal with unknown slot values, we consider dividing DST into several steps. Firstly, detecting whether an unknown slot value is present or not. Then updating the distribution for the unknown and known slot values. Finally integrating distributions for unknown slot values with that of known slot values. On the one hand, the updating of unknown and known slot values is at the same time, and all parameters of the three-level are learned simultaneously to minimize the loss functions. On the other hand, the subtasks of detection and update are related tightly. Joint modeling the three-step process makes them help each other.

Based on this, we propose a hierarchical dialog state tracking framework to model the dialog state tracking problem with unknown slot values. The whole framework consists of three levels of model. The first-level model of a cascaded neural network is used to detect whether an unknown slot value occurs. The second-level model has two parts, one is the update scheme for known slot value and the other is the update scheme for unknown slot value. The third-level model updates the dialog state on the basis of the first-level and second-level models and finally obtains the state of the dialog. The experimental results on DSTC and WOZ2.0 datasets show that the proposed hierarchical dialog state tracking framework achieves better results than those of [19, 23], and achieves comparable performance with the model in Mrkšić et al. [12]. Especially, the detection and distinction of unknown slot values greatly improve the final performance of dialog state tracking.

The remainder of this paper is organized as follows. Section 2 describes our hierarchical dialog state tracking framework. The framework is introduced firstly, then each level of the framework is presented respectively. Section 3 introduces settings of the experiment. Section 4 details the experiment and the analysis of the results. Section 5 draws conclusions.

2 Method

Following Henderson et al. [19] and Mrkšić et al. [12], our model works on slot-by-slot. We therefore introduce our model on slot S for example. A vocabulary of values for slot S is \(\left\{ V_1,V_2,\ldots ,V_m,V_{m+1},\ldots ,V_N\right\} \), where one part of the values is \(V^{known} = \left\{ V_1,V_2,\ldots ,V_m\right\} \), it refers to the set of known slot values. Any value might have different expressions, for example \(V_j\) has \(n_j\) possible expressions, i.e. \(\left\{ V_{j1},V_{j2},\ldots ,V_{jn_j}\right\} \), a slot value being known means that there is at least one expression of that value occurring in training. Another part of values is \(V^{unknown}=\left\{ V_{m+1},V_{m+2},\ldots ,V_N\right\} \), which refers to the set of unknown slot values. An unknown slot value is the value any expression of which does not occur in training.

Since there is no training data for unknown slot values, classification based models are unable to detect and distinguish them. We design a hierarchical dialog state tracking model (HDSTM) to be able to detect whether an unknown slot value occurs in the user utterance and distinguish which specific unknown slot value is expressed in the user utterance at the time of updating the state of the dialog.

The basic framework of the HDSTM is shown in Fig. 1. There are three levels. The first-level including a detection model D is used to detect whether an unknown slot value occurs in a user utterance. The second-level consists of two parts, one is the update scheme O for known slot values and the other is the update scheme N for unknown slot values. On the basis of the results in lower levels, the third-level model R updates the dialog state and finally gets the state label \(STATE_t\) of the t-th dialog turn \(I_t\). In Fig. 1, the abbreviation KB stands for knowledge base, and it will be detailed in Sect. 2.2.2.

Fig. 1
figure 1

Basic framework of HDSTM

Before we detail the models in different levels one-by-one, we first describe the inputs of the model. A belief tracker needs user utterance as inputs, and the dialog acts of the system leading up to the user utterance are also helpful. Examples are given in following.

  1. (1)

    System Request: System asks the user about the value of a specific slot. For example, if the user responds to the system action request (slot = area) with “any”, then the user utterance “any” refers to the slot “area”, not to other slots such as “pricerange” and “food”.

  2. (2)

    System Confirm: System asks the user to confirm whether a specific slot-value pair is part of their desired constraints. For instance, if the system action is confirm (food = indian), the user answers with “yes”, “yes” refers to the slot-value pair “food:indian”, not to others.

Let \(u_t=u_{t1},u_{t2},\ldots ,u_{tk}\) be the user utterance at turn t, \(m_{t-1}=\)\(m_{(t-1)1},m_{(t-1)2},\)\(\ldots ,m_{(t-1)z}\) be the preceding system acts, k is the number of words in the user utterance and z is the number of words in the system acts. For simplification, we concatenate them into \(f_t=w_{t1},\)\(\ldots ,w_{tk},w_{t(k+1)},\ldots ,w_{tL}\)\(=u_{t1},\ldots ,u_{tk}\#m_{(t-1) 1},\)\(\ldots ,m_{(t-1)z}\), the user utterance is in front, a symbol “#” denotes the concatenation operation, \(w_{ti}\) is the i-th word in \(f_t\) and \(w_{ti}\) denotes \(u_t\) for \(1 \le i \le k\), \(w_{ti}\) denotes \(m_{t-1}\) for \(k+1 \le i \le L\). The maximum number of words in \(f_t\) is L. Let \(f_{tv}=d_{t1},d_{t2},\ldots ,d_{tk},d_{t(k+1)},\ldots ,\)\(d_{tM}\) be tagged features of \(f_t\), \(f_{tv}\) is created from features \(f_t\) through appending occurrence of specific values by common tags, the maximum number of words in \(f_{tv}\) is M. There are two generic symbols here, one is \(\left\langle match\_known \right\rangle \), the other is \(\left\langle match\_unknown \right\rangle \). All occurrence known slot values in \(f_t\) are appended by the symbol \(\left\langle match\_known \right\rangle \), all unknown slot values are appended by the symbol \(\left\langle match\_unknown \right\rangle \) through querying the knowledge base. For example, if “Chinese” is a known slot value and “English” is an unknown slot value of slot food type (The ontology of the dataset provides the candidate values for each specific slot, the slot values occurring in training data are known slot values, and the rest are unknown slot values). The user utterance might be “yes not Chinese but English”, and its proceeding dialog act might be confirm (food = English). Feature extraction for this dialog turn t is outlined in Fig. 2. The following details the model of each level.

Fig. 2
figure 2

Example of feature extraction for one turn, giving f\(_t\) and f\(_{tv}\)

2.1 Detection Model

The detection model is used to detect whether an unknown slot value occurs or not, it is a binary classifier. For slot S, the classifier has two values, namely \(\left\{ V_{unk}, V_{notunk}\right\} \). If an unknown slot value occurs at t-th turn, then the classifier should output value \(V_{unk}\), otherwise \(V_{notunk}\). In the detection model, the belief state \(b_t\) at the t-th turn for slot S is the distribution over the two values \(\left\{ V_{unk}, V_{notunk}\right\} \).

The belief state \(b_t\) at the t-th turn is given by previous belief state \(b_{t-1}\), the last system actions and a new observation [5]. Direct using of word sequence \(T_t\), which is the concatenation of the corresponding user utterance and its preceding dialog acts, results in serious data sparseness. In this paper, \(T_t\) is first mapped to \(\varphi \left( T_t \right) \), the representations of \(T_t\), and then states are tracked based on \(\varphi \left( T_t \right) \) and the previous belief state \(b_{t-1}\). Then tracking the dialog state of slot S at turn t is to construct maps in (1) and (2).

$$\begin{aligned} \displaystyle&\varphi :T_i \rightarrow \varphi \left( T_i \right) \end{aligned}$$
(1)
$$\begin{aligned} \displaystyle&p:\varphi \left( T_1 \right) ,b_0,\ldots ,\varphi \left( T_t \right) ,b_{t-1} \rightarrow S=\hat{a}_t \end{aligned}$$
(2)

where \(b_0\) is the initial belief state distribution, \(\hat{a}_t \in \left\{ V_{unk}, V_{notunk}\right\} \), and for the detection model \(T_t = f_{tv}\).

Different from previous pipeline models which model two maps in (1) and (2) separately, a cascaded structure is proposed in this paper to jointly model the two maps, which are trained jointly. NLU and DST are therefore integrated in a single model.

There are two layers in the joint model. The upper layer employs a Long Short Term Memory (LSTM) [24], which receives vector representations of word sequence, belief state of previous turns, and encodes them into a latest hidden state. The bottom layer adopts Convolutional Neural Networks (CNN) [25, 26] to receive tagged features \(f_{tv}\) to construct turn representation. The detailed structure of the detection model is shown in Fig. 3.

Fig. 3
figure 3

Structure of the first-level detection model

Here, \(x_{ti}\) represents the m-dimensional word vector of \(d_{ti}\), initialized randomly, and \(x_{ti}\) is input to the underlying CNN in order. The CNN involves a filter w, which is applied to a window of h words to produce a new feature. For example, conducting convolution operation on \(x_{ti:t(i+h-1)}\) produces a feature \(c_{ti}=f\left( w \cdot x_{ti:t(i+h-1)}+b\right) \), where b is a bias term and f is a non-linear function. The filter is applied to each possible window of words \(\left\{ x_{t1:th},x_{t2:t(h+1)},\ldots ,x_{t(L-h+1):tL} \right\} \) to generate a feature map \(c_t=\left[ c_{t1},c_{t2},\ldots ,c_{t(L-h+1)} \right] \). Then, the maximum value of the feature map is selected as the feature corresponding to this particular filter by applying a max-pooling layer. Multiple filters and multiple feature maps are adopted in this paper. Therefore, the fixed dimensional vector representation \(\varphi \left( f_{tv} \right) \) of turn t is the concatenation of the features selected from different feature maps generated by different filters.

\(\varphi \left( f_{tv} \right) \) is passed to the upper LSTM as the representation of the tagged features \(f_{tv}\). The current belief state \(b_t\) is updated through the upper LSTM as follows:

$$\begin{aligned} o_t= & {} \varphi \left( f_{tv} \right) \oplus b_{t-1} \end{aligned}$$
(3)
$$\begin{aligned} h_t= & {} LSTM\left( h_{t-1}, o_t \right) \end{aligned}$$
(4)
$$\begin{aligned} b_t= & {} softmax \left( linear \left( h_t \right) \right) \end{aligned}$$
(5)

The input of the upper LSTM \(o_t\) is the concatenation of the current utterance representation \(\varphi \left( f_{tv} \right) \) and the previous belief state \(b_{t-1}\). The hidden state \(h_t\) of the upper LSTM is used to output the belief state \(b_t\) by a softmax operation on a liner layer.

For slot S, the value with the maximum confidence score is the state label of the t-th turn.

$$\begin{aligned} \hat{a}_t = argmax \left( b_t \right) \end{aligned}$$
(6)

Let \(\theta \) be the network parameters which are randomly initialized as \(\theta _0\). The detection model is trained as supervised learning. We use cross-entropy as the loss function. For a dialog segment, the ground-truth slot value is \(A=a_1,a_2,\ldots ,a_{DL}\), the predicted confidence score for A is \(\hat{A} =\hat{a}_1,\hat{a}_2,\ldots ,\hat{a}_{DL} \). Here, DL is the maximum dialog length. The state tracking loss for this dialog segment is computed as follows:

$$\begin{aligned} J_D \left( \theta \right) = - \sum _{i=1}^{DL} log\hat{a}_i \end{aligned}$$
(7)

The main problem of unknown slot values recognition is the absence of training data for them. The most straightforward idea is to construct training data for them. It is much easier to construct negative examples for the detection model for known slot values than to construct training data for unknown slot values. Without changing the dataset, known slot values with relatively lower frequency in the training set are used as negative examples, that is are simulated as unknown slot values to enable the ability of recognizing unknown slot values. We select simulated negative samples this way to investigate the effect of the proposition of it on experimental results. We conduct this in following procedures. Firstly, the number of training data for known slot values is sorted in descending order, then the last several known slot values are simulated as unknown slot values, and the number of the simulated negative samples is increased gradually to investigate the effect of the proposition of it on experimental results.

2.2 Second-Level Model

2.2.1 Update Scheme for Known Slot Values

When the first-level detection model outputs value \(V_{notunk}\), unknown slot values are not detected at t-th turn, the dialog state is updated through the update scheme for known slot values. The update scheme for known slot values is multiple classifiers, the belief state \(b'_t\) at the t-th turn for slot S is the distribution over the known slot values \(V^{known} = \left\{ V_1,V_2,\ldots ,V_m\right\} \).

The update scheme for known slot values adopts the same architecture with the first-level detection model, that is, \(f_t\) is first mapped to \(\varphi \left( f_t \right) \), the representation of \(f_t\), through the bottom CNN layer, and then belief states of the current turn \(b'_t\) are tracked based on the corresponding utterance representation \(\varphi \left( f_t \right) \) and the previous belief state \(b'_{t-1}\) through the upper LSTM layer, finally the value with the maximum confidence score \(\hat{a}'_t\) is the state label of the t-th turn. In the update scheme for known slot values, \(\hat{a}'_t \in \left\{ V_1,V_2,\ldots ,V_m\right\} \). Detailed network computing for the update scheme for known slot values is as follows:

$$\begin{aligned}&\varphi :f_t \rightarrow \varphi \left( f_t \right) \end{aligned}$$
(8)
$$\begin{aligned}&o'_t = \varphi \left( f_{t} \right) \oplus b'_{t-1} \end{aligned}$$
(9)
$$\begin{aligned}&h'_t = LSTM\left( h'_{t-1}, o'_t \right) \end{aligned}$$
(10)
$$\begin{aligned}&b'_t = softmax \left( linear \left( h'_t \right) \right) \end{aligned}$$
(11)
$$\begin{aligned}&\hat{a}'_t = argmax \left( b'_t \right) \end{aligned}$$
(12)

As with the first-level detection model, the update scheme for known slot values O is also trained as supervised learning. With ground-truth state labels \(A'=a'_1,a'_2,\ldots ,a'_{DL}\) for a given dialog segment and its predicted confidence score \(\hat{A}'=\hat{a}'_1,\hat{a}'_2,\ldots ,\hat{a}'_{DL}\), the state tracking cross-entropy loss is computed as follows:

$$\begin{aligned} J_O \left( \theta \right) = - \sum _{i=1}^{DL} log\hat{a}'_i \end{aligned}$$
(13)

2.2.2 Update Scheme for Unknown Slot Values

When the first-level detection model outputs value \(V_{unk}\), unknown slot values are detected at t-th turn, the dialog state is updated through the update scheme for unknown slot values N. Model fails in distinguishing between different unknown slot values for no training data for them. Although the unknown slot values do not appear in the training corpus, it can be assumed that the unknown slot values appear in external corpus, which is reasonable in most cases. Therefore, a knowledge base can be constructed to distinguish between different unknown slot values. When the first-level detection model outputs value \(V_{unk}\), the knowledge base will be queried to determine which specific unknown slot value \(\hat{a}''_t \in V^{unknown}=\left\{ V_{m+1},V_{m+2},\ldots ,V_N\right\} \) is expressed in the utterance.

We constructed the knowledge base by fuzzy string matching. Fuzzy string matching is the string match applied between each possible window of words in the utterance with a slot value, the window size is the number of words in the slot value, when the match ratio is greater than a certain threshold, the segment in the utterance is selected to put into a raw semantic dictionary, then we check if the resulting item is a possible expression of the slot value manually. At the same time, we integrate the contents of the semantic dictionaries produced by Henderson et al. [19] and Mrkšić et al. [12] to form the knowledge base. The knowledge base used in this paper is a slot value dictionary extended from the de-lexicalization slot value dictionaries produced by Henderson et al. [19] and Mrkšić et al. [12]. Items in such a knowledge base for two slot-value pairs are FOOD=CHEAP:[cheaper, budget, affordable, inexpensive, economic,...] and PRICERANGE =MODERATE: [moderately, medium, reasonably priced,...].

2.3 Third-Level Model

The third-level model R updates the dialog state on the basis of the first-level and second-level models, the specific update process of dialog state for the t-th turn for S is shown in formula (14).

$$\begin{aligned} STATE_t={\left\{ \begin{array}{ll} \hat{a}''_t, &{} \quad \text{ if } D_t=V_{unk}\\ \hat{a}'_t, &{} \quad \text{ if } D_t=V_{notunk} \end{array}\right. } \end{aligned}$$
(14)

When the output of the detection model at t-th turn \(D_t\) is \(V_{unk}\), an unknown slot value is detected, the dialog state of the t-th turn is the result of the update scheme for unknown slot values \(\hat{a}''_t\), otherwise is the result of the update scheme for known slot values \(\hat{a}'_t\). Finally the state label \({STATE}_t\) of the t-th turn is obtained.

3 Experimental Settings

3.1 Dataset

While there is a large body of empirical studies in DST, the evaluation protocols of DST on unknown slot values present a degree of variations. We present comparisons to several state-of-the-art studies in the same data settings. Three datasets are used to verify the performance of the proposed model and these corpora share almost the same domain ontology.

  1. (1)

    DSTC3 is a human machine dialog data in the field of tourist information searching and was collected using Amazon Mechanical Turk. A corpus of 2264 dialogs was collected for evaluation. A total of 3235 dialogs are divided by a ratio of about 8:2 to get a training set and a verification set. Thus, the training set, validation set and test set of DSTC3 contain 2597, 638, and 2264 dialogs respectively. And each turn in the dialog may contain one or more slots. The ability of trackers to discover and distinguish unknown slot values is studied in this paper, thus the tracker is evaluated on a total of three slots, i.e. “pricerange”, “food”, “area”. Slot “name” will not be analyzed here, for almost all the values of slot “name” is Null in the dataset. The known and unknown value types for each slot in DSTC3 and the proportion of unknown slot values in test set are shown in Table 1.

  1. (2)

    WOZ2.0 was collected with users assuming the role of the system or the user of a task-oriented dialogue system. The users were asked to type in natural language sentences, each contributed just a single turn to each dialogue. Before contributing their turns, users must review all previous turns in that dialogue to ensure coherence and consistency, and were encouraged to learn and correct each other based on previous turns. This turn-level data collection strategy yields a total of 1200 dialogues, containing 600 training, 200 validation and 400 test sets respectively. The known and unknown value types for each slot in WOZ2.0 and the proportion of unknown slot value types in test set are shown in Table 2.

Table 1 Statistics of DSTC3 dataset
Table 2 Statistics of WOZ2.0 dataset
  1. (3)

    DSTC2 is a human machine dialog data in the field of restaurant searching, and also was collected using Amazon Mechanical Turk. According to the official division, the training set, the validation set and the test set of DSTC2 contain 1612, 506, and 1117 dialogs, respectively. Although the DSTC2 dataset does not suffer from the problem of unknown slot values, we also implement our model on DSTC2 dataset for comparison with the model in Mrkšić et al. [12].

3.2 Evaluation

Accuracy is used as the criterion for performance evaluation, it is the ratio of the dialogs correctly predicted states to the total number of dialogs.

$$\begin{aligned} Accuracy = \frac{dialogs\;correctly\;predicted\;states}{total\;number\;of\;dialogs} \end{aligned}$$
(15)

3.3 Parameters

In the first-level detection model, the dimension of word vector, initialized randomly, is 50. The model is trained using stochastic gradient descent algorithm [27], whose momentum and learning rate are set to 0.9 and 0.01 respectively, and the batch size is set to 5. Filters 3, 5, 7, and 9 are adopted in the bottom CNN layer, each filter has 100 feature maps, and each 100 features obtained by filters 3, 5, 7, and 9 are spliced as the representation of the corresponding turn. The number of hidden nodes in the upper LSTM layer is set to 64. Except for the initial method and the dimension of word vectors, the update scheme for known slot values uses the same parameters as the first-level detection model. The dimension of word vector is 300 and the word vectors are initialized with semantically specialized Paragram-SL999 vectors in the update scheme for known slot value [28].

4 Experiment and Analysis

4.1 Experimental Results

Experimental results of the hierarchical dialog state tracking model and other models trained and evaluated on DSTC and WOZ2.0 datasets are shown in Table 3. “Pricerange”, “food”, “area” represent the three slots of the datasets, “all” represents the average of the accuracy, and “joint” is the percentage of dialogs with all three slots correctly predicted. The optimal performance on each dataset is shown in bold.

Table 3 DSTC and WOZ2.0 test set accuracies of DST models

As can be seen from Table 3, the accuracy of the HDSTM proposed in this paper is higher than that of the model in Henderson et al. [19] which focuses on the problem of unseen slot values and Kadlec et al. [23] which derives several heuristics based on the existing NLU results, doing nothing on the problem of unknown slot values in fact. To the best of our knowledge, the HDSTM sets a new state-of-the-art result on DSTC3 dataset. The model in Mrkšić et al. [12] achieves the state-of-the-art results on WOZ2.0 and DSTC2 datasets so far. We also evaluated the performance of our model on WOZ2.0 and DSTC2 datasets, and found that the HDSTM achieved comparable performance with Mrkšić et al. [12]. Since the WOZ2.0 and DSTC2 datasets are not designed to focus on the task of handling unknown slot values. In WOZ2.0 dataset, there is little unknown slot values in slot “food”, “area”, and no unknown slot values in slot “pricerange” and the DSTC2 dataset does not suffer from the problem of unknown slot values. The experimental results illustrate that our model applies not only to datasets with unknown slot values, but also to regular datasets without unknown slot values. Statistical significance tests are implemented by five-fold cross validation on WOZ2.0 and DSTC2 datasets, the same conclusions are drawn.

4.2 Dialog State Tracking V.S. Unknown Slot Values

The ability of the model to identify unknown slot values plays a significant role in DST as unknown slot values appear. Table 4 gives an example, “cuban” is an unknown value for slot “food” in WOZ2.0 dataset, the predicted output of the model in Mrkšić et al. [12] is the known slot value Null when “cuban” occurs, the model in Mrkšić et al. [12] makes a mistake, because it cannot process unknown slot values. While our model gives correct result, and we highlight the correct result in bold in Table 4.

Table 4 Identification results for unknown slot value “cuban”

As can be seen from Table 4, classification based models, such as Mrkšić et al. [12], are inherently not capable of handling unknown values, while our model can solve the problem of unknown slot values effectively. We also implement our model and the model in Mrkšić et al. [12] on the datasets with increasing number of unknown slot values, and find that as the number of unknown slot values increases, the performance of the model in Mrkšić et al. [12] decreases rapidly, while the performance of our model decreases gradually. Both the DSTC and WOZ2.0 don’t suffer from the problem of having different numbers of unknown slot values, we pick the “food” slot of WOZ2.0 to simulate different numbers of unknown slot values to conduct the investigation. Specifically, we gradually select increasing numbers of food types in the training set as unknown and discard all the training instances where the correct food type is the selected unknown types. The accuracy of our model and the model in Mrkšić et al. [12] on the datasets with increasing numbers of unknown slot values is shown in Fig. 4, the experimental results prove our point of view.

Fig. 4
figure 4

Accuracy of models on the datasets with increasing numbers of unknown slot values

The benefits of our model are evident when a large number of unknown slot values emerges. The experimental result on DSTC3 dataset is a good example. As can be seen from Table 1, if the problem of unknown slot values can not be solved, at least 50.76% dialog state of the test set in DSTC3 will be wrongly judged. The joint dialog state tracking accuracy is 72.71%, among which 32.49% is from the unknown slot values. The overall state tracking performance is greatly improved because of the accurate detection and distinction of unknown slot values in DSTC3 dataset.

4.3 Influence of Negative Examples

Does the method of constructing negative examples adopted in this paper have an effect on the performance of known slot values and the experimental results will be influenced by choosing different numbers of negative examples? To solve these doubts, firstly we analyze the impact of negative examples on the performance of known slot values, then experiments are designed to check whether selecting different numbers of negative examples affects the experimental results on DSTC and WOZ2.0 datasets.

4.3.1 Influence on the Performance of Known Slot Value

The experimental results for models at each level in the hierarchical dialog state tracking framework on DSTC and WOZ2.0 datasets are shown in Table 5.

Table 5 Accuracy of each level model in HDSTM

The performance of each level model is very good. Experimental results of N show that the unknown slot values can be well distinguished by querying knowledge base, whether it is a simulated unknown slot value or a real unknown slot value. The occurring of unknown slot values can be effectively detected by the detection model D and the simulated unknown slot values can be well distinguished by querying the knowledge base, thus the adopted method of constructing negative examples has a little influence on the performance of known slot values.

Experimental results on WOZ2.0 and DSTC2 in Table 3 also illustrate this point. It can be seen from Table 2 that there is almost no unknown slot values in WOZ2.0 and DSTC2 datasets, experimental results on WOZ2.0 and DSTC2 can be seen as the results of only processing known slot values. It can be concluded that the adopted method of constructing negative examples has a little influence on the performance of known slot values.

4.3.2 Influence on the Overall Performance

We analyze whether the experimental results are sensitive to the number of negative examples. Take slot “food” of WOZ2.0 as an example, firstly the number of training data for known slot values is sorted in descending order. Then, the last 23, 24, 25, up to 70 known slot values are simulated as unknown slot values. The last 23 known slot values account for 2.64% of the training data and the last 70 account for 49.57% of the training data. The accuracies of the hierarchical dialog state tracking model that simulates different numbers of unknown slot values for slot “food” on WOZ2.0 dataset are shown in Fig. 5.

Fig. 5
figure 5

Influence of selecting different numbers of negative examples

It can be observed from Fig. 5 that the curve of slot “food” is smoothing, and the amplitude of variation is less than 0.06. It shows that experimental results are not sensitive to the numbers of negative examples, it keeps stable in a wide range. The curves for other slots on its corresponding dataset show the same trends within a certain percentage of training data.

5 Conclusions

This paper proposes a hierarchical dialog state tracking framework to model the dialog state tracking problem with unknown slot values. The experimental results on DSTC and WOZ2.0 datasets show that the proposed framework achieves good performance. In particular, the discovery and distinction of unknown slot values greatly improve the final performance of dialog state tracking, illustrating the effectiveness of our proposed framework for addressing the problem of unknown slot values. Besides, the model applies not only to datasets with unknown slot values, but also to regular datasets without unknown slot values.

Unknown slot values can be effectively detected by constructing negative examples. Analysis of experimental results illustrates the adopted method of constructing negative examples has little influence on the performance of known slot values. Besides, under the condition of the selected negative examples accounting for a certain percentage of training data, the influence of selecting different numbers of negative examples on the experimental results is not obvious. It proves the generality and applicability of the HDSTM.

However each level model in hierarchical dialogue state tracking framework is separated now. The integrated training of the whole framework and its influence on experimental results need to be further explored.