Keywords

1 Introduction

The task of a retrieval-based conversation system is to select the most appropriate response from a set of responses given the input context of a conversation. The context is typically an utterance or a sequence of utterances produced by a human or by the system itself. Most of the state-of-the-art approaches to retrieval-based conversation systems are based on deep neural networks (NNs) [14, 16]. Under these approaches, the typical response selection pipeline consists of the following steps [2]:

  1. 1.

    Encode the given context and pre-defined response candidates into numeric vectors, or thought vectors, using NNs;

  2. 2.

    Compute the value of a matching function (matching score) for each pair consisting of a context vector and each response candidate;

  3. 3.

    Select the response candidate with the highest matching score.

During step 1, in order to obtain thought vectors that fairly represent semantics of input contexts and responses, the conversation model is preliminarily trained to return high matching scores for true context-response pairs and low for false ones.

The challenge we faced while building the above pipeline was that the resulting model often returned high matching scores for semantically similar contexts and responses. Consequently, the model frequently repeated or rephrased input contexts instead of giving quality responses.

Consider the following conversations:

  1. A.

    Context: “What is the purpose of living?”

    Response: “What is the purpose of existence?”

  2. B.

    Context: “What is the purpose of living?”

    Response: “It’s a very philosophical question.”

The effect of rephrasing, or echoing, in conversation A in contrast to the appropriate response in conversation B can be explained by the above pipeline. It is a result of the fact that contexts and responses often contain the same concepts [4, 13], hence during training on conversational datasets the NNs simply end up trying to fit the semantics of the input. The similar effect, named “lexical repetition”, was also observed in [9].

In this paper, we suggest a simple and natural solution to the echoing problem for end-to-end retrieval-based conversation systems. Our solution is based on a widely used hard negative mining approach [10], which forces the conversation model to produce low matching scores for similar contexts and responses.

The paper is organized as follows. First, we describe the hard negative mining method and how we utilize it to overcome the echoing problem. Then, we introduce the evaluation metrics, our results and benchmarks for the echoing problem. We also provide the evaluation dataset used in the experiments for further research.

2 Hard Negative Mining

Let \(D=\{(c_i, r_i)\}\), \(i \in \{1..N\}\) be a dataset of conversational context-response pairs, where \(c_i\), \(r_i\)i-th context and response, respectively.

Our goal is to build a conversation model that satisfies the following condition:

$$\begin{aligned} M(c_i, r_i) > M(c_i, r_j) \end{aligned}$$
(1)

\(\forall i, j \ne i\) and \(r_j\) is not an appropriate response for \(c_i\). In other words, the resulting model should return a higher matching score for appropriate responses than for inappropriate ones.

To train this model, we also need false context-response pairs as negative examples in addition to the positive ones presented in D. Consider two approaches to obtain the negative pairs: random sampling and hard negative mining. Under the first approach, we randomly select \(r_j\) from D for each \(c_i\). If D is large and diverse enough, then a randomly selected \(r_j\) is almost always inappropriate for a corresponding \(c_i\).

In contrast to random sampling, hard negative mining imposes a special constraint on responses selected as negatives. Let \(M_0\) be a conversation model trained on random pairs used as negative training examples. Then, we search for a new set of negative pairs \((c_i, r_j)\), so that their matching score satisfies the following condition:

$$\begin{aligned} M_0(c_i, r_i)-M_0(c_i, r_j) \le m \end{aligned}$$
(2)

where m is a margin (hyperparameter) between the scores of positive and negative pairs [3]. The new set of pairs is used to train the next model \(M_1\), which, in turn, used to search for negative pairs to train \(M_2\), and so on [1].

The intuitive idea behind hard negative mining is to select only negatives that have relatively high matching scores, and thus can be interpreted as errors of the conversation model. As a result, the model converges faster compared to random sampling [10].

Following this intuition, we can solve the echoing problem by considering contexts as possible responses, therefore the pairs \((c_i, c_i)\) can be selected as hard negatives. In the next section, we demonstrate that this approach can ultimately prevent the conversation model from assigning a high rank to responses that are similar to contexts.

3 Experiments

For our experiments, we implement a model similar to Basic QA-LSTM described in [12]. It has two bidirectional LSTMs of size 2048 (1024 units in each direction), with separate sets of weights that encode a context and a response independently. We use a max pooling operation to calculate final thought vectors of these LSTMs. We use a cosine similarity as the output matching function. We represent input words as embeddings of size 256, which are initialized by the pre-trained word2vec vectors [8] and are not updated further during the model training. Word sequences longer than 20 words are trimmed from the right, and the context encoder is fed with only one dialog step at a time.

3.1 Models

In order to study the impact of hard negative mining on the echoing problem, we train three models using the following strategies: random negative sampling (\( RN \)), hard negative mining based on responses only (\( HN_r \)), and hard negative mining based on both responses and contexts (\( HN_{r+c} \)). We also consider the following baseline approach (\( BL \)): we use \( RN \) model to rank responses in the testing stage and then just filter out responses equal to the given context.

3.2 Datasets

We train the models on 79M of tweet-reply pairs from a Twitter data archiveFootnote 1.

We perform an evaluation based on our own datasetFootnote 2. This dataset consists of 759 context-response pairs from human text conversations, where context and response both consist of a single sentence (see Table 1). We split the dataset into validation and test subsets consisting of 250 and 509 pairs, respectively. We use this dataset because it is clear, diverse and covers multiple topics of real-life conversations. Also we find it suitable for validating the echoing problem, as well as for estimating the overall model quality.

Table 1. Evaluation dataset sample (see Sect. 3.2)

3.3 Training

The models are trained with the Adam optimizer [5] with the size of mini-batches set to 512. Intermediate models that show the highest values of the Average Precision metric on the validation set (see Sect. 3.4) are selected as the resulting models.

We use a triplet loss [3] as an objective function:

$$\begin{aligned} max(0, m - M(c_i, r_i) + M(c_i, r_j)) \end{aligned}$$
(3)

where the margin m is set to 0.05. For each positive pair \((c_i, r_i)\), a negative \((c_i, r_j)\) is only selected within the current mini-batch using an intermediate model M trained by the moment of this batch. We only select the hard negative \(r_j\) with the highest matching score \(M(c_i, r_j)\) satisfying the following condition:

$$\begin{aligned} 0 \le M(c_i, r_i) - M(c_i, r_j) \le m \end{aligned}$$
(4)

The constraint \(0 \le M(c_i, r_i) - M(c_i, r_j)\) is used to filter out the “hardest” negatives, which in practice affect convergence and lead to bad local optima [10].

We noticed that while training the \( HN_{r+c} \) model, the fraction of \((c_i, c_i)\) negative pairs constitute up to 50% of the mini-batch.

3.4 Evaluation Methodology and Metrics

For each \( context_i \) from the evaluation set, we compute matching scores for all available pairs \( (context_i, answer) \), where \( answer \) comes not only from the responses, but also from the all available contexts. To evaluate these results, we sort the answers by the matching score in descending order and compute the following metrics: Average Precision [7], Recall@2, Recall@5, and Recall@10 [6]. The last three metrics are indicator functions that return 1, if the ground-truth response occurs in the top 2, 5 and 10 candidates, respectively. We also introduce the context echoing metrics:

  • \( rank_{context} \) – position (starting from zero) of the input context in the sorted results. The greater the rank, the less the model tends to return the input context among the top results

  • \( diff_{top} \) – difference between the top result score and the input context score. The greater the difference, the less the model tends to return relatively high scores for the context

  • \( diff_{response} \) – difference between the ground-truth response score and the input context score. The greater the difference, the less the model tends to return similar scores for the ground-truth response and for the context

For each metric, we compute the overall quality as an average across all test contexts. Note that for BL model we don’t present context echoing metrics, since echo-responses are filtered out from the results in this approach.

3.5 Results

The results of the evaluation based on the test set are presented in Table 2. As we can see, the proposed \( HN_{r+c} \) model achieves the highest values in almost all metrics compared to other approaches. According to \( rank_{context} \), it turns out that this model does not tend to highly rank input contexts and have them in the top response candidates. Still, according to the \( diff_{response} \) metric, the average score of a ground-truth response is lower than the score of a context, which means that the context can be ranked higher than the ground-truth response.

Table 2. Evaluation results based on the context-response test set (See footnote 2).
Table 3. Top 3 responses for a few input contexts sorted by matching score.

We also studied the model’s output. Examples of top-ranked responses for different contexts are presented in Table 3. As we can see, oftentimes the \( RN \) and \( HN_r \) models select identical or very similar responses, while the proposed \( HN_{r+c} \) model selects appropriate responses that are not necessarily semantically similar to the context. Based on this observation, we suggest that the proposed model filters out not only exact copies of the context, but also candidates with similar semantics.

4 Related Work

In the previous works on dialog systems there was not enough attention paid to the echoing problem. The possible reason for this are “soft” evaluation conditions: test samples are constructed from a relatively small number of negative responses [3, 6, 14] which usually do not “echo” the test context. In [9] the “lexical repetition” is regularized by utilizing a word overlap feature during training a SMT-based dialog system. In [11, 13, 14] the echoing is avoided by considering only responses the dataset’s contexts of which have high TF-IDF similarity with the given context. However, the latter approach is not applicable if only a set of responses is available for ranking during the testing stage, which can be the case for some domains and applications [15].

5 Conclusion

In this study, we applied a hard negative mining approach to train a retrieval-based conversation system to find a solution to the echoing problem, that is, to reduce inappropriate responses that are identical or too similar to the input context. In addition to responses, we consider contexts themselves as possible hard negative candidates. The evaluation shows that the resulting model avoids echoing the input context, tends to select candidates that are more appropriate as responses and achieves better results in terms of Average Precision and Recall@N metrics compared to the models trained without the proposed approach.