Keywords

1 Introduction

For improving of SDSs, which allow human to communicate with different computer systems via speech, the IQ metric was designed. This metric, proposed in [15, 17], was considered as an indicator for SDSs, which might reflect some problematic situations during the interaction. Later, this metric was adapted to HHC [21], assuming that HHC and HCSI have a resemblance and, consequently, the results of such an adaptation may be used for further improvement of SDSs.

The IQ model for HHC is based on more than 1200 features, describing an agent’s/ customer’s/ overlapping speech and the dialogue itself [22]. All these features may be extracted automatically. The features for IQ modelling can be subdivided into the three parameter levels: exchange, window and dialogue [17, 22].

To reduce the computational complexity in terms of the total feature extraction time and algorithm speed (for IQ modelling) we have tried to reduce the number of features by analysing the significance of each parameter level’s contribution. For this research we have designed the IQ models applying several classification algorithms, implemented in Rapidminer Footnote 1 and WEKA [10] on the different data sets (e.g. the data set containing only the features from the exchange and window levels).

The remainder of this paper is structured as follows: a brief description of IQ and the results of its modelling for HCSI using the different interaction parameter levels and their combinations is given in Sect. 2. Brief description of the spoken corpus, which was used for conducting all computations, is provided in Sect. 3, which is followed by a description of the formulated classification problems and utilized algorithms in Sect. 4. Section 5 presents the obtained results, which are then discussed in Sect. 6. Finally, we finished our paper by conclusions and future work in Sect. 7.

2 Related Work

The IQ paradigm idea was introduced in [15, 17] for assessing the SDS performance during an ongoing interaction. Originally this paradigm was derived from the concept of User/Customer Satisfaction (CS), which is widely used in different spheres. Usually CS is assessed manually at the end of calls/transactions by customers during various surveys. In contrast to CS, the IQ metric helps to evaluate an SDS performance at any point during the interaction. The IQ model for HCSI is based on the features from the three parameter levels. The first of them, the exchange level, consists of information about the current system-user-exchange. The next one, namely the window level comprises the features (some statistics) from the n last exchanges. The third one, the dialogue level, describes the complete dialogue up to the current exchange [17]. The complete list of features can be found in [16, 17].

For better understanding of each level’s contribution to the overall estimation performance, different experiments, which were based on the features from each parameter level and their combinations, were conducted [23].

The results, described in [23], shows, that the best result in terms of Unweighted Average Recall (UAR) [14], Cohen’s Kappa [4] linearly weighted [5], and Spearman’s Rho [20] was achieved using all parameters [23]. Also these experiments reveal, that the parameters from the window level have an important role in the overall performance.

3 Corpus Description

The experiments, described in this paper, have been conducted based on the spoken corpus [22], which consists of 53 task-oriented dialogues between employees and customers. Subsequently, after the manual diarization all dialogues were split into 1,165 agent-customer-exchanges.

Each agent-customer exchange is described by more than 1,200 features, which reflect the different interaction parameter levels: exchange, window, dialogue. The exchange level features contain acoustic attributes, extracted by OpenSMILE [7] (a feature vector, used in InterSpeech 2009 Emotion Challenge, contains 384 attributes [18]) for agent/customer/overlapping speech, information about the speech and pause duration, emotions (manually annotated) and others.

In turn, the dialogue and window levels are presented in the corpus by such features, as:

  • the total/mean duration of an exchange,

  • the total/mean duration and the percent of the duration of an agent, customer, overlapping speech and pauses between turns,

  • the total pause duration between exchanges,

  • the total/mean number of the fragments with speech overlaps,

  • the number of the exchanges, where the first speaker is agent/customer/ overlapping speech.

The window level covers the three last exchanges with respect to the current exchange.

3.1 Interaction Quality

Each observation in the corpus, i.e. agent-customer exchange, was annotated with two IQ score labels. Two types of the IQ assessment are based on the different IQ-labeling guidelines, which can be found in [21].

For the first approach an absolute scale, which is similar to the IQ score annotation guideline for HCSI [16], was used. Despite the fact, that this scale consists of five indicators (1-bad, 2-poor, 3-fair, 4-good, 5-excellent), only three classes are presented (with the IQ scores “3”, “4”, “5”) in this corpus. The biggest part (96.39%) of all observations belongs to the class with the IQ score “5”, while the smallest class (the IQ score “3”) covers only four observations. We will denote the first approach as IQ1.

In comparison with the first approach, the second approach is relied on a scale of changes, which, subsequently, is transformed into an absolute scale. The scale of changes is presented by the following scores: “−2”, “−1”, “0”, “1”, “2”, “1_abs” (the last score is in the absolute scale). Then with the assumption from the first approach, that all dialogues start with the IQ score “5”(in absolute scale), the obtained labels were converted into an absolute scale. As a result, we have received four scores: “6”, “5”, “4”, “3”. The majority class (with the IQ score “5”) in this case covers 88.24% of all exchanges, while the second biggest class “6” consists of 8.24% of all observations. Concerning the minority class “3”, it also contains four exchanges. We will refer to the second approach as IQ2.

3.2 Emotions

Each agent/customer turn was annotated with three different emotion labels. For this labeling we have chosen three sets from [19], which then were adapted for IQ modelling. The set em1 contains such categories as: angry, sad, neutral, and happy. The next set em2 differs from the em1 by the presence of such categories as: disgust/irritation and boredom. In the last emotion set em3, in comparison with the set em2, the category “boredom” was replaced by the category “surprise”. It should be pointed out, that not all categories in each set were presented in this corpus.

Then each set (em{1, 2, 3}) was subdivided into neutral and other emotions (denote them as em{1, 2, 3}2) and into negative, neutral, and positive emotions (denote them as em{1, 2, 3}3). This decomposition was performed to understand the complexity of the emotion sets, which is required for better IQ prediction.

4 Experimental Setup

The IQ score estimation task can be formulated as a classification problem, in our case with three classes for IQ1 and four classes for IQ2. For our research a total number of the different sets is eighteen. Each set is a combination of an IQ label (IQ1 or IQ2) and an emotion set (nine sets: the three main sets and two sets derived from each of them). Hereinafter we call it tasks.

Instead of using all possible combinations of the different parameter levels, as in [23], we have carried out our computations on the four sets:

  • the exchange, window, dialogue levels,

  • the exchange and window levels,

  • the exchange and dialogue levels,

  • the exchange level.

The reason is that the features from the window and dialogue levels do not contain sufficient information for the interaction description in HHC.

For the experiments the following classification algorithms were chosen: Kernel Naive Bayes classifier (NBK) [11], k-Nearest Neighbours algorithm (kNN) [25], L2 Regularised Logistic Regression (LR) [3], Support Vector Machines [6, 24] trained by Sequential Minimal Optimisation (SVM) [13].

For the classification performance assessment we have accomplished 10-fold cross-validation to obtain statistically reliable results. We have split our data on the training and testing sets, afterwards we have introduced one more inner 10-fold cross-validation on the training sets. Subsequently we used inner 10-fold cross-validation for the grid parameter optimisation of the classification algorithms, where \(F_{1}\)-score [9] was maximized.

Regarding dimensionality reduction we employed a data transformation technique Principal Component Analysis (PCA) [1] with the fixed cumulative variance value 0.99. The data were pre-processed: each column (attribute values) was statistically normalised, it means, that the mean of each column is equal to 0 and the variance is equal to 1.

Furthermore, all non-numeric attributes such as emotions, speaker gender, and “who starts an exchange” have been transformed into numeric type using dummy coding, which replaces a nominal attribute with m different categories by m new attributes, containing 0 or 1. These values reflect the absence and presence of the respective categories of a nominal attribute for each observation.

5 Results

An assessment of the classification algorithms performance is based on such classification performance measures as accuracy, Unweighted Average Recall [14], \(F_{1}\)-score, which were averaged over ten computations on different train-test splits. However, in this paper we provide results for \(F_{1}\)-score, as the main classification performance measure for this study.

Partly the results for the classification task with em13 for the different combinations of the interaction parameter levels are depicted in Fig. 1. The same results, but for all classification problems, for both IQ1 and IQ2 are presented in figures in Table 1. The results depicted in Fig. 1 and figures in Table 1 were achieved with kNN algorithm.

Fig. 1.
figure 1

kNN performance in \(F_{1}\)-score for the different combinations of the parameter level for the emotion set em13.

Table 1. kNN performance in \(F_{1}\)-score for the different combinations of the parameter levels.

Regarding numerical evaluations in terms of accuracy the best results were obtained with kNN and LR.

6 Discussion

To define the statistically significant differences between the obtained results we have relied on the one-way analysis of variance (one-way ANOVA) [2] and the Tukey’s honest significant difference (HSD) test [12] with the default settings, utilized in R programming languageFootnote 2. The one-way ANOVA determined, that the differences between means are statistically significant for IQ1 and IQ2 almost through all classification performance measures and all classification problems. To find out what algorithms gave statistically significant different results we have used the Tukey’s HSD test. This test revealed that almost in all the cases there are statistically significant differences between the results of NBK and other algorithms.

Moreover, we have applied these tests to determine the statistically significant differences between results, which had been obtained with kNN algorithm based on the different combinations of the interaction parameter levels. The one-way ANOVA test found out, that there are no any statistically significant differences between the results.

However, it should be mentioned, that for IQ1 almost for all classification problems the results, which are based on the data excluded the dialogue level, showed better results, than with the use of the dialogue parameter level. In turn, for IQ2 almost for all classification problems the exclusion of the window and dialogue level simultaneously led to the result decreasing.

The baseline accuracy (classifier always predicts the majority class) for IQ1 and IQ2 are 0.964 and 0.882, correspondingly. For \(F_{1}\)-score the baselines are 0.327 and 0.234, respectively.

Given the fact that the data is highly unbalanced, the achieved results are not reasonable enough, although almost in all classification performance measures in all classification problems the obtained results outperform the baselines. For some algorithms, however, the results do not outperform the baseline in terms of accuracy.

Interestingly, that the best results in terms of \(F_{1}\)-score were achieved in all the cases with kNN algorithm. To determine whether the results, obtained with kNN and LR, in terms of accuracy statistically significant differ from the baselines, the Student’s t-test [8] was employed. In the case of IQ2 for all tasks and for both algorithms p-value is less than 0.007. Regarding the kNN model for IQ1 p-value exceeds 0.15. It should be noted, that concerning the LR-based model results for IQ1, for some tasks p-values outperform 0.05.

Hence, from the results of the Student’s t-test between the obtained results (in terms of accuracy) and the baseline (0.964) for IQ1 we have concluded, that statistically significant results have been achieved with LR, but not for all tasks. Almost in all the cases the use of only exchange parameter level is not enough. The use of the datasets with emotions dividing into two classes did not result in significantly better results.

In turn, for all classification problems the obtained results in terms of accuracy statistically significantly outperform the baseline for both algorithms, namely kNN and LR for IQ2.

It should be mentioned, that the use of PCA reduced the number of features approximately by factor of 2.5 (approx. from 1200 to 470), and consequently increased the computational speed in terms of execution time.

7 Conclusions and Future Work

In this paper we have analysed the significance of the different interaction parameter levels in the task of IQ modelling for HHC. Our research has revealed the impact of the different interaction parameter levels and their combination for IQ modelling for HHC. The differences between the feature lists for IQ modelling for HCSI an HHC did not lead to the same results. Partially it might be explained by the fact, that the used corpus is highly unbalanced and all labels were annotated only by one expert rater.

As a future direction we plan to apply ensemble-based classifiers for predicting an IQ score. Taking into account a rather high dimensionality of the feature space, other dimensionality reduction methods might be helpful. Moreover, considering the highly imbalanced classes, approaches for multiclass imbalanced data should be performed.