Significance of Interaction Parameter Levels in Interaction Quality Modelling for Human-Human Conversation

Spirina, Anastasiia; Skorokhod, Alina; Karaseva, Tatiana; Polonskaia, Iana; Sidorov, Maxim

doi:10.1007/978-3-319-64206-2_52

Anastasiia Spirina¹⁵,
Alina Skorokhod¹⁶,
Tatiana Karaseva¹⁶,
Iana Polonskaia¹⁶ &
…
Maxim Sidorov¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10415))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1528 Accesses

Abstract

The Interaction Quality (IQ) metric, which originally was designed for spoken dialogue systems (SDSs) to assess human-computer spoken interaction (HCSI) and then adapted to human-human conversation (HHC), is based on features from three interaction parameter levels: an exchange, a window, and a dialogue level. To determine the significance of the window and dialogue interaction parameter levels, as well as their combination, computations, based on different data sets, have been performed using several classification algorithms. The obtained results may be used for further improvement of the IQ model for HHC in terms of the computational complexity.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Analysis of Interaction Parameter Levels in Interaction Quality Modelling for Human-Human Conversation

Interaction Quality as a Human-Human Task-Oriented Conversation Performance

Analysis of Temporal Features for Interaction Quality Estimation

Keywords

1 Introduction

For improving of SDSs, which allow human to communicate with different computer systems via speech, the IQ metric was designed. This metric, proposed in [15, 17], was considered as an indicator for SDSs, which might reflect some problematic situations during the interaction. Later, this metric was adapted to HHC [21], assuming that HHC and HCSI have a resemblance and, consequently, the results of such an adaptation may be used for further improvement of SDSs.

The IQ model for HHC is based on more than 1200 features, describing an agent’s/ customer’s/ overlapping speech and the dialogue itself [22]. All these features may be extracted automatically. The features for IQ modelling can be subdivided into the three parameter levels: exchange, window and dialogue [17, 22].

To reduce the computational complexity in terms of the total feature extraction time and algorithm speed (for IQ modelling) we have tried to reduce the number of features by analysing the significance of each parameter level’s contribution. For this research we have designed the IQ models applying several classification algorithms, implemented in Rapidminer ^{Footnote 1} and WEKA [10] on the different data sets (e.g. the data set containing only the features from the exchange and window levels).

The remainder of this paper is structured as follows: a brief description of IQ and the results of its modelling for HCSI using the different interaction parameter levels and their combinations is given in Sect. 2. Brief description of the spoken corpus, which was used for conducting all computations, is provided in Sect. 3, which is followed by a description of the formulated classification problems and utilized algorithms in Sect. 4. Section 5 presents the obtained results, which are then discussed in Sect. 6. Finally, we finished our paper by conclusions and future work in Sect. 7.

2 Related Work

The IQ paradigm idea was introduced in [15, 17] for assessing the SDS performance during an ongoing interaction. Originally this paradigm was derived from the concept of User/Customer Satisfaction (CS), which is widely used in different spheres. Usually CS is assessed manually at the end of calls/transactions by customers during various surveys. In contrast to CS, the IQ metric helps to evaluate an SDS performance at any point during the interaction. The IQ model for HCSI is based on the features from the three parameter levels. The first of them, the exchange level, consists of information about the current system-user-exchange. The next one, namely the window level comprises the features (some statistics) from the n last exchanges. The third one, the dialogue level, describes the complete dialogue up to the current exchange [17]. The complete list of features can be found in [16, 17].

For better understanding of each level’s contribution to the overall estimation performance, different experiments, which were based on the features from each parameter level and their combinations, were conducted [23].

The results, described in [23], shows, that the best result in terms of Unweighted Average Recall (UAR) [14], Cohen’s Kappa [4] linearly weighted [5], and Spearman’s Rho [20] was achieved using all parameters [23]. Also these experiments reveal, that the parameters from the window level have an important role in the overall performance.

3 Corpus Description

The experiments, described in this paper, have been conducted based on the spoken corpus [22], which consists of 53 task-oriented dialogues between employees and customers. Subsequently, after the manual diarization all dialogues were split into 1,165 agent-customer-exchanges.

Each agent-customer exchange is described by more than 1,200 features, which reflect the different interaction parameter levels: exchange, window, dialogue. The exchange level features contain acoustic attributes, extracted by OpenSMILE [7] (a feature vector, used in InterSpeech 2009 Emotion Challenge, contains 384 attributes [18]) for agent/customer/overlapping speech, information about the speech and pause duration, emotions (manually annotated) and others.

In turn, the dialogue and window levels are presented in the corpus by such features, as:

the total/mean duration of an exchange,
the total/mean duration and the percent of the duration of an agent, customer, overlapping speech and pauses between turns,
the total pause duration between exchanges,
the total/mean number of the fragments with speech overlaps,
the number of the exchanges, where the first speaker is agent/customer/ overlapping speech.

The window level covers the three last exchanges with respect to the current exchange.

3.1 Interaction Quality

Each observation in the corpus, i.e. agent-customer exchange, was annotated with two IQ score labels. Two types of the IQ assessment are based on the different IQ-labeling guidelines, which can be found in [21].

For the first approach an absolute scale, which is similar to the IQ score annotation guideline for HCSI [16], was used. Despite the fact, that this scale consists of five indicators (1-bad, 2-poor, 3-fair, 4-good, 5-excellent), only three classes are presented (with the IQ scores “3”, “4”, “5”) in this corpus. The biggest part (96.39%) of all observations belongs to the class with the IQ score “5”, while the smallest class (the IQ score “3”) covers only four observations. We will denote the first approach as IQ1.

In comparison with the first approach, the second approach is relied on a scale of changes, which, subsequently, is transformed into an absolute scale. The scale of changes is presented by the following scores: “−2”, “−1”, “0”, “1”, “2”, “1_abs” (the last score is in the absolute scale). Then with the assumption from the first approach, that all dialogues start with the IQ score “5”(in absolute scale), the obtained labels were converted into an absolute scale. As a result, we have received four scores: “6”, “5”, “4”, “3”. The majority class (with the IQ score “5”) in this case covers 88.24% of all exchanges, while the second biggest class “6” consists of 8.24% of all observations. Concerning the minority class “3”, it also contains four exchanges. We will refer to the second approach as IQ2.

3.2 Emotions

Each agent/customer turn was annotated with three different emotion labels. For this labeling we have chosen three sets from [19], which then were adapted for IQ modelling. The set em1 contains such categories as: angry, sad, neutral, and happy. The next set em2 differs from the em1 by the presence of such categories as: disgust/irritation and boredom. In the last emotion set em3, in comparison with the set em2, the category “boredom” was replaced by the category “surprise”. It should be pointed out, that not all categories in each set were presented in this corpus.

Then each set (em{1, 2, 3}) was subdivided into neutral and other emotions (denote them as em{1, 2, 3}2) and into negative, neutral, and positive emotions (denote them as em{1, 2, 3}3). This decomposition was performed to understand the complexity of the emotion sets, which is required for better IQ prediction.

4 Experimental Setup

The IQ score estimation task can be formulated as a classification problem, in our case with three classes for IQ1 and four classes for IQ2. For our research a total number of the different sets is eighteen. Each set is a combination of an IQ label (IQ1 or IQ2) and an emotion set (nine sets: the three main sets and two sets derived from each of them). Hereinafter we call it tasks.

Instead of using all possible combinations of the different parameter levels, as in [23], we have carried out our computations on the four sets:

the exchange, window, dialogue levels,
the exchange and window levels,
the exchange and dialogue levels,
the exchange level.

The reason is that the features from the window and dialogue levels do not contain sufficient information for the interaction description in HHC.

For the experiments the following classification algorithms were chosen: Kernel Naive Bayes classifier (NBK) [11], k-Nearest Neighbours algorithm (kNN) [25], L2 Regularised Logistic Regression (LR) [3], Support Vector Machines [6, 24] trained by Sequential Minimal Optimisation (SVM) [13].

For the classification performance assessment we have accomplished 10-fold cross-validation to obtain statistically reliable results. We have split our data on the training and testing sets, afterwards we have introduced one more inner 10-fold cross-validation on the training sets. Subsequently we used inner 10-fold cross-validation for the grid parameter optimisation of the classification algorithms, where \(F_{1}\)-score [9] was maximized.

Regarding dimensionality reduction we employed a data transformation technique Principal Component Analysis (PCA) [1] with the fixed cumulative variance value 0.99. The data were pre-processed: each column (attribute values) was statistically normalised, it means, that the mean of each column is equal to 0 and the variance is equal to 1.

Furthermore, all non-numeric attributes such as emotions, speaker gender, and “who starts an exchange” have been transformed into numeric type using dummy coding, which replaces a nominal attribute with m different categories by m new attributes, containing 0 or 1. These values reflect the absence and presence of the respective categories of a nominal attribute for each observation.

5 Results

An assessment of the classification algorithms performance is based on such classification performance measures as accuracy, Unweighted Average Recall [14], \(F_{1}\)-score, which were averaged over ten computations on different train-test splits. However, in this paper we provide results for \(F_{1}\)-score, as the main classification performance measure for this study.

Partly the results for the classification task with em13 for the different combinations of the interaction parameter levels are depicted in Fig. 1. The same results, but for all classification problems, for both IQ1 and IQ2 are presented in figures in Table 1. The results depicted in Fig. 1 and figures in Table 1 were achieved with kNN algorithm.

Table 1. kNN performance in \(F_{1}\)-score for the different combinations of the parameter levels.

Full size table

Regarding numerical evaluations in terms of accuracy the best results were obtained with kNN and LR.

6 Discussion

To define the statistically significant differences between the obtained results we have relied on the one-way analysis of variance (one-way ANOVA) [2] and the Tukey’s honest significant difference (HSD) test [12] with the default settings, utilized in R programming language^{Footnote 2}. The one-way ANOVA determined, that the differences between means are statistically significant for IQ1 and IQ2 almost through all classification performance measures and all classification problems. To find out what algorithms gave statistically significant different results we have used the Tukey’s HSD test. This test revealed that almost in all the cases there are statistically significant differences between the results of NBK and other algorithms.

Moreover, we have applied these tests to determine the statistically significant differences between results, which had been obtained with kNN algorithm based on the different combinations of the interaction parameter levels. The one-way ANOVA test found out, that there are no any statistically significant differences between the results.

However, it should be mentioned, that for IQ1 almost for all classification problems the results, which are based on the data excluded the dialogue level, showed better results, than with the use of the dialogue parameter level. In turn, for IQ2 almost for all classification problems the exclusion of the window and dialogue level simultaneously led to the result decreasing.

The baseline accuracy (classifier always predicts the majority class) for IQ1 and IQ2 are 0.964 and 0.882, correspondingly. For \(F_{1}\)-score the baselines are 0.327 and 0.234, respectively.

Given the fact that the data is highly unbalanced, the achieved results are not reasonable enough, although almost in all classification performance measures in all classification problems the obtained results outperform the baselines. For some algorithms, however, the results do not outperform the baseline in terms of accuracy.

Interestingly, that the best results in terms of \(F_{1}\)-score were achieved in all the cases with kNN algorithm. To determine whether the results, obtained with kNN and LR, in terms of accuracy statistically significant differ from the baselines, the Student’s t-test [8] was employed. In the case of IQ2 for all tasks and for both algorithms p-value is less than 0.007. Regarding the kNN model for IQ1 p-value exceeds 0.15. It should be noted, that concerning the LR-based model results for IQ1, for some tasks p-values outperform 0.05.

Hence, from the results of the Student’s t-test between the obtained results (in terms of accuracy) and the baseline (0.964) for IQ1 we have concluded, that statistically significant results have been achieved with LR, but not for all tasks. Almost in all the cases the use of only exchange parameter level is not enough. The use of the datasets with emotions dividing into two classes did not result in significantly better results.

In turn, for all classification problems the obtained results in terms of accuracy statistically significantly outperform the baseline for both algorithms, namely kNN and LR for IQ2.

It should be mentioned, that the use of PCA reduced the number of features approximately by factor of 2.5 (approx. from 1200 to 470), and consequently increased the computational speed in terms of execution time.

7 Conclusions and Future Work

In this paper we have analysed the significance of the different interaction parameter levels in the task of IQ modelling for HHC. Our research has revealed the impact of the different interaction parameter levels and their combination for IQ modelling for HHC. The differences between the feature lists for IQ modelling for HCSI an HHC did not lead to the same results. Partially it might be explained by the fact, that the used corpus is highly unbalanced and all labels were annotated only by one expert rater.

As a future direction we plan to apply ensemble-based classifiers for predicting an IQ score. Taking into account a rather high dimensionality of the feature space, other dimensionality reduction methods might be helpful. Moreover, considering the highly imbalanced classes, approaches for multiclass imbalanced data should be performed.

Notes

1.
http://rapidminer.com/.
2.
http://r-project.org/.

References

Abdi, H., Williams, L.: Principal component analysis. WIREs Comput. Stat. 2, 433–459 (2010)
Article Google Scholar
Bailey, R.A.: Design of Comparative Experiments. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
le Cessie, S., Houwelingen, J.C.: Ridge estimators in logistic regression. Appl. Stat. 41(1), 191–201 (1992)
Article MATH Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20, 37–46 (1960)
Article Google Scholar
Cohen, J.: Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968)
Article Google Scholar
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000)
Book MATH Google Scholar
Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in opensmile, the Munich open-source multimedia feature extractor. In: Proceedings of ACM Multimedia (MM), pp. 835–838 (2013)
Google Scholar
Fay, M.P., Proschan, M.A.: Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules. Stat. Surv. 4, 1–39 (2010)
Article MathSciNet MATH Google Scholar
Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-Score, with implication for evaluation. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 345–359. Springer, Heidelberg (2005). doi:10.1007/978-3-540-31865-1_25
Chapter Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutmann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Article Google Scholar
John, G.H., Langley, P.: Estimating continuous distribution in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)
Google Scholar
Kennedy, J.J., Bush, A.J.: An Introduction to the Design and Analysis of Experiments in Behavioural Research. University Press of America, Lanham (1985)
Google Scholar
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge (1999)
Google Scholar
Rosenberg, A.: Classifying skewed data: importance to optimize average recall. In: Proceedings of INTERSPEECH 2012, pp. 2242–2245 (2012)
Google Scholar
Schmitt, A., Schatz, B., Minker, W.: Modeling and predicting quality in spoken human-computer interaction. In: Proceedings of the SIGDIAL 2011 Conference, pp. 173–184. Association for Computational Linguistics (2011)
Google Scholar
Schmitt, A., Ultes, S.: Interaction quality: assessing the quality of ongoing spoken dialog interaction by experts and how it relates to user satisfaction. Speech Commun. 74, 12–36 (2015)
Article Google Scholar
Schmitt, A., Ultes, S., Minker, W.: A parameterized and annotated corpus of the CMU lets go bus information system. In: International Conference on Language Resources and Evaluation (LREC), pp. 3369–3373 (2012)
Google Scholar
Schuller, B., Steidl, S., Batliner, A.: The interspeech 2009 emotion challenge. In: Proceedings of INTERSPEECH 2009, pp. 312–315 (2009)
Google Scholar
Sidorov, M., Brester, C., Schmitt, A.: Contemporary stochastic feature selection algorithms for speech-based emotion recognition. In: Proceedings of INTERSPEECH 2015, pp. 2699–2703 (2015)
Google Scholar
Spearman, C.: The proof and measurement of association between two things. Am. J. Psychol. 15(1), 72–101 (1904)
Article Google Scholar
Spirina, A., Sidorov, M., Sergienko, R., Schmitt, A.: First experiments on interaction quality modelling for human-human conversation. In: Proceedings of the 13th International Conference on Informatics in Control, Automation and Robotics (ICINCO), vol. 2, pp. 374–380 (2016)
Google Scholar
Spirina, A.V., Sidorov, M.Y., Sergienko, R.B., Semenkin, E.S., Minker, W.: Human-human task-oriented conversations corpus for interaction quality modelling. Vestnik SibSAU 17(1), 84–90 (2016)
Google Scholar
Ultes, S., Schmitt, A., Minker, W.: Analysis of temporal features for interaction quality estimation. In: Proceedings of the 7th International Workshop on Spoken Dialogue Systems (IWSDS) (2016)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Book MATH Google Scholar
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2011)
Google Scholar

Download references

Acknowledgments

The work presented in this paper was partially supported by the DAAD (German Academic Exchange Service), the Ministry of Education and Science of Russian Federation within project 28.697.2016/2.2, and the Transregional Collaborative Research Centre SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” which is funded by the German Research Foundation (DFG).

Author information

Authors and Affiliations

Ulm University, Ulm, Germany
Anastasiia Spirina & Maxim Sidorov
Reshetnev Siberian State University of Science and Technology, Krasnoyarsk, Russia
Alina Skorokhod, Tatiana Karaseva & Iana Polonskaia

Authors

Anastasiia Spirina
View author publications
You can also search for this author in PubMed Google Scholar
Alina Skorokhod
View author publications
You can also search for this author in PubMed Google Scholar
Tatiana Karaseva
View author publications
You can also search for this author in PubMed Google Scholar
Iana Polonskaia
View author publications
You can also search for this author in PubMed Google Scholar
Maxim Sidorov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anastasiia Spirina .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
Václav Matoušek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Spirina, A., Skorokhod, A., Karaseva, T., Polonskaia, I., Sidorov, M. (2017). Significance of Interaction Parameter Levels in Interaction Quality Modelling for Human-Human Conversation. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_52

Download citation

DOI: https://doi.org/10.1007/978-3-319-64206-2_52
Published: 29 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64205-5
Online ISBN: 978-3-319-64206-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Significance of Interaction Parameter Levels in Interaction Quality Modelling for Human-Human Conversation

Abstract

Similar content being viewed by others

Analysis of Interaction Parameter Levels in Interaction Quality Modelling for Human-Human Conversation

Interaction Quality as a Human-Human Task-Oriented Conversation Performance

Analysis of Temporal Features for Interaction Quality Estimation

Keywords

1 Introduction

2 Related Work

3 Corpus Description

3.1 Interaction Quality

3.2 Emotions

4 Experimental Setup

5 Results

6 Discussion

7 Conclusions and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Significance of Interaction Parameter Levels in Interaction Quality Modelling for Human-Human Conversation

Abstract

Similar content being viewed by others

Analysis of Interaction Parameter Levels in Interaction Quality Modelling for Human-Human Conversation

Interaction Quality as a Human-Human Task-Oriented Conversation Performance

Analysis of Temporal Features for Interaction Quality Estimation

Keywords

1 Introduction

2 Related Work

3 Corpus Description

3.1 Interaction Quality

3.2 Emotions

4 Experimental Setup

5 Results

6 Discussion

7 Conclusions and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation