“Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed Interactions

Siegert, Ingo; Krüger, Julia

doi:10.1007/978-3-030-51870-7_4

Ingo Siegert⁶ &
Julia Krüger⁷

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 189))

1170 Accesses
7 Citations

Abstract

Nowadays, a diverse set of addressee detection methods is discussed. Typically, wake words are used. But these force an unnatural interaction and are error-prone, especially in case of false positive classification (user says the wake up word without intending to interact with the device). Therefore, technical systems should be enabled to perform a detection of device directed speech. In order to enrich research in the field of speech analysis in HCI we conducted studies with a commercial voice assistant, Amazon’s ALEXA (Voice Assistant Conversation Corpus, VACC), and complemented objective speech analysis with subjective self and external reports on possible differences in speaking with the voice assistant compared to speaking with another person. The analysis revealed a set of specific features for device directed speech. It can be concluded that speech-based addressing of a technical system is a mainly conscious process including individual modifications of the speaking style.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Gaze, Prosody and Semantics: Relevance of Various Multimodal Signals to Addressee Detection in Human-Human-Computer Conversations

Development of Automatic Speech Recognition Techniques for Elderly Home Support: Applications and Challenges

Hey ASR System! Why Aren’t You More Inclusive?

Notes

1.
The wake word to activate Amazon’s ALEXA from its “inactive” state to be able to make a request is ‘Alexa’ by default.
2.
The confederate speaker was introduced to the participants as “Jannik”.
3.
Participants were anonymized by using letters in alphabetic order.
4.
German-speaking annotators were anonymized by using letters in alphabetic order including *.
5.
Non-German-speaking annotators were anonymized by using letters in alphabetic order including **.

References

Akhtiamov, O., Sidorov, M., Karpov, A., Minker, W.: Speech and text analysis for multimodal addressee detection in human-human-computer interaction. In: Proceedings of the INTERSPEECH-2017, pp. 2521–2525 (2017)
Google Scholar
Akhtiamov, O., Siegert, I., Minker, W., Karpov, A.: Cross-corpus data augmentation for acoustic addressee detection. In: 20th Annual SIGdial Meeting on Discourse and Dialogue (2019)
Google Scholar
Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34, 555–596 (2008)
Google Scholar
Baba, N., Huang, H.H., Nakano, Y.I.: Addressee identification for human-human-agent multiparty conversations in different proxemics. In: Proceedings of the 4th Workshop on Eye Gaze in Intelligent Human Machine Interaction, pp. 6:1–6:6 (2012)
Google Scholar
Batliner, A., Hacker, C., Nöth, E.: To talk or not to talk with a computer. J. Multimodal User Interfaces 2, 171–186 (2008)
Google Scholar
Bertero, D., Fung, P.: Deep learning of audio and language features for humor prediction. In: Proceedings of the 10th LREC, Portorož, Slovenia (2016)
Google Scholar
Beyan, C., Carissimi, N., Capozzi, F., Vascon, S., Bustreo, M., Pierro, A., Becchio, C., Murino, V.: Detecting emergent leader in a meeting environment using nonverbal visual features only. In: Proceedings of the 18th ACM ICMI, pp. 317–324. ICMI 2016 (2016)
Google Scholar
Böck, R., Siegert, I., Haase, M., Lange, J., Wendemuth, A.: ikannotate—a tool for labelling, transcription, and annotation of emotionally coloured speech. In: Affective Computing and Intelligent Interaction, LNCS, vol. 6974, pp. 25–34. Springer (2011)
Google Scholar
Böck, R., Egorow, O., Siegert, I., Wendemuth, A.: Comparative study on normalisation in emotion recognition from speech. In: Horain, P., Achard, C., Mallem, M. (eds.) Proceedings of the 9th IHCI 2017, pp. 189–201. Springer International Publishing, Cham (2017)
Google Scholar
DaSilva, L.A., Morgan, G.E., Bostian, C.W., Sweeney, D.G., Midkiff, S.F., Reed, J.H., Thompson, C., Newhall, W.G., Woerner, B.: The resurgence of push-to-talk technologies. IEEE Commun. Mag. 44(1), 48–55 (2006)
Article Google Scholar
Dickey, M.R.: The echo dot was the best-selling product on all of amazon this holiday season. TechCrunch (December 2017). Accessed 26 Dec 2017
Google Scholar
Dowding, J., Clancey, W.J., Graham, J.: Are you talking to me? dialogue systems supporting mixed teams of humans and robots. In: AIAA Fall Symposium Annually Informed Performance: Integrating Machine Listing and Auditory Presentation in Robotic System, Washington, DC, USA (2006)
Google Scholar
Eggink, J., Bland, D.: A large scale experiment for mood-based classification of TV programmes. In: Proceedings of ICME, pp. 140–145 (2012)
Google Scholar
Egorow, O., Siegert, I., Wendemuth, A.: Prediction of user satisfaction in naturalistic human-computer interaction. Kognitive Systeme 1 (2017)
Google Scholar
Eyben, F., Scherer, K.R., Schuller, B.W., Sundberg, J., André, E., Busso, C., Devillers, L.Y., Epps, J., Laukka, P., Narayanan, S.S., Truong, K.P.: The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2016)
Article Google Scholar
Eyben, F., Wöllmer, M., Schuller, B.: openSMILE—the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM MM-2010 (2010)
Google Scholar
Gwet, K.L.: Intrarater reliability, pp. 473–485. Wiley, Hoboken, USA (2008)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Hassenzahl, M., Burmester, M., Koller, F.: AttrakDiff: Ein Fragebogen zur Messung wahrgenommener hedonischer und pragmatischer Qualität. In: Szwillus, G., Ziegler, J. (eds.) Mensch & Computer 2003, Berichte des German Chapter of the ACM, vol. 57, pp. 187–196. Vieweg+Teubner, Wiesbaden, Germany (2003)
Chapter Google Scholar
Hoffmann-Riem, C.: Die Sozialforschung einer interpretativen Soziologie - Der Datengewinn. Kölner Zeitschrift für Soziologie und Sozialpsychologie 32, 339–372 (1980)
Google Scholar
Horcher, G.: Woman says her amazon device recorded private conversation, sent it out to random contact. KIRO7 (2018). Accessed 25 May 2018
Google Scholar
Höbel-Müller, J., Siegert, I., Heinemann, R., Requardt, A.F., Tornow, M., Wendemuth, A.: Analysis of the influence of different room acoustics on acoustic emotion features. In: Elektronische Sprachsignalverarbeitung 2019. Tagungsband der 30. Konferenz, pp. 156–163, Dresden, Germany (2019)
Google Scholar
Jeffs, M.: Ok google, siri, alexa, cortana; can you tell me some stats on voice search? The Editr Blog (2017). Accessed 8 Jan 2018
Google Scholar
Jovanovic, N., op den Akker, R., Nijholt, A.: Human perception of intended addressee during computer-assisted meetings. In: Proceedings of the 11th EACL, pp. 169–176 (2006)
Google Scholar
Kleinberg, S.: 5 ways voice assistance is shaping consumer behavior. Think with Google (2018). Accessed Jan 2018
Google Scholar
Konzelmann, J.: Chatting up your google assistant just got easier. The Keyword, blog.google (2018). Accessed 21 June 2018
Google Scholar
Krüger, J.: Subjektives Nutzererleben in derMensch-Computer-Interaktion: Beziehungsrelevante Zuschreibungen gegenüber Companion-Systemen am Beispiel eines Individualisierungsdialogs. Qualitative Fall- und Prozessanalysen. Biographie – Interaktion – soziale Welten, Verlag Barbara Budrich (2018). https://books.google.de/books?id=v6x1DwAAQBAJ
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)
Google Scholar
Lange, J., Frommer, J.: Subjektives Erleben und intentionale Einstellung in Interviews zur Nutzer-Companion-Interaktion. Proceedings der 41. GI-Jahrestagung, Lecture Notes in Computer Science, vol. 192, pp. 240–254. Bonner Köllen Verlag, Berlin, Germany (2011)
Google Scholar
Lee, H., Stolcke, A., Shriberg, E.: Using out-of-domain data for lexical addressee detection in human-human-computer dialog. In: Proceedings of NAACL, Atlanta, USA, pp. 221–229 (2013)
Google Scholar
Liptak, A.: Amazon’s alexa started ordering people dollhouses after hearing its name on tv. The Verge (2017). Accessed 7 Jan 2017
Google Scholar
Lunsford, R., Oviatt, S.: Human perception of intended addressee during computer-assisted meetings. In: Proceedings of the 8th ACM ICMO, Banff, Alberta, Canada, pp. 20–27 (2006)
Google Scholar
Mallidi, S.H., Maas, R., Goehner, K., Rastrow, A., Matsoukas, S., Hoffmeister, B.: Device-directed utterance detection. In: Proceedings of the INTERSPEECH’18, pp. 1225–1228 (2018)
Google Scholar
Marchi, E., Tonelli, D., Xu, X., Ringeval, F., Deng, J., Squartini, S., Schuller, B.: Pairwise decomposition with deep neural networks and multiscale kernel subspace learning for acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), pp. 543–547 (2016)
Google Scholar
Mayring, P.: Qualitative Content Analysis: Theoretical Foundation, Basic Procedures and Software Solution. SSOAR, Klagenfurt (2014)
Google Scholar
Oh, A., Fox, H., Kleek, M.V., Adler, A., Gajos, K., Morency, L.P., Darrell, T.: Evaluating look-to-talk. In: Proceedings of the Extended Abstracts on Human Factors in Computing Systems (CHI EA ’02), pp. 650–651 (2002)
Google Scholar
Osborne, J.: Why 100 million monthly cortana users on windows 10 is a big deal. TechRadar (2016). Accessed 20 July 2016
Google Scholar
Oshrat, Y., Bloch, A., Lerner, A., Cohen, A., Avigal, M., Zeilig, G.: Speech prosody as a biosignal for physical pain detection. In: Proceedings of Speech Prosody, pp. 420–424 (2016)
Google Scholar
Prylipko, D., Rösner, D., Siegert, I., Günther, S., Friesen, R., Haase, M., Vlasenko, B., Wendemuth, A.: Analysis of significant dialog events in realistic human-computer interaction. J. Multimodal User Interfaces 8, 75–86 (2014)
Article Google Scholar
Ramanarayanan, V., Lange, P., Evanini, K., Molloy, H., Tsuprun, E., Qian, Y., Suendermann-Oeft, D.: Using vision and speech features for automated prediction of performance metrics in multimodal dialogs. ETS Res. Rep. Ser. 1, (2017)
Google Scholar
Raveh, E., Siegert, I., Steiner, I., Gessinger, I., Möbius, B.: Three’s a crowd? Effects of a second human on vocal accommodation with a voice assistant. In: Proceedings of Interspeech 2019, pp. 4005–4009 (2019). https://doi.org/10.21437/Interspeech.2019-1825
Raveh, E., Steiner, I., Siegert, I., Gessinger, I., Móbius, B.: Comparing phonetic changes in computer-directed and human-directed speech. In: Elektronische Sprachsignalverarbeitung 2019. Tagungsband der 30, Konferenz, Dresden, Germany, pp. 42–49 (2019)
Google Scholar
Rösner, D., Frommer, J., Friesen, R., Haase, M., Lange, J., Otto, M.: LAST MINUTE: a multimodal corpus of speech-based user-companion interactions. In: Proceedings of the 8th LREC, Istanbul, Turkey, pp. 96–103 (2012)
Google Scholar
Schuller, B., Steid, S., Batliner, A., Bergelson, E., Krajewski, J., Janott, C., Amatuni, A., Casillas, M., Seidl, A., Soderstrom, M., Warlaumont, A.S., Hidalgo, G., Schnieder, S., Heiser, C., Hohenhorst, W., Herzog, M., Schmitt, M., Qian, K., Zhang, Y., Trigeorgis, G., Tzirakis, P., Zafeiriou, S.: The interspeech 2017 computational paralinguistics challenge: Addressee, cold & snoring. In: Proceedings of the INTERSPEECH-2017, Stockholm, Sweden, pp. 3442–3446 (2017)
Google Scholar
Shriberg, E., Stolcke, A., Hakkani-Tür, D., Heck, L.: Learning when to listen: detecting system-addressed speech in human-human-computer dialog. In: Proceedings of the INTERSPEECH’12, Portland, USA, pp. 334–337 (2012)
Google Scholar
Shriberg, E., Stolcke, A., Ravuri, S.: Addressee detection for dialog systems using temporal and spectral dimensions of speaking style. In: Proceedings of the INTERSPEECH’13, Lyon, France, pp. 2559–2563 (2013)
Google Scholar
Siegert, I., Lotz, A., Egorow, O., Wendemuth, A.: Improving speech-based emotion recognition by using psychoacoustic modeling and analysis-by-synthesis. In: Proceedings of SPECOM 2017, 19th International Conference Speech and Computer, pp. 445–455. Springer International Publishing, Cham (2017)
Google Scholar
Siegert, I., Böck, R., Wendemuth, A.: Inter-rater reliability for emotion annotation in human-computer interaction—comparison and methodological improvements. J. Multimodal User Interfaces 8, 17–28 (2014)
Article Google Scholar
Siegert, I., Jokisch, O., Lotz, A.F., Trojahn, F., Meszaros, M., Maruschke, M.: Acoustic cues for the perceptual assessment of surround sound. In: Karpov, A., Potapova, R., Mporas, I. (eds.) Proceedings of SPECOM 2017, 19th International Conference Speech and Computer, pp. 65–75. Springer International Publishing, Cham (2017)
Google Scholar
Siegert, I., Krüger, J.: How do we speak with alexa—subjective and objective assessments of changes in speaking style between hc and hh conversations. Kognitive Systeme 1 (2019)
Google Scholar
Siegert, I., Krüger, J., Egorow, O., Nietzold, J., Heinemann, R., Lotz, A.: Voice assistant conversation corpus (VACC): a multi-scenario dataset for addressee detection in human-computer-interaction using Amazon’s ALEXA. In: Proceedings of the 11th LREC, Paris, France (2018)
Google Scholar
Siegert, I., Lotz, A.F., Egorow, O., Wolff, S.: Utilizing psychoacoustic modeling to improve speech-based emotion recognition. In: Proceedings of SPECOM 2018, 20th International Conference Speech and Computer, pp. 625–635. Springer International Publishing, Cham (2018)
Google Scholar
Siegert, I., Nietzold, J., Heinemann, R., Wendemuth, A.: The restaurant booking corpus—content-identical comparative human-human and human-computer simulated telephone conversations. In: Berton, A., Haiber, U., Wolfgang, M. (eds.) Elektronische Sprachsignalverarbeitung 2019. Tagungsband der 30. Konferenz. Studientexte zur Sprachkommunikation, vol. 90, pp. 126–133. TUDpress, Dresden, Germany (2019)
Google Scholar
Siegert, I., Shuran, T., Lotz, A.F.: Acoustic addressee-detection – analysing the impact of age, gender and technical knowledge. In: Berton, A., Haiber, U., Wolfgang, M. (eds.) Elektronische Sprachsignalverarbeitung 2018. Tagungsband der 29. Konferenz. Studientexte zur Sprachkommunikation, vol. 90, pp. 113–120. TUDpress, Ulm, Germany (2018)
Google Scholar
Siegert, I., Wendemuth, A.: ikannotate2—a tool supporting annotation of emotions in audio-visual data. In: Trouvain, J., Steiner, I., Möbius, B. (eds.) Elektronische Sprachsignalverarbeitung 2017. Tagungsband der 28. Konferenz. Studientexte zur Sprachkommunikation, vol. 86, pp. 17–24. TUDpress, Saarbrücken, Germany (2017)
Google Scholar
Statt, N.: Amazon adds follow-up mode for alexa to let you make back-to-back requests. The Verge (2018). Accessed 8 Mar 2018
Google Scholar
Terken, J., Joris, I., De Valk, L.: Multimodalcues for addressee-hood in triadic communication with a human information retrieval agent. In: Proceedings of the 9th ACM ICMI, Nagoya, Aichi, Japan, pp. 94–101 (2007)
Google Scholar
Tesch, R.: Qualitative Research Analysis Types and Software Tools. Palmer Press, New York (1990)
Google Scholar
Tilley, A.: Neighbor unlocks front door without permission with the help of apple’s siri. Forbes (2017). Accessed 17 Sept 2017
Google Scholar
Toyama, S., Saito, D., Minematsu, N.: Use of global and acoustic features associated with contextual factors to adapt language models for spontaneous speech recognition. In: Proceedings of the INTERSPEECH’17, pp. 543–547 (2017)
Google Scholar
Tsai, T., Stolcke, A., Slaney, M.: Multimodal addressee detection in multiparty dialogue systems. In: Proceedings of the 40th ICASSP, Brisbane, Australia, pp. 2314–2318 (2015)
Google Scholar
van Turnhout, K., Terken, J., Bakx, I., Eggen, B.: Identifying the intended addressee in mixed human-human and human-computer interaction from non-verbal features. In: Proceedings of the 7th ACM ICMI, Torento, Italy, pp. 175–182 (2005)
Google Scholar
Valli, A.: Notes on natural interaction. Technical Report, University of Florence, Italy (09 2007)
Google Scholar
Vinyals, O., Bohus, D., Caruana, R.: Learning speaker, addressee and overlap detection models from multimodal streams. In: Proceedings of the 14th ACM ICMI, Santa Monica, USA, pp. 417–424 (2012)
Google Scholar
Weinberg, G.: Contextual push-to-talk: a new technique for reducing voice dialog duration. In: MobileHCI (2009)
Google Scholar
Zhang, R., Lee, H., Polymenakos, L., Radev, D.R.: Addressee and response selection in multi-party conversations with speaker interaction RNNs. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2133–2143 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Mobile Dialog Systems, Otto von Guericke University Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany
Ingo Siegert
Department of Psychosomatic Medicine and Psychotherapy, Otto von Guericke University Magdeburg, Leipziger Str. 44, 39120, Magdeburg, Germany
Julia Krüger

Authors

Ingo Siegert
View author publications
You can also search for this author in PubMed Google Scholar
Julia Krüger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ingo Siegert .

Editor information

Editors and Affiliations

Loyola University Maryland, Baltimore, MD, USA
Gloria Phillips-Wren
Dipartimento di Psicologia, Università della Campania “Luigi Vanvitelli”, Caserta, Italy
Anna Esposito
Centre for Artificial Intelligence, University of Technology Sydney Broadway,, Sydney, NSW, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Siegert, I., Krüger, J. (2021). “Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed Interactions. In: Phillips-Wren, G., Esposito, A., Jain, L.C. (eds) Advances in Data Science: Methodologies and Applications. Intelligent Systems Reference Library, vol 189. Springer, Cham. https://doi.org/10.1007/978-3-030-51870-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-51870-7_4
Published: 27 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-51869-1
Online ISBN: 978-3-030-51870-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

“Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed Interactions

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Gaze, Prosody and Semantics: Relevance of Various Multimodal Signals to Addressee Detection in Human-Human-Computer Conversations

Development of Automatic Speech Recognition Techniques for Elderly Home Support: Applications and Challenges

Hey ASR System! Why Aren’t You More Inclusive?

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

“Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed Interactions

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Gaze, Prosody and Semantics: Relevance of Various Multimodal Signals to Addressee Detection in Human-Human-Computer Conversations

Development of Automatic Speech Recognition Techniques for Elderly Home Support: Applications and Challenges

Hey ASR System! Why Aren’t You More Inclusive?

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation