Abstract
Nowadays, a diverse set of addressee detection methods is discussed. Typically, wake words are used. But these force an unnatural interaction and are error-prone, especially in case of false positive classification (user says the wake up word without intending to interact with the device). Therefore, technical systems should be enabled to perform a detection of device directed speech. In order to enrich research in the field of speech analysis in HCI we conducted studies with a commercial voice assistant, Amazon’s ALEXA (Voice Assistant Conversation Corpus, VACC), and complemented objective speech analysis with subjective self and external reports on possible differences in speaking with the voice assistant compared to speaking with another person. The analysis revealed a set of specific features for device directed speech. It can be concluded that speech-based addressing of a technical system is a mainly conscious process including individual modifications of the speaking style.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The wake word to activate Amazon’s ALEXA from its “inactive” state to be able to make a request is ‘Alexa’ by default.
- 2.
The confederate speaker was introduced to the participants as “Jannik”.
- 3.
Participants were anonymized by using letters in alphabetic order.
- 4.
German-speaking annotators were anonymized by using letters in alphabetic order including *.
- 5.
Non-German-speaking annotators were anonymized by using letters in alphabetic order including **.
References
Akhtiamov, O., Sidorov, M., Karpov, A., Minker, W.: Speech and text analysis for multimodal addressee detection in human-human-computer interaction. In: Proceedings of the INTERSPEECH-2017, pp. 2521–2525 (2017)
Akhtiamov, O., Siegert, I., Minker, W., Karpov, A.: Cross-corpus data augmentation for acoustic addressee detection. In: 20th Annual SIGdial Meeting on Discourse and Dialogue (2019)
Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34, 555–596 (2008)
Baba, N., Huang, H.H., Nakano, Y.I.: Addressee identification for human-human-agent multiparty conversations in different proxemics. In: Proceedings of the 4th Workshop on Eye Gaze in Intelligent Human Machine Interaction, pp. 6:1–6:6 (2012)
Batliner, A., Hacker, C., Nöth, E.: To talk or not to talk with a computer. J. Multimodal User Interfaces 2, 171–186 (2008)
Bertero, D., Fung, P.: Deep learning of audio and language features for humor prediction. In: Proceedings of the 10th LREC, Portorož, Slovenia (2016)
Beyan, C., Carissimi, N., Capozzi, F., Vascon, S., Bustreo, M., Pierro, A., Becchio, C., Murino, V.: Detecting emergent leader in a meeting environment using nonverbal visual features only. In: Proceedings of the 18th ACM ICMI, pp. 317–324. ICMI 2016 (2016)
Böck, R., Siegert, I., Haase, M., Lange, J., Wendemuth, A.: ikannotate—a tool for labelling, transcription, and annotation of emotionally coloured speech. In: Affective Computing and Intelligent Interaction, LNCS, vol. 6974, pp. 25–34. Springer (2011)
Böck, R., Egorow, O., Siegert, I., Wendemuth, A.: Comparative study on normalisation in emotion recognition from speech. In: Horain, P., Achard, C., Mallem, M. (eds.) Proceedings of the 9th IHCI 2017, pp. 189–201. Springer International Publishing, Cham (2017)
DaSilva, L.A., Morgan, G.E., Bostian, C.W., Sweeney, D.G., Midkiff, S.F., Reed, J.H., Thompson, C., Newhall, W.G., Woerner, B.: The resurgence of push-to-talk technologies. IEEE Commun. Mag. 44(1), 48–55 (2006)
Dickey, M.R.: The echo dot was the best-selling product on all of amazon this holiday season. TechCrunch (December 2017). Accessed 26 Dec 2017
Dowding, J., Clancey, W.J., Graham, J.: Are you talking to me? dialogue systems supporting mixed teams of humans and robots. In: AIAA Fall Symposium Annually Informed Performance: Integrating Machine Listing and Auditory Presentation in Robotic System, Washington, DC, USA (2006)
Eggink, J., Bland, D.: A large scale experiment for mood-based classification of TV programmes. In: Proceedings of ICME, pp. 140–145 (2012)
Egorow, O., Siegert, I., Wendemuth, A.: Prediction of user satisfaction in naturalistic human-computer interaction. Kognitive Systeme 1 (2017)
Eyben, F., Scherer, K.R., Schuller, B.W., Sundberg, J., André, E., Busso, C., Devillers, L.Y., Epps, J., Laukka, P., Narayanan, S.S., Truong, K.P.: The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2016)
Eyben, F., Wöllmer, M., Schuller, B.: openSMILE—the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM MM-2010 (2010)
Gwet, K.L.: Intrarater reliability, pp. 473–485. Wiley, Hoboken, USA (2008)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Hassenzahl, M., Burmester, M., Koller, F.: AttrakDiff: Ein Fragebogen zur Messung wahrgenommener hedonischer und pragmatischer Qualität. In: Szwillus, G., Ziegler, J. (eds.) Mensch & Computer 2003, Berichte des German Chapter of the ACM, vol. 57, pp. 187–196. Vieweg+Teubner, Wiesbaden, Germany (2003)
Hoffmann-Riem, C.: Die Sozialforschung einer interpretativen Soziologie - Der Datengewinn. Kölner Zeitschrift für Soziologie und Sozialpsychologie 32, 339–372 (1980)
Horcher, G.: Woman says her amazon device recorded private conversation, sent it out to random contact. KIRO7 (2018). Accessed 25 May 2018
Höbel-Müller, J., Siegert, I., Heinemann, R., Requardt, A.F., Tornow, M., Wendemuth, A.: Analysis of the influence of different room acoustics on acoustic emotion features. In: Elektronische Sprachsignalverarbeitung 2019. Tagungsband der 30. Konferenz, pp. 156–163, Dresden, Germany (2019)
Jeffs, M.: Ok google, siri, alexa, cortana; can you tell me some stats on voice search? The Editr Blog (2017). Accessed 8 Jan 2018
Jovanovic, N., op den Akker, R., Nijholt, A.: Human perception of intended addressee during computer-assisted meetings. In: Proceedings of the 11th EACL, pp. 169–176 (2006)
Kleinberg, S.: 5 ways voice assistance is shaping consumer behavior. Think with Google (2018). Accessed Jan 2018
Konzelmann, J.: Chatting up your google assistant just got easier. The Keyword, blog.google (2018). Accessed 21 June 2018
Krüger, J.: Subjektives Nutzererleben in derMensch-Computer-Interaktion: Beziehungsrelevante Zuschreibungen gegenüber Companion-Systemen am Beispiel eines Individualisierungsdialogs. Qualitative Fall- und Prozessanalysen. Biographie – Interaktion – soziale Welten, Verlag Barbara Budrich (2018). https://books.google.de/books?id=v6x1DwAAQBAJ
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)
Lange, J., Frommer, J.: Subjektives Erleben und intentionale Einstellung in Interviews zur Nutzer-Companion-Interaktion. Proceedings der 41. GI-Jahrestagung, Lecture Notes in Computer Science, vol. 192, pp. 240–254. Bonner Köllen Verlag, Berlin, Germany (2011)
Lee, H., Stolcke, A., Shriberg, E.: Using out-of-domain data for lexical addressee detection in human-human-computer dialog. In: Proceedings of NAACL, Atlanta, USA, pp. 221–229 (2013)
Liptak, A.: Amazon’s alexa started ordering people dollhouses after hearing its name on tv. The Verge (2017). Accessed 7 Jan 2017
Lunsford, R., Oviatt, S.: Human perception of intended addressee during computer-assisted meetings. In: Proceedings of the 8th ACM ICMO, Banff, Alberta, Canada, pp. 20–27 (2006)
Mallidi, S.H., Maas, R., Goehner, K., Rastrow, A., Matsoukas, S., Hoffmeister, B.: Device-directed utterance detection. In: Proceedings of the INTERSPEECH’18, pp. 1225–1228 (2018)
Marchi, E., Tonelli, D., Xu, X., Ringeval, F., Deng, J., Squartini, S., Schuller, B.: Pairwise decomposition with deep neural networks and multiscale kernel subspace learning for acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), pp. 543–547 (2016)
Mayring, P.: Qualitative Content Analysis: Theoretical Foundation, Basic Procedures and Software Solution. SSOAR, Klagenfurt (2014)
Oh, A., Fox, H., Kleek, M.V., Adler, A., Gajos, K., Morency, L.P., Darrell, T.: Evaluating look-to-talk. In: Proceedings of the Extended Abstracts on Human Factors in Computing Systems (CHI EA ’02), pp. 650–651 (2002)
Osborne, J.: Why 100 million monthly cortana users on windows 10 is a big deal. TechRadar (2016). Accessed 20 July 2016
Oshrat, Y., Bloch, A., Lerner, A., Cohen, A., Avigal, M., Zeilig, G.: Speech prosody as a biosignal for physical pain detection. In: Proceedings of Speech Prosody, pp. 420–424 (2016)
Prylipko, D., Rösner, D., Siegert, I., Günther, S., Friesen, R., Haase, M., Vlasenko, B., Wendemuth, A.: Analysis of significant dialog events in realistic human-computer interaction. J. Multimodal User Interfaces 8, 75–86 (2014)
Ramanarayanan, V., Lange, P., Evanini, K., Molloy, H., Tsuprun, E., Qian, Y., Suendermann-Oeft, D.: Using vision and speech features for automated prediction of performance metrics in multimodal dialogs. ETS Res. Rep. Ser. 1, (2017)
Raveh, E., Siegert, I., Steiner, I., Gessinger, I., Möbius, B.: Three’s a crowd? Effects of a second human on vocal accommodation with a voice assistant. In: Proceedings of Interspeech 2019, pp. 4005–4009 (2019). https://doi.org/10.21437/Interspeech.2019-1825
Raveh, E., Steiner, I., Siegert, I., Gessinger, I., Móbius, B.: Comparing phonetic changes in computer-directed and human-directed speech. In: Elektronische Sprachsignalverarbeitung 2019. Tagungsband der 30, Konferenz, Dresden, Germany, pp. 42–49 (2019)
Rösner, D., Frommer, J., Friesen, R., Haase, M., Lange, J., Otto, M.: LAST MINUTE: a multimodal corpus of speech-based user-companion interactions. In: Proceedings of the 8th LREC, Istanbul, Turkey, pp. 96–103 (2012)
Schuller, B., Steid, S., Batliner, A., Bergelson, E., Krajewski, J., Janott, C., Amatuni, A., Casillas, M., Seidl, A., Soderstrom, M., Warlaumont, A.S., Hidalgo, G., Schnieder, S., Heiser, C., Hohenhorst, W., Herzog, M., Schmitt, M., Qian, K., Zhang, Y., Trigeorgis, G., Tzirakis, P., Zafeiriou, S.: The interspeech 2017 computational paralinguistics challenge: Addressee, cold & snoring. In: Proceedings of the INTERSPEECH-2017, Stockholm, Sweden, pp. 3442–3446 (2017)
Shriberg, E., Stolcke, A., Hakkani-Tür, D., Heck, L.: Learning when to listen: detecting system-addressed speech in human-human-computer dialog. In: Proceedings of the INTERSPEECH’12, Portland, USA, pp. 334–337 (2012)
Shriberg, E., Stolcke, A., Ravuri, S.: Addressee detection for dialog systems using temporal and spectral dimensions of speaking style. In: Proceedings of the INTERSPEECH’13, Lyon, France, pp. 2559–2563 (2013)
Siegert, I., Lotz, A., Egorow, O., Wendemuth, A.: Improving speech-based emotion recognition by using psychoacoustic modeling and analysis-by-synthesis. In: Proceedings of SPECOM 2017, 19th International Conference Speech and Computer, pp. 445–455. Springer International Publishing, Cham (2017)
Siegert, I., Böck, R., Wendemuth, A.: Inter-rater reliability for emotion annotation in human-computer interaction—comparison and methodological improvements. J. Multimodal User Interfaces 8, 17–28 (2014)
Siegert, I., Jokisch, O., Lotz, A.F., Trojahn, F., Meszaros, M., Maruschke, M.: Acoustic cues for the perceptual assessment of surround sound. In: Karpov, A., Potapova, R., Mporas, I. (eds.) Proceedings of SPECOM 2017, 19th International Conference Speech and Computer, pp. 65–75. Springer International Publishing, Cham (2017)
Siegert, I., Krüger, J.: How do we speak with alexa—subjective and objective assessments of changes in speaking style between hc and hh conversations. Kognitive Systeme 1 (2019)
Siegert, I., Krüger, J., Egorow, O., Nietzold, J., Heinemann, R., Lotz, A.: Voice assistant conversation corpus (VACC): a multi-scenario dataset for addressee detection in human-computer-interaction using Amazon’s ALEXA. In: Proceedings of the 11th LREC, Paris, France (2018)
Siegert, I., Lotz, A.F., Egorow, O., Wolff, S.: Utilizing psychoacoustic modeling to improve speech-based emotion recognition. In: Proceedings of SPECOM 2018, 20th International Conference Speech and Computer, pp. 625–635. Springer International Publishing, Cham (2018)
Siegert, I., Nietzold, J., Heinemann, R., Wendemuth, A.: The restaurant booking corpus—content-identical comparative human-human and human-computer simulated telephone conversations. In: Berton, A., Haiber, U., Wolfgang, M. (eds.) Elektronische Sprachsignalverarbeitung 2019. Tagungsband der 30. Konferenz. Studientexte zur Sprachkommunikation, vol. 90, pp. 126–133. TUDpress, Dresden, Germany (2019)
Siegert, I., Shuran, T., Lotz, A.F.: Acoustic addressee-detection – analysing the impact of age, gender and technical knowledge. In: Berton, A., Haiber, U., Wolfgang, M. (eds.) Elektronische Sprachsignalverarbeitung 2018. Tagungsband der 29. Konferenz. Studientexte zur Sprachkommunikation, vol. 90, pp. 113–120. TUDpress, Ulm, Germany (2018)
Siegert, I., Wendemuth, A.: ikannotate2—a tool supporting annotation of emotions in audio-visual data. In: Trouvain, J., Steiner, I., Möbius, B. (eds.) Elektronische Sprachsignalverarbeitung 2017. Tagungsband der 28. Konferenz. Studientexte zur Sprachkommunikation, vol. 86, pp. 17–24. TUDpress, Saarbrücken, Germany (2017)
Statt, N.: Amazon adds follow-up mode for alexa to let you make back-to-back requests. The Verge (2018). Accessed 8 Mar 2018
Terken, J., Joris, I., De Valk, L.: Multimodalcues for addressee-hood in triadic communication with a human information retrieval agent. In: Proceedings of the 9th ACM ICMI, Nagoya, Aichi, Japan, pp. 94–101 (2007)
Tesch, R.: Qualitative Research Analysis Types and Software Tools. Palmer Press, New York (1990)
Tilley, A.: Neighbor unlocks front door without permission with the help of apple’s siri. Forbes (2017). Accessed 17 Sept 2017
Toyama, S., Saito, D., Minematsu, N.: Use of global and acoustic features associated with contextual factors to adapt language models for spontaneous speech recognition. In: Proceedings of the INTERSPEECH’17, pp. 543–547 (2017)
Tsai, T., Stolcke, A., Slaney, M.: Multimodal addressee detection in multiparty dialogue systems. In: Proceedings of the 40th ICASSP, Brisbane, Australia, pp. 2314–2318 (2015)
van Turnhout, K., Terken, J., Bakx, I., Eggen, B.: Identifying the intended addressee in mixed human-human and human-computer interaction from non-verbal features. In: Proceedings of the 7th ACM ICMI, Torento, Italy, pp. 175–182 (2005)
Valli, A.: Notes on natural interaction. Technical Report, University of Florence, Italy (09 2007)
Vinyals, O., Bohus, D., Caruana, R.: Learning speaker, addressee and overlap detection models from multimodal streams. In: Proceedings of the 14th ACM ICMI, Santa Monica, USA, pp. 417–424 (2012)
Weinberg, G.: Contextual push-to-talk: a new technique for reducing voice dialog duration. In: MobileHCI (2009)
Zhang, R., Lee, H., Polymenakos, L., Radev, D.R.: Addressee and response selection in multi-party conversations with speaker interaction RNNs. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2133–2143 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Siegert, I., Krüger, J. (2021). “Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed Interactions. In: Phillips-Wren, G., Esposito, A., Jain, L.C. (eds) Advances in Data Science: Methodologies and Applications. Intelligent Systems Reference Library, vol 189. Springer, Cham. https://doi.org/10.1007/978-3-030-51870-7_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-51870-7_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-51869-1
Online ISBN: 978-3-030-51870-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)