Skip to main content

“Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed Interactions

  • Chapter
  • First Online:
Advances in Data Science: Methodologies and Applications

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 189))

Abstract

Nowadays, a diverse set of addressee detection methods is discussed. Typically, wake words are used. But these force an unnatural interaction and are error-prone, especially in case of false positive classification (user says the wake up word without intending to interact with the device). Therefore, technical systems should be enabled to perform a detection of device directed speech. In order to enrich research in the field of speech analysis in HCI we conducted studies with a commercial voice assistant, Amazon’s ALEXA (Voice Assistant Conversation Corpus, VACC), and complemented objective speech analysis with subjective self and external reports on possible differences in speaking with the voice assistant compared to speaking with another person. The analysis revealed a set of specific features for device directed speech. It can be concluded that speech-based addressing of a technical system is a mainly conscious process including individual modifications of the speaking style.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The wake word to activate Amazon’s ALEXA from its “inactive” state to be able to make a request is ‘Alexa’ by default.

  2. 2.

    The confederate speaker was introduced to the participants as “Jannik”.

  3. 3.

    Participants were anonymized by using letters in alphabetic order.

  4. 4.

    German-speaking annotators were anonymized by using letters in alphabetic order including *.

  5. 5.

    Non-German-speaking annotators were anonymized by using letters in alphabetic order including **.

References

  1. Akhtiamov, O., Sidorov, M., Karpov, A., Minker, W.: Speech and text analysis for multimodal addressee detection in human-human-computer interaction. In: Proceedings of the INTERSPEECH-2017, pp. 2521–2525 (2017)

    Google Scholar 

  2. Akhtiamov, O., Siegert, I., Minker, W., Karpov, A.: Cross-corpus data augmentation for acoustic addressee detection. In: 20th Annual SIGdial Meeting on Discourse and Dialogue (2019)

    Google Scholar 

  3. Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34, 555–596 (2008)

    Google Scholar 

  4. Baba, N., Huang, H.H., Nakano, Y.I.: Addressee identification for human-human-agent multiparty conversations in different proxemics. In: Proceedings of the 4th Workshop on Eye Gaze in Intelligent Human Machine Interaction, pp. 6:1–6:6 (2012)

    Google Scholar 

  5. Batliner, A., Hacker, C., Nöth, E.: To talk or not to talk with a computer. J. Multimodal User Interfaces 2, 171–186 (2008)

    Google Scholar 

  6. Bertero, D., Fung, P.: Deep learning of audio and language features for humor prediction. In: Proceedings of the 10th LREC, Portorož, Slovenia (2016)

    Google Scholar 

  7. Beyan, C., Carissimi, N., Capozzi, F., Vascon, S., Bustreo, M., Pierro, A., Becchio, C., Murino, V.: Detecting emergent leader in a meeting environment using nonverbal visual features only. In: Proceedings of the 18th ACM ICMI, pp. 317–324. ICMI 2016 (2016)

    Google Scholar 

  8. Böck, R., Siegert, I., Haase, M., Lange, J., Wendemuth, A.: ikannotate—a tool for labelling, transcription, and annotation of emotionally coloured speech. In: Affective Computing and Intelligent Interaction, LNCS, vol. 6974, pp. 25–34. Springer (2011)

    Google Scholar 

  9. Böck, R., Egorow, O., Siegert, I., Wendemuth, A.: Comparative study on normalisation in emotion recognition from speech. In: Horain, P., Achard, C., Mallem, M. (eds.) Proceedings of the 9th IHCI 2017, pp. 189–201. Springer International Publishing, Cham (2017)

    Google Scholar 

  10. DaSilva, L.A., Morgan, G.E., Bostian, C.W., Sweeney, D.G., Midkiff, S.F., Reed, J.H., Thompson, C., Newhall, W.G., Woerner, B.: The resurgence of push-to-talk technologies. IEEE Commun. Mag. 44(1), 48–55 (2006)

    Article  Google Scholar 

  11. Dickey, M.R.: The echo dot was the best-selling product on all of amazon this holiday season. TechCrunch (December 2017). Accessed 26 Dec 2017

    Google Scholar 

  12. Dowding, J., Clancey, W.J., Graham, J.: Are you talking to me? dialogue systems supporting mixed teams of humans and robots. In: AIAA Fall Symposium Annually Informed Performance: Integrating Machine Listing and Auditory Presentation in Robotic System, Washington, DC, USA (2006)

    Google Scholar 

  13. Eggink, J., Bland, D.: A large scale experiment for mood-based classification of TV programmes. In: Proceedings of ICME, pp. 140–145 (2012)

    Google Scholar 

  14. Egorow, O., Siegert, I., Wendemuth, A.: Prediction of user satisfaction in naturalistic human-computer interaction. Kognitive Systeme 1 (2017)

    Google Scholar 

  15. Eyben, F., Scherer, K.R., Schuller, B.W., Sundberg, J., André, E., Busso, C., Devillers, L.Y., Epps, J., Laukka, P., Narayanan, S.S., Truong, K.P.: The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2016)

    Article  Google Scholar 

  16. Eyben, F., Wöllmer, M., Schuller, B.: openSMILE—the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM MM-2010 (2010)

    Google Scholar 

  17. Gwet, K.L.: Intrarater reliability, pp. 473–485. Wiley, Hoboken, USA (2008)

    Google Scholar 

  18. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  19. Hassenzahl, M., Burmester, M., Koller, F.: AttrakDiff: Ein Fragebogen zur Messung wahrgenommener hedonischer und pragmatischer Qualität. In: Szwillus, G., Ziegler, J. (eds.) Mensch & Computer 2003, Berichte des German Chapter of the ACM, vol. 57, pp. 187–196. Vieweg+Teubner, Wiesbaden, Germany (2003)

    Chapter  Google Scholar 

  20. Hoffmann-Riem, C.: Die Sozialforschung einer interpretativen Soziologie - Der Datengewinn. Kölner Zeitschrift für Soziologie und Sozialpsychologie 32, 339–372 (1980)

    Google Scholar 

  21. Horcher, G.: Woman says her amazon device recorded private conversation, sent it out to random contact. KIRO7 (2018). Accessed 25 May 2018

    Google Scholar 

  22. Höbel-Müller, J., Siegert, I., Heinemann, R., Requardt, A.F., Tornow, M., Wendemuth, A.: Analysis of the influence of different room acoustics on acoustic emotion features. In: Elektronische Sprachsignalverarbeitung 2019. Tagungsband der 30. Konferenz, pp. 156–163, Dresden, Germany (2019)

    Google Scholar 

  23. Jeffs, M.: Ok google, siri, alexa, cortana; can you tell me some stats on voice search? The Editr Blog (2017). Accessed 8 Jan 2018

    Google Scholar 

  24. Jovanovic, N., op den Akker, R., Nijholt, A.: Human perception of intended addressee during computer-assisted meetings. In: Proceedings of the 11th EACL, pp. 169–176 (2006)

    Google Scholar 

  25. Kleinberg, S.: 5 ways voice assistance is shaping consumer behavior. Think with Google (2018). Accessed Jan 2018

    Google Scholar 

  26. Konzelmann, J.: Chatting up your google assistant just got easier. The Keyword, blog.google (2018). Accessed 21 June 2018

    Google Scholar 

  27. Krüger, J.: Subjektives Nutzererleben in derMensch-Computer-Interaktion: Beziehungsrelevante Zuschreibungen gegenüber Companion-Systemen am Beispiel eines Individualisierungsdialogs. Qualitative Fall- und Prozessanalysen. Biographie – Interaktion – soziale Welten, Verlag Barbara Budrich (2018). https://books.google.de/books?id=v6x1DwAAQBAJ

  28. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)

    Google Scholar 

  29. Lange, J., Frommer, J.: Subjektives Erleben und intentionale Einstellung in Interviews zur Nutzer-Companion-Interaktion. Proceedings der 41. GI-Jahrestagung, Lecture Notes in Computer Science, vol. 192, pp. 240–254. Bonner Köllen Verlag, Berlin, Germany (2011)

    Google Scholar 

  30. Lee, H., Stolcke, A., Shriberg, E.: Using out-of-domain data for lexical addressee detection in human-human-computer dialog. In: Proceedings of NAACL, Atlanta, USA, pp. 221–229 (2013)

    Google Scholar 

  31. Liptak, A.: Amazon’s alexa started ordering people dollhouses after hearing its name on tv. The Verge (2017). Accessed 7 Jan 2017

    Google Scholar 

  32. Lunsford, R., Oviatt, S.: Human perception of intended addressee during computer-assisted meetings. In: Proceedings of the 8th ACM ICMO, Banff, Alberta, Canada, pp. 20–27 (2006)

    Google Scholar 

  33. Mallidi, S.H., Maas, R., Goehner, K., Rastrow, A., Matsoukas, S., Hoffmeister, B.: Device-directed utterance detection. In: Proceedings of the INTERSPEECH’18, pp. 1225–1228 (2018)

    Google Scholar 

  34. Marchi, E., Tonelli, D., Xu, X., Ringeval, F., Deng, J., Squartini, S., Schuller, B.: Pairwise decomposition with deep neural networks and multiscale kernel subspace learning for acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), pp. 543–547 (2016)

    Google Scholar 

  35. Mayring, P.: Qualitative Content Analysis: Theoretical Foundation, Basic Procedures and Software Solution. SSOAR, Klagenfurt (2014)

    Google Scholar 

  36. Oh, A., Fox, H., Kleek, M.V., Adler, A., Gajos, K., Morency, L.P., Darrell, T.: Evaluating look-to-talk. In: Proceedings of the Extended Abstracts on Human Factors in Computing Systems (CHI EA ’02), pp. 650–651 (2002)

    Google Scholar 

  37. Osborne, J.: Why 100 million monthly cortana users on windows 10 is a big deal. TechRadar (2016). Accessed 20 July 2016

    Google Scholar 

  38. Oshrat, Y., Bloch, A., Lerner, A., Cohen, A., Avigal, M., Zeilig, G.: Speech prosody as a biosignal for physical pain detection. In: Proceedings of Speech Prosody, pp. 420–424 (2016)

    Google Scholar 

  39. Prylipko, D., Rösner, D., Siegert, I., Günther, S., Friesen, R., Haase, M., Vlasenko, B., Wendemuth, A.: Analysis of significant dialog events in realistic human-computer interaction. J. Multimodal User Interfaces 8, 75–86 (2014)

    Article  Google Scholar 

  40. Ramanarayanan, V., Lange, P., Evanini, K., Molloy, H., Tsuprun, E., Qian, Y., Suendermann-Oeft, D.: Using vision and speech features for automated prediction of performance metrics in multimodal dialogs. ETS Res. Rep. Ser. 1, (2017)

    Google Scholar 

  41. Raveh, E., Siegert, I., Steiner, I., Gessinger, I., Möbius, B.: Three’s a crowd? Effects of a second human on vocal accommodation with a voice assistant. In: Proceedings of Interspeech 2019, pp. 4005–4009 (2019). https://doi.org/10.21437/Interspeech.2019-1825

  42. Raveh, E., Steiner, I., Siegert, I., Gessinger, I., Móbius, B.: Comparing phonetic changes in computer-directed and human-directed speech. In: Elektronische Sprachsignalverarbeitung 2019. Tagungsband der 30, Konferenz, Dresden, Germany, pp. 42–49 (2019)

    Google Scholar 

  43. Rösner, D., Frommer, J., Friesen, R., Haase, M., Lange, J., Otto, M.: LAST MINUTE: a multimodal corpus of speech-based user-companion interactions. In: Proceedings of the 8th LREC, Istanbul, Turkey, pp. 96–103 (2012)

    Google Scholar 

  44. Schuller, B., Steid, S., Batliner, A., Bergelson, E., Krajewski, J., Janott, C., Amatuni, A., Casillas, M., Seidl, A., Soderstrom, M., Warlaumont, A.S., Hidalgo, G., Schnieder, S., Heiser, C., Hohenhorst, W., Herzog, M., Schmitt, M., Qian, K., Zhang, Y., Trigeorgis, G., Tzirakis, P., Zafeiriou, S.: The interspeech 2017 computational paralinguistics challenge: Addressee, cold & snoring. In: Proceedings of the INTERSPEECH-2017, Stockholm, Sweden, pp. 3442–3446 (2017)

    Google Scholar 

  45. Shriberg, E., Stolcke, A., Hakkani-Tür, D., Heck, L.: Learning when to listen: detecting system-addressed speech in human-human-computer dialog. In: Proceedings of the INTERSPEECH’12, Portland, USA, pp. 334–337 (2012)

    Google Scholar 

  46. Shriberg, E., Stolcke, A., Ravuri, S.: Addressee detection for dialog systems using temporal and spectral dimensions of speaking style. In: Proceedings of the INTERSPEECH’13, Lyon, France, pp. 2559–2563 (2013)

    Google Scholar 

  47. Siegert, I., Lotz, A., Egorow, O., Wendemuth, A.: Improving speech-based emotion recognition by using psychoacoustic modeling and analysis-by-synthesis. In: Proceedings of SPECOM 2017, 19th International Conference Speech and Computer, pp. 445–455. Springer International Publishing, Cham (2017)

    Google Scholar 

  48. Siegert, I., Böck, R., Wendemuth, A.: Inter-rater reliability for emotion annotation in human-computer interaction—comparison and methodological improvements. J. Multimodal User Interfaces 8, 17–28 (2014)

    Article  Google Scholar 

  49. Siegert, I., Jokisch, O., Lotz, A.F., Trojahn, F., Meszaros, M., Maruschke, M.: Acoustic cues for the perceptual assessment of surround sound. In: Karpov, A., Potapova, R., Mporas, I. (eds.) Proceedings of SPECOM 2017, 19th International Conference Speech and Computer, pp. 65–75. Springer International Publishing, Cham (2017)

    Google Scholar 

  50. Siegert, I., Krüger, J.: How do we speak with alexa—subjective and objective assessments of changes in speaking style between hc and hh conversations. Kognitive Systeme 1 (2019)

    Google Scholar 

  51. Siegert, I., Krüger, J., Egorow, O., Nietzold, J., Heinemann, R., Lotz, A.: Voice assistant conversation corpus (VACC): a multi-scenario dataset for addressee detection in human-computer-interaction using Amazon’s ALEXA. In: Proceedings of the 11th LREC, Paris, France (2018)

    Google Scholar 

  52. Siegert, I., Lotz, A.F., Egorow, O., Wolff, S.: Utilizing psychoacoustic modeling to improve speech-based emotion recognition. In: Proceedings of SPECOM 2018, 20th International Conference Speech and Computer, pp. 625–635. Springer International Publishing, Cham (2018)

    Google Scholar 

  53. Siegert, I., Nietzold, J., Heinemann, R., Wendemuth, A.: The restaurant booking corpus—content-identical comparative human-human and human-computer simulated telephone conversations. In: Berton, A., Haiber, U., Wolfgang, M. (eds.) Elektronische Sprachsignalverarbeitung 2019. Tagungsband der 30. Konferenz. Studientexte zur Sprachkommunikation, vol. 90, pp. 126–133. TUDpress, Dresden, Germany (2019)

    Google Scholar 

  54. Siegert, I., Shuran, T., Lotz, A.F.: Acoustic addressee-detection – analysing the impact of age, gender and technical knowledge. In: Berton, A., Haiber, U., Wolfgang, M. (eds.) Elektronische Sprachsignalverarbeitung 2018. Tagungsband der 29. Konferenz. Studientexte zur Sprachkommunikation, vol. 90, pp. 113–120. TUDpress, Ulm, Germany (2018)

    Google Scholar 

  55. Siegert, I., Wendemuth, A.: ikannotate2—a tool supporting annotation of emotions in audio-visual data. In: Trouvain, J., Steiner, I., Möbius, B. (eds.) Elektronische Sprachsignalverarbeitung 2017. Tagungsband der 28. Konferenz. Studientexte zur Sprachkommunikation, vol. 86, pp. 17–24. TUDpress, Saarbrücken, Germany (2017)

    Google Scholar 

  56. Statt, N.: Amazon adds follow-up mode for alexa to let you make back-to-back requests. The Verge (2018). Accessed 8 Mar 2018

    Google Scholar 

  57. Terken, J., Joris, I., De Valk, L.: Multimodalcues for addressee-hood in triadic communication with a human information retrieval agent. In: Proceedings of the 9th ACM ICMI, Nagoya, Aichi, Japan, pp. 94–101 (2007)

    Google Scholar 

  58. Tesch, R.: Qualitative Research Analysis Types and Software Tools. Palmer Press, New York (1990)

    Google Scholar 

  59. Tilley, A.: Neighbor unlocks front door without permission with the help of apple’s siri. Forbes (2017). Accessed 17 Sept 2017

    Google Scholar 

  60. Toyama, S., Saito, D., Minematsu, N.: Use of global and acoustic features associated with contextual factors to adapt language models for spontaneous speech recognition. In: Proceedings of the INTERSPEECH’17, pp. 543–547 (2017)

    Google Scholar 

  61. Tsai, T., Stolcke, A., Slaney, M.: Multimodal addressee detection in multiparty dialogue systems. In: Proceedings of the 40th ICASSP, Brisbane, Australia, pp. 2314–2318 (2015)

    Google Scholar 

  62. van Turnhout, K., Terken, J., Bakx, I., Eggen, B.: Identifying the intended addressee in mixed human-human and human-computer interaction from non-verbal features. In: Proceedings of the 7th ACM ICMI, Torento, Italy, pp. 175–182 (2005)

    Google Scholar 

  63. Valli, A.: Notes on natural interaction. Technical Report, University of Florence, Italy (09 2007)

    Google Scholar 

  64. Vinyals, O., Bohus, D., Caruana, R.: Learning speaker, addressee and overlap detection models from multimodal streams. In: Proceedings of the 14th ACM ICMI, Santa Monica, USA, pp. 417–424 (2012)

    Google Scholar 

  65. Weinberg, G.: Contextual push-to-talk: a new technique for reducing voice dialog duration. In: MobileHCI (2009)

    Google Scholar 

  66. Zhang, R., Lee, H., Polymenakos, L., Radev, D.R.: Addressee and response selection in multi-party conversations with speaker interaction RNNs. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2133–2143 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ingo Siegert .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Siegert, I., Krüger, J. (2021). “Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed Interactions. In: Phillips-Wren, G., Esposito, A., Jain, L.C. (eds) Advances in Data Science: Methodologies and Applications. Intelligent Systems Reference Library, vol 189. Springer, Cham. https://doi.org/10.1007/978-3-030-51870-7_4

Download citation

Publish with us

Policies and ethics