Abstract
Conversational agents are computer programs engaging with human users in a conversation to assist, educate, or entertain. Being subject to substantial research interest ever since the advent of artificial intelligence in the 1950s and 60s, recent advances in cloud computing and the availability of smart devices with wireless high-speed Internet connection have led to steep progress in the engineering of conversational technology. “Modern” conversational agents understand spoken language, are able to answer complicated questions, or interact with humans in a dialog of hundreds of user turns. The reader of this article will learn about strengths of modern conversational agents driven by synergy among highperforming speech recognition, smart devices, high-speed Internet, cloud computing, standardization, and crowdsourcing. Together, we will see how the field is primarily driven by commercial stakeholders, and how open-source alternatives are expected to play a major role in the future of modern conversational agents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
itu 1993: Pulse Code Modulation (PCM) of Voice Frequencies. Technical Report ITU-T Recommendation G.711, ITU, Geneva, Switzerland.
itu 2012: 7 kHz Audio-Coding within 64 kbit/s. Technical Report ITU-T Recommendation G.722, ITU, Geneva, Switzerland.
Adda, G./Mariani, J./Besacier, L./Gelas, H. 2013: Economic and ethical background of crowdsourcing for speech. In Eskenazi, M./Levow, G./Meng, H./Parent, G./Suendermann, D., (eds): Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment. Wiley, Hoboken, USA.
Bacchiani, M./Beaufays, F./Schalkwyk, J./Schuster, M./Strope, B. 2008: Deploying GOOG-411: Early Lessons in Data, Measurement, and Testing. In Proc. of the ICASSP, Las Vegas, USA.
Black, A./Tokuda, K. 2005: Blizzard Challenge – 2005: Evaluating Corpus-Based Speech Synthesis on Common Datasets. In Proc. of the Interspeech, Lisbon, Portugal.
Boyer, L./Danielsen, P./Ferrans, J./Karam, G./Ladd, D./Lucas, B./Rehor, K. 2004: VoiceXML 0.9. W3C Note–Initial Release. http://www.w3.org/TR/2000/ NOTEvoicexml- 20000505.
Boysen, E./Flathagen, J. 2011: Using SIP for Seamless Handover in Heterogeneous Networks. In Proc. of the ICUMT, Budapest, Hungary.
Brants, T./Franz, A. 2006: Web 1T 5-Gram Corpus Version 1.1. Technical report, Google Research.
Breazeal, C. 2005: Socially Intelligent Robots. Interactions, 12(2). (Bridle, 2004) Bridle,
J. (2004). Towards Better Understanding of the Model Implied by the Use of Dynamic Features in HMMs. In Proc. of the ICSLP, Jeju Island, South Korea.
Burnett, D./Shanmugham, S. 2012: Media Resource Control Protocol Version 2 (MRCPv2). http://tools.ietf.org/html/rfc6787.
Burnett, D./Shuang, Z./Baggia, P./Bagshaw, P./Bodell, M./Huang, D./Xiaoyan, L./McGlashan, S./Tao, J./Jun, Y./Fang, H./Kang, Y./Meng, H./Xia, W./Hairong, X./Wu, Z. 2010: Speech Synthesis Markup Language (SSML) Version 1.1. W3C
Recommendation. http://www.w3.org/TR/2010/REC-speech-synthesis11–20100907.
Chai, J./Horvath, V./Nicolov, N./Stys, M./Kambhatla, N./Zadrozny, W./Melville, P. 2002: Natural Language Assistant–A Dialog System for Online Product Recommendation. AI Magazine, 23(2).
Chen, S./Kingsbury, B./Mangu, L./Povey, D./Saon, G./Zweig, H. S. G. 2006: Advances in Speech Transcription at IBM under the DARPA EARS Program. IEEE Trans. on Audio, Speech and Language Processing, 14(5).
Clarke, A. 1968: 2001: A Space Odyssey. New American Library, New York, USA.
Davis, K./Biddulph, R./Balashek, S. 1952: Automatic Recognition of Spoken Digits. Journal of the Acoustical Society of America, 24(6).
de Melo, G./Hose, K. 2013: Advances in Information Retrieval. Springer, New York, USA.
ECMA 1999: Standard ECMA-262 ECMAScript Language Specification. http://www.ecma-international.org/publications/standards/Ecma-262.htm.
Eskenazi, M./Levow, G./Meng, H./Parent, G./Suendermann, D. 2013: Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment. Wiley, Hoboken, USA.
Evanini, K./Suendermann, D./Pieraccini, R. 2007: Call Classification for Automated Troubleshooting on Large Corpora. In Proc. of the ASRU, Kyoto, Japan.
Ferrucci, D./Brown, E./Chu-Carroll, J./Fan, J./Gondek, D./Kalyanpur, A./Lally, A./Murdock, W./Nyberg, E./Prager, J./Schlaefer, N./Welty, C. 2010: Building Watson: An Overview of the DeepQA Project. AI Magazine, 31(3).
Fielding, R./Kaiser, G. 1997: The Apache HTTP Server Project. Internet Computing, 1(4).
Fryer, L./Carpenter, R. 2006: Emerging Technologies–Bots as Language Learning Tools. Language Learning & Technology, 10(3).
Gibbon, D./Moore, R./Winski, R. 1997: Handbook of Standards and Resources for Spoken Language Systems. Mouton de Gruyter, New York, USA.
Glass, J./Hazen, T./Hetherington, I. 1999: Real-Time Telephone-Based Speech Recognition in the Jupiter Domain. In Proc. of the ICASSP, Phoenix, USA.
Hakkani-Tür, D./Tur, G./Heck, L. 2012: Research Challenges and Opportunities in Mobile Applications. Signal Processing Magazine, 28(4).
Hemphill, C./Godfrey, J./Doddington, G. 1990: The ATIS Spoken Language Systems Pilot Corpus. In Proc. of the Workshop on Speech and Natural Language, Hidden Valley, USA.
Herzfeld, N. 2002: In Our Image: Artificial Intelligence and the Human Spirit. Fortress Press, Minneapolis, USA.
Herzog, O./Siekmann, J./Rollinger, C. 1991: Text Understanding in LILOG: Integrating Computational Linguistics and Artificial Intelligence–Final Report on the LILOGProject. Springer, New York, USA.
Hillebrand, F. 2002: GSM and UMTS: The Creation of Global Mobile Communications. Wiley, New York, USA.
Hinton, G./Deng, L./Yu, D./Dahl, G./Mohamed, A./Jaitly, N./Senior, A./Vanhoucke, V./Nguyen, P./Sainath, T./Kingsbury, B. 2012: Deep Neural Networks for Acoustic Modeling in Speech Recognition. Signal Processing Magazine, 29(6).
Holovaty, A./Kaplan-Moss, J. 2009: The Definitive Guide to Django: Web Development Done Right. Apress, New York, USA.
Hunt, A. 2000: JSpeech Grammar Format. W3C Note. http://www.w3.org/TR/2000/NOTE-jsgf-20000605.
Hunt, A./McGlashan, S. 2004: Speech Recognition Grammar Specification Version 1.0.
W3C Recommendation. http://www.w3.org/TR/2004/REC-speech-grammar-2004 0316.
Jelinek, F. 1997: Statistical Methods for Speech Recognition. MIT Press, Cambridge, USA.
Johnston, A. 2004: SIP: Understanding the Session Initiation Protocol. Artech House, Norwood, USA.
Keeling, K./McGoldrick, P./Beatty, S. 2007: Virtual Onscreen Assistants: A Viable Strategy to Support Online Customer Relationship Building? Advances in Consumer Research, 34.
King, S./Karaiskos, V. 2010: The Blizzard Challenge 2010. In Blizzard Challenge Workshop, Kansai Science City, Japan.
Kumar, A./Tewari, A./Horrigan, S./Kam, M./Metze, F./Canny, J. 2011: Rethinking Speech Recognition on Mobile Devices. In Proc. of the IUI, Palo Alto, USA.
Lamere, P./Kwok, P./Gouvea, E./Raj, B., Singh/R., Walker, W./Warmuth, M./Wolf, P. 2003: The CMU SPHINX-4 Speech Recognition System. In Proc. of the ICASSP’03, Hong Kong, China.
Larson, J. 2000: Introduction and Overview of W3C Speech Interface Framework. W3C Working Draft. http://www.w3.org/TR/voice-intro.
Lea, W. 1980: Trends in Speech Recognition. Prentice Hall, Englewood Cliffs, USA.
Liu, Z./Bacchiani, M. 2011: TechWare: Mobile Media Search Resources. Signal Processing Magazine, 28(4).
Maybury, M. 2004: New Directions in Question Answering. AAAI Press, Menlo Park, USA.
Moreno, A./Lindberg, B./Draxler, C./Richard, G./Choukri, K./Euler, S./Allen, J. 2000: SPEECHDAT-CAR. A Large Speech Database for Automotive Environments. In Proc. of the LREC, Athens, Greece.
Neustein, A. 2010: Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics. Springer, New York, USA.
Oshry, M./Auburn, R./Baggia, P./Bodell, M./Burke, D./Burnett, D./Candell, E./Carter, J./McGlashan, S./Lee, A./Porter, B./Rehor, K. 2004: VoiceXML 2.1. W3C Recommendation. http://www.w3.org/ TR/2007/REC-voicexml21–20070619.
Pallett, D. 2003: A Look at NIST’s Benchmark ASR Tests: Past, Present, and Future. In Proc. of the ASRU, Virgin Islands, USA.
Pieraccini, R. 2012: The Voice in the Machine: Building Computers that Understand Speech. MIT Press, Cambridge, USA.
Price, P. 1990: Evaluation of Spoken Language Systems: The ATIS Domain. In Proc. of the Workshop on Speech and Natural Language, Hidden Valley, USA.
Prylipko, D./Schnelle-Walka, D./Lord, S./Wendemuth, A. 2011: Zanzibar OpenIVR: An Open-Source Framework for Development of Spoken Dialog Systems. In Proc. of the TSD, Pilsen, Czech Republic.
Rabiner, L. 1989: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. of the IEEE, 77(2).
Radomski, S./Schnelle-Walka, D. 2012: VoiceXML for Pervasive Environments. International Journal of Mobile Human Computer Interaction, 4(2).
Russell, S./Norvig, P. 2003: Artificial Intelligence–A Modern Approach. Prentice Hall, Upper Saddle River, USA.
Schlangen, D./Skantze, G. 2009: A General, Abstract Model of Incremental Dialogue Processing. In Proc. of the EACL, Athens, Greece.
Schnelle-Walka, D./Radomski, S./Mühlhäuser, M. 2013: JVoiceXML as a Modality Component in the W3C Multimodal Architecture. Journal on Multimodal User Interfaces.
Seneff, S./Hurley, E./Lau, R./Pao, C./Schmid, P./Zue, V. 1998: Galaxy-II: A Reference Architecture for Conversational System Development. In Proc. of the ICSLP, Sydney, Australia.
Simon, H. 1965: The Shape of Automation for Men and Management. Harper & Row, New York, USA.
Suendermann, D. 2011: Advances in Commercial Deployment of Spoken Dialog Systems. Springer, New York, USA.
Suendermann, D./Hunter, P./Pieraccini, R. 2008: Call Classification with Hundreds of Classes and Hundred Thousands of Training Utterances ... and No Target Domain Data. In Proc. of the PIT, Kloster Irsee, Germany.
Suendermann, D./Liscombe, J./Dayanidhi, K./Pieraccini, R. 2009: A Handsome Set of Metrics to Measure Utterance Classification Performance in Spoken Dialog Systems. In Proc. of the SIGdial, London, UK.
Suendermann, D./Liscombe, J./Pieraccini, R. 2010a: Contender. In Proc. of the SLT, Berkeley, USA.
Suendermann, D./Liscombe, J./Pieraccini, R. 2010b: How to Drink from a Fire Hose: One Person Can Annoscribe 693 Thousand Utterances in One Month. In Proc. of the SIGdial, Tokyo, Japan.
Suendermann, D./Liscombe, J./Pieraccini, R. 2010c: Minimally Invasive Surgery for Spoken Dialog Systems. In Proc. of the Interspeech, Makuhari, Japan.
Suendermann, D./Liscombe, J./Pieraccini, R./Evanini, K. 2010d: ‘How am I doing?’ A new framework to effectively measure the performance of automated customer care contact centers. In Neustein, A. (ed.): Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics. Springer, New York, USA.
Suendermann, D./Pieraccini, R. 2011: SLU in commercial and research spoken dialogue systems. In Tur, G./de Mori, R. (eds): Spoken Language Understanding. Wiley, New York, USA.
Suendermann, D./Pieraccini, R. 2013: Crowdsourcing for industrial spoken dialog systems. In Eskenazi, M./Levow, G./Meng, H./Parent, G./Suendermann, D. (eds): Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment. Wiley, Hoboken, USA.
Suendermann, D./Ney, H. 2003: synther – a New M-Gram POS Tagger. In Proc. of the NLPKE, Beijing, China.
Suendermann, D./Strecha, G./Bonafonte, A./Höge, H./Ney, H. 2005: Evaluation of VTLN-Based Voice Conversion for Embedded Speech Synthesis. In Proc. of the Interspeech, Lisbon, Portugal.
Tichelen, L./Burke, D. 2007: Semantic Interpretation for Speech Recognition (SISR) Version 1.0. W3C Recommendation. http://www.w3.org/TR/semantic-interpretation.
Tur, G./de Mori, R. 2011: Spoken Language Under- standing: Systems for Extracting Semantic Information from Speech. Wiley, Hoboken, USA.
Turing, A. 1950: Computing Machinery and Intelligence. Mind, 59.
Valin, J. 2006: Speex: A Free Codec for Free Speech. In Proc. of the Australian National Linux Conference, Dunedin, New Zealand.
van Meggelen, J./Smith, J./Madsen, L. 2009: Asterisk: The Future of Telephony. O’Reilly, Sebastopol, USA.
Wahlster, W. 2000: Verbmobil: Foundations of Speech-to-Speech Translation. Springer, New York, USA.
Walker, M./Aberdeen, J./Sanders, G. 2003: 2001 Commu- nicator Evaluation. Linguistic Data Consortium, Philadelphia, USA.
Walker, M./Rambow, O. 2002: Spoken Language Generation. Computer Speech and Language, 16(3).
Walker, W./Lamere, P./Kwok, P. 2002: FreeTTS: A Performance Case Study. Technical report, Sun Microsystems, Santa Clara, USA.
Wang, A. 2006:The Shazam Music Recognition Service. Communications of the ACM, 49(8).
Weizenbaum, J. 1966: ELIZA–A Computer Program for the Study of Natural Language Communication between Man and Machine. Communications of the ACM, 9(1).
Wiedenroth, H./Wollschläger, H. 2007: Karl Mays Werke: Historisch-Kritische Ausgabe. Karl-May-Verlag, Bamberg and Radebeul, Germany.
Williams, J./Witt-Ehsani, S./Liska, A./Suendermann, D. 2011: Speech Recognition in a Multi-Modal Health Care Application: Two Sides of the Coin. In Proc. of the AVIxD/IxDA Workshop, New York, USA.
Winarsky, N./Mark, B./Kressel, H. 2012: The Development of Siri and the SRI Venture Creation Process. Technical report, SRI International, Menlo Park, USA.
Zechner, K./Higgins, D./Xi, X. 2007: Speechrater: A Construct-Driven Approach to Scoring Spontaneous Non-Native Speech. In Proc. of the SLaTE, Farmington, USA.
Zyda, M./Thukral, D./Ferrans, J./Engelsma, J./Hans, M. 2008: Enabling a Voice Modality in Mobile Games through VoiceXML. In Proc. of the ACM SIGGRAPH symposium on Video games, Los Angeles, USA. Toad for Cloud Databases 2012. Online abrufbar unter: http://toadforcloud.com
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Fachmedien Wiesbaden
About this chapter
Cite this chapter
Suendermann-Oeft, D. (2014). Modern Conversational Agents. In: Jähnert, J., Förster, C. (eds) Technologien für digitale Innovationen. Springer VS, Wiesbaden. https://doi.org/10.1007/978-3-658-04745-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-658-04745-0_4
Published:
Publisher Name: Springer VS, Wiesbaden
Print ISBN: 978-3-658-04744-3
Online ISBN: 978-3-658-04745-0
eBook Packages: Humanities, Social Science (German Language)