Abstract
Virtual worlds are developing rapidly over the Internet. They are visited by avatars and staffed with Embodied Conversational Agents (ECAs). An avatar is a representation of a physical person. Each person controls one or several avatars and usually receives feedback from the virtual world on an audio-visual display. Ideally, all senses should be used to feel fully embedded in a virtual world. Sound, vision and sometimes touch are the available modalities. This paper reviews the technological developments which enable audio-visual interactions in virtual and augmented reality worlds. Emphasis is placed on speech and gesture interfaces, including talking face analysis and synthesis.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Abboud, B., Bredin, H., Aversano, G., Chollet, G.: Audio-visual identity verification: an introductory overview. In: Stylianou, Y. (ed.) Progress in Non-Linear Speech Processing, pp. 118–134. Springer, Heidelberg (2007)
Abe, M., Nakamura, S., Shikano, K., Kuwabara, H.: Voice conversion through vector quantization. In: International Conference on Acoustics, Speech, and Signal Processing (1988)
Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44–58 (2006)
Ahlberg, J.: Candide-3, an updated parameterized face. Technical report, Linköping University, Sweden (2001)
Ahlberg, J.: Real-time facial feature tracking using an active model with fast image warping. In: International Workshop on Very Low Bitrates Video (2001)
Albrecht, I., Schroeder, M., Haber, J., Seidel, H.-P.: Mixed feelings – expression of non-basic emotions in a muscle-based talking head. In: Virtual Reality (Special Issue Language, Speech and Gesture for VR) (2005)
Arslan, L.M.: Speaker transformation algorithm using segmental codebooks (STASC). Speech Communication (1999)
Beau, F.: Culture d’Univers - Jeux en réseau, mondes virtuels, le nouvel âge de la société numérique. Limoges (2007)
Benesty, J., Sondhi, M., Huang, Y. (eds.): Springer Handbook of Speech Processing. Springer, Heidelberg (2008)
Bui, T.D.: Creating Emotions And Facial Expressions For Embodied Agents. PhD thesis, University of Twente, Department of Computer Science, Enschede (2004)
Cassell, J., Bickmore, J., Billinghurst, M., Campbell, L., Chang, K., Vilhjálmsson, H., Yan, H.: Embodiment in conversational interfaces: Rea. In: CHI 1999, Pittsburgh, PA, pp. 520–527 (1999)
Cassell, J., Kopp, S., Tepper, P., Kim, F.: Trading Spaces: How Humans and Humanoids use Speech and Gesture to Give Directions. John wiley & sons, New york (2007)
Cassell, J., Vilhjálmsson, H., Bickmore, T.: BEAT: the Behavior Expression Animation Toolkit. In: Computer Graphics Proceedings, Annual Conference Series. ACM SIGGRAPH (2001)
Cheyer, A., Martin, D.: The open agent architecture. Journal of Autonomous Agents and Multi-Agent Systems, 143–148 (March 2001)
Chi, D., Costa, M., Zhao, L., Badler, N.: The emote model for effort and shape. In: International Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, pp. 173–182 (2000)
Chollet, G., Landais, R., Bredin, H., Hueber, T., Mokbel, C., Perrot, P., Zouari, L.: Some experiments in audio-visual speech processing, in non-linear speech processing. In: Chetnaoui, M. (ed.) Progress in Non-Linear Speech Processing. Springer, Heidelberg (2007)
Chollet, G., Petrovska-Delacrétaz, D.: Searching through a speech memory for efficient coding, recognition and synthesis. Franz Steiner Verlag, Stuttgart (2002)
Cootes, T., Edwards, G., Taylor, C.: Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 681–685 (2001)
Dornaika, F., Ahlberg, J.: Fast and reliable active appearance model search for 3D face tracking. IEEE Transactions on Systems, Man, and Cybernetics, 1838–1853 (2004)
Dutoit, T.: Corpus-based speech synthesis. In: Jacob, B., Mohan, S.M., Yiteng (Arden), H. (eds.) Springer Handbook of Speech Processing, pp. 437–453. Springer, Heidelberg (2008)
Ekman, P., Campos, J., Davidson, R.J., De Waals, F.: Emotions inside out, vol. 1000. Annals of the New York Academy of Sciences, New York (2003)
Esposito, A.: Children’s organization of discourse structure through pausing means. In: Faundez-Zanuy, M., Janer, L., Esposito, A., Satue-Villar, A., Roure, J., Espinosa-Duro, V. (eds.) NOLISP 2005. LNCS, vol. 3817, pp. 108–115. Springer, Heidelberg (2006)
Esposito, A.: The amount of information on emotional states conveyed by the verbal and nonverbal channels: Some perceptual data. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds.) COST 277. LNCS, vol. 4391, pp. 249–268. Springer, Heidelberg (2007)
Esposito, A., Bourbakis, N.G.: The role of timing in speech perception and speech production processes and its effects on language impaired individuals. In: 6th International IEEE Symposium on BioInformatics and BioEngineering, pp. 348–356 (2006)
Esposito, A., Marinaro, M.: What pauses can tell us about speech and gesture partnership. In: Esposito, A., Bratanic, M., Keller, E., Marinaro, M. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometrical Issue. NATO Publishing Series, pp. 45–57. IOS press, Amsterdam (2007)
Gauvain, J.L., Lamel, L.: Large - Vocabulary Continuous Speech Recognition: Advances and Applications. Proceedings of the IEEE 88, 1181–1200 (2000)
Genoud, D., Chollet, G.: Voice transformations: Some tools for the imposture of speaker verification systems. In: Braun, A. (ed.) Advances in Phonetics. Franz Steiner Verlag (1999)
Gentes, A.: Second life, une mise en jeu des médias. In: de Cayeux, A., Guibert, C. (eds.) Second life, un monde possible, Les petits matins (2007)
Gutierrez-Osuna, R., Kakumanu, P., Esposito, A., Garcia, O.N., Bojorquez, A., Castello, J., Rudomin, I.: Speech-driven facial animation with realistic dynamics. In: IEEE Transactions on Multimedia, pp. 33–42 (2005)
El Hannani, A., Petrovska-Delacrétaz, D., Fauve, B., Mayoue, A., Mason, J., Bonastre, J.-F., Chollet, G.: Text-independent speaker verification. In: Petrovska-Delacrétaz, D., Chollet, G., Dorizzi, B. (eds.) Guide to Biometric Reference Systems and Performance Evaluation. Springer, London (2008)
Hartmann, B., Mancini, M., Pelachaud, C.: Implementing expressive gesture synthesis for embodied conversational agents. In: Gibet, S., Courty, N., Kamp, J.-F. (eds.) GW 2005. LNCS (LNAI), vol. 3881, pp. 188–199. Springer, Heidelberg (2006)
Heylen, D., Ghijsen, M., Nijholt, A., op den Akker, R.: Facial signs of affect during tutoring sessions. In: Tao, J., Tan, T., Picard, R.W. (eds.) ACII 2005. LNCS, vol. 3784, pp. 24–31. Springer, Heidelberg (2005)
Horain, P., Bomb, M.: 3D model based gesture acquisition using a single camera. In: IEEE Workshop on Applications of Computer Vision, pp. 158–162 (2002)
Horain, P., Marques-Soares, J., Rai, P.K., Bideau, A.: Virtually enhancing the perception of user actions. In: 15th International Conference on Artificial Reality and Telexistence ICAT 2005, pp. 245–246 (2005)
IV2: Identification par l’Iris et le Visage via la Vidéo, http://iv2.ibisc.fr/pageweb-iv2.html
Jelinek, F.: Continuous Speech Recognition by Statistical Methods. Proceedings of the IEEE 64, 532–556 (1976)
Kain, A.: High Resolution Voice Transformation. PhD thesis, Oregon Health and Science University, Portland, USA, october (2001)
Kain, A., Macon, M.: Spectral voice conversion for text to speech synthesis. In: International Conference on Acoustics, Speech, and Signal Processing, New York (1998)
Kain, A., Macon, M.W.: Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction. In: International Conference on Acoustics, Speech, and Signal Processing (2001)
Kakumanu, P., Esposito, A., Gutierrez-Osuna, R., Garcia, O.N.: Comparing different acoustic data-encoding for speech driven facial animation. Speech Communication, 598–615 (2006)
Karungaru, S., Fukumi, M., Akamatsu, N.: Automatic human faces morphing using genetic algorithms based control points selection. International Journal of Innovative Computing, Information and Control 3(2), 1–6 (2007)
Kendon, A.: Gesture: Visible action as utterance. Cambridge Press, Cambridge (2004)
Kipp, M., Neff, M., Kipp, K.H., Albrecht, I.: Toward natural gesture synthesis: Evaluating gesture units in a data-driven approach. In: Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D. (eds.) IVA 2007. LNCS (LNAI), vol. 4722, pp. 15–28. Springer, Heidelberg (2007)
Kopp, S., Jung, B., Lessmann, N., Wachsmuth, I.: Max - a multimodal assistant in virtual reality construction. KI Kunstliche Intelligenz (2003)
Kopp, S., Wachsmuth, I.: Synthesizing multimodal utterances for conversational agents. The Journal Computer Animation and Virtual Worlds 15(1), 39–52 (2004)
Laird, C.: Webster’s New World Dictionary, and Thesaurus. In: Webster dictionary. Macmillan, Basingstoke (1996)
Li, Y., Wen, Y.: A study on face morphing algorithms, http://scien.stanford.edu/class/ee368/projects2000/project17
Lidell, S.: American Sign Language Syntax. Approaches to semiotics. Mouton, The Hague (1980)
Lu, S., Huang, G., Samaras, D., Metaxas, D.: Model-based integration of visual cues for hand tracking. In: IEEE workshop on Motion and Video Computing (2002)
Mancini, M., Pelachaud, C.: Distinctiveness in multimodal behaviors. In: Seventh International Joint Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2008, Estoril Portugal (May 2008)
Matthews, I., Baker, S.: Active appearance models revisited. International Journal of Computer Vision, 135–164 (2004)
McNeill, D.: Gesture and though. University of Chicago Press (2005)
Moeslund, T., Hilton, A., Kruger, V.: A survey of advances in vision-based human motion capture and analysis. Computer vision and image understanding 4, 90–126 (2006)
Moon, K., Pavlovic, V.I.: Impact of dynamics on subspace embedding and tracking of sequences. In: Conference on Computer Vision and Pattern Recognition, New York, pp. 198–205 (2006)
Niewiadomski, R., Pelachaud, C.: Model of facial expressions management for an embodied conversational agent. In: 2nd International Conference on Affective Computing and Intelligent Interaction ACII, Lisbon (September 2007)
Ochs, M., Niewiadomski, R., Pelachaud, C., Sadek, D.: Intelligent expressions of emotions. In: 1st International Conference on Affective Computing and Intelligent Interaction ACII, China (October 2005)
Padmanabhan, M., Picheny, M.: Large Vocabulary Speech Recognition Algorithms. Computer Magazine 35 (2002)
Pandzic, I.S., Forcheimer, R. (eds.): MPEG4 Facial Animation - The standard, implementations and applications. John Wiley & Sons, Chichester (2002)
Park, I.K., Zhang, H., Vezhnevets, V.: Image based 3D face modelling system. EURASIP Journal on Applied Signal Processing, 2072–2090 (January 2005)
Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D.: Intelligent virtual agents. In: 7th International Working Conference, IVA 2007 (2007)
Perrot, P., Aversano, G., Chollet, G.: Voice disguise and automatic detection, review and program. In: Stylianou, Y. (ed.) Progress in Non-Linear Speech Processing. Springer, Heidelberg (2007)
Perrot, P., Aversano, G., Blouet, G.R., Charbit, M., Chollet, G.: Voice forgery using alisp: Indexation in a client memory. In: International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, pp. 17–20 (2005)
Petrovska-Delacrétaz, D., El Hannani, A., Chollet, G.: Automatic speaker verification, state of the art and current issues. In: Stylianou, Y. (ed.) Progress in Non-Linear Speech Processing. Springer, Heidelberg (2007)
Petrovska-Delacrétaz, D., Lelandais, S., Colineau, J., Chen, L., Dorizzi, B., Krichen, E., Mellakh, M.A., Chaari, A., Guerfi, S., D’Hose, J., Ardabilian, M., Ben Amor, B.: The iv2 multimodal (2D, 3D, stereoscopic face, talking face and iris) biometric database, and the iv2 2007 evaluation campaign. In: The proceedings of the IEEE Second International Conference on Biometrics: Theory, Applications (BTAS), Washington DC, USA (September 2008)
Poppe, R.: Vision-based human motion analysis: an overview. Computer vision and image understanding 108, 4–18 (2007)
Romdhani, S., Blanz, V., Basso, C., Vetter, T.: Morphable models of faces. In: Li, S., Jain, A. (eds.) Handbook of Face Recognition, pp. 217–245. Springer, Heidelberg (2005)
Ruttkay, Z., Noot, H., ten Hagen, P.: Emotion disc and emotion squares: tools to explore the facial expression space. Computer Graphics Forum, 49–53 (2003)
Sminchisescu, C.: 3D Human Motion Analysis in Monocular Video, Techniques and Challenges. In: AVSS 2006: Proceedings of the IEEE International Conference on Video and Signal Based Surveillance, p. 76 (2006)
Sminchisescu, C., Triggs, B.: Estimating articulated human motion with covariance scaled sampling. International Journal of Robotic Research, 371–392 (2003)
Marques Soares, J., Horain, P., Bideau, A., Nguyen, M.H.: Acquisition 3D du geste par vision monoscopique en temps réel et téléprésence. In: Acquisition du geste humain par vision artificielle et applications, pp. 23–27 (2004)
Sündermann, D., Ney, H.: VTLN-Based Cross-Language Voice Conversion. In: IEEE workshop on Automatic Speech Recognition and Understanding, Virgin Islands, pp. 676–681 (2003)
Terzopoulos, D., Waters, K.: Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 569–579 (1993)
Thiebaux, M., Marshall, A., Marsella, S., Kallmann, M.: SmartBody: Behavior realization for embodied conversational agents. In: Seventh International Joint Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2008, Portugal (May 2008)
Thorisson, K.R., List, T., Pennock, C., DiPirro, J.: Whiteboards: Scheduling blackboards for semantic routing of messages and streams. In: AAAI 2005 Workshop on Modular Construction of Human-Like Intelligence, July 10 (2005)
Traum, D.: Talking to virtual humans: Dialogue models and methodologies for embodied conversational agents. In: Wachsmuth, I., Knoblich, G. (eds.) Modeling Communication with Robots and Virtual Humans, pp. 296–309. John Wiley & Sons, Chichester (2008)
Tsapatsoulis, N., Raouzaiou, A., Kollias, S., Cowie, R., Douglas-Cowie, E.: Emotion recognition and synthesis based on MPEG-4 FAPs in MPEG-4 facial animation. In: Pandzic, I.S., Forcheimer, R. (eds.) MPEG4 Facial Animation - The standard, implementations and applications. John Wiley & Sons, Chichester (2002)
Turajlic, E., Rentzos, D., Vaseghi, S., Ho, C.-H.: Evaluation of methods for parametric formant transformation in voice conversion. In: International Conference on Acoustics, Speech, and Signal Processing (2003)
Turkle, S.: Life on the screen, Identity in the age of the internet. Simon and Schuster, New York (1997)
Urtasun, R., Fleet, D.J., Fua, P.: 3D people tracking with gaussian process dynamical models. In: Conference on Computer Vision and Pattern Recognition, New York, pp. 238–245 (2006)
Vilhjalmsson, H., Cantelmo, N., Cassell, J., Chafai, N.E., Kipp, M., Kopp, S., Mancini, M., Marsella, S., Marshall, A.N., Pelachaud, C., Ruttkay, Z., Thórisson, K.R., van Welbergen, H., van der Werf, R.: The behavior markup language: Recent developments and challenges. In: Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D. (eds.) IVA 2007. LNCS, vol. 4722, pp. 99–111. Springer, Heidelberg (2007)
Wiskott, L., Fellous, J.M., Krüger, N., von der Malsburg, C.: Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 775–779 (1997)
Wolberg, G.: Recent advances in image morphing. Computer Graphics Internat, 64–71 (1996)
Xiao, J., Baker, S., Matthews, I., Kanade, T.: Real-time combined 2D+3D active appearance models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 25–35 (2004)
Ye, H., Young, S.: Perceptually weighted linear transformation for voice conversion. In: Eurospeech (2003)
Yegnanarayana, B., Sharat Reddy, K., Kishore, S.P.: Source and system features for speaker recognition using AANN models. In: International Conference on Acoustics, Speech, and Signal Processing (2001)
Young, S.: Statistical Modelling in Continuous Speech Recognition. In: Proceedings of the 17th International Conference on Uncertainty in Artificial Intelligence, Seattle, WA (August 2001)
Zanella, V., Fuentes, O.: An Approach to Automatic Morphing of Face Images in Frontal View. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS, vol. 2972, pp. 679–687. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chollet, G. et al. (2009). Multimodal Human Machine Interactions in Virtual and Augmented Reality. In: Esposito, A., Hussain, A., Marinaro, M., Martone, R. (eds) Multimodal Signals: Cognitive and Algorithmic Issues. Lecture Notes in Computer Science(), vol 5398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00525-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-00525-1_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00524-4
Online ISBN: 978-3-642-00525-1
eBook Packages: Computer ScienceComputer Science (R0)