Abstract
In recent years audio-visual speech recognition has emerged as an active field of research thanks to advances in pattern recognition, signal processing and machine vision. Its ultimate goal is to allow human-computer communication using voice, taking into account the visual information contained in the audio-visual speech signal. This document presents a command’s automatic recognition system using audio-visual information. The system is expected to control the laparoscopic robot da Vinci. The audio signal is treated using the Mel Frequency Cepstral Coefficients parametrization method. Besides, features based on the points that define the mouth’s outer contour according to the MPEG-4 standard are used in order to extract the visual speech information.
Chapter PDF
Similar content being viewed by others
References
Sackier, J., Wang, Y.: Robotically assisted laparoscopic surgery from concept to development. Surgical Endoscopy 8(1), 63–66 (1994)
Allen, T.P.K., Goldman, R., Hogle, N.J., Fowler, D.L.: In vivo pan/tilt endoscope with integrated light source, zoom and auto-focusing. Studies in Health Technologies and Informatics, 132–174 (2008)
Allaf, M., Jackman, S., Schulam, P., Cadeddu, J., Lee, B., Moore, R., Kavoussi, L.: Voice vs foot pedal interfaces for control of the AESOP robot. Surgical Endoscopy 12, 1415–1418 (1998)
Murioz, V., Thorbeck, C.V., DeGabriel, J., Lozano, J., Sanchez-Badajoz, E., Garcia-Cerezoand, A., Toscano, R., Jimenez-Garrido, A.: A medical robotic assistant for minimally invasive surgery. In: IEEE Int. Conf. Robotics and Automation, San Francisco, CA, USA, pp. 2901–2906 (2000)
Krupa, A., Gangloff, J., Doignon, C., de Mathelin, M.F., Morel, G., Leroy, J., Soler, L., Marescaux, J.: Autonomous 3-D Positioning of Surgical Instruments in Robotized Laparoscopic Surgery Using Visual Servoing. IEEE transactions on robotics and automation 19(5), 842–853 (2003)
Goecke, R.: Current trends in joint audio-video signal processing: A review. In: Eighth International Symposium on Signal Processing and Its Applications (ISSPA 2005), vol. 1, pp. 70–73 (2005)
Campbell, R.: Audio-visual speech processing, pp. 562–569. Elsevier, Amsterdam (2006)
Campbell, R.: The processing of audio-visual speech: empirical and neural bases. Philosophical Transactions of The Royal Society B 363, 1001–1010 (2008)
Gómez, J.B., Ceballos, A., Prieto, F., Redarce, T.: Mouth Gesture and Voice Command Based Robot Command Interface. In: Proceedings of 2009 IEEE International Conference on Robotics and Automation (ICRA 2009), pp. 333–338 (2009)
Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic bayesian networks for audio-visual speech recognition. EURASIP Journal on Applied Signal Processing, 1–15 (2002)
Aleksic, P.S., Katsaggelos, A.K.: Comparision of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance. In: IEEE International Conference on Image Processing, ICIP 2005, vol. 3, p. III-501-4 (2005)
Kratt, J., Metze, F., Stiefelhagen, R., Waibel, A.: Large vocabulary audio-visual speech recognition using the janus speech recognition toolkit. In: Rasmussen, C.E., Bülthoff, H.H., Schölkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 488–495. Springer, Heidelberg (2004)
Myung, K., Joung, R., Eun, K.: Speech Recognition with Multi-modal Features Based on Neural Networks. In: King, I., Wang, J., Chan, L.-W., Wang, D. (eds.) ICONIP 2006. LNCS, vol. 4233, pp. 489–498. Springer, Heidelberg (2006)
Huang, J., Potamianos, G., Connell, J., Neti, C.: Audio-visual speech recognition using an infrared headset. Speech Communication 44, 83–96 (2004)
Potamianos, G.: Speech recognition, audio-visual, pp. 800–805. Elsevier, Amsterdam (2006)
ISO/IEC: Information technology-generic coding of audio-visual objects, part 2: Visual, ISO/IEC FDIS 14496-2 (final drafts international standard), ISO/IEC JTC1/SC29/WG11 N2502 (1998)
Zhilin, W., Aleksic, P., Katsaggelos, A.: Lip tracking for MPEG-4 facial animation. In: Fourth IEEE International Conference on Multimodal Interfaces Processing, vol. 1, pp. 293–298 (2002)
Elliot, R.J., Aggoun, L., Moore, J.B.: Applications of mathematics. In: Karatzas, I., Yor, M. (eds.) Hidden Markov Models. Estimation and Control. Springer, New York (1995)
Anderson, S., Kewley-Port, D.: Evaluation of speech recognizers for speech training applications. IEEE Transactions on Speech and Audio Processing 3(4), 229–241 (1995)
Pasamontes, J.C.: Estrategias de incorporación de conocimiento sintáctico y semántico en sistemas de comprensión de habla continua en espanol. Estudios de Lingüistica Española (2001)
Aguilar, R.C.: Diseño y manipulación de modelos ocultos de markov, utilizando herramientas HTK. Ingeniare. Revista chilena de ingeniería 15(1), 18–26 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ceballos, A., Gómez, J., Prieto, F., Redarce, T. (2009). Robot Command Interface Using an Audio-Visual Speech Recognition System. In: Bayro-Corrochano, E., Eklundh, JO. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2009. Lecture Notes in Computer Science, vol 5856. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10268-4_102
Download citation
DOI: https://doi.org/10.1007/978-3-642-10268-4_102
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10267-7
Online ISBN: 978-3-642-10268-4
eBook Packages: Computer ScienceComputer Science (R0)