Abstract
This paper presents an overview of the main multi-modal speech enhancement methods reported to date. In particular, a new MATLAB based Toolbox developed by Barbosa et al (2007) for processing audio-visual data is reviewed and its performance potential evaluated. It is shown that the tool does not represent a complete and comprehensive speech processing solution, but rather serves as a standardised, yet versatile base to build upon with further research. To demonstrate this versatility, preliminary examples that make use of these computational procedures with an audiovisual corpus are demonstrated. Finally, some future research directions in the area of multi-modal speech processing are outlined, including future research that the authors aim to carry out with the aid of this newly developed audio-visual MATLAB toolbox, including toolbox customisation, and processing noisy speech in real world environments.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
- Discrete Cosine Transform
- Gaussian Mixture Model
- Audio Signal
- Blind Source Separation
- Speech Enhancement
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Haykin, S., Chen, Z.: The Cocktail Party Problem. Neural Computation 17(9), 1875–1902 (2005)
Sumby, W.H., Pollack, I.: Visual Contribution to Speech Intelligibility in Noise. J. Acc. Soc. America 26(2), 212–215 (1954)
Schwartz, J.L., Berthommier, F., Savariaux, C.: Audio-visual scene analysis: evidence for a ”very-early” integration process in audio-visual speech perception. In: ICSLP 2002, pp. 1937–1940 (2002)
Barker, J., Shao, X.: Audio-Visual Speech Fragment Decoding. In: AVSP 2007, paper L5-2 (accepted, 2007)
Almajai, I., Milner, B.: Maximising Audio-Visual Speech Correlation. In: AVSP 2007, paper P16 (accepted, 2007)
Barbosa, A.V., Yehia, H.C., Vatikiotis-Bateson, E.: MATLAB toolbox for audiovisual speech processing. In: AVSP 2007, paper P38 (accepted, 2007)
Rivet, B., Girin, L., Jutten, C.: Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures. IEEE Trans. on Audio, Speech, and Lang. Processing 15(1), 96–108 (2007)
Almajai, I., Milner, B., Darch, J., Vaseghi, S.: Visually-Derived Wiener Filters for Speech Enhancement. In: ICASSP 2007, vol. 4, p. IV-585–IV-588 (2007)
Scanlon, P., Reilly, R.: Feature analysis for automatic speechreading. Mult. Sig. Processing. In: 2001 IEEE Fourth Workshop on, pp. 625–630 (2001)
Hazen, J.T., Saenko, K., La, C.H., Glass, J.R.: A Segment Based Audio-Visual Speech Recognizer: Data Collection, Development, and Initial Experiments. In: ICMI 2004: Proceedings of the 6th international conference on Multimodal interfaces, pp. 235–242 (2004)
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent Advances in the Automatic Recognition of Audiovisual Speech. Proceedings - IEEE, part. 9, 91, 1306–1326 (2003)
Goecke, R.: Current Trends In Joint Audio-Video Signal Processing: A Review. In: Proceedings of the Eighth Int. Symposium on Signal Processing and Its Applications, pp. 70–73 (2005)
Potamianos, G., Neti, C., Deligne, S.: Joint Audio-Visual Speech Processing for Recognition and Enhancement. In: AVSP 2003, pp. 95–104 (2003)
Sanderson, C.: Biometric Person Recognition: Face, Speech and Fusion. VDM-Verlag (2008)
Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., Huang, T.: AVICAR: audio-visual speech corpus in a car environment. In: Interspeech 2004, pp. 2489–2492 (2004)
Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. IEEE Trans. On Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Abel, A., Hussain, A. (2009). Multi-modal Speech Processing Methods: An Overview and Future Research Directions Using a MATLAB Based Audio-Visual Toolbox. In: Esposito, A., Hussain, A., Marinaro, M., Martone, R. (eds) Multimodal Signals: Cognitive and Algorithmic Issues. Lecture Notes in Computer Science(), vol 5398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00525-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-00525-1_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00524-4
Online ISBN: 978-3-642-00525-1
eBook Packages: Computer ScienceComputer Science (R0)