Extraction of Features for Lip-reading Using Autoencoders

Paleček, Karel

doi:10.1007/978-3-319-11581-8_26

Karel Paleček²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8773))

Included in the following conference series:

International Conference on Speech and Computer

1400 Accesses
3 Citations

Abstract

We study the incorporation of facial depth data in the task of isolated word visual speech recognition. We propose novel features based on unsupervised training of a single layer autoencoder. The features are extracted from both video and depth channels obtained by Microsoft Kinect device. We perform all experiments on our database of 54 speakers, each uttering 50 words. We compare our autoencoder features to traditional methods such as DCT or PCA. The features are further processed by simplified variant of hierarchical linear discriminant analysis in order to capture the speech dynamics. The classification is performed using a multi-stream Hidden Markov Model for various combinations of audio, video, and depth channels. We also evaluate visual features in the join audio-video isolated word recognition in noisy environments. English

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Experimenting with lipreading for large vocabulary continuous speech recognition

Article 16 July 2018

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Audio-visual speech recognition using deep learning

Article Open access 20 December 2014

Keywords

References

Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent Advances in the Automatic Recognition of Audiovisual Speech. Proc. of the IEEE 91(9), 1306–1326 (2003)
Google Scholar
Goecke, R.: Current Trends in Joint Audio-Video Signal Processing: A Review. In: Proc. of the Eighth International Symposium on Signal Processing and Its Applications, pp. 70–73 (2005)
Google Scholar
Liew, A.W.Ch., W.S.: Visual Speech Recognition: Lip Segmentation and Mapping. Information Science Reference – Imprint. IGI Publishing, New York (2009)
Google Scholar
Lan, Y., Theobald, B.J., Harvey, R., Bowden, R.: Comparing Visual Features for Lipreading. In: Proc. AVSP, pp. 102–106 (2009)
Google Scholar
Paleček, K., Chaloupka, J.: Audio-visual Speech Recognition in Noisy Audio Environments. In: 36th International Conference on Telecommunications and Signal Processing (TSP), pp. 484–487 (2013)
Google Scholar
Goecke, R., Millar, J.B., Zelinovsky, A., Ribes, R.J.: Stereo Vision Lip-Tracking for Audio-Video Speech Processing. In: Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, Signal Processing (2001)
Google Scholar
Císař, P., Krňoul, Z., Železný, M.: 3D Lip-Tracking for Audio-Visual Speech Recognition in Real Applications. In: Proc. INTERSPEECH (2004)
Google Scholar
Galatas, G., Potamianos, G., Makedon, F.: Audio-visual Speech Recognition Incorporating Facial Depth Information Captured by the Kinect. In: Proc. EUSIPCO, pp. 2714–2717 (2012)
Google Scholar
Pei, Y., Kim, T.-K., Zha, H.: Unsupervised Random Forest Manifold Alignment for Lipreading. In: Proc. ICCV, pp. 129–136 (2013)
Google Scholar
Bengio, Y.: Learning Deep Architectures for AI. Foundations and Trends in Machine Learning 2(1), 1–127 (2009)
Article MathSciNet MATH Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal Deep Learning. In: Proc. ICML, pp. 689–696 (2011)
Google Scholar
Huang, J., Kingsbury, B.: Audio-visual Deep Learning for Noise Robust Speech Recognition. In: Proc. ICASSP, pp. 7596–7599 (2013)
Google Scholar
Viola, P.A., Jones, M.J.: Robust Real-Time Face Detection. International Journal of Computer Vision 57, 137–154 (2004)
Article Google Scholar
Cao, X., Wei, Y., Wen, F., Sun, J.: Face Alignment by Explicit Shape Regression. In: Proc. CVPR, pp. 2887–2894 (2012)
Google Scholar
Steve, Y., Odel, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book, version 2.1. Cambridge University, United Kingdom (1997)
Google Scholar
Varga, A.P., Steeneken, H.J.M., Tomlinson, M., Jones, D.: The NOISEX-92 Study on the Effect of Additive Noise on Automatic Speech Recognition. Technical Report, DRA Speech Research Unit (1992)
Google Scholar
Kamath, S., Loizou, P.: A Multi-band Spectral Subtraction Method for Enhancing Speech Corrupted by Colored Noise. In: Proc. ICASSP, pp. IV-4164 (2002)
Google Scholar
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian Optimization of Machine Learning Algorithms. Advances in Neural Information Processing Systems 25, 2951–2959 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

The Institute of Information Technology and Electronics, Technical University of Liberec, Studentská 2/1402, 46117, Liberec, Czech Republic
Karel Paleček

Authors

Karel Paleček
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Speech and Multimodal Interfaces Laboratory, St. Petersburg Institute of Informatics and Automation of the Russian Academy of Sciences, 39, 14th line, 199178, St. Petersburg, Russia
Andrey Ronzhin
Institute of Applied and Mathematical Linguistics, Moscow State Linguistic University, 38, Ostozhenka, 119034, Moscow, Russia
Rodmonga Potapova
Faculty of Technical Sciences, University of Novi Sad, 6, Trg Dositeja Obradovića, 21000, Novi Sad, Serbia
Vlado Delic

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paleček, K. (2014). Extraction of Features for Lip-reading Using Autoencoders. In: Ronzhin, A., Potapova, R., Delic, V. (eds) Speech and Computer. SPECOM 2014. Lecture Notes in Computer Science(), vol 8773. Springer, Cham. https://doi.org/10.1007/978-3-319-11581-8_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-11581-8_26
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11580-1
Online ISBN: 978-3-319-11581-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Extraction of Features for Lip-reading Using Autoencoders

Abstract

Chapter PDF

Similar content being viewed by others

Experimenting with lipreading for large vocabulary continuous speech recognition

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Audio-visual speech recognition using deep learning

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Extraction of Features for Lip-reading Using Autoencoders

Abstract

Chapter PDF

Similar content being viewed by others

Experimenting with lipreading for large vocabulary continuous speech recognition

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Audio-visual speech recognition using deep learning

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation