Generating Audio from Lip Movements Visual Input: A Survey

Suresh, Krishna; Gopakumar, G.; Duttagupta, Subhasri

doi:10.1007/978-981-16-0730-1_21

Krishna Suresh¹⁹,
G. Gopakumar¹⁹ &
Subhasri Duttagupta¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1353))

429 Accesses

Abstract

Generating audio from visual scene is an extremely challenging yet useful task as it finds application in remote surveillance, comprehending speech for hearing impaired people, or in silent speech interface (SSI). Due to the recent advancements of deep neural network techniques, there have been considerable research effort toward speech reconstruction from silent videos or visual speech. In this survey paper, we review several recent papers in this area and make a comparative study in terms of their architectural models and accuracy achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Combining audio and visual speech recognition using LSTM and deep convolutional neural network

Article 24 February 2022

Continuous lipreading based on acoustic temporal alignments

Article Open access 06 May 2024

A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

Article 18 May 2024

References

Amazon web services for deep learning applications. https://aws.amazon.com/deep-learning/, accessed: 2020-08-01
Nvidia GPU cloud computing. https://www.nvidia.com/en-in/data-center/gpu-cloud-computing/. Accessed: 1 Aug 2020
H. Akbari, H. Arora, L. Cao, N. Mesgarani, Lip2audspec: speech reconstruction from silent lip movements video, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 2516–2520
Google Scholar
I. Anina, Z. Zhou, G. Zhao, M. Pietikäinen, Ouluvs2: a multi-view audiovisual database for non-rigid mouth motion analysis, in 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1 (2015), pp. 1–5
Google Scholar
S. Arya, N. Pratap, K. Bhatia, Future of face recognition: a review. Procedia Computer Sci. 58, 578–585 (2015) [(Second International Symposium on Computer Vision and the Internet (VisionNet’15)]
Google Scholar
J.G. Beerends, J.A. Stemerdink, A perceptual speech-quality measure based on a psychoacoustic sound representation. J. Audio Eng. Soc. 115–123 (1994)
Google Scholar
L. Chen, Z. Li, R.K. Maddox, Z. Duan, C. Xu, Lip movements generation at a glance, in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 520–535
Google Scholar
J.S. Chung, A. Zisserman, A.: Lip reading in the wild, in Asian Conference on Computer Vision (2016), pp. 87–103
Google Scholar
M. Cooke, J. Barker, S. Cunningham, X. Shao, The grid audio-visual speech corpus (2006). https://doi.org/10.5281/zenodo.3625687
Article Google Scholar
T.L. Cornu, B. Milner, Generating intelligible audio speech from visual speech. IEEE/ACM Trans. Audio Speech Lang. Process. 1751–1761 (2017)
Google Scholar
A. Davis, M. Rubinstein, M., Wadhwa, N., G.J. Mysore, F. Durand, W.T. Freeman, A review on automatic facial expression recognition systems assisted by multimodal sensor data. ACM Trans. Graph. 33(4) (2014)
Google Scholar
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, in CVPR09 (2009)
Google Scholar
A. Ephrat, T. Halperin, S. Peleg, Improved speech reconstruction from silent video, in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017), pp. 455–462. https://doi.org/10.1109/ICCVW.2017.61.
A. Ephrat, S. Peleg, Vid2speech: speech reconstruction from silent video, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), pp. 5095–5099
Google Scholar
K.K. George, K.C. Santhosh, K.I. Ramachandran, A. Panda, Improving robustness of speaker verification against mimicked speech, in Odyssey (2016)
Google Scholar
A. Jamaludin, J.S. Chung, A. Zisserman, You said that? Synthesising talking faces from audio. International Journal of Computer Vision 127(11–12), 1767–1779 (2019)
Article Google Scholar
B. Jolad, R. Khanai, An art of speech recognition: a review, in 2019 2nd International Conference on Signal Processing and Communication (ICSPC) (2019), pp. 31–35
Google Scholar
Y. Kumar, R. Jain, M. Salik, R. Shah, R. Zimmermann, Y. Yin, Mylipper: a personalized system for speech reconstruction using multi-view visual feeds, in 2018 IEEE International Symposium on Multimedia (ISM) (2018), pp. 159–166
Google Scholar
Y. Kumar, R. Jain, K.M. Salik, R.R. Shah, Y. Yin, R. Zimmermann, Lipper: synthesizing thy speech using multi-view lipreading, in AAAI (2019)
Google Scholar
I. Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)
Article Google Scholar
D. Michelsanti, O. Slizovskaia, G. Haro, E. Gómez, Z.H. Tan, J. Jensen, Vocoder-based speech synthesis from silent videos. ArXiv abs/2004.02541 (2020)
Google Scholar
S. Nirmal, V. Sowmya, K.P. Soman, Open set domain adaptation for hyperspectral image classification using generative adversarial network, in Lecture Notes in Networks and Systems (2020), pp. 819–827
Google Scholar
S.O. Patil, V.V.S. Variyar, K.P. Soman, Speed bump segmentation an application of conditional generative adversarial network for self-driving vehicles, in Fourth International Conference on Computing Methodologies Communication (ICCMC) (2020)
Google Scholar
S. Petridis, Y. Wang, Z. Li, M. Pantic, End-to-end multi-view lipreading, in British Machine Vision Conference (BMVC) (2017)
Google Scholar
N. Samadiani, G. Huang, B. Cai, W. Luo, C.H. Chi, Y. Xiang, J. He, A review on automatic facial expression recognition systems assisted by multimodal sensor data. Sensors 19(8), 1863 (2019)
Article Google Scholar
T. Stafylakis, G. Tzimiropoulos, Combining residual networks with lSTMS for lipreading, in International Speech Communication Association (ISCA) (2017)
Google Scholar
T. Thiede, W.C. Treurniet, R. Bitto, C. Schmidmer, T. Sporer, J.G. Beerends, C. Colomes, Peaq-the itu standard for objective measurement of perceived audio quality. J. Audio Eng. Soc. 3–29 (2000)
Google Scholar
S. Uttam, Y. Kumar, D. Sahrawat, M. Aggarwal, R.R. Shah, D. Mahata, A. Stent, Hush-hush speak: speech reconstruction using silent videos, in Proceedings Interspeech (2019), pp. 136–140. https://doi.org/10.21437/Interspeech.2019-3269
K. Vougioukas, P. Ma, S. Petridis, M. Pantic, Video-driven speech reconstruction using generative adversarial networks, in Proceedings of Interspeech (2019)
Google Scholar
X. Wang, Y. Zhao, F. Pourpanah, Recent advances in deep learning. Int. J. Mach. Learn. Cybern. 11, 747–750 (2020)
Article Google Scholar
B. Xu, C. Lu, Y. Guo, J. Wang, Discriminative multi-modality speech recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)
Google Scholar
K. Yaman, M. Aggarwa, P. Nawal, S. Satoh, R.R. Shah, R. Zimmermann, Harnessing AI for speech reconstruction using multi-view silent video feed, in ACM Multimedia Conference (MM ’18) (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Amrita Viswa Vidyapeetham, Amritapuri, India
Krishna Suresh, G. Gopakumar & Subhasri Duttagupta

Authors

Krishna Suresh
View author publications
You can also search for this author in PubMed Google Scholar
G. Gopakumar
View author publications
You can also search for this author in PubMed Google Scholar
Subhasri Duttagupta
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Marcin Paprzycki
Indian Institute of Information Technology and Management Kerala, Trivandrum, Kerala, India
Sabu M. Thampi
Machine Intelligence Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Sushmita Mitra
Simon Fraser University, Burnaby, BC, Canada
Ljiljana Trajkovic
College of Computer Sciences and Engineering, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
El-Sayed M. El-Alfy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Suresh, K., Gopakumar, G., Duttagupta, S. (2021). Generating Audio from Lip Movements Visual Input: A Survey. In: Paprzycki, M., Thampi, S.M., Mitra, S., Trajkovic, L., El-Alfy, ES.M. (eds) Intelligent Systems, Technologies and Applications. Advances in Intelligent Systems and Computing, vol 1353. Springer, Singapore. https://doi.org/10.1007/978-981-16-0730-1_21

Download citation

DOI: https://doi.org/10.1007/978-981-16-0730-1_21
Published: 01 June 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0729-5
Online ISBN: 978-981-16-0730-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Generating Audio from Lip Movements Visual Input: A Survey

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Combining audio and visual speech recognition using LSTM and deep convolutional neural network

Continuous lipreading based on acoustic temporal alignments

A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Generating Audio from Lip Movements Visual Input: A Survey

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Combining audio and visual speech recognition using LSTM and deep convolutional neural network

Continuous lipreading based on acoustic temporal alignments

A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation