Abstract
Generating audio from visual scene is an extremely challenging yet useful task as it finds application in remote surveillance, comprehending speech for hearing impaired people, or in silent speech interface (SSI). Due to the recent advancements of deep neural network techniques, there have been considerable research effort toward speech reconstruction from silent videos or visual speech. In this survey paper, we review several recent papers in this area and make a comparative study in terms of their architectural models and accuracy achieved.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amazon web services for deep learning applications. https://aws.amazon.com/deep-learning/, accessed: 2020-08-01
Nvidia GPU cloud computing. https://www.nvidia.com/en-in/data-center/gpu-cloud-computing/. Accessed: 1 Aug 2020
H. Akbari, H. Arora, L. Cao, N. Mesgarani, Lip2audspec: speech reconstruction from silent lip movements video, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 2516–2520
I. Anina, Z. Zhou, G. Zhao, M. Pietikäinen, Ouluvs2: a multi-view audiovisual database for non-rigid mouth motion analysis, in 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1 (2015), pp. 1–5
S. Arya, N. Pratap, K. Bhatia, Future of face recognition: a review. Procedia Computer Sci. 58, 578–585 (2015) [(Second International Symposium on Computer Vision and the Internet (VisionNet’15)]
J.G. Beerends, J.A. Stemerdink, A perceptual speech-quality measure based on a psychoacoustic sound representation. J. Audio Eng. Soc. 115–123 (1994)
L. Chen, Z. Li, R.K. Maddox, Z. Duan, C. Xu, Lip movements generation at a glance, in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 520–535
J.S. Chung, A. Zisserman, A.: Lip reading in the wild, in Asian Conference on Computer Vision (2016), pp. 87–103
M. Cooke, J. Barker, S. Cunningham, X. Shao, The grid audio-visual speech corpus (2006). https://doi.org/10.5281/zenodo.3625687
T.L. Cornu, B. Milner, Generating intelligible audio speech from visual speech. IEEE/ACM Trans. Audio Speech Lang. Process. 1751–1761 (2017)
A. Davis, M. Rubinstein, M., Wadhwa, N., G.J. Mysore, F. Durand, W.T. Freeman, A review on automatic facial expression recognition systems assisted by multimodal sensor data. ACM Trans. Graph. 33(4) (2014)
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, in CVPR09 (2009)
A. Ephrat, T. Halperin, S. Peleg, Improved speech reconstruction from silent video, in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017), pp. 455–462. https://doi.org/10.1109/ICCVW.2017.61.
A. Ephrat, S. Peleg, Vid2speech: speech reconstruction from silent video, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), pp. 5095–5099
K.K. George, K.C. Santhosh, K.I. Ramachandran, A. Panda, Improving robustness of speaker verification against mimicked speech, in Odyssey (2016)
A. Jamaludin, J.S. Chung, A. Zisserman, You said that? Synthesising talking faces from audio. International Journal of Computer Vision 127(11–12), 1767–1779 (2019)
B. Jolad, R. Khanai, An art of speech recognition: a review, in 2019 2nd International Conference on Signal Processing and Communication (ICSPC) (2019), pp. 31–35
Y. Kumar, R. Jain, M. Salik, R. Shah, R. Zimmermann, Y. Yin, Mylipper: a personalized system for speech reconstruction using multi-view visual feeds, in 2018 IEEE International Symposium on Multimedia (ISM) (2018), pp. 159–166
Y. Kumar, R. Jain, K.M. Salik, R.R. Shah, Y. Yin, R. Zimmermann, Lipper: synthesizing thy speech using multi-view lipreading, in AAAI (2019)
I. Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)
D. Michelsanti, O. Slizovskaia, G. Haro, E. Gómez, Z.H. Tan, J. Jensen, Vocoder-based speech synthesis from silent videos. ArXiv abs/2004.02541 (2020)
S. Nirmal, V. Sowmya, K.P. Soman, Open set domain adaptation for hyperspectral image classification using generative adversarial network, in Lecture Notes in Networks and Systems (2020), pp. 819–827
S.O. Patil, V.V.S. Variyar, K.P. Soman, Speed bump segmentation an application of conditional generative adversarial network for self-driving vehicles, in Fourth International Conference on Computing Methodologies Communication (ICCMC) (2020)
S. Petridis, Y. Wang, Z. Li, M. Pantic, End-to-end multi-view lipreading, in British Machine Vision Conference (BMVC) (2017)
N. Samadiani, G. Huang, B. Cai, W. Luo, C.H. Chi, Y. Xiang, J. He, A review on automatic facial expression recognition systems assisted by multimodal sensor data. Sensors 19(8), 1863 (2019)
T. Stafylakis, G. Tzimiropoulos, Combining residual networks with lSTMS for lipreading, in International Speech Communication Association (ISCA) (2017)
T. Thiede, W.C. Treurniet, R. Bitto, C. Schmidmer, T. Sporer, J.G. Beerends, C. Colomes, Peaq-the itu standard for objective measurement of perceived audio quality. J. Audio Eng. Soc. 3–29 (2000)
S. Uttam, Y. Kumar, D. Sahrawat, M. Aggarwal, R.R. Shah, D. Mahata, A. Stent, Hush-hush speak: speech reconstruction using silent videos, in Proceedings Interspeech (2019), pp. 136–140. https://doi.org/10.21437/Interspeech.2019-3269
K. Vougioukas, P. Ma, S. Petridis, M. Pantic, Video-driven speech reconstruction using generative adversarial networks, in Proceedings of Interspeech (2019)
X. Wang, Y. Zhao, F. Pourpanah, Recent advances in deep learning. Int. J. Mach. Learn. Cybern. 11, 747–750 (2020)
B. Xu, C. Lu, Y. Guo, J. Wang, Discriminative multi-modality speech recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)
K. Yaman, M. Aggarwa, P. Nawal, S. Satoh, R.R. Shah, R. Zimmermann, Harnessing AI for speech reconstruction using multi-view silent video feed, in ACM Multimedia Conference (MM ’18) (2018)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Suresh, K., Gopakumar, G., Duttagupta, S. (2021). Generating Audio from Lip Movements Visual Input: A Survey. In: Paprzycki, M., Thampi, S.M., Mitra, S., Trajkovic, L., El-Alfy, ES.M. (eds) Intelligent Systems, Technologies and Applications. Advances in Intelligent Systems and Computing, vol 1353. Springer, Singapore. https://doi.org/10.1007/978-981-16-0730-1_21
Download citation
DOI: https://doi.org/10.1007/978-981-16-0730-1_21
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0729-5
Online ISBN: 978-981-16-0730-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)