Audio Driven Artificial Video Face Synthesis Using GAN and Machine Learning Approaches

Kumar Das, Arnab; Naskar, Ruchira

doi:10.1007/978-981-19-3089-8_23

Arnab Kumar Das¹⁴ &
Ruchira Naskar¹⁴

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 480))

Included in the following conference series:

International Conference on Computational Intelligence in Pattern Recognition

642 Accesses
1 Citations

Abstract

Now-a-days a large number of people share their opinion in either audio or video format through internet. Some of them are real videos and some are fake. So, we need to find out the differences between these two with the help of digital forensics. In this paper, the authors will discuss the different types of artificial face synthesis methods and after that authors will analyze the deepfake videos using machine learning methods. In artificial face synthesis, based on an incoming audio stream in any language, a face image or source video of a single person is animated with full lip synchronization and synthesized expression. For full lip synchronization, GAN can also be used to train the generative models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

You Said That?: Synthesising Talking Faces from Audio

Article Open access 13 February 2019

Talking Faces: Audio-to-Video Face Generation

Facial expression GAN for voice-driven face generation

Article 22 February 2021

References

Brand, M.: Voice puppetry. In: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 21–28 (1999)
Google Scholar
Li, Y., Shum, H.-Y.: Learning dynamic audio-visual mapping with input-output hidden Markov models. IEEE Trans. Multimedia 8(3), 542–549 (2006)
Article Google Scholar
Vondrick Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. Advances in neural information processing systems 29 (2016)
Google Scholar
Wang, T.C., et al.: Video-to-video synthesis. İn: Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), p. 11441156 (2018)
Google Scholar
Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S.-E., Sheikh, Y.A.: OpenPose: Realtime multi-person 2D pose estimation using part afnity elds. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172186 (Jan. 2021). https://doi.org/10.1109/TPAMI.2019.2929257
Wang, H., et al. Supervised video-to-video synthesis for single human pose transfer. IEEE Access 9, 17544–17556 (2021)
Google Scholar
Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with gans. Int. J. Comput. Vision 128(5), 1398–1413 (2020)
Article Google Scholar
Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. Proceedings of the AAAI Conference on Artificial Intelligence 33, 9299–9306 (2019)
Article Google Scholar
Song, Y., Zhu, J., Wang, X., Qi, H.: Talking face generation by conditional recurrent adversarial network. CoRR, abs/1804.04786 (2018)
Google Scholar
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7832–7841 (2019)
Google Scholar
Kefalas, T., Vougioukas, K., Panagakis, Y., Petridis, S., Kossaifi, J., Pantic, M.: Speech-driven facial animation using polynomial fusion of features. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2020)
Google Scholar
Sinha, S., Biswas, S., Bhowmick, B., Identity-preserving realistic talking face generation. In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE (2020)
Google Scholar
Wen, X., Wang, M., Richardt, C., Chen, Z.Y., Hu, S.M.: Photorealistic audio-driven video portraits. IEEE Trans. Visuali. Comp. Grap. 26(12), 3457-3466 (2020)
Google Scholar
Chen, L., Li, Z., Maddox, R.K., Duan, Z., Xu, C.: Lip movements generation at a glance. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional lstm. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4884–4888. IEEE (2015)
Google Scholar
Ofli, F., et al.: Audio-driven human body motion analysis and synthesis. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2233–2236 (2008)
Google Scholar
Zhang, S., Yuan, J., Liao, M., Zhang, L.: Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary. arXiv preprint arXiv:2104.14631 (2021)
Kumar, R., Sotelo, J., Kumar, K., de Brébisson, A., Bengio, Y.: ObamaNet: photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442 (2017)
Yu, L., Yu, J., Ling, Q.: Mining audio, text and visual information for talking face generation. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 787–795. IEEE (Nov 2019)
Google Scholar
Li, Y., Chang, M., Lyu, S.: In Ictu oculi: exposing AI created fake videos by detecting eye blinking. İn: Proc. IEEE International Workshop on Information Forensics and Security, pp. 1–7 (2018)
Google Scholar
Mitra, A., et al.: A novel machine learning based method for deepfake video detection in social media. In: 2020 IEEE International Symposium on Smart Electronic Systems (iSES) (Formerly iNiS). IEEE, pp 91-96 (2020)
Google Scholar
Feng, K., Wu, J., Tian, M.: A detect method for deepfake video based on full face recognition. In: 2020 IEEE International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Vol. 1, pp. 1121–1125. IEEE (2020)
Google Scholar
Ivanov, N.S., Arzhskov, A.V., Ivanenko, V.G. Combining deep learning and super-resolution algorithms for deep fake detection. In: 2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus), pp. 326–328. IEEE (2020)
Google Scholar
Nasar, B.F., Sajini, T., Lason, E.R.: Deepfake detection in media files-audios, ımages and videos. In: 2020 IEEE Recent Advances in Intelligent Computational Systems (RAICS), pp. 74–79. IEEE (2020)
Google Scholar
Pan, D., et al.: Deepfake Detection through Deep Learning. In: 2020 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT), pp. 134–143. IEEE (2020)
Google Scholar
Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1961–1970 (2016)
Google Scholar
Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)
Article Google Scholar
Wong, K.-W., Lam, K.-M., Siu, W.-C.: An efficient algorithm for human face detection and facial feature extraction under different conditions. Pattern Recognition 34(10), 1993–2004 (2001)
Google Scholar
Yehia, H.C., Takaaki, K., Eric, V.-B.: Linking facial animation, head motion and speech acoustics. J. Phon. 30(3), 555–568 (2002)
Article Google Scholar
Torricelli, D., Goffredo, M., Conforto, S., Schmid, M.: An adaptive blink detector to initialize and update a view-basedremote eye gaze tracking system in a natural scenario. Pattern Recogn. Lett. 30(12), 1144–1150 (2009)
Article Google Scholar
Divjak, M., Bischof, H.: Eye blink based fatigue detection for prevention of computer vision syndrome. İn: MVA, pp. 350–353 (2009)
Google Scholar
Li, Y., Chang, M.-C., Lyu, S.: In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In: 2018 IEEE ınternational workshop on ınformation forensics and security (WIFS). IEEE, pp. 1–7 (2018)
Google Scholar
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)
Goodfellow, I., et al.: Generative adversarial nets. Advances in neural information processing systems 27 (2014)
Google Scholar
Wan, W., Yang, Y., Lee, H.J.: Generative adversarial learning for detail-preserving face sketch synthesis. Neurocomputing 438, 107–121 (2021)
Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. (2014)
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? arXiv preprint arXiv:1705.02966 (2017)
Prajwal, K.R. et al.: Towards automatic face-to-face translation. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1428–1436 (2019)
Google Scholar
Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks, İn: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. America 120(5), 2421–2424 (2006)
Article Google Scholar
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (July 2017)
Google Scholar
Harte, N., Gillen, E.: TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Trans. Multimedia 17(5), 603–615 (2015)
Article Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian conference on computer vision, pp. 87–103. Springer, Cham (2016)
Google Scholar
Zhang, R., et al.: The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Engineering Science and TechnologyShibpur, Department of Information Technology, Howrah, West Bengal, 711103, India
Arnab Kumar Das & Ruchira Naskar

Authors

Arnab Kumar Das
View author publications
You can also search for this author in PubMed Google Scholar
Ruchira Naskar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arnab Kumar Das .

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Shibpur, West Bengal, India
Asit Kumar Das
P. G. Department of Computer Science, Maharaja Sriram Chandra Bhanja Deo (MSCB) University, Baripada, Mayurbhanj, Odisha, India
Janmenjoy Nayak
Department of Computer Application, Veer Surendra Sai University of Technology, Sambalpur, Odisha, India
Bighnaraj Naik
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology, Rajapalayam, Tamil Nadu, India
S. Vimal
Communication Sciences, University of Teramo, Teramo, Italy
Danilo Pelusi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar Das, A., Naskar, R. (2022). Audio Driven Artificial Video Face Synthesis Using GAN and Machine Learning Approaches. In: Das, A.K., Nayak, J., Naik, B., Vimal, S., Pelusi, D. (eds) Computational Intelligence in Pattern Recognition. CIPR 2022. Lecture Notes in Networks and Systems, vol 480. Springer, Singapore. https://doi.org/10.1007/978-981-19-3089-8_23

Download citation

DOI: https://doi.org/10.1007/978-981-19-3089-8_23
Published: 21 June 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-3088-1
Online ISBN: 978-981-19-3089-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Audio Driven Artificial Video Face Synthesis Using GAN and Machine Learning Approaches

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

You Said That?: Synthesising Talking Faces from Audio

Talking Faces: Audio-to-Video Face Generation

Facial expression GAN for voice-driven face generation

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Audio Driven Artificial Video Face Synthesis Using GAN and Machine Learning Approaches

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

You Said That?: Synthesising Talking Faces from Audio

Talking Faces: Audio-to-Video Face Generation

Facial expression GAN for voice-driven face generation

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation