Skip to main content

Audio Driven Artificial Video Face Synthesis Using GAN and Machine Learning Approaches

  • Conference paper
  • First Online:
Computational Intelligence in Pattern Recognition (CIPR 2022)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 480))

Abstract

Now-a-days a large number of people share their opinion in either audio or video format through internet. Some of them are real videos and some are fake. So, we need to find out the differences between these two with the help of digital forensics. In this paper, the authors will discuss the different types of artificial face synthesis methods and after that authors will analyze the deepfake videos using machine learning methods. In artificial face synthesis, based on an incoming audio stream in any language, a face image or source video of a single person is animated with full lip synchronization and synthesized expression. For full lip synchronization, GAN can also be used to train the generative models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Brand, M.: Voice puppetry. In: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 21–28 (1999)

    Google Scholar 

  2. Li, Y., Shum, H.-Y.: Learning dynamic audio-visual mapping with input-output hidden Markov models. IEEE Trans. Multimedia 8(3), 542–549 (2006)

    Article  Google Scholar 

  3. Vondrick Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. Advances in neural information processing systems 29 (2016)

    Google Scholar 

  4. Wang, T.C., et al.: Video-to-video synthesis. İn: Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), p. 11441156 (2018)

    Google Scholar 

  5. Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S.-E., Sheikh, Y.A.: OpenPose: Realtime multi-person 2D pose estimation using part afnity elds. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172186 (Jan. 2021). https://doi.org/10.1109/TPAMI.2019.2929257

  6. Wang, H., et al. Supervised video-to-video synthesis for single human pose transfer. IEEE Access 9, 17544–17556 (2021)

    Google Scholar 

  7. Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with gans. Int. J. Comput. Vision 128(5), 1398–1413 (2020)

    Article  Google Scholar 

  8. Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. Proceedings of the AAAI Conference on Artificial Intelligence 33, 9299–9306 (2019)

    Article  Google Scholar 

  9. Song, Y., Zhu, J., Wang, X., Qi, H.: Talking face generation by conditional recurrent adversarial network. CoRR, abs/1804.04786 (2018)

    Google Scholar 

  10. Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7832–7841 (2019)

    Google Scholar 

  11. Kefalas, T., Vougioukas, K., Panagakis, Y., Petridis, S., Kossaifi, J., Pantic, M.: Speech-driven facial animation using polynomial fusion of features. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2020)

    Google Scholar 

  12. Sinha, S., Biswas, S., Bhowmick, B., Identity-preserving realistic talking face generation. In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE (2020)

    Google Scholar 

  13. Wen, X., Wang, M., Richardt, C., Chen, Z.Y., Hu, S.M.: Photorealistic audio-driven video portraits. IEEE Trans. Visuali. Comp. Grap. 26(12), 3457-3466 (2020)

    Google Scholar 

  14. Chen, L., Li, Z., Maddox, R.K., Duan, Z., Xu, C.: Lip movements generation at a glance. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  15. Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional lstm. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4884–4888. IEEE (2015)

    Google Scholar 

  16. Ofli, F., et al.: Audio-driven human body motion analysis and synthesis. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2233–2236 (2008)

    Google Scholar 

  17. Zhang, S., Yuan, J., Liao, M., Zhang, L.: Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary. arXiv preprint arXiv:2104.14631 (2021)

  18. Kumar, R., Sotelo, J., Kumar, K., de Brébisson, A., Bengio, Y.: ObamaNet: photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442 (2017)

  19. Yu, L., Yu, J., Ling, Q.: Mining audio, text and visual information for talking face generation. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 787–795. IEEE (Nov 2019)

    Google Scholar 

  20. Li, Y., Chang, M., Lyu, S.: In Ictu oculi: exposing AI created fake videos by detecting eye blinking. İn: Proc. IEEE International Workshop on Information Forensics and Security, pp. 1–7 (2018)

    Google Scholar 

  21. Mitra, A., et al.: A novel machine learning based method for deepfake video detection in social media. In: 2020 IEEE International Symposium on Smart Electronic Systems (iSES) (Formerly iNiS). IEEE, pp 91-96 (2020)

    Google Scholar 

  22. Feng, K., Wu, J., Tian, M.: A detect method for deepfake video based on full face recognition. In: 2020 IEEE International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Vol. 1, pp. 1121–1125. IEEE (2020)

    Google Scholar 

  23. Ivanov, N.S., Arzhskov, A.V., Ivanenko, V.G. Combining deep learning and super-resolution algorithms for deep fake detection. In: 2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus), pp. 326–328. IEEE (2020)

    Google Scholar 

  24. Nasar, B.F., Sajini, T., Lason, E.R.: Deepfake detection in media files-audios, ımages and videos. In: 2020 IEEE Recent Advances in Intelligent Computational Systems (RAICS), pp. 74–79. IEEE (2020)

    Google Scholar 

  25. Pan, D., et al.: Deepfake Detection through Deep Learning. In: 2020 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT), pp. 134–143. IEEE (2020)

    Google Scholar 

  26. Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1961–1970 (2016)

    Google Scholar 

  27. Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)

    Article  Google Scholar 

  28. Wong, K.-W., Lam, K.-M., Siu, W.-C.: An efficient algorithm for human face detection and facial feature extraction under different conditions. Pattern Recognition 34(10), 1993–2004 (2001)

    Google Scholar 

  29. Yehia, H.C., Takaaki, K., Eric, V.-B.: Linking facial animation, head motion and speech acoustics. J. Phon. 30(3), 555–568 (2002)

    Article  Google Scholar 

  30. Torricelli, D., Goffredo, M., Conforto, S., Schmid, M.: An adaptive blink detector to initialize and update a view-basedremote eye gaze tracking system in a natural scenario. Pattern Recogn. Lett. 30(12), 1144–1150 (2009)

    Article  Google Scholar 

  31. Divjak, M., Bischof, H.: Eye blink based fatigue detection for prevention of computer vision syndrome. İn: MVA, pp. 350–353 (2009)

    Google Scholar 

  32. Li, Y., Chang, M.-C., Lyu, S.: In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In: 2018 IEEE ınternational workshop on ınformation forensics and security (WIFS). IEEE, pp. 1–7 (2018)

    Google Scholar 

  33. Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)

  34. Goodfellow, I., et al.: Generative adversarial nets. Advances in neural information processing systems 27 (2014)

    Google Scholar 

  35. Wan, W., Yang, Y., Lee, H.J.: Generative adversarial learning for detail-preserving face sketch synthesis. Neurocomputing 438, 107–121 (2021)

    Google Scholar 

  36. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. (2014)

  37. Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? arXiv preprint arXiv:1705.02966 (2017)

  38. Prajwal, K.R. et al.: Towards automatic face-to-face translation. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1428–1436 (2019)

    Google Scholar 

  39. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks, İn: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)

    Google Scholar 

  40. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. America 120(5), 2421–2424 (2006)

    Article  Google Scholar 

  41. Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (July 2017)

    Google Scholar 

  42. Harte, N., Gillen, E.: TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Trans. Multimedia 17(5), 603–615 (2015)

    Article  Google Scholar 

  43. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian conference on computer vision, pp. 87–103. Springer, Cham (2016)

    Google Scholar 

  44. Zhang, R., et al.: The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arnab Kumar Das .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kumar Das, A., Naskar, R. (2022). Audio Driven Artificial Video Face Synthesis Using GAN and Machine Learning Approaches. In: Das, A.K., Nayak, J., Naik, B., Vimal, S., Pelusi, D. (eds) Computational Intelligence in Pattern Recognition. CIPR 2022. Lecture Notes in Networks and Systems, vol 480. Springer, Singapore. https://doi.org/10.1007/978-981-19-3089-8_23

Download citation

Publish with us

Policies and ethics