Abstract
Now-a-days a large number of people share their opinion in either audio or video format through internet. Some of them are real videos and some are fake. So, we need to find out the differences between these two with the help of digital forensics. In this paper, the authors will discuss the different types of artificial face synthesis methods and after that authors will analyze the deepfake videos using machine learning methods. In artificial face synthesis, based on an incoming audio stream in any language, a face image or source video of a single person is animated with full lip synchronization and synthesized expression. For full lip synchronization, GAN can also be used to train the generative models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Brand, M.: Voice puppetry. In: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 21–28 (1999)
Li, Y., Shum, H.-Y.: Learning dynamic audio-visual mapping with input-output hidden Markov models. IEEE Trans. Multimedia 8(3), 542–549 (2006)
Vondrick Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. Advances in neural information processing systems 29 (2016)
Wang, T.C., et al.: Video-to-video synthesis. İn: Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), p. 11441156 (2018)
Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S.-E., Sheikh, Y.A.: OpenPose: Realtime multi-person 2D pose estimation using part afnity elds. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172186 (Jan. 2021). https://doi.org/10.1109/TPAMI.2019.2929257
Wang, H., et al. Supervised video-to-video synthesis for single human pose transfer. IEEE Access 9, 17544–17556 (2021)
Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with gans. Int. J. Comput. Vision 128(5), 1398–1413 (2020)
Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. Proceedings of the AAAI Conference on Artificial Intelligence 33, 9299–9306 (2019)
Song, Y., Zhu, J., Wang, X., Qi, H.: Talking face generation by conditional recurrent adversarial network. CoRR, abs/1804.04786 (2018)
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7832–7841 (2019)
Kefalas, T., Vougioukas, K., Panagakis, Y., Petridis, S., Kossaifi, J., Pantic, M.: Speech-driven facial animation using polynomial fusion of features. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2020)
Sinha, S., Biswas, S., Bhowmick, B., Identity-preserving realistic talking face generation. In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE (2020)
Wen, X., Wang, M., Richardt, C., Chen, Z.Y., Hu, S.M.: Photorealistic audio-driven video portraits. IEEE Trans. Visuali. Comp. Grap. 26(12), 3457-3466 (2020)
Chen, L., Li, Z., Maddox, R.K., Duan, Z., Xu, C.: Lip movements generation at a glance. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional lstm. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4884–4888. IEEE (2015)
Ofli, F., et al.: Audio-driven human body motion analysis and synthesis. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2233–2236 (2008)
Zhang, S., Yuan, J., Liao, M., Zhang, L.: Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary. arXiv preprint arXiv:2104.14631 (2021)
Kumar, R., Sotelo, J., Kumar, K., de Brébisson, A., Bengio, Y.: ObamaNet: photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442 (2017)
Yu, L., Yu, J., Ling, Q.: Mining audio, text and visual information for talking face generation. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 787–795. IEEE (Nov 2019)
Li, Y., Chang, M., Lyu, S.: In Ictu oculi: exposing AI created fake videos by detecting eye blinking. İn: Proc. IEEE International Workshop on Information Forensics and Security, pp. 1–7 (2018)
Mitra, A., et al.: A novel machine learning based method for deepfake video detection in social media. In: 2020 IEEE International Symposium on Smart Electronic Systems (iSES) (Formerly iNiS). IEEE, pp 91-96 (2020)
Feng, K., Wu, J., Tian, M.: A detect method for deepfake video based on full face recognition. In: 2020 IEEE International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Vol. 1, pp. 1121–1125. IEEE (2020)
Ivanov, N.S., Arzhskov, A.V., Ivanenko, V.G. Combining deep learning and super-resolution algorithms for deep fake detection. In: 2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus), pp. 326–328. IEEE (2020)
Nasar, B.F., Sajini, T., Lason, E.R.: Deepfake detection in media files-audios, ımages and videos. In: 2020 IEEE Recent Advances in Intelligent Computational Systems (RAICS), pp. 74–79. IEEE (2020)
Pan, D., et al.: Deepfake Detection through Deep Learning. In: 2020 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT), pp. 134–143. IEEE (2020)
Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1961–1970 (2016)
Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)
Wong, K.-W., Lam, K.-M., Siu, W.-C.: An efficient algorithm for human face detection and facial feature extraction under different conditions. Pattern Recognition 34(10), 1993–2004 (2001)
Yehia, H.C., Takaaki, K., Eric, V.-B.: Linking facial animation, head motion and speech acoustics. J. Phon. 30(3), 555–568 (2002)
Torricelli, D., Goffredo, M., Conforto, S., Schmid, M.: An adaptive blink detector to initialize and update a view-basedremote eye gaze tracking system in a natural scenario. Pattern Recogn. Lett. 30(12), 1144–1150 (2009)
Divjak, M., Bischof, H.: Eye blink based fatigue detection for prevention of computer vision syndrome. İn: MVA, pp. 350–353 (2009)
Li, Y., Chang, M.-C., Lyu, S.: In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In: 2018 IEEE ınternational workshop on ınformation forensics and security (WIFS). IEEE, pp. 1–7 (2018)
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)
Goodfellow, I., et al.: Generative adversarial nets. Advances in neural information processing systems 27 (2014)
Wan, W., Yang, Y., Lee, H.J.: Generative adversarial learning for detail-preserving face sketch synthesis. Neurocomputing 438, 107–121 (2021)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. (2014)
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? arXiv preprint arXiv:1705.02966 (2017)
Prajwal, K.R. et al.: Towards automatic face-to-face translation. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1428–1436 (2019)
Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks, İn: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. America 120(5), 2421–2424 (2006)
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (July 2017)
Harte, N., Gillen, E.: TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Trans. Multimedia 17(5), 603–615 (2015)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian conference on computer vision, pp. 87–103. Springer, Cham (2016)
Zhang, R., et al.: The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kumar Das, A., Naskar, R. (2022). Audio Driven Artificial Video Face Synthesis Using GAN and Machine Learning Approaches. In: Das, A.K., Nayak, J., Naik, B., Vimal, S., Pelusi, D. (eds) Computational Intelligence in Pattern Recognition. CIPR 2022. Lecture Notes in Networks and Systems, vol 480. Springer, Singapore. https://doi.org/10.1007/978-981-19-3089-8_23
Download citation
DOI: https://doi.org/10.1007/978-981-19-3089-8_23
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-3088-1
Online ISBN: 978-981-19-3089-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)