Deep Learning Framework Based on Audio–Visual Features for Video Summarization

Rhevanth, M.; Ahmed, Rashad; Shah, Vithik; Mohan, Biju R.

doi:10.1007/978-981-19-0840-8_17

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 858))

1204 Accesses
3 Citations

Abstract

The techniques of video summarization (VS) has garnered immense interests in current generation leading to enormous applications in different computer vision domains, such as video extraction, image captioning, indexing, and browsing. By the addition of high-quality features and clusters to pick representative visual elements, conventional VS studies often aim at the success of the VS algorithms. Many of the existing VS mechanisms only take into consideration the visual aspect of the video input, thereby ignoring the influence of audio features in the generated summary. To cope with such issues, we propose an efficient video summarization technique that processes both visual and audio content while extracting key frames from the raw video input. Structural similarity index is used to check similarity between the frames, while mel-frequency cepstral coefficient (MFCC) helps in extracting features from the corresponding audio signals. By combining the previous two features, the redundant frames of the video are removed. The resultant key frames are refined using a deep convolution neural network (CNN) model to retrieve a list of candidate key frames which finally constitute the summarization of the data. The proposed system is experimented on video datasets from YouTube that contain events within them which helps in better understanding the video summary. Experimental observations indicate that with the inclusion of audio features and an efficient refinement technique, followed by an optimization function, provides better summary results as compared to standard VS techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Hardcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

GVSUM: generic video summarization using deep visual features

Article 23 January 2021

Domain-Independent Video Summarization Based on Transfer Learning Using Convolutional Neural Network

Data-driven enabled approaches for criteria-based video summarization: a comprehensive survey, taxonomy, and future directions

Article 02 March 2023

References

Agyeman, R., Muhammad, R., Choi, G.S.: Soccer video summarization using deep learning. In: 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). pp. 270–273 (2019)
Google Scholar
Bhosale, A., Badve, P., Gholap, R., Joshi, P., Mone, M.S.: Video summarization using convolutional neural network. IJRASET 7(4), 2483–2489 (2019)
Article Google Scholar
Huang, C., Wang, H.: A novel key-frames selection framework for comprehensive video summarization. IEEE Trans. Circuits Syst. Video Technol. 30(2), 577–589 (2020). https://doi.org/10.1109/TCSVT.2019.2890899
Google Scholar
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size (2016)
Google Scholar
Jadon, S., Jasim, M.: Unsupervised video summarization framework using keyframe extraction and video skimming. In: 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA). pp. 140–145, (2020). https://doi.org/10.1109/ICCCA49541.2020.9250764
Javed, A., Irtaza, A., Malik, H., Mahmood, M.T., Adnan, S.: Multimodal framework based on audio-visual features for summarisation of cricket videos. IET Image Proc. 13(4), 615–622 (2019)
Article Google Scholar
Kamedo2: FFmpeg. ffmpeg.org (2013)
Google Scholar
Lee, H., Lee, G.: Summarizing long-length videos with gan-enhanced audio/visual features. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). pp. 3727–3731 (2019)
Google Scholar
Muhammad, K., Hussain, T., Del Ser, J., Palade, V., de Albuquerque, V.H.C.: Deepres: A deep learning-based video summarization strategy for resource-constrained industrial surveillance scenarios. IEEE Trans. Industr. Inf. 16(9), 5938–5947 (2020)
Article Google Scholar
Muhammad, K., Hussain, T., Tanveer, M., Sannino, G., de Albuquerque, V.H.C.: Cost effective video summarization using deep cnn with hierarchical weighted fusion for iot surveillance networks. IEEE Internet Things J. 7(5), 4455–4463 (2020)
Article Google Scholar
Rosebrock, A.: SSIM. https://www.pyimagesearch.com/2014/09/15/python-compare-two-images/
Smith, L.: MFCC. https://musicinformationretrieval.com/mfcc.html (2014)
Tejero-de-Pablos, A., Nakashima, Y., Sato, T., Yokoya, N., Linna, M., Rahtu, E.: Summariza tion of user-generated sports video by using deep action recognition features. IEEE Trans. Multimedia 20(8), 2000–2011 (2018)
Article Google Scholar
Vivekraj, V.K., Sen, D., Balasubramanian, R.: Vector ordering based multimodal video skimming for user videos. In: TENCON 2017—2017 IEEE Region 10 Conference. pp. 775–780, (2017). https://doi.org/10.1109/TENCON.2017.8227964
Wang, Z., Zhu, Y.: Video key frame monitoring algorithm and virtual reality display based on motion vector. IEEE Access 8, 159027–159038 (2020)
Article Google Scholar
Xia, G., Sun, H., Liu, Q., Hang, R.: Learning-based sphere nonlinear interpolation for motion synthesis. IEEE Trans. Industr. Inf. 15(5), 2927–2937 (2019)
Article Google Scholar
de Avila, S.E.F., Lopes, A.P.B., da Luz Jr., A., de Albuquerque Araújo, A.: Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognit. Lett. 32(1), 56–68 (2011). https://doi.org/10.1016/j.patrec.2010.08.004. <ce:title>Image Processing, Computer Vision and Pattern Recognition in Latin America</ce:title>

Download references

Author information

Authors and Affiliations

Department of Information Technology, National Institute of Technology Karnataka, Surathkal, Mangalore, 575025, India
M. Rhevanth, Rashad Ahmed, Vithik Shah & Biju R. Mohan

Authors

M. Rhevanth
View author publications
You can also search for this author in PubMed Google Scholar
Rashad Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Vithik Shah
View author publications
You can also search for this author in PubMed Google Scholar
Biju R. Mohan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Rhevanth .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology Arunachal Pradesh (NITAP), Itanagar, Arunachal Pradesh, India
Deepak Gupta
Department of Computer Science and Engineering, National Institute of Technology Arunachal Pradesh (NITAP), Itanagar, Arunachal Pradesh, India
Koj Sambyo
School of Computer Science, Faculty of Engineering and IT, University of Technology, Sydney, NSW, Australia
Mukesh Prasad
Department of Information Technology, Indian Institute of Information Technology (IIIT), Allahabad, Uttar Pradesh, India
Sonali Agarwal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rhevanth, M., Ahmed, R., Shah, V., Mohan, B.R. (2022). Deep Learning Framework Based on Audio–Visual Features for Video Summarization. In: Gupta, D., Sambyo, K., Prasad, M., Agarwal, S. (eds) Advanced Machine Intelligence and Signal Processing. Lecture Notes in Electrical Engineering, vol 858. Springer, Singapore. https://doi.org/10.1007/978-981-19-0840-8_17

Download citation

DOI: https://doi.org/10.1007/978-981-19-0840-8_17
Published: 26 June 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-0839-2
Online ISBN: 978-981-19-0840-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Deep Learning Framework Based on Audio–Visual Features for Video Summarization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

GVSUM: generic video summarization using deep visual features

Domain-Independent Video Summarization Based on Transfer Learning Using Convolutional Neural Network

Data-driven enabled approaches for criteria-based video summarization: a comprehensive survey, taxonomy, and future directions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Deep Learning Framework Based on Audio–Visual Features for Video Summarization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

GVSUM: generic video summarization using deep visual features

Domain-Independent Video Summarization Based on Transfer Learning Using Convolutional Neural Network

Data-driven enabled approaches for criteria-based video summarization: a comprehensive survey, taxonomy, and future directions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation