Automatic Event Detection in User-Generated Video Content: A Survey

Susmitha, Alamuru; Jain, Sanjay; Mohnaty, Mihir Narayan

doi:10.1007/978-981-15-8752-8_29

Alamuru Susmitha³⁸,
Sanjay Jain³⁸ &
Mihir Narayan Mohnaty³⁹

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 709))

Included in the following conference series:

International Conference on Emerging Trends and Advances in Electrical Engineering and Renewable Energy

771 Accesses
1 Citations

Abstract

The aim of event detection is to identify interested events in a user-generated content using multiple modalities automatically. However, it is a challenging task particularly when videos are captured in a restricted environment by nonprofessionals. Such videos suffer from poor quality, deprived lighting, blurring, complex camera motion chaotic background clutter, and obstructions. However, with the rise of social media, there is rising popularity of user-generated videos on the Web day-by-day. Each minute, 300 hours of user-generated video are uploaded on you tube due to which people find difficult to search the appropriate content among a large number of videos. Therefore, solutions to this problem are in great demands. In this paper, we study existing technologies for event detection in user-generated videos using multiple modalities. This paper provides key points about feature representations across different modalities, classification techniques.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A study on video semantics; overview, challenges, and applications

Article 19 January 2022

Resource Constrained Multimedia Event Detection

Compact Video Description and Representation for Automated Summarization of Human Activities

Keywords

1 Introduction

The growth of social media over the last decade attracted users to create and to immediately post their own content with no restrictions on the content. As a result, the user-generated content has been increasing rapidly on the Web.

The most widely used user-generated contents are text messages (tweets on Twitter), audio–speech and music, images or pictures and short sequences of moving images also known as video clips. User-generated content is valuable resources of information capturing people’s interests, thoughts, and actions. Automatic video understanding [1] is crucial among them. Different types of videos [2] can be produced as shown in Fig. 1.

2 Video

Video is a short sequence of moving images and audio. A video is the only asynchronous arrangement of several frames, each frame being a 2D representation. So, the important unit in a video is a frame. The video can be considered as a gathering of numerous scenes as shown in Fig. 2, [3] where the scene is an accumulation of shots that have a similar setting. Therefore, video consists of an enormous amount of content in terms of scenes, shots, and frames.

2.1 User-Generated Video Opportunities and Challenges

In general, user-generated video is of poor quality and less organized. As there are restricted capturing situations, they may be of poor quality than professionally edited videos. Both sports and news videos are made after proper editing. However, most UGV is usually captured using own smart phones by individual users, and without any editing, they will be uploaded on the web. So, UGV is unstructured.

According to Twitter statistics, each minute, twitter dynamic users create about 500 million tweets every day and YouTube users post 300 h of videos due to which people find it difficult to look for the appropriate content among a huge number of videos. Therefore, the need for automatic event detection in user-generated video data is obvious in many computer vision applications. Finding a solution to automatically figure out the events captured in this large collection of videos is not an easy task. Therefore, user-generated video data provides both opportunities as well as challenges. The primary challenge is how to handle such huge data in a proficient manner. In addition to that, it is complex to search for videos based on user interested specific events.

3 Event Detection

The aim of event detection in user-generated videos is to automatically detect the video clips based on user interested specific events in a given video. Event detection using single modality features or fusion of multiple single modality features can be performed in sports, news, movies, surveillance, large-scale videos, and user-generated videos. User-specific interested events detection in UGV is a quite challenging task because the videos are captured in a restricted environment by non-professional users. They are unstructured and unconstrained. They suffer from poor quality, deprived lighting, blurring, complex camera motion, chaotic background clutter, and obstructions. An event is a significant occurrence specified by the user. It happens at a certain place at a certain time. For example, human actions like jumping, running, and object-related events like kicking a ball, riding a horse, etc. Events are characterized by its type, time, location, and description. The process of identifying the occurrence of an event in a natural or manmade content is called event detection. Detecting substantial change in sea-level is a natural event detection and detecting events like smiling, frowning, etc., in images or videos is called manmade event detection. The social media platforms allow millions of people to use them daily to communicate and share information ranging from world level information, for example, the World Cup, to personal information like wedding, graduation. A massive amount of data is created by individual users in the form of texts, videos, and photos. The research done so far could give efficient solutions for large data storage but retrieving, handling, and processing of such a large amount of data particularly in videos are still a challenge. So monitoring and evaluating the user-generated video content can produce undoubtedly valuable information.

3.1 General Event Detection Methodology

Frame extraction—Depending upon the size of the video, it consists of a number of frames. Frame extraction is to represent the video in terms of image frames (Fig. 3).
Fig. 3
General event detection methodology
Full size image
Key frame extraction—Extraction of key frames is the basic step in video-related tasks to get rid of the duplicate frames with unnecessary data. The extracted key frames represent the characteristics of the video.
Feature extraction—It extracts important features from video data to enable semantic understanding. Using feature extraction process, visual, audio, and audio–visual features can be extracted. Spatial, transform, color, texture, shape, edge and boundary, structure, layout, and motion are some of the visual features. The most common audio classes in videos are speech, silence, music and the combination of later three. The audio features can be embedded with low-level visual features for key frame extraction.
Classification—Classification is done using classifiers. After extracting features, classifiers generate scores based on different model formulations and set of features. These are the most significant concepts in image processing; these are computer-based mathematical algorithms developed to encounter the required performance level, at trained data set with a given amount of time. The classifier is trained in such a way that the system must easily differentiate the datasets.
Score fusion—It combines scores computed from different features from different modalities. Based on this, decision can be made.

To improve the accuracy of event detection in user-generated videos, features can be extracted from four different modalities: text, audio, visual, and motion. Overview of different modalities with an example and description is given in Table 1 [4].

Table 1 Overview of different modalities

Full size table

4 Multimodal Event Detection

Multimodal event detection is based on multimodal fusion techniques, i.e., a fusion of multiple features from different modalities which are referred to as multimodal fusion. The fusion of various modalities may give related important information and, therefore, it is better to know which modalities will contribute a major role for accomplishing an analysis task. The fusion of both visual and audio features along with the video textual data in a user-generated video will improve the accuracy of event detection. Hence, the extraction of helpful features from video one by one to get better recognition of events is a further important task due to the discrete features of the concerned modalities.

As different modalities posses different characteristics, it is better to consider the confidence levels of the modalities in completing the required multimedia applications. The multimodal event detection framework [11] is shown in Fig. 4.

In Table 2, we listed the papers that used visual features alone and features combined from multiple modalities for different multimedia tasks.

Table 2 Overview of related work

Full size table

5 Conclusion

Video is a rich source of information and topics on video data offer a broad range of research applications such as multimodal event detection which further helps in video browsing, video indexing, video summarization, and content-based video retrieval applications. In recent years, multimodal event detection has been receiving widespread research attention because of the exponential increase in the volume of Web video data. This paper focused on multimodal event detection in user-generated video content where the objective is to detect video clips by the key event happening in the clip by the fusion of different features from different modalities. We discussed a number of challenges that need to be addressed due to the exponential growth of unstructured Web user-generated video content. We have given an outline of the event detection and the general multimodal event detection framework. Key points about modalities, features, classifiers, and fusion techniques were presented.. We think that this paper can give important insights for researchers who are just starting to investigate this area.

References

Lavee, G., Rivlin, E., & Rudzsky, M. (2009). Understanding video events: A survey of methods for automatic interpretation of semantic occurrences in the video. IEEE Transactions on Systems, Man, and Cybernetics, 39(5), 489–504.
Google Scholar
Ballan, L., Bertini, M., Del Bimbo, A., Seidenari, L., & Serra, G. (2011). Event detection and recognition for semantic annotation of video. Multimedia Tools and Applications.
Google Scholar
Rajendra, S. P., & Keshaveni, N. (2014). A survey of automatic video summarization techniques. Indonesian Journal of Electrical Engineering and Computer Science, 3(1).
Google Scholar
Soleymani, M., Garcia, D., Jou, B., Schuller, B., Chang, S.-F., & Pantic, M. (2017). A survey of multimodal sentiment analysis. Image and Vision Computing.
Google Scholar
Brezeale, D., & Cook, D. J. (2007). Automatic video classification: A survey of the literature. IEEE Transactions on Systems, Man, and Cybernetics.
Google Scholar
Atrey, P. K., Anwar Hossain, M., El Saddik, A., & Kankanhalli, M. S. (2010). Multimodal fusion for multimedia analysis: A survey. Multimedia Systems.
Google Scholar
Hassan, E., Gopal, M., Chaudhury, S., & Garg, V. (2011) A hybrid framework for event detection using multi-modal features. In IEEE International Conference on Computer Vision Workshops.
Google Scholar
van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9).
Google Scholar
Guo, J., Scott, D., Hopfgartner, F., & Gurrin, C. (2012). Detecting complex events in user-generated video using concept classifiers. In CBMI.
Google Scholar
Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). SURF: Speeded up robust features. In Computer Vision and Image Understanding.
Google Scholar
Jiang, Y.-G., Bhattacharya, S., Chang, S.-F., & Shah, M. (2013). High-level event recognition in unconstrained videos. International Journal of Multimedia Information Retrieval.
Google Scholar
Pouyanfar, S., & Chen, S.-C. Automatic video event detection for imbalance data using enhanced ensemble deep learning.
Google Scholar
Adams, W., Iyengar, G., Lin, C., Naphade, M., Neti, C., Nock, H., & Smith, J. (2003). Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP Journal on Advances in Signal Processing.
Google Scholar
Atrey, P. K., Kankanhalli, M. S., & Jain, R. Information assimilation framework for event detection in multimedia surveillance systems.
Google Scholar
Oh, S., McCloskey, S., Kim, E., Vadat, A., Cannons, K. J., Hajimirsadeghi, H., et al. (2014). Multimedia event detection with multimodal feature fusion and temporal concept localization. Machine Vision and Applications.
Google Scholar
Cricri1, F., Dabov, K., Curcio, I. D. D., Mate, S., & Gabbouj, M. (2014). Multimodal extraction of events and of information about the recording activity in user-generated videos. Multimedia Tools and Applications.
Google Scholar
Koutras, P., Zlatinsi, A., & Maragos, P. (2018). Exploring CNN-based architectures for multimodal salient event detection in videos. In Proceedings of the 13th IEEE Image, Video, and Multidimensional Signal Processing (IVMSP) Workshop.
Google Scholar
Güder, M., & Çiçekli, N. K. (2018). Multimodal video event recognition based on association rules and decision fusion. Multimedia Systems.
Google Scholar

Download references

Author information

Authors and Affiliations

CMR Institute of Technology, Bengaluru, India
Alamuru Susmitha & Sanjay Jain
ITER, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, India
Mihir Narayan Mohnaty

Authors

Alamuru Susmitha
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay Jain
View author publications
You can also search for this author in PubMed Google Scholar
Mihir Narayan Mohnaty
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Sanjay Jain or Mihir Narayan Mohnaty .

Editor information

Editors and Affiliations

School of Computer Engineering, Kalinga Institute of Industrial Technology (KIIT Deemed to be University), Bhubaneswar, Odisha, India
Pradeep Kumar Mallick
Department Electrical and Electronics Engineering, Sikkim Manipal Institute of Technology, Rangpo, Sikkim, India
Akash Kumar Bhoi
Division of Information and Communication, Baekseok University, Cheonan, Ch’ungch’ong-namdo, Korea (Republic of)
Gyoo-Soo Chae
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai, Tamil Nadu, India
Kanak Kalita

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Susmitha, A., Jain, S., Mohnaty, M.N. (2021). Automatic Event Detection in User-Generated Video Content: A Survey. In: Mallick, P.K., Bhoi, A.K., Chae, GS., Kalita, K. (eds) Advances in Electronics, Communication and Computing. ETAEERE 2020. Lecture Notes in Electrical Engineering, vol 709. Springer, Singapore. https://doi.org/10.1007/978-981-15-8752-8_29

Download citation

DOI: https://doi.org/10.1007/978-981-15-8752-8_29
Published: 29 January 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-8751-1
Online ISBN: 978-981-15-8752-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Event Detection in User-Generated Video Content: A Survey

Abstract

Similar content being viewed by others

A study on video semantics; overview, challenges, and applications

Resource Constrained Multimedia Event Detection

Compact Video Description and Representation for Automated Summarization of Human Activities

Keywords

1 Introduction

2 Video

2.1 User-Generated Video Opportunities and Challenges

3 Event Detection

3.1 General Event Detection Methodology

4 Multimodal Event Detection

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Automatic Event Detection in User-Generated Video Content: A Survey

Abstract

Similar content being viewed by others

A study on video semantics; overview, challenges, and applications

Resource Constrained Multimedia Event Detection

Compact Video Description and Representation for Automated Summarization of Human Activities

Keywords

1 Introduction

2 Video

2.1 User-Generated Video Opportunities and Challenges

3 Event Detection

3.1 General Event Detection Methodology

4 Multimodal Event Detection

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation