Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

8.1 Introduction

Detecting violent scenes in movies appears as an important feature in various use cases related to video on demand and child protection against offensive content. In the framework of the MediaEval benchmark initiative, we have developed a large dataset for this task and assessed various approaches via comparative evaluations.

MediaEvalFootnote 1 is a benchmarking initiative dedicated to evaluating new algorithms for multimedia access and retrieval. MediaEval emphasizes the multimodal character of the data (speech, audio, visual content, tags, users, context, etc). As a track of MediaEval, the Affect Task—Violent Scenes Detection—involves automatic detection of violent segments in movies. The challenge derives from a use case at the company Technicolor.Footnote 2 Technicolor is a provider of services in multimedia entertainment and solutions, in particular, in the field of helping users select the most appropriate content according to, for example, their profile. In this context, a particular use case arises which involves helping users choose movies that are suitable for children in their family, by previewing the parts of the movies (i.e., scenes or segments) that include the most violent moments [9].

Such a use case raises several substantial difficulties. Among them, the subjectivity that will occur during the selection of those violent moments is certainly the most important one. Indeed the definition of a violent event remains highly subjective and dependent on the viewers, their culture, their gender. Agreeing on a common definition of a violent event is not easy, which explains why each work related to violence in the literature exhibits a different definition. The semantic nature of the events to retrieve also contributes to the difficulty of the task, as it entails a huge semantic gap between features and interpretation. Due to the targeted content (i.e., Hollywood movies) and the nature of the events, multimodality is also an important characteristic of the task, which stresses its ambitious and challenging nature even more.

The choice of the targeted content raises additional challenges which are not addressed in similar evaluation tasks, for example in the TRECVid Surveillance Event Detection or Multimedia Event Detection Evaluation Tracks.Footnote 3 Indeed, systems will have to cope with content of very different genres that may contain special editing effects, which may alter the events to detect.

In the literature, violent scene detection in movies has received very little attention so far. Moreover, comparing existing results is impossible because of the different definitions of violence adopted. As a consequence of the differences in the definition of violence, methods suffer from a lack of standard, consistent, and substantial datasets. The Affect task of MediaEval constitutes a first attempt to address all these needs and establish a standard with state-of-the-art performance for future reference.

This paper provides a thorough description of the Violent Scene Detection (VSD) dataset and reviews the state of the art for this task. The main contributions in this regard can be summarized with:

  • the proposal of a definition of violence in movies and its validation in the community,

  • the design of a comprehensive dataset of 18 Hollywood movies annotated for violence and for concepts related to violence. Insights about annotation challenges are also provided;

  • a detailed description of the state of the art in violence detection;

  • a comparison of the systems that competed in the 2011 and 2012 benchmarks and the description of two of the best performing systems.

The chapter is organized as follows. Section 8.2 reviews previous research on violence detection in videos. Section 8.3 provides an overview of the violent scene detection task after 2 years of implementation within the MediaEval benchmarking initiative. Section 8.4 reports the results of the benchmark with a short comparative description of the competing systems. Section 8.5 provides an in-depth description of two of the best ranked systems with an explicit focus on the contribution of the multimodal information fusion.

8.2 A Review of the Literature

Automatically detecting violent scenes in movies received very limited attention prior to the establishment of the MediaEval violence detection task [21].

A closely related problem is action recognition focusing on detecting human violence in real-world scenarios. Datta et al. [8] proposed an hierarchical approach for detecting distinct violent events involving two people, e.g., fist fighting, hitting with objects, and kicking. They computed the motion trajectory of image structures, i.e., acceleration measure vector and its jerk. Their method was validated on 15 short sequences including around 40 violent scenes. Another example is the approach in [40] which aims at detecting instances of aggressive human behavior in public environments. The authors used a Dynamic Bayesian Network (DBN) as a fusion mechanism to aggregate aggression scene indicators, e.g., “scream,” “passing train,” or “articulation energy.” Evaluation is carried out using 13 clips featuring various scenarios, such as “aggression towards a vending machine” or “supporters harassing a passenger.”

Sports videos were also used for violence detection, usually relying on the bag of visual words (BoVW) representation. For instance, [32] addresses fight detection using BoVW along with space-time interest points and motion scale-invariant feature transform (MoSIFT) features. The authors evaluated their method on 1,000 clips containing different actions from ice hockey videos labeled at the frame level. The highest reported detection accuracy is near \(90\,\%\). A similar experiment is the one in [11] that used BoVW with local spatio-temporal features, for sports and surveillance videos. Experiments show that motion patterns tend to provide better performance than spatio-visual descriptors.

One of the early approaches targeting broadcast videos is from Nam et al. [31] where violent events were detected using multiple audio–visual signatures, e.g., description of motion activity, blood and flame detection, and violence/nonviolence classification of the soundtrack and characterization of sound effects. Only qualitative validations were reported. More recently, Gong et al. [17] used shot length, motion activity, loudness, speech, light, and music as features for violence detection. A modified semi-supervised learning model was employed for detection and evaluated on 4 Hollywood movies, achieving a F-measure of \(0.85\) at best. Similarly, Giannakopoulos et al. [14] used various audio-visual features for violence detection in movies, e.g., spectrogram, chroma, energy entropy, Mel-Frequency Cepstral Coefficients (MFCC), average motion, motion orientation variance, measure of the motion of people or faces in the scene. Modalities were combined by a meta-classification architecture that classified mid-term video segments as “violent” or “non-violent.” Experimental validation was performed on 50 video segments ripped from 10 different movies (totaling 150 min) with F-measures up to \(0.58\). Lin and Wang [27] proposed a violent shot detector that used a modified probabilistic Latent Semantic Analysis (pLSA). Audio features as well as visual concepts such as motion, flame, explosion, and blood were employed. Final integration was achieved though a co-training scheme, typically used when dealing with small amounts of training data and large amounts of unlabeled data. Experimental validation was conducted on 5 movies showing an average F-measure of \(0.88\).

Most of the approaches are naturally multimodal, exploiting both the image and sound tracks. However, a few works approached the problem based on a single modality. For example, [6] used Gaussian mixture models (GMM) and hidden Markov models (HMM) to model audio events over time series. They considered the presence of gunplay and car racing with audio events such as “gunshot,” “explosion,” “engine,” “helicopter flying,” “car braking,” and “cheers.” Validation was performed on a very restrained data set, containing excerpts of 5 min extracted from 5 movies, leading to an average F-measure of up to \(0.90\). In contrast, [4] used only visual concepts such as face, blood, and motion information to determine whether an action scene had violent content or not. The specificity of their approach is in addressing more semantics-bearing scene structures of video rather than simple shots.

In general, most of the existing approaches focus more or less on finding the correct concepts that can be translated into violence in general and their findings are bounded by the size of the dataset and the definition of violence. Because of the high variability of violent events in movies, no common and objective enough definition for violent events was ever proposed to the community, even when restricting to physical violence. On the contrary, each piece of work dealing with the detection of violent scenes provides its own definition of the violent events to detect. For instance, [4] targeted “a series of human actions accompanied with bleeding,” [11, 32] looked for “scenes containing fights, regardless of context and number of people involved.” In [14], the following definition is used: “behavior by persons against persons that intentionally threatens, attempts, or actually inflicts physical harm.” In [17], authors were interested in “fast paced scenes which contain explosions, gunshots and person-on-person fighting.” Moreover, violent scenes and action scenes are often mixed up in the past as in [5, 17].

The lack of a common definition and the resulting absence of a reference and substantial dataset has made it so far very difficult to compare methods which were sometimes developed for a very specific type of violence. This is precisely the fault that we attempt to correct with the MediaEval violent scene detection task, by creating a benchmark based on a clear and generalizable definition of violence to advance the state of the art on this topic.

8.3 Affect Task Description

The 2011 and 2012 Affect Task required participants to deploy multimodal approaches to automatically detect portions of movies depicting violence. Though not a strict requirement, we tried to emphasize multimodality for several reasons. First, videos are multimodal. Second, violence might be present in all modalities though not necessarily at the same time. This is clearly the case for images and soundtracks. Violence might also be reflected in subtitles though verbal violence was not considered. In spite of a definition of violence limited to physical violence, single modality approaches were bound to be suboptimal and most participants ended up using visual and audio features.

The key for creating a corpus for comparative evaluation clearly remains a general definition of the notion of violence which eases annotation while encompassing a large variety of situations. We discuss here the notion of violence and justify the definition that was adopted before describing the data set and evaluation rules.

8.3.1 Toward a Definition of Violence

The notion of violence remains highly subjective as it depends on viewers. The World Health Organization (WHO) [39] defines violence as: The intentional use of physical force or power, threatened or actual, against oneself, another person, or against a group or community that either results in or has a high likelihood of resulting in injury, death, psychological harm, maldevelopment, or deprivation. According to the WHO, three types of violence can be distinguished, namely, self-inflicted, interpersonal, and collective [24]. Each category is divided according to characteristics related to the setting and nature of violence, e.g., physical, sexual, psychological, and deprivation or neglect.

In the context of movies and television, Kriegel [23] defines violence on TV as an unregulated force that affects the physical or psychological integrity to challenge the humanity of an individual with the purpose of domination or destruction.

These definitions only focus on intentional actions and, as such, do not include accidents, which are of interest in the use case considered, as they also result in potentially shocking gory and graphic scenes, e.g., a bloody crash. We therefore adopted an extended definition of violence that includes accidents while being as objective as possible and reducing the complexity of the annotation task. In MediaEval 2011 and 2012, violence is defined as physical violence or accident resulting in human injury or pain. Violent events are therefore limited to physical violence, verbal, or psychological violence being intentionally excluded.

Although we attempted to narrow the field of violent events down to a set of events as objectively violent as possible, there are still some borderline cases. First of all, sticking to this definition leads to the rejection of some shots in which the results of some physical violence are shown but not the violent act itself. For example, shots in which one can see a dead body with a lot of injuries and blood were not annotated as violent. On the contrary, a character simply slapping another one in the face is considered as a violent action according to the task definition. Other events defined as “intent to kill,” in which one sees somebody shooting somebody else for example with the clear intent to kill, but the targeted person escapes with no injury, were also discussed and finally not kept in the violent set. On the contrary, scenes where the shooter is not visible but where shooting at someone is obvious from the audio, e.g., one can hear the gunshot possibly with screams afterward, were annotated as violent. Interestingly, such scenes emphasize the multimodal characteristic of the task. Shots showing actions resulting in pain but with no intent to be violent or, on the contrary, with the aim of helping rather than harming, e.g., segments showing surgery without anesthetics, fit into the definition and were therefore deemed violent.

Another borderline case keenly discussed was the events such as shots showing the destruction of a whole city or the explosion of a moving tank. Technically speaking, these shots do not show any proof of people death or injury, though one can reasonably assume that the city or the tank were not empty at the time of destruction. Consequently, such cases, where pain or injury is implicit, were annotated as violent. Finally, shots showing the violent action and the result of the action itself happen to be separated by several nonviolent shots. In this case, the entire segment was annotated as violent if the duration between the two violent shots (action and result) was short enough (less than 2 s).

8.3.2 Data Description

In line with the use case considered, the dataset consisted of Hollywood movies from a comprehensive range of genres, from extremely violent to movies without violence. In 2011, 15 movies were considered and completed by 3 additional movies in 2012. From these 18 movies, 12 were designated as development dataFootnote 4 in 2011. The three movies used as test setFootnote 5 in 2011 where shifted to the development set in 2012 where three additional movies were provided for evaluation. The list of movies, along with some characteristics, is given in Table 8.1.

Table 8.1 Movie dataset (2011 dev. set: first 12 movies; 2011 test set: following 3 movies. 2012 dev. set: first 15 movies; 2012 test set: last three movies).

The development dataset represents a total of 26,108 shots in 2012—as given by automatic shot segmentation—for a total duration of 102,851 s. Violent content corresponds to 9.25 % of the total duration and 12.27 % of the shots, highlighting the fact that violent segments are not so scarce in this database. We tried to respect the genre distribution (from extremely violent to nonviolent) both in the development and test sets. This appears in the statistics, as some movies such as Billy Elliot or The Wizard of Oz contain a small proportion of violent shots (around 5 %). The choice we made for the definition of violence impacts the proportion of annotated violence in some movies such as The Sixth Sense where violent shots amount to only 2.8 % of the duration. However, the movie contains several shocking scenes of dead people which do not fit the definition of violence that we adopted. In a similar manner, psychological violence, such as what may be found in Billy Elliot, was also not annotated, which also explains the small number of violent shots in this particular movie.

The violent scenes dataset was created by seven human assessors. In addition to segments containing physical violence according to the definition adopted, annotations also include high-level concepts potentially related to violence for the visual and audio modalities, highlighting the multimodal character of the task.

The annotation of violent segments was conducted using a 3 step process, with the same so-called “master annotators” for all movies. A first master annotator extracted all violent segments. A second master annotator reviewed the annotated segments and possibly missed segments according to his/her own judgment. Disagreements were discussed on a case by case basis, the third master annotator making the final decision in case of an unresolved disagreement. Each annotated violent segment contained a single action, whenever possible. In the case of overlapping actions, the corresponding global segment was proposed as a whole. This was indicated in the annotation files by adding the tag “multiple action scene.” The boundaries of each violent segment were defined at the frame level, i.e., indicating the start and end frame numbers.

The high-level video concepts were annotated through a simpler process, involving only two annotators. Each movie was first processed by an annotator and then reviewed by one of the master annotators.

Seven visual concepts are provided: presence of blood, fights, presence of fire, presence of guns, presence of cold weapons, car chases and gory scenes. For the benchmark, participants had the option to carry out detection of the high-level concepts. However, concept detection is not among the task’s goals and these high-level concept annotations were only provided on the development set. Each of these high-level concepts followed the same annotation format as for violent segments, i.e., starting and ending frame numbers and possibly some additional tags which provide further details. For blood annotations, a tag in each segment specifies the proportion of the screen covered in blood. Four tags were considered for fights: only two people fighting, a small group of people (roughly less than 10), large group of people (more than 10), distant attack (i.e., no real fight but somebody is shot or attacked at distance). As for the presence of fire, anything from big fires and explosions to fire coming out of a gun while shooting, a candle, a cigarette lighter, a cigarette, or sparks was annotated, e.g., a space shuttle taking off also generates fire and receives a fire label. An additional tag may indicate special colors of the fire (i.e., not yellow or orange). If a segment of video showed the presence of firearms (respectively cold weapons) it was annotated by any type of (parts of) guns (respectively cold weapons) or assimilated arms. Annotations of gory scenes are more difficult. In the present task, they are indicating graphic images of bloodletting and/or tissue damage. It includes horror or war representations. As this is also a subjective and difficult notion to define, some additional segments showing disgusting mutants or creatures are annotated as gore. In this case, additional tags describing the event/scene are added.

For the audio modality, three audio concepts were annotated, namely, gunshots, explosions, screams. Those concepts were extracted using the English audio tracks. Contrary to what is done for the video concepts, audio segments are identified by start and end times in seconds. Additional tags may be added to each segment to distinguish different types of subconcepts. For instance, distinction was made between gunshots and cannon fires. All kinds of explosions were annotated, even magic explosions as well as explosions resulting from shells or cannonballs in cannon fires. Last, scream annotations are also provided, however for 9 movies only, in which anything from nonverbal screams to what was called “effort noise” was extracted, as long as the noise came from a human or a humanoid. Effort noises were separated from the rest, by the use of two different tags in the annotation.

In addition to the annotation data, automatically generated shot boundaries with their corresponding key frames, as detected by Technicolor’s software, were also provided with each movie.

8.3.3 Evaluation Rules

Due to copyright issues, the video content was not distributed and participants were required to buy the DVDs. Participants were allowed to use all information automatically extracted from the DVDs, including visual and auditory material as well as subtitles. English was the chosen language for both the audio and subtitles channels. The use of any other data, not included in the DVD (web sites, synopsis, etc.) was not allowed.

Two types of runs were initially considered in the task, a mandatory shot classification run and an optional segment detection one. The shot classification run consisted in classifying each shot provided by Technicolor’s shot segmentation software as violent or not. Decisions were to be accompanied by a confidence score where the higher the score, the more likely the violence. Confidence scores were optional in 2011 and compulsory in 2012 because of the chosen metric. The segment detection run involved detection of the violent segment boundaries, regardless of the shot segmentation provided.

System comparison was based on different metrics in 2011 and 2012. In 2011, performance was measured using a detection cost function weighting false alarms (FA) and missed detections (MI), according to

$$\begin{aligned} C = C_{\text {fa}}\cdot P_{\text {fa}} + C_{\text {miss}}\cdot P_{\text {miss}} \end{aligned}$$
(8.1)

where the costs \(C_{\text {fa}} = 1\) and \(C_{\text {miss}} = 10\) were arbitrarily defined to reflect (a) the prior probability of the situation and (b) the cost of making an error. \(P_{\text {fa}}\) and \(P_{\text {miss}}\) are the estimated probabilities of respectively false alarms (false positive) and missed detections (false negative) given the system’s output and the reference annotation. In the shot classification, the FA and MI probabilities were calculated on a per shot basis while in the segment level run, they were computed on a per unit of time basis, i.e., durations of both references and detected segments are compared. This cost function is called “MediaEval cost” in all that follows.

Experience taught us that the MediaEval detection cost was too strongly biased toward low-missed detection rates, leading to systems hardly reaching cost values lower than 1 and therefore worse than a naive system classifying all shots as violent. We therefore adopted the Mean Average Precision (MAP) computed over the first 100 top-ranked violent segments as evaluation metric. Note that this measure is also well adapted to the search-related use case that serves as a basis for our work.

We also report detection error tradeoff curves, showing \(P_{\text {fa}}\) as a function of \(P_{\text {miss}}\) given a segmentation and the confidence score for each segment, to compare potential performance at different operating points. Note that in the segment detection run, DET curves are possible only for systems returning a dense segmentation (a list of segments that spans the entire video): segments not present in the output list are considered as non violent for all thresholds.

8.4 Results

In 2011, the Affect Task on Violent Scenes Detection was proposed in MediaEval as a pilot for the first year. Thirteen teams, corresponding to 16 research groups considering joint submission proposals, declared interest in the task. Finally, six teams registered and completed the task, representing four different countries, for a grand total of 29 runs submitted. These figures show the interest for the task for this first year. This was confirmed in 2012, with the registration of 11 teams, of which 8 crossed the final line, by sending 36 runs for the evaluation. Interest is also emphasized by the wide geographic coverage area of teams. Interestingly, the multimodal aspect of the task shows in the fact that participants come from different communities, namely the audio and image processing communities. A more detailed evolution of the task for these two years is summarized in Fig. 8.1.

Fig. 8.1
figure 1

Evolution of the participation to the task between 2011 and 2012

Official results are reported in Table 8.2. Despite the change of official metric between 2011 and 2012, MAP values were also computed on the 2011 submissions. Similarly, the MediaEval cost is reported for 2012. It should nevertheless be noted that these two metrics imply different tunings of the systems (toward low precision rate for the MediaEval cost, and on the contrary toward high precision for the MAP), meaning that metric values should be compared cautiously, as systems were not optimized in the same way.

Table 8.2 Official results of the 2011 and 2012 Affect task evaluation at MediaEval

In 2011 and 2012, all participants submitted predominantly runs for the shot classification task. Only the ARF team submitted one segment level run in 2012. Results show a substantial improvement between 2011 and 2012. Although the overall performances of the proposed systems in 2011 were not good enough to satisfy the requirements of a real-life commercial system, in 2012 three systems reached MAP@100 values above 60 %, leading to the conclusion that research still needs to be conducted on this subject, nevertheless state-of-the-art systems already show convincing performances.

Fig. 8.2
figure 2

Detection error trade-off curves for all participants in 2011 (a) and 2012 (b)

Detection error trade-off curves, obtained from the confidence values provided by participants, are given in Fig. 8.2 for the best run of each participant according to the official metric for the year considered. Clearly, ordering of the systems differs according to the operating point. Once again the direct comparison of the 2011 and 2012 curves is to be considered with caution. Nevertheless, improvements can be observed between the 2 years. Whereas in 2011, only one participant reached at best a false alarm rate of 20 % for a missed detection rate of about 25 %, in 2012, at least two participants have similar results and three more additional teams have fair results.

Analyzing the 2011 submissions, three different systems categories can be distinguished. Two participants (NII [26] and LIG [37]) treated the problem of violent scene detection as a concept detection problem, applying generic systems developed for TRECVid evaluations to violent scene detection, potentially with specific tuning. Both sites used classic video only features, computed on the key frames provided, based on color, textures, edges, either local (interest points) or global, and classic classifiers. One participant (DYNI [15]) proposed a classifier-free technique exploiting only two low-level audio and video features, computed on each successive frame, both measuring the activity within a shot. After a late fusion process, decisions were taken by comparison with a threshold. The last group of participants (TUB [2], UGE [16] and TI [33]) built dedicated supervised classification systems for the task of violent scene detection. Different classifiers were used from SVM, Bayesian networks to linear or quadratic discriminant analysis. All used multimodal features, either audio-video or audio-video-textual features (UGE). Features were computed globally for each shot (UGE, TI) or on the provided key frames (TUB).

In 2012, systems were all supervised classification systems; LIG [10] and NII [25] went on with some improved versions of their generic systems dedicated to concept detection, while others implemented dedicated versions of such systems for the task of violent scene detection. Chosen classifiers were mostly SVM, with some exceptions for neural networks and Bayesian networks. It should be noted that most participants [1, 10, 13, 22, 35, 38] voted for multimodal (audio \(+\) video) systems and that multimodality seems to help the performance of such systems. Globally, classic low-level audio (MFCC, zero-crossing rate, asymetry, roll-off, etc.) and video (color histograms, texture-related, Scale Invariant Feature Transform-like, Histograms of Oriented Gradients, visual activity, etc.) features were extracted. One exception may be noted with the use of multi-scale local binary pattern histogram features by DYNI [30]. Added to those classical features, audio and video mid-concept detection was also used for this second year [10, 22, 25, 38], thanks to the annotated high-level concepts. Such mid-level concepts, especially used in a two-step classification scheme [38], seem to be promising.

Based on these results, one may draw some tentative conclusions about the global characteristics that were more likely to be useful for violence detection. Local video features (SIFT-like) did not add a lot of information to the systems. On the contrary, taking advantage of different modalities seems to improve performance, especially when modalities are merged using late fusion. Although results do not prove their impact in one way or another, it also seems of interest to use temporal integration. This was carried out in different manners in the systems, either by using contextual features, i.e., features at different times, or by temporal smoothing or aggregation of the decisions at the output of the chain. Using intermediate concept detection with high-level concepts related to violence such as those provided in the task seems to be rewarding.

8.5 Multimodal Approaches

Progress achieved between 2011 and 2012 can probably be explained by two main factors. Data availability is undoubtedly the first one, along with experience on the task. Exploiting multimodal features is also one of the keys. While many systems made very limited use of multiple modalities in 2011, multimodal integration became more widely spread, mostly relying on the audio and visual modalities.

We provide here details for two multimodal systems which competed in 2012, namely the ARF system based on mid-level concepts detected from multimodal input and the Technicolor/IRISA system which directly exploits a set of low-level audio and visual features.

8.5.1 A Mid-Level Concept Fusion Approach

We describe the approach developed by the ARF team [21, 38], relying on fusing mid-level concept predictions inferred from low-level features by employing a bank of multilayer perceptron classifiers featuring a dropout training scheme.

The motivation of this approach lies in the high variability in appearance of violent scenes in movies and the low amount of training data that is usually available. In this scenario, training a classifier to predict violent frames directly from visual and auditory features seems rather difficult. The system proposed by ARF team uses the task provided a high-level concept ground-truth to infer mid-level concepts as an intermediate step toward the final violence detection goal, thus attempting to limit the semantic gap. Experiments proved that predicting mid-level concepts from low-level features should be more feasible than directly predicting all forms of violence.

8.5.1.1 Description of the System

Violence detection is first carried out at frame level by classifying each frame as being violent or nonviolent. Segment level prediction (shot level or arbitrary length) is then determined by a simple aggregation of frame level decisions. Given the complexity of this task, i.e., labeling of individual frames rather than video segments (ca. 160,000 frames per movie), the classification is tackled by exploiting the inherent parallel architecture of neural networks. The system involves several processing steps as illustrated in Fig. 8.3.

Fig. 8.3
figure 3

Description of ARF teams’s system developed for MediaEval 2012 (black boxes refer to classifiers)

Multimodal features: First, raw video data is converted into content descriptors whose objective is to capture meaningful properties of the auditory-visual information. Feature extraction is carried out at the frame level. Given the specificity of the task, the system was tested using audio, color, feature description, and temporal structure information, which is specific both for violence-related concepts as well as for the violent content itself. Results reported in 2012 were obtained with the following descriptors:

  • audio descriptors (196 dimensions) consist of general purpose descriptors: linear prediction coefficients, line spectrum pairs, MFCCs, zero-crossing rate, and spectral centroid, flux, rolloff, and kurtosis, augmented with the variance of each feature over a window of 0.8 s around the current frameFootnote 6;

  • color descriptors (11 dimensions) using the color naming histogram proposed in [12] which maps colors to 11 universal color names ( “black”, “blue”,“brown”, “gray”, “green”, “orange”, “pink”, “purple”, “red”, “white”, and “yellow”);

  • visual features (81 dimensions) which consist of the 81-dimensional Histogram of Oriented Gradients [29];

  • temporal structure (1 dimension) derives a measure of visual activity. The cut detector in [20] that measures visual discontinuity by means of a difference between color histograms of consecutive frames, was modified to account for a broader range of significant visual changes. For each frame it determines the number of detections in a certain time window centered at the current frame. High values of this measure will account for important visual changes that are typically related to action.

Neural network classification: Both at the concept level and at the violence level, classification is carried out with a neural network, namely a multilayer perceptron with a single hidden layer of 512 logistic sigmoid units. Network is trained by gradient descent on the cross-entropy error with backpropagation [36], using the recent idea in [19] to improve generalization: For each presented training case, a fraction of input and hidden units is omitted from the network and the remaining weights are scaled up to compensate. The set of dropped units is chosen at random for each presentation of a training case, such that many different combinations of units will be trained during an epoch.

Concept detection consists of a bank of perceptrons that are trained to respond to each of the targeted violence-related concepts, such as presence of “fire,” presence of “gunshots,” or “gory” scenes (see Sect. 8.3.2). As a result, a concept prediction value in \([0,1]\) is obtained for each concept. These values are used as inputs to a second classifier, acting as a final fusion scheme to provide values for the two classes “violence” and “nonviolence” on a frame-by-frame basis. For all classifiers, parameters were trained using reference annotations coming along with the data.

Violence classification: Frame prediction of violence for the unlabeled data is given by the system’s output when fed with the new data descriptors. As prediction is provided at frame level, aggregation into segments is performed by assigning a violence score corresponding to the highest predictor output for any frame within the segment. The segments are then tagged as “violent” or “nonviolent” depending on whether their violence score exceeds a certain threshold (determined in the training step of the violence classifier).

8.5.1.2 Results

Results are evaluated on the shot classification task and on the segment detection one.

Shot level classification: To highlight the contributions of the concept fusion scheme, different feature combinations were tested, namely: ARF-(c) uses as features only mid-level concept predictions for violence detection; ARF-(a) uses only audio descriptors, i.e., the violence classifier is trained directly on features instead of using the concept prediction outputs; ARF-(v) uses only visual features; ARF-(av) uses only audio-visual features; finally, ARF-(avc) uses all concepts and audio-visual features using an early fusion aggregation of concept predictions and features.

Results on the 2012 benchmark, reported in Table 8.3, exhibited a F-measure of 49.9 which placed the system among the top systems. The lowest discriminative power is achieved using only visual descriptors (ARF-(v)), with an F-measure of 35.6. Compared to visual features, audio features seem to show better descriptive power, providing an F-measure of 46.3. The combination of descriptors (early fusion) tends to reduce their efficiency and yields lower performance than the use of concepts alone, e.g., audio-visual (ARF-(av)) yields an F-measure of 44.6, while audio-visual-concepts (ARF-(avc)) achieve 42.4.

Table 8.3 ARF team violence shot-level detection results at MediaEval 2012

Figure 8.4 details the precision-recall curves for this system. The use of concepts fusion scheme (red line) proved again to provide significantly higher recall than the sole use of audio-visual features or the combination of all for a precision of \(25\,\%\) and above.

Fig. 8.4
figure 4

ARF system precision-recall curves [21]

Arbitrary segment-level results: At the segment detection level, the use of the fusion of the mid-level concepts achieves average precision and recall values of 42.21 and 40.38 %, respectively, while the F-measure is 41.3. This yields a miss rate (at time level) of 50.69 % and a very low false alarm rate of only 6 %. These results are promising considering the difficulty of precisely detecting the exact time interval of violent scenes, but also the subjectivity of the human assessment (reflected in the ground truth).

8.5.2 Direct Modeling of Multimodal Features

We describe here the approach adopted in the joint submission of Technicolor and IRISA in 2012, which directly models a set of multimodal features to infer violence at the shot level. Relying on Bayesian networks and, more specifically, on structure learning in Bayesian networks [18], we investigate multimodal integration via early and late fusion strategies, together with temporal integration.

8.5.2.1 Description of the System

Figure 8.5 provides a schematic overview of the various steps implemented in Technicolor’s system. Violence detection is performed at the shot level via direct modeling of audio and visual features aggregated over shots. Classification is then performed either based on the entire set of multimodal features or independently for each modality. In this last case, late fusion is used to combine modalities. In both cases, temporal information can be used at two distinct levels: in the model with contextual features or as a postprocessing step to smooth decisions taken on a per shot basis.

Fig. 8.5
figure 5

Description of the technicolor/IRISA system at MediaEval 2012

Multimodal features: For each shot, different low-level features are extracted from both the audio and the video signals of the movies:

  • Audio features: the audio features, extracted using 40 ms frames with 20 ms overlap, are: the energy (E), the frequency centroid (C), the asymmetry (A), the flatness (F), the 90 % frequency roll-off (R), and the zero-crossing rate (Z) of the signal. These features are normalized to zero mean and unit variance, and averaged over the duration of a shot, in order to obtain a single value per shot for each feature. The audio feature vector dimension is \(D=6\);

  • Video features: the video features extracted per shot are: the shot length (SL), the mean proportion of blood color pixels (B), the mean activity (AC), the number of flashes (FL), the mean proportion of fire color pixels (FI), a measurement of color coherence (CC), the average luminance (AVL), and three color harmony features, the majority harmony template (Tp), the majority harmony template mean angle (Al), and the majority harmony template mean energy (Em) [3]. The feature vector dimension is \(D = 10\).

Features are quantized in 21 bins on a per movie basis, except for the majority template whose values are already quantized over 9 bins.

Bayesian network classification: Bayesian networks are used as a classification technique. The idea behind Bayesian networks is to build a probabilistic network on top of the input features with a node in the network for classification of violence. The network represents conditional dependencies and independencies between the features, and it is possible to learn the structure of the graph using structure learning algorithms. The output of the classifier is, for each shot, the estimated posterior probabilities for each class, viz., violence and nonviolence.

We compared a so-called naive structure, which basically links all the features to the class variable, with structures learned using either forest-augmented networks (FAN) [28] or K2 [7]. The FAN structure consists in building a tree on top of the naive structure based on some criterion related to classification accuracy. On the contrary, the K2 algorithm does not impose the naive structure but rather attempts a better description of the data based on a Bayesian information criterion, thus not necessarily targeting better classification.

Temporal integration: Two strategies for integrating temporal information were tested. The first one is a contextual representation of the shots at the input of the classifier, where classification of a shot relies on the features for this shot augmented with the features from the neighboring shots. If we denote \(F_i\) the features for shot i, the contextual representation of shot i is given by:

$$\begin{aligned} F_i^{\star } := \lbrace F_{i-n}, F_{i-n+1}, \ldots , F_{i-1}, F_i, F_{i+1},\ldots , F_{i+n-1},F_{i+n}\rbrace \end{aligned}$$
(8.2)

where the context size was set to \(n=5\) (empirically determined).

In addition to contextual representation, we also used temporal filtering to smooth the shot by shot independent classification, considering two types of filters:

  • a majority vote over a sample window of size \(k=5\), after thresholding the probabilities.

  • an average of the probabilities over a sliding window of size \(k = 5\), before thresholding the probabilities.

Contrary to averaging, majority vote does not directly provide a confidence score in the decision taken. We implemented the following heuristics in this case. For a given shot, if the vote results in violence, the confidence score is set to \(\min \{P(S_{\textit{v}})\}\), where \(P(S_{\textit{v}})\) is the set of probabilities of the shots that were considered as violent within the window. If the vote results in a nonviolent decision, the confidence score is set to \(\max \{P(S_{{\textit{nv}}})\}\), where \(P(S_{{\textit{nv}}})\) is the set of probabilities of the shots that were considered as nonviolent within the window.

Multimodal integration: As for multimodal integration, early fusion and late fusion are compared. Early fusion consists in the concatenation of the audio and the video attributes in a common feature vector. The violence classifier is then learned using this feature vector. Late fusion consists in fusing the outputs of both a video classifier and an audio classifier. In order to fuse the outputs of the ith shot, the following rule is used:

$$\begin{aligned} P_{\text {fused}}^{s_i}(P_{v_a}^{s_i},P_{v_v}^{s_i}) = \left\{ \begin{array}{ll} \max \{P_{v_a}^{s_i},P_{v_v}^{s_i}\} &{} \text {if both decisions are violent} \\ \min \{P_{v_a}^{s_i},P_{v_v}^{s_i}\} &{} \text {if both decisions are nonviolent} \\ P_{v_a}^{s_i} \cdot P_{v_v}^{s_i} &{} \text {otherwise} \end{array} \right. \end{aligned}$$
(8.3)

where \(P_{v_a}^{s_i}\) (resp. \(P_{v_v}^{s_i}\)) is the probability that shot \(i\) is violent as given by the audio (respectively video) classifier. This simple rule of thumb yields a high score when both classifiers agree on violence, and a low score when they agree on nonviolent.

8.5.2.2 Results

We first compare the different strategies implemented using cross- validation over the 15 development movies, leaving one movie out for test on each fold. We then report results for the best configuration on the official 2012 evaluation.

The MAP@100 values obtained in cross-validation for the audio only, the video only, and the early fusion experiments are presented in Table 8.4. For the late fusion experiments, all classifier combinations, i.e., the naive structure, the FAN, or the K2 networks, with or without context, with or without temporal filtering, have been tested. The seven best combinations are presented in Table 8.5.

Table 8.4 MAP@100 values obtained via cross-validation
Table 8.5 Results obtained for the seven best late fusion parameter combinations

It is interesting to note that, while the FAN networks are supposed to perform well in classification, they are outclassed by the K2 and the naive structures in these experiments. As for the other two types of structure, they both seem to provide equivalent results, which shows that structure learning is not always beneficial. One must also note that, if the influence of context is not always clear for the modalities presented in Table 8.4, temporal filters systematically improve the results, thus showing the importance of the temporal aspect of the signal. However, it is not possible to say which filter provides the best performances. Finally, the importance of multimodal integration is clearly shown as the best results were obtained via both early and late fusions. The importance of temporal integration is further reinforced by the results obtained via late fusion: among the best combinations, the contextual naive structure is always used for the video modality, and a temporal filter is always used after the fusion step. Moreover, it seems that late fusion performs better than early fusion.

The system chosen and submitted to the 2012 campaign is the best system obtained via late fusion. This system uses a noncontextual K2 network for the audio modality, a contextual naive network for the video modality, and a sliding window probability averaging filter after the fusion. It is applied to the test movies and the obtained results are presented in Table 8.6.

Table 8.6 Results obtained on the test movies

The first thing to note is that results are much better than in the cross-validation experiments (\(\simeq +\)18 %). Taking a closer look at the individual results for each movie, it appears that the lowest results are obtained for the movie Fight Club, while for the other systems presented in the 2012 campaign, the lowest results were usually obtained for Dead Poet Society. This is encouraging as, contrary to the other systems, this system was able to cope with such a nonviolent movie. The “low” results obtained for Fight Club can be explained by the very particular type of violence present in this movie, which might be under-represented in the training database. Similarly, the good results obtained for Independence day can be explained by its similarity with the movie Armageddon present in the training set.

These results clearly emphasize again the importance of multimodal integration, through late fusion of classifiers. Finally, the overall result of 61.82 for the MAP@100 is already convincing for the evolution of the task towards real-life commercial systems.

8.6 Conclusions

Running the Violent Scene Detection task in the framework of the MediaEval benchmark initiative for 2 years have resulted in two major results: a comprehensive data set to study violence detection in videos, with a focus on Hollywood movies; state-of-the-art multimodal methods which establish a baseline for future research to compare with. Results in the evaluation, demonstrated by the two systems described in this chapter, clearly emphasize the crucial role of multimodal integration, either for mid-level concept detection or for direct detection of violence. The two models compared here, namely Bayesian networks and neural networks, have proven beneficial to learn relations between audio and video features for the task of violence detection.

Many questions are still to be addressed, among which we believe two to be crucial. First, Bayesian networks with structure learning, as well as neural networks, implicitly learn the relations between features for better classification. Still, it was observed that late fusion performs similarly. There is therefore a need for better models of the multimodal relations. Second, mid-level concept detection has proven beneficial, reducing the semantic gap between features and classes of interest. There is however, still a huge gap between features and concepts such as gunshots, screams, or explosions, as demonstrated by various experiments [21, 34]. An interesting idea for the future is that of inferring concepts in a data-driven manner, letting the data define concepts whose semantic interpretation is to be found post-hoc. Again, Bayesian networks and neural networks might be exploited to this end, with hidden nodes whose meaning have to be inferred.