Scene pathfinder: unsupervised clustering techniques for movie scenes extraction

Ellouze, Mehdi; Boujemaa, Nozha; Alimi, Adel M.

doi:10.1007/s11042-009-0325-5

Scene pathfinder: unsupervised clustering techniques for movie scenes extraction

Published: 05 August 2009

Volume 47, pages 325–346, (2010)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Multimedia Tools and Applications Aims and scope Submit manuscript

Scene pathfinder: unsupervised clustering techniques for movie scenes extraction

Download PDF

Mehdi Ellouze¹,
Nozha Boujemaa² &
Adel M. Alimi¹

318 Accesses
11 Citations
3 Altmetric
Explore all metrics

Abstract

The need for watching movies is in perpetual increase due to the widespread of the internet and the increasing popularity of the video on demand service. The important mass of movies stored in the Internet or in VOD servers need to be structured to accelerate the browsing operation. In this paper, we propose a new system called "The Scene Pathfinder" that aims at segmenting the movies into scenes to give users the opportunity to have a non- sequential access and to watch particular scenes of the movie. This helps them to judge quickly the movie and decide if they have to buy or to download it and avoiding waste of time and money. The proposed approach is multimodal. We use both of visual and auditory information to accomplish the segmentation. We base on the assumption that every movie scene is either action or non- action scene. Non-action scenes are generally characterized by static backgrounds and occur in the same place. For this reason, we base on the content information and on the Kohonen map to extract these kinds of scenes (shots agglomerations). Action scenes are characterized by high tempo and motion. For this reason, we base on tempo features and on the Fuzzy CMeans to classify shots and to localize the action zones. The two processes are complementary. Indeed, the over segmentation that may occur in the extraction of action scenes by basing on the content information is repaired by the Fuzzy clustering. Our system is tested on a varied database and obtained results show the merit of our approach and that our assumptions are well-founded.

Text-Based Video Scene Segmentation: A Novel Method to Determine Shot Boundaries

Interactive video summarization with human intentions

Article 30 June 2018

Shot and Scene Detection via Hierarchical Clustering for Re-using Broadcast Video

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Nowadays peoples are confronted to a great mass of information which is at the same time important and difficult to exploit. The causes of this problem may be summarized into two reasons. First, the acquisition machines become more and more sophisticated with popular prices. Now, we speak about cameras with mega pixels of resolution and with gigabytes of storage capacities. Second, the internet invaded all houses. The increasing bit rate and the evolution of compression techniques encouraged people to share big quantities of information. Internet is not only the biggest network in the world but also the biggest database.

In our previous works [11–13, 19, 20, 22] we have been interested to the news video archives and how to accelerate the browsing of news video archives. In this paper we will be interested to movies and to movies archives

The movie industry is one of the greatest producers of video (movies). According to IMDB web portal [18], we have until now an important stock of movies representing a total of thousands of hours. According to [36], the video-on-demand service becomes more and more popular with millions video-on-demand homes in the United States. VOD is projected to grow from its availability on approximately 53% of U.S. cable households to nearly 75% in the near future. Much of the attention is focused on the delivery of movies and network television programming.

In front of this great mass of movies, the challenge for every user is to find what he needs rapidly. The solution suggested by researchers is to segment every movie into semantic entities and to give users the opportunity to have a non-sequential access and to watch particular scenes in the movie. This can help him to judge the importance of the movie to avoid waste of time. As it is difficult and time consuming to segment every movie into scenes, the proposed systems are fully automatic. As a part of this research field, we propose in this paper a robust and automatic method to do this segmentation. The video segmentation and especially the video shot segmentation is not a new research field. It dates since the nineties.

Brunelli et al. [5] define the structuring of the video as decomposing the stream into shots and scenes. A shot consists of a sequence of frames recorded by one camera contiguously and representing a continuous action in time or space. A video scene consists of a sequence of semantically correlated shots [25] (see Fig. 1).

Now, the video shot detection systems have matured and become able to detect all types of transitions with considerable recall and precision rates [10].

However, the video scene detection problem is more complicated and it is still asking for more work. Indeed, the major problem of scenes detection is the lack of an exact criterion to define a scene boundary. The only heuristic which may be used is that the shots belonging to one scene share the same actors and occur at the same moment and in the same place. For this reason, as shown in the Fig. 2, all the proposed works proceed in three steps to accomplish video scenes detection. First, the movie is segmented into shots. Second, a representative image is computed for every shot. Some works do not use only one representative frame but several frames. The Final step is the extraction of signatures from these representative images and their grouping into scenes.

As a part of this effort, we propose in this paper a novel approach for video scenes detection. It builds on the observations that video scenes may be discovered through discovering shots agglomerations and that to avoid the over segmentation that occurs in the detection of some scenes containing actions shots, a localization of action zones may be useful. This new vision of scenes detection is proposed to remedy problems of classic approaches which use one-to-one shot similarity to delimit the scenes.

The rest of the paper will be organized as follows; in section 2, we discuss works related to video scenes detection. In section 3, we will state the problem, present our approach and detail our contribution. Results of our approach are shown in section 4. We conclude with directions for future work.

2 Related works

We can classify all the proposed works into two categories: unimodal perspective and multimodal perspective. The first perspective is based only on visual signatures to group shots visually similar and temporally close. In this perspective we can evoke the important and the pioneering work of Yeung et al. [42]. In this work, authors base on the assumption that every scene has a repetitive shot structure and use graphs to detect scenes boundaries. Each node of the graph represents a shot and the transitions between two nodes are weighted according to the visual similarity and to the temporal locality of the two corresponding shots. This approach is effective with dialog scenes. However, it may fail with the other kinds as actions scenes (See Fig. 3). In their work, Riu et al. [33] have the advantage of integrating the motion information to measure the similarity between two shots. As Yeung et al. [42], they use the time locality and the visual similarity to group shots in the same scene. The major drawback of their approach remains their clustering algorithm which depends heavily on many parameters and thresholds. Rasheed and Shah [32] follow the same trail of Riu et al. [33] by integrating the motion information in the criteria of shots grouping. Their main contribution consists in basing on global similarities of shots and graph partitioning to cluster shots and to extract scenes. However, their clustering technique is relatively complex. Besides, the integration of the motion information in the similarity measure is so straightforward. Hanjalic et al. [15] suggested a new way to compute shots keyframes through concatenating all shot images. Their clustering algorithm is based on the block matching to measure the similarity of the keyframes. They use 2 sliding windows to do the clustering and to detect scene boundaries. The two windows start from a given shot and perform a forward and a backward search to find the nearest matching shot. The problem of this approach is also the important number of parameters on which it depends. In fact, we have to fix the size of the matching threshold, the size of backward window and the size of the forward window. Tavanapong and Zhou [38] use the same clustering algorithm proposed by Hanjalic et al. [15]. However, they suggest a new way to compute the similarity between the shots. They compute color features only for the shots corners because they suppose that in the shots of a same scene only the corners remain unchanged. This is not always true and especially in action scenes. This assumption may be applicable for static dialog scenes. But its validity in action scenes is doubtful. Similarly to [15], authors in [43] and [40] proposed two approaches based also on sliding windows which try to gather successive similar shots. [40, 43] differ from [15] by their similarity measures and their clustering algorithms. However, as for [15] the problem of this technique (the sliding window) remains the choice of the window size and of the sliding strategy. Yeo and Kender [21] proposed a memory based approach to segment video into scenes. A buffer is used to store shots of a same scene. New shots will be added to the buffer by measuring the visual distance between the incoming shots and the shots of the buffer. Ngo et al. [30] proposed an approach that performs a motion characterization of shots and backgrounds mosaicking in order to reconstruct the scenes and to detect their boundaries. Similarly, Chen et al. [9] proposed an approach for scenes detection basing on mosaic images. They do not represent every shot by one keyframe or a set of keyframes, but by a mosaic image in order to gather the maximum of information from the shot. After that they extract from every mosaic image color and texture signatures and they base on some cinematic rules to group shots into scenes. The idea of representing a shot by a mosaic image is interesting. However, the creation of mosaic images is still a research problem. Besides, basing on cinematographic rules to detect some types of scenes is not always efficient for two reasons. First, cinematographic rules are general rules, they change continuously and filmmakers transgress always these rules. Second, quantifying cinematographic rules effectively is still a difficult problem [14].

In the same way, authors in [6] have been based on cinematographic rules to detect scenes boundaries. They proposed to establish a finite state machine designed according to cinematographic rules and scenes patterns to extract dialog or action scenes. In [7] authors presented a scene change detection system using an unsupervised segmentation algorithm and object tracking. Objects tracking will help to compute correlations between successive frames and shots to detect scenes boundaries. In [26] Lin et al. presented an approach that performs in two steps to detect scenes boundaries. First a shot segmentation is performed. To describe well the significant changes that may occur into shots, some of them may be segmented into sub-shots. After that, scenes boundaries are extracted by analyzing the splitting and merging force competitions at each shot boundary. The main drawback of [7, 26] resides in the used features. Indeed they employed only the color information. This is not always efficient, especially for action scenes.

The number of works done in the second perspective is not important relatively to the number of works proposed in the first one. In this perspective, we can evoke the well known work of Sundaram and Schang [37] in which authors define a scene as contiguous segment of data having consistent long term audio and video characteristics. They detect audio scenes basing on ten auditory features and a sliding window that tries to detect significant changes. Nearly the same thing is done on the visual bound. The final step of the process is the merge of the visual scenes and the auditory scenes to have the final scenes boundaries. In the same perspective, we can evoke also the work of Huang et al. [17] in which authors follow the same trail of Sundaram and Schang [37]. To detect scenes boundaries, they detect color breaks, audio breaks and motion breaks. They base on the assumption that a scene break occurs only when we have the 3 kinds of break at the same time.

3 Proposed framework

3.1 System framework

We think that to have an efficient system for scenes detection, we have to understand how these scenes have been made. As it is said in [35], the video is the result of an authoring process. When producing a video, an author starts from a conceptual idea. The semantic intention is then articulated in (sub) consciously selected conventions and techniques (rules) for the purpose of emphasizing aspects of the content. In this section, we will have a quick look on some important cinematic rules on which the majority of movies are built. According to [1, 14, 41] there are 4 important rules used by filmmakers to make scenes and shots:

Rule 1: The 180° Rule. This rule dictates that the camera should stay in one of the areas on either side of the axis of action.
Rule 2: Action Matching Rule. It indicates that motion direction should be the same on in two consecutive shots that record the continuous motion of actor
Rule 3: Film Rhythm Rule. It indicates the perceived rate and regularity of sounds, series of shots, and motion within the shots. Rhythmic factors include beat, accent, and tempo. In same scene, the shots have the similar rhythm.
Rule 4: Establishing Shot. It shows the spatial relations among the important figures, objects, and setting in a scene. Usually, the first and the last few shots in a dialog scene are establishing shots.

After reading these 4 rules we can deduce 2 important things. First, according to Rule 1, 3 and 4, when trying to relax audience filmmakers use long shots with common backgrounds and surroundings, repetitive camera angles for subsequent shots, and use quiet auditory bounds constituted essentially of speech and silence segments [24]. Second, according to Rule 2 and 4, when trying to excite the viewer, filmmakers use sound effects with a rough visual boundary to impose a tempo and a rhythm on the viewer. Basing on these rules and these heuristics, we suggest a scene extraction approach composed of three steps. The first step consists in using the visual content information to do a preliminary scene detection. The second step consists in localizing action zones. Finally, in the third step we merge the results of the two first ones.

In fact, the presence of action zones may cause an over segmentation because the visual content of shots of action scenes changes enormously. This over segmentation may be repaired by merging scenes that intersect with the same action zones.

Our approach is multimodal approach. Indeed, as filmmakers use visual and auditory modalities to transmit their cinematographic messages, we think that proposing a multimodal approach will augment the chances of the success of the system. Relatively to existing multimodal systems, our contribution consists in proposing a new way of integrating of the visual and the auditory information to determine scene boundaries. Existing systems use many rules to merge the results of the two kinds of boundaries. We do not agree with Sundaram and Schang who affirm in [37] that “audio data must be used to find exact scenes boundaries and not only for identifying important regions or detecting events as explosions in video sequences” because in a given scene the audio data and contrary to visual data is changing a lot even in a dialog scene. In one scene, we may have silences, male voices, female voices, car sounds, musical backgrounds... The ranges in which vary the auditory features will be wide. So it will be very difficult to delimit correctly the scenes using the audio data. We agree in considering the auditory information important to correctly detect scenes boundaries. However, their role must be corrective and especially in detecting action scenes. For this reason, the used features are not classified into visual and auditory features as the entire multimodal approaches have done, but into tempo features and content features to detect respectively action scenes and non-action scenes which are the principle categories of scenes (see Fig. 4).

The second contribution consists in proposing a new method for the detection of non-action scenes using content features and on the Kohonen map. In fact, in contrary to all the proposed works (unimodal and multimodal approaches), we extract scenes by localizing the agglomerations of shots and not by basing one-to-one shot similarity.

The third contribution consists in proposing for action scenes a new method totally different from those used for the detection of the other kinds of scenes. In action movies, where we have a lot of action scenes, we base on shots classification using Fuzzy CMeans and tempo features (motion, audio energy, shot frequency) to localize action scenes.

Indeed, there is sometimes over segmentation that may occur when extracting action scenes through basing only on the content information. This over segmentation is repaired by localizing the action zones and merging scenes intersecting with the same action zones (see Fig. 5)

3.2 Preliminary scene extraction

A scene is defined in [1] as a segment of a narrative film that usually takes place in a single time and place, often with the same characters. As it is mentioned in the pioneering work of Yeung et al. [42], shots belonging to the same scene are characterized by two important things. First, they share nearly the same surroundings and many common objects. Second, they are temporally close. For this reason, and as the majority of proposed works [9, 15, 26, 30, 32, 33, 38, 42, 43] we will base on these criteria to cluster shots.

Excepting [32, 42] which use the graph theory, all the others use one-to-one shot similarity and thresholds fixed by empiric studies to extract scenes. Basing on one-to-one shot similarity may be particularly convenient because even the shots of a same scene are generally different and the direct comparison may be not useful. However, if we compare them to shots of other scenes, they share some common properties as the background. So grouping shots into scenes must be done relatively to other shots and not through finding similarities between them.

To accomplish this, we will base on pattern recognition techniques and particularly the clustering techniques which are suitable to do such kind of processing. The clustering operation tries to gather elements that share common properties relatively to the other elements basing on descriptors that quantify these properties.

To discover the shots agglomerations (scenes), we have to choose the right descriptors which can discriminate the shots suitably. Shots belonging to the same scene share two important things: the general luminance and the background (see Fig. 6).

The luminance information of a given image is the brightness distribution or the luminosity within this image. Generally two different scenes have two different luminosities due to the difference in lighting conditions, in places and in time when the scenes take place (indoor, outdoor, day, night, cloudy weather, sunny weather…).
The background information is also important because it describes the surroundings and the place where the scene takes place. To extract the background information we base essentially on two descriptors color and texture. Color is a general descriptor for all kind of backgrounds. Texture is also an important information for textured backgrounds as forests, buildings… (see Fig. 7)

To extract the luminance and the background information we used the standard HSV color histogram and the Fourier histogram. HSV histogram is a classical color signature in the HSV color space. This signature makes available information about color content of the image in its non-altered state. Its performances are good compared to other color signatures [4].

The Fourier histogram signature is computed on the gray level image and it contains information about the texture and scale. These two descriptors already existed in our IKONA CBIR engine [4] and tested in many contexts in the field of image processing as content based image retrieval, relevance feedback, etc.

3.3 Scenes agglomeration discovering through Kohonen maps

One of the drawbacks of proposed approaches is the fact that they based on one-to-one shot similarity instead of studying the agglomerations. Kohonen maps [23] proved their performance in doing that. We have already used them to do a macro-segmentation. In our previous work [13], we used the Kohonen map to segment news broadcast programs into stories by discovering the cluster (the agglomeration) of anchor shots. They are well suited for mapping high dimensional vectors (shots) into two dimensional space. In fact, the Kohonen map is composed of two layers: the input layer which corresponds to the input elements and an output layer which corresponds to a set of connected neurons. Every input element is represented by a n-dimensional vector X = (x₁, x₂, …, x_n) and connected to m nodes of the map through weights W _ij (see Fig. 8).

The mechanism of the Kohonen map may be summarized as follows:

The connection weights are randomly initialised
For every input vector the weights of the connections linking this vector to the neurons are updated. The key idea introduced by Kohonen maps is the neighbourhood relation. In fact, to keep the relation between all the neighbour nodes, not only weights of the wining node are adjusted, but those of the neighbours are also updated. Thus, a node whose weight vector closely matches the input vector will have a small activation level, and a node whose weight vector is very different from the input vector will have a large activation level. The node in the network with the smallest activation level is deemed to be the "winner" for the current input vector. The further the neighbour is from the winner, the smaller its weight changes.
We compute the wining node using the following formula:
$$ j* = \arg {\min_j}\sum\limits_{i = 1}^n {{{\left( {{X_i}(t) - {W_{ij}}(t)} \right)}^2}} $$
(1)
The weight of the wining neuron is modified at every iteration as follows:
$$ {W_{ij}}(t) = {W_{ij}}\left( {t - 1} \right) + a(t)\left[ {{X_i}(t) - {W_{ij}}\left( {t - 1} \right)} \right] $$
(2)
The weight of the neighbouring neurons is modified at every iteration as follows:
$$ {W_{ij}}(t) = {W_{ij}}\left( {t - 1} \right) + a(t){h_j}\left( {j*,t} \right)\left[ {{X_i}(t) - {W_{ij}}\left( {t - 1} \right)} \right] $$
(3)
Where “t” represents the time-step and a(t) is a small variable called the learning rate which decreases with time. h(t) represents the amount of influence of the training sample on the node. It is computed as follows:
$$ {h_j}\left( {j*,t} \right) = \exp - \frac{{{{\left\| {j - j*} \right\|}^2}}}{{2{r^2}(t)}} $$
(4)
Where r(t) is the neighbourhood radius which typically decreases with time.

As we have already mentioned we compute for every shot two features the HSV color histogram and the Fourier histogram. The concatenation of these two features constitutes a vector of 184 components. For every movie, we trained a Kohonen map with vectors representing the shots of this movie. As a result of the training process, every shot (training example) will be attributed to a node (wining node).

After training the Kohonen map, we tried to discover the shots agglomerations that represent the scenes of the movie (see Fig. 8). In fact, if we map the definition of a scene on the Kohonen map we can say that it is the set of shots located in the same zone in the map. A zone in the map is identified by a set of nodes. After training the Kohonen map, every shot is attached to a unique node called the wining node. So, shots attached to neighbouring nodes may eventually belong to a same scene.

In order to extract scenes from the Kohonen map, we base on the two following assumptions. First, two shots which belong to the same scene either belong to one node or to two neighbouring nodes (see Fig. 8). Second, shots belonging to one scene must be also temporally close. As in [32], we fix the temporal threshold at 30 seconds. Two shots A and B belong to the same scene only if (M_A-M_B) is above 30 seconds, where M_A and M_B are respectively the middle frames of the shots A and B. This threshold is discussed in the experimentation section.

The pseudo code of our clustering algorithm is shown in the Fig. 9. First, we determine the wining neuron of every shot. Then, if two shots, have the same wining neuron or their wining neurons are neighbouring (direct neighbours, see Fig. 8) and they are temporally close (the temporal distance between their middle frames is above 30 seconds), therefore we will put them into the same scene. At the end, if two scenes have one or more common shots then they will be automatically merged into one scene. Besides, if two scenes are temporally intersecting they will be also merged.

For instance if a given scene A contains the shot i, the shot i+2 and the shot i+3 and a scene B contains the shot i+1 and the shot i+4 then A and B will be merged into one scene. In fact, a scene may contain shots that are completely different and do not share any common object (see Fig. 10). The temporal continuity of the scene allows gathering all these shots.

3.4 Localizing action zones

The major drawback of all proposed works is the fact that they do not have a specific process to localize action zones. They use the same process to extract all kinds of scenes. They compute a list of features (generally visual features and features dealing only with the content) for all shots and after that they cluster shots to detect the scenes boundaries. Some works as [32, 33] try to introduce in this list of features some specific features as motion to take into account the case of action scenes. After that they do an early fusion of all features before starting the clustering process. However, this remains insufficient. In fact, in action scenes the visual content is changing a lot and the tempo is nearly the same (high). So, the features describing the content may distort the clustering process and there is a chance to have an over segmentation. That’s why in the majority of works the results are getting worst with action movies.

An action scene is characterized by 3 important phenomena [1, 41]. First, it contains a lot of motion: motion of objects (motion of actors, motion of cars…) and camera motion (pan, tilt, zoom…).That’s why shots of the same scene do not share any common background or surroundings. The second phenomenon is the special sound effects used to excite and stimulate the viewer attention. Filmmakers amplify the actor voices. They introduce explosions and gunfire sounds from times to times… The third important phenomenon is the duration and the number of shots per minute. Action scenes are filmed by many cameras. For this reason, the filmmaker is switching permanently between all cameras to film the scene from many views.

After a deep study of these phenomena, we suggest to quantify them by a need of three descriptors that will be computed for every shot and on which we will base to cluster these shots. These descriptors will be detailed in the next sections.

3.4.1 Motion activity analysis

The intensity of motion activity gives the viewer an information about the tempo of the movie. The motion intensity of a given segment tries to capture the intensity of the action of this segment [8, 34]. Action scenes have a high intensity of action due to high camera motion and motion of objects. For non-action scenes as dialog scenes we have a low intensity of action because we have generally fixed camera and static objects.

Many descriptors have been proposed to measure the motion activity. The Lukas Kanade optical flow [28] is often used to estimate the motion of one shot. It is used to estimate the direction and the speed of objects motion from one frame to another in the same shot. The estimation of optical flow is based on the assumption that the image intensity is constant between two times t and t+dt. This assumption may be mathematically formalized by the following equation:

$$ {I_x}u + {I_x}v + {I_t} = 0 $$

(5)

In this equation I _x, I _y and I _t are the spatiotemporal image brightness derivatives, u is the horizontal optical flow, and v is the vertical optical flow.

Let u(t,i), v(t,i) denote the flow computed between two frames of one shot and averaged over the i ^th 16 x16 macroblock . The spatial activity matrix is defined as follows:

$$ Ac{t_{i,j}} = \sqrt {u{{\left( {t,i} \right)}^2} + v{{\left( {t,i} \right)}^2}} $$

(6)

The activity between two frames is computed by averaging the motion vectors magnitudes of macroblocks over the entire frame. It is computed as follows:

$$ FrAct = \frac{1}{NBlock}\sum\limits_i {\sum\limits_j {Ac{t_{i,j}}} } $$

(7)

where NBlock is number of macroblocks in the frame. The activity of a shot is the average of its frames activities. It is computed as follows:

$$ ShotAct = \frac{1}{NFrame}\sum\limits_k {FrAc{t_k}} $$

(8)

Where NFrame is the number of frames per shot

3.4.2 Audio energy analysis

The audio bound has an important contribution to interpret the tempo of the movie. Action scenes are generally characterized by musical backgrounds with many sound effects. As in [27], in order to discriminate between voiced and unvoiced sounds, researchers use generally the energy. Unvoiced sounds like music and sound effects have a larger dynamic range than speech. They are generally characterized by an important energy. We propose to compute the Short-Time Average Energy of the auditory bound of every shot. The Short-Time Average Energy of a discrete signal is defined as follows:

$$ E\left( {sho{t_n}} \right) = \frac{1}{N}\sum\limits_i {s{{(i)}^2}} $$

(9)

Where s(i) is the discrete time audio signal, “i” is the time index and N is the number of audio samples of the Shot _n. We have to mention that we compute the energy of every shot of the movie. We aim at distinguishing energetic shots.

Indeed, the Fig. 11 displays to the variation of the Short-Time Average Energy of a movie. The picks correspond to action zones in the movie and the hollows are generally dialog scenes in which we find either silences or speech zones that are characterized by a low energy.

3.4.3 Shot frequency

Action scenes are characterized by an important number of short shots that stream rapidly. Our idea is to compute the shot frequency (i.e.) the number of shots per minute relatively to a given shot (the reference shot). We place every shot in the center of an interval of one minute. After that we count the number of shots that belong to this interval (see Fig. 12). This number is the third feature of every shot.

3.4.4 Using Fuzzy CMeans to extract action shots

After computing for every shot the three features: Motion, audio energy and shot frequency, the final step consists in using all these features to discriminate between action and non-action shots. Two ways may be explored in this case. Either we base on some heuristics and thresholds or we base on pattern recognition techniques. Pattern recognition techniques and in particular unsupervised clustering techniques are suitable for doing such kind of task. They display a great efficiency in doing classes discrimination. Besides, our work is a typical problem of unsupervised clustering. The number of classes is known: the class of action shots and the class of non-action shots. One of the clustering techniques is the Fuzzy–CMeans introduced by Bezdeck [2]. The Fuzzy C-Means (FCM) algorithm is an iterative clustering method that produces an optimal C partitions, which minimizes the weighted within group sum of squared error objective function J _q (U,V)

$$ {J_q}\left( {U,V} \right) = \sum\limits_{k = 1}^n {\sum\limits_{i = 1}^c {{{\left( {{u_{ik}}} \right)}^q}{d^2}\left( {{x_k},{v_i}} \right)} } $$

(10)

Where $ X = \left\{ {{x_1},{x_2},...,{x_n}} \right\} \subseteq {R^p} $ is the set the of data items, n is the number of data items, c is the number of clusters with $ 2 \leqslant c \prec n $, U _k is the degree of membership of x _k in the i ^th cluster, q is a weighting exponent, on each fuzzy membership, v _i is the prototype of the center of cluster i, d²(x _k ,v _i) is a distance measure between object x _k and cluster center v _i . A solution of the object function can be obtained via an iterative process. The membership matrix (U)_ij is randomly initialized. At each iteration we compute the new values of the coefficients of the matrix (U)_ij and the new centers of each cluster. In our context we have two classes. The data items are shots represented by vectors composed of 3 components (motion, audio energy and shot frequency). At the end of the clustering, every shot is attributed to the cluster to which it has the highest membership degree. After that, we extract the class of action scenes by finding the class with the higher values of motion. The action shots will be temporally ordered to localize exactly the action zones as it is shown in the Fig. 13.

However, action scenes do not start directly with action shots. They start by calm shots and clam tempo and as time goes by, the tempo and the rhythm increase [1]. Besides, they generally finished by clam shots. This is not problematic for us because we do not aim through the detection of action zones to find the exact scenes boundaries. We aim at localizing action zones to remedy the over segmentation that may occur when extracting the scenes from the Kohonen map.

The results of the Fuzzy CMeans classifier will be the detection of the cores of action scenes (action zones). Continual action zones will help us to merge over-segmented scenes. The preliminary scenes extracted from the Kohonen map that are intersecting with the same action zone will be merged.

4 Experiments

4.1 Scene change detection

To show the efficiency of our system we conduct experiments on five movies as shown in Table 1. This database has been already used by authors in [9] to test their approach.

Table 1 Experimental results with ground truth

Full size table

We choose to test our system on this database for two reasons. First, this database includes movies belonging to different cinematographic genres. And this will help us to test our system correctly (many kinds of scenes). Second, as we will compare our system to that proposed in [9], it will be suitable to use the same database.

We use the recall and the precision rates as the measures of performance. They are defined as follows:

$$ {{\text{recall}} = \frac{N_c}{N_g}} $$

(11)

$$ {{\text{Precision}} = \frac{N_c}{{{N_c} + {N_f}}}} $$

(12)

Where N_g is the number of scenes of the ground truth, N_c is the number of scenes correctly detected and N_f is the number of scenes wrongly detected.

The ground truth has been generated by two real users. We explained the definition of a scene to the two users before giving them the database. After that we asked every user to watch every movie of the database and to delimit the scenes. The results of the two segmentation processes were merged to generate the final scenes boundaries.

Table 1 shows that our system presents in general encouraging results. However, these results vary according to the genre of the film. The best results are made with action movies (“Bugs” and “Dungeons and Dragons”). This shows the efficiency of our strategy of extracting action scenes which is the weak spot of the majority of approaches. The major problem of action scenes remains always the significant change of lighting such as explosions and flashing lights. As features related to tempo do not deal with content information the classification of shots into action/non-action shots using the Fuzzy CMeans classifier was at the same time efficient and very useful to remedy the problem of over segmentation.

Encouraging results are also shown in the movie “Little Voice”. This proves the merit of the Kohonen map in delimiting non-action scenes. Dramatic movies are essentially composed of dialog scenes in which characters discuss into decors having many common objects and backgrounds. These encouraging results show that discovering shots agglomerations is advantageous and this will be proved in the section (4.3), when we compare our system to other systems in which authors base on one-to-one shot similarity to cluster shots into scenes.

However, there is more work to do in the context of comedic and musical movies. These kinds of movies do not respond to the common cinematographic rules. For instance, in comedic movies a given scene may evolve in different contexts and in different decors. That’s why the Kohonen map may miss a lot of scenes and make a lot of false detections because shots of one scene may be located in different zones of the map. For this reason, we have to think to add a third path to our system and use other kinds of assumptions to find these kinds of scenes.

We have to mention also that there are some scenes which are ambiguous, and it is very hard to delimit them automatically. These kinds of scenes will be discussed in the following section.

4.2 Analysis of the ambiguity of some scenes

To establish the limitations of our technique and of scenes detection systems in general, it is important to discuss some types of scenes which are ambiguous and very difficult to delimit them automatically.

Some consecutive scenes may occur in the same place or in the same conditions. For instance, in the movie “Dungeons and Dragons” many successive scenes take place in the forest (common background and common texture). The boundaries of these scenes are indistinguishable. Visual features and clustering techniques are incapable to delimit them properly.

Lighting conditions may also perturb the detection process, especially when the consecutive scenes take place at night or in indoor dark places. In these conditions, the background is very dark and the foreground objects as faces or decor elements are indistinguishable. We have encountered this kind of scenes in the movie "Bugs". Many scenes of this movie take place in a train tunnel. This kind of scenes causes an undersegmenation for our system and for the majority of systems in general because the visual information and even the auditory information are not able to distinguish between these scenes.

Multi-angular scenes are also ambiguous scenes. The visual coherence between the shots of a multi-angular scene is reduced because they are filmed with many cameras and display different kinds of background and foreground objects. As example we may cite the dialog scenes which take place in streets (crowd scenes). In these scenes actors discuss, and from time to time we may see a passing car, a passing person, a building, a neon sign… In this kind of scenes using the global visual features to cluster shots is not very efficient. Local visual features may be the suitable solution.

Moreover, as shown in Fig. 14, the scenes that include shots having different cameras distances are also ambiguous and may cause an over segmentation. As mentioned by Bordwell and Thompson [3], we distinguish eight different types of shots: extreme long shot, long shot, medium long shot, medium close-up, medium shot, close-up, extreme long shot and extreme close-up. Indeed, due to cameras zooms we may have a master shot followed by a close up shot followed by medium long shot… Clustering techniques using classical distances may not be very efficient in these conditions.

Our system and the systems proposed in the literature in general, are essentially based on the visual information. To delimit some ambiguous scenes we proved that the visual information is not very efficient. We think also that the solution may not come from the auditory information for the reasons evoked in section 2. However, we think that the solution can come from the textual information generated through automatic speech recognition. A deep semantic analysis of the textual information through natural language processing techniques (NLP) may help in delimiting these scenes. A study of the speeches of actors may be done to detect significant changes in linguistic concepts.

The solution may also come from users interacting to correct some defaults in delimiting these ambiguous scenes. However, this solution may extend the achieving time of the scenes detection process.

4.3 Determining the temporal tolerance factor τ

The temporal tolerance factor is the threshold used to decide if two similar shots belong to the same scene. As in [32] we make a study to fix the suitable tolerance factor. We studied how the recall and the precision rates vary against the tolerance factor for the movies “Bugs” and “Walk the Line”. The movie “Bugs” is an action movie characterized by short shots and short scenes. In this movie, the mean and standard deviation of scene duration are respectively 106.27s and 106.48s. However, the movie “Walk the Line” is a musical movie characterized by long shots and scenes. In this movie, the mean and standard deviation of scene duration are respectively160.59s and 60.37s.

The Fig. 15 shows that the threshold 30 seconds it is suitable for delimiting scenes of all kinds of movies. This threshold may be exceeded in the case of dramatic, musical and comic movies because scenes of these movies are long enough and the risk of an over segmentation is weak. However, in action movies this threshold is the optimum because scenes in these movies are short and increasing the threshold may cause under-segmentations in case of presence of similar shots in neighbouring scenes.

4.4 Comparison results

We implemented the work [9] which presents good results relatively to the well known work of Yeung et al. [42]. However, as we failed to get the ground truth used in [9] we created our own ground truth as follows: first, we segment the movies into shots [31] shots and then we manually grouped shots into scenes according to strict scene definition. We implemented also the work of Tavanapong et al. [38]. This work is another shot-to-shot approach which is based on the assumption that the shots of a same scene have common zones namely the corners. Features used for the clustering process are computed on these corners.

The results of the comparison are shown in the Table 2. Generally, our system performs better than systems of Chen and Tavanapong. This confirms that basing on discovering shots agglomeration may be an alternative to basing on shot-to-shot method. The low results obtained by Tavanapong’s system demonstrate the adequacy of this proposal. Indeed, Tavanapong’s system is a typical shot-to-shot approach which uses two sliding windows (backward and forward) to cluster the shots into scenes.

Table 2 Comparison results of our system with the systems of Chen et al. and of Tavanapong et al.

Full size table

Besides, regarding results obtained in action movies, we can affirm that to delimit action scenes we have to base not only on content (case of the majority of approaches as those of Chen et al. and Tavanapong et al.) but also on tempo. Indeed, although Chen’s system and Tavanapong’s system adopted two different strategies to describe the content information of shots, their results in action movies are low relatively to our system. Tavanapong’s system adopted a local description of shots (shots corners); however Chen’s system adopted a global description (mosaic image). This comparison proves that whatever the kind of content features that will be used, it remains always insufficient to detect action scenes. The content is necessary to fix the boundaries of action scenes because they generally start and finish by clam shots (importance of the content). However, the core of the scenes is agitated, the content information may be useless and the tempo information may play a key role here.

This comparison was also very useful for us because it shows that representing a shot by a single frame is not always the suitable solution and especially in non-action movies where shots are long and generally evolve in different contexts and in different settings. As Chen et al. [9] use the mosaic image to represent every shot, the details of shots will be kept and the clustering process will be more efficient. That’s why the precision rates of the Chen’s system in the movies “Walk the line” and “Little voice” are better.

5 Conclusion and perspectives

We presented in this paper a new system with a new vision to extract scenes from movies. Segmenting a database of movies into scenes has the advantage of making the browsing operation quicker.

The proposed system has three essential contributions. First, and contrary to the majority of proposed approaches we propose a multimodal system. Second, and contrary to proposed multimodal approaches we do not use the visual features and the audio features separately to detect scenes boundaries. We fuse them and we make a clustering basing on the resulting vectors. Finally, we divide the scenes of the movies into two important classes: action scenes and non-action scenes. To detect non-action scenes (dialog, monolog, landscape, romance...) we base on the content information and the Kohonen map to discover the agglomerations of shots (scenes) having common backgrounds and objects.

In the other hand, we use audio-visual tempo features and the Fuzzy CMeans classifier to delimit the core of action scenes (fight, car chase, war, gun fire...) to remedy the over segmentation that may occur in action scenes.

Obtained results are encouraging and show the merit of this new vision. However, the results in general may be improved. Our approach is still suffering from over/under segmentation. That’s why we think in to improve the used features and to add other ones. We think also to base also on objects segmentation and tracking to cluster shots to delimit some ambiguous scenes.

References

Arijon D (1991) Grammar of the Film Language. Silman James Press, Los Angeles
Google Scholar
Bezdek JC (1981) Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, New York
MATH Google Scholar
Bordwell, D, Thompson K (1997) Film Art: An Introduction, 5th edn. McGraw-Hill
Boujemaa N, Fauqueur J, Ferecatu M, Fleuret F, Gouet V, Saux BL, Sahbi H (2001) Ikona: Interactive generic and specific image retrieval. In: Proceedings of the International workshop on Multimedia
Brunelli R, Mich O, Modena CM (1999) A survey on the automatic indexing of video data. Journal of Visual Communication Image Represent 10:78–112
Article Google Scholar
Chen L, Ozsu MT (2002) Rule-based scene extraction from video. In International Conference on Image Processing, pp 737-740
Chen SC, Shyu ML, Zhang CC, Kashyap RL (2001) Video Scene change detection method using unsupervised segmentation and object tracking, In Proceedings of IEEE International Conference on Multimedia and Expo, pp 56-59
Chen HW, Kuo JH, Chu WT, Wu JL (2004) Action Movies Segmentation and Summarization Based on Tempo Analysis, In Proceedings of the ACM SIGMM International Workshop on Multimedia Information Retrieval, pp 251-258
Chen LH, Lai YC, Liao HYM (2008) Movie scene segmentation using background information. Pattern Recognition 41:1056–1065
Article MATH Google Scholar
Cotsaces C, Nikolaidis N, Pitas I (2006) Video Shot Detection and Condensed Representation, A review. IEEE Signal Processing Magazine 23, pp 28–37
Google Scholar
Ellouze M, Karray H, Alimi AM (2006) Genetic Algorithm For Summarizing News Stories. In Proceedings of international conference on computer vision theory and applications, pp 303-308
Ellouze M, Karray H, Alimi AM (2008) REGIM, Research Group on Intelligent Machines, Tunisia, at TRECVID 2008, BBC Rushes Summarization, In Proceedings of international conference ACM Multimedia, TRECVID BBC Rushes Summarization Workshop
Ellouze M, Karray H, Soltana WB, Alimi AM (2007) Utilisation de la carte de Kohonen pour la détection des plans présentateur d’un journal télévisé, In Proceedings of international conference TAIMA 2007, cinquième édition des ateliers de travail sur le traitement et l'analyse de l’information, pp 271-276
Geng Y, Xu D, Wu A (2005) Effective Video Scene Detection Approach Based on Cinematic Rules. In Proceedings 9th International Conference on Knowledge-Based Intelligent Information and Engineering Systems, pp 1197-1203
Hanjalic A, Lagendijk RL, Biemond J (1999) Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Transaction Circuits and Systems for Video Technology 9:580–588
Article Google Scholar
Hanjalic A (2002) Shot-boundary detection: unraveled and resolved? IEEE Transactions on Circuits and Systems for Video Technology 12:90–105
Article Google Scholar
Huang J, Liu Z, Wang Y (1998) Integration of Audio and Visual Information for Content-based Video Segmentation. In Proceedings of IEEE International Conference on Image Processing, pp 526–529
IMDB (2008) http://www.imdb.com/, Last viewed July 2008
Karray H, Ellouze M, Alimi AM (2008) KKQ: K-frames and K-words extraction for quick news story browsing. International Journal of Information and Communication Technology 1, pp. 69–76
Google Scholar
Karray H, Ellouze M, Alimi AM (2008) Indexing video summaries for quick video browsing. Chapter in Computer Communications and Networks published by Springer Verlag, Germany. In Press
Kender JR, Yeo BL (1998) Video Scene Segmentation Via Continuous Video Coherence, In Proceedings of the conference of Computer Vision and Pattern Recognition, pp 367–373
Kherallah M, Karray H, Ellouze M, Alimi AM (2008) Toward an Interactive Device for Quick News Story Browsing. In Proceedings of international conference on pattern recognition. Accepted
Kohonen T (1990) The Self-Organizing Map. In Proceedings of the IEEE, pp 1464-1480
Lehane B, O’Connor NE (2006) Movie Indexing via Event Detection. In Proceedings of the Workshop on image analysis for multimedia interactive services, pp 1-4
Lin T, Zhang HJ (2000) Automatic Video Scene Extraction by Shot Grouping. In proceedings of the International Conference of Pattern Recognition 6:39–42
MathSciNet Google Scholar
Lin T, Zhang HJ, Shi QY (2001) Video scene extraction by force competition, In the proceedings of IEEE International Conference on Multimedia and Expo, pp 753-756
Lu L, Zhang HJ, Jiang H (2002) Content Analysis for Audio Classification and Segmentation. IEEE Transactions on Speech and Audio Processing 10:504–516
Article Google Scholar
Lukas B, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artificial Intelligence, pp 674–679
Nagasaka A, Tanaka Y (1991) Automatic scene-change detection method for video works. In 2ndWorking Conference on Visual Database Systems, pp 119–133
Ngo CW, Pong TC, Zhang HJ (2002) Motion-Based Video Representation for Scene Change Detection. International Journal of Computer Vision 2:127–142
Article Google Scholar
Oh J, Hua KA, Liang N (2000) A content-based scene change detection and classification technique using background tracking, In Proceedings of the conference on Multimedia Computing and Networking, pp 254-265
Rasheed Z, Shah M (2005) Detection and Representation of Scenes in Videos. IEEE Transaction on Multimedia 7:1097–1105
Article Google Scholar
Rui Y, Huang TS, Mehrotra S (1998) Constructing table of contents for videos. ACM J. Multimedia Systems, pp 359–368
Smeaton AF, Lehane B, O'Connor NE, Brady C, Craig G (2006) Automatically selecting shots for action movie trailers. In Proceedings of the ACM international workshop on Multimedia information , pp 231-238
Snoek CGM, Worring M, Geusebroek JM, Koelma DC, Seinstra FJ, Smeulders AWM (2006) The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence 28:1678–1689
Article Google Scholar
Studio4networks (2008) http://www.studio4networks.com/, Last viewed July 2008
Sundaram H, Chang SF (2000) Video Scene Segmentation Using Video and Audio Features. In Proceedings of the International Conference on Multimedia and Expo, pp1145-1148
Tavanapong W, Zhou J (2004) Shot clustering techniques for story browsing. IEEE Transactions on Multimedia 6:517–527
Article Google Scholar
TRECVID (2008) http://www-nlpir.nist.gov/projects/trecvid/, Last viewed July 2008
Truong BT, Dorai C, Venkatesh S (2003) Automatic scene extraction in motion pictures. IEEE Transactions in Circuits and Systems for Video Technology 1:5–10
Article Google Scholar
Yale film studies, 2008, http://classes.yale.edu/film-analysis/index.htm, Last viewed July 2008
Yeung M, Yeo BL, Liu B (1998) Segmentation of video by clustering and graph analysis, Computer Vision and Image Understanding 71, pp 94-109
Google Scholar
Zhao L, Yang SQ, Feng B (2001) Video scene detection using slide windows method based on temporal constrain shot similarity. In Proceedings of international conference on Multimedia and Expo, pp 1171–1174

Download references

Acknowledgments

The authors would like to thank several individuals and groups for making the implementation of this system possible. The authors would like to acknowledge the financial support of this work by grants from the General Direction of Scientific Research and Technological Renovation (DGRSRT), Tunisia, under the ARUB program 01/UR/11/02. We are also grateful, to EGIDE and INRIA, France, for sponsoring this work and the three-month research placement of Mehdi Ellouze from 1/11/2007 to 31/1/2008 in INRIA IMEDIA Team in which parts of this work were done.

Author information

Authors and Affiliations

REGIM: Research Group on Intelligent Machines, University of Sfax, ENIS, BP 1173, Sfax, 3038, Tunisia
Mehdi Ellouze & Adel M. Alimi
INRIA: IMEDIA Team, BP 105, Rocquencourt, 78153, Le Chesnay Cedex, France
Nozha Boujemaa

Authors

Mehdi Ellouze
View author publications
You can also search for this author in PubMed Google Scholar
Nozha Boujemaa
View author publications
You can also search for this author in PubMed Google Scholar
Adel M. Alimi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mehdi Ellouze.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ellouze, M., Boujemaa, N. & Alimi, A.M. Scene pathfinder: unsupervised clustering techniques for movie scenes extraction. Multimed Tools Appl 47, 325–346 (2010). https://doi.org/10.1007/s11042-009-0325-5

Download citation

Published: 05 August 2009
Issue Date: April 2010
DOI: https://doi.org/10.1007/s11042-009-0325-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Scene pathfinder: unsupervised clustering techniques for movie scenes extraction

Abstract

Similar content being viewed by others

Text-Based Video Scene Segmentation: A Novel Method to Determine Shot Boundaries

Interactive video summarization with human intentions

Shot and Scene Detection via Hierarchical Clustering for Re-using Broadcast Video

1 Introduction

2 Related works