1 Introduction

Nowadays peoples are confronted to a great mass of information which is at the same time important and difficult to exploit. The causes of this problem may be summarized into two reasons. First, the acquisition machines become more and more sophisticated with popular prices. Now, we speak about cameras with mega pixels of resolution and with gigabytes of storage capacities. Second, the internet invaded all houses. The increasing bit rate and the evolution of compression techniques encouraged people to share big quantities of information. Internet is not only the biggest network in the world but also the biggest database.

In our previous works [1113, 19, 20, 22] we have been interested to the news video archives and how to accelerate the browsing of news video archives. In this paper we will be interested to movies and to movies archives

The movie industry is one of the greatest producers of video (movies). According to IMDB web portal [18], we have until now an important stock of movies representing a total of thousands of hours. According to [36], the video-on-demand service becomes more and more popular with millions video-on-demand homes in the United States. VOD is projected to grow from its availability on approximately 53% of U.S. cable households to nearly 75% in the near future. Much of the attention is focused on the delivery of movies and network television programming.

In front of this great mass of movies, the challenge for every user is to find what he needs rapidly. The solution suggested by researchers is to segment every movie into semantic entities and to give users the opportunity to have a non-sequential access and to watch particular scenes in the movie. This can help him to judge the importance of the movie to avoid waste of time. As it is difficult and time consuming to segment every movie into scenes, the proposed systems are fully automatic. As a part of this research field, we propose in this paper a robust and automatic method to do this segmentation. The video segmentation and especially the video shot segmentation is not a new research field. It dates since the nineties.

Brunelli et al. [5] define the structuring of the video as decomposing the stream into shots and scenes. A shot consists of a sequence of frames recorded by one camera contiguously and representing a continuous action in time or space. A video scene consists of a sequence of semantically correlated shots [25] (see Fig. 1).

Fig. 1
figure 1

The video structure

Now, the video shot detection systems have matured and become able to detect all types of transitions with considerable recall and precision rates [10].

However, the video scene detection problem is more complicated and it is still asking for more work. Indeed, the major problem of scenes detection is the lack of an exact criterion to define a scene boundary. The only heuristic which may be used is that the shots belonging to one scene share the same actors and occur at the same moment and in the same place. For this reason, as shown in the Fig. 2, all the proposed works proceed in three steps to accomplish video scenes detection. First, the movie is segmented into shots. Second, a representative image is computed for every shot. Some works do not use only one representative frame but several frames. The Final step is the extraction of signatures from these representative images and their grouping into scenes.

Fig. 2
figure 2

The general framework of scenes extraction

As a part of this effort, we propose in this paper a novel approach for video scenes detection. It builds on the observations that video scenes may be discovered through discovering shots agglomerations and that to avoid the over segmentation that occurs in the detection of some scenes containing actions shots, a localization of action zones may be useful. This new vision of scenes detection is proposed to remedy problems of classic approaches which use one-to-one shot similarity to delimit the scenes.

The rest of the paper will be organized as follows; in section 2, we discuss works related to video scenes detection. In section 3, we will state the problem, present our approach and detail our contribution. Results of our approach are shown in section 4. We conclude with directions for future work.

2 Related works

We can classify all the proposed works into two categories: unimodal perspective and multimodal perspective. The first perspective is based only on visual signatures to group shots visually similar and temporally close. In this perspective we can evoke the important and the pioneering work of Yeung et al. [42]. In this work, authors base on the assumption that every scene has a repetitive shot structure and use graphs to detect scenes boundaries. Each node of the graph represents a shot and the transitions between two nodes are weighted according to the visual similarity and to the temporal locality of the two corresponding shots. This approach is effective with dialog scenes. However, it may fail with the other kinds as actions scenes (See Fig. 3). In their work, Riu et al. [33] have the advantage of integrating the motion information to measure the similarity between two shots. As Yeung et al. [42], they use the time locality and the visual similarity to group shots in the same scene. The major drawback of their approach remains their clustering algorithm which depends heavily on many parameters and thresholds. Rasheed and Shah [32] follow the same trail of Riu et al. [33] by integrating the motion information in the criteria of shots grouping. Their main contribution consists in basing on global similarities of shots and graph partitioning to cluster shots and to extract scenes. However, their clustering technique is relatively complex. Besides, the integration of the motion information in the similarity measure is so straightforward. Hanjalic et al. [15] suggested a new way to compute shots keyframes through concatenating all shot images. Their clustering algorithm is based on the block matching to measure the similarity of the keyframes. They use 2 sliding windows to do the clustering and to detect scene boundaries. The two windows start from a given shot and perform a forward and a backward search to find the nearest matching shot. The problem of this approach is also the important number of parameters on which it depends. In fact, we have to fix the size of the matching threshold, the size of backward window and the size of the forward window. Tavanapong and Zhou [38] use the same clustering algorithm proposed by Hanjalic et al. [15]. However, they suggest a new way to compute the similarity between the shots. They compute color features only for the shots corners because they suppose that in the shots of a same scene only the corners remain unchanged. This is not always true and especially in action scenes. This assumption may be applicable for static dialog scenes. But its validity in action scenes is doubtful. Similarly to [15], authors in [43] and [40] proposed two approaches based also on sliding windows which try to gather successive similar shots. [40, 43] differ from [15] by their similarity measures and their clustering algorithms. However, as for [15] the problem of this technique (the sliding window) remains the choice of the window size and of the sliding strategy. Yeo and Kender [21] proposed a memory based approach to segment video into scenes. A buffer is used to store shots of a same scene. New shots will be added to the buffer by measuring the visual distance between the incoming shots and the shots of the buffer. Ngo et al. [30] proposed an approach that performs a motion characterization of shots and backgrounds mosaicking in order to reconstruct the scenes and to detect their boundaries. Similarly, Chen et al. [9] proposed an approach for scenes detection basing on mosaic images. They do not represent every shot by one keyframe or a set of keyframes, but by a mosaic image in order to gather the maximum of information from the shot. After that they extract from every mosaic image color and texture signatures and they base on some cinematic rules to group shots into scenes. The idea of representing a shot by a mosaic image is interesting. However, the creation of mosaic images is still a research problem. Besides, basing on cinematographic rules to detect some types of scenes is not always efficient for two reasons. First, cinematographic rules are general rules, they change continuously and filmmakers transgress always these rules. Second, quantifying cinematographic rules effectively is still a difficult problem [14].

Fig. 3
figure 3

An example of an action scene

In the same way, authors in [6] have been based on cinematographic rules to detect scenes boundaries. They proposed to establish a finite state machine designed according to cinematographic rules and scenes patterns to extract dialog or action scenes. In [7] authors presented a scene change detection system using an unsupervised segmentation algorithm and object tracking. Objects tracking will help to compute correlations between successive frames and shots to detect scenes boundaries. In [26] Lin et al. presented an approach that performs in two steps to detect scenes boundaries. First a shot segmentation is performed. To describe well the significant changes that may occur into shots, some of them may be segmented into sub-shots. After that, scenes boundaries are extracted by analyzing the splitting and merging force competitions at each shot boundary. The main drawback of [7, 26] resides in the used features. Indeed they employed only the color information. This is not always efficient, especially for action scenes.

The number of works done in the second perspective is not important relatively to the number of works proposed in the first one. In this perspective, we can evoke the well known work of Sundaram and Schang [37] in which authors define a scene as contiguous segment of data having consistent long term audio and video characteristics. They detect audio scenes basing on ten auditory features and a sliding window that tries to detect significant changes. Nearly the same thing is done on the visual bound. The final step of the process is the merge of the visual scenes and the auditory scenes to have the final scenes boundaries. In the same perspective, we can evoke also the work of Huang et al. [17] in which authors follow the same trail of Sundaram and Schang [37]. To detect scenes boundaries, they detect color breaks, audio breaks and motion breaks. They base on the assumption that a scene break occurs only when we have the 3 kinds of break at the same time.

3 Proposed framework

3.1 System framework

We think that to have an efficient system for scenes detection, we have to understand how these scenes have been made. As it is said in [35], the video is the result of an authoring process. When producing a video, an author starts from a conceptual idea. The semantic intention is then articulated in (sub) consciously selected conventions and techniques (rules) for the purpose of emphasizing aspects of the content. In this section, we will have a quick look on some important cinematic rules on which the majority of movies are built. According to [1, 14, 41] there are 4 important rules used by filmmakers to make scenes and shots:

  • Rule 1: The 180° Rule. This rule dictates that the camera should stay in one of the areas on either side of the axis of action.

  • Rule 2: Action Matching Rule. It indicates that motion direction should be the same on in two consecutive shots that record the continuous motion of actor

  • Rule 3: Film Rhythm Rule. It indicates the perceived rate and regularity of sounds, series of shots, and motion within the shots. Rhythmic factors include beat, accent, and tempo. In same scene, the shots have the similar rhythm.

  • Rule 4: Establishing Shot. It shows the spatial relations among the important figures, objects, and setting in a scene. Usually, the first and the last few shots in a dialog scene are establishing shots.

After reading these 4 rules we can deduce 2 important things. First, according to Rule 1, 3 and 4, when trying to relax audience filmmakers use long shots with common backgrounds and surroundings, repetitive camera angles for subsequent shots, and use quiet auditory bounds constituted essentially of speech and silence segments [24]. Second, according to Rule 2 and 4, when trying to excite the viewer, filmmakers use sound effects with a rough visual boundary to impose a tempo and a rhythm on the viewer. Basing on these rules and these heuristics, we suggest a scene extraction approach composed of three steps. The first step consists in using the visual content information to do a preliminary scene detection. The second step consists in localizing action zones. Finally, in the third step we merge the results of the two first ones.

In fact, the presence of action zones may cause an over segmentation because the visual content of shots of action scenes changes enormously. This over segmentation may be repaired by merging scenes that intersect with the same action zones.

Our approach is multimodal approach. Indeed, as filmmakers use visual and auditory modalities to transmit their cinematographic messages, we think that proposing a multimodal approach will augment the chances of the success of the system. Relatively to existing multimodal systems, our contribution consists in proposing a new way of integrating of the visual and the auditory information to determine scene boundaries. Existing systems use many rules to merge the results of the two kinds of boundaries. We do not agree with Sundaram and Schang who affirm in [37] that “audio data must be used to find exact scenes boundaries and not only for identifying important regions or detecting events as explosions in video sequences” because in a given scene the audio data and contrary to visual data is changing a lot even in a dialog scene. In one scene, we may have silences, male voices, female voices, car sounds, musical backgrounds... The ranges in which vary the auditory features will be wide. So it will be very difficult to delimit correctly the scenes using the audio data. We agree in considering the auditory information important to correctly detect scenes boundaries. However, their role must be corrective and especially in detecting action scenes. For this reason, the used features are not classified into visual and auditory features as the entire multimodal approaches have done, but into tempo features and content features to detect respectively action scenes and non-action scenes which are the principle categories of scenes (see Fig. 4).

Fig. 4
figure 4

The general taxonomy of movie scenes

The second contribution consists in proposing a new method for the detection of non-action scenes using content features and on the Kohonen map. In fact, in contrary to all the proposed works (unimodal and multimodal approaches), we extract scenes by localizing the agglomerations of shots and not by basing one-to-one shot similarity.

The third contribution consists in proposing for action scenes a new method totally different from those used for the detection of the other kinds of scenes. In action movies, where we have a lot of action scenes, we base on shots classification using Fuzzy CMeans and tempo features (motion, audio energy, shot frequency) to localize action scenes.

Indeed, there is sometimes over segmentation that may occur when extracting action scenes through basing only on the content information. This over segmentation is repaired by localizing the action zones and merging scenes intersecting with the same action zones (see Fig. 5)

Fig. 5
figure 5

The scene pathfinder

3.2 Preliminary scene extraction

A scene is defined in [1] as a segment of a narrative film that usually takes place in a single time and place, often with the same characters. As it is mentioned in the pioneering work of Yeung et al. [42], shots belonging to the same scene are characterized by two important things. First, they share nearly the same surroundings and many common objects. Second, they are temporally close. For this reason, and as the majority of proposed works [9, 15, 26, 30, 32, 33, 38, 42, 43] we will base on these criteria to cluster shots.

Excepting [32, 42] which use the graph theory, all the others use one-to-one shot similarity and thresholds fixed by empiric studies to extract scenes. Basing on one-to-one shot similarity may be particularly convenient because even the shots of a same scene are generally different and the direct comparison may be not useful. However, if we compare them to shots of other scenes, they share some common properties as the background. So grouping shots into scenes must be done relatively to other shots and not through finding similarities between them.

To accomplish this, we will base on pattern recognition techniques and particularly the clustering techniques which are suitable to do such kind of processing. The clustering operation tries to gather elements that share common properties relatively to the other elements basing on descriptors that quantify these properties.

To discover the shots agglomerations (scenes), we have to choose the right descriptors which can discriminate the shots suitably. Shots belonging to the same scene share two important things: the general luminance and the background (see Fig. 6).

  • The luminance information of a given image is the brightness distribution or the luminosity within this image. Generally two different scenes have two different luminosities due to the difference in lighting conditions, in places and in time when the scenes take place (indoor, outdoor, day, night, cloudy weather, sunny weather…).

  • The background information is also important because it describes the surroundings and the place where the scene takes place. To extract the background information we base essentially on two descriptors color and texture. Color is a general descriptor for all kind of backgrounds. Texture is also an important information for textured backgrounds as forests, buildings… (see Fig. 7)

Fig. 6
figure 6

The luminosity is one of the common features of shots belonging to the same scene

Fig. 7
figure 7

The texture plays a key role to cluster shots

To extract the luminance and the background information we used the standard HSV color histogram and the Fourier histogram. HSV histogram is a classical color signature in the HSV color space. This signature makes available information about color content of the image in its non-altered state. Its performances are good compared to other color signatures [4].

The Fourier histogram signature is computed on the gray level image and it contains information about the texture and scale. These two descriptors already existed in our IKONA CBIR engine [4] and tested in many contexts in the field of image processing as content based image retrieval, relevance feedback, etc.

3.3 Scenes agglomeration discovering through Kohonen maps

One of the drawbacks of proposed approaches is the fact that they based on one-to-one shot similarity instead of studying the agglomerations. Kohonen maps [23] proved their performance in doing that. We have already used them to do a macro-segmentation. In our previous work [13], we used the Kohonen map to segment news broadcast programs into stories by discovering the cluster (the agglomeration) of anchor shots. They are well suited for mapping high dimensional vectors (shots) into two dimensional space. In fact, the Kohonen map is composed of two layers: the input layer which corresponds to the input elements and an output layer which corresponds to a set of connected neurons. Every input element is represented by a n-dimensional vector X = (x1, x2, …, xn) and connected to m nodes of the map through weights W ij (see Fig. 8).

Fig. 8
figure 8

The structure of the Kohonen map

The mechanism of the Kohonen map may be summarized as follows:

  • The connection weights are randomly initialised

  • For every input vector the weights of the connections linking this vector to the neurons are updated. The key idea introduced by Kohonen maps is the neighbourhood relation. In fact, to keep the relation between all the neighbour nodes, not only weights of the wining node are adjusted, but those of the neighbours are also updated. Thus, a node whose weight vector closely matches the input vector will have a small activation level, and a node whose weight vector is very different from the input vector will have a large activation level. The node in the network with the smallest activation level is deemed to be the "winner" for the current input vector. The further the neighbour is from the winner, the smaller its weight changes.

  • We compute the wining node using the following formula:

    $$ j* = \arg {\min_j}\sum\limits_{i = 1}^n {{{\left( {{X_i}(t) - {W_{ij}}(t)} \right)}^2}} $$
    (1)
  • The weight of the wining neuron is modified at every iteration as follows:

    $$ {W_{ij}}(t) = {W_{ij}}\left( {t - 1} \right) + a(t)\left[ {{X_i}(t) - {W_{ij}}\left( {t - 1} \right)} \right] $$
    (2)
  • The weight of the neighbouring neurons is modified at every iteration as follows:

    $$ {W_{ij}}(t) = {W_{ij}}\left( {t - 1} \right) + a(t){h_j}\left( {j*,t} \right)\left[ {{X_i}(t) - {W_{ij}}\left( {t - 1} \right)} \right] $$
    (3)

    Where “t” represents the time-step and a(t) is a small variable called the learning rate which decreases with time. h(t) represents the amount of influence of the training sample on the node. It is computed as follows:

    $$ {h_j}\left( {j*,t} \right) = \exp - \frac{{{{\left\| {j - j*} \right\|}^2}}}{{2{r^2}(t)}} $$
    (4)

    Where r(t) is the neighbourhood radius which typically decreases with time.

As we have already mentioned we compute for every shot two features the HSV color histogram and the Fourier histogram. The concatenation of these two features constitutes a vector of 184 components. For every movie, we trained a Kohonen map with vectors representing the shots of this movie. As a result of the training process, every shot (training example) will be attributed to a node (wining node).

After training the Kohonen map, we tried to discover the shots agglomerations that represent the scenes of the movie (see Fig. 8). In fact, if we map the definition of a scene on the Kohonen map we can say that it is the set of shots located in the same zone in the map. A zone in the map is identified by a set of nodes. After training the Kohonen map, every shot is attached to a unique node called the wining node. So, shots attached to neighbouring nodes may eventually belong to a same scene.

In order to extract scenes from the Kohonen map, we base on the two following assumptions. First, two shots which belong to the same scene either belong to one node or to two neighbouring nodes (see Fig. 8). Second, shots belonging to one scene must be also temporally close. As in [32], we fix the temporal threshold at 30 seconds. Two shots A and B belong to the same scene only if (MA-MB) is above 30 seconds, where MA and MB are respectively the middle frames of the shots A and B. This threshold is discussed in the experimentation section.

The pseudo code of our clustering algorithm is shown in the Fig. 9. First, we determine the wining neuron of every shot. Then, if two shots, have the same wining neuron or their wining neurons are neighbouring (direct neighbours, see Fig. 8) and they are temporally close (the temporal distance between their middle frames is above 30 seconds), therefore we will put them into the same scene. At the end, if two scenes have one or more common shots then they will be automatically merged into one scene. Besides, if two scenes are temporally intersecting they will be also merged.

Fig. 9
figure 9

Pseudo code of the scene pathfinder

For instance if a given scene A contains the shot i, the shot i+2 and the shot i+3 and a scene B contains the shot i+1 and the shot i+4 then A and B will be merged into one scene. In fact, a scene may contain shots that are completely different and do not share any common object (see Fig. 10). The temporal continuity of the scene allows gathering all these shots.

Fig. 10
figure 10

An example of a scene that takes place in more than one place

3.4 Localizing action zones

The major drawback of all proposed works is the fact that they do not have a specific process to localize action zones. They use the same process to extract all kinds of scenes. They compute a list of features (generally visual features and features dealing only with the content) for all shots and after that they cluster shots to detect the scenes boundaries. Some works as [32, 33] try to introduce in this list of features some specific features as motion to take into account the case of action scenes. After that they do an early fusion of all features before starting the clustering process. However, this remains insufficient. In fact, in action scenes the visual content is changing a lot and the tempo is nearly the same (high). So, the features describing the content may distort the clustering process and there is a chance to have an over segmentation. That’s why in the majority of works the results are getting worst with action movies.

An action scene is characterized by 3 important phenomena [1, 41]. First, it contains a lot of motion: motion of objects (motion of actors, motion of cars…) and camera motion (pan, tilt, zoom…).That’s why shots of the same scene do not share any common background or surroundings. The second phenomenon is the special sound effects used to excite and stimulate the viewer attention. Filmmakers amplify the actor voices. They introduce explosions and gunfire sounds from times to times… The third important phenomenon is the duration and the number of shots per minute. Action scenes are filmed by many cameras. For this reason, the filmmaker is switching permanently between all cameras to film the scene from many views.

After a deep study of these phenomena, we suggest to quantify them by a need of three descriptors that will be computed for every shot and on which we will base to cluster these shots. These descriptors will be detailed in the next sections.

3.4.1 Motion activity analysis

The intensity of motion activity gives the viewer an information about the tempo of the movie. The motion intensity of a given segment tries to capture the intensity of the action of this segment [8, 34]. Action scenes have a high intensity of action due to high camera motion and motion of objects. For non-action scenes as dialog scenes we have a low intensity of action because we have generally fixed camera and static objects.

Many descriptors have been proposed to measure the motion activity. The Lukas Kanade optical flow [28] is often used to estimate the motion of one shot. It is used to estimate the direction and the speed of objects motion from one frame to another in the same shot. The estimation of optical flow is based on the assumption that the image intensity is constant between two times t and t+dt. This assumption may be mathematically formalized by the following equation:

$$ {I_x}u + {I_x}v + {I_t} = 0 $$
(5)

In this equation I x , I y and I t are the spatiotemporal image brightness derivatives, u is the horizontal optical flow, and v is the vertical optical flow.

Let u(t,i), v(t,i) denote the flow computed between two frames of one shot and averaged over the i th 16 x16 macroblock . The spatial activity matrix is defined as follows:

$$ Ac{t_{i,j}} = \sqrt {u{{\left( {t,i} \right)}^2} + v{{\left( {t,i} \right)}^2}} $$
(6)

The activity between two frames is computed by averaging the motion vectors magnitudes of macroblocks over the entire frame. It is computed as follows:

$$ FrAct = \frac{1}{NBlock}\sum\limits_i {\sum\limits_j {Ac{t_{i,j}}} } $$
(7)

where NBlock is number of macroblocks in the frame. The activity of a shot is the average of its frames activities. It is computed as follows:

$$ ShotAct = \frac{1}{NFrame}\sum\limits_k {FrAc{t_k}} $$
(8)

Where NFrame is the number of frames per shot

3.4.2 Audio energy analysis

The audio bound has an important contribution to interpret the tempo of the movie. Action scenes are generally characterized by musical backgrounds with many sound effects. As in [27], in order to discriminate between voiced and unvoiced sounds, researchers use generally the energy. Unvoiced sounds like music and sound effects have a larger dynamic range than speech. They are generally characterized by an important energy. We propose to compute the Short-Time Average Energy of the auditory bound of every shot. The Short-Time Average Energy of a discrete signal is defined as follows:

$$ E\left( {sho{t_n}} \right) = \frac{1}{N}\sum\limits_i {s{{(i)}^2}} $$
(9)

Where s(i) is the discrete time audio signal, “i” is the time index and N is the number of audio samples of the Shot n . We have to mention that we compute the energy of every shot of the movie. We aim at distinguishing energetic shots.

Indeed, the Fig. 11 displays to the variation of the Short-Time Average Energy of a movie. The picks correspond to action zones in the movie and the hollows are generally dialog scenes in which we find either silences or speech zones that are characterized by a low energy.

Fig. 11
figure 11

Audio short time energy of a movie

3.4.3 Shot frequency

Action scenes are characterized by an important number of short shots that stream rapidly. Our idea is to compute the shot frequency (i.e.) the number of shots per minute relatively to a given shot (the reference shot). We place every shot in the center of an interval of one minute. After that we count the number of shots that belong to this interval (see Fig. 12). This number is the third feature of every shot.

Fig. 12
figure 12

The shot frequency relatively to the selected reference is 10

3.4.4 Using Fuzzy CMeans to extract action shots

After computing for every shot the three features: Motion, audio energy and shot frequency, the final step consists in using all these features to discriminate between action and non-action shots. Two ways may be explored in this case. Either we base on some heuristics and thresholds or we base on pattern recognition techniques. Pattern recognition techniques and in particular unsupervised clustering techniques are suitable for doing such kind of task. They display a great efficiency in doing classes discrimination. Besides, our work is a typical problem of unsupervised clustering. The number of classes is known: the class of action shots and the class of non-action shots. One of the clustering techniques is the Fuzzy–CMeans introduced by Bezdeck [2]. The Fuzzy C-Means (FCM) algorithm is an iterative clustering method that produces an optimal C partitions, which minimizes the weighted within group sum of squared error objective function J q (U,V)

$$ {J_q}\left( {U,V} \right) = \sum\limits_{k = 1}^n {\sum\limits_{i = 1}^c {{{\left( {{u_{ik}}} \right)}^q}{d^2}\left( {{x_k},{v_i}} \right)} } $$
(10)

Where \( X = \left\{ {{x_1},{x_2},...,{x_n}} \right\} \subseteq {R^p} \) is the set the of data items, n is the number of data items, c is the number of clusters with \( 2 \leqslant c \prec n \), U k is the degree of membership of x k in the i th cluster, q is a weighting exponent, on each fuzzy membership, v i is the prototype of the center of cluster i, (x k ,v i ) is a distance measure between object x k and cluster center v i . A solution of the object function can be obtained via an iterative process. The membership matrix (U) ij is randomly initialized. At each iteration we compute the new values of the coefficients of the matrix (U) ij and the new centers of each cluster. In our context we have two classes. The data items are shots represented by vectors composed of 3 components (motion, audio energy and shot frequency). At the end of the clustering, every shot is attributed to the cluster to which it has the highest membership degree. After that, we extract the class of action scenes by finding the class with the higher values of motion. The action shots will be temporally ordered to localize exactly the action zones as it is shown in the Fig. 13.

Fig. 13
figure 13

The temporal distribution of the classified video shots. Red squares represent action shots and green squares represent non-action shots

However, action scenes do not start directly with action shots. They start by calm shots and clam tempo and as time goes by, the tempo and the rhythm increase [1]. Besides, they generally finished by clam shots. This is not problematic for us because we do not aim through the detection of action zones to find the exact scenes boundaries. We aim at localizing action zones to remedy the over segmentation that may occur when extracting the scenes from the Kohonen map.

The results of the Fuzzy CMeans classifier will be the detection of the cores of action scenes (action zones). Continual action zones will help us to merge over-segmented scenes. The preliminary scenes extracted from the Kohonen map that are intersecting with the same action zone will be merged.

4 Experiments

4.1 Scene change detection

To show the efficiency of our system we conduct experiments on five movies as shown in Table 1. This database has been already used by authors in [9] to test their approach.

Table 1 Experimental results with ground truth

We choose to test our system on this database for two reasons. First, this database includes movies belonging to different cinematographic genres. And this will help us to test our system correctly (many kinds of scenes). Second, as we will compare our system to that proposed in [9], it will be suitable to use the same database.

We use the recall and the precision rates as the measures of performance. They are defined as follows:

$$ {{\text{recall}} = \frac{N_c}{N_g}} $$
(11)
$$ {{\text{Precision}} = \frac{N_c}{{{N_c} + {N_f}}}} $$
(12)

Where Ng is the number of scenes of the ground truth, Nc is the number of scenes correctly detected and Nf is the number of scenes wrongly detected.

The ground truth has been generated by two real users. We explained the definition of a scene to the two users before giving them the database. After that we asked every user to watch every movie of the database and to delimit the scenes. The results of the two segmentation processes were merged to generate the final scenes boundaries.

Table 1 shows that our system presents in general encouraging results. However, these results vary according to the genre of the film. The best results are made with action movies (“Bugs” and “Dungeons and Dragons”). This shows the efficiency of our strategy of extracting action scenes which is the weak spot of the majority of approaches. The major problem of action scenes remains always the significant change of lighting such as explosions and flashing lights. As features related to tempo do not deal with content information the classification of shots into action/non-action shots using the Fuzzy CMeans classifier was at the same time efficient and very useful to remedy the problem of over segmentation.

Encouraging results are also shown in the movie “Little Voice”. This proves the merit of the Kohonen map in delimiting non-action scenes. Dramatic movies are essentially composed of dialog scenes in which characters discuss into decors having many common objects and backgrounds. These encouraging results show that discovering shots agglomerations is advantageous and this will be proved in the section (4.3), when we compare our system to other systems in which authors base on one-to-one shot similarity to cluster shots into scenes.

However, there is more work to do in the context of comedic and musical movies. These kinds of movies do not respond to the common cinematographic rules. For instance, in comedic movies a given scene may evolve in different contexts and in different decors. That’s why the Kohonen map may miss a lot of scenes and make a lot of false detections because shots of one scene may be located in different zones of the map. For this reason, we have to think to add a third path to our system and use other kinds of assumptions to find these kinds of scenes.

We have to mention also that there are some scenes which are ambiguous, and it is very hard to delimit them automatically. These kinds of scenes will be discussed in the following section.

4.2 Analysis of the ambiguity of some scenes

To establish the limitations of our technique and of scenes detection systems in general, it is important to discuss some types of scenes which are ambiguous and very difficult to delimit them automatically.

Some consecutive scenes may occur in the same place or in the same conditions. For instance, in the movie “Dungeons and Dragons” many successive scenes take place in the forest (common background and common texture). The boundaries of these scenes are indistinguishable. Visual features and clustering techniques are incapable to delimit them properly.

Lighting conditions may also perturb the detection process, especially when the consecutive scenes take place at night or in indoor dark places. In these conditions, the background is very dark and the foreground objects as faces or decor elements are indistinguishable. We have encountered this kind of scenes in the movie "Bugs". Many scenes of this movie take place in a train tunnel. This kind of scenes causes an undersegmenation for our system and for the majority of systems in general because the visual information and even the auditory information are not able to distinguish between these scenes.

Multi-angular scenes are also ambiguous scenes. The visual coherence between the shots of a multi-angular scene is reduced because they are filmed with many cameras and display different kinds of background and foreground objects. As example we may cite the dialog scenes which take place in streets (crowd scenes). In these scenes actors discuss, and from time to time we may see a passing car, a passing person, a building, a neon sign… In this kind of scenes using the global visual features to cluster shots is not very efficient. Local visual features may be the suitable solution.

Moreover, as shown in Fig. 14, the scenes that include shots having different cameras distances are also ambiguous and may cause an over segmentation. As mentioned by Bordwell and Thompson [3], we distinguish eight different types of shots: extreme long shot, long shot, medium long shot, medium close-up, medium shot, close-up, extreme long shot and extreme close-up. Indeed, due to cameras zooms we may have a master shot followed by a close up shot followed by medium long shot… Clustering techniques using classical distances may not be very efficient in these conditions.

Fig. 14
figure 14

A scene including shots that have different cameras distances

Our system and the systems proposed in the literature in general, are essentially based on the visual information. To delimit some ambiguous scenes we proved that the visual information is not very efficient. We think also that the solution may not come from the auditory information for the reasons evoked in section 2. However, we think that the solution can come from the textual information generated through automatic speech recognition. A deep semantic analysis of the textual information through natural language processing techniques (NLP) may help in delimiting these scenes. A study of the speeches of actors may be done to detect significant changes in linguistic concepts.

The solution may also come from users interacting to correct some defaults in delimiting these ambiguous scenes. However, this solution may extend the achieving time of the scenes detection process.

4.3 Determining the temporal tolerance factor τ

The temporal tolerance factor is the threshold used to decide if two similar shots belong to the same scene. As in [32] we make a study to fix the suitable tolerance factor. We studied how the recall and the precision rates vary against the tolerance factor for the movies “Bugs” and “Walk the Line”. The movie “Bugs” is an action movie characterized by short shots and short scenes. In this movie, the mean and standard deviation of scene duration are respectively 106.27s and 106.48s. However, the movie “Walk the Line” is a musical movie characterized by long shots and scenes. In this movie, the mean and standard deviation of scene duration are respectively160.59s and 60.37s.

The Fig. 15 shows that the threshold 30 seconds it is suitable for delimiting scenes of all kinds of movies. This threshold may be exceeded in the case of dramatic, musical and comic movies because scenes of these movies are long enough and the risk of an over segmentation is weak. However, in action movies this threshold is the optimum because scenes in these movies are short and increasing the threshold may cause under-segmentations in case of presence of similar shots in neighbouring scenes.

Fig. 15
figure 15

Recall and precision against the tolerance factor

4.4 Comparison results

We implemented the work [9] which presents good results relatively to the well known work of Yeung et al. [42]. However, as we failed to get the ground truth used in [9] we created our own ground truth as follows: first, we segment the movies into shots [31] shots and then we manually grouped shots into scenes according to strict scene definition. We implemented also the work of Tavanapong et al. [38]. This work is another shot-to-shot approach which is based on the assumption that the shots of a same scene have common zones namely the corners. Features used for the clustering process are computed on these corners.

The results of the comparison are shown in the Table 2. Generally, our system performs better than systems of Chen and Tavanapong. This confirms that basing on discovering shots agglomeration may be an alternative to basing on shot-to-shot method. The low results obtained by Tavanapong’s system demonstrate the adequacy of this proposal. Indeed, Tavanapong’s system is a typical shot-to-shot approach which uses two sliding windows (backward and forward) to cluster the shots into scenes.

Table 2 Comparison results of our system with the systems of Chen et al. and of Tavanapong et al.

Besides, regarding results obtained in action movies, we can affirm that to delimit action scenes we have to base not only on content (case of the majority of approaches as those of Chen et al. and Tavanapong et al.) but also on tempo. Indeed, although Chen’s system and Tavanapong’s system adopted two different strategies to describe the content information of shots, their results in action movies are low relatively to our system. Tavanapong’s system adopted a local description of shots (shots corners); however Chen’s system adopted a global description (mosaic image). This comparison proves that whatever the kind of content features that will be used, it remains always insufficient to detect action scenes. The content is necessary to fix the boundaries of action scenes because they generally start and finish by clam shots (importance of the content). However, the core of the scenes is agitated, the content information may be useless and the tempo information may play a key role here.

This comparison was also very useful for us because it shows that representing a shot by a single frame is not always the suitable solution and especially in non-action movies where shots are long and generally evolve in different contexts and in different settings. As Chen et al. [9] use the mosaic image to represent every shot, the details of shots will be kept and the clustering process will be more efficient. That’s why the precision rates of the Chen’s system in the movies “Walk the line” and “Little voice” are better.

5 Conclusion and perspectives

We presented in this paper a new system with a new vision to extract scenes from movies. Segmenting a database of movies into scenes has the advantage of making the browsing operation quicker.

The proposed system has three essential contributions. First, and contrary to the majority of proposed approaches we propose a multimodal system. Second, and contrary to proposed multimodal approaches we do not use the visual features and the audio features separately to detect scenes boundaries. We fuse them and we make a clustering basing on the resulting vectors. Finally, we divide the scenes of the movies into two important classes: action scenes and non-action scenes. To detect non-action scenes (dialog, monolog, landscape, romance...) we base on the content information and the Kohonen map to discover the agglomerations of shots (scenes) having common backgrounds and objects.

In the other hand, we use audio-visual tempo features and the Fuzzy CMeans classifier to delimit the core of action scenes (fight, car chase, war, gun fire...) to remedy the over segmentation that may occur in action scenes.

Obtained results are encouraging and show the merit of this new vision. However, the results in general may be improved. Our approach is still suffering from over/under segmentation. That’s why we think in to improve the used features and to add other ones. We think also to base also on objects segmentation and tracking to cluster shots to delimit some ambiguous scenes.