1 Introduction

Lately, the Internet has become the most widely used system for exchanging and sharing information. With the advancement in the field of communication engineering, enormous amount of data has been shared or transferred between two points connected through the Internet. Social networking Web sites and other business systems also lead to the generation of huge amounts of data, namely text, audio, video and other graphics media, which is difficult to store and process because of their huge size. Generally, video signals require larger amount of space for storage and constitutes a major part in the overall Internet traffic [25]. In order to gain insights into any video data, it is essential to understand the content of the video. Usually, computer-based automatic approaches are used to discover the content present in the video, for instance scene recognition and video enhancement for consumer devices, to name a few [26]. For analysis, a video is divided into segments called shots and this process is commonly known as temporal video segmentation or shot boundary detection (SBD) in the field of video processing.

The motivation of shot boundary detection is to find out the boundary or transitions of the shots in any video. The transitions can be: (1) abrupt transition where a sudden change in between the two consecutive frames occurs or (2) gradual transition where the boundary of the change extends for a certain length of frames. A gradual transition is an editing effect which can be further divided into fade-in, fade-out, dissolve and wipes [10]. Fade-in, fade-out and dissolve are caused by lighting effect, and wipes is mainly an animated effect. In fade-in, the shot slowly appears from a monochromatic or blank frame which is usually a black frame, [7]. In fade-out, the shot slowly disappears to a monochromatic or blank frame (usually black frame) [24]. In dissolve transition, the current shot gradually changes into the next shot in the video. Usually it is a combination of fade-in and fade-out simultaneously excluding the occurrence of the monochromatic or blank frames.

In this paper, a shot boundary detection using Gist and local descriptor is proposed in which Gist feature is extracted from each frame in a video and the difference between consecutive frames in the video is calculated using Pearson’s correlation coefficient. Abrupt transitions are detected by using the Gist and MSER features of the selected frames, which is given in Algorithm 1. The gradual transitions are detected using the MSER feature and the proposed pattern matching method. The experimental result shows a good recall, precision and F1 score of the proposed system.

The novel contributions of the paper are as follows:

  1. 1.

    We propose a MSER pattern-based approach for the detection of shot boundaries.

  2. 2.

    We proposed an activity-based shot boundary detection. Activity here refers to those transition regions where there is a possibility of the presence of shot boundaries. This approach reduces the computational complexity by processing the transition regions only.

  3. 3.

    We propose a dual confirmation stage which reduces the false positives leading to an increased performance.

The paper is organized as follows: Sect. 2 gives a short survey about SBD. Section 3 explains the theoretical background of the feature extraction techniques used in the system. A detailed explanation of the proposed shot boundary detection system is given in Sect. 4. Section 5 contains the experimental result and discussion which is followed by conclusion in Sect. 6.

2 Related works

Researchers have carried out many works on the temporal video segmentation or SBD and it is monitored by TRECVid [22]. SBD can be approached by two methods (1) working in the compressed domain and (2) working in the uncompressed domain. Features such as discrete cosine transform (DCT) coefficients [11] and motion vectors [5, 16, 21] are used for SBD in compressed domain.

In an uncompressed domain, the features can be broadly classified into local feature and global features; in local feature-based approach, some local features are extracted from the frame which can represent or approximate the whole frame, thereby reducing the computational requirements. Information such as color content and pixel intensity is a global feature and have been used for SBD [8]. Color histogram [6, 14, 23, 27] and intensity histogram [28] are the most widely used global features for SBD in which histogram for each frame is computed and the difference/similarity between the corresponding bins of the histogram is calculated for consecutive frame histogram. Instead of taking the histogram, the pixel information in either color or intensity is directly employed for SBD in which the similarity between the corresponding pixels is considered [17, 28]. In  [9], a hybrid system using Gist and color histogram is proposed for SBD. A frame transition parameters combining global feature and local feature is proposed in [18] for SBD in which a neural network scheme is used to classify the type of transitions.

In [12, 13], a frame skipping technique is proposed for SBD where a preprocessing technique is adopted using an adaptive thresholding. A triangle pattern matching method is deployed to detect the gradual transitions. In [4], a SBD using foveation technique is proposed in which a local attention consistency measure is calculated to find out the transitions. Foveation is an image feature where image resolution is higher at some fixation point then the surrounding region of the image. Other image feature such as texture feature is used for analyzing the content of the image. In [3], energy, moment and contrast are used to detect SBD in which the texture features are calculated using co-occurrence matrix. In [2], a SBD is proposed using entropy and SURF feature in which entropy of a frame is calculated by using the intensity histogram. In  [1], a video retrieval system is presented using the feature of maximally stable extremal regions (MSER).

3 Feature extraction

In this section, a brief description of the feature extraction techniques used in the proposed system is discussed as follows.

3.1 Gist

Oliva et al. [20] provided Gist of a scene as the perceptual and semantic information of a scene perceived by an observer irrespective of the details of the scene. Gist of a scene can be of perception information (referred to as Perceptual Gist) which contains the image properties such as the spatial frequency, color and texture and the conceptual information (referred to as Conceptual Gist) which contains the semantic information of the scene. A detailed description of the Gist can be found in [20].

It represents the “Gist of a scene” by using a set of perceptual properties such as naturalness, openness, roughness, ruggedness and expansion, which are related to the shape of the scene space excluding the information of the local object and provides a holistic description of the real-world scenes which is termed as SpatialEnvelope. As only the background/scene information is obtained in Gist feature, it is convenient for reducing the risk against object motions which increases the fault in transition detection.

3.2 Maximally stable extremal region

Maximally stable extremal region (MSER) was proposed by Matas [15]. MSER is a widely used approach for finding the correspondence between the two images. Such maximally stable regions are determined by varying a threshold T (as given in Eq. 1) and then observing the corresponding variation in the size of the local regions in the thresholded image.

$$\begin{aligned} 0< T < I \end{aligned}$$
(1)

where

$$\begin{aligned} I= \text {max}(P) \end{aligned}$$
(2)

and P is the pixel values in the image.

A region is considered to be maximally stable if the variation of T brings a change \(C, (C < S)\) where S is the maximum allowed variation. For a detailed description, readers are suggested to refer the work in [15].

4 Proposed system

This section discusses about the proposed shot boundary detection system. A figure illustrating the flow of the proposed approach is shown in Fig. 1.

Fig. 1
figure 1

Block diagram for proposed system

The proposed system (see Fig. 1) starts with a preprocessing which includes:

  1. 1.

    Converting color image to gray level image.

  2. 2.

    Resizing each frame to \(S\times S\), where \(S=256\) in our experiment.

The features of the preprocessed frames are extracted using Gist. Gist of an image gives the Scene of the image, so Gist is scale-invariant feature, thereby resizing the image will not change the Gist features. The perceptual and conceptual information can be extracted without considering the object present in the image [19] which reduces the risk of object motion effect in our proposed system. After feature extraction, the correlation between the consecutive frames are computed using Pearson correlation coefficient \((C_i)\). Pearson correlation coefficient gives a value 1 if the consecutive frames are identical and it decreases to 0 as the dissimilarity between frames increases, as shown in Fig. 2.

Fig. 2
figure 2

Correlation coefficient for the video D6.mpg

A threshold Th, is applied to classify the possible transition frames and non-transition frames. The threshold (Th) is taken through careful observation and experimentation in different videos from TRECVid database. Empirically, we have found the optimal range of Th to be [0.95, 0.99]. For the remaining of our discussion, we have set \(Th=0.99\). To classify the image into transition and non-transition frame, we followed Eq. 3.

$$\begin{aligned} \text {Frame}={\left\{ \begin{array}{ll} \text {transition} &{} \text {if}\ C_i < Th \\ \text {non-transition} &{} \text {otherwise} \\ \end{array}\right. } \end{aligned}$$
(3)

where \(C_i\) is the correlation coefficient between ith and \(i+1\)th frames.

After determining the similarity between the frames, the algorithm groups the consecutive frames such that one transition effect is captured in one group. It is to be noted here that every group will not contain gradual or cut transition; there is also a possibility of motion effect which may lead to a group. Therefore, it is essential to perform a post-processing operation on the groups in order to determine the correct nature of the transition. It is also taken in account the grouping of frames, which are not consecutive but missed by one or two frame(s). As an arbitrary example, consider frames 14 and 16 are present, but frame 15 is missing after the above processes; then, a group of (14, 15, 16) is included in the grouping process. This step is helpful in increasing the detection of possible transition frames. Each group contains the possible transitions of the whole video.

Separately, one of the main problems in the system is the blank frames, which behaves as an abrupt transition. So, for each frame, which satisfies Eq. 3, MSER is calculated and the number of MSER regions is recorded in \(R_i\). The frames which satisfy Eq. 4 are discarded.

$$\begin{aligned} R_i = 0 \end{aligned}$$
(4)

Above-explained steps ensure to remove the unwanted information from the video, thereby reducing any chance of false detection. After this we have clear groups of frames where we can apply abrupt and gradual transition detection algorithm that are described in the following text. It is to be noted that the blank frame is detected by using MSER count and it is only discarded in the abrupt transition detection only as it behaves as an abrupt transition. Blank frames are detected by using MSER feature. If the number of MSER regions extracted from a frame is 0 (zero), then the frame is discarded as blank frame. But for the gradual transition, the blank frames play a vital role in detecting the fade effect which is a part of gradual transition.

4.1 Abrupt transition detection

For abrupt transition detection, another threshold, \(Th_{cut}\) is considered. The selection of \(Th_{cut}\) is done in such a way that all the spikes as shown in Fig. 1 are included. The spikes are the possible abrupt transitions. Algorithm 1 gives the process of detecting abrupt transitions and the efficiency of it depends on the careful and proper selection of the \(Th_{\hbox {cut}}\).

figure a

Algorithm 1 is applied to individual group \(G_i\) , where i is the number of groups. Logically, the frames whose correlation value are less than the threshold Th is completely omitted from the observation of the transition detection. The blank frames in a video give more error in the abrupt detection system. So, removing the blank frames prior in detection system increases the precision and F1 score (see Sect. 5).

4.2 Gradual transition detection

4.2.1 Analysis of MSER count in a video

Before applying the gradual transition algorithm, the TRECVid video test datasets are analyzed by using the MSER features. In this, the MSER features from a video frame is extracted and the number of MSER regions is counted and recorded (hereafter it is mentioned as MSER count). This step is repeated for all the remaining frames. Then, the transition part is analyzed by plotting the MSER count as shown in Fig. 3.

Fig. 3
figure 3

MSER count value for the video \(BG\_37770.mpg\) showing fade-out (A), fade-in (B), no transition (C) and dissolve (D)

It is observed that the behavior for fade-in, fade-out and dissolve transitions are almost similar and can be represented by a single pattern (clearly a slope). It is also found after an analysis of the pattern that the fade-in, fade-out and dissolve transitions can be differentiated from each other in such a way that in fade-out and fade-in, the MSER count of the last frame (in case of fade-out) and first frame (in case of fade-in) in the transition always near to 0 (zero) level (see Fig. 3 A, B), whereas in dissolve, the MSER count either of the starting frame or the ending frame of the transition never approaches to zero level. The wipe transition is not considered in the proposed system. From the behavior analysis of the gradual transition frames using the TRECVid datasets, it is concluded that the gradual transitions share a common behavior.

Fig. 4
figure 4

Gradual transitions a decreasing and b increasing patterns detected from the MSER count (L is the length of the gradual transitions)

4.2.2 Gradual detection approach

In this section, we will discuss the approach to determine the gradual transitions in a video sequence. Determining gradual transitions is relatively complex when compared to finding the cut transitions. This complexity arises because of the slow varying image features or characteristics spread over a number of frames. However, in a cut transition, the changes are abrupt and can be trivially determined by monitoring an appropriate metric between two frames.

A gradual transition can be seen as slow sweeping of a scene, \(S_i\) by another scene \(S_{i+1}\). Such transition usually takes place over many frames. Therefore, it is essential to observe multiple frames in order to determine the gradual transition. In the present work, we have used MSER feature as a discriminating feature to find out the inter-frame transition.

A graphical visualization of MSER variation for a gradual transition and normal scene is illustrated in Fig. 3. Figure 3 shows that MSER varies in an increasing or a decreasing pattern for a gradual transition unlike for a normal scene variation. Such transition behavior has been observed for 300 gradual transitions (from TRECVid datasets). Therefore, we are proposing in this paper that if we observe increasing or decreasing pattern (i.e., region A and B in Fig. 3) in any video then we can conclude it to be a gradual transition.

Table 1 Description of the TRECVid 2001 and 2007 video dataset used for SBD

Before declaring a gradual transition, all the detected gradual transitions must satisfy Eq. 5.

$$\begin{aligned} \text {match}\_{\hbox {feature}}(f_{L_s-n},f_{L_e+n}) \le Th_{\hbox {verification}} \end{aligned}$$
(5)

where \(f_{Ls-n}\) and \(f_{Le+n}\) are the MSER feature of the previous nth frame of the starting and preceding nth frame of the end of the possible gradual transition of length L (shown in Fig. 4), respectively. \(L_s\) and \(L_e\) are the starting and ending of the detected gradual transition. \(Th_{\hbox {verification}}\) is a threshold use for the verification of the gradual transition.

Equation 5 is used in our proposed system in order to reduce the number of false detections of the gradual transition. If any gradual transition does not satisfy Eq. 5, then the possible gradual transition is not considered as gradual transition and is discarded.

5 Experimental results

For experimentation, TRECVid 2001 and 2007 video test dataset has been considered for shot boundary detection and the description of the video with the number of abrupt and gradual transitions (excluding the wipe transition) is given in Table 1. TRECVid 2001 video test data are openly available for the research and it can be downloaded from the Open Video Project. TRECVid 2007 video test data are obtained from Netherlands Institute for Sound and Vision under request for research.

Table 2 F1 score of the abrupt transition using different \(Th_\mathrm{cut}\) value in TRECVid 2001 video test dataset
Fig. 5
figure 5

An illustration of a Fade-in b Fade-out and c gradual transitions detected from the video “D6.mpg”

The performance evaluation of the proposed system is done by using recall, precision and F1 score as in [13] which is calculated using Eqs. 68, respectively.

$$\begin{aligned}&\text {Recall}=\frac{N_\mathrm{C}}{N_\mathrm{C}+N_\mathrm{M}} \end{aligned}$$
(6)
$$\begin{aligned}&\text {Precision}=\frac{N_\mathrm{C}}{N_\mathrm{C}+N_\mathrm{F}} \end{aligned}$$
(7)
$$\begin{aligned}&F1\,\,\text {score}=\frac{2\times \text {Recall}\times \text {Precision}}{\text {Recall}+\text {Precision}} \end{aligned}$$
(8)

where \(N_\mathrm{C}\) is the number of correct transitions detected, \(N_\mathrm{M}\) is the number of missed transitions and \(N_\mathrm{F}\) is number of false transitions detected.

In our experiment, the threshold for cut \(Th_\mathrm{cut}\) is selected by observing the graph of the correlation coefficient value (example shown in Fig. 2). Different \(Th_\mathrm{cut}\) values have been selected ranging from 0.95 to 0.75, and the FI score of the abrupt transitions for TRECVid 2001 video test data is determined for testing the efficient value or range of \(Th_\mathrm{cut}\) in Table 2.

Table 2 shows that the \(Th_\mathrm{cut}\) value at 0.8 yields an overall good F1 score and this \(Th_\mathrm{cut}\) value is used for our experiment.

For gradual transition, the increasing or decreasing pattern generated using local descriptor is observed from the group we have extracted using Gist. The formation of the group helps us to exclude most of the non-transitions frames from the processing of the gradual transition, thereby reducing the processing time. The verification threshold \(Th_\mathrm{verification}\) in Eq. 5 is found out experimentally. In our experiment, 4 is taken as the \(Th_\mathrm{verification}\) value and two frame, i.e., \(n=2\), previous and preceding of the starting and ending of the gradual transition of length L is taken. Some of the sample patterns observed for the consideration of gradual transition is shown is Fig. 4.

It is observed from our experiment that in the TRECVid videos, the length of the gradual transitions usually ranges from 6 to 32 frames as in [13]. In some cases, two or three gradual transitions exist in one group and subsequently the number of frames in that group is also more (eg. 200 frames in a group). So in this case, we have apply our technique for gradual transition for the range of 5–31 frames at a time in a group. Some examples of the detected gradual transitions are shown in Fig. 5.

Table 3 Proposed system results for TRECVid 2007 video test dataset
Table 4 Comparison of the SBD using color feature [27] with the proposed system
Table 5 Comparisons between proposed system with SVD and pattern matching [13]
Table 6 Comparison of the SBD using GA and fuzzy logic [23] with the proposed system

In Table 3, the proposed system performance using TRECVid 2007 test data is shown. The proposed system is compared with SBD using color feature [27], SBD using SVD and pattern matching [13] and SBD using GA and fuzzy Logic [23] as shown in Tables 4, 5 and 6, respectively. Tables 4, 5 and 6 show that the proposed approach yields better F1 score for cut and gradual transition, compared to [13, 23, 27], hence proving the superiority of the proposed approach. In Tables 4, 5 and 6, R, P and F1 represents recall, precision and F1 score, respectively. In [9], a shot boundary detection system using Gist and color histogram is proposed with an overall approximate performance of 84 and 91% of precision and recall, respectively. However, our proposed system yields an overall approximate performance of 89 and 94% of precision and recall, respectively.

6 Conclusion

The paper proposed a shot boundary detection algorithm using Gist and MSER. In this, Gist is extracted from each frame in a video and the similarity between consecutive frames is calculated using Pearson’s correlation coefficient. A threshold is applied, which selects the frames whose correlation is less than the threshold and these frames have a high chance of a transition (either abrupt or gradual) to occur. Consecutive frames are taken to form groups and each group is observed for the occurrence of transitions. For abrupt transition, Gist and a cut threshold are used and the local descriptor is used to reduce the false detection. For gradual transition, a pattern matching approach is proposed where increasing or decreasing pattern of MSER count confirms the existence of gradual transition. The novel contribution of this paper is the most of the non-transition frames are skipped in the further processing for detecting the transitions. The proposed system yields a better result as shown in Tables 3, 4, 5 and 6.