Keywords

1 Introduction

To retrieve the semantic content in a video is a cumbersome task owing to the large size of video data. The retrieval problem in the video can be solved through indexing of video frames at the cost of high processing cost. Generally, scene change detection [2, 4, 6, 8,9,10] is a method of content detection for the application of annotation, scene analysis, fast searching, and indexing. The other use of scene change detection is in video compression where it is used to estimate the key frame. Manual indexing and annotation of large multimedia data are time-consuming tasks that encourage the researcher to make an automatic scene change detection algorithm. Location of scene change in a video sequence appears as a boundary [13], because continuous scene appears as a flow of continuous action that depends on the foreground and background contents. Background and foreground of continuous scene possesses similar contents that make a shot [7, 12] or content of similarity. Scene change appears at the boundary of two shots or in between two scenes. Therefore, for the content retrieval purpose, the video is organized into a groups of shots. The aim of the scene change detection method is to partition the video sequences into the meaningful and manageable segments (shots) [10] for video indexing. The key frame is extracted from each segment of the scene that represents spatial and temporal features of that scene segment. A scene change [3] in a video stream can also be explained as a change of feature points or change of pixel intensity between two consecutive frames up to a remarkable limit. The term limit is interpreted as Thresholding [5, 14] which is widely used for the detection of scene change. Threshold may be fixed or dynamic in nature and the value of dynamic threshold [11, 18] is always updated according to the content of scene segment. Similarity measure between two consecutive frames is the basic idea of scene change detection technique and most of the prior work uses such methodology.

This paper is organized as follows. Section II explains the proposed scene change detection method. Experimental results are given in Section III, and Section IV concludes the paper. Notation of upper case capital XYT is used for the respective axes and lower case xyt are for flow direction in the whole paper. \(V_t{XY}\) represents video cube with XY frame along t direction (Fig. 1). Similarly, \(V_x{TY}\) and \(V_y{TX}\) represent video cubes with frames TY and TX along x- and y- directions.

Fig. 1
figure 1

Video cube. a \(V_t{XY}\), b \(V_x{TY}\), c \(V_y{TX}\)

2 Proposed Detection Method

Proposed scene change detection method uses spatial frame XY. The novelty of the proposed techniques is to consider the video as a cube and process the entire cube simultaneously. Unlike the other existing techniques, the proposed method does not use frame-by-frame processing. The proposed detection method is a hybrid approach [16] which incorporates both the spatial frame and spatiotemporal frames (TY and TX). In the spatial frame, SPREF [1]-based frame energy is used to detect the abrupt scene change and the obtained result is fused with the method reported in [15] that uses both the spatiotemporal frames. In the next section, the proposed SPREF-based detection method has been explained.

2.1 SPREF-Based Detection Method

SPREF (spatiotemporal regularity flow) [1] is a general framework to model the video. Assumption of the video as a cube is one of the advantages of this model. SPREF (Spatiotemporal Regularity Flow) is a 3D vector field and it proposes a regular flow direction as a path in which the intensity of the pixel varies the least. If the scene is continuous, then intensity as well as flow vectors of frames vary regularly, but on the other hand, they show large deviation at the location of abrupt scene change. Using the SPREF model, we detect the deviation at the boundary of scene change with the help of flow energy function. In this paper, the translational-SPREF model is used and the flow energy is defined as

$$\begin{aligned} E\ (t)=\sum _{\varOmega } \vert \bigg ( I\star \frac{\partial H}{\partial x} \bigg ) c'_1(t)+ \bigg ( I\star \frac{\partial H}{\partial y}\bigg )c'_2(t)+I\star \frac{\partial H}{\partial t}\vert ^2 \end{aligned}$$
(1)

where \([c'_1(t), c'_2(t)]\) are the flow vector components in x- and y directions of the video frame XY with flow direction t. H is defined as a Gaussian filter and intensity of the image is I. Temporal size of the video cube is \(\varOmega \) and term c is used for translational flow. Flow energy function (Eq. 1) is solved by using translated box spline functions b(u) of the first degree. Due to smoothness of spline, it approximates the regular flow direction by minimizing the flow energy function. Flow vectors in terms of spline coefficients are explained as

$$\begin{aligned} \ c'_m(u)=\sum _{n}\alpha _{n}^{m}b(2^{-l}\ u-n) \end{aligned}$$
(2)

where \(m \in (1,2)\) and \(u \in (t)\). Term \( \alpha _{n} \) is the nth spline coefficient. Length of temporal axis of video cube region \( \varOmega \) is = \( 2^{k}\). Scaling factor of video cube is taken as l and its value has been taken as \(l = 1, 2,\ldots ,k.\) Value of n is defined as \(n = 2^{k-l}\). The spline function used here is defined as

$$\begin{aligned} b(z) = {\left\{ \begin{array}{ll} 1-|z| &{}\text {if }|z|<1\\ 0 &{}\text {otherwise} \end{array}\right. } \end{aligned}$$
(3)

Since flow energy function is the combinatorial form of intensity and the flow vectors of the frame, it estimates the regularity of frame efficiently. If no scene change occurs in a sequence of frames, then all the frames are regular on the basis of their frame energy. Flow energy of video defined by SPREF model combines both the features and therefore, it models the regularity of frame contents effectively than either of the pixel or flow vectors of the frame. Abrupt scene change in XY frames creates large deviation in their flow energy and the location of deviation is by the proposed threshold value:

$$\begin{aligned} Threshold = \sqrt{\frac{1}{N}\sum \limits _{t=1}^{N}(E_t-\mu )} \end{aligned}$$
(4)
$$\begin{aligned} where \quad \mu = \frac{1}{N}\sum \limits _{t = 1}^{N}E_{t} \end{aligned}$$
(5)

where \(E_{t}\) represents the flow energy of tth frame and N is the total number of frames. Abrupt scene change is detected with the help of the following condition:

$$\begin{aligned} E_{detected} = {\left\{ \begin{array}{ll} 1 &{} \text {if } E_t>= 5 \times Threshold \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)

\(E_{detected}\) gives only the location of abrupt scene change with value 1. It has been investigated that all the peaks of abrupt scene change are greater than the mean value of the flow energy. How much the maximum peak value of flow energy is deviated from the mean value is defined by the standard deviation and this is the reason to select standard deviation for thresholding. The next section explains the detection approach by using spatiotemporal frames.

2.2 Boundary Detection in Spatiotemporal Frames

Apart from the use of flow energy for scene change, spatiotemporal frames are also considered for the detection of abrupt scene change which has been proposed in [15]. In this method, both spatiotemporal frames TY and TX are considered. Abrupt scene change produces large pixel intensity variation between two XY frames, but in spatiotemporal frames such variation appears as a vertical line or vertical boundary. The aim of this work is to detects the location of scene change as a vertical line and to combine the result with the results obtained from the previous section. Spatiotemporal frame-based abrupt scene change detection method [15] is summarized as

  • Select four sampled TY and TX frames:

    $$\begin{aligned} \begin{aligned} S_{TY} = [TY_{s_1}, TY_{s_2}, TY_{s_3}, TY_{s_4}]\\ S_{TX} = [TX_{s_1}, TX_{s_2}, TX_{s_3}, TX_{s_4}] \end{aligned} \end{aligned}$$
    (7)

    where \(s_1\) is the first frame and sampled interval for other three frames is taken as

    $$\begin{aligned} \ s' = \lfloor N/4.5 \rfloor \end{aligned}$$
    (8)

    N is the total number of frames along the flow direction.

  • Canny edge detection method is used to detect the edges of all the sampled frames and it produces binary images. Binary image of sampled frames are represented as

    $$\begin{aligned} \begin{aligned} S_{TYedge} = [TY'_{s_1}, TY'_{s_2}, TY'_{s_3}, TY'_{s_4}]\\ S_{TXedge} = [TX'_{s_1}, TX'_{s_2}, TX'_{s_3}, TX'_{s_4}] \end{aligned} \end{aligned}$$
    (9)
  • In a binary image, the pixel value of the detected edges is assigned \('1'\) and only vertical lines are considered because they are part of the scene change location. As discussed earlier, abrupt scene change appears as a vertical line which is occupied by the column in both TY and TX frames.

  • Spatiotemporal frames are considered as noisy images and hence, sometimes, the boundary of scene change might be distorted. Therefore, the length of the boundary is defined as a scene change location when

    $$\begin{aligned} Length = 40 \% of (frame \quad height). \end{aligned}$$
    (10)

    where frame height is defined as the height of TY or TX frames.

  • Condition: \( (number \quad of \quad 1's \quad in \quad boundary \quad line) >= Length \).

    If the pixel value (\('1'\)) of any boundary or vertical line follows the above condition in both the TY and TX frames, then it is interpreted as the location of abrupt scene change.

  • The above procedure is repeated for all the sampled frames.

  • Now look up Table 1 that has been generated. In this table, all the detected locations obtained from all the frames (TYTXandXY) are tabulated.

  • Only those locations are considered that appeared at least twice among these spatial and spatiotemporal frames.

Table 1 Detected locations in spatial and spatiotemporal frames of anni006
Fig. 2
figure 2

Scenes of gstennis

Fig. 3
figure 3

Scene change detection in gstennis video

3 Experimental Results

Four natural test videos are taken from [17] for experiments and these are gstennis, anni002, anni003, and anni006. Video sequence gstennis (in Fig. 2) has 64 frames with one scene change and detection of scene change in spatiotemporal TY frame is shown in Fig. 3. Since one scene change appears in gstennis video sequence, only one boundary or vertical line appeared in their binary image (Fig. 3b). Number of frames taken in videos anni002, anni006, and anni003 are 2048, 2048, and 1024, respectively, and all the scenes are represented in Fig. 4. The total number of abrupt scene changes in videos anni002, anni003, and anni006 are 12, 7, and 19, respectively. All the spatial and spatiotemporal frames have been processed and the obtained results by both the methods have been combined so as to obtain optimal results. Flow energy function of videos anni002 and anni003 are shown in Fig. 5 and 6, respectively. The vertical axis of the energy plot is the magnitude of flow energy and the horizontal axis represents the frame numbers. Flow energy of XY frames of video anni002 is shown in Fig. 5a and deviations in the plot show the location of scene change. Location of scene change obtained through threshold (6) is now represented by Fig. 5b and magnitude 1 is assigned to the detected location. Similarly, Fig. 6 represents the scene change detection of video anni003. Spatiotemporal frames TY and TX of video annoi006 have been shown in Figs. 7 and 8. In their binary images, vertical lines represent the location of scene change and all of them appear in the columns of the frame. Now these columns (in TY and TX frames) are directly converted into frame numbers (XY) with scene change locations. Flow energy of frames XY of video anni006 has been shown in Fig. 9a. Detected locations of abrupt scene change obtained from the proposed method are shown in Fig. 9b. Scene change location in video anni006 obtained by both the methods are combined together and tabulated in Table 1. There are three horizontal sections in Table 1 for the results obtained from frames TY, TX, and XY.

Fig. 4
figure 4

Frames of different scenes

Fig. 5
figure 5

Detection of scene change in anni002

Fig. 6
figure 6

Detection of scene change in anni003

Fig. 7
figure 7

TY frame of video anni006

Fig. 8
figure 8

TX frame of video anni006

Fig. 9
figure 9

Detection of scene change in anni006

Among the three sections of spatial and spatiotemporal frames, location that appeared for at least two sections is considered as the approximated location of abrupt scene change. As shown in Table 1, locations obtained from the proposed method are [77, 215, 277, 349, 413, 530, 581, 702, 865, 936, 1029, 1142, 1318, 1554, 1774, 1890, 1976] and it is the same for the actual location. No missed or false detection are found and the proposed method detected all the abrupt scene changes. For the performance evaluation, the proposed method has been compared with the method given in [2, 8] and the comparison result has been shown in Table 2. The accuracy of the proposed method has been evaluated with the help of precision, recall, and F1 score. F1 score is used for the evaluation of accuracy and it is defined as

$$\begin{aligned} F1 = 2 \times \frac{precision\times recall}{precision + recall} \end{aligned}$$
(11)

The proposed method detects all the scene changes and therefore the F1 score obtained is high as compared to the methods reported in [2, 8].

Table 2 Comparison

4 Conclusion

The proposed abrupt scene change detection method incorporates intensity as well as the flow energy of the frames. Using spatial and spatiotemporal frames reduces the false or missed detection. Edges of spatiotemporal frames and flow energy of spatial frames have been used for the detection. Detection accuracy of the proposed method is high with no false or missed detection as compared to the other methods.