1 Introduction

With the rapid advancement of multimedia technology, it has become easier to generate, capture, store, edit and distribute video data. As a result applications like archiving,content based retrieval, copy detection and summarization of video data have emerged as an active area of research. In all such applications, it is important to segment data into meaningful units. Thus, video segmentation becomes the fundamental step. It may be categorized as temporal and semantic segmentation. At the lowest level video data is a collection of frames. The temporal unit, i.e., shot is a collection of consecutive frames captured in a single session (between camera on and off) of camera operation. Collection of semantically related consecutive shots depicting an event or a part of the story is taken as a scene which is a semantic unit. Temporal segmentation i.e. the shot detection is the first level of segmentation. In the applications like content based retrieval or copy detection, a query sequence is submitted which needs to be matched with the video data present in the database. As the different shots in the query sequence may find match with the shots of different sequences in the database, such applications require a simple and low cost shot detection scheme to enable shot level matching. This observation motivates us for the present work.

1.1 Past work

Shot detection stands for determining the boundary frames i.e. in case of a transition from one shot to another the last frame of the previous shot and the first frame of the following shot are to be identified. Shot transition can be broadly categorized as abrupt and gradual. Abrupt transition or cut occurs more often and is the outcome of natural process where one is appended to another. Thus, transition between the shots is limited within two consecutive frames. Gradual transition is the result of post-shooting process. Unlike cut, it has certain duration. Dissolve, fade-out, fade-in etc. fall into this category. Lot of research has been carried out to detect the shot boundaries. In case of cut, the transition occurs between two consecutive frames which are visually very dissimilar and hence it is easier. The detection of gradual transition is more challenging.

Shot detection algorithms are mostly focused on finding out the visual discontinuity between the frames. For abrupt change, such discontinuity can be readily detected by looking into two successive frames whereas it is not so prominent for a gradual transition unless the frames being compared are temporally well separated. The major tasks of a shot detection algorithm includes representing a frame in terms of visual descriptors, and discontinuity detection based on a similarity measure or a transition model. Smeaton [23] has presented a comprehensive description of different aspects of shot boundary detection methodology including the evaluation policies and measures adopted in TRECVID.

In order to describe the frame-content, variety of features have been used. Pixel-level difference of consecutive frames [9, 10, 15], gray-level or colour histogram [810] are quite simple and are widely used. Study presented in [23] also indicates the wide use of colour histogram as the descriptor. Based on the joint histogram of consecutive frames, mutual information is computed [12]. The pixel difference is sensitive to object motions. On the other hand, histograms are global in nature and lacks spatial information. Motion based features have also been tried [2, 11, 19]. Many researchers have dealt with edge and gradient based features [1, 12, 15, 25].

Content of the frames in a video are represented by the feature vector and suitable similarity measures are applied on the feature vectors to compute the similarity between the frames. As discussed in [23], different distance measures like Euclidean distance, Manhattan distance, histogram intersection have been used by the researchers. In order to detect the shot boundaries, the common practice is to compute the change in visual content and to check whether it exceeds a threshold or not [4, 9, 21]. Though the approach is simple but the selection of threshold is crucial. Multiple thresholds (one for abrupt change and another for gradual changes) have also been considered [6, 13]. For detecting gradual transition, use of sliding window is quite common [9, 12, 13]. But the success depends on the proper selection of window size. Le et al. [14], in their approach mapped the task to the problem of text segmentation in natural language processing. Amiri et al. [3] have presented a generalized eigenvalue decomposition based system. Use of various machine learning techniques is also quite common [23]. Decision tree based classification [20], KNN [8], fuzzy clustering [11], hidden Markov model [28], neural network [18], SVM [15, 24] are few such examples. All such techniques have their own merits and demerits in terms of complexity, tuning of various parameters, proper training etc.

A very few works have relied on model to describe the frames, and have detected the boundary frames based on that model. Shot transition model has been presented in [10] based on the post production editing process. In [17], cut and gradual transitions are modeled independently as delta function and rectangular function respectively. Temporal statistics based model [16] and background similarity based model [7] are also tried. In [18], a unified model has been presented to handle both abrupt and gradual transitions. Based on the model, a frame estimation scheme is formulated. Boundary is detected based on model parameters and frame estimation errors. Finally, frames are classified using multi-layer perceptron network.

It is evident from the past study that wide variety of approaches have been tried by the researchers. Most of the aforesaid methods either are too simple to handle complex transition or computationally too expensive. Moreover, many of the methods need nontrivial parameters (e.g., transition threshold or moving widow size) to be supplied by the users. In many of the cases the algorithms are developed for a specific (e.g. cut) type of transition. In this work a simple but novel scheme is presented to detect the shot boundaries. Proposed scheme relies on a shot transition model which assumes that frames in a shot are visually very similar with respect to a reference frame, and the discontinuity in similarity value occurs at a shot transition. Two algorithms are presented of which the first one is a direct implementation of the underlying model. The second and the final one is an improvement over the first and it evolves around the linear approximation of the similarity values. Slope of this linear approximation finally indicates whether there is a transition or not. The organization of the rest of the paper is as follows. Proposed methodology is presented in Section 2. Section 3 contains the experimental result and the paper is concluded in Section 4.

2 Proposed methodology

Proposed methodology draws the motivation from the window based scheme that is generally intended for detecting gradual transition. The simplest technique for detecting cut/abrupt change is based on the difference between the visual descriptors of two consecutive frames. In case of cut, the difference is quite high. It is not so for a gradual transition where two consecutive frames may not differ much. Thus, it is difficult to identify such frames. To address this problem, concept of window is introduced. A sliding window of size N is considered, and a frame is compared with its N-th predecessor. The temporal separation of the gradual transition frames with respect to the reference one makes the difference more significant in comparison to the frames within a shot. Determining the size of window as well as the threshold for detecting the transitions become very crucial.

2.1 Transition model

For easy understanding of the transition model, let us for the time being consider that a video clip consists of only one transition which may be either abrupt or gradual and such a sequence is taken as the window. We will see later (in Section 2.3) that the proposed algorithm can extend the transition model to work with large video.

Frames in the window are represented by suitable descriptors. As the frames in a shot are very cohesive in nature, all the frames of first shot in the window will possess high similarity with the reference frame, which may be the first frame of the window. However, difference between the frames are not zero because of the object motion, occlusion etc. By designing the frame descriptors properly, such effect may be kept very low. To make the reference descriptor more robust to noise and local variations, average of the first K descriptors in the window is taken as the descriptor corresponding to the reference frame. In our experiment, K is taken as five. On the other hand, frames of the next shot will usually exhibit very low similarity with the reference frame. If the similarity values are plotted against the frame numbers, an abrupt change from high to low similarity as shown in Fig. 1a is expected. For gradual transition, the similarity value slowly decreases from high to low as shown in Fig. 1b. Frames in gradual transition phase bears the trace of both the shots – the previous one and the following one. The impact of previous one gradually decreases and that of the next one increases. The task is to identify the boundary frames i.e. the last frame of current shot or the first one of the next shot. Based on the said model, the boundaries can be identified by determining the suitable threshold values denoting high and low similarity.

Fig. 1
figure 1

Transition model: a Abrupt change or cut and b Gradual transition

In case of a conventional window based scheme a frame is compared with N-th predecessor. The difference between the similarity values obtained for the frames in the shot and for those in the gradual transition phase may not be significant and it depends heavily on N. In the proposed scheme the reference frame is temporally well separated (separated by the whole shot) from the frames in transition phase. Thereby, the similarity values of the frames in transition phase will be considerably different in comparison to the frames in the first shot in the window. As a result, selection of proper threshold value is easier.

2.2 Computation of features

Features to represent the frame-content should be such that minor changes in the content would have marginal impact on the descriptor. It is desired that despite of changes present in the frames within a shot the computed feature values for them are very close to each other. A global feature can serve this purpose well. Another requirement is that the feature must have enough discriminative power to distinguish the within shot frames and the transition frames in case of gradual transition. As gradual transition takes place over a sequence of frames, such transition frames are also very similar to each other. In this context, a feature, which is global in nature, may fail to provide values to distinguish between within-shot and transition frames. On the other hand, local features may be very sensitive and for the frames within a shot show significantly different values. Thus, the features are to be chosen carefully to fulfill these diverging requirements.

In our work, we have relied on edge based features. Frame image is first converted to gray-scale, and intensity gradient at each pixel is then computed using Sobel operator. The gradient image thus formed is then thresholded to obtain the edge image. Average gradient is taken as the threshold value, and the edges thus obtained are illumination invariant. During gradual transition, current frame bears the impact of both the previous shot and the next shot. Edges from previous frame gradually fade away and that from next one emphasizes. We intend to capture this phenomena. So the edge image is divided into N p partitions of equal size. The collection of normalized count of edge pixels in the partitions is taken as the descriptor. Thus, a N p -dimensional feature vector < v 1,v 2,…,v N p > is obtained where v i stands for normalized count of edge pixels in the i-th partition and \({\sum }_{i} v_{i} = 1\). The feature vector reflects the spatial distribution of the edge pixels. For the frames within a shot, edge point distribution remains almost unaltered over the frames unless N p is quite high. But, for the transition frames the distribution varies significantly as new edges are gradually emphasized and the old one weakens. Thus, appearing/vanishing edges during gradual transition make the distribution quite dynamic which is almost static for frames within a shot. Thus, the computed feature vector meets the diverging requirement. It may be noted that high value of N p will make the descriptor too sensitive, while a low value for the same will reduce its discriminating power. Hence, a moderate value is chosen to serve the purpose and in our experiment it is taken as 16.

2.3 Boundary detection

Boundary detection process is based on the model presented in Section 2.1. As discussed, we assume that the model considers a window consisting of the frames of two consecutive shots. Thus, there exists only one instance of shot transition. Few examples of sequences with only one transition and corresponding plots of the similarity values of the frames with respect to the reference frame in the sequence are shown in Fig. 2. In order to highlight the characteristics of the transition, the plots are restricted to a subset of transition frames and some preceding and following frames. It is evident that the plots follow the model we propose.

Fig. 2
figure 2

Sequences with single transition – (i) abrupt transition and (ii) gradual transition

In reality, video sequences are large enough consisting of number of shots. If such a sequence is considered as the window and initial five frame of the sequence acts as the basis of reference frame for the rest then it is more likely that the similarity pattern reflected by the transition frames gets deviated from the model. A few examples of video sequences with multiple transitions and corresponding plots of the similarity values are shown in Fig. 3. Such a sequence contains multiple transitions. First one maintains the desired pattern but the subsequent transitions may fail to comply the model. Fade-out (fade-in) is also a kind of gradual transition where brightness of frames gradually lessens (becomes more) to a dark (bright) frame. The plots of similarity values as shown in Fig. 4 is apparently quite different from the assumed model, but the algorithms show that they work with such sequences and utilizes the transition characteristics reflected by the model.

Fig. 3
figure 3

Sequences with multiple transitions of different type – (i) multiple gradual transitions and (ii) multiple abrupt transitions

Fig. 4
figure 4

A sequence with gradual transition (fade-in and fade-out)

Shot boundary detection algorithm starts with a window with the initial N consecutive frames from the video data. N is taken as any value in [l m i n ,l m a x ] where l m i n and l m a x denote the estimates for minimum and maximum shot length respectively. Thus, selecting a value for N is not very critical. Following the assumed model, the algorithm tries to identify the first occurrence of shot transition within the window and also the corresponding transition frames. Once those are identified, set of frames in the window prior to the transition phase is taken as a shot. Window is refreshed to have the frames following the transition phase from the current set and also sufficient frames from the original video are augmented to make the window N frames long. The algorithm proceeds with the updated window. In case N is smaller than the length of the shot which is not known apriori, the algorithm is likely to fail in detecting a transition. To come out of it, subsequent N frames from the original video are appended to the current content in the window to double its length and pass it to the process for the next iteration of shot detection algorithm. Thus, the proposed algorithm is built around the transition model that runs iteratively to handle the large sequence.

Consider, a video sequence vid that needs to be segmented. Each frame in the video sequence is represented by the edge based descriptor as described in Section 2.2. Similarity of each frame with the reference descriptor in the window is computed by using a suitable distance measure. In this work, we have adopted the Bhattacharyya distance [5] between the normalized descriptors of corresponding frames. The overall shot detection algorithm, s e g m e n t_v i d e o(v i d) may be described as follows.

segment_video(vid)

begin

S=first N frames from vid

while (all frames in vid are not tested)

 begin

  Compute similarity array sim for S

  Smooth sim

   c h k = v e r i f y_u n i f o r m i t y(s i m)

  if (chk) then S = S∪ {subsequent N frames from vid}

  else

    begin

      t = g e t_t r a n s i t i o n_s e q u e n c e(S, s i m)

      shot =sequence of frames in S preceding t

     if (#(shot) < l m i n ) then consider it as transition

      seq=sequence of frames in S following t

     if (#(seq)> 0) then

      S = s e q∪ {subsequent (N-#(seq)) frames from vid}

     else S = S∪ {subsequent N frames from vid }

    end

  end

end

This algorithm may also be referred to as Algorithm 1.

s i m[i] denotes the similarity between the reference descriptor and the descriptor of i-th frame in the window. Due to change in photometric condition in the environment, frames with similar content may have different visual appearance. Moreover, descriptors of within shot frames may also reflect small variation. To combat the problem, similarity values are smoothened through mean filtering. Filtering is done by considering a moving window of size 5. v e r i f y_u n i f o r m i t y() checks whether the given set of frames in the window is a part of single shot or not. It returns true if the sequence is uniform or homogeneous i.e. the sequence should not be segmented further else false is returned. This is essential to avoid over splitting of the sequence. Temporally separated frames in a shot may also differ due to various reasons like motion of the objects and occlusion. In the feature space also, it may be reflected to some extent despite of the care taken to design the descriptors. But, such variation is usually quite low. Based on this assumption, v e r i f y_u n i f o r m i t y(s i m) works as follows.

v e r i f y_u n i f o r m i t y(s i m)

begin

m 1 = m a x{s i m[i]}

m 2 = m i n{s i m[i]}

 if \(\frac {m_{2}}{m_{1}} > th\) then return true else return false

end

A very high value for th may declare a shot as non-uniform one resulting into over splitting. On the other hand a low value may fail to discriminate subsequent shots. The value of th is chosen experimentally. The procedure exploits the fact that the variation in the similarity values (elements in sim) is quite low within a shot. It is well reflected in the example shown in Fig. 5.

Fig. 5
figure 5

A sequence with single shot

Once a sequence is considered as a non-uniform one i.e. it contains multiple shots, the task then becomes to detect the transition frames. The function g e t_t r a n s i t i o n_s e q u e n c e(S,s i m) does this job as follows.

g e t_t r a n s i t i o n_s e q u e n c e(S, s i m)

begin

 t ={} and i = 1

 while (s i m[i] > = h i g h) {increment i}

t = tf r a m e i

 while (s i m[i] > l o w and s i m[i] < h i g h){

   t = tf r a m e i

  Increment i}

end

The frames in the initial shot should show high similarity value, while the frames of the following shots are expected to have low similarity value. For the frames in gradual transition, the similarity values are supposed to be in between. The set t holds the transition frames. For an abrupt change, according to the procedure described above, t is supposed to have only one frame. But, because of smoothing operation on sim, such transition may be blurred and may give rise to multiple transition frames. It also affects the localization of gradual transitions. However, for the applications like content based retrieval system, such compromise does not affect the final result of the application. The high and low values are taken as μ + k 1 σ and μk 1 σ respectively, where μ and σ stand for mean and standard deviation of similarity values respectively and value of k 1 is chosen experimentally. It is assumed that a shot consists of at least a minimum number of frames, l m i n and a shot of smaller length is discarded and taken as transition. Part of the gradual transition frames may be identified as shot giving rise to false boundary. However, by discarding the shots of smaller length (< l m i n ) the cases of such false detection can be minimized. In our experiment we have chosen l m i n as 30. It may be noted that content of the window (i.e. S) is changed after each iteration. Frames representing the first shot and the first transition are removed from S. Subsequent set of frames from video sequence get appended to S. Under two circumstances, S is updated in a different manner. First case arises when S is uniform and no transition is identified. Second case corresponds to the situation when a shot and following transition is successfully detected and there is no subsequent frame. In both the cases, next set of frames from video data is appended to the current content of S. With the new window S, next iteration proceeds afresh. The similarity array, sim has to be recomputed once the reference frame is changed in any situation. Otherwise, partial computation is required for only the newly appended frames in S.

The algorithm s e g m e n t_v i d e o() works with large video sequence by taking a part of it in each iteration. At a point of time, the window may have multiple transitions. Few cases are shown in Figs. 3 and 4 which reveal that the similarity values may not always reflect the pattern assumed in the model. Moreover, high and low thresholds are chosen directly from the similarity values. Thus, the presence of number of shots in the window can have significant impact on the pattern of similarity values and consequently threshold values. As a result, performance may suffer. It has motivated us to modify the algorithm to enable the easy selection of threshold and handling of the window with multiple transitions in an elegant way.

The modified algorithm relies on the fact that the frames in a shot are almost similar in terms of feature values. Hence, with respect to a common reference frame, their similarity values are also very close to each other. If straight lines are fitted over similarity values of a subset of contiguous frames then such lines are almost horizontal for the frames within a shot. On the other hand, line segments correspond to transition (abrupt/gradual) regions have higher slope. Thus, thresholding slope magnitude the shot boundaries are detected. The steps of m o d i f i e d_s e g m e n t_v i d e o(v i d) algorithm are summarized as follows.

m o d i f i e d_s e g m e n t_v i d e o(v i d)

begin

S =first N frames from vid

 while (all the frames in vid are not tested)

  begin

   Compute similarity array sim for S

   Smooth sim

    c h k = v e r i f y_u n i f o r m i t y(s i m)

   if (chk) then S = S∪ {subsequent N frames from vid}

   else begin

      for each frame f i S {

        Form s u b s e t i = s i m[j] where j∈[ik,i + k]

        Over s u b s e t i , fit a straight line following least square regression

        Store the magnitude of the slope of the line segments

        in an array s i m_s l o p e }

      Mark the frames f j with s i m_s l o p e[j]>t h as transition frames

      Detect the shots as set of frames between two consecutive transition

      Shots of length less than l m i n are taken as transition sequence

       S = S - {set of frames in S prior to last detected shot}

       S = S ∪ {subsequent N frames from vid}

    end

   end

end

This algorithm may also be referred to as Algorithm 2. Frames corresponding to the line segments with high magnitude of slope are taken as the transition frames. It may be noted that multiple transitions and hence multiple shots present in a window can be detected. In s e g m e n t_v i d e o() algorithm, window for the next iteration starts with the frames following the first transition. In the modified algorithm m o d i f i e d_s e g m e n t_v i d e o(), next iterations proceeds from the beginning of the last detected shot. For a gradual transition, it may so happen that all the consecutive line segments passing through the transition regions are not marked as boundary. To address the problem any shot of length smaller than l m i n is considered as transition sub-sequence Fig. 6 shows the plot of the slope magnitude for different video sequences in Figs. 2, 3 and 4. It is evident that slope of relatively higher magnitude corresponds to transition zone in the video sequence. Threshold selection for slope is much easier compared to selection of high and low thresholds for similarity value. In our experiment, th is taken as μ s +k 2 σ s where μ s and σ s denote average and standard deviation of the values in slope. The value of k 2 is determined experimentally.

Fig. 6
figure 6

Plot of slope magnitude corresponding to video sequences in Figs. 2, 4 and 3

2.4 Post-processing

The visual content within a shot may vary significantly due to the reasons like camera and/or object motion, presence of high activity. Such variation particularly in comparison to the initial part of the shot may result into oversplitting. To minimize the false boundary detection a post-processing step is proposed here.

Let S t be a detected transition sequence. Usually for abrupt transition ∣S t ∣=2, otherwise it is >2. f s and f l are the first and last frame in S t respectively. f s (resp. f l ) is estimated from f l+1 (resp. f s−1) to obtain \(f_{e_{s}}\) (resp. \(f_{e_{l}}\)). If either f s is highly similar with \(f_{e_{s}}\) or f l is highly similar with \(f_{e_{l}}\) then S t is taken as false transition Thus, the steps are follows.

begin

\(f_{e_{s}}=\)estimate of f s based on f l+1

\(f_{e_{l}}=\)estimate of f l based on f s−1

s i m s = similarity between f s and \(f_{e_{s}}\)

s i m l =similarity between f l and \(f_{e_{l}}\)

if (s i m s t h p p or s i m l t h p p ) then

S t is a false transition

end

In order to estimate f s from f l+1, f s is first divided into blocks of size 16×16. Let b i is a block of f s centered at (x,y). A match for it is searched in a window of size 33×33 centered at (x,y) in f l+1. Matching is based on the sum of absolute difference of grayscale intensity. The best matched block from the search window is the estimate for b i in \(f_{e_{s}}\). Following the similar process \(f_{e_{l}}\) is formed. To generate the feature vector for the frames f s , \(f_{e_{s}}\) and \(f_{e_{l}}\), corresponding frame is divided into 16 blocks. For each block 256 bin intensity histogram is formed. All such histograms are concatenated in raster scan order and normalized to form the descriptor. Histogram intersection between the descriptors provides the similarity. Value of t h p p is experimentally determined as 0.8.

3 Experimental results

In our experiment we have used a dataset consisting of various sequences taken from TRECVID 2001 and 2005 test databases as well as some recorded sports and news video sequences from TV and Internet. The recorded sequences are made available at http://www.isical.ac.in/%7Eecsu/. Three movies of different types are also included in the dataset. Detailed description of the dataset is provided in Table 1. It reflects wide variety containing both types of transitions – abrupt and gradual change. Gradual change includes dissolve, fade-in and fade-out. The dataset has been grountruthed manually to note down the start and end frame numbers of each shot. For abrupt transition, last frame of pre-transition shot and first frame of post-transition shot are marked as transition frames. For gradual transitions, a set of consecutive frames are marked as transition. However, to evaluate the detected boundaries arising out of computational methods a relaxation of ±5 frames is considered for gradual transition. Moreover, a gradual transition of smaller length (≤5 frames), if detected as abrupt one is also considered as correct.

Table 1 Description of the dataset

A video sequence is provided as the input to shot detection algorithm. Identified transition frames are compared with the corresponding groundtruth to identify the miss and false boundary detection. The performance is measured in terms of Recall, Precision and F-measure. These are computed as follows.

$$Recall (R) = \frac{Correct}{Correct + Miss} $$
$$Precision (P) = \frac{Correct}{Correct + False}$$
$$F-measure (F) = \frac{2 \times Precision \times Recall}{Precision + Recall}$$

To obtain the measures in percentage, 100 is multiplied with them. In case of a shot boundary detection algorithm, it is desired to have high recall (i.e. low rate of miss in detecting a boundary) and high precision (i.e. low rate of false boundary detection).

The experiment has been conducted following both the algorithms – s e g m e n t_v i d e o() (Algorithm 1) and m o d i f i e d_s e g m e n t_v i d e o() (Algorithm 2). A moderate value for N is chosen to work with long sequences and it is taken as 800. Both the algorithms first verify whether the given sequence is a homogeneous one (i.e. consisting of one shot only) or not. If it is identified as a non-homogeneous one then the shot boundaries present in the sequence are detected. v e r i f y_u n i f o r m i t y() carries out the test based on a threshold value th. For different values of th, the performance of v e r i f y_u n i f o r m i t y() is shown in Table 2. It is observed that for higher value of th, uniform sequence is splitted into sub-sequences whereas for lower value, non-uniform sequences may get considered as uniform. Analysing the experimental result, th is taken as 0.9 for subsequent experiments.

Table 2 Performance of sequence uniformity test for different threshold values

Algorithm 1 considers the frames with similarity value lying between μ + k 1 σ and μk 1 σ are the transition frame(s). In order to choose the optimal value of k 1, an experiment has been carried by varying its value for a number of sequences with abrupt and/or gradual transition. For tuning the parameter, sequences consisting of single transition are taken as input. Higher the value of k 1, more frames may tend to qualify as transition frames and it leads to higher recall with lower precision. To judge the diverging requirement of high recall and high precision, performance is measured in terms of F-measure. The result for different values of k 1 is shown in Table 3. Based on this table the value of k 1 is taken as 1.

Table 3 Performance of Algorithm 1 for different values of k 1

In Algorithm 2, lines with slope greater than μ s +k 2 σ s qualify for transition regions. Value of k 2 is determined following the similar experiment as it is done in case of k 1. Table 4 shows the performance of the algorithm for different values of k 2. Higher the value of k 2 the probability of miss in detection increases but higher precision is achieved. For low values of k 2, recall increases but false detection also rises. Hence, F-measure is considered to determine the suitable value of k 2 and it is found to be 1.5.

Table 4 Performance of Algorithm 2 for different values of k 2

In order to study the performance of Algorithms 1 and 2, all the video sequences in the described dataset is used as the input with related parameter values. The experiment is thus conducted in a single run. Each sequence as a whole is provided as input to the algorithms. Table 5 shows the performance of the two algorithms. Algorithm 1 works based on the similarity value whereas Algorithm 2 (without post-processing) is the modified version that works based on the slope of line segment fitted on the overlapped subset of similarity values. It is evident from Table 5 that Algorithm 2 (without post-processing) performs better than Algorithm 1 in terms of both precision and recall as both false detection and miss detection are less for the second algorithm. It has been noted that still the false detections are large enough for sequences with high action where content varies significantly within a shot. Mostly, it gives rise to unwanted detection of abrupt detection. In order to minimize such false detection, post-processing is applied on the output of Algorithm 2 and the performance as shown in Table 5 improves significantly.

Table 5 Performance measure of proposed methodologies (Algorithm 1 and 2) (all figures in scale 0–1)

In order to compare the performance of the proposed system (Algorithm 2 – with post-processing), we have implemented the system proposed by Zhang et al. [27] and experimented with the dataset described in Table 1. The system in [27] deals with intensity histograms of R, G and B channels. Each frame is divided into number of blocks and histogram of each block are concatenated. Based on Fisher criterion in linear discriminant analysis a continuity measure is considered to detect the shot boundaries which are classified as abrupt and gradual transition using support vector machine. The scheme is simple enough but suffers from large number of false detection. We have also implemented the system of Tsinghua University [26] which is top performer as described in [23]. It has three components like fade-out/fade-in (FOI) detector, cut detector and gradual transition (GT) detector. Colour histrogram, pixel-wise difference feature, average and standard deviation of pixel intensities and motion vectors are used as the descriptors for visual content. FOI and cut detectors are simple and GT detector is based on finite state automata model. The scheme considers number of thresholds which are critical. It provides high recall for cut but precision suffers due to false detection. Comparative results shown in Table 6 indicates that the proposed methodology performs much better. It may be noted that for the proposed system, an interval for precision, recall and F-measures are provided. To obtain such intervals, number of runs are conducted. 90 % transitions are randomly selected from each video and used as the input in each run. In this way twenty runs are conducted. Finally, the intervals for the parameters are computed using Z-score based technique [22] with 95% confidence level. It is clear that even the lower bounds for the parameters are much higher than the value of corresponding parameter for other systems.

Table 6 Comparison of shot boundary detection performance (all figures in scale 0–1) on the dataset described in Table 1

For comparison, we have also worked with a common set of video sequences as used in [1] and these are taken from TRECVID 2001 dataset. System of Adjeroh et al. [1] follows an adaptive approach based on edge maps and it outperformed number of systems as presented in their work. Result on the same dataset is also presented in [18]. As the results of number of systems are already available, for comparison we have applied the proposed methodology (Algorithm 2 – with post-processing) and also the schemes presented in [26, 27] on the same dataset. The figures of performance metrics for different systems are presented in Table 7. As in case of Table 6, here also an interval for F-measure is provided for the proposed system and it is computed in the similar manner as discussed earlier. It shows that the proposed system performs better than the rest. Among rest, the performance of the unified model based system in [18] is closest to that of the proposed system. The said system [18] estimates a frame from its predecessor and follower using histogram, edge and motion based features. Finally, Multilayer Perceptron based neural network is used for classification of the frames and shots are detected by applying post-processing on classifier output. Although it performs better than the proposed method but the strength of the proposed method lies in its simplicity and low cost. Apart from other overheads, the computational cost of motion based local features used in [18] is of o(w 2 W 2) where size of the blocks to be matched is w×w and search region is of size W×W. Such costly matching has to be carried out for number of blocks and that too for each frame. Similar cost is also incurred in post-processing step of the proposed system. But, it has to carried out only for limited number of frames (at the most two frames per transition). On the other hand in the proposed system, cost for feature computation is o(n) where n is the number of pixels in a frame. In case of machine learning technique based boundary detection, the major hurdle lies in proper training of the system and tuning of the parameters. The success depends heavily on those two. In case of multilayer perceptron network as used in [18], classification cost increases with the increase in number of layers and nodes present in the network. Thus, the proposed similarity matrix based boundary detection technique is relatively simple and of low cost compared to [18].

Table 7 Comparison of shot boundary detection performance in terms of F-measure(all figures in scale 0–1) on the dataset used in [1]

4 Conclusion

In this work, we have presented a shot detection methodology developed following a simple shot transition model based on the similarity of the frames with respect to a reference frame. The model assumes that the frames in an individual shot are very similar and whenever a shot transition occurs a discontinuity in similarity values appears. Here, two algorithms are presented and they can work on large video sequence iteratively by taking a sub-sequence in the window. First algorithm directly applies the transition model and considers a threshold on the similarity values to detect the transitions. Its performance suffers if there exists multiple transitions in the window, and the selection of threshold on similarity value is non-trivial task. To overcome these limitations second and the final algorithm is presented which is the improvement over the first one and focuses on the slope of linear approximation of similarity values rather than the absolute similarity values. The said slope is estimated using least square regression. There is a possibility that the sequences which are not so well behaved may be oversplitted and to combat the issue a simple post-processing technique is presented. Experimental result and comparison with other systems indicate that the performance of our final algorithm is quite satisfactory and better. In the current scope of the work different threshold and other parameter values are selected experimentally. In future, effort may be put to formalize the same.