1 Introduction

Escalation of multimedia information in the cyberspace through the past decades has led to swift increase in data transmission volume and repository size. This increase has motivated the exploration for more efficient techniques to process and store data content [4].

There are several multimedia data types available in the cyberspace such as audio, image, and video; however, video is considered the most attracting one. Furthermore, video amongst other multimedia types is the most storage consuming in the cyberspace and contains more valuable information than other types of multimedia data. The dominance of videos in the cyberspace is due to the tremendous development in computer performance, spread of recording devices, and affordable storage media. In addition, video sharing websites (VSW), such as YouTube which has been used by individuals and companies to extend their audience. Statistics show that videos are uploaded and viewed at VSW in an inconceivable rate [5].

The enormous increase of video data has activated a substantial need for effective tools that can manage, manipulate and store that sheer volume of data [33]. This can be achieved by attaching video content with their storage (indexing) and then analyzing them to their basic units by video structure analysis. Video structure analysis is fairly considered a challenging task because of video attributes which are: vast information compared to images, huge size of row data, and no prior definition of video structure [13].

The aim of video structure analysis is to partition the video into its basic elements. Video structure levels are: 1) frames, 2) shots, 3) scenes, and 4) stories. Shot is considered the basic element of video and it is considered the most suitable level for content based video indexing and retrieval (CBVIR) [4].

Shot is defined as an incessant sequence of frames that are captured by single and nonstop operation of video capturing device. Shots are accumulated together to form a scene, scenes are linked to compose a story and so forth to produce a video. When attaching two shots directly together, a hard transition (HT) is produced. On the other hand, soft transition is generated using editing effect (indirect concatenation). HTs are the most predominant in videos than ST [19]. Shot boundary detection (SBD), also known as temporal video segmentation, is the process of analyzing the video to its basic elements, namely shots. SBD is the initial and substantial step of CBVIR and its accuracy largely affects the entire CBVIR efficiency [4].

SBD algorithms include the following three main steps: 1) representation of visual information (feature extraction), 2) similarity/dissimilarity measure (continuity signal), and 3) transition identification. Feature extraction is a significant module in SBD algorithms. Several types of features are utilized in SBD algorithms such as pixel-based, histogram-based, edge-based and others [4]. Similarity measure is performed by finding the distance between consecutive frames features. City-block, Euclidean, and correlation are examples of similarity measures. To find transition between shots, transition identification (detection) is performed using statistical machine learning (supervised and unsupervised) and threshold (fixed and adaptive).

SBD has attracted many researchers’ attention in the past two decades. It is mainly divided into two categories: compressed and uncompressed domains. The latter has received more attention compared to compressed domain because of the tremendous and valuable visual information. However, uncompressed domain-based algorithms require additional processing time due to the decoding process for video frames [4].

Pixel-based, histogram-based, edge-based, and motion-based are examples of techniques that are used for feature extraction. Extensive survey can be found in [4]. Recently, the analysis of SBD is performed in transform domain rather than in the time domain. The reason is that transform domain allows to view signals in different domain and gives a massive shift in terms of its powerful capability for analyzing the signals’ components [25, 26]. To transform a signal/image from time/spatial domain into transform (moment) domain, the orthogonal polynomials are employed. Orthogonal moments, transform coefficients, are scalar quantities used to represent the visual information. These moments represent the projection of signal on orthogonal polynomials. The ability of the orthogonal polynomials (OPs) is characterized by their energy compaction and localization properties [26].

Different OPs occupy a significant position in signal processing and computer vision fields [2] such as Discrete Krawtchouk-Tchebichef transform (DKTT) [26]. DKTT shows a high energy compaction and good localization properties compared to Tchebichef polynomial (TP) and Krawtchouk polynomial (KP).

Energy compaction is an important property where the tendency of DKTT to return back a large fraction of signal energy into relatively few coefficients of the transform components. In addition, the localization property improves the overall quality of the DKTT especially when the ROI is located priori in the image and adds value to feature extraction by reducing the computation time.

Hence, this paper introduces the use of DKTT as a feature extraction tool for representing video frames. In addition, new features, namely moments of smoothed and gradients frames, are extracted from video frames using OPs (DKTT). These features are then fused to form a single feature vector. Support vector machine (SVM) is used to detect transition and non-transition frames based on the fused features.

This paper is organized as follows: Section 2 describes a brief survey on the related work. Section 3 introduces OPs and the mathematical model of the implemented OP. Section 4 provides the proposed features extraction method based on OP to detect HT. Section 5 presents the experimental results to highlight the effectiveness of the extracted features. Finally, the Section 6 concludes the paper.

2 Related work

In the literature, various methods based on different schemes for SBD have been introduced. Generally, SBD workflow is divided into three stages: 1) representing the visual information for video frames (feature extraction), 2) constructing the continuity signal from the extracted features, and 3) detecting transitions. In this section, related works for each of the aforementioned stages are discussed.

2.1 Feature extraction

The extracted features from video frames are used to concisely represent the visual information of these frames. Different types of visual information representation are introduced such as: pixel-based histogram-based, edge-based, and transform-based. The simple and fast approach is the pixel-based technique (PBT) [14, 43, 53], where the pixel intensities are directly employed to represent the visual information. However, the PBTs are highly sensitive to object/camera motion and various types of camera operations as well as global motion. Thus, this sensitivity highly affects the accuracy of the SBD algorithm which in fact reduces the precision rate due to the high false alarm rate [4]. In addition to their sensitivity, missed detection results [18].

To alleviate the sensitivity of the PBT, histogram-based techniques (HBTs) were introduced [15, 16, 42]. The HBTs replace the dependency on the spatial information by considering the color distribution for each frame. This dependency makes the HBTs to be less sensitive to small global and local motions. Different types of color spaces are employed to extract histogram from frame such as gray [41], RGB [40], HSV [11], and L*a*b* [20]. However, HBTs are less sensitive to object/camera motion compared to PBTs, false positives are reported due to large object/camera motions and flash lights. In addition, miss detection highly occurs between two shots that fit in the same scene because of the similarity in the color distribution between the neighboring shots’ frames.

Algorithms based on edge features were also presented to reduce the influence of object/camera motion and flash light. In [49], Yoo et al. presented an edge object tracking-based algorithm to detect shots. On the other hand, the ratio of exiting and entering edge between consecutive frames is utilized to detect shot transitions [51, 52]. In spite of the fact that edge based techniques (EBTs) get the benefit of edges as a frame feature, the detection accuracy is still unsatisfactory due to: 1) EBTs are expensive due to the number of processes employed (edge detection, edge change ratio, motion compensation or edge tracking), 2) their performance is below the performance of HBT [4]. However, EBTs are able to remove the flash light because they are invariant to illumination change.

Recently, transform based techniques (TBTs) have been utilized for transition detection. The TBT uses linear transform to compute the transform coefficients. These coefficients are considered as features in the SBD algorithm, for example Fourier transform coefficients [34, 45]. Porter et al. [34] used the correlation between transform coefficients of video frame blocks as features to detect transitions. Vlachos [45] utilized the phase correlation between transform coefficients as a feature. Priya and Domnic [35, 36] proposed an SBD using Wlash-Hadamard transform (WHT). The algorithm extracted features for small block size (4 × 4) using different OP basis functions. The feature extraction is performed after resizing video frame to 256 × 256. Non-Subsampled Contourlet Transform (NSCT) is utilized for SBD in [27]. This algorithm extracted features from the low-frequency sub-band and seven high-frequency subbands for each CIE L*a*b* color space. The reported accuracy of this method is acceptable, and it is comparable to the SBD algorithm presented in [36]. Although the algorithms proposed in [27, 36] reported high accuracy, their computational cost was high. The computational cost in [36] stems from extracting features for small block size and different OP basis functions. On the other hand, the computational cost of the algorithm in [27] was due to employing multi-level of decomposition, block processing, and cosidering all CIE L*a*b* channels in the algorithm. Zernike and Fourier-Mellin transforms were used to extract features in [8] to detect shot transitions. These features were considered as shape descriptors in addition to other features, namely the color histogram, and phase correlation. These three features were combined to form a features vector. Then, the feature vector was used to detect HT. This algorithm provided an acceptable HT detection accuracy; however, the transforms, Zernike and Fourier-Mellin, require coordinate transformation and suitable approximation [28] which increase the computational cost.

2.2 Continuity signal

The continuity signal is constructed to find the temporal variation between features. Basically, the continuity signal is constructed by finding the difference (dissimilarity) between frames features or by computing the correlation coefficients between features. It should be noted that, the continuity signal is found either between consecutive frames or two frames within a temporal distance. The similarity signal between frames features are measured using cosine similarity [10], normalized correlation [34], and correlation [44]. On the other hand, frames features dissimilarity is computed using city-block distance [36], edge change ratio [52], histogram intersection [15], and accumulating histogram difference [16].

2.3 Classification of continuity signal

The constructed continuity signal (similarity or dissimilarity measure) is utilized to detect the shot transitions, i.e to differentiate between the transition and non-transition frames. The threshold-based techniques are considered the simplest methods for classification [4]. In threshold-based technique. the continuity signal is compared with a threshold to declare the state of transition or non-transition. Algorithms based on threshold to detect transitions are considered unreliable since the threshold can not be generalized to different types of video genre [6]. In addition, algorithms based on single threshold show inability to distinguish between transition, non-transition, object/camera motion, and flash light. Machine learning, which is based on categorization problem, is used to overcome the limitation of utilizing threshold and employing multiple features. Neural network and SVM are examples of machine learning algorithms. In addition, machine learning algorithms are very popular in image processing [9, 21] and can be classified into supervised and unsupervised methods. The difficulty of employing machine learning algorithms is resulted from the selection of suitable feature combination [32] because good features significantly increase the classifier performance [22].

It is obvious from the literature that SBD algorithms have good performance in transition detection. Moreover, there is an evident progress in SBD algorithms from simple feature comparison to rigorous probabilistic and using of complex models of the SBD. Nevertheless, accelerating the process of transitions detection with higher accuracy needs to be improved. Essentially, the accuracy of transition detection depends on the ability of the algorithm to distinguish between transitions and objects/camera motion. This is because features cannot model the difference between images clearly. Another important drawback that is obvious in most existing SBD methods is the high computational cost which becomes a bottleneck for real-time applications. Among the various techniques presented, TBTs, which are based on OP, show an improvement in the detection accuracy for their ability in reducing the impact of disturbance factors on transition detection. However, the existing TBT suffers from different drawbacks. First, employing single similarity signal leads to inadequate performance. Besides, each transform has its own limitation such as: Zernike and Fourier-Mellin require coordinate transformation and approximation, Walsh-Hadamard requires frame resizing, and discrete wavelet transform requires predefined multiple decomposition levels. It is clear that there is a growing need to design an algorithm that has a joint evaluation of the following aspects: minimizing computational cost, tackling the problem of detection accuracy by increasing recall and precision for various video genres. Moreover, the algorithm must be based on a robust backbone transform that provides: good representation of the visual video content, good localization property, high energy compaction capability, and effective extraction of the appropriate features. All points above are critical in SBD algorithm and need to be considered.

3 Discrete Krawtchouk-Tchebichef polynomials and moments

The objective of this section is to introduce the mathematical model of the utilized OPs, namely Krawtchouk-Tchebichef polynomial (KTP) [26], and their moments. The KTP of the nth order, Pn(x; p), can be defined as follows:

$$ P_{n}(x;p,N)=\sum\limits_{j = 0}^{N-1}k_{j}(n;p) t_{j}(x) $$
(1)

where kj(n) is the KP of the jth order which is given as follows [3]:

$$\begin{array}{@{}rcl@{}} k_{j}(n;p)&=&\sqrt{\frac{A}{B}} {}_{2}F_{1}\left( -j,-n;1+N;\frac{1}{p}\right)\\ j,n&=&0,1,\dots,N-1; N>0, p\in(0,1) \end{array} $$
(2)

where A and B are defined as follows:

$$\begin{array}{@{}rcl@{}} A&=&\binom{N-1}{n} p^{n} (1-p)^{N-1-n} \end{array} $$
(3)
$$\begin{array}{@{}rcl@{}} B&=&(-1)^{j} \left( \frac{1-p}{p}\right)^{j} \frac{j!}{(1-N)_{j}} \end{array} $$
(4)

where \( \binom {a}{b} \) represents the binomial coefficients, and (a)j stands for Pochhammer symbol [2].

tj(x) symbolizes TP [1] of the jth order which is given as follows:

$$\begin{array}{@{}rcl@{}} t_{j}(x)&=&\frac{(-N + 1)_{j} {_{3}F_{2}(-j,-x,j + 1;1,1-N;1)}}{\sqrt{(2j)! \binom{N+j}{2j + 1}}}\\ j,x&=&0,1,\dots,N-1; N>0 \end{array} $$
(5)

Lastly, 2F1 and 3F2 in (2) and (5) demonstrate the hypergeometric functions [2].

Signal functions projection on OPs basis functions are orthogonal moments. Orthogonal moments of a 2D signal I(x, y) (image) with a size of N1 × N2 can be computed as follows:

$$\begin{array}{@{}rcl@{}} \phi_{nm}&=&\sum\limits_{x = 0}^{N_{1}-1}\sum\limits_{y = 0}^{N_{2}-1}P_{n}(x) P_{m}(y) I(x,y) \\ n=\frac{N_{1}-2}{2}&&, \frac{N_{1}}{2},\dots,\frac{N_{1}-MO_{n}}{2},\frac{N_{1}+MO_{n}-2}{2} \\ m=\frac{N_{2}-2}{2}&&, \frac{N_{2}}{2},\dots,\frac{N_{2}-MO_{m}}{2},\frac{N_{2}+MO_{m}-2}{2} \end{array} $$
(6)

where MOn and MOm are the maximum moments order used for signal representation. The computation of moments can be performed in matrix form as follows:

$$ {\Phi}=P_{1} I {P_{2}^{T}} $$
(7)

where P1 and P2 are the matrix form for Pm(y) and Pn(x), respectively. Φ is the matrix form of the moments ϕnm. For moment order selection, Fig. 1 shows the construction of matrix P with the localization parameter p, and moment order MO.

Fig. 1
figure 1

KTP generation with moment order (MO)

To reconstruct I from the moment domain, the inverse moment transformation is applied as follows:

$$ \hat{I}={P_{1}^{T}} {\Phi} P_{2} $$
(8)

where \( \hat {I} \) is the reconstructed 2D signal from the moment domain.

4 SBD algorithm based on orthogonal moments

This section introduces the proposed SBD algorithm based on orthogonal moments. The proposed SBD algorithm includes three steps: A) OP-based feature extraction, B) dissimilarity measure, and C) Detection of transitions (shot boundaries). The proposed methodology is shown in Fig. 2, and the detailed description of each step for the proposed algorithm is explained in the following subsections.

Fig. 2
figure 2

The proposed methodology

4.1 Feature extraction

The proposed SBD algorithm design is based on moments computed (features) from DKTT domain. The considered features are: the smoothed moment, moments of image gradients (MOGs) in x-direction, and MOGs in y- direction, i.e. three groups of moments are used. The averaging (smoothing) is applied to video frames to decrease the effect of noise and camera/object motion [4]. To extract moments of a smoothed video frame, the smoothed image kernel is applied to the video frame prior to feature extraction process. Assume fk is the kth video frame with a size of N1 × N2, Sx and Sy are the smoothing image kernels in x- and y-directions, respectively. Then, the smoothed image in the x-direction can be obtained by convolving the video frame with Sx as follows:

$$ I_{SX}=S_{x}*f_{k} $$
(9)

When the obtained smoothed image (ISX) is convolved with the image kernel in the y-direction (Sy), the resulted image becomes the smoothed image in x and y-directions. This operation is performed as follows:

$$ I_{SXY}=S_{y} * I_{SX} $$
(10)

where ISXY is the smoothed image in both directions. To reduce the number of convolution operations, the associative convolution property is applied so that both kernels are combined to obtain a single kernel. From this point of view, we can achieve a smoothed video frame in both directions (ISXY) as follows:

$$ I_{SXY}(k)=S_{xy} * f_{k} $$
(11)

where ∗ is the convolution operation, \( S_{x} = \frac {1}{2 \pi {\sigma _{x}^{2}}} e^{\frac {(x-{\mu _{x}^{2}})}{2{\sigma _{x}^{2}}}}\), and \( S_{y} = \frac {1}{2 \pi {\sigma _{y}^{2}}} e^{\frac {(y-{\mu _{y}^{2}})}{2{\sigma _{y}^{2}}}}\). To compute the moments of the smoothed video frame, (7) is applied as follows:

$$ {\Phi}_{S}(k)=P_{1} I_{SXY}(k) {P_{2}^{T}} $$
(12)

where P1, P2 are the KTP matrices. To enhance the detection accuracy, gradient image kernels such as difference, Sobel, and Prewitt can be utilized to decompose an image into its gradients in x- and y-directions. The image gradient is used to find the intensity change in the horizontal and vertical directions. Image gradients have been proved to be useful and reasonable tools for image representation [4] and there were good attempts to employ image gradient in computer vision applications [24]. Accordingly, MOGs are considered as features in the design of the proposed SBD algorithm because of their ability to reduce the flash light effect [17].

To compute MOGs (ΦGX and ΦGY) in the x- and y- directions, the gradient image kernels operators GX and GY are utilized to find the gradient of video frames as follows:

$$\begin{array}{@{}rcl@{}} {\Phi}_{GX}(k)&=&P_{1} I_{GX}(k) {P_{2}^{T}} \end{array} $$
(13)
$$\begin{array}{@{}rcl@{}} {\Phi}_{GY}(k)&=&P_{1} I_{GY}(k) {P_{2}^{T}} \end{array} $$
(14)

where IGX and IGY are the image gradients of the frame fk in the x- and y-directions, respectively, which are computed as follows:

$$\begin{array}{@{}rcl@{}} I_{GX}(k) &=&G_{X} * f_{k} \end{array} $$
(15)
$$\begin{array}{@{}rcl@{}} I_{GY}(k) &=&G_{Y} * f_{k} \end{array} $$
(16)

where Gx = [− 1 1], and Gy = [− 1 1]T.

The efficiency of SBD algorithm can be increased by employing local features which are more robust than global features. Generally, local features are utilized to minimize the disturbance effect of object/camera motion and to draw more consistent temporal variation within shots [23]. From this point, video frames are divided into blocks to extract local features to reduce the effect of flash light, object/camera motion, and camera operation. To obtain local moments (features), the moments are computed for each block in the smoothed frame (ISXY) and gradient frames (IGX and IGY).

To summarize the feature extraction process of the proposed SBD algorithm, the procedure of local moments (features) computation is as follows:

  1. 1.

    The video frames are decoded.

  2. 2.

    For each video frame, average all RGB color planes as follow:

    $$ f_{k}=\sum\limits_{c\in {R,G,B}}f_{k}(x,y,c) $$
    (17)
  3. 3.
    1. (a)

      For each video frame, compute smoothed and gradient frames using (11), (15), and (16).

    2. (b)

      Acquire the video frame information (N1 and N2).

  4. 4.
    1. (a)

      Set the number of blocks for local features (v1 and v2) and moment order (MOn and MOm). Then compute B1 and B2 as follows:

      $$\begin{array}{@{}rcl@{}} B_{1}&=& \frac{N_{1}}{v_{1}} \end{array} $$
      (18)
      $$\begin{array}{@{}rcl@{}} B_{2}&=& \frac{N_{2}}{v_{2}} \end{array} $$
      (19)
    2. (b)

      Partition smoothed and gradient images into non-overlapped blocks of size B1 and B2.

  5. 5.
    1. (a)

      Generate KTPs (P1 and P2) using (1).

    2. (b)

      Compute ΦS for each video frame block using (12).

    3. (c)

      Compute ΦGX and ΦGY for each video frame block using (13) and (14).

The aforementioned steps are shown in Fig. 3 for more elucidation.

Fig. 3
figure 3

Flow diagram of the moment extraction stage

4.2 Construction of dissimilarity signal

In this section, the construction of the dissimilarity signal is performed using the extracted smoothed moments (ΦS) and MOGs (ΦGX and ΦGY). The accumulation of the dissimilarity signal between the blocks of k and k + 1 frames is considered. The city-block distance is used to find the dissimilarity between two consecutive frames moments as follows:

$$\begin{array}{@{}rcl@{}} &&D_{S}(k)=\sum\limits_{v = 1}^{v_{1}\times v_{2}}\sum\limits_{n,m\in{\Phi}_{S}}\left|{\Phi}_{S}(k)-{\Phi}_{S}(k + 1)\right| \end{array} $$
(20)
$$\begin{array}{@{}rcl@{}} &&D_{X}(k)=\sum\limits_{v = 1}^{v_{1}\times v_{2}}\sum\limits_{n,m\in{\Phi}_{GX}}\left|{\Phi}_{GX}(k)-{\Phi}_{GX}(k + 1)\right| \end{array} $$
(21)
$$\begin{array}{@{}rcl@{}} &&D_{Y}(k)=\sum\limits_{v = 1}^{v_{1}\times v_{2}}\sum\limits_{n,m\in{\Phi}_{GY}}\left|{\Phi}_{GY}(k)-{\Phi}_{GY}(k + 1)\right| \end{array} $$
(22)

The three dissimilarity signals (DS, DX, and DY) are concatenated to form the similarity signal, feature vector (FV), as follows:

$$ FV(k)=\left[\begin{array}{lllll} D_{S}(k)\\ D_{X}(k)\\ D_{Y}(k) \end{array}\right] \Longrightarrow FV=\left[\begin{array}{lllll} D_{S}(1) & D_{S}(2) & {\cdots} & D_{S}(N_{f}-1)\\ D_{X}(1) & D_{X}(2) & {\cdots} & D_{X}(N_{f}-1)\\ D_{Y}(1) & D_{Y}(2) & {\cdots} & D_{Y}(N_{f}-1) \end{array}\right] $$
(23)

Obviously the size of feature vector (FV) is 3 × (Nf − 1), where Nf is the number of frames in video.

Temporal (contextual) information is an important factor for detecting transitions in which previous and next frames features should be considered to improve the detection accuracy [23]. Accordingly, the temporal information of previous two frames, next two frames, and the current frame features are combined to form the following feature vector (FVZ) as follows:

$$ FV_{Z}(k)=\left[\begin{array}{lllll} FV(k-2)\\ FV(k-1)\\ FV(k)\\ FV(k + 1)\\ FV(k + 2) \end{array}\right] $$
(24)

It is worth mentioning that the feature vectors (FVZ) size is 15 × (Nf − 1), which results from 3 dissimilarity signals (DS(k), DX(k), and DY(K)) by 5 temporal information (k − 2, k − 1, k, k + 1, k + 2). The obtained feature vector FVZ is considered as input to the SVM classifier for training and testing.

4.3 Transition detection

In this stage, SVM [7] is used to identify transition and non-transition frames. SVM is utilized in this work to detect HTs, owing to its powerful ability in classification [29, 46]. In this regard, a training phase is required to obtain SVM model. Then the trained model is used to classify the input feature vector. Temporal information has been considered in this work such that the previous and next frames features are taken into consideration to improve the detection accuracy.

Features normalization is considered as an important process and it should be applied for both training and testing feature vectors [12]. In practice, features lie within diverse dynamic ranges, hence the impact of large features values is greater than small features values in the cost function [38]. The normalization process is performed to tackle the problem of great numeric ranges dominating this in smaller ranges, i.e. to ensure similar dynamic range [3]. In other words, the features normalization process is performed such that the values of features remain within the similar ranges. This can be achieved by transforming the kth feature FVZ of mean \( \mu _{FV_{Z}} \) and standard deviation \( \sigma _{FV_{Z}} \) into the desired mean μdes and standard deviation σdes as follows:

$$ FV_{ZN}=\left( FV_{Z}(k) - \mu_{FV_{Z}}(k) \right)\left( \frac{\sigma_{des}}{\sigma_{FV_{Z}}} \right) + \mu_{des} $$
(25)

where FVZN is the normalized feature vector. In this work, the feature vector FVZ is normalized with μdes = 0 and σdes = 1 to achieve the final feature vector (FVZN) which is used in the training and detection phases.

The cross-validation and grid-search methods are used to tune the SVM parameters, optimal kernel parameter (γ) and penalty parameter (C). Where, cross-validation procedure can prevent the overfitting problem. Several different pairs of (C, γ) values were tested in the SVM model and the one with the highest cross-validation accuracy was adopted. Based on 5-fold cross validation results, the grid-search successfully finds the optimal pair of both parameters.

5 Experimental results

In this section, the performance of the proposed SBD algorithm is evaluated in terms of computational cost and ability to detect HTs. The proposed algorithm is tested on different datasets provided by TRECVID which is co-sponsored by the National Institute of Standards and Technology (NIST) [39]. The datasets used in this work are TRECVID2005, TRECVID2006, and TRECVID2007. These datasets include 42 videos which contain more than 19 video hours.

To train the SVM model, 7 videos are utilized from the datasets. The remaining 35 videos are used for testing. The evaluation criteria is based on three measures which are: Precision (\( \mathcal {P} \) ), Recall (\( \mathcal {R} \)), and harmonic mean (F1). These measures are defined as follows [4]:

$$\begin{array}{@{}rcl@{}} \mathcal{P}\%&=&\frac{N_{C}}{N_{C} + N_{F}} \times 100 \end{array} $$
(26)
$$\begin{array}{@{}rcl@{}} \mathcal{R}\%&=&\frac{N_{C}}{N_{C} + N_{M}} \times 100 \end{array} $$
(27)
$$\begin{array}{@{}rcl@{}} F1&=&\frac{2 P R}{P+R} \end{array} $$
(28)

where NC represents the correctly detected transitions, NF is the falsely detected transitions, and NM symbolize the missed transitions. The experiment was carried out using MATLAB on hp laptop.

Table 1 summarizes the performance of the proposed algorithm in terms of precision, recall, and F1 score. The performance evaluation is carried out using different types of moments selection order. In addition, an evaluation of the features with and without MOGs is performed. In the experiment, the number of blocks (v1 × v2) is set to 4 × 4. The moment selection ratios considered in the experiment are 10% and 20% out of the total number of moments (please see Fig. 1). Note that, the actual number of blocks are 2v1 × 2v2. This is because the localization property of the KTP can represent each block by 4 sub-blocks; therefore, the computation cost is reduced by 4.

Table 1 The accuracy measures of the proposed SBD

From Table 1, it is observed that the moments selection ratio of 10% shows better performance accuracy than that with 20%. This is because the more moments included, the more fluctuation in the similarity signal is achieved. In addition, the proposed algorithm accuracy measures with MOGs shows better results when compared to the accuracy measures without MOGs. In other words, the MOGs reduce the effect of disturbance factors.

In addition to the accuracy results, the computational time is also computed for each dataset and reported in Table 2. The computational time, the number of processed frames, and the processing time per frame are presented in Table 2. It can be noted that the computational cost is increased when moment selection ratio of 20% is selected compared to 10%. For instance, the computational cost is reduced by 89 sec when considering 10% of moments and smoothed moments for feature extraction. Likewise, the total processing time is decreased by 162 sec when MOGs are considered. Furthermore, it is observed that the computational cost of the SBD algorithm is increased by ∼2.5 times when both smoothed and MOGs are used. Although the computational time increases, the accuracy of the proposed SBD algorithm is increased (please see Table 1). Moreover, the computational time of the proposed algorithm is ∼ 16% of real time video, which is adequate and can be used for real-time processing.

Table 2 The processing time of the proposed SBD algorithm

Particularly, TRECVID datasets are challenging owing to the diversity of their content from static to highly disturbed shots. Figure 4 shows the visual results of the correctly detected HT by the proposed algorithm. Different types of disturbance factors are illustrated in Fig. 4 as well as the occurrence of HT between them. As shown in the figure, the proposed algorithm is able to distinguish between HTs and disturbance factors such as local object motion, global object motion, and object occlusion. Furthermore, the proposed algorithm is able to distinguish HTs between two highly similar shots as depicted in the last row of Fig. 4.

Fig. 4
figure 4

Sample of HTs correctly detected using the proposed algorithm with the presence of disturbance factors

For more illustration, Fig. 5 shows different sampled frame sequences with disturbance factors in the datasets which are correctly classified as non-transition sequence. These disturbance factors include camera flash light, explosion, fast object motion, and different camera operations.

Fig. 5
figure 5

Samples of different disturbance factors in the video datasets

For comparison purpose, the proposed SBD is first compared to the top 7 competitors in TRECVID 2005 [30] and TRECVID 2006 [31] as shown in Fig. 6. It can be observed that the proposed method outperforms the top competitors in TRECVID 2005 and 2006 competitions in terms of harmonic mean (F1%). For TRECVID 2005, as shown in Fig. 6a, the proposed algorithm outperforms the top 7 competitors in TRECVID 2005 SBD task. The presented algorithm shows an improvement of ∼ 2.5% when compared with KDDI algorithm (highest accuracy in top TRECVID 2005), and an improvement of ∼ 5.8% when compared with CLIPS algorithm (lowest accuracy in top TRECVID 2005). For TRECVID 2006, as shown in Fig. 6b, the improvements of the proposed algorithm are ∼ 4.5% and ∼ 9.0% when compared with Tsinghua (highest accuracy in TRECVID 2006) and Huazhong (lowest accuracy in TRECVID 2006), respectively.

Fig. 6
figure 6

Comparison between proposed algorithm and top competitors in TRECVID a 2005, and b 2006

In addition, the proposed algorithm is compared to the state-of-the-art algorithms which are: SVM and Non-Subsampled Contourlet Transform-based SBD (SVMNSCT-SBD) [27], threshold and Non-Subsampled Contourlet Transform-based SBD (THNSCT-SBD) [37], Walsh-Hadamard transform-based SBD (WHT-SBD) [36], and concatenated block based histograms-based SBD (CBBH-SBD) [50]. The comparison is presented in Tables 3 and 4 in terms of accuracy and computation cost. It can be noted that the proposed algorithm shows an improvement compared to THNSCT-SBD and CBBH-SBD algorithms, and demonstrates a comparable result compared to SVMNSCT-SBD and WHT-SBD algorithms in terms of harmonic mean (F1 − score). On the other hand, the proposed algorithm shows a remarkable progress in terms of the time required to process video frames (please see Tables 3 and 4).

Table 3 Accuracy comparison using TRECVID2007 dataset
Table 4 Accuracy comparison using TRECVID2005 dataset

Although, the proposed algorithm shows promising results in terms of accuracy and computational cost, the performance of the algorithm can be further improved in terms of computational cost by using multi-core CPU/GPU [54, 55] and disturbed optimization [47, 48].

6 Conclusion

In this paper, a new SBD algorithm based on new features, that are extracted using OP, is presented. Smoothing frames is used to eliminate noise; while computing frame gradient in x-direction and y-direction is used for frame feature extraction. This work is based on moments extracted from smoothed and gradients video frames to detect HT effectively. The temporal information is taken into account when feature vector is constructed. The results shows that the presented algorithm outperforms the TRECVID competitors and different state-of-the-art algorithms. However, it is noticed that the computation cost is increased when convolving multiples image kernels with each video frame. Therefore, our future work is going towards further reducing the computational cost of the SBD algorithm as well as the detection of soft transitions so that the algorithm can be generalized to detect different transitions.