1 INTRODUCTION

In the recent years Internet and social media platforms are ubiquitous and plethora of video information is generated in every single minute. With the proliferation of 5G technology and the advancement in smart phones, mobile users and Internet of Things (IoT) are predicted to increase mobile video data traffic. Development in video acquisition technologies has led to the creation of massive video repositories on storage platforms. The users may prefer to query videos based on the content instead of sequentially accessing the video data, which demands sophisticated technology for representing, indexing and retrieving multimedia data. Video management in manual mode is arduous and hence it is crucial to develop efficient algorithms to store, index and retrieve the videos. This domain of research is referred as Content Based Video Retrieval (CBVR) system. CBVR seem to be inherent extension of Content Based Image Retrieval (CBIR). CBVR system is the task of providing relevant video shots/clips as per the user query. The approaches and paradigms for CBVR must promote to align computer vision in line with human perceptions [8]. The term “content” stands for image features such as color, shape, texture etc. and the term “retrieval” refers to the techniques that fetch results in relation and accordance with user perception. Thus, CBVR can be imposed as the search for videos that matches the query given by the user.

CBVR technology has been successfully used in several applications such as crime prevention, biometrics, gesture recognition, biodiversity information systems, medicine, digital libraries, historical research etc. The widespread applications of videos have increased the demand for automated tools and management for efficient indexing, browsing and retrieval of video data [13]. Since video retrieval is not effective using conventional query-by-text retrieval technique, CBVR system is considered as one of the best practical solutions for better retrieval quality [46]. The rich video structure has got tremendous scope in the area of video retrieval to enhance the performance of conventional search engines [21]. It is vital to develop appropriate measures to effectively and efficiently manage the multimedia information in a meaningful manner [7].

The research community working on CBVR has identified several challenges in the design of effective and efficient CBVR system. Shot Boundary Detection (SBD) is the crucial step in the design of CBVR system and aims at segmenting a video into a number of structural elements (scenes, shots or frames) [7]. Hence, it is termed as video temporal segmentation and has been focused for video summarization and indexing process. Some of the challenges of SBD include efficient feature descriptors and threshold free algorithms to achieve high detection rate for identification of any type of shot transition. The logical representation of video with respect to hierarchical structure is shown in Fig. 1. The shot boundaries are indicated relying upon the interruptions made between camera operations. A video stream is anticipated as a set of distinct scenes. A shot is a sequence of successive frames grabbed from a single camera. A representative frame termed as keyframe can be identified from every shot which constitutes summary of the video and is further used for retrieval purpose.

Fig. 1
figure 1

Hierarchy of video units

A frame represent a single image in a video. Consecutive frames within a shot are highly redundant in appearance and behaviour. According to the video editing effects, the shot boundaries are classified as either abrupt (hard) or gradual (soft) transition based on the inherent behaviour, properties and length of the videos [48]. Figure 2 presents the categories of video shot transitions. Determining gradual transition is complex when compared with abrupt transition due to camera and object motion [8].

Fig. 2
figure 2

Categories of video shot transition

Abrupt shot detection

The rapid change in visual content between adjacent frames in a video causes abrupt transition. Such transitions portray substantial visual discontinuity between frames and termed as hard cut.

Gradual shot detection

The slow and continuous change in visual content across multiple frames causes gradual transition. Such transitions exhibits progressive change over varied group of frames and termed as smooth transition. Gradual transition detection is tedious due to continuous nature of the editing effects. The editing effects considered as gradual transition are fade-in, fade-out, dissolve etc.

The primary step in SBD involves feature extraction and representation of frames. Subsequently, similarity/dissimilarity measures are computed to locate the transition between frames. Based on significant changes, shot boundaries are declared. Shot transition occurs when there is a drastic change in visual contents between the frames. In the literature, the SBD techniques are broadly classified into two categories based on the feature extraction domain viz., compressed and uncompressed. Features extracted from compressed domain make SBD algorithms fast, as no decoding process for video frames are required. However, researchers pay more attention to uncompressed domain because of its richness in visual information of video frames as discussed in [3]. The SBD algorithms proposed in this work are all based on uncompressed domain.

Researchers in the field of image and video analytics have revealed that, the methods based on soft computing techniques [5, 33, 34, 47] have shown better performance when compared to conventional methods [9]. The two dimensional discretization has caused inherent uncertainty in digital frames [29] and hence, there exists some amount of uncertainty even in simplest feature extraction approach [29]. The ambiguities in digital frames occur due to position of object and pixel intensity. Amongst the interesting features, edges interpret the boundary of the objects with variations in pixel intensities. Fuzzy set theory has been a vital choice in handling ambiguities and processing of edges [29].

In the proposed work, the task of shot boundary detection is achieved by employing edge information and incorporating fuzzy logic [57]. The edge detection process is likely to consist of several sub processes or phases [52]. The mathematical framework introduced by Bezdek et al., [6] consisting of several phases viz., conditioning, feature extraction, blending and scaling for edge detection is used by many methods in the literature. The concept of fuzzy logic can be applied in all the phases or at a specific phase depending on the nature and complexity of the video to produce better shot detection results [29]. In classical edge detection, binarization step causes information to be lost in digital frames [35]. The use of fuzzy sets for intermediate representation of edges may capture the important information [29, 51], which can help to describe the content of the video frame better and handle ambiguities in digital frames. This has motivated the authors to carry out the proposed SBD work.

The focus of this work is to identify abrupt and gradual transition in the videos. An attempt is made to enhance edge detection capability and address uncertainty in digital frames. Initially, the grayscale frames are transformed into gradient frames using Sobel detector. The Sobel gradient distribution of pixels are subjected to fuzzification process using triangular membership function (MF). The proposed method employs block based cumulative sum approach on each 3 × 3 block pixels of fuzzified gradient frame. This local feature discriminates the spatial distribution and is robust to noise and illumination variation. Further, the mean of cumulative sum is computed and is used to produce MCSH histogram of every video frame. Thus, the video is represented in terms of MCSHs and each MCSH describes the video frame information globally. Threshold devising strategy is accomplished by applying RSD statistical measure on the obtained MCSH histograms of every frame of a video for shot transition detection. Efficacy of the proposed method in terms of precision, recall and F1-score has been demonstrated by conducting extensive experiments on the TRECVID and VideoSeg datasets.

Rest of the paper is organized as follows: Section 2 gives a detailed description of the related work. Section 3 presents the proposed methodology covering the details about feature extraction and shot boundary detection. Experimental analysis and results are discussed in section 4 followed by conclusion in section 5.

2 LITERATURE REVIEW

With the advancement in CBVR technology, there is a great demand for robust and reliable SBD algorithms [7]. A comprehensive survey on recent developments of SBD has been reported by Abdulhussain et al., [3]. Numerous challenges of shot boundary detection and extensive review of several techniques are presented by Hu et al., [19] and Yuan et al., [56]. Some of the important work related to SBD, which address abrupt and gradual transitions are discussed in the following paragraphs. Among the several approaches proposed in the literature to address the problems of SBD, following are some of the interesting works, which have explored histograms, edge based, block based features and soft computing techniques.

Histogram is a global feature and does not capture spatial details of the pixels. Hence, it is robust to camera or object motion than pixel based methods. Mas and Fernandez [32] have depicted the effectiveness of color histogram descriptor using color space and quantization method by discriminating the least significant bits of each RGB component. City block distance between color histograms were measured and compared against threshold to detect shot cuts. Ji et al., [22] used the concept of accumulative histogram difference and support points for detection of dissolve transition. Lu and Shi [30] proposed singular value decomposition and candidate segment selection method for SBD. A frame feature matrix is formed by extracting color histogram in hue saturation value to identify cut and gradual transition. Li et al., [27] presented three-stage approach based on Multilevel Difference of colour histograms for detecting cut and gradual boundaries. Detection of shot boundaries are attempted by Hannane et al., [16] by extracting SIFT and edge-SIFT keypoints from each frame. Adaptive threshold is applied on the computed distance values of SIFT-PDH between the frames and shot boundaries are identified. Prasertsakul et al., [36] presents a novel technique for classifying several camera operations in videos using 2D histogram. 2D motion vector (MV) fields are generated by applying an existing block based MV estimation method in polar coordinates. MVs in each frame that share the similar magnitude and orientation features are utilized to classify the camera operations by representing the 2D histogram.

Edge is an important local feature to represent discontinuity in pixel intensity. Pixels belonging to same object exhibit continuity in pixel intensity and vice versa. Significant changes in the edge pixels between consecutive frames, indicate a shot change. Since spatial information is not considered, missed shot boundaries may occur [48]. This technique is applied to detect both abrupt and gradual transitions. Heng and Ngan [18] presented shot boundary detection using object based edge detection. The authors proposed time stamping transferring mechanism, which utilizes information across multiple frames. Moving objects across the gradual transition frames instead of adjacent frames are tracked by the concept of edge object tracking. Zheng et al., [60] proposed heuristic algorithm for detection of fade in and fade out transition. This work utilizes Robert edge detector and transition is detected by identifying the separation from object motion by employing predefined adaptive threshold. Adjeroh et al., [4] introduced adaptive edge oriented framework using multilevel features based on shot variability to address the problem of identifying abrupt transition. Three levels of adaptation are considered by the authors: at the feature extraction stage using locally-adaptive edge maps, at the video sequence level, and at the individual shot level. Adaptive parameters for multilevel edge based approach are formulated to determine adaptive thresholds for detection of shot boundaries. Priya and Domnic [38] used edge strength as feature vector that are extracted by projecting block of frames over vector space. The sum of absolute difference between the features of the blocks of the corresponding frames are evaluated. Shot transitions are categorized by using similarity difference values.

Block based approach acts as intermediary between local and global feature based approaches. Since the spatial resolution is reduced by using blocks instead of pixels, this method is less sensitive to object and camera motion. Shahraray [43] proposed block based technique by dividing the frame into 12 non overlapping blocks. Non linear order statistics is used to find the best match between respective neighbourhoods of the previous frame. Sustained low level increase in match values are identified to detect shot cuts. Lee et al. [25] performed block differences using HSV color space. The mean values of Hue and Saturation for two successive blocks are computed and shots were detected. Lian [28] has proposed pixel, histogram and motion based frame difference to resist flash and light detection to avoid false positives to address shot boundary detection. Jiang et al., [23] have proposed both pixel and histogram based method for detection process using uneven blocked color histogram and uneven pixel value difference in the moving windows. Rashmi and Nagendraswamy [39] have proposed shot cut method using edge information and constructing histogram by assigning binary weights to each sliding window of 2 × 2 block/mask of a video in overlapping and non overlapping mode. To enhance discriminative capability among spatial distribution, Rashmi and Nagendraswamy [40] have proposed Midrange LBP texture descriptor where midrange threshold value is applied for each pixel across 3 × 3 block of image matrix to produce histogram and adaptive threshold is used to detect shot boundaries. Wu et al., [54] proposes Unsupervised Deep Video Hashing where balanced code learning and hash function learning are integrated and optimized for video retrieval. Feature clustering and binarization are used to preserve neighbourhood structure. Smart rotations is used for generating effective hash codes. Wu and Xu [55] proposed bottom-up and top-down attention model to perform color image saliency detection in news video. Multi-scale local and global motion conspicuity maps are computed on eye-tracking datasets. Shen et al., [44] proposed video event detection using subspace selection technique. Unified transformation matrix is used for projecting different modalities for individual recognition tasks. Zhang et al., [59] proposed flash model and cut model using local window based method to detect false transitions. Zhang et al., [58] have proposed a shot boundary detection technique based on block-wise principle component analysis by dividing the video into several segments. Shot eigen spaces are established on the training segments and the candidate segments are projected onto the corresponding shot eigen space to extract the feature vectors. Analysis and pattern matching are performed to identify abrupt and gradual shot transitions in the video. Cirne et al., [11] proposed video summarization method using color co-occurrence matrices as frame representation. Feature extraction has been performed at multiple scales. Normalized sum of squared differences are computed between the frames for detecting the shots. Abdulhussain et al., [2] proposed Orthogonal Polynomial (OP) algorithm for detection of hard transitions. The OP domain are computed using Krawtchouk-Tchebichef polynomial. The shots are identified using Support Vector Machines.

In the recent years, many researchers have emphasized their work on soft computing techniques to handle uncertainties in images for addressing SBD. Fuzzification of frame-to-frame-property difference values using Rayleigh distribution and fuzzy rules are framed by Jadon et al., [20] for detection of abrupt and gradual changes. Lee et al. [26] used an ART2 neural network for video scene change detection. Küçüktunç et al., [24] presents color histogram based shot boundary detection algorithm to detect both cuts and gradual transitions with the fuzzy linking method on L*a*b* color space. A set of fuzzy rules are evaluated and fuzzy rule based cut detection approach is suggested by Dadashi and Kanan [12]. Thounaojam et al., [50] used normalized RGB color histogram difference as feature extraction method and finding the difference between consecutive frames is studied for shot detection. The authors have utilized fuzzy logic system optimized by Genetic Algorithm to find optimal range of values of fuzzy membership functions. Hassanien et al., [17] presented SBD technique based on spatio-temporal Convolutional Neural Networks (CNN). The authors have applied deep neural techniques on the large SBD data set of 3.5 millions of frames of sharp and gradual transitions. Image compositing models are used to generate the transitions. Rashmi and Nagendraswamy [41] applied correlation coefficient between consecutive fuzzified frames using fuzzy sets and IFS techniques for the purpose of abrupt shot detection. Artificial neural networks (ANN) represents an important paradigm in the soft computing domain. Gygli et al., [15] developed Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks. Large temporal context is used by Convolutional Neural Network (CNN) with unprecedented speed for detection process.

3 PROPOSED METHODOLOGY

The framework of the proposed methodology for Shot Boundary Detection process is presented in Fig. 3. The main challenge is to develop a simple approach to eliminate false shot detections that occur due to illumination, camera operation, object motion and noise. Therefore, a combination of local and global feature is considered to address the aforementioned problems by constructing MCSH histograms. The proposed model addresses the detection of both abrupt and gradual transitions present in the videos using MCSH histograms.

Fig. 3
figure 3

Framework of the Proposed Methodology for Shot Boundary Detection

3.1 Feature Extraction and Representation

Extraction of meaningful feature and efficient representation of video frames plays a very important role in effective detection of shots in videos. The following subsections presents the detail description of the proposed feature extraction and representation of video frames.

3.1.1 Sobel Gradient Frames

Initially, all the RGB frames of the video are converted to gray scale frames. There exists wide variety of edge detection techniques to measure intensity changes [10, 37, 45] and it is found that Sobel edge detector outperform other edge detectors in terms of accuracy and computational efficiency [1]. In this work, Sobel detector [45] is used to convolve the grayscale frame pixels with their respective convolution mask to obtain gradient frame from grayscale frame. For every pixel in the grayscale frame, the vertical and horizontal components of the gradient is obtained by applying convolution with two 3-by-3 convolution masks as formulated in Fig. 4:

Fig. 4
figure 4

Sobel 3 × 3 convolution masks

The magnitude of the gradient gives the measure of rate of change in intensity at the pixel location x,y and is computed as:

$$ {G}_{\left(x,y\right)}=\sqrt{G_x^2+{G}_y^2} $$
(1)

3.1.2 Fuzzification of Sobel Gradient Frames

In order to capture the vagueness and uncertainty present in the data, the crisp data has to be converted to fuzzy data using the process of fuzzification. Membership functions (MFs) are used to carry out fuzzification. MF is a curve that defines how each point in input space is mapped to membership value between 0 and 1 [57]. Different MFs can be used to fuzzify the data which has to be determined empirically by studying the functions for a specific application. The most commonly used MFs in literature are trapezoidal, triangular and Gaussian as they produce good results. The responsibility of choosing the shape of the MF lies with the user and the application. The triangular/trapezoidal MF is used if the system needs significant dynamic variation within short period of time and a Gaussian MF is used if high control accuracy is selected [31].

In the proposed approach, the Sobel gradient frame is subjected to fuzzification using triangular MF and the parameters are formulated as follows:

$$ {\mu}_A\left({G}_{ij}\right)=\left\{\begin{array}{l}\frac{G_{ij}-a}{b-a}\kern1.46em if\kern0.5em a\le {G}_{ij}\le b\\ {}\frac{c-{G}_{ij}}{c-b}\kern1.2em if\kern0.5em b\le {G}_{ij}\le c\\ {}0\kern3.559999em otherwise\end{array}\right. $$
(2)

Where Gij represent the gradient frame. The parameters a, b and c specifies the boundaries with the criteria (a < b < c) and determines the x coordinates of the boundaries of fuzzy triangular MF. In the proposed approach, the left and right boundary values are evaluated as minimum and maximum pixel value of the respective gradient frame and core value is set to midrange value. The numerical illustration of the fuzzified Sobel gradient pixel values are depicted in Fig. 5.

Fig. 5
figure 5

Illustration of obtaining Fuzzified Sobel Gradient Values

3.1.3 Block Computation and Histogram Construction

The SBD process in the proposed work is based on establishing MCSH histograms for every frame of a video. The illustration for the process of extraction of MCSH histogram for a sample frame #881 of anni006 video sequence is presented in Fig. 6. Initially, the video frame undergoes transformation mechanism from gray scale to fuzzified gradient form as discussed in sections 3.1.1 and 3.1.2.

Fig. 6
figure 6

Illustration of Block Cumulative Sum and Histogram construction on a sample frame

For illustration purpose, a 3 × 3 block at each pixel position of the fuzzified frame is considered for evaluation of cumulative sum in overlapping mode. Each block slides over the frame at every pixel position from left to right and top to bottom position. For each block, cumulative sum is evaluated as follows:

$$ CS(i)={\sum}_{k=1}^iB(k) $$
(3)

Where B(k) is 3 × 3 block pixel values and CS(i) is the corresponding cumulative sum values considering the block movement in overlapping mode. Further, the mean of cumulative sum values for each block will be computed as follows:

$$ \mu =\frac{\sum \limits_{i=1}^n CS(i)}{n} $$
(4)

Where μ is the mean value for cumulative sum values for each 3 × 3 block in overlapping mode and n = 9. Thus, the evaluated mean value for each 3 × 3 block at each pixel position is used to construct histogram for each fuzzified frame and represented as feature vector as illustrated in Fig. 6. The representation of MCSH histograms of two different frames #881 and #1236 of anni006 video sequence is depicted in Fig. 7 which exhibits distinct bin values.

Fig. 7
figure 7

Representation of MCSH Histogram for two different frames of anni006 video sequence

3.2 Shot Boundary Detection

The task of shot boundary detection is carried out on some of the video sequences of TRECVID and VideoSeg datasets. The dataset is challenging, since it includes large variation of shot breaks. Both abrupt and gradual shot transitions are detected with the aid of RSD measure applied on each MCSH histogram. Also, detection mechanism for elimination of unwanted frames is proposed. The following subsections present a detailed description about the proposed shot detection process.

3.2.1 Detection and Elimination of Unwanted Frames

TRECVID dataset contains non transition frames other than abrupt and gradual transitions. The frames present in non transition group are unwanted frames caused due to flash/light variations, object or camera motion. Even blank/black frames are considered as unwanted frames which will act as abrupt transition [49]. In order to reduce false detections, it is necessary to eliminate unwanted frames prior to abrupt and gradual transition detection. This task is accomplished by applying RSD statistical measure on the obtained MCSH histograms of every frame of a video.

Let Pj contain MCSH histogram values of a frame where {j = 1,2,...,n}, then RSDi is the coefficient of variation value corresponding to the ith MCSH histogram and is computed as follows:

$$ {RSD}_i=\frac{\sigma }{\mu } $$
(5)

Where, \( \mu =\frac{{\sum \limits_{j=1}^n}_{P_j}}{n} \) and \( \sigma =\sqrt{\frac{\sum \limits_{j=1}^n{\left({P}_j-\mu \right)}^2}{n}} \)

A threshold is empirically set for each video to determine transition and non transition frames using RSDs evaluated for all MCSH histograms of a video as follows:

$$ {T}_{NT}=\mu +\alpha \sigma $$
(6)

Where TNT is the threshold value, α is constant value, μ is mean value and σ is standard deviation value of all computed RSDs of the entire video. The constant value is chosen by observing the RSD graph and a sample illustration for BG_37309 video is depicted in Fig. 8. During experimentation, it is found that the RSD value of the frames above the threshold TNT are considered as unwanted frames and are excluded from shot detection process. This segregation mechanism ensures reduction in false detection.

Fig. 8
figure 8

Illustration of RSD showing non transition frames for BG_37309 video

It is to be noted that the blank frames will be removed only during abrupt shot detection process. During gradual shot detection blank frames are included in the sequence, as they act as integral part of fade-in and fade-out editing effects. The RSD values as computed above for all the corresponding videos are further utilized by abrupt and gradual detection algorithms and are detailed in the following sub sections.

3.2.2 Abrupt Shot Transition Detection

The significant difference between the frames depends upon the salient content present within the frames. Ford et al., [14] have analysed that, the histogram metric yields best results when computed for blocks in case of abrupt transition. In the proposed work, the difference between the RSD values computed for each MCSH histogram as formulated in Eq. 5 of section 3.2.1 is used to detect abrupt shots. Let RSDk and RSDk + 1 be the RSD values of the two consecutive MCSH histograms of a video. The comparison between RSDk and RSDk + 1 denoted by distance DRSD is computed by finding the difference as follows:

$$ {D}_{RSD}={RSD}_k-{RSD}_{k+1} $$
(7)

Thus, the difference value is computed for all other consecutive frames of the entire video. The pictorial representation of distance values thus evaluated is depicted in Fig. 9 for D6 video of TRECVID 2001 dataset.

Fig. 9
figure 9

Distribution of difference between RSD values of D6 (nad58) video

It can be clearly observed from Fig. 9 that, the distance comparison of RSD values between two consecutive frames belonging to same shot will produce low peaks and prominent peaks for camera break shots. The change in camera breaks is signified by peak variations in the distance of RSD values. Therefore, threshold mechanism has to be devised for identifying prominent peaks. Let μ be the mean, σ be the standard deviation of DRSD values, α is chosen as a constant and a threshold TAT is computed as follows:

$$ {T}_{AT}=\mu +\alpha \sigma $$
(8)

The distance values above the set threshold value TAT is considered as prominent peaks indicating camera break operation.

3.2.3 Gradual Shot Transition Detection

Determining gradual transition is complex when compared to abrupt transition due to camera and object motion [8]. The complexity arises due to varying frame features spread across number of frames. Some of the statistical parameters aid in presenting a distinct pattern for gradual transition [8]. It has been observed that, the patterns representing fade-in, fade-out and dissolve transitions can be categorized with its specific patterns.

Fade-in transition is superimposed combination of blank frames and initial frames of the shot. In this pattern, the blank frames decreases and frames of the appearing shot gets prominent. Fade-out is reverse of fade-in transition. The visual content of the frames of the current shot lose its intensity and gradually turn into black frame. Dissolve transition lasts for few frames when a shot overlaps with succeeding shot. During the overlap process the intensity of the current shot decreases gradually and the intensity of the appearing shot increases linearly. This represents a good combination of fade-out and fade-in transition. It is found that the feature value of last frame in fade-out and first frame in fade-in will be nearing to zero. Usually, dissolve transition is a combination of fade-in and fade-out excluding the occurrence of blank frames as depicted in Fig. 10.

Fig. 10
figure 10

Illustration of (a) Fade-in (b) Fade-out and (c) Dissolve transitions

Before applying gradual transition algorithm, the abrupt shots and non transition frames identified using threshold mechanism as described in section 3.2.1 are excluded from sequence of frames. In order to choose the frames for gradual detection process, a threshold has been set using RSD values of MCSH histograms as follows:

$$ {T}_{GT}=\mu +\sigma $$
(9)

Where μ be the mean, σ is the standard deviation of computed RSD values of entire video. The frames belonging to range TGT and TNT are considered by gradual transition detection algorithm as shown in Fig. 11. This criteria helps in curtailing false detections and aid in improving efficiency of the algorithm.

Fig. 11
figure 11

Criteria set for gradual transition detection

In order to identify the overlapping information across multiple/group frames, an appropriate technique should be used to identify and recognize the patterns. In the proposed approach, RSD measure applied on each MCSH histogram of the corresponding group is utilized by gradual shot detection algorithm. In the subsequent step, mean of all RSD values related to every frame of the corresponding group is computed as follows.

$$ {M}_{RSD}=\frac{1}{n}\sum \limits_{i=1}^n RSD $$
(10)

Where MRSD is the mean of all RSD’s in the frame group. Further, the difference between RSD and MRSD of each frame is computed and its square value is found. Finally, in order to find the frame feature, the relative difference is computed as formulated in the following equation:

$$ {F}_i=\frac{{\left({RSD}_i-{M}_{RSD}\right)}^2}{M_{RSD}} $$
(11)

where Fi is the feature value computed for each frame in the sequence. While conducting experiments, feature values are computed considering group of frames and group size is chosen empirically at each instance. Since gradual transitions occur over multiple sequences of frames, it is essential to observe patterns over multiple frames. After plotting Fi values for each frame in the group, the increase or decrease pattern is examined that represents various types of gradual transitions (dissolve, fade-in and fade-out excluding wipe transition). Based on the pictorial representation of Fi values plotted, the pattern can be characterized to be fade-out, fade-in or dissolve gradual transition. This step is repeated for all the remaining group of frames in the sequence. The results of gradual transition detection presented in Fig. 12 (a) shows increasing pattern representing fade-in and Fig. 12 (b) shows decreasing pattern representing fade-out. Figure 12 (c) shows dissolve pattern.

Fig. 12
figure 12

Illustration of (a) Fade-in (b) Fade-out and (c) Dissolve patterns

4 EXPERIMENTAL RESULTS AND DISCUSSION

Dataset

The experimental analysis of the proposed method has been performed on TRECVID and VideoSeg benchmark dataset and has been assessed with common metrics and compared with the baselines. The description of the benchmark datasets is described along with the ground truth information related to camera effects of every video in the following tables. The potentiality of the proposed method is analyzed using video sequences taken from US National Institute of Standards (NIST) TRECVID 2001 and 2007 dataset. TRECVID 2001 dataset described in Table 1 can be downloaded from the Open Video Project whereas TRECVID 2007 data described in Table 2 has to be obtained from Netherlands Institute for Sound and Vision. Also. the VideoSeg benchmark dataset [53] containing 10 different videos with varied quality and resolution is used for experimental analysis and summarized in Table 3.

Table 1 Description of TRECVID 2001 dataset
Table 2 Description of TRECVID 2007 dataset
Table 3 Description of VIDEOSEG dataset

The dataset considered for experimentation is of varied length, genre and challenging scenarios which comprises of video editing effects along with camera/object motion and illumination variation. The presence of camera/object motion and sudden light variation causes ambiguous shot boundaries.

Discussion

The performance of the proposed method is evaluated using quantitative evaluation metrics such as Recall, Precision and F1-score which is formulated as follows:

$$ Recall=\frac{N_C}{N_C+{N}_M} $$
(12)
$$ Precision=\frac{N_C}{N_C+{N}_F} $$
(13)
$$ F 1\hbox{-} score=\frac{2\ast Recall\ast Precision}{Recall+ Precision} $$
(14)

Where NC is the number of correct detections, NM is the number of missed detections and NF is the number of false detections. F1-score is defined as the harmonic mean of recall and precision which reflects on recall and precision rates. An algorithm having highest F1-score is regarded as an efficient algorithm. Experiments were carried out using MATLAB on Intel Core i5 processor, running at 2.20 GHz with 8 GB RAM. The algorithm complexity of the proposed method depends on the resolution of the video frame pertaining to the dataset. In general the time complexity of the proposed method is of the order of θ(n2). The time taken for feature extraction per frame is measured in terms of milliseconds (ms) and has been recorded as 0.84 ms for TRECVID 2001, 1.32 ms for TRECVID 2007 and 4.2 ms for VideoSeg dataset.

The efficiency of the proposed system relies on specific threshold values set empirically for each category of transition. The strength of MCSH histograms signifies the overall performance of the proposed approach. The contribution of RSD statistical measure aid in effective detection of abrupt and gradual transitions. Removal of unwanted frames are performed by setting the threshold TNT as described in section 3.2.1. The constant α value is chosen in the range 0.1 to 1 to curtail/reduce false detections.

4.1 Results on Abrupt Transition Detection

During abrupt transition detection, an appropriate threshold value is set to identify dominant peaks based on the observations made by viewing RSD difference graph obtained for each video. Threshold TAT as described in section 3.2.2 is experimented to figure out F1-score. For TRECVID 2001 dataset, different values of α is chosen in the range 0.1 to 3 by observing RSD difference graph (example shown in Fig. 9) and experimental results are recorded. Whereas, the constant value α is chosen in the range 0.1 to 1 for TRECVID 2007 dataset and range 0.1 to 2.5 for VideoSeg dataset.

The achievement of the proposed method based on the analogy of the results obtained with regard to other state-of-the-art approaches are reported in Tables 4, 6 and 8 for TRECVID 2001, TRECVID 2007 and VideoSeg dataset respectively. An additional comparison has been made with the recent state-of-the-art algorithm proposed by Sasithradevi et al., [42] for some of the video sequences of TRECVID 2001 dataset and TRECVID 2007 dataset and the obtained results are recorded in Tables 5 and 7 respectively.

Table 4 Performance comparison for abrupt shot transition with Thounaojam et al., (2016, 2017) on TRECVID 2001 dataset
Table 5 Performance comparison for abrupt shot transition Sasithradevi et al.,(2020) on TRECVID 2001 dataset

The results shown in Tables 4,5,6,7 and 8 and signifies improved efficiency of the proposed algorithm with the state-of-the-art methods and the performance is depicted graphically in Figs. 13(a) to 13(e). By comparative analysis, it is noticeable from the graphs depicted that, the proposed method outperform other SBD approaches. However, the proposed method segments the video considering combination of local and global feature of the frame. Significant improvement has been noticed based on the capability of the proposed method in detecting meager number of missed and false transitions.

Table 6 Performance comparison for abrupt shot transition with Thounaojam et al., (2016) on TRECVID 2007 dataset
Table 7 Performance comparison for abrupt shot transition with Sasithradevi et al.,(2020) on TRECVID 2007 dataset
Table 8 Performance comparison for abrupt shot transition on VideoSeg dataset
Fig. 13
figure 13

Comparative results of abrupt shot transition detection

4.2 Results on Gradual Transition Detection

Criteria set for gradual transition detection is established by using two local adaptive threshold values based on observation made on RSD difference graph (example shown in Fig. 11). Threshold TNT has already been discussed in previous section. Threshold TGT has been set based on mean and standard deviation of computed RSD values of MCSH histograms of entire video. The frames falling in the range TNT and TGT are considered by gradual transition detection mechanism. Thus, formation of this sequence of frames excluding non transition frames reduces processing time and false transition during detection phase.

Since gradual transitions share a common behaviour, in the proposed method the patterns have been generated using relative frame feature difference computed for group frames using Eq. 11 of section 3.2.3. Thounaojam et al., [49] have observed that, the length of gradual transition ranges from 6 to 32 frames in the group for TRECVID videos.

Presuming this, the authors in the proposed method have made an empirical study for making observations of the patterns by grouping the frames in terms of 5, 10, 20, 25 and 30. This is achieved by plotting the relative frame difference values for the indicated group specifically. After making observation and thorough analysis, a frame group of 30 has yielded the expected behavioural pattern for ascertaining fade-in, fade-out and dissolve transitions (excluding wipe transition). The empirical setup and analysis has been performed to find detection rates to determine F1-score.

The analogy of the results with state-of-the-art approaches using benchmark datasets are detailed in Tables 9, 10 and 11 for TRECVID 2001 and TRECVID 2007 dataset. An additional comparison has been performed with the recent state-of-the-art algorithms proposed by Sasithradevi et al., [42] for some of the video sequences of TRECVID 2001 and 2007 dataset in Tables 10 and 12 respectively After experimental study and comprehensive analysis, the proposed method exhibits significant progress when compared with state-of-the-art approaches as depicted graphically in the Figs. 14(a) to 14(d). The proposed method has taken care of non transition frames that depict camera/object motion and thus yielding significant progress in the achieved results.

Table 9 Performance comparison for gradual shot transition with Lu and Shi (2013) and Thounaojam et al., (2016, 2017) on TRECVID 2001 dataset
Table 10 Performance comparison for gradual shot transition with Sasithradevi et al.,(2020) on TRECVID 2001 dataset
Table 11 Performance comparison for gradual shot transition Thounaojam et al., (2016) on TRECVID 2007 dataset
Table 12 Performance comparison for gradual shot transition with Sasithradevi et al.,(2020) on TRECVID 2007 dataset
Fig. 14
figure 14

Comparative results of gradual shot transition detection

As a summary, the empirical study on the dataset emphasize that the proposed method performs consistently well in complex environment, preserving good trade-off between recall and precision. During feature extraction from the frames, the spatial resolution is reduced by using blocks instead of pixels. Hence, this method is less sensitive to object and camera motion. However, the proposed method is threshold dependent and sensitive to complex camera and light variation affecting the overall performance of the algorithm. The algorithm efficiency limits with the identification of camera zooming and panning effects. The proposed method is computationally expensive than global histogram techniques. False detections may be encountered when frames of two different shots have similar histograms due to similar color values.

5 CONCLUSION

In this work, a simple and effective method to detect shot boundaries in videos is proposed. The method exploits the concept of fuzzy sets, Sobel gradient, block based MCSH histogram and RSD statistical measure to address the task of abrupt and gradual transition detection in videos. The algorithm applies RSD measure on each MCSH histograms with threshold mechanism to determine transitions. The experimental observations signifies that the discriminating strength of MCSH histograms using benchmark datasets have produced good results. Abrupt transition is identified by finding the difference between RSD measure of each MCSH histogram. Patterns for gradual transition are observed by plotting the relative difference of RSD values obtained from MCSH histogram in each group of frames. Experiments were performed on some of the benchmark datasets viz. TRECVID 2001, TRECVID 2007 and VideoSeg datasets. The efficacy of the proposed method is on par with some of the state-of-the-art SBD methods. As part of the future work, efforts will be made to reduce computational complexity of the algorithm. Also, there is a need to explore other feature descriptors to analyze the visual contents of the video frame. Advanced fuzzy logic can also be explored to better address uncertainty problem prevalent in most video frames.