Keywords

1 Introduction

Development of the multimedia technology has made available plenty of high performance and easy to operate video capturing devices at affordable cost to the common public. As a result, everyday a huge volume of video data is created. Statistically, it is observed that videos from different sources are uploaded, downloaded and viewed in an unimaginable rate. This led to an escalation of the digital video information in the cyber space. This huge volume of video data is also easily available to and accessible by the common public. So, there is a substantial need for efficient management of the video information starting from video indexing, retrieval and classification to summerisation [1]. Hierarchical levels of video structure from bottom to top are frames, shots, scenes and stories. Combination of shots forms a scene, combination of scene forms a story and so forth combined to produce a video. Scene change detection is the first and the most crucial step towards this goal. It divides the video data into semantically related frames sharing the same content. The boundary between two consecutive shots is generally a hard-cut or a gradual transition (GT). Hard-cut also called cut transitions is observed to be the most predominant type of transition in videos than gradual transitions [2].

2 Review of Literature

Many algorithms have been developed in literature for automatic detection of type and location of transitions. However, sudden illumination change such as flashlight, stage effects and high-speed object/camera motion in videos can be misunderstood as shot change leading to false detection in the shot boundary. Sudden change in illumination and high motion scenes are common in fantasy and thriller movies, news reports or sports videos, that are intentionally included in the video to make them attractive for the audience. So it is a crucial step to eliminate the influence of these effects on shot change detection. To ensure this, a suitable feature needs to be extracted followed by formulation of the similarity measure and shot boundary detection(SBD). Many contributions on video SBD can be found in the literature [3,4,5]. Several features deployed for SBD are pixel intensity [6], histogram [7], edge, SURF and SIFT features [8]. Intensity-based features are found to have higher sensitivity towards large motion and light variation. Histogram feature-based algorithms are comparatively better than intensity feature-based approach, in terms of handling motion and illumination variation. Besides these, various edge and gradient-based features can be found in literature. Edge feature-based approaches are insensitive to small variation in light; however, for videos with large illumination, variations tend to destroy the edge features leading to false transitions. In order to improve efficiency, some researchers combined multiple features [9,10,11] for transition detection. Many authors proposed shot boundary techniques based on new feature space. LBPHF [12], LBP [13], CSLBP [14] and LDP [15] deployed for shot boundary detection are efficient in handling videos with sudden change in lighting condition. As an improvement in LBP-based method, Chakraborty et al. [16] proposed LTP-based approach in laboratory colour space under high object camera motion for SBD but has lower sensitivity to noise and illumination. In the current work, the advantage of illumination insensitivity feature of phase congruency (PC) is explored to develop a new feature similarity(FS) measure, i.e. PCFS. The proposed PCFS outperforms histogram and LBP-based feature similarity measure. This paper is arranged as follows: Sect. 2 describes the related work. Section 3 introduces image feature extraction using phase congruency, its significance in applying for current problem and the algorithm for development of similarity measure for the transition detection. Section 4 presents the simulation result discussions to support for the effectiveness of the extracted feature and algorithm followed by conclusions in Sect. 5.

3 Phase Congruency

Phase congruency has been developed from the phase information of the signal obtained from the frequency domain representation. It is found from the literature that phase is a crucial parameter for the perception of visual features [17]. Further, as per the evidence given by Morrone [18] indicated that the human visual system strongly responds to the image locations. It perceives features with highly ordered phase information. Thus, a human has a tendency to sketch an image, by precisely through the edges and interest points as perceived in a scene. This highlights the points having highest order in phase in a frequency-based image representation. The locations in the image where the Fourier components have highest phase correlation emphasize the visually differentiable image features such as edges, lines and mach bands. This indicates that more informative features can be captured at points of high value of PC. Thus, the PC model defines features as points in an image with high phase order and uniquely defines the image luminance function. PC value lies between 0 and 1. It is a dimensionless quantity and identified as invariant to scale, illumination and contrast of an image [19, 20]. It allows a wide range of feature types to be detected within a single framework. The PC captures edge, corner, structure or contour information of objects. Gradient-based edge detection operators are sensitive to illumination variations and do not have accurate and consistent localization, which is overcome by PC-based feature. Moreover, PC mimics the response of the human visual perception to contours. Yu et al. [21] showed that it is well capable of distinguishing structural information content of the scene. The points where the Fourier waves at different frequencies have congruent phases capture the visually differentiable features. That is, at points of high phase congruency (PC), highly informative features can be extracted. PC can be defined by the frequency response of the log Gabor filter which is given by the following transfer function

$$\begin{aligned} LG(w,\theta )=e^{\frac{-(log(\frac{w}{w_0}))^2}{2(log(\frac{k}{w_0}))^2}}e^{\frac{-(\theta -\theta _0)}{2\sigma _{\theta }^2}} \end{aligned}$$
(1)

where \(\omega _0\): centre frequency of the filter and \(\theta _0\): orientation component of the filter.

For an image f(xy) with \(M_{so}^{ev}\) and \(M_{so}^{od}\) as the even symmetric and odd symmetric components of the log Gabor filter at scale \(\mathbf {s}\) and orientation \(\mathbf {o}\), the responses of the two quadrature pair filters are given by \(ev_{so}(x,y)\) and \(od_{so}(x,y)\), respectively, in (2)

$$\begin{aligned}{}[ev_{so}(x,y),od_{so}(x,y)]=[f(x,y)*M_{so}^{ev},f(x,y)*M_{so}^{od}] \end{aligned}$$
(2)

The amplitude at scale \(\mathbf {s}\) and orientation \(\mathbf {o}\) is given by (3)

$$\begin{aligned} A_{so}=\sqrt{ev^2_{so}(x,y)+od^2_{so}(x,y)} \end{aligned}$$
(3)

Hence, the phase congruency representation of an image f(xy) in the simplest form without considering weight and noise component is given by (4)

$$\begin{aligned} PC(x,y)=\frac{\sum _{o}\sqrt{(\sum _{s}ev_{so}(x,y))^2+(\sum _{s}od_{so}(x,y))^2}}{\epsilon +\sum _{o}\sum _{s}A_{so}(x,y)} \end{aligned}$$
(4)

\(\mathbf {\epsilon }\) is a small constant for avoiding zero in the denominator of (4). The phase congruency (PC) representation is a frequency-based modelling of visual information. It supposes that, instead of processing visual data spatially, the visual system can do similar processing via phase and amplitude of the individual frequency components in a signal. In PC evaluation, frequency domain processing is achieved through the Fourier transform. Kovesi [22] showed that corners and edges are well detected using PC. The problem of discriminating between abrupt shot transition in videos has not been addressed earlier by using phase congruency features. Kovesi formulated PC via a log Gabor filter function. In contrast to the Gabor function, it maintains zero DC for arbitrarily large bandwidth. Moreover, log Gabor is characterized by extended tail at higher-frequency region preserving the high-frequency details in the image [23]. To show the features captured by PC, we have considered a hypothetical image having prominent edges and corners given in Fig. 1a. The corresponding edge strength, corner strength and the complete PC map are illustrated in Fig. 1b, c and d, respectively. Edge and corners are well captured through PC. Again to illustrate the illumination insensitivity characteristic of the PC feature map, we have considered two consecutive frames, i.e. 165th and 166th frame of the video “Littlemiss sunshine”, as shown in Fig. 2a, out of which 166th frame is exposed to the flashlight. The original image and the corresponding PC feature frames of the two consecutive frames are given in Fig. 2a and b, respectively. It is clearly visible from the figure that the PC feature frames of the frame numbers 165 and 166 are very similar and not much affected by illumination variation through flashlight effect and hence can be suitable to develop an illumination insensitive similarity for abrupt transition detection. Due to the mentioned characteristics of the PC feature, we are motivated to use it for shot boundary detection problem. Towards this goal, this paper introduces the use of PC-based feature extraction for representing frame content of the video and for illumination insensitive shot boundary detection.

Fig. 1
figure 1

a Hypothetical image having corners and edges, b corresponding edge strength image, c corresponding corner strength image, d corresponding PC map of the image

Fig. 2
figure 2

a 165th and 166th frame of video “LM” , b PC feature image of corresponding frames

4 Proposed PC-Based CUT Detection

In this section, the proposed PC-based abrupt transition detection algorithm has been explained. The PC feature-based similarity (PCFS) between consecutive PC feature frames is given by (5).

$$\begin{aligned} PCFS(t,t+1)=\sum _{k=1}^{\text {Row}}\sum _{l=1}^{Col}\left| PC_t(k,l)-PC_{t+1}(k,l)\right| \end{aligned}$$
(5)

\(PCFS(t,t+1):\) is the PC-based feature similarity measure between tth and \((t+1)\)th frame. \(PC_t\) and \(PC_{t+1}\) are the PC feature frames of tth and \((t+1)\)th frame, respectively. Row and Col are the number of rows and columns in the image. For AT detection, PC-based similarity is compared with Th as given by Zhang et al. [7].

$$\begin{aligned} \text {Th}=\mu _s+\beta \times \sigma _{s} \end{aligned}$$
(6)

where \(\beta \) is a constant and its value lies between 4 and 8 and \(\mu _s\) and \( \sigma _{s} \) are the average and standard deviation of the PC feature-based similarity value.

Table 1 Ground truth transition details of different genres of test videos
Table 2 Performance comparison of histogram, LBP and proposed PCFS methods

5 Simulations and Result Discussions

For validation of the proposed PC-based similarity measure, we considered ten videos of different genre such as English movies, Sitcom video, Soccer video, Cartoon video and Documentary videos, consisting of 78,447 frames in total and 437 cuts collected from [24] and Internet. The detailed information of the test videos is given in Table 1. The proposed PCFS is compared with the histogram-based similarity approach (ASHD) [7] and LBP-based similarity approach [14]. Recall (Rec), precision (Pr) and F1-measure are used to validate different SBD methods. It is clearly observed from Table 2 that the performance of the proposed PCFS is the highest in terms of average Rec, Pr and F1-measures.

6 Conclusions

In this article, an illumination insensitive phase congruency feature-based abrupt transition detection algorithm has been proposed. Besides illumination insensitivity, this feature is robust against contrast and scale changes as well. The performance of the proposed model is validated on publicly available video and the standard benchmark TRECVid data set. The limitations of using PC are the computational complexity in the evaluation process, setting of too many parameters to suit the application and sensitivity to image noise. Techniques of noise reduction prior to the PC evaluation may further improve the result. In future, the PC-based feature can be integrated with other features for efficient detection of gradual transitions.