Keywords

1 Introduction

In this paper, we propose a feature based approach for video summarization. Video summarization aims to generate a summarized video with all the salient activities of the input video. We propose to retain the salient frames towards generation of video summary. Due to huge content available in the internet in form of videos, searching the most appropriate and effective information is time consuming for the user. Video summarization is the method to generate a short video containing the most effective frames of the available video. Video summarization finds its applications in video surveillance systems [3, 24, 26] in which computer vision algorithms, such as tracking, behavior analysis, and object segmentation, are integrated in cameras and/or servers. It also finds its applications in movie trailer generation, sport summary generation etc.

Many researchers have worked on video summarization. Objects and people within a video play vital role for video summarization [18]. This is because, we generally represent the events in a video by people/objects and their activities. Moreover, people/objects in the video have the high-level of the semantic perception. Also, along with this, humans usually are more attentive towards the moving objects in a video [6, 8]. However, researchers consider the problem of extracting the moving objects from a video that has changes in illumination, high noise, bad contrast and multimodal environment as a challenging problem [2, 22]. However, in the videos with low contrast, the edges of objects are given higher prominence [22]. Also, this method is further sensitive to the variation in the shape and position of the object.

To resolve these problems, we can apply the theory of edge-segments (i.e. groups of connected sequential edge pixels) [8]. But, [4, 6] claims that the edge-segments based methods fail when the video has shape matching errors or local shape distortion. The state of the art methods for object detection use the ellipses or circles to represent curve fragments [6]. Even then, the problem persists if the video has low illumination [6]. Also, we observe that in real world, an object can take up any shapes other than circular, elliptical, parabolic, or hyperbolic curve. Thus, the object detection methods that approximate the shapes to the primitive structures fail in such circumstances. However, we can easily fit a conic part for simple objects.

In [10], authors use a set of similar objects to build a model for summarization. Authors in [16] present a part-based object movement framework. Authors in [14] apply object bank and object-like windows to extract the objects and then they perform story based video summarization. Authors in [5] propose a complementary background model. Pixel-based motion energy and edge features are combined in [23] for summarization. Authors in [12] propose a background subtraction method to detect foreground objects for video summarization. Authors in [13] modify the previous idea for Aggregated Channel Features (ACF) detection and a background subtraction technique for object detection.

In [21], authors propose a video summarization technique by merging three multi-modal human visual sensitive features, namely, motion information, foreground objects, and visual saliency.

Authors in [15] propose a min-cut based approach for generating storyboard. Also authors in [15] modify the previous idea and propose a Bayesian foraging technique for objects and their activities detection to summarize a video. The grid background model is applied in [7]. Authors of [17], deploy a key-point matching technique for video segmentation. Authors in [8] apply Spatio-temporal slices to select the states of the object motion.

Authors in [9] propose a learning based approach for video summarization. They describe the Objects in a video by Histogram of Optical Flow Orientations and then apply a SVM based classifier. Authors in [19] propose unsupervised framework via joint embedding and sparse representative selection for video summarization. The objective function is two-stream in nature. The first objective is to capture multi-view correlations using an embedding, that assists in extracting a diverse set of representatives and the second is to use \(L-2\) norm to model the sparsity while selecting representative shots for the summary. Authors in [28] uses RNN to exploit the temporal relationship between frames for saliency detection. Authors in [20] makes use of fully connected neural network for video summarization. However all these techniques need high computational capability which makes it highly impossible for low-cost real time implementation.

Authors in [27] apply a modularity cut algorithm to track objects for summary generation. Gaussian Mixture model based approach is employed in [4]. The key frames are selected based on the parameters of cluster. Authors in [4, 6], use geometric primitives (such as lines, arcs) for distinguishable descriptors than edge-pixels or edge-segments.

These primitives are independent of the size of the object, and also they are efficient for matching and comparisons. They are also invariant to scale and viewpoint changes. Thus, these geometric primitives represent objects with complex shapes and structures effectively. Also, they are useful in cognitive system [11].

In this paper, we propose to fuse the techniques of foreground/background segmentation and the use of geometric features for saliency detection in order to achieve video summarization. Towards this, we make the following contributions:

  • We propose to detect the saliency of a frame by detecting the saliency of its foreground and background separately and then combine the probabilities of foreground and background being salient to check the saliency of a frame.

    • We propose to model the image as an MRF and use MAP using graph-cut as final solution for foreground and background segmentation.

    • We propose to combine the probabilities of foreground and background being salient using the Dempster Shafer Combination rule (DSCR).

  • We propose to use the changes in the variant of the geometric features (such as lines, arcs) to decide the saliency of a frame. For efficient extraction of geometric primitives,

    • We propose to extract the PCA features to detect the principle components of foreground and background frames.

    • We convert the image from RGB to YCbCr and compute PCA on Y channel of the frame to retain the chromic information.

  • We demonstrate the results using the BL-7F dataset and compare the results using the state-of-the-art techniques with the help of the quantitative parameters such as condensation ratio and retention ratio.

2 Proposed Framework

We demonstrate the proposed framework in Fig. 1. We propose to detect the saliency of a frame by detecting the saliency of its foreground and background separately. We propose to detect the changes in the PCA and Geometric Primitives such as lines and contours by computing difference in standard deviation of the segments and comparing the difference with a heuristically set threshold. The threshold for background is kept much lower as compared to that of foreground with an intuition that any small motion in background is much significant as compared to small motion in foreground. We find separate probabilities for foreground and background segments being salient. We combine the two probabilities using DSCR to obtain joint probability. We decide if the given frame is salient based on the decision boundary set upon the joint probability.

Fig. 1.
figure 1

Proposed framework

2.1 Foreground and Background Segmantation

We propose to separate the foreground of the scene from the background using Energy Minimization via Graphcut. We model every frame as MRF (Markov Random Field) and use MAP (Maxima A Posteriori) estimate as the final solution. In this framework, we use the grid graph containing image pixels for MRF. Here, we try to find the labelling for the pixels in the image f with minimum energy.

$$E(f) = Esmooth(f) + Edata(f)$$

Where Edata(f) is defined by,

$$Edata(f) = \sum _{p\in P}Dp(fp)$$

Here Esmooth(f) measures the extent to which f is not piecewise smooth, whereas the Edata(f) measures the total disagreement between f and the observed data. Researchers have proposed many different energy functions. The form of Esmooth(f) is typically,

$$Esmooth(f) = \sum _{{p,q}\in N} u \{p,q \}.T(f_{p} \ne f_{q})$$

here, T is indicator function. It will output 1 if the input condition is true. We use Potts Model in which, discontinuities between any pair of labels are penalized equally. This is, in some sense, the simplest discontinuity preserving model.

We then obtain the two segments of the image, one corresponding to foreground and the other corresponding to background. The foreground and background segmentation for two datasets is shown in Fig. 2.

Fig. 2.
figure 2

Segmentation of image into foreground and background frames: (a), (d) are original images. (b), (e) are the corresponding foreground frames. (c), (f) are the corresponding background frames

2.2 Saliency Detection of Foreground and Background Frames

We demonstrate the saliency detection block in Fig. 3. The input for the saliency detection is the segmented frame (Foreground or background). We propose to use the changes in the variant of geometric primitives to decide the saliency of a frame. We extract the variant of geometric features, named the frame feature descriptors (FFD). The process of FFD extraction is demonstrated in Fig. 4. We then find the standard deviation between the extracted feature vectors of the consecutive frames. The probability of frame being salient is decided by the extent with which the obtained standard deviation is greater than a heuristically set threshold.

Fig. 3.
figure 3

Saliency detection of foreground and background frames

2.3 Extraction of Frame Feature Descriptors (FFD)

The process of FFD extraction is demonstrated in Fig. 4. We convert the RGB frames of the video to YCbCr to retain the colour information. We apply PCA on ‘Y’ channel of the image to get PCA transformed ‘Y’ channel. We convert the output to RGB to obtain the images with enhanced principal components. We extract geometric features from images with enhanced principal components.

Fig. 4.
figure 4

Extraction of frame feature descriptors (FFD)

We extract the objects present in the salient frames as geometric features. We apply Canny edge detection to find the edges of the objects. Using these edges we find the geometric features like line segments and contours. Contours represent the largest area in the frames. Hence we find the difference in the frames by monitoring the change in the position of the line segments present in the area of the contours.

2.4 Joint Probabability Using DSCR

We combine the two probabilities using Dempster Shafer Combination Rule (DSCR) to obtain the joint probability. We decide if the given frame is salient based on the decision boundary set upon the joint probability. Let \(P_1\) and \(P_2\) be the probabilities to be combined. DSCR combines two hypothesis consisting of three parameters, mass of belief, mass of disbelief and mass of uncertainty rather than two probabilities. We construct hypothesis, \(hyp_1\) and hyp2 as a set of mass of belief (m(b)), disbelief (m(d)) and ambiguity (m(u)) respectively. We set mass of belief (\(m_1(b)\)) for \(hyp_1\) as \(P_1\) and mass of belief (\(m_2(b)\)) for \(hyp_2\) be \(P_2\). We assume mass of disbelief (\(m_1(d)\)) for \(hyp_1\) and \(hyp_2\) to be 0 and mass of ambiguity (\(m_1(u)\) and \(m_2(u)\)) for \(hyp_1\) and \(hyp_2\) as \(1-P_1\) and \(1-P_2\) respectively. We combine \(hyp_1\) and \(hyp_2\) using combination table as shown in Table 1.

Table 1. Combination table

In the combination table, the product of mass of belief of one hypothesis and mass of disbelief of other hypothesis gives rise to conflict and is represented by \(\emptyset \). The product of mass of belief and mass of belief or the product of mass of belief and mass of uncertainty represents joint belief and is represented by \(\psi _{1}\). Similarly \(\psi _{2}\) represents the joint disbelief.

The Combined belief of two evidences is considered as Joint probabilities and is given by:

$$Joint Probability=\frac{\sum \psi _{1} }{1- \sum \emptyset }$$

We decide if the given frame is salient based on the decision boundary set upon the joint probability. The advantage of using DSCR for combining the two probabilities is that it emphasis of the fact that if \(P_1\) is the probability of frame being salient, then \(1-P_1\) need not be the probability of frame being non-salient. It can be uncertainty as well.

3 Results and Discussions

We evaluate our approach using BL-7F dataset. In this dataset, 19 surveillance videos are taken from fixed surveillance cameras located in the seventh floor of the BarryLam Building in the National Taiwan University. Each video consists of 12,900 frames with a duration of 7 min and 10 s. We compare our results using Retention ratio and Condensation ratio as evaluation metrics.

Table 2. Comparison of condensation ratio (in percentage) of the proposed method with the different state-of the art techniques [1, 18, 25] for different surveillance videos. Here RR = retention ratio and is seen to be 1 for the results, unless mentioned.

Retention ratio is the ratio of number of objects in the summarized video to the number of objects in the original video.

$$ RR=\frac{number\,of\, objects\, in\, summarized\, video}{number\, of \,objects\, in \,input\, video} $$

Condensation ratio is the ratio of length of summarized video to length of the input video.

$$ CR=(1-\frac{length\, of\, summarized\, video}{length\, of\, input\, video})*100 $$

We find that the proposed method gives better results as compared to results obtained from the other state-of-the-art techniques. Retention ratio for the proposed method is unity for all videos and Condensation ratios are also very high compared to the existing methods. The comparison of the condensation ratio (in percentage) of the proposed method with the different state-of the art techniques [1, 18, 25] for different surveillance videos is demonstrated in Table 2.

4 Conclusions

In this paper, we have proposed a geometric feature and frame segmentation based approach for video summarization. We detected saliency in foreground and background of the image separately. We proposed to model the image as MRF (Markov Random Field) and use MAP (Maximum a-posteriori) as final solution to segment the image into foreground and background. The salient frame was effectively defined by the variation in feature descriptors using variant of geometric features. We proposed to combine the probabilities of foreground and background segments being salient using DSCR (Dempster Shafer Combination Rule). We modelled the summarized video as a combination of salient frames for a user defined time. We have demonstrated the results using several videos in BL-7F dataset and compared the same with state of art techniques using retention ratio and condensation ratio as quality parameters to prove the superiority of the proposed method over the other algorithms.