Abstract
Because of the large amount of data, storage and processing of video are always a challenging problem. It becomes even more difficult in surveillance videos because of the length of the video data. Thus, it is extremely necessary to design algorithms for faster browsing of the video data with as much compression as possible. In this paper, we propose a novel decomposition algorithm that reduces the redundancy of a video cube by segmenting the motion salient regions using total variation approach. We further use the decomposition algorithm to summarize a video shot for easy interpretation of the event in the video shot. We propose two different methods for the summarization process and demonstrate that the video summarization reduces the storage requirement drastically without sacrificing the understanding of video content.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Video is one of the most popular medias in today’s world. It has been extensively used for surveillance, communication, defense, and entertainment purposes. However, it is difficult to store and browse video data because of its huge volume. Though different video compression algorithms have been designed that save the storage space, they cannot reduce the browsing time of a video. Thus, it is necessary to design new algorithms that can reduce the demand of storage space and browsing time such that the events in a video can be easily interpreted. One of the most popular approaches is the generation of video storyboard or video skimming [1, 5, 6, 10]. However, it is difficult to understand an event from video skimming and often it requires multiple frames to represent a video. Recently, video summarization techniques become popular because of more efficient browsing and file size management [3, 9, 11, 12, 15]. Video summarization algorithms detect the salient events in a video and generate an image that represents the event without disturbing the continuity of the event. In [11], the authors proposed an interactive video summarization algorithm. In [12], authors extracted the portion of a video where the events are denser than the other parts and used the extracted video for summarization. Sunkavalli et al. proposed a saliency-based algorithm to summarize a video [15]. In [8], authors exploited the spatiotemporal information along with graph-cut method to generate space-time montage from the input video. In [14], authors proposed a multi-scale approach to summarize a video with minimum visual distortion. However, most of these algorithms cannot segment the moving regions precisely and the edges get blurred in the final summarized image [15].
In this paper, we propose a video summarization algorithm based on video decomposition. The video decomposition algorithm extracts the motion salient regions with sharp object boundary. Then, we temporally sample the motion salient regions to generate the final summarized image. We use both uniform sampling and non-uniform sampling to summarize an input video.
Contributions:
-
We propose a novel decomposition algorithm that is parallelizable.
-
We use both uniform and non-uniform sampling to generate summarized images of input videos with minimum visual distortion.
-
As the summarization algorithms are based on the video decomposition technique, the final summarized image has sharp object boundaries.
Rest of the paper is divided as follows. In Sect. 2, the novel video decomposition algorithm is discussed along with the summarization techniques. In Sect. 3, we discuss different aspects of the video decomposition algorithm. We also summarize different input videos using both uniform and non-uniform sampling. Finally, we conclude the paper in Sect. 4 discussing the impact of the work with its future prospects.
2 Proposed Algorithm
Before discussing the detection and restoration process of the artifacts, we briefly discuss the parallelizable video decomposition scheme that we have used in all restoration process. From an input video, the decomposition technique estimates one background video where the frames are visually similar and a residual video that has all the remaining information. The residual video is then used to summarize the motion information and the visually similar video is used to estimate the background. In Fig. 1, we show the basic block diagram of the proposed algorithm.
As the main objective of video decomposition is to decompose the input video, say \(\mathbf {V}\) into background video \(\mathbf {L}\) and feature video \(\mathbf {S}\), we will first estimate the background video from the input video cube and then construct the feature video \(\mathbf {S}\) using \(\mathbf {V}\) and \(\mathbf {L}\).
Let us assume that an input video \(\mathbf {V}\) has K number of frames with frame resolution \(M\times N\). If a pixel \(\mathbf {p}=(x,y)\) is in the background of the video, the intensity will not vary at that particular pixel location along the time axis. Thus, if we consider a vector \(\mathbf {l_p}\) at pixel location \(\mathbf {p}\) along time axis, such that \(l_p^i\), \(i\mathrm{th}\) element of the vector, represents the intensity at pixel location \(\mathbf {p}\) in \(i\mathrm{th}\) frame of input video \(\mathbf {V}\), then if we calculate a vector \(\mathbf {x_p}\) such that
the vector \(\mathbf {x_p}\) will be a sparse vector if \(\mathbf {p}\) belongs to the background of the video. We can represent Eq. 1, as \(\mathbf {x_p}=\mathbf {Dl_p}\), where \(\mathbf {D}\) is the variation matrix. If we take the consecutive element-wise difference, as shown in Eq. 1, then we can express \(\mathbf {D}\) as
Using this idea of background pixel, we consider a vector \(\mathbf {v_p}\) at any pixel location \(\mathbf {p}\in M\times N\), such that \(v_p^i\), the \(i\mathrm{th}\) element of the vector, represents the intensity at pixel location \(\mathbf {p}\) in \(i\mathrm{th}\) frame of input video \(\mathbf {V}\). Then to get the background intensity, we estimate \(\mathbf {l_p}\) from \(\mathbf {v_p}\) such that \(\mathbf {Dl_p}\) is a sparse vector. We define the optimization problem as
where \(\left\| .\right\| _2\) and \(\left\| .\right\| _0\) denote \(l_2\) norm and \(l_0\) norm of a vector, respectively. The first term of the expression is the data fidelity term and the second term ensures that the estimated vector \(\mathbf {l_p}\) is smooth and \(\lambda \) is a non-negative weight that determines the level of smoothness in the final estimate of \(\mathbf {l_p}\). As \(\lambda \) increases, estimated \(\mathbf {l_p}\) becomes smoother, i.e., \(\mathbf {Dl_p}\) becomes sparser.
As the optimization problem defined in Eq. 2 is a non-convex problem, the estimation of optimal \(\mathbf {l_p}\) is NP-hard. To reduce the computational complexity, keeping the concept of sparsity intact, we replace the \(l_0\) with \(l_1\) norm [2, 4] and modify the optimization problem of Eq. 2 as
To solve the convex problem defined in Eq. 3, we apply iterative reweighted norm (IRN) approach. IRN uses the concept of iterative reweighted least square (IRLS) method to convert \(l_p\) norm of a vector to weighted \(l_2\) norm. This solves the optimization problem in fewer iterations [16] as \(l_2\) norm is differentiable and leads to a closed form solution with an iterative update step of the weight matrix. A simplified form of IRN states that \(l_p\) norm minimization of \(\mathbf {q}=[q_1,q_2,\ldots q_n]^t\) can be solved using an weighted least square problem as,
where \(\mathbf {R}\) is a diagonal matrix with each diagonal element defined as \(\mathbf {R}^{1/2}_{i,i}=(|q_i|^{1-p/2}+\epsilon )^{-1}\), and \(\epsilon \) is a small positive constant added to avoid division by zero [16].
Using the concept of IRLS, we modify Eq. 3 and define the cost function \(C(\mathbf {l_p}^{(k)})\) as
where weighting matrix \({\mathbf {R}^{(k)}}^{1/2}\) is calculated considering \(\mathbf {q}^{(k)}=\mathbf {Dl_p}^{(k-1)}\).
To minimize the cost function, we differentiate right-hand side with respect to \(\mathbf {l_p}^{(k)}\) and set that equal to zero. A mathematical simplification gives us
where \(\mathbf {I}\) is the identity matrix of dimension \(K \times K\) and \(\mathbf {\Psi }=\lambda \mathbf {D}^t\mathbf {R}^{(k)}\mathbf {D}+\mathbf {I}\). We may end the iteration process if \(\mathcal {Q}(\mathbf {l_p}^{(k)})-\mathcal {Q}(\mathbf {l_p}^{(k-1)})=\mathbf {0}\) or if \(\text {rank}(\mathbf {\Psi })<K\), where \(\mathcal {Q}\) is a quantizer that quantizes element-wise a floating value to its nearest integer and \(\mathbf {0}\) is a null vector. It is important to note that \(\mathbf {\Psi }\) is a symmetric matrix as \(\mathbf {R}^{(k)}\) is a diagonal matrix and \(\mathbf {\Psi }\) has a diagonal loading, i.e., the matrix \(\mathbf {\Psi }\) is invertible even if \(\mathbf {D}^t\mathbf {R}^{(k)}\mathbf {D}=\mathbf {0}\). The linear system \(\mathbf {\Psi }\mathbf {l_p}^{(k)}=\mathbf {v_p}\) defined in Eq. 7 can be solved for \(\mathbf {l_p}^{(k)}\) using Newton’s method without performing the matrix inversion explicitly [13].
Finally, we construct the two videos \(\mathbf {L}\) and \(\mathbf {S}\) where video \(\mathbf {L}\) contains the background information of the input video and \(\mathbf {S}\) contains the motion information of the input video \(\mathbf {V}\) defined as \(\mathbf {S}=\mathbf {V}-\mathbf {L}\). The intensity values at pixel location \(\mathbf {p}\) in the \(i\mathrm{th}\) frame are \(l_p^i\) and \(s_p^i\) for videos \(\mathbf {L}\) and \(\mathbf {S}\), respectively, where \(l_p^i\) is the \(i\mathrm{th}\) element of the estimated vector \(\mathbf {l_p}^{(k)}\).
To summarize the video frames, first we calculate the background image B as
where \(b_p\) is the intensity at pixel \(\mathbf {p}\) in background B.
Next, we uniformly sub-sample the feature video \(\mathbf {S}\) with sampling rate z to construct a video \(\mathbf {S}_u\) such that
where \(S_u^i\) and \(S^i\) are the \(i\mathrm{th}\) frames of videos \(\mathbf {S}_u\) and \(\mathbf {S}\), respectively.
As \(\mathbf {V}=\mathbf {L}+\mathbf {S}\), \(\mathbf {S}_u\) or \(\mathbf {S}\) does not contain the actual intensity values of a moving object. We apply adaptive thresholding on video \(\mathbf {S}_u\) to extract the moving object. If \(\mathbf {F}\) is uniformly sampled motion segmented video, then,
where \(s_{u_p}^j\) is the intensity at pixel location \(\mathbf {p}\) in the \(j\mathrm{th}\) frame of video \(\mathbf {S}_u\) and \(\tau _j\) is an adaptive constant calculated as \(\tau _j=\mu _j+\sigma _j\), where \(\mu _j\) and \(\sigma _j\) are the mean and standard deviation of frame \(S_u^j\), respectively, and \(j\in J_u\) where \(J_u=\{1,2,3,\ldots \lfloor K/z \rfloor \}\).
We generate the uniformly summarized image \(I_u\) as
where \(i_{u_p}\) is the intensity value at pixel location \(\mathbf {p}\) in \(I_u\).
Though uniformly summarized image generates satisfactory results in simple videos, it performs poorly if the scene has non-uniform motion, acceleration, or multiple moving objects. Thus, we define another approach to summarize an input video non-uniformly.
To do so, we first segment the motion information present in input video \(\mathbf {V}\) using the information present in \(\mathbf {S}\). If \(\mathbf {U}\) is the final motion segmented video, then,
where \(u_p^i\) and \(s_p^i\) are the intensities at pixel location \(\mathbf {p}\) in the \(i\mathrm{th}\) frame of videos \(\mathbf {U}\) and \(\mathbf {S}\), respectively, and \(\tau _i\) is an adaptive constant calculated as \(\tau _i=\mu _i+\sigma _i\) where \(\mu _i\) and \(\sigma _i\) are the mean and standard deviation of frame \(S^i\), respectively.
Next, we select the indices of the frames such that
where d(.) is dilation operation performed on a frame, \(\phi \) is an null matrix and \(\cap \) computes spatial intersection of nonzero elements in two images. Suppose, we get a sampled video \(\mathbf {F}_n\) such that
where \(F_n^i\) and \(U^i\) are the \(i\mathrm{th}\) frames of videos \(\mathbf {F}_n\) and \(\mathbf {U}\), respectively. We construct the final non-uniformly summarized image \(I_n\) as
where \({f^{i}_{n_p}}\) is the intensity at pixel location \(\mathbf {p}\) in the \(i\mathrm{th}\) frame of video \(\mathbf {F}_n\).
3 Experimental Results
To validate the decomposition algorithm and the summarization algorithms, we test them on different input videos. Figure 2a shows frame from a typical input video. Figure 2c shows the respective frames from feature video \(\mathbf {S}\). In Fig. 3a, we show the estimated \(\mathbf {l_p}\) for different \(\lambda \) values. The input vector \(\mathbf {v_p}\) is the change in intensity at the center of the red circle shown in Fig. 2a. The change in rank of the video \(\mathbf {L}\) is shown in Fig. 3b. The rank of the video \(\mathbf {L}\) is calculated as described in [7]. It is important to inform that in our previous work [2], we reported a parallelizable decomposition method based on majorization-minimization algorithm. However, the algorithm in [2] takes large number of iterations (\({\sim }500\)) to complete the decomposition. As shown in Fig. 3b, the proposed decomposition minimizes the rank in much smaller number of steps (\({\sim }60\)) without increasing the complexity of the algorithm. Thus, the proposed decomposition algorithm is faster than the algorithm mentioned in [2]. In Fig. 3c and d, we compare the execution times of these decomposition algorithms on different datasets and further validate the claim. As the decomposition algorithms are pixel based, the algorithms are parallelizable and increase in number of cores in the processor reduces the execution times in both the cases.
In Fig. 4, we show the outputs of both the summarization algorithms for different input videos. In Fig. 4a–d, we show frames of the input videos. All the videos in the dataset contain complex motions like acceleration, multiple objects, nonlinear motion, etc. Figure 4e shows the respective summarized images using the uniform summarization method, and Fig. 4f shows the summarized images using non-uniform summarization method. However, as mentioned in Sect. 2, the uniformly summarized image \(I_u\) may contain distortion due to overlapping regions. In Fig. 4e, we show the overlapping regions within the black rectangles. As shown in Fig. 4f, the non-uniformly summarized images \(I_n\) are free from such artifacts. An interesting property of non-uniform summarization algorithm is that the summarized image \(I_n\) may differ for the same input video depending on the frame to initialize the summarization process. This is further depicted in Fig. 5. For the same input videos, Fig. 5a and b show the final non-uniformly summarized images initialized from the first frame and the last frame, respectively.
Though it is easier to interpret an event in a summarized video, video summarization algorithm drastically reduces the file size as it generates a single image as the final output. In Table 1, the sizes of the input videos and the summarized images \(I_u\) and \(I_n\) are shown. The execution times of both the proposed summarization algorithms are also tabulated in Table 1. It is evident that the space-time requirements of both the algorithms are comparable. All the evaluations are done in MATLAB 2013 using 4 cores running on an Intel(R) core(TM) i7-4770 3.90 GHz processor with 8 GB RAM.
4 Conclusion
Storage and interpretation of videos require large amount of resources. It is crucial to develop algorithms which can represent an input video consuming as minimum resource as possible without disturbing the flow of the events. In this paper, we proposed two algorithms to summarize an input video to an image using uniform and non-uniform sampling of the video frames. The methods consume little amount of disk space and can be executed in small amount of time as the entire algorithms are pixel based and can be executed using parallel processing. Though the uniformly summarized image may contain some distortion depending upon the content of the input video, the non-uniformly summarized image is always distortion-free. However, the non-uniformly summarized image requires slightly more resources.
Though the proposed summarization algorithms work for static cameras, it is necessary to design such algorithm in future for videos with camera motions.
References
Bhattacharya, S., Gupta, S., Venkatesh, K.: Video shot detection & story board generation using video decomposition. In: Proceedings of the Sixth International Conference on Computer and Communication Technology 2015. pp. 223–227. ACM (2015)
Bhattacharya, S., Venkatsh, K., Gupta, S.: Background estimation and motion saliency detection using total variation-based video decomposition. Signal, Image and Video Processing 11(1), 113–121 (2017)
Brunelli, R., Mich, O., Modena, C.M.: A survey on the automatic indexing of video data. Journal of visual communication and image representation 10(2), 78–112 (1999)
Donoho, D.L.: Compressed sensing. Information Theory, IEEE Transactions on 52(4), 1289–1306 (2006)
Furini, M., Geraci, F., Montangero, M., Pellegrini, M.: Stimo: Still and moving video storyboard for the web scenario. Multimedia Tools and Applications 46(1), 47 (2010)
Goldman, D.B., Curless, B., Salesin, D., Seitz, S.M.: Schematic storyboarding for video visualization and editing. In: ACM Transactions on Graphics (TOG). vol. 25, pp. 862–871. ACM (2006)
J., C.E., Li, X., Ma, Y., Wright., J.: Robust principal component analysis? J. ACM 58 no.3 (2011)
Kang, H.W., Chen, X.Q.: Space-time video montage. In: computer vision and pattern recognition, 2006 IEEE computer society conference on. vol. 2, pp. 1331–1338. IEEE (2006)
Money, A.G., Agius, H.: Video summarisation: A conceptual framework and survey of the state of the art. Journal of Visual Communication and Image Representation 19(2), 121–143 (2008)
Mundur, P., Rao, Y., Yesha, Y.: Keyframe-based video summarization using delaunay clustering. International Journal on Digital Libraries 6(2), 219–232 (2006)
Pritch, Y., Rav-Acha, A., Peleg, S.: Nonchronological video synopsis and indexing. IEEE transactions on pattern analysis and machine intelligence 30(11), 1971–1984 (2008)
Rav-Acha, A., Pritch, Y., Peleg, S.: Making a long video short: Dynamic video synopsis. In: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. vol. 1, pp. 435–441. IEEE (2006)
Rodríguez, P., Wohlberg, B.: Efficient minimization method for a generalized total variation functional. IEEE Transactions on Image Processing 18(2), 322–332 (2009)
Simakov, D., Caspi, Y., Shechtman, E., Irani, M.: Summarizing visual data using bidirectional similarity. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. pp. 1–8. IEEE (2008)
Sunkavalli, K., Joshi, N., Kang, S.B., Cohen, M.F., Pfister, H.: Video snapshots: Creating high-quality images from video clips. IEEE transactions on visualization and computer graphics 18(11), 1868–1879 (2012)
Uruma, K., Konishi, K., Takahashi, T., Furukawa, T.: Image colorization based on the mixed l 0/l 1 norm minimization. In: 2012 19th IEEE International Conference on Image Processing. pp. 2113–2116. IEEE (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Bhattacharya, S., Venkatesh, K.S., Gupta, S. (2018). Video Summarization Using Novel Video Decomposition Algorithm. In: Chaudhuri, B., Kankanhalli, M., Raman, B. (eds) Proceedings of 2nd International Conference on Computer Vision & Image Processing . Advances in Intelligent Systems and Computing, vol 704. Springer, Singapore. https://doi.org/10.1007/978-981-10-7898-9_32
Download citation
DOI: https://doi.org/10.1007/978-981-10-7898-9_32
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7897-2
Online ISBN: 978-981-10-7898-9
eBook Packages: EngineeringEngineering (R0)