1 Introduction

Video is one of the most popular medias in today’s world. It has been extensively used for surveillance, communication, defense, and entertainment purposes. However, it is difficult to store and browse video data because of its huge volume. Though different video compression algorithms have been designed that save the storage space, they cannot reduce the browsing time of a video. Thus, it is necessary to design new algorithms that can reduce the demand of storage space and browsing time such that the events in a video can be easily interpreted. One of the most popular approaches is the generation of video storyboard or video skimming [1, 5, 6, 10]. However, it is difficult to understand an event from video skimming and often it requires multiple frames to represent a video. Recently, video summarization techniques become popular because of more efficient browsing and file size management [3, 9, 11, 12, 15]. Video summarization algorithms detect the salient events in a video and generate an image that represents the event without disturbing the continuity of the event. In [11], the authors proposed an interactive video summarization algorithm. In [12], authors extracted the portion of a video where the events are denser than the other parts and used the extracted video for summarization. Sunkavalli et al. proposed a saliency-based algorithm to summarize a video [15]. In [8], authors exploited the spatiotemporal information along with graph-cut method to generate space-time montage from the input video. In [14], authors proposed a multi-scale approach to summarize a video with minimum visual distortion. However, most of these algorithms cannot segment the moving regions precisely and the edges get blurred in the final summarized image [15].

In this paper, we propose a video summarization algorithm based on video decomposition. The video decomposition algorithm extracts the motion salient regions with sharp object boundary. Then, we temporally sample the motion salient regions to generate the final summarized image. We use both uniform sampling and non-uniform sampling to summarize an input video.

Contributions:

  • We propose a novel decomposition algorithm that is parallelizable.

  • We use both uniform and non-uniform sampling to generate summarized images of input videos with minimum visual distortion.

  • As the summarization algorithms are based on the video decomposition technique, the final summarized image has sharp object boundaries.

Rest of the paper is divided as follows. In Sect. 2, the novel video decomposition algorithm is discussed along with the summarization techniques. In Sect. 3, we discuss different aspects of the video decomposition algorithm. We also summarize different input videos using both uniform and non-uniform sampling. Finally, we conclude the paper in Sect. 4 discussing the impact of the work with its future prospects.

2 Proposed Algorithm

Before discussing the detection and restoration process of the artifacts, we briefly discuss the parallelizable video decomposition scheme that we have used in all restoration process. From an input video, the decomposition technique estimates one background video where the frames are visually similar and a residual video that has all the remaining information. The residual video is then used to summarize the motion information and the visually similar video is used to estimate the background. In Fig. 1, we show the basic block diagram of the proposed algorithm.

As the main objective of video decomposition is to decompose the input video, say \(\mathbf {V}\) into background video \(\mathbf {L}\) and feature video \(\mathbf {S}\), we will first estimate the background video from the input video cube and then construct the feature video \(\mathbf {S}\) using \(\mathbf {V}\) and \(\mathbf {L}\).

Fig. 1
figure 1

Block diagram of the proposed algorithm

Let us assume that an input video \(\mathbf {V}\) has K number of frames with frame resolution \(M\times N\). If a pixel \(\mathbf {p}=(x,y)\) is in the background of the video, the intensity will not vary at that particular pixel location along the time axis. Thus, if we consider a vector \(\mathbf {l_p}\) at pixel location \(\mathbf {p}\) along time axis, such that \(l_p^i\), \(i\mathrm{th}\) element of the vector, represents the intensity at pixel location \(\mathbf {p}\) in \(i\mathrm{th}\) frame of input video \(\mathbf {V}\), then if we calculate a vector \(\mathbf {x_p}\) such that

$$\begin{aligned} \mathbf {x_p}=[l_p^1-l_p^2,l_p^2-l_p^3, \ldots , l_p^{K-1}-l_p^K]^t \end{aligned}$$
(1)

the vector \(\mathbf {x_p}\) will be a sparse vector if \(\mathbf {p}\) belongs to the background of the video. We can represent Eq. 1, as \(\mathbf {x_p}=\mathbf {Dl_p}\), where \(\mathbf {D}\) is the variation matrix. If we take the consecutive element-wise difference, as shown in Eq. 1, then we can express \(\mathbf {D}\) as

$$\mathbf {D}= \left[ \begin{array}{ccccc} 1 &{} -1 &{} 0 &{} \ldots &{} 0\\ 0 &{} 1 &{} -1 &{} \ldots &{} 0\\ \, &{} \, &{} \, &{} \ddots &{} \,\\ 0 &{} 0 &{} \ldots &{} 1 &{}-1\end{array} \right] _{(K-1)\times K}$$

Using this idea of background pixel, we consider a vector \(\mathbf {v_p}\) at any pixel location \(\mathbf {p}\in M\times N\), such that \(v_p^i\), the \(i\mathrm{th}\) element of the vector, represents the intensity at pixel location \(\mathbf {p}\) in \(i\mathrm{th}\) frame of input video \(\mathbf {V}\). Then to get the background intensity, we estimate \(\mathbf {l_p}\) from \(\mathbf {v_p}\) such that \(\mathbf {Dl_p}\) is a sparse vector. We define the optimization problem as

$$\begin{aligned} \begin{aligned}&\underset{\mathbf {l_p}}{\text {minimize }} \quad \{\left\| \mathbf {v_p- l_p}\right\| _2^2+\lambda \left\| \mathbf {Dl_p}\right\| _0\} \\&\text {subject to} \quad \lambda \ge 0 \end{aligned} \end{aligned}$$
(2)

where \(\left\| .\right\| _2\) and \(\left\| .\right\| _0\) denote \(l_2\) norm and \(l_0\) norm of a vector, respectively. The first term of the expression is the data fidelity term and the second term ensures that the estimated vector \(\mathbf {l_p}\) is smooth and \(\lambda \) is a non-negative weight that determines the level of smoothness in the final estimate of \(\mathbf {l_p}\). As \(\lambda \) increases, estimated \(\mathbf {l_p}\) becomes smoother, i.e., \(\mathbf {Dl_p}\) becomes sparser.

As the optimization problem defined in Eq. 2 is a non-convex problem, the estimation of optimal \(\mathbf {l_p}\) is NP-hard. To reduce the computational complexity, keeping the concept of sparsity intact, we replace the \(l_0\) with \(l_1\) norm [2, 4] and modify the optimization problem of Eq. 2 as

$$\begin{aligned} \begin{aligned}&\underset{\mathbf {l_p}}{\text {minimize}} \quad \{\left\| \mathbf {v_p- l_p}\right\| _2^2+\lambda \left\| \mathbf {Dl_p}\right\| _1\} \\&\text {subject to} \quad \lambda \ge 0 \end{aligned} \end{aligned}$$
(3)

To solve the convex problem defined in Eq. 3, we apply iterative reweighted norm (IRN) approach. IRN uses the concept of iterative reweighted least square (IRLS) method to convert \(l_p\) norm of a vector to weighted \(l_2\) norm. This solves the optimization problem in fewer iterations [16] as \(l_2\) norm is differentiable and leads to a closed form solution with an iterative update step of the weight matrix. A simplified form of IRN states that \(l_p\) norm minimization of \(\mathbf {q}=[q_1,q_2,\ldots q_n]^t\) can be solved using an weighted least square problem as,

$$\begin{aligned} \left\| \mathbf {q}\right\| _p^p=\sum _j |q_j|^p=\left\| \mathbf {R}^{1/2}\mathbf {q}\right\| _2^2 \end{aligned}$$
(4)

where \(\mathbf {R}\) is a diagonal matrix with each diagonal element defined as \(\mathbf {R}^{1/2}_{i,i}=(|q_i|^{1-p/2}+\epsilon )^{-1}\), and \(\epsilon \) is a small positive constant added to avoid division by zero [16].

Using the concept of IRLS, we modify Eq. 3 and define the cost function \(C(\mathbf {l_p}^{(k)})\) as

$$\begin{aligned} C(\mathbf {l_p}^{(k)})=\frac{1}{2}\left\| \mathbf {v_p- l_p}^{(k)}\right\| _2^2+\frac{\lambda }{2}\left\| {\mathbf {R}^{(k)}}^{1/2}\mathbf {Dl_p}^{(k)}\right\| _2^2 \end{aligned}$$
(5)

where weighting matrix \({\mathbf {R}^{(k)}}^{1/2}\) is calculated considering \(\mathbf {q}^{(k)}=\mathbf {Dl_p}^{(k-1)}\).

To minimize the cost function, we differentiate right-hand side with respect to \(\mathbf {l_p}^{(k)}\) and set that equal to zero. A mathematical simplification gives us

$$\begin{aligned} \mathbf {l_p}^{(k)}=(\lambda \mathbf {D}^t\mathbf {R}^{(k)}\mathbf {D}+\mathbf {I})^{-1}\mathbf {v_p}\end{aligned}$$
(6)
$$\begin{aligned} \mathbf {l_p}^{(k)}= \mathbf {\Psi }^{-1}\mathbf {v_p} \end{aligned}$$
(7)

where \(\mathbf {I}\) is the identity matrix of dimension \(K \times K\) and \(\mathbf {\Psi }=\lambda \mathbf {D}^t\mathbf {R}^{(k)}\mathbf {D}+\mathbf {I}\). We may end the iteration process if \(\mathcal {Q}(\mathbf {l_p}^{(k)})-\mathcal {Q}(\mathbf {l_p}^{(k-1)})=\mathbf {0}\) or if \(\text {rank}(\mathbf {\Psi })<K\), where \(\mathcal {Q}\) is a quantizer that quantizes element-wise a floating value to its nearest integer and \(\mathbf {0}\) is a null vector. It is important to note that \(\mathbf {\Psi }\) is a symmetric matrix as \(\mathbf {R}^{(k)}\) is a diagonal matrix and \(\mathbf {\Psi }\) has a diagonal loading, i.e., the matrix \(\mathbf {\Psi }\) is invertible even if \(\mathbf {D}^t\mathbf {R}^{(k)}\mathbf {D}=\mathbf {0}\). The linear system \(\mathbf {\Psi }\mathbf {l_p}^{(k)}=\mathbf {v_p}\) defined in Eq. 7 can be solved for \(\mathbf {l_p}^{(k)}\) using Newton’s method without performing the matrix inversion explicitly [13].

Finally, we construct the two videos \(\mathbf {L}\) and \(\mathbf {S}\) where video \(\mathbf {L}\) contains the background information of the input video and \(\mathbf {S}\) contains the motion information of the input video \(\mathbf {V}\) defined as \(\mathbf {S}=\mathbf {V}-\mathbf {L}\). The intensity values at pixel location \(\mathbf {p}\) in the \(i\mathrm{th}\) frame are \(l_p^i\) and \(s_p^i\) for videos \(\mathbf {L}\) and \(\mathbf {S}\), respectively, where \(l_p^i\) is the \(i\mathrm{th}\) element of the estimated vector \(\mathbf {l_p}^{(k)}\).

To summarize the video frames, first we calculate the background image B as

$$\begin{aligned} b_p=\frac{\sum _{i=1}^K l_p^i}{K} \end{aligned}$$

where \(b_p\) is the intensity at pixel \(\mathbf {p}\) in background B.

Next, we uniformly sub-sample the feature video \(\mathbf {S}\) with sampling rate z to construct a video \(\mathbf {S}_u\) such that

$$\begin{aligned} S_u^j=S^{zj};\;\; j=1,2,3\ldots \lfloor K/z \rfloor \end{aligned}$$
(8)

where \(S_u^i\) and \(S^i\) are the \(i\mathrm{th}\) frames of videos \(\mathbf {S}_u\) and \(\mathbf {S}\), respectively.

As \(\mathbf {V}=\mathbf {L}+\mathbf {S}\), \(\mathbf {S}_u\) or \(\mathbf {S}\) does not contain the actual intensity values of a moving object. We apply adaptive thresholding on video \(\mathbf {S}_u\) to extract the moving object. If \(\mathbf {F}\) is uniformly sampled motion segmented video, then,

$$\begin{aligned} f_p^j=\left\{ \begin{matrix} v_p^j\;\;\text {if }|s_{u_p}^j|\ge \tau _j\\ 0 \;\;\;\; \text {otherwise} \end{matrix}\right. \end{aligned}$$
(9)

where \(s_{u_p}^j\) is the intensity at pixel location \(\mathbf {p}\) in the \(j\mathrm{th}\) frame of video \(\mathbf {S}_u\) and \(\tau _j\) is an adaptive constant calculated as \(\tau _j=\mu _j+\sigma _j\), where \(\mu _j\) and \(\sigma _j\) are the mean and standard deviation of frame \(S_u^j\), respectively, and \(j\in J_u\) where \(J_u=\{1,2,3,\ldots \lfloor K/z \rfloor \}\).

We generate the uniformly summarized image \(I_u\) as

$$\begin{aligned} i_{u_p}=\begin{matrix} {\left\{ \begin{array}{ll} {{f_p}}^{j}\;\text { if }{{{f_p}}^{j}}\ne 0;\; \text {for any }j\in J_u \\ {b_p}\;\text { otherwise} \end{array}\right. } \end{matrix} \end{aligned}$$
(10)

where \(i_{u_p}\) is the intensity value at pixel location \(\mathbf {p}\) in \(I_u\).

Though uniformly summarized image generates satisfactory results in simple videos, it performs poorly if the scene has non-uniform motion, acceleration, or multiple moving objects. Thus, we define another approach to summarize an input video non-uniformly.

To do so, we first segment the motion information present in input video \(\mathbf {V}\) using the information present in \(\mathbf {S}\). If \(\mathbf {U}\) is the final motion segmented video, then,

$$\begin{aligned} u_p^i=\left\{ \begin{matrix} v_p^i\;\;\text {if }|s_p^i|\ge \tau _i\\ 0 \;\;\;\; \text {otherwise} \end{matrix}\right. \end{aligned}$$
(11)

where \(u_p^i\) and \(s_p^i\) are the intensities at pixel location \(\mathbf {p}\) in the \(i\mathrm{th}\) frame of videos \(\mathbf {U}\) and \(\mathbf {S}\), respectively, and \(\tau _i\) is an adaptive constant calculated as \(\tau _i=\mu _i+\sigma _i\) where \(\mu _i\) and \(\sigma _i\) are the mean and standard deviation of frame \(S^i\), respectively.

Next, we select the indices of the frames such that

$$\begin{aligned} J_n=\{i: d(U^i) \cap d(U^h)=\phi , \; i,h\in K, i\ne h\} \end{aligned}$$
(12)

where d(.) is dilation operation performed on a frame, \(\phi \) is an null matrix and \(\cap \) computes spatial intersection of nonzero elements in two images. Suppose, we get a sampled video \(\mathbf {F}_n\) such that

$$\begin{aligned} F_n^i=U^i;\;\; i\in J_n \end{aligned}$$
(13)

where \(F_n^i\) and \(U^i\) are the \(i\mathrm{th}\) frames of videos \(\mathbf {F}_n\) and \(\mathbf {U}\), respectively. We construct the final non-uniformly summarized image \(I_n\) as

$$\begin{aligned} i_{n_p}=\begin{matrix} {\left\{ \begin{array}{ll} {f^{i}_{n_p}}\;\text { if }{f^{i}_{n_p}}\ne 0;\; \text {for any }i\in J_n \\ {b_p}\;\text { otherwise} \end{array}\right. } \end{matrix} \end{aligned}$$
(14)

where \({f^{i}_{n_p}}\) is the intensity at pixel location \(\mathbf {p}\) in the \(i\mathrm{th}\) frame of video \(\mathbf {F}_n\).

3 Experimental Results

To validate the decomposition algorithm and the summarization algorithms, we test them on different input videos. Figure 2a shows frame from a typical input video. Figure 2c shows the respective frames from feature video \(\mathbf {S}\). In Fig. 3a, we show the estimated \(\mathbf {l_p}\) for different \(\lambda \) values. The input vector \(\mathbf {v_p}\) is the change in intensity at the center of the red circle shown in Fig. 2a. The change in rank of the video \(\mathbf {L}\) is shown in Fig. 3b. The rank of the video \(\mathbf {L}\) is calculated as described in [7]. It is important to inform that in our previous work [2], we reported a parallelizable decomposition method based on majorization-minimization algorithm. However, the algorithm in [2] takes large number of iterations (\({\sim }500\)) to complete the decomposition. As shown in Fig. 3b, the proposed decomposition minimizes the rank in much smaller number of steps (\({\sim }60\)) without increasing the complexity of the algorithm. Thus, the proposed decomposition algorithm is faster than the algorithm mentioned in [2]. In Fig. 3c and d, we compare the execution times of these decomposition algorithms on different datasets and further validate the claim. As the decomposition algorithms are pixel based, the algorithms are parallelizable and increase in number of cores in the processor reduces the execution times in both the cases.

Fig. 2
figure 2

a Estimation of \(\mathbf {l_p}\) for different values of \(\lambda \); b rank of \(\mathbf {L}\) versus iteration for \(\lambda =100\); c execution time of [2] with different number of cores for different dataset; d execution time of the proposed algorithm

Fig. 3
figure 3

a Frame of an input video; b estimated background c respective frame from \(\mathbf {S}\) video

Fig. 4
figure 4

ad Frames of input videos; e uniformly summarized images \(I_u\); f non-uniformly summarized image \(I_n\)

In Fig. 4, we show the outputs of both the summarization algorithms for different input videos. In Fig. 4a–d, we show frames of the input videos. All the videos in the dataset contain complex motions like acceleration, multiple objects, nonlinear motion, etc. Figure 4e shows the respective summarized images using the uniform summarization method, and Fig. 4f shows the summarized images using non-uniform summarization method. However, as mentioned in Sect. 2, the uniformly summarized image \(I_u\) may contain distortion due to overlapping regions. In Fig. 4e, we show the overlapping regions within the black rectangles. As shown in Fig. 4f, the non-uniformly summarized images \(I_n\) are free from such artifacts. An interesting property of non-uniform summarization algorithm is that the summarized image \(I_n\) may differ for the same input video depending on the frame to initialize the summarization process. This is further depicted in Fig. 5. For the same input videos, Fig. 5a and b show the final non-uniformly summarized images initialized from the first frame and the last frame, respectively.

Fig. 5
figure 5

a Non-uniformly summarized image starting from the first frame; b non-uniformly summarized image starting from the last frame

Though it is easier to interpret an event in a summarized video, video summarization algorithm drastically reduces the file size as it generates a single image as the final output. In Table 1, the sizes of the input videos and the summarized images \(I_u\) and \(I_n\) are shown. The execution times of both the proposed summarization algorithms are also tabulated in Table 1. It is evident that the space-time requirements of both the algorithms are comparable. All the evaluations are done in MATLAB 2013 using 4 cores running on an Intel(R) core(TM) i7-4770 3.90 GHz processor with 8 GB RAM.

Table 1 File size comparison after summarization for different videos

4 Conclusion

Storage and interpretation of videos require large amount of resources. It is crucial to develop algorithms which can represent an input video consuming as minimum resource as possible without disturbing the flow of the events. In this paper, we proposed two algorithms to summarize an input video to an image using uniform and non-uniform sampling of the video frames. The methods consume little amount of disk space and can be executed in small amount of time as the entire algorithms are pixel based and can be executed using parallel processing. Though the uniformly summarized image may contain some distortion depending upon the content of the input video, the non-uniformly summarized image is always distortion-free. However, the non-uniformly summarized image requires slightly more resources.

Though the proposed summarization algorithms work for static cameras, it is necessary to design such algorithm in future for videos with camera motions.