Abstract
In this paper, we present a novel video stabilization method with a pixel-wise motion model. In order to avoid distortion introduced by traditional feature points based motion models, we focus on constructing a more accurate model to capture the motion in videos. By taking advantage of dense optical flow, we can obtain the dense motion field between adjacent frames and set up a pixel-wise motion model which is accurate enough. Our method first estimates dense motion field between adjacent frames. A PatchMatch based dense motion field estimation algorithm is proposed. This algorithm is specially designed for similar video frames rather than arbitrary images to reach higher speed and better performance. Then, a simple and fast smoothing algorithm is performed to make the jittered motion stabilized. After that, we warp input frames using a weighted average algorithm to construct the output frames. Some pixels in output frames may be still empty after the warping step, so in the last step, these empty pixels are filled using a patch based image completion algorithm. We test our method on many challenging videos and demonstrate the accuracy of our model and the effectiveness of our method.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
For amateurs, hand-held video camera is an easy-to-use tool to shoot videos. It can be conveniently used by people without any knowledge about photography, even a child. Therefore, it is getting more and more popular in recent years. Meanwhile, the quality of videos captured with hand-held cameras is still much lower than the quality of videos captured using professional devices. The main problem for videos captured by hand-held cameras is jitter. Professionals use special equipments, such as dolly tracks and steadicams, to make the path of camera smooth and under control. However, camera path of hand-held camera cannot be well controlled because of shake of hands and movement of photographer. Jitter decreases the quality of videos very much. Therefore, the technique of stabilizing a jittered video is necessary, which is called video stabilization.
Morimoto et al. [21] abstract video stabilization into three steps: motion estimation, motion compensation and image reconstruction. Prior video stabilization methods can be generally divided into two categories based on the motion model they use in the first two steps. The first class is 2D methods. In these methods, every frame is considered as a single plane and a 2D transformation (usually affine or perspective) between two frames is used to model the motion. The video is stabilized by smoothing the parameters of these transformation matrices. These methods are generally computationally efficient because the models they use are very simple. But on the contrary, these simple models cannot handle videos with complex scenes well, such as scenes containing multiple planes or large depth change. The second class is 3D methods. These methods recover the full 3D structure of the scenes in the jittered videos. After that, videos can be easily stabilized by directly smoothing the camera path or simply designing the camera path. 3D methods can stabilize videos with complex scenes on which 2D methods may fail. However, this class of methods highly depends on Structure from Motion (SfM) method to recover the 3D structure of the scenes. Since SfM methods are usually very slow and not robust enough, 3D stabilization methods cannot work on videos without enough 3D structure information and usually very slow.
Most previous methods set up a motion model depending on feature points correspondences (trajectories). However, in the image reconstruction step, every pixel in the output video frames, rather than only several feature points, should be assigned a value. In previous methods, homographies are wildly used to solve this problem. Many works use single homography warping to generate output frames. This is a 2D assumption which suffers from the disadvantages of all 2D methods, i.e. it is not capable to model complex scenes. In order to improve the performance of single homograpy warping, Liu et al. [13] propose a new warping method named content preserving warping (CPW). This is a warping method using multiple homographies. It splits each frame into many grids and warps each grid using a different homography. While being robust enough, this technique can generate more plausible results than single homography warping. CPW makes an improvement in the image reconstruction step but its motion estimation step and motion compensation step are still based on feature points. Liu et al. [16] propose a method which introduces mesh-based idea into motion estimation step. This method splits every input frame into grids in the first place and then models camera motion with a bundle of camera paths, each of which belongs to a grid cell. An optimization based smoothing algorithm is performed on this bundle of camera paths and then the multiple warping method of CPW can be combined with this method seamlessly. Compared to other works, this method greatly reduces the use of feature points. Feature points are only used when extracting the motion of grids between adjacent frames and feature points are matched only between adjacent frames.
From the above we can see that recent methods are trying to avoid the disadvantage of using feature points. [13] and [16] try to handle all pixels in the frames rather than only the detected feature points. In fact, a feature points guided homography cannot represent the motion of a whole frame well because the number of feature points are usually much less than the number of all pixels in one frame. Homographies are used to warp all pixels in each frame, but they are calculated only using a few feature points. Therefore, the performance is not uniform in frames. It is better in area with more feature points but worse in area with less feature points.
In this paper, we aim to stabilize jittered videos using a motion model which can represent the motion between frames more precisely than previous models based on feature points. To this end, we propose a pixel-wise motion model which calculates the dense motion field for every pair of adjacent frames. This pixel-wise motion model is based on dense optical flow using nearest neighbour fields [1]. The corresponding position of one pixel in another frame is calculated by finding the nearest neighbour patch. With the help of dense optical flow, correspondences of all pixels between adjacent frames can be find and thus the motion model is dense and uniform. Every pixel takes part in the stabilization procedure, so their stabilized positions do not need to be approximated by other pixels (warping using homographies). Therefore, this dense motion model is more precise than models based on sparse feature points. In addition to the new motion model, we also propose a simple and fast smoothing algorithm, a pixel-wise frame warping algorithm and a patch based image completion algorithm to fulfill the stabilization.
The main novelty of our work is that it stabilizes videos in a pixel-wise manner in the whole stabilization process. [16] handles videos by a regular grid mesh ,which is the most similar method to ours so far. In fact, our method can be considered as the extreme situation of [16], i.e., the grid size is only one pixel. But the algorithm of our method is totally different from [16] . The contribution of our method lies in the following aspects:
-
1.
A novel dense motion estimation algorithm is proposed. Correspondence for every pixel is searched from adjacent frames. Since there is strong spacial coherency between adjacent frames, the searching range is greatly narrowed down by taking advantage of it.
-
2.
A pixel-wise frame warping algorithm is proposed. We use a weighted average strategy to make most pixels in output frames get correct values. Some empty pixels may still exist after warping. Finally, a simple patch based image completion algorithm is proposed to fill them.
2 Related works
2.1 Video stabilization methods
According to the dimensionality of the motion model they use, previous video stabilization methods can be divided into two categories: 2D methods and 3D methods.
2.1.1 2D methods
The basic assumption of 2D methods is that all the scenes are on the same plane. Under this assumption, all pixels in one frame can be handled by one single transformation matrix. Some early methods such as [20] use a four-parameter rigid transformation and some methods such as [9] use a six-parameter affine transformation. Later 2D methods such as [5, 19] always use an eight-parameter perspective transformation because it is more flexible. Litvin et al. [12] construct a chain of affine transformations among continuous frames. They treat jitter as noise in the chain and propose a probabilistic model to filter the noise. This method suffers from cumulative error obviously. With the increase of frame number, the error becomes larger and larger. In order to avoid cumulative error, Matsushita et al. [19] propose a local method that filters the motion of each frame only within a few neighbour frames. Gleicher et al. [5] speculate users’ intention in operating the camera and carefully design the transformations to fulfill the intentional motion rather than only simply filtering the transformations. It assumes the video segments to be either static or moving and calculates the transformation matrices using mosaic-based technique. This method can generate film-like effect. Similar with [5], Grundmann et al. [7] classify camera motion into the following three categories: constant, linear and parabolic, depending on the high order derivatives of trajectories. Then trajectories are smoothed using L 1-optimization. Rather than compensating inter-frame transformations directly, Lee et al. [11] seek the transformations that can make the trajectories as smooth as possible. Walha et al. [25] present a video stabilization and moving object detection system based on camera motion estimation. Most methods use SIFT [17] or KLT [23] to find and track feature points. Instead, Okade et al. [22] use Maximally Stable Extremal Region (MSER) which is fast and robust for video stabilization purpose.
Many recent methods, which also belong to 2D methods, directly calculate the stabilized position of feature points. In order to avoid distortion, they take some different constraints into consideration. Liu et al. [14] propose a video stabilization method based on an observation that smoothed trajectories should approximately lie in a low-rank subspace. They enforce subspace constraints on the trajectories matrix to get the smoothed trajectories. In order to avoid the reconstruction of 3D structure, Goldstein et al. [6] present a method to smooth trajectories utilizing the constraint of epipolar geometry. Wang et al. [26] filter trajectories while keeping similarity constrains among neighbour trajectories using triangulation. Wang et al. [27] detect multiplane structure from videos and then stabilize every plane independently. Liu et al. [16] make an improvement upon [13]. In this method, frames are divided into grids immediately after trajectory extraction step and the motion of grids are calculated using the motion of trajectories. Not only the warping step but also the stabilization step is applied using multiple homographies.
2D methods are usually fast and robust, but they cannot handle videos with complex scenes because of the single plane assumption. If the scene is constructed by more than one plane, the motion between frames cannot be modelled using only one homography. In contrast, the motion model of our method is much more flexible. Since we deal with every pixel separately, the motion between two frames is modelled using dense optical flow. Therefore, no matter how complex the scene is, our method can handle it well.
2.1.2 3D methods
3D methods recover the full 3D structure of the scene and the camera motion. Videos can be stabilized easily by smoothing or planning the path of camera directly. Zhang et al. propose a 3D video stabilization method [28]. It requires intrinsic and extrinsic parameters of the camera of every frame as input of stabilization step and calculates the stabilized camera parameters using optimization. Liu et al. [13] propose a new warping method in order to generate perceptual plausible output frames. This method divides every frame into grids and applies spatially-varying warps according to the smoothed trajectories. Zhou et al. propose another 3D method [29]. It detects planes from videos and improves [13] to achieve higher performance on videos with multiple planes. Chouvatut et al. [3] estimate camera motion and reconstruct human face from a video sequence.
If the dense 3D structure of the scenes can be recovered, 3D methods can generate results easily. However, it is too difficult to recover dense 3D structure from videos. In fact, nearly all previous 3D methods recover only sparse 3D structure, except that Liu et al. [15] use a depth camera to get additional dense depth information. Dense optical flow is also used in [19], but it is only used to find the 2D transformation matrix between two frames. Recently, Liu et al.[24] propose a method which also stabilize videos in a pixel-wise manner. Instead of smoothing trajectories, this method smooths pixel profile, which collects motion vectors at the same pixel location, to stabilize jittered videos.
3D methods can generate many satisfied results, but they are usually not robust enough. They strongly depend on Structure from Motion (SfM) methods, which are used to recover the 3D structure of the video. However, even state-of-art SfM methods may fail in some situation, leading to failure cases in 3D video stabilization methods. On the contrary, our method depends on dense optical flow, which seldom fails. Therefore, our method is more robust than 3D methods.
2.2 PatchMatch based dense optical flow
Optical flow is a basic and important technique in computer vision which has been studied for a very long time. Sparse optical flow methods extract and track feature points in videos, while dense optical flow methods do that for every pixel. In sparse methods, only highly reliable points are chosen as feature points, so the number of feature points are much less than the number of all pixels. There is a large number of works on optical flow. Here we only discuss some of them which are strongly related to our work.
PatchMatch [1] is an algorithm to match similar patches between a pair of images. For one patch in one image, it can find the most matched patch in another image by calculating Nearest-Neighbour Field (NNF). Instead of using a naive brute force searching algorithm, [1] proposes a fast searching algorithm. This algorithm is based on an observation that since the image is continuous, the displacements of best matches of neighbour patches should also be in the neighbour. Block matching is also used for motion estimation. Block based motion estimation can also be used. Some fast techniques are proposed in [4].
PatchMatch [1] is speeded up by many following methods. Korman et al. [10] use hashing to make information propagate faster. He et al.[8] use Propagation-Assisted KD-Trees to avoid the time-consuming backtracking in traditional tree method. PatchMatch [1] is further used to compute dense optical flow. Chen et al. [2] use NNF of PatchMatch to compute an initial motion field and refine it as a motion segmentation problem. PatchMatch Filter [18] combines PatchMatch-based randomized search and efficient edge-aware image filtering together to get a state-of-the-art dense optical flow algorithm with high accuracy and speed.
3 Our approach
3.1 Overview
Given a jittered input video which consists of n frames, we denote the input frames as {I i },i ∈ [1, n]. We assume that the size of the frames is h by w, then every pixel can be denoted by its coordinate x = (x, y), x ∈ [1, h], y ∈ [1, w]. The output of our method is a stabilized video which consists of n frames denoted as {O i }, i ∈ [1, n].
The pipeline of our method is shown in Fig. 1. We can see that our method mainly consists of four steps. The first step is motion estimation step. For every frame in the input video, we detect the motion of every pixel between this frame and each adjacent frame to form a dense motion field. For every pixel x in frame I i , we find the corresponding pixel in frame I i−1 and denote the position of the corresponding pixel as \(C_{i}^{i-1}(\mathbf {x})\). Similarly, we also find the corresponding pixel \(C_{i}^{i+1}(\mathbf {x})\) in frame I i+1. In order to achieve high performance and efficiency, we propose a PatchMatch based dense optical flow algorithm which focuses on adjacent video frames with strong spacial coherency.
The second step is motion smoothing step. In this step, we calculate the stabilized position S i (x) for every pixel x in input frame I i . Similar to smoothing a trajectory, low-pass filters can be used. In our experiments, Gaussian filter is used for simplicity. Although we only know the motion between adjacent frames, we can calculate the motion between every two frames by joining the motion of every two adjacent frames between these two frames.
The third step is frame warping step. After the two steps above, we have known the stabilized position of every pixel in input frames. In order to construct the output frames, we need to warp I i to O i using these correspondences. If the stabilized positions are all integers, this step can be easily done by directly copying the values. But since the original integer coordinates are smoothed in the second step, the stabilized positions may not be integers. So we use an weighted average algorithm inspired by bilinear interpolation to calculate an average value for every pixel in output frames.
The last step is image completion step. It cannot be guaranteed that all the pixels in output frames are assigned a value in the third step, so we propose an image completion algorithm to fill the empty pixels. Since the number of empty pixels are very small compared to the number of all pixels, we can calculate the corresponding position in input frame for every empty pixel using a patch based algorithm. Corresponding pixel’s value is copied to fill the empty pixels.
3.2 Motion estimation
In this section, we describe the dense optical flow based motion estimation algorithm in detail. For every pixel x in every input frame I i , the goal of our motion estimation algorithm is to find the corresponding positions in I i−1 and I i+1:
Denote the patch centering at x in I i as P i (x). For every pixel x in I i , we get the corresponding pixel in I i−1 by searching for the nearest neighbour patch for the patch P i (x):
||⋅|| denotes the L 2 norm. Obviously, it is not practical to do this search exhaustively. Since most area of an image is continuous, spacial coherency constraint should be hold. Equation (2) is improved to:
where N(x) contains the four neighbour pixels of x and w is a weight to balance the data term and smoothing term.
It is very difficult to solve (3) directly. PatchMatch [1] uses an iterative strategy. In every iteration, it searches two patches according to its two neighbour pixels and a random patch. Result is updated by the best one. This strategy greatly reduces computation, but since PatchMatch is designed for two arbitrary images, the searching range is still too wide for adjacent frames.
We still take advantage of the iterative strategy of PatchMatch. Moreover, we greatly narrow down the searching range for every pixel based on the following observation: it can be intuitively observed that, if two pixels are neighbour in one frame, their corresponding pixels in the other frame will be very near, usually within several pixels. Since two adjacent frames are captured almost at the same time under the same scene (except for sudden scene change, which is generally not considered in video stabilization methods), the motion between these two frames approximates to translation. It is obviously that two neighbour pixels are still in neighbour after a translation motion.
Now we explain our motion estimation algorithm in detail. Assuming we are dealing with frames I i and I i−1.
-
1.
At first, we initialize the dense motion field \(C_{i}^{i-1}(\mathbf {x})\) with random value and run original PatchMatch for 1-2 iterations. After that, a large number of pixels in I i are correctly matched to I i−1. Though not all pixels are matched correctly, the overall motion can be detected, which is used to further refine the motion field.
-
2.
Next, we calculate the average motion for every pixel using neighbour pixels:
$$ \overline{C}_{i}^{i-1}(\mathbf{x})=\frac{{\sum}_{\mathbf{y}\in N(\mathbf{x})}C_{i}^{i-1}(\mathbf{y})}{|N(\mathbf{x})|} $$(4)If \(||\overline {C}_{i}^{i-1}(\mathbf {x})-C_{i}^{i-1}(\mathbf {x})||>th\), we use \(\overline {C}_{i}^{i-1}(\mathbf {x})\) to replace \(C_{i}^{i-1}(\mathbf {x})\). |N(x)| denotes the element number of N(x) and th is set to 5 in our experiments.
-
3.
Then we perform a modified version of PatchMatch. In this process, the window of random search is limited to a small window around C i (x)(30 × 30 in our experiments) and we use the energy term in (3) instead of (2) to calculate distance between two patches.
-
4.
Repeat step 2 and 3 for 3-5 times to get the final dense motion field.
3.3 Smoothing
When the motions of all pixels are known, the smoothing step is carried out to get the stabilized position of pixels. In fact, any low-pass filter can be used and we use simple Gaussian filter in our experiments.
According to Section 3.2, for every pixel at x in frame I i , we know the corresponding pixel in I i+1 is at \(C_{i}^{i+1}(x)\). Since we also know that the corresponding pixel for x ′ in I i+1 is at \(C_{i+1}^{i+2}(x^{\prime })\) in I i+2, we can calculate the coordinate of corresponding pixel for pixel at x in frame I i as \(C_{i+1}^{i+2}(C_{i}^{i+1}(\mathbf {x}))\) in frame I i+2. Generally we can propagate the motion to get the correspondence of this pixel in any frame. For example, the coordinate of corresponding pixel in frame I j (j>i) is:
For correspondence in frame I k (k<i), the equation is similar:
For every pixel x in frame I i , we calculate its stabilized position using Gaussian filter as follow:
where W(i) = {j|i − n ≤ j ≤ i + n} is the indices of neighbour frames of frame I i and \(g(k)=\frac {1}{\sqrt {2\pi }\sigma }e^{-k^{2}/2\sigma ^{2}}\) is the Gaussian kernel. n controls the size of neighbour frames. A larger value of n leads to more smoothed results but lower speed. We find that n = 15 is enough for most videos. σ controls the strength of smoothing and we set it to 50.
Some previous methods, such as [14, 26] smooth trajectories while keep some constraint between trajectories. However, we smooth every pixel independently without any constraint and get quite good result (please refer to Section 4 and accompanying video). This is attributed to the patch based dense motion estimation algorithm in Section 3.2. In fact, the relationships between neighbour pixels have been constrained in the motion estimation step by using patches. Similar to the situation between two adjacent frames in Section 3.2, the motion between I i and O i are nearly translation, so neighbour pixels are very likely to be neighbour after stabilization. Therefore, the distortion in output frames is too small to be observed.
3.4 Frame warping
Now we want to generate every stabilized frames O i . We need to warp I i to O i using the correspondences of all pixels. A simple consideration is to copy the value of pixel x in I i to the position S i (x) in O i . But unfortunately, S i (x) is the result of Gaussian smoothing, which is usually not integer, so it does not correspond to a pixel directly. We tried to use the rounded value of S i (x) so that we can copy the pixel value directly, but from Fig. 2a, we can see that many pixels cannot be assigned a value because of the discontinuity of using rounded values. Instead, we use a weighted average algorithm to calculate the pixel value for every pixel x ′ in O i . In Fig. 2b, we can see that most of these pixels are assigned a value using our algorithm.
In an ordinary warping process (such as affine or perspective warping), the corresponding position \(S_{i}^{-1}(\mathbf {x^{\prime }})\) (not integer) in input frame I i for every pixel x ′ (integer) in output frame O i is known, which is the inverse situation to our problem. Bilinear interpolation is generally used in this situation to get the value for every pixel in output frame using four nearest pixels in the input frame. Inspired by bilinear interpolation, we make every pixel x in I i contribute to four pixels in O i nearest to S i (x). Every pixel x ′ in O i may be affected by more than one pixels in I i , so a weighted average value is calculated. The weight in our algorithm is the same as in bilinear interpolation. The algorithm detail of this step is shown in Algorithm 1. Note that the second output \(S_{i}^{-1}(\mathbf {x^{\prime }})\) is used in the image completion step (Section 3.5).
3.5 Image completion
Some pixels in the output frames may still not be assigned a value after the warping step. From Algorithm 1 we can see that our warping method is applied on every pixel in the input frames rather than directly on pixels in output frames. Every pixel in input frames affects four nearest pixels in output frames but it is not ensured that all pixels in output frames can be affected. Fortunately, the number of these empty pixels is usually very small. In our experiments, the number of empty pixels is usually less than 1 % of the frame pixel number.
In order to generate complete output frames, an image completion step is proposed. Our image completion algorithm is also based on PatchMatch. The PatchMatch method[1] is originally designed for structural image editing, including image completion. In fact, we can directly use PatchMatch or any other image completion algorithm to fill the empty pixels. But since we have already known the correspondences of most pixels between each pair of input and output frame, we can simplify this procedure using these information.
In Algorithm 1, the second output is calculated for assisting image completion. \(S_{i}^{-1}(\mathbf {x^{\prime }})\) is approximate original position of every pixel x ′ in O i if it is not empty. This is not the exact value but it can narrow down the searching range for empty pixels nearby. Firstly we find out all empty pixels. We identify empty pixels by setting e m p t y(x)=t r u e according the output of Algorithm 1:
Then we find the corresponding patch in I i for every empty pixel (in fact, the patch centering at this pixel) in O i with least distance and copy the center pixel’s value to fill the empty pixel. In order to find the patch with least distance, we simply apply PatchMatch between O i and I i , but we add following limitation to make it work more efficiently:
-
1.
The values of non-empty pixels should not be modified, i.e. the PatchMatch algorithm are applied only on empty pixels. This greatly reduces computation.
-
2.
We use \(S_{i}^{-1}(\mathbf {x})\) as initial value for every pixel x in O i rather than a random value. Values of non-empty pixels has been calculated in Algorithm 1 . For every empty pixel x, it is initiate as average of values of neighbour non-empty pixels:
$$ S_{i}^{-1}(\mathbf{x})=\frac{{\sum}_{\mathbf{y}\in N_{ne}(\mathbf{x})}(S_{i}^{-1}(\mathbf{y}))}{|N_{ne}(\mathbf{x})|} $$(9)N n e (x) = {y|||y − x||<t h, e m p t y(y)=f a l s e} and |N n e (x)| is the element number of N n e . th is set to 5 in our experiments.
-
3.
The window of random search is restricted to a window centering at the initial position \(S_{i}^{-1}(\mathbf {x})\). Because of the spacial continuity of video frames, the correspondence of every empty pixel ought to be near neighbour pixels so we do not have to search patches far away. The window of random search is set to 30 × 30 in our experiments.
-
4.
Note that the patch centering at one empty pixel may contain other empty pixels. These pixels are ignored when calculating patch distance.
Figure 3 presents some frames before and after our image completion step.
4 Results and comparisons
We have tested our algorithm on many different videos, which cover a variety range of scenes, such as indoor and outdoor scenes. All videos used in our experiment are captured by hand-held cameras and there is obvious jitter in all videos. Many successful results show that our method is effective and robust. In Fig. 4, we show the frames of some results. Since videos can not be shown in this paper, please refer to accompanying videos. We also compare our method with some previous methods. We discuss the comparison in this section.
Comparison with PatchMatch
The algorithm of motion estimation step in our method is based on PatchMatch. PatchMatch can find nearest neighbour patches very quickly. Although it takes spacial coherency into consideration, it does not force this constraint. However, in our experiment, we find that spacial coherency is very important in stabilization. So we force this constraint in our motion estimation step. Figure 5 shows the comparison of our method and PatchMatch on one frame. Figure 5a and b shows the spacial coherency energy (the second energy term in (3)) of the matching results of PatchMatch and our algorithm respectively. Whiter pixel indicates larger energy at that position. We can obviously see that the spacial coherency energy is much smaller in our algorithm. The final stabilization results using these two algorithms are shown in Fig. 5c and d and our result is also much better.
User study
A user study is conducted to compare the performance of our method with other methods. Two previous methods are used in this user study: one of them [19] uses single homography warping, and the other one [16] uses multiple homographies warping. 10 sets of videos are used and each video set consists of an original shaky video and 3 stabilized results (including results of our method and the two methods above). Each set of videos is displayed to the 78 participants and they are asked to give a score between 0 and 100 for every stabilization result. Better stabilization performance gets bigger score. The results are displayed randomly and anonymously to the participants. The snapshots of ten shaky videos and the average scores of the 10 sets of videos are shown in Fig. 6. Some detailed information of the 10 shaky videos is shown in Table 1.
From Fig. 6 we can see that [19] gets the lowest scores in all cases and the scores of [16] are obviously higher than [19]. The scores of our method are very close to those of [16]. Although the scores of our method are a little lower in most cases, the difference of visual effect is too small to be perceived. Based on our experience, the difference in stability of two videos can hardly be distinguished by human eyes if the difference in scores is less than 10. In the user study, participants are asked to give an accurate score to every video, so they may watch every video for many times carefully before giving the scores. In practice, when the videos are watched normally for only once, their performances are quite comparable. This can be easily understood by taking a glance at the demo we proposed together with this paper. The comparisons between our method and these two categories of methods are shown below.
Comparison with methods using single homography warping
Single homography warping is used widely in early methods, especially in 2D methods, such as [5, 19]. These methods are based on single plane assumption, so they cannot handle videos with complex scene well. Different from these methods, our method directly handle every pixel separately no matter whether they are on the same plane or not. From Fig. 6 we can see that our method gets much higher scores than [19].
Comparison with methods using multiple homographies warping
Multiple homographies warping is proposed in order to avoid the disadvantage of single homography warping. Different homographies are used in different area in each frame. This kind of method includes [13, 16] and [26]. A step further, our method calculate the stabilized position for every pixel in the frames, which generates comparable results to state-of-art methods. From Fig. 6 we can see that the scores of our methods are very close to those of [16].
Quantitative evaluation
It is a difficult problem to evaluate the stability of a video using quantitative metric. In most previous methods, there are only subjective evaluation or comparisons. In [16], three metrics are proposed to evaluate the quality and stability of result videos. However, the stability metric of [16] require the bundled camera paths calculated in this method, which is not suitable for our method. We propose another metric that is suitable for other methods. Between each pair of adjacent frames, the correspondence of every pixel is already known. The distance of correspondence pixels is the motion of that pixel. We use the average motion of all pixels in a video to represent the motion of the video. The motion data of original videos and result videos of our method and [16, 19] are shown in Table 2. We can see that the motion of shaky videos is greatly reduced using our method.
Limitation
The major limitation of our method is the low computation speed. Video processing usually costs much time since a video contains many frames. Although all the four steps of our method do not cost too much time, the total time cost to stabilize a video is quite large. Another limitation of our method is that it does not works well on moving objects because the motion of moving objects is not regular and is difficult to estimate.
Future work
One future work is to speed up this method. The high precision of our method slows down the computation speed. Maybe we do not need to handle every pixel so that the computation burden can be reduced. Besides, we need to improve the performance on moving objects. One consideration is to pick out move objects from video and handle them in a different way. So far, there is not a widely accepted numerical standard to evaluate the stability of videos. Therefore, the evaluation of results and comparison between different methods are accomplished by human eyes at most time. In order to avoid subjective evaluation, a numerical stability evaluation method is strongly needed.
5 Conclusions
In this paper, we present a novel pixel-wise video stabilization method. Instead of using model based on sparse feature points, dense motion field is used to model the motion between adjacent frames . A patch based motion estimation algorithm is proposed to get the dense motion field, followed by an Gaussian smoothing step. Frames are warped using an weighted average warping algorithm and empty pixels are filled using a patch based image completion algorithm. Our method can handle videos with complex scenes well. Many successful results show the effectiveness of our method.
References
Barnes C, Shechtman E, Finkelstein A, Goldman D (2009) Patchmatch: a randomized correspondence algorithm for structural image editing. In: ACM Transactions on Graphics-TOG, vol 28. doi:10.1145/1531326.1531330
Chen Z, Jin H, Lin Z, Cohen S, Wu Y (2013) Large displacement optical flow from nearest neighbor fields. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2443–2450. doi:10.1109/CVPR.2013.316
Chouvatut V, Madarasmi S (2013) Tuceryan, M., 3d face and motion estimation from sparse points using adaptive bracketed minimization. In: Multimedia tools and applications, vol 63, p 569589. doi:10.1007/s11042-011-0925-8
Furht B, Greenberg J, Westwater R (1996) Motion Estimation Algorithms for Video Comression. Springer, Westwater R
Gleicher ML, Liu F (2008) Re-cinematography: Improving the camera dynamics of casual video. In: ACM Transactions on Multimedia Computing, Communications, and Applications, vol 5. doi:10.1145/1404880.1404882
Goldstein A, Fattal R (2012) Video stabilization using epipolar geometry. In: ACM Transactions on Graphics(TOG), vol 31. doi:10.1145/2231816.2231824
Grundmann M, Kwatra V, Essa I (2011) Auto-directed video stabilization with robust l1 optimal camera paths. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 225232. doi:10.1109/CVPR.2011.5995525
He K, Sun J (2012) Computing nearest-neighbor fields via propagation-assisted kd-trees. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 111118. doi:10.1109/CVPR.2012.6247665
Irani M, Rousso B, Peleg S (1994) Recovery of ego-motion using image stabilization. In: 1994 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, p 454460. doi:10.1109/CVPR.1994.323866
Korman S, Avidan S (2011) Coherency sensitive hashing. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp 1607–1614. doi:10.1109/ICCV.2011.6126421
Lee KY, Chuang YY, Chen BY, Ouhyoung M (2009) Video stabilization using robust feature trajectories. In: 2009 IEEE 12th International Conference on Computer Vision, p 13971404. doi:10.1109/ICCV.2009.5459297
Litvin A, Konrad J, Karl WC (2003) Probabilistic video stabilization using Kalman filtering and mosaicing. In: Symposium on Electronic Imaging, Image and Video Communications, p 663674. doi:10.1117/12.476436
Liu F, Gleicher M, Jin H, Agarwala A (2009) Content-preserving warps for 3D video stabilization. In: ACM Trans. Graphic, p 28. doi:10.1145/1531326.1531350
Liu F, Gleicher M, Wang J, Jin H, Agarwala A (2011) Subspace video stabilization. In: ACM Trans. Graphic, vol 30. doi:10.1145/1899404.1899408
Liu S, Wang Y, Yuan L, Bu J, Tan P, Sun J (2012) Video stabilization with a depth camera. In: Proceedings of the CVPR, p 8995. doi:10.1109/CVPR.2012.6247662
Liu S, Yuan L, Tan P, Sun J (2013) Bundled camera paths for video stabilization. In: ACM Transactions on Graphics (TOG), p 32. doi:10.1145/2461912.2461995
Lowe DG (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on Computer vision, vol 2, pp 1150–1157. doi:10.1109/ICCV.1999.790410
Lu J, Yang H, Min D, Do MN (2013) Patch match filter: Efficient edge-aware filtering meets randomized search for fast correspondence field estimation. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1854–1861. doi:10.1109/CVPR.2013.242
Matsushita Y, Ofek E, Tang X, Shum HY (2005) Full-frame video stabilization. In: Proceedings of the CVPR, p 5057. doi:10.1109/CVPR.2005.166
Morimoto C, Chellappa R (1996) Automatic digital image stabilization. In: Proceedings of the CVPR
Morimoto C, Chellappa R (1998) Evaluation of image stabilization algorithms. In: 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol 5, p 27892792. doi:10.1109/ICASSP.1998.678102
Okade M, Biswas PK (2014) Video stabilization using maximally stable extremal region features. In: Multimedia Tools and Applications, vol 68, p 947968. doi:10.1007/s11042-012-1095-z
Shi J, Tomasi C (1994) Good features to track. In: 1994 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 593–600. doi:10.1109/CVPR.1994.323794
Liu S, Yuan L, Tan P, Sun J (2014) Steadyflow: Spatially smooth optical flow for video stabilization. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4209–4216. doi:10.1109/CVPR.2014.536
Walha A, Wali A, Alimi AM (2014) Video stabilization with moving object detecting and tracking for aerial video surveillance. In: Multimedia Tools and Applications, p 123. doi:10.1007/s11042-014-1928-z
Wang Y, Liu F, Hsu P, Lee T (2013) Spatially and temporally optimized video stabilization. In: IEEE Transactions on Visualization and Computer Graphics, pp 1354–1361. doi:10.1109/TVCG.2013.11
Wang ZQ, Zhang L, Huang H (2013) Multiplane video stabilization. In: Computer Graphics Forum, vol 32, pp 265–273. doi:10.1111/cgf.12234
Zhang G, Hua W, Qin X, Shao Y, Bao H (2009) Video stabilization based on a 3D perspective camera model, vol 25, p 9971008. In: The Visual Computer. doi:10.1007/s00371-009-0310-z
Zhou Z, Jin H, Ma Y (2013) Plane-based content-preserving warps for video stabilization. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2299–2306. doi:10.1109/CVPR.2013.298
Acknowledgments
We would like to thank the anonymous reviewers for their helpful advices. We also thank the participants in our user study. This work is supported by the National Natural Science Foundation of China under Project No. 61133008 and No. 61202180.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, Z., Huang, H. Pixel-wise video stabilization. Multimed Tools Appl 75, 15939–15954 (2016). https://doi.org/10.1007/s11042-015-2907-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-2907-8