1 Introduction

With the development of portable video devices such as cameras on mobile phones, camcorders, and action cams, many people have personal video imaging devices, and it has become increasingly popular to utilize the video imaging device to share images or personal broadcasts. Research is also carried out to recognize this video image [19]. However, such personal video imaging devices often have a problem in that the image shakes due to movement of the user and camera. To solve the problem, many researchers have developed image stabilization techniques divided into four types: 1) mechanical, 2) optical, 3) electronic, and 4) digital [5].

The mechanical, optical, and electronic image stabilization techniques can obtain good stabilization results, but image stabilization is only possible by additional mechanical compensation. In addition, mechanical compensation can have disadvantages such as an increase in the size of the imaging device and a cost increase due to the purchase of an additional device. Therefore, there has been a lot of research on the digital image stabilization technique using only software that consists of three stages: 1) motion estimation [7], 2) motion compensation [17], and 3) motion inpainting [15]. Since all three stages of digital image stabilization have value as a research field, we focus on the motion estimation stage to estimate camera motion.

Currently, there are two research methods for motion estimation. The first method is intensity-based estimation using changes in color information value (intensity) of the image frame [2, 14, 22, 26]. When the input image is cut out frame by frame, the time between each frame represents a very short time (30 fps standard, which is about 33 ms). Therefore, a pixel at one point does not change much in a short time, and the camera motion can be estimated as a result of estimating the color information of the pixel. This method has the advantage of using only a simple operation of the image frame pixels and thus has a short computational time. However, it has disadvantages such as illumination change, blurred image, and so on. The second method is feature point-based estimation whereby feature points such as edges and corners that can be seen in successive frames are extracted [3, 4, 8, 9, 12, 13, 16, 18, 21, 24, 25]. This method has the advantage of robust and accurate motion estimation for various environmental factors because it extracts feature points that are not affected by the surrounding video imaging environment. Because of these characteristics, it is used in many areas [23]. However, it has a disadvantage in that the computational time can be much longer compared with the color information-based method. In addition, a method of using the relationship between successive images was also proposed [20].

In general, the motion estimation stage assesses the motion of the camera in three steps: 1) local motion estimation, 2) accuracy improvement, and 3) global motion estimation. If the motion estimation is performed in three stages, accuracy improvement in separating the local motion and the global motion may take a long time or having to repeat the image stabilization may occur because of RANSAC (RANdom SAmple Consensus) in the accuracy improvement stage. Therefore, one-step motion estimation techniques with a simple framework have been studied [13].

In this paper, we propose a method to minimize the motion estimation to a one-step process and eliminate the problems that may occur in the accuracy improvement stage by using the k-means clustering algorithm. Finally, we propose an algorithm that minimizes the residual error of feature matching by optimizing the transformation matrix using the similarity constraint-based optimization algorithm.

The composition of the paper is as follows. In Section 2, the proposed algorithm is explained in detail, while in Section 3, experimental results are shown, and in Section 4, the contributions of this research are described.

2 Digital image stabilization using similarity-constrained optimization

2.1 System overview

In this chapter, we propose an advanced digital image stabilization method based on similarity-constrained optimization, as shown in Fig. 1. In order to improve image stabilization accuracy, feature points were extracted using the moment-based speeded-up robust features (MSURF) algorithm [10], which is a modified version of the conventional SURF algorithm using Gaussian-Hermite moments [11]. MSURF improves robustness of the image features and provides better matching accuracy.

Fig. 1
figure 1

Architecture of the proposed method

Conventional digital image stabilization techniques use the RANSAC algorithm in the motion estimation stage. Because it uses a random function, it has disadvantages in that different results can be obtained every time an experiment is carried out and the calculation time for the RANSAC algorithm is long. Therefore, we propose a k-means clustering algorithm and a similarity constraint-based optimization technique to solve this problem.

2.2 MSURF-based feature point detection

The first step in digital image stabilization is to divide the input image into several frames to obtain transformation matrix T representing the movement between the divided frames. In this paper, the MSURF algorithm [10] is applied for more accurate matching since it has better feature matching performance than SURF [1], which is similar to the scale-invariant feature transform (SIFT) algorithm [18] but with faster execution. Figure 2 shows an overview of the MSURF algorithm applied to an input image. While the conventional SURF algorithms used the Hessian matrix, MSURF uses the modified discrete Gaussian-Hermite moment (MDGHM) matrix for position x = (i,j) of each feature point as follows:

$$ \mathbf{M}(\mathbf{x},\sigma)= \left[\begin{array}{ll} \hat{\eta}_{p,0}(\mathbf{x},\sigma) & \hat{\eta}_{p,q}(\mathbf{x},\sigma)\\\hat{\eta}_{p,q}(\mathbf{x},\sigma) & \hat{\eta}_{0,q}(\mathbf{x},\sigma)\end{array}\right], $$
(1)

where \(\hat {\eta }_{p,q}(\mathbf {x},\sigma )\) is the MDGHM [11] at feature position x on input image I, and is defined as

$$ \hat{\eta}_{p,q}(i,j,\sigma) = \frac{4}{(M-1)(N-1)}\sum\limits_{u = 1}^{M}{\sum\limits_{v = 1}^{N}{I\left( u + i - {}^{M}/{}_{2},v + j - {}^{N}/{}_{2} \right){{{\hat{H}}}_{p}}\left( x,\sigma \right){{{\hat{H}}}_{q}}\left( y,\sigma \right)}}, $$
(2)
$$ \left\{ \begin{array}{l} \hat{H}_{p}(x,\sigma)= \frac{2}{M-1}\frac{1}{\sqrt{{{2}^{p}}p!\sqrt{\pi }\sigma }}\exp \left( -{{{x}^{2}}}/{2{{\sigma }^{2}}} \right){{H}_{p}}\left( {x}/{\sigma } \right),\\ \hat{H}_{q}(y,\sigma)= \frac{2}{N-1}\frac{1}{\sqrt{{{2}^{q}}q!\sqrt{\pi }\sigma }}\exp \left( -{{{y}^{2}}}/{2{{\sigma }^{2}}} \right){{H}_{q}}\left( {y}/{\sigma } \right), \end{array} \right. $$
(3)
$$ \left\{ \begin{array}{l} {{H}_{p}}\left( {x}/{\sigma } \right)={{\left( -1 \right)}^{p}}\exp \left( \frac{{{x}^{2}}}{{{\sigma }^{2}}} \right)\frac{{{d}^{p}}}{d{{t}^{p}}}\exp \left( \frac{-{{x}^{2}}}{{{\sigma }^{2}}} \right),\\ {{H}_{q}}\left( {y}/{\sigma } \right)={{\left( -1 \right)}^{q}}\exp \left( \frac{{{y}^{2}}}{{{\sigma }^{2}}} \right)\frac{{{d}^{q}}}{d{{t}^{q}}}\exp \left( \frac{-{{y}^{2}}}{{{\sigma }^{2}}} \right). \end{array} \right. $$
(4)
Fig. 2
figure 2

Overview of MSURF algorithm

In (2), M × N, (u,v), and (x,y) represent the size and position of the mask at feature position x on input image I. By using (1) to (4), the MSURF descriptor \(\hat {\mathbf {v}}\) is generated as

$$ descriptor \hat{\mathbf{v}}= \left( \sum{{{{\hat{\eta }}}_{p,0}}},\sum{{{{\hat{\eta }}}_{0,q}}},\sum{\left| {{{\hat{\eta }}}_{p,0}} \right|},\sum{\left| {{{\hat{\eta }}}_{0,q}} \right|} \right). $$
(5)

As a result, robust feature points can be extracted using (5). Figure 3 illustrates the example input image to which to apply the MSURF algorithm. The center of the green circle in the figure represents the feature point of the input image frame, and the size of the circle represents the strength of the feature point.

Fig. 3
figure 3

Example image for applying MSURF algorithm

Using the MSURF algorithm, the feature point matching between consecutive image frames can be confirmed. Figure 4 shows the distribution of matching points between consecutive image frames.

Fig. 4
figure 4

Feature matching between two consecutive images

Since the location of feature points can be found by matching feature points between successive frames, the final transformation matrix T can be obtained using the similarity transformation matrix to maintain and compensate for the size and direction of structures in the image frames.

2.3 Outlier elimination by k-means clustering

When the feature point matching using the MSURF algorithm is performed, matching with outliers may occur. Since this can lead to ultimately calculating the wrong transformation matrix, outliers must be removed before the final transformation matrix can be obtained. Conventional image stabilization algorithms remove this problem using the RANSAC algorithm. However, since RANSAC can cause problems such as repeatability of the algorithm results and long computational time, other researches to replace it are now proceeding.

In this study, we applied the k-means clustering algorithm to solve this problem. After the feature points obtained through the MSURF algorithm are divided into k clusters, each cluster has one centroid. If the transformation matrix T is obtained by using these center points, the effect of the RANSAC algorithm can be achieved because the influence caused by the outliers is minimized. In addition, it is possible to solve the problem of biasing the transformation matrix where the feature points are concentrated in a specific region within the image frame. The local bias problem is an important issue that needs to be solved before calculating the final transformation matrix because if not carried out correctly, it can emphasize the transformation of a specific region and so calculate an incorrect transformation matrix. Figure 5 shows the matching using the center point of each cluster.

Fig. 5
figure 5

Matching using each cluster center point

In the k-means clustering algorithm, to set the appropriate K is very important and proceeds by applying.

$$ K \approx \sqrt{{n}/{2}}, $$
(6)

where n is the total number of data items. Since the number of feature-points in an image is usually several tens to several hundreds and the K value is determined using (6), it is impossible to solve the problem of local bias due to too much clustering.

In this study, we determine K value using elbow method. As K value increases, it calculates sum of squared errors (SSE) of k-mean cluster. SSE value is decreasing rapidly when K value is lower than real cluster, whereas decreasing slowly when K value is higher than real clusters. It determines K value in which the decreasing rate become changes. Figure 6 shows the experimental results of a comparison of inter-frame transformation fidelity (ITF) performance results according to the K values for the 100th and 101st frames.

Fig. 6
figure 6

ITF results according to K value

The minimum value of K started at the reference value 5, which was set at the minimum value when the image frame was divided, and it can be seen that the ITF value increased as the K value increased, and then decreased again. This result was slightly different for each image frame but was fixed at 20, which was the K value showing the most stable performance.

2.4 Similarity-constrained optimization

Since the k-means clustering algorithm does not use the random function found in RANSAC, similar results were obtained in several experiments on the same image. Figure 7 shows a comparison of the results of outlier elimination for 100 trials between RANSAC and the k-means clustering-based method for the same frame. Although both methods were able to produce similar mean results, the standard deviation (SD) of the k-means clustering-based method was about 57% lower than RANSAC as shown in Table 1.

Fig. 7
figure 7

Comparison for matching accuracy of features for RANSAC and K-mean clustering-based method: a an example image of test image sequences, b result data for ITF metric

Table 1 ITF Result data comparison with RANSAC and K-means

The feature points between the previous frame and the next frame can be predicted by using the transformation matrix T representing the inter-frame conversion relation. We denote the ith feature point on the arbitrary frame t by \(\mathbf {x}^{i}_{t}\). Let \(\mathbf {x}^{i}_{t}\) moves to \(\mathbf {x}^{i}_{t + 1}\) through the transformation matrix T. In this feature matching process, the type of error which may occur is type 2, as shown in Fig. 8.

Fig. 8
figure 8

Errors of the feature matching

In Fig. 8, error #1 occurs because the corresponding feature point is actually located at the \(\mathbf {x}^{i}_{t + 1}\) on the frame (t + 1). On the other hand, if the \(\mathbf {x}^{i}_{t + 1}\) is moved to the feature point on the frame t through the inverse transformation matrix T− 1, it moves to the \(\mathbf {x}^{i}_{t}\) and the error #2 occurs. These errors are defined as a residual error er as shown in (7), and the components of the transformation matrix T are optimized in such a direction as to minimize the residual errors.

$$ {e}_{r}=\sum\limits_{i = 1}^{n}{\left\{ {{\left[ T\times \mathbf{x}_{t}^{i}-\mathbf{x}_{t + 1}^{i} \right]}^{2}}+{{\left[ {{T}^{-1}}\times \mathbf{x}_{t + 1}^{i}-\mathbf{x}_{t}^{i} \right]}^{2}} \right\}}. $$
(7)

The transformation matrix T consists of a 2 × 2 rotation matrix and a 2 × 1 translation component. Moreover, it is divided into similarity or affine transformation matrices according to the characteristics of the rotation matrix component.

The similarity transformation matrix has the property that the scale of the 2 × 2 rotation matrix component changes constantly, and so it has an advantage in that the size and direction of the structure in the image can be kept constant. On the other hand, if an affine transformation matrix is used, the degrees of freedom to compensate for the whole image region are higher than for the similarity transformation matrix because the 2 × 2 component has the property of varying independently. However, if the optimization algorithm is repeatedly applied to minimize the residual error, the size and direction of the structures in the image are distorted due to the high degrees of freedom. Figure 9 shows the result image using either the similarity or affine matrix as the transformation matrix.

Fig. 9
figure 9

Comparison of stabilization results: a similarity matrix, and b affine matrix

If the similarity transformation matrix is used as in Fig. 9a, the size and direction of the structure in the image will be maintained even if the optimization algorithm is executed several times. However, if the affine transformation matrix is used as in Fig. 9b, there is a problem in that the size and direction of the structures in the image become distorted. Therefore, if the optimization algorithm is used in the digital image stabilization technique, it is advantageous to use the constraint that can maintain the similarity characteristic.

To use the similarity transformation property defined by nonlinear equations as a constraint condition, we use the Lagrange multiplier λ, which is a representative method of using constraint equations as a constraint condition for transformation matrix T and is defined as

$$ \lambda \left( {{t}_{11}}^{2}+{{t}_{21}}^{2}-{{t}_{12}}^{2}-{{t}_{22}}^{2} \right),\text{ for }T=\left[ \begin{array}{lll} {{t}_{11}} & {{t}_{12}} & {{t}_{13}} \\ {{t}_{21}} & {{t}_{22}} & {{t}_{23}} \\ 0 & 0 & 1 \end{array} \right]. $$
(8)

The similarity transformation matrix T has the nonlinear characteristic of \({t}_{11}^{2}+{{t}_{21}}^{2}-{{t}_{12}}^{2}-{{t}_{22}}^{2}= 0\), thus if a Lagrange multiplier is used as the constraint condition expression, the result can be kept close to 0 at all times. Finally, from (7) and (8), we can define the Lagrangian to reduce the residual error, which can be expressed as

$$ L\left( \mathbf{x}_{t}^{i},\lambda \right)={{e}_{r}}+\lambda \left( {{t}_{11}}^{2}+{{t}_{21}}^{2}-{{t}_{12}}^{2}-{{t}_{22}}^{2} \right). $$
(9)

The objective function er in (9) minimizes the positional error for accurate feature point matching between the previous frame and the next frame, and the latter part of (9) is the constraint to maintain the characteristics of the similarity transformation matrix using the Lagrange multiplier. Therefore, we can solve (9) as the similarity-constrained optimization method. Since (9) has several nonlinear equations, we use sequential quadratic programming (SQP), which is a representative technique of nonlinear optimization.

The selection of the initial starting point in the optimization algorithm is a very important issue and in the proposed method, it starts as near as possible to the global minimum because we initially use the transformation matrix with MSURF. In addition, the stopping criteria are also important because the optimization algorithm requires a lot of computational time or divergence. Figure 10 shows the conditions for finding the minimum residual error point in the initial start condition for SQP. The tolerance setting, which is the condition for stopping the algorithm, is set such that the change amount Δx of the transformation matrix T component is 10− 10 and the change amount Δf of the residual error r is 0.1. Actually, during the operation of the algorithm, since the change amount of the transformation matrix T component and the change amount of the residual error r are both confirmed, the minimum point can be found by repeating the algorithm about several dozen times.

Fig. 10
figure 10

Conditions for finding points with minimum residual error using SQP

Figure 11 is an example of the optimization results for the residual error applied to the sequential quadratic programming method based on the setting value. In [16], it can be seen that the residual position error value of approximately 4.05 reduced through the initial k-means clustering converges to around 1.6 as the optimization algorithm proceeds. Finally, we can confirm that the variation of the position error converges to a setting threshold value of 0.1 or less, and the minimum point is judged only after being conducted 24 times in total.

Fig. 11
figure 11

Residual error optimization using SQP

Through the similarity-constrained optimization process, we can make the robust transformation matrix T and stabilize the image sequence using it.

3 Experiments

3.1 Experimental environment

3.1.1 Image dataset and comparison methods

We conducted an experiment to investigate the accuracy of the image stabilization technique. Figure 12 shows example images five different types of image sequences for evaluation. Figure 12a is a typical static scene sequence with no changes in the background, while Fig. 12b is an image sequence taken by the camera moving from left to right with the background; since the camera and the background move at the same time, the moving path of the camera can be canceled, and the images only reflect changes according to the photographing angle. Figure 12c includes a complex background having an excessive amount of feature information that can be a problem in image stabilization techniques using feature points or subspace detection techniques, while Fig. 12d is a drone hovering image sequence often occurring in real life. Finally, Fig. 12e is a dynamic image sequence taken under the condition that the camera and foreground objects in the image sequence move simultaneously.

Fig. 12
figure 12

Example images of five image sequence conditions for evaluation: a fixed camera, b horizontal moving camera with fixed objects, c fixed complex background, d forward moving camera with momentary vertical and horizontal motion, e forward moving camera with forwarding moving objects

In order to compare the performance of the image stabilization method, we selected four methods: original data without stabilization, methods from [13] and [6], and the proposed method; we evaluated their performance for the five image sequences shown in Fig. 12.

3.1.2 Evaluation

To compare image stabilization performance, we use ITF to evaluate the relative stabilization results between the image frames, which can be defined as

$$ ITF=\frac{1}{{{N}_{F}}-1}\sum\limits_{k = 1}^{{{N}_{F}}-1}{PSNR\left( k \right)},\ $$
(10)

where NF represents the total number of frames, and the peak signal-to-noise ratio (PSNR) represents the maximum signal-to-noise ratio; this is derived as

$$ PSNR\left( k \right)= 10\log \frac{{{I}_{MAX}}}{MSE\left( k \right)},\ $$
(11)

in which

$$ MSE\left( k \right)=\frac{1}{N}\sum\limits_{x = 0}^{w-1}{\sum\limits_{y = 0}^{h-1}{{{\left( {{I}_{k}}\left( x,y \right)-{{I}_{k + 1}}\left( x,y \right) \right)}^{2}},\text{ for}\left( x,y \right)\in {{S}_{nb}},}} $$
(12)

where w, h, and IMAX represent image width, height, and peak intensity values in input image I, respectively; and N and Snb mean the non-blank region and the number of non-blank pixel for input image I, respectively. The value of I(x,y) used in (12) represents the intensity value at position (x,y), and the difference in brightness value at the same position between consecutive frames is defined as an error.

3.1.3 Experimental results

We evaluated accuracy for the five cases. In the cases of (a) and (c), the position of the camera was fixed, thus they can be considered as static scenes whereas in the cases of (b), (d), and (e), the position of camera was continuously changed, and so they can be considered as dynamic scenes. Figure 13 illustrates the experimental results frame by frame for five image sequences and Table 2 shows the mean and SD for the ITF values. In Table 2, the higher the mean ITF value is, the higher the accuracy of the performed image stabilization, while the lower the ITF SD is, the higher precision of image stabilization has been performed.

Fig. 13
figure 13

Experimental results for five image sequences

Table 2 Mean and SD for the ITF values

As shown in Table 2, for most of the image dataset, the mean and SD for the ITF values obtained using the proposed method were better than those using the existing ones. In the case of (a), the proposed method showed a lower mean ITF value of 0.36 dB for [13] and 0.04 dB for [6], respectively. It stems from that images in (a) include relatively complex objects, and it reduce the stabilization accuracy, as influencing determination of the number of cluster in outlier removing step using the k-means algorithm. According to the mean ITF values, the proposed method exhibited enhancements over [13] and [6] of 0.24 dB and 0.27 dB in the case of (b), 0.29 dB and 2.06 dB in the case of (c), 11.47 dB and 2.86 dB in the case of (d), and 1.4 dB and 1.38 dB in the case of (e), respectively. The evaluation results demonstrate that the proposed method was superior to the existing methods under static and dynamic scene conditions in terms of stabilization accuracy.

Figure 13 illustrates the resulting images of the ground truth comparison with respect to the five different image type sequences. The left, middle, and right columns represent the 1st 100th, and 200th frames, respectively, in each image sequence. Moreover, the first to fourth rows demonstrate the stabilized images using the original image sequence, [6, 13], and the proposed method, respectively. To compare the methods easily, base lines are marked with red dashed lines in each image.

As shown in Fig. 13a, most methods except the original image sequence showed similar stabilization performances, which was stable for this case. In the case of Fig. 13b, we can see that the proposed method always showed the best performance because it always pointed to a similar reference point. The results using [6] did not show good performance since the method is optimized for dynamic image sequence experiments with many camera movements. Moreover, in [13], as the optimization algorithm proceeded, the degrees of freedom of the transformation matrix became high resulting in the size and direction of the structures in the image becoming distorted. On the other hand, since the proposed method maintained similarity as the optimization algorithm proceeded, the size and direction of the structures in the image were kept constant.

In Fig. 13c, [6] showed the worst ITF performance compared to the other techniques, while [13] and the proposed method gave similar good mean and SD values. However, when comparing their values in Table 2, we can confirm that the proposed method has the best performance in terms of ITF value. Figure 13d shows the ground truth comparison of the drone image sequence. Compared with the other techniques, the proposed method showed the most stable results and obtained the best ITF mean value. However, it can be seen that the ITF SD greatly varied as the image sequence progressed because drones do not maintain consistent performance and many rolls in the x-axis direction occur when hovering. With [13], it is evident that the algorithm was no longer able to proceed due to image distortion from the 200th frame onward, while with [6], the overall ITF performance was consistent but the mean ITF value was lower than the proposed method (see Table 2).

The experimental results in Fig. 13e show a zooming effect in which the size of the background in the image changes because the camera moved forward. Since the proposed method compensates for camera shake by counteracting the shaking of the whole image, the size of the structure in the image became smaller due to the zoom effect, which illustrates a limitation of the proposed method that should be considered in future research. Furthermore, although [6, 13], and the proposed method showed similar ITF mean values, Table 2 shows that the proposed method obtained the best ITF mean and SD performance.

We also evaluated computation time for two cases; image size : 720 × 1280 and 360 × 640, because an algorithm based on a feature point depends on the size of the image and the computation time. The computation speeds (fps) of conventional approaches and the proposed method are compared in Table 3. The experiments were done with a AMD Ryzen 7 1700 8 Core 3.0GHz CPU with 16GB RAM and GTX1080 GPU. It can be shown that real-time implementation of image processing algorithms is possible only when fps results from 20 or higher are obtained. Although the proposed algorithm is slower than color information-based method, the result of the computation speed shows that proposed algorithm can be applied in real-time for small images.

Table 3 Computation time comparison

4 Conclusions

In this paper, we propose an advanced image stabilization technique consisting of three steps. First, local motion features are detected using the MSURF algorithm. Second, outliers are removed from local motion using k-means clustering to remove outliers when detecting global motion, which up to now, has been achieved using the RANSAC algorithm. However, the latter method has a problem in that the result is different each time the algorithm is executed because it uses the random function several times to determine the outliers. However, k-means clustering applied to remove outliers can guarantee the repeatability of the image stabilization results because it does not use the random function in the calculation process. Third, the calculated transformation matrix is optimized using the similarity-constrained optimization method. Digital image stabilization techniques use a similarity matrix, an affine matrix, or a projective matrix as a transformation matrix. Since the affine matrix or the projective matrix has high degrees of freedom of the transformation matrix components, this can result in a problem whereby the size and direction of the structures in the image may become distorted as the optimization algorithm proceeds. Therefore, constraints that maintain the similarity characteristics of the transformation matrix are needed for the optimization algorithm to proceed, and the similarity-constrained optimization method is able to do this during optimization. From the experimental results, we confirmed the superiority of the proposed algorithm over previously reported ones by applying it to various image sequences.