Video Inpainting Using Advanced Homography-based Registration Method

Janardhana Rao, B.; Chakrapani, Y.; Srinivas Kumar, S.

doi:10.1007/s10851-022-01111-0

Video Inpainting Using Advanced Homography-based Registration Method

Published: 25 July 2022

Volume 64, pages 1029–1039, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Mathematical Imaging and Vision Aims and scope Submit manuscript

Video Inpainting Using Advanced Homography-based Registration Method

Download PDF

331 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

A new video inpainting technique for videos taken from free moving cameras is suggested in this research paper. The effective results of video inpainting can be achieved by maintaining spatiotemporal coherence while filling the holes in the target frames. This is possible only with the proper registration of source frames to the target frame. Image registration plays a vital role in the process of video inpainting to obtain effective results. An advanced homography-based image registration method is introduced, based on HALF-SIFT: high accurate localization feature for SIFT to extract feature points without localization error. The covariance matrix has been used to estimate the localization error. Further, new inlier selection method using CW MLESAC and refining is carried out for homography matrix with CW L-M. This iteration process can improve the accuracy of image registration. After registering frames to the target frame, the hole is inpainted by globally minimizing the energy cost function. The proposed video inpainting is applied to several complex video sequences. Experimental results are outperformed in visual quality when compared with the state-of-the-art methods. The performance metrics like peak signal-to-noise ratio and Structural Similarity Index are determined and compared with existing methods for different video sequences.

A Fast Video Inpainting Technique

Video stitching based on multi-view spatiotemporal feature points and grid-based matching

Article 07 December 2019

Real-Time Multi-camera Video Stitching Based on Improved Optimal Stitch Line and Multi-resolution Fusion

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Video inpainting is a process of removing undesired objects in the video frames and filling them with sophisticated algorithms without any artifacts. It is also a technique used to reconstruct the damaged parts of the video sequence. The frames in the video sequence contain undesired objects are called target frames, the hole created after removing the object is called the target region. The frames which are having similar pixel information to fill the target region are called source frames. Video inpainting finds a lot more applications in video processing such as video restoration, video stabilization, film post-processing, etc. Image inpainting is the technique of restoring the damaged images and removing unwanted information in the image and filling the hole with appropriate data from the remaining region of the image.

1.1 Video Inpainting: Related Work

The total variation (TV) and diffusion-based methods [1, 2] are used for inpainting small regions and removal of text and scratches. Large region removal and inpainting are obtained with exemplar-based inpainting methods [3]. The basic exemplar-based image inpainting was proposed by Criminisi et.al [4], which is further modified as robust exemplar-based inpainting using region segmentation [5]. Video inpainting is obtained by extending the Criminisi method for images with taking care of spatiotemporal coherence between the frames of the video [6,7,8,9]. These methods achieve good results with compromise in camera movement. Granados et al. [10] introduced graph-cuts optimization to implement video inpainting. Homography-based image registration was used in this to align the input video frames to the target frame. The missing pixels in the video frames are filled with the information taken from registered frames. The cost function minimization is utilized to find the best pixels values to fill in the target region [11, 12]. This entire process takes more time even for low-resolution videos and long-duration sequences.

Newson et al. [13] improved the Granados inpainting by using the patch match algorithm [14] to the spatiotemporal domain with pyramids of frames. In this method, approximate nearest neighbor (ANN) was efficiently calculated after aligning all the frames to the middle frame using affine transformation. This method produces poor results because of aligning all the frames to the middle frame, the side part regions which does not intersect with all the frames were not aligned properly.

Ebdelli et al. [15] achieved a well-reconstructed video by aligning more frames to the target frame using region-based homography transformation. For the proper alignment, it produces excellent results and adverse results with incorrect frame alignment. Huang et al. [16] introduced a video completion technique by including both optical flow and color information of pixels in the target region of the frames in a video. In this temporal consistency is maintained from both pixel-wise flow field and patch-based optimization. The reconstruction of the frames was obtained through the iterative computation of forward and backward flow fields. This increases the computational complexity of the algorithm.

Recently, the novel video inpainting technique is implemented with a hybridization of the cuckoo search algorithm and multi-verse optimization (CS-MVO) [17] for optimizing the patch matching and recurrent neural network (RNN) for categorizing the patch as smooth or structured. This method produced the optimal video inpainting results compared to available methods. There is another video inpainting technique proposed by using an enhanced priority computation method and optimal patch selected for inpainting the target region with grey wolf optimization (GWO) [18] and modified artificial bee colony algorithm [19]. This method outperformed the existing techniques of video inpainting in terms of metrics PSNR, SSIM and edge similarity.

From all the researchers, one can understand that the proper alignment of source frames to the target frame plays a vital role in video inpainting to maintain spatiotemporal coherence between the frames of the reconstructed video sequence. In this paper, an advanced homography-based registration method to attain an efficient alignment of frames to the target frame is proposed from the inspiration of work [20]. This registration technique is utilized for our proposed video inpainting.

The traditional homography-based methods were implemented in two steps: feature point extraction and estimation of feature points matching between the images with homography estimation. The feature points are extracted using scale-invariant feature transform (SIFT) [21] and speeded-up robust feature (SURF) [22]. Then, global correspondence is established with feature matching and estimating the transformation between the frames using a robust estimation method, random sample consensus (RANSAC) [23]. The incorrect feature point correspondences due to illumination changes and viewpoint differences are called outliers, which leads to inaccurate homography estimation results. The robust estimation methods were required to remove the outliers obtained due to the feature points matching. Proper selection of inliers in the feature matching can avoid outliers. The advanced homography transformation method improves the quality of alignment of frames.

In this paper, the image registration method is carried out by extracting the feature points using the HALF-SIFT [24] method instead of the SIFT. The localization error occurred in SIFT due to the feature detector. The HALF-SIFT method extracts more accurate feature points and rectifies the localized error. The localization error is assumed to arise either from pixel intensity noise or the feature extraction method itself, which has zero mean and non-uniform distribution. This type of error is called an anisotropic and non-uniformly distributed localization error [25,26,27,28]. It is also called a random localization error [20]. This random localization error does not satisfy the assumptions of the RANSAC estimate. To estimate this type of error, the covariance matrices are used in the proposed video inpainting method.

Next, the selection of inliers and estimation of the homography matrix is done using covariance weighted maximum likelihood sample consensus (CW MLESAC). Finally, the homography matrix is refined with the covariance weighted Levenberg–Marquardt algorithm (CW L-M). In this video inpainting method, the last step is the target region filling. The target region filling is completed by minimizing the robust energy cost function globally using the expansion move algorithm [29,30,31]. To observe the effectiveness of the proposed method, the experiments are conducted on the densely annotated video segmentation (DAVIS) dataset [32].

The paper is organized as follows: The proposed video inpainting approaches are described in Sect. 2. Section 3 consists of experimental results from the proposed method and a comparison of metrics with an existing method. The concluding remarks are presented in Sect. 4.

2 Proposed Video Inpainting Method

The proposed method is implemented in two main steps; the first step is image registration, and the second step is target region filling. The target frame with a hole is inpainted using aligning the neighboring frames to the target frame. The hole is filled with the diffusion of pixels from aligned frames to the target region. The image registration is achieved with an advanced homography-based image registration method and the hole in the target frame called the target region is filled by globally minimizing the robust energy function. The flow of the proposed method is shown in Fig. 1.

2.1 Advanced Homography-based Registration Method

In the proposed image registration method, the feature points are extracted using the HALF-SIFT [24] method instead of the SIFT method. The HALF-SIFT method extracts more accurate feature points and rectifies the localized error obtained due to the SIFT feature detector. Later estimation of localization error was obtained due to pixel intensity noise with the covariance matrix. The feature matching between the frames is obtained with minimum Euclidean distance as a parameter. Next, the selection of inliers and estimation of the homography matrix with covariance weighted maximum likelihood sample consensus (CW MLESAC) is done, and finally, the homography matrix is refined with the covariance weighted Levenberg–Marquardt algorithm (CW L-M).

2.1.1 Localized Feature Points Extraction Using HALF-SIFT

The localization error of feature points in the SIFT method is due to the use of the parabolic interpolation method to estimate the coordinates of feature points. The interpolation accuracy is improved by using the HALF-SIFT method, which produces accurate localized feature points.

In order to extract the accurate localized feature point, HALF-SIFT used regression analysis for minimizing the distance between the sampling points near the feature points in the difference between the Gaussians pyramid and response model function. The output obtained from the difference of the Gaussian (DoG) filter by applying the Gaussian function as input is called the response model function. This response model function is described by the parameter vector $v$. The parameter vector $v = \left( {m_{0} ,n_{0} , a, b, c, r} \right)$ determines the response model of the difference of Gaussians filter. Here, $M_{0} = \left( {m_{0} ,n_{0} } \right)$ is the accurate position of the feature point. This parameter vector $\left( v \right)$ is optimized with the Levenberg–Marquardt algorithm. The optimized objective function [24] is taken as,

$$ \hat{v} = arg\mathop {\min }\limits_{v} \left\{ {\mathop \sum \limits_{i,j,p = - 1,0,1} \parallel X\left( {m_{c} \pm i,n_{c} \pm j,\left( {2^{1/3} } \right)^{p} \alpha } \right) - X_{v} \left( {m_{c} \pm i,n_{c} \pm j,\left( {2^{1/3} } \right)^{p} \alpha } \right)\parallel_{2}^{2} } \right\} $$

(1)

Difference of Gaussians (DoG) is a feature enhancement algorithm that involves the subtraction of one Gaussian blurred version of an original image from another, less blurred version of the original. The result is a set of images of a variety of sizes, each being a "difference of Gaussians." It is called as difference of the Gaussian pyramid, represented as $X\left( \cdot \right)$.; $\left( {m_{c} ,n_{c} } \right)$. is the extracted local maximum int with SIFT, and $X_{v} \left( \cdot \right)$. is the response model function described by the parameter vector $v$ and $\alpha$., where $i$., $j$. and $p$ are the amplitudes of scale in SIFT.

2.2 Estimation of Localization Error Using the Covariance Matrix

The localization error of the feature points depends on the distribution of the pixel intensity values near the feature points. Due to the inaccurate distribution of pixel intensity values, anisotropic and non-identical localization error occurs. The covariance matrix is used to indicate the anisotropic and non-identical error [25], which is represented as

$$ {\Lambda } = \sigma .G_{r} .\left[ {\begin{array}{*{20}c} \mu & 0 \\ 0 & {1 - \mu } \\ \end{array} } \right].G_{r}^{T} $$

(2)

where $G_{r} = \left[ {\begin{array}{*{20}c} {\cos \theta } & { - \sin \theta } \\ {\sin \theta } & { \cos \theta } \\ \end{array} } \right]$ is the rotation matrix that represents the anti-clockwise rotation by an angle $ \theta$. $\sigma \in \left[ {0,\infty } \right)$ is the scale, $\mu \in \left( {0,1} \right)$ is the eccentricity, and $\gamma \in \left[ {0,\pi } \right)$ is the rotation angle of the covariance matrix ${\Lambda }$.

In this work, the random localization error of feature points is assumed as a bilateral Gaussian model and expressed in the covariance matrix, which is estimated as

$$ {\Lambda }^{{\left( {\alpha_{v} } \right)}} = \left( {\mathop \sum \limits_{{\left( {k,l} \right) \in N_{v} }} u\left( {k,l} \right).\left[ {\begin{array}{*{20}c} {X_{mm} \left( {k,l,\alpha_{v} } \right)} & {X_{mn} \left( {k,l,\alpha_{v} } \right)} \\ {X_{mn} \left( {k,l,\alpha_{v} } \right)} & {X_{nn} \left( {k,l,\alpha_{v} } \right)} \\ \end{array} } \right]} \right)^{ - 1} $$

(3)

$$ {\Lambda } = {\Lambda }^{{\left( {\alpha_{v} } \right)}} .\left( {\frac{{{\text{res}}\;\left( {\alpha_{0} } \right)}}{{{\text{res}}\,\left( {\alpha_{v} } \right)}}} \right)^{2} $$

(4)

where $ u\left( {k,l} \right)$ is the weighted coefficient of Gaussian distribution, $X\left( {.,\alpha_{v} } \right)$ is the particular layer of the scale-space pyramid, $N_{v}$ is a small neighborhood near the feature point $ \left( {k,l} \right)$,$ \alpha_{v}$ is the scale parameter of the present layer,$ res\left( {\alpha_{0} } \right)$ and $res\left( {\alpha_{v} } \right)$ is the image resolution of the layer whose scale parameter is.

$\alpha_{0}$ and $\alpha_{v}$ in a scale-space pyramid. The sub-indices m and n in Eq. (3) represent the local maxima point positions obtained from the SIFT method.

2.2.1 Selection of Inliers with CW MLESAC

The residuals are calculated in traditional feature point matching methods like LMedS and RANSAC using Eq. 5. For this calculation, the inliers correspondences vector is considered as $ C = \left\{ {\left. {\left( {c_{1i} ,c_{2i} } \right)} \right| i = 1,2,3,.., n} \right\}$, where $c_{1i} , c_{2i}$ are pixel positions of feature points of two images.

$$ r_{i} = \sqrt {\parallel e\parallel_{2}^{2} = \sqrt {\parallel c_{2i} - \hat{c}_{2i} \parallel_{2}^{2} } } $$

(5)

where e is the re-projected error vector and $\hat{c}_{2i} $ is the re-projected pixel position of $c_{2i}$ with homography transform H. The residuals calculated using the above formula do not include the anisotropic and non-identical properties of localization error, so this residual calculation leads to inaccurate homography estimation. This is rectified by selecting the inliers with the CW MLESAC method. This method is implemented by using normalized covariance weighted residuals (NCWR); this includes the properties of localization error of feature points.

The covariance matrix is decomposed as

$$ {\Lambda } = U.\left[ {\begin{array}{*{20}c} {\varepsilon_{1} } & 0 \\ 0 & {\varepsilon_{2} } \\ \end{array} } \right]. U^{T} $$

(6)

Here, $\varepsilon_{1} , \varepsilon_{2}$ are eigenvalues of the covariance matrix, and $U = \left[ {u_{1} ,u_{2} } \right]$ are eigenvectors.

In the proposed CW-MLESAC method the residual.

$r_{i}$ is replaced with the NCWR formula $\overline{r}_{i}$ as,

$$ \overline{r}_{i} = \sqrt {\left( {\overline{e}_{{\varepsilon_{1} }} } \right)^{2} + \left( {\overline{e}_{{\varepsilon_{2} }} } \right)^{2} } = \sqrt {\frac{{2\varepsilon_{1} }}{{\varepsilon_{1} + \varepsilon_{2} }}(e_{{u_{1} }} )^{2} + \frac{{2\varepsilon_{2} }}{{\varepsilon_{1} + \varepsilon_{2} }}(e_{{u_{2} }} )^{2} } = \sqrt {\frac{{2\varepsilon_{1} \left( {u_{1}^{T} .\left( {c_{2i} - \hat{c}_{2i} } \right)} \right)^{2} }}{{\varepsilon_{1} + \varepsilon_{2} }} + \frac{{2\varepsilon_{2} \left( {u_{2}^{T} .\left( {c_{2i} - \hat{c}_{2i} } \right)} \right)^{2} }}{{\varepsilon_{1} + \varepsilon_{2} }}} $$

(7)

Finally, the best inliers correspondences $C_{{{\text{inlier}}}}$ and homography matrix $H_{{{\text{inlier}}}}$ are selected using CW MLESAC.

2.2.2 Refining of Homography Matrix with CW L-M

Considering.

$C_{inlier} = \left\{ {\left. {\left( {c_{1i} ,c_{2i} } \right)} \right| i = 1,2,3,.., n} \right\}$ as the inlier correspondences and pixel positions of images as.

$c_{1i} , c_{2i}$. Then, the objective function (

$F$) taken in the existing L–M method is

$$ F = \arg \;\mathop {\min }\limits_{H} \mathop \sum \limits_{k = 1}^{n} \parallel e\parallel_{2}^{2} = \arg \, \mathop {\min }\limits_{H} \mathop \sum \limits_{k = 1}^{n} \parallel c_{2i} - H(c_{1i} )\parallel_{2}^{2} $$

(8)

here, $e$ is the re-projected error vector, $H\left( \cdot \right)$ is the homography transform.

This objective function did not include the anisotropic and non-identical properties of localization error hence the homography estimation was not accurate from this objective function. This problem was solved by using a new homography matrix refinement method called the CW L-M algorithm.

This uses the covariance weighted objective function $ \left( {\tilde{F}} \right)$, which is taken as

$$ \tilde{F} = \arg \mathop {\min }\limits_{H} \mathop \sum \limits_{k = 1}^{n} \left[ {\frac{{\left( {u_{1i }^{T} .\left( {c_{2i} - H\left( {c_{1i} } \right)} \right)} \right)^{2} }}{{\varepsilon_{1i} }} + \frac{{\left( {u_{2i }^{T} .\left( {c_{2i} - H\left( {c_{1i} } \right)} \right)} \right)^{2} }}{{\varepsilon_{2i} }}} \right] $$

(9)

here, $\varepsilon_{1i} , \varepsilon_{2i} $ are eigenvalues of the covariance matrix of the $i_{th }$ feature point correspondence in inliers with $ C_{{{\text{inlier}}}}$. $U = \left[ {u_{1i } u_{2i} } \right], $ eigenvectors concerning eigenvalues.

From Eq. (9), the new objective function includes the anisotropic properties of the localization error by re-projected error vector axes are rotated toward the covariance matrix–vector direction. Also divided with the eigenvalues of the covariance matrix, it indicates different contributions of the different feature points. From all these, the localization error of the feature points becomes isotropic and identically distributed. Then, L–M method produces the optimal solution.

2.3 Target Region Filling

The hole $\left( {{\Phi }_{t} } \right)$ in the target frame $\left( {I_{t} } \right)$ is called the target region is filled with similar pixels from the group of frames aligned to the target frame by maintaining spatiotemporal coherence. The missing regions in the target frame are inpainted by minimizing an energy function globally using the expansion move algorithm [29,30,31]. If we consider the M number of past and M number of future neighbor frames, then every pixel in the hole is inpainted with best-suited pixels from 2 M (past M + future M) number of registered frames $\tilde{I}_{i} , i = 1 \ldots 2M$. Let $S_{v}$ and $S_{w}$ denote the category of matching pixel values near pixel $ v$ and $w$, respectively. ${S}^{*}$ value is calculated by minimizing the energy cost function for all the pixel values of the hole in the target frame.

$$ S^{*} = \arg \mathop {\text{ min }}\limits_{S} \xi \left( S \right) $$

(10)

with

$$ \xi \left( S \right) = \mathop \sum \limits_{{v \in \Phi_{t} }} E_{d} \left( {S\left( v \right)} \right) + \gamma \mathop \sum \limits_{{\left( {v,w} \right) \in N\left( {\Phi_{t} } \right)}} E_{sm} \left( {S\left( v \right), S\left( w \right)} \right) $$

(11)

Where, $E_{d} \left( {S\left( v \right)} \right) $ is called the data term, which indicates the stationary background surrounding the pixel $v$. The data term is divided into three terms

$$ E_{d} \left( {S\left( v \right)} \right) = E_{0} \left( {\hat{I}_{S\left( v \right)} } \right) + E_{1} \left( {\hat{I}_{S\left( v \right)} } \right) + E_{2} \left( {\hat{I}_{S\left( v \right)} } \right) $$

(12)

where first-term $E_{0} \left( {\hat{I}_{S\left( v \right)} } \right) $ is the sum of the squared difference between the current target frame $\left( { I_{t} } \right)$ and the frames registered to the target frame $\left( {\hat{I}_{S\left( v \right)} } \right)$ taken as, $E_{0} \left( {\hat{I}_{S\left( v \right)} } \right) = log\left( {1 + SSD\left( {I_{t} , \hat{I}_{S\left( v \right)} } \right)} \right)$. This term is used to find the pixels in the best-aligned frames, which gives less alignment error. The logarithm is used to limit the dynamic range of this term.

Second term $E_{1} \left( {\hat{I}_{S\left( v \right)} } \right)$ is a constant term calculated as $E_{1} \left( {\hat{I}_{S\left( v \right)} } \right) = \frac{1}{{2M\left| {{\Omega }_{S\left( v \right)} } \right|}}\mathop \sum \limits_{k = - M}^{M} {\Omega }_{S\left( v \right)} - {\Omega }_{k}^{2}$.where ${\Omega }_{S\left( v \right)} , $ the patch is centered with the pixel $v $ in the registered frame $ \hat{I}_{S\left( v \right) }$ and $\left| {{\Omega }_{S\left( v \right)} } \right|$ is the number of pixels in the patch and.

${\Omega }_{k}$ represents the patch in the registered frames of $\hat{I}_{{S^{*} \left( v \right)}}$. This term is responsible to maintain temporal consistency in the inpainting results.

In the third term, $E_{2} \left( {\hat{I}_{S\left( v \right)} } \right) $ represents the similarity of the pixels in the target frame to the pixels in the source frame centered at $ v$, which is calculated with $ E_{2} \left( {\hat{I}_{S\left( v \right)} } \right) = I_{s} \left( v \right) - \hat{I}_{S\left( v \right)} \left( v \right)^{2}$, the value of $I_{s} \left( v \right) $ is calculated using spatial inpainting. This is the term responsible for spatiotemporal coherence in the video inpainting results.

The second term $\left( {Esm \left( {S\left( v \right), S\left( w \right)} \right)} \right)$ in the Eq. (11) is called the smoothness term calculated between each pair of adjacent pixels in the hole of the target frame, which is taken as

$$ E_{sm} \left( {S\left( v \right), S\left( w \right)} \right) = \hat{I}_{S\left( v \right)} \left( v \right) - \hat{I}_{S\left( w \right)} \left( v \right)^{2} + \hat{I}_{S\left( v \right)} \left( w \right) - \hat{I}_{S\left( w \right)} \left( w \right)^{2} $$

(13)

The smoothness term is used to maintain the spatial consistency by inpainting the pixels in the target region with similar pixel values in the neighboring registered frames. $N\left( {{\Phi }_{t} } \right)$, is the 4-neighbors of pixel $ v$. The value of $\gamma$ in Eq. (11) is chosen as 10 to maintain the balance between the data term and the smoothness term. Equation (10) is minimized by using expansion-move algorithm [29,30,31].

3 Experimental Results and Discussion

The proposed video inpainting algorithm is implemented in MATLAB. The process of inpainting for a video with around 100 frames takes at most 3 h with an Intel Core i3 processor of 8 GB RAM. The proposed algorithm tested with multiple complex video sets includes motion blur, camera movement, dynamic background and complex hole shapes. The results are compared with the existing video completion methods. The results obtained from the proposed method are visually feasible in comparison with existing methods of video inpainting.

In the proposed advanced homography-based video inpainting method, the experiments are carried out for the quantitative analysis in terms of PSNR and SSIM.

The PSNR and SSIM values are determined as follows:

A separate data set of composite videos is created for 11 videos. These 11 videos are treated as ground truth videos.
The composite video is created by adding one mask to the ground truth video, the mask is taken from the DAVIS data set.
This created composite video is treated as an input video for inpainting.
The proposed inpainting method is applied to the composite video to remove the added mask object.
The resulted video is compared with the corresponding ground truth video to calculate the PSNR and SSIM values.
Similarly, 11 composite videos inpainting is done and metrics are determined.
In the same way, the existing video inpainting techniques are applied to composite videos to determine the PSNR and SSIM values.

The video inpainting results of a few video sequences are shown in Fig. 2; odd rows represent the mask corresponding to an object to be removed in the video frames and even rows represent the resultant frames after removing the object using the proposed method.

3.1 Comparison with Existing Methods

The video inpainting approach by Granados et al. [10] is quite similar to the proposed method in this paper. The homography-based image registration is used in [10] to align the input video frames to the target frame. The missing pixels in the video frames are filled with the information taken from registered frames. The cost function minimization is used to find the best pixel values to fill in the target region. This entire process takes more time even for low-resolution videos and long-duration sequences.

In this work, different advanced and robust homography estimation as compared to [10] is proposed to align the input video frames to the target frame. Further, the inpainting quality is improved by introducing data term in the energy function to fill the target region. The author has not published the code to compare with his work. It is difficult to reproduce this work without making any errors. Hence, this work is not compared with the proposed work in terms of performance metrics. The patch-based video inpainting algorithm [13] used spatiotemporal sampling to fill the target region. This work [13] is compared with the proposed work. The frames of the video sequence with an object to be removed and the inpainted result are shown in Figs. 3, 4 and 5. From these results, we can identify that some artifacts occurred due to the incorrect flow of pixels from the known region to the hole in the target frame and incorrect alignment of frames. The proposed method achieves more reliable results compared to [13] due to the proper alignment of source frames to the target frame.

Compared with the method [16], it is computationally complex due to the forward and backward flow fields in the image sequence. As mentioned by the author in their paper, this method produces some noticeable artifacts in the videos contain a dynamic camera, foreground and background. By comparing, our method reconstructed the frames without any artifacts in all above cases due to the inclusion of high accurate feature points calculation to match the pixels in the frames. The comparison of inpainted results of the proposed method is shown in Figs. 6 and 7.

The results obtained from the proposed method are compared with [13] and [16] as shown in Figs. 8, 9 and 10.

The PSNR and SSIM values are computed for the existing methods in the literature by implementing the code taken from the author’s page. The proposed method is applied to the videos which are used in the existing methods. The experimental results such as PSNR and SSIM for different videos are given in Table 1. In this table, the state-of-art methods of inpainting proposed by Newton [13] and Huang [16] are compared with the proposed novel inpainting method with respect to PSNR and SSIM values for 11 input videos. The experiment is carried out on 11 standard input videos from the DAVIS dataset and the average of PSNR and SSIM is evaluated to compare the performance of the proposed method with existing methods. The average PSNR value for the proposed method is 21.075, whereas for Newton [13] is 19.921 and for Huang [16] is 20.556. From these results, we can say that the proposed algorithm has enhanced PSNR by 6% and 3% when compared with [13] and [16] works, respectively. The average SSIM of the proposed method for 11 videos is 0.932, which is an improved value when comparing the existing method values of 0.9 [13] and 0.907 [16].

Table 1 Comparison of PSNR and SSIM values of various videos with available methods in the literature [13, 16] and the proposed method

Full size table

Figure 11 represents the variation of PSNR and SSIM values for different videos from the DAVIS dataset. Video 2 has the highest PSNR value among all the videos, whereas video 8 has the lowest value. Hence the average PSNR is evaluated and compared with the state of artworks. The PSNR values are distinguished between the proposed method and state-of-art methods shown in Fig. 12. It is observed that the PSNR of the proposed method is improved for all the standard input videos when compared with Newton [13] as well as Huang [16]. The variations of SSIM of the proposed method are clearly described in Fig. 13. It is noted that there is a significant improvement is achieved by the proposed method.

4 Conclusions

A novel video inpainting method using advanced homography-based image registration is proposed. In this homography-image registration, HALF-SIFT is used for proper feature point extraction, and a covariance matrix is used to estimate and remove the localization error. The proper selection of inliers to remove outliers is achieved using CW MLESAC. In order to get further refining the homography matrix, CW L-M was used. This entire process of image registration strengthens the spatiotemporal coherence in the video inpainting. Next, the inpainting of the hole in the target frame is accomplished by globally minimizing the energy cost function. The experimental video results are compared with two video inpainting methods available in the literature in the form of images. The comparison of inpainted videos or images shows that the proposed method produces high-quality results compared to the existing methods. The performance metrics, such as PSNR and SSIM values, are determined for the proposed method and compared with the results of existing methods. The average PSNR and SSIM values of the proposed method for 11 videos are evaluated and compared with existing state-of-art inpainting methods. The improvement of 6% and 3% in average PSNR and SSIM is achieved by the proposed inpainting method when compared with existing inpainting methods.

References

Sridevi, G., Kumar, S.S.: Image inpainting and enhancement using fractional order variational model. Defence Sci. J. 67(3), 308–315 (2017). https://doi.org/10.14429/dsj.67.10665
Article Google Scholar
Sridevi, G., Srinivas Kumar, S.: Image inpainting based on fractional-order nonlinear diffusion for image reconstruction. Circuits, Syst. Signal Process. (2019). https://doi.org/10.1007/s00034-019-01029-w
Article Google Scholar
Janardhana Rao, B., Chakrapani, Y., Srinivas Kumar, S.: Image inpainting method with improved patch priority and patch selection. IETE J. Educ. 59(1), 26–34 (2018). https://doi.org/10.1080/09747338.2018.1474808
Article Google Scholar
Criminisi, A., Perez, P., Toyama, K.: Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 13(9), 1200–1212 (2004). https://doi.org/10.1109/TIP.2004.833105
Article Google Scholar
Lee, J., Lee, D.K., Park, R.H.: Robust exemplar-based inpainting algorithm using region segmentation. IEEE Trans. Consum. Electron. (2012). https://doi.org/10.1109/TCE.2012.6227460
Article Google Scholar
Matsushita, Y., Ofek, E., Ge, W., Tang, X., Shum, H.-Y.: Full-frame video stabilization with motion inpainting. IEEE Trans. Pattern Anal. Mach. Intell. 28(7), 1150–1163 (2006). https://doi.org/10.1109/TPAMI.2006.141
Article Google Scholar
Patwardhan, K.A., Sapiro, G., Bertalmio, M.: Video inpainting under constrained camera motion. IEEE Trans. Image Process. 16(2), 545–553 (2007). https://doi.org/10.1109/TIP.2006.888343
Article MathSciNet Google Scholar
Shih, T.K., Tang, N.C., Hwang, J.-N.: Exemplar-based video inpainting without ghost shadow artifacts by maintaining temporal continuity. IEEE Trans. Circuits Syst. Video Technol. 19(3), 347–360 (2009). https://doi.org/10.1109/TCSVT.2009.2013519
Article Google Scholar
Shih, T. K., Tan, N. C., Tsai, J. C., and Zhong, H.-Y.: “Video falsifying by motion interpolation and inpainting.” In: Proc. IEEE Conf. Comput.Vis. Pattern Recognit., (Jun. 2008), pp. 1–8. DOI: https://doi.org/10.1109/CVPR.2008.4587701
M. Granados, J. Tompkin, K. I. Kim, J. Kautz, and C. Theobalt, “Background inpainting for videos with dynamic objects and a free moving camera.” In: Proc. Eur. Conf. Comput. Vis., 2012, pp. 682–695. https://doi.org/10.1007/978-3-642-33718-5_49
Whyte, O., Sivic, J., and Zisserman, A.: “Get out of my picture! Internet based inpainting.” In: Proc. Brit. Mach. Vis. Conf., (2009), pp. 1–11. doi:https://doi.org/10.5244/C.23.116
Chen, X., Shen, Y., and Yang, Y. H.: “Background estimation using graph cuts and inpainting.” In: Proc. Graph. Inter., (2010), pp. 97–103.
Newson, A., Almansa, A., Fradet, M., Gousseau, Y., Pérez, P.: Video inpainting of complex scenes. SIAM J. Imag. Sci. 7(4), 1993–2019 (2014). https://doi.org/10.1137/140954933
Article MathSciNet MATH Google Scholar
Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: Patch Match: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28(3), 24:1-24:11 (2009). https://doi.org/10.1145/1576246.1531330
Article Google Scholar
Ebdelli, M., Le Meur, O., Guillemot, C.: Video inpainting with short term windows: application to object removal and error concealment. IEEE Trans. Image Processing 24(10), 3034–3047 (2015). https://doi.org/10.1109/TIP.2015.2437193
Article MathSciNet MATH Google Scholar
Huang, J.B., Kang, S.B., Ahuja, N., Kopf, J.: Temporally coherent completion of dynamic video. ACM Trans. Graph. (2016). https://doi.org/10.1145/2980179.2982398
Article Google Scholar
Janardhana Rao, B., Chakrapani, Y., Srinivas Kumar, S.: Hybridized cuckoo search with multi-verse optimization-based patch matching and deep learning concept for enhancing video inpainting. Comput. J. (2021). https://doi.org/10.1093/comjnl/bxab067
Article Google Scholar
Janardhana Rao, B., Chakrapani, Y., Srinivas Kumar, S.: An enhanced video inpainting technique with grey wolf optimization for object removal application. J. Mobile Multimedia 18(3), 561–582 (2022). https://doi.org/10.13052/jmm1550-4646.1835
Article Google Scholar
Janardhana Rao, B., Chakrapani, Y., Srinivas Kumar, S.: MABC-EPF: Video in-painting technique with enhanced priority function and optimal patch search algorithm. Concurr. Computat. Pract. Exper. (2022). https://doi.org/10.1002/cpe.6840
Article Google Scholar
Zhao, C., Zhao, H.: Accurate and robust feature-based homography estimation using HALF-SIFT and feature localization error weighting. J. Vis. Commun. Image Represent. 40, 288–299 (2016). https://doi.org/10.1016/j.jvcir.2016.07.002
Article Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94
Article Google Scholar
Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Surf: speeded up robust features. Comput. Vis. Image Understand. (CVIU) 110(3), 346–359 (2008). https://doi.org/10.1016/j.cviu.2007.09.014
Article Google Scholar
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981). https://doi.org/10.1016/B978-0-08-051581-6.50070-2
Article MathSciNet Google Scholar
Kai Cordes, Oliver Müller, Bodo Rosenhahn, Jörn Ostermann: “HALF-SIFT: high accurate localized features for SIFT”, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Miami, U.S.A., (2009), pp. 31–38. DOI: https://doi.org/10.1109/CVPRW.2009.5204283
Brooks, M.J., Chojnacki, W., Gawley, D.: “What value covariance information in estimating vision parameters?”. In: IEEE International Conference on Computer Vision, Vancouver, Canada, (2001), pp 302–308. DOI: https://doi.org/10.1109/ICCV.2001.937533
Steele, R.M., Christopher, J.: Feature uncertainty arising from covariant image noise. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, (2005), pp. 1063–1070.
Abdel-Hakim, A.E., Farag, A.A.: A novel stability quantification of detected interest points in scale-space. In: International Conference on Pattern Recognition, Tampa, FL (2008), pp. 124–127
Zeisl, B., Georgel, P.F., Schweiger, F.: Estimation of location uncertainty for scale invariant feature points. In: Proceedings of the British Machine Vision Conference, London, United Kingdom (2009)
Kolmogorov, V., Zabin, R.: What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 147–159 (2004). https://doi.org/10.1109/TPAMI.2004.1262177
Article Google Scholar
Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23(11), 1222–1239 (2001). https://doi.org/10.1109/34.969114
Article Google Scholar
Boykov, Y., Kolmogorov, V.: An experimental comparison of mincut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1124–1137 (2004). https://doi.org/10.1109/TPAMI.2004.60
Article Google Scholar
Perazzi, Federico, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung: "A benchmark dataset and evaluation methodology for video object segmentation." In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732.( 2016). DOI: https://doi.org/10.1109/CVPR.2016.85

Download references

Author information

Authors and Affiliations

CVR College of Engineering - Ibrahimpatan Campus, Hyderabad, India
B. Janardhana Rao
ACE Engineering College, Hyderabad, Telangana, India
Y. Chakrapani
UCEK, JNTUK, Kakinada, Andhra Pradesh, India
S. Srinivas Kumar

Authors

B. Janardhana Rao
View author publications
You can also search for this author in PubMed Google Scholar
Y. Chakrapani
View author publications
You can also search for this author in PubMed Google Scholar
S. Srinivas Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. Janardhana Rao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Janardhana Rao, B., Chakrapani, Y. & Srinivas Kumar, S. Video Inpainting Using Advanced Homography-based Registration Method. J Math Imaging Vis 64, 1029–1039 (2022). https://doi.org/10.1007/s10851-022-01111-0

Download citation

Received: 02 March 2020
Accepted: 01 July 2022
Published: 25 July 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s10851-022-01111-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Video Inpainting Using Advanced Homography-based Registration Method

Abstract