1 Introduction

Video inpainting is a process of removing undesired objects in the video frames and filling them with sophisticated algorithms without any artifacts. It is also a technique used to reconstruct the damaged parts of the video sequence. The frames in the video sequence contain undesired objects are called target frames, the hole created after removing the object is called the target region. The frames which are having similar pixel information to fill the target region are called source frames. Video inpainting finds a lot more applications in video processing such as video restoration, video stabilization, film post-processing, etc. Image inpainting is the technique of restoring the damaged images and removing unwanted information in the image and filling the hole with appropriate data from the remaining region of the image.

1.1 Video Inpainting: Related Work

The total variation (TV) and diffusion-based methods [1, 2] are used for inpainting small regions and removal of text and scratches. Large region removal and inpainting are obtained with exemplar-based inpainting methods [3]. The basic exemplar-based image inpainting was proposed by Criminisi et.al [4], which is further modified as robust exemplar-based inpainting using region segmentation [5]. Video inpainting is obtained by extending the Criminisi method for images with taking care of spatiotemporal coherence between the frames of the video [6,7,8,9]. These methods achieve good results with compromise in camera movement. Granados et al. [10] introduced graph-cuts optimization to implement video inpainting. Homography-based image registration was used in this to align the input video frames to the target frame. The missing pixels in the video frames are filled with the information taken from registered frames. The cost function minimization is utilized to find the best pixels values to fill in the target region [11, 12]. This entire process takes more time even for low-resolution videos and long-duration sequences.

Newson et al. [13] improved the Granados inpainting by using the patch match algorithm [14] to the spatiotemporal domain with pyramids of frames. In this method, approximate nearest neighbor (ANN) was efficiently calculated after aligning all the frames to the middle frame using affine transformation. This method produces poor results because of aligning all the frames to the middle frame, the side part regions which does not intersect with all the frames were not aligned properly.

Ebdelli et al. [15] achieved a well-reconstructed video by aligning more frames to the target frame using region-based homography transformation. For the proper alignment, it produces excellent results and adverse results with incorrect frame alignment. Huang et al. [16] introduced a video completion technique by including both optical flow and color information of pixels in the target region of the frames in a video. In this temporal consistency is maintained from both pixel-wise flow field and patch-based optimization. The reconstruction of the frames was obtained through the iterative computation of forward and backward flow fields. This increases the computational complexity of the algorithm.

Recently, the novel video inpainting technique is implemented with a hybridization of the cuckoo search algorithm and multi-verse optimization (CS-MVO) [17] for optimizing the patch matching and recurrent neural network (RNN) for categorizing the patch as smooth or structured. This method produced the optimal video inpainting results compared to available methods. There is another video inpainting technique proposed by using an enhanced priority computation method and optimal patch selected for inpainting the target region with grey wolf optimization (GWO) [18] and modified artificial bee colony algorithm [19]. This method outperformed the existing techniques of video inpainting in terms of metrics PSNR, SSIM and edge similarity.

From all the researchers, one can understand that the proper alignment of source frames to the target frame plays a vital role in video inpainting to maintain spatiotemporal coherence between the frames of the reconstructed video sequence. In this paper, an advanced homography-based registration method to attain an efficient alignment of frames to the target frame is proposed from the inspiration of work [20]. This registration technique is utilized for our proposed video inpainting.

The traditional homography-based methods were implemented in two steps: feature point extraction and estimation of feature points matching between the images with homography estimation. The feature points are extracted using scale-invariant feature transform (SIFT) [21] and speeded-up robust feature (SURF) [22]. Then, global correspondence is established with feature matching and estimating the transformation between the frames using a robust estimation method, random sample consensus (RANSAC) [23]. The incorrect feature point correspondences due to illumination changes and viewpoint differences are called outliers, which leads to inaccurate homography estimation results. The robust estimation methods were required to remove the outliers obtained due to the feature points matching. Proper selection of inliers in the feature matching can avoid outliers. The advanced homography transformation method improves the quality of alignment of frames.

In this paper, the image registration method is carried out by extracting the feature points using the HALF-SIFT [24] method instead of the SIFT. The localization error occurred in SIFT due to the feature detector. The HALF-SIFT method extracts more accurate feature points and rectifies the localized error. The localization error is assumed to arise either from pixel intensity noise or the feature extraction method itself, which has zero mean and non-uniform distribution. This type of error is called an anisotropic and non-uniformly distributed localization error [25,26,27,28]. It is also called a random localization error [20]. This random localization error does not satisfy the assumptions of the RANSAC estimate. To estimate this type of error, the covariance matrices are used in the proposed video inpainting method.

Next, the selection of inliers and estimation of the homography matrix is done using covariance weighted maximum likelihood sample consensus (CW MLESAC). Finally, the homography matrix is refined with the covariance weighted Levenberg–Marquardt algorithm (CW L-M). In this video inpainting method, the last step is the target region filling. The target region filling is completed by minimizing the robust energy cost function globally using the expansion move algorithm [29,30,31]. To observe the effectiveness of the proposed method, the experiments are conducted on the densely annotated video segmentation (DAVIS) dataset [32].

The paper is organized as follows: The proposed video inpainting approaches are described in Sect. 2. Section 3 consists of experimental results from the proposed method and a comparison of metrics with an existing method. The concluding remarks are presented in Sect. 4.

2 Proposed Video Inpainting Method

The proposed method is implemented in two main steps; the first step is image registration, and the second step is target region filling. The target frame with a hole is inpainted using aligning the neighboring frames to the target frame. The hole is filled with the diffusion of pixels from aligned frames to the target region. The image registration is achieved with an advanced homography-based image registration method and the hole in the target frame called the target region is filled by globally minimizing the robust energy function. The flow of the proposed method is shown in Fig. 1.

Fig. 1
figure 1

Flow of Proposed Video Inpainting Method

2.1 Advanced Homography-based Registration Method

In the proposed image registration method, the feature points are extracted using the HALF-SIFT [24] method instead of the SIFT method. The HALF-SIFT method extracts more accurate feature points and rectifies the localized error obtained due to the SIFT feature detector. Later estimation of localization error was obtained due to pixel intensity noise with the covariance matrix. The feature matching between the frames is obtained with minimum Euclidean distance as a parameter. Next, the selection of inliers and estimation of the homography matrix with covariance weighted maximum likelihood sample consensus (CW MLESAC) is done, and finally, the homography matrix is refined with the covariance weighted Levenberg–Marquardt algorithm (CW L-M).

2.1.1 Localized Feature Points Extraction Using HALF-SIFT

The localization error of feature points in the SIFT method is due to the use of the parabolic interpolation method to estimate the coordinates of feature points. The interpolation accuracy is improved by using the HALF-SIFT method, which produces accurate localized feature points.

In order to extract the accurate localized feature point, HALF-SIFT used regression analysis for minimizing the distance between the sampling points near the feature points in the difference between the Gaussians pyramid and response model function. The output obtained from the difference of the Gaussian (DoG) filter by applying the Gaussian function as input is called the response model function. This response model function is described by the parameter vector \(v\). The parameter vector \(v = \left( {m_{0} ,n_{0} , a, b, c, r} \right)\) determines the response model of the difference of Gaussians filter. Here, \(M_{0} = \left( {m_{0} ,n_{0} } \right)\) is the accurate position of the feature point. This parameter vector \(\left( v \right)\) is optimized with the Levenberg–Marquardt algorithm. The optimized objective function [24] is taken as,

$$ \hat{v} = arg\mathop {\min }\limits_{v} \left\{ {\mathop \sum \limits_{i,j,p = - 1,0,1} \parallel X\left( {m_{c} \pm i,n_{c} \pm j,\left( {2^{1/3} } \right)^{p} \alpha } \right) - X_{v} \left( {m_{c} \pm i,n_{c} \pm j,\left( {2^{1/3} } \right)^{p} \alpha } \right)\parallel_{2}^{2} } \right\} $$
(1)

Difference of Gaussians (DoG) is a feature enhancement algorithm that involves the subtraction of one Gaussian blurred version of an original image from another, less blurred version of the original. The result is a set of images of a variety of sizes, each being a "difference of Gaussians." It is called as difference of the Gaussian pyramid, represented as \(X\left( \cdot \right)\).; \(\left( {m_{c} ,n_{c} } \right)\). is the extracted local maximum int with SIFT, and \(X_{v} \left( \cdot \right)\). is the response model function described by the parameter vector \(v\) and \(\alpha\)., where \(i\)., \(j\). and \(p\) are the amplitudes of scale in SIFT.

2.2 Estimation of Localization Error Using the Covariance Matrix

The localization error of the feature points depends on the distribution of the pixel intensity values near the feature points. Due to the inaccurate distribution of pixel intensity values, anisotropic and non-identical localization error occurs. The covariance matrix is used to indicate the anisotropic and non-identical error [25], which is represented as

$$ {\Lambda } = \sigma .G_{r} .\left[ {\begin{array}{*{20}c} \mu & 0 \\ 0 & {1 - \mu } \\ \end{array} } \right].G_{r}^{T} $$
(2)

where \(G_{r} = \left[ {\begin{array}{*{20}c} {\cos \theta } & { - \sin \theta } \\ {\sin \theta } & { \cos \theta } \\ \end{array} } \right]\) is the rotation matrix that represents the anti-clockwise rotation by an angle \( \theta\). \(\sigma \in \left[ {0,\infty } \right)\) is the scale, \(\mu \in \left( {0,1} \right)\) is the eccentricity, and \(\gamma \in \left[ {0,\pi } \right)\) is the rotation angle of the covariance matrix \({\Lambda }\).

In this work, the random localization error of feature points is assumed as a bilateral Gaussian model and expressed in the covariance matrix, which is estimated as

$$ {\Lambda }^{{\left( {\alpha_{v} } \right)}} = \left( {\mathop \sum \limits_{{\left( {k,l} \right) \in N_{v} }} u\left( {k,l} \right).\left[ {\begin{array}{*{20}c} {X_{mm} \left( {k,l,\alpha_{v} } \right)} & {X_{mn} \left( {k,l,\alpha_{v} } \right)} \\ {X_{mn} \left( {k,l,\alpha_{v} } \right)} & {X_{nn} \left( {k,l,\alpha_{v} } \right)} \\ \end{array} } \right]} \right)^{ - 1} $$
(3)
$$ {\Lambda } = {\Lambda }^{{\left( {\alpha_{v} } \right)}} .\left( {\frac{{{\text{res}}\;\left( {\alpha_{0} } \right)}}{{{\text{res}}\,\left( {\alpha_{v} } \right)}}} \right)^{2} $$
(4)

where \( u\left( {k,l} \right)\) is the weighted coefficient of Gaussian distribution, \(X\left( {.,\alpha_{v} } \right)\) is the particular layer of the scale-space pyramid, \(N_{v}\) is a small neighborhood near the feature point \( \left( {k,l} \right)\),\( \alpha_{v}\) is the scale parameter of the present layer,\( res\left( {\alpha_{0} } \right)\) and \(res\left( {\alpha_{v} } \right)\) is the image resolution of the layer whose scale parameter is.

\(\alpha_{0}\) and \(\alpha_{v}\) in a scale-space pyramid. The sub-indices m and n in Eq. (3) represent the local maxima point positions obtained from the SIFT method.

2.2.1 Selection of Inliers with CW MLESAC

The residuals are calculated in traditional feature point matching methods like LMedS and RANSAC using Eq. 5. For this calculation, the inliers correspondences vector is considered as \( C = \left\{ {\left. {\left( {c_{1i} ,c_{2i} } \right)} \right| i = 1,2,3,.., n} \right\}\), where \(c_{1i} , c_{2i}\) are pixel positions of feature points of two images.

$$ r_{i} = \sqrt {\parallel e\parallel_{2}^{2} = \sqrt {\parallel c_{2i} - \hat{c}_{2i} \parallel_{2}^{2} } } $$
(5)

where e is the re-projected error vector and \(\hat{c}_{2i} \) is the re-projected pixel position of \(c_{2i}\) with homography transform H. The residuals calculated using the above formula do not include the anisotropic and non-identical properties of localization error, so this residual calculation leads to inaccurate homography estimation. This is rectified by selecting the inliers with the CW MLESAC method. This method is implemented by using normalized covariance weighted residuals (NCWR); this includes the properties of localization error of feature points.

The covariance matrix is decomposed as

$$ {\Lambda } = U.\left[ {\begin{array}{*{20}c} {\varepsilon_{1} } & 0 \\ 0 & {\varepsilon_{2} } \\ \end{array} } \right]. U^{T} $$
(6)

Here, \(\varepsilon_{1} , \varepsilon_{2}\) are eigenvalues of the covariance matrix, and \(U = \left[ {u_{1} ,u_{2} } \right]\) are eigenvectors.

In the proposed CW-MLESAC method the residual.

\(r_{i}\) is replaced with the NCWR formula \(\overline{r}_{i}\) as,

$$ \overline{r}_{i} = \sqrt {\left( {\overline{e}_{{\varepsilon_{1} }} } \right)^{2} + \left( {\overline{e}_{{\varepsilon_{2} }} } \right)^{2} } = \sqrt {\frac{{2\varepsilon_{1} }}{{\varepsilon_{1} + \varepsilon_{2} }}(e_{{u_{1} }} )^{2} + \frac{{2\varepsilon_{2} }}{{\varepsilon_{1} + \varepsilon_{2} }}(e_{{u_{2} }} )^{2} } = \sqrt {\frac{{2\varepsilon_{1} \left( {u_{1}^{T} .\left( {c_{2i} - \hat{c}_{2i} } \right)} \right)^{2} }}{{\varepsilon_{1} + \varepsilon_{2} }} + \frac{{2\varepsilon_{2} \left( {u_{2}^{T} .\left( {c_{2i} - \hat{c}_{2i} } \right)} \right)^{2} }}{{\varepsilon_{1} + \varepsilon_{2} }}} $$
(7)

Finally, the best inliers correspondences \(C_{{{\text{inlier}}}}\) and homography matrix \(H_{{{\text{inlier}}}}\) are selected using CW MLESAC.

2.2.2 Refining of Homography Matrix with CW L-M

Considering.

\(C_{inlier} = \left\{ {\left. {\left( {c_{1i} ,c_{2i} } \right)} \right| i = 1,2,3,.., n} \right\}\) as the inlier correspondences and pixel positions of images as.

\(c_{1i} , c_{2i}\). Then, the objective function (

\(F\)) taken in the existing L–M method is

$$ F = \arg \;\mathop {\min }\limits_{H} \mathop \sum \limits_{k = 1}^{n} \parallel e\parallel_{2}^{2} = \arg \, \mathop {\min }\limits_{H} \mathop \sum \limits_{k = 1}^{n} \parallel c_{2i} - H(c_{1i} )\parallel_{2}^{2} $$
(8)

here, \(e\) is the re-projected error vector, \(H\left( \cdot \right)\) is the homography transform.

This objective function did not include the anisotropic and non-identical properties of localization error hence the homography estimation was not accurate from this objective function. This problem was solved by using a new homography matrix refinement method called the CW L-M algorithm.

This uses the covariance weighted objective function \( \left( {\tilde{F}} \right)\), which is taken as

$$ \tilde{F} = \arg \mathop {\min }\limits_{H} \mathop \sum \limits_{k = 1}^{n} \left[ {\frac{{\left( {u_{1i }^{T} .\left( {c_{2i} - H\left( {c_{1i} } \right)} \right)} \right)^{2} }}{{\varepsilon_{1i} }} + \frac{{\left( {u_{2i }^{T} .\left( {c_{2i} - H\left( {c_{1i} } \right)} \right)} \right)^{2} }}{{\varepsilon_{2i} }}} \right] $$
(9)

here, \(\varepsilon_{1i} , \varepsilon_{2i} \) are eigenvalues of the covariance matrix of the \(i_{th }\) feature point correspondence in inliers with \( C_{{{\text{inlier}}}}\). \(U = \left[ {u_{1i } u_{2i} } \right], \) eigenvectors concerning eigenvalues.

From Eq. (9), the new objective function includes the anisotropic properties of the localization error by re-projected error vector axes are rotated toward the covariance matrix–vector direction. Also divided with the eigenvalues of the covariance matrix, it indicates different contributions of the different feature points. From all these, the localization error of the feature points becomes isotropic and identically distributed. Then, L–M method produces the optimal solution.

2.3 Target Region Filling

The hole \(\left( {{\Phi }_{t} } \right)\) in the target frame \(\left( {I_{t} } \right)\) is called the target region is filled with similar pixels from the group of frames aligned to the target frame by maintaining spatiotemporal coherence. The missing regions in the target frame are inpainted by minimizing an energy function globally using the expansion move algorithm [29,30,31]. If we consider the M number of past and M number of future neighbor frames, then every pixel in the hole is inpainted with best-suited pixels from 2 M (past M + future M) number of registered frames \(\tilde{I}_{i} , i = 1 \ldots 2M\). Let \(S_{v}\) and \(S_{w}\) denote the category of matching pixel values near pixel \( v\) and \(w\), respectively. \({S}^{*}\) value is calculated by minimizing the energy cost function for all the pixel values of the hole in the target frame.

$$ S^{*} = \arg \mathop {\text{ min }}\limits_{S} \xi \left( S \right) $$
(10)

with

$$ \xi \left( S \right) = \mathop \sum \limits_{{v \in \Phi_{t} }} E_{d} \left( {S\left( v \right)} \right) + \gamma \mathop \sum \limits_{{\left( {v,w} \right) \in N\left( {\Phi_{t} } \right)}} E_{sm} \left( {S\left( v \right), S\left( w \right)} \right) $$
(11)

Where, \(E_{d} \left( {S\left( v \right)} \right) \) is called the data term, which indicates the stationary background surrounding the pixel \(v\). The data term is divided into three terms

$$ E_{d} \left( {S\left( v \right)} \right) = E_{0} \left( {\hat{I}_{S\left( v \right)} } \right) + E_{1} \left( {\hat{I}_{S\left( v \right)} } \right) + E_{2} \left( {\hat{I}_{S\left( v \right)} } \right) $$
(12)

where first-term \(E_{0} \left( {\hat{I}_{S\left( v \right)} } \right) \) is the sum of the squared difference between the current target frame \(\left( { I_{t} } \right)\) and the frames registered to the target frame \(\left( {\hat{I}_{S\left( v \right)} } \right)\) taken as, \(E_{0} \left( {\hat{I}_{S\left( v \right)} } \right) = log\left( {1 + SSD\left( {I_{t} , \hat{I}_{S\left( v \right)} } \right)} \right)\). This term is used to find the pixels in the best-aligned frames, which gives less alignment error. The logarithm is used to limit the dynamic range of this term.

Second term \(E_{1} \left( {\hat{I}_{S\left( v \right)} } \right)\) is a constant term calculated as \(E_{1} \left( {\hat{I}_{S\left( v \right)} } \right) = \frac{1}{{2M\left| {{\Omega }_{S\left( v \right)} } \right|}}\mathop \sum \limits_{k = - M}^{M} {\Omega }_{S\left( v \right)} - {\Omega }_{k}^{2}\).where \({\Omega }_{S\left( v \right)} , \) the patch is centered with the pixel \(v \) in the registered frame \( \hat{I}_{S\left( v \right) }\) and \(\left| {{\Omega }_{S\left( v \right)} } \right|\) is the number of pixels in the patch and.

\({\Omega }_{k}\) represents the patch in the registered frames of \(\hat{I}_{{S^{*} \left( v \right)}}\). This term is responsible to maintain temporal consistency in the inpainting results.

In the third term, \(E_{2} \left( {\hat{I}_{S\left( v \right)} } \right) \) represents the similarity of the pixels in the target frame to the pixels in the source frame centered at \( v\), which is calculated with \( E_{2} \left( {\hat{I}_{S\left( v \right)} } \right) = I_{s} \left( v \right) - \hat{I}_{S\left( v \right)} \left( v \right)^{2}\), the value of \(I_{s} \left( v \right) \) is calculated using spatial inpainting. This is the term responsible for spatiotemporal coherence in the video inpainting results.

The second term \(\left( {Esm \left( {S\left( v \right), S\left( w \right)} \right)} \right)\) in the Eq. (11) is called the smoothness term calculated between each pair of adjacent pixels in the hole of the target frame, which is taken as

$$ E_{sm} \left( {S\left( v \right), S\left( w \right)} \right) = \hat{I}_{S\left( v \right)} \left( v \right) - \hat{I}_{S\left( w \right)} \left( v \right)^{2} + \hat{I}_{S\left( v \right)} \left( w \right) - \hat{I}_{S\left( w \right)} \left( w \right)^{2} $$
(13)

The smoothness term is used to maintain the spatial consistency by inpainting the pixels in the target region with similar pixel values in the neighboring registered frames. \(N\left( {{\Phi }_{t} } \right)\), is the 4-neighbors of pixel \( v\). The value of \(\gamma\) in Eq. (11) is chosen as 10 to maintain the balance between the data term and the smoothness term. Equation (10) is minimized by using expansion-move algorithm [29,30,31].

3 Experimental Results and Discussion

The proposed video inpainting algorithm is implemented in MATLAB. The process of inpainting for a video with around 100 frames takes at most 3 h with an Intel Core i3 processor of 8 GB RAM. The proposed algorithm tested with multiple complex video sets includes motion blur, camera movement, dynamic background and complex hole shapes. The results are compared with the existing video completion methods. The results obtained from the proposed method are visually feasible in comparison with existing methods of video inpainting.

In the proposed advanced homography-based video inpainting method, the experiments are carried out for the quantitative analysis in terms of PSNR and SSIM.

The PSNR and SSIM values are determined as follows:

  • A separate data set of composite videos is created for 11 videos. These 11 videos are treated as ground truth videos.

  • The composite video is created by adding one mask to the ground truth video, the mask is taken from the DAVIS data set.

  • This created composite video is treated as an input video for inpainting.

  • The proposed inpainting method is applied to the composite video to remove the added mask object.

  • The resulted video is compared with the corresponding ground truth video to calculate the PSNR and SSIM values.

  • Similarly, 11 composite videos inpainting is done and metrics are determined.

  • In the same way, the existing video inpainting techniques are applied to composite videos to determine the PSNR and SSIM values.

The video inpainting results of a few video sequences are shown in Fig. 2; odd rows represent the mask corresponding to an object to be removed in the video frames and even rows represent the resultant frames after removing the object using the proposed method.

Fig. 2
figure 2

Object Removal; Odd row: Input Video frames (Videos 1, 2, 3), Even rows: Frames using the proposed inpainted method

3.1 Comparison with Existing Methods

The video inpainting approach by Granados et al. [10] is quite similar to the proposed method in this paper. The homography-based image registration is used in [10] to align the input video frames to the target frame. The missing pixels in the video frames are filled with the information taken from registered frames. The cost function minimization is used to find the best pixel values to fill in the target region. This entire process takes more time even for low-resolution videos and long-duration sequences.

In this work, different advanced and robust homography estimation as compared to [10] is proposed to align the input video frames to the target frame. Further, the inpainting quality is improved by introducing data term in the energy function to fill the target region. The author has not published the code to compare with his work. It is difficult to reproduce this work without making any errors. Hence, this work is not compared with the proposed work in terms of performance metrics. The patch-based video inpainting algorithm [13] used spatiotemporal sampling to fill the target region. This work [13] is compared with the proposed work. The frames of the video sequence with an object to be removed and the inpainted result are shown in Figs. 3, 4 and 5. From these results, we can identify that some artifacts occurred due to the incorrect flow of pixels from the known region to the hole in the target frame and incorrect alignment of frames. The proposed method achieves more reliable results compared to [13] due to the proper alignment of source frames to the target frame.

Fig. 3
figure 3

Comparison of Results with [13] a Input Video Frames with an object to be removed (Video 4) b Results from [13] c Results from the proposed method

Fig. 4
figure 4

Comparison of results with [13]: a Input Video Frames with an object to be removed (Video 5); b Results from [13]; c Results from the proposed method

Fig. 5
figure 5

Comparison of results with [13]; a Input Video Frames with an object to be removed (Video 6); b Results from [13]; c Results from the proposed method

Compared with the method [16], it is computationally complex due to the forward and backward flow fields in the image sequence. As mentioned by the author in their paper, this method produces some noticeable artifacts in the videos contain a dynamic camera, foreground and background. By comparing, our method reconstructed the frames without any artifacts in all above cases due to the inclusion of high accurate feature points calculation to match the pixels in the frames. The comparison of inpainted results of the proposed method is shown in Figs. 6 and 7.

Fig. 6
figure 6

Comparison of Results with [16]; a Input Video Frames with an object to be removed (Video 7); b Results from [16]; c Results from the proposed method

Fig. 7
figure 7

Comparison of results with [16]; a Input Video Frames with an object to be removed (Video 8); b Results from [16]; c Results from the proposed method; d Mask of the object to be removed for complex frame and artifact in the inpainted frame

The results obtained from the proposed method are compared with [13] and [16] as shown in Figs. 8, 9 and 10.

Fig. 8
figure 8

Comparison of results with [13] and [16]; a Input video frames with object to be removed (video 9); b Results from [13]; c Results from [16]; d Results from proposed method

Fig. 9
figure 9

Comparison of results with [13] and [16]; a Input video frames with object to be removed (video 10); b Results from [13]; c Results from [16] d Results from proposed method

Fig. 10
figure 10

Comparison of results with [13] and [16]; a Input video frames with object to be removed (video 11); b Results from [13]; c Results from [16]; d Results from proposed method

The PSNR and SSIM values are computed for the existing methods in the literature by implementing the code taken from the author’s page. The proposed method is applied to the videos which are used in the existing methods. The experimental results such as PSNR and SSIM for different videos are given in Table 1. In this table, the state-of-art methods of inpainting proposed by Newton [13] and Huang [16] are compared with the proposed novel inpainting method with respect to PSNR and SSIM values for 11 input videos. The experiment is carried out on 11 standard input videos from the DAVIS dataset and the average of PSNR and SSIM is evaluated to compare the performance of the proposed method with existing methods. The average PSNR value for the proposed method is 21.075, whereas for Newton [13] is 19.921 and for Huang [16] is 20.556. From these results, we can say that the proposed algorithm has enhanced PSNR by 6% and 3% when compared with [13] and [16] works, respectively. The average SSIM of the proposed method for 11 videos is 0.932, which is an improved value when comparing the existing method values of 0.9 [13] and 0.907 [16].

Table 1 Comparison of PSNR and SSIM values of various videos with available methods in the literature [13, 16] and the proposed method

Figure 11 represents the variation of PSNR and SSIM values for different videos from the DAVIS dataset. Video 2 has the highest PSNR value among all the videos, whereas video 8 has the lowest value. Hence the average PSNR is evaluated and compared with the state of artworks. The PSNR values are distinguished between the proposed method and state-of-art methods shown in Fig. 12. It is observed that the PSNR of the proposed method is improved for all the standard input videos when compared with Newton [13] as well as Huang [16]. The variations of SSIM of the proposed method are clearly described in Fig. 13. It is noted that there is a significant improvement is achieved by the proposed method.

Fig. 11
figure 11

Comparison graph of PSNR and SSIM values

Fig. 12
figure 12

Comparison of PSNR values for standard input videos

Fig. 13
figure 13

Comparison of SSIM values

4 Conclusions

A novel video inpainting method using advanced homography-based image registration is proposed. In this homography-image registration, HALF-SIFT is used for proper feature point extraction, and a covariance matrix is used to estimate and remove the localization error. The proper selection of inliers to remove outliers is achieved using CW MLESAC. In order to get further refining the homography matrix, CW L-M was used. This entire process of image registration strengthens the spatiotemporal coherence in the video inpainting. Next, the inpainting of the hole in the target frame is accomplished by globally minimizing the energy cost function. The experimental video results are compared with two video inpainting methods available in the literature in the form of images. The comparison of inpainted videos or images shows that the proposed method produces high-quality results compared to the existing methods. The performance metrics, such as PSNR and SSIM values, are determined for the proposed method and compared with the results of existing methods. The average PSNR and SSIM values of the proposed method for 11 videos are evaluated and compared with existing state-of-art inpainting methods. The improvement of 6% and 3% in average PSNR and SSIM is achieved by the proposed inpainting method when compared with existing inpainting methods.