1 Introduction

3D video and free viewpoint video (FVV) have generated great interest in the past few years due to their realistic and interactive experience [38]. Compared with the traditional 2D video, the addition of depth perception helps viewers to better distinguish the occlusion relationship between objects in the scene. Due to the limitation of transmission bandwidth, it is not realistic to generate FVV through a large number of video capture devices [20, 23]. One practical approach is to generate a series of virtual views at the receiving end based on one or more key reference views. In this case, video-plus-depth or multiview video-plus-depth is the common format for transmitting 3D video, because a depth image requires less bandwidth than a color image of the same size [41]. In view synthesis techniques, depth-image-based rendering (DIBR) is widely used to produce content for 3D video and FVV, which only requires a single reference view and its associated depth image [10]. In the DIBR, all pixels in the reference image are projected to the world coordinate system based on their respective depth values. Then the virtual image is synthesized by projecting these points into the target image plane. This process is called 3D warping [25].

In fact, during the 3D warping process, a critical problem arises along with the movement of the view is that the background occluded by the foreground objects in the reference view may become visible in the synthesized view. As no pixels from the reference image are projected into these regions, holes are created, called disocclusions (shown in Fig. 1) [7]. Figure 1(a) and 1(b) are the synthesized results for small baseline and large baseline, respectively. It can be seen that the disocclusion becomes larger as the baseline increases (marked in white). In addition, since the depth image can reflect the distance from the object to the camera [14], the larger depth discontinuity between adjacent pixels at the foreground and background junctions, the greater their distance in the virtual image, resulting in larger disocclusion. Therefore, the reasonable filling of these disocclusions is critical to the quality of the virtual image. In addition to disocclusions, there are other types of artifacts in the virtual image, such as ghosts, cracks, and out-of-field regions (OFRs) [26], as shown in Fig. 1. Ghosts mean that the pixels on the foreground boundary are assigned depth values of adjacent backgrounds during depth acquisition, which are incorrectly projected into the background region and blended with the background texture. Cracks are usually 1–2 pixels wide, which are caused by the rounding of target pixel position. Moreover, In the case where the virtual view exceeds the imaging range of reference view, OFRs appear at the edges of the virtual image. Since these artifacts are difficult to avoid in the original 3D warping, especially for disocclusions, how to fill them in a visually plausible way is a challenging task.

Fig. 1
figure 1

Original 3D warping result. a Small baseline, and b large baseline

In this paper, to fill the disocclusions properly and improve the visual quality of virtual view, a disocclusion-type aware hole filling method is proposed. Our main contributions are as follows: 1) We divide the disocclusions into two types based on the depth information of their edge pixels: foreground-background (FG-BG) disocclusion and background-background (BG-BG) disocclusion. 2) For the former, an adaptive depth image preprocessing method is introduced to decompose the disocclusions so that they can be easily filled in the virtual view. For the latter, the corresponding foreground objects are removed, so that the disocclusion can be filled with the reliable background texture. 3) A modified depth-based inpainting method is proposed in the filling process to increase the credibility of the results. The rest of the paper is organized as follows. The related work is introduced in Section 2. The proposed disocclusion handling approach is described in Section 3. The experimental results are presented in Section 4. Finally, Section 5 discusses the limitations and concludes the paper.

2 Related work

Under the condition of single view rendering, in general, the disocclusion handling methods can be divided into two categories. One is to preprocess the depth image before 3D warping. Depth image can be obtained by depth camera or stereo matching algorithm. Accurate depth value is important for identifying foreground and background pixels [35, 42]. Since the disocclusion is caused by depth discontinuity, some methods apply low pass filter to smooth the depth image. Symmetric Gaussian filters have been proposed in [37] to remove isolated noise pixels in the depth image and reduce the area of disocclusion. This method usually leads to the rubber sheet effect, which means the geometric distortion of the object. On this basis, Zhang et al. [44] apply an asymmetric Gaussian filter to filter the depth discontinuity vertically more than horizontally. Based on the anisotropic nature, this filter prevents the disocclusion while avoiding the rubber sheet effect. However, the regions which do not cause disocclusions are also smoothed, resulting in degradation of the depth layers and visual quality. To overcome this problem, Chen et al. [2] propose an edge-dependent Gaussian filter, which only smooths the edges in the horizontal direction and prevents the geometric distortion of non-hole regions. Liu et al. [21] apply the structure-aided filter to process the depth image in the transformed domain. Adaptive smoothing helps prevent the generation of disocclusion. However, smoothing of the edges may result in blurring of the foreground boundaries, and the above methods are suitable for small baseline configurations or small depth discontinuity. Targeted to large baseline, Lei et al. [15] propose a divide-and-conquer depth image preprocessing method that decomposes disocclusion by reducing the depth discontinuity in the foreground edge. This method can achieve good results under the condition of simple background, but for complex background, especially when the vertical texture is rich, the change of depth value may lead to the distortion of structure.

The other type of method is to fill the disocclusions with surrounding textures in the spatial or temporal domain. In the spatial domain, due to the good performance in recovering the unknown region, the image inpainting method is introduced into the disocclusion filling. The exemplar-based inpainting method proposed by Criminisi et al. [5] is widely used for hole filling. In this approach, the confidence and texture of the hole boundary pixels are combined to determine the priority, and the hole is filled with the most similar patch in the source region. In [33], the graphcut algorithm and an example-based inpainting technique are combined to fill the holes. Shen et al. [34] proposed gradient-based image completion algorithm, which could reconstruct the image from the gradient maps by solving a Poisson equation. In [43], Deep learning scheme is used for image content prediction and aesthetic assessment. In fact, as the disocclusions are originated from background region, they should be filled with the background texture. However, this classic method treats the foreground and background boundaries equally, causing some foreground textures to be sampled into the disocclusion. To prevent this problem, some improved methods modify the inpainting method by introducing the depth information. In [6], additional depth term is used in the calculation of priority, and the patch with lower depth variance is assigned higher priority. Moreover, the calculation of the matching cost takes into account the similarity of color and depth information. This method fills the disocclusion under the condition that the depth image of the virtual view is known, which may not hold in practical applications. Ahn et al. [1] generate the depth image of the virtual view during the 3D warping. Disocclusions in the color image and the depth image are then simultaneously filled by the depth-based image inpainting method. Kao [13] optimizes the priority calculation in [6] and uses the depth-based gray-level distance to measure the similarity of patches. However, when the depth value of the foreground edge is inaccurate, ghosts appear at the edge of the disocclusion, causing the penetration of foreground pixels. In order to reduce the interference of foreground pixels on the hole filling, Luo et al. [22] remove the foreground object from the reference image based on the depth information, and then apply the improved Criminisi’s method to generate the background image, which is used to fill the disocclusion after 3D warping. Han et al. [12] propose a layered 3D warping method, in which foreground objects are segmented, and the 3D perception and visual comfort are balanced by disparity control. However, these methods are strongly dependent on the accuracy of foreground extraction. In the case where the depth image contains multiple layers, it is difficult to apply the edge detection or the threshold method such as the Otsu thresholding method [29] to segment the foreground object. It may even be necessary to manually specify the foreground object, which affects the robustness of the method.

In the temporal domain, as the foreground object moves, the background it occludes is also changing. Therefore, some methods try to fill the disocclusion by exploiting reliable background pixels from other frames. Sun et al. [36] use the switchable Gaussian model to establish an online background model to adapt to different scenes. Lie et al. [18] propose a background sprite model, which can combine time and spatial information to remove foreground. In [28], a background modeling method based on novelty-view domain is presented to fill the disocclusion. But this method needs to generate several novel viewpoints, which increases the computational complexity. In [8], the improved Gaussian mixture model (GMM) is used to establish an adaptive background model independently for each pixel in the reference image. However, these background models can achieve good results under the assumption of moving foreground objects. For still foreground objects or a single still image, the foreground pixels may not be completely removed. In this case, there is still a risk of foreground pixels penetration.

For view synthesis with two or multiple reference views, more reliable pixels can be used to fill the disocclusion due to the increase of views [32]. Zhu et al. [47] explore the mechanism of disocclusion generation in view interpolation, and filled the disocclusion based on visible and invisible background information. In [16], the experimental results show that multi-view rendering can effectively reduce the area of the disocclusion in view interpolation and extrapolation. Most of the disocclusions can be filled by view merging, which reduces the difficulty of the postprocessing. However, multi-view rendering requires more image acquisition equipment and higher transmission bandwidth [4]. In addition, there are still some holes in the merging result that need to be filled, especially for viewpoint extrapolation. For special application, such as in 2D-to-3D system, only single view can be used. Therefore, exploring the disocclusion filling method for single view synthesis is still necessary.

In order to reduce the penetration of the foreground texture and select high-reliability background pixels to fill the disocclusion, we apply the Laplacian of Gaussian (LOG) operator to generate the laplacian image of the reference depth image, which can be used to identify the pixels on the disocclusion boundary, thereby dividing the disocclusion into two types: FG-BG disocclusion and BG-BG disocclusion. For the former, an adaptive depth image preprocessing method is introduced to remove the ghosts and decompose the disocclusions based on the complexity of the surrounding background texture. For the latter, the valid pixels on the boundary are marked in the reference image. The corresponding foreground object that occludes the background is extracted under the guidance of the depth information, so that the disocclusion can be filled with the reliable background texture. In addition, a postprocessing approach is applied to deal with the remaining artifacts after merging.

3 Proposed disocclusion handling approach

The flowchart of the proposed method is shown in Fig. 2. Firstly, the reference Laplacian image is generated by applying the LOG operator to the depth image. Disocclusions in the original virtual image are marked and divided into two types. Then, the FG boundary leading to the FG-BG disocclusion and its surrounding BG texture are adaptively optimized in the depth image. The ghosts are also removed in this step. Thirdly, based on the valid pixels located on the BG-BG disocclusion boundaries, the associated foreground objects are detected and removed according to the depth features. The removed regions are filled by applying the depth-based inpainting method. The optimized reference image and its depth image are used as input to 3D warping, and the output is merged with the original virtual image. Finally, the remaining other artifacts are handled during the postprocessing. The pseudo code description of the proposed method is shown in Algorithm 1. In the following, the core steps of the proposed method will be described in detail.

Fig. 2
figure 2

Flowchart of the proposed method. a Disocclusion classification, and b view synthesis module with disocclusion mask for hole filling

Algorithm 1 The pseudo code description of the proposed method

figure a

3.1 Classification and marking of disocclusions

Figure 1 shows that the depth value distribution of the pixels on the disocclusion boundary is affected by the baseline setup and depth discontinuity. In the case of small baseline and small depth discontinuity, small disocclusion is generated by the movement of the viewpoint. When the FG object has a certain width, this occlusion hole will only appear between the FG and the BG, that is, its two sides belong to the FG and the BG respectively, as shown in Fig. 1(a). On the contrary, with the increase of baseline and depth discontinuity, the difference between FG and BG displacement in the virtual image becomes larger. Therefore, the area of the disocclusion gradually increases until the entire FG is projected into the new BG. Therefore, the whole occluded BG is exposed in the virtual view, as shown in Fig. 1(b). In this case, most of disocclusion edge pixels belong to the BG. In general, disocclusions occur on the right side of the foreground for the right synthesized view, and vice versa for the left synthesized view. Ghosts appear in the adjacent BG region of the disocclusion, appearing as a mixture of FG and BG textures. Accordingly, this paper takes the disocclusion handling for right virtual view as an example. The similar process can be applied on the rendering of other virtual views. Based on the depth value distribution of the boundary pixels, we classify the disocclusions and apply different handling strategies depending on the nature of them.

The study in [19] shows that the Laplacian is sensitive to the disocclusion boundary and has directional invariance. Therefore, the Laplacian operator is used to identify the valid pixel along the disocclusion boundary as FG boundary or BG boundary. In order to increase the robustness against noise, the LOG operator is introduced. In the case of single view rendering, based on the feature of the depth image, pixels along the disocclusion boundary are identified and labeled as follows:

$$ L\left(u,v\right)=\left\{\begin{array}{l}\delta {\Omega}_{FG},\mathrm{if}\ {\left(\varDelta d\right)}_w\left(u,v\right)<0\\ {}\delta {\Omega}_{BG},\mathrm{if}\ {\left(\varDelta d\right)}_w\left(u,v\right)>0\end{array}\right.,\mathrm{for}\ \left(u,v\right)\in \delta \Omega $$
(1)

where δΩFG and δΩBG donate the FG boundary and BG boundary respectively. dis the input depth image and Δd is the Laplacian of d. (Δd)w donates the warped Laplacian of the original depth image. δΩis the boundary of the disocclusion Ω. It should be noted that if the Laplacian value is equal to zero, it means that there is no depth discontinuity at this position and no disocclusion is generated. Therefore, the zero Laplacian value is note mentioned in Eq. (1). The classification result of the disocclusion boundary pixels is shown in Fig. 3, where the FG boundary is marked in red and the BG boundary is marked in green. In this case, disocclusions can be divided into FG-BG disocclusion (ΩFB) and BG-BG (ΩBB) disocclusion according to the proportion of foreground pixels on the boundary. Since the disocclusion is caused by depth discontinuity between the FG and the BG, it has at least one side background boundary. In this paper, the classification threshold is set to 35% for FG ratio. A disocclusion is classified as FG-BG disocclusion if at least 35% of the pixels around the boundary are FG. For other cases, it is classified as BG-BG disocclusion. After the classification, the boundary pixels of the two types of disocclusions are marked in the reference depth image as follows:

$$ {L}_{FB}\left(u,v\right)=\left\{\begin{array}{l}1,\mathrm{if}\ {\left(u,v\right)}_w\in \delta {\Omega}_{FB}\\ {}0,\mathrm{otherwise}\end{array}\right. $$
(2)
$$ {L}_{BB}\left(u,v\right)=\left\{\begin{array}{l}1,\mathrm{if}\ {\left(u,v\right)}_w\in \delta {\Omega}_{BB}\\ {}0,\mathrm{otherwise}\end{array}\right. $$
(3)

where LFB donates the mask of the FG-BG disocclusion boundary and LBB donates the mask of the BG-BG disocclusion boundary, as shown in Fig. 3 (c) and (d). (u, v)w donates the corresponding pixel of (u, v)in the virtual image. In the subsequent steps, only the marked pixels and their neighbors are processed, so that the computational complexity and the degradation of visual quality can be reduced.

Fig. 3
figure 3

Classification and marking of the disocclusion. a FG-BG disocclusion, b BG-BG disocclusion, c FG-BG disocclusion boundary in reference image, and d BG-BG disocclusion boundary in reference image

3.2 Adaptive preprocessing for FG-BG disocclusion

Figure 3(c) indicates that for the right virtual view, the FG-BG disocclusion boundary pixels are mainly distributed to the right side of the FG object in the reference image. Due to the depth discontinuity caused by occlusion, adjacent pixels in the reference image are separated in the virtual image, resulting in disocclusions. The accuracy of depth image is very important for view synthesis. Meanwhile, the boundaries of objects in depth image may be mismatched with that of the color image because of the limitations of stereo matching algorithm and hardware equipment. Some FG pixels are assigned BG depth values, and separated from the FG object. They are projected into the BG region and blended with BG texture, called ghosts. If not processed, the subsequent hole filling algorithm will spread these FG textures, resulting in incorrect results. Therefore, an adaptive depth image preprocessing method is presented in this section to remove ghosts and decompose the FG-BG disocclusion.

For the right virtual view, the FG-BG disocclusion appear in the depth image where the depth value is changed from high to low. It should be noted that in this paper, the depth value is uniformly described as the pixel value in the depth image. The larger depths are represented by smaller values in the depth image, so that the depth value of the FG is higher than the BG, as shown in Fig. 4. In this case, the FG boundary that may create FG-BG disocclusion can be detected according to depth discontinuity and marked as:

$$ {E}_{FG}\left(u,v\right)=\left\{\begin{array}{l}1,\mathrm{if}\ d\left(u,v\right)-d\left(u+1,v\right)>{T}_1\\ {}0,\mathrm{otherwise}\end{array}\right. $$
(4)

where T1 is the segmentation threshold between the FG and BG. Its value is set based on the depth distribution of the scene in order to extract a more complete FG boundary. Since the FG-BG disocclusion boundary pixels is marked in the reference image and its depth image, the FG boundary pixels extracted by Eq. (4) are further filtered. The FG boundary for preprocessing is limited to pixels adjacent to the FG-BG disocclusion boundary, and is marked as:

$$ Mask\left(u,v\right)=\left\{\begin{array}{l}1,\mathrm{if}\ {E}_{FG}\left(u,v\right)=1\&\&N\left(u,v\right)\cap {L}_{FB}\ne \varnothing \\ {}0,\mathrm{otherwise}\end{array}\right. $$
(5)

where N(u, v) donates the set of (u, v) and its immediate neighbors. Figure 4(a) shows the filtered FG boundary mask. Based on the marked pixels, the depth values of their neighbor BG are optimized. Since the texture of image is not immutable, texture analysis is performed in these BG regions. For the identified BG region with flat texture, the distinct depth discontinuity is replaced by a gradually decreasing depth change, so that the FG-BG disocclusion can be decomposed into multiple small holes through the preprocessing of the depth image. After decomposition, the hole becomes smaller and its surrounding pixels belong to the BG texture, so it is easy to be filled. In order not to cause geometric distortions, we only decompose disocclusions with flat BG textures. For the remaining region with complex texture, its depth value is preserved to avoid geometric distortion in the virtual image. Moreover, in order to remove ghosts, all of the marked FG boundaries are extended to the BG region in the depth image, so as to correct the error in depth acquisition and prevent the penetration of foreground pixels in subsequent inpainting process.

Fig. 4
figure 4

Marking and classification of FG boundary. a Marking result, and b classification result based on texture complexity (R1 marked in green, and R2marked in yellow)

As gradient information can reflect the complexity of image texture, we apply gradient operator to achieve texture analysis. The marked FG boundary is classified into two regions as:

$$ \left\{\begin{array}{l}{R}_1=\left\{\left.\left(u,v\right)\right| Mask\left(u,v\right)=1\&\&\lambda \cdot \left|{G}_u\left(u+k,v\right)\right|+\left(1-\lambda \right)\cdot \left|{G}_v\left(u+k,v\right)\right|<{T}_2,k=\left\{2,3,\cdots {W}_i\right\}\right\}\\ {}{R}_2=\left\{\left.\left(u,v\right)\right| Mask\left(u,v\right)=1\&\&\left(u,v\right)\notin {R}_1\right\}\end{array}\right. $$
(6)

Where Gu(u, v) and Gv(u, v) represent the gradient values in the horizontal and vertical directions at (u, v), respectively. The human visual system obtains depth cues from disparity mainly from horizontal differences rather than vertical differences [44]. Compared with the vertical direction, the geometric distortion in the horizontal direction is more intolerable, and the visual quality of the virtual image is significantly reduced. Therefore, the horizontal direction should be given higher weight. In addition, the preprocessing is performed in the horizontal direction, so the weight coefficient λ is set to 0.7 in the experiment. Wirepresents the average width of the FG-BG disocclusion i. T2 represents the gradient threshold. For depth image, the gradient difference means the change in pixel value in an 8-bit single channel image where the value represents the disparity in pixels between the outermost views. Large gradient differences reflect large texture changes. In order to reduce geometric distortion, the threshold T2 should be set to a small value. Considering that the color image has three channels, so T2 is set to 30 in the experiment. The classification result of R1 and R2 is shown in Fig. 4(b). Therefore, the distortion can be reduced in the case of disocclusion decomposition, especially for linear structure, and the depth value of the adjacent FG can also be preserved.

For the region R1 with flat texture, as the ghosts are usually 1–2 pixels wide, the marked FG boundary is firstly extended in the depth image to remove the ghosts as:

$$ d\left(u+1,v\right)=d\left(u+2,v\right)=d\left(u,v\right),\mathrm{if}\ \left(u,v\right)\in {R}_1 $$
(7)

In the case of the fixed baseline and scene, the area of FG-BG disocclusion is proportional to the depth discontinuity between adjacent pixels in the reference depth image, meaning that the larger depth discontinuity results in a larger disocclusion. Due to the additive nature of the error, the reliability of the inpainting method becomes lower at the center of the disocclusion. Therefore, the core idea of disocclusion decomposition is to reduce the depth discontinuity between adjacent pixels. For the BG region adjacent to the marked FG boundary, except for the extended region whose depth value is replaced with the FG depth value, a linear descent process is used in the horizontal direction to optimize the depth value. Large depth discontinuity between FG and BG is modified to a slow depth drop among several pixels. In order to make an even width of each hole after decomposition, a fixed depth drop parameter is set based on the depth distribution of scene. For the FG boundary belonging to R1, the linear drop process for each row can be expressed as:

$$ d\left(u+k+1,v\right)=d\left(u+k,v\right)-m,\mathrm{if}\ \left(u,v\right)\in {R}_1\&\&d\left(u+k,v\right)-d\left(u+k+1,v\right)>m,k=\left\{2,3,\cdots {W}_i\right\} $$
(8)

In the experiment, it is necessary to ensure that the value of m ⋅ (Wi − 1) is greater than the maximum depth discontinuity of the disocclusion i. Through decomposition process, the flat background texture can be evenly distributed around each hole, thus preventing the FG texture penetration. In addition, the reduction of the hole area helps to improve the accuracy of the inpainting result.

For the region R2 with rich texture, as the contour of FG is irregular, the decomposition process may result in different depth values in the same linear texture, evolving into the stretching and distortion of the texture in the virtual image. Thus, only the ghost removal process is performed on this region, as follows:

$$ d\left(u+1,v\right)=d\left(u+2,v\right)=d\left(u,v\right),\mathrm{if}\ \left(u,v\right)\in {R}_2 $$
(9)

The depth image preprocessing result for the FG-BG disocclusion is shown in Fig. 5(a). Note that the boundary of the FG is extended to the BG region. The result in Fig. 5(b) shows that the proposed method can adaptively implement the segmentation transition from FG to BG, thereby effectively decomposing the disocclusion and improving the filling accuracy.

Fig. 5
figure 5

Processing result of the proposed method. a Depth image preprocessing result (original FG boundaries marked in red), and b 3D warping result with adaptive preprocessing

3.3 FG removal for BG-BG disocclusion

In general, disocclusions belong to the BG region that is occluded by the FG object in the reference view. These regions become visible in the virtual view, but the associated pixels cannot be retrieved from the reference image, thus causing holes. For view synthesis, as the viewpoint moves, different depth values mean different displacements of the pixels. Compared with the BG object, the parallax of the FG object is larger under the same baseline. In this case, when the FG object occludes a new BG region, that is, the FG and BG visible in the reference view overlap in the virtual image, the occluded BG in the reference view is completely exposed, forming the BG-BG disocclusion that both sides belong to the BG region. GMM-based methods can realize the detection of FG objects and establish the BG model by BG subtraction. However, such methods are only applicable to the case of moving FG objects, and still FG objects are usually considered part of the BG and remain in the BG model. For the FG extracting methods based on the edge detection, threshold selection is a difficult task.

In this paper, features of the depth image are introduced to extract BG objects that occlude BG-BG disocclusion region in the reference view. As the BG-BG disocclusion boundary pixels are located on the boundary of FG object in reference image, the edge of FG object can be extracted by detecting these pixels. Since adjacent pixels in the same FG object have similar depth values, extraction and removal of the specific FG objects can be achieved by an iterative process. Furthermore, in order to remove artifacts and completely extract FG objects, the depth values of the disocclusion boundary pixels are corrected before the iterative process.

After the valid pixels on the BG-BG disocclusion boundary are marked, these pixels are re-projected into the reference image, as shown in Fig. 6(a). A threshold is assigned to each disocclusion boundary based on the depth information to assist in the extraction of the FG object. In order to remove outliers, for the BG-BG disocclusion i, the FG-BG segmentation threshold Ti is set as the average depth value of the pixels along its boundary after removing the 10% highest and lowest values. For each disocclusion, a minimum bounding rectangle (MBR) is established in the reference depth image based on the distribution of its boundary pixels. The iterative process takes place within the rectangle. Due to the existence of ghosts, although most of the marked boundary pixels have BG depth values, they belong to the foreground object in the color image. Thus, the depth values of these pixels are firstly corrected. According to the condition under which disocclusion occurs, there must be pixels with FG depth value in the neighborhood of the boundary pixels. Therefore, the depth values of these marked pixels are replaced with the highest depth value within their neighborhood, that is, they are re-marked as FG pixels. Based on the marked FG boundary, the complete FG object is gradually extracted by judging whether its immediate neighbors belong to the FG. For the pixels x on the boundary Fi of FG i, y is an unmarked neighborhood pixel of x. According to the characteristics of depth image, if y belongs to the FG, it should have the similar depth value with x, and its depth value is higher than the BG it occluded. Therefore, the condition that y is appended to FG i can be expressed as follows:

$$ y\in {F}_i\left|\left|d(x)-d(y)\right|<\alpha \&\&d(y)>{T}_i\right.,y\in N(x)\cap {\mathrm{MBR}}_i $$
(10)

where α is a small value and is set to 3 in our experiment. When all the pixels of Fi have been traversed, the new state of the FG object is used as the input to the next iteration. The iterative process continues until all the pixels in the MBR are traversed, and the extraction of the FG object is achieved. In addition, a morphological expansion operation is performed on the extracted FG region to extend the boundary outward so that all ghost pixels are included. The extracted foreground mask is shown in the Fig. 6(d), and then the corresponding FG objects are removed from the reference image and its depth image, as shown in Fig. 6(e) and (f). Note that some parts of FG object are still not removed. This is because their depth values are very close to the neighborhood BG pixels, so that no disocclusion is generated in the virtual, and it will not affect the disocclusion filling.

Fig. 6
figure 6

FG removal for BG-BG disocclusion. a Disocclusion edge mask in reference image, b FG extraction result of the 10th iteration, c FG extraction result of the 30th iteration, d FG extraction result, e FG removal result in reference image, and f FG removal result in depth image

3.4 Removed region filling

As mentioned in Section 1, Criminisi’s algorithm is an effective image inpainting method. This method takes into account the texture and structure information of the image. It determines the fill order based on the priority term, and search the most similar patch from the source region for hole filling. Compared with pixel-based methods, it does not introduce blurred artifacts. In this paper, we extend the Criminisi’s algorithm to fill the removed regions in the reference image and its depth image.

Image inpainting starts from the boundary of removed regions. In the classic Criminisi’s algorithm, for a input image I, Ω is the region to be filled and Φ represents the remaining source region (Φ = I − Ω). For pixel p on the boundary of Ω, assume that the square patch Ψp centered at p is to be filled, and its priority is defined as:

$$ P(p)=C(p)\cdot D(p) $$
(11)

where C(p) and D(p) are the confidence term and data term, respectively. C(p) donates the percentage of valid pixels in Ψp. D(p) represents strength of isophotes on the boundary, encouraging the linear structure to be inpainted first. Once all priorities on the boundary are computed, the pixel \( \hat{p} \) with highest priority and its corresponding patch \( {\Psi}_{\hat{p}} \) is found. Then, the most similar patch \( {\Psi}_{\hat{q}} \) in source region is searched to fill the hole in \( {\Psi}_{\hat{p}} \). The similarity between the two patches is computed by using the Sum of Squared Difference (SSD) of the valid pixels, so \( {\Psi}_{\hat{q}} \) is determined as follows:

$$ {\Psi}_{\hat{q}}=\arg \underset{\Psi_q\in \Phi}{\min}\mathrm{SSD}\left({\Psi}_{\hat{p}},{\Psi}_q\right) $$
(12)

As the hole filling is an iterative process, the confidence term is updated before the next iteration as follows:

$$ C(p)=C\left(\hat{p}\right),\forall p\in {\Psi}_{\hat{p}}\cap \Omega $$
(13)

It is noted that for removed region filling, as the FG objects and ghosts are removed, the pixels around the hole belong to the BG region. Compared with the disocclusion filling in virtual image, the texture in the reference image is taken from the real scene. Therefore, the filling process performed in the reference image can prevent the spread of artifacts which are introduced in 3D warping. Although the hole filling no longer starts from the FG boundary, the guidance of depth information is still of great significance for the image with multiple depth layers. In the proposed method, the modified priority term can be expressed as:

$$ P(p)=C(p)\cdot D(p)\cdot Z(p) $$
(14)

where Z(p) is the introduced depth term, indicating the average depth value of the valid pixels in Ψp, given by:

$$ Z(p)=\frac{d_{\mathrm{max}}-\frac{\sum_{q\in {\Psi}_p\cap {\Phi}^{\prime }}d(q)}{\left|{\Psi}_p\cap {\Phi}^{\prime}\right|}}{d_{\mathrm{max}}-{d}_{\mathrm{min}}} $$
(15)

where dmax and dmin are the highest and lowest nonzero depth values in the depth image, respectively. Φ is the newly defined source region with size N × N centered at \( \hat{p} \), where Nis set to twice the length of the long side of MBRi to ensure the exploration of the suitable source patch and texture spatial locality. The depth term gives higher priority to the patch with lower depth value, thereby ensuring that the filling starts from a local BG region in the case of multiple depth layers. Once all the priorities are computed, \( {\Psi}_{\hat{q}} \) is searched in the Φ. Since adjacent patches in the BG region usually have similar depth values, depth information is introduced in the similarity computation, and \( {\Psi}_{\hat{q}} \) is determined as:

$$ {\Psi}_{\hat{q}}=\arg \underset{\Psi_q\in {\Phi}^{\prime }}{\min}\left[{\mathrm{SSD}}_{\mathrm{color}}\left({\Psi}_{\hat{p}},{\Psi}_q\right)+{\mathrm{SSD}}_{\mathrm{depth}}\left({\Psi}_{\hat{p}},{\Psi}_q\right)\right] $$
(16)

Moreover, in order to preserve the update of depth term, the removed region filling is simultaneously performed in the reference image and its depth image. The filling result is shown in Fig. 7, which can be used to fill BG-BG disocclusions in the virtual image.

Fig. 7
figure 7

Filling result of the removed region. a Filling result in color image, and b filling result in depth image

3.5 Merging and postprocessing

After the depth image preprocessing and corresponding FG removal, the optimized reference image and its depth image are warped to the target view. As the corresponding FG objects are removed, BG-BG disocclusions will no longer appear in the rendered image, and the FG-BG disocclusion is adaptively divided into several small holes. Note that these holes are almost surrounded by the BG pixels, so the proposed inpainting method can effectively fill them with reliable BG textures. Since some FG objects may be removed from the optimized reference image, a subsequent merging process is necessary, which combines the original DIBR result with the optimized virtual image. In order to preserve the correct occlusion relationship between the FG and BG, Z-buffer algorithm [11] is applied to process the pixels at the same location.

However, there are still some other types of artifacts in the original DIBR result, such as cracks and OFRs. They are not caused by the occlusion between the FG and BG objects, therefore, our method mentioned above do not detect and handle them. After merging, an additional postprocessing is applied to deal with these remaining artifacts. As proposed in our previous work [3], cracks caused by the rounding of the coordinate values are filled by the optimized DIBR algorithm, and the reference view extension approach is used to fill OFRs. Moreover, for other small holes caused by depth errors, since they are usually inside the FG or BG objects, satisfactory result can be obtained by applying a simple inpainting method in the postprocessing.

4 Experimental results and discussion

In order to validate the proposed approach, two Multiview Video-plus-Depth (MVD) sequences “Ballet” and “Breakdancers” are used in our experiment [48]. Each sequence contains 8 viewpoints, which have a resolution of 1024 × 768 pixels and 100 frames long. The “Ballet” sequence has large depth discontinuity between the FG and BG objects, and the FG object is far from another one. The “Breakdancers” sequence contains a series of FG objects with similar depth layers and the overlap of FG objects occurs in some frames. Small depth discontinuities result in small disocclusions, even in the case of large baseline. In addition, five public image-plus-depth sequences from the Middlebury Stereo Data Sets [31] are used to evaluate the performance of our method. In these data sets, the ground truth of the depth image is obtained using high-precision structured light. Camera parameters are known, including internal and external parameters. The synthesized view is named after the video sequence and the projection configuration, i.e., BA54 donates the sequence warped from view 5 to view 4 of “Ballet”. To evaluate the performance of the proposed method, we measure the subjective and objective quality of the synthesized virtual view at different projection configurations and compare the results with the previous methods. The depth distribution of the scene affects the setting of some parameters. In our experiment, T1 is set to 25 for “Ballet” and 15 for “Breakdancers” and Middlebury data sets. m is set to 8 for “Ballet” and 5 for “Breakdancers” and Middlebury data sets.

7 previous methods are chosen for subjective and objective quality comparisons, including Criminisi’s classic inpainting method [5], Daribo’s depth-aided inpainting method [6], Ahn’s inpainting-based method [1], Lei’s divide-and-conquer hole filling method [15], Kao’s depth-based inpainting method [13], our previous work in [3] and Oliveira’s filling method [7]. It should be noted that Daribo’s method is under the assumption that the depth image of virtual view is known, while other methods and the proposed method need only a single reference view and its depth image for view synthesis.

The visual quality comparison results of disocclusions for small baseline are shown in Fig. 8, including BA54, BA56, BR43 and BR45. Influenced by the baseline distance and depth discontinuity between the FG and adjacent BG objects, all disocclusions belong to the FG-BG type. It can be seen that the proposed method outperforms others in terms of the synthesized appearance and looks more plausible, while the filling results of other methods contain some artifacts or FG textures. In Criminisi’s method [5], although the classic inpainting method uses patches to propagate textures and prevent the blurring effects, it does not take into account the fact that disocclusions usually belong to the BG region. Simultaneous inpainting from the FG and BG boundaries causes the FG texture to be incorrectly introduced into the synthesized image, as shown in Fig. 8(c). In Daribo’s method [6], depth variance is introduced into the computation of priority and patch distance, but the presence of ghosts makes some artifacts appear in the disocclusion region, as shown in Fig. 8(d). Moreover, this method requires the depth image of virtual view, which is difficult to implement in practical applications. Ahn’s method [1] solves this problem by generating the virtual depth image during 3D warping, so that the disocclusion is filled simultaneously in the virtual image and its depth image. The introduction of depth information reduces the priority of FG pixels, but when the boundary of FG objects in color image are mismatched with that of the depth image, some unexpected defects can be generated in the filling results, as shown in Fig. 8(e). the results of Lei’s method [15] are shown in Fig. 8(f), which can produce satisfactory results for regions with simple background textures. But for background regions with complex textures, modified depth values cause the structural distortion of BG objects. Subsequent inpainting increases the error, resulting in a decline in the visual quality of the virtual view. Moreover, some artifacts occur along the boundary of FG object. In Kao’s method [13], depth image preprocessing is applied before 3D warping, and the width of FG object is extended in the depth image. However, for FG-BG disocclusion, ghosts only occur on the BG side, so the extension of the other side causes some BG textures to be incorrectly projected into the virtual image. The priority computation based on the depth variance is not sensitive to the recognition of FG and BG, especially for scenes with multiple depth layers. This causes some defects and erroneous textures to be produced into the synthesized results, as shown in Fig. 8(g). Figure 8(h) shows the results of our previous work [3]. Local background term and depth term are introduced in the priority computation. This method selects the appropriate BG texture to fill the disocclusion in most cases, but in regions with low confidence, some wrong textures may be sampled. In Oliveira’s method [7], Confidence term and data term are replaced by depth term and background term in priority calculation. This makes the disocclusion filling start from the BG side. However, the texture information is ignored, resulting in that the linear structure cannot be preferentially extended, as shown in Fig. 8(i). The proposed method decomposes the disocclusions based on the texture complexity, which reduces the number of low-confidence fillings, and shows reasonable textures in Fig. 8(j).

Fig. 8
figure 8

Visual quality comparison of disocclusions for small baseline. a Warped virtual image, b magnified parts of (a), c Criminisi’s method, d Daribo’s method, e Ahn’s method, f Lei’s method, g Kao’s method, h Chen’s method, i Oliveira’s method, j proposed method, and k ground truth

In the case of large baseline, the area of the disocclusion increases, and the type to which it belongs is related to the depth distribution of the scene. For “Ballet” sequence, disocclusions generated in the virtual image are converted to the BG-BG type, while most of the disocclusions in the virtual image of “Breakdancers” are still maintain the FG-BG type, as shown in Fig. 9(a). Increased disocclusion area makes the filling process more challenging. The comparison results for disocclusion handling are shown in Fig. 9(c)-(j). The synthesized textures in the proposed method look more plausible than others. Especially for BG-BG disocclusion, as the associated FG objects are completely removed, the hole can be filled with appropriate background textures around it, so the result looks the more likely the ground truth, while other methods contain some artifacts and unrealistic textures.

Fig. 9
figure 9

Visual quality comparison of disocclusions for large baseline. a warped virtual image, b magnified parts of (a), c Criminisi’s method, d Daribo’s method, e Ahn’s method, f Lei’s method, g Kao’s method, h Chen’s method, i Oliveira’s method, j proposed method, and k ground truth

To objectively evaluate the performance of the proposed method, in our experiment, peak signal to noise ratio (PSNR), structural similarity (SSIM) [40], feature similarity index (color) (FSIMc) [45], and visual saliency-induced index (VSI) [46] are used to measure the similarity between the synthesized image and ground truth. PSNR measures the squared intensity difference and SSIM measures the perceptual visual quality, while FSIMc and VSI focus on the gradient feature and color feature. The higher PSNR value means the higher similarity between the two images. The values of SSIM, FSIMc and VSI are normalized to 0–1, where 1 donates the highest similarity, i.e., the two images are identical. The objective comparison results for MVD sequences are given in Tables 1, 2,3 and 4. The proposed method achieves the best overall results. The frame-wise objective results for BA54 are shown in Fig. 10. In most cases, the proposed method outperforms the other methods. Objective comparison results indicate that the proposed algorithm would effectively handle the disocclusions and is robust to scene and baseline distance changes.

Table 1 PSNR Comparison for MVD sequences
Table 2 SSIM Comparison for MVD sequences
Table 3 FSIMc Comparison for MVD sequences
Table 4 VSI Comparison for MVD sequences
Fig. 10
figure 10

Objective results for BA54. a PSNR results, b SSIM results, c FSIMc results, and d VSI results

For image sequence rendering, the subjective comparison results are shown in Fig. 11. Since the reference depth image is obtained by structured light, it has high accuracy and is conducive to the view synthesis. As the depth image of the virtual view in the Middlebury data sets is not provided, the performance of algorithm [6] is not evaluated. Subjective comparison results show that the proposed method has better results in most cases. Our method implements effective measures for different types of disocclusion. During the filling process, the edge of the FG object is maintained and errors caused by artifacts are effectively prevented. For BG-BG disocclusions, as the corresponding FG is removed, the FG texture would not be sampled into the disocclusion region. The objective comparison results for Middlebury data sets are shown in Tables 5, 6, 7 and 8. It can be seen that the proposed method achieves the best overall results.

Fig. 11
figure 11

Visual quality comparison of disocclusions for image sequence rendering. a warped virtual image, b magnified parts of (a), c Criminisi’s method, d Ahn’s method, e Lei’s method, f Kao’s method, g Chen’s method, h Oliveira’s method, i proposed method, and j ground truth

Table 5 PSNR Comparison for still image data sets
Table 6 SSIM Comparison for still image data sets
Table 7 FSIMc Comparison for still image data sets
Table 8 VSI Comparison for still image data sets

In addition to the visual quality results of the proposed method, the performance in terms of computational complexity is analyzed. Compared with the original DIBR, the computation cost of the proposed method is higher. In MATLAB environment, for the synthesis of first frame of BA54, the original DIBR algorithm takes 1542 ms, and the proposed method takes 10,314 ms. The ratio of elapsed time is 6.69. As the proposed method preforms the 3D warping process multiple times and introduces a patch-based inpainting method, the improvement of the visual quality leads to an increase in computational complexity. In addition, the proposed method is implemented in MATLAB with an unoptimized code. The running time cost can be greatly reduced by optimizing the proposed method in a more efficient language such as C++. GPU-based parallel computing or related hardware such as FPGA is a viable measure to improve computational efficiency. In some steps of the proposed method, the computation for each pixel to be processed is independent, such as 3D warping, detection and marking of FG or BG pixels, and the computation of priority and similarity in the inpainting process. Therefore, parallel computing in software or hardware can be introduced into these processes to reduce the computation cost.

At present, deep learning technology has been widely used in the field of image processing, such as deep visual attention prediction [39], fast tracking [17], and image segmentation, etc. Based on a large amount of training data, the established deep learning framework can quickly output accurate results and is very robust for known scenes. In terms of virtual view synthesis, the scenes used are common and available, and deep networks can be used for moving foreground object tracking, foreground segmentation, and image inpainting. Therefore, in future work, we will try to combine our method with the deep network. Migrate the existing deep network to the scene of view synthesis, and introduce depth information to train and improve the network framework, so that the accuracy and efficiency of the view synthesis method can be further improved.

5 Conclusion

In this paper, the underlying mechanisms of disocclusion generation and the factors affecting the distribution of pixels around it are investigated. Based on this, a disocclusion-type aware hole filling method is proposed. Disocclusions in the virtual image are identified and divided into two types. FG-BG disocclusions are divided into several small holes through the texture-based depth image preprocessing. BG-BG disocclusions are filled by extracting the corresponding FG objects in the reference image and replacing them with the surrounding BG textures. Further, additional post-processing is introduced to handle the holes that still exist after the merging. Subjective and objective comparison results demonstrate that the proposed method produces trusted contents under different scenes and baseline distances. The proposed method still has some limitation in the classification of disocclusions. Due to the complexity of the scene, there would be a situation where two types of disocclusions are combined. The current classification based on pixel distribution may not achieve optimal results. Our future work will consider a more intelligent and robust classification method to achieve disocclusion segmentation and sequential processing. Reducing the computation cost is also an important concern. This paper does not explore the temporal correlation between the adjacent frames. In this case, there may be a common region in the disocclusions exposed in the adjacent two frames. Since our method is to process each frame separately, it may produce different filling results and cause flicker between frames. For the rendering of single-frame images, such as stereoscopic image pairs, this phenomenon can be ignored. But for the rendering of virtual view sequences, it is necessary to maintain temporal consistency. Compared to the disocclusion filling in each frame, considering the movement of FG object in the time domain and reusing the inpainted BG information can reduce computational complexity and maintain the temporal consistency [24]. In future work, we will consider using global optimization to recover the texture of disocclusions, so that the flicker between frames can be improved. Further, deep learning techniques have proved successful in target detection and image inpainting, which can provide some inspiration for FG object detection and real-time disocclusion filling [9, 27, 30].