1 Introduction

A depth map is defined as an image that contains information related to the distance of object surfaces from a viewpoint. Depth maps are widely used in many techniques including robotics, three-dimensional (3D) television, and interactive view interpolation. Passive stereo and active depth sensors are employed in many applications to facilitate the rapid acquisition of real-time depth maps of dynamic scenes [10, 26]. Thus, the development of an autonomous system capable of understanding the shape and location of a target object within a depth map is an active research area in the field of computer vision and image processing [10, 11, 19, 26]. Kinect, which was designed by Microsoft for computer gaming, is a popular alternative to expensive laser scanners in video surveillance, robotics, and forensics applications [27]. Kinect sensors provide depth and color images simultaneously at frame rates up to 30 fps. The integration of depth and color data yields a colored point and each frame may contain 300,000 points. The characteristics of the data captured by Kinect sensors have attracted the attention from other fields including mapping and 3D modeling. However, the high-level surface geometry must be inferred from noisy point-based data to generate 3D models for use in various applications, but connecting neighboring points obtains noisy and low-quality meshes, thereby leading to occlusions, shadowing and the generation of erroneous regions during depth estimation. Thus, to separate the layers of the acquired depth map in an effective manner, it is necessary to remove the noise and to sharpen the boundary [25].

The resolution of a depth map is lower than that of a color image because of noise degradation during the depth data acquisition process. Consequently, numerous approaches have been proposed for depth map enhancement to remove the noise and retain the layers of the given depth map. However, most of these approaches are affected by the same problems, which are caused when focusing on monoscopic color image enhancement, including spatial resolution enhancement, denoising, and sharpening. These approaches continue to produce problems when enhancing the quality of the depth map because they use the color and depth images jointly to improve the quality, or they require large numbers of training patches for learning-based depth map enhancement. Therefore, these methods are highly dependent on the quality of the color image, training patches, and applications. To overcome these drawbacks and to improve the performance of depth map enhancement without prior information about the given depth map, we propose a novel method called adaptive total variation minimization (ATVM), which facilitates both noise smoothing and boundary sharpening. The proposed method is in fact obtained by combining the moving least squares (MLS) and TV minimization methods. The MLS model provides very satisfactory results for image reconstruction but weak against outliers. In contrast, the process of minimizing TV eliminates outliers effectively since outliers make large variation values. Thus, by incorporating the TV regularizer into the MLS model the solution becomes to achieve a higher order approximation than that of the conventional TV and MLS methods. We filter the noise in the depth map using a refined total variation minimization (TV) minimization technique that uses edge-preserving and noise reduction smoothing filters [23].

2 Related work

Many previous studies have proposed conventional two-dimensional image enhancement approaches in the field of computer vision and pattern recognition [1, 6, 23]. In particular, BM3D is very popularly used to remove the noise from a given image, but it require the statistical variance of the noise to effectively remove the noise in prior. The conventional techniques used to enhance the contrast, sharpness, and color vividness in an image are applied directly during depth map enhancement, where local adjustments are made to increase the amount of high frequency components [21]. Previous approaches have achieved denoising and image sharpening by decreasing or increasing the high frequency components according to the local image characteristics [10, 21]. In particular, Subedar et al. [1] and Kim et al. [12] used high pass filters to enhance the depth-based sharpness, as well as depth estimation and contrast enhancement. In particular, Kinect-based depth map enhancement and denoising researches [9, 21] received numerous concerns as the preprocessing to analyze the 3D scene and human motion analysis. The KinectFusion [16] was also designed to enhance the quality of depth map using multiple depth map images. However, these previously proposed depth map enhancement algorithms used the normal light image, and the enhanced image were obtained by adding a depth-weighted high pass-filtered color image to the original image [21]. This approach has the limitation that it cannot remove the noise from an unknown depth map image without prior information. Eisenmann and Durand [7] proposed a cross-bilateral filter, where they modified the bilateral filter and computed the edge-preserving term as a function of the depth map image. However, their method preserved edges that did not actually appear in the noisy input depth map image. Eisenmann and Durand [7] replaced the intensity value of each pixel in an image with weighted average intensity values based on nearby pixels. PDE-based denoising methods based on a variational approach of energy functional minimization have also been used for image smoothing with edge preservation. A popular variational denoising method is the TV minimizing process of Rudin-Osher-Fatemi [20]. According to previous image restoration studies, TV regularization has the effect of preserving salient edges and removing noise. However, if the variation minimizing effect is too strong, the smooth regions become flat or constant, thereby yielding a restored image that looks unnatural. This is known as the staircase effect [18] and it is primarily attributed by the fact that the TV minimization method estimates the image using a piecewise constant approximation. Thus, several variants of the TV function have been proposed to avoid the staircase effect and to obtain a higher-order approximation of the reconstructed image [3, 5, 17]. The MLS [2, 14] or kernel regression [24] methods, where the optimal fitting is expressed as a linear combination of polynomials, have been proved to be quite useful in image interpolation as well as denoising and super-resolution [22]. However, MLS based algorithms are weak against noise, since, in general, least squares methods are weak against outliers. Also, when interpolating images across edges, some artifacts (blurring or ringing) are produced into the result images.

In this study, we employ an ATVM technique that has high accuracy to preserve the details of the observed depth map. To preserve strong edges while smoothing noise, we add a TV regularization term to the moving least squares method, and use weight functions that consider the similarity of the local areas in the evaluation and the reference positions.

3 ATVM-based depth map enhancement

Let I := {I(i, j) : i = 1, …, n 1, j = 1, …, n 2} with positive integers n 1 and n 2. Put [1, …, n 1] × [1, …, n 2] = [Χ 1, Χ 2, Χ 3,.., Χ N ]. Then the observed depth map image I can be treated as a discrete sampling of a function at a point set {Χ 1,.., Χ N } in a domain Ω ⊂  2, where N is the size of the image. If the given image is contaminated by noise during the image acquisition process, we may write I as I(Χ l ) = f(Χ l ) + ε l , l = 1, …, N, where f(Χ l ) is the value of an underlying function f and ε l indicates the additive noise at the location Χ l . The denoising method used to construct a denoised image from a depth map image is introduced below.

3.1 Total variation minimization method

For a given noisy image I, the TV minimization technique [4, 15] generates a denoised image Î by solving the following minimization problem

$$ \widehat{I}=\underset{u}{ \arg \min }{\left\Vert u-I\right\Vert}_2^2+\mu {\left\Vert u\right\Vert}_{TV} $$
(1)

where \( {\left\Vert u\right\Vert}_{TV}={\displaystyle \underset{\varOmega }{\int}\left|\nabla u\right|dX} \)with the gradient operator ∇. The second term in Eq. 1, ‖ ⋅ ‖ TV is called the total variation norm, and the solution of the minimization problem has the property of preserving sharp edges in images while removing noise. This is a desirable property for images because the visual quality of an image depends greatly on the preservation of edges. However, this TV scheme processes the observed image to obtain a piecewise constant image, which exhibits many false jump discontinuities and is visually unpleasant. This is mainly attributable to the fact that the TV minimization variation method approximates an image with a first-order accuracy.

3.2 Adaptive moving least squares method with a total variation minimizing regularization term

In this section, we suggest an improved TV minimization approach, which is formulated specifically for depth map image denoising.

We employ the adapted least squares technique with the total variation regularization term in [13]. Let I be a given reference image defined on a domain Ω and let X o be an evaluation point in Ω. We obtain a solution Î(X o) as a denoised image by constructing local polynomial of degree m in 2, \( p(X):={p}_{X^o}(X) \) and evaluating p at X o, i.e., Î(X o) := p(X o). The polynomial can be written as \( p(X):={\displaystyle \sum_{{\left|\alpha \right|}_1\le m}{c}_{\alpha }{X}^{\alpha }} \). For example, if m = 2, α ∈ {(0, 0), (1, 0), (0, 1), (2, 0), (1, 1), (0, 2)}. Specifically, the coefficients c α are obtained by minimizing the following energy functional:

$$ \underset{p\in {\varPi}_m}{ \arg \min}\left\{{\displaystyle \sum_{l=1}^N\Big({\left|p\left({X}_l\right)-I\left({X}_l\right)\right|}^2w\left({X}^o,{X}_l,I\right)+\lambda {\left\Vert \nabla p\left({X}_l\right)\right\Vert}_{TV}}\right\} $$
(2)

where Π m is the space of bivariate polynomials of degree ≤ m and w is a specialized weighting function for the denoising solution, which ensures that it obtains a result that preserves textures or repeated local features. Specifically, we use the weighting function defined as

$$ w\left({X}^o,X,I\right)= exp\left\{-\frac{{\displaystyle \sum_{l\in S}{G}_a(l){\left|I\left({X}^o+l\right)-I\left(X+l\right)\right|}^2}}{h_0^2}\right\} $$
(3)

where h 20 is a small positive value and G α is a Gaussian function with standard deviation a and where S is a suitable (small) stencil for patch comparison around X o and X. The weighting function is data adaptive and it considers the similarity of the local areas in two positions X o and X. In our proposed method, we construct p locally in the image by solving the minimization problem in Eq. 2 for each evaluation point in Ω. Thus, the overall approximation function Î becomes Î(X) := p(X) := p X (X) for all X ∈ Ω.

The minimization model (Eq. 2) with the L 1 term can be solved using the split Bregman iteration algorithm [8]. In our method, we obtain the solution based on the following iterated steps for each X:

$$ \begin{array}{l} step\kern0.5em 1:{p}^{k+1}(X)=\underset{p}{argmin}\Big\{{\displaystyle \sum_{l=1}^N\frac{\lambda }{2}{\left|p\left({X}_l\right)-I\left({X}_l\right)\right|}^2w\left(X,{X}_l\right)}\hfill \\ {}\kern6em +\frac{\mu }{2}{\left|{d}^k\left({X}_l\right)-\nabla p\left({X}_l\right)-{b}^k\left({X}_l\right)\right|}^2w\left(X,{X}_l\right)\Big\}\hfill \\ {} step\kern0.5em 2:{d}^{k+1}(X)= shrink\left(\nabla {p}^{k+1}+{b}^k\left({X}_l\right),1/\mu \right)\hfill \\ {} step\kern0.5em 3:{b}^{k+1}(X)={b}^k(X)+\nabla {p}^{k+1}(X)-{d}^{k+1}(X)\hfill \end{array} $$

where shrink(x, γ) = max(|x| − γ, 0) ⋅ sign(x).

Without the second term in Eq. 2, the energy functional in Eq. 2 simply becomes the conventional least squares approximation, which fits data by local polynomial approximation [14]. In the previous study for Eq. 2, the TV regularization term was proved to have a better denoising property. In general, a least squares method is weak against outliers; therefore, it is not usually the best tool for denoising. However, in our method, the TV regularization term eliminates the noise very quickly, which helps the regularized method produce a better approximation of the original noise-free image.

4 Experiments

We conducted numerical experiments by using synthetic and real depth map data to evaluate the performance of the ATVM-based depth map enhancement method. In order to assess the improvement in the depth accuracy obtained with the proposed method, we tested the method using known ground truth (synthetic) data from the Middlebury stereo data set, as shown in Table 1. To generate a noisy depth map from the data sets, we added Gaussian noise with a standard deviation of 20 to the ground truth image. We used the peak signal-to-noise ratio (PSNR) which is popularly used as the qualitative measure of the engineering terms for the ratio between the maximum possible power of a signal and the power of the noise. PSNR based on the established ground truth data to quantitatively evaluate the depth map enhancement. Table 1 compares the depth map enhancement results obtained by using our approach and previous approaches, i.e., a TV-based approach and a bilateral approach. As shown in Table 1, the quantitative comparison of the depth map enhancement and denoising using bilateral denoising [7], generous TVM [20], and our approach is represented. The PSNR represents that our proposed approach is very effective to remove the noise from given data. In particular, our approach provides better noise reduction and sharpening from given noisy depth map including multiple layers.

Table 1 Quantitative comparison of depth map enhancement using our proposed approach and previous approaches based on the PSNR

Figure 1 shows the given depth map image with noise and the final image after removing the noise by our proposed approach. To effectively visualize how much our proposed approach is better than given noisy depth map image, we represent the depth map with normal vector. As shown in Fig. 1, our proposed approach is superior in the complex areas that are mixed with different objects because ATVM-based denoising and enhancement approach is very efficient at retaining the edges while removing the noise around the object.

Fig. 1
figure 1

Depth map enhancement using our approach using normal image. a Input noisy depth and its normal image. b Enhanced depth and its normal image

In the next experiment, we tested the performance of the ATVM-based depth map enhancement method using a real depth map obtained by Kinect sensors, which has the resolution of 640 × 480 pixels. Figure 2 represents the depth map obtained after applying our proposed ATVM-based approach. The middle column of Fig. 2 is the original depth map image from Kinect, but it is not easy to understand the shape and depth of the target object. The right column of Fig. 2 is the refined depth map using our approach. The enhanced depth map obtained using the ATVM-based approach displays the details of the scene better, compared with the input Kinect depth map image. By removing the noise and enhancing the layers of the depth map, it provides effectiveness to analyze the shape of the target objects and 3D scene. Thus, by applying ATVM-based depth map enhancement, we can separate the layers of the given image and analyzed the scene. In particular, compared to remarkable previous approach like KinectFusion [16] which also refines the depth map from Kinect using multiple depth map, the advantage of our approach is in that we use single depth map by retaining the edges and removing the noise from input depth map.

Fig. 2
figure 2

Real depth map enhancement using our proposed approach based on images captured by Kinect sensors. a Input RGB image. b Given noisy depth from Kinect. c Our approach

To effectively visualize the differentiation between our approach and previous approaches like bilateral and TV methods, Fig. 3 shows the noise removed depth maps which are captured from Kinect. Especially, depth map enhancement and denoising using our approach keeps the separation of the layers and remove the noise in a flat layer. It can be used for layer separation by removing the noise from Kinect.

Fig. 3
figure 3

Quantitative comparison of depth enhancement and denoising using our approach and previous approaches. a Bilateral approach [7]. b Generous TVM approach [20]. c Our approach

5 Conclusion

In this study, we proposed a novel depth map enhancement approach based on ATVM. Our method employs a moving least squares method combined with TV minimization, to retain the edges and to remove the noise from input depth images. The moving least squares method facilitates rapid denoising, which allows us to obtain a sufficiently smooth approximation. The TV-based depth map denoising and deblurring approach exhibits robust performance in reducing the noise while retaining the edges in the depth map. Experiments using real/synthetic images demonstrated that our ATVM-based depth map enhancement method satisfied our objectives. By enhancing the resolution of the depth map, the proposed scheme retained the benefits of the TV minimization method and preserved geometric information. In particular, the proposed ATVM performed well in maintaining the details of the target object while reducing the noise, but without requiring prior information.