Stereo visualization is an actively developing area in three-dimensional (3D) computer visualization, where the main problem is to estimate the positions of objects in three-dimensional space on the basis of constructed depth maps (DM).

Computer visualization problems include segmentation and detection of objects in 3D by extracting information from multidimensional data in actual systems and transforming these data on the basis of efficient descriptors, which substantially reduce the dimension of the problem. Image contours and relations between them are the basic elements in computer visualization and robotics and represent generalized information on analyzed scenes in the form of a series of features important for recognition. Among descriptors of this type, Daisy and SID are the ones most efficient and widely used in practice.

Based on the ideas described in [112], an original method for reconstructing depth maps from stereo color pairs in 3D visualization is proposed and justified. The new image descriptor (VPR) relies on visual primitives and relations between them and characterizes distinctions in color, plane positions, distances, and angles between the primitives. A fundamental difference of VPR from other well-known descriptors is that it uses structural and semantic information. This improves its robust properties, specifically, in the case of nonideal recoding, illumination, and reflection in stereo color pairs. Numerical experiments and an analysis and physical interpretation of their results under actual radiometric differences in the exposition and illumination of stereo image pairs have shown that the new approach is superior to other available descriptors.

FORMULATION OF THE PROBLEM AND SOLUTION METHOD

Our theoretical description of the proposed image descriptor is based on the Riesz transform [9]. For a given two-dimensional signal \(f(x,y)\), two odd signal components are formed:

$${{F}_{{01,02}}}(w) = \left\{ \begin{gathered} i\frac{{{{w}_{{x,y}}}}}{{\left\| w \right\|}}F(w),\quad w \ne 0, \hfill \\ 0,\quad w = 0. \hfill \\ \end{gathered} \right.$$

The Log-Gabor filter \({{G}_{e}}(w) = \exp \left. {\left( { - \frac{{\log \left( {\frac{{\left\| w \right\|}}{{{{w}_{0}}}}} \right)}}{{2{{{(\log {{\sigma }_{0}})}}^{2}}}}} \right.} \right)\) is used, for which two odd components G01(w) and G02(w) are found, and the isotropic filter G(w) = \(\sqrt {G_{{01}}^{2}(w) + G_{{02}}^{2}(w)} \) is applied.

After filtering, we form a vector of three components (two odd and one even) (so-called monogenic signal), namely, \({{f}_{m}}(x)\, = \,[{{f}_{e}}(x),{{f}_{{01}}}(x),{{f}_{{02}}}(x)]\). In spherical coordinates, this vector is characterized by the length A(x) and two angular coordinates:

$$\begin{gathered} \phi (x) = \arctan \left. {\left( {\frac{{{{f}_{0}}(x)}}{{{{f}_{e}}(x)}}} \right.} \right)\quad {\text{and}} \\ \theta (x) = \arctan \left. {\left( {\frac{{{{f}_{{02}}}(x)}}{{{{f}_{{01}}}(x)}}} \right.} \right). \\ \end{gathered} $$

The 2D visual primitive \({{\Pi }_{{2{\text{D}}}}}(x)\) is determined by these three parameters A(x), \(\phi (x)\), and \(\theta (x)\), while the 3D visual primitive is characterized by the vector Π3D(x) = \((A(x),\theta (x),\phi (x),d(x))\), where d determines the position of the primitive in three-dimensional space.

The dependences between two primitives are determined by a number of features. Specifically, cocolority is computed in the color space CIELab as follows:

$${{R}^{{{{C}_{{}}}}}}({{\Pi }_{i}},{{\Pi }_{j}}) = \sqrt {{{{({{L}_{i}} - {{L}_{j}})}}^{2}} + {{{({{a}_{i}} - {{a}_{j}})}}^{2}} + {{{({{b}_{i}} - {{b}_{j}})}}^{2}}} .$$

Other features are characterized by color, position, orientation, and relations between them, namely, by the angle between the primitives,

$${{R}^{A}}({{\Pi }_{i}},{{\Pi }_{j}}) = \arccos \left. {\left( {\frac{{{{\Pi }_{i}}{{\Pi }_{j}}}}{{\left| {{{\Pi }_{i}}} \right|\left| {{{\Pi }_{j}}} \right|}}} \right.} \right);$$

by the normalized distance between them,

$${{R}^{{ND}}}({{\Pi }_{i}},{{\Pi }_{j}}) = \frac{1}{2}\left( {\frac{{\left| {{{W}_{i}} \times {{\Pi }_{j}}} \right|}}{{\left| {{{\Pi }_{j}}} \right|}}} \right. + \left. {\frac{{\left| {{{W}_{j}} \times {{\Pi }_{i}}} \right|}}{{\left| {{{\Pi }_{i}}} \right|}}} \right);$$

and by the coplanarity

$$\begin{gathered} {{R}^{P}}({{\Pi }_{i}},{{\Pi }_{j}}) \\ = \frac{1}{2}\left[ {\pi - \arccos \left( {\frac{{n{{A}^{O}}({{\Pi }_{i}})}}{{\left| n \right|{\text{|}}{{A}^{O}}({{\Pi }_{i}}){\text{|}}}}} \right) - \arccos \left( {\frac{{n{{A}^{O}}({{\Pi }_{j}})}}{{\left| n \right|{\text{|}}{{A}^{O}}({{\Pi }_{j}}){\text{|}}}}} \right)} \right] \\ \end{gathered} $$

(see [13]). The novel descriptor VPR is based on relations between the primitives and is implemented as described in the block diagram of the method presented in Fig. 1, i.e., for a given color image IRGB, its colors are transformed from RGB to the CIELab space. Cocolority is determined by the monogenic signal formed for S scales and \(\Sigma \) kernels (i = 1, 2, ..., S; \(j = 1,2,...,\Sigma \)) with the use of \(S \times \Sigma \) Gaussian filters. Each filter performs the convolution of the image IL of the channel L in the CIELab space and forms \(3 \times S \times \Sigma \) components \({{f}_{m}}(x) = \left[ {{{f}_{e}}(x),{{f}_{{01}}}(x),{{f}_{{02}}}(x)} \right]\) of the monogenic signal for different \(S \times \Sigma \) filters with parameters si and \({{\sigma }_{j}}\) in the Gaussian filters.

Fig. 1.
figure 1

Block diagram of the proposed method.

For each pixel, the new descriptor consists of a vector determined by the relations between the primitives \((x,y)\) and \((u,{v})\) in a square window W centered at \((x,y)\): \(H{}_{A}(x) = [{{R}_{A}}\{ {{s}_{i}},{{\sigma }_{j}}\} (1,1),.......,{{R}_{A}}\{ {{s}_{i}},{{\sigma }_{j}}\} (u,{v})]\), where RA is the angle determined by the relations between \((x,y)\) and \((u,v)\). The feature vector for angles between the primitives for \(S \times \Sigma \) Gaussian filters is given by the matrix expression

$$D_{A}^{{}}(x,y) = \left[ {\begin{array}{*{20}{c}} \begin{gathered} {{H}_{A}}(x,y)_{{{{s}_{1}}}}^{{{{\sigma }_{1}}}} \hfill \\ .................... \hfill \\ \end{gathered} &\begin{gathered} {{H}_{A}}(x,y)_{{{{s}_{S}}}}^{{{{\sigma }_{1}}}} \hfill \\ ................ \hfill \\ \end{gathered} \\ {{{H}_{A}}(x,y)_{{{{s}_{1}}}}^{{{{\sigma }_{\Sigma }}}}......}&{{{H}_{A}}(x,y)_{{{{s}_{S}}}}^{{{{\sigma }_{\Sigma }}}}} \end{array}} \right].$$

Expressions for the features characterized by the normalized distances \(D_{{ND}}^{{}}(x,y)\) and the coplanarity parameters \(D_{{CP}}^{{}}(x,y)\) are found in a similar manner:

$$\begin{gathered} D_{{ND}}^{{}}(x,y) = \left[ {\begin{array}{*{20}{c}} \begin{gathered} {{H}_{{ND}}}(x,y)_{{{{s}_{1}}}}^{{{{\sigma }_{1}}}} \hfill \\ .................... \hfill \\ \end{gathered} &\begin{gathered} {{H}_{{ND}}}(x,y)_{{{{s}_{S}}}}^{{{{\sigma }_{1}}}} \hfill \\ ................ \hfill \\ \end{gathered} \\ {{{H}_{{ND}}}(x,y)_{{{{s}_{1}}}}^{{{{\sigma }_{\Sigma }}}}......}&{{{H}_{{ND}}}(x,y)_{{{{s}_{S}}}}^{{{{\sigma }_{\Sigma }}}}} \end{array}} \right], \\ D_{{CP}}^{{}}(x,y) = \left[ {\begin{array}{*{20}{c}} \begin{gathered} {{H}_{{CP}}}(x,y)_{{{{s}_{1}}}}^{{{{\sigma }_{1}}}} \hfill \\ .................... \hfill \\ \end{gathered} &\begin{gathered} {{H}_{{CP}}}(x,y)_{{{{s}_{S}}}}^{{{{\sigma }_{1}}}} \hfill \\ ................ \hfill \\ \end{gathered} \\ {{{H}_{{CP}}}(x,y)_{{{{s}_{1}}}}^{{{{\sigma }_{\Sigma }}}}......}&{{{H}_{{CP}}}(x,y)_{{{{s}_{S}}}}^{{{{\sigma }_{\Sigma }}}}} \end{array}} \right]. \\ \end{gathered} $$

The final descriptor in (x, y) is determined by the above-indicated features: DVPR(x, y) = [DA(x, y), DC(x, y), \(D_{{CP}}^{{}}(x,y),D_{{ND}}^{{}}(x,y)]\). Next, depth maps DM are reconstructed using a block matching procedure based on comparing the features of two stereo pairs.

NUMERICAL RESULTS

We studied synthetic benchmark images taken from the Middlebury Stereo Vision website [14, 15]. The 2005 dataset contains a series of different stereo pairs, together with ground-truth (GT) images (true depth maps) in full format (1390 × 1110 pixels), as well as in 1/2 and 1/3 formats. Additionally, the robustness of the new descriptor was examined using the 2014 dataset, which contains 33 stereo pairs divided into three groups (10 for learning, 10 for testing, and 13 additional pairs that are not illustrated by GT). Moreover, for each stereo pair, this dataset provides two images obtained in different illumination (L) or exposition (E) conditions.

All data in the experiments were processed together in order to confirm the efficiency and robustness of the VPR descriptor as applied to the reconstruction of depth maps for stereo pairs obtained in different conditions.

The quality of depth map reconstruction was analyzed using the QBP criterion (the number of bad matching pixels). For each of the studied images DM, the QBP value was computed using the formula

$$QBP = \frac{1}{N}\sum\limits_{(x,y)} {\left( {\left| {D{{M}_{I}}(x,y) - D{{M}_{{GT}}}(x,y)} \right| > {{\delta }_{d}}} \right)} ,$$
((3))

where N is the number of pixels in the image or frame; DMI and DMGT are the estimated and true (GT) maps, respectively; and \({{\delta }_{d}} = 1\).

In Fig. 2, the depth maps reconstructed by the VPR descriptor are visually compared with those produced by the DAISY and SID descriptors, which are best available in the literature; here, the traditional block matching technique is used to compute DM in all three methods.

Fig. 2.
figure 2

Results of depth map reconstruction: (a) true (GT) depth map and the maps reconstructed using the following descriptors: (b) Daisy, (c) SID, (d) VPR, (e) Daisy E (nonideal illumination), and (f) VPR E (nonideal illumination) in the cases of Adirondack, Piano, and Playtable images (from top to bottom).

Under different illumination (L) and exposition (E) conditions, the presented images show that the SID descriptor strongly smooths the objects and fails to estimate the depth maps correctly. The depth maps reconstructed by VPR and Daisy demonstrate a similar quality both in terms of a quantitative metric and in a subjective analysis. An analysis of the various images (Fig. 2, Table 1) suggests that the VPR descriptor has the best robustness in processing the inaccuracies of stereo image pairs.

Table 1. QBP values for VPR and other descriptors as applied to depth map reconstruction in the case of nonideal exposition (E) and illumination (L) of stereo pairs

Since the traditional block matching technique is used in the depth map reconstruction, the problem of occlusions cannot be resolved directly. The new VPR descriptor can be used in conjunction with other algorithms, such as semiglobal matching or graph cuts.

CONCLUSIONS

An analysis of the numerical results produced by the new method for depth map reconstruction suggests the following important conclusions:

(i) The proposed and substantiated method is based on visual primitives and relations between them, namely, cocolority, coplanarity, distance, and angle between the primitives. The method makes it possible to improve the quality of reconstructed depth maps.

(ii) The performance characteristics of the new VPR descriptor are superior to other descriptors, namely, DAISY and SID, which are widely used in the literature.

(iii) It has been confirmed experimentally that the VPR descriptor demonstrates the best robustness in depth map formation in the case of radiometric differences in the exposition and illumination of stereo image pairs.