Keywords

1 Introduction

Real-time 3D intra-operative tissue surface shape recovery from stereo images is important in Computer Assisted Surgery (CAS). The reconstructed depth is a crucial for dense Simultaneous Localization and Mapping (SLAM) [23, 24], AR system [11, 28] and diseases diagnosis [13, 18]. All stereo matching procedures follow the pinhole camera model [2] and conduct image rectification, undistortion, and disparity estimation. The stereo matching techniques are normally classified into two categories regarding disparity estimation: prior-free and learning-based. Conventional prior-free methods estimate the pixel-wise disparity using the image alignment techniques [8, 12, 14, 20, 25]. Based on the left-right image consistency assumption (photo-metric or feature-metric), they either use corner feature registration, dense direct pixel searching, or a combination. Differently, Deep Neural Network (DNN) based techniques directly learn the disparity from the training image pairs [3, 16, 26, 29, 30]. Although DNN methods are reported to be efficient, the results may be invalidated with changing parameters such as focal length and baseline or a large texture difference between the training and testing data [1, 19]. Moreover, the DNN-based methods heavily depend on the size and quality of the annotated training data, which are not accessible in many CAS scenarios.

In the category of prior-free methods, ELAS [8] is still one of the most widely used stereo matching algorithms due to its robustness and accuracy [23, 24, 32, 33]. It is also the most popular method in the industry [4, 31]. ELAS uses Sobel descriptors to match sparse corners as the supporting points and triangulate the pixel-wise disparity prior. Then, the optimal dense disparity is retrieved with its proposed maximum a-posteriori algorithm. Its two-step process requires around 0.25–1 s on a single modern CPU core. This paper aims for a faster CPU-based stereo matching method.

The Dense Inverse Searching (DIS) [14] shows the potential of dense direct matching without the time-consuming sparse supporting points alignment. By resizing the left and right images to several coarse scales, it adopts and modifies the Lucas-Kanade (LK) optical flow algorithm [17] for fast estimating the pixel-wise optimal disparity. [14] demonstrates that real-time computation is possible with its patch-based coarse-to-fine dense matching, where patch refers to an arbitrary squared image segment. However, DIS is strictly built based on the photometric consistency and surface texture abundance assumptions, which cannot always be satisfied in CAS. The two main challenges are the textureless/dark surfaces and the serious non-Lambertian reflectance. The weak/dark texture, which widely exists in CAS, leads to ambiguous photometric consistency. Meanwhile, non-Lambertian reflectance brings uneven disturbance on the surfaces, and it cannot be eliminated by just enforcing the patch normalization [21].

In this paper, to deal with photometric inconsistency and non-Lambertian reflectance in stereo matching, we propose a Bayesian Dense Inverse Searching (BDIS) to quantify the posterior probability of each optimized patch. A spatial Gaussian Mixture Model (GMM) is further adapted to quantify pixel-wise confidence within the patch. The final pixel-wise disparity is the fusion of multiple local overlapping patches, reducing the impact of those patches suffering from the textureless/dark surfaces or the non-Lambertian reflectance. In extreme cases, it is beneficial to give up the disparity estimation of some patches identified as dubious. In particular, this work has the following contributions:

  • A Bayesian approach is developed to quantify the posterior probability of the patch.

  • A spatial GMM is introduced to quantify the pixels’ confidence within the patch.

  • To our knowledge, BDIS is the first single core CPU based stereo matching approach that achieves similar performance to the near real-time method ELAS.

2 Methodology

2.1 Multiscale DIS

Figure 1 shows the DIS (based on fast LK) algorithm for stereo matching proposed by [14]. It is a modified version of the LK algorithm. We use the fast DIS as our base framework. Note that the variational refinement module in [14] is abandoned because it has a small (less than \(0.5\%\)) contribution in promoting the accuracy. The modified fast LK based DIS is achieved by minimizing the following objective function:

$$\begin{aligned} \varDelta \textbf{u}={\text {argmin}}_{\varDelta \textbf{u}^{\prime }} \sum _{x}\left[ I_{r}\left( \textbf{x}+\textbf{u}\right) -I_{l}(\textbf{x}+\varDelta \textbf{u}^{\prime })\right] ^{2}, \end{aligned}$$
(1)

where \(\textbf{x}\) is the processed location, \(\textbf{u}\) is the estimated disparity in the loop, \(I_l\) and \(I_r\) are the left image patch and right image, and \(\varDelta \textbf{u}\) is the optimal update of \(\textbf{u}\) at one loop. Different from authentic LK, \(\varDelta \textbf{u}^{\prime }\) is moved from the right image to the left image patch. The improvement avoids the expensive re-evaluation of the Hessian on the right image. (1) is traversed on all patches at different scales. The disparity at the fine-scale level is initialized at the optimized coarse scale. The optimal disparity at the location \(\textbf{x}\) is the weighted fusion with all covering patches using inverse residual:

$$\begin{aligned} \hat{\textbf{u}}_\textbf{x} = \sum _{k \in \varOmega } \frac{1/ {\text {max}}(\Vert I_l(\textbf{x}+\textbf{u}^{(k)})-I_r(\textbf{x})\Vert ^2,1)}{\sum _{k \in \varOmega } 1/ {\text {max}}( \Vert I_l(\textbf{x}+\textbf{u}^{(k)})-I_r(\textbf{x})\Vert ^2,1)} \textbf{u}^{(k)}, \end{aligned}$$
(2)

where \(\varOmega \) is the set of patches covering the position \(\textbf{x}\), \(\textbf{u}^{(k)}\) is the estimated disparity of the patch k and \({\text {max}}(\cdot ,\cdot )\) selects the maximum value. The pixel-wise disparity \(\hat{\textbf{u}}_\textbf{x}\) is the weighted average of the estimated disparities from all patches, wherein the weight is the inverse residual of brightness.

Fig. 1.
figure 1

The framework of the DIS algorithm [14]. It uses 3 scale levels as an example.

2.2 The Bayesian Patch-Wise Posterior Probability

The residual-based weighted average fusion (2) suffers from the ambiguities brought by the textureless/dark surface and non-Lambertian reflectance. The textureless/dark surface leads to ambiguous local minima of the cost function penalizing photometric inconsistency (1) and misleads the algorithm to be over-confident on the estimation. Furthermore, the photometric consistency presumption is seriously violated on the surface affected heavily by the non-Lambertian reflectance. The affine lighting changes formulation in previous large-scale SLAM studies [7] cannot fully tackle the complex and severe non-Lambertian reflectance in CAS. In both situations, the weights retrieved from the photometric residuals (2) are misleading. To overcome the difficulty in defining the confidence of the estimated disparity, we propose a Bayesian model to correctly estimate the confidence in the presence of textureless surface and non-Lambertian reflectance. Since the uncertainty distribution of both the left and right scenes is unclear, it is difficult to conduct the direct inference of the posterior probability in terms of disparity. Thus, we implicitly infer the probability with Bayesian modeling using Conditional Random Fields (CRF) [27]. The posterior probability of the patch-wise disparity \(\textbf{u}^{(k)}\) is

$$\begin{aligned} \begin{aligned} p(\textbf{u}^{(k)}|I_l,I_r)\propto \frac{p(I_r|I_l,\textbf{u}^{(k)})}{p(I_r,I_l,\textbf{u}^{(k)})}\propto \frac{p(I_r|I_l,\textbf{u}^{(k)})}{\varSigma _{\textbf{u}_i^{(k)} \in \mathcal {P}} p(I_r|I_l,\textbf{u}_i^{(k)})}\propto \frac{p(I_r|I_l,\textbf{u}^{(k)})}{\varSigma _{\textbf{u}_i^{(k)} \in \mathcal {P}'} p(I_r|I_l,\textbf{u}_i^{(k)})}\textrm{r}, \end{aligned} \end{aligned}$$
(3)

where \(\mathcal {P}\) is the domain of all possible choice of \(\textbf{u}_i^{(k)}\). To reduce computational load, \(\textrm{r}\) is applied as the constant compensation ratio for all patches within the window. \(\mathcal {P}\) is reduced to a small window \(\mathcal {P}'\) assuming the rest candidates are numerically trivial.

Equation (3) indicates that the posterior probability of the disparity can be obtained by traversing the probability on all possible \(\textbf{u}_i^{(k)}\). And the possible choice of disparity is equal to window size \(\textrm{s}\). Even though the posterior probability suffers from the textureless surface and non-Lambertian reflectance, the illumination consistency probability is proportional to the residuals because the set of neighboring disparities is within one patch, and the impact of the issues is consistent. Thus, we model the illumination consistency probability \(p(I_r|I_l,\textbf{u}_i^{(k)},\textbf{x})\) based on the Boltzmann distribution [15] as

$$\begin{aligned} p(I_r|I_l,\textbf{u}_i^{(k)}) = \exp \left( -\frac{\Vert I_l(\textbf{u}_i^{(k)})-I_r(\textbf{u}_i^{(k)})\Vert ^2_\textrm{F}}{2\sigma _r^2\textrm{s}^2}\right) , \end{aligned}$$
(4)

where \(\Vert \cdot \Vert _\textrm{F}\) is the Frobenius norm and \(\sigma _r\) is the hyperparameter to describe the variance of the brightness. The relative posterior probability can be obtained with (3) and (4). Generally, the absolute exponential parameter of the Boltzmann distribution denotes the entropy of the state. In our case, such entropy is defined as (4). Image with abundant texture has more entropy loss. Hence, the entropy item is highly related to the photometric inconsistency loss.

Figure 2 shows the relationship between the illumination consistency probability density function and the texture. The response is stronger on the textured surface. The residuals are always small in the textureless surface, no matter how the left and right images are aligned. (2) cannot correctly measure the weights while (4) describes the relative probability of the estimation. Moreover, it tests the local convergence to filter the Saddle point solutions.

Fig. 2.
figure 2

The probability density function of the textureless region.

2.3 The Prior Spatial Gaussian Probability

In addition to the patch-wise posterior probability of the disparity in the last section, a spatial GMM is adopted to estimate pixel-wise probability within the patch. Considering that medical images are natural images, a multivariate Gaussian distribution is adopted to measure the confidence of the pixel-wise probability using a Gaussian mask. In accordance with the multivariate Gaussian distribution, the center of the patch has higher confidence than the edge pixels since those central pixels preserve more information for inference. Assuming all pixels in the patch are i.i.d, we have

$$\begin{aligned} \begin{aligned} p(\textbf{u}^{(k)}|I_l,I_r,\textbf{x}) \propto p(\textbf{u}^{(k)}|I_l,I_r)\exp \left( -\frac{\sum _{\mathbf {\xi }^{(k)}(\textbf{x})}\Vert \textbf{x}-\mathbf {\xi }^{(k)}(\textbf{x})\Vert ^2_\textrm{F}}{2\sigma _s^2}\right) , \end{aligned} \end{aligned}$$
(5)

where \(\mathbf {\xi }^{(k)}(\textbf{x})\) is the set of all pixel positions within the patch k in image coordinate. \(\sigma _s\) is the 2D spatial variance of the probability. Note that (5) is independent of the patch and can therefore be pre-computed before the process. Combining (3), (4) and (5), the final pixel-wise posterior probability distribution can be represented as follows,

$$\begin{aligned} \begin{aligned} p(\textbf{u}^{(k)}\!|\!I_l,I_r,\textbf{x})\! \propto \! \exp \left( \!-\frac{\!\sum _{\!\mathbf {\xi }^{(k)}(\textbf{x})}\!\Vert \textbf{x}-\mathbf {\xi }^{(k)}(\textbf{x})\Vert ^2_\textrm{F}}{2\sigma _s^2}\right) \frac{\!\exp \left( -\frac{\!\Vert I_l(\textbf{u}^{(k)})-I_r(\textbf{u}^{(k)})\!\Vert ^2_\textrm{F}}{2\sigma _r^2\textrm{s}^2}\!\right) }{\!\sum _{\textbf{u}_i^{(k)}\! \in \! \mathcal {P}}\exp \left( -\frac{ \Vert I_l(\textbf{u}_i^{(k)})-I_r(\textbf{u}_i^{(k)})\Vert ^2_\textrm{F}}{2\sigma _r^2\textrm{s}^2}\!\right) }. \end{aligned} \end{aligned}$$

Finally, it should be emphasized that (4) and (5) are not the cost functions but probability/weight for each patch or pixel. Costly optimization steps are avoided.

3 Results and Discussion

BDIS was compared with DIS [14], SGBM [12] and ELAS [8] on the in-vivo and the synthetic data setsFootnote 1. The computations were implemented on a commercial desktop (i5-9400) in C++. DNN-based methods PSMNet [5] and GwcNet [10] were also compared for completeness and the computation was conducted on the GTX 1080ti in PyTorch. The public in-vivo stereo videos from [9] were adopted which contains 200 images with size \(640\times 480\) and 200 images with size \(288\times 360\). All stereo images were rectified, undistorted, calibrated, and vertically aligned with the provided intrinsic and extrinsic parameters. We also provided a synthetic data set generated from an off-the-shelf virtual phantom of a male’s digestive system. A virtual handheld colonoscope was placed inside the colon and was manipulated to go through the colon to collect the depth and stereo images. The 3D game engine Unity3DFootnote 2 was used to generate the sequential stereo and depth images with a pin-hole camera in size \(640 \times 480\). The synthetic distortion-free data has accurate intrinsic and extrinsic parameters. Both diffuse lighting (100 frames) and non-Lambertian reflectance (100 frames) were simulated. \(\gamma \) was set to 0.75 for \(640\times 480\) and 0.25 for \(288\times 360\) data to discard the patch without enough valid pixels. \(\sigma _r\) and \(\sigma _s\) were set to 4; the sampling within one Bayesian window was 5; the disturbance from the convergence was 0.5 and 1 pixel.

Fig. 3.
figure 3

Sample reconstructions in Diffuse lighting and Lambertian reflectance scenarios.

Table 1. The results on the synthetic data with diffused light and Lambertian reflectance.

3.1 Quantitative Comparisons on the Synthetic Data Set

BDIS was compared quantitatively with the baseline methods ELAS, SGBM, DIS, PSMNet, and GwcNet. The comparison between the prior-based DNN-based method and BDIS is for completeness only. The default setting of PSMNet and GwcNet were strictly followed. The pre-trained networks were adopted and finetuned with the labeled 300 (training) and 50 (validation) synthetic images for training and validation. Both were trained with Adam optimizer in 300 epochs.

Table 1 and Fig. 3 show the comparisons on the synthetic data set, which are unaffected by distortion and inaccurate camera parameters. Considering the median error, BDIS is the best and has \(9.55\%\) and \(17.68\%\) higher accuracy than ELAS in diffuse lighting and non-Lambertian reflectance. The results indicate that BDIS is more advantageous in the scenario of non-Lambertian reflectance over ELAS, thus more robust in surgical scenarios. Results also show that BDIS cannot handle the edges well. Figure 3 and Table 1 reveal the bad mean error comparison is attributed to the small group of far-out points on the dark regions/edges. The number of valid prediction suggest BDIS produces more predictions but suffers from inaccurate dark region predictions.

Readers may notice the bad performance of DNN, which contradicts the conclusion from [1]. The reason is that the finetuning training process does not yield satisfying model parameters. The synthetic data set for transfer learning and the data used to pre-train the DNN are significantly different in terms of textures. Studies [6, 22] indicate that the performance of the convolutional DNN is heavily dependent on the image texture, and efforts were devoted to bridging the domain gap [6, 34]. The bad training process indicates its strong dependency on the training data set, which can be avoided using prior-free methods. Further tests will be conducted on labeled in-vivo data set.

3.2 Qualitative Comparisons on the In-vivo Dataset

We compared ELAS, BDIS, DIS, and SGBM on the in-vivo data sets. Since no ground truth is provided, DNN-based methods cannot be implemented. We aim to show that BDIS achieves similar accuracy as ELAS since near real-time ELAS is widely used in the community. Based on the scope-to-surface distance, the samples were categorized into five groups. Results show that BDIS achieves an average 0.4–1.66 mm (median error) and 0.65–2.32 mm (mean error) deviation from ELAS’s results.

The invalid/dark/bright pixels lead to photometric inconsistency in the stereo matching process. Figure 4 shows the qualitative comparisons of ELAS, DIS, SGBM, and BDIS on the relatively well-textured images. Generally, BDIS achieves similar performance as ELAS but better matches pixels at the image edge with fewer outliers. DIS and SGBM suffer from the wrong edges. Invalid pixels inevitably exist on the edges of the rectified image after the image undistortion. Thus, in the coarse-level patch disparity estimation, patches with more invalid pixels are more likely to fail in convergence or yield local minima (abnormal depth) due to insufficient information. The dubious predictions, however, substantially influence the prediction and the initialization of the disparity at the finer-scale patch optimization (as in (2)). BDIS solves the problem by quantifying the posterior probability, discarding the patch that does not converge, and lowering the patches’ probabilities with invalid pixels. Although the discarded patch does not help yield disparity, other patches compensate for the loss. If one pixel is not covered by any patch, we follow ELAS not to optimize the pixel.

Another noticeable problem is the ambiguous local minima in the cost function, which penalizes the photometric inconsistency. Figure 4 shows BDIS has fewer local minima than DIS and SGBM and is similar to ELAS. Figure 4 (a–b) indicates that the BDIS addresses the patchs’ probabilities with textured and alleviates the ambiguous disparity from the textureless surface. Figure 4 (c–e) show that the ambiguities caused by the illumination have been greatly reduced. The quantitative results also provide evidence on its side. It should be emphasized that this work does not enforce any prior smoothness constraint in the optimization process.

We additionally tested BDIS and ELAS on the surfaces with serious non-Lambertian reflectance (Fig. 5). The photometric consistency of this data deteriorates significantly. Figure 5 shows that the center of the soft tissue is exposed to intense lighting while the marginal region is dark. Figure 5 indicates that ELAS suffers from the ambiguity on the marginal dark regions while BDIS can ignore or estimate most dark pixels correctly.

3.3 Processing Rate Comparison

We compared the time consumption of ELAS, DIS, and BDIS on a single core of CPU (i5-9400). BDIS runs 10 Hz on \(640\times 480\) image and 25 Hz on \(360\times 288\) image while ELAS achieves 4 Hz and 11 Hz. The two DNN methods GwcNet and PSMnet run 3 Hz and 5 Hz on GTX 1080ti. BDIS consumes double the time of DIS. The majority of the extra time of BDIS is devoted to patch-wise window traversing. Since the sampling window size is 5 in the experiment, 5 more times residual estimations are needed. In general, BDIS achieves similar/better performance over ELAS but runs 2 times faster.

Fig. 4.
figure 4

Sample recovered shapes of the 5 classes. Circles mark the regions with large error.

Fig. 5.
figure 5

The qualitative comparisons on the heavy Lambertian reflectance and dark case.

4 Conclusion

We propose BDIS, the first CPU-level real-time stereo matching approach for CAS. BDIS inherits the fast performance of DIS while being more robust to textureless/dark surface and severe non-Lambertian reflectance. It achieves similar or better performance in accuracy as the near real-time method ELAS. A Bayesian approach and a spatial GMM are developed to describe the relative confidence of the pixel-wise disparity to achieve the performance. Experiments indicate that BDIS has fewer outliers than DIS and achieves a lower amount of outlier predictions than the near real-time ELAS.