1 Introduction

The use of 3D information has many applications from navigation in mobile robots to object reconstruction and augmented reality. The computer vision approach for extracting 3D information has been widely studied, creating a variety of algorithms which could be broadly classified into dense and sparse. Dense approaches try to obtain depth information for every pixel in the input images. These approaches take advantage of the smoothness constraints in surfaces in order to increase the accuracy of the obtained depths but this comes at a high computational cost. In contrast sparse approaches obtain depth information only for important features in the images. Image features could correspond to pixels, blobs, edges or ridges. By using only those important features, the amount of computation required is decreased as only a few pixels are processed for the calculation of depth.

Point-based approaches have been successfully used to reconstruct highly textured areas but they fail in areas with low texture [1]. In contrast, image edges are present at object boundaries (borders, colour or texture changes). Based on this easy availability of edges, in this work a new algorithm is proposed. The proposed approach extracts disparity information by matching only a few anchor points and then propagates the disparities while the edges are simultaneously detected in a stereo-pair.

1.1 Related Work

Due to the reduced cost involved in the computation of depth maps using image edges, different approaches have been proposed to take advantage of this. The approaches found in the literature can be divided into line-based and curve-based. Line-based approaches assume a straight line may be fitted to the image edges. This set of approaches have been successfully used for wide-baseline stereo [2], Visual Odometry [3, 4] and Structure from Motion [5]. Although effective in some cases, if few or no straight lines are present in the image, those approaches would fail, as is the case in outdoors scenarios where curved objects like trees and humans are present.

Only a few approaches have been proposed for using curves as image edges. Rao et al. [6] used Bezier curves to solve the SLAM problem. But only the endpoints of the curves are used to perform the tracking. To our knowledge the approaches presented by Witt and Weltin [7, 8] and Ohta and Kanade [9] are the only ones able to obtain disparity maps in real time for edge pixels without fitting any model to the edges.

In [7] Witt and Weltin use Canny edge detection [10] to extract the image edges. Then every edge pixel is matched across the stereo pair by using different support regions in an area based approach. A confidence measure is obtained for the match. Then interpolation is performed to increase the confidence of poor or not-found matches. In [8] Witt and Weltin use Canny edge detection for finding the edge pixels. Then they match the edge pixels by using a strip as support region in an area based approach. Only highly confident matches are kept and the remaining disparities are recovered by dynamic programming. In [9] Ohta and Kanade detect edges by using image gradients. Then the edges on the same scanline are matched by using dynamic programming. After this, dynamic programming is used intra-scanline to identify and recover wrong matches. Although fast, these approaches require the computation of descriptors, matching cost and optimization for every edge pixel.

1.2 Contributions

This paper introduces an approach able to extract chains of connected 3D points which represent the 3D edges of the objects in the scene captured by the stereo-pair. In summary the main advantages of this approach over the state-of-the-art edge-based approaches are:

  • Edge matching robust to partial occlusions.

  • Only a small fraction of the edge-points require to be matched, reducing the number of descriptors and cost computations required.

  • The extraction of chains of connected 3D points which represent the edges of the objects in 3D, comes as a by-product of the simultaneous smart routing.

The approach for obtaining this is detailed in the following section.

Fig. 1.
figure 1

Pipeline of simultaneous edge drawing.

2 Disparity Estimation by Simultaneous Edge Drawing

The proposed algorithm is able to obtain the disparity of edge pixels by simultaneously detecting the edges in a stereo pair. This is achieved by extending the Edge Drawing (ED) algorithm proposed by Topal and Akinlar [11,12,13], so that it runs simultaneously on both images in the stereo pair. By doing this, it reduces the required amount of computation as only a few anchor points are matched across the image-pair. The obtained disparities are propagated along the edges as they are detected by linking the pixels along the image gradient. The pipeline of the proposed algorithm is shown in Fig. 1.

The algorithm for sparse disparity calculation by Simultaneous Edge Drawing (SED) consists of the following stages:

  1. 1.

    Computation of the gradient magnitude, edge candidates and orientations.

  2. 2.

    Extraction of anchor points in the Scale Space by Edge Focusing [14].

  3. 3.

    Anchor matching by an area based approach.

  4. 4.

    Edge linking and disparity propagation by simultaneous smart routing.

  5. 5.

    Curve merging.

The algorithm for computing of the gradient magnitude, edge candidates and orientations is inherited from ED. The extraction of the anchor points is performed across the scale space in order to reduce the number of anchor points which might be created by noise or texture. The anchor matching is common to other pixel-based stereo-matching approaches. The simultaneous smart routing is extended from ED to run on a stereo-pair. The curve merging uses a simple approach for recovering fragmented edges. Each of these steps is detailed in the following sections.

2.1 Computation of the Gradient Magnitude, Edge Candidates and Orientations

The first step in ED is to compute the gradient magnitude G(p), the edge candidates E(p) and orientations H(p) for every pixel p in the stereo-pair. This step is performed in a smoothed version of the input images. As in ED, the smoothing is performed by convolving the input images with a Gaussian kernel with \(\sigma = 1\). The calculation of the gradient magnitude is performed as in the parameter-free version of ED (EDPF [13]). The horizontal and vertical gradients, \(g_x(p)\) and \(g_y(p)\) respectively, are calculated by using the Prewitt operator [15]. Then the gradient magnitude is calculated using the formula

$$\begin{aligned} G(p) = \sqrt{(g_x(p))^2 + (g_y(p))^2} \end{aligned}$$
(1)

The edge orientation at every pixel p is calculated by comparing the horizontal and vertical gradients. If \(|g_x(p)| < |g_y(p)|\) it is assumed that the pixel is traversed by an horizontal edge, \(H(p) = 1\). Otherwise, the edge is assumed to be vertical, \(H(p) = 0\) following the same criteria as in ED.

The edge candidates are the pixels which may contain an edge, they are identified by gradient thresholding. Pixels with a gradient magnitude larger than a threshold \(t_g\) are taken as edge candidates. The threshold \(t_g\) is used as in the parameter free version of ED where only gradient values produced by a quantization error are discarded. For the Prewitt operator this occurs at the value of \(t_g = 8.48\) as in ED [13].

2.2 Extraction of the Anchor Points

The anchor points in ED are pixels with a high chance of being edge pixels, in this work we took only those pixels with a high chance of being edge pixels across the scale space. This is done in order to reduce the number of anchors which might correspond to edges created by low textures or noise. In order to perform this, the Edge Focusing algorithm [14] is applied. The Edge Focusing algorithm applies non-maxima suppression at a coarse level and then the detected pixels are tracked through the scale space until a fine level is reached.

As in [14], the maximum sigma for the scale space is set to \(\sigma _{MAX} = 4\) and the size of the step between scale space levels is set to \(\delta \sigma = 0.5\) in order to guarantee that the detected edge pixels do not move more than one pixel between scales.

Additionally, as in ED, the scan interval s is used to reduce the number of processed anchor points. The scan interval is used to process only one out of s rows, i.e. if \(s = 2\) only one of every two rows is processed, if \(s= 3\) only one of every three rows is processed.

2.3 Anchor Matching

The anchor points are matched by using an area-based stereo-matching approach. The approach is selected as in [8] but using the Complete Rank Transform [16] as pixel descriptor. The Complete Rank Transform uses the morphological order of the pixels in a window to create a descriptor invariant to changes in illumination. An example of this image transform is shown in Fig. 2. As in [8], the descriptors are computed at each side of the anchor points. The Sum of Absolute Differences (SAD) is taken as match cost and a WTA stategy is applied for selecting the best match as in [8]. Left-right confidence check is used in order to identify occluded anchors and discard them [17].

Fig. 2.
figure 2

Example of the complete rank transform, CRT(p) = 1, 3, 7, 1, 5, 6, 0, 4, 6.

Additionally, the Peak-Ratio Naive Inverse (PKRNI) and Max Cost Ratio (MCR) confidence metrics are applied in order to remove possible wrong matches. The PKRNI is based on the PKR confidence metric [18] but bounded to the interval [0, 1]. This is:

$$\begin{aligned} PKRNI = 1 - \frac{c_1 + 1}{c_{2m} + 1} \end{aligned}$$
(2)

where \(c_1\) is the minimum found cost and \(c_{2m}\) corresponds to the second minimum cost. In the case where there is only one candidate, \(c_{2m}\) is taken as the maximum cost for the descriptor \(c_{MAX}\). For the CRT, \(c_{MAX}\) is computed as:

$$\begin{aligned} c_{MAX} = (mn - 1) * (mn) \end{aligned}$$
(3)

where m and n are the number of rows and columns used for the CRT. The MCR confidence metric is simply a ratio between the minimum cost and the maximum cost bounded to the interval [0,1], this is expressed as:

$$\begin{aligned} MCR = 1 - \frac{c_1}{c_{MAX}} \end{aligned}$$
(4)

Only high confident matches are kept, to determine this thresholds \(t_{PKRI}\) and \(t_{MCR}\) are applied to the PKRI and MCR confidence metrics.

Fig. 3.
figure 3

Location of windows for a 3\(\,\times \,\)3 CRT used for creating the anchor point descriptors. The red pixels are the anchor points, the yellow are the pixels used for the left descriptor and the blue ones are the pixels used for the right descriptor. Gray pixels are edge pixels. (Color figure online)

It is important to note that only non-linked anchors are matched one at the time. This means that for every matched anchor a simultaneous smart routing is performed and if an anchor is already linked it would not be processed. By doing this the number of required descriptors and cost computations is decreased as only a small portion of the anchors is processed.

2.4 Disparity Propagation by Simultaneous Smart Routing

The matched anchor points are used as seeds for the simultaneous smart routing. The simultaneous smart routing procedure extends the smart routing from ED to run simultaneously on each image in the stereo pair and produces a chain of 3D points. Every point in this chain has the form (xyd) where x and y are the coordinates on the left image and d is the disparity of the pixel.

The simultaneous smart routing takes as input a matched anchor pair and then by using the same principle from ED it moves one pixel in each image of the image-pair. This displacement across the edge candidates is performed until any of the following stopping criteria is met:

  • The next pixel to be linked in any of the images in the stereo pair is not in the edge map E for the left and right images, i.e., the thresholded gradient of the next pixel is zero (as in ED).

  • A previously linked pixel is found (as in ED).

  • The difference in the y location of the pixels is larger than an epipolar threshold \(t_e\).

In order to add tolerance to errors in the image rectification the threshold \(t_e\) is introduced. This threshold is computed as the difference in the y location of the current linked pixels:

$$\begin{aligned} t_e = |y^l_c - y^r_c| \end{aligned}$$
(5)

where \(y^l_c\) and \(y^r_c\) are the y location in the left and right image of the current pixel being linked.

In order to determine the next pixel to be linked an approach similar to that used in ED is applied: starting at a pair of matched anchor points, for the left image if a vertical edge passes through the matched pair the linking is started up and down, if the edge is horizontal the linking is started to the left and right. The same is applied to the right image.

As in [11, 13, 19] it is not clear how to handle the changes in the orientation of the edges in the smart routing procedure, Fig. 4 describes the criteria applied in this work. For an horizontally oriented pixel if the last displacement was to the right then the pair of points is moved to the right, otherwise the pair of points is moved to the left. A similar approach is applied for vertical edges, if the last displacement was up, then the pair is moved up, otherwise the pair is moved down. The direction of movement is calculated for each image independently. After obtaining the direction of move, the three immediate neighbours in that direction are considered and the one having the maximum gradient is taken as the next pixel.

Fig. 4.
figure 4

Procedure used to get the direction of the next pixel to be linked in SED.

Figure 5 illustrates the linking procedure in the left and right images. Starting at the anchor point (red) the linking is started to the left. By comparing the three immediate neighbours on the left \(\{50, 55, 47\}\) and \(\{49,53,46\}\) for the left and right images respectively, the pixel with the maximum gradient is selected as the next for each image, i.e. 55 and 53 respectively. Then the compared elements are \(\{47,50,45\}\) and \(\{43,49,43\}\) selecting 50 and 49 for the left and right images respectively. For the next pixels \(\{30,35,47\}\) and \(\{29,34,48\}\) are compared selecting 47 and 48 as the next for each image. At this stage a change in orientation is found and the direction of linking is determined by the algorithm shown in Fig. 4. The new direction for linking is DOWN for each image and the compared elements are \(\{50,48,32\}\) and \(\{51,46,32\}\) selecting 50 and 51 for the left and right image respectively. This is repeated until one of the termination conditions is reached.

If required, in order to obtain sub-pixel accurate disparities efficiently, the edges are located at the sub-pixel level by following the approach in [20], then the difference in the horizontal location is taken as disparity.

Fig. 5.
figure 5

Linking procedure. (Color figure online)

2.5 Curve Merging

After the extraction of the edge segments, they are merged in order to have longer curves. Two endpoints are merged if their distance is smaller than a radius r and the difference in their disparity values is smaller than 1. This value for disparity is selected in order to only merge pixels which are in the same disparity region. After the edges are merged, only those with a minimum length of 1% of the image diagonal are kept as in [21].

After linking and merging, the output of the algorithm is a set of chains of connected, well-localized and one-voxel wide 3D points in the form (xyd) where x and y are the location on the left of right image and d is the disparity at the pixel. As the resulting chains of 3D points are obtained from the edges in the image, they could be used to identify the 3D objects by template matching or reconstruction.

3 Evaluation

In this section the performance of the Simultaneous Edge Drawing (SED) is tested on the Middlebury [22] and the KITTI [23] datasets, in order to compare the performance with other existing methods. All of the experiments were run on a Laptop with a processor Intel Core i7 2675QM at 2.1 GHz and 8 GB of RAM using only one thread.

To our knowledge the only methods found able to obtain disparity values in real time for edge pixels without fitting a line or a curve to the edge pixels prior to stereo matching are: EMCBR [7], EBDP [8]. As only a subset of the Middlebury is used in [7, 8], the same subset is used for comparisonn.

The parameters used for SED are kept fixed for all the experiments except for the computation of sub-pixel values, when comparing against EBDP and EMCBR sub-pixel are computed meanwhile for the KITTI dataset they are not used as the KITTI dataset do not provide sub-pixel values. The scan interval is set to \(s = 2\) in order to reduce the number of processed anchors, larger values could decrease further the number of anchors at the cost of avoiding the detection of small structures. As pixel descriptor a CRT of size 9\(\,\times \,\)9 is used, larger windows do not produce a significant increase in accuracy but increase the required number of computations. No aggregation is used as it did not show any improvement in accuracy for transformation windows larger than 9\(\,\times \,\)9. Additionally, in order to keep only confident matches, the confidence thresholds are set to \(t_{PKRI} = 0.44\) and \(t_{MCR} = 0.88\), the values were selected as they increased the accuracy without significantly decreasing the number of obtained matches. For the linking stage the epipolar threshold is set to \(t_e = 1\) to allow a small offset on the epipolar lines. Finally, for the merging stage a radius of 5 is used, this value was found experimentally.

The edge-based ground truth is obtained dilating a dense ground truth image with a 3\(\,\times \,\)3 structuring element to keep the larger disparities, then edges are detected by using Canny algorithm, and only values at the detected edges are kept as in [7, 8]. A pixel is marked as erroneous if the difference on the obtained disparity and the ground truth is \(> 1\). No information from other error thresholds is provided by EBDP and EMCBR.

Table 1 shows the obtained result by running SED on the same subset of the Middlebury dataset as EBDP and EMCBR. This table shows that although the accuracy is similar to the obtained by EBDP and EMCBR the running time is twice as fast considering only one core is used on our experiments whereas two cores are used on EBDP and EMCBR. This reduction in the number of computations is important in low-cost embedded systems where multiple cores and GPUs may not be available. It is important to note that no information is provided for EBDP and EMCBR but for SED an error threshold of 3 obtained accuracies of around 99%. The running times on the Middlebury dataset include 30–50ms for the Edge Focusing algorithm, this being the most computational expensive part of SED; 9–22ms for anchor matching and 1–3ms for simultaneous smart routing. The reduction in the computing times is explained as typically less than 50% of the anchor points are matched avoiding the computation of many descriptors and cost values. Figure 6 shows the obtained images by SED and EBDP.

Table 1. Comparisson between SED, EBDP and EMCBR.
Fig. 6.
figure 6

Disparity results for SED (top) and EBDP (bottom). Green pixels have an error threshold \(\epsilon \le 1\). Red pixels have an error threshold \(\epsilon > 1\). White pixels are unmatched pixels. The images for EBDP were taken from [8]. (Color figure online)

Table 2 shows the obtained results on the KITTI stereo dataset 2015. The obtained results would place our algorithm among the top-15 of the ranking at the time of writing being the only one with a running time smaller than 1 s without the use of multiple cores or GPUs.

Table 2. Results from SED on the KITTI stereo evaluation 2015 using only the estimated pixels. D1-bg = outliers on background regions, D1-fg = outliers on foreground regions, D1-all = outliers over all ground truth pixels. This would place our approach in the top 15 at the time of writing, being the only one with a running time smaller than 1 s. without the use of multiple cores or GPUs.
Table 3. Results from SED on the Middlebury v3 stereo evaluation using only the estimated pixels. The selected metric is bad 2.0 and the mask is for non-occluded pixels only. These values are the default for the ranking.

Table 3 shows the obtained results on the Middlebury v3 stereo dataset. The obtained results would place our algorithm among the top performing on the ranking at the time of writing.

4 Conclusions

This paper presented a new approach to use edge information in order to reduce the number of matched pixels in a local approach and to produce sparse disparity maps which are able to represent the image semantics. The proposed approach extends a robust edge detector to run simultaneously on a stereo-pair producing chains of ordered 3D points. The key speed up is in the reduction of the number of descriptors and the cost computed as only a few pixels require to be matched. Further research would include the implementation of the approach in a low-cost embedded system and the use of the obtained chains of 3D points for obstacle detection.

Potential improvements can be identified in terms of speed as the Edge Focussing is the main time consuming step in the algorithm, followed by the matching. Improvements in accuracy could be obtained by using a different pixel descriptor which allows to better describe horizontal edges and by using an alternate approach for identifying occlusion. The connectivity information could additionally be used in a post-processing in order to recover missing edges but it is out of the scope of this work.