1 Introduction

Depth is considered as one of the most important cues in perceiving the three-dimensional characteristics of objects in the scene captured by cameras. In computer vision, the value which represents the distance between each object in the scene to the focal point of the camera is called depth and an image storing these values for all the pixels is referred to as a depth map. Depth maps are essential in a variety of applications such as view synthesis, robot vision, 3D scene reconstruction, interaction between human and computer, and advanced driver assistance systems [13]. The performance of the mentioned applications is highly dependent on the quality and accuracy of the depth map. Thus, generating an accurate depth map is of substantial importance. The main objective of depth estimation methods is to generate a per-pixel depth map of the scene based on two or more reference images of the same scene from different angles. The reference images are captured by a stereo calibrated camera system in which the cameras are parallel to each other or are set with a slight angle. The images are typically stereo rectified with the property that the corresponding points in each image have identical vertical coordinates.

Most of stereo matching methods usually consist of the following four steps:

  • Matching cost calculation The similarity of image locations is measured by defining a matching cost. Normally, a matching cost is calculated at each pixel for all disparities under certain considerations. Common matching costs include absolute differences (AD) and squared differences (SD). More complicated matching costs have also been proposed such as mutual information (MI), Census transform, and rank transform. An evaluation of different cost functions for stereo matching is provided in [4].

  • Cost aggregation The calculated matching cost is aggregated over a support region. Numerous cost aggregation methods have been proposed recently [3335]. The most common way is by using a square window. Cross-based cost aggregation is an alternative solution to this matter.

  • Disparity optimization Disparity refers to the distance between two corresponding points in the left and right images of a stereo pair. The disparity which minimizes the cost function is chosen as the optimum value and the depth value is calculated accordingly. Several optimization techniques have been proposed to estimate the optimum depth value.

  • Depth refinement Depending on the application, the requirement for depth map accuracy varies. Some applications like human pose estimation in the gaming industry are satisfied by low resolution depth maps. However, higher accuracy is vital for applications like depth image-based rendering (DIBR) techniques and stereo-based advanced driver assistant systems (ADAS).

Depth maps can be estimated using either stereo matching techniques or depth sensors. With the advent of depth sensors, novel camera systems have been developed [5], which generate depth maps in real-time. The measurement of depth in such sensors is often performed by either using time-of-flight (TOF) systems or infrared pattern deformation. Depth maps acquired by the depth sensors are usually noisy and suffer from poorly generated depth boundaries. Figure 1 shows an example of a set of color and depth map of a scene captured by Kinect device [5]. We can see that the depth map has extremely poor quality.

Fig. 1
figure 1

Example of a Kinect dataset a color image, and b depth map (color figure online)

Stereo matching techniques can be classified into two groups, namely local and global techniques. Local methods [6, 7] consider a finite-size window to estimate the disparity. Thus, the window size plays an important role in such methods. The local methods are fast and computationally simple but they are highly error-prone and the estimated depth maps are usually inaccurate. On the other hand, in global techniques an energy function is globally optimized to find the disparity. Global depth estimation techniques can generate high-quality depth maps. Most popular techniques in this category include belief propagation [8], graph cuts [9] and dynamic programming [10]. However, due to the computational complexity of such algorithms, it is not feasible to use them in real-time applications. Semi-global matching (SGM) [11] which was first introduced by Hirschmuller, performs pixelwise matching based on mutual information and approximation of a global smoothness constraint. Therefore, a good trade-off between accuracy and runtime is obtained. However, it achieves limited performance under illumination changes. Despite the presence of such promising depth estimation techniques, there are still several problems in the generated depth maps. The existence of holes and sensitivity to noise and illumination changes are the main significant problems. These problems are partially solved by post-processing the depth maps using depth refinement techniques [12, 13].

In this paper, we propose a depth estimation algorithm based on the Census transform and stripe-based optimization. The main contribution of the proposed method is to improve the quality of estimated depth maps with low computational complexity so that it can satisfy the requirements of real-time applications such as 3D scene reconstruction and stereo-based pedestrian detection. We focus on applying the proposed depth estimation method for stereo-based pedestrian detection in advanced driver assistance systems. The proposed method increases the performance of ADAS systems by improving the pedestrian detection rate. Simulation results indicate a significant increase in the depth map accuracy and quality. Performance evaluations are performed using standard benchmarks and real-world scenarios to demonstrate the efficiency of the proposed algorithm.

The rest of the paper is organized as follows. In Sect. 2, we review previous work related to real-time depth estimation. The proposed method is presented in detail in Sect. 3. The experimental results are provided in Sect. 4, and the paper is concluded in Sect. 5.

2 Related work

Several approaches have been proposed in the past for estimating dense disparity maps from stereo images captured by static cameras. Due to the development of various applications that require real-time processing, such as advanced driver assistance systems, real-time depth estimation has become a more attractive research area in recent years. In this section, we will briefly review recently proposed real-time depth estimation techniques.

During the past several years, many stereo matching techniques have been proposed. A survey study of different stereo matching algorithms is available in [14]. A multi-resolution stereo matching technique implemented in graphics hardware is presented in [15]. The multi-resolution scheme reduces the noise and sum-of-squared-difference (SSD) is used to aggregate matching cost. The optimum disparity values are determined based on the winner-take-all criterion. A depth estimation method for uncalibrated stereo images is proposed in [16].

Due to high computational complexity of some of stereo matching techniques, real-time performance is only achieved via specially designed hardware. In [17], a graphics processing unit (GPU)-based depth estimation method is proposed which exploits adaptive cost aggregation windows. The windows change shape according to the local content of the image, such as edges and corners. A near-real-time stereo matching technique based on dynamic programming is presented in [18]. Two different implementations are presented in the paper. The first one uses the GPU for matching cost calculation and uses the CPU for disparity estimation, while the second one uses GPU for both calculations. A real-time correlation-based stereo matching algorithm is proposed in [19] which uses an image-gradient-guided cost aggregation scheme. The scheme is designed to fit the architecture of GPUs and the whole algorithm is run on the graphics board.

Depth estimation using SGM has been studied intensively for the past years. The original SGM implementation on GPU was first presented in [20]. An implementation of SGM with 8 accumulation paths on field programmable gate array (FPGA) is presented in [21].

The method proposed in [22] uses a 4 path cost aggregation method and an extended work with fewer aggregation paths is presented in [23].

Eliminating the diagonal integration paths is an idea proposed in [24]. By only using 4 directions for cost accumulation, the method proposed in [24] achieves a run time of 19 ms for a \(512 \times 383\) image resolution. iSGM is a technique introduced in [25] as a new cost integration concept for semi-global matching by iteratively reducing the search space. A GPU-based SGM depth estimation method is proposed in [26] which can achieve real-time processing on high resolution reference images. The method runs with the frame rate of 25 fps on images with \(1024 \times 768\) resolution and 128 disparity levels. In [27], a two-dimensional parallelization scheme for SGM is proposed. The FPGA implementation of the algorithm achieves real-time processing for VGA image resolutions.

In the past years, a number of depth estimation methods based on the Census transform have been proposed. In [28], a SGM-based technique is proposed which uses a \(5 \times 5\) Census transform for computing the similarity while conducting cost accumulation over 8 paths. In [29], a disparity map is estimated by applying the Census transform prior to the SGM-based optimization. The depth map is refined by a segmentation and plane fitting approach. The authors in [30] introduce an algorithm based on semi-global matching and the Census transform. Two sub-pixel interpolation functions are implemented to increase the accuracy at the sub-pixel level. A modified Census transform using semi-global matching is a method introduced in [31]. The authors in [32] propose an algorithm using adaptive window patterns for Census transform.

Different cost aggregation methods have been studied for the past years. In [33], a segmentation-based adaptive support region for cost aggregation is proposed. A geodesic support weight for stereo matching is proposed in [34] where the pixels with low geodesic distance are given high support weight. A cross-based cost aggregation method is proposed in [35].

Apart from existence of holes in the estimated depth maps, other artifacts may limit the efficiency of stereo matching techniques. A few limitations of the existing depth estimation methods can be listed as:

  1. 1.

    Sensitivity to noise and illumination changes,

  2. 2.

    Inaccurate object boundaries, and

  3. 3.

    High computational complexity.

Taking all these artifacts into consideration, the main goal of the proposed depth estimation method is to provide a high quality depth map which can be used in real-time applications. Benefiting from multiple symmetric Census window patterns and performing most of the tasks on the low-resolution images reduce the overall computational complexity and make the algorithm applicable for real-time advanced driver assistance systems.

3 Proposed method

In our proposed method, we use a stereo image pair as the reference to generate a depth map. The Census transform is chosen as the matching metric and the local characteristics of depth maps are exploited using different window patterns in Census transform. The proposed technique consists of four steps: down-sampling and mask generation, cost calculation and aggregation, semi-global optimization, and depth refinement. Figure 2 shows the block diagram of the proposed algorithm.

Fig. 2
figure 2

Block diagram of the proposed technique

3.1 Down-sampling and mask generation

In an ideal case, pixels belonging to the same object which are positioned at the same distance from the camera should have the same depth value. However, this does not occur in real-world scenarios due to several reasons such as illumination changes within an object and mismatches in stereo matching algorithms. We use the curvature of the pixels in the given color images to distinguish between the smooth area and sharp edges of the objects in the scene. To reduce the computational complexity of the algorithm, the first step is to down sample the stereo color images by a factor of 4. Then, we obtain a mask which indicates the smoothness of different regions in the reference image.

Fig. 3
figure 3

Mask generation: a color image and b mask (color figure online)

The curvature is calculated using the first- and second-order gradients of each pixel given by Eq. (1).

$$\begin{aligned} k\left( {x,y} \right) =\frac{\mu _{xx} \mu _x^2 -2\mu _x \mu _{xy} \mu _y +\mu _{yy} \mu _y^2}{\left( \mu _x^2 +\mu _y^2\right) ^{3/2}}, \end{aligned}$$
(1)

where \(\mu _x\) and \(\mu _{xx}\) are the first- and second-order gradients, respectively. Subscripts indicate the direction of gradient. The Prewitt kernel is used to find the gradient. After computing the curvature, we aggregate the values over a \(5 \times 5\) window and store it in a curvature map. A binary mask is generated by Eq. (3) using the curvature map. When the aggregated curvature of a pixel is less than a threshold, a zero value is assigned to the mask. An example of mask generation is shown in Fig. 3.

$$\begin{aligned} \mathrm{agg}_c= & {} \mathop {\sum }\limits _{{\varDelta } y=-2}^2 \mathop {\sum }\limits _{{\varDelta } x=-2}^2 k\left( {x+{\varDelta } x,y+{\varDelta } y} \right) , \end{aligned}$$
(2)
$$\begin{aligned} M\left( {x,y} \right)= & {} \left\{ {{ \begin{array}{ll} 0,&{} \quad \mathrm{if}\,\mathrm{agg}_c <T_1, \\ 1,&{} \quad \mathrm{Otherwise}. \\ \end{array}}} \right. \end{aligned}$$
(3)

3.2 Cost calculation and aggregation

The Census transform [17] maps the local neighborhood surrounding a pixel \(I_c \left( {x,y} \right) \) to a bit-pattern. The transform relies on the relative ordering of local intensity values and not the pixel values itself. Figure 4 shows an example of the Census transform of a window image with respect to the center pixel.

Fig. 4
figure 4

Census transform example of a window image

The Census transform converts relative intensity difference to 0 or 1 in a bit-pattern.

In this paper, Census transform is used to calculate the cost function for both left and right images. We use a simple Census window pattern for the smooth regions to reduce the computational complexity and use a more complex pattern for the non-uniform regions which usually contain edges and object boundaries. The adaptive Census window patterns are shown in Fig. 5 where the selected positions are denoted by black pixels.

Fig. 5
figure 5

Census window patterns: a \(\hbox {P}_{1}\) for uniform regions, and b \(\hbox {P}_{2}\) non-uniform regions

The black pixels shown in Fig. 5 are used in a symmetric way to generate the binary bit-pattern. Therefore, only half of the pixels are directly engaged in the computation process. The other half of pixel values are symmetrically copied using the pre-calculated data. Figure 6 shows the center-symmetric window pattern of \(P_1\). Similarly, the binary bit-pattern of \(P_2\) pattern is symmetrically obtained. For the pixel \(I_c \left( {x,y} \right) ,\) the Census transform is calculated using Eq. (4)

$$\begin{aligned} R_T \left( {x,y} \right) =\otimes _{\left( {i,j} \right) \in N} \xi (I_{c} \left( {x,y} \right) ,I_{c} \left( {x+i,y+j} \right) , \end{aligned}$$
(4)

where N is the neighborhood of the current pixel within the Census transform, \(\xi \) is the step function and is bitwise concatenation. The step function is defined by the Eq. (5).

Fig. 6
figure 6

Center-symmetric Census window

$$\begin{aligned} \xi (I_{c} \left( {x,y} \right) ,I_{c} \left( {x+i,y+j} \right) =\left\{ {{ \begin{array}{ll} 0, &{} \quad \mathrm{if}\,I_{c} \left( {x,y} \right) <I_{c} \left( {x+i,y+j} \right) , \\ 1, &{} \quad \mathrm{Otherwise}. \\ \end{array}}} \right. \end{aligned}$$
(5)

The binary mask generated in the previous step is used to decide which pattern to use. The decision criterion is made as follows:

$$\begin{aligned} \alpha= & {} \mathop {\sum }\limits _{\left( {i,j} \right) \in N} M\left( {x+i,y+j} \right) , \end{aligned}$$
(6)
$$\begin{aligned} \mathrm{Pattern}= & {} \left\{ {{ \begin{array}{ll} P_1, &{} \quad \mathrm{if}\,\alpha <T_2, \\ P_2, &{} \quad \mathrm{Otherwise}. \\ \end{array}}} \right. \end{aligned}$$
(7)

The cost function is calculated by finding the Hamming distance between the obtained bit-patterns of the left and right reference images using Eq. (8).

$$\begin{aligned} C({({x,y} ),d})=\mathop {\sum }\limits _{({i,j})\in N} d_\mathrm{H} ({R_{T,l} (x+i,y+j}),R_{T,r} ({x+i-d,y+j} )) \end{aligned}$$
(8)

In Eq. (8), \(R_{T}\) is the calculated bit-pattern. d is the disparity and \(d_\mathrm{H}\) is the Hamming distance function and the subscripts l and r refer to the left and right reference images, respectively.

Since we have calculated the cost for each pixel, it is time to aggregate each pixel’s cost over a support region. The main goal of cost aggregation is to reduce the matching ambiguities and noise present in the initial cost. A modified cross-based cost aggregation is proposed based on the following fact. An effective assumption is that neighboring pixels with similar colors and spatial characteristics usually belong to the same object and should have similar depth values. The proposed cost aggregation method consists of two steps:

  1. 1.

    Creating the cross shape, and

  2. 2.

    Aggregating cost over the created cross.

In the first step, an adaptive cross with varying arms is constructed for each pixel. Given a pixel p, the endpoint of the arm is defined as \(p_1\) when one of the three following rules is not met:

  • The color difference between p and \(p_1\) should be less than a predefined threshold.

  • The spatial distance between p and \(p_1\) should be less than a preset maximum length.

  • The curvature values of p and \(p_1\) in the curvature map should not exceed a threshold.

The above-mentioned criteria are defined by Eq. (9).

$$\begin{aligned} \left| {I_c \left( p \right) -I_c \left( {p_1} \right) } \right|\le & {} \tau _1 ,\nonumber \\ \parallel p-p_1\parallel\le & {} L,\nonumber \\ \left| {\mathrm{curv}\left( p \right) -\mathrm{curve}\left( {p_1} \right) } \right|\le & {} \tau _2 , \end{aligned}$$
(9)

where L is the maximum length. \(\tau _1\) and \(\tau _2\) are predefined thresholds. Large thresholds are usually set for textureless regions to include adequate intensity variation.

The second step is aggregating the cost values over the created cross. The intermediate cost is obtained by summing the cost values horizontally and the final cost is calculated by adding all the intermediate data vertically. The whole process is shown in Fig. 7.

Fig. 7
figure 7

Cross-based aggregation

3.3 Stripe-based optimization

The optimization path for finding the best match is formed within multiple-size horizontal stripes. Figure 8 shows the corresponding adaptive stripe pattern. Depending on the structure of the reference image, typically up to 10 stripes with adaptive widths are used. For each pixel, we consider 4 directions of up, down, left and right and the direction which has the least cost value is decided as the next path. The path accumulation is performed in the decided directions. However, the paths which cross the stripes are cut so that the information cannot propagate over them. The size of each stripe is obtained based on the reference image structure.

Fig. 8
figure 8

Stripe-based optimization path

At this stage of the algorithm, we obtain the best match for each pixel and calculate the disparity. The optimal disparity is calculated by minimizing the energy function. As shown in Eq. (10), the energy function consists of three terms. First term is the matching cost from the previous step which is based on the Census transform. The other two terms are the smoothness constraints. We add two penalty terms to the matching cost function to take into account slight and abrupt changes in the disparity of neighboring pixels.

$$\begin{aligned} E\left( {p,D} \right) =\sum C\left( {p,d_p} \right) +\mathop {\sum }\limits _{q\in N} f_1 F\left[ |d_q -d_p \hbox {|}=1\right] +\mathop {\sum }\limits _{q\in N} f_2 F\left[ |d_q -d_p \hbox {|}>1\right] \nonumber \\ \end{aligned}$$
(10)

where N is the neighborhood of the current pixel \(p.\,d_p\) and \(d_q\) are the depth values for pixels p and \(q\,f_1\) is the penalty term when the disparity values of neighboring pixels differ by one. A larger penalty term \(f_2\) is added when the neighboring disparity values differ by more than one.

The disparity image is estimated by finding the optimum value which minimizes the cost function as stated in Eq. (11).

$$\begin{aligned} \mathrm{Disparity}=\mathop {\hbox {Argmin}}\nolimits _d E\left( {p,d} \right) \end{aligned}$$
(11)

The 4-direction optimization path is used to find the optimal value.

3.4 Depth refinement

The depth map obtained by the proposed modified Census transform and stripe-based optimization is further refined using the proposed depth refinement algorithm. The proposed algorithm consists of the following two steps:

  1. 1.

    Filling the holes in the estimated disparity image, and

  2. 2.

    Sharpening the edges and object boundaries.

Since we obtained the disparity image based on down-sampled reference images, we bring it back to the original size while performing refinement. The estimated disparity image from the previous steps has some holes due to the occlusion and mismatches which need to be filled. The hole regions usually belong to the background which cannot be seen from the other reference view. Hence, the algorithm fails to estimate a depth value for those specific regions. To fill up the holes, we first select the pixels that belong to the background among the neighboring non-zero pixels of the hole region. The holes are then filled by a weighted average on the selected correct pixels using Eq. (12).

$$\begin{aligned} d_\mathrm{hole} =\frac{\mathop {\sum }\nolimits _{i=1}^N w_i d_i^\mathrm{bg}}{\mathop {\sum }\nolimits _{i=1}^N w_i} \end{aligned}$$
(12)

where \(d_i^\mathrm{bg}\) is the background depth value and \(w_i\) is the weighting factor based on the distance from the background depth pixel to the current hole.

The weights are calculated using Gaussian distribution based on the distance to the current pixel using Eq. (14). Therefore, the farther pixels would have less impact on the calculated depth value. The output of the hole filling algorithm is a low-resolution dense disparity image which needs to be up-sampled to the original size. The up-sampling is performed by applying a trilateral filter which makes the boundaries sharper and corrects the misaligned regions using Eq. (13). The designed filter consists of three terms: depth data, texture data and the curvature.

$$\begin{aligned} d\left( p \right) =\frac{1}{w}\sum d\left( q \right) .f_\mathrm{dep} \left( \parallel {d_p -d_q} \parallel \right) .f_\mathrm{tex} \left( \parallel {C_p -C_q} \parallel \right) .f_\mathrm{curv} \left( \parallel {k_p -k_q} \parallel \right) \nonumber \\ \end{aligned}$$
(13)

where \(\parallel .\parallel \) is the Euclidean distance between two pixels and d is the disparity value. C and k are the color and curvature values, respectively. f is the Gaussian distribution with standard deviation \(\sigma \) defined by Eq. (14).

$$\begin{aligned} f\left( x \right) =e^{-\left( {\parallel x\parallel /\sigma } \right) ^{2}} \end{aligned}$$
(14)

We also ensure that no new depth values are introduced during the up-sampling. Therefore, when the disparity image is filtered using the trilateral filter, the new depth values are adjusted by mapping them to the nearest depth value which already exists in the disparity image.

4 Experimental results

In this section, we present simulation results to show the efficiency of the proposed method compared with those of the state-of-the-art SGM-based stereo matching methods.

Fig. 9
figure 9

Visual comparison of depth maps generated by different algorithms using Middlebury dataset [36]: a the original test images, b ground truth depth maps, and depth maps obtained by c proposed method, d PlaneFitSGM [40], e gradient-based Census [41]

To evaluate the performance of our depth estimation and refinement algorithms, we used the Middlebury [36], KITTI stereovision benchmark suite [37], and the Daimler dataset [38]. The workstation runs the Windows 7 operating system with Intel Xeon Quad-Core processor and 8 GB RAM.

To evaluate the performance of the proposed stereo matching method, we compute the error statistics with respect to the ground truth depth map. The percentage of bad matching pixels is a common quality measure used to compare the performance of different depth estimation methods.

The performance of the proposed method is compared with the following reference systems. The method proposed in [20] is the original SGM which uses pixelwise matching of mutual information and approximates a global smoothness constraint by combining multiple 1D constraints. The stereo matching proposed in [39] uses a sparse Census mask and the algorithm is implemented on embedded systems. PlaneFitSGM [40] is based on a Census-based cost calculation and performs the disparity optimization on horizontal stripes of the image. Gradient-based Census [41] performs correlation-based stereo matching and is suitable for the deployment in embedded real-time systems.

Four different image pairs from the Middlebury dataset [36] are chosen for the evaluation. Figure 9 compares the quality of depth maps generated by the proposed method and the references. The use of multiple Census window patterns and efficient depth refinement technique by considering the edges of objects makes the proposed method comparable to other methods of SGM family.

Table 1 indicates the error statistics of percentage of bad pixels with respect to the provided ground truth depth map by Middlebury benchmark [36] for different algorithm configurations. The percentage of bad pixels evaluation criterion is defined by Eq. (15).

$$\begin{aligned} P_{bp} =\frac{1}{N}\sum \left( {\left| {d_\mathrm{G} \left( {x,y} \right) -d_\mathrm{GT} \left( {x,y} \right) } \right| >\sigma _d} \right) , \end{aligned}$$
(15)

where \(d_\mathrm{G}\) and \(d_\mathrm{GT}\) are the generated and ground truth depth values, respectively. \(\sigma \) is the error tolerance and N is the total number of pixels of the image.

The percentage of correctly matched pixels for different Census-based matching techniques is shown in Fig. 10. Avoiding any new depth value in the refinement step of the proposed method significantly reduces the percentage of bad pixel compared with the methods in [20, 39, 40].

Table 1 Percentage of bad pixels for the Middlebury dataset [36]

In addition to this, applying Census transform as a matching metric helps us to find the perfect match for each pixel and increases the percentage of correctly matched pixels.

Fig. 10
figure 10

Percentage of correctly matched pixels: comparison between the proposed method and algorithms in [20, 39, 40]

Table 2 compares the processing time of the proposed algorithm with other state-of-the-art techniques. The experiments are done by running the algorithms on the Middlebury dataset [36] using C programming on CPU.

Table 2 Computational time complexity

Applying adaptive Census window patterns reduces the overall complexity of the algorithm and leads to achieving higher frame rates.

The proposed algorithm has been tested on KITTI dataset [37] which consists of 194 training image pairs and 195 test image pairs. The images have \(1224 \times 370\) pixels resolution. Figure 11 shows the result of depth estimation and refinement for a sample left side image of KITTI dataset [37]. The holes are filled in the refined depth map and objects have much sharper boundaries.

Fig. 11
figure 11

Proposed method output: a color image and b estimated depth map, and c refined depth map using KITTI dataset [37] (color figure online)

Fig. 12
figure 12

Hole filling comparison using KITTI dataset [37]: a filling by all the surrounding correct pixels, b only considering background depth pixels

As stated earlier, the proposed depth refinement algorithm uses neighboring background pixels solely to fill the holes and also incorporates a depth adjustment stage to ensure that no new depth values are replaced a correct depth value in the depth map. Figure 12 compares the result of hole filling where in Fig. 12b only the background depth pixels are used. Considering only the background pixels for depth refinement avoids several artifacts such as shrinking of depth values around object boundaries (similar to the case around the pedestrian and the trash can in Fig. 12b). One should keep in mind that Fig. 12 only shows the hole filling result without applying edge sharpening.

For a more complex scenario where several number of pedestrians is present in the scene, it becomes more challenging to derive a clear depth map for each pedestrian individually. Figure 14 shows the result of the proposed depth estimation method on four different images of KITTI dataset [37] under this challenging situation.

Since the main target application of the proposed method is pedestrian detection systems, we provide additional results based on pedestrian detection rate. We apply our proposed depth estimation method to the stereo-based pedestrian detection algorithm presented in [2]. The proposed method in [2] uses an adaptive window for region of interest (ROI) generation using depth maps. Also, Support Vector Machine (SVM) is used to classify the ROIs into pedestrian and non-pedestrian classes. The Daimler dataset [38] has been used for this part of experimental results. It consists of 21790 image pairs with size of \(640 \times 480\) pixels captured from a stereo vision camera mounted on a vehicle.

To calculate the ROC curve, we plot the detection rate versus the average of false-positive per frame (FPPF) using Eq. 16.

$$\begin{aligned} \mathrm{Detection\, rate}= & {} {\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}},\nonumber \\ \mathrm{FPPF}= & {} {\frac{\mathrm{TFPP}}{\#\,\mathrm{frames}}}, \end{aligned}$$
(16)

where TP, FN and FP are true positive, false negative and false positive, respectively.

The calculated ROC curves are shown in Fig. 13. The detection rates are improved using the proposed method. In other words, for a fixed false-positive probability, we get higher detection rate.

Fig. 13
figure 13

ROC curve comparison between different algorithms using Diamler dataset [38]

Fig. 14
figure 14

Depth estimation by the proposed method on real-world stereo images. a Left color images. b Estimated depth map (color figure online)

5 Conclusion

In this paper, a novel depth estimation algorithm has been proposed. The proposed method is based on adaptive window patterns of Census transform which make it robust against illumination changes and suitable for applications like advanced driver assistance systems. By down sampling the reference images, the computational complexity of the whole algorithm is reduced. A modified cross-based cost aggregation technique is proposed that generated cross-shape support regions for each pixel individually. A stripe-based optimization path is used in finding the best match by avoiding the error propagation into unwanted regions. The proposed depth refinement technique aims at filling the holes and sharpening the object boundaries. The background depth pixels are used to fill the holes of the estimated depth map and the proposed trilateral filter is used to enhance the quality of the depth map. Simulation results indicate that the proposed methods fulfill the aims by improving the quality of the generated depth maps and reducing the computational complexity (Fig. 14).