Keywords

1 Introduction

Detecting road and its boundaries is the basis for the autonomous vehicles to navigate routes and avoid obstacles. Despite that various sensors are mounted on the vehicle to help the system perceive the environment, visual sensors, such as video cameras, can provide informative cues at a lower cost. Using the monocular colour image captured by the cameras, the goal of road and road boundary detection is to identify whether the pixels belong to the road areas and road boundaries.

To detect road areas, segmentation techniques like [1] are usually introduced to tackle the problem. Using either traditional classifiers [2, 3] or deep neural networks [4, 5], the category of each pixel can be estimated. However, these methods are not aware of the existence of the road boundaries. As for the boundaries, they are not simply the edge of the detected road areas. In general, the road boundaries can be defined as the marks that split the road and non-road areas, such as the curb stones between vehicle and pedestrian paths and the white lines between the road and parking areas. Detecting road boundaries is challenging because they have various forms. Some studies [6, 7] attempt to identify the road boundary using a trained classifier while others [8, 9] simply refine the contours without knowing their locations. Few studies attempt to consider both road detection and road boundary detection in a unified framework.

In this paper, we summarise the road and road boundary using a Bayesian network and propose the RBNet to tackle the corresponding detection task. A critical observation of our study is that although the road boundaries have diversified appearance, they do contain abundant structural information which is helpful to define the road areas. As a result, we conclude that there exists a contextual connection between the road areas and the road boundaries, which can be formulated as a graphical probabilistic model. Following the concluded model, RBNet is introduced to detect both road and road boundaries in a single step. The training procedure of the RBNet can be formulated as a multi-task learning problem and we share the visual features across different tasks. After properly training the RBNet, we evaluate the effectiveness of the proposed network on the widely-used KITTI road benchmark [10] and report the performance on its official websiteFootnote 1. Favourable performance can be achieved by the proposed method against other competitive algorithms. Statistical results also verify the existence of the contextual relationship between road boundaries and road areas in the road scene.

2 Related Work

The vision-based road and road boundary detection methods can be divided into two groups: model- based and learning-based. Model-based methods tend to build a shape model or appearance model to describe the road. For shape model, the boundaries of the road are commonly represented by curves like Bezier Splines [11] and Cubic Splines [12]. Then random sample consensus (RANSAC) is usually used to find the fittest parameters for the curves. For appearance model, as an example, [13] describes the road as a linear combination of different color planes and the color distribution of each pixel is used to decide whether it belongs to the road. These model-based methods are accurate when similar road pattern appears, but they would be vulnerable to the complicated road scenes. Different from model-based approaches, learning-based methods mainly adopt a trained classifier to distinguish the road areas from non-road areas [3].

With the recent development of deep learning techniques, convolutional neural networks (CNNs) have achieved record-breaking performance in image segmentation-related tasks. With deep CNN, road areas can be effectively segmented from the images [4, 14]. The major obstacle in improving the performance of CNN-based segmentation methods is that high-level semantic features are too coarse to define the contours or the boundaries of an instance. To relieve the boundary issue, fully convolutional neural network (FCN) [15] fuses results from low-level feature maps. A similar work [4] facilitates precise localization by following the architecture of “U-net” [16], which consists of a contracting network to capture contexts and a symmetric expanding network to enrich details. Such architectures are helpful for obtaining spatial details but are still weak on boundaries. In another group of studies, the boundaries of the segments are refined using conditional random fields (CRFs) in a post-processing step. CRF-based methods integrate score maps generated by CNN with pairwise features [17], whose inference can be efficiently carried out by high-dimensional filtering [8]. CRF-based methods are advantageous for refining contours but they are not aware of the existence of the boundaries. More closely related studies tend to detect the object boundaries to refine the segmentation. For example, LRR [18] distinguishes boundaries with masking operations and uses Laplacian reconstruction to improve accuracy. [6] first detects obstacle boundaries with CNN and then obtains the road areas by using a graph-cut algorithm. Although these studies are active in identifying boundaries, the contextual relationship between the detected boundaries and the segmentation results is not sufficiently studied. With the consideration of this context, the label noise problem [24] can also be avoided to some extents.

3 Road and Road Boundary Detection Network

In this section, we first summarise the road and road boundary detection tasks using the unified Bayesian network model which tends to formulate the relationship among road, road boundaries and the input image in the same probabilistic graph. Following the structure of Bayesian network, we then introduce a deep neural network, called RBNet, to simultaneously detect road and road boundaries.

3.1 The Bayesian Network Model

Tackling the road and road boundary detection separately would be time-consuming and requires carefully designed algorithms to fuse the results for better performance. To relieve this issue, we attempt to simultaneously detect the road areas and road boundaries by formulating them in the unified model. In specific, we find that the identification of road areas is not only influenced by the local appearance but also affected by the road boundaries. For example, if a pixel on the image is enclosed by the road boundaries, it can be considered as the road areas as well, regardless of its visual appearance. However, the road boundaries can not be directly defined based on the edges of the road areas because these areas may not always relate to the road boundaries. As an example, the edges of the road areas may be around the image border instead of actual road boundaries. Therefore, it is more adequate to define the road boundaries only based on the visual appearance. Accordingly, we summarise the relationships among road areas, road boundaries and input image as a Bayesian network, whose detailed structure is illustrated in Fig. 1.

Fig. 1.
figure 1

The Bayesian network for road and road boundary detection. In this model, road area, road boundary, and the image are three nodes in the graph. The directed arrows represent the dependencies among the nodes.

Formally, we consider the road, road boundary and input image as three nodes in the Bayesian network. Meanwhile, we refer the road detection and the road boundary detection as the pixel-wise classification. Suppose \(r_{x,y}\) denotes the labeling of the pixel at (xy) with respect to road areas on the input image I. Then segmenting road areas aims at acquiring an assignment \(R = \{r_{x,y}\}\) which allocates 1 to in-road pixels and 0 to the rest. Similarly, suppose \(b_{x,y}\) denotes the labeling of a pixel at (xy) which takes 1 if the pixel belongs to the road boundaries and 0 otherwise. We refer \(B = \{b_{x,y}\}\) as the classification results for road boundaries on the whole image. According to the graph illustrated in Fig. 1, the joint probability of R, B, and I is given as:

$$\begin{aligned} P(R, B, I) = P(R|B, I)P(B|I)P(I) \end{aligned}$$
(1)

Therefore, the road detection and the road boundary detection can be solved by estimating by estimating P(B) and P(R). According to the Eq. 1, we have:

$$\begin{aligned} P(B) = \sum _{\begin{array}{c} R\in \{-,+\} \end{array}, \begin{array}{c} I\in \{-,+\} \end{array}}P(R, B, I) \end{aligned}$$
(2)
$$\begin{aligned} P(R) = \sum _{\begin{array}{c} B\in \{-,+\} \end{array}, \begin{array}{c} I\in \{-,+\} \end{array}}P(R, B, I) \end{aligned}$$
(3)

Let \(I^+\) represent that there exists road in the image and \(I^-\) represent the opposite. Suppose we are working on the urban road images, we can assume that there is always a road in the camera’s view and thus consider the probability of \(P(I^+)\) as always being 1. As a result, the P(B) can be computed by:

$$\begin{aligned} P(B) = P(B|I^+) \end{aligned}$$
(4)

As for the road detection task, it requires the computation of marginal probability P(R). Based on Eq. 3, we have:

$$\begin{aligned} \begin{aligned} P(R) =P(R|B^-, I^+) P(B^-|I^+) +P(R|B^+, I^+) P(B^+|I^+) \end{aligned} \end{aligned}$$
(5)

where \(B^+\) is denoted as the collection of pixels which are identified as road boundaries and \(B^-\) is the counterpart. Accordingly, \(B=\{B^+, B^-\}\).

3.2 The Deep Neural Network for Road and Road Boundary Detection

As mentioned above, the road and road boundary detection can be delivered by inferring P(B) and P(R) based on Eqs. 4 and 5 respectively. However, empirically estimating the probabilities would be unreliable and arduous because the dependencies in the Bayesian network could be ambiguous due to environmental noises and complex road scenarios. To obtain a faithful estimation for the probabilities, we employ the deep neural network, RBNet, to learn the dependencies statistically.

In order to properly implement RBNet, we first decompose the task of inferring graphical model of Bayesian network into several independent sub-tasks and then introduce the corresponding task-specific sub-networks to solve them. As a result, the training procedure can be formulated as a multi-task learning problem. In the following sections, we denote l as the loss function for each pixel.

Road Detection. The road detection task can be achieved by inferring P(R) based on Eq. 5. The inference requires the sum of \(P(R|B^+,I^+)P(B^+|I^+)\) and \(P(R|B^-,I^+)P(B^-|I^+)\). Computing \(P(R|B^-,I^+)\) and \(P(R|B^+,I^+)\) is simple and straight forward. On one hand, the \(P(R| B^-, I^+)\) stands for the probability of each pixel in the image that belongs to the road area when no boundary is detected, which can be interpreted as the common segmentation task. On the other hand, the estimation of \(P(R| B^+, I^+)\) can be regarded as the prediction of road areas only based on the road boundary detection results. However, for example, computing \(P(R|B^+,I^+)P(B^+|I^+)\) should not be regarded as element-wise multiplication between the \(P(R|B^+,I^+)\) and \(P(B^+|I^+)\), because a road boundary pixel may have effects on the road pixels in a larger image region. To properly infer P(R), we rewrite the Eq. 5 in the following form:

$$\begin{aligned} {\begin{matrix} P(R) = \ P(R|B^-, I^+) \ + \left( P(R|B^+, I^+) - P(R|B^-, I^+) \right) P(B^+|I^+) \quad \end{matrix}} \end{aligned}$$
(6)

given the fact that \(P(b^+_{x,y}|I^+) + P(b^-_{x,y}|I^+) = 1\) for \(b_{x,y}^+\in B^+\) and \(b_{x,y}^-\in B^-\). Consequently, inferring P(R) can be achieved by computing the addition of two terms on the right side of Eq. 6.

For computing the term \(P(R| B^-, I^+)\), it can be obtained via the segmentation results directly. We introduce a subnetwork in RBNet to compute the term, whose overall training loss can be defined as:

$$\begin{aligned} L_1(\theta _1) = \sum _{x,y} l(r^*_{x,y}, r_{x,y}(\theta _1)) \end{aligned}$$
(7)

where \(r^*_{x,y}\) represents the ground-truth at the location (xy) for road detection and \(r_{x,y}(\theta )\) denotes the output of the network parameterized by \(\theta \) at the same location.

The second term on the right side of the Eq. 6 can be viewed as a residual of the \(P(R| B^-, I^+)\) when computing the P(R). We name it as the contextual residual. The form of this contextual residual suggests that it can be computed based on the road boundary detection results, \(P(B^+| I^+)\). Using the road labels as well, the training loss of the sub-network that predicts the contextual residual can be defined as:

$$\begin{aligned} L_2(\theta _2) = \sum _{x,y} l\left( r^*_{x,y} , r_{x,y}\left( \theta _2, B^+(\theta _3)\right) + r_{x,y}(\theta _1)\right) \end{aligned}$$
(8)

where the \(B^+(\theta _3)\) represents the \(P(B^+|I^+)\) estimated in the sub-network for road boundary detection task parameterised by \(\theta _2\). The symbol \(\theta _3\) denotes the parameters of the sub-network for computing contextual residual.

Road Boundary Detection. Based on Eq. 4, P(B) can be directly inferred from \(P(B|I^+)\). Suppose the ground-truths of road boundaries is given by \(\{b^*_{x,y}\}\), the loss to train the network:

$$\begin{aligned} L_3(\theta _3) = \sum _{x,y} l(b^*_{x,y}, b_{x,y}(\theta _3)) \end{aligned}$$
(9)

where \(b_{x,y}(\theta _3)\) represents the estimation of the road boundaries for a pixel located at (xy) using the sub-network parameterised by \(\theta _3\). We manually labeled the road boundaries as supervision information based on the ground-truths for road area detection task.

Multi-task Learning. Overall, we formulate the training procedure of the RBNet as a multi-task learning problem. Let \(\varTheta \) denote all the parameters of RBNet and thus \(\varTheta = \{\theta _1, \theta _2, \theta _3\}\). The general training loss of RBNet is defined as:

$$\begin{aligned} \textit{Loss}(\varTheta ) =\mu _1 L_1(\theta _1) + \mu _2 L_2(\theta _2) + \mu _3 L_3(\theta _3) \end{aligned}$$
(10)

where \(\mu _i\) represents the loss weight for the corresponding task. As a result, training RBNet can be viewed as minimising the overall loss function with respect to the \(\varTheta \). Furthermore, in order to train the RBNet in an end-to-end manner, we make the CNN feature sharable for each task, which means that the subset of \(\varTheta \) for computing the visual features are shared among \(\theta _1\), \(\theta _2\) and \(\theta _3\). Sharing features could also bring other advantages. For instance, both abstract semantics and fine spatial details could be maintained to ensure good performance.

Fig. 2.
figure 2

Detailed architecture of RBNet. The blue cubes represent convolution layers. The symbols k, c, and s below the cubes respectively represent the kernel sizes, the channel numbers, and the strides for the corresponding convolution operations.

Implementation Details. Figure 2 shows the detailed architecture of the proposed RBNet. As illustrated, the process of RBNet involves several steps. In the first step, we use a pre-trained deep convolutional neural network (DCNN) model to extract visual features, which usually have five convolution blocks. Afterward, we adopt hypercolumn-like architecture, whose details can be found at [19], to fuse and interpret features extracted from different depths. Following this are three task-specific networks, including road boundary detection network, semantic segmentation network, and the contextual residual network.

Specifically, considering the powerful expression power and efficiency, we adopt the use of ResNet50 [20] network as the pre-trained DCNN model. The features from conv2, conv3, conv4 and conv5 blocks are connected to the hypercolumn, followed by two fully convolution layers. Subsequently, three task-specific sub-networks are employed to fulfill the goal of multi-task learning described by Eq. 10. In the boundary detection network and road detection network, we use convolution layers with small kernels to tackle the corresponding tasks. To capture structural contexts in a broader region, we use larger kernels for the convolution layer in predicting the contextual residual. In our implementation, the loss function l is defined as multinomial logistic loss function.

4 Experiment

4.1 Setup

In this section, we comprehensively evaluate the effectiveness of the RBNet and also compare it with other competing algorithms. Since the detection of road boundary benefits road detection based on the summarised Bayesian network, we major evaluate the performance on road detection benchmark. To best unfold and assess the performance of the proposed approach, we conduct the evaluation on the KITTI road benchmark [10], where the results of evaluated methods can be made publicly accessible on the official websiteFootnote 2. KITTI road detection benchmark divides the images into three sets, which are urban marked (UM), urban multiple marked lanes (UMM) and urban unmarked (UU). There are in total 289 images for training and 290 images for testing.

Evaluation Metrics. We follow the evaluation metrics as discussed in [21] in KITTI road benchmark. The metrics include maximum F1-measure (MaxF), average precision (AP), precision rate (PRE), recall rate (REC), false positive rate (FPR), and false negative rate (FNR). The four latter measures are computed at the working point of MaxF. According to KITTI’s evaluation system, all the results are transformed into birds-eye view space for evaluation.

Training. While training RBNet, we randomly flip and crop the training images and add small disturbance to RGB channels of input data. The input images are resized into a uniform size of \(300\times 900\). While training the RBNet, the loss weights \(\mu _1\), \(\mu _2\), \(\mu _3\) in Eq. 10 are set as 1, 1, and 0.1 respectively. We use 100k training epochs and the learning rate is decayed from 0.01 using “poly” policy. The hardware used for all the computation is a cluster node (8 cores @ 3.50 GHz, 32 ,GB RAM) accelerated with a GPU card (NVIDIA Tesla K20c 5 GB). The overall processing time of RBNet is 0.18 second per frame.

Boundary Refinement. To demonstrate the effects of detecting road boundary, we use the detected boundaries to refine the detected road areas by eliminating potential false positives. Specifically, we first find the left and right boundaries of the road based on the boundary detection results and then refine the confidence score of each pixel according to their relative locations to the identified boundaries. Outside pixels could be viewed as false results. If better road detection results are obtained, the contextual relationship between road and road boundaries can be proved.

Table 1. Performance for the per-category result. “UM”, “UMM” and “UU” represent the detection task for urban marked road, urban multiple marked lane, and urban unmarked road respectively. The “Lane” represents the ego-lane detection task based on the “UM”. Bold fonts refer to the best performance.
Table 2. Overall performance for KITTI’s benchmark based on all the “UM”, “UMM” and “UU” test sets. Best scores are presented in bold.

4.2 Results

In this section, we thoroughly compare the performance of the RBNet for general road detection with other state-of-the-art methods on the KITTI road benchmark. The compared algorithms include Up_Conv [4], DDN [5], FTP [12], FCN_LC [22], SPRAY [23], and StixelNet [6]. MaxF and AP are mainly used for comparison.

Table 1 shows the results of the evaluation on different categories of tasks in the KITTI benchmark. The MaxF and AP scores of UM road, UMM road, and UU road in KITTI benchmark are presented in the table. The effectiveness of the proposed method has been demonstrated since our method achieves the highest scores on both MaxF for each category.

By combining the results of all the “UM”, “UMM” and “UU” road, overall performance of the evaluated algorithms are illustrated in the Table 2. In this measurement, the proposed RBNet has also outperformed other algorithms in many criteria, including MaxF, AP and so on, which proves both the correctness of the summarised Bayesian network and the robustness of the RBNet in general road detection. Some qualitative results on KITTI benchmark are illustrated in the Fig. 3.

Fig. 3.
figure 3

Qualitative results on KITTI road detection benchmark. The results are from: (a) UM; (b) UMM; (c) UU; and (d) Lane. The detected road boundaries and the segmented road areas are shown in yellow color and green color respectively. (Color figure online)

5 Conclusion

In this work, we formulate the road detection and road boundary detection problem into a unified Bayesian network model based on the contextual relationship between road boundaries and road areas in an image. We then propose the RBNet to estimate the probabilities of the Bayesian network. The RBNet can detect road boundaries and road areas in a single processing step. The empirical study on KITTI benchmark proves the effectiveness and validity of RBNet. For the future research, we will accelerate the processing speed to meet real-time demand.