Keywords

1 Introduction

Road surface markings (RSMs) and road edges (REs) are among the most important elements for guiding autonomous vehicles (AVs). High-Definition maps with accurate RSM information are very useful for many applications such as navigation, prediction of upcoming road situations and road maintenance [23]. RSM and RE detection is also vital in the context of pavement distress detection to ensure that RSMs are not confused with pavement defects and to eliminate the areas where the cracks, potholes and other defects cannot possibly be found [17].

The detection of stationary objects of interest related to roadways is usually addressed by using video streams or still images acquired by digital cameras mounted on a vehicle. For example, Reach-U Ltd.—Estonian company specializing in geographic information systems, location based solutions and cartography—has developed a fast-speed mobile mapping system employing six high-resolution cameras for recording images of roads. The resulting orthoframes assembled from recorded panoramic images are available to Estonian Road Administration via a web application called EyeVi [16] and are used as the source data in present study.

RSM extraction methods are often based on intensity thresholds that are subject to data quality, including lighting, shadows and RSM wear. RE detection is generally considered even more difficult than lane mark detection due to lack of clear boundary between road and RE because of variations in road and roadside materials and colors  [9]. Due to recent advances in deep learning, semantic segmentation of RSMs and REs is increasingly implemented within the deep learning framework [21].

The present study combines both RSM and RE detection by a hierarchical convolutional neural network to enhance the method’s predictive ability. On narrow rural roads without RSMs, RE detection alone can provide sufficient information for navigation but REs might escape the field of view of cameras on wider highways and REs cannot provide sufficient information for lane selection. Simultaneous detection of REs and RSMs can also improve the accuracy of these tasks, e.g. RSMs suggested by the network outside the road can be eliminated either implicitly within the network or explicitly during the post-processing.

2 Related Work

In recent years, several studies have tackled tasks of RSM and RE detection for a variety of reasons. For example, [3, 19] employ RSMs for car localization, [1] applies RSMs to detect REs, [2, 4, 6, 26] classify RSMs whereas [7, 9, 12, 20, 21, 27] focus on pixel level segmentation to detect the exact shapes and locations of RSMs.

These works have used different input data. For example, [12, 22, 24] use 3D data collected by LiDAR, [15, 25] use 3D spacial data generated from images captured by stereo camera, [11] uses radar for detecting metal guardrails and [9, 21, 26] use 2D still images captured by a camera. Radars have an advantage over cameras as they can be deployed regardless of lighting conditions. However, metal guardrails are rarely found on smaller rural roads. LiDARs can also operate under unfavorable lighting conditions that can cause over- or under-lit road surface, however, high quality LiDARs are expensive. Both 2D cameras and 3D (stereo) cameras are adversely affected by shadows, over- and under-lit road surface and while 3D cameras can provide additional information, this information requires more processing power. Consequently, 2D cameras are still the most suitable equipment for capturing data for RSMs and RE detection provided that the detection method is “smart” enough to cope with even the most adverse lighting condition such as hard shadows, over- and under-lit surface of the road, etc. Thus the RSM and RE detection methods must address these issues.

Possible RSMs and RE detection methods and techniques include, for example, filtering [10] and (NN) based detection [26]. While filtering was the method of choice in older research, the NN based methods have started to dominate recent research. Even though the NN based methods have gained popularity, the way NNs are implemented for RSM and RE detection has evolved significantly over the years. Earlier approaches used classifier networks. The segmentation of images was achieved by applying a sliding window classifier over the whole input image. To improve quality of the predictions, fully convolutional (AEs) that can produce a segmented output image directly, have replaced the sliding window method. However, since the ‘encoding’ part of the AE is subject to data loss, the AEs have been phased out by residual networks (e.g., [5, 8]) with direct forward connections between layers of encoder and decoder for higher quality image segmentation.

3 Methodology

3.1 Neural Network Design

The proposed architecture consists of three connected neural networks of identical structure (Fig. 1) that combines the architectures of symmetrical AE (with dimension reducing encoder and expanding decoder), (ResNet) [5] and DenseNet [8] having shortcut connections between encoder and decoder layers by feature map concatenation.

Fig. 1.
figure 1

Illustration of sub-neural network design using 2 blocks with 3 convolutional filters

This NN can be further divided into sub-blocks (hereafter blocks) that all have a similar structure. In essence, a block consists of a set of convolutional layers followed by a size transformation layer (either max pooling for encoder or upsampling for decoder). As the NN is symmetric, encoder and decoder have equal number of blocks. Thus a NN with “two blocks” would be a NN that has two encoding and two decoding blocks. Each of these blocks has an equal number of convolutional layers with \(3\times 3\) kernel. After each decoding block, there is a concatenation layer in order to fuse the upsampled feature map with the corresponding convolutional layer’s feature map from the encoder part of the network.

The overall design of the sub-NN is as follows:

  1. 1.

    Input layer (RGB image)

  2. 2.

    \(n\times \) encoding block

  3. 3.

    \(n\times \) decoding block + concatenation layer

  4. 4.

    flattening convolutional layer

  5. 5.

    Output layer (gray scale image).

All the convolutional layers (except the last one) use LeakyReLU (\(\alpha = 0.1\)) as activation function. It is observed that a large gradient flowing through a more common ReLU neuron can cause the weights to update in such a way that the neuron will never activate on any datapoint again. Once a ReLU ends up in this state, it is unlikely to recover. LeakyReLU returns a small value when its input is less than zero and gives it a chance to recover  [13]. The convolutional layers’ padding is ‘same’, i.e., the height and width of input and output feature maps are the same for each convolutional layer within a block.

The U-Net architecture proposed in [18] utilizes a similar idea, however, the order of layers is different in the decoder part of their NN:

The number of computations required by a convolutional layer increases as the size of inputs (feature maps from the previous layer) increases. The size of layer’s inputs is given by Eq. 1, where \(width\) and \(height\) are width and height of the feature map from the previous layer and \(depth\) is the number of convolutional filters in the previous (convolutional) layer.

$$\begin{aligned} V = width \times height \times depth \end{aligned}$$
(1)

Because U-Net performs upsampling and concatenation before convolutional layer, the input size (hence the required computational power) is more than four times bigger (Eq. 2) than with the proposed NN. Upsampling with kernel size (2 \(\times \) 2) doubles both height and width and concatenation further increases depth by adding \(depth_{enc}\) layers from an encoder block to the feature map.

$$\begin{aligned} \begin{array}{ll} V_{\mathtt {UNet}} = width_{\mathtt {UNet}} \times height_{\mathtt {UNet}} \times depth_{\mathtt {UNet}} \\ \qquad \,\, = \left( 2\times width\right) \times \left( 2\times height\right) \times \left( depth+depth_{enc}\right) \\ \end{array} \end{aligned}$$
(2)

Since RSMs and REs have different characteristics, these two tasks are performed using two different NNs. Both NNs have a RGB input of size (224 \(\times \) 224 \(\times \) 13) and a single (224 \(\times \) 13) output, i.e., a gray scale mask (Fig. 1). One pixel corresponds roughly to 3.37 mm, thus the width and height of the input segment are about 75.42 cm.

The resulting outputs from RE and RSMs detection are concatenated with the original input image and fed to a third NN (REFNET) to refine the final results (Fig. 2). Hence, the final structure of the whole NN combines three NNs: two parallel NNs (RENN and RSMNN) and a REFNET. REFNET has the same general design as the RENN and RSMNN but the number of blocks and number of convolutional filters per block differ (Table 1). It must be noted that adding REFNET incurs relatively small overhead because it has considerably smaller number of trainable parameters compered to RENN and RSMNN even though REFNET has higher block count. Table 1 describes the chosen architecture that had the best performance.

Table 1. Neural network parameters

Both RENN and RSMNN are pre-trained for 20 epochs. Several combinations of blocks and layers were tested and best performing RENN and RSMNN were chosen. Next, all three NNs – RENN, RSMNN and REFNET – are trained for additional 60 epochs. The NNs were trained using Adadelta optimizer with initial learning rate of \(1.0\) and binary crossentropy as loss function. Automatic learning rate reduction was applied if validation loss did not improve in 15 epochs. Each reduction halved the current learning rate.

Fig. 2.
figure 2

Overall structure of the final neural network

3.2 Evaluation Methodology

There are several metrics for evaluating segmentation quality. The most common of those is accuracy (Eq. 3) that calculates the percent of pixels in the image that are classified correctly

$$\begin{aligned} A _{cc} = \frac{ TP + TN }{ TP + FP + TN + FN }, \end{aligned}$$
(3)

where TP = True Positives (pixels correctly predicted to belong to the given class), TN = True Negatives (pixels correctly predicted not to belong to the given class), FP = False Positives (pixels falsely predicted to belong to the given class), and FN = False Negatives (pixels falsely predicted not to belong to the given class). In cases where there is a class imbalance in the evaluation data, e.g. TN \(\gg \) TP, FP, FN (which is often the case in practice), accuracy is biased toward an overoptimistic estimate.

For class-imbalanced problems, precision (Eq. 4) and recall (Eq. 5) (showing how many of predicted positives were actually true positives and how many of true positives were predicted as positives, respectively)

$$\begin{aligned} P _r = \frac{ TP }{ TP + FP }, \end{aligned}$$
(4)
$$\begin{aligned} R _c = \frac{ TP }{ TP + FN }, \end{aligned}$$
(5)

or balanced accuracy that is an average of recall (also known a sensitivity or True Positive Rate) and specificity (also known as True Negative Rate) (Eq. 6)

$$\begin{aligned} Acc_{bal} = \frac{TPR+TNR}{2}=\left( \frac{TP}{TP+FN}+\frac{TN}{TN+FP}\right) \div 2, \end{aligned}$$
(6)

give a better estimate

In addition, Jaccard similarity coefficient or Intersection over Union (IoU) (Eq. 7) can be used

$$\begin{aligned} IoU =\frac{\left| X\bigcap Y\right| }{\left| X\bigcup Y\right| }= \frac{ TP }{ TP + FP + FN }. \end{aligned}$$
(7)

IoU is the area of overlap between the predicted segmentation (X) and the ground truth (Y) divided by the area of union between the predicted segmentation and the ground truth.

For binary or multi-class segmentation problems, the mean IoU is calculated by taking the IoU of each class and averaging them.

$$\begin{aligned} mIoU =n^{-1}\sum ^n_{i=1}IoU_i \end{aligned}$$
(8)

The same applies for the other metrics (Eq. 6, Eq. 4, Eq. 5).

4 Experimental Results

4.1 Setup of the Experiments

The proposed methods are evaluated on a dataset that contains a collection of 314 orthoframe images of Estonian roads that each have a size of 4096 \(\times \) 4096 pixels.

Fig. 3.
figure 3

Image annotation. The annotated RSMs are outlined by red color and the area outside the road masked out by the annotation mask is painted green for illustration. (Color figure online)

The images in the dataset are manually annotated to generate ground truth masks for both RSMs and RE. The annotation is performed by using Computer Vision Annotation Tool (CVAT) (Fig. 3). These vector graphics annotations are then converted into image masks with separate masks for RSM and RE.

As a preliminary step before training and testing the proposed NN, automatic image pre-processing (i.e. white balancing) is applied to each individual input image by using gray world assumption with saturation threshold of (0.95). The goal of image pre-processing is to mitigate color variance (e.g. due to color difference of sun light).

Next, 249 of 314 pre-processed images are used to build the training/validation dataset of image segments. This training/validation dataset contains 36497 image segments and each segment has a size of 224 \(\times \) 224 pixels (Fig. 3). It must be noted that not all segments of a pre-processed image are included in the training/validation dataset. Since large part of the 4096 \(\times \) 4096 pixel image is masked out (depicted by black color in Fig. 3), only segments with 20% or more non-masked pixels are considered. These segments are then divided into two groups: 1) those that include annotations and 2) those that do not include annotations. A segment is considered to be annotated if and only if at least 5% of its pixels contain annotations. In order to prevent significant class imbalance in training data, the two groups (i.e., segments with and without annotations) have to be of equal size. The segments in the two groups do not overlap.

Built-in image pre-processing methods of TensorFlow/Keras library are randomly applied on each 224 \(\times \) 224 \(\times \) 3 segment before training in order to perform data augmentation. These image pre-processing methods include horizontal flip, vertical flip, width shift (\(\pm 5\%\)), height shift (\(\pm 5\%\)) and rotation (<90\(^{\circ }\)).

Fig. 4.
figure 4

Augmentation of input data

Rotation and/or shifting without zooming in can lead to situations where part of the modified segment does not include the pixels from the original segment. Therefore, these pre-processing methods are performed using ‘nearest’ fill method (Fig. 4).

4.2 Image Post-processing and Testing Results

Testing is performed on 65 images which were not included in training or validation data sets. The images are turned into 224 \(\times \) 224 \(\times \) 3 RGB image segments similarly to the process described in Sect. 4.1. The main difference is that after pre-processing (i.e., white balance), the testable segments are generated from all segments that are not 100% black. These testable segments are then passed to the NNs and resulting output of the NNs is re-combined such that input segments that were 100% black are also fully black in the combined 4096 \(\times \) 4096 output image. This process is repeated on each test image three times. Each time a different amount of padding is used. These three outputs are cropped to original size and averaged to assure that final generated masks/images are smooth.

Before the final evaluation, the re-combined outputs undergo image post-processing. First, thresholding using Otsu’s method  [14] is applied on both RE and RSM images to produce binary images (Fig. 5). Next, holes are filled in the RE image. Then RE contours are detected and small objects are filtered out based on their area. Next, the refined RE image is used as a mask to rule out false positive RSMs. Finally, RSM contours are detected and small objects are filtered out based on the object’s area (Fig. 6)

Fig. 5.
figure 5

Road surface markings and road edges masks before applying Otsu’s method (left and middle, respectively) and the combined result after post-processing (right).

The number of pixels, classified as true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) are determined over all test images (Table 2). Based on the resulting figures, recall (\(R_c\)), precision (\(P_r\)), balanced accuracy (\(ACC_{bal}\)) and IoU are calculated and given in Table 3. The figures in Table 3 imply that RE detection has somewhat better quality than RSMs detection. This is, however, deceptive, because of the smaller overall size of RSMs (TP + FN in Table 2), minor detection imperfections result in a greater degradation of performance indices.

Fig. 6.
figure 6

Near-perfect detection of road surface markings (outlined by blue color) and road edges (pale red). (Color figure online)

Fig. 7.
figure 7

Detection of road surface markings (outlined by blue color) and road edges (pale red) in presence of hard shadows. (Color figure online)

In contrast to performance measures over all test images in Table 3, Table 4 provides performance analysis on orthoframe basis. First, performance measures were calculated for each image separately. Next, statistical measures such as average, minimum, maximum value and standard deviation (\(\sigma \)) were calculated. These additional characteristics show that RE detection accuracy is in fact less consistent and more sensitive to hard shadows in particular as Fig. 7 demonstrates. Standard deviation (\(\sigma \)) for RSM detection is lower in all categories compared to RE detection.

Table 2. Pixel classification after post-processing
Table 3. Detection performance indices after post-processing
Table 4. Detailed measures of per image results

This was expected because RE detection is considered to be harder problem to solve. The detection of RENN did not improve when NN size (block and/or convolutional layer count) were increased. However, adding contextual information (RSMNN) increased the proposed method’s predictive ability for both RE and RSM detection when the outputs of RENN and RSMNN were combined in REFNET.

5 Conclusion

In this study, we proposed a convolutional neural network-powered method for concurrent road edge (RE) and road surface markings (RSMs) detection from orthoframes even under severely adverse conditions such as bad lighting conditions (shadows, over-lit road surface, etc) and RSM wear. The measures of IoU (88% and 84%, respectively) indicate that the network performs well in most conditions, however, hard shadows, cast either by trees or buildings alongside the road, present a problem, particularly for RE detection.

In future works, we intend to research how to improve the predictive ability of the method by using increased contextual awareness either by incorporating data about the neighboring image segments or by using the whole orthoframe image as input.