Keywords

1 Introduction

Road extraction in high-resolution remote sensing image aims at detecting and segmenting road pixels in images. It refers to judging pixels as road or non-road, usually regarded as a binary classification problem. Road is an integral part of vehicle navigation, city planning, geographic and information updating, and so on. At present, the task of road extraction mainly contains road surface detection [1] and road centerline extraction [2, 3]. The former extracts all road pixels out, while the latter only label the skeletons of roads, which used to provide directions. Some methods also extract surface and centerline of road simultaneously [4].

In the field of high-resolution remote sensing road extraction, numerous methods have been proposed in recent years. Early methods extract low-level features (e.g., edge, corner, gradients) and define heuristic rules (e.g., geometrical shape) to classify pixels into road or non-road. S. Hinz and A. Baumgartner combined road and context information to extract road, including radiation measurements and geometry information [5]. In [6, 7], uniform areas with shape or geometric characteristics in the image were detected firstly and then used region growth technique to generate road map. The problem of these methods is that the features and rules used are only for simple scenes, while roads in the high-resolution remote sensing images are complex and irregular.

Several methods applied deep learning to road extraction in high-resolution remote sensing image. Y. Zhang et al. used multi-source data and multi feature to improve accuracy of road extraction [8]. Y. Wei et al. designed a road structural loss function to constrain road edge [9]. G. Máttyus et al. inferred the correct road in the result of initial segmentation [10]. F. Bastani et al. designed an iterative graph construction method to output road map [11]. Recent methods adopted the idea of semantic segmentation [12,13,14,15] and take roads as foreground and non-roads as background. Z. Zhang et al. combined the strengths of residual learning and U-Net [16] to extract road [17]. Y. Xu et al. fused attention mechanisms in DenseNet to capture local and global road information simultaneously [18]. L. Zhou et al. used larger receptive filed to preserve detailed information, thus obtained a better result of road extraction [19]. Although methods based on deep learning made some progress, it still had incoherent issues of road edge. The issues mainly caused by edge interference, including shadows of roadside trees or buildings and vehicles on the roads, which can be observed in high-resolution remote sensing images. To solve the issues, a novel high-resolution remote sensing road extraction (RSRE) method is proposed to refine road topology information. In addition to increasing the receptive field to keep context information, RSRE also considers the spatial relations of road pixels of an image. The spatial relations of pixels in image contribute to learning topology information of road with weak coherence in high-resolution remote sensing image. Therefore, RSRE can alleviate the incoherent issues which occurred in many existed methods based on deep learning.

In this paper, RSRE focuses on road topology information refinement in high-resolution remote sensing image. Topology information refinement means the maintenance of the shape, structure, or connection of the road throughout the whole image. Based on the encoder–decoder architecture which often used in semantic segmentation network, RSRE adopts dilation module (DM) and message module (MM) between encoder and decoder to enhance connectivity of road edge. Dilated convolutions in DM can increase the receptive field to keep the detailed context information in image, while slice-by-slice convolutions in MM enable messages passing across rows and columns in image to capture spatial relations of pixels. After extracting the features of image by encoder, DM and MM are used to reprocess the features between encoder and decoder. Finally, RSRE uses sigmoid layer and threshold value to output road maps. Furthermore, a new loss function is proposed to make RSRE not favor the non-road which has most of pixels in image. Experimental results show that RSRE is excellent both on DeepGlobe Road dataset [20] and Massachusetts Road dataset [21].

2 Method

2.1 RSRE Architecture

Due to the initial high-resolution remote sensing image has large size, and road always span the whole image with natural properties like topology and connectivity. Therefore, RSRE receives 1024 × 1024 high-resolution image as input to reduce the loss of detail caused by cropping images and generates road map with road topology information refinement and better road connectivity recovery. As shown in Fig. 1, RSRE has an encoder–decoder structure and combines low-level detail information and high-level semantic information to extract road in high-resolution image.

Fig. 1
figure 1

RSRE architecture. Symmetrical blocks represent features with the same size and channels. The below expression \(n^{2} \times c\) means that the size is \(n \times n\), and the number of channels is c. RSRE has an encoder–decoder structure. Center part is the core of RSRE, including DM and MM

There are three parts in RSRE: encoder, center, and decoder parts. Like the architecture of D-LinkNet, the encoder part extracts feature maps of input high-resolution remote sensing image and uses ResNet34 [22] pretrained on ImageNet [23] dataset. The center of RSRE fuses the results of feature reprocessing by DM and MM to keep topology information of road. The decoder part uses transposed convolution layers [24] to do up-sampling and restores the resolution from 32 × 32 to 1024 × 1024. Finally, RSRE uses sigmoid layer and threshold value to output road maps. Pixels which probability of sigmoid layer output larger than 0.5 are considered roads, while others are considered as non-roads.

2.2 DM and MM

The center part can refine the topology structure of road in the high-resolution remote sensing image. The high-dimensional hidden layer features are selected as the input of center part because of rich information. This part is composed of two parallel operations: DM and MM. DM increases receptive field without reducing the image resolution through dilated convolution layers [25] with series and parallel connections. As shown in Fig. 2, DM stacks result of each dilation rate, which contributes to capturing multi-scale context.

Fig. 2
figure 2

DM architecture. DM contains dilated convolution with series and parallel connections. The expression \(n^{2} \times c\) on the feature block means that the size is \(n \times n\), and the number of channels is c. The parameters r represents the dilation rate, and f means the receptive field

Although DM contributes to obtain multi-scale context information by increasing receptive field, it has an issue of lacking correlation information between pixels. Due to the values of dilated convolution is obtained from the pixels of mutually independent rows and columns, and these pixels lack correlation on each other. The issue causes loss of local information might be not relevant and thus bring poor continuity but is critical to the road, which has long-distance continuous, and strong spatial relation while weak appearance clue in the high-resolution remote sensing image.

In order to solve the potential issue of DM, MM uses the module of spatial CNN [26] in the field of computer vision to enhance spatial relations of pixels in high-resolution remote sensing image. Though slice-by-slice convolutions within feature maps, it can better propagate spatial information of pixels on rows and columns and thus can effectively preserve the topology information of road with long thin structure in high-resolution image.

As shown in Fig. 3, MM also applies on high-dimensional hidden layer features. The height, width, and the number of channel of the input feature are 32, 32, 512. MM has four directions to slice: upward, down, left, and right. Only the down and upward directions are shown in figure, and the left and right directions are similar. In each direction, the feature is sliced along height (upward and down) or width (left and right) of feature. The first slice goes through convolution and rectified linear unit (ReLU) and then adds next slice to formulate a new slice. The new slice repeats the same processing sequentially until the last slice is updated and thus obtains a new feature of size 32 × 32 × 512. Weights of the slice-by-slice convolutions are shared with the same direction and are initialized randomly different with spatial CNN.

Fig. 3
figure 3

MM Architecture. MM contains slice-by-slice convolutions within features. Expression \(n^{2} \times c\) on the feature means that the size is \(n \times n\), and the number of channels is c. Parameter k is kernel width used in MM

2.3 Loss Function

Although roads are distributed in the entire high-resolution image, the imbalance in road pixels and non-road pixels in an image has the potential to tilt the training result toward non-road class with more pixels. Therefore, a new loss function is used in this paper, which uses dice coefficient item (1) and binary cross-entropy (2) simultaneously.

$$l_{\text{dice}} = 1 - \frac{{\sum\nolimits_{n = 1}^{N} {y_{n} f_{\text{w}} \left( {x_{n} } \right)} + m}}{{\sum\nolimits_{n = 1}^{N} {y_{n} + f_{\text{w}} \left( {x_{n} } \right)} + m}}.$$
(1)
$$l_{\text{bce}} = - \frac{1}{N}\sum\limits_{n = 1}^{N} {\left[ {y_{n} \cdot \log f_{\text{w}} \left( {x_{n} } \right) + (1 - y_{n} ) \cdot \log \left( {1 - f_{\text{w}} \left( {x_{n} } \right)} \right)} \right]} .$$
(2)

xn means nth high-resolution remote sensing image, where \(n = 1,2,3, \ldots ,N\) and N is the mini-batch size. yn indicates the ground truth (GT) of image xn, and GT is a binary map. Expression \(f_{\text{w}} \left( {x_{n} } \right)\) denotes the output of RSRE, where \({\text{w}}\) represents the weight of RSRE that needs to be optimized. Parameter m is an adjustable parameter between numerator and denominator in \(l_{\text{dice}}\).

Loss \(l_{\text{dice}}\) can be regarded as the degree of similarity of road contours between GT and predicted \(f_{\text{w}} \left( {x_{n} } \right)\). Loss \(l_{\text{bce}}\) is often used in the training of semantic segmentation network. However, in the case of extremely unbalanced data, the cross-entropy loss is much smaller than the dice loss after multiple iterations, and the effect of \(l_{\text{bce}}\) will be lost. Road extraction can be regarded as a task of pixel-level recognition, only pixels of road can view as positive sample. Clearly, there is great imbalance in road pixels and non-roads pixels. To release the question of imbalance, RSRE combines \(l_{\text{dice}}\) and \(l_{\text{bce}}\) with \(\lambda\) and adjusts m to control the effects of different losses on training results. Thus, the final loss function is the form in (3):

$${\text{loss}}_{\text{w}} = l_{\text{dice}} + \lambda l_{\text{bce}} .$$
(3)

Parameter w is the weight of RSRE that needs to be updated. \(\lambda\) is a constant coefficient by manual setting. Though minimizing the loss function, the optimal w is obtained gradually. RSRE chooses Adam as optimizer to optimize the loss function.

3 Experiment

3.1 Datasets

The method is tested on two big datasets. The first is DeepGlobe Road dataset. The resolution of each image is 1024 × 1024. Image scenes include urban, rural, wilderness, seaside, tropical rainforest, and others. On account of only training images have labels, for the convenience of measuring the accuracy of road extraction, the experiment divides the labeled 6226 training image into 4358 for training and 1868 for test.

The second dataset is Massachusetts Road dataset. The size is 1500 × 1500 with a resolution of 1.2 m per pixel. In the original training data, there are some images that do not match labels; because in the work of the original thesis, the dataset is used to study the effectiveness of robustness. In this paper, first delete the mismatch image and GT pairs, then crop to size 1024 × 1024 from center, and train RSRE on the rest of 737 images and test in 49 images.

3.2 Implementation Details

In the experiments, PyTorch [27] is used as the deep learning framework. In the training phase, the min-batch size is 16 and uses 2 GPUs. The learning rate was initially set to 2e−4 and reduced by a factor of 0.1 in every 20 epochs. RSRE adopts data augmentation to avoid the problem of over-fitting without cross-validation, including flip, vertical flip, diagonal flip, color jittering, image shifting, and scaling. In the predicting phase, each image has operations of horizontal flip, vertical flip, and diagonal flip. Hence, each image is predicted 8 times and averaged the probability of each prediction.

3.3 Result and Analysis

To assess the effectiveness of RSRE in road extraction with high-resolution remote sensing image, the precision (P), recall (R) [28], and F1-score are introduced as follows:

$$P = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}},R = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}},F_{1} = 2 \times \frac{P \times R}{P + R}.$$
(4)

where TP, FP, TN, FN represent the number of true positives, false positives, false negatives, and false negatives. P means the percentage of all road pixels in image that is predicted correctly. R represents the proportion of all pixel predicted correctly that detected as road. F1-score is an evaluation metric for the harmonic mean between P and R.

RSRE is compared with U-Net, LinkNet, D-LinkNet in two datasets and evaluated P, R, F1-score of road in Table 1 The architecture of U-Net and LinkNet is modified to fit the high-resolution remote sensing image input of 1024 × 1024, and they use cross-entropy as loss function. As baseline of RSRE, D-LinkNet only uses DM, but has no MM and constrained loss. The best result of RSRE is obtained when m = 0.5, λ = 0.01. The results show that after the center feature fusion and loss constrained, the P and F1-score of road extraction have increased. The R drops a little, because precision and recall are a pair of contradictory measures, when precision is high, the recall is often low.

Table 1 Results of RSRE compared with other methods in two datasets, and the best values of precision (P), recall (R), and F1-score have been highlighted in bold

The effect of different weight combinations tested in Massachusetts Road dataset is shown in Table 2. As λ decreases, the influence of \(l_{\text{bce}}\) decreases gradually, and both P and F1-score improve. It shows that the constraint binary cross-entropy can alleviate the poor performance caused by sample imbalance. m has the best adjustment effect on \(l_{\text{dice}}\) at 0.5. Therefore, RSRE gets the best result at m = 0.5, λ = 0.01.

Table 2 Results of RSRN with different combinations of \(m, \lambda\) in Massachusetts Road dataset

In order to observe ability of RSRE proposed in this paper, some typical images tested in DeepGlobe Road dataset are shown in Fig. 4. The test results of RSRE at different background including river, building roof, urban, and rural. The road maps of comparison demonstrate that RSRE can maintain road topology information effectively in the presence of disturbances. The results in D-LinkNet are superior to U-Net and LinkNet because of the use of DM, which proves that it is necessary to increase the receptive field. Though river in image has approximate linear edge, building roof may block part of the road edge or pavement and thus destroy the continuity of the road. The connectivity of roads in a red circle of figure can be well kept through RSRE than other methods.

Fig. 4
figure 4

Example results of RSRE and other methods tested in DeepGlobe Road dataset. From top to bottom, the contained background is river, building roof, urban, and rural

Results of RSRE compared with other methods in Massachusetts Road dataset are shown in Fig. 5. Compare with the results of U-Net, LinkNet, and D-LinkNet, RSRE can detect most of the roads correctly with the road topology information refinement. Though increase the receptive field and enhance spatial relations by adding DM and MM in the center of deep learning network, RSRE keeps rich information of road to refine topology in high-resolution remote sensing image. Therefore, in the phase of generating road map, the context and spatial relations information of road obtained in RSRE alleviate effectively the incoherent issues caused by shadows of sheltered trees and vehicles.

Fig. 5
figure 5

Example results of RSRE and other methods tested in Massachusetts Road dataset

4 Summary

In this paper, RSRE extracts road from high-resolution remote sensing image, which pays more attention to refine road topology information. By the feature fusion processing of DM and MM, it can refine topology information of road and thus effectively preserve the continuity of long thin structure of road. The new combined loss function can solve the imbalance of road and non-road pixels. The results in two datasets show that RSRE could alleviate the discontinuity and incoherence of road which come from edge interference. It also has a good performance in different backgrounds. However, the test results of all methods are not very accurate for very short and thin sections of road, and RSRE has the wrong recognition in a heavily sheltered area, Therefore, the future work will focus on how to solve the above issues.