Keywords

1 Introduction

1.1 Background

Remote sensing imagery, both satellite and aerial, contains a lot of terrain-feature-specific information such as land-cover spread, building footprints, waterbody extent, vegetation and forest boundaries etc. Extracting this feature information without losing relative context within the image is a very important remote sensing image processing milieu [1, 2]. Feature extraction is usually done by identifying a common pattern among pixels and grouping them together, that group of pixels then being a feature [3]. One of the most crucial aspects for accurate image feature extraction is finer spatial details such as edges and corners. Primitive feature extraction methods were time-consuming and required a lot of expensive human intervention [4]. This was mostly because of the unavailability of higher spatial resolution data in conjunction with the technical infrastructure at the time. However, with advancements in digital systems for image processing and also the increased availability and accessibility of high spatial resolution data from both satellites and Unmanned Aerial Vehicles (UAVs), image feature extraction has consistently been one of the hottest research topics in remote sensing image processing [5].

1.2 Previous Works

In remote sensing feature extraction, building extraction is one of the most vital aspects of research. With its applications spread in various pipelines of urban mapping and management, disaster management, change detection, maintaining and updating geodatabases etc., building extraction has caught the attention of researchers worldwide for developing robust and accurate algorithms to automate the process [6]. Primitive methods of building extraction were based on applying statistical and morphological operations on individual pixels to group them together [7], hence automating the task up to some extent. One of the most prevalent issues in building extraction that has propagated from early methods to the recent methods is the differentiation of foreground and background as well as building and non-building objects [8]. To be able to differentiate between these, spectral 2 and geometrical cues such as color, shape and line have been used to extract buildings from very high-resolution imagery [9]. Another study combined distinctive corners while estimating building outlines to extract buildings [10], but was unable to extract irregular-shaped buildings. In the beginning of the decade, a generic index called Morphological Building Index (MBI) was introduced to extract buildings from high-resolution satellite imagery, based on spectral information [11]. While this method was able to successfully extract buildings with an irregular shape, it failed in shadowy regions and also could not extract buildings located close by (instance extraction). A consequent study to MBI proposed a Morphological Building/Shadow Index which defined a building index as well as a shadow index, and was specifically aimed at bridging the shortcomings of the MBI method [12].

With the recent availability of strong computing systems as well as finer resolution data, artificial intelligence-based deep learning algorithms such as Convolutional Neural Networks (CNNs) are being aggressively used for building extraction given their advantage of hierarchical feature extraction without losing any contextual information [13, 14, 15]. In general, a deep learning architecture consists of a network structure with many hidden layers leading to hierarchical feature extraction thus, eliminating the problem of inadequate representation of learning features [16]. Building-A-Nets is an adversarial network to for robust extraction of building rooftops. Multiple Feature Reuse Network (MERN) is a resource-efficient rich CNN to detect building edges from high spatial resolution satellite imagery [17]. A special type of pre-trained CNN, called a Fully Convolutional Network (FCN) is also being widely used for transfer learning-based building extraction. A few such popular FCNs are VGG-16 [18], ResNet [19], Deeplab [20], DenseNet [21], SegNet [22] and U-Net [23]. Studies specifically on building extraction from UAV images have also increased of late. SegNet and U-Net have been used in an ensemble manner to improve building footprint extraction from high-resolution UAV imagery [24]. Techniques such as dilated spatial pyramid pooling [25], multi-stage multi-task learning [26], and channel attention mechanisms [27] have been used to improve the building segmentation accuracy from UAV data. Variants of U-Net architecture have also been tested for building extraction and studies indicate that the U-Net is the most suitable for dense image building extraction [15, 28, 29].

1.3 Objective and Summary

Sometimes, the FCN based segmentation is visually degraded in case of blurred building boundaries [30]. Moreover, high spatial resolution data is generally restricted to three or four spectral channels, which makes it difficult to differentiate buildings and other spatially similar features [24]. To address these issues, this study proposes a deep learning-based segmentation approach that combines a pre-trained FCN with a U-Net being trained for building extraction, to extract buildings from high-resolution RGB UAV imagery. The learning of a deep Residual Network (ResNet) trained on the ImageNet dataset is transferred to the segmentation-based FCN U-Net, hence forming a combined Res-U-Net architecture. In this Res-U-Net, the pre-trained ResNet helps capture more context in case of features spatially similar to buildings while the U-Net learns building segmentation based on a unique loss function (discussed in Sect.  2.3) that simultaneously accounts for crispness as well as the region of a segmented building, hence preventing prediction leakage outside of feature in case of blurred boundaries. Consequent sections of the paper discuss the dataset details, data preparation and training methodology, results and their inferences, and conclude the study.

2 Dataset Details

This study uses the Inria Aerial Image Labelling (IAIL) dataset. This dataset contains a total of 360 orthorectified images (180 for training and 180 for testing) with a tile size of 1500 m2 each, at 30 cm spatial resolution with red, green and blue bands. Each image is of size 5000 × 5000 pixels. While covering an area of 81 km2/city in select 3 US cities of Austin, Chicago, Kitsap County and select Austrian cities of Vienna and West Tyrol, this dataset contains 36 images from each city having high variance in terms of urban density and building spacing. Moreover, numerous instances of shadowy features and shadowy backgrounds are present, especially in the images from Chicago, US. The ground truth of the training set is provided as a binary feature image with only two classes namely building and non-building. Since ground truth is provided only for the training set of 180 images, we use only those 180 images to train and validate our model. Figure 1 shows the UAV image and its corresponding ground truth as available from the IAIL training set, for each of the five cities.

Fig. 1
Five pairs of data samples from the I A I L dataset between U A V R G B image and Building feature map.

Data samples from the IAIL dataset, one from each city a Austin, USA, b Chicago, USA, c Kitsap County, USA, d West Tyrol, Austria, e Vienna, Austria

3 Methodology

3.1 Data Preparation Methodology

A single image is of size 5000 × 5000 pixels. We further split it into small data chips of size 224 × 224 pixels in accordance to the proposed network architecture. This results into 484 such tiles from a single image. However, certain number of chips contain no buildings or hardly any buildings at all, creating a bias in the type of data which could result in model misfit. To ensure uniformity of 224 × 224 chips in terms of buildings, we further filter the 484 chips using a High Label Filter (Eq. 1). This is basically a ratio of the number of labelled pixels to the total number of pixels in a 224 × 224 chip. We use a threshold of 0.3 in the High Label Filter to further filter these 484 chips. This excludes the chips having label density less than 30% and hence the earlier bias in the data is now removed. Figure 2 shows the data preparation methodology for a single image. This process is performed for all 180 images as well as labels. Passing the 87,120 224 × 224 chips obtained from 5000 × 5000 180 images (180 × 484) through the High Label Filter, we get 27,164 224 × 224 chips. The proposed model is trained and validated on these 27,164 chips and entire images of size 5000 × 5000 are used for testing.

Fig. 2
The workflow diagram for data preparation methodology for a single image. It contains single I A I L image size 5000 cross 5000, clip into chips of size 224 cross 224, 484 images with size 224 cross 224, high label filter, high label density images 224 cross 224.

Data preparation methodology for a single image

$$HLF=\frac{{\sum }_{i=0}^{224*224}{building\_pixel}_{i}}{{\sum }_{i=0}^{224*224}{image\_pixel}_{i}}$$
(1)

3.2 Network Architecture

In this study, the U-Net architecture is implemented with a dynamic decoder to learn building extraction as a fully convolutional network (FCN). The whole architecture essentially consists of two major operations—image contraction performed by the encoder and image expansion performed by the decoder (Fig. 3). The encoder is responsible for pooling out the necessary information from within the convolution kernel which is done by max pooling operations. The decoder helps preserve precise local information such as building edges in case of blurred images which is done by upsampling and convoluting over transposed kernels. Each step of encoder is connected with the corresponding inverse step of the decoder using successive skip connections. The advantage of using a dynamic network is the automatic creation of the decoder based on how the encoder is initialized [31] as well as working with almost any patch-size [32].

Fig. 3
The illustration of proposed Res U Net architecture. It contains input image and output segmentation map. The values of copy and concatenate are depicted between the input image and sigmoid output.

Proposed Res-U-Net architecture described in terms of U-Net encoders and decoders, along with the pre-trained ResNet34 layers

U-Net being an end-to-end FCN can easily be initialized with the weights of a deeper CNN. We further initialize the proposed dynamic U-Net architecture with the weights of ResNet34 trained on ImageNet, forming a Res-U-Net. The proposed Res-U-Net comprises of multiple sequential blocks as well as dynamic U-Net blocks initialized with ResNet34. Each encoder-decoder block of the architecture consists of a series of 2D batch normalization and ReLU activations which extract the trainable features from the data. Table 1 shows the specific network architecture of the proposed Res-U-Net architecture. The input to the network is an RGB image of shape (224, 5 224, 3) to which the network segments buildings and outputs segmented maps of shape (224, 224, 2). Here, the prediction contains two channels, one of which is a boolean array having discrete prediction for every pixel being a building or not and the other is a float32 array which contains the logit probability score for every pixel being a building. This is helpful in refining the results by further pooling the probability scores with bounded functions such as sigmoid.

Table 1 Specific proposed network architecture with individual layer parameters

3.3 Training the Network

After weight initialization of the proposed Res-U-Net, transfer learning methodology was used to train for building extraction. Figure 4 shows the step-by-step training methodology. Out of 27,164 image-label pairs, the network was trained on 23,089 pairs (85%) and was validated on the remaining 4075 (15%) pairs of images and their corresponding labels. The network was trained with a batch size of 6 and a patch size of 224 × 224 for 30 epochs, with roughly 1200 batches being processed per epoch. The training was cut-off based on loss convergence (Fig. 5a). The learning was carried out on nearly 20 million parameters extracted at different layers of the network. The network was optimized with ADAM optimizer at a learning rate of 0.0001 and a decay rate of 0.9.

Fig. 4
An illustration of network training methodology which contains initialize weights, transfer learning and much more.

Network training methodology for building extraction using transfer learning

Fig. 5
Three line graphs labeled a, b and c. These graphs are for epochs against combo loss, accuracy, and intersection over union.

a Combo loss variation, b accuracy variation and c IoU variation in 30 epochs of training

A unique combination of Binary Cross Entropy (BCE) loss (Eq. 2) and dice loss (Eq. 3) was used to train the network. BCE is a probability distribution-based loss [33] and hence was used to minimize the entropy between the prediction and the ground truth in terms of buildings as features. It was also helpful in preserving the crispness near the boundary regions. Dice loss is a region-based Intersection-over-Union like metric [34] and it was used to maximize the overlap and similarity between the predicted region and the ground truth of the feature region. Hence, a combo loss was defined (Eq. 4) which focused on both boundary and region preservation. Figure 5a shows the loss-based convergence of the model after 30 epochs of training. After training for 30 epochs 7 and processing 36,000 batches the model began to converge and was saved at the end of 30 epochs with an overall accuracy of 95.7% and mean Intersection over Union (IoU) of 0.83.

$${BCE}_{Loss}=-\frac{1}{patchsize}{\sum }_{i=1}^{patchsize}{g}_{i}\times \mathrm{log}{p}_{i}+\left(1-{g}_{i}\right)\times \mathrm{log}(1-{p}_{i})$$
(2)
$$Dice\,\,Loss=\frac{2\times \sum_{i=0}^{patchsize}{p}_{i}{g}_{i}}{{\sum }_{i=0}^{patchsize}{p}_{i}^{2}+{\sum }_{i=0}^{patchsize}{g}_{i}^{2}}$$
(3)
$$Combo\,\,Loss= {BCE}_{Loss}+DiceLoss$$
(4)

where g = ground truth image, p = predicted building mask

4 Results and Discussion

Figure 6 shows the results for building extraction for select RGB images from each city of the IAIL dataset. The first column is the input to the model, the second column is the ground truth, the third column is the segmented building map as predicted by the model and the fourth column shows the evaluation of the prediction with True Positives (TP) in white, True Negatives (TN) in black, False Positives (FP) in red and False Negatives (FN) in yellow. These are original images of size 5000 × 5000 from the IAIL dataset. The predictions are obtained by clipping to chips of 224 × 224, segmenting buildings and then again merging to the original size of 5000 × 5000. In Fig. 6 we try to show all the different conditions for building extraction such as the surrounding land-cover classes, urban density, shadows etc. from each city. Figure 6a, c, f show successful building extraction in case of high urban density with closely spaced buildings, with rare instance segmentation challenges. Figure 6b shows effective building extraction even in shadowy regions. It can be noted that the shadows are not falsely classified as buildings, which has been a very popular challenge in building extraction [12]. Figure 6a, b, f show successful building extraction in presence of spectrally similar features such as cemented roads and parking lots as well as spatially similar features such as roads, open grounds and vegetation patches having shape similar to buildings. The model is also able to segment buildings even when the dominant land cover in the image is not urban—Fig. 6d, e contain a large cover of vegetation, Fig. 6b, e contain a large area of water.

Fig. 6
Six rows and four columns show imaging with different parameters of five selected countries.

Select instances of building extraction results from each city of the IAIL dataset. First column is RGB input to the model, the second column is model prediction for building segmentation, third column is ground truth and the fourth column is the evaluation image showing TP (white), TN (black), FP (red) and FN (yellow). a, b From Austin, USA, c from Chicago, USA, d from Kitsap County, USA, e from Tyrol West, Austria, f from Vienna Austria

To quantify the prediction made by the model in terms of binary segmentation, the metrics of accuracy (4), precision (5), recall (6) and F1-score (7) were used. To further perform a feature-based evaluation, object-based metrics such as branching factor (8), miss factor (9), detection percentage (10) and IoU or quality percentage (11) (otherwise also popularly known as jaccard index) were used. Table 2 shows the metrics of the individual images in Fig. 6.

Table 2 Metrics for individual images of Fig. 6
$$accuracy=\frac{tp+tn}{tp+tn+fp+fn}$$
(5)
$$precision=\frac{tp}{tp+fp}$$
(6)
$$recall=\frac{tp}{tp+fn}$$
(7)
$$f1=2\times \frac{precision*recall}{precision+recall}$$
(8)
$$branchingFactor=\frac{fp}{tp}$$
(9)
$$missFactor=\frac{fn}{tp}$$
(10)
$$detectionPercentage=100\times \frac{tp}{tp+fn}$$
(11)
$$qualityPercentage/IoU=100\times \frac{tp}{tp+fn+fp}$$
(12)

where tp = True Positive, fp = False Positive, tn = True Negative and fn = False Negative.

Figure 7 shows the city-wise metrics of model validation. Tyrol West and Vienna from the IAIL dataset exhibit highly favourable conditions for building extraction. Extracting buildings from Chicago and Kitsap has been the most challenging. This is due to shadowy regions, typically the shadows being cast on other buildings. Though the proposed model successfully discriminates between shadowy regions and buildings and avoids shadows as false positives, it faces significant challenges in extracting the buildings which are under shadows. This drastically increases the rate of false negatives, as the model excludes the buildings under shadows as only shadowy regions (Fig. 8a, b). A potential reason for this could be loss of spectral variance as well as the spatial distinction of a building that is under shadow. Moreover, another isolated issue encountered in a Kitsap image is a patch of waterbody being falsely segmented as building, resulting into a high number of false positives (Fig. 8c). This could be due to multiple reasons such as spectral similarity of the waterbody area due to turbidity, or saturation of DN values in those areas due to direct glint on sensor. Such instances of shadowed buildings and typical water areas are prominent in the images from Chicago and Kitsap and hence the extraction results are lowest for these two cities from the IAIL dataset. Figure 8 shows select instances buildings under shadows which result in a high number of false negatives.

Fig. 7
A line graph with five regions plots eight factors for city wise prediction metrics from I A I L dataset validation part. The factors are branching factor, miss factor, detection percentage, I o U, accuracy, precision, recall, and F 1 score.

City-wise prediction metrics from the IAIL dataset validation part

Fig. 8
Three rows and four columns illustrate the images with different parameters. The columns are R G B U A V image, ground truth, predicted output, and evaluation.

Select instances where buildings are covered under shadows, leading to high false negative rate. First column is RGB input to the model, second column is model prediction for building segmentation, third column is ground truth and fourth column is evaluation image showing TP (white), TN (black), FP (red) and FN (yellow). a, b From Chicago, USA, c from Kitsap County

Despite these specific challenges and rare instance segmentation issues, the overall performance of the model when evaluated on the validation set of 4075 images is highly favourable. The high values of the evaluation metrics, especially IoU, also indicate that the proposed model can segment buildings well within the feature edges and there is no region loss except for when the building itself is under a shadow. When compared with other deep learning-based approaches, the proposed model increases the average IoU to 0.80 and average F1 score to 0.86. Table 3 shows the overall evaluation metrics of the model for the validation set as well as a comparison of those metrics with other studies on the same IAIL dataset.

Table 3 Overall metrics of the proposed approach and their comparison with existing approaches

5 Conclusion

In this research work, building extraction from UAV imagery was explored using deep learning and transfer learning methodology. A Res-U-Net architecture consisting of U-Net blocks initialized with pre-trained ResNet34 weights and was used to learn building extraction from the IAIL dataset. The combination of ResNet and U-Net was used in an attempt to overcome the problems of blurred building boundaries and limited spectral resolution in building extraction. Moreover, a combined loss function that accounts both for the building region, as well as building boundaries, was used to train the proposed Res-U-Net. The model was trained and validated on 180 images from across five different cities of US and Austria. These images depicted high variance in terms of urban density and dominant land cover of the image. The proposed model was successfully able to segment buildings in all cases with rare instance segmentation issues. Model performance was measured using quantitative metrics of confusion matrix as well as object-based metrics such as branching factor, miss factor and IoU. When comparing these metrics with those of existing deep learning-based methods, highly favorable results were noted. Specific challenges such as extracting buildings lying under shadow and excluding turbid/active waterbody as a building were also identified and are open for research.