1 Introduction

Unmanned aerial vehicles (UAVs) as low-cost, light-weight, imaging sensors have been constantly developed over the last decade. As a consequence, UAVs have been widely applied to process remote sensing data, for example, traffic monitoring, monitoring of vegetation cover [1, 6, 9, 10, 22, 26], archaeology [4, 12, 20], meteorology [21], volcano monitoring [2] and forest fire monitoring [3]. The large area remote sensing images obtained using UAVs contain a large amount of ground information. UAV-based remote sensing systems can provide complex traffic data. As a supplement to traditional traffic devices, these data have begun to be used for detecting vehicles. Vehicle detection from remote sensing images has been applied in various fields: images of road networks and distribution of vehicles in various areas can provide information for urban planning and traffic monitoring; vehicle detection and tracking from aerial images are also an important component of any video monitoring system.

In recent years, numerous algorithms have been proposed to detect and recognise vehicles from aerial images. They mainly include methods based on artificially designed feature models, shallow neural networks, and deep neural networks which include deep belief networks (DBNs) and convolutional neural networks (CNNs).

The method based on artificially designed feature models mainly refers to detecting objects with illumination invariance by using the distinguishable structures and shapes of objects. This kind of method generally presents a low recall rate (RR) and a high false alarm rate (FAR). Shallow neural networks are used on the basis of their simple characteristics with few network layers. Therefore, they present poor robustness to the displacement and rotation of objects, and high FARs in complex scenes. Deep learning algorithms contain DBNs, DCNNs, and so on. The deep neural network based vehicle detection has been proven to be the most flexible, robust, and precise method and is one of the optimal methods for vehicle detection from large area remote sensing images at present.

This research proposes a spatial pyramid pooling (SPP) based deep convolutional neural network (DCNN) for vehicle detection. When the size of images input into traditional DCNN is changed, stretching or cropping images can result in image distortion or information loss. The SPP based DCNN adopts a multi-scale spatial pyramid models for down-sampling images from characteristic patterns with different sizes, thus generating feature vectors with a fixed length. In this way, the network can directly process original images without stretching- or cropping-induced deformation, thus improving the detection effect.

2 Rapid extraction of candidate objects

Specific object detection from remote sensing images requires discovery of objects of interest in large area images and then uses complex classifiers to deal directly with original images. The large area images need to be segmented so as to reduce the size of images input into classifiers and extract the candidate windows with suspected targets. This study adopts rapid extraction for candidate objects based on binarized NG to obtain the candidate windows containing suspected targets.

2.1 Normed gradient features

The binarized NG-based rapid extraction for candidate objects generally uses generic target detection models to locate, rapidly and effectively, all objects to be detected in remote sensing images. Generally speaking, general objects have independent, favourable, closed boundaries. On this basis, the NG of images presents obvious distinctions by adjusting the windows of actual objects to a fixed size (e.g., 8 × 8). In this case, slight changes in closed boundaries can be presented in the characteristics of the NG. To begin with, the input images are normalised to different quantitative sizes, followed by the calculation of the NG of adjusted images. Then, the vectors in the 8 × 8 regions of images are defined as the 64-dimentional (64D) NG of corresponding windows. Afterwards, as shown in Fig. 1, a 64D linear model is trained to select proposal windows containing targets based on the NG characteristics.

Fig. 1
figure 1

NG features

2.2 General target models for NG learning

Inspired by the fact that human visual systems can perceive objects before recognising them, 64D NG and its approximate binarisation are input into classifiers. After that, a two-stage cascaded support vector machine (SVM) is used to score the characteristics so as to acquire the model for identifying objects from image windows.

Stage 1: the weight w of classifiers is learnt using linear SVMs. To find general objects from an image, this study pre-defines and quantizes the window size (size and length-to-width ratio) so as to scan images. In addition, a linear model w ∈ ℝ64 is applied to score the windows.

$$ {x}_l=\left\langle w,{g}_l\right\rangle $$
(1)
$$ l=\left(i,x,y\right) $$
(2)

Where, x l , g l , l, i and (x, y) denote the confidence score, the characteristics of the NG, position, size, and coordinates of windows, respectively; while w ∈ w denotes the weight of the classifier to be learned.

Stage 2: again, linear SVMs are used to learn the weights v and t of the second-level classifiers. Based on the belief scores of candidate windows acquired during Stage 1, a group of proposal windows are selected from each size i using non-maximum suppression (NMS). Therein, the windows at some sizes, for example, 512 × 512, are unlikely to contain vehicles. Therefore, the algorithm is used to screen these windows further.

$$ {\mathrm{z}}_l={v}_i\cdot {x}_l+{t}_i $$
(3)

Where, v i  , t i  ∈ ℝ and v i  ∈ v , t i  ∈ t; v i and t i indicate the learnt coefficient and bias term of each quantitative size i, separately; while z l stands for the score of the window in Stage 2. Windows which have highest score are considered as input for the next step.

3 DCNN

3.1 Pre-processing of maximum NG with multi-thresholds

As an end-to-end network structure, DCNNs can be used to process the original images directly. Nevertheless, when remote sensing images are disturbed by the environment, for instance, when objects are blocked by trees or buildings, the characteristics learnt by the neural network include noise. As a result, real information about objects is lost. To overcome this problem, a pre-processing algorithm of maximum NG with multiple thresholds is proposed.

The outline of remote sensing vehicle images contains some main information which can be used to distinguish vehicles from other objects. In this study, the unidimensional differential operator [−1, 0, 1] is used to calculate the gradient images in each channel of a colour image. The gradient with the maximum L1 norm at each pixel position is taken as the final gradient of this pixel point.

For the gradient images obtained using the maximum NG from multi-channel colour images, the edges of objects are more obvious than those in images acquired through the use of traditional algorithms. However, when objects show little clear difference from the background, or are disturbed, the edge information of the images with maximum NG is still ambiguous and we cannot effectively distinguish objects. To overcome this problem, gradient images are dealt with using a multiple threshold method so as to enhance the originally unapparent outline information. As shown in Fig. 2, no obvious difference can be found between dark cars and shadowed background and some cars are blocked by trees on the roads. After processing by multi-threshold method, the outline of dark cars is enhanced, while the textures of trees are suppressed. In this study, two thresholds (i.e., 40 and 130), are adopted. In addition, similar multiple threshold processing is also performed on grey-scale images.

Fig. 2
figure 2

Multiple threshold images with the maximum NG

3.2 DCNN

The DCNN used in this research contains four convolution layers, four max-pooling down-sampling layers, and three fully-connected layers. Convolution layer in the DCNN contains multiple convolution kernels which output a feature map on which the weights of neurons are equal. After neurons are processed using the ReLU non-linear activation function, the output characteristics present better robustness to micro-displacement. The number of characteristic patterns in each down-sampling layer is equal to that in the former layer. In these characteristic patterns, the neurons are connected with local receptive fields of the former layer. Sampling-based dimension reduction decreases the number of neurons and extracts higher level characteristics therewith.

3.3 Convolution layers and feature mapping

For a traditional DCNN, its first four layers are convolution layers, followed by max-pooling layers: these pooling layers can also be regarded as special convolution layers. In general, they adopt sliding windows to deal with images. In addition, the last two layers are fully-connected layers and the last layer uses an N-dimensional SVM classifier to output images where N denotes the number of different categories.

Deep neural networks need to input an image of a fixed size; however, this requirement arises mainly because the fully-connected layers require input vectors of a fixed length. Besides, images of any size can be input into convolution layers as they use sliding filters to output feature maps. These maps show a basically consistent length: width ratio to that of the input images and contain information about response intensity and spatial position.

The process of using convolution layers to generate feature maps is similar to that found using traditional methods [32] to produce characteristic images. Traditional methods adopt scale-invariant feature transform (SIFT) vectors [23] or image blocks to extract characteristics. Afterwards, these characteristics are coded through vector quantization [8, 25], sparse coding, kernel Fisher, etc. These coded characteristics contain several characteristic images and are pooled through Bag-of-Words (BoW) or using spatial pyramids. Similarly, deep convolution characteristics can also be pooled.

3.4 Multi-scale SPP layers

Also called spatial pyramid matching, SPP [5, 7, 8], as an extended version of a BoW model [11], and is one of the most successful method used in computer vision applications. It separates images into layers according to their precision and gathers local features therefrom. Before the great success of CNNs, SPP is always the key component used in competitive classification [13, 15, 17] and detection systems [18]. By combining SPP with pooling operations, this research makes it possible for neural networks to process input images of different sizes.

Images of any size can be input into the convolution layers of DCNNs to generate images with corresponding sizes. The fully-connected layers need to input vectors of fixed length, which can be generated through pooling using the BoW method. SPP, which is improved from BoW, can be used to obtain information from local space bins through pooling. For input images of any size, the size of space bins is positively related to that of the images, while their number is fixed. This differs from the sliding windows used in traditional deep networks as the number of sliding windows depends on the size of input images.

To make deep networks adapt to input images of any sizes, multi-scale SPP layers are used to substitute max-pooling layers, as shown in Fig. 3. In each space bin, the response of filters is randomly pooled. If there are M bins, the outputs of SPP are the kM-dimensional vectors with a fixed length where k denotes the number of filters used in the last convolution layer. Vectors in fixed dimensions are the inputs of fully-connected layers.

Fig. 3
figure 3

The structure of a network containing SPP layers

4 Experimental results and analysis

In this study, the database used is the DLR Munich Vehicle dataset offered by the Remote Sensing Technology Institute of the German Aerospace Centre [14, 16]. The relevant images were acquired from the skies over Munich using DLR 3 K camera systems. Due to the disturbance of various factors, such as, roads, streets, trees, and similar objects, the environment in this city is far more complex than that in rural areas. Accordingly, it is challenging to use this algorithm to detect vehicles from the images in this database.

With a resolution and focal length of 5616 × 3744 and 50 mm, separately, the aerial images in the Munich Vehicle dataset are optical images obtained from 1000 m above the ground using Canon Eos 1Ds Mark III cameras installed on aircraft. These images were sampled every 13 cm on the ground and saved in JPEG format. In this database, there are total 20 images, half of which are used for training while the rest formed the test dataset. In the training images, 3418 cars and 54 trucks are contained in the positive samples with vehicles, while the test dataset includes 5928 vehicles.

To measure the performance of the DCNN algorithm, this research adopted FAR, precision rate (PR), and RR as test standards. They are defined as follows:

$$ \left\{\begin{array}{c}\hfill \mathrm{FAR}=\frac{\mathrm{number}\ \mathrm{of}\ \mathrm{false}\ \mathrm{alarms}}{\mathrm{number}\ \mathrm{of}\ \mathrm{vehicles}}\times 100\%\hfill \\ {}\hfill PR=\frac{\mathrm{number}\ \mathrm{of}\ \mathrm{detected}\ \mathrm{vehicles}}{\mathrm{number}\ \mathrm{of}\ \mathrm{detected}\ \mathrm{objects}}\times 100\%\hfill \\ {}\hfill RR=\frac{\mathrm{number}\ \mathrm{of}\ \mathrm{detected}\ \mathrm{vehicles}}{\mathrm{number}\ \mathrm{of}\ \mathrm{vehicles}}\times 100\%\hfill \end{array}\right. $$
(4)

Where, the lower the FAR, the fewer objects were falsely regarded as vehicles in background windows; a higher PR implies that more vehicles are contained in the objects; while more vehicles can be detected when the RR is found to be higher. Accordingly, the algorithm aims to obtain a lower FAR and higher PR and RR as far as possible.

By counting the size distribution of vehicles in the DLR Munich Vehicle dataset, the candidate object extraction based on binarized NGs uses candidate windows at sizes of: 32, 48, 64, 80, 96, 112, and 128, separately. After being normalised, these windows measured 8 × 8. After scoring the windows by ranking SVMs in Stage 1, the approximate scores in the range of NMS spanning ±1 are calculated based on the parameters Nw = 3 and Ng = 4. As shown in Table 1, when there are 50,000 candidate windows, the binarized NG-based rapid extraction algorithm shows a DR reaching 98.6 %, while the DR of sliding window algorithm is only 16.3 %, which is far lower than that of the proposed algorithm.

Table 1 Window number and DR based on binarized NGs

In this study, the DCNN designed for vehicle detection consists of four convolution layers and four max-pooling layers in series. The images input into the network are maximum NG maps with multiple thresholds of 40 and 130, separately. In addition, the same thresholds are also found in original gradient images and multiple threshold grey-scale images. Along with the original grey-scale images, a total of six images are input to the network.

Taking the input image measuring 64 × 64 as an example, the structure of the CNNs is illustrated in detail. The convolution kernel in the first convolution layer measures 7 × 7 and the step of 1, generating 84 characteristic patterns measuring 58 × 58; max-pooling is adopted in the first pooling layer with the size of the template and step being 3 × 3 and 2, separately. With the size of the convolution kernel and step being 5 × 5 and 1, respectively, the second convolution layer produces 96 characteristic patterns measuring 24 × 24; max-pooling is used in the second pooling layer where data are processed in the same way and the layer shows the same template and step as the first pooling layer. For the convolution kernel measuring 3 × 3 and 1, separately, in the third convolution layer, 128 characteristic patterns are generated, each of which measured 10 × 10; with the size of the template and step being 2 × 2 and 1, respectively, the third pooling layer adopts max-pooling for overlapped sampling. The fourth convolution layer, which presents a 3 × 3 convolution kernel and a step of 1, separately, generates 128 characteristic patterns (measuring 7 × 7). While three pyramid pooling models with sizes of 1 × 1, 2 × 2, and 3 × 3, respectively, are randomly used in the fourth pooling layer to generate 128 characteristic patterns. Each characteristic pattern has a dimension of 14 and therefore there are total of 1792 dimensions. Therein, 1024 and 256 dimensions are output from the first and second fully-connected layers, respectively. During training, these two layers are learnt using a drop-out method and 2-dimensional images are output using SVM classifiers to judge whether, or not, the objects detected are vehicles. For other input images, as the structure and parameter of each network are identical, except for the size of the generated characteristic patterns, they are, therefore, not explained here.

For convenience of comparison, the detection results using six other methods based on the DLR Munich Vehicle dataset are also listed. These algorithms include: general DCNNs, histograms of oriented gradients (HOGs), SVMs, local binary patterns (LBPs), SVM [19], Adaboost, and MVC [24]. For these algorithms, grey-scale images are used as input. In the characteristic calculation using HOGs, nine orientation bins are adopted and the input grey-scale images measured 64 × 64. They can be divided into 1 × 1 + 2 × 2 + 3 × 3 + 4 × 4 + 5 × 5 = 55 blocks. Accordingly, the HOG measured 55 × 9 = 495. As for the LBPs, where P = 8 and R = 1.5 along with 58 uniform patterns and one non-uniform pattern, the feature dimension was 59 × 55 = 3245. The kernel function of SVMs is a radial basis function. Besides, five kinds of Haar characteristics and 2000 stumps are used in Adaboost. The specific detection results are illustrated in Table 2 and Fig. 4. Table 3 and Fig. 5 show the PR of the MVC algorithm [16]. Since same data to this study are adopted in MVC algorithm, Table 3 and Fig. 5 only show RR and PR results.

Table 2 The RRs and FARs of vehicle detection using SPP-based DCNN
Fig. 4
figure 4

The RRs and FARs of vehicle detection using SPP-based DCNN

Table 3 The RRs and PRs of vehicle detection using SPP-based DCNN
Fig. 5
figure 5

The RRs and PRs of vehicle detection using SPP-based DCNN

The experiment indicates that the detection rate of the algorithm proposed in this research is improved compared with those of traditional algorithms including: HOG + SVM, LBP + SVM, Adaboost, MVC, and general DCNNs. When the RR is given as 95 %, the multi-scale spatial pyramid network shows a detection rate of 92.9 % and a false detection rate of 19.8 %, respectively: the detection rate and false detection rate of general DCNNs are 80.5 % and 34.7 %, separately. This is because multi-scale spatial pyramids can reduce the over-fitting problem in a network. Meanwhile, it can extract the characteristics of objects under different resolutions, thus showing a better detection effect for input images of different scales.

The impact of key parameters on the performance of the algorithm is also analysed in the experiment. Since the PR of SPP-based DCNN is already very high when RR is 95 %, we mainly use FAR as evaluation criteria. Table 4 and Fig. 6 show a false detection rate of 22.8 % when taken single-scale image as input. False alarm rate is increased owing to the characteristics extracted from single-scale image are less than multi-scale images.

Table 4 The RRs and FARs of vehicle detection using single-scale image
Fig. 6
figure 6

The RRs and FARs of vehicle detection using single-scale image

The effect of gradient images with multiple threshold is tested as well. As shown in Table 5 and Fig. 7, when the RR is given as 95 % and taken grey image as input, the network shows a false alarm rate of 45.5 %. Gradient images with multiple thresholds could provide more information in certain range of grey level. Lack of such information leads to an increase of FAR. When the RR is given as 95 % and taken grey image with rotation as input, the network shows a false alarm rate of 41.1 %, lower than grey image without rotation because rotated images can help DCNN obtain more rich features, while it still presents poorer performance compared with gradient images with multiple threshold. Figure 8 shows the detection results.

Table 5 The RRs and FARs of vehicle detection using grey image with rotation
Fig. 7
figure 7

The RRs and FARs of vehicle detection using grey image with rotation

Fig. 8
figure 8

The detection effect of the DCNN

5 Conclusions

This study investigates the detection of vehicles from remote sensing images in the light of the characteristics of aerial remote sensing images obtained using UAVs. By using multi-scale SPP-based DCNN to detect images of different sizes, the detection effect is improved. Although this detection algorithm has presented favourable generalisation capabilities, more universal algorithms still need to be studied to increase the universality by reducing the number of pre-processing steps required. Besides, the background of objects in remote sensing images is generally complex as objects in city scenes are disturbed by streets, trees, architectural shadow, similar objects, etc. Therefore, it is necessary to research characteristics of stronger robustness so as to improve the detection effect of the algorithm in complex environments.