Keywords

1 Introduction

Tremendous progress has been made in the last few years with respect to aerial scene representation learning - from datasets (DeepGlobe [7], xView [19], DOTA [37], SkyScapes [4]) to better network architectures (RA-FCN [25], ROI transformer [9], SCRDet [38]). Most of these approaches are developed for either object detection or semantic segmentation. [1, 2, 22]. Recently there has been developing interest in entity counting via regression, as opposed to commonly used via detection, is highly motivational as it requires comparatively fewer parameters while achieving similar, if not better, accuracy, especially for crowded scenes. However, regression-based entity counting has been explored mostly using ground imagery [20, 22, 24, 28]. In this paper, we focus on entity (e.g., vehicle) counting from overhead imagery without relying on any localization information.

More specifically, we try to answer the question: what can we do to improve feature representations pretrained on ground imagery, e.g., ImageNet, for aerial vehicle counting? In the same line of Aich and Stavness [1], we start by fine-tuning a pretrained VGG-16 network [34] on PUCPR+ and CARPK datasets and use it as our baseline. Singh et al. showed that a network trained with self-supervised semantic inpainting was able to outperform its ImageNet pretrained counterpart on aerial semantic segmentation by learning domain specific features [35]. Hence, it is possible to use self-supervision for improved feature learning in the aerial domain. However, PUCPR+ and CARPK contain only 100 and 989 images respectively in the training sets making it difficult to perform full-fledged self-supervision from scratch. Since background categories such as vegetation, roads and buildings dominate the content of these datasets, the likelihood of learning vehicle-specific features is lower than that of background categories.

In order to learn meaningful features and cope with vehicle-specific sample scarcity within the datasets, we propose to adapt representations pretrained using ground images under the premise that features learned from ground images capturing color, texture, edges can be reused for entity counting in aerial images. We investigate two alternative approaches for feature adaptation: 1) proxy self-supervision tasks: we apply the self-supervision tasks to the ImageNet pretrained VGG-16 network instead of its randomly initialized version (Sect. 3.1) and 2) network modifications: we experiment with squeeze and excitation blocks and active rotating filters as means for feature re-calibration (Sect. 3.2). We achieve better performance than the state of the art in regression-based entity counting. Comparing to detection-based approaches, our methods are more promising since they achieve counting while bypassing precise localization, which requires more complex network architecture to support the computation and large amounts of annotations to support the training.

2 Related Work

Vehicle counting has been tackled previously by using various approaches - from object detection [3, 5, 13, 14] to regression and matching [1, 2, 22]. The former set of approaches involve designing complex networks with extensive hyperparameter search (for example, the anchor scales, anchor ratios, learning rate), while the latter set are prone to making the training in an orderly fashion for the network to capture the dataset space. Hsieh et al. released two datasets captured from a drone - PUCPR+ and CARPK - with bounding box annotations, and used spatially regularized constraints to increase the localization performance [6, 14]. Goldman et al. proposed the soft-IOU (intersection over union) layer as the third head of the RPN detection alongside object score and coordinates to help resolve densely packed object detections [13]. Amato et al. adapted the YoloV3 for aerial detection by jointly training the layers for maximizing the use of ImageNet and dataset-specific layers [3, 8, 32]. Cai et al. proposed the Guided Attention Network (GA-Net) which consists of foreground and background attention blocks, learned explicitly to extract discriminative features within the imagery. They also propose a new method for data augmentation, to switch between different times of the day using brightness and Perlin noise, which leads to considerable boost in detection performance [5, 29].

Aich and Stavness proposed the first approach for one-look regression on the dataset - they combined count regression with heatmap regulation of the network [1] . In their approach, the network is trained on two loss functions - an L1-loss for minimizing the count and a Smooth L1-loss for minimizing the corresponding class activation map with the ground truth object locations placed as Gaussians. We denote the network trained with and without heatmap regulation as VGG-GAP-HR and VGG-GAP in Table 1 respectively. Aich and Stavness further replaced the global average pooling layer at the end of VGG-16’s convolutional backbone with a global sum pooling layer to achieve resolution invariance [2]. Lu et al. formulated counting as a template matching problem by learning a density map prediction over samples from the ImageNet-VID dataset [22, 33]. They minimized the matching between a single snapshot of the object of interest and the whole frame, where the network was trained with weighted L2 loss on the output density map with the ground truth locations (similar to [1]). They used domain adapters to shift from the ImageNet-VID dataset to the CARPK dataset for vehicle counting [31].

Self-supervised Learning. has attracted a lot of research interest [18, 21, 35] in computer vision community as it is able to extract context directly from the design of pretext tasks as the supervisory signals, instead of relying on extensive labeling. Kolesnikov, Zhai and Beyer [18] explored the quality of representations learned from rotation [12], exemplar [11], relative patch locations [10] and jigsaw [26] using ImageNet [8] and Places205 [39] datasets. Singh et al. improved the performance of a ResNet-18 network trained from scratch by adding a self-supervised semantic inpainting loss wherein the network is forced to learn overhead-specific features for correctly filing the masked out regions [35]. Liu et al. [21] improved the performance of crowd counting networks by leveraging unlabeled data with a ranking loss - given two areas sampled from an image in a concentric manner, the network has to ensure that the count predicted for the smaller area is smaller than the count predicted for the larger area.

3 Methodology

We establish a baseline for vehicle counting by removing the last set of convolutional layers from VGG-16 network, pretrained on ImageNet, and retrofitting a single fully connected layer that predicts the final count (VGG-GAP [1]). Our focus, herein, is to improve the baseline performance by re-calibrating the features learned using ground imagery towards vehicle counting in satellite imagery. To this end, we describe the two unique approaches we investigated (summarized in Fig. 1). One, a data-driven or indirect scheme, encourages suitable features to be learned via introducing proxy self-supervised training. The other directly selects or imposes suitable feature properties via introducing additional network layers. While self-supervision has been widely studied as an unsupervised representation learning method for various downstream tasks, its application to adapting features from ground imagery to aerial imagery has not been attempted at large, especially for scenarios with sparse annotation. Besides, the effectiveness of indirect and direct adaptation via self-supervision and network modification, respectively, has not been thoroughly investigated and compared in literature. We will address these two issues in this paper.

Fig. 1.
figure 1

An overview of all methods experimented in this paper: (a) shows crops with rotation invariance, (b) jigsaw solver, (c) semantic inpainting, (d) squeeze and excitation block, (e) active rotation filters.

3.1 Proxy Self-supervision Tasks

Rotation Invariance (RotNet): Proposed by Gidaris et al. - the authors create four different copies of a single image by transposing and flipping it and then train a convolutional network to predict the geometric transform applied to the image from its original setting [12]. This helps the network learn informative features and focus on the most salient object in the scene as well as gauge its default appearance. However, we cannot directly apply the task to aerial imagery as there is no de-facto default appearance setting - for example, cars can be present with front facing the north or south direction and yet both are plausible settings. Hence, we modify the task and minimize the loss as shown in Fig. 1(a):

$$\begin{aligned} loss(X_i,\theta ) = - \frac{1}{K}\sum _{y=1}^{K} log(F^y( g(X_i|y), X_i | \theta )), \end{aligned}$$
(1)

where \(X_i\) is the sampled image from the dataset and \(\{g(\cdot |y)\}_{y=1}^{K}\) applies the geometric transformation with label y to image \(X_i\). \(F^y(\cdot )\) and \(\theta \) indicate the predicted probability distribution over y and the model F’s learnable parameters respectively. We convert the problem into a siamese network - where the network receives an image and its rotation version as inputs and is tasked with predicting the rotation used to generate the rotated input. Following [12], we use 0, 90, 180, and 270\(^\circ \) as the options for \(g(\cdot )\) and discuss the rest of implementation details in Sect. 4.2.

Jigsaw Solver: proposed by Noroozi and Favaro to learn contextual representations by training the convolutional network to solve jigsaw puzzles [26]. This task helps the network to learn discriminative features as it has to find appropriate features that can place the randomly shuffled set of K patches (K = 9 by default) in the correct order. Practically, we implement this approach by minimizing the loss as shown in Fig. 1(b):

$$\begin{aligned} loss(X_i,\theta ) = - \frac{1}{K}\sum _{y=1}^{K} log(F^y( g(X_i|y) | \theta )), \end{aligned}$$
(2)

where \(\{g(\cdot |y)\}_{y=1}^{K}\) splits the image \(X_i\) as per the tile configuration \(y=(A_1,A_2,\dots ,A_9)\). The original paper used a subset of 1000 permutations based on high Hamming distance, and we use 15 of those combinations in our approach as we did not find using more permutations being a good trade-off in training time and network performance.

Semantic Inpainting: We use the Least squares generative adversarial networks (LS-GAN) with perceptual loss to learn the task of filling in the holes randomly placed within an image (Fig. 1(c)) [16, 23]. Our motivation for this task is to encourage learning features that are based on strong contextual and relational information - for examples, if there are pixels to be filled around a red car, how do we make the network aware of what color would be filled in the corresponding pixels? We use a fixed hole grid of \(4\times 4\) centered around a mask of \(12\times 12\) with replications instead of random masks - unlike other datasets that contain a wide distribution of images, the datasets PUCPR+ and CARPK have a slightly fixed area of focus and hence we use fixed masks with extensive image rotations to capture better variance.

3.2 Network Modifications

Squeeze and Excitation Blocks: introduced by Hu et al. , these blocks perform feature re-calibration by adaptively weighting each channel of the feature maps (Fig. 1(d)) [15]. We hypothesize not all ImageNet-learned features contribute to aerial imagery, and hence apply SE blocks as channel attention over the features for adaptation. Assuming \(\mathbf{v} _{H \times W \times C}\) is the output of a convolutional block where W, H, and C represent the height, weight, and channel, respectively, the squeeze operation applies a global average pooling layer to aggregate the channel-wise responses \(\mathbf{z} = \{ \mathbf{z} _c \}\) as

$$\begin{aligned} \mathbf{z} _c = \mathbf {F}_{squeeze}(\mathbf{v} _c) = \frac{1}{H\times W}\sum _{i=1}^{H} \sum _{j=1}^{W} \mathbf{v} _c(i,j), \end{aligned}$$
(3)

where ijc are the indices for height, weight, and channel, respectively. The squeezed representations \(\mathbf{z} \) are passed through two fully connected layers parameterized by \(W_1 \in \mathbb {R}^{\frac{C}{r} \times C}\), \(W_2 \in \mathbb {R}^{C \times \frac{C}{r}}\) to compute the inter-channel dependencies in the excitation operation as

$$\begin{aligned} \mathbf {s} = \mathbf {F}_{excite}(\mathbf {z}, W_1, W_2) = \sigma (W_2 \delta (W_1\mathbf {z})), \end{aligned}$$
(4)

where \(\delta \), \(\sigma \) and r represent the ReLU non-linearity, the sigmoid activation and the reduction ratio respectively. Finally, the initial features are scaled by the inter-channel weights to obtain the final scaled features as \(\widetilde{\mathbf {v}} = \mathbf {s} \cdot \mathbf {{v}}\). We experimentally insert SE block after the last max pooling layer and before the last convolutional layer choose \(r=2\) to maximize the network prediction performance and minimize changes to the network architecture.

Active Rotating Filters: proposed by Zhou et al. and further developed by Wang et al. to produce rotation-invariant filters [36, 40]. Active Rotating Filters (ARF) generate feature maps with orientation channels - during the convolution, each filter rotates internally and produces feature maps to capture the receptive field layout from K different orientations (for example, K = 4 \(\rightarrow \) 0, 90, 180, and 270\(^\circ \) - Fig. 1(e)). This improves the generalization capacity of the network by learning for orientations that have not been seen before with significantly less need for data augmentation and hence, ARFs are a naturally viable candidate for aerial imagery where objects do not follow a default orientation. To assimilate all the gathered orientation information, Zhou et al. proposed ORAlign which calculates the dominant orientation and assigns the features in its favor [40]. Wang et al. developed it further into S-ORAlign with concepts from SE blocks and fixing the backpropogation to work with constant learning rate [36]. Experimentally, we adopt ARFs with S-ORAlign in our approach for feature adaptation by imposing the desired orientation invariance for improved performance in vehicle counting.

4 Experiments and Results

4.1 Datasets

The PUCPR+ dataset contains images captured from an altitude at a slanted view of a parking lot. It is a subset of the PUCPR dataset [6] and has images under different weather conditions including sunny, cloudy and overcast. This dataset contains 100 training images and 25 test images. The number of car instances varies from zero to 331 in the training set and from one to 328 in the testing set. The CARPK dataset was released along with the PUCPR+ dataset in [14]. It is the first large-scale aerial dataset for vehicle counting in parking lots under diverse location and weather conditions. This dataset contains 989 training images and 459 test images. The number of car instances varies from one to 87 in the training set and from two to 188 in the testing set. CARPK differs from PUCPR in two ways - 1) it has a diverse location setting compared to images in PUCPR overlooking the same region at all times and 2) it has a more complex count distribution. The images in both datasets are at 720 \(\times \) 1280 resolution.

4.2 Experimental Settings

We use Pytorch for evaluating all proposed approaches on the PUCPR+ and CARPK datasets [14, 27]. We drop the last set of convolutional layers from the VGG-16 network following previous works [1, 14] with the presumption to have just enough downsampling to perceive all vehicles in the scene at the last feature map.

We downsample the images by a factor of 2: \(720 \times 1280 \rightarrow 360 \times 640\) for all experiments, since we observe negligible performance difference between these two resolutions (this is also consistent with the approach adopted by VGG-GAP [1]). We also split 10% of the training set as validation set using stratified sampling so that the error metrics are more informative as compared to random sampling. We use the validation set for hyperparameter search and final model selection across all epochs. We train our networks on the task of count regression for 30 epochs with a learning rate of \(1e-4\) and then 20 more epochs at a learning rate of \(1e-5\) with a batch size of 16. Unless mentioned otherwise, we use the Adam optimizer [17] in all our experiments. We apply random horizontal flip, random vertical flip, and color jittering to both datasets. In addition, we observe that the orientation of vehicles in CARPK has more variance as compared to PUCPR+. Hence, we add data augmentation in the form of transposing the image to account for more car orientation, which we refer to as transposed augmentation in the following discussion).

For the proxy self-supervision tasks, we sample 10 random patches within [72 \(\times \) 72, 90 \(\times \) 90] resolution per image. For rotation invariance and jigsaw solver tasks, we train on a batch size of 50 for 30 epochs - we use an initial learning rate of \(1e-3\) and drop the learning rate by a factor of 10 at 15th and 23rd epoch. For semantic inpainting, we use an initial learning rate of \(2e-4\) for the generator and \(2e-5\) for the discriminator. We observed that the discriminator learns at a faster rate and to even the curve, we use stochastic gradient descent (SGD) as the optimizer for the discriminator. We train the networks for 30 epochs after which, we discard the discriminator and use the encoder from the generator for count regression fine-tuning.

5 Evaluation Metrics

We use the Mean Absolute Error (MAE), Root-Mean-Sq. Error (RMSE), %Over-estimate (%OA) and %Under-estimate (%UA) for reporting all results:

$$\begin{aligned} \textit{MAE} = \frac{\sum _{i}|y_i-x_i|}{N}, \textit{RMSE} = \sqrt{\frac{\sum _{i}\left( y_i-x_i\right) ^{2}}{N} },&\end{aligned}$$
(5)
$$\begin{aligned} \%\textit{OA} = \frac{\sum _{i}|y_i-x_i|I_{[(y_i-x_i)>0]}}{\sum _{i}x_i} \times 100, \%\textit{UA} = \frac{\sum _{i}|y_i-x_i|I_{[(y_i-x_i)<0]}}{\sum _{i}x_i} \times 100,&\end{aligned}$$
(6)

where \(y_i\), \(x_i\) are the predicted and actual counts for the image sample i and N is the total number of image samples. Hence, we not only get an overall network performance from Eq. 5, but also get a comparative count of over limit and under limit predictions from Eq. 6. We use MAE as the primary metric of interest throughout our results discussion.

Table 1. Performance of different methods on PUCPR+ and CARPK datasets. We highlight the best results in each group of methods - detection vs. regression - in bold.

5.1 Results

We discuss the performance of our best performing method in comparison with other published methods in Table 1 and ablation study in Table 2. Our method achieves the best performance among regression-based methods [1]. While we observe about 2–3 increase in MAE and RMSSE with respect to the best performing detection-based method, our method requires only half the computation complexity and can be trained without the need of localization annotation, which is known to be expensive to acquire. This clearly demonstrates the effectiveness of our method for entity counting using aerial datasets with sparse annotation.

Table 2 compares the performance of our methods with different configurations on PUCPR+ and CARPK. We report results with transposed augmentation for CARPK. For PUCPR+, most of the configurations under study, both feature adaptation via self-supervision and feature selection via network modification, produce a better performance than the pretrained baseline. Particularly, RotNet-based self-supervision produces the most improvement followed by semantic inpainting and ARFs, demonstrating the efficacy of representation adaptation. We show activation maps from pretrained baseline and RotNet-trained network in Fig. 2. We observe that the latter version has finer activation details as compared to the ImageNet-pretrained network. This is learned via proxy self-supervision tasks without using any localization information.

Table 2. Ablation study of all approaches discussed in Sect. 3 on PUCPR+ and CARPK datasets using a baseline VGG-16 ImageNet-pretrained network. We highlight the best results in bold with MAE as the metric of interest.

For more complex scenes in CARPK, we have two key observations if transposed augmentations is not used:

  • the pretrained baseline gives an MAE of 11.5 ± 1.4. This demonstrates the simple effectiveness of understanding the training and test data distribution and adjusting with data augmentation. We also observe in Fig. 3 that the activations for vehicles have lower intensities when transposed augmentations are not used, especially in cases where the orientations do not match the training set distribution.

  • RotNet and SE blocks gives an MAE of 9.3 ± 1.2 and 10.2 ± 0.7 respectively, thus proving that feature adaptation is essential for aerial imagery adaptation.

However, with transposed augmentation, we notice that only semantic inpainting and SE blocks, which are complementary to transformation-based augmentation can further improve the performance (Table 2). This further validates our hypothesis that not all ImageNet-learned features contribute to complex aerial imagery and feature adaptation is essential for good performance.

Fig. 2.
figure 2

Exemplar activation maps for images from the PUCPR+ dataset: input image with ground truth count (left), activation maps from pretrained network (middle), and activation maps from the network finetuned with rotation invariance proxy task (right).

Fig. 3.
figure 3

Exemplar activation maps for images from the CARPK dataset observing the differences based on using orientation-based augmentation. The first row shows images sampled from the training set. Rows 2 and 3 display input image (left), activation maps from network with (middle) and without transposed-image augmentation (right) respectively.

Additionally, we also performed an ablation study where we trained the network from scratch with rotation invariance as self-supervised task. The network trained from scratch without any self-supervision or localization information achieves an MAE of 124.75 on PUCPR+. The network with RotNet-based self-supervision achieves an MAE of 17.05. Although this does not match the MAE of 3.72 on a RotNet-based ImageNet-pretrained, self-supervised learning still leads to a significant difference in the MAE performance and hence strengthening the scope for aerial self-supervised learning. Figure 4 shows the comparison of three networks on the first convolutional layer - two of them based on pretrained ImageNet features and the third on self-supervised RotNet. We observe that the values of the pretrained and RotNet-proxy VGG-16 networks are identical for weights and biases. This visually confirms our hypothesis towards feature re-usage between the ground and aerial imagery as the weights appear to be looking for the same early set of features. For the network trained from scratch with RotNet as a self-supervised learning task, it is harder to interpret the information stored - however, we can still observe that the network is looking for some information towards edges and color given there is not a single weight that is monochromatic.

Fig. 4.
figure 4

Comparison between networks trained on PUCPR+ dataset. Top and bottom rows: weights and biases of the first convolutional layer from VGG-16. Left column: pretrained ImageNet. Middle column: trained with RotNet proxy. Right column: trained with RotNet from scratch.

To further understand where the networks actually differ, we use singular vector canonical correlation analysis (SVCCA) [30] to compare the activations of the two networks on the fixed set of input images from the dataset. SVCCA uses a combination of singular value decomposition and canonical correlation analysis for interpreting similarity within different sets of feature maps without accounting for filter orderings. From Figs. 5, 6, we observe that the activations differ post the second max pooling further strengthening our hypothesis of feature re-usage.

Fig. 5.
figure 5

CCA similarity amongst different layers of VGG-16 comparing the activations of pretrained and RotNet proxy on PUCPR+ dataset. (a), (b), (c) indicate the similarities at the maxpool stages and (d) indicates the similarities before the global average pooling layer.

Fig. 6.
figure 6

CCA similarity amongst different layers of VGG-16 comparing the activations with and without transposed-image augmentation on CARPK dataset. (a), (b), (c) indicate the similarities at the maxpool stages and (d) indicates the similarities before the global average pooling layer.

6 Conclusion

We study a suite of approaches that help in learning better features for vehicle counting from aerial imagery with small scale datasets. Our study showed that different adaptation approaches induce different amounts of performance improvement depending on data characteristics. With a suitable adaptation scheme, we achieved substantial performance improvement on both PUCPR+ and CARPK datasets.