1 Introduction

Vehicle object detection plays an important role in intelligent transportation system (ITS). It is the prerequisite and foundation for follow-up research work such as vehicle recognition, vehicle tracking, and traffic statistics [1]. With the rapid development of deep neural networks, object detection has become a major research hotspot in computer vision tasks. General object detection has achieved great success, driven by the deep convolutional neural network (DCNN). Object detection refers to the combination of object segmentation and recognition, which can not only detect the position of the object in the picture or video, but also recognize the category of the object. It is widely used in intelligent transportation systems, intelligent monitoring systems, military target detection, and medical imaging field. However, vehicle detection still faces many challenges in complex traffic scenes, such as various lighting conditions, occlusion, and low-resolution [2].

Nowadays, many scholars at home and abroad are committed to the research of object detection and have obtained good results. The proposed architecture can be divided into two categories: two-stage detectors [3,4,5,6,7], and one-stage detectors [8,9,10]. The two-stage detector achieves better detection accuracy, but sacrifice speed and consume resources. The one-stage detector has poor detection accuracy but is more efficient in the training and inference process, and it is more suitable for real-time detection in real scenes.

In order to detect objects of different scales, CNN-based target detection algorithms adopt multi-scale outputs [11,12,13]. Among, YOLO v3 and Mask R-CNN use the Feature Pyramid Network (FPN) [14] idea to fuse feature maps of adjacent scales through the concat method. FPN uses a top-down, side-to-side connection method to fuse the features of two adjacent scales. The high-resolution feature map contains more fine-grained features of the object, and the low-resolution features contain more contour information. Effective feature aggregation can improve network performance.

With the introduction of FPN, choosing a suitable FPN output layer has become a problem that must be solved. The traditional method is based on the Region of Interest (RoI) obtained by the RPN to select, which is based on the width w and height h of the RoI, using the formula (1) proposed by [14] to find the best k layer as a sample.

$$k = \left\lfloor {k_{0} + \log_{2} \left( {\sqrt {wh/m} } \right)} \right\rfloor$$
(1)

where m represents the size of the pre-trained ImageNet input picture, and k0 represents the level corresponding to the RoI with an area of w × h. However, we think that the choice of a single-layer FPN may limit the ability of network description. [15] proved our idea, which achieved better detection accuracy than the baseline method by summing the candidate regions generated by all the feature layers. However, this summation method inevitably increases the complexity of the network, and the training process requires more resources. This is understandable, because the summation method increases the number of candidate regions by 5–6 times or even more, and requires a lot of computing resources.

Inspired by [15], a candidate area aggregation network (CRAN) is proposed in this paper. First, re-extract features of FPN output features through a convolutional layer; then, a quality score module is constructed to calculate the similarity between different feature layers. The similarity result is used as a quantitative factor to determine the number of candidate regions in the corresponding feature layer. Finally, a more comprehensive set of candidate regions is generated. Since our quantity factor is derived from the output feature map of FPN, and each group of feature maps is derived from the same input image. Therefore, the proposed quality score module can be applied to any input picture. In addition, in order to solve the problem of difficult detection of small targets in the process of vehicle detection, an area cross entropy loss function is proposed. This paper designs a monotonically decreasing function based on the area of the candidate region to add weight to the cross entropy loss function. Our intuition is that small goals should be assigned more weight, while big goals require less weight. The introduction of area cross entropy loss is beneficial to the detection of small targets and the improvement of the performance of the model.

The main contributions of this paper are as follows:

  1. 1)

    A novelty candidate region aggregation network (CRAN) is proposed to effectively aggregate candidate regions of feature layers of different scales. Improve the performance of the network structure to handle multi-scale problems.

  2. 2)

    An area cross entropy loss function is proposed to improve the detection performance of the model for small targets. In this paper, each candidate region is assigned a different weight during the classification process, and the weight depends on the area of each generated candidate region.

  3. 3)

    The proposed CRAN and area cross entropy loss are introduced into the current advanced detectors and tested on challenging datasets.

The flow of the remaining paper is as follows. Object detection architecture and feature fusion method are elaborated in Sect. 2. Section 3 describes the proposed approach. In Sect. 4, the experimental setup, benchmark datasets and experimental results are presented. The conclusions are placed in Sect. 5.

2 Related work

2.1 Object detection

With the increasing popularity of intelligent transportation systems, many experts have begun to study vehicle object detection [16]. There have been many outstanding studies in the early days, such as Harr [17], SIFT [18], HOG [19], DPM [20, 21]. However, traditional detection algorithms require manual acquisition of relevant target feature information, which results in high complexity and a large amount of redundancy. Severely affects the running speed and is difficult to realize engineering in real scenarios. With the development of deep learning, especially the proposal of deep learning algorithms based on convolutional neural networks, object detection has entered an intelligent development stage. Through parameter sharing and sparse connection, the object detection algorithm can avoid the complicated process of manually extracting features. It effectively solves the problems of poor portability and missing features of traditional models [22]. In addition, with the rapid development of GPU technology, the computing speed of deep learning has also shown an exponential increase.

Recently, CNN-based two-stage and one-stage detectors are continuously updating object detection performance in several benchmark datasets. The first is a two-stage architecture based on R-CNN [3, 23, 24]. In order to improve the training efficiency of the network, in 2015, HE et al. [5] proposed Faster R-CNN, which designed an RPN network to generate proposals under a unified framework (Fig. 1). Then, a series of excellent two-stage detectors appeared, and trying to optimize the network architecture [25,26,27], training strategy [28, 29], adding auxiliary modules [30,31,32] to improve the network.

In 2016, the yolo method was proposed by Redmon et al. [8]. The candidate bounding box regression, and classification are directly integrated into the same convolution network, and obtained extremely fast detection speed. However, due to the rough network design, it is far from reaching the accuracy requirements of real-time target detection, and there are problems such as inaccurate target positioning, poor detection of small objects and multiple objects. Subsequently, Redmon and others continued to improve the YOLO algorithm and proposed YOLO v2 [33] and YOLO v3 [11], respectively. At the same time, Liu et al. [9] proposed asingle-shot detector (SSD), which combines the regression idea of the YOLO model and the anchor mechanism of faster R-CNN. SSD surpasses Faster RCNN in detection speed and accuracy, but it does not consider the correlation between different layers and different scale targets, resulting in poor detection of small objects. Then, RSSD and DSSD were proposed, and the performance was greatly improved.

2.2 Multi-scale features

As the limitation of single feature representation becomes more and more prominent, people began to study multi-feature fusion technology in order to find a better feature representation. The existing feature fusion methods can be divided into two types: direct addition to average and weighted sum. In fact, the former is a special form of the latter. Many scholars have tried this research [16, 34,35,36] and achieved good performance. However, in the field of object detection, we often need to deal with targets of different scales, so multi-scale issues must be considered. [14] proposed a feature pyramid network (FPN) to perform feature representation from different levels, and has been proven to be effective for general object detection. However, the selection of the FPN output feature layer is based on heuristic, which limits its performance to a certain extent. On this basis, a novel candidate region aggregation network is designed to effectively utilize all the output layer information of FPN and improve the performance of the network structure.

2.3 Classification loss function

In terms of object classification, the cross entropy loss function adjusts network parameters by describing the distance between vectors. It has always had a good performance and is used in many advanced algorithms [3, 4, 6, 7]. However, we can see from the expression of the cross entropy loss function that its weight parameter for all input samples is 1, which makes it perform poorly in dealing with complex problems, such as a serious imbalance in the number of samples (1: 100), the object size gap is too large, etc. Based on this problem, [8] proposed Focal loss for the first time. Focal loss effectively solves the problem of imbalance in the sample category ratio by adding a balance coefficient to the cross-entropy. In the process of vehicle detection, the detection of small targets and low-resolution targets has always been a challenging problem. Therefore, based on the idea of area weight loss in [37], we propose a cross entropy loss function based on the area factor. By assigning more weights to small targets and a small amount of weights to large targets, the problem of excessive object size gaps in the vehicle detection process is effectively solved.

3 Methodology

In this section, the candidate region aggregation network (CRAN) and area cross entropy loss function are described in detail.

3.1 Candidate region aggregation network (CRAN)

The proposal of FPN effectively solves the problem of multi-scale feature selection, and is an architecture that can select appropriate features according to the size of the image. FPN effectively solves the problem of multi-scale feature selection, and can select the appropriate feature architecture according to the size of the image. Many literatures have proved that FPN has the ability to maintain effective spatial information, and avoids the complicated calculation problems caused by the refinement of features at each scale. In the network architecture, the selection of feature maps is generally based on the baseline method to select one of them as the input of the RoI layer. Although the baseline method is a more general method, but the [15] proves that the baseline method is similar to the random selection method, and has proved this idea through experiments. The experiment selected some samples from the COCO data set, and the baseline method, random method and direct sum method were selected for comparison experiments. Figure 2 shows the progress of the training process. It can be seen that the progress of the random method and the baseline method are relatively similar, and the average accuracy gap is small. It shows that each output feature map of FPN contains valid information, and it is not comprehensive to use any single feature map to represent the input image. In addition, the experiment also directly sums the output feature maps of FPN. The results show that the training progress of the summation method is basically consistent with the baseline method, and after the 9th epoch, the test accuracy exceeds the result of the baseline method. The above experimental results show that every output feature map of FPN cannot be ignored. Therefore, effectively aggregating the output feature maps of FPN is of great help in improving the performance of the network model.

Fig. 1
figure 1

Two-stage detector architecture

Fig. 2
figure 2

The average prediction accuracy of different FPN layersselected under the COCO data set

Based on the above problems, this paper proposes a candidate region aggregation network (CRAN). Our inspiration is that although the method of summation can increase the richness of candidate regions, it greatly increases the consumption of computing resources, and a large number of candidate regions are easy to cause interference between classes. Therefore, this article tries to process the generated candidate regions to minimize the number while ensuring the richness. CRAN mainly consists of three modules: feature re-extraction module, quality score module and aggregation module. The network structure is shown in Fig. 3.

Fig. 3
figure 3

Candidate region aggregation network

3.1.1 Feature re-extraction module

Through the ResNet50 backbone network and FPN, feature maps P2–P6 of different scales can be extracted from the input image, where P6 is obtained by up-sampling from P5. Since the above features are obtained by superimposing the up-sampling features and the basic features C2–C5, in order to better fuse these two features, this paper re-extracts the merged features P2–P6. The output features of FPN are re-extracted using a convolution kernel with a size of 3 × 3, and the [38] proves that this method can effectively improve the quality of features.

3.1.2 Quality score module

The quality score module is mainly to learn the quantity factor of FPN output feature maps by introducing an attention mechanism. We tried two ways to learn quantity factor to explore them: one is based on Feedforward Neural Network (FNN), the other is based on Convolutional Neural Network (CNN) method.

The based on FNN method first converts feature maps of different scales into the same size according to the principle of forward propagation. Then calculates the degree of similarity between the corresponding feature maps of the baseline method and other feature maps. Finally, the feature quality score is determined through the normalization operation. The basic structure is shown in Fig. 3, Pk is the feature map of the Kth layer selected by the baseline method, Pi is the FPN output feature map, and Pi* is the result of Pi tiled expansion. It is worth noting that the vector sizes corresponding to different object sizes are different. In order to ensure that the vector similarity calculation is not affected by the size, this paper uses the cosine phase similarity as the benchmark to measure the similarity of the feature maps.

The quantity factor is as follows:

$${\text{Value}}_{i} = \left\{ {\begin{array}{*{20}c} {\frac{{P_{i}^{*} \cdot P_{k} }}{{\left\| {P_{i}^{*} } \right\| \cdot \left\| {P_{k} } \right\|}}} & {i \ne k} \\ 1 & {i = k} \\ \end{array} } \right.$$
(2)
$$\varepsilon_{i} = {\text{Soft}}\max \left( {{\text{Value}}_{i} } \right) = \frac{{e^{{{\text{Value}}_{i} }} }}{{\sum\nolimits_{j = 2}^{6} {e^{{{\text{Value}}_{i} }} } }}$$
(3)

The CNN method first converts the output feature map into 1 × 1 feature points through the multi-layer Valid convolution method. Then calculate the similarity between the feature value obtained in each layer and the feature value of the specified layer. Finally, the similarity result is used as a quantitative factor to determine the number of candidate regions in the corresponding feature layer. Among them, the specific layer is obtained from the baseline method. Since every output feature of FPN is derived from the input image, we think that there are similarities between all feature maps (Fig. 4). The network structure is shown in Fig. 5 below:

Fig. 4
figure 4

Quality score module based on feedforward neural network

Fig. 5
figure 5

Quality score module based on convolutional neural network

In Fig. 5, Valuei is the feature map of the output Pi of the ith layer of FPN, and the weighting calculation method is as follows:

$$w_{i} \frac{{{\text{value}}_{k} - \left| {{\text{Value}}_{k} - {\text{Value}}_{i} } \right|}}{{{\text{Value}}_{k} }}$$
(4)

where Valuek is the feature map of the feature map Pk of the Kth layer selected by the baseline method.

3.1.3 Aggregation module

The main function of this module is to generate candidate region groups according to the quantity factor of each scale feature map. Specifically, a series of candidate regions are generated for each feature map. We completely retain the candidate region of the feature layer of the baseline method, and retain part of the candidate region and the remaining feature layer. Among them, the number of reserved candidate regions is determined by the quantity factor, which can be expressed by formula (6).

$${\text{Num}}_{i} = N_{i} \times \varepsilon_{i}$$
(5)
$$N_{i} = H_{i} \times W_{i} \times {\text{anchors}}$$
(6)

where Ni is the number of candidate regions generated by the Pi feature layer, and Numi is the reserved number. The number of candidate regions generated by each feature layer is determined by the size of the feature map. Hi and Wi, respectively, represent the height and width of the Pi feature layer. The anchors represent the number of anchor points generated by each feature point, and is usually set to 9.

3.2 Loss function

The loss function of object detection is mainly composed of two parts, classification loss and positioning loss, which can be described as:

$$L_{{{\text{Loss}}}} = \frac{1}{N}\sum\nolimits_{i} {L_{{{\text{cls}}}} \left( {p_{i} ,p_{i}^{ * } } \right) + \lambda \frac{1}{{N_{{{\text{loc}}}} }}\sum\nolimits_{i} {t_{i} L_{{{\text{loc}}}} \left( {g_{i} ,g_{i}^{ * } } \right)} }$$
(7)

where i is the anchor index, pi is the classification probability of the anchor i, pi* is the probability that the anchor i is the true label; gi is the coordinate vector of the predicted bounding box, gi is that of the ground truth coordinate vector; ti represents the positive and negative sample type. ti is 1 if the anchor is positive, and 0 if not. In order to train the detection network, we need positive samples and their ground truth. Calculate the degree of overlap between each candidate box and the ground truth bounding box. Candidate boxes are defined as positive samples if the overlap is greater than the threshold (0.5). Finally, the candidate frame with the largest overlap is selected as the object.

3.2.1 Area cross entropy loss function

For the classification loss Lcls, the multivariate cross entropy is usually used, and a negative log likelihood function is applied to all object classifications. The specific expression is as follows:

$$L\left( {p_{i} ,q_{i} } \right) = { - }\sum\limits_{j = 1}^{c} {q_{ij} \times \log \left( {p_{ij} } \right)}$$
(8)

Among them, qij is a one-hot vector, which is defined as follows:

$$q_{ij} = \left\{ {\begin{array}{*{20}c} 1 & { \, i_{{{\text{th}}}} {\text{ sample category is }}j} \\ 0 & {{\text{otherwise}}} \\ \end{array} } \right.$$
(9)

where pij represents the probability of ith sample belongs to category j. When calculating category probability, the Softmax function is used.

It is not difficult to find that the weight of all samples in the cross entropy loss function is 1, which is equivalent to ignoring the size of the object. However, our attention to multi-scale objects is different in real scenes. A study has shown that when detecting multi-scale targets at the same time, more attention needs to be devoted to small target detection [39].

In order to deal with the difficulty of detecting small objects and low-resolution objects, this paper quote the area weight idea in [37] and design an area cross entropy loss function. Our expectation is to design a weight parameter that depends on the target size, which assigns different parameters to different objects. Using only width or height as a weighting factor is not the best choice, due to the existence of some large aspect ratio targets, such as buses and coach. Therefore, an area-based weight parameter mi is proposed. Due to the large difference in the area of the proposal, we normalized its area to between 0 and 1, and designed a monotonically decreasing function on the area. In order to prevent the weight from being too small, the weighting factor mi remains greater than 1 and less than 2. For the definition of mi, we refer to the expression of the SoftMax function, and defined as follows:

$$m_{i} = 1 + e_{{}}^{{ - s_{i} }}$$
(10)

where si represents the area of the ith prediction frame.

Figure 6 shows the image representation of the weight factor mi. It can be seen that a larger weight can be obtained when the area of the prediction frame is relatively small. In contrast, when the area of the prediction frame is relatively large, a smaller weight can be obtained.

Fig. 6
figure 6

Regional weight factor expression

In summary, the area cross entropy loss function defined in this article is:

$$L_{{{\text{Area}}}} \left( {p_{i} ,q_{i} } \right) = - \sum\limits_{j = 1}^{c} {\left( {1 + e^{{ - s_{i} }} } \right) \times q_{ij} \times \log \left( {p_{ij} } \right)}$$
(11)

Compared with the author's global area weight in [37], the difference is that we only focus on the object classification process. Because our intuition is that the final regression process is based on classification.

3.2.2 Location loss

For the location loss, we choose the Smooth L1 function with fast convergence speed and good smoothness, and its expression is:

$${\text{Smooth }}L_{1} \left( x \right) = \left\{ {\begin{array}{*{20}c} {0.5x^{2} } & {if \, x < 0} \\ {\left| x \right| - 0.5} & {{\text{otherwise}}} \\ \end{array} } \right.$$
(12)

In the overall loss calculation, ti × Lloc means that only the positive sample loss will be activated. In the process of returning the candidate frame to the ground truth, the offset of the center (cx, cy), height (h) and width (w) can be expressed as:

$$\begin{array}{*{20}c} {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g}_{j}^{cx} = \frac{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g}_{j}^{cx} - d_{i}^{cx} }}{{d_{i}^{w} }},} & {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g}_{j}^{cy} = \frac{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g}_{j}^{cy} - d_{i}^{cy} }}{{d_{i}^{h} }}} \\ {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g}_{j}^{w} = \log \left( {\frac{{g_{j}^{w} }}{{d_{i}^{w} }}} \right),} & {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g}_{j}^{h} = \log \left( {\frac{{g_{j}^{h} }}{{d_{i}^{h} }}} \right)} \\ \end{array}$$
(13)

In summary, our loss calculation can be defined as:

$$L_{{{\text{Loss}}}} = \frac{1}{{N_{{{\text{cls}}}} }}\sum\nolimits_{i} {\left( {1 + e^{{ - s_{i} }} } \right)L_{{{\text{cls}}}} \left( {p_{i} ,p_{i}^{*} } \right)} + \lambda \frac{1}{{N_{{{\text{loc}}}} }}\sum\nolimits_{i} {t_{i} L_{{{\text{loc}}}} \left( {g_{i} ,g_{i}^{*} } \right)}$$
(14)

After adding the weight mi to LLoss, the size of the object will affect the loss and gradient, and the smaller the object, the greater the impact on the result.

4 Experiments

This section reports experimental details, including object detection data set, experimental environment, evaluation metrics, implementation details, experimental results.

4.1 Data set and evaluation metrics

4.1.1 Data set

With the development of the object detection field, many challenging data sets have been released for further research, such as PASCAL VOC, COCO, KITTI. In order to evaluate our proposal, experiments were carried out on the UA-DETRAC and KITTI datasets [24].

4.1.1.1 UA-DETRAC data set [40]

It is a challenging multi-target detection benchmark in real scenes. The data set contains a series of video sequences under different scenes and was shot in 24 different locations in Beijing and Tianjin, China. There are more than 140,000 video frame pictures in the entire data set, including 84 K for the training set and 56 K for the test set. Since only the training data contains vehicle annotation information, we divide the training set into two parts, of which 56 K is used for training and 28 K is used for validation.

4.1.1.2 KITTI data set [41]

It is the most representative object detection data benchmark in autonomous driving scenarios. Most of the pictures in KITTI are taken from the driving recorder and contain real picture data under various road conditions. This paper mainly selects the data set of the vehicle detection part, including 7 K training pictures and 7 K test pictures. Since only the training data contains vehicle annotation information, we also divide the training set into two parts, where 4 K is used for training and 3 K is used for validation.

4.1.2 Evaluation metrics

The COCO python package is used in ablation experiments, using APs, APm, and APl to verify the effectiveness of CRAN and area cross entropy. In addition, in order to verify the performance of the proposed method in the state-of-the-art network architecture, we also conducted experiments on the recent DETRAC benchmark [40] and KITTI benchmark [41].

4.2 Implementation details

4.2.1 Pre-processing

This article first selects ResNet50 as the backbone network, and introduces FPN to extract multi-scale feature map. The input size of DETRAC is 540 × 960 pixels, and the input size of KITTI is 576 × 1920 pixels. The generalization ability of the model has been improved by means of data set enhancement.

4.2.2 Training

In order to obtain a more accurate mapping, all our parameter settings follow the settings in [22]. The training set is used to train the network, and the validation set is used to verify the training results. In the training process, the batch size of each GPU is 4. In addition, the "Xavier" method in this paper is used to initialize the convolutional layer parameters, and the stochastic gradient descent (SGD) method is used to optimize the model. In particular, 12 epochs are set in the training process, the initial learning rate is 0.01, and it decreases to 50% of the current learning rate as the epoch increases, and the learning rate decays to 0.0001 after the 9th epoch. In addition, we use optimization techniques such as batch normalization and dropout to optimize each method.

4.2.3 Testing

In the testing process, we use the trained object detection network model to obtain the category and border of each object in the test set, and then compare it with the label data to obtain its testing accuracy.

4.2.4 Experimental environment

Our experiment is based on python language and Pytorch1.2 framework in ubuntu16.04 operating system. The main hardware configuration includes 2.4 GHz CPU and 64 GB RAM. On this basis, GTX 1080Ti graphics cards (12G memory) are used for accelerated training.

4.3 Ablation analysis

This paper designs an ablation experiment on the COCO dataset to verify the performance of the proposed CRAN and area cross entropy on different evaluation indicators. During the experiment, the same parameter settings were used, and mAP with an IoU threshold of 0.7 was used to ensure the fairness of the experiment.

4.3.1 Baseline setting

In this paper, a baseline network based on Faster RCNN is constructed, the backbone network used is ResNet50, and FPN is used for multi-scale feature extraction. Table 1 shows the output feature size of FPN. Experimental results show that the global mAP is 36.5% in the test set [22].

Table 1 FPN output feature map size

4.3.2 Effect of CRAN module

CRAN module is applied to Faster R-CNN, and the basic network settings are consistent with the baseline method. The experimental results are shown in Table 2. It can be seen that the detection results have been improved after adding the CRAN module. During the experiment, we choose the CRAN module based on FNN and the CRAN module based on CNN to conduct experiments, respectively. From the experimental results in Table 2, it can be seen that the detection accuracy of the CNN-based method is better than FNN on the verification set, and the model size is also lower than the latter. This is understandable, because the FNN-based method contains more training parameters, and the CNN-based method has more advantages in processing two-dimensional data. Therefore, in the subsequent experiments, this article uses the CNN-based CRAN method.

Table 2 Ablation analysis on CRAN module

Baseline method has a suboptimal performance on recalling objects of various scales, especially the tiny ones, as depicted in Fig. 7a. As shown in Fig. 7b, our CRAN performs considerably well, and achieves an encouraging recall rate on the COCO validation set.

Fig. 7
figure 7

Baseline a and CRAN b. Visibly, CRAN obtained better accuracy, and experimentally in COCO datasets, it has a higher recall for than baseline

In addition, this article also conducted experiments on the DETRAC dataset. The visualization results in Fig. 8 further illustrate the effectiveness of our mothed. Our CRAN performs considerably well and achieves an encouraging recall rate over 99% on the DETRAC validation set.

Fig. 8
figure 8

Baseline a and CRAN b. Baseline a and CRAN b. Visibly, CRAN obtained better accuracy, and experimentally in DETRAC datasets, it has a higher recall for than baseline

4.3.3 Effect of area cross entropy loss function

Three experiments are designed in this paper to verify the effectiveness of the area cross entropy loss function. The area cross entropy loss function is applied to RPN classification, object classification, and both simultaneously. Table 3 reports the comparison results between the three experimental methods and the baseline method. It can be seen that the proposed area cross entropy loss function has improved performance for the RPN classification process and the object classification process, especially for the detection of small targets, which also verifies our ideas. We found that when the area cross entropy loss function is applied to both RPN classification and object classification processes, the mAP has been greatly improved. Therefore, in the subsequent experiments, this paper applies the area cross entropy loss function to both the RPN classification and object classification processes.

Table 3 Ablation analysis on Area loss

Figure 8 shows the loss in the training process. It can be seen from Fig. 9a that our area cross entropy loss can converge faster compared to the baseline method, and the overall loss value is smaller; Figure 9b shows the classification and positioning loss of our method separately. Obviously, the contribution of the classification loss to the overall loss is greater at the beginning of training, which also validates the idea in Sect. 3.2 of this paper.

Fig. 9
figure 9

Loss during training. a baseline mothed overall loss and our overall loss. Area cross entropy loss can converge faster and the loss value is smaller. b Our classification and positioning loss

4.4 Application of CRAN and area cross entropy to different architectures

We apply the proposed CRAN and area cross entropy to the several state-of-the-art architectures at present, and verify the performance of the method in the UA-DETRA dataset and the KITTI dataset. For the one-stage network, our method eliminates the RPN process. CRAN is applied to aggregate candidate regions of different feature layers, and the area cross entropy is only applied to the object classification process.

4.4.1 Performance test on UA-DETRAC data set

This paper tests our method on the UA-DETRAC dataset, and submitted the training results of the training set and validation set to the UA-DETRAC benchmark test. The comparative experimental results are reported in Table 4. It can be seen that our proposed method performs well in outstanding detectors. In particular, the performance on the Hard subset has been greatly improved, which is consistent with our original intention of designing area cross entropy. It is worth noting that our proposed method has a good performance on the two-stage detector, which improves the detection accuracy by more than 1.0% on average. In addition, the test results of several outstanding anchor-free methods on UETRAC (CornerNet, CenterNet, FCOS) are also listed. We found that the method proposed in this article makes some excellent detectors beyond the anchor-free method.

Table 4 Performance evaluation on UA-DETRAC dataset. +  + means adding our proposed CRAN module and area cross entropy loss; (+ **) means improved detection performance

4.4.2 Performance detection on KITTI data set

In order to verify the performance of the proposed method in state-of-the-art network structures, we also conducted training and testing on the KITTI dataset, and fully evaluated our method on the KITTI benchmark. Applying our method to several outstanding detectors, Table 5 gives a comparison of several methods. It can be clearly seen that for several outstanding detectors, the proposed method improves the mAP by more than 1.0%, especially on the Hard subset.

Table 5 Performance evaluation on KITTI dataset. +  + means adding our proposed CRAN module and area cross entropy loss; (+ **) means improved detection performance

This improvement is more obvious in Fig. 10, where we visualize some detection cases on the DETRAC and KITTI test sets. It can be clearly seen from the successful cases that the proposed method can better detect small targets at a longer distance. In particular, it can also detect obstructed, blurred, and night vehicles. In short, our method can not only be applied to a variety of detectors, but also enhance the generalization ability of the network.

Fig. 10
figure 10

Success cases from DETRAC and KITTI

5 Conclusion

In order to improve the performance of vehicle object detection, this paper proposed a CRAN and area cross entropy loss, respectively, to improve the recall rate of the model and the detection performance of difficult instances. The ablation experiment proves that the proposed method can not only greatly improve the recall rate, but also promote model convergence. Finally, the experimental results on the UA-DETRAC and KITTI datasets show that our method can increase the mAP of several existing advanced detectors by more than 1%, especially the two-stage detector.