Keywords

1 Introduction

Generating desired information through algorithms using various CNN models is crucial for collecting and analyzing image-based traffic information. Utilizing drones for data analysis provides a different perspective on the images compared to the black box or installed cameras in conventional vehicles. Therefore, a drone capturing a wider area at once is essential for comprehending comprehensive vehicle movement and flow. Understanding vehicle movement is necessary and can be categorized into three basic states: normal state, where the vehicle moves along the lane; lane-changing state, where the vehicle changes its lane; and stop state, where the vehicle comes to a halt.

This research aims to design a CNN model that classifies three vehicle movement states through drone images and applies mosaic data augmentation [1] and soft label assignment [2]. The dataset is collected data using a drone, which captured images with a bird’s-eye view of the road. This work utilizes mosaic augmentation to increase data diversity and prevent overfitting during model training. This data augmentation technique randomly combines four selected images to generate a new mosaic image. Soft-label Assignment is used to determine the labels of the mosaic images. These techniques demonstrate the potential of drones for traffic information analysis and the effectiveness of the proposed methodology in improving classification model performance for autonomous driving systems.

2 Related Work

2.1 Autonomous Vehicle Dataset for Object Detection

Images and annotation data collected in the past environment on various roads have been continuously accumulated. With the advancement of object classification and detection technology, autonomous driving technology is being developed rapidly. The Cityscapes [3] and KITTI [4] datasets were created as datasets for autonomous vehicle research [5,6,7,8,9,10]. The dataset generated image collection and annotation data for traffic conditions on the road through cameras installed in the vehicle. The KITTI dataset also includes 3D bounding box location and camera calibration information through a 3D laser scanner.

2.2 Drone-Based Dataset for Object Detection

Stanford drone dataset [11] is the first public aerial image dataset using drones. This dataset contains ten kinds of tracking information (Track ID, (xmin, ymin), (xmax, ymax), frame, lost, occluded, generated, label) about objects on the road in the video image. Images taken at eight locations on the Stanford campus were collected. The targets are six classes (Bicyclist, Pedestrian, Skateboarder, Cart, Car, and Bus). However, the annotation quality of the bounding box of the object is roughly expressed, which has a problem with the performance of the object detection algorithm. The VisDrone [12] dataset is a large-scale drone image produced by AISKYEYE team at Lab of Machine Learning and Data Mining, Tianjin University, China. The dataset aims to develop applications that can be used for computer vision through drones. Through cameras installed in drones, 288 video images were collected from 14 urban areas in China. It produced 2.6 million bounding boxes, including ten classes (pedestrian, person, bus, car, van, truck, bicycle, awning tricycles, motorcycles, and tricycles). Data validation is tested through VisDrone challenge [13,14,15] and various kinds of research [16,17,18] are utilized. The Institut für Kraftfahrwesen Aachen research team had built a drone-based road user trajectory dataset for various situations. The test for vehicles related to autonomous driving is conducted based on the scenario. Therefore, we present reliable and high-quality data criteria. The highD [19] is a large-scale vehicle trajectory dataset for German high roads. It includes six locations, 16.5 h, and 110,000 trajectory information. In inD [20], automated vehicles require data-based analysis methods to understand complex environments. By collecting road images using drones, it was proposed to collect road trajectories and natural road conditions through vehicle movement. Finally, the dataset provides a dataset including road conditions and vehicles, bicycles, and pedestrians over four kinds of German intersections. The roundD [21] includes the movement trajectories of cars, vans, trucks, buses, pedestrians, bicycles, and motorcycles in three traffic circles in Germany. In addition, positions, headings, speeds, accelerations, and classes of objects were extracted from the video and provided as data.

3 Proposed Algorithm

Figure 1 illustrates the overall process for classifying vehicle status. The process consists of four components: 1) Vehicle detection with YOLOv5, 2) Mosaic data augmentation, 3) Soft-label Assignment, and 4) a network for vehicle state classification. These four components work together to determine the movement status of vehicles. This chapter provides an explanation of the proposed methods in detail.

Fig. 1.
figure 1

An overview process of vehicle state classification that contains object detection, data augmentation (mosaic augmentation), soft label assignment, and VSNet (vehicle state network).

3.1 Vehicle Detection

This study first presents an approach for detecting vehicles using YOLOv5 [22], an advanced object detection algorithm that has achieved state-of-the-art performance on a variety of visual recognition tasks. YOLOv5 is an abbreviation for “You Only Look Once version 5”, and is an extension of the original YOLO algorithm with improvements in speed and accuracy. It is based on a deep neural network architecture that efficiently extracts features from images and predicts object bounding boxes and class probabilities in a single forward pass.

The YOLOv5 algorithm comprises two main components: a feature extraction backbone and a detection head. The backbone network is built on efficient architecture, which has been shown to be highly efficient and effective in a wide range of vision tasks. The detection head employs anchor boxes and grid cells to predict object locations and classes at multiple scales. To adapt YOLOv5 for vehicle detection task, the train fine-tunes the model on a custom dataset of drone flight images using transfer learning. Vehicle types are limited to car, truck, and bus. Specifically, this work initializes the network with pre-trained weights on the COCO dataset [23] with advanced data augmentation and optimization techniques.

3.2 Mosaic Data Augmentation

Fig. 2.
figure 2

With the several mosaic ratios, mosaic augmentation generates mixed 4 images to a new image.

Mosaic data augmentation is proposed in YOLOv4 [1] as a technique for augmenting data. This method involves selecting four images from the dataset and arranging them in a manner that is determined by the mosaic ratio, denoted by \(\mathcal {M}_r\) so that they are represented as a single image in Fig. 2. \(\mathcal {M}_r\) is chosen randomly from the range [0.3, 0.7]. I(i) is image among dataset at index i. Index, i is randomly selected. Every I(i) is resized to \(512\times 512\). \(w_i\) and \(h_i\) denote the width and height of I(i). I(n) contains 4 images that n is an order, \(n=0,\dotsc ,3\). Based on \(\mathcal {M}_r\), the width and height sizes of each of the four images are determined as follows:

$$\begin{aligned} I(n) = {\left\{ \begin{array}{ll} \text {new w}_0=\mathcal {M}_r\times 512,\text { }\text {new h}_0=\mathcal {M}_r\times 512 &{} \quad \text {if } n \text { is 0} \\ \text {new w}_1=512-\text {new w}_0,\text { }\text {new h}_1=512-\text {new h}_3 &{} \quad \text {if } n \text { is 1} \\ \text {new w}_2=\mathcal {M}_r\times 512,\text { }\text {new h}_2=512-\text {new h}_0 &{} \quad \text {if } n \text { is 2} \\ \text {new w}_3=512-\text {new w}_2,\text { }\text {new h}_3=\mathcal {M}_r\times 512 &{} \quad \text {if } n \text { is 3} \end{array}\right. } \end{aligned}$$
(1)

This approach enables the model to learn from multiple images simultaneously, improving generalization by incorporating diverse contextual information into a single image. Mosaic is applied by selecting random numbers in quantity equal to the batch size for each iteration.

3.3 Soft Label Assignment

After applying mosaic augmentation in this study, a method of soft label assignment for label allocation is proposed. In the mosaic image, four original images correspond to four labels. A soft label is created by referring to label smoothing [2]. The soft label, \(\mathcal {S}(x)\) is shown in Eq. (2). In training sample x, \(h_i(k|x)\) represents the hard label distribution of the four images at classes, \(k \in {0,1,2}\) and index of distribution i. The hyperparameter \(\alpha \) is assigned a weight value between 0 and 1. The value of K, which denotes the number of images, is 4. Equation (2) are defined as the following:

$$\begin{aligned} \mathcal {S}(x) =\frac{1}{N}\displaystyle \sum _{i=1}^{N}\{(1-\alpha )h_i(k|x) + \alpha / K\} \end{aligned}$$
(2)

The ground truth label distribution multiples the weight \(\alpha \) and interpolates the hard label through the \(\alpha /K\). The label of the mosaic image generates a soft label by calculating the average value from \((1-\alpha )h_i(k|x) + \alpha / K\). Equation (2) represents the ground truth soft label, which adjusts the ground truth label distribution by applying label smoothing to mosaic images for classification models. In conclusion, these methods adapt the use of mosaic augmentation and soft label assignment resulting in improved classification model performance.

3.4 Vehicle State Classification

This classification model is adapted [24] as the previous work. The proposed model comprises the Wide Area Feature Extraction (WAFE) module and Deformable Residual (DR) module. These modules play critical roles in extracting and focusing on feature information. The following section provides a detailed layer-by-layer explanation of these modules.

Fig. 3.
figure 3

The left module in the illustration is the WAFE module. It uses a dilated convolutional layer to appropriately extract features while reducing computation when objects in the image are far apart. The right module is the DR module, which is designed to extract meaningful features for vehicle status judgment by applying a variety of receptive fields using a deformable convolutional layer.

Wide Area Feature Extraction Module (WAFE module). To classify the state of the target vehicle, the input image considers the position and state of the surrounding vehicles. Figure 5 shows that the vehicles in the image are mostly separated. To exclude unnecessary information like background, the first convolutional layer passes a \(5\times 5\) kernel size with 64 filters, a stride of 4, and a dilated ratio of 3. Next \(1\times 1\) convolutional layer extends the number of channels, 64 to 128. To stabilize the learning process on the feature map, batch normalization (BN) [25] is performed after all convolutional layers, and the proposed network employs Gaussian Error Linear Units (GELU) [26] as the activation function. The feature map is further processed by dividing the 32 channels into four groups, and each group is passed through four kinds of dilated ratio, [1, 3, 5, 7] of \(3\times 3\) convolutional layer. The four groups of outputs are concatenated, and a \(1\times 1\) kernel is applied. Additionally, a residual block is used to incorporate previous information before the maxpooling operation into the feature map.

Deformable Residual Module (DR Module). Deformable residual is modified from deformable convolutional layer [27] to extract flexible spatial information through output feature from WAFE module. As illustrated in Fig. 4(a), the traditional \(3\times 3\) convolutional layer has a fixed receptive field in the image area, represented by the red and blue dots. However, in the image data used for vehicle detection, the vehicles are often separated from each other. Therefore, using a fixed receptive field would extract feature information that includes unnecessary background information. To address this issue and perform more effective convolutional operations, deformable convolution is employed. Figure 4(b) shows how deformable convolution generates an offset as the convolutional layer and performs convolution operations through the offset information.

$$\begin{aligned} \textbf{y}(\textbf{p}_0)=\displaystyle \sum _{\textbf{p}_n\in \mathcal {R}}\textbf{w}(\textbf{p}_n)\cdot \textbf{x}(\textbf{p}_0+\textbf{p}_n+\bigtriangleup \textbf{p}_n) \end{aligned}$$
(3)

Equation (3) represents the offset that determines the kernel position in deformable convolution. \(\textbf{y}\) represents the output feature map. The kernel grid, \(\mathcal {R}\), is defined as the receptive field, where \(\mathcal {R}={(-1,-1),(-1,0),\dotsc ,(0,1),(1,1)}\). The convolution occurs at the pixel position of the input image \(\textbf{x}\), which is \(\textbf{p}_0\), and at the individual positions in \(\mathcal {R}\), which are \(\textbf{p}_n\), along with the offset, \(\bigtriangleup \textbf{p}_n\). In particular, the offset \(\bigtriangleup \textbf{p}_n\) value is generated based on the convolution layer value and is trained in each iteration. Thus, \(\textbf{p}_0+\textbf{p}_n+\bigtriangleup \textbf{p}_n\) ultimately determines the position of the input value, and the convolution operation is performed by multiplying it with the convolution kernel weight, \(\textbf{w}(\textbf{p}_n)\), at that position.

Fig. 4.
figure 4

(a) \(3\times 3\) conventional convolution, (b) \(3\times 3\) deformable convolution, In deformable convolution, deep red and dark blue dots are focused on the vehicle in the input image. (Color figure online)

The Convolutional Block Attention Module (CBAM) [28] allows for complementary attention of both channel-wise and spatial-wise information and is applied through the output of three deformable convolutional layer operations. The fully connected layer receives the feature map that has been calculated by two deformable convolutional layers.

Loss Function. During training, hard and soft labels are utilized to adjust the loss function. For hard labels, the original loss function uses in this study is Focal Loss [29], which helps to balance the training process and prevent bias towards one class when dealing with data imbalance issues. For soft labels, the loss function is the mean squared error (MSE). Since the value of the soft labels is a float number, it has been computed average value. The proposed total loss function, \(\mathcal {L}\) is as follows:

$$\begin{aligned} \mathcal {L} = \alpha _l\mathcal {L}^{cls}_{hl}+(1-\alpha _l)\mathcal {L}^{cls}_{sl} \end{aligned}$$
(4)

\(\mathcal {L}^{cls}_{hl}\) is the loss calculated for the hard label, and \(\mathcal {L}^{cls}_{sl}\) represents the loss result for the soft label. The parameter, \(\alpha _l\) assigns weights to the hard and soft losses. Since soft labels are selected less frequently than hard labels, the weight of hard labels is higher. \(\alpha _l\) is 0.9.

4 Experiment

Drone Image Dataset: The dataset for drone images is captured from a top-down perspective in Fig. 5, and vehicle detection is performed using YOLOv5 large model on the collected images. After detection, the image is cropped based on the five vehicles surrounding the target vehicle. The dataset consists of three classes: lane_change, safe, and stop, and the total number of training and test data is shown in Table 1.

Configuration Details. In this study, the Adam optimizer [30] is employed and the learning rate is set to 0.001. Epoch is 200. Four NVIDIA RTX 3090 GPU, each with 24GB of memory, are used, with a batch size of 16.

Table 1. Information of train and test dataset for vehicle state classification.
Fig. 5.
figure 5

Illustration for drone image dataset. The view in the picture is bird’s-eye view.

Object Detection. In this study, YOLOv5 [22] is adopted as the object detection algorithm, and car and truck are the two classes considered for train and test. Table 3 presents the object detection performance for the train and test datasets. The training on 9,776 images a performance of 95.75 mAP(\(AP_{50}\)) and 83.8 mAP(\(AP_{50:95}\)), while testing on 2,200 images a performance of 91.8 mAP(\(AP_{50}\)) and 80.3 mAP(\(AP_{50:95}\)). Utilizing this detector, other traffic videos are analyzed to identify and extract vehicle information, including their position and class (Table 2).

Table 2. The mAP performance of YOLOv5 on drone train and test dataset.

VSNet Performance. The performance of the network for classifying the final vehicle state in Table 3 is compared to the Dilated Residual Network (DRN) [31]. DRN is a classification model derived from ResNet [32] and is a network that replaces the convolutional layers with dilated convolutional layers. Proposed model utilizes dilated and deformable convolutional layers to extract features from a wide area, and therefore, its network is compared with DRN, which is composed of dilated convolutional layers. DRN has four types, A, B, C, and D, with additional dilated blocks and skip-connections. In this paper, types C and D are used, and type D is a simplified version of type C.

Table 3. Comparison result with dilated residual networks (DRN) and vehicle state network (VSNet) for data augmentation.

Compared to DRN_D_22, the first proposed model shows a 16.9% difference in accuracy results, but it reduces the number of parameters by 92.2%. In addition, the proposed model presents the results of applying mosaic and color data augmentation. When both augmentations are applied, it shows the best performance among the results presented, with an accuracy of 83.41%. Furthermore, compared to the DRN_C_42 model, it achieves a 1.63% higher accuracy and saves 96% of the parameters.

Table 4 presents the accuracy performance of the proposed model according to the soft label values. The highest accuracy performance of 83.41% is achieved when \(\alpha \) is set to 0.7. As \(\alpha \) gradually decreases, the performance decreases as well. This is because the soft label values differ from the original hard label values, leading to differences in learning performance.

Table 4. According to Soft label, \(\alpha \), accuracy of vehicle state classification.

5 Conclusion

This study applies mosaic augmentation and soft-label assignment techniques to classify vehicle states using drone images. Mosaic augmentation combines existing images to create a new image, increasing the amount of data and improving generalization for a limited dataset. Additionally, soft-label assignment is used to generate labels for the mosaic images in vehicle state classification. These two techniques contribute to smooth training and enhance the accuracy performance of the proposed classification model.