Keywords

1 Introduction

Quickly and accurately extracting vehicle related data from video is a fundamental topic in unmanned driving [11] and road monitoring. In the transportation field, the vehicle speed and density are two important measurements in the vehicle travel. Speeding and other illegal acts will be detected, which has a great impact on traffic safety. This topic is of broad interest for potential applications of traffic supervision, data analysis and so on.

There are many ways to obtain relevant data on the vehicle’s travel, which mainly be categorized in two classes. The first one is to analyze the complex signal generated by electronic devices such as radar [8] and vehicle’s own attached sensors [9, 10].

The other direction is based on cameras or the like to capture video for object detection and tracking [4]. Road monitoring can cover a certain range of vehicles and the vehicle’s data can be measured by the objection detection and tracking method in computer vision [13]. However, there will be a large deviation in the speed measurement due to different monitoring camera’s pitch angles to the near and far objects. We show this phenomenon in Fig. 1. Since the position of the surveillance camera is generally low, the vehicle will have occlusion and overlap in the road, which raises many problems for analysis algorithms.

Fig. 1.
figure 1

When some devices measure that the vehicle is approaching a uniform speed, the phenomenon shown in the figure will appear: the speed in the first half is fluctuating up and down, and the speed in the second half is suddenly rising.

Nowadays, it is possible to extract motion information from video sequences thanks to the advancement of multiple object tracking. The pioneer work is to improve the accuracy of object detection like YOLOv3 trained on PASCAL VOC and Microsoft COCO datasets include more labels, such as person, boat and vehicle, etc. On the other hand, more excellent matching mechanism make multiple object tracking better.

State-of-the-art multiple object tracking are based on public datasets which contains multi-classes include oblique perspective vehicle. The model perform not well in vehicle tracking tasks from a special top-town perspective. These problems are due to less information of the vehicle in this view, like the vehicle in drone’s video.

The development of drone equipment makes it possible to capture highways at a fixes altitude. This greatly ensures that the scene in the video are more realistically restored and then accurate vehicle data through certain technologies can be obtained. The drone has a panoramic view, which makes the vehicle in the video scene more complete and has no larger scale changes. High-altitude overhead allows for no occlusion between vehicles, simplifying the scene and simultaneously monitoring all vehicles in the scene.

Using video captured by drones, we design a vehicle speed analysis framework that tracks the vehicles in real time, calculates the speed by traces and corrects the speed of vehicles. Through experimental result, it has been proven that the framework can effectively calculate the running speed of vehicle and has a good performance in suppressing the noise occurring in the measurement process. Our main contributions are threefold.

  • We implement a real-time multiple object tracking framework based on the YOLOv3 detection system and Kalman Filter for UAV video.

  • We propose a Gaussian Filter [5] to remove the noise from the vehicle’s data and refine the calculation of vehicle speed.

  • We build a vehicle dataset from a large number of drone videos that contains the actual speed of the vehicle at the moment.

2 Related Work

In the following, we review known public traffic scene datasets and drone datasets. Almost all traffic scene datasets consist of images from in-vehicle devices and surveillance videos. In most samples, the scales of the vehicle vary widely, and many vehicles have occlusion from each other like KITTI. For some drone datasets, the proportion of vehicle samples is small and the most critical is that the actual speed of the vehicle is not included in the datasets like the Stanford Drone Dataset.

In order to obtain information such as speed and density in a video sequence, a real-time, stable and accurate tracking framework is often required. This framework is usually combined with the detector and motion prediction model. Driven by YOLO which relies on darknet-53 network, higher accuracy vehicle predictions and real-time detection results are achieved. Prediction of objects in the next frame which utilizing time and space information of vehicles in a video sequence by add Kalman Filter to the tracking framework. To correct the vehicle speed and other data generated by the tracking framework, we proposed a kind of Filter.

Filter selection is roughly divided into two categories. One line is based on some transformation. The data is transformed from the spatial domain to the frequency domain by Fourier Transform [14], processed in the frequency domain, and then processed to the spatial domain by inverse transform [15].

The other method is spatial domain filter. Since they process data or signals directly without any transformation, like operating directly on the pixel in the image. This approach takes advantage of the distribution of data. In the experiment, we notice that the vehicle’s data approximates to a certain distribution, and spatial domain filtering is more intuitive and simpler than frequency domain filtering with less computational complexity.

We start with our observation and analysis of the large flutter of the bounding boxes which occurs in object tracking process. The large flutter mentioned above means the coordinates of the bounding boxes are not accurate enough. This is because the tracking framework is not stable enough [1], but actually this is inevitable. So we perform a statistical analysis of the results of the tracking and try to fix it.

To make a good performance in correcting vehicle data, we design a Gaussian filter based on video characteristics and vehicle data distribution rules. This module takes advantage of the fact that the vehicle data closes to the Gaussian distribution [7] and then remove noise by the filter.

Our experiments show that the Gaussian filter has a good effect when dealing with the speed data of moving vehicles in drone video. We use MSE to evaluate our effect on vehicle speed correction. In the testing dataset, data fluctuations in vehicle speed were significantly reduced and closer to truth values. Yet there is still much room to exploit in data analysis and data processing.

3 Our Proposed System

With above analysis, we design a real-time detection system of vehicle based on the YOLO model from the UAV video [16]. In addition, a Kalman Filter has been combined in the system to implement the tracking module [17, 18]. Finally, a Gaussian Filter has been involved to suppress the noise from the result obtained by the tracking module, and more accurate vehicle speed estimations are achieved by the proposed system in Fig. 2.

Fig. 2.
figure 2

The proposed system has a Gaussian Filter and a multiple object tracking module consists of YOLO detection and Kalman Filter

3.1 YOLO Detection System

The YOLOv3 detection system [1] is a state-of-the-art framework in real-time object detection, here we use YOLO to refer to it. YOLO creatively uses anchor boxes in network design to get direct location prediction. The image is divided into a 7 × 7 grid and each grid cell has 9 anchor boxes [2, 6] which are obtained by clustering and have fixed dimension to predict the possible bounding boxes and locations of the object. Although this makes the model predicts more than thousand, the objects are more detectable. In this improvement, the 9 anchor boxes have less coverage on a small scale and 7 × 7 cells to predict the objects which YOLO used make the \( \left( {{\text{x}},\,{\text{y}}} \right) \) locations of bounding boxes has some instability.

3.2 Kalman Filter

Kalman Filter [3] is a linear filter which can be descrambled by linear stochastic difference equation. It keeps track of the estimated state of the system and the variance or uncertainty of the estimate.

In our drone video, all vehicles are in the lanes and always heading in the same direction. The most important point is that the division between the vehicles is very clear and there is no occlusion at all and the contour of the vehicle remains the same. Some examples are shown in Fig. 3. So Kalman Filter is satisfactory for removing noise and predict the location of vehicle changes in that case.

Fig. 3.
figure 3

Since the drone has a panoramic view, the vehicle is crowded but there is no occlusion in the UAV video.

3.3 Gaussian Filter

To calculate more accurate vehicle speed, we introduce the Gaussian filter, which proves that it has a significant effect on correcting tracking data [19, 20].

In any object detection framework and scientific calculation, errors are inevitable. If the error is within the allowable range, the data results are also acceptable. Although YOLO has taken some strategies to improve the accuracy of location, the prediction of the locations still has some instability [2]. In particular, anchor boxes which YOLO used to help CNN [22] locate targets make an unstable error between the coordinates of bounding boxes and the ground truth [21]. We address this issue by using a suitable Gaussian filter.

The Gaussian filter is a signal processing method as a filter whose impulse response is a Gaussian function, which is commonly used in eliminating Gaussian noise. The tracking data points recorded by the vehicle just tracked or tracked at the end tend to fluctuate greatly, while more stable during the tracking. Sudden acceleration or deceleration of the vehicle can also cause large deviations in the data calculated by our tracking module. This means that tracking sometimes lags behind the actual displacement of the vehicle. Gaussian filter is helpful in this regard to modify those mutation data.

Taking vehicle speed as an example, actually, the speed of the vehicle can be regarded as a constant value in one second. In the actual experiment, we use “frame” as the speed measurement unit instead of seconds for more accurate calculation data.

When measured the speed of the vehicle which can be calculated by \( \frac{{\left| {x_{p} \, - \,x_{n} } \right|}}{1/fps} \) in every frame, where \( fps,x_{p} ,x_{n} \) denote frames per second, x coordinate of current frame, x coordinate of previous frame respectively. We find that the calculation speed per frame is quite different but regular. We call this phenomenon “data flutter” in Fig. 4, the tracking data float around the true value and satisfy a certain distribution law. According to the characteristics of the data distribution, the correction of vehicle speed can be achieved by utilizing suitable Gaussian filter.

Fig. 4.
figure 4

Illustration of data flutter. (a) In one second (assuming the video frame rate is 30), the phenomenon that the forward speed of the uniform moving vehicle in each frame of the image is fluctuated greatly. (b) The vehicle speed calculated by tracking framework is close to the normal distribution.

To analyze these observations, errors are partially related to the precision of the multi-target detection algorithm framework and the video itself has a higher resolution. Since the data has a certain distribution law, a filter with suitable parameters can greatly improve the tracking data results.

We propose a Gaussian Filter to apply to vehicle speed data. We perform confirmatory experiments to show the difference in Sect. 4.

4 Experiments

Our improvement works on vehicle tracking data in drone video. The method we used effectively corrects vehicle tracking data and improves detection accuracy. We notice that although there are many public tracking datasets, they didn’t provide motion information for targets in the real world such as speed. So we carry our experiments on our drone video dataset, which contains 45 car tracking data groups, real speed of the vehicle, and 4834 usable data records for training and testing.

For a tracking system, components selected in the framework play an extremely important role in terms of real-time and accuracy. These often determine the efficiency of the tracking and the accuracy of the detection. Our implementation is based on excellent algorithm YOLO which has a good performance in real-time and accuracy as well as a classic prediction method Kalman Filter, then the most important one is that we add a suitable Gaussian filter to smooth the tracking data about vehicle speed.

4.1 Implementation Details

We train 2750 pictures of vehicles taken from drone at a certain height with YOLOv3. We use a high resolution 2704 * 1520 or 1920 * 1080 picture. The vehicle is usually small in the UAV view. We normalize the image size to 416 * 416 and set learning rate and momentum to 0.001 and 0.9. We take the steps strategy and reduce the learning rate by 10 times when the number of training is 5000, 8000, 12000. The maximum iteration number to stop training is 15000. For data augmentation, the value of the saturation and exposure parameters we using are both 1.5 and the hue is 0.1. The part of data augmentation increases the amount of data to some extent.

Kalman Filter is a classic approach to linear filtering and prediction problem. For Kalman Filter, we set delatime to 0.2 to make target more “massive”. Since the acceleration is not clear, it is assumed to be a process noise. We set the value of Accel_noise_mag to 0.1.

Whether the detected objects in the video sequence are regarded as the same object adopts the IOU judgment strategy.

The observation of bounding box within a small range flutter made we think about how to stable the detection boxes. By analyzed the tracking data of every car, the error between predicted value and truth value is close to the certain normal distribution in Fig. 4. A Gaussian filter with suitable parameters can yield great performance in the tracking data from analyzing the video.

Datasets.

Existing available public tracking datasets lack object motion information while this is part of our system’s main contribution. Therefore we built our own vehicle dataset from the drone video which contains its speed information recorded by speedometer. Some examples in Fig. 3. We collect the data points with N (e.g., 30) frames and group them in each second of each video. The data in the Table 1 comes from the video with a frame rate of 30 fps, so every second has 30 data points for analysis. The well grouped data is divided into 810/270 data points for training and testing.

Table 1. Evaluate vehicle speed with MSE on test dataset. We choose three cars called “Car 1, Car 2, Car 3” in the video count down three seconds denoted by “Time 1, Time 2, Time 3”. The vehicle’s label is from the chosen cars that appears in different videos. The data in bold is the better group. Their MSE performance is better. Their mean and the output error is smaller.

Performance Measure.

The process of data analysis is considered as a regression task, so mean square error (MSE) is used as the main performance measure.

We choose the same car in the same section of the road just for justice, and we pick some videos both in the upstream and downstream directions to test.

Upstream Vehicle Video.

In the video, “upstream” means that the car drives from right to left in the scene of the video. As shown in Fig. 5, our tracking framework completes the tracking well. We use the tracking data of the previous time that the target vehicle appears in the video scene as the training data to calculate the parameter estimation of the corresponding Gaussian filter, and apply it to the estimation and correction of the vehicle speed for a later period of time.

Fig. 5.
figure 5

The vehicle selected by the red rectangle is the target vehicle. (a) The red arrow indicates the direction of motion moves from the right to the left and (b) the red thin line is the trajectory tracked by the vehicle. (Color figure online)

For this video, we use about 270 data points about the speed of the target vehicle to analysis μ and σ which makes us get a suitable Gaussian filter. Then using the filter to correct the vehicle speed data for the last three seconds of the video.

As shown in the first row and the first column of the second row in Fig. 6, our work is excellent in tracking data. The red lines with diamonds are drawn from unprocessed vehicle speeds. The blue lines is composed of vehicle speed data processed by a Gaussian filter. After the last three seconds of the tracking data of the video is processed by a Gaussian filter, it can be clearly seen that the speed of the vehicle with blue lines is largely stabilized and the jitter is significantly reduced and is closer to real vehicle speed as shown with green line. Our performance measure MSE improves about 71.73%, and another set of data improved about 98.6%. The result as shown in the first three rows of the Table 1.

Fig. 6.
figure 6

Overview of our data result graph. Speed line chart of the upstream vehicle obtained in the last third second of the video with (a). The chart of second and last second as shown in (b) and (c). Speed line chart of the downstream vehicle obtained in the last third second of the video with (d). The chart of second and last second as shown in (e) and (f). (Color figure online)

Downstream Vehicle Video.

In the video, “downstream” means that the car drives from left to right in the scene of the video, as shown in Fig. 7. This set of data contains 210 tracking data points about the speed of the same vehicle which appearing in the “downstream vehicle video” for training and 90 tracking data points for testing. The testing data points are divided into 3 groups. They are the last three seconds of the vehicle’s speed data. Methods trained the 210 tracking data to get an available Gaussian filter for testing. The result as shown in the middle three rows of the Table 1.

Fig. 7.
figure 7

The vehicle selected by the red rectangle is the target vehicle. (a) The red arrow indicates the direction of motion move from left to right and (b) the blue thin line with the target vehicle is the trajectory tracked by the vehicle.

As shown in Table 1, with the suitable Gaussian filter, the three sets of tracking data grouped by vehicle after processing are well corrected, so that the data is distributed near the real data baseline with less jitter and error. The method can improve the MSE of the data about 59.21%–95.84%. We also use the “AVERAGE” about the errors to measure the noise suppression. The MSE from the table is significantly smaller which proved that vehicle speed at different times is closer to the true value by using our method. The “AVERAGE” indicates system output error is less than before. Finally, the system will output these processed data like vehicle speed which can be applied in the analysis process of traffic.

Statistics in Table 1 show that the Gaussian Filter has notable advantage in improving accuracy of the vehicle speed and the like obtained by tracking framework. The above experiments are all in vehicle dataset which contains the real speed of vehicles.

5 Conclusion

After constructing a vehicle dataset containing motion information of vehicle, we introduce YOLO and Kalman Filter to create a tracking module which for motion information such as vehicle speed. This module can better track the vehicle appearing in the drone video in real-time in Fig. 8. By statistically analyzing on the vehicle speed extracted by the tracking module, we propose a kind of Filter to refine the vehicle speed which gets an excellent improvement with MSE evaluation on vehicle dataset we built. The Gaussian Filter we used provides a nice feature in removing noise or present in vehicle speed. In the last part of the system, we obtain the corrected more accurate vehicle speed.

Fig. 8.
figure 8

Illustration of vehicle tracking in drone video. The line following the vehicle is the trajectory of the vehicle.

We hope the implementation details publicly available can help the community adopt these useful strategies for dealing with the tracking data and advance related techniques.