1 Introduction

Traffic management technology using cameras mounted on drones or airplanes is becoming a hot topic [18, 29]. Recently, airborne surveillance provides more advantages than traditional monitoring techniques, as a stationary camera and bridge sensors. Such methods are providing more coverage with lessen expenses and are better in emergency response. Moreover, UAVs are highly portable to collect traffic data in an area with difficult geographic locations, when the conventional data-gathering techniques cannot be applied. However, the data collected by UAV cameras are needed to turn into useful resources. For traffic monitoring concerns, vehicle detection and tracking are considered as essential and challenging tasks. While there are many existing approaches designed for handling surveillance detection and tracking challenges such as shadows, occlusion, and reflections using cameras at a fixed location, the most challenging issue in the aerial video is the camera movement[20]. Both the background and foreground, in this case, are moving in the image due to the irregular motion of the camera that mounted on the UAV.

The traditional approaches to extract traffic information using static cameras based on blob detection or background subtraction cannot efficiently work with UAV-based videos. Vehicle detection and tracking are usually handled as two separate processes. Vehicle detection in images depends on spatial appearance features, while vehicle tracking depends on both spatial appearance and temporal motion features. Detection strategies of vehicle detectors based on visual and appearance features of vehicles’ spatial domain have started to saturate [2, 27, 36], and the tracking strategies depend on manually vehicle detection initialization is not efficient in real applications [7, 8, 16]. In literature, various approaches have been introduced to tackle the detection and tracking problems for vehicles in images and videos. However, it is still an open issue due to the changes in appearance, scale, view of the objects, shadows, camera movement, illumination conditions, and partial occlusion. While significant progress has been made for the detection or tracking separately, most of them still need a high computation complexity with low accuracy.

Most vehicle surveillance strategies depend mainly on vehicle detection or vehicle tracking, with good accuracy, yet some false positive and false negative results have occurred. It is very difficult to achieve vehicle detection and tracking in real-time using airborne videos, so some researchers focus only on the detection [17, 24, 27, 29, 30, 32, 38] or tracking problem [7, 8, 11]. Moreover, tracking multi-vehicles with shadow and illumination changes due to day light and moving clouds at the same time is another challenge with different vehicles speed. Not much effort has done so far for using tracking information to improve the accuracy of object detection and to reduce the storage and training cost. Moreover, the object detector usually suffers from missing detection and false positives, which deteriorate the tracking process. Because of such issues, this paper presents an effective collaborative strategy between detection and tracking information to improve both the detection and tracking processes.

Detection and tracking of vehicles simultaneously are considered an efficient strategy for achieving accurate detection and tracking results. The morphological operations have been used in object detection with better foreground and background discrimination [38], but the performance is not perfect. The detection and tracking accuracy can be enhanced by reducing false-positive and false-negative results using a refining process to improve the foreground detection by the morphological operations. Hence, new vehicle detection and tracking strategy is suggested in this paper based on the collaboration process between collected detection information using Top-hat and Bottom-hat transformation and tracking information using KLT tracker. The work depends on the feature points information analysis between the fixed number of frames, and using the temporal information of the detection and tracking feature points between the frame sets to achieve better detection and tracking decisions.

In this work, we exploited the features of both the detection and tracking information to solve the problems associated with each of them separately. An efficient collaborative real-time approach is introduced that enhances the interaction between a non-trained object detector utilizing Top-hat and Bottom-hat transformation and KLT-based tracking methodology. This paper introduces a new simultaneous vehicle detection and tracking method by developing a three-step approach in each frameset. First, the power of Top-hat & Bottom-hat is exploited in the vehicle detection process before the vehicle refining and clustering process in the second step using the optical flow and k-means clustering. The Top-hat & Bottom-hat is used in the first frame, while the refining analysis considering the remaining frames in the frameset. Thus, a robust discrimination process between the foreground vehicles and noisy background regions is utilized. Thirdly an effective connecting strategy is offered to assign each vehicle with its corresponding trajectory based on the collected detection and tracking information. The detection and tracking accuracy is increased, and the algorithm works efficiently with different and challenging environments. Experiments carried on airplane videos confirm our superior performance in terms of precision, recall, tracking accuracy, and time performance competed to the existing techniques. Also, the method can be used in real-time applications with minimum execution time.

The paper is organized as follows: Related work is introduced in Section 2. The proposed vehicle detection and tracking method are described in Section 3. In Section 4, the experimental results are given and evaluated. Finally, the work is concluded in Section 5.

2 Related work

Most of the existing approaches handled one of the research topics, either vehicle detection or tracking. Vehicle detection can be categorized into two groups; vehicle detection methods based on machine learning techniques, and vehicle detection methods based on segmentation techniques. Vehicle detection methods based on the machine learning methods, including a sliding window [24, 28, 30] and appearance features method [2, 27, 36], always follow the processes of finding the vehicle features first and then categorize them into classification models that identify the group of the testing vehicle. These methods can only detect targets from pre-learned object classes. In [27], a support vector machine (SVM) was used for the detection process after pixel classification using AdaBoost classifier trained on Haar-like features. Deep learning strategies are suggested in [10, 12, 14, 22] for object detection, but the detection accuracy of these strategies depends on the trained images with more computation complexity. The authors of [37], provide a review and discussions on deep learning strategies for object detection. Vehicle detection methods based on segmentation techniques including morphological operations[38], thresholding [21, 35], and edge detection [31] in combination with connected component labeling. To further enhance those methods, several detection approaches were presented in[17] and [32]. In [17], the authors proposed a new technique that combined visual, temporal, and spatial features. Firstly, they divided the video frames into uniform regions before forming a region adjacency graph that assembles each of the frames. Then they proposed a multi-graph matching algorithm to match each of these regions with their corresponding through the frames. Finally, a graph coloring methodology was presented to efficiently labeling the objects in the background or foreground. Recently, authors in [32] used image stacking in a novel way, they employing image registration to small vehicles only, and the warping process has been used to blur all the stationary background in the neighborhood of the moving vehicles. The main goal of this algorithm is to remove disturbing image parts of the background, which can be smoothed to extract the observed vehicle from the background. However, previous strategies demand a high computational requirement.

Regarding vehicle tracking, various efforts have been made concerning multiple targets tracking in aerial videos [4, 7, 8, 23, 26, 34]. Existing methods can be classified into two categories: tracking through detection and tracking through prediction. In tracking through detection methods [23, 34], also called Data Association Tracking (DAT), first, the vehicles are localized in the scene using different detection strategies, then the identification of each one is assigned via proper data association schemes. The identity assignment process, in this case, is handled by various data association methods. Multi-Hypothesis Tracking (MHT) [26] and Joint Probabilistic Data Association (JPDAF) [4] are the two widely used basic algorithms but associated with some assumption that making these algorithms non-ideal in tracking a large number of targets. Moreover, they have a high computational cost. Tracking through prediction methods managed to track multiple targets by implementing several predictions based on single target trackers (STT). Each STT performs the prediction based tracking through two main steps. First, an estimated region of interest is predicted, then the localization of the target is conducted by the tracker in this region. In general, the prediction phase in the first step is established using various target motion models. The Kalman filter is used for tracking linear object movements, which means the object travel with constant movement parameters (acceleration or velocity) [25]. However, the linear motion model is not qualified as a non-linear model to handle the diversified target dynamics in the aerial surveillance videos. Non-linear motion model is handled using Extended Kalman filter (EKF) [5], Unscented Kalman Filter (UKF) [33], and Particle Filter (PF) [11, 15]. Although the effectiveness of such non-linear motion models in handling the tracking challenges, yet their performance in tracking a large number of targets is difficult due to the complexity of the model. Furthermore, any error that occurs in the prediction phase may lead to accumulated and irreversible tracking failure due to the absence of detection responses. Tracking strategy based on a dynamic grouping method was introduced in [8] where the authors used a collaborative framework, that involves a two-level tracking manner; namely, high, and low levels of tracking. The higher-level part forms a relevance network and divides vehicles into several groups based on the velocity magnitude and direction, where the relevance is estimated using the obtained information from the lower level with Kalman filtering and histogram matching help. However, grouping vehicles are difficult when their movement is not always uniform. A hierarchical layered structure for tracking is proposed in [7] to build an inter-object complementary assistance system for vehicle tracking. This system provided more reliable information about the motion model using a novel nonlinear motion and interaction model. However, the previous strategies have some limitations, including the manual detection initialization process, more complicated, and they cannot work efficiently with different motion models.

Recently, authors are working on simultaneous vehicle detection and tracking as in [1, 3, 16]. In [16], a hybrid framework has been developed for vehicle detection and tracking. The authors combined the frame difference technology with the background subtraction scheme in the detection step after the frame registration process. Then, they proposed a non-linear motion model, called adaptive velocity adaptive acceleration (AVAA ), for the tracking purpose. However, the objects bound boxes are initialized drawing by hand in the first frames, to produce the initial working velocity and acceleration. Unfortunately, this strategy has a low processing speed. In [3], the authors proposed a multiscale target hypothesis in the detection part where the vehicles were characterized by a blob detection method as the Canny edge detector. Then the temporal consistencies of this hypothesis were tested to eliminate the outlier observations and enhance the detection results. Finally, the tracking was accomplished by reducing the spatial distance between associated vehicles in consecutive frames. Nevertheless, the disadvantage of this method is canny edge detector failures despite the good achievable precision result. Authors in [1], proposed a new vehicle detection and tracking strategy using Harries and Lucas-Tomasi feature tracker to detect good features that tracked in the image frames. The background points were removed by measuring the changes in the histogram of the pixels around each point with time to obtain the foreground features. Then they clustered them into separate trackable vehicles, according to their movement angles and displacement magnitudes. The algorithm achieved real-time performance for detection and tracking vehicles in airborne videos with low precision and recall accuracy. This paper introduces a new approach to achieve fast and accurate vehicle detection and tracking. The detected vehicles are extracted in the first frame of each frameset, by analyzing their feature characteristics between the remaining consecutive frames and the background regions were excluded. The vehicle’s features are updated every fixed number of frames associated with an effective algorithm to assign each vehicle with its corresponding trajectory.

3 The proposed algorithm

The main steps of the proposed automatic vehicle detection and tracking scheme are described in the following subsections, as shown in Fig. 1.

Fig. 1
figure 1

Workflow chart of the proposed system for a specific frameset K

3.1 Initial object detector

This step initially determines the object regions of interest in the first frame, every fixed number of frames, as shown in Fig. 2. These regions estimated using a mixture of a grayscale closing and bottom-hat transformation or grayscale opening and top-hat transformation [38], as described in equations 1, 2, 3 and 4.

$$ \text{Opening : }I\circ ST = (I\ominus ST )\oplus ST $$
(1)
$$ \text{Top-hat : }TOH = I-(I\circ ST ) $$
(2)
$$ \text{Closing : }I\bullet ST = (I\oplus ST )\ominus ST $$
(3)
$$ \text{Bottom-hat : }BOH = (I\bullet ST)-I $$
(4)

Where I presents the original image, ST is the structuring element, ∘ represents the grayscale opening, and ∙ represents the grayscale closing. ⊖ and ⊕ represent erosion and dilation operators, respectively. TOH presents the frame image after making the Top-hat transformation, and BOH represents it after making the Bottom-hat transformation.

Fig. 2
figure 2

Regions of detection resulted from Top-hat & Bot-hat transformation

The initial detection method includes the following major steps. First, the structure element of morphological transformation is identified based on the camera spatial resolution and the altitude from the ground. Then, the grayscale opening transformation, grayscale top-hat transformation, and Otsu partitioning method are employed to detect vehicles in a light background. Whereas the grayscale closing transformation, grayscale Bottom-hat transformation, and Otsu partitioning method, are utilized to detect vehicles in black background. In addition, vehicles have been detected from a light background and dark background. Then, the hypothesis vehicles detected from the two cases are overlaid, and the two identical hypothesis vehicles are amalgamated to a single one with the closing transformation. Bright objects are extracted from Top-hat transformation, while dark objects are extracted from Bot-hat transformation. More details can be found in [38].

Morphological operations are applied where the structure element is identified according to the spatial resolution of the camera sensor installed on UAV. The area covered by each image pixel is indicated through the camera spatial resolution that can be calculated by measuring the resolution cell, B (the area on the Earth’s surface ). The resolution cell is determined by multiply the camera height, C from the ground in its instantaneous field of view (IFOV), where IFOV was known as the camera angular cone of visibility, A, as illustrated in Fig. 3.

Fig. 3
figure 3

Structuring element selection based on drone altitude and camera spatial resolution

This detection algorithm has a low computational complexity since it can detect various vehicle types without a pre-trained process. Besides, it achieved a satisfactory recall accuracy. Nonetheless, this algorithm detects some of the false-positive objects that yield low precision accuracy. These false-positive results will be eliminated by employing K-means clustering and their tracking optical flow information, as described in the following subsections. The false-positive results obtained from the background regions have different motion characteristics compared to the foreground vehicles. Hence we exploit their feature points motion information to discard them before the final decision. Calculate the number of vehicles dynamically is another key factor in the detection phase. The detection process is implemented regularly to capture the diversity of moving objects number and to accomplish an automatic system for the detection and tracker updating.

3.2 Feature points detector

Selecting good features of the bounding boxes, that resulted from the detection step, is necessary for robustly tracking feature points across frames. While the Harris detector is the most famous corner detector, Shi and Tomasi’s detector perform better than the Harris corner detector [20]. Here Tomasi detector has been implemented in each detection region based on the detection algorithm described in the previous section to extract the robust corner point inside each region as illustrated in Fig. 4.

Fig. 4
figure 4

Feature points extraction process using Shi-Tomasi detector in the detected regions of interest

The periodically updated vehicle feature points are very important to guarantee to track them for a long time because these features may disappear as a result of illumination change and out-of-plane rotation. So, the detection process is repeated regularly every fixed number of frames, NFrames. The new extracted feature points from the detection step and the old tracked features in the latter frame of the former NFrames are combined. Hence, the system is updated every NFrames by integrating both the tracking and detection feature points.

3.3 Motion and foreground vehicles clustering

This step has two main purposes; removing the background regions to eliminate the false positive detected result, and clustering the extracted foreground vehicles. These functions are achieved by using the K-means clustering algorithm and the optical flow information. The K-means clustering algorithm and the methodology of using it with the optical flow are explained in the following subsections.

3.3.1 K-means clustering

K-means clustering is unsupervised learning, computationally efficient algorithm for large datasets. Initially, k samples, serving as the initial centroids, are chosen randomly to approximate the centroids of the initial clusters (K is a positive integer number). Simply K-means clustering is an algorithm to group the objects based on features into a K number of groups. The grouping process is done by minimizing the sum of squares of distances (Euclidean squared distance) between data and the corresponding cluster centroid. K-means algorithm will be carried out by executing the following three steps below until convergence (Iterate until stable) is obtained.

  • Determine the centroid coordinates.

  • Determine the distance between the centroids and each data feature.

  • Group the data based on the minimum distance to find the closest centroid.

In the second and third steps, the data coordinates using the centroid and the nearest neighbor role are distributed. The data coordinates Cj are assigned to cluster \( CL_{K} \left (m\right ) \) if

$$ \overset{\left | C_{j}-M_{K} \left( m\right) \right |<\left|C_{j}-M_{c} \left( m\right) \right|}{\text { For all } C=1,2,..........,K and \thinspace C \neq K} $$
(5)

Let \( C=\left \{ C_{1}, C_{2}, C_{3},.........,C_{N} \right \} \) be the data sets (feature coordinates, angle or displacement) and k is the number of clusters; Ml is the centroid of cluster CLl that defined by the following equation.

$$ M_{l} \left( m\right) = \frac{1}{n_{l}}\sum\limits_{C \in CL_{l}}C\qquad ,l=1,2,3,..........,K $$
(6)

Where nl is the number of features in a cluster, l is the cluster number, m is the iteration number and N is the number of features. Convergence is obtained if none of the cluster centroids changed, that is

$$ \overset {M_{l} \left( m+1\right) = M_{l} \left( m\right) }{\text {For}\quad l=1,2,3,........,k} $$
(7)

Where k is finally obtained.

3.3.2 Motion and vehicle clustering

To ensure the extraction between the vehicle clusters and the background regions, each of them has a different motion characteristic. The optical flow-based feature point tracking is considered a suitable approach to achieve our goal since it produces a perfect feature point matching accuracy and faster processing speed. After the detection step and extracting the robust points in frame f(x, y, t), the detected corner point is tracked from frame t to frame t + 1 using the Kanade-Lucas optical flow approach. The optical flow results in the first frame pairs are a set of vectors C e.g Ci = (Si, θi), where Si and θi are given by

$$ S_{i} = \sqrt{(X_{2}-X_{1})^{2}+(Y_{2}-Y_{1})^{2}} $$
(8)
$$ \theta_{i} = \arctan\left (\frac{Y_{2}-Y_{1}}{X_{2}-X_{1}} \right ) $$
(9)

Each element in C matches a feature point Pi that tracked from frame t to frame t + 1, where Si and θi are two vectors comprising the displacement magnitudes and angles respectively for each feature point. In the initial stage before the background elimination, each feature point is a vertex Pi = (Si, θi) and K-means clustering is used to cluster the feature points based only on their displacement magnitudes and angles. The k-means clustering algorithm is used to successfully separate the feature points of the vehicle regions from the background regions, as shown in Fig. 5 (a), by considering the trajectories of the points from multiple frames.

Fig. 5
figure 5

Clustering the foreground vehicles using the K-means clustering and points information analysis

It should be noted that the trajectories behaviors of the background feature points through multiple frames have a small variation in S and θ or having too large variations have been considered [13]. On the other side, the trajectories behaviors of the vehicle feature points have a motion variation that should not be too small or too large. Hence, the background regions can be eliminated using k-means clustering approach. In this state, K is selected to be two; one is devoted to the foreground cluster and the other for the background cluster.

After removing the background cluster as shown in Fig. 5 (b), the number of detecting regions only contains the foreground vehicles which can be grouped using k-means clustering as shown in Fig. 5 (c). In this situation, the elements of the vector CF contain SF, θF, and points’ coordinates where F relates to the foreground. KF value is the number of detected vehicles. Every feature point is a vertex Pi = (Xi, Yi, Si, θi), where Xi and Yi represent the X and Y coordinates in the current frame, Si and θi represent the displacement and angle of Pi from two consecutive frames, respectively, as shown in Fig. 5 (c) where only 4 vehicles were detected.

3.4 Connecting vehicle cluster trajectories

The detection vehicle regions with their most robust feature points inside each cluster will be accompanied by a bounding box that created according to feature points’ coordinates in each cluster. Each one of the bounding boxes with their feature points will take a unique ID that tracked within the frameset. In order to judge the possibility to assign the same ID for the new detected vehicle or it is a new vehicle with a different ID, the intersection area of the new detected bounding box and the old tracking bounding boxes is calculated. For this purpose, the following two cases are considered.

  1. 1.

    Maintaining same vehicle cluster The intersection area, in this case, is greater than a predetermined α percentage. Hence the newly detected vehicle has the same label (ID) of the old matched one, as shown in Fig. 6 (b). Another case of maintaining the same ID, is when the detection algorithm cannot detect the tracked vehicle, so the result of the tracking will be used with the same ID for the next frameset.

  2. 2.

    Creating a new vehicle cluster This case established when there is no intersection area value or the estimated intersection area is less than or equal α, so the newly detected vehicle assigns a new label ID. This case is shown in Fig. 6 (a), when the new detected vehicle with red color in frame 121 has no intersection area with the existing tracks vehicles in frame 120, so it is assigned to a new vehicle.

Fig. 6
figure 6

Connecting vehicle trajectories based on the detected and tracked rectangular boxes

3.5 Vehicle points tracking

The obtained vehicles features are tracked using KLT tracker [6]. First, the vehicle’s corner points are extracted and clustered in the first frame of each frameset (NFrames). Then KLT tracker is used to tracking them through the remaining frames starting from the second frame, as shown in Fig. 7 (a) to the Nth frame as shown in Fig. 7 (b).

Fig. 7
figure 7

Tracking Feature Points from the second frame a to the last frame b in the frame set

Vehicle bounding boxes are updated continuously in each frame using the updated points’ coordinates resulted from the KLT tracker, where each vehicle trajectory is created between a pair of images by measuring the displacement vector. This vector added to the previous corresponding vector every frame to achieve a continuous trajectory for each vehicle.

To solve the problem of fixed bounding box size, we propose an adaptive bounding box using the output points from optical flow, every frame after getting the strongest points as in Fig. 8 (a), we can easily form the correct bounding box in relative with these points as in Fig. 8 (b). This process executed based on the maximum and minimum feature point coordinates Xmin, Ymin, Xmax, and Ymax for each vehicle, as shown in Fig. 9

$$ Bounding\ Box = \left [ X_{min}\quad Y_{min}\quad \left (X_{max}-X_{min} \right )\quad \left (Y_{max}-Y_{min} \right )\right ] $$
(10)
Fig. 8
figure 8

Variable Bounding Box creation, the vehicles rectangular boxes that shown in figure (b) are created based on vehicles feature points coordinates that shown in figure (a)

Fig. 9
figure 9

Varying bounding box coordinates, each bounding box is created based on the minimum and maximum feature point coordinates

4 Experimental results

The proposed method is evaluated, compared with the existing methods [1, 3, 7, 8, 16, 17, 27, 32] and the experimental results are discussed in this section. For the sake of comparison with the recent detection and tracking strategies, we assume focusing on data sets containing vehicle objects only because of using Top-hat & Bottom-hat transformation and optical flow information cannot discriminate between the vehicles and other objects. DARPA VIVID [9] and police chase datasets are used to validate the proposed method. VIVID is widely used as a benchmark for evaluating moving object detection and tracking algorithms. It contains various challenges as the varying number of vehicles, occlusion, poses variation, blocked vehicles, and out of the plane rotation. Moreover, a real-world airborne video representing a police chase that contains 20 vehicles moving with high speed without following any pattern is used. It should be noted that the ground truth information is not available for vehicular targets in these data sets. Hence, the tracking accuracy was evaluated by manually labeling the results as shown in [3].

Here, the detection process was made every ten frames, NFrames = 10, to reduce the computational complexity. As described in Fig. 3, the structure element was chosen to be a disc with radius 10 and 20 pixels for VIVID and police chase videos, respectively, based on the camera spatial resolution, vehicle size, and the altitude from the ground. The newly detected vehicle takes the same ID of the corresponding vehicle based on the history of trajectories when the overlapped intersection area between them is greater than 25 percentage, α = 25%. We examined the algorithm with different values of α, where we found that if α is too high, the same vehicle may be classified into the new one, which increases the Identity Switch error and consequently minimizes the tracking accuracy. The value of α = 25 % yields to the best accuracy in our experiment. The proposed method offers a robust automatic multi-vehicles detection and tracking system. Quantitative evaluation of the detection will be discussed in Section 4.1, and compared with recent approaches [1, 3, 17, 27, 32]. In Section 4.2, a comparison with tracking approaches [1, 3, 7, 8, 16] was presented to examine the tracking performance of the proposed methodology.

4.1 Detection evaluation

The comparison with recent techniques was conducted to confirm the proposed method contribution, as illustrated in Tables 1 and 3. The detection accuracy is evaluated using quantitative performance metrics that have been used as a standard evaluation [19], known as Precision and Recall. The precision is calculated as the percentage of correctly detection vehicle pixels [true positive (TP)] over the total number of detecting object pixels, including TPs and false positive (FP).

$$ Precision=\frac{\left | TP \right |}{\left | TP \right |+\left | FP \right |} $$
(11)

Recall refers to the ratio of accurately detected vehicle pixels to the number of actual vehicle pixels that include the number of false negative pixels (FN).

$$ Recall=\frac{\left | TP \right |}{\left | TP \right |+\left | FN \right |} $$
(12)

The precision and recall results for approaches focused on the detection or simultaneous detection and tracking are reported in Tables 1 and 3, respectively. The detection algorithm using a stacking technique [32] exceeds the accuracy of the state of art algorithms. However, this method has a declined precision accuracy as a result of the false positive detections that occur from the registration process. They used image stacking in a novel way by employing the image registration to small vehicles only, and the warping process has been used to blur all the stationary background in the neighborhood of the moving vehicles. The main goal of this algorithm is to remove disturbing image parts of the background, which can be smoothed to extract the observed vehicle from the background, such a strategy needs a huge computational requirement.

Table 1 Detection accuracy comparison of different methods on VIVID dataset

In [27], an AdaBoost classifier was trained on Haar-like features to classify pixels, and a support vector machine (SVM) was used for vehicle detection. However, this strategy fails to efficiently distinguish moving vehicles from the background. It is also can not detect the whole shape of the moving vehicles because the block regions may not be consistent in visual features and motion transformations. Image registration has been used as a basis for moving objects detection. Registration strategies attempt to discover correspondences between consecutive frame pairs based on image appearances transformations. However, spatial information is often ignored, and different motions from multiple moving objects cannot be efficiently modeled. In addition, image registration is not efficient to handle occlusion that can result in potential object misses. Authors in [17], presented a new approach to address these problems. Video frames are first divided into uniform regions, followed by the construction of region adjacent graphs to represent each frame. The corresponding regions are then matched with the consecutive frames by employing the multigraph matching algorithm. A graph coloring algorithm finally labels objects as being background or foreground region. However, this algorithm has a high computation complexity to choose the optimal frame sequence length for the best multi-graph matching performance. Although the results of [3] showed a good precision accuracy, the recall dramatically drops using the Canny edge detector that tends to fail in low contrast scenes. In addition, the major weakness of this algorithm is the incapability of occlusion handling because the distinctive objects were characterized using the Canny edge detector for blobs detection, which in turn generates multiple-scale target hypotheses with temporal consistencies and adaptive thresholding. It is asserted from Table 3 that the precision weakness for testing Egtest3 using the method in [3] results since this scene has more occlusion cases. The proposed algorithm achieves the best precision and recall, with average percentages of 97.5% and 95.1%, respectively with the higher precision and recall percentages. This result based on the false positive detections elimination process that yields from collecting information from the detection step and feature points motion analysis.

4.2 Tracking Evaluation

To evaluate the proposed method regarding the tracking perspective, we compare the vehicle tracking results of the proposed method with recent tracking techniques. Two tracking performance measures are used [19]; known as multiple object tracking accuracy (MOTA) and tracking speed, frame per second (FPS). MOTA which is the most widely used figures to evaluate a tracker’s performance is defined as

$$ MOTA=1-\frac{{\sum}_{f=1}^{ft}\left (FN_{f}+FP_{f}+IDSW_{f} \right )}{{\sum}_{f=1}^{ft}\left (GT_{f}\right )} $$
(13)

Where f is the frame number and ft are the total numbers of frames. It is expressed as it integrates three sources of errors, false negative, false positives, and identity switches. Frame-per-second (FPS) metric displays the processing speed of algorithms by calculating the computational time of the approach.

As shown in Tables 2 and 3, the proposed method achieved the highest accuracy. Different from the comparison algorithms, where vehicles in the first initial frames are tracked manually to initialize the tracking process [7], or where it needs to initialize the vehicle locations manually [8] and [16], the proposed approach automatically detects and track the vehicles. Moreover, the proposed algorithm achieves the minimum execution time. It is worth mentioning that, different from existing methods, the resulting execution time comprises both detections, and tracking process time. Moreover, our method was implemented using MATLAB and carried out on a single core of an Intel i7, 3.4 GHz Processor with 4 GB RAM. It is asserted that the average rate of 34.64 fps for the VIVID dataset achievable from the proposed approach is highly encouraging compared to 26.9 fps from the method presented in [3]. However, the result of [1] obtains the best computation processing speed compared with the state of arts, while the average detection and tracking results dramatically drop because the algorithm only depends on the feature points and histogram of pixels.

Table 2 Tracking (MOTA %) and time performance (FPS) comparison for the VIVID dataset
Table 3 Simultaneous Detection and Tracking Accuracy Comparison of Different Methods on VIVID Dataset

Detection and tracking results in the police chase video are shown in Table 4. It is asserted that the proposed method achieved the highest precision and recall accuracy of 98%, 91% respectively compared to 91%, and 78.7% respectively in [1]. Moreover, the proposed method increases the tracking accuracy by 12.2% from 79% to 91.2% compared to the method in [1].

Table 4 Simultaneous Detection and Tracking Accuracy Comparison of Different Methods on Police Chase Video

Figure 10 shows some important visual evaluation of the experimental results. The first row illustrates the success in handling various motion patterns without existing known when the tracking is persistent without a change in the vehicle’s label. On the other side, the vehicle tracking accuracy will be deteriorated when a restrictive probabilistic motion model was employed, since some of the objects could not be tracked due to the various turning angles. The occlusion problem is solved, as illustrated in the second and third rows of Fig. 10 by utilized the combined detection and tracking results with minimum Euclidean distance selection between the centroid of their intersection area. The fourth row demonstrates the behavior of the proposed algorithm in dealing with vehicle detection failure that resulted from camera defocusing. Tracks are continued without fail, despite the missing observation. However, one should note that an identity switch error is generated due to camera defocussing since some feature points will be lost. The success of the proposed method in dealing with vehicles blocked by trees can be observed in the fifth row. The repeated detection process saved the blocked vehicle from being lost. The last row indicates the success in tracking high-speed vehicles without missing the trajectory. Furthermore, these sub-figures show the success in handling scale changes by using the proper structure elements in the detection phase.

Fig. 10
figure 10

Sample result on VIVID data set in the first five rows and police chase video in the last row

5 Conclusion

This paper proposes a real-time and efficient approach for vehicle detection and tracking in aerial videos. In this method, both the detection and tracking procedures were linked together to get the robust feature points when they are updated regularly every fixed number of frames. These feature points are extracted and analyzed using a combined process between grayscale morphological-based detection, optical flow, and K-means clustering. Further, the feature points are clustered based on their motion properties to distinguish the foreground features from the background and to assign the detecting vehicles with its corresponding cluster based on the intersection area between the detected and tracked points’ information. The proposed algorithm has been tested on videos taken in different scenarios to verify its efficiency over other algorithms. Excellent detection and tracking accuracy, and fast processing speed are the main factors, that nominate the proposed algorithm for real-time applications.