Keywords

1 Introduction

Nowadays, UAV is widely used and continually expanding its market because of its low-cost, unique flexibility and high-altitude operational capability. Compared to human, UAV can carry out tasks including but not limited to disaster search [1], power lines detection [2], traffic monitoring [3] etc. safely, easily and efficiently. According to Carroll and Rathbone [4], the estimated budget for traffic data collection is about $5 million per in an average metropolitan area, while using UAV, we can reduce the total cost by 20% and half of collecting procedures. Therefore, UAV is called the best tool for performing 3D (the Dull, the Dirty and the Dangerous) tasks [5].

Object tracking is one of the hot topics in the field of computer vision, which uses a bounding box locks onto the region of interest (ROI) such as person and vehicle. Give the initial location of the target, then computer can find its location in the next sequences. This technology is one of the important applications used in UAV for ground strike, criminal vehicle etc. as well as plays an important role in other process such as estimating velocity and position of an object [6], UAV landing [7], search and rescue [1]. Particularly, in aerial surveillance, through object tracking technologies, traffic flow over a highway during a period can be estimated and a proactive approach can be adopted for an effective traffic management by identifying and evaluating potential problems before they occur [4], as shown in Fig. 1.

Fig. 1
figure 1

Aerial surveillance over a highway

In general, tracking accuracy reflects the tracking performance. Various factors affect this performance such as illumination change, abrupt motion, scale variation and full or partial occlusion [8]. Although the tracking algorithm is more and more robust and efficient, no one can handle all scenarios [8]. In addition, different from static camera, aerial object tracking is also influenced by low-sampling rate, resolution and unstable camera platform, which caused by running vehicle and wind that lead to tracking drifts. When the altitude of the flight is great, the objects on the ground looks so small that it is hard to detect them. Hence, realizing a robust and stable tracking algorithm or system is still an issue to be addressed immediately.

The rest of the paper is organized as follows: Sect. 2 introduces the history and current research institutions of UAV vision, while Sect. 3 summarizes the sensors used in aerial platform. Section 4 mainly discusses the tracking framework and algorithm, collected common datasets and evaluation metrics. Future directions are given in Sect. 5. Finally, Sect. 6 concludes this paper.

2 The Development of UAV Vision

2.1 History of UAV Vision

The first aerial video was captured by Nadal, a famous French photographer in December, 1858 [9]. He used an old-fashioned wet plate camera on the hot air balloon. Then, in World War II, the main belligerent countries used aerial camera to carry out reconnaissance, but this way could not meet the needs of real-time. Then, people concentrated on inventing airborne optoelectronic platform. The famous tactical UAV—“Scout”, Israel created, was able to send video, which was obtained through the visible light sensors in the optoelectronic platform, back to display. During the Lebanese war in 1982, Israel became the first country to use the real-time image transferring technology in aerial platform [10]. Since the 1990s, UAVs were used almost exclusively in military applications and they also have been finding commonplace usage in civilian applications. For instance, New Mexico State University used UAV to observe whether fishermen were fishing in legal areas [11].

2.2 Current Research Institution

Medioni from Institute for Robotics and Intelligent Systems, University of Southern California and his research group devote themselves to aerial vision research [12]. They are developing the system of wide area aerial surveillance and aim to build an efficient, scalable framework to provide activity inference from airborne imagery [13]. This system includes image mosaicking, video stabilization, object detection and tracking, and activity inference from wide area aerial videos [14,15,16].

The Air lab of Carnegie Mellon University develops and tests perception and planning algorithms for UAVs [17]. Their research fields include indoor scene understanding, indoor flight in degraded visual environments, micro air vehicle scouts for intelligent semantic mapping etc.

UAV Vision is a company that design and manufacture high performance, lightweight, gyro-stabilized camera payloads for ISR applications [18]. Their sensors can be installed in different aircraft such as fixed wing UAV, multi rotor UAV or rotary wing UAV and carry out various tasks like disaster management, search and rescue. Particularly, when using their CM202U, user can track a moving vehicle from a long distance [19], as shown in Fig. 2.

Fig. 2
figure 2

Tracking a moving vehicle for law enforcement applications

DJI is a famous company about UAV in China which manufactures and designs UAV, cameras, flight control systems etc. [20]. In civil domain, their products are globally used for music, television and film industries. According to the statistics, DJI is the world’s leader in the civilian drone and aerial imaging technology industry, accounting for 85% of the global consumer drone market [21].

In addition, some associated conferences and journals are also offer platforms to UAV researchers and fans. For instance, automated vehicles symposium [22], International Conference on Unmanned Aircraft Systems [23] and International Journal of Intelligent Unmanned Systems.

3 Sensors Used in Aerial Platform

Without the airborne optoelectronic platform, the UAV vision cannot be developed. Therefore, the advancement of optoelectronic platform will benefit this technology. This section introduces some common sensors used in aerial platform. Each sensor has its own imaging mechanism and characteristic, which are described in Table 1.

Table 1 Common sensors and main features

4 Aerial Platform Based Object Tracking

In this section, we first review the object tracking algorithms, which are used in the UAV, followed by common datasets and evaluation metrics.

4.1 Common Framework

Object tracking in aerial surveillance is to estimate the states of the target on the ground through detection algorithm or selecting ROI manually to give the initialized state of it. As shown in Fig. 3, the aerial platform based object tracking consists three main steps. They are (1) Ego Motion Compensation; (2) Object Detection and (3) Object Tracking. Behavior analysis for decision making is the output.

Fig. 3
figure 3

False alarms due to moving camera [24]

Ego Motion Compensation. Ego Motion Compensation is for the image stabilization because of the moving camera by registering video frames onto a reference plane. It is the basic step, otherwise we will get false alarms in the next step for pixel intensity of the background changing, as shown in Fig. 3.

The compensation algorithm can be divided into gray-level [25] based, feature [26] based and transform domain [27] based. In aerial surveillance, feature based methods are often used. Through extracting feature information such as corners, points, lines and edges etc. of two images to carry out the match between them and establish affine model to finish the registration.

Object Detection. The means of detection are various. If we focus on a suspicious object, we can select ROI manually. When the UAV flies high to monitor the traffic condition, there are many vehicles on the ground and detection algorithm is needed, in which false alarms may occur. Usually, optical flow [28], frame differencing [29] and background subtraction [30] are common methods for detection.

Optical flow is defined as the apparent motion of the brightness patterns or the feature points in the image, which can be calculated from the movement of pixels with the same brightness value between two consecutive images [31]. Frame differencing uses the difference of two adjacent frames to detect moving objects. The concept of background subtraction is using the gray difference between the current image and background image to detect object. In addition, parallax, similar appearance, objects merge or split, occlusion etc. will affect detection accuracy [32].

Object Tracking. Object tracking method can be divided as generation methods [33] and discriminant methods [34]. The former methods means in the current frame modeling the object area and in the next frame to find the most similar area as the predicted location. The latter methods means in the current frame extract the features of object and background as positive and negative samples respectively to train classifier. In the next frame, use the classifier to distinguish foreground and use the result to update the classifier. Now, discriminant method is popular because it is more robust.

There are three main modules in object tracking [8]. First, target representation scheme: define an object as anything that is of interest for further analysis [35]. Second, search mechanism: estimate the state of the target objects. Third, model update: update the target representation or model to account for appearance variations.

In aerial surveillance tracking, data associative trackers which belongs to generation methods are often used. It takes input as a number of data points of the form \( (X,t) \) where X is a position (usually in 2- or 3-space), and t is the timestamp associated with that position [36]. The tracker then assigns an identifier to each data point indicating its track ID of each object.

Behavior Analysis. Behavior analysis includes the recognition of event, group activity, human roles, and traffic accident prediction etc. It is the output of aerial surveillance tracking for administrators to make decisions. Probabilistic network method is widely used because of its robustness to small changes of motion sequences in time and space scales.

Its function is to define each static posture of a movement as a state or a set of states. Through network to connect these states and using probability to describe the switching between state and state. Hidden Markov Models [37] and Dynamic Bayesian Networks [38] are the representations.

4.2 Object Tracking Algorithms

In [39], Medioni et al. presented a methodology, in 1997, to perform the analysis of a video stream took from an UAV whose goal was to provide an alert mechanism to a human operator. It was the beginning of their video surveillance and monitoring (VSAM) project. The main procedure follows Fig. 4. In [32, 40], they also followed this and plotted the object trajectory showed in the mosaic image to help to infer their behavior. Nevertheless, these methods cannot deal with the effect of parallax which will lead to false alarm of detection.

Fig. 4
figure 4

Common framework of aerial platform based object tracking

Recently, in 2017, they used detection-based tracker (DBT) and local context tracker (LCT) simultaneously to track the vehicles on the ground [41]. Because the object in the airborne images is small, gray and the displacement of a moving target is large, rely merely on DBT is unreliable. By introducing LCT, which explores spatial relations for a target to avoid unreasonable model deformation in the next frame, to relax the dependency on frame differencing motion detection and appearance information. Here, DBT explicitly handles merged detections in detection association. The results showed that this method has a high detection rate, except for its high computation time and inability to long-term occlusion (Fig. 5).

Fig. 5
figure 5

Results of DBT only (first row) and DBT & LCT (second row)

Ali et al. proposed COCOA system for tracking in aerial imagery [42]. The whole framework likes [32, 40]. The system works well but the scenario is simple and no vehicles merging occurs. In [43], they used motion and appearance context for tracking and re-acquiring. It is the first time to use context knowledge in aerial imagery processing. Briefly, the appearance context is used to discriminate whether the objects are occluded or not and similar motion context of the unoccluded objects is used to predict the location of occluded ones as shown in Fig. 6. Obviously, it can handle occlusion while it needs reference knowledge and does not take slow or stopped vehicles into account.

Fig. 6
figure 6

Using motion context to handle occlusion

Perera et al. proposed a tracking method under the conditions of long occlusion and split-merge simultaneously [44]. The object detection is performed by background modeling which flags a pixel whether it belongs to foreground or background. A simple nearest-neighbor data association tracker is used, in which Kalman filter updates the position and velocity of objects. Long occlusion is solved by tracklets linking according one-to-one correspondence. In terms of merges and splits, suppose two objects A and B merge for a while, so that

$$ \begin{aligned} A & = \{ T_{a,1} , \ldots ,T_{a,m} ,T_{c,1} , \ldots ,T_{c,o} ,T_{d,1} , \ldots ,T_{d,p} \} \\ B & = \{ T_{b,1} , \ldots ,T_{b,n} ,T_{c,1} , \ldots ,T_{c,o} ,T_{e,1} , \ldots ,T_{e,q} \} \\ \end{aligned} $$
(1)

where \( T_{c} \) represents the merging period, the rest are splitting one. Using the pairwise assumption:

$$ \begin{aligned} P(A,B) & = P(\{ T_{a,1} , \ldots ,T_{a,m} \} ) \times P(\{ T_{b,1} , \ldots ,T_{b,m} \} ) \\ & \quad \times P_{m} (\{ T_{a,m} ,T_{b,n} \to T_{c,1} \} ) \times P(\{ T_{c,1} , \ldots ,T_{c,o} \} ) \\ & \quad \times P_{s} (\{ T_{c,o} \to T_{d,1} ,T_{e,1} \} ) \times P(\{ T_{d,1} , \ldots ,T_{d,p} \} ) \\ & \quad \times P(\{ T_{e,1} , \ldots ,T_{e,q} \} ) \\ \end{aligned} $$
(2)

with \( P_{m} \) and \( P_{c} \) denote the probability of a merge and split, respectively. The results showed that tracker will not be confused after two vehicles merging and continue tracking on the same vehicle when they split, as shown in Fig. 7. However, this method needs 30 frames to initialize the background model and does not take slow or stopped vehicles into account.

Fig. 7
figure 7

The images (red border) confused while the images (blue border) show linking after merge processing

Xiao et al. proposed a joint probabilistic relation graph approach to detect and track vehicles [45], in which background subtraction is used because it can make up the drawback of three-frame subtraction that it is hard to detect slow or stopped vehicles. Vehicle behavior model is exploited to estimate potential travel direction and speed for each individual vehicle. In line with expectations, more stopped and slow vehicles are detected, while due to the overlap when two vehicles merge, the detection accuracy will be affected. In the results, track identifications were missed that all detected vehicles were marked as the same color.

Keck et al. realized the real-time tracking of low-resolution vehicles for aerial surveillance [36]. The airborne images are characterized that they have 100 megapixels, which will increase the burden of computation. To solve this problem, they divided the large images into tiles and set TileProcessors to process tiles in parallel. FAST-9 algorithm (feature based), three-frame difference and Kalman filter are used to perform registration, detection and tracking respectively. Through quantitative results, the detection and tracking accuracy are high. Because of parallelism, the efficiency of computation can meet the need of real-time. However, occlusion, merge and split, same appearance of objects will affect these accuracy.

Some state-of-the-art methods are summarized in Table 2.

Table 2 Object tracker in aerial surveillance, their components and performance

4.3 Common Datasets

Some common datasets of airborne imagery that can be used for object tracking are collected and listed below:

VIVID datasets [48]. This datasets is created for tracking ground vehicles from airborne sensor platforms. Its functions includes ground-truthed data set, some baseline tracking algorithm and a mechanism for compare yours results with ground-truth.

UAV123 dataset [49]. All videos in this dataset are captured from low-altitude UAVs. It contains a total of 123 video sequences and more than 110 K frames.

CLIF 2006 dataset [50]. This datasets is established by Air Force Research Labs of America. It is used for the research of aerial surveillance. Its features are high altitude, large field of view and small objects.

SEAGULL dataset [51]. A multi-camera multi-spectrum (visible, infrared, near infrared and hyper-spectral) image sequences dataset for research on sea monitoring and surveillance. The image sequences are recorded from a fixed wing UAV flying above the Atlantic Ocean.

In addition, Image Sequence Server dataset [52], WPAFB 2009 dataset [53], UCF Aerial Action Data Set [54] and UCLA Aerial Event Dataset [55] are also common aerial image datasets.

4.4 Evaluation Metrics

When tracking algorithm is performed, the results should be evaluated both qualitatively and quantitatively to illustrate whether the algorithm is robust or not.

Qualitative evaluation. Generally, we use a bounding box or more to contain the object(s) in pixels we want to track. Then, evaluation is carried out by our eyes.

If the tracking algorithm is robust and accurate, the bounding box will lock on the appearance of the object as much as possible, whenever occurs illumination change, occlusion, abrupt motion etc. Otherwise, when it drifts, it is weak and inaccurate.

Quantitative evaluation. Only qualitative evaluation is not persuasive. Quantitative evaluation always couples with qualitative one.

In [8], author introduces four metrics for evaluation of single object. They are Center Location Error (CLE), Distance Precision (DP), Overlap Precision (OP) and Frames Per Second (FPS).

CLE refers to the Euclidean distance between the estimated location and ground-truth location of the object. The smaller the value is, the better the performance is.

$$ \text{CLE} = \sqrt {(x - x_{0} )^{2} + (y - y_{0} )^{2} } $$
(3)

DP refers to the percentage of the frames whose CLE is smaller than a threshold among the whole sequences. The higher the value is, the better the performance is.

$$ \text{DP} = \frac{{N_{{\text{CLE} \le \text{th}}} }}{N} \times 100\% $$
(4)

OP refers to the percentage of the frames in which the overlap rate between bounding box and area of ground-truth is higher than a threshold. The higher the value is, the better the performance is.

$$ \text{OP} = \frac{{N_{{\phi \ge \text{th}}} }}{N} \times 100\% ,\phi = \frac{{A_{{\text{output}}} \cap A_{{\text{ground}\;\text{truth}}} }}{{A_{{\text{output}}} \cup A_{{\text{ground}\;\text{truth}}} }} $$
(5)

FPS refers to the how many frames the algorithm can process in one second. The higher the value is, the better the performance is:

$$ \text{FPS} = N/t $$
(6)

5 Future Directions

Although those state-of-the-art methods have lower false alarms and higher tracking accuracy, some issues are still “bottlenecks” that constrain the further development of UAV-based tracking.

  1. (1)

    Appearance change

When object moving, the pose and shape may change. Additionally, illumination variation will also affect tracker.

  1. (2)

    Occlusion

Because object lost in view during occlusion, trackers may not resume tracking when occlusion ends.

  1. (3)

    Complex background

Due to the high-altitude angle of surveillance, objects may drown in background that brings difficulty to detection.

  1. (4)

    Merge and split

When objects merge, some trackers consider them as one object that lose identities even switch track ID.

  1. (5)

    Computation efficiency

With the improvement of sensors, the rising of megapixels and more objects being tracked, the amount of calculation will be greater. The requirements of efficient algorithm and high performance hardware facilities need to be meet.

In the future, some improvements and innovations maybe realized:

  1. (1)

    Rarely use traditional generation methods

Detection-based tracking method will be the mainstream for aerial surveillance, in which background information, local models and dynamic model are critical components [8]. Fully using background information can separate object and background well. Local model can fight against appearance change. Dynamic model is used for prediction that search region can be minimized.

  1. (2)

    Surveillance with AI technology

Now the Artificial Intelligence (AI) has been widely and deeply studied. The technologies of machine learning (ML) and deep learning (DL) have shown their great power in the field of computer vision, automation and even Go [56]. Through the feature extraction and training of massive data, computers can even compete with humans and they can do some work instead of us. As shown in Fig. 8, if supported by the department of transportation and automobile manufacturer, we can train models by the data (prior knowledge) they offered and use detection and tracking algorithm to realize UAV traffic monitoring. After UAV taking and transferring videos to the data processing center (DPC), the brand, size, velocity etc. can be recognized online. At the same time, situation estimation and congestion judgment will be performed by object tracking. These information will be all received by the operators who are in front of computer. Then they can make decision that whether send out a warning signal or give priority of driving.

Fig. 8
figure 8

The UAV-based traffic surveillance system in the future

  1. (3)

    No ego motion compensation

Advanced technology of UAV stabilization such as flight control and wind estimation may decrease the needs of ego motion compensation that this step can be optional.

  1. (4)

    Lower computation burden

Advanced hardware, processor and efficient algorithms in aerial platform can relax the computation burden that the situation of scene and the processed data can be reflected to the observers in real-time.

  1. (5)

    Persistent working ability

UAVs should meet the needs of working all-time and all-weather if they are used in engineering. Working merely in good weather is far from enough. The technology of waterproof and battery etc. should be progressed as soon as possible.

  1. (6)

    More open airspace

The limited flying space will make the UAV useless. In the future, more airspace will be available for operators. While UAVs are flying, operators should also obey the flight rules in the non-prohibited zone and keep in mind that prohibited zone is inviolable at all times.

6 Conclusion

This paper presents a survey of object tracking in aerial surveillance. First, the development history and current research institutions are reviewed. Then, frequently used sensors are summarized which are followed by the detailed descriptions of the common frame work and representative tracking algorithms of aerial surveillance. Some suggestions and future directions are proposed for the deficiency of the current technologies in which we conclude that by combining advanced algorithm with AI technology, the UAV can play a greater role in the field of aerial surveillance.