Keywords

1 Introduction

Recently, UAVs have been used more and more widely for news gathering, search, and rescue both in military and civil areas, especially in a reconnaissance and strike integrated system by aerial surveillance. Aerial surveillance is performed primarily using video cameras these days. Exploitation of UAV video data is increasingly critical for surveillance systems. Large numbers of videos from UAVs will overwhelm human operators for visually inspecting the data, so automatic video analysis and processing is essential, which will not only reduce the workload for humans but will also improve the fully-automatic surveillance ability for future UAVs.

Moving objects detecting and tracking for UAVs is a subject of active research. The detection of an interesting moving object is based on the motion information such as optical flow or background subtraction. Tracking is to maintain a consistent identity on the object based on the appearance, shape, or kinematic information over the frames. A large amount of research on moving-object detecting and tracking in a video has been performed in the past years, and a variety of methods have been developed, which include region-based [13], feature-based [46], and contour-based [7, 8] methods. Although many methods have been proposed, most can only be used for specific applications with reliable assumptions, such as a stationary application in a constrained and uncluttered environment. For UAV videos captured by a fast-moving camera, moving objects detecting and tracking has the following key challenges.

The camera is mounted on the moving UAV, so there are two independent motions involved: the motion of moving objects and the motion of camera. Unfortunately, those two motions are blended together. In order to detect moving objects robustly from the moving background, it should be able to decompose these two independent motions from sensor readings.

A UAV may move very fast, and an object may also move fast. UAV video objects tend to have low resolution and the background is usually cluttered. There are various types of noise added at various stages, including changing lighting conditions, changing appearance, rotations, and scales of objects. Furthermore, UAV objects may be so small that they are similar to noise, and may suffer partial or total occlusions.

UAV video analysis requires fast tracking of moving objects. Hence a compromise typically exists between the achieving of real-time tracking performance and the constructing of a robust tracking algorithm.

Frame difference, which compares two consecutive image frames and finds moving objects based on the difference, is perhaps the most intuitive and fastest algorithm for moving object detection, especially when the viewing camera is static. However, for UAV videos, straightforward differencing is not applicable because a large difference is generated by simply moving the camera even if nothing moves in the environment. The global-motion of the background should be eliminated so that the remaining motions, which are due to moving objects, can be detected.

To overcome the first problem mentioned above, various methods have been proposed to stabilize camera motions by tracking features [9] and computing optical flow [10]. These approaches focus on estimating the transformations between two image coordinate systems. The features associated with the foreground objects should be eliminated since they will lead to wrong estimation of the global-motion of the background.

Once moving objects has been identified, they need to be tracked. Two major components can be distinguished in a typical visual tracker [11]. Target Representation and Localization is mostly a bottom-up process which copes with the changes in the appearance of the target. Filtering and Data Association is mostly a top-down process dealing with the dynamics of the tracked object, learning of scene priors and evaluation of different hypotheses by using a Bayesian filter. The way the two components are combined and weighted is application dependent. They play a decisive role in the robustness and efficiency of the tracker.

In this paper we propose an approach to detect and track moving objects for UAV. The block diagram of the proposed approach is depicted in Fig. 1. It includes three steps: compensation of the image flow induced by a moving camera, detection of moving regions in each frame, and tracking of moving objects in time.

Fig. 1
figure 1

The proposed detecting and tracking approach

2 Global-Motion Estimation

The motion induced by a moving camera must be canceled before motion detection. The global-motion estimation can be estimated by tracking features between images. When the camera moves, two consecutive images, \( I^{t} \) (the image at time t) and \( I^{t - 1} \) (the image at time \( t - 1 \)), are in different coordinate systems. Global-motion estimation is a transformation from the image coordinates of \( I^{t - 1} \) to that of \( I^{t} \) so that the two images can be compared directly. The transformation can be estimated using two corresponding feature sets: a set of features \( f^{t - 1} \) in \( I^{t - 1} \) and a set of corresponding features \( f^{t} \) in \( I^{t} \).

2.1 Feature Detecting and Matching

In general, two basic questions must be answered: how to select features and how to track them from frame to frame. Two most successful visual features are corner [12] and SIFT (Scale Invariant Scale Transform) [13] features. A corner feature is characterized by the high intensity changes in both horizontal and vertical directions. SIFT feature is a more advanced visual feature, which is known to be relatively invariant to image translation, scaling, and rotation and partially invariant to changes in illumination and local image deformations. Considering the computational intensive, it is not suitable for real-time applications, unless the algorithm is implemented in hardware. We adopt the KLT [4] feature (a famous corner feature) selection algorithm for corresponding feature set selection due to its computation efficiency. The feature selection algorithm runs on image \( I^{t - 1} \), and generates features \( f^{t - 1} \). A corresponding feature set \( f^{t} \) is constructed by tracking the same features on the image \( I^{t} \). Figure 2a shows the outdoor environment. Figure 2b shows the feature selected from the image \( I^{t} \), and Fig. 2c shows the same features tracked over 5 frames on the image \( I^{t + 5} \), while Fig. 2d implies the optical flow between the two frames.

Fig. 2
figure 2

a out door environment b features detected on image It c features detected on image It+5 d optical flows between the two images

2.2 Valid Feature Determination

Once the correspondence \( f = \left\{ {f^{t - 1} ,f^{t} } \right\} \) is known, the geometric transform to align \( I^{t - 1} \)and \( I^{t} \) can be estimated using a transformation model. However, as mentioned before, some of the features associated with the foreground moving objects may be included, which will lead to an inaccurate estimation of the global-motion of background. Thus, those features should be eliminated from the feature set before the transform is computed. We name features belonging to background as valid features (or insiders) while foreground features are invalid features (or outliers).

Since features belonging to the background move in a uniform way, assume that the distribution of the optical flow vectors follows Gaussian distribution. The probability out of the range of \( \left[ {\mu - 3\sigma ,\mu + 3\sigma } \right] \) is very small according to the Gaussian theory, where \( \mu ,\;\sigma \) denote the expected value and variance respectively. When the vector of two corresponding features is out of the range, the features are probably belonging to foreground objects. Let \( V_{i} \) represent the optical flow vector of a selected feature between two consequent frames. \( \left\| {V_{i} } \right\|,\;Ang\left( {V_{i} } \right) \) denotes its vector norm and vector direction, respectively, while \( \left( {\mu_{\left\| \cdot \right\|} ,\sigma_{\left\| \cdot \right\|} } \right) \) and \( \left( {\mu_{Ang} ,\sigma_{Ang} } \right) \) represent the expected value, variance of the norm, and direction angle of all selected features respectively. The determinant condition (outlier algorithm) can be defined as follows:

$$ \left\{ {\begin{array}{ll} {f_{i} \in F_{\text{in}} } & {{\text{if}}\,\left| {\left\| {V_{i} } \right\| - \mu_{\left\| \cdot \right\|} } \right| <\; 3 \cdot \sigma_{\left\| \cdot \right\|} \;{\text{and}}\;\left| {Ang\left( {V_{i} } \right) - \mu_{Ang} } \right| <\; 3 \cdot \sigma_{Ang} } \\ {f_{i} \in F_{\text{out}} } & {\text{otherwise}} \\ \end{array} } \right. $$
(1)

where \( F_{\text{in}} \) and \( F_{\text{out}} \) denote valid and invalid features respectively. Figure 3 shows the result of the valid feature determination algorithm. \( F_{\text{in}} \) is marked with red circles and \( F_{\text{out}} \) is marked with green circles. In the frame, the aircraft is moving while the buildings and trees are static. All the features associated with the aircraft are detected and eliminated. Note that the valid feature determination will be violated when moving objects are very close to the camera since they will occupy most of the areas of an image and the features optical flow will not conform to Gaussian distribution.

Fig. 3
figure 3

Inliers (red circle) and outliers (green circles)

2.3 Transform Model

Using the remaining insiders (valid features set), we can compute the relatively accurate transformation model. The most-used models include affine model, bilinear model, and pseudo-perspective model. When the interval between consecutive images is very small, most global-motion can be estimated using an affine model, which can cover translation, rotation, shearing, and scaling motions. Hence, we use an affine model here:

$$ \left[ {\begin{array}{*{20}c} {f_{x}^{t} } \\ {f_{y}^{t} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {af_{x}^{t - 1} + bf_{y}^{t - 1} + t_{1} } \\ {cf_{x}^{t - 1} + df_{y}^{t - 1} + t_{2} } \\ \end{array} } \right] $$
(2)

We compute the transformation model \( T_{t - 1}^{t} \) by using RANSAC (RANdom SAmple Consensus) algorithm [14], an iterative method, to estimate the parameters of the affine model due to its ability for robust estimation.

2.4 Moving Compensation

For frame differencing, image \( I^{t - 1} \) is converted using the transformation model before being compared to the image \( I^{t} \) in order to eliminate the effect of global-motion. For each pixel \( \left( {x,y} \right) \)

$$ I_{\text{comp}}^{t - 1} \left( {x,y} \right) = I^{t - 1} \left( {\left( {T_{t - 1}^{t} } \right)^{ - 1} \left( {x,y} \right)} \right) $$
(3)

where \( I_{\text{comp}}^{t - 1} \) represents the compensation of global-motion of image \( I^{t - 1} \).

3 Moving Object Detecting

The difference image between two consecutive images is computed using the compensated image:

$$ I_{\text{diff}} \left( {x,y} \right) = \left| {I_{\text{comp}}^{t - 1} \left( {x,y} \right) - I^{t} \left( {x,y} \right)} \right| $$
(4)

Note that some pixels around the border moved out of the frame after compensation, causing undefined area in the compensation frame. To obtain a complete frame, we redefine the difference frame as follows:

$$ I_{\text{diff}} \left( {x,y} \right) = \left\{ {\begin{array}{ll} {\left| {I_{\text{comp}}^{t - 1} \left( {x,y} \right) - I^{t} \left( {x,y} \right)} \right|} & {{\text{if}}\left( {x,y} \right) \in {\text{definedarea}}} \\ 0 & {{\text{if}}\left( {x,y} \right) \in {\text{undefinedarea}}} \\ \end{array} } \right. $$
(5)

Figure 4 compares the result of two cases: frame difference without global-motion compensation (Fig. 4a) and with global-motion compensation (Fig. 4b).

Fig. 4
figure 4

a frame difference without global-motion compensation b frame difference with global-motion compensation

With the image processes introduced above, we can detect the moving object in theory, but in fact due to illumination and background variance, there are noises in the frame. Hence, we continue to perform a morphologic closing operation on the result to remove camera noise and to connect object pixels into regions. The morphologic process include erosion and dilation algorithm, which define as follows:

$$ A \odot B = \left\{ {x\left| {\left( B \right)_{x} \subseteq A} \right.} \right\} $$
(6)
$$ A \oplus B = \left\{ {x\left| {\left( {\hat{B}} \right)_{x} \cap \;A \ne \varnothing} \right.} \right\} $$
(7)

where A is the image to be processed and B is the erosion matrix of \( 20 \times 20 \) pixels. The resultant image is then size-filtered. Only connected regions of size bigger than a threshold are kept. Figure 5 shows the difference of the difference image with noise and with morphologic process. We can extract the detecting moving object clearly after morphologic.

Fig. 5
figure 5

a frame difference without morphologic process b frame difference with morphologic process

4 Object Tracking

In the previous work, tracking object was always initialized with a hand-drawn region. As in the surveillance of the aerial video, it is difficult to point the exact interesting region due to the fast video speed, and the region including noise and unrelated information will lead to wrong track. In this paper we propose an approach to initialize the region by the information about the moving object obtained from the object detecting module, which overcomes shortcoming of artificial orientation.

Although KLT algorithm can track feature points, it cannot identify which feature point belongs to the target we want to track. Hence, we choose other feature space. Mean Shift, which is based on the motion of “kernel”, is an efficient technique that can automatically sort out local model within a set of data. Here, we employ the Mean Shift algorithm which chooses color information as the object feature.

4.1 Object Tracking Using Mean Shift Algorithm

The Mean Shift tracking algorithm is divided into three parts: target representation, similarity function measure, and target location computation [15]. First, the reference target model is represented by its probability density function (pdf) \( \hat{q} = \left\{ {\hat{q}_{u} } \right\}_{u = 1 \ldots m} \), \( \sum\nolimits_{u = 1}^{m} {\hat{q}_{u} = 1} \). A target candidate which is defined at location y is characterized by the pdf \( \hat{p}\left( y \right) = \left\{ {\hat{p}_{u} \left( y \right)} \right\}_{u = 1 \ldots m} \), \( \sum\nolimits_{u = 1}^{m} {\hat{p}_{u} = 1} \). m represents m-bin histograms. Second, the similarity between the target model and the target candidates in the next frame is measured using the metric derived from the Bhattacharyya coefficient \( \hat{\rho }\left( y \right) \equiv \rho \left[ {\hat{p}\left( y \right),\hat{q}} \right] = \sum\nolimits_{u = 1}^{m} {\sqrt {\hat{p}_{u} \left( y \right)\hat{q}_{u} } } \). The coefficient determines whether it is the most similar result to a given model, if not then search around. The bigger the value, the more similar the model is to the target. The search procedure uses gradient information which is provided by the Mean Shift vector. Mean Shift was originally presented in 1975 by Fukunaga and Hostetler [16]. We use Mean Shift iterations to search for the maxima density function of the coefficient, then we can obtain the most reliable candidate. The Mean Shift algorithm calculates local optimal result fast and effectively which is superior to blindness search. Besides color feature using for tracking is simple and effective in most situations.

However, the Mean Shift tracking algorithm has some shortcomings. It cannot be applied to track fast-moving objects. Because the most reliable location y was obtained after some manipulations using Taylor expansion around the probability location \( \hat{y}_{0} \) on \( \hat{\rho }\left( y \right) \):

$$ \rho \left[ {\hat{p}\left( y \right),\hat{q}} \right] \approx \frac{1}{2}\sum\limits_{u = 1}^{m} {\sqrt {\hat{p}_{u} \left( {\hat{y}_{0} } \right),\hat{q}_{u} } + \frac{1}{2}\sum\limits_{u = 1}^{m} {\hat{p}_{u} \left( {\hat{y}} \right)} } \sqrt {\frac{{\hat{q}_{u} }}{{\hat{p}_{u} \left( {\hat{y}_{0} } \right)}}} $$
(8)

When the target moves fast, the new location is far from the old one which is not fit for neighborhood expansion analysis.

Besides, the feature color, being described as the target and candidates in Mean Shift tracking algorithm, is a relative weak feature. When some noises which are similar to the target appear in the background, it will cause wrong tracking.

4.2 Kalman Prediction

Mean Shift tracking algorithm is a bottom-up appearance-based tracking method, while Kalman is totally a top-down process. Kalman is an optimal recursive data processing algorithm, which consists of two steps: prediction and correction. In the former step, the state is predicted with the dynamic model, while in the latter step, it is corrected with the observation model. Since the error covariance of the estimator is minimized, it can be regarded as an optimal estimator.

As for tracking, the information characterizing the target is defined by the state sequence \( \left\{ {X_{k} } \right\}_{k = 0,1, \ldots } = \left\{ {x_{k} ,y_{k} ,v_{xk} ,v_{yk} } \right\} \), representing the target’s location and its velocity, whose evolution in time is specified by the dynamic equation \( X_{k} = FX_{k - 1} + w_{k} \). The available measurements \( \left\{ {Z_{k} } \right\}_{k = 1, \ldots } = \left\{ {x_{k} ,y_{k} } \right\} \) are related to the corresponding states through the measurement equation \( Z_{k} = HX_{k} + v_{k} \). The matrix \( F \) is called the system matrix and \( H \) is the measurement matrix. The noise sequence \( w_{k} \) and \( {\text{v}}_{k} \) are Gaussian, \( w_{k} \sim N\left( {0,\sigma_{\text{w}}^{2} } \right),\;v_{k} \sim N\left( {0,\sigma_{\text{v}}^{2} } \right) \).

Considering a target may not always runs with uniform velocity, it moves fast and sometimes slowly, turns around quickly, or brakes suddenly. We suppose that it moves in a linear motion with random acceleration noise. We design the state sequence updates in the following way:

$$ x_{k} = x_{k - 1} + \upsilon_{{x_{k - 1} }} t + \frac{1}{2}w_{k} t^{2} $$
(9)
$$ y_{k} = y_{k - 1} + \upsilon_{{y_{k - 1} }} t + \frac{1}{2}w_{k} t^{2} $$
(10)
$$ \upsilon_{{x_{k} }} = \upsilon_{{x_{k - 1} }} + w_{k} t $$
(11)
$$ \upsilon_{{y_{k} }} = \upsilon_{{y_{k - 1} }} + w_{k} t $$
(12)

where t denotes the interval frames. Hence, we design the dynamic equation and the measurement equation as follows:

$$ \left( {\begin{array}{*{20}c} {x_{k} } \\ {y_{k} } \\ {\upsilon_{{x_{k} }} } \\ {\upsilon_{{y_{k} }} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} 1 & 0 & t & 0 \\ 0 & 1 & 0 & t \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {x_{k - 1} } \\ {y_{k - 1} } \\ {\upsilon_{{x_{k - 1} }} } \\ {\upsilon_{{y_{k - 1} }} } \\ \end{array} } \right) + \left( {\begin{array}{*{20}c} {\frac{{t^{2} }}{2}} \\ {\frac{{t^{2} }}{2}} \\ t \\ t \\ \end{array} } \right)w_{k} $$
(13)
$$ \left( {\begin{array}{*{20}c} {x_{k} } \\ {y_{k} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {x_{k - 1} } \\ {y_{k - 1} } \\ {\upsilon_{{x_{k - 1} }} } \\ {\upsilon_{{y_{k - 1} }} } \\ \end{array} } \right) + \left( {\begin{array}{*{20}c} 1 \\ 1 \\ \end{array} } \right)v_{k} $$
(14)

where \( \sigma_{\text{w}} = \sigma_{\text{v}} = 5 \). The object of using Kalman prediction is to estimate the state \( X_{k} \) given all the measurement \( Z_{1:k} \) up to that moment, or equivalently to construct the probability density function \( p\left( {x_{k} \left| {Z_{1:k} } \right.} \right) \).

4.3 Fusing Kalman Prediction and Mean Shift Search

Considering the advantages and disadvantages of the Mean Shift and Kalman, we combine Mean Shift search and Kalman prediction together to track fast and robustly. The framework is shown in Fig. 6. Kalman predicts position of the target first, and Mean Shift searches in the confidence region estimated by Kalman which largely improves the search efficiency since it reduces the search area. The search result of Mean Shift is used in turn as the measurement value of Kalman to predict the position of the target in the next frame. As the search result of Mean Shift may not be accurate which will lead to unreliable prediction computed by Kalman, we add a determination. \( \left( {\hat{x}_{k} ,\hat{y}_{k} } \right) \) which represents predicted value of Kalman, while \( \left( {x_{k} ,y_{k} } \right) \) denotes the search value of Mean Shift. The two values are highly similar in general and \( e\left( k \right) = \sqrt {\left( {x_{k} - \hat{x}_{k} } \right)^{2} + \left( {y_{k} - \hat{y}_{k} } \right)^{2} } \) are small. When the difference is large, we can infer that Mean Shift result is invalid so it cannot be used as the measure value of Kalman.

Fig. 6
figure 6

New tracking algorithm of fusing Mean Shift search and Kalman prediction

5 Experiments

We have developed a system implementation for testing the detecting and tracking method, using visual C++ 2010 platform and OpenCV library. The computer processor is Intel Xeon @2.80 GHz.

5.1 Moving Object Detecting Result

The city video was taken from an airplane. The three rows shown in Fig. 7 represent detection results of frames 37, 41, and 44, respectively, which aims at detecting a moving car along the road. The first column shows the frame difference without global-motion compensation while the second column shows the frame difference with compensation. In the third column, we mark out the detected moving region with red rectangle. There is only one moving car in the scene. We compute the location and velocity of the moving region to prepare for the tracking module. The total process time is 215 ms.

Fig. 7
figure 7

Moving object detection from a moving camera. The frames 37, 41, and 44 are shown

5.2 Moving Object Tracking Results

In a sequence of the game of football, the ball runs fast and moves in a random way. Note that the motion characters of the tracked targets from UAVs are similar to football, so we use the football sequence to test the performance of the tracking algorithms. When tracking the football using Mean Shift only (Fig. 8), it shows that the ball can be tracked well in the first few frames, but fails later due to its fast movement, just as the theory analyzed in the fourth part.

Fig. 8
figure 8

Results of football tracking with Mean Shift

Figure 9 shows the tracking result of the running football in the same football sequence using the method of fusing Mean Shift search and Kalman prediction together. Whether the football runs relatively slowly which was brought along by the player (frame 1, 28, 32, 45), runs very fast because of being shoot (frame 80, 88), or partially occlusion (frame 69) by the dual meet players, it can all be tracked robustly.

Fig. 9
figure 9

Results of football tracking by fusing Mean Shift and Kalman. Frames 1, 28, 32, 45, 56, 69, 80, 88 (left to right, up to down) are shown

In a car sequence (Fig. 10), which has 200 frames of 512*288 pixels, we intend to track the black car. Since the histogram of the shade is similar to the car itself, Mean Shift tracking converges to the shade.

Fig. 10
figure 10

Results of car tracking with Mean Shift

Figure 11 shows the result of tracking the black car using the fusing method proposed in this paper. Observe that the overall algorithm is able to track the car in the presence of similar objects (car shadow in frame 110) and partially occlusion by the obstacle on the road (frame 171).

Fig. 11
figure 11

Results of tracking by fusing Mean shift and Kalman. Frame1, 110, 171 (left to right) are shown

In a more concrete analysis of the car sequence, similar objects in the neighborhood and occlusion increase the measurement uncertainty. In Fig. 12, we present the measurements (dotted line) obtained by Mean Shift and the estimated locations of the object computed by Kalman (solid line) in the direction of X and Y respectively. In most of the frames, two values are very close to each other. But the presence of car shadow from frame 72 to 149, which determines an increase in state uncertainty, give two values which vary. The sharp curve from frame 169 to 175 in the dotted line represents invalid search by Mean Shift caused by feature occlusion (see Fig. 11). By fusing Mean Shift search and Kalman prediction together, we can conduct a robust track.

Fig. 12
figure 12

Measurement and estimated state of car sequence

Besides achieving a robust detecting and tracking method, we also care much about its real-time performance. When searching the target in one frame of 512*288 pixels, the average cost time using Mean Shift is 31 ms, while the time cost by fusing Mean Shift and Kalman is 26 ms. By fusing Mean Shift search and Kalman prediction together, we can also conduct a faster track. Thus the fusing track algorithm is fit for our aerial video tracking.

6 Conclusion and Future Work

A real-time method for moving object detecting and tracking on an unmanned aerial vehicle was introduced. The global-motion of the background was estimated using corresponding valid feature sets first. Then detect the moving object after compensating the consecutive frames. Use the location and velocity information of the moving object obtained from the object detecting module for object tracking module initialization. Finally we integrated the Mean Shift algorithm and the Kalman prediction to track objects fast and robustly. The results show that the method is general and computationally inexpensive which is fit for aerial video surveillance.

Additional research work is needed to improve the performance of the method, such as optimizing the object detection algorithm to minimize the computation cost. On the other hand, the current system can only estimate the location, direction and speed of the moving objects. Since a single camera has limit on retrieving depth information, the information from a camera alone is not rich enough to construct full 3-dimensional models of moving objects. We can use a laser rangefinder, which provides the depth information of a single plane.