Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Detection and tracking of objects and structures with visual sensors have been studied thoroughly through the last decades. This is because visual sensors are useful in a vast number of applications and industries, including the maritime sector, car industry, surveillance, monitoring, and obviously also in many other applications. Visual sensors can capture an enormous amount of information through images and that may be why machine and robotic vision are attractive research fields.

Manufacturers have realized that commercial visual sensors are used more and more, which has pushed the price and weight rapidly down in recent years. This has revealed the possibility of utilizing visual sensors in light-weight unmanned aerial vehicles (UAVs). Moreover, the computational capacity of small on-board computers grows every year and permits the use of cameras to perform complex tasks in real-time applications. Visual sensors can be exploited in many ways in UAV operations, including navigation [1, 2], sense and avoid technology [3, 4], inspection [5], object detection and tracking [6, 7], georeferencing [8, 9], and airborne simultaneous localization and mapping (SLAM) [10, 11].

UAVs equipped with a visual sensor can be beneficial at sea because information captured from the air may be very different from the information available at the sea surface. This can for instance be utilized in iceberg detection in the arctic or for detecting floating objects hidden in waves [12]. Moreover, UAVs can be used to monitor and conduct operations on its own or in cooperation with vessels at sea. One particular scenario where a UAV can be useful is as a scouting system for autonomous ship operations [13]. In order to avoid obstacles and plan the most effective path, autonomous ships need the location of floating objects near their desired path. UAVs can be used to detect and track floating objects and assist ships at sea.

Object detection is the problem of identifying objects of interest within an image. Detecting objects in images captured from a moving platform is different from detecting objects in images captured at rest, even though the goal is equal. Moreover, detecting objects in a maritime environment is not similar to identify, e.g., pedestrians at an intersection. Therefore, it is necessary to find viable solutions for a specific application in most cases, especially if the aim is a real-time solution working on an embedded computer with restricted memory and processing capacity. When looking for floating objects at sea, small structures, and vessels are expected to be the majority of the obstacles. It may be challenging to distinguish small objects from the sea surface and detect dark vessels with a regular day camera, particularly during poor lighting conditions. That is why sensors capturing images at another spectrum than the visible may be attractive. Vessels at sea have heat sources and materials that are warmer or with higher emissivity than the sea surface. Therefore, using thermal cameras for object detection at sea is a viable alternative.

Target tracking is the problem of estimating the position and most often also the velocity of an object. Since many objects may be present, multiple target tracking need to be considered. It is much harder than tracking a single target. This is because several objects may be located within the same image and all of these observations must be connected to an existing track (or a new track if the object is detected for the first time). In the literature, this is normally referred to as data association and it is the first part of multiple target tracking . The second part is the filtering problem where the measurements are used to estimate the position and velocity of all targets. The filtering part usually consists of individual state estimators for every object.

1.1 Related Literature

Object detection and tracking are studied thoroughly in the literature and a great number of surveys exist [14,15,16,17]. Covering all of the relevant research in one section is impossible. Therefore, this section will mainly be focused toward object detection and tracking with UAVs and for applications at sea.

Object detection and tracking of floating structures in images captured from a UAV operating at high speed are challenging because the velocity of the visual sensor typically exceeds the velocity of the objects significantly. Therefore, motion-based segmentation techniques are not appropriate unless one can compensate for the camera motion accurately. This requires knowledge of the navigation states of the UAV and time-synchronized data with high precision. This is not necessarily available and it is assumed that motion-based segmentation techniques are inappropriate in this paper. Thus, it is necessary to identify algorithms that work well under this assumption.

Feature-based techniques are an alternative and include popular point detectors, such as SIFT [18], SURF [19], and the KLT detector [20]. Point detectors are usually computationally effective and quite robust since they may be scale and rotation invariant. However, they need to be combined with another detection method because features may be located in places where objects are not present. Moreover, several features can be present on the same object. Therefore, it is necessary to identify the ones that belong to objects and not the background. Another issue is that it is easier to associate features between two subsequent images than in a large sequence of images. Therefore, using single features obtained with, e.g., SIFT, is not very robust for tracking purposes where you want to track objects over longer periods of time. This is obviously also a consequence of point detectors ability to find many features because association is harder to do on a large set of features. The large set is, on the other hand, very beneficial when calculating, e.g., optical flow [6].

A template matching approach for visual detection and tracking with UAVs is presented in [21]. However, in order to be robust, a great number of templates need to be available because template matching is neither scale nor rotation invariant. Moreover, the objects of interest need to be known beforehand, which is unavailable information in this scenario. Color segmentation has also been applied in object detection with UAVs [22], but requires tailored parameters toward a specific type of objects. Thus, it is not as robust since floating objects at sea can have many different appearances. Leira et al. [7] demonstrates a stepwise solution for detecting marine vessels in thermal images, which works well as long as the temperature difference between the sea surface and objects of interest is large.

For target tracking , several solutions exist both for the filtering part and the data association part. The greatest variation in object tracking is the distinction between tracking a single and multiple targets (if data association is necessary or not). The filtering part of object tracking consists of state estimators that estimate the position and velocity of the targets. Several alternatives are available, including stochastic methods, such as the Kalman filter and particle filter, and deterministic solutions such as observers. State estimation with applications for tracking is described thoroughly in [23] and includes multiple model approaches. A tracking strategy for vision-based applications on-board UAVs is described in [24], and a correlation method for ship detection with a visual sensor mounted in a UAV is presented in [25]. Tracking strategies for vessels at sea from UAVs equipped with a thermal camera are presented in [6, 7]. Other tracking strategies with UAVs are described in [9, 26, 27].

Data association has also been studied thoroughly and is closely related to classification. A general description of pattern recognition (classification) and machine learning with methods useful for data association is described in [28]. Data association in multiple target tracking with the recursive RANSAC algorithm is described in [27, 29], but requires the knowledge of the number of objects beforehand. Moreover, it can be computationally demanding. Nearest neighbor data association is described in [30] and will be utilized in this paper because of the limited processing capacity on the on-board computer.

1.2 Main Contributions

The robustness of vision-based tracking systems is a challenge because of issues like changes in illumination, occlusion, limited field of view, and real-time requirements. This paper will extend the tracking system in [7] and propose a more robust system for multiple object tracking.

A complete stepwise solution for multiple object detection and tracking is thoroughly described in this paper. A region-based detection algorithm will be used to extract bounding boxes for floating objects at sea. Moreover, the center of the object can be computed, even if just a part of the object is visible as long as the entire object has been available at some point beforehand. In addition, several properties of each object can be extracted in a feature vector for data association purposes. This vector is tailored toward the use of thermal cameras in maritime environments. The distance between the current estimate and a new observation is added to the feature vector to enable the possibility of distinguishing two equal objects. Three different distance measures are evaluated with the Mahalanobis distance to increase the robustness of the data association. A nonlinear tracking filter will be used to estimate the position and velocity of detected objects. The tracking system is divided into modules, which makes it possible to change and expand every part of the system.

The proposed system is evaluated in real-time simulations of data gathered at a flight experiment at sea. A marine vessel is detected and tracked in several image sequences. Furthermore, different distance measures for data association will be evaluated with Monte Carlo simulations in a separate case study.

1.3 Organization

The remainder of this paper is divided into four parts. Section 2 describes how objects are detected in images and the method is tailored toward the use of airborne thermal cameras at sea. Section 3 describes a tracking system that is able to track multiple targets where the number of targets is unknown beforehand. This includes both a data association part, which is necessary to connect existing targets with new observations and a filtering part that estimates the position and velocity of each target. Section 4 presents experiments that have been carried out to evaluate the performance of the detection and tracking system together with the results. The paper is concluded in Sect. 5.

2 Machine Vision

This section describes a machine vision system that utilizes thermal images to find floating objects at sea. The object detection module extracts the center of each object in the image frame and is described thoroughly in the first part of this section. It is more useful to estimate the position of the objects in a world-fixed frame. Therefore, a georeferencing algorithm can be used to transform the image frame coordinates to world-fixed coordinates. This is briefly described in the second part. The latter part of this section describes features that can be extracted in the machine vision system and useful for data association .

2.1 Object Detection

Object detection at sea using thermal images has not been described extensively in the literature even though some examples exist. In this section, an edge detection-based method for locating floating objects will be presented.

In order to automatically detect objects of interest, the thermal image, \(\mathbf {I}\), is first smoothed. This is conducted to reduce the thermal noise present in the image and was found to make the process of detecting edges in the image more robust than when performed on the raw thermal images. The smoothing is conducted by convolving (denoted by \(*\)) the image with a Gaussian kernel \(\mathbf {g}\). \(\mathbf {g}\) is a \(n\times {n}\) kernel approximating a Gaussian distribution with a standard deviation of \(\sigma _{g}\), i.e.,

$$\begin{aligned} \mathbf {I_{s}}[x,y] = (\mathbf {I}*{\mathbf {g}})[x,y] = \sum _{k=-\frac{n-1}{2}}^{k=\frac{n-1}{2}}{\sum _{m=-\frac{n-1}{2}}^{m=\frac{n-1}{2}}{\mathbf {I}[x-m,y-k]\mathbf {g}[m,k]}} \end{aligned}$$
(1)
Fig. 1
figure 1

Before (a) and after (b) smoothing of the original image. The image is showing a large boat (length of 55 m), a rigid-inflatable boat and a small buoy

\(\mathbf {I}\) and \(\mathbf {I_s}\) are \(w\times {h}\) matrices, where w and h are the width and height of the original thermal image. [xy] are integers representing a pixel coordinate in the image, and [mk] are integers representing a coordinate in the Gaussian kernel approximation \(\mathbf {g}\). The result of smoothing an image showing a big boat (length of 55 m), a rigid-hulled inflatable boat (RHIB) and a small buoy can be seen in Fig. 1. Notice that the upper left corner has slightly brighter pixels than the rest of the image, due to inaccurate camera calibration. Now, to detect the edges in the resulting smoothed image \(\mathbf {I}_{s}\), the gradient image, \(\mathbf {G}\), of \(\mathbf {I}_{s}\) is calculated. The gradient image of \(\mathbf {I}_s\) is found by the following calculation

$$\begin{aligned} \begin{aligned}&\mathbf {G_I}_{x}[x,y] = (\mathbf {I}_s*{\mathbf {P}})[x,y] = \sum _{k=-1}^{k=1}{\sum _{m=-1}^{m=1}{\mathbf {I}_s[x-m,y-k]\mathbf {P}[m,k]}}, \\&\mathbf {G_I}_{y}[x,y] = (\mathbf {I}_s*{\mathbf {P}^T})[x,y] = \sum _{k=-1}^{k=1}{\sum _{m=-1}^{m=1}{\mathbf {I}_s[x-m,y-k]\mathbf {P}^T[m,k]}}, \\&\mathbf {G_I}[x,y] = \sqrt{\mathbf {G_I}^2_{x}[x,y] + \mathbf {G_I}^2_{y}[x,y]} \end{aligned} \end{aligned}$$
(2)

\(\mathbf {P}\), also referred to as the Prewitt operator [31], is defined as the \(3\times 3\) matrix

$$\begin{aligned} \mathbf {P} := \begin{bmatrix} -1&0&1\\ -1&0&1 \\ -1&0&1 \end{bmatrix} \end{aligned}$$
(3)
Fig. 2
figure 2

Before (a) and after (b) gradient magnitude thresholding

Fig. 3
figure 3

Before (a) and after (b) removing blobs which are either too small or too large to be an object of interest

The resulting gradient image \(\mathbf {G_I}\) can be seen in Fig. 2a. It is seen that the big boat, the RHIB and the small buoy are clearly visible after these image processing operations. However, it is apparent that the waves and ripples in the ocean in addition to some of the noise in the image are still visible, albeit smaller in magnitude than the objects of interest. Because of this, removing them can be done by using a threshold value for the magnitude of the gradients. That is, all pixels in the gradient image that have a magnitude less than a certain threshold \(T_g\) can be removed. This is achieved by the following operation

$$\begin{aligned} \mathbf {G_I}(x,y)= {\left\{ \begin{array}{ll} maxValue &{} \text {if } \mathbf {G_I}(x,y) \ge T_g\\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(4)

where maxValue is the maximum brightness value a pixel in \(\mathbf {G_I}\) can have. From Fig. 2b it is readily seen that, post processing, it is mostly objects with a distinct heat signature that are left in the image.

Looking at Fig. 3a, it is obvious that some of the blobs clearly do not originate from any object of interest (i.e., the small dots scattered across the image), and therefore have to be filtered out. To filter out the unwanted blobs from the image, a connected component algorithm [32] is used to group and label components together in blobs. Furthermore, the area of each blob is calculated, and blobs with a smaller or larger area than what is expected from an object of interest are removed from the image. The result of this process is seen in Fig. 3. The resulting image (Fig. 3b) is hereby referred to as the binary image, \(\mathbf {B}\), of the original image \(\mathbf {I}\).

After applying the image analysis methods just described and finding the bounding boxes for each detected object, the detected objects can be seen in Fig. 4a. However, it is seen that big objects which have some texture in them can trigger detections within the interior of the actual object. This is because the texture of the object shows up in the edge detector. In order to make every detection of an object only show up as one unique detection, bounding boxes completely contained in a larger bounding box are removed. The result of this process is seen in Fig. 4b. Looking at Fig. 4b, it is apparent that the three detected objects are of further interest, and they are now ready for further evaluation.

Fig. 4
figure 4

Before (a) and after (b) removing detections completely contained in the interior of other detections

The detection step provides the position of objects of interest in the image. However, since the detector is using edge detection , the areas of an image that is highlighted as interesting will often only contain the exterior edges of an object. When performing for instance recognition based on characteristics, such as size, average temperature, and overall form, it is crucial that the whole object is evaluated. To expand the detections to also include the interior of the objects of interest, an algorithm that seeks to fill holes in the binary image [31] is applied. The result of applying this algorithm can be seen in Fig. 5.

Fig. 5
figure 5

Before (a) and after (b) filling the interior holes in the detected objects

Using the location of the bright pixels in the binary image seen in Fig. 5b, the center positions of the remaining blobs are calculated and passed on to the tracking module as measurements.

Fig. 6
figure 6

Partially visible object (a), and its corresponding ellipse approximating the object’s form (red circle), object center (red cross), and the orientation of the major axis of the ellipse (blue line) illustrated in image (b)

Note that in some cases, the center of the object blob detected in an image is not necessarily the same as the center of the detected object. Such an example can be seen in Fig. 6. The calculated blob center is illustrated with a red cross in Fig. 6b, and since the boat is only partly visible in the image, the blob center does not coincide with the true object center. In order to adjust the calculated blob center to a pixel coordinate which is closer to the actual object center, the orientation of the detected blob is calculated by approximating the detected blob as an ellipse with a minor and a major axis. This can be done by using the second-order central moments to construct a covariance matrix for the detected blob’s pixels. The \((q+p)\)th order central moment of a blob or an image region \(\mathbf {O}\) is here given as [32]

$$\begin{aligned} {m_{pq}} = \sum _{x,y \in \mathbf {O}} x^{p}y^{q}\mathbf {B}(x,y),\quad p,q = 0,1,\ldots \end{aligned}$$
(5)

The second-order covariance matrix of an image region \(\mathbf {O}\) can be calculated as [32]

$$\begin{aligned} cov[\mathbf {O}(x,y)] = \begin{bmatrix} m_{20}&m_{11}\\ m_{11}&m_{02}\\ \end{bmatrix} \end{aligned}$$
(6)

Now, the eigenvectors of this matrix correspond to the major and minor axes of the ellipse approximating the detected object blob in the image region \(\mathbf {O}\), hence the angle between these two eigenvectors can be used to approximate the orientation of the detected blob. Figure 6b shows the ellipse constructed using the two eigenvectors as major and minor axes for the approximation of the detected blob’s form, as well as the calculated blob center marked with a red cross. The blue line shows the direction of the major axis of the blob, and can be used as a measurement of the direction. Note that in order for this process to yield information about the orientation of an object, the object has to be noncircular, and a sufficient part of the object has to be visible so that the length of the major axis of the blob is larger than the expected full length of the minor axis.

Assuming that the full length of the partially detected object is already known, the expected length of the boat in pixels can be calculated. Using, this information combined with the results of the above mentioned calculations, a more accurate estimate of the detected object’s center in the pixel coordinate frame can be found. This is done by moving the calculated blob center the following amount

$$\begin{aligned} \varDelta {\mathbf {c}} = \frac{\mathbf {L}}{2}-(\mathbf {a}-\mathbf {i}) \end{aligned}$$
(7)

where \(\mathbf {L}\) is the expected object length decomposed in the image plane according to the approximated orientation, i.e., \(||\mathbf {L} ||= L\), where L is the expected length of the boat. Further, \(\mathbf {a}\) is the major axis and \(\mathbf {i}\) is the vector from the blob’s centroid to the closest intersection with the image frame along the direction of the major axis, illustrated as a blue line in Fig. 6. Adding \(\varDelta {\mathbf {c}}\) to the calculated blob center will effectively move the detected object’s center along the calculated major axis of the blob toward the expected location of the object center. This can sometimes lead to the object center being computed to be located outside the image borders, but this is not an issue since the theoretical camera model is not restricted by the field of view.

2.2 Georeferencing

Georeferencing can be defined as the transformation from a pixel position in the image frame to world-fixed coordinates. It might be very advantageous in UAVs because the pixel position of an object contains a little amount of information when the scene (location of the image frame) changes continuously. This is the main motivation behind georeferencing and is described thoroughly in [6, 8, 9, 33]. However, in this scenario, where we are interested in continuous tracking of floating objects (and not acquire “standalone” measurements of world-fixed coordinates), georeferencing is not necessary because it happens indirectly in the nonlinear measurement model described in Sect. 3.1. Nevertheless, georeferencing has worked well in the aforementioned papers, and therefore, it will be used as a benchmark in this paper. Note that georeferencing without a range measurement is only possible as long as all points in the image are located in the same plane. This is the case at sea and that is why georeferencing can be very useful in maritime environments.

2.3 Extracting Feature Vector for Classification and Data Association

When multiple objects are tracked simultaneously, it is necessary to associate new measurements with objects that are being tracked. In other words, it is necessary to find out which object a new measurement belongs to. This can be achieved by extracting features for detected objects through image analysis and is described extensively in [7], but the main motivation is to use features in the image that can distinguish between several objects. However, complex machine vision routines are challenging to run in real-time on resource-limited embedded computers on-board UAVs. Therefore, it is important to extract features that are computationally effective, but still robust in the sense of data association . Moreover, features should be scale and rotation invariant because the sensor position and field of view change continuously. A feature vector is reliable if it describes an object uniquely with high accuracy. In other words, two different objects should never have feature vectors that are similar. However, this may be the case if features are only gathered from the images. Consider a situation where two circles of identical temperature and size are located in two locations in a thermal image. It is very hard to find unique features for similar objects, just by using image properties. Therefore, it is important to find an additional feature that not directly depends on the image properties. This can for example be some sort of position measure and is described more closely in Sect. 3.2. Note that a position measure in the image plane is not necessarily reliable because the location in the image changes continuously with the UAV. Hence, the distance measure should be based on world-fixed coordinates.

Thermal images consist of pixels where a single channel is used to describe the intensity (temperature) of each pixel. Image features that can be used for data association are very diverse and can include information about intensity, the gradient, edges and of course many others. In [7], the area, average temperature and the first normalized central moment proposed by Hu [34] are features used for data association. They work well in thermal images and can be combined with a distance measure. A global nearest neighbour search can be used to associate the feature vector of detected objects with existing tracks. This is described extensively in [7].

3 Multiple Target Tracking

This section describes a tracking system that is able to estimate the position and velocity of several objects simultaneously. The first part describes the filtering part of the tracking system. The latter part presents how data association is solved in order to track several objects automatically at the same time. In this section, it will be assumed that a set of observations is available from the machine vision system. Moreover, the image frame coordinates of the center are obtained by the object detection algorithm. Note that initialization of new tracks will not be covered in this paper. Leira et al. [7] covers this in more detail.

3.1 The Filtering Part of the Tracking System - State Estimation

In the filtering part of the tracking system, some challenges arise when images are captured from a moving platform with a pose (position and attitude) that changes rapidly. This is because the object position in the image will vary with the attitude and North–East-Down (NED) positions (also called world-frame position) of the UAV. Moreover, it is advantageous to track objects in world-fixed coordinates because it is less informative to track objects in the image plane when the position of the camera changes quickly. Therefore, it is necessary to know the pose of the UAV accurately in order to track objects in a world-fixed coordinate frame. Furthermore, time synchronization between the camera and UAV navigation system becomes crucial since it is necessary to know the pose of the UAV as close as possible to when an image is captured. How navigation uncertainty can be included in the tracking system is described in [35]. Moreover, the concept of simultaneous localization and mapping with moving landmarks is closely related to multiple target tracking so the airborne SLAM literature is also relevant [10, 11, 36]. However, in this paper, the UAV position and attitude are assumed to be known from the autopilot, but the time synchronization is somewhat uncertain. This will obviously affect the performance and robustness of the tracking system in situations, where the navigation data are unreliable, but this is not the focus of this paper.

Mono-cameras are only able to capture measurements of the bearing and not the range. This originates from the fact that only two coordinates in the world-frame can be acquired with two image coordinates when the range is unknown. This is not critical at sea because floating objects are located on the sea surface where the altitude is known. Therefore, all objects can be assumed to be located in the same plane. This is referred to as the flat-earth assumption, which theoretically makes it possible to find the horizontal world coordinates as a function of the UAV attitude, altitude and image plane coordinates. This is most often the basis for georeferencing unless a range measurement is available. However, this assumption is fragile for variations in altitude and errors in the navigation states. The navigation states are known from an extended Kalman filter (EKF), but are not necessarily entirely correct even though the accuracy is expected to be reliable. Moreover, because the navigation states are available from the EKF, the error is most likely not Gaussian and could perhaps be correlated from frame to frame. This might violate the Gaussian assumptions in the Kalman filter. In addition, georeferencing is a nonlinear transformation where the true measurement is the pixel position and not the NE positions. Thus, the noise related to the true measurement (pixel position) theoretically has to go through the same transformation to find the measurement covariance in the Kalman filter and that is very challenging. Therefore, the measurement noise covariance is normally a matter of trial and error and not directly related to the accuracy of the sensors. Considering these issues, it is natural to look for another solution where the georeferencing operation is avoided, but where one still can take advantage of the fact that objects are located on the sea surface. This leads to a nonlinear measurement model, which entails the use of a nonlinear tracking filter, such as the extended Kalman filter (EKF). The goal of the filtering part is to estimate the position and velocity of an object with a state estimator. Since one instant of the state estimator will be created for every object, it is only necessary to consider how this is achieved for a single target.

3.1.1 Target Motion Model

In stochastic state estimation it is necessary to define a motion model for how the states are evolving. How to choose a motion model is described in [37] and depends on the type of targets that are expected to be detected. In this paper, a constant velocity model (white noise acceleration) is chosen because the dynamics of floating objects are assumed to be slow. The goal is to estimate the position and velocity in the NE plane. The discrete-time constant velocity motion model at time step [k] is defined as

$$\begin{aligned} \begin{aligned} \mathbf {x}^t[k+1]&= \mathbf {F}^t\mathbf {x}^t[k] + \mathbf {E}^t\mathbf {v}^t[k] \\ \end{aligned} \end{aligned}$$
(8)

where \(\mathbf {x}^t = [p^N_t, p^E_t, v^N_t, v^E_t]^{\top }\) is the state vector consisting of the target position and velocity in the NE frame, and \(\mathbf {v}^t = [v^N_v, v^E_v]^{\top }\) is assumed to be zero-mean Gaussian white noise with covariance \(\mathbf {Q}\). \(\mathbf {F}^t\) and \(\mathbf {E}^t\) are the system matrices defined as

$$\begin{aligned} \mathbf {F}^t = \begin{bmatrix} 1&0&T&0 \\ 0&1&0&T \\ 0&0&1&0 \\ 0&0&0&1 \\ \end{bmatrix}, \quad \mathbf {E}^t = \begin{bmatrix} \frac{1}{2}T^2&0 \\ 0&\frac{1}{2}T^2 \\ T&0 \\ 0&T \end{bmatrix} \end{aligned}$$

where T is the sampling period of the camera . The down position is zero for surface objects at sea (given that the origin of NED is placed at sea level) and not a part of the state vector. Note that the motion model is linear, which enables the use of the standard equations for time update (prediction) in the Kalman filter.

Fig. 7
figure 7

Illustration of the azimuth (\(\varphi \)) and elevation (\(\vartheta \)) angles. The camera points straight down from the image plane to the world-frame where \(z^c\) is the range from the camera to the plane the pixel is located in. This is simply the altitude of the UAV with zero roll, pitch, and gimbal angles

3.1.2 Tracking System Based on the Extended Kalman Filter

The bearing-only measurement model is based on the pinhole camera model . It relates the image plane coordinates (uv) to a camera-fixed coordinate frame {C} (see Fig. 7):

$$\begin{aligned} \mathbf {z}_1(k) = \begin{bmatrix} z_{1_1} \\ z_{1_2} \end{bmatrix} = \begin{bmatrix} u \\ v \end{bmatrix} = \underbrace{\frac{f}{z^c} \begin{bmatrix} x^c \\ y^c \end{bmatrix}}_{\mathbf {h}_1}, \quad z^c \ne 0 \end{aligned}$$
(9)

where f is the focal length of the camera lens. The image plane coordinates (uv) are measured by the detection module. The camera-fixed coordinates \([x^c_t, y^c_t, z^c_t]^{\top }\) of the target are used to relate the world-fixed coordinates of the camera (\(\mathbf {p}^n_{uav}\)) and the target (\(\mathbf {p}^n_t\)) through the following model

$$\begin{aligned} \mathbf {p}^c_{t} = \begin{bmatrix} x^c_t \\ y^c_t \\ z^c_t \end{bmatrix} = \mathbf {R}^c_n(\mathbf {p}^n_t - \mathbf {p}^n_{uav}) \end{aligned}$$
(10)

where \(\mathbf {R}^c_n\) is the rotation matrix from {C} to NED. In the filtering part, \(\mathbf {h}_1\) will be used as the measurement model where the covariance can be chosen as a diagonal matrix \(\mathbf {R}\) with a pixel uncertainty related to each pixel coordinate. Note that it is necessary to insert (10) in (9) to get a measurement model that depends on the UAV attitude, gimbal orientation, UAV NED positions, and the NED positions of the target. This is beneficial because the measurement model now directly depends on parameters of interest, and not the camera-fixed coordinates of the target.

The bearing-only observation model can also be represented in terms of azimuth (\(\varphi \)) and elevation (\(\vartheta \)) angles

$$\begin{aligned} \mathbf {z}_2(k) = \begin{bmatrix} z_{2_1} \\ z_{2_2} \end{bmatrix} = \begin{bmatrix} \varphi \\ \vartheta \end{bmatrix} = \underbrace{\begin{bmatrix} \arctan \left( \frac{y^c}{x^c} \right) \\ \arctan \left( \frac{z^c}{\sqrt{(x^c)^2 + (y^c)^2}} \right) \end{bmatrix}}_{\mathbf {h}_2} = \begin{bmatrix} \arctan \left( \frac{\frac{1}{f_v}v}{\frac{1}{f_u}u} \right) \\ \arctan \left( \frac{f_u}{u} \cos (\varphi ) \right) \end{bmatrix} \end{aligned}$$
(11)

where \(f_u\) and \(f_v\) are camera parameters in the intrinsic camera matrix obtained by the lens specification (or camera calibration). The azimuth and elevation angles are illustrated in Fig. 7 and this representation of the measurement can be used for data association.

The measurement model (9) is nonlinear, and the use of a nonlinear state estimator is necessary. The most common and straightforward solution is to use an Extended Kalman filter (EKF) to estimate the position and velocity of the target. In order to use the EKF, it is necessary to find the Jacobian of \(\mathbf {h}_1\) with respect to the states:

$$\begin{aligned} \frac{\partial \mathbf {h}_1}{\partial \mathbf {x}^t} = \begin{bmatrix} \frac{\partial z_{1_1}}{\partial x^n_t}|_{\hat{\mathbf {x}}^t_{k|k-1}}&\frac{\partial z_{1_1}}{\partial y^n_t}|_{\hat{\mathbf {x}}^t_{k|k-1}}&0&0 \\ \frac{\partial z_{1_2}}{\partial x^n_t}|_{\hat{\mathbf {x}}^t_{k|k-1}}&\frac{\partial z_{1_2}}{\partial y^n_t}|_{\hat{\mathbf {x}}^t_{k|k-1}}&0&0 \end{bmatrix} \end{aligned}$$
(12)

where \(\hat{\mathbf {x}}^t_{k|k-1}\) is the predicted state \({\mathbf {x}^t}\) at the current time step. The last two columns of the Jacobian are zero because the measurement model \(\mathbf {h}_1\) not depends on the target velocities.

The filtering part proposed here is somewhat more involved than a linear tracking filter based on georeferencing [6, 7] since it is nonlinear. Problems with linearization and initialization follow with the EKF and the measurements (pixel coordinates) might in many cases be less informative than the NE positions (for humans). On the other hand, the uncertainty related to the measurement model is much more properly defined in the nonlinear tracking filter. In the georeferencing approach, the true measurement (pixel position) goes through a complex transformation to obtain the NE positions that are used as a measurement in the tracking filter. This means that the covariance matrix of the measurement noise related to the pixel position should go through the same transformation. This is not necessary in this case.

3.2 Data Association

Data association is the task of assigning newly detected measurements to existing tracks, or creating a new track when an object is entering the image for the first time. Solving the data association problem is most often easy for the human eye and brain, but more difficult for computers. This is especially the case in situations where the scale and rotation of the object change with the movement of the UAV. Moreover, changes in illumination or appearance are challenges that are much harder to deal with for computers than humans. In addition, because of the restricted processing capacity, advanced and complex methods are not suitable for use in real-time solutions on-board UAVs. Therefore, fast algorithms are necessary and this is likely to affect the reliability and robustness of the association. A trade-off between robustness and processing time is most often inevitable. Luckily, a few errors in the association part may not be critical because the filtering part will remove measurements that are outliers.

In this paper, data association will be based on [7] and the global nearest neighbor (GNN) search. GNN is an approach where new measurements are associated with existing objects (tracks) that are closest in distance with respect to a predefined similarity measure. The similarity measure can be based on several different features including distance (from measurement to estimated position), size, intensity, color distribution and so forth. The least squares distance or a similar measure is used to connect new measurements with existing tracks. In situations where a new measurement is far from every existing track with respect to the similarity measure (a threshold must be assigned), a new object is most probably detected. Thus, a new track is added to the tracking system. The search is called global because all measurements are considered together to find the optimal solution for the entire set of measurements at each time instant. If association is not handled globally, the order of which measurements are received will affect the result significantly because the first observation can be assigned to an object the second observation is closer to.

In Sect. 2.3 it was emphasized that data association should not only be based on image properties because there is a possibility of two similar objects being present within the same image. Therefore, it is necessary to add another measure to the association process. In target tracking , a tracking gait (an ellipse based on the mean and covariance of the estimates) is usually created around each object. An observation of an object is expected to fall within the tracking gait. This is usually described in terms of the Mahalanobis distance

$$\begin{aligned} \gamma = \varvec{\nu }^{\top } \mathbf {S}^{-1}\varvec{\nu } \end{aligned}$$
(13)

where \(\nu \) (innovation) and \(\mathbf {S}\) (innovation covariance) are defined as

$$\begin{aligned} \begin{aligned} \varvec{\nu }&= \mathbf {z}^t - \mathbf {h}^t(\mathbf {p}^n_{uav}, \mathbf {R}^c_n, \mathbf {p}^n_{target}) \\ \mathbf {S}&= \frac{\partial \mathbf {h}^t}{\partial \mathbf {x}^t}\mathbf {P}^t \left( \frac{\partial \mathbf {h}^t}{\partial \mathbf {x}^t}\right) ^{\top } + \mathbf {R} \end{aligned} \end{aligned}$$
(14)

\(\gamma \) is now a measure of the “inverse likelihood” that a certain observation is related to a specific object. The minimal Mahalanobis distance specifies the object that most likely is the origin of the measurement. Since the Mahalanobis distance is a normalized quantity, it can be combined with the information gathered in the feature vector to do the GNN least squares search for data association. This is described thoroughly in [7] and will not be the main focus here. In this paper, the focus will be directed toward finding the most robust measure for the Mahalanobis distance because the innovation can be computed in different ways. Three models for the innovation are presented and evaluated.

3.2.1 The Linear Case

When the tracking is conducted in a Kalman filter where georeferencing is used to obtain measurements, the innovation in (14) can be calculated directly as the difference between the measurement and the predicted position for the tracked object. The innovation covariance can be extracted from the Kalman filter. The only difference for equation (14) is that the innovation covariance is a linear function of the matrix \(\mathbf {H}\) so linearization is avoided. This is a very straightforward approach and works well in [7]. However, this method can only be used in a linear tracking filter. Therefore, it is important to find an alternative solution with equal or better performance in the nonlinear case. This method is, therefore, used as a benchmark. It will be referred to as the linear association method (or georeferencing) in the rest of this paper.

3.2.2 The Nonlinear Case

In the nonlinear case, two different representations of the innovation can be used to find the Mahalanobis distance. The first method is based on (9) and the second on (11). Both representations have advantages and disadvantages. Bryson and Sukkarieh [11] argues that the azimuth and elevation representation is beneficial because it is more accurate in the border of the image. However, the situation might be different when all objects of interest are located in the same plane and perhaps very close together as well. Therefore, both representations will be evaluated.

The first representation is based on the pinhole measurement model and referred to as the pixel representation. The innovation is calculated as

$$\begin{aligned} \mathbf {\nu }_1 = \begin{bmatrix} u \\ v \end{bmatrix} - \underbrace{\frac{f}{z^c} \begin{bmatrix} x^c \\ y^c \end{bmatrix}}_{\mathbf {h}_1} \end{aligned}$$
(15)

where \((x^c, y^c, z^c)\) is calculated from (10). It is obviously important that the pixel coordinates (uv) are represented in meters and this can be achieved by the camera and lens specifications. The innovation covariance is calculated with (14) where the measurement function \(\mathbf {h}_1\) is given by (9) and linearized with respect to the position of the target.

The second representation is based on the azimuth and elevation angles and referred to as the angle representation. The innovation can be calculated as

$$\begin{aligned} \varvec{\nu _2} = \begin{bmatrix} \arctan \left( \frac{\frac{1}{f_v}v}{\frac{1}{f_u}u} \right) \\ \arctan \left( \frac{f_u}{u} \cos (\varphi ) \right) \end{bmatrix} - \underbrace{\begin{bmatrix} \arctan \left( \frac{y^c}{x^c} \right) \\ \arctan \left( \frac{z^c}{\sqrt{(x^c)^2 + (y^c)^2}} \right) \end{bmatrix}}_{\mathbf {h}_2} \end{aligned}$$
(16)

The innovation covariance are calculated with (14) where the measurement function \(\mathbf {h}_2\) is linearized with respect to the position of the target. Note that the analytical Jacobian is quite complex for both \(\mathbf {h}_1\) and \(\mathbf {h}_2\). Therefore, it might be clever to evaluate the Jacobian numerically to reduce the computation time.

4 Experiments and Results

This section describes two different case studies that have been conducted to evaluate separate parts of the tracking system. The first case study looks into data association and the methods described in Sect. 3.2. The second case study looks into tracking of a single target.

Table 1 Covariance of the Gaussian white noise added to the UAV navigation states and the object detection algorithm in case study 1

4.1 Case Study 1 - Evaluation of the Data Association Methods

The main motivation behind the first case study is to compare the different methods for data association in terms of the Mahalanobis distance. Note that only the distance measures described in Sect. 3.2 will be used so image properties are not relevant in this case study. Evaluation of the methods is achieved through Monte Carlo simulations. Two objects at rest that are located 10 (north) and 15 m (east) from each other are simulated (mean values). Gaussian noise with a standard deviation of 10 m is added to the north and east initial positions for both objects. Therefore, the true distance between the objects will vary for each simulation to test the robustness of the data association in different situations. The true pixel positions in the image are calculated with the navigation states of the UAV and the position of the objects for each time step (with backwards georeferencing), before Gaussian noise (properties described in Table 1) is added to the true pixel positions and used as measurements. Moreover, white noise with properties described in Table 1 is added to the navigation states of the UAV so that georeferencing will lead to different north and east positions than the true ones. This ensures a realistic simulation where there are uncertainty related to both the image detection algorithm and the navigation states of the UAV.

One object (referred to as the main object) is tracked with a Kalman filter so that the estimated position and covariance are available. The goal is to find out which of the two measurements that belong to the main object at each time instant. 141 images are processed for each Monte Carlo simulation and the association is either right or wrong for every image. The success rate (number of right associations divided by the number of images) will be the most important measure for the performance. The states for the main object go through 50 iterations in the Kalman filter initially (before association begins) to ensure that the estimates are somewhat consistent before data association is conducted. The measurements for these iterations are generated as the true position where Gaussian noise with standard deviation of 10 m in both the north and east positions are added. However, the estimated position for the main object is not more accurate than what realistically can be expected (as will be shown in the results) in a real situation. 10000 Monte Carlo simulations are conducted.

In order to conduct the Monte Carlo simulations, some design properties need to be stated. The frame rate of the camera is 0.1 s and, thus, each simulation consists of a time period of 14 s (first image starts at zero). The navigation states of the UAV are sampled at the same frequency. Image capture and sampling of the navigation states are not synchronized and, therefore, navigation states at the exact time instant when an image is captured are in general not available. The initial conditions and measurement noise related to each association method are summarized in Table 2. Note that all states are initially uncorrelated (only zeros on the off-diagonal in the covariance matrices). The navigation states are sampled at a field experiment and the objects have been simulated in locations which are visible in the image plane. The covariance of the measurement noise (the bottom four elements in Table 2) is quite large, but that is mainly because the accuracy of the navigation states is limited and thus the measurements are quite poor at times. In other words, the magnitude of the covariance is caused by the navigation states and not the accuracy of the object detection algorithm. The payload used in the field experiment is described more closely in [8].

Table 2 Initial conditions and covariance of the Gaussian white noise related to the measurements for the different association methods in discrete-time

4.1.1 Results - Evaluation of Data Association Methods

Table 3 presents the main result for the first case study. The success rate for the different methods is very similar, which means that all of the representations are applicable in practice. More importantly, the nonlinear methods have the same performance as the linear method. Therefore, a nonlinear tracking filter can be used for multiple target tracking . Note, that the azimuth and elevation approach is the least accurate. However, the performance is so similar that one cannot rule out the possibility of the design parameters being the reason for the difference. Nevertheless, different simulation parameters (covariances) have been tried and the results seem to be reliable in different situations. Note that the nonlinear association in pixel form is the most accurate one overall.

Table 3 Success rate for the data association methods in 10000 Monte Carlo Simulations
Fig. 8
figure 8

Results for a single simulation in Case Study 1. a Correct association or not for the three methods. High value indicates successful association. The nonlinear case 1 and 2 are the pixel form and angle form, respectively. b Mahalanobis distance in the linear case

Fig. 9
figure 9

Results for a single simulation in Case Study 1. a Mahalanobis distance in the nonlinear case with measurement model (9). b Mahalanobis distance in the nonlinear case with azimuth and elevation angles (11)

Fig. 10
figure 10

Results for a single simulation in Case Study 1. a Estimated north and east positions for the main object. b Measurements of the main object and the second object (with georeferencing)

Figures 8, 9 and 10 show results of a single Monte Carlo simulation. The success rate for the angle representation is 92.2% in this particular simulation. The success rate for the nonlinear approach in pixel form and the linear representation is 91.5%. Hence, in this simulation, the angle representation is able to associate one more image correctly compared to the other representations. One interesting consideration in Fig. 8a is even though the pixel and angles form have almost the same success rate, the mistakes in the association are not happening at the exact same set of images. Therefore, it seems to be situations where the pixel representations have advantages and other situations where the angle form is better. Thus, it could be possible to get an even more robust association by combining the methods in the nonlinear case. However, this is not investigated further in this paper.

Figures 8b and 9 show the Mahalanobis distance for the three methods. The Mahalanobis distance for the measurement from the main object is obviously much smaller, but some outliers exist. Figure 10 shows the estimated north and east positions for the main object and the measurements for the main and second object with georeferencing. Notice that the measurements are distributed on a large area in the north–east plane and that is mainly because of the noise related to the UAV navigation states. Noise in the image detection algorithm has a much smaller influence than errors in the attitude. This is because georeferencing uses the attitude of the UAV to calculate the ray from the camera to the sea surface. An error of two degrees in roll will move the ray much further on the sea surface (when the altitude is large) compared to a small error in pixels. This has been verified in the simulations and shows the necessity of having reliable knowledge about the navigation states to get the good performance in target tracking from a moving platform.

4.2 Case Study 2 - Tracking a Single Target

The second case study evaluates the performance of the tracking system when a single target is considered. The target is the ship displayed in Fig. 11 and has a length of approximately 70 m. Data association is not necessary because all of the measurements are known to originate from the target. Thus, the object detection algorithm and the nonlinear filtering part of the tracking system are the focus of this case study. This experiment is based on navigation data and images captured at a flight experiment with the X8 Skywalker fixed-wing UAV and the payload described in [8]. The target is visible in 441 images over a period of 45 seconds. Note that the images are based on five different segments in time that have been merged into one segment. This ensures that the ship is visible in all images in this case study. The true position of the target is measured with GPS and used as a benchmark (ground truth). Moreover, estimates with and without the centroid adjustment will be shown to illustrate why it can be beneficial to calculate the center of the ship in situations where only a part of the vessel is visible. Only a part of the vessel (as illustrated in Fig. 11) is visible in the majority of the images.

Fig. 11
figure 11

Thermal images of the ship that is tracked in the second case study. Many images only contain a small part of the ship (as illustrated in the right image). That is why it is beneficial to use the size of the ship to get a better measure of the center in situations like this one

The design parameters that need to be stated in this case study include initial conditions for the states and covariance of the measurement (\(\mathbf {R}\)) and process (\(\mathbf {Q}\)) noise. \(\mathbf {R}\), \(\mathbf {Q}\) and the initial covariance of the states (\(\mathbf {P}(0)\)) are given as

$$\begin{aligned} \begin{aligned} \mathbf {P}(0)&= \begin{bmatrix} (10\,\mathrm{m})^2&0&0&0 \\ 0&(10\,\mathrm{m})^2&0&0 \\ 0&0&(\sqrt{20}\,\mathrm{m/s})^2&0 \\ 0&0&0&(\sqrt{20}\,\mathrm{m/s})^2 \end{bmatrix} \\ \\ \mathbf {R}&= \frac{1}{h}\begin{bmatrix} (100 \,\text {pixels})^2&0 \\ 0&(80 \,\text {pixels})^2 \end{bmatrix} \quad \mathbf {Q} = \frac{1}{h} \begin{bmatrix} (0.5\,\mathrm{m/s}^2)^2&0 \\ 0&(0.5\,\mathrm{m/s}^2)^2 \end{bmatrix} \end{aligned} \end{aligned}$$
(17)

where h is the step length (frame rate of camera). \(\mathbf {R}\) and \(\mathbf {Q}\) are given in discrete time. The initial position is given by georeferencing of the first available measurement and the initial velocity is set to zero. Moreover, an approximation of the size of the ship (in pixels) is assumed to be known. In practice, the size is determined the first time the whole vessel is visible in an image. The ship is almost at rest during the tracking period, but the system has no knowledge about the behavior beforehand. The process noise covariance is chosen as white noise acceleration with standard deviation of \(0.5\,\mathrm{m/s}^2\) in continuous time (in both north and east directions). This is assumed to be realistic because large vessels have slow dynamics. Therefore, it is expected that this covariance would work for moving targets as well. The camera captures 10 images each second with a resolution of 640 \(\times \) 512 pixels and the navigation states of the UAV are stored at the same rate.

4.2.1 Results - Tracking a Single Target

441 images have been processed during the second case study. The object detection algorithm is able to detect the vessel in 438 of the images, which gives a success rate of 99.3%. The high success rate is explained by the high signal-to-noise ratio in the images (see Fig. 11). There is clearly a large difference in temperature between the vessel and the background. Hence, it is possible to separate the vessel from the background.

Figure 12 shows the NE positions obtained with georeferencing with and without the centroid adjustment. A larger part of the measurements are distributed near the center of the vessel with the centroid adjustment, even though the difference is quite small. Some important measures are summarized in Table 4. The mean georeferenced position is somewhat closer to the GPS reference with the centroid adjustment. Therefore, the benefit of finding the center is clearly illustrated. The variance increases a bit with the centroid adjustment. That is perhaps because the adjustment only is performed in situations where the major axis is significantly larger than the semi-major axis. Therefore, the center is not adjusted in every image and the distribution is likely to have a larger variance. This is because the difference between measurements with and without the adjustment is expected to be larger than the difference between two measurements without adjustment. Note that the variance is mainly related to sensor noise in the navigation states of the UAV and not uncertainty in the object detection algorithm. Hence, the variance would be reduced significantly if the navigation states are perfectly known.

Fig. 12
figure 12

Georeferenced north and east positions with a and without b the centroid adjustment

Table 4 Results of georeferencing with and without the centroid adjustment during object detection
Fig. 13
figure 13

Total distance between the reference (GPS measured position) and the estimated position in the tracking period with (red) and without (blue) the centroid adjustment

Figure 13 shows the total distance between the GPS measured position and the estimated position with and without the centroid adjustment. After the initialization period, the estimate with the adjustment is more accurate, even though the difference is below 5 m for most of the time. The mean distance between the true and estimated position is 11.54 and 13.07 m with and without the centroid adjustment, respectively. However, if the initialization period is neglected (the first 10 s), the mean is 9.96 and 12.21 m, respectively. The accuracy increases with the centroid adjustment and the adjustment seems to be beneficial. Nevertheless, it is important to emphasize the fact that the centroid adjustment only is necessary for targets of a certain length. The difference would be negligible for small vessels.

The estimated position is convincing, especially because the accuracy of the navigation states is limited. This is also supported by Fig. 10b where it can be seen that the distribution of georeferenced points is large even though the noise related to the object detection algorithm (Table 1) is quite low. The GPS on the ship was not mounted exactly in the center of the ship and GPS is also affected by noise. Therefore, one cannot rule out that the distance between the estimates and the GPS position is slightly smaller or larger. The estimated position is more accurate than the estimates obtained with georeferencing in [6]. Thus, the nonlinear filtering part seems to be a viable alternative to a linear tracking filter.

Figure 14 shows the estimated speed for the vessel. The vessel is nearly at rest during the simulation (dynamic positioning) and the estimated speed is close to zero after the first period. It is interesting to see that the speed is closer to zero with the centroid adjustment. A speed of exactly zero is hard to achieve because it is impossible to compensate perfectly for the motion of the camera . Nevertheless, the estimated speed is very small so the performance of the filtering part seems to be reliable.

Fig. 14
figure 14

Estimated speed with (red) and without (blue) the centroid adjustment

5 Concluding Remarks

This paper presented a vision-based tracking system for unmanned aerial vehicles . The tracking system is tailored toward floating objects and structures at sea. The tracking system consists of an object detection algorithm that is able to find marine vessels in thermal images . Moreover, if the size of the vessel is known (or computed the first time the whole structure is visible), the detection algorithm is able to compute the center of the ship, even when only a part of the vessel is visible in the image. This is obviously very beneficial for applications that require estimates with high precision. The tracking system also contains a filtering part that estimates the position and velocity of each target. In this paper, a nonlinear state estimator has been used to estimate the states based on the pixel position of the center of the object. Moreover, the tracking system includes a data association part that is used to find the origin of each measurement when multiple targets are tracked simultaneously.

Every part of the tracking system have been evaluated in this paper. The data association part was studied in the first case study where it was shown that data association is equally accurate with a nonlinear tracking filter as in the linear case. This is an important result as it permits the use of the nonlinear tracking filter. The second case study evaluated the detection algorithm and the nonlinear filtering part. A single target was considered and it was shown that the detection algorithm is very reliable when the temperature difference between the vessel and the sea surface is large. Moreover, the estimated position and speed were very close to the true position and speed. Therefore, all parts of the tracking system work well individually and multiple targets can be tracked.