Introduction

There are numerous approaches to target tracking and following which vary in their degree of simplicity of implementation, computational expense (CPU usage), optimality, and other measures. We will focus primarily on methods appropriate for multirotor UAVs carrying a body-fixed monocular camera.

Any vision-based target-tracking method will need to extract measurements from the images. Recently, there has been extensive research in object detection and identification using deep neural networks such as YOLO [1], R-CNN [2], and others [3]. These methods achieve high accuracy, but are computationally expensive. We discuss other methods in “Visual Front End.”

For a good review of target-tracking algorithms, we refer the reader to [4]. There exist optimal methods such as the multiple hypothesis tracking (MHT) [5] and the probabilistic multi-hypothesis tracker (PMHT) [6]; however, they are difficult to implement and not feasible to do in real time [7]. A variant of the MHT is the track-oriented MHT (TO-MHT) [8] which can be done in real time. There are other simple, computationally efficient techniques such as the global nearest neighbor filter (GNNF) [9] and joint probabilistic data association filter (JPDAF) [10]; however, they cannot initialize tracks. A track consists of a state estimate (position, velocity, etc) and error covariance of the target. A recently developed non-optimal, target-tracking algorithm is Recursive Random Sample Consensus (R-RANSAC) which efficiently initializes and manages tracks [11,12,13,14,15]. We discuss this algorithm in more detail in “R-RANSAC Multiple Target Tracker.”

In order to follow the detected and tracked targets, a control approach known as image-based visual servoing (IBVS) is commonly implemented [16]. IBVS in its most basic form is implemented using the proportional-integral-derivative controller (PID) [17] which is simple to implement and responsive, but causes increased error and overshoot due to UAV rotations [18]. The IBVS-Sphere (IBVS-Sph) effectively combats this problem by mapping the image plane to a virtual spherical image plane around the camera; however, this and similar spherical mapping techniques become unstable as the target moves beneath the UAV [18]. An optimal solution is to map a virtual image plane parallel to the ground directly above the detected targets. This reduces error due to UAV pose and discrepancies caused by spherical mapping. A few examples of this approach, such as the desired robust integral of the sign of error (DCRISE), are [19, 20]. The algorithm described in this paper takes advantage of this mapping of a parallel virtual image plane and will be discussed in more detail in “Target-Following Controller.”

The paper is organized as follows.

In “Architecture,” we present the tracking and following architecture, followed by the visual front end, R-RANSAC, and the controller in “Visual Front End,” “R-RANSAC Multiple Target Tracker,” and “Target-Following Controller.” Finally, we discuss our results and conclusions in “Results” and “Conclusions.”

Architecture

In this paper, we assume that a monocular camera is rigidly mounted on a multirotor UAS equipped with IMU, on-board computer, autopilot, and an altitude sensor. The camera sends images into the visual front-end pipeline. The visual front end is responsible for extracting point measurements of targets from the images and computing the homography and essential matrix, as shown in Fig. 1.

Fig. 1
figure 1

Target-tracking and following architecture

The visual front end produces point measurements that are processed by the tracking back end, labeled R-RANSAC in Fig. 1, which produces tracks (position, velocity, plus error covariances) of targets. Target tracking is done in the current image frame which requires transforming measurements and tracks expressed in the previous camera fame to the current frame.

Since we assume that the target is moving on a fairly planar ground, we can use the homography matrix to transform the measurements and tracks to the current image frame, as shown in Fig. 1.

The multiple target-tracking algorithm R-RANSAC sends the tracks to the track selector which determines which track to follow. The selected track is then passed to the controller which sends commands to the flight control unit (FCU) that enables target following, as shown in Fig. 1.

Visual Front End

This section describes the visual front end shown in Fig. 1.

Images from the camera along with the camera’s intrinsic parameters are given to the visual front end, which is responsible for producing measurements and calculating the homography and essential matrix. The visual front end uses a variety of algorithms to extract features. Some of the algorithms we have used include image differencing, color segmentation, YOLO, and apparent feature motion methods to extract measurements. The image difference method finds the difference between two images, looks for disparities of certain shapes and sizes (blobs) which are caused by moving objects, and takes the center of the blobs as measurements. When the camera is moving, we use the homography to transform one image into the frame of the other image. The color segmentation method looks for blobs with specific color, size, and shape profile to extract point measurements. This method of course assumes that the target of interest is a unique color, and is in general not very useful except in simple controlled environments.

The method that we use to implement the visual front end is described in the remainder of this section and is shown graphically in Fig. 1. In particular, the KLT feature tracker extracts matching points between consecutive images. Using the matching points, the homography is computed. The matching points that are outliers to the homography are considered moving. Their motion can be caused by a moving target, noise, or parallax. Motion caused by parallax is filtered out using the essential matrix.

KLT Feature Tracker

In order to compute the homography matrix, the essential matrix, and to calculate apparent feature motion, good features need to be tracked between consecutive frames. A common and popular method is to use “Good Features to Track” [21] to select good features from the previous image and find their corresponding features in the current image using the Kanade-Lucas-Tomasi (KLT) feature tracker [22, 23]. The combination of the two algorithms yields matching features. These algorithms can be implemented using the OpenCV functions goodFeaturesToTrack() and calcOpticFlowPyrLK() [24].

Estimate Homography

The homography describes the transformation between image frames and maps static features from one image to static features in the other image. Thus, if we map the matched features into the same image frame and compare the distance from their matched counterpart, we can identify which features correspond to static objects and moving object.

The matching features obtained from the KLT Tracker are used to estimate the homography. The relevant geometry of the Euclidian homography is shown in Fig. 2.

Fig. 2
figure 2

The geometry for the derivation of the homography matrix between two camera poses

Suppose that pf is a feature point that lies on a plane defined by the normal (unit) vector n. Let \(\boldsymbol {p}_{f/a}^{a}\) and \(\boldsymbol {p}_{f/b}^{b}\) be the position of pf relative to frames a and b, expressed in those frames respectively. Then, as shown in Fig. 2, we have

$$ \boldsymbol{p}_{f/b}^{b} = {R_{a}^{b}} \boldsymbol{p}_{f/a}^{a} + \boldsymbol{p}_{a/b}^{b}. $$

Let da be the distance from the origin of frame a to the planar scene, and observe that

$$ d_{a} = \boldsymbol{n}^{\top} \boldsymbol{p}_{f/a}^{a} \qquad \implies \qquad \frac{\boldsymbol{n}^{\top} \boldsymbol{p}_{f/a}^{a}}{d_{a}} = 1. $$

Therefore, we get that

$$ \begin{array}{@{}rcl@{}} \boldsymbol{p}_{f/b}^{b} = \left( {R_{a}^{b}} + \frac{\boldsymbol{p}_{a/b}^{b}}{d_{a}}\boldsymbol{n}^{\top} \right) \boldsymbol{p}_{f/a}^{a}. \end{array} $$
(1)

Let \(\boldsymbol {p}_{f/a}^{a} = (p_{xa}, p_{ya}, p_{za})^{\top }\) and \(\boldsymbol {p}_{f/b}^{b} = (p_{xb}, p_{yb}, p_{zb})^{\top }\), and let \({\epsilon }_{f/a}^{a} = (p_{xa}/p_{za}, p_{ya}/p_{za}, 1)^{\top }\) represent the normalized homogeneous coordinates of \(\boldsymbol {p}_{f/a}^{a}\) projected onto image plane a, and similarly for \({\epsilon }_{f/b}^{b}\). Then, Eq. 1 can be written as

$$ \frac{p_{zb}}{p_{za}} {\epsilon}_{f/b}^{b} = \left( {R_{a}^{b}} + \frac{\boldsymbol{p}_{a/b}^{b}}{d_{a}} \boldsymbol{n}^{\top} \right){\epsilon}_{f/a}^{a}. $$
(2)

Defining the scalar γf = pzb/pza, we get

$$ \gamma_{f} {\epsilon}_{f/b}^{b} = {H_{a}^{b}}{\epsilon}_{f/a}^{a}, $$
(3)

where

$$ {H_{a}^{b}} \stackrel{\triangle}{=} \left( {R_{a}^{b}} + \frac{\boldsymbol{p}_{a/b}^{b}}{d_{a}}\boldsymbol{n}^{\top} \right) $$
(4)

is called the Euclidian homography matrix between frames a and b [25]. Equation 3 demonstrates that the Euclidian homography matrix \({H_{a}^{b}}\) transforms the normalized homogeneous pixel location of p in frame a into a homogeneous pixel location of p in frame b. The scaling factor γ, which is feature point dependent, is required to put 𝜖b in normalized homogeneous coordinates, where the third element is unity.

The homography can be calculated using the openCV function findHomography(), which combines a 4-point algorithm with a RANSAC [26, 27] process to find the homography matrix that best fits the data. The findHomography() function scales the elements of H so that the (3,3) element is equal to one. Feature pairs that do not satisfy (3) are labeled as features that are potentially moving in the environment.

Estimate Essential Matrix

The homography works well to segment between moving and non-moving features provided that the scene is planar; however, that is rarely the case due to trees, lamp posts, and other objects that stick out of the ground. The objects that do not lie on the same plane used to describe the homography will be outliers to the homography and appear as moving features even if they are static. The essential matrix provides a strategy to filter out the static features using the epipolar constraint.

Figure 3 shows the essence of epipolar geometry. Let pf be the 3D position of a feature point in the world, and let \(\boldsymbol {p}_{f/a}^{a}\) be the position vector of the feature point relative to frame \(\mathcal {F}_{a}\) expressed in frame \(\mathcal {F}_{a}\), and similarly for \(\boldsymbol {p}_{f/b}^{b}\).

Fig. 3
figure 3

Epipolar geometry

The relationship between \(\boldsymbol {p}_{f/a}^{a}\) and \(\boldsymbol {p}_{f/b}^{b}\) is given by

$$ \boldsymbol{p}_{f/b}^{b} = {R_{a}^{b}} \boldsymbol{p}_{f/a}^{a} + \boldsymbol{p}_{a/b}^{b}. $$
(5)

Defining

$$ \left \lfloor{\begin{pmatrix}a\\b\\c\end{pmatrix}}\right \rfloor = \begin{pmatrix} 0 & -c & b \\ c & 0 & -a \\ -b & a & 0 \end{pmatrix} $$

as the cross-product operator, then multiplying both sides of Eq. 5 on the left by \(\lfloor {\boldsymbol {p}_{a/b}^{b}}\rfloor \) gives

$$ \lfloor{\boldsymbol{p}_{a/b}^{b}}\rfloor \boldsymbol{p}_{f/b}^{b} = \lfloor{\boldsymbol{p}_{a/b}^{b}}\rfloor {R_{a}^{b}} \boldsymbol{p}_{f/a}^{a}. $$

Since \(\lfloor {\boldsymbol {p}_{a/b}^{b}}\rfloor \boldsymbol {p}_{f/b}^{b} = \boldsymbol {p}_{a/b}^{b} \times \boldsymbol {p}_{f/b}^{b}\) must be orthogonal to \(\boldsymbol {p}_{f/b}^{b}\) we have that

$$ \boldsymbol{p}_{f/b}^{b\top} \lfloor{\boldsymbol{p}_{a/b}^{b}}\rfloor {R_{a}^{b}} \boldsymbol{p}_{f/a}^{a} = 0. $$
(6)

Dividing (6) by the norm of \(\boldsymbol {p}_{a/b}^{b}\), and defining

$$ \boldsymbol{t}_{a/b}^{b}\overset{\triangle}{=} \frac{\boldsymbol{p}_{a/b}^{b}}{\left\|\boldsymbol{p}_{a/b}^{b}\right\|} $$

gives

$$ \boldsymbol{p}_{f/b}^{b\top} \lfloor{\boldsymbol{t}_{a/b}^{b}}\rfloor {R_{a}^{b}} \boldsymbol{p}_{f/a}^{a} = 0. $$
(7)

The matrix

$$ {E_{a}^{b}} = \lfloor{\boldsymbol{t}_{a/b}^{b}}\rfloor {R_{a}^{b}} $$
(8)

is called the essential matrix and is completely defined by the relative pose \(({R_{a}^{b}}, \boldsymbol {p}_{a/b}^{b})\).

Dividing (7) by the distances to the feature in each frame gives

$$ {\epsilon}_{f/b}^{b\top} ~{E_{a}^{b}}~ {\epsilon}_{f/a}^{a} = 0, $$
(9)

where \({\epsilon }_{f/a}^{a}\) and \({\epsilon }_{f/b}^{b}\) are the normalized homogeneous image coordinates of the feature in frame a (respectively frame b). This equation is the epipolar constraint and serves as a constraint between static point correspondences.

The epipoles \(\bar {e}_{a}\) and \(\bar {e}_{b}\) shown in Fig. 3 are the intersection of the line connecting \(\mathcal {F}_{a}\) and \(\mathcal {F}_{b}\) with each image plane. The epipolar lines a and b are the intersection of the plane \((\mathcal {F}_{a}, \mathcal {F}_{b}, P_{f})\) with the image planes. The epipolar constraint with the pixel \({\epsilon }_{f/a}^{a}\) is satisfied by any pixel on the epipolar line b. In other words, if Pf is a static feature or its motion is along the epipolar line then its point correspondence \({\epsilon }_{f/a}^{a}\) and \({\epsilon }_{f/b}^{b}\) will satisfy the epipolar constraint [28].

The essential matrix can be calculated using the openCV function findEssentialMat() which uses the five-point Nister algorithm [29] coupled with a RANSAC process.

Moving/Non-moving Segmentation

This section describes the “Moving/non-moving Segmentation” block shown in Fig. 1. The purpose of this block is to segment the tracked feature pairs into those that are stationary in the environment, and those that are moving relative to the environment. As shown in Fig. 1, the inputs to the “Moving/non-moving Segmentation” block at time k are the homography \(H_{k-1}^{k}\), the essential matrix \(E_{k-1}^{k}\), and the set of matching feature points \({\mathscr{M}}_{k}=\{({{\epsilon }_{i}^{k}}, {\epsilon }_{i}^{k-1})\) between image \(\mathcal {I}_{k-1}\) and image \(\mathcal {I}_{k}\).

When the camera is mounted on a moving UAV observing a scene where most of the objects in the scene are not moving, the homography computed from planar matching features will correspond to the motion of the UAV. As previously stated, moving objects or static objects not coplanar with the features used to compute the homography will appear to have motion when their corresponding features from the previous image are mapped to the current image. Therefore, given the set of matching feature points \({\mathscr{M}}_{k}\), we can segment \({\mathscr{M}}_{k}\) into two disjoint sets \({\mathscr{M}}_{k}^{in}\) for inliers and \({\mathscr{M}}_{k}^{out}\) for outliers where, for some small η1 > 0

$$ \begin{array}{@{}rcl@{}} \mathcal{M}_{k}^{in} &=& \left\{ ({{\epsilon}_{i}^{k}}, {\epsilon}_{i}^{k-1})\in \mathcal{M}_{k} \mid \left\|\gamma_{i} {{\epsilon}_{i}^{k}} - H_{k-1}^{k} {\epsilon}_{i}^{k-1}\right\| \leq \eta_{1} \right\} \\ \mathcal{M}_{k}^{out} &=& \left\{ ({{\epsilon}_{i}^{k}}, {\epsilon}_{i}^{k-1})\in \mathcal{M}_{k} \mid \left\|\gamma_{i} {{\epsilon}_{i}^{k}} - H_{k-1}^{k} {\epsilon}_{i}^{k-1}\right\| > \eta_{1} \right\}. \end{array} $$

Therefore, \({\mathscr{M}}_{k}^{in}\) are all matching feature pairs that are explained by the homography \(H_{k-1}^{k}\), and therefore correspond to ego-motion of the UAV, and \({\mathscr{M}}_{k}^{out}\) are all matching feature pairs that are not explained by the homography \(H_{k-1}^{k}\), and therefore potentially correspond to moving objects in the environment.

Figure 4 illustrates the application of this homography segmentation scheme, where feature outliers \({\mathscr{M}}_{k}^{out}\) have been retained.

Fig. 4
figure 4

Motion detection using the homography matrix. Matching features are shown in red and blue. The set \({\mathscr{M}}_{k}^{in}\) is shown in blue, and the set \({\mathscr{M}}_{k}^{out}\) is shown in red

The homography matrix provides good moving/non-moving segmentation either if the motion of the UAV is purely rotational, or if the surrounding environment is planar. A planar environment may be an adequate assumption for a high-flying fixed-wing vehicle moving over mostly-flat terrain. However, it is not a good assumption for multirotor UAV moving in complex 3D environments, where non-planar, stationary features will appear to be moving due to parallax. In that case, the potentially moving features \({\mathscr{M}}_{k}^{out}\) need to be further processed to discard features from the 3D scene that are not moving.

Our approach uses the epipolar constraint given in Eq. 9 that is satisfied by stationary 3D points. Therefore, potentially moving 3D points are given by

$$ \mathcal{M}_{k}^{moving} = \left\{ ({{\epsilon}_{i}^{k}}, {\epsilon}_{i}^{k-1})\in \mathcal{M}_{k}^{out} \mid \left| {\epsilon}_{i}^{k\top} E_{k-1}^{k} {\epsilon}_{i}^{k-1} \right| > \eta_{2} \right\} $$

for some small η2 > 0.

Figure 5 illustrates the moving/non-moving segmentation scheme using video from a multirotor flying in close proximity to 3D terrain. The blue feature points correspond to features on 3D objects, which due to parallax are not discarded by the homography threshold and are therefore elements of \({\mathscr{M}}_{k}^{out}\). However, these points satisfy the epipolar constraint and therefore are not flagged as moving features. The red dots in Fig. 5 correspond to \({\mathscr{M}}_{k}^{moving}\) and are actually moving in the scene. One drawback to this approach is that features that are moving along the epipolar lines (i.e., moving in the same direction as the camera) will be filtered out. However, this can be mitigated by controlling the camera so that its motion is not aligned with the target’s motion.

Fig. 5
figure 5

Motion detection using the essential matrix. Matching pairs in \({\mathscr{M}}_{k}^{out}\) are shown in blue and red, where the red features are in \({\mathscr{M}}_{k}^{moving}\)

R-RANSAC Multiple Target Tracker

Recursive Random Sample Consensus (R-RANSAC) is a modular multiple target tracking (MTT) paradigm originally developed in [11,12,13,14,15] and extended by various others [30,31,32,33,34,35,36,37]. The novel aspects of R-RANSAC include feature (measurement) and track propagation, track initialization, and track management. R-RANSAC tracks objects in the current camera frame. Since the camera frame moves as the UAV moves, features and tracks need to be transformed to the current camera frame. As new measurements are received, tracks are initialized, updated, and managed.

Transform Measurements and Tracks

This section describes the “Transform measurements and tracks to current camera frame” block shown in Fig. 1 which transform all measurements and tracks from the previous image frame to the current image frame.

We have shown how uncalibrated pixels are transformed between frames by the homography matrix as

$$ \gamma_{f} {\epsilon}_{f/b}^{b} = {H_{a}^{b}} {\epsilon}_{f/b}^{a}. $$

The visual multiple target tracking algorithm produces pixel velocities, pixel accelerations, and 2 × 2 covariance matrices associated with each of these quantities. In this section, we show how to transform pixel velocities, accelerations, and covariances using the homography matrix.

Throughout this section, we will use the following notation. The homography matrix will be decomposed into block elements as

$$ {H_{a}^{b}} \overset{\triangle}{=} \begin{pmatrix} H_{1} & \boldsymbol{h}_{2} \\ \boldsymbol{h}_{3}^{\top} & h_{4} \end{pmatrix} , $$

and the homogeneous image coordinates are decomposed as 𝜖 =△(𝜖̂ 1 ).

Given the relationship

$$ \begin{array}{@{}rcl@{}} & \gamma_{f} \begin{pmatrix}\hat{{\epsilon}}_{f/b}^{b} \\ 1 \end{pmatrix} = \begin{pmatrix} H_{1} & \boldsymbol{h}_{2} \\ \boldsymbol{h}_{3}^{\top} & h_{4} \end{pmatrix} \begin{pmatrix} \hat{{\epsilon}}_{f/a}^{a} \\ 1 \end{pmatrix}, \\ \iff & \begin{pmatrix}\gamma_{f}\hat{{\epsilon}}_{f/b}^{b} \\ \gamma_{f} \end{pmatrix} = \begin{pmatrix} H_{1} \hat{{\epsilon}}^{a} + \boldsymbol{h}_{2} \\ \boldsymbol{h}_{3}^{\top}\hat{{\epsilon}}_{f/a}^{a} + h_{4} \end{pmatrix}, \end{array} $$

which implies that

$$ \begin{array}{@{}rcl@{}} \hat{{\epsilon}}_{f/b}^{b} = \frac{ H_{1} \hat{{\epsilon}}_{f/a}^{a} + \boldsymbol{h}_{2}}{\boldsymbol{h}_{3}^{\top}\hat{{\epsilon}}_{f/a}^{a} + h_{4}} \quad \text{ and } \quad \gamma_{f} = \boldsymbol{h}_{3}^{\top}\hat{{\epsilon}}_{f/a}^{a} + h_{4}. \end{array} $$

Defining the function

$$ g(\hat{{\epsilon}}, H) \overset{\triangle}{=} \frac{H_{1}\hat{{\epsilon}} + \boldsymbol{h}_{2}}{\boldsymbol{h}_{3}^{\top} \hat{{\epsilon}} + h_{4}}, $$
(10)

we have that 2D pixels are transformed between frames as 𝜖̂f/bb = g(𝜖̂f/aa,Hab). Therefore, the 2D pixel velocity is transformed as

$$ \begin{array}{@{}rcl@{}} \dot{\hat{{\epsilon}}}_{f/b}^{b} = \frac{\partial g}{\partial \hat{{\epsilon}}}\Big|_{\hat{{\epsilon}}=\hat{{\epsilon}}_{f/a}^{a}} \dot{\hat{{\epsilon}}}_{f/a}^{a} = G(\hat{{\epsilon}}_{f/a}^{a}, {H_{a}^{b}})\dot{\hat{{\epsilon}}}_{f/a}^{a}, \end{array} $$
(11)

where

$$ \begin{array}{@{}rcl@{}} G(\hat{{\epsilon}}, H) =\frac{(\boldsymbol{h}_{3}^{\top} \hat{{\epsilon}} + h_{4})H_{1} - (H_{1}\hat{{\epsilon}}+\boldsymbol{h}_{2})\boldsymbol{h}_{3}^{\top}}{(\boldsymbol{h}_{3}^{\top} \hat{{\epsilon}} + h_{4})^{2}}. \end{array} $$
(12)

The next lemma shows how position and velocity covariances are transformed between images.

Theorem 1

Suppose that \({H_{a}^{b}}\) is the homography matrix between frames a and b and that \(\hat {{\epsilon }}_{f/a}^{a}\) and \(\dot {\hat {{\epsilon }}}_{f/a}^{a}\) are random vectors representing pixel location and velocity of feature f in frame a with mean \(\hat {\boldsymbol {\mu }}_{f/a}^{a}\), and \(\dot {\hat {\boldsymbol {\mu }}}_{f/a}^{a}\), respectively, and covariances \({{\Sigma }_{p}^{a}}\) and \({{\Sigma }_{v}^{a}}\) respectively. Suppose that \(\hat {{\epsilon }}_{f/a}^{a}\) is transformed according to \(\hat {{\epsilon }}_{f/a}^{b} = g(\hat {{\epsilon }}_{f/a}^{a}, {H_{a}^{b}})\) where g is defined in Eq. 10, then the mean and covariance of \(\hat {{\epsilon }}_{f/b}^{b}\) and \(\dot {\hat {{\epsilon }}}_{f/b}^{b}\) are given by

$$ \begin{array}{@{}rcl@{}} \hat{\boldsymbol{\mu}}^{b} &=& g(\hat{\boldsymbol{\mu}}^{a}, {H_{a}^{b}}) \\ \dot{\hat{\boldsymbol{\mu}}}^{b} &=& G(\hat{\boldsymbol{\mu}}^{a}, {H_{a}^{b}}) \dot{\hat{\boldsymbol{\mu}}}^{a} \\ {{\Sigma}_{p}^{b}} &=& G(\hat{\boldsymbol{\mu}}^{a}, {H_{a}^{b}}) {{\Sigma}_{p}^{a}} G^{\top}(\hat{\boldsymbol{\mu}}^{a},{H_{a}^{b}}) \\ {{\Sigma}_{v}^{b}} &=& G(\hat{\boldsymbol{\mu}}^{a}, {H_{a}^{b}}) {{\Sigma}_{v}^{a}} G^{\top}(\hat{\boldsymbol{\mu}}^{a},{H_{a}^{b}}) \end{array} $$

where G is defined in Eq. 12.

Track Initialization

Given that the measurements and tracks are expressed with respect to the same coordinate frame, we use the new measurements that do not belong to any existing track to initialize new tracks.

For simplicity, suppose that we have two observable targets whose motion can be described by a linear time-invariant model where both targets are in the camera field-of-view. Some of the camera’s measurements correspond to a target while others are spurious false measurements. Since there are multiple targets and false measurements, we need a way to associate measurements to their respective targets or noise. We do this using the standard RANSAC algorithm.

Suppose that we currently have one target in the field-of-view of the camera and a batch of measurements as depicted in Fig. 6.

Fig. 6
figure 6

Black dots indicate measurements, and the current batch of measurements are denoted with z. A particular minimum subset is denoted with red circle, including the current measurement zk. A track hypothesis generated from a minimum subset of measurements, depicted with the red curve. Alternate trajectory hypotheses that are not selected are shown in yellow

We take a minimum subset of measurements such that the target’s trajectory can be reconstructed by the measurements in the subset and so that at least one of the measurements is from the latest time step. One particular minimum subset is depicted in Fig. 6 using red circles.

Using the minimum subset, a trajectory hypothesis is generated. The trajectory hypothesis is used to identify other measurement inliers (i.e., measurements that are close to the trajectory hypothesis). The trajectory hypothesis is then scored using the number of inliers. An example of a trajectory hypothesis is depicted in Fig. 6 by the red line.

This process is repeated up to a predetermined number of times. The trajectory hypothesis that has the most number of inliers is then filtered (e.g., using an EKF) to produce a new current track estimate. An example of track initialization with multiple targets is shown in Fig. 7. Alternate trajectory hypotheses that were not selected during initialization are shown in yellow in Fig. 7.

Fig. 7
figure 7

Track initialization for multiple targets

Track Management

When new measurements are received, they are associated to either an existing track or are used to initialize a new track. The measurements that are associated to a track are used to update the track. The modular design of R-RANSAC allows us to use various techniques to associate measurements and update the tracks. Some popular methods include the global nearest neighbor filter [38, 39], probabilistic data association filter [40], and joint probabilistic data association filter [40, 41]. Other possibilities include algorithms in [7].

R-RANSAC maintains a bank of M tracks. As the track initializer generates new tracks, tracks are pruned to keep the number of tracks at or below M. Every track is rated by the number of inliers it has and its lifetime. When there are more than M tracks, tracks with the lowest ratings are pruned until there are only M tracks.

As tracks are propagated and updated, they may leave the field-of-view of the camera, they may coalesce, or they may stop receiving measurements. To handle these situations, we remove tracks that have not received a measurement for a predetermined period of time, and we merge similar tracks.

Good tracks, i.e., tracks that have a high inlier ratio, are given a unique numerical track ID. The good tracks passed to the track selection algorithm at every time step.

Track Selection

R-RANSAC passes good tracks to the track selector which chooses a track to follow. In this section, we list several possible options for target selection.

Target Closest to the Image Center

One option is to follow the track that is closest to the image center. If visual-MTT returns a set of normalized image coordinates 𝜖i for the tracks, then select the track that minimizes ∥𝜖i∥.

Target Recognition

A common automatic method for track selection is target recognition using visual information. This method compares the tracks to a visual profile of the target of interest. If a track matches the visual profile, then it is followed. A downside of this method requires the visual profile to be built previously. For visual target recognition algorithms, see [42,43,44].

User Input

A manual method for track selection is to query a user about which track should be followed. After the user has been queried, a profile of the target using gathered data can be made to recognize the track in the future. One example of this is [45] which uses a DNN to build the visual profile online.

The selected track is communicated to the target following controller.

Target-Following Controller

This section overviews one possible target-following controller as shown in Fig. 1. The controller consists of three parts: (1) a PID strategy that uses a height-above-ground sensor to maintain a constant pre-specified height above the ground, (2) a position controller that follows the target based on the track information, and (3) a heading controller that aligns the UAV’s heading with the target’s heading. In this section, we describe the position and heading controllers in detail.

The provided track contains the state estimate of the target in normalized image coordinates. Image coordinates are not invariant to the roll and pitch of the UAV; therefore, we design the controller in the normalized virtual image plane.

Let \(\boldsymbol {p}_{t/c}^{c}\) denote the position of the target relative to the camera expressed in the camera frame; the track produced by R-RANSAC is in normalized image coordinates and is given by

$$ {\epsilon}_{t/c}^{c} = \frac{K_{c}^{-1}\boldsymbol{p}_{t/c}^{c}}{\boldsymbol{e}_{3}^{\top} K_{c}^{-1} \boldsymbol{p}_{t/c}^{c}} $$

where Kc is the camera intrinsic parameter [46]. The target’s velocity is given by \(\dot {{\epsilon }}_{t/c}^{c}\). Note that the third element of \(\epsilon _{t/c}^{c}\) is 1, and the third element of \(\dot {{\epsilon }}_{t/c}^{c}\) is 0.

The coordinate axes in the camera frame are defined so that the z-axis points along the optical axis, the x-axis points to the right when looking at the image from the optical center in the direction of the optical axis, and the y-axis points down in the image, to form a right-handed coordinate system. Alternatively, the virtual camera frame is defined so that the z-axis points down toward the ground, i.e., is equal to e3, and the x and y axes are projections of the camera x and y axis onto the plane orthogonal to e3. A notional depiction of the camera and virtual camera frame is shown in Fig. 8.

Fig. 8
figure 8

A notional depiction of the camera frame and the virtual camera frame. The optical axis of the virtual camera frame is the projection of the optical axis of the camera frame onto the down vector e3

The virtual camera frame is obtained from the camera frame through a rotation that aligns the optical axis with the down vector e3. The rotation, denoted \({R_{c}^{v}}\), is a function of the roll and pitch angles of the multirotor, as well as the geometry of how the camera is mounted to the vehicle.

Therefore, the normalized virtual image coordinates of the track in the virtual camera frame are given by

$$ {\epsilon}_{t/c}^{v} = \frac{{R_{c}^{v}}{\epsilon}_{t/c}^{c}}{\boldsymbol{e}_{3}^{\top} {R_{c}^{v}} {\epsilon}_{t/c}^{c}}. $$
(13)

Similarly, the pixel velocity in normalized virtual image coordinates is given by

$$ \begin{array}{@{}rcl@{}} \dot{{\epsilon}}_{t/c}^{v} = \frac{1}{(\boldsymbol{e}_{3}^{\top} {R_{c}^{v}} {\epsilon}_{t/c}^{c})^{2}} \left( (\boldsymbol{e}_{3}^{\top} {R_{c}^{v}} {\epsilon}_{t/c}^{c})I - {R_{c}^{v}} {\epsilon}_{t/c}^{c} \boldsymbol{e}_{3}^{\top} \right)\notag\\ \cdot\left( {R_{c}^{v}} \lfloor{\boldsymbol{\omega}_{c/v}^{c}}\rfloor{\epsilon}_{t/c}^{c} + {R_{c}^{v}}\dot{{\epsilon}}_{t/c}^{c} \right). \end{array} $$
(14)

Equations 13 and 14 are computed by vision data using the R-RANSAC tracker described in the previous section. We also note that \({\epsilon }_{t/c}^{v}\) is simply the normalized line-of-sight vector expressed in the virtual camera frame, i.e.,

$$ {\epsilon}_{t/c}^{v} = \frac{\boldsymbol{p}_{t/c}^{v}}{{\boldsymbol{e}_{3}^{T}}\boldsymbol{p}_{t/c}^{v}} = \lambda\boldsymbol{p}_{t/c}^{v}, $$

where \(\lambda = 1/({\boldsymbol {e}_{3}^{T}}\boldsymbol {p}_{t/c}^{v})\) is the constant height-above-ground. In addition, we have that

$$ \begin{array}{@{}rcl@{}} \ddot{{\epsilon}}_{t/c}^{v} &= \lambda\ddot{\boldsymbol{p}}_{t/c}^{v} = \lambda \left( \ddot{\boldsymbol{p}}_{t/i}^{v} - \ddot{\boldsymbol{p}}_{c/i}^{v}\right), \end{array} $$

where \(\dot {\boldsymbol {p}}_{t/i}^{v}\) and \(\dot {\boldsymbol {p}}_{c/i}^{v}\) are the inertial velocities of the target and camera, and \(\ddot {\boldsymbol {p}}_{t/i}^{v}\) and \(\ddot {\boldsymbol {p}}_{c/i}^{v}\) are the inertial accelerations of the target and camera, all expressed in the virtual camera frame.

If we assume that the inertial acceleration of the target is 0, and that the center of the camera frame is the center of the multirotor body frame, then

$$ \ddot{{\epsilon}}_{t/c}^{v} = -\lambda \boldsymbol{a}^{v}, $$

where \( \boldsymbol {a}^{v} = \ddot {\boldsymbol {p}}_{b/i}^{v} = \ddot {\boldsymbol {p}}_{c/i}^{v}\) is the commanded acceleration of the multirotor.

We now have the following theorem.

Theorem 2

Assume that the inertial acceleration of the target is 0, and that the height-above-ground is constant and known. Let \({\epsilon }_{d/c}^{v}\) be the desired constant normalized line-of-sight vector to the target, and let

$$ \boldsymbol{a}^{v} = \frac{1}{\lambda}\left( (k_{1}+k_{2})\dot{{\epsilon}}_{t/c}^{v} + k_{1}k_{2}\left( {\epsilon}_{t/c}^{v}-{\epsilon}_{d/c}^{v}\right)\right), $$
(15)

where k1 > 0 and k2 > 0 are control gains, then

$$ {\epsilon}_{t/c}^{v}\to {\epsilon}_{d/c}^{v}. $$

The desired attitude is selected to align with the target’s velocity vector \(\dot {\boldsymbol {p}}_{t/i}^{n}\) as follows:

$$ \begin{array}{@{}rcl@{}} {R_{d}^{i}} &=& \begin{pmatrix} \boldsymbol{r}_{1} & \boldsymbol{r}_{2} & \boldsymbol{r}_{3} \end{pmatrix} \end{array} $$
(16)
$$ \begin{array}{@{}rcl@{}} \boldsymbol{r}_{1} &=& \frac{(I-\boldsymbol{e}_{3}\boldsymbol{e}_{3}^{\top})\dot{\boldsymbol{p}}_{t/i}^{v}}{\left\|(I-\boldsymbol{e}_{3}\boldsymbol{e}_{3}^{\top}) \dot{\boldsymbol{p}}_{t/i}^{v}\right\|} \end{array} $$
(17)
$$ \begin{array}{@{}rcl@{}} \boldsymbol{r}_{2} &=& \boldsymbol{r}_{1} \times \boldsymbol{e}_{3} \end{array} $$
(18)
$$ \begin{array}{@{}rcl@{}} \boldsymbol{r}_{3} &=& \boldsymbol{e}_{3}. \end{array} $$
(19)

Therefore, the x-axis of the desired frame points in the direction of the desired velocity vector, and the attitude is otherwise aligned with the body-level frame. The attitude control scheme is derived using the technique given in [47].

Following Multiple Targets

We briefly mention two approaches to the following multiple targets. If the targets are clustered together, then the following can be achieved by aligning their average position with the camera’s optical center using a technique similar to the one presented in this paper. A more realistic and common approach is a decentralized multiple target tracking scheme that uses a fleet of UAVs to cooperatively track targets in their respective surveillance region and share their information via a communication network [48].

Results

We implemented the target tracking and following pipeline in simulation using PX4 software-in-the-loop with Gazebo and ROS [49]. We used the IRIS multirotor model with a camera pitched down by 45° provided by the PX4. We used default simulated noise values. We had a single target move in a square upon command. For simplicity, we had the UAV find the target using visual MTT before telling the target to move. Once the target began moving, the UAV followed it fairly well in the normalized virtual image plane. Figure 9 shows the error plots. Notice that the yaw angle has large increases in error at several points. This is when the target is turning 90°. These turns also impact the error in the northeast plane. The results show the effectiveness of the complete pipeline and its robustness to target modeling errors.

Fig. 9
figure 9

The X and Y errors are in the normalized virtual image plane in units of meters and the yaw error is in units of radians

A video of the simulation is at https://youtu.be/C6JWr1dGsBQhttps://youtu.be/C6JWr1 https://youtu.be/C6JWr1dGsBQdGsBQ.

Conclusions

We have presented a review of a complete pipeline for tracking and following a target using a fixed monocular camera on a multirotor UAV. In future work, we plan to improve the controller to track multiple targets simultaneously, and incorporate target recognition for when tracks leave the camera field-of-view.