FastTrack: A Highly Efficient and Generic GPU-Based Multi-object Tracking Method with Parallel Kalman Filter

Liu, Chongwei; Li, Haojie; Wang, Zhihui

doi:10.1007/s11263-023-01933-4

FastTrack: A Highly Efficient and Generic GPU-Based Multi-object Tracking Method with Parallel Kalman Filter

Published: 21 November 2023

Volume 132, pages 1463–1483, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Computer Vision Aims and scope Submit manuscript

FastTrack: A Highly Efficient and Generic GPU-Based Multi-object Tracking Method with Parallel Kalman Filter

Download PDF

Chongwei Liu¹,
Haojie Li² &
Zhihui Wang¹

629 Accesses
2 Altmetric
Explore all metrics

Abstract

The Kalman Filter based on uniform assumption has been a crucial motion estimation module in trackers. However, it has limitations in non-uniform motion modeling and computational efficiency when applied to large-scale object tracking scenarios. To address these issues, we propose a novel Parallel Kalman Filter (PKF), which simplifies conventional state variables to reduces computational load and enable effective non-uniform modeling. Within PKF, we propose a non-uniform formulation which models non-uniform motion as uniform motion by transforming the time interval $\Delta t$ from a constant into a variable related to displacement, and incorporate a deceleration strategy into the control-input model of the formulation to tackle the escape problem in Multi-Object Tracking (MOT); an innovative parallel computation method is also proposed, which transposes the computation graph of PKF from the matrix to the quadratic form, significantly reducing the computational load and facilitating parallel computation between distinct tracklets via CUDA, thus making the time consumption of PKF independent of the input tracklet scale, i.e., O(1). Based on PKF, we introduce Fast, the first fully GPU-based tracker paradigm, which significantly enhances tracking efficiency in large-scale object tracking scenarios; and FastTrack, the MOT system composed of Fast and a general detector, offering high efficiency and generality. Within FastTrack, Fast only requires bounding boxes with scores and class ids for a single association during one iteration, and introduces innovative GPU-based tracking modules, such as an efficient GPU 2D-array data structure for tracklet management, a novel cost matrix implemented in CUDA for automatic association priority determination, a new association metric called HIoU, and the first implementation of the Auction Algorithm in CUDA for the asymmetric assignment problem. Experiments show that the average time per iteration of PKF on a GTX 1080Ti is only 0.2 ms; Fast can achieve a real-time efficiency of 250FPS on a GTX 1080Ti and 42FPS even on a Jetson AGX Xavier, outperforming conventional CPU-based trackers. Concurrently, FastTrack demonstrates state-of-the-art performance on four public benchmarks, specifically MOT17, MOT20, KITTI, and DanceTrack, and attains the highest speed in large-scale tracking scenarios of MOT20.

Heterogeneous CPU–GPU tracking–learning–detection (H-TLD) for real-time object tracking

Article 22 October 2015

A Survey on GPU-Based Visual Trackers

PE-TLD: Parallel Extended Tracking-Learning-Detection for Multi-target Tracking

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Multi-Object Tracking (MOT) predominantly follows the tracking-by-detection paradigm. An MOT system typically comprises a general detector (Ren et al., 2015; Ge et al., 2021) and a generic^{Footnote 1} motion-based tracker (Zhang et al., 2022; Cao et al., 2022; Bewley et al., 2016).

Although the Kalman Filter (KF) is a crucial motion estimation module for many state-of-the-art trackers (Wang et al., 2020; Zhou et al., 2020; Zhang et al., 2021, 2022), it has limitations in motion modeling and computational efficiency when applied to MOT.

Concerning motion modeling, the KF assumes uniform object motion, which is unsuitable for various motion patterns in general tracking scenes^{Footnote 2} as shown in Fig. 1. Although Extended KF (Smith et al., 1962) and Unscented KF (Julier & Uhlmann, 1997) were introduced to handle non-uniform motions of Taylor approximations, they are computationally complex and cannot estimate arbitrary non-uniform motion, such as the highly random motions of dancers in stage scenes. Additionally, the KF is sensitive to noise, leading to the escape problem (Cao et al., 2022). When an object is lost, its bounding box, continuously predicted by KF without observation information supervision, rapidly escapes along the current velocity direction, making it difficult to retrace. For example, in Fig. 1, when an object is lost at a certain circular point, the box predicted by KF rapidly escapes along the velocity direction.

Concerning computational efficiency, each tracklet is represented using a distinct KF, and all KFs are updated in a sequential manner. This process leads to a linear rise in time consumption of all KFs proportional to the count of input tracklets. This computational expense corresponds to a time complexity of O(n), where n denotes the total number of tracklets. Such substantial resource allocation weakens tracking efficiency, especially in scenarios of large-scale object tracking. For example, the CPU-based tracking algorithm OCSORT (Cao et al., 2022), despite improving the KF to attain state-of-the-art performance, falls short in terms of computational efficiency. As a result, OCSORT experiences nearly a 30$\times $ increase in time consumption when the number of input tracklets increases from 6 (as in KITTI) to 139 (as in MOT20), as illustrated in Fig. 2. Thus, an optimal tracker must strike a balance between tracking precision and computational efficiency.

To address these issues, we introduce a novel Parallel Kalman Filter (PKF) that models non-uniform motion while achieving a time complexity of O(1). In modeling, the importance of a suitable set of state variables for the KF cannot be understated. Therefore, we revise the conventional eight-tuple state variables and replace them with a more simplified four-tuple state variable set, which focuses specifically on the 2D coordinates of the object center along with their corresponding velocities. This simplification reduces computational load and aligns more appropriately with the assumption of adjacent frame approximation of objects, thus enabling effective modeling of non-uniform motion. Taking cues from the fundamental motion equation $V = S/T$, we propose a unique state transfer formulation, called the non-uniform formulation, based on the simplified state variables. It models non-uniform motion as uniform motion by transforming the time interval $\Delta t$ from a constant into a variable related to displacement. Moreover, to address the escape problem, we incorporate a deceleration strategy into the control-input model of our proposed formulation. In computation, reducing the computational load and increasing parallel computation are the two main approaches to improving computational efficiency. By simplifying state variables and adhering to the strict matrix representation of our non-uniform formulation, we introduce an innovative parallel computation method. This method transposes the computation graph of PKF from the matrix to the quadratic form, significantly reducing the computational load and facilitating parallel computation between distinct tracklets via CUDA. Consequently, the time consumption of the PKF becomes independent of the input tracklet scale, i.e., O(1). Overall, within the scope of PKF, the simplified state variables serve as the cornerstone for both modeling and computation. The design of the non-uniform formulation facilitates parallel computing, and the practical implementation of parallel computing significantly accelerates the non-uniform formulation across diverse tracklets.

Although PKF can achieve high tracking efficiency through CUDA acceleration, the other conventional modules of the tracker remain CPU-based, leading to a bottleneck in large-scale object tracking. To further improve tracking efficiency in large-scale object tracking scenarios, we introduce Fast, the first fully GPU-based tracker paradigm, based on PKF; and FastTrack, the MOT system composed of Fast and a general detector, offering high efficiency and generality. Within FastTrack, Fast only requires bounding boxes with scores and class ids to perform a single association during one iteration, allowing for enhanced efficiency and generality.

Within Fast, we propose corresponding GPU-based modules to replace the conventional CPU-based modules. We innovatively introduce a highly efficient GPU 2D-array data structure to manage tracklets instead of instances like most previous works (Zhang et al., 2022; Cao et al., 2022; Bewley et al., 2016), enabling efficient parallel access. Furthermore, we propose a novel cost matrix implemented in CUDA, capable of automatically determining association priorities based on scores within a single association. This novel cost matrix also facilitates multi-object and multi-class tracking by simply shifting all boxes along the x-axis by the distance of class id times the input image width before calculating the Intersection over Union (IoU). Additionally, we propose a new association metric, HIoU, to replace IoU when tracking pedestrian or traffic scenes. Lastly, we implement the Auction Algorithm (Bertsekas, 1992a) for the asymmetric assignment problem using CUDA for the first time, replacing conventional CPU-based linear assignment algorithms such as the Hungarian Algorithm (Kuhn, 1955) or LAPJV (Jonker and Volgenant, 1987).

The conducted experiments demonstrate that the average time per iteration of PKF on GTX 1080Ti is only 0.2 ms and is independent of the input scale. Based on PKF and other proposed modules, Fast can achieve a real-time efficiency of 250FPS on GTX 1080Ti and 42FPS even on the embedded CUDA device Jetson AGX Xavier. The efficiency is unaffected by the number of tracklets even on the MOT20 dataset with 139 objects per frame on average, which has never been achieved in conventional CPU-based trackers. As shown in Fig. 2, Fast is 7$\times $ faster than the state-of-the-art CPU &motion-based tracker OCSORT in large-scale tracking scenes of MOT20 and obtains the state-of-the-art performance on four benchmarks, i.e., MOT17 (Milan et al., 2016), MOT20 (Dendorfer et al., 2020), KITTI (Geiger et al., 2013), and DanceTrack (Sun et al., 2021).

In summary, our work presents three significant contributions:

We propose a novel Parallel Kalman Filter (PKF) that models non-uniform motion and achieves a time complexity of O(1). PKF modifies the conventional state variables, proposes a non-uniform formulation, incorporates a deceleration strategy to tackle the escape problem, and leverages a parallel computation method to reduce computational load.
We introduce the first fully GPU-based tracker paradigm called Fast, which greatly improves tracking efficiency in large-scale object tracking scenarios; and FastTrack, the MOT system consisting of Fast and a general detector, allowing for high efficiency and generality. Within FastTrack, Fast only requires bounding boxes with scores and class ids to perform a single association during one iteration.
We propose innovative GPU-based modules within Fast to replace conventional CPU-based modules, such as a highly efficient GPU 2D-array data structure for managing tracklets, a novel cost matrix implemented in CUDA for automatic association priority determination, a novel association metric HIoU, and the first implementation of the Auction Algorithm via CUDA for the asymmetric assignment problem. These GPU-based modules contribute to the real-time efficiency and generality of FastTrack in large-scale object tracking scenarios.

2 Related Works

2.1 Tracking-by-Detection

Tracking-by-detection has become the dominant paradigm in the MOT task. This paradigm divides an MOT system into two separate parts, i.e., the detector and the tracker. In the basic case, the detector provides the tracker with detection results (bounding boxes with confidence scores and class ids) for each video frame, and the tracker uses motion estimation to achieve tracking. In recent years, with the rapid development of object detection, more general object detectors (Redmon and Farhadi, 2018; Bochkovskiy et al., 2020) have achieved both high recall and high precision. Consequently, numerous tracking methods (Lu et al., 2020; Peng et al., 2020; Zhou et al., 2020; Wu et al., 2021; Zhang et al., 2022) have started utilizing powerful detectors (e.g., RetinaNet (Lin et al., 2017), CenterNet (Zhou et al., 2019), or YOLOX (Ge et al., 2021) to obtain superior tracking performance. It has become a trend to combine high-performance detectors with concise and generic motion-based trackers into MOT systems. For instance, SORT (Bewley et al., 2016) first employed Faster R-CNN (Ren et al., 2015) as its detector and the Kalman Filter (Kalman, 1960) as its motion estimation module, achieving state-of-the-art performance in 2016 with a simple and efficient tracker. Building on SORT, ByteTrack (Zhang et al., 2022) achieved state-of-the-art results in MOT17 and MOT20 by using the advanced detector YOLOX (Ge et al., 2021). Before ByteTrack (Zhang et al., 2022), many methods employing RetinaNet or CenterNet opted to directly filter low score boxes (scores below 0.5) to eliminate most False Positive (FP) boxes and guarantee tracking performance due to the low precision of detectors at the time. However, the high recall and precision of YOLOX ensure that even low score boxes are likely to be True Positive (TP) boxes. Therefore, ByteTrack achieves state-of-the-art performance while ensuring simplicity and efficiency by employing YOLOX and cascade association based on score. In our approach, Fast, we take it a step further by exploiting the score, i.e., by fusing tracklet and detection scores into the cost matrix to automatically prioritize matches within a single association.

In addition, the tracker is essentially a computationally intensive task, but current trackers are primarily CPU-based and implemented with object-oriented programming. GPUs have not been well-explored for tracker implementation due to the programming gap between GPU and CPU. In this paper, we propose the first fully GPU-based tracker paradigm, which significantly improves tracking efficiency in large-scale object tracking scenarios.

2.2 Kalman Filter

Introduction The Kalman Filter (Bishop et al., 2001) is a classical motion estimation algorithm consisting of two phases: prediction and update. In the prediction phase, the KF uses the previous state to estimate the current state. The update phase incorporates observations of the current state to provide a more accurate state estimate.

At each iteration, two variables are maintained for each tracked object: the state estimate ${\textbf{x}}$ and its posterior estimated error covariance matrix ${\textbf{P}}$. The prediction phase of the KF is characterized by the state-transition model ${\textbf{F}}$, the control-input model ${\textbf{B}}$ with the control vector ${\textbf{u}}$, and the covariance of the process noise ${\textbf{Q}}$. The update phase is described by the observation model ${\textbf{H}}$ and the covariance of the observation noise ${\textbf{R}}$.

At each time step t, the KF first predicts the state estimate ${\textbf{x}}_{t|t-1}$ and its covariance matrix ${\textbf{P}}_{t|t-1}$ using the following equations:

$$\begin{aligned}&{{\textbf{x}}}_{t|t-1} = {\textbf{F}}_t {{\textbf{x}}}_{t-1} + {\textbf{B}}_t {{\textbf{u}}}_{t}, \end{aligned}$$

(1a)

$$\begin{aligned}&{\textbf{P}}_{t|t-1} = {\textbf{F}}_t {\textbf{P}}_{t-1} {\textbf{F}}_t^\top + {\textbf{Q}}_t, \end{aligned}$$

(1b)

where Eq. 1a models the object motion with a state transfer formulation.

The KF then updates the state estimate and covariance matrix based on the observation ${\textbf{z}}_t$ to obtain more accurate estimates (${{\textbf{x}}}_{t}$ and ${\textbf{P}}_{t}$) using the following equations:

$$\begin{aligned}&{\textbf{S}}_t = {\textbf{H}}_t{\textbf{P}}_{t|t-1}{\textbf{H}}_t^\top +{\textbf{R}}_t, \end{aligned}$$

(2a)

$$\begin{aligned}&{\textbf{K}}_t = {\textbf{P}}_{t|t-1} {\textbf{H}}_t^\top {\textbf{S}}_t^{-1}, \end{aligned}$$

(2b)

$$\begin{aligned}&{{\textbf{x}}}_{t} = {{\textbf{x}}}_{t|t-1} + {\textbf{K}}_t ({\textbf{z}}_t - {\textbf{H}}_t{{\textbf{x}}}_{t|t-1}), \end{aligned}$$

(2c)

$$\begin{aligned}&{\textbf{P}}_{t} = {\textbf{P}}_{t|t-1} - {\textbf{K}}_t{\textbf{S}}_t{\textbf{K}}_t^\top , \end{aligned}$$

(2d)

where the Kalman gain is denoted by matrix ${\textbf{K}}$, and the system uncertainty is represented by matrix ${\textbf{S}}$, which is the projected ${\textbf{P}}$ in the measurement space. Additionally, Eq. 2d can also be expressed as:

$$\begin{aligned} {\textbf{P}}_{t} = ({\textbf{I}} - {\textbf{K}}_t{\textbf{H}}_t){\textbf{P}}_{t|t-1}, \end{aligned}$$

(3)

where the identity matrix is denoted by ${\textbf{I}}$. The corresponding proof can be found on the following website.^{Footnote 3}

The uniform motion assumption of the KF is restrictive, leading to the development of the Extended KF (EKF) (Smith et al., 1962) and Unscented KF (UKF) (Julier & Uhlmann, 1997) to handle non-uniform motion through first- and third-order Taylor approximations. However, these methods still rely on the Gaussian approximation under the KF assumption and cannot accurately estimate arbitrary non-uniform motions, such as the highly random motion of dancers. Particle filters (Gustafsson et al., 2002) address non-uniform motions through sampling-based a posteriori estimation, but at the cost of exponential computational complexity.

Application In the context of the MOT task, SORT (Bewley et al., 2016) initially applies the KF to model objects with uniform motion, assuming that the inter-frame displacements of each object are approximately equal. DeepSORT (Wojke et al., 2017) builds on SORT by improving the representation of ${\textbf{x}}$. Subsequently, most of the related works (Wang et al., 2020; Zhou et al., 2020; Zhang et al., 2021, 2022) directly employ the same KF used in DeepSORT as their motion estimation module.

In DeepSORT, the KF’s state estimate ${\textbf{x}}$ is an eight-tuple, ${\textbf{x}} = [u, v, \gamma , h, {\dot{u}}, {\dot{v}}, {\dot{\gamma }}, {\dot{h}}]^\top $, where (u, v) represents the 2D coordinates of the object center, $\gamma $ is the box aspect ratio, and h is the box height. The remaining four variables with dots indicate the corresponding velocities.

DeepSORT and SORT assume uniform motion for all tracked objects. Consequently, ${\textbf{B}}$ and ${\textbf{u}}$ are discarded in Eq. 1a, and the state-transition model ${\textbf{F}}$ becomes:

$$\begin{aligned} \begin{aligned}&{\textbf{F}}_{t} = \begin{bmatrix} {\textbf{I}}_{4 \times 4} &{} \Delta t {\textbf{I}}_{4 \times 4}\\ {\textbf{0}}_{4\times 4} &{} {\textbf{I}}_{4 \times 4} \end{bmatrix},\\ \end{aligned} \end{aligned}$$

(4)

where the time difference $\Delta t$ between two steps is consistent (1 by default) throughout the iterations. The process noise ${\textbf{Q}}$ and the observation noise ${\textbf{R}}$ are defined as follows:

$$\begin{aligned}&{\textbf{Q}}_{t}\!=\!\textbf{diag}(\!\phi h^2\!,\!\phi h^2\!, \!10^{\!-\!4\!}\!,\!\phi h^2\!,\!\psi h^2\!,\!\psi h^2\!,\!10^{\!-\!10\!}\!,\!\psi h^2),\! \end{aligned}$$

(5a)

$$\begin{aligned}&{\textbf{R}}_{t}\!=\!\textbf{diag}(\phi h^2, \phi h^2, 10^{\!-\!2\!}, \phi h^2), \end{aligned}$$

(5b)

where $\phi $ and $\psi $ represent the position weight and the velocity weight, respectively. The observation model ${\textbf{H}}$ is given by:

$$\begin{aligned} \begin{aligned}&{\textbf{H}}_{t} = \begin{bmatrix} {\textbf{I}}_{4 \times 4}&{\textbf{0}}_{4\times 4} \end{bmatrix}.\\ \end{aligned} \end{aligned}$$

(6)

Limitations Nevertheless, there are several issues with the aforementioned KF application.

First, the variables $\gamma $ and h should not be incorporated into the state estimate ${\textbf{x}}$. Due to the assumption that displacements of objects in adjacent frames are similar, $\gamma $ and h should remain constant between adjacent frames and should not exhibit uniform motion. To adhere to this assumption, $\gamma $ and h need to be excluded from the state estimation.

Second, the matrix ${\textbf{F}}$ can only predict objects based on uniform motion, which is unsuitable for most motion patterns, particularly for non-linear objects with significant impact, such as dancers in DanceTrack.

Third, KF is sensitive to noise and thus susceptible to the escape problem when the tracked object is lost. Initially, KF operates as a predict-update loop, where the update phase is employed to supervise the predicted state estimate to correct noise. When the tracked object is lost, KF only executes the prediction phase, leading to continuous amplification of noise (visualized by the rapid escape of the predicted box) and making it challenging to retrace. Recently, OCSORT (Cao et al., 2022) introduced the Observation-centric Online Smoothing strategy to mitigate noise accumulation in KF due to a lack of observations when a lost object is retraced. However, this strategy does not enhance the probability of objects being retraced and is not GPU-friendly.

Fourth, KF is computationally demanding, with a time complexity of O(n), which implies that its efficiency drastically declines as the number of tracked objects increases.

In this paper, we present the Parallel Kalman Filter to address these limitations.

2.3 Association

Association is also a core aspect of the MOT task, which primarily involves calculating a cost matrix between tracklets and detections, and then matching them based on the cost matrix. Among all the cues, position information is the most generic. In contrast to appearance or feature information, it can be directly obtained from a general object detector without the need for additional feature extraction networks or modifications to the original detection network. As a result, numerous methods (Bewley et al., 2016; Wojke et al., 2017; Zhang et al., 2022) utilize IoU to compute the cost matrix. Following the cost matrix calculation, tracklets and detections are matched using an assignment strategy. This can be achieved through classical linear assignment problem solutions such as the Hungarian Algorithm (Kuhn, 1955) or LAPJV (Jonker and Volgenant, 1987). For instance, SORT employs the Hungarian Algorithm for single association; DeepSORT (Wojke et al., 2017) uses the Hungarian Algorithm for cascade association; ByteTrack utilizes LAPJV for cascade association. However, both the Hungarian Algorithm and LAPJV are CPU-based implementations. To implement a fully GPU-based tracker, we introduce another classical solution to the linear assignment problem called the Auction Algorithm (Bertsekas, 1992a), which can be implemented on a GPU. In this paper, we successfully leverage CUDA to implement the Auction Algorithm and utilize it as the assignment strategy for our tracker paradigm.

Table 1 Basic information about four benchmarks

FastTrack: A Highly Efficient and Generic GPU-Based Multi-object Tracking Method with Parallel Kalman Filter

Abstract

Similar content being viewed by others

Heterogeneous CPU–GPU tracking–learning–detection (H-TLD) for real-time object tracking

A Survey on GPU-Based Visual Trackers

PE-TLD: Parallel Extended Tracking-Learning-Detection for Multi-target Tracking

Explore related subjects

1 Introduction

2 Related Works

2.1 Tracking-by-Detection

2.2 Kalman Filter

2.3 Association

3 Numerical Statistics

3.1 Trajectory Overlap

3.2 Height and Width Ratio

4 Proposed Methods

4.1 Parallel Kalman Filter

4.2 Fast

5 Experiment

5.1 Settings

5.2 Ablation Study

5.2.1 Association

5.2.2 Parallel Kalman Filter

5.2.3 Comparison with Other Trackers

5.2.4 Module Efficiency

5.3 Benchmark Results

5.3.1 DanceTrack

5.3.2 KITTI

5.3.3 MOT Challenge

5.3.4 Efficiency

6 Conclusion

Data Availability Statement

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A Naive Auction

Appendix B MOT17-val

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation