1 Introduction

Multi-Object Tracking (MOT) predominantly follows the tracking-by-detection paradigm. An MOT system typically comprises a general detector (Ren et al., 2015; Ge et al., 2021) and a genericFootnote 1 motion-based tracker (Zhang et al., 2022; Cao et al., 2022; Bewley et al., 2016).

Although the Kalman Filter (KF) is a crucial motion estimation module for many state-of-the-art trackers (Wang et al., 2020; Zhou et al., 2020; Zhang et al., 2021, 2022), it has limitations in motion modeling and computational efficiency when applied to MOT.

Concerning motion modeling, the KF assumes uniform object motion, which is unsuitable for various motion patterns in general tracking scenesFootnote 2 as shown in Fig. 1. Although Extended KF (Smith et al., 1962) and Unscented KF (Julier & Uhlmann, 1997) were introduced to handle non-uniform motions of Taylor approximations, they are computationally complex and cannot estimate arbitrary non-uniform motion, such as the highly random motions of dancers in stage scenes. Additionally, the KF is sensitive to noise, leading to the escape problem (Cao et al., 2022). When an object is lost, its bounding box, continuously predicted by KF without observation information supervision, rapidly escapes along the current velocity direction, making it difficult to retrace. For example, in Fig. 1, when an object is lost at a certain circular point, the box predicted by KF rapidly escapes along the velocity direction.

Concerning computational efficiency, each tracklet is represented using a distinct KF, and all KFs are updated in a sequential manner. This process leads to a linear rise in time consumption of all KFs proportional to the count of input tracklets. This computational expense corresponds to a time complexity of O(n), where n denotes the total number of tracklets. Such substantial resource allocation weakens tracking efficiency, especially in scenarios of large-scale object tracking. For example, the CPU-based tracking algorithm OCSORT (Cao et al., 2022), despite improving the KF to attain state-of-the-art performance, falls short in terms of computational efficiency. As a result, OCSORT experiences nearly a 30\(\times \) increase in time consumption when the number of input tracklets increases from 6 (as in KITTI) to 139 (as in MOT20), as illustrated in Fig. 2. Thus, an optimal tracker must strike a balance between tracking precision and computational efficiency.

Fig. 1
figure 1

Illustration of object trajectories in typical tracking scenarios. Four prevalent motion patterns are displayed: Low-speed, High-speed, Static, and Non-linear. The bold colored lines represent the central coordinate trajectories of the objects, while the black arrow lines indicate the direction of the object’s velocity at specific circular points

Fig. 2
figure 2

Efficiency and HOTA comparisons between our method and other CPU &motion-based trackers across various benchmarks. The number in parentheses represents the average number of tracklets per frame for the respective test set. In the case of KITTI, the two scores correspond to HOTA for Car and Pedestrian categories, respectively

To address these issues, we introduce a novel Parallel Kalman Filter (PKF) that models non-uniform motion while achieving a time complexity of O(1). In modeling, the importance of a suitable set of state variables for the KF cannot be understated. Therefore, we revise the conventional eight-tuple state variables and replace them with a more simplified four-tuple state variable set, which focuses specifically on the 2D coordinates of the object center along with their corresponding velocities. This simplification reduces computational load and aligns more appropriately with the assumption of adjacent frame approximation of objects, thus enabling effective modeling of non-uniform motion. Taking cues from the fundamental motion equation \(V = S/T\), we propose a unique state transfer formulation, called the non-uniform formulation, based on the simplified state variables. It models non-uniform motion as uniform motion by transforming the time interval \(\Delta t\) from a constant into a variable related to displacement. Moreover, to address the escape problem, we incorporate a deceleration strategy into the control-input model of our proposed formulation. In computation, reducing the computational load and increasing parallel computation are the two main approaches to improving computational efficiency. By simplifying state variables and adhering to the strict matrix representation of our non-uniform formulation, we introduce an innovative parallel computation method. This method transposes the computation graph of PKF from the matrix to the quadratic form, significantly reducing the computational load and facilitating parallel computation between distinct tracklets via CUDA. Consequently, the time consumption of the PKF becomes independent of the input tracklet scale, i.e., O(1). Overall, within the scope of PKF, the simplified state variables serve as the cornerstone for both modeling and computation. The design of the non-uniform formulation facilitates parallel computing, and the practical implementation of parallel computing significantly accelerates the non-uniform formulation across diverse tracklets.

Although PKF can achieve high tracking efficiency through CUDA acceleration, the other conventional modules of the tracker remain CPU-based, leading to a bottleneck in large-scale object tracking. To further improve tracking efficiency in large-scale object tracking scenarios, we introduce Fast, the first fully GPU-based tracker paradigm, based on PKF; and FastTrack, the MOT system composed of Fast and a general detector, offering high efficiency and generality. Within FastTrack, Fast only requires bounding boxes with scores and class ids to perform a single association during one iteration, allowing for enhanced efficiency and generality.

Within Fast, we propose corresponding GPU-based modules to replace the conventional CPU-based modules. We innovatively introduce a highly efficient GPU 2D-array data structure to manage tracklets instead of instances like most previous works (Zhang et al., 2022; Cao et al., 2022; Bewley et al., 2016), enabling efficient parallel access. Furthermore, we propose a novel cost matrix implemented in CUDA, capable of automatically determining association priorities based on scores within a single association. This novel cost matrix also facilitates multi-object and multi-class tracking by simply shifting all boxes along the x-axis by the distance of class id times the input image width before calculating the Intersection over Union (IoU). Additionally, we propose a new association metric, HIoU, to replace IoU when tracking pedestrian or traffic scenes. Lastly, we implement the Auction Algorithm (Bertsekas, 1992a) for the asymmetric assignment problem using CUDA for the first time, replacing conventional CPU-based linear assignment algorithms such as the Hungarian Algorithm (Kuhn, 1955) or LAPJV (Jonker and Volgenant, 1987).

The conducted experiments demonstrate that the average time per iteration of PKF on GTX 1080Ti is only 0.2 ms and is independent of the input scale. Based on PKF and other proposed modules, Fast can achieve a real-time efficiency of 250FPS on GTX 1080Ti and 42FPS even on the embedded CUDA device Jetson AGX Xavier. The efficiency is unaffected by the number of tracklets even on the MOT20 dataset with 139 objects per frame on average, which has never been achieved in conventional CPU-based trackers. As shown in Fig. 2, Fast is 7\(\times \) faster than the state-of-the-art CPU &motion-based tracker OCSORT in large-scale tracking scenes of MOT20 and obtains the state-of-the-art performance on four benchmarks, i.e., MOT17 (Milan et al., 2016), MOT20 (Dendorfer et al., 2020), KITTI (Geiger et al., 2013), and DanceTrack (Sun et al., 2021).

In summary, our work presents three significant contributions:

  • We propose a novel Parallel Kalman Filter (PKF) that models non-uniform motion and achieves a time complexity of O(1). PKF modifies the conventional state variables, proposes a non-uniform formulation, incorporates a deceleration strategy to tackle the escape problem, and leverages a parallel computation method to reduce computational load.

  • We introduce the first fully GPU-based tracker paradigm called Fast, which greatly improves tracking efficiency in large-scale object tracking scenarios; and FastTrack, the MOT system consisting of Fast and a general detector, allowing for high efficiency and generality. Within FastTrack, Fast only requires bounding boxes with scores and class ids to perform a single association during one iteration.

  • We propose innovative GPU-based modules within Fast to replace conventional CPU-based modules, such as a highly efficient GPU 2D-array data structure for managing tracklets, a novel cost matrix implemented in CUDA for automatic association priority determination, a novel association metric HIoU, and the first implementation of the Auction Algorithm via CUDA for the asymmetric assignment problem. These GPU-based modules contribute to the real-time efficiency and generality of FastTrack in large-scale object tracking scenarios.

2 Related Works

2.1 Tracking-by-Detection

Tracking-by-detection has become the dominant paradigm in the MOT task. This paradigm divides an MOT system into two separate parts, i.e., the detector and the tracker. In the basic case, the detector provides the tracker with detection results (bounding boxes with confidence scores and class ids) for each video frame, and the tracker uses motion estimation to achieve tracking. In recent years, with the rapid development of object detection, more general object detectors (Redmon and Farhadi, 2018; Bochkovskiy et al., 2020) have achieved both high recall and high precision. Consequently, numerous tracking methods (Lu et al., 2020; Peng et al., 2020; Zhou et al., 2020; Wu et al., 2021; Zhang et al., 2022) have started utilizing powerful detectors (e.g., RetinaNet (Lin et al., 2017), CenterNet (Zhou et al., 2019), or YOLOX (Ge et al., 2021) to obtain superior tracking performance. It has become a trend to combine high-performance detectors with concise and generic motion-based trackers into MOT systems. For instance, SORT (Bewley et al., 2016) first employed Faster R-CNN (Ren et al., 2015) as its detector and the Kalman Filter (Kalman, 1960) as its motion estimation module, achieving state-of-the-art performance in 2016 with a simple and efficient tracker. Building on SORT, ByteTrack (Zhang et al., 2022) achieved state-of-the-art results in MOT17 and MOT20 by using the advanced detector YOLOX (Ge et al., 2021). Before ByteTrack (Zhang et al., 2022), many methods employing RetinaNet or CenterNet opted to directly filter low score boxes (scores below 0.5) to eliminate most False Positive (FP) boxes and guarantee tracking performance due to the low precision of detectors at the time. However, the high recall and precision of YOLOX ensure that even low score boxes are likely to be True Positive (TP) boxes. Therefore, ByteTrack achieves state-of-the-art performance while ensuring simplicity and efficiency by employing YOLOX and cascade association based on score. In our approach, Fast, we take it a step further by exploiting the score, i.e., by fusing tracklet and detection scores into the cost matrix to automatically prioritize matches within a single association.

In addition, the tracker is essentially a computationally intensive task, but current trackers are primarily CPU-based and implemented with object-oriented programming. GPUs have not been well-explored for tracker implementation due to the programming gap between GPU and CPU. In this paper, we propose the first fully GPU-based tracker paradigm, which significantly improves tracking efficiency in large-scale object tracking scenarios.

2.2 Kalman Filter

Introduction The Kalman Filter (Bishop et al., 2001) is a classical motion estimation algorithm consisting of two phases: prediction and update. In the prediction phase, the KF uses the previous state to estimate the current state. The update phase incorporates observations of the current state to provide a more accurate state estimate.

At each iteration, two variables are maintained for each tracked object: the state estimate \({\textbf{x}}\) and its posterior estimated error covariance matrix \({\textbf{P}}\). The prediction phase of the KF is characterized by the state-transition model \({\textbf{F}}\), the control-input model \({\textbf{B}}\) with the control vector \({\textbf{u}}\), and the covariance of the process noise \({\textbf{Q}}\). The update phase is described by the observation model \({\textbf{H}}\) and the covariance of the observation noise \({\textbf{R}}\).

At each time step t, the KF first predicts the state estimate \({\textbf{x}}_{t|t-1}\) and its covariance matrix \({\textbf{P}}_{t|t-1}\) using the following equations:

$$\begin{aligned}&{{\textbf{x}}}_{t|t-1} = {\textbf{F}}_t {{\textbf{x}}}_{t-1} + {\textbf{B}}_t {{\textbf{u}}}_{t}, \end{aligned}$$
(1a)
$$\begin{aligned}&{\textbf{P}}_{t|t-1} = {\textbf{F}}_t {\textbf{P}}_{t-1} {\textbf{F}}_t^\top + {\textbf{Q}}_t, \end{aligned}$$
(1b)

where Eq. 1a models the object motion with a state transfer formulation.

The KF then updates the state estimate and covariance matrix based on the observation \({\textbf{z}}_t\) to obtain more accurate estimates (\({{\textbf{x}}}_{t}\) and \({\textbf{P}}_{t}\)) using the following equations:

$$\begin{aligned}&{\textbf{S}}_t = {\textbf{H}}_t{\textbf{P}}_{t|t-1}{\textbf{H}}_t^\top +{\textbf{R}}_t, \end{aligned}$$
(2a)
$$\begin{aligned}&{\textbf{K}}_t = {\textbf{P}}_{t|t-1} {\textbf{H}}_t^\top {\textbf{S}}_t^{-1}, \end{aligned}$$
(2b)
$$\begin{aligned}&{{\textbf{x}}}_{t} = {{\textbf{x}}}_{t|t-1} + {\textbf{K}}_t ({\textbf{z}}_t - {\textbf{H}}_t{{\textbf{x}}}_{t|t-1}), \end{aligned}$$
(2c)
$$\begin{aligned}&{\textbf{P}}_{t} = {\textbf{P}}_{t|t-1} - {\textbf{K}}_t{\textbf{S}}_t{\textbf{K}}_t^\top , \end{aligned}$$
(2d)

where the Kalman gain is denoted by matrix \({\textbf{K}}\), and the system uncertainty is represented by matrix \({\textbf{S}}\), which is the projected \({\textbf{P}}\) in the measurement space. Additionally, Eq. 2d can also be expressed as:

$$\begin{aligned} {\textbf{P}}_{t} = ({\textbf{I}} - {\textbf{K}}_t{\textbf{H}}_t){\textbf{P}}_{t|t-1}, \end{aligned}$$
(3)

where the identity matrix is denoted by \({\textbf{I}}\). The corresponding proof can be found on the following website.Footnote 3

The uniform motion assumption of the KF is restrictive, leading to the development of the Extended KF (EKF) (Smith et al., 1962) and Unscented KF (UKF) (Julier & Uhlmann, 1997) to handle non-uniform motion through first- and third-order Taylor approximations. However, these methods still rely on the Gaussian approximation under the KF assumption and cannot accurately estimate arbitrary non-uniform motions, such as the highly random motion of dancers. Particle filters (Gustafsson et al., 2002) address non-uniform motions through sampling-based a posteriori estimation, but at the cost of exponential computational complexity.

Fig. 3
figure 3

IoU statistics on general tracking scenes. The top plot displays the videos from the train sets of MOT17, MOT20, and KITTI; the bottom plot exhibits the videos from the train set and the validation set of DanceTrack

Fig. 4
figure 4

Height and width ratio statistics on general tracking scenes. The title of each plot indicates the dataset and category

Application In the context of the MOT task, SORT (Bewley et al., 2016) initially applies the KF to model objects with uniform motion, assuming that the inter-frame displacements of each object are approximately equal. DeepSORT (Wojke et al., 2017) builds on SORT by improving the representation of \({\textbf{x}}\). Subsequently, most of the related works (Wang et al., 2020; Zhou et al., 2020; Zhang et al., 2021, 2022) directly employ the same KF used in DeepSORT as their motion estimation module.

In DeepSORT, the KF’s state estimate \({\textbf{x}}\) is an eight-tuple, \({\textbf{x}} = [u, v, \gamma , h, {\dot{u}}, {\dot{v}}, {\dot{\gamma }}, {\dot{h}}]^\top \), where (u, v) represents the 2D coordinates of the object center, \(\gamma \) is the box aspect ratio, and h is the box height. The remaining four variables with dots indicate the corresponding velocities.

DeepSORT and SORT assume uniform motion for all tracked objects. Consequently, \({\textbf{B}}\) and \({\textbf{u}}\) are discarded in Eq. 1a, and the state-transition model \({\textbf{F}}\) becomes:

$$\begin{aligned} \begin{aligned}&{\textbf{F}}_{t} = \begin{bmatrix} {\textbf{I}}_{4 \times 4} &{} \Delta t {\textbf{I}}_{4 \times 4}\\ {\textbf{0}}_{4\times 4} &{} {\textbf{I}}_{4 \times 4} \end{bmatrix},\\ \end{aligned} \end{aligned}$$
(4)

where the time difference \(\Delta t\) between two steps is consistent (1 by default) throughout the iterations. The process noise \({\textbf{Q}}\) and the observation noise \({\textbf{R}}\) are defined as follows:

$$\begin{aligned}&{\textbf{Q}}_{t}\!=\!\textbf{diag}(\!\phi h^2\!,\!\phi h^2\!, \!10^{\!-\!4\!}\!,\!\phi h^2\!,\!\psi h^2\!,\!\psi h^2\!,\!10^{\!-\!10\!}\!,\!\psi h^2),\! \end{aligned}$$
(5a)
$$\begin{aligned}&{\textbf{R}}_{t}\!=\!\textbf{diag}(\phi h^2, \phi h^2, 10^{\!-\!2\!}, \phi h^2), \end{aligned}$$
(5b)

where \(\phi \) and \(\psi \) represent the position weight and the velocity weight, respectively. The observation model \({\textbf{H}}\) is given by:

$$\begin{aligned} \begin{aligned}&{\textbf{H}}_{t} = \begin{bmatrix} {\textbf{I}}_{4 \times 4}&{\textbf{0}}_{4\times 4} \end{bmatrix}.\\ \end{aligned} \end{aligned}$$
(6)

Limitations Nevertheless, there are several issues with the aforementioned KF application.

First, the variables \(\gamma \) and h should not be incorporated into the state estimate \({\textbf{x}}\). Due to the assumption that displacements of objects in adjacent frames are similar, \(\gamma \) and h should remain constant between adjacent frames and should not exhibit uniform motion. To adhere to this assumption, \(\gamma \) and h need to be excluded from the state estimation.

Second, the matrix \({\textbf{F}}\) can only predict objects based on uniform motion, which is unsuitable for most motion patterns, particularly for non-linear objects with significant impact, such as dancers in DanceTrack.

Third, KF is sensitive to noise and thus susceptible to the escape problem when the tracked object is lost. Initially, KF operates as a predict-update loop, where the update phase is employed to supervise the predicted state estimate to correct noise. When the tracked object is lost, KF only executes the prediction phase, leading to continuous amplification of noise (visualized by the rapid escape of the predicted box) and making it challenging to retrace. Recently, OCSORT (Cao et al., 2022) introduced the Observation-centric Online Smoothing strategy to mitigate noise accumulation in KF due to a lack of observations when a lost object is retraced. However, this strategy does not enhance the probability of objects being retraced and is not GPU-friendly.

Fourth, KF is computationally demanding, with a time complexity of O(n), which implies that its efficiency drastically declines as the number of tracked objects increases.

In this paper, we present the Parallel Kalman Filter to address these limitations.

2.3 Association

Association is also a core aspect of the MOT task, which primarily involves calculating a cost matrix between tracklets and detections, and then matching them based on the cost matrix. Among all the cues, position information is the most generic. In contrast to appearance or feature information, it can be directly obtained from a general object detector without the need for additional feature extraction networks or modifications to the original detection network. As a result, numerous methods (Bewley et al., 2016; Wojke et al., 2017; Zhang et al., 2022) utilize IoU to compute the cost matrix. Following the cost matrix calculation, tracklets and detections are matched using an assignment strategy. This can be achieved through classical linear assignment problem solutions such as the Hungarian Algorithm (Kuhn, 1955) or LAPJV (Jonker and Volgenant, 1987). For instance, SORT employs the Hungarian Algorithm for single association; DeepSORT (Wojke et al., 2017) uses the Hungarian Algorithm for cascade association; ByteTrack utilizes LAPJV for cascade association. However, both the Hungarian Algorithm and LAPJV are CPU-based implementations. To implement a fully GPU-based tracker, we introduce another classical solution to the linear assignment problem called the Auction Algorithm (Bertsekas, 1992a), which can be implemented on a GPU. In this paper, we successfully leverage CUDA to implement the Auction Algorithm and utilize it as the assignment strategy for our tracker paradigm.

Table 1 Basic information about four benchmarks

3 Numerical Statistics

To address the modeling issues of KF, we conduct statistical analyses of general tracking scenarios to elucidate the characteristics of object motion and summarize priors.

For the sake of convenience, four benchmarks are selected for our study, namely, MOT17 (Milan et al., 2016), MOT20 (Dendorfer et al., 2020), KITTI (Geiger et al., 2013), and DanceTrack (Sun et al., 2021), with the corresponding information presented in Table 1. MOT17 and MOT20 comprise pedestrian scenes. MOT17 contains a relatively small number of videos and scenes compared to the other datasets, while MOT20 increases the density of pedestrians and emphasizes occlusions between them. The pedestrian movements in MOT17 and MOT20 are quite regular (low-speed and nearly uniform), and they maintain distinguishable appearances. DanceTrack includes a large number of stage scenes. The similar-looking dancers in these stages move chaotically, and their movements differ significantly from frame to frame, posing considerable challenges for modeling non-uniform motion. KITTI is among the first large-scale MOT datasets for traffic scenes, focusing on tracking high-speed cars and pedestrians. In this section, we perform statistical analyses on the ground truths of the above four benchmark datasets, including the train sets of MOT17, MOT20, KITTI, and DanceTrack, as well as the validation set of DanceTrack.

3.1 Trajectory Overlap

As illustrated in Fig. 3, we count the IoU of the same trajectory in adjacent frames. We find that for low-speed objects (mostly in MOT17/20 and DanceTrack), the majority of IoU distributions lie between 0.7 and 1, indicating that low-speed objects can be easily associated between frames even without estimation. For high-speed objects (cars or pedestrians in KITTI), the uniform motion modeling of KF is crucial for their tracking because they predominantly exhibit near-linear motion, as demonstrated by the upper right car in Fig. 1.

Modeling non-uniform motion while maintaining the simplicity of KF’s state transfer formulation is challenging because we cannot predict the object’s motion pattern in advance or express the object’s motion pattern in a mathematical formulation. However, based on the aforementioned IoU statistics, we can suppress KF’s predicted displacement by determining whether the object is moving at low speed, thereby achieving non-uniform motion modeling.

Hence, the prior derived from Trajectory Overlap is that, compared to high-speed objects, low-speed objects exhibit denser IoU distributions with high scores in adjacent frames. This demonstrates that when tracking low-speed objects, KF does not need to perform aggressive state estimation because the probability of high overlap between the current state and the observed state is high. Conversely, KF needs to perform aggressive state estimation to increase the overlap probability with the observation for tracking high-speed objects.

3.2 Height and Width Ratio

As depicted in Fig. 4, we count the ratio of height and width of the same trajectory in adjacent frames. The calculation formula is as follows:

$$\begin{aligned} R = \frac{min(x_{cur}, x_{next})}{max(x_{cur}, x_{next})}, \end{aligned}$$
(7)

where R denotes the height/width ratio and \(x_{cur}\)/\(x_{next}\) represent the height/width of the same trajectory in the current and next frames. It is evident that, compared to dancers in DanceTrack, pedestrians and cars in MOT17/20 and KITTI exhibit stronger height invariance (ratio distribution between 0.7 and 1) than width invariance (ratio distribution between 1 and 0.4). This implies that we can enhance the accuracy of the tracker by introducing a height invariance prior in the tracker design.

Therefore, the prior from Height & Width Ratio is that pedestrians and cars between adjacent frames exhibit strong height invariance, demonstrating that when the height ratio of two cars or pedestrians is small (e.g., less than 0.7), there is a high probability that these two objects do not belong to the same trajectory.

Fig. 5
figure 5

Illustration of the concept of adaptive \(\Delta t\). The relative stability of the velocity V is achieved by combining \(\Delta t\) with the displacement S, i.e., \(T = S/V\)

4 Proposed Methods

4.1 Parallel Kalman Filter

Modeling As discussed in Sect. 2.2Limitations, we remove the variables \(\gamma \) and h thus the PKF defines the state \({\textbf{x}}\) as a four-tuple, i.e., \({\textbf{x}} = [u, v, {\dot{u}}, {\dot{v}}]^\top \), where (u, v) is the 2D coordinates of the object center and the other two variables \({\dot{u}}\) and \({\dot{v}}\) are the corresponding velocities. Therefore, the process noise \({\textbf{Q}}\), the observation noise \({\textbf{R}}\), and the observation model \({\textbf{H}}\) are reformed as

$$\begin{aligned}&{\textbf{Q}}_t = \textbf{diag}(\phi h^2, \phi h^2, \psi h^2, \psi h^2),\end{aligned}$$
(8a)
$$\begin{aligned}&{\textbf{R}}_t = \textbf{diag}(\phi h^2, \phi h^2), \end{aligned}$$
(8b)
$$\begin{aligned}&\begin{aligned} {\textbf{H}}_{t} = \begin{bmatrix} {\textbf{I}}_{2 \times 2}&{\textbf{0}}_{2 \times 2} \end{bmatrix},\\ \end{aligned} \end{aligned}$$
(8c)

where weights \(\phi \) and \(\psi \) are \((\frac{1}{20})^2\) and \((\frac{1}{80})^2\), respectively. The simplified state variables allow for more effective modeling of non-uniform motion and reduces computational complexity.

As discussed in Sect. 3.1, to model non-uniform motion, as shown in Fig. 5, we start from the basic motion equation \(V=S/T\) and model non-uniform motion as uniform motion by transforming \(\Delta t\) in Eq. 1a from a constant to an adaptive variable related to the displacement s as Eq. 9 shows. Based on the prior of Trajectory Overlap, we introduce the \(\Delta t\) threshold factor \(\xi \) into \(\Delta t\) to distinguish high-speed objects from low-speed objects, with the aim of suppressing the estimated displacement of low-speed objects only. The variable \(\Delta t\) for \({\dot{u}}\) and \({\dot{v}}\) is defined as

$$\begin{aligned} \left\{ \begin{array}{ll} \Delta t_{{\dot{u}}_{t-1}} \!= min(\xi , \frac{1}{\left| {\dot{u}}_{t-1} \right| }) {s}_{u_{t-1}},\\ \Delta t_{{\dot{v}}_{t-1}} = min(\xi , \frac{1}{\left| {\dot{v}}_{t-1} \right| }) {s}_{v_{t-1}}, \end{array}\right. \end{aligned}$$
(9)

where the object is considered to be moving at a low speed when the absolute value of the velocity is less than \(\frac{1}{\xi }\) and the displacement s is expressed as follows:

$$\begin{aligned} \left\{ \begin{array}{ll} s_{u_{t-1}} = | u_{t-1}\! - u_{t-2} |, \\ s_{v_{t-1}} = | v_{t-1} - v_{t-2} |. \end{array}\right. \end{aligned}$$
(10)

In practice, s is usually set to smooth to reduce noise through a linear smooth weight \(\omega \):

$$\begin{aligned} \left\{ \begin{array}{ll} s_{u_{t-1}}\! = \omega | u_{t-1}\! - u_{t-2} | + (1 - \omega )s_{u_{t-2}},\\ s_{v_{t-1}} = \omega | v_{t-1} - v_{t-2} | + (1 - \omega )s_{u_{t-2}}. \end{array}\right. \end{aligned}$$
(11)
Fig. 6
figure 6

Illustration of the escape problem. a, b Display the predicted trajectories of the same bounding box by KF and PKF, respectively, after the tracked person gets lost. The red dashed box represents the box at the beginning of the lost trajectory; the red solid box represents the box after 15 consecutive prediction phases of KF or PKF. The box predicted by the KF escapes in the direction of the velocity with noise, while the PKF suppresses the noise and keeps the box around the dashed box so that the person can be retraced (Color figure online)

To solve the escape problem, we recover the control model in Eq. 1a to implement the deceleration strategy. As discussed in Sect. 2.2, when the tracked object is lost, KF only performs the prediction phase, resulting in the noise being continuously amplified as shown in Fig. 6. Therefore we gradually reduce the current velocity by the degree of loss to suppress the accumulated noise. The deceleration ratio r is defined as

$$\begin{aligned} r = \frac{f_{id} - f_{end} - 1}{\tau }, \end{aligned}$$
(12)

where \(f_{id}\) is the current frame number and \(f_{end}\) is an attribute of the tracklet called EndFrame (see Sect. 4.2Storage); \(\tau \) is the max lost threshold for removing tracklets with too much lost time and is set to 30 by default. r is 0 when the tracklet is tracked and decelerates the velocity once the tracklet is lost. When an object is lost (i.e., \(f_{id} - f_{end} - 1 > 0\) in Eq. 12), the deceleration strategy can stop the object at an early stage of loss to increase the probability of retraced. Therefore, the four variables in the non-uniform formulation (Eq. 1a) is expressed as follows:

$$\begin{aligned} \left\{ \begin{array}{ll} \begin{aligned} &{}u_{t|t-1} = u_{t-1} + {\dot{u}}_{t-1} \Delta t_{{\dot{u}}_{t-1}} (1 - \frac{r}{2}),\\ &{}v_{t|t-1} = v_{t-1} + {\dot{v}}_{t-1} \Delta t_{{\dot{v}}_{t-1}} (1 - \frac{r}{2}),\\ &{}{\dot{u}}_{t|t-1} = {\dot{u}}_{t-1} (1 - r),\\ &{}{\dot{v}}_{t|t-1} = {\dot{v}}_{t-1} (1 - r). \end{aligned} \end{array}\right. \end{aligned}$$
(13)
Algorithm 1
figure a

Pseudo-code of PKF Prediction.

Algorithm 2
figure b

Pseudo-code of PKF Update.

Under the matrix representation, \({\textbf{F}}\) and \(\textbf{Bu}\) can be represented as

$$\begin{aligned}{} & {} \begin{aligned}&{\textbf{F}}_t = \begin{bmatrix} {\textbf{I}}_{2 \times 2} &{} \begin{bmatrix} \Delta t_{{\dot{u}}_{t-1}} &{} 0 \\ 0 &{} \Delta t_{{\dot{v}}_{t-1}} \end{bmatrix}\\ {\textbf{0}}_{2 \times 2} &{} {\textbf{I}}_{2 \times 2} \end{bmatrix},\\ \end{aligned} \end{aligned}$$
(14)
$$\begin{aligned}{} & {} \begin{aligned}&{\textbf{B}}_t {\textbf{u}}_t = \begin{bmatrix} - {\dot{u}}_{t-1} \Delta t_{{\dot{u}}_{t-1}} \frac{r}{2} \\ - {\dot{v}}_{t-1} \Delta t_{{\dot{v}}_{t-1}} \frac{r}{2} \\ - {\dot{u}}_{t-1} r \\ - {\dot{v}}_{t-1} r \\ \end{bmatrix}.\\ \end{aligned} \end{aligned}$$
(15)

Computation To attain an O(1) time complexity, we exploit our simplified state variables and the strict matrix representation of our non-uniform formulation. In this context, we introduce an innovative parallel computation method. This method transposes the computation graph of the PKF from a matrix to a quadratic form, capitalizing on the attributes of sparse matrices within the PKF. Consequently, we facilitate parallel computation across distinct objects employing CUDA.

Specifically, the conventional KF uses matrices to represent the variables (e.g., \({\textbf{F}}\), \({\textbf{P}}\), and \({\textbf{H}}\)), and calculations are achieved by matrix multiplication. However, all the matrices involved in KF are sparse matrices with fixed positions, which means that a large amount of computation in matrix multiplication is used to compute meaningless zeros. Therefore, reducing the computation by separating out these 0-related computations is particularly critical to improve the efficiency. Meanwhile, the separated computation is equivalent to finite quadratic operations, which can be easily implemented in parallel between different tracklets via CUDA. For example, the covariance matrix \({\textbf{P}}\) of PKF is represented as follows

$$\begin{aligned} \begin{aligned}&{\textbf{P}} = \begin{bmatrix} \begin{array}{cccc} p_{1} &{} &{} p_{7} &{} \\ &{} p_{2} &{} &{} p_{8} \\ p_{5} &{} &{} p_{3} &{} \\ &{} p_{6} &{} &{} p_{4} \\ \end{array} \end{bmatrix},\\ \end{aligned} \end{aligned}$$
(16)

where the positions without variables has a value of 0 at any time step, thus we can reduce the computation load by eliminating these 0-value positions and associated computations both in storage and computation. Therefore, in storage, we can define the covariance vector \({\textbf{p}}\) to equivalently represent matrix \({\textbf{P}}\), i.e., \({\textbf{p}} = [p_1, p_2, p_3, p_4, p_5, p_6, p_7, p_8]\). We also define the other vectors that need to be involved in the PKF calculation, and they are the state vector \(\varvec{{\hat{x}}} = [u, v, {\dot{u}}, {\dot{v}}]\), the observation vector \(\varvec{{\hat{z}}} = [{\bar{u}}, {\bar{v}}, {\bar{w}}, {\bar{h}}]\), and the property vector \({\textbf{o}} = [w, h, f_{end}, s'_u, s'_v, u', v']\) where \(f_{end}\) is the attribute of the tracklet called EndFrame; \(s'_u, s'_v, u',\) and \(v'\) are the previous displacements and coordinates, respectively. Algorithms 1 and 2 indicate the calculation process of the prediction phase and the the update phase of PKF, respectively. In Algorithm 2, \(s_1\) and \(s_2\) denote the elements on the main diagonal of the matrix \({\textbf{S}}\), i.e., \({\textbf{S}} = \textbf{diag}(s_1, s_2)\). \(k_1\), \(k_2\), \(k_3\), and \(k_4\) denote the valuable numbers in kalman gain, i.e.,

$$\begin{aligned} \begin{aligned}&{\textbf{K}} = \begin{bmatrix} \begin{array}{cc} k_{1} &{} \\ &{}k_{2} \\ k_{3} &{} \\ &{}k_{4} \\ \end{array} \end{bmatrix}.\\ \end{aligned} \end{aligned}$$
(17)

\(r_u\), \(r_v\), \(r_w\), \(r_h\) are the residual of u, v, w, h between observation and estimation. For initialization, u, v in \(\varvec{{\hat{x}}}\) and w, h in \({\textbf{o}}\) are set to the corresponding values in the input detection result; \({\dot{u}}\), \({\dot{v}}\) in \(\varvec{{\hat{x}}}\) and \(s'_u\), \(s'_v\) in \({\textbf{o}}\) are set to 0; \(u'\), \(v'\) in \({\textbf{o}}\) are set to the values of u and v, respectively; \(f_{end}\) is set to the current frame id \(f_{id}\). The vector \({\textbf{p}}\) is initialized to \([4\phi h^2, 4\phi h^2,100\psi h^2,100\psi h^2,0,0,0,0]\) following DeepSORT.

In the CUDA implementation, we employ two threads for u- and v-related computations, respectively. One block owns 32 threads and the number of blocks is set to \(\lceil \frac{n}{16} \rceil \) where n is the number of tracked objects.

Fig. 7
figure 7

Illustration of different storage structures for trackers. a 2D-array. b List with instances. The illustration shows only a portion of the components (e.g., Box or Score) of tracklets

4.2 Fast

Storage In previous works (Lu et al., 2020; Peng et al., 2020; Zhou et al., 2020; Wu et al., 2021; Zhang et al., 2022), tracklets are usually represented as instances and stored through a list, as shown in Fig. 7b. Since the program does not support parallel accession to these instances, the updating or accession operation to each tracklet is executed under the time complexity O(n). As shown in Fig. 7a, guided by Occam’s Razor, we abandon the instance and creatively propose a GPU 2D-array to storage tracklets. In functionality, the two storages can perform exactly the same functions such as modifying certain information or adding/removing certain tracklets; in efficiency, due to the CUDA acceleration, efficient parallel updating or accession operation can be performed on all tracklets with a time complexity of O(1). In Fast, each tracklet has the attributes Box (uvwh), Score, ClassID, TrackID, State (one of Track, Lost, or New), EndFrame, and the PKF-related variables (\(\varvec{{\hat{x}}}, {\textbf{p}},\) and \({\textbf{o}}\)).

Algorithm 3
figure c

Pseudo-code of Fast.

Cost Matrix Since Fast maintains the cleanest processing flow and only performs the association operation once during one iteration, the quality of the cost matrix has a crucial impact on the tracking results. We take into account the quality (score) of boxes into the cost matrix to determine the priority of association automatically. Specifically, for a list of tracklets with N elements and a list of detections with M elements where each element owns a box B with the corresponding score \({\mathcal {S}}\), the \(N \times M\) cost matrix \({\textbf{C}}\) is calculated as follows:

$$\begin{aligned} {\textbf{C}}_{ij} = IoU(B_i, B_j) \times {\mathcal {S}}_i \times {\mathcal {S}}_j, \end{aligned}$$
(18)

where i/j denotes the i-th/j-th element in the input tracklets/detections. Through this calculation, tracklets and detections with high scores will be assigned first; with low scores will be assigned last; the remaining tracklets and detections are of medium priority. From another perspective, we achieve the effect of the cascade association by score (like BYTE (Zhang et al., 2022) does) within one association. In addition, the novel cost matrix can also achieve multi-object &class tracking by simply shifting all boxes along the x-axis by the distance of class id times input image width before calculating the IoU. In implementation, we assign each element of the cost matrix to a CUDA kernel to improve computational efficiency. Besides, under the prior obtained in the Height & Width Ratio, we propose the HIoU metric to replace the original IoU in our cost matrix, which is calculated as follows:

$$\begin{aligned} HIoU(B_{\mathcal {T}}, B_{\mathcal {D}}) = \left\{ \begin{array}{l} 0, \frac{min(h_{\mathcal {T}}, h_{\mathcal {D}})}{max(h_{\mathcal {T}}, h_{\mathcal {D}})} < \lambda , \\ IoU(B_{\mathcal {T}}, B_{\mathcal {D}}), else. \\ \end{array} \right. \end{aligned}$$
(19)

HIoU receives a detection box \(B_{\mathcal {D}}\) and a tracklet box \(B_{\mathcal {T}}\) and calculates IoU between them if the ratio of their heights (\(h_{\mathcal {D}}\) &\(h_{\mathcal {T}}\)) is greater than or equal to the height ratio threshold \(\lambda \). Otherwise, the result of HIoU is 0. HIoU can be used in pedestrian or traffic scenes to further improve the performance.

Assignment In order for the tracker to be fully accelerated by GPU, we first implement the Auction Algorithm (Bertsekas, 1992b) for the asymmetric assignment problem. The naive Auction Algorithm models the auction to solve the classical symmetric assignment problem where there are n persons and n objects that we have to match on a one-to-one basis. There is a benefit \(a_{ij}\) (belonging to a \(n \times n\) cost matrix) for matching person i with object j and we want to assign persons to objects so as to maximize the total benefit. The “Appendix A” describes in detail about the calculation and implementation of the naive Auction Algorithm. While the shape of the cost matrix is an arbitrary (\(n \times m\)) at most time, the Reverse Auction Algorithm is required and we implement it via CUDA successfully. Specifically, when n is less than or equal to m, we apply the naive Auction Algorithm to match the tracklets and detections; when n is greater than m, we first transpose the cost matrix (from \(n \times m\) to \(m \times n\)) and then apply the naive Auction Algorithm, while the results need to be mapped back to the original cost matrix.

Architecture Along the concise and generic tracker design concept, Fast only requires boxes with scores and class ids to do only one association during one iteration, thus allowing for high generality and efficiency. The input of Fast is a video sequence \({\mathcal {V}}\) and a detector \(Det(\cdot )\). The max lost threshold \(\tau \) and tracking score threshold \(\epsilon \) (0.7 by default) are also set. The output of Fast is the tracklets \({\mathcal {T}}\) of the video and each tracklet has attributes such as Box, Score, ClassID, TrackID, State (one of Track, Lost, or New), and EndFrame. As the pseudo-code of Fast shown in Algorithm 3, for each frame \(f_t\) in the video \({\mathcal {V}}\), Fast first obtains the detections \({\mathcal {D}}_t\) including Box, Score, and ClassID via the detector \(Det(\cdot )\). Then the motion estimation (the prediction phase of PKF) of tracklets \({\mathcal {T}}\) will be calculated (line 4). After associating the detections \({\mathcal {D}}_t\) and tracklets \({\mathcal {T}}\) via the cost matrix generated by them, Fast obtains four groups: the matched detections \({\mathcal {D}}_{m}\), the matched tracklets \({\mathcal {T}}_{m}\), the unmatched detections \({\mathcal {D}}_{u}\), and the unmatched tracklets \({\mathcal {T}}_{u}\) (line 5 to 8). The detections in \({\mathcal {D}}_{m}\) will be updated into the corresponding tracklets in \({\mathcal {T}}_{m}\) (line 9 to 10) where the update phase of PKF will be performed, the State of tracklets will be set to Track, and the EndFrame will be set to the current frame id \(f_{id}\). Then the tracklet in \({\mathcal {T}}_{u}\) will be removed if the difference between its EndFrame and \(f_{id}\) is greater than or equal to the \(\tau \) or its State is New following BYTE. Otherwise, it will be remained and its State will be set to Lost (line 12 to 18). Next, the detections in \({\mathcal {D}}_{u}\) will be added into the \({\mathcal {T}}\) after initialization when its score is greater than or equal to the \(\epsilon \) (line 20 to 24). \(Init(\cdot )\) includes initializing the State to New, the EndFrame to \(f_{id}\), and the initialization of PKF. The output of each frame is the new tracklets \({\mathcal {T}}\), where the tracklets with Lost state are not allowed to be output.

5 Experiment

5.1 Settings

Datasets We evaluate Fast on the aforementioned MOT benchmarks, which include general tracking scenes such as MOT17 (Milan et al., 2016), MOT20 (Dendorfer et al., 2020), KITTI (Geiger et al., 2013), and DanceTrack (Sun et al., 2021). MOT17 (Milan et al., 2016) and MOT20 (Dendorfer et al., 2020) focus on pedestrian tracking, with mostly low-speed movements. However, MOT20 scenes are more crowded than those in MOT17 (139 vs. 33 on average per frame). KITTI (Geiger et al., 2013) targets traffic tracking with high-speed cars and pedestrians. DanceTrack (Sun et al., 2021) is a recently proposed benchmark for tracking dancers in stage scenes. In DanceTrack, dancers exhibit highly non-linear movements and share similar appearances, while severe occlusions and frequent crossovers occur. For ablation datasets, we follow previous works (Cao et al., 2022; Zhang et al., 2022) and use MOT17-val (see “Appendix B” for details) and DanceTrack validation set (DanceTrack-val) to verify the effectiveness of our proposed modules, such as the non-uniform formulation or the cost matrix.

Metrics We employ HOTA (Luiten et al., 2021), IDF1, and MOTA (Milan et al., 2016) as the main metrics to evaluate various aspects of tracking performance. HOTA is a recently proposed metric that explicitly balances the effects of accurate detection, association, and localization. IDF1 evaluates identity preservation ability and focuses more on association performance. MOTA is computed based on FP, FN, and IDs. Given that the number of FP and FN is larger than IDs, MOTA focuses more on detection performance. Additionally, we report other metrics, such as DetA or AssA, and raw statistics like ID switch (IDs) or Fragments (Frag).

Implementations All experiments are conducted on a PC equipped with a GTX 1080Ti GPU and an i5-9500 CPU. Fast is implemented using CuPy (Okuta et al., 2017) in Python. CuPy is a powerful CUDA-based library for fast matrix operations on NVIDIA GPUs, offering nearly identical APIs to NumPy and supporting the loading of raw CUDA kernel functions written in C/C++. Our storage method uses the 2D array provided by CuPy; cost matrix, PKF, and Auction Algorithm are implemented in CUDA C/C++ and called via CuPy. Following previous works (Zhang et al., 2022; Cao et al., 2022) for a fair comparison, we directly employ the same detector in FastTrack. For MOT17, MOT20, and DanceTrack, we use the publicly available YOLOX (Ge et al., 2021) detector weights from ByteTrack (Zhang et al., 2022); for KITTI (Geiger et al., 2013), we use the detections from PermaTrack (Tokmakov et al., 2021) publicly available in the official release following OCSORT (Cao et al., 2022). Methods using the same detector are placed at the bottom of each benchmark table, and linear interpolation (Zhang et al., 2022) is also employed in the benchmark comparisons.

Table 2 Ablation study on our proposed cost matrix
Table 3 Ablation study on HIoU\(_{\mathcal{D}\mathcal{T}}\) with different height ratio threshold \(\lambda \) on MOT17-val and DanceTrack-val
Table 4 Ablation study on the accuracy of different assignment algorithms

5.2 Ablation Study

All the ablation studies are conducted on MOT17-val and DanceTrack-val, with four metrics being used to evaluate the performance: HOTA (H\(\uparrow \)), MOTA (M\(\uparrow \)), IDF1 (I\(\uparrow \)), and IDs (\(I\) \(\downarrow \)).

5.2.1 Association

Cost Matrix Table 2 shows the experimental results of comparing different cost matrix strategy in Fast. “Baseline” indicates the results of employing the original BYTE’s cascade association based on score into Fast. The effects of IoU, IoU\(_{{\mathcal {D}}}\), and IoU\(_{\mathcal{D}\mathcal{T}}\) are incremental over the six metrics of the two datasets. Compared to “Baseline”, IoU\(_{\mathcal{D}\mathcal{T}}\) outperforms in all metrics with the concise assignment strategy.

HIoU As discussed in Sect. 3.2, compared with the pedestrians in MOT17, the dancers in DanceTrack have a greater range of motion and thus do not have the height invariance prior. The experimental results of HIoU\(_{\mathcal{D}\mathcal{T}}\) with different height ratio threshold \(\lambda \) in Table 3 reflect this as well. HIoU with 0.8 threshold in MOT17-val can increase the HOTA and IDF1 because it excludes potential error assignment; DanceTrack-val does not meet the prior thus the application of HIoU leads to a drop in all metrics. In later ablation studies, HIoU\(_{\mathcal{D}\mathcal{T}}\) with 0.8 threshold is employed in MOT17-val; IoU\(_{\mathcal{D}\mathcal{T}}\) is employed in DanceTrack-val.

Table 5 Ablation study on different states of KF

Assignment Table 4 presents the accuracy of the Auction and LAPJVFootnote 4 algorithms. The metric values for both algorithms are nearly identical, with the Auction algorithm slightly outperforming LAPJV by 0.15% and 0.26% for the HOTA and IDF1 metrics on the DanceTrack-val dataset, respectively. In conclusion, the accuracies of the two algorithms in the context of MOT can be considered equal. Therefore, it is functionally feasible for the Auction algorithm to replace the conventional CPU-based linear assignment algorithm LAPJV in Fast.

5.2.2 Parallel Kalman Filter

State Estimate Table 5 shows the experimental results of different states of KF. Here we employ the original KF with different states as the motion estimation module of our Fast. It can be found that Fast has better performance on all metrics when \(\gamma \) and h are removed, which is also consistent with the previous analysis in Sect. 2.2, i.e., uniform modeling of variables \(\gamma \) and h is unreasonable and they should be removed from the state estimate of KF.

\(\Delta t\) Threshold Factor \(\xi \). The hyperparameter \(\xi \) is used to distinguish low-speed objects from high-speed objects, and as shown in Table 6, Fast with the best \(\xi \) performs better in HOTA and IDF1 in both datasets compared to the “Baseline”. Due to the diversity of image resolutions, object sizes and frame rates for different types of scenes, the value of \(\xi \) needs to be set specifically, and as can be seen from the experiments in Table 6, the value of \(\xi \) takes a range roughly between 0.05 and 0.09. Empirically, we suggest that the default value of \(\xi \) is 0.05, which can be appropriately increased to 0.07 or 0.08 for better tracking results when the overall motion of objects in the tracking scene is faster or the frame rate is lower.

Linear Smooth Weight \(\omega \) The introduction of an appropriate smoothing hyperparameter is crucial to reduce noise generated in the object motion, thus in Table 7, Fast with the best \(\omega \) performs better in the HOTA and IDF1 in both datasets compared to the “Baseline”. Since the noise differs in different types of scenes, the value of \(\omega \) also needs to be set specifically, and as can be seen from the experiments in Table 7, the value of \(\omega \) takes a range roughly between 0.85 and 0.70. Empirically, we suggest that the default value of \(\omega \) is 0.85, which can be appropriately reduced to 0.70 when the overall motion of objects in the tracking scene (such as stage scenes) is highly non-linear to obtain better tracking results.

Table 6 Ablation study on \(\Delta t\) threshold factor \(\xi \)
Table 7 Ablation study on linear smooth weight \(\omega \)
Table 8 Ablation study on the deceleration strategy
Table 9 Comparison with other state-of-the-art motion-based trackers
Fig. 8
figure 8

Computional efficiency comparison between different modules and the corresponding baselines. The x-axis indicates the number of tracklets processed simultaneously; the y-axis indicates the average consumption time (in ms)

Deceleration Strategy Table 8 shows the experimental results of our deceleration strategy. For MOT17-val, there is little improvement compared with the “Baseline” in the four metrics due to the small sample size and the fact that the occlusion problem in it is not severe. For DanceTrack-val, the sufficient data sample combined with the severe occlusion problem makes our strategy improve another 1% on HOTA and IDF1 compared with the “Baseline”. Besides, stopping the lost object immediately does not have a significant advantage over “Baseline” in both datasets.

5.2.3 Comparison with Other Trackers

Table 9 presents the experimental results of different motion-based trackers. Fast outperforms the second-best tracker by 1–2% in HOTA, MOTA, and IDF1 metrics on both datasets. Thanks to PKF’s ability to model non-uniform motion, Fast achieves a 3.4% higher HOTA score and a 3.8% higher IDF1 score in the DanceTrack-val dataset compared to the second-ranked OCSORT, while maintaining a nearly identical number of ID switches.

5.2.4 Module Efficiency

As illustrated in Fig. 8, we compare the time consumption of our proposed modules with their corresponding CPU baselines. To investigate the maximum computational efficiency of each module, we define the extreme case by setting the tracklet count at 1000.Footnote 5 Conversely, we denote the typical case by setting the tracklet count at 200,Footnote 6 serving as a representative of computational efficiency in realistic large-scale object tracking situations.

Fig. 9
figure 9

Visualizations of Fast’s tracking results on each benchmark. Boxes of the same color represent the same object (Color figure online)

Table 10 Comparison in DanceTrack test set
Table 11 Comparison in KITTI test set
Table 12 Comparison under the “private detector” protocol in MOT17 test set
Table 13 Comparison under the “private detector” protocol in MOT20 test set

Storage As depicted in Fig. 8a, we compare the computational efficiency of our GPU 2D-array storage method to the baseline, which uses a list with instances. The average time consumption refers to the time taken to change the attribute State of all n tracklets from Lost to Track. Our method and the baseline method exhibit time complexities of O(1) and O(n), respectively. Under the typical case, our method performs 3.6\(\times \) faster, while under the extreme case, it performs 14\(\times \) faster than the baseline method.

Cost Matrix Figure 8b compares the computational efficiency of our proposed cost matrix method (which includes HIoU) to the baseline (cython_boxFootnote 7). The average time consumption denotes the time required to create an \(n \times n\) cost matrix. Our method has a time complexity of O(1), whereas the baseline method has a time complexity of \(O(n^2)\). Our method outperforms the baseline by a factor of 6.2\(\times \) in the typical case and by 51\(\times \) in the extreme case.

Assignment Figure 8c showcases the computational efficiency of our implemented Auction Algorithm, contrasted with the baseline (LAPJV). The average time consumption is the time needed to solve an \(n \times n\) random matrix with all values between 0 and 1. The theoretical time complexity of LAPJV is \(O(n^3)\), while our method’s complexity is O(n) with a low slope. Architectural differences between CPUs and GPUs mean that initiating kernel functions on GPUs require additional time. This extra time cost overshadows the computational advantage of our GPU method when the number of tracklets is less than 300. However, as the data scale increases, the parallel computing capabilities of the GPU render this time cost increasingly insignificant. Therefore, our GPU method’s processing speed surpasses that of the CPU method when the number of tracklets exceeds 300. Under the typical case, our method is nearly identical to the baseline, but it is 4\(\times \) faster in the extreme case.

Kalman Filter Figure 8d compares the computational efficiency of our PKF with the baseline KF. The average time consumption is the time taken to complete a predict-update loop for all n tracklets. The time complexities of our method and the baseline method are O(1) and O(n), respectively. Under the typical case, our method is 60\(\times \) faster than the baseline, and under the extreme case, it is 306\(\times \) faster. Furthermore, our PKF can process 10 million objects simultaneously with an average time of 0.2 ms per iteration when fully utilizing the GTX 1080Ti 12 G video memory, whereas the original KF takes about 600 s per iteration to process the same number of objects.

Discussion From the aforementioned analyses, it becomes clear that the time consumption of each crucial module, particularly the Kalman Filter, in conventional CPU-based trackers escalates with increasing input scale. This escalation negatively impacts the computational efficiency of large-scale object tracking and the stability of the MOT system, inducing processing speed to fluctuate with the input scale. In contrast, our proposed modules successfully disentangle the computational efficiency from the input scale. This separation leads to a novel GPU-based tracker paradigm that can efficiently manage large-scale object tracking while maintaining the stability of the MOT system.

5.3 Benchmark Results

We report Fast’s performance on the four benchmarks separately, and the visualized results of Fast on each benchmark are shown in Fig. 9.

5.3.1 DanceTrack

We report the DanceTrack test set results in Table 10 to evaluate Fast’s performance under challenging non-uniform motion. The setting of the hyperparameters during testing is the same as in DanceTrack-val, i.e., \(\xi \) is 0.07 and \(\omega \) is 0.7 and HIoU is not employed in the cost matrix. Our proposed non-uniform formulation and deceleration strategy successfully model the diverse non-uniform motions of the dancers in the stage scenes, and thus Fast achieves a new state-of-the-art on HOTA, MOTA, and IDF1. With the same detector (YOLOX), Fast is 1.7% higher than OCSORT and 10.1% higher than BYTE on HOTA.

5.3.2 KITTI

Table 11 shows the results of the comparison on the KITTI test set. Compared to the pedestrian scene in MOT17-val, the traffic scene in KITTI tracks both cars and pedestrians at high-speed linear motion from the perspective of trackers due to the low frame rate (10 vs. 30). Therefore we set \(\xi \) appropriately higher at 0.08 to handle the high-speed motion and \(\omega \) still follows the default setting of 0.85, and HIoU is employed where \(\lambda \) is set to 0.5 from the statistical prior in Fig. 4 KITTI Train Car. Following OCSORT, we utilize the detection results of PermaTr (Tokmakov et al., 2021) for tracking both cars and pedestrians. Notably, Fast can track both pedestrians and cars at the same time by our proposed cost matrix. Fast surpasses PermaTr and OCSORT on HOTA and MOTA of both Car and Pedestrian and improves the pedestrian tracking performance (HOTA) to a new state-of-the-art level, i.e., 55.10%.

5.3.3 MOT Challenge

We report in Tables 12 and 13 the performance of Fast on MOT17 and MOT20 test sets using the same detector YOLOX as well as ByteTrack and OCSORT. Following the MOT17-val, \(\xi \) is 0.05 by default and \(\omega \) is 0.85 in both MOT17 and MOT20. HIoU is employed on both MOT17 and MOT20 and \(\lambda \) is set to 0.8. Although receiving the same detection results, OCSORT discards the low-scoring detection results first before starting tracking following SORT, while Fast utilizes the score prior to automatically determine association priorities in our proposed cost matrix, which allows Fast to track more objects with low scores and thus achieves higher MOTA than OCSORT by 2.5% and 1.8% on MOT17 and MOT20, respectively. Meanwhile, Fast also achieves the the highest HOTA and IDF1 on MOT17 under the same detector (YOLOX) and competitive HOTA on MOT20. It is important to note that on MOT20, Fast gains significant efficiency improvements with its revolutionary GPU paradigm as shown in Fig. 2, i.e., 7\(\times \) faster than OCSORT.

5.3.4 Efficiency

Figure 2 illustrates the average processing time of different motion-based trackers on the four benchmarks. The comparison is conducted on a single PC equipped with an NVIDIA GTX 1080Ti and Intel Core i5-9500. The average number of objects per frame processed by the trackers on the four test sets are 6 (KITTI), 8 (DanceTrack), 33 (MOT17), and 139 (MOT20), respectively. As discussed in Sect. 5.2.4, the computational efficiency of our new paradigm, Fast, is independent of the input scale, with an average time consumption of 4 ms across all four benchmarks.

When the input scale is small (e.g., 6 on KITTI), the extra startup time of kernel functions offsets the computational advantage of Fast, resulting in Fast consuming more time than other CPU-based trackers (4 ms vs. 1 ms). As the input scale increases, the time consumption of the CPU-based trackers rapidly rises, while Fast’s time consumption remains constant at 4 ms. For instance, on MOT20, Fast stays at 4 ms per frame, while OCSORT and BYTE take 30 ms and 20 ms, respectively. Even on the embedded CUDA device Jetson AGX Xavier, Fast’s average time consumption on MOT20 is only 24 ms, while OCSORT and BYTE take 231 ms and 67 ms, respectively.

To further showcase Fast’s efficiency in the extreme case, we compare the time consumption of the four methods on MOT20 at different scales, as demonstrated in Table 14. We evaluate the trackers’ efficiency in large-scale object tracking scenarios by replicating each frame’s detection results N times on the MOT20 test set. Specifically, we shift the x-coordinate of the ith replicated result by \(i\times \)W pixels to the right (W being the frame width), while keeping the y-coordinate unchanged, thereby increasing the number of objects in each frame by N times. This simulation is reasonable because, while a single video stream may not contain 1000 tracklets per frame, it is typical to merge and track multiple streams simultaneously in real-world scenarios.

Table 14 demonstrates that Fast holds a significant advantage over other trackers in large-scale object tracking scenarios. For instance, when N equals 7, Fast processes 29 ms per frame with an average of 973 objects per frame, whereas OCSORT and BYTE take 241 ms and 161 ms, respectively. Although the time consumption of each module in Fast remains independent of the input scale, the creation and destruction overhead of intermediate data variables (e.g., the cost matrix) in Fast increases as the input scale expands, resulting in greater time consumption for large-scale input. Additionally, the GTX 1080Ti also constrains Fast’s efficiency on large-scale input. When using the RTX 4090, Fast only requires 10 ms to process 973 objects per frame, making it 24\(\times \) faster than OCSORT and 16\(\times \) faster than BYTE.

The above comparison results highlight the remarkable efficiency of our new GPU-based tracker paradigm.

Table 14 Average time consumption (ms) of different trackers on MOT20 at different scales

6 Conclusion

In this paper, we propose Parallel Kalman Filter to model non-uniform motion via the proposed non-uniform formulation and achieve a time complexity of O(1) via the proposed parallel computation method in large-scale object tracking scenarios. Further, based on PKF, we propose Fast, the first fully GPU-based tracker paradigm, to greatly improve tracking efficiency in large-scale object tracking; and FastTrack, the MOT system consisting of Fast and a general detector, allowing for high efficiency and generality. Within Fast, we introduce innovative GPU-based tracking modules, such as an efficient GPU 2D-array data structure, a novel cost matrix, a new association metric called HIoU, and the Auction Algorithm. The conducted experiments demonstrate that PKF owns the highly computational efficiency and is independent of the input scale, while other modules in Fast are also highly efficient compared to the CPU-based baselines. FastTrack demonstrates state-of-the-art performance on four public benchmarks, and attains the highest speed in large-scale tracking scenarios of MOT20.