1 Introduction

Multi-object tracking (MOT) is a challenging task in computer vision. The aims of MOT are to simultaneously identify multi-objects and estimate their trajectories from clutter scenes. It has a wide application scope, ranging from surveillance, traffic safety to automotive driver assistance systems and robotics. Several challenges, such as occlusion, mis-detection, false detection, camera motion in complex scenes or similar appearance with occlusion, make MOT still a tough problem [29].

Due to the development of object detectors [13, 38], tracking-by-detection (TBD) methods always show state-of-the-art performance in recent years [2, 3, 7, 20, 31, 34, 39, 41, 44]. The TBD methods can be roughly categorized into a couple classes, batch tracking and online tracking.

The batch tracking methods solve the data association problem in MOT using the forward-backward information from entire sequence [2, 7, 12, 31, 34]. They first build short tracklets by linking detections frame by frame. Then the short tracklets are globally associated to form the long trajectories. Many global association methods have been proposed in recent years. However, the performance of batch tracking methods has some limitations. One is that the batch tracking requires the detections of future frames for the entire sequence. Using detections from the whole video will need enormous computation because it has to iteratively link short tracklets to construct the optimized trajectories. The iterative optimized linking process in batch tracking implies that tracklets may change their links at each iteration. Link change may result in the ambiguity of close targets identification, especially of objects with similar appearance. A global optimal matching, in this case, always relies on pair-wise point matching in consecutive frames. Pairwise matching, however, sometimes may fail to find the correct matching due to the ambiguities among competing candidates. Hence, the global optimal matching, achieved in the iterative optimized linking process, may change their linking results at each iteration with non-unique or incorrect pairwise matching in consecutive frames [18, 37]. The other problem in batch tracking is that it is not suitable for time-critical applications due to huge computational burden in global optimization. Comparing to the batch tracking methods, the online tracking methods are more suitable for real time applications since they only use the detections from recent frames to build trajectories [3, 20, 39, 41, 43, 44]. There is no iterative optimization process and the tracking results are outputted on the fly based on the up-to-present detections. However, the online methods are not robust under occlusion, in which case the online tracking may produce short fragment trajectories. This is because the data association in online MOT without iterative associations, when the detections are partly reliable with possible false positive and missing detections, the association result is inaccurate. Hence, in terms of tracking accuracy, the batch tracking methods are more accurate than the online tracking methods due to available future information and iterative associations can be used to tackle detection errors and tracking failures. In this paper, we pay attention to online MOT tracking and aim to improve the performance of online MOT.

In detection-based MOT, data association plays an important role for robust tracking. Both the appearance model and motion model are typically used to solve data association in online MOT [3, 20, 26, 39, 41,42,43,44]. The detection-based MOT achieve a good performance in many situations such as pedestrian tracking or vehicle tracking, where the objects follow a simple motion model and their appearance can help to distinguish objects from each other. However, most of the existing online MOT methods adopt the first or second order motion model to describe object dynamic states in the current frame [44], which indicates that the current object states heavily depend on previous one or two frames only. This kind of state estimation performs well when one object is detected in continuous frames. In addition, data association becomes more and more complicated when multiple targets with similar-looking appearance, as shown in Fig. 1. It is difficult to distinguish objects based on color and shapes (Fig.1a, b). Moreover, when detections are unavailable for several frames due to occlusion or mis-detection, the object states predicted by motion model may be unreliable.

Fig. 1
figure 1

A difficult scene with similar-looking appearance of multi-objects in two consecutive frames. Their color and size provide quite a few cues for matching, but motion can help to distinguish the objects

To resolve the above mentioned problems, in this paper, a novel online MOT focusing on multi-objects with similar appearances is proposed. The framework of the proposed method is shown in Fig. 2. With the detections up to the present frame, the data association in online MOT is solved by maximum a posteriori (MAP) with trajectory estimation and detection reliability. The trajectory estimation, on one hand, is solved based on Bayesian framework with the number of involved frames being selected adaptively. On the other hand, detection reliability, computed by tracklet dynamic estimation and detection-prediction association in continuous sequences, is sequentially introduced into trajectory estimation stage. The detection-prediction association in consecutive frames is used to filter out unreliable association among the trajectories and the detections according to local associated motion constraint. The local associated motion constraint in this paper is built by the predicted object states and the detections with the Hankel matrix. The Hankel matrix based object states predication is beneficial to recover short fragment tracklets in online MOT because it takes long history object states into consideration. In addition, the MAP framework allows the trajectory estimation and detection reliability interact with each other in a sequential manner, which facilitates online multi-object tracking. Experimental results on synthetic dataset and four public-available challenging datasets confirm the superiority of the proposed method.

Fig. 2
figure 2

The tracking framework of the proposed method

The main contributions of this paper are:(1) Data association problem in online MOT is solved by MAP in a Bayesian framework with previous trajectory and the current detection reliability. (2) The detection reliable prior is computed by tracklet dynamic estimation and detection-prediction association in continuous sequences. By MAP the trajectory estimation and detection reliable prior interact with each other in a sequential manner. (3) The Hankle matrix based object dynamic motion estimation is used to measure the association weights between detections and the predict object states. Such estimation is beneficial to recover short fragment tracklets and improve the correctness of the association among the trajectories and the detections.

2 Related works

Numerous multi-object tracking approaches have been proposed in recent years [2, 3, 7, 12, 20, 26, 31, 34, 39, 41,42,43,44]. In this section, we mainly introduce related MOT methods.

Data association plays an important role for robust tracking in detection-based MOT. Both the appearance and motion models are typically used to solve data association in MOT. Several features, such as color histogram, HOG, haar-like feature, sparse feature and deep feature, are designed to describe appearance changes. In [46], a multitask shared sparse regression framework is proposed to represent the input image at different levels. In [35], a CNN based feature representation method with an adaptive hedge method is proposed for constructing robust appearance model of the object. In [19], Zhang et al. proposed a new object detection framework by using high-level feature representation and extended their work in [45] by high-level convolutional feature and visually similar neighbors. In [9], Chen et al. proposed a robust object tracking method based on subspace learning-based appearance model with sparse feature representation. In [21], Hong et al. employed the Integrated Correlation Filter (ICF) to improve the single object tracking performance. In terms of motion cue of the object, there are many MOT methods directly exploit Kalman or particle filter to locate objects. These methods typically use the first or second order motion models to predict object states, which indicate that the current object states heavily rely on previous one or two frames. The simple motion model performs well in short durations but show limitations in sequences with long-term occlusion, complex motion or cluttered scenes. Data association methods like JPDAF [17] and MHT [10, 36] have been proposed to link the short tracklets and generate long trajectories. Since the search space grows exponentially with the number of frames, both JPDAF and MHT are less effective for long–time association. To overcome this limitation, a variety of data association approaches have been developed to consider pairwise association of detections in consecutive frames as an optimization task based on Hungarian algorithm [25], K-shortest paths (KSP) [5], the Linear Programming [23], the Quadratic Boolean Programming [27], the Markov Chain Monte Carlo (MCMC) [32] and the maximum weight-independent set [8].

However, the pairwise based data association methods only consider pairs of detections and set a pairwise interframe edge costs. These algorithms do not have a good performance when appearance constraint of the closely moved multi-objects is weak. The motion model constraint like Kalman or particle filter, which heavily rely on the former frame motion information to predict the current object, also does not provide useful information to distinguish similar-looking appearance objects. In addition, merely depend on the previous one or two frames kinestate to predict the current motion state is insufficient. In [9], Chen et al. point out that the particle filter is an approximate nonlinear Bayesian filter, which is used to get suboptimal solution of posterior probability for object state with observation. In [11], Collins shows that higher-order motion constraint has a major effect on improve the quality of data association in MOT, especially when multi-objects with similar appearance. In [41], nonlinear motion patterns and robust appearance model are learned for each of object to better explain direction changes and construct more robust motion affinities between tracklets. In [14], both individual and mutual relation models are introduced in MOT to build graph model, but the mutual relation model only works when the objects move in the same direction. In [42], the pairwise relative motion model is introduced as an additional term to construct CRF energy function. Most recently, the notion of relative motion network proposed in [44], which is designed to improve the data association performance by utilizing the relative spatial constraint between objects. In [20], a structural motion constraint among objects has been utilized to assist data association against unreliable detections in online MOT. Bae and Yoon [3] exploited trajectory confidence constraint and incremental linear discriminate appearance to assist their two step data association. Then they extended their work in [4] by introducing a track existence probability into data association. However, these methods exploit prior information into two separate stages, either in the detection or association stage. In addition, the pairwise motion constraint in those methods are building based on the position information no more than three consecutive frames. However, the occlusion or mis-detection always appears in multiple frames, often more than three consecutive frames in practical scene. In this paper, we propose a novel association based multi-objects tracking method to combine trajectory estimation and detection prior together for better enhancement the tracking performance. However, both our work and [43] are online multi-object tracking methods based on maximizing a posteriori estimation with sequential prior knowledge. The major differences of them are: (1) the way to compute the detection prior; (2) how to combine the detection prior into MAP estimation during online multi-object tracking; In our work, the detection reliability is sequentially introduced into trajectory estimation stages by using Bayesian framework [22, 33], which is different from prior in [43]. Our work takes the local associated motion constraint and associated weight to refine the detections. The associated weight is calculated by Hankel matrix based dynamic motion model with the number of involved frames’ instead of manually fixing the order of motion model. In addition, the MAP framework allows the trajectory estimation and detection reliable prior interact with each other in a sequential manner, which facilitates online multi-object tracking. While in [43], the multi-object tracking is solved by two MAP estimation problems: object detection and trajectory-detection association. In their detection refinement with MAP estimation stage, the posterior detection probability computed by combining the observation likelihood function and the prior detection probability. The prior detection probability in their work is computed based on the spatio-temporal consistency assumption with the Kalman filter to predict the object states. Based on the object states, they build density map with the position constraint of the object.

3 Online tracking with detection reliability under local associated motion constraint

3.1 Problem formulation

The essential problem for MOT is data association, which implements the task of matching detections in one frame to a set of previous trajectories with corresponding detections. Let \( {\mathbb{X}}_t=\left\{{\mathrm{x}}_t^1,\cdots, {\mathrm{x}}_t^N\right\} \) and \( {\mathrm{\mathbb{Z}}}_t=\left\{{z}_t^1,\cdots, {z}_t^M\right\} \) be the set of object detections and predicted object states at frame t. Denote the set of detections, trajectories and predicted object states up to frame t as \( {\mathbb{X}}_{1:t} \), \( {\mathbb{T}}_{1:t} \) and ℤ1 : t, respectively. For online MOT, the trajectories \( {\mathbb{T}}_{1:t}^j \) of object j up to frame t can be represented as \( {\mathbb{T}}_{1:t}^j=\left\{{\mathrm{x}}_k^j|1\le {t}_s^j\le k\le {t}_e^j\le t\right\} \), where \( {t}_s^j \) and \( {t}_e^j \) are the start and end frame of a tracklet. Then, the online MOT problem can be solved within the Bayesian framework by maximizing the joint posterior probability over \( {\mathbb{X}}_{1:t} \) and \( {\mathbb{T}}_{1:t-1} \) given the predicted object states ℤ1 : t as follows:

$$ {\displaystyle \begin{array}{l}\left\langle {\mathbb{T}}_{1:t},\left.{\mathbb{X}}_{1:t}\right\rangle =\right.\underset{{\mathbb{T}}_{1:t-1},{\mathbb{X}}_{1:t}}{\arg \max }p\left({\mathbb{T}}_{1:t-1},{\mathbb{X}}_{1:t}|{\mathrm{\mathbb{Z}}}_{1:t}\right)\\ {}\kern2.4em =\underset{{\mathbb{T}}_{1:t-1},{\mathbb{X}}_{1:t}}{\arg \max}\underset{ttrajectory estimation}{\underbrace{p\left({\mathbb{T}}_{1:t-1}|{\mathbb{X}}_{1:t},{\mathrm{\mathbb{Z}}}_{1:t}\right)}}\underset{detection reliablity estimation}{\underbrace{p\left({\mathbb{X}}_{1:t}|{\mathrm{\mathbb{Z}}}_{1:t},{\Re}_t\right)}}\end{array}} $$
(1)

The first term is the trajectory estimation, which is used to generate current trajectories \( {\mathbb{T}}_t \) conditioned on ℤt, for pairwise associations between \( {\mathbb{T}}_{t-1} \) and \( {\mathbb{X}}_t \). The second term is the posterior probability for detection reliability estimation between \( {\mathbb{X}}_t \)and ℤt. Due to the huge number of possible combination of \( {\mathbb{T}}_{1:t-1} \) and \( {\mathbb{X}}_{1:t} \), the space of possible trajectories grows exponentially over time. As a result, it is often difficult to optimize Eq. (1) exhaustively. Therefore, we decompose Eq. (1) into two estimation stages with local associated motion constraint t, the local associated motion constraint will be detailed described in section 3.2.

3.2 Local associated motion constraint

By the fact that the object states in two consecutive frames should not change drastically, the detections in frame t are more likely to appear around the predicted location according to the existing trajectories using tracklet dynamic estimation model, as will be shown later. Then, a local associated motion constraint (LAMC, denoted as t) is built, to represent the affinity between detections and predicted object states, based on two constraints as follows:

$$ {\displaystyle \begin{array}{l}\left\Vert {y}_{z_t^j}-{y}_{{\mathrm{x}}_t^i}\right\Vert <0.5\sqrt{{\left({w}_{z_t^j}\right)}^2+{\left({h}_{z_t^j}\right)}^2}\\ {}\exp \left(-\left(\frac{h_{{\mathrm{x}}_t^i}-{h}_{z_t^j}}{h_{{\mathrm{x}}_t^i}{+}_{z_t^j}}+\frac{w_{{\mathrm{x}}_t^i}-{w}_{z_t^j}}{w_{{\mathrm{x}}_t^i}+{w}_{z_t^j}}\right)\right)>{\tau}_s\end{array}} $$
(2)

where \( {y}_{{\mathrm{x}}^i} \)and \( {y}_{z^j} \) are the positions of detection i and object j, respectively, (w, h) are the weight and height of one object.

The first constraint in Eq. (2) is location constraint, means that one detection is considered for tracking only if it is located closed to the predicted object location. The second constraint in Eq. (2) is size constraint, reflects the fact that both the detected object and the predicted object have similar size. We empirically set τs = 0.7, if the size change and location change of the predicted object state and the detection are satisfied the size constraint and location constraint in Eq. (2) the association assignment di, j = 1, which indicates that the i_thdetection is associated with the j_th object. Otherwise, di, j = 0, which means there is no association between xiand zj. We use the association constraint in Eq. (2) to filter out the unreliable association between detections and predictions. Consequently, the total number of possible assignments between ℤt and \( {\mathbb{X}}_t \) is thus reduced. Figure 3 is an example to illustrate the local associated motion constraint.

Fig. 3
figure 3

An example to illustrate the associated motion constraint among three objects in frame t − 1 and t. Each box color denotes a unique target ID. The association for each object state and detection is determined by Eq. (2). The solid arrow represents the object is associated with the detection, di, j = 1. While the dotted arrow denotes there is no association between the detection and the object state, di, j = 0. The thickness of solid arrow represents the association strength between the detection and the object state, which is determined by associated motion weight in Eq. (11)

When there exist M trajectories in frame (t − 1) and N detections in frame t, the LAMC tis defined as

$$ {\displaystyle \begin{array}{l}{\Re}_t={\cup}_{j=1}^M{\Re}_t^j\\ {}{\Re}_t^j=\left\{\left(i,j\right)|{d}_t^{i,j}=1,1\le i\le N\right\}\end{array}} $$
(3)

with the fact that the linked edges represent the affinity between object states and detections in frame t. The LAMC build between zj and xi in \( {\Re}_t^j \) only if di, j = 1at frame t. Since the affinities between objects and detections are different, the associated motion weight \( {\theta}_t^{\left(i,j\right)} \) is represented as follows:

$$ {\displaystyle \begin{array}{l}{\theta}_t^j=\left\{{\theta}_t^{\left(i,j\right)}|\left(i,j\right)\in {\Re}_t^j,1\le i\le N\right\}\\ {}\sum \limits_{\left(i,j\right)\in {\Re}_t^j}{\theta}_t^{\left(i,j\right)}=1\end{array}} $$
(4)

where the initial associated motion weights \( {\theta}_t^{\left(i,j\right)}=\frac{1}{\mid {\Re}_t^j\mid } \), and \( \mid {\Re}_t^j\mid \) is the cardinality of an association set.

3.3 Detection reliability under local associated motion constraint

With the assumption that each object state is independent, the posterior detection probability for detection \( {\mathrm{x}}_t^i \) and the predicted object states set ℤ1 : t in Eq. (1) under local associated motion constraint are defined as follows:

$$ p\left({\mathrm{x}}_t^i|{\mathrm{\mathbb{Z}}}_{1:t},{\Re}_t^j\right)=\sum \limits_{\left(i,j\right)\in {\Re}_t^i}{\theta}_t^{\left(i,j\right)}p\left({\mathrm{\mathbb{Z}}}_t|{\mathrm{x}}_t^i,{\Re}_t^j\right)p\left({\mathrm{x}}_t^i|{\mathrm{\mathbb{Z}}}_{1:t-1},{\Re}_t^j\right) $$
(5)

In Eq. (5), the posterior detection probability takes the associated motion weights \( {\theta}_t^{\left(i,j\right)} \) into consideration. The prior detection probability \( p\left({\mathrm{x}}_t^i|{\mathrm{\mathbb{Z}}}_{1:t-1},{\Re}_t^j\right) \) is approximated by recursively procedure from the sequential Bayesian approach [33] under LAMC.

The observation likelihood \( p\left({\mathrm{\mathbb{Z}}}_t|{\mathrm{x}}_t^i,{\Re}_t^j\right) \) in Eq. (5) is the association probability between \( {z}_t^j \) and \( {\mathrm{x}}_t^i \) under LAMC, which is defined as follows:

$$ p\left({\mathrm{\mathbb{Z}}}_t|{\mathrm{x}}_t^i,{\Re}_t^j\right)={p}_0\left({E}_t^{\left(i,j\right)}\right)+\sum \limits_j{p}_j\left({E}_t^{\left(i,j\right)}\right)p\left({z}_t^j|{\mathrm{x}}_t^i,{\Re}_t^j\right) $$
(6)

where \( {p}_j\left({E}_t^{\left(i,j\right)}\right) \) represents the association probability between the j_th object state and the i_thdetection and \( {p}_0\left({E}_t^{\left(i,j\right)}\right) \) denotes the not-associated probability. \( p\left({z}_t^j|{\mathrm{x}}_t^i,{\Re}_t^j\right) \) is the likelihood between \( {z}_t^j \) and \( {\mathrm{x}}_t^i \).

Similarly to [44], the likelihood function is computed by using appearance, shape and motion cues as follows:

$$ p\left({z}_t^j|{\mathrm{x}}_t^i,{\Re}_t^i\right)={p}_a\left({z}_t^j|{\mathrm{x}}_t^i\right){p}_s\left({z}_t^j|{\mathrm{x}}_t^i\right){p}_m\left({z}_t^j|{\mathrm{x}}_t^i,{\Re}_t^j\right) $$
(7)

where pa, ps and pm are appearance, size and motion similarity, respectively, which are defined as

$$ {\displaystyle \begin{array}{cc}{p}_a=\exp \left(-\sum \limits_{b=1}^B\sqrt{{\mathrm{H}}^b\left({z}_t^j\right){\mathrm{H}}^b\left({\mathrm{x}}_t^i\right)}\right)& \left(\mathrm{a}\right)\\ {}{p}_s=\exp \left(-\left(\frac{h_{{\mathrm{x}}_t^i}-{h}_{z_t^j}}{h_{{\mathrm{x}}_t^i}+{h}_{z_t^j}}+\frac{w_{{\mathrm{x}}_t^i}-{w}_{z_t^j}}{w_{{\mathrm{x}}_t^i}+{w}_{z_t^j}}\right)\right)& \left(\mathrm{b}\right)\\ {}{p}_m\left({z}_t^j|{\mathrm{x}}_t^i,{\Re}_t^j\right)={\frac{S\left({z}_t^j\right)\cap S\left({\mathrm{x}}_t^i\right)}{S\left({z}_t^j\right)\cup S\left({\mathrm{x}}_t^i\right)}}_{\left(i,j\right)\in {\Re}_t^j}& \left(\mathrm{c}\right)\end{array}} $$
(8)

where \( {\mathrm{H}}^b\left({\mathrm{x}}_t^i\right) \),\( {\mathrm{H}}^b\left({z}_t^j\right) \)are the color histogram of the i_thdetection and the j_thpredicted object state, respectively. b denotes the b_th bin and B is the number of bins. Here, we use B = 64 bins for each HSV color space. In terms of shape similarity in Eq. (8) (b), (hx, hz), (wx, wz)are the height and width for detection xand object z. The motion similarity in Eq. (8) (c) is computed by PASCAL score [16], where S(•) is the area of the \( {z}_t^j \) and \( {\mathrm{x}}_t^i \).

The associated weight \( {\theta}_t^{\left(i,j\right)} \) is calculated by tracklet dynamics estimation proposed in [12]. Since target motion can be formed as a sequence of piecewise linear regression and the order of regression can be estimated from the positions of the object in previous frames. Therefore, the trajectory for an object can be represented by an ordered sequence of dynamic measurements as follows:

$$ {y}_t=\sum \limits_{i=1}^m{a}_i{y}_{t-i},m\le l,t\ge s+m $$
(9)

where y is the set of positions of a trajectory, ai is the regression coefficient, l is the length of the trajectory, m is the order of regression model and s represents the start frame of a trajectory. According to [39], the order m of the regression model equals to the rank of corresponding Hankel matrix, \( m=\mathit{\operatorname{rank}}\left({H}_{{\mathbb{T}}_i}\right) \), where \( {H}_{{\mathbb{T}}_i} \)is the Hankel matrix with n ≥ mcolumns.

$$ {H}_{{\mathbb{T}}_i}=\left[\begin{array}{cccc}{y}_s& {y}_{s+1}& \cdots & {y}_{s+n-1}\\ {}{y}_{s+1}& {y}_{s+2}& \cdots & {y}_{s+n}\\ {}\vdots & \vdots & \vdots & \vdots \\ {}{y}_{t-n+1}& {y}_{t-n}& \cdots & {y}_t\end{array}\right] $$
(10)

where n = li − ⌈li/3⌉ + 1, li = t − s + 1.li is the length of tracklet \( {\mathbb{T}}_i \), \( {\mathbb{T}}_i \) is a tracklet from frame s.

Then the associated motion weight \( {\theta}_t^{\left(i,j\right)} \) is described as follows:

$$ {\theta}_t^{\left(i,j\right)}=\frac{\mathit{\operatorname{rank}}\left(\mathrm{H}\left({\mathbb{T}}_i\right)+\operatorname{rank}\left(\mathrm{H}\left({\mathbb{T}}_j\right)\right)\right)}{\mathit{\operatorname{rank}}\left(H\left({\mathbb{T}}_{\mathrm{ij}}\right)\right)}-1 $$
(11)

where\( {\mathbb{T}}_{ij}=\left[{\mathbb{T}}_i,{\alpha}_i^j,{\mathbb{T}}_j\right] \) is the joint tracklet with gap \( {\alpha}_i^j \) between \( {\mathbb{T}}_i \) and \( {\mathbb{T}}_j \). If \( {\mathrm{x}}_t^i \) and \( {z}_t^j \) are belong to the same trajectory, \( {\mathbb{T}}_{ij} \) can be approximated by one relatively low order regression. Otherwise, \( {\mathbb{T}}_{ij} \) is approximated by a higher order regression than the regression of each single tracklet.

The above dynamic motion model uses an m_th order sequence to predict object states in current frame. By using the Hankel matrix to estimate the order of the motion model, instead of manually fixing the order of motion model in many existing works, our strategy is beneficial to recover short fragment tracklet and significantly reduce errors in online MOT. This is because the m_thorder dynamic motion model takes long trajectory motion cue into consideration rather than heavily rely on one or two frames in previous works. Simultaneously, in online MOT, the order of a trajectory is estimated several times with new data. Therefore, the dynamic motion model used in this paper can reduce the ambiguity when a target is undetected in one or more successive frames or two detections are erroneously linked.

After achieving the posterior detection probability, the association detection-prediction pairs are determined by

$$ {C}_t^{i,j}=-\ln \left\{p\left({\mathrm{\mathbb{Z}}}_t|{\mathrm{x}}_t^i,{\Re}_t^j\right)\right\} $$
(12)

If \( {C}_t^{i,j}<\tau \), then \( {\mathrm{x}}_t^i \) is associated with the \( {z}_t^j \). The corresponding assignment index thus is \( {\gamma}_t^{i,j}=1 \). The association probability is defined as \( {p}_j\left({E}_t^{<i,j>}\right)=\frac{\gamma_t^{i,j}}{\mid {\Re}_t^j\mid } \) and the not-associated probability as \( {p}_0\left({E}_t^{<i,j>}\right)=1-\sum \limits_{j=1}^{\mid {\Re}_t^j\mid }{p}_j\left({E}_t^{<i,j>}\right) \), where \( \mid {\Re}_t^j\mid \) denotes the number of detection-prediction pairs.

3.4 Data association with detection reliability constraint

In online MOT, suppose we have found Mtrajectories \( {\mathbb{T}}_{t-1}={\left\{{\mathrm{T}}_{t-1}^j\right\}}_{j=1}^M \)in frame (t − 1) and N detections \( {\mathbb{X}}_t={\left\{{\mathrm{x}}_t^i\right\}}_{i=1}^N \) in frame t. The pairwise trajectory-detection association between \( {\mathbb{T}}_{t-1} \) and \( {\mathbb{X}}_t \) to generate current trajectories \( {\mathbb{T}}_t \) is formulated by Bayesian rule as follows:

$$ p\left(<{\mathbb{T}}_t,{\mathbb{X}}_t>|{\mathbb{X}}_t,{\mathbb{T}}_{1:t-1}\right)=\frac{p\left({\mathbb{X}}_t|<{\mathbb{T}}_t,{\mathbb{X}}_t>,{\mathbb{T}}_{t-1}\right)p\left(<{\mathbb{T}}_t,{\mathbb{X}}_t>|{\mathbb{T}}_{t-1}\right)}{p\left({\mathbb{X}}_t|{\mathbb{T}}_{t-1}\right)} $$
(13)

where \( p\left(<{\mathbb{T}}_t,{\mathbb{X}}_t>|{\mathbb{X}}_t,{\mathbb{T}}_{1:t-1}\right) \) is the posterior association probability representing the detections assigned to the exist trajectories. \( p\left({\mathbb{X}}_t|<{\mathbb{T}}_t,{\mathbb{X}}_t>,{\mathbb{T}}_{t-1}\right) \)is the observation likelihood between the detections\( {\mathbb{X}}_t \)and the trajectories \( {\mathbb{T}}_{t-1} \). \( p\left(<{\mathbb{T}}_t,{\mathbb{X}}_t>|{\mathbb{T}}_{t-1}\right) \) is the prior association probability. \( p\left({\mathbb{X}}_t|{\mathbb{T}}_{t-1}\right) \) is the transition density, which is estimated by dynamic motion model of the object.

The prior association probability \( p\left(<{\mathbb{T}}_t,{\mathbb{X}}_t>|{\mathbb{T}}_{t-1}\right) \) is computed by two cues. The first one is the detection reliability, as described in Section 3.3. The second one is the trajectory confidence, which is used to measure the reliability of an existing trajectory Tj as follows:

$$ conf\left({\mathrm{T}}^j\right)=\left(\frac{1}{l_j}\sum \limits_{t\in \left[{t}_s^j,{t}_e^j\right],{d}^{j,k}=1}{\Omega}_{t-1}^j\right)\times \exp \left(-\beta \cdot \frac{W}{l_j}\right) $$
(14)

where lj is the length of tracklet Tj. \( {\Omega}_{t-1}^j \) is the posterior association probability for trajectory Tj at frame (t − 1) computed by Eq. (13). \( W=t-{t}_s^j-{l}_j \) is the number of frames the object j is missing. β is a control parameter.

Combining the detection reliability and trajectory confidence, the prior association probability is approximated as follows:

$$ {\displaystyle \begin{array}{l}p\left(<{\mathbb{T}}_t,{\mathbb{X}}_t>|{\mathbb{T}}_{t-1}\right)=\prod \limits_{j=1}^Mp\left(<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i>|{\mathbb{T}}_{t-1}\right)\\ {}p\left(<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i>|{\mathbb{T}}_{t-1}\right)=\frac{\delta \left({\mathrm{x}}^i\right)}{\sum_{i=1}^M\delta \left({\mathrm{x}}^i\right)}\times conf\left({\mathrm{T}}^j\right)\end{array}} $$
(15)

where \( \delta \left({\mathrm{x}}^i\right)=p\left({\mathrm{x}}_t^i|{\mathrm{\mathbb{Z}}}_{1:t},{\Re}_t^j\right) \) is the posterior detection probability computed in Eq. (5).

With the fact that the observation likelihood probability \( p\left({\mathbb{X}}_t|<{\mathbb{T}}_t,{\mathbb{X}}_t>,{\mathbb{T}}_{t-1}\right) \) in Eq. (13) is the probability of detections association with the trajectories, then we have:

$$ p\left({\mathbb{X}}_t|<{\mathbb{T}}_t,{\mathbb{X}}_t>,{\mathbb{T}}_{t-1}\right)=\underset{i=1}{\overset{N}{\Pi}}p\left({\mathrm{x}}^i|<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i>,{\mathbb{T}}_{t-1}\right) $$
(16)

By considering the probability that a detection xi is originated from an existing trajectory Tj or xi is a false positive detection, the likelihood \( p\left({\mathrm{x}}^i|<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i>,{\mathbb{T}}_{t-1}\right) \) can be computed as follows:

$$ {\displaystyle \begin{array}{l}p\left({\mathrm{x}}^i|<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i>,{\mathbb{T}}_{t-1}\right)=p\left({\mathrm{x}}^i,<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{i0}|<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i>,{\mathbb{T}}_{t-1}\right)+\sum \limits_ip\left({\mathrm{x}}^i,<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{ij}|<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i>,{\mathbb{T}}_{t-1}\right)\\ {}\kern5.279997em =p\left({\mathrm{x}}^i,<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{i0}|{\mathbb{T}}_{t-1}\right)+\sum \limits_ip\left({\mathrm{x}}^i,<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{ij}|{\mathbb{T}}_{t-1}\right)\end{array}} $$
(17)

where \( p\left({\mathrm{x}}^i,<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{i0}|{\mathbb{T}}_{t-1}\right) \) denotes the probability that none of the existing trajectories is associated with the i_th detection and \( p\left({\mathrm{x}}^i,<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{ij}|{\mathbb{T}}_{t-1}\right) \) represents the probability that the i_th detection associated with the j_th trajectory. With the fact that the predicted object state for a trajectory in frame t is estimated by the existing tracklet, the pairwise detection-prediction association probability \( {p}_j\left({E}_t^{<i,j>}\right) \) and \( {p}_0\left({E}_t^{<i,j>}\right) \) computed in section 3.3 are used to compute the association probability \( p\left({\mathrm{x}}^i,<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{i0}|{\mathbb{T}}_{t-1}\right) \) and \( p\left({\mathrm{x}}^i,<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{ij}|{\mathbb{T}}_{t-1}\right) \).

$$ {\displaystyle \begin{array}{l}p\left({\mathrm{x}}^i,<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{ij}|{\mathbb{T}}_{t-1}\right)=p\left({\mathrm{x}}^i|<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{ij},{\mathbb{T}}_{t-1}\right)p\left(<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{ij}|{\mathbb{T}}_{t-1}\right)=p\left({\mathrm{x}}^i|{\mathrm{T}}^j\right){p}_j\left({E}_t\right){p}_{\Phi_{ij}}\\ {}p\left({\mathrm{x}}^i,<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{i0}|{\mathbb{T}}_{t-1}\right)=p\left({\mathrm{x}}^i|<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{i0},{\mathbb{T}}_{t-1}\right)p\left(<\mathrm{t}{\mathrm{T}}^j,{\mathrm{x}}^i{>}_{i0}|{\mathbb{T}}_{t-1}\right)={p}_0\left({E}_t\right)\prod \limits_{j=1}^M\left(1-{p}_{\Phi_{ij}}\right)\end{array}} $$
(18)

wherep(xi| Tj) is the association likelihood betweenxiandTj, which is computed by Eq. (7), p(xi| Tj) = pm(xi| Tj)ps(xi| Tj)pa(xi| Tj). \( {p}_{\Phi_{ij}} \)is the prior association probability computed in Eq. (15). Then the observation likelihood probability \( p\left({\mathbb{X}}_t|<{\mathbb{T}}_t,{\mathbb{X}}_t>,{\mathbb{T}}_{t-1}\right) \) in Eq. (16) can be rewritten as follows:

$$ p\left({\mathbb{X}}_t|<{\mathbb{T}}_t,{\mathbb{X}}_t>,{\mathbb{T}}_{t-1}\right)=\underset{i=1}{\overset{N}{\Pi}}\left\{{p}_0\left({E}_t\right)\prod \limits_{j=1}^M\left(1-{p}_{\Phi_{ij}}\right)+p\left({\mathrm{x}}^i|{\mathrm{T}}^j\right){p}_j\left({E}_t\right){p}_{\Phi_{ij}}\right\} $$
(19)

Finally, the data association problem in online MOT is solved by the Hungarian algorithm, in which the association matrix \( {S}_{M\times N}=-\ln \left\{p\left(<{\mathbb{T}}_t,{\mathbb{X}}_t>|{\mathbb{X}}_t,{\mathbb{T}}_{t-1}\right)\right\} \). The association matrix indicates the cost that the detection xi is associated with the trajectory Tj. The optimal trajectory-detection pairs are determined by minimizing the total cost in SM × N. According to data association, the final results are achieved by solving the maximizing problem in Eq. (1). Then, the object states and confidence of existing trajectories are updated. The non-associated detections are remained to initialize new potential trajectories, and a new trajectory is found when it grows over five consecutive frames. The non-associated trajectories are terminated if they unassociated in five consecutive frames. The main steps of the proposed online multi-object tracking method are summarized in Algorithm 1.

figure c

4 Experiments

In this section, to demonstrate the effectiveness of the proposed online MOT method, thirteen state-of-the-art multi-object tracking algorithms are used to compare with the proposed MOT. Five of them are online algorithms (RMOT [44], SCEA [20], CMOT [3], TSML [39], MDP [40]) and eight are batch methods (SMOT [12], CEM [31], KSP [5], DCT [2], DTLE [30], GOGA [34], JPDA_100 [18], MHT_DAM [24]). For fair comparisons, we have used the reported results in their paper or achieve the results by using the source codes provided by the authors with default parameters on the four public datasets. In addition, we performed our proposed method on synthetic dataset to demonstrate the robustness of the proposed method against noise and missing detections.

4.1 Implementation details

The proposed online MOT algorithm performed on MATLAB with an Intel Core i7 8GHz PC, the average run time is about 15 fps without any code optimization and parallel programming. In the experimental, we empirically set τs = 0.7in Eq. (2), τ = 2 in Eq. (12), β = 2 in Eq. (14).

4.2 Synthetic data results

In our first test, we estimate the robustness of our method against noise and missing observations. We construct synthetic data from three models: a) constant velocity motion model, b) acceleration motion model and c) dynamic acceleration motion model, as shown in Eq. (20).

$$ \left[x,y\right]={\mathrm{a}}_3{t}^3+{\mathrm{a}}_2{t}^2+{\mathrm{a}}_1{t}^1+{\mathrm{a}}_0 $$
(20)

where a0to a3are all random coefficients. In Eq. (20), we set a3 = a2 = 0to form a linear motion model, usea3 = 0to represents a constant acceleration motion, and set a0 to a3 with non-zero random values for a dynamic acceleration motion model.

We then add Gaussian noises on observations with zero mean and 0.3 variance. In our test, we first generate a sequence with 45 frames and manually eliminate frame 31 to 35, to verify the robustness of the proposed methods with missing observations.

The synthetic data association results of our method are shown in Fig. 4. In Fig. 4, data1 (magenta line) denote the ground truth trajectories, data2 (blue star) denote the noised observations, data3 (yellow solid line) denote the association results by the proposed method and data4 (red downward triangle) denote the missing observations estimated by tracklet dynamic estimation model. We can see from Fig. 4a~c that, the proposed data association method effectively overcomes noise interference and accurately associates observations. In addition, the tracklet dynamic estimation model can effectively estimate missing data, a property improving the fragment tracklet association.

Fig. 4
figure 4

The association results for synthetic dataset

4.3 Real-world datasets and evaluation metrics

For the performance evaluation, four public available challenging datasets, SMOT (Similar Multi-Object Tracking) dataset [12], PETS2009 dataset [15], TUD dataset [1] and MOT 2015 dataset, are used. The SMOT dataset is a very challenging dataset and specially designed for multi-object tracking with similar-looking appearance. It is including various multiple targets, the number of objects from 3~80, the length of sequences from 130~1285, both including non-rigid and rigid objects. We adopt five sequences from SMOT, the Salmon, Juggling, Acrobats, Seagulls and Crowd sequences, for evaluation. In PETS2009 dataset, the widely used S2.L1 and S2.L2 sequences are included for evaluation, in which the sequences show outdoor surveillance scenes with many pedestrians. The numbers of the tracked objects are 19 and 43, and the length of S2.L1 and S2.L2 sequences are from 436 to 795. We adopt the Campus, Crossing and Stadtmitte sequences from TUD dataset for evaluation. The challenges for TUD dataset is severely occlusions with low viewpoint, the number of tracked objects in Campus, Crossing and Stadtmitte are from 7~12, the lengths are range from 71 to 201. The MOT 2015 dataset is a latest MOT dataset, it contains 11 changeling sequences with occlusion, clutter background, scale and shape changes, moving camera and stationary camera, the length of sequences in this dataset from 187 ~ 1194, the number of tracked object from 12 ~157. For fair comparisons, we use the public available detections provided by detector. For SMOT dataset, we use the public available detections provided by [12]. The public detections for PETS 2009 and TUD dataset are provided by [31]. The detections for MOT 2015 datasets is provided by https://motchallenge.net/data/2D_MOT_2015/.

We use the common CLEAR performance metrics, including MOTP, MOTA, FP, FN, IDS, [6] for quantitative evaluation. The multiple object tracking precision (MOTP) evaluates the average overlap rate between true and estimated bounding boxes. The multiple objects tracking accuracy (MOTA) indicates the accuracy composed of false positives (FP), false negatives (FN) and identities switches (IDS). In addition, some metrics defined in [28], the percentage of mostly tracked (MT), mostly lost (ML) or partially tracked (PT) trajectories, the number of ground truth trajectory fragments by tracking result (FM), the percentage of correctly matched objects with ground truth objects (Recall), as well as the percentage of correctly matched targets with detect results (Precision), are used to evaluate the performance of MOT. ( denotes the higher score is the better results, and means that lower is better).

4.4 Results and discussion

Tables 1, 2, 3 and 4 and Figs. 5, 6, 7, 8 and 9 show the quantitative results of the proposed method and the state-of-the-art tracking methods on SMOT, PETS2009, TUD and MOT 2015 datasets, respectively. For all metrics, the best scores are shown in red and the batch multi-object tracking methods are marked with star. Some sample results from SMOT, PETS2009, TUD and MOT 2015 datasets are shown in Figs. 5, 6, 7 and 8. Fig. 9 shows plots for Recall, Precision, MT, PT, MOTA and MOTP scores for all videos of SMOT, PETS2009, TUD and MOT 2015 datasets.

Table 1 Performance comparison between state-of-the-art methods and ours on SMOT dataset
Fig. 5
figure 5

Sample tracking results of the proposed method on SMOT dataset sequences. At each frame, objects are attached with bounding boxes and ID labels with different colors

Fig. 6
figure 6

Sample tracking results of the proposed method on PETS2009 dataset sequences. At each frame, objects are attached with bounding boxes and ID labels with different colors

Fig. 7
figure 7

Sample tracking results of the proposed method on TUD dataset sequences. At each frame, objects are attached with bounding boxes and ID labels with different colors

Fig. 8
figure 8

Sample tracking results of the proposed method on 11 testing video sequences of the MOT 2015 dataset. At each frame, objects are attached with bounding boxes and ID labels with different colors

Results for SMOT dataset

Slamon sequence has three skiers racing down a slalom with complex zig-zag motion. They frequently move closely with each other and one of the skier escapes out of the field for a long time. Slamon sequence is also accompanied with camera motion and frequently zooming. Due to the tracklet dynamic estimation model used in this paper can effectively estimate missing data, the proposed method achieves the best performance except MOTA metric compared to the competing trackers.

Juggling sequence is a very challenging scene with juggler performing three ball alternating tricks by adding artistic motions. The combined motion of balls, juggler and camera with similar appearance of balls makes it even hard for a human to keep track on the balls. The SMOT method, achieves the best result on this sequence due to its global iterative optimization and dynamic motion model. The proposed method shows a second best results on this challenging sequence.

The main difficult in Acrobats sequence is the acrobats dressed same and lineup in air with occlusions. The proposed method achieves the best performance in all terms except MOTA. The data association method used in this paper can effectively overcome weak appearance cue and rely on the dynamic motion model to accurately associate observations. SMOT tracker has the highest MOTA = 100%, which is 2.1% higher than the proposed method.

Seagulls sequence shows an extremely difficult scene where a flock of seagulls take off at sea with similar appearance, spatial close-moving, frequent occlusion and clutter background. CMOT tracker gives the best MOTA in this challenging sequence, which is 9.8% higher than the propose method. SMOT tracker achieves the best MOTP with 100%, 0.3% higher than our method.

The Crowd sequence is an over-crowded surveillance scene with frequent occlusions and close-moving pedestrians. The data association strategy and the tracklet dynamic estimation model used in this paper can help to improve the fragment tracklet association. Hence, our method achieves the highest MOTA and precious, with low MT, FP, IDS and FM. The qualitative tracking results are shown in Fig. 5.

Results for PETS2009 dataset

PETS09-S2.L1 and PETS09-S2.L2 are the most widely used multi-pedestrian tracking sequences. The datasets provided multi-view from different camera angles. We only use the first view of each sequence in our experimental. The S2.L1 sequence is a medium crowed scene with pedestrian frequently changing their motion directions. Many state-of-the-art methods, both batch tracking and online tracking methods, are included for fair comparison. Table 2 shows that the batch based tracking, such as CEM and DCT methods, achieve the best MOTA in this sequence. The proposed method achieves fairly good results in terms of six metrics: MOTA MOTP, recall, precious, few ID switches and few fragments. PETS-S2.L2 sequence is a highly crowd scene with frequently pedestrian occlusion and illumination changes. Table 2 shows that batch based tracking, GOGA and DCT methods, give the best precision and MOTP, respectively. Our method achieves high recall, MOTP and high ration of MT with relatively low ID switches. Comparing with the online tracking methods, the batch based methods give more satisfied results. This is because the available information of following-up frames and, consequently, a globally iterative optimal association can be used in batch based trackers to effectively tackle detection errors and tracking failures. However, as one of the online tracker which focuses on objects only up to current frame, the proposed method achieves comparable results with the batch based tracking methods. Some sampled visual results of S2.L1 and S2.L2 of the proposed method are shown in Fig. 6.

Table 2 Performance comparison between state-of-the-art methods and ours on PETS09 dataset

Results for TUD dataset

TUD-Campus, TUD-Crossing and TUD-Stadtmitte sequences are used to evaluate the performance of pedestrian tracking. The main challenges for these sequences are long occlusions, spatial closely moving targets with low viewpoint. Some sampled tracking results for the proposed method are shown in Fig. 7. The quantitative results for all competing algorithms are shown in Table 3. Comparing with the state-of-the-art methods, the proposed method significantly improves the recall, precision, MOTA and MOTP for TUD-Stadtmitte and TUD-campus sequences. In two out of the total three sequences, our MOTA is higher than the batch based tracking methods like CEM, DCT and KSP.

Table 3 Performance comparison between state-of-the-art methods and ours on TUD dataset

Results for MOT 2015 dataset

The MOT 2015 dataset is a latest MOT dataset. The test set of this dataset contain 11 changeling sequences with occlusion, clutter background, large scale and shape changes, moving camera and stationary camera. Table 4 shows the compare results of our method with the state-of-the-art methods on test sequences in MOT 2015 dataset. Fig. 8 shows some sampled tracking results on the 11 test sequences. As shown in Table 4, our tracker achieve the second best MOTA metric with 32.6% compared with the state-of-the-art methods and achieve the best performance in IDs even though the proposed method works in online mode. The good tracking performance of the proposed method demonstrates the advantages of our method that using Hankel matrix based object states predication and local motion constraint are beneficial to recover short fragment tracklets and reduce the ID switches in online MOT. This is because it takes long history object states to construct dynamic motion model of the objects, which is robust to occlusion and can filter out some unreliable association among the trajectories and the detections with local associated motion constraint.

Table 4 Performance comparison between state-of-the-art methods and ours on MOT 15 dataset
Fig. 9
figure 9

Evaluation metric results for four public datasets

4.5 Run time performance

In the experiment, given the detection responses, we present the average execution speed of the proposed method and compared trackers by averaging the trackers 5 times for all test sequence in MOT 2015 dataset. The results are show in Table 5 and Fig. 10. The speed of the trackers is measured by frame-per-second (FPS). From the Table 5 and Fig.10, we can see that the proposed method perform well in run time, which has a comparative result with the state-of-the-art methods in MOT. The average speed reaches 15.21 FPS. With further optimization of the code, the speed of the proposed method can improved.

Table 5 Run time performance (FPS)
Fig. 10
figure 10

The performances compare between the proposed method and the stat-of-the-art trackers. Each marker denotes a tracker accuracy and speed measured in FPS, the higher and more right is better

Online multi-object tracking plays an important role in numerous essential applications, such as visual surveillance, traffic safety, autonomous driving and navigation. We test the proposed method on four public available MOT datasets with frequently occlusion, clutter background, large scale and shape changes, non-rigid and rigid objects, moving camera and stationary camera scenes. Although the proposed method shows some good tracking performance in most test sequences, it still needs to improve until it can be used in real applications. The proposed method is online method, the run time is about 15 fps, which is still slower than the real-time requirement. Therefore, how to further optimization the code of our method in order to speed-up the run time is our future work. As known, appearance and motion models are two main cues used in MOT, in our work, we mainly focus on how to use the history object states to construct the motion model of the object. Then use the dynamic motion model to predict object state and build detection reliability during online tracking. We just use the color histogram to build appearance model of the object, no extra information are used. In recent years, deep convolutional neural networks have shown impressive performance for many tasks. The features from deep convolutional layers are discriminative while preserving spatial and structural information. Hence, one of our further researches is introducing deep learning into our online MOT by learning discriminative appearance model of the object. In addition, occlusion problem and mis-detection are common issues in MOT, in our work, we use the predicted object state to overcome the occlusion and mis-detection. However, the predicted object state heavily relies on dynamic motion model of the object, which is not considering the appearance cue of the object. While in real application, such as urban traffic scenarios, the heavy traffic and congestion situation often include serious occlusion. The detection responses provided by detector in this situation always with mis-detection, which will pose more challenges for the proposed method to tackle occlusion and accurately tracking the targets. Therefore, in our further research, we will pay more attention on occlusion analysis in order to better address the challenging caused by occlusions.

5 Conclusion

In this paper, an online detection based multi-object tracking method is proposed. The proposed method splits the data association problem into two related optimized estimation steps, which integrates the trajectory estimation and detection reliability estimation into a unified framework. The trajectory-detection association pairs are achieved by sequentially introducing the previous trajectory and the current detection reliability. To further improve the correctness of association between trajectories and detections, tracklet dynamic estimation model and trajectory confidence are used to recover short fragment tracklet and reduce ambiguity caused by missing detections. In addition, with the local associated motion constraint, the detections are refined and the unreliable associations between tracklets and detections are filtered out. What’s more, the MAP framework allows an alternative updating on trajectory estimation and detection reliability estimation in a sequential manner. Compared with the state-of-the-art multi-object tracking algorithms, including both batch tracking and online tracking algorithms, experimental results verify the effectiveness of the proposed method. However, the proposed method is detection-based tracking method, which heavily relies on object detector, it has limitation under long-term occlusion and the run time of the proposed method is still slower than the real-time requirement. Hence, in our future work, we will focus on speed-up the proposed method and occlusion analysis in order to satisfy real application and better address the challenging caused by occlusions.