Online Multi-Object Tracking with Pose-Guided Object Location and Dual Self-Attention Network

Zhang, Xin; Wang, Shihao; Yang, Yuanzhe; Chu, Chengxiang; Zhou, Zhong

doi:10.1007/978-3-030-89370-5_17

Xin Zhang¹²,
Shihao Wang¹²,
Yuanzhe Yang¹²,
Chengxiang Chu¹² &
…
Zhong Zhou¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13033))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

1384 Accesses

Abstract

The recent trend in Multi-Object Tracking (MOT) is heading towards using deep learning to detect objects and extract features. Although tracking frameworks using detection network have achieved outstanding performance in object locating on MOT, it is still challenging for crowded occlusion. In this paper, we propose to alleviate this difficulty by combining bounding boxes from outputs of both object detection and pose estimation. The motivation behind generating redundant candidates is that object detection and pose estimation can complement each other in tracking scenes. In order to get optimal tracking objects from candidates, we present Soft-Pose-NMS. For similarity calculation, we design a Dual Self-Attention Network (DSAN) with the self-attention mechanism. The network generates the self-attention map that enables the network to focus on the object area of detection and tracklet images. Simultaneously, the network can extract the temporal self-attention feature map to suppress noisy images in the tracklet. Experiments are conducted on the MOT benchmark datasets. Results show that our tracker achieves competitive results and is state-of-the-art in half of the metrics.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Online Multi-Object Tracking with Dual Matching Attention Networks

Adaptive sparse attention-based compact transformer for object tracking

Article Open access 28 May 2024

Residual Attention SiameseRPN for Visual Tracking

Keywords

1 Introduction

Multi-object tracking (MOT) is one of the most fundamental computer vision tasks, aiming to generate the trajectory information of all interested objects across video frames. It has attracted much attention because of its broad application such as intelligent video analysis, autonomous driving and smart city. The current MOT studies mainly adopt the “tracking-by-detection” strategy that applies the detector to locate objects in each frame and associates objects among the different frames to generate object trajectories [5, 25, 31].

Despite the encouraging progress made in the past few years, there are two significant problems with “tracking-by-detection” strategy. One is that tracking results heavily rely on the quality of object detection, which by itself is hard to generate reliable results across frames. Taking the tracking scenes in the MOT16 dataset as examples, during the crowd scenes, the bounding boxes based on one kind of detection method of the occluded objects is usually unreliable, posing drifting and ID-switching in tracking, as shown in Fig. 1. To alleviate such issues, recent research [24] introduces the object location information from an instance segmentation method to locate the tracking objects. In this paper, we combine the merits of multi-person pose estimation and object detection in a unified framework to introduce object joint points information. We use the pedestrian joint points information to assist in locating the object and alleviate unreliable detection.

On the other hand, for similarity computation in MOT, we need to compare the current detect object with a sequence of previous observations in the trajectory. One of the most commonly track objects in MOT is pedestrians, so the re-identification [16, 22] is commonly used for similarity calculation with challenging factors including occlusion, partial loss and pose variation [31], as shown in Fig. 1. To alleviate such issues, [7, 31] propose the feature extraction network that introduces attention mechanism [27] to extract detection and tracklet appearance features. Additionally, inspired by [29], we introduce the self-attention mechanism, which calculates the self-attention map for detection image and tracklet images, respectively. Moreover, our network is end-to-end, which can alleviate training complexity and extract more robust features.

The main contributions of this paper can be summarized as follows.

1.
A new detection strategy is proposed to combine object detection and pose estimation results. The strategy takes advantage of both object detection and pose estimation to handle unreliable detection in online MOT.
2.
We design a Dual Self-Attention Network (DSAN), introducing the self-attention mechanism to allocate different attention values to each location in the object image and exploit self-attention temporal feature from the tracklet.
3.
Experimental results demonstrate that our tracker achieves competitive performance on the MOT benchmark dataset and is state-of-the-art in half of the metrics.

2 Related Work

In recent years there has been an explosion of technological progress in MOT driven primarily by object detection strategy. Sanchez-Matilla et al. [20] exploited multiple detectors to improve detection performance in MOT. Chen et al. [5] combined detection and predicted bounding boxes by Kalman filter as tracking candidate set for quality evaluation and used different strategies for data association. Although these methods alleviate the unreliable detection results, they still use one kind of detection information. Hence these methods cannot effectively alleviate the issue of missing detection. There are also several works that use other category location information to determine the coordinates of the tracking candidates [6, 10, 13, 24]. Voigtlaender et al. [24] proposed MOTS task and TrackR-CNN network to merge segmentation and multi-object tracking. The network employed top-down segmentation information instead of detection information to locate the object. Nevertheless, the top-down object location information introduced in the above methods still depends on the quality of the object detection results [8, 24]. On the contrary, we propose the Soft-Pose-NMS detection strategy to introduce object joint points information from the bottom-up pose estimation method. The bottom-up object location information is not affected by the object detection performance and can provide additional object position information, and thereby it can effectively improve the object detection results in MOT.

For object feature extraction and similarity computation, Mahmoudi et al. [17] applied CNN extracted appearance features along with position features to calculate more accurate similarity score. Chu et al. [7] introduced a Spatial-Temporal Attention Mechanism (STAM) to handle the tracking drift caused by the occlusion and interaction among objects. Zhu et al. [31] proposed a Dual Matching Attention Networks (DMAN) with both spatial and temporal attention mechanisms to perform the tracklet data association. In this paper, we integrate both spatial and temporal self-attention mechanisms into the proposed MOT framework. Our framework differs from the state-of-the-art DMAN [31] method. First, the spatial attention in the DMAN corresponds to the detection image and trajectory images. Since the attention map is affected by different trajectory images, it becomes unreliable when other objects appear in the trajectory image. In contrast, we exploit the image itself to generate the self-attention map, which is demonstrated to be more robust to inter-object occlusion and noisy detection. Second, the DMAN needs to be divided into two steps to train the model, while our spatial and temporal self-attention map can be end-to-end trained.

3 Proposed Method

Our online tracking framework consists of three tasks, object detection, similarity calculation and trajectory management. We first measure all tracking objects by the proposed Soft-Pose-NMS detection strategy that introduces object pose information. Then we use the Dual Self-Attention Networks (DSAN) to extract feature and compute the similarity score of the detection image and tracklet images. Finally, we update the tracking state of objects and trajectories.

3.1 Soft-Pose-NMS Object Detection Strategy

Given a new frame, we get the joint points of each object through the pose estimation network [15]. Nonetheless, there are abnormal points in these joint points, as shown in Fig. 2. Therefore, the Soft-Pose-NMS detection strategy is designed to generate accurate joint points-based bounding boxes with pose estimation results and determine tracking candidates by screening two types of bounding boxes. These bounding boxes are adopted to alleviate detection failures in crowded scenes.

First, we obtain the primary detection-based bounding box set $PB_{det}$ by object detection method. It is necessary to generate a sufficient number of detection bounding boxes to filter and obtain accurate tracking bounding boxes. Therefore, we set a lower confidence threshold $T_{detcon}$ to generate the detection-based bounding box set $B_{det}$ form $PB_{det}$.

Second, a primary joint points-based bounding box $PB_{jpi}$ is generated by expanding the coordinates of the joint points. Here we define $NP_{PBjpi}$ as the number of joint points and $AR_{PBjpi}$ as the aspect ratio for the $PB_{jpi}$. Then the primary joint points-based bounding boxes set $PB_{jp}$ can be defined as:

$$\begin{aligned} PB_{jp} = \left\{ PB_{jp1}...PB_{jpi} \right\} , NP_{jpi}> T_{njp}\, and\, AR_{jpi}< T_{ratio} \end{aligned}$$

(1)

where $T_{njp}$ is threshold for the number of joint points, $T_{ratio}$ is threshold of the aspect ratio. We set $T_{njp}$ = 8 and $T_{ratio}$ = 0.6 to generate $PB_{jp}$. However, the joint points-based bounding box coordinate shifting still exists in $PB_{jp}$, as shown in Fig. 2(c). We observe that this shifting only appears on the abscissa. In order to deal with this joint points drift issue to get exact width value for joint points-based bounding box. First, we use the clustering algorithm to cluster the joint points of each bounding box $PB_{jpi}$ in $PB_{jp}$ into two point groups. Then we calculate the width ratio of the two points groups. Here we define $w_{1}$ and $w_{2}$ as the width of two point group width, respectively, as shown in Fig. 2(c). We define $R_{w}$ as the width ratio of $w_{1}$ and $w_{2}$. Therefore, the width of ith joint points-based bounding box $W_{PBjpi}$ can be generated by the following formula:

$$\begin{aligned} W_{PBjpi}={\left\{ \begin{array}{ll} w_{1} &{} R_{w}> Tw_{ratio} \\ w_{2} &{} R_{w}\le Tw_{ratio} \end{array}\right. } \end{aligned}$$

(2)

where $Tw_{ratio}$ as the threshold of the width ratio. We analyse the position of the drift joint point and set $Tw_{ratio}$ to 2.

After recalculating the width of each joint point-based bounding box, we get the final joint point-based bounding box set $B_{jp}$. In order to combine detection based bounding boxes and screen unreliable bounding boxes, we need to calculate a reasonable confidence score to the ith joint points-bounding box $B_{jpi}$ in $B_{jp}$. Directly using the average score of each joint point in joint points-based bounding box $B_{jpi}$ as corresponding confidence value will cause confidence bias. Therefore, we propose a function to explicitly encode pose information of each joint point into the confidence maps. We expand the total variance and make the scoring probability distribution distance of different pedestrians farther. The confidence of $B_{jpi}$ is defined as:

$$\begin{aligned} CB_{jpi}=\frac{1}{n}\sum _{n}^{i=1}\tan h\frac{s_{i}}{\sigma } \end{aligned}$$

(3)

where $CB_{jpi}$ is the confidence of ith joint points-based bounding box $B_{jpi}$, $\sigma $ is a data-driven parameter used to control the degree of score suppression and $s_{i}$ is the score of each joint point. The scores are averaged after $\tan h$ function mapping to generate the confidence $CB_{jpi}$ and the final joint points-based bounding box set $B_{jp}$.

In order to measure tracking objects bounding box set $B_{track}$. First, we fuse the detection-based bounding box set $B_{det}$ and the joint points-based bounding box set $B_{jp}$ to generate the all candidates bounding box set $B_{can}$ of current frame. Second, we sort all the bounding boxes according to the confidence and output the bounding box $B_{max}$ with the maximum confidence as tracking objects. Then, we re-assign the confidence of remaining bounding boxes as:

$$\begin{aligned} CB_{cani}=\left\{ \begin{matrix} CB_{cani} &{} IoU_{mi}< T_{IoU}\\ CB_{cani}(1-IoU_{mi}) &{} IoU_{mi}> T_{IoU} \end{matrix}\right. \end{aligned}$$

(4)

where $CB_{cani}$ indicates the confidence of ith bounding box $B_{cani}$ in candidates bounding box set $B_{can}$, $IoU_{mi}$ indicates the IoU of bounding box $B_{max}$ and $B_{cani}$, $T_{IoU}$ indicates the threshold of IoU. Finally, we delete the candidates that confidence less than the confidence threshold $T_{con}$, until $B_{can}$ is empty.

3.2 Feature Extraction with Dual Self-Attention Network

Extracting more discriminative appearance feature is the critical component of calculating accurate similarity scores. Moreover, the challenge is that object and tracklet images may undergo occlusion and noise in the tracking scene. To alleviate such issues, we design a Dual Self-Attention Network (DSAN) with self-attention mechanisms. Figure 3 illustrates the architecture of our network.

In this work, we use the DenseNet-101 [12] as backbone network and introduce the self-attention mechanism to extract tracking object and tracklet feature map. The self-attention mechanism can enlarge the receptive field and get contextual information which enables the network to pay more attention to the object area in the detection and tracklet images. We convolve the tracklet image in the temporal direction by the 3D convolutional layer to exploit the temporal feature of the object. The self-attention map is applied to the feature maps from the last convolutional layer of the DenseNet-101 to compute the self-attention feature map. We apply the detection self-attention feature map $X_{\alpha }$ and tracklet self-attention feature map $X_{\beta }$ for re-identification training and combined feature $X_{c}$ for binary classifier training to predict whether detection and tracklet are the same object. Furthermore, we will apply the similarity probability $P_{same}$ that predicted by the network to calculate the similarity score between the detection and trajectory.

To infer the self-attention maps of the detection and tracklet, we transform the backbone network feature maps into query feature map $f_{q}$, key feature map $f_{k}$ and value feature map $f_{v}$ respectively. After that, we use the feature map $f_{q}$ and $f_{k}$ to calculate the attention map as the following formula:

$$\begin{aligned} {\beta _{i,j}=\frac{exp(S_{ij})}{\sum _{i=1}^{N}exp(S_{ij})}}, S_{ij}=f_{q}(x_{i})^{T}f_{k}(x_{j}) \end{aligned}$$

(5)

where $\beta _{i,j}$ indicates the attention value of the other jth position in the image on the ith pixel. Then we multiply $\beta _{i,j}$ with $f_{v}$ to get the self-attention masked feature map $f_{org}^{att}$ that weight by the self-attention map, where:

$$\begin{aligned} f_{org}^{att}=\sum _{i=1}^{N}\beta _{ij}f_{v} \end{aligned}$$

(6)

Additionally, we add the feature map $f_{org}^{att}$ and $f_{org}$. Therefore the final self-attention feature map $f_{sa}$ is given by:

$$\begin{aligned} f_{sa} = \theta f_{org}^{att} + f_{org} \end{aligned}$$

(7)

where $\theta $ is a learnable scalar, to gradually emphasize the importance of self-attention feature map.

The training objective of each feature map in DSAN can be modelled as a multi-task training. The joint objective can be written as a weighted linear sum of losses:

$$\begin{aligned} L_{total} = \alpha L_{sig} + (1-\alpha ) L_{seq} + \beta L_{same} \end{aligned}$$

(8)

where $L_{sig}$ and $L_{seq}$ are used for re-id training and calculated by the cross-entropy loss function. $L_{same}$ is used for the binary classification training and applying the contrastive loss to calculate. $\alpha $ and $\beta $ are loss weights. We utilize the ground-truth bounding boxes and objects identity provided in the MOT16 training set to generate detection images and object trajectories for training the network.

3.3 Data Association and Trajectory Management

For data association, we calculate the similarity score between the detection and tracklet feature map firstly, by the following formula:

$$\begin{aligned} S_{dt} = w_{1} dist(f_{\alpha }, f_{\beta }) + w_{2} P_{same} \end{aligned}$$

(9)

where $w_{1}$ and $w_{2}$ are similar score weights, $S_{dt}$ is the final similar score of detection and tracklet. Then tracker generates affinity matrix with the similar scores. Meanwhile, we apply the Hungarian algorithm and affinity matrix to associate the detection and tracklet. Last, the tracker associates the remaining detection with unassociated tracklet based on IoU between detection and tracklets, with a threshold $T_{IoUa}$. For trajectory management, we initial the trajectory for detection, which is not associated with any trajectory in any of the first $T_{init}$ frames. Trajectories are terminated if they are not associated for $T_{term}$ frames.

4 Experiments

4.1 Implementation Details

To validate the effectiveness of the proposed online tracking approach, we design experiments on popular MOT datasets, MOT16 and MOT17 [18]. We employ Pifpaf in [15] to estimate the objects pose information, and use SDP [28] detection results that officially provided by MOT16 and MOT17 as the object detection results. We set $T_{IoU}$ = 0.95 and $T_{con}$ = 0.5 for filtering repetitive bounding box to generate the tracking object set $B_{track}$ and select 5 observations from the 20 most recent frames as tracklet input for DSAN. We set $T_{IoUa}$ = 0.7 for data association. For trajectory management, we set the threshold $T_{init}$ = 3 for trajectory initialization and $T_{term}$ = 10 for trajectory termination.

4.2 Performance on MOT Benchmark Datasets

In order to measure the accuracy of tracking results, we adopt multiple metrics used in the MOT benchmark [2] to evaluate the proposed tracking method, including Multiple Object Tracking Accuracy (MOTA), ID F1 score (IDF, the ratio of correct detections over the average number of ground-truth and computed detections), MT (the ratio of Mostly Tracked objects), Ml (the ratio of Mostly Lost objects), the number of False Negatives (FN), the number of False Positives (FP), the number of ID Switches (IDS), the number of fragments (Frag). Table 1 and Table 2 present the tracking performance on the MOT16 and MOT17 datasets, respectively.

Table 1. Tracking performance on MOT16 dataset. The arrow each metric indicates that the higher ($\uparrow $) or lower ($\downarrow $) value is better.

Full size table

Table 2. Tracking performance on MOT17 dataset.

Full size table

Quantitative results and comparison with the other tracking methods are shown in Table 1 and Table 2. As shown in Table 1, our tracking method achieves a comparable MT, ML, FP, Frag score and performs favourably against the state-of-the-art methods in terms of MOTA, IDF1, FN and IDs on the MOT16 dataset. Our tracker upgrades MOTA to 67.7, IDF1 to 66.4 and reduces FN to 42494, IDs to 334. Meanwhile, our tracker achieves the best performance in IDF1 and IDs among online and batch methods, demonstrating the merits of our tracker in object identity matching and the stability of multi-object tracking. MOTA and FN correspond to the object detection capability. Therefore, the improvement of MOTA and FN demonstrates the merits of our Soft-Pose-Nms detection strategy in object locating for MOT. Similarly, Table 2 shows that our tracker outperforms existing online trackers on half of the metrics and achieves the best performance in terms of IDF1, MT, IDs and Frag on the MOT17 dataset.

In addition, as shown in Table 1, our tracker has a high FP. According to this phenomenon, the detection strategy proposed in this paper is combining the object detection results and pose estimation results. This can alleviate unreliable detection and complement missing object. Second, we find that only the moving pedestrians are recorded as tracking object ground-truth in MOT16 and MOT17. Nevertheless the detection strategy proposed in this paper can detect and track these small-scale pedestrians, occluded pedestrians, stationary pedestrians and pedestrians who are not recorded as tracking objects. Therefore, our detection strategy will cause the phenomenon of high FP, and the similar situation exists in [4, 5] too. This phenomenon also reflects the effectiveness of the detection strategy proposed in this paper.

4.3 Ablation Studies

In order to verify the effectiveness of the proposed detection strategy and evaluate its contribution, we use different object detection results and conduct ablation experiments in the MOT16 dataset. We choose Mask R-CNN [11] and SDP [28] as bounding box-based object detection method and PifPaf [15] as pose estimation method. In addition, to exclude the disturbance of other factors, we use DeepSORT [25], the more common method in MOT, for tracking.

Table 3. Evaluation tracking results on MOT16 dataset with different detection method. Ours (M+P) indicates combining Mask R-CNN detection results and PifPaf pose estimation results. Ours (S+P) indicates combining SDP detection results and PifPaf pose estimation results.

Full size table

The experiment results are shown in Table 3. The comparison between our detection strategy and object detection methods and pose estimation method confirms that our detection strategy performs best. Our detection strategy improves 3.6 in MOTA, 3.5 in IDF1, 3.1% in MT with the second best detection method and effectively reduced FN demonstrating the merits of our detection strategy in locating the objects. By combining object detection results and pose estimation results, our detection strategy can reduce unreliable detections and alleviate missing detections, as shown in Fig. 4(a).

Table 4. Evaluation results on MOT16 with different feature representations.

Full size table

To demonstrate the contribution of the proposed DSAN network in our method, we compare representations learned by DSAN with PCB, DenseNet-121. Moreover, we use SDP [28] detection result, provide by MOT16 officially, for tracking. The experiment results are shown in Table 4. It can be seen that the IDF1, IDs and MOTA of DSAN are better than other methods. Our tracker upgrades MOTA to 65.7, IDF1 to 68.7 and reduces IDs to 455, which demonstrates the effectiveness of our feature extraction network.

Figure 4(b) shows the visualization results of the self-attention feature map from DSAN. In Fig. 4(b), each group consists of four images. The top row of each group shows an image pair from the same object, while the bottom row presents corresponding self-attention feature maps. It can be seen that our self-attention feature map focus more explicitly on object regions and suppress noise and occlusion, which enhances the power of extracting discriminative features.

5 Conclusions

This paper presents a detection strategy and a feature extraction network to improves two main components of most online trackers, detection and feature extraction. The tracker locates joint points of objects with pose estimation results. Then generating optimal object bounding boxes by proposed Soft-Pose-NMS method, which also helps alleviate typical difficulties in tracking such as occlusion handling and track drifting. In this paper, the tracker learns the discriminative self-attention maps from the MOT dataset with the Self-Attention mechanism to calculate more accurate similarity scores. The experimental results on MOT Challenge datasets demonstrated that the proposed tracking framework leads to competitive performance improvement through extensive experiments.

References

Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 941–951 (2019)
Google Scholar
Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. EURASIP J. Image Video Process. 2008, 1–10 (2008)
Article Google Scholar
Brasó, G., Leal-Taixé, L.: Learning a neural solver for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6247–6257 (2020)
Google Scholar
Chen, J., Sheng, H., Zhang, Y., Xiong, Z.: Enhancing detection model for multiple hypothesis tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 18–27 (2017)
Google Scholar
Chen, L., Ai, H., Zhuang, Z., Shang, C.: Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2018)
Google Scholar
Choi, W.: Near-online multi-target tracking with aggregated local flow descriptor. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3029–3037 (2015)
Google Scholar
Chu, Q., Ouyang, W., Li, H., Wang, X., Liu, B., Yu, N.: Online multi-object tracking using CNN-based single object tracker with spatial-temporal attention mechanism. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4836–4845 (2017)
Google Scholar
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: Rmpe: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343 (2017)
Google Scholar
Fang, K., Xiang, Y., Li, X., Savarese, S.: Recurrent autoregressive networks for online multi-object tracking. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 466–475. IEEE (2018)
Google Scholar
Fragkiadaki, K., Shi, J.: Detection free tracking: Exploiting motion and topology for segmenting and tracking under entanglement. In: CVPR 2011, pp. 2073–2080. IEEE (2011)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
Keuper, M., Tang, S., Andres, B., Brox, T., Schiele, B.: Motion segmentation & multiple object tracking by correlation co-clustering. IEEE Trans. Patt. Anal. Mach. Intell. 42(1), 140–153 (2018)
Article Google Scholar
Kim, C., Li, F., Ciptadi, A., Rehg, J.M.: Multiple hypothesis tracking revisited. In: IEEE International Conference on Computer Vision (2015)
Google Scholar
Kreiss, S., Bertoni, L., Alahi, A.: Pifpaf: Composite fields for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11977–11986 (2019)
Google Scholar
Luo, H., Gu, Y., Liao, X., Lai, S., Jiang, W.: Bag of tricks and a strong baseline for deep person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1487–1495 (2019)
Google Scholar
Mahmoudi, N., Ahadi, S.M., Rahmati, M.: Multi-target tracking using CNN-based features: Cnnmtt. Multimedia Tools Appl. 78(6), 7077–7096 (2019)
Article Google Scholar
Milan, A., Leal-Taixe, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv e-prints arXiv:1603.00831 (Mar 2016)
Pang, B., Li, Y., Zhang, Y., Li, M., Lu, C.: Tubetk: adopting tubes to track multi-object in a one-step training model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6308–6318 (2020)
Google Scholar
Sanchez-Matilla, R., Poiesi, F., Cavallaro, A.: Online multi-target tracking with strong and weak detections. In: European Conference on Computer Vision. pp. 84–99. Springer (2016)
Google Scholar
Son, J., Baek, M., Cho, M., Han, B.: Multi-object tracking with quadruplet convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5620–5629 (2017)
Google Scholar
Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S.: Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 480–496 (2018)
Google Scholar
Tang, S., Andriluka, M., Andres, B., Schiele, B.: Multiple people tracking by lifted multicut and person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3539–3548 (2017)
Google Scholar
Voigtlaender, P., et al.: Mots: multi-object tracking and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7942–7951 (2019)
Google Scholar
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649. IEEE (2017)
Google Scholar
Xu, Y., Osep, A., Ban, Y., Horaud, R., Leal-Taixé, L., Alameda-Pineda, X.: How to train your deep multi-object tracker. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6787–6796 (2020)
Google Scholar
Yan, C., et al.: Stat: spatial-temporal attention mechanism for video captioning. IEEE Trans. Multimedia 22(1), 229–241 (2019)
Article Google Scholar
Yang, F., Choi, W., Lin, Y.: Exploit all the layers: fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2129–2137 (2016)
Google Scholar
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: International Conference on Machine Learning, pp. 7354–7363. PMLR (2019)
Google Scholar
Zhou, X., Koltun, V., Krhenbühl, P.: Tracking objects as points. arXiv arXiv:2004.01177 (2020)
Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., Yang, M.H.: Online multi-object tracking with dual matching attention networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 366–382 (2018)
Google Scholar

Download references

Acknowledgment

This work was supported by National Key R&D Program of China (Grant No. 2018YFB2100603) and National Natural Science Foundation of China (Grant No. 61872024). The authors would like to thank the anonymous reviewers for their critical and constructive comments and suggestion.

Author information

Authors and Affiliations

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, People’s Republic of China
Xin Zhang, Shihao Wang, Yuanzhe Yang, Chengxiang Chu & Zhong Zhou

Authors

Xin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shihao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuanzhe Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chengxiang Chu
View author publications
You can also search for this author in PubMed Google Scholar
Zhong Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhong Zhou .

Editor information

Editors and Affiliations

MIMOS Berhad, Kuala Lumpur, Malaysia
Duc Nghia Pham
Sirindhorn International Institute of Science and Technology, Thammasat University, Mueang Pathum Thani, Thailand
Thanaruk Theeramunkong
Data61, CSIRO, Brisbane, QLD, Australia
Guido Governatori
Department of Philosophy, Tsinghua University, Beijing, China
Fenrong Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, X., Wang, S., Yang, Y., Chu, C., Zhou, Z. (2021). Online Multi-Object Tracking with Pose-Guided Object Location and Dual Self-Attention Network. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds) PRICAI 2021: Trends in Artificial Intelligence. PRICAI 2021. Lecture Notes in Computer Science(), vol 13033. Springer, Cham. https://doi.org/10.1007/978-3-030-89370-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-89370-5_17
Published: 01 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89369-9
Online ISBN: 978-3-030-89370-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Online Multi-Object Tracking with Pose-Guided Object Location and Dual Self-Attention Network

Abstract

Similar content being viewed by others

Online Multi-Object Tracking with Dual Matching Attention Networks

Adaptive sparse attention-based compact transformer for object tracking

Residual Attention SiameseRPN for Visual Tracking

Keywords

1 Introduction

2 Related Work

3 Proposed Method

3.1 Soft-Pose-NMS Object Detection Strategy

3.2 Feature Extraction with Dual Self-Attention Network

3.3 Data Association and Trajectory Management

4 Experiments

4.1 Implementation Details

4.2 Performance on MOT Benchmark Datasets

4.3 Ablation Studies

5 Conclusions

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Online Multi-Object Tracking with Pose-Guided Object Location and Dual Self-Attention Network

Abstract

Similar content being viewed by others

Online Multi-Object Tracking with Dual Matching Attention Networks

Adaptive sparse attention-based compact transformer for object tracking

Residual Attention SiameseRPN for Visual Tracking

Keywords

1 Introduction

2 Related Work

3 Proposed Method

3.1 Soft-Pose-NMS Object Detection Strategy

3.2 Feature Extraction with Dual Self-Attention Network

3.3 Data Association and Trajectory Management

4 Experiments

4.1 Implementation Details

4.2 Performance on MOT Benchmark Datasets

4.3 Ablation Studies

5 Conclusions

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation