Keywords

1 Introduction

Multi-object tracking is an important research issue in computer vision. It has a wide range of applications, such as video surveillance, human-computer interaction, and driverless driving. The main task of multi-object tracking is to correlate the moving objects detected in the video sequence and plot the trajectory of each moving object, as shown in Fig. 1. These trajectories will contribute to abnormal event report, traffic control and other applications.

Fig. 1.
figure 1

Vehicle detection and tracking on a highway

In the practical application of high-speed scenes, most of the surveillance cameras are installed on the roadside, so in the process of video surveillance, vehicles often occlude each other. When the target vehicle reappears in the video after occultation, it is likely recognized as new, which results in the loss of the original target. A simplified and efficient correlation algorithm is proposed in this paper. After calculating the correlation degree of each object in different images by its position, motion and color information, we match and merge those objects with a high degree as an identified target.

The main contributions of this paper are listed below:

  • A vehicle location prediction model is proposed to accurately predict the location when occluded.

  • A data association method based on multi-feature fusion is proposed for data association for simplification and efficiency.

  • Tracking algorithm is improved on real data sets as experiments.

2 Related Work

From RCNN [1] series to the SSD [2] and YOLO [3] series, many achievements appear in object detection. As YOLOv3 is an end-to-end object detection method, it runs very fast [4, 5].

For video analysis, multi-object tracking has been a research focus which consists of Detection-Free Tracking (DFT) and Detection-Based Tracking (DBT) [6]. Because the vehicle in the video is tracked continuously, DBT is considered in this paper.

Based on DBT, many researchers propose different solutions [7,8,9]. However, these methods do not involve image features, the results are not acceptable for occlusion.

CNN is also introduced for multi-object tracking [10,11,12,13]. Based on SORT [8, 14, 15] uses CNN to extract the image features of the tracking target. Although CNN provides more precision, it shows powerlessness in real-time computation. Because the performances on single object tracking in complex scenarios have been improved [16,17,18], [19] transforms the multi-object tracking problem into multiple single object tracking problems, and also achieves good performance.

For location revision, Kalman filter [20] can filter the noise and interference and optimize the target state. However, Kalman filter is only applicable to Gauss function and deals with linear model.

Based on the above works, this paper proposes a method on multi-object tracking by correlating the color and the location features. It predicts the location when the target is lost in an image, and then applies the unscented Kalman Filter for calibration.

3 Data Association

Given an image sequence, we employ object detection module to get n detected objects in the t-th frame as \( S^{t} = \{ S_{1}^{t} ,S_{2}^{t} , \ldots ,S_{n}^{t} \} \). The set of all the detected objects from time \( t_{s} \) to \( t_{e} \) is \( S^{{t_{s} :t_{e} }} = \left\{ {S_{1}^{{t_{s} :t_{e} }} ,S_{2}^{{t_{s} :t_{e} }} , \ldots ,S_{n}^{{t_{s} :t_{e} }} } \right\} \), where \( S_{i}^{{t_{s} :t_{e} }} = \left\langle {S_{i}^{{t_{s} }} ,S_{i}^{{t_{s} + 1}} , \ldots ,S_{i}^{{t_{e} }} } \right\rangle \left( {i \in n} \right) \) is the trajectory of the object i-th from \( t_{s} \) to \( t_{e} \). The purpose of multi-object tracking is to find the best sequence of states of all objects, which can be modeled by using MAP (maximum a posteriori) according to the conditional distribution of the states of all observed sequences, defined as:

$$ \bar{S}^{{t_{s} :t_{e} }} = \mathop {\arg \hbox{max} }\limits_{{S^{{t_{s} :t_{e} }} }} \text{P}\left( {S^{{t_{s} :t_{e} }} |O^{{t_{s} :t_{e} }} } \right) $$
(1)

For more accuracy, prediction on the object position according to the historical trajectory is necessary. Equations (2) and (3) show the prediction method.

$$ S_{i}^{t} = \left\langle {Loc_{{S_{i}^{t} }} ,MotionInfo_{{S_{i}^{t} }} ,Features_{{S_{i}^{t} }} } \right\rangle $$
(2)
$$ \left\{ {\begin{array}{*{20}l} {Loc_{{S_{i}^{t} }} = \left[ {\left( {x_{{S_{i}^{t} }} ,y_{{S_{i}^{t} }} } \right),\left( {\left( {x + w} \right)_{{S_{i}^{t} }} ,\left( {y + h} \right)_{{S_{i}^{t} }} } \right)} \right]} \hfill \\ {MotionInfo_{{S_{i}^{t} }} = \left[ {\bar{v}_{{S_{i}^{t} }} ,\bar{a}_{{S_{i}^{t} }} ,k} \right]} \hfill \\ {Features_{{S_{i}^{t} }} = \left[ {color_{{S_{i}^{t} }} ,type_{{S_{i}^{t} }} } \right]} \hfill \\ \end{array} } \right. $$
(3)

\( Loc_{{S_{i}^{t} }} \) is the location of i-th object in t-th frame, including upper left coordinates \( \left( {x_{{S_{i}^{t} }} ,y_{{S_{i}^{t} }} } \right) \) and lower right coordinates \( \left( {\left( {x + w} \right)_{{S_{i}^{t} }} ,\left( {y + h} \right)_{{S_{i}^{t} }} } \right) \). \( MotionInfo_{{S_{i}^{t} }} \) is motion information of i-th object in t-th frame, including average velocity \( \bar{v}_{{S_{i}^{t} }} \), average acceleration \( \bar{a}_{{S_{i}^{t} }} \) and motion direction k; \( Features_{{S_{i}^{t} }} \) is characteristic information of i-th object in t-th frame, including color features \( color_{{S_{i}^{t} }} \) and object class \( type_{{S_{i}^{t} }} \).

Kalman filter is used to calibrate the position. However, Kalman filter algorithm is applicable to linear model and does not support multi-object tracking. Therefore, UKF is used for statistical linearization called nondestructive transformation. UKF first collects n points in the prior distribution, and uses linear regression for non-linear function of random variables for higher accuracy.

In this paper, three data are used for data matching – location information, IOU (Intersection over Union) and color feature. When calculating the position information of each \( S_{i}^{t - 1} \) and \( S_{i}^{t} \), we use the adjusted cosine similarity to calculate the position similarity of \( S_{i}^{t - 1} \) and \( S_{i}^{t} \). Formula for calculating position similarity Lconfidence between \( S_{i}^{t - 1} \) and \( S_{i}^{t} \) is defined as (4).

$$ L_{confidence} = \frac{{\sum\limits_{k = 1}^{n} {\left( {Loc_{{S_{i}^{t} }} - \overline{Loc}_{{S_{i}^{t} }} } \right)_{k} \left( {Loc_{{S_{i}^{t - 1} }} - \overline{Loc}_{{S_{i}^{t - 1} }} } \right)_{k} } }}{{\sqrt {\sum\limits_{k = 1}^{n} {\left( {Loc_{{S_{i}^{t} }} - \overline{Loc}_{{S_{i}^{t} }} } \right)_{k}^{2} } } \sqrt {\sum\limits_{k = 1}^{n} {\left( {Loc_{{S_{i}^{t - 1} }} - \overline{Loc}_{{S_{i}^{t - 1} }} } \right)_{k}^{2} } } }} $$
(4)

The object association confidence between \( S_{i}^{t - 1} \) and \( S_{i}^{t} \) is shown as:

$$ \text{Confidence} \left( {S_{i}^{t - 1} ,S^{t} } \right) = \left\{ {\text{c}_{1} \left( {S_{i}^{t - 1} ,S_{1}^{t} } \right),\text{c}_{2} \left( {S_{i}^{t - 1} ,S_{2}^{t} } \right), \ldots ,\text{c}_{n} \left( {S_{i}^{t - 1} ,S_{n}^{t} } \right)} \right\} $$
(5)

Thereby \( Loc_{{S_{i}^{t} }} \) is:

$$ Loc_{{S_{i}^{t} }} = Loc_{{S_{{\text{maxIndex} \left( {\text{Confidence} \left( {S_{i}^{t - 1} ,S^{t} } \right)} \right)}}^{t} }} $$
(6)

4 Experiments

The real high surveillance videos are used as the experimental data set, and three different scenes are intercepted: ordinary road section, frequent occlusion and congestion. Each video lasts about 2 min with a total of 11338 pictures annotated according to the MOT Challenge standard data set format.

The performance indicators of multi-object tracking show the accuracy in predicting the location and the consistency of the tracking algorithm in time. The evaluation indicators include: MOTA, combines false negatives, false positives and mismatch rate; MOTP, overlap between the estimated positions and the ground-truth averaged over the matches; MT, percentage of ground-truth trajectories which are covered by the tracker output for more than 80% of their length; ML, percentage of ground-truth trajectories which are covered by the tracker output for less than 20% of their length; IDS, times that a tracked trajectory changes its matched ground-truth identity [6].

Firstly, we implement the basic IOU algorithm to calculate the association of targets. The comparison is between IOU, SORT and IOU17. In the experiment, our method sets \( T_{minhits} \) (shortest life length of the generated object) as 8, \( T_{maxdp} \) as 30; the same to SORT. IOU17 sets \( T_{minhits} \) as 8, \( \sigma_{iou} \) as 0.5; the objective detecting accuracy is set as 0.5 and the results are shown in Tables 1, 2 and 3.

Table 1. Accuracy for ordinary section
Table 2. Accuracy for frequently occlusion section
Table 3. Accuracy for congestion sections

It is illustrated from Tables 1, 2 and 3 that the method with predicted position and color features we proposed in this paper outperforms other methods. On frequently occlusion scenario, predicted position method greatly improves the accuracy. In congestion sections, color features are more helpful in accuracy than location prediction. The method in this paper does not predict the position in the case of amble because of the large deviation. However, the color feature doesn’t change in amble. For amble, the color feature will make the accuracy significantly increase.

SORT [8] and IOU17 [9] only use the position information of the target to correlate the data over image feature and position prediction. In the case of frequent congestion and occlusion, the vehicle can’t be tracked effectively because of the decrease of the accuracy of the object detection and the occlusion of the tracking target.

5 Conclusions

Target loss happens when occlusion and other events occur in a video surveillance. In this paper, we use linear regression to analyze the vehicle’s historical trajectory, and then predict the position of the vehicle when it disappears in a video frame, and recognize the target when the vehicle appears again by the predicted position and color feature. Experiments show that the algorithm also shows it sufficiency for occlusion and congestion.

This method only uses color feature as extracted image feature. Although the computation is light, the deviation for color similarity of targets exits. For the future work, we will research shallow CNN to extract image features to enhance the performances in terms of efficiency and differentiation.