1 Introduction

Target tracking has become a popular research topic in the field of computer vision because of its wide application and great potential in areas such as intelligent surveillance, driverless driving, human–computer interaction, and intelligent transportation. Before 2010, target tracking was mostly done using classical algorithms such as particle filtering [10], Kalman filtering [32], mean drift [35], and the optical flow method. In 2010, Bolme et al. [3] applied the correlation filtering method to tracking; later KCF [16], BACF [19], SRDCF [7], DSST [8], CACF [24], and Siamese [2] methods were employed. In 2016, Bertinetto et al. [2] proposed a tracking method that combines a Siamese network in deep learning with correlation filtering and achieved great success. So far, most deep-learning-based target tracking methods [4, 5, 1720, 22, 29, 30, 36] have been based on this method. In 2018, Li et al. [21] proposed the SiamRPN method that combines SiameseFC with RPN, abandoning the traditional multiscale detection method. Wang et al. [29] proposed the SiamMask method that combines SiamRPN networks with sharp segmentation networks, ensuring that the SiameseFC and RPN networks can be used in the same way. Tracking accuracy is significantly improved by tracking targets based on segmentation results, while compromising on the accuracy of the SiamRPN method.

Although these methods have achieved excellent results, they focus more on the merits of the features of the target and neglect the construction of motion models for target tracking. Even though good features can easily improve the performance of a tracker, target tracking faces problems such as occlusion, lighting changes, scale changes, deformation, and motion blur. Moreover, there are no methods available for extracting good features in every scenario. In the face of heavy occlusion, it is difficult for detection or segmentation-based methods to extract sufficient features as the target does not appear in the field of view. In contrast, Kalman filtering methods can accurately predict the state of a target by learning from the state of the target in the past frame in the absence of sufficient target features.

The major contributions of this study are as follows. Firstly, we propose the use of Kalman filtering to build motion models in combination with the SiamMask method to address the problem of missing tracked targets in complex environments resulting from the lack of accurate information on segmented target objects. Secondly, an elliptical fitting strategy is used to evaluate the angle and size of the rotating bounding boxes, and an attention mechanism is used to focus the model more on the contribution of the target subject area and to reduce the influence of the background.

The strengths of the proposed system are as follows. Firstly, we refine the tracking results of the tracking model by combining the segmentation model. This makes this method can still maintain high accuracy and robustness in a variety of complex environments. Secondly, by combining Kalman filter method, we propose a spatiotemporal motion model which can effectively alleviate the negative impact of occlusion. Benefit from this, even if we can’t extract enough effective target appearance features, we can also track them in a short time. Thirdly, we use the ellipse fitting strategy to refine the final boundary box, which helps us greatly improve the accuracy of the algorithm on the premise of consuming minimal resources.

Section 1 focuses on a brief introduction to our approach. Section 2 introduces related work, and Section 3 describes our approach in detail, including the main structure and core modules of the algorithm. Section 4 compares our approach to other popular algorithms on two datasets, VOT2016 and VOT2018; the strengths and weaknesses of our method are analyzed, and we propose future directions to address the weaknesses. Finally, the full text is reviewed and a reasonable conclusion is provided in Section 5.

2 Related works

In this section, we briefly review the research progress on Siamese networks in the target tracking field in recent years. Bertinetto et al. [2] proposed the SiamFC method, combining the Siamese network with related filtering methods for the first time and successfully applying it to target tracking. However, the SiamFC method has weak adaptability to the environment, cannot be adapted to changes in scale, and its accuracy and precision cannot meet the complex circumstances of tracking requirements. SiamRPN was proposed by Li et al. [21]. This method focuses on the introduction of RPN networks. By pre-setting multiple anchors, the position and size of the target in the current frame are determined through pre-learned classification branches and position branches. Mask_RCNN [15] adds a branch on the basis of Faster RCNN [27] to segment the target instance while achieving target detection. SiamMask [29] refers to the Mask_RCNN method and adds segmentation branches on the basis of SiamRPN, maps the segmentation image back to the original image, and uses the segmentation object as the final tracking result to achieve real-time target tracking. The SiamMask method greatly improves tracking accuracy. However, because only the influence of positive samples on the tracking results is considered, SiamMask often incorrectly segments backgrounds with high similarity to the target into the target when intraclass interference and severe occlusion occur, resulting in inaccurate tracking results or even loss.

The above methods all focus on the complete network parameters of offline training, and there is almost no online learning strategy. However, the uncertainty of the tracked object and the complex, changeable tracking scene mean that it is difficult for the pre-trained network to fully represent changeable target tracking and the influence of the background on target tracking for each video image. Therefore, a reasonable online learning strategy is necessary.

The main emphasis of the current popular method is to track the network based on offline training. Zhang [36] proposed relying on temporal and spatial context information to model the temporal and spatial information of the tracking target through a Bayesian framework to obtain a correlation between the target and the surrounding features. The Kalman filter is based on the state transition equation and the observation state, and an optimal estimation is obtained by combining these two Gaussian distributions, which is used for linear filtering and prediction problems.

In this study, Kalman filtering is used to construct a motion model and to predict the state of the target when there are obvious deviations and errors in the tracking. It is experimentally demonstrated that this method is superior to the SiamMask algorithm when faced with occlusion problems and in-class interference problems.

3 Our method

In this section, we will describe our approach in detail. We divide the tracking system into the following modules: a prediction module, a segmentation module, and a correction module. The prediction module uses an efficient prediction of the state of the object in the video image that appears heavily occluded or subject to in-class interference. The segmentation module uses a Siamese network with a segment branch to efficiently segment the target object in each frame. The correction module uses an elliptic fitting strategy to correct the final bounding box for the segmentation results in the video image. The main structure of the algorithm is shown in Fig. 1.

Fig. 1
figure 1

Algorithm structure diagram

3.1 Prediction module

In target tracking, most of the existing methods only focus on how to extract quality features, while ignoring the continuity of target tracking in space–time. SiamMask [29] determines the final state of the target based on the segmentation results, but in several experiments we found that the segmentation branch can easily confuse similar objects in the background with the tracked target when they are disturbed by inner classes. Kalman filtering does not depend on the merits of the extracted features but rather relies on the movement trend of the target in the spatiotemporal sequence. Given a set of video observation sequences Yt, the observation state can be expressed linearly by the state variable Zt as

$$ {Y}_t={H}_t{Z}_t+{n}_t,\kern0.5em t=1,2,\dots, N, $$
(1)

where Ht is the observation matrix, nt is the observation noise, Zt represents the state of the target at time t. The transfer of the state of the target can be represented by the linear state transfer equation

$$ {Z}_t={\varPhi}_{t,t-1}{Z}_{t-1}+{w}_t,\kern0.5em t=1,2,\dots, N, $$
(2)

where Φt, t ‐ 1 is the state transfer matrix, wt is the error of the state model, and its covariance matrix is Qt for the error of the state transfer model.

The state update process is a two-step process: state prediction and error matrix prediction, Kalman gain calculation, status update, error matrix update, and status update. The process is as follows: The state prediction equation is

$$ {\overset{\wedge }{Z}}_t={\Phi}_{t,t-1}{\overset{\wedge }{Z}}_{t-1}. $$
(3)

The covariance prediction equation is

$$ {P_t}^{-}={\Phi}_{t,t-1}{P}_{t-1}^u{\Phi}_{t,t-1}^T+{Q}_t. $$
(4)

The Kalman gain equation is

$$ {K}_k={P}_t^{-}{H}_t^T{\left[{H}_t{P}_t^{-}{H}_t^T+{R}_t\right]}^{-1}. $$
(5)

The state update equation is

$$ {\overset{\wedge }{Z}}_t={{\overset{\wedge }{Z}}_t}^{-}+{K}_t{\left[{Y}_t-{H}_t{{\overset{\wedge }{Z}}_t}^{-}\right]}^{-1}. $$
(6)

The covariance update equation is

$$ {P}_t={P_t}^{-}-{K}_t{H}_t{P_t}^{-}. $$
(7)

In this study, the center point of the target is modeled as a characteristic point X = [x, y]T as a uniformly accelerated motion, and a quadratic polynomial motion model is obtained:

$$ {\displaystyle \begin{array}{c}{X}_t={X}_{t-1}+{v}_{t-1}\varDelta t+\frac{1}{2}{a}_{t-1}{\left(\varDelta t\right)}^2\\ {}{v}_t={v}_{t-1}+{a}_{t-1}\varDelta t\\ {}{a}_t={a}_{t-1}\end{array}}\Big\},{Z}_t=\left[\begin{array}{c}{X}_t\\ {}{v}_t\\ {}{a}_t\end{array}\right]. $$
(8)

The error of the state transition model, Φt, t − 1, and the observation matrix Ht are

$$ {\varPhi}_{t,t-1}=\left[\begin{array}{ccc}{I}_2& {I}_2\varDelta t& \frac{1}{2}{I}_2{\left(\varDelta t\right)}^2\\ {}{0}_2& {I}_2& {I}_2\varDelta t\\ {}{0}_2& {0}_2& {I}_2\end{array}\right],{H}_t=\left[{I}_2\kern0.5em {0}_2\kern0.5em {0}_2\right], $$
(9)

where I2represents a two-dimensional identity matrix and 02 represents a two-dimensional zero matrix. According to Eqs. (1) and (2), the Kalman filter can be used to accurately predict the state of the target. The final state of target St in this study is determined by

$$ {S}_t=\Big\{{\displaystyle \begin{array}{c}{M}_t, dist\le \sigma\ \mathrm{and}\kern1.00em score\ge \eta, \\ {}{Z}_t,\begin{array}{ccc} dist>\sigma & \mathrm{or}& score<\eta, \end{array}\end{array}} $$
(10)

where Mt is the target state obtained by splitting the branch, Zt is the target state predicted by Kalman filtering, dist is the Euclidean distance between the state of the target center in the previous frame St − 1 and Mt, and score is the score of the feature with dimensions 1 × 1 × 256, which represents the similarity between target and candidate samples. The more similar the candidate and the template are, the higher the score will be. We choose a more reasonable target state between Mt and Zt by using Eqs. (10). If the target is severely blocked or out of view, all score values are lower than η. However, if there are similar objects in the candidate area, the actual possible state of the target cannot be distinguished by the score value alone. Here we default to the case in which the target does not move in a large range between two frames. The rationale for this decision is that the target state Mtobtained by segmentation may have large deviations, and similar objects in the candidate area may be mistakenly identified as tracking targets. Through many experiments, we found that, when score < η, problems such as the target being occluded in a large area or the target leaving the field of view often appear in the video image. In such a case, the original tracker still considers that most of the target should be visible in the field of view. Therefore, we have to choose a background with a higher similarity to the template as the target to continue tracking. We have derived optimal values for the parameters in numerous experiments and have set σ = 100 and η = 0.9. At the same time, to ensure the stability of the tracker, we assume that the value of dist should be within a certain range and that the target’s trajectory will not exhibit large-scale fluctuations. This is because in the experiment we found that, except for images with fast-moving objects, the targets in the other images rarely move long distances. The long-distance movement of the tracking result is often caused by the loss of the tracking target. Therefore, to improve the robustness of the algorithm, we have used Eq. (10) as the selection criteria.

3.2 Segmentation module

We used SiamMask [29] as the segmentation module of this study. SiamMask uses RPN [21] to calculate simple classification scores and bounding boxes, so that the candidate window of a fully convolutional Siamese network encodes the necessary information to generate a pixel-level binary segmentation mask. Two inputs (a template and a search area) go through the same convolutional neural network fθ, and a deep cross-correlation of the two feature maps is performed to obtain

$$ {g}_{\theta}\left(T, SR\right)={f}_{\theta }(T)\ast {f}_{\theta }(SR). $$
(11)

SiamMask uses a simple two-layer neural network hϕ with a learned parameter ϕ to predict a binary mask of size w × h. The predicted mask of the nth candidate window \( {g}_{\theta}^n\left(T, SR\right) \) is

$$ {m}_n={h}_{\phi}\left({g}_{\theta}^n\left(T, SR\right)\right). $$
(12)

From Fig. 1, we can see that there are three branches paralleling the segmentation branch: classification branch, a regression branch, and a segmentation branch. The classification branch is used to distinguish the target from the background. It predicts each sample as a target and a background score, and its loss function is recorded as Lcls. The regression branch fine-tunes the candidate area to obtain the predicted position and bounding box size, and the loss function is recorded as Lreg. The segmentation branch extracts the feature with the highest score in the feature map and decodes it to generate a segmented binary mask; its loss function is denoted as Lmask. The total loss function of the SiamMask method is therefore

$$ {L}_{3B}={\lambda}_1{L}_{\boldsymbol{mask}}+{\lambda}_2{L}_{reg}+{\lambda}_3{L}_{cls}, $$
(13)

whereλ1, λ2,and λ3 are the parameters.

3.3 Correction module

After many experiments, we found that the segmentation results often do not perfectly strip the target from the background. The SiamMask method uses the smallest rectangular bounding box of the segmentation mask as the final result in the current frame. Even if the segmentation result contains a small part of the background, it will have a greater impact on the final bounding box. In this study, the ellipse-fitting strategy is used to finely select the rotating bounding box, so that the final result is more biased toward the torso of the target to reduce the accuracy drop caused by the inaccurate segmentation of a small part. An ellipse can be represented by a conical equation with the following constraints:

$$ {\displaystyle \begin{array}{c}F\left(i,j\right)=a{i}^2+ bij+c{j}^2+ di+ ej+f=0,\\ {}{b}^2-4 ac<0,\end{array}} $$
(14)

where a, b, c, d, e, and f are the coefficients of the ellipse and i, j is a point on the ellipse. Because the image needs to be rotated around the center of the ellipse, the following transfer matrix is used to calculate the coordinates of the transferred point in the original image:

$$ M=\left[\begin{array}{cccc}\cos \theta & \sin \theta & \left(1-\cos \theta \right){i}_{cen}& -\sin \theta {j}_{cen}\\ {}-\sin \theta & \cos \theta & \sin \theta {i}_{cen}& -\left(1-\cos \theta \right){j}_{cen}\end{array}\right], $$
(15)

where θ is the rotation angle and (icen, jcen) is the center point. If Maska is the set of all points in the segmentation mask, thenMaskb, the point set of the segmentation mask after the transfer, is given by

$$ Mas{k}_b=M\ast \left[\begin{array}{c}i\\ {}j\\ {}1\end{array}\right],\forall \left(i,j\right)\in Mas{k}_a. $$
(16)

reca is the smallest rectangular bounding box of the ellipse of the target mask after rotation. The smallest rectangle of the segmentation result is recmask. The intersection recl of reca and recmask is calculated as the optimized bounding rectangle. The segmented image recl is then rotated back to the original position according to the rotation angle θ, and the rotated reclθ is outputted as the final bounding box. Figure 2 shows the main flow of the calculations.

Fig. 2
figure 2

Ellipse fitting strategy

4 Experiment

In this section, we evaluate the improved methods we propose on the VOT2016 and VOT2018 datasets, and we compare them with a number of popular methods. The experimental results demonstrate that this proposed method has great accuracy and precision. To reflect the fairness of comparison, the SiamMask part employed here uses the same structure and parameters as in Wang et al. [29]. Our experimental setup made use of computer with a Ryzen7 4800 h CPU, a GeForce GTX 1650Ti GPU, and 16 GB of memory, running on a Windows 10 operating system under a Python program.

4.1 Evaluation criteria

The evaluation indicators used in this study were the average overlap ratio, tracking length, failure rate, and robustness. The average overlap ratio is the intersection ratio between the area of the predicted target and the real area. The larger the value of this ration, the greater is the error. The tracking length is the number of frames in which the error from the start of tracking to the center point is lower than the acceptable range of the threshold. The failure rate is specified as follows: When the overlap rate is lower than the threshold, the tracking has failed, and the bounding box is reinitialized. The shorter the track length of each segment, the greater is the failure rate. During the kth algorithm repeated measurement process, the video robustness is calculated using

$$ Rs={e}^{- aM},M=\frac{F1}{N}, $$
(17)

where M is the average time of failures, F1 is the total time of failures, N is the length of the video sequence, and a is a parameter. F(i, k) represents the number of times that the recording algorithm fails to track in the video image and reinitialize after five frames.

4.2 Experimental results

The proposed algorithm was tested and evaluated on the VOT2016 and VOT2018 datasets, and we compared the results with those from ECO [9], ECO_HC and VITAL [28], SiamMask, SiamRPN, and TADT [11], and SiamCAR [13]. The videos were divided into nine categories, and the results closely reflect the performance capabilities of each algorithm in different scenarios. The experimental results demonstrate that the algorithm has good performance when facing various challenges and that its stability is obviously stronger than that of the compared algorithms.

4.3 Analysis of experimental results

From Table 1, we can see that our method has achieved good results in motion variation, camera motion, and scale variation and is first in average video accuracy, which demonstrates that the algorithm is robust. The segmentation results can accurately segment the target in the face of motion state changes, the Kalman filter can more accurately predict the target location and fine-tune the target state results obtained from segmentation, and the scale variation response achieves excellent results. The elliptic-fitting strategy can fine-tune the segmentation results to achieve good accuracy. The method performs well in the face of the blocking problem on the VOT2018 dataset listed in Table 2, being superior to the other comparison algorithms, and performs worse than ECO, TADT, and VITAL on the VOT2016 dataset. These performance differences can be explained as follows: ECO uses more comprehensive features (CNN + HOG+CN) to cope with the blocking problem of the single feature target tracking algorithm. VITAL uses the generated confrontation network that randomly generates numerous membranes and retains the most robust membranes among the target features to increase the positive sample data. TADT uses pixel-level losses to guide channel selection, and the VITAL and the accuracy results of TADT are significantly higher than our algorithm; however, Compared to SiamMask our algorithm still achieves an improvement of 0.05 accuracy. From Tables 3 and 4, we can see that the strategy proposed here has achieved first place in terms of accepted average overlap (EAO), overlap, and failure metrics, with a strong overall performance, outperforming SiamMask by almost 0.04, which indicates that the improvement of the algorithm is effective. Figures 3 and 6 list the A-R ranks of EAO metrics of various algorithms for nine types of videos in VOT2016 and VOT2018. It can be seen that the robustness and accuracy of the algorithm are higher than those other algorithms in most of the challenges.

Table 1 VOT2016 accuracy comparison
Table 2 VOT2018 accuracy comparison
Table 3 Comparison of the algorithm in VOT2016 coverage, failure rate, and expected average coverage
Table 4 Comparison of the algorithm in VOT2018 coverage, area under the curve, and average coverage
Fig. 3
figure 3

EAO comparison results (VOT2016)

From the expected overlap curves in Figs. 4 and 6, it can be seen that the algorithm will not be like ECO and other methods for which the coverage decreases significantly as the number of video frames increases, because the method adopts deep learning to extract features, and the depth features are more robust. The segmentation result is not easily affected by the previous target motion state, and the selection strategy adopted in Eq. (10) is also reasonable, even in the face of long video frames. For the video challenges, robustness is still guaranteed. From the expected overlap scores in Figs. 5 and 8, we see that the algorithm is much stronger than other algorithms in terms of the average expected overlap scores, indicating that our algorithm has high accuracy compared to other comparison algorithms. This strength also can be attributed to the segmentation branch we used and the elliptic-fitting strategy used to optimize the bounding box of the segmentation results.

Fig. 4
figure 4

Expected overlap curve (VOT2016)

Fig. 5
figure 5

Expected overlap score (VOT2016)

Figure 9 shows the real-time performance of this algorithm and other algorithms in multiple video frames, in which the red rotating rectangle shows the performance of this algorithm. It can be seen that the algorithm performs well in multiple video frames with large time intervals, which demonstrates the excellent stability of the algorithm Figs. 6, 7, 8 and 9.

Fig. 6
figure 6

EAO comparison results (VOT2018)

Fig. 7
figure 7

Expected overlap curve (VOT2018)

Fig. 8
figure 8

Expected overlap score (VOT2018)

Fig. 9
figure 9figure 9

Effect display diagram

4.4 Future outlook

Although the experimental part of the algorithm has exhibited powerful advantages, remaining robust in the face of various challenges, the performance of the algorithm is still unsatisfactory under changing illumination in the video, as indicated in Table 1. The Kalman filter is not accurate enough in predicting the target state, which is why it does not work well in the analysis of the results. The performance contribution of the tracker is not high enough. In the future, we can consider making full use of the time–space context information to predict the current state of the target more accurately by comparing the information between successive frames and construct a better tracker by combining detection or segmentation methods.

5 Conclusion

The performance of a tracker is commonly degraded when it is faced with a heavily occluded target because effective target features cannot be extracted. In view of this, a spatiotemporal fusion approach to motion target tracking and segmentation is proposed in this study. Based on Siamese networks and segmentation structures, the method utilizes a spatiotemporal motion target tracking model combined with Kalman filtering to mitigate the occlusion problem during tracking by extracting motion features of the tracked target on the time axis and building a motion model of the motion target on the time series. Because current target tracking methods neglect the importance of employing an online strategy, we propose to use Kalman filtering to construct a motion model of the target and to reasonably predict the motion of the target when the target is missing or heavily occluded in a short period of time. We use a segmentation network to segment the target from the background to achieve accurate tracking and elliptic-fitting strategy to correct the error caused by imprecise segmentation results and to improve the tracker’s accuracy. The experiments demonstrate that this method is feasible and achieves excellent results when compared with other algorithms. However, there remain problems of insufficient segmentation accuracy and insufficient prediction accuracy. From [1, 6, 13, 23, 25, 31, 34], we can foresee that the algorithm combining segmentation and tracking will become more pervasively used in the future and that the target tracking method combining deep learning and traditional methods [12, 14, 18, 26, 33] has a bright future.