Keywords

1 Instruction

MOT in ordinary videos has received widespread attention and research, and many excellent results have been applied to various scenarios. MOT in remote sensing videos is even more critical in some applications. Remote sensing satellites can easily observe large areas of target regions, track vehicle flow, and support smart city transportation. MOT in remote sensing videos has become an important research topic in remote sensing image processing. However, compared to ordinary videos, remote sensing videos face many new challenges: 1) Low discrimination between objects and the background. 2) High background noise. 3) Small object areas. 4) Lack of detailed features. 5) Cloud cover.

These challenges pose great difficulties to object tracking in remote sensing videos, so we require effective methods to overcome them. In this article, motivated by the TraDeS [1] for MOT in ordinary video, a new MOT framework is proposed, which can be applied to remote sensing videos for tracking tiny objects.

The main contributions of this paper are as follows:

  1. 1)

    We propose a multi-scale local cost volume mechanism that could accurately represent the motion offset of small objects than the baseline.

  2. 2)

    A head representing the direction of object motion is added to the network output head. The detection branch and tracking branch of output heads are connected to the not enhanced current feature and final enhanced feature, respectively.

Our proposed MOT method based on MLCVNet is designed to address some challenges in remote sensing videos. Compared to [1] in remote sensing videos, our MLCV model can more accurately match the same object between previous and current frames while effectively reducing computation. Without introducing extra network branches with excessive computational requirements as in DSFNet [2], the proposed method can meet the real-time tracking needs of multi-object in remote sensing videos while performing well. Detailed network architecture will discuss in Sect. 3.

2 Related Works

The concept of object tracking was proposed by Wax N in the 1960s [3] and was applied to pedestrian tracking. Since then, the field of object tracking has received much attention from researchers, with new theories and research results constantly emerging and being innovated. This article is about multi-object tracking, and the current research status will be briefly described below.

2.1 MOT in Ordinary Video

In the traditional object tracking framework, detection is done by establishing an appearance model to identify the object’s identity, including unique features that distinguish different objects and are used for subsequent association tracking. Many of the MOT frameworks that have emerged in recent years are based on deep learning. They are roughly divided into two types of tracking frameworks: tracking-by-detection (TBD) and joint detection and tracking (JDT) [4].

The appearance model of an object can be represented by different object attributes, including color, texture, gradient, motion, and optical flow, to identify the object uniquely. By extracting a class of features or joint features of the object, it is possible to distinguish the object from the background and thus distinguish different objects. Many traditional multi-object tracking frameworks fall into this category [5,6,7].

Compared to traditional methods, deep learning does not require manual feature extraction and can obtain richer feature representations, often achieving better results. DeepSort [8] improves the Sort [9] method that utilizes deep learning. In the object detection phase, a detection network is used to detect the object. Then the detected object is passed to a re-identification (ReID) appearance feature extraction network for feature extraction, followed by the tracking process. The MOTDT [10] framework fully utilizes the advantages of deep neural networks to address prominent issues in TBD, such as unreliable detection and intra-class occlusion. The detection part in D&T [11] is based on the R-FCN fully convolutional network, and the tracking part incorporates the tracking ideas based on correlation and regression from single-object tracking methods into the front-end detection framework, implementing this multi-object tracking method.

In recent years, more and more research has leaned towards one-stage methods, which only require one network to accomplish object detection and appearance feature extraction simultaneously. JDE [12] proposes a network model that can integrate object detection and ReID tasks into one by incorporating the appearance ReID model into the one-shot detector. FairMOT [13] points out multiple imbalances in general anchor-based methods and proposes an improvement. FairMOT is a tracking method based on the anchor-free feature extraction network DLA [14], which adds a ReID branch on top of the detection task. CenterTrack [15] is an improvement on the CenterNet [16], adding a branch to the detection output branch to reflect the position movement vector of the object between two frames, thus implementing multi-object tracking in one network. TraDeS [1] proposes a new online joint detection and tracking model with tracking features to assist with end-to-end detection tasks. It infers the object tracking offset based on cost volume and then uses it to propagate the object features from a previous moment to improve the current frame’s object detection and segmentation tasks.

2.2 MOT in Remote Sensing Video

Currently, some multi-object tracking frameworks based on remote sensing videos have been proposed. Du et al. [17] proposed a specific strategy for constructing a more robust tracker using a kernel correlation filtering (KCF) tracker and a three-frame differencing algorithm. Guo et al. [18] proposed a correlation filter Kalman filter (CFKF) tracker, which is a tracking algorithm based on a fast correlation filter (CF) for satellite video object tracking. Shao et al. [19] proposed a velocity correlation filter (VCF) algorithm to overcome the problem of insufficient brightness and color features of remote sensing video objects. Xuan et al. [20] proposed a new motion estimation (ME) algorithm based on the kernel correlation filtering (KCF) algorithm, which combines Kalman filtering and motion smoothing trajectory to reduce the boundary effects of the kernel correlation filtering algorithm.

He et al. [21] proposed a graph-based multi-task reasoning tracking framework, which models multi-object tracking as a graph feature information fusion process based on message inference. Xiao et al. [2] proposed a two-stream network that integrates object motion information and object appearance information, which the authors refer to as dynamic information and static information, respectively. It was originally used for object detection tasks in remote sensing videos, but its network can also be used for multi-object tracking.

3 Network Architecture

The overall architecture of MLCVNet consists of four main parts, as shown in Fig. 1. During the training process, there are three inputs, the current frame \({\mathbf{I}}^{t}\) at time t, the historical frame \({\mathbf{I}}^{t - \tau }\), and the heatmap \({\mathbf{P}}^{t - \tau }\) of the historical frame.

Fig. 1.
figure 1

The detailed network architecture includes a DLA-34 backbone, which extracts three scales of feature maps \({\mathbf{f}}_{s}^{f}\) and \({\mathbf{f}}_{s}^{t - \tau }\) from the input frames \({\mathbf{I}}^{t}\) and \({\mathbf{I}}^{t - \tau }\), respectively. These feature maps are then used in a correlation operation to produce local cost volume \({\mathbf{C}}_{s}\), which is further processed using a template operation to obtain the offset matrix \({\mathbf{O}}_{s}\). The FE module extracts motion transformation features at three scales. The resulting fusion feature is combined with the current feature to obtain an enhanced feature map connected to the output branches to produce the final outputs.

3.1 Multi-scale Local Cost Volume

Firstly, a DLA-34 network was used to extract multi-scale features from the image. The input image size of the backbone is \(3 \times H_{i} \times W_{i}\). After passing it, \({\mathbf{I}}^{t}\) and \({\mathbf{I}}^{t - \tau }\) get three scales of feature maps, which are \({\mathbf{f}}_{s}^{t} \in {\mathbb{R}}^{{C_{s}^{f} \times H_{s}^{f} \times W_{s}^{f} }}\) and \({\mathbf{f}}_{s}^{t - \tau }\), respectively. The down-sampling ratios of the feature maps are 2, 4, and 8, denoted as s, where \(H_{s}^{f} = \frac{{H_{i} }}{s},W_{s}^{f} = \frac{{W_{i} }}{s}\).

The second part is MLCV module. The feature maps \({\mathbf{f}}_{s}^{t}\) and \({\mathbf{f}}_{s}^{t - \tau }\) at corresponding scales were first handled by a correlation operation to obtain a local cost volume \({\mathbf{C}}_{s} \in {\mathbb{R}}^{{H_{s}^{d} \times W_{s}^{d} \times H_{s}^{f} \times W_{s}^{f} }}\) between the current frame and the previous frame. Specifically, a correlation operation is performed at each pixel position in the feature map \({\mathbf{f}}_{s}^{t}\), using a correlation kernel \({\mathbf{K}}_{x,y}^{t}\) of size \(k \times k\) centered at the position \(\left( {x_{f} ,y_{f} } \right)\). Then a search window of size \(H_{s}^{d} \times W_{s}^{d}\) centered at the corresponding position in feature map \({\mathbf{f}}_{s}^{t - \tau }\) is slid and correlated, resulting in a vector \({\mathbf{C}}_{{x_{f} ,y_{f} }} \in {\mathbb{R}}^{{H_{d} \times W_{d} }}\) of length \(H_{s}^{d} \times W_{s}^{d}\), which stores the correlation values of the kernel of the feature \({\mathbf{f}}_{s}^{t}\) and all kernels in the window of the feature \({\mathbf{f}}_{s}^{t - \tau }\). This vector reflects the matching degree between the object in the current frame and the possible positions of the object in the previous frame. The process is shown in Fig. 2.

Fig. 2.
figure 2

During the correlation operation, a kernel centered at position \((x,\,y)\) in feature map \({\mathbf{f}}^{t}\) is slid in a search window at the corresponding position in feature map \({\mathbf{f}}^{t - \tau }\). The kernel \({\mathbf{K}}_{x,y}^{t}\) correlates with every kernel \({\mathbf{K}}_{{x^{\prime } ,y^{\prime } }}^{t - \tau }\) in the search window, and each pair of kernels produces a value \({\mathbf{C}}_{{x,y,x^{\prime } ,y^{\prime } }}\).

If we ignore the difference in the scales s, the process of correlation operation is the same. The value \({\mathbf{C}}_{{x,y,x^{\prime } ,y^{\prime } }}\) in the vector is handled by doing an inner product of the two kernel vectors, which is obtained by the following equation:

$$ {\mathbf{C}}_{{x,y,x^{\prime } ,y^{\prime } }} = {\mathbf{K}}_{x,y}^{t} {\mathbf{K}}_{{x^{\prime } ,y^{\prime } }}^{{t - \tau { \top }}} $$
(1)

A maximum value in the vector indicates the highest matching degree between the two kernels because they represent objects’ partial features at the corresponding positions in the two frames. So, the highest matching value means these two objects are most likely to match. The complete correlation operation is performed for all positions in frame \({\mathbf{f}}_{s}^{t}\), resulting in a local cost volume \({\mathbf{C}}\). The operation is expressed as:

$$ {\mathbf{C}} = {\text{Corr}} \left( {{\mathbf{f}}^{t} ,{\mathbf{f}}^{t - \tau } ,d^{\prime } ,k} \right) $$
(2)

where \(d^{\prime }\) represents the displacement of the search window, \(d^{\prime } = \left\lfloor \frac{d}{2} \right\rfloor\), and \(d = d^{\prime } \times 2 + 1\). k is the size of the correlation kernel, and the default value is set to 3.

The local cost volume C obtained by the correlation operation is a four-dimensional matrix with dimensions \(\left[ {1,H_{d} \times W_{d} ,H_{f} ,W_{f} } \right]\). It needs to be reshaped into dimensions \(\left[ {1,H_{d} ,W_{d} ,H_{f} \times W_{f} } \right]\), and then the maximum value is taken in the second and third dimensions to obtain the maximum cost volume values \({\mathbf{C}}_{H} \in {\mathbb{R}}^{{H_{f} \times W_{f} \times H_{d} }}\) and \({\mathbf{C}}_{W} \in {\mathbb{R}}^{{H_{f} \times W_{f} \times W_{d} }}\) of each pixel between feature map \({\mathbf{f}}^{t}\) ant the corresponding search window of \({\mathbf{f}}^{t - \tau }\) in the height and width directions, respectively. Then, using the preset vertical and horizontal offset templates \({\mathbf{V}} \in {\mathbb{R}}^{{H_{f} \times W_{f} \times H_{d} }}\) and \({\mathbf{H}} \in {\mathbb{R}}^{{H_{f} \times W_{f} \times W_{d} }}\), the vectors of the two templates at position \((i,j)\) are denoted as \({\mathbf{V}}_{i,j} \in {\mathbb{R}}^{{H_{d} }}\) and \({\mathbf{H}}_{i,j} \in {\mathbb{R}}^{{W_{d} }}\), respectively. By multiplying the values from the softmaxed cost volume in the two directions, the position offset vector \({\mathbf{O}}_{i,j} = \left[ {{\mathbf{C}}_{i,j}^{H} {\mathbf{V}}_{i,j} ,{\mathbf{C}}_{i,j}^{W} {\mathbf{H}}_{i,j} } \right]^{{ \top }}\) with the maximum matching value from time t to \(t - \tau\) at position \((i,j)\) can be obtained, and the tracking offset matrix \({\mathbf{O}} \in {\mathbb{R}}^{{H_{f} \times W_{f} \times 2}}\) of all pixels in the feature map can be obtained.

The third part is the feature enhancement module, which is simplification of the MFW module [1], through which the enhanced feature \(\widetilde{{\mathbf{f}}}_{q}^{t}\) will be obtained. But our FE model works on multi-scale, and multi-scale features will be fused by the IDA model [14]. After that we get \(\widetilde{{\mathbf{f}}}_{s}^{t}\) at three scales, these features will be fused by IDA model at last. The final enhanced feature \(\widetilde{{\mathbf{f}}}^{t}\) is obtained in Eq. 3.

$$ \widetilde{{\mathbf{f}}}^{t} = IDA\left( {{\mathbf{f}}_{s}^{t} } \right),\quad s = 2,4,8 $$
(3)

3.2 Motion Direction Head

The output heads, which includes the detection and tracking branches. The detection branch is similar with [15], while the tracking branch contains a tracking offset head and a pos head. Each head of the detection branch is followed by a convolutional operation after the current time feature \({\mathbf{f}}^{t}\) to predict the corresponding information. The tracking offset head and pos head output by directly connecting a convolution to the final enhanced feature \(\widetilde{{\mathbf{f}}}^{t}\), which are used to predict the position offset of the object, and the motion direction of the object, respectively.

For the tracking branch, in order to learn the position offset of the object more accurately between t and \(t - \tau\), some improvements have been made. In the heads of [1], the enhanced feature \(\widetilde{{\mathbf{f}}}^{t}\) is connected to both the detection branch and the tracking branch, which is detrimental to the learning of the tracking offset head and pos head in the tracking branch. However, in our framework, the enhanced feature is only connected to the tracking branch to ensure that the MLCV and FE modules can better learn the enhanced features for object position offset.

For the pos head, which is used to predict the direction of object motion. Because for remote sensing, the video is captured from a top-down perspective, so the motion of the object in the video is equivalent to moving on a plane. The enhanced feature hides motion information of object within it. Therefore, it is considered to output the direction of object motion in the multi-frames as a header so that the network can learn the object’s motion information accurately. Specifically, the ground is roughly divided into eight directions: up, down, left, right, upper right, lower right, lower left, and upper left. That is, each pixel is represented by a vector of length 8. If an object exists in that pixel position, the direction index corresponding to the object’s motion direction in that vector is set to 1, and others are set to 0. As shown in Fig. 3, the object relative to the origin of the coordinate system in the right-side figure moves in the upper right direction compared to its position in the left figure.

Fig. 3.
figure 3

An example to illustrate the ground truth of pos head, the object relative to the origin of the coordinate system in the right figure moves in the upper right direction compared to its position in the left figure.

The size of the pos feature map is \(8 \times H_{o} \times W_{o}\), where \(H_{o} = H_{i} /2,W_{o} = W_{i} /2\). If a real object box \({\mathbf{b}}^{i} = \left( {x_{1}^{i} ,y_{1}^{i} ,x_{2}^{i} ,y_{2}^{i} } \right)\) is in the image, with center at the position \(\left( {c_{x}^{i} ,c_{y}^{i} } \right)\). The vector at the position \(\left( {\tilde{c}_{x}^{i} ,\tilde{c}_{y}^{i} } \right) = \left( {\left\lfloor {\frac{{c_{x}^{i} }}{2}} \right\rfloor ,\left\lfloor {\frac{{c_{y}^{i} }}{2}} \right\rfloor } \right)\) on the feature represents the motion direction of an object i. For example, in Fig. 4, the motion direction of the object is upper right, and the vector at the corresponding position in the true label of pos is \({\mathbf{p}}^{i} = [0,1,0,0,0,0,0,0]\). The loss function of the motion direction header uses the Mean Squared Error loss, and the loss function is shown in Eq. 4.

$$ L_{pos} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\left( {\widehat{{\mathbf{p}}}^{i} - {\mathbf{p}}^{i} } \right)^{2} } $$
(4)

where N is the number of objects in the video frames.

The overall loss of network consists of two head branches together, and the total loss function can be obtained by summing up the different branches by assigning certain weights to them.

$$ L_{total} = w_{1} L_{heat} + w_{2} L_{box} + w_{3} L_{tr} + w_{4} L_{pos} $$
(5)

where \(w_{1} ,w_{2} ,w_{3} ,\,{\text{and}}\,w_{4}\) represent weight values for the four losses. \(L_{heat}\) and \(L_{box}\) are the detection losses as in [15]. \(L_{tr}\) is the tracking offset loss as in [1].

4 Experiments

4.1 Datasets and Implementation Details

To validate the performance of the proposed multi-object tracking framework, it is tested on a remote sensing video dataset. The dataset used in this paper is provided by the DSFNet, which proposed a method for object detection in remote sensing images, but the dataset format also supports MOTChallenge multi-object tracking. The videos in the dataset were captured by the Jilin-1 video satellite, and the training set contains 72 videos, while the test set contains 7 videos. During the training process, some image data augmentation techniques are applied, including flipping and color space transformation. The experiment mainly focuses on vehicle-like objects in the video.

The specific experimental environments are shown in Table 1.

Table 1. Detail experimental environments.

To train MLCVNet, we utilized the Adam optimizer with a batch size of 8 and an initial learning rate of 1.25 × 10–4. The entire network was trained for 30 epochs before termination. In the MLCV module, we used a kernel size of \(3 \times 3\) when down-sampling by a factor of 2 and a search window size of \(7 \times 7\). For down-sampling by a factor of 4, the kernel size was set to \(3 \times 3\), and the search window size was set to \(5 \times 5\). When down-sampling by a factor of 8, we used a kernel size of \(1 \times 1\) and a search window size of \(3 \times 3\).

In order to verify the effectiveness of the method proposed in this thesis, some multi-object tracking frameworks with outstanding performance are selected for comparison, including CenterTrack [15], FairMOT [13], DSFNet [2], and TraDeS [1]. The experimental results are shown in Table 2.

Table 2. Experimental results of each tracking framework on the remote sensing video test set

Some frameworks that perform well in regular videos, such as CenterTrack, FairMOT, and TraDeS, rely mainly on object appearance features for learning. This leads to poor results in remote sensing videos because of some prominent characteristics of remote sensing videos. The DSFNet framework integrates object motion information and object appearance information, which the authors refer to as dynamic information and static information, respectively. It was originally used for object detection tasks in remote sensing videos, but its network can also be used for multi-object tracking. Therefore, it performs well in the test set, but due to the addition of a motion information branch and the output size of its network being consistent with the original image size, the computational cost of the entire network is very high, resulting in an FPS of only 2 during testing.

Several indicators of MLCVNet have reached the best level, with the MOTA indicator reaching 51.0%, consistent with the [2]. Because the output feature map of the MLCVNet network is down-sampled by 2 times, and the motion branch is not introduced, the computational cost of the overall network is much smaller than that of [2], and its inference speed reaches 14 frames per second. The results are shown in Fig. 4, with one complete result image selected for each framework, and the FPS indicator of each framework is marked in the upper-left corner.

Fig. 4.
figure 4

Illustration of the FPS results for each framework run on the test set. (b) FPS indicator for DSFNet is only 2. (c) MLCVNet runs at 14 frames per second.

4.2 Ablation Studies

In order to prove the effectiveness of the three improvement schemes proposed in this paper, the ablation experiments are now conducted for these different schemes of the original baseline TraDeS, TraDeS with improved head connection (TraDeS*), improved TraDeS with added MLCV module (TraDeS*+MLCV), improved TraDeS with added MLCV module and pos head (MLCVNet), respectively, and the results of the experiments are shown in Table 3.

Table 3. Experimental results of ablation studies

From Table 3, it can be seen that both adding the MLCV module, adding the MLCV module and pos head to the baseline have improved accuracy, which fully proves the effectiveness of the MLCV module and pos head.

When adding the pos head to MLCVNet, it is to learn the tracking information of the object more accurately, but it can also serve as an auxiliary branch to learn detection information, which can effectively help learn the heatmap head information and improve the confidence of the learned objects’ information. To verify this idea, the confidence information of all objects was extracted from the test set videos before and after adding the pos head to the framework. The confidence values were then divided into intervals of 0.1 within the range of [0, 1). The number of detected objects within each interval was counted and the results were plotted, as shown in Fig. 5.

Fig. 5.
figure 5

The figure shows the change in the distribution of confidence scores of all detection results obtained when the two tracking frameworks run on the test set before and after adding the pos head.

The above figure shows that (a) is the result of testing on DSFNet. After adding the pos branch, the number of low confidence objects in DSFNet has decreased significantly, and more detection objects with confidence scores above 0.5 are output by the framework. Figure 5 (b) is the result of testing on the framework proposed in this paper. By comparing the MLCVNet network with and without the pos branch, it can be seen that the number of low confidence objects has slightly decreased and the number of objects with scores above 0.3 has increased. Although the overall effect is not as obvious as in (a), the overall distribution of confidence scores is also moving towards the high score interval. From the analysis of the results in Fig. 5 (a) and (b), it can be concluded that the pos branch proposed in this paper is effective in helping the network learn detection information of the objects.

5 Conclusion

In this paper, we propose a novel multi-object tracking framework for remote sensing videos based on the TraDeS framework. We make three improvements to address the prominent issues in remote sensing videos and achieve good performance in tracking tiny objects in terms of accuracy and real-time processing. Firstly, we improve the head connection of the framework, enabling the network to learn better tracking and detection information separately. Secondly, the MLCV module utilizes the kernel and local search window mechanism to extract the motion information of small objects in remote sensing videos more accurately. Lastly, we add a pos head to the tracking branch of the network’s output head to represent the direction of object motion, which helps the network to learn more accurate object detection information. The results of comparative and ablation experiments show that the proposed method is effective and achieves excellent performance in multi-object tracking for remote sensing videos.