1 Introduction

Object tracking is an essential task in computer vision [1,2,3]. In the past decades, deep convolutional networks have been successfully applied in different fields, especially in object tracking [4,5,6,7]. Visible light images have rich texture information and high contrast, which is useful for object tracking. However, in weak light conditions such as cloudy nights or low visibility conditions such as aerosols, visible-light-images-based object tracking may be difficult to function. Unlike visual light images, thermal infrared images, which mainly record the thermal radiation of objects, are stable under drastic changes in weak light or low visibility conditions [8, 9]. Infrared radiation can penetrate rain, fog, and snow. Nevertheless, infrared images lack texture information rich in visible light images and have low contrast. Due to the complementarity between infrared and visible light images, object tracking based on the fusion of infrared and visible light images has attracted more and more attention [10,11,12,13].

The existing fusion tracking based on infrared and visible light images (or so-called RGB-T fusion tracking) methods focuses on supplementing thermal information to assist visible-light-image-based tracking [14,15,16]. They aim to compensate for the visible light image in deteriorated light conditions. RGB-T fusion can be divided into pixel-level fusion, feature-level fusion, and decision-level fusion [17].

Pixel-level fusion fuses the rigorous registered image pairs pixel by pixel and then performs object tracking based on the merged images [17]. Pixel-level fusion is sensitive to noise and has a high demand for image registration [18]. Decision-level fusion performs tracking tasks separately on RGB and thermal images and then aggregates two different tracking results (such as the position and the size of the tracking object) to obtain the final tracking result [19]. That is, separate images are processed individually and fed into the fusion algorithm. Unlike pixel-level fusion, decision-level fusion does not require obtaining individual pixel values at the same locations by interpolation. However, decision-level fusion pays little attention to the feature complementarity between infrared and visible light images, leading to unreliable tracking effects that rely on a single pattern [18].

Feature-level fusion is to extract and fuse features of RGB-T image pairs and then utilize the fused features for tracking [20,21,22]. In this way, the tracking result may not rely too much on a single pattern tracker or original strict registered image pairs [17, 18]. Although a spatial registration step is still necessary for feature extraction and fusion, feature-level fusion allows for explicitly handling localization uncertainty, for instance, due to the misalignment of the image pairs.

Fig. 1
figure 1

Superimposed image of infrared and visible light image pair. The weight of the visible light image is 0.6, and the weight of the infrared image is 0.4. The object in the yellow box with rich texture, details, and color is from the visible light image. The object in the red box, which the silhouette can roughly distinguish, is from the infrared image

To visually demonstrate the characteristics of the RGB-T pairs, we linearly superimpose the infrared and visible light images, as shown in Fig. 1. On the one hand, the image pairs contain common features because two images are shot simultaneously at the same place. On the other hand, since cameras capture two images with different sensor types, the infrared image and the visible light image have individual features: the visible light image is a high-resolution color image with rich textures and details, while the infrared image is monochrome, low contrast, and lack textures. Besides, because of the different clock frequencies of the different photosensitive chips, even if the two cameras are registered in advance, the position of the same object in the two images is not necessarily the same. Unregistered cases require that the fusion tracker have a certain ability of misalignment prevention during the fusion process.

Based on the above analysis, a feature-level fusion network, Siamese Infrared and Visible Light Fusion Network (SiamIVFN) is proposed to track an object in RGB-T image pairs. The feature fusion part of SiamIVFN is composed of two subnetworks: a complementary-feature-fusion network (CFFN) and a contribution-aggregation network (CAN). CFFN uses a two-stream convolution structure to extract and fuse the features of infrared and visible light images. In each layer of the two-stream convolution, a coupled filter is designed to extract the common features from the image pairs. Considering that the similarity of the features extracted from the shallow layers in the RGB-T image pairs is different from the features extracted from the deep layers, we gradually increase the coupling rates. The effectiveness of the coupling rate setting is demonstrated in the experimental part.

Besides, because infrared and visible light images have different contributions to object tracking, further processing of the feature fusion should be considered. CAN uses the self-attention method to adaptively calculate the contribution of infrared and visible light images to different visual conditions. Experiments show that SiamIVFN achieves the best effects of infrared and visible light fusion tracking.

In summary, our contributions are summarized below:

  1. 1.

    CFFN utilizes a two-stream convolutional network with increasing coupling filters to extract the common and individual features. In addition, CFFN has the ability to misalignment prevention.

  2. 2.

    CAN adaptively calculates the contributions of the infrared and visible light features, which can make SiamIVFN robust to various lighting conditions

  3. 3.

    SiamIVFN adopts a Siamese-framework-based fusion tracker for RGB-T fusion tracking, whose structure is straightforward. Therefore SiamIVFN can achieve real-time tracking (tracking speed is 147.6 FPS).

2 Related work

2.1 Visual object tracking

In object tracking, deep learning-based trackers have achieved state-of-the-art performance on multiple public datasets with their powerful representation capabilities. At present, most object trackers based on deep learning have adopted the structure of the Siamese network. SiamFC [4] used the similarity learning method to treat the object tracking problem as a template matching problem. SiamFC is simple and fast. However, since SiamFC uses a multi-scale prediction method, it cannot handle the situation when the size of the object changes drastically. To solve this problem, SiamRPN [5] introduced the region proposal network (RPN). Researchers have improved the Siamese-network-based methods in data preprocessing [23], network structure [24], and multilayer feature fusion [6]. SiamFC++ [7] introduced the concept of anchor-free in the Siamese network, thus improving the speed and accuracy of the Siamese-framework-based tracker. Although visible light object tracking can achieve good results, it still cannot handle smoke, night, and other bad visual conditions due to the characteristic of RGB sensors.

2.2 RGB-T object tracking

Owing to the penetration ability of infrared sensors, they are often adopted to work with RGB sensors. Therefore, RGB-T fusion tracking has recently attracted more and more attention. SiamFT [21] and DsiamFT [22] used Siamese networks to solve the RGB-T fusion tracking problem. They used different backbones to extract features from infrared and visible light image pairs, then merged them and fed them into the tracking head. Due to the simple structure of the siamese-based methods, the tracking speed of these methods is fast. However, these fusion methods processed the image pairs separately. They did not fully consider the common features of infrared and visible light images, resulting in a lot of feature redundancy and computational burden. Unlike extracting features using different backbones, MANet [20] and CANet [25] shared a part of the same convolution kernels (or so-called coupling filters) to extract common features of infrared and visible light images. However, in designing convolution kernels of different depths, they did not fully consider the features extracted by different depth layers. In this paper, we use different coupling filters in different layers. Besides the consideration of the common feature extraction, the attention mechanism is utilized to extract individual features that reflect the characteristics of the two different sensors. The experimental results of LTDA [26] and CMPP [27] showed that the attention mechanism could largely improve tracking performances.

3 Our method

This section will introduce the proposed SiamIVFN. First, we summarize the overall structure of SiamIVFN and then introduce the structures of CFFN and CAN.

3.1 The architecture

Fig. 2
figure 2

Illustration of the proposed SiamIVFN framework. The complementary-feature-fusion network (CFFN) is used to extract and fuse the features of RGB-T image pairs. The contribution-aggregation network (CAN) is utilized to calculate the contribution of different features for tracking tasks adaptively. The tracking head is divided into two branches: classification and regression. Please refer to Sect. 4 for more details

The SiamIVFN network consists of three parts: CFFN, CAN, and tracking head. The structure of SiamIVFN is illustrated in Fig. 2. In the online tracking process, given the infrared and visible light video sequences, the tracker will track the position of the object in each frame. Unlike visible light images, infrared images lack detailed information, such as color and texture. It is necessary to use uncoupled filters to extract individual features from infrared and visible light images separately. Since each infrared and visible image pair simultaneously capture the same scene, they contain common features, such as semantics and contours. Wang [28] argued that coupled filters could extract common features. Li [29] adopted the coupled filters for depth estimation and showed their effectiveness. Inspired by this research work, this paper proposes a complementary-feature-fusion network (CFFN) to extract and fuse the features of infrared and visible light images.

Besides the common features, infrared and visible light images contain individual features for object tracking and may have different contributions to tracking tasks under different light conditions. In degraded light conditions such as fog and night, infrared images contribute more than visible light images for object tracking. While under normal lighting conditions, visible light images are more suitable than infrared images for detecting and tracking an object. Most existing fusion methods regard the contribution of infrared and visible images as the same and often directly concatenate the features extracted from infrared and visible light images. This paper proposes a contribution-aggregation network (CAN), which adaptively calculates the contribution of different features. CAN utilizes the self-attention module [30] to adaptively calculate the contributions of infrared and visible light images according to different light conditions.

3.2 Complementary-feature-fusion network

Fig. 3
figure 3

Illustration of CFFN. CFFN is a two-stream convolutional network, which extracts the features of infrared and visible light images, respectively. The two-stream convolutional network is equipped with coupling filters with different coupling rates

The details of the CFFN are depicted in Fig. 3. CFFN adopts a two-stream convolution structure. The lower branch represents the convolutional flow of infrared images. The upper branch represents the convolutional flow of visible light images. Unlike other two-stream networks, CFFN sets up filters with different coupling rates in each convolutional layer to learn the common features between infrared and visible images. The overlapping yellow part between the two indicates the coupling part of the two image filters. In this way, infrared and visible light images are mutually auxiliary. The features extracted from the infrared image are supplementary to the stream network designed for the visible light image. In the other stream network for infrared images, features extracted from visible light images are supplementary through partially coupled filters. The uncoupled filters are designed to learn the individual features. The ratio of the number of coupling filters to the number of all filters is called the coupling ratio:

$$\begin{aligned} R_{i}=\frac{k_{i}}{n_{i}}(i=1,2,3,4), \end{aligned}$$
(1)

where \(R_{i}\) is the coupling rate of the ith layer, \(k_{i}\) is the number of coupled filters in the ith convolutional layer, and \(n_{i}\) is the number of all filters in the ith convolutional layer. We set the coupling ratio of each convolutional layer: 0.25, 0.5, and 0.75. In Sect. 5, we use the grid search method to demonstrate the effectiveness of the coupling rate. The coupling rate increases as the convolutional layers go deeper. In CFFN, the parameters of the filter are updated through the backpropagation algorithm. In each iteration, the non-coupling filter for infrared and visible light images is updated once, and the coupling filter for infrared and visible light images is updated twice. Therefore, if suppose that we update weights in the infrared stream first and then update weights in the visible light stream, the filter weights are updated as follows:

$$\begin{aligned} w_{\text {RGB}}^{(i)}= & {} \left\{ \begin{array}{c} w_{\text {RGB}_{\text {ucoupled}}}^{(i-1)}+l\frac{\partial L}{\partial w_{\text {RGB}_{\text {ucoupled}}}^{(i-1)}}\\ w_{\text {coupled }}^{(2i-1)}+l\frac{\partial L}{\partial w_{\text {coupled }}^{(2i-1)}} \end{array}\right. , \end{aligned}$$
(2)
$$\begin{aligned} w_{\text {T}}^{(i)}= & {} \left\{ \begin{array}{c} w_{\text {T}_{\text {ucoupled}}}^{(i-1)}+l\frac{\partial L}{\partial w_{\text {T}_{\text {ucoupled}}}^{(i-1)}}\\ w_{\text {coupled }}^{(2i-2)}+l\frac{\partial L}{\partial w_{\text {coupled }}^{(2i-2)}} \end{array}\right. , \end{aligned}$$
(3)

where w is the parameter that needs to be updated, (i) is the number of iterations, l is the learning rate, and L is the loss function. The weights of the coupled filters updated in visible-stream as follows:

$$\begin{aligned} w_{\text {coupled }}^{(2i-1)}=w_{\text {coupled }}^{(2i-2)}+l\frac{\partial L}{\partial w_{\text {coupled }}^{(2i-2)}}. \end{aligned}$$
(4)

The weights of the coupled filters updated in infrared-stream as follows:

$$\begin{aligned} w_{\text {coupled }}^{(2i-2)}=w_{\text {coupled }}^{(2i-3)}+l\frac{\partial L}{\partial w_{\text {coupled }}^{(2i-3)}}. \end{aligned}$$
(5)

In summary, a two-stream convolutional structure is designed in CFFN. Besides individual features, the two-stream convolutional blocks are able to extract common features by the coupling filters.

3.3 Contribution-aggregation network

Fig. 4
figure 4

Illustration of CAN.CAN first concatenates features from RGB and infrared streams. Then, the extracted features are compressed into channel-wise vectors and fed to two fully connected layers. The outputs are multiplied by the original features, finally, added to the original features

After extracting features from infrared and visible light images (using certainly separated backbones), most existing fusion trackers directly concatenate the features and then send them to the tracking head for tracking. However, the different features contribute to object tracking differently, especially under various light conditions. Inspired by SENet [31], this paper proposes CAN to adaptively calculate the contribution of the features, which can be shown in Fig. 4. The difference between CAN and SENet is that CAN adds a step of concatenation. CAN first concatenates features from RGB and infrared streams. Then, CAN utilizes global average pooling to each channel to obtain a global feature \(g_{c}\) as:

$$\begin{aligned} g_{c}=\frac{1}{H\times W}\sum _{i=1}^{H}\sum _{j=1}^{W}x_{c}(i,j), \end{aligned}$$
(6)

where H and W are the height and width of the original feature \(x_{c}\), respectively. The global feature then passes through two fully connected layers to improve the generalization ability of CAN:

$$\begin{aligned} h_{c}=\beta \left( \alpha \left( g_{c}\right) \right) , \end{aligned}$$
(7)

where \(\alpha (\cdot )\) and \(\beta (\cdot )\) are two different fully connected layers. The learned feature vector \(h_{c}\) is multiplied by \(x_{c}\):

$$\begin{aligned} y_{c}=h_{c}\cdot x_{c}. \end{aligned}$$
(8)

Finally, the obtained feature \(y_{c}\) is added to the original feature to calculate the output \(z_{c}\) of CAN:

$$\begin{aligned} z_{c}=y_{c}+x_{c}. \end{aligned}$$
(9)

The whole procedure of CAN can be viewed as learning the weight coefficient of each channel through self-attention, which pays more attention to the channels critical for object tracking through end-to-end learning.

4 Implementation details

This section will introduce the training and online tracking process. The tracking head is built based on SiamFC++ [7]. We train and test SiamIVFN on the PyTorch platform with I7-10700K CPU and TITAN RTX GPU.

4.1 Training procedure

4.1.1 Pre-training

We use the GOT10K [32] and LASOT [33] datasets to train our network end-to-end. Since GOT10K and LASOT are both RGB datasets, they do not have infrared images. We use visible light images to generate grayscale images to train coupling filters and non-coupling filters. The optimization algorithm is the stochastic gradient descent method with momentum. The momentum is set to 0.9, and the weight attenuation is set to 0.0001. The learning rate adopts the cosine decay strategy, the initial learning rate is set to 0.08, and the final learning rate is set to 1e–6.

4.1.2 Training

Based on the pre-trained network, we train the entire network using the RGB-T dataset. In the first ten epochs, CFFN is fixed to train CAN and the tracking head. In the second ten epochs, we unfreeze the non-coupling filters for infrared images in CFFN. In the third ten epochs, we unfreeze the coupling filter in CFFN. After the 40th period, we unfreeze the whole CFFN for training. Such gradual training can accelerate the convergence of the network. To improve the discriminative ability of the network, we set the maximum index of a pair of sample frames to 1000 and the ratio of the number of positive sample pairs to the number of negative sample pairs to 0.5. In terms of optimization algorithms, we use Adam to optimize the loss function. The learning rate also uses cosine decay. The initial learning rate is set to 8e-5, and the end learning rate is set to 1e-6.

4.2 Online tracking

In the online tracking process, the template RGB-T image pair and the RGB-T image pair to be searched are fed to CFFN. Then CAN obtain the features of the template and the search area. After the two features are cross-correlated, a score map is computed for classification. According to Xu [7], the direct utilization of the score map for boundary selection might cause performance degradation. In this paper, we adopt a quality estimation branch in addition to the classification branch. The classification branch uses focal loss [34]. The quality estimation branch uses center loss [35]. The \(1\times 1\) convolution is used to weight the classification score and quality estimation score to get the overall classification score. In the regression branch, to avoid artificially setting anchor points and thresholds and other manual interventions, we adopt the idea of anchorless to directly predict the four sides from the corresponding position \((x,\ y)\) in the bounding box. The regression branch uses IOU loss [7].

5 Experiment

5.1 Evaluation dataset and evaluation metrics

We compare SiamIVFN and other tracking methods on two RGB-T tracking benchmark datasets to demonstrate the performances. The GTOT [36] dataset has 15.8K frames, containing 50 RGB-T videos aligned in space and time and seven annotated attributes. The RGBT234 [37] dataset has 234K frames, 234 aligned RGB-T videos, and twelve annotated attributes. Due to the significant differences in the number, quality, and data distribution of GTOT and RGBT234, we divided GTOT and RGBT234 into different training and test sets, respectively. We divide GTOT into five parts, each containing ten videos. When performing experiments on GTOT, we use four parts for training and one for testing. We conduct five separate experiments to ensure that all GTOT datasets are tested. When performing experiments on RGBT234, we divide the dataset into nine parts, each containing 26 videos. Eight parts are utilized for training; one left part is for testing. Nine experiments were performed separately.

The precision rate (PR) and the success rate (SR) in one-pass evaluation (OPE) are used as evaluation indicators. PR refers to the percentage of frames whose distance between the output and ground truth positions is within a threshold. We set the thresholds of GTOT and RGBT234 datasets to 5 and 20, respectively. SR is the proportion of frames whose overlap ratio between the output bounding box and the ground truth bounding box is larger than a threshold. We use the area under the curves (AUC) to calculate the SR score.

5.2 Comparison with state-of-the-art trackers

Fig. 5
figure 5

Overall performance compared with state-of-the-art trackers on RGBT234 (a) and GTOT (b)

We implemented SiamIVFN on the GTOT and RGBT234 benchmarks and compared them with other state-of-the-art RGB trackers (KCF [38], ECO [39], C-COT [40], MDNet [41], and SiameseFC [4]) and state-of-the-art fusion trackers (SGT [42], MANet [20], MACNet [43], DAPNet [44], and DAFNet [45]). The overall tracking performances are shown in Fig. 5. It can be found that SiamIVFN outperforms other trackers. Specifically, on the RGBT234 dataset, the PR/SR score of SiamIVFN reached 81.1\(\%\)/63.2\(\%\), 1.5\(\%\)/8.8\(\%\) higher than the second-best method. As for the GTOT dataset, the PR/SR score of SiamIVFN reached 91.5\(\%\)/79.3\(\%\), 2.1\(\%\)/6.9\(\%\) higher than the second-best one. The experimental results demonstrated the effectiveness of the proposed SiamIVFN.

Table 1 RGBT234 dataset PR/SR scores based on attributes

To further show the performances of SiamIVFN, we separately calculated the PR/SR scores of each attribute in the RGBT234 dataset. The specific results are recorded in Table 1. It can be concluded from the table that SiamIVFN has the highest scores in almost all attributes than other trackers.

Besides the improvement of precision rates, the success rate of the proposed SiamIVFN is much higher than any other tracker (8.8\(\%\) higher than the second-best), specifically under the challenges of low illumination (LI), low resolution (LR), background clutter (BC), partial occlusion (PO), and heavy occlusion (HO). It means that the proposed subnetworks CFFN and CAN can adaptively extract and fuse the features for the success of object tracking. The two-stream convolutional structure and the channel-wise aggregation are simple and effective for the RGB-T tracking tasks.

In the case of low illumination (LI), relying only on visible light for tracking will lead to poor results. Since SiamIVFN can integrate visible light and infrared images and use infrared image information to supplement tracking, the success rate of SiamIVFN increases by 7.9\(\%\) compared to the second-best (DAFNet). In the case of background clutter (BC), because of the simple background of the infrared images, SiamIVFN exploits the individual features of the infrared images, and the success rate of SiamIVFN is 8.7\(\%\) higher than the second-best (DAFNet). In the case of partial occlusion (PO) and heavy occlusion (HO), SiamIVFN can extract the common features and cope with a certain misalignment, thereby increasing tracking robustness. In PO and HO, the success rate of SiamIVFN increases by 6.5\(\%\) and 12.8\(\%\) from the second-best (DAFNet for PO, MANet for HO).

5.3 Ablation Study

Fig. 6
figure 6

Comparison of visible light, infrared, and fusion tracking (a). Ablation experiment (b) on RGBT234

In this subsection, we compared the tracking performances of SiamIVFN (RGB), SiamIVFN (T), and SiamIVFN. SiamIVFN (RGB) and SiamIVFN (T) indicate that SiamIVFN relies solely on visible light and infrared images for tracking, respectively. SiamIVFN (RGB) refers to replacing the infrared image part with the visible light image. SiamIVFN (T) refers to replacing the visible light image part with the infrared image. The tracking performance is shown in Fig. 6a. The experimental results show that SiamIVFN fusion tracking is significantly better than relying solely on infrared images (12.3\(\%\)/12.6) or visible light images for tracking (9.0\(\%\)/14.2\(\%\)).

To show the performance of the two subnetworks, CFFN and CAN, we remove CFFN and CAN from SiamIVFN [denoted by SiamIVFN (No-CFFN) and SiamIVFN (No-CAN)]. Comparative experiments are performed on the RGBT234 dataset. SiamFC++(RGBT) is the baseline. In SiamFC++(RGBT), the infrared image is directly used as the fourth channel, concatenated on the RGB image, and then tracked by SiamFC++. The results are shown in Fig. 6b, which show that:

  1. 1.

    Comparing the performance of SiamIVFN and SiamIVFN (No-CAN), the PR/SR score with CAN improves by 1.6\(\%\)/3.0\(\%\).

  2. 2.

    Comparing the performance of SiamIVFN and SiamIVFN (No-CFFN), the PR/SR score with CFFN improves by 9.3\(\%\)/8.7\(\%\).

The coupling rate of different layers in the CFFN is a hyperparameter of SiamIVFN. We arrange the coupling rates 0.25, 0.5, 0.75 in separate layers and then compare networks with different coupling rates. The tracking performance under RGBT234 is shown in Table 2. From Table 2, we can find that the greater the coupling rate in deep layers is, the better the tracking performance is. When the coupling ratio is 0.25, 0.5, 0.75, the tracker can obtain the best performance. The features extracted by the shallow layers are individual features such as color and texture. These individual features between infrared and visible light images are quite different, so the appropriate coupling rate is small. On the contrary, the common features such as the contour extracted by the deep network between infrared and visible light images, are relatively similar, so the appropriate coupling rate is larger.

Table 2 Comparison of different coupling rates on RGBT234

5.4 Qualitative performances

Fig. 7
figure 7

Qualitative analysis of SiamIVFN and other trackers

To visually show the tracking performances of SiamIVFN, we took four sequences for comparison. Figure 7 shows the bounding box of SiamIVFN and other trackers (MANet, C-COT, SiamFC, and SiamFC++). To show the bounding boxes in one image, we linearly superimpose the infrared and visible light images. Red and yellow boxes are utilized to frame the ground truth position of the target initially given in the infrared and visible light images.

The second and third column image pairs are selected from nightthreepeople and woman6, whose background is complex. The complex background can easily interfere with the classification score of the object, making it impossible to distinguish the foreground and background correctly. The infrared image background is simple and easy to distinguish. The CFFN extracts the individual features of infrared and visible images, improving the stability of the tracker through the infrared port in complex backgrounds.

The false object was occluding the real object in manwithbasketball twoperson, the first and fourth columns. When the fake object pass by the real object and misalignment occurs, the infrared part of the real object is located in the visible part of the fake object, leading to the classification score of the fake object being higher than the real object, causing the tracker to make an error. Since the CFFN is a feature-level fusion, it can cope with slight misalignment.

Fig. 8
figure 8

Visualization of contribution vectors of CAN. The horizontal axis represents 512 vectors, and the vertical axis represents the number of frames in the video sequence. Color from cold to warm represents the value from – 1 to 1

To illustrate the effectiveness of the CAN, We separately selected 200 frames from the day and night sequences of the RGBT dataset. We visualize the contribution vectors for some frames in Fig. 8. The first 256 contribution vectors in the figure are calculated from visible light features, and the 257th–512th contribution vectors are calculated from infrared features. It can be found that in the nighttime sequences beginhand, the infrared feature has larger contribution weights (warm color). In contrast, in the daytime sequence car, the contribution weights of the visible light feature are relatively large. It means that the CAN pays more attention to features beneficial to the tracking task.

5.5 Efficiency analysis

Fig. 9
figure 9

Speed comparison of various tracking methods

We compare the efficiency of SiamIVFN with that of other fusion tracking methods in Fig. 9. It can be found that the speed of the proposed SiamIVFN greatly exceeds other fusion methods. SiamIVFN reached 147.6FPS, 124.6FPS faster than the second-best fusion tracker DAFNet. In the design of SiamIVFN, we give priority to speed and take the Siamese-based structure as the tracking head. Besides, both CFFN and CAN are more concise than the backbone of other fusion tracking methods.

Based on all the experiments performed in this section, we conclude that:

  1. 1.

    Compared with the visible tracking method (KCF, ECO, C-COT, MDNet, and SiameseFC) and the fusion tracking method (MANet, MACNet, SGT, DAPNet, and DAFNet), SiamIVFN achieves the best PR/SR score and the fastest tracking speed.

  2. 2.

    The performance of the fusion method is better than that of methods based on single-modal images, which shows the advantage of fusion.

  3. 3.

    With the two-stream structure and coupled filters used in CFFN, SiamIVFN can separately extract individual features and common features.

  4. 4.

    With CAN, which adaptively calculates the contributions of the infrared and visible light features, SiamIVFN is robust to various lighting conditions.

6 Conclusion

A novel RGB-T image-based tracking method, called SiamIVFN, is proposed in this paper. SiamIVFN can adaptively fuse the complementarity of infrared and visible light images to address the object tracking problem under various light conditions. SiamIVFN mainly contains two subnetworks, CFFN and CAN. Owing to the two-stream convolutional structure, CFFN can extract both common features and individual features from infrared and visible light image pairs. CFFN treats infrared and visible light images as complements of each other through the coupling filters. The common features of infrared and visible light images can be learned without additional computation. CFFN is a feature-level fusion network that can handle situations where visible light and infrared images are not rigorously aligned. Under various light conditions, CAN is designed to adaptively compute the contributions of different features, which could learn the weight coefficient of each channel through self-attention. Experiments performed on two RGB-T tracking benchmark datasets demonstrate that SiamIVFN outperforms other latest RGB-T tracking methods and can reach 147.6FPS. In the future, we will try to adopt other advanced architectures to let the network dynamically change the coupling rate, and combine temporal modalities to improve tracking performance.