Keywords

1 Introduction

Target tracking using optical and infrared imagers has many applications such as security monitoring and surveillance operations. Compared to radar [1, 2], multispectral [3, 4], and hyperspectral sensors [5, 6], optical and infrared imagers are low cost and easy to install and operate. In recent years, there are new tracking algorithms using track-learn-detect [7], compressive sensing [8, 9], deep learning [10], tracking by detection [11], and references therein. These algorithms have been proven to work well in benchmark data sets. However, the benchmark videos are of high resolution and high quality. In contrast, some realistic videos such as the SENSIAC videos [12] are of low quality in terms of resolution and environmental conditions.

One objective of this research is to compare some representative tracking algorithms in the literature using the SENSIAC data [12], which have both optical and infrared videos at various ranges. We do not have any preference on any particular tracking algorithms. In fact, we also include Kalman tracker [13,14,15], which is probably the oldest algorithm in the literature. Another objective is to see if deep learning approaches, due to recent hype in deep learning, are better than conventional algorithms.

This paper is organized as follows. In Sect. 2, we will briefly review the tracking algorithms in this study. Although the algorithms are not exhaustive, they are representative methods of many state-of-the-art algorithms in the literature. Section 3 focuses on an extensive comparative study using actual videos. Two performance metrics were used to compare different algorithms. Finally, some concluding remarks are included in Sect. 4.

2 Tracking Algorithms

The following approaches are by no means an exhaustive list of current methods. However, they are representative methods in target tracking in recent years. Some deep learning approaches were not included because our PCs do not have the necessary hardware or software to run them. We briefly outline the key ideas of each method below.

2.1 STAPLE Tracker [16]

For this algorithm, the histogram of oriented gradients (HOG) features are extracted from the most recent estimated target location and used to update the models of the tracker. Then a template response is calculated using the updated models and the extracted features from the next frame. To be able to estimate the location of the target, the histogram response is needed along with the template response. The histogram response is calculated by updating the weights in the current frame. Then the per-pixel score is computed using the next frame. This score and the weights, calculated before, are used to determine the integral image, and ultimately, the histogram response. Together, with the template and histogram response, the tracker is able to estimate the location of the target.

The STAPLE tracker [16] is able to successfully track the target of interest until the end of a video when there is no occlusion. Even with a camera that is not stationary, STAPLE [16] is able to keep a tight bounding box around the target and the bounding box appears to scale according to the target. However, the scaling of the bounding box is too little to be significant. There are some cases where the bounding box does not completely encase the entire target, but it will still follow and track the target when there is partial encasement by the bounding box. One major issue that STAPLE [16] suffers from is the case of occlusions. Once the target becomes occluded, STAPLE [16] is unable to redetect the target to track again after emerging from the occlusion. Overall, STAPLE [16] works well for targets that do not become occluded.

2.2 Long-Term Correlation Tracking (LCT) Tracker [17]

This algorithm starts by using the initial bounding box and expanding it to specify a search window. Features are then extracted from within the search window to estimate the target location. After the location has been computed, the scaling is then calculated for the bounding box. Then the program checks to make sure that the correct target is being tracked. If it is not, then the tracker performs redetection by finding possible states and chooses the most accurate state by comparing the confidence scoring for each state. After redetection, the appearance and motion model are updated. This update is performed regardless of whether or not the redetection module is performed. This cycle is continued until the end of the video.

The LCT tracker [17] is able to successfully track the target of interest until the end of the video. This algorithm has proven to be quite robust in that it is able to handle occlusions and a non-stationary camera. Although the LCT [17] is able to handle cases of light to moderate occlusion, it is unsuccessful when there is heavy occlusion. Such as when a target is under a heavy shadow that spans multiple frames. It has been found that the LCT [17] is unsuccessful because of the dramatic changes the shadows make on the appearance model of the target. Another minor fault of this algorithm is the dynamic scaling of the bounding box. When the orientation of the target changes and the bounding box is scaled, there are some cases when the bounding box becomes too large and covers more area around the target of interest than desired. Overall, the LCT [17] algorithm is robust and able to handle most cases of occlusion.

2.3 Fusion of STAPLE and LCT

The Fusion of STAPLE [16] and LCT [17] merges the two algorithms into one program. We implemented this fusion algorithm. The reasoning for this merge is to combine the best features of the two algorithms and resolve the main issues that each algorithm suffers as individual programs. It just so happens that the issues with STAPLE [16] can be resolved by the LCT [17] and the issues with the LCT [17] can be resolved with the STAPLE [16]. This Fusion tracker is able to successfully track the target of interest until the end of the video while keeping a tight fitting bounding box around the target. This includes cases with light to medium occlusion.

Figure 1 illustrates the fusion based tracker. The fusion tracker works by running the STAPLE [16] and LCT [17] trackers simultaneously. Since the bounding box information from the STAPLE [16] tracker has a more desirable result, STAPLE [16] is used to visualize the location of the target. The LCT [17] is used to detect occlusion and for redetection. Once occlusions are detected, a flag is raised and the program waits for 5 frames to pass so that the target has some time to emerge from the occlusion. Once the flag is raised and the 5 frames have passed, STAPLE [16] is reset and initialized with the location information from the LCT [17]. The purpose of resetting STAPLE [16] is to clear the history of the appearance and motion model. This cycle continues until the end of the video.

Fig. 1.
figure 1

Fusion of STAPLE and LCT tracking algorithms.

2.4 Kalman Tracker

Although Kalman tracker is easy to understand, we could not find a good Kalman tracker in the internet. So we implemented this by ourselves. The Kalman tracker is able to successfully track a moving target at close range until the end of the video if there are no occlusions and the camera is stationary. It has been found that the Kalman tracker has issues when the target is stationary because of its reliance on motion to be able to successfully track. Detection of motion is only performed every ten frames to ensure that there are notable differences between two frames. Overall, the Kalman tracker works for close range targets captured with a stationary camera.

Figure 2 illustrates the Kalman tracker. Given an initial position and velocity, a prediction of the next location is calculated for the first frame. Then, the same calculation is made for frames in between intervals of 10. For every other 10 frames, the Kalman filter parameters are updated using the motion of the target. More specifically, the measurement residual and Kalman gain are updated to predict a more accurate state estimate. This cycle continues until the end of the video.

Fig. 2.
figure 2

Kalman tracking algorithm.

2.5 Hierarchical Convolutional Features for Visual Tracking (CF Tracker)

This is deep learning based tracker. As shown in Fig. 3, the CF tracker starts off the tracking by cropping out a search window from the first frame based on the initial position that is used as an input for the program. Once the search window has been established, convolutional features are extracted with spatial interpolation. Once this has been completed, a confidence score is computed for each VGG net layer. This score is used to estimate the closest target location for the next frame. Then another area is cropped out from the whole frame using the newest estimate and the convolutional features are extracted with interpolation to update the correlation filters for each layer. This cycle is repeated until the end of the video is reached.

Fig. 3.
figure 3

Deep learning tracker.

The CF tracker performs similarly to the STAPLE and LCT tracker. It is able to track the target until the end of the video in most cases. This tracker is able to keep a bounding box around the target when the camera is not stationary and the scaling of the bounding box is adaptive, much like the STAPLE bounding box. However, the bounding box size changes are too small to be significant. One issue of this tracker is that the computational time is quite long. For one video of approximately 1,875 frames, the tracker takes about 30 min to complete. Although the tracker has not been tested on videos with occlusion, it appears that it would not handle occlusion very well due to the similar behavior with the STAPLE. Furthermore, the code does not have a function or algorithm for redetection.

3 Experiments

3.1 SENSIAC Database Description

All the tracking algorithms have been tested using the SENSIAC database [12], which contain different vehicles and human targets at multiple ranges. These videos are captured using both optical and mid-wave infrared (MWIR) cameras. This data set is available for purchase [12]. In this paper, we focus only on vehicles.

For vehicles, there are a total of nine targets. It is important to note that two targets were excluded because not all scenarios were available for the particular targets. These targets vary in size from a pickup truck to a tank. For each target, there are a total of 18 scenarios, nine for daytime and nine for night time. These nine daytime and nighttime scenarios vary in range from the target to the camera used to capture the video. The range starts from 1,000 m and ends at 5,000 m with an interval of 500 m. All the vehicles drive in a circular pattern at speeds specified within the ground truth files associated with each scenario. In total, there are 162 videos for vehicles.

3.2 Vehicle Tracking Results

Although there are quite a few performance metrics for evaluating different trackers in the literature, only two different performance metrics were performed on each of the scenarios. This is because only the ground truth center locations of the vehicles are available in the database. The first one was the distance precision (DP), which computes the average number of frames the estimate location was within range of the ground truth location. The second performance metric is the center location error (CLE), which computes the average distance between the ground truth location and the estimated location. The following tables are the averages of all targets for a particular range for each tracking algorithm.

Optical Camera Results.

Table 1 summarizes the averaged CLE of tracking results of different algorithms at different ranges. It should be noted that there are a number of vehicles at each range. Smaller CLEs mean better performance. It can be seen that Kalman tracker works well for ranges less than or equal to 2000 m. STAPLE works well for ranges longer than 2000 m. Table 2 shows the averaged Distance Precision (DP) at threshold of 20 pixels for all cases. Again, it can be seen that the Kalman tracker works well for short ranges and STAPLE works well for long ranges. Compared to the conventional trackers, the deep learning based tracker (CF) only performs moderately well in long ranges. The fusion approach works well only when there are occlusions, which are not present in the SENSIAC videos. In terms of computational speed, Kalman and STAPLE are the fastest, followed the LCT, fusion, and CF.

Table 1. Averaged Center Location Error (CLE) for all cases. Optical videos.
Table 2. Averaged Distance Precision (DP) at threshold of 20 pixels for all cases.

Figure 4 shows the averaged DP at three ranges. The trends are similar to what we observe in Tables 1 and 2.

Fig. 4.
figure 4

Average DP at various ranges. Optical videos.

Infrared Camera Results.

Different from the optical videos that have only day time videos, the infrared videos have both day and night time videos. Table 3 shows the averaged CLE results for all cases. In both day and night time cases, STAPLE performs quite well for all ranges. All other algorithms do not perform that well. Similarly, Table 4 shows the averaged DP results for all cases. Again, STAPLE performs well in almost all cases. Figures 5, 6 and 7 plot the DP results for different ranges. We can observe the similar trends mentioned above. In terms of computational speed, Kalman and STAPLE are the fastest, followed the LCT, fusion, and CF.

Table 3. Averaged Center Location Error (CLE) for all cases. Infrared videos.
Table 4. Averaged Distance Precision (DP) at threshold of 20 pixels for all cases.
Fig. 5.
figure 5

Average distance precision for videos at range of 1000 m. Left: Infrared videos at day time; right: infrared videos at night time.

Fig. 6.
figure 6

Average distance precision for videos at range of 3000 m. Left: Infrared videos at day time; right: infrared videos at night time.

Fig. 7.
figure 7

Average distance precision for videos at range of 5000 m. Left: Infrared videos at day time; right: infrared videos at night time.

4 Conclusions

In this paper, we address target tracking for low quality videos. The low quality is caused by long range data acquisition as well as environmental conditions due to poor illumination and camera motions. Five representative trackers were used in our comparative study. Two performance metrics (center location error and distance precision) were used in our experiments. It was observed that one tracker known as STAPLE performed quite well whereas the deep learning based tracker did not work as well as STAPLE. A somewhat surprising results is that the Kalman tracker also works well up to 2000 m for optical videos. It is our belief that the field of target tracking still needs a lot of research, including deep learning based methods.