1 Introduction

The development of machine learning and computer vision makes it possible for ecologists to use computer vision technologies to extract key features from images, videos or other visual data, so as to quickly classify animal species and count individual animals (Tuia et al., 2022; Weinstein, 2018). And these scientists also expect to do further research on animal behavior analysis on the basis of the visual information collected from large datasets (Van der Zande et al., 2021; Norouzzadeh et al., 2018), which consequently facilitates the development of biodiversity conservation (Mathis et al., 2018; Valletta et al., 2017).

Visual object tracking, being one of the key technologies in computer vision, aims to consistently capture interested targets from video sequences (Lu & Wang, 2019). It is a fundamental vision task in many real-world applications including animal recognition and behavior analysis (Lopez-Marcano et al., 2021; Valletta et al., 2017; Risse et al., 2017; Haalck et al., 2020; Fukunaga et al., 2015; Ravoor & Sudarshan, 2020). In the past decades, researchers from the computer vision community have intensively studied visual object tracking, leaving two main-stream methods, namely correlation filter (CF) based trackers (Liu et al., 2021) and deep learning based trackers (Li et al., 2018). Following the development of tracking methods, tracking benchmarks play an important role in trackers evaluation. It is well known that OTB50 (Wu et al., 2013) is the first benchmark, and has been widely adopted in the literature as a standard tool to evaluate the performance of different trackers. It contains 50 video sequences that cover various kinds of general objects, including animals, vehicles, humans, body parts, toys, etc. In Wu et al. (2015), the authors extend OTB50 to 100 video sequences, which provides a much bigger dataset for trackers evaluation. In the following years, several other benchmark datasets are released, including TC128 (Liang et al., 2015), NFS (Galoogahi et al., 2017), LaSOT (Fan et al., 2019), UAV123 (Mueller et al., 2016), GOT10K (Huang et al., 2021), AMTSet (Wang et al., 2022) and TOTB (Fan et al., 2021). These datasets have promoted the development of visual object tracking.

However, when it comes to wild animals tracking, existing benchmarks are restricted due to the following reasons. First, wild animals cover a wide scope of objects ranging from land animals to marine animals, from insects to mammals. Existing benchmarks do not include enough wild animal sequences to evaluate the performance of a tracker. Second, some existing benchmarks include several video sequences related to animals, such as GOT10K and LaSOT, but most of them are captive animals in zoos which are of less significance than animals captured in the wild. In addition, most of the sequences containing wild animals in those benchmarks are very short which is insufficient for evaluating wild animals tracking. Third, the main challenges in wild animals tracking are different from that of general object tracking. According to our exploration, challenges like camera motion, similar object, deformation and rotation are very common in wild animals tracking. When analyzing group of animals, challenges like full or partial occlusions, out of view and rotation are frequently appeared. For individual animal analysis, rotation and scale variation commonly exist.

In this paper, we build the first benchmark for wild animal tracking (WATB). WATB comprises 206 sequences and covers a wide range of animal categories. The total frame of WATB is over 203,000 while the average sequence length is over 980 frames. All the sequences are labelled with 13 challenge attributes and manually annotated with axis-align bounding boxes. To the best of our knowledge, WATB is the first benchmark dataset specifically used for the research of wild animal tracking. Sample frames from WATB is shown in Fig. 2. We benchmark 38 state-of-the-art trackers for the sake of understanding the performance of them and providing basic results for comparison in future research. Comprehensive analysis of the evaluation results demonstrates that wild animal tracking remains challenging in computer vision community.

The main contributions of our work are summarized as follows:

  1. (1)

    We propose the first benchmark dataset WATB dedicated to the research of wild animal tracking and behavior analysis. WATB consists of 206 sequences with over 203,000 frames, which makes it large enough to support related research. It covers a wide range of wild animal species including molluscs, coelenterates, mammals, birds, reptiles, amphibians, arthropods and fish.

  2. (2)

    We evaluate 38 state-of-the-art trackers including deep learning based trackers, such as Siamese network and Transformer, and correlation filter based trackers with hand-crafted and deep features. We give in-depth analysis in order to provide researchers evaluation results for comparison in future research. We rank the evaluated trackers to specify the top three in tracking accuracy, which facilitates us to point out the research prospect or development trends in wild animal tracking according to the latest progress of tracking methods.

  3. (3)

    All the sequences, annotations, attributes, evaluation results and toolkit of WATB are released on the website https://w-1995.github.io/. Researchers are encouraged to evaluate their work on WATB.

2 Related Works

2.1 Tracking Datasets

Tracking datasets are very important for the development of visual object tracking. There are two standards to categorize the existing benchmark datasets. In Fan et al. (2021), the existing datasets are divided into two types: generic and specific, while in Valmadre et al. (2018) the authors divide existing datasets into long-term and short-term.

Generic datasets usually contain various objects collected from general scenes, such as pedestrians, human body parts, animals, vehicles, sports players, etc. OTB50 (Wu et al., 2013) and OTB100 (Wu et al., 2015) are two standard generic datasets that are widely used to evaluate current trackers in the literature. Both of them contain 11 challenge attributes, and include both color and gray sequences. TC128 (Liang et al., 2015) is composed of 128 color sequences that focus on testing the impact of color information on tracking performanceFootnote 1. It also provides 11 challenge attributes for each video sequence. NFS (Galoogahi et al., 2017) aims to evaluate trackers on high frame rate videos. Each video sequence is labelled with 9 attributes. Recently, some large-scale datasets are proposed for training deep models for visual object tracking. LaSOT (Fan et al., 2019) provides 1400 sequences with 1120 and 280 for training and testing, respectively. It covers 70 object categories including vehicles, person, pets, livestock, human body parts, clothes, captive and wild animals, etc. Most of the animal videos are captured at zoo or home and few of them are wild animals. GOT-10K (Huang et al., 2021) provides about 10,000 video sequences while 9,340 for training, 420 for testing and 180 for validation. But its average sequence length is only about 200 frames. OxUvA (Valmadre et al., 2018) mainly focuses on long-term tracking evaluation with 366 video sequences.

Specific dataset is usually constructed to evaluate tracking performance for specific goals. It is worth noting that our WATB is a specific dataset for wild animals tracking. AMTset (Wang et al., 2022) gives special attention to abrupt motion tracking supplying 50 video sequences and a new evaluation metric for performance evaluation. UAV123 (Mueller et al., 2016) focuses on tracking objects captured using unmanned aerial vehicle (UAV). It includes 123 video sequences that cover several object categories including vehicles, person, buildings, etc. In (Fan et al., 2021), the authors proposed the first dataset for transparent object tracking (TOTB). Li et al. (2019) propose the first RGB-thermal object tracking benchmark, while in Xiong et al. (2020), the first dataset is proposed for hyper-spectrum video object tracking. Wang et. al. propose a benchmark dataset WAMI-226 for object tracking in the context of wide area motion imagery (WAMI). WAMI-226 focuses on the main challenges induced by low resolution, low contrast and low frame rate. There are also several datasets developed for face tracking. 300-VW (Shen et al., 2015) is composed of 100 videos which is developed for facial landmark tracking. MobiFace (Lin et al., 2019) dataset consists of 80 sequences which is specially developed for face tracking in mobile scenes. In Risse et al. (2017), Risse et al. build a Wildlife Animal Tracking (WAT) dataset which comprises only 12 videos (11 animal videos and 1 table tennis video) covering a small range of wild animals including 1 Mammal video, 1 Bird video and 9 Arthropod videos. WAT pays special attention to small target animals, such as ants, dung beetles, woodlouse, etc., covering common challenges in object tracking including occlusion, illumination variation, similar objects, clutters, and so on. Compared to WAT, our proposed WTAB contains much more animal videos and covers a wide range of animal species (as shown in Fig. 1).

2.2 Tracking Methods

In general, there are two main branches of tracking methods occupying the community, that are correlation filter and deep learning based methods. The former is famous for its high efficiency while maintaining good accuracy and robustness. The latter is characterized by its high accuracy and robustness under different challenging situations. Here, we briefly review these two methods.

Correlation filter based tracking methods: MOSSE Bolme et al. (2010) is the first CF-based tracking method which brings researchers to a new direction. It shows high tracking accuracy and speed. Following MOSSE, KCF greatly improves its tracking performance by learning kernelized CF on HOG features (Henriques et al., 2014). Staple (Bertinetto et al., 2016) incorporates HOG and color features to improve KCF, while SAMF (Li & Zhu, 2014) pays much attention to scale estimation in CF based trackers. In order to suppress the annoying boundary effects, SRDCF (Danelljan et al., 2015) adopts a spatial weight coefficient to punish the samples that are far away from the object center. BACF (Kiani Galoogahi et al., 2017) tries to enlarge the sampling area and draws real negative samples from background area which greatly improves the quality of samples and suppresses the boundary effect. STRCF (Li et al., 2018) introduces the temporal regularization into SRDCF to further suppress boundary effect. In Dai et al. (2019), ASRCF is proposed to improve SRDCF by introducing adaptive spatial regularization which adapts to object appearance variation. It learns a location filter and a scale filter using HOG and deep features. GFSDCF (Xu et al., 2019, 2021), based on the spatial consistency of multi-channel features, performs group feature selection across spatial and feature channel dimensions, learns correlation filters with spatial-channel group sparsity, and compresses features from spatial and feature channel dimensions. Li et al. (2020) propose to remove cosine window from CF based trackers with spatial regularization and to replace it with a binary and Gaussian shaped mask functions.

Deep learning based tracking methods: The strong power of deep neural network has brought the object tracking community to a new peak. Convolutional Neural Network (CNN) (Nam & Han, 2016; Wang et al., 2015) and Siamese Network (Bertinetto et al., 2016; Li et al., 2018; Guo et al., 2020) are two popular deep tracking frameworks. MDNet (Nam & Han, 2016) adopts large-scale annotated video sets to train a CNN for general feature representation. Wang et al. (2015) explore the representation ability of the features from different CNN layers. The feature maps of different layers are selected and refined to obtain more discriminative features. SiamFC (Bertinetto et al., 2016) is the first Siamese network based tracker which treats tracking as a similarity learning problem. A Siamese network is trained off-line to localize a sample region in a larger searching region. SiamRPN (Li et al., 2018) is composed of a Siamese subnetwork for feature extraction and a region proposal subnetwork for predicting object position and confidence. SiamRPN++ (Li et al., 2019) adopts ResNet50 as the backbone network to extract features. It largely alleviates the damage of translation invariant brought by padding, and introduces a deep network into the Siamese network, thus bringing significant improvement in accuracy. SiamBAN (Chen et al., 2020) is similar to SiamRPN++. The difference is that the principle of dilated convolution is introduced in SiamBAN, which increases the receptive field and improves the tracking performance. SiamCAR (Guo et al., 2020) is similar to SiamBAN where the difference is the introduction of centerness branch to better localize the object center. SiamAttn (Yu, 2020) uses deformable Siamese attention network and a region refine module to boost tracking performance.

3 Wild Animal Tracking Benchmark (WATB)

3.1 Video Collection

We aim to collect video sequences containing wild animals that are moving naturally. Generally speaking, documentaries about wild animals contain a lot of footage about moving wild animals. Initially, we decided to select sequences from existing general object tracking benchmarks, but very few wild animals are included in these public benchmarks. Thus, we collect a lot of such raw video sequences from the Internet. These sequences are filtered and edited manually leaving those suitable for tracking analysis. Some of the extra frames in the video are cut out, making the video focus more on a specific wild animal of interest. In addition, we also do some clips on some sequences in order to construct certain tracking challenges. For example, a common challenge in object tracking is abrupt motion induced by rapid camera motion or switching which are common in documentaries. We cut some of the frames to make the wild animals move a larger distance between two adjacent frames.

We also aim to cover a wide range of wild animal species including molluscs, coelenterates, mammals, birds, reptiles, amphibians, arthropods and fish, which are from land, sea, forest, grassland and mountain. At the same time, we try to name the selected sequences based on the species of the target animals in the sequences. Since we are not professional zoologists, we use Baidu image recognitionFootnote 2 to help us identify the wild animals in the sequence. For this purpose, we select sample frames containing the target animals, and cut down the complete animal region as the input to Baidu image recognition. The results are carefully analyzed to make sure the recognized animals are the same as that from the video sequence. For some animals in crowded scenes, for example, AcanthopagrusSchlegelii2 and PomadasysArgenteus2, the recognition results are not accurate enough. We have to try several times and cut different regions of a frame as input to obtain satisfactory recognition results. We eventually collected 206 sequences to form the WATB which is sufficient to evaluate a tracker’s performance on wild animal tracking. We give the summary of WATB in Table 1 and compare WATB with WAT.

Table 1 Summary of statistics of the proposed WATB
Fig. 1
figure 1

Distribution of animals in WATB

3.2 Animal Categories

Among the 206 sequences in WATB, 29 of them contain sea animals, 102 contain land animals and 75 contain flying animals (shown in Fig. 1a). Most of the sequences in WATB contain birds and mammals, because the raw video sequences are mainly obtained from documentaries that focus much on mammals and birds. There are 27 sequences containing fish and all of them are marine animals. In Fig. 1b, we give the plot of the distribution of the animal species in WATB. It seems that the distribution of the animals in WATB is lack of balance. But it is worth noting that, there are several standards to categorize animals in zoology. We use a rough standard to divide the 206 sequences into several branches. Sequences of mammals can also be divided into several branches, but detailed categorization is out of the scope of our work. As it is difficult to capture wild animal videos by ourselves, existing documentaries are the main source of WATB. We find that wild animals from Africa are the favorite targets in most of the documentaries. In addition, some of the bird sequences are from social media which are taken by amateur photography enthusiasts.

3.3 Annotation

For the annotation step, we follow the standard in Fan et al. (2019) to provide consistent bounding box annotation when labelling most of the sequences. For a given video sequence, the initial target animal is specified in the first frame. When the target animal appears in a frame, the labeler will draw a bounding box tightly including any visible part of the animal. When it is fully occluded or out of view, the labeler will assign an absent label to this frame. However, in order to build certain challenge attributes in some sequences, we do not label the whole body of the target animal, because labelling the whole body cannot reflect certain challenge attributes. For example, in Fig. 2, for the sequence GoldenEagle, we annotate the head of the bird. For some animals with long tails or legs, take arthropods for example, annotating all the visible parts will include too much background area. Therefore, we only annotate the main body of such animals by omitting the extra legs or tails. For example, in Fig. 2, row 3 column 4, the right leg of the mantis is discarded from the bounding box. Fig. 3 shows the distribution of the bounding box sizes (target animal scale) in WATB. We plot the width (pixels) and height (pixels) of the target animal in the first frame of each sequence. It shows that the widths of the animals approximately fall in the interval [30, 210], while the heights fall in [30, 150].

Fig. 2
figure 2

Sample frames from WATB benchmark

In order to ensure the annotation quality, we organize a labelling team and an inspection team. Both of them have five members. The labelling team first annotates a sequence and sends the labelled sequence to the inspection team. Then, the inspection members will check all the annotated frames, and any possible inappropriate annotation will be discussed and sent back to the labelling team for refinement. Based on the above principles and quality control measures, we finally get the WATB benchmark dataset. Sample frames selected from WATB are shown in Fig. 2.

3.4 Attributes

In order to enable researchers to do further in-depth analysis of different trackers’ performance on wild animal tracking, we label each video sequence with 13 challenge attributes, that are: (1) illumination variation (IV), (2) out-of-plane rotation (OPR), (3) in-plane rotation (IPR), (4) deformation (DEF), (5) fast motion (FM), (6) scale variation (SV), (7) camera motion (CM), (8) out of view (OV), (9) partial occlusion (POC), (10) full occlusion (FOC), (11) low resolution (LR), (12) similar object (SO), (13) motion blur (MB). The definitions of these attributes are listed in Table 2 which are similar to that in OTB100 and LaSOT. In Fig. 4, the distribution of different challenge attributes in WATB is shown. From this figure, it is clear that the top-ranked common challenges in wild animal tracking are scale variation, out-plane rotation and partial occlusion. The animal behaviors in the collected sequences influence the attributes of the sequences, which can be directly reflected in documentaries. For example, many animals tend to rotate their heads or body out of the image plane. Consequently, the OPR attribute appears in such sequences.

Fig. 3
figure 3

Distribution of bounding box sizes (object scale) in WATB. The diameter of the bubbles represents the area (width\(\times \)height) of the corresponding bounding boxes

4 Evaluation

4.1 Evaluated Trackers

For the sake of evaluating existing trackers’ performance on WATB, we test 38 tracking methods including CF and deep learning based trackers. For CF based trackers, we select the tracking methods with hand-crafted features, such as BACF (Kiani Galoogahi et al., 2017), STRCF (Li et al., 2018), SRDCF (Danelljan et al., 2015), and with deep features, such as ASRCF (Dai et al., 2019), CF2 (Ma et al., 2015), GFSDCF (Xu et al., 2019). As for deep learning based trackers, we select recent popular Siamese network based trackers, such as SiamBAN (Chen et al., 2020), SiamCAR (Guo et al., 2020), SiamAttn (Yu, 2020), SiamGAT (Guo et al., 2021), STMTrack (Fu et al., 2021), and Transformer trackers, such as Stark (Yan et al., 2021), TransT (Chen et al., 2021), HiFT (Cao et al., 2021). All the tested trackers are listed in Table 3. We run the codes released by the related authors to implement all the trackers.

Table 2 Description of the attributes in WATB

4.2 Evaluation Methodology

We use the general one-pass evaluation (OPE) criteria as in Fan et al. (2021), Fan et al. (2019) to compare the trackers using precision measure, normalized precision measure and success measure.

The precision measure is computed as the Euclidean distance d between the estimated object center \((x_{est},y_{est})\) and the center of the ground-truth bounding box \((x_{gt},y_{gt})\).

$$\begin{aligned} d = \sqrt{(x_{est}-x_{gt})^2+(y_{est}-y_{gt})^2} \end{aligned}$$
(1)

A distance threshold \(\alpha _d\) is first chosen to decide a successful track in each frame (for example, 20 pixels). In a given frame, if \(d>\alpha _d\), it is defined as a successful track, which means the tracker successfully captures the target animal. Thereafter, for a tracker, the total successful tracks \(F_{suc}\) divide by the total frame number of the animal sequence \(F_{all}\) will generate the precision score P which is used to rank different trackers.

$$\begin{aligned} P = \frac{F_{suc}}{F_{all}} \end{aligned}$$
(2)

As the threshold \(\alpha _d\) varies in a given interval [0, 50], a precision curve will be formed. On this plot, the trackers are ranked based on the value of P with distance threshold \(\alpha _d=20\).

Fig. 4
figure 4

Sequence numbers of different challenge attributes

The normalized precision measure is adopted because the precision measure is easily influenced by object size. It is computed by normalizing the precision measure over the size of the ground-truth bounding box. The distance between the estimated object center \((x_{est},y_{est})\) and the center of the ground-truth bounding box \((x_{gt},y_{gt})\) is computed as follow.

$$\begin{aligned} d_n = \sqrt{\left( \frac{x_{est}}{w}-\frac{x_{gt}}{w}\right) ^2+\left( \frac{y_{est}}{h}-\frac{y_{gt}}{h}\right) ^2} \end{aligned}$$
(3)

where w and h are the width and height of the target animal, respectively. They are defined in ground-truth bounding box. Then, we set a normalized distance threshold \(\alpha _n\) to decide a successful track. The definition of the normalized precision score \(P_n\) is the same as that of P. To generate a normalized precision curve, \(\alpha _n\) varies in a given interval [0, 0.5]. On this plot, the trackers are ranked based on the value of \(P_n\) with \(\alpha _n=0.2\).

Table 3 Evaluated trackers in our experiment

As for the success measure, it is computed based on the intersection over union (IoU) where the intersection means the overlap area of the estimated region of target animal \(R_{est}\) and ground-truth bounding box \(R_{gt}\), and the union means the union of \(R_{est}\) and \(R_{gt}\).

$$\begin{aligned} IoU = \frac{R_{est}\cap R_{gt}}{R_{est}\cup R_{gt}} \end{aligned}$$
(4)

We also need a threshold \(\alpha _{IoU}\) to determine a successful track. Then, the success score S is obtained using the total successful tracks to divide the total number of frames. As the threshold \(\alpha _{IoU}\) varies in the given interval [0, 1], a success plot is generated and the trackers on this plot are ranked based on the area under curve (AUC) score.

4.3 Quantitative Results

4.3.1 Overall Performance

We extensively test 38 trackers on WATB. Each tracker is run individually without any modification and re-training. We employ the above-mentioned three measures for all the tested trackers and draw corresponding plots to compare their performance which are shown in Fig. 5.

From Fig. 5, it is clear that Stark obtains the highest precision score (0.580), normalized precision score (0.577) and AUC score (0.512). In the precision plot (Fig. 5a), STMTrack wins the second place (0.556) while TransT wins the second runner-up (0.550). Dimp50 and Dimp18 occupy the fourth and fifth places with precision scores 0.545 and 0.540, respectively. SiamAttn (0.539) and SiamBAN (0.538) are following Dimp50. The other four Siamese network based trackers and ATOM are following the top seven winners. The best CF based tracker is GFSDCF with precision score 0.479 which is better than SiamFC (0.423). The other two deep learning based trackers, SiamGAT and HiFT, do not achieve good precision score because SiamGAT (0.344) focuses much on aspect ratio change while HiFT (0.288) is good at UAV object tracking.

When it comes to the normalized precision plot (Fig. 5b), the top six winners are slightly changed. STMTrack still stands on the second place with score 0.553 followed by TransT (0.547). SiamAttn comes to the fourth place with score 0.539 which is 0.001 higher than Dimp50 (0.538). SiamBAN (0.531) and Dimp18 (0.530) are following Dimp50. The best CF based tracker GFSDCF (0.472) obtains the 13th place and outperforms several deep learning based trackers including SiamFC (0.400) and SiamGAT (0.333). The second best CF based tracker is ASRCF with score (0.384) followed by ARCF(0.348) and CF2 (0.348). As for the success plot, the top five winners do not change much. TransT (0.492) comes to the second place followed by STMTrack (0.486), Dimp50 (0.472) and Dimp18 (0.465). SiamBAN, SiamAttn and ATOM stand on the sixth to eighth places with AUC scores 0.455, 0.455, 0.450, respectively, which outperforms SiamRPN++ (0.428), SiamCAR (0.423), SiamRPN (0.413) and SiamMask (0.395). The best CF based tracker is still GFSDCF (0.373) which outperforms SiamFC (0.331), and the second best CF based tracker is ASRCF (0.324). The rank of SiamGAT (0.322) in the success plot gets a great improvement compared to that in precision plot surpassing all the other CF based trackers.

Fig. 5
figure 5

Overall performance of the trackers on WATB using precision, normalized precision and success measures

4.3.2 Attribute-Based Performance

In order to detect the strengths and limitations of existing trackers on wild animal tracking, we provide attribute-based evaluation under the 13 challenge attributes of WATB. In Fig. 6, the plots of the attribute-based performance on WATB are shown which demonstrates Stark is the best tracker on almost all the 13 attributes except similar object and illumination variation. STMTrack wins the champion on similar object attribute.

The most common challenge in WATB is scale variation, and Stark gets the best AUC score (0.504) which outperforms the second best TransT (0.478) by 2.6%. STMTrack, Dimp50 and Dimp18 stand on the following three places. The best two CF based trackers are GFSDCF and ASRCF with AUC scores 0.333 and 0.280, respectively. Out-of-plane rotation is the second most common challenge in wild animal tracking. In this attribute, the top five trackers are almost the same as those in scale variation with TransT (0.470) and STMTrack (0.471) exchanging their places. GFSDCF, ASRCF, CFWCR and CF2 are the best four CF based trackers.

When it comes to partial occlusion, the top five trackers are the same as those in scale variation, but the AUC scores show severe decrease, which demonstrates that partial occlusion remains a big challenge in wild animal tracking. It is worth noting that GFSDCF (0.318) and ASRCF (0.275) are the best two CF based trackers. ASRCF outperforms SiamFC (0.270) and HiFT (0.244) by 0.5% and 3.1%, respectively. On the fast motion attribute, the overall AUC scores also show severe decrease compared to scale variation, which means that fast motion is another big challenge in wild animal tracking. Fast motion is commonly induced by camera switching, sudden dynamic change and fast moving of animals. Stark shows the best performance with AUC score 0.465. It outperforms the second best TransT (0.423) by a large margin 4.2%. Dim50 is the second runner up with an AUC score 0.401 which is 0.7% higher than STMTrack and 2.2% lower that TransT (0.423). The AUC score of the best CF based tracker GFSDCF is only 0.287 which is 17.8% lower than Stark. On the out of view attribute, we can draw similar conclusions.

On the full occlusion attribute, the AUC scores of the trackers decrease severely. The best tracker Stark only gets 0.385, which tells us that full occlusion is very challenging in wild animal tracking. We can get the conclusion from the literatures that full occlusion is also challenging in general object tracking. For similar object (background clutters), STMTrack stands on the first place with AUC score 0.480 followed by TransT (0.478) and SiamAttn (0.470), while Stark (0.469) falls to the fourth place. STMTrack wins the second place on illumination variation attribute with AUC score (0.448) which is 1.3% lower than Stark (0.461), and TransT (0.445) is the second runner up exceeding Dimp18 and Dimp50. We can see that, all the deep learning based trackers exceed CF based trackers except the best two trackers, i.e. GFSDCF (0.328) and ASRCF (0.288).

The ranking of the trackers in Fig. 6 clearly tells us that deep learning based trackers are more efficient in dealing with these challenge attributes in wild animal tracking, while CF based trackers, with either deep features or hand-crafted features, are less capable of handling these challenges.

Fig. 6
figure 6

Attribute-based performance of the trackers on WATB using success measure

The normalized precision scores of the evaluated trackers over 13 attributes are given in Table 4. The results of deep learning based trackers and CF based trackers are separated and listed in different parts of the table. For deep learning based trackers, the best three results are shown in red, blue and green, respectively, while the best three CF based trackers are shown in italic style using similar colors to deep learning based trackers. The scores show severe drop on FOC attributes, which tells us that FOC is very challenging in wild animal tracking. Similar conclusion can also be drawn from Fig. 6. Stark gets the highest score on almost all of the attributes except SO and IV, while GFSDCF achieves the best on all of the attributes among CF based trackers. For CF based trackers, the top ranked ones, including GFSDCF, ASRCF, CF2 and ECO, use deep features and hand-crafted features together to enhance the tracking performance. The scores of such trackers are better than those CF based trackers with hand-crafted features. ASRCF wins the runner up on most of the attributes except CM. Most of the CF based trackers with hand-crafted features cannot obtain satisfactory results, which demonstrates that deep features have better representation ability in wild animal tracking. We will give further analysis in Sect. 4.4.

Table 4 Normalized precision score of different trackers over 13 challenge attributes (Color table online)

4.3.3 Species-Based Performance

As shown in Fig. 1b, there are 8 animal species in WATB. In this section, we conduct species-based evaluation on WATB. The success plots of the evaluation results are shown in Fig. 7. We do not give the success plots of Amphibians, Reptiles, Molluscs and Coelenterates due to the fact that the sequences of these animal species are very few (less than 4), but all of them are released on our project website. Therefore, we only discuss the results of the other four species, i.e. Birds, Fish, Mammals and Arthropods.

For Birds sequences, Stark (0.487), TransT (0.476) and Dimp50 (0.474) occupy the top three places all of which are transformer-based trackers. The AUC scores of the following 5 deep learning based trackers, SiamBAN, Dimp18, ATOM, SiamAttn, and STMTrack, are all higher than 0.450. GFSDCF (0.411), ASRCF (0.352) and CFWCR (0.348) are the best three CF-based trackers on these sequences. The other trackers following GFSDCF get lower AUC scores than 0.4. When it comes to the Fish sequences, STMTrack (0.550) wins the championship, following TransT (0.533) and SiamAttn (0.505). The overall AUC scores of Fish sequences are obviously much higher than Birds sequences, which tells us that Birds sequences contain much more challenge attributes. For Mammals sequences, Stark (0.525) returns to the first place, leading the runner-up STMTrack with a large margin (3.5%). The AUC scores of the trackers behind SiamRPN (0.401) are all lower than 0.400 including all the CF based trackers. There are only 5 Arthropods sequences in WATB which is much fewer than the aforementioned 3 species. Stark undoubtedly occupies the first place leading the runner-up TransT a very large margin (16.7%). This success plot also shows that all the trackers except Stark and TransT have their AUC values below 0.4, which demonstrates that Arthropods tracking is very challenging in wild animal tracking, but the advanced Transformer-based trackers have shown a very promising potential to overcome this problem.

Fig. 7
figure 7

Species-based performance of the trackers on WATB using success measure

In Table 5, the normalized precision scores of the trackers over 8 animal species in WATB are shown. We mark the best three scores using the same appearance as that in Table 4. We do not mark the results of two species, Molluscs and Coelenterates, considering that there are very few sequences of these two species. For the marked 6 animal species, the best methods are all Transformer-based trackers, which again demonstrates their great potential in tracking different animal species. For Arthropods, Stark gets the best score (0.637), but the scores of the other 37 trackers show severe drop compared to Stark, especially the CF based trackers. It is very difficult for those CF based trackers to accurately track the target Arthropod animals. This result also verifies the challenge attribute of tracking Arthropod animals and the potential of Transformers in conquering these challenges. On the other 5 animal species, GFSDCF shows moderate performance compared to the corresponding best three deep learning based trackers, and leads the other CF-based trackers by a large margin which can be attributed to the use of deep features as in Sect. 4.3.2.

Table 5 Normalized precision score of different trackers over 8 animal species in WATB (Color table online)

4.4 Deep and Hand-Crafted Features in WATB

As we have claimed in the previous section, compared to CF based trackers with deep features, those with hand-crafted features cannot deal well with the challenges in wild animal tracking. In this section, we do ablation study using GFSDCF tracker (Xu et al., 2019).

As the baseline, GFSDCF uses both deep features from ResNet50, and hand-crafted features including HOG and CN. HOG feature is wildly adopted in CF based trackers. CN feature is proven to be better than any other color features in object tracking. The results are shown in Fig. 8 and Table 6, which tells us that using single hand-crafted feature, HOG or CN, cannot achieve good tracking performance. But when they work together, the tracking performance is boosted with a large margin. Consequently, the three measures increase remarkably. When deep features from ResNet50 are incorporated, the precision score increases by 10.7%, normalized precision score 12.9% and AUC score 7.4%. It clearly demonstrates that using deep features can greatly improve the tracking performance on WATB. Compared to single hand-crafted feature, incorporating multiple hand-crafted features can also boost the tracking performance to some extent.

Fig. 8
figure 8

Plots of GFSDCF on WATB using hand-crafted and deep features

Table 6 Ablation study of deep and hand-crafted features in WATB for GFSDCF

4.5 Qualitative Evaluation

For qualitatively analyzing the existing trackers, we visualize the tracking results of six typical trackers, Stark, TransT, SiamBAN, GFSDCF, ASRCF, STRCF, which covers transformer trackers (Stark and TransT), Siamese network based tracker (SiamBAN), CF based trackers with deep and hand-crafted features (GFSDCF and ASRCF), and CF based trackers with hand-crafted features (STRCF). We select six typical animal sequences, BabyLion, Boar, Clownfish4, Crane, Gazelle1, Lynxlynx. Each of the sequences contains several challenge attributes and different animals. The results are shown in Fig. 9.

For the BabyLion sequence, the target animal undergoes frequent rotation, partial occlusion, and similar targets are appeared to disturb tracking. When severe rotations appeared (\(\#\)0243), Stark, TransT and SiamBAN can accurately capture its position and scale, while the CF based trackers cannot adapt to the scale changes. After long-period partial occlusions (\(\#\)0988), Stark first recovers from failure and captures the target animal accurately, followed by TransT (\(\#\)1137), but the other trackers drifts to other similar animals. The Boar sequence covers several challenge attributes including motion blur, abrupt motion, out of view, scale variation and out-of-plane rotation. The results show that CF based trackers can not well adapt to scale variations induced by frequent rotations, while the three deep learning based trackers can deal well with these challenges. The Clownfish4 and Crane sequences show some failure cases of the tracker Stark when the target animals frequently interact with similar animals accompanying frequent occlusions, scale variations and deformations. However, Stark can successfully recover from such failures and show better performance in handling these accompanied challenges. In frame \(\#\)1196 of Crane sequence, all the trackers are distracted from the target animal. Similar conclusion can be made on the Gazelle1 sequence. The Lynxlynx sequence is very challenging since it contains severe deformation, scale variation and motion blur. The target animal also does sudden jump to change its direction and shape. We can see that GFSDCF and STRCF drift after the first jump (\(\#\)0149). After the third (\(\#\)0179) and fourth jump (\(\#\)0211), Stark can still stably track the animal. The Nightingale sequence contains the severe abrupt motion challenge induced by camera switching. The target animal undergoes large-scale position change. Given the consecutive frames (\(\#\)0194-\(\#\)0197) where the target bird changes its position suddenly with large difference, all the trackers fail to capture it. But Stark quickly recovers from failure and re-capture the bird. TransT recovers at frame \(\#\)224.

From the above qualitative analysis, we can draw the conclusion that deep learning based trackers can deal with various challenges in wild animal tracking better than CF based trackers. Some of the challenges, for example, abrupt motions induced by camera switching, frequent similar animal interactions, or frequent deformation, scale variation and rotation, cannot be well handled by the CF based trackers. There are some failure cases for the best deep learning based trackers (Stark and TransT), which tells us that wild animal tracking is far from well resolved.

Fig. 9
figure 9

Qualitative evaluation results

Table 7 English abbreviation control table

5 Discussion

The main contribution of this paper is to provide the community a benchmark for evaluating trackers and boost the research of wild animal tracking. We have successfully constructed the WATB and released all the resources on the project website. WATB consists of 206 video sequences covering different kinds of animals that are usually appeared in documentaries. According to zoologists, there are about 1.5 million known animals on earth. Therefore, we need to extend WATB to cover more animal species. But finding all of the wild animals in videos seems impossible. We will supplement absent animal species to WATB in our following work. In addition, we do not provide training set in WATB for existing deep learning trackers because the 206 sequences in WATB are very few compared to existing large scale training set. According to our exploration, the existing training sets for training object tracking models are very large, for example, the training set of GOT-10K contains more than 9,300 sequences, LaSOT has more than 1,100 training sequences, and TrackingNet has more than 30,000 training sequences. Most of the existing deep learning based tracking models are trained using the above-mentioned training sets. We will consider to build a training set for wild animal tracking in our future work.

According to our evaluation results, the strongest tracker in WATB is Stark (Yan et al., 2021). But its scores of the normalized precision and success AUC (0.577, 0.512) are much lower than that on LaSOT dataset (0.770, 0.671 Yan et al. (2021)). For STMTrack, its AUC score on OTB100 is 0.719 Fu et al. (2021) which is much higher than that on WATB (0.486). It tells us that wild animal tracking is very challenging and is far from resolved. The declining performance of these tracking algorithms on WATB datasets occurs for various reasons, including missing training data, characteristic physical movements and constraints (e.g., biped, tetrapod, polyped), irregular or homogeneous skin texture, and limited resolution caused by video compression. Therefore, a series of innovations are required in the research of wild animal tracking, including wild animal appearance modeling, designing more robust and efficient deep network, building wild animal datasets for training, etc. Researchers from the computer vision community are encouraged to take part in the development of wild animal tracking.

6 Conclusion

In this paper, we introduce WATB, the first benchmark for wild animal tracking. WATB includes more than 200 video sequences with over 203,000 frames covering a wide range of animal species. To the best of our knowledge, WATB is the first benchmark released in the computer vision community for evaluating object tracking methods in wild animal tracking. Details of the construction of WATB is described so as to demonstrate our principles of building WATB. In order to understand the performance of existing tracking methods on WATB, we extensively evaluate 38 popular tracking methods, covering emerging Transformer based trackers, Siamese network based trackers, and traditional CF based trackers. We select CF based trackers with hand-crafted features, deep features and both to do extensive evaluation. The quantitative and qualitative results demonstrate that Transformer based trackers show better performance in handling different challenging factors in wild animal tracking than Siamese network based trackers and CF based tracker. The results also tell us that Wild Animal Tracking is far from well resolved. We believe that WATB will ignite the interests in wild animal tracking and consequently promote the development of wild animal behavior analysis and wildlife preservation.

7 Supplementary Information

Some of the abbreviations and corresponding full names are shown in Table 7.