1 Introduction

Visual object tracking, or simply object tracking, is the process of maintaining an estimation of a specific object’s (or set of objects’) position(s) in a video sequence. This is closely related to the problem of video object detection [1], in which the task is to localize target object(s) in each image frame of a video sequence. However, with object tracking, an additional task is to predict a trajectory of the detected object(s). In many cases, object detection is a sub-task of visual tracking. The simplest case of visual tracking, single object tracking (SOT), considers the problem of tracking a single object in a video stream. The tracking task in most cases can be effectively accomplished by simply detecting the target object in each video frame [2]. Multiple object tracking (MOT) is a more complex problem involving the tracking of many objects simultaneously. Because of the complexity of MOT tasks, additional algorithms are often utilized to enhance robustness.

Important applications of object tracking include video surveillance [3], sports broadcasts [4], civil security applications [5], human–machine interaction [6], augmented reality [7], robotics and autonomous driving [8].

Visual appearance is the most important characteristic of physical objects that enables—in both biological and machine cognition—the effective recognition of different objects. Appearance modeling is aimed at encoding functional representations of visual features of objects that preserve their meaning under different viewing conditions. This is considered the most important task of the visual tracking problem [9, 10]. The main task in robust appearance modeling is to extract useful visual information from training images that are invariant under different real-world phenomena (e.g., varying illumination, scale changes, occlusions and deformations). The learned visual representations are then used to aid detection and tracking, thus making it possible to accurately track objects regardless of variations in object or scene appearance.

Object tracking settings are usually highly dynamic in nature, with constantly changing object appearances and environmental conditions. The typical tracking setting is characterized by complicating factors such as object interactions, camera motion, cluttered backgrounds, non-uniform illumination, motion blur, changing object scales, occlusions, varying view angles, nonlinear object deformations, and changing scene conditions. Under these circumstances, a target object model captured under particular conditions may be incapable of representing the object in subsequent frames when the viewing conditions change.

1.1 Related works

Given the practical importance of visual tracking, a large number of surveys have been conducted on different aspects of tracking. Most of these surveys are dedicated to either classical machine learning approaches (e.g., [4, 11,12,13,14,15]) or deep learning-based tracking techniques (e.g., [16,17,18,19,20]), while a few others (e.g., [21, 22]) deal with both classical and deep learning approaches. Many surveys treat visual tracking techniques from the perspective of a given taxonomy defined according to various criteria [18,19,20, 22]. For instance, Abbass et al. [16] classified tracking algorithms into methods that employ generative or discriminative models and techniques that utilize a combination of both approaches. They then presented an elaborate discussion of deep learning-based trackers under these broad methodological themes. Li et al. [20] introduced a taxonomy on the basis of network structure, function and training and presented a detailed description of deep learning-based trackers from the point of view of the proposed taxonomy. Similarly, in [19] Xu et al. categorized trackers into three groups, namely, deep network embedding-, description enhancement-, and end-to-end-based trackers. They further presented a detailed discussion on object tracking architectures and training methods for deep convolutional neural network (DCNN)- and recurrent neural network (RNN)-based trackers. Fiaz et al. [22] focused on techniques for tracking objects in noisy images. They classified visual tracking methods into correlation filter- and noncorrelation filter-based approaches and provided an extensive treatment of the common techniques in each of the categories based on the general architectures and tracking procedures.

Other works treat object tracking methods based on their constituent components (e.g., [15, 21]) or the main sub-tasks [12, 14, 17] in the tracking pipeline. Notably, [15, 21] presented deep learning-based visual trackers based on their key components and discussed extensively the application of deep learning methods in each component. In [15], Luo et al. classified MOT algorithms according to three different criteria: initialization method, image processing approach and output type. They then presented a generalized object tracking pipeline and the essential components of MOT models and, for each component, discussed the common issues and implementation details. Sugirtha and Sridevi [23] focus on the various stages of video object detection as well as tracking. [21] focuses exclusively on tracking-by-detection frameworks and the application of different deep learning techniques in the various sub-tasks of tracking.

Several surveys [4, 21, 24,25,26,27] deal with tracking issues in specific domains. These include animal tracking [25, 28], human tracking in specific contexts (e.g., in football games [4, 24]), football tracking [26], vehicle tracking [28, 29], pedestrian tracking [21, 24], or both vehicle and pedestrian tracking [27].

Datasets, evaluation metrics and extensive analysis of the performance of different trackers are presented in [16,17,18, 20, 22, 24]. In addition to these surveys, the performance results of many state-of-the-art trackers are presented in the reports of annual object tracking competitions—notably, the Visual Object Tracking (VOT) for SOT trackers [30,31,32], and the Multiple Object Tracking (MOT) challenges [33].

Despite the importance of appearance modeling in visual tracking, only a few surveys [11, 12] are dedicated solely to appearance modeling. However, even these surveys focus exclusively on classical approaches to appearance modeling. Till date, no single work has covered deep learning-based approaches to appearance modeling in sufficient detail. We propose this survey to address this gap.

1.2 Scope and outline of study

In view of the issues that have already been tackled by previous survey papers, we limit the scope of this review to studying deep learning-based robust appearance modeling techniques. We specifically focus on special deep neural network topologies and auxiliary strategies that are employed in conjunction with classical deep CNNs for invariant representation of visual appearance features. The techniques are aimed at improving the robustness of object tracking models in general settings. In addition, we discuss common evaluation metrics and present quantitative performance results on several state-of-the-art visual trackers.

The paper is structured as follows. Section 1 provides a general background to the problem of object tracking, and highlights the importance of appearance modeling in visual tracking. It also explores related surveys of deep learning approaches to object tracking, and outlines the main differences with the current work. Section 2 presents a general framework of visual tracking and the various subtasks involved in the tracking process. In Sects. 3 to 7, we conduct a thorough survey of state-of-the-art deep learning approaches for encoding robust appearance features for object detection and tracking tasks. Section 8 presents common datasets, evaluation methods and performance results of the surveyed approaches. In Sect. 9, we summarize and discus the major issues of object detection and tracking algorithms. Section 10 explores potential developments and directions for future research. Finally, in Sect. 11, we conclude by recapping the main issues discussed in this work.

Fig. 1
figure 1

General structure and workflow of object tracking algorithms

2 Appearance modeling in tracking

In this section, we present a generic structure of object tracker in the context of deep learning and summarize general approaches to appearance modeling based on deep learning techniques.

2.1 General framework of object tracking models

We present a generalized architecture of object tracking models and briefly describe its components. We utilize the conceptual framework for object tracking proposed by Wang et al in [10]. Per this framework, a tracker is essentially made up of a number of distinct components, each performing different functions: motion model, feature extractor, observation model, model updater, and ensemble post-processor [10]. With some modifications, we represent this generic architecture in the context of deep learning-based visual tracking in Fig. 1.

The appearance model encodes invariant representation of visual features, while the motion model estimates the location of the target object in subsequent frames. As shown in the diagram, the extracted features are used to build both appearance and motion models, which together form the basis for the observation model used to make predictions about target locations. In a deep learning setting, the observation model may be a neural network sub-model that aggregates the outputs of the appearance and motion models. An often critical component of most online trackers is the model updater. It performs periodic updates to allow temporal context of the video sequence to be incorporated in the tracking process.

There may also be an ensemble post-processor [10] (which we termed Auxiliary Module) for performing additional functions such as fusing the predictions of several tracklets in cases where multiple observations are made about the same object(s) (see Fig. 1). In particular, the data association and affinity computation [17] are common tasks that provide additional information that can be used to compensate for detection errors, and helps to localize target instances or to recover missing observations. Other post-processing tasks may include the removal of false detections or interpolating trajectories in case of discontinuities (e.g., due to occlusions) [34, 35].

2.2 Overview of common Deep Learning approaches to appearance modeling

Invariably, the first step of object tracking involves learning an appearance model for the objects to be tracked. This requires extracting a compact set of invariant image features, based on which the tracking can be performed. We present the most common approaches to deep learning-based appearance modeling in the following sub-sections.

2.2.1 Classification-based deep CNN trackers

The simplest deep learning-based tracking approaches utilize deep convolutional neural networks as binary classifiers, where the main tracking task consists in distinguishing between the target object and background in each video frame. In general, feature extraction takes place in the initial CNN layers, while the classification process is performed in the last layers of the CNN model (e.g., [36,37,38,39]), but can also be performed in a separate machine learning model ( e.g., in [40, 41]). Support vector machines (SVMs) are particularly popular in this regard [40,41,42,43]. The described trackers are essentially end-to-end deep networks that directly predict the presence of target objects in the video frames under consideration. Some works [44] propose training CNN classifiers online to perform tracking. However, since the amount of training data that can be obtained online for training is naturally small, online training approaches are subject to severe overfitting. To overcome this limitation, approaches [36, 41, 45] have been proposed to train CNN models offline with external images or videos. Typically, to extract useful features, many approaches utilize off-the-shelf deep CNN models that have been pre-trained on large-scale datasets. Because of the domain shift problem [46], it is often necessary to fine-tune models using data from the target domain. In [45], Wang et al., for instance, performed offline training on large-scale image datasets and then fine-tuned online. [41] utilized pretrained CNN models and performed online learning using SVM.

The main advantage of classification-based tracking approaches is the simplicity of the problem formulation and the ability to work seamlessly with large-scale datasets using pre-trained image classification models. However, because of this simplicity, it is often limited to SOT task or less challenging MOT scenarios.

2.2.2 Correlation filter-based trackers

Correlation filter (CF) [47] approaches have been widely used in deep learning-based tracking [48,49,50,51,52,53]. Correlation filter kernels utilize appearance features extracted by CNN models to perform cross-correlation to associate and locate target objects. The technique translates complex time-domain operations to simple, element-wise multiplications in the Fourier domain. Because of this simplicity, computational efficiency and high performance, correlation filters-based methods have becomeone of the most popular approaches for matching and locating target objects.

Fig. 2
figure 2

A generalized tracking-by-detection-based appearance modeling framework for robust visual tracking. It incorporates a two-stage detection scheme and data association sub-models as the main components. Data association primarily involves re-identification and affinity matching. As depicted here, different techniques are utilized to encode robust features for detection and for extracting invariant features from the detected bounding boxes for re-identification

2.2.3 Tracking-by-detection approaches

Currently, the overwhelming majority of deep learning-based tracking algorithms are based on the so-called tracking-by-detection approaches. They perform tracking in two stages—detection and association. This involves first localizing target objects with object detectors in the initial frame and then finding correspondences among the initial detections and future detections in each subsequent frame. Such a decoupled formulation of the tracking problem allows to effectively tackle each of the two tasks—object detection and temporal association–separately through different robust appearance modeling techniques. A detailed scheme of this framework is shown in Fig. 2. We describe the important tasks below.

(a) Detection. The first step in tracking is usually to initialize the detector with a bounding box that describes the current location of the target. This can be accomplished manually or automatically [15]. For automatic initialization, bounding box proposals for probable target locations are generated by pre-trained object detectors. Many approaches utilize standard CNN-based object detectors such as Faster R-CNN (e.g., in [54]), SSD (e.g., in [55]) and YOLO (e.g., in [56]). Since two-stage detection frameworks such as [54] are generally more robust than their one-stage counterparts [57] like SSD [55] and YOLO [56], they are more commonly used in applications where robust performance is critical and computational efficiency is not a major concern. Two-stage detectors (shown in the diagram in Fig. 2) compute region proposals and align the encompassed features in the first stage and then predict their categories in the second stage. In contrast, one-stage detectors classify features in the first stage straightaway. While standard object detection pipelines are commonly used for the detection task, many recent approaches [56] have proposed to augment these detectors with additional robust appearance models or utilize custom detection models (e.g. [58,59,60] for robust object detection). Automatic target initialization requires that arbitrary targets in the initial frame be accurately detected and, in the case of MOT, appropriately assigned identifiers. However, owing to problems associated with complexity of real-world tracking settings, detections may be poor for arbitrary objects. To alleviate this problem, many approaches utilize advanced appearance modeling techniques to enhance the detection accuracy and robustness. This allows to more effectively detect the target objects at the initialization stage, as well as perform re-identification (Re-ID) and re-detections in subsequent frames regardless of appearance variations.

(b) Re-identification. For each of the generated bounding boxes, visual features are extracted for use by a re-identification sub-network. In general, the regions within the detector bounding boxes are taken as positive training samples, while regions outside the bounding boxes are considered as negative training data. Thus, for each object, there usually exists only one positive target sample and potentially infinite negative ones. To solve this sample imbalance problem, some authors [61] have proposed to sample several positive examples around the vicinity of each bounding box. However, this degrades the quality of positive samples and ultimately contributes to poor performance. State-of-the-art approaches tackle the data imbalance problem by utilizing advanced appearance modeling techniques that allow to encode invariant representation of visual features using one accurate positive sample generated by the detector. While both detection and re-identification need good features for robust performance, they typically utilize different kinds of features [62]. The detector performs inference at the object level (i.e., using high-level semantic features that are obtained from deeper layers), while re-identification operates on invariant, low-level features from shallower layers that allow to encode intra-class variations. Thus, it is common to adopt two different sets of robust feature representation schemes for detection and re-identification.

(c) Auxiliary tasks. In many state-of-the-art tracking algorithms, especially in MOT, additional subtasks such as affinity computation are frequently used to improve tracking performance in challenging situations. Several different techniques [63,64,65,66] have been proposed to enhance data association or compute affinity for matching candidate objects with target instances. In the literature, some of the most popular techniques include Bayesian methods (e.g., [63]), deep reinforcement learning (e.g., [64]), Hungarian algorithm (e.g., [66]), particle filter (e.g., [145] [67]) and linear programming (e.g., [65]). Most recently, a number of authors [68, 69] have proposed replacing these data association techniques based on heuristics with differentiable neural network sub-models.

As a result of recent advances in robust visual feature embedding techniques, a number of authors [70, 71] have proposed using detections alone to accomplish object tracking. These approaches formulate the tracking problem as a frame-to-frame re-identification task. For instance, in [70], Bergmann et al. proposed a detector-only tracking approach that outperformed more complex models in a range of multiple object tracking tasks on standard benchmarks. In this case, the re-identification model was trained offline and employed to perform detections in the tracking process. However, Jia et al. [72] suggests these approaches may be weak against adversarial attacks. Other recent approaches [62, 73,74,75,76] have suggested jointly performing detection and tracking as a one-step process so as to better leverage both processes. For instance, in [73] Feichtenhofer et al. applied both detection and tracking as complementary processes for better performance. That is, trajectory predictions are used to refine detections and vice versa.

2.2.4 Advanced deep learning-based appearance modeling techniques

As outlined above, classical deep learning techniques are inadequate for appearance modeling in complex domains. To overcome this limitation, several lines of work have been proposed. In the following sections, we explore these approaches in detail using the taxonomy depicted in Fig. 3. These advanced appearance modeling techniques facilitate invariant feature representation that enables accurate and robust detection and re-identification.

Fig. 3
figure 3

Taxonomy of advanced deep learning-based appearance modeling methods discussed in this paper

3 Data-centric approaches

One of the most important factors that accounts for the astounding success of deep learning approaches in machine vision tasks is the availability of large and rich annotated training data. However, visual tracking tasks usually involve dealing with arbitrary objects in an online manner, where the possibility of obtaining relevant training data in sufficient quantity is severely limited. This limitation often results in relatively poor generalization performance of deep learning methods in object tracking tasks as compared to other machine vision settings like object classification. Many authors have proposed to alleviate this problem by utilizing various techniques to generate large and diverse training data that cover all possible appearance conditions.

3.1 Manual data augmentation

An important problem in many practical machine vision applications is the class imbalance problem [77, 78]—a situation where training data is excessively skewed towards some particular categories. More specifically, in object tracking settings, this is usually a relative scarcity of positive instances compared to negative ones [79, 80]. This presents enormous difficulties to creating appearance models that are robust against different view conditions. One way to address the problem is by employing manual data augmentation techniques [81, 82]. These approaches focus on manually generating more diverse positive samples that capture all possible appearance variations in the particular setting. In [81] Bhat et al. exploited different data augmentation strategies where positive samples are manually created to improve the robustness of the resulting model in object tracking tasks. Approaches utilizing synthetically generated data have also been suggested [83,84,85] to provide diverse positive samples for improved generalization performance. Augmenting training data with negative samples has also shown to be effective in visual tracking. For instance, in [79], Zhu et al. proposed to improve the discrimination of targets from semantic background (i.e., other objects in the scene) by introducing hard negative samples into the training data through data augmentation.

Despite the fact that manual augmentation techniques have successfully been used to improve robustness of deep learning models in many machine vision domains, they have limited scope of application in visual tracking domains. The main reason for this limitation is that in many visual tracking tasks, target objects are not usually known a-priori; the appearance details are determined online only upon initialization, making it challenging to apply manual augmentation in the tracking process. In addition, the process of creating new samples using manual data augmentation techniques such as [81, 82] is notoriously time consuming and can only be achieved by an expert with an extensive knowledge of the end application domain. Moreover, in many cases, the manually created data may not be semantically rich and meaningful to capture complex appearance variations in real-world settings. This can lead to poor performance in practical applications. These issues are addressed by generative modeling techniques that perform automatic data augmentation.

3.2 Generative modeling

A recent trend is to employ deep learning algorithms to automatically generate relevant training data to extend and diversify the original data. The main idea of generative modeling is to automatically create “artificial” data that contain predictive features as the tracked instance. The use of generative methods is desirable both from the point of view of their ease of implementation and from the point of view of their scope of application; models based on them are generally invariant under more diverse transformations of the target appearance, including complex nonlinear transformations which cannot be generated manually.

Fig. 4
figure 4

Generalized architecture of Generative Adversarial Network (GAN) . The generator takes as input random noise and transforms that into an image samples. The discriminator computes the classification loss and propagates it through to the generator

3.2.1 Automatic data generation based on Generative Adversarial Networks

The most popular class of approaches [80, 86, 87] for generating training data in object tracking domains is based on generative adversarial network (GAN) [88] architectures. The GAN approximates the distribution of the input data by sampling from that distribution. This, thus, overcomes problems of sample scarcity and data imbalance. A GAN is a composite neural network made up of a generator and a discriminator that are designed to compete with each other (Fig. 4). Usually, the discriminator is simply a standard CNN classifier whose task is to distinguish generated images from real ones. The generator’s goal, on the other hand, is to generate as realistic as possible data that makes it difficult for the discriminator to discriminate.

A repeated process of generation and discrimination is carried out until convergence, when the generator learns to synthesize data that is so close to the input sample that the discriminator is unable to distinguish between the real and generated data.

Fig. 5
figure 5

Generative adversarial network (GAN)-based appearance modeling approach proposed in [89]. It utilizes sample-level data generation sub-model based on the conventional GAN architecture and feature-level generation sub-model to diversify features by occlusion masking

In many machine vision settings, the goal of generative modeling is often to generate artificial samples that look as realistic as possible. In contrast, common implementations [80, 90,91,92] of GANs in object tracking domains are designed to accomplish feature-level generation. This typically consists in first generating an output mask from convolution features and then using it to alter output features from training images in a way that produces artificial variations which are subsequently learned through adversarial training. In [90] Yin et al. proposed a GAN-based tracker which generates random masks adversarily with the help of cropped images placed around input image samples. The masks are then used to produce richer appearance variations that are learned by the model. [91] employs a CNN classifier that leverages attention mechanism to enhance the robustness of the network in [90] against appearance drifts.

Most of the recent GAN-based approaches (e.g., [80, 92]) additionally exploit strategies to select a subset of features—the most robust with respect to the given context—out of the generated samples. The goal is to improve performance by retaining only the most robust features of the tracked instance which can then be used to train a final classifier. In [92], Javanmardi et al. argued that randomly masking out features to produce appearance variations, as implemented in [90], for example, may lead to potential loss of useful information which may be disadvantageous. To address this problem, they proposed to generate an adaptive mask that aligns the most informative features of local image regions of the most recent scenes with that of earlier target images. In [80], the authors proposed a tracker that augments positive samples through adversarial learning. They incorporated a generator-discriminator pair into a conventional CNN architecture, specifically a VGG-M model [93]. They utilize the generator to generate masks which are subsequently used to adaptively mask out input convolutional features from positive samples. This procedure produces multiple output features corresponding to different appearance changes. Further, they trained a discriminator to be robust to these visual appearance variations.

There are a number of GAN-based approaches (e.g., [89, 94,95,96]) that formulate the tracking problem as a similarity learning problem. To provide robustness to more diverse tracking problems, Han et al. [89] utilized two separate GAN modules to handle sample- and feature-level generation (Fig. 5). First, a sample GAN (SGAN) model generates diverse training samples which are then fed into a feature GAN (FGAN) that learns to generate diverse features for different appearance conditions such as deformations, occlusions and motion blur.

3.2.2 Other generative modeling methods for automatic data augmentation

Although GANs remain the predominant approaches for generative modeling, the use of other generative modeling techniques in robust image feature generation has been growing over the years. Researchers have explored a number of related techniques to improve the quality of feature representation and generalizability. Most notably, approaches based on autoencoders [100,101,102] and variational autoencoders (VAEs) [95, 98, 99, 103] have demonstrated good performance. To address overfitting problems arising from small training data, Liu et al. [102] employed an auto-encoder sub-network to impose constrain on the loss function. In [98], Kim et al. used a conventional variational autoencoder (VAE) to implement a deep learning model for learning rich spatial information about objects. They demonstrated the use of conventional variational autoencoders (VAEs) in generating rich appearance features for tracking. In [99], Lin et al. used a custom variational autoencoder consisting of three encoder branches to extract visual features at different semantic levels for video object segmentation and tracking. The extracted visual features are used to enhance Mask R-CNN segmentation robustness in tracking. The branches provide different semantic levels of generalization: the input layer is sensitive to simple image features such as lines and their orientation in certain areas of the visual area, while the response of other layers is more complex, abstract, and position-independent of the image. Similar functions are realized in the cognitron by modeling the organization of the visual cortex. Methods have also been developed that combine different generative schemes to produce better appearance features. For example, Wang et al. [95] proposed a generative modeling technique using the earlier developed Siamese Instance Search Tracker (SINT) [104] as a backbone model. Their generative modeling approach uses two different subnetworks—Positive Sample Generation Network (PSGN) based on VAE architecture to generate and augment positive samples, and a so-called Hard Positive Transformation Network (HPTN) based on deep Q-network to create occlusion and deformation patterns that can be learned by the discriminator. The final component, the Siamese network, is used to infer the similarity between the target sample that is initialized in the initial frame and candidate samples in subsequent frames. Common generative modeling-based trackers and their constituent components are summarized in Table 1.

Table 1 Representative generative modeling-based trackers and their construction

3.2.3 Feature hallucination techniques

In contrast to the aforementioned methods such as [80, 90,91,92, 94,95,96] which aim to improve robustness by generating feature masks to increase the diversity of training data, some of the more recent generative modeling approaches, known as hallucination methods (e.g., [97, 105, 107, 108]), are aimed at directly transferring different visual phenomena from training data to unseen data, thereby generating novel views. The concept of hallucination has been motivated by the ability of humans to imagine new visual contexts from observations [97, 105, 106, 108]. The main idea is to learn image transformations from exemplar images and then apply this knowledge to unseen object classes in novel contexts. These techniques, therefore, allows to learn robust visual feature representations that can be applied across multiple domains and tasks. These approaches generally utilize an encoder- decoder scheme where the encoder learns transferable image transformations from pairs of exemplar images (e.g., different poses, scales, illumination conditions) of the same class, and the decoder’s task is to learn to apply these learned transformations to new categories. For instance, in [90] Wu et al. proposed to generate new image samples using an encoder-decoder network based on what they termed Adversarial Hallucinator or AH. The hallucinator generates transformed images which are then used to train CNN classifiers. In addition, they incorporated a so-called selective deformation transfer (SDT) sub-model to select and transfer the most relevant transformations to unseen contexts. In [106], Wei et al. proposed a re-identification framework, PTGAN, that uses a GAN to transfer persons in labeled datasets to novel styles (i.e., appearance conditions such as different backgrounds, illuminations and view angles), while preserving useful features that define the identity of the persons. Amirkhani et al. [109] employ visual style transfer technique to compose new training dataset from an existing dataset and combined them to achieve a larger and more diverse data for training object trackers. The various data augmentation methods described in this section are summarized in Table 2.

Table 2 Summary of the major data augmentation approaches

4 Compositional part modeling

A part model of an object is understood as the set of simple geometric primitives that provides a meaningful representation of that object. The rationale for this approach is based on the fact that appearance variations of object parts are generally much less drastic than the possible variations of the object as a whole. Hence, simpler models and smaller datasets can be used to effectively obtain robust models. Many different approaches are used to encode compositional parts as information priors in deep learning pipelines (Fig. 6). In general, object classes are represented as mixture of parts, with each part representing specific appearance instances such as different viewpoints [110, 111], size variations [112], pose instances [113] or occlusion extend [114]. In many tracking applications (e.g., [115, 116]) compositional part models serve to enhance robustness of object detectors. The main strength of compositional parts is their ability to handle complex transformations such as nonlinear deformations and significantly occluded objects, even if trained without including transformed examples [114, 117]. Two broad strategies of part-based approaches can be identified: approaches that explicitly formulate part models as representation priors and those based on deeply learned parts. In the first family of approaches, object parts are manually modeled independently before using some algorithm, usually a machine learning model, for feature classification. In the second case, part-level representations are directly learned end-to-end from deep CNN feature maps.

Fig. 6
figure 6

Taxonomy of part modeling approaches based on representing compositional parts as information priors in deep learning pipelines

4.1 Part models as representation priors in deep CNNs

A large number of approaches [118,119,120,121,122,123] propose to explicitly model compositional parts as representation priors in object detection and tracking pipelines. These approaches usually approach feature learning as a two-step process; building informative, invariant mid-level features as vectors of compositional parts and using deep CNN models to learn robust representations for these parts. The simplest approaches to compositional part modeling utilize natural images that are artificially divided into grids or smaller patches [119, 121, 122, 124, 125]. In [122] for example, Tian et al. proposed a part-based pedestrian detection technique utilizing a pool of human body parts defined as a rectangular human body grid and then trained a CNN classifier to learn relevant features for each of these parts by sliding filters over the entire grid. Another common method for compositional part modeling is to segment training images on the basis of low-level pixel properties—superpixels [126, 127]. This approach is based on the intuition that pixels sharing common visual characteristics in a given region may represent a unique semantic context. Superpixels are commonly defined by clustering algorithms [128]. However, newer approaches [129,130,131] have proposed learning superpixels end-to-end with deep neural networks.

More sophisticated compositional part modeling techniques such as [110, 111, 117, 132,133,134] encode additional information such as spatial dependencies among constituent parts. To handle object deformations, for example, deformable part models (DPMs) [135, 136], encode deformations from part displacements. DPMs are often used to help with the object detection sub-task, where they help to encode robust features in region-based CNN detection models [115, 116]. For instance, in [137] Ouyang et al. used deformable part models to generate region proposals containing deformable object parts. After this, a dense subgraph discovery (DSD)-based filter is used to select the most useful region proposals.

Richer part-based methods model the structural features of an object based on its constituent parts and their spatial relationships [133]. In this regard, structural information of objects in images is represented using simple sub-entities that are themselves described by even simpler entities. The most advanced part models (e.g., [138,139,140,141]) are typically described by hierarchical graph structures in the form of nodes and links which encode more detailed information about the spatial properties of the constituent parts, including local interactions. In [139], for instance, Wang et al. proposed an appearance model for object tracking using a graph-based architecture consisting of multiple CNNs to encode visual features of local parts. The learned features are then fused using a regularization framework. Similarly, in [138] , Nam et al. employed separate CNN sub-networks in a hierarchical, tree-like arrangement to model the appearance of different parts. In their implementation, the edges of the structure characterize the structural relationships that exist among the different parts (represented by the different CNN sub-networks). To simplify the representation, some graph-based approaches (e.g., [142, 143]) utilize superpixel information to segment images into parts which are then defined as graph elements.

Despite the aforementioned advantages of using information priors in the form of compositional parts, the approach has a number of significant drawbacks. First, object tracking based on parts results in the loss of high-level information, thereby reducing performance in some cases. Second, building rich part models is usually a labor-intensive and time consuming process. Another area of difficulty when using explicit part models as representation priors relates to the inability of human experts to manually identify good parts that are optimal for visual recognition tasks. In view of these limitations, several authors propose to learn part representations automatically in an end-to-end manner.

4.2 Deeply learned quasi-compositional part representations from mid-level CNNs features

In [144,145,146] it was shown that in deep convolutional neural networks, part-level information is present in the mid layers and that extracting features from these layers could provide contextual hierarchy in object representations. This concept has two main advantages. First, it does not generally require additional model parameters since these mid-level features are mined from existing layers of the network. Also, the requirements for adapting filters or for exploiting complex network structures for learning invariance is eliminated, thus providing a more simple approach to appearance modeling. Inspired by this finding, a large number of recent approaches [50, 147,148,149,150,151,152,153,154] exploit this idea to design end-to-end deep CNN models to learn quasi-part representations directly from image-level data. These methods unify the processes of part modeling and feature representation by jointly extracting part-level features from deep CNN layers and learning suitable representations from the extracted parts. In [50], Ma et al. used features from early CNN layers to encode more nuanced spatial details while employing the last activation layer to capture object semantics. Many approaches employ special strategies such as dedicated compositional part filters [153, 154], unsupervised clustering [155, 156], special activations [157,158,159] or pooling techniques [153] in selected CNN layers to learn high level compositional parts. For instance, to overcome the limitations of conventional pooling techniques like average pooling and max pooling in encoding part-level information, Ouyang and Wang [160] proposed a part-based CNN model that incorporates a deformation layer between the fully connected layer and the last convolutional layer to capture part deformations. Ouyang et al. [153] extended this concept by introducing deformation- or def-pooling which is designed to replace conventional pooling layers at multiple locations within a deep CNN network.

More recently, advanced compositional-part-modeling approaches (e.g., [151, 153, 154, 161, 162]) that utilize complex network architectures consisting of several independent sub-networks have emerged. For instance, Wu et al. [162] propose an approach for robust visual tracking using multiple deep learning sub-networks to separately observe different sub-regions of the input frames. Each sub-model is designed to learn specific local features from a target sub-region. Qi, et al. [148] employ several independent CNN trackers to learn mid-level spatial features from different convolutional layers. The predictions of these trackers are then adaptively fused by means of an online decision-theoretic learning approach using Hedge algorithm. An overall high-performance tracker is obtained based on the weighted sum of the predictions of all trackers. Yang et al. [154] proposed to integrate multiple CNN-based compositional part extraction modules, called P-CNN, into different layers of pre-trained CNN models—AlexNet [163] and VGG19 [93]. The P-CNN utilizes part filters which are optimized to select part-level descriptors from feature maps of designated convolution layers (i.e., layers to which P-CNN modules have been attached). In [151] Mordan et al. introduced “Deformable Part-based Fully Convolutional Network (DP-FCN)”, which utilizes a (FCN) network [152] together with a number of custom extensions for part-level feature learning. The fully convolutional network is responsible for extracting task-specific features of each image class into feature maps. In addition, a deformable part-based region-of-interest (RoI) pooling layer encodes part-level representations of the resulting feature maps. The deformable RoI pooling layer partitions the image-level feature maps into \(n \times n\) region proposals (i.e. square grids) and performs alignment of parts. The final extension, at the end of the whole structure, consists of two separate network branches that perform semantic classification and deformation-aware localization by exploiting the effects of part displacements. [153] proposed a deep CNN architecture that jointly learns object deformation and part-level feature representations, as well as incorporating context information. The approach was implemented using the ZFNet architecture (proposed in [164]) as a CNN base model with additional branches consisting of part-level kernels and classification sub-networks. By changing the configuration of this CNN, different detectors are obtained, leading to variability, and hence better generalization performance in specific situations. In addition, the approach further enhances generalization by allowing the sharing of deformable parts among different object categories.

While deeply learning compositional parts from CNN layers can provide better generalization in unseen domains [147], they are typically less transparent compared to their explicit model counterparts, and ultimately suffer from the black-box syndrome [165] commonly encountered in deep neural networks. Another limitation pertaining to compositional part modeling in general is that the approach is not suitable for objects without distinct parts. Also, non-rigid object parts can often exhibit many different shape and form variations that completely diverge from the learned representations and thereby making it difficult for the approach to work well. Because of these limitations, in some scenarios, they may be more prone to catastrophic failures than traditional part-based models designed explicitly to account for anticipated conditions. The main approaches to modeling compositional parts in the context of object detection and tracking are captured in Table 3.

5 Similarity learning approaches

When tracking objects using deep learning methods, the network is required to learn very reliable visual features that remain stable under many different conditions. In this case, the deep learning model relies on learning invariant visual features from large datasets and then performing predictions based on matching corresponding features in candidate images to the previously learned representations. Since in most tracking applications the target appearance is captured only in the initial frame, it is often not possible to obtain sufficiently rich features for tracking. Many traditional deep learning approaches tackle this problem by training offline utilizing large-scale datasets before fine-tuning online on the specific visual tracking task. But this often requires performing parameter updates online using gradient decent, which is computationally expensive and generally too slow for most practical applications. The second option is to combine classical algorithms such as particle filters [166] and HoG-like features [167] with CNNs or to utilize specialized deep learning architectures (e.g., [95]) to encode robust object appearance. These techniques are often more complex, highly specific and require more prior knowledge about the target domain. All these considerations led to the widespread use of similarity learning algorithms [168, 169]. Similarity learning trackers are typically offline trackers in that they learn similarity embedding completely offline using available datasets that are similar to the target domain.

Table 3 A summary of compositional part modeling methods and their major characteristics

5.1 General principles of similarity learning

Similarity learning approaches to appearance modeling differ from conventional deep learning methods in that they do not directly learn visual features for each object instance or category. Instead, they learn a function that predicts the similarity of input images. The decision boundary is defined by a similarity measure [170] which can be independently computed as a distance metric [171, 172] or learned directly from input images [66, 104, 173] using a neural network. In place of the usual prediction error-based loss functions employed in traditional CNNs, similarity learning methods use special loss functions such as contrastive loss [174] to force semantically similar image samples to be embedded in close proximity while forcing dissimilar images apart. Another important task in similarity learning is to minimize the intra-class differences between objects while, at the same time, maximizing the interclass differences. One major challenge with distance metrics is in defining the right size of the distance, which must be large enough to include all intra-class appearance variations but small enough to exclude interclass appearance differences. Deeply learned similarity metrics solve this problem but they are often not transparent and may be subject to higher error rates when trained using insufficiently large data. To further enhance robustness, some approaches impose temporal constraints (e.g., [115]) or additional spatial constrains (e.g., [175, 176]) on the definition of similarity metrics. The main idea in [175] and [176] consist in dividing images into sub-regions and then learning similarity measures for corresponding regions independently before combining the individual metrics to obtain a global similarity metric. Once a similarity is learned, the tracking process involves initializing the target object in the first frame and then performing exhaustive search in subsequent frames to locate the most probable region within the search area that might contain the target. Thus, re-identification in the context of similarity learning consists in finding a candidate region with the minimum distance within the threshold specified by the metric. The rest of this section explores common similarity learning approaches categorized into different network topologies and similarity embedding mechanisms.

5.2 Single-stream similarity networks

The simplest similarity learning approaches are based on single-stream networks [9, 177,178,179]. They typically consist of deep convolutional neural network architectures that employ contrastive loss at the end of the deeper layers to learn similarity embedding. In [9], Moujahid et al. proposed a single-stream similarity embedding network that uses soft cosine similarity metric to compute similarity. During tracking, the approach samples candidate locations around the initialized target and computes similarity for each candidate region. The region with the highest score is taken as the new target location. A major limitation of the method is that the model needs to make an assumption about the probable location of the target. For this purpose, a motion model is employed. In [179] Ning et al. proposed a single-stream similarity network which employs contrastive loss layer to implicitly learn the similarity from sample targets and background images selected by RoI layers. Despite its simplicity and closeness in structure to traditional deep CNN architectures, current literature emphasizes the use of more complex topologies such as two-stream and multi-stream networks for enhanced similarity encoding.

5.3 Two-stream Siamese networks

In recent years, visual tracking approaches using pairwise, deep similarity learning architectures based on two-stream and multi-stream networks have become very popular in many machine vision domains [180]. In particular, the Siamese network [181, 182]—a two-stream network architecture—is currently the most popular visual tracking approach for solving most SOT problems. Their success in SOT is evidenced by the results of the annual Visual Object Tracking (VOT) Challenge, where the top-performing short-term trackers in recent years [30,31,32] have mostly been Siamese-based architectures.

Fig. 7
figure 7

General structure of Siamese network

A generalized architecture of the Siamese network is shown in Fig. 7. It consists of two identical CNN branches with shared parameters. The network is trained by feeding into the two branches a pair of similar (i.e., objects of the same class) and dissimilar (objects belonging to different classes) images. The features extracted by the two branches are compared and fused by means of a contrastive loss mechanism whose goal is to learn a similarity function to correctly predict object similarity given any pair of images. During tracking, one of the branches is fed with the initialized target (i.e., an image patch containing the object), while the other branch takes as input a search area encompassing the whole scene or part of it. Essentially, the search of candidate objects consists in shifting the exemplar patch over the entire search area while computing similarity for each location. An extensive review of Siamese architecture is presented by Chicco in [180]. The author detailed several applications of Siamese networks.

In one of the pioneer works, Tao et al. in [104] proposed Siamese Instance search Tracker (SINT) based on conventional two-stream Siamese framework that employed Radius Sampling method proposed in [183] to sample candidate objects for tracking. In [184], Bertinetto et al. introduced SiamFC which employs a dedicated cross-correlation layer on top of the Siamese branches. In this case, the search for candidate targets during tracking is reduced to computing cross-correlation between the target patch and the search patch. Similar to [184], CFNet [185] utilizes cross-correlation layer to estimate similarity; but in contrast to SiamFC, CFNet additionally employs a correlation filter unit as a differentiable CNN module in the template image branch of the Siamese framework to help learn varying appearance cues. GORUN [186], on the other hand, employs a Siamese framework to learn target appearance features while applying fully connected CNN layers to fuse the extracted features. [187] proposed to use region proposal network (RPN) on top of a traditional Siamese architecture to perform object detection. Zhu et al. [79] extended the SiamRPN model by proposing DA-SiamRPN, which incorporates a so-called distractor-aware sub-module to transfer learned representations of semantic negative object interactions in complex scenes to the online tracking process. To handle out-of-view and full occlusion problems in long-term tracking, they also proposed a strategy to incrementally expand the search area to provide a global view in order to recover the lost object (through re-detection) once it reappears.

Some Siamese-based approaches propose to fuse features of different abstraction levels from multiple CNN layers [188] or learn low- and high-level features in separate Siamese networks [189, 190] before combining the results for inference. In [189], He et al. proposed a special Siamese framework consisting of a double two-stream network structure. The network is made up of an appearance branch that extracts invariant visual features from shallower layers and a semantic branch that exploits deeper features to encode high-level semantic representation. The similarity scores for the two branches are computed separately in the training phase before being combined to obtain a final similarity result during tracking. The appearance and semantic branches are aimed at enhancing the network’s discriminative and generalization abilities, respectively.

Fundamentally radical modifications of the standard Siamese architecture have also been proposed. Notably, Zagoruyko and Komodakis in [191] investigated a number of new Siamese network architectures, including a so-called pseudo-Siamese network. While Siamese architectures employ two identical CNN streams with shared weights, the Pseudo-Siamese architecture proposed in [191] employs two stream networks with unshared weights. According to the authors, the technique allows more parameters to be adjusted easily during training. The authors further extended this concept with the introduction of a so-called 2-channel network, which operates based on completely uncoupled two-stream networks. From the results of their studies, the performance of these different models seem to depend strongly on the specific application scenario. Despite their promise, these approaches have not yet been fully exploited in object tracking domains.

Fig. 8
figure 8

Structure of triplet and quadruplet networks with their respective losses

5.4 Multi-stream similarity networks

Multi-stream networks are a special type of Siamese architectures that employ, typically, three (triplet networks) or four (quadruplet networks) CNN branches to learn image similarity. Multi-stream models provide more advanced feature embedding mechanisms than two-stream Siamese networks.

(a) Triplet trackers. Triplet networks [192,193,194,195] (Fig. 8a) are made up of three identical neural networks with shared parameters and are trained by using three groups of input samples at a time: a target instance P, a positive sample from the target class A, known as anchor or reference, and a negative sample N (i.e. a sample from a different class). Generally, a triplet network uses triplet loss functions [192] to learn similarity (Fig. 8b). The idea is to minimize the distance dp between the target Pand the reference A and maximize the distance dn between the negative N and target P. During inference, the objective is to determine whether the input image at anchor channel is closer to the reference or negative sample. Thus, training with triplet loss allows to compare similarity in relative terms rather than simply determining absolute correspondence of two input images. This way, more expressive visual features are extracted compared to two-stream architectures [194].

(b) Quadruplet network trackers. Most quadruplet network trackers [196,197,198] employ quadruplet loss for similarity learning. For instance, Chen et al. [199], and Dike and Zhou [200] propose to use quadruplet networks with quadruplet loss that jointly learns similarity using the entire scene (search area) in addition to the three patches used in triplet network architectures. The quadruplet network (Fig. 8c) samples from four images consisting of a positive image P representing the target object; an anchor or reference image A, which is also a positive sample (i.e., an instance of the target object); and a pair of dissimilar images N1 and N2 that are different from A and P samples representing two negative instances. The quadruplet loss (see Fig. 7d) involves minimizing the distance dp between the positive sample P and the reference image A, maximizing the distance dn1 between the negative instance N1 and the reference A, and maximizing the distance dn2 between the two negative samples N2 and N1. Although conventional quadruplet networks use quadruplet loss, some new approaches have proposed using different loss combinations [196, 198]. In [198] Zhang proposed a quadruplet network with shared weights using multi-task loss - a combination of pairwise (i.e., contrastive) loss and a triplet loss. The pairwise loss learns the similarity between an exemplar patch (reference image) and a search area (candidate image), while the triplet loss compares positive and negative instances against the reference image. By using these losses in combination, the relationship among the input samples is better exploited for robust representation. Similarly, Dong et al. [196] proposed a four-stream network and introduced a special loss function with both pairwise loss and triplet loss within the same quadruplet network architecture.

Table 4 A summary of compositional part modeling methods and their characteristics

5.5 Approaches to online-learning with similarity models

A significant limitation of conventional similarity learning approaches is that the similarity embedding is learned completely offline and is generally fixed—further updates are often not applicable once the model is deployed online. The visual appearance changes inherent in most tracking scenarios, especially in long-term tracking tasks, make it challenging to achieve robust performance with these models. Consequently, to enhance robustness in complex scenarios, some approaches resort to incorporating robust motion models to complement predictions [201]. Another common solution is to embed Correlation Filters (CF) into the Siamese network (e.g., in [185]) to handle appearance variations online. Recently, several online learning mechanisms [185, 193, 202, 203] have been proposed that allow Siamese networks to update learned appearance embeddings during the tracking process. [203] uses an LSTM-based neural network to determine when updates are required and then performs updates by modifying the appearance features stored in external memory. In [193], Liu, et al. extended the SiamFC model proposed in [184] from two-stream network to a three-stream network in which the third stream is used for online model update, while the other two streams are used in the usual way to learn similarity embeddings. In addition, the network includes a Faster R-CNN-based detector known as localization network that allows it to re-establish a lost target. Similarly, Shi et al. [204] uses a triplet net extension to improve both SiamFC [184] and SiamCAR [205] through online model updates. Siamese networks are also increasingly being used in MOT as part of a more complex architecture to perform specific tasks in the tracking pipeline—for example, feature extraction [65, 206, 207], data association [208] or affinity computation [209, 210]. The important properties, topologies and operating principles of similarity learning models are presented in Table 4.

6 Memory and attention mechanisms

An emerging trend in visual appearance modeling for object tracking tasks is the increasing use of memory and attention to improve performance. The concept of attention [211] is based on selective processing of input signals to enhance robustness and efficiency. Since different features have different discrimination and generalization abilities [212], utilizing all visual features with equal priority for visual tasks such as tracking is inefficient and may produce sub-optimal results. Visual attention [213,214,215] provides a mechanism to adaptively select and process the most semantically useful features for a given task while at the same time ensuring compactness and efficiency of representation. On the other hand, memory [203, 216, 217] endows the model with the ability to preserve learned representations over time. Memory (e.g., [218]) and attention mechanisms (e.g., [219, 220]) have also been proposed as a means of incorporating context to enrich visual representation in object detection and tracking tasks. Chen and Gupta in [218] proposed Spatial Memory Network (SMN) to characterize contextual relationships among objects in images. Li et al. [219] proposed to model global and scene-level contexts using Attention to Context Convolutional Neural Network (ACCNN). Most attentional networks are implemented using feedback architectures such as RNNs. By virtue of their feedback arrangements, RNNs are also naturally endowed with memory. Beyond this natural occurrence, memory and attention do often perform complementary roles in machine vision tasks. In particular, since memory capacity is often limited, attention can enable selective storage of relevant information. Conversely, recall of stored information can also leverage attention to enable fast and efficient retrieval of information.

6.1 Attention in visual tracking

The attention mechanism works by adaptively re-weighting network parameters so as to prioritize more relevant features or relevant areas of interest for subsequent processing. The original work on visual attention—proposed by [211]—use attention to enhance the computational efficiency and at the same time increase the robustness of deep learning models in classification tasks. Attending to specific objects locations in large scenes can also be used to enhance visual search in challenging object detection tasks. This has been demonstrated with impressive results in [221]. Attention mechanisms [222,223,224] are recently being widely used to develop robust models for online trackers. They are able to adapt trackers to visual appearance changes of target objects over long time periods. Kahoú et al. [222], for example, implemented attention mechanism using RNN-based framework that performs spatial “glimpses” on relevant and informative regions of a scene. For target localization, the model uses a binary classification module to classify image features at the various locations. In [222], Kosiorek et al. utilized both spatial and feature attention mechanisms to allow a deep learning network to search in the right regions of a scene as well as select relevant features that are important for the tracking task at hand.

Recently, approaches based on modified RNN architectures like Long Short-Term Memory networks (LSTMs) [225, 226] and Gated Recurrent Units (GRUs) [207, 227], have been introduced. They allow deeper models to proces longer video sequences without the effects of vanishing gradients. In [227] two GRUs were used within a Recurrent Autoregressive Network to separately learn visual appearance and motion models. Instead of conventional recurrent networks based on RNNs, an increasingly large number works [121, 213, 228, 229] propose to use special CNN configurations to learn different types of attentions. For instance, Stollenga et al. [230] implemented an attention mechanism by using special feedback arrangements constructed on the basis of Maxout networks [231]. In their approach, the synaptic weights of the feedback connections are learned using reinforcement learning techniques. This is done so as to enable the tracking model adapt its convolutional filters to important features present in the input images. In [59], Chu et al. proposed to use spatial graph transformer for learning attention.

More recent works (e.g., [59, 232,233,234,235, 237]) have explored the use of deep neural networks based on transformer [236] architectures as an alternative method of encoding attention in visual tracking models. In contrast to RNN-based attention models which utilize feedback in recurrent network topology to process information sequentially, the transformer employs feedforward attention blocks within an encoder-decoder structure. They can process larger amounts of data in parallel and model relatively longer-range dependencies. This allows them to learn inherent interdependencies between different entitiesin different parts of an image to help model the global context of the underlying scene. TrTr [232], for instance, incorporates transformer units within an encoder-decoder network that utilizes self- and cross-attention mechanisms to model contextual relationships between template and search image features in a single object tracking framework. TransTrack [235] proposes a transformer-based query-key method for multiple object tracking that is capable of effectively detecting and tracking new objects that appear in the scene during the tracking process. It employs two decoders—one for object detection and the other for propagating object features to the following frame—and a single encoder for learning robust feature maps through attention. The feature maps serve as input queries (object and track queries) for the decoders. That is, one decoder predicts bounding box detections using object query, while the other one aims to estimate the current locations of features from previous frames with the help of the track query. This allows the model to identify new objects that were not previously present in the scene. Trackformer [237] uses single encoder to learn both object and track queries and matches tracks entirely using self- and cross-attention operations. Approaches based on transformer architectures are presently one of the most impressive visual tracking models.

6.2 Long-term memory in visual tracking

The memory in RNN-based approaches (e.g., [203, 217] does not provide long-term storage, as these models do not contain actual memory (i.e., storage). To address long-term storage needs, some authors (e.g., [215, 238, 239]) have proposed various techniques to enhance the information storage capacity of deep learning models. Chanho et al. [215], for example, proposed to increase the information storage capacity of conventional LSTM methods using Bilinear LSTM. Chen et al. [238] proposed a dedicated memory mechanism, referred to as Long Range Memory (LRM), to cache previously extracted local and global features as intermediate features for re-use by later frames. However, the ability to retain information over long term periods requires actual storage resources which are absent in approaches based on neural networks. A number of works [203, 240, 241] have proposed using explicit memory that provides reading and writing capabilities to deal with visual appearance variations over long periods of time. With this approach, the storage capacity of deep neural networks can easily be enlarged by increasing the size of external memory. In [203], Yang and Chan proposed Dynamic Memory Networks to overcome the problem of low capacity of LSTM-based approaches. Instead of keeping object appearance information as weight parameters in deep neural networks, the proposed approach stores visual feature information in external memory and retrieves relevant appearance details as needed. Appearance changes are handled by updating the stored information in memory. Because the method uses external memory, long-term appearance variations can be stored. The approach employs LSTM to control the writing and reading of information into and from memory. In addition, a spatial attention mechanism is used to direct the LSTM input to the probable locations of the relevant target. In [240], Deng et al. proposed an external memory to store features extracted from detections (i.e., features located within the bounding boxes) in a video sequence to be subsequently combined with features from later video frames.

7 Approaches for learning spatial transformations

A prevalent problem in object tracking settings is the apparent variation in objects’ visual appearances emanating from phenomena such as non-rigid deformations, changes in object proximity and camera view angles, rotations and pose variations. These changes, in turn, result in geometrically transformed objects in the captured images, thus making it difficult to adequately encode the object’s appearance in all possible contexts using a single appearance model. To address this problem, one promising class of approaches [242,243,244,245,246,247,248,249,250,251,252,253,254,255,256] seek to embed additional convolutional or pooling layers as independent, dedicated differentiable units in deep CNNs to explicitly learn geometric transformations. The most well-known methods in this class are those proposed in [257] and [258]. As shown in Table 5, these approaches can broadly be categorized into three groups [259]: methods that address (1) affine transformations, (2) general (including arbitrary and nonlinear) transformations, and (3) specific (or single) transformations.

7.1 Approaches to modeling affine transformations

Many spatial transformation modeling approaches [242, 243, 243,244,245,246,247,248, 248,249,250,251,252,253] specifically target affine transformations. In [257] Jaderberg et al. proposed spatial transformer network (STN), which embeds a differentiable model, called spatial transformer, to learn the parameters of affine transformations of a target object. The learned transformation parameters are then used to generate new sampling kernels which are applied to extract features from input data. Approaches based on spatial transformers have already become very popular in many machine vision tasks—including detection and tracking [245,246,247,248]. In most of the implementations, the spatial transformers are embedded in base CNN classification models or placed on top of detection heads to align input images to canonical views. For instance, Qian et al. [251] proposed a method to allow the detection of heavily deformed pedestrians in fish-eye camera views. Because of the lack of wide field of view (FoV) pedestrian detection datasets, they first transformed canonical images into fish-eye views by means of a so-called Projective Model Transformer (PMT) and then utilized a so-called Oriented Spatial Transformer Network (OSTN) consisting of a pair of STNs to learn fish-eye image transformations. Spatial transformers have also been employed to help generate positive samples in different poses for adversarial training [260, 261]. In [253], Li et al. used an STN to learn localization information for latent compositional parts in a pedestrian re-identification framework. Luo et al. [252] (Fig. 9) combined STN and re-identification modules in a similarity learning framework for robust person re-identification. The STN learns affine transformation parameters and is able to accurately sample the most similar holistic image patches that match target (partial) persons in distorted and cropped images. Similar to the STN-based models, Xie et al. [243] proposed to incorporate a custom affine transformation manifold in a Faster R-CNN object detection model in order to learn geometric transformations of target objects, and to adapt and align detection bounding boxes to object shape. The bounding box alignment allows to better capture spatial features in the effective area of the tracked object. To encode possible deformations, three different kernel sizes are used for RoI pooling. Additionally, a multi-task loss simultaneously optimizes the robustness and accuracy of detections.

Fig. 9
figure 9

Spatial Transformer Network (STN)-based person re-identification framework—STNReID [252]. The approach employs an STN in a Siamese network configuration to perform re-identification of persons in cropped and severely warped images

7.2 Approaches to modeling nonlinear transformations

Approaches such as [243] and the STN-based methods [242, 243, 245,246,247,248, 250,251,252] employ explicit geometric transformation operations to learn spatial appearance variations. As a result, they cannot effectively handle complex, non-analytical transformations. To overcome this shortcoming, Dai, et al. in [258] introduced the (DCN), a technique that allows arbitrary nonlinear geometry transformations to be learned. The approach embeds a module that allows arbitrary deformations to be applied to the sampling kernels of its convolutional and RoI pooling layers. When incorporated into a standard CNN network, these deformable kernels can be applied on input features to learn geometric transformations. Following the original work in [258], several works utilizing the method for better visual feature encoding in object detection and tracking tasks have been proposed [62, 133, 248, 254,255,256, 262]. For instance, Cao and Chen [256] proposed Deformable Convolution Network Tracker (DCT) which consists of using deformable convolution modules in multiple CNN branches dedicated to different domains. In [62], it was shown that deformable convolutions can help to align re-identification features with detections, thereby significantly improving the accuracy and robustness of tracking. In contrast to the above approach to learning spatial deformations by adaptively changing the shape of convolutional kernels, Johnander et al. [263] proposed to encode target transformations by composing filters as linear combination of smaller filters.

Based on the knowledge [264] that expanding the receptive field improves generalization to spatial transformations, some approaches [253, 265,266,267] proposed to expand the receptive field by replacing the CNN’s conventional dense convolutions with dilated or atrous convolutions [268, 269]. For instance, in [265] Chen et al. composed a visual tracker which uses a ResNet-50 backbone with dilated convolutions within a Siamese network structure to learn robust appearance features for tracking. Similar to [265], Jiang et al. [266] employed dilated convolutions based on a Hybrid Dilated Convolution (HDC) [270] organization to learn rich feature hierarchies. Zhang et al. [271] proposed irregular atrous convolutional scheme to further enhance feature representation in object tracking tasks.

Table 5 Approaches to tackling geometric transformations in visual tracking settings

7.3 Approaches to modeling single transformations

In contrast to the techniques considered in Sects. 7.1 and 7.2 which model general (affine and nonlinear) transformations, a common line of work aims to encode specific geometric transformations by applying predefined transformations in a pre-processing step (e.g., [272,273,274,275,276]) or by using multi-scale features [277, 278] before using layers CNNs to learn these transformations. These techniques are mostly incorporated in standard backbone feature extraction and object detection models such as VGG [93]. They are commonly designed to encode rotations [279], scale variations [277, 280], and perspective distortions [281, 282]. Multi-scale methods are arguably the commonest of these techniques. In [280], Szegedy et al. proposed to use of differently sized convolutional filters to extract multi-scale features from input images. Fang et al. [277] employ a spatial arrangement of filters to encode features of varying sizes. Other approaches, for example, [283, 284] adopt special pooling mechanisms to dynamically adjust the scales of visual features. Even though methods in this category are less general as compared to other geometric transformation techniques, they still find widespread use in object detection and tracking applications due to their low computational overheads.

Table 6 Common object tracking datasets

8 Datasets, evaluation metrics and performance results of state-of-the-art object trackers

This section presents the common datasets, evaluation metrics and performance results of state-of-the-art visual trackers surveyed in this work. We focus on datasets for which quantitative performance results are available for many of the approaches surveyed. Conversely, in the presentation of performance results, we pay less attention to approaches that have not been evaluated on popular datasets. Also, we focus on a subset of metrics for which we have several results on the selected datasets. Nonetheless, for each dataset, the selected subset of performance metrics is the most important, and is broad and can adequately characterize the performance of visual trackers.

8.1 Datasets

To allow the training and evaluation of object tracking models, a large number of video datasets [30,31,32, 286,287,288,289,290,291,292,293,294,295, 298, 299] have been composed. The videos in these datasets are typically captured under challenging conditions like varying illumination and scale, occlusion, blur, background clutter, deformation, as well as in-plane and out-of-plane rotations. This allows researchers to train robust trackers and evaluate their ability to handle different real-world situations. The major features of common object tracking datasets are summarized in Table 6. In addition to these dedicated object tracking datasets, visual trackingmodels that rely on tracking-by-detection methods may utilize large-scale video object detection datasets such as ImageNet VID dataset [301] and the YouTube-BoundingBoxes dataset [302] . In the following paragraphs, we present a brief description of some of the most important visual tracking datasets.

(a) SOT datasets: The large-scale datasets used for training single object trackers include the Visual Object Tracking (VOT) family of datasets—VOT15 through to VOT20 [30,31,32, 287,288,289]; the Object Tracking Benchmark (OTB) line of datasets—OTB-50 [303] and OTB-100 [286]; Need for Speed (NfS) [294]; UAV123 [295]; GOT-10k [297]; LaSOT [298], and TrackingNet [299].The Visual Object Tracking (VOT), LaSOT, GOT-10k and Object Tracking Benchmark (OTB) lines of datasets are the most popular datasets for training and evaluating SOT algorithms. Some of the SOT datasets focus on narrow application domains such as people tracking in video surveillance scenarios (e.g., [32]) and vehicle tracking (e.g., [295]). There are also many SOT datasets (e.g., LaSOT [298],GOT-10k [297] and TC-128 [296]) that aim to capture generic objects and scenes. The OTB-100 dataset, for example, contains one hundred (100) challenging labeled video snippets with a general focus. The TC-128 has 128 labeled video clips with a large diversity of object categories captured under different conditions. It particularly focuses on object and scene color variations. The VOT family of datasets and the OTB-100 [286] focus on human tracking.

(b) MOT datasets: Existing multiple object tracking datasets are typically domain-specific datasets, with many dealing with pedestrian or vehicle tracking. The most popular datasets are the MOT series [290,291,292]. Through several iterations starting from MOT15 to MOT20, a large number of these benchmark datasets have been collected through several MOT Challenges. To date, a total of 44 video snippets totaling about 36, 000 seconds of streaming content [292] are available through the MOTChallenge. The latest MOT dataset, MOT20 [292], contains 8 new (4 training and 4 test sets) video sequences. The MOT datasets are domain-specific, all dealing with pedestrian detection and tracking. The KITTI object detection dataset [293] is another popular dataset used for training and evaluating multiple object tracking models. The dataset is intended for vehicle and pedestrian detection and tracking. It contains a total of 50 short videos, 21 of which are for training and the remaining 29 for testing. Wen et al. recently introduced a new dataset, the UA-DETRAC dataset [285], for vehicle tracking.

Table 7 Results of surveyed state-of-the-art trackers on the Visual Object Tracking (VOT) datasets—VOT15, VOT16 and VOT17 datasets
Table 8 Results of surveyed state-of-the-art visual trackers on the Visual Object Tracking (VOT) datasets—VOT18, VOT19 and VOT20
Table 9 Results of surveyed state-of-the-art trackers on other popular SOT datasets—TrackingNet, GOT-10K and LaSOT
Table 10 Results of surveyed state-of-the-art trackers on MOT17 dataset
Table 11 Results of surveyed state-of-the-art trackers on MOT20 dataset

8.2 Evaluation metrics

Many performance benchmarks and evaluation metrics have been proposed to quantitatively assess the quality of object tracking algorithms and validate their use in different situations. They also allow researchers to compare the performance of different models. Typically, different datasets or families of datasets provide different evaluation protocols and metrics. We briefly introduce the metrics used to compare visual trackers explored in this paper, and refer the reader to appropriate sources for more detailed information on the specific metrics.

(a) SOT metrics: In this work, we present performance results for VOT15 through to VOT20, as well as for TrackingNet, LaSOT and GOT-10K datasets. We briefly describe the important metrics used on these datasets for performance evaluation. Details about these metrics are presented in the original works [30,31,32, 287,288,289, 297,298,299]. The most important performance evaluation metrics provided by the VOT family of datasets are accuracy (A), robustness (R), and the expected average overlap (EAO). Accuracy describes the preciseness of localization of the target, that is, how well the estimated bounding box for a tracked object matches the ground-truth bounding box. The metric is given as a fractional number which is computed as the ratio of successfully tracked frames to the total number of frames in the given video sequence. A successful track is considered to be a track whose region overlap exceeds a certain predetermined threshold value. Robustness, also called failure score, is the number of times a tracker loses its target and needs re-initialization. Expected average overlap is a composite metric that characterizes the combined effect of the robustness and accuracy measures. For GOT-10k, we report results for average overlap (AO) and success rate (SR) scores. The success rates are measured using overlap thresholds of 0.75 (SR0.75) and 0.5 (SR0.5). TrackingNet uses precision (P), normalized precision (Pnorm) and success (S) to quantitatively measure the performance of trackers. Precision measures the distance error or deviation, in pixel units, between the center positions of the ground-truth and the predicted bounding box of the target object for each frame. Precision is usually measured as the percentage of frames in which this deviation is within a given limit. With normalized precision, the raw precision values are normalized to account for the influence of different image sizes or resolutions. In this case, the distance error values are measured relative to image sizes. Success is computed as the region overlap ratio (i.e., the Intersection over Union or IoU) between the predicted and ground-truth bounding boxes. Again, a threshold value is set, above which a track is considered to be successful. The default value for this threshold is usually 0.5, and the percentage of frames whose region overlap ratios are greater than 0.5 gives the success score for the particular model. LaSOT, similar to TrackingNet dataset, provides precision and normalized precision for evaluation. Another important metric is the area under curve (AUC). This metric is obtained by first varying the overlap threshold between 0 and 1 and computing the success score at each threshold for the entire sequence. The average value of the success scores at each (sampled) overlap threshold value gives the AUC score.

(b) MOT metrics: For evaluating the performance of multiple object tracking algorithms, the most commonly used metrics are the multiple object target accuracy (MOTA) and its newer extension—the multiple object tracking precision (MOTP). MOTA is computed using 3 main parameters: missed tracks (false negatives or FN), false positives (FP), and identifier assignment errors (i.e., identity switches). The Identity or ID Switch metric (IDS) measures the total number of times the IDs of correctly tracked objects are erroneously changed. Since the proportion of missed tracks is usually several orders of magnitude higher than false positives, FN scores greatly influences the overall MOTA scores. BYTE, recently proposed by Zhang et al. in [317] aims to mitigate this challenge by grouping detections into high- and low-confidence predictions. The high confidence bounding box detections are first matched with tracklets. All tracklets that remain unmatched are then associated with detections from the low-confidence group. This differs from the common approach where low confidence detections below a given threshold are rejected. The new method significantly reduces false negatives and enhances the overall tracking performance. Per the MOTA metrics, tracking success can be categorized as mostly tracked (MT)—i.e., for tracks with tracking success of 80% and above; mostly lost (ML)—success not exceeding 20%, and partially tracked (PT)—success between 20% and 80%. The MOTP metric measures the localization accuracy of tracked objects. Other notable evaluation metrics commonly used in visual tracking models include false alarm per frame (FAF) and fragmentation (Frag). FAF is calculated as the number of false positive instances detected in each frame. Frag is determined by the number of times a tracker loses a tracked instance in an earlier frame and re-establishes (i.e., re-detects) it in a later frame. Most MOT datasets come with specific object detectors that can be used in the detection stage. This is to ensure a fair comparison of different approaches. That notwithstanding, researchers are still able to use private or custom detectors on these datasets. For a detailed overview of the various evaluation protocols and metrics, readers can refer to [290] and [292].

8.3 Quantitative performance results of visual trackers

In Tables 7, 8, 9, 10 and 11, we present quantitative performance results of the surveyed trackers on selected large-scale visual tracking datasets. For the metrics marked with up arrow (\(\uparrow \)), higher numerical values are better, while those shown with the down arrow (\(\downarrow \)) indicate metrics for which lower numerical values are better. As already mentioned, we selected the particular datasets that have been widely used to evaluate many of the surveyed approaches. Tables 7, 8 and 9 present results for SOT methods, while Tables 10 and 11 capture results on MOT datasets.

Tables 7 and 8 present results on the popular VOT family of datasets. The evaluation metrics used are the expected average overlap (EAO), accuracy (A) and robustness (R). These metrics are briefly described in Sect. 8.2. The reader may refer to [287] and [288] for further details on the computation procedures. Results on other popular SOT datasets - specifically, TrackingNet, GOT-10k and LaSOT - are presented in Table 9. For the TrackingNet dataset, results are presented in terms of success, precision (prec.) and normalized precision (Pnorm.). The GOT-10k dataset results are based on average overlap (AO), success rate at 0.5 and 0.75 overlap thresholds \((\mathrm{SR}_{0.5}\) and \(\mathrm{SR}_{0.75})\). For LaSOT, precision (P), normalized precision (Pnorm) area under curve (AUC) are used. Details on the calculation of these metrics are available in [297, 299] and [298].

In Tables 10 and 11, we present results for multiple object tracking methods using MOT17 and MOT20, respectively. The metrics used here are the multiple object target accuracy (MOTA), multiple object Tracking Precision (MOTP), identification-F1 (IDF1), mostly tracked (MT), mostly lost (ML), false positives (FP), false negatives (FN) and ID Switch (IDS). We refer interested readers to [291] and [320] for a detailed discussion on these metrics.In some cases where the same model has been tested using public and private detections, we provide results for both detections.

9 Summary and discussion

In Sects. 3 to 7, we have reviewed the main deep learning approaches for enhancing robustness of appearance models in object detection and tracking tasks. The reviewed techniques address different issues: sample efficiency, geometric transformations, object deformations, occlusions, complex backgrounds, and object interactions. Each technique approaches the problem of robust feature extraction and representation differently, offering advantages in terms of a combination of generalization performance with respect to general or specific appearance changes, computational efficiency, model adaptability and sample efficiency. Section 8 presents the common datasets and evaluation metrics, as well results of the surveyed object tracking models on some of the popular datasets. A broad summary of the common features, main rationale, architectures and limitations of the most important approaches covered in this work is given in Table 12.

Currently, methods based on similarity learning approaches, especially two-stream Siamese architectures, are the most common techniques due to their simplicity, computational efficiency and the possibility for few-shot learning. However, disadvantages associated with phenomena such as occlusions, background clutter and object interactions that are common in many complex MOT environments and long-term tracking scenarios limit their scope of application. In these scenarios, similarity learning approaches are often used in conjunction with other techniques in more complex pipelines. Solving problems such as occlusions and complex background clutter is most effective using compositional part modeling techniques, which treat the appearance model as a composition of spatially related entities. However, the process of creating models by this means is very time-consuming. A new trend is to automatically learn compositional parts from input samples. However, this is often challenging in many practical tracking applications since it requires training with large corpus of relevant data. GAN-based approaches have been proposed to address the problem of data scarcity and severe data imbalance by generating appropriate samples in the training process. Unlike in general machine vision tasks that mostly deal with sample-level generation, adversarial learning in object tracking contexts typically involve feature-level generation.

Table 12 Summary of robust appearance modeling approaches, their strengths and weaknesses as well as the common deep learning architectures used for their implementation

While extending training datasets with GANs has proven to be an effective way to learn invariant features robust to different appearance conditions, these models are generally harder to train and, in some situations, achieving convergence may be unattainable. There is also a lack of reliable empirical performance metrics to assess the quality of GAN-generated data. Moreover, they also introduce additional computational overhead, thereby hampering their suitability for real-time applications. Attention-based models provide a good balance of efficiency and robust performance. Unlike conventional approaches to visual recognition where entire input images are processed with equal “attention”—and as a result learn both useful and irrelevant features of the object and scene – in models using attention mechanisms, only the most informative image segments necessary for the particular task are processed. This greatly reduces computational costs and increases detection efficiency while maintaining invariance to image transformations. In addition, the inclusion of memory in attention models allows long-term appearance characteristics to be preserved for future use. Another way to improve the robustness of deep learning-based appearance modeling is to integrate specialized CNN modules to explicitly model spatial transformations. The modules are differentiable and can seamlessly be incorporated into standard CNN models like the faster R-CNN framework (e.g., as in [247, 321]) and trained end-to-end without modifying the structure of the base model. These techniques provide a fast and reliable means of encoding robust appearance models that can generalize well under various conditions. However, when applying them in general settings, difficulties arise due to their narrowly-defined formulation—they focus mainly on spatial transformations. For this reason, photometric effects (e.g., random noise, shadows, reflections and illumination variability) can greatly reduce their effectiveness.

10 Future research directions

A recent trend in object tracking is the development of [object detection and tracking] techniques [95, 98, 102, 148, 250, 253, 267, 322] that combine different approaches dedicated to specific tasks into complex models in order to overcome the limitations of the individual approaches. Indeed, many of the approaches surveyed utilize two or more fundamental methodologies so as to ensure more accurate and robust detection and tracking performance. The resulting hybrid architectures consist of a set of dedicated sub-systems for feature representation using a combination of various mechanisms such as GANs, part models, visual attention and similarity learning approaches. For example, [253] employed a complex tracker that utilize a wide range of techniques. These include multi-scale kernels to encode scale variations; dilated convolutions to increase the receptive field; deeply mined quasi-compositional parts from multiple convolution layers; a spatial transformer network (STN) to learn affine transformations of latent compositional parts, as well as also modeling additional spatial constraints to better encode visual features. Similarly, [267] utilized a deep CNN configuration that involves an STN, a GRU and atrous convolutional layers. Zhang et al. [250] proposed a Siamese framework, within which an STN is employed to learn affine transformations of compositional parts for robust tracking. Lee et al. [323] introduced a memory model in a Siamese model to enable long-term tracking. In [322], attention mechanism is used to extract robust features from compositional part models.

Some of these hybrid models require the use of sophisticated fusion algorithms, as well as refinement methods. In [267] a GRU is used to fuse different features produced by the model components. In [324], a soft-max-based fusion mechanism is proposed for aggregating low-level features. In addition, a high-level spatial feature fusion is used to combine features from different components, including the soft-max fusion output and channel and spatial attention sub-modules. The techniques for fusing hybrid models are still at an early stage of development, hence, there is still a lot of room for the development of better fusion strategies to harness the strengths of individual approaches in a unified framework. The most promising application of future hybrid trackers would be to enable generic object tracking algorithms that generalize across multiple domains.

The main directions envisaged for future work include the following.

  • Robust feature transfer: More effective techniques for transferring useful features from existing large-scale datasets to novel visual contexts and challenging application settings would be highly beneficial and compensate for the difficulty in creating large-scale tracking datasets.

  • Generic appearance models for tracking in open domains: Many practical tracking application scenarios are characterized by openness, where arbitrary objects can appear on and disappear from the scene. Most of the current appearance models, however, work in specific, closed environments, in which the number of object categories are known and fixed. A relatively unexplored approach is learning robust generic appearance models in open environments.

  • More advanced hybrid fusion methods: More sophisticated “hybridization” techniques that rely on both low- and high-level context information as well as advanced decision making capabilities to aggregate visual features will significantly improve the robustness and reliability of appearance models. These fusing methods could allow multiple and diverse inference engines to be modeled as computational primitives within deep learning frameworks and be fused to enable predictions in a manner that is consistent with high-level real-world contexts.

  • The use of automated machine learning (AutoML) techniques: The emerging area of Automated Machine Learning (AutoML) [325], especially Neural Architecture Search (NAS) [326,327,328], has already produced impressive deep learning models for many visual recognition problems. However, it remained under-explored in visual tracking tasks. An important dimension of future research would potentially involve the exploitation of these techniques to develop more advanced detectors and trackers. The configuration of these machine-generated frameworks could fundamentally differ from existing architectures.

11 Conclusion

Appearance modeling is the most important task in visual object tracking and is generally solved by extracting visual features from sample data of the target objects into sets of invariant feature vectors, and subsequently making inference based on the encoded representations. In this paper, we extensively survey the most important deep learning techniques for learning robust visual representations for object detection and tracking. The main motivations, key functional principles, implementation issues and application scenarios of these algorithms are thoroughly discussed. In addition, common datasets, performance evaluation metrics and quantitative results of state-of-the-art models surveyed in this paper are presented.

As we have noted earlier in the survey, owing to the enormous complexity of real-world visual tracking scenarios, there is still a lot of room for further improvement of appearance models with regard to their robustness and accuracy in challenging detection and tracking tasks. State-of-the-art deep learning techniques still fare poorly in visual tracking as compared to other machine vision tasks. Nevertheless, with the wide diversity of approaches at their disposal, developers and researchers have a lot of leverage and flexibility in developing appearance models that meet the requirements of specific applications. One of the main tasks for developers will be in defining the most suitable approaches for each given application scenario and adaptively fusing appropriate models for optimum performance.