Introduction

One of the most challenging computer vision issues is visual tracking. Applications include motion analysis, video surveillance, and advanced diver assistance systems [1, 2]. Visual tracking refers to the difficulty of determining a target’s motion from a set of images. Due to developments in computer technology, it has emerged as one of the most discussed topics in computer vision [1]. Many applications include surveillance, autonomous navigation, and medical diagnosis [2, 3].

Because of the unavoidable appearance changes in an open environment, traditional tracking techniques that start with fixed models of the target typically fail [4]. The object’s fundamental qualities, such as non-rigid construction and posture changes, and the environment’s qualities, such as shifting changes, camera motion, view-point, camera scale, and occlusion, can be attributed to these discrepancies. Adaptive techniques that gradually alter a target’s representation over time have been developed in order to successfully handle these changes [4].

Researchers consider visual representation and appearance modeling to deal with visual tracking difficulties. Visual representation refers to how information is communicated through images, graphs, charts, and other visual media. Visual representations are essential for communicating complex ideas and data quickly and efficiently, as they allow people to process large amounts of information more effectively than text alone. One common type of visual representation is the infographic, which combines images and text to present information visually appealingly. Other examples include diagrams, maps, and flowcharts. Effective visual representations are designed to be easy to read and understand, and they often use color, size, and other design elements to convey important information [5].

Appearance modeling is a technique used to create realistic digital representations of objects, people, or environments. Appearance modeling involves capturing data about the appearance of an object or environment, such as its texture, color, and lighting, and using that data to create a 3D model. Appearance modeling is used in various applications, from video game development to product design. One common use of appearance modeling is in the creation of virtual try-on technology, which allows customers to see how clothing or other products will look on them before making a purchase. Appearance modeling can also create realistic simulations of real-world environments, such as cityscapes or natural landscapes, for use in movies or video games [6, 7].

As mentioned earlier, one of the primary challenges limiting the efficiency of real-time visual tracking algorithms is the absence of appropriate appearance models [1, 8]. Due to the fixed models, conventional template-matching tracking techniques are unable to adjust to appearance changes. As a result, dynamic templates built on online learning are utilized to depict how a target’s look varies due to adjustments to posture and lighting. The tracking framework incorporates the online learning method to update the target’s appearance model flexibly in response to changes in appearance [9].

Visual tracking techniques are often divided into two classes: generative and discriminative. The generative tracking techniques develop a model to depict the target object's appearance. Finding the item whose look is closest to the modeled appearance is how the tracking problem is then stated. Instead, discriminative tracking approaches, which outperform generative tracking techniques regarding accuracy and speed, seek to distinguish the target item from the background [1, 10].

Therefore, this study focuses on a discriminative online learning-based method for appearance modeling. An extensive investigation of the methods that have emerged over the years is presented, allowing the reader to appreciate the historical development of this field. The main contributions of this study are listed as follows:

  1. 1.

    Review several online learning-based discriminative techniques for modeling physical appearance.

  2. 2.

    Outlining a critical examination of current methodologies for discriminatory online learning-based approaches and discussing their benefits and drawbacks.

  3. 3.

    Addressing the effectiveness of online learning strategies for appearance modeling techniques for visual tracking.

The remainder of this paper consists of “Introduction” section presents an introduction. The “Review of online learning modeling methods” section discusses appearance modeling involving visual representation and online learning. The discriminative online learning-based methods investigate in this section. Results and discussion are presented in “Discussions and analysis” section. Finally, this study concludes in “Conclusion” section.

Review of online learning modeling methods

Because of target appearance variation, online learning modeling is the most important module in visual tracking. This module generally consists of two components [6]: visual target representation and online learning modeling of the model. Visual target representation focuses on leveraging various visual elements to represent the target in pictures. Online training focuses on creating a model of the target and updating it under conditions of appearance variation in order to recognize the target in subsequent frames. Because it is the first stage in online learning modeling, visual representation is only briefly discussed in the following parts. As was already said, this work uses discriminative-based techniques to model online learning.

Visual target representation

Visual tracking is required to represent a target in the image to describe the target using visual features and track the target in the following images. Because the description's effectiveness substantially impacts the entire tracking process, visual target representation is a crucial duty in the appearance modeling module. Moreover, such a description is not always given to the tracker beforehand; thus, it could need to be created in real-time using both past knowledge and unknown information or an online creating model [11]. As a result, choosing the right characteristics for visual target representation and description is essential for visual tracking. This study only introduces the visual representation and briefly describes it because it is essential for any appearance modeling method.

Discriminative-based appearance models

In dynamic and long processing of visual tracking, having a good target description for its representation in the scene is not enough to deal with target appearance changes. In this case, the target representation is not adaptive to appearance variation conditions. These appearance variations can arise from illumination changes, pose variations, geometrical transformations of the target, etc. To handle such variations, generating a target model is required, and the target model is needed to be incrementally updated to be adjusted and adapted to the new circumstances [6, 12]. This study focuses on discriminative online learning for appearance modeling. Figure 1 shows online discriminative-based appearance modeling methods.

Fig. 1
figure 1

Representative of online discriminative-based appearance modeling methods

Discriminative-based appearance models deal with a binary classification to classify the foreground and background regions. They intend to classify the target and non-target regions discriminately. They adopted highly discriminative and informative features for visual target tracking. These models can predict a scene’s target and non-target regions using online learning classification functions. This online learning procedure gradually modifies visual elements to anticipate the monitored object in a complex background.

Boosting-based discriminative appearance models

Due to their successful performance in discriminative learning capacities, the BDAMs are currently widely used in visual target tracking [6]. To be more precise, the self-learning boosting-based models train a classifier first using the data from previous frames before using the learned classifier to assess potential target regions at the current frame [6]. In the following sections, we will discuss the categories in detail.

Self-learning single-instance boosting-based models,

Various applications for computer vision and target detection and tracking are based on online boosting models [13]. To make a clear description of self-learning single-instance models, we conduct a table and categorize these models as shown in Table 1.

Table 1 Review of self-learning single-instance boosting-based models
Co-learning single-instance BDAMs

As stated in [6], the “model drift” issue affects self-learning boosting-based models. Additional models based on semi-supervised learning methods are used for visual object tracking to address this issue. Grabner et al. [14] proposed a semi-supervised algorithm that uses online boosting, as illustrated in Fig. 2.

Fig. 2
figure 2

A sample of a typical co-learning problem

Multi-instance boosting-based models

Multiple instance learning is utilized for target tracking in order to address the underlying uncertainty of object localization, as shown in Fig. 3. Multi-instance boosting-based models can be broadly categorized into classes for self- and co-learning. Table 2 includes information about these categories in detail.

Fig. 3
figure 3

Illustration of single-instance multi-instance learning

Table 2 Review of multi-instance boosting-based models

SVM-based discriminative appearance models (SDAMs)

SDAMs aim to develop discriminative SVM classifiers with a margin of error to increase inter-class separability. SDAMs have a high discriminative capacity since they can find and retain instructive samples as support vectors for object/non-object classification. Designing resilient SDAMs requires careful kernel selection and efficient kernel computation. SDAMs are commonly based on self-learning SDAMs and co-learning SDAMs, according to the applied learning methods. SVM-based discriminative appearance models are shown in Table 3.

Table 3 Review of SVM-based discriminative appearance models

Randomized learning-based models

Building a randomized or diversified classifier ensemble to create a model for target appearance is possible with randomized learning-based techniques based on random input and feature selection. They are effective in real-time systems and have little computational cost. They can be expanded to address issues with multi-class learning. Additionally, they make it possible to leverage parallel processing on multi-core and GPU-based platforms, which can be improved with random learning-based techniques to cut down on processing time significantly. Because of their arbitrary feature selection, they suffer from tracking performance in various scenes and target more look variants. Several randomized learning-based models are put forth in visual trackings, such as online random forests [15] and random naive Bayes classifiers [16]. Table 4 presents some existing models under randomized learning-based models.

Table 4 Review of randomized learning-based appearance models

Discriminant analysis-based models

Discriminant analysis-based models are an algorithm used in visual tracking to handle appearance variations in the target object being tracked. Appearance variations occur due to lighting conditions, pose, scale, and occlusion changes. The basic idea behind these models is to create a discriminative function that distinguishes the target object from the background and other objects in the scene. This function is learned based on training data consisting of samples of the target object in different appearance variations [6]. The following section discusses these branches in detail.

Conventional discriminant analysis models

Table 5 presents the review of conventional models of discriminant analysis models.

Table 5 Review of conventional models of discriminant analysis models
Graph-driven discriminant analysis models

Recent discriminant analysis models utilize graph-based learning techniques for discriminant analysis for visual target tracking. Typically, these graph-driven models are categorized into graph-embedding-based and graph-transductive-based methods. Table 6 presents the review of graph-driven discriminant analysis models.

Table 6 Review of graph-driven discriminant analysis models

Codebook learning-based models

These models rely on the concept of a codebook, which is a dictionary of visual patterns learned from the target object in different appearance variations. The basic idea behind these models is to represent the target object as a bag of visual words, where each word corresponds to a visual pattern in the codebook. The bag of visual words is then used as a feature vector to track the target object.

The codebook is learned from training data consisting of samples of the target object in different appearance variations. The codebook can be learned using unsupervised learning techniques such as k-means clustering or supervised learning techniques such as support vector machines (SVMs) [17].

Discussions and analysis

The discriminative-based appearance models primarily focus on how to match the data from various target classes. The key challenge with these models is determining if the provided model is properly defined. Incremental learning, defined as a model update of the target visual representation during the tracking process, is introduced to make the model more effective. The foreground and background may be efficiently separated with this technique. However, due to the background regions’ resemblance to the target class, they continue to experience appearance fluctuation and distractions.

Discriminative-based appearance models deal with a binary classification to classify the foreground and background regions. They are used to separate the target and non-target regions in an image. They adopted highly discriminative and informative visual features for target tracking. Visual features are incrementally updated in this online learning process to represent the target in the complicated background. Thus, they can achieve effective and efficient predictive performances [6].

In conclusion, generative approaches solely consider the object's appearance and ignore background information. The discriminative approaches, in contrast, calculate a border area that separates the item from its surroundings by considering information about both the object and the backdrop [35]. The number of monitored objects and the average position inaccuracy are used to evaluate the tracking outcome in the visual tracking community objectively [36, 37]. As a result, if there are enough examples, discriminative approaches perform better than generative ones.

Conclusion

This work concentrated on online learning modeling, a vital process for appearance modeling that is mostly employed for visual tracking. This thesis discusses appearance modeling, one of the key components of visual tracking systems. One of the key components of visual tracking systems is appearance modeling, which is covered. Statistical modeling and visual representation are the two main components of appearance modeling. These two are thoroughly covered in this paper because they substantially impact the outcome of moving object identification, which is crucial for visual tracking systems. This work emphasizes generative online learning-based methods and thorough reviews utilizing highly regarded and peer-reviewed literature.

Additionally, a critical analysis was completed to discuss the benefits and drawbacks of the current approaches. To further examine appearance modeling in visual tracking, the fusion of generative and discriminative online learning can be investigated for appearance modeling. Moreover, a deep learning-based approach can be implemented to improve online learning performance.