1 Introduction

Video surveillance systems have become essential tools to assist in the task of monitoring public and private spaces and identifying possible threats, therefore, protecting people and assets. In order to alleviate the burden of the surveillance personnel and to reduce human errors, smart video surveillance approaches aim to automatize tasks that involve the observation of actions, behaviors, or events that present any risk [109]. The advantages of this strategy over conventional video surveillance systems include providing the ability to prevent incidents by analyzing suspicious behaviors, enabling analytical video capabilities that can be used for forensics and content-based retrieval, and identifying objects and actions of interest to monitor [54]. Recent advances in smart video surveillance are being made through the employment of machine learning models.

Traditionally, machine learning models are trained using a static set of data that represents or tries to generalize, all future examples presented to these models later. However, real-world environments are non-stationary, i.e., continually changing and evolving. Therefore, past data tends to, over time, not be able to describe the current context, and prediction models that have been trained using that data have their performance degraded. This phenomenon is described as concept drift [137], and, in recent years, techniques to deal with this event have been developed. One possible approach is to incorporate new information in streams so that, as soon as new data becomes available, it is used to update the model [55, 154]. Another approach is to detect when the drift occurs and then perform adaptation by retraining the model or switching to another one [27148].

Dealing with concept drift in predictive models used in surveillance video streams presents a set of challenges that are specific to this kind of application due to factors and constraints such as: (a) ideally, surveillance systems run endlessly, which implies large volumes of data being generated constantly; (b) surveillance cameras, especially those installed in outdoor environments, are in non-controlled environments where illumination and the characteristics of the scenario can change gradually or drastically; (c) although smart surveillance can serve as a post-analysis tool, to enable authorities or security personnel to take action, real-time computation is a concern.

Several comprehensive reviews and surveys of concept drift adaptation have been made in recent years [8, 28, 29, 41, 42, 68, 93]. However, none of them approached the additional challenges that surveillance video data imposes. Noticeably, in [41] the authors categorize the types of concept drifts and present the main drift detection techniques, proposing a general drift adaptation taxonomy. This work is a comprehensive introduction to the problem of concept drift, thus, it does not provide details about any particular machine learning problem. In [8], the focus is placed on exploring feature drift, which occurs when a subset of features becomes irrelevant to the learned concept. In [93], in addition to an extensive literature overview of concept drift, the authors present real-world and synthetic datasets used to evaluate drift detection methods. However, no video dataset was presented, and also no analysis of what features and algorithms are being used to deal with this type of data. In [28] and [42], the authors also present concept drift datasets, but do not present any video dataset. More specifically, in [28] the authors mention some concept drift applications such as forecasting, recommendation systems, and energy demand prediction. A short consideration is made about the challenges that high-dimensional unstructured data presents (e.g., images), but they do not elaborate on how to deal with them. In [42], the authors propose a new unsupervised drift detection algorithm and compare its performance with other popular ones, evaluating it on low-dimensional datasets.

None of these reviews, and to the best of our knowledge no other previous work, focus on surveillance video applications or on how to deal with complex video data in the presence of concept drift, but rather center attention on datasets in which the dimensions, i.e., number of features, are relatively low in comparison with video data.

Dealing with concept drift in surveillance video streams introduces additional complexities. In this context, a fast and continuous flow produces a mutable, large volume of data. Data characteristics are constantly changing due to illumination, weather, complex interactions, and scene changes; especially in non-controlled settings, as is the case in outdoor environments. Moreover, this type of data also presents challenges imposed by the high number of dimensions, scales, and spatio-temporal relations between video frames.

The goal of this systematic review is to comprehensively analyze the state of the art of recent approaches to deal with concept drift in the context of surveillance video streams. This systematic review aims to answer the following research questions:

  • What are the existing methods and techniques to deal with concept drift in surveillance videos?

  • What feature descriptors and machine learning algorithms are used by such methods and techniques?

  • Which methods and techniques can be used in real-time?

  • What are the datasets and evaluation metrics used?

The contribution of this work is a comprehensive analysis of aspects involving concept drift adaptation in surveillance contexts, including: (a) concept drift adaptation and the proposal of a new classification to describe the adaptation process; (b) the relation between learning strategies and computer vision tasks; (c) features and machine learning techniques used in surveillance contexts; (d) challenges imposed by real-time processing (e.g., computing capacity, data volume); (e) the datasets employed; and (f) the metrics used to evaluate the effectiveness of the proposed approaches.

In Section 2, we present the theoretical background and introduce concepts related to concept drift adaptation and learning settings. Section 3 describes the method used to conduct the systematic review. In Section 4, concept drift adaptation methods are outlined. In Section 5, the relation between learning settings and computer vision tasks is presented. In Section 6, we present the features and machine learning algorithms employed by the works. In Section 7, the characteristics of the employed real-time approaches are described. In Section 8, we report the datasets and metrics used to evaluate computer vision algorithms in the context of surveillance. In Section 9, we make our considerations about the results. Lastly, the overall conclusion and future research directions are given in Section 10.

2 Theoretical background

In this section, the main concepts around concept drift adaptation in surveillance will be exposed.

2.1 Concept drift

In a machine learning context, learning from examples, or acquiring concept, is what makes feasible the generation of a mathematical model that can make predictions or classifications based on feature points presented to it earlier. Concept drift [137] is a phenomenon that occurs when the context changes in a way that the learned concept does not hold any longer. In other words, the real world presents contexts that are hidden from the model [159].

An example of concept drift in surveillance systems can be illustrated with a model created to detect anomalies in images from a video camera. Suppose that there were no images of rainy or snowy days in the dataset during the training phase. However, at the inference phase, images from outdoor cameras facing diverse weather conditions may cause the model to wrongly classify a rainy scene as an anomaly when it is actually a new context (Fig. 1).

Fig. 1
figure 1

Source: Office of Information Technology of USP (STI-USP)

Example of concept drift in surveillance. The image to the left corresponds to the initial concept. Concerning the image to the right, due to severe rains, most of the area is flooded, which changed the characteristics of the data.

A formal definition of concept drift, as given in [41], between the times \({t}_{0}\) e \({t}_{1}\) is presented in Eq. 1.

$$\exists X:{p}_{{t}_{0}}(X,y)\ne {p}_{{t}_{1}}(X,y)$$
(1)

where \({p}_{{t}_{0}}\) represents the joint-distribution at the time \({t}_{0}\), between the set of input variables \(X\) and the output variable \(y\).

According to [41], there are two categories of concept drift:

  • Real concept drift: occurs when the data distribution \(p(y|X)\) changes so that the prediction capacity is affected. This category of drift can happen with or without changes in \(p(X)\).

  • Virtual concept drift: occurs when the data distribution \(p(X)\) changes without altering \(p(y|X)\). It is also the case when \(p(y|X)\) is not available for inference.

2.1.1 Concept drift detection

Detecting concept drift can be achieved by using change detection algorithms. Usually, the metrics monitored are the classification error or the accuracy returned by a machine learning model. The methods to detect drift can be roughly divided into sequential analysis, statistical process control, and distribution-based [41].

Sequential analysis methods continuously use the most recent observations to evaluate if the mean of the input data significantly deviates from an allowed value. CUSUM [113] and Page-Hinkley [113] are algorithms that belong to this category.

Statistical process control algorithms keep track of the statistical properties of the data distribution. The probability of error is also evaluated by considering a prediction and its true label. This type of method also defines distinct levels of change, such as warning and drift levels. Examples of this type of method are DDM [40] and EDDM [5].

Distribution-based methods aim to compare two data distribution windows (i.e., subsets of the data over time). These distributions include a reference window and a window containing the most recent data. Both windows are then compared using statistical tests to infer whether a change was introduced or not. ADWIN [13] and VFDTc [39] are methods that lie in this category.

2.2 Learning setting

Based on the learning objective, the learning setting can be classified into supervised, unsupervised, and semi-supervised.

Supervised learning

It is employed when, for every set of input variables, there is one or multiple known target variables. According to [41], real concept drift can only be verified in a supervised learning setting because it is possible to measure the discrepancy between a prediction and its ground truth.

Unsupervised learning

As opposed to supervised learning, in this type of learning setting, the examples do not need to have an annotated target variable, i.e., ground truth. The input variables themselves are used to model the output. Labeling videos is a time-consuming task since, usually, each second of a video produces 25 to 30 frames. Therefore, unsupervised learning proves itself to be an advantageous approach. However, the accuracy of unsupervised algorithms tends to be relatively lower than the supervised ones, given the fact that knowing the target variable beforehand provides a clearer objective [19], [64].

Semi-supervised learning

A combination of supervised and unsupervised learning, semi-supervised learning requires the annotation of a subset of the training examples, which reduces the workload associated with the labeling process. An example of a problem that can potentially be addressed with this setting is anomaly detection. In this task, only video segments identified as normal are annotated and used to train a model that learns what normal video clips are like. During inference time, it is possible to tell how much a video clip deviates from the normal and then classify it as normal or abnormal.

2.3 Knowledge acquisition

In order to cope with concept drift, machine learning models need to be updated with new concepts, i.e., newly acquired knowledge must be added to the existing model. Three different strategies of knowledge acquisition are found in the literature: batch, incremental, and active learning.

Batch learning

A naive approach to adding new knowledge to a model is to train the model from scratch. This approach is known as batch learning [35] and requires all training examples to be present before the training process starts. The time taken to re-train the whole model is a factor that can make the adaptation process a time-consuming step and, therefore, not ideal for surveillance scenarios where timely actions are needed.

Incremental/Online learning

This type of learning allows continuous integration of knowledge into an existing model. It naturally fits within non-stationary environments where the context is constantly changing. In the literature, incremental and online learning are at times defined separately [44], but also used interchangeably [114159]. In this review, the terms incremental and online will refer to algorithms that can gradually add new information to an existing model.

Active learning

Active learning [138] comprehends strategies to receive annotations given by an oracle (e.g., human) by selecting only the most relevant instances, i.e., the ones classified with the most uncertainty. In this way, the model would benefit from having newly annotated examples that can be used for training and adaptation to new contexts, and the oracle would have to label fewer instances, thus, saving time and effort.

2.4 Computer vision tasks in surveillance

Computer vision [10] involves techniques to analyze and interpret images. Video streams are a rich source of analysis where several computer vision tasks can be performed. In the context of video surveillance, these tasks aim to reveal potential risks to protect people and assets.

Anomaly detection

[122134161] is the task of telling apart abnormal events from normal ones in a dataset. In videos, the detection of anomalies can provide information on where the anomalies are (spatial information, i.e., coordinates inside a frame) and also when the anomalies happen (temporal information) without necessarily indicating their spatial location.

Activity recognition and localization

[59128153] are tasks that learn predetermined activities from the input data. In surveillance videos, examples of activities are car crashes, fire, robbery, violence, and trespassing. This task can be employed when the objective is clear, e.g., for a camera installed on a highway, it is possible to have a model specialized in car crashes.

Image classification

[737792] can also be used in videos. It aims to process each video frame and then classify them with respect to a target variable. An example of image classification in a surveillance context is gun detection.

Object detection

[24124162] is a task where in addition to knowing what an object is, it is also relevant to find the location of that object in the image. The location of an object is usually given by a set of coordinates that represent a bounding box around that object.

Re-identification

(Re-ID) [536369] is a task that aims to match objects in different frames and, in this way, re-identifying specific objects. It can be used in biometric systems and also to identify suspects in video surveillance feeds.

3 Research method

The systematic review was conducted based on [67], which defines the phases of planning, conduction, and report.

The search string was defined as: (video* OR camera* OR visual OR feed) AND ("incremental" OR "continual learning" OR "active learning" OR "continuous learning" OR "online learning" OR "on-line learning" OR "adaptive learning" OR "drift" OR "concept shift" OR "dataset shift" OR "covariate shift" OR "non-stationary" OR "nonstationary") AND ("surveillance" OR "analytics" OR "security") AND NOT (tracking OR education OR classroom OR student).

The search was performed in the scientific databases: IEEE Explore,Footnote 1 ACM Digital LibraryFootnote 2 and SCOPUS.Footnote 3 In order to select the studies, the inclusion and exclusion criteria were defined as presented in Table 1.

Table 1 Inclusion (I) and exclusion (E) criteria adopted in the Systematic Review

To be included, a study must fulfill all the inclusion criteria and cannot fulfill any of the exclusion criteria.

Note that, for the exclusion criterion E1, only machine learning approaches were considered because concept drift itself is a term defined in a machine learning context [137, 159]. As for the exclusion criterion E2, the term drift, in an object tracking context, has a meaning that differs from drift in concept drift. Drift in object tracking, as defined by [156], occurs when the tracking fails because a significant part of the tracked object is no longer in the updated template. Concept drift is more general and does not focus on single objects in an image, but rather on global characteristics that can change and impact a model’s performance.

As shown in Fig. 2, a total of 627 papers were found during the searching phase. After removing 125 duplicates, 44 studies were selected based on the inclusion and exclusion criteria defined earlier.

Fig. 2
figure 2

Systematic review process

Two studies were included manually due to their relevance to this review. Both studies do not explicitly mention surveillance, but they propose relevant methods to deal with concept drift in videos. The first one is [27], where the authors propose a Convolutional Neural Network (CNN) [78] that can be re-trained only at specific layers that were potentially affected by concept drift. The other one is [81], where the authors propose a Support Vector Machine (SVM) [15] model that can learn incrementally but also forget about irrelevant information. The overall information extracted from the papers is shown in Table 2.

Table 2 General information of the papers included in the systematic review

4 Concept drift adaptation

Concept drift adaptation methods can be divided into the ones that can be adapted in a continuous way and the ones that can be adapted in an informed manner. Based on the approaches to deal with drift analyzed in the papers of this review, we propose a new classification of how concept drift adaptation takes place, illustrated in Fig. 3. This classification is different from previous ones [4193] as it brings the dimension of active learning and draws the relationship between adaptation types and knowledge acquisition strategies.

Fig. 3
figure 3

Classification of concept drift adaptation

Continuous adaptation methods use the newly and continually arriving input data to gradually update the model. In contrast, informed methods keep track of concept drifts, and when one occurs, that information is used to update the model.

Informed adaptation methods can have multiple strategies with respect to how machine learning models are updated once concept drift is detected. To generalize, as shown in Fig. 3, we further divided the adaptation of informed methods into two categories. The first is model selection/ensemble, which means that, in the presence of concept drift, either a new model will be generated or selected from a pool of existing models, or an ensemble strategy will be used with the existing or newly produced models. The second category is model retraining, which can happen either globally, by replacing the existing model, or locally, by adapting only parts of the current model.

Continuous adaptation methods rely on incremental learning to seamlessly add new instances to the already existing model (Fig. 3). These new instances can be acquired using active learning, which selects only the most informative instances to be labeled by a human oracle. Or without a specific strategy for acquiring knowledge, which can be called passive learning. In Table 2, we present how the studies in this review are classified concerning their adaptation method.

Informed Adaptation

From the 46 papers in the review, only 9% employed informed adaptation. In [112], the authors use a drift detection method to train new models when a drift is identified. In [148], the drift is detected by comparing the similarity between known data points and newly added ones. Upon drift detection, an algorithm selects a new model from a pool of existing models or trains a new one from scratch. That new model provides better accuracy than the one previously used. In [104], drift detection is used as an alert mechanism applied to road traffic. It also “forgets” old concepts, a technique called decremental learning by the authors, which is able to drop concepts that are no longer relevant. In [27], the authors use an adapted version of CUSUM as the drift detection method. When drift is detected, only the affected layers of the CNN are retrained, while the rest is left untouched.

Continuous Adaptation

Continuous adaptation methods correspond to 91% of the analyzed papers. From these works, the combination of incremental and active learning is used in 21% of the papers, whereas incremental learning with passive learning is used in 79% of the studies.

In [149] and [6], the authors propose ways to automatize and simplify the labeling process to make active learning feasible in surveillance. A visual-interactive labeling strategy is proposed by [51], where model-based suggestions and visual cues are combined to ease the labeling process for users. In [55], contextual information is obtained from the newly arriving data to improve the selection of informative instances for posterior human labeling. Thus, once the instances are appropriately labeled, the new examples incrementally are added to the model. Similarly to [51], in [55], the authors use an algorithm to select the most informative examples for user annotation. In [139], the authors use the output of object detection algorithms to select objects detected with low confidence to be double-checked (labeled) by humans. Similarly to [139] in [18], unlabeled images are supplied to an object detection algorithm, then the detections with the smallest confidences considering a threshold are used to retrain the model. The authors call this approach semi-supervised active learning and it is also possible to receive feedback from an oracle.

In [105], an active learning strategy called human-in-the-loop is employed, where human feedback is required whenever an anomaly is detected. If the detection is identified as a false-positive, this information is then used to incrementally update the model, advising that this example should be treated as normal. In [4], although an active learning approach is proposed, there is no specific details of how it is handled.

The remaining studies employ an incremental approach, as seen in Table 2. The algorithms used by each one of the studies are presented in Section 6. The machine learning models employed are capable of incrementally aggregating new unseen information to an existing model.

5 Learning settings and computer vision tasks

Concerning the learning settings used by the studies analyzed in this review, 56% (26 studies) are supervised, 22% (10 studies) are unsupervised, and 22% (10 studies) are semi-supervised (Table 2). The relation between computer vision tasks and learning settings is shown in Fig. 4, where a larger circle represents a greater number of studies using a learning setting to perform the corresponding computer vision task.

Fig. 4
figure 4

Computer vision tasks versus learning settings

It is possible to see that the activity recognition task is employed exclusively in a supervised learning context. Although more commonly applied in a supervised setting, image classification, object detection, and re-ID are also used in settings where not all labels are available. In re-ID tasks, the unsupervised setting can be achieved by measuring similarities between one or more frames.

In object detection, even though some techniques are regarded by the authors as semi-supervised or unsupervised approaches, labeled examples are still needed, so we classified them as supervised approaches. The difference lies in how these labels are acquired, which can be by using another pre-trained object detection model [139, 148] or by employing an algorithm to automatically extract these labels, such as regions that are moving between frames [108].

Among the studies of this review, anomaly detection was only performed in a supervised setup in [4] and was used in a semi-supervised setting roughly as much as it was used in an unsupervised one. Although this task can be performed in a supervised learning setting, it is unusual since knowing all possible anomalies beforehand is not feasible. Also, the definition of anomaly itself is ambiguous. In other words, it is impossible to know beforehand all events that comprehend abnormal and normal activities in every context [30], e.g., a person running in a public square is usually not an anomaly, whereas a person running inside a bank would commonly be.

6 Features and models

Representing videos and images through features is a research topic on its own. Visual features like HOG, Haar Cascades, SIFT, Optical Flow, and 3D spatial gradients have been used extensively for the last few decades [140]. They are considered handcrafted features because they rely on previous knowledge and pre-assumptions about the input data. Recently, CNN models have been used with the reasoning that features can be learned rather than handcrafted, achieving, in this way, better representations than features designed in a manual fashion.

Considering studies that use CNN features as cases where the attributes are learned (or self-learned), analyzing the studies in this review (Table 2), before 2018, there were no papers that employed learned features. In [106], the authors extract bounding boxes using a CNN but do not use the CNN feature vectors. However, after 2018, 67% of the papers (22 out of 33) used learned features.

In recent years, we have also seen approaches that combine handcrafted and learned features to achieve more robust feature representations. Some studies use features provided by other algorithms or systems. For instance, in [30], the authors use the output bounding boxes coordinates from YOLO ( [125] along with Optical Flow features.

From Table 2, 52% of the studies use handcrafted features, 37% use self-learned ones, and 11% employ both feature types simultaneously. From the studies using self-learned features, 59% also employ CNNs as the machine learning model, while the other 41% use CNNs as a feature extractor only.

Regarding machine learning models, many distinct algorithms and architectures have been used. Although each one presents its peculiarities, the models were divided into general categories. Figure 5 shows the number of times each category has been used and Fig. 6 shows the distribution of learned and handcrafted feature types among the model categories.

Fig. 5
figure 5

Use of model categories among the papers

Fig. 6
figure 6

Feature types used in each one of the model categories. Some models employ both types

Neural Networks

Neural Networks [99] are used in 33% of the papers in this review. This technique uses nodes that are interconnected by weight vectors to learn a prediction function based on the training data. Among papers that employed this machine learning algorithm, 90% used CNNs, a type of neural network primarily created to be used when the inputs are images.

A common issue that arises when training neural networks continuously is a phenomenon known as catastrophic forgetting [98], which can be roughly described as the tendency that a neural network has to lose old information learned previously as new examples are presented to the model. In [149], a CNN is used only to improve labeling effort, but not in classification itself. In [139], active learning is employed to reduce training time and improve the quality of object detection. In [105], the authors employ a spatio-temporal autoencoder using a Long Short-Term Memory (LSTM) [58] neural network with convolutional layers. Similarly to [139], active learning is used in order to improve classification accuracy.

In [147], the authors evaluated incremental training in a CNN using four different approaches: (a) Fine-Tuning: retraining only a subset of layers of the network,(b) Knowledge distillation: transferring knowledge from a large network to a smaller one; (c) Learning without Forgetting (LwF): extending from knowledge distillation, it combines cross-entropy loss to learn new tasks and distillation loss function to keep old ones; and (d) Joint-Training: retraining the network from scratch as new examples are added to the existing training set. As discussed by the authors, Joint-Training produced the best accuracy on old data, avoiding the catastrophic forgetting issue. However, as pointed out by the authors, Joint-Training takes, in general, 1.3 times longer to train than the other strategies. Therefore, LwF presents itself as a middle-ground between accuracy and training time, since performance in old tasks is preserved 11% more than Fine-Tuning and it does not require training from scratch like Join-Training.

In [72], the authors combine three networks with distinct purposes (pre-trained prediction, continuous-updating prediction, and weight estimation) to predict future frames in video sequences. In [118], in the context of anomaly detection, a Recursive Neural Network [46] architecture is employed. Then, the estimation is obtained based on autoregression and the moving average of regression errors. In [66], new classes of objects are incrementally aggregated by a Fast R-CNN [45] model. An approach that addresses the concept drift problem indirectly.

In [4], a supervised approach to anomaly detection is presented. The authors perform the preprocessing step of removing the background and then extracting features using both an Optical Flow algorithm and a CNN, which are then input to the LSTM network. In [18], the authors suggest a face mask detection algorithm based on a CNN known as Single Shot Detector (SSD) [82]. The modification made in the original SSD algorithm aimed to improve the accuracy and involved changes including the loss function and the aspect ratio of the network layers.

In [107], the cross-domain adaptation topic (a technique to allow a model trained in one context to be successfully utilized in another context without the need for retraining) is addressed from an object detection perspective. To achieve this, they employ a Domain Transfer Module which consists of a two-layer CNN. Once the model is trained it is able to incrementally add new knowledge using data drawn from the joint representation of previous targets. In [76], the authors deploy a CNN-based object detector with two detection phases: then initial detection performed directly at the camera devices,and a post-processing phase at a server. In order to update the models, the authors claim to have developed a domain adaptation mechanism that can either receive new information from a user feedback or automatically update the domain information related to the spacial and location features.

Among all the works that implemented neural networks, only 20% (3 papers) did not employ CNNs. In [32], the authors put forward a framework to adapt anomaly detection incrementally and continually. The approach involves extracting information from frames, such as bounding boxes, spatial information, and distances as input features to a Recursive Neural Network [152]. In this way, it is argued that the model can be trained continually, thus avoiding the catastrophic forgetting issue. Apparently, a continuation of the previous work, in [31], the authors propose a framework that, in addition to learning continually, is able to implement cross-domain adaptability and few-shot learning (ability to achieve generalization using a relatively small set of representative data). To achieve this, visual information from activities such as object bounding boxes, motion, and poses are transformed into semantic features using the Word2Vec [103] algorithm, i.e., complex activities can be turned into phrases such as “person walking on the sidewalk”. Because this type of feature is more general and simpler than images, it makes cross-domain adaptability and few-shot learning more feasible. Finally, in [112] an incremental learning neural network called Probabilistic Fuzzy ARTMAP (PFAM) [79] is employed. This network provides probabilistic prediction scores based on the categorization of the feature space.

Support vector machines

The SVM [15] algorithm is employed in 17% of the analyzed studies. SVM is a classification algorithm that aims to find the decision boundary that presents an optimal margin between classes. In [81], in addition to an incremental approach, the authors also implemented a decremental strategy using sliding windows, making the algorithm capable of removing patterns that are considered obsolete. In [155], an SVM-based algorithm is proposed to make pedestrian recognition adaptive to environmental changes. This method modified the regularization terms to incrementally construct and update its appearance model. In [143], for an action recognition task, the authors used a Structural SVM algorithm applied to short video segments, assuming that the prediction scores of interactions increase over time. In [70, 154], and [55], the SVM algorithms are capable of making incremental updates to the model to prevent them from concept drift. In [87] and [86] the authors propose a face recognition system that uses an ensemble of SVMs that can be self-trained (i.e., the predictions returned by the classifier are used as labels) and in an incremental fashion.

Probabilistic

Probabilistic models have been used as classification algorithms in 11% of the analyzed papers. This method builds probability distributions over the training data, so when an example is presented to the model, it can inform the probability one particular example has of belonging to a specific class.

In [11], the authors present a framework to obtain activity patterns from surveillance videos. In this framework, the trajectories classifications and anomaly detections are made using sequential Monte-Carlo techniques. In [94], a combination of handcrafted and learned features is employed to generate a Bayesian fusion model, where the last step is to use a learning-to-rank-based mutual promotion procedure to incrementally update the classifiers based on the newly acquired unlabeled data. In [3], the employed method generates Hidden Markov Models (HMM) [120]. To work incrementally, the likelihood is computed for each incoming video window. When matching a class, a distinct HMM is trained using this data to update an already trained HMM with a weighted average.

In [75], a Gaussian Mixture Model (GMM) is used to detect anomalies in video scenes, where the mixture represents the distribution of abnormal and normal events. The Mahalanobis distance is computed to compare new feature vectors with the mean of the distribution. In [16], features are extracted using variational autoencoder models, and a novelty/anomaly classification is performed using the Markov Jump Particle Filter. Whenever new events are detected, new autoencoder models are deployed.

Clustering

Clustering techniques have been used by 11% of the papers. Clustering techniques aim to divide the input data set into groups. Consequently, when a new example is presented, the algorithm is able to predict to which group that instance belongs. This type of algorithm is used in an unsupervised learning setup.

In [106] and [104], the authors use an algorithm named Incremental Knowledge Acquiring and Self-Learning (IKASL) [25], which is based on Growing Self-Organizing Map (GSOM) [1]. The algorithm divides the input set into pathways that can be used in video surveillance tasks. In [80], an algorithm named Online Weighted Clustering is employed in anomaly detection, aiming to model recent events and assign large weights to clusters representing normal events. [121] employ a clustering algorithm to update pose patterns from video cameras dynamically. These extracted patterns seek to improve the labeling process by having more credible images selected for pseudo pose evaluation.

Distance based

Used in 11% of the papers in this review, distance-based algorithms use a measure of distance between instances to make inferences. Among these algorithms, k-Nearest Neighbors (kNN) [141] is used in 66% of the papers. KNN is a common choice for adaptive methods because no training is required. This makes the adaptation faster than algorithms such as SVM and neural networks. The inference time, however, is proportional to the number of instances that the model has. In [17], a measure learning algorithm based on statistical probability is employed to incrementally re-identify targets. In [61], in conjunction with Null Folley-Sammon Transform, the Mahalanobis [97] distance is computed.

Addressing object detection, in [126], the authors propose the generation of candidate bounding boxes using a modified Haar Cascades algorithm, these candidate boxes are then used as features along with CNN visual characteristics. These feature vectors serve as input to the posterior classification step using a nearest mean classifier. Through this algorithm, new classes can be added incrementally and there is no need to store all of the training examples.

Other models

Among other algorithms, decision tree [119], a tree-based algorithm that derivates decision rules from the training data, was used by 9% of the studies. In [83], the authors use data from sensors to build an incremental model for activity recognition using a swarm decision table. One disadvantage is that people must wear these sensors, which is not always feasible due to constraints such as cost and ease of use. In [108], to perform object detection in videos, the authors automatically label objects based on moving regions and then use these labels to train a decision tree-based model using a co-training strategy for classifier grids. [51] employ a random forest algorithm to incorporate human-assigned labels in an active learning setting. In [47], the authors employ a model called Nearest Class Mean Forest (NCMF) to recognize emotions in images. The NCMF model differs from a random forest, among other characteristics, because only a random subset of available classes is considered in each node.

Ensemble techniques (the combination of predictions of multiple machine learning models) were employed by 4% of the papers. In [71], as new models are generated, old ones are progressively forgotten using a weighting strategy. In [151], an Adaboost algorithm [38] is used to make a global decision by joining a set of weak classifiers. As discussed in [93], ensembles are useful in the case of recurring drift because old models can be simply reused instead of re-trained, which results in a significant saving in computing time.

Lastly, sparse-coding [96], a technique that aims to create sparse linear combinations of basis vectors, has been used in [20] (representing 2% of the papers), where incremental updates are made to existing dictionaries, and the sparse-coding method is used to classify video segments as normal or abnormal.

7 Real-time processing

To prevent damage to people or assets promptly, real-time detection and adaptation are desired capabilities of automated surveillance systems. From all the studies analyzed, 19% (9 papers), reported achieving real-time capacity. To measure the speed of video processing methods, the prevalent Frames per Second (FPS) measure is employed. In Table 3, we summarize the information regarding the studies that employed real-time techniques.

Table 3 Papers using real-time processing approaches

A rate higher than 23 FPS is considered real-time as it is the standard recording rate for cameras and video feeds. If a technique is not able to process frames as they arrive, the processing is delayed.

To make real-time detection possible, we witness a larger use of GPU architectures. In fact, 67% of the papers claiming to use real-time approaches rely on GPU processing [183076105148154]. All of these works were published after 2018 and make use of neural networks to obtain the input features.

In [154], considering the steps of feature extraction, dimensionality reduction, and inference, the authors report the speed of 25 FPS for one surveillance video feed. In [30], using pre-trained deep learning models for feature extraction and a kNN model for the testing phase, a speed of 32 FPS is reported. However, it is important to note that kNN needs to compute distances between an incoming example and all of the examples at a specific partition of the training set at inference time. Over time, the number of training examples tends to increase, resulting in performance degradation. The authors also point out that the time for the feature extraction phase can be reduced if a GPU with more computing power is used or if a faster but less accurate version of the deep learning extractors is used.

In [148], whenever concept drift is identified, an object detection model is selected or generated. The baseline model achieves the rate of 24 FPS, while the lightest model architecture yields 144 FPS. Model selection depends on a clustering algorithm, and as stated by the authors, the speed tends to decrease over time as more clusters are created.

In [105], the employed CNN uses eight consecutive frames as input, with 224 × 224 pixels each. The experiment yielded a processing rate of approximately 27 FPS. In [76], the infrastructure takes advantage of cloud computing servers with allocated GPUs. Each camera also has an embedded client that communicates with the server performing object detection. In [18] The neural network proposed, although less accurate than the state-of-the-art model, presents itself as a fast alternative.

We also see less computationally expensive approaches. The remaining 33% of papers rely on regular CPU processing [20, 80, 108]. In [80], a clustering algorithm is used to classify video clips as normal or abnormal, and the reported speed was 30 FPS. In the experiment, the video frames are resized from 158×238 pixels to 120×160, which requires less computational capacity. Similarly to [30] and [148], the clustering method suffers from speed degradation as new clusters are created. In [20], the spatio-temporal features are generated for every five consecutive frames, and then the dimensions are reduced using PCA. The authors report achieving a processing rate of 25 FPS. In [108], the author states that the implemented system runs at a rate of 24 FPS but does not give details about the type of hardware used.

Lastly, we did not consider the work presented in [149] as being a real-time surveillance method but rather a framework to reduce human effort because the authors employ an already functional real-time application to reduce the annotation to mouse clicks.

8 Datasets and evaluation metrics

8.1 Datasets

In this section, we expose the challenges that handling video data brings, and the characteristics of the datasets used in this review, including the number of frames, resolution, task type, scenes, and annotation type. We also explain why there is still a need for datasets that are proper for concept drift detection.

One of the main characteristics of video surveillance data is that the velocity of data generation is usually high (continuous surveillance video feeds). In addition to that, the number of features is also larger in comparison to other data sources (e.g., tabular data, text, audio). For instance, a video with a resolution as low as 416 × 416 pixels, using the RGB color system, and having a rate of 30 FPS, has 519,168 features (number of pixels) per frame. One minute of that video has 1,800 frames. Furthermore, in [28], the authors mention the challenges faced when dealing with unstructured data in concept drift. In general, video data is multi-dimensional, multi-scale, has spatial relations between frames, and can have an undefined number of labels in each frame.

In [93], several datasets used in concept drift studies are presented. Most of these datasets contain structured data, such as weather and sensor data. There is also text data, which is unstructured, but as explained earlier, video data imposes different challenges, such as spatial relationships between frames and input size. The largest dataset in terms of the number of attributes has 287,034 features and only 10,983 instances.

The datasets used by the studies in this review are presented in Table 4. Given the fact that none of the datasets have concept drift labels or do not have significant changes in illumination, weather, or camera movements, they are not made specifically for detecting concept drift in videos. Instead, they are intended to be used for particular computer vision tasks (activity recognition, anomaly detection, object detection, image classification, or re-ID). Anomalies (or outliers) cannot be considered concept drifts but rather ephemeral changes. Therefore, the anomaly detection datasets analyzed in this review are not suitable for detecting concept drift. Besides the incidence of occasional variations, the characteristics of the input data do not change in a way that the output variable is affected, resulting in the degradation of the prediction capacity of the models. In [30], the authors use an eight-hour-long YouTube video where it begins to rain at some point, and that rain changes the characteristic of the input data. However, the dataset is not annotated, thus, although the authors present a comprehensive analysis regarding adaptation, it is not clear when the concept drifts starts.

Table 4 Datasets used by studies in this review

8.2 Metrics

Although computer vision tasks of different natures have been analyzed, the metrics used to evaluate the outcome of the algorithms can be summarized.

Considering the analyzed studies, 49% (22 papers) have used accuracy or accuracy-based metrics. We considered accuracy-based, the recognition rate and detection rate metrics, which are simply accuracy multiplied by 100.

Precision and recall are informative when classes are imbalanced [133], i.e., anomaly detection, where anomalies represent only a small subset of the data. 15% of the studies use precision and recall, and 60% of these studies also use the F-measure, known as the harmonic mean of precision and recall: a single number that summarizes both metrics.

In addition, 32% of the papers (15 studies) use ROC AUC. The Equal Error Rate (EER) is a metric that can be used along with ROC AUC and that summarizes the trade-off between false positives and false negatives. A lower EER represents a more accurate system. The combined use of ROC AUC and EER represents 10% of the articles reviewed.

Commonly used in object detection, Average Precision (AP) or mean Average Precision (mAP), is a metric that takes into account the precision at different recall intervals. It summarises the shape of the Precision-Recall curve. In its respective equation at Table ??, the definition given by [34] is used. AP is defined as the mean precision at a set of eleven equally spaced recall levels \([\mathrm{0,0.1,0.2,0.3},...,1]\). The precision at a recall level \(r\) is interpolated by taking the maximum precision corresponding to the next recall value greater than the current one:

$$p\mathrm{interp}(r)=\max_{\widetilde r:\widetilde r\geq r}p(\widetilde r)$$
(2)

where \(p(\widetilde{r})\) is the observed precision at the recall level \(r\). We observe that 18% of all the studies, and 50% of the studies where the task is object detection, used AP.

In Fig. 7, we present the relation between computer vision tasks and the metrics chosen to evaluate them in this review. It is possible to notice that accuracy and AUC-ROC are the most commonly used metrics. Also, the task of object detection presents a strong relation with the AP metric. Some metrics were not used along with some tasks, e.g., AUC-ROC was not used as an evaluation metric in any object detection paper.

Fig. 7
figure 7

Relation among tasks and metrics. For simplification purposes, we grouped all accuracy-based metrics (recognition rate, detection rate and matching rate) under the label “Accuracy”

9 Discussion

The compilation of the works evaluated in this review allowed us to delineate research potentials, limits, and challenges concerning concept drift adaptation in video-based surveillance regarding four dimensions: adaptation types; features and algorithms; datasets and metrics; and practical aspects. Therefore, our discussion has been structured to present the characteristics of each one of the four defined dimensions.

9.1 Concept drift adaptation

Concerning the first dimension, concept drift adaptation, we explore the implications derived from the choice of adaptation type, adaptation velocity and weighting, and concept drift awareness.

As observed in this systematic review, most concept drift adaptation methods used in surveillance contexts in recent years rely on continuous rather than informed adaptation, which is also pointed out in [41]. While continuous adaptation approaches can constantly aggregate new information into a machine learning model, one disadvantage is that the explicability is sacrificed. In other words, it is not possible to know when a drift occurs.

Furthermore, a strategy on how fast adaptation happens must be defined in incremental approaches, thus, introducing a trade-off between robustness in the presence of noise and adaptation pace. The more weight is given to new examples, the faster adaptation to new concepts happens. However, the model becomes more prone to be affected by noise. On the other hand, when less weight (or no weight) is given to new examples, adaptation occurs slower, but the model becomes more robust to noise. We name this behavior the robustness-pace trade-off: a concept introduced in this work.

When all the examples are stored and used for training, besides memory concerns (storage is not unlimited), adaptation to new concepts tends to be slower. In incremental approaches where the data is not stored and single example instances are used to update the model, old concepts are completely forgotten over time, which can be an undesirable feature in contexts where recurring drifts occur.

There are also approaches where specific windows, i.e., sets of sequential training examples, are kept in memory. Larger windows are slower to adapt to new concepts, and smaller windows are faster. In all the incremental approaches, weighting strategies can be applied to make adaptation faster or slower, considering the robustness-pace trade-off explained earlier.

Informed adaptation provides information about when the concept drift happened, a knowledge that, in surveillance contexts, can also be used to trigger alerts to the security personnel, informing them that a context change has occurred. In addition to that, even though space is still a constraint, knowing when a concept drift happens eases the process of getting rid of information that is no longer required. The disadvantages of the informed approach are: a) it is highly dependent on the drift detection algorithm; b) when the model needs to be trained globally, the time spent in adaptation can be long (depending on the learning algorithm and dataset size) and thus, affecting the timing needed in surveillance contexts.

The development of techniques that can circumvent the disadvantages of incremental and informed concept drift adaptation is a topic that deserves more attention. For instance, drift detection could be done at the same time as an incremental training pipeline is running. Thus, it would be possible to know at which moment the concept changed.

9.2 Features and algorithms

In this section, we address the second dimension, features and algorithms, discussing aspects involving feature representation and extraction, handcrafted versus learned features, incremental versus continuous learning, supervised versus unsupervised concept drift detection, and active versus passive learning.

Feature representation and extraction directly impact a model’s effectiveness [26],hence, it plays a decisive role in pattern recognition. As presented in this review, in recent years, we have been witnessing a shift from the use of handcrafted features to learned ones and also the combination of both (Section 6). The use of learned features demands, in general, higher computing costs. Consequently, it might not be feasible in cases where more powerful computing architectures (e.g., GPUs) are not available.

Although CNNs have been largely used for feature extraction, 50% of the studies used them exclusively for that purpose and did not employ CNNs as the classifier algorithm. The use of CNNs solely as feature extractors is potentially due to the fact that this type of neural network tends to take a longer time to train than other algorithms, such as SVM [55, 154], clustering algorithms [121], and probabilistic algorithms [94]. Also, as discussed in [30], neural networks suffer from the catastrophic forgetting issue, which causes the performance to degrade over time. To overcome this issue, [27] and [147] suggested approaches to, respectively, retrain only layers affected by drift and employ learning distillation. However, both techniques do not completely solve the forgetting and long training time issues.

As for the other categories of models, even though the training process can be done incrementally and usually faster, and effective storage management strategy needs to be defined to cope with the robustness-pace trade-off. In learning settings where no examples are discarded from memory, the training time increases incrementally as more examples are added. Similarly, for clustering algorithms such as kNN, the inference time grows as new instances are aggregated to the model.

Works that combine learned and handcrafted features do so as a means to explore different characteristics of the input data, thus, improving generalization and accuracy. Besides, another justification for this hybrid approach is reducing computing time. This is the case in [31], where the authors use Optical Flow features along with CNN ones and then transform them into semantic representations that have fewer dimensions than the original data.

Regarding machine learning algorithms used to cope with concept drift, in [28], a review on learning on non-stationary environments, the authors mention that among the continuous adaptation methods, decision trees were one of the most popular algorithms when considering non-ensemble approaches. This differs from the analysis made in recent years, as decision tree based models represent 9% of the total. This is potentially due to the increasing adoption of neural networks approached since the study was published.

Although methods to detect real concept drift rely on annotated instances being evaluated in an existing model [41], the amount of generated data makes the labeling process time-consuming and expensive in surveillance contexts. Thus, detecting drift from a supervised learning approach becomes a less attractive solution. Concept drift detection based on data distribution [93] has been explored as a viable solution [104148]. In this case, concept drift is detected by analyzing the dataset itself, which does not require labeled examples.

In [148], the video frames are represented in a lower-dimensional space using a combination of an adversarial autoencoder and a GAN [48]. Then, a clustering algorithm is used to detect concept drift. However, as mentioned earlier, the clustering method tends to get slower and take up more memory space as more instances are aggregated to the model. Similarly, in [104], the authors use a clustering method that brings the same disadvantages.

More research can be done towards approaches that, considering the high dimensionality and high volume of video data, are able to explore the use of established concept drift detection techniques (Section 2.1.1) that can work with less memory and computation constraints. For instance, dimensionality reduction techniques such as PCA or autoencoders can be employed to summarize the data distribution, and then new data points can feed the reconstruction error to a drift detection algorithm such as EDDM or Page-Hinkley.

Active learning aims to enable the acquisition of labels by having an active oracle available while the training process occurs. Nevertheless, acquiring information in this way presents challenges, such as: a) trusting the oracle’s labels is not always possible since the quality of these labels can drop over time. An open research problem is to make machine learning models capable of evaluating the quality of these labels [138] b) how to interact with oracles in a way that reduces their effort and optimizes the quality of the labels. General protocols, frameworks, and tools could be developed for this end.

Obtaining labeled information automatically is preferable, but not always achievable. In [108], the authors use background subtraction to extract moving regions and automatically annotate them. Nevertheless, this approach is not extensible to tasks such as activity recognition, where a label is still needed in order to inform which type of activity is taking place.

9.3 Datasets and metrics

The use of datasets and metrics is the third dimension. Regarding this dimension, we discuss the issue of the lack of annotated datasets for concept drift detection, as well as the employment of more distinctive evaluation metrics.

Regarding the datasets used by studies in this review, it was possible to conclude that there is no dataset made specifically to detect drift in surveillance contexts. Beyond the lack of annotations of when drifts occur, the datasets do not present significant changes at the scenes. Illumination, weather, and structural scene changes are some of the phenomena that frequently happen in surveillance contexts and are missing from such datasets. In [93], the authors present several datasets employed in concept drift detection. Although the datasets presented by them do not provide explicit concept drift annotations as well, they do contain drifts that are usual to their respective context (sensor, weather, spam, etc.), a factor not present in the datasets analyzed in this review. Hence, surveillance video datasets where concept drifts occur could be developed and published.

Concerning the metrics employed to assess the quality of the classifiers, accuracy has been the most used one. Despite being straightforward to compute and understand, using accuracy alone is problematic since it does not handle well problems of imbalanced class distribution, with the minority class being less favored than the majority one [60]. Metrics such as ROC AUC and F-measure, for general classification tasks, and average precision, for object detection, are more distinctive and robust measures than accuracy. Thus, more works using metrics that are more distinctive than accuracy can be explored in the subject of video surveillance drift adaptation.

9.4 Practical aspects

Finally, with regard to the fourth and last dimension, we explore practical aspects regarding machine learning and concept drift adaptation in surveillance contexts, including multi-camera environments, real-time processing, data storage, handling sensitive and personal data, and machine learning frameworks for video-based surveillance data.

In large outdoor environments, or even in places with more than one room, a common scenario for video surveillance is to deploy multiple cameras in order to monitor different locations or viewpoints. In this context, considering real-time detection, inference has to be performed in parallel instead of sequentially for each camera. From this practical point of view, real-time approaches using less computational power are preferable. Most of the papers that claim real-time capacity rely on GPU processing, which can potentially use all the GPU units of the computer to perform a single inference. In addition to the cost of having multiple of these powerful machines, configuration, (e.g., deploying new machines) and smart allocation of resources to prevent idle cameras from wasting computational power are other concerns.

Data storage is another issue. Data from surveillance feeds and closed-circuit television (CCTV) are usually generated ceaselessly at a greater velocity than they can be analyzed. Hence, techniques and protocols to process data in distributed ways and also discard it when it is no longer needed could be explored and employed. Additionally, as surveillance video data usually contain sensitive and personal data, data security and privacy are other aspects to consider and are gaining more research attention over the last few years [163]

There is also a need for frameworks and tools not only to analyze surveillance video data but also to manage and cope with concept drift. This could be done by providing a set of concept drift techniques that can be used, compared, and extended along with techniques to perform computer vision tasks (e.g., object detection, activity recognition, anomaly detection) using traditional and deep learning techniques that can be reused and shared.

10 Conclusion

The main contributions of this work are the delineation, the limitations, and research opportunities involving methods, techniques, and strategies to deal with concept drift in surveillance, as well as practical aspects involving computing resources and real-time processing; and the proposal of a new classification of concept drift adaptation method. This classification differs from previous ones as it establishes a relationship between adaptation types and knowledge acquisition strategies; and includes active learning, a relevant technique to acquire new concepts in the presence of drift.

The results show that much more attention has been given to methods that adapt to new concepts in a continuous way rather than in an informed one, and, although blind adaptation in non-stationary environments has advantages, the information on when the concept drift occurred is not available. The continuous adaptation methods include approaches using incremental learning and active learning, while the informed adaptation settings include model selection, model creation, and retraining, which can be done locally or globally.

The relation between computer vision learning tasks and learning settings (supervised, unsupervised, and semi-supervised) used in surveillance was explored, and while tasks such as activity recognition were only performed in a supervised setting, other tasks like anomaly detection were usually done in unsupervised or semi-supervised settings. In such settings, real concept drift cannot be verified because ground truth annotations are not available. Therefore, techniques that explore virtual concept drift must be employed.

Regarding features and machine learning algorithms, even though we witness an increase in the adoption of learned features over handcrafted ones. Traditional methods, such as SVMs and clustering algorithms, tend to be employed more than modern deep learning strategies due to the time taken for adaptation combined with the phenomenon of catastrophic forgetting, usually present in neural networks.

This literature review will help researchers of areas related to machine learning in surveillance to have a comprehensive vision of how the phenomenon of concept drift, in the context of video surveillance, has been handled in recent years, serving as a foundation for other research works. As video surveillance is crucial to improve the security of public and private spaces, the real-world impact of this work is enabling the understanding of alternatives to deal with concept drift, consequently, improving the overall performance of learning methods in this specific scenario, which, as outlined before, presents inherent characteristics and additional complexities over other use cases.

Future research directions include more exploration of informed concept drift adaptation approaches for surveillance, the creation of datasets crafted for non-stationary surveillance environments, the investigation of strategies for data management for continuous surveillance video streams, and the combination of CNNs and traditional approaches.