1 Introduction

In the context of thermal video security monitoring the sensor type that is responsible of quantifying the observed infrared-radiation as a thermograph can be split into two groups: sensors that produce relative thermographs and sensors that produce absolute thermographs. Absolute thermographs can correlate the observed radiation directly with temperature, whereas relative thermographs produce observations relative to the “coldest” and “warmest” radiation. In security monitoring contexts the absolute temperature readings produced by an absolute thermograph are not necessary and can potentially suppress thermal details when observing thermally uniform environment. Furthermore the price of absolute thermal cameras are much higher than their relative counterpart.

When performing image recognition tasks the visual appearance of objects and their surroundings is very important, and in an outdoor context that is subjected to changes in temperature, weather, sun-radiation, among others, the visual appearance of objects and their surroundings change quite drastically. This is further expanded by societal factors like the recent pandemic which could introduce mandatory masks. This is known as “Concept Drift” where objects remain the same however the concept definition which is observed through representation changes. While in theory it could be possible to collect a large enough dataset encompassing the weather conditions, the actors, usually people, within the context also dress and act differently. Furthermore the cost of producing such a dataset would be quite extensive as potentially years worth of data would have to be annotated. Typically deployment of object detectors would have a pretrained baseline, and the model would have to be retrained when the observed context drifts too far away from the training context. The reliability in such a system is questionable as deployed algorithms tend not to have a way to quantify the performance during deployment and extra data would have to be routinely annotated to verify that the system is still performing as expected. To address this issue and foster more research into long-term reliability of deployed learning based object detectors a benchmark for classifying the impact of concept drift could greatly benefit the field.

The ECCV 2022 ChaLearn LAP Seasons in Drift Challenge aims to propose a setting for evaluating the impact of concept drift at a month to month basis and evaluating the impact of concept drift in a weighted manner. The problem of concept drift is exacerbated with limited training data, particularly when the distribution of the visual appearance in the data is similar. To explore the consistency of performance across varied levels of concept drift particularly of object detection algorithms, an extended set of frames were annotated spanning several months. The challenge attracted a total of 184 participants on its different tracks. With a total of 691 submissions at the different challenge stages and tracks, from over 180 participants, the challenge managed to successfully establish a benchmark for thermal concept drift. Top-wining solutions outperformed the baseline by a large margin following distinct strategies, detailed in Sect. 4.

The rest of the paper is organized as follows. In Sect. 2 we present the related work. The Challenge design, which includes a short description of the adopted dataset, evaluation protocol and baseline are detailed in Sect. 3. Challenge results and top-winning solutions are discussed in Sect. 4. Finally, conclusion and suggestions for future research directions are drawn in Sect. 5.

2 Related Work

Popular thermal detection and segmentation datasets, such as KAIST [13] and FLIR-ADAS [24], provide thermal and visible images. The focus of a large part of academic research have been focused on leveraging a multi-modal input [10, 16, 29, 30] or using the aligned visible/thermal pairs as a way to do unsupervised domain adaptation between the visible and thermal [7, 10, 25, 28]. Approaches that leverage the multi-modal input directly typically use siamese style networks to perform modality specific feature extraction, subsequently leveraging a fusion scheme to combine the information in a learned manner [16, 25, 29], alternatively simple concatenation or addition is performed after initial feature extraction [10, 30]. In contrast, a network can be optimized to be domain agnostic. HeatNet [25] and DANNet [28] leverage an adversarial approach to guide the network to extract domain agnostic features.

It has been proven that in security monitoring contexts fusion of visible and thermal images outperforms any modality alone [14, 17], however in a real-world scenario camera setups tend to be single sensor setups. While thermal cameras are robust to changes in weather and lighting conditions, they still struggle with the change of visual appearance of objects due to the change of scene temperature [6, 8, 9, 15, 17]. Early work [9] leveraged edges to highlight objects, making detection possible robust to the variation when the relative contrast between objects and their surroundings were consistent. Recent studies leverage research in the visible imaging domain, and directly apply it to the thermal domain [6, 17]. Until recently thermal specific detection methods have been a rarity and recently it was proven that contextual information is important to increase robustness to day/night variation [15, 23] for thermal only object detection. By employing a conditioning of the latent representation guided by an auxiliary day/night classification head, the accuracy of day and night accuracy can be significantly increased [15]. Similar increase in performance can also be gained with a combination of a shallow feature-extractor and residual FPN-style connections [8]. Most notably the residual connections are leverage during training to enforce learning of discriminative features throughout the network, and serve no purpose during inference, and as such can be removed.

3 Challenge Design

The ECCV 2022 Seasons in Drift ChallengeFootnote 1 aimed to spotlight the problem of concept drift in a security monitoring context and highlight the challenges and limitations of existing methods, as well as to provide a direction of research for the future. The challenge used an extension of the LTD Dataset [21] which consists of thermal footage that spans multiple seasons, detailed in Sec. 3.1. The challenge was split into 3 different tracks associated with thermal object detection. Each track having the same evaluation criteria/data but varying the amount of train data as well as the time span of the data, as detailed next.

  • Track 1 - Detection at day level: Train on a predefined and single day data and evaluate concept drift across timeFootnote 2. The day is the 13th of February 2020 as it is the coldest day in the recorded data, due to the relative thermal appearance of objects being the least varied in colder environments this is our starting point.

  • Track 2 - Detection at week level: Train on a predefined and single week data and evaluate concept drift across timeFootnote 3. The week selected is the week of the 13th - 20th of February 2020 - (i.e. expanding from our starting point)

  • Track 3 - Detection at month level: Train on a predefined and single month data and evaluate concept drift across timeFootnote 4. The selected month is the entire month of February.

The training data is chosen by selecting the coldest day, and surrounding data as cold environments introduce the least amount of concept drift. Each track aims at evaluating how robust a given detection method is to concept drift, by training on limited data from a specific time period (day, week, month in February) and evaluating performance across time, by validating and testing performance on months of unseen data (Jan., Mar., Apr., May., Jun., Jul., Aug. and Sep.). The February data is only present in the training set and the remaining months are equally split between validation and test.

Each track is composed of two phases, i.e., development and test phase. At the development phase, public train data was released and participants needed to submit their predictions with respect to a validation set. At the test (final) phase, participants needed to submit their results with respect to the test data, which was released just a few days before the end of the challenge. Participants were ranked, at the end of the challenge, using the test data. It is important to note that this competition involved the submission of results (and not code). Therefore, participants were required to share their codes and trained models after the end of the challenge so that the organizers could reproduce the results submitted at the test phase, in a “code verification stage”. At the end of the challenge, top ranked methods that pass the code verification stage were considered as valid submissions.

3.1 The Dataset

The dataset used in the challenge is an extension of the Long-Term Thermal Imaging [21] dataset, and spans 188 days in the period of 14th May 2020 to 30th of April 2021, with a total of 1689 2-minute clips sampled at 1fps with associated bounding box annotations for 4 classes (Human, Bicycle, Motorcycle, Vehicle). The collection of this dataset has included data from all hours of the day in a wide array of weather conditions overlooking the harborfront of Aalborg, Denmark. In this dataset depicts the drastic changes of appearance of the objects of interest as well as the scene over time in a static security monitoring context to develop robust algorithms for real-world deployment. Figure 1 illustrates the camera setup and two annotated frames of the dataset, obtained at different time intervals.

Fig. 1.
figure 1

Illustration of the camera setup (a) and two annotated frames of the dataset, captured at different time intervals (b-c).

For a detailed explination of the datasets weather contents, an overview can be found in the original dataset paper [21]. As for the extended annotations provided with this challenge, we can observe that the distribution of classes is heavily skewed towards the classes that are most commonly observed in the context. As can be seen in Table 1 the total number of occourances of each class is heavily scewed towards the Person class. Furthermore, as can be seen in Fig. 2, each class follows roughly the same trend in terms of the density of which they occur. While the most common for all classes is a single count of the given object present in a given image is 1, the range of occurrences are greater for the Person category.

Fig. 2.
figure 2

Histogram of object density, across the dataset, density of objects (x-axis) and occurrences (y-axis).

The camera used for recording the dataset was elevated above the observed area, and objects often appear very distant with regards to the camera, in combination with the resolution of the camera most objects appear very small in the image (see Fig. 1). Table 1 summarizes the amount of objects from each class pertaining to each size category. The size is classified using the same scheme as used in the COCO dataset [19], where objects with areas \(area < 32^2\), \( 32^2< area < 96^2\) and \(area > 96^2\) are considered small, medium and large respectively. The density of object sizes are also illustrated in Fig. 3, where it can be more clearly seen that the vast majority of objects fall within the small category for classes. This holds true for classes Person, Bicycle and Motorcycle, where as the Vehicle class more evenly covers all size categories. This is a result of larger vehicles only being allowed to drive in the area closest to the camera.

Fig. 3.
figure 3

Illustration of object size (height\(\times \)width, in pixels) across the dataset. The white outlines seperate the areas that would be labeled as small, medium and large following COCO standards.

Table 1. Object frequency observed for each COCO-style size category.

3.2 Evaluation Protocol

The challenge followed the COCO evaluationFootnote 5 scheme for mAP. The primary metric is, mAP across 10 different IoU thresholds (ranging from 0.5 to 0.95 at 0.05 increments). This is calculated for each month in the validation/test set and the model is then ranked based on a weighted average of each month (more distant months having a larger weight as more concept drift is present), referred to as \(mAP_w\) in the analysis of the results (Table 2). The evaluation is performed leveraging the official COCO evaluation toolsFootnote 6.

3.3 The Baseline

The baseline is a YOLOv5 with the default configuration from the UltralyticsFootnote 7 repository, including augmentations. It was trained with a batch size of 64 for 300 epochs, with an input image size of 384\(\times \)288 and the best performing model is chosen. Naturally, the labels were converted to the normalized yolo format ([cls] [c\(_x\)] [c\(_y\)] [w] [ht]) for both training and evaluation. For submission on the Codalab platform they were converted back to the ([cls] [tl\(_x\)] [tl\(_y\)] [br\(_x\)] [br\(_y\)]) coordinates. The models were all trained on the same machine with 2x Nvidia RTX 3090 GPUs, all training is also conducted as multi GPU training using the pytorch distributed learning module.

4 Challenge Results and Winning Methods

The challenge ran from 25 April 2022 to 24 June 2022 through CodalabFootnote 8 [22], a powerful open source framework for running competitions that involve result or code submission. It attracted a total of 184 registered participants, 82, 52 and 50 on track 1, 2 and 3, respectively. During development phase we received 267 submissions from 17 active teams in track 1, 117 submissions from 6 teams in track 2, and 96 submissions from 4 teams in track 3. At the test (final) phase, we received 84 submissions from 23 active teams in track 1, 55 submissions from 22 teams in track 2, and 72 submissions from 24 teams in track 3. The reduction in the number of submissions from the development to the test phase is explained by the fact that the maximum number of submissions per participant on the final phase was limited to 3, to minimize the change of participants to improve their results by trial and error.

Table 2. Codalab leaderboards\(^*\) at the test (final) phase.

4.1 The Leaderboard

The leaderboards at the test phase for the different tracks are shown in Table 2. Note that we only show here the top-5 solutions (per track), in addition to the baseline results. Top solutions that passed the “code verification stage” are highlighted in bold. The full leaderbord of each track can be found in the respective Codalab competition webpage.

As expected, Table 2 shows that overall better results are obtained with more train data. That is, a model trained at the month level is overall more accurate than the same model trained at the week level, which is overall more accurate than the one trained at the day level. Therefore, the differences in performance improvement when training the model at the month level (compared to week level) are smaller than those obtained when training the model at the week level (compared to day level), particularly when a large shift in time is observed (e.g., from Jun. to Sep.), suggesting that the increase of train data from week to month level may have a small impact when large shifts are observed. This was also observed by the Team heboyong (described in Sec. 4.3), which reported to have only used week level data to train their model (i.e., on Tracks 2 and 3), based on the observation that using more data was not improving the final result. This raises an interesting point in that even for winning approaches the variation of the training data is much more important than the amount of training data, a further analysis of what causes the loss of mAP across will be discussed in 4.4.

Table 3 shows some general information about the top winning approaches. As it can be seen from Table 3, common strategies employed by top-winning solutions are the use of pre-trained models combined with data augmentation. Next, we briefly introduce the top-winning solutions that passed the code verification stage based on the information provided by the authors. For a detailed information, we refer the reader to the associated fact sheets, available for download in the challenge webpage(see footnote 1). Two participants (i.e., Team GroundTruth and Team heboyong) ranked best on all tracks. Each participant applied the same method on all tracks, but trained at day, week or month level, detailed as follows.

Table 3. General information about the top winning approaches.

4.2 Top-1: Team GroundTruth

The Team GroundTruth proposed to take benefit of temporal and contextual information to improve object detection performance. Based on Scaled-YOLOv4 [26], they first perform sparse sampling at the input. The best sampling setting is defined based on experiments given different sampling methods (i.e., average sampling, random sampling, and active sampling). Mosaic [1] data augmentation is then used to improve the detector’s recognition ability and robustness to small objects. To obtain a more accurate and robust model at inference stage, they adopt Model Soups [27] for model integration, given the results obtained by Scaled-YOLOv4p6 and Scaled-YOLOv4p7 detectors trained using different hyperparameters, also combined with horizontal flip data augmentation to further improve the detection performance. Given a video sequence of region proposals and their corresponding class scores, Seq-NMS [12] associates bounding boxes in adjacent frames using a simple overlap criterion. It then selects boxes to maximize a sequence score. Those boxes are used to suppress overlapping boxes in their respective frames and are subsequently re-scored to boost weaker detections. Thus, Seq-NMS [12] is applied as post-processing to improve the performance further. An overview of the proposed pipeline is illustrated in Fig. 4.

Fig. 4.
figure 4

Top-1 winning solution pipeline: Team GroundTruth.

4.3 Top-2: Team Heboyong

The Team heboyong employed Cascade RCNN [4], a two-stage object detection algorithm, as the main architecture for object detection, with Swin Transformer [20] as backbone. According to the authors, Swin Transformer gives better results when compared with other CNN-based backbones. CBNetv2 [18] is used to enhance the Swin Transformer to further improve accuracy. MMdetection [5] is adopted as the main framework. During training, only 30% of the train data is randomly sampled, to reduce overfitting, combined with different data augmentation methods, such as Large Scale Jitter, Random Crop, MixUp [31], Albu Augmentation [3] and CopyPaste [11]. At inference stage, they use Soft-NMS [2] and flip augmentation to further enhance the results. An overview of the proposed pipeline is illustrated in Fig. 5. They also reported to have not addressed well the long-tail problem caused by the extreme sparsity of the bicycle and motorcycle categories, which resulted in low mAP for these two categories.

4.4 What Challenge the Models the Most?

In this section we analyze the performance of the baseline, Team GroundTruths and Team heboyongs models on the test set. Particularly, we inspect the performance of each model with regards to temperature, humidity object area and object density. Temperature and humidity are chosen as they were discovered that these two factors have the highest correlation with visual concept drift [21]. Additionally, because of the uneven distribution of object densities across dataset, the impact of the object density is also investigated.

Impact of Temperature can be observed in Fig. 6, as the temperature increases the performance of the model degrades. This is expected as the available training data has been picked from the coldest month and as such warmer scenes are not properly represented in the training data, and as mentioned in 3 this is deliberately done as temperature is one of the most impactfull factors of concept drift in thermal images [21]. The performance of the baseline model shows severe degradation when compared to the winner and Team heboyong, while the performance consistently degrades for all models. Interestingly, Team heboyong method is distinctly more sensitive to concept drift with the smaller training set, while the winning solutions seems to perform consistently regardless of the amount of data trained on.

Fig. 5.
figure 5

Top-2 winning solution pipeline: Team heboyong.

Impact of Humidity. According to the initial paper [21], humidity is one of the most impactfull factors of concept drift, as it tends to correlate positively with the different types of weather. This leads to a quite interesting observation, which can be made across all tracks with regards to the impact of humidity. As can be observed in Fig. 7, the mAP of detectors increases with the humidity across all tracks. This could be because higher humidity tends to correlate with the level of rain-clouds, which would explain partially cloudy being more difficult for the detectors as the visual appearance in the image is less uniform.

Impact of Object Size. As would be expected the models converge towards fitting bounding-boxes to the most dominant object size of the training data (see Table 1). As shown in Fig. 8, the models obtain very good performance on the most common of object sizes and struggle with objects as they increase in size and rarity. In this case the participants see strong improvement over baseline, and also manage to become more robust towards rarer cases. As can also be observed in the figure this problem is increasingly alleviated with the increase of training data.

Fig. 6.
figure 6

Overview of performance with samples separated with regards to the temperature recorded for the given frame.

Fig. 7.
figure 7

Overview of performance with samples separated with regards to the humidity recorded for the given frame.

Impact of Object Density. As shown in Fig. 2, the density of objects for the majority of the images is towards the lower end, as such one would expect the detectors’ mAP to degrade when a scene becomes more crowded and the individual objects become more difficult to detect due to occlusions. However what is observed is the mAP of highlighted methods are consistent as density increases, while the performance across densities also correlate to the amount of training data.

Fig. 8.
figure 8

Overview of performance with samples separated with regards the size of objects bounding-box

Fig. 9.
figure 9

Overview of performance with samples separated with regards to the object density of the frame

5 Conclusions

The Seasons in Drift challenge attracted over 180 participants whom made 480 submissions during validation and 211 submissions for test set and a potential place on the finale leaderboard. While the concept of measuring the impact of thermal drift on detection performance in a security monitoring context is a very understudied field, a lot of people participated. Many of the participants managed to beat the proposed baseline by quite a large margin, especially with limited training data, and achieved more robust solutions when compared to the degradation of the baseline in terms of performance with respect to drift. Allthough great improvements can be observed, the problem of concept drift still negatively affects the performance of participating methods. Interestingly while the winner and Team heboyong methods use different architectures, the impact of concept drift seems to transcend the choice of SotA object detectors. This lends merit investigating methods that could condition layers of the network given the input image, and introduce a venue for the model to learn an adaptable approach as opposed to learning a generalized model specific to the thermal conditions of the training context. As can be observed in Figs. 8 and 9 the size of the observed objects seem to be a more challenging factor than the density of which they occour in. Detection of small objects is a known and well documented problem, and despite the nature of thermal cameras, still persist as an issue in the thermal domain. Further research could be done to learn more scale invariant object detectors or rely entirely on other methods than an RPN or Anchors to produce object proposals.