Keywords

1 Introduction

Anomaly detection is the task of detecting abnormal behaviour in the data. These unexpected occurrences are usually related to critical events, such as machine failure [8], intrusion detection [19] or medical applications [31]. Thus, detecting anomalies in time allows us to save money, preserve privacy and save lives.

Because anomalies are, by definition, rare events, obtaining labels (especially anomalous ones) is often expensive, unethical, or simply time-consuming. Hence, anomaly detection is usually tackled from an unsupervised perspective [10, 12]. However, it has been shown in the literature that providing limited, but specific labels to the model can have a large impact on its performance [35, 45]. Therefore, one can implement active learning strategies to collect labels strategically, such as those in regions where the model has high uncertainty [1, 11, 24].

However, sometimes it can be challenging to provide a correct label for a given instance. For example, when labeling abnormal water usage, it may happen that some normal behaviour (e.g., system maintenance) is infrequent and the user presumes it is anomalous and labels it as such [44]. More generally, an instance’s label may be ambiguous, and different annotators may label it in different ways (e.g., crowdsourcing). When reconciling these inconsistencies to get a hard decision, selecting the correct label may be a difficult task [21, 39]. A solution to this problem is to relax our request by allowing the user to provide a soft label (i.e., a probability). Thus, one asks how likely it is that an instance is anomalous. Previous work has shown that this relaxation increases performance, especially in highly imbalanced data sets [26, 43].

Unfortunately, soft labels that reflect the inherent label probability are hard to collect [9, 15]. For example, a user may be overly confident and annotate a slightly excessive usage of water as having a very high probability of being anomalous. Similarly, in crowdsourcing, a group of users may be affected by a biased selection of instances that ends up producing inaccurate probabilities for some specific instances [25]. Thus, asking for a user to provide soft labels often results in examples that are annotated with noisy probabilities. This can have a negative effect on the detector’s performance as using incorrect soft labels at training time affects its ability to make accurate predictions at test time. For example, overly high (low) probabilities would make the model sensitive to producing false positives (negatives). Therefore, accounting for the (possible) noise both during training and inference is an important problem.

Additionally, we require a method that has both an unsupervised and supervised component. Many, but not all, anomalies are non-repetitive events. These anomalies are best detected by unsupervised anomaly detectors. However, these unsupervised detectors have difficulties detecting anomalies that look similar to normal instances or might detect some normal behavior as anomalous. Labels can help distinguish these last two cases. Thus, we want to make predictions such that (1) we fall back to unsupervised scores if instances are distant from labeled training data and (2) the instances that are closer to the labeled data receive a score that is mostly based on the soft labels.

Therefore, we fill this gap in the literature by proposing SLADe (Soft Label Anomaly tector), the first semi-supervised anomaly detector that learns from noisy soft labels using active learning. Initially, it uses an unsupervised anomaly detector as an indication of how anomalous instances are (prior knowledge). Then, it sets up an active learning loop that (1) measures the uncertainty inherent to dealing with noisy soft labels, (2) uses the uncertainty metric to collect noisy soft labels, and (3) learns from such labels by training a Gaussian Process to model the deviation between the given soft labels and the unsupervised scores. Finally, at inference time, SLADe removes the noise from the soft labels by averaging out the GP’s prediction over a Gaussian surface. By summing this average with the unsupervised score, SLADe computes the probability that a test instance is anomalous.

2 Background and Notation

We assume a d-dimensional instance space \(\mathcal {X} \subseteq \mathbb {R}^d\) and a binary output space \(\mathcal {Y}=\{0,1\}\) where 1 denotes the anomaly class. Moreover, we assume that we are given an unlabeled dataset \(U = \{x_i|x_i \in \mathcal {X}\}_{i=1}^{N}\) of size N, an initially empty (soft) labeled dataset L, and a label budget \(B \in \mathbb {N}\) that indicates how many (soft) labels the user is willing to provide. We now review the necessary background on anomaly detection and Gaussian processes.

2.1 Anomaly Detection

In unsupervised anomaly detection, the goal is to learn a function \(s:\mathcal {X} \rightarrow \mathbb {R}\) that assigns real-valued anomaly scores to any instance in \(\mathcal {X}\) where, without loss of generality, we assume that higher scores represent more anomalous instances. Unsupervised detectors are trained by making assumptions about what constitutes an anomaly, which typically results in defining how anomalies are dissimilar to normal instances. For example, Isolation Forest (IForest) [22] assumes that anomalies can be easily isolated when randomly splitting the instance space, and assigns anomaly scores inversely proportional to the number of splits needed to isolate an instance. The k-NN outlier detector (kNNO) [2] assumes that anomalies are far away from normals with respect to some notion of distance, and uses the distance to the k-th nearest neighbor as the anomaly score.

A practical issue is how to convert an anomaly score into a hard prediction [32]. One way to do this is to use the contamination factor \(\gamma \in [0, 1]\), which is the fraction of anomalies in a dataset [33, 34]. Using \(\gamma \) one can define a threshold \(\lambda \) so that a fraction \(\gamma \) of the training data receives an anomaly score greater than \(\lambda \). For an unseen test instance \(x_t\),

$$\begin{aligned} y(x_t) = {\left\{ \begin{array}{ll} 0 &{} s(x_t) \le \lambda \\ 1 &{} s(x_t) > \lambda \,. \end{array}\right. } \end{aligned}$$
(1)

Recently, there is increasing recognition that incorporating strategically chosen labeled instances is important for improving the performance of anomaly detectors [35, 45]. Active learning (AL) is commonly used to select which instances to label [17, 41]. At a high level, it is possible to distinguish among three approaches to AL [24]: uncertainty-based strategies aim to select the unlabeled data samples with the highest uncertainty [11], diversity-based strategies aim to maximize the diversity among the labeled training data [1] and combined strategies integrate the advantages of these two [6]. The first category is widely used due to its simplicity and strong performance. Starting with an unlabeled dataset U and an empty (soft) labeled dataset L, a detector is learned in an unsupervised manner. Then, the following steps are repeated until a given label budget is exhausted. First, query a human annotator to provide a (soft) label for the strategically chosen instances. In uncertainty sampling, one approach is to use the probabilistic gap \(|P(Y = 1|x) - P(Y = 0|x)|\) where smaller gaps indicate higher uncertainty. Second, the queried instances and their (soft) labels are added to L and the model is retrained using this newly expanded dataset.

2.2 Gaussian Processes

A Gaussian process (GP) is a collection of random variables over the instance space, such that any finite subset of them have a joint Gaussian distribution [37]. Roughly speaking, a GP can be seen as a distribution over functions \(f:\mathcal {X} \rightarrow \mathbb {R}\) such that for any \(x, x' \in \mathcal {X}\)

$$\begin{aligned} f(x) \sim \mathcal{G}\mathcal{P}(m(x), \mathcal {K}(x, x')), \end{aligned}$$

where \(m :\mathcal {X} \rightarrow \mathbb {R}\) is called the mean function, and \(\mathcal {K}:\mathcal {X} \times \mathcal {X} \rightarrow \mathbb {R}\) is the covariance function (otherwise known as the kernel). The Gaussian process is completely characterized by these two functions m and \(\mathcal {K}\), which define

$$\begin{aligned} \mathbb {E}[f(x)] = m(x) \quad \text {and} \quad \textsc {Cov}[f(x),f(x')] = \mathcal {K}(x,x'). \end{aligned}$$

Picking an appropriate prior mean and kernel enables encoding prior beliefs of the data-generating process into the model. More importantly, the GP fully relies on these prior beliefs to make predictions for an unseen instance that falls in a region far from any training instance. Given a training set of pairs \(\mathcal {R} = \{(x_i, r_i)\}_{i=1}^{|\mathcal {R}|}\), where \(r_i \in \mathbb {R}\), the posterior distribution of a GP for any \(x, x' \in \mathcal {X}\) is

$$\begin{aligned} \begin{aligned} f | \mathcal {R}&\sim \mathcal{G}\mathcal{P}(m_{\mathcal {R}}, \mathcal {K}_{\mathcal {R}}) \\ m_{\mathcal {R}}(x)&= m(x) + \varSigma _{x, X} \left( \varSigma _{X, X}\right) ^{-1}(\textbf{r} - m(X))\\ \mathcal {K}_{\mathcal {R}}(x, x')&= \mathcal {K}(x, x') - \varSigma _{x, X} \left( \varSigma _{X, X}\right) ^{-1}\varSigma _{X, x'}\,, \end{aligned} \end{aligned}$$
(2)

where the elements of \(\varSigma _{a,b}\) depend on the kernel \((\varSigma _{a,b})_{i,j} = \mathcal {K}(a_i, b_j)\), which makes \(\varSigma _{X, X}\) the training-training covariance matrix, and \(\varSigma _{x, X}\), \(\varSigma _{X, x'}\), respectively, \(1 \times |\mathcal {R}|\) and \(|\mathcal {R}| \times 1\) covariance vectors. Note that the posterior covariance is always lower than the prior due to the subtraction of a strictly positive term.

Given a test set \(T = \{x_t\}_{t=1}^{|T|}\), the GP predicts a posterior multivariate normal distribution (|T|-dimensional) \(\mathcal {N}(m_{\mathcal {R}}(T), \mathcal {K}_{\mathcal {R}}(T,T))\). Note, that each individual instance has a Gaussian marginal distribution that can be used for instance-wise predictions. In practice, one can derive the final prediction from the given distribution by either taking a sample (Bayesian perspective) or extracting the mean (frequentist perspective). In this work, we use the latter.

3 SLADe

Our goal is to learn a model to estimate the probability that an instance is anomalous in an active learning setting where a user provides soft labels. Starting from an unlabeled dataset \(U = \{x_n|x_n \in \mathcal {X}\}_{n=1}^N\), an empty soft labeled dataset L, and a label budget B, the algorithm can iteratively query instance \(x \in U\). However, instead of receiving its exact label, the user provides a real value \(p \in [0,1]\) indicating the probability that the instance belongs to the anomaly class.

Designing an approach to learn in this setting has three key challenges. First, we need an informative unsupervised score about what is and is not likely to be anomalous. This allows the model to output probabilities even in regions where no soft labels are given. Second, we need a way to combine the weak supervision provided by the soft labels with this unsupervised score such that (1) we fall back to the initial scores if instances are distant from labeled training data and (2) the instances that are closer to the soft labeled data in L receive a score that is mostly based on those labels. Third, we need to explicitly model the uncertainty that is inherent when working with soft labels.

We address these challenges by combining unsupervised anomaly detection with a Gaussian process. Intuitively, the anomaly detector will provide an informative prior for the GP. A key question is what the GP should model. One choice would be to have it directly model the soft labels. However, because the labels are uncertain and noisy, we want to decouple the noise arising from the soft labels and the uncertainty of unsupervised scores. Therefore, we model the deviation of the soft labels from the unsupervised prior. When making a prediction, we propose a novel way to combine the estimated deviation and the unsupervised score in a noise-robust way. Next, we describe our training and inference procedures in more detail.

3.1 Training

SLADe constructs the informative prior by taking a completely unsupervised approach. First, SLADe trains an unsupervised anomaly detector on U that can compute an anomaly score for any instance \(x \in \mathcal {X}\), which is denoted as s(x). SLADe is detector agnostic and we will discuss possible choices in the experimental evaluation. Second, we want to learn the deviation of the soft labels from these scores. However, working with the raw scores is not possible because scores provided by different unsupervised models have different meanings. Moreover, anomaly scores often cannot be interpreted as probabilities (e.g., kNNo assigns a distance) and thus, in this form they can not be compared with soft labels (i.e., probabilities). Therefore, we apply the linear unification transformation (i.e., min-max normalisation) [18]

$$\begin{aligned} \tilde{s}(x) = \frac{s(x) - min(\textbf{s})}{max(\textbf{s}) - min(\textbf{s})} \end{aligned}$$

to map anomaly scores into [0, 1], where \(\textbf{s} = \{s_1, \dots , s_N\}\) are the anomaly scores for U. We opt for linear unification because we do not want to introduce strong assumptions on the unsupervised scores (which, working as a prior, is supposed to be flexible [46]).

Our GP models the deviation between the user-provided soft labels and these prior probabilities and it is initialized as \(g_0 \sim \mathcal{G}\mathcal{P}(0, \mathcal {K})\). The posterior GP is then defined as

$$\begin{aligned} g_0|L_0 \sim \mathcal{G}\mathcal{P}(m_{L_0}, \mathcal {K}_{L_0}), \end{aligned}$$

where \(L_0 = \{(x_j, p_j - \tilde{s}(x_j)) :(x_j, p_j)\in L\}\) denotes a dataset containing the difference between the soft labels (i.e., \(p_j\)) and the unified unsupervised scores of the training data in L. To gather soft labeled training data and train the GP, we run an active learning loop. Given a label budget B, we repeat the following steps until our label budget is exhausted. (1) We query the instance \(x_* \in U\) where the model is the most uncertain. Quantifying uncertainty requires assigning a prediction to each instance in U. By combining the unsupervised prior \(\tilde{s}\) with the GP’s mean \(m_{L_0}\), we obtain a first probability estimate:

$$\begin{aligned} P_1(Y = 1| x, L) = \tilde{s}(x) + m_{L_0}(x)\,. \end{aligned}$$
(3)

Model uncertainty can arise for two reasons: making weak predictions (\(\approx 0.5\)) and a lack of labeled instances in certain regions of the instance space. To capture both types of uncertainty, we use Kapoor et al. [16]’s strategy to query labels for

$$\underset{x_* \in U}{argmin}\frac{|0.5 - P_1(Y=1|x_*, L)|}{\sqrt{\mathcal {K}_{L_0}(x_*, x_*)}}\,.$$

This formula assigns low scores if (a) the posterior probability is close to 0.5 (small numerator), or (b) if the instance is far from the labeled instances and hence has high prediction variance (big denominator). (2) Finally, SLADe updates \(L = L \cup \{(x_*, p_*)\}\) and \(U = U \setminus \{x_*\}\). Subsequently, \(g_0|L_0\) is updated with the newly obtained soft labels.

3.2 Inference

Given an unseen test instance \(x_t\) and a set of soft labels L, computing the posterior probability \(P(Y=1|x_t, L)\) is challenging for the following reason. An initial estimate of the posterior probability can be obtained via Eq. 3. However, this probability is heavily affected by noisy soft labels. Per definition, the GP predicts the exact soft labels for each soft-labeled training instance. Consequently, if \(x_t\) is in close proximity to a noisy soft label, the predicted posterior probability would be affected by this noise.

We propose to mitigate the effect of noisy labels as follows. We distinguish between two types of test instances: (1) those that are far from the training data and (2) those that have many training instances nearby. Since the unsupervised anomaly scores model the proximity to other data points, we can use this as a measure without introducing any new assumptions (i.e. high anomaly scores represent distant instances). For the first type of test instances, there is no reason to try and fix the noise. They are far from the training data and will thus not be influenced by noise. The second type, on the other hand, is influenced by label noise. We cope with this problem by smoothing out the estimated deviation over a Gaussian surface that has \(x_t\) as the center and a given variance \(\sigma ^2_t\). Formally,

$$\begin{aligned} P_2(Y = 1|x_t, L) = \tilde{s}(x_t) + \mathbb {E}_{V \sim \mathcal {N}(x_t, \sigma ^2_t)}[m_{L_0}(V)], \end{aligned}$$
(4)

where V is a normally distributed random variable. Using the surrounding instances forces the model to use more soft labels when computing the posterior probability, which clearly averages out the negative effects that the presence of noise has on the model. \(\sigma _t\) is dependent on \(x_t\) and we define it as one-third of the radius of a hypersphere with center \(x_t\) that captures \(q\%\) of the instances in U. Thus, for every test instance, we average out over the same number of training data. We then formalize our final probability estimate as

$$\begin{aligned} \hat{P}(Y = 1 | x_t, L) = {\left\{ \begin{array}{ll} P_1(Y = 1| x_t, L) &{} s(x_t) > \lambda \\ P_2(Y = 1| x_t, L) &{} s(x_t) \le \lambda \,, \end{array}\right. } \end{aligned}$$
(5)

where \(\lambda \) denotes the anomaly score threshold as defined in Eq. 1. A hard prediction is obtained by setting a threshold, typically 0.5, on the probability estimates.

4 Experiments

We address the following two research questions: Q1: How do the methods compare under various noise regimes? Q2: How sensititive is SLADe to the choice of its hyperparameters?

4.1 Experimental Setup

Methods. We compare SLADeFootnote 1 against four baselines. Conceptually, these can be divided into two groups. The first group learns directly from probabilistic labels: GP [31] simply uses a Gaussian Process to model the soft labels without including the unsupervised prior, while P-SVM [20] uses a Support Vector Machine (SVM) with class labels that are weighted by the given soft labels. The second group cannot operate directly on the soft labels. Therefore, we convert them to hard labels by flipping a weighted coin. Then we apply traditional semi-supervised models. SSDO [44] is a propagation-based detector that uses the distance to hard labels to assign anomaly scores. HIF [23] is a semi-supervised variant of the widely used unsupervised Isolation Forest [22] that improves its anomaly scores by adding the distance to the anomalous hard labels.

Data. We evaluate our method and the baselines on 21 benchmark datasets that are widely used in the anomaly detection literature [4, 12]. These datasets vary in size, number of features, and proportion of anomalies. To limit the computational cost of the experiments, we subsample each dataset to at most 5000 instances keeping the same proportion between normals and anomalies. See Table 1 for the characteristics of the datasets.

Table 1. Characteristics (full size, subsampled size, number of features d, contamination factor \(\gamma \)) of the 21 benchmark datasets used for the experiments.

Setup. Our setup can be divided into three parts: (1) generating the ground-truth soft labels, (2) introducing the noise, and (3) evaluating the methods.

The first part requires modeling the human annotator: given an instance x, a soft label p indicates the proportion of anomalous labels that we would obtain if we queried x multiple times. Moreover, similar instances are likely to obtain similar probabilities. We model this aspect by training a Random Forest with low depth (\(= 4\)) on the original dataset and use it to compute the soft labels as class probabilities. The low depth guarantees that Random Forest does not push all probabilities to the extremes (0 or 1) but assigns smooth values over [0, 1].

In the second part, we introduce noise into the soft labels. We use a standard transformation [7] that changes the label p into \(1-p\) for a fixed percentage of the soft labels. The noisy instances are picked uniformly at random. The percentage of swapped labels is the noise level of the dataset.

Finally, for each of the 21 datasets, we run the following experiment: (i) We randomly split the dataset into \(80\%\) training and \(20\%\) test set; (ii) We compute the ground-truth soft labels and add the given level of noise to the training soft labels; (iii) We run the active learning loop with a label budget \(B = 60\%\) of the training set size N, which we split into 12 rounds of \(5\%\) each. We choose a label budget of 60% for completeness reasons. All baseline methods also employ uncertainty sampling. (iv) We evaluate the Area Under the Receiving Operating Curve (AUROC) [14] of each method at every iteration of the loop. As the test set also has soft labels, we sample a hard label to make the evaluation consistent within our probabilistic setting. To average out the randomness introduced by sampling labels, we repeat the active learning loop 20 times. All four steps are then repeated five times. We carry out a total of \(5\times 20 \times 21 = 2100\) experiments.

Hyperparameters.

SLADe has three hyperparameters. We choose IForest [22] as the unsupervised method. We use the Matèrn kernel with \(\nu = \frac{1}{2}\) in the GP as it is widely used in the literature [36]. Moreover, we optimize the length scale hyperparameter of the Gaussian Process by maximizing the log marginal likelihood [37]. Finally, we set \(q = 2\). SSDO uses the same prior model as SLADe and the default values for \(\alpha \) and k. HIF has two hyperparameters: \(\alpha _1\) and \(\alpha _2\). Since the paper does not suggest any values, we set both to 0.5, which makes a fair weighting between the different parts of the score. P-SVM utilizes an RBF kernel with the default parameters [20]. Finally, GP relies on a Gaussian Process that has the same hyperparameters as for SLADe.

4.2 Experimental Results

Q1. Comparing the Methods.

We want to evaluate SLADe on two aspects: (1) its robustness against noise and (2) its ability to rank anomalies. Therefore, we compare SLADe against the baselines on three different noise levels and compare both their noise-robustness and performance at different label percentages.

First, we compare SLADe against the baselines for each label frequency of the active learning loop under the three noise levels (\(0\%\), \(10\%\), \(20\%\)). For this task, we plot the learning curve, which has on the x-axis the label percentage as a proportion of the dataset’s size, and, on the y-axis, the methods’ AUROC. Figure 1 shows the results on five representative datasets, while the Supplement includes the plots for all the remaining datasets. Regardless of the noise, SLADe clearly outperforms all the baselines on Shuttle (left plot), while it performs similarly to the baselines on Pima and Heart (second and third plots). On the other hand, on Page and Iono (right plots), SLADe obtains competitive AUROC values with no noise present while outperforming all the baselines at higher noise levels (\(10\%\) and \(20\%\)). Overall, the major strength of SLADe is the ability to improve its performance when acquiring (possibly noisy) soft labels: on Shuttle, SLADe’s learning curve is steeper than all the baselines’ for all noise levels. On the other hand, looking at Page and Iono, all methods’ learning curves are flat, but SLADe’s does not deteriorate as hard as the baselines when introducing higher noise levels.

Second, we dive deeper into the noise-robustness of the methods. Therefore, we aggregate the results on a per-dataset basis and measure how their performance decreases when moving from a setting with no noise to a setting with (a) \(10\%\) and (b) \(20\%\) of noise. Figure 2 reports the methods’ mean AUROC drop aggregated over all of the label percentages for the two scenarios. The star (cross) markers indicate the mean AUROC with no noise (the given level of noise), while the length of the segment is indicative of how robust each model is against noise: the shorter the segment, the smaller the change of AUROC, and the more robust the model. The results show that SLADe obtains the lowest/similar (i.e., within a gap of 0.01) drop in performance in 13 out of 21 datasets when the noise goes from \(0\%\) to \(10\%\), while it does so on six datasets when increasing the noise to \(20\%\). Unsurprisingly, the second-best baseline is HIF, which is naturally noise-robust because it only leverages anomalous labels to assign scores, which hides the negative effect of noisy negative labels provided by the user. In fact, HIF obtains the lowest drop in performance on six datasets under \(10\%\) noise, and nine datasets under \(20\%\) noise. Furthermore, GP is the most affected by the noise: because it only learns from the given soft labels, incorrect probabilities have a strong impact on the surrounding test instances.

Table 2. Wins (W), Draws (D), and Losses (L) of SLADe against each baseline in terms of average AUROC per dataset, for each label percentage, under \(20\%\) of noise. A draw means that the absolute difference in AUROC is \(\le 0.01\).
Fig. 1.
figure 1

Learning curves for all methods on five representative datasets for three different noise levels (\(0\%\), \(10\%\), \(20\%\)). On the x-axis we vary the label percentage, while on the y-axis we report the average AUROC (higher is better).

Fig. 2.
figure 2

Comparison on all 21 datasets between the methods’ mean AUROC when moving from a clean setting to \(10\%\) (top) and \(20\%\) (bottom) of noise. The AUROC is aggregated over all percentages of labels. For every dataset and method, the star/cross marker indicates the AUROC with no noise/given level of noise. The length of the segment quantifies the drop in AUROC when introducing noise (shorter is more resistant).

Finally, because our task is to develop a noise-resistant model, we zoom in on the high noise scenario (\(20\%\)) and analyze how often SLADe outperforms each baseline.Footnote 2 Table 2 shows the number of times (out of 21) SLADe’s average AUROC is higher (Win), within a margin of 0.01 (Draw) or lower (Loss) than that of the baselines at every label percentage. For any label percentage SLADe never loses more than six times against any baseline. As expected, SLADe outperforms HIF more often at higher label percentages because HIF only uses positive labels. Moreover, against GP, SLADe wins more in the lower label percentage settings (which are more realistic in Active Learning) because SLADe needs less data to learn effectively.

Q2. Sensitivity Analysis.

We evaluate the effect of varying SLADe’s three hyperparameters: the unsupervised anomaly detector, the GP’s kernel, and the percentage of training instances inside the hypersphere, q, used to fix the noise at inference time. We assume a default level of noise equal to \(10\%\) and vary one hyperparameter at a time while keeping the other two as specified in Sect. 4.1. We subsample the datasets to at most 500 instances for computational reasons.

Table 3 shows SLADe’s AUROC averaged over all datasets for different label percentages when using Isolation Forest (IForest) [22], One-Class SVM (OCSVM) [42], Local Outlier Factor (LOF) [13] and the k-NN outlier detector (kNNO) [2] as unsupervised detectors to assign the anomaly scores. SLADe seems to be robust to the selected anomaly detector as all approaches perform similarly. There are small differences for the three lowest label budgets, where using IForest offers some performance gains. This happens because IForest assigns better rankings to the anomalies, as confirmed by [12] as well. A bad unsupervised model will thus require a certain number of labels before it is able to accurately detect anomalies. Therefore, selecting the correct unsupervised model is an important decision.

Table 3. AUROC (avg ± std) of SLADe for different unsupervised detectors.

Table 4 shows the AUROC averaged over all datasets for different label percentages when using four variants of the Matérn kernel [36] as the covariance function of the GP. We vary its hyperparameter \(\nu \in \{\frac{1}{2}, \frac{3}{2}, \frac{5}{2}, +\infty \}\), where \(\nu = +\infty \) represents the Radial Basis Function (RBF) kernel [3]. The results illustrate that SLADe has the highest performance for \(\nu = \frac{1}{2}\), in agreement with the existing literature on Gaussian Processes [36]. Unsurprisingly, results show that SLADe’s performance deteriorates when increasing the hyperparameter \(\nu \): because \(\nu \) indicates the smoothness of the GP’s kernel (i.e., high differentiability), high values of \(\nu \) underpin the assumption that the class probability function is smooth, which is not true in several real-world datasets. Moreover, the effect of changing \(\nu \) increases with the number of soft labels, which ends up being \(> 0.06\) against \(\nu =+\infty \) with \(60\%\) of soft labels.

Table 5 shows the AUROC averaged over all datasets for varying label budgets for \(q \in [0.5, 1, 2, 5, 10]\). The results show that the value of this hyperparameter has a negligible impact on SLADe’s performance. Therefore, we set q’s default value to 2, as it is an in-between value that avoids averaging over too many instances, which might slightly decrease the performance with little noise, and averaging over almost no instance, which would make the model too sensitive to noise.

Table 4. AUROC (avg ± std) of SLADe for different values of the Matérn kernel’s hyperparameter \(\nu \).

5 Related Work

There is, to our knowledge, no work that tackles learning from active noisy soft labels in anomaly detection. However, three related research lines exist that are of interest, of which the first two relate to traditional binary classification tasks.

Learning from Soft Labels. The literature on learning from soft labels consists of three common approaches: ranking methods, regression methods and traditional methods adapted for soft labels. (1) Ranking methods solve a constrained optimization problem where the constraints are pairwise rankings between the soft labels [26, 27, 38]. (2) Regression methods use soft labels as target values in their learning mechanism [31]. (3) Probabilistic Support Vector Machines (P-SVM) use soft labels to micro-steer the obtained margin [20, 28]. Empirical evaluation [26] shows that this third category performs best. However, in Sect. 4.2 we showed that SLADe outperforms P-SVM.

Learning from Noisy Hard Labels. The existing work on models that are designed to be noise-robust mostly takes a supervised approach [5, 7, 48]. These make strong assumptions that do not hold in our setting. For instance, there is no correctly labeled subset of data available [48]. A strictly weaker assumption is the availability of a large set of noisy data [5]. It is non-trivial how to adapt these methods for small sets of noisy labels.

Weakly Supervised Models. Some existing literature in anomaly detection deals with weak supervision. For example, some semi-supervised methods need access only to a small set of clean labels [29, 30, 40, 47]. However, it is unclear how to extend them to deal with soft labels.

Table 5. AUROC (avg ± std) of SLADe for different values of q (\(\%\) of training instances inside the hypersphere).

6 Conclusion

This paper tackled the challenge of learning a model that estimates the probability of an instance being anomalous in an active learning setting where the user provides noisy soft labels. The soft labels indicate the probability that the instance belongs to the anomaly class. The key challenges were how to (1) have an initial indication of how likely instances are anomalous without having access to labels, (2) combine the obtained soft labels with the initial unsupervised scores, (3) model the uncertainty when learning from soft labels, and (4) develop a noise-robust approach that smooths out the noisy probabilities. We proposed SLADe, the first semi-supervised anomaly detector that leverages the noisy soft labels by (1) computing the anomaly scores using an unsupervised anomaly detector, and (2) fixing the scores by modeling their deviation from the given soft labels through a GP. In the active learning loop, it queries the most informative instances by quantifying the model uncertainty that arises from (a) receiving weak soft labels (e.g., 0.5) and (b) the lack of labels. Finally, at inference time, it smooths out the noise by averaging the GP prediction over a Gaussian surface with adaptive variance. Experimentally on 21 datasets, we showed that SLADe is noise-robust and that it performs better than several baselines on the majority of cases.

Ethical Statement.

In general, any work on anomaly detection is beneficial to society. In many applications, it is important to detect anomalies in due time as they are often related to critical events, such as machine failure [8], intrusion detection [19] or medical applications [31]. Being able to detect anomalies in time, thus allows us to save money, preserve privacy and save lives. However, the use of anomaly detection and soft labels in certain settings raises some ethical concerns that need to be considered. One of the primary concerns is the potential for discrimination against some minorities. As anomaly detection techniques are designed to identify instances that deviate from “normal behavior”, it is possible that someone with malicious intentions misuses anomaly detectors to discriminate against specific groups by labeling their behavior as “anomalous”. Another due ethical consideration relates to the potential violation of privacy that may result from failing to detect anomalies in particular applications. For example, in intrusion detection, the failure to detect anomalous hacker activity could compromise some people’s privacy. Finally, the traditional labeling approaches for anomaly detection usually involve the use of an expert. However, collecting soft labels instead of hard labels allows for the use of multiple cheap labor forces instead of a single domain expert. While this may lower the cost of labeling data, it raises ethical concerns regarding the exploitation of cheap labor and the potential for unfair practices.