A meta-level analysis of online anomaly detectors

Ntroumpogiannis, Antonios; Giannoulis, Michail; Myrtakis, Nikolaos; Christophides, Vassilis; Simon, Eric; Tsamardinos, Ioannis

doi:10.1007/s00778-022-00773-x

A meta-level analysis of online anomaly detectors

Regular Paper
Published: 14 January 2023

Volume 32, pages 845–886, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The VLDB Journal Aims and scope Submit manuscript

A meta-level analysis of online anomaly detectors

Download PDF

Antonios Ntroumpogiannis¹,
Michail Giannoulis²,
Nikolaos Myrtakis^1,3,
Vassilis Christophides³,
Eric Simon⁴ &
…
Ioannis Tsamardinos¹

589 Accesses
6 Citations
Explore all metrics

Abstract

Real-time detection of anomalies in streaming data is receiving increasing attention as it allows us to raise alerts, predict faults, and detect intrusions or threats across industries. Yet, little attention has been given to compare the effectiveness and efficiency of anomaly detectors for streaming data (i.e., of online algorithms). In this paper, we present a qualitative, synthetic overview of major online detectors from different algorithmic families (i.e., distance, density, tree or projection based) and highlight their main ideas for constructing, updating and testing detection models. Then, we provide a thorough analysis of the results of a quantitative experimental evaluation of online detection algorithms along with their offline counterparts. The behavior of the detectors is correlated with the characteristics of different datasets (i.e., meta-features), thereby providing a meta-level analysis of their performance. Our study addresses several missing insights from the literature such as (a) how reliable are detectors against a random classifier and what dataset characteristics make them perform randomly; (b) to what extent online detectors approximate the performance of offline counterparts; (c) which sketch strategy and update primitives of detectors are best to detect anomalies visible only within a feature subspace of a dataset; (d) what are the trade-offs between the effectiveness and the efficiency of detectors belonging to different algorithmic families; (e) which specific characteristics of datasets yield an online algorithm to outperform all others.

Real time anomaly detection and categorisation

Article Open access 24 June 2022

Integrated Clustering and Anomaly Detection (INCAD) for Streaming Data

Study and Evaluation of Unsupervised Algorithms Used in Network Anomaly Detection

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Abnormal data might be more interesting to study than the prevalent patterns [1, 18, 76]. Data anomalies, e.g., among measurements or observations may simply represent errors (called outliers), but they can also indicate interesting phenomena (called novelties), such as new incidents or faults of a system, intrusions to a computer network, frauds in credit cards transactions or even over-expressed genes of living things.^{Footnote 1}

Recently, the real-time detection of anomalies in streaming data has gained increasing attention [7, 55] as it allows to raise alerts, predict faults, detect intrusions and threats across industries. However, analyzing a sequence of samples^{Footnote 2} arriving over time imposes unique constraints and challenges for machine learning models. In contrast to batch detectors, the full dataset is not available in advance and online detectors must learn incrementally as samples arrive [17].In that respect, the order rather than the timestamp of the samples in a data stream is an important feature that models should take into account [5, 41]. Additional constraints on the detection setting may be imposed in practice. For instance, the high velocity of streams leaves little opportunity for labeling samples by experts. Also, a thorough tuning of several hyper-parameters^{Footnote 3} is challenging especially when data characteristics evolve over time [30]. For these reasons, in this paper we focus on unsupervised methods for detecting point anomalies and leave out of the scope of our study (semi-)supervised methods for shallow [11] or deep anomaly detection [48]. In this respect, detection of range-based (i.e., subsequence or collective) anomalies based on temporal dependencies rather than independent abnormal samples [12, 19, 21, 33, 61] is left as future work.

Existing unsupervised methods for detecting anomalies in multivariate datasets bypass the need of labeled samples by exploiting different anomalousness criteria [1, 18, 36, 68, 76, 78] based on the divergence of statistical distributions, distance thresholds, density variance from nearest neighbors, isolation facility, etc. Indeed, recent online detection methods belong to different algorithmic families: (i) proximity-based detectors including distance-threshold based like MCOD [40] and CPOD [64] or nearest-neighbor-based like LEAP [16], inspired by $\hbox {KNN}_W$ [51] (ii) density-based detectors either on the full feature space like STARE [72] and their offline counterpart LOF [14] or on feature subspaces like RS-Hash [57] tracing its roots back to HICS [38]; (iii) tree-based detectors such as HST [60] and RRCF [32] inspired by IF [43] or OCRF [29]; and (iv) projection-based detectors such as LODA [49] or XSTREAM [45] featuring both online and offline versions.

Anomaly detection has been an active area of research over the past decades. Besides notorious surveys on anomaly detection approaches and methods [1, 18, 19, 21, 33, 36, 68, 76], various empirical studies have experimentally evaluated the effectiveness or the efficiency of detectors [3, 4, 15, 20, 24, 28, 47, 62]. However, the aforementioned works focus in their majority, on offline detectors. Online anomaly detection was listed in the future perspectives of the overview presented in [19] while [33] includes only the first steps in making density-based detectors incremental, like LOF [14]. Regarding empirical studies, [63] focuses exclusively on the efficiency of online proximity-based detectors while [20] compares stream clustering algorithms [58] for anomaly detection. To the best of our knowledge, no previous work has compared qualitatively or quantitatively, over the same multi-dimensional datasets, distance-based (MCOD, CPOD), KNN-based (LEAP, $\hbox {KNN}_W$) and density-based detectors (STARE, RS-Hash, LOF) with tree-based (HST/F, RRCF, IF, OCRF) and projection-based detectors (XSTREAM, LODA). Additionally, reported experiments seldom report the tension between the effectiveness and the efficiency of the detection algorithms. To this end, we optimally tune the hyper-parameters of detectors per dataset rather than rely on the default configurations recommended by their inventors. Last, previous meta-learning analyses of detectors [66, 75] do not consider meta-features related to anomalies visible only to a subset of the dataset feature space.

In this context, several questions that are left unanswered regarding the performance of online anomaly detectors over data streams. First, previous studies do not assess the reliability of detectors’ effectiveness against a random classifier, and do not either highlight the dataset characteristics (e.g., number of features) that make them perform randomly. Second, they do not indicate when online detectors can approximate the effectiveness of offline detectors and under which conditions (e.g., number of features irrelevant to the anomalies, anomaly ratio). Third, they do not indicate which is the best sketch strategy and update primitives of detectors (e.g., micro-clusters, random trees, histogram or chain-based density estimators) to detect anomalies that are only visible within a feature subspace of a dataset. Fourth, they do not analyze the trade-offs between the effectiveness and the efficiency of detectors belonging to different algorithmic families. Last, they do not highlight the characteristics of datasets that make an online algorithm capable of outperforming all others. For instance, the statistically significant correlations between the relative performance of the best performing detectors and meta-features such as the number of samples or features in a dataset, the skewness of feature values, or the distance between the clusters of abnormal and normal samples. In summary, we make the following contributions:

Large Selection of Online Detectors from Different Algorithmic Families: In Sect. 2, we introduce the nine online anomaly detectors included in our testbed, namely, MCOD [40], CPOD [64], LEAP [16], STARE [72], RS-Hash [57], HST [60], RRCF [32], LODA [49] and XSTREAM [45] (for their offline counterparts, readers are referred to Appendix 1). We detail their scoring function, model creation and update primitives, as well as the involved analytical complexities. To ensure a common ground of comparison, we implemented a variation of HST with a forgetting mechanism similar to RRCF, called HSTF, as well as, a continuous scoring function for MCOD, CPOD and LEAP instead of their binary outcome.
Fair Evaluation Environment for Online and Offline Anomaly Detectors over Multivariate Data: In Sect. 3, we describe the characteristics of abnormal and normal samples (e.g., anomaly ratio, dimensionality) in the twenty-four real and the five synthetic datasets included in our testbed that are widely used in previous empirical studies [15, 24, 25, 63, 68]. We additionally consider in Sect. 5 the recently proposed benchmark Exathlon [37] for explainable anomaly detection over repeated executions of two different Spark streaming applications, containing five different type of anomalies. To fairly compare the performance of detectors, we consider as evaluation metrics both Area Under the ROC Curve (AUC) and Average Precision (AP) and explain their differences under edge cases. These metrics are computed for each algorithm under optimal evaluation conditions per dataset (for optimal hyper-parameter values per dataset and sensitivity of algorithms to tuning, readers are referred to Appendix 2).
Thorough Evaluation of Detectors’ Effectiveness: In Sect. 4, we analyze the AUC and Mean AP of detectors over the 24 real datasets of our testbed (details are given in Appendix 3) in order to reveal interesting patterns involving specific meta-features (i.e., Number of Samples/Features, Anomaly to Normal distance). In particular, we assess the reliability of the decisions made by the detectors w.r.t. a random classifier, and rank in a statistically significant way online and offline detectors according to their performance. Our analysis sheds light on how well online detectors approximate the performance of offline detectors.
Robustness of Detectors Against Increasing Dimensionality: In Sect. 6, we assess the robustness of online and offline detectors against increasing data and anomaly subspace dimensionality using twenty synthetic datasets. Specifically, we investigate whether a particular algorithmic family (e.g., proximity, tree or projection-based) is able to discover anomalies that are only visible in a small subset of features (i.e., a subspace).
Efficiency of Detectors and Trade-offs: In Sect. 7, we report the execution time of training and updating detectors’ models over the 24 real datasets of our testbed. We investigate the trade-off between the update time and the effectiveness of the 9 online detectors contrasted to the best overall performing detector.
Meta-learning Analysis of Leading Detectors: In Sect. 8, we investigate which of the meta-features are statistically correlated with the relative effectiveness of the best performing detector (detailed results per meta-feature are given in Appendix 4). This meta-analysis helps us to assess whether the best overall performing detector will also excel in a given dataset.

Finally, conclusions and future work are discussed in Sect. 10. Detailed experimental results, as well as all employed hyper-parameter values are given in Appendices 2, 3 and 4.

2 Detection Algorithms

In this section, we introduce the online anomaly detection algorithms considered in our experimental evaluation that belong to the distance, knn, tree or projection-based algorithmic families. The offline counterpart of these detectors is described in Appendix 1.

In contrast to offline detectors that model and score samples in one batch, online detectors continuously update their models for incrementally detecting anomalies in several windows of samples. The total number of samples in a window indicates the window size while the number of samples that a window will be shifted over a data stream indicates the window slide. Windows that overlap as they slide over a data stream are called sliding, otherwise they are called tumbling (i.e., window size = window slide), as illustrated in 1.

Our testbed includes state-of-the-art online anomaly detection algorithms with publicly available implementations, as reported in the literature. CPOD [64] was included because it outperforms other distance-based detectors (such as MCOD[40] which is considered to be among the best detectors) in terms of runtime and memory usage.^{Footnote 4} HST [60] exhibits the best efficiency among tree-based detectors while RRCF[32] is the most effective detector of the same family. We additionally included two projection-based detectors: LODA[49] exhibiting a very low runtime and memory footprint and XSTREAM [45] outperforming in terms of effectiveness other detectors on high-dimensional datasets. LEAP[16] has been proved to be three orders of magnitude faster than state-of-the-art KNN methods. STARE[72] outperforms other popular density-based detectors ([46, 50]) in terms of execution time, achieving comparable or higher accuracy. Finally, RS-HASH[57] proved to be more efficient and effective than other subspace detectors like HiCS[38]. To the best of our knowledge, no previous study has compared both the effectiveness and efficiency of proximity-based, tree-based and projection-based detectors under stream and batch processing settings. In the next sub-sections, we provide the main operations and primitives of each detection algorithm, both theoretically and through a running example, and detail their train, update, forgetting and anomaly reporting procedures.

2.1 Micro cluster outlier detection (MCOD)

MCOD is a distance-based detector [40] that models neighboring regions of samples in a stream as micro-clusters (MC). MCOD requires to tune three hyper-parameters: a distance threshold R, a neighbor count threshold K and a distance metric $ dist (\cdot , \cdot )$ (e.g., Euclidean). Given a dataset D, MCOD identifies a sample p to be a distance-based anomaly if it has fewer than K neighbors within distance R, otherwise p is considered normal.

MCOD builds a set of micro-clusters (MC), to assess the normality of samples in every window. An MC is composed of at least K + 1 samples and is centered on one sample. A sample belongs to at most one MC and the distance of any sample from the center of its MC is at most R/2. According to the triangular inequality in the metric space, the distance between every pair of samples in a MC is smaller than R. Therefore, every sample in a MC is considered normal. Samples that cannot be clustered, i.e., potential anomalies, are inserted in a list called PD. The contents of PD are processed as new samples in a window arrives and they can be either normal if $ R \text {-Neigh} (p) \ge K$ (see Eq. 2) or abnormal, otherwise. The $ R \text {-Neigh} (p) $ indicates the number of neighbors of a sample p in a radius R. Hence, we list the building blocks of MCOD:

Training Phase MCOD does not have a training phase since its model operates over a single window by computing pairwise distances.

Model Update A new sample p can either: (i) be inserted into its nearest MC (see Eq. 1) if the distance from the center ($ mcc $) of that MC is $\le $ R/2; or (ii) form a new MC if it has at least K neighbors in PD list within a distance $\le R/2$; or (iii) be inserted into PD.

Forgetting Mechanism MCOD forgets all samples that have been processed in a current window before processing the next window. A forgotten sample can: (i) dissolve an MC if it contains less than K + 1 points; or (ii) be removed from the PD list.

Anomaly Report After processing new and forgotten samples, every sample with less than K neighbors is reported as anomaly.

$$\begin{aligned}{} & {} mcc _{p}^{\star } = {\mathop {{{\,\textrm{argmin}\,}}}\limits _{ mcc }}\, dist(p, mcc) . \end{aligned}$$

(1)

$$\begin{aligned}{} & {} R \text {-Neigh} (p) = \vert \{p' \vert dist(p, p ') < R \}\vert \end{aligned}$$

(2)

A running example inspired from the MCOD paper [40] is presented in Fig. 2 over the two sliding windows depicted in Fig. 1a. Samples have two features, namely {F1, F2}. Using the first window (Fig. 2a), MCOD builds a micro-cluster $ MC_1 $ containing the samples $p_1, p_2, p_3, p_6$, which are considered as normal. The PD list contains the samples that do not belong to $ MC_1 $: $p_4, p_5, p_7$. Sample $p_4$ is normal as it has $K=3$ neighbors within distance R. Samples $p_5, p_7$ are anomalies since they have less than $K=3$ neighbors within distance R. Next, all samples of the first window are forgotten, so $ MC_1 $ is dissolved and PD is emptied. MCOD then processes the second window (Fig. 2b) and forms the micro-cluster $ MC_2 $. Observe that $p_7$ is now a normal sample, while $p_6$ is an anomaly because its preceding neighbors have been forgotten.

MCOD originally provides a binary label l as outcome depicted in Eq. 3$ MCOD_l $: 0 for normal and 1 for abnormal samples. However, to homogenize the comparison of the outcome with other algorithms, we need a continuous scoring function of samples. We therefore use the function $ MCOD_s $ that gives a score s depicted in Eq. 4. Intuitively, the more a sample lies in a sparse region, i.e., it has few or no neighbors within distance R, or it is far away from its nearest MC, the higher it should be scored. For instance, $p_7$ and $p_3$ in Fig. 2a will, respectively, get the highest and lowest score. Also, $p_5$ has a higher score than $p_4$ which has a higher score than $p_6$.

$$\begin{aligned}{} & {} MCOD_l(p) = dist(p, mcc_p^\star ) < \frac{R}{2} \vee \text {R-Neigh}(p) \ge K \nonumber \\ \end{aligned}$$

(3)

$$\begin{aligned}{} & {} MCOD_s(p) = \frac{1}{R\text {-Neigh}(p)} \cdot dist(p, mcc_p^\star ) \end{aligned}$$

(4)

The main advantage of MCOD is that it effectively prunes pairwise distance computations, providing a more efficient neighbor search through the creation of micro-clusters. Every sample that remains in an MC when the window slides is considered as normal without further checks. Thus, to classify a new sample it suffices to compute distances only w.r.t. the centers of MC and the samples in the PD list. MCOD has linear time complexity on average $\Theta ((1-c) w log((1-c) w)+K log(K)))$ and linear space complexity $O(c w+(1-c) K w)$ [63], where c is the number of clusters, K is the number of nearest neighbors and w is the window size. However, depending on the data characteristics, i.e., if the data are very sparse, no clusters may be formed resulting to $c=0$. Thus, for the worst case, MCOD has linearithmic time complexity $O(w log (w) + K log (K))$ The hyper-parameters of MCOD require careful tuning (see Fig. 17 in Appendix 2). Specifically, a very low R and high K values on a dataset with several sparse areas may lead MCOD to create very few micro-clusters, identifying most samples as anomalies. The aforementioned behavior will also degrade the efficiency of MCOD as it must compute the distances with respect to more samples in the PD list, which is inefficient for neighbor searches. In the opposite case, with a very high R and low K values in a dataset having several dense areas, MCOD may identify most samples as normal and create a large number of micro-clusters.

2.2 Core point-based outlier detection (CPOD)

CPOD is a distance-based detector [64] that models neighboring regions of samples in a stream using core points. CPOD requires to tune two hyper-parameters: a distance threshold R and a neighbor count threshold K. Given a dataset D, CPOD identifies a sample p to be a distance-based anomaly if it has fewer than K neighbors within distance R, otherwise p is considered normal.

CPOD shares the same definition of anomalies as MCOD, but it optimizes neighbor’s search by looking at the close neighborhood of few special samples called core points rather than the neighborhood of the nearest micro-cluster center. The same technique is also used to address a limitation of MCOD when every pair of points has a distance greater than R/2. In this degenerate case, no micro-cluster is formed resulting in quadratic neighbor search and a poor efficiency especially with streams. A core point is a special sample that supports multi-distance indexing as it stores its Euclidean distances to other samples in multiple ranges within each slide. It permits to both quickly identify normal samples and reduce neighbor search spaces for anomaly candidates. A core point has the following two properties: (i) the distance between any pair of core points is greater than R; (ii) each sample $p \in D$ is linked to at least one core point c, having distance less than R from c. Note that in contrast to the micro-cluster centers of MCOD, core points do not require to have at least K neighbors to be formed. Each core point c stores every neighbor sample p in a map E for different radius values $k \in \{0,1,2,3\}$, ranging from R/2 to 2R:

$ E_k(c) = \{p \in D \vert ~ kR/2 < dist(c,p) \le (k+1)R/2\}.$ To calculate the neighbors of a sample p within distance R, CPOD leverages the distance from its corresponding core point(s) to reduce the computations, matching exactly one of the four cases below, where $c^\star $ is the closest core from a core set C to p:

1.
$\bigcup _{k = 0,1,2}E_k(c^\star )$, if $ dist (c^\star ,p) \le R/2$
2.
$\bigcup _{k = 0,1,2,3}E_k(c^\star )$, if $R/2 < dist (c^\star ,p) \le R$
3.
$\bigcup _{k = 0,1}E_k(c_i \in C$, if $R < dist (c^\star , p) \le 2R$
4.
0, if $2R < dist(c_i, p) , \forall c_i \in C$

From the aforementioned cases, exactly one can hold for a particular sample p. To better grasp the neighbor search procedure, we explain the first case, where $ dist (c^\star ,p) \le R/2$. CPOD operates via a prefilter approach, called minimal probing. First, it automatically considers the samples within R/2 ($k=0$ in $E_k$) as neighbors of p due to triangular inequality. If the neighbors in R/2 do not exceed the neighbor threshold, then k is increased $k=1$ and the search is expanding to the range (R/2, R]. If the threshold is satisfied, the search stops and p is declared as normal, otherwise the search continuous to higher ranges. The same logic is followed for the rest cases. In the following, we report the building blocks of CPOD:

Training Phase CPOD does not have a training phase since its model computes the core points for every new slide, starting from the first window.

Model Update A new sample p can either: (i) be linked to exactly one core, if $ dist (c^\star , p) \le R$; (ii) to multiple cores if $R < dist (c_,p) \le 2R$; (iii) form a core point or (iv) to not correspond to any core point if $dist(c_i,p) > 2R, \forall c_i \in C$.

Forgetting Mechanism with a new slide, CPOD forgets all the expired samples, i.e., samples that are just removed from the current window, before processing the next window slide. The neighbor count of the active samples, i.e., samples that remain in the current window, is decreased and the expired samples are removed from E maps of their cores.

Anomaly Report CPOD spots samples as anomaly candidates if they have a distance $\le R$ from their cores and less than K neighbors, expanding the neighbor search in higher ranges. Every sample with less than K neighbors after CPOD’s expanded search is reported as anomaly.

Now, we present a running example of CPOD based on Fig. 2 we used to explain MCOD. For this example with $K=3$, we consider the cores $c_2, c_4$ to be core points in the first slide, having the same values as $p_2$ and $p_4$, respectively. Thus, the neighbors of each core in distance [0, 2R] are $E(c_2): \{p_2, p_1, p_6, p_4, p_5\}$ and $E(c_4)$ contains all samples, where E denotes all the neighbors in different ranges. Every sample in the solid circle falls in the first case, presented previous neighbor search procedure; it has distance $\le R/2$ from $c_2$ and it will be immediately identified as normal having 3 neighbors. Sample $p_5$ also fall in the first case, there are not enough neighbors in distance $\le R/2$, it is considered an anomaly candidate and the search will expand to higher radius ranges from its closest core $c_4$. Expanding the search to R, no new neighbors are found so the search stops and $p_5$ is reported as anomaly. The sample $p_7$ falls in the third case, having distance $R < dist(c_4, p_7) \le 2R$ from its closest core $c_4$, so CPOD searches neighbors in radius R/2 to R in every core, labeling $p_7$ as anomaly since only $p_5$ found as its neighbor. In the next slide, considering only $c_{10}$ as core taking the values of $p_{10}$, CPOD labels $p_7$ as anomaly (unlike MCOD). This is because $R < dist(c_{10}, p_7) \le 2R$, thus the neighbors search is performed in radius $<R$ of each core, resulting to samples $p_{12}, p_8, p_9$ as possible neighbors, missing $p_{11}$ where $dist(c_{10}, p_{11})>R$. From the explored samples, only $p_8, p_9$ are neighbors of $p_7$ so it is labeled as anomaly. Moreover, $p_6$ is also an anomaly since $ dis(c_{10}, p_6) >2R$, thus it has no neighbors in radius R. The rest samples are labeled as normal.

CPOD originally provides a binary anomaly outcome. However, our evaluation metrics require an ordered outcome and thus we report the score of a sample p as the inverse number of its neighbor count $ CPOD_s(p) = 1 /( \vert N_{CPOD}(p) \vert + \epsilon )$. Note that lower scores denote greater anomalousness.

CPOD has linear time complexity $O(N_c~w + N_f~N_r)$, where $N_c$ is the total number of cores, w is the number of samples in a window, $N_f$ is number of the anomaly candidates and $N_r$ the neighbor search time for these candidates. Note that in a very sparse dataset with many isolated samples, i.e., many samples have distance $> R$ and $\le 2R$ from every core, the neighbor search will use each core, searching on their R/2 and R radius. This could result an overhead if many cores have been formed. Therefore, R and K should be carefully selected.

2.3 Lifespan-aware probing (LEAP)

LEAP is a distance-based detector [16] that encompasses two different anomaly semantics, namely distance-threshold-based and nearest-neighbor-based.^{Footnote 5} LEAP requires to tune two hyper-parameters: a distance threshold R and a neighbor count threshold K. Given a dataset D, LEAP identifies a sample p to be a distance-based anomaly if it has fewer than K neighbors within distance R, otherwise p is considered normal.

LEAP shares the same definition of anomalies as MCOD, but it mitigates expensive range queries with adequate indexing: rather than storing in the same index structure all samples of a window, a separate smaller index is maintained for each slide. For a given window, a slide index $s_i$ references a sample that appears in sliding window i and not in sliding window $i+1$. For instance, for a window of size $w=9$ and slide $s=3$, three index structures are created because three sliding windows cover the samples of that window. LEAP maintains an evidence list evi for every sample p: $p.evi = \bigcup _{i=t,..t-s} \{q \in s_i \vert dist(p, q) \le R\} $, containing the neighbors of p in each slide in reverse chronological order, starting from the newest slide $s_t$ up to $s_{t-s}$, where s is the slide size, adopting the minimal probing principle, i.e., if more than K neighbors are found till the current slide, the search stops. LEAP maintains also a trigger list: $tr = \{q \in w \vert q.evi \setminus s_{t-s} \ne q.evi\}$, containing the samples that are going to lose neighbors in the next window slide. Subsequently, we report the building blocks of LEAP:

Training Phase LEAP does not have a training phase since it performs the neighbor search for every new slide, starting from the first window.

Model Update A new sample p leads to re-probing the neighbors of each sample in tr list and potentially it can let a sample(s) to be removed from tr list.

Forgetting Mechanism when a slide expires: (i) its index is discarded; (ii) the expired neighbors of p in p.evi list are removed and (iii) the samples in the tr list are re-evaluated.

Anomaly Report For a sample p, the newest samples are examined first and in case the neighbors are fewer than K, the probing continuous. If p has less than K neighbor after each slide of a window is probed, p is reported as anomaly.

Now, we present a running example of LEAP based on Fig. 2 we used to explain MCOD. For this example with $K = 3$, $w=7$ and slide $s=5$, LEAP will create two indexes $s_1$ and $s_2$, respectively, indexing the samples $p_1, \ldots p_5$ and samples $p_6, p_7$. LEAP will build an evidence list for each sample. For p1, it will start searching for neighbors in $s_2$ and then in $s_1$ resulting to $p_1.evi = \{p_6, p_2, p_3\}$. For the first window $p_7$ and $p_5$ are reported as anomalies and the rest samples as normal. The samples belonging to the trigger list in the first window are $tr = \{p_6, p_7\}$ as they will lose neighbors in the next slide. When the window slides, the samples in the trigger list will be evaluated first. Now, $p_6$ is labeled as anomaly while $p_7$ as normal considering its succeeding neighbors. For this window, each sample inside the solid circle is considered normal and the anomalies are the samples $p_6, p_{10}, p_{12}$.

LEAP originally provides a binary anomaly outcome. However, our evaluation metrics require an ordered outcome and thus we report the score of a sample p as the inverse number of its neighbor count in radius R: $ LEAP_s(p) = 1/R\text {-Neighbors}(p)$. Note that lower scores denote greater anomalousness.

In the worse case, i.e., when the minimal probing principle fails, LEAP requires quadratic time complexity $O(w^2)$, where w is the number of samples in a window. However, authors mention that more advanced data structures can be utilized to reduce the neighbor search complexity.

2.4 Half space trees (HST)

HST [60] is a tree-based anomaly detector that learns a sketch of a data stream using an ensemble of half-space trees (HS-Tree). A HS-Tree is a full binary tree in which all leaves are at the same depth. HST requires to tune two hyper-parameters: the maximum depth h of each HS-tree and the number of HS-trees T of the ensemble.

Each half-space tree is built using a random perturbation of the original feature space, called workspace, where an internal node of a tree represents a selected feature. HST selects a feature randomly and uniformly. Then, the splitting value is randomly picked in the half-way of the work range of a selected feature. The work range (wr) is defined in Eq. 5 and differs per feature F. Note that $v_F$ is a random value between the maximum and the minimum value of a feature, $v_F \in [F_{min}, F_{max}]$. Note that HST uses all the window samples to construct the trees in contrast to batch tree-based detectors such as IF [43] and OCRF [29] that rely on bootstrapping to induce diversity among the constructed trees. HST ensures this diversification by selecting randomly the splitting value $v_F$. The $v_F$ value lets HST to construct wide value ranges, which is a crude estimate of the range of the unseen samples. We should also stress that HST assumes that the data is scaled such that values of features are bounded in [0, 1]. However, such normalization would have altered the nature of other tree-based algorithms such as RRCF [32] that utilizes the different value ranges to select the most prominent split feature at each step. Therefore, to ensure that the algorithms are compared on an equal basis, we modified the work range function (see Eq. 6) so that the data range does not have to be restricted.

$$\begin{aligned} wr(F) = v_F \pm 2 \cdot max (v_F, 1-v_F) \end{aligned}$$

(5)

$$\begin{aligned} wr'(F) = v_F \pm 2 \cdot max (v_F - F_{min}, F_{max}-v_F) \end{aligned}$$

(6)

HST assigns an anomaly score to each sample. The score of a sample p in a specific tree $t \in HS-Trees $ is defined as:

$$\begin{aligned} score (p, t) = Node.r \cdot 2^ Node.h , \end{aligned}$$

(7)

where $ Node.r $ is the mass and $ Node.h $ is the height of a leaf node, respectively, in a tree t. The lower the score a sample obtains, the more anomalous it is considered. Then, HST assigns a total score to each sample p which is the sum of scores obtained from the constituent trees $ HS-Trees $:

$$\begin{aligned} HST (p) = \sum _{t \in HS-Trees } score (p, t). \end{aligned}$$

(8)

A significant limitation of HST is that as new samples are processed, the mass profiles of the corresponding leaf nodes can only increase; in other words, HST can never forget. To overcome this limitation, we have implemented a HST variation with a simple forgetting mechanism inspired by RRCF [32], noted as HSTF. Subsequently we list the building blocks of HST/F:

Training Phase During training HST builds an ensemble of half-space trees that learns the sketch of the data stream. The mass profile of a leaf is the number of samples that end up to that subspace. The lower the mass, the more sparse the region is considered. The structure of the trees formed during training remain unaltered during the update phase and only the mass profiles are updated.

Model Update During updating, HST uses two alternating tumbling windows: a latest window which is not full yet and a reference window preceding the reference window. The mass profile of the reference window is computed and samples in latest window falling in low mass leaves (partitions) are considered anomalous. When the latest window is full, the mass profile of a leaf is updated by adding the number of samples that fell into that partition.

Forgetting Mechanism Our HSTF variation forgets the samples of the oldest windows by decreasing the mass profiles of the corresponding leaves. The number of samples that the model must remember is specified by a forgetting threshold f, i.e., a third hyper-parameter w.r.t. the original HST. The lower the value of f, the faster a high mass profile may become sparse. As an old sample is deleted only after the insertion of a new sample, f should be a multiplicative of the window size w. Therefore, when f is exceeded, the mass profile of the w oldest leaves (i.e., not been updated by the latest window) in each tree will be decreased by one.

Anomaly Report As it relies on tumbling windows, HST/F reports the anomaly scores for each sample after processing entirely the latest window according to Eq. 8. A running example of HST/F is illustrated in Fig. 3. In this example we assume that the max height $h=2$ and only one tree $T=1$ is built. First, the reference window is constructed on the first six samples (see Fig. 3a), leading to 4 partitions based on a random splitting value $s \in wr '(F)$ of a randomly selected feature F at each step. For the second window (see Fig. 3b), we assess the abnormality of each new sample based on the mass of the partition that it falls into, computed from the preceding (reference) window. Therefore, sorting the six samples of the second window in descending score value order (indicating increasing anomalousness) yields : $\langle p_{10}, \{p_8, p_{11}, p_9, p_{12}\}, p_{7} \rangle $, where $\{\cdot \}$ indicates a tie. Sample $p_7$ is the most abnormal sample because it falls into partition $P_1$ with 0 mass. After the scores of the new samples have been computed, the latest window becomes the reference window and the mass of each partition is updated accordingly. An interesting case is the sample $p_{10}$ that is the most normal among the six samples in the second window because it falls into partition P4 with mass 4 (see Fig. 3). The aforementioned behavior shows that if a plethora of samples are concentrated in a partition in the beginning of the stream but very few samples fall into that partition as the stream evolves, HST will assign high scores to those samples leading to potential false negatives. The suggested forgetting mechanism of HSTF can reduce this effect by decreasing the mass profiles of such partitions.

HST requires linear time $O(T (2^{h + 1} - 1))$ for model construction^{Footnote 6} and linear time O(thw) for model update, where w is the window size. In the worst case each sample may end up to a different leaf. Thus, all points in a window may update all mass profiles in a different tree traversal. Therefore, complexities are amortized constant when h, t and w are set. Note that the forgetting threshold does not change the complexity of the original algorithm.

2.5 Robust random cut forest (RRCF)

RRCF is a tree-based detector [32] used by the AWS Data Analytics Engine^{Footnote 7} that learns a sketch of a data stream using an ensemble of Robust Random Cut Trees (RRCT). A RRCT is a full binary tree used to calculate the collusive displacement (CoDisp) of a sample. CoDisp measures the differential effect of adding/removing a particular sample from a RRCT. RRCF requires to tune three hyper-parameters: the maximum number of samples Max Samples that are used to build a tree during training; the maximum number f of leaf nodes to forget after updates; and the number of trees T of the ensemble. RRCF differs from HST in three aspects: (i) it prioritizes features with higher value range; (ii) it uses a forgetting mechanism to delete old samples and (iii) the anomalies are reported instantly, i.e., before the current window is being processed completely. Subsequently we list the building blocks of RRCF:

Training Phase RRCF trains the trees of the ensemble by subsampling without replacement few initial sliding windows. Max Samples are used to build a tree. An internal tree node represents a splitting feature that is selected proportionally to its normalized value range. Features with larger value spaces may contain extreme values and therefore, anomalies. Each internal node has a splitting value which is selected randomly and uniformly from the range of the selected feature. With HST, the splitting value essentially partitions samples into smaller subspaces. Every internal node keeps a bounding box that stores the value range of the feature at a specific depth. A leave node contains a sample along with its arrival time in the stream and the number of replicas in case that many samples end up in the same leaf. The construction of a tree stops when every sample in the training set is isolated from the remainder of the training data, i.e., falls in a leaf.

Model Update RRCF updates incrementally the trees using sliding windows. When a new sample p traverses the internal nodes of a tree, if the feature values of p exceed the bounding box of the last internal node in the path then a new node is built, otherwise p ends up in the same leaf as another sample, increasing its replica counter.

Forgetting Mechanism RRCF provides a time-decaying mechanism to forget old samples. When the number of leaves exceed the forgetting threshold f, the oldest samples per insertion time are deleted and the tree is restructured accordingly.

Anomaly Report After the insertion of a new sample in the model, its anomaly score is immediately computed unlike HST that requires to process an entire window. RRCF uses an anomalousness criterion called collusive displacement (CoDisp). To compute CoDisp, the displacement of a node $n_p$ that sample p traversed through in a tree $t \in $ RRC-Trees is computed as:

$$\begin{aligned} Disp(n_p,t) = \frac{\text {number of samples beneath } sibling_{n_p} }{\text {number of samples beneath } {n_p}}. \end{aligned}$$

CoDisp extends the notion of Disp by accounting for duplicates and near-duplicates, called colluders, that can mask the presence of anomalies. Given a path of nodes P starting from a leaf node l to the node before the root r of a tree $t \in \text {RRC-Trees}$, the CoDisp of $n_p$ is computed as the average maximal displacement over the traversal path of p across all trees: $1/T \sum _t max (\{ Disp (n_i,t) \vert n_i \in P\}). $

Intuitively, CoDisp measures the change in the model complexity incurred by the insertion or deletion of p. The model complexity here can be represented as the sum of depths for all samples in the tree. Therefore, a tree-based anomaly is defined as a sample that significantly increases the depth for a set of samples, when it is included in the tree.

A running example of RRCF is illustrated in Fig. 4. The model is initially built (training phase) using the first sliding window in Fig. 4a. We assign to each node a unique id $n_i$. Every internal node represents the selected feature along with its bounding box (value range). The feature F1 is selected at the first step as it has higher value range ([0, 5]) than F2 ([0, 3]); we depict the splitting value on the edges of each node. When the first window is finished, each sample is isolated in a leaf along with a replica counter (rep). The scoring and update procedures are performed using the next sliding window in Fig. 4b. To keep a neat tree visualization, we present the model up to sample $p_{10}$. Given that a sample is scored only when it is inserted to the tree altering its structure, we compute the CoDisp analytically only for $p_{10}$. Observe that $p_{10}$ exceeds the bounding box of F2 and thus a new node is created in depth 3. Therefore, we compute the CoDisp for the nodes $n_{11}, n_{10}$ and $n_5$ resulting in the following values of Disp: $\{3/1, 1/4, 5/5\}$; the CoDisp is the maximum value which is 3. We also report the scores of the rest samples upon their insertion: $p_6$=0.5, $p_7$=1.33, $p_8$=1, $p_9$=0.8. Compared to Disp, CoDisp captures the effect of the deletion at tree-level instead of a leaf-level. This helps recognizing clustered anomalies, i.e., a sample masked by its anomalous neighborhood.

RRCF requires linear time to construct a forest $O(t (2 n - 1))$^{Footnote 8} and logarithmic updating time $O(t \log (n))$ [32]. In the worst case each sample ends up to a different leaf at the maximal height which requires to update the entire subtree till the root. The CoDisp requires linearithmic time $O(t \log (n) w)$ to update the tree structure for every sample in the window w.

2.6 Lightweight online detector of anomalies (LODA)

LODA [49] is a projection-based detector that constructs an ensemble of k one-dimensional histogram density estimators using sparse random projections. The significant advantage of LODA is that it is hyper-parameter free. Specifically, the number of histograms k can be estimated by measuring the reduction of variance after adding another histogram [59] and the number of bins b can be estimated via the method of Birgé and Rozenholc [10]. LODA can operate in batch (noted as L-B for brevity) or online mode (noted as L-S for brevity) that continuously updates the histograms as the stream evolves. In batch mode, the two hyper-parameters are robustly estimated using all available samples while in online mode, hyper-parameters are estimated using only the training samples. Subsequently, we list the building blocks of LODA in streaming mode (L-S).

Training Phase During training L-S constructs a one-dimensional histogram j as follows. First, a projection vector $w_j$ is built with coefficients $\sim N(0, \textbf{1}_d)$ where $\sqrt{d}$ of them are selected uniformly at random to be replaced with zeros. The nonzero coefficients denote a subspace of features used to build the histogram j. Second, L-S relies on online histograms [8] to approximate the distribution of data by using a set of pairs $H_j = {(z_{1j}, m_{1j}), \ldots ,(z_{bj}, m_{bj})}$, where $z_{ij} = w_{j}^{T} x_i$ is the projection of the i-th sample in the j-th histogram and $m_{ij}$ is total number of the samples falling into the same projection. When the number of bins exceeds the estimated threshold b, the two nearest projections $z_{ij}$, $z_{lj}$ are merged to form the new pair: $(\frac{z_{ij}\cdot m_{ij} + z_{lj}\cdot m_{ij}}{m_{ij} + m_{lj}}, m_{ij} + m_{lj})$. Finally, two additional pairs for the minimum and maximum projections are added: $H_j \leftarrow H_j \cup \{(z_{min}, 0), (z_{max}, 0)\}$.

Model Update L-S operates in tumbling windows using one of the following modes: (a) under a continuous histograms mode [8] each sample is first scored and then inserted to the histograms; (b) under a two alternating histograms mode samples are inserted to the model after all been scored, i.e., upon the completion of the new window similarly to HST [60]. The updating process is the same as for building the initial histograms while their minimum and maximum bounds may be updated by new samples. Note that the number of histograms does not change during updates.

Forgetting Mechanism Similar to the original implementation of HST, L-S does not employ a forgetting mechanism. Therefore, the frequency of the bins can only increase as the stream evolves.

Anomaly Report To compute the score of a sample $x_{i'}$, the projection $z_{i'j}$ for a histogram j is computed and the two nearest projections (if exist) $z_{ij}< z_{i'j} < z_{i+1j}$ are used to calculate the anomaly score:

$$\begin{aligned} \hat{p}(x_{i'}) = \frac{1}{k} \sum _{j=1}^k \frac{z_{ij}\cdot m_{ij} + z_{i+1j}\cdot m_{i+1j}}{2 M_j (z_{i+1j} - z_{ij})}, \end{aligned}$$

(9)

where $M_j = \sum _{i=1}^b m_{ij}$. If a sample falls in a sparse region, it receives a lower score indicating more anomalousness. Note that if $\not \exists ~ i: z_{ij}< z_{i'j} < z_{i+1j}$, the score cannot be computed.

A running example of L-S trained on the first six samples of tumbling window 1 of Fig. 1b is depicted in Fig. 5. L-S estimates the number of bins $b=3$ (not considering the min/max bins) and the number of histograms $k=2$. Since there are two histograms, there are two one-dimensional random projections. For $H_1$ the feature F1 is selected with projection vector $w_1 = [0.5, 0]$ and for $H_2$ the F2 is selected with projection vector $w_2 = [0, 1]$. Then the samples of second window of Fig. 1b arrive (see Fig. 3b for the value ranges). The first sample to be scored is $p_7 = (3,0)$.^{Footnote 9} However, the projections of $p_7$ are $\langle 0,0 \rangle $ in $H_1$ and $\langle 0,3 \rangle $ in $H_2$ which are outside the min/max bounds and thus its anomaly score cannot be computed. Then, $p_7$ is inserted creating the bins $b'_{ H_1} = (0, 1)$ and $b'_{H_2} = (3, 1)$. As the number of bins are $4 > 3$ (not considering the min/max bins), the $b_{1, H_1}$ and $b_{2, H_1}$ are merged into $b_{12, H_1} = (1.6, 2)$ and the $b_{2, H_2}$ and $b_{3, H_2}$ are merged into $b_{23, H_2} = (1.3, 3)$. Finally the min/max bounds are updated: $b_{ min , H_1} = (0,0)$ and $b_{ max , H_2} = (3,0)$ for $H_2$.The next sample is $p_8 = (3.5, 1)$ projected as $\langle 0.5,0 \rangle $ in $H_1$ and $\langle 0,3.5 \rangle $ in $H_2$. Therefore, it can only be scored in $H_1$, being between $b'_{H_1}=0< 0.5 < 1.6=b_{12, H_1}$. Thus $\hat{p}_8 = (0 \cdot 1 + 1.6 \cdot 2) / (2\cdot 7\cdot (1.6-0)) = 0.28$.

Given k histograms with b bins each, the time complexity of L-S is $O(Nk(d^{-\frac{1}{2}} + b))$ for training, where N is the number of samples, and O(wk) for updating and scoring, where w is the window size.

2.7 XSTREAM

XSTREAM [45] is a projection-based detector that constructs an ensemble of half-space chains (HSC) serving as density estimators. Unlike all previous detectors it is able to cope with feature-evolving data streams. XSTREAM requires to tune three hyper-parameters: The number of random projections K, the number of HSC M and the depth of HSC D. XSTREAM can operate in batch (noted as X-B for brevity) or online mode (noted as X-S for brevity). Both rely on sparse random projections of samples over a subset of features. In each random projection, 1/3 of the features are set as nonzero, weighted by $\{-\sqrt{3/K}, \sqrt{3/K}\}$ with equal probability. Then, each sample p from $\mathbb {R}^d$ is projected to $\mathbb {R}^K$ to obtain a new projected sample $\langle r_1^T p, \ldots , r_K^T p\rangle $, where d denotes the dimensionality of the dataset and $r_i$ a random projection. Subsequently, we list the building blocks of X-S.

Training Phase During training, a fraction of samples are kept to estimate the value range $\Delta _{F'_i} = ( max _{F'_i} - min _{F'_i}) / 2$ of each constructed feature $F'_i \in \mathbf {F'} = \{F'_1,\ldots , F'_K\}$. At each level of a HSC, a feature $F'_i$ is selected randomly with replacement from $\mathbf {F'}$. The value of $F'_i$ selected at level l in the bin vector $\textbf{z}_l$ of p, is computed using the following equations:

$$\begin{aligned} z_{F'_i, l}={\left\{ \begin{array}{ll} (p_{F'_i} + s_{F'_i}) / \Delta _{F'_i}, &{} \text {if }o(F'_i, l) = 1.\\ (2z_{F'_i} - s_{F'_i}) / \Delta _{F'_i}, &{} \text {if }o(F'_i, l) > 1. \end{array}\right. } \end{aligned}$$

(10)

where $p_{F'_i}$ is the value of $F'_i$ of a projected sample p, $s_{F'_i}$ is a random shift drawn from $ Uniform (0, \Delta _{F'_i})$ and $o(F'_i, l)$ is the frequency that a feature has been sampled till the l-th level. The bin vector at l level is computed as $\textbf{z}_l = \{z_{F'_i, l} \vert F'_i \in F'\}$. To maintain the bin count at every level for a chain c, X-S relies on count-min-sketch (cms) structure, noted as $\textbf{H}_c = \{H_{c,l} \vert l=1\ldots D\}$. For a specific level l of a HSC c, $H_{c,l}$ is updated as:

$$\begin{aligned} H_{c,l}={\left\{ \begin{array}{ll} H_{c,l}[\lfloor {\textbf{z}_l} \rfloor ] +1, &{} \text {if }\lfloor {\textbf{z}_l} \rfloor \in H_{c,l}.\\ 1, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

(11)

Model Update Like HST [60], X-S operates with two alternating tumbling windows. Two cms structures are maintained $\textbf{H}_ ref $ and $\textbf{H}_ cur $ for the reference and current window, respectively. The samples of the current window are scored using the bin counts of $\textbf{H}_ ref $, and the bin counts of $\textbf{H}_ cur $ are updated as in training. When all samples of the current window have been scored, the $\textbf{H}_ ref $ is replaced by $\textbf{H}_ cur $ and the counts of $\textbf{H}_ cur $ are set to zero. This technique lets X-S handles drifts in data distribution between two consecutive windows.

Forgetting Mechanism When all samples of the current window are inserted into $\textbf{H}_ cur $, the bin counts learned in the reference window are replaced by the ones of the current window. Therefore, whenever the window slides, X-S forgets together all samples of the reference window.

Anomaly Report The anomaly score for a projected sample is the minimum bin count across all levels of a chain, averaged out across all HSC: $\frac{1}{M} \sum _{c \in C} min _l~ 2^lH_{c,l}[\lfloor {\textbf{z}_l} \rfloor ]$ , where C is the set of the M chains. The intuition behind the scoring function is to measure the anomalousness of a sample across the different feature granularities and report the score that corresponds to the lowest density this sample is located at.

A running example is illustrated in Fig. 6. The first tumbling window of samples $\{p_1, \ldots ,p_6\}$ (see Fig. 1b) serves as the reference window, and is used to build the initial cms structures $H_{1,l}$ for $l=1,\ldots 3$ ($M=1$ and $D=3$). First, samples are projected using two random projections $r_1, r_2$: the $F'_1$ is comprised by $r_1 = [1, 0]$ using only the values of $F_1$ and $F'_2$ is comprised by $r_2 = [0, 1]$ using only the values $F_2$. The halved value range $\Delta _{F'_2}, \Delta _{F'_1}$ of each feature is 1: $\varvec{\Delta } = (1,1)$. Subsequently, we compute the bin vector of sample $p_6$ across the chain levels. In $l=1$, the feature F1 is selected and $p_6$ has the bin vector $\textbf{z}_1 = (0, 4.7)$. In level 2, $F'_2$ is selected leading to $\textbf{z}_2 = (1, 4.7)$ and in the last level $F'_1$ is selected again yielding $\textbf{z}_3 = (1, 9.4)$. Therefore, the discretized bin vectors are $\textbf{z}_1 = (0, 4), \textbf{z}_2 = (1, 4), \textbf{z}_3 = (1, 9)$. Since there is distribution shift in the second tumbling window, all samples get a zero score, which is the lowest possible density. For the projected sample $p_1$, the bin counts are: 3 ($l=1$), 2 ($l=2$), 1 ($l=3$), and its anomaly score is 1.

X-S has linear time complexity O(NKmDM) to construct the cms structure, where N is the number of training samples, K is the number of projections, m is the number of fixed-size hash tables to approximate the bin counts, and M is the number of HSC of depth D. The time complexity to update the cms structure is O(wKmDM) with w being the window size.

2.8 Randomized subspace hashing (RS-Hash)

RS-Hash [57] is a density-based detector that operates in subspaces. It constructs an ensemble of histograms on feature subspaces, serving as density estimators as in XSTREAM and LODA. RS-Hash requires to tune three hyper-parameters: the number of hash functions h, the sub-sample size s and the number of repetitions m.

The main idea is to repeatedly construct grid-based histograms on sub-samples and combine the obtained scores in an ensemble fashion. Each histogram is built on a sparse, randomly chosen subspace of the original feature space. The features of a subspace with dimensionality r are sampled uniformly at random from $(1+0.5 \cdot log_{max(2,1/f)}(s), log_{max(2,1/f)}(s))$, where f is a locality sampled uniformly at random from $(1/\sqrt{s}, 1-1/\sqrt{s})$. Unlike XSTREAM and LODA, RS-Hash assumes that the r features have equal weight and the histograms are constructed on the original sample values of these features rather than their inner product with the selected subspace. After selecting the subspace features, each sample is normalized using min–max normalization and histograms are constructed using a count-min-sketch (cms) structure, as in XSTREAM. In total, h histograms are built for a particular subspace. The process is repeated m times, with a different sub-sample hashed on a different subspace. Subsequently, we report the building blocks of RS-Hash:

Training Phase During training, a fraction of samples are kept to construct the initial histograms. We denote a histogram j of a sub-sample i as $H_{ij}$ that uses a particular hash function. To maintain the bin count, RS-Hash leverages the cms structure that hashes the values of a sample p as in Eq. 11 of XSTREAM. The difference is that the values of p are normalized but not projected as in XSTREAM.

Model Update When a new sample p arrives, RS-Hash first scores p and then the histograms’ counts are updated.

Forgetting Mechanism Whenever the window slides, RS-Hash forgets all the expired samples by reducing the counts of the corresponding hash buckets.^{Footnote 10}

Anomaly Report To compute the score of a new sample p, the non-used features in a subspace are encoded as -1 and the remaining ones are normalized. The anomaly score is the minimum bin count across the histograms of a sub-sample, averaged over the different sub-samples:

$$\begin{aligned} Score(p) = \frac{1}{m} \sum _{i=1}^m log_2( min_{j} H_{ij}[p] +1). \end{aligned}$$

(12)

We reuse the running example of LODA in Fig. 5, where the first window of Fig. 1b is used for training RS-Hash. We set $w=2$ resulting to two histograms and $m=1$. We select four samples uniformly at random: $p_1, p_3, p_5, p_6$ and for Thus, we take the histograms $H_{11}=\{b_1:[p_1, p_3, p_6], b_2:[p_5]\}, b_3:[]$, $H_{12}= \{b_1:[p_1, p_3], b_2:[p_6], b_3:[p_5]\}$, where $b_i$ denotes the bucket i; the buckets may differ due to different hash functions. After training, the samples of window 2 of Fig. 1a arrive. We assume that the first sample $p_7$ falls in $b_3$ in both hash tables. The minimum count is 1 from $H_{11}$ so it receives a score of 0 (see Eq 12). Then, it increases the count of $b_3$ by 1 in both hash tables. The scoring continuous for the remaining samples, respectively. Recall that each point is first scored and then the cms structure is updated. After each sample is scored, the forgetting mechanism will be activated, reducing the counts in the corresponding buckets.

RS-Hash has linear time complexity for training O(shm) including the cms construction, and O(whm) for updating and scoring, where s is the sub-sample size, h the number of hash functions, m the repetitions and w is the window size.

2.9 STAtionary REgion skipping (STARE)

STARE [72] is a density-based detector that identifies the top-n local anomalies in sliding windows. STARE requires three hyper-parameters to be tuned: The number k of nearest kernel centers $\theta _k$, the diagonal length of a grid cell $\theta _R$ and the error allowance threshold $\gamma $.

STARE relies on a kernel density estimation (KDE) to compute the density around each sample. In contrast to other KDE-based methods, STARE does not globally update the samples’ densities for every window slide. Instead, it optimizes density estimation based on the observation that data distributions in many regions hardly change across window slides - a notion called stationary region skipping. STARE operates in three phases: (i) Data distribution approximation, (ii) Cumulative net-change-based skip and (iii) Top-n Anomaly Detection. In the first phase, STARE divides the space into d-dimensional grid cells of diagonal length $\theta _R$, where d is the data dimensionality. Each grid c has a kernel center that represents the samples that fall into a grid while empty grids are not considered. STARE stores a weight distribution grid $\mathbb {G}$ with the number of samples in each c. In the second phase, STARE examines the changes in $\mathbb {G}$, denoted as net-weight distribution grid $\Delta \mathbb {G}$, between the current and previous slide to avoid updating the local densities of samples in stationary regions. Note that some regions may change slightly, e.g., only one sample is removed from each grid; in this case, STARE relies on a threshold $\gamma $ to skip updates of slightly changed regions. Higher $\gamma $ values^{Footnote 11} increase the error in density estimation, sacrificing accuracy for efficiency. In the last phase, STARE first searches in candidate cells, that are guaranteed to contain the top-n anomalies. To do that, density bounds are computed for each cell. For the candidate cells, STARE performs a point-level detection to the candidate cells. Subsequently, we report the building blocks of STARE:

Training Phase STARE uses the first window to partition the data space into grid cells, to store their centers and to compute the densities for all samples.

Model Update Each new sample is indexed to a grid cell, updating its sample count and therefore its weight distribution.

Forgetting Mechanism When a slide expires, the weight distribution of each affected grid cell is updated by reducing its sample count.

Anomaly Report The density of a sample p is calculated as the weighted average density over kernel centers near to p: $\mathcal {D}(p) = \sum _{i=1}^{\theta _k} \frac{w_i}{\sum _{j=1}^{\theta _k}} \prod _{l=1}^d \mathcal {K}_{h^l}( dist (p^l, kc_i^l))$, where $\theta _k$ is the distance to the k-th nearest kernel center $kc_i$ of p, $K_h$ is the kernel function taking as input the distance between p and $kc_i$ in a univariate fashion for a dimension l. The score of a sample p is given by $\mathcal {S}(p) = (\mu - \mathcal {D}(p)) / \sigma $, where $\mu , \sigma $ are the mean and standard deviation of the local densities at the $\theta _k$ nearest kernel centers of p. The anomaly score ranges from $-\infty $ to $+\infty $, where high values indicate lower density, i.e., more anomalousness.

A running example is depicted in Fig. 7. In the first slide of Fig. 7a, STARE forms the non-empty grid cells and computes the density for each sample. In this window, samples are ranked in decreasing order of their score as follows: $p_4> p_5> p_7> p_6> p_2> p_1 > p_3$. The first two samples have greater score than $p_7$ as they fall near to a dense region; thus they are assessed to be anomalous. On the next slide of Fig. 7b the samples of grid cells $g_1, g_3$ are expired and two new cells are formed $g_5, g_6$. Assuming $\theta _k = 2$, only the two nearest kernel centers will be examined for each sample. The $\gamma $ threshold requires at least four samples to get expired in a cell in order to re-compute the samples’ density in the cell. In this window, the weight distribution has changed from the previous, affecting the cells $g_2$ and $g_4$; thus the net-change mechanism is activated. However, since only three samples expired in $g_1$ (the nearest kernel to $g_2$), the density of $p_6$ will not be updated, as well as $p_7$. The new ranking will be $p_{6}> p_{10}> p_{12}> p_7> p_8> p_9 > p_{11}$. Observe that $p_7$ should have now received the highest score as it is slightly far from a dense area while $p_6$ should have received lower score as it lies on a sparse area far from dense areas after the update. Of course, with a different choice of $\gamma $ this behavior can be avoided.

STARE has time complexity $O(w + N^2_G)$ for training in the first window and $O(w + rN^2_G)$ for updating in subsequent slides, where w is the window size, $N_G$ is the non-empty grid cells and r is the ratio of changed grid cells between two consecutive windows.

Table 1 Qualitative comparison of online detectors

A meta-level analysis of online anomaly detectors

Abstract

Similar content being viewed by others

Real time anomaly detection and categorisation

Integrated Clustering and Anomaly Detection (INCAD) for Streaming Data

Study and Evaluation of Unsupervised Algorithms Used in Network Anomaly Detection

1 Introduction

2 Detection Algorithms

2.1 Micro cluster outlier detection (MCOD)

2.2 Core point-based outlier detection (CPOD)

2.3 Lifespan-aware probing (LEAP)

2.4 Half space trees (HST)

2.5 Robust random cut forest (RRCF)

2.6 Lightweight online detector of anomalies (LODA)

2.7 XSTREAM

2.8 Randomized subspace hashing (RS-Hash)

2.9 STAtionary REgion skipping (STARE)

2.10 Summary

3 Experimental environment

3.1 Datasets

3.1.1 Real datasets

3.1.2 Synthetic datasets

3.1.3 Exathlon time series

3.1.4 Dataset profiling

3.2 Evaluation protocols and metrics

4 Effectiveness of detectors

4.1 Random detection

4.2 Ranking online detectors

4.3 Online vs offline detectors

4.3.1 Tree-based detectors

4.3.2 Distance/KNN and density-based detectors

4.3.3 Projection-based detectors

4.3.4 Discussion

5 Detection of range-based anomalies

6 Robustness of detectors

7 Efficiency of detectors

8 Meta-learning of XSTREAM’s performance

9 Lessons learned

10 Summary and future work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Appendix 1: Offline detectors

1.1 Weighted K nearest neighbor (\(\hbox {KNN}_{W}\))

1.2 Local outlier factor (LOF)

1.3 Isolation forest (IF)

1.4 One-class random forest (OCRF)

Appendix 2: Hyper-parameters

Appendix 3: AUC/MAP of detectors

Appendix 4: Meta-features

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation