1 Introduction

Abnormal data might be more interesting to study than the prevalent patterns [1, 18, 76]. Data anomalies, e.g., among measurements or observations may simply represent errors (called outliers), but they can also indicate interesting phenomena (called novelties), such as new incidents or faults of a system, intrusions to a computer network, frauds in credit cards transactions or even over-expressed genes of living things.Footnote 1

Recently, the real-time detection of anomalies in streaming data has gained increasing attention [7, 55] as it allows to raise alerts, predict faults, detect intrusions and threats across industries. However, analyzing a sequence of samplesFootnote 2 arriving over time imposes unique constraints and challenges for machine learning models. In contrast to batch detectors, the full dataset is not available in advance and online detectors must learn incrementally as samples arrive [17].In that respect, the order rather than the timestamp of the samples in a data stream is an important feature that models should take into account [5, 41]. Additional constraints on the detection setting may be imposed in practice. For instance, the high velocity of streams leaves little opportunity for labeling samples by experts. Also, a thorough tuning of several hyper-parametersFootnote 3 is challenging especially when data characteristics evolve over time [30]. For these reasons, in this paper we focus on unsupervised methods for detecting point anomalies and leave out of the scope of our study (semi-)supervised methods for shallow [11] or deep anomaly detection [48]. In this respect, detection of range-based (i.e., subsequence or collective) anomalies based on temporal dependencies rather than independent abnormal samples [12, 19, 21, 33, 61] is left as future work.

Existing unsupervised methods for detecting anomalies in multivariate datasets bypass the need of labeled samples by exploiting different anomalousness criteria [1, 18, 36, 68, 76, 78] based on the divergence of statistical distributions, distance thresholds, density variance from nearest neighbors, isolation facility, etc. Indeed, recent online detection methods belong to different algorithmic families: (i) proximity-based detectors including distance-threshold based like MCOD [40] and CPOD [64] or nearest-neighbor-based like LEAP [16], inspired by \(\hbox {KNN}_W\) [51] (ii) density-based detectors either on the full feature space like STARE [72] and their offline counterpart LOF [14] or on feature subspaces like RS-Hash [57] tracing its roots back to HICS [38]; (iii) tree-based detectors such as HST [60] and RRCF [32] inspired by IF [43] or OCRF [29]; and (iv) projection-based detectors such as LODA [49] or XSTREAM [45] featuring both online and offline versions.

Anomaly detection has been an active area of research over the past decades. Besides notorious surveys on anomaly detection approaches and methods [1, 18, 19, 21, 33, 36, 68, 76], various empirical studies have experimentally evaluated the effectiveness or the efficiency of detectors [3, 4, 15, 20, 24, 28, 47, 62]. However, the aforementioned works focus in their majority, on offline detectors. Online anomaly detection was listed in the future perspectives of the overview presented in [19] while [33] includes only the first steps in making density-based detectors incremental, like LOF [14]. Regarding empirical studies, [63] focuses exclusively on the efficiency of online proximity-based detectors while [20] compares stream clustering algorithms [58] for anomaly detection. To the best of our knowledge, no previous work has compared qualitatively or quantitatively, over the same multi-dimensional datasets, distance-based (MCOD, CPOD), KNN-based (LEAP, \(\hbox {KNN}_W\)) and density-based detectors (STARE, RS-Hash, LOF) with tree-based (HST/F, RRCF, IF, OCRF) and projection-based detectors (XSTREAM, LODA). Additionally, reported experiments seldom report the tension between the effectiveness and the efficiency of the detection algorithms. To this end, we optimally tune the hyper-parameters of detectors per dataset rather than rely on the default configurations recommended by their inventors. Last, previous meta-learning analyses of detectors [66, 75] do not consider meta-features related to anomalies visible only to a subset of the dataset feature space.

In this context, several questions that are left unanswered regarding the performance of online anomaly detectors over data streams. First, previous studies do not assess the reliability of detectors’ effectiveness against a random classifier, and do not either highlight the dataset characteristics (e.g., number of features) that make them perform randomly. Second, they do not indicate when online detectors can approximate the effectiveness of offline detectors and under which conditions (e.g., number of features irrelevant to the anomalies, anomaly ratio). Third, they do not indicate which is the best sketch strategy and update primitives of detectors (e.g., micro-clusters, random trees, histogram or chain-based density estimators) to detect anomalies that are only visible within a feature subspace of a dataset. Fourth, they do not analyze the trade-offs between the effectiveness and the efficiency of detectors belonging to different algorithmic families. Last, they do not highlight the characteristics of datasets that make an online algorithm capable of outperforming all others. For instance, the statistically significant correlations between the relative performance of the best performing detectors and meta-features such as the number of samples or features in a dataset, the skewness of feature values, or the distance between the clusters of abnormal and normal samples. In summary, we make the following contributions:

  • Large Selection of Online Detectors from Different Algorithmic Families: In Sect. 2, we introduce the nine online anomaly detectors included in our testbed, namely, MCOD [40], CPOD [64], LEAP [16], STARE [72], RS-Hash [57], HST [60], RRCF [32], LODA [49] and XSTREAM [45] (for their offline counterparts, readers are referred to Appendix 1). We detail their scoring function, model creation and update primitives, as well as the involved analytical complexities. To ensure a common ground of comparison, we implemented a variation of HST with a forgetting mechanism similar to RRCF, called HSTF, as well as, a continuous scoring function for MCOD, CPOD and LEAP instead of their binary outcome.

  • Fair Evaluation Environment for Online and Offline Anomaly Detectors over Multivariate Data: In Sect. 3, we describe the characteristics of abnormal and normal samples (e.g., anomaly ratio, dimensionality) in the twenty-four real and the five synthetic datasets included in our testbed that are widely used in previous empirical studies [15, 24, 25, 63, 68]. We additionally consider in Sect. 5 the recently proposed benchmark Exathlon [37] for explainable anomaly detection over repeated executions of two different Spark streaming applications, containing five different type of anomalies. To fairly compare the performance of detectors, we consider as evaluation metrics both Area Under the ROC Curve (AUC) and Average Precision (AP) and explain their differences under edge cases. These metrics are computed for each algorithm under optimal evaluation conditions per dataset (for optimal hyper-parameter values per dataset and sensitivity of algorithms to tuning, readers are referred to Appendix 2).

  • Thorough Evaluation of Detectors’ Effectiveness: In Sect. 4, we analyze the AUC and Mean AP of detectors over the 24 real datasets of our testbed (details are given in Appendix 3) in order to reveal interesting patterns involving specific meta-features (i.e., Number of Samples/Features, Anomaly to Normal distance). In particular, we assess the reliability of the decisions made by the detectors w.r.t. a random classifier, and rank in a statistically significant way online and offline detectors according to their performance. Our analysis sheds light on how well online detectors approximate the performance of offline detectors.

  • Robustness of Detectors Against Increasing Dimensionality: In Sect. 6, we assess the robustness of online and offline detectors against increasing data and anomaly subspace dimensionality using twenty synthetic datasets. Specifically, we investigate whether a particular algorithmic family (e.g., proximity, tree or projection-based) is able to discover anomalies that are only visible in a small subset of features (i.e., a subspace).

  • Efficiency of Detectors and Trade-offs: In Sect. 7, we report the execution time of training and updating detectors’ models over the 24 real datasets of our testbed. We investigate the trade-off between the update time and the effectiveness of the 9 online detectors contrasted to the best overall performing detector.

  • Meta-learning Analysis of Leading Detectors: In Sect. 8, we investigate which of the meta-features are statistically correlated with the relative effectiveness of the best performing detector (detailed results per meta-feature are given in Appendix 4). This meta-analysis helps us to assess whether the best overall performing detector will also excel in a given dataset.

Finally, conclusions and future work are discussed in Sect. 10. Detailed experimental results, as well as all employed hyper-parameter values are given in Appendices 2, 3 and 4.

2 Detection Algorithms

In this section, we introduce the online anomaly detection algorithms considered in our experimental evaluation that belong to the distance, knn, tree or projection-based algorithmic families. The offline counterpart of these detectors is described in Appendix 1.

In contrast to offline detectors that model and score samples in one batch, online detectors continuously update their models for incrementally detecting anomalies in several windows of samples. The total number of samples in a window indicates the window size while the number of samples that a window will be shifted over a data stream indicates the window slide. Windows that overlap as they slide over a data stream are called sliding, otherwise they are called tumbling (i.e., window size = window slide), as illustrated in 1.

Our testbed includes state-of-the-art online anomaly detection algorithms with publicly available implementations, as reported in the literature. CPOD [64] was included because it outperforms other distance-based detectors (such as MCOD[40] which is considered to be among the best detectors) in terms of runtime and memory usage.Footnote 4 HST [60] exhibits the best efficiency among tree-based detectors while RRCF[32] is the most effective detector of the same family. We additionally included two projection-based detectors: LODA[49] exhibiting a very low runtime and memory footprint and XSTREAM [45] outperforming in terms of effectiveness other detectors on high-dimensional datasets. LEAP[16] has been proved to be three orders of magnitude faster than state-of-the-art KNN methods. STARE[72] outperforms other popular density-based detectors ([46, 50]) in terms of execution time, achieving comparable or higher accuracy. Finally, RS-HASH[57] proved to be more efficient and effective than other subspace detectors like HiCS[38]. To the best of our knowledge, no previous study has compared both the effectiveness and efficiency of proximity-based, tree-based and projection-based detectors under stream and batch processing settings. In the next sub-sections, we provide the main operations and primitives of each detection algorithm, both theoretically and through a running example, and detail their train, update, forgetting and anomaly reporting procedures.

2.1 Micro cluster outlier detection (MCOD)

MCOD is a distance-based detector [40] that models neighboring regions of samples in a stream as micro-clusters (MC). MCOD requires to tune three hyper-parameters: a distance threshold R, a neighbor count threshold K and a distance metric \( dist (\cdot , \cdot )\) (e.g., Euclidean). Given a dataset D, MCOD identifies a sample p to be a distance-based anomaly if it has fewer than K neighbors within distance R, otherwise p is considered normal.

MCOD builds a set of micro-clusters (MC), to assess the normality of samples in every window. An MC is composed of at least K + 1 samples and is centered on one sample. A sample belongs to at most one MC and the distance of any sample from the center of its MC is at most R/2. According to the triangular inequality in the metric space, the distance between every pair of samples in a MC is smaller than R. Therefore, every sample in a MC is considered normal. Samples that cannot be clustered, i.e., potential anomalies, are inserted in a list called PD. The contents of PD are processed as new samples in a window arrives and they can be either normal if \( R \text {-Neigh} (p) \ge K\) (see Eq. 2) or abnormal, otherwise. The \( R \text {-Neigh} (p) \) indicates the number of neighbors of a sample p in a radius R. Hence, we list the building blocks of MCOD:

Training Phase MCOD does not have a training phase since its model operates over a single window by computing pairwise distances.

Model Update A new sample p can either: (i) be inserted into its nearest MC (see Eq. 1) if the distance from the center (\( mcc \)) of that MC is \(\le \) R/2; or (ii) form a new MC if it has at least K neighbors in PD list within a distance \(\le R/2\); or (iii) be inserted into PD.

Forgetting Mechanism MCOD forgets all samples that have been processed in a current window before processing the next window. A forgotten sample can: (i) dissolve an MC if it contains less than K + 1 points; or (ii) be removed from the PD list.

Anomaly Report After processing new and forgotten samples, every sample with less than K neighbors is reported as anomaly.

$$\begin{aligned}{} & {} mcc _{p}^{\star } = {\mathop {{{\,\textrm{argmin}\,}}}\limits _{ mcc }}\, dist(p, mcc) . \end{aligned}$$
(1)
$$\begin{aligned}{} & {} R \text {-Neigh} (p) = \vert \{p' \vert dist(p, p ') < R \}\vert \end{aligned}$$
(2)
Fig. 1
figure 1

Example of two sliding (left) and tumbling (right) windows in the time-varying feature space

Fig. 2
figure 2

MCOD running on the sliding windows of Fig. 1a in the feature space with neighbor count threshold \(K=3\) and radius threshold R

A running example inspired from the MCOD paper [40] is presented in Fig. 2 over the two sliding windows depicted in Fig. 1a. Samples have two features, namely {F1, F2}. Using the first window (Fig. 2a), MCOD builds a micro-cluster \( MC_1 \) containing the samples \(p_1, p_2, p_3, p_6\), which are considered as normal. The PD list contains the samples that do not belong to \( MC_1 \): \(p_4, p_5, p_7\). Sample \(p_4\) is normal as it has \(K=3\) neighbors within distance R. Samples \(p_5, p_7\) are anomalies since they have less than \(K=3\) neighbors within distance R. Next, all samples of the first window are forgotten, so \( MC_1 \) is dissolved and PD is emptied. MCOD then processes the second window (Fig. 2b) and forms the micro-cluster \( MC_2 \). Observe that \(p_7\) is now a normal sample, while \(p_6\) is an anomaly because its preceding neighbors have been forgotten.

MCOD originally provides a binary label l as outcome depicted in Eq. 3\( MCOD_l \): 0 for normal and 1 for abnormal samples. However, to homogenize the comparison of the outcome with other algorithms, we need a continuous scoring function of samples. We therefore use the function \( MCOD_s \) that gives a score s depicted in Eq. 4. Intuitively, the more a sample lies in a sparse region, i.e., it has few or no neighbors within distance R, or it is far away from its nearest MC, the higher it should be scored. For instance, \(p_7\) and \(p_3\) in Fig. 2a will, respectively, get the highest and lowest score. Also, \(p_5\) has a higher score than \(p_4\) which has a higher score than \(p_6\).

$$\begin{aligned}{} & {} MCOD_l(p) = dist(p, mcc_p^\star ) < \frac{R}{2} \vee \text {R-Neigh}(p) \ge K \nonumber \\ \end{aligned}$$
(3)
$$\begin{aligned}{} & {} MCOD_s(p) = \frac{1}{R\text {-Neigh}(p)} \cdot dist(p, mcc_p^\star ) \end{aligned}$$
(4)

The main advantage of MCOD is that it effectively prunes pairwise distance computations, providing a more efficient neighbor search through the creation of micro-clusters. Every sample that remains in an MC when the window slides is considered as normal without further checks. Thus, to classify a new sample it suffices to compute distances only w.r.t. the centers of MC and the samples in the PD list. MCOD has linear time complexity on average \(\Theta ((1-c) w log((1-c) w)+K log(K)))\) and linear space complexity \(O(c w+(1-c) K w)\) [63], where c is the number of clusters, K is the number of nearest neighbors and w is the window size. However, depending on the data characteristics, i.e., if the data are very sparse, no clusters may be formed resulting to \(c=0\). Thus, for the worst case, MCOD has linearithmic time complexity \(O(w log (w) + K log (K))\) The hyper-parameters of MCOD require careful tuning (see Fig. 17 in Appendix 2). Specifically, a very low R and high K values on a dataset with several sparse areas may lead MCOD to create very few micro-clusters, identifying most samples as anomalies. The aforementioned behavior will also degrade the efficiency of MCOD as it must compute the distances with respect to more samples in the PD list, which is inefficient for neighbor searches. In the opposite case, with a very high R and low K values in a dataset having several dense areas, MCOD may identify most samples as normal and create a large number of micro-clusters.

2.2 Core point-based outlier detection (CPOD)

CPOD is a distance-based detector [64] that models neighboring regions of samples in a stream using core points. CPOD requires to tune two hyper-parameters: a distance threshold R and a neighbor count threshold K. Given a dataset D, CPOD identifies a sample p to be a distance-based anomaly if it has fewer than K neighbors within distance R, otherwise p is considered normal.

CPOD shares the same definition of anomalies as MCOD, but it optimizes neighbor’s search by looking at the close neighborhood of few special samples called core points rather than the neighborhood of the nearest micro-cluster center. The same technique is also used to address a limitation of MCOD when every pair of points has a distance greater than R/2. In this degenerate case, no micro-cluster is formed resulting in quadratic neighbor search and a poor efficiency especially with streams. A core point is a special sample that supports multi-distance indexing as it stores its Euclidean distances to other samples in multiple ranges within each slide. It permits to both quickly identify normal samples and reduce neighbor search spaces for anomaly candidates. A core point has the following two properties: (i) the distance between any pair of core points is greater than R; (ii) each sample \(p \in D\) is linked to at least one core point c, having distance less than R from c. Note that in contrast to the micro-cluster centers of MCOD, core points do not require to have at least K neighbors to be formed. Each core point c stores every neighbor sample p in a map E for different radius values \(k \in \{0,1,2,3\}\), ranging from R/2 to 2R:

\( E_k(c) = \{p \in D \vert ~ kR/2 < dist(c,p) \le (k+1)R/2\}.\) To calculate the neighbors of a sample p within distance R, CPOD leverages the distance from its corresponding core point(s) to reduce the computations, matching exactly one of the four cases below, where \(c^\star \) is the closest core from a core set C to p:

  1. 1.

    \(\bigcup _{k = 0,1,2}E_k(c^\star )\), if \( dist (c^\star ,p) \le R/2\)

  2. 2.

    \(\bigcup _{k = 0,1,2,3}E_k(c^\star )\), if \(R/2 < dist (c^\star ,p) \le R\)

  3. 3.

    \(\bigcup _{k = 0,1}E_k(c_i \in C\), if \(R < dist (c^\star , p) \le 2R\)

  4. 4.

    0, if \(2R < dist(c_i, p) , \forall c_i \in C\)

From the aforementioned cases, exactly one can hold for a particular sample p. To better grasp the neighbor search procedure, we explain the first case, where \( dist (c^\star ,p) \le R/2\). CPOD operates via a prefilter approach, called minimal probing. First, it automatically considers the samples within R/2 (\(k=0\) in \(E_k\)) as neighbors of p due to triangular inequality. If the neighbors in R/2 do not exceed the neighbor threshold, then k is increased \(k=1\) and the search is expanding to the range (R/2, R]. If the threshold is satisfied, the search stops and p is declared as normal, otherwise the search continuous to higher ranges. The same logic is followed for the rest cases. In the following, we report the building blocks of CPOD:

Training Phase CPOD does not have a training phase since its model computes the core points for every new slide, starting from the first window.

Model Update A new sample p can either: (i) be linked to exactly one core, if \( dist (c^\star , p) \le R\); (ii) to multiple cores if \(R < dist (c_,p) \le 2R\); (iii) form a core point or (iv) to not correspond to any core point if \(dist(c_i,p) > 2R, \forall c_i \in C\).

Forgetting Mechanism with a new slide, CPOD forgets all the expired samples, i.e., samples that are just removed from the current window, before processing the next window slide. The neighbor count of the active samples, i.e., samples that remain in the current window, is decreased and the expired samples are removed from E maps of their cores.

Anomaly Report CPOD spots samples as anomaly candidates if they have a distance \(\le R\) from their cores and less than K neighbors, expanding the neighbor search in higher ranges. Every sample with less than K neighbors after CPOD’s expanded search is reported as anomaly.

Now, we present a running example of CPOD based on Fig. 2 we used to explain MCOD. For this example with \(K=3\), we consider the cores \(c_2, c_4\) to be core points in the first slide, having the same values as \(p_2\) and \(p_4\), respectively. Thus, the neighbors of each core in distance [0, 2R] are \(E(c_2): \{p_2, p_1, p_6, p_4, p_5\}\) and \(E(c_4)\) contains all samples, where E denotes all the neighbors in different ranges. Every sample in the solid circle falls in the first case, presented previous neighbor search procedure; it has distance \(\le R/2\) from \(c_2\) and it will be immediately identified as normal having 3 neighbors. Sample \(p_5\) also fall in the first case, there are not enough neighbors in distance \(\le R/2\), it is considered an anomaly candidate and the search will expand to higher radius ranges from its closest core \(c_4\). Expanding the search to R, no new neighbors are found so the search stops and \(p_5\) is reported as anomaly. The sample \(p_7\) falls in the third case, having distance \(R < dist(c_4, p_7) \le 2R\) from its closest core \(c_4\), so CPOD searches neighbors in radius R/2 to R in every core, labeling \(p_7\) as anomaly since only \(p_5\) found as its neighbor. In the next slide, considering only \(c_{10}\) as core taking the values of \(p_{10}\), CPOD labels \(p_7\) as anomaly (unlike MCOD). This is because \(R < dist(c_{10}, p_7) \le 2R\), thus the neighbors search is performed in radius \(<R\) of each core, resulting to samples \(p_{12}, p_8, p_9\) as possible neighbors, missing \(p_{11}\) where \(dist(c_{10}, p_{11})>R\). From the explored samples, only \(p_8, p_9\) are neighbors of \(p_7\) so it is labeled as anomaly. Moreover, \(p_6\) is also an anomaly since \( dis(c_{10}, p_6) >2R\), thus it has no neighbors in radius R. The rest samples are labeled as normal.

CPOD originally provides a binary anomaly outcome. However, our evaluation metrics require an ordered outcome and thus we report the score of a sample p as the inverse number of its neighbor count \( CPOD_s(p) = 1 /( \vert N_{CPOD}(p) \vert + \epsilon )\). Note that lower scores denote greater anomalousness.

CPOD has linear time complexity \(O(N_c~w + N_f~N_r)\), where \(N_c\) is the total number of cores, w is the number of samples in a window, \(N_f\) is number of the anomaly candidates and \(N_r\) the neighbor search time for these candidates. Note that in a very sparse dataset with many isolated samples, i.e., many samples have distance \(> R\) and \(\le 2R\) from every core, the neighbor search will use each core, searching on their R/2 and R radius. This could result an overhead if many cores have been formed. Therefore, R and K should be carefully selected.

2.3 Lifespan-aware probing (LEAP)

LEAP is a distance-based detector [16] that encompasses two different anomaly semantics, namely distance-threshold-based and nearest-neighbor-based.Footnote 5 LEAP requires to tune two hyper-parameters: a distance threshold R and a neighbor count threshold K. Given a dataset D, LEAP identifies a sample p to be a distance-based anomaly if it has fewer than K neighbors within distance R, otherwise p is considered normal.

LEAP shares the same definition of anomalies as MCOD, but it mitigates expensive range queries with adequate indexing: rather than storing in the same index structure all samples of a window, a separate smaller index is maintained for each slide. For a given window, a slide index \(s_i\) references a sample that appears in sliding window i and not in sliding window \(i+1\). For instance, for a window of size \(w=9\) and slide \(s=3\), three index structures are created because three sliding windows cover the samples of that window. LEAP maintains an evidence list evi for every sample p: \(p.evi = \bigcup _{i=t,..t-s} \{q \in s_i \vert dist(p, q) \le R\} \), containing the neighbors of p in each slide in reverse chronological order, starting from the newest slide \(s_t\) up to \(s_{t-s}\), where s is the slide size, adopting the minimal probing principle, i.e., if more than K neighbors are found till the current slide, the search stops. LEAP maintains also a trigger list: \(tr = \{q \in w \vert q.evi \setminus s_{t-s} \ne q.evi\}\), containing the samples that are going to lose neighbors in the next window slide. Subsequently, we report the building blocks of LEAP:

Training Phase LEAP does not have a training phase since it performs the neighbor search for every new slide, starting from the first window.

Model Update A new sample p leads to re-probing the neighbors of each sample in tr list and potentially it can let a sample(s) to be removed from tr list.

Forgetting Mechanism when a slide expires: (i) its index is discarded; (ii) the expired neighbors of p in p.evi list are removed and (iii) the samples in the tr list are re-evaluated.

Anomaly Report For a sample p, the newest samples are examined first and in case the neighbors are fewer than K, the probing continuous. If p has less than K neighbor after each slide of a window is probed, p is reported as anomaly.

Now, we present a running example of LEAP based on Fig. 2 we used to explain MCOD. For this example with \(K = 3\), \(w=7\) and slide \(s=5\), LEAP will create two indexes \(s_1\) and \(s_2\), respectively, indexing the samples \(p_1, \ldots p_5\) and samples \(p_6, p_7\). LEAP will build an evidence list for each sample. For p1, it will start searching for neighbors in \(s_2\) and then in \(s_1\) resulting to \(p_1.evi = \{p_6, p_2, p_3\}\). For the first window \(p_7\) and \(p_5\) are reported as anomalies and the rest samples as normal. The samples belonging to the trigger list in the first window are \(tr = \{p_6, p_7\}\) as they will lose neighbors in the next slide. When the window slides, the samples in the trigger list will be evaluated first. Now, \(p_6\) is labeled as anomaly while \(p_7\) as normal considering its succeeding neighbors. For this window, each sample inside the solid circle is considered normal and the anomalies are the samples \(p_6, p_{10}, p_{12}\).

LEAP originally provides a binary anomaly outcome. However, our evaluation metrics require an ordered outcome and thus we report the score of a sample p as the inverse number of its neighbor count in radius R: \( LEAP_s(p) = 1/R\text {-Neighbors}(p)\). Note that lower scores denote greater anomalousness.

In the worse case, i.e., when the minimal probing principle fails, LEAP requires quadratic time complexity \(O(w^2)\), where w is the number of samples in a window. However, authors mention that more advanced data structures can be utilized to reduce the neighbor search complexity.

2.4 Half space trees (HST)

HST [60] is a tree-based anomaly detector that learns a sketch of a data stream using an ensemble of half-space trees (HS-Tree). A HS-Tree is a full binary tree in which all leaves are at the same depth. HST requires to tune two hyper-parameters: the maximum depth h of each HS-tree and the number of HS-trees T of the ensemble.

Each half-space tree is built using a random perturbation of the original feature space, called workspace, where an internal node of a tree represents a selected feature. HST selects a feature randomly and uniformly. Then, the splitting value is randomly picked in the half-way of the work range of a selected feature. The work range (wr) is defined in Eq. 5 and differs per feature F. Note that \(v_F\) is a random value between the maximum and the minimum value of a feature, \(v_F \in [F_{min}, F_{max}]\). Note that HST uses all the window samples to construct the trees in contrast to batch tree-based detectors such as IF [43] and OCRF [29] that rely on bootstrapping to induce diversity among the constructed trees. HST ensures this diversification by selecting randomly the splitting value \(v_F\). The \(v_F\) value lets HST to construct wide value ranges, which is a crude estimate of the range of the unseen samples. We should also stress that HST assumes that the data is scaled such that values of features are bounded in [0, 1]. However, such normalization would have altered the nature of other tree-based algorithms such as RRCF [32] that utilizes the different value ranges to select the most prominent split feature at each step. Therefore, to ensure that the algorithms are compared on an equal basis, we modified the work range function (see Eq. 6) so that the data range does not have to be restricted.

$$\begin{aligned} wr(F) = v_F \pm 2 \cdot max (v_F, 1-v_F) \end{aligned}$$
(5)
$$\begin{aligned} wr'(F) = v_F \pm 2 \cdot max (v_F - F_{min}, F_{max}-v_F) \end{aligned}$$
(6)

HST assigns an anomaly score to each sample. The score of a sample p in a specific tree \(t \in HS-Trees \) is defined as:

$$\begin{aligned} score (p, t) = Node.r \cdot 2^ Node.h , \end{aligned}$$
(7)

where \( Node.r \) is the mass and \( Node.h \) is the height of a leaf node, respectively, in a tree t. The lower the score a sample obtains, the more anomalous it is considered. Then, HST assigns a total score to each sample p which is the sum of scores obtained from the constituent trees \( HS-Trees \):

$$\begin{aligned} HST (p) = \sum _{t \in HS-Trees } score (p, t). \end{aligned}$$
(8)

A significant limitation of HST is that as new samples are processed, the mass profiles of the corresponding leaf nodes can only increase; in other words, HST can never forget. To overcome this limitation, we have implemented a HST variation with a simple forgetting mechanism inspired by RRCF [32], noted as HSTF. Subsequently we list the building blocks of HST/F:

Training Phase During training HST builds an ensemble of half-space trees that learns the sketch of the data stream. The mass profile of a leaf is the number of samples that end up to that subspace. The lower the mass, the more sparse the region is considered. The structure of the trees formed during training remain unaltered during the update phase and only the mass profiles are updated.

Model Update During updating, HST uses two alternating tumbling windows: a latest window which is not full yet and a reference window preceding the reference window. The mass profile of the reference window is computed and samples in latest window falling in low mass leaves (partitions) are considered anomalous. When the latest window is full, the mass profile of a leaf is updated by adding the number of samples that fell into that partition.

Forgetting Mechanism Our HSTF variation forgets the samples of the oldest windows by decreasing the mass profiles of the corresponding leaves. The number of samples that the model must remember is specified by a forgetting threshold f, i.e., a third hyper-parameter w.r.t. the original HST. The lower the value of f, the faster a high mass profile may become sparse. As an old sample is deleted only after the insertion of a new sample, f should be a multiplicative of the window size w. Therefore, when f is exceeded, the mass profile of the w oldest leaves (i.e., not been updated by the latest window) in each tree will be decreased by one.

Fig. 3
figure 3

HST partitions (left) and constructed model (right) when running on the tumbling windows of Fig. 1b in the feature space with \(T=1\) trees and max height \(h=2\)

Anomaly Report As it relies on tumbling windows, HST/F reports the anomaly scores for each sample after processing entirely the latest window according to Eq. 8. A running example of HST/F is illustrated in Fig. 3. In this example we assume that the max height \(h=2\) and only one tree \(T=1\) is built. First, the reference window is constructed on the first six samples (see Fig. 3a), leading to 4 partitions based on a random splitting value \(s \in wr '(F)\) of a randomly selected feature F at each step. For the second window (see Fig. 3b), we assess the abnormality of each new sample based on the mass of the partition that it falls into, computed from the preceding (reference) window. Therefore, sorting the six samples of the second window in descending score value order (indicating increasing anomalousness) yields : \(\langle p_{10}, \{p_8, p_{11}, p_9, p_{12}\}, p_{7} \rangle \), where \(\{\cdot \}\) indicates a tie. Sample \(p_7\) is the most abnormal sample because it falls into partition \(P_1\) with 0 mass. After the scores of the new samples have been computed, the latest window becomes the reference window and the mass of each partition is updated accordingly. An interesting case is the sample \(p_{10}\) that is the most normal among the six samples in the second window because it falls into partition P4 with mass 4 (see Fig. 3). The aforementioned behavior shows that if a plethora of samples are concentrated in a partition in the beginning of the stream but very few samples fall into that partition as the stream evolves, HST will assign high scores to those samples leading to potential false negatives. The suggested forgetting mechanism of HSTF can reduce this effect by decreasing the mass profiles of such partitions.

HST requires linear time \(O(T (2^{h + 1} - 1))\) for model constructionFootnote 6 and linear time O(thw) for model update, where w is the window size. In the worst case each sample may end up to a different leaf. Thus, all points in a window may update all mass profiles in a different tree traversal. Therefore, complexities are amortized constant when h, t and w are set. Note that the forgetting threshold does not change the complexity of the original algorithm.

2.5 Robust random cut forest (RRCF)

RRCF is a tree-based detector [32] used by the AWS Data Analytics EngineFootnote 7 that learns a sketch of a data stream using an ensemble of Robust Random Cut Trees (RRCT). A RRCT is a full binary tree used to calculate the collusive displacement (CoDisp) of a sample. CoDisp measures the differential effect of adding/removing a particular sample from a RRCT. RRCF requires to tune three hyper-parameters: the maximum number of samples Max Samples that are used to build a tree during training; the maximum number f of leaf nodes to forget after updates; and the number of trees T of the ensemble. RRCF differs from HST in three aspects: (i) it prioritizes features with higher value range; (ii) it uses a forgetting mechanism to delete old samples and (iii) the anomalies are reported instantly, i.e., before the current window is being processed completely. Subsequently we list the building blocks of RRCF:

Training Phase RRCF trains the trees of the ensemble by subsampling without replacement few initial sliding windows. Max Samples are used to build a tree. An internal tree node represents a splitting feature that is selected proportionally to its normalized value range. Features with larger value spaces may contain extreme values and therefore, anomalies. Each internal node has a splitting value which is selected randomly and uniformly from the range of the selected feature. With HST, the splitting value essentially partitions samples into smaller subspaces. Every internal node keeps a bounding box that stores the value range of the feature at a specific depth. A leave node contains a sample along with its arrival time in the stream and the number of replicas in case that many samples end up in the same leaf. The construction of a tree stops when every sample in the training set is isolated from the remainder of the training data, i.e., falls in a leaf.

Model Update RRCF updates incrementally the trees using sliding windows. When a new sample p traverses the internal nodes of a tree, if the feature values of p exceed the bounding box of the last internal node in the path then a new node is built, otherwise p ends up in the same leaf as another sample, increasing its replica counter.

Forgetting Mechanism RRCF provides a time-decaying mechanism to forget old samples. When the number of leaves exceed the forgetting threshold f, the oldest samples per insertion time are deleted and the tree is restructured accordingly.

Anomaly Report After the insertion of a new sample in the model, its anomaly score is immediately computed unlike HST that requires to process an entire window. RRCF uses an anomalousness criterion called collusive displacement (CoDisp). To compute CoDisp, the displacement of a node \(n_p\) that sample p traversed through in a tree \(t \in \) RRC-Trees is computed as:

$$\begin{aligned} Disp(n_p,t) = \frac{\text {number of samples beneath } sibling_{n_p} }{\text {number of samples beneath } {n_p}}. \end{aligned}$$

CoDisp extends the notion of Disp by accounting for duplicates and near-duplicates, called colluders, that can mask the presence of anomalies. Given a path of nodes P starting from a leaf node l to the node before the root r of a tree \(t \in \text {RRC-Trees}\), the CoDisp of \(n_p\) is computed as the average maximal displacement over the traversal path of p across all trees: \(1/T \sum _t max (\{ Disp (n_i,t) \vert n_i \in P\}). \)

Intuitively, CoDisp measures the change in the model complexity incurred by the insertion or deletion of p. The model complexity here can be represented as the sum of depths for all samples in the tree. Therefore, a tree-based anomaly is defined as a sample that significantly increases the depth for a set of samples, when it is included in the tree.

Fig. 4
figure 4

RRCF model built on sliding Windows of Fig. 1a in the feature space with \(T=1\) and Max Samples = 5

A running example of RRCF is illustrated in Fig. 4. The model is initially built (training phase) using the first sliding window in Fig. 4a. We assign to each node a unique id \(n_i\). Every internal node represents the selected feature along with its bounding box (value range). The feature F1 is selected at the first step as it has higher value range ([0, 5]) than F2 ([0, 3]); we depict the splitting value on the edges of each node. When the first window is finished, each sample is isolated in a leaf along with a replica counter (rep). The scoring and update procedures are performed using the next sliding window in Fig. 4b. To keep a neat tree visualization, we present the model up to sample \(p_{10}\). Given that a sample is scored only when it is inserted to the tree altering its structure, we compute the CoDisp analytically only for \(p_{10}\). Observe that \(p_{10}\) exceeds the bounding box of F2 and thus a new node is created in depth 3. Therefore, we compute the CoDisp for the nodes \(n_{11}, n_{10}\) and \(n_5\) resulting in the following values of Disp: \(\{3/1, 1/4, 5/5\}\); the CoDisp is the maximum value which is 3. We also report the scores of the rest samples upon their insertion: \(p_6\)=0.5, \(p_7\)=1.33, \(p_8\)=1, \(p_9\)=0.8. Compared to Disp, CoDisp captures the effect of the deletion at tree-level instead of a leaf-level. This helps recognizing clustered anomalies, i.e., a sample masked by its anomalous neighborhood.

RRCF requires linear time to construct a forest \(O(t (2 n - 1))\)Footnote 8 and logarithmic updating time \(O(t \log (n))\) [32]. In the worst case each sample ends up to a different leaf at the maximal height which requires to update the entire subtree till the root. The CoDisp requires linearithmic time \(O(t \log (n) w)\) to update the tree structure for every sample in the window w.

2.6 Lightweight online detector of anomalies (LODA)

LODA [49] is a projection-based detector that constructs an ensemble of k one-dimensional histogram density estimators using sparse random projections. The significant advantage of LODA is that it is hyper-parameter free. Specifically, the number of histograms k can be estimated by measuring the reduction of variance after adding another histogram [59] and the number of bins b can be estimated via the method of Birgé and Rozenholc [10]. LODA can operate in batch (noted as L-B for brevity) or online mode (noted as L-S for brevity) that continuously updates the histograms as the stream evolves. In batch mode, the two hyper-parameters are robustly estimated using all available samples while in online mode, hyper-parameters are estimated using only the training samples. Subsequently, we list the building blocks of LODA in streaming mode (L-S).

Training Phase During training L-S constructs a one-dimensional histogram j as follows. First, a projection vector \(w_j\) is built with coefficients \(\sim N(0, \textbf{1}_d)\) where \(\sqrt{d}\) of them are selected uniformly at random to be replaced with zeros. The nonzero coefficients denote a subspace of features used to build the histogram j. Second, L-S relies on online histograms [8] to approximate the distribution of data by using a set of pairs \(H_j = {(z_{1j}, m_{1j}), \ldots ,(z_{bj}, m_{bj})}\), where \(z_{ij} = w_{j}^{T} x_i\) is the projection of the i-th sample in the j-th histogram and \(m_{ij}\) is total number of the samples falling into the same projection. When the number of bins exceeds the estimated threshold b, the two nearest projections \(z_{ij}\), \(z_{lj}\) are merged to form the new pair: \((\frac{z_{ij}\cdot m_{ij} + z_{lj}\cdot m_{ij}}{m_{ij} + m_{lj}}, m_{ij} + m_{lj})\). Finally, two additional pairs for the minimum and maximum projections are added: \(H_j \leftarrow H_j \cup \{(z_{min}, 0), (z_{max}, 0)\}\).

Model Update L-S operates in tumbling windows using one of the following modes: (a) under a continuous histograms mode [8] each sample is first scored and then inserted to the histograms; (b) under a two alternating histograms mode samples are inserted to the model after all been scored, i.e., upon the completion of the new window similarly to HST [60]. The updating process is the same as for building the initial histograms while their minimum and maximum bounds may be updated by new samples. Note that the number of histograms does not change during updates.

Forgetting Mechanism Similar to the original implementation of HST, L-S does not employ a forgetting mechanism. Therefore, the frequency of the bins can only increase as the stream evolves.

Anomaly Report To compute the score of a sample \(x_{i'}\), the projection \(z_{i'j}\) for a histogram j is computed and the two nearest projections (if exist) \(z_{ij}< z_{i'j} < z_{i+1j}\) are used to calculate the anomaly score:

$$\begin{aligned} \hat{p}(x_{i'}) = \frac{1}{k} \sum _{j=1}^k \frac{z_{ij}\cdot m_{ij} + z_{i+1j}\cdot m_{i+1j}}{2 M_j (z_{i+1j} - z_{ij})}, \end{aligned}$$
(9)

where \(M_j = \sum _{i=1}^b m_{ij}\). If a sample falls in a sparse region, it receives a lower score indicating more anomalousness. Note that if \(\not \exists ~ i: z_{ij}< z_{i'j} < z_{i+1j}\), the score cannot be computed.

Fig. 5
figure 5

LODA-streaming model trained on tumbling window 1 of Fig. 1b in the feature space with \(b=3\) bins and \(k=2\) histograms

A running example of L-S trained on the first six samples of tumbling window 1 of Fig. 1b is depicted in Fig. 5. L-S estimates the number of bins \(b=3\) (not considering the min/max bins) and the number of histograms \(k=2\). Since there are two histograms, there are two one-dimensional random projections. For \(H_1\) the feature F1 is selected with projection vector \(w_1 = [0.5, 0]\) and for \(H_2\) the F2 is selected with projection vector \(w_2 = [0, 1]\). Then the samples of second window of Fig. 1b arrive (see Fig. 3b for the value ranges). The first sample to be scored is \(p_7 = (3,0)\).Footnote 9 However, the projections of \(p_7\) are \(\langle 0,0 \rangle \) in \(H_1\) and \(\langle 0,3 \rangle \) in \(H_2\) which are outside the min/max bounds and thus its anomaly score cannot be computed. Then, \(p_7\) is inserted creating the bins \(b'_{ H_1} = (0, 1)\) and \(b'_{H_2} = (3, 1)\). As the number of bins are \(4 > 3\) (not considering the min/max bins), the \(b_{1, H_1}\) and \(b_{2, H_1}\) are merged into \(b_{12, H_1} = (1.6, 2)\) and the \(b_{2, H_2}\) and \(b_{3, H_2}\) are merged into \(b_{23, H_2} = (1.3, 3)\). Finally the min/max bounds are updated: \(b_{ min , H_1} = (0,0)\) and \(b_{ max , H_2} = (3,0)\) for \(H_2\).The next sample is \(p_8 = (3.5, 1)\) projected as \(\langle 0.5,0 \rangle \) in \(H_1\) and \(\langle 0,3.5 \rangle \) in \(H_2\). Therefore, it can only be scored in \(H_1\), being between \(b'_{H_1}=0< 0.5 < 1.6=b_{12, H_1}\). Thus \(\hat{p}_8 = (0 \cdot 1 + 1.6 \cdot 2) / (2\cdot 7\cdot (1.6-0)) = 0.28\).

Given k histograms with b bins each, the time complexity of L-S is \(O(Nk(d^{-\frac{1}{2}} + b))\) for training, where N is the number of samples, and O(wk) for updating and scoring, where w is the window size.

2.7 XSTREAM

XSTREAM [45] is a projection-based detector that constructs an ensemble of half-space chains (HSC) serving as density estimators. Unlike all previous detectors it is able to cope with feature-evolving data streams. XSTREAM requires to tune three hyper-parameters: The number of random projections K, the number of HSC M and the depth of HSC D. XSTREAM can operate in batch (noted as X-B for brevity) or online mode (noted as X-S for brevity). Both rely on sparse random projections of samples over a subset of features. In each random projection, 1/3 of the features are set as nonzero, weighted by \(\{-\sqrt{3/K}, \sqrt{3/K}\}\) with equal probability. Then, each sample p from \(\mathbb {R}^d\) is projected to \(\mathbb {R}^K\) to obtain a new projected sample \(\langle r_1^T p, \ldots , r_K^T p\rangle \), where d denotes the dimensionality of the dataset and \(r_i\) a random projection. Subsequently, we list the building blocks of X-S.

Training Phase During training, a fraction of samples are kept to estimate the value range \(\Delta _{F'_i} = ( max _{F'_i} - min _{F'_i}) / 2\) of each constructed feature \(F'_i \in \mathbf {F'} = \{F'_1,\ldots , F'_K\}\). At each level of a HSC, a feature \(F'_i\) is selected randomly with replacement from \(\mathbf {F'}\). The value of \(F'_i\) selected at level l in the bin vector \(\textbf{z}_l\) of p, is computed using the following equations:

$$\begin{aligned} z_{F'_i, l}={\left\{ \begin{array}{ll} (p_{F'_i} + s_{F'_i}) / \Delta _{F'_i}, &{} \text {if }o(F'_i, l) = 1.\\ (2z_{F'_i} - s_{F'_i}) / \Delta _{F'_i}, &{} \text {if }o(F'_i, l) > 1. \end{array}\right. } \end{aligned}$$
(10)

where \(p_{F'_i}\) is the value of \(F'_i\) of a projected sample p, \(s_{F'_i}\) is a random shift drawn from \( Uniform (0, \Delta _{F'_i})\) and \(o(F'_i, l)\) is the frequency that a feature has been sampled till the l-th level. The bin vector at l level is computed as \(\textbf{z}_l = \{z_{F'_i, l} \vert F'_i \in F'\}\). To maintain the bin count at every level for a chain c, X-S relies on count-min-sketch (cms) structure, noted as \(\textbf{H}_c = \{H_{c,l} \vert l=1\ldots D\}\). For a specific level l of a HSC c, \(H_{c,l}\) is updated as:

$$\begin{aligned} H_{c,l}={\left\{ \begin{array}{ll} H_{c,l}[\lfloor {\textbf{z}_l} \rfloor ] +1, &{} \text {if }\lfloor {\textbf{z}_l} \rfloor \in H_{c,l}.\\ 1, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(11)

Model Update Like HST [60], X-S operates with two alternating tumbling windows. Two cms structures are maintained \(\textbf{H}_ ref \) and \(\textbf{H}_ cur \) for the reference and current window, respectively. The samples of the current window are scored using the bin counts of \(\textbf{H}_ ref \), and the bin counts of \(\textbf{H}_ cur \) are updated as in training. When all samples of the current window have been scored, the \(\textbf{H}_ ref \) is replaced by \(\textbf{H}_ cur \) and the counts of \(\textbf{H}_ cur \) are set to zero. This technique lets X-S handles drifts in data distribution between two consecutive windows.

Forgetting Mechanism When all samples of the current window are inserted into \(\textbf{H}_ cur \), the bin counts learned in the reference window are replaced by the ones of the current window. Therefore, whenever the window slides, X-S forgets together all samples of the reference window.

Anomaly Report The anomaly score for a projected sample is the minimum bin count across all levels of a chain, averaged out across all HSC: \(\frac{1}{M} \sum _{c \in C} min _l~ 2^lH_{c,l}[\lfloor {\textbf{z}_l} \rfloor ]\) , where C is the set of the M chains. The intuition behind the scoring function is to measure the anomalousness of a sample across the different feature granularities and report the score that corresponds to the lowest density this sample is located at.

Fig. 6
figure 6

XSTREAM model trained on tumbling window 1 of Fig. 1b in the feature space with depth \(D=3\), \(M=1\) chain and \(K=2\) projections. For simplicity we assume that the projections are identical to the original feature values and the random shift \(\textbf{s} = (0,0)\) for each feature

A running example is illustrated in Fig. 6. The first tumbling window of samples \(\{p_1, \ldots ,p_6\}\) (see Fig. 1b) serves as the reference window, and is used to build the initial cms structures \(H_{1,l}\) for \(l=1,\ldots 3\) (\(M=1\) and \(D=3\)). First, samples are projected using two random projections \(r_1, r_2\): the \(F'_1\) is comprised by \(r_1 = [1, 0]\) using only the values of \(F_1\) and \(F'_2\) is comprised by \(r_2 = [0, 1]\) using only the values \(F_2\). The halved value range \(\Delta _{F'_2}, \Delta _{F'_1}\) of each feature is 1: \(\varvec{\Delta } = (1,1)\). Subsequently, we compute the bin vector of sample \(p_6\) across the chain levels. In \(l=1\), the feature F1 is selected and \(p_6\) has the bin vector \(\textbf{z}_1 = (0, 4.7)\). In level 2, \(F'_2\) is selected leading to \(\textbf{z}_2 = (1, 4.7)\) and in the last level \(F'_1\) is selected again yielding \(\textbf{z}_3 = (1, 9.4)\). Therefore, the discretized bin vectors are \(\textbf{z}_1 = (0, 4), \textbf{z}_2 = (1, 4), \textbf{z}_3 = (1, 9)\). Since there is distribution shift in the second tumbling window, all samples get a zero score, which is the lowest possible density. For the projected sample \(p_1\), the bin counts are: 3 (\(l=1\)), 2 (\(l=2\)), 1 (\(l=3\)), and its anomaly score is 1.

X-S has linear time complexity O(NKmDM) to construct the cms structure, where N is the number of training samples, K is the number of projections, m is the number of fixed-size hash tables to approximate the bin counts, and M is the number of HSC of depth D. The time complexity to update the cms structure is O(wKmDM) with w being the window size.

2.8 Randomized subspace hashing (RS-Hash)

RS-Hash [57] is a density-based detector that operates in subspaces. It constructs an ensemble of histograms on feature subspaces, serving as density estimators as in XSTREAM and LODA. RS-Hash requires to tune three hyper-parameters: the number of hash functions h, the sub-sample size s and the number of repetitions m.

The main idea is to repeatedly construct grid-based histograms on sub-samples and combine the obtained scores in an ensemble fashion. Each histogram is built on a sparse, randomly chosen subspace of the original feature space. The features of a subspace with dimensionality r are sampled uniformly at random from \((1+0.5 \cdot log_{max(2,1/f)}(s), log_{max(2,1/f)}(s))\), where f is a locality sampled uniformly at random from \((1/\sqrt{s}, 1-1/\sqrt{s})\). Unlike XSTREAM and LODA, RS-Hash assumes that the r features have equal weight and the histograms are constructed on the original sample values of these features rather than their inner product with the selected subspace. After selecting the subspace features, each sample is normalized using min–max normalization and histograms are constructed using a count-min-sketch (cms) structure, as in XSTREAM. In total, h histograms are built for a particular subspace. The process is repeated m times, with a different sub-sample hashed on a different subspace. Subsequently, we report the building blocks of RS-Hash:

Training Phase During training, a fraction of samples are kept to construct the initial histograms. We denote a histogram j of a sub-sample i as \(H_{ij}\) that uses a particular hash function. To maintain the bin count, RS-Hash leverages the cms structure that hashes the values of a sample p as in Eq. 11 of XSTREAM. The difference is that the values of p are normalized but not projected as in XSTREAM.

Model Update When a new sample p arrives, RS-Hash first scores p and then the histograms’ counts are updated.

Forgetting Mechanism Whenever the window slides, RS-Hash forgets all the expired samples by reducing the counts of the corresponding hash buckets.Footnote 10

Anomaly Report To compute the score of a new sample p, the non-used features in a subspace are encoded as -1 and the remaining ones are normalized. The anomaly score is the minimum bin count across the histograms of a sub-sample, averaged over the different sub-samples:

$$\begin{aligned} Score(p) = \frac{1}{m} \sum _{i=1}^m log_2( min_{j} H_{ij}[p] +1). \end{aligned}$$
(12)

We reuse the running example of LODA in Fig. 5, where the first window of Fig. 1b is used for training RS-Hash. We set \(w=2\) resulting to two histograms and \(m=1\). We select four samples uniformly at random: \(p_1, p_3, p_5, p_6\) and for Thus, we take the histograms \(H_{11}=\{b_1:[p_1, p_3, p_6], b_2:[p_5]\}, b_3:[]\), \(H_{12}= \{b_1:[p_1, p_3], b_2:[p_6], b_3:[p_5]\}\), where \(b_i\) denotes the bucket i; the buckets may differ due to different hash functions. After training, the samples of window 2 of Fig. 1a arrive. We assume that the first sample \(p_7\) falls in \(b_3\) in both hash tables. The minimum count is 1 from \(H_{11}\) so it receives a score of 0 (see Eq 12). Then, it increases the count of \(b_3\) by 1 in both hash tables. The scoring continuous for the remaining samples, respectively. Recall that each point is first scored and then the cms structure is updated. After each sample is scored, the forgetting mechanism will be activated, reducing the counts in the corresponding buckets.

RS-Hash has linear time complexity for training O(shm) including the cms construction, and O(whm) for updating and scoring, where s is the sub-sample size, h the number of hash functions, m the repetitions and w is the window size.

2.9 STAtionary REgion skipping (STARE)

STARE [72] is a density-based detector that identifies the top-n local anomalies in sliding windows. STARE requires three hyper-parameters to be tuned: The number k of nearest kernel centers \(\theta _k\), the diagonal length of a grid cell \(\theta _R\) and the error allowance threshold \(\gamma \).

STARE relies on a kernel density estimation (KDE) to compute the density around each sample. In contrast to other KDE-based methods, STARE does not globally update the samples’ densities for every window slide. Instead, it optimizes density estimation based on the observation that data distributions in many regions hardly change across window slides - a notion called stationary region skipping. STARE operates in three phases: (i) Data distribution approximation, (ii) Cumulative net-change-based skip and (iii) Top-n Anomaly Detection. In the first phase, STARE divides the space into d-dimensional grid cells of diagonal length \(\theta _R\), where d is the data dimensionality. Each grid c has a kernel center that represents the samples that fall into a grid while empty grids are not considered. STARE stores a weight distribution grid \(\mathbb {G}\) with the number of samples in each c. In the second phase, STARE examines the changes in \(\mathbb {G}\), denoted as net-weight distribution grid \(\Delta \mathbb {G}\), between the current and previous slide to avoid updating the local densities of samples in stationary regions. Note that some regions may change slightly, e.g., only one sample is removed from each grid; in this case, STARE relies on a threshold \(\gamma \) to skip updates of slightly changed regions. Higher \(\gamma \) valuesFootnote 11 increase the error in density estimation, sacrificing accuracy for efficiency. In the last phase, STARE first searches in candidate cells, that are guaranteed to contain the top-n anomalies. To do that, density bounds are computed for each cell. For the candidate cells, STARE performs a point-level detection to the candidate cells. Subsequently, we report the building blocks of STARE:

Training Phase STARE uses the first window to partition the data space into grid cells, to store their centers and to compute the densities for all samples.

Model Update Each new sample is indexed to a grid cell, updating its sample count and therefore its weight distribution.

Forgetting Mechanism When a slide expires, the weight distribution of each affected grid cell is updated by reducing its sample count.

Anomaly Report The density of a sample p is calculated as the weighted average density over kernel centers near to p: \(\mathcal {D}(p) = \sum _{i=1}^{\theta _k} \frac{w_i}{\sum _{j=1}^{\theta _k}} \prod _{l=1}^d \mathcal {K}_{h^l}( dist (p^l, kc_i^l))\), where \(\theta _k\) is the distance to the k-th nearest kernel center \(kc_i\) of p, \(K_h\) is the kernel function taking as input the distance between p and \(kc_i\) in a univariate fashion for a dimension l. The score of a sample p is given by \(\mathcal {S}(p) = (\mu - \mathcal {D}(p)) / \sigma \), where \(\mu , \sigma \) are the mean and standard deviation of the local densities at the \(\theta _k\) nearest kernel centers of p. The anomaly score ranges from \(-\infty \) to \(+\infty \), where high values indicate lower density, i.e., more anomalousness.

A running example is depicted in Fig. 7. In the first slide of Fig. 7a, STARE forms the non-empty grid cells and computes the density for each sample. In this window, samples are ranked in decreasing order of their score as follows: \(p_4> p_5> p_7> p_6> p_2> p_1 > p_3\). The first two samples have greater score than \(p_7\) as they fall near to a dense region; thus they are assessed to be anomalous. On the next slide of Fig. 7b the samples of grid cells \(g_1, g_3\) are expired and two new cells are formed \(g_5, g_6\). Assuming \(\theta _k = 2\), only the two nearest kernel centers will be examined for each sample. The \(\gamma \) threshold requires at least four samples to get expired in a cell in order to re-compute the samples’ density in the cell. In this window, the weight distribution has changed from the previous, affecting the cells \(g_2\) and \(g_4\); thus the net-change mechanism is activated. However, since only three samples expired in \(g_1\) (the nearest kernel to \(g_2\)), the density of \(p_6\) will not be updated, as well as \(p_7\). The new ranking will be \(p_{6}> p_{10}> p_{12}> p_7> p_8> p_9 > p_{11}\). Observe that \(p_7\) should have now received the highest score as it is slightly far from a dense area while \(p_6\) should have received lower score as it lies on a sparse area far from dense areas after the update. Of course, with a different choice of \(\gamma \) this behavior can be avoided.

Fig. 7
figure 7

STARE running example on sliding windows of Fig. 1a. A square is a grid cell \(g_i\) with a kernel center (red cross), representing specific samples

STARE has time complexity \(O(w + N^2_G)\) for training in the first window and \(O(w + rN^2_G)\) for updating in subsequent slides, where w is the window size, \(N_G\) is the non-empty grid cells and r is the ratio of changed grid cells between two consecutive windows.

Table 1 Qualitative comparison of online detectors

2.10 Summary

Table 1 summarizes the main characteristics of the online detectors included in our benchmark. Online detectors are essentially incremental versions of offline detectors that assess anomalousness of samples using similar criteria. According to their authors, HST and RRCF are tree-based online detectors inspired by IF. MCOD [40] and CPOD [64] are distance-threshold based while LEAP [16] is a nearest-neighbor-based inspired by KNN\(_W\). STARE [72] is a density-based detector on the full feature space as LOF [14], while RS-Hash [57] working on feature subspaces trace its roots back to HICS [38]. Offline detectors (detailed in Appendix 1) constitute essentially the baselines for the effectiveness of online detectors.

Regarding the time complexity for the model update of the employed online detectors, according to Table 1, they are divided into linear (XSTREAM, LODA, RS-HASH, CPOD, HST/F), linearithmic (MCOD, RRCF) and quadratic (LEAP, STARE) at worst case. Note that some of the reported complexities may differ from the analytical complexities reported by other works such as MCOD’s [63]. This is because we took into account the data characteristics. For instance, in MCOD if each pair of points have a distance greater than R/2, i.e., the data are sparse, or the hyper-parameter K is set to a large value, then no micro-cluster will be formed, resulting in linearithmic neighbor search. In CPOD, if the data are sparse, many cores will be formed; if many samples are far from their core, in range (R, 2R], then each core will be explored, resulting to computational overhead. In LEAP, when the minimal probing principle fails, the algorithm has quadratic complexity, which can be reduced with advanced data structures. In STARE, if the data distributions between many consecutive windows differ more than the predefined threshold \(\gamma \), the algorithm will have quadratic complexity, as no region will be stationary. The data characteristics will determine the ranking of the algorithms w.r.t. execution time.

3 Experimental environment

Our experimental evaluation relies on datasets widely used in previous empirical studies [15, 24, 25, 63, 68]. These set-based datasets contaminated with point anomalies [18] exhibit different characteristics of abnormal and normal samples (e.g., anomaly ratio, dimensionality) and are useful for an unbiased comparison of offline and online algorithms over all possible order of arrival of samples in a data stream. We additionally consider the recently proposed Exathlon [37] for explainable anomaly detection over times series that overcomes several limitations of previously used benchmarks for temporal data [69]. These sequence datasets contaminated with interval anomalies [19] can be used to evaluate online detectors only for the specific order implied by the timestamps of their samples.

We have implemented in Java the tree-based online algorithms (HST/F and RRCF) and integrated in our testbed the original implementation in Java of MCOD,Footnote 12 LEAP,Footnote 13 STARE,Footnote 14 CPODFootnote 15 as well as in C++(Online) of XSTREAM [45],Footnote 16 Matlab(Online)/Python(Offline) LODA [49]Footnote 17.We also extended the Python implementation of RS-Hash available at github.Footnote 18 We finally rely on third-party implementations in Python of KNN\(_W\), LOF and IF from the scikit-learn library version 0.21.2Footnote 19 and the original implementation of OCRF [29].Footnote 20 We used Java Version 1.8, Python Version 3.7.4, and Scikit Learn Version 0.21.3. Experiments were conducted on 2-core Intel i7-7500U at 2.7 GHz, with 8 GB of RAM on Windows 10. The platform of the experimental evaluation environment, as well as the scripts used for the analysis of the result can be found on the Gitgub repository: https://github.com/droubo/meta-level-analysis-of-anomaly-detectors.

3.1 Datasets

The bulk of our evaluation lies on real data which in their majority are contaminated with synthetically generated anomalies. To experimentally compare detectors w.r.t. particular factors such as anomaly ratio and dimensionality, window size and speed, etc., we used representative datasets with both synthetic abnormal and normal data.

In unlabeled datasets, point anomalies are synthetically inserted using different methods. In datasets used in classification problems, as anomalies are considered the samples belonging to the minority class. In datasets used in clustering problems, anomalies are inserted samples far away from regions of high density. In other datasets, anomalies are implanted by simply adding noise to the values of their features. Anomalies are additionally characterized as subspace when they are visible only to a subset of the dataset’s feature space, and as fullspace otherwise. Note that our experimental evaluation highlights the behavior of detectors for subspace anomalies that have been less studied in the literature.

To simulate a stream of samples from a batch dataset (real or synthetic) we generate a sequence of windows of a given size. To guarantee a smooth detection difficulty across all windows, their content is obtained by shuffling normal and abnormal samples. Moreover, anomalies are stratified in windows using a step related to the anomaly ratio of the dataset. Thus, we guarantee that if \( step \, \le \, window \, size \), then each window is going to have a similar number of anomalies. We finally partition windows into training and testing sequences. In this way, we can run a given detector over a specific dataset under multiple possible sample orderings and report its average effectiveness.

To select the window size for the experiments concerning point anomalies, we relied on prior work on point anomaly detection [32, 40, 49, 60]. Instead of selecting a fixed window size in advance as in [49, 60], we examined three sizes comprising 64, 128 and 256 samples. Regarding the algorithms that operate using sliding windows, we used stride=1 [32]. To ensure a fair comparison of detectors, we investigated the window size that favors the majority of the detectors. In Fig. 16 (see Appendix 2), we found that a window of 128 samples results in better median MAP performance for the majority of the detectors. As a matter of fact, for many detectors the MAP using the three window sizes is almost identical, due to the stratification of anomalies performed in each window per dataset. We should stress that for AUC we observed the same trend. Hence, in subsequent experiments concerning point anomalies, we decided to use a window of 128 samples.

3.1.1 Real datasets

Table 2 depicts the characteristics of the real world and synthetic datasets used in our experiments. Real world datasets have been originally introduced in the UCI [26] and OpenML [67] repositories, for multi-class classification. In our comparison, we use the unnormalized versions of these datasetsFootnote 21 that are made available for the anomaly detection problem from GLOSS [65], ODDS, DATAHUB, DAMI [15] and XSTREAM [45]. In the majority of these datasets, the anomaly ratio is low (no more than 9%) with the exception of Electricity in which anomalies and normal samples are almost balanced. It is worth also describing how anomalies are implanted in GLOSS. The authors pick randomly 5% of the samples transforming each one to an anomaly by replacing a randomly picked feature subset with the values of a sample from a different class; the size of each feature subset was chosen uniformly from [\(2, max (2, 0.1 \cdot d))\)], where d is the dataset’s dimensionality. The aforementioned anomaly generation process results in scattered fullspace anomalies.

Table 2 Real datasets with fullspace anomalies and Synthetic datasets with subspace anomalies

3.1.2 Synthetic datasets

Table 2 depicts the five synthetic datasets we used in our experiments. These datasets have been introduced to evaluate the HiCS subspace anomaly detectorFootnote 22 and exhibit a high variability in the number of features (20D-100D) while the number of samples is constant (875). They are normalized and contaminated with subspace anomalies of varying dimensionality (2D, 3D, 4D, 5D) as indicated by the subscripts of their names.

The peculiarity of the HiCS datasets is that they are contaminated by point anomalies that lie far away from dense regions of normal samples with highly correlated features. LOF has been applied exhaustively on each of 2D, 3D, 4D and 5D feature subspaces to score subspace anomalies. In our experiments, we focus on the 100D HiCS dataset and removed anomalies that are visible to more than one subspace. Then, we generated five dataset variations with 20, 40, 60, 80 and 100 dimensions that contain exactly the same subspace anomalies of a given dimensionality (2D, 3D, 4D or 5D). The objective is to increase the ratio of irrelevant features as we increase the dimensionality of the datasets. In the 20D HiCS all features belong to at least one of the subspaces anomalies are visible, i.e., they are relevant to the subspace anomalies of the dataset. As the same anomalies are contained in all datasets, in the 100D dataset we have 20 relevant and 80 irrelevant features. We have finally included 18 anomalies per dataset leading to an anomaly ratio of 2%.

3.1.3 Exathlon time series

The Exathlon benchmark [37] contains repeated executions of 10 different Spark streaming applications (2.3 million samples in total). The dataset is split across 93 different files, called Traces, with recordings of a Spark streaming application run (2,283 features) and are grouped by application (9 traces per application on average). Out of these traces we have focused on 34, called disturbed traces, that contain both normal and abnormal samples. They are split into 5 categories based on the type of anomalies they containFootnote 23 in realistic system health monitoring settings: (i) bursty input traces, (ii) bursty input until crash, (ii) stalled input traces, (iv) CPU contention and (v) process failure traces. Unlike to the previously explained datasets where anomalies are scattered throughout all windows, we are interested in evaluating the effectiveness of online detectors to spot point anomalies that occur in a specific time frame, called range-based anomalies. We should stress that in these datasets, the anomalies are produced in specific time-frames under the same conditions and therefore they are clustered. In our experiments, we have used sequence data acquired only from 2 applications (5 total datasets, one for each type of anomaly), as experiments with time series acquired from all applications prove to be very time-consuming.

3.1.4 Dataset profiling

To be able to explain why the effectiveness of detectors is different even on the same dataset, meta-features are extracted from real and synthetic datasets using the python library PyMFE [6]. In addition to general and statistical meta-features explored in previous meta-learning works [66, 75], we introduce meta-features related to both fullspace and subspace anomalies.

Table 3 Set of statistical and general dataset meta-features

More precisely, our meta-level analysis of experimental results explores the following sub-categories of meta-features (see Table 3): (1) General reporting the number of samples/features and the anomaly ratio of datasets; (2) Fullspace characterizing anomalies in the entire feature space of datasets such as the distance between the center of mass of abnormal and normal classes, the number of pairwise correlated features, the number of features whose values are normally distributed, or the difficulty in the separation of abnormal from normal samples called Anomaly To Normal Distance (AND); (3) Subspace indicating the existence of anomalies in subspaces of the features of the dataset. In case of subspace anomalies, relevant features have a higher correlation with the target variable (i.e., the anomaly class) in comparison to irrelevant ones according to different test statistics like Wilks’ [42], Pillai’s, Roy’s [54], Lawley–Hotelling. (4) Value space characterizing the skewness of the different feature distributions by taking the average/std value of a statistic such as maximum/minimum.

In the sequel, we detail the AND meta-feature introduced in this work:

$$\begin{aligned}{} & {} DistCM = \frac{\sum _{i=0}^{n}{ med (N_i) - med (A_i)}}{n} \end{aligned}$$
(13)
$$\begin{aligned}{} & {} AND = \frac{ DistCM - AvgMAD }{ AvgMed }, \end{aligned}$$
(14)

where n is the number of features, \( AvgMAD \) is the average features’ Median Absolute Deviation, and the \( AvgMed \) is the average features’ median value. Equation 14 reveals the normalized difference between the average MAD and the distance between the normal and abnormal center of mass. The intuition behind this formula is how far Anomalies lie from Normal samples, w.r.t to the average Median Absolute Deviation of the distribution. As we can see in Fig. 8, the lower the value of \( AND \) is, the more difficult it is to separate abnormal from normal samples. The above meta-features can be distinguished as Predictive or Explanatory. The former can be computed from the dataset without knowing the target (e.g., dataset dimensionality) while the latter require knowledge of the target (e.g., anomaly ratio).

3.2 Evaluation protocols and metrics

We are interested in assessing the effectiveness of anomaly detectors in separating abnormal from normal samples on the basis of the real-valued scores they produce. Unlike offline detectors working with one large window, online detectors incrementally model and score samples in a stream in several windows.

To ensure a common ground for comparison between detectors’ effectiveness across all datasets of our testbed, we rely on widely used metrics for evaluating the detection of top-k point anomalies [15, 32, 45, 49, 72]. Range-based metrics capturing the positional overlap of the discovered anomalous subsequences with the ground truth [61] are valuable when evaluating shallow [11] or deep [48] anomaly detection methods over time series and it is left as future work.

We consider an evaluation protocol inspired from Forward Chaining Cross Validation [9] used in time series analytics. It essentially computes an evaluation metric per window in the test partition. This evaluation protocol captures the variance of detectors’ effectiveness across all windows as their model gets updated in a streaming fashion. The effectiveness of score-based detectors can be assessed by two metrics; Area Under the Receiver Operating Characteristics Curve (ROC AUC) and Average Precision (AP) [74].

AUC is a 2D plot representing the trade-off between the false positive rate FPR (in x-axis) and true positive rate TPR (in y-axis), for different score thresholds. The higher the AUC ROC the greater the probability of a detector to classify correctly the samples to abnormal and normal classes. The 0.5 value indicates a random classification and 1.0 value indicates perfect classification.

Fig. 8
figure 8

High Anomaly to Normal Distance (AND) value indicates readily separated anomalies

AP is used to measure explicitly whether anomalies obtain a higher score than normal samples. The higher the AP the less overlap exist between the score distributions of abnormal and normal classes. The 0.0 (1.0) value indicates that all normal samples (anomalies) scored higher than true anomalies (normal). In contrast to AUC, the value that indicates random classification varies per dataset. In fact, it is equal to the percentage of the positive class [56], which in our case is the anomaly ratio of each dataset. More formally, AP is defined using Precision at k (P@k) as follows:

  • Precision at k (P@k): given a dataset D consisting of n items and a set of anomalies \(A \subset D\), P@k is defined as the proportion of the true anomalies to the top-k potential anomalies ranked by the detection method: \(P@k = \frac{\vert \{x \in A\vert rank(x) \le k\}\vert }{k}.\)

  • Average precision (AP): instead of evaluating the precision individually, this measurement refers to the mean of precision scores over all possible positions: \(\frac{1}{\vert A\vert } \sum _{k=1}^{n} P@k \cdot \mathbbm {1}[x_k \in A].\)

Given that our stream generation method stratifies anomalies per windows, we set for online detectors k equal to the window size and for offline ones k is equal to the total number of samples. Statistical methods could be alternatively used to estimate k [71]. In this way, we guarantee that each window contains at least one ground truth anomaly (i.e., \(A \ne \emptyset \) in every window). We should stress that the final AUC and AP given an online anomaly detector are computed on each window; thus the final metric value is the average computed over the constituent windows, leading to the Mean Average Precision (MAP).

The two metrics may yield different results given a certain order of samples. In order to demonstrate two extreme cases, we consider the example of Fig. 9. Case A, exemplifies a low AP despite having high AUC ROC. When the majority of the normal samples (negative class) are scored in the lowest positions, AUC penalizes less the detector produced such order than AP. Case B exemplifies the opposite case: there is a higher confidence in the positive class on the higher rankings. AP can be more enlightening, as we are interested on ranking anomalies higher than normal samples given that AUC ROC on highly imbalanced datasets (i.e., with low anomaly ratio) can be misleading [13, 22, 44, 56].

In order to rank the detectors according to their MAP or AUC scores, we used the AutoRank library in python [35], which also statistically compares the ranks of each detector using the nonparametric Friedman test, as well as the post hoc Nemenyi test for paired comparison.

Fig. 9
figure 9

Exemplifying the differences of detectors’ scoring functions in AUC ROC vs AP

4 Effectiveness of detectors

The objective of this series of experiments is to answer three main questions: (a) what is the reliability of decisions made by anomaly detectors compared to a random classifier? (b) how online and offline detectors could be ranked according to their performance on real datasets? and (c) to what extend online detectors approximate the performance of offline detectors in identifying sub/full space anomalies?

To answer these questions, we benchmark ten online detectors {HST, HSTF, RRCF, MCOD, XSTREAM (X-S), LODA (L-S), RS-HASH, STARE, LEAP, CPOD} and six offline detectors {LOF, KNN\(_W\), iForest (IF), XSTREAM (X-B), OCRF, LODA (L-B)} using twenty-four real datasets (see Sect. 3.1). The AUC and MAP scores of online and offline detectors per dataset are reported in Tables 16 and 15, respectively (see Appendix 3). These scores are obtained by averaging over 30 independent runs (with different sampled data streams) for HST/F, RRCF, X-S, and L-S (non-deterministic detectors) and 1 run for MCOD, RS-HASH, STARE, LEAP, CPOD (deterministic detectors). For the optimal hyper-parameters of each detector per dataset readers are referred to Appendix 2. After preliminary experiments with a subset of our datasets, we chose 128 as the optimal window size for the vast majority of the detectors (only RS-HASH is favored by larger sized windows, see Fig. 16). Varying window sizes per algorithm and dataset will complicate a statistical sound comparison of algorithms over the two performance metrics (MAP and AUC) across all datasets of our testbed. In the sequel, we will present a synthetic overview of the main findings after analyzing the AUC and MAP scores.

4.1 Random detection

In a first step, we are interested in assessing the reliability of the performance exhibited by the different online and offline detectors in real datasets. To this end, we compare their AUC scores in real datasets (see Table 16) with the performance of a Random Classifier described in Sect. 3.2. In particular, we are looking for dataset characteristics that make detectors to perform randomly. Fig. 10 depicts the CDF and PMF plots of the distribution of the number of detectors that exhibit similar performance to the Random Classifier per dataset, i.e., with AUC ROC \( < 0.6\). The AUC ROC essentially represents the probability of a random anomaly to be scored higher than a normal sample, thus it provides a natural comparison with the performance of the random binary classifier: when AUC=0.5, then the classifier is not able to distinguish between positive and negative class points. As we can see, 9 out of 16 anomaly detectors are close to the Random Classifier in half of the datasets (see the dotted red line). This observation clearly indicates the serious limitations of existing unsupervised detectors. Interestingly, on datasets that contain implanted anomalies, 11.3 detectors fail to outperform on average the random classifier whereas on datasets where the minority class is used as the anomaly class, 7.6 detectors fail on average. This highlights that implanted anomalies are more challenging for the detectors than anomalies simulated by the minority class. This is due to the fact that in implanted datasets, anomalies are scattered in the full feature space and their detection requires knowledge of the data distribution drawn from the entire dataset rather than from individual windows. As we can observe in Table 15 in Appendix 3, batch detectors outperform streaming detectors in five out of six dataset with implanted anomalies. On the only dataset with real anomalies (Electricity), 12 detectors fail but this is mainly attributed to the high anomaly ratio of this dataset (57.5%). In fact, the number of detectors that exhibit random performance is correlated, with a p value of 0.039 and a correlation value of 0.422, with the anomaly ratio of the datasets. Consequently, the higher the anomaly ratio, the more likely it is for the detectors to behave randomly. LOF and KNN are the detectors that fail in the least amount of datasets compared to the rest (7 and 8 datasets, respectively). Among online detectors X-S and MCOD have the least random behavior (both fail on 11 datasets out of 24). It is also worth mentioning that no algorithmic family seem to fail in more datasets compared to the rest.

Fig. 10
figure 10

CDF and PMF of online and offline detectors with a performance comparable to the Random Classifier: More than a half of the detectors fail in 50% of the datasets

To capture the degree of difficulty caused by the different types of anomalies contaminating each dataset, we consider the meta-feature Anomaly To Normal Distance (AND) described in Sect. 3.1.4. Over the 14 datasets with the lowest AND value, 11.3 detectors fail on average. Over the remaining 10 datasets with the highest AND value, 6.2 detectors fail on average. Clearly, the higher the distance of abnormal from normal samples w.r.t. the average mean absolute deviation, the greater the number of detectors scoring close to the Random Classifier. Another meta-feature significantly correlated with the number of detectors exhibiting a performance close the Random Classifier, is the number of distinct highly correlated pairs of features, with a p value = 0.005 and correlation value of -0.546. The more correlated pairs of features in a dataset, the less detectors perform similarly to the Random Classifier.

4.2 Ranking online detectors

We are turning now our attention to the performance comparison of online anomaly detectors.

Table 4 summarizes the total number of wins of online detectors along with the average difference of its MAP (AUC ROC) score from the wining detector in each dataset. According to MAP metric, four detectors (X-S, MCOD, L-S and RS-HASH) tie in the first place winning in 4 out of 24 datasets, respectively. LEAP follows, with 3 wins while HST and HSTF have 2 wins each. CPOD and STARE do not achieve any win. HST, in datasets outperformed by others, exhibits a performance very close to the leader (compared to the rest of the detectors) with a difference of 15.8% on average. Despite its forgetting mechanism, HSTF have a higher average difference from the leader (18.7%) compare to HST, while having the same amount of wins. RS-Hash, STARE and LEAP win all together in 29% of the datasets, but they exhibit the highest average difference (22.5% - 28.8%) from leader in the remaining ones. Note that 3 of the X-S wins are observed in datasets (3 out of 6) with a number of samples lower than 800. Also note that, projection-based detectors (X-S and L-S) lead in 4 out of 6 datasets that contain implanted anomalies (MCOD wins in the rest).

Table 4 Number of online detectors’ wins and average difference from the winner (ADW) on real datasets: X-S dominates in both metrics

Table 4 also summarizes the total number of wins of each online detector based on the AUC ROC scores. According to this metric, X-S gets 5 more wins while L-S and MCOD have 3 and 2 less wins, respectively, compared to MAP. STARE obtains its first win compared to MAP. We observe that X-S achieves 4 more wins in datasets where most detectors exhibit similar performance to the Random Classifier. In this case, the performance difference between the first and the second leading detector becomes significant compared to MAP. Also, X-S wins in 5 out of 8 datasets with number of samples greater than 10.000. In other datasets, we observe that according to AUC ROC detectors may not perform so well as when considering MAP (see Sect. 3.2). More precisely, L-S on InternetAds and X-S on Pima succeed to score more anomalies higher than the other detectors, but failed on the lower rankings.

To determine if there are any significant difference between the average ranks of the detectors, we use the nonparametric Friedman test [23]. We reject the null hypothesis with a significance level of 5% that all detectors’ performances are equal. Next, we use the post hoc Nemenyi test, in order to compare the detectors in pairs. There is a significant difference when the difference between the average ranks of two detectors is higher than a critical distance (CD) of 2.57 (for 10 detectors on 24 datasets at a significance level of 0.05). Figure 11a depicts the statistically significant ranking of detectors’ performance in the 24 datasets of our testbed according to their MAP. X-S is ranked first, followed by MCOD, while having not statistically significant difference with any of the remaining detectors. HST/F, MCOD and RS-HASH have similar average ranks (around 5). HST surpasses HSTF due to the lower average performance difference from the leader, despite having the same number of wins. L-S is ranked sixth followed by RRCF. STARE and CPOD and LEAP are the last three detectors with very little differences in their average ranks.

This overall performance picture does not drastically change when we consider AUC ROC. As we can see in Fig. 11b, X-S now has a statistical significant difference with the last 4 detectors (CPOD, LEAP, STARE, RRCF). X-S’s great performance on AUC ROC, indicates that it consistently ranks most of normal samples lower than anomalies. The following 5 detectors (L-S, HST/F, MCOD and RS-HASH) are ranked close, which can be mainly attributed to the class imbalance of the datasets, as AUC ROC provides less information compared to MAP (see Sect. 3.2). L-S is now ranked three places higher compared to MAP, as it becomes the third best performing detector leaving RS-HASH in the sixth place.

In a nutshell, X-S exhibits the best overall performance when considering both metrics, having a statistical significant difference with half of the detectors using AUC ROC. L-S is performing better w.r.t. AUC ROC rather than MAP, while the additional forgetting mechanism on HST, does not significantly boost its performance. The rest of the detectors remain relatively stable w.r.t. both metrics. We should note that the optimized variants of MCOD like CPOD or LEAP are systematically ranked in the last 2 positions w.r.t. both metrics. Regarding density-based detectors, RS-Hash outperforms STARE but both underperform compared to the distance-based MCOD.

Fig. 11
figure 11

Online detectors’ ranking: X-S dominates on both MAP and AUC

4.3 Online vs offline detectors

As a last question in this section we are studying to what extent online detectors approximate the performance of offline detectors in real datasets. In essence, we are interested in comparing the AUC or AP performance of detection models continuously updated in several small windows with batch models build in one big window.

Again we rely on the nonparametric Friedman test [23] and we reject the null hypothesis that all detectors’ performances are equal with a significance level of 5%. Using the post hoc Nemenyi test, there is a significant difference when the difference between the average ranks of two detectors is higher than a critical distance (CD) of 4.05 (for 16 detectors on 24 datasets at a significance level of 0.05). Figure 12a depicts the rank of each detector according to their MAP scores, and Table 5 provides the number of wins per detector along with the average difference from the leader detector in each dataset.

In the following, we are contrasting the rank of online and offline detectors belonging to the same family, namely, tree-based (HST/F and RRCF vs IF and OCRF) or distance/KNN-based (MCOD, CPOD and LEAP vs KNN), density-based (STARE, and RS-HASH vs LOF), as well as, the stream and batch versions of the two projection-based XSTREAM and LODA.

4.3.1 Tree-based detectors

As we can see in Fig. 12a, IF is the best ranked tree-based detector, and fifth overall. Although IF is ranked first, the rank difference between HST(F) and RRCF are less than 0.5 and 1.5, respectively. Surprisingly enough, tree-based online detectors not only approximate well the effectiveness of offline ones, but as in the cases of HST(F) and RRCF may outperform offline detectors like OCRF. Despite the most enlighten decision regarding the splitting criteria in trees, OCRF exhibits the worst performance. This behavior can attributed to the fact that to ensure a fair comparison with online detectors, offline detectors are also trained with both normal samples and anomalies (unlike the original OCRF paper [29]).

It is important to notice, that there is no statistically significant difference between any of the tree-based detectors. This picture does not change based on AUC ROC as well. As we can see in Table 5, IF, OCRF and HST succeed to get the most wins among tree-based detectors (2 wins each) while HST exhibits the lowest average difference from leader. The rest of the detectors (HSTF, IF, RRCF) have a similar average difference from leader, with OCRF having the highest at (31.4%).

Fig. 12
figure 12

Detectors’ ranking using MAP and AUC ROC: Online detectors seemingly approximate the performance of offline detectors

Table 5 Number of wins based on MAP of online vs offline detectors on real datasets and their average difference from winner (ADW): LOF achieves the most wins

4.3.2 Distance/KNN and density-based detectors

As we can see in Fig. 12, offline detectors KNN and LOF are ranked in the first 3 places based on both AUC ROC and MAP. Their rank difference with MCOD is close to 2. As a matter of fact, MCOD and LOF achieve the highest number of wins with 4, respectively, followed by LEAP and KNN with 2 and 1 wins (see Table 5). After MCOD, RS-HASH is the second best ranked between online detector while CPOD, STARE and LEAP are consistently ranked in the 3-4 lower places achieving on average the highest performance difference from the leader detector in each dataset (see Table 5). Furthermore, RS-Hash and MCOD fail (\(AUC < 0.6\)) in less datasets compared to CPOD,LEAP,STARE. KNN and LOF show a statistically important difference to CPOD, LEAP and STARE based on AUC ROC and to CPOD and LEAP based on MAP. It is worth mentioning that LOF wins in all three datasets with highest number of features (isolet, letter-recognition, InternetAds) with an AUC above 0.75 while on the other hand every online KNN and density-based detector performs randomly.

4.3.3 Projection-based detectors

As we can see in Fig. 12a, the batch version of XSTREAM (X-B), is the best performing detector while its streaming version is placed fourth. There is no statistically significant difference between the performance of the two XSTREAM versions (i.e., Their rank difference is close to 1). X-B succeeds to win in 2 datasets while X-S does not achieve any win and has 2.5% less average difference from the corresponding leader (see Table 5). Interestingly enough, X-S is ranked third and X-B fourth based on AUC ROC (Fig. 12b). In a nutshell, X-S not only approximates well the performance of X-B (and outperforms it according to AUC ROC), but also outperforms almost every other detector in our benchmark (besides LOF and KNN).

Fig. 13
figure 13

Score distribution of HST/F, RRCF and RS-HASH for the 5 datasets from Exathlon benchmark. The red highlighted areas indicate the anomaly range according to the ground truth

Table 6 MAP and AUC ROC scores of online detectors over the 5 time series datasets (1_4_1000000_80, 1_5_1000000_86, 1_2_100000_68, 6_1_500000_65, 6_3_200000_76) contained in Exathlon benchmark

As we can see in Fig. 12a, the streaming (L-S) and batch (L-B) versions of LODA are ranked closely (only RRCF separates them). L-S wins in 4 datasets with a lower average difference (28.0%) from the leader while L-B is systematically outperformed with highest average difference (32.1%) from the leader in each dataset (see Table 5). Clearly, L-S outperforms L-B w.r.t. both metrics.

4.3.4 Discussion

The previous experiments demonstrate that online detectors can effectively approximate the performance of offline ones (e.g., X-S, MCOD and RS-HASH vs KNN and LOF or HST/F vs IF) and in some metrics to outperform them (i.e., X-S vs X-B). Distance/KNN and density-based offline detectors are ranked high (top 3), while the optimized MCOD variations like CPOD and LEAP are ranked in the lowest places. Furthermore, projection-based detectors (XSTREAM and LODA) consistently outperform tree-based ones (HST/F, RRCF and OCRF) in either batch or stream modes.

5 Detection of range-based anomalies

In this section we evaluate the performance of online detectors (X-S, L-S, HST/F, RRCF, MCOD, RS-HASH, LEAP, CPOD, STARE) over 5 time series datasets containing a different type of range-based anomalies (see Sect. 3.1.3): (a) \(1\_2\_100000\_68\) contains anomalies caused by bursty input until crash; (b) \(1\_4\_1000000\_80\) contains anomalies caused by CPU contention; (c) \(1\_5\_1000000\_86\) contains anomalies caused by a process failure; (d) \(6\_1\_500000\_65\) contains anomalies caused by bursty input without resulting in a crash; and (e) \(6\_3\_200000\_76\) contains anomalies caused by stalled input from the sender. As in our previous experiments, we thoroughly tuned the hyper-parameters (see Table 11 in Appendix 2) of each detector.

For range-based anomalies, the selection of window size is more challenging than for point anomalies. This is because the anomaly ranges (and thus the number of anomalies) vary from one window to the other per dataset. In other words, one cannot stratify the datasets contaminated with range-based anomalies as the order of the samples matters. For this reason, we included the window size as a hyper-parameter of each detection algorithm in our cross-validation protocol. In Table 11 (see Appendix 2), we report the window size that led to the highest AUC for each detector with a performance higher than the random classifier. as well as, the window size (64,128,256) per dataset.

Table 6 depicts the MAP and the AUC ROC scores of online detectors in the aforementioned datasets. We can easily observe that the majority of detectors (X-S, L-S, MCOD, CPOD, LEAP, STARE) exhibit a random behavior in all 5 datasets. For this reason, we will focus on the remaining 4 detectors (HST/F, RRCF, RS-HASH). Notably, HSTF and RRCF are the only two detectors that are exploring a tunable forgetting mechanism as new samples appear. HST/F outperforms all other detectors in the 5 datasets, while exhibits a close to random performance on \(6\_3\_200000\_76\) dataset. The addition of the forgetting mechanism on HST slightly improves its performance in 4 out of 5 datasets as it is able to forget past points for detecting range-based anomalies. We should stress that RS-HASH is more effective than X-S due to its forgetting policy, i.e., it gradually reduces the histogram counts rather than assigning zero counts to all bins when the window slides, as in X-S. Tunable and gradual forgetting mechanisms prove to be more effective for detecting range-based anomalies in several time-frames. Moreover, the range-based anomalies of Exathlon are clustered in the specific time-frames; thus, after few frames, a global forgetting mechanism adjusts to the anomalous samples considering them as normal. The dataset containing anomalies that are caused by CPU contention (6_3_2000000_76), proves to be the most difficult for all these 4 detectors, where all detectors exhibit a random performance. As a matter of fact, this is the dataset containing two different interval anomalies while the application returns back to its normal state after a period of an abnormal behavior. On the other hand, all 4 detectors achieve the highest scores on 1_2_100000_68 dataset, where anomalies are caused by an increase in the input rate of the sender, until the application crashed.

Figure 13 depicts the scores per sample for each of the 4 detectors (HST/F, RRCF and RS-HASH). RS-HASH has a similar performance across 3 out of 5 datasets where the scores of samples increase as the time passes. As we can see in Fig. 13a, HST/F and RRCF are able to highly score anomalies from the beginning of the bursty input where they start to appear. As time passes, the scores of anomalies are getting closer to those of normal samples before the bursty inputs begin. The forgetting mechanism of HST, results in more immediate drop in the scores’ range. The same behavior can be observed in the 1_5_10000000_86 dataset where the scores of normal samples, after the anomalies have appeared, are getting smaller over time. Last but not least, datasets 1_4_1000000_80, 6_1_500000_65 and 6_3_200000_76 that do not result in application crash, proves to be the most challenging for all detectors. As shown in Fig. 13c–e, RRCF and HST/F are not able to detect the existence of abnormal samples and maintain a high score over a large period of time. This behavior is due to the fact that anomalies appearing in more than 40 consecutive windows, facilitate detectors to adjust to the distribution change. As the application returns to its normal state, normal samples are identified as anomalies until the detectors adjust again to the new distribution change. As a result, data streams processed in high number of consecutive windows featuring anomalies which do not correspond to application crash, prove to be the most challenging for all online detectors.

Unlike point-based anomalies, the vast majority of detectors exhibit a random performance in range-based anomalies. Furthermore, larger intervals of anomalies challenge more all detectors (see Fig. 13c). HST/F proved to be the best overall detector in this experiment, exhibiting the highest AUC ROC and MAP scores.

6 Robustness of detectors

In this experiment we are assessing the robustness of online {HST, HSTF, RRCF, MCOD, XSTREAM (X-S), LODA (L-S), RS-HASH, STARE, CPOD, LEAP} and offline {LOF, KNN\(_W\), iForest (IF) , XSTREAM (X-B), LODA (L-B)} detectors against increasing data and subspace anomaly dimensionality, i.e., anomalies hidden in subspaces with more features. To keep constant the anomaly ratio we use twenty synthetic datasets derived from HiCS (see Sect. 3.1.2) as follows: for each of the five datasets containing 20 up to 100 features, namely {HiCS20, HiCS40, HiCS60, HiCS80, HiCS100}, we generate four variants contaminated with the same number of 2-d, 3-d, 4-d and 5-d subspace anomalies. To assess the impact of increasing irrelevant features on the detection of subspace anomalies of a given dimensionality, each dataset has four sub-versions, one for 2-d, 3-d, 4-d and 5-d subspace anomalies.

Fig. 14
figure 14

MAP scores over increasing data and subspace dimensionality on synthetic datasets

As we can see in Fig. 14a (a), in synthetic datasets, projection-based detectors (X-S, L-S) outperform (in terms of MAP) all other online detectors in higher dimensions (80-d and 100-d). MCOD is up to one order of magnitude more effective than tree-based detectors (HST/F, RRCF) in 20 and 40 dimensions while in higher dimensions all detectors exhibit a similar effectiveness. CPOD and LEAP seem to better perform than MCOD as data dimensionality increases: in some cases (60-d) they even achieve to be two times more effective than MCOD. Moreover, the performance RS-HASH which is a subspace detector, drops as the number of feature increases. HST/F, RRCF and STARE exhibit a similar performance across all dimensions. As expected, distance and density-based detectors like MCOD, CPOD and LEAP underperform as data dimensionality increases. This is due to the fact that Euclidean distance becomes less effective on higher dimensions [70]. We finally observe that the implemented forgetting mechanism makes HSTF to perform slightly better than HST in detecting subspace anomalies.

In overall, with the exception of distance-based, online detectors exhibit a robust behavior in increasing data dimensionality. This behavior can be attributed to various reasons. First, the random projections enable X-S and L-S to effectively reduce data dimensionality while keeping the distances between samples. This mechanism seems to be less affected by the increasing number of features which are irrelevant to the subspace anomalies of the datasets compared to the random feature splitting employed HST/F and RRCF. Second, the notion of neighborhood in MCOD, CPOD, and LEAP becomes less meaningful in higher dimensions [2]. Third, in our benchmark the successive windows consumed by online detectors are fed with stratified anomalies over shuffled normal samples (see Sect. 3.1). It is known that processing data in sub-samples reduces both anomaly swamping and masking effects [43, 77], that may incur when we increase the dataset dimensionality with irrelevant features.

In Fig. 14b (b), we can observe that with the exception of OCRF, the effectiveness of offline detectors gets decreased while increasing data dimensionality. Clearly, in high-dimensional spaces, all pairs of samples become almost equidistant (distance concentration), and distance and density-based detectors like \(\hbox {KNN}_W\) and LOF struggle to separate abnormal from normal samples. In the experiment reported in [78] this effect has been observed at 100-d, while in our experiment started earlier at 40-d. This is due to the fact that HiCS datasets are contaminated by subspace instead of fullspace anomalies.

IF fails to isolate subspace anomalies, since by increasing the number of features which are irrelevant to the anomalies, the noise in tree structures constructed by uniformly sampling features gets increased. In the experiment reported in [53] this effect has been observed with the addition of 30 irrelevant features, while in our experiment started earlier even with the addition of 20 irrelevant features due to the contamination of HiCS datasets with subspace anomalies. This is not the case for OCRF, which is the most robust offline detector against increasing data dimensionality. OCRF actually succeeds to obtain better scores on higher dimensions (80-d, 100-d) compared to lower ones. This is due to the fact, that OCRF relies on an enlighten choice of split features that leads to more consistent splits across increasing dimensions. Also, by increasing the dimensions, the volume of the data goes up, which helps OCRF to perform more accurate splits due to its one-class Gini index splitting criterion.

Table 7 Mean and std. of training and update time (in seconds) of online detectors per window over all datasets: X-S is the fastest detector in terms of update time whereas RRCF is the slowest

We are now turning our attention to the impact of subspace anomaly dimensionality on the effectiveness of detectors. Figure 14c (c) illustrates the average MAP of online detectors across all HiCS datasets for subspace anomalies of 2-d, 3-d, 4-d and 5-d. In general, all online detectors exhibit a robust behavior against increasing subspace dimensionality. L-S achieves the higher scores on 3-d, 4-d and 5-d subspace anomalies with a performance drop in 2-d. LEAP, CPOD and X-S exhibit higher (M)AP scores in 2-d subspace anomalies, while having a relatively stable performance in higher dimensions. Surprisingly enough, the dedicated subspace anomaly detector RS-HASH does not excel in HICS datasets. RS-HASH achieves a better performance in 2-d and 5-d subspace anomalies, while having a lower performance in 3-d and 4-d. MCOD exhibits a better performance than tree-based detectors (HSTF/F, RRCF) due to the fact that it re-computes the actual (L2) distance between samples in sliding windows but also updates both the content and the number of micro-clusters (i.e., deleting old and inserting new).

Tree-based online detectors exhibit a similar performance across all anomaly subspaces. RRCF is slightly better than HST/F as it updates the tree structure of its model with feature subsets that are more relevant to the subspaces of the anomalies contained in a window instead of only the mass profiles of immutable feature partitions. Clearly, the activation of a forgetting mechanism allows HSTF to better capture 2-d and 3-d subspace anomalies than HST. However, both they face difficulties in finding 4-d and 5-d subspace anomalies. Surprisingly enough, RRCF performance is improved in 4-d subspace anomalies (in HiCS20). Effectiveness of MCOD, HST/F, RRCF and LODA remains robust against increasing subspace dimensionality. As we can see from Fig. 14d (d), offline detectors exhibit a similar behavior, with ups and downs due to the non-deterministic nature of all detectors besides KNN.

7 Efficiency of detectors

In this section we are turning our attention to the efficiency of online detectors. We are focusing on measuring the CPU time of detectors while their memory footprint is left as future work. Execution times are measured directly in the native language of the respective implementation (C++ for XSTREAM, Matlab for LODA, Python for RS-HASH and JAVA for the rest). Compared to the analytical complexities reported in Table 1, the detectors’ runtime depends not only on the execution speed of the programming language used, but also on the optimizations implemented at code-level, as well as, on the values of the hyper-parameters w.r.t. data distribution in consecutive windows. The actual runtime of all detectors in real datasets lies between the two extremes: X-S as the fastest and RRCF as the slowest detector. For this reason, we rely more on the ranking of detectors’ efficiency across all datasets of our testbed rather than their absolute runtimes per dataset. This is in particular useful when studying the trade-off between effectiveness and efficiency of online detectors {X-S, L-S, HST, HSTF, RRCF, MCOD, RS-HASH, STARE, LEAP, CPOD} in real datasets (see Sect. 3.1.1).

Table 7 depicts the mean runtime required for training and updating the model of detectors, respectively, across all datasets of our testbed. As training time, we report the total time for model construction while as update time, the average time per window for model update and anomaly detection. To provide a common ground of comparison, the size of windows is the same for all detectors running on the same dataset. This choice is justified by the results of a preliminary experiment on a subset of the datasets, indicating that a window size of 128 seems to be optimal for most detectors besides HSTF and MCOD (see Fig. 16 in Appendix 2). Only the number of windows vary per dataset. Datasets on the x-axis are ranked according to their dimensionality.

The runtime required to construct the initial model of HST/F is one order of magnitude less than the RRCF mostly in high-dimensional datasets. This is due to the fact that HST/F uses a maximum depth (hyper-parameter) to construct small (perfect) binary trees instead of large (full) binary trees, as in the case of RRCF. Similarly, the model updates of HST/F cost one order of magnitude less than RRCF thanks to the constant cost of counter updates in mass profiles, as opposed to tree re-structuring of RRCF. HSTF runtime is slightly greater than HST due to counter decrements for forgetting old samples.

X-S, exhibits a stable training time, as the dimensionality increases thanks to its sparse random projections, so models are trained on almost a similar number of projections across all datasets. X-S is also the faster detector w.r.t the update time, as its updates have constant complexity (see Table 1) and its implementation benefits from several code-level optimizations. As expected, the two optimized variants of MCOD, LEAP and CPOD, exhibit faster update times (up to \(30\%\) faster) compared to MCOD. Despite the fact that LEAP has quadratic time complexity at worst case, it does not update micro-clusters (MCOD) or core points (CPOD) when the window slides; in combination with the minimal probing that is frequently activated, LEAP achieves a faster execution time. RS-HASH and STARE are slightly slower with similar update times. RS-HASH runs on Python which is generally slower compared to JAVA. STARE’s update times are highly dependent on the values of its hyper-parameters. Furthermore, we observe that HST/F and RRCF run slower in {Forestcover, http, InternetAds} datasets. This is due to the fact that the maximum tree height of their binary trees is proportional to the volume of the dataset (\(volume = \#samples * \#features\)). In fact, the higher the volume, the higher the trees depth and therefore the more CPU time is needed to construct and update them. In this respect, L-S relies on one of the simple and fast update procedures: it creates k one-dimensional histograms using sparse random projectionsFootnote 24 and each histogram is then updated by projecting the training sample onto a vector and then updating the corresponding histogram bin. However, L-S runtime is penalized by the execution speed of the Matlab code.

Figure 15 illustrates the Pareto frontier capturing the trade-off between the update time (per window) and MAP of the ten online detectors { X-S, L-S, HST, HSTF, RRCF, RS-HASH, STARE, LEAP, CPOD, MCOD }. Samples belonging to the Pareto set indicate that the respective detector dominates the remaining ones in the specific dataset that the sample represents. Note that not all datasets necessarily belong to the Pareto set.

Fig. 15
figure 15

Pareto frontier of MAP and Update time (per window) for each detector per dataset: X-S dominates in most datasets in terms of both effectiveness and efficiency

We can easily observe that 5 out of 9 of the samples of the Pareto set, belong to X-S. Moreover, 5 more samples related to X-S lie very closely to the Pareto frontier. This fact confirms the leading performance of X-S in terms of both effectiveness and efficiency. The rest 4 samples of the Pareto set belong to STARE, CPOD, LEAP and HST. LEAP and CPOD dominate in the dataset with the highest anomaly ratio (electricity) compared to other detectors with low update times. Note that one of X-S’s samples concerns the electricity dataset, as it is the fastest detector with a comparable effectiveness w.r.t. others in this dataset. LEAP dominates in SMTP in terms of efficiency, although its effectiveness is the worst of all. Last but not least, HST dominates in magic-telescope, achieving the highest MAP score w.r.t. detectors with comparable update times.

8 Meta-learning of XSTREAM’s performance

The experiments in the previous sections showed that X-S is the most effective detector w.r.t. both AUC and MAP metrics while its implementation exhibits the fastest update time. For this reason, we investigate which of the datasets’ meta-features discussed in Sect. 3.1.4 are statistically correlated (using Spearman correlation) with the drop or increase in X-S’s efficacy relative to the rest online anomaly detectors. Our meta-learning analysis sheds light on the datasets’ characteristics that boost X-S effectiveness or that make detectors to exhibit a similar performance.

Table 8 Statistically significant correlations of meta-features with the performance of X-S, 2\(^{nd}\) best detector, as well as, their ratio: The scale variance, non-normality of features as well as the existence of subspace anomalies let X-S to outperform the rest detectors

Table 8 reports the correlations of every meta-feature found to have statistical significant correlation (i.e., at least \( p value <0.05\)) with the ratio best (i.e., X-S) and second best online detector (or the best online detector in datasets where X-S loses) in terms of AUC and MAP. The first column of the table reports the statistical significant correlation values between the meta-features and the ratio between X-S and the second best online detector per dataset. The second and third Table columns report X-S’s and second best online detector’s raw scores correlations for the meta-features found to have statistical significant correlation with the ratio.

The number of features (G2) is the only general meta-feature with a negative ratio correlation. As data dimensionality increases, the difference between X-S’ performance and the second best online detector tend to also decrease. This correlation can be attribute to the fact that in high-dimensional datasets, all online detectors score similarly to the Random Classifier. On the other hand, the performance ratio of X-S with the second best online detector is positively correlated with the number of samples (G1). Meta-features (F1-F14) related to the variance of the scale of feature’ values are also positively correlated with X-S’ performance ratio. As we can see in the second and third columns of Table 3, these meta-features are more correlated with the raw performance of X-S rather than of the second best detector. Hence, when the ratio increases, this is mainly due to an increase in X-S performance.

Meta-features such as F6 (SD Max Value), F8 (SD Median Value) and F7 (Mean MEAN Value) are also correlated with the relative performance of X-S. As discussed in Sect. 2.7, its scoring function aims to assess anomalousness across different features value granularities. As a result, the diversity of the feature’ scale boosts X-S’s performance compared to the rest of the detectors. In particular, X-S proves to effectively maintain pairwise distances of samples even in datasets whose features exhibit significantly different value scales.

We observe a strong negative correlation of X-S with the Nr. of Statistical Outliers. X-S performance decrease w.r.t. the second best detector as the number of statistical outliers increases. This can be attributed to the fact that the datasets with higher number of statistical outliers are usually of high dimensionality, so X-S’s performance drops as we explained previously.

The ratio between X-S and the second best performing detector is correlated with the meta-features capturing the existence of subspace anomalies (S1, S2, S3, S4, S5). There is no significant performance degradation of X-S in subspace anomalies, in construct to the rest of the online detectors, which exhibit a higher correlation. This indicates that the density-estimation mechanism of X-S (and L-S) is more robust to subspace anomalies (on real datasets) compared to mechanisms adopted by other detectors.

Three meta-features are found to be correlated with the relative performance of X-S w.r.t. MAP (Mean/SD Kurtosis, SD Skewness). As the second best online detector is negatively correlated with those meta-features, we observe a negative correlation with the X-S’ performance ratio. Higher kurtosis and skewness values reveal denser clusters of samples. To avoid splitting those dense clusters into different bins, X-S relies on a random shifting mechanism, that actually decreases X-S’s performance on higher kurtosis and skewness compared to the rest of the detectors.

The pairwise correlations of X-S with online detectors are given in Table 17 of Appendix 4. The relative performance of X-S against the two variations of MCOD (CPOD and LEAP) and RS-HASH, is positively correlated with meta-features capturing the variance of features’ values (F1-F14). Clearly, features’ diversity positively affects X-S’s performance. Furthermore, the relative performance of X-S is improved as the number of samples increases compared to RS-HASH and STARE. On the other hand, X-S under-performs on datasets of high dimensionality and anomaly ratio compared to those density-based detectors. Recall that in high-dimensional datasets the majority of detectors exhibit a random behavior. No correlations are observed between X-S and HST, HSTF, as well as, RRCF.

9 Lessons learned

In this section we report the main conclusions drawn from the experiments in our work.

The majority of online and offline detectors classify samples in more than half of the datasets with a performance comparable to the Random Classifier. As expected (see Sect. 4.1), detectors typically exhibit a good performance when abnormal and normal classes are well separated (high AND values) even in datasets of high dimensionality (\(>100\)). However, this is not always the case in real datasets contaminated with implanted anomalies. Unlike distance/KNN and density-based detectors calculating samples’ distance over all features, the presence of many correlated feature pairs (i.e., multicollinearity) favor stochastic detectors to include less features that are more informative w.r.t. the anomalous class. Unfortunately, random projections of XSTREAM and LODA becomes less effective in most datasets of high dimensionality (\(\ge 257\), see Table 16), both detectors exhibit a performance close to the random classifier.

Online detectors approximate well and in some cases outperform their offline counterparts. As shown in Sect. 4.3, distance-based detectors like MCOD is ranked very close to KNN and LOF, while tree-based detectors like HST approximates well the performance of IF dominating the other member of this algorithmic family, namely, OCRF. Clearly, offline distance/KNN and tree-based detectors are favored by feeding the data into a single large window; KNN and LOF dispose more information regarding the neighborhood of samples, while IF is able to better profile normal samples and thus mitigate the effect of anomaly swamping [43]. Finally, the online version of XSTREAM not only outperforms all other online detectors by also is more effective than its offline version.

Tunable and gradual forgetting mechanisms proved to be more effective for range-based anomalies presented on several time-frames. As shown in Sect. 3.1.3, only HST/F, RRCF and RS-HASH exhibit a performance better than the random classifier. RRCF and HSTF have a tunable forgetting mechanism and RS-HASH a gradual forgetting mechanism. These properties are essential to detect anomalies especially for longer ranges than global forgetting mechanisms such as XSREAM’s, i.e., assigning all bin counts to zero when the window slides. This is due to the fact that the range-based anomalies are clustered; therefore, global forgetting makes the algorithm to rapidly adapt to the new data distribution of anomalies, assigning them scores close to normal samples.

Online anomaly detectors exhibit poor performance on the identification of subspace anomalies. As shown in Sect. 6, the majority of online detectors are robust against an increasing subspace or data dimensionality. The best performing strategy was the random projections as performed by L-S. Unlike fullspace anomalies contained in real datasets, anomalies hidden in feature subspaces proved to be a great challenge for all online detectors, as the highest MAP value was 0.16. This observation highlights the need for online detectors that are more effective on identifying subspace anomalies.

XSTREAM exhibit the best trade-off between efficiency and effectiveness of continuously updated detection models. As shown in Sect. 7, XSTREAM proves to be both faster w.r.t. updating time and more effective in the majority of the datasets. The different trade-offs achieved by online detectors can be explained w.r.t. how well their model (bins or histograms in random projections, mass profiles in trees, micro-clusters, core points, etc.) summarizes the data distribution of a stream, as well as, how effective are the incremental updates of models in successive windows. Finally, we should stress that LODA and RS-HASH share the same updating principles as XSTREAM and the difference to their absolute runtimes is more related to the programming language and the optimizations at code-level. On the other hand, detectors such as LEAP, CPOD, MCOD and STARE seems to run under best cases in the datasets of our testbed. This makes them bypass costly neighbor searches by applying the minimal probing principle (LEAP), explore only cluster centers (MCOD) or cores (CPOD) or skipping stationary regions (STARE).

The non-normality and scale variance of data features favor XSTREAM to outperform the rest detectors on real datasets contaminated with point anomalies. As shown in Sect. 8, when the scale of feature values exhibit a high dispersion (e.g., high std. of mean values), the difference in AUC performance between XSTREAM and the second best detector is amplified (positive ratio correlation) in favor of XSTREAM (higher raw correlation coefficient than the second best detector). In case of features exhibiting non-normal patterns such as high skewness and kurtosis, XSTREAM’s MAP performance decreases (negative raw correlation) but its performance is less affected than the second best detector. Both distributional characteristics imply the existence of dense clusters, concentrated either in the one side of the distribution or around the peak in case of a leptokurtic distribution. It is important that the dense clusters should not be separated by the detectors’ partitioning mechanism. XSTREAM handles this issue more effectively than other detectors by employing a shifting mechanism that strives to assign clustered samples on the same bins.

10 Summary and future work

In this work, we conducted a large scale experiment to reveal several open questions in the literature. We compared the effectiveness and efficiency of 6 online anomaly detectors along with 5 offline counterparts over 24 real and 1 synthetic dataset with 20 variations. None of the previous empirical studies [3, 4, 15, 24, 28, 47, 62] compared the performance of online detectors over a fairly complete collection of datasets used to evaluate the original algorithms. We should note that [63] focuses exclusively on the efficiency of online proximity-based algorithms without contrasting the effectiveness of algorithms belonging to different families (proximity vs tree vs projection-based). Unlike the aforementioned works, the variations of the synthetic dataset used in our work make it possible to evaluate the behavior of detectors on anomalies visible exclusively on subsets of features.

To fairly compare online and offline algorithms, we relied on both ROC AUC and MAP evaluation metrics, highlighting different aspects of their underlying anomalousness scoring functions. In this respect, we used an evaluation protocol inspired by time series, namely Forward Chain Cross Validation [9]. Rather than using, as in the majority of the empirical studies, the default hyper-parameters provided by the original algorithms we search for optimal hyper-parameter values per dataset. To reveal valuable insights regarding detectors’ behavior, we followed a principled methodology to analyze the raw results of our experiments. In particular, we performed a meta-learning of different performance aspects of detectors (e.g., randomness of decisions, trend of outperforming behavior) w.r.t. concrete characteristics of datasets (e.g., ratio of correlated features, feature normality, value range dispersion).

As future work, we are interested in contrasting the performance of the ’shallow’ detection methods studied in this work with deep learning algorithms [48], especially for high-dimensional datasets that favor a random behavior of detectors. Moreover, we will like to investigate how the behavior of detectors is affected by the anomaly generation method either clustered or scattered across normal samples in datasets.