Keywords

1 Introduction

Anomaly Detection (AD) is a machine learning task of identifying anomalous data instances automatically using algorithms. Anomalies (also refer to as outliers) are data instances that are significantly different from most of the other data causing suspicions that they are generated from a different mechanism from the one that is normal or expected. AD has many applications such as intrusion detection in computer networks, fraud detection in banking, detecting illegal activities (e.g., drug trafficking, money laundering) in national intelligence/security. In the literature, AD problems have been solved using three learning approaches [8, 12]: (i) Supervised learning: A classification model is learned using training instances from both normal and anomalous classes to make predictions for test data; (ii) Unsupervised learning: Given data instances (which may have anomalies) are ranked directly based on some outlier scores, i.e., no training involved; and (iii) Semi-supervised learning: A profile of normal/expected behaviour is learned from labelled training samples of normal data only, and test data are ranked based on how well they comply with the learned profile of normal data.

Regardless of the learning approaches used, existing AD methods have some limitations/issues that restrict their wide applicability in practice. Supervised methods have the following major issues [12]: (i) it might be very expensive or even impossible to obtain labelled training anomalous samples in many real-world applications; (ii) even if possible, they are infinitesimally rare resulting in the class imbalanced problem; and (iii) a few known anomalies are not enough to generalise characteristics of all possible anomalous patterns because anomalies can be anywhere in the feature space. Though techniques like minority class (anomalies) oversampling, majority class (normal) under-sampling, and algorithmic adjustments [18] are used to alleviate the above-mentioned issues, their effectiveness in practice are limited. It is because they assume that unseen/future anomalies are generated from the same distribution as previously seen/observed anomalies. Often, it is not the case in practice. New anomalies can be very different from previously seen anomalies.

Un/semi-supervised approaches do not require labelled training anomalous samples. Unsupervised approaches do not require training samples at all. Assuming anomalies are few and different, they use distance/density based scores to rank given data (which may have anomalies) directly. They may perform poorly when the assumption does not hold, i.e., when there are far too many anomalies [8, 12]. Semi-supervised approaches do not make such assumption. Because a vast majority of observed data are normal, normal training data can be obtained easily. Thus, we focus on semi-supervised AD approach in this paper.

Most existing un/semi-supervised AD methods assume that data have numeric features. However, in many real-world applications, data have both numeric (e.g., age, height) and categorical (e.g., gender, nationality) features. The common practice is to convert categorical features into numeric features using technique like one-hot encoding [15]. Each categorical label (e.g., Australian for nationality) is converted into a binary feature with the value of 1 (if the nationality is Australian) or 0 (otherwise) and treated as a numeric feature. A categorical feature with n possible values is converted into n binary numeric features, out of which only one has the value of 1 for each instance. Because of this, each original categorical feature and original numeric feature contribute differently to AD models, which can degrade the performances of AD methods. There are methods proposed for categorical data only [26]. Numeric features can be converted into categorical features through discretisation [15]. Most AD methods developed for categorical data have high computational complexities limiting their use in large real-world datasets.

Most existing AD methods examine data instances globally, i.e., w.r.t. the entire dataset. They can detect global anomalies that exhibit anomalous characteristics in the entire dataset. However, they cannot detect local anomalies that appear to be normal when examined globally but exhibit anomalous characteristics w.r.t. their local neighbourhood (i.e., in the local context). For example, in Fig. 1(a), \(a_4\) and \(a_5\) have significantly lower density than normal cluster \(C_1\) in their neighbourhood but have the same density as many instances in normal cluster \(C_2\). Most existing methods fail to detect them as anomalies. Real-world data have complex structures and instances may exhibit characteristics that look normal in the global perspective but suspicious in their local contexts. There are some methods that examine anomalies w.r.t. their localities (e.g., [5, 11]), but they are limited to numeric data only.

To summarise, most existing AD methods do not work well in practical applications due to the following three main issues:

  • Lack of sufficient examples of known anomalies: It is not possible to have a good representative sample of known anomalies to generalise characteristics of all possible anomalies.

  • Global view of anomalies: Data often exhibit suspicious characteristics w.r.t. their neighbourhood (in local context) that can appear to be normal in the global context.

  • Limitations to handle mixed types of data features: Most real-world applications have numeric and categorical features, but most existing methods can not handle mixed types of features well.

In this paper, we present a simple idea to address the above-mentioned issues and introduce a new semi-supervised AD method. Instead of using one global model, we propose to partition the data space into many regions and build an AD model in each region using data falling in the region only, i.e., many local AD models are built. To make prediction for a test instance, the AD model learned on the region where it falls is used. Instead of just relying on a local region from one partitioning of the space, we propose to create multiple partitions of the data space and use ensemble of multiple local models learned on local regions from each partition. It exploits the benefits of ensemble learning to consider sufficient locality around the test instance. Though there are not many AD models that can work well with mixed data, there are classifiers such as traditional Decision Tree (DT) [22] that can handle numeric and categorical features directly. We used DTs in local regions for AD without using labelled anomalies by adding synthetic data. Our results show that an Ensemble of Local Decision Trees (ELDT) produces better and more consistent detection results compared to popular state-of-the-art AD methods, particularly in datasets with mixed types of features.

2 Related Work

In the semi-supervised approach, a model is learned from a training set D of N instances belonging to the normal class only and evaluated on a test set Q, which is a mixture of normal and anomalous data. Let \(\mathbf{x}\) be a data instance represented as an M-dimensional vector \(\langle x_1, x_2, \cdots , x_M \rangle \), where each component represents its value of a feature that can be either numeric \(x_i \in \mathbb {R}\) (\(\mathbb {R}\) is a real domain) or categorical \(x_i \in \{v_{i_1}, v_{i_2}, \cdots , v_{i_w}\}\) (where \(v_{i_j}\) is a label out of w possible labels for feature i). Let \(F=\{A_1, A_2, \cdots , A_M\}\) be a set of data features, also called as attributes of data.

In this section, we review prior work related to this paper that includes AD methods for numeric and categorical data, and ensemble approaches for AD.

2.1 Methods for Numeric Data

Because anomalies are few and different, they are expected to have feature values that are significantly different from most data and lie in low density regions. Most of them use distance/density-based anomaly scores to rank data according to their degrees of outlying behaviour, e.g., Nearest Neighbours (NNs) or Support Vectors (SVs) based methods.

In the NN-based methods, the anomaly score of \(\mathbf{x}\in Q\) is estimated based on the distances to its kNNs in D, where k is a user defined neighbourhood parameter. Local Outlier Factor (LOF) [11] and \(k^{th}\) NN distance [6] are the most widely used NNs-based methods. Being different from normal instances, anomalies are expected to have larger distances to their kNNs than normal instances. They require to compute distances of \(\mathbf{x}\) with all instances in D, which can be computationally expensive when D is large. Though the nearest neighbour search can be speed up by using indexing schemes such as k-d tree [7], their effectiveness reduces as the number of dimension increases and become useless in high dimensional problems [19]. Sugiyama and Borgwardt (2013) [25] showed that the nearest neighbour search in a small subset \(\mathcal {D}\subset D\) (\(|\mathcal {D}| =\psi \ll N\)) is enough. They proposed a simple, but very fast, anomaly detector called Sp where the anomaly score of \(\mathbf{x}\) is its distance to the nearest neighbor (1NN) in \(\mathcal {D}\). It has been shown that Sp with \(\psi \) as small as 25 produces competitive results to LOF but runs several orders of magnitude faster [25].

The SV-based methods define the boundary around normal (expected) data and identify a set of data instances lying in the boundary called Support Vectors (SVs). They compute the pairwise similarities of data using a kernel function. Gaussian kernel that uses Euclidean distance is a popular choice. In the testing phase, the anomaly score of \(\mathbf{x}\in Q\) is estimated based on its kernel similarities with the SVs. One-Class Support Vector Machine (OCSVM) [23] and Support Vector Data Description (SVDD) [27] are widely used methods in this class. The training process is computationally expensive in the case of large D because of the pairwise similarity calculations.

2.2 Methods for Categorical Data

Despite the widespread prevalence of categorical/qualitative data in real-world applications, AD in categorical data has not received much attention in the research community [26]. The common practice is to convert categorical features into numeric features and use methods designed for numeric data. There are some methods proposed in the literature specifically for categorical data based on frequencies of categorical labels, information theory and data compression/encoding [26]. They are computationally expensive to run in large datasets and do not perform better than using methods for numeric data by converting categorical data into numeric data [4].

He et al. (2005) [17] proposed a method for categorical data based on frequent patterns. The intuition is that an instance is more likely to be an anomaly if it has a few or none of the frequent patterns. Akoglu et al. (2012) [2] proposed a pattern-based compression technique called COMPREX. The intuition is that the higher the cost of encoding \(\mathbf{x}\), the more likely it is to be an anomaly. Aryal et al. (2016) [4] revisited the Simple Probabilistic AD (SPAD) where multi-dimensional probability is estimated as the product of one-dimensional probability and show that it works quite well compared to more complex state-of-the-art methods, such as LOF, One-Class SVM, in datasets with categorical only and mixed types of features. It uses the frequencies of categorical label in each feature individually assuming features are independent to each other. Most of these methods for categorical datasets except SPAD have high time and/or space complexities limiting their use in small and low-dimensional datasets only. SPAD is simple and arguably the fastest AD method.

2.3 Ensemble Approaches

To solve a given task, the ensemble methods build multiple models by using an algorithm on different subsets of given data (data sampling or feature sampling) or using different parameter settings of the algorithm [13]. The final decision of the ensemble is an aggregation of decisions by its individual models. The main idea is that models are different and they make different errors so that they compensate each other’s weaknesses and results in better overall performance than any individual model. Ensemble learning is widely studied for classification problems and various frameworks have been proposed that can be used with different base classifiers [9, 10, 28]. However, the use of ensemble learning to solve the AD problem is rather limited [1]. Ensemble based AD methods build multiple models using subsamples of data and/or subsets of features, e.g., Lazarevic and Kumar (2005) [20] and Zimek et al. (2013) [29] used LOF using random subsets of features (i.e., subspaces) and data (i.e., subsamples), respectively. AD techniques such as iForest [21] and usfAD [3] used a collection of random trees to partition the data space using small subsamples of data until the instances are isolated. Each tree is using a small subset of features. The main idea of these methods is that anomalies are expected to isolate early in the trees and lie in leaves with low heights. They run very fast as they do not require pairwise distance calculations. All these ensemble-based methods assume that data have numeric features only. Most of them are not applicable to data with categorical only or mixed features. Also, many of them build multiple global models, they can not detect local anomalies.

3 Our Proposal: An Ensemble of Local Decision Trees

To address the three limitations of existing AD methods in practical real-world applications discussed in Sect. 1, we develop a new ensemble learning framework for anomaly detection based on the of idea of Feating [28]. First, we explain Feating for classification as used by Ting et al. (2011) [28] (Subsect. 3.1) and then discuss how we can adapt it for anomaly detection (Subsect. 3.2).

3.1 Feating for Classification

Feature-Subspace Aggregat ing (Feating) [28] is an ensemble framework developed for classification that uses an ensemble of local models. It is a feature bagging approach, ensemble learning using subsets of features of fixed size \(m < M\) (i.e., using m-dimensional subspaces). In each subspace \(S\subset F\) with \(|S|=m\), rather than building a global model trained on the entire training set, it first partitions the subspace using a tree structure called “Level Tree” (LT). At each node of the tree, the space is partitioned using one of the m features in S. Each feature in S is used only once in the tree, resulting in the maximum tree height of m. LTs can handle both numeric and categorical features. For numeric feature, the space is divided into two regions by the cut-point selected in the same manner as in ordinary decision tree [22] based on information gain. For categorical feature with w possible values, the space is partition into w regions, one for each categorical label. Further partitioning of a node stops when the node is either pure (i.e., has instances belonging to the same class), there are less than minPts data instances or reaches the maximum height of m. In each impure leaf node with more than minPts samples, a classifier is learned from the training samples falling in the node only, i.e., a Local Model (LM) is built. For rest of the other leaves (with less than minPts instances or pure), class probabilities are recorded based on the training samples they have. In the testing phase, a test instance is traversed from the root to a leaf in each LT. If a LM was built in the leaf node, the class probabilities are the predicted probabilities of the LM. Otherwise, the recorded class probabilities are used. The final prediction is based on the aggregated class probabilities from multiple LTs. The enumerated version of Feating builds \({M\atopwithdelims ()m}\) LTs, which has a large space complexity making it infeasible in problems with large M (high-dimensional applications). To overcome this issue, Ting et al. (2011) introduced a randomised version, where only \(t\ll {M\atopwithdelims ()m}\) random subspaces of size m are used. It significantly improves the time and space requirements without any significant compromise in accuracy.

Fig. 1.
figure 1

An example of dataset and definition of local regions. (a) \(C_1\) and \(C_2\) are clusters of normal data, whereas \(a_1, \cdots , a_5\) are anomalies. (b) Note only half of the normal data (red points) are used in the training process as “+ve” class samples. Blue points, which are uniformly generated synthetic points, are “-ve” class samples. \(R_1\), \(R_2\), \(R_3\), and \(R_4\) are local regions created by a Level Tree. Note that local classifiers are built in regions \(R_1\) and \(R_4\) only. (Color figure online)

Fig. 2.
figure 2

An example of a Level Tree in 3-dimensional subspace \(S=\{A_1, A_2, A_3\}\). Note that \(A_1\) is a categorical feature with three possible values (\(a_1,b_1,c_1\)), and \(A_2\) and \(A_3\) are numeric features.

3.2 Feating for Anomaly Detection

We propose the following adjustments to use Feating in semi-supervised AD, where there are no labelled anomalies. LT building process uses class information but in this case we do not have labelled anomalies. We are given D which is a set of normal data only. To build each LT, we propose to consider the given data D as “+ve” class and the same number of synthetic points are added as “-ve” class as done by Shi and Horvath (2006) [24] to use Random Forest in unsupervised problems. The values of synthetic points in each feature are selected uniformly at random from the range of possible values, i.e., for a numeric feature, values are selected uniformly randomly between the possible range defined by instances in D, and for a categorical feature, values are selected randomly from the possible values. With the samples from “+ve” and “-ve” classes, a LT can be built exactly in the same way as for the classification task. An example of space partition into local region is shown in Fig. 1(b). Note that we add new “-ve” class samples for each LT. It is possible to have two LTs using the same subset of m features if \(t > {M\atopwithdelims ()m}\), but the two LTs will be different because of the new set of “-ve” class samples.

figure a

Once the space is partitioned into regions, the idea is to build a classifier to separate “+ve” class (given training data, which are normal) from “-ve” class (synthetic data). Anomalies are expected to have low probabilities of belonging to “+ve” class (normal data). Local classifier is built in each leaf with more than minPts samples from the “+ve” class and current class probabilities are recorded in other leaves. To ensure balanced class distribution to build a classifier in a local region, we first remove all “-ve” class samples (synthetic points) and then add the same amount of new synthetic points (“-ve” class) as the “+ve” class samples in the region. The synthetic points are sampled uniformly at random from the range of possible values in the region. With the same amount of “+ve” class and “-ve” class (newly added) samples, local classifier is built. We use Decision Tree (DT) [22] that can handle numeric and categorical features directly as local classifier. An example of a Level Tree in 3-dimensional subspace is provided in Fig. 2 and the procedures to build LTs and local DTs are provided in Algorithms 1, 2 and 3. In the testing phase the anomaly score of a test instance \(\mathbf{x}\) is estimated as the aggregated \(P(+ve|\mathbf{x})\) over t local models. Anomalies are expected to have low aggregated score than normal data. We call the proposed method as “Ensemble of Local Decision Trees (ELDT)”.

figure b
figure c

The ensemble of local DT based on the idea of Feating addresses the three limitations of existing AD approaches in practical problems discussed in Sect. 1. It does not require labelled anomalies. It examines anomalies with respect to their local context or locality defined by multiple local regions. This is useful to detect local anomalies. Using DT that can handle categorical and/or mixed features directly at the local regions, it works well with categorical and mixed data.

4 Experimental Results

In this section, we present the results of our experiments conducted to evaluate the performance of ELDT. The three parameters of ELDT were set as default to: \(minPts=\lfloor \log _2(|D|)\rfloor +1\), (note that D is the training normal data); \(m=\lfloor \log _2(|F|)\rfloor +1\) (note that F is the set of features of D), and \(t=100\). We compared the performance of ELDT with Bagging using DT (Bag.DT) [9] and Random Forest (RF) [10], where each model in the ensemble is a global DT for the entire data space. They also used D as “+ve” class and the same amount of synthetic points as “-ve” class as did in building level trees in ELDT. Each tree in the ensemble has different “-ve” class samples to ensure diversity between trees. We considered the following three state-of-the-art AD methods as main baselines:

  1. 1.

    iForest [21]: It is an ensemble-based AD method. It uses a collection of t random trees, where each tree \(T_i\) is constructed from a small random subsample of data \({\mathcal D}_i \subset D\), \(|{\mathcal D}_i|=\psi \) (=256 by default). The idea is to isolate each instance in \({\mathcal D}_i\). Anomalies are expected to have shorter average path lengths over the collection of random trees. It produces good results and runs significantly fast. It works only with numeric features, so categorical features are converted into numeric features using one-hot encoding. It is unable to detect local anomalies [5].

  2. 2.

    LOF [11]: It is the most widely used AD method based on kNN (\(k=\lfloor \sqrt{N}\rfloor \) by default) search. It compares the density of a test instance with the average densities of its kNNs. It examines anomalies w.r.t. to their locality defined by the kNNs. It is a local model-based existing AD method. It is also mainly for numeric data, categorical features have to be first converted into numeric features. It is computationally very expensive when D is large.

  3. 3.

    SPAD [4]: It is a simple probabilistic AD method, where multi-dimensional probability is estimated as the product of one-dimensional probabilities assuming features are independent. It works with discrete or categorical data. Numerical features are converted into categorical features through equal-width discretisation [15] with the number of bins b (=\(\lfloor \log _2(N)\rfloor + 1\) by default). Despite its simplicity, it has been shown to perform better than more complex methods such as LOF, One-Class SVM and iForest [4].

We used 10 benchmark datasets with categorical only, mixed (categorical and numeric) and numeric only features. The characteristics of datasets used in terms of data size, dimensionality (numeric and categorical) and proportion of anomalies are provided in Table 1. Most of these datasets are from the UCI Machine Learning Repository [14]Footnote 1. All methods are implemented in JAVA using the WEKA platform [16]. We used Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) as the performance evaluation metric. We conducted 10 trials of different train (D) and test (Q) sets and presented the average AUC over 10 runs.

Table 1. Characteristics of data sets. #Inst: data size, #Feat: num. of features, #NFeat: num. of numeric features, #CFeat: num. of categorical feaures, anomaly%: percentage of anomalies

The average AUC results of contending methods are provided in Table 2. The results show that ELDT produced best results overall with the average AUC of 0.918 and average rank of 2.0 over 10 datasets used. It had the best AUC in four out of 10 datasets followed by Bag.DT in three datasets, RF and SPAD in two datasets each, and iFoest and LOF in only one dataset each. The closest contender in terms of consistent performance across datasets is SPAD the average AUC of 0.864 and the average rank of 3.3. Though Bag.DT produced the best results in three datasets, it performed worst in the other four datasets, whereas ELDT was ranked second in three datasets, third in two datasets and forth in the remaining one dataset. This results show that ELDT produced more consistent results across different datasets with numeric only, categorical only and mixed attributes and those with local and global anomalies. Among the three baselines, SPAD has the best overall performance. These results are consistent with those claimed by the authors in [4].

The runtime results in the five largest datasets with more than 10,000 instances are presented in Table 3. These results show that ELDT ran slower than all contender except LOF, but it had the runtimes in the same order of magnitudes with them. It was at last one order of magnitude faster than LOF, two orders of magnitude faster in the largest dataset.

Table 2. Average AUC over 10 runs. The best performance in each dataset is highlighted on bold.
Table 3. Average runtime (in seconds) over 10 runs in the five largest datasets with more than 10,000 instances.

4.1 Sensitivity of Parameters

In this section, we present the results of experiments conducted to assess the sensitivity of the three parameters, m (the size of subspaces that determines the maximum height of Level Trees), minPts (minimum points required at leaf nodes to build local AD models) and t (ensemble size), in the performance of ELDT. We varied one parameter at a time setting the other two parameters to default values. For this experiments, we used two datasets - Annthyroid and Mnist. The results are presented in Figs. 3 and 4. The results show that the performance of ELDT can be improved by setting m and minPts properly. In terms of t, higher the better. In both cases, performance improved when t was increased and started to flatten. Increasing t also increases time and space complexities linearly. Therefore, there has to be a trade-off between performance and complexities.

Fig. 3.
figure 3

Annthyroid: Effect of parameter in ELDT

Fig. 4.
figure 4

Mnist: Effect of parameter in ELDT

5 Conclusions and Future Work

In this paper, we presented a simple idea to address the three main limitations of existing Anomaly Detection (AD) methods in practical applications: (i) lack of sufficient examples of known anomalies; (ii) unable to detect local anomalies; and (iii) inability to handle mixed attributes well. Instead of using one global model, we propose to partition the data space into many regions and build an AD model in each region using data falling in the region only, i.e., many local AD models are built. To make prediction for a test instance, the AD model learned on the region where it falls is used. Instead of just relying on a local region from one partitioning of the space, we proposed to create multiple partitions of the data space and use ensemble of multiple local models learned on local regions from each partition. It exploits the benefits of ensemble learning to consider sufficient locality around the test instance. We used the idea of Feating to partition the data space. Though there are not many AD models that can work well with mixed data, there are classifiers such as traditional Decision Tree (DT) that can handle numeric and categorical features directly. We used DTs in local regions for AD without using labelled anomalies by adding synthetic data. We presented a new AD method called Ensemble of Local Decision Trees (ELDT). Our results show that ELDT produces better and more consistent detection results compared to popular state-of-the-art AD methods, particularly in datasets with mixed types of features.

Our results suggest that ensemble of local AD models produces better results than using a single global model. AD algorithms that can handle categorical and numeric features directly without any conversion produce better results than using methods designed for only type of features, which require all features to be converted into the supported type. AD problems can be converted into classification problems by adding uniformly distributed synthetic points and classification algorithms can be used. Our results indicate that it is a very promising line of research to investigate further to develop a flexible and robust AD framework for practical use. It can lead to a general ensemble learning framework for AD, where different space partitioning techniques can be used to define local regions and any classifier or AD algorithm can be used in local regions. In this paper, we presented one simple variant of it. In future, we would like to focus on: (i) using other classifiers (e.g., Naive Bayes, KNN, SVM, Neural Networks, etc.) and AD methods (e.g., LOF, SPAD, One-Class SVM, etc.) as local models; and (ii) investigating different implementations of space partitioning: using trees (e.g., Feating), grids, nearest neighbours, etc.