Keywords

1 Introduction

Chandola and Kumar [4] define anomaly detection as “the problem of finding patterns in data that do not conform to expected behavior”. Although anomaly detection is essentially a binary classification problem (i.e. instances are classified as either normal or anomalous), in anomaly detection scenarios the classes are highly imbalanced—there is very limited, or sometimes no access to anomalous instances during training, although there is usually an abundance of normal instances. Anomaly detection approaches that do not use anomalous instances during training can be classified as semi-supervised machine learning approaches, and are often referred to as one-class classifiers [12]. Anomaly detection approaches have been used in a variety of applications including credit scoring [11], intrusion detection [2], forensics [13], medical applications [14], and computer network security [27].

The main contribution of this paper is to adapt the Radial Basis Function (RBF) network [3] for one-class classification through a novel cost function. We have named the resultant one-class neural network the Elliptical Basis Function Data Descriptor (EBFDD) network. Coupled with our novel cost function, the EBFDD network is a semi-supervised one-class classification approach that utilizes elliptical kernels to learn sophisticated decision boundaries, while retaining the advantages of a shallow network, which include: easy retraining, interpretability, reduced data requirements, and shorter training time.

The remainder of the paper is structured as follows: Sect. 2 reviews related work, and briefly explains the motivation behind the EBFDD network approach. In Sect. 3 the EBFDD network approach is explained in detail. Section 4 describes the design of an evaluation experiment that compares the performance of the EBFDD network to state of the art algorithms across a number of benchmark datasets. Section 5 presents the results of the experiment, which then discusses their implications. Finally, Sect. 6 concludes the paper and suggests directions for future work.

2 Related Work

In this section we first describe some of the common approaches to anomaly detection. Then we describe the standard RBF network before explaining how it has been adapted for anomaly detection in the EBFDD network.

2.1 Common Approaches to Anomaly Detection

Semi-supervised machine learning approaches to anomaly detection are dominated by a family of algorithms that are modifications of the Support Vector Machine (SVM) algorithm [24], designed to work with only examples of a single class: One-Class SVM (OCSVM) [23]. In fact Khan and Madden [12] go so far as to say that one-class classification algorithms and methods should be divided into OCSVMs and non-OCSVMs.

The idea underpinning OCSVM is quite similar to the standard SVM method, and the kernel trick is still a key part of the OCSVM for the transformation of the input space into a feature space of higher dimensionality. The OCSVM approach finds a hyper-plane that separates all of the normal data points in a training set from the origin, while maximizing the distance between the origin and this hyper-plane. This results in a binary function, whose output is 1 for the regions of the input space belonging to the normal data, and −1 anywhere else. The main disadvantage of the OCSVM is the assumption that the anomalous data instances are concentrated around the origin [12]. Variations of the OCSVM approach include the Support Vector Machine Data Description (SVDD) [26], which uses hyper-spheres rather than hyper-planes to achieve separation. Interestingly, the authors in [22] propose the idea of a deep SVDD, where the SVDD is mixed with deep learning to accomplish anomaly detection.

On the non-OCSVM side, Auto-Encoder networks (AENs) [7], and all their variations have been increasingly used for anomaly detection [29,30,31]. An AEN does not learn a discriminative model, but rather a generative model of the input data. After transforming the input data into a representation with reduced dimensionality, it learns to reconstruct the original input data from the dimensionally-reduced representation. The error between the input and the reconstruction of the input, referred to as the reconstruction error, can be used as an anomaly detection signal—normal instances should be accurately reproduced leading to low reconstruction error, while anomalous instances should be poorly reproduced leading to large reconstruction errors. The AEN is trained on normal data only and learns features that could best reconstruct the normal data.

Isolation Forest (iForest) [28] is another interesting non-OCSVM approach to anomaly detection. The iForest tries to isolate individual data points in the training set by splitting the space randomly and repeatedly. The intuition behind this approach is that less splits should be required to isolate anomalous instances. An anomaly score can be calculated based on the number of splits required to isolate a data point.

Gaussian Mixture Models (GMMs) [1] are also used for anomaly detection. GMMs assume that normal data is generated by a collection of \(H\) Gaussians, which are placed randomly in the input space in the beginning of training. Then using the Expectation Maximization (EM) [1], the means and covariance matrices of the Gaussians are learned such that the likelihood of observing the training data is maximised. During testing, one can measure the likelihood of the test data, and if it is below a certain threshold it could be labeled as anomalous, and normal otherwise.

2.2 Radial Basis Function Networks

A Radial Basis Function (RBF) network [3] is a local-representation learning technique, which divides the input space among local kernels. For every input data point, depending on where in the input space it appears, a fraction of these locally-tuned kernel units get activated. It is as if these local units have divided the input space among themselves and each one takes responsibility for a subspace. The idea of locality, inherently implies the need for a distance function that measures the similarity between a given input data instance \(X\), with dimensionality \(D\), and the center, \(\mu _h\), of every kernel unit \(h\). The common choice for this measure is the Euclidean distance, \(\parallel \) X \(-\,\mu _h\)\(\parallel \). The response function for these local units, should have a maximum when \(X = \mu _h\), and decrease as \(X\) and \(\mu _h\) get less similar. The most commonly used response function for the RBF units is the Gaussian function. Not only have the RBF networks been used for traditional classification [18] and regression [25] problems, they have also been applied to anomaly detection tasks, which we will explore below.

RBF networks have been applied to anomaly detection in two main ways. First, if examples of both normal and anomalous data are available during training, the standard binary/multi-class classification RBF networks can be used with modifications. For example, in [19], a hybrid optimisation algorithm based on RBF networks, which combines gradient descent with quantum-behaved particle swarm optimisation is used to train the RBF network for anomaly detection. Similarly, in [21], an RBF network is trained for the task of intrusion detection. The model keeps adding hidden units to the architecture until a certain performance goal is met.

In the second approach to using RBF networks for anomaly detection, only examples of the normal class are used at training time. This can be achieved by modifying the dataset or modifying the algorithm. As an example of the first type, in [20] an augmented training set is composed using the instances belonging to the normal class as random proxy anomalous patterns and used to train an anomaly detection model for time series data. As an example of the second type, in [17], the RBF network algorithm is combined with the Support Vector Data Descriptor (SVDD) [26] algorithm, resulting in a hybrid model. The hidden layer of the RBF network is used as a feature extractor and the outputs for the subsequent transformed feature space are then used as inputs to the SVDD algorithm with a linear kernel.

The EBFDD network approach modifies the cost function used to train an RBF network to ensure that it learns a compact set of Gaussian kernels that concentrate around the normal region of the input space, covering all of it, while excluding other regions. During testing, for a given input data, \(X\), the output of the model would be high, if \(X\) belongs to the normal region, and low, otherwise.

3 The Elliptical Basis Function Data Descriptor (EBFDD) Network

The EBFDD network is a semi-supervised one-class classifier, which is based on the Radial Basis Function (RBF) network [3]. Figure 1 illustrates the architecture of the EBFDD network. This is a shallow network containing one hidden layer and one output layer with a single node. There are \(H\) hidden nodes in the hidden layer, each of which uses a Gaussian activation function. The activation, \(p_h(X)\), of the \(h^{th}\) Gaussian node is defined as:

$$\begin{aligned} p_h(X) = exp\left[ -\frac{1}{2}\left( X - \mu _h\right) ^T{\varSigma _h}^{-1}\left( X - \mu _h\right) \right] \end{aligned}$$
(1)

where \(X\) is the input vector of length \(D\); \(x_d\) is the value of the \(d^{th}\) dimension of \(X\); and the parameters \(\mu _h\) and \(\varSigma _h\) are the mean vector (which is \(D\)-dimensional) and the covariance matrix (which is a \(D \times D\) matrix) of the \(h^{th}\) Gaussian kernel.

Fig. 1.
figure 1

The Architecture of the EBFDD network.

The outputs of the nodes in the hidden layer are connected to a single output node via a weight vector \(W\), where \(w_h\) is the weight parameter connecting the \(h^{th}\) hidden node to the output node. In the output node the modified hyperbolic tangent, \(tanh\), activation function proposed in [15] is used as it avoids saturation:

$$\begin{aligned} y = 1.7159 \times tanh\left( \frac{2z(X)}{3}\right) \end{aligned}$$
(2)

where \(z(X)\) is the weighted sum of the outputs of the hidden layer, when \(X\) is the input vector:

$$\begin{aligned} z(X) = \sum _{h=1}^{H} w_h\times p_h\left( X\right) \end{aligned}$$
(3)

The intuition behind the EBFDD network is that, it can be trained to learn a compact set of elliptical kernels that gather around the region in the input space where the normal data resides. This means that a trained model will output a high value for any normal query data point that falls within this region, and a low value for any anomalous query data point that falls outside this region. Thresholding this value will allow accurate anomaly detection.

EBFDD training begins with a pre-training phase that provides an initial set of positions for the \(H\) Gaussian kernels at the hidden layer. This phase uses the \(k\)-means [1] clustering algorithm, using Euclidean distance, applied to the training dataset. The resulting cluster centres are used to initialise the Gaussian kernel centres, \(\mu _h\), and covariance matrices, \(\varSigma _h\), for each of the \(H\) hidden nodes. Initially these kernels are radial and assume equal variance in all directions. The number of hidden nodes in the network, \(H\), (which is also the number of clusters found by k-means) is a hyper-parameter set before training starts.

The parameter values in the network (\(\mu _h\), \(\varSigma _h\), and \(w_h\)) for each node in the network are then optimised using mini-batch gradient descent [1] (mini-batch sizes are always set to \(32\)) and the back-propagation of error algorithm [1]. The cost function to be minimised in this process is:

$$\begin{aligned} E = \frac{1}{2}\left[ \left( 1 - y\right) ^2 + \beta R_\varSigma + \lambda R_W\right] \end{aligned}$$
(4)

where \(y\) is the output of the network; \(R_\varSigma \) and \(R_W\) are regularisation terms defined below in Eq. (5) and Eq. (6); and \(\beta \) and \(\lambda \) are hyper-parameters that control the influence of the regularisation terms.

This cost function is a weighted sum of three main terms that ensure the network learns a compact set of Gaussians that cover just the normal training data. The first term in this summation, \((1 - y)^2\) encourages the network training process to learn a model that outputs a value as close as possible to 1 for instances belonging to the normal class.

The second term in the cost function, \(R(\varSigma )\), regularises the variances of the Gaussian kernels in the hidden layer of the network. This term introduces the optimisation criterion of having the most compact set of Gaussians possible to represent the normal data. As the size of a Gaussian ellipsoid is dictated by the diagonal elements (i.e., the variances) of its covariance matrix, off-diagonal elements are excluded from the regularisation. \(R_\varSigma \) is defined as:

$$\begin{aligned} R_\varSigma = \sum \limits _{h=1}^{H}\sum \limits _{d=1}^{D}\left( \varSigma _{h}\left[ d, d\right] \right) ^2 \end{aligned}$$
(5)

which is the squared L-2 norm [7] of the variances (i.e., the diagonal elements in each of the \(H\) Gaussian covariance matrices). Here, \(D\) denotes the dimensionality of the input space and \(\varSigma _{h}\left[ d, d\right] \) refers to the \(d^{th}\) diagonal element of the convariance matrix for the \(h^{th}\) hidden node.

The third term in the cost function, \(R_W\), is the squared L-2 norm of the weight vector, \(W\), connecting the hidden layer nodes to the output node. This discourages the weights from becoming so large that they would actually ignore the outputs from the hidden nodes. It also makes the EBFDD network robust to outliers in the training set [7]. \(R_W\) is defined as:

$$\begin{aligned} R_W = \sum \limits _{h=1}^{H}{w_h^2} \end{aligned}$$
(6)

where \(w_h\) is the weight connecting the \(h^{th}\) hidden node to the output node.

We argue that a gradient descent training process based on the minimisation of the cost function in Eq. (4) will find a compact set of Gaussians, whose collective output is still high for the normal region of the input space and low, anywhere else (i.e., where we believe the anomalies would appear).

Based on the application of the back-propagation of error algorithm using gradient descent, the following update rules for learning the parameters of the EBFDD network have been derived in Eq. (7), Eq. (8) and Eq. (9) (the update rules are described for a single training instance for ease of reading, but are easily expandable to a mini-batch scenario by summing across the instances in the mini-batch). Below, we have derived the learning rules for all 3 learnable parameters of the EBFDD network. The general structure of these learning rules is \(a = a - \eta \times \frac{\partial E}{\partial a}\), where \(a\) is the learnable parameter, and \(E\) is the cost function defined in Eq. (4). Each of the weights, \(w_h\), connecting a hidden unit to the output unit is updated using:

$$\begin{aligned} w_h&= w_h -\eta \times \frac{\partial E}{\partial w_h} \nonumber \\&= w_h -\eta \left[ \frac{\partial \frac{1}{2}(1 - y)^2}{\partial w_h} + \frac{\partial \frac{1}{2}\lambda R_W}{\partial w_h} \right] \nonumber \\&= w_h -\eta \left[ \frac{\partial \frac{1}{2}(1 - y)^2}{\partial y} \times \frac{\partial y}{\partial z_h(X)} \times \frac{\partial z_h(X)}{\partial w_h} + \frac{\partial \frac{1}{2}\lambda R_W}{\partial w_h} \right] \nonumber \\&= w_h - \eta [(y - 1) \times [1.1439(1 - tanh(\frac{2z_h(X)}{3})^2)] \times p_h(X) + \lambda w_h] \end{aligned}$$
(7)

where \(\eta \) is the learning rate, and \(z_h(X)\) is defined as in Eq. 3 (although we omit the summation because only the \(h^{th}\) hidden unit is relevant). All other terms are defined as before. Each of the kernel centres, \(\mu _h\), is updated using:

$$\begin{aligned} \mu _h&= \mu _h -\eta \times \frac{\partial E}{\partial \mu _h} \nonumber \\&= \mu _h -\eta \left[ \frac{\partial \frac{1}{2}(1 - y)^2}{\partial \mu _h} \right] \nonumber \\&= \mu _h -\eta \left[ \frac{\partial \frac{1}{2}(1 - y)^2}{\partial y} \times \frac{\partial y}{\partial z_h(X)} \times \frac{\partial z_h(X)}{\partial p_h(X)} \times \frac{\partial p_h(X)}{\partial \mu _h} \right] ~~~~~~~~~~~~~~~~ \nonumber \\&= \mu _h - \eta [(y - 1) \times [1.1439(1 - tanh(\frac{2z_h(X)}{3})^2)] \times w_h \\&~~~~~~~~~~~~~~ \times [p_h(X) \varSigma _h^{-1} (X - \mu _h)]] \nonumber \end{aligned}$$
(8)

And for the covariance matrix of the \(h^{th}\) Gaussian, \(\varSigma _h\), the learning rule is:

$$\begin{aligned} \varSigma _h&= \varSigma _h -\eta \times \frac{\partial E}{\partial \varSigma _h} \nonumber \\&= \varSigma _h -\eta \times \left[ \frac{\partial \frac{1}{2}(1 - y)^2}{\partial \varSigma _h} + \frac{\partial \frac{1}{2}\beta R_\varSigma }{\partial \varSigma _h} \right] \nonumber \\&= \varSigma _h -\eta \times \left[ \frac{\partial \frac{1}{2}(1 - y)^2}{\partial y} \times \frac{\partial y}{\partial z_h(X)} \times \frac{\partial z_h(X)}{\partial p_h(X)} \times \frac{\partial p_h(X)}{\partial \varSigma _h} + \frac{\partial \frac{1}{2}\beta R_\varSigma }{\partial \varSigma _h} \right] \nonumber ~~~\\&= \varSigma _h -\eta \times [(y - 1) \times [1.1439(1 - tanh(\frac{2z_h(X)}{3})^2)] \times w_h \\&~~~~~~~~~~~~~~ \times [p_h(X)(-\varSigma _h^{-T}(X - \mu _h)(X - \mu _h)^T\varSigma _h^{-T})] + \beta Diag(\varSigma _h)] \nonumber \end{aligned}$$
(9)

where the function \(Diag()\) diagonalises the covariance matrix \(\varSigma _h\) by turning all non-diagonal elements to 0.

When Eq. 9 is applied, the resulting covariance matrix my not be invertible. This is a problem, as the inverse of the covariance matrices are used in both forward and backward propagation in the network. In such cases, we replace the covariance matrix generated with its closest positive semi-definite matrix found using the method proposed by Higham [8].

As explained earlier, the output of the EBFDD network, \(y\), needs to be thresholded for the anomaly detection to take place. Any of the common thresholding schemes used with other anomaly detection algorithms [4] can be used with EBFDD networks. The thresholding method is not the focus of this paper as we have used the Area Under the Curve (AUC) of the ROC curve as the metric of evaluation. This gives us a good estimate of the performance of the algorithms assuming that we have a method of choosing the best threshold on the outputs for each one of the algorithms used in our experiments.

Fig. 2.
figure 2

Considering images of digits 0 and 1 from the MNIST dataset as class normal instances for training and using the first two principal components, starting from left to right, we have the visualizations of the transformed training data after applying PCA in red circles, the Gaussians after K-means pre-training, trained EBFDD network, and the decision surface of the trained EBFDD network. (Color figure online)

To illustrate the behaviour of an EBFDD network we present a simple example based on the MNIST dataset of handwritten digits [16], and we have used 10 Gaussians in this illustration. We have also used images of digits 0 and 1 as the normal class. So that the behaviour of the network can be visualised, the dimensionality of the input data is reduced to 2 using Principal Component Analysis (PCA) [1]. Figure 2a shows the training dataset of normal instances (images of digits 0 and 1). Figure 2b shows the output of the k-means algorithm after the pre-training phase, which provides the EBFDD network with an initial set of well positioned Gaussian kernels. Figure 2c shows the compact set of kernels covering the normal region after the EBFDD network is trained. Finally, Fig. 2d shows the decision surface that has been learned, where brighter colors correspond to higher outputs by the EBFDD network.

4 Experimental Method

This section describes the design of an experiment conducted to measure the performance of the EBFDD network and compare it with a number of state of the art anomaly detection approaches. In these experiments, we have trained EBFDD networks with 2 kernel options: (1) Elliptical Gaussian kernels and (2) Radial Gaussian Kernels. In the case of the latter, we are hoping that a less computationally expensive kernel, the radial kernel, would bring us an advantage. The state of the art algorithms in our experiments are the One-Class SVM (OCSVM) models [23], Auto-Encoders (AEN) [7], Gaussian Mixture Models (GMM) [1], and the Isolation Forests (iForest) [28]). These algorithms have performed anomaly detection tasks on a selection of datasets using only normal examples during training. The remainder of this section describes the datasets and performance metrics used in the experiments.

4.1 Benchmark Datasets and Anomaly Detection Scenarios

Following Emmott, et al. [5] we use fully labelled classification datasets to simulate anomaly detection scenarios so that performance metrics can be calculated. For each dataset used, where possible, we have considered different anomaly detection scenarios, where we consider different normal and anomalous classes to add more variety into our experiments.

We have chosen a random subset of the datasets used in [5] for our experimentsFootnote 1. All of these datasets come from the UCI repositoryFootnote 2. From these datasets a number of different anomaly detection scenarios can be generated. These scenarios can be divided into two main groups:

  • Binary Classification Datasets: In this case, the instances in each dataset belong to either of 2 classes. The normal class is selected based on the recommendations in [5], and only examples from this class are used for training.

  • Multi-Class Classification Datasets: In this case, we have followed 3 approaches. First, is the one-vs-all approach where we select instances in a particular class as normal and treat all other classes as anomalous. Only normal instances are used during training. This is interesting, because it explores scenarios where anomalous instances might come from a variety of different distributions, but a compact distribution defines the normal class. So, if a dataset has \(K\) classes, we define \(K\) one-vs-all experiments. The second approach is the all-vs-one approach, where we choose instances of one class as anomalous and the instances of the other classes as normal. Only normal instances are used during training. This method explores scenarios where we might have a variety of definitions of the normal data coming from different distributions but a well-defined compact distribution generating the anomalous instances. Similarly, if a dataset has \(K\) classes, we define \(K\) all-vs-one experiments. The last type are more difficult scenarios recommended by Emmott, et al. [5] where an analysis is performed to find the most challenging partition of classes, in terms of separability, into normal and anomalous groups for the datasets they use in their experiments. We use these partitions to define the third set of scenarios in our experiments.

Table 1. The datasets used in our experiments

Table 1 summarises the datasets that have been used in our experiments. The number of rows, classes, and features in each one, as well as the number of anomaly detection scenarios extracted are shown. For all datasets feature values have been normalised to [0, 1] using range normalisation.

4.2 Experimental Method and Performance Metrics

For each algorithm, dataset and scenario combination we use an experimental design that is similar to bootstrapping [10]. We extract an 80% sample (with no replacement) from the whole normal portion of a dataset to use to train a model. The remaining 20% is then mixed with all the anomalous data to constitute the test set. This is repeated 10 times and average performance results are reported. For every algorithm a grid search is performed to find the best set of hyper-parameters and the results of the best performing set of hyper-parameters are reported. All of our code and scripts used in our experiments are available on GitHubFootnote 3.

The tunable hyper-parameters used to generate these results, for each algorithm across all the datasets and scenarios, are presented in Table 2. It is important to mention that the values for \(H\) in the hyper-parameter grid search are determined by the dimensionality of each dataset, and in our experiments they do not exceed the dimensionality of the input. The range tested for \(H\) starts with 1 and then increases in steps of 5 all the way up to the dimensionality of the data, \(D\). In the case of the Page Blocks Classification, and Magic Gamma Telescope datasets we reduce the step size to 2, and for the Skin Segmentation dataset the step size of 1, as the dimensionality of these datasets is rather low. As a result, the EBFDD/RBFDD, and AEN always tend to compress the input data in their hidden representation and similarly, the number of Gaussians in the GMM is also bounded by the dimensionality of the input, \(D\). However, in the case of the Isolation Forest, we have chosen the same number of estimators across all the datasets. In the case of OCSVM, \(\nu \) is the upper bound on the fraction of training errors and a lower bound of the fraction of support vectors, and \(\gamma \) is the kernel coefficient.

Table 2. Hyper-parameter ranges for each algorithm

The area under the ROC curve [9] has been chosen as the evaluation metric in these experiments. This allows us to avoid defining a specific thresholding function to use with each model, but still measures the ability of a model to distinguish between normal and anomalous classes. The ROC curves are generated from the raw output of each model type.

Fig. 3.
figure 3

The average rank of the algorithms across the experiments. On each box plot the median ranks are shown via the horizontal orange lines and the mean ranks are shown using green triangles. (Color figure online)

Fig. 4.
figure 4

Friedman’s aligned test results where the statistical significance is \(\alpha = 0.05\)

5 Results and Discussion

The performances of each algorithm on each anomaly detection scenario for each dataset were compared and ranked. The distribution of the ranks for each algorithm for all anomaly detection scenarios are shown in Fig. 3a, and the average rank scores are summarised in Table 3Footnote 4. The average ranks in Table 3 show that the EBFDD network has the lowest average rank over all experiments. It is also clear from Fig. 3a that the ranks of EBFDD have low variance indicating its consistent strong performance.

Table 3. Average rank across all experiments

Figure 3 also shows the distribution of ranks of the different approaches on the one-vs-all and all-vs-one scenarios (distributions for the other scenarios are not included as there are too few results to generate meaningful distributions). While the overall pattern is largely the same for these subsets as for the overall results, it is worth noting that there is a marked difference in the performance of the AEN models on the one-vs-all scenarios—performance is relatively poor—and on the all-vs-one scenarios—performance is quite good.

A Friedman test [6] with a significance level of \(\alpha = 0.05\) was performed on the rank data. This showed a statistically significant difference in the performance of the different algorithms and so, following the recommendations of [6], a post-hoc Friedman aligned rank test, with \(\alpha = 0.05\), was performed to further investigate the pairwise differences between the performance of the algorithms. Figure 4 summarises the results of these tests (tests were performed for all scenarios and independently for the one-vs-all and all-vs-one scenarios)Footnote 5.

In all cases the performance of a group of the best performing algorithms cannot be separated in a statistically significant way. This group includes EBFDD, RBFDD, GMM, and (except for the one-vs-all scenario subset) AEN. The poorer performances of the OCSVM and iForest approaches are statistically significantly different to the performance of the other approaches. The poor performance of OCSVM is somewhat surprising, however, it is in line with previous benchmarks, for example [5].

Both the EBFDD and RBFDD algorithms accomplish their training through minimizing our proposed cost function in Eq. (4). We believe that the EBFDD out-performs RBFDD because of the more flexible nature of the decision boundaries that it is able to learn because of the use of elliptical kernels.

It is worth noting, however, that the cost of the good performance of the EBFDD algorithm is the very large number of parameters that must be learned in this model. For an EBFDD network, as the full covariance matrices are learnable, the number of trainable parameters is quite high:

$$\begin{aligned} (H\times D) + \left( H\times \frac{D(D+1)}{2}\right) + H \end{aligned}$$
(10)

where H is the number of hidden nodes, and D is the dimensionality of the input. Moreover, if we consider radial kernels for the EBFDD networks, i.e. an RBFDD network, the number of trainable parameters is reduced to:

$$\begin{aligned} (H\times D) + (H\times D) + H \end{aligned}$$
(11)

This means that training EBFDD networks can take longer than training equivalently sized RBFDD networks.

There is a minor difference in the number of trainable parameters in a GMM compared to an EBFDD network. As the weights used to combine distributions in a GMM sum to 1, for \(H\) Gaussians only \((H - 1)\) weights need to be learned as the last weight can be computed by subtracting the sum of those weights from 1. As a result, the number of trainable parameters in a GMM is equal to:

$$\begin{aligned} (H\times D) + \left( H\times \frac{D(D+1)}{2}\right) + (H-1) \end{aligned}$$
(12)

In addition the number of trainable parameters for the AEN is:

$$\begin{aligned} (H\times D) + H \end{aligned}$$
(13)

It is not possible to talk about the number of trainable parameters for the iForest and OCSVM models in a meaningful way and so these are not compared. Moreover, as the implementations of algorithms used in these experiments come from different packages with differing levels of optimisation we do not believe that detailed comparisons of training times are appropriate.

As observed in Table 3, the empirical results show that the added complexity of the EBFDD network has made it a stronger anomaly detector compared to its simpler version that is the RBFDD network. In addition, even though the EBFDD network and GMM have almost the same number of trainable parameters, it seems that the optimisation of the proposed cost function in the EBFDD network, which utilises gradient descent, allows it to find better solutions than the expectation maximisation approach used in a GMM.

6 Conclusions and Future Work

This paper presents a novel cost function, whose minimisation can adapt the Radial Basis Function (RBF) network into a one-class classifier. We have named the resultant anomaly detector the Elliptical Basis Function Data Descriptor (EBFDD) network. EBFDD utilises elliptical kernels that can elongate and rotate to allow it to learn sophisticated decision surfaces. An evaluation experiment conducted using a large set of datasets compared the EBFDD network with state of the art anomaly detection algorithms. Although statistical significance is not shown in all cases, the empirical results show that the EBFDD network has a better overall performance across all the experiments.

In future work, we plan to add recurrent connections to the EBFDD network architecture to allow contextual anomalies within streams to be identified, as well as the point anomalies identified by the current architecture. Moreover, we would like to investigate the idea of building a deep architecture where the EBFDD network and a deep network would be trained in an end to end fashion, the motivation being that the backpropagation signal for training the EBFDD network might push the deep network into learning better features for the Gaussian kernels of the EBFDD network to work with. An alternative would be to train the deep architecture separately and use the extracted features to train the EBFDD network, and this will also be explored. Finally, we will investigate how EBFDD can be adapted to handle concept drift scenarios in which the characteristics of what constitutes normal change over time.