1 Introduction

Anomaly detection is a fundamental data mining task whose aim is to isolate samples in a dataset that are suspected of being generated by a distribution different from the rest of the data. The presence of anomalies is due to many reasons like mechanical faults, fraudulent behavior, human errors, instrument error or simply through natural deviations in populations.

Depending on the composition of the dataset, anomaly detection settings can be classified as unsupervised, semi-supervised, and unsupervised [1, 14]. In the supervised setting the training data are labeled as normal and abnormal and and the goal is to build a classifier. The difference with standard classification problems is that abnormal data form a rare class. In the semi-supervised setting, the training set is composed by both labelled and unlabelled data. A special case of this setting is the one-class classification when we have a training set composed only by normal class items. In the unsupervised setting the goal is to detect outliers in an input dataset by assigning a score or anomaly degree to each object. Several statistical, data mining and machine learning approaches have been proposed to detect anomalies, namely, statistical-based [11, 15], distance-based [6, 9, 10, 24], density-based [12, 21], reverse nearest neighbor-based [4, 5, 19, 25], SVM-based [30, 33], and many others [1, 14].

In last years deep learning-based methods for anomaly detection [13, 17, 27] have shown great performances. Auto encoder(AE) based anomaly detection [3, 13, 20] consists in training an AE to reconstruct a set of examples and then to detect as anomalies those data that show a large reconstruction error. Variational auto encoders (VAE) arise as a variant of standard auto encoders designed for generative purposes [23]. The key idea of VAEs is to encode each example as a normal distribution over the latent space and regularize the loss by maximizing similarity of these distributions with the standard normal one. Due to similarities to standard auto encoders, VAEs have also been used to detect anomalies. However, it has been noticed that VAEs share with standard AEs the problem that they generalize so well that they can also well reconstruct anomalies [3, 7, 8, 13, 22, 32]. Generative Adversarial Networks (GAN) [18] are another tool for generative purposes, aiming at learning an unknown distribution by means of an adversarial process involving a discriminator, that outputs the probability for an observation to be generated by the unknown distribution, and a generator, mapping points coming from a standard distribution to points belonging to the unknown one. GANs have also been employed with success to the anomaly detection task [2, 16, 29, 31, 34].

Some authors [26, 28] have recently observed that all the above mentioned anomaly detection deep learning based methods are not designed to directly discover anomalies, but their main task is data reconstruction (AE and VAE) or data generation (GAN) and anomaly detection is a collateral result. They introduce new methods, called \(\textit{Deep-SVDD}\) and \(\textit{Deep-SAD}\), that combine the expressive power of deep neural networks with a loss inspired from SVM-based methods and specifically designed for anomaly detection. These methods are used for one-class and (semi-)supervised settings but we argue that they do not apply very naturally to the unsupervised setting, thus we introduce \(\textit{Deep-UAD}\), a new unsupervised method that deeply modifies the architectures in [26, 28]. In particular we build a new training paradigm for the network in [26] that involves an AE which is trained alternatively with the network and with which the network and exchange the information they obtained during the training. This is done by modifying the losses of both the network and the AE. The proposed approach shows sensible improvements in terms of detection performances over both the standard approach in [26, 28] and the baseline shallow methods.

The rest of the paper is organized as follows. Section 2 discusses related work with particular emphasis on \(\textit{Deep-SVDD}\) and \(\textit{Deep-SAD}\). Section 3 introduces the \(\textit{Deep-UAD}\) unsupervised anomaly detection algorithm. Section 4 illustrates experimental results. Finally, Sect. 5 concludes the work.

2 Preliminaries

In this Section we deepen auto encoder and \(\textit{Deep-SVDD}\) which are exploited by our technique as basic components and suitably modified to our purposes.

Auto Encoder. An auto encoder (AE) is a neural network architecture successfully employed for anomaly detection [20]. It aims at providing a reconstruction of the input by exploiting a dimensionality reduction step (the encoder \(\phi _W\)) followed by a step mapping back from the compressed space (the latent space) to the original space (the decoder \(\psi _{W'}\)). Its ability in detecting anomalies depends on the observation that regularities should be better compressed and, hopefully, better reconstructed [20]. The AE loss is \(\mathcal {E}(x)=\Vert x-\hat{x} \Vert _2^2\), where \(\hat{x}=\psi _{W'}(\phi _W(x))\), and coincides with the standard reconstruction error.

One-class SVM. Before discussing \(\textit{Deep-SVDD}\) some preliminary notions about One-Class SVM (OC-SVM) [30] are needed. The original OC-SVM method is designed for the one-class setting and has the objective of finding the hyperplane in a feature space that best separates the mapped data from the origin. Given the data \(\left\{ x_1,\dots ,x_n\right\} \subseteq X\), it is defined by the following optimization problem

$$\begin{aligned}&\min _{\textbf{w},\rho ,\xi _i}\frac{1}{2}\Vert \textbf{w} \Vert _F^2-\rho +\frac{1}{\nu n}\sum _{i=1}^n\xi _i\\ \text {s. t. }&\left\langle \phi \left( x_i\right) ,\textbf{w}\right\rangle \ge \rho -\xi _i, \\&\xi _i\ge 0, \quad i=1,\dots ,n \end{aligned}$$

where \(\rho \) is the distance from the origin to the hyperplane \(\textbf{w}\in F\), \(\xi _i\) are slack variables and \(\nu \in \left( 0,1\right] \) is a trade-off hyperparameter. The points in the test set are labelled as normal if they are mapped inside the hyperplane and anomalous if they are mapped outside. Related to OC-SVM, Support Vector Data Descriptor (SVDD) [33] is a method that has the aim of enclosing the input data into a hypersphere of minimum radius. The relative optimization problem is

$$\begin{aligned}&\min _{R,\textbf{c},\xi _i}R^2+\frac{1}{\nu n}\sum _{i=1}^n\xi _i\\ \text {s. t. }&\Vert \phi \left( x_i\right) -\textbf{c} \Vert \le R^2+\xi _i, \\&\xi _i\ge 0, \quad i=1,\dots ,n \end{aligned}$$

where \(R>0\) and \(\textbf{c}\) are the radius and the center of the hypersphere and again \(\xi _i\) are slack variables and \(\nu \in \left( 0,1\right] \) is a trade-off hyperparameter.

Deep-SVDD. In [26], authors apply the same idea expressed in SVDD of enclosing the data into an hypersphere performing the mapping into the feature space with the use of a deep neural network. In particular, let \(\phi _W:X\rightarrow F\) be a mapping obtained with a neural network with weights \(W=\left[ W_1,\dots ,W_L\right] \) (\(W_l\) are the weights relative to the layer \(l\in \left\{ 1,\dots ,L\right\} \)) from the input space \(X\subseteq \mathbb {R}^d\) to the output space \(F\subseteq \mathbb {R}^k\), with \(k<d\). The loss of the network is given by

$$\begin{aligned} \mathcal {L}=\frac{1}{n}\sum _{i=1}^n\Vert \phi _W\left( x_i\right) -\textbf{c} \Vert _2^2+\frac{\lambda }{2}\sum _{l=1}^L\Vert W_l \Vert _F^2, \end{aligned}$$
(1)

where the first term forces the network representation \(\phi _W(x)\) to stay close to the center \(\textbf{c}\) of the hypersphere and the second term is a weight decay regularizer with hyperparameter \(\lambda >0\). This loss is used in a One-Class anomaly detection setting to map the training set (composed only by normal items) as close as possible to the center \(\textbf{c}\) so that in the testing phase the network is less able to map the anomalies close to \(\textbf{c}\). Because of this, it is defined as anomaly score of the point x the distance of its network representation from the center: \(\mathcal {S}(x)=\Vert \phi _W(x)-\textbf{c} \Vert _2^2.\)

The center \(\textbf{c}\) is not a trainable parameter and is fixed before the training by means of an AE that is composed so that the encoding part has the same structure as the network \(\phi \) and shares with it the weights W, the structure of the decoding part is symmetric to it and thus the latent space coincides with the space F. The training set is given in input to this AE which is trained with the standard loss and subsequently the center \(\textbf{c}\) is defined as \(\textbf{c}=\frac{1}{n}\sum _{i=1}^n\phi _W(x_i)\), that is the mean of the latent representations of all the points in the training set. The same architecture has been applied in [28] for the task of semi-supervised anomaly detection with the following natural adaptation of the loss

$$\begin{aligned} \small \mathcal {L}=\frac{1}{n+m}\sum _{i=1}^n\Vert \phi _W\left( x_i\right) -\textbf{c} \Vert _2^2+\frac{\eta }{n+m}\sum _{i=1}^m\left( \Vert \phi _W\left( \tilde{x}_i\right) -\textbf{c} \Vert _2^2\right) ^{\tilde{y}_i}+\frac{\lambda }{2}\sum _{l=1}^L\Vert W_l \Vert _F^2, \end{aligned}$$
(2)

where \(\tilde{x}_i\) are the m labeled data with the relative labels \(\tilde{y}_i\) and \(\eta \) is an hyperparameter handling the trade-off between the contributions of labelled and unlabelled data. Let us observe that data labelled as normal (\(\tilde{y}_i=+1\)) are treated in the usual way which means that they are forced to be mapped close to \(\textbf{c}\) while for the anomalies (\(\tilde{y}_i=-1\)) the contribution is inverted and they are force to stay as far as possible from \(\textbf{c}\).

It is important to observe that (2) is designed to consider also unlabelled examples. An extreme case occurs when \(m=0\), when all the training data are unlabelled. This scenario is similar to the unsupervised setting but there is a substantial difference: in one case the objective is to detect anomalies in a test set, in the other the anomalies have to be detected among the same data used for the training phase. In this case the losses (2) and (1) coincide, which means that, even if originally the loss (1) has been designed to deal only with normal class items, it can be used in settings that involve the use of unlabeled anomalies in the training phase, thus it can be applicable also to the unsupervised settings.

3 Method

In this Section the technique \(\textit{Deep-UAD}\) proposed in this paper is discussed.

A one class based technique, like \(\textit{Deep-SVDD}\), is aimed at building a model for the normal class exploiting input data by assuming they do not contain anomalies and classifying data of a test set. In particular, \(\textit{Deep-SVDD}\) tends to map close to the center all the input data and, then, in the unsupervised setting this technique may fail in correctly separating normal and anomalous samples.

\(\textit{Deep-UAD}\) tackles this issue by providing information to the network about the anomaly degree of each sample in order to force the network to approach normal data to the center and to let anomalies far from the center. This is accomplished by exploiting an AE that provides a level of anomaly suspiciousness. Thus, the proposed architecture consists in two components, a neural network \(\textit{Deep-UAD}_{NET}\), and an auto encoder \(\textit{Deep-UAD}_{AE}\); \(\textit{Deep-UAD}_{NET}\) has the same structure of the network of \(\textit{Deep-SVDD}\), thus can be defined by the same mapping function \(\phi _W\), and it is forced to map the data badly reconstructed by \(\textit{Deep-UAD}_{AE}\), namely more suspected to be anomalous, faraway from the center and, conversely, data suggested as normal by \(\textit{Deep-UAD}_{AE}\) close to the center. Technically, this is done by introducing this novel loss

$$\begin{aligned} \mathcal {L}_{\text {NET}}=\frac{1}{n}\sum _{i=1}^n\frac{1}{\mathcal {E}\left( x_i\right) }\Vert \phi _W\left( x_i\right) -\textbf{c} \Vert _2^2+\frac{\lambda }{2}\sum _{l=1}^L\Vert W_l \Vert _F^2. \end{aligned}$$
(3)
Fig. 1.
figure 1

Diagram of the \(\textit{Deep-UAD}\) cooperative strategy: the network \(\textit{Deep-UAD}_{NET}\) and the auto encoder \(\textit{Deep-UAD}_{AE}\) refine their capabilities to find anomalies by sharing the encoder weights W and passing to each other the information of their own score.

It is inspired by Eq. (1) which is modified by inserting the term \(\frac{1}{\mathcal {E}\left( x_i\right) }\), directly related to the probability for \(x_i\) to be an anomaly according to the AE, and it is used as a weight to control how much is important that the network representation of \(x_i\) is mapped close to \(\textbf{c}\). In particular, the smaller is \(\mathcal {E}(x_i)\), namely \(x_i\) is probably not an anomaly according to \(\textit{Deep-UAD}_{AE}\), the more higher is the weight and thus the network takes more advantage in mapping \(x_i\) close to the center; conversely if \(\mathcal {E}\left( x_i\right) \) is large, \(x_i\) is suspected to be an anomaly by the AE, the weight is small and the network has a small advantage in bringing the representation of \(x_i\) close to the center.

The strategy of \(\textit{Deep-UAD}\) consists in a preliminary phase where the AE, without information by the network \(\textit{Deep-UAD}_{NET}\), is trained with standard loss, the center of the hypershpere is computed and the reconstruction error \(\mathcal {E}(x_i)\) is evaluated for each sample. Successively, two phases are iteratively executed, during the first one, the network \(\textit{Deep-UAD}_{NET}\) is trained with the loss (3) for a certain number of epochs and the score \(\mathcal {S}(x_i)\) is calculated, during the second, \(\textit{Deep-UAD}_{AE}\) is trained for some epochs with the novel loss

$$\begin{aligned} \mathcal {L}_{\text {AE}}=\sum _{i=1}^n\frac{1}{\mathcal {S}(x_i)}\Vert x_i-\hat{x_i} \Vert _2^2. \end{aligned}$$
(4)

The purpose of \(\mathcal {S}(x_i)\) is similar to the one of \(\mathcal {E}(x_i)\) in (3), giving a weight to the contribution of each point \(x_i\) according to the results obtained by the network. The idea of \(\textit{Deep-UAD}\) is that the score obtained from one network improves the training of the other one, the final anomaly score output is \(\mathcal {S}(x_i)\).

4 Experimental Results

In this section we report experiments conducted to study the behavior of the proposed method. We focus on three main aspects, namely (i) the impact of the dimension of the output space on the performances, (ii) the analysis of the cooperative process as the iterations proceed, (iii) the comparison with other methods with specific emphasis on \(\textit{Deep-SVDD}\). In our experiments we consider two standard benchmark datasets composed by grayscale images, MNISTFootnote 1 and Fashion-MNISTFootnote 2. They are both composed by \(28\times 28\) pixels images divided in 10 classes, thus, in order adapt them for anomaly detection, we adopt a one-vs-all policy, i.e. we consider one class as normal and all the others as anomalous. For each class, we create a dataset composed by all the examples of the selected class as normal and s random selected examples from each other class as anomalies.

Sensitivity Analysis on the Dimension K of the Output Space. In this section, our aim is to determine how the dimension of the output space F impacts on the behavior of both our method and the original \(\textit{Deep-SVDD}\) algorithm. In order to do this we consider the MNIST dataset in the one-vs-all setting and, for each class, we train both models with k varying in the interval [8, 64].

Fig. 2.
figure 2

MNIST dataset (\(s=10\)): AUCs of \(\textit{Deep-UAD}\) and \(\textit{Deep-SVDD}\) varying the dimension of the final space.

From Fig. 2, in which are reported the results after 5 runs, we can see that for both \(\textit{Deep-UAD}\) (in red) and \(\textit{Deep-SVDD}\) (in black) the trend is increasing which means that a small dimensional space F is not sufficient, in both cases, to separate the anomalies from the normal examples. Moreover it is important to point out that the performances achieved by our method are better than the ones obtained by \(\textit{Deep-SVDD}\) for almost each class and each value of k.

Analysis of the Iterative Process. \(\textit{Deep-UAD}\) is based on an iterative process in which the network \(\textit{Deep-UAD}_{NET}\) and the auto encoder \(\textit{Deep-UAD}_{AE}\) share information, because of this it is crucial to investigate how the number of iterations affects the performances of both the architectures. We do this by considering MNIST and Fashion-MNIST datasets in the one-vs-all setting, performing 5 runs for each class and computing the AUC for each iteration. In each iteration both the network and the AE are trained for 25 epochs.

In Fig. 3 are reported the trends of the two architectures. As we can see, they are always non decreasing, which means that both the architectures are taking advantage of the cooperative strategy. For what concerns \(\textit{Deep-UAD}_{NET}\), which is the one that outputs the score of \(\textit{Deep-UAD}\), the trend becomes substantially stable and constant, sometimes from the very first iteration (as class 0 of MNIST and class Sandal of Fashion-MNIST) and other times after a slightly bigger number of iterations (like classes 2 and 7 of MNIST). This means that the parameter of the number of iterations is not hard to fix, since a number around one ten of iterations guarantees always the achievement of a score of \(\textit{Deep-UAD}\) close to best possible and an improvement over \(\textit{Deep-SVDD}\).

Fig. 3.
figure 3

MNIST and Fashion-MNIST datasets (\(s=10\)): AUCs of \(\textit{Deep-UAD}\) and AE varying the iterations of the method.

Moreover, \(\textit{Deep-UAD}_{AE}\) improves its performances as the iterations proceed. This fact is crucial for the behavior of the whole process, indeed it means that the information provided to \(\textit{Deep-UAD}_{NET}\) by \(\textit{Deep-UAD}_{AE}\) becomes better at every iteration, thus \(\textit{Deep-UAD}\) succeeds in mapping the anomalies away from the center better than \(\textit{Deep-SVDD}\), when this information is missing and several anomalies are not detected being closer to the center than some normal samples with consequent worsening of the AUC.

Table 1. AUC of \(\textit{Deep-UAD}\) and competitors on MNIST and Fashion-MNIST with \(s=10\) on the left and \(s=100\) on the right.

Comparison with Competitors. Finally, in this last section, we compare the results \(\textit{Deep-UAD}\) with competitors on MNIST and Fashion-MNIST. The methods taken into account are Isolation Forest (IF) and k-Nearest Neighbor as shallow algorithms and \(\textit{Deep-SVDD}\) and Deep Convolutional auto encoder (DCAE) as deep learning methods. To ensure a fair comparison, both \(\textit{Deep-SVDD}\) and DCAE have the same structure of \(\textit{Deep-UAD}\) and for both of them, as well as for our method, we fix \(k=64\) according to the results of the first experiment.

In Table 1 are reported the results for both datasets with \(s=10\) and \(s=100\). We can see that for almost all classes \(\textit{Deep-UAD}\) performs better than all considered competitors and in certain cases the differences with their performances are huge. In particular, in the direct comparison with \(\textit{Deep-SVDD}\), the technique that inspires our method, \(\textit{Deep-UAD}\) is always winning, meaning that the cooperative work of the network and the AE succeeds in improving the ability of isolating anomalies.

5 Conclusions

In this work is presented \(\textit{Deep-UAD}\), a deep learning approach for unsupervised anomaly detection. It is based on an alternate and cooperative training of an AE and a neural network aiming at mapping the data close to a fixed center in the output space. Experimental results show that \(\textit{Deep-UAD}\) achieves good performances and that the strategy of alternate training brings benefits to both the neural network and the AE improving their capabilities to isolate anomalies.

In the future our main goals are to investigate the application of a cooperative alternate strategy similar to this one to more complex neural architectures, to study possible modifications to the discussed method that may help in improving performances, and to test our algorithm on dataset of different size and nature.