Keywords

1 Introduction

Anomalies are generally defined as behaviors or events that are different from most normal situations which are rare but extremely harmful. Therefore accurate detection of anomalies are essential within many environments. Such environments may include fraud detection in finance [1], disease detection in clinical medicine [2], web intrusion detection [3] in network security, etc.

There have been many traditional anomaly detection algorithms based on unsupervised learning [4,5,6] or only normal class observed [7,8,9]. They usually assume that there are no observed anomalies during training and lose the chance to take advantage of the abnormal information. Consequently, a series of anomaly detection algorithms recently emerged that train models via a large number of unlabeled samples along with a few observed anomalies [10,11,12]. This setting is more in line with actual application scenarios, which can not only make up for the lack of supervision information in the unsupervised algorithms, but also reduce the burden of abnormal label collection in supervised schemes. Nevertheless, most of them directly regard unlabeled samples as a normal set, which may be unreasonable for some datasets containing a non-negligible amount of anomalies in unlabeled set. The core problem is finding out how to leverage the unlabeled samples to enhance the anomaly detection models.

Semi-supervised learning [13, 14] is an appropriate choice to apply to anomaly detection due to its adequate mining of unlabeled samples. Some previous anomaly detection works [15, 16] which are in semi-supervised frame simply applied unsupervised algorithms on unlabeled samples. In this paper, we employ a pseudo-label algorithm on the unlabeled set to find a set of pseudo anomalies.

Meta pseudo label (MPL) [17] takes the idea of meta-learning, which the teacher network continuously adjusts to reduce the confirmation bias using the feedback of the student network on the labeled samples. Inspired by MPL, we introduce a Meta-Pseudo-Label Anomaly Detection (MPAD) method in this paper. MPAD exploits the feedback of the student network on pseudo anomalies to influence the update of the teacher network. Meta pseudo anomalies (MPAs) then generated by the teacher network not only have less confirmation bias but also assist the student network to be more generalized on test set. In our implementation, we withhold a fixed validation set to judge the detection performance of the student network, and in turn the difference in performance is treated as a reward or punishment during the training of the teacher network.

The major contributions of this paper are summarized as follows:

  • We propose an anomaly detection framework with partially observed anomalies which employs the pseudo-label algorithm to increase the content and quality of observed anomalies, thereby improving the accuracy of the anomaly detection model;

  • The feedback of student model is used to correct the update direction of the teacher network, so that the teacher network can generate more beneficial pseudo anomalies;

  • Extensive experimental results on datasets in five different fields show that the proposed MPAD framework exceeds five most currently popular algorithms in effectiveness.

2 Related Work

2.1 Anomaly Detection Methods

Traditional anomaly detection algorithms mainly follow the unsupervised setting. They cannot take advantage of existing anomaly information. Similar settings to this paper are semi-supervised or weakly-supervised based anomaly detection. One class of semi-supervised anomaly detection methods assumes that only normal samples are available when building a model. The classic algorithms are OCSVM [7] and deep support vector data description (SVDD) [8]. As they only learn patterns of the normal category, any pattern that differs from the normal ones is considered as an anomaly. The advantage of this approach is that it can reduce the overfitting problem of abnormal learning. They generally assume that the data are similar within a class and they are mostly applicable to situations with a large number of positive samples. Another class of semi-supervised anomaly detection methods presumes that a small amount of labeled normal and anomalies are available in addition to unlabeled ones, e.g., DeepSAD [15] and the method in [16]. They are both based on SVDD. Generally speaking, these models outperform unsupervised algorithms due to the presence of supervised information. Some work [10,11,12, 19, 20] have the same detection settings as our MPAD and focus on a small number of observable anomalies and unlabeled samples. Yet, most of these works assume that unlabeled samples are normal, and our model extracts reliable pseudo anomalies from unlabeled samples to enhance the utilization of supervised information.

2.2 Semi-supervised Methods

At present, semi-supervised algorithms [13, 14] are mainly based on consistency, pseudo-labels, and a class of hybrid algorithms. Consistency algorithms are mainly based on the assumption that different representations of the same sample can yield the same results on downstream tasks. Many of them rely on rich data augmentation. But pseudo-labels methods have no such problem. The meta pseudo label [17] method used the results of a student network on the labeled samples as the feedback to a teacher network, reducing the pseudo labels’ confirmation bias. To the best of our knowledge, there are currently no anomaly detection algorithms based on partially observed anomalies that use pseudo-label algorithms. We propose a general framework of MPAD based on MPL, which can employ any network structure as the teacher and the student network, and is compatible with various types of data.

3 Methods

3.1 Preliminaries

We follow the setting that partially anomalies are observed in anomaly detection. Notationally, the dataset is represented by \(\mathcal {D}=\{\mathcal {D}_{\textrm{L}}, \mathcal {D}_{\textrm{U}}\}\). \(\mathcal {D}_{\textrm{L}}\) notes the partially known anomalies set and \(\mathcal {D}_{\textrm{U}}\) is the unlabeled set in which normal samples are much more than anomalies. \(\mathcal {D}_{\textrm{L}} = \{(x_1,y_1),\cdots ,(x_K,y_K)\}\), \(\mathcal {D}_{\textrm{U}} = \{x_{K+1},\cdots ,x_{K+N}\}\), where \(x_i \in \mathcal {X}\), \(\mathcal {X} = \mathbb {R}^{d}\), \(y_i=1\), \(y_i \in \mathcal {Y}\), \(\mathcal {Y} = \{0,1\}\). Additionally, we follow the description of the models in pseudo-label algorithms. We define the teacher models that provide pseudo labels T, and their parameters \(\theta _{T}\). Student models that take pseudo labels, which in this paper are the anomaly detection models, are called S and the corresponding parameters are \(\theta _{S}\). We expect to train an anomaly detection model leveraging the dataset \(\mathcal {D}\) and implement the model on the test set to examine its performance.

3.2 Meta Pseudo Anomaly Detection Scheme

We first introduce the basic pseudo-label algorithm, which obtains the distribution probability of the sample from a neural network, and gets the hard pseudo-label \(y^{\textrm{PL}}\) by a threshold \(\lambda \):

$$\begin{aligned} y^{\textrm{PL}} = {\mathbb {1}}\left[ T(x_u;\theta _{T})\geqslant \lambda \right] , \end{aligned}$$
(1)

in which \(x_u \in \mathcal {D}_{\textrm{U}}\) and \(T(x_u;\theta _{T})\) is the probability that \(x_u\) belongs to a particular class output by the teacher network. This formula is also used to generate pseudo anomalies when applied in anomaly detection.

This simplest pseudo-label method works well in anomaly detection, but the performance of student detection models are limited by the accuracy of the pseudo-labels produced by the teacher network. To improve this accuracy, we borrow the idea of Meta Pseudo Labels (MPL) to the pseudo anomalies generation which we called Meta Pseudo Anomalies (MPA).

We first utilize the teacher network to generate pseudo anomalies following Eq. (1). Here the teacher network refers to the self-training schedule, i.e., executing two steps in a loop: (1) Train a classifier using an already labeled dataset. Here we treat the unlabeled set as normal; (2) Use the trained classifier to label the unlabeled data, and add those with high prediction confidence to the labeled set. Based on these pseudo anomalies, the optimization objective \(\theta ^{\textrm{MPA}}_{S}\) of the student network is:

$$\begin{aligned} \theta ^{\textrm{MPA}}_{S} = \mathop {\textrm{argmin}}\limits _{\theta _{S}} \, {\mathcal {L}_{\textrm{CN}} \Big (S\big ( \left[ x_u,x_l,\textrm{MPA} \right] ;\theta _{S} \big ),y \Big )}, \end{aligned}$$
(2)

where \(x_l \in \mathcal {D}_{\textrm{L}}\) are labeled anomalies and MPA is added to this set when training. \(\mathcal {L}_{\textrm{CN}}\) is the loss of student network presented in Eq. (10) and y is the true labels with the pseudo labels. So far this is a standard pseudo-label algorithm using self-training.

Seeing that the ultimate purpose of the student network is to improve the generalization effect on the test data. We expect the teacher network to generate pseudo anomalies that meet this goal. We manage to separate part of data called \(\mathcal {D}_{\textrm{V}}\) from \(\mathcal {D}\) to do this. Since \(\textrm{MPA}\) is generated according to \(\theta _{T}\) as in Eq. (1), the optimization result for student network can be seen as a function of \(\theta _{T}\) which we write it as \(\theta ^{\textrm{MPA}}_{S}(\theta _{T})\). The overall goal is to minimize the loss of the student network on \(\mathcal {D}_{\textrm{V}}\):

$$\begin{aligned} \min _{\theta _{S}, \theta _{T}} {\mathcal {L}_{\textrm{CN}}\Big (S\big (\mathcal {D}_{\textrm{V}};\theta _{S}^{\textrm{MPA}}(\theta _{T})\big ),y_{v}\Big )}. \end{aligned}$$
(3)

We expect that this objective will correct the update direction of the teacher network and further improve the performance of the detection network.

There are two variables in the target at the same time, so the parameters cannot be updated directly by calculating the derivative. Here we update the two parameters step-by-step depending on meta-learning. In order to achieve the approximate optimization, we let \(\theta _T\) and \(\theta _S\) update alternately. And only one step is updated each time along the gradient direction rather than directly updating to the current optimal. This is because the current optimum is only a local optimum of the objective function according to the meta-learning theory. \(\theta _S\) update one step to \(\theta _{S}^{'}\) first:

$$\begin{aligned} \theta _{S}^{'} = \theta _{S} - \eta _{S} \nabla _{\theta _{S}} \mathcal {L}_{\textrm{CN}} (\theta _T,\theta _S). \end{aligned}$$
(4)

The contrastive above is applied with MPA generated by \({\theta }_{T}\). \(\theta _T\) is updated leveraging the updated student network \(\theta _{S}^{'}\):

$$\begin{aligned} \theta _{T}^{'} = \theta _{T} - \eta _{T} \nabla _{\theta _{T}} \mathcal {L}_{T}(\theta _{S}^{'}). \end{aligned}$$
(5)

We denote the objective function as H and split the derivative into a product of two derivatives:

$$\begin{aligned} \begin{aligned} \frac{\partial {H}}{\partial {\theta _{T}}}&= \frac{\partial {H}}{\partial {\theta ^{'}_{s}}}\cdot \frac{\partial {{\theta ^{'}_{s}}}}{\partial {\theta _{T}}} = h \cdot \frac{\partial {\mathcal {L}_{\textrm{CE}}\big (\hat{y}_{u},T(x_u;\theta _T)\big )}}{\partial {\theta _{T}}}, \end{aligned} \end{aligned}$$
(6)

where \(\hat{y}_{u}\) is the pseudo-labels and \(h = \mathcal {L}_{\textrm{CN}}(\theta _{S})-\mathcal {L}_{\textrm{CN}}(\theta ^{'}_{S})\) following the Taylor’s Formula. Both two contrastive losses are computed on the validation set. The second term in the last equation is the cross-entropy loss between the teacher network output and the pseudo-labels. In addition, we also trained the teacher network with the loss on the labeled samples. The total loss is as follows:

$$\begin{aligned} \mathcal {L}_{T} = \mathcal {L}_{\textrm{CE}}\big (T(x_l;\theta _{T}),y_l\big ) + \big (\mathcal {L}_{\textrm{CN}}(\theta _{S})-\mathcal {L}_{\textrm{CN}}(\theta ^{'}_{S})\big ) \times \mathcal {L}_{\textrm{CE}}\big (T(x_u;\theta _T),\hat{y}_u\big ). \end{aligned}$$
(7)

Here we assume that the unlabeled set is normal, and calculate the standard cross-entropy loss together with the labeled anomalies as the first term above.

3.3 Student Anomaly Learner

We chose DevNet [11] as the student model, which itself is a model based on a small number of observed anomalies. DevNet makes efficient use of observed anomalies. Its performance tends to increase on most datasets with the increase of observed anomalies. The main principles are: First, L abnormal scores of normal samples \(r_i\) are sampled from a standard Gaussian distribution, and the mean value is used as the reference score \(\mu _r\) of the normal points:

$$\begin{aligned} \mu _r = \frac{1}{L}\sum _{i=1}^L r_i, r_i \sim \mathcal {N}(\mu =0, \sigma =1). \end{aligned}$$
(8)

Then the z-score is applied to calculate the gap between the training data \(z_i\) and the reference score,

$$\begin{aligned} \textrm{dev}(z_i) = \frac{z_i - \mu _r}{\sigma _r}. \end{aligned}$$
(9)

Finally, the distance is increased between the abnormal points and the reference score while reducing the gap between the normal points and the reference score through the contrastive loss \(\mathcal {L}_{\textrm{CN}}\):

$$\begin{aligned} \mathcal {L}_{\textrm{CN}} = (1-y_i)\cdot |\textrm{dev}(z_i)|+y_i\cdot \textrm{max}\big (0,\delta -\textrm{dev}(z_i)\big ). \end{aligned}$$
(10)

3.4 Total Flow of MPAD

We first initialize the teacher network and the student network with the observed anomalies and unlabeled samples, respectively. During training, batches are obtained from the initial dataset in a one-to-one ratio of normal and abnormal. The supervised loss of the teacher network is calculated within a batch. At the same time, a one-step update is made to the student network, and loss (\(\mathcal {L}_{\textrm{CN}}(\theta _{S})\)) on the validation set are recorded. The teacher network generates pseudo anomalies (MPA) according to the given probability threshold \(\mathcal {P}^{\textrm{MPA}}\). We add these MPAs to the observed anomalies set to secondly update the student network, and also record the loss (\(\mathcal {L}_{\textrm{CN}}(\theta ^{'}_{S})\)) on the validation set. Finally, the teacher network is updated using the deviation of the loss on the student network and its own loss on labeled set.

4 Experiments

4.1 Experimental Settings

Datasets. We evaluate the proposed MPAD on five public datasets covering different fields.

Census is a dataset of US Census from 1994 and 1995 which includes 500 variables related to demographics and employment. Among them, very few people with an income more than 50,000 are regarded as anomalies for detection.

Campaign comes from a telemarketing campaign of a Portuguese bank. It contains 62 attributes such as customer information and economic activities. A small number of users who chose to subscribe to the banking product are identified as abnormal.

Thyroid is established to study whether patients had hypothyroidism. There are three categories which are normal, hyperfunctioning and dysfunctional. Here we merge the latter two categories as anomalies.

Arrhythmia is a dataset for studying arrhythmia and contains information about the patients’ physical conditions and heart rates. Patients are classified into one normal class ECGs and 15 different types of arrhythmias. Here we combine the arrhythmia classes as anomalies.

Pima is a research dataset of diabetes in Pima Indian women, which comes from the UCI repository. Here we label those with diabetes as anomalies.

Baselines

  • DevNet [11]: focuses on learning the anomaly scores directly rather than improving the representations. It designs a reference score of the normal samples according to the data distribution, and combines the contrastive loss to isolate the anomaly scores of normal samples and abnormal samples. It is an end-to-end anomaly detection algorithm based on partially observed anomalies and is also the student model of our MPAD.

  • DeepSAD [15]: is a semi-supervised version of SVDD. It builds models with both unlabeled and labeled data. It places the normal samples close to the center of the hypersphere, while the abnormal samples are far from the surface of the hypersphere according to the label information, which improves the performance of SVDD.

  • SS-DGM [21]: is a semi-supervised deep generative model. It combines a discriminative model of latent features with a generative semi-supervised model. This paper follows the setting of SS-DGM in [15] and applies it to anomaly detection.

  • OCSVM [7]: is a classic single-class anomaly detection model which only use normal samples for training. It builds a hyperplane to segment samples, which maximize the separation between positive and negative samples.

  • iForest [18]: is an efficient unsupervised model in anomaly detection. It achieves the isolation of anomalies by recursive segmentation of eigenvalues.

Implementation Details. We apply an MLP with a hidden layer as the teacher network in the implementation of MPAD. The architecture of the student model (DevNet) is the same as the teacher network, except that the teacher network outputs a two-dimensional vector, and the DevNet outputs a single-dimensional vector to calculate different losses. The number of neurons in the hidden layer is 64. In addition, the teacher network and the student network are optimized using the SGD and Adam optimizers, with learning rate of 0.03 and 0.001, respectively. The training and test sets of all algorithms are in a ratio of 8:2 with the random state of 42. DeepSAD, SS-DGM and the proposed MPAD are implemented with pytorch, while iForest and OCSVM are achieved with sklearn.

Metrics. AUC-ROC: is the area under the curve with the false positive rate as the abscissa and the true positive rate as the ordinate. It is a comprehensive evaluation criterion which represents the expected generalization performance of the model in different situations. Generally, if one curve can completely surround the other, it means that the former performs better than the latter, so the area under the curve is a good representation of the pros and cons of a model. In anomaly detection, it tends to show the ability to recognize normal classes due to extreme class imbalance.

AUC-PR: is the area under the curve drawn with the recall of the positive samples as the abscissa and the precision as the ordinate. It only pays attention to positive samples (anomalies). Similar to AUC-ROC, one curve is wrapped by another, indicating that the latter is more capable of achieving high recall and precision at the same time. In anomaly detection, we focus more on the detection ability of the anomaly category, so we care more about AUC-PR values than AUC-ROC values.

Table 1. The performance w.r.t. AUC-ROC and AUC-PR among the proposed MPAD and the baselines on five tabular datasets with 30 labeled anomalies and 2% noise injection for training. The best performance for each dataset is boldfaced.

4.2 Effectiveness Results

The number of available anomalies in this comparative experiment is 30, and the noise of the training set is 0.02. Our MPAD and all baselines pick the best performing hyperparameters and demonstrate the optimal performance in Table 1. It can be seen that the proposed MPAD has achieved the best AUC-PR on all five datasets and the best AUC-ROC on three datasets. Among them, AUC-PR of MPAD exceeds the optimal result on each dataset by 12.2%, 7.6%, 8%, 5.3%, and 4.7%, respectively. This illustrates the advancement achieved by our algorithm. In general, the unsupervised algorithm iForest and the single-class model OCSVM do not perform as well as the first three baselines. Since DeepSAD and SS-GDM algorithms are also semi-supervised methods, they show the performance only second to MPAD on the Thyroid dataset. DevNet obtains the second position on the rest of the datasets. Among them, SS-GDM shows higher AUC-ROC on Arrhythmia and Pima, proving that it has better recognition of normal samples.

Fig. 1.
figure 1

AUC-PR w.r.t. No.labeled anomalies on five datasets.

4.3 Data Efficiency Study

This experiment aims to test how the performance of algorithms change as the observed anomalies increase. The noise ratio is fixed at 0.02 during this experiment, and the observed anomalies are changed from 5 to 15, 30, 60, and 120 for modeling. It can be seen from Fig. 1 that our MPAD always maintains a high AUC-PR on Census, Campaign, and Arrhythmia datasets, and the results on Pima have a significant upward trend with the increase of observed anomalies. As for Thyroid, MPAD maintains a high level when there are fewer visible anomalies, and the effect decreases when the number of anomalies increases. The number of visible anomalies will act on the initialization effect of the teacher network. This result indicates that there are visible anomalies overlapping with the normal ones, making the teacher’s performance drop further affecting the result of student. The two algorithms, OCSVM and iForest, are not influenced by the number of visible anomalies. They have certain advantages when there are few visible anomalies, yet they cannot make effective use of this information and lose their odds as the number of anomalies increases.

5 Conclusion

We introduce pseudo-label algorithms to partially-observed-anomalies anomaly detection. Thus, unlabeled data is used reasonably, and valuable pseudo anomalies can be extracted to assist the establishment of anomaly detection models. The most important part is that the proposed meta pseudo anomalies generation procedure makes the teacher network and the student network update alternately, and the teacher network is subject to both supervised information and the student network’s feedback. Comprehensive experiments show that the pseudo anomalies generated in this way are better than the general pseudo labels, and our framework outperforms the other state-of-the-art anomaly detection methods on five public datasets.