Abstract
This paper introduces the notion of learning from contradictions (a.k.a Universum learning) for deep one class classification problems. We formalize this notion for the widely adopted one class large-margin loss (Schölkopf et al. in Neural Comput 13(7):1443–1471), and propose the deep one class classification using contradictions (DOC\(^3\)) algorithm. We show that learning from contradictions incurs lower generalization error by comparing the empirical Rademacher complexity of DOC\(^3\) against its traditional inductive learning counterpart. Further, our proposed ‘learning from contradiction’ is a generic learning setting and can compliment other advanced learning settings. To illustrate this, we extend the adversarial learning based DROCC-LF (Goyal et al. in International conference on machine learning, PMLR, 2020) algorithm under this new setting. Our empirical results demonstrate the efficacy of DOC\(^3\) and it’s extensions compared to popular baseline algorithms on several benchmark and real-life data sets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Anomaly detection (AD) is one of the most widely researched problem in the machine learning community (Chandola et al., 2009). In its basic form, the task of Anomaly Detection (AD) involves discerning patterns in data that do not conform to expected ‘normal’ behavior. These non-conforming patterns are referred to as anomalies or outliers. Anomaly detection problems manifest in several forms in real-life like, defect detection in manufacturing lines, intrusion detection for cyber security, or pathology detection for medical diagnosis etc. There are several mechanisms to handle anomaly detection problems viz., parametric or non-parametric statistical modeling, spectral based, or classification based modeling (Chandola et al., 2009). Of these, the classification based approach has been widely adopted in literature (Scholkopf et al., 2002; Tax & Duin, 2004; Tan et al., 2016; Cherkassky & Mulier, 2007). One specific classification based formulation which has gained huge adoption is one class classification (Scholkopf et al., 2002; Tax & Duin, 2004), where we design a parametric model to estimate the support of the ‘normal’ class distribution. The estimated model is then used to detect ‘unseen’ abnormal samples.
With the recent success of deep learning based approaches for different machine learning problems, there has been a surge in research adopting deep learning for one class problems (Ruff et al., 2021; Pang et al., 2020; Chalapathy & Chawla, 2019). However, most of these works adopt an inductive learning setting. This makes the underlying model estimation data hungry, and perform poorly for applications with limited training data availability, like medical diagnosis, industrial defect detection, etc. The learning from contradictions paradigm (popularly known as Universum learning) has shown to be particularly effective for problems with limited training data availability (Vapnik, 2006; Sinz et al., 2008; Weston et al., 2006; Chen & Zhang, 2009; Cherkassky et al., 2011; Shen et al., 2012; Dhar & Cherkassky, 2015; Zhang & LeCun, 2017; Xiao et al., 2021). However, it has been mostly limited to binary or multi class problems. In this paradigm, along with the labeled training data we are also given a set of unlabeled contradictory (a.k.a universum) samples. These universum samples belong to the same application domain as the training data, but are known not to belong to any of the classes. The rationale behind this setting comes from the fact that even though obtaining labels is very difficult, obtaining such additional unlabeled samples is relatively easier. These unlabeled universum samples act as contradictions and should not be explained by the estimated decision rule. Adopting this to one class problems is not straight forward. A major conceptual problem is that, one class model estimation represents unsupervised learning, where the notion of contradiction needs to be redefined properly. In this paper,
-
1.
Definition We introduce the notion of ‘Learning from contradictions’ for one class problems (Definition 2).
-
2.
Formulation We analyze the popular one class hinge loss (Schölkopf et al., 2001), and extend it under universum settings to propose the Deep One Class Classification using Contradictions DOC\(^3\) algorithm. Further, our proposed ‘learning from contradiction’ is a generic learning setting and can compliment other advanced learning settings. To illustrate this, we extend the adversarial learning based DROCC-LF (Goyal et al., 2020) algorithm under universum settings and call it DROCC-LF (univ) (see Algo. 1).
-
3.
Generalization error We analyze the generalization performance of one class formulations under inductive and universum settings using Rademacher complexity based bounds, and show that learning under the universum setting can provide improved generalization compared to its inductive counterpart.
-
4.
Empirical results Finally, we provide an exhaustive set of empirical results on several tabular and image datasets in support of our approach.
2 One class learning under inductive settings
First we introduce the widely adopted inductive learning setting used for one class problems (Scholkopf et al., 2002; Cherkassky & Mulier, 2007).
Definition 1
(Inductive setting) Given i.i.d training samples from a single class \({\mathcal {T}}=({\textbf{x}}_i, \; y_i = +1)_{i=1}^n \sim {\mathcal {D}}_{{\mathcal {X}}\vert {\mathcal {Y}} = +1}^n\), with \({\textbf{x}} \in {\mathcal {X}} \subseteq \Re ^d\) and \(y \in {\mathcal {Y}} = \{-1,+1 \}\); estimate a hypothesis \(h^*:{\mathcal {X}} \rightarrow {\mathcal {Y}}\) from an hypothesis class \({\mathcal {H}}\) which minimizes,
\({\mathcal {D}}_{{\mathcal {T}}}\) is the training distribution (consisting of both classes)
\({\mathcal {D}}_{{\mathcal {X}} \vert {\mathcal {Y}} = +1}\) is class conditional distribution
\({\textbf{1}}(\cdot )\) is the indicator function, and
\({\mathbb {E}}_{{\mathcal {D}}_{{\mathcal {T}}}}(\cdot )\) is the expectation under training distribution.
Note that, the underlying data generation process assumes a two class problem; of which the samples from only one class is available during training. The overall goal is to estimate a model which minimizes the error on the future test data, containing samples from both normal (\(y = +1\)) and abnormal classes (\(y = -1\)). Typical examples include, AI driven visual inspection of product defects in a manufacturing line; where images or videos of non-defective products are available in abundance. The goal is to detect ‘defective’ (abnormal/anomalous) products through visual inspection in manufacturing lines (Bergmann et al., 2019; Weimer et al., 2016). A popular loss function used in such settings is the \(\nu \)-SVM loss (Schölkopf et al., 2001),
where \(\nu \in (0,1]\) is a user-defined parameter which controls the margin errors \(\sum _i \xi _i\) and the size of geometric \(\frac{1}{\vert \vert {\textbf{w}}\vert \vert }\) and functional \(\rho \) margins. \(\phi (\cdot ): {\mathcal {X}} \rightarrow {\mathcal {G}}\) is a feature map. Typical examples include an empirical kernel map (see Definition 2.15 (Scholkopf et al., 2002)) or a map induced by a deep learning network (Goodfellow et al., 2016). The final decision function is given as, \(h({\textbf{x}}) = \left\{ \begin{array}{l l} +1;\quad \text {if} \; {\textbf{w}}^\top \phi ({\textbf{x}}_i) \ge \rho \\ -1;\quad \text {else} \end{array}\right. \). Note that, recent works like Ruff et al. (2018) extend a different loss function which uses a ball to explain the support of the data distribution following (Tax & Duin, 2004). As discussed in Schölkopf et al. (2001), most of the time these two formulations yield equivalent decision functions. For example, with kernel machines \({\textbf{K}}({\textbf{x}},{\textbf{x}}^{\prime }) = \phi ({\textbf{x}})^\top \phi ({\textbf{x}}^\prime )\) depending solely on \({\textbf{x}} - {\textbf{x}}^{\prime }\) (like RBF kernels), these two formulations are the same. Hence, most of the improvements discussed in this work translates to such alternate formulations. In this paper however, we solve the following one class Hinge Loss,
to estimate the the decision function \(f({\textbf{x}}) = {\textbf{w}}^\top \phi ({\textbf{x}}_i)\) and use the decision rule, \(h({\textbf{x}}) = \left\{ \begin{array}{l l} +1;\quad \text {if} \; f({\textbf{x}}) \ge 1 \\ -1;\quad \text {else} \end{array}\right. \). Here, the user-defined parameter C controls the trade-off between explaining the training samples (through small margin error \(\sum _{i=1}^n \xi _i \)), and the margin size (through \(\vert \vert {\textbf{w}}\vert \vert _2^2\)), which in turn controls the generalization error. For deep learning architectures we optimize using all the model parameters and equivalently regularize the entire matrix norm \(\vert \vert {\textbf{W}}\vert \vert _F^2\), see Goyal et al. (2020), Ruff et al. (2018). Note that, we solve one class Hinge loss (3) for the two main reasons,
-
First, it has the advantage that \(L_T({\textbf{w}},\phi (\{{\textbf{x}}\}_{i=1}^n)) = \sum _{i=1}^n [1-{\textbf{w}}^T\phi ({\textbf{x}}_i)]_+ \) exhibits the same form as the traditional hinge loss used for binary classification problems (Vapnik, 2006) and can be easily solved using existing software packages (Paszke et al., 2019; Abadi et al., 2016; Pedregosa et al., 2011). Throughout the paper we refer (3) using underlying deep architectures as Deep One Class DOC (Hinge) formulation.
-
Second, solving Eq. (3) also provides the solution for Eq. (2). This connection follows from Proposition 1.
Proposition 1
Connection between Eq. (2) and Eq. (3)
-
1.
Any solution \({\textbf{w}}\) of Eq. (3) also solves Eq. (2) with \(\nu = \frac{1}{Cn \delta }\); where \(\delta > 0\) is a scalar that depends on the solution of Eq. (3). Further, this solution \((\hat{\mathbf {{{w}}}}, \rho )\) of Eq. (2) is given as \(\hat{\mathbf {{{w}}}} = {\textbf{w}}\delta , \quad \rho = \delta \).
-
2.
The decision function obtained through solving Eq. (3) i.e., \({\textbf{w}}^\top \phi ({\textbf{x}}) - 1 = 0\) coincides with the decision function \(\hat{\mathbf {{{w}}}}^\top \phi ({\textbf{x}}) - \rho = 0\) obtained by solving Eq. (2) i.e. using the solution discussed above.
All proofs are provided in “Appendix”.
3 One class learning using contradictions a.k.a Universum learning
3.1 Problem formulation
Learning from contradictions or Universum learning was introduced in Vapnik (2006) for binary classification problems to incorporate a priori knowledge about admissible data samples. For example, if the goal of learning is to discriminate between handwritten digits ‘5’ and ‘8’, one can introduce additional knowledge in the form of other handwritten letters ‘a’,‘b’,‘c’,‘d’,\(\ldots \) ‘z’. These examples from the Universum contain certain information about the handwritten styles of authors, but they cannot be assigned to any of the two classes (5 or 8). Further, these Universum samples do not have the same distribution as labeled training samples. In this work we introduce the notion of ‘Learning from Contradictions’ for one class problems. Similar to inductive setting (Definition 1) the goal here is also to minimize the generalization error on future test data containing both normal (\(y=+1\)) and abnormal (\(y=-1\)) samples. Here however, during training in addition to the samples from the normal class \(({\textbf{x}}_i, y_i = +1)_{i=1}^n\), we are also provided with universum (contradictory) samples, which are known not to belong to either of the (normal or abnormal) classes of interest. A practical use-case can be of automated visual inspection based anomaly detection in manufacturing lines. Here the target is to identify the defects in a specific product type (say ’screws’ in Fig. 1). For this case, the images from other product types in the manufacturing line act as universum samples. Note that, such universum samples belong to the same application domain (i.e. visual inspection data); but do not represent either of the classes normal screws or anomalous screws. This setting is formalized as,
Definition 2
(Learning from contradictions a.k.a Universum setting) Given i.i.d training samples \({\mathcal {T}}=({\textbf{x}}_i,y_i = +1)_{i=1}^n \sim {\mathcal {D}}_{{\mathcal {X}}\vert {\mathcal {Y}} = +1}^n\), with \({\textbf{x}} \in {\mathcal {X}} \subseteq \Re ^d\) and \(y \in {\mathcal {Y}} =\{-1,+1\}\) and additional m universum samples \({\mathcal {U}} = ({\textbf{x}}_{i^\prime }^{*})_{i^\prime =1}^m \sim {\mathcal {D}}_{{\mathcal {U}}}\) with \({\textbf{x}}^{*} \in {\mathcal {X}}_{U}^* \subseteq \Re ^d\), estimate \(h^*:{\mathcal {X}} \rightarrow {\mathcal {Y}}\) from hypothesis class \({\mathcal {H}}\) which, in addition to Eq. (1), obtains maximum contradiction on universum samples i.e. maximizes the following probability for \({\textbf{x}}^* \in {\mathcal {X}}_{U}^*\),
\({\mathcal {D}}_{{\mathcal {U}}}\) is the universum distribution,
\({\mathbb {P}}_{{\mathcal {D}}_{{\mathcal {U}}}}(\cdot )\) is probability under universum distribution,
\({\mathbb {E}}_{{\mathcal {D}}_{{\mathcal {U}}}}(\cdot )\) is the expectation under universum distribution, \({\mathcal {X}}_{U}^{*}\) is the domain of universum data.
Learning using contradictions under Universum setting has the dual goal of minimizing the generalization error in Eq. (1) while maximizing the contradiction on universum samples Eq. (4). The following proposition provides guidelines on how this can be achieved for the one class hinge loss in Eq. (3).
Proposition 2
For the one class hinge loss in Eq. (3), maximum contradiction on universum samples \({\textbf{x}}^* \in {\mathcal {X}}_U^*\) can be achieved when,
That is, we need the universum samples to lie on the decision boundary. This motivates the following one class loss using contradictions (under Universum settings) where we relax the constraint in Eq. (5) by introducing a \(\Delta \)-insensitive loss similar to Weston et al. (2006), Dhar et al. (2019) and solve,
Here, \([x]_+ = \text {max}(0,x)\). Further, the interplay between \(C, C_U - \) controls the trade-off between explaining the training samples using \(L_T\) versus maximizing the contradiction on Universum samples using \(L_U\). For \(C_U = 0\) or \(\Delta \rightarrow \infty \), Eq. (6) transforms to Eq. (3). For deep learning models, we optimize Eq. (6) over all the model parameters and refer to it as Deep One Class Classification using Contradictions (DOC\(^3\)).
3.2 Analysis of generalization error bound
Next we provide theoretical justification in support of Universum learning. We argue, learning under universum settings using DOC\(^3\) can provide improved generalization error compared to its inductive counterpart DOC (Hinge). For this, we first derive a generic form of the generalization error bound for one class learning using the Rademacher complexity capacity measure in Theorem 1.
Theorem 1
(Generalization error bound) Let \({\mathcal {F}}\) be the class of functions from which the decision function \(f({\textbf{x}})\) in Eq. (3) and (6) are estimated. Let \(R_{f,1} = \{ {\textbf{x}}: f({\textbf{x}}) \ge 1 \}\) be the induced decision region. Then, with probability \(1-\eta \) with \(\eta \in [0,1]\), over any independent draw of the random sample \({\mathcal {T}}=({\textbf{x}}_i,y_i = +1)_{i=1}^n \sim {\mathcal {D}}_{{\mathcal {T}}\vert {\mathcal {Y}} = +1}^n\), for any \(\kappa > 0\) we have,
where \(\quad \xi _i = [1-f({\textbf{x}})]_+\); \(R_{f,\theta } = \{{\textbf{x}}: f({\textbf{x}}) \ge \theta \} \)
\(\hat{{\mathcal {R}}}_n({\mathcal {F}}) = {\mathbb {E}}_{\sigma }[\underset{f \in {\mathcal {F}}}{\text {sup}} \vert \frac{2}{n} \sum _{i=1}^n \sigma _i f({\textbf{x}}_i)\vert \Big \vert ({\textbf{x}}_i)_{i=1}^n] \)
\(\sigma \) = independent uniform \(\{ \pm 1 \}-\) valued random variables a.k.a Rademacher variables.
The Theorem 1 is agnostic of model parameterization and holds for any popularly adopted kernel machine or deep learning architectures. Similar to the Theorem 7 in Schölkopf et al. (2001), Theorem 1 gives a probabilistic guarantee that new points lie in a larger region \(R_{f,1-\kappa }\). Here, we rather use the Empirical Rademacher Complexity (ERC) \(\hat{{\mathcal {R}}}_n({\mathcal {F}})\) as the capacity measure of the hypothesis class, instead of the covering number. Additionally, our bound does not contain a \(\frac{1}{\kappa ^2}\) term as in Schölkopf et al. (2001), and only has the scaling factor of \(\frac{1}{\kappa }\). As seen from Theorem 1 above, it is preferable to use a hypothesis class \({\mathcal {F}}\) with smaller ERC \(\hat{{\mathcal {R}}}_n({\mathcal {F}})\). Next we compare the ERC of the hypothesis class induced by the formulations Eq. (3) versus Eq. (6).
Theorem 2
(Empirical Rademacher complexity). For the hypothesis class induced by the formulations,
-
Equation (3): \({\mathcal {F}}_{\text {ind}} = \{ f: {\textbf{x}} \rightarrow {\textbf{w}}^{\top } \phi ({\textbf{x}}) \Big \vert \vert \vert {\textbf{w}}\vert \vert _2^2 \le \Lambda ^2 \}\)
-
Equation (6): \({\mathcal {F}}_{\text {univ}} = \{ f: {\textbf{x}} \rightarrow {\textbf{w}}^{\top } \phi ({\textbf{x}}) \Big \vert \vert \vert {\textbf{w}}\vert \vert _2^2 \le \Lambda ^2; \vert {\textbf{w}}^{\top } \phi (\mathbf {x^*}) -1\vert \le \Delta \;, \; \forall x^* \in {\mathcal {X}}_U^*\}\)
The following holds,
-
1.
\(\hat{{\mathcal {R}}}_n({\mathcal {F}}_{\text {ind}}) \ge \hat{{\mathcal {R}}}_n({\mathcal {F}}_{\text {univ}})\)
-
2.
Further, for any fixed mapping \(\phi (\cdot )\),\(\; \forall \gamma \ge 0\) we have,
-
(a)
\(\hat{{\mathcal {R}}}_n({\mathcal {F}}_{\text {ind}}) \le \frac{2\Lambda }{n} \sqrt{\sum \limits _{i=1}^n \vert \vert {\textbf{z}}_i\vert \vert ^2}\); where \({\textbf{z}} = \phi ({\textbf{x}})\)
-
(b)
\(\hat{{\mathcal {R}}}_n({\mathcal {F}}_{\text {univ}}) \le \frac{2\Lambda }{n} \sqrt{\sum \limits _{i=1}^n \vert \vert {\textbf{z}}_i\vert \vert ^2} \, \underset{\gamma \ge 0}{\text {min}} \; K(\gamma ) \big [1 - \varvec{\Sigma }(\gamma )\big ] ^{\frac{1}{2}}\)
-
(a)
\(\otimes = \text {Kronecker Product}, \quad tr = \text {Matrix Trace} \)
Note that, several recent works (Neyshabur et al., 2015; Sokolic et al., 2016; Cortes et al., 2017) derive the ERC of the function class induced by an underlying neural architecture. In this analysis however, we fix the feature map and analyze how the loss function in Eq. (6) reduces the function class capacity compared to Eq. (3). This simplifies our analysis and focuses on the effect of the proposed new loss in Eq. (6) under the universum setting. As seen from Theorem 2 (1), the function class induced under the universum setting (using contradictions) exhibits lower ERC compared to that under inductive settings. A more explicit characterization of the ERC is provided in part (2). Setting \(\gamma = 0\) in (b), we achieve the same R.H.S as (a); hence the R.H.S in (b) is always smaller than in (a). Further note that \(\varvec{\Sigma }(\gamma )\) in Eq. (9) has the form of a correlation matrix between the training and universum samples in the feature space. In fact, we have \(\Sigma (\infty ) = \underset{\gamma \rightarrow \infty }{\text {lim}} \Sigma (\gamma ) = \frac{ tr(VZ^{\top }ZV^{\top })}{ tr(Z^{\top }Z) \; tr(VV^{\top })}\). This shows that, for a fixed number of universum samples m and \(\Delta \), the effect of the DOC\(^3\) algorithm is influenced by the correlation between training and universum samples in the feature space. Loosely speaking, the DOC\(^3\) algorithm searches for a solution where in addition to reducing the margin errors \(\xi _i\), also minimizes this correlation; and by doing so minimizes the generalization error. Similar conclusions have been empirically derived for binary, multiclass problems in Weston et al. (2006), Chapelle et al. (2008), Cherkassky et al. (2011) and Dhar et al. (2019). Here, we provide the theoretical reasoning for one class problems. Further, we confirm these theoretical findings in our results (Sect. 5.3.3).
3.3 Algorithm implementation
A limitation in solving Eq. (6) is handling the absolute term in \(L_U\). In this paper we adopt a similar approach used in Weston et al. (2006), Dhar et al. (2019) and simplify this by re-writing \(L_U\) as a sum of two hinge functions. To do this, for every universum sample \({\textbf{x}}_{i^\prime }^{*}\) we create two artificial samples, \(({\textbf{x}}_{i^\prime }^{*},y_{i^\prime 1}^{*}=1), ({\textbf{x}}_{i^\prime }^{*},y_{i^\prime 2}^{*}=-1)\) and re-write,
where, \(\epsilon _1 = 1 - \Delta \) and \(\epsilon _2 = -1 - \Delta \). Now, the universum loss is the sum of two hinge functions with \(\epsilon _1, \epsilon _2 -\) margins; and can be solved using standard deep learning libraries (Paszke et al., 2019; Abadi et al., 2016; Pedregosa et al., 2011).Footnote 1
4 Existing approaches and related works
Most research in Anomaly Detection (AD) can be broadly categorized as adopting either traditional (shallow) or the more modern deep learning based approaches. Traditional approaches generally adopt parametric or non-parametric statistical modeling, spectral based, or classification based modeling (Chandola et al., 2009). Typical examples include, PCA based methods (Jolliffe, 2002; Hoffmann, 2007), proximity based methods (Knorr et al., 2000; Ramaswamy et al., 2000), tree-based methods like Isolation Forest (IF) (Liu et al., 2008), or classification based OC-SVM (Schölkopf et al., 2001), Support Vector Data Description (SVDD) (Tax & Duin, 2004) etc. These techniques provide good performance for optimally tuned feature map. However, for complex domains like vision or speech, where designing optimal feature maps is non trivial; such approaches perform sub-optimally. A detailed survey on these approaches is available in Chandola et al. (2009).
In contrast, for the modern deep learning based approaches, extracting the optimal feature map is imbibed in the learning process. Broadly there are three main sub-categories for deep learning based AD. First, the Deep Auto Encoder and its variants like DCAE (Masci et al., 2011; Makhzani , Frey, 2014) or ITAE (Huang et al., 2019) etc. Here, the aim is to build an embedding where the normal samples are correctly reconstructed while the anomalous samples exhibit high reconstruction error. The second type of approach adopt Generative Adversarial Network (GAN)-based techniques like AnoGAN (Schlegl et al., 2017), GANomaly (Akcay et al., 2018), EGBAD (Zenati et al., 2018), CBiGAN (Carrara et al., 2020) etc. These approaches, typically focus on generating additional samples which follow similar distribution as the training data. This is followed up by designing an anomaly score to discriminate between normal versus anomalous samples. Finally, the third category consist of the more recent one class classification based approaches like, Deep SVDD (Ruff et al., 2018), DROCC (Goyal et al., 2020) etc. These approaches adopt solving a one class loss function catered for deep architectures. All these above approaches however adopt an unsupervised inductive learning setting. There is a newer class of classification based paradigm which adopts semi or self supervised formulations. Typical examples include, GOAD (Bergman , Hoshen, 2020), SSAD (Ruff et al., 2019), ESAD (Huang et al., 2020) etc. However, such approaches use fundamentally different problem settings (like a multi class problem for GOAD); or have different assumptions on the additional data available.
Learning with disjoint auxiliary (DA) data: A recently popularized new learning setting assumes the availability of an additional auxiliary data which is disjoint from the test set. The underlying assumption is that these auxiliary samples may or may not follow the same distribution as the test data and are disjoint from test set. This idea was first introduced in Dhar (2014) (see Sect. 4.3) and misconstrued as Universum learning. Note that, the notion of universum samples was originally introduced to act as contradictions to the concept classes in the test set (Vapnik, 2006). The above assumption does not adhere to this notion and violates the true essence of Universum learning. This setting has been recently used to propose ‘outlier exposure’ in Hendrycks et al. (2018) and variants (Ruff et al., 2021; Goyal et al., 2020). A more advanced variation of this setup adopts generating the anomalous samples through perturbation (Cai & Fan, 2022) or through distribution-shifting transformations (Tack et al., 2020) and using contrastive losses. Our learning from contradiction setting is different from the above methods in the following aspects,
-
(Problem setting) is different. While the above setting only assumes disjoint auxiliary data from test data’s concept classes (‘normal’ and ‘anomalous’ samples), Universum follows a different assumption that the concept classes of the universum data is different from both the normal as well as anomalous samples. This assumption is quintessential for proving Prop. 2, which in turn provides the optimality constraint on the decision function (in Eq. 5). Prop. 2 is not possible for DA setting.
-
(Formulation) The difference in problem setting is also clear from the formulations. For example, the formulations proposed under the disjoint auxiliary setting like, (Dhar, 2014), Outlier Exposure (OE) (Hendrycks et al., 2018), DROCC-LF (OE) (Goyal et al., 2020) etc., only uses the relation between in-lier training data and the additional auxiliary data. No information on the relation between the auxiliary data and the anomalous samples in test set is encoded in the loss function. In essence, such approaches controls the complexity of hypotheses class by constraining the space in which ‘normal’ samples can lie. In contrast, Universum learning assumes different concept classes for Universum versus both normal and anomalous (test) samples. This information is encoded through the proof in Prop. 2. The Universum setting controls the complexity of hypotheses class by constraining the space in which both ‘normal’ or ‘anomalous’ samples can lie.
In short, Universum learning adopts a different learning paradigm (see Definition 2) compared to the ‘disjoint auxiliary data’ settings. Different from the existing ‘disjoint auxiliary’ based loss functions in Dhar (2014), OE (Hendrycks et al., 2018), DROCC-LF (OE) (Goyal et al., 2020) etc., the Universum samples (in Eq. (6)) implicitly contradicts the unseen anomalous test samples. A more pedagogical explanation of the differences between these settings with examples is provided in “Appendix C.1”. However similar as DA/OE settings, Universum learning can compliment other advanced learning settings. To highlight this, we extend the adversarial based DROCC-LF algorithm under universum setting in algorithm 1 and compare it’s performance against its OE based extension DROCC-LF(OE) (introduced in Goyal et al. (2020)). Here, for DROCC-LF (univ) we replace the binary cross entropy loss used in Goyal et al. (2020) with the universum loss in Eq. (6) (see step 3 in Algo. 1). We use the same notations used in Goyal et al. (2020).
Algorithm 1 DROCC-LF (univ) | |
---|---|
Input: Training (normal) samples \({\mathcal {T}}=({\textbf{x}}_i,y_i = +1)_{i=1}^n\) and Universum samples \({\mathcal {U}} = ({\textbf{x}}_{i^\prime }^{*})_{i^\prime =1}^m\). | |
Parameters: Radius r, \(\lambda \ge 0\), \(\mu \ge 0\), step-size \(\eta \), number of gradient steps \(m_g\), number of initial training steps \(n_0\). | |
Initial steps: For \(B = 1, \ldots n_0\) | |
Batch of training (\(X_T\)) and universum (\(X_U\)) samples | |
\(\theta = \theta - \nabla \Big (\sum \limits _{\begin{array}{c} {\textbf{x}}_i \in X_B \end{array}} L_T(f({\textbf{x}}_i)) + \sum \limits _{\begin{array}{c} {\textbf{x}}_{i^\prime }^{*} \in X_U \end{array}} L_U(f({\textbf{x}}_{i^\prime }^{*})) \Big ) \) | |
DROCC steps: For \(B = n_0, \ldots n_0 + N\) | |
\(X_T\): Batch of normal training inputs (\(y=+1\)) | |
\(\forall x \in X_T: h \sim {\mathcal {N}}(0, I_{d})\) | |
Adversarial search: For \(i = 1, \ldots m_g\) | |
1. \(L_T(h) = L_T(f(x + h), -1)\) | |
2. \(h = h + \eta \frac{\nabla _h L_T(h)}{\Vert \nabla _h L_T(h) \Vert }\) | |
3. \(h =\) Projection given by Prop.1 in Goyal et al. (2020) | |
\(\ell ^{itr} = \lambda \Vert {\textbf{w}} \Vert ^2 + \sum \limits _{\begin{array}{c} {\textbf{x}}_i \in X_B \end{array}} L_T(f({\textbf{x}}_i)) + \sum \limits _{\begin{array}{c} {\textbf{x}}_{i^\prime }^{*} \in X_U \end{array}} L_U(f({\textbf{x}}_{i^\prime }^{*})) +\mu L_T(f(x + h), -1) \) | |
\(\theta = \theta - \nabla \ell ^{itr}\) |
5 Empirical results
5.1 Standard benchmark on tabular datasets from Goyal et al. (2020)
First we provide the results on several tabular data used in Goyal et al. (2020). The datasets used involves standard anomaly detection problems described below,
-
Abalone used in Das et al. (2018): Here the task is to predict the age of abalone using several physical measurements like, rings, sex, length, diameter, height, weight, etc. For this problem class 3 and 21 are anomalies and class 8, 9 and 10 serve as normal samples.
-
Arrhythmia used in Zong et al. (2018). Here the task is to identify the arrhythmic samples using the ECG features. We follow the same data set preparation as Zong et al. (2018).
-
Thyroid used in Zong et al. (2018). The goal is to predict if a patient is hypothyroid based on his/her medical history.We follow the same data set preparation as Zong et al. (2018).
For all the above data we use the data set preparation codes provided in Goyal et al. (2020). This code provides the data preprocessing and partitioning scheme as used in the previous works. We follow the same experiment setup and network architecture as in Goyal et al. (2020). We use the same baseline methods as used in Goyal et al. (2020), and also provide the results of the recent approach PLAD (Cai & Fan, 2022) proposed to stabilize the DROCC baseline.
Table 1 provides the results of DOC\(^3\) over 10 random partition of the data set. In each partition, we create training/test data as used in Goyal et al. (2020). Note however, different from Goyal et al. (2020), we scale the data in the range of \([-1, +1]\). In addition, here we generate uniform noise in range \([-1,+1]\) and use that as universum/contradiction samples. As seen from Table 1 the DOC\(^3\) outperforms all existing approaches (except adversarial based DROCC (Goyal et al., 2020) and PLAD for the Thyroid data); and significantly improves (> 5–15 %) upon the state-of-the-art results for the Arrhythmia and Abalone data. The optimal model parameters used for the results are provided in “Appendix B.1” (Table 7) for reproducibility. Note that, through out the paper we fix \(\Delta = 0\). For all our experiments we see minimal improvements through tuning the \(\Delta \) parameter. This is also discussed in our ablation studies in “Appendix C.1.2”.
5.2 Standard image benchmark datasets
5.2.1 CIFAR-10
For our next set of experiments we use the standard image benchmark CIFAR-10 dataset (Ruff et al., 2018; Goyal et al., 2020). The data consists of 32x32 colour images of 10 classes with 6000 images per class. The classes are mutually exclusive. The underlying task involves one-vs-rest anomaly detection, where we build a one class classifier for each class and evaluate it on the test data for all the 10-classes. Note that, this data does not have any naturally occurring universum (contradiction) samples (following Def. 2). So, we use synthetic universum samples by randomly generating the pixel values as \(\sim {\mathcal {N}}(\mu ,\sigma )\), with \(\mu = 0, \; \sigma = 1\); where \({\mathcal {N}}\) is the normal distribution (see Fig. 2). The idea of generating synthetic universum (contradiction) samples has been previously studied for binary (Weston et al., 2006; Cherkassky et al., 2011; Sinz et al., 2008), multiclass (Zhang & LeCun, 2017; Dhar et al., 2019) and regression (Dhar & Cherkassky, 2017) problems. In this paper we use such a similar mechanism for one class problems. Note that for the one-vs-rest AD problem, the generated universum samples do not belong to either ‘+1’ (normal) or ‘-1’ (anomalous) class used during testing (see Def. 2). The data is scaled in range \([-1,+1]\).
For this set of experiments we adopt a LeNet like architecture used in Ruff et al. (2018), Goyal et al. (2020). The detailed architecture specifics is provided in “Appendix B.1.1”. Note that, this paper focuses on the design and analysis of the DOC\(^3\) loss (Eq. 6). Here rather than adopting a state-of-the-art network architecture optimized for the specific dataset; we adopt a systematic approach to isolate the effectiveness of the proposed loss by using a basic LeNet architecture similar to Ruff et al. (2018), Goyal et al. (2020). This avoids secondary generalization effects encoded in most advanced architectures. To that end, the approaches in Ruff et al. (2018), Goyal et al. (2020) and DOC (Hinge in Eq. (3)) serve as the main baselines. In addition, for a more thorough comparison we also provide the results for DOC extended under disjoint auxiliary (DA) a.k.a. Outlier Exposure (OE) settings. For that we use the additional universum samples as belonging to the negative class following (Goyal et al., 2020).
Table 2 provides the average ± standard deviation of the AUC under the ROC curve over 10 runs of the experiment. Here, we report the results of the best performing DOC (Hinge in (3)) model selected over the range of parameters \(\lambda = 1/2C = [1.0, 0.5]\) and that for DOC\(^3\) over the range of parameters \(\lambda = 1/2C = [0.1, 0.05], C_{U}/C = [1.0, 0.5]\). We fix \(\Delta = 0\). A more detailed discussion on model selection and the selected model parameters is provided in “Appendix B.3” for reproducibility. Note however, our results for the DROCC algorithm is different from that reported in Goyal et al. (2020). Re-running the codes provided in Goyal et al. (2020) did not yield similar results as reported in the paper (especially for ‘Ship’). Moreover, their current implementation normalizes the data using mean, \(\mu = (0.4914,0.4822, 0.4465)\) and standard deviation, \(\sigma = (0.247, 0.243, 0.261)\). These values are calculated using the data from all the classes; which is not available during training of a single class. To avoid such inconsistencies we rather normalize using mean, \( \mu = (0.5, 0.5, 0.5)\) and standard deviation, \(\sigma =(0.5, 0.5, 0.5)\). Such a scale does not need apriori information of the other class’s pixel values and scales the data in a range of \([-1,+1]\). Detailed discussions on reproducing the results of the deep learning algorithms Deep-SVDD (Ruff et al., 2018) and DROCC (Goyal et al., 2020) is provided in “Appendix C.2” (see Tables 18, 19 and 20). As seen from Table 2, DOC\(^3\) (using the noise universum), provides significant improvement \(\sim \) 5–15% (and upto \(30 \%\) for ‘Bird’), over its inductive counterpart (DOC). In addition, the DOC\(^3\) in most cases outperforms the DOC (DA/OE). This illustrates the advantage of extending Anomaly Detection problems following Def. 2 in accordance with the Prop. 2.
Next, we show the effectiveness of extending the advanced adversarial based DROCC-LF method under universum settings over the OE based setting used in Goyal et al. (2020). The major difference is now the auxilliary data serves as universum samples and the loss function follows. (6) (see Algo. 1). For the DROCC-LF (OE) we use the same implementation as in Goyal et al. (2020). Additionally, we replace the relu operator \([x]_{+}\) with the softplus operator for the loss functions.
For our experiments, we adopt the same LeNet architecture used in Ruff et al. (2018), Goyal et al. (2020) (see “Appendix B.1.1”, Fig. 4). Finally, we run the experiments over 10 runs and report the best AUC over the range of parameters recommended in Goyal et al. (2020) (Sect. 5). That is learning rate = \(10^{-4}\), radius (r) in range of \(\sqrt{d}\) = \(\{ 8.0, 16.0, 32.0\}\). Here, for both the methods we use Adam and fix the number of ascent steps = 10 and batch size = 256 and total epochs = 350. The remaining parameters are set to default values.
Table 2 provides the average ± standard deviation of the AUC under the ROC curve over 10 runs of the experiment. We also provide the results for the standard DROCC (without any auxiliary data) (Goyal et al., 2020) and the more recent PLAD (an adversarial approach introduced to improve DROCC) (Cai & Fan, 2022) as baselines. As seen from Table 2 the DROCC-LF (univ) significantly outperforms the DROCC-LF (OE) method to upto - 30% (‘dog’) for some cases. Further, DROCC-LF (univ) outperforms the baseline algorithms for all cases except ‘Cat’. The final optimal parameters selected for the different classes is provided in “Appendix B.3”.
5.2.2 Fashion-MNIST (F-MNIST)
For our next set of experiments we use another standard image benchmark dataset F-MNIST (Xiao et al., 2017). The data consists of \(28 \times 28\) gray images of Zalando’s fashion product database and consists of 10 classes (product lines) with 60,000 training and 10,000 test samples. The classes are mutually exclusive. The underlying task involves one-vs-rest anomaly detection, where we build a one class classifier for each class and evaluate it on the test data for all the 10-classes. As before, this data does not have any naturally occurring universum (contradiction) samples (following Def. 2). So, we use synthetically generated universum samples from \(\sim {\mathcal {N}}(\mu ,\sigma )\), with \(\mu = 0, \; \sigma = 1\); where \({\mathcal {N}}\) is the normal distribution. The data is scaled in range \([-1,+1]\).
We adopt the same network and experiment set-up used in Cai and Fan (2022) (see Fig. 5 in “Appendix B.1.2”). As before, we provide the results for DOC and its extension under disjoint auxiliary (DA)/Outlier Exposure (OE) settings. We also provide the baseline results from Cai and Fan (2022). Table 3 provides the average ± standard deviation of the AUC under the ROC curve over 10 runs of the experiment. A more detailed discussion on model selection and the selected model parameters is provided in “Appendix B.4” (see Tables 12, 13 and 14) for reproducibility.
As seen from Table 3, DOC\(^3\) (using the noise universum), provides significant improvement \(\sim \) 5–20% over its inductive counterpart (DOC). In addition, the DOC\(^3\) in most cases outperforms the DOC (DA/OE). This further consolidates the advantage of extending Anomaly Detection problems under universum settings.
We also illustrate the effectiveness of extending the advanced adversarial based DROCC-LF method under universum settings over the OE based setting used in Goyal et al. (2020). We provide the results for the DROCC algorithm as baseline. In addition, we also provide the results of the recent PLAD (Cai & Fan, 2022) algorithm which adopts a generative adversarial learning approach proposed to stabilize DROCC. Note however, the results reported are from our re-run of the PLAD algorithm. We found several caveats with the code implementation and report these discrepancies in “Appendix C.3”.
As seen from Table 3 the DROCC-LF (univ) outperforms the DROCC-LF (OE) method and beats the baseline algorithms for all the classes. The final optimal parameters selected for the different classes is provided in “Appendix B.4” (see Table 15).
5.3 Visual inspection using real-life MV-Tec AD data
For our final set of experiments we tackle the more realistic visual inspection based anomaly detection problem in manufacturing lines. Lately with the recent advancements in deep learning technologies, there has been an increased interest towards automating manufacturing lines and adopting AI driven solutions providing automated visual inspection of product defects (Bergmann et al., 2019; Huang & Pan, 2015). One popular benchmark data set used for such problems is the MV-Tec AD data set (Bergmann et al., 2019).
5.3.1 Data set and experiment setup
The MV-Tec AD data set contains 5354 high-resolution color images of different industrial object and texture categories. For each categories it contains normal (no defect) images used for training. The test data contains both normal as well as anomalous (defective) product images. The anomalies manifest themselves in the form of over 70 different types of defects such as scratches, dents, contamination, and various other structural changes. The goal in this paper is to build one class image-level classifiers for the texture categories (see Table 4). We use the original data scale of [0,1]. Further, to simplify the problem we resize all the images to \(64 \times 64\) pixel. Note that, for the current analysis we only use the texture classes containing RGB images.
For this problem we have naturally occurring universum (contradiction) samples in the form of the objects’ images or other texture types. That is, for the goal of building a one class classifier for ‘carpet’, all the ‘other textures’ (leather, tile, wood) or the ‘objects’ (bottle, cable, capsule, hazelnut, metal nut, pill, transistor) available in the dataset, can serve as universum (contradiction) samples. This is inline with the problem setting in Def. 2, where such samples are neither ‘normal’ nor ‘anomalous’ (defective) carpet samples. For our experiments, we use three types of universum,
-
Noise: Similar to previous experiments we generate random noise as universum samples. Here, since the data is already scaled in the range of [0,1], we generate \(64 \times 64\) dimension images where the pixel values are obtained from a uniform distribution \(\sim {\mathcal {U}}(0,1)\).
-
Objects: This type of universum contains all the images in the object categories with RGB pixels viz. bottle, cable, capsule, hazelnut, metal nut, pill, transistor. Note that, we include both the normal as well as the defective samples for these objects.
-
Other Textures: Here we use the remaining texture images as universum. That is, if the goal is building a one class classifier for ‘carpet’ we use the images from the other ‘textures’ (leather, tile, wood) as universum. We include both the normal as well as the defective samples in the universum set.
As before, we adopt a LeNet like architecture (schematic representation in Fig. 3, details in “Appendix B.1.3”, Fig. 6). Note that, there have been a few recent works proposing advanced architectures to achieve state-of-the-art performance on this data (Carrara et al., 2020; Huang et al., 2019). However, the main focus here is to isolate the effectiveness of DOC\(^3\), and hence we mainly compare against DOC and DOC(OE) baselines using a simple LeNet network. Since our baselines DOC, DOC(OE) using LeNet have not been previously reported on this data; as sanity check we also add the results in Massoli et al. (2020) for a good comparison with different classes of algorithms. Also, we adopt a slight modification to our loss function. Rather than using relu function \([x]_+\) in Eq. (3), and (6) for the training samples; we use a softplus operator. We see improved results using this modification. Note that, softplus is a dominating surrogate loss over relu, and hence Theorem 1 still holds.
5.3.2 Performance comparison results
Table 5 provides the results over 10 runs of our experiments. We provide the the average ± standard deviation of the AUC values for DOC, DOC (DA/OE) and DOC\(^3\) algorithm. In addition we also provide the best AUC obtained for each algorithm over these 10 runs. Additional details on model selection and the optimal hyperparameters is provided in “Appendix B.5”. As seen in Table 5, the DOC\(^3\) algorithm provides significant improvement over DOC. Depending on the type of universum typical improvements range upto \(> 50 \%\). In addition, DOC\(^3\) provides consistent improvements over the DOC (DA/OE) algorithm. In all, these results further consolidate the utility of DOC\(^3\) under the universum setting (Def 2). Separately, Table 5 also provides the baseline results available in Massoli et al. (2020). Note that, these results are obtained using advanced network architectures adopted for the MVTec data, and are not averaged over multiple runs. Hence, we compare these results with the best AUC obtained for DOC, DOC (DA/OE) and DOC\(^3\) over 10 runs. As seen from Table 5, DOC\(^3\) improves upon the ‘carpet’ and ‘leather’ results using the ‘objects’ universum. Further, it achieves comparable performance for the ‘Wood’ and ‘Tile’ texture using ‘Noise’ and ‘Obj.’ universum respectively. Achieving improved performance over the baseline algorithms, even using a basic LeNet architecture sheds a very positive note for the proposed DOC\(^3\) algorithm.
5.3.3 Understanding DOC\(^3\) performance using Theorem (2)
For our final set of experiments we try to understand the working of the DOC\(^3\) algorithm in connection with the correlation \(\Sigma (\infty )\) (in Theorem 2). Table 6 reports the correlation values for the training and universum samples using ‘RAW’ pixel, ‘DOC’ and DOC\(^3\) solution’s feature maps. For the feature map we use the CNN features shown in Fig. 3. Also, the DOC\(^3\) solutions represent the estimated model using the training data (in column 1) and the respective universum data (in column 2). As seen from the results, the DOC solution provides high correlation \(\Sigma (\infty )\) between the training and universum samples. In essence, the DOC solution sees the training and universum samples similarly. This is not desirable, as the universum samples follow a different distribution than training samples. On the contrary, the DOC\(^3\) provides a solution where the correlation between the training and universum samples are significantly reduced. This is inline with the Theorem 2’s analysis (Sect. 3.2), where we argued that the DOC\(^3\) searches for a solution with low \(\Sigma (\infty )\) between the training and universum samples (in feature space). And by doing so ensures lower ERC and improved generalization compared to DOC (confirmed empirically in Table 5). Another interesting point seen for the ‘other texture’ universum type, with originally high raw pixel correlation values (\(\sim 0.9\)) is that; using DOC\(^3\) provides limited improvement. Such universum types are too similar to the training data, and act as ‘bad’ contradictions.
6 Future research
Broadly there are two major future research directions,
Model selection This is a generic issue for any (unsupervised) one class based anomaly detection formulation, and is further complicated by the non-convex loss landscape for deep learning problems. For DOC\(^3\) we simplify model selection by fixing \(\Delta = 0\), and optimally tuning \(C_U\). However, the success of DOC\(^3\) heavily depends on carefully tuning of its hyperparameters. In the absence of any validation set containing both ‘normal’ and ‘anomalous’ samples, we follow the current norm of reporting the best model’s results over a small subset of hyperparameters. But this is far from practical. We believe, our Theorem 1 provides a good framework for bound based model selection. This in conjunction with Theorem 2 and the recent works on ERC for deep architectures (Neyshabur et al., 2015; Sokolic et al., 2016), may provide better mechanisms for model selection and yield optimal models.
Selecting ‘good’ universum samples The effectiveness of DOC\(^3\) also depends on the type of universum used. Our analysis in Sect. 5.3.3 provides some initial insights into the workings of DOC\(^3\), and how to loosely identify ‘bad’ contradictions. Additional analysis, possibly inline with the Histogram of Projections (HOP) technique introduced in Cherkassky et al. (2011), Dhar et al. (2019), is needed to improve our understanding of ‘good’ universum samples. This is an open research problem.
7 Conclusions
This paper introduces the notion of learning from contradictions for deep one class classification and introduces the DOC\(^3\) algorithm. DOC\(^3\) is shown to provide improved generalization over DOC, its inductive counterpart, by deriving the Empirical Rademacher Complexity (ERC). We empirically show the effectiveness of the proposed formulation, and connect the results to our theoretical analysis. Finally, we also discuss the limitations and the future research directions.
Data availability
All data used is publicly available and appropriately referenced in the paper.
Notes
All codes are available at: https://github.com/sauptikdhar/DOC3
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., & Isard, M. (2016). Tensorflow: A system for large-scale machine learning. In 12th\(\{\)USENIX\(\}\)symposium on operating systems design and implementation (\(\{\)OSDI\(\}\)) (Vol. 16, pp. 265–283).
Akcay, S., Atapour-Abarghouei, A., & Breckon, T. P. (2018). Ganomaly: Semi-supervised anomaly detection via adversarial training. In Asian conference on computer vision (pp. 622–637). Springer.
Bartlett, P. L., & Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.
Bergman, L., & Hoshen, Y. (2020). Classification-based anomaly detection for general data. arXiv preprint arXiv:2005.02359
Bergmann, P., Fauser, M., Sattlegger, D., & Steger, C. (2019). Mvtec ad—A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9592–9600).
Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). Lof: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on management of data (pp. 93–104).
Cai, J., & Fan, J. (2022). Perturbation learning based anomaly detection. In A.H. Oh, A. Agarwal, D. Belgrave, & K. Cho (Eds.), Advances in neural information processing systems. https://openreview.net/forum?id=-Xdts90bWZ3
Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV) (pp. 132–149).
Carrara, F., Amato, G., Brombin, L., Falchi, F., & Gennaro, C. (2020). Combining gans and autoencoders for efficient anomaly detection. arXiv preprint arXiv:2011.08102
Chalapathy, R., & Chawla, S. (2019). Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3), 1–58.
Chang, C.-C., & Lin, C.-J. (2001). Training v-support vector classifiers: Theory and algorithms. Neural Computation, 13(9), 2119–2147.
Chapelle, O., Agarwal, A., Sinz, F. H., & Schölkopf, B. (2008). An analysis of inference with the universum. In Advances in neural information processing systems (pp. 1369–1376).
Chen, S., & Zhang, C. (2009). Selecting informative universum sample for semi-supervised learning. In IJCAI (pp. 1016–1021).
Cherkassky, V., Dhar, S., & Dai, W. (2011). Practical conditions for effectiveness of the universum learning. IEEE Transactions on Neural Networks, 22(8), 1241–1255.
Cherkassky, V., & Mulier, F. M. (2007). Learning from data: Concepts, theory, and methods. Wiley.
Cortes, C., Gonzalvo, X., Kuznetsov, V., Mohri, M., & Yang, S. (2017). Adanet: Adaptive structural learning of artificial neural networks. In International conference on machine learning (pp. 874–883). PMLR.
Das, S., Islam, M. R., Jayakodi, N. K., & Doppa, J. R. (2018). Active anomaly detection via ensembles. arXiv preprint arXiv:1809.06477
Dhar, S. (2014). Analysis and extensions of universum learning.
Dhar, S., & Cherkassky, V. (2017). Universum learning for svm regression. In 2017 International joint conference on neural networks (IJCNN) (pp. 3641–3648). IEEE.
Dhar, S., Cherkassky, V., & Shah, M. (2019). Multiclass learning from contradictions. In Advances in neural information processing systems (pp. 8400–8410).
Dhar, S., & Cherkassky, V. (2015). Development and evaluation of cost-sensitive universum-SVM. IEEE Transactions on Cybernetics, 45(4), 806–818.
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). MIT Press.
Goyal, S., Raghunathan, A., Jain, M., Simhadri, H. V., & Jain, P. (2020). Drocc: Deep robust one-class classification. In International conference on machine learning (pp. 3711–3721). PMLR.
Hendrycks, D., Mazeika, M., & Dietterich, T. (2018). Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606
Hoffmann, H. (2007). Kernel PCA for novelty detection. Pattern Recognition, 40(3), 863–874.
Huang, C., Cao, J., Ye, F., Li, M., Zhang, Y., & Lu, C. (2019). Inverse-transform autoencoder for anomaly detection. arXiv preprint arXiv:1911.10676
Huang, C., Ye, F., Cao, J., Li, M., Zhang, Y., & Lu, C. (2019). Attribute restoration framework for anomaly detection. arXiv preprint arXiv:1911.10676
Huang, C., Ye, F., Zhang, Y., Wang, Y.-F., & Tian, Q.: Esad: End-to-end deep semi-supervised anomaly detection. arXiv preprint arXiv:2012.04905
Huang, S.-H., & Pan, Y.-C. (2015). Automated visual inspection in the semiconductor industry: A survey. Computers in Industry, 66, 1–10.
Jolliffe, I. T. (2002). Principal component analysis. Springer.
Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distance-based outliers: Algorithms and applications. The VLDB Journal, 8(3), 237–253.
Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation forest. In 2008 Eighth IEEE international conference on data mining (pp. 413–422). IEEE.
Makhzani, A., & Frey, B. (2014). Winner-take-all autoencoders. arXiv preprint arXiv:1409.2752
Masci, J., Meier, U., Cireşan, D., & Schmidhuber, J. (2011). Stacked convolutional auto-encoders for hierarchical feature extraction. In International conference on artificial neural networks (pp. 52–59). Springer.
Massoli, F. V., Falchi, F., Kantarci, A., Akti, Ş., Ekenel, H. K., & Amato, G. (2020). Mocca: Multi-layer one-class classification for anomaly detection. arXiv preprint arXiv:2012.12111
Neyshabur, B., Tomioka, R., & Srebro, N. (2015). Norm-based capacity control in neural networks. In Conference on learning theory (pp. 1376–1401).
Pang, G., Shen, C., Cao, L., & Hengel, A. V. D. (2020). Deep learning for anomaly detection: A review. arXiv preprint arXiv:2007.02500
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 3, 2.
Patel, R., & Toda, M. (1979). Trace inequalities involving Hermitian matrices. Linear Algebra and its Applications, 23, 13–20.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., & Dubourg, V. (2011). Scikit-learn: Machine learning in python. The Journal of machine Learning research, 12, 2825–2830.
Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD international conference on management of data (pp. 427–438)
Rosenberg, D. S., & Bartlett, P. L. (2007). The rademacher complexity of co-regularized kernel classes. In Artificial intelligence and statistics (pp. 396–403).
Ruff, L., Kauffmann, J. R., Vandermeulen, R. A., Montavon, G., Samek, W., Kloft, M., Dietterich, T. G., & Müller, K.-R. (2021). A unifying review of deep and shallow anomaly detection. In Proceedings of the IEEE.
Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., Müller, E., & Kloft, M. (2018). Deep one-class classification. In International conference on machine learning (pp. 4393–4402).
Ruff, L., Vandermeulen, R. A., Görnitz, N., Binder, A., Müller, E., Müller, K.-R., & Kloft, M. (2019). Deep semi-supervised anomaly detection. arXiv preprint arXiv:1906.02694
Schlegl, T., Seeböck, P., Waldstein, S. M., Schmidt-Erfurth, U., & Langs, G. (2017). Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging (pp. 146–157). Springer.
Schölkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., & Platt, J. (1999). Support vector method for novelty detection. In Proceedings of the 12th international conference on neural information processing systems.
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443–1471.
Scholkopf, B., Smola, A. J., Bach, F., et al. (2002). Learning with Kernels: Support vector machines, regularization, optimization, and beyond. MIT Press.
Shawe-Taylor, J., Cristianini, N., et al. (2004). Kernel methods for pattern analysis. Cambridge University Press.
Shen, C., Wang, P., Shen, F., & Wang, H. (2012). Uboost: Boosting with the universum. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 825–832.
Sinz, F., Chapelle, O., Agarwal, A., & Schölkopf, B. (2008). An analysis of inference with the universum. In Advances in neural information processing systems (Vol. 20, pp. 1369–1376). Curran.
Sokolic, J., Giryes, R., Sapiro, G., & Rodrigues, M. R. (2016). Lessons from the rademacher complexity for deep learning.
Tack, J., Mo, S., Jeong, J., & Shin, J. (2020). Csi: Novelty detection via contrastive learning on distributionally shifted instances. Advances in Neural Information Processing Systems, 33, 11839–11852.
Tan, P.-N., Steinbach, M., & Kumar, V. (2016). Introduction to data mining. Pearson Education India.
Tax, D. M., & Duin, R. P. (2004). Support vector data description. Machine Learning, 54(1), 45–66.
Vapnik, V. (2006). Estimation of dependences based on empirical data (information science and statistics). Springer.
Weimer, D., Scholz-Reiter, B., & Shpitalni, M. (2016). Design of deep convolutional neural network architectures for automated feature extraction in industrial inspection. CIRP Annals, 65(1), 417–420.
Weston, J., Collobert, R., Sinz, F., Bottou, L., & Vapnik, V. (2006). Inference with the universum. In Proceedings of the 23rd international conference on machine learning (pp. 1009–1016). ACM.
Xiao, Y., Feng, J., & Liu, B. (2021). A new transductive learning method with universum data. Applied Intelligence, 1–13.
Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747
Zenati, H., Foo, C. S., Lecouat, B., Manek, G., & Chandrasekhar, V. R. (2018). Efficient gan-based anomaly detection. arXiv preprint arXiv:1802.06222
Zhang, X., & LeCun, Y. (2017). Universum prescription: Regularization using unlabeled data. In AAAI (pp. 2907–2913).
Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu, C., Cho, D., & Chen, H. (2018). Deep autoencoding Gaussian mixture model for unsupervised anomaly detection. In International conference on learning representations. https://openreview.net/forum?id=BJJLHbb0
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
A detailed list of contribution is provided below, Conceptualization—SD Data curation—BGT Formal Analysis—SD Funding acquisition—NA Investigation—SD, BGT Methodology—SD, BGT Project administration—SD Resources—SD,BGT Software—BGT, SD Supervision—SD. Validation—BGT Visualization—SD, BGT. Writing (original draft)—SD Writing (review, editing)—SD, BGT.
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Code availability
All our codes available at https://github.com/sauptikdhar/DOC3
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Proofs
1.1 A.1 Proof of Proposition 1
Part 1 A slightly different version of this proposition is analyzed in Proposition 8.2 of Scholkopf et al. (2002) and Chang and Lin (2001). Here, we provide a different version of the connection between the solutions of (3) and (2). This is achieved through analyzing the KKT systems of the formulations. We start with the formulation (3). Note that, (3) is the same as solving,
The Lagrangian is given as,
\({\mathcal {L}}({\textbf{w}},\xi ,\alpha ,\beta ) = \frac{1}{2} \vert \vert {\textbf{w}} \vert \vert ^2 + C \sum _{i=1}^n \xi _i - \sum _{i=1}^n \beta _i \xi _i -\sum _{i=1}^n \alpha _i[{\textbf{w}}^\top \phi ({\textbf{x}}_i) -1 + \xi _i]\)
KKT System
Complimentary Slackness,
Constraints,
Define \( \delta = \frac{1}{\sum _i \alpha _i}\) and re-write the equations (A2)–(A7) by scaling with \(\delta > 0\) as, \(\hat{\mathbf {{{w}}}} = {\textbf{w}}\delta ; {\hat{\alpha _i}} = \alpha _i\delta ; \, {\hat{\beta _i}} = \beta _i\delta ; \, {\hat{\xi _i}} = \xi _i \delta \). (\(\delta> 0; \because \exists i\, \text {s.t.} \, \alpha _i >0 \text { and } \forall i, \alpha _i \ge 0)\)
This gives, Transformed KKT System
Complimentary Slackness,
Constraints,
Note that, the transformed KKT system (A8)–(A12) solves (2) with \(\nu = \frac{1}{Cn\delta }; \rho = \delta \) (compare with the KKT of (2)).
Part 2 For the solution to (2) obtained from Proposition 1 (i) the decision rule can be given as,
\(\square \)
1.2 A.2 Proof of Proposition 2
Note that for this proof we need to accommodate a case where a sample may not belong to either of the two classes \(\{-1, +1\}\). For this we rather analyze a different decision rule than (3).
Define,
This gives,
Since the events are mutually exclusive we have,
The maximum can be achieved when \({\textbf{w}}^\top \phi ({\textbf{x}}) -1 = 0\) \(\square \)
1.3 A.3 Proof of Theorem 1
Define, \(R_{f,\theta } = \{{\textbf{x}}: f({\textbf{x}}) \ge \theta \}\). This gives,
where, \(H(x,\theta ) = \left\{ \begin{array}{l l} 0;\quad \text {if} \; x \ge \theta \\ 1;\quad \text {else} \\ \end{array}\right. \). For the rest of the proof we drop the subscripts as it is clear from context. To bound the R.H.S of (A13) we follow a similar approach of bounding a dominating function see Theorem 4.17 in Shawe-Taylor et al. (2004). Here we define,
Note that, A(x) is \(\frac{1}{\kappa }-\) Lipchitz. Further, \(H(f({\textbf{x}}), 1-\kappa ) \le A(f({\textbf{x}}))\). This gives, \({\mathbb {E}}[H(f({\textbf{x}}),1-\kappa )-1] \le {\mathbb {E}}[A(f({\textbf{x}}))-1] \). Hence with probability \(1-\eta , \forall f \in {\mathcal {F}}\) the following holds (see Theorem 4.9 in Shawe-Taylor et al. (2004)); where \(\mathbb {{\hat{E}}} = \) the empirical estimate for the expectation operator.
From Th. 4.15 (Shawe-Taylor et al., 2004)
where, \(\xi _i = [1-f({\textbf{x}}_i)]_{+}\). Using (A13), we get the final form of Theorem (1). \(\square \)
1.4 A.4 Proof of Theorem 2
Part 1: It is clear that \({\mathcal {F}}_{\text {univ}} \subseteq {\mathcal {F}}_{\text {ind}}\). This ensures \(\mathcal {{\hat{R}}}_n({\mathcal {F}}_{\text {univ}}) \le \mathcal {{\hat{R}}}_n({\mathcal {F}}_{\text {ind}})\) (following Theorem 4.15 (i) in Shawe-Taylor et al. (2004). \(\square \)
Part 2(a): This follows from standard analysis (see Theorem 4.12 (Shawe-Taylor et al., 2004) or Lemma 22 in Bartlett and Mendelson (2002)).
\(\square \)
Part 2(b): Define
\(\because \) the constraint on all \({\textbf{x}}^* \in {\mathcal {X}}_{U}^* \Rightarrow \) constraint on \(m-\) samples. Now, let’s analyze the constraint \( \vert {\textbf{w}}^{\top }{\textbf{u}}_j - 1 \vert \le \Delta \). This implies, \({\textbf{w}}^{\top }{\textbf{u}}_j - 1 \le \Delta ; 1- {\textbf{w}}^{\top }{\textbf{u}}_j \le \Delta \) (simultaneously). However, only one of the constraint is active. Hence, we re-write the constraint as, \(\forall j \;; \; \begin{bmatrix} {\textbf{w}}^{\top }{\textbf{u}}_j \\ {\textbf{w}}^{\top }(-{\textbf{u}}_j) \end{bmatrix} \le \begin{bmatrix} \Delta + 1 \\ \Delta - 1 \end{bmatrix}\).
Next define a mapping where we concatenate the reflected space. i.e. \(\psi :\phi ({\textbf{x}}^*) \rightarrow \begin{bmatrix} \phi ({\textbf{x}}^*)^\top \\ -\phi ({\textbf{x}}^*)^\top \end{bmatrix}\) and rewrite \(V = \psi \Big ( [\phi ({\textbf{x}}_j^*)]_{j=1}^m \Big ) = \begin{bmatrix} \phi ({\textbf{x}}_1^*)^\top \\ \phi ({\textbf{x}}_2^*)^\top \\ \vdots \\ \phi ({\textbf{x}}_m^*)^\top \\ -\phi ({\textbf{x}}_1^*)^\top \\ -\phi ({\textbf{x}}_2^*)^\top \\ \vdots \\ -\phi ({\textbf{x}}_m^*)^\top \\ \end{bmatrix}\). This can be compactly re-written as, \(V = \begin{bmatrix} 1\\ -1 \end{bmatrix} \otimes \begin{bmatrix} ({\textbf{u}}_1)^T\\ \vdots \\ ({\textbf{u}}_{m})^T \end{bmatrix}\). This results to the overall constraint in (A14) to be,
where, .
In essence for each constraint in (A14) we create \(2 \times \) the constraints for both the original and reflected space to take care of the absolute value.
Now,
The last line follows as the element-wise constraint is relaxed by \( \vert \vert \cdot \vert \vert _2^2\) constraint.
Next, from (A15) and assuming a fixed mapping \(\phi (\cdot )\), for the given training data \(Z = \begin{bmatrix} ({\textbf{z}}_1)^\top \\ \vdots \\ ({\textbf{z}}_n)^\top \\ \end{bmatrix} = \begin{bmatrix}\phi ({\textbf{x}}_1)^\top \\ \vdots \\ \phi ({\textbf{x}}_n)^\top \\ \end{bmatrix}\) we have,
Hence \(\forall \gamma \ge 0\) and \(\Gamma = \Lambda ^2+ 2\gamma m(\Delta ^2 + 1)\) we have,
The (in)-equalities follow,
-
a.
from symmetry \({\textbf{w}} \in {\mathcal {W}}_{USVM} \Rightarrow -{\textbf{w}} \in {\mathcal {W}}_{USVM}\). Hence we drop the absolute term from definition. Also for simplicity we drop the conditional term. This is clear from context.
-
b.
since the conditions \(\begin{array}{c} \vert \vert {\textbf{w}} \vert \vert ^2 \le \Lambda ^2 \\ ({\textbf{w}}^{\top }V^{\top }V{\textbf{w}}) \le 2m[\Delta ^2 + 1] \end{array} \Rightarrow \vert \vert {\textbf{w}} \vert \vert ^2 + \gamma ({\textbf{w}}^{\top }V^{\top }V{\textbf{w}}) \le \Gamma \quad \forall \gamma \ge 0\).
-
c.
stationary point of the constraint. A similar approach was previously used in Rosenberg and Bartlett (2007).
-
d.
since (Rademacher variables) are drawn uniformly over \(\sigma \sim \{-1, +1\}\); we cancel the cross-terms \(\sigma _i\sigma _j\) under expectation \({\mathbb {E}}_{\sigma }\).
-
e.
using Sherman-Morrison-Woodbury formula.
-
f.
from the matrix inequality II in Patel and Toda (1979).
\(\square \)
Appendix B Reproducibility
1.1 B.1 Network architectures
1.1.1 B.1.1 LeNet architecture for CIFAR-10 experiments
For CIFAR-10 we use the same architecture (Fig. 4) as used in Goyal et al. (2020).
1.1.2 B.1.2 LeNet architecture for FMNIST experiments
For FMNIST we use the same architecture as described in Cai and Fan (2022), see Fig. 5. Note however, there are discrepancies between the official code https://openreview.net/forum?id=-Xdts90bWZ3 and the description in the paper. We adopt the description in paper; as the implementation is inconsistent with general deep learning theory. For additional details on these issues please see “Appendix C.3”.
1.1.3 B.1.3 LeNet architecture for MVTec experiments
For MVTec-AD there have been a few recent works proposing advanced architectures to achieve state-of-the-art performance on this data (Carrara et al., 2020; Huang et al., 2019). However, the main goal of our experiment is to illustrate the effectiveness of universum over inductive learning for one class problems. Hence, we stick to a simple LeNet architecture shown in Fig. 6.
Finally, for both the above architectures we use bias = False for convolution operations and set \(\epsilon = 10^{-4}\), Affine = False for BatchNorm. Additionally, we use a leaky ReLU activation after every max-pool operation.
1.2 B.2 Model parameters for Table 1
The optimal model parameters for the tabular datasets is provided in Table 7.
1.3 B.3 Model parameters for Table 2 (CIFAR-10)
1.3.1 B.3.1 DOC and DOC\(^3\) model parameters used in Table 2
There are several hyper-parameters to be tuned for DOC and DOC\(^3\). To simplify our analysis we fix a few of these parameters following prior research.
-
Unlike previous works like (Ruff et al., 2018; Goyal et al., 2020), we uniformly use an SGD optimizer with batch_size = 256. Although, training for each class represent a completely different problem, we adopt this to maintain consistency and isolate out the effect of optimizers for DOC versus DOC\(^3\) performances.
-
For DOC we fix the total number of iterations for gradient updates to 300. Except for class ‘DOG’ and ‘Truck’ we use 400 and 50 respectively. For DOC\(^3\) we fix it to 350. This is in the same range as Ruff et al. (2018), and hence incurs similar computation complexity as the baseline DOCC and DROCC algorithms.
-
Finally for DOC\(^3\) we fix \(\Delta = 0\).
With the above hyper parameters fixed our best selected remaining hyper parameters for DOC and DOC\(^3\) are provided in Tables 8 and 9 respectively.
1.3.2 B.3.2 DOC (DA/OE) model parameters in Table 2
Next, we provide the optimal model parameters for the DOC (DA/OE) setting in Table 10. For the DOC (DA/OE) following Goyal et al. (2020) we introduce the universum samples as negative class in a standard binary hinge loss. The explicit form of this loss is also discussed in “Appendix C.1.1” in Eq. (C1). Here we set \(C^{+} = C^{-} = 1\).
1.3.3 B.3.3 Model parameters for DROCC-LF under OE versus Universum setting
For the DROCC-LF (OE) we use the same implementation as in Goyal et al. (2020). For the DROCC-LF under universum setting we replace the binary cross entropy loss used in Goyal et al. (2020) with the universum loss in (6) (see Algo. 1). Here we use the same notations as also used in Goyal et al. (2020). Further, as in Sect. 5.3.1 we replace the relu operator \([x]_{+}\) with the softplus operator for the loss functions.
We adopt the same LeNet architecture used in Ruff et al. (2018), Goyal et al. (2020) (see Fig. 4). Finally, we run the experiments over 10 runs and report the best AUC over the range of parameters recommended in Goyal et al. (2020) (Sect. 5). That is learning rate = \(10^{-4}\), radius (r) in range of \(\sqrt{d}\) = \(\{ 8.0, 16.0, 32.0\}\). Here, for both the methods we use Adam and fix the number of ascent steps = 10 and batch size = 256 and total epochs = 350. The remaining parameters are set to default values. The final optimal parameters selected for the different classes is provided in Table 11.
Caveat(s): We found a few caveats while running the DROCC-LF experiments. One major caveat is that the gradient ascent steps are prone to instabilities. Note that the DROCC-LF algorithm (Algo. 2 in Goyal et al. (2020)) scales the perturbation direction (h) by the norm of the gradient vector. This results to severe gradient explosion. Appropriate measures to alleviate this issue has to be taken. Another major caveat is that the additional gradient ascent updates results to high computation complexity. For example, for the experiments presented in this paper a typical DROCC-LF run (350 epoch) takes \(\sim 10^4\) secs compared to \(\sim 10^3\) sec without the adversarial updates. The system configuration used here is,
-
CPU = AMD Ryzen 9 5950X 16 Core.
-
RAM = 32 GB.
-
GPU = NVIDIA GeForce RTX 3080
-
CUDA = 11.4
1.4 B.4 Model parameters for Table 3 (FMNIST)
1.4.1 B.4.1 DOC and DOC\(^3\) model parameters used in Table 3
The optimal model parameters for DOC, DOC\(^3\) and DOC(DA/OE) are provided in Tables 12, 13, 14 respectively.
1.4.2 B.4.2 Model parameters for DROCC-LF under OE versus Universum setting for F-MNIST data
The optimal model parameters are provided in Table 15.
1.5 B.5 Model parameters for Table 5 (MVTec-AD)
Here we provide the optimal model parameters selected and used to reproduce the DOC and DOC\(^3\) results in Tables 5 and 6. For this set of experiments we use the Adam optimizer with batch_size = 100. Further, to simplify model selection we fix the total number of iterations to 1000, and \(\Delta = 0\). The optimal model parameters for DOC and DOC\(^3\) is provided in Table 16. Finally we also provide the optimal hyperparameters for the DOC (DA/OE) algorithm in Table 17.
Appendix C Additional experiments and results
1.1 C.1 Comparisons of disjoint auxiliary (or outlier exposure) versus Universum settings
In this section we highlight the differences between the universum versus the ‘Disjoint Auxiliary data’ setting used in Dhar (2014) (see Sect. 4.3) and Hendrycks et al. (2018), Goyal et al. (2020) etc. As discussed in the Sect. 4 a major difference is the assumption that the universum samples act as contradictions to the unseen anomalous class (see Definition (2)). Methods using the ‘Disjoint Auxiliary’ setting do not use this assumption and formulate a loss function which only contradicts the ‘normal’ class. Such approaches have also been called ‘Supervised OE’ in Ruff et al. (2021) or ‘Limited Negatives’ in Goyal et al. (2020). In this section we take a more pedagogical approach to highlight the differences between Universum versus ‘Disjoint Auxiliary’ setting. For simplicity we use a binary classifier as an exemplar of this ‘Disjoint Auxiliary’ setting. That is, we build a binary classifier with ‘\(+1\)’ (normal samples) and ‘\(-1\)’ (contradiction a.k.a universum) samples. Note that, such an approach is philosophically inconsistent following Def. 2; where the universum samples are assumed to not follow the same distribution as both the normal (‘+1’) and anomalous (‘-1’) class. Using the universum samples as (‘-1’) class violates the assumption that universum follows a different distribution than the anomalous class. To further confirm our theoretical analysis we provide a simple synthetic example in example in C.1.1.
1.1.1 C.1.1 Synthetic experiment
For our synthetic example, we use synthetic data generated using normal distribution \({\mathcal {N}}(\mu ,\sigma )\). For illustration we use,
-
Normal Class (+1): \(\mu = (1.0, 1.0)\), \(\sigma = (0.25,1.0)\).
-
Anomaly Class (-1): \(\mu = (0.25, 1.0)\), \(\sigma = (0.25,1.0)\).
-
Contradictions: \(\mu = (0.75, 6.0)\), \(\sigma = (0.25,1.0)\).
Additionally we use,
-
No. of Training samples (+1 class) = 10
-
No. of Test samples (+1, -1) class = 1000 per class.
-
No. of Universum samples = 1000.
Note, that in the above synthetic example the discriminative power is mostly contained in the 1\(^{st}\) dimension. Having ‘good’ universum samples can incorporate this additional information by contradicting the 2\(^{nd}\) dimension while estimating the decision rule. This is also seen from Fig. 7. Figure 7 provides the decision boundaries obtained under inductive (3) versus universum settings (6) using only linear parameterization. Under linear parameterization the formulations reduces to standard SVM formulations so we refer them as one class SVM and one class U-SVM respectively. Finally, we also provide the decision boundary using a binary SVM, which serves as a representative for DA/OE-extension (Cherkassky & Mulier, 2007).
where, \([x]_+ = \text {max}(0,x) \). For the binary SVM we use the universum as (-1) class, and adopt cost-sensitive formulation with a cost ratio \(\frac{C^+}{C^-} = \frac{\# univ}{\# train} = \frac{1000}{10}\), to handle the class imbalance.
As seen from Fig. 7, using binary formulation in this universum setting does not correctly capture information available through the contradiction samples. That is, discriminating between normal and contradiction samples does not provide a good classifier for normal versus anomaly classification. The one class SVM although correctly classifies the positive samples (TP = 100%); does not perform good on future test samples. Using the universum samples, we can incorporate the additional information that the decision boundary should align along the vertical axis to have maximal contradiction (following Prop. 2). And by doing so, it improves the test performance over the one class SVM solution.
1.1.2 C.1.2 Ablation study hyperparameters
The DOC\(^3\) algorithm mainly introduces two additional hyper-parameters \(C_U\) and \(\Delta \) compared to its inductive counterpart. The success of such an advanced technique depends on careful tuning of the hyperparameters. In this section we perform an ablation study of the \(\frac{C_U}{C}\) and the \(\Delta \) hyperparameters. To simplify we present the results for the CIFAR-10 data. Analysis using F-MNIST and MVTec-AD data provides similar conclusions.
Figures 8 and 9 provides the average ± std. deviation of the AUC values over 10 experiment runs for varying \(\frac{C_U}{C}\)—ratios and \(\Delta \) values respectively. The experiment follows the same setting as in Sect. 5.3.1. Further all the other model parameters are set to their optimal values reported in Table 16. As seen from the figures, the model performance significantly varies for different \(\frac{C_U}{C}\) -values (specifically for automobile, deer, dog, frog etc.). On the contrary, the DOC\(^3\) model performance seems relatively stable for varying \(\Delta \) values (see Fig. 9). Such behavior is also seen for the other datasets. In line with this analysis throughout the paper we fix \(\Delta = 0\) and follow the current norm of reporting the best model’s results over a small subset of hyperparameters. But this is far from practical. This motivates advanced mechanisms for optimal selection of this hyper parameter, which is still an open research topic. From our prior experiments, we found \(C_U/C\) in the range of [0.01, 2.0] provides reasonable performance in practice.
1.2 C.2 Re-run of deep-SVDD (Ruff et al., 2018) and DROCC (Goyal et al., 2020) CIFAR-10 Results
1.2.1 C.2.1 Deep one class classification deep-SVDD results
For the Deep-SVDD results we see very similar results for our run except the ‘Frog’ and ‘Dog’ classes (see Table 18 and 19); where the difference are not too significant. Hence, we report the results as presented in the paper.
1.2.2 C.2.2 Deep robust one class classification (DROCC) results
In Table 20, we report the results of our run with two different scaling. For the ‘all-class’ scale we use the scale used in the original DROCC paper (Goyal et al., 2020) i.e. \(\mu = (0.4914,0.4822, 0.4465)\) and standard deviation, \(\sigma = (0.247, 0.243, 0.261)\). Note that, this scale is computed using the pixel values for all the classes. This in general is not available during training a one class classifier. Alternatively, ‘no-prior’ scale also reports the results using a scale using \( \mu = (0.5, 0.5, 0.5)\) and standard deviation, \(\sigma =(0.5, 0.5, 0.5)\). This scale does not need additional information from the other class’s pixel values. We do not see a significant difference using these different scales. Although our re-runs show a significant difference for the ‘ship’ class between our results and the paper. We report the results of our re-run using the ‘no-prior’ scale in Table 2.
1.3 C.3 Re-run of PLAD (Cai & Fan, 2022) results on F-MNIST data
This section provides a detailed discussion of our re-run of the PLAD experiments on FMNIST data. PLAD (Cai & Fan, 2022) is a recent adversarial based work which improves upon DROCC (Goyal et al., 2020). We found several caveats and discrepancies in the PLAD implementation https://openreview.net/forum?id=-Xdts90bWZ3. This is highlighted below,
-
1.
The network architecture discussed in paper and the implementation are different. The implementation does not use any non linearity (relu) operators for the final linear layers. Without the non-linear operators a cascade of affine operations can simply be replaced by a single affine operation. For our implementation we correct it and use non-linear relu operators in-between the linear operations.
-
2.
The classes are differently scaled. Class ‘T-shirt’ and ‘Trouser’ uses a generic mean = [0.5], std = [0.5] scale, while the other classes are scaled to their respective mean and std. Further, as previously discussed for DROCC re-runs a class specific normalization is not practical for one-class problems as we cannot expect the knowledge of class labels apriori during training or testing. Throughout our paper we use a generic mean = [0.5], std = [0.5] scale.
-
3.
Finally the results provided in Cai and Fan (2022) is the best AUC obtained during the entire training process (and not at the final stopping criteria). While such an approach may still be practical for supervised learning, where intermediate validation AUCs may provide guidance to select the optimal model. It is impossible to select an intermediate best model during the training process for one-class (unsupervised) learning problems. For this work Table 3 provides the avg ± std. of the test AUCs obtained at the final step of training. For interested readers Table 21 also provides the avg ± std. of the best test AUCs obtained during the training process as used in Cai and Fan (2022).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dhar, S., Gonzalez-Torres, B. DOC\(^3\): deep one class classification using contradictions. Mach Learn 113, 5109–5150 (2024). https://doi.org/10.1007/s10994-023-06362-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06362-5