DOC $$^3$$ : deep one class classification using contradictions

Dhar, Sauptik; Gonzalez-Torres, Bernardo

doi:10.1007/s10994-023-06362-5

DOC$^3$: deep one class classification using contradictions

Published: 29 August 2023

Volume 113, pages 5109–5150, (2024)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

DOC$^3$: deep one class classification using contradictions

Download PDF

206 Accesses
1 Altmetric
Explore all metrics

Abstract

This paper introduces the notion of learning from contradictions (a.k.a Universum learning) for deep one class classification problems. We formalize this notion for the widely adopted one class large-margin loss (Schölkopf et al. in Neural Comput 13(7):1443–1471), and propose the deep one class classification using contradictions (DOC$^3$) algorithm. We show that learning from contradictions incurs lower generalization error by comparing the empirical Rademacher complexity of DOC$^3$ against its traditional inductive learning counterpart. Further, our proposed ‘learning from contradiction’ is a generic learning setting and can compliment other advanced learning settings. To illustrate this, we extend the adversarial learning based DROCC-LF (Goyal et al. in International conference on machine learning, PMLR, 2020) algorithm under this new setting. Our empirical results demonstrate the efficacy of DOC$^3$ and it’s extensions compared to popular baseline algorithms on several benchmark and real-life data sets.

Phased progressive learning with coupling-regulation-imbalance loss for imbalanced data classification

Article 20 February 2024

Entropy Repulsion for Semi-supervised Learning Against Class Mismatch

Open Set Learning with Counterfactual Images

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Anomaly detection (AD) is one of the most widely researched problem in the machine learning community (Chandola et al., 2009). In its basic form, the task of Anomaly Detection (AD) involves discerning patterns in data that do not conform to expected ‘normal’ behavior. These non-conforming patterns are referred to as anomalies or outliers. Anomaly detection problems manifest in several forms in real-life like, defect detection in manufacturing lines, intrusion detection for cyber security, or pathology detection for medical diagnosis etc. There are several mechanisms to handle anomaly detection problems viz., parametric or non-parametric statistical modeling, spectral based, or classification based modeling (Chandola et al., 2009). Of these, the classification based approach has been widely adopted in literature (Scholkopf et al., 2002; Tax & Duin, 2004; Tan et al., 2016; Cherkassky & Mulier, 2007). One specific classification based formulation which has gained huge adoption is one class classification (Scholkopf et al., 2002; Tax & Duin, 2004), where we design a parametric model to estimate the support of the ‘normal’ class distribution. The estimated model is then used to detect ‘unseen’ abnormal samples.

With the recent success of deep learning based approaches for different machine learning problems, there has been a surge in research adopting deep learning for one class problems (Ruff et al., 2021; Pang et al., 2020; Chalapathy & Chawla, 2019). However, most of these works adopt an inductive learning setting. This makes the underlying model estimation data hungry, and perform poorly for applications with limited training data availability, like medical diagnosis, industrial defect detection, etc. The learning from contradictions paradigm (popularly known as Universum learning) has shown to be particularly effective for problems with limited training data availability (Vapnik, 2006; Sinz et al., 2008; Weston et al., 2006; Chen & Zhang, 2009; Cherkassky et al., 2011; Shen et al., 2012; Dhar & Cherkassky, 2015; Zhang & LeCun, 2017; Xiao et al., 2021). However, it has been mostly limited to binary or multi class problems. In this paradigm, along with the labeled training data we are also given a set of unlabeled contradictory (a.k.a universum) samples. These universum samples belong to the same application domain as the training data, but are known not to belong to any of the classes. The rationale behind this setting comes from the fact that even though obtaining labels is very difficult, obtaining such additional unlabeled samples is relatively easier. These unlabeled universum samples act as contradictions and should not be explained by the estimated decision rule. Adopting this to one class problems is not straight forward. A major conceptual problem is that, one class model estimation represents unsupervised learning, where the notion of contradiction needs to be redefined properly. In this paper,

1.
Definition We introduce the notion of ‘Learning from contradictions’ for one class problems (Definition 2).
2.
Formulation We analyze the popular one class hinge loss (Schölkopf et al., 2001), and extend it under universum settings to propose the Deep One Class Classification using Contradictions DOC$^3$ algorithm. Further, our proposed ‘learning from contradiction’ is a generic learning setting and can compliment other advanced learning settings. To illustrate this, we extend the adversarial learning based DROCC-LF (Goyal et al., 2020) algorithm under universum settings and call it DROCC-LF (univ) (see Algo. 1).
3.
Generalization error We analyze the generalization performance of one class formulations under inductive and universum settings using Rademacher complexity based bounds, and show that learning under the universum setting can provide improved generalization compared to its inductive counterpart.
4.
Empirical results Finally, we provide an exhaustive set of empirical results on several tabular and image datasets in support of our approach.

2 One class learning under inductive settings

First we introduce the widely adopted inductive learning setting used for one class problems (Scholkopf et al., 2002; Cherkassky & Mulier, 2007).

Definition 1

(Inductive setting) Given i.i.d training samples from a single class ${\mathcal {T}}=({\textbf{x}}_i, \; y_i = +1)_{i=1}^n \sim {\mathcal {D}}_{{\mathcal {X}}\vert {\mathcal {Y}} = +1}^n$, with ${\textbf{x}} \in {\mathcal {X}} \subseteq \Re ^d$ and $y \in {\mathcal {Y}} = \{-1,+1 \}$; estimate a hypothesis $h^*:{\mathcal {X}} \rightarrow {\mathcal {Y}}$ from an hypothesis class ${\mathcal {H}}$ which minimizes,

$$\begin{aligned} \underset{h \in {\mathcal {H}}}{\text {inf}} \; {\mathbb {E}}_{{\mathcal {D}}_{{\mathcal {T}}}}[{\textbf{1}}_{y \ne h({\textbf{x}})}] \end{aligned}$$

(1)

${\mathcal {D}}_{{\mathcal {T}}}$ is the training distribution (consisting of both classes)

${\mathcal {D}}_{{\mathcal {X}} \vert {\mathcal {Y}} = +1}$ is class conditional distribution

${\textbf{1}}(\cdot )$ is the indicator function, and

${\mathbb {E}}_{{\mathcal {D}}_{{\mathcal {T}}}}(\cdot )$ is the expectation under training distribution.

Note that, the underlying data generation process assumes a two class problem; of which the samples from only one class is available during training. The overall goal is to estimate a model which minimizes the error on the future test data, containing samples from both normal ($y = +1$) and abnormal classes ($y = -1$). Typical examples include, AI driven visual inspection of product defects in a manufacturing line; where images or videos of non-defective products are available in abundance. The goal is to detect ‘defective’ (abnormal/anomalous) products through visual inspection in manufacturing lines (Bergmann et al., 2019; Weimer et al., 2016). A popular loss function used in such settings is the $\nu $-SVM loss (Schölkopf et al., 2001),

$$\begin{aligned}&\underset{{\textbf{w}},\varvec{\xi }, \rho }{\text {min}}\quad \frac{1}{2}\vert \vert {\textbf{w}}\vert \vert _2^2 \; + \; \frac{1}{\nu n} \sum _{i=1}^n \xi _i -\rho{} & {} \nonumber \\&\quad \text {s.t.} \quad \quad {\textbf{w}}^\top \phi ({\textbf{x}}_i) \ge \rho -\xi _i, \quad \xi _i \ge 0; \quad \forall \; i = 1 \ldots n{} & {} \end{aligned}$$

(2)

where $\nu \in (0,1]$ is a user-defined parameter which controls the margin errors $\sum _i \xi _i$ and the size of geometric $\frac{1}{\vert \vert {\textbf{w}}\vert \vert }$ and functional $\rho $ margins. $\phi (\cdot ): {\mathcal {X}} \rightarrow {\mathcal {G}}$ is a feature map. Typical examples include an empirical kernel map (see Definition 2.15 (Scholkopf et al., 2002)) or a map induced by a deep learning network (Goodfellow et al., 2016). The final decision function is given as, $h({\textbf{x}}) = \left\{ \begin{array}{l l} +1;\quad \text {if} \; {\textbf{w}}^\top \phi ({\textbf{x}}_i) \ge \rho \\ -1;\quad \text {else} \end{array}\right. $. Note that, recent works like Ruff et al. (2018) extend a different loss function which uses a ball to explain the support of the data distribution following (Tax & Duin, 2004). As discussed in Schölkopf et al. (2001), most of the time these two formulations yield equivalent decision functions. For example, with kernel machines ${\textbf{K}}({\textbf{x}},{\textbf{x}}^{\prime }) = \phi ({\textbf{x}})^\top \phi ({\textbf{x}}^\prime )$ depending solely on ${\textbf{x}} - {\textbf{x}}^{\prime }$ (like RBF kernels), these two formulations are the same. Hence, most of the improvements discussed in this work translates to such alternate formulations. In this paper however, we solve the following one class Hinge Loss,

$$\begin{aligned}&\underset{{\textbf{w}}}{\text {min}}\quad \frac{1}{2}\vert \vert {\textbf{w}}\vert \vert _2^2 \; + \; C \; L_T({\textbf{w}},\{\phi ({\textbf{x}}_i)\}_{i=1}^n){} & {} \nonumber \\&\quad \text {s.t} \quad \quad L_T({\textbf{w}},\{\phi ({\textbf{x}}_i)\}_{i=1}^n) = \sum _{i=1}^n [1-{\textbf{w}}^\top \phi ({\textbf{x}}_i)]_+ \; ; \; [x]_+ = \text {max}(0,x){} & {} \end{aligned}$$

(3)

to estimate the the decision function $f({\textbf{x}}) = {\textbf{w}}^\top \phi ({\textbf{x}}_i)$ and use the decision rule, $h({\textbf{x}}) = \left\{ \begin{array}{l l} +1;\quad \text {if} \; f({\textbf{x}}) \ge 1 \\ -1;\quad \text {else} \end{array}\right. $. Here, the user-defined parameter C controls the trade-off between explaining the training samples (through small margin error $\sum _{i=1}^n \xi _i $), and the margin size (through $\vert \vert {\textbf{w}}\vert \vert _2^2$), which in turn controls the generalization error. For deep learning architectures we optimize using all the model parameters and equivalently regularize the entire matrix norm $\vert \vert {\textbf{W}}\vert \vert _F^2$, see Goyal et al. (2020), Ruff et al. (2018). Note that, we solve one class Hinge loss (3) for the two main reasons,

First, it has the advantage that $L_T({\textbf{w}},\phi (\{{\textbf{x}}\}_{i=1}^n)) = \sum _{i=1}^n [1-{\textbf{w}}^T\phi ({\textbf{x}}_i)]_+ $ exhibits the same form as the traditional hinge loss used for binary classification problems (Vapnik, 2006) and can be easily solved using existing software packages (Paszke et al., 2019; Abadi et al., 2016; Pedregosa et al., 2011). Throughout the paper we refer (3) using underlying deep architectures as Deep One Class DOC (Hinge) formulation.
Second, solving Eq. (3) also provides the solution for Eq. (2). This connection follows from Proposition 1.

Proposition 1

Connection between Eq. (2) and Eq. (3)

1.
Any solution ${\textbf{w}}$ of Eq. (3) also solves Eq. (2) with $\nu = \frac{1}{Cn \delta }$; where $\delta > 0$ is a scalar that depends on the solution of Eq. (3). Further, this solution $(\hat{\mathbf {{{w}}}}, \rho )$ of Eq. (2) is given as $\hat{\mathbf {{{w}}}} = {\textbf{w}}\delta , \quad \rho = \delta $.
2.
The decision function obtained through solving Eq. (3) i.e., ${\textbf{w}}^\top \phi ({\textbf{x}}) - 1 = 0$ coincides with the decision function $\hat{\mathbf {{{w}}}}^\top \phi ({\textbf{x}}) - \rho = 0$ obtained by solving Eq. (2) i.e. using the solution discussed above.

All proofs are provided in “Appendix”.

3 One class learning using contradictions a.k.a Universum learning

3.1 Problem formulation

Learning from contradictions or Universum learning was introduced in Vapnik (2006) for binary classification problems to incorporate a priori knowledge about admissible data samples. For example, if the goal of learning is to discriminate between handwritten digits ‘5’ and ‘8’, one can introduce additional knowledge in the form of other handwritten letters ‘a’,‘b’,‘c’,‘d’,$\ldots $ ‘z’. These examples from the Universum contain certain information about the handwritten styles of authors, but they cannot be assigned to any of the two classes (5 or 8). Further, these Universum samples do not have the same distribution as labeled training samples. In this work we introduce the notion of ‘Learning from Contradictions’ for one class problems. Similar to inductive setting (Definition 1) the goal here is also to minimize the generalization error on future test data containing both normal ($y=+1$) and abnormal ($y=-1$) samples. Here however, during training in addition to the samples from the normal class $({\textbf{x}}_i, y_i = +1)_{i=1}^n$, we are also provided with universum (contradictory) samples, which are known not to belong to either of the (normal or abnormal) classes of interest. A practical use-case can be of automated visual inspection based anomaly detection in manufacturing lines. Here the target is to identify the defects in a specific product type (say ’screws’ in Fig. 1). For this case, the images from other product types in the manufacturing line act as universum samples. Note that, such universum samples belong to the same application domain (i.e. visual inspection data); but do not represent either of the classes normal screws or anomalous screws. This setting is formalized as,

Definition 2

(Learning from contradictions a.k.a Universum setting) Given i.i.d training samples ${\mathcal {T}}=({\textbf{x}}_i,y_i = +1)_{i=1}^n \sim {\mathcal {D}}_{{\mathcal {X}}\vert {\mathcal {Y}} = +1}^n$, with ${\textbf{x}} \in {\mathcal {X}} \subseteq \Re ^d$ and $y \in {\mathcal {Y}} =\{-1,+1\}$ and additional m universum samples ${\mathcal {U}} = ({\textbf{x}}_{i^\prime }^{*})_{i^\prime =1}^m \sim {\mathcal {D}}_{{\mathcal {U}}}$ with ${\textbf{x}}^{*} \in {\mathcal {X}}_{U}^* \subseteq \Re ^d$, estimate $h^*:{\mathcal {X}} \rightarrow {\mathcal {Y}}$ from hypothesis class ${\mathcal {H}}$ which, in addition to Eq. (1), obtains maximum contradiction on universum samples i.e. maximizes the following probability for ${\textbf{x}}^* \in {\mathcal {X}}_{U}^*$,

$$\begin{aligned}&\underset{h \in {\mathcal {H}}}{\text {sup}} \; {\mathbb {P}}_{{\mathcal {D}}_{{\mathcal {U}}}}[h({\textbf{x}}^* )\notin {\mathcal {Y}}] = \underset{h \in {\mathcal {H}}}{\text {sup}} \; {\mathbb {E}}_{{\mathcal {D}}_{{\mathcal {U}}}}[{\textbf{1}}_{ \lbrace \bigcap \limits _{y \in {\mathcal {Y}}} h({\textbf{x}}^*) \ne y \rbrace }]{} & {} \end{aligned}$$

(4)

${\mathcal {D}}_{{\mathcal {U}}}$ is the universum distribution,

${\mathbb {P}}_{{\mathcal {D}}_{{\mathcal {U}}}}(\cdot )$ is probability under universum distribution,

${\mathbb {E}}_{{\mathcal {D}}_{{\mathcal {U}}}}(\cdot )$ is the expectation under universum distribution, ${\mathcal {X}}_{U}^{*}$ is the domain of universum data.

Learning using contradictions under Universum setting has the dual goal of minimizing the generalization error in Eq. (1) while maximizing the contradiction on universum samples Eq. (4). The following proposition provides guidelines on how this can be achieved for the one class hinge loss in Eq. (3).

Proposition 2

For the one class hinge loss in Eq. (3), maximum contradiction on universum samples ${\textbf{x}}^* \in {\mathcal {X}}_U^*$ can be achieved when,

$$\begin{aligned} \vert {\textbf{w}}^\top \phi ({\textbf{x}}^{*}) - 1\vert = 0 \end{aligned}$$

(5)

That is, we need the universum samples to lie on the decision boundary. This motivates the following one class loss using contradictions (under Universum settings) where we relax the constraint in Eq. (5) by introducing a $\Delta $-insensitive loss similar to Weston et al. (2006), Dhar et al. (2019) and solve,

$$\begin{aligned}&\underset{{\textbf{w}}}{\text {min}}\; \frac{1}{2}\vert \vert {\textbf{w}}\vert \vert _2^2 + C \; L_T({\textbf{w}},\phi (\{{\textbf{x}}_i\}_{i=1}^n)) + C_U \; L_U({\textbf{w}},\phi (\{{\textbf{x}}_{i^\prime }^*\}_{i^\prime =1}^m)){} & {} \nonumber \\&\text {s.t.} \quad L_T({\textbf{w}},\phi (\{{\textbf{x}}\}_{i=1}^n)) = \sum _{i=1}^n [1-{\textbf{w}}^\top \phi ({\textbf{x}}_i)]_+{} & {} \nonumber \\&\quad \quad \; L_U({\textbf{w}},\phi (\{{\textbf{x}}_{i^\prime }^*\}_{i^\prime =1}^m)) = \sum _{i^\prime =1}^m [\vert 1-{\textbf{w}}^\top \phi ({\textbf{x}}_{i^\prime }^*)\vert -\Delta ]_+{} & {} \end{aligned}$$

(6)

Here, $[x]_+ = \text {max}(0,x)$. Further, the interplay between $C, C_U - $ controls the trade-off between explaining the training samples using $L_T$ versus maximizing the contradiction on Universum samples using $L_U$. For $C_U = 0$ or $\Delta \rightarrow \infty $, Eq. (6) transforms to Eq. (3). For deep learning models, we optimize Eq. (6) over all the model parameters and refer to it as Deep One Class Classification using Contradictions (DOC$^3$).

3.2 Analysis of generalization error bound

Next we provide theoretical justification in support of Universum learning. We argue, learning under universum settings using DOC$^3$ can provide improved generalization error compared to its inductive counterpart DOC (Hinge). For this, we first derive a generic form of the generalization error bound for one class learning using the Rademacher complexity capacity measure in Theorem 1.

Theorem 1

(Generalization error bound) Let ${\mathcal {F}}$ be the class of functions from which the decision function $f({\textbf{x}})$ in Eq. (3) and (6) are estimated. Let $R_{f,1} = \{ {\textbf{x}}: f({\textbf{x}}) \ge 1 \}$ be the induced decision region. Then, with probability $1-\eta $ with $\eta \in [0,1]$, over any independent draw of the random sample ${\mathcal {T}}=({\textbf{x}}_i,y_i = +1)_{i=1}^n \sim {\mathcal {D}}_{{\mathcal {T}}\vert {\mathcal {Y}} = +1}^n$, for any $\kappa > 0$ we have,

$$\begin{aligned}&{\mathbb {P}}_{{\mathcal {D}}_{{\mathcal {T}}\vert {\mathcal {Y}} = +1}}({\textbf{x}} \notin R_{f,1-\kappa }) \; \le \; \frac{1}{\kappa n} \sum _{i=1}^n \xi _i + \frac{2}{\kappa } \hat{{\mathcal {R}}}_n({\mathcal {F}}) + 3 \sqrt{\frac{ln \frac{2}{\eta }}{2n}}{} & {} \end{aligned}$$

(7)

where $\quad \xi _i = [1-f({\textbf{x}})]_+$; $R_{f,\theta } = \{{\textbf{x}}: f({\textbf{x}}) \ge \theta \} $

$\hat{{\mathcal {R}}}_n({\mathcal {F}}) = {\mathbb {E}}_{\sigma }[\underset{f \in {\mathcal {F}}}{\text {sup}} \vert \frac{2}{n} \sum _{i=1}^n \sigma _i f({\textbf{x}}_i)\vert \Big \vert ({\textbf{x}}_i)_{i=1}^n] $

$\sigma $ = independent uniform $\{ \pm 1 \}-$ valued random variables a.k.a Rademacher variables.

The Theorem 1 is agnostic of model parameterization and holds for any popularly adopted kernel machine or deep learning architectures. Similar to the Theorem 7 in Schölkopf et al. (2001), Theorem 1 gives a probabilistic guarantee that new points lie in a larger region $R_{f,1-\kappa }$. Here, we rather use the Empirical Rademacher Complexity (ERC) $\hat{{\mathcal {R}}}_n({\mathcal {F}})$ as the capacity measure of the hypothesis class, instead of the covering number. Additionally, our bound does not contain a $\frac{1}{\kappa ^2}$ term as in Schölkopf et al. (2001), and only has the scaling factor of $\frac{1}{\kappa }$. As seen from Theorem 1 above, it is preferable to use a hypothesis class ${\mathcal {F}}$ with smaller ERC $\hat{{\mathcal {R}}}_n({\mathcal {F}})$. Next we compare the ERC of the hypothesis class induced by the formulations Eq. (3) versus Eq. (6).

Theorem 2

(Empirical Rademacher complexity). For the hypothesis class induced by the formulations,

Equation (3): ${\mathcal {F}}_{\text {ind}} = \{ f: {\textbf{x}} \rightarrow {\textbf{w}}^{\top } \phi ({\textbf{x}}) \Big \vert \vert \vert {\textbf{w}}\vert \vert _2^2 \le \Lambda ^2 \}$
Equation (6): ${\mathcal {F}}_{\text {univ}} = \{ f: {\textbf{x}} \rightarrow {\textbf{w}}^{\top } \phi ({\textbf{x}}) \Big \vert \vert \vert {\textbf{w}}\vert \vert _2^2 \le \Lambda ^2; \vert {\textbf{w}}^{\top } \phi (\mathbf {x^*}) -1\vert \le \Delta \;, \; \forall x^* \in {\mathcal {X}}_U^*\}$

The following holds,

1.
$\hat{{\mathcal {R}}}_n({\mathcal {F}}_{\text {ind}}) \ge \hat{{\mathcal {R}}}_n({\mathcal {F}}_{\text {univ}})$
2.
Further, for any fixed mapping $\phi (\cdot )$,$\; \forall \gamma \ge 0$ we have,
1. (a)
  $\hat{{\mathcal {R}}}_n({\mathcal {F}}_{\text {ind}}) \le \frac{2\Lambda }{n} \sqrt{\sum \limits _{i=1}^n \vert \vert {\textbf{z}}_i\vert \vert ^2}$; where ${\textbf{z}} = \phi ({\textbf{x}})$
2. (b)
  $\hat{{\mathcal {R}}}_n({\mathcal {F}}_{\text {univ}}) \le \frac{2\Lambda }{n} \sqrt{\sum \limits _{i=1}^n \vert \vert {\textbf{z}}_i\vert \vert ^2} \, \underset{\gamma \ge 0}{\text {min}} \; K(\gamma ) \big [1 - \varvec{\Sigma }(\gamma )\big ] ^{\frac{1}{2}}$

$$\begin{aligned} \text {where }&K(\gamma ) = \big [1+\frac{2\gamma m (\Delta ^2+1)}{\Lambda ^2}\big ]^{\frac{1}{2}}{} & {} \end{aligned}$$

(8)

$$\begin{aligned}&\varvec{\Sigma }(\gamma ) = \gamma \frac{ tr(VZ^{\top }ZV^{\top })}{ \big [tr(Z^{\top }Z) \big ] \; \big [tr(I+\gamma VV^{\top }) \big ]}{} & {} \nonumber \\&Z = \begin{bmatrix} ({\textbf{z}}_1)^T\\ \vdots \\ ({\textbf{z}}_{n})^T \end{bmatrix} \; and \; V = \begin{bmatrix} 1\\ -1 \end{bmatrix} \otimes \begin{bmatrix} ({\textbf{u}}_1)^T\\ \vdots \\ ({\textbf{u}}_{m})^T \end{bmatrix} ; \quad {\textbf{u}} = \phi ({\textbf{x}}^*); \quad {\textbf{x}}^* \in {\mathcal {X}}_{U}^{*}{} & {} \end{aligned}$$

(9)

$\otimes = \text {Kronecker Product}, \quad tr = \text {Matrix Trace} $

Note that, several recent works (Neyshabur et al., 2015; Sokolic et al., 2016; Cortes et al., 2017) derive the ERC of the function class induced by an underlying neural architecture. In this analysis however, we fix the feature map and analyze how the loss function in Eq. (6) reduces the function class capacity compared to Eq. (3). This simplifies our analysis and focuses on the effect of the proposed new loss in Eq. (6) under the universum setting. As seen from Theorem 2 (1), the function class induced under the universum setting (using contradictions) exhibits lower ERC compared to that under inductive settings. A more explicit characterization of the ERC is provided in part (2). Setting $\gamma = 0$ in (b), we achieve the same R.H.S as (a); hence the R.H.S in (b) is always smaller than in (a). Further note that $\varvec{\Sigma }(\gamma )$ in Eq. (9) has the form of a correlation matrix between the training and universum samples in the feature space. In fact, we have $\Sigma (\infty ) = \underset{\gamma \rightarrow \infty }{\text {lim}} \Sigma (\gamma ) = \frac{ tr(VZ^{\top }ZV^{\top })}{ tr(Z^{\top }Z) \; tr(VV^{\top })}$. This shows that, for a fixed number of universum samples m and $\Delta $, the effect of the DOC$^3$ algorithm is influenced by the correlation between training and universum samples in the feature space. Loosely speaking, the DOC$^3$ algorithm searches for a solution where in addition to reducing the margin errors $\xi _i$, also minimizes this correlation; and by doing so minimizes the generalization error. Similar conclusions have been empirically derived for binary, multiclass problems in Weston et al. (2006), Chapelle et al. (2008), Cherkassky et al. (2011) and Dhar et al. (2019). Here, we provide the theoretical reasoning for one class problems. Further, we confirm these theoretical findings in our results (Sect. 5.3.3).

3.3 Algorithm implementation

A limitation in solving Eq. (6) is handling the absolute term in $L_U$. In this paper we adopt a similar approach used in Weston et al. (2006), Dhar et al. (2019) and simplify this by re-writing $L_U$ as a sum of two hinge functions. To do this, for every universum sample ${\textbf{x}}_{i^\prime }^{*}$ we create two artificial samples, $({\textbf{x}}_{i^\prime }^{*},y_{i^\prime 1}^{*}=1), ({\textbf{x}}_{i^\prime }^{*},y_{i^\prime 2}^{*}=-1)$ and re-write,

$$\begin{aligned}&L_U = \sum _{i^\prime =1}^m [\vert 1-{\textbf{w}}^\top \phi ({\textbf{x}}_{i^\prime }^*)\vert -\Delta ]_+{} & {} \nonumber \\&\quad = \sum _{i^\prime =1}^m \Big ( [\epsilon _1-y_{i^\prime 1}^{*}{\textbf{w}}^\top \phi ({\textbf{x}}_{i^\prime }^*)]_+ + [\epsilon _2-y_{i^\prime 2}^{*}{\textbf{w}}^\top \phi ({\textbf{x}}_{i^\prime }^*)]_+ \Big ){} & {} \end{aligned}$$

(10)

where, $\epsilon _1 = 1 - \Delta $ and $\epsilon _2 = -1 - \Delta $. Now, the universum loss is the sum of two hinge functions with $\epsilon _1, \epsilon _2 -$ margins; and can be solved using standard deep learning libraries (Paszke et al., 2019; Abadi et al., 2016; Pedregosa et al., 2011).^{Footnote 1}

4 Existing approaches and related works

Most research in Anomaly Detection (AD) can be broadly categorized as adopting either traditional (shallow) or the more modern deep learning based approaches. Traditional approaches generally adopt parametric or non-parametric statistical modeling, spectral based, or classification based modeling (Chandola et al., 2009). Typical examples include, PCA based methods (Jolliffe, 2002; Hoffmann, 2007), proximity based methods (Knorr et al., 2000; Ramaswamy et al., 2000), tree-based methods like Isolation Forest (IF) (Liu et al., 2008), or classification based OC-SVM (Schölkopf et al., 2001), Support Vector Data Description (SVDD) (Tax & Duin, 2004) etc. These techniques provide good performance for optimally tuned feature map. However, for complex domains like vision or speech, where designing optimal feature maps is non trivial; such approaches perform sub-optimally. A detailed survey on these approaches is available in Chandola et al. (2009).

In contrast, for the modern deep learning based approaches, extracting the optimal feature map is imbibed in the learning process. Broadly there are three main sub-categories for deep learning based AD. First, the Deep Auto Encoder and its variants like DCAE (Masci et al., 2011; Makhzani , Frey, 2014) or ITAE (Huang et al., 2019) etc. Here, the aim is to build an embedding where the normal samples are correctly reconstructed while the anomalous samples exhibit high reconstruction error. The second type of approach adopt Generative Adversarial Network (GAN)-based techniques like AnoGAN (Schlegl et al., 2017), GANomaly (Akcay et al., 2018), EGBAD (Zenati et al., 2018), CBiGAN (Carrara et al., 2020) etc. These approaches, typically focus on generating additional samples which follow similar distribution as the training data. This is followed up by designing an anomaly score to discriminate between normal versus anomalous samples. Finally, the third category consist of the more recent one class classification based approaches like, Deep SVDD (Ruff et al., 2018), DROCC (Goyal et al., 2020) etc. These approaches adopt solving a one class loss function catered for deep architectures. All these above approaches however adopt an unsupervised inductive learning setting. There is a newer class of classification based paradigm which adopts semi or self supervised formulations. Typical examples include, GOAD (Bergman , Hoshen, 2020), SSAD (Ruff et al., 2019), ESAD (Huang et al., 2020) etc. However, such approaches use fundamentally different problem settings (like a multi class problem for GOAD); or have different assumptions on the additional data available.

Learning with disjoint auxiliary (DA) data: A recently popularized new learning setting assumes the availability of an additional auxiliary data which is disjoint from the test set. The underlying assumption is that these auxiliary samples may or may not follow the same distribution as the test data and are disjoint from test set. This idea was first introduced in Dhar (2014) (see Sect. 4.3) and misconstrued as Universum learning. Note that, the notion of universum samples was originally introduced to act as contradictions to the concept classes in the test set (Vapnik, 2006). The above assumption does not adhere to this notion and violates the true essence of Universum learning. This setting has been recently used to propose ‘outlier exposure’ in Hendrycks et al. (2018) and variants (Ruff et al., 2021; Goyal et al., 2020). A more advanced variation of this setup adopts generating the anomalous samples through perturbation (Cai & Fan, 2022) or through distribution-shifting transformations (Tack et al., 2020) and using contrastive losses. Our learning from contradiction setting is different from the above methods in the following aspects,

(Problem setting) is different. While the above setting only assumes disjoint auxiliary data from test data’s concept classes (‘normal’ and ‘anomalous’ samples), Universum follows a different assumption that the concept classes of the universum data is different from both the normal as well as anomalous samples. This assumption is quintessential for proving Prop. 2, which in turn provides the optimality constraint on the decision function (in Eq. 5). Prop. 2 is not possible for DA setting.
(Formulation) The difference in problem setting is also clear from the formulations. For example, the formulations proposed under the disjoint auxiliary setting like, (Dhar, 2014), Outlier Exposure (OE) (Hendrycks et al., 2018), DROCC-LF (OE) (Goyal et al., 2020) etc., only uses the relation between in-lier training data and the additional auxiliary data. No information on the relation between the auxiliary data and the anomalous samples in test set is encoded in the loss function. In essence, such approaches controls the complexity of hypotheses class by constraining the space in which ‘normal’ samples can lie. In contrast, Universum learning assumes different concept classes for Universum versus both normal and anomalous (test) samples. This information is encoded through the proof in Prop. 2. The Universum setting controls the complexity of hypotheses class by constraining the space in which both ‘normal’ or ‘anomalous’ samples can lie.

In short, Universum learning adopts a different learning paradigm (see Definition 2) compared to the ‘disjoint auxiliary data’ settings. Different from the existing ‘disjoint auxiliary’ based loss functions in Dhar (2014), OE (Hendrycks et al., 2018), DROCC-LF (OE) (Goyal et al., 2020) etc., the Universum samples (in Eq. (6)) implicitly contradicts the unseen anomalous test samples. A more pedagogical explanation of the differences between these settings with examples is provided in “Appendix C.1”. However similar as DA/OE settings, Universum learning can compliment other advanced learning settings. To highlight this, we extend the adversarial based DROCC-LF algorithm under universum setting in algorithm 1 and compare it’s performance against its OE based extension DROCC-LF(OE) (introduced in Goyal et al. (2020)). Here, for DROCC-LF (univ) we replace the binary cross entropy loss used in Goyal et al. (2020) with the universum loss in Eq. (6) (see step 3 in Algo. 1). We use the same notations used in Goyal et al. (2020).

Algorithm 1 DROCC-LF (univ)
Input: Training (normal) samples ${\mathcal {T}}=({\textbf{x}}_i,y_i = +1)_{i=1}^n$ and Universum samples ${\mathcal {U}} = ({\textbf{x}}_{i^\prime }^{*})_{i^\prime =1}^m$.
Parameters: Radius r, $\lambda \ge 0$, $\mu \ge 0$, step-size $\eta $, number of gradient steps $m_g$, number of initial training steps $n_0$.
Initial steps: For $B = 1, \ldots n_0$
Batch of training ($X_T$) and universum ($X_U$) samples
$\theta = \theta - \nabla \Big (\sum \limits _{\begin{array}{c} {\textbf{x}}_i \in X_B \end{array}} L_T(f({\textbf{x}}_i)) + \sum \limits _{\begin{array}{c} {\textbf{x}}_{i^\prime }^{} \in X_U \end{array}} L_U(f({\textbf{x}}_{i^\prime }^{})) \Big ) $
DROCC steps: For $B = n_0, \ldots n_0 + N$
$X_T$: Batch of normal training inputs ($y=+1$)
$\forall x \in X_T: h \sim {\mathcal {N}}(0, I_{d})$
Adversarial search: For $i = 1, \ldots m_g$
1. $L_T(h) = L_T(f(x + h), -1)$
2. $h = h + \eta \frac{\nabla _h L_T(h)}{\Vert \nabla _h L_T(h) \Vert }$
3. $h =$ Projection given by Prop.1 in Goyal et al. (2020)
$\ell ^{itr} = \lambda \Vert {\textbf{w}} \Vert ^2 + \sum \limits _{\begin{array}{c} {\textbf{x}}_i \in X_B \end{array}} L_T(f({\textbf{x}}_i)) + \sum \limits _{\begin{array}{c} {\textbf{x}}_{i^\prime }^{} \in X_U \end{array}} L_U(f({\textbf{x}}_{i^\prime }^{})) +\mu L_T(f(x + h), -1) $
$\theta = \theta - \nabla \ell ^{itr}$

5 Empirical results

5.1 Standard benchmark on tabular datasets from Goyal et al. (2020)

First we provide the results on several tabular data used in Goyal et al. (2020). The datasets used involves standard anomaly detection problems described below,

Abalone used in Das et al. (2018): Here the task is to predict the age of abalone using several physical measurements like, rings, sex, length, diameter, height, weight, etc. For this problem class 3 and 21 are anomalies and class 8, 9 and 10 serve as normal samples.
Arrhythmia used in Zong et al. (2018). Here the task is to identify the arrhythmic samples using the ECG features. We follow the same data set preparation as Zong et al. (2018).
Thyroid used in Zong et al. (2018). The goal is to predict if a patient is hypothyroid based on his/her medical history.We follow the same data set preparation as Zong et al. (2018).

For all the above data we use the data set preparation codes provided in Goyal et al. (2020). This code provides the data preprocessing and partitioning scheme as used in the previous works. We follow the same experiment setup and network architecture as in Goyal et al. (2020). We use the same baseline methods as used in Goyal et al. (2020), and also provide the results of the recent approach PLAD (Cai & Fan, 2022) proposed to stabilize the DROCC baseline.

Table 1 provides the results of DOC$^3$ over 10 random partition of the data set. In each partition, we create training/test data as used in Goyal et al. (2020). Note however, different from Goyal et al. (2020), we scale the data in the range of $[-1, +1]$. In addition, here we generate uniform noise in range $[-1,+1]$ and use that as universum/contradiction samples. As seen from Table 1 the DOC$^3$ outperforms all existing approaches (except adversarial based DROCC (Goyal et al., 2020) and PLAD for the Thyroid data); and significantly improves (> 5–15 %) upon the state-of-the-art results for the Arrhythmia and Abalone data. The optimal model parameters used for the results are provided in “Appendix B.1” (Table 7) for reproducibility. Note that, through out the paper we fix $\Delta = 0$. For all our experiments we see minimal improvements through tuning the $\Delta $ parameter. This is also discussed in our ablation studies in “Appendix C.1.2”.

Table 1 F1-score ± standard deviation for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets

DOC\(^3\): deep one class classification using contradictions

Abstract

Similar content being viewed by others

Phased progressive learning with coupling-regulation-imbalance loss for imbalanced data classification

Entropy Repulsion for Semi-supervised Learning Against Class Mismatch

Open Set Learning with Counterfactual Images

Explore related subjects

1 Introduction

2 One class learning under inductive settings

Definition 1

Proposition 1

3 One class learning using contradictions a.k.a Universum learning

3.1 Problem formulation

Definition 2

Proposition 2

3.2 Analysis of generalization error bound

Theorem 1

Theorem 2

3.3 Algorithm implementation

4 Existing approaches and related works

5 Empirical results

5.1 Standard benchmark on tabular datasets from Goyal et al. (2020)

5.2 Standard image benchmark datasets

5.2.1 CIFAR-10

5.2.2 Fashion-MNIST (F-MNIST)

5.3 Visual inspection using real-life MV-Tec AD data

5.3.1 Data set and experiment setup

5.3.2 Performance comparison results

5.3.3 Understanding DOC\(^3\) performance using Theorem (2)

6 Future research

7 Conclusions

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Code availability

Additional information

Publisher's Note

Appendices

Appendix A Proofs

1.1 A.1 Proof of Proposition 1

1.2 A.2 Proof of Proposition 2

1.3 A.3 Proof of Theorem 1

1.4 A.4 Proof of Theorem 2

Appendix B Reproducibility

1.1 B.1 Network architectures

1.1.1 B.1.1 LeNet architecture for CIFAR-10 experiments

1.1.2 B.1.2 LeNet architecture for FMNIST experiments

1.1.3 B.1.3 LeNet architecture for MVTec experiments

1.2 B.2 Model parameters for Table 1

1.3 B.3 Model parameters for Table 2 (CIFAR-10)

1.3.1 B.3.1 DOC and DOC\(^3\) model parameters used in Table 2

1.3.2 B.3.2 DOC (DA/OE) model parameters in Table 2

1.3.3 B.3.3 Model parameters for DROCC-LF under OE versus Universum setting

1.4 B.4 Model parameters for Table 3 (FMNIST)

1.4.1 B.4.1 DOC and DOC\(^3\) model parameters used in Table 3

1.4.2 B.4.2 Model parameters for DROCC-LF under OE versus Universum setting for F-MNIST data

1.5 B.5 Model parameters for Table 5 (MVTec-AD)

Appendix C Additional experiments and results

1.1 C.1 Comparisons of disjoint auxiliary (or outlier exposure) versus Universum settings

1.1.1 C.1.1 Synthetic experiment

1.1.2 C.1.2 Ablation study hyperparameters

1.2 C.2 Re-run of deep-SVDD (Ruff et al., 2018) and DROCC (Goyal et al., 2020) CIFAR-10 Results

1.2.1 C.2.1 Deep one class classification deep-SVDD results

1.2.2 C.2.2 Deep robust one class classification (DROCC) results

1.3 C.3 Re-run of PLAD (Cai & Fan, 2022) results on F-MNIST data

Rights and permissions

About this article

Cite this article

Share this article

Keywords