1 Introduction

Anomaly detection (AD) is one of the most widely researched problem in the machine learning community (Chandola et al., 2009). In its basic form, the task of Anomaly Detection (AD) involves discerning patterns in data that do not conform to expected ‘normal’ behavior. These non-conforming patterns are referred to as anomalies or outliers. Anomaly detection problems manifest in several forms in real-life like, defect detection in manufacturing lines, intrusion detection for cyber security, or pathology detection for medical diagnosis etc. There are several mechanisms to handle anomaly detection problems viz., parametric or non-parametric statistical modeling, spectral based, or classification based modeling (Chandola et al., 2009). Of these, the classification based approach has been widely adopted in literature (Scholkopf et al., 2002; Tax & Duin, 2004; Tan et al., 2016; Cherkassky & Mulier, 2007). One specific classification based formulation which has gained huge adoption is one class classification (Scholkopf et al., 2002; Tax & Duin, 2004), where we design a parametric model to estimate the support of the ‘normal’ class distribution. The estimated model is then used to detect ‘unseen’ abnormal samples.

With the recent success of deep learning based approaches for different machine learning problems, there has been a surge in research adopting deep learning for one class problems (Ruff et al., 2021; Pang et al., 2020; Chalapathy & Chawla, 2019). However, most of these works adopt an inductive learning setting. This makes the underlying model estimation data hungry, and perform poorly for applications with limited training data availability, like medical diagnosis, industrial defect detection, etc. The learning from contradictions paradigm (popularly known as Universum learning) has shown to be particularly effective for problems with limited training data availability (Vapnik, 2006; Sinz et al., 2008; Weston et al., 2006; Chen & Zhang, 2009; Cherkassky et al., 2011; Shen et al., 2012; Dhar & Cherkassky, 2015; Zhang & LeCun, 2017; Xiao et al., 2021). However, it has been mostly limited to binary or multi class problems. In this paradigm, along with the labeled training data we are also given a set of unlabeled contradictory (a.k.a universum) samples. These universum samples belong to the same application domain as the training data, but are known not to belong to any of the classes. The rationale behind this setting comes from the fact that even though obtaining labels is very difficult, obtaining such additional unlabeled samples is relatively easier. These unlabeled universum samples act as contradictions and should not be explained by the estimated decision rule. Adopting this to one class problems is not straight forward. A major conceptual problem is that, one class model estimation represents unsupervised learning, where the notion of contradiction needs to be redefined properly. In this paper,

  1. 1.

    Definition We introduce the notion of ‘Learning from contradictions’ for one class problems (Definition 2).

  2. 2.

    Formulation We analyze the popular one class hinge loss (Schölkopf et al., 2001), and extend it under universum settings to propose the Deep One Class Classification using Contradictions DOC\(^3\) algorithm. Further, our proposed ‘learning from contradiction’ is a generic learning setting and can compliment other advanced learning settings. To illustrate this, we extend the adversarial learning based DROCC-LF (Goyal et al., 2020) algorithm under universum settings and call it DROCC-LF (univ) (see Algo. 1).

  3. 3.

    Generalization error We analyze the generalization performance of one class formulations under inductive and universum settings using Rademacher complexity based bounds, and show that learning under the universum setting can provide improved generalization compared to its inductive counterpart.

  4. 4.

    Empirical results Finally, we provide an exhaustive set of empirical results on several tabular and image datasets in support of our approach.

2 One class learning under inductive settings

First we introduce the widely adopted inductive learning setting used for one class problems (Scholkopf et al., 2002; Cherkassky & Mulier, 2007).

Definition 1

(Inductive setting) Given i.i.d training samples from a single class \({\mathcal {T}}=({\textbf{x}}_i, \; y_i = +1)_{i=1}^n \sim {\mathcal {D}}_{{\mathcal {X}}\vert {\mathcal {Y}} = +1}^n\), with \({\textbf{x}} \in {\mathcal {X}} \subseteq \Re ^d\) and \(y \in {\mathcal {Y}} = \{-1,+1 \}\); estimate a hypothesis \(h^*:{\mathcal {X}} \rightarrow {\mathcal {Y}}\) from an hypothesis class \({\mathcal {H}}\) which minimizes,

$$\begin{aligned} \underset{h \in {\mathcal {H}}}{\text {inf}} \; {\mathbb {E}}_{{\mathcal {D}}_{{\mathcal {T}}}}[{\textbf{1}}_{y \ne h({\textbf{x}})}] \end{aligned}$$
(1)

\({\mathcal {D}}_{{\mathcal {T}}}\) is the training distribution (consisting of both classes)

\({\mathcal {D}}_{{\mathcal {X}} \vert {\mathcal {Y}} = +1}\) is class conditional distribution

\({\textbf{1}}(\cdot )\) is the indicator function, and

\({\mathbb {E}}_{{\mathcal {D}}_{{\mathcal {T}}}}(\cdot )\) is the expectation under training distribution.

Note that, the underlying data generation process assumes a two class problem; of which the samples from only one class is available during training. The overall goal is to estimate a model which minimizes the error on the future test data, containing samples from both normal (\(y = +1\)) and abnormal classes (\(y = -1\)). Typical examples include, AI driven visual inspection of product defects in a manufacturing line; where images or videos of non-defective products are available in abundance. The goal is to detect ‘defective’ (abnormal/anomalous) products through visual inspection in manufacturing lines (Bergmann et al., 2019; Weimer et al., 2016). A popular loss function used in such settings is the \(\nu \)-SVM loss (Schölkopf et al., 2001),

$$\begin{aligned}&\underset{{\textbf{w}},\varvec{\xi }, \rho }{\text {min}}\quad \frac{1}{2}\vert \vert {\textbf{w}}\vert \vert _2^2 \; + \; \frac{1}{\nu n} \sum _{i=1}^n \xi _i -\rho{} & {} \nonumber \\&\quad \text {s.t.} \quad \quad {\textbf{w}}^\top \phi ({\textbf{x}}_i) \ge \rho -\xi _i, \quad \xi _i \ge 0; \quad \forall \; i = 1 \ldots n{} & {} \end{aligned}$$
(2)

where \(\nu \in (0,1]\) is a user-defined parameter which controls the margin errors \(\sum _i \xi _i\) and the size of geometric \(\frac{1}{\vert \vert {\textbf{w}}\vert \vert }\) and functional \(\rho \) margins. \(\phi (\cdot ): {\mathcal {X}} \rightarrow {\mathcal {G}}\) is a feature map. Typical examples include an empirical kernel map (see Definition 2.15 (Scholkopf et al., 2002)) or a map induced by a deep learning network (Goodfellow et al., 2016). The final decision function is given as,   \(h({\textbf{x}}) = \left\{ \begin{array}{l l} +1;\quad \text {if} \; {\textbf{w}}^\top \phi ({\textbf{x}}_i) \ge \rho \\ -1;\quad \text {else} \end{array}\right. \). Note that, recent works like Ruff et al. (2018) extend a different loss function which uses a ball to explain the support of the data distribution following (Tax & Duin, 2004). As discussed in Schölkopf et al. (2001), most of the time these two formulations yield equivalent decision functions. For example, with kernel machines \({\textbf{K}}({\textbf{x}},{\textbf{x}}^{\prime }) = \phi ({\textbf{x}})^\top \phi ({\textbf{x}}^\prime )\) depending solely on \({\textbf{x}} - {\textbf{x}}^{\prime }\) (like RBF kernels), these two formulations are the same. Hence, most of the improvements discussed in this work translates to such alternate formulations. In this paper however, we solve the following one class Hinge Loss,

$$\begin{aligned}&\underset{{\textbf{w}}}{\text {min}}\quad \frac{1}{2}\vert \vert {\textbf{w}}\vert \vert _2^2 \; + \; C \; L_T({\textbf{w}},\{\phi ({\textbf{x}}_i)\}_{i=1}^n){} & {} \nonumber \\&\quad \text {s.t} \quad \quad L_T({\textbf{w}},\{\phi ({\textbf{x}}_i)\}_{i=1}^n) = \sum _{i=1}^n [1-{\textbf{w}}^\top \phi ({\textbf{x}}_i)]_+ \; ; \; [x]_+ = \text {max}(0,x){} & {} \end{aligned}$$
(3)

to estimate the the decision function \(f({\textbf{x}}) = {\textbf{w}}^\top \phi ({\textbf{x}}_i)\) and use the decision rule, \(h({\textbf{x}}) = \left\{ \begin{array}{l l} +1;\quad \text {if} \; f({\textbf{x}}) \ge 1 \\ -1;\quad \text {else} \end{array}\right. \). Here, the user-defined parameter C controls the trade-off between explaining the training samples (through small margin error \(\sum _{i=1}^n \xi _i \)), and the margin size (through \(\vert \vert {\textbf{w}}\vert \vert _2^2\)), which in turn controls the generalization error. For deep learning architectures we optimize using all the model parameters and equivalently regularize the entire matrix norm \(\vert \vert {\textbf{W}}\vert \vert _F^2\), see Goyal et al. (2020), Ruff et al. (2018). Note that, we solve one class Hinge loss (3) for the two main reasons,

  • First, it has the advantage that \(L_T({\textbf{w}},\phi (\{{\textbf{x}}\}_{i=1}^n)) = \sum _{i=1}^n [1-{\textbf{w}}^T\phi ({\textbf{x}}_i)]_+ \) exhibits the same form as the traditional hinge loss used for binary classification problems (Vapnik, 2006) and can be easily solved using existing software packages (Paszke et al., 2019; Abadi et al., 2016; Pedregosa et al., 2011). Throughout the paper we refer (3) using underlying deep architectures as Deep One Class DOC (Hinge) formulation.

  • Second, solving Eq. (3) also provides the solution for Eq. (2). This connection follows from Proposition 1.

Proposition 1

Connection between Eq. (2) and Eq. (3)

  1. 1.

    Any solution \({\textbf{w}}\) of Eq. (3) also solves Eq. (2) with \(\nu = \frac{1}{Cn \delta }\); where \(\delta > 0\) is a scalar that depends on the solution of Eq. (3). Further, this solution \((\hat{\mathbf {{{w}}}}, \rho )\) of Eq. (2) is given as \(\hat{\mathbf {{{w}}}} = {\textbf{w}}\delta , \quad \rho = \delta \).

  2. 2.

    The decision function obtained through solving Eq. (3) i.e., \({\textbf{w}}^\top \phi ({\textbf{x}}) - 1 = 0\) coincides with the decision function \(\hat{\mathbf {{{w}}}}^\top \phi ({\textbf{x}}) - \rho = 0\) obtained by solving Eq. (2) i.e. using the solution discussed above.

All proofs are provided in “Appendix”.

3 One class learning using contradictions a.k.a Universum learning

3.1 Problem formulation

Fig. 1
figure 1

Visual inspection of anomalous screws in a manufacturing line (Bergmann et al., 2019). Images of the other products act as universum samples. Such images are neither normal-screw nor anomalous-screw images and act as contradictions

Learning from contradictions or Universum learning was introduced in Vapnik (2006) for binary classification problems to incorporate a priori knowledge about admissible data samples. For example, if the goal of learning is to discriminate between handwritten digits ‘5’ and ‘8’, one can introduce additional knowledge in the form of other handwritten letters ‘a’,‘b’,‘c’,‘d’,\(\ldots \) ‘z’. These examples from the Universum contain certain information about the handwritten styles of authors, but they cannot be assigned to any of the two classes (5 or 8). Further, these Universum samples do not have the same distribution as labeled training samples. In this work we introduce the notion of ‘Learning from Contradictions’ for one class problems. Similar to inductive setting (Definition 1) the goal here is also to minimize the generalization error on future test data containing both normal (\(y=+1\)) and abnormal (\(y=-1\)) samples. Here however, during training in addition to the samples from the normal class \(({\textbf{x}}_i, y_i = +1)_{i=1}^n\), we are also provided with universum (contradictory) samples, which are known not to belong to either of the (normal or abnormal) classes of interest. A practical use-case can be of automated visual inspection based anomaly detection in manufacturing lines. Here the target is to identify the defects in a specific product type (say ’screws’ in Fig. 1). For this case, the images from other product types in the manufacturing line act as universum samples. Note that, such universum samples belong to the same application domain (i.e. visual inspection data); but do not represent either of the classes normal screws or anomalous screws. This setting is formalized as,

Definition 2

(Learning from contradictions a.k.a Universum setting) Given i.i.d training samples \({\mathcal {T}}=({\textbf{x}}_i,y_i = +1)_{i=1}^n \sim {\mathcal {D}}_{{\mathcal {X}}\vert {\mathcal {Y}} = +1}^n\), with \({\textbf{x}} \in {\mathcal {X}} \subseteq \Re ^d\) and \(y \in {\mathcal {Y}} =\{-1,+1\}\) and additional m universum samples \({\mathcal {U}} = ({\textbf{x}}_{i^\prime }^{*})_{i^\prime =1}^m \sim {\mathcal {D}}_{{\mathcal {U}}}\) with \({\textbf{x}}^{*} \in {\mathcal {X}}_{U}^* \subseteq \Re ^d\), estimate \(h^*:{\mathcal {X}} \rightarrow {\mathcal {Y}}\) from hypothesis class \({\mathcal {H}}\) which, in addition to Eq. (1), obtains maximum contradiction on universum samples i.e. maximizes the following probability for \({\textbf{x}}^* \in {\mathcal {X}}_{U}^*\),

$$\begin{aligned}&\underset{h \in {\mathcal {H}}}{\text {sup}} \; {\mathbb {P}}_{{\mathcal {D}}_{{\mathcal {U}}}}[h({\textbf{x}}^* )\notin {\mathcal {Y}}] = \underset{h \in {\mathcal {H}}}{\text {sup}} \; {\mathbb {E}}_{{\mathcal {D}}_{{\mathcal {U}}}}[{\textbf{1}}_{ \lbrace \bigcap \limits _{y \in {\mathcal {Y}}} h({\textbf{x}}^*) \ne y \rbrace }]{} & {} \end{aligned}$$
(4)

\({\mathcal {D}}_{{\mathcal {U}}}\) is the universum distribution,

\({\mathbb {P}}_{{\mathcal {D}}_{{\mathcal {U}}}}(\cdot )\) is probability under universum distribution,

\({\mathbb {E}}_{{\mathcal {D}}_{{\mathcal {U}}}}(\cdot )\) is the expectation under universum distribution, \({\mathcal {X}}_{U}^{*}\) is the domain of universum data.

Learning using contradictions under Universum setting has the dual goal of minimizing the generalization error in Eq. (1) while maximizing the contradiction on universum samples Eq. (4). The following proposition provides guidelines on how this can be achieved for the one class hinge loss in Eq. (3).

Proposition 2

For the one class hinge loss in Eq. (3), maximum contradiction on universum samples \({\textbf{x}}^* \in {\mathcal {X}}_U^*\) can be achieved when,

$$\begin{aligned} \vert {\textbf{w}}^\top \phi ({\textbf{x}}^{*}) - 1\vert = 0 \end{aligned}$$
(5)

That is, we need the universum samples to lie on the decision boundary. This motivates the following one class loss using contradictions (under Universum settings) where we relax the constraint in Eq. (5) by introducing a \(\Delta \)-insensitive loss similar to Weston et al. (2006), Dhar et al. (2019) and solve,

$$\begin{aligned}&\underset{{\textbf{w}}}{\text {min}}\; \frac{1}{2}\vert \vert {\textbf{w}}\vert \vert _2^2 + C \; L_T({\textbf{w}},\phi (\{{\textbf{x}}_i\}_{i=1}^n)) + C_U \; L_U({\textbf{w}},\phi (\{{\textbf{x}}_{i^\prime }^*\}_{i^\prime =1}^m)){} & {} \nonumber \\&\text {s.t.} \quad L_T({\textbf{w}},\phi (\{{\textbf{x}}\}_{i=1}^n)) = \sum _{i=1}^n [1-{\textbf{w}}^\top \phi ({\textbf{x}}_i)]_+{} & {} \nonumber \\&\quad \quad \; L_U({\textbf{w}},\phi (\{{\textbf{x}}_{i^\prime }^*\}_{i^\prime =1}^m)) = \sum _{i^\prime =1}^m [\vert 1-{\textbf{w}}^\top \phi ({\textbf{x}}_{i^\prime }^*)\vert -\Delta ]_+{} & {} \end{aligned}$$
(6)

Here, \([x]_+ = \text {max}(0,x)\). Further, the interplay between \(C, C_U - \) controls the trade-off between explaining the training samples using \(L_T\) versus maximizing the contradiction on Universum samples using \(L_U\). For \(C_U = 0\) or \(\Delta \rightarrow \infty \), Eq. (6) transforms to Eq. (3). For deep learning models, we optimize Eq. (6) over all the model parameters and refer to it as Deep One Class Classification using Contradictions (DOC\(^3\)).

3.2 Analysis of generalization error bound

Next we provide theoretical justification in support of Universum learning. We argue, learning under universum settings using DOC\(^3\) can provide improved generalization error compared to its inductive counterpart DOC (Hinge). For this, we first derive a generic form of the generalization error bound for one class learning using the Rademacher complexity capacity measure in Theorem 1.

Theorem 1

(Generalization error bound) Let \({\mathcal {F}}\) be the class of functions from which the decision function \(f({\textbf{x}})\) in Eq. (3) and (6) are estimated. Let \(R_{f,1} = \{ {\textbf{x}}: f({\textbf{x}}) \ge 1 \}\) be the induced decision region. Then, with probability \(1-\eta \) with \(\eta \in [0,1]\), over any independent draw of the random sample \({\mathcal {T}}=({\textbf{x}}_i,y_i = +1)_{i=1}^n \sim {\mathcal {D}}_{{\mathcal {T}}\vert {\mathcal {Y}} = +1}^n\), for any \(\kappa > 0\) we have,

$$\begin{aligned}&{\mathbb {P}}_{{\mathcal {D}}_{{\mathcal {T}}\vert {\mathcal {Y}} = +1}}({\textbf{x}} \notin R_{f,1-\kappa }) \; \le \; \frac{1}{\kappa n} \sum _{i=1}^n \xi _i + \frac{2}{\kappa } \hat{{\mathcal {R}}}_n({\mathcal {F}}) + 3 \sqrt{\frac{ln \frac{2}{\eta }}{2n}}{} & {} \end{aligned}$$
(7)

where \(\quad \xi _i = [1-f({\textbf{x}})]_+\);   \(R_{f,\theta } = \{{\textbf{x}}: f({\textbf{x}}) \ge \theta \} \)

\(\hat{{\mathcal {R}}}_n({\mathcal {F}}) = {\mathbb {E}}_{\sigma }[\underset{f \in {\mathcal {F}}}{\text {sup}} \vert \frac{2}{n} \sum _{i=1}^n \sigma _i f({\textbf{x}}_i)\vert \Big \vert ({\textbf{x}}_i)_{i=1}^n] \)

\(\sigma \) = independent uniform \(\{ \pm 1 \}-\) valued random variables a.k.a Rademacher variables.

The Theorem 1 is agnostic of model parameterization and holds for any popularly adopted kernel machine or deep learning architectures. Similar to the Theorem 7 in Schölkopf et al. (2001), Theorem 1 gives a probabilistic guarantee that new points lie in a larger region \(R_{f,1-\kappa }\). Here, we rather use the Empirical Rademacher Complexity (ERC) \(\hat{{\mathcal {R}}}_n({\mathcal {F}})\) as the capacity measure of the hypothesis class, instead of the covering number. Additionally, our bound does not contain a \(\frac{1}{\kappa ^2}\) term as in Schölkopf et al. (2001), and only has the scaling factor of \(\frac{1}{\kappa }\). As seen from Theorem 1 above, it is preferable to use a hypothesis class \({\mathcal {F}}\) with smaller ERC \(\hat{{\mathcal {R}}}_n({\mathcal {F}})\). Next we compare the ERC of the hypothesis class induced by the formulations Eq. (3) versus Eq. (6).

Theorem 2

(Empirical Rademacher complexity). For the hypothesis class induced by the formulations,

  • Equation (3): \({\mathcal {F}}_{\text {ind}} = \{ f: {\textbf{x}} \rightarrow {\textbf{w}}^{\top } \phi ({\textbf{x}}) \Big \vert \vert \vert {\textbf{w}}\vert \vert _2^2 \le \Lambda ^2 \}\)

  • Equation (6): \({\mathcal {F}}_{\text {univ}} = \{ f: {\textbf{x}} \rightarrow {\textbf{w}}^{\top } \phi ({\textbf{x}}) \Big \vert \vert \vert {\textbf{w}}\vert \vert _2^2 \le \Lambda ^2; \vert {\textbf{w}}^{\top } \phi (\mathbf {x^*}) -1\vert \le \Delta \;, \; \forall x^* \in {\mathcal {X}}_U^*\}\)

The following holds,

  1. 1.

    \(\hat{{\mathcal {R}}}_n({\mathcal {F}}_{\text {ind}}) \ge \hat{{\mathcal {R}}}_n({\mathcal {F}}_{\text {univ}})\)

  2. 2.

    Further, for any fixed mapping \(\phi (\cdot )\),\(\; \forall \gamma \ge 0\) we have,

    1. (a)

      \(\hat{{\mathcal {R}}}_n({\mathcal {F}}_{\text {ind}}) \le \frac{2\Lambda }{n} \sqrt{\sum \limits _{i=1}^n \vert \vert {\textbf{z}}_i\vert \vert ^2}\);  where \({\textbf{z}} = \phi ({\textbf{x}})\)

    2. (b)

      \(\hat{{\mathcal {R}}}_n({\mathcal {F}}_{\text {univ}}) \le \frac{2\Lambda }{n} \sqrt{\sum \limits _{i=1}^n \vert \vert {\textbf{z}}_i\vert \vert ^2} \, \underset{\gamma \ge 0}{\text {min}} \; K(\gamma ) \big [1 - \varvec{\Sigma }(\gamma )\big ] ^{\frac{1}{2}}\)

$$\begin{aligned} \text {where }&K(\gamma ) = \big [1+\frac{2\gamma m (\Delta ^2+1)}{\Lambda ^2}\big ]^{\frac{1}{2}}{} & {} \end{aligned}$$
(8)
$$\begin{aligned}&\varvec{\Sigma }(\gamma ) = \gamma \frac{ tr(VZ^{\top }ZV^{\top })}{ \big [tr(Z^{\top }Z) \big ] \; \big [tr(I+\gamma VV^{\top }) \big ]}{} & {} \nonumber \\&Z = \begin{bmatrix} ({\textbf{z}}_1)^T\\ \vdots \\ ({\textbf{z}}_{n})^T \end{bmatrix} \; and \; V = \begin{bmatrix} 1\\ -1 \end{bmatrix} \otimes \begin{bmatrix} ({\textbf{u}}_1)^T\\ \vdots \\ ({\textbf{u}}_{m})^T \end{bmatrix} ; \quad {\textbf{u}} = \phi ({\textbf{x}}^*); \quad {\textbf{x}}^* \in {\mathcal {X}}_{U}^{*}{} & {} \end{aligned}$$
(9)

\(\otimes = \text {Kronecker Product}, \quad tr = \text {Matrix Trace} \)

Note that, several recent works (Neyshabur et al., 2015; Sokolic et al., 2016; Cortes et al., 2017) derive the ERC of the function class induced by an underlying neural architecture. In this analysis however, we fix the feature map and analyze how the loss function in Eq. (6) reduces the function class capacity compared to Eq. (3). This simplifies our analysis and focuses on the effect of the proposed new loss in Eq. (6) under the universum setting. As seen from Theorem 2 (1), the function class induced under the universum setting (using contradictions) exhibits lower ERC compared to that under inductive settings. A more explicit characterization of the ERC is provided in part (2). Setting \(\gamma = 0\) in (b), we achieve the same R.H.S as (a); hence the R.H.S in (b) is always smaller than in (a). Further note that \(\varvec{\Sigma }(\gamma )\) in Eq. (9) has the form of a correlation matrix between the training and universum samples in the feature space. In fact, we have \(\Sigma (\infty ) = \underset{\gamma \rightarrow \infty }{\text {lim}} \Sigma (\gamma ) = \frac{ tr(VZ^{\top }ZV^{\top })}{ tr(Z^{\top }Z) \; tr(VV^{\top })}\). This shows that, for a fixed number of universum samples m and \(\Delta \), the effect of the DOC\(^3\) algorithm is influenced by the correlation between training and universum samples in the feature space. Loosely speaking, the DOC\(^3\) algorithm searches for a solution where in addition to reducing the margin errors \(\xi _i\), also minimizes this correlation; and by doing so minimizes the generalization error. Similar conclusions have been empirically derived for binary, multiclass problems in Weston et al. (2006), Chapelle et al. (2008), Cherkassky et al. (2011) and Dhar et al. (2019). Here, we provide the theoretical reasoning for one class problems. Further, we confirm these theoretical findings in our results (Sect. 5.3.3).

3.3 Algorithm implementation

A limitation in solving Eq. (6) is handling the absolute term in \(L_U\). In this paper we adopt a similar approach used in Weston et al. (2006), Dhar et al. (2019) and simplify this by re-writing \(L_U\) as a sum of two hinge functions. To do this, for every universum sample \({\textbf{x}}_{i^\prime }^{*}\) we create two artificial samples, \(({\textbf{x}}_{i^\prime }^{*},y_{i^\prime 1}^{*}=1), ({\textbf{x}}_{i^\prime }^{*},y_{i^\prime 2}^{*}=-1)\) and re-write,

$$\begin{aligned}&L_U = \sum _{i^\prime =1}^m [\vert 1-{\textbf{w}}^\top \phi ({\textbf{x}}_{i^\prime }^*)\vert -\Delta ]_+{} & {} \nonumber \\&\quad = \sum _{i^\prime =1}^m \Big ( [\epsilon _1-y_{i^\prime 1}^{*}{\textbf{w}}^\top \phi ({\textbf{x}}_{i^\prime }^*)]_+ + [\epsilon _2-y_{i^\prime 2}^{*}{\textbf{w}}^\top \phi ({\textbf{x}}_{i^\prime }^*)]_+ \Big ){} & {} \end{aligned}$$
(10)

where, \(\epsilon _1 = 1 - \Delta \) and \(\epsilon _2 = -1 - \Delta \). Now, the universum loss is the sum of two hinge functions with \(\epsilon _1, \epsilon _2 -\) margins; and can be solved using standard deep learning libraries (Paszke et al., 2019; Abadi et al., 2016; Pedregosa et al., 2011).Footnote 1

4 Existing approaches and related works

Most research in Anomaly Detection (AD) can be broadly categorized as adopting either traditional (shallow) or the more modern deep learning based approaches. Traditional approaches generally adopt parametric or non-parametric statistical modeling, spectral based, or classification based modeling (Chandola et al., 2009). Typical examples include, PCA based methods (Jolliffe, 2002; Hoffmann, 2007), proximity based methods (Knorr et al., 2000; Ramaswamy et al., 2000), tree-based methods like Isolation Forest (IF) (Liu et al., 2008), or classification based OC-SVM (Schölkopf et al., 2001), Support Vector Data Description (SVDD) (Tax & Duin, 2004) etc. These techniques provide good performance for optimally tuned feature map. However, for complex domains like vision or speech, where designing optimal feature maps is non trivial; such approaches perform sub-optimally. A detailed survey on these approaches is available in Chandola et al. (2009).

In contrast, for the modern deep learning based approaches, extracting the optimal feature map is imbibed in the learning process. Broadly there are three main sub-categories for deep learning based AD. First, the Deep Auto Encoder and its variants like DCAE (Masci et al., 2011; Makhzani , Frey, 2014) or ITAE (Huang et al., 2019) etc. Here, the aim is to build an embedding where the normal samples are correctly reconstructed while the anomalous samples exhibit high reconstruction error. The second type of approach adopt Generative Adversarial Network (GAN)-based techniques like AnoGAN (Schlegl et al., 2017), GANomaly (Akcay et al., 2018), EGBAD (Zenati et al., 2018), CBiGAN (Carrara et al., 2020) etc. These approaches, typically focus on generating additional samples which follow similar distribution as the training data. This is followed up by designing an anomaly score to discriminate between normal versus anomalous samples. Finally, the third category consist of the more recent one class classification based approaches like, Deep SVDD (Ruff et al., 2018), DROCC (Goyal et al., 2020) etc. These approaches adopt solving a one class loss function catered for deep architectures. All these above approaches however adopt an unsupervised inductive learning setting. There is a newer class of classification based paradigm which adopts semi or self supervised formulations. Typical examples include, GOAD (Bergman , Hoshen, 2020), SSAD (Ruff et al., 2019), ESAD (Huang et al., 2020) etc. However, such approaches use fundamentally different problem settings (like a multi class problem for GOAD); or have different assumptions on the additional data available.

Learning with disjoint auxiliary (DA) data: A recently popularized new learning setting assumes the availability of an additional auxiliary data which is disjoint from the test set. The underlying assumption is that these auxiliary samples may or may not follow the same distribution as the test data and are disjoint from test set. This idea was first introduced in Dhar (2014) (see Sect. 4.3) and misconstrued as Universum learning. Note that, the notion of universum samples was originally introduced to act as contradictions to the concept classes in the test set (Vapnik, 2006). The above assumption does not adhere to this notion and violates the true essence of Universum learning. This setting has been recently used to propose ‘outlier exposure’ in Hendrycks et al. (2018) and variants (Ruff et al., 2021; Goyal et al., 2020). A more advanced variation of this setup adopts generating the anomalous samples through perturbation (Cai & Fan, 2022) or through distribution-shifting transformations (Tack et al., 2020) and using contrastive losses. Our learning from contradiction setting is different from the above methods in the following aspects,

  • (Problem setting) is different. While the above setting only assumes disjoint auxiliary data from test data’s concept classes (‘normal’ and ‘anomalous’ samples), Universum follows a different assumption that the concept classes of the universum data is different from both the normal as well as anomalous samples. This assumption is quintessential for proving Prop. 2, which in turn provides the optimality constraint on the decision function (in Eq. 5). Prop. 2 is not possible for DA setting.

  • (Formulation) The difference in problem setting is also clear from the formulations. For example, the formulations proposed under the disjoint auxiliary setting like, (Dhar, 2014), Outlier Exposure (OE) (Hendrycks et al., 2018), DROCC-LF (OE) (Goyal et al., 2020) etc., only uses the relation between in-lier training data and the additional auxiliary data. No information on the relation between the auxiliary data and the anomalous samples in test set is encoded in the loss function. In essence, such approaches controls the complexity of hypotheses class by constraining the space in which ‘normal’ samples can lie. In contrast, Universum learning assumes different concept classes for Universum versus both normal and anomalous (test) samples. This information is encoded through the proof in Prop. 2. The Universum setting controls the complexity of hypotheses class by constraining the space in which both ‘normal’ or ‘anomalous’ samples can lie.

In short, Universum learning adopts a different learning paradigm (see Definition 2) compared to the ‘disjoint auxiliary data’ settings. Different from the existing ‘disjoint auxiliary’ based loss functions in Dhar (2014), OE (Hendrycks et al., 2018), DROCC-LF (OE) (Goyal et al., 2020) etc., the Universum samples (in Eq. (6)) implicitly contradicts the unseen anomalous test samples. A more pedagogical explanation of the differences between these settings with examples is provided in “Appendix C.1”. However similar as DA/OE settings, Universum learning can compliment other advanced learning settings. To highlight this, we extend the adversarial based DROCC-LF algorithm under universum setting in algorithm 1 and compare it’s performance against its OE based extension DROCC-LF(OE) (introduced in Goyal et al. (2020)). Here, for DROCC-LF (univ) we replace the binary cross entropy loss used in Goyal et al. (2020) with the universum loss in Eq. (6) (see step 3 in Algo. 1). We use the same notations used in Goyal et al. (2020).

Algorithm 1 DROCC-LF (univ)

Input: Training (normal) samples \({\mathcal {T}}=({\textbf{x}}_i,y_i = +1)_{i=1}^n\) and Universum samples \({\mathcal {U}} = ({\textbf{x}}_{i^\prime }^{*})_{i^\prime =1}^m\).

Parameters: Radius r, \(\lambda \ge 0\), \(\mu \ge 0\), step-size \(\eta \), number of gradient steps \(m_g\), number of initial training steps \(n_0\).

Initial steps: For \(B = 1, \ldots n_0\)

   Batch of training (\(X_T\)) and universum (\(X_U\)) samples

   \(\theta = \theta - \nabla \Big (\sum \limits _{\begin{array}{c} {\textbf{x}}_i \in X_B \end{array}} L_T(f({\textbf{x}}_i)) + \sum \limits _{\begin{array}{c} {\textbf{x}}_{i^\prime }^{*} \in X_U \end{array}} L_U(f({\textbf{x}}_{i^\prime }^{*})) \Big ) \)

DROCC steps: For \(B = n_0, \ldots n_0 + N\)

   \(X_T\): Batch of normal training inputs (\(y=+1\))

   \(\forall x \in X_T: h \sim {\mathcal {N}}(0, I_{d})\)

Adversarial search: For \(i = 1, \ldots m_g\)

   1. \(L_T(h) = L_T(f(x + h), -1)\)

   2. \(h = h + \eta \frac{\nabla _h L_T(h)}{\Vert \nabla _h L_T(h) \Vert }\)

   3. \(h =\) Projection given by Prop.1 in Goyal et al. (2020)

\(\ell ^{itr} = \lambda \Vert {\textbf{w}} \Vert ^2 + \sum \limits _{\begin{array}{c} {\textbf{x}}_i \in X_B \end{array}} L_T(f({\textbf{x}}_i)) + \sum \limits _{\begin{array}{c} {\textbf{x}}_{i^\prime }^{*} \in X_U \end{array}} L_U(f({\textbf{x}}_{i^\prime }^{*})) +\mu L_T(f(x + h), -1) \)

\(\theta = \theta - \nabla \ell ^{itr}\)

5 Empirical results

5.1 Standard benchmark on tabular datasets from Goyal et al. (2020)

First we provide the results on several tabular data used in Goyal et al. (2020). The datasets used involves standard anomaly detection problems described below,

  • Abalone used in Das et al. (2018): Here the task is to predict the age of abalone using several physical measurements like, rings, sex, length, diameter, height, weight, etc. For this problem class 3 and 21 are anomalies and class 8, 9 and 10 serve as normal samples.

  • Arrhythmia used in Zong et al. (2018). Here the task is to identify the arrhythmic samples using the ECG features. We follow the same data set preparation as Zong et al. (2018).

  • Thyroid used in Zong et al. (2018). The goal is to predict if a patient is hypothyroid based on his/her medical history.We follow the same data set preparation as Zong et al. (2018).

For all the above data we use the data set preparation codes provided in Goyal et al. (2020). This code provides the data preprocessing and partitioning scheme as used in the previous works. We follow the same experiment setup and network architecture as in Goyal et al. (2020). We use the same baseline methods as used in Goyal et al. (2020), and also provide the results of the recent approach PLAD (Cai & Fan, 2022) proposed to stabilize the DROCC baseline.

Table 1 provides the results of DOC\(^3\) over 10 random partition of the data set. In each partition, we create training/test data as used in Goyal et al. (2020). Note however, different from Goyal et al. (2020), we scale the data in the range of \([-1, +1]\). In addition, here we generate uniform noise in range \([-1,+1]\) and use that as universum/contradiction samples. As seen from Table 1 the DOC\(^3\) outperforms all existing approaches (except adversarial based DROCC (Goyal et al., 2020) and PLAD for the Thyroid data); and significantly improves (> 5–15 %) upon the state-of-the-art results for the Arrhythmia and Abalone data. The optimal model parameters used for the results are provided in “Appendix B.1” (Table 7) for reproducibility. Note that, through out the paper we fix \(\Delta = 0\). For all our experiments we see minimal improvements through tuning the \(\Delta \) parameter. This is also discussed in our ablation studies in “Appendix C.1.2”.

Table 1 F1-score ± standard deviation for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets

5.2 Standard image benchmark datasets

5.2.1 CIFAR-10

Fig. 2
figure 2

Random noise Universum (contradictions)

For our next set of experiments we use the standard image benchmark CIFAR-10 dataset (Ruff et al., 2018; Goyal et al., 2020). The data consists of 32x32 colour images of 10 classes with 6000 images per class. The classes are mutually exclusive. The underlying task involves one-vs-rest anomaly detection, where we build a one class classifier for each class and evaluate it on the test data for all the 10-classes. Note that, this data does not have any naturally occurring universum (contradiction) samples (following Def. 2). So, we use synthetic universum samples by randomly generating the pixel values as \(\sim {\mathcal {N}}(\mu ,\sigma )\), with \(\mu = 0, \; \sigma = 1\); where \({\mathcal {N}}\) is the normal distribution (see Fig. 2). The idea of generating synthetic universum (contradiction) samples has been previously studied for binary (Weston et al., 2006; Cherkassky et al., 2011; Sinz et al., 2008), multiclass (Zhang & LeCun, 2017; Dhar et al., 2019) and regression (Dhar & Cherkassky, 2017) problems. In this paper we use such a similar mechanism for one class problems. Note that for the one-vs-rest AD problem, the generated universum samples do not belong to either ‘+1’ (normal) or ‘-1’ (anomalous) class used during testing (see Def. 2). The data is scaled in range \([-1,+1]\).

For this set of experiments we adopt a LeNet like architecture used in Ruff et al. (2018), Goyal et al. (2020). The detailed architecture specifics is provided in “Appendix B.1.1”. Note that, this paper focuses on the design and analysis of the DOC\(^3\) loss (Eq. 6). Here rather than adopting a state-of-the-art network architecture optimized for the specific dataset; we adopt a systematic approach to isolate the effectiveness of the proposed loss by using a basic LeNet architecture similar to Ruff et al. (2018), Goyal et al. (2020). This avoids secondary generalization effects encoded in most advanced architectures. To that end, the approaches in Ruff et al. (2018), Goyal et al. (2020) and DOC (Hinge in Eq. (3)) serve as the main baselines. In addition, for a more thorough comparison we also provide the results for DOC extended under disjoint auxiliary (DA) a.k.a. Outlier Exposure (OE) settings. For that we use the additional universum samples as belonging to the negative class following (Goyal et al., 2020).

Table 2 Average AUC (with standard deviation) for one-vs-rest anomaly detection on CIFAR-10

Table 2 provides the average ± standard deviation of the AUC under the ROC curve over 10 runs of the experiment. Here, we report the results of the best performing DOC (Hinge in (3)) model selected over the range of parameters \(\lambda = 1/2C = [1.0, 0.5]\) and that for DOC\(^3\) over the range of parameters \(\lambda = 1/2C = [0.1, 0.05], C_{U}/C = [1.0, 0.5]\). We fix \(\Delta = 0\). A more detailed discussion on model selection and the selected model parameters is provided in “Appendix B.3” for reproducibility. Note however, our results for the DROCC algorithm is different from that reported in Goyal et al. (2020). Re-running the codes provided in Goyal et al. (2020) did not yield similar results as reported in the paper (especially for ‘Ship’). Moreover, their current implementation normalizes the data using mean, \(\mu = (0.4914,0.4822, 0.4465)\) and standard deviation, \(\sigma = (0.247, 0.243, 0.261)\). These values are calculated using the data from all the classes; which is not available during training of a single class. To avoid such inconsistencies we rather normalize using mean, \( \mu = (0.5, 0.5, 0.5)\) and standard deviation, \(\sigma =(0.5, 0.5, 0.5)\). Such a scale does not need apriori information of the other class’s pixel values and scales the data in a range of \([-1,+1]\). Detailed discussions on reproducing the results of the deep learning algorithms Deep-SVDD (Ruff et al., 2018) and DROCC (Goyal et al., 2020) is provided in “Appendix C.2” (see Tables 18, 19 and 20). As seen from Table 2, DOC\(^3\) (using the noise universum), provides significant improvement \(\sim \) 5–15% (and upto \(30 \%\) for ‘Bird’), over its inductive counterpart (DOC). In addition, the DOC\(^3\) in most cases outperforms the DOC (DA/OE). This illustrates the advantage of extending Anomaly Detection problems following Def. 2 in accordance with the Prop. 2.

Next, we show the effectiveness of extending the advanced adversarial based DROCC-LF method under universum settings over the OE based setting used in Goyal et al. (2020). The major difference is now the auxilliary data serves as universum samples and the loss function follows. (6) (see Algo. 1). For the DROCC-LF (OE) we use the same implementation as in Goyal et al. (2020). Additionally, we replace the relu operator \([x]_{+}\) with the softplus operator for the loss functions.

For our experiments, we adopt the same LeNet architecture used in Ruff et al. (2018), Goyal et al. (2020) (see “Appendix B.1.1”, Fig. 4). Finally, we run the experiments over 10 runs and report the best AUC over the range of parameters recommended in Goyal et al. (2020) (Sect. 5). That is learning rate = \(10^{-4}\), radius (r) in range of \(\sqrt{d}\) = \(\{ 8.0, 16.0, 32.0\}\). Here, for both the methods we use Adam and fix the number of ascent steps = 10 and batch size = 256 and total epochs = 350. The remaining parameters are set to default values.

Table 2 provides the average ± standard deviation of the AUC under the ROC curve over 10 runs of the experiment. We also provide the results for the standard DROCC (without any auxiliary data) (Goyal et al., 2020) and the more recent PLAD (an adversarial approach introduced to improve DROCC) (Cai & Fan, 2022) as baselines. As seen from Table 2 the DROCC-LF (univ) significantly outperforms the DROCC-LF (OE) method to upto - 30% (‘dog’) for some cases. Further, DROCC-LF (univ) outperforms the baseline algorithms for all cases except ‘Cat’. The final optimal parameters selected for the different classes is provided in “Appendix B.3”.

5.2.2 Fashion-MNIST (F-MNIST)

For our next set of experiments we use another standard image benchmark dataset F-MNIST (Xiao et al., 2017). The data consists of \(28 \times 28\) gray images of Zalando’s fashion product database and consists of 10 classes (product lines) with 60,000 training and 10,000 test samples. The classes are mutually exclusive. The underlying task involves one-vs-rest anomaly detection, where we build a one class classifier for each class and evaluate it on the test data for all the 10-classes. As before, this data does not have any naturally occurring universum (contradiction) samples (following Def. 2). So, we use synthetically generated universum samples from \(\sim {\mathcal {N}}(\mu ,\sigma )\), with \(\mu = 0, \; \sigma = 1\); where \({\mathcal {N}}\) is the normal distribution. The data is scaled in range \([-1,+1]\).

We adopt the same network and experiment set-up used in Cai and Fan (2022) (see Fig. 5 in “Appendix B.1.2”). As before, we provide the results for DOC and its extension under disjoint auxiliary (DA)/Outlier Exposure (OE) settings. We also provide the baseline results from Cai and Fan (2022). Table 3 provides the average ± standard deviation of the AUC under the ROC curve over 10 runs of the experiment. A more detailed discussion on model selection and the selected model parameters is provided in “Appendix B.4” (see Tables 12, 13 and 14) for reproducibility.

As seen from Table 3, DOC\(^3\) (using the noise universum), provides significant improvement \(\sim \) 5–20% over its inductive counterpart (DOC). In addition, the DOC\(^3\) in most cases outperforms the DOC (DA/OE). This further consolidates the advantage of extending Anomaly Detection problems under universum settings.

Table 3 Average AUC (with standard deviation) for one-vs-rest anomaly detection on F-MNIST

We also illustrate the effectiveness of extending the advanced adversarial based DROCC-LF method under universum settings over the OE based setting used in Goyal et al. (2020). We provide the results for the DROCC algorithm as baseline. In addition, we also provide the results of the recent PLAD (Cai & Fan, 2022) algorithm which adopts a generative adversarial learning approach proposed to stabilize DROCC. Note however, the results reported are from our re-run of the PLAD algorithm. We found several caveats with the code implementation and report these discrepancies in “Appendix C.3”.

As seen from Table 3 the DROCC-LF (univ) outperforms the DROCC-LF (OE) method and beats the baseline algorithms for all the classes. The final optimal parameters selected for the different classes is provided in “Appendix B.4” (see Table 15).

5.3 Visual inspection using real-life MV-Tec AD data

For our final set of experiments we tackle the more realistic visual inspection based anomaly detection problem in manufacturing lines. Lately with the recent advancements in deep learning technologies, there has been an increased interest towards automating manufacturing lines and adopting AI driven solutions providing automated visual inspection of product defects (Bergmann et al., 2019; Huang & Pan, 2015). One popular benchmark data set used for such problems is the MV-Tec AD data set (Bergmann et al., 2019).

5.3.1 Data set and experiment setup

The MV-Tec AD data set contains 5354 high-resolution color images of different industrial object and texture categories. For each categories it contains normal (no defect) images used for training. The test data contains both normal as well as anomalous (defective) product images. The anomalies manifest themselves in the form of over 70 different types of defects such as scratches, dents, contamination, and various other structural changes. The goal in this paper is to build one class image-level classifiers for the texture categories (see Table 4). We use the original data scale of [0,1]. Further, to simplify the problem we resize all the images to \(64 \times 64\) pixel. Note that, for the current analysis we only use the texture classes containing RGB images.

For this problem we have naturally occurring universum (contradiction) samples in the form of the objects’ images or other texture types. That is, for the goal of building a one class classifier for ‘carpet’, all the ‘other textures’ (leather, tile, wood) or the ‘objects’ (bottle, cable, capsule, hazelnut, metal nut, pill, transistor) available in the dataset, can serve as universum (contradiction) samples. This is inline with the problem setting in Def. 2, where such samples are neither ‘normal’ nor ‘anomalous’ (defective) carpet samples. For our experiments, we use three types of universum,

  • Noise: Similar to previous experiments we generate random noise as universum samples. Here, since the data is already scaled in the range of [0,1], we generate \(64 \times 64\) dimension images where the pixel values are obtained from a uniform distribution \(\sim {\mathcal {U}}(0,1)\).

  • Objects: This type of universum contains all the images in the object categories with RGB pixels viz. bottle, cable, capsule, hazelnut, metal nut, pill, transistor. Note that, we include both the normal as well as the defective samples for these objects.

  • Other Textures: Here we use the remaining texture images as universum. That is, if the goal is building a one class classifier for ‘carpet’ we use the images from the other ‘textures’ (leather, tile, wood) as universum. We include both the normal as well as the defective samples in the universum set.

As before, we adopt a LeNet like architecture (schematic representation in Fig. 3, details in “Appendix B.1.3”, Fig. 6). Note that, there have been a few recent works proposing advanced architectures to achieve state-of-the-art performance on this data (Carrara et al., 2020; Huang et al., 2019). However, the main focus here is to isolate the effectiveness of DOC\(^3\), and hence we mainly compare against DOC and DOC(OE) baselines using a simple LeNet network. Since our baselines DOC, DOC(OE) using LeNet have not been previously reported on this data; as sanity check we also add the results in Massoli et al. (2020) for a good comparison with different classes of algorithms. Also, we adopt a slight modification to our loss function. Rather than using relu function \([x]_+\) in Eq. (3), and (6) for the training samples; we use a softplus operator. We see improved results using this modification. Note that, softplus is a dominating surrogate loss over relu, and hence Theorem 1 still holds.

Table 4 MVTec-AD dataset

5.3.2 Performance comparison results

Table 5 AUC for MVTec-AD (Texture) data

Table 5 provides the results over 10 runs of our experiments. We provide the the average ± standard deviation of the AUC values for DOC, DOC (DA/OE) and DOC\(^3\) algorithm. In addition we also provide the best AUC obtained for each algorithm over these 10 runs. Additional details on model selection and the optimal hyperparameters is provided in “Appendix B.5”. As seen in Table 5, the DOC\(^3\) algorithm provides significant improvement over DOC. Depending on the type of universum typical improvements range upto \(> 50 \%\). In addition, DOC\(^3\) provides consistent improvements over the DOC (DA/OE) algorithm. In all, these results further consolidate the utility of DOC\(^3\) under the universum setting (Def 2). Separately, Table 5 also provides the baseline results available in Massoli et al. (2020). Note that, these results are obtained using advanced network architectures adopted for the MVTec data, and are not averaged over multiple runs. Hence, we compare these results with the best AUC obtained for DOC, DOC (DA/OE) and DOC\(^3\) over 10 runs. As seen from Table 5, DOC\(^3\) improves upon the ‘carpet’ and ‘leather’ results using the ‘objects’ universum. Further, it achieves comparable performance for the ‘Wood’ and ‘Tile’ texture using ‘Noise’ and ‘Obj.’ universum respectively. Achieving improved performance over the baseline algorithms, even using a basic LeNet architecture sheds a very positive note for the proposed DOC\(^3\) algorithm.

5.3.3 Understanding DOC\(^3\) performance using Theorem (2)

For our final set of experiments we try to understand the working of the DOC\(^3\) algorithm in connection with the correlation \(\Sigma (\infty )\) (in Theorem 2). Table 6 reports the correlation values for the training and universum samples using ‘RAW’ pixel, ‘DOC’ and DOC\(^3\) solution’s feature maps. For the feature map we use the CNN features shown in Fig. 3. Also, the DOC\(^3\) solutions represent the estimated model using the training data (in column 1) and the respective universum data (in column 2). As seen from the results, the DOC solution provides high correlation \(\Sigma (\infty )\) between the training and universum samples. In essence, the DOC solution sees the training and universum samples similarly. This is not desirable, as the universum samples follow a different distribution than training samples. On the contrary, the DOC\(^3\) provides a solution where the correlation between the training and universum samples are significantly reduced. This is inline with the Theorem 2’s analysis (Sect. 3.2), where we argued that the DOC\(^3\) searches for a solution with low \(\Sigma (\infty )\) between the training and universum samples (in feature space). And by doing so ensures lower ERC and improved generalization compared to DOC (confirmed empirically in Table 5). Another interesting point seen for the ‘other texture’ universum type, with originally high raw pixel correlation values (\(\sim 0.9\)) is that; using DOC\(^3\) provides limited improvement. Such universum types are too similar to the training data, and act as ‘bad’ contradictions.

Fig. 3
figure 3

Schematic representation of the Network used for MVTec-AD results in Table 5

Table 6 Average ± standard deviation of correlation \(\Sigma (\infty )\) between training and universum over 10 runs

6 Future research

Broadly there are two major future research directions,

Model selection This is a generic issue for any (unsupervised) one class based anomaly detection formulation, and is further complicated by the non-convex loss landscape for deep learning problems. For DOC\(^3\) we simplify model selection by fixing \(\Delta = 0\), and optimally tuning \(C_U\). However, the success of DOC\(^3\) heavily depends on carefully tuning of its hyperparameters. In the absence of any validation set containing both ‘normal’ and ‘anomalous’ samples, we follow the current norm of reporting the best model’s results over a small subset of hyperparameters. But this is far from practical. We believe, our Theorem 1 provides a good framework for bound based model selection. This in conjunction with Theorem 2 and the recent works on ERC for deep architectures (Neyshabur et al., 2015; Sokolic et al., 2016), may provide better mechanisms for model selection and yield optimal models.

Selecting ‘good’ universum samples The effectiveness of DOC\(^3\) also depends on the type of universum used. Our analysis in Sect. 5.3.3 provides some initial insights into the workings of DOC\(^3\), and how to loosely identify ‘bad’ contradictions. Additional analysis, possibly inline with the Histogram of Projections (HOP) technique introduced in Cherkassky et al. (2011), Dhar et al. (2019), is needed to improve our understanding of ‘good’ universum samples. This is an open research problem.

7 Conclusions

This paper introduces the notion of learning from contradictions for deep one class classification and introduces the DOC\(^3\) algorithm. DOC\(^3\) is shown to provide improved generalization over DOC, its inductive counterpart, by deriving the Empirical Rademacher Complexity (ERC). We empirically show the effectiveness of the proposed formulation, and connect the results to our theoretical analysis. Finally, we also discuss the limitations and the future research directions.