Keywords

1 Introduction

Deep convolutional neural networks are one of the basic architectures in deep learning, and they have achieved great success in modern computer vision tasks. However, the over-confidence issue of OOD data has always been with CNN which harms its generalization performance seriously. In previous research, neural networks have been proved to generalize well when the test data is drawn i.i.d. from the same distribution as the training data, i.e., the ID data. However, when deep learning models are deployed in an open world scenario, the input samples can be OOD data and therefore should be handled cautiously.

Generally, there are two major challenges for improving the robustness of models: adversarial examples and OOD examples. As pointed out in [10], adding very small perturbations to the input can fool a well-trained classification net, and these modified inputs are the so-called adversarial examples. Another problem is how to detect OOD examples that are drawn far away from the training data. The trained neural networks often produce very high confidence to these OOD samples which has raised concerns for AI Safety [4] in many applications, which is the so-called over-confidence issue [28]. As shown in Fig. 1 (a), a trained ResNet18 is used for extracting features from the MNIST dataset, and the blue points indicate feature representations of ID data. It can be found that almost the whole feature space is assigned with high confidence score but the ID data only concentrates in some narrow regions densely.

Fig. 1.
figure 1

Over-confidence issue in typical classification nets. (a): A ResNet18 trained on MNIST. The number of neurons of its penultimate layer is set to 2 for feature visualization. The are feature representations of ID data. The background color represents confidence score given by the ResNet18. It is shown that the region far from the blue points gets high confidence score. (b): Classification on two gaussian distribution with a MLP. The are training data. It can be seen the classification net gives OOD regions high confidence which is abnormal. (c): Boundary aware learning (BAL) gives ID regions much higher confidence than OOD regions. More visualization results are shown in the Appendix Fig. 7. (Color figure online)

Previous studies have proposed different approaches for detecting OOD samples to improve the robustness of classifiers. In [12], a max-softmax method is proposed for identifying OOD samples. Further, in ODIN [25], temperature scaling and input pre-processing are introduced for improving the confidence scores of ID samples. In [38], convolutional prototype learning is proposed for image classification which shows effectiveness in OOD detection and class-incremental learning. In [7], it points out that the outputs of softmax can not represent the confidence of neural net actually, and thus, a new branch is separated for confidence estimation independently. All these previous works have brought many different perspectives and inspirations for solving the open world recognition tasks. However, these methods pay limited attention to the learning of OOD features which is a key factor in OOD detection. The neural networks can better detect OOD samples if they are supervised by the trivial and hard OOD information, and that’s why we argue OOD feature learning is important for OOD uncertainty estimation.

In this paper, we attribute the reason of poor OOD detection performance to the fact that the traditional classification networks can not perceive the boundary of ID data due to lack of OOD supervision, as illustrated in Fig. 1 (a) and (b). Consequently, this paper focuses on how to generate synthetic OOD information that supervises the learning of classifiers. The key idea of our proposed boundary aware learning (BAL) is to generate synthetic OOD features from trivial to hard gradually via a generator. At the same time, a discriminator is trained to distinguish ID and OOD features. Powered by this adversarial training phase, the discriminator can well separate ID and OOD features. The key contributions of this work can be summarized as follows:

  • A boundary aware learning framework is proposed for improving the rejection ability of neural networks while maintaining the classification performance. BAL can be combined with mainstream CNN architectures easily.

  • We use a GAN to learn the distribution of OOD features adaptively step by step without introducing any assumptions about the distribution of ID features. Alongside, we propose an efficient method called RSM (Representation Sampling Module) to sample synthetic hard OOD features.

  • We test the proposed BAL on several datasets with different CNN architectures, the results suggest that BAL significantly improves the performance of OOD detection, achieving state-of-the-art performance and allowing more robust classification in the open world scenario.

2 Related Work

OOD Detection with Softmax-Based Scores. In [12], a baseline approach to detect OOD inputs named max-softmax is proposed, and the metrics of evaluating OOD detectors are defined properly. Following this, inspired by [10], ODIN [25] and generalized ODIN [15] are proposed for improving the detection ability of max-softmax using temperature scaling, input pre-processing, and confidence decomposition. In [3, 24], these studies argue that the feature maps from the penultimate layer of neural networks are not suitable for detecting outliers, and thus, they use the features from a well-chosen layer and adopt some metrics such as Euclidean distance, Mahalanobis distance, and OSVM [34]. In [7], a branch is separated for confidence regression since the outputs of softmax can not well represent the confidence of neural networks. More recently, GradNorm [17] finds that the magnitude of gradients is higher in ID than that of OOD, making it informative for OOD detection. In [26], energy score derived from discriminative models is used for OOD detection which also brings some improvement.

OOD Detection with Synthetic Data. These kinds of methods usually use the ID samples to generate fake OOD samples, and then, train a \((C+1)\) classifier which can improve the rejection ability of neural nets. [35] treats the OOD samples as two types, one indicates these samples that are close to but outside the ID manifold, and the other is these samples which lie on the ID boundary. This work uses Variational AutoEncoder [33] to generate such data for training. In [23], the authors argue that samples lie on the boundary of ID manifold can be treated as OOD samples, and they use GAN [9] to generate these data. The proposed joint training method of confident classifier and adversarial generator inspires our work. It can not be ignored that the methods mentioned above are only suitable for small toy datasets, and the joint training method harms the classification performance of neural nets. Further, in [6], the study points out that AutoEncoder can reconstruct the ID samples with much less error than OOD examples, allowing more effective detection with taking reconstruction error into consideration. Very recently, a newly proposed VOS [8] introduces the OOD detection into object detection tasks, and its main focus is still the OOD feature generation. In these previous works, the features of each category from penultimate layer of CNN are assumed to follow a multivariate gaussian distribution. We argue and verify that this assumption is not reasonable. Our proposed BAL uses a GAN to learn the OOD distribution adaptively without making assumptions, and the experimental results show that BAL outperforms gaussian assumption based methods significantly.

Improving Detection Robustness with Model Ensembles. In [21], the authors initialize different parameters for neural networks randomly, and the bagging sampling method is used for generating training data. Similarly, in [31], the features from different layers of neural networks are used for identifying OOD samples. The defined higher order Gram Matrices in this work yield better OOD detection performance. More recently, [32] converts the labels of training data into different word embeddings using GloVe [29] and FastText [18] as the supervision to gain diversity and redundancy, the semantic structure improves the robustness of neural networks.

OOD Detection with Auxiliary Supervision. In [30], the authors argue that the likelihood score is heavily affected by the population level background statistics, and thus, they propose a likelihood ratio method to deal with background and semantic targets in image data. In [14], the study finds that self-supervision can benefit the robustness of recognition tasks in a variety of ways. In [40], a residual flow method is proposed for learning the distribution of feature space of a pre-trained deep neural network which can help to detect OOD examples. The latest work in [36] treats ood samples as near-OOD and far-OOD samples, it argues that contrastive learning can capture much richer features which improve the performance in detecting near-OOD samples. In [13], the author uses auxiliary datasets served as OOD data for improving the anomaly detection ability of neural networks. Generally, these kinds of methods use some prior information to supervise the learning of OOD detector.

3 Preliminaries

Problem Statement. This work considers the problem of separating ID and OOD samples. Suppose \(P_{in}\) and \(P_{out}\) are distributions of ID and OOD data, \(X=\{x_1,x_2,...,x_N\}\) are images randomly sampled from these two distributions. This task aims to give lower confidence of image \(x_i\) sampled from \(P_{out}\) while higher to that of \(P_{in}\). Typically, OOD detection can be formulated as a binary classification problem. With a chosen threshold \(\gamma \) and confidence score S(x), input is judged as OOD data if \(S(x) < \gamma \) otherwise ID. Figure 2 (a) shows the traditional classifiers can not capture the OOD uncertainty, and as a result, produce over-confident predictions on OOD data. Figure 2 (b) shows an ideal case where ID data gets higher score than OOD. Methods that aim to boost the performance of OOD detection should use no data labeled as OOD explicitly.

Fig. 2.
figure 2

Confusion between ID and OOD data. (a): In typical classifiers, the ID and OOD data are confused, and both of them get very high confidence scores. (b): With the proposed BAL, the OOD data is assigned with much lower confidence, allowing more effective OOD detection.

Methodology. For a given image x, its corresponding feature representation f can be got from the penultimate layer of a pre-trained neural network, and based on the total probability theorem, we have:

$$\begin{aligned} \begin{aligned} P(w|f)=P(w|f\in \mathcal {M}_f)\cdot P(f\in \mathcal {M}_f|f)\\ + P(w|f\notin \mathcal {M}_f)\cdot P(f\notin \mathcal {M}_f|f) \end{aligned} \end{aligned}$$
(1)

where w is the category label of ID data, and \(\mathcal {M}_f\) represents the manifold of ID features. Typical neural networks have no access to OOD data, therefore the softmax output is actually the conditional probability assuming the inputs are ID data, i.e., \(P(w|f\in \mathcal {M}_f)\). Empirically, since the OOD data has quite different semantic meanings compared with ID data, it is reasonable to approximate \(P(w|f\notin \mathcal {M}_f)\) to 0. Then, we have:

$$\begin{aligned} P(w|f)\approx P(w|f\in \mathcal {M}_f)\cdot P(f\in \mathcal {M}_f|f) \end{aligned}$$
(2)

It tells that the approximation of posterior can be formulated as the product of outputs from pre-trained classifiers and the probability f belongs to \(\mathcal {M}_f\). The proposed BAL aims to estimate \(P(f\in \mathcal {M}_f|f)\) with features from the penultimate layer of pre-trained CNN.

4 Boundary Aware Learning

Fig. 3.
figure 3

The proposed BAL framework. The ID features are extracted from pre-trained classifier. The trivial OOD features are uniformly sampled in feature space. The hard OOD features are generated using FGSM method. All features except ID feature are treated as OOD when training the discriminator. \(\mathcal {M}_f\) is the manifold of ID features. REM, RSM and RDM are representation extraction module, representation sampling module and representation discrimination module respectively.

The proposed boundary aware learning framework contains three modules as illustrated in Fig. 3. These modules handle the following problems: (I) Representation Extraction Module (REM): how to generate trivial OOD features to supervise the learning of conditional discriminator; (II) Representation Sampling Module (RSM): how to generate synthetic hard OOD features to enhance the discrimination ability of conditional discriminator step by step; (III) Representation Discrimination Module (RDM): how to make the conditional discriminator aware the boundary of ID features.

4.1 Representation Extraction Module (REM)

This module handles the problem of how to generate trivial synthetic OOD features. As in prior works, we use the outputs of penultimate layer in CNN to represent the input images. In the following parts, \(\mathcal {H}\) and h are used to indicate the pre-trained classification net with and without the top classification layer, and \(\theta \) is the pre-trained weights. Formally, the feature f of an input image x is:

$$\begin{aligned} f=h(x;\theta ) \end{aligned}$$
(3)

During training, image x and its corresponding label c are sampled from dataset \(\mathcal {X}\). We get an ID feature-label pair \(\left\langle f,c \right\rangle \) with Eq. (3). For generating trivial synthetic OOD features, we sample data in feature space uniformly. Given a batch features \(\{f_1,f_2,f_3,...,f_k\}\), the length of each feature vector \(f_i\) is m. We first calculate the minimal and maximal bound in m-dimensional space that contains all features within this batch. For \(j\in \{1,2,3,...,m\}\), we have:

$$\begin{aligned} R_{\min }^{(j)}=\min _{1\le i\le k} f_{i}^{(j)}, \quad R_{\max }^{(j)}=\max _{1\le i\le k} f_{i}^{(j)} \end{aligned}$$
(4)

Consequently, the batch-wise lower and upper bound of feature vectors are obtained as follows:

$$\begin{aligned} a=(R_{\min }^{(1)}, R_{\min }^{(2)},..., R_{\min }^{(m)})^T, \quad b=(R_{\max }^{(1)}, R_{\max }^{(2)},..., R_{\max }^{(m)})^T \end{aligned}$$
(5)

We use \(\mathbb {U}(a,b)\) to indicate a batch-wise uniform distribution in feature space. Randomly sampled feature \(\hat{f}\) from \(\mathbb {U}(a,b)\) is treated as a negative sample with a randomly generated label \(\hat{c}\). The negative pair is expressed as \(\left\langle \hat{f},\hat{c} \right\rangle \). We give the reasons of uniform sampling: (a) It can not be guaranteed that features from the penultimate layer of CNN follow a multivariate gaussian distribution no matter in low dimensional space or higher feature space. For verifying this idea, we set the penultimate layer of CNN to have two and three neurons for feature visualization, the results shown in Fig. 4 indicate the unreasonableness of this assumption. (b) ID features densely distribute in some narrow regions which means the most samples from uniform sampling are OOD data. Conflicts may happen when \(\hat{f}\) is close to ID and \(\hat{c}\) does match with \(\hat{f}\), the RDM deals with these conflicts.

Fig. 4.
figure 4

Feature distribution in penultimate layer of CNN. Left: Classification on MNIST with ResNet18, the penultimate layer has 2 neuros for visualization. Right: Same as the left, the penultimate layer has 3 neurons. There is a large deviation between the distribution of ID feature and a multivariate gaussian. Moreover, it is clear that ID features densely distribute at some narrow regions in feature space.

4.2 Representation Sampling Module (RSM)

This module is used for generating hard OOD features. For noise z sampled from normal distribution \(P_z\), its corresponding synthetic ID feature f can be got by G(zc) where c is a conditional label. Since the generator G is trained for generating ID data, the feature f is much closer to ID instead of OOD. With Fast Gradient Sign Method [10], we push the feature f towards the boundary of ID manifold which gets a much lower score from discriminator.

$$\begin{aligned} \tilde{f}&=f-\epsilon \frac{\partial D(f;c)}{\partial f}\approx f-\epsilon \, \textrm{sgn}(\frac{\partial D(f;c)}{\partial f}) \end{aligned}$$
(6)
$$\begin{aligned} \tilde{z}&=z-\epsilon \frac{\partial D(f;c)}{\partial z}=z-\epsilon \frac{\partial D(G(z;c);c)}{\partial G(z;c)}\frac{\partial G(z;c)}{\partial z} \end{aligned}$$
(7)

where \(\tilde{f}\) represents the OOD feature which scatters at the low density area of ID feature distribution \(P_f\). \(\tilde{z}\) can be used for generating OOD features by \(G(\tilde{z};c)\). In particular, we set \(\epsilon \) a random variable which follows a gaussian distribution for improving the diversity of sampling. \(\left\langle \tilde{f},\tilde{c} \right\rangle \) is treated as hard OOD feature pair because its quality is growing with the adversarial training process.

4.3 Representation Discrimination Module (RDM)

This module aims to make the discriminator aware the boundary of ID features. The generator with FGSM is used for generating hard OOD representations while the discriminator is used for separating ID and OOD features. The noise vector z is sampled from a normal distribution \(P_z\). The features of training images from REM follow a distribution \(P_f\). For learning the boundary of ID data via discriminator, we propose shuffle loss and uniform loss. The shuffle loss makes the discriminator aware the category of each ID cluster in feature space, and the uniform loss makes the discriminator aware the boundary of each ID feature cluster.

Shuffle Loss. In each batch of the training data, we get feature-label pairs like \(\left\langle f,c \right\rangle \). In a conditional GAN, these \(\left\langle f,c \right\rangle \) pairs are treated as positive samples. With a shuffle function \(T(\cdot )\), the positive pair \(\left\langle f,c \right\rangle \) is transformed to a negative pair \(\left\langle f,\tilde{c} \right\rangle \) where \(\tilde{c}=T(c)\ne c\) is a mismatched label with feature f. The discriminator is expected to identify these mismatch pairs as OOD data for awareness of category label, and the classification loss is the so called shuffle loss as below:

$$\begin{aligned} L_s = \mathbb {E}_{P_f} (\log D(f;T(c)) -\log D(f;c)) \end{aligned}$$
(8)

Uniform Loss. We get positive pair \(\left\langle f,c \right\rangle \) and negative pair \(\left\langle \hat{f},\hat{c} \right\rangle \) from REM. It is mentioned before that conflicts may happen when \(\hat{f}\) is close to some ID feature clusters and the randomly generated label \(\hat{c}\) dose match with them. For tackling this issue, we strengthen the memory of discriminator about positive pair \(\left\langle f,c \right\rangle \) while weaken that about negative pair \(\left\langle \hat{f},\hat{c} \right\rangle \). We force the discriminator to maximize D(fc) for remembering positive pairs, meanwhile, a hyperparameter \(\lambda _c\) is used to mitigate the negative effects of conflicts. The uniform loss is defined as follows:

$$\begin{aligned} L_u = \lambda _c \cdot \mathbb {E}_{P_{\mathbb {U}}} \log D(\hat{f};\hat{c}) - \mathbb {E}_{P_f}\log D(f;c) \end{aligned}$$
(9)

Alongside, the hard OOD features from RSM introduce no conflicts, and they are treated as negative OOD pairs for calculating uniform loss when training discriminator. Formally, the loss function \(L_d\) for conditional discriminator can be formulated as below:

$$\begin{aligned} L_t&=-\mathbb {E}_{P_f}\log D(f;c)-\mathbb {E}_{P_z} \log (1-D(G(z);c)) \end{aligned}$$
(10)
$$\begin{aligned} L_d&= L_t + L_s + L_u \end{aligned}$$
(11)

where \(L_t\) is the loss of discriminator in a vanilla conditional GAN. A well trained discriminator is a binary classifier for separating ID and OOD features. In the process of training generator, we add a regularization term to accelerate the convergence. The loss function of generator is written as:

$$\begin{aligned} L_g = \mathbb {E}_{P_z} \log (1-D(G(z;c);c))+ \lambda (\min _{f_c\in \mathcal {M}_c}||f_c-G(z;c)||_1) \end{aligned}$$
(12)

where \(||\cdot ||_1\) indicates the L1 norm, \(\mathcal {M}_c\) is the set of ID features with label c, and \(\lambda \) is a balance hyperparameter. The regularization term reduces the difference between synthetic features and the real. We set \(\lambda \) to 0.01 in our experiments. In the process of training generator, the label c is generated randomly.

Generally, the BAL framework only trains the conditional GAN while keeping the pre-trained classification net unchanged. The confidence score outputted by a trained discriminator is treated as \(P(f\in \mathcal {M}_f|f)\). Based on Eq. (2), the approximation of posteriori is formulated as the product of outputs from pre-trained classification net and discriminator. The training and inference pipeline is shown in Algorithm 1. Code is available at: https://github.com/ForeverPs/BAL

figure c

5 Experiments

In this section, we validate the proposed BAL on several image classification datasets and neural net architectures. Experimental setup is described in Sect. 5.1 and Sect. 5.2, evaluation metrics are detailed in Appendix Sect. 6.10 and ablation study is described in Sect. 5.3. We report the main results and metrics in Sect. 5.4. Visualization of synthetic OOD data is given in Sect. 5.5.

5.1 Dataset

MNIST [22]: A database of handwritten digits in total 10 categories, has a training set of 60k examples, and a test set of 10k examples.

Fashion-MNIST [37]: A dataset contains grayscale images of fashion products from 10 categories, has a training set of 60k images, and a test set of 10k images.

Omniglot [20]: A dataset that contains 1623 different handwritten characters from 50 different alphabets. In this work, we treat Omniglot as OOD data.

CIFAR-10 and CIFAR-100 [19]: The former one contains 60k colour images in 10 classes, with 6k images per class. The latter one also contains 60k images but in 100 classes, with 600 images per class.

TinyImageNet [5]: A dataset contains 120k colour images in 200 classes, with 600 images per class.

SVHN [27] and LSUN [39]: The former one contains colour images of street view house number. The latter one is a large-scale scene understanding dataset.

5.2 Experimental Setup

Softmax Baseline. ResNet [11] and DenseNet [16] are used as backbones, and they are trained with an Adam optimizer using cross-entropy loss in total of 300 epochs. Images from MNIST, Fashion-MNIST and Omniglot are resized to 28 \(\times \) 28 with only one channel. Other datasets are resized to 32 \(\times \) 32 with RGB channels. For MNIST, Fashion-MNIST and Omniglot, ResNet18 is used as the feature extractor. For any other datasets, ResNet34 and DenseNet-BC with 100 layers are used for feature extraction.

GCPL. We use distance-based cross-entropy loss and prototype loss as mentioned in [38]. The hyperparameter \(\lambda \) (weight of prototype loss) is set to 0.01.

ODIN and Generalized ODIN. Parameters (T, \(\epsilon \)) are provided in Table 7.

AEC. This method uses reconstruction error to detect outliers. We reproduce it following the details in [6]. See Appendix Fig. 7 for more details.

5.3 Ablation Study

Ablation on Proposed Loss Functions. We compare different loss functions proposed in BAL. Specifically, we use DenseNet-BC as the feature extractor. CIFAR-10 is set as ID data while TinyImageNet is set as OOD data. We consider four combinations of proposed loss functions: \(L_t\), \(L_t+L_s\), \(L_t+L_u\) and \(L_t+L_s+L_u\). The details of pre-mentioned loss functions can be found in Eqs. (8-10). For uniform loss \(L_u\), we set the hyperparameter \(\lambda _c\) to 0.7. The results are summarized in Table 1, where BAL with shuffle loss and uniform loss outperforms the alternative combinations. Compared to max-softmax, BAL reduces FPR95 up to 36.1%.

Table 1. Ablation on different combinations of loss functions. All networks are trained with the training set of CIFAR-10, and no OOD data is used. \(\lambda _c\) in uniform loss \(L_u\) is set to 0.7. It can be seen that the proposed shuffle loss and uniform loss enhance the ability for detecting outliers.

Ablation on \(\lambda _c\) in uniform loss. We test the sensitivity of \(\lambda _c\) in Eq. (9). CIFAR-10 and TinyImageNet are set as ID and OOD respectively. DenseNet-BC is used as the backbone. The ablation results shown in Table 2 demonstrate that with the increasing of \(\lambda _c\), AUPR\(_{out}\) of neural networks increases synchronously which means the classifier can aware more OOD data. In particular, using \(\lambda _c\) as 0.7 yields both better ID and OOD detection performance.

Table 2. Ablation on parameter \(\lambda _c\). All networks are trained with the training set of CIFAR-10, and no OOD data is used. In the following experiments, if not specified, \(\lambda _c\) is set to 0.7 throughout.

Ablation on OOD Synthesis Sampling Methods. We consider different trivial OOD feature sampling methods. As described in Sect. 4.1, the distribution of features in convolutional layer is usually assumed to follow a multivariate gaussian distribution. Therefore, the low density area of each category is treated as OOD region. We argue this assumption is not reasonable enough because: (I) From Fig. 1 (a) and Fig. 4, we can see that in low dimensional feature space, the conditional distribution of each category has a great deviation with multivariate gaussian distribution; (II) In high dimensional space, the distribution of ID features is extremely sparse, therefore it is hard to estimate the probability density of assumed gaussian distribution accurately; (III) It is costly to calculate the mean vector \(\mu \) and covariance matrix \(\varSigma \) of multivariate gaussian distribution in high dimensional feature space; (IV) Inefficient sampling. It is of low efficiency since the probability density needs to be calculated for each synthetic sample. Without introducing any strong assumptions about the ID features, we verify that the naive uniform sampling together with a GAN framework can model the OOD feature distribution effectively. We still use CIFAR-10 and TinyImageNet as ID and OOD data. We compare uniform sampling and gaussian sampling in feature space. The dimensionality of features is controlled by setting different number of neurons in the penultimate fully connected layer. The ablation results are shown in Table 3. It is clear that BAL with uniform sampling outperforms gaussian sampling in both low and high dimensional space.

Table 3. Ablation on BAL with different sampling methods. The values in the table are AUROC. Both uniform and gaussian sampling are performed within BAL framework.
Table 4. Detecting OOD samples on MNIST, Fashion-MNIST and Omniglot with ResNet18. We use the mixture of two datasets as OOD samples.

5.4 Detection Results

We detail the main experimental results on several datasets with ResNet18, ResNet34, and DenseNet-BC. For CIFAR-10, CIFAR-100, and SVHN, we use the pre-trained ResNet-34 and DenseNet-BC, and for MNIST, Fashion-MNIST, and Omniglot, we train the ResNet18 from scratch.

Results on MNIST, Fashion-MNIST, and Omniglot. We observe the effects of BAL in two groups. In the first group, MNIST is ID data, and the mixture of Fashion-MNIST and Omniglot is OOD data. In the second group, Fashion-MNIST is ID data while MNIST and Omniglot are OOD data. For simplicity, Cls Acc and Det Err are used to represent Classification Accuracy and Detection Error. For ODIN, temperature (T) and magnitude (\(\epsilon \)) are 10 and 5e-4 respectively. The results summarized in Table 4 tell that BAL is effective on image classification benchmark, particularly, BAL reduces FPR95 up to 24.1% compared with ODIN in the second group.

Results on CIFAR-10, CIFAR-100, and SVHN. We consider sufficient experimental settings in this part for testing the generalization ability of BAL. The pre-trained ResNet-34 and DenseNet-BC on CIFAR-10, CIFAR-100 and SVHN come from [1]. The main results on image classification tasks are summarized in Table 5, where BAL demonstrates superior performance compared with the mainstream methods under different experimental settings. Optimal temperature (T) and magnitude (\(\epsilon \)) are searched for ODIN in each group. Specifically, BAL reports a decline of FPR95 up to 13.9% compared with Generalized ODIN.

Table 5. Main OOD detection results. We use C-10, C-100, TIN, D-BC and R-34 to represent CIFAR-10, CIFAR-100, TinyImageNet, DenseNet-BC and ResNet-34.

5.5 Visualization of trivial and hard OOD features

We show the visualization results of trivial OOD features from uniform sampling and the hard OOD features from generator via FGSM in Fig. 5. We set the training data as two gaussian distributions with dimensionality \(m=2\). We use a MLP with three layers as the classifier. The discriminator and generator only use fully connected layers. In the adversarial training process, we sample data in raw data space uniformly since the dimensionality of raw data is fairly low. The other training details are the same as pipeline shown in Algorithm 1. We also report the classification results on dogs vs. cats [2]. The images from ImageNet are treated as OOD data. The top-1 classification results of BAL and Softmax baseline are given in Fig. 6.

Fig. 5.
figure 5

Synthetic OOD in raw data space. When the dimensionality of raw data space is high, we have to perform sampling in feature space as shown in Algorithm 1.

Fig. 6.
figure 6

OOD detection in open world scenario. Two columns on the left: classification results on ID data. Two columns on the right: classification results on OOD images from ImageNet. Green: max-softmax baseline. Pink: the proposed BAL. The threshold for distinguish ID and OOD is set to 0.60 . It is shown that BAL reduces the false positives among classification results. The image with macarons is a failure case where BAL misclassifies it as a dog. (Color figure online)

6 Conclusion

In this paper, we propose using BAL to learn the distribution of OOD features adaptively. No strong assumptions about the ID features are introduced. We use a simple uniform sampling method combined with a GAN framework can generate OOD features in very high quality progressively. BAL has been proved to generalize well across different datasets and architectures. Experimental results on image classification benchmarks promise the state-of-the-art performance. The ablation study also shows BAL is stable with different parameter settings.