Augmenting Softmax Information for Selective Classification with Out-of-Distribution Data

Xia, Guoxuan; Bouganis, Christos-Savvas

doi:10.1007/978-3-031-26351-4_40

Guoxuan Xia¹² &
Christos-Savvas Bouganis¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13846))

Included in the following conference series:

Asian Conference on Computer Vision

358 Accesses
5 Citations

Abstract

Detecting out-of-distribution (OOD) data is a task that is receiving an increasing amount of research attention in the domain of deep learning for computer vision. However, the performance of detection methods is generally evaluated on the task in isolation, rather than also considering potential downstream tasks in tandem. In this work, we examine selective classification in the presence of OOD data (SCOD). That is to say, the motivation for detecting OOD samples is to reject them so their impact on the quality of predictions is reduced. We show under this task specification, that existing post-hoc methods perform quite differently compared to when evaluated only on OOD detection. This is because it is no longer an issue to conflate in-distribution (ID) data with OOD data if the ID data is going to be misclassified. However, the conflation within ID data of correct and incorrect predictions becomes undesirable. We also propose a novel method for SCOD, Softmax Information Retaining Combination (SIRC), that augments softmax-based confidence scores with feature-agnostic information such that their ability to identify OOD samples is improved without sacrificing separation between correct and incorrect ID predictions. Experiments on a wide variety of ImageNet-scale datasets and convolutional neural network architectures show that SIRC is able to consistently match or outperform the baseline for SCOD, whilst existing OOD detection methods fail to do so. Code is available at https://github.com/Guoxoug/SIRC.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Augmenting the Softmax with Additional Confidence Scores for Improved Selective Classification with Out-of-Distribution Data

Article Open access 23 April 2024

Rethinking Out-of-Distribution Detection From a Human-Centric Perspective

Article 22 May 2024

KS(conf): A Light-Weight Test if a Multiclass Classifier Operates Outside of Its Specifications

Article Open access 10 October 2019

1 Introduction

Out-of-distribution (OOD) detection [49], i.e. identifying data samples that do not belong to the training distribution, is a task that is receiving an increasing amount of attention in the domain of deep learning [32, 33, 4, 16, 6, 6, 19, 46, 45, 48, 22, 31, 41, 49, 50, 39]. The task is often motivated by safety-critical applications, such as healthcare and autonomous driving, where there may be a large cost associated with sending a prediction on OOD data downstream.

However, in spite of a plethora of existing research, there is generally a lack of focus with regards to the specific motivation behind OOD detection in the literature, other than it is often done as part of the pipeline of another primary task, e.g. image classification. As such the task is evaluated in isolation and formulated as binary classification between in-distribution (ID) and OOD data. In this work we consider the question why exactly do we want to do OOD detection during deployment? We focus on the problem setting where the primary objective is classification, and we are motivated to detect and then reject OOD data, as predictions on those samples will incur a cost. That is to say the task is selective classification [5, 8] where OOD data has polluted the input samples. Kim et al. [27] term this problem setting unknown detection. However, we prefer to use Selective Classification in the presence of Out-of-Distribution data (SCOD) as we would like to emphasise the downstream classifier as the objective, and will refer to the task as such in the remainder of the paper.

The key difference between this problem setting and OOD detection is that both OOD data and incorrect predictions on ID data will incur a cost [27]. It does not matter if we reject an ID sample if it would be incorrectly classified anyway. As such we can view the task as separating correctly predicted ID samples (ID✓) from misclassified ID samples (ID✗) and OOD samples. This reveals a potential blind spot in designing approaches solely for OOD detection, as the cost of ID misclassifications is ignored. The key contributions of this work are:

1.
Building on initial results from [27] that show poor SCOD performance for existing methods designed for OOD detection, we show novel insight into the behaviour of different post-hoc (after-training) detection methods for the task of SCOD. Improved OOD detection often comes directly at the expense of SCOD performance. Moreover, the relative SCOD performance of different methods varies with the proportion of OOD data found in the test distribution, the relative cost of accepting ID✗ vs OOD, as well as the distribution from which the OOD data samples are drawn.
2.
We propose a novel method, targeting SCOD, Softmax Information Retaining Combination (SIRC), that aims to improve the OOD|ID✓ separation of softmax-based methods, whilst retaining their ability to identify ID✗. It consistently outperforms or matches the baseline maximum softmax probability (MSP) approach over a wide variety of OOD datasets and convolutional neural network (CNN) architectures, unlike existing OOD detection methods.

2 Preliminaries

Neural Network Classifier. For a K-class classification problem we learn the parameters $\boldsymbol{\theta }$ of a discriminative model $P(y|\boldsymbol{x};\boldsymbol{\theta })$ over labels $y \in \mathcal Y = \{\omega _k\}_{k=1}^K$ given inputs $\boldsymbol{x} \in \mathcal X = \mathbb R^D$, using finite training dataset $\mathcal D_\text {tr} = \{y^{(n)},\boldsymbol{x}^{(n)}\}_{n=1}^{N}$ sampled independently from true joint data distribution $p_\text {tr}(y,\boldsymbol{x})$. This is done in order to make predictions $\hat{y}$ given new inputs $\boldsymbol{x}^* \sim p_\text {tr}(\boldsymbol{x})$ with unknown labels,

$$\begin{aligned} \hat{y} = f(\boldsymbol{x}^*) = \mathop {\mathrm {arg\,max}}\limits _\omega P(\omega |\boldsymbol{x}^*;\boldsymbol{\theta })~, \end{aligned}$$

(1)

where f refers to the classifier function. In our case, the parameters $\boldsymbol{\theta }$ belong to a deep neural network with categorical softmax output $\boldsymbol{\pi }\in [0,1]^K$,

$$\begin{aligned} P(\omega _i|\boldsymbol{x};\boldsymbol{\theta }) = \pi _i(\boldsymbol{x};\boldsymbol{\theta }) = \exp v_i(\boldsymbol{x})/\sum _{k=1}^K \exp v_k(\boldsymbol{x})~, \end{aligned}$$

(2)

where the logits $\boldsymbol{v} = \boldsymbol{W} \boldsymbol{z} + \boldsymbol{b} \quad (\in \mathbb R^K)$ are the output of the final fully-connected layer with weights $\boldsymbol{W} \in \mathbb R^{K\times L}$, bias $\boldsymbol{b} \in \mathbb R^K$, and final hidden layer features $\boldsymbol{z} \in \mathbb R^L$ as inputs. Typically $\boldsymbol{\theta }$ are learnt by minimising the cross entropy loss, such that the model approximates the true conditional distribution $P_\text {tr}(y|\boldsymbol{x})$,

$$\begin{aligned} \mathcal L_\text {CE}(\boldsymbol{\theta })&= -\frac{1}{N}\sum _{n=1}^{N}\sum _{k=1}^K \delta (y^{(n)}, \omega _k)\log P(\omega _k|\boldsymbol{x}^{(n)};\boldsymbol{\theta }) \\&\approx -\mathbb E_{p_\text {tr}(\boldsymbol{x})}\left[ \sum _{k=1}^K P_\text {tr}(\omega _k|\boldsymbol{x})\log P(\omega _k|\boldsymbol{x};\boldsymbol{\theta })\right] = \mathbb {E}_{p_\text {tr}}\left[ KL\left[ P_\text {tr}||P_{\boldsymbol{\theta }}\right] \right] + A~, \nonumber \end{aligned}$$

(3)

where $\delta (\cdot ,\cdot )$ is the Kronecker delta, A is a constant with respect to $\boldsymbol{\theta }$ and KL$[\cdot ,\cdot ]$ is the Kullback-Leibler divergence.

Selective Classification. A selective classifier [5] can be formulated as a pair of functions, the aforementioned classifier $f(\boldsymbol{x})$ (in our case given by Eq. 1) that produces a prediction $\hat{y}$, and a binary rejection function

$$\begin{aligned} g(\boldsymbol{x};t) = {\left\{ \begin{array}{ll} 0\text { (reject prediction)}, &{}\text {if }S(\boldsymbol{x}) < t\\ 1\text { (accept prediction)}, &{}\text {if }S(\boldsymbol{x}) \ge t~, \end{array}\right. } \end{aligned}$$

(4)

where t is an operating threshold and S is a scoring function which is typically a measure of predictive confidence (or $-S$ measures uncertainty). Intuitively, a selective classifier chooses to reject if it is uncertain about a prediction.

Problem Setting. We consider a scenario where, during deployment, classifier inputs $\boldsymbol{x}^*$ may be drawn from either the training distribution $p_\text {tr}(\boldsymbol{x})$ (ID) or another distribution $p_\text {OOD}(\boldsymbol{x})$ (OOD). That is to say,

$$\begin{aligned} \boldsymbol{x}^* \sim p_\text {mix}(\boldsymbol{x}), \quad p_\text {mix}(\boldsymbol{x}) = \alpha p_\text {tr}(\boldsymbol{x}) + (1-\alpha )p_\text {OOD}(\boldsymbol{x})~, \end{aligned}$$

(5)

where $\alpha \in [0,1]$ reflects the proportion of ID to OOD data found in the wild. Here “Out-of-Distribution” inputs are defined as those drawn from a distribution with label space that does not intersect with the training label space $\mathcal Y$ [49]. For example, an image of a car is considered OOD for a CNN classifier trained to discriminate between different types of pets.

We now define the predictive loss on an accepted sample as

(6)

where $\beta \in [0,1]$, and define the selective risk as in [8],

$$\begin{aligned} R(f,g;t) = \frac{\mathbb E_{p_\text {mix}(\boldsymbol{x})}[g(\boldsymbol{x};t)\mathcal L_\text {pred}(f(\boldsymbol{x}))]}{\mathbb E_{p_\text {mix}(\boldsymbol{x})}[g(\boldsymbol{x};t)]}~, \end{aligned}$$

(7)

which is the average loss of the accepted samples. We are only concerned with the relative cost of ID✗ and OOD samples, so we use a single parameter $\beta $.

The objective is to find a classifier and rejection function (f, g) that minimise R(f, g; t) for some given setting of t. We focus on comparing post-hoc (after-training) methods in this work, where g or equivalently S is varied with f fixed. This removes confounding factors that may arise from the interactions of different training-based and post-hoc methods, as they can often be freely combined.

In practice, both $\alpha $ and $\beta $ will depend on the deployment scenario. However, whilst $\beta $ can be set freely by the practitioner, $\alpha $ is outside of the practitioner’s control and their knowledge of it is likely to be very limited.

It is worth contrasting the SCOD problem setting with OOD detection. SCOD aims to separate OOD, ID✗ |ID✓, whilst for OOD detection the data is grouped as OOD|ID✗, ID✓ (see Fig. 1). We note that previous work [26, 34, 35, 38, 41] refer to different types of predictive uncertainty, namely aleatoric and epistemic. The former arises from uncertainty inherent in the data (i.e. the true conditional distribution $P_\text {tr}(y|\boldsymbol{x})$) and as such is irreducible, whilst the latter can be reduced by having the model learn from additional data. Typically, it is argued that it is useful to distinguish these types of uncertainty at prediction time. For example, epistemic uncertainty should be an indicator of whether a test input $\boldsymbol{x}^*$ is OOD, whilst aleatoric uncertainty should reflect the level of class ambiguity of an ID input. An interesting result within our problem setting is that the conflation of these different types of uncertainties may not be an issue, as there is no need to separate ID✗ from OOD, as both should be rejected.

3 OOD Detectors Applied to SCOD

As the explicit objective of OOD detection is different to SCOD, it is of interest to understand how existing detection methods behave for SCOD. Previous work [27] has empirically shown that some existing OOD detection approaches perform worse, and in this section we shed additional light as to why this is the case.

Improving Performance: OOD Detection vs SCOD. In order to build an intuition, we can consider, qualitatively, how detection methods can improve performance over a baseline, with respect to the distributions of OOD and ID✗ relative to ID✓. This is illustrated in Fig. 2. For OOD detection the objective is to better separate the distributions of ID and OOD data. Thus, we can either find a confidence score S that, compared to the baseline, has OOD distributed further away from ID✓, and/or has ID✗ distributed closer to ID✓. In comparison, for SCOD, we want both OOD and ID✗ to be distributed further away from ID✓ than the baseline. Thus there is a conflict between the two tasks as, for ID✗, the desired behaviour of confidence score S will be different.

Existing Approaches Sacrifice SCOD by Conflating ID✓ and ID✗. Considering post-hoc methods, the baseline confidence score S used is Maximum Softmax Probability (MSP) [16]. Improvements in OOD detection are often achieved by moving away from the softmax $\boldsymbol{\pi }$ in order to better capture the differences between ID and OOD data. Energy [33] and Max Logit [14] consider the logits $\boldsymbol{v}$ directly, whereas the Mahalanobis detector [31] and DDU [38] build generative models using Gaussians over the features $\boldsymbol{z}$. ViM [48] and Gradnorm [21] incorporate class-agnostic, feature-based information into their scores.

Recall that typically a neural network classifier learns a model $P(y|\boldsymbol{x};\boldsymbol{\theta })$ to approximate the true conditional distribution $P_\text {tr}(y|\boldsymbol{x})$ of the training data (Eqs. 2, 3). As such, scores S extracted from the softmax outputs $\boldsymbol{\pi }$ should best reflect how likely a prediction on ID data is going to be correct or not (and this is indeed the case in our experiments in Sect. 5). As the above (post-hoc) OOD detection approaches all involve moving away from the modelled $P(y|\boldsymbol{x};\boldsymbol{\theta })$, we would expect worse separation between ID✗ and ID✓ even if overall OOD is better distinguished from ID. Figure 3 shows empirically how well different types of data are separated using MSP ($\pi _\text {max}$) and Energy ($\log \sum _k\exp v_k$), by plotting false positive rate (FPR) against true positive rate (TPR). Lower FPR indicates better separation of the negative class away from the positive class. Although Energy has better OOD detection performance compared to MSP, this is actually because the separation between ID✗ and ID✓ is much less for Energy, whilst the behaviour of OOD relative to ID✓ is not meaningfully different to the MSP baseline. Therefore, SCOD performance for Energy is worse in this case. Another way of looking at it would be that for OOD detection, MSP does worse as it conflates ID with OOD, however, this doesn’t harm SCOD performance as much, as those ID samples are mostly incorrect anyway. The ID dataset is ImageNet-200 [27], OOD dataset is iNaturalist [22] and the model is ResNet-50 [13].

4 Targeting SCOD – Retaining Softmax Information

We would now like to develop an approach that is tailored to the task of SCOD. We have discussed how we expect softmax-based methods, such as MSP, to perform best for distinguishing ID✗ from ID✓, and how existing approaches for OOD detection improve over the baseline, in part, by sacrificing this. As such, to improve over the baseline for SCOD, we will aim to retain the ability to separate ID✗ from ID✓ whilst increasing the separation between OOD and ID✓.

Combining Confidence Scores. Inspired by Gradnorm [21] and ViM [48] we consider the combination of two different confidence scores $S_1, S_2$. We shall consider $S_1$ our primary score, which we wish to augment by incorporating $S_2$. For $S_1$ we investigate scores that are strong for selective classification on ID data, but are also capable of detecting OOD data – MSP and (the negative of) softmax entropy, $(-)\mathcal H[\boldsymbol{\pi }]$. For $S_2$, the score should be useful in addition to $S_1$ in determining whether data is OOD or not. We should consider scores that capture different information about OOD data to the post-softmax $S_1$ if we want to improve OOD|ID✓. We choose to examine the $l_1$-norm of the feature vector $||\boldsymbol{z}||_1$ from [21] and the negative of the Residual^{Footnote 1} score $-||\boldsymbol{z}^{P^\bot }||_2$ from [48] as these scores capture class-agnostic information at the feature level. Note that although $||\boldsymbol{z}||_1$ and Residual have previously been shown to be useful for OOD detection in [21, 48], we do not expect them to be useful for identifying misclassifications. They are separate from the classification layer defined by $(\boldsymbol{W},\boldsymbol{b})$, so they are far removed from the categorical $P(y|\boldsymbol{x};\boldsymbol{\theta })$ modelled by the softmax.

Softmax Information Retaining Combination (SIRC). We want to create a combined confidence score $C(S_1,S_2)$ that retains $S_1$’s ability to distinguish ID✗ |ID✓ but is also able to incorporate $S_2$ in order to augment OOD|ID✓. We develop our approach based on the following set of assumptions:

$S_1$ will be higher for ID✓ and lower for ID✗ and OOD.
$S_1$ is bounded by maximum value $S_1^\text {max}$.^{Footnote 2}
$S_2$ is unable to distinguish ID✗ |ID✓, but is lower for OOD compared to ID.
$S_2$ is useful in addition to $S_1$ for separating OOD|ID.

We propose to combine $S_1$ and $S_2$ using

$$\begin{aligned} C(S_1,S_2) = -(S^{\max }_1-S_1)\left( 1+\exp (-b[S_2-a])\right) ~, \end{aligned}$$

(8)

^{Footnote 3}where a, b are parameters chosen by the practitioner. The idea is for the accept/reject decision boundary of C to be in the shape of a sigmoid on the $(S_1,S_2)$-plane (See Fig. 4). As such the behaviour of only using the softmax-based $S_1$ is recovered for ID✗ |ID✓ as $S_2$ is increased, as the decision boundary tends to a vertical line. However, $S_2$ is considered increasingly important as it is decreased, allowing for improved OOD|ID✓. We term this approach Softmax Information Retaining Combination (SIRC).

The parameters a, b allow the method to be adjusted to different distributional properties of $S_2$. Rearranging Eq. 8,

$$\begin{aligned} S_1 = S_1^\text {max} + C/[1+\exp (-b[S_2-a])]~, \end{aligned}$$

(9)

we see that a controls the vertical placement of the sigmoid, and b the sensitivity of the sigmoid to $S_2$. We use the empirical mean and standard deviation of $S_2$, $\mu _{S_2}, \sigma _{S_2}$ on ID data (training or validation) to set the parameters. We choose $a = \mu _{S_2}-3\sigma _{S_2}$ so the centre of the sigmoid is below the ID distribution of $S_2$, and we set $b=1/\sigma _{S_2}$, to match the ID variations of $S_2$. Note that other parameter settings are possible, and practitioners are free to tune a, b however they see fit (on ID data), but we find the above approach to be empirically effective.

Figure 4 compares different methods of combination by plotting ID✓, ID✗ and OOD data densities on the $(S_1,S_2)$-plane. Other than SIRC we consider the combination methods used in ViM, $C=S_1 + cS_2$, where c is a user set parameter, and in Gradnorm, $C=S_1 S_2$. The overlayed contours of C represent decision boundaries for values of t. We see that the linear decision boundary of $C=S_1 + cS_2$ must trade-off significant performance in ID✗ |ID✓ in order to gain OOD|ID✓ (through varying c), whilst $C=S_1 S_2$ sacrifices the ability to separate ID✗ |ID✓ well for higher values of $S_1$. We also note that $C=S_1S_2$ is not robust to different ID means of $S_2$. For example, arbitrarily adding a constant D to $S_2$ will completely change the behaviour of the combined score. On the other hand, SIRC is designed to be robust to this sort of variation between different $S_2$. Figure 4 also shows an alternative parameter setting for SIRC, where a is lower and b is higher. Here more of the behaviour of only using $S_1$ is preserved, but $S_2$ contributes less. It is also empirically observable that the assumption that $S_2$ (in this case $||\boldsymbol{z}||_1$) is not useful for distinguishing ID✓ from ID✗ holds, and in practice this can be verified on ID validation data when selecting $S_2$.

We also note that although we have chosen specific $S_1,S_2$ in this work, SIRC can be applied to any S that satisfy the above assumptions. As such it has the potential to improve beyond the results we present, given better individual S.

5 Experimental Results

We present experiments across a range of CNN architectures and ImageNet-scale OOD datasets. Extended results can be found in the supplemental material.

Data, Models and Training. For our ID dataset we use ImageNet-200 [27], which contains a subset of 200 ImageNet-1k [43] classes. It has separate training, validation and test sets. We use a variety of OOD datasets for our evaluation that display a wide range of semantics and difficulty in being identified. Near-ImageNet-200 (Near-IN-200) [27] is constructed from remaining ImageNet-1k classes semantically similar to ImageNet-200, so it is especially challenging to detect. Caltech-45 [27] is a subset of the Caltech-256 [12] dataset with non-overlapping classes to ImageNet-200. Openimage-O [48] is a subset of the Open Images V3 [29] dataset selected to be OOD with respect to ImageNet-1k. iNaturalist [22] and Textures [48] are the same for their respective datasets [47, 2]. Colorectal [25] is a collection of histological images of human colorectal cancer, whilst Colonoscopy is a dataset of frames taken from colonoscopic video of gastrointestinal lesions [36]. Noise is a dataset of square images where the resolution, contrast and pixel values are randomly generated (for details see the supplemental material). Finally, ImageNet-O [18] is a dataset OOD to ImageNet-1k that is adversarially constructed using a trained ResNet. Note that we exclude a number of OOD datasets from [27] and [22] as a result of discovering ID examples.

We train ResNet-50 [13], DenseNet-121 [20] and MobileNetV2 [44] using hyperparameters based around standard ImageNet settings^{Footnote 4}. Full training details can be found in the supplemental material. For each architecture we train 5 models independently using random seeds $\{1,\dots ,5\}$ and report the mean result over the runs. The supplemental material contains results on single pre-trained ImageNet-1k models, BiT ResNetV2-101 [28] and PyTorch DenseNet-121.

Detection Methods for SCOD. We consider four variations of SIRC using the components {MSP,$\mathcal H$}$\times ${$||\boldsymbol{z}||_1,$Residual}, as well as the components individually. We additionally evaluate various existing post-hoc methods: MSP [16], Energy [33], ViM [48] and Gradnorm [21]. For SIRC and ViM we use the full ID train set to determine parameters. Results for additional approaches, as well as further details pertaining to the methods, can be found in the supplemental material.

5.1 Evaluation Metrics

For evaluating different scoring functions S for the SCOD problem setting we consider a number of metrics. Arrows($\uparrow \downarrow $) indicate whether higher/lower is better. (For illustrations and additional metrics see the supplemental material.)

Area Under the Risk-Recall curve (AURR)$\downarrow $. We consider how empirical risk (Eq. 7) varies with recall of ID✓, and aggregate performance over different t by calculating the area under the curve. As recall is only measured over ID✓, the base accuracy of f is not properly taken into account. Thus, this metric is only suitable for comparing different g with f fixed. To give an illustrative example, a f, g pair where the classifier f is only able to produce a single correct prediction will have perfect AURR as long as S assigns that correct prediction the highest confidence (lowest uncertainty) score. Note that results for the AURC metric [27, 10] can be found in the supplemental material, although we omit them from the main paper as they are not notably different to AURR.
Risk@Recall=0.95 (Risk@95)$\downarrow $. Since a rejection threshold t must be selected at deployment, we also consider a particular setting of t such that 95% of ID✓ is recalled. In practice, the corresponding value of t could be found on a labelled ID validation set before deployment, without the use of any OOD data. It is worth noting that differences tend to be greater for this metric between different S as it operates around the tail of the positive class.
Area Under the ROC Curve (AUROC)$\uparrow $. Since we are interested in rejecting both ID✗ and OOD, we can consider ID✓ as the positive class, and ID✗, OOD as separate negative classes. Then we can evaluate the AUROC of OOD|ID✓ and ID✗ |ID✓ independently. The AUROC for a specific value of $\alpha $ would then be a weighted average of the two different AUROCs. This is not a direct measure of risk, but does measure the separation between different empirical distributions. Note that due to similar reasons to AURR this method is only valid for fixed f.
False Positive Rate@Recall=0.95 (FPR@95)$\downarrow $. FPR@0.95 is similar to AUROC, but is taken at a specific t. It measures the proportion of the negative class accepted when the recall of the positive class (or true positive rate) is 0.95.

5.2 Separation of ID✗ |ID✓ and OOD|ID✓ Independently

Table 1. %AUROC and %FPR@95 with ID✓ as the positive class, considering ID✗ and each OOD dataset separately. Full results are for ResNet-50 trained on ImageNet-200. We show abridged results for MobileNetV2 and DenseNet-121. Bold indicates best performance, underline 2nd or 3rd best and we show the mean over models from 5 independent training runs. Variants of SIRC are shown as tuples of their components ($S_1$,$S_2$). We also show error rate on ID data. SIRC is able to consistently match or improve over $S_1$ for OOD|ID✓, at a negligible cost to ID✗ |ID✓. Existing OOD detection methods are significantly worse for ID✗ |ID✓ and inconsistent at improving OOD|ID✓.

Full size table

Table 1 shows %AUROC and %FPR@0.95 with ID✓ as the positive class and ID✗, OOD independently as different negative classes (see Sect. 5.1). In general, we see that SIRC, compared to $S_1$, is able to improve OOD|ID✓ whilst incurring only a small ($<0.2$%AUROC) reduction in the ability to distinguish ID✗ |ID✓, across all 3 architectures. On the other hand, non-softmax methods designed for OOD detection show poor ability to identify ID✗, with performance ranging from $\sim 8$ worse %AUROC than MSP to $\sim 50\%$ AUROC (random guessing). Furthermore, they cannot consistently outperform the baseline when separating OOD|ID✓, in line with the discussion in Sect. 3.

SIRC is Robust to Weak $\boldsymbol{S_2}$. Although for the majority of OOD datasets SIRC is able to outperform $S_1$, this is not always the case. For these latter instances, we can see that $S_2$ individually is not useful, e.g. for ResNet-50 on Colonoscopy, Residual performs worse than random guessing. However, in cases like this the performance is still close to that of $S_1$. As $S_2$ will tend to be higher for these OOD datasets, the behaviour is like that for ID✗ |ID, with the decision boundaries close to vertical (see Fig. 4). As such SIRC is robust to $S_2$ performing poorly, but is able to improve on $S_1$ when $S_2$ is of use. In comparison, ViM, which linearly combines Energy and Residual, is much more sensitive to when the latter stumbles. On Colonoscopy ViM has $\sim 30$ worse %FPR@95 compared to Energy, whereas SIRC ($-\mathcal H$, Res.) loses $<1\%$ compared to $-\mathcal H$.

OOD Detection Methods Are Inconsistent over Different Data. The performance of existing methods for OOD detection relative to the MSP baseline is varies considerably from dataset to dataset. For example, even though ViM is able to perform very well on Textures, Noise and ImageNet-O (>50 better %FPR@95 on Noise), it does worse than the baseline on most other OOD datasets (>20 worse %FPR@95 for Near-ImageNet-200 and iNaturalist). This suggests that the inductive biases incorporated, and assumptions made, when designing existing OOD detection methods may prevent them from generalising across a wider variety of OOD data. In contrast, SIRC more consistently, albeit modestly, improves over the baseline, due to its aforementioned robustness.

5.3 Varying the Importance of OOD Data Through $\alpha $ and $\beta $

At deployment, there will be a specific ratio of ID:OOD data exposed to the model. Thus, it is of interest to investigate the risk over different values of $\alpha $ (Eq. 5). Similarly, an incorrect ID prediction may or may not be more costly than a prediction on OOD data so we investigate different values of $\beta $ (Eq. 6). Figure 5 shows how AURR and Risk@95 are affected as $\alpha $ and $\beta $ are varied independently (with the other fixed to 0.5). We use the full test set of ImageNet-200, and pool OOD datasets together and sample different quantities of data randomly in order to achieve different values of $\alpha $. We use 3 different groupings of OOD data: All, “Close” {Near-ImageNet-200, Caltech-45, Openimage-O, iNaturalist} and “Far” {Textures, Colonoscopy, Colorectal, Noise}. These groupings are based on relative qualitative semantic difference to the ID dataset (see supplemental material for example images from each dataset). Although the grouping is not formal, it serves to illustrate OOD data-dependent differences in SCOD performance.

Relative Performance of Methods Changes with $\boldsymbol{\alpha }$ and $\boldsymbol{\beta }$. At high $\alpha $ and $\beta $, where ID✗ dominates the risk, the MSP baseline performs best. However, as $\alpha $ and $\beta $ are decreased, and OOD data is introduced, we see that other methods improve relative to the baseline. There may be a crossover after which the ability to better distinguish OOD|ID✓ allows a method to surpass the baseline. Thus, which method to choose for deployment will depend on the practitioner’s setting of $\beta $ and (if they have any knowledge of it at all) of $\alpha $.

SIRC Most Consistently Improves over the Baseline. SIRC $(-\mathcal H, \text {Res.})$ is able to outperform the baseline most consistently over the different scenarios and settings of $\alpha , \beta $, only doing worse for ID✗ dominated cases ($\alpha , \beta $ close to 1). This is because SIRC has close to baseline ID✗ |ID✓ performance and is superior for OOD|ID✓. In comparison, ViM and Energy, which conflate ID✗ and ID✓, are often worse than the baseline for most (if not all) values of $\alpha , \beta $. Their behaviour on the different groupings of data illustrates how these methods may be biased towards different OOD datasets, as they significantly outperform the baseline at lower $\alpha $ for the “Far” grouping, but always do worse on “Close” OOD data.

5.4 Comparison Between SCOD and OOD Detection

Figure 6 shows the difference in %FPR@95 relative to the MSP baseline for different combinations of negative|positive data classes (ID✗ |ID✓, OOD|ID✓, OOD|ID), where OOD results are averaged over all datasets and training runs. In line with the discussion in Sect. 3, we observe that the non-softmax OOD detection methods are able to improve over the baseline for OOD|ID, but this comes mostly at the cost of inferior ID✗ |ID✓ rather than due to better OOD|ID✓, so they will do worse for SCOD. SIRC on the other hand is able to retain much more ID✗ |ID✓ performance whilst improving on OOD|ID✓, allowing it to have better OOD detection and SCOD performance compared to the baseline.

6 Related Work

There is extensive existing research into OOD detection, a survey of which can be found in [49]. To improve over the MSP baseline in [16], early post-hoc approaches, primarily experimenting on CIFAR-scale data, such as ODIN [32], Mahalanobis [31], Energy [33] explore how to extract non-softmax information from a trained network. More recent work has moved to larger-scale image datasets [22, 14]. Gradnorm [21], although motivated by the information in gradients, at its core combines information from the softmax and features together. Similarly, ViM [48] combines Energy with the class-agnostic Residual score. ReAct [45] aims to improve logit/softmax-based scores by clamping the magnitude of final layer features. There are also many training-based approaches. Outlier Exposure [17] explores training networks to be uncertain on “known” existing OOD data, whilst VOS [4] instead generates virtual outliers during training for this purpose. [19, 46] propose the network explicitly learn a scaling factor for the logits to improve softmax behaviour. There also exists a line of research that explores the use of generative models, $p(\boldsymbol{x};\boldsymbol{\theta })$, for OOD detection [1, 50, 42, 39], however, these approaches are completely separate from classification.

Selective classification, or misclassification detection, has also been investigated for deep learning scenarios. Initially examined in [8, 16], there are a number of approaches to the task that target the classifier f through novel training losses and/or architectural adjustments [37, 3, 9]. Post-hoc approaches are fewer. DOCTOR [11] provides theoretical justification for using the $l_2$-norm of the softmax output $||\boldsymbol{\pi }||_2$ as a confidence score for detecting misclassifications, however, we find its behaviour similar to MSP and $\mathcal H$ (see supplemental material).

There also exist general approaches for uncertainty estimation that are then evaluated using the above tasks, e.g. Bayesian Neural Networks [23], MC-Dropout [7], Deep Ensembles [30], Dirichlet Networks [34, 35] and DDU [38].

The two works closest to ours are [24] and [27]. [24] investigates selective classification under covariate shift for the natural language processing task of question and answering. In the case of covariate shift, valid predictions can still be produced on the shifted data, which by our definition is not possible for OOD data (see Sect. 2). Thus the problem setting here is different to our work. We remark that it would be of interest to extend this work to investigate selective classification with covariate shift for tasks in computer vision. [27] introduces the idea that ID✗ and OOD data should be rejected together and investigates the performance of a range of existing approaches. They examine both training and post-hoc methods (comparing different f and g) on SCOD (which they term unknown detection), as well as misclassification detection and OOD detection. They do not provide a novel approach targeting SCOD, and consider a single setting of ($\alpha , \beta $), where the $\alpha $ is not specified and $\beta = 0.5$.

7 Concluding Remarks

In this work, we consider the performance of existing methods for OOD detection on selective classification with out-of-distribution data (SCOD). We show how their improved OOD detection vs the MSP baseline often comes at the cost of inferior SCOD performance. Furthermore, we find their performance is inconsistent over different OOD datasets. In order to improve SCOD performance over the baseline, we develop SIRC. Our approach aims to retain information, which is useful for detecting misclassifications, from a softmax-based confidence score, whilst incorporating additional information useful for identifying OOD samples. Experiments show that SIRC consistently matches or improves over the baseline approach for a wide range of datasets, CNN architectures and problem scenarios.

Notes

1.
$\boldsymbol{z}^{P^\bot }$ is the component of the feature vector that lies outside of a principle subspace calculated using ID data. For more details see Wang et al. [48]’s paper.
2.
This holds for our chosen $S_1$ of $\pi _\text {max}$ and $-\mathcal H$.
3.
To avoid overflow this is implemented using the logaddexp function in PyTorch [40].
4.
https://github.com/pytorch/examples/blob/main/imagenet/main.py.

References

Caterini, A.L., Loaiza-Ganem, G.: Entropic issues in likelihood-based ood detection. ArXiv abs/2109.10794 (2021)
Google Scholar
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Google Scholar
Corbière, C., THOME, N., Bar-Hen, A., Cord, M., Pérez, P.: Addressing failure prediction by learning model confidence. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 2902–2913. Curran Associates, Inc. (2019). http://papers.nips.cc/paper/8556-addressing-failure-prediction-by-learning-model-confidence.pdf
Du, X., Wang, Z., Cai, M., Li, Y.: Vos: Learning what you don’t know by virtual outlier synthesis. ArXiv abs/2202.01197 (2022)
Google Scholar
El-Yaniv, R., Wiener, Y.: On the foundations of noise-free selective classification. J. Mach. Learn. Res. 11, 1605–1641 (2010)
MathSciNet MATH Google Scholar
Fort, S., Ren, J., Lakshminarayanan, B.: Exploring the limits of out-of-distribution detection. In: NeurIPS (2021)
Google Scholar
Gal, Y., Ghahramani, Z.: Dropout as a bayesian approximation: representing model uncertainty in deep learning. In: Balcan, M.F., Weinberger, K.Q. eds.) Proceedings of the 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, 20–22 June 2016, pp. 1050–1059. PMLR, New York. https://proceedings.mlr.press/v48/gal16.html
Geifman, Y., El-Yaniv, R.: Selective classification for deep neural networks. In: NIPS (2017)
Google Scholar
Geifman, Y., El-Yaniv, R.: Selectivenet: a deep neural network with an integrated reject option. In: International Conference on Machine Learning, pp. 2151–2159. PMLR (2019)
Google Scholar
Geifman, Y., Uziel, G., El-Yaniv, R.: Bias-reduced uncertainty estimation for deep neural classifiers. In: ICLR (2019)
Google Scholar
Granese, F., Romanelli, M., Gorla, D., Palamidessi, C., Piantanida, P.: Doctor: a simple method for detecting misclassification errors. In: NeurIPS (2021)
Google Scholar
Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset (2007)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Hendrycks, D., Basart, S., Mazeika, M., Mostajabi, M., Steinhardt, J., Song, D.X.: Scaling out-of-distribution detection for real-world settings. arXiv: Computer Vision and Pattern Recognition (2020)
Google Scholar
Hendrycks, D., Dietterich, T.G.: Benchmarking neural network robustness to common corruptions and perturbations. ArXiv abs/1903.12261 (2019)
Google Scholar
Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. ArXiv abs/1610.02136 (2017)
Google Scholar
Hendrycks, D., Mazeika, M., Dietterich, T.G.: Deep anomaly detection with outlier exposure. ArXiv abs/1812.04606 (2019)
Google Scholar
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.X.: Natural adversarial examples. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15257–15266 (2021)
Google Scholar
Hsu, Y.C., Shen, Y., Jin, H., Kira, Z.: Generalized odin: detecting out-of-distribution image without learning from out-of-distribution data. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10948–10957 (2020)
Google Scholar
Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017)
Google Scholar
Huang, R., Geng, A., Li, Y.: On the importance of gradients for detecting distributional shifts in the wild. In: NeurIPS (2021)
Google Scholar
Huang, R., Li, Y.: Mos: towards scaling out-of-distribution detection for large semantic space. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8706–8715 (2021)
Google Scholar
Jospin, L.V., Laga, H., Boussaid, F., Buntine, W., Bennamoun, M.: Hands-on bayesian neural networks-a tutorial for deep learning users. IEEE Comput. Intell. Mag. 17(2), 29–48 (2022)
Article Google Scholar
Kamath, A., Jia, R., Liang, P.: Selective question answering under domain shift. In: ACL (2020)
Google Scholar
Kather, J.N., et al.: Multi-class texture analysis in colorectal cancer histology. Scientific Reports 6 (2016)
Google Scholar
Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 5580–5590. Curran Associates Inc., Red Hook (2017)
Google Scholar
Kim, J., Koo, J., Hwang, S.: A unified benchmark for the unknown detection capability of deep neural networks. ArXiv abs/2112.00337 (2021)
Google Scholar
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., Houlsby, N.: Big transfer (bit): General visual representation learning. In: ECCV (2020)
Google Scholar
Krasin, I., et al.: Openimages: a public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages (2017)
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: NIPS (2017)
Google Scholar
Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In: NeurIPS (2018)
Google Scholar
Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv: Learning (2018)
Google Scholar
Liu, W., Wang, X., Owens, J.D., Li, Y.: Energy-based out-of-distribution detection. ArXiv abs/2010.03759 (2020)
Google Scholar
Malinin, A., Gales, M.J.F.: Predictive uncertainty estimation via prior networks. In: NeurIPS (2018)
Google Scholar
Malinin, A., Mlodozeniec, B., Gales, M.J.F.: Ensemble distribution distillation. ArXiv abs/1905.00076 (2020)
Google Scholar
Mesejo, P., Pizarro, D., Abergel, A., Rouquette, O.Y., Béorchia, S., Poincloux, L., Bartoli, A.: Computer-aided classification of gastrointestinal lesions in regular colonoscopy. IEEE Trans. Med. Imaging 35(9), 2051–2063 (2016)
Google Scholar
Moon, J., Kim, J., Shin, Y., Hwang, S.: Confidence-aware learning for deep neural networks. In: ICML (2020)
Google Scholar
Mukhoti, J., Kirsch, A., van Amersfoort, J.R., Torr, P.H.S., Gal, Y.: Deterministic neural networks with appropriate inductive biases capture epistemic and aleatoric uncertainty. ArXiv abs/2102.11582 (2021)
Google Scholar
Nalisnick, E.T., Matsukawa, A., Teh, Y.W., Görür, D., Lakshminarayanan, B.: Do deep generative models know what they don’t know? ArXiv abs/1810.09136 (2019)
Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019)
Google Scholar
Pearce, T., Brintrup, A., Zhu, J.: Understanding softmax confidence and uncertainty. ArXiv abs/2106.04972 (2021)
Google Scholar
Ren, J., et al.: Likelihood ratios for out-of-distribution detection. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/1e79596878b2320cac26dd792a6c51c9-Paper.pdf
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Article MathSciNet Google Scholar
Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Google Scholar
Sun, Y., Guo, C., Li, Y.: React: out-of-distribution detection with rectified activations. In: NeurIPS (2021)
Google Scholar
Techapanurak, E., Suganuma, M., Okatani, T.: Hyperparameter-free out-of-distribution detection using cosine similarity. In: Proceedings of the Asian Conference on Computer Vision (ACCV), November 2020
Google Scholar
Van Horn, G., et al.: The inaturalist species classification and detection dataset (2017). https://arxiv.org/abs/1707.06642
Wang, H., Li, Z., Feng, L., Zhang, W.: Vim: Out-of-distribution with virtual-logit matching. ArXiv abs/2203.10807 (2022)
Google Scholar
Yang, J., Zhou, K., Li, Y., Liu, Z.: Generalized out-of-distribution detection: a survey. ArXiv abs/2110.11334 (2021)
Google Scholar
Zhang, M., Zhang, A., McDonagh, S.G.: On the out-of-distribution generalization of probabilistic image modelling. In: NeurIPS (2021)
Google Scholar

Download references

Acknowledgements

GX’s PhD is funded jointly by Arm and the EPSRC.

Author information

Authors and Affiliations

Imperial College London, London, UK
Guoxuan Xia & Christos-Savvas Bouganis

Authors

Guoxuan Xia
View author publications
You can also search for this author in PubMed Google Scholar
Christos-Savvas Bouganis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoxuan Xia .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5207 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xia, G., Bouganis, CS. (2023). Augmenting Softmax Information for Selective Classification with Out-of-Distribution Data. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13846. Springer, Cham. https://doi.org/10.1007/978-3-031-26351-4_40

Download citation

DOI: https://doi.org/10.1007/978-3-031-26351-4_40
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26350-7
Online ISBN: 978-3-031-26351-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Augmenting Softmax Information for Selective Classification with Out-of-Distribution Data