1 Introduction

Out-of-distribution (OOD) detection [49], i.e. identifying data samples that do not belong to the training distribution, is a task that is receiving an increasing amount of attention in the domain of deep learning [32, 33, 4, 16, 6, 6, 19, 46, 45, 48, 22, 31, 41, 49, 50, 39]. The task is often motivated by safety-critical applications, such as healthcare and autonomous driving, where there may be a large cost associated with sending a prediction on OOD data downstream.

However, in spite of a plethora of existing research, there is generally a lack of focus with regards to the specific motivation behind OOD detection in the literature, other than it is often done as part of the pipeline of another primary task, e.g. image classification. As such the task is evaluated in isolation and formulated as binary classification between in-distribution (ID) and OOD data. In this work we consider the question why exactly do we want to do OOD detection during deployment? We focus on the problem setting where the primary objective is classification, and we are motivated to detect and then reject OOD data, as predictions on those samples will incur a cost. That is to say the task is selective classification [5, 8] where OOD data has polluted the input samples. Kim et al. [27] term this problem setting unknown detection. However, we prefer to use Selective Classification in the presence of Out-of-Distribution data (SCOD) as we would like to emphasise the downstream classifier as the objective, and will refer to the task as such in the remainder of the paper.

The key difference between this problem setting and OOD detection is that both OOD data and incorrect predictions on ID data will incur a cost [27]. It does not matter if we reject an ID sample if it would be incorrectly classified anyway. As such we can view the task as separating correctly predicted ID samples (ID✓) from misclassified ID samples (ID✗) and OOD samples. This reveals a potential blind spot in designing approaches solely for OOD detection, as the cost of ID misclassifications is ignored. The key contributions of this work are:

  1. 1.

    Building on initial results from [27] that show poor SCOD performance for existing methods designed for OOD detection, we show novel insight into the behaviour of different post-hoc (after-training) detection methods for the task of SCOD. Improved OOD detection often comes directly at the expense of SCOD performance. Moreover, the relative SCOD performance of different methods varies with the proportion of OOD data found in the test distribution, the relative cost of accepting ID✗ vs OOD, as well as the distribution from which the OOD data samples are drawn.

  2. 2.

    We propose a novel method, targeting SCOD, Softmax Information Retaining Combination (SIRC), that aims to improve the OOD|ID✓ separation of softmax-based methods, whilst retaining their ability to identify ID✗. It consistently outperforms or matches the baseline maximum softmax probability (MSP) approach over a wide variety of OOD datasets and convolutional neural network (CNN) architectures, unlike existing OOD detection methods.

2 Preliminaries

Neural Network Classifier. For a K-class classification problem we learn the parameters \(\boldsymbol{\theta }\) of a discriminative model \(P(y|\boldsymbol{x};\boldsymbol{\theta })\) over labels \(y \in \mathcal Y = \{\omega _k\}_{k=1}^K\) given inputs \(\boldsymbol{x} \in \mathcal X = \mathbb R^D\), using finite training dataset \(\mathcal D_\text {tr} = \{y^{(n)},\boldsymbol{x}^{(n)}\}_{n=1}^{N}\) sampled independently from true joint data distribution \(p_\text {tr}(y,\boldsymbol{x})\). This is done in order to make predictions \(\hat{y}\) given new inputs \(\boldsymbol{x}^* \sim p_\text {tr}(\boldsymbol{x})\) with unknown labels,

$$\begin{aligned} \hat{y} = f(\boldsymbol{x}^*) = \mathop {\mathrm {arg\,max}}\limits _\omega P(\omega |\boldsymbol{x}^*;\boldsymbol{\theta })~, \end{aligned}$$
(1)

where f refers to the classifier function. In our case, the parameters \(\boldsymbol{\theta }\) belong to a deep neural network with categorical softmax output \(\boldsymbol{\pi }\in [0,1]^K\),

$$\begin{aligned} P(\omega _i|\boldsymbol{x};\boldsymbol{\theta }) = \pi _i(\boldsymbol{x};\boldsymbol{\theta }) = \exp v_i(\boldsymbol{x})/\sum _{k=1}^K \exp v_k(\boldsymbol{x})~, \end{aligned}$$
(2)

where the logits \(\boldsymbol{v} = \boldsymbol{W} \boldsymbol{z} + \boldsymbol{b} \quad (\in \mathbb R^K)\) are the output of the final fully-connected layer with weights \(\boldsymbol{W} \in \mathbb R^{K\times L}\), bias \(\boldsymbol{b} \in \mathbb R^K\), and final hidden layer features \(\boldsymbol{z} \in \mathbb R^L\) as inputs. Typically \(\boldsymbol{\theta }\) are learnt by minimising the cross entropy loss, such that the model approximates the true conditional distribution \(P_\text {tr}(y|\boldsymbol{x})\),

$$\begin{aligned} \mathcal L_\text {CE}(\boldsymbol{\theta })&= -\frac{1}{N}\sum _{n=1}^{N}\sum _{k=1}^K \delta (y^{(n)}, \omega _k)\log P(\omega _k|\boldsymbol{x}^{(n)};\boldsymbol{\theta }) \\&\approx -\mathbb E_{p_\text {tr}(\boldsymbol{x})}\left[ \sum _{k=1}^K P_\text {tr}(\omega _k|\boldsymbol{x})\log P(\omega _k|\boldsymbol{x};\boldsymbol{\theta })\right] = \mathbb {E}_{p_\text {tr}}\left[ KL\left[ P_\text {tr}||P_{\boldsymbol{\theta }}\right] \right] + A~, \nonumber \end{aligned}$$
(3)

where \(\delta (\cdot ,\cdot )\) is the Kronecker delta, A is a constant with respect to \(\boldsymbol{\theta }\) and KL\([\cdot ,\cdot ]\) is the Kullback-Leibler divergence.

Selective Classification. A selective classifier [5] can be formulated as a pair of functions, the aforementioned classifier \(f(\boldsymbol{x})\) (in our case given by Eq. 1) that produces a prediction \(\hat{y}\), and a binary rejection function

$$\begin{aligned} g(\boldsymbol{x};t) = {\left\{ \begin{array}{ll} 0\text { (reject prediction)}, &{}\text {if }S(\boldsymbol{x}) < t\\ 1\text { (accept prediction)}, &{}\text {if }S(\boldsymbol{x}) \ge t~, \end{array}\right. } \end{aligned}$$
(4)

where t is an operating threshold and S is a scoring function which is typically a measure of predictive confidence (or \(-S\) measures uncertainty). Intuitively, a selective classifier chooses to reject if it is uncertain about a prediction.

Problem Setting. We consider a scenario where, during deployment, classifier inputs \(\boldsymbol{x}^*\) may be drawn from either the training distribution \(p_\text {tr}(\boldsymbol{x})\) (ID) or another distribution \(p_\text {OOD}(\boldsymbol{x})\) (OOD). That is to say,

$$\begin{aligned} \boldsymbol{x}^* \sim p_\text {mix}(\boldsymbol{x}), \quad p_\text {mix}(\boldsymbol{x}) = \alpha p_\text {tr}(\boldsymbol{x}) + (1-\alpha )p_\text {OOD}(\boldsymbol{x})~, \end{aligned}$$
(5)

where \(\alpha \in [0,1]\) reflects the proportion of ID to OOD data found in the wild. Here “Out-of-Distribution” inputs are defined as those drawn from a distribution with label space that does not intersect with the training label space \(\mathcal Y\) [49]. For example, an image of a car is considered OOD for a CNN classifier trained to discriminate between different types of pets.

We now define the predictive loss on an accepted sample as

(6)

where \(\beta \in [0,1]\), and define the selective risk as in [8],

$$\begin{aligned} R(f,g;t) = \frac{\mathbb E_{p_\text {mix}(\boldsymbol{x})}[g(\boldsymbol{x};t)\mathcal L_\text {pred}(f(\boldsymbol{x}))]}{\mathbb E_{p_\text {mix}(\boldsymbol{x})}[g(\boldsymbol{x};t)]}~, \end{aligned}$$
(7)

which is the average loss of the accepted samples. We are only concerned with the relative cost of ID✗ and OOD samples, so we use a single parameter \(\beta \).

The objective is to find a classifier and rejection function (fg) that minimise R(fgt) for some given setting of t. We focus on comparing post-hoc (after-training) methods in this work, where g or equivalently S is varied with f fixed. This removes confounding factors that may arise from the interactions of different training-based and post-hoc methods, as they can often be freely combined.

In practice, both \(\alpha \) and \(\beta \) will depend on the deployment scenario. However, whilst \(\beta \) can be set freely by the practitioner, \(\alpha \) is outside of the practitioner’s control and their knowledge of it is likely to be very limited.

Fig. 1.
figure 1

Illustrative sketch showing how SCOD differs to OOD detection. Densities of samples, misclassifications () and correct predictions () are shown with respect to confidence score S. For OOD detection the aim is to separate |, whilst for SCOD the data is grouped as |.

It is worth contrasting the SCOD problem setting with OOD detection. SCOD aims to separate OOD, ID✗ |ID✓, whilst for OOD detection the data is grouped as OOD|ID✗, ID✓ (see Fig. 1). We note that previous work [26, 34, 35, 38, 41] refer to different types of predictive uncertainty, namely aleatoric and epistemic. The former arises from uncertainty inherent in the data (i.e. the true conditional distribution \(P_\text {tr}(y|\boldsymbol{x})\)) and as such is irreducible, whilst the latter can be reduced by having the model learn from additional data. Typically, it is argued that it is useful to distinguish these types of uncertainty at prediction time. For example, epistemic uncertainty should be an indicator of whether a test input \(\boldsymbol{x}^*\) is OOD, whilst aleatoric uncertainty should reflect the level of class ambiguity of an ID input. An interesting result within our problem setting is that the conflation of these different types of uncertainties may not be an issue, as there is no need to separate ID✗ from OOD, as both should be rejected.

3 OOD Detectors Applied to SCOD

As the explicit objective of OOD detection is different to SCOD, it is of interest to understand how existing detection methods behave for SCOD. Previous work [27] has empirically shown that some existing OOD detection approaches perform worse, and in this section we shed additional light as to why this is the case.

Fig. 2.
figure 2

Illustrations of how a detection method can improve over a baseline. : For OOD detection we can either have further away from or closer to . : For SCOD we want both and to be further away from . Thus, we can see how improving OOD detection may in fact be at odds with SCOD.

Improving Performance: OOD Detection vs SCOD. In order to build an intuition, we can consider, qualitatively, how detection methods can improve performance over a baseline, with respect to the distributions of OOD and ID✗ relative to ID✓. This is illustrated in Fig. 2. For OOD detection the objective is to better separate the distributions of ID and OOD data. Thus, we can either find a confidence score S that, compared to the baseline, has OOD distributed further away from ID✓, and/or has ID✗ distributed closer to ID✓. In comparison, for SCOD, we want both OOD and ID✗ to be distributed further away from ID✓ than the baseline. Thus there is a conflict between the two tasks as, for ID✗, the desired behaviour of confidence score S will be different.

Existing Approaches Sacrifice SCOD by Conflating ID✓ and ID✗. Considering post-hoc methods, the baseline confidence score S used is Maximum Softmax Probability (MSP) [16]. Improvements in OOD detection are often achieved by moving away from the softmax \(\boldsymbol{\pi }\) in order to better capture the differences between ID and OOD data. Energy [33] and Max Logit [14] consider the logits \(\boldsymbol{v}\) directly, whereas the Mahalanobis detector [31] and DDU [38] build generative models using Gaussians over the features \(\boldsymbol{z}\). ViM [48] and Gradnorm [21] incorporate class-agnostic, feature-based information into their scores.

Recall that typically a neural network classifier learns a model \(P(y|\boldsymbol{x};\boldsymbol{\theta })\) to approximate the true conditional distribution \(P_\text {tr}(y|\boldsymbol{x})\) of the training data (Eqs. 2, 3). As such, scores S extracted from the softmax outputs \(\boldsymbol{\pi }\) should best reflect how likely a prediction on ID data is going to be correct or not (and this is indeed the case in our experiments in Sect. 5). As the above (post-hoc) OOD detection approaches all involve moving away from the modelled \(P(y|\boldsymbol{x};\boldsymbol{\theta })\), we would expect worse separation between ID✗ and ID✓ even if overall OOD is better distinguished from ID. Figure 3 shows empirically how well different types of data are separated using MSP (\(\pi _\text {max}\)) and Energy (\(\log \sum _k\exp v_k\)), by plotting false positive rate (FPR) against true positive rate (TPR). Lower FPR indicates better separation of the negative class away from the positive class. Although Energy has better OOD detection performance compared to MSP, this is actually because the separation between ID✗ and ID✓ is much less for Energy, whilst the behaviour of OOD relative to ID✓ is not meaningfully different to the MSP baseline. Therefore, SCOD performance for Energy is worse in this case. Another way of looking at it would be that for OOD detection, MSP does worse as it conflates ID with OOD, however, this doesn’t harm SCOD performance as much, as those ID samples are mostly incorrect anyway. The ID dataset is ImageNet-200 [27], OOD dataset is iNaturalist [22] and the model is ResNet-50 [13].

Fig. 3.
figure 3

Left: False positive rate (FPR) of samples plotted against true positive rate (TPR) of ID samples. Energy performs better (lower) for OOD detection relative to the MSP baseline. Right: FPR of and samples against TPR of . Energy is worse than the baseline at separating | and no better for |, meaning it is worse for SCOD. Energy’s improved OOD detection performance arises from pushing closer to . The ID dataset is ImageNet-200, OOD dataset is iNaturalist and the model is ResNet-50. 

4 Targeting SCOD – Retaining Softmax Information

We would now like to develop an approach that is tailored to the task of SCOD. We have discussed how we expect softmax-based methods, such as MSP, to perform best for distinguishing ID✗ from ID✓, and how existing approaches for OOD detection improve over the baseline, in part, by sacrificing this. As such, to improve over the baseline for SCOD, we will aim to retain the ability to separate ID✗ from ID✓ whilst increasing the separation between OOD and ID✓.

Combining Confidence Scores. Inspired by Gradnorm [21] and ViM [48] we consider the combination of two different confidence scores \(S_1, S_2\). We shall consider \(S_1\) our primary score, which we wish to augment by incorporating \(S_2\). For \(S_1\) we investigate scores that are strong for selective classification on ID data, but are also capable of detecting OOD data – MSP and (the negative of) softmax entropy, \((-)\mathcal H[\boldsymbol{\pi }]\). For \(S_2\), the score should be useful in addition to \(S_1\) in determining whether data is OOD or not. We should consider scores that capture different information about OOD data to the post-softmax \(S_1\) if we want to improve OOD|ID✓. We choose to examine the \(l_1\)-norm of the feature vector \(||\boldsymbol{z}||_1\) from [21] and the negative of the ResidualFootnote 1 score \(-||\boldsymbol{z}^{P^\bot }||_2\) from [48] as these scores capture class-agnostic information at the feature level. Note that although \(||\boldsymbol{z}||_1\) and Residual have previously been shown to be useful for OOD detection in [21, 48], we do not expect them to be useful for identifying misclassifications. They are separate from the classification layer defined by \((\boldsymbol{W},\boldsymbol{b})\), so they are far removed from the categorical \(P(y|\boldsymbol{x};\boldsymbol{\theta })\) modelled by the softmax.

Softmax Information Retaining Combination (SIRC). We want to create a combined confidence score \(C(S_1,S_2)\) that retains \(S_1\)’s ability to distinguish ID✗ |ID✓ but is also able to incorporate \(S_2\) in order to augment OOD|ID✓. We develop our approach based on the following set of assumptions:

  • \(S_1\) will be higher for ID✓ and lower for ID✗ and OOD.

  • \(S_1\) is bounded by maximum value \(S_1^\text {max}\).Footnote 2

  • \(S_2\) is unable to distinguish ID✗ |ID✓, but is lower for OOD compared to ID.

  • \(S_2\) is useful in addition to \(S_1\) for separating OOD|ID.

We propose to combine \(S_1\) and \(S_2\) using

$$\begin{aligned} C(S_1,S_2) = -(S^{\max }_1-S_1)\left( 1+\exp (-b[S_2-a])\right) ~, \end{aligned}$$
(8)

Footnote 3where ab are parameters chosen by the practitioner. The idea is for the accept/reject decision boundary of C to be in the shape of a sigmoid on the \((S_1,S_2)\)-plane (See Fig. 4). As such the behaviour of only using the softmax-based \(S_1\) is recovered for ID✗ |ID✓ as \(S_2\) is increased, as the decision boundary tends to a vertical line. However, \(S_2\) is considered increasingly important as it is decreased, allowing for improved OOD|ID✓. We term this approach Softmax Information Retaining Combination (SIRC).

The parameters ab allow the method to be adjusted to different distributional properties of \(S_2\). Rearranging Eq. 8,

$$\begin{aligned} S_1 = S_1^\text {max} + C/[1+\exp (-b[S_2-a])]~, \end{aligned}$$
(9)

we see that a controls the vertical placement of the sigmoid, and b the sensitivity of the sigmoid to \(S_2\). We use the empirical mean and standard deviation of \(S_2\), \(\mu _{S_2}, \sigma _{S_2}\) on ID data (training or validation) to set the parameters. We choose \(a = \mu _{S_2}-3\sigma _{S_2}\) so the centre of the sigmoid is below the ID distribution of \(S_2\), and we set \(b=1/\sigma _{S_2}\), to match the ID variations of \(S_2\). Note that other parameter settings are possible, and practitioners are free to tune ab however they see fit (on ID data), but we find the above approach to be empirically effective.

Figure 4 compares different methods of combination by plotting ID✓, ID✗ and OOD data densities on the \((S_1,S_2)\)-plane. Other than SIRC we consider the combination methods used in ViM, \(C=S_1 + cS_2\), where c is a user set parameter, and in Gradnorm, \(C=S_1 S_2\). The overlayed contours of C represent decision boundaries for values of t. We see that the linear decision boundary of \(C=S_1 + cS_2\) must trade-off significant performance in ID✗ |ID✓ in order to gain OOD|ID✓ (through varying c), whilst \(C=S_1 S_2\) sacrifices the ability to separate ID✗ |ID✓ well for higher values of \(S_1\). We also note that \(C=S_1S_2\) is not robust to different ID means of \(S_2\). For example, arbitrarily adding a constant D to \(S_2\) will completely change the behaviour of the combined score. On the other hand, SIRC is designed to be robust to this sort of variation between different \(S_2\). Figure 4 also shows an alternative parameter setting for SIRC, where a is lower and b is higher. Here more of the behaviour of only using \(S_1\) is preserved, but \(S_2\) contributes less. It is also empirically observable that the assumption that \(S_2\) (in this case \(||\boldsymbol{z}||_1\)) is not useful for distinguishing ID✓ from ID✗ holds, and in practice this can be verified on ID validation data when selecting \(S_2\).

We also note that although we have chosen specific \(S_1,S_2\) in this work, SIRC can be applied to any S that satisfy the above assumptions. As such it has the potential to improve beyond the results we present, given better individual S.

Fig. 4.
figure 4

Comparison of different methods of combining confidence scores \(S_1,S_2\) for SCOD. , and distributions are displayed using kernel density estimate contours. Graded contours for the different combination methods are then overlayed (lighter means higher combined score). We see that our method, SIRC (centre right) is able to better retain | whilst improving |. An alternate parameter setting for SIRC, with a stricter adherence to \(S_1\), is also shown (far right). The ID dataset is ImageNet-200, the OOD dataset iNaturalist and the model ResNet-50. SIRC parameters are found using ID training data; the plotted distributions are test data.

5 Experimental Results

We present experiments across a range of CNN architectures and ImageNet-scale OOD datasets. Extended results can be found in the supplemental material.

Data, Models and Training. For our ID dataset we use ImageNet-200 [27], which contains a subset of 200 ImageNet-1k [43] classes. It has separate training, validation and test sets. We use a variety of OOD datasets for our evaluation that display a wide range of semantics and difficulty in being identified. Near-ImageNet-200 (Near-IN-200) [27] is constructed from remaining ImageNet-1k classes semantically similar to ImageNet-200, so it is especially challenging to detect. Caltech-45 [27] is a subset of the Caltech-256 [12] dataset with non-overlapping classes to ImageNet-200. Openimage-O [48] is a subset of the Open Images V3 [29] dataset selected to be OOD with respect to ImageNet-1k. iNaturalist [22] and Textures [48] are the same for their respective datasets [47, 2]. Colorectal [25] is a collection of histological images of human colorectal cancer, whilst Colonoscopy is a dataset of frames taken from colonoscopic video of gastrointestinal lesions [36]. Noise is a dataset of square images where the resolution, contrast and pixel values are randomly generated (for details see the supplemental material). Finally, ImageNet-O [18] is a dataset OOD to ImageNet-1k that is adversarially constructed using a trained ResNet. Note that we exclude a number of OOD datasets from [27] and [22] as a result of discovering ID examples.

We train ResNet-50 [13], DenseNet-121 [20] and MobileNetV2 [44] using hyperparameters based around standard ImageNet settingsFootnote 4. Full training details can be found in the supplemental material. For each architecture we train 5 models independently using random seeds \(\{1,\dots ,5\}\) and report the mean result over the runs. The supplemental material contains results on single pre-trained ImageNet-1k models, BiT ResNetV2-101 [28] and PyTorch DenseNet-121.

Detection Methods for SCOD. We consider four variations of SIRC using the components {MSP,\(\mathcal H\)}\(\times \){\(||\boldsymbol{z}||_1,\)Residual}, as well as the components individually. We additionally evaluate various existing post-hoc methods: MSP [16], Energy [33], ViM [48] and Gradnorm [21]. For SIRC and ViM we use the full ID train set to determine parameters. Results for additional approaches, as well as further details pertaining to the methods, can be found in the supplemental material.

5.1 Evaluation Metrics

For evaluating different scoring functions S for the SCOD problem setting we consider a number of metrics. Arrows(\(\uparrow \downarrow \)) indicate whether higher/lower is better. (For illustrations and additional metrics see the supplemental material.)

  • Area Under the Risk-Recall curve (AURR)\(\downarrow \). We consider how empirical risk (Eq. 7) varies with recall of ID✓, and aggregate performance over different t by calculating the area under the curve. As recall is only measured over ID✓, the base accuracy of f is not properly taken into account. Thus, this metric is only suitable for comparing different g with f fixed. To give an illustrative example, a fg pair where the classifier f is only able to produce a single correct prediction will have perfect AURR as long as S assigns that correct prediction the highest confidence (lowest uncertainty) score. Note that results for the AURC metric [27, 10] can be found in the supplemental material, although we omit them from the main paper as they are not notably different to AURR.

  • Risk@Recall=0.95 (Risk@95)\(\downarrow \). Since a rejection threshold t must be selected at deployment, we also consider a particular setting of t such that 95% of ID✓ is recalled. In practice, the corresponding value of t could be found on a labelled ID validation set before deployment, without the use of any OOD data. It is worth noting that differences tend to be greater for this metric between different S as it operates around the tail of the positive class.

  • Area Under the ROC Curve (AUROC)\(\uparrow \). Since we are interested in rejecting both ID✗ and OOD, we can consider ID✓ as the positive class, and ID✗, OOD as separate negative classes. Then we can evaluate the AUROC of OOD|ID✓ and ID✗ |ID✓ independently. The AUROC for a specific value of \(\alpha \) would then be a weighted average of the two different AUROCs. This is not a direct measure of risk, but does measure the separation between different empirical distributions. Note that due to similar reasons to AURR this method is only valid for fixed f.

  • False Positive Rate@Recall=0.95 (FPR@95)\(\downarrow \). FPR@0.95 is similar to AUROC, but is taken at a specific t. It measures the proportion of the negative class accepted when the recall of the positive class (or true positive rate) is 0.95.

5.2 Separation of ID✗ |ID✓ and OOD|ID✓ Independently

Table 1. %AUROC and %FPR@95 with ID✓ as the positive class, considering ID✗ and each OOD dataset separately. Full results are for ResNet-50 trained on ImageNet-200. We show abridged results for MobileNetV2 and DenseNet-121. Bold indicates best performance, underline 2nd or 3rd best and we show the mean over models from 5 independent training runs. Variants of SIRC are shown as tuples of their components (\(S_1\),\(S_2\)). We also show error rate on ID data. SIRC is able to consistently match or improve over \(S_1\) for OOD|ID✓, at a negligible cost to ID✗ |ID✓. Existing OOD detection methods are significantly worse for ID✗ |ID✓ and inconsistent at improving OOD|ID✓.

Table 1 shows %AUROC and %FPR@0.95 with ID✓ as the positive class and ID✗, OOD independently as different negative classes (see Sect. 5.1). In general, we see that SIRC, compared to \(S_1\), is able to improve OOD|ID✓ whilst incurring only a small (\(<0.2\)%AUROC) reduction in the ability to distinguish ID✗ |ID✓, across all 3 architectures. On the other hand, non-softmax methods designed for OOD detection show poor ability to identify ID✗, with performance ranging from \(\sim 8\) worse %AUROC than MSP to \(\sim 50\%\) AUROC (random guessing). Furthermore, they cannot consistently outperform the baseline when separating OOD|ID✓, in line with the discussion in Sect. 3.

SIRC is Robust to Weak \(\boldsymbol{S_2}\). Although for the majority of OOD datasets SIRC is able to outperform \(S_1\), this is not always the case. For these latter instances, we can see that \(S_2\) individually is not useful, e.g. for ResNet-50 on Colonoscopy, Residual performs worse than random guessing. However, in cases like this the performance is still close to that of \(S_1\). As \(S_2\) will tend to be higher for these OOD datasets, the behaviour is like that for ID✗ |ID, with the decision boundaries close to vertical (see Fig. 4). As such SIRC is robust to \(S_2\) performing poorly, but is able to improve on \(S_1\) when \(S_2\) is of use. In comparison, ViM, which linearly combines Energy and Residual, is much more sensitive to when the latter stumbles. On Colonoscopy ViM has \(\sim 30\) worse %FPR@95 compared to Energy, whereas SIRC (\(-\mathcal H\), Res.) loses \(<1\%\) compared to \(-\mathcal H\).

OOD Detection Methods Are Inconsistent over Different Data. The performance of existing methods for OOD detection relative to the MSP baseline is varies considerably from dataset to dataset. For example, even though ViM is able to perform very well on Textures, Noise and ImageNet-O (>50 better %FPR@95 on Noise), it does worse than the baseline on most other OOD datasets (>20 worse %FPR@95 for Near-ImageNet-200 and iNaturalist). This suggests that the inductive biases incorporated, and assumptions made, when designing existing OOD detection methods may prevent them from generalising across a wider variety of OOD data. In contrast, SIRC more consistently, albeit modestly, improves over the baseline, due to its aforementioned robustness.

Fig. 5.
figure 5

AURR\(\downarrow \) and Risk@95\(\downarrow \) (\(\times 10^2\)) for different methods as \(\alpha \) and \(\beta \) vary (Eqs. 5,6) on a mixture of all the OOD data. We also split the OOD data into qualitatively “Close” and “Far” subsets (Sect. 5.3). For high \(\alpha , \beta \), where ID✗ dominates in the risk, the MSP is the best. As \(\alpha , \beta \) decrease, increasing the effect of OOD data, other methods improve relative to the . is able to most consistently improve over the . OOD detection methods perform better on “Far” OOD. The ID dataset is ImageNet-200, the model ResNet-50. We show the mean over 5 independent training runs. We multiply all values by \(10^2\) for readability.

5.3 Varying the Importance of OOD Data Through \(\alpha \) and \(\beta \)

At deployment, there will be a specific ratio of ID:OOD data exposed to the model. Thus, it is of interest to investigate the risk over different values of \(\alpha \) (Eq. 5). Similarly, an incorrect ID prediction may or may not be more costly than a prediction on OOD data so we investigate different values of \(\beta \) (Eq. 6). Figure 5 shows how AURR and Risk@95 are affected as \(\alpha \) and \(\beta \) are varied independently (with the other fixed to 0.5). We use the full test set of ImageNet-200, and pool OOD datasets together and sample different quantities of data randomly in order to achieve different values of \(\alpha \). We use 3 different groupings of OOD data: All, “Close” {Near-ImageNet-200, Caltech-45, Openimage-O, iNaturalist} and “Far” {Textures, Colonoscopy, Colorectal, Noise}. These groupings are based on relative qualitative semantic difference to the ID dataset (see supplemental material for example images from each dataset). Although the grouping is not formal, it serves to illustrate OOD data-dependent differences in SCOD performance.

Relative Performance of Methods Changes with \(\boldsymbol{\alpha }\) and \(\boldsymbol{\beta }\). At high \(\alpha \) and \(\beta \), where ID✗ dominates the risk, the MSP baseline performs best. However, as \(\alpha \) and \(\beta \) are decreased, and OOD data is introduced, we see that other methods improve relative to the baseline. There may be a crossover after which the ability to better distinguish OOD|ID✓ allows a method to surpass the baseline. Thus, which method to choose for deployment will depend on the practitioner’s setting of \(\beta \) and (if they have any knowledge of it at all) of \(\alpha \).

SIRC Most Consistently Improves over the Baseline. SIRC \((-\mathcal H, \text {Res.})\) is able to outperform the baseline most consistently over the different scenarios and settings of \(\alpha , \beta \), only doing worse for ID✗ dominated cases (\(\alpha , \beta \) close to 1). This is because SIRC has close to baseline ID✗ |ID✓ performance and is superior for OOD|ID✓. In comparison, ViM and Energy, which conflate ID✗ and ID✓, are often worse than the baseline for most (if not all) values of \(\alpha , \beta \). Their behaviour on the different groupings of data illustrates how these methods may be biased towards different OOD datasets, as they significantly outperform the baseline at lower \(\alpha \) for the “Far” grouping, but always do worse on “Close” OOD data.

Fig. 6.
figure 6

The change in %FPR@95\(\downarrow \) relative to the MSP baseline of different methods. Different data classes are shown negative|positive. Although OOD detection methods are able to improve , they do so mainly at the expense of rather than improving . SIRC is able to improve with minimal loss to , alongside modest improvements for . Results for OOD are averaged over all OOD datasets. The ID dataset is ImageNet-200 and the model ResNet-50.

5.4 Comparison Between SCOD and OOD Detection

Figure 6 shows the difference in %FPR@95 relative to the MSP baseline for different combinations of negative|positive data classes (ID✗ |ID✓, OOD|ID✓, OOD|ID), where OOD results are averaged over all datasets and training runs. In line with the discussion in Sect. 3, we observe that the non-softmax OOD detection methods are able to improve over the baseline for OOD|ID, but this comes mostly at the cost of inferior ID✗ |ID✓ rather than due to better OOD|ID✓, so they will do worse for SCOD. SIRC on the other hand is able to retain much more ID✗ |ID✓ performance whilst improving on OOD|ID✓, allowing it to have better OOD detection and SCOD performance compared to the baseline.

6 Related Work

There is extensive existing research into OOD detection, a survey of which can be found in [49]. To improve over the MSP baseline in [16], early post-hoc approaches, primarily experimenting on CIFAR-scale data, such as ODIN [32], Mahalanobis [31], Energy [33] explore how to extract non-softmax information from a trained network. More recent work has moved to larger-scale image datasets [22, 14]. Gradnorm [21], although motivated by the information in gradients, at its core combines information from the softmax and features together. Similarly, ViM [48] combines Energy with the class-agnostic Residual score. ReAct [45] aims to improve logit/softmax-based scores by clamping the magnitude of final layer features. There are also many training-based approaches. Outlier Exposure [17] explores training networks to be uncertain on “known” existing OOD data, whilst VOS [4] instead generates virtual outliers during training for this purpose. [19, 46] propose the network explicitly learn a scaling factor for the logits to improve softmax behaviour. There also exists a line of research that explores the use of generative models, \(p(\boldsymbol{x};\boldsymbol{\theta })\), for OOD detection [1, 50, 42, 39], however, these approaches are completely separate from classification.

Selective classification, or misclassification detection, has also been investigated for deep learning scenarios. Initially examined in [8, 16], there are a number of approaches to the task that target the classifier f through novel training losses and/or architectural adjustments [37, 3, 9]. Post-hoc approaches are fewer. DOCTOR [11] provides theoretical justification for using the \(l_2\)-norm of the softmax output \(||\boldsymbol{\pi }||_2\) as a confidence score for detecting misclassifications, however, we find its behaviour similar to MSP and \(\mathcal H\) (see supplemental material).

There also exist general approaches for uncertainty estimation that are then evaluated using the above tasks, e.g. Bayesian Neural Networks [23], MC-Dropout [7], Deep Ensembles [30], Dirichlet Networks [34, 35] and DDU [38].

The two works closest to ours are [24] and [27]. [24] investigates selective classification under covariate shift for the natural language processing task of question and answering. In the case of covariate shift, valid predictions can still be produced on the shifted data, which by our definition is not possible for OOD data (see Sect. 2). Thus the problem setting here is different to our work. We remark that it would be of interest to extend this work to investigate selective classification with covariate shift for tasks in computer vision. [27] introduces the idea that ID✗ and OOD data should be rejected together and investigates the performance of a range of existing approaches. They examine both training and post-hoc methods (comparing different f and g) on SCOD (which they term unknown detection), as well as misclassification detection and OOD detection. They do not provide a novel approach targeting SCOD, and consider a single setting of (\(\alpha , \beta \)), where the \(\alpha \) is not specified and \(\beta = 0.5\).

7 Concluding Remarks

In this work, we consider the performance of existing methods for OOD detection on selective classification with out-of-distribution data (SCOD). We show how their improved OOD detection vs the MSP baseline often comes at the cost of inferior SCOD performance. Furthermore, we find their performance is inconsistent over different OOD datasets. In order to improve SCOD performance over the baseline, we develop SIRC. Our approach aims to retain information, which is useful for detecting misclassifications, from a softmax-based confidence score, whilst incorporating additional information useful for identifying OOD samples. Experiments show that SIRC consistently matches or improves over the baseline approach for a wide range of datasets, CNN architectures and problem scenarios.