Keywords

1 Introduction

Uncertainty estimation and robustness are essential for deploying Deep Neural Networks (DNN) in real-world systems with different levels of autonomy, ranging from simple driving assistance functions to fully autonomous vehicles. In addition to excellent predictive performance, DNNs are also expected to address different types of uncertainty (noisy, ambiguous or out-of-distribution samples, distribution shift, etc.), while ensuring real-time computational performance. These key and challenging requirements have stimulated numerous solutions and research directions leading to significant progress in this area [10, 27, 43, 45, 57, 77]. Yet, the best performing approaches are computationally expensive [45], while faster variants struggle to disentangle different types of uncertainty [23, 52, 56].

We study a promising new line of methods, termed deterministic uncertainty methods (DUMs) [65], that has recently emerged for estimating uncertainty in a computational efficient manner from a single forward pass [3, 48, 56, 64, 74]. In order to quantify uncertainty, these methods rely on some statistical or geometrical properties of the hidden features of the DNNs. While appealing for their good Out-of-Distribution (OOD) uncertainty estimations at low computational cost, they have been used mainly for classification tasks and their specific regularization is often unstable when training deeper DNNs [62]. We then propose a new DUM technique, based on a discriminative latent space that improves both scalability and flexibility. We achieve this by still following the principles of DUMs of learning a sensitive and smooth representation that mirrors well the input distribution, although not by enforcing directly the Lipschitz constraint.

Our DUM, dubbed Latent Discriminant deterministic Uncertainty (LDU), is based on a DNN imbued with a set of prototypes over its latent representations. These prototypes act like a memory that allows to better analyze features from new images in light of the “knowledge” acquired by the DNN from the training data. Various forms of prototypes have been studied for anomaly detection in the past [29] and they often take the shape of a dictionary of representative features. Instead, LDU is trained to learn the optimal prototypes, such that this distance improves the accuracy and the uncertainty prediction. Indeed to train LDU, we introduce a confidence-based loss that learns to predict the error of the DNN given the data. ConfidNet [15] and SLURP [81] have shown that we can train an auxiliary network to predict the uncertainty, at the cost of a more complex training pipeline and more inference steps. Here LDU is lighter, faster and needs only a single forward pass. LDU can be used as a pluggable learning layer on top of DNNs. We demonstrate that LDU avoids feature collapse and can be applied to multiple computer vision tasks. In addition, LDU improves the prediction accuracy of the baseline DNN without LDU.

Contributions. To summarize, our contributions are as follows: (1) LDU (Latent Discriminant deterministic Uncertainty): an efficient and scalable DUM approach for uncertainty quantification. (2) A study of LDU’s properties against feature collapse. (3) Evaluations of LDU on a range of computer vision tasks and settings (image classification, semantic segmentation, depth estimation) and the implementation of a set of powerful baselines to further encourage research in this area.

2 Related Work

In this section, we focus on the related works from two perspectives: uncertainty quantification algorithms applied to computer vision tasks and prototype learning on DNNs. In Table 1, we list various uncertainty quantification algorithms according to different computer vision tasks.

Table 1. Summary of the uncertainty estimation methods applied to the specific computer vision tasks.

2.1 Uncertainty Estimation for Computer Vision Tasks

Uncertainty for Image Classification and Semantic Segmentation. Quantifying uncertainty for classification and semantic segmentation can be done with Bayesian Neural Networks (BNNs) [10, 20, 37, 78], which estimate the posterior distribution of the DNN weights to marginalize the likelihood distribution at inference time. These approaches achieve good performances on image classification, but they do not scale well to semantic segmentation. Deep Ensembles [45] achieve state-of-the-art performance on various tasks. Yet, this approach is computationally costly in both training and inference. Some techniques learn a confidence score as uncertainty [15], but struggle without sufficient negative samples to learn from. MC-Dropout [27] is a generic and easy to deploy approach, however its uncertainty is not always reliable [60] while requiring multiple forward passes. Deterministic Uncertainty Methods (DUMs) [2, 3, 48, 56, 64, 74, 79] are new strategies that allow to quantify epistemic uncertainty in the DNNs with a single forward pass. Yet, except for MIR [64], to the best of our knowledge none of these techniques work on semantic segmentation.

Uncertainty for 1D/2D Regression. Regression in computer vision comprises monocular depth estimation [9, 46], optical flow estimation [71, 72], or pose estimation [12, 69]. One solution for quantifying the uncertainty consists in formalizing the output of a DNN as a parametric distribution and training the DNN to estimate its parameters [43, 59]. Multi-hypothesis DNNs [40] consider that the output is a Gaussian distribution and focus on optical flow. Some techniques estimate a confidence score for regression thanks to an auxiliary DNN [63, 81]. Deep Ensembles [45] for regression, consider that each DNN outputs the parameters of a Gaussian distribution, to form a mixture of Gaussian distributions. Sampling-based methods [27, 54] simply apply dropout or perturbations to some layers during test time to quantify the uncertainty. Yet, their computational cost remains important compared to a single forward pass in the network. Some DUMs [3, 64] also work on regression tasks. DUE [3] is applied in a 1D regression task and MIR [64] in monocular depth estimation.

2.2 Prototype Learning in DNNs

Prototype-based learning approaches have been introduced on traditional handcrafted features [47], and have been recently applied to DNNs as well, for more robust predictions [13, 29, 76, 80]. The center loss [76] can help DNNs to build more discriminative features by compacting intra-class features and dispersing the inter-class ones. Based on this principle, Convolutional Prototype Learning (CPL) [80] with prototype loss also improves the intra-class compactness of the latent features. Chen et al. [13] try to bound the unknown classes by learning reciprocal points for better open set recognition. Similar to [67, 75], MemAE [29] learns a memory slot of the prototypes to strengthen the reconstruction error of anomalies in the process of the reconstruction. These prototype-based methods are well suited for classification tasks but are rarely used in semantic segmentation and regression tasks.

3 Latent Discriminant Deterministic Uncertainty (LDU)

3.1 DUM Preliminaries

DUMs arise as a promising line of research for estimating epistemic uncertainty in conventional DNNs in a computationally efficient manner and from a single forward pass. DUM approaches generally focus on learning useful and informative hidden representations of a model [2, 3, 48, 56, 64, 79] by considering that the distribution of the hidden representation should be representative for the input distribution. Most of the conventional models suffer from the feature collapse problem [74] when OOD samples are mapped to similar feature representations as in-distribution ones, thus hindering OOD detection from these representations. DUMs address this issue through various regularization strategies for constraining the hidden representations to mimic distances from the input space. In practice this amounts to striking a balance between sensitivity (when the input changes, the feature representation should also change) and smoothness (a small change in the input cannot generate major shifts in the feature representation) of the model. To this end, most methods enforce constraints over the Lipschitz constant of the DNN [48, 53, 74].

Formally, we define \(f_{\omega }(\cdot )\) a DNN with trainable parameters \(\boldsymbol{\mathbf {\omega }}\), and an input sample \(\textbf{x}\) from a set of images \(\mathcal {X}\). Our DNN \(f_{\boldsymbol{\mathbf {\omega }}}\) is composed of two main blocks: a feature extractor \(h_{\boldsymbol{\mathbf {\omega }}}\) and a head \(g_{\boldsymbol{\mathbf {\omega }}}\), such that \(f_{\boldsymbol{\mathbf {\omega }}}(\textbf{x})=(g_{\boldsymbol{\mathbf {\omega }}}\circ h_{\boldsymbol{\mathbf {\omega }}}) (\textbf{x})\). \(h_{\boldsymbol{\mathbf {\omega }}}(\textbf{x})\) computes a latent representation from \(\textbf{x}\), while \(g_{\boldsymbol{\mathbf {\omega }}}\) is the final layer, that takes \(h_{\boldsymbol{\mathbf {\omega }}} (\textbf{x})\) as input, and outputs the logits of \(\textbf{x}\). The bi-Lipschitz condition implies that for any pair of inputs \(\textbf{x}_1\) and \(\textbf{x}_2\) from \(\mathcal {X}\):

$$\begin{aligned} L_1 \Vert \textbf{x}_1 -\textbf{x}_2\Vert \le \Vert h_{\boldsymbol{\mathbf {\omega }}}(\textbf{x}_1) - h_{\boldsymbol{\mathbf {\omega }}}(\textbf{x}_2)\Vert \le L_2 \Vert \textbf{x}_1 -\textbf{x}_2\Vert \end{aligned}$$
(1)

where \(L_1\) and \(L_2\) are positive and bounded Lipschitz constants \(0<L_1<1<L_2\). The upper Lipschitz bound enforces the smoothness and is an important condition for the robustness of a DNN by preventing over-sensitivity to perturbations in the input space of \(\textbf{x}\), i.e., the pixel space. The lower Lipschitz bound deals with the sensitivity and strives to preserve distances in the latent space as mappings of distances from the input space, i.e., preventing representations from being too smooth, thus avoiding feature collapse. Liu et al. [48] argue that for residual DNNs [33], we can ensure \(f_{\boldsymbol{\mathbf {\omega }}}\) to be bi-Lipschitz by forcing its residuals to be Lipschitz and choosing sub-unitary Lipschitz constants.

There are different approaches for imposing the bi-Lipschitz constraint over a DNN, out of which we describe the most commonly used ones in recent works [4, 7, 30, 55]. Wasserstein GAN [4] enforces the Lipschitz constraint by clipping the weights. However, this turns out to be prone to either vanishing or exploding gradients if the clipping threshold is not carefully tuned [30]. An alternative solution from GAN optimization is gradient penalty [30] which is practically an additional loss term that regularizes the \(L_2\) norm of the Jacobian of weight matrices of the DNN. However this can also lead to high instabilities [48, 56] and slower training [56]. Spectral Normalization [7, 55] brings better stability and training speed, however, on the downside, it supports only a fixed pre-defined size for the input, in the same manner as fully connected layers. For computer vision tasks, such as semantic segmentation which is typically performed on high resolution images, constraining the input size is a strong limitation. Moreover, Postels et al. [64] argue that in addition to the architectural constraints, these strategies for avoiding feature collapse risk overfitting epistemic uncertainty to the task of OOD detection. This motivates us to seek a new DUM strategy that does not need the network to comply with the Lipschitz constraint. The recent MIR approach [64] advances an alternative regularization strategy that adds a decoder branch to the network, thus forcing the intermediate activations to better cover and represent the input space. However in the case of high resolution images, reconstruction can be a challenging task and the networks can over-focus on potentially useless and uninformative details at the cost of loss of global information. We detail our strategy below.

3.2 Discriminant Latent Space

An informative latent representation should project similar data samples close and dissimilar ones far away. Yet, it has long been known that in high-dimensional spaces the Euclidean distance and other related p-norms are a very poor indicator of sample similarity as most samples are nearly equally far/close to each other [1, 6]. At the same time, the samples of interest are often not uniformly distributed, and may be projected by means of a learned transform on a lower-dimensional manifold, namely the latent representation space.

Instead of focusing on preserving the potentially uninformative distance in the input space, we can rather attempt to better deal with distances in the lower-dimensional latent space. To this end, we propose to use a distinction maximization (DM) layer [49] that has been recently considered as a replacement for the last layer to produce better uncertainty estimates, in particular for OOD detection [49, 61]. In a DM layer, the units of the classification layer are seen as representative class prototypes and the classification prediction is computed by analyzing the localization of the input sample w.r.t. all class prototypes as indicated by the negative Euclidean distance. Note that a similar idea has been considered in the few-shot learning literature, where DM layers are known as cosine classifiers [28, 66, 70]. In contrast to all these approaches that use DM as a last layer for classification predictions, we employ it as hidden layer over latent representations. More specifically, we insert DM in the pre-logit layer. We argue that this allows us to better guide learning and preserve the discriminative properties of the latent representations compared to placing DM as last layers where the weights are more specialized for classification decision than for feature representation. We can easily integrate this layer in the architecture without impacting the training pipeline.

Formally, we denote \(\textbf{z}\in \mathbb {R}^n\) the latent representation of dimension n of \(\textbf{x}\), i.e., \(\textbf{z}=h_{\boldsymbol{\mathbf {\omega }}}(\textbf{x})\), that is given as input to the DM layer. Given a set \(\textbf{p}_{\boldsymbol{\mathbf {\omega }}}=\{\textbf{p}_i\}_{i=1}^m\), of m vectors ( \(\textbf{p}_i \in \mathbb {R}^n\)) that are trainable, we define the DM layer as follows:

$$\begin{aligned} \text{ DM}_{p}(\textbf{z}) =\begin{bmatrix} -\Vert \textbf{z}-\textbf{p}_1\Vert , \ldots , -\Vert \textbf{z}-\textbf{p}_m\Vert \end{bmatrix}^\top \end{aligned}$$
(2)

The \(L_2\) distance considered in the DM layer is not bounded, thus when DM is used as intermediate layer, relying on the \(L_2\) distance could cause instability during training. In our proposed approach, we use instead the cosine similarity, \(S_c(\cdot , \cdot )\). Our DM layer reads now:

$$\begin{aligned} \text{ DM}_{p}(\textbf{z}) =\begin{bmatrix} S_c(\textbf{z},\textbf{p}_1), \ldots , S_c(\textbf{z},\textbf{p}_m) \end{bmatrix}^\top \end{aligned}$$
(3)

The vectors \(\textbf{p}_i\) can be seen as a set of prototypes in the latent space that can help in better placing an input sample in the learned representation space using these prototypes as references. This is in contrast to prior works with DM being considered as last layer, where the prototypes represent canonical representations for samples belonging to a class [49, 70]. Since hidden layers are used here, we can afford to consider an arbitrary number of prototypes that can define richer latent mapping through a finer coverage of the representation space. DM layers learn the set of weights \(\{\textbf{p}_i\}_{i=1}^m\) such that the cosine similarity (evaluated between \(\textbf{z}\) and the prototypes) is optimal for a given task.

We apply the distinction maximization on this hidden representation, and subsequently use the exponential function as activation function. We consider the exponential function as it can sharpen similarity values and thus facilitates the alignment of the data embedding to the corresponding prototypes in the latent space. Finally, we apply a last fully connected layer for classification on this embedding. Our DNN (see Fig. 1) can be written as:

$$\begin{aligned} f_{\boldsymbol{\mathbf {\omega }}}(\textbf{x})=\left[ g_{\boldsymbol{\mathbf {\omega }}}\circ (\text{ exp }(-\text{ DM}_{p}(h_{\boldsymbol{\mathbf {\omega }}}))) \right] (\textbf{x}) \end{aligned}$$
(4)

We can see from Eq. (4) that the vector weights \(\textbf{p}_i\) are optimized jointly with the other DNN parameters. We argue that \(\textbf{p}_i\) can work as indicators for analyzing and emphasizing patterns in the latent representation prior to making a classification prediction in the final layers.

Fig. 1.
figure 1

Overview of LDU: the DNN learns a discriminative latent space thanks to learnable prototypes \(\textbf{p}_{\boldsymbol{\mathbf {\omega }}}\). The DNN backbone computes a feature vector \(\textbf{z}\) for an input \(\textbf{x}\) and then the DM layer matches it with the prototypes. The computed similarities reflecting the position of \(\textbf{z}\) in the learned feature space, are subsequently processed by the classification layer and the uncertainty estimation layer. The dashed arrows point to the loss functions that need to be optimized for training LDU.

3.3 LDU Optimization

Given a DNN \(f_{\omega }\) we usually optimize its parameters to minimize a loss \(\mathcal {L}^{\text{ Task }}\). This can lead to prototypes specialized for solving that task that do not encapsulate uncertainty relevant properties. Hence we propose to enforce the prototypes to be linked to uncertainty first by avoiding the collapse of all prototypes to a single prototype. Second, we constrain the latent representation \(\text{ DM}_{p}(h_{\boldsymbol{\mathbf {\omega }}})\) of the DNN to not rely only on a single prototype. Finally, we optimize an MLP \(g_{\boldsymbol{\mathbf {\omega }}}^{\text{ unc }}\) on the top of the latent representation \(\text{ DM}_{p}(h_{\boldsymbol{\mathbf {\omega }}})\) such that the output of this MLP provides more meaningful information for uncertainty estimation.

First, we add a loss to force the prototypes to be dissimilar:

$$\mathcal {L}^{\text{ Dis }} = -\sum _{i<j} \Vert \textbf{p}_i - \textbf{p}_j\Vert .$$

Then, we also add one loss to constrain the latent representation to stay close to different prototypes. We achieve this with an entropy-like loss:

$$ \mathcal {L}^{\text{ Entrop }} = \sum _{i=1}^n \sigma (\text{ DM}_{p}(h_{\boldsymbol{\mathbf {\omega }}}))_i{\cdot } \log (\sigma (\text{ DM}_{p}(h_{\boldsymbol{\mathbf {\omega }}}))_i), $$

where \(\sigma \) is the softmax layer, and the subscript index i corresponds to the i-th coefficient of a tensor. Different from per-class prototypes [13, 76, 80], we obtain more discriminative features by increasing the distance between prototypes and enlarging the dispersion of features corresponding to different prototypes.

We propose to train \(g_{\boldsymbol{\mathbf {\omega }}}^{\text{ unc }}\) to predict the error of the DNN, which helps us relate the prototypes to the uncertainty. Formally, given an input data \(\textbf{x}\), its groundtruth y (y can be a scalar or a vector if we deal with regression) and, its loss \(\mathcal {L}^{\text{ Task }}(g_{\boldsymbol{\mathbf {\omega }}}(\textbf{x}),y)\), we train \(g_{\omega }^{\text{ unc }} \) by minimizing:

$$\mathcal {L}^{\text{ Unc }} =\text{ BCE }(\left[ g_{\boldsymbol{\mathbf {\omega }}}^{\text{ unc }} \circ (\text{ exp }(-\text{ DM}_{p}(h_{\boldsymbol{\mathbf {\omega }}}))) \right] (\textbf{x}), \mathcal {L}^{\text{ Task }}(g_{\boldsymbol{\mathbf {\omega }}}(\textbf{x}),y)),$$

after normalizing \(\mathcal {L}^{\text{ Task }}(g_{\boldsymbol{\mathbf {\omega }}}(\textbf{x}),y)\) over the mini-batch such that its maximum value is equal to one and its minimum is equal to zero. BCE stands for the binary cross entropy, which was empirically validated to perform better than common alternatives such as the mean square error and the absolute error.

All these losses combined allow us to have a DNN which can predict uncertainty, avoid feature collapse and have the potential to improve the accuracy of the prediction. To summarize, the following loss function \(\mathcal {L}^{\text{ total }}\) will be optimized to train a DNN containing a DM layer:

$$\begin{aligned} \mathcal {L}^{\text{ total }} = \mathcal {L}^{\text{ Task }} + \mathbf {\lambda }(\mathcal {L}^{\text{ Entrop }} + \mathcal {L}^{\text{ Dis }} + \mathcal {L}^{\text{ Unc }}) \end{aligned}$$
(5)

where \(\mathbf {\lambda }\) is a hyper-parameter for the auxiliary losses.

Fig. 2.
figure 2

PCA 2D projection on the left of a standard MLP and on the right of a DM-MLP trained on the two moons dataset. Blue and red points indicate the features of data points of the two classes respectively. As we can see, the representations on the MLP are overlapping between the two classes, leading to a network that will be prone to feature collapse, unlike the DM-MLP. (Color figure online)

3.4 Addressing Feature Collapse

In order to illustrate the feature collapse problem, we consider a toy example on the two moon dataset. We train on it two MLPs with two hidden layers, each containing 17 neurons. One of the MLP additionally integrates our proposed DM layer and is denoted as DM-MLP, while the standard architecture is called MLP. The two networks reach the same classification performance, about 99% of accuracy. We perform PCA on the pre-logit latent space of both networks after training and visualize PCA projections in Fig. 2. We can observe the feature collapse as the MLP assigns strongly correlated feature representation to both classes which can lead to unreliable uncertainty prediction. However, our DM layer allows a better disentangling of the latent space. Note that, as the networks have the same performance, it is impossible to detect the feature collapse based on the test accuracy alone.

We note that our LDU layer is a Lipschitz function, hence: \(\Vert \text{ exp }(-\text{ DM}_{p}(z_1)) -\text{ exp }(-\text{ DM}_{p}(z_2)) \Vert \le k \Vert z_1-z_2 \Vert \) with \(k \in \mathbb {R}^+\). However, \(h_{\boldsymbol{\mathbf {\omega }}}\) is not necessarily a Lipschitz function, and we cannot thus guarantee that its features do not entangle ID and OOD data. Yet, using a distance function in the DNN [50, 53] can allow it to learn to separate the two data distributions better as illustrated in Figure 2.

Fig. 3.
figure 3

Illustration of confidence score results on the two moons dataset after the first training (on original data) on the left and with second training (on synthesized outliers) on the right. Orange and blue data points are sampled from two classes in two moons, and the green points are OOD data points. Yellow area indicates high confidence, blue area indicates uncertainty. The left image shows that the uncertain area is between the two classes leading to a confidence score related to aleatoric uncertainty. In the right one, the uncertain area is around the dataset leading to a confidence score related to epistemic uncertainty.

Most DUM methods aim for bi-Lipschitz DNNs with small Lipschitz constants. Yet, this is sub-optimal according to the concentration theory. Indeed, let \(\textbf{X}\) be a set of random vectors of size d i.i.d. from a normal distribution \(\mathcal {N}(0,\sigma ^2I_d )\). \(I_d\) is the identity matrix of size d. Let \(f: \mathbb {R}^d \rightarrow \mathbb {R}\) be a Lipschitz function with Lipschitz constant K. The concentration theory ([11], p. 125) stipulates that : \( \mathcal {P}(|f(\textbf{X})-\mathbb {E}(f(\textbf{X}))|>t )\le 2 \text{ exp }(-\frac{t^2}{2K^2\sigma ^2}) \text{ for } \text{ all } t>0. \) This means that the smaller K is, the more the concentration of the data around their mean increases, leading to increased feature collapse. Hence, it is desirable to have a Lipschitz function that will bring similar data close, but it is at the same time essential to put dissimilar data apart.

3.5 LDU And Epistemic/Aleatoric Uncertainty

We are interested in capturing two types of uncertainty with our DNN: aleatoric and epistemic uncertainty [17, 43]. Aleatoric uncertainty is related to the inherent noise and ambiguity in data or in annotations and is often called irreducible uncertainty as it does not fade away with more data in the training set. Epistemic uncertainty is related to the model, the learning process and the amount of training data. Since it can be reduced with more training data, it is also called reducible uncertainty. Disentangling these two sources of uncertainty is generally non-trivial [56] and ensemble methods are usually superior [23, 52].

Optional Training with Synthesized Outliers. Due to limited training data and to the penalty enforced by \(\mathcal {L}^{\text{ Task }}\) being too small, the loss term \(\mathcal {L}^{\text{ Unc }}\) may potentially force the DNN in some circumstances to overfit the aleatoric uncertainty. Although we did not encounter this behavior on the computer vision tasks given the dataset size, it might occur on more specific data, and among other potential solutions, we propose one relying on synthesized outliers that we illustrate on the two moons dataset as follows. More specifically, we propose to add noise to the data similarly to [19, 51], and to introduce an optional step for training \(g_{\omega }^{\text{ unc }} \) on these new samples. We consider a two-stage training scheme. In the first stage we train over data without noise and in the second we optimize only the parameters of \(g_{\omega }^{\text{ unc }} \) over the synthesized outliers. Note that this optional stage would require for vision tasks an adequate OOD synthesizer [8, 19] which is beyond the scope of this paper, and that we applied it on the toy dataset. In Fig. 3 we assess the uncertainty estimation performance of this model on the two moons dataset. We can see that the confidence score relates to the aleatoric uncertainty after the first training stage. After the second one, it is linked to the epistemic uncertainty of the model.

Distinguishing between the two sources of uncertainty is essential for various practical applications, such as active learning, monitoring, OOD detection. In the following, we propose two strategies for computing each type of uncertainty.

Aleatoric Uncertainty. For estimating aleatoric uncertainty in classification, maximum class probability (MCP) [36] is a common strategy. The intuition is that a lower MCP can mean a higher entropy, i.e., a potential confusion on the classifier regarding the most likely class of the image. We use this criterion for the aleatoric uncertainty for classification and for semantic segmentation, while for the regression task we use \(g_{\omega }^{\text{ unc }} \) as confidence score.

Epistemic Uncertainty. To estimate epistemic uncertainty, we analyzed the latent representations of the DM layers followed by an exponential activation and found that the maximum value can model well uncertainty. The position of the feature w.r.t. the learned prototypes caries information about the proximity of the current sample with the in-distribution features. Yet, we propose to use the output of \(g_{\omega }^{\text{ unc }} \) as confidence score since we train this criterion for this purpose.

4 Experiments

One major interest of our technique is that it may be seamlessly applied to any computer vision task, be it classification or regression. Thus, we propose to evaluate the quality of uncertainty quantification of different techniques on three major tasks, namely image classification, semantic segmentation and monocular depth estimation. For all the three tasks, we compare our technique against MC Dropout [27] and Deep Ensembles [45]. For image classification, we also compare our technique to relevant DUM techniques for image classification, namely DDU [56], DUQ [74], DUE [3], MIR [64] and SNGP [48].

We evaluate the predictive performance in terms of accuracy for image classification, mIoU [22] for semantic segmentation, and the metrics first introduced in [21] and used in many subsequent works for monocular depth estimation. For image classification and semantic segmentation, we also evaluate the quality of the confidence scores provided by the DNNs via the following metrics: Expected Calibration Error (ECE) [31], AUROC [34, 36] and AUPR [34, 36]. Note that ECE we use is the confidence ECE defined in [31]. We use the Area Under the Sparsification Error: AUSE-RMSE and AUSE-Absrel similarly to [32, 63, 81] to better evaluate the uncertainty quantification on monocular depth estimation.

We run all methods ourselves in similar settings using publicly available codes and hyper-parameters for related methods. In the following tables, Top-2 results are highlighted in color.

Table 2. Comparative results for image classification tasks. We evaluate on CIFAR-10 for the tasks: in-domain classification, out-of-distribution detection with SVHN. Results are averaged over three seeds.
Fig. 4.
figure 4

Illustration of the different confidence scores on one image of MUAD. Note that the class train, bicycle, Stand food and the animals are OOD.

4.1 Classification Experiments

To evaluate uncertainty quantification for image classification, we adopt a standard approach based on training on CIFAR-10 [44], and using SVHN [58] as OOD data [25, 48, 65]. We use ResNet18 [33] as architecture for all methods. Note that for all DNNs, even for DE, we average results over three random seeds for statistical relevance. We follow the corresponding protocol for all DUM techniques (except LDU). For Deep Ensembles, MCP, and LDU, we use the same protocol. Please refer to the appendix for implementation details of LDU. The performances of the different algorithms are shown in Table 2. We can see that LDU has state-of-the-art performances on CIFAR-10. We note that LDU’s OOD detection performance improves with the number of prototypes. This can be linked with the fact that the more prototypes we have, the better we can model complex distributions. The ablation studies on sensitivity of the choice of \(\lambda \) and the impact of different losses are provided in the appendix.

4.2 Semantic Segmentation Experiments

Our semantic segmentation study consists of three experiments. The first one is on a new synthetic dataset: MUAD [26]. It comprises a training set and a test set without OOD classes and adverse weather conditions. We denote this set normal set. MUAD contains three more test sets that we denote OOD set, low adv. set and high adv. set which contain respectively images with OOD pixels but without adverse weather conditions, images with OOD pixels and weak adverse weather conditions, and for the last set, images with OOD pixels and strong adverse weather conditions. The second experiment evaluates the segmentation precision and uncertainty quality on the Cityscapes [16] and the Cityscapes-C [24, 41, 68] datasets to assess performance under distribution shift. Finally we analyze OOD detection performance on BDD Anomaly dataset [34] whose test set contains objects unseen during training. We detail the experimental protocol of all datasets in the appendix.

Table 3. Comparative results for semantic segmentation on MUAD.
Table 4. Comparative results for semantic segmentation on Cityscapes and Cityscapes-C.
Table 5. Comparative results obtained on the OOD detection task on BDD Anomaly [34] with PSPNet (ResNet50).
Table 6. Comparative results for monocular depth estimation on KITTI eigen-split validation set.

We train a DeepLabV3+ [14] network with ResNet50 encoder [33] on MUAD. Table 3 lists the results from different uncertainty techniques. For this task, we found that enforcing the Lipschitz constraint (see Baseline (MCP) lipz.) has a significant impact. Figure 4 shows a qualitative example of typical uncertainty maps computed on MUAD images.

Similarly to [24, 41] we assess predictive uncertainty and robustness under distribution shift using Cityscapes-C, a corrupted version of Cityscapes images with perturbations of varying intensity. We generate Cityscapes-C ourselves from the original Cityscapes images using the code of Hendrycks et al. [35] to apply the different corruptions on the images. Following [35], we apply the following perturbations: Gaussian noise, shot noise, impulse noise, defocus blur, frosted, glass blur, motion blur, zoom blur, snow, frost, fog, brightness, contrast, elastic, pixelate, JPEG. Each perturbation is scaled with five levels of strength. We train a DeepLabV3+ [14] with ResNet50 encoder [33] on Cityscapes. Results in Table 4 show that LDU is closely trailing in accuracy (mIoU score) the much more costly Deep Ensembles [45], while making better calibrated predictions (ECE score).

In order to assess the epistemic uncertainty quantification on real data we used PSPNet [82] with ResNet50 backbone using the experimental protocol in [34]. BDD Anomaly is a subset of BDD dataset, composed of 6688 street scenes for the training set and 361 for the testing set. The training set contains 17 classes, and the test set is composed of the 17 training classes and 2 OOD classes. Results in Table 5 show again that the performances of LDU are close to the ones of Deep Ensembles.

4.3 Monocular Depth Experiments

We set up our experiments on KITTI dataset [73] with Eigen split training and validation set [21] to evaluate and compare the predicted depth accuracy and uncertainty quality. We train BTS [46] with DenseNet161 [39], and we use the default training setting of BTS (number of epochs, weight decay, batch size) to train DNNs for all uncertainty estimation techniques applied on this backbone.

By default, the BTS baseline does not output uncertainty. Similarly to [40, 43], we can consider that a DNN may be constructed to find and output the parameters of a parametric distribution (e.g., the mean and variance for a Gaussian distribution). Such networks can be optimized by maximizing their log-likelihood. We denote the result as single predictive uncertainty (Single-PU). We also train a Deep Ensembles [45] by ensembling 3 DNNs, as well as a MC-Dropout [27] with eight forward passes. Without the extra DNNs or training procedures, we also applied Infer-noise [54], which injects Gaussian noise layers to the trained BTS baseline model and propagate eight times to predict the uncertainty.

We have also implemented LDU with the BTS model, but we note however that, in the monocular depth estimation setting and in agreement with previous works [18], the definition of OOD is fundamentally different with respect to the tasks introduced in the prior experiments. Thus, our objective is to investigate whether LDU is robust, can improve the prediction accuracy and still perform well for aleatoric uncertainty estimation. Table 6 lists the depth and uncertainty estimation results on KITTI dataset. Using different settings of \(\# \textbf{p}\) and \(\lambda \), the proposed LDU is virtually aligned with the current state-of-the-art, while being significantly lighter computationally (see also Table 7). More ablation results on the influence of \(\# \textbf{p}\) and \(\lambda \) can be found in the supplementary materials.

Table 7. Comparative results for training (forward+backward) and inference wall-clock timings and number of parameters for evaluated methods. Timings are computed per image and averaged over 100 images.

5 Discussions and Conclusions

Discussions. In Table 7 we compare the computational cost of LDU and related methods. For each approach we measure the training (forward+backward) and inference time per image on a NVIDIA RTX 3090Ti and report the corresponding number of parameters. We report training and inference wall-clock timings averaged over 100 training and validation. We use the same backbones as mentioned in Sect. 4.2 and Sect. 4.3 for semantic segmentation and monocular depth estimation respectively. We note that the runtime of LDU is almost the same as that of the baseline model (standard single forward model). This underpins the efficiency of our approach during inference, a particularly important requirement for practical applications.

Conclusions. In this work, we propose a simple way to modify a DNN to better estimate its predictive uncertainty. These minimal changes consist in optimizing a set of latent prototypes to learn to quantify the uncertainty by analyzing the position of an input sample in this prototype space. We perform extensive experiments and show that LDU can outperform state-of-the-art DUMs in most tasks and reach results comparable to Deep Ensembles with a significant advantage in terms of computational efficiency and memory requirements.

Along with the current state of the art methods, a limitation of our proposed LDU is that despite the empirical improvements in uncertainty quantification, it does not provide theoretical guarantees on the correctness of the predicted uncertainty. Our perspectives concern further exploration and improvements of the regularization strategies introduced in LDU on the latent feature representation that would allow us to bound the model error while still preserving its main task high performance.