Keywords

1 Introduction

Uncertainty estimation (UE) recently became a very active area of research in deep learning. Neural networks usually are treated as black boxes, and in general, they are prone to overconfidence [9, 15]. Uncertainty estimation methods aim to help overcome this drawback by identifying potentially erroneous predictions. This can be especially important for error-critical applications like medical diagnostics [4] or autonomous car driving [8]. Another important application for uncertainty estimation is active learning [34]. The majority of sampling criteria in active learning are based on estimates of uncertainty, which makes it important to obtain high-quality uncertainty estimates.

There are several main approaches for uncertainty estimation for deep neural networks. Bayesian neural networks (BNN) and variational inference in particular represent a natural way for uncertainty estimation due to availability of well-defined posteriors, but they can be prohibitively slow for large-scale applications. The usage of dropout at the inference stage was shown to be good and efficient approximation to BNNs [9, 10]. The ensembles of independently trained models [25] have state-of-the-art performance in many tasks requiring uncertainty estimation [37]. Recently, forcing models in ensembles to be more diverse was shown to improve results even further [22]. The drawback of ensembles is that we need to train and use multiple models that require additional resources, i.e. more memory to store models and more computing power for training.

In this work, we aim to develop a new approach for dropout-based uncertainty estimation. Usually there are many highly correlated neurons in neural networks, which results in a slow convergence of estimates based on the standard uniform sampling in dropout layers. We propose to estimate correlations between neurons based on the data and sample the most diverse neurons in order to improve the convergence of the estimates and, as a result, the quality of uncertainty estimates. As a particular realization of the general idea, we suggest sampling dropout masks using the machinery of determinantal point processes (DPP) [24] which are known to give diverse samples.

We summarize the main contributions of the paper as follows:

  • We propose two DPP-based sampling methods for neural networks with dropout. Our approach requires to train only a single model and adds only small overhead on the inference stage compared to plain MC dropout.

  • We compare different dropout-based approaches for uncertainty estimation in an extensive series of experiments for real-world regression and classification datasets. The results show superior performance of proposed DPP-based approaches.

  • Importantly, the proposed methods show high quality of uncertainty estimation even for very small number of stochastic passes through the network, thus opening the possibility to significantly speed up the inference stage.

The rest of the paper is organized as follows. Section 2 introduces the proposed method for DPP-based sampling from neural networks with dropout. In Sect. 3, we show the efficiency of the proposed approach in the problem of uncertainty estimation. Section 4 gives an overview of the related work on uncertainty estimation for neural networks. Section 5 concludes the study and highlights some directions for future work.

2 Methods

2.1 Neural Networks with Dropout as Implicit Ensembles

We start by considering a standard fully connected layer in a neural network

$$\begin{aligned} \textstyle {S_{i}^{h} = \sum _{j = 1}^{N_{h - 1}} w_{ij}^{h} O_{j}^{h-1}, ~~ i = 1, \dots , N_{h},} \end{aligned}$$
(1)

where \(O_{i}^{h} = \sigma \bigl (S_{i}^{h}\bigr )\) is an output of the \(h\)-th layer of the neural network given by a non-linear transformation \(\sigma (\cdot )\) of the corresponding pre-activation \(S_{i}^{h}\). An application of dropout to neurons results in the following formula for the pre-activations:

$$\begin{aligned} \textstyle {S_{i}^{h} = \sum _{j = 1}^{N_{h - 1}} \frac{1}{1 - p} m_{j}^{h} w_{ij}^{h} O_{j}^{h-1}, ~~ i = 1, \dots , N_{h},} \end{aligned}$$
(2)

where \(m_{j}^{h}\) are Bernoulli random variables with a probability of \(0\) equal to \(p\). The outputs \(O_{i}^{h}\) of the \(h\)-th layer remain to be computed by \(O_{i}^{h} = \sigma \bigl (S_{i}^{h}\bigr )\). Note that if an input variable of neural network is denoted by \(\textbf{x}\), then output of every layer is a function of \(\textbf{x}\), i.e., \(O_{i}^{h} = O_{i}^{h}(\textbf{x})\).

Let us denote the vector of dropout weights \(m_{j}^{h}\) for the \(h\)-th layer by \(\textbf{m}_h = (m_{1}^{h}, \dots , m_{N_h}^{h})^{\textrm{T}}\) and the full set of dropout weights by \(\textbf{M}= (\textbf{m}_1, \dots , \textbf{m}_K)\). Thus, any neural network \(\hat{f}(\textbf{x})\) with dropout layers essentially has two sets of parameters: the full set of learnable weights \(\textbf{W}\) and the set of dropout weights \(\textbf{M}\):

$$\begin{aligned} \hat{f}(\textbf{x}) = \hat{f}(\textbf{x} \mid \textbf{W}, \textbf{M}). \end{aligned}$$

Let us have a neural network with dropout, which was trained on some dataset giving weight estimates \(\hat{\textbf{W}}\). Then dropout weights \(\textbf{M}\) can be considered as free parameters and require selection at the time of inference

$$\begin{aligned} \hat{f}(\textbf{x} \mid \textbf{M}) = \hat{f}(\textbf{x} \mid \hat{\textbf{W}}, \textbf{M}). \end{aligned}$$

The originally proposed [18] and currently the standard choice is to take \(\hat{\textbf{M}} = (1 - p) \cdot \textbf{E}\), where \(\textbf{E}\) is the matrix of all ones of the corresponding shape. Such an approach gives the fixed function \(\hat{f}(\textbf{x} \mid \hat{\textbf{M}})\), which is known to give reasonably good performance in practice. The main intuition behind such choice is the replacement of the stochastic pre-activations \(S_{i}^{h}\) given by (2) with their expectations, which are exactly equal to (1).

Recently, it was proposed to consider dropout as a variational approximation in a specially chosen Bayesian model, see [10]. Within this approach, one can sample \(T\) i.i.d. realizations \(\textbf{M}_1, \ldots , \textbf{M}_T \sim Bernoulli(1 - p)\) and compute approximate posterior mean and variance

$$\begin{aligned} \bar{f}_T(\textbf{x}) = \frac{1}{T} \sum _{i = 1}^{T} \hat{f}(\textbf{x} \mid \textbf{M}_i), \quad \bar{\sigma }_T^2(\textbf{x}) = \frac{1}{T} \sum _{i = 1}^{T} \bigl (\hat{f}(\textbf{x} \mid \textbf{M}_i) - \bar{f}_T(\textbf{x})\bigr )^2. \end{aligned}$$

The approximate posterior variance \(\bar{\sigma }_T^2(\textbf{x})\) is a natural choice for the uncertainty estimate and was successfully used in the variety of applications such as out-of-distribution detection [40] and active learning [11].

In this paper, we suggest a different approach, namely we treat \(\hat{f}(\textbf{x} \mid \textbf{M})\) as an ensemble of models indexed by dropout masks \(\textbf{M}\). Such a view allows us to decouple inference from training and pose an intuitive question: what set of masks \(\textbf{M}_1, \ldots , \textbf{M}_T\) should one choose in order to obtain the best uncertainty estimate \(\bar{\sigma }_T^2(\textbf{x})\)?

Importantly, here we do not limit the selection of masks to be samples from standard dropout distribution, which, in principle, should allow us to obtain better estimates. However, the design of mask selection procedure is a non-trivial problem, which we discuss below in detail.

Remark 1

The standard approach in the literature is to consider an ensemble of models trained on different subsets of the data set or just from different random initializations giving the set of parameter estimates \(\hat{\textbf{W}}_1, \ldots , \hat{\textbf{W}}_T\) and corresponding approximations \(\hat{f}(\textbf{x} \mid \hat{\textbf{W}}_i, \hat{\textbf{M}}), ~ i = 1, \ldots , T\). Similarly, one can compute the variance \(\bar{\sigma }_T^2(\textbf{x})\), which was shown to be a reasonable uncertainty estimate in practice [5, 36]. The main drawback of this approach is the need to train and store \(T\) different models, which might be very costly both in terms of computation and storage needed.

2.2 Data-Driven Mask Generation Under General Sampling Distributions

In practice, many neurons in the network are highly correlated. For example, consider a correlation matrix of neurons in a hidden layer of a fully-connected neural network, trained on the regression dataset (see Fig. 1a). The correlation matrix was computed on the test set and clearly shows groups of highly correlated neurons. Sampling masks for this layer uniformly at random might result in a high variance of pre-activations (2). As a result, the estimates for the whole network may require a significant number of samples (stochastic passes through the NN) \(T\) to converge. We illustrate this behaviour on Fig. 1b, where several hundreds of simple MC dropout estimates are required for the convergence of the log-likelihood values. It is clearly seen that a larger number of samples improves the values of log-likelihood, yet may impose computational cost too large to be used in real-world applications. However, one may expect that the knowledge about the correlations between neurons can help to sample more diverse neurons and improve the estimates.

Fig. 1.
figure 1

(a) Correlation matrix \(C\) between the outputs of the neurons in a hidden layer of the NN trained on the naval propulsion dataset. (b) For the same dataset log-likelihood computed via MC dropout increases with increase of the number of stochastic passes \(T\). More than 100 samples are needed to reach convergence.

In what follows, we consider the probabilistic generation of masks \(\textbf{m}_h\) from some distribution \(P^{(h)}\) with possibly non-i.i.d. distributions of components. Similarly to the case of dropout, we suggest using an unbiased estimate of the layer-wise mean. Our main motivation is to approximately preserve the average performance of the trained network. The construction of the unbiased estimator is non-trivial and is given by celebrated Horvitz-Thompson (HT) estimator [19]:

$$\begin{aligned} \textstyle {S_{i}^{h} = \sum _{j = 1}^{N_{h - 1}} \frac{1}{\pi _j^h} m_{j}^{h} w_{ij}^{h} O_{j}^{h-1}, ~~ i = 1, \dots , N_{h},} \end{aligned}$$
(3)

where \(\pi _j^h\) is the marginal probability of value \(1\) for the random variable \(m_{j}^{h}\).

2.3 Diversity Sampling Approaches

Let us consider \(h\)-th hidden layer of the neural network with dropout. Assume that we have access to the correlations

$$\begin{aligned} C_{ij}^{(h)} = \textrm{corr}_{\textbf{x}} \bigl \{O_{i}^{h}(\textbf{x}), O_{j}^{h}(\textbf{x})\bigr \}, ~ i, j = 1, \dots , N_h. \end{aligned}$$

In practice, we compute an empirical correlation based on some set of points, which represents the data distribution well enough. As a result, we obtain the correlation matrix \(C^{(h)} \in \mathbb {R}^{N_h \times N_h}\) between the neurons of the \(h\)-th hidden layer. Below we discuss several approaches to sampling neurons in a way that the correlation between sampled neurons is as small as possible. We note that instead of the correlation matrix \(C^{(h)}\) one may consider the covariance matrix \(K^{(h)}\) in any of the approaches described below. The properties of the methods significantly depend on the choice of the matrix, and we will perform the empirical evaluation of the methods based on each of them in the experiments.

Leverage Score Sampling. A basic approach for non-uniform sampling of rows and columns in kernel matrices is the so-called leverage score sampling [1]. In this approach, the neurons are sampled independently with different probabilities \(\pi _j^{h}\):

$$\begin{aligned} \pi _j^{h} \sim \ell _{\lambda }^{(h)}(j) = \Bigl [C^{(h)} \bigl (C^{(h)} + \lambda I\bigr )^{-1}\Bigr ]_{jj}, ~ j = 1, \dots , N_h, \end{aligned}$$

where the quantities \(\ell _{\lambda }^{(h)}(j)\) are called leverage scores. This approach makes neurons from large and highly correlated clusters to be sampled less frequently. In Sect. 3, we show that leverage score sampling indeed allows obtaining better uncertainty estimates for out-of-distribution data in regression tasks compared to MC dropout. However, its performance for in-domain data is even inferior to uniform sampling. In the next section, we propose a more complex approach, which allows to significantly improve the quality of uncertainty estimation.

Sampling with Determinantal Point Processes. Determinantal Point Processes (DPPs) [24] are specific probability distributions over configurations of points that encode diversity through a kernel function. They were introduced in [28] for the needs of statistical physics and were used for a number of ML applications, see [24] for an overview. DPP can be seen as a probabilistic MaxVol algorithm [14] of finding a maximal-volume submatrix.

We use correlation matrix \(C^{(h)}\) as the likelihood kernel for DPP. Then, given a set \(S\) of selected points for a mask distribution \(\textbf{m}_h \sim DPP\bigl (C^{(h)}\bigr )\), we obtain

$$\begin{aligned} \mathbb {P}[\textbf{m}_h = S] = \frac{\det \bigl [C^{(h)}_S\bigr ]}{\det \bigl [C^{(h)} + I\bigr ]}, ~~ h = 1, \dots , K, \end{aligned}$$

where \(C_S^{(h)} = \Bigl [C_{ij}^{(h)}, ~ i, j \in S\Bigr ]\), i.e., a square submatrix of \(C^{(h)}\) obtained by keeping only rows and columns indexed by \(S\).

To better understand the DPP, let us come back to the correlation matrix depicted in Fig. 1a. The probability for DPP to take highly correlated neurons into the sample \(S\) is low as, in this case, the corresponding determinant \(\det C^{(h)}_S\) will have a small value. Thus, DPP tends to sample neurons from different clusters, increasing an overall diversity.

From computational point of view, DPP-sampling requires \(O(N_{h}^3)\) operations for generating each sample. It is quite expensive but completely viable even for modern large networks which usually have up to 1024 neurons in fully-connected layers. Importantly, masks can be precomputed once, and then the same masks are used on the inference stage for every test sample with no additional overhead. Also, computations in last fully-connected layers with dropout usually require only few percents of the total computational budget in ImageNet-size networks. Therefore, a computational overhead caused by the DPP-sampling does not have a significant impact on the inference time.

K-DPP. The k-DPP [24] is a variation of the DPP, conditioned to produce samples of fixed size \(|S| = k\). With the cost of introducing an additional parameter, it allows us to tune the sampling procedure as the choice of \(k\) apparently has a significant influence on the result. In this work, we use for the \(h\)-th layer \(k^{(h)} = (1 - p) N_h\), so that the number of neurons in the sample is equal to the mean number of neurons in the sample of MC-Dropout. In the case of k-DPP, the computation of the marginal probabilities \(\pi _j^{h}\) for HT-estimator (3) is non-trivial and requires the separate optimization procedure, see the details in [2].

2.4 Diversification for Uncertainty Estimation in Classification

For regression, the variance of prediction is a standard uncertainty measure. However, uncertainty estimation for classification is, in some sense, more challenging than for regression as there is no obvious candidate for uncertainty measure.

Let us define the average probability for the class prediction by ensemble members \(\bar{p}_T(y = c \mid \textbf{x}) = \frac{1}{T} \sum _{i = 1}^{T} p(y = c \mid \textbf{x}, \textbf{M}_{i})\). The standard uncertainty measure usually considered in the literature is

$$\begin{aligned} s(\textbf{x}) = 1 - \max \limits _c ~ \bar{p}_T(y = c \mid \textbf{x}), \end{aligned}$$

which is based solely on the mean probabilities predicted by the ensemble. While providing good results in practice [3, 37] it doesn’t use the information about the variation of predictions between ensemble members.

In our work, we consider BALD [20] uncertainty measure and combine it with different sampling schemes considered above. BALD is equal to the mutual information between outputs and model parameters:

$$\begin{aligned} \textstyle {I(\textbf{x}) = H(\textbf{x}) - \frac{1}{T} \sum _{c = 1}^{C} \sum _{i=1}^{T} -p(y = c \mid \textbf{x}, \textbf{M}_{i}) \log \bigl (p(y = c \mid \textbf{x}, \textbf{M}_{i})\bigr ),} \end{aligned}$$

where \(H(\textbf{x}) = -\sum _{c = 1}^{C} \bar{p}_T(y = c \mid \textbf{x}) \log \bigl (\bar{p}_T(y = c \mid \textbf{x})\bigr )\) is an entropy of the ensemble mean. Importantly, BALD values are directly linked with the diversity of the ensemble members, and therefore are well suited for combination with our approach.

3 Experiments

3.1 Uncertainty Estimation for Regression

Models and Metrics. For the experiments, we consider MC dropout as a baseline and all the proposed UE methods discussed in the Sect. 2.3: leverage score sampling, DPP and k-DPP. We present the results for leverage score sampling and DPP based on correlation matrix and k-DPP based on covariance matrix as such a choices give consistently better results compared to an alternative. For leverage score sampling we deliberately choose \(\lambda = 1\) to make it working with de-facto the same matrix as DPP-based methods. All the regression models were trained with RMSE as a loss function. We used feed-forward NNs with 3 hidden layers (128-128-64 neurons) and leaky ReLU activation function [27]. For DPP-based methods, we use the DPPy implementation provided in [13].

We should note that we do not compare with fully Bayesian approaches as we are focusing on the solutions applicable to the standard dropout-based models without changing model architecture and training procedure. Following [17, 22], we use log-likelihood of Gaussian distribution with mean and variance computed by different methods as a quality measure.

On top of single models, we also consider a straightforward ensemble approach with NNs trained exactly the same way as single models but from different random initializations. Our experiments show that uncertainty estimates based on ensembling of networks without sampling in individual networks doesn’t work for well for the considered regression datasets.

Experiments on Regression Datasets. Similarly to [22], we run a series of experiments on various regression datasets, see Table 1 for the full list of datasets. We start with in-domain uncertainty estimation: for each dataset, random 50% of points were used for training and other 50% for testing. The log-likelihood values are averaged over testing set. Multiple experiments are done via 5 random train-test splits, 2-fold cross-validation and 5 runs of the training procedures for every model (resulting in 50 average log-likelihood values contributing to each boxplot). Uncertainty estimates were computed for different number of stochastic passes \(T\,=\,10, 30\) and \(100\) for every model.

Table 1. Summary of the UCI datasets used in experiments, see [7].

We show the resulting distributions of log-likelihood values for each dataset on Fig. 2. We observe that either DPP or k-DPP always show the best results. Most importantly, DPP works very well already for small number of stochastic passes \(T = 10\) and consistently has low variance which is extremely important for practical usage.

Fig. 2.
figure 2

Log-likelihood metric across various UCI datasets for NN UE models with different number of stochastic passes \(T = 10, 30, 100\). DPP and k-DPP give better results compared to other methods with DPP working well already for \(T = 10\) and consistently showing lower variance.

We also performed an experiment with out-of-distribution (OOD) data. To generate OOD data we pick a random feature and split the data into the train set and OOD set by the median value on this feature. The experiments were run for 5 different splits. For OOD data good uncertainty estimates should have on average higher values compared to in-domain data. Table 2 provides for concrete dataset the percentages of OOD points with UE values higher than \(\alpha \) percentile of UE distribution for training data (\(\alpha = 80\%, 90\%, 95\%\)). The resulting numbers should be considered with a significant grain of salt due to their high variance but still DPP and k-DPP show the best results based on average values.

Table 2. Percentages of OOD points with UE values higher than specified percentile of UE distribution for training data for concrete dataset. DPP and k-DPP show the best results based on average values (top-2 average values are put in bold). For all the methods \(T = 100\).

3.2 Uncertainty Estimation for Classification

Data, Models and Metrics. In this section, we aim to show the applicability of the proposed methods to the classification tasks. We take BALD [20] as an uncertainty estimate. We consider three datasets: MNIST, which is a toy dataset of handwritten digits [26], CIFAR-10, which is a 10-class image dataset with simple objects [23], and ImageNet [6], the large scale image classification dataset. Importantly, for MNIST we use only 500 train samples, otherwise the models would have too good accuracy and uncertainty estimation for in-domain data would not be relevant. For CIFAR-10 we use 50’000 samples for training and 10’000 for testing. For the MNIST dataset, we use a simple convolutional neural network with two convolutional layers, max-pooling and two fully connected layers. For the CIFAR-10 we use a more powerful network with 6 convolutional layers and batch normalization. Finally, for ImageNet we use the pre-trained ResNet-18 neural network [16] from PyTorch [33]. Dropout with rate \(p = 0.5\) is used before the last fully-connected layer in all the cases. \(T = 100\) stochastic passes were made for every model. The experiments are repeated three times with different seeds for the models.

Experimental Results. For in-domain uncertainty estimation the results are presented via UE-accuracy curve, see Fig. 3. It assumes that samples with lower uncertainty will be classified with a higher average accuracy. It can be clearly seen that DPP significantly outperforms all the competitors on every dataset. We should emphasize that the superiority of DPP is especially strong for ImageNet, where the usage of DPP required only 2% computational overhead compared to MC dropout according to our experiments.

Fig. 3.
figure 3

UE-accuracy curve (the higher curve – the better). We select the samples with low uncertainty to assure that the accuracy is higher for them.

We also consider detection of out-of-distribution samples which is one of the important problems for the uncertainty estimation. As OOD samples we use fashion-MNIST [41] and SVHN images [32] for MNIST and CIFAR-10 correspondingly. We use count-vs-uncertainty curve and expect there should be few points with the low uncertainty for good uncertainty estimation methods. The results are presented in Fig. 4. We see that DPP-based approach allows to detect the OOD samples better for the both considered datasets.

Fig. 4.
figure 4

Count-vs-uncertainty curve for out-of-distribution data (the lower curve – the better).

4 Related Work

Dropout [18, 38] has emerged in recent years as a technique to prevent the overfitting in deep and overparametrized neural networks. Over the years, it obtained theoretical explanations as an averaged ensembling technique [38], a Bernoulli realization of the corresponding Bayesian neural network [10] and a latent variable model [30]. It was shown in [9, 31] that using dropout at the prediction stage (i.e., stochastic forward passes of the test samples through the network, also referred to as MC dropout) leads to unbiased Monte-Carlo estimates of the mean and variance for the corresponding Bayesian neural network trained using variational inference. These uncertainty estimates were shown to be efficient in different scenarios [9, 39].

Training an ensemble of models and uncertainty estimation by their disagreement is another common approach [25]. It is shown that with few models in an ensemble, you can get robust and useful calibrated results [5], outperforming MC dropout in active learning and error detection. The main disadvantage of ensembles is the necessity to train multiple model instances. However, it was addressed in recent works [12, 21, 29] which consider different strategies for speeding up ensemble construction. Recently, it was shown that improving diversity of ensemble members improves the quality of the resulting uncertainty estimates [22]. We also mention recent works which thoroughly investigate in-domain [3] and out-of-domain [37] uncertainty estimation in classification for the case of maximum probability uncertainty estimate.

5 Conclusions

We have proposed a new approach that strengthens the dropout-based uncertainty estimation for neural networks. Instead of randomly sampling the dropout masks on the inference stage, we sample special sets of diverse neurons via determinantal point processes that utilize the information about the correlations of neurons in the inner layers. Numerical experiments on a wide range of regression and classification tasks show that uncertainty estimates based our approach outperform the MC dropout and other baselines with a significant margin. A combination of dropout-based inference with ensembling of several models allows to further improve the quality of the proposed uncertainty estimates and achieve state-of-the-art performance. From the practical perspective, our method is simple to implement as it does not require any modifications to the neural network architecture and the training process. Importantly, the proposed uncertainty estimates have high quality even for a small number of stochastic passes through the network making the inference stage even faster in practice.

We expect that the proposed methods of dropout mask sampling may also be used on the training stage, leading to more robust and efficient models. Another compelling direction of further research is approximate DPP sampling, which may increase the sampling speed of the proposed approaches, making them more production-friendly, as in [35].

The code reproducing the experiments is available at GithubFootnote 1.