Keywords

1 Introduction

1.1 Motivation

Consider a ML problem, where some models have been trained on a given dataset. It is then required to know their performances, in terms of any performance measure, on the population of testers. This is not only for the sake of assessing each of them, but also to be able to select the best model among them. These different models could even represent different instances of the same ML algorithm, with different values of parameters (e.g., a KNN with different values of K), and it is required to choose the best value for the current problem. The performance on the population of testers is called the true performance, because this is the performance on the whole population, not on a subset of it.

If the underlying probability distribution of the testers is known, e.g., from a priori knowledge about the nature of the problem, the true performance can be calculated mathematically. One of the first attempts in this direction was Fukunaga [14], where he assumed the data follows a multinormal distribution, to find a closed-form expression of the error rate of a binary classification rule. An alternative to mathematical calculations is simulating a very large dataset, from the assumed distribution, from which a very accurate estimation of the true performance can be obtained.

The early work of Fukunuga was inspiring, from the theoretical point of view, for the early community of pattern recognition and machine learning to understand important theoretical properties and concepts. However, for real-life applications it is very unusual that the assumption of multinormality, or any other assumption, hold. In these situations, which are called nonparametric, or distribution-free, it is impossible to derive either the true performance in closed form, or estimate it using a very large simulated dataset. In such situations, the true performance must be estimated from a single testing dataset (testers). The way we obtain such a testing dataset defines two major paradigms, discussed next.

In Paradigm I, we only have one dataset \(\textbf{t}\), usually called the design or construction dataset, from which we have to make up a training dataset \(\textbf{tr}\) and a testing dataset \(\textbf{ts}\), such that \(\textbf{t}=\textbf{tr}\cup \textbf{ts}\). Otherwise, training and testing on the same dataset \(\textbf{t}\) would provide a very optimistic estimate of the performance measure. This splitting is performed iteratively using one of the resampling techniques, e.g., jackknife, bootstrap, or cross validation. In each resampling iteration we get a different pair of training and testing datasets, on which the algorithm will be trained and tested, respectively. The results from these different iterations will be compiled together, as defined by the resampling method, to provide a single estimate of the performance measure. It is obvious that the performance estimation obtained from any of these methods will vary with varying the design dataset \(\textbf{t}\). This chapter is dedicated to reviewing this paradigm, its different estimators, and the variance estimation of these estimators.

It is worth mentioning that fatal fallacies are committed by practitioners when using this paradigm. For example, a very common mistake is using the whole dataset \(\textbf{t}\) to learn some statistical properties of the different classes of the classification problem, mistakenly naming this a data preprocessing step, using these properties to construct a classifier, then excluding this step from the resampling mechanism afterwards. Although the correct way of performing preprocessing is explained in textbooks (see, e.g.,Hastie et al. [20], Sect. 7.10.2), we still see this mistake in several occasions in both academia and industry.

In Paradigm II, it is required, or even mandated (e.g., in several public-policy-making or regulatory settings), to maintain what might be called the traditional data hygiene of two independent datasets: the design dataset \({\textbf{t}}\), and a final testing dataset \({\textbf{TS}}\), which is a sequestered testing dataset that has never been available to the design procedure, but for just onetime final testing. Assessing a ML algorithm from independent testing dataset is as simple as applying the estimators of the performance measure of interest (Sect. 1.2) on the testing dataset. However, the estimator will then have two sources of variability, the design and the testing datasets. The mathematical details of this paradigm and the estimation of this variance are discussed in Yousef et al. [34], Chen et al. [4], and not reviewed in our present chapter.

Although it may seem very safe to use this testing paradigm, some practitioners abuse it as well. One possible common mistake is that they test several models on this sequestered testing set, then they analyze the relative estimated performances. Accordingly, these models are redesigned to improve their performance on the testing set! Worse than this is keeping iterating this processes several times, which indeed turns the independent sequestered testing dataset to be part of the training dataset, indirectly through this human mental parsing of the results, which acts as a feedback that guides the redesign process.

Nowadays, it is almost the default in the field of ML to leverage both paradigms in the task of model assessment and selection. The available dataset is initially split into two datasets:

  1. 1.

    the design dataset \(\textbf{t}\), from which the ML algorithm is designed. This is conducted via one of the resampling methods of paradigm I explained above. Usually, several algorithms are used, and several parameters’ values are examined for each algorithm. Then, the model with the best performance is chosen.

  2. 2.

    the sequestered testing dataset \(\textbf{TS}\), on which the final chosen model from paradigm I is assessed once and only once, without redesign. This is the final estimation of the performance measure that should be reported, along with the estimation of its variance.

It is worth mentioning that, there is a convention in the field to call the dataset \(\textbf{ts}\) that is split from the design dataset \(\textbf{t}\) during the resampling process, a validation dataset rather than a testing dataset, to reserve the word testing to the final testing datset \(\textbf{TS}\) of paradigm II. However, in some applications, the converse is adopted; i.e., \(\textbf{ts}\) is called the a testing dataset and \(\textbf{TS}\) is called a validation dataset. To avoid ambiguity, any notation and expression should be defined clearly within any context.

What is introduced above is valid for any ML problem, whether it is regression or classification, and for any performance measure, whether it is the error rate \({\textrm{Err}}\), AUC, or any other. However, we emphasize below two very important issues.

(1) The true performance, which we discussed its estimation in this introduction so far, is itself a random variable whose randomness arises from the randomness of the training dataset, as was explained in the previous chapter. Have we changed the training dataset, the true performance would change. For example, and without loss of generality (WLOG) but for the sake of illustration, suppose the whole design dataset \(\textbf{t}\) is used as a training dataset \(\textbf{tr}\) and we are interested in the AUC as a performance measure. Then, as was explained in the previous chapter, we should be interested in the following:

  1. 1.

    \({\textrm{AUC}}_{{\textbf{t}}} \): the true performance conditional on a particular training dataset \({\textbf{t}}\) of a specified size n.

  2. 2.

    \(\text {E}_{{\textbf{t}}} {\textrm{AUC}}_{{\textbf{t}}} \): the expectation of true performance over the population of training datasets of the same size n.

  3. 3.

    \(\text {Var}_{\textbf{t}}{\textrm{AUC}}_{{\textbf{t}}}\): the variance of the true performance over the population of training datasets of the same size n.

(2) Regarding the meaning and utility of the performance measure, we emphasize the importance of the ROC curve and its AUC as a summary measure [2, 18, 19], where the former is a manifestation of the trade-off between the two types of error of any binary classification rule. We always advocate for the use of the ROC or its AUC since they are prevalence independent; i.e., they do not depend on a particular chosen threshold, class prior probability, or misclassification costs. Adopting a performance measure that is prevalence dependent, e.g., the overall accuracy or its many different versions, can provide a misleading measure of the classification power of the classification algorithm, especially in classification problems that involve, for instance, unbalanced data (different class size). Therefore, the present chapter assumes familiarity with the ROC and its AUC, at the level provided in the previous chapter. However, for the sake of completeness, all notations are tersely summarized in the following subsection.

1.2 Notation

Consider the binary classification problem, where a classification rule \(\eta \) gives a score of h(x) for the predictor x, and classifies it to one of the two classes by comparing this score h(x) to a chosen threshold th. The observation x belongs to one of the two classes with distributions \(F_i\), \(i=1,2\). The two error components of this rule (\(e_1\), or the false negative fraction (FNF), and \(e_2\) or the false positive fraction (FPF)), along with the risk, are given as follows:

$$\begin{aligned} {\textrm{FNF}}&= e_{1} =\int \limits _{-\infty }^{th}{f_{h}}\left( {h(x)|\omega _{1}}\right) {dh(x)},\end{aligned}$$
(1a)
$$\begin{aligned} {\textrm{FPF}}&= e_{2} =\int \limits _{th}^{\infty }{f_{h}}\left( {h(x)|\omega _{2}}\right) {dh(x)},\end{aligned}$$
(1b)
$$\begin{aligned} {\textrm{R}}&=c_{12}P_{1}e_{1}+c_{21}P_{2}e_{2}. \end{aligned}$$
(1c)

The cost \(c_{ij},\ i,j=1,2\) is the cost of classifying an observation as belonging to class j whereas it belongs to class i; \(c_{ii}= 0\), which means there is no cost for correct classification; and \(P_i\) is the prior probability of each class, \(i=1,2\). The risk (1c) is called the “error rate” \({\textrm{Err}}\), or probability of misclassification (PMC), when putting \(c_{12}=c_{21}=1\), which is denoted by the 0-1 cost, or loss.

The receiver operating characteristics (ROC) curve is a plot of the true positive fraction (TPF), which is \(1-{\textrm{FNF}}\), versus the FPF. Then the area under the curve (AUC) is given by:

$$\begin{aligned} {\textrm{AUC}}&=\int \limits _{0}^{1}{{\textrm{TPF}}~d({\textrm{FPF}})}.\end{aligned}$$
(2a)
$$\begin{aligned}&=\Pr \bigl [ h(x)|\omega _{2} < h(x)|\omega _{1}\bigr ], \end{aligned}$$
(2b)

which expresses how the classifier scores for class \(\omega _1\) are stochastically larger than those of class \(\omega _2\).

If the distributions \(F_1\) and \(F_2\) are not known, a setup that is called nonparametric or distribution-free, any performance measure can be estimated only numerically from a given dataset, called the testing dataset. This is regardless of the testing paradigm, i.e., whether this testing dataset is obtained by simulation, resampling, or sequestering. This is done by assigning equal probability mass for each observation:

$$\begin{aligned} \hat{F}:\text {mass}~\frac{1}{n}~\text {on}\,~t_{i},~i=1,\ldots ,n, \end{aligned}$$
(3)

where n is the size of the testing dataset. Lemma 1 shows that this is the maximum likelihood estimator (MLE) of the distribution F.

In this case the performance measures (1) can be obtained as follows.

$$\begin{aligned} \widehat{{\textrm{FNF}}}&=\widehat{e_{1}} = \frac{1}{n}\sum \limits _{i=1}^{n}I_{h(x_{i}|\omega _{1})<th}\end{aligned}$$
(4a)
$$\begin{aligned} \widehat{{\textrm{FPF}}}&= \widehat{e_{2}} = \frac{1}{n}\sum \limits _{i=1}^{n}I_{h(x_{i}|\omega _{2})>th}\end{aligned}$$
(4b)
$$\begin{aligned} \widehat{{\textrm{R}}(\eta )}&=\frac{1}{n}\left( {c_{12}\,\widehat{e_{1}}\,n_{1}+c_{21}\,\widehat{e_{2}}\,n_{2}}\right) . \end{aligned}$$
(4c)

The indicator function \(I_{cond}\) equals 1 or 0 when the Boolean expression cond is true or false, respectively. The values \(n_{1}\) and \(n_{2}\) are the number of observations in the two classes respectively, and \(\widehat{P_{1}}\) and \(\widehat{P_{2}}\) are the estimated a priori probabilities for each class.

As the the two components \({\textrm{TPF}}\) and \({\textrm{FPF}}\) defined a single operating point on the ROC, the two components \(\widehat{{\textrm{TPF}}} (=1-\widehat{{\textrm{FNF}}})\) and \(\widehat{{\textrm{FPF}}}\) give one point on the empirical (estimated) ROC curve. To draw the complete curve in the nonparametric situation, the classifier’s sore is calculated for each point of the available dataset. Then all possible thresholds are considered in turn, i.e., the threshold values between every two successive scores. At each threshold value a point on the ROC curve is calculated. Then the AUC (2a) can be estimated from the empirical ROC curve using the trapezoidal rule:

$$\begin{aligned} \widehat{{\textrm{AUC}}}=\frac{1}{2}\sum \limits _{i=2}^{n_{th}}{\left( {{\textrm{FNF}}_{i}-{\textrm{FNF}}_{i-1}}\right) ({\textrm{TPF}}_{i}+{\textrm{TPF}}_{i-1})},\end{aligned}$$
(5)

where \(n_{th}\) is the number of threshold values taken over the dataset. By plotting the empirical ROC curve, it is easy to see that (5) is the same as the Mann-Whitney statistic—which is another form of the Wilcoxon rank-sum test [15, Chap. 4]—defined by:

$$\begin{aligned} \widehat{{\textrm{AUC}}}=&\frac{1}{n_{1}n_{2}}\sum \limits _{j=1}^{n_{2}}{\sum \limits _{i=1}^{n_{1}}{\psi \left( {h\left( {x_{i}|\omega _{1}}\right) ,h\left( {x_{j}|\omega _{2}}\right) }\right) }},\end{aligned}$$
(6a)
$$\begin{aligned}&\,\, \psi (a,b)=\left\{ \begin{array}{ccc} 1 &{} &{} a>b\\ 1/2 &{} &{} a=b\\ 0 &{} &{} a<b \end{array}\right. . \end{aligned}$$
(6b)

It is interesting, as well, to know from the theory of U-statistics [25] that the estimator (6) is the uniform minimum variance unbiased estimator (UMVUE) for the probability (2b) under the distribution (3).

All the estimators given above have the nice property of converging to their corresponding population definitions, (1) and (2), as the size of the testing set goes to infinity. It is worth mentioning that each of the error estimators \(\hat{e}_1\) and \(\hat{e}_2\) in (4) is called a one-sample statistic, because its kernel \(I_{(\cdot )}\) requires only one observation from either distributions. However, the AUC estimator in (6) is a two-sample statistic since its kernel \(\psi (\cdot ,\cdot )\) requires two observations, one from each distribution. This is a fundamental difference between both estimators (statistics) which will be treated and explained carefully in the present chapter.

1.3 Roadmap

The rest of this chapter is organized as follows. Section 2 paves the road to the chapter by reviewing the nonparametric estimators for estimating the mean and variance of one-sample statistics, including the preliminaries of bootstraps and influence function. This section is a very concise review mainly of the work done in Hampel [16], Efron and Tibshirani [11], and Huber [21]. Section 3 switches gears and reviews the nonparametric estimators that estimate the mean and variance of a special kind of statistics, i.e., the error rate of classification rules. This section is a concise review of the work done mainly in Efron [8], and Efron and Tibshirani [13]. Section 4 explains how the nonparametric estimators that estimate the error rate, a one-sample statistic, can be extended to estimate the AUC, a two-sample statistic. It does so by providing theoretical parallelism between the two sets of estimators and showing that the extension is rigorous and not just an ad hoc application. Section 6 concludes the chapter and provides a discussion and an advice for practitioners.

2 Nonparametric Methods for Estimating the Bias and the Variance of a Statistic

Consider a statistic s that is a function of a dataset \(\textbf{x}:\{x_{i},\,i=1,\ldots ,n\}\), where \(x_{i}\overset{i.i.d}{\sim }F\). The statistic s is now a random variable and its variability comes from the variability of \(x_{i}\). Suppose that this statistic is used to estimate a real-valued parameter \(\theta =f\left( F\right) \). Then \(\hat{\theta }=s\left( \textbf{x}\right) \) has expected value \(\text {E}\,\,{{s\!\left( \textbf{x}\right) }}\) and variance \(\text {Var}\,\,s\!\left( \textbf{x}\right) \). The mean squared error (MSE) of the estimator \(\hat{\theta }\) is defined as:

$$\begin{aligned} {\textrm{MSE}}(\hat{\theta })=\text {E}\left[ {\hat{\theta }-\theta }\right] ^{2}.\end{aligned}$$
(7)

The root of the mean squared error (RMS) has the same units and is on the same scale of the original variable \(\theta \), and hence has more intuitive value. The bias of the estimator \(\hat{\theta }=s\left( \textbf{x}\right) \) is defined by the difference between the true value of the parameter and the expectation of the estimator, i.e.,

$$\begin{aligned} {{\,\textrm{bias}\,}}_{F}\left( \hat{\theta }\right) =\text {E}_{F} s\left( \textbf{x}\right) - \theta . \end{aligned}$$
(8)

Then, it is known that, the MSE in (7) can be decomposed to:

$$\begin{aligned} {\textrm{MSE}}(\hat{\theta })={{\,\textrm{bias}\,}}_{F}^{2}\left( \hat{\theta }\right) +\text {Var}_F \hat{\theta }.\end{aligned}$$
(9)

A critical question is whether the bias and variance of the statistic s in (9) may be estimated from the available dataset?

Fig. 1
A mechanism without bootstrap from the original sample estimates the statistic of interest, and a mechanism with bootstrap from many replicated samples computes the bootstrap estimates.

Bootstrap mechanism: B bootstrap replicates are withdrawn (by sampling and replacement) from the original sample. From each replicate the statistic is calculated. (The idea behind this figure first appeared in [11, Fig. 6.1, pp. 48])

2.1 Bootstrap Estimate

The bootstrap was introduced by Efron [5] to estimate the standard error of a statistic. The bootstrap mechanism is implemented by treating the current dataset \(\textbf{x}\) as a representation for the population distribution F; i.e., approximating the distribution F by the MLE defined in (3). Then B bootstrap samples are drawn from that empirical distribution. Each bootstrap replicate is of size n, the same size as \(\textbf{x}\), and is obtained by sampling with replacement. Then in a bootstrap replicate some case \(x_{i}\), in general, will appear more than once at the expense of another \(x_{j}\) that will not appear. The original dataset will be treated now as the population, and the replicates will be treated as samples from the population. This situation is illustrated in Fig. 1. Each of these bootstrap replicates is denoted by \(\textbf{x}^{*b},\ b=1,\ldots ,B\), and the corresponding bootstrap replications of the statistics \(\hat{\theta }=s(\textbf{x})\) itself are given by:

$$\begin{aligned} \hat{\theta }^{*b}=s(\textbf{x}^{*b}),\quad b=1,\ldots ,B, \end{aligned}$$
(10)

The bootstrap estimate of bias and standard error are defined by:

$$\begin{aligned}&\quad {{\,\textrm{bias}\,}}_{B}(\hat{\theta })=\hat{\theta }^{*}-\hat{\theta },\end{aligned}$$
(11)
$$\begin{aligned} \widehat{\text {SE}}_B=&\left[ \frac{1}{(B-1)} {\sum \limits _{b=1}^{B}{\left[ {\hat{\theta }^{*b}-\hat{\theta }^{*}}\right] ^{2}}}\right] ^{1/2},\end{aligned}$$
(12)
$$\begin{aligned}&\quad \hat{\theta }^{*}=\frac{1}{B}\sum \limits _{b=1}^{B}{\hat{\theta }^{*b}}. \end{aligned}$$
(13)

Either in estimating the bias or the standard error, the larger the number of bootstraps B the closer the estimate to the asymptotic value, i.e.,

$$\begin{aligned} \lim _{B\rightarrow \infty }\widehat{\text {SE}}_B(\hat{\theta }^{*})=\text {SE}_{\hat{F}}(\hat{\theta }^{*}). \end{aligned}$$
(14)

For more details and some examples the reader is referred to [11, Chap. 6, 7, and 10].

2.2 Jackknife Estimate

Instead of replicating from the original dataset, a new set \(\textbf{x} _{(i)}\) is created by removing the case \(x_{i}\) from the dataset. Then the jackknife samples are defined by:

$$\begin{aligned} \textbf{x}_{\left( i\right) }=(x_{1},\ldots ,x_{i-1},x_{i+1},\ldots ,x_{n}),\quad i=1,\ldots ,n, \end{aligned}$$
(15)

and the n-jackknife replications of the statistic \(\hat{\theta }\) are:

$$\begin{aligned} \hat{\theta }_{(i)}=s(\textbf{x}_{\left( i\right) }),\quad i=1,\ldots ,n. \end{aligned}$$
(16)

The jackknife estimates of bias and standard error are defined by:

$$\begin{aligned}&\widehat{{{\,\textrm{bias}\,}}}_{J}=(n-1)(\hat{\theta }^J-\hat{\theta }),\end{aligned}$$
(17)
$$\begin{aligned} \widehat{\text {SE}}_J&=\left[ {\frac{n-1}{n}\sum \limits _{i=1}^{n}{(\hat{\theta }_{\left( i\right) }-\hat{\theta }^J)^{2}}}\right] ^{1/2},\end{aligned}$$
(18)
$$\begin{aligned}&\quad \hat{\theta }^J=\frac{1}{n}\sum \limits _{i=1}^{n}{\hat{\theta }_{\left( i\right) }}. \end{aligned}$$
(19)

For motivation behind the factors \((n-1)\) and \((n-1)/n\) in (17) see [11, Chap. 11]. The jackknife estimate of variance is discussed in detail in Efron [6] and Efron and Stein [10].

2.3 Bootstrap Versus Jackknife

Usually, it requires up to 200 bootstraps to yield acceptable bootstrap estimates; (in special situations like estimating the uncertainty in classifier performance it may take up to thousands of bootstraps). Hence, this requires calculating the statistic \(\hat{\theta }\) the same number of times B, as well. In the case of the jackknife, it requires only n calculations as shown in (16). If the sample size is smaller than the required number of bootstraps, the jackknife is more economical in terms of computational cost.

In terms of accuracy, the jackknife can be seen to be an approximation to the bootstrap when estimating the standard error of a statistic [11, Chap. 20]. Thus, if the statistic is linear they almost give the same result; (the bootstrap gives the jackknife estimate multiplied by \([(n-1)/n]^{1/2}\)). A statistic \(s(\textbf{x})\) is said to be linear if:

$$\begin{aligned} s(\textbf{x})=\mu +\frac{1}{n}\sum \limits _{i=1}^{n}{\alpha (x_{i})}, \end{aligned}$$
(20)

where \(\mu \) is a constant and \(\alpha (\cdot )\) is a function. This also can be viewed as having one data point at a time in the argument of the function \(\alpha \). Similarly, the jackknife can be seen as an approximation to the bootstrap when estimating the bias. If the statistic is quadratic, they almost agree except in a normalizing factor . A statistic \(s(\textbf{x})\) is quadratic if:

$$\begin{aligned} s(\textbf{x})=\mu +\frac{1}{n}\sum \limits _{1\le i\le n}{\alpha (x_{i})+\frac{1}{n^{2}}\sum \limits _{1\le i<j\le n}{\beta (x_{i},x_{j})}}. \end{aligned}$$
(21)

An in-depth treatment of the bootstrap and jackknife, and their relation to each other, in mathematical detail is provided by Efron [7, Chaps. 1–5].

If the statistic is not smooth the jackknife will fail. Informally speaking, a statistic is said to be smooth if a small change in the data leads to a small change in the statistic. An example of a non-smooth statistic is the median. If the sample cases are ranked and the median is calculated, it will not change when a sample case changes unless this sample case bypasses the median value. Using the same argument, we can see that an example of a smooth statistic is the sample mean.

2.4 Influence Function, Infinitesimal Jackknife, and Estimate of Variance

The infinitesimal jackknife was introduced by Jaeckel [22]. The concept of the influence curve was introduced later by Hampel [16]. In the present context and for pedagogical purposes, the influence curve will be explained before the infinitesimal jackknife, since the former can be understood as the basis for the latter.

Following Hampel [16], let \(\mathfrak {R}\) be the real line and s be a real-valued functional defined on the distribution F, which is defined on \(\mathfrak {R}\). The distribution F can be perturbed by adding some probability measure (mass) on a point x. This should be balanced by a decrement in F elsewhere, resulting in a new probability distribution \(G_{\varepsilon ,x}\) defined by:

$$\begin{aligned} G_{\varepsilon ,x}=(1-\varepsilon )F+\varepsilon \delta _{x},~x\in \mathfrak {R}. \end{aligned}$$
(22)

Then, the influence curve \(IC_{s,F}(\cdot )\) is defined by:

$$\begin{aligned} IC_{s,F}(x)=\lim _{\varepsilon \rightarrow 0^{+}}\frac{s\left( \left( 1-\varepsilon \right) F+\varepsilon \delta _{x}\right) -s\left( F\right) }{\varepsilon }. \end{aligned}$$
(23)

It should be noted that F does not have to be a discrete distribution. A simple example of applying the influence curve concept is to consider the expectation \(s=\int {x~dF(x)}=\mu \). Substituting back in (23) gives:

$$\begin{aligned} IC_{s,F}(x)=x-\mu . \end{aligned}$$
(24)

The meaning of this formula is the following: the rate of change of the functional s with the probability measure at a point x is \(x-\mu \). This is how the point x influences the functional s. The influence curve can be used to linearly approximate a functional s, along with its variance, which is similar to taking up to only the first-order term in a Taylor series expansion (Appendix 7.2).

It is important to state here that s should be a functional in \(\hat{F}\) that is an approximation to F, as was initially assumed in (23). If for example the value of the statistic s changes if every sample case \(x_{i}\) is duplicated, i.e., repeated twice, this is not a functional statistic. An example of a functional statistic is the biased version of the variance estimate \(\varSigma _{i}(x_{i}-\bar{x}_{i})^{2}/n\), while the unbiased version \(\varSigma _{i}(x_{i}-\bar{x}_{i})^{2}/(n-1)\) is not a functional statistic. Generally, any approximation \(s(\hat{F})\) to the functional s(F), by approximating F by the MLE \(\hat{F}\), obviously will be functional. In such a case the statistic \(s(\hat{F})\) is called the plug-in estimate of the functional s(F). Moreover, the influence function (IF) method for variance estimation is applicable only to those functional statistics whose derivative (73) exists. If that derivative exists, the statistic is called a smooth statistic; i.e., a small change in the dataset leads a small change in the statistic. For instance, although the median is a functional statistic in the sense that duplicating any sample case will result in the same value of the median, it is not smooth as described at the end of Sect. 2.3. A key reference for the IF is Hampel [17]. Appendix 7.2 shows an interesting connection to the jackknife estimate.

3 Nonparametric Methods for Estimating the Error Rate of a Classification Rule

The review provided in this section is a terse summary of the main work of Efron [8, 11, 13]. In the previous section the statistic, or generally speaking the functional, was a function of just one dataset. For a non-fixed design, i.e., when the predictors of the testing set do not have to be the same as the predictors of the training dataset, a slight clarification for the previous notations is needed. The classification rule trained on the training dataset \(\textbf{t}\) will be denoted as \(\eta _{\textbf{t}}\). Any new observation that does not belong to \(\textbf{t}\) will be denoted by \(t_{0}=(x_0,y_0)\). Therefore, the classification loss is given by \(L(y_{0},\eta _{\textbf{t}}(x_{0}))\). Any performance measure conditional on that training dataset will be similarly subscripted. Thus, all the performance measures should be subscripted \(\textbf{t}\); and hence the risk and the error rate (1) should be denoted by \({\textrm{R}}_{\textbf{t}}\) and \({\textrm{Err}}_{\textbf{t}}\), respectively. In the sequel, for simplicity and WLOG, the 0-1 loss function will be used. In such a case the conditional error rate will be given by:

$$\begin{aligned} {\textrm{Err}}_{\textbf{t}}=\text {E}_{0F}L\left( y_{0},\eta _{\textbf{t}}\left( x_{0}\right) \right) ,\quad \left( x_{0},y_{0}\right) \sim F. \end{aligned}$$
(25)

The expectation \(\text {E}_{0F}\) is subscripted so to emphasize that it is taken over the observations \(t_{0}\notin \textbf{t}\). If the performance is measured in terms of the error rate and we are interested in the mean performance, not the conditional one, then it is given by:

$$\begin{aligned} {\textrm{Err}}=\text {E}_{\textbf{t}}{\textrm{Err}}_{\textbf{t}}. \end{aligned}$$
(26)

The expectation \(\text {E}_{\textbf{t}}\) is the expectation over the training dataset \(\textbf{t}\), which would be the same if we had written \(\text {E}_{F}\); for notation clarity the former is chosen.

Consider a classification rule \(\eta _{\textbf{t}}\) already trained on a training dataset \(\textbf{t}\). A natural next question is, given that there is just a single dataset available, how to use this dataset in assessing the classifier performance as well? Said differently, how should one estimate, using only the available dataset, the true classification performance of a classification rule in predicting new observations; these observations are different from those on which the rule was trained. In this section, we will review the principal methods in the literature for estimating both the true error rate (25) and its mean (26) of a classification rule.

3.1 Apparent Error

The apparent error is the error of the fitted model when it is tested on the same training data. Of course it is downward biased with respect to the true error rate since it results from testing on the same information used in training [9]. The apparent error is defined by:

$$\begin{aligned} \overline{{\textrm{Err}}}_{\textbf{t}}&=\text {E}_{\hat{F}}L(y,\eta _{\textbf{t}}(x)),\quad (x,y)\in \textbf{t}\end{aligned}$$
(27a)
$$\begin{aligned}&=\frac{1}{n}\sum \limits _{i=1}^{n}\left[ I_{\hat{h}_{\textbf{t}}(x_{i}|\omega _{1})<th}+I_{\hat{h}_{\textbf{t}}(x_{i}|\omega _{2})>th}\right] . \end{aligned}$$
(27b)

Overfitting a classifier to minimize the apparent error is not the goal. The goal is to minimize the true error rate (25) or its mean (26).

3.2 Cross Validation (CV)

The basic concept of CV, as a resampling approach, has been proposed in different articles since the mid-1930s. The concept simply leans on splitting the data into two parts; the first part is used in design (or training) without any involvement of the second part. Then the second part is used to test the designed procedure; this is to test how the designed procedure will behave for new datasets. Stone [28] is a key reference for CV that proposes different criteria for optimization.

CV can be used to assess the prediction error of a model or in model selection. The true error rate in (25) is the expected error rate for a classification rule if tested on the population, conditional on a particular training dataset \(\textbf{t}\). This performance measure can be approximated by the leave-one-out CV (LOOCV) by:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{cv1}}=\frac{1}{n}\sum \limits _{i=1}^{n}{L}\left( {y_{i},\eta _{\textbf{t}^{(i)}}(x_{i})}\right) ,~~~(x_{i},y_{i})\in \textbf{t}. \end{aligned}$$
(28)

This is done by training the classification rule on the dataset \(\textbf{t}^{\left( i\right) }\) that does not include the case \(t_{i}\); then testing the trained rule on that omitted case. This proceeds in “round-robin” fashion until all cases have contributed one at a time to the error rate. There is a hidden assumption in this mechanism: the training dataset \(\textbf{t}\) will not change very much by omitting a single case. Therefore, testing on the omitted observation one at a time accounts for testing approximately the same trained rule on n new cases, all different from each other and different from those the classifier has been trained on. Besides this LOOCV, there are other versions named K-fold (or leave-n/K-out). In such versions the whole dataset is split into K roughly equal-sized subsets, each of which contains approximately n/K observations. The classifier is trained on \(K-1\) subsets and tested on the left-out one; hence we have K iterations. It is clear that the LOOCV is a special case of the K-fold CV, where \(K=n\).

It is of interest to assess this estimator to see whether it estimates the conditional true error \(\text {E}\left[ {\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{cv1}}-{\textrm{Err}}_{\textbf{t}}}\right] ^{2}\), with small MSE, as was designed or not. Many simulation results, e.g., Efron [8], show that there is only a very weak correlation between the CV estimator \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{cv1}}\) and the conditional true error rate \({\textrm{Err}}_{\textbf{t}}\). This issue is discussed in mathematical detail in the excellent paper by Zhang [35]. Those other estimators that are based on resampling as well, and will be reviewed below, are shown to have this same attribute. This very interesting (and perhaps surprising) result means the following: whether the estimator is designed to estimate the conditional performance or the mean performance it indeed estimates the latter because of the weak correlation with the former.

3.3 Bootstrap Methods for Error Rate Estimation

The prediction error in (25) is a function of the training dataset \(\textbf{t}\) and the testing population F. Bootstrap estimation can be implemented here by treating the empirical distribution \(\hat{F}\) as an approximation to the actual population distribution F. By replicating from that distribution one can simulate many training datasets \(\textbf{t}^{*b},~b=1,\ldots ,B\). For every replicated training dataset the classifier will be trained and then tested on the original dataset \(\textbf{t}\). This is the simple bootstrap (SB) estimator approach [11, Sect. 17.6] that was defined formally by:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{SB}}=\text {E}_{*}\sum \limits _{i=1}^{n}{L(y_{i},\eta _{\textbf{t}^{*}}(x_{i}))/n,\quad \hat{F}\rightarrow \textbf{t}^{*}}. \end{aligned}$$
(29)

It should be noted that this estimator no longer estimates the true error rate (25) because the expectation taken over the bootstraps mimics an expectation taken over the population of trainers, i.e., it is not conditional on a particular training dataset. Rather, the estimator (29) estimates the expected performance of the classifier \(\text {E}_{F}{\textrm{Err}}_{\textbf{t}}\). For a finite number of bootstraps, the expectation (29) can be approximated by:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{SB}}=\frac{1}{B}\sum \limits _{b=1}^{B}{\sum \limits _{i=1}^{n}{L}}\left( {{y_{i},\eta _{\textbf{t}^{*b}}(x_{i})}}\right) /n. \end{aligned}$$
(30)

3.3.1 Leave-One-Out Bootstrap (LOOB)

The previous estimator is obviously biased since the original dataset \(\textbf{t}\) used for testing includes part of the training data in every bootstrap replicate. Efron [8] proposed that, after training the classifier on every bootstrap replicate, it is tested on those cases in the set \(\textbf{t}\) that are not included in the training; this concept can be developed as follows. Equation (30) can be rewritten by interchanging the order of the double summation to give:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{SB}}=\frac{1}{n}\sum \limits _{i=1}^{n}{\sum \limits _{b=1}^{B}{L}}\left( {{y_{i},\eta _{\textbf{t}^{*b}}(x_{i})}}\right) \Big /B. \end{aligned}$$
(31)

This equation is formally identical to (30) but it expresses a different mechanism for evaluating the same quantity. It says that, for a given point, the average performance over the bootstrap replicates is calculated; then this performance is averaged over all the n cases. Now, if every case \(t_{i}\) is tested only from those bootstraps that did not include it in the training, a slight modification of the previous expression yields the leave-one-out bootstrap (LOOB) estimator:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}=\frac{1}{n}\sum \limits _{i=1}^{n}\left[ {\sum \limits _{b=1}^{B}{I_{i}^{b}L}}\left( {{y_{i},\eta _{\textbf{t}^{*b}}(x_{i})}}\right) \Bigg / \sum \limits _{{b}^{\prime }=1}^{B}{I_{i}^{{b}^{\prime }}} \right] , \end{aligned}$$
(32)

where the indicator function \(I_{i}^{b}\) equals one when the case \(t_{i}\) is not included in the training replicate b, and zero otherwise. Efron and Tibshirani [13] emphasized a critical point about the difference between this bootstrap estimator and the LOOCV. The CV tests on a given sample case \(t_{i}\), having been trained just once on the remaining dataset. By contrast, the LOOB tests on a given sample case \(t_{i}\) using a large number of classifiers that result from a large number of bootstrap replicates that do not contain that sample. This results in a smoothed cross-validation-like estimator. We explained and elaborated on this smoothness property in Yousef [30].

3.3.2 The Refined Bootstrap (RB)

The SB and the LOOB, from their definitions, look like designed to estimate the mean true error rate (26) of a classifier. For estimating the true conditional error rate of a classifier, conditional on a particular training dataset, Efron [8] proposed to correct for the downward biased estimator \(\overline{{\textrm{Err}}}_{\textbf{t}}\). Since the true error rate \({\textrm{Err}}_{\textbf{t}}\) can be written as \(\overline{{\textrm{Err}}}_{\textbf{t}}+({\textrm{Err}}_{\textbf{t}}-\overline{{\textrm{Err}}}_{\textbf{t}})\), then it can be approximated by \(\overline{{\textrm{Err}}}_{\textbf{t}}+\text {E}_{F}({\textrm{Err}}_{\textbf{t}}-\overline{{\textrm{Err}}}_{\textbf{t}})\). The term \(({\textrm{Err}}_{\textbf{t}}-\overline{{\textrm{Err}}}_{\textbf{t}})\) is called the optimism. The expectation of the optimism can be approximated over the bootstrap population. Finally the refined bootstrap approach, as named in Efron and Tibshirani [11, Sect. 17.6], gives the estimator:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{RB}}=\overline{{\textrm{Err}}}_{\textbf{t}}+\text {E}_{*}({\textrm{Err}}_{\textbf{t}*}(\hat{F})-\overline{{\textrm{Err}}}_{\textbf{t}*}), \end{aligned}$$
(33)

where \({\textrm{Err}}_{\textbf{t}*}(\hat{F})\) represents the error rate obtained from training the classifier on all bootstrap replicates \(\textbf{t}^{*}\) and testing on the empirical distribution \(\hat{F}\). This can be approximated for a limited number of bootstraps by:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{RB}}=\overline{{\textrm{Err}}}_{\textbf{t}}+\frac{1}{B}\sum \limits _{b=1}^{B}\left[ {\sum \limits _{i=1}^{n}{L}}\left( {{y_{i},\eta _{\textbf{t}^{*b}}(x_{i})}}\right) {/n-\sum \limits _{i=1}^{n}{L}}\left( {{y_{ib}^{*},\eta _{\textbf{t}^{*b}}(x_{ib}^{*})}}\right) {/n}\right] . \end{aligned}$$
(34)

3.3.3 The 0.632 Bootstrap

If the concept used in developing the LOOB estimator, i.e., testing on cases not included in training, is used again in estimating the optimism described above, this gives the 0.632 bootstrap estimator. Since the probability of including a case \(t_{i}\) in the bootstrap \(\textbf{t}^{*b}\) is given by:

$$\begin{aligned} \Pr (t_{i} \in \textbf{t}^{*b})&=1-(1-1/n)^{n} \approx 1-e^{-1}=0.632, \end{aligned}$$
(35)

the effective number of sample cases contributing to a bootstrap replicate is approximately 0.632 of the size of the training dataset. Efron [8] introduced the concept of a distance between a point and a sample in terms of a probability. Having trained on a bootstrap replicate, testing on those cases in the original dataset not included in the bootstrap replicate accounts for testing on a set far from the training one, i.e., the bootstrap replicate. This is because every sample case in the testing set has zero probability of belonging to the training dataset, i.e., very distant from the training dataset. This is a reason for why the LOOB is an upwardly biased estimator. Efron [8] showed roughly that:

$$\begin{aligned} \text {E}_{F}\left[ {{\textrm{Err}}_{\textbf{t}}-\overline{{\textrm{Err}}}_{\textbf{t}}}\right] \approx 0.632\,\text {E}_{F}\left[ {\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}-\overline{{\textrm{Err}}}_{\textbf{t}}}\right] . \end{aligned}$$
(36)

Substituting back in (33) gives the 0.632 estimator:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{(0.632)}}=0.368\,\overline{{\textrm{Err}}}_{\textbf{t} }+0.632\,\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}. \end{aligned}$$
(37)

The proof of the above results can be found in Efron [8] and Efron and Tibshirani [11, Sect. 6].

The motivation behind this estimator as stated earlier is to correct for the downward biased apparent error by adding a piece of the upward biased LOOB estimator. But an increase in variance should be expected as a result of adding this piece of the relatively variable apparent error. Moreover, this new estimator is no longer smooth since the apparent error itself is unsmooth.

3.3.4 The 0.632+ Bootstrap Estimator

The 0.632 estimator reduces the bias of the apparent error. But for over-trained classifiers, i.e., those whose apparent error tends to be zero, the 0.632 estimator is still downward biased. Breiman et al. [3] provided the example of an overfitted rule, like 1NN where the apparent error is zero. If, however, the class labels are assigned randomly to the predictors the true error rate will obviously be 0.5. But substituting in (37) gives an estimate of \(0.632\times 0.5=0.316\). To account for this bias for such over-fitted classifiers, Efron and Tibshirani [13] defined the no-information error rate \(\gamma \) by:

$$\begin{aligned} \gamma =\text {E}_{0F_{ind}} {L}\left( {y_{0},\eta _{\textbf{t}}(x_{0})}\right) , \end{aligned}$$
(38)

where \(F_{ind}\) means that \(x_{0}\) and \(y_{0}\) are distributed marginally as F but they are independent. Or said differently, the label is assigned randomly to the predictor. Then for a training sample \(\textbf{t}\), \(\gamma \) can be estimated by:

$$\begin{aligned} \hat{\gamma }= \frac{1}{{n^{2}}}\sum \limits _{i=1}^{n}{\sum \limits _{j=1}^{n}{L}}\left( {{y_{i},\eta _{\textbf{t}}(x_{j})}}\right) . \end{aligned}$$
(39)

This means that the n predictors have been permuted with the n responses to produce \(n^{2}\) non-informative cases. In the special case of binary classification, let \(\hat{p}_{1}\) be the proportion of the response classified as belonging to class 1. Also, let \(\hat{q}_{1}\) be the proportion of the responses classified as belonging to class 1. Then (39) reduces to:

$$\begin{aligned} \hat{\gamma }=\hat{p}_{1}(1-\hat{q}_{1})+(1-\hat{p}_{1})\hat{q}_{1}. \end{aligned}$$
(40)

Also define the relative overfitting rate:

$$\begin{aligned} \hat{R}=\frac{\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}-\overline{{\textrm{Err}}}_{\textbf{t}}}{\hat{\gamma }-\overline{{\textrm{Err}}}_{\textbf{t}}}. \end{aligned}$$
(41)

Efron and Tibshirani [13] showed that the bias of the 0.632 estimator for the case of over-fitted classifiers is alleviated by using a renormalized version of that estimator:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{(0.632+)}}&=(1-\hat{w})\overline{{\textrm{Err}}}_{\textbf{t}}+\hat{w}\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }},\end{aligned}$$
(42a)
$$\begin{aligned} \hat{w}&=\frac{0.632}{1-0.368\hat{R}}. \end{aligned}$$
(42b)

It is useful to express the 0.632+ estimator in terms of its predecessor, the 0.632 estimator. Combining (37), (40), and (41) then substituting in (42a) yields:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( {0.632+}\right) }}=\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( {0.632}\right) }}+(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}-\overline{{\textrm{Err}}}_{\textbf{t}})\frac{0.368\cdot 0.632\cdot \hat{R}}{1-0.368\hat{R}}. \end{aligned}$$
(43)

Efron and Tibshirani [13] consider the possibility that \(\hat{R}\) lies out of the region \(\left[ {0,1}\right] \). This leads to their proposal of defining:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }\prime }&=\min (\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }},\hat{\gamma }),\end{aligned}$$
(44)
$$\begin{aligned} {\hat{R}}^{\prime }&=\left\{ {\begin{array}{lll} (\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}-\overline{{\textrm{Err}}}_{\textbf{t}})/(\hat{\gamma }-\overline{{\textrm{Err}}}_{\textbf{t}}) &{} &{} \overline{{\textrm{Err}}}_{\textbf{t}}<\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}<\gamma \\ 0 &{} &{} \text {otherwise} \end{array}}\right. , \end{aligned}$$
(45)

to obtain a modification to (43) that finally becomes:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{\left( {0.632+}\right) }=\widehat{{\textrm{Err}}}_{\textbf{t}}^{\left( {0.632}\right) }+(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }\prime }-\overline{{\textrm{Err}}}_{\textbf{t}})\frac{0.368\cdot 0.632\cdot {\hat{R}}^{\prime }}{1-0.368{\hat{R}}^{\prime }}. \end{aligned}$$
(46)

3.4 Estimating the Standard Error of Error Rate Estimators

What have been reviewed above are several resampling methods: the CV, 0.632, and 0.632+ estimate the conditional error rate of a classification rule, conditional on that training dataset; and the LOOB estimates the mean error rate, where the expectation is taken over the population of training datasets. Regardless of what the estimator is designed to estimate, it is still a function of the current dataset \(\textbf{t}\), i.e., it is a random variable. If, e.g., the LOOB estimator \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}\) is considered, it estimates a constant real-valued parameter \(\text {E}_{0F}\text {E}_{F}L(y_{0},\eta _{\textbf{t}}(x_{0}))\) with expectation taken over all the trainers and then over all the testers, respectively; this is the overall mean error rate. Yet, \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}\) is a random variable whose variability comes from the finite size of the available dataset. If the classifier is trained and tested on a very large number of observations, this would approximate training and testing on the entire population, and the variability would shrink to zero. This also applies for any performance measure other than the error rate. So, we are interested now in estimating \(\text {Var}_{\textbf{t}}\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}\), the variance of the estimator, not estimating \(\text {Var}_{\textbf{t}}{\textrm{Err}}_{\textbf{t}}\), the variance of the true performance.

The next question then is, having estimated the mean performance of a classifier: what is the associated uncertainty of this estimate. Said differently: an estimate of the variance of this estimator be obtained from the same training dataset? Efron and Tibshirani [13] proposed the use of the IF method (Sect. 2.4), to estimate the uncertainty (variability) in \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}\). The reader is alerted that estimators that incorporate a piece of the apparent error are not suitable for the IF method. Such estimators are not smooth because the apparent error itself is not smooth.

By recalling the definitions of Sect. 2.4, \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}\) is now the statistic \(s(\hat{F})\). To simplify notation, the error \(L(y_{i},\eta _{\textbf{t}^{*b}}(x_{i}))\) may be denoted by \(L_{i}^{b}\), and define the following notation:

$$\begin{aligned} l_{\cdot }^{b}=\frac{1}{n}\sum \limits _{i=1}^{n}{I_{i}^{b}L_{i}^{b}},\end{aligned}$$
(47)

Also, define \(N_{i}^{b}\) to be the number of times the case \(t_{i}\) is included in the bootstrap b. Then, it has been proven in Efron and Tibshirani [12] that the IF of such an estimator is given by:

$$\begin{aligned} \left. {\frac{\partial s(\hat{F}_{\varepsilon ,i})}{\partial \varepsilon } }\right| _{\varepsilon =0}=(2+\frac{1}{n-1})(\hat{E}_{i}-\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }})+\frac{n\sum \nolimits _{b=1} ^{B}{(N_{i}^{b}-\bar{N}_{i})I_{i}^{b}}}{\sum \nolimits _{b=1}^{B}{I_{i}^{b}}}. \end{aligned}$$
(48)

Combining (78) and (48) gives an estimation to the uncertainty in \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}\).

4 Nonparametric Methods for Estimating the AUC of a Classification Rule

In the present section, we extend the study carried out in Efron [8], Efron and Tibshirani [13], and summarized in Sect. 3, to construct nonparametric estimators for the AUC (a two-sample statistic) analogue to those of the error rate (a one-sample statistic). Although some previous experimental comparative studies [26, 27, 32] were conducted to compare some of these resampling-based AUC estimators, in particular the 0.632 versions, there was no theoretical justification of using these estimators for the AUC. We provide here a full account of the different versions of bootstrap estimators reviewed in Sect. 3 and show how they can be formally extended to estimate the AUC.

4.1 Construction of Nonparametric Estimators for AUC

Before switching to the AUC, some more elaboration on Sect. 3 is needed. The SB estimator (29) can be rewritten as:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{SB}}=\text {E}_{*}\text {E}_{\widehat{F}}\left[ {L(\eta _{\textbf{t}^{*} }(x),y)|\textbf{t}^{*}} \right] . \end{aligned}$$
(49)

Since there would be some observation overlap between \(\mathbf {t\ }\) and \(\textbf{t}^{*}\), this approach suffers an obvious bias as was introduced in that section. This was the motivation behind interchanging the expectations and defining the LOOB (Sect. 3.3.1). Alternatively, we could have left the order of the expectation but with testing on only those observations in \(\textbf{t}\) that do not appear in the bootstrap replication \(\textbf{t}^{*}\), i.e., the distribution \(\hat{F}^{\left( *\right) }\). The parenthesis notation \(\left( *\right) \) refers to excluding from \(\widehat{F}\), in the testing stage, the training cases \(\textbf{t}^{*}\) that were generated from the bootstrap replication. We call the resulting estimator \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( *\right) }}\), which we define formally by:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( *\right) }}&=\text {E}_{*}\text {E}_{\hat{F}^{\left( *\right) }}\left[ {L(\eta _{\textbf{t}^{*}}(x),y)|\textbf{t}^{*} }\right] \end{aligned}$$
(50)

We can give the inner expectation the notation \({\textrm{Err}}_{\textbf{t}^{*b}}(\widehat{F}^{\left( *\right) })\), and rewrite the estimator as:

$$\begin{aligned} \widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( *\right) }}&= \text {E}_{*}{{\textrm{Err}}_{\textbf{t}^{*b}}(\widehat{F}^{\left( *\right) })}\end{aligned}$$
(51a)
$$\begin{aligned}&=\frac{1}{B}\sum \limits _{b=1}^{B}\left[ \sum \limits _{i=1}^{N}{I_{i}^{b}L(\eta _{\textbf{t}^{*b}}(x_{i}),y_{i}) \Big / \sum \limits _{i'=1}^{N}{I_{i'}^b}}\right] , \end{aligned}$$
(51b)

where the indicator \(I_{i}^{b}\) equals one if the observation \(t_{i}\) is excluded from the bootstrap replication \(\textbf{t}^{*b}\), and equals zero otherwise. The inner expectation in (50) is taken over those observations not included in the bootstrap replication \(\textbf{t}^{*}\), whereas the outer expectation is taken over all the bootstrap replications.

Analogously to Sect. 3, and to what has been introduced above, we can define several bootstrap estimators for the AUC. The start is the SB estimate, which can be defined as:

$$\begin{aligned} \widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{SB}}&=\text {E}_{*}{{\textrm{AUC}}_{\textbf{t}^{*}}(\widehat{F})},\quad \widehat{F}\rightarrow \textbf{t}^{*} \end{aligned}$$
(52a)
$$\begin{aligned}&=\text {E}_{*}\left[ {\frac{1}{n_{1}n_{2}}\sum \limits _{j=1}^{n_{2}}{\sum \limits _{i=1}^{n_{1}}{\psi (\hat{h}_{\textbf{t}^{*}}(x_{i}),\hat{h}_{\textbf{t}^{*}}(x_{j}))}}}\right] ,\quad x_{i}\in \omega _{1}, x_{j}\in \omega _{2}. \end{aligned}$$
(52b)

This averages the Mann-Whitney statistic over the bootstraps, where \({\textrm{AUC}}_{\textbf{t}^{*}}(\widehat{F})\) refers to the AUC obtained from training the classifier on the bootstrap replicate \(\textbf{t}^{*}\) and testing it on the empirical distribution \(\widehat{F}\). In the approach used here, the bootstrap replicate \(\textbf{t}^{*}\) preserves the ratio between \(n_{1}\) and \(n_{2}\), which is called stratification. That is, the training sample \(\textbf{t}\) is treated as \(\textbf{t}=\textbf{t}_{1}\cup \textbf{t}_{2},\ \textbf{t}_{1}\in \omega _{1},\ \textbf{t}_{2}\in \omega _{2}\); then \(n_{1}\) cases are replicated from the first-class sample and \(n_{2}\) cases are replicated from the second-class sample to produce \(\textbf{t}_{1}^{*}\) and \(\textbf{t}_{2}^{*}\) respectively, where \(\textbf{t}^{*}=\textbf{t}_{1}^{*}\cup \textbf{t}_{2}^{*}\). This was not needed when the performance measure was the error rate since it is a statistic that does not operate simultaneously on two different sets of observations as the Mann-Whitney statistic does (in U-statistic theory [25], error rate and Mann-Whitney are called one-sample and two-sample statistics respectively). The expectation (52a) is approximated by averaging over a finite number of bootstrap:

$$\begin{aligned} \widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{SB}}=\frac{1}{B}\sum \limits _{b=1}^{B}{{\textrm{AUC}}_{\textbf{t}^{*b}}(\widehat{F})}, \end{aligned}$$
(53)

The same motivation behind the estimator (32) can be applied here, i.e., testing only on those cases in \(\textbf{t}\) that are not included in the training dataset \(\textbf{t}^{*b}\), in order to reduce the bias. This can be carried out in (53) without interchanging the summation order. The new estimator is named \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{(*)}}\), where the parenthesis notation \(\left( *\right) \) refers to the exclusion, in the testing stage, of the training cases that were generated from the bootstrap replication. Formally, we define this as:

$$\begin{aligned} \widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}&=\text {E}_{*}{{\textrm{AUC}}_{\textbf{t}^{*b}}(\widehat{F}^{\left( *\right) })}\end{aligned}$$
(54a)
$$\begin{aligned}&=\frac{1}{B}\sum \limits _{b=1}^{B}\left[ {\sum \limits _{j=1}^{n_{2}} {\sum \limits _{i=1}^{n_{1}}{\psi ({{{\hat{h}_{\textbf{t}^{*}}(x_{i}),\hat{h}_{\textbf{t}^{*}}(x_{j}))}}}I_{i}^{b}I_{j}^{b}}}} \Big / \sum _{i^{\prime } =1}^{n_{1}}I_{i^{\prime }}^{b}\sum _{j^{\prime }=1}^{n_{2}}I_{j^{\prime }} ^{b} \right] . \end{aligned}$$
(54b)

The RB and 0.632 estimators can be introduced here in the same way it was used for the true error rate (Sect. 3.3.3) as:

$$\begin{aligned} \widehat{{\textrm{AUC}}}_{\textbf{t}}^{RB}=\overline{{\textrm{AUC}}}_{\textbf{t}}+\text {E}_{*} \left[ {\textrm{AUC}}_{\textbf{t}*}(\widehat{F})-\overline{{\textrm{AUC}}}_{\textbf{t}*}\right] . \end{aligned}$$
(55)

Then, if testing is carried out on cases excluded from the bootstraps, analogously to the 0.632 estimator of the error rate, this gives rise to the 0.632 estimator of the AUC:

$$\begin{aligned} \widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{(0.632)}}=0.368\,\overline{{\textrm{AUC}}}_{\textbf{t} }+0.632\,\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}. \end{aligned}$$
(56)

It should be noted that this estimator is designed to estimate the true AUC for a classifier trained on the dataset \(\textbf{t}\) (the classifier performance conditional on the training dataset \(\textbf{t})\). This is on contrary to the estimator (54) that estimates the mean performance of the classifier (this is the expectation over the training dataset population for the conditional performance).

The 0.632+ estimator \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{(0.632+)}}\) develops from \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{(0.632)}}\) in the same way as \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{(0.632+)}}\) developed from \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{(0.632)}}\) in Sect. 3.3.4. There are two modifications to the details. The first regards the no-information error rate \(\gamma \); it can be proven that the no-information AUC is given by \(\gamma _{{\textrm{AUC}}} = 0.5\) (Lemma 2). The second regards the definitions (44), which should be modified to accommodate for the AUC. The new definitions are given by:

$$\begin{aligned} \widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( {0.632+}\right) }}&=\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( {0.632}\right) }}+(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }\prime }-\overline{{\textrm{AUC}}}_{\textbf{t}})\frac{0.368\cdot 0.632\cdot {\hat{R}}^{\prime }}{1-0.368{\hat{R}}^{\prime }},\end{aligned}$$
(57a)
$$\begin{aligned} \widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }\prime }&=\max \bigl ( \widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }},\gamma _{{\textrm{AUC}}}\bigr ),\end{aligned}$$
(57b)
$$\begin{aligned} {\hat{R}}^{\prime }&=\left\{ \begin{array}{ccc} \frac{(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}-\overline{{\textrm{AUC}}}_{\textbf{t}})}{(\gamma _{{\textrm{AUC}}}-\overline{{\textrm{AUC}}}_{\textbf{t}})} &{} &{} \text {if}~\overline{{\textrm{AUC}}}_{\textbf{t}}>\widehat{{\textrm{AUC}}}_{\textbf{t}}^{\left( *\right) }>\gamma _{{\textrm{AUC}}}\\ 0 &{} &{} \text {otherwise} \end{array}\right. . \end{aligned}$$
(57c)

To this end, we have constructed the AUC nonparametric estimators analogue to those of the error rate. Some of them, mainly the 0.632+ estimator, will have the least bias [13]. However, all of these estimators are not “smooth” and not eligible for the variance estimation via, e.g., the IF method (Sects. 2.4 and 3.4). The only estimator that may seem smooth, is the star versions \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( *\right) }}\text {and} \ \widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}\). However, the inner components \({{\textrm{Err}}_{\textbf{t}^{*b}}(\widehat{F}^{\left( *\right) })}\) and \({{\textrm{AUC}}_{\textbf{t}^{*b}}(\widehat{F}^{\left( *\right) })}\) are unsmooth themselves, because the classifier is trained on just one dataset. Applying the influence function enforces distributing the differential operator \(\partial /\partial \varepsilon \), of the IF, over the summation to be encountered by these unsmooth components.

4.2 The Leave-Pair-Out Boostrap (LPOB) \(\widehat{{\textrm{AUC}}}^{_{\left( {1,1}\right) }}\), Its Smoothness and Variance Estimation

The above discussion suggests introducing an analogue to \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}\) for measuring the performance in AUC. This estimator is motivated from (52a) the same way the estimator \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}\) was motivated from (31). The SB estimator (52a) can be rewritten as:

$$\begin{aligned} \widehat{{\textrm{AUC}}}_{\textbf{t}}^{SB}&=\frac{1}{n_{1}n_{2}}\sum \limits _{j=1}^{n_{2}}{\sum \limits _{i=1}^{n_{1}}{\text {E}_{*} {\psi (\hat{h}_{\textbf{t}^{*}}(x_{i}),\hat{h}_{\textbf{t}^{*}}(x_{j}))}}}\end{aligned}$$
(58)
$$\begin{aligned}&=\frac{1}{n_{1}n_{2}}\sum \limits _{j=1}^{n_{2}}{\sum \limits _{i=1}^{n_{1}}{\sum \limits _{b=1}^{B}\left[ {\psi (\hat{h}_{\textbf{t}^{*b}}(x_{i}),\hat{h}_{\textbf{t}^{*b}}(x_{j})){\Big /B}}\right] }}. \end{aligned}$$
(59)

In words, the procedure is to select a pair (one observation from each class) and calculate for that pair the mean—over many bootstrap replications and training—of the Mann-Whitney kernel. Then, average over all possible pairs. This procedure will be optimistically biased because sometimes the testers will be the same as the trainers. To eliminate that bias, the inner bootstrap expectation should be taken only over those bootstrap replications that do not include the pair \((t_{i},t_{j})\) in the training. Under that constraint, the estimator (58) becomes the leave-pair-out bootstrap (LPOB) estimator:

$$\begin{aligned} \widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( {1,1}\right) }}&=\frac{1}{n_{1}n_{2}}\sum \limits _{j=1}^{n_{2}}{\sum \limits _{i=1}^{n_{1}}{\widehat{{\textrm{AUC}}}_{i,j}} },\end{aligned}$$
(60a)
$$\begin{aligned} \widehat{{\textrm{AUC}}}_{i,j}&=\sum \limits _{b=1}^{B}{I_{j}^{b}I_{i}^{b}\psi (\hat{h}_{\textbf{t}^{*b}}(x_{i}),\hat{h}_{\textbf{t}^{*b}}(x_{j})) \Big / \sum \limits _{{b}^{\prime }=1}^{B}{I_{j}^{{b}^{\prime }}I_{i}^{{b}^{\prime }}} }. \end{aligned}$$
(60b)

The two estimators \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}\) and \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( {1,1}\right) }}\) produce very similar results; this is expected since they both estimate the same thing, i.e., the mean AUC. However, the inner component \(\widehat{{\textrm{AUC}}}_{i,j}\) of the estimator \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( {1,1}\right) }}\) also enjoys the smoothness property of \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}\).

4.3 Estimating the Standard Error of AUC Estimators

The only smooth nonparametric estimator for the AUC so far is the LPOB estimator (60). Yousef et al. [33] discusses how to extend the approach of estimating the uncertainty in the error rate estimator using the IF method (Sect. 3.4) to estimate the uncertainty of this estimator, where interested readers may be referred to for all mathematical details and experimental results that show that the IF method provides almost unbiased estimation for the standard error of the LPOB estimator.

5 Illustrative Numerical Examples

5.1 Error Rate Estimation

Efron [8] and Efron and Tibshirani [13] provide comparisons of their proposed estimators (discussed in Sect. 3). They ran many simulations considering a variety of classifiers and data distributions, as well as real datasets. They assessed the estimators in terms of the RMS, the root of the experimental MSE:

$$\begin{aligned} {\textrm{MSE}}&=\text {E}_{MC}( {\widehat{{\textrm{Err}}}_{\textbf{t}}-{\textrm{Err}}_{\textbf{t}}})^{2}\end{aligned}$$
(61a)
$$\begin{aligned}&=\frac{1}{G}\sum \limits _{g=1}^{G}{( {\widehat{{\textrm{Err}}}_{\textbf{t}_{g}}-{\textrm{Err}}_{\textbf{t}_{g}}}) ^{2}}, \end{aligned}$$
(61b)

where \(\widehat{{\textrm{Err}}}_{\textbf{t}_{g}}\) is the estimator (any estimator) conditional on a training dataset \(\textbf{t}_{g}\), and \({\textrm{Err}}_{\textbf{t}_{g}}\) is the true prediction error conditional on the same training dataset. The number of MC trials G in their experiments was 200. The following statement is quoted from Efron and Tibshirani [13]:

The results vary considerably from experiment to experiment, but in terms of RMS error the 0.632+ rule is an overall winner.

This conclusion was without stating the criterion for deciding the overall winner. It was apparent from their results that the \(0.632+\) rule is the winner in terms of the bias—as was designed for. We calculated the average of the RMS of every estimator across all the 24 experiments they ran; Table 1 displays these averages. The estimators \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}\) and \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{(0.632+)}}\) are quite comparable to each other with only 2.5% increase in the average RMS of the former. We will show below in Sect. 5.2 that the AUC estimators exhibit the same behavior but with magnified difference between the two estimators.

Table 1 Average of RMS error of each estimator over 24 experiments run by Efron and Tibshirani [13]. The estimator \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{\left( 1\right) }\) is the next to the estimator \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{\left( 0.632+\right) }\) with only 2.5% increase in RMS
Fig. 2
A graph plots A U C for true, times, 0.632, 0.632 plus, and apparent with respect to 1 over n. The apparent trend is in increasing order and others in decreasing order.

The figure first appeared in Yousef et al. [32]

Comparison of the three bootstrap estimators, \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}\), \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( 0.632\right) }}\), and \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( 0.632+\right) }}\) for 5-feature predictor. The \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}\) is downward biased, while the \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( 0.632\right) }}\) is an over correction for that bias. \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( 0.632+\right) }}\) is almost the unbiased version of the \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( 0.632\right) }}\).

Table 2 Comparison of the different bootstrap-based estimators of the \({\textrm{AUC}}\). They are comparable to each other in the RMS sense, \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( .632+\right) }}\) is almost unbiased, and all are weakly correlated with the true conditional performance \({\textrm{AUC}}_{\textbf{t}}\)

5.2 AUC Estimation

We carried out different experiments to compare the three bootstrap-based estimators \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}\), \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( .632\right) }}\), and \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( .632+\right) }}\) considering different dimensionalities, different parameter values, and training set sizes. All experiments provided consistent and similar results. Here, in this section, we illustrate the results when the dimensionality \(p=5\), for multinormal 2-class data, with \(\varSigma _1 = \varSigma _2 = \textbf{I}\), \(\mu _1 = \textbf{0}\), \(\mu _2 = c \textbf{1}\), and c is an adjusting parameter to adjust the Mahalanobis distance \(\varDelta =\left[ (\mu _{1}-\mu _{2})'\varSigma ^{-1}(\mu _{1}-\mu _{2})\right] ^{1/2} = c^{2}p\). We adjust c to keep a reasonable inter-class separation of \(\varDelta = 0.8\). When the classifier is trained, it will be tested on a pseudo-infinite test set, here 1000 cases per class, to obtain a very good approximation to the true AUC for the classifier trained on this very training dataset; this is called a single realization or a Monte-Carlo (MC) trial. Many realizations of the training datasets with same n are generated over MC simulation to study the mean and variance of the AUC for the Bayes classifier under this training set size. The number of MC trials is 1000 and the number of bootstraps is 100. It is apparent from Fig. 2 that the \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}\) is downward biased. This is a natural opposite of the upward bias observed in Efron and Tibshirani [13] when the performance measure was the true error rate as a measure of incorrectness, by contrast with the true AUC as a measure of correctness. The \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( {.632}\right) }}\) is designed as a correction for \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}\); it appears in the figure to correct for that but with an over-shoot. The correct adjustment for the remaining bias is almost achieved by the estimator \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( {.632+}\right) }}\). The \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( {.632}\right) }}\) estimator can be seen as an attempt to balance between the two extreme biased estimators, \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}\) and \(\overline{{\textrm{AUC}}}_{\textbf{t}}\). However, it is expected that the component of \(\overline{{\textrm{AUC}}}_{\textbf{t}}\) that is inherent in both \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( {0.632+}\right) }}\) and \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( {0.632}\right) }}\) increases the variance of these two estimators that my compensate for the decrease in the bias. Therefore, we assess all estimators in terms of the RMS, the root of the MSE defined in (61), and report the results in Table 2. In addition, we average the RMS of these estimators over the 10 experiments of Table 2 and list the average in Table 3. It is evident that the \(0.632+\) is slightly the overall winner with only 9% decrease in RMS if compared to the \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}\) estimator. This almost agrees with the same result obtained for the error rate estimators and reported in Table 1.

Table 3 Average of RMS error of each estimator over the 10 experiments displayed in Table 2. The estimator \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{\left( *\right) }\) is the next to \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{\left( 0.632+\right) }\) with only 9% increase in RMS

In addition to the RMS, Table 2 compares the estimators in terms of the RMS around mean (\({\textrm{RMS}}_{AM}\)): the root of the mean squared difference between an estimate and the mean performance (the mean over all possible training sets), instead of the conditional performance (conditional on a particular training set). The motivation behind that is explained next. The estimators \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( *\right) }}\) and \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( 1,1\right) }}\) seem, at least from their formalization, to estimate the mean AUC of the classifier (this is the analogue of \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( *\right) }}\) and \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( 1\right) }}\)). However, the basic motivation for the \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( .{632}\right) }}\) and \(\widehat{{\textrm{AUC}}}_{\textbf{t}}^{_{\left( .{632+}\right) }}\) is to estimate the AUC conditional on the given dataset \(\textbf{t}\) (this is the analogue of \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( .{632}\right) }}\) and \(\widehat{{\textrm{Err}}}_{\textbf{t}}^{_{\left( .{632+}\right) }}\)). Nevertheless, as mentioned in Efron and Tibshirani [13] and detailed in Zhang [35] the CV, the basic ingredient of the bootstrap based estimators, is weakly correlated with the true performance on a sample by sample basis. This means that no estimator has a preference in estimating the conditional performance. Section 5.3 elaborates more on this phenomenon.

Fig. 3
A line graph plots a weak correlation between true performance and estimators versus shat. The mean is indicated in the graph.

The lack of correlation (or the weak correlation) between the bootstrap-based estimators and the true conditional performance. Every line connects the true performance of the classifier trained on a data set \(\textbf{t}_{i}\) and the estimated value. The figure represents 15 trials of the 1000 MC trials. Two nearby values of true performance may correspond to two widely separated estimates on different sides of the mean

5.3 Components of Variance and Weak Correlation

Many simulation results, e.g., Efron [8], Efron and Tibshirani [13], show that there is only a weak correlation between the CV estimator and the conditional true error rate \({\textrm{Err}}_{\textbf{t}}\). This issue is discussed in mathematical detail in the excellent paper by Zhang [35], which therefore concludes that the CV estimator should not be used to estimate the true error rate of a classification rule conditional on a particular training data set. Other estimators discussed in the present article have this same attribute, since they have the same resampling ingredient of the CV estimator and “we would guess, for any other estimate of conditional prediction error” (Sect. 7.12, [20]). We provide our simple mathematical elaboration as follows. Denote the true performance of the classification rule conditional on the training set \(\textbf{t}\) (whether \({\textrm{Err}}_{\textbf{t}}\), \({\textrm{AUC}}_{\textbf{t}}\), or any other performance measure) by \(S_{\textbf{t}}\), the unconditional performance by \(\text {E}_{\textbf{t}}S_{\textbf{t}}\), and an estimator of either of them by \(\widehat{S}_{\textbf{t}}\). For easier notation we can unambiguously drop the subscript \(\textbf{t}\) and decompose the MSE as

$$\begin{aligned} {\textrm{MSE}}(\widehat{S}, S)&= \text {E}(\widehat{S} - S)^2 \end{aligned}$$
(62a)
$$\begin{aligned}&= \text {E}(\widehat{S} - \text {E}S)^2 + \text {Var}(S) - 2\text {Cov}(\widehat{S},S). \end{aligned}$$
(62b)

Then, by normalizing with the standard deviations we get:

$$\begin{aligned} \frac{{\textrm{MSE}}(\widehat{S}, S)}{\sigma _S \sigma _{\widehat{S}}}&= \frac{{\textrm{MSE}}(\widehat{S}, \text {E}S)}{\sigma _S \sigma _{\widehat{S}}} + \frac{\sigma _S}{\sigma _{\widehat{S}}} - 2\rho _{\widehat{S}S}. \end{aligned}$$
(63)

This equation relates four crucial components to each other:

  • \({\textrm{MSE}}(\widehat{S}, S)\bigl /\sigma _S \sigma _{\widehat{S}}\), the normalized MSE of \(\widehat{S}\), if we see it as an estimator of the conditional performance S.

  • \({\textrm{MSE}}(\widehat{S}, \text {E}S)\bigl /\sigma _S \sigma _{\widehat{S}}\), the normalized MSE of \(\widehat{S}\), if we see it as an estimator of the expected performance \(\text {E}S\) (and therefore called MSE around the mean).

  • \(\sigma _S \bigl / \sigma _{\widehat{S}}\), the standard deviation ratio between S and \(\widehat{S}\).

  • \(\rho _{\widehat{S}S}\), the correlation coefficient between S and \(\widehat{S}\).

From (63), an estimator \(\widehat{S}\) is a good candidate to estimate S than \(\text {E}S\) if its \({\textrm{MSE}}(\widehat{S}, S)\) is less than its \({\textrm{MSE}}(\widehat{S}, \text {E}S)\). Then, it is the responsibility of the correlation coefficient \(\rho _{\widehat{S}S}\) to be high enough to cancel \(\sigma _S \bigl / \sigma _{\widehat{S}}\) and a portion of \({\textrm{MSE}}(\widehat{S}, \text {E}S)\). Unfortunately, this is not the case as we illustrate experimentally in Table 2, which provides all quantities of the decomposition (63). It is obvious from the values that \({\textrm{RMS}}(\widehat{S}, S)\) and \({\textrm{RMS}}(\widehat{S}, \text {E}S)\) are very close to each other because the quantity \(\sigma _S \bigl / \sigma _{\widehat{S}} - 2\rho _{\widehat{S}S} \simeq 0.413 - 2 \times 0.290 = -0.167\) (on average over the 10 experiments shown in the table). Moreover, in some cases, e.g., the first experiment, it goes as low as \(-0.052\). The correlation between \(\widehat{S}\) and S is weak to cast \(\widehat{S}\) as an estimate to S, although it is designed to estimate it! For more illustration, Fig. 3 visualizes the components in Eq. (63) and the numbers in Table 2. This figure shows 15 realizations of the 1000 MC trials of the same experiment above. On the right, are the true values of S when trained on these different 15 training sets. On the left, are the corresponding 15 estimated values of \(\widehat{S}\). The lines provide links between the true values and the corresponding estimates. This figure shows that two nearby true values of S are likely to have two widely separated estimated values \(\widehat{S}\) on different sides of the mean. This visually illustrates the lack of correlation (or the weak correlation) between the estimators and the true conditional performance.

5.4 Two Competing Classifiers

If the assessment problem is how to compare two classifiers, rather than the individual performance, then the measure to be used is either the conditional difference

$$\begin{aligned} \varDelta _{\textbf{t}}={\textrm{AUC}}_{1_{\textbf{t}}}-{\textrm{AUC}}_{2_{\textbf{t}}}, \end{aligned}$$
(64)

or the mean, unconditional, difference

$$\begin{aligned} \varDelta ={\text {*}}{E}\varDelta _{\textbf{t}}={\text {*}}{E}\left[ {\textrm{AUC}}_{1_{\textbf{t}}}-{\textrm{AUC}}_{2_{\textbf{t}}}\right] , \end{aligned}$$
(65)

where, we defined them for the AUC just for illustration with immediate identical treatment for other measures. Then it is obvious that there is nothing new in the estimation task, i.e., it is merely the difference of the performance estimate of each classifier, i.e.,

$$\begin{aligned} \widehat{\varDelta }=\widehat{{\text {*}}{E}{\textrm{AUC}}_{1_{\textbf{t}}}}-\widehat{{\text {*}}{E}{\textrm{AUC}}_{2_{\textbf{t}}}}, \end{aligned}$$
(66)

where each of the two estimators in (66) is obtained by any estimator. A natural candidate, from the point of view of the present chapter is the LPOB estimator \(\widehat{{\textrm{AUC}}}^{_{\left( 1,1\right) }}\)—because of both the smoothness and weak correlation issues discussed so far.

Then, how to estimate the uncertainty (variance) of \(\widehat{\varDelta }\). This is very similar to estimating the variance in \(\widehat{{\text {*}}{E}{\textrm{AUC}}_{\textbf{t}}}\). There is nothing new in estimating \({\text {*}}{Var}\widehat{\varDelta }\). It is obtained by replacing \(\widehat{{\textrm{AUC}}}^{_{\left( 1,1\right) }}\), in Yousef et al. [33], by the statistic \(\widehat{\varDelta }\) in (66). For demonstration, typical values are given in Table  4, for comparing the linear and quadratic discriminants, where the training set size per class is 20 and number of features is 4.

Table 4 Estimating the uncertainty in the estimator that estimates the difference in performance of two competing classifiers, the LDA and the QDA. The quantity M represents \({\textrm{AUC}}_1\) for LDA, \({\textrm{AUC}}_2\) for QDA, and \(\varDelta \) for the difference

6 Discussion and Conclusion

In this chapter, the very important topic of the assessment of ML algorithms is reviewed, with an emphasis on the nonparametric assessment of classification rules. The topic is quite important to many fields and applications, in particular cyberphysical security, where ML algorithms are almost ubiquitous. We started with reviewing the basic nonparametric methods for estimating the bias and variance of a statistic. Then, we reviewed the basic resampling-based methods for estimating the error rate of a classification rule. Departing from that, we extended these estimators from estimating the error rate (a one-sample statistic) to estimating the AUC (a two-sample statistic). This extension is theoretically justified, and not just an ad hoc application. Among these estimators, we identified those that are smooth and eligible for estimating their standard error using the IF method.

It was interesting to see, through the whole chapter, the connection among different resampling-based estimators. It is worth mentioning that, in addition to the conventional K-fold CV, there are other versions and variants, which are usually used in an ad hoc way by many practitioners. The formalization of these versions and variants, and the mathematical connection among them, along with their connection to the bootstrap-based estimators, all can be established in the same spirit and approach followed in the present chapter. However, many of them are unsmooth except possibly the repeated CV, which is partially smooth and suitable for the IF method [30, 31].

With this rich variety of estimators, a practitioner may legitimately wonder about the “optimal” estimator (in terms of any optimality criterion) that should be systematically used. There are three aspects, on which we can base our comparison: accuracy, uncertainty estimation, and computational efficiency.

In terms of accuracy, it is surprising to know that, from the few number of comparative studies available in the literature, there is no overall winner among these estimators. All of them have comparable accuracy, measured in terms of RMS, with a little superiority of the 0.632+ bootstrap estimator. In addition, and most importantly, all estimators have a weak correlation with the true conditional performance (e.g., \({\textrm{Err}}_{\textbf{t}}\), the conditional error rate, or \({\textrm{AUC}}_{\textbf{t}}\), the conditional AUC), a phenomenon that allows them to be eligible only for estimating the mean true performance (e.g., \(\text {E}_{\textbf{t}}{\textrm{Err}}_{\textbf{t}}\) or \(\text {E}_{\textbf{t}}{\textrm{AUC}}_{\textbf{t}}\)), where the mean is taken over the population of training datasets as explained through the chapter. Said differently, the performance estimation that a practitioner obtains using, e.g., the CV, is not an estimation of the performance of this very trained ML algorithm; rather, it is an estimation of the mean performance of this algorithm had we trained it on all possible training datasets of the same size! We quote from [20, Sect. 7.12]:

This phenomenon also occurs for bootstrap estimates of error, and we would guess, for any other estimate of conditional prediction error.

In terms of the variance estimation of these estimators (not the estimation of the variance of the algorithm itself), only a few of them are smooth and candidates for a sophisticated method like the IF. The ordinary K-fold CV is not among those! Rather, only the computationally expensive version of it, the repeated CV, is partially smooth as mentioned above.

It terms of the computational aspects, the bootstrap-based estimators are computationally expensive. If compared to the conventional K-fold CV, which requires only K iterations of both training and testing, the former require hundreds of bootstrap replications. Because the majority of recent ML applications involve both massive datasets and complex algorithms, including DNN that is very computationally expensive, it is obvious that the CV may be more practical than the bootstrap-based estimators. However, for some other fields, e.g., cyberphysical security, many applications produce tabular (structured) data. Tabular data are more suitable for the traditional and less computationally expensive ML algorithms. Therefore, serious practitioners in these fields and applications may need to keep all of these estimators in their toolbox. Moreover, it is quite prudent to see a future benchmark that compiles these estimators, along with different datasets from a wide range of applications, in a single comprehensive comparative study.