Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Although the Vapnik–Chervonenkis (VC) learning theory [15, 16, 1820] has been justly acclaimed as a major conceptual breakthrough, applying its essential theorems to practical problems often yields very loose bounds. In the case of the pattern recognition problem, the theorems provide distribution-independent uniform bounds on the deviation between the expected classification error and the empirical classification error. Their derivation reveals many possible causes for their poor quantitative performance:

(i):

Practical data distributions may lead to smaller deviations than the worst possible data distribution.

(ii):

Uniform bounds hold for all possible classification functions. Better bounds may hold when one restricts the analysis to functions that perform well on plausible training sets.

(iii):

A symmetrization lemma translates the main combinatorial result into a bound on the deviation between expected and empirical errors. This lemma is a conservative inequality.

(iv):

The combinatorial characterization of the Vapnik–Chervonenkis capacity is a conservative upper bound.

(v):

The union bound \(P(\cup _iA_i)\le \sum _iP(A_i)\) constitutes a critical step of the proof. This bound could be reasonably accurate if the events were independent events with low probability. Nothing guarantee that this is the case.

An apparently different class of bounds, sometimes called sample compression bounds, often provides much more realistic estimates of the generalization error. Such bounds predate the VC theory: for instance, it was mentioned in Paphos that Chervonenkis knew that the expected error of the generalized portrait algorithm is roughly bounded by the fraction of support vectors found in the training set [17, 21]. This bound depends on the number of support vectors, an empirical quantity measured a posteriori.

The purpose of this contribution is to explore the gap between these two styles of bounds using only simple mathematics and a simple empirical case study. This simplicity results from an apparently bold step: instead of assuming that the examples are independently drawn from an unknown distribution, we will reason on random partitions of an arbitrary dataset into equally sized training and test sets. Deviation estimates then result from the combinatorial argument that forms the central part of the traditional Vapnik–Chervonenkis proofs. Avoiding the symmetrization lemma (see point (iii) above) also provides a simple way to obtain data- and algorithm-dependent bounds (points (i) and (ii)) and to define empirical data- and algorithm-dependent capacities (point (iv)) [3, 4, 23]. The union bound (point (v) above) then remains the main source of quantitative problems.

Although refined bounding techniques have been proposed to address all these issues [58, 12, 13], their sophistication often obscures their connection to practical reality. We believe that the simple framework described in this contribution provides useful intuitions.

The following discussion is organized as follows. After presenting the random split paradigm, we explain how to easily derive bounds in the style of Vapnik–Chervonenkis and make them take into account the specificities of the data distribution and of the learning algorithm. We then estimate the performance of these bounds on a simple case study and show that more refinements are necessary to obtain a bound with a reasonable amount of computation.

2 Setup

Let Q(zw) be a loss function that measures the correctness on example z of the answer produced by a learning machine parameterized by \(w\in \mathcal{F}\). In this paper we only consider the case of binary loss functions that take the value one if the answer is wrong and zero if it is correct. For instance, in the case of a pattern recognition system, each example z is a pair (xy) composed of a pattern x and a class label y. Given a classifier \(f_w(x)\) parametrized by w, the loss function Q(zw) is zero when \(f_w(x)=y\) and is one otherwise.

Let S be a set of \(2\ell \) labeled examples \(z_1,\cdots ,z_{2\ell }\). There are \(C_{2\ell }^{\ell }\) ways to split this set into equally sized training and test sets, \(S_1\) and \(S_2\), each containing \(\ell \) examples. For each choice of a training set \(S_1\) and a test set \(S_2\), and for each value of the parameter w, we define the training error \(\nu _1\), the test error \(\nu _2\), and the total error \(\nu \) as:

$$ \nu _1(w) = \frac{1}{\ell } \sum _{z \in S_1} Q(z,w),~~~~~~ \nu _2(w) = \frac{1}{\ell } \sum _{z \in S_2} Q(z,w), $$
$$ \nu (w) = \frac{1}{2\ell } \sum _{z \in S} Q(z,w). $$

Consider a deterministic learning algorithm \(\mathcal{A}\) that processes the training set \(S_1\) and produces a parameter \({w^{S_1}}\). This parameter value usually performs well on the training set \(S_1\) and we hope that it will also perform well on the test set \(S_2\). For instance, the empirical risk minimization principle suggests designing an algorithm that minimizes \(\nu _1(w)\) in the hope of ensuring that \(\nu _2({w^{S_1}})\) is small.

All results presented here concern the distribution of the deviation between the training error \(\nu _1({w^{S_1}})\) and the test error \(\nu _2({w^{S_1}})\) when one considers all possible splits \(S_1\cup S_2\) of the dataset S and obtains \({w^{S_1}}\) by running the learning algorithm \(\mathcal{A}\),

$$\begin{aligned} \Pr \left\{ ~ \left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| > \epsilon ~ \right\} . \end{aligned}$$
(9.1)

The notation \(\Pr (\mathcal{H})\) denotes the ratio of the number of splits for which condition \(\mathcal{H}\) is true over the total number \(C_{2\ell }^{ \ell }\) of possible splits \(S_1\cup S_2\) of the dataset S. We use this notation instead of the traditional probability notation to emphazise the purely combinatorial nature of this problem.

We argue that the real-life behavior of learning algorithms is well characterized by the tail of this distribution. Thousands of machine learning papers are in fact supported by experimental studies that follow the same protocol: randomly separating out test data, applying the learning algorithm to the remaining data, and assessing its performance on the test data. A good test set performance is widely accepted as convincing evidence supporting the use of a specific learning algorithm for a specific learning problem. Bounding the tail of the distribution (9.1) provides equally strong evidence.

In contrast, traditional statistical approaches to the learning problem assume that the training examples are drawn independently from an unknown distribution. The expected error \(\mathbb {E}(Q(z,{w^{S_1}}))\) then represents the future performance of the system on new examples drawn from this same distribution. Bounding the difference between the training error and the expected error provides a stronger guarantee because the assumed existence of the ground truth distribution provides a means to reason about the algorithm’s performance on arbitrarily large training sets. Consider for instance a binary classification algorithm that relies on a polynomial discriminant function whose degree grows linearly with the number of training examples. Running such an algorithm on a training set \(S_1\) of sufficiently small size \(\ell \) could conceivably give acceptable performance on the test set \(S_2\) of the same size. However this acceptable performance does not guarantee that running the algorithm on all \(2\ell \) available examples would not overfit.

Avoiding the ground truth assumption is attractive for philosophical reasons. Although epistemology frequently relies on the idea that the world is ruled by simple universal truths waiting to be uncovered, it can be argued that the only thing that is available to us for sure is a finite set of examples. From this point of view, the ground truth distribution is a metaphysical concept because there is no statistical test to determine whether or not our dataset is composed of independent and identically distributed examples and no hope to identify their distribution.

Avoiding the ground truth assumption is also attractive for technical reasons. Working with the combinatorial distribution (9.1) affords simple ways to derive tail bounds that leverage specific properties of the data or of the learning algorithm.

3 Misclassification Vectors

For each value of the parameter w, the loss function Q(zw) maps the full set of examples S onto a binary vector \(q(w)=(Q(z_1,w),\dots ,Q(z_n,w))\) of length \(2\ell \) that we shall call the misclassification vector. The ordering of its coefficients does not depend on the dataset split: the ith component of q(w) indicates whether the learning system parametrized by w processes the example \(z_i\) incorrectly, regardless of whether \(z_i\) belongs to the training set or the test set.

The misclassification vector q(w) encapsulates all that we need to know about the performance of the system parametrized by vector w. Parameter values that lead to the same misclassification vector will also lead to the same total error, training error, and test error. Therefore we often write them as \(\nu (q)\), \(\nu _1(q)\), and \(\nu _2(q)\) instead of \(\nu (w)\), \(\nu _1(w)\), and \(\nu _2(w)\).

The function \(\eta (q,\epsilon )=\Pr \left\{ \left| \nu _2(q)-\nu _1(q)\right| >\epsilon \right\} \) does not depend on the ordering of the coefficients in the misclassification vector q because all possible splits are considered and because the quantities \(\nu _1(q)\) and \(\nu _2(q)\) do not depend on the ordering of the coefficients within each subset. We therefore write \(\eta (q,\epsilon )=\eta (\ell ,\nu (q),\epsilon )\).

Consider an urn containing \(2\nu \ell \) red marbles and \(2(1-\nu )\ell \) white marbles. Out of the \(C_{2\ell }^{\ell }\) possible ways to draw \(\ell \) marbles without replacement, there are exactly \(C_{2\nu \ell }^{k} C_{2(1-\nu )\ell }^{\ell -k}\) ways to draw exactly k red marbles. The analytic expression of \(\eta (\ell ,\nu ,\epsilon )\) is obtained by summing this quantity for all values of k that ensure that the difference between the number k of red marbles drawn from the urn and the number \(2\nu \ell -k\) of red marbles left in the urn exceeds \(\ell \epsilon \):

$$\begin{aligned} \eta (\ell ,\nu ,\epsilon ) = \frac{1}{C_{2\ell }^{\ell }} ~~ \sum _{2|\nu \ell -k|>\ell \epsilon } ~ C_{2\nu \ell }^{k} \, C_{2(1-\nu )\ell }^{\ell -k}. \end{aligned}$$

There are efficient numerical methods for computing this hypergeometric tail  [14].

Since the function \(\eta (\ell ,\nu ,\epsilon )\) is monotonically decreasing with \(\epsilon \), we define the inverse function \(\epsilon (\ell ,\nu ,\eta )\) and write

$$\begin{aligned} \forall q \quad \Pr \left\{ ~ \left| \nu _2(q)-\nu _1(q)\right| > \epsilon (\ell ,\nu (q),\eta ) ~ \right\} ~ = ~ s\eta . \end{aligned}$$
(9.2)

Although there is no known analytic form for the inverse function \(\epsilon (\ell ,\nu ,\eta )\), its exact values can be directly read from a table of its inverse \(\eta (\ell ,\nu ,\epsilon )\). This function is also well described by relatively accurate bounds and approximations such as those derived by Vapnik and Chervonenkis [15, inequality A5, p. 176]:

$$\begin{aligned} \epsilon (\ell ,\nu ,\eta )&\le \sqrt{4 \left( \nu +\frac{1}{2\ell }\right) \left( 1-\nu +\frac{1}{2\ell }\right) \,\frac{\log (2/\eta )}{\ell +1} } \end{aligned}$$
(9.3)
$$\begin{aligned}&\approx \sqrt{\frac{4\,\nu (1-\nu )\,\log (2/\eta )}{\ell } }. \end{aligned}$$
(9.4)

4 Data- and Algorithm-Independent Bounds

Let \({\varDelta }_\mathcal{F}(S)= \{ q(w) \,:\, w\in \mathcal{F}\}\) be the set of misclassification vectors associated with all potential values of the parameter w. Bounds on the deviation (9.1) are then derived from the following chain of inequalities:

$$\begin{aligned} \Pr&\left\{ ~ \left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| > \epsilon (\ell ,\nu ({w^{S_1}}),\eta ) ~ \right\} \nonumber \\&= ~ \Pr \left\{ ~ \left| \nu _2({q^{S_1}})-\nu _1({q^{S_1}})\right| > \epsilon (\ell ,\nu ({q^{S_1}}),\eta ) ~ \right\} \nonumber \\&\le ~ \Pr \big \{~ \exists q\in {\varDelta }_\mathcal{F}(S) ~:~ |\nu _2(q)-\nu _1(q)|>\epsilon (\ell ,\nu (q),\eta )| ~\big \}\nonumber \\&\le \sum _{q\in {\varDelta }_\mathcal{F}(S)} \Pr \left\{ ~ \left| \nu _2(q)-\nu _1(q)\right| > \epsilon (\ell ,\nu (q),\eta ) ~ \right\} \nonumber \\&= ~ \eta \, {{\mathrm{Card}}}\,{\varDelta }_\mathcal{F}(S). \end{aligned}$$
(9.5)

The first inequality above majorizes (9.1) by a uniform bound. The second inequality is an application of the union bound \(\Pr (A\cup B)\le \Pr (A)+\Pr (B)\), and the final result is obtained by applying Eq. (9.2).

Traditional data- and algorithm-independent deviation bounds control \(\epsilon (\ell ,\nu ,\eta )\) using the more convenient expression (9.3) and then invoke the landmark combinatorial lemma of Vapnik and Chervonenkis [18, theorem 1], which states that \({{\mathrm{Card}}}\,{\varDelta }_\mathcal{F}(S)\) is either equal to \(2^{2\ell }\) or smaller than \((2\ell e/h)^h\) for some positive integer h that does not depend on the data S and is now called the VC dimension of the family of indicator functions \(\{\,z\,\mapsto \,Q(w,z)~:~w\in \mathcal{F}\,\}\). Simple algebraic manipulations then yield data- and algorithm-independent bounds for both the absolute and the relative deviation:

$$\begin{aligned} \Pr \left\{ ~ \left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| > \sqrt{\frac{h(1+\log \frac{\ell }{h}) -\log \frac{\eta }{2}}{\ell -1}} ~\right\}&\le \eta ,\\ \Pr \left\{ ~ \frac{\left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| }{\sqrt{\nu ({w^{S_1}})+\frac{1}{2\ell }}} > 2\sqrt{\frac{h(1+\log \frac{\ell }{h})-\log \frac{\eta }{2}}{\ell }} ~\right\}&\le \eta . \end{aligned}$$

5 Data- and Algorithm-Dependent Bounds

There are several obvious ways to make these bounds tighter. Instead of using the bound (9.3), we can tabulate the exact values of \(\epsilon (\ell ,\nu ,\eta )\) as suggested in Sect. 9.3. Instead of bounding \({{\mathrm{Card}}}\,{\varDelta }_\mathcal{F}(S)\), we can design empirical procedures to measure its value [3, 22]. The only remaining causes of inaccuracy are then the two inequalities appearing in the derivation (9.5), namely the uniform bound and the union bound.

The first source of concern is the majorization of the error deviation by a uniform bound. Many elements of \({\varDelta }_\mathcal{F}(S)\) are misclassification vectors that no reasonable learning algorithm would produce. Realistic learning algorithms tend to produce solutions that perform well on the training examples and also contain critical capacity control aspects. For instance one can argue that multilayer network training often achieves good performance because its poor optimization algorithm is unable to find solutions far away from the initial point. All these aspects severely restrict the set of misclassification vectors.

Therefore, instead of considering the set \({\varDelta }_\mathcal{F}(S)\) of the misclassification vectors associated with all potential parameters \(w\in \mathcal{F}\), we can consider the set \({\varDelta }_\mathcal{A}(S)\) of the misclassification vectors associated with the parameters produced by applying algorithm \(\mathcal{A}\) to all training sets \(S_1\) extracted from dataset S:

$$ {\varDelta }_\mathcal{A}(S) = \big \{ ~ q(\mathcal{A}(S_1)) ~|~ S_1\subset S ~\wedge ~{{\mathrm{Card}}}(S_1)=\ell ~ \big \}. $$

Replicating the derivation (9.5) leads to a data- and algorithm-dependent deviation bound,

$$\begin{aligned} \Pr \left\{ ~ \left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| > \epsilon (\ell ,\nu ({w^{S_1}}),\eta ) ~ \right\} \le \eta \, {{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S). \end{aligned}$$
(9.6)

The second source of concern is the union bound which, in (9.5), majorizes the probability of the union of K events \(A_1, \dots , A_K\) of probability \(\eta \) by the sum \(K\eta \) of their probabilities. Let us tentatively assume that the events \(A_i\) can be considered pairwise independent. We can then write

$$\begin{aligned} K\eta - \Pr \left( \cup _k A_k\right) ~\le ~ \sum _{i<j}\Pr (A_i\cap A_j) ~\approx ~ \frac{K^2}{2} \eta ^2 \end{aligned}$$
(9.7)

and show that the majorization error is much smaller than \(K\eta \). The deviation bound (9.6) is likely to be quite accurate if this assumption holds. Whether this is true will be clarified in Sect. 9.7.

6 Empirical Study

In order to illustrate the performance of bound (9.6), we report on a simple experimental study using 1,000 examples from the MNIST handwritten digit recognition dataset [2]. The classifier is the convolutional network Lenet5 described in [10] and containing 60,000 adjustable parameters. Training is performed using mean square error back-propagation with learning rates periodically adjusted by estimating the diagonal of the Hessian matrix [11]. This case study should be viewed as a typical example of multilayer neural network training technology using a proven implementation. In particular, this learning algorithm should not be seen as a simple empirical risk minimization algorithm because the cost function is nonconvex and because the first-order nature of the algorithm favors solutions that are relatively close to the initial conditions.

We train this classifier on 1,000 random splits of the examples into equally sized training and test sets containing \(\ell =500\) examples each. We always use the same weight initialization. The observed median training error, median test error, and median relative deviation are, respectively,

$$ {{\mathrm{Median}}}\big [\nu _1({w^{S_1}})\big ]~\approx ~0.075,\qquad {{\mathrm{Median}}}\big [\nu _2({w^{S_1}})\big ]~\approx ~0.14, $$
$$ {{\mathrm{Median}}}\left[ \frac{|\nu _2({w^{S_1}})-\nu _1({w^{S_1}})|}{\sqrt{\nu ({w^{S_1}})(1-\nu ({w^{S_1}})}}\right] ~\approx ~0.21. $$

The median deviation can also be estimated by setting the right-hand side of (9.6) to 0.5 and using the approximation (9.4),

$$\begin{aligned} {{\mathrm{Median}}}\left[ ~ \frac{|\nu _2({w^{S_1}})-\nu _1({w^{S_1}})|}{\sqrt{\nu ({w^{S_1}})(1-\nu ({w^{S_1}}))}} ~-~ 2\,\sqrt{\frac{\log (4{{\mathrm{Card}}}{\varDelta }_\mathcal{A}(S))}{l}} ~\right] ~ \mathop {\approx }\limits ^{?}~0. \end{aligned}$$
(9.8)
Fig. 9.1
figure 1

Bounds on the median relative deviation (top) and median test error \(\nu _2\) (bottom) as a function of \({{\mathrm{Card}}}{\varDelta }_\mathcal{A}(S)\). The dotted line indicates the observed values

Figure 9.1 (top plot) shows how the bound on the relative deviation (9.8) depends on the value \({{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S)\). Figure 9.1 (bottom) plots a corresponding bound on the median test error \(\nu _2\), obtained by setting the training error \(\nu _1=0.075\) and numerically solving (9.8) for \(\nu _2\) with \(\nu =(\nu _1+\nu _2)/2\). Both plots show that \({{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S)\) must be as low as 62 for the bounds to match empirical observations. However these plots also show that values as large as \(10^8\) still provide reasonable estimates.

In contrast, since it is clear that the VC dimension of such a large multilayer neural network exceeds the total number of examples, \({{\mathrm{Card}}}\,{\varDelta }_\mathcal{F}(S)=2^{2\ell }\approx 10^{301}\), leading to a vacuous bound on the median test error, \(\nu _2\le 1.25\).

We can attempt to directly measure \({{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S)\) by counting the number \(\mathcal {N}_0(t)\) of distinct misclassification vectors seen after training the classifier on t random splits. Such an attempt was unsuccessful because we lacked the computing resources to process a large enough number of splits. We stopped after processing 18,000 random splits and producing 18,000 distinct misclassification vectors. Birthday problem considerations [1] show that \({{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S)>10^8\) with confidence greater than \(80\,\%\). As illustrated in Fig. 9.1, even such large values of \({{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S)\) can still lead to reasonable estimates, within a factor of two of the observed deviations.

Since directly counting \({{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S)\) is computationally too expensive, we must design simpler empirical procedures to characterize the properties of the set \({\varDelta }_\mathcal{A}(S)\) of misclassification vectors.

7 Coverings

The solid curve in Fig. 9.2 shows the histogram of the Hamming distances measured between the misclassification vectors associated with pairs of random splits. This histogram shows a very peaky distribution. We can accurately determine the location of this peak by processing a moderate number of pairs. All our misclassification vectors appear to be located at or around Hamming distance 75 from each other.

Fig. 9.2
figure 2

Histogram of Hamming distances between misclassification vectors. The solid curve shows the histogram of the Hamming distances separating random pairs of misclassification vectors. The dashed curve shows what this histogram would have been if the coefficient of the misclassification vectors were independently sampled from a Bernoulli distribution

It is well known that the distribution of the Hamming distance separating two d-dimensional binary vectors follows a very peaky distribution centered on \(2dp(1-p)\) where p is the probability of nonzero coefficients [9]. The dotted curve in Fig. 9.2 represents the histogram obtained by randomly shuffling the coefficient of each misclassification vector before computing the Hamming distances. This curve verifies the theoretical prediction with a peak centered at \(4\,\ell \,\nu (1-\nu )\approx 180\). The actual misclassification vectors \(q({w^{S_1}})\) therefore appear considerably less dispersed than random binary vectors. This observation invalidates the independence assumption that could have given us confidence in the accuracy of the union bound (9.7).

This peaky Hamming distance distribution suggests we should characterize the set \({\varDelta }_\mathcal{A}(S)\) of misclassification vectors using covering numbers. Let \(B_c(q)\) represent a Hamming ball of radius c centered on q. The covering number \(\mathcal {N}_c({\varDelta })\) is the smallest number of Hamming balls of radius c necessary to cover the set \({\varDelta }\):

$$ \mathcal {N}_c({\varDelta }) ~=~ \min _{C\subseteq {\varDelta }} ~ {{\mathrm{Card}}}(C) \quad {\text { such that}}\quad {\varDelta }\subseteq \mathop {\cup }_{q\in C} B_c(q). $$

Let us consider an arbitrary split of the dataset into training and test sets and assume that there exists \(q^\prime \in B_c(q)\) such that \(|\nu _2(q^\prime )-\nu _1(q^\prime )|>\varepsilon \). A simple derivation then establishes that \(|\nu _2(q)-\nu _1(q)|>\varepsilon -c/\ell \).

Combining this observation with (9.2) gives

$$ \forall q \quad \Pr \left\{ ~ \exists q^\prime \in B_c(q) ~ : ~ \left| \nu _2(q^\prime )-\nu _1(q^\prime )\right| > \frac{c}{\ell } + \epsilon (\ell ,\nu (q),\eta ) ~ \right\} ~ = ~ \eta , $$

and a chain of inequalities similar to (9.5) gives

$$\begin{aligned} \Pr \left\{ ~ \left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| > \frac{c}{\ell } + \epsilon (\ell ,\nu ({w^{S_1}}),\eta ) ~ \right\} ~ \le ~ \eta \, \mathcal {N}_c({\varDelta }_\mathcal{A}(S)). \end{aligned}$$
Fig. 9.3
figure 3

Empirical covering sizes. Each curve plots how many Hamming balls (of radii ranging from 40 to 100) are needed to cover the misclassification vectors obtained using the number of splits specified on the X axis. These curves should reach the corresponding covering number when the number of splits increases to infinity

Fig. 9.4
figure 4

Covering-based bounds on the median test error \(\nu _2({q^{S_1}})\) as a function of the Hamming ball radius c. The dotted line indicates the observed median test error

We construct coverings with the following greedy algorithm. Let \(q_1,q_2,\dots \) be the misclassification vectors associated with successive random splits of our dataset. We construct a covering \(C_t\) of the first t vectors using the following recursive procedure: if \(q_t\) belongs to one of the Hamming balls centered on an element of \(C_{t-1}\), we set \(C_t=C_{t-1}\), otherwise we set \(C_{t}=C_{t-1}\cup \{q_t\}\).

This empirical covering size \(N_c(t)={{\mathrm{Card}}}(C_t)\) should converge to an upper bound on \(\mathcal {N}_c(\varDelta _\mathcal{A}(S))\) when t increases. Figure 9.3 plots the empirical covering sizes for several values of the Hamming ball radius c. When the radius is smaller than the peak of the Hamming distance histogram, this convergence cannot be observed in practice. When the radius is larger than the peak, \(\mathcal {N}_c(t)\) converges to a small value.

In the intermediate regime, the empirical covering size appears to converge but its limit is hard to determine. We can work around this difficulty by writing

$$\begin{aligned} \Pr \left\{ \, \left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| > \frac{c}{\ell } + \epsilon (\ell ,\nu ({w^{S_1}}),\eta ) \, \right\} \, \le \, \eta \, \mathcal {N}_c(T) + \Pr (R_T), \end{aligned}$$
(9.9)

where \(R_t\subseteq {\varDelta }_\mathcal{A}(S)\) denotes the set of misclassification vectors that are not covered by any of the Hamming balls centered on the elements of \(C_T\). Let \(q_{t+1},\dots ,q_{t+s}\) denote the longest sequence of misclassification vectors such that \(C_{t+s}=C_{t}\). None of these vectors belongs to \(R_t\). Since they are obtained by considering random splits independent of the t previous random splits, the probability that none of these vectors belongs to \(R_t\) is \((1-\Pr (R_t))^s\). We can therefore write with confidence \(1-\varepsilon \) that

$$ \Pr (R_T) \le \Pr (R_t) \le 1-\root s \of {\varepsilon } \le -\log (\varepsilon )/s. $$

Empirical covering sizes \(\mathcal {N}_c(T)\) were collected for \(T=\mathrm{10{,}000}\). They range from \(N_{120}(\mathrm{10{,}000})=1\) to \(N_{50}(\mathrm{10{,}000})=\mathrm{3{,}317}\). We cannot ensure that \(\Pr (R_T)\) is small enough when \(c<50\).

Setting the right-hand side of (9.9) to 0.5, using approximation (9.4), and solving for \(\nu _2({w^{S_1}})\) yields a bound on the median test error. Figure 9.4 plots this bound as a function of the Hamming ball radius c. Although their empirical accuracy is far from ideal, these covering-based bounds are within a factor of two of the observed test error. This is clearly better than the vacuous bounds usually afforded by the data- and algorithm-independent bounding technique.

Fig. 9.5
figure 5

Empirical variance of the loss function. Only a fraction of the examples \(z_i\) have losses \(Q(z_i,{w^{S_1}})\) that vary from one split to the next. The other examples are either always correctly classified or always misclassified

8 Discussion

There is still a significant gap separating these empirical bounds from the observed values. Certainly the most interesting effect revealed by our empirical study is the low dispersion of the misclassification vectors (Fig. 9.2) because it implies that the union bound is very inaccurate. Although relying on empirical covering numbers should in principle reduce the negative impact of this low dispersion, Dudley’s chaining technique [6, 13] is a much more refined way to improve on the union bound. Vorontsov’s recent work [24] is therefore very interesting because it leverages a more refined characterization of the distribution of misclassification vectors in a manner related to Dudley’s chaining.

It is also interesting to investigate the cause of the low dispersion of the misclassification vectors. The observed Hamming distance histogram (Fig. 9.2) looks strikingly like the Hamming distance histogram separating random binary vectors of lower dimensionality. Could it be that only a subset of the examples are responsible for the misclassification vector variations? This would mean that most of the examples are always correctly recognized (or misrecognized when their label is incorrect) regardless of the dataset split. This hypothesis is confirmed by Fig. 9.5 which plots the observed variance of the loss \(Q(z_i,{w^{S_1}})\) for all examples \(z_i\) ordered by decreasing variance. This observation is interesting because it establishes a connection with sample compression bounds: the only examples that matter are those that switch from being correctly classified to being misclassified when one changes how the data is split into training and test sets. The connection between capacity and compression therefore appears to be a manifestation of the subtleties of the union bound.

Finally, one of the main criticisms of the approach outlined in this paper is its computational requirement. Why spend time characterizing the set of misclassification vectors to produce a mediocre bound on the test error while a fraction of this time is sufficient to compute the test error itself? This is a valid criticism of this work as an empirical measuring technique. However this work also has value because it helps us understand the mathematical subtleties of learning. Measuring and understanding are two equally important aspects of the scientific approach.