Making Vapnik–Chervonenkis Bounds Accurate

Bottou, Léon

doi:10.1007/978-3-319-21852-6_9

Léon Bottou⁴

2945 Accesses
2 Citations

Abstract

This chapter shows how returning to the combinatorial nature of the Vapnik–Chervonenkis bounds provides simple ways to increase their accuracy, take into account properties of the data and of the learning algorithm, and provide empirically accurate estimates of the deviation between training error and test error.

Access provided by Autonomous University of Puebla. Download chapter PDF

Learning Theory: the Probably Approximately Correct Framework

Editors’ Introduction

Computational Learning Theory

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Although the Vapnik–Chervonenkis (VC) learning theory [15, 16, 18–20] has been justly acclaimed as a major conceptual breakthrough, applying its essential theorems to practical problems often yields very loose bounds. In the case of the pattern recognition problem, the theorems provide distribution-independent uniform bounds on the deviation between the expected classification error and the empirical classification error. Their derivation reveals many possible causes for their poor quantitative performance:

(i):: Practical data distributions may lead to smaller deviations than the worst possible data distribution.
(ii):: Uniform bounds hold for all possible classification functions. Better bounds may hold when one restricts the analysis to functions that perform well on plausible training sets.
(iii):: A symmetrization lemma translates the main combinatorial result into a bound on the deviation between expected and empirical errors. This lemma is a conservative inequality.
(iv):: The combinatorial characterization of the Vapnik–Chervonenkis capacity is a conservative upper bound.
(v):: The union bound $P(\cup _iA_i)\le \sum _iP(A_i)$ constitutes a critical step of the proof. This bound could be reasonably accurate if the events were independent events with low probability. Nothing guarantee that this is the case.

An apparently different class of bounds, sometimes called sample compression bounds, often provides much more realistic estimates of the generalization error. Such bounds predate the VC theory: for instance, it was mentioned in Paphos that Chervonenkis knew that the expected error of the generalized portrait algorithm is roughly bounded by the fraction of support vectors found in the training set [17, 21]. This bound depends on the number of support vectors, an empirical quantity measured a posteriori.

The purpose of this contribution is to explore the gap between these two styles of bounds using only simple mathematics and a simple empirical case study. This simplicity results from an apparently bold step: instead of assuming that the examples are independently drawn from an unknown distribution, we will reason on random partitions of an arbitrary dataset into equally sized training and test sets. Deviation estimates then result from the combinatorial argument that forms the central part of the traditional Vapnik–Chervonenkis proofs. Avoiding the symmetrization lemma (see point (iii) above) also provides a simple way to obtain data- and algorithm-dependent bounds (points (i) and (ii)) and to define empirical data- and algorithm-dependent capacities (point (iv)) [3, 4, 23]. The union bound (point (v) above) then remains the main source of quantitative problems.

Although refined bounding techniques have been proposed to address all these issues [5–8, 12, 13], their sophistication often obscures their connection to practical reality. We believe that the simple framework described in this contribution provides useful intuitions.

The following discussion is organized as follows. After presenting the random split paradigm, we explain how to easily derive bounds in the style of Vapnik–Chervonenkis and make them take into account the specificities of the data distribution and of the learning algorithm. We then estimate the performance of these bounds on a simple case study and show that more refinements are necessary to obtain a bound with a reasonable amount of computation.

2 Setup

Let Q(z, w) be a loss function that measures the correctness on example z of the answer produced by a learning machine parameterized by $w\in \mathcal{F}$. In this paper we only consider the case of binary loss functions that take the value one if the answer is wrong and zero if it is correct. For instance, in the case of a pattern recognition system, each example z is a pair (x, y) composed of a pattern x and a class label y. Given a classifier $f_w(x)$ parametrized by w, the loss function Q(z, w) is zero when $f_w(x)=y$ and is one otherwise.

Let S be a set of $2\ell $ labeled examples $z_1,\cdots ,z_{2\ell }$. There are $C_{2\ell }^{\ell }$ ways to split this set into equally sized training and test sets, $S_1$ and $S_2$, each containing $\ell $ examples. For each choice of a training set $S_1$ and a test set $S_2$, and for each value of the parameter w, we define the training error $\nu _1$, the test error $\nu _2$, and the total error $\nu $ as:

$$ \nu _1(w) = \frac{1}{\ell } \sum _{z \in S_1} Q(z,w),~~~~~~ \nu _2(w) = \frac{1}{\ell } \sum _{z \in S_2} Q(z,w), $$

$$ \nu (w) = \frac{1}{2\ell } \sum _{z \in S} Q(z,w). $$

Consider a deterministic learning algorithm $\mathcal{A}$ that processes the training set $S_1$ and produces a parameter ${w^{S_1}}$. This parameter value usually performs well on the training set $S_1$ and we hope that it will also perform well on the test set $S_2$. For instance, the empirical risk minimization principle suggests designing an algorithm that minimizes $\nu _1(w)$ in the hope of ensuring that $\nu _2({w^{S_1}})$ is small.

All results presented here concern the distribution of the deviation between the training error $\nu _1({w^{S_1}})$ and the test error $\nu _2({w^{S_1}})$ when one considers all possible splits $S_1\cup S_2$ of the dataset S and obtains ${w^{S_1}}$ by running the learning algorithm $\mathcal{A}$,

$$\begin{aligned} \Pr \left\{ ~ \left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| > \epsilon ~ \right\} . \end{aligned}$$

(9.1)

The notation $\Pr (\mathcal{H})$ denotes the ratio of the number of splits for which condition $\mathcal{H}$ is true over the total number $C_{2\ell }^{ \ell }$ of possible splits $S_1\cup S_2$ of the dataset S. We use this notation instead of the traditional probability notation to emphazise the purely combinatorial nature of this problem.

We argue that the real-life behavior of learning algorithms is well characterized by the tail of this distribution. Thousands of machine learning papers are in fact supported by experimental studies that follow the same protocol: randomly separating out test data, applying the learning algorithm to the remaining data, and assessing its performance on the test data. A good test set performance is widely accepted as convincing evidence supporting the use of a specific learning algorithm for a specific learning problem. Bounding the tail of the distribution (9.1) provides equally strong evidence.

In contrast, traditional statistical approaches to the learning problem assume that the training examples are drawn independently from an unknown distribution. The expected error $\mathbb {E}(Q(z,{w^{S_1}}))$ then represents the future performance of the system on new examples drawn from this same distribution. Bounding the difference between the training error and the expected error provides a stronger guarantee because the assumed existence of the ground truth distribution provides a means to reason about the algorithm’s performance on arbitrarily large training sets. Consider for instance a binary classification algorithm that relies on a polynomial discriminant function whose degree grows linearly with the number of training examples. Running such an algorithm on a training set $S_1$ of sufficiently small size $\ell $ could conceivably give acceptable performance on the test set $S_2$ of the same size. However this acceptable performance does not guarantee that running the algorithm on all $2\ell $ available examples would not overfit.

Avoiding the ground truth assumption is attractive for philosophical reasons. Although epistemology frequently relies on the idea that the world is ruled by simple universal truths waiting to be uncovered, it can be argued that the only thing that is available to us for sure is a finite set of examples. From this point of view, the ground truth distribution is a metaphysical concept because there is no statistical test to determine whether or not our dataset is composed of independent and identically distributed examples and no hope to identify their distribution.

Avoiding the ground truth assumption is also attractive for technical reasons. Working with the combinatorial distribution (9.1) affords simple ways to derive tail bounds that leverage specific properties of the data or of the learning algorithm.

3 Misclassification Vectors

For each value of the parameter w, the loss function Q(z, w) maps the full set of examples S onto a binary vector $q(w)=(Q(z_1,w),\dots ,Q(z_n,w))$ of length $2\ell $ that we shall call the misclassification vector. The ordering of its coefficients does not depend on the dataset split: the ith component of q(w) indicates whether the learning system parametrized by w processes the example $z_i$ incorrectly, regardless of whether $z_i$ belongs to the training set or the test set.

The misclassification vector q(w) encapsulates all that we need to know about the performance of the system parametrized by vector w. Parameter values that lead to the same misclassification vector will also lead to the same total error, training error, and test error. Therefore we often write them as $\nu (q)$, $\nu _1(q)$, and $\nu _2(q)$ instead of $\nu (w)$, $\nu _1(w)$, and $\nu _2(w)$.

The function $\eta (q,\epsilon )=\Pr \left\{ \left| \nu _2(q)-\nu _1(q)\right| >\epsilon \right\} $ does not depend on the ordering of the coefficients in the misclassification vector q because all possible splits are considered and because the quantities $\nu _1(q)$ and $\nu _2(q)$ do not depend on the ordering of the coefficients within each subset. We therefore write $\eta (q,\epsilon )=\eta (\ell ,\nu (q),\epsilon )$.

Consider an urn containing $2\nu \ell $ red marbles and $2(1-\nu )\ell $ white marbles. Out of the $C_{2\ell }^{\ell }$ possible ways to draw $\ell $ marbles without replacement, there are exactly $C_{2\nu \ell }^{k} C_{2(1-\nu )\ell }^{\ell -k}$ ways to draw exactly k red marbles. The analytic expression of $\eta (\ell ,\nu ,\epsilon )$ is obtained by summing this quantity for all values of k that ensure that the difference between the number k of red marbles drawn from the urn and the number $2\nu \ell -k$ of red marbles left in the urn exceeds $\ell \epsilon $:

$$\begin{aligned} \eta (\ell ,\nu ,\epsilon ) = \frac{1}{C_{2\ell }^{\ell }} ~~ \sum _{2|\nu \ell -k|>\ell \epsilon } ~ C_{2\nu \ell }^{k} \, C_{2(1-\nu )\ell }^{\ell -k}. \end{aligned}$$

There are efficient numerical methods for computing this hypergeometric tail [14].

Since the function $\eta (\ell ,\nu ,\epsilon )$ is monotonically decreasing with $\epsilon $, we define the inverse function $\epsilon (\ell ,\nu ,\eta )$ and write

$$\begin{aligned} \forall q \quad \Pr \left\{ ~ \left| \nu _2(q)-\nu _1(q)\right| > \epsilon (\ell ,\nu (q),\eta ) ~ \right\} ~ = ~ s\eta . \end{aligned}$$

(9.2)

Although there is no known analytic form for the inverse function $\epsilon (\ell ,\nu ,\eta )$, its exact values can be directly read from a table of its inverse $\eta (\ell ,\nu ,\epsilon )$. This function is also well described by relatively accurate bounds and approximations such as those derived by Vapnik and Chervonenkis [15, inequality A5, p. 176]:

$$\begin{aligned} \epsilon (\ell ,\nu ,\eta )&\le \sqrt{4 \left( \nu +\frac{1}{2\ell }\right) \left( 1-\nu +\frac{1}{2\ell }\right) \,\frac{\log (2/\eta )}{\ell +1} } \end{aligned}$$

(9.3)

$$\begin{aligned}&\approx \sqrt{\frac{4\,\nu (1-\nu )\,\log (2/\eta )}{\ell } }. \end{aligned}$$

(9.4)

4 Data- and Algorithm-Independent Bounds

Let ${\varDelta }_\mathcal{F}(S)= \{ q(w) \,:\, w\in \mathcal{F}\}$ be the set of misclassification vectors associated with all potential values of the parameter w. Bounds on the deviation (9.1) are then derived from the following chain of inequalities:

$$\begin{aligned} \Pr&\left\{ ~ \left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| > \epsilon (\ell ,\nu ({w^{S_1}}),\eta ) ~ \right\} \nonumber \\&= ~ \Pr \left\{ ~ \left| \nu _2({q^{S_1}})-\nu _1({q^{S_1}})\right| > \epsilon (\ell ,\nu ({q^{S_1}}),\eta ) ~ \right\} \nonumber \\&\le ~ \Pr \big \{~ \exists q\in {\varDelta }_\mathcal{F}(S) ~:~ |\nu _2(q)-\nu _1(q)|>\epsilon (\ell ,\nu (q),\eta )| ~\big \}\nonumber \\&\le \sum _{q\in {\varDelta }_\mathcal{F}(S)} \Pr \left\{ ~ \left| \nu _2(q)-\nu _1(q)\right| > \epsilon (\ell ,\nu (q),\eta ) ~ \right\} \nonumber \\&= ~ \eta \, {{\mathrm{Card}}}\,{\varDelta }_\mathcal{F}(S). \end{aligned}$$

(9.5)

The first inequality above majorizes (9.1) by a uniform bound. The second inequality is an application of the union bound $\Pr (A\cup B)\le \Pr (A)+\Pr (B)$, and the final result is obtained by applying Eq. (9.2).

Traditional data- and algorithm-independent deviation bounds control $\epsilon (\ell ,\nu ,\eta )$ using the more convenient expression (9.3) and then invoke the landmark combinatorial lemma of Vapnik and Chervonenkis [18, theorem 1], which states that ${{\mathrm{Card}}}\,{\varDelta }_\mathcal{F}(S)$ is either equal to $2^{2\ell }$ or smaller than $(2\ell e/h)^h$ for some positive integer h that does not depend on the data S and is now called the VC dimension of the family of indicator functions $\{\,z\,\mapsto \,Q(w,z)~:~w\in \mathcal{F}\,\}$. Simple algebraic manipulations then yield data- and algorithm-independent bounds for both the absolute and the relative deviation:

$$\begin{aligned} \Pr \left\{ ~ \left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| > \sqrt{\frac{h(1+\log \frac{\ell }{h}) -\log \frac{\eta }{2}}{\ell -1}} ~\right\}&\le \eta ,\\ \Pr \left\{ ~ \frac{\left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| }{\sqrt{\nu ({w^{S_1}})+\frac{1}{2\ell }}} > 2\sqrt{\frac{h(1+\log \frac{\ell }{h})-\log \frac{\eta }{2}}{\ell }} ~\right\}&\le \eta . \end{aligned}$$

5 Data- and Algorithm-Dependent Bounds

There are several obvious ways to make these bounds tighter. Instead of using the bound (9.3), we can tabulate the exact values of $\epsilon (\ell ,\nu ,\eta )$ as suggested in Sect. 9.3. Instead of bounding ${{\mathrm{Card}}}\,{\varDelta }_\mathcal{F}(S)$, we can design empirical procedures to measure its value [3, 22]. The only remaining causes of inaccuracy are then the two inequalities appearing in the derivation (9.5), namely the uniform bound and the union bound.

The first source of concern is the majorization of the error deviation by a uniform bound. Many elements of ${\varDelta }_\mathcal{F}(S)$ are misclassification vectors that no reasonable learning algorithm would produce. Realistic learning algorithms tend to produce solutions that perform well on the training examples and also contain critical capacity control aspects. For instance one can argue that multilayer network training often achieves good performance because its poor optimization algorithm is unable to find solutions far away from the initial point. All these aspects severely restrict the set of misclassification vectors.

Therefore, instead of considering the set ${\varDelta }_\mathcal{F}(S)$ of the misclassification vectors associated with all potential parameters $w\in \mathcal{F}$, we can consider the set ${\varDelta }_\mathcal{A}(S)$ of the misclassification vectors associated with the parameters produced by applying algorithm $\mathcal{A}$ to all training sets $S_1$ extracted from dataset S:

$$ {\varDelta }_\mathcal{A}(S) = \big \{ ~ q(\mathcal{A}(S_1)) ~|~ S_1\subset S ~\wedge ~{{\mathrm{Card}}}(S_1)=\ell ~ \big \}. $$

Replicating the derivation (9.5) leads to a data- and algorithm-dependent deviation bound,

$$\begin{aligned} \Pr \left\{ ~ \left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| > \epsilon (\ell ,\nu ({w^{S_1}}),\eta ) ~ \right\} \le \eta \, {{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S). \end{aligned}$$

(9.6)

The second source of concern is the union bound which, in (9.5), majorizes the probability of the union of K events $A_1, \dots , A_K$ of probability $\eta $ by the sum $K\eta $ of their probabilities. Let us tentatively assume that the events $A_i$ can be considered pairwise independent. We can then write

$$\begin{aligned} K\eta - \Pr \left( \cup _k A_k\right) ~\le ~ \sum _{i<j}\Pr (A_i\cap A_j) ~\approx ~ \frac{K^2}{2} \eta ^2 \end{aligned}$$

(9.7)

and show that the majorization error is much smaller than $K\eta $. The deviation bound (9.6) is likely to be quite accurate if this assumption holds. Whether this is true will be clarified in Sect. 9.7.

6 Empirical Study

In order to illustrate the performance of bound (9.6), we report on a simple experimental study using 1,000 examples from the MNIST handwritten digit recognition dataset [2]. The classifier is the convolutional network Lenet5 described in [10] and containing 60,000 adjustable parameters. Training is performed using mean square error back-propagation with learning rates periodically adjusted by estimating the diagonal of the Hessian matrix [11]. This case study should be viewed as a typical example of multilayer neural network training technology using a proven implementation. In particular, this learning algorithm should not be seen as a simple empirical risk minimization algorithm because the cost function is nonconvex and because the first-order nature of the algorithm favors solutions that are relatively close to the initial conditions.

We train this classifier on 1,000 random splits of the examples into equally sized training and test sets containing $\ell =500$ examples each. We always use the same weight initialization. The observed median training error, median test error, and median relative deviation are, respectively,

$$ {{\mathrm{Median}}}\big [\nu _1({w^{S_1}})\big ]~\approx ~0.075,\qquad {{\mathrm{Median}}}\big [\nu _2({w^{S_1}})\big ]~\approx ~0.14, $$

$$ {{\mathrm{Median}}}\left[ \frac{|\nu _2({w^{S_1}})-\nu _1({w^{S_1}})|}{\sqrt{\nu ({w^{S_1}})(1-\nu ({w^{S_1}})}}\right] ~\approx ~0.21. $$

The median deviation can also be estimated by setting the right-hand side of (9.6) to 0.5 and using the approximation (9.4),

$$\begin{aligned} {{\mathrm{Median}}}\left[ ~ \frac{|\nu _2({w^{S_1}})-\nu _1({w^{S_1}})|}{\sqrt{\nu ({w^{S_1}})(1-\nu ({w^{S_1}}))}} ~-~ 2\,\sqrt{\frac{\log (4{{\mathrm{Card}}}{\varDelta }_\mathcal{A}(S))}{l}} ~\right] ~ \mathop {\approx }\limits ^{?}~0. \end{aligned}$$

(9.8)

Figure 9.1 (top plot) shows how the bound on the relative deviation (9.8) depends on the value ${{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S)$. Figure 9.1 (bottom) plots a corresponding bound on the median test error $\nu _2$, obtained by setting the training error $\nu _1=0.075$ and numerically solving (9.8) for $\nu _2$ with $\nu =(\nu _1+\nu _2)/2$. Both plots show that ${{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S)$ must be as low as 62 for the bounds to match empirical observations. However these plots also show that values as large as $10^8$ still provide reasonable estimates.

In contrast, since it is clear that the VC dimension of such a large multilayer neural network exceeds the total number of examples, ${{\mathrm{Card}}}\,{\varDelta }_\mathcal{F}(S)=2^{2\ell }\approx 10^{301}$, leading to a vacuous bound on the median test error, $\nu _2\le 1.25$.

We can attempt to directly measure ${{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S)$ by counting the number $\mathcal {N}_0(t)$ of distinct misclassification vectors seen after training the classifier on t random splits. Such an attempt was unsuccessful because we lacked the computing resources to process a large enough number of splits. We stopped after processing 18,000 random splits and producing 18,000 distinct misclassification vectors. Birthday problem considerations [1] show that ${{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S)>10^8$ with confidence greater than $80\,\%$. As illustrated in Fig. 9.1, even such large values of ${{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S)$ can still lead to reasonable estimates, within a factor of two of the observed deviations.

Since directly counting ${{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}(S)$ is computationally too expensive, we must design simpler empirical procedures to characterize the properties of the set ${\varDelta }_\mathcal{A}(S)$ of misclassification vectors.

7 Coverings

The solid curve in Fig. 9.2 shows the histogram of the Hamming distances measured between the misclassification vectors associated with pairs of random splits. This histogram shows a very peaky distribution. We can accurately determine the location of this peak by processing a moderate number of pairs. All our misclassification vectors appear to be located at or around Hamming distance 75 from each other.

It is well known that the distribution of the Hamming distance separating two d-dimensional binary vectors follows a very peaky distribution centered on $2dp(1-p)$ where p is the probability of nonzero coefficients [9]. The dotted curve in Fig. 9.2 represents the histogram obtained by randomly shuffling the coefficient of each misclassification vector before computing the Hamming distances. This curve verifies the theoretical prediction with a peak centered at $4\,\ell \,\nu (1-\nu )\approx 180$. The actual misclassification vectors $q({w^{S_1}})$ therefore appear considerably less dispersed than random binary vectors. This observation invalidates the independence assumption that could have given us confidence in the accuracy of the union bound (9.7).

This peaky Hamming distance distribution suggests we should characterize the set ${\varDelta }_\mathcal{A}(S)$ of misclassification vectors using covering numbers. Let $B_c(q)$ represent a Hamming ball of radius c centered on q. The covering number $\mathcal {N}_c({\varDelta })$ is the smallest number of Hamming balls of radius c necessary to cover the set ${\varDelta }$:

$$ \mathcal {N}_c({\varDelta }) ~=~ \min _{C\subseteq {\varDelta }} ~ {{\mathrm{Card}}}(C) \quad {\text { such that}}\quad {\varDelta }\subseteq \mathop {\cup }_{q\in C} B_c(q). $$

Let us consider an arbitrary split of the dataset into training and test sets and assume that there exists $q^\prime \in B_c(q)$ such that $|\nu _2(q^\prime )-\nu _1(q^\prime )|>\varepsilon $. A simple derivation then establishes that $|\nu _2(q)-\nu _1(q)|>\varepsilon -c/\ell $.

Combining this observation with (9.2) gives

$$ \forall q \quad \Pr \left\{ ~ \exists q^\prime \in B_c(q) ~ : ~ \left| \nu _2(q^\prime )-\nu _1(q^\prime )\right| > \frac{c}{\ell } + \epsilon (\ell ,\nu (q),\eta ) ~ \right\} ~ = ~ \eta , $$

and a chain of inequalities similar to (9.5) gives

$$\begin{aligned} \Pr \left\{ ~ \left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| > \frac{c}{\ell } + \epsilon (\ell ,\nu ({w^{S_1}}),\eta ) ~ \right\} ~ \le ~ \eta \, \mathcal {N}_c({\varDelta }_\mathcal{A}(S)). \end{aligned}$$

We construct coverings with the following greedy algorithm. Let $q_1,q_2,\dots $ be the misclassification vectors associated with successive random splits of our dataset. We construct a covering $C_t$ of the first t vectors using the following recursive procedure: if $q_t$ belongs to one of the Hamming balls centered on an element of $C_{t-1}$, we set $C_t=C_{t-1}$, otherwise we set $C_{t}=C_{t-1}\cup \{q_t\}$.

This empirical covering size $N_c(t)={{\mathrm{Card}}}(C_t)$ should converge to an upper bound on $\mathcal {N}_c(\varDelta _\mathcal{A}(S))$ when t increases. Figure 9.3 plots the empirical covering sizes for several values of the Hamming ball radius c. When the radius is smaller than the peak of the Hamming distance histogram, this convergence cannot be observed in practice. When the radius is larger than the peak, $\mathcal {N}_c(t)$ converges to a small value.

In the intermediate regime, the empirical covering size appears to converge but its limit is hard to determine. We can work around this difficulty by writing

$$\begin{aligned} \Pr \left\{ \, \left| \nu _2({w^{S_1}})-\nu _1({w^{S_1}})\right| > \frac{c}{\ell } + \epsilon (\ell ,\nu ({w^{S_1}}),\eta ) \, \right\} \, \le \, \eta \, \mathcal {N}_c(T) + \Pr (R_T), \end{aligned}$$

(9.9)

where $R_t\subseteq {\varDelta }_\mathcal{A}(S)$ denotes the set of misclassification vectors that are not covered by any of the Hamming balls centered on the elements of $C_T$. Let $q_{t+1},\dots ,q_{t+s}$ denote the longest sequence of misclassification vectors such that $C_{t+s}=C_{t}$. None of these vectors belongs to $R_t$. Since they are obtained by considering random splits independent of the t previous random splits, the probability that none of these vectors belongs to $R_t$ is $(1-\Pr (R_t))^s$. We can therefore write with confidence $1-\varepsilon $ that

$$ \Pr (R_T) \le \Pr (R_t) \le 1-\root s \of {\varepsilon } \le -\log (\varepsilon )/s. $$

Empirical covering sizes $\mathcal {N}_c(T)$ were collected for $T=\mathrm{10{,}000}$. They range from $N_{120}(\mathrm{10{,}000})=1$ to $N_{50}(\mathrm{10{,}000})=\mathrm{3{,}317}$. We cannot ensure that $\Pr (R_T)$ is small enough when $c<50$.

Setting the right-hand side of (9.9) to 0.5, using approximation (9.4), and solving for $\nu _2({w^{S_1}})$ yields a bound on the median test error. Figure 9.4 plots this bound as a function of the Hamming ball radius c. Although their empirical accuracy is far from ideal, these covering-based bounds are within a factor of two of the observed test error. This is clearly better than the vacuous bounds usually afforded by the data- and algorithm-independent bounding technique.

8 Discussion

There is still a significant gap separating these empirical bounds from the observed values. Certainly the most interesting effect revealed by our empirical study is the low dispersion of the misclassification vectors (Fig. 9.2) because it implies that the union bound is very inaccurate. Although relying on empirical covering numbers should in principle reduce the negative impact of this low dispersion, Dudley’s chaining technique [6, 13] is a much more refined way to improve on the union bound. Vorontsov’s recent work [24] is therefore very interesting because it leverages a more refined characterization of the distribution of misclassification vectors in a manner related to Dudley’s chaining.

It is also interesting to investigate the cause of the low dispersion of the misclassification vectors. The observed Hamming distance histogram (Fig. 9.2) looks strikingly like the Hamming distance histogram separating random binary vectors of lower dimensionality. Could it be that only a subset of the examples are responsible for the misclassification vector variations? This would mean that most of the examples are always correctly recognized (or misrecognized when their label is incorrect) regardless of the dataset split. This hypothesis is confirmed by Fig. 9.5 which plots the observed variance of the loss $Q(z_i,{w^{S_1}})$ for all examples $z_i$ ordered by decreasing variance. This observation is interesting because it establishes a connection with sample compression bounds: the only examples that matter are those that switch from being correctly classified to being misclassified when one changes how the data is split into training and test sets. The connection between capacity and compression therefore appears to be a manifestation of the subtleties of the union bound.

Finally, one of the main criticisms of the approach outlined in this paper is its computational requirement. Why spend time characterizing the set of misclassification vectors to produce a mediocre bound on the test error while a fraction of this time is sufficient to compute the test error itself? This is a valid criticism of this work as an empirical measuring technique. However this work also has value because it helps us understand the mathematical subtleties of learning. Measuring and understanding are two equally important aspects of the scientific approach.

References

Bloom, D.: A birthday problem. Am. Math. Mon. 80, 1141–1142 (1973)
Google Scholar
Bottou, L., Cortes, C., Denker, J.S., Drucker, H., Guyon, I., Jackel, L.D., LeCun, Y., Muller, U.A., Säckinger, E., Simard, P., Vapnik, V.N.: Comparison of classifier methods: a case study in handwritten digit recognition. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 2, pp. 77–82. IEEE (1994)
Google Scholar
Bottou, L., Cortes, C., Vapnik, V.N.: On the effective VC dimension. Technical report, Neuroprose. http://ftp.funet.fi/pub/sci/neural/neuroprose/bottou-effvc.ps.Z, http://leon.bottou.org/papers/bottou-cortes-vapnik-94 (1994)
Bottou, L., LeCun, Y., Vapnik, V.N.: Report: predicting learning curves without the ground truth hypothesis. http://leon.bottou.org/papers/bottou-lecun-vapnik-1999 (1999)
Bousquet, O.: Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. Ph.D. thesis, École Polytechnique (2002)
Google Scholar
Dudley, R.M.: The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Funct. Anal. 1(3), 290–330 (1967)
Article MATH MathSciNet Google Scholar
Dudley, R.M.: Uniform Central Limit Theorems. Cambridge University Press, Cambridge (1999)
Book MATH Google Scholar
Haussler, D.: Sphere packing numbers for subsets of the boolean $n$-cube with bounded Vapnik-Chervonenkis dimension. J. Comb. Theory Ser. A 69(2), 217–232 (1995)
Article MATH MathSciNet Google Scholar
Kanerva, P.: Sparse Distributed Memory. MIT Press, Cambridge (1988)
MATH Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
LeCun, Y., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Orr, G.B., Müller, K.R. (eds.) Neural Networks, Tricks of the Trade. Lecture Notes in Computer Science, vol. 1524. Springer, Berlin (1998)
Google Scholar
Shawe-Taylor, J., Bartlett, P., Williamson, R., Anthony, M.: Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory 44(5), 1926–1940 (1998)
Article MATH MathSciNet Google Scholar
Talagrand, M.: The Generic Chaining: Upper and Lower Bounds of Stochastic Processes. Springer, Berlin (2005)
Google Scholar
Trong, Wu: An accurate computation of the hypergeometric distribution function. ACM Trans. Math. Softw. 19(1), 33–43 (1993)
Article MATH MathSciNet Google Scholar
Vapnik, V.N.: Estimation of Dependences based on Empirical Data. Springer Series in Statistics. Springer, Berlin (1982)
Google Scholar
Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar
Vapnik, V.N., Chervonenkis, A.Y.: A note on one class of perceptrons. Autom. Remote Control 25(1), 774–780 (1964)
Google Scholar
Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Proc. USSR Acad. Sci. 181(4), 781–783 (1968) (English translation: Sov. Math. Dokl. 9(4), 915–918 (1968))
Google Scholar
Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16(2), 264–281 (1971) (This volume, Chap. 3)
Google Scholar
Vapnik, V.N., Chervonenkis, A.Y.: Теория распознавания образов: Статистические проблемы обучения (Theory of Pattern Recognition: Statistical Problems of Learning: in Russian). Nauka, Moscow (1974). German translation: Theorie der Zeichenerkennung, transl. K.G. Stöckel and B. Schneider, ed. S. Unger and B. Fritzsch, Akademie Verlag, Berlin (1979)
Google Scholar
Vapnik, V.N., Lerner, A.Y.: Pattern recognition using generalized portrait method. Autom. Remote Control 24(6), 774–780 (1963)
Google Scholar
Vapnik, V.N., Levin, E., LeCun, Y.: Measuring the VC-dimension of a learning machine. Neural Comput. 6(5), 851–876 (1994)
Article Google Scholar
Vorontsov, K.V.: Combinatorial substantiation of learning algorithms. Comput. Math. Math. Phys. 44(11), 1997–2009 (2004)
MathSciNet Google Scholar
Vorontsov, K.V.: Exact combinatorial bounds on the probability of overfitting for empirical risk minimization. Pattern Recognit. Image Anal. Adv. Math. Theory Appl. 20(3), 269–285 (2010)
Google Scholar

Download references

Acknowledgments

This work originates in long discussions held in the 1990 s with my AT&T Labs colleagues Olivier Bousquet, Corinna Cortes, John Denker, Isabelle Guyon, Yann LeCun, Sara Solla, and Vladimir Vapnik. My interest was revived in Paphos by Konstantin Vorontsov and Vladimir Vovk. I would like to thank Vladimir Vovk for convincing me to write it up and Matus Tegarlsky for suggesting the use of the birthday problem to lower bound ${{\mathrm{Card}}}\,{\varDelta }_\mathcal{A}$ using empirical evidence.

Author information

Authors and Affiliations

Microsoft Research, 641 Avenue of the Americas, New York, NY, USA
Léon Bottou

Authors

Léon Bottou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Léon Bottou .

Editor information

Editors and Affiliations

Dept. of Computer Science, Royal Holloway, Univ of London, Egham, Surrey, United Kingdom
Vladimir Vovk
Frederick University, Nicosia, Cyprus
Harris Papadopoulos
Dept. of Computer Science, University of London, Egham, Surrey, United Kingdom
Alexander Gammerman

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bottou, L. (2015). Making Vapnik–Chervonenkis Bounds Accurate. In: Vovk, V., Papadopoulos, H., Gammerman, A. (eds) Measures of Complexity. Springer, Cham. https://doi.org/10.1007/978-3-319-21852-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-21852-6_9
Published: 04 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21851-9
Online ISBN: 978-3-319-21852-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Making Vapnik–Chervonenkis Bounds Accurate

Abstract

Similar content being viewed by others