Abstract
We present in this contribution a synthesis of Seeger’s (PAC-Bayesian generalization error bounds for Gaussian process classification, 2002) and our own (Catoni, PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, 2007) approach of PAC-Bayes inequalities for 0–1 loss functions. We apply it to supervised classification, and more specifically to the proof of new margin bounds for support vector machines, in the spirit of the bounds established by Langford and Shawe-Taylor (Advances in Neural Information Processing Systems, 2002) and McAllester (Learning Theory and Kernel Machines, COLT 2003).
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
- Loss Function
- Supervise Classification
- Reproduce Kernel Hilbert Space
- Leibler Divergence
- Probability Kernel
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 PAC-Bayes Bounds for 0–1 Loss Functions
In this section, we are given some i.i.d. sample \((W_i)_{i=1}^n \in \mathcal {W}^n\), where \(\mathcal {W}\) is a measurable space, and some binary measurable loss function \(L : \mathcal {W} \times \varTheta \rightarrow \{0,1\}\), where \(\varTheta \) is a measurable parameter space. Our aim is to minimize with respect to \(\theta \in \varTheta \) the expected loss
where \(\mathbb {P}\) is the marginal distribution of the observed sample \((W_i)_{i=1}^n\). More precisely, assuming that \(\mathbb {P}\) is unknown, we would like to find an estimator \(\widehat{\theta }(W_{1:\, n})\) depending on the observed sample \(W_{1:\, n} \overset{\text {def}}{=} (W_i)_{i=1}^n\) such that the excess risk
is small. The previous quantity is random, since \(\widehat{\theta }\) depends on the random sample \(W_{1:\, n}\). Therefore its size can be understood in different ways. Here we will focus on the deviations of the excess risk. Accordingly, we will look for estimators providing a small risk with a probability close to one.
A typical example of such a problem is provided by supervised classification. In this setting \(\mathcal {W} = \mathcal {X} \times \mathcal {Y}\), where \(\mathcal {Y}\) is a finite set, \(W_i = (X_i, Y_i)\), where \((X_i,Y_i)\) are input-output pairs, a family of measurable classification rules \(\bigl \{ f_{\theta } : \mathcal {X} \rightarrow \mathcal {Y}; \theta \in \varTheta \bigr \}\) is considered, and the loss function \(L(w,\theta )\) is defined as the classification error
Accordingly the aim is to minimize the expected classification error
given a sample \((X_i, Y_i)_{i=1}^n\) of observations.
The point of view presented here is a synthesis of the approaches of [2, 8].
1.1 Deviation Bounds for Sums of Bernoulli Random Variables
Given some parameter \(\lambda \in \mathbb {R}\), let us consider the (normalized) \(\log \)-Laplace transform of the Bernoulli distribution:
Let us also consider the Kullback–Leibler divergence of two Bernoulli distributions
In the sequel \(\overline{\mathbb {P}}\) will be the empirical measure
of an i.i.d. sample \((W_i)_{i=1}^n\) drawn from \(\mathbb {P}^{\otimes n} \in \mathcal {M}_+^1(\mathcal {W}^n)\) (the set of probability measures on \(\mathcal {W}^n\)). We will use a short notation for integrals, putting for any \(\rho , \pi \in \mathcal {M}_+^1(\varTheta )\) and any integrable function \(f \in \mathbb {L}_1 \bigl ( \mathcal {W} \times \varTheta ^2, \mathbb {P} \otimes \pi \otimes \rho \bigr )\)
so that for instance \(\displaystyle L(\mathbb {P},\rho ) = \int L(w,\theta ) \, \mathrm {d}\mathbb {P}(w) \mathrm {d}\rho (\theta )\).
Let us first recall Chernoff’s bound.
Proposition 20.1
For any fixed value of the parameter \(\theta \in \varTheta \), the identity
shows that with probability at least \(1 - \epsilon \),
where
Moreover
In the same way, the identity
shows that with probability at least \(1 - \epsilon \)
where
and
Before proving this proposition, let us mention some important identities.
Proposition 20.2
For any probability measures \(\pi \) and \(\rho \) defined on the same measurable space such that \(\mathcal {K}(\rho , \pi ) < \infty \), and any bounded measurable function h, let us define the transformed probability measure \(\pi _{\exp (h)} \ll \pi \) by its density
where \(Z = \int \exp (h) \, \mathrm {d}\pi \). Moreover, let us introduce the notation
The expectations with respect to \(\rho \) and \(\pi \) of h and the \(\log \)-Laplace transform of h are linked by the identities
Proof
The first identity is a straightforward consequence of the definitions of \(\pi _{\exp (h)}\) and of the Kullback–Leibler divergence function. The second one is the Taylor expansion of order one with integral remainder of the function
which says that \(f(1) = f(0) + f'(0) + {\textstyle \int }_0^1(1 - \alpha ) f''(\alpha ) \, \mathrm {d}\alpha \). \(\square \)
Exercise 20.1
Prove that \(f \in \mathcal {C}^{\infty }\). Hint: write
use Fubini’s theorem to show that \(\alpha \mapsto \int h^k \exp (\alpha h) \, \mathrm {d}\pi \) belongs to \(\mathcal {C}^1\) and compute its derivative. \(\square \)
Let us come now to the proof of Proposition 20.1. Chernoff’s inequality reads
where the inequality holds with probability at least \(1 - \epsilon \). Since the left-hand side is non-random, it can be optimized in \(\lambda \), giving
Exercise 20.2
Prove this statement in more detail. For any integer \(k > 1\), consider the event
where \(\displaystyle F(\lambda ) = \varPhi _{\lambda } \bigl [ L ( \mathbb {P}, \theta ) \bigr ] - \frac{\log (\epsilon ^{-1})}{n \lambda }\). Show that \(\mathbb {P}^{\otimes n} (A_k) \le \epsilon \) by choosing some suitable value of \(\lambda \). Remark that \(A_k \subset A_{k+1}\) and conclude that \(\mathbb {P}^{\otimes n} \bigl ( \cup _k A_k \bigr ) \le \epsilon \). \(\square \)
Since
\(B_+(q, \delta ) \le 1\).
Applying Eq. (20.1) to Bernoulli distributions gives
where
This shows that
because when \(q \le p < 1\) then \(\displaystyle \lambda = \log \biggl ( \frac{q^{-1} - 1}{p^{-1} - 1} \biggr ) \in \mathbb {R}_+\), \(q = p_{\lambda }\) and therefore \(K(q,p_{\lambda }) = 0\).
Let us remark now that \(\displaystyle \frac{\partial ^2}{\partial x^2} K(x, p) = x^{-1}(1-x)^{-1}\). Thus if \(p \ge q \ge 1/2\), then
so that if \(K(q,p) \le \delta \), then
Now if \(q \le 1/2\) and \(p \ge q\) then
so that if \(K(q,p) \le \delta \), then
implying that
On the other hand,
thus if \(K(q,p) = \delta \) with \(p > q\), then
implying that
Exercise 20.3
The second part of Proposition 20.1 is proved in the same way and left as an exercise. \(\square \)
1.2 PAC-Bayes Bounds
We are now going to make Proposition 20.1 uniform with respect to \(\theta \). The PAC -Bayes approach to this [3, 5–7] is to randomize \(\theta \), so we will now consider joint distributions on \((W_{1:\, n}, \theta )\), where the distribution of \(W_{1:\, n}\) is still \(\mathbb {P}^{\otimes n}\) and the conditional distribution of \(\theta \) given the sample is given by some transition probability kernel \(\rho : \mathcal {W}^n \rightarrow \mathcal {M}_+^1(\varTheta )\), called in this context a posterior distribution.Footnote 1 This posterior distribution \(\rho \) will be compared with a prior (meaning non-random) probability measure \(\pi \in \mathcal {M}_+^1(\varTheta )\).
Proposition 20.3
Let us introduce the notation
For any prior probability measure \(\pi \in \mathcal {M}_+^1(\varTheta )\) and any \(\lambda \in \mathbb {R}_+\),
and therefore for any finite set \(\varLambda \subset \mathbb {R}_+\), with probability at least \(1 - \epsilon \), for any \(\rho \in \mathcal {M}_+^1(\varTheta )\),
Proof
The exponential moment inequality (20.3) is a consequence of Eq. (20.1), showing that
and of the fact that \(\varPhi _{\lambda }\) is convex, showing that
The deviation inequality follows as usual. \(\square \)
We cannot take the infimum on \(\lambda \in \mathbb {R}_+\) as in Proposition 20.1, because we can no longer cast our deviation inequality in such a way that \(\lambda \) appears on some non-random side of the inequality. Nevertheless, we can get a more explicit bound from some specific choice of the set \(\varLambda \).
Proposition 20.4
Let us define the least increasing upper bound of the variance of a Bernoulli distribution of parameter \(p \in [0,1]\) as
Let us choose some positive integer parameter m and let us put
With probability at least \(1 - \epsilon \), for any \(\rho \in \mathcal {M}_+^1(\varTheta )\),
where
Moreover, as soon as \(n \ge 5\),
so that with probability at least \(1 - \epsilon \), for any \(\rho \in \mathcal {M}_+^1(\varTheta )\),
Proof
Let us put
According to Eq. (20.2) applied to Bernoulli distributions, for any \(\lambda \in \varLambda \),
Moreover, as \(p_{\alpha } \le p\),
As \(\overline{v}(p) \le 1/4\) and \(\displaystyle \delta \ge \frac{\log \bigl [ (m+1)/\epsilon \bigr ]}{n}\),
Therefore either \(\lambda _{\min } \le \widehat{\lambda } \le 1\), or \(\widehat{\lambda } > 1\). Let us consider these two cases separately.
If \(\lambda _{\min } = \min \varLambda \le \widehat{\lambda } \le \max \varLambda = 1\), then \(\log \bigl ( \widehat{\lambda } \bigr ) \) is at distance at most t / m from some \(\log \bigl ( \lambda \bigr )\) where \(\lambda \in \varLambda \), because \(\log (\varLambda )\) is a grid with constant steps of size 2t / m. Thus
If moreover \(q \le 1/2\), then \(\overline{v}(p) \le p(1-q)\), so that we obtain a quadratic inequality in p, whose solution is bounded by
If on the contrary \(q \ge 1/2\), then \(\overline{v}(p) = \overline{v}(q) = 1/4\) and
so that in both cases
Let us now consider the case when \(\widehat{\lambda } > 1\). In this case \(\overline{v}(p) < 2 \delta \), so that
In conclusion, applying Proposition 20.3 we see that with probability at least \(1 - \epsilon \), for any posterior distribution \(\rho \),
which is precisely the statement to be proved.
In the special case when \(m = \lfloor \log (n)^2 \rfloor - 1 \ge \log (n)^2 - 2\),
as soon as the last inequality holds, that is as soon as \(n \ge \exp (\sqrt{2}) \simeq 4.11\) to make \(\log (n)^2 - 2\) positive and
which holds true for any \(n \ge 5\), as can be checked numerically. \(\square \)
2 Linear Classification and Support Vector Machines
In this section we are going to consider more specifically the case of linear binary classification. In this setting \(\mathcal {W} = \mathcal {X} \times \mathcal {Y} = \mathbb {R}^d \times \{-1, + 1\}\), \(w = (x,y)\), where \(x \in \mathbb {R}^d\) and \(y \in \{-1, +1\}\), \(\varTheta = \mathbb {R}^d\), and
We will follow the approach presented in [4, 5].
Although we will stick in this presentation to the case when \(\mathcal {X}\) is a vector space of finite dimension, the results also apply to support vector machines [9–11], where the pattern space is some arbitrary space mapped to a Hilbert space \(\mathcal {H}\) by some implicit mapping \(\varPsi : \mathcal {X} \rightarrow \mathcal {H}\), \(\varTheta = \mathcal {H}\) and \(L(w,\theta ) = \mathbbm {1} \bigl ( \langle \theta , \varPsi (x) \rangle y \le 0 \bigr )\). It turns out that classification algorithms do not need to manipulate \(\mathcal {H}\) itself, but only to compute scalar products of the form \(k(x_1, x_2) = \langle \varPsi (x_1), \varPsi (x_2) \rangle \), defining a symmetric positive kernel k on the original pattern space \(\mathcal {X}\). The converse is also true: any positive symmetric kernel k can be represented as a scalar product in some mapped Hilbert space (this is the Moore–Aronszajn theorem). Often-used kernels on \(\mathbb {R}^d\) are
In the following, we will work in \(\mathbb {R}^d\), which covers only the case when \(\dim \mathcal {H} < \infty \), but extensions are possible.
After [4, 5], let us consider as prior probability measure \(\pi \) the centered Gaussian measure with covariance \(\beta ^{-1} {{\mathrm{\mathbf {Id}}}}\), so that
Let us also consider the function
Let \(\pi _{\theta }\) be the measure \(\pi \) shifted by \(\theta \), defined by the identity
In this case
and
Thus the randomized loss function has an explicit expression: randomization replaces the indicator function of the negative real line by a smooth approximation. As we are ultimately interested in \(L(w, \theta )\), we will shift things a little bit, considering along with the classification error function L some error with margin
Unlike \(L(w, \theta )\) which is independent of the norm of \(\theta \), the margin error \(M(w, \theta )\) depends on \(||\theta ||\), counting a classification error each time x is at distance less than \(||x ||/ ||\theta ||\) from the boundary \(\{ x' \,:\, \langle \theta , x' \rangle = 0 \}\), so that the error with margin region is the complement of the open cone \(\bigl \{ x \in \mathbb {R}^d \, ; \, y \langle \theta , x \rangle > ||x ||\bigr \}\).
Let us compute the randomized margin error
It satisfies the inequality
Applying previous results we obtain
Proposition 20.5
With probability at least \(1 - \epsilon \), for any \(\theta \in \mathbb {R}^d\),
where
the bound B being defined by Eq. (20.4).
We can now minimize this empirical upper bound to define an estimator. Let us consider some estimator \(\widehat{\theta }\) such that
Then for any fixed parameter \({\theta _{\star }}\), \(C_1(\theta ) \le C_1({\theta _{\star }}) + \zeta \). On the other hand, with probability at least \(1 - \epsilon \)
Indeed
because \(p \mapsto - \varPhi _{- \lambda }(p)\) is convex. As a consequence
Proposition 20.6
With probability at least \(1 - 2 \epsilon \),
It is also possible to state a result in terms of empirical margins. Indeed
Thus with probability at least \(1 - \epsilon \), for any \(\theta \in \mathbb {R}^d\),
where
However, \(C_1\) and \(C_2\) are non-convex criterions, and faster minimization algorithms are available for the usual SVM loss function, for which we are going to derive some generalization bounds now. Indeed, let us choose some positive radius R and let us put \(||x ||_{R} = \max \bigl \{ R, ||x ||\bigr \}\), so that in the case when \(||x ||\le R\), \(||x ||_{R} = R\).
To check that this is true, consider the functions
Let us remark that they are both non-increasing, that f is convex on the interval \(z \in \bigl (||x ||, \infty \bigr ( \) (because \(\varphi \) is convex on \(\mathbb {R}_+\)), and that \(\sup f = \sup \varphi = 1\). Since \(||x ||_R \ge ||x ||\), for any \(z \in \, ] \! - \infty , ||x ||]\), \( g(z) \ge 1 \ge f(z)\). Moreover, \( g(2 ||x ||_R) = \varphi (\sqrt{\beta }) \ge \varphi \bigl [ \sqrt{\beta } \bigl ( 2 ||x ||^{-1} ||x ||_R - 1 \bigr ) \bigr ] = f(z)\). Since on the interval \(\bigl [||x ||, 2 ||x ||_R \bigr ]\) the function g is linear, the function f is convex, and g is not smaller than f at the two ends, this proves that g is not smaller than f on the whole interval. Finally, on the interval \(z \in \bigl [ 2 ||x ||_R, + \infty \bigr [\), the function g is constant and the function f is decreasing, so that on this interval also g is not smaller than f, and this ends the proof of (20.6), since the three intervals on which \(g \ge f\) cover the whole real line.
Using the upper bounds (20.6) and (20.5), and Proposition 20.3, we obtain
Proposition 20.7
With probability at least \(1 - \epsilon \), for any \(\theta \in \mathbb {R}^d\),
where
Let us assume now that the patterns x are in a ball, so that \(||x ||\le R\) almost surely. In this case \(||x ||_R = R\) almost surely. Let us remark that \(L(\mathbb {P}, \theta ) = L(\mathbb {P}, 2 R \, \theta )\), and let us make the previous result uniform in \(\beta \in \Xi \). This leads to
Proposition 20.8
Let us assume that \(||x ||\le R\) almost surely. With probability at least \(1 - \epsilon \), for all \(\theta \in \mathbb {R}^d\),
where
and
The loss function \(C_4(\lambda , \theta )\) is the most-employed learning criterion for support vector machines, and is called the box constraint. It is convex in \(\theta \). There are fast algorithms to compute \(\inf _{\theta } C_4(\lambda , \theta )\) for any fixed values of \(\lambda \) and \(\beta \). Here we get an empirical criterion which could also be used to optimize the values of \(\lambda \) and \(\beta \), that is to optimize the strength of the regularizing factor \(\displaystyle \frac{\beta R^2 ||\theta ||^2}{n \lambda }\).
Here \(||\theta ||^{-1}\) can be interpreted as the margin width, that is the minimal distance of x from the separating hyperplane \( \{ x' : \langle \theta , x' \rangle = 0 \}\) beyond which the error term \(\bigl ( 1 - y \langle \theta , x \rangle \bigr )_+\) vanishes (for data x that are on the right side of the separating hyperplane). The speed of convergence depends on \(R^2 ||\theta ||^2 / n\). For this reason, \(R^2 ||\theta ||^2\), the square of the ratio between the radius of the ball containing the data and the margin, plays the role of the dimension. The bound does not depend on d, showing that with separating hyperplanes and more generally support vector machines, we can get low error rates while choosing to represent the data in a reproducing kernel Hilbert space with a large, or even infinite, dimension.
We considered so far only linear hyperplanes and data centered around 0. Anyhow, this also covers affine hyperplanes and data contained in a not necessarily centered ball, through a change of coordinates. More precisely, the previous proposition has the following corollary:
Corollary 20.1
Assume that almost surely \(||x - c ||\le R\), for some \(c \in \mathbb {R}^d\) and \(R \in \mathbb {R}_+\). With probability at least \(1 - \epsilon \), for any \(\theta \in \mathbb {R}^d\), any \(\gamma \in \mathbb {R}\) such that \(\displaystyle \min _{i=1, \dots , n} \langle \theta , x_i \rangle \le \gamma \le \max _{i=1, \dots , n} \langle \theta , x_i \rangle \),
where
Proof
Let us apply the previous result to \(x' = (x-c, R)\), and \(\theta ' = \bigl [ \theta , R^{-1} \bigl ( \langle \theta , c \rangle - \gamma \bigr ) \bigr ]\). We get that \(||x' ||^2 \le 2 R^2\) and \(||\theta ' ||^2 = 2 ||\theta ||^2\), because almost surely \( -||\theta ||R \le {{\mathrm{ess}}}\inf \langle \theta , x - c \rangle \le \gamma - \langle \theta , c \rangle \le {{\mathrm{ess}}}\sup \langle \theta , x - c \rangle \le ||\theta ||R\), so that almost surely, for the allowed values of \(\gamma \), \(\bigl ( \langle \theta , c \rangle - \gamma \bigr )^2 \le R^2 ||\theta ||^2\). This proves that \(C_4(\beta , \lambda , \theta ') \le C_5(\beta , \lambda , \theta , \gamma )\), as required to deduce the corollary from the previous proposition. \(\square \)
Notes
- 1.
We will assume that \(\rho \) is a regular conditional probability kernel, meaning that for any measurable set A the map \((w_1, \dots , w_n) \mapsto \rho (w_1, \dots , w_n)(A)\) is assumed to be measurable. We will also assume that the \(\sigma \)-algebra we consider on \(\varTheta \) is generated by a countable family of subsets. See [1] (p. 50) for more details.
References
Catoni, O.: Statistical Learning Theory and Stochastic Optimization, Lectures on Probability Theory and Statistics, École d’Été de Probabilités de Saint-Flour XXXI—2001. Lecture Notes in Mathematics, vol. 1851. Springer, Berlin (2004)
Catoni, O.: PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. IMS Lecture Notes Monograph Series, vol. 56. Institute of Mathematical Statistics, Beachwood (2007)
Germain, P., Lacasse, A., Laviolette, F., Marchand, M.: PAC-Bayesian learning of linear classifiers. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML’09, pp. 353–360. ACM, New York (2009)
Langford, J., Shawe-Taylor, J.: PAC-Bayes & margins. Advances in Neural Information Processing Systems, pp. 423–430. MIT Press, Cambridge (2002)
McAllester, D.: Simplified PAC-Bayesian margin bounds. In: Schölkopf, B., Warmuth, M.K. (eds.) Learning Theory and Kernel Machines, COLT 2003. Lecture Notes in Artificial Intelligence, vol. 2777, pp. 203–215. Springer, Berlin (2003)
McAllester, D.A.: PAC-Bayesian model averaging. In: Proceedings of the 12th Annual Conference on Computational Learning Theory, pp. 164–170. ACM, New York (1999)
McAllester, D.A.: PAC-Bayesian stochastic model selection. Mach. Learn. 51(1), 5–21 (2003)
Seeger, M.: PAC-Bayesian generalization error bounds for Gaussian process classification. Informatics report series EDI-INF-RR-0094, Division of Informatics, University of Edinburgh. http://www.inf.ed.ac.uk/publications/online/0094.pdf (2002)
Vapnik, V.: Estimation of Dependences Based on Empirical Data. Springer, Berlin (1982)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16(2), 264–280 (1971) (This volume, Chap. 3)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Catoni, O. (2015). PAC-Bayes Bounds for Supervised Classification. In: Vovk, V., Papadopoulos, H., Gammerman, A. (eds) Measures of Complexity. Springer, Cham. https://doi.org/10.1007/978-3-319-21852-6_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-21852-6_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21851-9
Online ISBN: 978-3-319-21852-6
eBook Packages: Computer ScienceComputer Science (R0)