PAC-Bayes Bounds for Supervised Classification

Catoni, Olivier

doi:10.1007/978-3-319-21852-6_20

Olivier Catoni^4,5

2956 Accesses
1 Citations

Abstract

We present in this contribution a synthesis of Seeger’s (PAC-Bayesian generalization error bounds for Gaussian process classification, 2002) and our own (Catoni, PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, 2007) approach of PAC-Bayes inequalities for 0–1 loss functions. We apply it to supervised classification, and more specifically to the proof of new margin bounds for support vector machines, in the spirit of the bounds established by Langford and Shawe-Taylor (Advances in Neural Information Processing Systems, 2002) and McAllester (Learning Theory and Kernel Machines, COLT 2003).

Access provided by Autonomous University of Puebla. Download chapter PDF

Fast Hyperparameter Tuning for Support Vector Machines with Stochastic Gradient Descent

A Geometric Viewpoint of the Selection of the Regularization Parameter in Some Support Vector Machines

Kernel classification using a linear programming approach

Article 14 June 2019

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 PAC-Bayes Bounds for 0–1 Loss Functions

In this section, we are given some i.i.d. sample $(W_i)_{i=1}^n \in \mathcal {W}^n$, where $\mathcal {W}$ is a measurable space, and some binary measurable loss function $L : \mathcal {W} \times \varTheta \rightarrow \{0,1\}$, where $\varTheta $ is a measurable parameter space. Our aim is to minimize with respect to $\theta \in \varTheta $ the expected loss

$$\int L(w, \theta ) \, \mathrm {d}\mathbb {P}(w),$$

where $\mathbb {P}$ is the marginal distribution of the observed sample $(W_i)_{i=1}^n$. More precisely, assuming that $\mathbb {P}$ is unknown, we would like to find an estimator $\widehat{\theta }(W_{1:\, n})$ depending on the observed sample $W_{1:\, n} \overset{\text {def}}{=} (W_i)_{i=1}^n$ such that the excess risk

$$ \int L(w, \widehat{\theta }) \, \mathrm {d}\mathbb {P}(w) - \inf _{\theta \in \varTheta } \int L(w, \theta ) \, \mathrm {d}\mathbb {P}(w) $$

is small. The previous quantity is random, since $\widehat{\theta }$ depends on the random sample $W_{1:\, n}$. Therefore its size can be understood in different ways. Here we will focus on the deviations of the excess risk. Accordingly, we will look for estimators providing a small risk with a probability close to one.

A typical example of such a problem is provided by supervised classification. In this setting $\mathcal {W} = \mathcal {X} \times \mathcal {Y}$, where $\mathcal {Y}$ is a finite set, $W_i = (X_i, Y_i)$, where $(X_i,Y_i)$ are input-output pairs, a family of measurable classification rules $\bigl \{ f_{\theta } : \mathcal {X} \rightarrow \mathcal {Y}; \theta \in \varTheta \bigr \}$ is considered, and the loss function $L(w,\theta )$ is defined as the classification error

$$ L\bigl [ (x,y), \theta \bigr ] = \mathbbm {1} \bigl [ f_{\theta }(x) \ne y \bigr ]. $$

Accordingly the aim is to minimize the expected classification error

$$ \mathbb {P}_{X,Y} \bigl [ f_\theta (X) \ne Y \bigr ] $$

given a sample $(X_i, Y_i)_{i=1}^n$ of observations.

The point of view presented here is a synthesis of the approaches of [2, 8].

1.1 Deviation Bounds for Sums of Bernoulli Random Variables

Given some parameter $\lambda \in \mathbb {R}$, let us consider the (normalized) $\log $-Laplace transform of the Bernoulli distribution:

$$ \varPhi _{\lambda }(p) \overset{\mathrm{def}}{=} - \frac{1}{\lambda } \log \bigl [ 1 - p + p \, \exp ( - \lambda ) \bigr ]. $$

Let us also consider the Kullback–Leibler divergence of two Bernoulli distributions

$$ K(q,p) \overset{\mathrm{def}}{=} q \, \log \biggl ( \frac{q}{p} \biggr ) + (1 - q) \log \biggl ( \frac{1 - q}{1 - p} \biggr ). $$

In the sequel $\overline{\mathbb {P}}$ will be the empirical measure

$$ \overline{\mathbb {P}} = \frac{1}{n} \sum _{i=1}^n \delta _{W_i} $$

of an i.i.d. sample $(W_i)_{i=1}^n$ drawn from $\mathbb {P}^{\otimes n} \in \mathcal {M}_+^1(\mathcal {W}^n)$ (the set of probability measures on $\mathcal {W}^n$). We will use a short notation for integrals, putting for any $\rho , \pi \in \mathcal {M}_+^1(\varTheta )$ and any integrable function $f \in \mathbb {L}_1 \bigl ( \mathcal {W} \times \varTheta ^2, \mathbb {P} \otimes \pi \otimes \rho \bigr )$

$$ f(\mathbb {P}, \rho , \pi ) = \int f(w,\theta , \theta ') \, \mathrm {d}\mathbb {P}(w) \, \mathrm {d}\rho (\theta ) \, \mathrm {d}\pi (\theta '), $$

so that for instance $\displaystyle L(\mathbb {P},\rho ) = \int L(w,\theta ) \, \mathrm {d}\mathbb {P}(w) \mathrm {d}\rho (\theta )$.

Let us first recall Chernoff’s bound.

Proposition 20.1

For any fixed value of the parameter $\theta \in \varTheta $, the identity

$$ \int \exp \bigl [ - n \lambda L(\overline{\mathbb {P}}, \theta ) \bigr ] \, \mathrm {d}\mathbb {P}^{\otimes n} = \exp \Bigl \{ - n \lambda \varPhi _{\lambda } \bigl [ L(\mathbb {P}, \theta ) \bigr ] \Bigr \} $$

shows that with probability at least $1 - \epsilon $,

$$\begin{aligned} L(\mathbb {P}, \theta )&\le B_+ \bigl [ L(\overline{\mathbb {P}}, \theta ), \log (\epsilon ^{-1})/n \bigr ], \end{aligned}$$

where

$$\begin{aligned} B_+(q, \delta )&= \inf _{\lambda \in {\mathbb {R}}_+} \varPhi _{\lambda }^{-1} \biggl ( q + \frac{\delta }{\lambda } \biggr ) \\&= \sup \Bigl \{ p \in [0,1] \,:\, K (q ,p) \le \delta \Bigr \}, \qquad q \in [0,1], ~ \delta \in \mathbb {R}_+, \end{aligned}$$

Moreover

$$ - \delta q \le B_+(q,\delta ) - q - \sqrt{2 \delta q(1-q)} \le 2 \delta (1 - q). $$

In the same way, the identity

$$ \int \exp \bigl [ n \lambda L(\overline{\mathbb {P}}, \theta ) \bigr ] d \mathbb {P}^{\otimes n} = \exp \Bigl \{ n \lambda \varPhi _{-\lambda } \bigl [ L(\mathbb {P}, \theta ) \bigr ] \Bigr \} $$

shows that with probability at least $1 - \epsilon $

$$\begin{aligned} L(\overline{\mathbb {P}}, \theta )&\le B_- \bigl [ L(\mathbb {P}, \theta ), \log (\epsilon ^{-1})/n \bigr ], \end{aligned}$$

where

$$\begin{aligned} B_-(q, \delta )&= \inf _{\lambda \in {\mathbb {R}}_+} \varPhi _{- \lambda } (q) + \frac{\delta }{\lambda } \\&= \sup \Bigl \{ p \in [0,1] \, : \, K(p,q) \le \delta \Bigr \}, \qquad q \in [0,1], \delta \in \mathbb {R}_+, \end{aligned}$$

and

$$ - \delta q \le B_-(q, \delta ) - q - \sqrt{2 \delta q (1-q)} \le 2 \delta (1 - q). $$

Before proving this proposition, let us mention some important identities.

Proposition 20.2

For any probability measures $\pi $ and $\rho $ defined on the same measurable space such that $\mathcal {K}(\rho , \pi ) < \infty $, and any bounded measurable function h, let us define the transformed probability measure $\pi _{\exp (h)} \ll \pi $ by its density

$$ \frac{\mathrm {d}\pi _{\exp (h)}}{\mathrm {d}\pi } = \frac{\exp (h)}{Z}, $$

where $Z = \int \exp (h) \, \mathrm {d}\pi $. Moreover, let us introduce the notation

$$ {{\mathrm{\mathbf {Var}}}}\bigl ( h \, \mathrm {d}\pi \bigr ) = {\textstyle \int }\bigl ( h - {\textstyle \int }h \, \mathrm {d}\pi \bigr )^2 \, \mathrm {d}\pi . $$

The expectations with respect to $\rho $ and $\pi $ of h and the $\log $-Laplace transform of h are linked by the identities

$$\begin{aligned} {\textstyle \int }h \, \mathrm {d}\rho - \mathcal {K}(\rho , \pi )&+ \mathcal {K}(\rho , \pi _{\exp (h)}) = \log \bigl [ {\textstyle \int }\exp (h) \, \mathrm {d}\pi \bigr ] \end{aligned}$$

(20.1)

$$\begin{aligned}&= {\textstyle \int }h \, \mathrm {d}\pi + {\textstyle \int }_{0}^{1} ( 1 - \alpha ) {{\mathrm{\mathbf {Var}}}}\bigl [ h \, \mathrm {d}\pi _{\exp (\alpha h)} \bigr ] \, \mathrm {d}\alpha . \end{aligned}$$

(20.2)

Proof

The first identity is a straightforward consequence of the definitions of $\pi _{\exp (h)}$ and of the Kullback–Leibler divergence function. The second one is the Taylor expansion of order one with integral remainder of the function

$$ f(\alpha ) = \log \bigl [ {\textstyle \int }\exp (\alpha h) \, \mathrm {d}\pi \bigr ], $$

which says that $f(1) = f(0) + f'(0) + {\textstyle \int }_0^1(1 - \alpha ) f''(\alpha ) \, \mathrm {d}\alpha $. $\square $

Exercise 20.1

Prove that $f \in \mathcal {C}^{\infty }$. Hint: write

$$ h^k \exp (\alpha h) = h^k + \int _0^\alpha h^{k+1} \exp ( \gamma h) \, \mathrm {d}\gamma , $$

use Fubini’s theorem to show that $\alpha \mapsto \int h^k \exp (\alpha h) \, \mathrm {d}\pi $ belongs to $\mathcal {C}^1$ and compute its derivative. $\square $

Let us come now to the proof of Proposition 20.1. Chernoff’s inequality reads

$$ \varPhi _{\lambda } \bigl [ L(\mathbb {P}, \theta ) \bigr ] - \frac{\log (\epsilon ^{-1})}{n \lambda } \le L(\overline{\mathbb {P}}, \theta ), $$

where the inequality holds with probability at least $1 - \epsilon $. Since the left-hand side is non-random, it can be optimized in $\lambda $, giving

$$ L(\mathbb {P}, \theta ) \le B_+ \bigl [ L(\overline{\mathbb {P}}, \theta ), \log (\epsilon ^{-1})/n \bigr ]. $$

Exercise 20.2

Prove this statement in more detail. For any integer $k > 1$, consider the event

$$ A_k = \Bigl \{ \sup _{\lambda \in \mathbb {R}_+} F(\lambda ) - k^{-1} > L(\overline{\mathbb {P}} , \theta ) \Bigr \}, $$

where $\displaystyle F(\lambda ) = \varPhi _{\lambda } \bigl [ L ( \mathbb {P}, \theta ) \bigr ] - \frac{\log (\epsilon ^{-1})}{n \lambda }$. Show that $\mathbb {P}^{\otimes n} (A_k) \le \epsilon $ by choosing some suitable value of $\lambda $. Remark that $A_k \subset A_{k+1}$ and conclude that $\mathbb {P}^{\otimes n} \bigl ( \cup _k A_k \bigr ) \le \epsilon $. $\square $

Since

$$ \lim _{\lambda \rightarrow + \infty } \varPhi _{\lambda }^{-1} \biggl ( q + \frac{\delta }{\lambda } \biggr ) = \lim _{\lambda \rightarrow + \infty } \frac{1 - \exp (- \lambda q - \delta )}{1 - \exp (- \lambda )} \le 1, $$

$B_+(q, \delta ) \le 1$.

Applying Eq. (20.1) to Bernoulli distributions gives

$$ \lambda \varPhi _{\lambda } (p) = \lambda q + K(q,p) - K(q, p_{\lambda }) $$

where

$$ p_{\lambda } = \frac{p}{p + (1-p) \exp (\lambda )}. $$

This shows that

$$\begin{aligned} B_+(q,\delta )&= \sup \Bigl \{ p \in [0,1] \, : \, \varPhi _{\lambda }(p) \le q + \frac{\delta }{\lambda }, \; \lambda \in \mathbb {R}_+ \Bigr \} \\&= \sup \Bigl \{ p \in [q,1[ \, : \, K(q,p) \le \delta + K(q, p_{\lambda }), \lambda \in \mathbb {R}_+ \Bigr \} \\&= \sup \Bigl \{ p \in [q,1[ \, : \, K(q,p) \le \delta \Bigr \} \\&= \sup \Bigl \{ p \in [0,1] \, : \, K(q,p) \le \delta \Bigr \}, \end{aligned}$$

because when $q \le p < 1$ then $\displaystyle \lambda = \log \biggl ( \frac{q^{-1} - 1}{p^{-1} - 1} \biggr ) \in \mathbb {R}_+$, $q = p_{\lambda }$ and therefore $K(q,p_{\lambda }) = 0$.

Let us remark now that $\displaystyle \frac{\partial ^2}{\partial x^2} K(x, p) = x^{-1}(1-x)^{-1}$. Thus if $p \ge q \ge 1/2$, then

$$ K(q,p) \ge \frac{(p-q)^2}{2 q(1-q)}, $$

so that if $K(q,p) \le \delta $, then

$$ p \le q + \sqrt{2 \delta q(1-q)}. $$

Now if $q \le 1/2$ and $p \ge q$ then

$$ K(q,p) \ge \left. {\left\{ \begin{array}{ll} \displaystyle \frac{(p-q)^2}{2 p(1-p)}, &{} p \le 1/2 \\ \displaystyle 2 (p-q)^2, &{} p \ge 1/2 \end{array}\right. } \right\} \ge \frac{(p-q)^2}{2p(1-q)}, $$

so that if $K(q,p) \le \delta $, then

$$ (p-q)^2 \le 2 \delta p(1-q), $$

implying that

$$ p-q \le \delta (1 -q) + \sqrt{ 2 \delta q(1-q) + \delta ^2(1-q)^2} \le \sqrt{2 \delta q(1-q)} + 2 \delta (1 - q). $$

On the other hand,

$$ K(q,p) \le \frac{(p-q)^2}{2 \min \{ q(1-q), p(1-p) \}} \le \frac{(p-q)^2}{2 q(1-p)}, $$

thus if $K(q,p) = \delta $ with $p > q$, then

$$ (p-q)^2 \ge 2 \delta q (1 - p), $$

implying that

$$ p-q \ge - \delta q + \sqrt{ 2 \delta q(1-q) + \delta ^2 q^2} \ge \sqrt{2 \delta q (1 - q)} - \delta q. $$

Exercise 20.3

The second part of Proposition 20.1 is proved in the same way and left as an exercise. $\square $

1.2 PAC-Bayes Bounds

We are now going to make Proposition 20.1 uniform with respect to $\theta $. The PAC -Bayes approach to this [3, 5–7] is to randomize $\theta $, so we will now consider joint distributions on $(W_{1:\, n}, \theta )$, where the distribution of $W_{1:\, n}$ is still $\mathbb {P}^{\otimes n}$ and the conditional distribution of $\theta $ given the sample is given by some transition probability kernel $\rho : \mathcal {W}^n \rightarrow \mathcal {M}_+^1(\varTheta )$, called in this context a posterior distribution.^{Footnote 1} This posterior distribution $\rho $ will be compared with a prior (meaning non-random) probability measure $\pi \in \mathcal {M}_+^1(\varTheta )$.

Proposition 20.3

Let us introduce the notation

$$ B_{\varLambda } (q, \delta ) = \inf _{\lambda \in \varLambda } \varPhi _{\lambda }^{-1} \biggl ( q + \frac{\delta }{\lambda } \biggr ). $$

For any prior probability measure $\pi \in \mathcal {M}_+^1(\varTheta )$ and any $\lambda \in \mathbb {R}_+$,

$$\begin{aligned} \int \exp \biggl [ \sup _{\rho \in \mathcal {M}_+^1(\varTheta )} n \lambda \Bigl \{ \varPhi _{\lambda }\bigl [ L(\mathbb {P}, \rho ) \bigr ] - L(\overline{\mathbb {P}}, \rho ) \Bigr \} - \mathcal {K}(\rho , \pi ) \biggr ] \, \mathrm {d}\mathbb {P}^{\otimes n} \le 1, \end{aligned}$$

(20.3)

and therefore for any finite set $\varLambda \subset \mathbb {R}_+$, with probability at least $1 - \epsilon $, for any $\rho \in \mathcal {M}_+^1(\varTheta )$,

$$ L(\mathbb {P}, \rho ) \le B_{\varLambda }\Biggl (L(\overline{\mathbb {P}}, \rho ), \frac{\mathcal {K}(\rho ,\pi ) + \log \bigl ( |\varLambda |/ \epsilon \bigr )}{n} \Biggr ), $$

Proof

The exponential moment inequality (20.3) is a consequence of Eq. (20.1), showing that

$$ \exp \Biggl \{ \sup _{\rho \in \mathcal {M}_+^1(\varTheta )} n \lambda \int \Bigl \{ \varPhi _{\lambda } \bigl [ L(\mathbb {P}, \theta ) \bigr ] - L(\overline{\mathbb {P}}, \theta ) \Bigr \} \, \mathrm {d}\rho (\theta ) - \mathcal {K}(\rho , \pi ) \Biggr \} \\ \qquad \qquad \qquad \qquad \qquad \quad \le \int \exp \biggl [ n \lambda \Bigl \{ \varPhi _{\lambda }\bigl [ L(\mathbb {P}, \theta ) \bigr ] - L(\overline{\mathbb {P}}, \theta ) \Bigr \} \biggr ] \, \mathrm {d}\pi (\theta ), $$

and of the fact that $\varPhi _{\lambda }$ is convex, showing that

$$ \varPhi _{\lambda } \bigl [ L(\mathbb {P}, \rho ) \bigr ] \le \int \varPhi _{\lambda } \bigl [ L(\mathbb {P}, \theta ) \bigr ] \, \mathrm {d}\rho (\theta ). $$

The deviation inequality follows as usual. $\square $

We cannot take the infimum on $\lambda \in \mathbb {R}_+$ as in Proposition 20.1, because we can no longer cast our deviation inequality in such a way that $\lambda $ appears on some non-random side of the inequality. Nevertheless, we can get a more explicit bound from some specific choice of the set $\varLambda $.

Proposition 20.4

Let us define the least increasing upper bound of the variance of a Bernoulli distribution of parameter $p \in [0,1]$ as

$$ \overline{v}(p) = {\left\{ \begin{array}{ll} p(1-p), &{} p \le 1/2,\\ 1/4, &{} \text {otherwise.} \end{array}\right. } $$

Let us choose some positive integer parameter m and let us put

$$ t = \frac{1}{4} \log \Biggl ( \frac{ n}{8 \log \bigl [ (m+1) / \epsilon \bigr ]} \Biggr ). $$

With probability at least $1 - \epsilon $, for any $\rho \in \mathcal {M}_+^1(\varTheta )$,

$$ L(\mathbb {P}, \rho ) \le L(\overline{\mathbb {P}}, \rho ) + B_m \bigl [ L(\overline{\mathbb {P}}, \rho ), \mathcal {K}(\rho , \pi ), \epsilon \bigr ], $$

where

$$\begin{aligned} B_m \bigl ( q, e, \epsilon \bigr )&= \max \Biggl \{ \sqrt{\frac{ 2 \overline{v}(q) \bigl \{ e + \log \bigl [ (m+1)/\epsilon \bigr ]\bigr \}}{n}} \cosh \bigl ( t/m \bigr ) \\&\qquad + \frac{2 (1 - q) \bigl \{ e + \log \bigl [ (m+1)/\epsilon \bigr ] \bigr \}}{n} \cosh (t/m)^2,\\&\qquad \qquad \frac{2 \bigl \{ e + \log \bigl [ (m+1)/\epsilon \bigr ] \bigr \}}{n} \Biggr \}\\&\le \sqrt{\frac{ 2 \overline{v}(q) \bigl \{ e + \log \bigl [ (m+1)/\epsilon \bigr ]\bigr \}}{n}} \cosh \bigl ( t/m \bigr ) \\&\qquad \qquad + \frac{2 \bigl \{ e + \log \bigl [ (m+1)/\epsilon \bigr ] \bigr \}}{n} \cosh (t/m)^2. \end{aligned}$$

Moreover, as soon as $n \ge 5$,

$$\begin{aligned}&B_{\lfloor \log (n)^2 \rfloor - 1}(q,e, \epsilon ) \le B(q,e, \epsilon ) \overset{\mathrm{def}}{=} \nonumber \\&\qquad \qquad \sqrt{\frac{2 \overline{v}(q) \bigl \{ e + \log \bigl [ \log (n)^2/\epsilon \bigr ] \bigr \}}{n}} \cosh \bigl [ \log (n)^{-1} \bigr ] \nonumber \\&\qquad \qquad \qquad \qquad \quad + \frac{ 2 \bigl \{ e + \log \bigl [ \log (n)^2/\epsilon \bigr ] \bigr \}}{n} \cosh \bigl [ \log (n)^{-1} \bigr ]^2, \end{aligned}$$

(20.4)

so that with probability at least $1 - \epsilon $, for any $\rho \in \mathcal {M}_+^1(\varTheta )$,

$$\begin{aligned}&L(\mathbb {P}, \rho ) \le L(\overline{\mathbb {P}}, \rho )\\&\qquad \qquad + \sqrt{\frac{2 \overline{v}\bigl [ L(\overline{\mathbb {P}}, \rho ) \bigr ] \Bigl \{ \mathcal {K}(\rho , \pi ) + \log \bigl [ \log (n)^2 / \epsilon \bigr ] \Bigr \}}{n}} \cosh \bigl [ \log (n)^{-1} \bigr ] \\&\qquad \qquad \qquad \qquad \quad + \frac{2 \Bigl \{ \mathcal {K}(\rho , \pi ) + \log \bigl [ \log (n)^2 / \epsilon \bigr ] \Bigr \}}{n} \cosh \bigl [ \log (n)^{-1} \bigr ]^2. \end{aligned}$$

Proof

Let us put

$$\begin{aligned} q&= L(\overline{\mathbb {P}},\rho ), \\ \delta&= \frac{ \mathcal {K}(\rho , \pi ) + \log \bigl [ (m+1)/\epsilon \bigr ]}{n}, \\ \lambda _{\min }&= \sqrt{\frac{8 \log \bigl [ (m+1)/\epsilon \bigr ] }{n}}, \\ \varLambda&= \Bigl \{ \lambda _{\min }^{1 - k/m}, k=0, \dots , m \Bigr \}, \\ p&= B_{\varLambda }(q, \delta ) = \inf _{\lambda \in \varLambda } \varPhi _{\lambda }^{-1} \bigg ( q + \frac{\delta }{\lambda } \biggr ), \\ \widehat{\lambda }&= \sqrt{\frac{2 \delta }{\overline{v}(p)}}. \end{aligned}$$

According to Eq. (20.2) applied to Bernoulli distributions, for any $\lambda \in \varLambda $,

$$ \varPhi _{\lambda }(p) = p - \frac{1}{\lambda } \int _{0}^{\lambda } (\lambda - \alpha ) p_{\alpha } (1 - p_{\alpha }) \, \mathrm {d}\alpha \le q + \frac{ \delta }{\lambda }. $$

Moreover, as $p_{\alpha } \le p$,

$$ p - q \le \inf _{\lambda \in \varLambda } \frac{\lambda \overline{v}(p)}{2} + \frac{\delta }{\lambda } = \inf _{\lambda \in \varLambda } \sqrt{2 \delta \overline{v}(p)} \cosh \biggl [ \log \biggl ( \frac{\widehat{\lambda }}{\lambda } \biggr ) \biggr ]. $$

As $\overline{v}(p) \le 1/4$ and $\displaystyle \delta \ge \frac{\log \bigl [ (m+1)/\epsilon \bigr ]}{n}$,

$$\sqrt{\frac{2 \delta }{\overline{v}(p)}} = \widehat{\lambda } \ge \lambda _{\min } = \sqrt{\frac{8 \log \bigl [ (m+1)/\epsilon \bigr ]}{n}}. $$

Therefore either $\lambda _{\min } \le \widehat{\lambda } \le 1$, or $\widehat{\lambda } > 1$. Let us consider these two cases separately.

If $\lambda _{\min } = \min \varLambda \le \widehat{\lambda } \le \max \varLambda = 1$, then $\log \bigl ( \widehat{\lambda } \bigr ) $ is at distance at most t / m from some $\log \bigl ( \lambda \bigr )$ where $\lambda \in \varLambda $, because $\log (\varLambda )$ is a grid with constant steps of size 2t / m. Thus

$$ p - q \le \sqrt{2 \delta \overline{v}(p)} \cosh \bigl ( t / m \bigr ). $$

If moreover $q \le 1/2$, then $\overline{v}(p) \le p(1-q)$, so that we obtain a quadratic inequality in p, whose solution is bounded by

$$ p \le q + \sqrt{2 \delta q(1-q)} \cosh \bigl ( t/m \bigr ) + 2 \delta (1 - q) \cosh \bigl ( t / m \bigr )^2. $$

If on the contrary $q \ge 1/2$, then $\overline{v}(p) = \overline{v}(q) = 1/4$ and

$$ p \le q + \sqrt{2 \delta \overline{v}(q)} \cosh \bigl ( t/m \bigr ), $$

so that in both cases

$$\begin{aligned} p - q \le \sqrt{2 \delta \overline{v}(q)} \cosh (t/m) + 2 \delta ( 1- q) \cosh \bigl ( t / m \bigr )^2. \end{aligned}$$

Let us now consider the case when $\widehat{\lambda } > 1$. In this case $\overline{v}(p) < 2 \delta $, so that

$$ p-q \le \frac{\overline{v}(p)}{2} + \delta \le 2 \delta . $$

In conclusion, applying Proposition 20.3 we see that with probability at least $1 - \epsilon $, for any posterior distribution $\rho $,

$$ L(\mathbb {P}, \rho ) \le p \le q + \max \Bigl \{ 2 \delta , \sqrt{2 \delta \overline{v}(q)} \cosh \bigl ( t / m \bigr ) + 2 \delta (1 - q) \cosh \bigl ( t / m \bigr )^2 \Bigr \}, $$

which is precisely the statement to be proved.

In the special case when $m = \lfloor \log (n)^2 \rfloor - 1 \ge \log (n)^2 - 2$,

$$ \frac{t}{m} \le \frac{1}{4 \bigl [ \log (n)^2 - 2 \bigr ]} \log \Biggl ( \frac{n}{8\log \bigl [ \log (n)^2 - 1 \bigr ]} \Biggr ) \le \log (n)^{-1} $$

as soon as the last inequality holds, that is as soon as $n \ge \exp (\sqrt{2}) \simeq 4.11$ to make $\log (n)^2 - 2$ positive and

$$ 3 \log (n)^2 - 8 + \log (n) \log \Bigl \{ 8 \log \bigl [ \log (n)^2 - 1 \bigr ] \Bigr \} \ge 0, $$

which holds true for any $n \ge 5$, as can be checked numerically. $\square $

2 Linear Classification and Support Vector Machines

In this section we are going to consider more specifically the case of linear binary classification. In this setting $\mathcal {W} = \mathcal {X} \times \mathcal {Y} = \mathbb {R}^d \times \{-1, + 1\}$, $w = (x,y)$, where $x \in \mathbb {R}^d$ and $y \in \{-1, +1\}$, $\varTheta = \mathbb {R}^d$, and

$$ L(w, \theta ) = \mathbbm {1} \bigl [ \langle \theta , x \rangle y \le 0 \bigr ]. $$

We will follow the approach presented in [4, 5].

Although we will stick in this presentation to the case when $\mathcal {X}$ is a vector space of finite dimension, the results also apply to support vector machines [9–11], where the pattern space is some arbitrary space mapped to a Hilbert space $\mathcal {H}$ by some implicit mapping $\varPsi : \mathcal {X} \rightarrow \mathcal {H}$, $\varTheta = \mathcal {H}$ and $L(w,\theta ) = \mathbbm {1} \bigl ( \langle \theta , \varPsi (x) \rangle y \le 0 \bigr )$. It turns out that classification algorithms do not need to manipulate $\mathcal {H}$ itself, but only to compute scalar products of the form $k(x_1, x_2) = \langle \varPsi (x_1), \varPsi (x_2) \rangle $, defining a symmetric positive kernel k on the original pattern space $\mathcal {X}$. The converse is also true: any positive symmetric kernel k can be represented as a scalar product in some mapped Hilbert space (this is the Moore–Aronszajn theorem). Often-used kernels on $\mathbb {R}^d$ are

$$\begin{aligned} k(x_1, x_2)&= \bigl ( 1 + \langle x_1, x_2 \rangle \bigr )^s, \text { for which } \dim \mathcal {H} < \infty , \\ k(x_1, x_2)&= \exp \bigl ( - ||x_1 - x_2 ||^2 \bigr ), \text { for which } \dim \mathcal {H} = + \infty . \end{aligned}$$

In the following, we will work in $\mathbb {R}^d$, which covers only the case when $\dim \mathcal {H} < \infty $, but extensions are possible.

After [4, 5], let us consider as prior probability measure $\pi $ the centered Gaussian measure with covariance $\beta ^{-1} {{\mathrm{\mathbf {Id}}}}$, so that

$$ \frac{\mathrm {d}\pi }{\mathrm {d}\theta } (\theta ) = \left( \frac{\beta }{2 \pi } \right) ^{d/2} \exp \biggl ( - \frac{\beta ||\theta ||^2}{2} \biggr ). $$

Let us also consider the function

$$\begin{aligned} \varphi (x)&= \frac{1}{\sqrt{2\pi }} \int _{x}^{+ \infty } \exp \bigl ( - t^2/2 \bigr ) \, \mathrm {d}t, \qquad x \in \mathbb {R} \\&\le \min \Bigl \{ \frac{1}{x \sqrt{2 \pi }}, \frac{1}{2} \Bigr \} \exp \left( - \frac{x^2}{2} \right) , \quad x \in \mathbb {R}_+. \end{aligned}$$

Let $\pi _{\theta }$ be the measure $\pi $ shifted by $\theta $, defined by the identity

$$ \int h(\theta ') \, \mathrm {d}\pi _{\theta }(\theta ') = \int h(\theta + \theta ') \, \mathrm {d}\pi (\theta '). $$

In this case

$$ \mathcal {K}(\pi _{\theta }, \pi ) = \frac{\beta }{2} ||\theta ||^2, $$

and

$$ L(w, \pi _{\theta }) = \varphi \bigl [ \sqrt{\beta } y ||x ||^{-1} \langle \theta , x \rangle \bigr ]. $$

Thus the randomized loss function has an explicit expression: randomization replaces the indicator function of the negative real line by a smooth approximation. As we are ultimately interested in $L(w, \theta )$, we will shift things a little bit, considering along with the classification error function L some error with margin

$$ M(w, \theta ) = \mathbbm {1} \bigl [ y ||x ||^{-1} \langle \theta , x \rangle \le 1 \bigr ]. $$

Unlike $L(w, \theta )$ which is independent of the norm of $\theta $, the margin error $M(w, \theta )$ depends on $||\theta ||$, counting a classification error each time x is at distance less than $||x ||/ ||\theta ||$ from the boundary $\{ x' \,:\, \langle \theta , x' \rangle = 0 \}$, so that the error with margin region is the complement of the open cone $\bigl \{ x \in \mathbb {R}^d \, ; \, y \langle \theta , x \rangle > ||x ||\bigr \}$.

Let us compute the randomized margin error

$$ M(w, \pi _{\theta }) = \varphi \Bigl \{ \sqrt{\beta } \bigl [ y ||x ||^{-1} \langle \theta , x \rangle - 1 \bigr ] \Bigr \}. $$

It satisfies the inequality

$$\begin{aligned} M(w, \pi _{\theta }) \ge \varphi ( - \sqrt{\beta } \, \bigr ) L(w, \theta ) = \bigl [ 1 - \varphi \bigl (\sqrt{\beta } \, \bigr ) \bigr ] L(w, \theta ). \end{aligned}$$

(20.5)

Applying previous results we obtain

Proposition 20.5

With probability at least $1 - \epsilon $, for any $\theta \in \mathbb {R}^d$,

$$ L(\mathbb {P}, \theta ) \le \bigl [ 1 - \varphi (\sqrt{\beta }) \bigr ]^{-1} M(\mathbb {P}, \pi _{\theta }) \le C_1(\theta ), $$

where

$$ C_1(\theta ) = \bigl [ 1 - \varphi \bigl ( \sqrt{\beta } \, \bigr ) \bigr ]^{-1} B \Biggl ( M(\,\overline{\mathbb {P}}, \pi _{\theta }), \frac{\beta ||\theta ||^2}{2}, \epsilon \Biggr ), $$

the bound B being defined by Eq. (20.4).

We can now minimize this empirical upper bound to define an estimator. Let us consider some estimator $\widehat{\theta }$ such that

$$ C_1(\widehat{\theta }) \le \inf _{\theta \in \mathbb {R}^d} C_1(\theta ) + \zeta . $$

Then for any fixed parameter ${\theta _{\star }}$, $C_1(\theta ) \le C_1({\theta _{\star }}) + \zeta $. On the other hand, with probability at least $1 - \epsilon $

$$ M(\overline{\mathbb {P}},\pi _{{\theta _{\star }}}) \le B_- \biggl ( M(\mathbb {P}, \pi _{{\theta _{\star }}}), \frac{\log (\epsilon ^{-1})}{n} \biggr ). $$

Indeed

$$\begin{aligned} \int \exp&\Bigl \{ n \lambda \Bigl [ M(\overline{\mathbb {P}}, \pi _{{\theta _{\star }}}) - \varPhi _{- \lambda }\bigl [ M(\mathbb {P}, \pi _{{\theta _{\star }}}) \bigr ] \Bigr ] \Bigr \} \, \mathrm {d}\mathbb {P}^{ \otimes n} \\&\le \int \exp \biggl \{ n \lambda \int \Bigl \{ M(\overline{\mathbb {P}}, \theta ) - \varPhi _{- \lambda }\bigl [ M(\mathbb {P}, \theta ) \bigr ] \Bigr \} \, \mathrm {d}\pi _{{\theta _{\star }}}(\theta ) \biggr \} \, \mathrm {d}\mathbb {P}^{\otimes n} \le 1, \end{aligned}$$

because $p \mapsto - \varPhi _{- \lambda }(p)$ is convex. As a consequence

Proposition 20.6

With probability at least $1 - 2 \epsilon $,

$$\begin{aligned}&L(\mathbb {P}, \widehat{\theta }) \le \\&\qquad \inf _{{\theta _{\star }}\in \varTheta } \bigl [ 1 - \varphi \bigl ( \sqrt{ \beta } \, \bigr ) \bigr ]^{-1} B \biggl ( B_-\biggl ( M(\mathbb {P}, \pi _{\theta _{\star }}), \frac{\log (\epsilon ^{-1})}{n} \biggr ), \frac{\beta ||{\theta _{\star }}||^2}{2}, \epsilon \biggr ) + \zeta . \end{aligned}$$

It is also possible to state a result in terms of empirical margins. Indeed

$$ M(w,\pi _{\theta }) \le M(w, \theta / 2) + \varphi (\sqrt{\beta }). $$

Thus with probability at least $1 - \epsilon $, for any $\theta \in \mathbb {R}^d$,

$$ L(\mathbb {P}, \theta ) \le C_2(\theta ), $$

where

$$ C_2(\theta ) = \bigl [ 1 - \varphi \bigl ( \sqrt{\beta } \, \bigr ) \bigr ]^{-1} B \biggl ( M(\overline{\mathbb {P}}, \theta /2) + \varphi \bigl ( \sqrt{\beta } \, \bigr ), \frac{\beta ||\theta ||^2 }{2}, \epsilon \biggr ). $$

However, $C_1$ and $C_2$ are non-convex criterions, and faster minimization algorithms are available for the usual SVM loss function, for which we are going to derive some generalization bounds now. Indeed, let us choose some positive radius R and let us put $||x ||_{R} = \max \bigl \{ R, ||x ||\bigr \}$, so that in the case when $||x ||\le R$, $||x ||_{R} = R$.

$$\begin{aligned} M(w, \pi _{\theta }) = \varphi \bigl [ \sqrt{\beta } \bigl (y ||x ||^{-1} \langle \theta , x \rangle - 1 \bigr ) \bigr ] \le \bigl ( 2 - y ||x ||_{R}^{-1} \langle \theta , x \rangle \bigr )_+ + \varphi (\sqrt{\beta }). \end{aligned}$$

(20.6)

To check that this is true, consider the functions

$$\begin{aligned} f(z)&= \varphi \bigl [ \sqrt{\beta } \bigl ( ||x ||^{-1} z - 1 \bigr ) \bigr ], \\ g(z)&= \bigl ( 2 - ||x ||_R^{-1} z \bigr )_+ + \varphi (\sqrt{\beta }), \qquad z \in \mathbb {R}. \end{aligned}$$

Let us remark that they are both non-increasing, that f is convex on the interval $z \in \bigl (||x ||, \infty \bigr ( $ (because $\varphi $ is convex on $\mathbb {R}_+$), and that $\sup f = \sup \varphi = 1$. Since $||x ||_R \ge ||x ||$, for any $z \in \, ] \! - \infty , ||x ||]$, $ g(z) \ge 1 \ge f(z)$. Moreover, $ g(2 ||x ||_R) = \varphi (\sqrt{\beta }) \ge \varphi \bigl [ \sqrt{\beta } \bigl ( 2 ||x ||^{-1} ||x ||_R - 1 \bigr ) \bigr ] = f(z)$. Since on the interval $\bigl [||x ||, 2 ||x ||_R \bigr ]$ the function g is linear, the function f is convex, and g is not smaller than f at the two ends, this proves that g is not smaller than f on the whole interval. Finally, on the interval $z \in \bigl [ 2 ||x ||_R, + \infty \bigr [$, the function g is constant and the function f is decreasing, so that on this interval also g is not smaller than f, and this ends the proof of (20.6), since the three intervals on which $g \ge f$ cover the whole real line.

Using the upper bounds (20.6) and (20.5), and Proposition 20.3, we obtain

Proposition 20.7

With probability at least $1 - \epsilon $, for any $\theta \in \mathbb {R}^d$,

$$\begin{aligned} L(\mathbb {P}, \theta )&\le \bigl [ 1 - \varphi \bigl ( \sqrt{\beta } \, \bigr ) \bigr ]^{-1} B_{\varLambda } \biggl ( \int \bigl ( 2 - y ||x ||_R^{-1} \langle \theta , x \rangle \bigr )_+ d \overline{\mathbb {P}}(x,y) + \varphi (\sqrt{\beta }), \\&\qquad \qquad \qquad \frac{\beta ||\theta ||^2 + 2 \log \bigl ( |\varLambda |/ \epsilon \bigr )}{2n} \biggr ) \\&= \bigl [ 1 - \varphi \bigl ( \sqrt{\beta } \bigr ) \bigr ]^{-1} \inf _{\lambda \in \varLambda } \varPhi _{\lambda }^{-1} \biggl [ C_3(\lambda , \theta ) + \varphi \bigl ( \sqrt{\beta } \, \bigr ) + \frac{\log \bigl ( |\varLambda |/ \epsilon \bigr )}{ n \lambda } \biggr ], \end{aligned}$$

where

$$ C_3(\lambda , \theta ) = \int \bigl ( 2 - y ||x ||_R^{-1} \langle \theta , x \rangle \bigr )_+ \, \mathrm {d}\overline{\mathbb {P}}(x,y) + \frac{\beta ||\theta ||^2}{2n \lambda }. $$

Let us assume now that the patterns x are in a ball, so that $||x ||\le R$ almost surely. In this case $||x ||_R = R$ almost surely. Let us remark that $L(\mathbb {P}, \theta ) = L(\mathbb {P}, 2 R \, \theta )$, and let us make the previous result uniform in $\beta \in \Xi $. This leads to

Proposition 20.8

Let us assume that $||x ||\le R$ almost surely. With probability at least $1 - \epsilon $, for all $\theta \in \mathbb {R}^d$,

$$\begin{aligned} L(\mathbb {P}, \theta ) \le \inf _{\beta \in \Xi } \big [ 1 - \varphi (\sqrt{\beta }) \bigr ]^{-1} \inf _{\lambda \in \varLambda } \varPhi _{\lambda }^{-1}&\biggl [ 2 C_4 \bigl (\beta , \lambda , \theta \bigr ) \\&\qquad + \varphi (\sqrt{\beta }) + \frac{\log \bigl ( |\Xi |\, |\varLambda |/ \epsilon )}{n \lambda } \biggr ], \end{aligned}$$

where

$$ C_4 (\beta , \lambda , \theta ) = \frac{1}{2} \, C_3(\lambda , 2 R \, \theta ) = \int \bigl ( 1 - y \langle \theta , x \rangle \bigr )_+ \, \mathrm {d}\overline{\mathbb {P}}(x, y) + \frac{\beta R^2 ||\theta ||^2}{n \lambda }, $$

and

$$ \varPhi _{\lambda }^{-1}(q) = \frac{1 - \exp (- \lambda q)}{1 - \exp (- \lambda )} \le \frac{q}{\displaystyle 1 - \frac{\lambda }{2}}. $$

The loss function $C_4(\lambda , \theta )$ is the most-employed learning criterion for support vector machines, and is called the box constraint. It is convex in $\theta $. There are fast algorithms to compute $\inf _{\theta } C_4(\lambda , \theta )$ for any fixed values of $\lambda $ and $\beta $. Here we get an empirical criterion which could also be used to optimize the values of $\lambda $ and $\beta $, that is to optimize the strength of the regularizing factor $\displaystyle \frac{\beta R^2 ||\theta ||^2}{n \lambda }$.

Here $||\theta ||^{-1}$ can be interpreted as the margin width, that is the minimal distance of x from the separating hyperplane $ \{ x' : \langle \theta , x' \rangle = 0 \}$ beyond which the error term $\bigl ( 1 - y \langle \theta , x \rangle \bigr )_+$ vanishes (for data x that are on the right side of the separating hyperplane). The speed of convergence depends on $R^2 ||\theta ||^2 / n$. For this reason, $R^2 ||\theta ||^2$, the square of the ratio between the radius of the ball containing the data and the margin, plays the role of the dimension. The bound does not depend on d, showing that with separating hyperplanes and more generally support vector machines, we can get low error rates while choosing to represent the data in a reproducing kernel Hilbert space with a large, or even infinite, dimension.

We considered so far only linear hyperplanes and data centered around 0. Anyhow, this also covers affine hyperplanes and data contained in a not necessarily centered ball, through a change of coordinates. More precisely, the previous proposition has the following corollary:

Corollary 20.1

Assume that almost surely $||x - c ||\le R$, for some $c \in \mathbb {R}^d$ and $R \in \mathbb {R}_+$. With probability at least $1 - \epsilon $, for any $\theta \in \mathbb {R}^d$, any $\gamma \in \mathbb {R}$ such that $\displaystyle \min _{i=1, \dots , n} \langle \theta , x_i \rangle \le \gamma \le \max _{i=1, \dots , n} \langle \theta , x_i \rangle $,

$$\begin{aligned}&\int \mathbbm {1} \bigl [ y \bigl ( \langle \theta , x \rangle - \gamma \bigr ) \le 0 \bigr ] \, \mathrm {d}\mathbb {P}(x,y) \le \inf _{\beta \in \Xi } \bigl [ 1 - \varphi (\sqrt{\beta }) \bigr ]^{-1}\\&\qquad \qquad \qquad \inf _{\lambda \in \varLambda } \varPhi _{\lambda }^{-1} \biggl [ 2 C_5(\beta , \lambda , \theta , \gamma ) + \varphi (\sqrt{\beta }) + \frac{ \log \bigl ( |\Xi |\, |\varLambda |/ \epsilon \bigr )}{n \lambda } \biggr ], \end{aligned}$$

where

$$ C_5(\beta , \lambda , \theta , \gamma ) = \int \bigl [ 1 - y \bigl ( \langle \theta , x \rangle - \gamma \bigr ) \bigr ]_+ \, \mathrm {d}\overline{\mathbb {P}}(x,y) + \frac{4 \beta R^2 ||\theta ||^2}{n \lambda }. $$

Proof

Let us apply the previous result to $x' = (x-c, R)$, and $\theta ' = \bigl [ \theta , R^{-1} \bigl ( \langle \theta , c \rangle - \gamma \bigr ) \bigr ]$. We get that $||x' ||^2 \le 2 R^2$ and $||\theta ' ||^2 = 2 ||\theta ||^2$, because almost surely $ -||\theta ||R \le {{\mathrm{ess}}}\inf \langle \theta , x - c \rangle \le \gamma - \langle \theta , c \rangle \le {{\mathrm{ess}}}\sup \langle \theta , x - c \rangle \le ||\theta ||R$, so that almost surely, for the allowed values of $\gamma $, $\bigl ( \langle \theta , c \rangle - \gamma \bigr )^2 \le R^2 ||\theta ||^2$. This proves that $C_4(\beta , \lambda , \theta ') \le C_5(\beta , \lambda , \theta , \gamma )$, as required to deduce the corollary from the previous proposition. $\square $

Notes

1.
We will assume that $\rho $ is a regular conditional probability kernel, meaning that for any measurable set A the map $(w_1, \dots , w_n) \mapsto \rho (w_1, \dots , w_n)(A)$ is assumed to be measurable. We will also assume that the $\sigma $-algebra we consider on $\varTheta $ is generated by a countable family of subsets. See [1] (p. 50) for more details.

References

Catoni, O.: Statistical Learning Theory and Stochastic Optimization, Lectures on Probability Theory and Statistics, École d’Été de Probabilités de Saint-Flour XXXI—2001. Lecture Notes in Mathematics, vol. 1851. Springer, Berlin (2004)
Google Scholar
Catoni, O.: PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. IMS Lecture Notes Monograph Series, vol. 56. Institute of Mathematical Statistics, Beachwood (2007)
Google Scholar
Germain, P., Lacasse, A., Laviolette, F., Marchand, M.: PAC-Bayesian learning of linear classifiers. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML’09, pp. 353–360. ACM, New York (2009)
Google Scholar
Langford, J., Shawe-Taylor, J.: PAC-Bayes & margins. Advances in Neural Information Processing Systems, pp. 423–430. MIT Press, Cambridge (2002)
Google Scholar
McAllester, D.: Simplified PAC-Bayesian margin bounds. In: Schölkopf, B., Warmuth, M.K. (eds.) Learning Theory and Kernel Machines, COLT 2003. Lecture Notes in Artificial Intelligence, vol. 2777, pp. 203–215. Springer, Berlin (2003)
Google Scholar
McAllester, D.A.: PAC-Bayesian model averaging. In: Proceedings of the 12th Annual Conference on Computational Learning Theory, pp. 164–170. ACM, New York (1999)
Google Scholar
McAllester, D.A.: PAC-Bayesian stochastic model selection. Mach. Learn. 51(1), 5–21 (2003)
Article MATH Google Scholar
Seeger, M.: PAC-Bayesian generalization error bounds for Gaussian process classification. Informatics report series EDI-INF-RR-0094, Division of Informatics, University of Edinburgh. http://www.inf.ed.ac.uk/publications/online/0094.pdf (2002)
Vapnik, V.: Estimation of Dependences Based on Empirical Data. Springer, Berlin (1982)
MATH Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar
Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16(2), 264–280 (1971) (This volume, Chap. 3)
Google Scholar

Download references

Author information

Authors and Affiliations

CNRS – UMR 8553, Département de Mathématiques et Applications, École Normale Supérieure, 45, rue d’Ulm, 75230, Paris Cedex 05, France
Olivier Catoni
INRIA Paris-Rocquencourt — CLASSIC Team, Le Chesnay Cedex, France
Olivier Catoni

Authors

Olivier Catoni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olivier Catoni .

Editor information

Editors and Affiliations

Dept. of Computer Science, Royal Holloway, Univ of London, Egham, Surrey, United Kingdom
Vladimir Vovk
Frederick University, Nicosia, Cyprus
Harris Papadopoulos
Dept. of Computer Science, University of London, Egham, Surrey, United Kingdom
Alexander Gammerman

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Catoni, O. (2015). PAC-Bayes Bounds for Supervised Classification. In: Vovk, V., Papadopoulos, H., Gammerman, A. (eds) Measures of Complexity. Springer, Cham. https://doi.org/10.1007/978-3-319-21852-6_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-21852-6_20
Published: 04 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21851-9
Online ISBN: 978-3-319-21852-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PAC-Bayes Bounds for Supervised Classification

Abstract

Similar content being viewed by others

Fast Hyperparameter Tuning for Support Vector Machines with Stochastic Gradient Descent

A Geometric Viewpoint of the Selection of the Regularization Parameter in Some Support Vector Machines

Kernel classification using a linear programming approach

Keywords

1 PAC-Bayes Bounds for 0–1 Loss Functions

1.1 Deviation Bounds for Sums of Bernoulli Random Variables

Proposition 20.1

Proposition 20.2

Proof

Exercise 20.1

Exercise 20.2

Exercise 20.3

1.2 PAC-Bayes Bounds

Proposition 20.3

Proof

Proposition 20.4

Proof

2 Linear Classification and Support Vector Machines

Proposition 20.5

Proposition 20.6

Proposition 20.7

Proposition 20.8

Corollary 20.1

Proof

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation