Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 PAC-Bayes Bounds for 0–1 Loss Functions

In this section, we are given some i.i.d. sample \((W_i)_{i=1}^n \in \mathcal {W}^n\), where \(\mathcal {W}\) is a measurable space, and some binary measurable loss function \(L : \mathcal {W} \times \varTheta \rightarrow \{0,1\}\), where \(\varTheta \) is a measurable parameter space. Our aim is to minimize with respect to \(\theta \in \varTheta \) the expected loss

$$\int L(w, \theta ) \, \mathrm {d}\mathbb {P}(w),$$

where \(\mathbb {P}\) is the marginal distribution of the observed sample \((W_i)_{i=1}^n\). More precisely, assuming that \(\mathbb {P}\) is unknown, we would like to find an estimator \(\widehat{\theta }(W_{1:\, n})\) depending on the observed sample \(W_{1:\, n} \overset{\text {def}}{=} (W_i)_{i=1}^n\) such that the excess risk

$$ \int L(w, \widehat{\theta }) \, \mathrm {d}\mathbb {P}(w) - \inf _{\theta \in \varTheta } \int L(w, \theta ) \, \mathrm {d}\mathbb {P}(w) $$

is small. The previous quantity is random, since \(\widehat{\theta }\) depends on the random sample \(W_{1:\, n}\). Therefore its size can be understood in different ways. Here we will focus on the deviations of the excess risk. Accordingly, we will look for estimators providing a small risk with a probability close to one.

A typical example of such a problem is provided by supervised classification. In this setting \(\mathcal {W} = \mathcal {X} \times \mathcal {Y}\), where \(\mathcal {Y}\) is a finite set, \(W_i = (X_i, Y_i)\), where \((X_i,Y_i)\) are input-output pairs, a family of measurable classification rules \(\bigl \{ f_{\theta } : \mathcal {X} \rightarrow \mathcal {Y}; \theta \in \varTheta \bigr \}\) is considered, and the loss function \(L(w,\theta )\) is defined as the classification error

$$ L\bigl [ (x,y), \theta \bigr ] = \mathbbm {1} \bigl [ f_{\theta }(x) \ne y \bigr ]. $$

Accordingly the aim is to minimize the expected classification error

$$ \mathbb {P}_{X,Y} \bigl [ f_\theta (X) \ne Y \bigr ] $$

given a sample \((X_i, Y_i)_{i=1}^n\) of observations.

The point of view presented here is a synthesis of the approaches of [2, 8].

1.1 Deviation Bounds for Sums of Bernoulli Random Variables

Given some parameter \(\lambda \in \mathbb {R}\), let us consider the (normalized) \(\log \)-Laplace transform of the Bernoulli distribution:

$$ \varPhi _{\lambda }(p) \overset{\mathrm{def}}{=} - \frac{1}{\lambda } \log \bigl [ 1 - p + p \, \exp ( - \lambda ) \bigr ]. $$

Let us also consider the Kullback–Leibler divergence of two Bernoulli distributions

$$ K(q,p) \overset{\mathrm{def}}{=} q \, \log \biggl ( \frac{q}{p} \biggr ) + (1 - q) \log \biggl ( \frac{1 - q}{1 - p} \biggr ). $$

In the sequel \(\overline{\mathbb {P}}\) will be the empirical measure

$$ \overline{\mathbb {P}} = \frac{1}{n} \sum _{i=1}^n \delta _{W_i} $$

of an i.i.d. sample \((W_i)_{i=1}^n\) drawn from \(\mathbb {P}^{\otimes n} \in \mathcal {M}_+^1(\mathcal {W}^n)\) (the set of probability measures on \(\mathcal {W}^n\)). We will use a short notation for integrals, putting for any \(\rho , \pi \in \mathcal {M}_+^1(\varTheta )\) and any integrable function \(f \in \mathbb {L}_1 \bigl ( \mathcal {W} \times \varTheta ^2, \mathbb {P} \otimes \pi \otimes \rho \bigr )\)

$$ f(\mathbb {P}, \rho , \pi ) = \int f(w,\theta , \theta ') \, \mathrm {d}\mathbb {P}(w) \, \mathrm {d}\rho (\theta ) \, \mathrm {d}\pi (\theta '), $$

so that for instance \(\displaystyle L(\mathbb {P},\rho ) = \int L(w,\theta ) \, \mathrm {d}\mathbb {P}(w) \mathrm {d}\rho (\theta )\).

Let us first recall Chernoff’s bound.

Proposition 20.1

For any fixed value of the parameter \(\theta \in \varTheta \), the identity

$$ \int \exp \bigl [ - n \lambda L(\overline{\mathbb {P}}, \theta ) \bigr ] \, \mathrm {d}\mathbb {P}^{\otimes n} = \exp \Bigl \{ - n \lambda \varPhi _{\lambda } \bigl [ L(\mathbb {P}, \theta ) \bigr ] \Bigr \} $$

shows that with probability at least \(1 - \epsilon \),

$$\begin{aligned} L(\mathbb {P}, \theta )&\le B_+ \bigl [ L(\overline{\mathbb {P}}, \theta ), \log (\epsilon ^{-1})/n \bigr ], \end{aligned}$$

where

$$\begin{aligned} B_+(q, \delta )&= \inf _{\lambda \in {\mathbb {R}}_+} \varPhi _{\lambda }^{-1} \biggl ( q + \frac{\delta }{\lambda } \biggr ) \\&= \sup \Bigl \{ p \in [0,1] \,:\, K (q ,p) \le \delta \Bigr \}, \qquad q \in [0,1], ~ \delta \in \mathbb {R}_+, \end{aligned}$$

Moreover

$$ - \delta q \le B_+(q,\delta ) - q - \sqrt{2 \delta q(1-q)} \le 2 \delta (1 - q). $$

In the same way, the identity

$$ \int \exp \bigl [ n \lambda L(\overline{\mathbb {P}}, \theta ) \bigr ] d \mathbb {P}^{\otimes n} = \exp \Bigl \{ n \lambda \varPhi _{-\lambda } \bigl [ L(\mathbb {P}, \theta ) \bigr ] \Bigr \} $$

shows that with probability at least \(1 - \epsilon \)

$$\begin{aligned} L(\overline{\mathbb {P}}, \theta )&\le B_- \bigl [ L(\mathbb {P}, \theta ), \log (\epsilon ^{-1})/n \bigr ], \end{aligned}$$

where

$$\begin{aligned} B_-(q, \delta )&= \inf _{\lambda \in {\mathbb {R}}_+} \varPhi _{- \lambda } (q) + \frac{\delta }{\lambda } \\&= \sup \Bigl \{ p \in [0,1] \, : \, K(p,q) \le \delta \Bigr \}, \qquad q \in [0,1], \delta \in \mathbb {R}_+, \end{aligned}$$

and

$$ - \delta q \le B_-(q, \delta ) - q - \sqrt{2 \delta q (1-q)} \le 2 \delta (1 - q). $$

Before proving this proposition, let us mention some important identities.

Proposition 20.2

For any probability measures \(\pi \) and \(\rho \) defined on the same measurable space such that \(\mathcal {K}(\rho , \pi ) < \infty \), and any bounded measurable function h, let us define the transformed probability measure \(\pi _{\exp (h)} \ll \pi \) by its density

$$ \frac{\mathrm {d}\pi _{\exp (h)}}{\mathrm {d}\pi } = \frac{\exp (h)}{Z}, $$

where \(Z = \int \exp (h) \, \mathrm {d}\pi \). Moreover, let us introduce the notation

$$ {{\mathrm{\mathbf {Var}}}}\bigl ( h \, \mathrm {d}\pi \bigr ) = {\textstyle \int }\bigl ( h - {\textstyle \int }h \, \mathrm {d}\pi \bigr )^2 \, \mathrm {d}\pi . $$

The expectations with respect to \(\rho \) and \(\pi \) of h and the \(\log \)-Laplace transform of h are linked by the identities

$$\begin{aligned} {\textstyle \int }h \, \mathrm {d}\rho - \mathcal {K}(\rho , \pi )&+ \mathcal {K}(\rho , \pi _{\exp (h)}) = \log \bigl [ {\textstyle \int }\exp (h) \, \mathrm {d}\pi \bigr ] \end{aligned}$$
(20.1)
$$\begin{aligned}&= {\textstyle \int }h \, \mathrm {d}\pi + {\textstyle \int }_{0}^{1} ( 1 - \alpha ) {{\mathrm{\mathbf {Var}}}}\bigl [ h \, \mathrm {d}\pi _{\exp (\alpha h)} \bigr ] \, \mathrm {d}\alpha . \end{aligned}$$
(20.2)

Proof

The first identity is a straightforward consequence of the definitions of \(\pi _{\exp (h)}\) and of the Kullback–Leibler divergence function. The second one is the Taylor expansion of order one with integral remainder of the function

$$ f(\alpha ) = \log \bigl [ {\textstyle \int }\exp (\alpha h) \, \mathrm {d}\pi \bigr ], $$

which says that \(f(1) = f(0) + f'(0) + {\textstyle \int }_0^1(1 - \alpha ) f''(\alpha ) \, \mathrm {d}\alpha \).   \(\square \)

Exercise 20.1

Prove that \(f \in \mathcal {C}^{\infty }\). Hint: write

$$ h^k \exp (\alpha h) = h^k + \int _0^\alpha h^{k+1} \exp ( \gamma h) \, \mathrm {d}\gamma , $$

use Fubini’s theorem to show that \(\alpha \mapsto \int h^k \exp (\alpha h) \, \mathrm {d}\pi \) belongs to \(\mathcal {C}^1\) and compute its derivative.   \(\square \)

Let us come now to the proof of Proposition 20.1. Chernoff’s inequality reads

$$ \varPhi _{\lambda } \bigl [ L(\mathbb {P}, \theta ) \bigr ] - \frac{\log (\epsilon ^{-1})}{n \lambda } \le L(\overline{\mathbb {P}}, \theta ), $$

where the inequality holds with probability at least \(1 - \epsilon \). Since the left-hand side is non-random, it can be optimized in \(\lambda \), giving

$$ L(\mathbb {P}, \theta ) \le B_+ \bigl [ L(\overline{\mathbb {P}}, \theta ), \log (\epsilon ^{-1})/n \bigr ]. $$

Exercise 20.2

Prove this statement in more detail. For any integer \(k > 1\), consider the event

$$ A_k = \Bigl \{ \sup _{\lambda \in \mathbb {R}_+} F(\lambda ) - k^{-1} > L(\overline{\mathbb {P}} , \theta ) \Bigr \}, $$

where \(\displaystyle F(\lambda ) = \varPhi _{\lambda } \bigl [ L ( \mathbb {P}, \theta ) \bigr ] - \frac{\log (\epsilon ^{-1})}{n \lambda }\). Show that \(\mathbb {P}^{\otimes n} (A_k) \le \epsilon \) by choosing some suitable value of \(\lambda \). Remark that \(A_k \subset A_{k+1}\) and conclude that \(\mathbb {P}^{\otimes n} \bigl ( \cup _k A_k \bigr ) \le \epsilon \).   \(\square \)

Since

$$ \lim _{\lambda \rightarrow + \infty } \varPhi _{\lambda }^{-1} \biggl ( q + \frac{\delta }{\lambda } \biggr ) = \lim _{\lambda \rightarrow + \infty } \frac{1 - \exp (- \lambda q - \delta )}{1 - \exp (- \lambda )} \le 1, $$

\(B_+(q, \delta ) \le 1\).

Applying Eq. (20.1) to Bernoulli distributions gives

$$ \lambda \varPhi _{\lambda } (p) = \lambda q + K(q,p) - K(q, p_{\lambda }) $$

where

$$ p_{\lambda } = \frac{p}{p + (1-p) \exp (\lambda )}. $$

This shows that

$$\begin{aligned} B_+(q,\delta )&= \sup \Bigl \{ p \in [0,1] \, : \, \varPhi _{\lambda }(p) \le q + \frac{\delta }{\lambda }, \; \lambda \in \mathbb {R}_+ \Bigr \} \\&= \sup \Bigl \{ p \in [q,1[ \, : \, K(q,p) \le \delta + K(q, p_{\lambda }), \lambda \in \mathbb {R}_+ \Bigr \} \\&= \sup \Bigl \{ p \in [q,1[ \, : \, K(q,p) \le \delta \Bigr \} \\&= \sup \Bigl \{ p \in [0,1] \, : \, K(q,p) \le \delta \Bigr \}, \end{aligned}$$

because when \(q \le p < 1\) then \(\displaystyle \lambda = \log \biggl ( \frac{q^{-1} - 1}{p^{-1} - 1} \biggr ) \in \mathbb {R}_+\), \(q = p_{\lambda }\) and therefore \(K(q,p_{\lambda }) = 0\).

Let us remark now that \(\displaystyle \frac{\partial ^2}{\partial x^2} K(x, p) = x^{-1}(1-x)^{-1}\). Thus if \(p \ge q \ge 1/2\), then

$$ K(q,p) \ge \frac{(p-q)^2}{2 q(1-q)}, $$

so that if \(K(q,p) \le \delta \), then

$$ p \le q + \sqrt{2 \delta q(1-q)}. $$

Now if \(q \le 1/2\) and \(p \ge q\) then

$$ K(q,p) \ge \left. {\left\{ \begin{array}{ll} \displaystyle \frac{(p-q)^2}{2 p(1-p)}, &{} p \le 1/2 \\ \displaystyle 2 (p-q)^2, &{} p \ge 1/2 \end{array}\right. } \right\} \ge \frac{(p-q)^2}{2p(1-q)}, $$

so that if \(K(q,p) \le \delta \), then

$$ (p-q)^2 \le 2 \delta p(1-q), $$

implying that

$$ p-q \le \delta (1 -q) + \sqrt{ 2 \delta q(1-q) + \delta ^2(1-q)^2} \le \sqrt{2 \delta q(1-q)} + 2 \delta (1 - q). $$

On the other hand,

$$ K(q,p) \le \frac{(p-q)^2}{2 \min \{ q(1-q), p(1-p) \}} \le \frac{(p-q)^2}{2 q(1-p)}, $$

thus if \(K(q,p) = \delta \) with \(p > q\), then

$$ (p-q)^2 \ge 2 \delta q (1 - p), $$

implying that

$$ p-q \ge - \delta q + \sqrt{ 2 \delta q(1-q) + \delta ^2 q^2} \ge \sqrt{2 \delta q (1 - q)} - \delta q. $$

Exercise 20.3

The second part of Proposition 20.1 is proved in the same way and left as an exercise.   \(\square \)

1.2 PAC-Bayes Bounds

We are now going to make Proposition 20.1 uniform with respect to \(\theta \). The PAC -Bayes approach to this [3, 57] is to randomize \(\theta \), so we will now consider joint distributions on \((W_{1:\, n}, \theta )\), where the distribution of \(W_{1:\, n}\) is still \(\mathbb {P}^{\otimes n}\) and the conditional distribution of \(\theta \) given the sample is given by some transition probability kernel \(\rho : \mathcal {W}^n \rightarrow \mathcal {M}_+^1(\varTheta )\), called in this context a posterior distribution.Footnote 1 This posterior distribution \(\rho \) will be compared with a prior (meaning non-random) probability measure \(\pi \in \mathcal {M}_+^1(\varTheta )\).

Proposition 20.3

Let us introduce the notation

$$ B_{\varLambda } (q, \delta ) = \inf _{\lambda \in \varLambda } \varPhi _{\lambda }^{-1} \biggl ( q + \frac{\delta }{\lambda } \biggr ). $$

For any prior probability measure \(\pi \in \mathcal {M}_+^1(\varTheta )\) and any \(\lambda \in \mathbb {R}_+\),

$$\begin{aligned} \int \exp \biggl [ \sup _{\rho \in \mathcal {M}_+^1(\varTheta )} n \lambda \Bigl \{ \varPhi _{\lambda }\bigl [ L(\mathbb {P}, \rho ) \bigr ] - L(\overline{\mathbb {P}}, \rho ) \Bigr \} - \mathcal {K}(\rho , \pi ) \biggr ] \, \mathrm {d}\mathbb {P}^{\otimes n} \le 1, \end{aligned}$$
(20.3)

and therefore for any finite set \(\varLambda \subset \mathbb {R}_+\), with probability at least \(1 - \epsilon \), for any \(\rho \in \mathcal {M}_+^1(\varTheta )\),

$$ L(\mathbb {P}, \rho ) \le B_{\varLambda }\Biggl (L(\overline{\mathbb {P}}, \rho ), \frac{\mathcal {K}(\rho ,\pi ) + \log \bigl ( |\varLambda |/ \epsilon \bigr )}{n} \Biggr ), $$

Proof

The exponential moment inequality (20.3) is a consequence of Eq. (20.1), showing that

$$ \exp \Biggl \{ \sup _{\rho \in \mathcal {M}_+^1(\varTheta )} n \lambda \int \Bigl \{ \varPhi _{\lambda } \bigl [ L(\mathbb {P}, \theta ) \bigr ] - L(\overline{\mathbb {P}}, \theta ) \Bigr \} \, \mathrm {d}\rho (\theta ) - \mathcal {K}(\rho , \pi ) \Biggr \} \\ \qquad \qquad \qquad \qquad \qquad \quad \le \int \exp \biggl [ n \lambda \Bigl \{ \varPhi _{\lambda }\bigl [ L(\mathbb {P}, \theta ) \bigr ] - L(\overline{\mathbb {P}}, \theta ) \Bigr \} \biggr ] \, \mathrm {d}\pi (\theta ), $$

and of the fact that \(\varPhi _{\lambda }\) is convex, showing that

$$ \varPhi _{\lambda } \bigl [ L(\mathbb {P}, \rho ) \bigr ] \le \int \varPhi _{\lambda } \bigl [ L(\mathbb {P}, \theta ) \bigr ] \, \mathrm {d}\rho (\theta ). $$

The deviation inequality follows as usual.    \(\square \)

We cannot take the infimum on \(\lambda \in \mathbb {R}_+\) as in Proposition 20.1, because we can no longer cast our deviation inequality in such a way that \(\lambda \) appears on some non-random side of the inequality. Nevertheless, we can get a more explicit bound from some specific choice of the set \(\varLambda \).

Proposition 20.4

Let us define the least increasing upper bound of the variance of a Bernoulli distribution of parameter \(p \in [0,1]\) as

$$ \overline{v}(p) = {\left\{ \begin{array}{ll} p(1-p), &{} p \le 1/2,\\ 1/4, &{} \text {otherwise.} \end{array}\right. } $$

Let us choose some positive integer parameter m and let us put

$$ t = \frac{1}{4} \log \Biggl ( \frac{ n}{8 \log \bigl [ (m+1) / \epsilon \bigr ]} \Biggr ). $$

With probability at least \(1 - \epsilon \), for any \(\rho \in \mathcal {M}_+^1(\varTheta )\),

$$ L(\mathbb {P}, \rho ) \le L(\overline{\mathbb {P}}, \rho ) + B_m \bigl [ L(\overline{\mathbb {P}}, \rho ), \mathcal {K}(\rho , \pi ), \epsilon \bigr ], $$

where

$$\begin{aligned} B_m \bigl ( q, e, \epsilon \bigr )&= \max \Biggl \{ \sqrt{\frac{ 2 \overline{v}(q) \bigl \{ e + \log \bigl [ (m+1)/\epsilon \bigr ]\bigr \}}{n}} \cosh \bigl ( t/m \bigr ) \\&\qquad + \frac{2 (1 - q) \bigl \{ e + \log \bigl [ (m+1)/\epsilon \bigr ] \bigr \}}{n} \cosh (t/m)^2,\\&\qquad \qquad \frac{2 \bigl \{ e + \log \bigl [ (m+1)/\epsilon \bigr ] \bigr \}}{n} \Biggr \}\\&\le \sqrt{\frac{ 2 \overline{v}(q) \bigl \{ e + \log \bigl [ (m+1)/\epsilon \bigr ]\bigr \}}{n}} \cosh \bigl ( t/m \bigr ) \\&\qquad \qquad + \frac{2 \bigl \{ e + \log \bigl [ (m+1)/\epsilon \bigr ] \bigr \}}{n} \cosh (t/m)^2. \end{aligned}$$

Moreover, as soon as \(n \ge 5\),

$$\begin{aligned}&B_{\lfloor \log (n)^2 \rfloor - 1}(q,e, \epsilon ) \le B(q,e, \epsilon ) \overset{\mathrm{def}}{=} \nonumber \\&\qquad \qquad \sqrt{\frac{2 \overline{v}(q) \bigl \{ e + \log \bigl [ \log (n)^2/\epsilon \bigr ] \bigr \}}{n}} \cosh \bigl [ \log (n)^{-1} \bigr ] \nonumber \\&\qquad \qquad \qquad \qquad \quad + \frac{ 2 \bigl \{ e + \log \bigl [ \log (n)^2/\epsilon \bigr ] \bigr \}}{n} \cosh \bigl [ \log (n)^{-1} \bigr ]^2, \end{aligned}$$
(20.4)

so that with probability at least \(1 - \epsilon \), for any \(\rho \in \mathcal {M}_+^1(\varTheta )\),

$$\begin{aligned}&L(\mathbb {P}, \rho ) \le L(\overline{\mathbb {P}}, \rho )\\&\qquad \qquad + \sqrt{\frac{2 \overline{v}\bigl [ L(\overline{\mathbb {P}}, \rho ) \bigr ] \Bigl \{ \mathcal {K}(\rho , \pi ) + \log \bigl [ \log (n)^2 / \epsilon \bigr ] \Bigr \}}{n}} \cosh \bigl [ \log (n)^{-1} \bigr ] \\&\qquad \qquad \qquad \qquad \quad + \frac{2 \Bigl \{ \mathcal {K}(\rho , \pi ) + \log \bigl [ \log (n)^2 / \epsilon \bigr ] \Bigr \}}{n} \cosh \bigl [ \log (n)^{-1} \bigr ]^2. \end{aligned}$$

Proof

Let us put

$$\begin{aligned} q&= L(\overline{\mathbb {P}},\rho ), \\ \delta&= \frac{ \mathcal {K}(\rho , \pi ) + \log \bigl [ (m+1)/\epsilon \bigr ]}{n}, \\ \lambda _{\min }&= \sqrt{\frac{8 \log \bigl [ (m+1)/\epsilon \bigr ] }{n}}, \\ \varLambda&= \Bigl \{ \lambda _{\min }^{1 - k/m}, k=0, \dots , m \Bigr \}, \\ p&= B_{\varLambda }(q, \delta ) = \inf _{\lambda \in \varLambda } \varPhi _{\lambda }^{-1} \bigg ( q + \frac{\delta }{\lambda } \biggr ), \\ \widehat{\lambda }&= \sqrt{\frac{2 \delta }{\overline{v}(p)}}. \end{aligned}$$

According to Eq. (20.2) applied to Bernoulli distributions, for any \(\lambda \in \varLambda \),

$$ \varPhi _{\lambda }(p) = p - \frac{1}{\lambda } \int _{0}^{\lambda } (\lambda - \alpha ) p_{\alpha } (1 - p_{\alpha }) \, \mathrm {d}\alpha \le q + \frac{ \delta }{\lambda }. $$

Moreover, as \(p_{\alpha } \le p\),

$$ p - q \le \inf _{\lambda \in \varLambda } \frac{\lambda \overline{v}(p)}{2} + \frac{\delta }{\lambda } = \inf _{\lambda \in \varLambda } \sqrt{2 \delta \overline{v}(p)} \cosh \biggl [ \log \biggl ( \frac{\widehat{\lambda }}{\lambda } \biggr ) \biggr ]. $$

As \(\overline{v}(p) \le 1/4\) and \(\displaystyle \delta \ge \frac{\log \bigl [ (m+1)/\epsilon \bigr ]}{n}\),

$$\sqrt{\frac{2 \delta }{\overline{v}(p)}} = \widehat{\lambda } \ge \lambda _{\min } = \sqrt{\frac{8 \log \bigl [ (m+1)/\epsilon \bigr ]}{n}}. $$

Therefore either \(\lambda _{\min } \le \widehat{\lambda } \le 1\), or \(\widehat{\lambda } > 1\). Let us consider these two cases separately.

If \(\lambda _{\min } = \min \varLambda \le \widehat{\lambda } \le \max \varLambda = 1\), then \(\log \bigl ( \widehat{\lambda } \bigr ) \) is at distance at most t / m from some \(\log \bigl ( \lambda \bigr )\) where \(\lambda \in \varLambda \), because \(\log (\varLambda )\) is a grid with constant steps of size 2t / m. Thus

$$ p - q \le \sqrt{2 \delta \overline{v}(p)} \cosh \bigl ( t / m \bigr ). $$

If moreover \(q \le 1/2\), then \(\overline{v}(p) \le p(1-q)\), so that we obtain a quadratic inequality in p, whose solution is bounded by

$$ p \le q + \sqrt{2 \delta q(1-q)} \cosh \bigl ( t/m \bigr ) + 2 \delta (1 - q) \cosh \bigl ( t / m \bigr )^2. $$

If on the contrary \(q \ge 1/2\), then \(\overline{v}(p) = \overline{v}(q) = 1/4\) and

$$ p \le q + \sqrt{2 \delta \overline{v}(q)} \cosh \bigl ( t/m \bigr ), $$

so that in both cases

$$\begin{aligned} p - q \le \sqrt{2 \delta \overline{v}(q)} \cosh (t/m) + 2 \delta ( 1- q) \cosh \bigl ( t / m \bigr )^2. \end{aligned}$$

Let us now consider the case when \(\widehat{\lambda } > 1\). In this case \(\overline{v}(p) < 2 \delta \), so that

$$ p-q \le \frac{\overline{v}(p)}{2} + \delta \le 2 \delta . $$

In conclusion, applying Proposition 20.3 we see that with probability at least \(1 - \epsilon \), for any posterior distribution \(\rho \),

$$ L(\mathbb {P}, \rho ) \le p \le q + \max \Bigl \{ 2 \delta , \sqrt{2 \delta \overline{v}(q)} \cosh \bigl ( t / m \bigr ) + 2 \delta (1 - q) \cosh \bigl ( t / m \bigr )^2 \Bigr \}, $$

which is precisely the statement to be proved.

In the special case when \(m = \lfloor \log (n)^2 \rfloor - 1 \ge \log (n)^2 - 2\),

$$ \frac{t}{m} \le \frac{1}{4 \bigl [ \log (n)^2 - 2 \bigr ]} \log \Biggl ( \frac{n}{8\log \bigl [ \log (n)^2 - 1 \bigr ]} \Biggr ) \le \log (n)^{-1} $$

as soon as the last inequality holds, that is as soon as \(n \ge \exp (\sqrt{2}) \simeq 4.11\) to make \(\log (n)^2 - 2\) positive and

$$ 3 \log (n)^2 - 8 + \log (n) \log \Bigl \{ 8 \log \bigl [ \log (n)^2 - 1 \bigr ] \Bigr \} \ge 0, $$

which holds true for any \(n \ge 5\), as can be checked numerically.    \(\square \)

2 Linear Classification and Support Vector Machines

In this section we are going to consider more specifically the case of linear binary classification. In this setting \(\mathcal {W} = \mathcal {X} \times \mathcal {Y} = \mathbb {R}^d \times \{-1, + 1\}\), \(w = (x,y)\), where \(x \in \mathbb {R}^d\) and \(y \in \{-1, +1\}\), \(\varTheta = \mathbb {R}^d\), and

$$ L(w, \theta ) = \mathbbm {1} \bigl [ \langle \theta , x \rangle y \le 0 \bigr ]. $$

We will follow the approach presented in [4, 5].

Although we will stick in this presentation to the case when \(\mathcal {X}\) is a vector space of finite dimension, the results also apply to support vector machines [911], where the pattern space is some arbitrary space mapped to a Hilbert space \(\mathcal {H}\) by some implicit mapping \(\varPsi : \mathcal {X} \rightarrow \mathcal {H}\), \(\varTheta = \mathcal {H}\) and \(L(w,\theta ) = \mathbbm {1} \bigl ( \langle \theta , \varPsi (x) \rangle y \le 0 \bigr )\). It turns out that classification algorithms do not need to manipulate \(\mathcal {H}\) itself, but only to compute scalar products of the form \(k(x_1, x_2) = \langle \varPsi (x_1), \varPsi (x_2) \rangle \), defining a symmetric positive kernel k on the original pattern space \(\mathcal {X}\). The converse is also true: any positive symmetric kernel k can be represented as a scalar product in some mapped Hilbert space (this is the Moore–Aronszajn theorem). Often-used kernels on \(\mathbb {R}^d\) are

$$\begin{aligned} k(x_1, x_2)&= \bigl ( 1 + \langle x_1, x_2 \rangle \bigr )^s, \text { for which } \dim \mathcal {H} < \infty , \\ k(x_1, x_2)&= \exp \bigl ( - ||x_1 - x_2 ||^2 \bigr ), \text { for which } \dim \mathcal {H} = + \infty . \end{aligned}$$

In the following, we will work in \(\mathbb {R}^d\), which covers only the case when \(\dim \mathcal {H} < \infty \), but extensions are possible.

After [4, 5], let us consider as prior probability measure \(\pi \) the centered Gaussian measure with covariance \(\beta ^{-1} {{\mathrm{\mathbf {Id}}}}\), so that

$$ \frac{\mathrm {d}\pi }{\mathrm {d}\theta } (\theta ) = \left( \frac{\beta }{2 \pi } \right) ^{d/2} \exp \biggl ( - \frac{\beta ||\theta ||^2}{2} \biggr ). $$

Let us also consider the function

$$\begin{aligned} \varphi (x)&= \frac{1}{\sqrt{2\pi }} \int _{x}^{+ \infty } \exp \bigl ( - t^2/2 \bigr ) \, \mathrm {d}t, \qquad x \in \mathbb {R} \\&\le \min \Bigl \{ \frac{1}{x \sqrt{2 \pi }}, \frac{1}{2} \Bigr \} \exp \left( - \frac{x^2}{2} \right) , \quad x \in \mathbb {R}_+. \end{aligned}$$

Let \(\pi _{\theta }\) be the measure \(\pi \) shifted by \(\theta \), defined by the identity

$$ \int h(\theta ') \, \mathrm {d}\pi _{\theta }(\theta ') = \int h(\theta + \theta ') \, \mathrm {d}\pi (\theta '). $$

In this case

$$ \mathcal {K}(\pi _{\theta }, \pi ) = \frac{\beta }{2} ||\theta ||^2, $$

and

$$ L(w, \pi _{\theta }) = \varphi \bigl [ \sqrt{\beta } y ||x ||^{-1} \langle \theta , x \rangle \bigr ]. $$

Thus the randomized loss function has an explicit expression: randomization replaces the indicator function of the negative real line by a smooth approximation. As we are ultimately interested in \(L(w, \theta )\), we will shift things a little bit, considering along with the classification error function L some error with margin

$$ M(w, \theta ) = \mathbbm {1} \bigl [ y ||x ||^{-1} \langle \theta , x \rangle \le 1 \bigr ]. $$

Unlike \(L(w, \theta )\) which is independent of the norm of \(\theta \), the margin error \(M(w, \theta )\) depends on \(||\theta ||\), counting a classification error each time x is at distance less than \(||x ||/ ||\theta ||\) from the boundary \(\{ x' \,:\, \langle \theta , x' \rangle = 0 \}\), so that the error with margin region is the complement of the open cone \(\bigl \{ x \in \mathbb {R}^d \, ; \, y \langle \theta , x \rangle > ||x ||\bigr \}\).

Let us compute the randomized margin error

$$ M(w, \pi _{\theta }) = \varphi \Bigl \{ \sqrt{\beta } \bigl [ y ||x ||^{-1} \langle \theta , x \rangle - 1 \bigr ] \Bigr \}. $$

It satisfies the inequality

$$\begin{aligned} M(w, \pi _{\theta }) \ge \varphi ( - \sqrt{\beta } \, \bigr ) L(w, \theta ) = \bigl [ 1 - \varphi \bigl (\sqrt{\beta } \, \bigr ) \bigr ] L(w, \theta ). \end{aligned}$$
(20.5)

Applying previous results we obtain

Proposition 20.5

With probability at least \(1 - \epsilon \), for any \(\theta \in \mathbb {R}^d\),

$$ L(\mathbb {P}, \theta ) \le \bigl [ 1 - \varphi (\sqrt{\beta }) \bigr ]^{-1} M(\mathbb {P}, \pi _{\theta }) \le C_1(\theta ), $$

where

$$ C_1(\theta ) = \bigl [ 1 - \varphi \bigl ( \sqrt{\beta } \, \bigr ) \bigr ]^{-1} B \Biggl ( M(\,\overline{\mathbb {P}}, \pi _{\theta }), \frac{\beta ||\theta ||^2}{2}, \epsilon \Biggr ), $$

the bound B being defined by Eq. (20.4).

We can now minimize this empirical upper bound to define an estimator. Let us consider some estimator \(\widehat{\theta }\) such that

$$ C_1(\widehat{\theta }) \le \inf _{\theta \in \mathbb {R}^d} C_1(\theta ) + \zeta . $$

Then for any fixed parameter \({\theta _{\star }}\), \(C_1(\theta ) \le C_1({\theta _{\star }}) + \zeta \). On the other hand, with probability at least \(1 - \epsilon \)

$$ M(\overline{\mathbb {P}},\pi _{{\theta _{\star }}}) \le B_- \biggl ( M(\mathbb {P}, \pi _{{\theta _{\star }}}), \frac{\log (\epsilon ^{-1})}{n} \biggr ). $$

Indeed

$$\begin{aligned} \int \exp&\Bigl \{ n \lambda \Bigl [ M(\overline{\mathbb {P}}, \pi _{{\theta _{\star }}}) - \varPhi _{- \lambda }\bigl [ M(\mathbb {P}, \pi _{{\theta _{\star }}}) \bigr ] \Bigr ] \Bigr \} \, \mathrm {d}\mathbb {P}^{ \otimes n} \\&\le \int \exp \biggl \{ n \lambda \int \Bigl \{ M(\overline{\mathbb {P}}, \theta ) - \varPhi _{- \lambda }\bigl [ M(\mathbb {P}, \theta ) \bigr ] \Bigr \} \, \mathrm {d}\pi _{{\theta _{\star }}}(\theta ) \biggr \} \, \mathrm {d}\mathbb {P}^{\otimes n} \le 1, \end{aligned}$$

because \(p \mapsto - \varPhi _{- \lambda }(p)\) is convex. As a consequence

Proposition 20.6

With probability at least \(1 - 2 \epsilon \),

$$\begin{aligned}&L(\mathbb {P}, \widehat{\theta }) \le \\&\qquad \inf _{{\theta _{\star }}\in \varTheta } \bigl [ 1 - \varphi \bigl ( \sqrt{ \beta } \, \bigr ) \bigr ]^{-1} B \biggl ( B_-\biggl ( M(\mathbb {P}, \pi _{\theta _{\star }}), \frac{\log (\epsilon ^{-1})}{n} \biggr ), \frac{\beta ||{\theta _{\star }}||^2}{2}, \epsilon \biggr ) + \zeta . \end{aligned}$$

It is also possible to state a result in terms of empirical margins. Indeed

$$ M(w,\pi _{\theta }) \le M(w, \theta / 2) + \varphi (\sqrt{\beta }). $$

Thus with probability at least \(1 - \epsilon \), for any \(\theta \in \mathbb {R}^d\),

$$ L(\mathbb {P}, \theta ) \le C_2(\theta ), $$

where

$$ C_2(\theta ) = \bigl [ 1 - \varphi \bigl ( \sqrt{\beta } \, \bigr ) \bigr ]^{-1} B \biggl ( M(\overline{\mathbb {P}}, \theta /2) + \varphi \bigl ( \sqrt{\beta } \, \bigr ), \frac{\beta ||\theta ||^2 }{2}, \epsilon \biggr ). $$

However, \(C_1\) and \(C_2\) are non-convex criterions, and faster minimization algorithms are available for the usual SVM loss function, for which we are going to derive some generalization bounds now. Indeed, let us choose some positive radius R and let us put \(||x ||_{R} = \max \bigl \{ R, ||x ||\bigr \}\), so that in the case when \(||x ||\le R\), \(||x ||_{R} = R\).

$$\begin{aligned} M(w, \pi _{\theta }) = \varphi \bigl [ \sqrt{\beta } \bigl (y ||x ||^{-1} \langle \theta , x \rangle - 1 \bigr ) \bigr ] \le \bigl ( 2 - y ||x ||_{R}^{-1} \langle \theta , x \rangle \bigr )_+ + \varphi (\sqrt{\beta }). \end{aligned}$$
(20.6)

To check that this is true, consider the functions

$$\begin{aligned} f(z)&= \varphi \bigl [ \sqrt{\beta } \bigl ( ||x ||^{-1} z - 1 \bigr ) \bigr ], \\ g(z)&= \bigl ( 2 - ||x ||_R^{-1} z \bigr )_+ + \varphi (\sqrt{\beta }), \qquad z \in \mathbb {R}. \end{aligned}$$

Let us remark that they are both non-increasing, that f is convex on the interval \(z \in \bigl (||x ||, \infty \bigr ( \) (because \(\varphi \) is convex on \(\mathbb {R}_+\)), and that \(\sup f = \sup \varphi = 1\). Since \(||x ||_R \ge ||x ||\), for any \(z \in \, ] \! - \infty , ||x ||]\), \( g(z) \ge 1 \ge f(z)\). Moreover, \( g(2 ||x ||_R) = \varphi (\sqrt{\beta }) \ge \varphi \bigl [ \sqrt{\beta } \bigl ( 2 ||x ||^{-1} ||x ||_R - 1 \bigr ) \bigr ] = f(z)\). Since on the interval \(\bigl [||x ||, 2 ||x ||_R \bigr ]\) the function g is linear, the function f is convex, and g is not smaller than f at the two ends, this proves that g is not smaller than f on the whole interval. Finally, on the interval \(z \in \bigl [ 2 ||x ||_R, + \infty \bigr [\), the function g is constant and the function f is decreasing, so that on this interval also g is not smaller than f, and this ends the proof of (20.6), since the three intervals on which \(g \ge f\) cover the whole real line.

Using the upper bounds (20.6) and (20.5), and Proposition 20.3, we obtain

Proposition 20.7

With probability at least \(1 - \epsilon \), for any \(\theta \in \mathbb {R}^d\),

$$\begin{aligned} L(\mathbb {P}, \theta )&\le \bigl [ 1 - \varphi \bigl ( \sqrt{\beta } \, \bigr ) \bigr ]^{-1} B_{\varLambda } \biggl ( \int \bigl ( 2 - y ||x ||_R^{-1} \langle \theta , x \rangle \bigr )_+ d \overline{\mathbb {P}}(x,y) + \varphi (\sqrt{\beta }), \\&\qquad \qquad \qquad \frac{\beta ||\theta ||^2 + 2 \log \bigl ( |\varLambda |/ \epsilon \bigr )}{2n} \biggr ) \\&= \bigl [ 1 - \varphi \bigl ( \sqrt{\beta } \bigr ) \bigr ]^{-1} \inf _{\lambda \in \varLambda } \varPhi _{\lambda }^{-1} \biggl [ C_3(\lambda , \theta ) + \varphi \bigl ( \sqrt{\beta } \, \bigr ) + \frac{\log \bigl ( |\varLambda |/ \epsilon \bigr )}{ n \lambda } \biggr ], \end{aligned}$$

where

$$ C_3(\lambda , \theta ) = \int \bigl ( 2 - y ||x ||_R^{-1} \langle \theta , x \rangle \bigr )_+ \, \mathrm {d}\overline{\mathbb {P}}(x,y) + \frac{\beta ||\theta ||^2}{2n \lambda }. $$

Let us assume now that the patterns x are in a ball, so that \(||x ||\le R\) almost surely. In this case \(||x ||_R = R\) almost surely. Let us remark that \(L(\mathbb {P}, \theta ) = L(\mathbb {P}, 2 R \, \theta )\), and let us make the previous result uniform in \(\beta \in \Xi \). This leads to

Proposition 20.8

Let us assume that \(||x ||\le R\) almost surely. With probability at least \(1 - \epsilon \), for all \(\theta \in \mathbb {R}^d\),

$$\begin{aligned} L(\mathbb {P}, \theta ) \le \inf _{\beta \in \Xi } \big [ 1 - \varphi (\sqrt{\beta }) \bigr ]^{-1} \inf _{\lambda \in \varLambda } \varPhi _{\lambda }^{-1}&\biggl [ 2 C_4 \bigl (\beta , \lambda , \theta \bigr ) \\&\qquad + \varphi (\sqrt{\beta }) + \frac{\log \bigl ( |\Xi |\, |\varLambda |/ \epsilon )}{n \lambda } \biggr ], \end{aligned}$$

where

$$ C_4 (\beta , \lambda , \theta ) = \frac{1}{2} \, C_3(\lambda , 2 R \, \theta ) = \int \bigl ( 1 - y \langle \theta , x \rangle \bigr )_+ \, \mathrm {d}\overline{\mathbb {P}}(x, y) + \frac{\beta R^2 ||\theta ||^2}{n \lambda }, $$

and

$$ \varPhi _{\lambda }^{-1}(q) = \frac{1 - \exp (- \lambda q)}{1 - \exp (- \lambda )} \le \frac{q}{\displaystyle 1 - \frac{\lambda }{2}}. $$

The loss function \(C_4(\lambda , \theta )\) is the most-employed learning criterion for support vector machines, and is called the box constraint. It is convex in \(\theta \). There are fast algorithms to compute \(\inf _{\theta } C_4(\lambda , \theta )\) for any fixed values of \(\lambda \) and \(\beta \). Here we get an empirical criterion which could also be used to optimize the values of \(\lambda \) and \(\beta \), that is to optimize the strength of the regularizing factor \(\displaystyle \frac{\beta R^2 ||\theta ||^2}{n \lambda }\).

Here \(||\theta ||^{-1}\) can be interpreted as the margin width, that is the minimal distance of x from the separating hyperplane \( \{ x' : \langle \theta , x' \rangle = 0 \}\) beyond which the error term \(\bigl ( 1 - y \langle \theta , x \rangle \bigr )_+\) vanishes (for data x that are on the right side of the separating hyperplane). The speed of convergence depends on \(R^2 ||\theta ||^2 / n\). For this reason, \(R^2 ||\theta ||^2\), the square of the ratio between the radius of the ball containing the data and the margin, plays the role of the dimension. The bound does not depend on d, showing that with separating hyperplanes and more generally support vector machines, we can get low error rates while choosing to represent the data in a reproducing kernel Hilbert space with a large, or even infinite, dimension.

We considered so far only linear hyperplanes and data centered around 0. Anyhow, this also covers affine hyperplanes and data contained in a not necessarily centered ball, through a change of coordinates. More precisely, the previous proposition has the following corollary:

Corollary 20.1

Assume that almost surely \(||x - c ||\le R\), for some \(c \in \mathbb {R}^d\) and \(R \in \mathbb {R}_+\). With probability at least \(1 - \epsilon \), for any \(\theta \in \mathbb {R}^d\), any \(\gamma \in \mathbb {R}\) such that \(\displaystyle \min _{i=1, \dots , n} \langle \theta , x_i \rangle \le \gamma \le \max _{i=1, \dots , n} \langle \theta , x_i \rangle \),

$$\begin{aligned}&\int \mathbbm {1} \bigl [ y \bigl ( \langle \theta , x \rangle - \gamma \bigr ) \le 0 \bigr ] \, \mathrm {d}\mathbb {P}(x,y) \le \inf _{\beta \in \Xi } \bigl [ 1 - \varphi (\sqrt{\beta }) \bigr ]^{-1}\\&\qquad \qquad \qquad \inf _{\lambda \in \varLambda } \varPhi _{\lambda }^{-1} \biggl [ 2 C_5(\beta , \lambda , \theta , \gamma ) + \varphi (\sqrt{\beta }) + \frac{ \log \bigl ( |\Xi |\, |\varLambda |/ \epsilon \bigr )}{n \lambda } \biggr ], \end{aligned}$$

where

$$ C_5(\beta , \lambda , \theta , \gamma ) = \int \bigl [ 1 - y \bigl ( \langle \theta , x \rangle - \gamma \bigr ) \bigr ]_+ \, \mathrm {d}\overline{\mathbb {P}}(x,y) + \frac{4 \beta R^2 ||\theta ||^2}{n \lambda }. $$

Proof

Let us apply the previous result to \(x' = (x-c, R)\), and \(\theta ' = \bigl [ \theta , R^{-1} \bigl ( \langle \theta , c \rangle - \gamma \bigr ) \bigr ]\). We get that \(||x' ||^2 \le 2 R^2\) and \(||\theta ' ||^2 = 2 ||\theta ||^2\), because almost surely \( -||\theta ||R \le {{\mathrm{ess}}}\inf \langle \theta , x - c \rangle \le \gamma - \langle \theta , c \rangle \le {{\mathrm{ess}}}\sup \langle \theta , x - c \rangle \le ||\theta ||R\), so that almost surely, for the allowed values of \(\gamma \), \(\bigl ( \langle \theta , c \rangle - \gamma \bigr )^2 \le R^2 ||\theta ||^2\). This proves that \(C_4(\beta , \lambda , \theta ') \le C_5(\beta , \lambda , \theta , \gamma )\), as required to deduce the corollary from the previous proposition.   \(\square \)