1 Introduction

In statistical learning theory, PAC-Bayesian theoryFootnote 1 (Shawe-Taylor & Williamson, 1997; McAllester, 1998) provides a powerful framework for analyzing the generalization ability of machine learning models such as linear classifiers (Germain et al., 2009), SVM (Ambroladze et al., 2006), or neural networks (Dziugaite & Roy, 2017; Pérez-Ortiz et al., 2021). In the PAC-Bayesian theory, the machine learning models are considered randomized (or stochastic), i.e., a model is sampled from a posterior probability distribution for each prediction. The analysis of such a randomized classifier usually takes the form of bounds on the average risk with respect to a learned posterior distribution given a learning sample and a chosen prior distribution defined over a set of hypotheses. Note that the prior distribution can encode an a priori belief on the set of hypotheses, or if we have no belief, it can be set to a non-informative distribution, such as the uniform distribution. While such bounds are very effective for analyzing randomized/stochastic classifiers, the vast majority of machine learning methods nevertheless need guarantees on deterministic models. In this case, a derandomization step of the bound is required to get a bound on the risk of the deterministic model. In general, the derandomization step consists in obtaining a bound on the risk of a deterministic model from a bound that is originally for randomized/stochastic models. Different forms of derandomization have been introduced in the literature for specific settings. Among them, Langford and Shawe-Taylor (2002) proposed a derandomization for Gaussian posteriors over linear classifiers: thanks to the Gaussian symmetry, a bound on the risk of the maximum a posteriori (deterministic) classifier is obtainable from the bound on the average risk of the randomized classifier. Also relying on Gaussian posteriors, Letarte et al. (2019) derived a PAC-Bayesian bound for a very specific deterministic network architecture using sign functions as activations; this approach has been further extended by Biggs and Guedj (2021, 2022). Another line of works derandomizes neural networks (Neyshabur et al., 2018; Nagarajan & Kolter, 2019). While technically different, it starts from PAC-Bayesian guarantees on the randomized classifier and uses an “output perturbation” bound to convert a guarantee from a random classifier to the mean classifier. These works highlight the need for a general framework for the derandomization of classic PAC-Bayesian bounds.

In this paper, we focus on another kind of derandomization, sometimes referred to as disintegration of the PAC-Bayesian bound, and first proposed by Catoni (2007, Th.1.2.7) and Blanchard and Fleuret (2007): instead of bounding the average risk of a randomized classifier with respect to the posterior distribution, the disintegrated PAC-Bayesian bounds upper-bound the risk of a sampled (unique) classifier from the posterior distribution. Despite their interest in derandomizing PAC-Bayesian bounds, these kinds of bounds have only received little study in the literature; especially, we can cite the recent work of  Rivasplata et al. (2010, Th.1(i)) who have derived a general disintegrated PAC-Bayesian theorem. It is important to note that these bounds have never been used in practice. Driven by machine learning practical purposes, our objective is thus twofold. We derive new tight and usable disintegrated PAC-Bayesian bounds (i) that directly derandomize any classifiers without any other additional step and with almost no impact on the guarantee, and (ii) that can be easily optimized to learn classifiers with strong guarantees. To achieve this objective, our contribution consists in providing a new general disintegration framework based on the Rényi divergence (in Theorem 2), allowing us to meet the practical goal of efficient learning. From the theoretical standpoint, due to the Rényi divergence term, our bound is expected to be looser than the one of Rivasplata et al. (2010, Th.1(i)) in which the divergence term is “disintegrated” but depends on the sampled hypothesis only. However, as we show in our experimental evaluation on neural networks, their “disintegrated” term is, in practice, subject to high variance, making their bound harder to optimize. This variance arises because the sampled hypothesis does not influence our Rényi divergence term. Our bound has then the main advantage of leading to a more stable learning algorithm with better empirical results. In addition, we derive a new theoretical result in the form of an information-theoretic bound, giving new insights into disintegration procedures.

The rest of the paper is organized as follows. Section 2 introduces the notations we follow and recalls some basics on generalization bounds. In Sect. 3, we derive our main contribution relying on disintegrated PAC-Bayesian bounds. Then, we illustrate the practical usefulness of this disintegration on deterministic neural networks in Sect. 5. Before concluding in Sect. 7, we discuss in Sect. 6 another point of view of the disintegrated through an information-theoretic bound. For readability, we deferred the proofs of our theoretical results to the Appendix.

2 Setting and basics

2.1 General notations

We denote by \({\mathcal {M}}({\mathcal {A}})\) the set of probability densities on the measurable space \(({\mathcal {A}}, \Sigma _{{\mathcal {A}}})\) with respect to a reference measureFootnote 2 where \(\Sigma _{{\mathcal {A}}}\) is the \(\sigma\)-algebra on the set \({\mathcal {A}}\). In this paper, we consider supervised classification tasks with \({\mathcal {X}}\) the input space, \({\mathcal {Y}}\) the label set, and \({\mathcal {D}}\in {\mathcal {M}}({\mathcal {X}}{\times }{\mathcal {Y}})\) an unknown data distribution on \({\mathcal {X}}{\times } {\mathcal {Y}}{=}{\mathcal {Z}}\). An example is denoted by \(z{=} ({\textbf{x}},y)\! \in\, {\mathcal {Z}}\), and the learning sample \({\mathcal {S}}{=} \{z_i\}_{i=1}^{m}\) is constituted by m examples drawn i.i.d. from \({\mathcal {D}}\); the distribution of such an m-sample being \({\mathcal {D}}^m\in {\mathcal {M}}({\mathcal {Z}}^m)\). We consider a hypothesis set \({\mathcal {H}}\) of functions \(h\!:\!{\mathcal {X}}{\rightarrow } {\mathcal {Y}}\). The learner aims to find \(h\!\in\,{\mathcal {H}}\) that assigns a label y to an input \({\textbf{x}}\) as accurately as possible. Given an example z and a hypothesis h, we assess the quality of the prediction of h with a loss function \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } [0, 1]\) evaluating to which extent the prediction is accurate. Given a loss function \(\ell\), the true risk \({R}_{{\mathcal {D}}}(h)\) of a hypothesis h on the distribution \({\mathcal {D}}\) and its empirical counterpart, the empirical risk, \({R}_{{\mathcal {S}}}(h)\) estimated on \({\mathcal {S}}\) are defined as

$$\begin{aligned} {R}_{{\mathcal {D}}}(h) \triangleq {\mathbb {E}}_{z\sim {\mathcal {D}}}\ell (h, z)\,, \quad \text { and } \quad {R}_{{\mathcal {S}}}(h) \triangleq \frac{1}{m}\sum _{i=1}^{m} \ell (h, z_{i})\,. \end{aligned}$$

Then, the learner wants to find the hypothesis h from \({\mathcal {H}}\) that minimizes \({R}_{{\mathcal {D}}}(h)\). However, we cannot compute \({R}_{{\mathcal {D}}}(h)\) since \({\mathcal {D}}\) is unknown. In practice, one could work under the Empirical Risk Minimization principle (erm) that looks for a hypothesis minimizing \({R}_{{\mathcal {S}}}(h)\). Generalization guarantees over unseen data from \({\mathcal {D}}\) can be obtained by quantifying how much the empirical risk \({R}_{{\mathcal {S}}}(h)\) is a good estimate of \({R}_{{\mathcal {D}}}(h)\). Statistical machine learning theory (see, e.g., Vapnik, 2000) studies the conditions of consistency and convergence of erm towards the true risk. This kind of result is called generalization bound, often referred to as PAC (Probably Approximately Correct) bound (Valiant, 1984), and takes the form:

$$\begin{aligned} {\mathbb {P}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m}} \Big [ \big \vert {R}_{{\mathcal {D}}}(h) -{R}_{{\mathcal {S}}}(h) \big \vert \le \varepsilon \big ( \tfrac{1}{\delta }, \tfrac{1}{m}\big ) \Big ] \ge 1-\delta . \end{aligned}$$

Put into words, with high probability (at least \(1{-}\delta\)) on the random choice of the learning sample \({\mathcal {S}}\), good generalization guarantees are obtained when the deviation between the true risk \({R}_{{\mathcal {D}}}(h)\) and its empirical estimate \({R}_{{\mathcal {S}}}(h)\) is low, i.e., \(\varepsilon \big (\tfrac{1}{\delta }, \tfrac{1}{m}\big )\) should be as small as possible. The function \(\varepsilon\) depends mainly on two quantities: (i) the number of examples m for statistical precision, and (ii) the confidence parameter \(\delta\). We now recall three classical bounds while focusing on the PAC-Bayesian theory at the heart of our contribution. By abuse of notation, in the following, we use the function \(\varepsilon\) for the different presented frameworks: we consider an additional argument of \(\varepsilon\) to pinpoint the differences between the frameworks.

2.2 Uniform convergence bound

A first classical type of generalization bounds is referred to as Uniform Convergence bounds based on a measure of complexity of the set \({\mathcal {H}}\) (such as the VC-dimension or the Rademacher complexity) and hold for all the hypotheses of \({\mathcal {H}}\). This type of bound takes the form:

$$\begin{aligned} {\mathbb {P}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m}}\left[ \, \sup _{h\in {\mathcal {H}}}\left| {R}_{{\mathcal {D}}}(h)-{R}_{{\mathcal {S}}}(h)\right| \le \varepsilon \big (\tfrac{1}{\delta }, \tfrac{1}{m},{\mathcal {H}}\big ) \right] \ge 1-\delta . \end{aligned}$$

Due to \(\sup _{h\in {\mathcal {H}}}\), this bound can be seen as a worst-case analysis. Indeed, it means that the bound \(\left| {R}_{{\mathcal {D}}}(h)-{R}_{{\mathcal {S}}}(h)\right| \le \varepsilon \big (\tfrac{1}{\delta }, \tfrac{1}{m},{\mathcal {H}}\big )\) holds with a high probability for all \(h\!\in\,{\mathcal {H}}\), including the best but also the worst. This worst-case analysis makes it hard to obtain a non-vacuous bound i.e., with \(\varepsilon (\frac{1}{\delta }, \frac{1}{m}, {\mathcal {H}})<1\). Note that the ability of such bounds to explain the generalization of deep learning has been recently challenged (Nagarajan & Kolter, 2019b).

2.3 Algorithmic-dependent bounds

A potential drawback of the Uniform Convergence bounds is that they are independent of the learning algorithm, i.e., they do not take into account the way the hypothesis space is explored. To tackle this issue, algorithmic-dependent bounds have been proposed to take advantage of some particularities of the learning algorithm, such as its uniform stability (Bousquet & Elisseeff, 2002) or robustness (Xu & Mannor, 2012). In this case, the bounds obtained hold for a single hypothesis \(h_{L({\mathcal {S}})}\), the one learned with the algorithm L from the learning sample \({\mathcal {S}}\). The form of such bounds is:

$$\begin{aligned} {\mathbb {P}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m}} \Big [ \left| {R}_{{\mathcal {D}}}(h_{L({\mathcal {S}})}){-}{R}_{{\mathcal {S}}}(h_{L({\mathcal {S}})})\right| \le \varepsilon \big (\tfrac{1}{\delta }, \tfrac{1}{m}, L \big ) \Big ]\ge 1-\delta . \end{aligned}$$

For example, this approach has been used by Hardt et al. (2016) to derive generalization bounds for hypotheses learned by stochastic gradient descent.

2.4 PAC-Bayesian bound

This paper leverages PAC-Bayesian bounds that stand in the PAC framework but borrows inspiration from the Bayesian probabilistic view that deals with randomness and uncertainty in machine learning (McAllester, 1998). In the PAC-Bayesian setting, we consider a prior distribution \({\mathcal {P}}\!\in\,{\mathcal {M}}^{*}({\mathcal {H}})\subseteq {\mathcal {M}}({\mathcal {H}})\) on \({\mathcal {H}}\), with \({\mathcal {M}}^{*}({\mathcal {H}})\) the set of strictly positive probability densities. This distribution encodes an a priori belief on \({\mathcal {H}}\) before observing the learning sample \({\mathcal {S}}\). Then, given \({\mathcal {S}}\) and the prior \({\mathcal {P}}\), we learn a posterior distribution \({\mathcal {Q}}\!\in\,{\mathcal {M}}({\mathcal {H}})\). In this case, the bounds take the form:

$$\begin{aligned} {\mathbb {P}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m}}\!\Big [\forall {\mathcal {Q}}\in {\mathcal {M}}({\mathcal {H}}),\quad {\mathbb {E}}_{h\sim {\mathcal {Q}}}\!\left| {R}_{{\mathcal {D}}}(h){-}{R}_{{\mathcal {S}}}(h)\right| {\le }\,\varepsilon \big (\tfrac{1}{\delta }{,}\tfrac{1}{m}{,}{\mathcal {Q}}\big ) \Big ]\ge 1-\delta . \end{aligned}$$

A key notion is that the function \(\varepsilon ()\) upper-bounds a \({\mathcal {Q}}\)-weighted expectation over the risks of all classifiers in \({\mathcal {H}}\). Hence, it upper-bounds the risk of a randomized classifier.Footnote 3 Such a randomized classifier can be described as follows: to predict the label of an input \({\textbf{x}}\in {\mathcal {X}}\), (i) a hypothesis \(h\in {\mathcal {H}}\) is sampled from \({\mathcal {Q}}\) and (ii) the classifier predicts the label given by \(h({\textbf{x}})\).

We recall below the classical PAC-Bayesian bounds in a general form as proposed by Germain et al. (2009); Bégin et al. (2016). The idea is to express the bound in terms of a generic function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\) that is meant to capture the the deviation between the true and the empirical risks, instead of deriving a theorem by settling on a specific measure of deviation such as \(\vert {R}_{{\mathcal {D}}}(h){-}{R}_{{\mathcal {S}}}(h)\vert\). Note that, Theorem 1 is expressed in a slightly different form than the original ones; we prove Theorem 1 in Appendix A for the sake of completeness.

Theorem 1

(General PAC-Bayes bounds) For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\) on \({\mathcal {H}}\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any \(\delta \in (0,1]\) we have

$$\begin{aligned}&\underbrace{{\mathbb {P}}_{{\mathcal {S}}{\sim }{\mathcal {D}}^{m}}\,\left(\,\begin{array}{l} \forall {\mathcal {Q}}\in {\mathcal {M}}({\mathcal {H}}),\\ {\displaystyle {\mathbb {E}}_{h\sim {\mathcal {Q}}}\!\ln (\phi (h,\!{\mathcal {S}})) \le \textrm{KL}( {\mathcal {Q}}\Vert {\mathcal {P}})\,+\! \ln\,\left[ \frac{1}{\delta }\!{\mathbb {E}}_{{\mathcal {S}}{\sim }{\mathcal {D}}^{m\!}}{\mathbb {E}}_{h{\sim }{\mathcal {P}}}\phi (h,\!{\mathcal {S}})\right] } \end{array}\,\right)\,\ge\, 1{-}\delta }_{\text {(Germain et al., 2009)}}, \end{aligned}$$
(1)
$$\begin{aligned}&\text{ and }\nonumber \\&\underbrace{{\mathbb {P}}_{{\mathcal {S}}{\sim }{\mathcal {D}}^{m}}\,\left(\,\begin{array}{l} \forall {\mathcal {Q}}\in {\mathcal {M}}({\mathcal {H}}),\\ {\displaystyle \tfrac{\alpha }{\alpha {-}1}\! \ln\, \left[ {\mathbb {E}}_{h{\sim }{\mathcal {Q}}}\!\phi (h,\!{\mathcal {S}})\right] \le D_{\alpha }({\mathcal {Q}}\Vert {\mathcal {P}})\,+\! \ln\,\left[ \frac{1}{\delta }\!{\mathbb {E}}_{{\mathcal {S}}{\sim }{\mathcal {D}}^{m\!}}{\mathbb {E}}_{h{\sim }{\mathcal {P}}}\!\phi (h,\!{\mathcal {S}})^\frac{\alpha }{\alpha {-}1}\right]\,} \end{array} \right)\, \ge\, 1{-}\delta }_{(\text {B}\acute{e}\text {gin et al., 2016})}, \end{aligned}$$
(2)

with \(\textrm{KL}({\mathcal {Q}}\Vert {\mathcal {P}}){\triangleq } { {\displaystyle {\mathbb {E}}_{h{\sim }{\mathcal {Q}}}}} \ln \tfrac{{\mathcal {Q}}(h)}{{\mathcal {P}}(h)}\) the Kullback-Leibler (KL-)divergence between \({\mathcal {Q}}\) and \({\mathcal {P}}\), and \(D_{\alpha }({\mathcal {Q}}\Vert {\mathcal {P}}) {\triangleq } \frac{1}{\alpha {-}1}\ln\,\left[ {{\displaystyle {\mathbb {E}}_{h{\sim }{\mathcal {P}}}}}\!\left[\,\frac{ {\mathcal {Q}}(h)}{{\mathcal {P}}(h)}\right] ^{\!\alpha }\right]\) the Rényi divergence between \({\mathcal {Q}}\) and \({\mathcal {P}}\) \((\alpha {>}1)\).

Note that Eq. (2) is more general than Eq.  (1). Indeed, the former is obtained from the latter by the three following steps: (i) substituting \(\phi (h, {\mathcal {S}})\) by \(\phi (h, {\mathcal {S}})^{\frac{\alpha -1}{\alpha }}\) in Eq.  (2), (ii) applying Jensen’s inequality in order to move the expectation over \({\mathcal {Q}}\) in front of the logarithm, and (iii) taking the limit when \(\alpha\) tends to 1. Note also the original bound statements of Germain et al. (2009); Bégin et al. (2016) are recovered by choosing a convex function \(\Delta : [0,1]^2 {\rightarrow } {\mathbb {R}}\) that captures a deviation between the true risk \({R}_{{\mathcal {D}}}(h)\) and the empirical risk \({R}_{{\mathcal {S}}}(h)\). Then, two steps are required: (i) setting \(\phi (h,\!{\mathcal {S}}){=}\exp (m\Delta ({R}_{{\mathcal {S}}}(h), {R}_{{\mathcal {D}}}(h)))\) in Eq.  (1), or \(\phi (h,\!{\mathcal {S}}){=}\Delta ({R}_{{\mathcal {S}}}(h), {R}_{{\mathcal {D}}}(h))\) in Eq.  (2), and then (ii) applying Jensen’s inequality on the left-hand side of the in equation. In fact, our proofs follow the exact same steps than those of Germain et al. (2009, Th.2.1) and Bégin et al. (2016, Th.9), but instead of starting from \(\Delta ({R}_{{\mathcal {S}}}(h), {R}_{{\mathcal {D}}}(h))\), we consider the slightly more general expression \(\phi (h, {\mathcal {S}})\) from the beginning.Footnote 4

The advantage of Theorem 1 is that it can be used as a starting point for deriving different forms of bounds. For instance, for a loss function \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } [0, 1]\) with \(\phi (h,\!{\mathcal {S}}){=} \exp \left( m\Delta ({R}_{{\mathcal {S}}}(h), {R}_{{\mathcal {D}}}(h))\right)\) and \(\Delta ({R}_{{\mathcal {S}}}(h), {R}_{{\mathcal {D}}}(h))=2[{R}_{{\mathcal {S}}}(h){-}{R}_{{\mathcal {D}}}(h)]^2\) we retrieve from Eq.  (1) the bound proposed by McAllester (1998):

$$\begin{aligned}&{\mathbb {P}}_{{\mathcal {S}}{\sim }{\mathcal {D}}^{m}}\,\left( \forall {\mathcal {Q}},\ \left| {\mathbb {E}}_{{h\sim {\mathcal {Q}}}}{R}_{{\mathcal {S}}}(h)-{\mathbb {E}}_{{h\sim {\mathcal {Q}}}}{R}_{{\mathcal {D}}}(h)\right| \le \sqrt{ \frac{\textrm{KL}( {\mathcal {Q}}\Vert {\mathcal {P}})+\ln \tfrac{2\sqrt{m}}{\delta }}{2m}} \right)\, \ge\, 1{-}\delta \\ \Longrightarrow&{\mathbb {P}}_{{\mathcal {S}}{\sim }{\mathcal {D}}^{m}}\,\left( \forall {\mathcal {Q}},\ {\mathbb {E}}_{{h\sim {\mathcal {Q}}}}{R}_{{\mathcal {D}}}(h) \le {\mathbb {E}}_{{h\sim {\mathcal {Q}}}}{R}_{{\mathcal {S}}}(h) + \sqrt{ \frac{\textrm{KL}( {\mathcal {Q}}\Vert {\mathcal {P}})+\ln \tfrac{2\sqrt{m}}{\delta }}{2m}} \right)\, \ge\, 1{-}\delta . \end{aligned}$$

This bound illustrates the trade-off between the average empirical risk and \(\textstyle \varepsilon \big (\tfrac{1}{\delta }{,}\tfrac{1}{m}{,}{\mathcal {Q}}\big ) = \sqrt{ \frac{1}{2m} (\textrm{KL}( {\mathcal {Q}}\Vert {\mathcal {P}})+\ln \tfrac{2\sqrt{m}}{\delta })}\). More precisely, the higher m is, the lower \(\textstyle \varepsilon \big (\tfrac{1}{\delta }{,}\tfrac{1}{m}{,}{\mathcal {Q}}\big )\) is therefore the smaller the difference between the true risk \({\mathbb {E}}_{h{\sim }{\mathcal {Q}}}{R}_{{\mathcal {D}}}(h)\) and the empirical risk \({\mathbb {E}}_{{h\sim {\mathcal {Q}}}}{R}_{{\mathcal {S}}}(h)\).

Another example leading to a slightly tighter but less interpretable bound is the Seeger (2002); Maurer (2004)’s bound that we retrieve with \(\phi (h,\!{\mathcal {S}}){=} \exp \left( m\,\Delta ({R}_{{\mathcal {S}}}(h),{R}_{{\mathcal {D}}}(h))]\right)\) and \(\Delta ({R}_{{\mathcal {S}}}(h),{R}_{{\mathcal {D}}}(h))=\textrm{kl}[{R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h)]\):

$$\begin{aligned} {\mathbb {P}}_{{\mathcal {S}}{\sim }{\mathcal {D}}^{m}}\,\left( \forall {\mathcal {Q}},\ {\mathbb {E}}_{{h\sim {\mathcal {Q}}}} \textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\le \frac{\textrm{KL}( {\mathcal {Q}}\Vert {\mathcal {P}})+\ln \tfrac{2\sqrt{m}}{\delta }}{m} \right)\, \ge\, 1{-}\delta , \end{aligned}$$
(3)

where

$$\begin{aligned} \textrm{kl}(q\Vert p)= q\ln \tfrac{q}{p}{+}(1{-}q)\ln \tfrac{1{-}q}{1{-}p} \end{aligned}$$
(4)

is the KL divergence between two Bernoulli distributions of parameters q and p.

Such PAC-Bayesian bounds are known to be tight (e.g., Pérez-Ortiz et al. (2021); Zantedeschi et al. (2021)), but they hold for a randomized classifier by nature (due to the expectation on \({\mathcal {H}}\)). A key issue for usual machine learning tasks is then the derandomization of the PAC-Bayesian bounds to obtain a guarantee for a deterministic classifier instead of a randomized one (by removing the expectation on \({\mathcal {H}}\)). In some cases, this derandomization results from the structure of the hypotheses, such as for randomized linear classifiers that can be directly expressed as one deterministic linear classifier (Germain et al., 2009). However, in other cases, the derandomization is much more complex and specific to the class of hypotheses, such as for neural networks (e.g., Neyshabur et al. (2018), Nagarajan and Kolter (2019b, Ap. J), Biggs and Guedj (2022)).

The next section states our main contribution, which is a general derandomization framework (based on the Rényi divergence) for disintegrating PAC-Bayesian bounds into a bound for a single hypothesis from \({\mathcal {H}}\).

3 Disintegrated PAC-Bayesian theorems

3.1 Form of a disintegrated PAC-Bayes bound

First, we recall another kind of bound introduced by Blanchard and Fleuret (2007) and Catoni (2007, Th.1.2.7) and referred to as the disintegrated PAC-Bayesian bound. Its form is:

$$\begin{aligned} {\mathbb {P}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m},\, h\sim {\mathcal {Q}}_{{\mathcal {S}}}}\ \Big ( \left| {R}_{{\mathcal {D}}}(h)-{R}_{{\mathcal {S}}}(h)\right| \le \varepsilon \big (\tfrac{1}{\delta }, \tfrac{1}{m},{\mathcal {Q}}_{{\mathcal {S}}}\big ) \Big )\ge 1-\delta , \end{aligned}$$
(5)

where \({\mathcal {Q}}_{{\mathcal {S}}}{\triangleq }A({\mathcal {S}}, {\mathcal {P}})\) with \(A\!:\!{\mathcal {Z}}^{m}{\times }{\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\) a deterministic algorithm chosen a priori which (i) takes a learning sample \({\mathcal {S}}\!\in\,{\mathcal {Z}}^{m}\) and a prior distribution \({\mathcal {P}}\) as inputs, and (ii) outputs a data-dependent distribution \({\mathcal {Q}}_{{\mathcal {S}}}{\triangleq }A({\mathcal {S}}, {\mathcal {P}})\) from the set \({\mathcal {M}}({\mathcal {H}})\) of all possible probability densities on \({\mathcal {H}}\). Concretely, this kind of generalization bound allows one to derandomize the usual PAC-Bayes bounds as follows. Instead of considering a bound holding for all the posterior distributions on \({\mathcal {H}}\) as usually done in PAC-Bayes (the “\(\,\forall {\mathcal {Q}}\,\)” in Theorem 1), we consider only the posterior distribution \({\mathcal {Q}}_{{\mathcal {S}}}\) obtained through a deterministic algorithm A taking the learning sample \({\mathcal {S}}\) and the prior \({\mathcal {P}}\) as inputs. Then, the above bound holds for a unique hypothesis \(h{\sim }{\mathcal {Q}}_{{\mathcal {S}}}\) instead of the randomized classifier: the individual risks are no longer averaged with respect to \({\mathcal {Q}}_{{\mathcal {S}}}\); this is the PAC-Bayesian bound disintegration. The dependence in probability on \({\mathcal {Q}}_{{\mathcal {S}}}\) means that the bound is valid with probability at least \(1{-}\delta\) over the random choice of the learning sample \({\mathcal {S}}{\sim }{\mathcal {D}}^m\) and the hypothesis \(h{\sim }{\mathcal {Q}}_{{\mathcal {S}}}\). Under this principle, we introduce in Theorems 2 and 4 below two new general disintegrated PAC-Bayesian bounds. A key asset of our results is that the bounds are instantiable to specific settings as for the “classical” PAC-Bayesian bounds (e.g., with i.i.d./non-i.i.d. data, unbounded losses, etc.): to instantiate the bound, one has to instantiate the function \(\phi\). Note that, except our bound and the one of  Rivasplata et al. (2010, Th.1(i)), the disintegrated bounds of the literature introduced by Blanchard and Fleuret (2007) and Catoni (2007, Th.1.2.7) do not depend on such a general function \(\phi\). With an appropriate instantiation, we obtain an easily optimizable bound, leading to a self-boundingFootnote 5 algorithm (Freund, 1998) with theoretical guarantees. As an illustration of the usefulness of our results, we provide, in Sect. 4, such an instantiation for neural networks.

3.2 Disintegrated PAC-Bayesian bounds with the Rényi divergence

3.2.1 Our main contribution: a general disintegrated bound

In the same spirit as Eq.  (2) our main result stated in Theorem 2 is a general bound involving the Rényi divergence \(D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\) of order \(\alpha\,>\!1\).

Theorem 2

(General Disintegrated PAC-Bayes Bound) For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any \(\alpha\,>\!1\), for any \(\delta \in (0,1]\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}{\times }{\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have

$$\begin{aligned} &{\mathbb {P}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m},h\sim {\mathcal {Q}}_{{\mathcal {S}}}}\!\Bigg (\!\frac{\alpha }{\alpha {-}1}\ln \left( \phi (h,\!{\mathcal {S}})\right) \\&\quad \le \ {\frac{2\alpha {-}1}{\alpha {-}1}}\ln \frac{2}{\delta } +D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}){+} \ln \left[ {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\left( \phi (h'\!, {\mathcal {S}}')^{\frac{\alpha }{\alpha {-}1}}\right) \right]\,\Bigg )\,\ge\, 1{-}\delta , \end{aligned}$$

where \({\mathcal {Q}}_{{\mathcal {S}}}{\triangleq }A({\mathcal {S}}, {\mathcal {P}})\) is output by the deterministic algorithm A.

Proof

(Proof sketch (see Appendix B for details)) Recall that \({\mathcal {Q}}_{{\mathcal {S}}}\) is obtained with the algorithm \(A({\mathcal {S}}, {\mathcal {P}})\). Applying Markov’s inequality on \(\phi (h,\!{\mathcal {S}})\) with the random variable h and using Hölder’s inequality to introduce \(D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\), we have, with probability at least \(1{-}\tfrac{\delta }{2}\) on \({\mathcal {S}}\!\sim\, {\mathcal {D}}^m\) and \(h\!\sim\,{\mathcal {Q}}_{{\mathcal {S}}}\),

$$\begin{aligned} \frac{\alpha }{\alpha {-}1}\ln \left[ \phi (h,\!{\mathcal {S}})\right] \ {}&\le \ \frac{\alpha }{\alpha {-}1}\ln \left[ \frac{2}{\delta }{\mathbb {E}}_{h'{\sim } {\mathcal {Q}}_{{\mathcal {S}}}}\phi (h'\!, {\mathcal {S}})\right] \\&\le \ D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}) + \frac{\alpha }{\alpha {-}1}\ln \frac{2}{\delta } +\ln \left[ {\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\left( \phi (h'\!, {\mathcal {S}})^{\frac{\alpha }{\alpha -1}}\right) \right] . \end{aligned}$$

By applying again Markov’s inequality on \(\phi (h,\!{\mathcal {S}})\) with the random variable \({\mathcal {S}}\), we have, with probability at least \(1{-}\tfrac{\delta }{2}\) on \({\mathcal {S}}\!\sim\, {\mathcal {D}}^m\) and \(h\!\sim\,{\mathcal {Q}}_{{\mathcal {S}}}\),

$$\begin{aligned} \ln \left[ {\mathbb {E}}_{h'{\sim }{\mathcal {P}}} \left( \phi (h'\!, {\mathcal {S}})^{\frac{\alpha }{\alpha {-}1}}\right) \right] \le \ln \left[ \frac{2}{\delta }{\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\left( \phi (h'\!, {\mathcal {S}}')^{\frac{\alpha }{\alpha {-}1}}\right) \right] . \end{aligned}$$

Lastly, we combine the two bounds with a union-bound argument. \(\square\)

As for the general classical PAC-Bayesian bounds (Theorem 1), the above theorem can be seen as the starting point of the derivation of generalization bounds depending on the choice of the function \(\phi\), as done in Corollary 6 in Sect. 4.1; this property makes it the main result of our paper.

In its proof, Hölder’s inequality is used differently than in the classic PAC-Bayes bound’s proofs. Indeed, in Bégin et al. (2016, Th. 8), the change of measure based on Hölder’s inequality is key for deriving a bound that holds for all posteriors \({\mathcal {Q}}\) with high probability, while our bound holds for a unique posterior \({\mathcal {Q}}_{{\mathcal {S}}}\) dependent on the sample \({\mathcal {S}}\) and the prior \({\mathcal {P}}\). In fact, we use Hölder’s inequality to introduce a prior \({\mathcal {P}}\) independent from \({\mathcal {S}}\): a crucial point for our bound instantiated in Corollary 6.

Compared to Eq.  (2), our bound involves the term \({\frac{2\alpha {-}1}{\alpha {-}1}}\ln \frac{2}{\delta }\) instead of \(\ln \frac{1}{\delta }\), that is an additional constant value of \({\frac{2\alpha {-}1}{\alpha {-}1}}\ln \frac{2}{\delta }-\ln \frac{1}{\delta }=\ln 2{+}\frac{\alpha }{\alpha {-}1}\ln \tfrac{2}{\delta }\). When \(\alpha =2\), this constant equals \(\ln \frac{8}{\delta ^2}\), which turns out to be a reasonable cost to “derandomize” a bound into a disintegrated one, as typical choices for \(\phi (h,\!{\mathcal {S}})\) will make the constant imprint on the bound value decay with m. This is similar to the bounds of Theorem 2 that tighten as m increases, provided that \(\phi (h,{\mathcal {S}})\) is chosen wisely. For instance, by setting \(\phi (h,\!{\mathcal {S}})=\exp (\tfrac{\alpha -1}{\alpha }m\,\textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\) with \(\textrm{kl}(\cdot \Vert \cdot )\) defined by Eq.  (4), the bound depends on m and converges as m increases (see Sect/ 4). Moreover, the tightness of the bound depends also on the deviation between \({\mathcal {Q}}_{{\mathcal {S}}}\) and \({\mathcal {P}}\), which makes the bound tighter when \({\mathcal {Q}}_{{\mathcal {S}}}={\mathcal {P}}\).

We instantiate below Theorem 2 for \(\alpha {\rightarrow }1^+\) and \(\alpha {\rightarrow }{+}{\infty }\) showing that the bound converges when \(\alpha {\rightarrow }1^+\) and \(\alpha {\rightarrow }{+}\infty\).

Corollary 3

Under the assumptions of Theorem 2, when \(\alpha {\rightarrow }1^+\), we have

$$\begin{aligned} {\mathbb {P}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m},h\sim {\mathcal {Q}}_{{\mathcal {S}}}}\!\Bigg ( \ln \phi (h{,}{\mathcal {S}}) \le \ln \frac{2}{\delta } + \ln \left[ {{\,\textrm{esssup}\,}}_{{\mathcal {S}}'\in {\mathcal {Z}}, h'\in {\mathcal {H}}}\phi (h'{,} {\mathcal {S}}')\right] \Bigg )\!\ge\, 1{-}\delta , \end{aligned}$$

when \(\alpha {\rightarrow }+\infty\), we have

$$\begin{aligned} {\mathbb {P}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m},h\sim {\mathcal {Q}}_{{\mathcal {S}}}}\!\Bigg (\ln \phi (h{,} {\mathcal {S}})\le \ln {\displaystyle {{\,\textrm{esssup}\,}}_{h'\in {\mathcal {H}}}}\,\frac{{\mathcal {Q}}_{{\mathcal {S}}}(h')}{{\mathcal {P}}(h')}{+}\ln\,\left[ \frac{4}{\delta ^2} {\displaystyle {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\phi (h'{,}{\mathcal {S}}')}\right] \Bigg )\!\ge\, 1{-}\delta , \end{aligned}$$

where \({{\,\textrm{esssup}\,}}\) is the essential supremum defined as the supremum on a set with non-zero probability measures, i.e.,

$$\begin{aligned}&{{\,\textrm{esssup}\,}}_{{\mathcal {S}}'\in {\mathcal {Z}}, h'\in {\mathcal {H}}}\phi (h'{,} {\mathcal {S}}')=\inf \left\{ \tau \in {\mathbb {R}}, {\mathbb {P}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m},h\sim {\mathcal {Q}}_{{\mathcal {S}}}}\!\Big [ \phi (h{,} {\mathcal {S}})> \tau \Big ] = 0\right\} ,\\ \text {and}\quad&{{\,\textrm{esssup}\,}}_{h'\in {\mathcal {H}}}\,\frac{{\mathcal {Q}}_{{\mathcal {S}}}(h')}{{\mathcal {P}}(h')}=\inf \left\{ \tau \in {\mathbb {R}}, {\mathbb {P}}_{h\sim {\mathcal {Q}}_{{\mathcal {S}}}}\!\Big [ \tfrac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)} > \tau \Big ] = 0\right\} . \end{aligned}$$

This corollary illustrates that the parameter \(\alpha\) controls the trade-off between the Rényi divergence \(D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\) and \(\ln \left[ {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\phi (h'\!, {\mathcal {S}}')^{\frac{\alpha }{\alpha {-}1}}\right]\). Indeed, when \(\alpha {\rightarrow }1^+\), the Rényi divergence vanishes while the other term converges toward \(\ln\,\left[ {{\,\textrm{esssup}\,}}_{{\mathcal {S}}'\in {\mathcal {Z}}, h'\in {\mathcal {H}}}\phi (h'\!, {\mathcal {S}}')\right]\), roughly speaking the maximal value possible for the second term. On the other hand, when \(\alpha {\rightarrow }{+}\infty\), the Rényi divergence increases and converges toward \(\ln {{\,\textrm{esssup}\,}}_{h'\in {\mathcal {H}}}\frac{{\mathcal {Q}}_{{\mathcal {S}}}(h')}{{\mathcal {P}}(h')}\) and the other term decreases toward \(\ln\,\left[ {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\phi (h'{,}{\mathcal {S}}')\right]\).

3.2.2 Comparison with the bound of Rivasplata et al. (2020)

For the sake of comparison, we recall in Eq.  (6) the bound proposed by  Rivasplata et al. (2010, Th.1(i)), that is more general than the bounds of Blanchard and Fleuret (2007) and Catoni (2007, Th.1.2.7):

$$\begin{aligned} {\mathbb {P}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m}, h\sim {\mathcal {Q}}_{{\mathcal {S}}}}\Bigg (\! \ln (\phi (h,\!{\mathcal {S}})) \le \ln \frac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)} +\ln \left( \frac{1}{\delta }\displaystyle {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\phi (h'{,} {\mathcal {S}}')\right)\,\Bigg ) \ge 1{-}\delta . \end{aligned}$$
(6)

The term \(\ln\,\tfrac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}\) (also involved in Catoni (2007); Blanchard and Fleuret (2007)) can be seen as a “disintegratedFootnote 6 KL divergence” depending only on the sampled \(h{\sim }{\mathcal {Q}}_{{\mathcal {S}}}\). In contrast, our bound involves the Rényi divergence \(D_\alpha ({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\) between the prior \({\mathcal {P}}\) and the posterior \({\mathcal {Q}}_{{\mathcal {S}}}\), meaning our bound involves only one term that depends on the sampled hypothesis (the risk): the divergence value is the same whatever the hypothesis. Our bound is expected to be looser because of the Rényi divergence (see van Erven & Harremoës,2014) and the dependence in \(\delta\) (which is worse than Eq.  6). Nevertheless, our divergence term is the main advantage of our bound. Indeed, as confirmed by our experiments (Sect/ 5), our bound with \(D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\) makes the learning procedure (in our self-bounding algorithm) more stable and efficient compared to the optimization of Eq.  (6) with \(\ln \tfrac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}\) that is subject to high variance.

3.2.3 A parameterizable general disintegrated bound

In the PAC-Bayesian literature, parametrized bounds have been introduced (e.g., Catoni (2007); Thiemann et al. (2017)) to control the trade-off between the empirical risk and the divergence along with the additional term. For the sake of completeness, we now provide a parametrized version of our bound, enlarging its practical scope. We follow a similar approach to introduce a version of a disintegrated Rényi divergence-based bound that has the advantage of being parameterizable.

Theorem 4

(Parametrizable Disintegrated PAC-Bayes Bound) For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any \(\delta \in (0,1]\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}{\times }{\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have

$$\begin{aligned}&{\mathbb {P}}_{\begin{array}{c} {\mathcal {S}}\sim {\mathcal {D}}^{m},\\ h\sim {\mathcal {Q}}_{{\mathcal {S}}} \end{array}}\,\Bigg (\!\forall \lambda {>}0,\,\ln \left( \phi (h,\!{\mathcal {S}})\right) {\le } \ln\,\bigg [ \frac{\lambda }{2}\displaystyle e^{D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})}{+} \frac{8}{2\lambda \delta ^3}{\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\!\Big [\phi (h'\!, {\mathcal {S}}')^2\Big ] \bigg ]\!\Bigg )\ge 1{-}\delta , \end{aligned}$$

where \({\mathcal {Q}}_{{\mathcal {S}}}{\triangleq }A({\mathcal {S}},{\mathcal {P}})\) is output by the deterministic algorithm A.

Note that \(e^{D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})}\) is closely related to the \(\chi ^2\)-distance. Indeed we have: \(\chi ^2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}) \triangleq {\mathbb {E}}_{h\sim {\mathcal {P}}}\left[ \tfrac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}\right] ^2\,{-}1 = e^{D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})} {-}1\). An asset of Theorem 4 is the parameter \(\lambda\) controlling the trade-off between the exponentiated Rényi divergence \(e^{D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})}\) and \(\frac{1}{\delta ^3}{{\mathbb {E}}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{{\mathbb {E}}}_{h'{\sim }{\mathcal {P}}}\phi (h'\!, {\mathcal {S}}')^2\). Our bound is valid for all \(\lambda\,>\!0\), thus, from a practical view, we can learn/tune the parameter \(\lambda\) to minimize the bound and control the possible numerical instability due to \(e^{D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})}\). Indeed, if \(D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\) is large, the numerical computation can lead to an infinite value due to finite precision arithmetic. It is important to notice that, like other parametrized bounds (e.g., Thiemann et al., 2017), there exists a closed-form solution of the optimal parameter \(\lambda\) (for a fixed \({\mathcal {P}}\) and \({\mathcal {Q}}_{{\mathcal {S}}}\)); the solution is derived in Proposition 5 and shows that the optimal bound of Theorem 4 corresponds to the bound of Theorem 2.

Proposition 5

For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\) on \({\mathcal {H}}\), for any \(\delta {\in }(0,1]\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}{\times }{\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), let

$$\begin{aligned}&\lambda ^* {=} {{\,\textrm{argmin}\,}}_{\lambda >0} \, \ln\, \left[ \frac{\lambda }{2} e^{D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})}\!+\! \frac{\displaystyle {\mathbb {E}}_{{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}} {\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\,\left[ 8\phi (h'\!, {\mathcal {S}}')^2\right] }{2\lambda \delta ^3} \right]\,,\\ \text{ then, } \text{ we } \text{ have }\quad&\overbrace{2\ln\,\left[ \frac{\lambda ^*}{2\ }e^{D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})}\!+\! {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim } {\mathcal {P}}} \left( \frac{8\phi (h'\!, {\mathcal {S}}')^2}{2\lambda ^*\delta ^3}\right) \right] }^{\text {Theorem}} 4 \\&= \ \underbrace{D_{2}({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}) + \ln \left[ {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim } {\mathcal {P}}} \left( \frac{8\phi (h'\!, {\mathcal {S}}')^{2}}{\delta ^3}\right) \right] }_{\text {Theorem}} 2 \text {with} \alpha =2., \end{aligned}$$

where \(\displaystyle \lambda ^* = \sqrt{\frac{{\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{{\mathbb {E}}}_{{h'{\sim }{\mathcal {P}}}}\left[ 8\phi (h'\!, {\mathcal {S}}')^2\right] }{\delta ^3 \exp (D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}))}}\).

Put into words: the optimal \(\lambda ^*\) gives the same bound for Theorem 2 and Theorem 4.

4 The disintegration in action

So far, we have introduced theoretical results to derandomize PAC-Bayesian bounds through a disintegration approach. Indeed, the disintegration allows us to obtain a bound for a unique model sampled from the distribution \({\mathcal {Q}}_{{\mathcal {S}}}\) instead of having a bound on the averaged risk of the models. We propose in this section to illustrate the instantiation and the usefulness of Theorem 2 on neural networks compared to the classical PAC-Bayesian bounds.

4.1 Specialization to neural network classifiers

We consider Neural Networks (NN) parametrized by a weight vector \({\textbf{w}}\!\in\,{\mathbb {R}}^{d}\) and overparametrized, i.e., \(d\!\gg\,m\). We aim to learn the weights of the NN leading to the lowest true risk. Practitioners usually proceed by epochsFootnote 7 and obtain one “intermediate” NN after each epoch. Then, they select the “intermediate” NN associated with the lowest validation risk. We propose translating this practice into our PAC-Bayesian setting by considering one prior per epoch. Given T epochs, we hence have T priors \({\textbf {P}}{=}\{{\mathcal {P}}_t\}_{t=1}^T\), where \(\forall t\!\in\,\{1,\ldots ,T\}, {\mathcal {P}}_t={\mathcal {N}}({\textbf{v}}_t, \sigma ^2\textbf{I}_{d})\) is a Gaussian distribution centered at \({\textbf{v}}_t\) (the weights associated with the t-th “intermediate” NN) with a covariance matrix of \(\sigma ^2\textbf{I}_{d}\) (where \(\textbf{I}_{d}\) is the \(d{\times }d\)-dimensional identity matrix). Assuming the T priors are learned from a set \({\mathcal {S}}_{\text {prior}}\) such that \({\mathcal {S}}_{\text {prior}} \bigcap {\mathcal {S}}{=} \emptyset\), then Corollaries 6 and 7 will guide us to learn a posterior \({\mathcal {Q}}_{{\mathcal {S}}}{=}{\mathcal {N}}({\textbf{w}}, \sigma ^2\textbf{I}_{d})\) from a prior \({\mathcal {P}}\in {\textbf {P}}\) minimizing the empirical risk on \({\mathcal {S}}\) (we give more details on the procedure after the forthcoming corollaries). Note that considering Gaussian distributions has the advantage of simplifying the expression of the KL divergence, and thus is commonly used in the PAC-Bayesian literature for neural networks (e.g., Dziugaite & Roy, 2017; Letarte et al., 2019; Zhou, Veitch, Austern, Adams, & Orbanz, 2019).Footnote 8

Corollary 6 below instantiates Theorem 2 to this neural networks setting. Then, for the sake of comparison, Corollary 7 instantiates other disintegrated bounds from the literature; more precisely, Eq.  (7) corresponds to Rivasplata et al. (2010)’s bound of Eq.  (6), Eq.  (8) to Blanchard and Fleuret (2007)’s one, and Eq.  (9) to Catoni (2007)’s one.

Corollary 6

For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any set \({\textbf {P}}=\{{\mathcal {P}}_1,\dots , {\mathcal {P}}_T\}\) of T priors on \({\mathcal {H}}\) where \({\mathcal {P}}_t={\mathcal {N}}({\textbf{v}}_t, \sigma ^2{{\textbf{I}}}_d)\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}) {\rightarrow } {\mathcal {M}}({\mathcal {H}})\), for any loss \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } [0, 1]\), for any \(\delta {\in }(0,1]\), we have

$$\begin{aligned} {\mathbb {P}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m}, h\sim {\mathcal {Q}}_{{\mathcal {S}}}}\Bigg (\forall&{\mathcal {P}}_t\in {\textbf {P}},\ \textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\le \frac{1}{m}\left[ \frac{\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^{2}}{\sigma ^2}+\ln \frac{16T\sqrt{m}}{\delta ^3}\right]\,\Bigg )\ge 1{-}\delta , \end{aligned}$$

where \(\textrm{kl}(a\Vert b) = a\ln \tfrac{a}{b}+(1{-}a)\ln \tfrac{1-a}{1-b}\), \({\mathcal {Q}}_{{\mathcal {S}}}= {\mathcal {N}}({\textbf{w}}, \sigma ^2{{\textbf{I}}}_d)\), and the hypothesis \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\) is parametrized by \({\textbf{w}}{+}\varvec{\epsilon }\).

Corollary 7

For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any set \({\mathcal {H}}\), for any set \({\textbf {P}}=\{{\mathcal {P}}_1,\dots ,{\mathcal {P}}_T\}\) of T priors on \({\mathcal {H}}\) where \({\mathcal {P}}_t={\mathcal {N}}({\textbf{v}}_t, \sigma ^2{{\textbf{I}}}_d)\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), for any loss \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } \{0, 1\}\), for any \(\delta {\in }(0,1]\), with probability at least \(1{-}\delta\) over the learning sample \({\mathcal {S}}{\sim }{\mathcal {D}}^{m}\) and the hypothesis \(h{\sim } {\mathcal {Q}}_{{\mathcal {S}}}\) parametrized by \({\textbf{w}}{+}\varvec{\epsilon }\), we have \(\forall {\mathcal {P}}_t\in {\textbf {P}}\)

$$\begin{aligned}&\textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\!\le \, \frac{1}{m}\! \Bigg [\! \frac{\Vert {\textbf{w}}{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert ^2_{2} {-}\Vert \varvec{\epsilon }\Vert ^2_{2}}{2\sigma ^2} {+} \ln\, \frac{2T\!\sqrt{m}}{\delta }\!\Bigg ]\!,\! \end{aligned}$$
(7)
$$\begin{aligned} \forall b\!\in\,{\textbf{B}},\quad&\textrm{kl}_{+}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\!\le \frac{1}{m}\, \Bigg [ \frac{b{+}1}{b}\!\left[ \frac{\Vert {\textbf{w}}{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert ^2_{2} {-}\Vert \varvec{\epsilon }\Vert ^2_{2}}{2\sigma ^2}\right] _{\!+}\!{+} \ln\, \frac{(b{+}1)T\vert {\textbf{B}}\vert }{\delta } \Bigg ]\!,\! \end{aligned}$$
(8)
$$\begin{aligned} \forall c\!\in\,{\textbf{C}},\quad&{R}_{{\mathcal {D}}}(h)\,\le \, \frac{\displaystyle 1{-}\exp \left( {\displaystyle \,{-}c{R}_{{\mathcal {S}}}(h) {-}\frac{1}{m}\,\left[\, \frac{\Vert {\textbf{w}}{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert ^2_{2} {-}\Vert \varvec{\epsilon }\Vert ^2_{2}}{2\sigma ^2} {+} \ln\,\frac{T\vert {\textbf{C}}\vert }{\delta }\!\right]\,}\right) }{1{-}e^{{-}c}}\!,\! \end{aligned}$$
(9)

with \(\left[ x\right] _{+}\!{=}\max (x,0)\), and \(\textrm{kl}_{+}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h)){=}\textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\) if \({R}_{{\mathcal {S}}}(h){<}{R}_{{\mathcal {D}}}(h)\) and 0 otherwise. Moreover, \(\varvec{\epsilon }{\sim }{\mathcal {N}}({{\textbf{0}}}, \sigma ^2\textbf{I}_{d})\) is a Gaussian noise such that \({\textbf{w}}{+}\varvec{\epsilon }\) are the weights of \(h{\sim }{\mathcal {Q}}_{{\mathcal {S}}}\) with \({\mathcal {Q}}_{{\mathcal {S}}}{=}{\mathcal {N}}({\textbf{w}}, \sigma ^2\textbf{I}_{d})\), and \({\textbf{C}}\), \({\textbf{B}}\) are two sets of hyperparameters fixed a priori.

As the parameter \(\lambda\) of the Theorem 4, \(c\!\in\,{\textbf{C}}\) is a hyperparameter that controls a trade-off between the empirical risk \({R}_{{\mathcal {S}}}(h)\) and the term

$$\frac{1}{m}\,\left[\, \frac{\Vert {\textbf{w}}{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert ^2_{2} {-}\Vert \varvec{\epsilon }\Vert ^2_{2}}{2\sigma ^2} {+} \ln\,\frac{T\vert {\textbf{C}}\vert }{\delta }\!\right] .$$

Besides, the parameter \(b\!\in\,{\textbf{B}}\) controls the tightness of the bound. In general, these parameters can be tuned to minimize the bound of Eq.  (8) and Eq.  (9); however, there is no closed-form solution for the expression of the minimum of this Eq. . In consequence, our experimental protocol requires minimizing the bounds by gradient descent for each \(b\!\in\,{\textbf{B}}\), respectively \(c\!\in\,{\textbf{C}}\), in order to learn the distribution \({\mathcal {Q}}_{{\mathcal {S}}}\) leading to the lowest bound value. To obtain a tight bound, the divergence between one prior \({\mathcal {P}}_t\!\in\,{\textbf {P}}\) and \({\mathcal {Q}}_{{\mathcal {S}}}\) must be low, i.e., \(\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^2\) (or \(\Vert {\textbf{w}}{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert _{2}^2{-}\Vert \varvec{\epsilon }\Vert ^2_{2}\)) has to be small. One solution is to split the learning sample into 2 non-overlapping subsets \({\mathcal {S}}_{\text {prior}}\) and \({\mathcal {S}}\), where \({\mathcal {S}}_{\text {prior}}\) is used to learn the prior, while \({\mathcal {S}}\) is used both to learn the posterior and compute the bound. Hence, if we “pre-learn” a good enough prior \({\mathcal {P}}_t\!\in\,{\textbf {P}}\) from \({\mathcal {S}}_{\text {prior}}\), then we can expect to have a low \(\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}\).

figure a

At first sight, the selection of the prior weights with \({\mathcal {S}}\) by early stopping may appear to be “cheating”. However, this procedure can be seen as: 1) first constructing \({\textbf {P}}\) from the T “intermediate” NNs learned after each epoch from \({\mathcal {S}}_{\text {prior}}\), then 2) optimizing the bound with the prior that leads to the best risk on \({\mathcal {S}}\). This gives a statistically valid result: since Corollary 6 is valid for every \({\mathcal {P}}_t\!\in\,{\textbf {P}}\), we can select the one we want, in particular the one minimizing \({R}_{{\mathcal {S}}}(h)\) for a sampled \(h\sim {\mathcal {P}}_t\). This heuristic makes sense: it allows us to detect if a prior is concentrated around hypotheses that potentially overfit the learning sample \({\mathcal {S}}_{\text {prior}}\). Usually, practitioners consider this “best” prior as the final NN. In our case, the advantage is that we refine this “best” prior with \({\mathcal {S}}\) to learn the posterior \({\mathcal {Q}}_{{\mathcal {S}}}\). Note that Pérez-Ortiz et al. (2021) have already introduced tight generalization bounds with data-dependent priors for—non-derandomized—stochastic NNs.Footnote 9 Nevertheless, the weights of the stochastic NNs are, by definition, sampled from the posterior distribution \({\mathcal {Q}}\) for each prediction. In that sense, it is important to mention that stochastic NNs differ from derandomized NNs where only one model is sampled from \({\mathcal {Q}}_{{\mathcal {S}}}\). Moreover, our training method to learn the prior differs greatly since 1) we learn T NNs (i.e., T priors) instead of only one, 2) we fix the variance of the Gaussian in the posterior \({\mathcal {Q}}_{{\mathcal {S}}}\). Note that, as illustrated in Sect/ 5, fixing the variance is not restrictive: the advantage is that it simplifies the expression of the KL divergence while keeping the bounds tight. To the best of our knowledge, our training method for the prior is new.

4.2 A note about stochastic neural networks

Due to its stochastic nature, PAC-Bayesian theory has been explored to study stochastic NNs (e.g., Langford and Caruana (2001); Dziugaite and Roy (2017, 2018); Zhou et al. (2019); Pérez-Ortiz et al. (2021)). In Corollary 8 below, we instantiate the bound of Eq.  (1) for stochastic NNs to empirically compare the stochastic and the deterministic NNs associated to the same prior and posterior distributions. We recall that, in this paper, a deterministic NN is a single h sampled from the posterior distribution \({\mathcal {Q}}_{{\mathcal {S}}}{=}{\mathcal {N}}({\textbf{w}}, \sigma ^2\textbf{I}_{d})\) output by the algorithm A. This means that for each example, the label prediction is performed by the same deterministic NN: the one parametrized by the weights \({\textbf{w}}+\varvec{\epsilon }\!\in\, {\mathbb {R}}^d\). Conversely, the stochastic NN associated with a posterior distribution \({\mathcal {Q}}{=}{\mathcal {N}}({\textbf{w}}, \sigma ^2\textbf{I}_{d})\) predicts the label of a given example by (i) first sampling h according to \({\mathcal {Q}}\), (ii) then returning the label predicted by h. Thus, the risk of the stochastic NN is the expected risk value \({\mathbb {E}}_{h{\sim }{\mathcal {Q}}}{R}_{{\mathcal {D}}}(h)\), where the expectation is taken over all h sampled from \({\mathcal {Q}}\). We compute the empirical risk of the stochastic NN from a Monte Carlo approximation: (i) we sample n weight vectors, and (ii) we average the risk over the n associated NNs; we denote by \({\mathcal {Q}}^n\) the distribution of such n-sample. In this context, we obtain the following PAC-Bayesian bound.

Corollary 8

For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any \({\mathcal {H}}\), for any set \({\textbf {P}}=\{{\mathcal {P}}_1,\dots , {\mathcal {P}}_T\}\) of T priors on \({\mathcal {H}}\) where \({\mathcal {P}}_t={\mathcal {N}}({\textbf{v}}_t, \sigma ^2{{\textbf{I}}}_d)\), for any loss \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } \{0, 1\}\), for any \(\delta {\in }(0,1]\), with probability at least \(1{-}\delta\) over \({\mathcal {S}}{\sim }{\mathcal {D}}^m\) and \(\{h_1,\dots ,h_n\}{\sim }{\mathcal {Q}}^n\), we have simultaneously \(\forall {\mathcal {P}}_t\!\in\,{\textbf {P}},\)

$$\begin{aligned}&\textrm{kl}\!\left( {\mathbb {E}}_{h{\sim }{\mathcal {Q}}}\,{R}_{{\mathcal {S}}}(h)\Vert\, {\mathbb {E}}_{h{\sim }{\mathcal {Q}}}\,{R}_{{\mathcal {D}}}(h)\!\right)\, {\le } \frac{1}{m}\,\left[ \frac{\Vert {\textbf{w}}\!{-}{\textbf{v}}_t\Vert _{2}^{2}}{2\sigma ^2} {+}\ln\,\frac{4T\sqrt{m}}{\delta }\right]\,,\! \end{aligned}$$
(10)
$$\begin{aligned} \,\text{ and } \quad&\textrm{kl}\left( {\frac{1}{n}\sum _{i=1}^{n}\!{R}_{{\mathcal {S}}}(h_i)}\Vert {\mathbb {E}}_{h{\sim }{\mathcal {Q}}}\!{R}_{{\mathcal {S}}}(h)\right) \le \frac{1}{n} \ln \frac{4}{\delta }, \end{aligned}$$
(11)

where \({\mathcal {Q}}={\mathcal {N}}({\textbf{w}}, \sigma ^2{{\textbf{I}}}_d)\) and the hypothesis h sampled from \({\mathcal {Q}}\) is parametrized by \({\textbf{w}}+\varvec{\epsilon }\) with \(\varvec{\epsilon }\sim {\mathcal {N}}({{\textbf{0}}}, \sigma ^2{{\textbf{I}}}_d)\).

This result shows two key features that allow considering it as an adapted baseline for a fair comparison between disintegrated and classical PAC-Bayesian bounds, thus between deterministic and stochastic NNs. On the one hand, it involves the same terms as Corollary 6. On the other hand, it is close to the bound of  Pérez-Ortiz et al. (2021, Sec. 6.2), since (i) we adapt the KL divergence to our setting (i.e., \(\textrm{KL}({\mathcal {Q}}\Vert {\mathcal {P}}){=}\tfrac{1}{2\sigma ^2}\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _2^2\)), (ii) the bound holds for T priors thanks to a union-bound argument.

5 Experiments with neural networks

The source code of our experiments is available at https://github.com/paulviallard/MLJ-Disintegrated-PB. We used the PyTorch framework (Paszke et al., 2019).

In this section, we do not seek state-of-the-art performance; in fact, we have a threefold objective: (a) we check if \(50\%/50\%\) is a good choice for splitting the original train set into \(({\mathcal {S}}_{\text {prior}}, {\mathcal {S}})\) (which is the most common split in the PAC-Bayesian literature (Germain et al., 2009; Pérez-Ortiz et al., 2021)); (b) we highlight that our disintegrated bound associated with the deterministic NN is tighter than the randomized bound associated with the stochastic NN (Corollary 8); (c) we show that our disintegrated bound (Corollary 6) is tighter and more stable than the ones based on Rivasplata et al. (2010), Blanchard and Fleuret (2007) and Catoni (2007) (Corollary 7).

5.1 Training method

We follow our Training Method (Sect. 4.1) in which we integrate the direct minimization of all the bounds. We refer as ours the training method based on the minimization of our bound in Corollary 6, as rivasplata the one based on Eq.  (7), as blanchard the one based on Eq.  (8), and as catoni the one based on Eq.  (9). stochastic denotes the PAC-Bayesian bound with the prior and posterior distributions obtained from ours. To optimize the bound with gradient descent, we replace the non-differentiable 0-1 loss with a surrogate: the bounded cross-entropy loss (Dziugaite & Roy, 2018). We made this replacement since cross-entropy minimization works well in practice for neural networks (Goodfellow et al., 2016) and because this loss is bounded between 0 and 1, which is required for the \(\textrm{kl}()\) function. The cross-entropy is defined in a multiclass setting with \(y\!\in\,\{1,2,\ldots \}\) by \(\ell (h, ({\textbf{x}}, y)) {=} -\frac{1}{Z}\ln (\Phi (h({\textbf{x}})[y]))\!\in\, [0, 1]\) where \(h({\textbf{x}})[y]\) is the y-th output of the NN, and \(\forall p\!\in\,[0, 1], \Phi (p) {=} e^{-Z}{+}(1{-}2e^{-Z})p\) (we set \(Z{=}4\), the default parameter of Dziugaite and Roy (2018)). That being said, to learn a good enough prior \({\mathcal {P}}\!\in\,{\textbf {P}}\) and the posterior \({\mathcal {Q}}_{{\mathcal {S}}}\), we run our Training Method with two stochastic gradient descent-based algorithms \(A_{\text {prior}}\) and A. Note that the randomness in the stochastic gradient descent algorithm is fixed to have deterministic algorithms. In phase 1) algorithm \(A_{\text {prior}}\) learns from \({\mathcal {S}}_{\text {prior}}\) the T priors \({\mathcal {P}}_1,\dots ,{\mathcal {P}}_T\!\in\,{\textbf {P}}\) (i.e., during T epochs) by minimizing the bounded cross-entropy loss. In other words, at the end of the epoch t, the weights \({\textbf{w}}_t\) of the classifier are used to define the prior \({\mathcal {P}}_t={\mathcal {N}}({\textbf{w}}_t, \sigma ^2{{\textbf{I}}}_{d})\). Then, the best prior \({\mathcal {P}}\!\in\,{\textbf {P}}\) is selected by early stopping on \({\mathcal {S}}\). In phase 2), given \({\mathcal {S}}\) and \({\mathcal {P}}\), algorithm A integrates the direct optimization of the bounds with the bounded cross-entropy loss.

5.2 Optimization procedure in algorithms A and \(A_{\text {prior}}\)

Footnote 11 Let \(\varvec{\omega }\) be the mean vector of a Gaussian distribution used as NN weights that we are optimizing. In algorithms A and \(A_{\text {prior}}\), we use the Adam optimizer (Kingma & Ba, 2015), and we sample a noise \({\varvec{\epsilon }}\!\sim\, {\mathcal {N}}(\textbf{0}, \sigma ^2\textbf{I}_{d})\) at each iteration of the optimizer. Then, we forward the examples of the mini-batch to the NN parametrized by the weights \(\varvec{\omega }{+}\varvec{\epsilon }\), and we update \(\varvec{\omega }\) according to the bounded cross-entropy loss. Note that during phase 1), at the end of each epoch t, \({\mathcal {P}}_{t} {=} {\mathcal {N}}(\varvec{\omega }, \sigma ^2{{\textbf{I}}}_{d}) {=} {\mathcal {N}}({\textbf{v}}_t, \sigma ^2{{\textbf{I}}}_{d})\) and finally at the end of phase 2) we have \({\mathcal {Q}}_{{\mathcal {S}}}{=} {\mathcal {N}}(\varvec{\omega }, \sigma ^2{{\textbf{I}}}_{d}) {=} {\mathcal {N}}({\textbf{w}}, \sigma ^2{{\textbf{I}}}_{d})\).

5.3 Experimental setting

5.3.1 Datasets

We perform our experimental study on three datasets: MNIST (LeCun et al., 1998), Fashion-MNIST (Xiao et al., 2017), and CIFAR-10 (Krizhevsky, 2009). We divide each original train set into two independent subsets \({\mathcal {S}}_{\text {prior}}\) of size \(m_{\text {prior}}\) and \({\mathcal {S}}\) of size m with varying split ratios defined as \(\tfrac{m_{\text {prior}}}{m+m_{\text {prior}}}\in \{0, .1, .2, .3, .4, .5, .6, .7, .8, .9\}\). The test sets denoted by \({\mathcal {T}}\) remain the original ones.

5.3.2 Models

For the (Fashion-)MNIST datasets, we train a variant of the All Convolutional Network (Springenberg et al., 2015). The model is a 3-hidden layers convolutional network with 96 channels. We use \(5\times 5\) convolutions with a padding of size 1, and a stride of size 1 everywhere except on the second convolution where we use a stride of size 2. We adopt the Leaky ReLU activation functions after each convolution. Lastly, we use a global average pooling of size \(8\times 8\) to obtain the desired output size. Furthermore, the weights are initialized with Xavier Normal initializer (Glorot & Bengio, 2010) while each bias of size l is initialized uniformly between \(-1/{\sqrt{l}}\) and \(1/\sqrt{l}\).

For the CIFAR-10 dataset, we train a ResNet-20 network, i.e., a ResNet network from He et al. (2016) with 20 layers. The weights are initialized with Kaiming Normal initializer (He et al., 2015) and each bias of size l is initialized uniformly between \(-1/{\sqrt{l}}\) and \(1/\sqrt{l}\).

5.3.3 Optimization

For the (Fashion-)MNIST datasets, we learn the parameters of our prior distributions \({\mathcal {P}}_1,\dots ,{\mathcal {P}}_T\) by using Adam optimizer for \(T=10\) epochs with a learning rate of \(10^{-3}\) and a batch size of 32 (the other parameters of Adam are left by default). Moreover, the parameters of the posterior distribution \({\mathcal {Q}}_{{\mathcal {S}}}\) are learned for one epoch with the same batch size and optimizer (except that the learning rate is either \(10^{-4}\) or \(10^{-6}\)). For the CIFAR-10 dataset, the parameters of the priors \({\mathcal {P}}_1,\dots ,{\mathcal {P}}_T\) are learned for \(T=100\) epochs, and the posterior distribution \({\mathcal {Q}}_{{\mathcal {S}}}\) for 10 epochs with a batch size of 32 by using Adam optimizer as well. Additionally, the learning rate to learn the prior for CIFAR-10 is \(10^{-2}\).

5.3.4 Bounds

For blanchard ’s bounds, the set of hyperparameters is defined as \({\textbf{B}}{=}\{ b{\in }{\mathbb {N}}\;\vert \; b{=}\sqrt{x},\ (x{+}1){\le }2\sqrt{m} \}\), i.e., such that blanchard ’s bounds can be tighter than rivasplata ’s ones. We fixed the set of hyperparameters for catoni as \({\textbf{C}}{=}\left\{ 10^{k} \vert k{\in }\{-3, -2, \dots , +3\}\right\}\). We try different values for \(\sigma ^2 {\in } \{10^{-3}, 10^{-4}, 10^{-5}, 10^{-6}\}\) associated with the disintegrated KL divergence \(\ln \frac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}= \frac{1}{2\sigma ^2}(\Vert {\textbf{w}}{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert ^2_{2} {-}\Vert \varvec{\epsilon }\Vert ^2_{2})\), the “normal” Rényi divergence \(D_2({\mathcal {Q}}\Vert {\mathcal {P}}){=}\tfrac{1}{\sigma ^2}\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^{2}\) and the KL divergence \(\textrm{KL}({\mathcal {Q}}\Vert {\mathcal {P}}){=}\tfrac{1}{2\sigma ^2}\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^{2}\). For all the figures, the values are averaged over 400 deterministic NNs sampled from \({\mathcal {Q}}_{{\mathcal {S}}}\) (the standard deviation is small and presented in the Appendix K). We additionally report as stochastic (Corollary 8) the randomized bound value and KL divergence \(\textrm{KL}({\mathcal {Q}}\Vert {\mathcal {P}}){=}\tfrac{1}{2\sigma ^2}\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^{2}\) associated with the model learned by ours, meaning that \(n{=}400\) and that the test risk reported for ours also corresponds to the risk of the stochastic NN approximated with these 400 NNs.

5.4 Results

Fig. 1
figure 1

Evolution of the bound values in terms of the split ratio. The x-axis represents the different split ratios, and the y-axis represents the bound values obtained after their optimization using our Training Method. Each row corresponds to a given variance \(\sigma ^2\), and each column corresponds to a dataset (MNIST, Fashion-MNIST, or CIFAR-10). In this figure, we consider a learning rate of \(10^{-6}\)

Fig. 2
figure 2

Evolution of the bound values in terms of the split ratio. The x-axis represents the different split ratios, and the y-axis represents the bound values obtained after their optimization using our Training Method. Each row corresponds to a given variance \(\sigma ^2\), and each column corresponds to a dataset (MNIST, Fashion-MNIST, or CIFAR-10). In this figure, we consider a learning rate of \(10^{-4}\).

5.4.1 Analysis of the influence of the split ratio between \({\mathcal {S}}_{\text {prior}}\) and \({\mathcal {S}}\)

Figures 1 and 2 study the evolution of the bound values after optimizing the bounds with our Training Method for different parameters. Specifically, the split ratio of the original train set varies from 0.1 to 0.9 (0.1 means that \(m_{\text {prior}}= 0.1(m+m_{\text {prior}})\)), for four variances values \(\sigma ^2\) and the two learning rates (\(10^{-6}\) and \(10^{-4}\)). For the sake of readability, we present detailed results when the split ratio is 0 in Table 1. We first remark that the behavior is different for the two learning rates. On the one hand, for lr=\(10^{-6}\), the mean bound values are close to each other, which is not surprising since the disintegrated KL divergences and the Rényi divergences are close to zero (see Tables 2, 3, 4, 5, 6, 7, 8, 9, 10). Moreover, for MNIST and Fashion-MNIST, there is a trade-off between learning a good prior with \({\mathcal {S}}_{\text {prior}}\) and minimizing a generalization bound with \({\mathcal {S}}\). In this case, the split ratio 0.5 appears to be a good choice to obtain a tight (disintegrated) PAC-Bayesian bound. This ratio is widely used in the PAC-Bayesian literature (see, e.g., in the context of linear classifiers (Germain et al., 2009), majority votes (Zantedeschi et al., 2021), and neural networks (Letarte et al., 2019; Pérez-Ortiz et al., 2021)). On the other hand, when lr=\(10^{-4}\), the mean bound values tend to increase when the split ratio increases as well for the bounds introduced in the literature (i.e., for blanchard, catoni, and rivasplata), while the mean bound values of our bound remain low. Indeed, m decreases as long as the split ratio increases, which has the effect of increasing the bound value drastically when the disintegrated KL divergence is high. We further explain why the disintegrated KL divergence can become high for the disintegrated bounds of the literature. To do so, we will now restrict our study to a split ratio of 0.5 in order to (i) compare the tightness of the bounds, (ii) understand why the disintegrated bounds of the literature diverge.

5.4.2 Comparison between disintegrated and “classic” bounds

We first compare the “classic” PAC-Bayesian bound (Corollary 8) and our disintegrated PAC-Bayesian bound (Corollary 6). To do so, we fix the variance \(\sigma ^2{=}10^{-3}\) (along with the split ratio equals 0.5). We report in Fig. 3, the mean bound values associated with ours (i.e., the Training Method that minimizes our bound) and stochastic (we recall that stochastic is the PAC-Bayesian bound of Corollary 8 on the model learned by ours). Actually, ours leads to more precise bounds than the randomized stochastic even if the two empirical risks are the same and the KL divergence is smaller than the Rényi one. This imprecision is due to the non-avoidable sampling according to \({\mathcal {Q}}\) done in the randomized PAC-Bayesian bound of Corollary 8 (the higher n, the tighter the bound). Thus, using a disintegrated PAC-Bayesian bound avoids sampling a large number of NNs to obtain a low risk. This confirms that our framework makes sense for practical purposes and has a great advantage in terms of time complexity when computing the bounds.

Fig. 3
figure 3

The values of the PAC-Bayes bound (Corollary 8) and the values of the disintegrated bound (Corollary 6) for learning rates of \(10^{-4}\) and \(10^{-6}\), and a split ratio is 0.5. The y-axis shows the values of the bounds (the hatched bar for ours (Corollary 6) and the white bar for stochastic (Corollary 8)) and the test risks \({R}_{{\mathcal {T}}}(h)\) (gray shaded bar). We also report the values of the empirical risk \({R}_{{\mathcal {S}}}(h)\), the Rényi divergence (associated with ours ’ bound), and the KL divergence (associated with stochastic ’s bound)

Fig. 4
figure 4

The value of the bounds (hatched bars) and the test risks (colored bars) for Corollary 6 (“ours ”) and Corollary 7 (“catoni ”, “rivasplata ” and “blanchard ”) in two different settings, i.e., with a learning rate of \(10^{-6}\) and \(10^{-4}\) and with split ratio of 0.5. We also plot the value of the bounds (the dashed lines) and the test risks (the dotted lines) before executing Step 2) of our Training Method. The y-axis shows the values of the bounds and the test risks \({R}_{{\mathcal {T}}}(h)\). The empirical risk \({R}_{{\mathcal {S}}}(h)\) is presented above each bar. Moreover, the second value represents the mean value of the divergence (the standard deviations are also given for the disintegrated bounds of the literature)

5.4.3 Analysis of the tightness of the disintegrated bounds

We now compare the tightness of the different disintegrated PAC-Bayesian bounds (i.e., our bound and the ones in the literature). We study, as before, the case where the split ratio is 0.5 and the variance \(\sigma ^2=10^{-3}\). We report in Fig. 4 for ours, rivasplata, blanchard and catoni, the mean bounds values; the mean test risk \({R}_{{\mathcal {T}}}(h)\) before (i.e., with the prior \({\mathcal {P}}\)) and after applying Step 2) (i.e., with the posterior \({\mathcal {Q}}_{{\mathcal {S}}}\)). Moreover, we report above the bars the mean train risks \({R}_{{\mathcal {S}}}(h)\) and the mean/standard deviation divergence values obtained after Step 2), i.e., the Rényi divergence \(D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}){=}\tfrac{1}{\sigma ^2}\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^{2}\) for ours and the disintegrated KL divergence \(\ln \frac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}{=}\tfrac{1}{2\sigma ^2}\left[ \Vert {\textbf{w}}{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert ^2_{2} {-}\Vert \varvec{\epsilon }\Vert ^2_{2}\right]\) for the others. First of all, we can remark that we observe two different behaviors for lr=\(10^{-4}\) and lr=\(10^{-6}\). For lr=\(10^{-6}\), the bound values are close to each other, as well as the empirical risks and the divergences (which are close to 0). In Fig. 4, we observe that the bound values and the test risks are close to the one associated with the prior distribution because the divergence is close to 0. This is probably due to the fact that the learning rate is too small, implying that the bounds are not optimized. With a higher learning rate of lr=\(10^{-4}\), we observe that our bound remains tight while the disintegrated bounds of the literature are looser. Hopefully, our bound is improved after performing Step 2) of our Training Method. However, for the bounds of the literature, the value of the disintegrated KL divergence is large, making the bounds looser after executing Step 2). We now investigate the reasons for the divergence of the bounds by looking at the influence of the variance \(\sigma ^2\).

Fig. 5
figure 5

We plot the evolution of the mean bound values (the plain lines) in terms of the variance \(\sigma ^2\) after optimizing the bounds with our Training Method. Moreover, we plot the mean bound values (the dashed lines) obtained before executing the Step 2) of our Training Method. The variance is represented on the x-axis, while the bound values are represented on the y-axis. Furthermore, each row corresponds to a given learning rate (\(10^{-6}\) or \(10^{-4}\)), and each column corresponds to a dataset (either MNIST, FashionMNIST, or CIFAR-10). The split ratio considered is 0.5

5.4.4 Analysis of the influence of the variance

Given a split ratio of 0.5 and lr\(\in \{10^{-6}, 10^{-4}\}\), we report in Fig. 5 the evolution of the bound values associated with ours, rivasplata, blanchard, and catoni when the variance varies from \(10^{-6}\) to \(10^{-3}\). First of all, the important point is that ours behaves differently than rivasplata, blanchard, and catoni. Indeed, for both learning rates, when \(\sigma ^2\) decreases, the value of our bound remains low, while the others increase drastically due to the explosion of the disintegrated KL divergence term (see Table 6 in Appendix K for more details). Concretely, the disintegrated KL divergence in Corollary 7 involves the noise \(\varvec{\epsilon }\) through \(\frac{1}{2\sigma ^2}\Vert {\textbf{w}}{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert _{2}^2{-}\Vert \varvec{\epsilon }\Vert ^2_{2}\) compared to our divergence which is \(\frac{1}{\sigma ^2}\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^2\) (without noise). Then, the sampled noise during the optimization procedure \(\varvec{\epsilon }\) influences the disintegrated KL divergence, making it prone to high variations during training (depending thus \(\sigma ^2\)). To illustrate the difference during the optimization, we focus on the objective function (detailed in Appendix I) of Corollarys 6 and  7 (Eq.  7). Roughly speaking, the divergence in Corollary 6 does not depend on the sampled hypothesis h (with weights \(\varvec{\omega }+\varvec{\epsilon }\)), while the divergence of Eq.  (7) does. In consequence, the derivatives are less dependent on h for Corollary 6 than for Eq.  (7). To be convinced of this, we propose to study the gradient with respect to the current mean vector \(\varvec{\omega }\). On the one hand, the gradient \(\frac{\partial {R}_{{\mathcal {S}}}(h)}{\partial \varvec{\omega }}\) of the risk w.r.t. \(\varvec{\omega }\) is the same for both bounds; hence, the phenomenon cannot come from this derivative. On the other hand, the gradients of the divergence in Eq.  (7) and Corollary 6 are respectively

$$\begin{aligned} \frac{\partial }{\partial \varvec{\omega }}\,\left[ \frac{1}{m}\,\left(\,\frac{\left\| \varvec{\omega }{+}\varvec{\epsilon }{-}{\textbf{v}}_t\right\| ^2_{2}\!{-}\left\| \varvec{\epsilon }\right\| ^2_{2}}{2\sigma ^2}\right) \right]&= \frac{\partial }{\partial \varvec{\omega }}\left[ \frac{1}{m2\sigma ^2}\Vert \varvec{\omega }{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert _2^2 \right] \\ {}&= \frac{1}{m\sigma ^2}\left( \varvec{\omega }{+}\varvec{\epsilon }{-}{\textbf{v}}_t\right) = \diamondsuit ,\\ \text {and}\quad \quad \frac{\partial }{\partial \varvec{\omega }}\,\left[ \frac{1}{m}\,\left( \frac{\Vert \varvec{\omega }{-}{\textbf{v}}_t\Vert _{2}^{2}}{\sigma ^2}\right) \right] =&\frac{\partial }{\partial \varvec{\omega }}\left[ \frac{1}{m\sigma ^2}\Vert \varvec{\omega }{-}{\textbf{v}}_t\Vert _2^2 \right] \\ =&\frac{2}{m\sigma ^2}\left( \varvec{\omega }{-}{\textbf{v}}_t\right) = \heartsuit . \end{aligned}$$

From the two derivatives, we deduce that \(\diamondsuit = \frac{1}{2}\heartsuit + \frac{1}{m\sigma ^2}\varvec{\epsilon }\). Hence, each gradient step involves a noise in the gradient of the disintegrated KL divergence \(\frac{1}{m\sigma ^2}\varvec{\epsilon }\sim {\mathcal {N}}(\textbf{0}, \frac{1}{m}{{\textbf{I}}}_d)\), which is high for a small m. This randomness causes the disintegrated KL divergence \(\frac{1}{2\sigma ^2}\left\| \varvec{\omega }{+}\varvec{\epsilon }{-}{\textbf{v}}_t\right\| ^2_{2}\!{-}\left\| \varvec{\epsilon }\right\| ^2_{2}\) to be larger when \(\sigma ^2\) decreases since (i) the divergence is divided by \(2m\sigma ^2\) and (ii) the deviation between \(\varvec{\omega }\) and \({\textbf{v}}_t\) increases. In conclusion, this makes the objective function (i.e., the bound) subject to high variations during the optimization, implying higher final bound values. Thus, the Rényi divergence has a valuable asset over the disintegrated KL divergence since it does not depend on the sampled noise \(\varvec{\epsilon }\).

5.4.5 Take-home message from the experiments

To summarize, our experiments show that our disintegrated bound is, in practice, tighter than the ones in the literature. This tightness allows us to precisely bound the true risk \({R}_{{\mathcal {D}}}(h)\) (or the test risk \({R}_{{\mathcal {T}}}(h)\)); thus, the model selection from the disintegrated bound is effective. Moreover, we show that our bound is more easily optimizable than the others. This is mainly due to the disintegrated KL divergence, which depends on the sampled hypothesis h with weights \(\varvec{\omega }{+}\varvec{\epsilon }\). Indeed, the gradients of the disintegrated KL divergence with respect to \(\varvec{\omega }\) include the noise \(\varvec{\epsilon }\), making the gradient inaccurate (especially with “high” learning rate and small variance \(\sigma ^2\)).

6 Toward information-theoretic bounds

Before concluding, we discuss another interpretation of the disintegration procedure through Theorem 9 below. Actually, the Rényi divergence between \({\mathcal {P}}\) and \({\mathcal {Q}}\) is sensitive to the choice of the learning sample \({\mathcal {S}}\): when the posterior \({\mathcal {Q}}\) learned from \({\mathcal {S}}\) differs greatly from the prior \({\mathcal {P}}\) the divergence is high. To avoid such a behavior, we consider Sibson’s mutual information (Verdú, 2015) which is a measure of dependence between the random variables \({\mathcal {S}}\!\in\,{\mathcal {Z}}^m\) and \(h\!\in\,{\mathcal {H}}\). It involves an expectation over all the learning samples of a given size m and is defined for a given \(\alpha {>}1\) by

$$\begin{aligned} I_{\alpha }(h{;}{\mathcal {S}})&\triangleq \min _{{\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})} \frac{1}{\alpha {-}1}\!\ln\,\left[ {\mathbb {E}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m}}{\mathbb {E}}_{h\sim {\mathcal {P}}}\!\left[\,\frac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}\!\right] ^{\alpha }\right] . \end{aligned}$$

The higher \(I_{\alpha }(h{;}{\mathcal {S}})\), the higher the correlation is, meaning that the sampling of h is highly dependent on the choice of \({\mathcal {S}}\). This measure has two interesting properties: it generalizes the mutual information (Verdú, 2015), and it can be related to the Rényi divergence. Indeed, let \(\rho (h, {\mathcal {S}}){=} {\mathcal {Q}}_{{\mathcal {S}}}(h){\mathcal {D}}^{m}({\mathcal {S}})\), resp. \(\pi (h, {\mathcal {S}}){=} {\mathcal {P}}(h){\mathcal {D}}^{m}({\mathcal {S}})\), be the probability of sampling both \({\mathcal {S}}{\sim }{\mathcal {D}}^m\) and \(h{\sim }{\mathcal {Q}}_{{\mathcal {S}}}\), resp. \({\mathcal {S}}{\sim }{\mathcal {D}}^m\) and \(h{\sim }{\mathcal {P}}\). Then we can write:

$$\begin{aligned} I_{\alpha }(h{;}{\mathcal {S}})&=\,\min _{{\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})} \frac{1}{\alpha {-}1}\!\ln\,\Bigg [\!{\mathbb {E}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m}}{\mathbb {E}}_{h\sim {\mathcal {P}}}\!\left[\,\frac{{\mathcal {Q}}_{{\mathcal {S}}}(h){\mathcal {D}}^{m}({\mathcal {S}})}{{\mathcal {P}}(h){\mathcal {D}}^{m}({\mathcal {S}})}\!\right] ^{\alpha }\,\Bigg ]\nonumber \\&=\,\min _{{\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})} D_{\alpha }(\rho \Vert \pi ). \end{aligned}$$
(12)

From Verdú, 2015 the optimal prior \({\mathcal {P}}^*\) minimizing Eq.  (12) is a distribution-dependent prior:

$$\begin{aligned} \displaystyle {\mathcal {P}}^*(h)=\frac{\left[ {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathcal {Q}}_{{\mathcal {S}}'}(h)^{\alpha }\right] ^{\frac{1}{\alpha }}}{{\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\tfrac{1}{{\mathcal {P}}(h')}\left[ {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathcal {Q}}_{{\mathcal {S}}'}(h')^{\alpha }\right] ^{\frac{1}{\alpha }}}. \end{aligned}$$

This leads to an Information-Theoretic generalization bound Footnote 12.

Theorem 9

(Disintegrated Information-Theoretic Bound) For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any measurable function \(\phi\,:\!{\mathcal {H}}{\times } {\mathcal {Z}}^{m}{\rightarrow }{\mathbb {R}}_{+}^{*}\), for any \(\alpha\,>\!1\), for any \(\delta \in (0,1]\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have

$$\begin{aligned} {\mathbb {P}}_{\begin{array}{c} {\mathcal {S}}\sim {\mathcal {D}}^{m},\\ h\sim {\mathcal {Q}}_{{\mathcal {S}}} \end{array}}\,\left( \frac{\alpha }{\alpha {-}1}\!\ln\,\left( \phi (h,\!{\mathcal {S}})\right) \le I_{\alpha }(h'{;} {\mathcal {S}}')\!+\! \ln \left[ \frac{1}{\delta ^{\frac{\alpha }{\alpha {-}1}}}\!{\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim } {\mathcal {P}}^*}\,\left[ \phi (h'\!, {\mathcal {S}}')^{\frac{\alpha }{\alpha -1}}\right] \right] \right) \ge 1{-}\delta . \end{aligned}$$

Note that Esposito, Gastpar, and Issa (2020, Cor.4) introduced a bound based on the Sibson’s mutual information, but, as discussed in Appendix J, Theorem 9 leads to a tighter bound. From a theoretical view, Theorem 9 brings a different philosophy than the disintegrated PAC-Bayes bounds. Indeed, in Theorems 2 and 4, given \({\mathcal {S}}\), the Rényi divergence \(D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\) suggests that the learned posterior \({\mathcal {Q}}_{{\mathcal {S}}}\) should be close enough to the prior \({\mathcal {P}}\) to get a low bound. While in Theorem 9, the Sibson’s mutual information \(I_{\alpha }(h'; {\mathcal {S}}')\) suggests that the random variable h has to be not too much correlated to \({\mathcal {S}}\). However, the bound of Theorem 9 is not computable in practice due notably to the sample expectation over the unknown distribution \({\mathcal {D}}\) in \(I_{\alpha }\). An exciting line of future works could be to study how we can make use of Theorem 9 in practice.

7 Conclusion and future works

We provide a new and general disintegrated PAC-Bayesian bound (Theorem 2) in the family of Eq.  (5), i.e., when the derandomization step consists in (i) learning a posterior distribution \({\mathcal {Q}}_{{\mathcal {S}}}\) on the classifiers set (given an algorithm, a learning sample \({\mathcal {S}}\) and a prior distribution \({\mathcal {P}}\)) and (ii) sampling a hypothesis h from this posterior \({\mathcal {Q}}_{{\mathcal {S}}}\). While our bound can be looser than the ones of Rivasplata et al. (2010); Blanchard and Fleuret (2007); Catoni (2007), it provides nice opportunities for learning deterministic classifiers. Indeed, our bound can be used not only to study the theoretical guarantees of deterministic classifiers but also to derive self-bounding algorithms (based on the bound optimization) that are more stable and efficient than the ones we obtain from the bounds of the literature. Concretely, the bounds of Rivasplata et al. (2010); Blanchard and Fleuret (2007); Catoni (2007) depend on two terms related to the classifier drawn: the risk and the “disintegrated KL divergence”, while in our bound the (Rényi) divergence term depends on the hypothesis set, implying that the divergence remains the same whatever which classifier is drawn. In this sense, our bound is more stable as the learning algorithm seeking to minimize the bound allows, in practice, to choose a better hypothesis than with the bounds of Rivasplata et al. (2010); Blanchard and Fleuret (2007); Catoni (2007). We have illustrated the interest of our bound on neural networks, but our result could be instantiated to other well-known settings such as linear classifiers (Germain et al., 2009) or the majority vote classifier (Zantedeschi et al., 2021).

Our general framework opens the way to the study of other machine learning settings by exploiting the proven randomized PAC-Bayesian theorems, for example, for Domain Adaptation (Germain et al., 2020), Adversarial Robustness (Viallard et al., 2021) or Transductive Learning (Bégin et al., 2014).

Despite being an important step towards the practical use of PAC-Bayes guarantees, our disintegrated bounds arguably have a drawback: we sample a hypothesis from a distribution instead of obtaining a bound for all the possible hypotheses, like for uniform convergence bounds. While uniform convergence bounds can be vacuous (Nagarajan & Kolter, 2019b), they hold (with high probability on the choice of the learning sample) for all hypotheses including the one with the best guarantee (i.e., the one minimizing the bound). In the case of disintegrated bounds, we learn a distribution on the hypothesis set, and then we sample a hypothesis according to this distribution. Hence, there is a small probability (i.e., less than \(\delta\)) of sampling a bad hypothesis. An interesting research direction is comparing disintegrated and uniform convergence bounds to understand in which cases using disintegrated bounds can be better than using uniform convergence bounds. Knowing that there are connections between (agnostic) PAC-learnability and uniform convergence (see, e.g., Shalev-Shwartz and Ben-David (2014)), we believe that defining a new notion of PAC-learnability, which better fits with the disintegrated framework, could help to provide such a comparison.

This Appendix is structured as follows. We give the proof of Theorem 1, Theorem 2, Corollary 3, Theorem 4, Proposition 5, Corollary 6, Corollary 7, and Corollary 8 in Appendix A, Appendix B, Appendix C, Appendix D, Appendix D, Appendix F, Appendix G, and Appendix H respectively. We also discuss the minimization and the evaluation of the bounds introduced in the different corollaries in Appendix I. Additionally, Appendix J is devoted to Theorem 9. Appendix K provides an exhaustive list of numerical results.