Abstract
PAC-Bayesian bounds are known to be tight and informative when studying the generalization ability of randomized classifiers. However, they require a loose and costly derandomization step when applied to some families of deterministic models such as neural networks. As an alternative to this step, we introduce new PAC-Bayesian generalization bounds that have the originality to provide disintegrated bounds, i.e., they give guarantees over one single hypothesis instead of the usual averaged analysis. Our bounds are easily optimizable and can be used to design learning algorithms. We illustrate this behavior on neural networks, and we show a significant practical improvement over the state-of-the-art framework.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In statistical learning theory, PAC-Bayesian theoryFootnote 1 (Shawe-Taylor & Williamson, 1997; McAllester, 1998) provides a powerful framework for analyzing the generalization ability of machine learning models such as linear classifiers (Germain et al., 2009), SVM (Ambroladze et al., 2006), or neural networks (Dziugaite & Roy, 2017; Pérez-Ortiz et al., 2021). In the PAC-Bayesian theory, the machine learning models are considered randomized (or stochastic), i.e., a model is sampled from a posterior probability distribution for each prediction. The analysis of such a randomized classifier usually takes the form of bounds on the average risk with respect to a learned posterior distribution given a learning sample and a chosen prior distribution defined over a set of hypotheses. Note that the prior distribution can encode an a priori belief on the set of hypotheses, or if we have no belief, it can be set to a non-informative distribution, such as the uniform distribution. While such bounds are very effective for analyzing randomized/stochastic classifiers, the vast majority of machine learning methods nevertheless need guarantees on deterministic models. In this case, a derandomization step of the bound is required to get a bound on the risk of the deterministic model. In general, the derandomization step consists in obtaining a bound on the risk of a deterministic model from a bound that is originally for randomized/stochastic models. Different forms of derandomization have been introduced in the literature for specific settings. Among them, Langford and Shawe-Taylor (2002) proposed a derandomization for Gaussian posteriors over linear classifiers: thanks to the Gaussian symmetry, a bound on the risk of the maximum a posteriori (deterministic) classifier is obtainable from the bound on the average risk of the randomized classifier. Also relying on Gaussian posteriors, Letarte et al. (2019) derived a PAC-Bayesian bound for a very specific deterministic network architecture using sign functions as activations; this approach has been further extended by Biggs and Guedj (2021, 2022). Another line of works derandomizes neural networks (Neyshabur et al., 2018; Nagarajan & Kolter, 2019). While technically different, it starts from PAC-Bayesian guarantees on the randomized classifier and uses an “output perturbation” bound to convert a guarantee from a random classifier to the mean classifier. These works highlight the need for a general framework for the derandomization of classic PAC-Bayesian bounds.
In this paper, we focus on another kind of derandomization, sometimes referred to as disintegration of the PAC-Bayesian bound, and first proposed by Catoni (2007, Th.1.2.7) and Blanchard and Fleuret (2007): instead of bounding the average risk of a randomized classifier with respect to the posterior distribution, the disintegrated PAC-Bayesian bounds upper-bound the risk of a sampled (unique) classifier from the posterior distribution. Despite their interest in derandomizing PAC-Bayesian bounds, these kinds of bounds have only received little study in the literature; especially, we can cite the recent work of Rivasplata et al. (2010, Th.1(i)) who have derived a general disintegrated PAC-Bayesian theorem. It is important to note that these bounds have never been used in practice. Driven by machine learning practical purposes, our objective is thus twofold. We derive new tight and usable disintegrated PAC-Bayesian bounds (i) that directly derandomize any classifiers without any other additional step and with almost no impact on the guarantee, and (ii) that can be easily optimized to learn classifiers with strong guarantees. To achieve this objective, our contribution consists in providing a new general disintegration framework based on the Rényi divergence (in Theorem 2), allowing us to meet the practical goal of efficient learning. From the theoretical standpoint, due to the Rényi divergence term, our bound is expected to be looser than the one of Rivasplata et al. (2010, Th.1(i)) in which the divergence term is “disintegrated” but depends on the sampled hypothesis only. However, as we show in our experimental evaluation on neural networks, their “disintegrated” term is, in practice, subject to high variance, making their bound harder to optimize. This variance arises because the sampled hypothesis does not influence our Rényi divergence term. Our bound has then the main advantage of leading to a more stable learning algorithm with better empirical results. In addition, we derive a new theoretical result in the form of an information-theoretic bound, giving new insights into disintegration procedures.
The rest of the paper is organized as follows. Section 2 introduces the notations we follow and recalls some basics on generalization bounds. In Sect. 3, we derive our main contribution relying on disintegrated PAC-Bayesian bounds. Then, we illustrate the practical usefulness of this disintegration on deterministic neural networks in Sect. 5. Before concluding in Sect. 7, we discuss in Sect. 6 another point of view of the disintegrated through an information-theoretic bound. For readability, we deferred the proofs of our theoretical results to the Appendix.
2 Setting and basics
2.1 General notations
We denote by \({\mathcal {M}}({\mathcal {A}})\) the set of probability densities on the measurable space \(({\mathcal {A}}, \Sigma _{{\mathcal {A}}})\) with respect to a reference measureFootnote 2 where \(\Sigma _{{\mathcal {A}}}\) is the \(\sigma\)-algebra on the set \({\mathcal {A}}\). In this paper, we consider supervised classification tasks with \({\mathcal {X}}\) the input space, \({\mathcal {Y}}\) the label set, and \({\mathcal {D}}\in {\mathcal {M}}({\mathcal {X}}{\times }{\mathcal {Y}})\) an unknown data distribution on \({\mathcal {X}}{\times } {\mathcal {Y}}{=}{\mathcal {Z}}\). An example is denoted by \(z{=} ({\textbf{x}},y)\! \in\, {\mathcal {Z}}\), and the learning sample \({\mathcal {S}}{=} \{z_i\}_{i=1}^{m}\) is constituted by m examples drawn i.i.d. from \({\mathcal {D}}\); the distribution of such an m-sample being \({\mathcal {D}}^m\in {\mathcal {M}}({\mathcal {Z}}^m)\). We consider a hypothesis set \({\mathcal {H}}\) of functions \(h\!:\!{\mathcal {X}}{\rightarrow } {\mathcal {Y}}\). The learner aims to find \(h\!\in\,{\mathcal {H}}\) that assigns a label y to an input \({\textbf{x}}\) as accurately as possible. Given an example z and a hypothesis h, we assess the quality of the prediction of h with a loss function \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } [0, 1]\) evaluating to which extent the prediction is accurate. Given a loss function \(\ell\), the true risk \({R}_{{\mathcal {D}}}(h)\) of a hypothesis h on the distribution \({\mathcal {D}}\) and its empirical counterpart, the empirical risk, \({R}_{{\mathcal {S}}}(h)\) estimated on \({\mathcal {S}}\) are defined as
Then, the learner wants to find the hypothesis h from \({\mathcal {H}}\) that minimizes \({R}_{{\mathcal {D}}}(h)\). However, we cannot compute \({R}_{{\mathcal {D}}}(h)\) since \({\mathcal {D}}\) is unknown. In practice, one could work under the Empirical Risk Minimization principle (erm) that looks for a hypothesis minimizing \({R}_{{\mathcal {S}}}(h)\). Generalization guarantees over unseen data from \({\mathcal {D}}\) can be obtained by quantifying how much the empirical risk \({R}_{{\mathcal {S}}}(h)\) is a good estimate of \({R}_{{\mathcal {D}}}(h)\). Statistical machine learning theory (see, e.g., Vapnik, 2000) studies the conditions of consistency and convergence of erm towards the true risk. This kind of result is called generalization bound, often referred to as PAC (Probably Approximately Correct) bound (Valiant, 1984), and takes the form:
Put into words, with high probability (at least \(1{-}\delta\)) on the random choice of the learning sample \({\mathcal {S}}\), good generalization guarantees are obtained when the deviation between the true risk \({R}_{{\mathcal {D}}}(h)\) and its empirical estimate \({R}_{{\mathcal {S}}}(h)\) is low, i.e., \(\varepsilon \big (\tfrac{1}{\delta }, \tfrac{1}{m}\big )\) should be as small as possible. The function \(\varepsilon\) depends mainly on two quantities: (i) the number of examples m for statistical precision, and (ii) the confidence parameter \(\delta\). We now recall three classical bounds while focusing on the PAC-Bayesian theory at the heart of our contribution. By abuse of notation, in the following, we use the function \(\varepsilon\) for the different presented frameworks: we consider an additional argument of \(\varepsilon\) to pinpoint the differences between the frameworks.
2.2 Uniform convergence bound
A first classical type of generalization bounds is referred to as Uniform Convergence bounds based on a measure of complexity of the set \({\mathcal {H}}\) (such as the VC-dimension or the Rademacher complexity) and hold for all the hypotheses of \({\mathcal {H}}\). This type of bound takes the form:
Due to \(\sup _{h\in {\mathcal {H}}}\), this bound can be seen as a worst-case analysis. Indeed, it means that the bound \(\left| {R}_{{\mathcal {D}}}(h)-{R}_{{\mathcal {S}}}(h)\right| \le \varepsilon \big (\tfrac{1}{\delta }, \tfrac{1}{m},{\mathcal {H}}\big )\) holds with a high probability for all \(h\!\in\,{\mathcal {H}}\), including the best but also the worst. This worst-case analysis makes it hard to obtain a non-vacuous bound i.e., with \(\varepsilon (\frac{1}{\delta }, \frac{1}{m}, {\mathcal {H}})<1\). Note that the ability of such bounds to explain the generalization of deep learning has been recently challenged (Nagarajan & Kolter, 2019b).
2.3 Algorithmic-dependent bounds
A potential drawback of the Uniform Convergence bounds is that they are independent of the learning algorithm, i.e., they do not take into account the way the hypothesis space is explored. To tackle this issue, algorithmic-dependent bounds have been proposed to take advantage of some particularities of the learning algorithm, such as its uniform stability (Bousquet & Elisseeff, 2002) or robustness (Xu & Mannor, 2012). In this case, the bounds obtained hold for a single hypothesis \(h_{L({\mathcal {S}})}\), the one learned with the algorithm L from the learning sample \({\mathcal {S}}\). The form of such bounds is:
For example, this approach has been used by Hardt et al. (2016) to derive generalization bounds for hypotheses learned by stochastic gradient descent.
2.4 PAC-Bayesian bound
This paper leverages PAC-Bayesian bounds that stand in the PAC framework but borrows inspiration from the Bayesian probabilistic view that deals with randomness and uncertainty in machine learning (McAllester, 1998). In the PAC-Bayesian setting, we consider a prior distribution \({\mathcal {P}}\!\in\,{\mathcal {M}}^{*}({\mathcal {H}})\subseteq {\mathcal {M}}({\mathcal {H}})\) on \({\mathcal {H}}\), with \({\mathcal {M}}^{*}({\mathcal {H}})\) the set of strictly positive probability densities. This distribution encodes an a priori belief on \({\mathcal {H}}\) before observing the learning sample \({\mathcal {S}}\). Then, given \({\mathcal {S}}\) and the prior \({\mathcal {P}}\), we learn a posterior distribution \({\mathcal {Q}}\!\in\,{\mathcal {M}}({\mathcal {H}})\). In this case, the bounds take the form:
A key notion is that the function \(\varepsilon ()\) upper-bounds a \({\mathcal {Q}}\)-weighted expectation over the risks of all classifiers in \({\mathcal {H}}\). Hence, it upper-bounds the risk of a randomized classifier.Footnote 3 Such a randomized classifier can be described as follows: to predict the label of an input \({\textbf{x}}\in {\mathcal {X}}\), (i) a hypothesis \(h\in {\mathcal {H}}\) is sampled from \({\mathcal {Q}}\) and (ii) the classifier predicts the label given by \(h({\textbf{x}})\).
We recall below the classical PAC-Bayesian bounds in a general form as proposed by Germain et al. (2009); Bégin et al. (2016). The idea is to express the bound in terms of a generic function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\) that is meant to capture the the deviation between the true and the empirical risks, instead of deriving a theorem by settling on a specific measure of deviation such as \(\vert {R}_{{\mathcal {D}}}(h){-}{R}_{{\mathcal {S}}}(h)\vert\). Note that, Theorem 1 is expressed in a slightly different form than the original ones; we prove Theorem 1 in Appendix A for the sake of completeness.
Theorem 1
(General PAC-Bayes bounds) For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\) on \({\mathcal {H}}\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any \(\delta \in (0,1]\) we have
with \(\textrm{KL}({\mathcal {Q}}\Vert {\mathcal {P}}){\triangleq } { {\displaystyle {\mathbb {E}}_{h{\sim }{\mathcal {Q}}}}} \ln \tfrac{{\mathcal {Q}}(h)}{{\mathcal {P}}(h)}\) the Kullback-Leibler (KL-)divergence between \({\mathcal {Q}}\) and \({\mathcal {P}}\), and \(D_{\alpha }({\mathcal {Q}}\Vert {\mathcal {P}}) {\triangleq } \frac{1}{\alpha {-}1}\ln\,\left[ {{\displaystyle {\mathbb {E}}_{h{\sim }{\mathcal {P}}}}}\!\left[\,\frac{ {\mathcal {Q}}(h)}{{\mathcal {P}}(h)}\right] ^{\!\alpha }\right]\) the Rényi divergence between \({\mathcal {Q}}\) and \({\mathcal {P}}\) \((\alpha {>}1)\).
Note that Eq. (2) is more general than Eq. (1). Indeed, the former is obtained from the latter by the three following steps: (i) substituting \(\phi (h, {\mathcal {S}})\) by \(\phi (h, {\mathcal {S}})^{\frac{\alpha -1}{\alpha }}\) in Eq. (2), (ii) applying Jensen’s inequality in order to move the expectation over \({\mathcal {Q}}\) in front of the logarithm, and (iii) taking the limit when \(\alpha\) tends to 1. Note also the original bound statements of Germain et al. (2009); Bégin et al. (2016) are recovered by choosing a convex function \(\Delta : [0,1]^2 {\rightarrow } {\mathbb {R}}\) that captures a deviation between the true risk \({R}_{{\mathcal {D}}}(h)\) and the empirical risk \({R}_{{\mathcal {S}}}(h)\). Then, two steps are required: (i) setting \(\phi (h,\!{\mathcal {S}}){=}\exp (m\Delta ({R}_{{\mathcal {S}}}(h), {R}_{{\mathcal {D}}}(h)))\) in Eq. (1), or \(\phi (h,\!{\mathcal {S}}){=}\Delta ({R}_{{\mathcal {S}}}(h), {R}_{{\mathcal {D}}}(h))\) in Eq. (2), and then (ii) applying Jensen’s inequality on the left-hand side of the in equation. In fact, our proofs follow the exact same steps than those of Germain et al. (2009, Th.2.1) and Bégin et al. (2016, Th.9), but instead of starting from \(\Delta ({R}_{{\mathcal {S}}}(h), {R}_{{\mathcal {D}}}(h))\), we consider the slightly more general expression \(\phi (h, {\mathcal {S}})\) from the beginning.Footnote 4
The advantage of Theorem 1 is that it can be used as a starting point for deriving different forms of bounds. For instance, for a loss function \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } [0, 1]\) with \(\phi (h,\!{\mathcal {S}}){=} \exp \left( m\Delta ({R}_{{\mathcal {S}}}(h), {R}_{{\mathcal {D}}}(h))\right)\) and \(\Delta ({R}_{{\mathcal {S}}}(h), {R}_{{\mathcal {D}}}(h))=2[{R}_{{\mathcal {S}}}(h){-}{R}_{{\mathcal {D}}}(h)]^2\) we retrieve from Eq. (1) the bound proposed by McAllester (1998):
This bound illustrates the trade-off between the average empirical risk and \(\textstyle \varepsilon \big (\tfrac{1}{\delta }{,}\tfrac{1}{m}{,}{\mathcal {Q}}\big ) = \sqrt{ \frac{1}{2m} (\textrm{KL}( {\mathcal {Q}}\Vert {\mathcal {P}})+\ln \tfrac{2\sqrt{m}}{\delta })}\). More precisely, the higher m is, the lower \(\textstyle \varepsilon \big (\tfrac{1}{\delta }{,}\tfrac{1}{m}{,}{\mathcal {Q}}\big )\) is therefore the smaller the difference between the true risk \({\mathbb {E}}_{h{\sim }{\mathcal {Q}}}{R}_{{\mathcal {D}}}(h)\) and the empirical risk \({\mathbb {E}}_{{h\sim {\mathcal {Q}}}}{R}_{{\mathcal {S}}}(h)\).
Another example leading to a slightly tighter but less interpretable bound is the Seeger (2002); Maurer (2004)’s bound that we retrieve with \(\phi (h,\!{\mathcal {S}}){=} \exp \left( m\,\Delta ({R}_{{\mathcal {S}}}(h),{R}_{{\mathcal {D}}}(h))]\right)\) and \(\Delta ({R}_{{\mathcal {S}}}(h),{R}_{{\mathcal {D}}}(h))=\textrm{kl}[{R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h)]\):
where
is the KL divergence between two Bernoulli distributions of parameters q and p.
Such PAC-Bayesian bounds are known to be tight (e.g., Pérez-Ortiz et al. (2021); Zantedeschi et al. (2021)), but they hold for a randomized classifier by nature (due to the expectation on \({\mathcal {H}}\)). A key issue for usual machine learning tasks is then the derandomization of the PAC-Bayesian bounds to obtain a guarantee for a deterministic classifier instead of a randomized one (by removing the expectation on \({\mathcal {H}}\)). In some cases, this derandomization results from the structure of the hypotheses, such as for randomized linear classifiers that can be directly expressed as one deterministic linear classifier (Germain et al., 2009). However, in other cases, the derandomization is much more complex and specific to the class of hypotheses, such as for neural networks (e.g., Neyshabur et al. (2018), Nagarajan and Kolter (2019b, Ap. J), Biggs and Guedj (2022)).
The next section states our main contribution, which is a general derandomization framework (based on the Rényi divergence) for disintegrating PAC-Bayesian bounds into a bound for a single hypothesis from \({\mathcal {H}}\).
3 Disintegrated PAC-Bayesian theorems
3.1 Form of a disintegrated PAC-Bayes bound
First, we recall another kind of bound introduced by Blanchard and Fleuret (2007) and Catoni (2007, Th.1.2.7) and referred to as the disintegrated PAC-Bayesian bound. Its form is:
where \({\mathcal {Q}}_{{\mathcal {S}}}{\triangleq }A({\mathcal {S}}, {\mathcal {P}})\) with \(A\!:\!{\mathcal {Z}}^{m}{\times }{\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\) a deterministic algorithm chosen a priori which (i) takes a learning sample \({\mathcal {S}}\!\in\,{\mathcal {Z}}^{m}\) and a prior distribution \({\mathcal {P}}\) as inputs, and (ii) outputs a data-dependent distribution \({\mathcal {Q}}_{{\mathcal {S}}}{\triangleq }A({\mathcal {S}}, {\mathcal {P}})\) from the set \({\mathcal {M}}({\mathcal {H}})\) of all possible probability densities on \({\mathcal {H}}\). Concretely, this kind of generalization bound allows one to derandomize the usual PAC-Bayes bounds as follows. Instead of considering a bound holding for all the posterior distributions on \({\mathcal {H}}\) as usually done in PAC-Bayes (the “\(\,\forall {\mathcal {Q}}\,\)” in Theorem 1), we consider only the posterior distribution \({\mathcal {Q}}_{{\mathcal {S}}}\) obtained through a deterministic algorithm A taking the learning sample \({\mathcal {S}}\) and the prior \({\mathcal {P}}\) as inputs. Then, the above bound holds for a unique hypothesis \(h{\sim }{\mathcal {Q}}_{{\mathcal {S}}}\) instead of the randomized classifier: the individual risks are no longer averaged with respect to \({\mathcal {Q}}_{{\mathcal {S}}}\); this is the PAC-Bayesian bound disintegration. The dependence in probability on \({\mathcal {Q}}_{{\mathcal {S}}}\) means that the bound is valid with probability at least \(1{-}\delta\) over the random choice of the learning sample \({\mathcal {S}}{\sim }{\mathcal {D}}^m\) and the hypothesis \(h{\sim }{\mathcal {Q}}_{{\mathcal {S}}}\). Under this principle, we introduce in Theorems 2 and 4 below two new general disintegrated PAC-Bayesian bounds. A key asset of our results is that the bounds are instantiable to specific settings as for the “classical” PAC-Bayesian bounds (e.g., with i.i.d./non-i.i.d. data, unbounded losses, etc.): to instantiate the bound, one has to instantiate the function \(\phi\). Note that, except our bound and the one of Rivasplata et al. (2010, Th.1(i)), the disintegrated bounds of the literature introduced by Blanchard and Fleuret (2007) and Catoni (2007, Th.1.2.7) do not depend on such a general function \(\phi\). With an appropriate instantiation, we obtain an easily optimizable bound, leading to a self-boundingFootnote 5 algorithm (Freund, 1998) with theoretical guarantees. As an illustration of the usefulness of our results, we provide, in Sect. 4, such an instantiation for neural networks.
3.2 Disintegrated PAC-Bayesian bounds with the Rényi divergence
3.2.1 Our main contribution: a general disintegrated bound
In the same spirit as Eq. (2) our main result stated in Theorem 2 is a general bound involving the Rényi divergence \(D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\) of order \(\alpha\,>\!1\).
Theorem 2
(General Disintegrated PAC-Bayes Bound) For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any \(\alpha\,>\!1\), for any \(\delta \in (0,1]\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}{\times }{\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have
where \({\mathcal {Q}}_{{\mathcal {S}}}{\triangleq }A({\mathcal {S}}, {\mathcal {P}})\) is output by the deterministic algorithm A.
Proof
(Proof sketch (see Appendix B for details)) Recall that \({\mathcal {Q}}_{{\mathcal {S}}}\) is obtained with the algorithm \(A({\mathcal {S}}, {\mathcal {P}})\). Applying Markov’s inequality on \(\phi (h,\!{\mathcal {S}})\) with the random variable h and using Hölder’s inequality to introduce \(D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\), we have, with probability at least \(1{-}\tfrac{\delta }{2}\) on \({\mathcal {S}}\!\sim\, {\mathcal {D}}^m\) and \(h\!\sim\,{\mathcal {Q}}_{{\mathcal {S}}}\),
By applying again Markov’s inequality on \(\phi (h,\!{\mathcal {S}})\) with the random variable \({\mathcal {S}}\), we have, with probability at least \(1{-}\tfrac{\delta }{2}\) on \({\mathcal {S}}\!\sim\, {\mathcal {D}}^m\) and \(h\!\sim\,{\mathcal {Q}}_{{\mathcal {S}}}\),
Lastly, we combine the two bounds with a union-bound argument. \(\square\)
As for the general classical PAC-Bayesian bounds (Theorem 1), the above theorem can be seen as the starting point of the derivation of generalization bounds depending on the choice of the function \(\phi\), as done in Corollary 6 in Sect. 4.1; this property makes it the main result of our paper.
In its proof, Hölder’s inequality is used differently than in the classic PAC-Bayes bound’s proofs. Indeed, in Bégin et al. (2016, Th. 8), the change of measure based on Hölder’s inequality is key for deriving a bound that holds for all posteriors \({\mathcal {Q}}\) with high probability, while our bound holds for a unique posterior \({\mathcal {Q}}_{{\mathcal {S}}}\) dependent on the sample \({\mathcal {S}}\) and the prior \({\mathcal {P}}\). In fact, we use Hölder’s inequality to introduce a prior \({\mathcal {P}}\) independent from \({\mathcal {S}}\): a crucial point for our bound instantiated in Corollary 6.
Compared to Eq. (2), our bound involves the term \({\frac{2\alpha {-}1}{\alpha {-}1}}\ln \frac{2}{\delta }\) instead of \(\ln \frac{1}{\delta }\), that is an additional constant value of \({\frac{2\alpha {-}1}{\alpha {-}1}}\ln \frac{2}{\delta }-\ln \frac{1}{\delta }=\ln 2{+}\frac{\alpha }{\alpha {-}1}\ln \tfrac{2}{\delta }\). When \(\alpha =2\), this constant equals \(\ln \frac{8}{\delta ^2}\), which turns out to be a reasonable cost to “derandomize” a bound into a disintegrated one, as typical choices for \(\phi (h,\!{\mathcal {S}})\) will make the constant imprint on the bound value decay with m. This is similar to the bounds of Theorem 2 that tighten as m increases, provided that \(\phi (h,{\mathcal {S}})\) is chosen wisely. For instance, by setting \(\phi (h,\!{\mathcal {S}})=\exp (\tfrac{\alpha -1}{\alpha }m\,\textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\) with \(\textrm{kl}(\cdot \Vert \cdot )\) defined by Eq. (4), the bound depends on m and converges as m increases (see Sect/ 4). Moreover, the tightness of the bound depends also on the deviation between \({\mathcal {Q}}_{{\mathcal {S}}}\) and \({\mathcal {P}}\), which makes the bound tighter when \({\mathcal {Q}}_{{\mathcal {S}}}={\mathcal {P}}\).
We instantiate below Theorem 2 for \(\alpha {\rightarrow }1^+\) and \(\alpha {\rightarrow }{+}{\infty }\) showing that the bound converges when \(\alpha {\rightarrow }1^+\) and \(\alpha {\rightarrow }{+}\infty\).
Corollary 3
Under the assumptions of Theorem 2, when \(\alpha {\rightarrow }1^+\), we have
when \(\alpha {\rightarrow }+\infty\), we have
where \({{\,\textrm{esssup}\,}}\) is the essential supremum defined as the supremum on a set with non-zero probability measures, i.e.,
This corollary illustrates that the parameter \(\alpha\) controls the trade-off between the Rényi divergence \(D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\) and \(\ln \left[ {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\phi (h'\!, {\mathcal {S}}')^{\frac{\alpha }{\alpha {-}1}}\right]\). Indeed, when \(\alpha {\rightarrow }1^+\), the Rényi divergence vanishes while the other term converges toward \(\ln\,\left[ {{\,\textrm{esssup}\,}}_{{\mathcal {S}}'\in {\mathcal {Z}}, h'\in {\mathcal {H}}}\phi (h'\!, {\mathcal {S}}')\right]\), roughly speaking the maximal value possible for the second term. On the other hand, when \(\alpha {\rightarrow }{+}\infty\), the Rényi divergence increases and converges toward \(\ln {{\,\textrm{esssup}\,}}_{h'\in {\mathcal {H}}}\frac{{\mathcal {Q}}_{{\mathcal {S}}}(h')}{{\mathcal {P}}(h')}\) and the other term decreases toward \(\ln\,\left[ {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\phi (h'{,}{\mathcal {S}}')\right]\).
3.2.2 Comparison with the bound of Rivasplata et al. (2020)
For the sake of comparison, we recall in Eq. (6) the bound proposed by Rivasplata et al. (2010, Th.1(i)), that is more general than the bounds of Blanchard and Fleuret (2007) and Catoni (2007, Th.1.2.7):
The term \(\ln\,\tfrac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}\) (also involved in Catoni (2007); Blanchard and Fleuret (2007)) can be seen as a “disintegratedFootnote 6 KL divergence” depending only on the sampled \(h{\sim }{\mathcal {Q}}_{{\mathcal {S}}}\). In contrast, our bound involves the Rényi divergence \(D_\alpha ({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\) between the prior \({\mathcal {P}}\) and the posterior \({\mathcal {Q}}_{{\mathcal {S}}}\), meaning our bound involves only one term that depends on the sampled hypothesis (the risk): the divergence value is the same whatever the hypothesis. Our bound is expected to be looser because of the Rényi divergence (see van Erven & Harremoës,2014) and the dependence in \(\delta\) (which is worse than Eq. 6). Nevertheless, our divergence term is the main advantage of our bound. Indeed, as confirmed by our experiments (Sect/ 5), our bound with \(D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\) makes the learning procedure (in our self-bounding algorithm) more stable and efficient compared to the optimization of Eq. (6) with \(\ln \tfrac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}\) that is subject to high variance.
3.2.3 A parameterizable general disintegrated bound
In the PAC-Bayesian literature, parametrized bounds have been introduced (e.g., Catoni (2007); Thiemann et al. (2017)) to control the trade-off between the empirical risk and the divergence along with the additional term. For the sake of completeness, we now provide a parametrized version of our bound, enlarging its practical scope. We follow a similar approach to introduce a version of a disintegrated Rényi divergence-based bound that has the advantage of being parameterizable.
Theorem 4
(Parametrizable Disintegrated PAC-Bayes Bound) For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any \(\delta \in (0,1]\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}{\times }{\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have
where \({\mathcal {Q}}_{{\mathcal {S}}}{\triangleq }A({\mathcal {S}},{\mathcal {P}})\) is output by the deterministic algorithm A.
Note that \(e^{D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})}\) is closely related to the \(\chi ^2\)-distance. Indeed we have: \(\chi ^2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}) \triangleq {\mathbb {E}}_{h\sim {\mathcal {P}}}\left[ \tfrac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}\right] ^2\,{-}1 = e^{D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})} {-}1\). An asset of Theorem 4 is the parameter \(\lambda\) controlling the trade-off between the exponentiated Rényi divergence \(e^{D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})}\) and \(\frac{1}{\delta ^3}{{\mathbb {E}}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{{\mathbb {E}}}_{h'{\sim }{\mathcal {P}}}\phi (h'\!, {\mathcal {S}}')^2\). Our bound is valid for all \(\lambda\,>\!0\), thus, from a practical view, we can learn/tune the parameter \(\lambda\) to minimize the bound and control the possible numerical instability due to \(e^{D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})}\). Indeed, if \(D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\) is large, the numerical computation can lead to an infinite value due to finite precision arithmetic. It is important to notice that, like other parametrized bounds (e.g., Thiemann et al., 2017), there exists a closed-form solution of the optimal parameter \(\lambda\) (for a fixed \({\mathcal {P}}\) and \({\mathcal {Q}}_{{\mathcal {S}}}\)); the solution is derived in Proposition 5 and shows that the optimal bound of Theorem 4 corresponds to the bound of Theorem 2.
Proposition 5
For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\) on \({\mathcal {H}}\), for any \(\delta {\in }(0,1]\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}{\times }{\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), let
where \(\displaystyle \lambda ^* = \sqrt{\frac{{\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{{\mathbb {E}}}_{{h'{\sim }{\mathcal {P}}}}\left[ 8\phi (h'\!, {\mathcal {S}}')^2\right] }{\delta ^3 \exp (D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}))}}\).
Put into words: the optimal \(\lambda ^*\) gives the same bound for Theorem 2 and Theorem 4.
4 The disintegration in action
So far, we have introduced theoretical results to derandomize PAC-Bayesian bounds through a disintegration approach. Indeed, the disintegration allows us to obtain a bound for a unique model sampled from the distribution \({\mathcal {Q}}_{{\mathcal {S}}}\) instead of having a bound on the averaged risk of the models. We propose in this section to illustrate the instantiation and the usefulness of Theorem 2 on neural networks compared to the classical PAC-Bayesian bounds.
4.1 Specialization to neural network classifiers
We consider Neural Networks (NN) parametrized by a weight vector \({\textbf{w}}\!\in\,{\mathbb {R}}^{d}\) and overparametrized, i.e., \(d\!\gg\,m\). We aim to learn the weights of the NN leading to the lowest true risk. Practitioners usually proceed by epochsFootnote 7 and obtain one “intermediate” NN after each epoch. Then, they select the “intermediate” NN associated with the lowest validation risk. We propose translating this practice into our PAC-Bayesian setting by considering one prior per epoch. Given T epochs, we hence have T priors \({\textbf {P}}{=}\{{\mathcal {P}}_t\}_{t=1}^T\), where \(\forall t\!\in\,\{1,\ldots ,T\}, {\mathcal {P}}_t={\mathcal {N}}({\textbf{v}}_t, \sigma ^2\textbf{I}_{d})\) is a Gaussian distribution centered at \({\textbf{v}}_t\) (the weights associated with the t-th “intermediate” NN) with a covariance matrix of \(\sigma ^2\textbf{I}_{d}\) (where \(\textbf{I}_{d}\) is the \(d{\times }d\)-dimensional identity matrix). Assuming the T priors are learned from a set \({\mathcal {S}}_{\text {prior}}\) such that \({\mathcal {S}}_{\text {prior}} \bigcap {\mathcal {S}}{=} \emptyset\), then Corollaries 6 and 7 will guide us to learn a posterior \({\mathcal {Q}}_{{\mathcal {S}}}{=}{\mathcal {N}}({\textbf{w}}, \sigma ^2\textbf{I}_{d})\) from a prior \({\mathcal {P}}\in {\textbf {P}}\) minimizing the empirical risk on \({\mathcal {S}}\) (we give more details on the procedure after the forthcoming corollaries). Note that considering Gaussian distributions has the advantage of simplifying the expression of the KL divergence, and thus is commonly used in the PAC-Bayesian literature for neural networks (e.g., Dziugaite & Roy, 2017; Letarte et al., 2019; Zhou, Veitch, Austern, Adams, & Orbanz, 2019).Footnote 8
Corollary 6 below instantiates Theorem 2 to this neural networks setting. Then, for the sake of comparison, Corollary 7 instantiates other disintegrated bounds from the literature; more precisely, Eq. (7) corresponds to Rivasplata et al. (2010)’s bound of Eq. (6), Eq. (8) to Blanchard and Fleuret (2007)’s one, and Eq. (9) to Catoni (2007)’s one.
Corollary 6
For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any set \({\textbf {P}}=\{{\mathcal {P}}_1,\dots , {\mathcal {P}}_T\}\) of T priors on \({\mathcal {H}}\) where \({\mathcal {P}}_t={\mathcal {N}}({\textbf{v}}_t, \sigma ^2{{\textbf{I}}}_d)\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}) {\rightarrow } {\mathcal {M}}({\mathcal {H}})\), for any loss \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } [0, 1]\), for any \(\delta {\in }(0,1]\), we have
where \(\textrm{kl}(a\Vert b) = a\ln \tfrac{a}{b}+(1{-}a)\ln \tfrac{1-a}{1-b}\), \({\mathcal {Q}}_{{\mathcal {S}}}= {\mathcal {N}}({\textbf{w}}, \sigma ^2{{\textbf{I}}}_d)\), and the hypothesis \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\) is parametrized by \({\textbf{w}}{+}\varvec{\epsilon }\).
Corollary 7
For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any set \({\mathcal {H}}\), for any set \({\textbf {P}}=\{{\mathcal {P}}_1,\dots ,{\mathcal {P}}_T\}\) of T priors on \({\mathcal {H}}\) where \({\mathcal {P}}_t={\mathcal {N}}({\textbf{v}}_t, \sigma ^2{{\textbf{I}}}_d)\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), for any loss \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } \{0, 1\}\), for any \(\delta {\in }(0,1]\), with probability at least \(1{-}\delta\) over the learning sample \({\mathcal {S}}{\sim }{\mathcal {D}}^{m}\) and the hypothesis \(h{\sim } {\mathcal {Q}}_{{\mathcal {S}}}\) parametrized by \({\textbf{w}}{+}\varvec{\epsilon }\), we have \(\forall {\mathcal {P}}_t\in {\textbf {P}}\)
with \(\left[ x\right] _{+}\!{=}\max (x,0)\), and \(\textrm{kl}_{+}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h)){=}\textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\) if \({R}_{{\mathcal {S}}}(h){<}{R}_{{\mathcal {D}}}(h)\) and 0 otherwise. Moreover, \(\varvec{\epsilon }{\sim }{\mathcal {N}}({{\textbf{0}}}, \sigma ^2\textbf{I}_{d})\) is a Gaussian noise such that \({\textbf{w}}{+}\varvec{\epsilon }\) are the weights of \(h{\sim }{\mathcal {Q}}_{{\mathcal {S}}}\) with \({\mathcal {Q}}_{{\mathcal {S}}}{=}{\mathcal {N}}({\textbf{w}}, \sigma ^2\textbf{I}_{d})\), and \({\textbf{C}}\), \({\textbf{B}}\) are two sets of hyperparameters fixed a priori.
As the parameter \(\lambda\) of the Theorem 4, \(c\!\in\,{\textbf{C}}\) is a hyperparameter that controls a trade-off between the empirical risk \({R}_{{\mathcal {S}}}(h)\) and the term
Besides, the parameter \(b\!\in\,{\textbf{B}}\) controls the tightness of the bound. In general, these parameters can be tuned to minimize the bound of Eq. (8) and Eq. (9); however, there is no closed-form solution for the expression of the minimum of this Eq. . In consequence, our experimental protocol requires minimizing the bounds by gradient descent for each \(b\!\in\,{\textbf{B}}\), respectively \(c\!\in\,{\textbf{C}}\), in order to learn the distribution \({\mathcal {Q}}_{{\mathcal {S}}}\) leading to the lowest bound value. To obtain a tight bound, the divergence between one prior \({\mathcal {P}}_t\!\in\,{\textbf {P}}\) and \({\mathcal {Q}}_{{\mathcal {S}}}\) must be low, i.e., \(\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^2\) (or \(\Vert {\textbf{w}}{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert _{2}^2{-}\Vert \varvec{\epsilon }\Vert ^2_{2}\)) has to be small. One solution is to split the learning sample into 2 non-overlapping subsets \({\mathcal {S}}_{\text {prior}}\) and \({\mathcal {S}}\), where \({\mathcal {S}}_{\text {prior}}\) is used to learn the prior, while \({\mathcal {S}}\) is used both to learn the posterior and compute the bound. Hence, if we “pre-learn” a good enough prior \({\mathcal {P}}_t\!\in\,{\textbf {P}}\) from \({\mathcal {S}}_{\text {prior}}\), then we can expect to have a low \(\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}\).
At first sight, the selection of the prior weights with \({\mathcal {S}}\) by early stopping may appear to be “cheating”. However, this procedure can be seen as: 1) first constructing \({\textbf {P}}\) from the T “intermediate” NNs learned after each epoch from \({\mathcal {S}}_{\text {prior}}\), then 2) optimizing the bound with the prior that leads to the best risk on \({\mathcal {S}}\). This gives a statistically valid result: since Corollary 6 is valid for every \({\mathcal {P}}_t\!\in\,{\textbf {P}}\), we can select the one we want, in particular the one minimizing \({R}_{{\mathcal {S}}}(h)\) for a sampled \(h\sim {\mathcal {P}}_t\). This heuristic makes sense: it allows us to detect if a prior is concentrated around hypotheses that potentially overfit the learning sample \({\mathcal {S}}_{\text {prior}}\). Usually, practitioners consider this “best” prior as the final NN. In our case, the advantage is that we refine this “best” prior with \({\mathcal {S}}\) to learn the posterior \({\mathcal {Q}}_{{\mathcal {S}}}\). Note that Pérez-Ortiz et al. (2021) have already introduced tight generalization bounds with data-dependent priors for—non-derandomized—stochastic NNs.Footnote 9 Nevertheless, the weights of the stochastic NNs are, by definition, sampled from the posterior distribution \({\mathcal {Q}}\) for each prediction. In that sense, it is important to mention that stochastic NNs differ from derandomized NNs where only one model is sampled from \({\mathcal {Q}}_{{\mathcal {S}}}\). Moreover, our training method to learn the prior differs greatly since 1) we learn T NNs (i.e., T priors) instead of only one, 2) we fix the variance of the Gaussian in the posterior \({\mathcal {Q}}_{{\mathcal {S}}}\). Note that, as illustrated in Sect/ 5, fixing the variance is not restrictive: the advantage is that it simplifies the expression of the KL divergence while keeping the bounds tight. To the best of our knowledge, our training method for the prior is new.
4.2 A note about stochastic neural networks
Due to its stochastic nature, PAC-Bayesian theory has been explored to study stochastic NNs (e.g., Langford and Caruana (2001); Dziugaite and Roy (2017, 2018); Zhou et al. (2019); Pérez-Ortiz et al. (2021)). In Corollary 8 below, we instantiate the bound of Eq. (1) for stochastic NNs to empirically compare the stochastic and the deterministic NNs associated to the same prior and posterior distributions. We recall that, in this paper, a deterministic NN is a single h sampled from the posterior distribution \({\mathcal {Q}}_{{\mathcal {S}}}{=}{\mathcal {N}}({\textbf{w}}, \sigma ^2\textbf{I}_{d})\) output by the algorithm A. This means that for each example, the label prediction is performed by the same deterministic NN: the one parametrized by the weights \({\textbf{w}}+\varvec{\epsilon }\!\in\, {\mathbb {R}}^d\). Conversely, the stochastic NN associated with a posterior distribution \({\mathcal {Q}}{=}{\mathcal {N}}({\textbf{w}}, \sigma ^2\textbf{I}_{d})\) predicts the label of a given example by (i) first sampling h according to \({\mathcal {Q}}\), (ii) then returning the label predicted by h. Thus, the risk of the stochastic NN is the expected risk value \({\mathbb {E}}_{h{\sim }{\mathcal {Q}}}{R}_{{\mathcal {D}}}(h)\), where the expectation is taken over all h sampled from \({\mathcal {Q}}\). We compute the empirical risk of the stochastic NN from a Monte Carlo approximation: (i) we sample n weight vectors, and (ii) we average the risk over the n associated NNs; we denote by \({\mathcal {Q}}^n\) the distribution of such n-sample. In this context, we obtain the following PAC-Bayesian bound.
Corollary 8
For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any \({\mathcal {H}}\), for any set \({\textbf {P}}=\{{\mathcal {P}}_1,\dots , {\mathcal {P}}_T\}\) of T priors on \({\mathcal {H}}\) where \({\mathcal {P}}_t={\mathcal {N}}({\textbf{v}}_t, \sigma ^2{{\textbf{I}}}_d)\), for any loss \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } \{0, 1\}\), for any \(\delta {\in }(0,1]\), with probability at least \(1{-}\delta\) over \({\mathcal {S}}{\sim }{\mathcal {D}}^m\) and \(\{h_1,\dots ,h_n\}{\sim }{\mathcal {Q}}^n\), we have simultaneously \(\forall {\mathcal {P}}_t\!\in\,{\textbf {P}},\)
where \({\mathcal {Q}}={\mathcal {N}}({\textbf{w}}, \sigma ^2{{\textbf{I}}}_d)\) and the hypothesis h sampled from \({\mathcal {Q}}\) is parametrized by \({\textbf{w}}+\varvec{\epsilon }\) with \(\varvec{\epsilon }\sim {\mathcal {N}}({{\textbf{0}}}, \sigma ^2{{\textbf{I}}}_d)\).
This result shows two key features that allow considering it as an adapted baseline for a fair comparison between disintegrated and classical PAC-Bayesian bounds, thus between deterministic and stochastic NNs. On the one hand, it involves the same terms as Corollary 6. On the other hand, it is close to the bound of Pérez-Ortiz et al. (2021, Sec. 6.2), since (i) we adapt the KL divergence to our setting (i.e., \(\textrm{KL}({\mathcal {Q}}\Vert {\mathcal {P}}){=}\tfrac{1}{2\sigma ^2}\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _2^2\)), (ii) the bound holds for T priors thanks to a union-bound argument.
5 Experiments with neural networks
The source code of our experiments is available at https://github.com/paulviallard/MLJ-Disintegrated-PB. We used the PyTorch framework (Paszke et al., 2019).
In this section, we do not seek state-of-the-art performance; in fact, we have a threefold objective: (a) we check if \(50\%/50\%\) is a good choice for splitting the original train set into \(({\mathcal {S}}_{\text {prior}}, {\mathcal {S}})\) (which is the most common split in the PAC-Bayesian literature (Germain et al., 2009; Pérez-Ortiz et al., 2021)); (b) we highlight that our disintegrated bound associated with the deterministic NN is tighter than the randomized bound associated with the stochastic NN (Corollary 8); (c) we show that our disintegrated bound (Corollary 6) is tighter and more stable than the ones based on Rivasplata et al. (2010), Blanchard and Fleuret (2007) and Catoni (2007) (Corollary 7).
5.1 Training method
We follow our Training Method (Sect. 4.1) in which we integrate the direct minimization of all the bounds. We refer as ours the training method based on the minimization of our bound in Corollary 6, as rivasplata the one based on Eq. (7), as blanchard the one based on Eq. (8), and as catoni the one based on Eq. (9). stochastic denotes the PAC-Bayesian bound with the prior and posterior distributions obtained from ours. To optimize the bound with gradient descent, we replace the non-differentiable 0-1 loss with a surrogate: the bounded cross-entropy loss (Dziugaite & Roy, 2018). We made this replacement since cross-entropy minimization works well in practice for neural networks (Goodfellow et al., 2016) and because this loss is bounded between 0 and 1, which is required for the \(\textrm{kl}()\) function. The cross-entropy is defined in a multiclass setting with \(y\!\in\,\{1,2,\ldots \}\) by \(\ell (h, ({\textbf{x}}, y)) {=} -\frac{1}{Z}\ln (\Phi (h({\textbf{x}})[y]))\!\in\, [0, 1]\) where \(h({\textbf{x}})[y]\) is the y-th output of the NN, and \(\forall p\!\in\,[0, 1], \Phi (p) {=} e^{-Z}{+}(1{-}2e^{-Z})p\) (we set \(Z{=}4\), the default parameter of Dziugaite and Roy (2018)). That being said, to learn a good enough prior \({\mathcal {P}}\!\in\,{\textbf {P}}\) and the posterior \({\mathcal {Q}}_{{\mathcal {S}}}\), we run our Training Method with two stochastic gradient descent-based algorithms \(A_{\text {prior}}\) and A. Note that the randomness in the stochastic gradient descent algorithm is fixed to have deterministic algorithms. In phase 1) algorithm \(A_{\text {prior}}\) learns from \({\mathcal {S}}_{\text {prior}}\) the T priors \({\mathcal {P}}_1,\dots ,{\mathcal {P}}_T\!\in\,{\textbf {P}}\) (i.e., during T epochs) by minimizing the bounded cross-entropy loss. In other words, at the end of the epoch t, the weights \({\textbf{w}}_t\) of the classifier are used to define the prior \({\mathcal {P}}_t={\mathcal {N}}({\textbf{w}}_t, \sigma ^2{{\textbf{I}}}_{d})\). Then, the best prior \({\mathcal {P}}\!\in\,{\textbf {P}}\) is selected by early stopping on \({\mathcal {S}}\). In phase 2), given \({\mathcal {S}}\) and \({\mathcal {P}}\), algorithm A integrates the direct optimization of the bounds with the bounded cross-entropy loss.
5.2 Optimization procedure in algorithms A and \(A_{\text {prior}}\)
Footnote 11 Let \(\varvec{\omega }\) be the mean vector of a Gaussian distribution used as NN weights that we are optimizing. In algorithms A and \(A_{\text {prior}}\), we use the Adam optimizer (Kingma & Ba, 2015), and we sample a noise \({\varvec{\epsilon }}\!\sim\, {\mathcal {N}}(\textbf{0}, \sigma ^2\textbf{I}_{d})\) at each iteration of the optimizer. Then, we forward the examples of the mini-batch to the NN parametrized by the weights \(\varvec{\omega }{+}\varvec{\epsilon }\), and we update \(\varvec{\omega }\) according to the bounded cross-entropy loss. Note that during phase 1), at the end of each epoch t, \({\mathcal {P}}_{t} {=} {\mathcal {N}}(\varvec{\omega }, \sigma ^2{{\textbf{I}}}_{d}) {=} {\mathcal {N}}({\textbf{v}}_t, \sigma ^2{{\textbf{I}}}_{d})\) and finally at the end of phase 2) we have \({\mathcal {Q}}_{{\mathcal {S}}}{=} {\mathcal {N}}(\varvec{\omega }, \sigma ^2{{\textbf{I}}}_{d}) {=} {\mathcal {N}}({\textbf{w}}, \sigma ^2{{\textbf{I}}}_{d})\).
5.3 Experimental setting
5.3.1 Datasets
We perform our experimental study on three datasets: MNIST (LeCun et al., 1998), Fashion-MNIST (Xiao et al., 2017), and CIFAR-10 (Krizhevsky, 2009). We divide each original train set into two independent subsets \({\mathcal {S}}_{\text {prior}}\) of size \(m_{\text {prior}}\) and \({\mathcal {S}}\) of size m with varying split ratios defined as \(\tfrac{m_{\text {prior}}}{m+m_{\text {prior}}}\in \{0, .1, .2, .3, .4, .5, .6, .7, .8, .9\}\). The test sets denoted by \({\mathcal {T}}\) remain the original ones.
5.3.2 Models
For the (Fashion-)MNIST datasets, we train a variant of the All Convolutional Network (Springenberg et al., 2015). The model is a 3-hidden layers convolutional network with 96 channels. We use \(5\times 5\) convolutions with a padding of size 1, and a stride of size 1 everywhere except on the second convolution where we use a stride of size 2. We adopt the Leaky ReLU activation functions after each convolution. Lastly, we use a global average pooling of size \(8\times 8\) to obtain the desired output size. Furthermore, the weights are initialized with Xavier Normal initializer (Glorot & Bengio, 2010) while each bias of size l is initialized uniformly between \(-1/{\sqrt{l}}\) and \(1/\sqrt{l}\).
For the CIFAR-10 dataset, we train a ResNet-20 network, i.e., a ResNet network from He et al. (2016) with 20 layers. The weights are initialized with Kaiming Normal initializer (He et al., 2015) and each bias of size l is initialized uniformly between \(-1/{\sqrt{l}}\) and \(1/\sqrt{l}\).
5.3.3 Optimization
For the (Fashion-)MNIST datasets, we learn the parameters of our prior distributions \({\mathcal {P}}_1,\dots ,{\mathcal {P}}_T\) by using Adam optimizer for \(T=10\) epochs with a learning rate of \(10^{-3}\) and a batch size of 32 (the other parameters of Adam are left by default). Moreover, the parameters of the posterior distribution \({\mathcal {Q}}_{{\mathcal {S}}}\) are learned for one epoch with the same batch size and optimizer (except that the learning rate is either \(10^{-4}\) or \(10^{-6}\)). For the CIFAR-10 dataset, the parameters of the priors \({\mathcal {P}}_1,\dots ,{\mathcal {P}}_T\) are learned for \(T=100\) epochs, and the posterior distribution \({\mathcal {Q}}_{{\mathcal {S}}}\) for 10 epochs with a batch size of 32 by using Adam optimizer as well. Additionally, the learning rate to learn the prior for CIFAR-10 is \(10^{-2}\).
5.3.4 Bounds
For blanchard ’s bounds, the set of hyperparameters is defined as \({\textbf{B}}{=}\{ b{\in }{\mathbb {N}}\;\vert \; b{=}\sqrt{x},\ (x{+}1){\le }2\sqrt{m} \}\), i.e., such that blanchard ’s bounds can be tighter than rivasplata ’s ones. We fixed the set of hyperparameters for catoni as \({\textbf{C}}{=}\left\{ 10^{k} \vert k{\in }\{-3, -2, \dots , +3\}\right\}\). We try different values for \(\sigma ^2 {\in } \{10^{-3}, 10^{-4}, 10^{-5}, 10^{-6}\}\) associated with the disintegrated KL divergence \(\ln \frac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}= \frac{1}{2\sigma ^2}(\Vert {\textbf{w}}{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert ^2_{2} {-}\Vert \varvec{\epsilon }\Vert ^2_{2})\), the “normal” Rényi divergence \(D_2({\mathcal {Q}}\Vert {\mathcal {P}}){=}\tfrac{1}{\sigma ^2}\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^{2}\) and the KL divergence \(\textrm{KL}({\mathcal {Q}}\Vert {\mathcal {P}}){=}\tfrac{1}{2\sigma ^2}\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^{2}\). For all the figures, the values are averaged over 400 deterministic NNs sampled from \({\mathcal {Q}}_{{\mathcal {S}}}\) (the standard deviation is small and presented in the Appendix K). We additionally report as stochastic (Corollary 8) the randomized bound value and KL divergence \(\textrm{KL}({\mathcal {Q}}\Vert {\mathcal {P}}){=}\tfrac{1}{2\sigma ^2}\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^{2}\) associated with the model learned by ours, meaning that \(n{=}400\) and that the test risk reported for ours also corresponds to the risk of the stochastic NN approximated with these 400 NNs.
5.4 Results
5.4.1 Analysis of the influence of the split ratio between \({\mathcal {S}}_{\text {prior}}\) and \({\mathcal {S}}\)
Figures 1 and 2 study the evolution of the bound values after optimizing the bounds with our Training Method for different parameters. Specifically, the split ratio of the original train set varies from 0.1 to 0.9 (0.1 means that \(m_{\text {prior}}= 0.1(m+m_{\text {prior}})\)), for four variances values \(\sigma ^2\) and the two learning rates (\(10^{-6}\) and \(10^{-4}\)). For the sake of readability, we present detailed results when the split ratio is 0 in Table 1. We first remark that the behavior is different for the two learning rates. On the one hand, for lr=\(10^{-6}\), the mean bound values are close to each other, which is not surprising since the disintegrated KL divergences and the Rényi divergences are close to zero (see Tables 2, 3, 4, 5, 6, 7, 8, 9, 10). Moreover, for MNIST and Fashion-MNIST, there is a trade-off between learning a good prior with \({\mathcal {S}}_{\text {prior}}\) and minimizing a generalization bound with \({\mathcal {S}}\). In this case, the split ratio 0.5 appears to be a good choice to obtain a tight (disintegrated) PAC-Bayesian bound. This ratio is widely used in the PAC-Bayesian literature (see, e.g., in the context of linear classifiers (Germain et al., 2009), majority votes (Zantedeschi et al., 2021), and neural networks (Letarte et al., 2019; Pérez-Ortiz et al., 2021)). On the other hand, when lr=\(10^{-4}\), the mean bound values tend to increase when the split ratio increases as well for the bounds introduced in the literature (i.e., for blanchard, catoni, and rivasplata), while the mean bound values of our bound remain low. Indeed, m decreases as long as the split ratio increases, which has the effect of increasing the bound value drastically when the disintegrated KL divergence is high. We further explain why the disintegrated KL divergence can become high for the disintegrated bounds of the literature. To do so, we will now restrict our study to a split ratio of 0.5 in order to (i) compare the tightness of the bounds, (ii) understand why the disintegrated bounds of the literature diverge.
5.4.2 Comparison between disintegrated and “classic” bounds
We first compare the “classic” PAC-Bayesian bound (Corollary 8) and our disintegrated PAC-Bayesian bound (Corollary 6). To do so, we fix the variance \(\sigma ^2{=}10^{-3}\) (along with the split ratio equals 0.5). We report in Fig. 3, the mean bound values associated with ours (i.e., the Training Method that minimizes our bound) and stochastic (we recall that stochastic is the PAC-Bayesian bound of Corollary 8 on the model learned by ours). Actually, ours leads to more precise bounds than the randomized stochastic even if the two empirical risks are the same and the KL divergence is smaller than the Rényi one. This imprecision is due to the non-avoidable sampling according to \({\mathcal {Q}}\) done in the randomized PAC-Bayesian bound of Corollary 8 (the higher n, the tighter the bound). Thus, using a disintegrated PAC-Bayesian bound avoids sampling a large number of NNs to obtain a low risk. This confirms that our framework makes sense for practical purposes and has a great advantage in terms of time complexity when computing the bounds.
5.4.3 Analysis of the tightness of the disintegrated bounds
We now compare the tightness of the different disintegrated PAC-Bayesian bounds (i.e., our bound and the ones in the literature). We study, as before, the case where the split ratio is 0.5 and the variance \(\sigma ^2=10^{-3}\). We report in Fig. 4 for ours, rivasplata, blanchard and catoni, the mean bounds values; the mean test risk \({R}_{{\mathcal {T}}}(h)\) before (i.e., with the prior \({\mathcal {P}}\)) and after applying Step 2) (i.e., with the posterior \({\mathcal {Q}}_{{\mathcal {S}}}\)). Moreover, we report above the bars the mean train risks \({R}_{{\mathcal {S}}}(h)\) and the mean/standard deviation divergence values obtained after Step 2), i.e., the Rényi divergence \(D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}){=}\tfrac{1}{\sigma ^2}\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^{2}\) for ours and the disintegrated KL divergence \(\ln \frac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}{=}\tfrac{1}{2\sigma ^2}\left[ \Vert {\textbf{w}}{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert ^2_{2} {-}\Vert \varvec{\epsilon }\Vert ^2_{2}\right]\) for the others. First of all, we can remark that we observe two different behaviors for lr=\(10^{-4}\) and lr=\(10^{-6}\). For lr=\(10^{-6}\), the bound values are close to each other, as well as the empirical risks and the divergences (which are close to 0). In Fig. 4, we observe that the bound values and the test risks are close to the one associated with the prior distribution because the divergence is close to 0. This is probably due to the fact that the learning rate is too small, implying that the bounds are not optimized. With a higher learning rate of lr=\(10^{-4}\), we observe that our bound remains tight while the disintegrated bounds of the literature are looser. Hopefully, our bound is improved after performing Step 2) of our Training Method. However, for the bounds of the literature, the value of the disintegrated KL divergence is large, making the bounds looser after executing Step 2). We now investigate the reasons for the divergence of the bounds by looking at the influence of the variance \(\sigma ^2\).
5.4.4 Analysis of the influence of the variance
Given a split ratio of 0.5 and lr\(\in \{10^{-6}, 10^{-4}\}\), we report in Fig. 5 the evolution of the bound values associated with ours, rivasplata, blanchard, and catoni when the variance varies from \(10^{-6}\) to \(10^{-3}\). First of all, the important point is that ours behaves differently than rivasplata, blanchard, and catoni. Indeed, for both learning rates, when \(\sigma ^2\) decreases, the value of our bound remains low, while the others increase drastically due to the explosion of the disintegrated KL divergence term (see Table 6 in Appendix K for more details). Concretely, the disintegrated KL divergence in Corollary 7 involves the noise \(\varvec{\epsilon }\) through \(\frac{1}{2\sigma ^2}\Vert {\textbf{w}}{+}\varvec{\epsilon }{-}{\textbf{v}}_t\Vert _{2}^2{-}\Vert \varvec{\epsilon }\Vert ^2_{2}\) compared to our divergence which is \(\frac{1}{\sigma ^2}\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^2\) (without noise). Then, the sampled noise during the optimization procedure \(\varvec{\epsilon }\) influences the disintegrated KL divergence, making it prone to high variations during training (depending thus \(\sigma ^2\)). To illustrate the difference during the optimization, we focus on the objective function (detailed in Appendix I) of Corollarys 6 and 7 (Eq. 7). Roughly speaking, the divergence in Corollary 6 does not depend on the sampled hypothesis h (with weights \(\varvec{\omega }+\varvec{\epsilon }\)), while the divergence of Eq. (7) does. In consequence, the derivatives are less dependent on h for Corollary 6 than for Eq. (7). To be convinced of this, we propose to study the gradient with respect to the current mean vector \(\varvec{\omega }\). On the one hand, the gradient \(\frac{\partial {R}_{{\mathcal {S}}}(h)}{\partial \varvec{\omega }}\) of the risk w.r.t. \(\varvec{\omega }\) is the same for both bounds; hence, the phenomenon cannot come from this derivative. On the other hand, the gradients of the divergence in Eq. (7) and Corollary 6 are respectively
From the two derivatives, we deduce that \(\diamondsuit = \frac{1}{2}\heartsuit + \frac{1}{m\sigma ^2}\varvec{\epsilon }\). Hence, each gradient step involves a noise in the gradient of the disintegrated KL divergence \(\frac{1}{m\sigma ^2}\varvec{\epsilon }\sim {\mathcal {N}}(\textbf{0}, \frac{1}{m}{{\textbf{I}}}_d)\), which is high for a small m. This randomness causes the disintegrated KL divergence \(\frac{1}{2\sigma ^2}\left\| \varvec{\omega }{+}\varvec{\epsilon }{-}{\textbf{v}}_t\right\| ^2_{2}\!{-}\left\| \varvec{\epsilon }\right\| ^2_{2}\) to be larger when \(\sigma ^2\) decreases since (i) the divergence is divided by \(2m\sigma ^2\) and (ii) the deviation between \(\varvec{\omega }\) and \({\textbf{v}}_t\) increases. In conclusion, this makes the objective function (i.e., the bound) subject to high variations during the optimization, implying higher final bound values. Thus, the Rényi divergence has a valuable asset over the disintegrated KL divergence since it does not depend on the sampled noise \(\varvec{\epsilon }\).
5.4.5 Take-home message from the experiments
To summarize, our experiments show that our disintegrated bound is, in practice, tighter than the ones in the literature. This tightness allows us to precisely bound the true risk \({R}_{{\mathcal {D}}}(h)\) (or the test risk \({R}_{{\mathcal {T}}}(h)\)); thus, the model selection from the disintegrated bound is effective. Moreover, we show that our bound is more easily optimizable than the others. This is mainly due to the disintegrated KL divergence, which depends on the sampled hypothesis h with weights \(\varvec{\omega }{+}\varvec{\epsilon }\). Indeed, the gradients of the disintegrated KL divergence with respect to \(\varvec{\omega }\) include the noise \(\varvec{\epsilon }\), making the gradient inaccurate (especially with “high” learning rate and small variance \(\sigma ^2\)).
6 Toward information-theoretic bounds
Before concluding, we discuss another interpretation of the disintegration procedure through Theorem 9 below. Actually, the Rényi divergence between \({\mathcal {P}}\) and \({\mathcal {Q}}\) is sensitive to the choice of the learning sample \({\mathcal {S}}\): when the posterior \({\mathcal {Q}}\) learned from \({\mathcal {S}}\) differs greatly from the prior \({\mathcal {P}}\) the divergence is high. To avoid such a behavior, we consider Sibson’s mutual information (Verdú, 2015) which is a measure of dependence between the random variables \({\mathcal {S}}\!\in\,{\mathcal {Z}}^m\) and \(h\!\in\,{\mathcal {H}}\). It involves an expectation over all the learning samples of a given size m and is defined for a given \(\alpha {>}1\) by
The higher \(I_{\alpha }(h{;}{\mathcal {S}})\), the higher the correlation is, meaning that the sampling of h is highly dependent on the choice of \({\mathcal {S}}\). This measure has two interesting properties: it generalizes the mutual information (Verdú, 2015), and it can be related to the Rényi divergence. Indeed, let \(\rho (h, {\mathcal {S}}){=} {\mathcal {Q}}_{{\mathcal {S}}}(h){\mathcal {D}}^{m}({\mathcal {S}})\), resp. \(\pi (h, {\mathcal {S}}){=} {\mathcal {P}}(h){\mathcal {D}}^{m}({\mathcal {S}})\), be the probability of sampling both \({\mathcal {S}}{\sim }{\mathcal {D}}^m\) and \(h{\sim }{\mathcal {Q}}_{{\mathcal {S}}}\), resp. \({\mathcal {S}}{\sim }{\mathcal {D}}^m\) and \(h{\sim }{\mathcal {P}}\). Then we can write:
From Verdú, 2015 the optimal prior \({\mathcal {P}}^*\) minimizing Eq. (12) is a distribution-dependent prior:
This leads to an Information-Theoretic generalization bound Footnote 12.
Theorem 9
(Disintegrated Information-Theoretic Bound) For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any measurable function \(\phi\,:\!{\mathcal {H}}{\times } {\mathcal {Z}}^{m}{\rightarrow }{\mathbb {R}}_{+}^{*}\), for any \(\alpha\,>\!1\), for any \(\delta \in (0,1]\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have
Note that Esposito, Gastpar, and Issa (2020, Cor.4) introduced a bound based on the Sibson’s mutual information, but, as discussed in Appendix J, Theorem 9 leads to a tighter bound. From a theoretical view, Theorem 9 brings a different philosophy than the disintegrated PAC-Bayes bounds. Indeed, in Theorems 2 and 4, given \({\mathcal {S}}\), the Rényi divergence \(D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\) suggests that the learned posterior \({\mathcal {Q}}_{{\mathcal {S}}}\) should be close enough to the prior \({\mathcal {P}}\) to get a low bound. While in Theorem 9, the Sibson’s mutual information \(I_{\alpha }(h'; {\mathcal {S}}')\) suggests that the random variable h has to be not too much correlated to \({\mathcal {S}}\). However, the bound of Theorem 9 is not computable in practice due notably to the sample expectation over the unknown distribution \({\mathcal {D}}\) in \(I_{\alpha }\). An exciting line of future works could be to study how we can make use of Theorem 9 in practice.
7 Conclusion and future works
We provide a new and general disintegrated PAC-Bayesian bound (Theorem 2) in the family of Eq. (5), i.e., when the derandomization step consists in (i) learning a posterior distribution \({\mathcal {Q}}_{{\mathcal {S}}}\) on the classifiers set (given an algorithm, a learning sample \({\mathcal {S}}\) and a prior distribution \({\mathcal {P}}\)) and (ii) sampling a hypothesis h from this posterior \({\mathcal {Q}}_{{\mathcal {S}}}\). While our bound can be looser than the ones of Rivasplata et al. (2010); Blanchard and Fleuret (2007); Catoni (2007), it provides nice opportunities for learning deterministic classifiers. Indeed, our bound can be used not only to study the theoretical guarantees of deterministic classifiers but also to derive self-bounding algorithms (based on the bound optimization) that are more stable and efficient than the ones we obtain from the bounds of the literature. Concretely, the bounds of Rivasplata et al. (2010); Blanchard and Fleuret (2007); Catoni (2007) depend on two terms related to the classifier drawn: the risk and the “disintegrated KL divergence”, while in our bound the (Rényi) divergence term depends on the hypothesis set, implying that the divergence remains the same whatever which classifier is drawn. In this sense, our bound is more stable as the learning algorithm seeking to minimize the bound allows, in practice, to choose a better hypothesis than with the bounds of Rivasplata et al. (2010); Blanchard and Fleuret (2007); Catoni (2007). We have illustrated the interest of our bound on neural networks, but our result could be instantiated to other well-known settings such as linear classifiers (Germain et al., 2009) or the majority vote classifier (Zantedeschi et al., 2021).
Our general framework opens the way to the study of other machine learning settings by exploiting the proven randomized PAC-Bayesian theorems, for example, for Domain Adaptation (Germain et al., 2020), Adversarial Robustness (Viallard et al., 2021) or Transductive Learning (Bégin et al., 2014).
Despite being an important step towards the practical use of PAC-Bayes guarantees, our disintegrated bounds arguably have a drawback: we sample a hypothesis from a distribution instead of obtaining a bound for all the possible hypotheses, like for uniform convergence bounds. While uniform convergence bounds can be vacuous (Nagarajan & Kolter, 2019b), they hold (with high probability on the choice of the learning sample) for all hypotheses including the one with the best guarantee (i.e., the one minimizing the bound). In the case of disintegrated bounds, we learn a distribution on the hypothesis set, and then we sample a hypothesis according to this distribution. Hence, there is a small probability (i.e., less than \(\delta\)) of sampling a bad hypothesis. An interesting research direction is comparing disintegrated and uniform convergence bounds to understand in which cases using disintegrated bounds can be better than using uniform convergence bounds. Knowing that there are connections between (agnostic) PAC-learnability and uniform convergence (see, e.g., Shalev-Shwartz and Ben-David (2014)), we believe that defining a new notion of PAC-learnability, which better fits with the disintegrated framework, could help to provide such a comparison.
This Appendix is structured as follows. We give the proof of Theorem 1, Theorem 2, Corollary 3, Theorem 4, Proposition 5, Corollary 6, Corollary 7, and Corollary 8 in Appendix A, Appendix B, Appendix C, Appendix D, Appendix D, Appendix F, Appendix G, and Appendix H respectively. We also discuss the minimization and the evaluation of the bounds introduced in the different corollaries in Appendix I. Additionally, Appendix J is devoted to Theorem 9. Appendix K provides an exhaustive list of numerical results.
Data availability
Not applicable.
Code availability
The code is available on Github at https://github.com/paulviallard/MLJ-Disintegrated-PB.
Notes
The measure considered for \(({\mathcal {A}}, \Sigma _{{\mathcal {A}}})\) is usually the Lebesgue or the counting measure.
The risk of the randomized classifier \({\mathbb {E}}_{{h\sim {\mathcal {Q}}}}{R}_{{\mathcal {D}}}(h)\) is sometimes referred to as the Gibbs risk in the PAC-Bayes literature.
We refer the reader to the proof sketches given by Figure 1 of Bégin et al. (2016) for more insights.
A self-bounding algorithm minimizes a generalization bound to obtain a model with a generalization guarantee.
We say that the KL divergence is “disintegrated” since the log term is not averaged in contrast to the KL divergence.
One epoch corresponds to one pass of the entire learning set during the optimization process.
Stochastic NNs were introduced in the PAC-Bayesian literature by Langford and Caruana (2001).
The details of the optimization and the evaluation of the bounds are described in Appendix I.
We provide a mutual information-based bound in Appendix J.
References
Alquier, P. (2021). User-friendly introduction to PAC-Bayes bounds. CoRR, abs/2110.11216.
Ambroladze, A., Parrado-Hernández, E., & Shawe-Taylor, J. (2006). Tighter PAC-Bayes bounds. Advances in neural information processing systems (NIPS) (pp. 9–16). MIT Press.
Bégin, L., Germain, P., Laviolette, F., & Roy, J. (2014). PAC-Bayesian theory for transductive learning. In: International conference on artificial intelligence and statistics (AISTATS) (Vol. 33, pp. 105–113). JMLR.org.
Bégin, L., Germain, P., Laviolette, F., & Roy, J. (2016). PAC-Bayesian bounds based on the Rényi divergence. In: International conference on artificial intelligence and statistics (AISTATS) (Vol. 51, pp. 435–444). JMLR.org.
Biggs, F., & Guedj, B. (2021). Differentiable PAC-Bayes objectives with partially aggregated neural networks. Entropy, 23(10), 1280.
Biggs, F., & Guedj, B. (2022). On margins and derandomisation in PACBayes. International conference on artificial intelligence and statistics (AISTATS) (Vol. 151, pp. 3709–3731). PMLR.
Blanchard, G., & Fleuret, F. (2007). Occam’s hammer.In: Annual conference on learning theory (COLT) (Vol. 4539, pp. 112–126). Springer.
Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2, 499–526.
Catoni, O. (2007). PAC-Bayesian supervised classification: The thermodynamics of statistical learning. CoRR, abs/0712.0248.
Dziugaite, G.K., & Roy, D. (2017). Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In: Conference on uncertainty in artificial intelligence (UAI). AUAI Press.
Dziugaite, G.K., & Roy, D. (2018). Data-dependent PAC-Bayes priors via differential privacy. Advances in neural information processing systems (NeurIPS) (pp. 8440–8450).
Esposito, A.R., Gastpar, M., Issa, I. (2020). Robust generalization via \(\alpha\)—Mutual information. CoRR, abs/2001.06399.
Freund, Y. (1998). Self bounding learning algorithms. Annual conference on computational learning theory (COLT) (pp. 247–258). ACM.
Germain, P., Habrard, A., Laviolette, F., & Morvant, E. (2020). PAC-Bayes and domain adaptation. Neurocomputing, 379, 379–397.
Germain, P., Lacasse, A., Laviolette, F., & Marchand, M. (2009). PAC-Bayesian learning of linear classifiers. In: Annual international conference on machine learning (ICML) (Vol. 382, pp. 353–360). ACM.
Gil, M., Alajaji, F., & Linder, T. (2013). Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences, 249, 124–0131.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In: International conference on artificial intelligence and statistics (AISTATS) (Vol. 9, pp. 249–256). JMLR.org.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Guedj, B. (2019). A primer on PAC-Bayesian learning. CoRR, abs/1901.05353.
Hardt, M., Recht, B., & Singer, Y. (2016). Train faster, generalize better: Stability of stochastic gradient descent. In: International conference on machine learning (ICML) (Vol. 48, pp. 1225–1234). JMLR.org.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification.In: IEEE international conference on computer vision (ICCV) (pp. 1026–1034). IEEE Computer Society.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778). IEEE Computer Society.
Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In: International conference on learning representations (ICLR).
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images (Unpublished master’s thesis). University of Toronto.
Langford, J., & Caruana, R. (2001). (Not) bounding the true error. Advances in neural information processing systems (NIPS) (pp. 809–816). MIT Press.
Langford, J., & Shawe-Taylor, J. (2002). PAC-Bayes & margins. Advances in neural information processing systems (NIPS) (pp. 423–430). MIT Press.
LeCun, Y., Cortes, C., & Burges, C. (1998). The MNIST dataset of handwritten digits. Retrieved from http://yann.lecun.com/exdb/mnist/
Letarte, G., Germain, P., Guedj, B., Laviolette, F. (2019). Dichotomize and generalize: PAC-Bayesian binary activated deep neural networks. Advances in neural information processing systems (NeurIPS) (pp. 6869–6879).
Lever, G., Laviolette, F., & Shawe-Taylor, J. (2013). Tighter PAC-Bayes bounds through distribution-dependent priors. Theoretical Computer Science, 473, 4–28.
Maurer, A. (2004). A note on the PAC Bayesian theorem. CoRR, cs.LG/0411099 .
McAllester, D. (1998). Some PAC-Bayesian theorems. In: Annual conference on computational learning theory (COLT) (pp. 230–234). ACM.
Nagarajan, V., & Kolter, Z. (2019). Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience. International conference on learning representations (ICLR): OpenReview. net.
Nagarajan, V., & Kolter, Z. (2019b). Uniform convergence may be unable to explain generalization in deep learning. Advances in neural information processing systems (NeurIPS) (pp. 11611–11622).
Neyshabur, B., Bhojanapalli, S., & Srebro, N. (2018). A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. International conference on learning representations (ICLR): OpenReview.net.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., & Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems (NeurIPS) (pp. 8024–8035).
Pérez-Ortiz, M., Rivasplata, O., Shawe-Taylor, J., & Szepesvári, C. (2021). Tighter risk certificates for neural networks. Journal of Machine Learning Research, 22, 227:1–227:40.
Reeb, D., Doerr, A., Gerwinn, S., & Rakitsch, B. (2018). Learning gaussian processes by minimizing PAC-Bayesian generalization bounds. Advances in neural information processing systems (NeurIPS) (pp. 3341–3351).
Rivasplata, O., Kuzborskij, I., Szepesvári, C., & Shawe-Taylor, J. (2020). PACBayes analysis beyond the usual bounds. Advances in neural information processing systems (NeurIPS).
Seeger, M. (2002). PAC-Bayesian generalisation error bounds for gaussian process classification. Journal of Machine Learning Research, 3, 233–269.
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning - from theory to algorithms. Cambridge University Press.
Shawe-Taylor, J., & Williamson, R. (1997). A PAC analysis of a bayesian estimator. In: Annual conference on computational learning theory (COLT) (pp. 2–9). ACM.
Springenberg, J.T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). Striving for simplicity: The all convolutional net. In: International conference on learning representations (ICLR).
Thiemann, N., Igel, C., Wintenberger, O., & Seldin, Y. (2017). A strongly quasiconvex PAC-Bayesian bound. In: International conference on algorithmic learning theory (ALT) (Vol. 76, pp. 466–492). PMLR.
Valiant, L. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.
van Erven, T., & Harremoës, P. (2014). Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7), 3797–3820.
Vapnik, V. (2000). The nature of statistical learning theory. Springer.
Verdú, S. (2015). \(\alpha\)-mutual information. Information theory and applications workshop (ITA) (pp. 1–6). IEEE.
Viallard, P., Vidot, G., Habrard, A., & Morvant, E. (2021). A PAC-Bayes analysis of adversarial robustness. Advances in neural information processing systems (NeurIPS) (pp. 14421–14433).
Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747 .
Xu, H., & Mannor, S. (2012). Robustness and generalization. Machine Learning, 86(3), 391–423.
Zantedeschi, V., Viallard, P., Morvant, E., Emonet, R., Habrard, A., Germain, P., & Guedj, B. (2021). Learning stochastic majority votes by minimizing a PAC-Bayes generalization bound. Advances in neural information processing systems (NeurIPS) (pp. 455–467).
Zhou, W., Veitch, V., Austern, M., Adams, R., & Orbanz, P. (2019). Non-vacuous generalization bounds at the ImageNet scale: a PAC-Bayesian compression approach. International conference on learning representations (ICLR): OpenReview.net.
Acknowledgements
This work was partially funded by the French ANR Project APRIORI ANR-18-CE23-0015. Pascal Germain is supported by the Canada CIFAR AI Chair Program, and the NSERC Discovery Grant RGPIN-2020-07223. We would like to thank the reviewers for their valuable comments and their suggestions to improve the paper.
Funding
This work was partially funded by the French ANR Project APRIORI ANR-18-CE23-0015. Pascal Germain is supported by the Canada CIFAR AI Chair Program, and the NSERC Discovery Grant RGPIN-2020-07223.
Author information
Authors and Affiliations
Contributions
Conceptualization: PV, EM, PG, PAH; Formal analysis and investigation: PV; Software: PV; Writing—original draft preparation: PV; Writing—review and editing: PG, PAH; Funding acquisition: EM, PG, PAH; Supervision: EM, PG, PAH.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editor: Krzysztof Dembczynski and Emilie Devijver.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof of Theorem 1
Theorem 1
(General PAC-Bayes bounds) For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\) on \({\mathcal {H}}\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any \(\delta \in (0,1]\) we have
with \(\textrm{KL}({\mathcal {Q}}\Vert {\mathcal {P}}){\triangleq } {{\displaystyle {\mathbb {E}}_{h{\sim }{\mathcal {Q}}}}} \ln \tfrac{{\mathcal {Q}}(h)}{{\mathcal {P}}(h)}\) the Kullback-Leibler (KL-)divergence between \({\mathcal {Q}}\) and \({\mathcal {P}}\), and \(D_{\alpha }({\mathcal {Q}}\Vert {\mathcal {P}}) {\triangleq } \frac{1}{\alpha {-}1}\ln\,\left[ {{\displaystyle {\mathbb {E}}_{h{\sim }{\mathcal {P}}}}}\!\left[\,\frac{ {\mathcal {Q}}(h)}{{\mathcal {P}}(h)}\right] ^{\!\alpha }\right]\) the Rényi divergence between \({\mathcal {Q}}\) and \({\mathcal {P}}\) \((\alpha {>}1)\).
Proof
By the Donsker-Varadhan’s variational formula (see e.g., Bégin et al., 2016, Lemma 3), we have
By Markov’s inequality and taking the logarithm to both sides, we have
By merging Eqs. (A1) and (A2), we obtain Eq. (1).
The proof of Eq. (2) is similar to the one of Equation (1). Indeed, from the Rényi change of measure (see e.g., Bégin et al., 2016, Theorem 8), we have
By Markov’s inequality and taking the logarithm to both sides, we have
By merging Equations (A3) and (A4), Eq. (2) is obtained. \(\square\)
Appendix B: Proof of Theorem 2
Theorem 2
(General Disintegrated PAC-Bayes Bound) For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any \(\alpha\,>\!1\), for any \(\delta \in (0,1]\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}{\times }{\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have
where \({\mathcal {Q}}_{{\mathcal {S}}}{\triangleq }A({\mathcal {S}}, {\mathcal {P}})\) is output by the deterministic algorithm A.
Proof
For any sample \({\mathcal {S}}\in {\mathcal {Z}}^m\), prior \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\) and deterministic algorithm A fixed a priori, let \({\mathcal {Q}}_{{\mathcal {S}}}=A({\mathcal {S}}, {\mathcal {P}})\) the distribution obtained from the algorithm A. Note that \(\phi (h,\!{\mathcal {S}})\) is a strictly positive random variable. Hence, from Markov’s inequality, we have
Taking the expectation over \({\mathcal {S}}\sim {\mathcal {D}}^{m}\) to both sides of the inequality gives
Since both sides of the inequality are strictly positive, we can take the logarithm and multiply by \(\frac{\alpha }{\alpha -1}>0\) to obtain
We develop the right-hand side of the inequality and take the expectation of the hypothesis over the prior distribution \({\mathcal {P}}\). We have for all prior \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\),
Remark that \(\frac{1}{r}+\frac{1}{s}=1\) with \(r=\alpha\) and \(s=\frac{\alpha }{\alpha -1}\). Hence, we can apply Hölder’s inequality:
Then, since both sides of the inequality are strictly positive, we take the logarithm, add \(\ln (\tfrac{2}{\delta })\) and multiply by \(\frac{\alpha }{\alpha -1}>0\) to both sides of the inequality, to obtain
From this inequality, we can deduce that
Given a prior \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\), note that \({\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\phi (h'\!, {\mathcal {S}})^{\frac{\alpha }{\alpha -1}}\) is a strictly positive random variable. Hence, we apply Markov’s inequality to have
Since the inequality does not depend on the random variable \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\), we have
Since both sides of the inequality are strictly positive, we take the logarithm to both sides of the inequality, and we add \(\frac{\alpha }{\alpha -1}\ln \frac{2}{\delta }\) to have
Combining Equations (B5) and (B6) with a union bound gives us the desired result. \(\square\)
Appendix C: Proof of Corollary 3
Corollary 3
Under the assumptions of Theorem 2, when \(\alpha {\rightarrow }1^+\), we have
when \(\alpha {\rightarrow }+\infty\), we have
where \({{\,\textrm{esssup}\,}}\) is the essential supremum defined as the supremum on a set with non-zero probability measures, i.e.,
Proof
Starting from Theorem 2 and rearranging, we have
Then, we will prove the case when \(\alpha \rightarrow 1\) and \(\alpha \rightarrow +\infty\) separately.
When \(\alpha \rightarrow 1\).
First, we have \(\lim _{\alpha \rightarrow 1^+}\frac{2\alpha {-}1}{\alpha }\ln \frac{2}{\delta } = \ln \frac{2}{\delta }\) and \(\lim _{\alpha \rightarrow 1^+}\frac{\alpha -1}{\alpha }D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}) = 0\).
Furthermore, note that
is the \(L^{\frac{\alpha }{\alpha {-}1}}\)-norm of the function \(\phi : {\mathcal {H}}\times {\mathcal {Z}}^m \rightarrow {\mathbb {R}}_{+}^{*}\), where \(\lim _{\alpha \rightarrow 1} \Vert \phi \Vert _{\frac{\alpha }{\alpha {-}1}} = \lim _{\alpha '\rightarrow +\infty } \Vert \phi \Vert _{\alpha '}\) (since we have \(\lim _{\alpha \rightarrow 1^+}\frac{\alpha }{\alpha {-}1} = (\lim _{\alpha \rightarrow 1}\alpha )(\lim _{\alpha \rightarrow 1}\frac{1}{\alpha {-}1}) = +\infty\)). Then, it is well known that
Hence, we have
Finally, we can deduce that
When \(\alpha \rightarrow +\infty\).
First, we have \(\lim _{\alpha \rightarrow +\infty }{\frac{2\alpha {-}1}{\alpha }}\ln \frac{2}{\delta } = \ln \frac{2}{\delta }\left[ 2 -\lim _{\alpha \rightarrow +\infty }\frac{1}{\alpha } \right] = 2\ln \frac{2}{\delta }= \ln \frac{4}{\delta ^2}\) and \(\lim _{\alpha \rightarrow +\infty } \Vert \phi \Vert _{\frac{\alpha }{\alpha {-}1}} = \lim _{\alpha '\rightarrow 1} \Vert \phi \Vert _{\alpha '} = \Vert \phi \Vert _1\) (since \(\lim _{\alpha \rightarrow +\infty }\frac{\alpha }{\alpha -1} = \lim _{\alpha \rightarrow +\infty } \frac{1}{1-\frac{1}{\alpha }}=1\)). Hence, we have
Moreover, by rearranging the terms in \(\frac{\alpha {-}1}{\alpha }D_{\alpha }({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\), we have
where \(\Vert \gamma \Vert _{\alpha }\) is the \(L^{\alpha }\)-norm of the function \(\gamma\) defined as \(\gamma (h)=\tfrac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}\). We have
Finally, we can deduce that
\(\square\)
Appendix D: Proof of Theorem 4
For the sake of completeness, we first prove an upper bound on \(\sqrt{ab}\) (see, e.g., Thiemann et al., 2017).
Lemma 10
For any \(a>0, b>0\), we have
Proof
Let \(f(\lambda ) = \left( \tfrac{a}{\lambda }+\lambda b \right)\). The first derivative of f w.r.t. \(\lambda\) is
Moreover, from the derivative we can deduce that we have \(\frac{\partial f}{\partial \lambda }(\lambda ) < 0 \iff \lambda \in (0, \sqrt{\frac{a}{b}})\), and \(\frac{\partial f}{\partial \lambda }(\lambda )> 0 \iff \lambda > \sqrt{\frac{a}{b}}\) and \(\frac{\partial f}{\partial \lambda }(\lambda ) = 0 \iff \lambda = \sqrt{\frac{a}{b}}\). It implies that the function is strictly decreasing on \(\lambda \in (0, \sqrt{\frac{a}{b}})\), strictly increasing for \(\lambda > \sqrt{\frac{a}{b}}\) and admit a unique minimum at \(\lambda ^* = \sqrt{\frac{a}{b}}\). Additionally, \(f(\lambda ^*)=2\sqrt{ab}\) which proves the claim. \(\square\)
We can now prove Theorem 4 with Lemma 10.
Theorem 4
(Parametrizable Disintegrated PAC-Bayes Bound) For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any \(\delta \in (0,1]\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}{\times }{\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have
where \({\mathcal {Q}}_{{\mathcal {S}}}{\triangleq }A({\mathcal {S}},{\mathcal {P}})\) is output by the deterministic algorithm A.
Proof
The proof is similar to the one of Theorem 2. Since \(\phi (h,\!{\mathcal {S}})\) is a strictly positive random variable, from Markov’s inequality, we have
Taking the expectation over \({\mathcal {S}}\sim {\mathcal {D}}^{m}\) to both sides of the inequality gives
Using Lemma 10 with \(a=\tfrac{4}{\delta ^2}\phi (h'\!, {\mathcal {S}})^2\) and \(b=\tfrac{ {\mathcal {Q}}_{{\mathcal {S}}}(h')^2}{{\mathcal {P}}(h')^2}\), we have for all prior \({\mathcal {P}}{\in }{\mathcal {M}}^{*}({\mathcal {H}})\)
Then, since both sides of the inequality are strictly positive, we take the logarithm to obtain
Hence, we can deduce that
Given a prior \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\), note that \({\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\phi (h'\!, {\mathcal {S}})^{2}\) is a strictly positive random variable. Hence, we apply Markov’s inequality:
Since the inequality does not depend on the random variable \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\), we have
Additionally, note that multiplying by \(\frac{4}{2\lambda \delta ^2}>0\), adding \(\frac{\lambda }{2}\exp (D_2( {\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}))\), and taking the logarithm to both sides of the inequality results in the same indicator function. Indeed,
Hence, we can deduce that
Combining Equations (D7) and (D8) with a union bound gives us the desired result. \(\square\)
Appendix E: Proof of Proposition 5
Proposition 5
For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\) on \({\mathcal {H}}\), for any \(\delta {\in }(0,1]\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}{\times }{\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), let
where \(\displaystyle \lambda ^* = \sqrt{\frac{{\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{{\mathbb {E}}}_{{h'{\sim }{\mathcal {P}}}}\left[ 8\phi (h'\!, {\mathcal {S}}')^2\right] }{\delta ^3 \exp (D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}))}}\).
Put into words: the optimal \(\lambda ^*\) gives the same bound for Theorem 2 and Theorem 4.
Proof
We consider the right-hand side of the inequality of Theorem 4 (which is strictly positive): we have
Since \(\ln\) is a strictly increasing function, we have
Then, we apply Lemma 10 by taking \(a = \frac{8}{2\delta ^3}{\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\left[ \phi (h', {\mathcal {S}}')^2\right]\) and \(b=\frac{1}{2}e^{D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})}\) to obtain \(\lambda ^*=\sqrt{\frac{a}{b}}= \sqrt{\frac{{\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{{\mathbb {E}}}_{{h'{\sim }{\mathcal {P}}}}\left[ 8\phi (h'\!, {\mathcal {S}}')^2\right] }{\delta ^3 \exp (D_2({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}))}}\). Finally, by substituting \(\lambda ^*\) into Eq. (E9), we obtain
which is the desired result. \(\square\)
Appendix F: Proof of Corollary 6
We introduce Theorem 2’, which takes into account a set of priors \({\textbf {P}}\) while Theorem 2 handles a unique prior \({\mathcal {P}}\).
Theorem 2’ For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any priors set \({\textbf {P}}{=}\{{\mathcal {P}}_t\}_{t=1}^T\) of T prior \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\), for any measurable function \(\phi\,:\! {\mathcal {H}}{\times }{\mathcal {Z}}^{m}{\rightarrow } {\mathbb {R}}_{+}^{*}\), for any \(\alpha\,>\!1\), for any \(\delta \in (0,1]\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}{\times }{\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have
where \({\mathcal {Q}}_{{\mathcal {S}}}{\triangleq }A({\mathcal {S}}, {\mathcal {P}})\) is output by the deterministic algorithm A.
Proof
The proof is mainly the same as Theorem 2. Indeed, we first derive the same equation as Eq. (B5), we have
Then, we apply Markov’s inequality (as in Theorem 2) T times with the T priors \({\mathcal {P}}_t\) belonging to \({\textbf {P}}\), however, we set the confidence to \(\frac{\delta }{2T}\) instead of \(\tfrac{\delta }{2}\), we have
Finally, combining the \(T+1\) bounds with a union bound gives us the desired result. \(\square\)
We now prove Corollary 6 from Theorem 2’.
Corollary 6
For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any set \({\textbf {P}}=\{{\mathcal {P}}_1,\dots , {\mathcal {P}}_T\}\) of T priors on \({\mathcal {H}}\) where \({\mathcal {P}}_t={\mathcal {N}}({\textbf{v}}_t, \sigma ^2{{\textbf{I}}}_d)\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}) {\rightarrow } {\mathcal {M}}({\mathcal {H}})\), for any loss \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } [0, 1]\), for any \(\delta {\in }(0,1]\), we have
where \(\textrm{kl}(a\Vert b) = a\ln \tfrac{a}{b}+(1{-}a)\ln \tfrac{1-a}{1-b}\), \({\mathcal {Q}}_{{\mathcal {S}}}= {\mathcal {N}}({\textbf{w}}, \sigma ^2{{\textbf{I}}}_d)\), and the hypothesis \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\) is parametrized by \({\textbf{w}}{+}\varvec{\epsilon }\).
Proof
We instantiate Theorem 2’ with \(\phi (h,\!{\mathcal {S}})=\exp\,\left[ \tfrac{\alpha -1}{\alpha }m\textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\right]\) and \(\alpha =2\). We have with probability at least \(1-\delta\) over \({\mathcal {S}}\sim {\mathcal {D}}^m\) and \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\), for all prior \({\mathcal {P}}_t\!\in\, {\textbf {P}}\)
From Maurer (2004) we upper-bound \({\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^m}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}_t} e^{m\textrm{kl}({R}_{{\mathcal {S}}'}(h')\Vert {R}_{{\mathcal {D}}}(h'))}\) by \(2\sqrt{m}\) for each prior \({\mathcal {P}}_t\). Hence, we have, for all prior \({\mathcal {P}}_t\!\in\, {\textbf {P}}\)
Additionally, the Rényi divergence \(D_{2}({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}_t)\) between two multivariate Gaussians \({\mathcal {Q}}_{{\mathcal {S}}}{=}{\mathcal {N}}({\textbf{w}}, \sigma ^2\textbf{I}_{d})\) and \({\mathcal {P}}_t{=}{\mathcal {N}}({\textbf{v}}_t, \sigma ^2\textbf{I}_{d})\) is well known: its closed-form solution is \(D_{2}({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}}_t){=}\frac{\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^{2}}{\sigma ^2}\) (see, for example, (Gil et al., 2013)). \(\square\)
Appendix G: Proof of Corollary 7
We first prove the following lemma in order to prove Corollary 7.
Lemma 11
If \({\mathcal {Q}}_{{\mathcal {S}}}={\mathcal {N}}({\textbf{w}}, \sigma ^2{{\textbf{I}}}_d)\) and \({\mathcal {P}}= {\mathcal {N}}({\textbf{v}}, \sigma ^2{{\textbf{I}}}_{d})\), we have
where \(\varvec{\epsilon }{\sim }{\mathcal {N}}({{\textbf{0}}}, \sigma ^2\textbf{I}_{d})\) is a Gaussian noise such that \({\textbf{w}}{+}\varvec{\epsilon }\) are the weights of \(h{\sim }{\mathcal {Q}}_{{\mathcal {S}}}\) with \({\mathcal {Q}}_{{\mathcal {S}}}{=}{\mathcal {N}}({\textbf{w}}, \sigma ^2\textbf{I}_{d})\).
Proof
The probability density functions of \({\mathcal {Q}}_{{\mathcal {S}}}\) and \({\mathcal {P}}\) for \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\) (with the weights \({\textbf{w}}{+}\varvec{\epsilon }\)) can be rewritten as
We can derive a closed-form expression of \(\ln\,\left[ \frac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}(h)}\right]\). Indeed, we have
\(\square\)
We can now prove Corollary 7.
Corollary 7
For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any set \({\mathcal {H}}\), for any set \({\textbf {P}}=\{{\mathcal {P}}_1,\dots ,{\mathcal {P}}_T\}\) of T priors on \({\mathcal {H}}\) where \({\mathcal {P}}_t={\mathcal {N}}({\textbf{v}}_t, \sigma ^2{{\textbf{I}}}_d)\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), for any loss \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } \{0, 1\}\), for any \(\delta {\in }(0,1]\), with probability at least \(1{-}\delta\) over the learning sample \({\mathcal {S}}{\sim }{\mathcal {D}}^{m}\) and the hypothesis \(h{\sim } {\mathcal {Q}}_{{\mathcal {S}}}\) parametrized by \({\textbf{w}}{+}\varvec{\epsilon }\), we have \(\forall {\mathcal {P}}_t\in {\textbf {P}}\)
with \(\left[ x\right] _{+}\!{=}\max (x,0)\), and \(\textrm{kl}_{+}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h)){=}\textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\) if \({R}_{{\mathcal {S}}}(h){<}{R}_{{\mathcal {D}}}(h)\) and 0 otherwise. Moreover, \(\varvec{\epsilon }{\sim }{\mathcal {N}}({{\textbf{0}}}, \sigma ^2\textbf{I}_{d})\) is a Gaussian noise such that \({\textbf{w}}{+}\varvec{\epsilon }\) are the weights of \(h{\sim }{\mathcal {Q}}_{{\mathcal {S}}}\) with \({\mathcal {Q}}_{{\mathcal {S}}}{=}{\mathcal {N}}({\textbf{w}}, \sigma ^2\textbf{I}_{d})\), and \({\textbf{C}}\), \({\textbf{B}}\) are two sets of hyperparameters fixed a priori.
Proof
We will prove the three bounds separately.
Equation (7). We instantiate Theorem 1(i) of Rivasplata et al. (2010) with \(\phi (h,\!{\mathcal {S}})=\exp\,\left[ m\textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\right]\), however, we apply the theorem T times for each prior \({\mathcal {P}}_t\in {\textbf {P}}\) (with a confidence \(\frac{\delta }{T}\) instead of \(\delta\)). Hence, for each prior \({\mathcal {P}}_t\in {\textbf {P}}\), we have with probability at least \(1-\frac{\delta }{T}\) over the random choice of \({\mathcal {S}}\sim {\mathcal {D}}^m\) and \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\)
From Maurer (2004), we upper-bound \({\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^m}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}_t} e^{m\textrm{kl}({R}_{{\mathcal {S}}'}(h')\Vert {R}_{{\mathcal {D}}}(h'))}\) by \(2\sqrt{m}\) and using Lemma 11 we rewrite the disintegrated KL divergence. Finally, a union-bound argument gives us the claim.
Equation (8). We apply \(T\vert {\textbf{B}}\vert\) times Proposition 3.1 of Blanchard and Fleuret (2007) with a confidence \(\frac{\delta }{T\vert {\textbf{B}}\vert }\) instead of \(\delta\). For each prior \({\mathcal {P}}_t\in {\textbf {P}}\) and hyperparameters \(b\in {\textbf{B}}\), we have with probability at least \(1-\frac{\delta }{T\vert {\textbf{B}}\vert }\) over the random choice of \({\mathcal {S}}\sim {\mathcal {D}}^m\) and \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\)
From Lemma 11 and a union-bound argument, we obtain the claim.
Equation (9). We apply \(T\vert {\textbf{C}}\vert\) times Theorem 1.2.7 of Catoni (2007) with a confidence \(\tfrac{\delta }{T\vert {\textbf{C}}\vert }\) instead of \(\delta\). For each prior \({\mathcal {P}}_{t}\in {\textbf {P}}\) and hyperparameter \(c\in {\textbf{C}}\), we have with probability at least \(1-\tfrac{\delta }{T\vert {\textbf{C}}\vert }\) over the random choice of \({\mathcal {S}}\sim {\mathcal {D}}^m\) and \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\)
From Lemma 11 and a union-bound argument, we obtain the claim. \(\square\)
Appendix H: Proof of Corollary 8
Corollary 8
For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any \({\mathcal {H}}\), for any set \({\textbf {P}}=\{{\mathcal {P}}_1,\dots , {\mathcal {P}}_T\}\) of T priors on \({\mathcal {H}}\) where \({\mathcal {P}}_t={\mathcal {N}}({\textbf{v}}_t, \sigma ^2{{\textbf{I}}}_d)\), for any loss \(\ell\,:\! {\mathcal {H}}{\times } {\mathcal {Z}}{\rightarrow } \{0, 1\}\), for any \(\delta {\in }(0,1]\), with probability at least \(1{-}\delta\) over \({\mathcal {S}}{\sim }{\mathcal {D}}^m\) and \(\{h_1,\dots ,h_n\}{\sim }{\mathcal {Q}}^n\), we have simultaneously \(\forall {\mathcal {P}}_t\!\in\,{\textbf {P}},\)
where \({\mathcal {Q}}={\mathcal {N}}({\textbf{w}}, \sigma ^2{{\textbf{I}}}_d)\) and the hypothesis h sampled from \({\mathcal {Q}}\) is parametrized by \({\textbf{w}}+\varvec{\epsilon }\) with \(\varvec{\epsilon }\sim {\mathcal {N}}({{\textbf{0}}}, \sigma ^2{{\textbf{I}}}_d)\).
Proof
We instantiate Eq. (3) (and apply Jensen’s inequality on the left-hand side of the inequation) for each prior \({\mathcal {P}}_t\) with \({\mathcal {Q}}{=}{\mathcal {N}}({\textbf{w}}, \sigma ^2\textbf{I}_{d})\) and \({\mathcal {P}}_t{=}{\mathcal {N}}({\textbf{v}}_t, \sigma ^2\textbf{I}_{d})\) with a confidence \(\tfrac{\delta }{2T}\) instead of \(\delta\). Indeed, for each prior \({\mathcal {P}}_t\), with probability at least \(1{-}\tfrac{\delta }{2T}\) over the random choice of \({\mathcal {S}}\sim {\mathcal {D}}^m\), we have for all posterior \({\mathcal {Q}}\) on \({\mathcal {H}}\),
Note that the closed-form solution of the KL divergence between the Gaussian distributions \({\mathcal {Q}}\) and \({\mathcal {P}}_t\) is well known, we have \(\textrm{KL}({\mathcal {Q}}\Vert {\mathcal {P}}_t){=}\frac{\Vert {\textbf{w}}{-}{\textbf{v}}_t\Vert _{2}^{2}}{2\sigma ^2}\). Then, by applying a union-bound argument over the T bounds obtained with the T priors \({\mathcal {P}}_t\), we have with probability at least \(1{-}\frac{\delta }{2}\) over the random choice of \({\mathcal {S}}\sim {\mathcal {D}}^m\), for all prior \({\mathcal {P}}_t\in {\textbf {P}}\), for all posterior \({\mathcal {Q}}\)
Additionally, we obtained Eq. (11) by a direct application the Theorem 2.2 of Dziugaite and Roy (2017) (with confidence \(\frac{\delta }{2}\) instead of \(\delta\)). Finally, from a union bound of the two bounds in Equations (11) and (10) gives the claimed result. \(\square\)
Appendix I: Evaluation and minimization of the bounds of Corollaries 6, 7, 8
This appendix presents more details on the optimization and the evaluation of the bounds.
1.1 I.1 Evaluation of the bounds
Note that, except for Eq. (9), a generalization gap is upper-bounded instead of the true risk. Hence, to evaluate the bounds of the corollaries (except for Eq. (9)) we use the invert binary \(\textrm{kl}\) divergence defined as
where q is typically the empirical risk, and \(\psi\) is the PAC-Bayesian bound. Here, the function \(\textrm{kl}^{-1} (q \vert \psi )\) outputs the worst true risk p where the inequality \(\textrm{kl}(q\Vert p) \le \psi\) holds. We can actually instantiate p, q and \(\psi\) for the different corollaries. Indeed, we have for all \({\mathcal {P}}_t\in {\textbf {P}}\)
Hence, \(\textrm{kl}^{-1}\) has to be evaluated in order to obtain the value of the upper-bound on \({R}_{{\mathcal {D}}}(h)\) or \({\mathbb {E}}_{h\sim {\mathcal {Q}}}{R}_{{\mathcal {D}}}(h)\): the evaluation of \(\textrm{kl}^{-1}(q\vert \psi )\) is performed by the bisection method. From this new formulation of the bounds, we can remark that the objective is to minimize the function \(\textrm{kl}^{-1} (q \vert \psi )\) in order to minimize the true risk p. To do so, Reeb et al. (2018) introduced an analytical expression of the derivative of \(\textrm{kl}^{-1}\) with respect to the empirical risk q and the PAC-Bayesian bound \(\psi\). The two partial derivatives are defined in the following way:
Note that these partial derivatives need the evaluation of \(\textrm{kl}^{-1}(q\vert \psi )\) for a given empirical risk q and a PAC-Bayesian bound \(\psi\). Then, by computing the derivatives of q and \(\psi\) with respect to the parameters and by using the chain rule of differentiation, a library like PyTorch (see Paszke et al. (2019)) can automatically compute the derivatives of \(\textrm{kl}^{-1}\) with respect to the parameters.
1.2 Optimization of the bounds
The optimization of the bounds associated with the corollaries are presented in Algorithm 2. This algorithm is divided in two steps: 1) optimizing and chosing the prior \({\mathcal {P}}\) (Line 6 to 28); and 2) optimizing the posterior \({\mathcal {Q}}_{{\mathcal {S}}}\) (from Line 32 to 39).
In step 1), the prior \({\mathcal {P}}_t\) is obtained after the epoch \(t\in \{1,\dots ,T\}\) (line 16) by updating \(\varvec{\omega }\) (parameterizing the prior \({\mathcal {P}}_t\)) using a mini-batch gradient descent algorithm. For each epoch t and for each mini-batch \({\mathcal {U}}\subseteq {\mathcal {S}}_{\text {prior}}\) (Line 8 and 11), we sample a hypothesis h parameterized by \(\varvec{\omega }+\varvec{\epsilon }\) (Line 12 and 13) and update \(\varvec{\omega }\) with the gradient descent algorithm by minimizing the risk \({R}_{{\mathcal {U}}}(h)\) (Line 14).
After each epoch t, the prior \({\mathcal {P}}\) is selected by early stopping on the learning sample \({\mathcal {S}}\). We first estimate the risk on \({\mathcal {S}}\) (Line 19 to 23) by sampling \(h\sim {\mathcal {P}}_t\) (Line 20 and 21) and computing the losses for each mini-batch \({\mathcal {U}}\). Then, we select the prior \({\mathcal {P}}_t\) if it minimizes the risk (Line 24 to 27).
Given the prior \({\mathcal {P}}\), we learn a posterior \({\mathcal {Q}}_{{\mathcal {S}}}\) in step 2) during \(T'\) epochs. For each epoch and each mini-batch \({\mathcal {U}}\subseteq {\mathcal {S}}\), we sample a hypothesis h associated with the weights \(\varvec{\omega }+\varvec{\epsilon }\) (Line 38 and 39). At each iteration, the algorithm updates the weights \(\varvec{\omega }\) (Line 39) by optimizing
Note that, as stated in Sect. 5.3.3, \(T'=1\) for MNIST and FashionMNIST while \(T'=10\) for CIFAR-10 with a batch size of 32. Additionally, the loss is the bounded cross-entropy loss \(\ell (h, ({\textbf{x}}, y)) {=} -\frac{1}{Z}\ln (\Phi (h({\textbf{x}})[y]))\) of Dziugaite and Roy (2018) in the risk \({R}_{{\mathcal {U}}}(h)\). The update of the weights \(\varvec{\omega }\) is done with the Adam optimizer (Kingma & Ba, 2015). Concerning the optimization of the hyperparameters \(c\in {\textbf{C}}\) and \(b\in {\textbf{B}}\) for Equations (8) and (9), we (a) initialize \(b\in {\textbf{B}}\) or \(c\in {\textbf{C}}\) with the one that performs best on the first mini-batch and (b) optimize by gradient descent the hyperparameter. To evaluate Equations (8) and (9), we take \(b\in {\textbf{B}}\) and \(c\in {\textbf{C}}\) that leads to the tightest bound.
Appendix J: About Theorem 9
This section is devoted to (i) the proof of a bound that is easier to interpret than Theorem 9, (ii) the proof of Theorem 9 and (iii) a discussion about Theorem 9.
1.1 J.1: A bound easier to interpret
Since the mutual information is well known, a bound based on this quantity will be more interpretable than the one with the Sibson’s. Hence, we propose a mutual-information-based bound in Theorem 13. However, in order to prove this theorem, we need to prove Lemma 12.
Lemma 12
For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any measurable function \(\phi :{\mathcal {H}}\times {\mathcal {Z}}^{m}\rightarrow [1, +\infty [\), for any \(\delta \in (0,1]\), for any deterministic algorithm \(A:{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have
Proof
By developing \({\mathbb {E}}_{{\mathcal {S}}\sim {\mathcal {D}}^m}{\mathbb {E}}_{h\sim {\mathcal {Q}}_{{\mathcal {S}}}}\ln \phi (h, {\mathcal {S}})\), we have for all prior \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\)
From Jensen’s inequality, we have for all prior \({\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})\)
Since we assume in this case that \(\phi (h, {\mathcal {S}}) \ge 1\) for all \(h\in {\mathcal {H}}\) and \({\mathcal {S}}\in {\mathcal {Z}}^m\), we have \(\ln \phi (h, {\mathcal {S}}) \ge 0\); we can apply Markov’s inequality to obtain
Then, from Equations (J14) and (J15), we can deduce the stated result. \(\square\)
We are now ready to prove Theorem 13.
Theorem 13
For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any measurable function \(\phi :{\mathcal {H}}\times {\mathcal {Z}}^{m}\rightarrow [1, +\infty [\), for any \(\delta \in (0,1]\), for any deterministic algorithm \(A:{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have
where \({\mathcal {P}}^*\) is defined such that \({\mathcal {P}}^*(h)={\mathbb {E}}_{{\mathcal {S}}\sim {\mathcal {D}}^m}{\mathcal {Q}}_{{\mathcal {S}}}(h)\) and \(I(h{;} {\mathcal {S}}) = \min _{{\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})}{\mathbb {E}}_{{\mathcal {S}}\sim {\mathcal {D}}^m}\textrm{KL}({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\).
Proof
Note that the mutual information is defined by \(I(h{;} {\mathcal {S}}) = \min _{{\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})}{\mathbb {E}}_{{\mathcal {S}}\sim {\mathcal {D}}^m}\textrm{KL}({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\). Hence, to prove Theorem 13, we have to instantiate Lemma 12 with the optimal prior, i.e., the prior \({\mathcal {P}}\) which minimizes \({\mathbb {E}}_{{\mathcal {S}}\sim {\mathcal {D}}^m}\textrm{KL}({\mathcal {Q}}_{{\mathcal {S}}}\Vert {\mathcal {P}})\). The optimal prior is well known (see, e.g., Catoni, 2007; Lever, Laviolette, & Shawe-Taylor, 2013): for the sake of completeness, we derive it. First, we have
Hence,
where \({\mathcal {P}}^*(h) = {\mathbb {E}}_{{\mathcal {S}}'\sim {\mathcal {D}}^m}{\mathcal {Q}}_{{\mathcal {S}}'}(h)\). Note that \({\mathcal {P}}^*\) is defined from the data distribution \({\mathcal {D}}\), hence, \({\mathcal {P}}^*\) is a valid prior when instantiating Lemma 12 with \({\mathcal {P}}^*\). Then, we have with probability at least \(1{-}\delta\) over \({\mathcal {S}}\sim {\mathcal {D}}^m\) and \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\)
\(\square\)
As you can remark, this bound is looser than Theorem 9, which is based on Sibson’s mutual information. For example, when we instantiate this bound with \(\phi (h,{\mathcal {S}})=\exp \left[ m\textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\right]\), the bound will be multiplied by \(\frac{1}{\delta m}\), while the bound of Theorem 9 is only multiplied by \(\frac{1}{m}\) (but we add the term \(\frac{1}{m}\ln \frac{1}{\delta }\) to the bound which is small even for small m).
1.2 J.2: Proof of Theorem 9
We first introduce Lemma 14 in order to prove Theorem 9.
Lemma 14
For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any prior distribution \({\mathcal {P}}\) on \({\mathcal {H}}\), for any measurable function \(\phi :{\mathcal {H}}\times {\mathcal {Z}}^{m}\), for any \(\alpha >1\), for any \(\delta \in (0,1]\), for any deterministic algorithm \(A:{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have
where \(\rho (h, {\mathcal {S}}){=} {\mathcal {Q}}_{{\mathcal {S}}}(h){\mathcal {D}}^{m}({\mathcal {S}})\); \(\pi (h, {\mathcal {S}}){=} {\mathcal {P}}(h){\mathcal {D}}^{m}({\mathcal {S}})\).
Proof
Note that \(\phi (h,\!{\mathcal {S}})\) is a non-negative random variable. From Markov’s inequality, we have
Then, since both sides of the inequality are strictly positive, we take the logarithm to both sides of the equality and multiply by \(\frac{\alpha }{\alpha -1}>0\) to obtain
We develop the right-hand side of the inequality in the indicator function and make the expectation of the hypothesis over the distribution \({\mathcal {P}}\) appear. We have for all priors \({\mathcal {P}}{\in }{\mathcal {M}}^{*}({\mathcal {H}})\),
Then, since \(\tfrac{1}{r}+\tfrac{1}{s}=1\) where \(r{=}\alpha\) and \(s{=}\frac{\alpha }{\alpha -1}\). Hence, Hölder’s inequality gives
Since both sides of the inequality are positive, we take the logarithm. Moreover, we add \(\ln (\tfrac{1}{\delta })\), and we multiply by \(\frac{\alpha }{\alpha -1}>0\) to both sides of the inequality. We have
Hence, we can deduce that
where, by definition, we have \(D_{\alpha }(\rho \Vert \pi )=\frac{1}{\alpha {-}1}\!\ln\,\left( {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim } {\mathcal {P}}}\!\left( \left[\,\frac{{\mathcal {Q}}_{{\mathcal {S}}'}(h')}{{\mathcal {P}}(h')}\!\right] ^{\alpha }\right) \right)\). \(\square\)
From Lemma 14, we prove Theorem 9.
Theorem 9
(Disintegrated Information-Theoretic Bound) For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any measurable function \(\phi\,:\!{\mathcal {H}}{\times } {\mathcal {Z}}^{m}{\rightarrow }{\mathbb {R}}_{+}^{*}\), for any \(\alpha\,>\!1\), for any \(\delta \in (0,1]\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), we have
Proof
Note that Sibson’s mutual information is defined as \(I_{\alpha }(h{;}{\mathcal {S}})=\min _{{\mathcal {P}}\in {\mathcal {M}}^{*}({\mathcal {H}})}D_{\alpha }(\rho \Vert \pi )\). Hence, in order to prove Theorem 9, we have to instantiate Lemma 14 with the optimal prior, i.e., the prior \({\mathcal {P}}\) which minimizes \(D_{\alpha }(\rho \Vert \pi )\). Actually, this optimal prior has a closed-form solution (Verdú, 2015). For the sake of completeness, we derive it. First, we have
where \({\mathcal {P}}^*(h)=\left[\,\tfrac{\left[ {\mathbb {E}}_{{\mathcal {S}}\sim {\mathcal {D}}^{m}}\left( {\mathcal {Q}}_{{\mathcal {S}}}(h)^{\alpha }\right) \right] ^{\frac{1}{\alpha }}}{{\mathbb {E}}_{h'{\sim }{\mathcal {P}}}\tfrac{1}{{\mathcal {P}}(h')}\left[ {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}\left( {\mathcal {Q}}_{{\mathcal {S}}'}(h')^{\alpha }\right) \right] ^{\frac{1}{\alpha }}}\!\right]\).
From these equalities and using the fact that \(D_{\alpha }({\mathcal {P}}^*\Vert {\mathcal {P}})\) is minimal (i.e., equal to zero) when \({\mathcal {P}}^*={\mathcal {P}}\), we can deduce that
Note that \({\mathcal {P}}^*\) is defined from the data distribution \({\mathcal {D}}\), hence, \({\mathcal {P}}^*\) is a valid prior when instantiating Lemma 14 with \({\mathcal {P}}^*\). Then, we have with probability at least \(1{-}\delta\) over \({\mathcal {S}}\sim {\mathcal {D}}^m\) and \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\)
where \(\pi ^*(h, {\mathcal {S}})={\mathcal {P}}^*(h){\mathcal {D}}^{m}({\mathcal {S}})\). \(\square\)
1.3 J.3: About Theorem 9
For the sake of comparison, we introduce the following corollary of Theorem 9.
Corollary 15
Under the assumptions of Theorem 9, when \(\alpha {\rightarrow }1^+\), with probability at least \(1{-}\delta\) we have
When \(\alpha {\rightarrow }+\infty\), with probability at least \(1{-}\delta\) we have
Proof
The proof is similar to Corollary 3. Starting from Theorem 9 and rearranging, we have
Then, we will prove separately the case when \(\alpha \rightarrow 1\) and \(\alpha \rightarrow +\infty\).
When \(\alpha \rightarrow 1\). First, we have \(\lim _{\alpha \rightarrow 1^+}\frac{\alpha {-}1}{\alpha }I_{\alpha }(h'; {\mathcal {S}}') = 0\). Furthermore, note that
is the \(L^{\frac{\alpha }{\alpha {-}1}}\)-norm of the function \(\phi : {\mathcal {H}}\times {\mathcal {Z}}^m \rightarrow {\mathbb {R}}_{+}^{*}\), where \(\lim _{\alpha \rightarrow 1} \Vert \phi \Vert _{\frac{\alpha }{\alpha {-}1}} = \lim _{\alpha '\rightarrow +\infty } \Vert \phi \Vert _{\alpha '}\) (since we have \(\lim _{\alpha \rightarrow 1^+}\frac{\alpha }{\alpha {-}1} = (\lim _{\alpha \rightarrow 1}\alpha )(\lim _{\alpha \rightarrow 1}\frac{1}{\alpha {-}1}) = +\infty\)). Then, it is well known that
Hence, we have
Finally, we can deduce that
When \(\alpha \rightarrow +\infty\).
First, we have \(\lim _{\alpha \rightarrow +\infty } \Vert \phi \Vert _{\frac{\alpha }{\alpha {-}1}} = \lim _{\alpha '\rightarrow 1} \Vert \phi \Vert _{\alpha '} = \Vert \phi \Vert _1\) Hence, we have
Moreover, by rearranging the terms in \(\frac{\alpha {-}1}{\alpha }I_{\alpha }(h'; {\mathcal {S}}')\), we have
where \(\Vert \gamma \Vert _{\alpha }\) is the \(L^{\alpha }\)-norm of the function \(\gamma\) defined as \(\gamma (h)=\tfrac{{\mathcal {Q}}_{{\mathcal {S}}}(h)}{{\mathcal {P}}^*(h)}\). We have
Finally, we can deduce that
\(\square\)
As for Theorem 2, this corollary illustrates a trade-off introduced by \(\alpha\) between the Sibson’s mutual information \(I_{\alpha }(h'; {\mathcal {S}}')\) and the term \(\ln\,\left( {\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^{m}}{\mathbb {E}}_{h'{\sim } {\mathcal {P}}}\left( \phi (h'\!, {\mathcal {S}}')^{\frac{\alpha }{\alpha -1}}\right)\,\right)\).
Furthermore, Esposito et al. (2020, Cor.4) introduced a bound involving Sibson’s mutual information. Their bound holds with probability at least \(1{-}\delta\) over \({\mathcal {S}}\sim {\mathcal {D}}^m\) and \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\):
Hence, we compare Eq. (J16) with the equations of the following corollary.
Corollary 16
For any distribution \({\mathcal {D}}\) on \({\mathcal {Z}}\), for any hypothesis set \({\mathcal {H}}\), for any \(\alpha\,>\!1\), for any \(\delta \in (0,1]\), for any algorithm \(A\!:\!{\mathcal {Z}}^{m}\times {\mathcal {M}}^{*}({\mathcal {H}}){\rightarrow } {\mathcal {M}}({\mathcal {H}})\), with probability at least \(1{-}\delta\) over \({\mathcal {S}}\sim {\mathcal {D}}^m\) and \(h\sim {\mathcal {Q}}_{{\mathcal {S}}}\), we have
Proof
First of all, we instantiate Theorem 9 with \(\phi (h,\!{\mathcal {S}})=\exp\,\left[ \tfrac{\alpha -1}{\alpha }m\textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\right]\), we have (by rearranging the terms)
Then, from Maurer (2004), we upper-bound \({\mathbb {E}}_{{\mathcal {S}}'{\sim }{\mathcal {D}}^m}{\mathbb {E}}_{h'{\sim }{\mathcal {P}}} e^{m\textrm{kl}({R}_{{\mathcal {S}}'}(h')\Vert {R}_{{\mathcal {D}}}(h'))}\) by \(2\sqrt{m}\) to obtain Eq. (J17). Finally, to obtain Eq. (J18), we apply Pinsker’s inequality, i.e., \(2({R}_{{\mathcal {S}}}(h){-}{R}_{{\mathcal {D}}}(h))^2\le \textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h))\) on Eq. (J17). \(\square\)
Equation (J18) is slightly looser than Eq. (J16) since it involves an extra term of \(\tfrac{1}{m}\ln \sqrt{m}\). However, Eq. (J17) is tighter than Eq. (J16) when \(\textrm{kl}({R}_{{\mathcal {S}}}(h)\Vert {R}_{{\mathcal {D}}}(h)){-}2({R}_{{\mathcal {S}}}(h){-}{R}_{{\mathcal {D}}}(h))^2 \ge \tfrac{1}{m}\ln \sqrt{m}\) (which becomes more frequent as m grows).
Appendix K: Results presented in Section 5
This appendix presents the details of the results of Sect. 5. Tables 2, 3, 4, 5, 6, 7, 8, 9, 10 report empirical results for split ratios going from 0.0 to 0.9 presented in Figs. 1 to 5. More precisely, we report the test risk \({R}_{{\mathcal {T}}}(h)\), the empirical risk \({R}_{{\mathcal {S}}}(h)\), the bound value (Bnd), and the divergence value associated with the network h sampled from the posterior \({\mathcal {Q}}_{{\mathcal {S}}}\) for each learning rate, variance, dataset, and bound type. Tables 11, 12, 13 report the performances of the prior before applying Step 2) outlined in Figs. 4 and 5. In particular, we report the test risk \({R}_{{\mathcal {T}}}(h)\), the empirical risk \({R}_{{\mathcal {S}}}(h)\), the bound values of Corollary 6 and Equations (7), (8), (9) for each split ratio and variance.
Note that for the split 0.0, since Step 1) is skipped, the prior distribution \({\mathcal {P}}\) is only initialized as introduced in Sect. 5.3.2. Note that in this case, \(T=1\) since we have only one prior. To do the same number of epochs compared to the other splits, we perform 11 epochs (instead of 1) for MNIST and Fashion-MNIST and 110 epochs (instead of 10) for CIFAR-10 during Step 2). The other parameters are not changed.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Viallard, P., Germain, P., Habrard, A. et al. A general framework for the practical disintegration of PAC-Bayesian bounds. Mach Learn 113, 519–604 (2024). https://doi.org/10.1007/s10994-023-06391-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06391-0