Bernoulli’s golden theorem in retrospect: error probabilities and trustworthy evidence

Spanos, Aris

doi:10.1007/s11229-021-03405-z

Bernoulli’s golden theorem in retrospect: error probabilities and trustworthy evidence

Original Research
Published: 01 November 2021

Volume 199, pages 13949–13976, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Synthese Aims and scope Submit manuscript

Bernoulli’s golden theorem in retrospect: error probabilities and trustworthy evidence

Download PDF

Aris Spanos ORCID: orcid.org/0000-0002-9229-424X¹

300 Accesses
2 Citations
Explore all metrics

Abstract

Bernoulli’s 1713 golden theorem is viewed retrospectively in the context of modern model-based frequentist inference that revolves around the concept of a prespecified statistical model ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$, defining the inductive premises of inference. It is argued that several widely-accepted claims relating to the golden theorem and frequentist inference are either misleading or erroneous: (a) Bernoulli solved the problem of inference ‘from probability to frequency’, and thus (b) the golden theorem cannot justify an approximate Confidence Interval (CI) for the unknown parameter $\theta$, (c) Bernoulli identified the probability $P\left( A \right)$ with the relative frequency $\frac{1}{n}\sum\nolimits_{k = 1}^{n} {x_{k} }$ of event A as a result of conflating $f({\mathbf{x}}_{0} |\theta )$ with $f(\theta |{\mathbf{x}}_{0} ),$ where ${\mathbf{x}}_{0}$ denotes the observed data, and (d) the same ‘swindle’ is currently perpetrated by the p value testers. In interrogating the claims (a)–(d), the paper raises several foundational issues that are particularly relevant for statistical induction as it relates to the current discussions on the replication crises and the trustworthiness of empirical evidence, arguing that: [i] The alleged Bernoulli swindle is grounded in the unwarranted claim $\hat{\theta }_{n} \left( {{\mathbf{x}}_{0} } \right) \simeq \theta^{*} ,$ for a large enough n, where $\hat{\theta }_{n} \left( {\mathbf{X}} \right)$ is an optimal estimator of the true value $\theta^{*}$ of θ. [ii] Frequentist error probabilities are not conditional on hypotheses (H₀ and H₁) framed in terms of an unknown parameter θ since θ is neither a random variable nor an event. [iii] The direct versus inverse inference problem is a contrived and misplaced charge since neither conditional distribution $f({\mathbf{x}}_{0} |\theta )$ and $f(\theta |{\mathbf{x}}_{0} )$ exists (formally or logically) in model-based (${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$) frequentist inference.

History and nature of the Jeffreys–Lindley paradox

Article Open access 26 August 2022

The Case of the Jeffreys-Lindley-paradox as a Bayes-frequentist Compromise: A Perspective Based on the Rao-Lovric-Theorem

Article Open access 11 September 2023

Confidence, credibility and prediction

Article 19 June 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

James (Jacob) Bernoulli (1713), in Part IV of his book entitled "The Art of Conjecturing" derived what he called the ‘golden theorem’ (theorema aureum). This theorem was particularly influential for subsequent developments in both probability theory (especially limit theorems) and statistical inference (frequentist vs. Bayesian inference); see Hald (1998), Gorroochurn (2012). Since then, the golden theorem has become a topic of recurring disputes relating to its importance, interpretation and implications for inference, which are motivated by several of its unique features, including (i) Bernoulli’s own motivation and interpretation, (ii) its direct link to his numerical example aiming to illustrate it, (iii) its inferential interpretation in terms of the inverse versus direct inference, and (iv) its interpretation, and implications for a finite sample ($n < \infty$) and its asymptotic ($n \to \infty$) renderings.

In an attempt to narrow the scope of the discussion, the paper focuses on Diaconis and Skyrms (2018) that summarizes a widely-held perspective on the golden theorem as follows:

“Bernoulli’s motivation for his golden theorem was the determination of chance from empirical data.”(p. 64).

“What does it mean to determine chances a posteriori from frequencies? The question is, given the data—the number of trials and the relative frequencies of success in those trials—what is the probability that the chances fall within a certain interval? It is evident that this is not the problem that Bernoulli solved. He solved an inference from chances to frequencies, not the inverse problem from frequencies to chances. The inverse problem had to wait for Thomas Bayes.” (p. 65).

“Bernoulli argued that he had shown that with a large enough number of trials, it will be morally certain that relative frequency would be (approximately) equal to chance. But if frequency equals chance, then chance equals frequency. So, the argument goes, we have solved the problem of inference from frequency to chance. This is Bernoulli’s swindle. Try to make it precise and it falls apart.” (p. 65).

“To be explicit, Bernoulli’s conditional probabilities are probabilities about frequencies given chances, rather than probabilities about chances given frequencies.” (p. 66).

It is important to note at the outset that Bernoulli (1713) viewed $\theta = {\mathbb{P}}\left( {X = 1} \right)$ as probability a priori (chances) and $\overline{x}_{n} = \frac{1}{n}\sum\nolimits_{k = 1}^{n} {x_{k} } ,$ based on binary data ${\mathbf{x}}_{0} : = \left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right)$, as probability a posteriori (relative frequencies), which should not be conflated with the modern Bayesian interpretation of these terms.

Diaconis and Skyrms (2018) also argue that current p value testers routinely perpetrate Bernoulli’s swindle by conflating $P(H_{0} |D)$ with $P(D|H_{0} )$: “The untutored think they are getting the probability of effectiveness given the data, while they are being given conditional probabilities going in the opposite direction.” (p. 67).

The above quotations include several different but interrelated claims:

(a)
Bernoulli (1713) solved the problem of inference ‘from probability $\theta$ to frequency $\overline{x}_{n}$’, but the inverse problem was addressed by Bayes (1764), because:
(b)
Bernoulli committed a swindle by identifying the probability ($\theta$) with relative frequency ($\overline{x}_{n}$) as a result of conflating ‘direct’ inference based on $f({\mathbf{x}}_{0} |\theta )$ with ‘inverse’ inference based on $f(\theta |{\mathbf{x}}_{0} )$, and thus:
(c)
the golden theorem does not justify an approximate confidence interval for $\theta$, and
(d)
the same swindle permeates current frequentist testing whose error probabilities fail to distinguish between $P(H_{0} |{\mathbf{x}})$ and $P({\mathbf{x}}|H_{0} )$.

Claims and criticisms similar to (a)–(d) are repeated by most Bayesian statistics textbooks (O’Hagan, 1994, and Robert, 2007), as well as philosophy of science books on ‘probability and evidence’ (Howson & Urbach, 2006; Sober, 2008).

Viewing Bernoulli’s (1713) golden theorem retrospectively in the context of modern model-based [${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$] frequentist inference, the claims in (a)–(d) are called into question as grounded in misconceptions. Their interrogation brings out several broader foundational problems that are particularly relevant for the current discussions on the replication crisis and the trustworthiness of empirical evidence, including:

[i]
misapplying/misconstruing limit theorems (as $n \to \infty )$ in inference,
[ii]
misinterpreting the p value, type I and II error probabilities and the power as conditional on $H_{0}$ or $H_{1}$,
[iii]
the alleged ‘swindle’ is a special case of a well-known unwarranted claim, $\hat{\theta }\left( {{\mathbf{x}}_{0} } \right) \simeq \theta^{*}$ for $n < \infty$, where $\theta^{*}$ denotes the true value of $\theta$, $\hat{\theta }\left( {{\mathbf{x}}_{0} } \right)$ is the estimate corresponding to an optimal estimator $\hat{\theta }\left( {\mathbf{X}} \right)$ of $\theta ,$ which is routinely committed by effect size users, and not by frequentist testers, and
[iv]
the direct versus inverse inference criticism is not just misplaced, it is motivated by a misguided attempt to justify a dubious crosscut in vindicating Bayes’ formula by reimagining the distribution of the sample $f\left( {{\mathbf{x}};\theta } \right),$ ${\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ,$ as conditional on $\theta$, i.e. $f({\mathbf{x}}|\theta ),$ ${\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ,$ which is meaningless in frequentist statistics; see Spanos (2010).

2 Statistical induction

2.1 Induction by enumeration

The problem of induction boils down to justifying an inference from particular instances to potential realizations (generalizations), or from past to future instances. Hume (1748) argued that no rational justification of induction based on experience can be invoked since the argument that ‘a regularity that has held in the past will or must continue to hold in the future’ is circular and question-begging in the sense that it presupposes a belief in the ‘uniformity of nature’ that has no rational defence in reason. Instead, it reflects custom of the mind or habit. Hume’s stance has been bedeviling philosophy of science since then; see Henderson (2020).

Induction by enumeration: if $\left( {m/n} \right)$ is the relative frequency of event $A$ from a sample of $n$ realizations, infer that:

$$\begin{array}{*{20}l} {P\left( A \right) \simeq \frac{{\text{m}}}{n},} \hfill \\ \end{array}$$

(1)

i.e. the ‘long-run’ relative frequency is $\left( {m/n} \right)$; see Salmon (1967), p. 50.

This is widely viewed in philosophy of science as the quintessential form of statistical induction, and von Mises’s (1928) frequentist interpretation of probability as providing the link between the empirical relative frequencies $\left( {m/n} \right) = \frac{1}{n}\sum\nolimits_{k = 1}^{n} {x_{k} }$ and the corresponding mathematical probability $P\left( A \right)$ using the notion of a collective: an infinite sequence of outcomes $\left\{ {x_{k} } \right\}_{k = 1}^{\infty } ,$ $x_{k} = \left\{ {\begin{array}{*{20}l} 0 \hfill & {{\text{not}}\;A} \hfill \\ 1 \hfill & A \hfill \\ \end{array} } \right.$, via ${\text{lim}}_{n \to \infty } \left( {\frac{1}{n}\sum\nolimits_{k = 1}^{n} {x_{k} } } \right) = P\left( A \right),$ with this limit being invariant to place selections, i.e. ${\text{lim}}_{n \to \infty } \left( {\frac{1}{n}\sum\nolimits_{k = 1}^{n} \varphi \left( {x_{k} } \right)} \right) = P\left( A \right)$, where $\varphi \left( . \right)$ is a mapping of admissible place-selection sub-sequences $\left\{ {\varphi \left( {x_{k} } \right)} \right\}_{k = 1}^{\infty } .$

Hacking (1965), p. 261, questions Salmon’s claim: “Reichenbach equated induction with acceptance of a certain estimator, the straight rule: If m of the n observed A are B, estimate the long-run frequency of B among A as $m/n$. Salmon and Reichenbach maintain that if long-run frequencies exist, the straight rule for estimating long-run frequencies is to be preferred to any rival estimator. Other propositions are needed to complete their vindication of induction, but only this one concerns us. Salmon claims to have proved it. This is more interesting than mere academic vindications of induction; practical statisticians need good criteria for choosing among estimators, and, if Salmon were right, he would have very largely solved their problems, which are much more pressing than Hume’s.”

The key feature of inductive inference is that it is ampliative in the sense that it goes beyond the observed data $\left( {m/n} \right)$ to the unknown $\theta = {\mathbb{P}}\left( A \right)$, enhancing our knowledge about the underlying set-up that gave rise to the observed data. As argued in the sequel, when this claim is viewed in the context of model-based induction where ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ provides the inductive premises of inference, Hacking is right to question Salmon’s claim since (1) is a special case of a more general unwarranted claim:

$$\hat{\theta }_{n} \left( {{\mathbf{x}}_{0} } \right) \simeq \theta^{*} ,\quad {\text{for a large enough }}n < \infty ,$$

(2)

when $\hat{\theta }_{n} \left( {\mathbf{X}} \right)$ is an ‘optimal’ estimator of the unknown true parameter $\theta^{*} ;$ (1) assumes the simple Bernoulli model in (5). Viewing Hacking’s “Other propositions needed to complete their vindication of induction” in the context of ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ in (5), they include (i) the validity of the inductive premises [Independent and Identically Distributed (IID)] for data ${\mathbf{x}}_{0} ,$ which ensures the reliability of inference, as well as (ii) the optimality of the estimator $\hat{\theta }_{n} \left( {\mathbf{X}} \right) = \frac{1}{n}\sum\nolimits_{k = 1}^{n} {X_{k} }$, which secures the effectiveness of the inference. The reliability and effectiveness of inference lie at the core of inductive (statistical) inference: how we learn from data about phenomena of interest.

2.2 Model-based frequentist inference

Fisher (1922) recast Pearson’s descriptive statistics into model-based induction that revolves around the concept of a prespecified parametric statistical model, generically defined by:

$${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right) = \left\{ {f\left( {{\mathbf{x}};{{\varvec{\uptheta}}}} \right),\,{{\varvec{\uptheta}}} \in \Theta \subset {\mathbb{R}}^{m} } \right\},\;\;{\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ,\;\;n > m,$$

(3)

where $f\left( {{\mathbf{x}};{{\varvec{\uptheta}}}} \right), {\mathbf{x}} \in {\mathbb{R}}_{X}^{n}$ denotes the joint distribution of the sample ${\mathbf{X}}: = \left( {X_{1} , \ldots ,X_{n} } \right),$ ${\mathbb{R}}_{X}^{n}$ denotes the sample space and $\Theta$ the parameter space, specifying (explicitly) the inductive premises of inference. The revolutionary nature of Fisher’s recasting stems from the fact that ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ aims to describe the stochastic mechanism that gave rise to data ${\mathbf{x}}_{0}$, and not to summarize/describe ${\mathbf{x}}_{0}$, and thus transforming descriptive statistics into statistical induction.

Example 1

Consider the simple Normal model:

$$X_{t} \sim {\text{NIID}}\left( {\mu ,\sigma^{2} } \right),\;{\mkern 1mu} x_{t} \in {\mathbb{R}},\;E\left( {X_{t} } \right) = \mu \in {\mathbb{R}},\;Var\left( {X_{t} } \right) = \sigma^{2} > 0,\quad t \in {\mathbb{N}}: = \left( {1,2, \ldots ,n, \ldots } \right),$$

(4)

where ‘${\text{NIID}}$’ stands for Normal, Independent, and Identically Distributed (IID), and for simplicity we assume that $\sigma^{2}$ is known.

Example 2

Consider the simple Bernoulli model, specified by:

$$X_{k}\backsim \;{\text{BerIID}}\left( {\theta ,\theta \left( {1 - \theta } \right)} \right),\;x_{k} = 0,1,\;E\left( {X_{k} } \right) = \theta ,\quad 0 < \theta < 1,\;k \in {\mathbb{N}},$$

(5)

where ‘${\text{Ber}}$’ denotes the ‘Bernoulli distribution’ with $\theta = {\mathbb{P}}\left( {X_{k} = 1} \right)$.

The primary objective of frequentist inference is to use the statistical information, as summarized by $f\left( {{\mathbf{x}};{{\varvec{\uptheta}}}} \right), {\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ,$ in conjunction with data ${\mathbf{x}}_{0}$, to narrow down $\Theta$ as much as possible, ideally, to a single point ${{\varvec{\uptheta}}}^{*}$—the ‘true’ value of ${{\varvec{\uptheta}}}$ in $\Theta$—which is shorthand for saying that the generating mechanism ${\mathcal{M}}^{*} \left( {\mathbf{x}} \right) = \left\{ {f\left( {{\mathbf{x}};{{\varvec{\uptheta}}}^{*} } \right)} \right\}, {\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ,$ could have generated data ${\mathbf{x}}_{0}$; see Spanos and Mayo (2015).

The evaluation of the effectiveness (optimality) of an inference procedure is calibrated in terms of the relevant error probabilities that revolve around the sampling distribution, $f\left( {y_{n} ;{{\varvec{\uptheta}}}} \right),\;\forall y_{n} \in {\mathbb{R}}$, of a statistic (estimator, test, predictor) $Y_{n} = h\left( {X_{1} ,X_{2} , \ldots ,X_{n} } \right)$ derived via:

$$F_{n}(y)\mathit{=}\mathbb{P}(Y_{n}\leq y) = \underbrace {{\int {\int \cdots } \int {} }}_{{\left\{ {{\mathbf{x}}:\;h\left( {\mathbf{x}} \right) \le y} \right\}}}f\left( {\mathbf{x}};\, {\varvec{\uptheta}} \right)d{\mathbf{x}},\quad \forall y \in {\mathbb{R}}$$

(6)

The parameter θ is viewed as an unknown constant whose values in (6) in deriving the sampling distribution, $f\left( {y_{n} ;{{\varvec{\uptheta}}}} \right), \;\forall y_{n} \in {\mathbb{R}}$, are always prespecified and based on two different forms of reasoning:

(i)
factual (estimation and prediction): presuming that θ = θ^*, whatever that value happens to be in ${{\varvec{\Theta}}}$, and
(ii)
hypothetical (hypothesis testing): various hypothetical scenarios based on ${{\varvec{\uptheta}}}$ taking different prespecified values under $H_{0}$: ${{\varvec{\uptheta}}} \in {{\varvec{\Theta}}}_{0}$ (presuming that ${{\varvec{\uptheta}}} \in {{\varvec{\Theta}}}_{0} )$ versus $H_{1}$: ${{\varvec{\uptheta}}} \in {{\varvec{\Theta}}}_{1}$ (presuming that ${{\varvec{\uptheta}}} \in {{\varvec{\Theta}}}_{1} ),$ where ${{\varvec{\Theta}}}_{0} \cup {{\varvec{\Theta}}}_{1} = {{\varvec{\Theta}}}, {{\varvec{\Theta}}}_{0} \cap {{\varvec{\Theta}}}_{1} = \emptyset ;$ see Spanos (2019), p. 576. Note that neither form of reasoning involves conditioning on ${{\varvec{\uptheta}}},$ since the latter makes no mathematical or logical sense; see Sect. 2.5 for further discussion.

It is important to emphasize that the reliability and effectiveness of statistical inference depend crucially on statistical adequacy: the validity of the probabilistic assumptions comprising the prespecified ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ For example 1, the invoked assumptions are NIID and their validity should be evaluated using mis-specification (M-S) testing before any inference is drawn; see Spanos (2018). When any of these assumptions are invalid for data ${\mathbf{x}}_{0} ,$ the actual error probabilities associated with the invoked inference procedures are likely to be very different from the nominal (assumed based on ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$) ones. Applying a .05 significance level test when the actual type I error (due to statistical misspecification) is closer to .9, will lead that inference astray; see Spanos and McGuirk (2001).

Example 1

(continued). For the simple Normal model in (4):

$$\begin {aligned} \left( {\text{i}} \right)\;\overline{X}_{n} &= \frac{1}{n}\sum\limits_{i = 1}^{n} {X_{i} } \sim {\text{N}}\left( {\mu ,\frac{{\sigma^{2} }}{n}} \right),\\ \left( {{\text{ii}}} \right)\;s^{2} & = \frac{1}{n - 1}\sum\limits_{i = 1}^{n} {\left( {X_{i} - \overline{X}_{n} } \right)^{2} } \sim \left( {\frac{n - 1}{{\sigma^{2} }}} \right)\chi^{2} \left( {n - 1} \right), \end {aligned}$$

(7)

and (iii) $\overline{X}_{n}$ is independent of $s^{2}$, implies that (Lehmann & Romano, 2005, p. 156):

$$\tau \left( {{\mathbf{X}};\mu } \right) = \frac{{\sqrt n \left( {\overline{X}_{n} - \mu } \right)}}{s}\sim \;{\text{St}}\left( {n - 1} \right),$$

(8)

where ${\text{St}}\left( {n - 1} \right)$ denotes the Student’s t distribution with $\left( {n - 1} \right)$ degrees of freedom. What is not obvious is how to interpret (8), since it is not apparent why $E\left( {\tau \left( {{\mathbf{X}};\mu } \right)} \right) = 0$. A simple answer is that it follows from the fact that $\overline{X}_{n}$ is an unbiased estimator of $\mu$, i.e. $E\left( {\overline{X}_{n} } \right) = \mu^{*} .$ Using this unbiasedness in conjunction with the independence in (iii), one can show (Williams, 2001, p. 101) that under factual reasoning:

$$E\left( {\frac{{\sqrt n \left( {\overline{X}_{n} - \mu } \right)}}{s}} \right)\mathop = \limits^{{\mu = \mu^{*} }} E\left( {\overline{X}_{n} - \mu^{*} } \right) \cdot E\left( {\sqrt n /s} \right) = 0,\;{\text{for any }}E\left( {\sqrt n /s} \right) > 0.$$

Hence, a more transparent way to specify (8) is:

$$\begin{array}{*{20}l} {\tau \left( {{\mathbf{X}};\mu^{*} } \right) = \frac{{\sqrt n \left( {\overline{X}_{n} - \mu^{*} } \right)}}{s}\mathop \sim \limits^{{\mu = \mu^{*} }} {\text{St}}\left( {n - 1} \right),} \hfill \\ \end{array}$$

(9)

despite the cumbersome notation that overuses ‘*’ to elucidate it.

It is interesting to note that when the von Mises ‘collective’ $\left\{ {x_{k} } \right\}_{k = 1}^{\infty }$ is viewed from the model-based (${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{z}} \right)$) perspective, it becomes clear that an infinite realization of an IID Bernoulli process $\left\{ {X_{t} , t \in {\mathbb{N}}} \right\}$ is a non-operational concept. What operationalizes the idea behind the collective is to view the data ${\mathbf{x}}_{0} = \left\{ {x_{k} } \right\}_{k = 1}^{n}$ its initial segment that constitutes a realization of the sample ${\mathbf{X}}$; see Spanos (2013a).

2.3 Estimation (point and interval)

For estimation and prediction purposes the underlying reasoning is factual.

Example 1

(continued). For the simple Normal model in (4) with $\sigma^{2}$ known, the Maximum Likelihood (ML) estimator of $\mu$ is $\hat{\theta }_{ML} \left( {\mathbf{X}} \right) = \frac{1}{n}\sum\nolimits_{i = 1}^{n} {X_{i} } .$ Its optimality revolves around its sampling distribution evaluated using factual reasoning:

$$\hat{\theta }_{ML} \left( {\mathbf{X}} \right)\mathop \sim \limits^{{\mu = \mu^{*} }} {\text{N}}\left( {\mu^{*} ,\frac{{\sigma^{2} }}{n}} \right).$$

(10)

where $\hat{\theta }_{ML} \left( {\mathbf{X}} \right)$ is unbiased, sufficient, fully efficient, and strongly consistent; note that these properties hold only when the model assumptions ‘NIID’ are valid!

As Fisher (1922) points out, the statistics literature until the 1920s conflated the sample ${\mathbf{X}}: = \left( {X_{1} ,X_{2} , \ldots ,X_{n} } \right)$ with the sample realization ${\mathbf{x}}_{0}$ (the observed data), as well as the estimator $\hat{\theta }\left( {\mathbf{X}} \right),$ the estimate $\hat{\theta }\left( {{\mathbf{x}}_{0} } \right)$ and the unknown parameter $\theta$.

What is often insufficiently appreciated by the effect size literature (Cohen, 1988) is that an optimal (consistent, unbiased, fully efficient, sufficient) estimator $\hat{\theta }_{n} \left( {\mathbf{X}} \right)$ of $\theta$ does not justify the inferential claim in (2).

Example 1

(continued). The ML estimator $\hat{\theta }_{ML} \left( {\mathbf{X}} \right) = \frac{1}{n}\sum\nolimits_{i = 1}^{n} {X_{i} }$ of $\mu$ enjoys all optimal properties, but that does not underwrite the claim $\hat{\theta }_{ML} \left( {{\mathbf{x}}_{0} } \right) \simeq \mu^{*} ,$ since $\hat{\theta }_{ML} \left( {{\mathbf{x}}_{0} } \right)$ represents a single value from the range of possible values of $\hat{\theta }_{ML} \left( {\mathbf{x}} \right)$ associated with its sampling distribution $f\left( {\hat{\theta }_{ML} \left( {\mathbf{x}} \right);\theta^{*} } \right), {\mathbf{x}} \in {\mathbb{R}}^{n} ,$ as in (10). What (10) implies is that $Var\left( {\hat{\theta }_{ML} \left( {\mathbf{X}} \right)} \right) = \frac{{\sigma^{2} }}{n}$ decrease to zero as $n \to \infty .$ Therefore, invoking the strong consistency of $\hat{\theta }_{ML} \left( {\mathbf{X}} \right)$ does not address the problem since ${\mathbb{P}}\left( {{\text{lim}}_{n \to \infty } \hat{\theta }_{ML} \left( {\mathbf{X}} \right) = \theta^{*} } \right) = 1$ pertains to what happens at the limit ($n = \infty$), and not at any $n < \infty ;$ see Spanos (2013a). That is, as n increases $f\left( {\hat{\theta }_{ML} \left( {\mathbf{x}} \right);\theta^{*} } \right)$ concentrates around $\theta^{*} ,$ but it is defined over an unknown interval for any $n < \infty .$ As shown in Sect. 5.1, this interval can be approximated using bounds provided by the Law of Iterated Logarithm; see Billingsley (1995).

The unwarranted inferential claim in (2) was a primary motivation for Neyman (1937) to go beyond point estimation to propose the method of Confidence Intervals (CIs) that takes into consideration the uncertainty that relates to the point estimate as described by its sampling distribution $f\left( {\hat{\theta }_{ML} \left( {\mathbf{x}} \right);\theta^{*} } \right),\;{\mathbf{x}} \in {\mathbb{R}}_{X}^{n}$.

Example 1

(continued). For (4), the $\left( {1 - \alpha } \right)$ CI takes the form:

$${\mathbb{P}}\left( {\overline{X}_{n} - c_{{\frac{\alpha }{2}}} \left( {\frac{s}{\sqrt n }} \right) \le \mu < \overline{X}_{n} + c_{{\frac{\alpha }{2}}} \left( {\frac{s}{\sqrt n }} \right);\;\mu = \mu^{*} } \right) = 1 - \alpha ,$$

(11)

where $c_{{\frac{\alpha }{2}}}$ is derived from the distribution of $\tau \left( {{\mathbf{X}};\mu^{*} } \right)$ in (8). Having said that, it should be emphasized that the observed CI, $\left( {\overline{x}_{n} - c_{{\frac{\alpha }{2}}} \left( {\frac{s}{\sqrt n }} \right) \le \mu < \overline{x}_{n} + c_{{\frac{\alpha }{2}}} \left( {\frac{s}{\sqrt n }} \right)} \right),$ where $\overline{x}_{n}$ is the estimate of $\mu ,$ cannot be assigned the probability $\left( {1 - \alpha } \right)$ post-data; it either includes or excludes $\mu^{*} ,$ but it is invariably unknown which one holds. The length of the observed CI does, however, provide some additional information about the uncertainty relating to the estimate $\overline{x}_{n}$.

2.4 Neyman–Pearson (N–P) testing

Example 1

(continued). Consider testing the hypotheses:

$$H_{0} :\mu \le \mu_{0} {\text{ vs}}{. }\,\ H_{1} :\mu > \mu_{0} ,$$

(12)

where the framing of $H_{0}$ and $H_{1}$ constitutes a partition of ${\mathbb{R}}$. For statistical inference purposes, all values of $\mu$ are of interest, irrespective of whether only a few values are of substantive interest. Using hypothetical reasoning one can evaluate the sampling distribution of $\tau \left( {\mathbf{X}} \right) = \frac{{\sqrt n \left( {\overline{X}_{n} - \mu_{0} } \right)}}{s}$ under $H_{0}$ and $H_{1}$ yielding:

$$\left( {\text{i}} \right) \tau \left( {\mathbf{X}} \right) = \frac{{\sqrt n \left( {\overline{X}_{n} - \mu_{0} } \right)}}{\sigma }\mathop \sim \limits^{{\mu = \mu_{0} }} {\text{St}}\left( {n - 1} \right),\quad \left( {{\text{ii}}} \right) \tau \left( {\mathbf{X}} \right) = \frac{{\sqrt n \left( {\overline{X}_{n} - \mu_{0} } \right)}}{\sigma }\mathop \sim \limits^{{\mu = \mu_{1} }} {\text{St}}\left( {\delta_{1} ;n - 1} \right),$$

(13)

where $\delta_{1} = \frac{{\sqrt n \left( {\mu_{1} - \mu_{0} } \right)}}{\sigma },$ for $\mu_{1} > \mu_{0}$, is the noncentrality parameter.

More generally, N–P testing is based on hypothetical reasoning using prespecified values of $\mu$ that could ‘approximate closely’ $\mu^{*}$, in the sense that the difference $\left| {\left| {\mu^{*} - \mu_{0} } \right|} \right|,$ where $\left| {\left| . \right|} \right|$ denotes a distance function (norm), is statistically insignificant/significant (negligible/substantial). The primary role of the error probabilities is to operationalize the concepts of ‘statistically significant/insignificant’ as it relates to $\left| {\left| {\mu^{*} - \mu_{0} } \right|} \right|$. The test statistic $\tau \left( {\mathbf{X}} \right)$ reflects the difference $\left| {\left| {\mu^{*} - \mu_{0} } \right|} \right|$, in the sense that (i) $\mu^{*}$ is replaced by its best estimator, and (ii) $\tau \left( {\mathbf{X}} \right)$ increases monotonically with this distance. For instance, the test $T_{\alpha }$ in (14) uses $\tau \left( {\mathbf{X}} \right) = \left[ {\sqrt n \left( {\overline{X}_{n} - \mu_{0} } \right)/s} \right],$ a standardized distance between $\overline{X}_{n}$ (best estimator of $\mu^{*}$) and $\mu_{0}$.

For the hypotheses in (12), an $\alpha$-significance level Uniformly Most Powerful (UMP) test is defined by:

$$T_{\alpha } : = \{ \tau \left( {\mathbf{X}} \right) = \frac{{\sqrt n \left( {\overline{X}_{n} - \mu_{0} } \right)}}{\sigma },\;C_{1} \left( \alpha \right) = \{ {\mathbf{x}}:\tau \left( {\mathbf{x}} \right) > c_{\alpha } \} \} ,$$

(14)

Lehmann and Romano (2005, p. 58). The type I error probability and the p value are evaluated using (i) in (13):

$${\mathbb{P}}(\tau \left( {\mathbf{X}} \right) > c_{\alpha } ;{\mkern 1mu} \mu = \mu_{0} ) = \alpha ,\;{\mathbb{P}}(\tau \left( {\mathbf{X}} \right) > \tau \left( {{\mathbf{x}}_{0} } \right);{\mkern 1mu} \mu = \mu_{0} ) = p\left( {{\mathbf{x}}_{0} } \right).$$

(15)

The power of $T_{\alpha }$ is evaluated using (ii) in (13):

$${\mathcal{P}}\left( {\mu_{1} } \right) = {\mathbb{P}}\left( {\tau \left( {\mathbf{X}} \right) > c_{\alpha } ;\mu = \mu_{1} } \right),\;{\text{for all}}\;\mu_{1} > \mu_{0} .$$

(16)

The power of a test measures its generic (for any ${\mathbf{x}} \in {\mathbb{R}}^{n}$) capacity to detect discrepancies from $H_{0} .$ As argued next, none of the above error probabilities (type I, II, power, p value) are conditional on values of $\mu$. Hence the use of the notation ‘;’ instead of ‘|’ to separate the observable random variable $\tau \left( {\mathbf{X}} \right)$ from the unknown (and unobservable) constant $\theta$ to avoid confusion.

Particularly important for the current discussions on replicability are two crucial preconditions proposed by Neyman and Pearson (1933) which relate to the framing of $H_{0}$ and $H_{1}$ to secure the effectiveness of N–P testing: [i] $H_{0}$ and $H_{1}$ should constitute a partition of $\Theta$, in a way that renders [ii] the type I error probability as the more serious of the two to ensure that the framing of $H_{1}$ includes the potential range of values around $\theta^{*} .$ Precondition [i] is needed to eliminate the scenario where $\theta^{*}$ lies outside ${{\varvec{\Theta}}}_{0} \cup {{\varvec{\Theta}}}_{1} ,$ and [ii] to ensure that the test has power where is needed for effective learning from data.

Example 2

(continued). For the simple Bernoulli model let the framing be:

$$H_{0} :\theta = \theta_{0} \;{\text{vs}}{.}\;H_{1} :\theta > \theta_{0} ,$$

(17)

and consider the case where $\theta_{0} = .5, n = 20, \overline{x}_{n} = .2.$ This framing ensures that a UMP N–P test for the hypotheses in (17) based on ${\text{T}}_{\alpha } : = \{ d\left( {\mathbf{X}} \right) = \frac{{\sqrt n \left( {\overline{X}_{n} - \theta_{0} } \right)}}{{\sqrt {\theta_{0} \left( {1 - \theta_{0} } \right)} }}$, $C_{1} \left( \alpha \right) = \{ {\mathbf{x}}$: $d\left( {\mathbf{x}} \right) > c_{\alpha } \} \}$ for $\alpha = .05$ yields $d\left( {{\mathbf{x}}_{0} } \right) = - 2.683,$ which indicates that the relevant range of values for $\theta^{*}$ lies outside ${{\varvec{\Theta}}}_{0} \cup {{\varvec{\Theta}}}_{1}$. $d\left( {{\mathbf{x}}_{0} } \right) = - 2.683$ gives rise to ‘accept $H_{0}$’ with a p value $p\left( {{\mathbf{x}}_{0} } \right) = .996!$ This absurd result stems from the ill-chosen framing in (17) that disregards both N–P preconditions [i]–[ii] and ensures that the (implicit) power of this test in detecting all relevant discrepancies $\left( {\theta - \theta_{0} } \right) < 0$ is less than $\alpha$. Such absurd testing results are easily preventable by adhering to the N–P preconditions.

Hence, when no reliable information about the potential range of values for $\theta^{*}$ is available, the N–P test will be more appropriate with a two-sided partition of $\Theta$:

$$H_{0} :\theta = \theta_{0} \;{\text{vs}}{.}\;H_{1} :\theta \ne \theta_{0} .$$

(18)

When such information is available, the appropriate framing is one-sided (directional) with $H_{1}$ framed to include the relevant range of value for $\theta^{*} .$ In the case of the above example, the framing would be $H_{0}$: $\theta \ge \theta_{0}$ versus $H_{1}$: $\theta < \theta_{0} ,$ which would have rejected $H_{0}$ with a p value $p\left( {{\mathbf{x}}_{0} } \right) = .004!$

Regrettably, such ill-chosen framings of $H_{0}$ and $H_{1}$ are routinely used to (misleadingly) criticize N–P testing as inherently problematic when in fact the framing in (17) runs afoul one or both preconditions [i]–[ii]!

2.5 Error probabilities cannot be conditional on $\theta$

To shed light on why conditioning on $\theta$ makes no formal or logical sense in frequentist inference, one needs to return to the basic axiomatic approach (Kolmogorov, 1933) where probability theory is erected on a probability space $\left( {S,\Im ,{\mathbb{P}}\left( . \right)} \right),$ with $S$ denoting the set of all (logically) possible distinct outcomes, $\Im$ the set of all events ($A \subset S$) of interest and related events that enjoys the mathematical structure of a sigma ($\sigma$)-field-$\Im$ is closed under the set-theoretic operations of union, intersection, and complementation –, and ${\mathbb{P}}\left( . \right)$: $\Im \to 0,1]$ assigns probabilities to events (elements) in $\Im.$ Kolmogorov (1933, p. v) points out that the concept of a $\sigma$-field played a key role in the axiomatization of probability through Lebesgue’s measure theory (Shiryaev, 2016, p. 187). Random variables are defined relative to $\Im$ in the sense that a function $X\left( . \right)$: $S \to {\mathbb{R}}$ is said to be a random variable if its pre-image ($\left( {X\left( s \right) \le x} \right) = X^{ - } \left( x \right),$ for all $s \in S$ and $x \in {\mathbb{R}}$) defines events in $\Im$ ensuring that $X$ defines a subset of events $\sigma \left( X \right)$ of $\Im$ known as the minimal $\sigma$-field generated by $X.$

To make the case that error probabilities are conditional on $\theta$, one needs to demonstrate the mathematical meaning of $f(h\left( {\mathbf{x}} \right)|\theta ),$ for any statistic $h\left( {\mathbf{X}} \right),$ and defined by (Williams, 2001, p. 258):

$$f(h\left( {\mathbf{x}} \right)|\theta = \vartheta ) = \frac{{f\left( {h\left( {\mathbf{x}} \right),\theta = \vartheta } \right)}}{f\left( \vartheta \right)},\quad \forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n}$$

(19)

for a particular value $\vartheta$ in $\Theta$. Given that, in frequentist inference, $\theta$ is not an event or a random variable defined relative to $\sigma$-field $\Im$ of the probability space $\left( {S,\Im ,{\mathbb{P}}\left( . \right)} \right)$ underlying ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$, (19) makes no mathematical sense. That is, (19) does not exist as a probabilistic concept since there is no well-defined joint distribution $f\left( {{\mathbf{x}},\theta } \right)$ to determine the numerator $f\left( {{\mathbf{x}},\theta = \vartheta } \right)$, or the denominator $f\left( \vartheta \right) = \mathop \smallint \limits_{{{\mathbf{x}} \in {\mathbb{R}}_{X}^{n} }} f\left( {{\mathbf{x}},\vartheta } \right)d{\mathbf{x}}$. This is not just a matter of ‘inept’ terminology, but a crucial issue that concerns the non-existence of the two concepts $f({\mathbf{x}}|\theta = \vartheta ),\;\forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n}$ and $f(\theta |{\mathbf{X}} = {\mathbf{x}}_{0} ),\;\forall \theta \in {\Theta }$, in the context of frequentist inference. Even when viewed at a more intuitive level, factual (presuming that $\theta = \theta^{*} )$ and hypothetical (presuming that $\theta = \theta_{0} )$ reasoning do not entail probabilistic conditioning since the latter pertains to ‘information that an event $A$ in $\Im$ has occurred’. Hence, invoking the misleading set phrase ‘given $H_{0}$’ as bespeaking mathematical conditioning is ridiculous. What makes mathematical and logical sense is to define $f\left( {h\left( {\mathbf{x}} \right);\theta = \vartheta } \right),\;\forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ,$ for prespecified values of $\theta$ and derived it via (6) using factual or hypothetical reasoning.

As a counter-argument to the above case, one might hazard the counter-claim that $\theta$ can be transformed into a special random variable that relates to two events $A = \{ \theta$: $\theta = \theta_{0} \}$ and $\overline{A} = \{ \theta$: $\theta \ne \theta_{0} \} ,$ with the relevant $\sigma$-field of interest being ${\mathcal{F}} = \left\{ {S,\emptyset ,A,\overline{A}} \right\},$ and ${\mathbb{P}}\left( A \right) = 0,$ or ${\mathbb{P}}\left( A \right) = 1.$ Regrettably, this idea crumbles instantly since the two random variables $X$ and $\theta$ can only be related as in (19) when they are both defined on the same probability space, $\left( {S,\Im ,{\mathbb{P}}\left( . \right)} \right)$, whose $\sigma$-field $\Im$ is required to include all possible unions, intersections, and complementations of all the events relating to both! Worse, the mapping $\theta \left( s \right) = \theta_{0}$ for all $s \in S$ defines a degenerate (constant) random variable which, by construction, is independent of every other random variable $X$ defined on $\left( {S,\Im ,{\mathbb{P}}\left( . \right)} \right)$ (Renyi, 1970, p. 201), i.e., there is no joint ${\mathbb{P}}\left( {x,\theta } \right)$ or conditional ${\mathbb{P}}(x|\theta )$ probability to be had.

More astounding is the impossibility of constructing a $\sigma$-field $\Im$ that includes all the joint events associated with $\theta$ and $X$ even when $\theta$ is a proper random variable with its own prior distribution $\pi \left( \theta \right),\; \forall \theta \in {\Theta }$. That is, this problem lies abeyant at the very foundation of Bayesian statistics. The traditional derivation of Bayes theorem circumnavigates this problem by reimagining the frequentist distribution of the sample $f\left( {{\mathbf{x}};\theta } \right)$ as (somehow) conditional on $\theta ,$ i.e. $f({\mathbf{x}}|\theta ),$ $\forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n} .$ This finessing enables Bayesians to define—without any intellectual effort—the (contrived) joint distribution by $f\left( {{\mathbf{x}},\theta } \right) = f({\mathbf{x}}|\theta ) \cdot \pi \left( \theta \right),$ $\forall \theta \in \Theta ,\;\forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ;$ see Sect. 6.2.

3 Bernoulli’s golden theorem in retrospect

Assuming the simple Bernoulli model in (5), Bernoulli’s golden theorem asserts:

$${\mathbb{P}}\left( {\left| {\overline{X}_{n} - \theta } \right| < \varepsilon } \right) \ge 1 - \delta ,\;{\text{for}}\;\varepsilon > 0,\;\delta > 0,\;{\text{and all}}\;n \ge N.$$

(20)

The retrospective view of this theorem is guided by Le Cam’s (1986) perspective on limit theorems encapsulated by the following quotation: “… limit theorems ‘as $n$ tends to infinity’ are logically devoid of content about what happens at any particular $n$. All they can do is suggest certain approaches whose performance must then be checked on the case at hand. Unfortunately, the approximation bounds we could get were too often too crude and cumbersome to be of any practical use.” (p. xiv).

3.1 Bernoulli’s law of large numbers

The most pivotal way Bernoulli’s golden theorem influenced probability and statistical inference arose from its implications as $n \to \infty$ (Hald, 1998, 2007). When placed in the context of model-based frequentist inference, the statistical model underlying the result is the simple Bernoulli model.

3.1.1 Bernoulli’s WLLN

For a Bernoulli IID process $\left\{ {X_{k} , k \in {\mathbb{N}}} \right\}$ in (5):

$$\mathop {\lim }\limits_{n \to \infty } {\mathbb{P}}\left( {\left| {\overline{X}_{n} - \theta } \right| < \varepsilon } \right) = 1,\;{\text{for}}\;\varepsilon > 0,$$

(21)

where $\overline{X}_{n} = \frac{1}{n}\sum\nolimits_{k = 1}^{n} {X_{k} } .$ (21) follows from (20) and (25) since $\delta = \frac{{\theta \left( {1 - \theta } \right)}}{{\varepsilon^{2} n}}\to _{n \to \infty } 0$; see Billingsley (1995), p. 5.

The result in (21) provided the first formal justification for the frequentist interpretation of probability of an event $A$ as the limit of the ‘stable long-run relative frequency’ $\overline{x}_{n} = \frac{1}{n}\sum\nolimits_{k = 1}^{n} {x_{k} } ,$ in the context of the statistical model ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ in (5). This was the first limit theorem known as the Weak Law of Large Numbers (WLLN). Almost two centuries later, Bernoulli’s WLLN was strengthened by Borel in 1909 in the form of a Strong Law of Large Number (SLLN).

3.1.2 Borel’s SLLN

For an IID Bernoulli process $\left\{ {X_{k} , k \in {\mathbb{N}}} \right\}$ in (5):

$$\begin{array}{*{20}l} {{\mathbb{P}}\left( {\mathop {{\text{lim}}}\limits_{n \to \infty } \overline{X}_{n} = \theta } \right) = 1.} \hfill \\ \end{array}$$

(22)

That is, as $n \to \infty$ the process $\{ \overline{X}_{n} \}_{n = 1}^{\infty } ,$ converges to $\theta = E\left( {X_{k} } \right)$ with probability one, or almost surely (a.s.); see Billingsley (1995), p. 8.

3.1.3 Probabilistic versus mathematical convergence

It is important to distinguish between the above forms of probabilistic convergence (21)–(22) from the mathematical convergence invoked by von Mises (1928):

$$\mathop {{\text{lim}}}\limits_{n \to \infty } \overline{x}_{n} = \theta ,$$

(23)

where $\overline{x}_{n}$ denotes the values of $\overline{X}_{n} ,{ }$ since neither (21) nor (22) entails (23). As argued by Williams (2001), p. 25, any attempt to make rigorous the mathematical convergence ${\text{lim}}_{n \to \infty } \overline{x}_{n} = \theta$ is ill-fated for purely mathematical reasons which can only be circumvented using measure theory. Historically, the line between probabilistic and mathematical convergence ${\text{lim}}_{n \to \infty } \overline{x}_{n} = \theta$ was blurred by von Mises’s (1928) notion of a collective, which was defined in terms of infinite realizations $\{ x_{k} \}_{k = 1}^{\infty }$ whose partial sums $\{ \overline{x}_{n} \}_{n = 1}^{\infty }$ converge to $\theta$. This has led to the widespread confusion that lingers on to today between probabilities and relative frequencies by misidentifying the frequentist interpretation of probability with the long-run metaphor; see Spanos (2013a).

3.2 Bernoulli’s golden theorem versus his numerical example

From today’s perspective, Bernoulli’s golden theorem amounts to a finite sample approximation to the WLLN in (21). Bernoulli (1713), derived the Binomial (Bin) distribution for $\sum\nolimits_{k = 1}^{n} {X_{k} }$ using the homonymous expansion in his discussion of proposition 12 of Part I. He used this result in Part IV, to derive the first finite ‘sampling distribution’ of the sum:

$$Y: = n\overline{X}_{n} = \sum\limits_{k = 1}^{n} {X_{k} } \backsim {\text{Bin}}\left( {n\theta ,{\mkern 1mu} n\theta \left( {1 - \theta } \right);n} \right).$$

(24)

In retrospect, his derivation of (20) was based on approximating the Binomial tail areas, which today is better approximated using Chebyshev’s inequality:

$${\mathbb{P}}\left( {\left| {\overline{X}_{n} - \theta } \right| \ge \varepsilon } \right) \le \frac{{\theta \left( {1 - \theta } \right)}}{{\varepsilon^{2} n}},$$

(25)

which implies that since $\theta \left( {1 - \theta } \right) \le \frac{1}{4}$, for $\delta = {\mathbb{P}}\left( {\left| {\overline{X}_{n} - \theta } \right| \ge \varepsilon } \right)$ (20) holds for any:

$$n \ge N = \frac{{\theta \left( {1 - \theta } \right)}}{{\varepsilon^{2} \delta }} \le \frac{1}{{4\varepsilon^{2} \delta }} .$$

(26)

3.2.1 Bernoulli’s example

In discussing the golden theorem’s interpretation and implications for inference, it is important to distinguish between the above generic results in (24)–(26) and Bernoulli’s numerical example based on $\theta^{*} = .6,$ $\varepsilon = .2$ and $\delta = .001$ since the example has often been misinterpreted.

Using Bernoulli’s numerical example (26) implies that: $N = [4(.2)^{2} \left( {.001} \right)]^{ - 1} = 6250.$ That is, for any $n \ge 6250$ the lower and upper bounds, $\left( {\overline{X}_{n} - \varepsilon } \right)$ and $\left( {\overline{X}_{n} + \varepsilon } \right),$ respectively, will include (overlay) the true value of $\theta ,$ say $\theta^{*}$, with probability $\left( {1 - \delta } \right) = .999.$

It is worth noting that Bernoulli’s (1713) bound for $\delta$ was much less accurate than (26), yielding $N = 25550$, “… because of two crude approximations. First, he requires that the basic inequality holds for each tail separately, instead of their sum. … Second, he uses the arithmetic approximation for the tail probability instead of the geometric one.” (Hald, 2007, p. 14). It is also important to bring out the fact that the lower bound for (20) yielding $N = 6250$ does not use the information that $\theta^{*} = .6.$ Naturally, when this information is used, $\theta \left( {1 - \theta } \right) = .24$, yielding a smaller $N = \left( {.24} \right)[(.2)^{2} \left( {.001} \right)]^{ - 1} = 6000$.

In light of the above comments, one should separate the golden theorem from Bernoulli’s numerical example to illustrate it. His illustration is no different in substance from demonstrating the golden theorem today using simulation or an analytical calculation for particular values of $\theta , \varepsilon , \delta ,$ and N. The simulation can be used to establish the relevant tail areas empirically based on a large number (say N = 10,000) of sample realizations ${\mathbf{x}}_{i} ,\; i = 1,2, \ldots ,N,$ of size n. Hence, it will be a mistake from today’s perspective to view Bernoulli’s theorem as (somehow) tainted by his use of the information $\theta^{*} = .6$ to illustrate it since such information is irrelevant for the theorem in (20) to hold.

3.3 Revisiting Bernoulli’s alleged swindle

Influenced by the legal tradition of his time, Bernoulli (1713) understood the magnitude of probability ${\mathbb{P}}\left( A \right)$ as degrees of certainty along a graduated spectrum of belief ranging from total ignorance (${\mathbb{P}}\left( A \right) \simeq 0$) to firm conviction (${\mathbb{P}}\left( A \right) \simeq 1$) or moral certainty: “something is morally certain if its probability comes so close to complete certainty that the difference cannot be perceived.” (p. 315). In his numerical example, an event (conjecture) A is morally certain when ${\mathbb{P}}\left( A \right) = .999.$

3.3.1 Bernoulli’s alleged swindle

Diaconis and Skyrms (2018) argue that Bernoulli committed a ‘swindle’ by viewing his golden theorem in (20), in conjunction with his notion of moral certainty, to infer:

$$\begin{array}{*{20}l} {\overline{x}_{n} \simeq \theta^{*} , \;{\text{for}}\; n \ge N,} \hfill \\ \end{array}$$

(27)

where $\overline{x}_{n}$ denotes the observed value of $\overline{X}_{n} = \frac{1}{n}\sum\nolimits_{k = 1}^{n} {X_{k} }$.

A retrospective view suggests that (27) is just a special case of the unwarranted claim in (2), with $\hat{\theta }\left( {{\mathbf{x}}_{0} } \right) = \overline{x}_{n}$, potentially stemming from misinterpreting (21) as entailing (23); see Spanos (2013b). The claim in (27), to the extent it persists today, stems primarily from misconstruing the long-run metaphor that aims to conceptualize the link between relative frequencies and probabilities. In the context of model-based frequentist inference, probabilities are not identified with relative frequencies, but rather probabilities are evidenced by stable relative frequencies based on a statistically adequate ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$; see Spanos (2013a). As argued by Hacking (1980): “Probability in this sense [frequentist] does not mean ‘relative frequency’, but probabilities are typically manifested by stable frequencies.” (p. 150). ‘Typically’ refers to ‘the particular data ${\mathbf{x}}_{0}$ being a typical realization of the prespecified ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ or equivalently, the probabilistic assumptions comprising ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ are valid for ${\mathbf{x}}_{0}$. Hence, (21) is justified on empirical and not on a priori – rational defense in reason – grounds.

The intuition underlying Bernoulli’s golden theorem could be illustrated in terms of relative frequencies (proportions) as follows: assuming the IID assumptions of ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ in (5) are valid, for large enough $n,$ say $n \ge 6250,$ a proportion $\delta = .001$ of the N = 10,000 sample realizations ${\mathbf{x}}_{i} : = \left( {x_{1i} ,x_{2i} , \ldots ,x_{ni} } \right), \;i = 1,2, \ldots ,N$, is likely to exhibit errors (fluctuations around $\theta^{*} )$ outside the band $\left| {\overline{x}_{n} - \theta^{*} } \right| < \varepsilon .$ Borel’s SLLN states that under the same conditions, for $n \ge 6250$ no sample realization is likely to exhibit errors outside $\left| {\overline{x}_{n} - \theta^{*} } \right| < \varepsilon .$ It is important to view this as a heuristic explanation of the theorems (21)–(22) where probabilities are manifested by the relative frequencies; see Spanos (2013a).

Is Bernoulli (1713) guilty of the swindle alleged by Diaconis and Skyrms (2018)? A retrospective case can be made that the combination of his numerical example and his notion of ‘moral certainty’, are likely to have misled modern readers into conflating the heuristic illustration with the theorem in (21).

3.4 The golden theorem and approximate CIs

As argued above, the statistical adequacy of ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ in (5) is critical for the golden theorem in (20), as well as (21)–(22), to hold. A crucial difference between Bernoulli’s and Borel’s Law of Large Numbers (LLN) and subsequent variants is that the inductive premises underlying (21) and (22), ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ include an explicit distribution assumption that can be used to simulate the underlying sampling distribution of $\sum\nolimits_{k = 1}^{n} {X_{k} }$ in in (24), as shown in Fig. 1, where the Binomial is approximated closely by the Normal distribution.

Historically, almost all subsequent extensions (generalizations) of the original limit theorems (LLN, CLT) replaced that with indirect distribution assumptions (e.g. existence of certain moments); see Billingsley (1995).

In light of that, the golden theorem in (20) can be used in conjunction with the sampling distribution in (24) to derive an approximate frequentist CI:

$$\begin{array}{*{20}l} {{\mathbb{P}}\left( {\overline{X}_{n} - \varepsilon < \theta \le \overline{X}_{n} + \varepsilon ; \;\theta = \theta^{*} } \right) = \left( {1 - \alpha } \right),} \hfill \\ \end{array}$$

(28)

where $\varepsilon = c_{{\frac{\alpha }{2}}} \sqrt {\overline{X}_{n} \left( {1 - \overline{X}_{n} } \right)/n} ,$ and $c_{{\frac{\alpha }{2}}}$ relates to the Normal approximation in Fig. 1. Hence, contrary to the Diaconis and Skyrms (2018) claim, Bernoulli did answer the question: “what is the probability that the chances [i.e. $\theta^{*} = {\mathbb{P}}\left( {X = 1} \right)$] fall within a certain interval?”, in the sense that the CI in (28) overlays $\theta^{*}$ with probability $\left( {1 - \alpha } \right)$, and not the inverse probability interval ${\mathbb{P}}( {\theta - \varepsilon < {\overline{x}_{n} \le \theta + \varepsilon } | \theta } ).$

That is, the legitimacy of this approximate CI in (28) stems from (24) and the fact that $\delta$ does not depend on $\theta^{*} .$ Indeed, Laplace (1812) was the first to put forward a similar interval based on direct probabilities; see Hald (2007), p. 5. Dempster (1966) argues that the golden theorem can be viewed as a forerunner of Neyman-type CIs. What is even more interesting is that (28) can be sharpened considerably by replacing the $\left( {1 - \delta } \right)$ bound with the tails areas of (24).

Example 2

(continued). Using Chebyshev’s inequality for $n = 2500$ and $\varepsilon = .1,$ implies $\delta = [4\left( {2500} \right)(.1)^{2} ]^{ - 1} = .01$, one can deduce that the approximate .99 CI:

$$\begin{array}{*{20}l} {{\mathbb{P}}\left( {\left| {\overline{X}_{n} - \theta } \right| \le \varepsilon } \right) \ge .99.} \hfill \\ \end{array}$$

(29)

On the other hand, when $Z = \frac{{\sum\nolimits_{k = 1}^{n} {\left( {X_{k} - n\theta } \right)} }}{{\sqrt {n\theta \left( {1 - \theta } \right)} }}$ is used to approximate the Binomial with the Normal distribution (de Moivre, 1738), shown in Fig. 1 for $n = 100, \;\theta^{*} = .6$, the finite sample .99 CI in (28) requires only $n = 166$ since $\sqrt n \left( {.1} \right)/\sqrt {.25} = 2.576 \to n = 166$, and thus:

$${\mathbb{P}}\left( {\left| {\overline{X}_{n} - \theta } \right| < .1} \right) \ge \int\limits_{ - 2.576}^{2.576} {\left( {\frac{{\exp \left( { - .5z^{2} } \right)}}{{\sqrt {2\pi } }}} \right)} \,dz = .99.$$

(30)

The sizeable reduction of the required sample size n from 2500 to 166 illustrates Le Cam’s quotation about asymptotic approximations being “too crude and cumbersome to be of any practical use”, and the reduction from 2500 to 166 is typical of asymptotic approximations versus finite sample results; see Spanos (2019).

3.5 Bernoulli and direct versus inverse inference

As argued above, the alleged Bernoulli’s swindle in (27) is a special case of the more general unwarranted claim in (2). This calls into question the traditional argument articulated by Diaconis and Skyrms (2018) that the source of the swindle stems from conflating $f({\mathbf{x}}|\theta )$ with $f(\theta |{\mathbf{x}})$. Let us unpack this claim.

Regrettably, Bernoulli’s use of the true $\theta^{*} = .6$ in his numerical example has generated confusion in the literature about legitimate and illegitimate interpretations of the golden theorem, as well as whether the probability in (20) is direct (frequentist) or inverse (Bayesian). As shown above, the lower bound $\left( {1 - \delta } \right)$ in (20) need not rely on knowing $\theta^{*}$ since $\delta = \frac{{\theta \left( {1 - \theta } \right)}}{{\varepsilon^{2} n}} \le \frac{1}{{4\varepsilon^{2} n}}.$ Also, it is not obvious what the claim by Diaconis and Skyrms (2018): “He solved an inference from chances to frequencies” (p. 65) refers to. Why?

To begin with, the probabilistic assignment $P( {\theta - \varepsilon < {\overline{x}_{n} \le \theta + \varepsilon } | \theta } ) \simeq 1$ is meaningless in frequentist inference since there is no random variable involved to justify the assignment $P\left( . \right);$ $\overline{x}_{n}$, $\theta$ and $\varepsilon$ are known constants.

Second, it is not obvious what the inferential claim: ‘assuming $\theta$ is known, for a given $\varepsilon > 0$ there is a large enough n such that ${\mathbb{P}}\big( {\theta - \varepsilon < {\overline{X}_{n} \le \theta + \varepsilon } \big| \theta } \big) \simeq 1$’ could (possibly) mean in frequentist statistics since the golden theorem pertains to a particular value of $\theta ,$ i.e. $\theta^{*} .$ When $\theta = \theta^{*}$ is known, the underlying generating mechanism ${\mathcal{M}}^{*} \left( {\mathbf{x}} \right) = \{ f\left( {{\mathbf{x}};\theta^{*} } \right),\;{\mathbf{x}} \in \{ 0,1\}^{n} ,$ is fully known for any n; see Fig. 1 for $n = 100,\; \theta^{*} = .6.$ That is, one can just use:

$$f\left( {{\mathbf{x}};\theta^{*} = .6} \right) = (.6)^{{\sum\nolimits_{k = 1}^{n} {x_{k} } }} (.4)^{{\sum\nolimits_{k = 1}^{n} {\left( {1 - x_{k} } \right)} }} = (.6)^{y} (.4)^{{\left( {n - y} \right)}} ,\quad y = 0,1, \ldots ,n,$$

(31)

where $y = \sum\nolimits_{k = 1}^{n} {x_{k} }$, to evaluate the exact probabilities for different $Y = y$ as in Table 1.

Table 1 Probabilities relating to Fig. 1

Full size table

Given that the primary objective of frequentist inference is to learn from data ${\mathbf{x}}_{0}$ about $\theta^{*} ,$ when $\theta^{*}$ is known no statistical inference is called for or warranted. The notion that one can use $\theta = \theta^{*}$ to infer something about ${\mathbf{x}}_{0}$ is nonsensical since there is no statistical inference to be had; there is no uncertainty about $\theta^{*}$. Indeed, one can use ${\mathcal{M}}^{*} \left( {\mathbf{x}} \right)$ to evaluate the probabilities associated with any values of $Y = \sum\nolimits_{k = 1}^{n} {X_{k} }$ of substantive interest beyond ${\mathbf{x}}_{0} ,$ including predicting future values of $X_{t}$. Moreover, since neither $f(\theta |{\mathbf{x}}_{0} ),$ nor $f({\mathbf{x}}_{0} |\theta ),$ exist in frequentist inference (Sect. 2.5), (20) cannot (possibly) be susceptible to the charge of conflating direct with inverse inference.

One the other hand, when $\theta$ is assumed to be a random variable, as in Bayesian statistics, the probabilistic statement ${\text{Pr}}(\overline{x}_{n} - \varepsilon < \theta \le \overline{x}_{n} + \varepsilon |{\mathbf{x}}_{0} ) \simeq 1$ stems from the posterior distribution, $\pi (\theta |{\mathbf{x}}_{0} ) \propto f\left( {{\mathbf{x}}_{0} ;\theta } \right) \cdot \pi \left( \theta \right)$, $\theta \in \left( {0,1} \right)$. This, however, does not render the frequentist interpretation of (20) problematic in the context of ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ in (5) in any logical or mathematical sense.

4 Revisiting the direct versus inverse inference

4.1 Bayesian deformation of the p value?

The question that naturally arises at this stage is: what is the merit of the Bayesian charge that frequentists often confuse $f(\theta |{\mathbf{x}}_{0} )$ with $f({\mathbf{x}}_{0} |\theta )$ when neither exists in that context and what does that imply for frequentist testing in particular?

In a section entitled "Bernoulli swindle and hypothesis testing" Diaconis and Skyrms (2018, p. 67), argue: “Suppose a drug company runs randomized trials on a new drug. The drug is either effective or not. You would like to know the probability that it is effective given the data. The drug company computes the probability that one would get the result in the data or better, given that the drug is ineffective, and gets a very small number. … To those who do not understand statistics, this is an invitation to Bernoulli’s swindle. It is "morally impossible" to get this value if the drug is ineffective. Therefore the drug is effective.”

A Bayesian practitioner would wholeheartedly agree with the sentence in italics since probability refers to his/her degrees of belief, but why do the authors presume that this claim has any meaning in frequentist testing where the drug does not have a "probability of being effective", whether or not given the data. As argued below, N–P testing results can provide reliable evidence ‘whether the drug is effective or not’ when appropriately interpreted using the post-data severity evaluation to establish the warranted discrepancy $\gamma$ from the null value; see also Mayo and Spanos (2011).

The above quotation echoes Cohen’s (1994) more direct calumny: “When one tests $H_{0}$, one is finding the probability that the data ($D$) could have arisen if $H_{0}$ were true, $P(D|H_{0} )$. If that probability is small, then it can be concluded that if $H_{0}$ is true, then $D$ is unlikely. Now, what really is at issue, what is always the real issue, is the probability that $H_{0}$ is true, given the data, $P(H_{0} |D)$, the inverse probability.” (p. 998).

Numerous papers in the replication literature (Wasserstein et al., 2019) declare:

$$P(H_{0} |D) \ne P(D|H_{0} ),$$

(32)

self-evident, and proceed to admonish frequentist testing. As argued in Sect. 2.5, when (32) is properly defined takes the form in (19), which does not exist in frequentist inference. Why the confusion? The unwarranted claim in (32) pertains to any two events $A$ and $B$, where the relevant formula:

$${\mathbb{P}}\left( {A \cap B} \right) = {\mathbb{P}}(A|B) \cdot {\mathbb{P}}\left( B \right) = {\mathbb{P}}(B|A) \cdot {\mathbb{P}}\left( A \right)$$

(33)

implies that ${\mathbb{P}}(B|A) \ne {\mathbb{P}}(A|B)$ unless ${\mathbb{P}}\left( A \right) = {\mathbb{P}}\left( B \right)$. What is insufficiently appreciated is that (33) involves observable events $A$ and $B,$ in $\Im$; see Spanos (2010). Calling $B$ a hypothesis ($H_{0}$) and $A$ data ($D$) does not render (32) a legitimate claim in the context of ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ since $H_{0}$: $\theta = \theta_{0}$ cannot be an event in $\Im$; see Sect. 2.5.

4.2 From accept/reject ${\varvec{H}}_{0}$ to an evidential interpretation

After a tongue-in-cheek ‘praise’ for Fisher for avoiding ‘Bernoulli’s swindle’ by proposing “… a methodology and a story about why that is what you want”, Diaconis and Skyrms (2018) take the praise back by claiming: “But it is not what you want, is it? You want the probability of effectiveness given the data.” (p. 68). Instead of allowing frequentists to articulate what they really want, and try to understand their underlying reasoning, they pronounce "what you really want is a posterior probability from $f(\theta |{\mathbf{x}}_{0} ), \;\forall \theta \in {\Theta }$".

Fisher’s (1925) significance testing driven by the p value was recast into an optimal theory of hypothesis testing by Neyman and Pearson (1933), where the type I and II (or power) are used to calibrate the pre-data capacity of the test to detect different discrepancies from $H_{0}$; see Spanos (2006). Unfortunately, neither account has provided a cogent evidential interpretation of the testing results. Mayo and Spanos (2006) proposed such an evidential interpretation based on a post-data evaluation of the testing results that outputs the discrepancy $\gamma$ from $H_{0}$ warranted with high probability by test $T_{\alpha }$ and data ${\mathbf{x}}_{0}$. What is different from previous attempts at providing an evidential interpretation is that error probabilities are viewed and interpreted in the context of the particular statistical set-up:

$$\left[ a \right]{\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right),{\mkern 1mu} \left[ b \right]{\mkern 1mu} H_{0} :\theta \in \Theta_{0} \;vs.\;H_{1} :\theta \in \Theta_{1} ,\;\left[ c \right]{\mkern 1mu} T_{\alpha } : = \left\{ {d\left( {\mathbf{X}} \right),C_{1} \left( \alpha \right)} \right\},\;\left[ d \right]{\mkern 1mu} data{\mkern 1mu} {\mathbf{x}}_{0} ,$$

(34)

which includes the validity of the assumptions comprising ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ vis-à-vis data ${\mathbf{x}}_{0} ,$ the framing of $H_{0}$ and $H_{1}$ as a partition of ${\Theta }$, and the sample size $n$. What is important to emphasize is that the discrepancy $\gamma$ from $H_{0}$ warranted by $T_{\alpha }$ and ${\mathbf{x}}_{0}$, with high probability, provides a more reliable testing-based effect size, which is not vulnerable to the alleged Bernoulli swindle since it does not invoke the unwarranted claim in (2); see Spanos (2013b, 2021).

Contrary to the claim by Diaconis and Skyrms (2018), a frequentist tester agrees with their comment that: “… the p values are only part of the story. There is the power of the test ….” (p. 116). Indeed, from the post-data severity perspective (Mayo & Spanos, 2011) $p\left( {{\mathbf{x}}_{0} } \right) < \alpha$ indicates the presence of ‘some’ discrepancy $\gamma ,$ but provides no information about its magnitude since (i) the underlying distribution for $p\left( {{\mathbf{x}}_{0} } \right)$ is evaluated only under $H_{0} ,$ and (ii) $p\left( {{\mathbf{x}}_{0} } \right)$ is vulnerable to the large n problem (e.g. high power). Both problems are addressed using the severity evaluation that takes into account the statistical context in (34), including the power, or equivalently the ‘sensitivity’ of the test: “By increasing the size of the experiment, we can render it more sensitive, meaning by this that it will allow the detection of … a quantitative smaller departure from the null hypothesis.” (Fisher, 1925, pp. 21–22).

Regrettably, ‘untutored’ practitioners accept the misleading claims (a)–(d) by Diaconis and Skyrms (2018) in the introduction at face value, in concert with similarly erroneous testimonials from Bayesian textbooks, which include:

(e) Ignore the statistical context in (34) because only ${\mathbf{x}}_{0}$ has any bearing on the evidence for or against $H_{0}$ since Bayesian inference is data specific. A feature that has been lionized by Bayesians in the form of the likelihood principle, which asserts that for inference purposes ${\mathbf{x}}_{0}$ is the only relevant value of ${\mathbf{X}}$; see Berger and Wolpert (1988).

(f) Accept the unwarranted claim that the p value conflates $P(H_{0} |D)$ with $P(D|H_{0} )$ and disparage frequentist testers for conflating the two; see Nickerson (2000).

(g) Keep reminding practitioners that ‘what they really want’ in terms of inference is the conditional probability of different values of $\theta$ given ${\mathbf{x}}_{0}$, i.e. the posterior probability based on $f(\theta |{\mathbf{x}}_{0} ), \;\forall \theta \in {\Theta }.$

Arguably, the erroneous referrals and recommendations (a)–(g) have contributed a great deal to the misuse/abuse and misinterpretation of the p value in particular, and frequentist inference results more generally. Adding to this list:

(h) the confusion between the false positive/negative rates in medical diagnostic screening and the type I/II error probabilities that permeates the discussion in the replication crisis (Spanos, 2021), and

(i) a statistically misspecified ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$—its assumptions are invalid for data x₀,

Taken together (a)–(i) provide a much better explanation of why a sizeable percentage of the empirical evidence published in scientific journals is untrustworthy.

5 Bernoulli’s alleged swindle and effect sizes

Bernoulli’s distinction between chances, referring to $\theta = {\mathbb{P}}\left( {X = 1} \right),$ and $\overline{x}_{n}$ as probability a posteriori, referring to relative frequencies, is important because $\theta$ is rarely a probability in the context of a statistical model ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$; the Bernoulli distribution is an exception. As argued below, the inferential claim in (27) is unwarranted, not because Bernoulli conflated $f({\mathbf{x}}_{0} |\theta )$ with $f(\theta |{\mathbf{x}}_{0} ),$ but since (27) is an instance of (2).

5.1 An unwarranted claim: $\hat{\user2{\theta }}_{{\varvec{n}}} \left( {{\mathbf{x}}_{0} } \right) \simeq {\varvec{\theta}}^{\user2{*}} ,\user2{ }$ for a large enough ${\varvec{n}}$

As argued in Sect. 2.3, the Law of Large Numbers (LLN) (weak or strong) does not justify the claim $\hat{\theta }_{n} \left( {{\mathbf{x}}_{0} } \right) \simeq \theta^{*} ,$ for a large enough $n,$ since the LLN pertains only to what happens at the limit ($n = \infty$). What would it take to find a statistic, say $h\left( {\mathbf{X}} \right),$ that would justify the claim $h\left( {{\mathbf{x}}_{0} } \right) \simeq \theta^{*}$? For that one needs to invoke another limit theorem, known as the Law of Iterated Logarithm (LIL) that quantifies the LLN fluctuations of $\hat{\theta }_{n} \left( {\mathbf{X}} \right)$ around $\theta^{*}$, as described by its sampling distribution $f\left( {\hat{\theta }_{n} \left( {\mathbf{x}} \right);\theta } \right),\; {\mathbf{x}} \in {\mathbb{R}}^{n}$, using upper and lower bounds.

As an aside, it is important to note that limit theorems, such as LLN and the LIL revolve around a specific statistic, $\overline{X}_{n} = \frac{1}{n}\sum\nolimits_{i = 1}^{n} {X_{i} } ,$ but their results can be easily extended to more general statistics $h\left( {\mathbf{X}} \right);$ see Spanos (2019), ch. 9.

To implement the LIL, however, one would need to generate additional sample information in the form of N faithful replicas—ones that exhibit the same chance regularity patterns—as the original data ${\mathbf{x}}_{0} ,$ say ${\mathbf{x}}_{1} ,{\mathbf{x}}_{2} , \ldots ,{\mathbf{x}}_{N} ,$ using simulation or bootstrapping (resampling). These replicas are used to evaluate N estimates $\hat{\theta }_{n} \left( {{\mathbf{x}}_{i} } \right),$ $i = 1,2, \ldots ,N,$ of $\theta$ whose (smoothed) histogram approximates the empirical distribution, say $\hat{f}_{N} \left( {\hat{\theta }\left( {{\mathbf{x}}_{1} ,,{\mathbf{x}}_{2} , \ldots ,{\mathbf{x}}_{N} } \right);\theta } \right);$ the empirical counterpart of the sampling distribution $f\left( {\hat{\theta }\left( {\mathbf{x}} \right);\theta } \right),\; {\mathbf{x}} \in {\mathbb{R}}^{n}$. Although no single $\hat{\theta }_{n} \left( {{\mathbf{x}}_{i} } \right)$ approximates $\theta^{*}$ unless by happenstance, the overall average of these N estimates provides a ‘close enough’ approximation:

$$\overline{{\hat{\theta }}}_{N} \left( {{\mathbf{x}}_{1} ,,{\mathbf{x}}_{2} , \ldots ,{\mathbf{x}}_{N} } \right) = \frac{1}{N}\sum\limits_{i = 1}^{N} \, \hat{\theta }_{n} \left( {{\mathbf{x}}_{i} } \right) \simeq \theta^{*} ,\;{\text{for a large enough }}N.$$

(35)

The LIL quantifies ‘close enough’ by providing bounds for the approximation error $\left| {\frac{1}{N}\sum\nolimits_{i = 1}^{N} {\hat{\theta }_{ML} } \left( {{\mathbf{x}}_{i} } \right) - \theta^{*} } \right| < \varepsilon$ (Billingsley, 1995, p. 153):

$$\left( {1 \pm \varepsilon } \right)\left( {\frac{1}{N}\sqrt {2N{\text{lnln}}\left( N \right)} } \right)\;{\mkern 1mu} for\;any\;{\mkern 1mu} \varepsilon > 0.$$

(36)

For instance, when N = 20,000 (36) yields $\left( {1 \pm \varepsilon } \right)\left( {.015} \right),$ ensuring first decimal approximation accuracy, but for $N = 100$ the bounds are not as accurate $\left( {1 \pm \varepsilon } \right)\left( {.175} \right).$

In practice, the histogram in Fig. 1 can be replicated using simple bootstrapping (Efron & Tibshirani, 1993), when the validity of the IID assumptions for data ${\mathbf{x}}_{0}$ has been established using comprehensive misspecification testing; see Spanos (2018). This qualification is particularly crucial because any departures from the IID assumptions will render the bootstrap replications unfaithful replicas—they will exhibit different chance regularities than ${\mathbf{x}}_{0}$—and the ensuing empirical sampling distribution and its summary statistics will be unreliable; see Spanos (2019), p. 463.

It is important to emphasize that the approximation in (35) is not equivalent to using an enlarged data set ${\mathbf{X}}_{0}$ with sample size nN to estimate $\theta$ and invoke consistency to claim $\hat{\theta }_{nN} \left( {{\mathbf{X}}_{0} } \right) \simeq \theta^{*} .$ What is different in (35) is that the LIL bounds in (36) depend crucially on the averaging of the N estimates which shortens the range of values of the sampling distribution $\hat{f}_{N} \left( {\hat{\theta }\left( {{\mathbf{x}}_{1} ,,{\mathbf{x}}_{2} ,...,{\mathbf{x}}_{N} } \right);\theta } \right)$ as opposed to that of $\hat{f}\left( {\hat{\theta }_{nN} \left( {\mathbf{x}} \right);\theta } \right)$. That is, the LLN does justify $\hat{\theta }_{nN} \left( {{\mathbf{X}}_{0} } \right) \to \theta^{*}$ (in probability or almost surely), as $nN \to \infty ,$ but it cannot provide bounds for the approximation error $\left| {\hat{\theta }_{nN} \left( {{\mathbf{X}}_{0} } \right) - \theta^{*} } \right|,$ otherwise the LIL would have been redundant!

It should also be noted that Bernoulli’s LLN in the context of (5) can be somewhat misleading for the general case of an arbitrary consistent estimator $\hat{\theta }_{n} \left( {\mathbf{X}} \right) \to \theta^{*}$ as $n \to \infty .$ As mentioned in Sect. 3.4, it constitutes a special case where the invoked probabilistic assumptions include a direct (explicit) distribution assumption, Bernoulli, ensuring that (a) the finite sampling distribution of $\overline{X}_{n} = \frac{1}{n}\sum\nolimits_{k = 1}^{n} {X_{k} } ,$ is known, (24), and (b) $Var\left( {\overline{X}_{n} } \right) = \left( {\theta \left( {1 - \theta } \right)/n} \right)$ is bounded above by $\left( {1/4n} \right)$ since $\theta \left( {1 - \theta } \right) \le \left( {1/4} \right)$. This is not the case with more general limit theorems since they usually rely on indirect distribution assumptions, such as the existence of the first few moments; see Spanos (2019), ch. 9.

5.2 Estimation-based effect sizes

This approximation in (35) has important implications for the replication crisis as they relate to the estimation-based effect sizes. Usually, effect sizes are point estimates of a function of one or more parameters of ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right);$ see Cohen (1988), Ellis (2010). For instance, in the case of testing the difference between two means, the estimation-based effect size, known as Cohen’s $d = \left[ {\left( {\overline{x}_{n} - \overline{y}_{n} } \right)/s} \right],$ is nothing more than a point estimate $\hat{\theta }_{n} \left( {{\mathbf{z}}_{0} } \right) = \left[ {\left( {\overline{x}_{n} - \overline{y}_{n} } \right)/s} \right]$ of the unknown parameter $\theta = \left[ {\left( {\mu_{1} - \mu_{2} } \right)/\sigma } \right]$. This suggests that such estimation-based effect sizes constitute instances of the unwarranted claim (2).

This is important for the current discussions on replicability since numerous recent papers (Nosek & Lakens, 2014) replicate published results to compare the point estimates $\hat{\theta }_{n} \left( {{\mathbf{z}}_{0} } \right)$ of two or more studies to draw inferences relating to the replicability and the trustworthiness of their evidence. Given that $\hat{\theta }_{n} \left( {{\mathbf{x}}_{0} } \right) \simeq \theta^{*}$ is unwarranted, this strategy is likely to give rise to highly misleading results by the replicators. The above discussion questions the reliability of conclusions of the form ‘for a particular published study (i) the statistical significance is replicated based on observed CIs, but (ii) the effect size $\theta = \left[ {\left( {\mu_{1} - \mu_{2} } \right)/\sigma } \right],$ measured by Cohen’s $d$ is smaller/bigger than the original’. Since particular point estimates depend crucially on the sample size $n,$ as well as the statistical adequacy of ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right),$ estimates based on different sample sizes or statistically misspecified models (${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$) will give rise to highly misleading replications results.

A case can be made that a more reliable way to evaluate the replicability of studies is to compare the discrepancies from a null value warranted by an optimal test and data ${\mathbf{z}}_{0}$ stemming from the post-data severity evaluation of the testing results that takes fully into account the statistical context in (34); see Spanos (2021).

6 Bayes’ theorem and direct versus inverse inference

The traditional interpretation of Bernoulli’s golden theorem, as summarized by Diaconis and Skyrms (2018) in the introduction, has been that his inferential claim $\overline{x}_{n} \simeq \theta^{*} ,$ for $n \ge N,$ is not just unwarranted, but the problem he posed did not have a legitimate frequentist answer. His answer is based on conflating two different conditional densities $f({\mathbf{x}}_{0} |{{\varvec{\uptheta}}})$ and $f({{\varvec{\uptheta}}}|{\mathbf{x}}_{0} )$. Instead, his inferential problem was solved by Bayes (1764) who introduced the distinction between the two densities. As argued in Sect. 2.5, conditioning on the unknown and unobservable constant ${{\varvec{\uptheta}}}$ is both mathematically and logically meaningless in model-based frequentist inference; neither density exists. Despite this obvious mathematical fact, Bayesians have convinced many frequentists that the distribution of the sample, $f\left( {{\mathbf{x}};{{\varvec{\uptheta}}}} \right), \;{\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ,$ can be (legitimately) reimagined as $f({\mathbf{x}}|{{\varvec{\uptheta}}}), \;{\mathbf{x}} \in {\mathbb{R}}_{X}^{n}$, giving rise to a reinterpreted likelihood function $L({{\varvec{\uptheta}}}|{\mathbf{x}}_{0} ) \propto f({\mathbf{x}}_{0} |{{\varvec{\uptheta}}}),$ $\forall {{\varvec{\uptheta}}} \in {\Theta }$, as well as a transposed conditioning, to define $f(\theta |{\mathbf{x}}_{0} ), \;\forall \theta \in {\Theta }$, when neither makes sense in frequentist statistics. Why? The short answer is that it allows Bayesians to use a dubious crosscut to render Bayes’ rule easier to define, justify and apply. Let us unpack this claim in finer detail.

6.1 Revisiting the traditional Bayes’ rule

According to Ghosh et al., (2006), Bayes’ rule takes the form:

$$\pi ({{\varvec{\uptheta}}}|{\mathbf{x}}) = \frac{{f({\mathbf{x}}|{{\varvec{\uptheta}}}) \cdot \pi \left( {{\varvec{\uptheta}}} \right)}}{{\int_{{{{\varvec{\uptheta}}} \in \Theta }} {f\left( {{\mathbf{x}}|{{\varvec{\uptheta}}}} \right)} \cdot \pi \left( {{\varvec{\uptheta}}} \right){\mathbf{d\theta }}}},$$

(37)

“where $\pi \left( {{\varvec{\uptheta}}} \right)$ is the prior density function and $f({\mathbf{x}}|{{\varvec{\uptheta}}})$ is the density of ${\mathbf{X}},$ interpreted as the conditional density of ${\mathbf{X}}$ given ${{\varvec{\uptheta}}}$. The numerator is the joint density of ${{\varvec{\uptheta}}}$ and ${\mathbf{X}}$ and the denominator is the marginal density of ${\mathbf{X}}.$” (p. 31).

The formula in (37) and the Ghosh et al. (2006) description of its components are both misleading. To reveal the flaws, compare a more accurate definition of Bayes’ rule that includes a needed quantifier:

$$\pi \left( {{{\varvec{\uptheta}}}|{\mathbf{x}}_{0} } \right) = \frac{{f\left( {{\mathbf{x}}_{0} |{{\varvec{\uptheta}}}} \right) \cdot \pi \left( {{\varvec{\uptheta}}} \right)}}{{\int_{{{{\varvec{\uptheta}}} \in \Theta }} f \left( {{\mathbf{x}}_{0} |{{\varvec{\uptheta}}}} \right) \cdot \pi \left( {{\varvec{\uptheta}}} \right){\mathbf{d\theta }}}},\quad \forall {{\varvec{\uptheta}}} \in \Theta ,$$

(38)

for $f\left( {{\mathbf{x}}_{0} } \right) = \int\limits_{{{{\varvec{\uptheta}}} \in \Theta }} f ({\mathbf{x}}_{0} |{{\varvec{\uptheta}}}) \cdot \pi \left( {{\varvec{\uptheta}}} \right){\mathbf{d}}{\varvec{\uptheta}} > 0,$ where data ${\mathbf{x}}_{0}$ represents a point in the sample space ${\mathbb{R}}_{X}^{n} .$ When (37) is compared to (38), the obvious differences are that the subscript $0$ for ${\mathbf{x}}_{0}$, and the quantifier $\forall {{\varvec{\uptheta}}} \in \Theta$ are missing, rendering the description of its components problematic in so far as:

[i] $f({\mathbf{x}}_{0} |{{\varvec{\uptheta}}})$ is not the conditional density of ${\mathbf{X}}$ given ${{\varvec{\uptheta}}}$; it is an amalgam from different conditional distributions with a fixed ${\mathbf{x}}_{0}$ and varying values of ${{\varvec{\uptheta}}}$ in $\Theta \subset {\mathbb{R}}^{m} , \;n > m.$ Besides, the conditional density of ${\mathbf{X}}$ given ${{\varvec{\uptheta}}}$ requires the quantifier $\forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ,$ and not $\forall {{\varvec{\uptheta}}} \in \Theta$.

[ii] The product $f({\mathbf{x}}_{0} |{{\varvec{\uptheta}}}) \cdot \pi \left( {{\varvec{\uptheta}}} \right),$ $\forall {{\varvec{\uptheta}}} \in \Theta ,$ is not the joint density of ${{\varvec{\uptheta}}}$ and ${\mathbf{X}},$ because $f\left( {{\mathbf{x}},{{\varvec{\uptheta}}}} \right)$ would require a double quantifier $\forall {{\varvec{\uptheta}}} \in \Theta , \quad \forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n}$ with a generic ${\mathbf{x}} \in {\mathbb{R}}_{X}^{n} .$

[iii] $\mathop \smallint \limits_{{{{\varvec{\uptheta}}} \in \Theta }} f({\mathbf{x}}_{0} |{{\varvec{\uptheta}}}) \cdot \pi \left( {{\varvec{\uptheta}}} \right){\mathbf{d\theta }} = f\left( {{\mathbf{x}}_{0} } \right)$ is a scaling factor and not the marginal density of ${\mathbf{X}}$, which is defined by $f\left( {\mathbf{x}} \right),\quad \forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n} .$

When one points out the flaws [i]–[iii] in the above quotation from Ghosh et al., (2006), the reply is often framed in terms of ‘sloppy language and clumsy notation’. The problem is that this interpretation is typical of Bayesian textbooks more generally; see Lindley (1965), p. 118, O’Hagan (1994), p. 4, and Robert (2007), pp. 8–9, inter alia. It will be equally misplaced to dismiss [i]–[iii] as restating the obvious that ‘we all know that …’ type of exculpation because the problem is more fundamental and has to do with Bayesians (purposely) reimagining the distribution of the sample $f\left( {{\mathbf{x}};{{\varvec{\uptheta}}}} \right),\;{\mathbf{x}} \in {\mathbb{R}}_{X}^{n}$ as conditional on ${{\varvec{\uptheta}}},$ $f({\mathbf{x}}|{{\varvec{\uptheta}}}), \;{\mathbf{x}} \in {\mathbb{R}}_{X}^{n}$. Why?

6.1.1 Bayes’ foundational problem

Given that in Bayesian inference, ${\mathbf{X}}$ and ${{\varvec{\uptheta}}}$ are viewed as random variables (vectors), they are both functions defined on the same probability space $\left( {S,\Im ,{\mathbb{P}}\left( . \right)} \right)$ underlying the relevant ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ based on events:

$$A_{x} : = {\mathbf{X}}^{ - } \left( { - \infty ,{\mathbf{x}}} \right] \in \Im ,\;\forall {\mathbf{x}} \in {\mathbb{R}}^{n} ,\;B_{\vartheta } : = {{\varvec{\uptheta}}}^{ - } \left( { - \infty ,\vartheta } \right] \in \Im ,\quad \forall \vartheta \in \Theta ,$$

where $Z^{ - } \left( . \right)$ denotes the pre-image of $Z\left( . \right)$. Since ${\mathbf{X}}$ is observable and represents real-world events (data), but ${{\varvec{\uptheta}}}$ is unobservable and denotes degrees of belief, the foundational problem that arises is how one is supposed to conceptualize and construct the joint density:

$$f\left( {{\mathbf{x}},\vartheta } \right),\;\,\forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ,\;\,\forall \vartheta \in \Theta ,$$

(39)

by assigning probabilities to the overlapping events $A_{x} \cap B_{\vartheta } \ne \emptyset$ aiming to blend coherently the observable (${\mathbf{X}}$) with the unobservable (${{\varvec{\uptheta}}}$) worlds. If one were to imagine that such a task is (somehow) achievable, then Bayesian inference would be reduced to a simple deductive formula:

$$f(\vartheta |{\mathbf{x}}_{0} ) = \frac{{f\left( {{\mathbf{x}}_{0} ,\vartheta } \right)}}{{f\left( {{\mathbf{x}}_{0} } \right)}},\quad \forall \vartheta \in \Theta .$$

(40)

The key difference between (40) with (38) is that $f\left( {{\mathbf{x}}_{0} ,{{\varvec{\uptheta}}}} \right)$ is replaced by $f({\mathbf{x}}_{0} |{{\varvec{\uptheta}}}) \cdot \pi \left( {{\varvec{\uptheta}}} \right),$ where $\pi \left( {{\varvec{\uptheta}}} \right)$ is chosen independently of $f\left( {{\mathbf{x}},\vartheta } \right)$ instead of using $f\left( \vartheta \right) = \mathop \smallint \limits_{{{\mathbf{x}} \in {\mathbb{R}}_{X}^{n} }} f\left( {{\mathbf{x}},\vartheta } \right){\mathbf{dx}}$.

The traditional perspective on Bayesian statistics, however, ignores the above foundational conundrum and defines Bayes’ rule using a dubious crosscut to evade the intellectually taxing task in defining (39). Instead of choosing $f\left( {{\mathbf{x}},{{\varvec{\uptheta}}}} \right)$, which will determine both $f({\mathbf{x}}|{{\varvec{\uptheta}}})$ and $f\left( {{\varvec{\uptheta}}} \right)$, Bayesian statistics selects $f({\mathbf{x}}|{{\varvec{\uptheta}}})$ and $\pi \left( {{\varvec{\uptheta}}} \right)$ separately and defines a (contrived) joint distribution via (Gelman, et al., 2004, p. 7):

$$f\left( {{\mathbf{x}},{{\varvec{\uptheta}}}} \right) = f({\mathbf{x}}|{{\varvec{\uptheta}}}) \cdot \pi \left( {{\varvec{\uptheta}}} \right),\quad \forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ,\quad \forall {{\varvec{\uptheta}}} \in \Theta .$$

(41)

This conveniently evades the mammoth conundrum of bridging the gap between the real world of data and the mathematical world of prior probabilities pointed out by Le Cam (1977):

“(2) It [Bayesian statistics] confuses ‘theories’ about nature with ‘facts’, and makes no provision for the construction of models. (3) It applies brutally to propositions about theories or models of physical phenomena the same simplified logic which every one of us uses ordinarily for ‘events’. … (5) The theory blends in the same barrel all forms of uncertainty and treats them all alike.” (p. 134).

To be more specific, after reimagining $f\left( {{\mathbf{x}};{{\varvec{\uptheta}}}} \right)$ as $f({\mathbf{x}}|{{\varvec{\uptheta}}}),$ the second step involves invoking the multiplication rule for density functions which takes the form:

$$f\left( {{\mathbf{x}},{{\varvec{\uptheta}}}} \right) = f\left( {{{\varvec{\uptheta}}}|{\mathbf{x}}} \right) \cdot f\left( {\mathbf{x}} \right) = f\left( {{\mathbf{x}}|{{\varvec{\uptheta}}}} \right) \cdot f\left( {{\varvec{\uptheta}}} \right),\quad \forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ,\quad \forall {{\varvec{\uptheta}}} \in \Theta .$$

(42)

The third step mistakenly evaluates (42) at ${\mathbf{X}} = {\mathbf{x}}_{0}$:

$$f\left( {{\mathbf{x}}_{0} ,{{\varvec{\uptheta}}}} \right) = f({{\varvec{\uptheta}}}|{\mathbf{x}}_{0} ) \cdot f\left( {{\mathbf{x}}_{0} } \right) = f({\mathbf{x}}_{0} |{{\varvec{\uptheta}}}) \cdot f\left( {{\varvec{\uptheta}}} \right),\quad \forall {{\varvec{\uptheta}}} \in \Theta ,$$

(43)

by ignoring the fact that $f({{\varvec{\uptheta}}}|{\mathbf{x}}_{0} ) \cdot f\left( {{\mathbf{x}}_{0} } \right) \ne f({\mathbf{x}}_{0} |{{\varvec{\uptheta}}}) \cdot f\left( {{\varvec{\uptheta}}} \right),$ since the multiplication rule in (42) holds only when both quantifiers are attached, unlike the one for simple events in (33), since random variables always define more than one simple event in $\Im$. To derive (38), the erroneously derived (43) is then solved for $f({{\varvec{\uptheta}}}|{\mathbf{x}}_{0} )$, thus eliminating $f\left( {{\mathbf{x}}_{0} ,{{\varvec{\uptheta}}}} \right)$ as a result of the sleight of hand in step three hiding the misapplication of (42) as if it were (33).

This sleight of hand suggests that one way to render the above Ghosh et. al (2006) interpretation of Bayes rule’s components formally correct is to add both quantifiers:

$$f({{\varvec{\uptheta}}}|{\mathbf{x}}) = \frac{{f({\mathbf{x}}|{{\varvec{\uptheta}}}) \cdot \pi \left( {{\varvec{\uptheta}}} \right)}}{{\int_{{{{\varvec{\uptheta}}} \in \Theta }} f ({\mathbf{x}}|{{\varvec{\uptheta}}}) \cdot \pi \left( {{\varvec{\uptheta}}} \right){\mathbf{d\theta }}}},\quad \forall {{\varvec{\uptheta}}} \in \Theta ,\;\forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n} .$$

(44)

This describes accurately the above quotation by Ghosh et al. (2006), but has two unusual features:

(i)
$f({{\varvec{\uptheta}}}|{\mathbf{x}})$ is essentially a simple reparametrization of the contrived $f\left( {{\mathbf{x}},{{\varvec{\uptheta}}}} \right),$ and
(ii)
The presence of the quantifier $\forall {\mathbf{x}} \in {\mathbb{R}}_{X}^{n}$ belies the Likelihood Principle: for inference purposes, the only relevant sample information pertaining to ${{\varvec{\uptheta}}}$ is contained in ${\mathbf{x}}_{0}$ via the likelihood function $L({\mathbf{x}}_{0} |{{\varvec{\uptheta}}}) \propto f({\mathbf{x}}_{0} |{{\varvec{\uptheta}}}), \forall {{\varvec{\uptheta}}} \in \Theta .$ Moreover, if two sample realizations are proportional, ${\mathbf{x}}_{0} = c{\mathbf{y}}_{0} ,$ for some $c > 0,$ they contain the same information about ${{\varvec{\uptheta}}}$ (Berger & Wolpert, 1988, p. 19).

In light of the above discussion, Bayesian statistics need to choose between a formally correct Bayes’ rule as in (38) and forsake the misleading interpretation associated with (37), or adopt the formula in (44) and give up the likelihood principle. A third, and more practical choice is to do away with the interpretation of the various components in (37), as per Ghosh et al., (2006), and view it as an updating formula whose interpretation is deemed irrelevant. All three choices, however, leave unresolved the key conundrum of bridging the gap between the real world of data and the mathematical world of prior probability. The questionable crosscut in (41) can be defended on pragmatic grounds, but using it to admonish frequentists for conflating $f({{\varvec{\uptheta}}}|{\mathbf{x}}_{0} )$ with $f({\mathbf{x}}_{0} |{{\varvec{\uptheta}}})$ is absurd, since neither density makes sense in that context.

6.2 Alternative ways to ‘learn from data’

In frequentist inference, bridging the gap between the real world of data ${\mathbf{x}}_{0}$ and the mathematical world of probability constitutes the essence of statistical induction: learning from data ${\mathbf{x}}_{0}$ about ${\mathcal{M}}^{*} \left( {\mathbf{x}} \right) = \left\{ {f\left( {{\mathbf{x}};{{\varvec{\uptheta}}}^{*} } \right), \;{\mathbf{x}} \in {\mathbb{R}}_{X}^{n} } \right\}$ that gave rise to data ${\mathbf{x}}_{0} .$ This is achieved by first securing the statistical adequacy of ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ vis-a-vis data ${\mathbf{x}}_{0} ,$ including the constancy of its parameters ${{\varvec{\uptheta}}},$ and then proceed to use effective (optimal) procedures at the inference facet, estimation (point and interval), testing and prediction, knowing that this ensures both the reliability and effectiveness of inference as well as the trustworthiness of the ensuing evidence; see Spanos (2013a).

In contrast, learning from data in Bayesian inference takes the form of revising the prior probability $\pi \left( {{\varvec{\uptheta}}} \right),$ $\forall {{\varvec{\uptheta}}} \in \Theta$ in light of data ${\mathbf{x}}_{0} ,$ to yield the posterior probability $\pi ({{\varvec{\uptheta}}}|{\mathbf{x}}_{0} ) \propto f\left( {{\mathbf{x}}_{0} ;{{\varvec{\uptheta}}}} \right) \cdot \pi \left( {{\varvec{\uptheta}}} \right),$ $\forall {{\varvec{\uptheta}}} \in \Theta$. What is not so obvious is the nature and meaning of the inferential claim that accompanies the revised rankings of ${{\varvec{\uptheta}}}$ by $\pi ({{\varvec{\uptheta}}}|{\mathbf{x}}_{0} )$. Is the highest-ranked ${{\varvec{\uptheta}}}$ value, say ${{\varvec{\uptheta}}}^{\dag } ,$ the one approximating ${{\varvec{\uptheta}}}^{*}$ the best in some sense? There is no decipherable answer to that question in the Bayesian literature since the notion of a ‘true value’ ${{\varvec{\uptheta}}}^{*}$ is not well-defined when ${{\varvec{\uptheta}}}$ is a random variable (vector); no single number can characterize a non-degenerate random variable. Looking at Bayesian statistics textbooks, the pragmatic answer seems to be: it depends on the choice of ‘a loss function’; see Ghosh et al., (2006), Robert (2007) inter alia. What does expected loss, based on information other than data ${\mathbf{x}}_{0} {\text{ and statistical model}}\;{\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ have to do with learning from data ${\mathbf{x}}_{0}$ about ${\mathcal{M}}^{*} \left( {\mathbf{x}} \right)?$ The answer to this question highlights incisively a key difference between the frequentist and Bayesian approaches to inference, as well as ‘what learning from data’ amounts to in the context of two approaches; see Spanos (2017). In that sense, Bernoulli’s (1713) inferential problem was not addressed by Bayes (1764) as often claimed. Bayes recast Bernoulli’s inference problem by viewing ${{\varvec{\uptheta}}}$ as a latent random vector and offered an alternative way to learn from data ${\mathbf{x}}_{0}$ about observable phenomena of interest.

7 Conclusions

Viewing Bernoulli’s (1713) golden theorem retrospectively in the context of model-based frequentist inference that revolves around a parametric statistical model, ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right),$ the following claims were called into question: (a) Bernoulli solved the problem of “an inference from chances to frequencies”, and thus (b) the golden theorem does not justify an approximate CI for $\theta$, (c) Bernoulli’s ‘swindle’ in identifying probability with relative frequency stems from his conflating $f({\mathbf{x}}_{0} |\theta )$ with $f(\theta |{\mathbf{x}}_{0} ),$ and (d) the same swindle is routinely perpetrated by p value significance testers. In interrogating these claims, the paper argued that they are grounded in misconceptions that raise several broader foundational problems relating to the current replication crisis.

The main conclusions are: (i) Frequentist error probabilities are not conditional on hypotheses framed in terms of $\theta$. They are attached to the inference procedure itself to calibrate its effectiveness and grounded in the relevant sampling distribution of a statistic (estimator, test, predictor), which is evaluated via (6) under factual (presuming that $\theta = \theta^{*}$, in estimation and prediction), or hypothetical (presuming that $\theta = \theta_{0}$ or $\theta = \theta_{1}$ in testing) reasoning. (ii) The alleged Bernoulli’s swindle is an instance of the unwarranted claim $\hat{\theta }_{n} \left( {{\mathbf{x}}_{0} } \right) \simeq \theta^{*} ,$ for a large enough n, that also undermines point-estimation measures, such as the straight rule and the estimation-based effect sizes. (iii) More reliable measures for the ‘scientific effect’ can be evaluated using testing-based discrepancies warranted by data ${\mathbf{x}}_{0}$; see Spanos (2021). (iv) The direct versus inverse inference problem is a contrived issue that gives rise to misplaced criticisms of model-based frequentist inference since neither distribution $f({\mathbf{x}}_{0}|\theta )$ or $f(\theta |{\mathbf{x}}_{0} )$ exists in that context. A case is made that (v) this criticism is motivated by a misguided attempt to justify a dubious crosscut in deriving Bayes’ rule by reimagining the distribution of the sample $f\left( {{\mathbf{x}};\theta } \right),$ ${\mathbf{x}} \in {\mathbb{R}}_{X}^{n} ,$ as conditional on $\theta$. (vi) The reliability and precision of inferences depend solely on the approximate validity of the probabilistic assumptions comprising ${\mathcal{M}}_{{{\varvec{\uptheta}}}} \left( {\mathbf{x}} \right)$ for the particular data ${\mathbf{x}}_{0} ,$ and nothing else. Any attempt to invoke limit theorems (as $n \to \infty )$ is misplaced. (vii) Bayes (1764) did not address Bernoulli’s (1713) inference problem. Instead, he recast the original problem by viewing ${{\varvec{\uptheta}}}$ as a latent random vector and proposed a very different way to learn from data ${\mathbf{x}}_{0} .$ (viii) Bayesians should consider the dormant foundational problems arising from the choice of a prior as it relates to the dubious crosscut and the erroneous use of the multiplication rule for random variables in (42)–(43) in defining the contrived joint distribution in (41).

References

Bayes, T. (1764). An essay towards solving a problem in the doctrine of chances. Philospical Transactions of the Royal Society of London, 53, 370–402.
Google Scholar
Berger, J. O., & Wolpert, R. W. (1988). The likelihood principle. Lecture notes—Monograph series (2nd ed., Vol. 6). Institute of Mathematical Statistics.
Google Scholar
Bernoulli, J. (1713/2006). The art of conjecturing. JHU Press.
Billingsley, P. (1995). Probability and measure (4th ed.). Wiley.
Google Scholar
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum.
Google Scholar
Cohen, J. (1994). The earth is round (p <. 05). American Psychologist, 49, 997–1003.
Article Google Scholar
De Moivre, A. (1738). The doctrine of chances: Or a method of calculating the probability of events in play. W. Pearson.
Google Scholar
Dempster, A. P. (1966). New methods for reasoning towards posterior distributions based on sample data. The Annals of Mathematical Statistics, 37(2), 355–374.
Article Google Scholar
Diaconis, P., & Skyrms, B. (2018). Ten great ideas about chance. Princeton University Press.
Book Google Scholar
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Chapman & Hall.
Book Google Scholar
Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge University Press.
Book Google Scholar
Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society A, 222, 309–368.
Google Scholar
Fisher, R. A. (1925). Statistical methods for research workers. Oliver and Boyd.
Google Scholar
Gelman, A., Carlin, J. B., & Rubin, D. B. (2004). Bayesian data analysis (2nd ed.). Chapman & Hall.
Google Scholar
Ghosh, K. J., Delampady, M., & Samata, T. (2006). Introduction to Bayesian analysis. Springer.
Google Scholar
Gorroochurn, P. (2012). Classic problems of probability. Wiley.
Book Google Scholar
Hacking, I. (1965). Salmon’s vindication of induction. The Journal of Philosophy, 62(10), 260–266.
Article Google Scholar
Hacking, I. (1980). The theory of probable inference: Neyman, Peirce and Braithwaite. In D. Mellor (Ed.), Science, belief and behavior: Essays in honour of Richard B (pp. 141–160). Cambridge University Press, Cambridge.
Google Scholar
Hald, A. (1998). A history of mathematical statistics from 1750 to 1930. Wiley.
Google Scholar
Hald, A. (2007). A history of parametric statistical inference from Bernoulli to Fisher, 1713–1935. Springer.
Google Scholar
Henderson, L. (2020). The problem of induction. The Stanford Encyclopedia of Philosophy. Edward N. Zalta (ed.). https://plato.stanford.edu/archives/spr2020/entries/induction-problem/.
Howson, C., & Urbach, P. (2006). Scientific rasoning: The Bayesian approach (3rd ed.). Open Court.
Google Scholar
Hume, D. (1748). An Enquiry Concerning Human Understanding. Oxford: Oxford University Press.
Kolmogorov, A. N. (1933). Foundations of the theory of Probability, 2nd English edition, NY: Chelsea Publishing Co.
Laplace, P. S. (1812). Théorie analytique des Probabilités (Vol. 2). Courcier Imprimeur.
Google Scholar
Le Cam, L. (1977). A note on metastatistics or ‘an essay toward stating a problem in the doctrine of chances”. Synthese, 36, 133–160.
Article Google Scholar
Le Cam, L. (1986). Asymptotic methods in statistical decision theory. Springer.
Book Google Scholar
Lehmann, E. L., & Romano, J. P. (2005). Testing statistical hypotheses. Springer.
Google Scholar
Lindley, D. V. (1965). Introduction to probability and statistics from the bayesian viewpoint (Vol. 1). Cambridge University Press.
Book Google Scholar
Mayo, D. G., & Spanos, A. (2006). Severe testing as a basic concept in a Neyman–Pearson philosophy of induction. British Journal for the Philosophy of Science, 57, 323–357.
Article Google Scholar
Mayo, D. G., & Spanos, A. (2011). Error statistics. Philosophy of statisticsIn D. Gabbay, P. Thagard, & J. Woods (Eds.), The handbook of philosophy of science (Vol. 7, pp. 151–196). Amsterdam: Elsevier.
Google Scholar
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Statistical Society of London, A, 236, 333–380.
Article Google Scholar
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Statistical Society of London, A, 231, 289–337.
Article Google Scholar
Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5(2), 241–301.
Article Google Scholar
Nosek, B. A., & Lakens, D. E. (2014). A method to increase the credibility of published results. Social Psychology, 45, 137–141.
Article Google Scholar
O’Hagan, A. (1994). Bayesian inference. Edward Arnold.
Google Scholar
Renyi, A. (1970). Foundations of probability. Holden-Day.
Google Scholar
Robert, C. P. (2007). The Bayesian choice: From decision-theoretic foundations to computational implementation (2nd ed.). Springer.
Google Scholar
Salmon, W. C. (1967). The foundations of scientific inference. University of Pittsburgh Press.
Book Google Scholar
Shiryaev, A. N. (2016). Probability-1 (2nd ed.). Springer.
Book Google Scholar
Sober, E. (2008). Evidence and evolution: The logic behind the science. Cambridge University Press.
Book Google Scholar
Spanos, A. (2006). Where do statistical models come from? Revisiting the problem of specification. In J. Rojo (Ed.), Optimality: The Second Erich L. Lehmann Symposium. Lecture Notes-Monograph Series, (Vol. 49, pp. 98–119). Institute of Mathematical Statistics.
Spanos, A. (2010). Is frequentist testing vulnerable to the base-rate fallacy? Philosophy of Science, 77, 565–583.
Article Google Scholar
Spanos, A. (2013a). A frequentist interpretation of probability for model-based inductive inference. Synthese, 190, 1555–1585.
Article Google Scholar
Spanos, A. (2013b). Who should be afraid of the Jeffreys–Lindley paradox? Philosophy of Science, 80, 73–93.
Article Google Scholar
Spanos, A. (2017). Why the decision-theoretic perspective misrepresents frequentist inference. In: Advances in statistical methodologies and their applications to real problems (pp. 3–28). ISBN 978-953-51-4962-0.
Spanos, A. (2018). Mis-specification testing in retrospect. Journal of Economic Surveys, 32(2), 541–577.
Article Google Scholar
Spanos, A. (2019). Probability theory and statistical inference: Empirical modeling with observational data. Cambridge University Press.
Book Google Scholar
Spanos, A. (2021). Revisiting noncentrality-based confidence intervals, error probabilities and estimation-based effect sizes. Journal of Mathematical Psychology, 104, 102580.
Article Google Scholar
Spanos, A., & McGuirk, A. (2001). The model specification problem from a probabilistic reduction perspective. Journal of the American Agricultural Association, 83, 1168–1176.
Google Scholar
Spanos, A., & Mayo, D. G. (2015). Error statistical modeling and inference: Where methodology meets ontology. Synthese, 192, 3533–3555.
Article Google Scholar
Von Mises, R. (1928). Probability, statistics and truth (2nd ed.). Dover.
Google Scholar
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond p < .05. American Statistian, 73(Suppl. 1), 1–19.
Google Scholar
Williams, D. (2001). Weighing the odds: A course in probability and statistics. Cambridge University Press.
Book Google Scholar

Download references

Acknowledgements

Thanks are due to Prakash Gorroochurn and two anonymous reviewers for several useful comments and suggestions that helped improve the discussion in the paper appreciably.

Author information

Authors and Affiliations

Department of Economics, Virginia Tech, Blacksburg, VA, 24061, USA
Aris Spanos

Authors

Aris Spanos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aris Spanos.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the topical collection “Recent Issues in Philosophy of Statistics: Evidence, Testing, and Applications”, edited by Sorin Bangu, Emiliano Ippoliti, and Marianna Antonutti.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Spanos, A. Bernoulli’s golden theorem in retrospect: error probabilities and trustworthy evidence. Synthese 199, 13949–13976 (2021). https://doi.org/10.1007/s11229-021-03405-z

Download citation

Received: 05 October 2020
Accepted: 03 September 2021
Published: 01 November 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s11229-021-03405-z

Bernoulli’s golden theorem in retrospect: error probabilities and trustworthy evidence

Abstract

Similar content being viewed by others

History and nature of the Jeffreys–Lindley paradox

The Case of the Jeffreys-Lindley-paradox as a Bayes-frequentist Compromise: A Perspective Based on the Rao-Lovric-Theorem

Confidence, credibility and prediction

1 Introduction

2 Statistical induction

2.1 Induction by enumeration

2.2 Model-based frequentist inference

Example 1

Example 2

Example 1

2.3 Estimation (point and interval)

Example 1

Example 1

Example 1

2.4 Neyman–Pearson (N–P) testing

Example 1

Example 2

2.5 Error probabilities cannot be conditional on \(\theta\)

3 Bernoulli’s golden theorem in retrospect

3.1 Bernoulli’s law of large numbers

3.1.1 Bernoulli’s WLLN

3.1.2 Borel’s SLLN

3.1.3 Probabilistic versus mathematical convergence

3.2 Bernoulli’s golden theorem versus his numerical example

3.2.1 Bernoulli’s example

3.3 Revisiting Bernoulli’s alleged swindle

3.3.1 Bernoulli’s alleged swindle

3.4 The golden theorem and approximate CIs

Example 2

3.5 Bernoulli and direct versus inverse inference

4 Revisiting the direct versus inverse inference

4.1 Bayesian deformation of the p value?

4.2 From accept/reject \({\varvec{H}}_{0}\) to an evidential interpretation

5 Bernoulli’s alleged swindle and effect sizes

5.1 An unwarranted claim: \(\hat{\user2{\theta }}_{{\varvec{n}}} \left( {{\mathbf{x}}_{0} } \right) \simeq {\varvec{\theta}}^{\user2{*}} ,\user2{ }\) for a large enough \({\varvec{n}}\)

5.2 Estimation-based effect sizes

6 Bayes’ theorem and direct versus inverse inference

6.1 Revisiting the traditional Bayes’ rule

6.1.1 Bayes’ foundational problem

6.2 Alternative ways to ‘learn from data’

7 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation