1 Introduction

This note is split into two parts: The first (Sect. 2) deals with conditional probability, from a general point of view, while the second (Sects. 3 and 4) highlights some consequences of adopting an alternative notion of conditional probability.

Let us call SN the standard notion of conditional probability (i.e., regular conditional distributions) and AN the alternative notion quoted above. Roughly speaking, AN is obtained from SN giving up the measurability constraint and adding a properness condition. As easily expected, this has both advantages and disadvantages. One major drawback of AN is that essential uniqueness is lost. This is certainly disappointing, but possibly not so crucial in the subjective view of probability. As to the advantages, AN allows to overcome various paradoxes occurring with SN. This is because, thanks to properness, one is actually conditioning on events (and not on sub-\(\sigma \)-fields, as it happens under SN).

Finally, among the possible consequences of AN, we focus on those related to Bayesian statistics, exchangeability and compatibility.

2 Conditional probability

In the sequel, \((\Omega ,{\mathcal {A}},P)\) is a probability space, \({\mathcal {G}}\subset {\mathcal {A}}\) a sub-\(\sigma \)-field, and

$$\begin{aligned} Q=\{Q(\omega ):\omega \in \Omega \} \end{aligned}$$

a collection of probability measures on \({\mathcal {A}}\). We denote by \(Q(\omega ,A)\) the value of \(Q(\omega )\) at \(A\in {\mathcal {A}}\). Also, \(\sigma (Q)\) is the \(\sigma \)-field on \(\Omega \) generated by the maps \(\omega \mapsto Q(\omega ,A)\) for all \(A\in {\mathcal {A}}\).

In this notation, Q is a regular conditional distribution (r.c.d.) given \({\mathcal {G}}\) if

(a):

\(\sigma (Q)\subset {\mathcal {G}}\);

(b):

\(P(A\cap B)=\int _BQ(\omega ,A)\,P(\mathrm{d}\omega )\) for all \(A\in {\mathcal {A}}\) and \(B\in {\mathcal {G}}\).

An r.c.d. can fail to exist. However, it exists and is a.s. unique under reasonable conditions, such as \({\mathcal {A}}\) countably generated and P perfect; see, e.g., Jirina (1954).

(We recall that P is perfect if, for each \({\mathcal {A}}\)-measurable \(f:\Omega \rightarrow {\mathbb {R}}\), there is \(I\in {\mathcal {B}}({\mathbb {R}})\) such that \(I\subset f(\Omega )\) and \(P(f\in I)=1\). If \(\Omega \) is separable metric and \({\mathcal {A}}={\mathcal {B}}(\Omega )\), perfectness is equivalent to tightness.)

This is the standard notion of conditional probability, based on Kolmogorov’s axioms and adopted almost universally. Indeed, apart from rare exceptions, a conditional probability is meant as an r.c.d.

Using r.c.d.’s, however, one is conditioning on a \(\sigma \)-field and not on a specific event. What does it mean? What is the information provided by a \(\sigma \)-field? According to the usual naive interpretation, the information provided by \({\mathcal {G}}\) is:

  1. (*)

    For each event\(B\in {\mathcal {G}}\), it is known whetherBis true or false.

Attaching interpretation (*) to r.c.d.’s is quite dangerous.

Example 1

(Continuous time processes) Let \(X=\{X_t:t\ge 0\}\) be a real-valued process on \((\Omega ,{\mathcal {A}},P)\), adapted to a filtration \(\{{\mathcal {F}}_t:t\ge 0\}\), and let \({\mathcal {X}}\) be the set of all functions from \([0,\infty )\) into \({\mathbb {R}}\). Define \({\mathcal {N}}=\{A\in {\mathcal {A}}:P(A)=0\}\) and suppose that

$$\begin{aligned} {\mathcal {N}}\subset {\mathcal {F}}_0\quad \text {and}\quad \{X=x\}\in {\mathcal {N}}\quad \text { for each }x\in {\mathcal {X}}. \end{aligned}$$

Even if very usual for continuous time processes, the above assumption conflicts with (*). In fact, since \(\{X=x\}\in {\mathcal {F}}_0\) for each \(x\in {\mathcal {X}}\), interpretation (*) would imply that the actual X-path is already known at time 0. See also (Berti and Rigo 2008, Example 3).

Example 2

(Borel–Kolmogorov paradoxes) Let X and Y be random variables on \((\Omega ,{\mathcal {A}},P)\) such that \(\{X=x\}=\{Y=y\}\) for some x and y. Using r.c.d.’s, the conditional probability given \(X=x\) is taken to be \(P(\cdot \mid X=x)=Q_X(\omega )\), where \(Q_X\) is an r.c.d. given \(\sigma (X)\) and \(\omega \in \Omega \) is such that \(X(\omega )=x\). Similarly, \(P(\cdot \mid Y=y)=Q_Y(\omega )\) where \(Q_Y\) is an r.c.d. given \(\sigma (Y)\) and \(Y(\omega )=y\). But since X and Y are different, it may be that \(P(\cdot \mid X=x)\ne P(\cdot \mid Y=y)\) even if \(\{X=x\}=\{Y=y\}\).

Example 3

(Properness) For interpretation (*) to make sense, Q should be everywhere proper, in the sense that

$$\begin{aligned} Q(\omega )=\delta _\omega \text { on }{\mathcal {G}}\text { for each }\omega \in \Omega . \end{aligned}$$

In that case,

$$\begin{aligned} B=\{\omega \in \Omega :Q(\omega ,B)=1\}\in \sigma (Q)\quad \text {for each }B\in {\mathcal {G}}, \end{aligned}$$

so that \({\mathcal {G}}=\sigma (Q)\). Also, \(\sigma (Q)\) is countably generated whenever \({\mathcal {A}}\) is countably generated. Thus, Q fails to be everywhere proper if \({\mathcal {A}}\) is countably generated, but \({\mathcal {G}}\) is not. A weaker notion of properness is

$$\begin{aligned} Q(\omega )=\delta _\omega \text { on }{\mathcal {G}}\text { for each }\omega \in B_0 \end{aligned}$$
(1)

where \(B_0\in {\mathcal {G}}\) and \(P(B_0)=1\). But even condition (1) typically fails unless \({\mathcal {G}}\) is countably generated. In fact, condition (1) holds if and only if \({\mathcal {G}}\cap B_0\) is countably generated for some \(B_0\in {\mathcal {G}}\) with \(P(B_0)=1\); see Berti and Rigo (2007).

A seminal paper on properness is Blackwell and Dubins (1975). Other related references are Berti and Rigo (2007), Berti and Rigo (2008), Maitra and Ramakrishnan (1988).

To make interpretation (*) effective, the notion of r.c.d. is to be modified. We first recall that, for each \(\omega \in \Omega \), the \({\mathcal {G}}\)-atom including \(\omega \) is

$$\begin{aligned} H(\omega )=\bigcap _{\omega \in B\in {\mathcal {G}}}B. \end{aligned}$$

We also let

$$\begin{aligned} \Pi =\{H(\omega ):\omega \in \Omega \}. \end{aligned}$$

Note that \(\Pi \) is a partition of \(\Omega \) and each element of \({\mathcal {G}}\) is a union of elements of \(\Pi \).

Say that Q is a strategy given \({\mathcal {G}}\), or a \({\mathcal {G}}\)-strategy, if

(\(\mathrm{a}^*\)):

\(Q(x)=Q(y)\) whenever \(x,\,y\in \Omega \) and \(H(x)=H(y)\);

(\(\mathrm{b}^*\)):

There is a probability measure \({\widehat{P}}\) on \(\sigma (Q)\) such that

$$\begin{aligned} P(A)=\int Q(\omega ,A)\,{\widehat{P}}(\mathrm{d}\omega )\quad \text {for all }A\in {\mathcal {A}}; \end{aligned}$$
(\(\mathrm{c}\)):

Q is everywhere proper, i.e., \(Q(\omega )=\delta _\omega \) on \({\mathcal {G}}\) for each \(\omega \in \Omega \).

The above notion of \({\mathcal {G}}\)-strategy is inspired to Blackwell and Dubins (1975) while the term “strategy” is borrowed from Dubins (1975).

Some obvious properties of \({\mathcal {G}}\)-strategies are collected in the next lemma.

Lemma 4

Let Q be a \({\mathcal {G}}\)-strategy. Then, \({\mathcal {G}}\subset \sigma (Q)\) and

$$\begin{aligned} P(A\cap B)=\int _BQ(\omega ,A)\,{\widehat{P}}(\mathrm{d}\omega )\quad \text {for all }A\in {\mathcal {A}}\text { and }B\in {\mathcal {G}}. \end{aligned}$$

In particular, \({\widehat{P}}=P\) on \({\mathcal {G}}\). Moreover, \({\widehat{P}}=P\text { on }{\mathcal {A}}\cap \sigma (Q)\) provided \(\Pi \subset {\mathcal {G}}\).

Proof

Let \(B\in {\mathcal {G}}\). By (c), \(B=\{\omega :Q(\omega ,B)=1\}\in \sigma (Q)\). Further,

$$\begin{aligned} P(A\cap B)=\int Q(\omega ,A\cap B)\,{\widehat{P}}(\mathrm{d}\omega )=\int _BQ(\omega ,A)\,{\widehat{P}}(\mathrm{d}\omega )\quad \text {for each }A\in {\mathcal {A}}, \end{aligned}$$

where the first equality is by (\(b^*\)) and the second by (c). Finally, suppose \(\Pi \subset {\mathcal {G}}\) and fix \(A\in {\mathcal {A}}\cap \sigma (Q)\). By (\(a^*\)), each element of \(\sigma (Q)\) is a union of \({\mathcal {G}}\)-atoms. Hence,

$$\begin{aligned} Q(\omega ,A)\ge Q(\omega ,H(\omega ))=\delta _\omega (H(\omega ))=1\quad \text {whenever }\omega \in A. \end{aligned}$$

Similarly, \(Q(\omega ,A)=0\) if \(\omega \notin A\). Therefore,

$$\begin{aligned} P(A)=\int Q(\omega ,A)\,{\widehat{P}}(\mathrm{d}\omega )=\int 1_A(\omega )\,{\widehat{P}}(\mathrm{d}\omega )={\widehat{P}}(A). \end{aligned}$$

\(\square \)

By Lemma 4, a \({\mathcal {G}}\)-strategy Q satisfies condition (b) whenever \(\sigma (Q)\subset {\mathcal {G}}\). Generally, however, \(\omega \mapsto Q(\omega ,A)\) is not \({\mathcal {G}}\)-measurable (or even \({\mathcal {A}}\)-measurable) and cannot be integrated against P. This is the reason for a mixing measure \({\widehat{P}}\) is involved in condition (\(b^*\)).

Condition (\(a^*\)) is a weaker version of (a) (it is in fact a consequence of (a)). Roughly speaking, the motivation of (\(a^*\)) is that, conditionally on \({\mathcal {G}}\), one is actually observing an element of the partition \(\Pi \) rather than a point of \(\Omega \). Thus, x and y provide the same information if \(H(x)=H(y)\).

Essentially, a \({\mathcal {G}}\)-strategy depends on \({\mathcal {G}}\) only through its atoms. In particular, if \({\mathcal {G}}\) includes the singletons, then \(\Pi =\{\{\omega \}:\omega \in \Omega \}\subset {\mathcal {G}}\) and the only \({\mathcal {G}}\)-strategy is \(Q(\omega )=\delta _\omega \) on \({\mathcal {A}}\) for all \(\omega \in \Omega \). As an example, take \({\mathcal {G}}=\{A\in {\mathcal {A}}:P(A)\in \{0,1\}\}\) and suppose that \(\{\omega \}\in {\mathcal {A}}\) and \(P(\{\omega \})=0\) for every \(\omega \in \Omega \). Then, \({\mathcal {G}}\) includes the singletons, so that \(Q(\omega )=\delta _\omega \) is the only \({\mathcal {G}}\)-strategy, while an r.c.d. given \({\mathcal {G}}\) is \(Q(\omega )=P\). As a further example, take another sub-\(\sigma \)-field \({\mathcal {F}}\subset {\mathcal {A}}\). If \({\mathcal {F}}\) has the same atoms as \({\mathcal {G}}\) and \(\Pi \subset {\mathcal {F}}\cap {\mathcal {G}}\), then Q is an \({\mathcal {F}}\)-strategy if and only if is a \({\mathcal {G}}\)-strategy.

A last remark is that \({\mathcal {G}}\)-strategies are not uniquely determined by P. In particular, they are not essentially unique. This is technically a drawback, as well as a major difference with r.c.d.’s. However, in the subjective view of probability, non uniqueness is possibly not so crucial. In a sense, just as the choice of P is subjective, the choice of Q (once P is given) can be seen as a subjective act as well.

Let us turn now to existence issues. Recall that \((\Omega ,{\mathcal {A}})\) is a standard space if \(\Omega \) is a Borel subset of a Polish space and \({\mathcal {A}}={\mathcal {B}}(\Omega )\). For an r.c.d. given \({\mathcal {G}}\) to exist, it suffices that \((\Omega ,{\mathcal {A}})\) is a standard space. Instead, for a \({\mathcal {G}}\)-strategy to exist, one needs conditions on both \((\Omega ,{\mathcal {A}})\) and \({\mathcal {G}}\). The next statement is a translation of some results from Berti and Rigo (1999), Berti and Rigo (2002) concerning existence of disintegrations.

Theorem 5

Let

$$\begin{aligned} G=\bigl \{(x,y)\in \Omega \times \Omega :H(x)=H(y)\bigr \}. \end{aligned}$$

There is a \({\mathcal {G}}\)-strategy provided \((\Omega ,{\mathcal {A}})\) is a standard space and at least one of the following conditions is satisfied:

  1. (i)

    G is a co-analytic subset of \(\Omega \times \Omega \);

  2. (ii)

    G is an analytic subset of \(\Omega \times \Omega \) and all but countably many elements of \(\Pi \) are \(F_\sigma \) or \(G_\delta \).

Proof

In view of (Berti and Rigo 1999, Theorem 2) and (Berti and Rigo 2002, Theorem 8) under (i) or (ii), P admits a \(\sigma \)-additive disintegration on the partition \(\Pi \). This means that, under (i) or (ii), there is a pair \((\alpha ,\beta )\) such that:

  • \(\alpha (\cdot \mid H)\) is a probability measure on \(\sigma ({\mathcal {A}}\cup \Pi )\) such that \(\alpha (H|H)=1\) for each \(H\in \Pi \);

  • \(\beta \) is a probability measure on \(\sigma (\alpha )\), where \(\sigma (\alpha )\) is the \(\sigma \)-field over \(\Pi \) generated by the maps \(H\mapsto \alpha (A|H)\) for all \(A\in {\mathcal {A}}\);

  • \(P(A)=\int _\Pi \alpha (A|H)\,\beta (dH)\) for all \(A\in {\mathcal {A}}\).

Given such \((\alpha ,\beta )\), to obtain a \({\mathcal {G}}\)-strategy, it suffices to let

$$\begin{aligned} Q(\omega ,A)=\alpha \bigl (A\mid H(\omega )\bigr )\quad \text {for all }\omega \in \Omega \text { and }A\in {\mathcal {A}}. \end{aligned}$$

In fact, Q meets (\(a^*\)) and (c) (to check (c), just recall that each member of \({\mathcal {G}}\) is a union of elements of \(\Pi \)). To prove (\(b^*\)), for each \(S\subset \Pi \), denote by \(S^*\) the subset of \(\Omega \) obtained as the union of the elements of S. Then, \(\sigma (Q)=\{S^*:S\in \sigma (\alpha )\}\). Thus, letting \({\widehat{P}}(S^*)=\beta (S)\), one trivially obtains

$$\begin{aligned} P(A)=\int _\Pi \alpha (A|H)\,\beta (dH)=\int _\Omega Q(\omega ,A)\,{\widehat{P}}(\mathrm{d}\omega )\quad \text {for all }A\in {\mathcal {A}}. \end{aligned}$$

\(\square \)

Theorem 5 implies that a \({\mathcal {G}}\)-strategy exists whenever \((\Omega ,{\mathcal {A}})\) is a standard space and G is a Borel subset of \(\Omega \times \Omega \). This happens in several meaningful situations, including the cases where \({\mathcal {G}}\) is a tail or a symmetric sub-\(\sigma \)-field. In these cases, thus, a \({\mathcal {G}}\)-strategy is available while a proper r.c.d. fails to exist in general; see Blackwell and Dubins (1975) and Example 7.

To close this section, it would be nice to exhibit an example where a \({\mathcal {G}}\)-strategy fails to exist. If \(\Pi \subset {\mathcal {G}}\) and \((\Omega ,{\mathcal {A}})\) is a standard space, however, such example is not available under the usual axioms of set theory (the so called ZFC set theory). Take in fact \(\Omega =[0,1]\), \({\mathcal {A}}={\mathcal {B}}([0,1])\), and consider the assertion:

For every Borel partition\(\Psi \)of [0, 1], the Lebesgue measure on\({\mathcal {A}}\)admits a strategy given\(\sigma (\Psi )\)”.

Then, as shown by Dubins and Prikry (1995, Theorem 2), such an assertion is undecidable in ZFC, in the sense that the assertion and its negation are both consistent with ZFC.

Incidentally, as regards existence and nonexistence of \({\mathcal {G}}\)-strategies, things are quite different in a finitely additive framework; see, e.g., Dubins (1975) and Prikry and Sudderth (1982).

3 Bayesian statistical inference

Let \(({\mathcal {X}},{\mathcal {E}})\) and \((\Theta ,{\mathcal {F}})\) be measurable spaces to be regarded, respectively, as the sample space and the parameter space. For the sake of simplicity, the \({\mathcal {E}}\)-atoms are assumed to be the singletons. A statistical model is a measurable collection

$$\begin{aligned} {\mathcal {P}}=\{P_\theta :\theta \in \Theta \} \end{aligned}$$

of probability measures on \({\mathcal {E}}\), where measurability means that \(\theta \mapsto P_\theta (A)\) is \({\mathcal {F}}\)-measurable for each \(A\in {\mathcal {E}}\). A prior is a probability measure on \({\mathcal {F}}\).

Roughly speaking, the problem is to make inference on the parameter \(\theta \) given the data x. To this end, in the notation of Sect. 2, one lets

$$\begin{aligned} (\Omega ,{\mathcal {A}})=({\mathcal {X}}\times \Theta ,\,{\mathcal {E}}\otimes {\mathcal {F}}) \end{aligned}$$

and takes \({\mathcal {G}}\) to be the sub-\(\sigma \)-field of \({\mathcal {A}}\) generated by the data, namely

$$\begin{aligned} {\mathcal {G}}=\{A\times \Theta :A\in {\mathcal {E}}\}. \end{aligned}$$

Since the \({\mathcal {E}}\)-atoms are the singletons, the partition of \(\Omega \) in the \({\mathcal {G}}\)-atoms is

$$\begin{aligned} \Pi =\bigl \{\{x\}\times \Theta :x\in {\mathcal {X}}\bigr \}. \end{aligned}$$

Also, given a statistical model \({\mathcal {P}}\) and a prior \(\nu \), the reference probability measure P on \({\mathcal {A}}\) is

$$\begin{aligned} P(C)=\int \int 1_C(x,\theta )\,P_\theta (\mathrm{d}x)\,\nu (\mathrm{d}\theta )\quad \text {for all }C\in {\mathcal {A}}. \end{aligned}$$

In this framework, a posterior is a conditional probability for P given \({\mathcal {G}}\). Thus, technically, how to define a posterior depends on the adopted notion of conditional probability. Let

$$\begin{aligned} {\mathcal {Q}}=\{Q_x:x\in {\mathcal {X}}\} \end{aligned}$$

be a collection of probability measures on \({\mathcal {F}}\), and let \(\sigma ({\mathcal {Q}})\) be the \(\sigma \)-field over \({\mathcal {X}}\) generated by the maps \(x\mapsto Q_x(B)\) for all \(B\in {\mathcal {F}}\).

As noted in Sect. 2, a conditional probability is usually meant as an r.c.d. In that case, \({\mathcal {Q}}\) is a posterior provided

$$\begin{aligned} \sigma ({\mathcal {Q}})\subset {\mathcal {E}}\quad \text {and}\quad P(C)=\int \int 1_C(x,\theta )\,Q_x(\mathrm{d}\theta )\,m_\nu (\mathrm{d}x)\quad \text {for all }C\in {\mathcal {A}} \end{aligned}$$

where

$$\begin{aligned} m_\nu (A)=P(A\times \Theta )=\int P_\theta (A)\,\nu (\mathrm{d}\theta )\quad \text {for all }A\in {\mathcal {E}}. \end{aligned}$$

Instead, if a conditional probability is meant as a strategy, \({\mathcal {Q}}\) is a posterior whenever

$$\begin{aligned} P(C)=\int \int 1_C(x,\theta )\,Q_x(\mathrm{d}\theta )\,m(\mathrm{d}x),\quad C\in {\mathcal {A}}, \end{aligned}$$
(2)

where m is any probability measure on \(\sigma ({\mathcal {Q}})\). Note that Lemma 4 yields \(m=m_\nu \) on \({\mathcal {E}}\), so that m is actually an extension of \(m_\nu \).

Therefore, the class of posteriors becomes larger if conditional probabilities are meant as strategies and not as r.c.d.’s. Indeed, for \({\mathcal {Q}}\) to be a posterior, equation (2) is enough and no measurability constraints are required to \({\mathcal {Q}}\). This fact has some consequences.

In the next result, a posterior is actually meant as a \({\mathcal {G}}\)-strategy, namely a collection \({\mathcal {Q}}=\{Q_x:x\in {\mathcal {X}}\}\) of probability measures on \({\mathcal {F}}\) satisfying equation (2) for some m.

Theorem 6

Let \({\mathcal {P}}\) be a statistical model, \(\nu \) a prior probability on \({\mathcal {F}}\) and \(Y:({\mathcal {X}},{\mathcal {E}})\rightarrow ({\mathcal {Y}},{\mathcal {H}})\) a measurable map. Suppose:

  • card\(\,({\mathcal {E}})\le \,\)card\(\,({\mathbb {R}})\), card\(\,({\mathcal {F}})\le \,\)card\(\,({\mathbb {R}})\), and \({\mathcal {H}}\) is countably generated and includes the singletons;

  • \(P_\theta \) is a perfect probability measure such that \(P_\theta (Y=y)=0\) for all \(\theta \in \Theta \) and \(y\in {\mathcal {Y}}\).

Then, there is a posterior \({\mathcal {Q}}=\{Q_x:x\in {\mathcal {X}}\}\) such that

$$\begin{aligned} x_1,\,x_2\in {\mathcal {X}}\quad \text {and}\quad Y(x_1)=Y(x_2)\quad \Rightarrow \quad Q_{x_1}=Q_{x_2}. \end{aligned}$$

Proof

Two known facts are to be recalled. Let \((D,{\mathcal {D}},\mu )\) be any probability space.

  1. (j)

    If \({\mathcal {D}}\) is countably generated, \(\mu \) is perfect and \(\mu (F)=0\) for each \({\mathcal {D}}\)-atom F, then the collection of \({\mathcal {D}}\)-atoms has the cardinality of the continuum; see (Berti and Rigo 1996, Lemma 2.3);

  2. (jj)

    Let \(\Gamma \) be a class of probability measures on \({\mathcal {D}}\) and \(\Sigma \) the \(\sigma \)-field over \(\Gamma \) generated by the maps \(\gamma \mapsto \gamma (D)\) for all \(D\in {\mathcal {D}}\). Suppose \(\mu (D)=\int _\Gamma \gamma (D)\,\beta (d\gamma )\) for all \(D\in {\mathcal {D}}\), where \(\beta \) is a finitely additive probability on \(\Sigma \). Then, \(\beta \) is \(\sigma \)-additive provided each \(\gamma \in \Gamma \) is 0-1-valued; see Theorem 11 and Example 15 of Berti and Rigo (2018).

Next, recall that \((\Omega ,{\mathcal {A}})=({\mathcal {X}}\times \Theta ,\,{\mathcal {E}}\otimes {\mathcal {F}})\) and define

$$\begin{aligned} {\mathcal {V}}= & {} \{C\in {\mathcal {A}}:P(C)>0\}\quad \quad \text {and}\\ L(C)= & {} \bigl \{y\in {\mathcal {Y}}:(x,\theta )\in C\text { for some }(x,\theta )\in \Omega \text { with }Y(x)=y\bigr \}\quad \text {where }C\subset \Omega . \end{aligned}$$

This proof is split into two parts: First, we prove the theorem under the assumption

$$\begin{aligned} \text {card}\,(L(C))\ge \,\text {card}\,({\mathbb {R}})\quad \text {for all }C\in {\mathcal {V}}, \end{aligned}$$
(3)

and then, we show that (3) is actually true.

Suppose condition (3) holds. Then, since card\(\,({\mathcal {A}})\le \,\text {card}\,({\mathbb {R}})\), one obtains

$$\begin{aligned} \text {card}\,(L(C))\ge \,\text {card}\,({\mathcal {A}})\ge \,\text {card}\,({\mathcal {V}})\quad \text {for all }C\in {\mathcal {V}}. \end{aligned}$$

Hence, there is an injective map \(f:{\mathcal {V}}\rightarrow {\mathcal {Y}}\) such that \(f(C)\in L(C)\) for each \(C\in {\mathcal {V}}\); see (Berti and Rigo 1996, Lemma 2.1). For each \(y\in {\mathcal {Y}}\), select a probability measure \(U_y\) on \({\mathcal {F}}\) as follows: If y is not in the range of f, define \(U_y=\delta _{\theta _0}\) where \(\theta _0\in \Theta \) is arbitrary. Otherwise, if \(y=f(C)\) for some (unique) \(C\in {\mathcal {V}}\), take \((x,\theta )\in C\) with \(Y(x)=y\) and set \(U_y=\delta _{\theta }\). For \(x\in {\mathcal {X}}\), define also

$$\begin{aligned} Q_x=U_{Y(x)}\quad \text {and}\quad T_x(C)=Q_x\{\theta \in \Theta :(x,\theta )\in C\}\quad \quad \text {for all }C\in {\mathcal {A}}. \end{aligned}$$

Then, \(T_x\) is a probability measure on \({\mathcal {A}}\) such that \(T_x\bigl (\{x\}\times \Theta \bigr )=1\). Further,

$$\begin{aligned} \text {for each }C\in {\mathcal {V}},\text {there is }x\in {\mathcal {X}}\text { such that }T_x(C)=1. \end{aligned}$$

By (Berti and Rigo 1996, Lemma 2.2) and the above condition, there is a finitely additive probability \(m_0\) on the power set of \({\mathcal {X}}\) such that

$$\begin{aligned} \int \int 1_C(x,\theta )\,Q_x(\mathrm{d}\theta )\,m_0(\mathrm{d}x)=\int T_x(C)\,m_0(\mathrm{d}x)=P(C)\quad \text {for all }C\in {\mathcal {A}}. \end{aligned}$$

Let \({\mathcal {Q}}=\{Q_x:x\in {\mathcal {X}}\}\) and let m be the restriction of \(m_0\) on \(\sigma ({\mathcal {Q}})\). Then, m is \(\sigma \)-additive because of (jj). Therefore, \({\mathcal {Q}}\) is a posterior such that \(Q_{x_1}=Q_{x_2}\) whenever \(Y(x_1)=Y(x_2)\). This concludes the first part of the proof.

Finally, we prove (3). It suffices to show that \(P(C)=0\) whenever \(C\in {\mathcal {A}}\) and \(\text {card}\,(L(C))<\,\text {card}\,({\mathbb {R}})\). Fix one such C and take \(A\in {\mathcal {E}}\) with

$$\begin{aligned} A\subset \bigcup _{y\in L(C)}\bigl \{x\in {\mathcal {X}}:Y(x)=y\bigr \}. \end{aligned}$$

Let \({\mathcal {D}}=A\cap \sigma (Y)=\{A\cap B:B\in \sigma (Y)\}\). Since \({\mathcal {H}}\) is countably generated, \(\sigma (Y)\) is countably generated, which in turn implies that \({\mathcal {D}}\) is countably generated. Toward a contradiction, suppose \(P_\theta (A)>0\) for some \(\theta \in \Theta \). Then, one can define

$$\begin{aligned} \mu (A\cap B)=\frac{P_\theta (A\cap B)}{P_\theta (A)}\quad \text {for all }B\in \sigma (Y). \end{aligned}$$

Since \(P_\theta \) is perfect, \(\mu \) is a perfect probability measure on \({\mathcal {D}}\). Each atom F of \({\mathcal {D}}\) is of the form \(F=A\cap \{Y=y\}\) for some y, and

$$\begin{aligned} \mu (F)=\frac{P_\theta (A\cap \{Y=y\})}{P_\theta (A)}=0. \end{aligned}$$

In view of (j), the set of \({\mathcal {D}}\)-atoms has the cardinality of the continuum, so that

$$\begin{aligned} \text {card}\,(L(C))\ge \,\text {card}\,\{y\in {\mathcal {Y}}:\{Y=y\}\cap A\ne \emptyset \} =\text {card}\,\{{\mathcal {D}}\text {-atoms}\}=\text {card}\,({\mathbb {R}}). \end{aligned}$$

This is a contradiction, since \(\text {card}\,(L(C))<\,\text {card}\,({\mathbb {R}})\). Hence, it must be \(P_\theta (A)=0\) for all \(\theta \). To conclude the proof, just note that

$$\begin{aligned} \{x:(x,\theta )\in C\}\subset \bigcup _{y\in L(C)}\bigl \{x\in {\mathcal {X}}:Y(x)=y\bigr \}\quad \text {for all }\theta . \end{aligned}$$

It follows that

$$\begin{aligned} P(C)=\int P_\theta \{x:(x,\theta )\in C\}\,\nu (\mathrm{d}\theta )=0. \end{aligned}$$

Hence, (3) holds and this concludes the proof. \(\square \)

Theorem 6 improves (Berti and Rigo 1996, Theorem 3.1) where the probability m involved in equation (2) is only finitely additive.

In the subjective framework, Theorem 6 has a nice interpretation in terms of sufficiency; see Berti and Rigo (1996). In fact, think of Y as a statistic. Also, given a posterior \({\mathcal {Q}}\), say that Y is sufficient for\({\mathcal {Q}}\) if \(Q_{x_1}=Q_{x_2}\) whenever \(Y(x_1)=Y(x_2)\). Then, Theorem 6 essentially states that, for any prior \(\nu \) and any statistic Y, there is a posterior \({\mathcal {Q}}\) which makes Y sufficient provided only that \(P_\theta (Y=y)=0\) for all \(\theta \) and y. This seems in line with both the substantial meaning of sufficiency and the subjective view of probability. Indeed, the assessment of \({\mathcal {Q}}\) can be split into two steps: First, the inferrer selects a partition of \({\mathcal {X}}\), by grouping those samples which, according to him/her, have the same inferential content. This step precisely amounts to the choice of a sufficient statistic Y. Subsequently, a probability law on \({\mathcal {F}}\) is attached to every element in the partition. If no such element has positive probability under the statistical model, Theorem 6 implies that at least a posterior \({\mathcal {Q}}\) is consistent with this procedure.

In addition to sufficiency, another intriguing point is whether improper priors can be recovered when posteriors are regarded as strategies. This issue is actually connected to compatibility. Thus, improper priors are postponed to Example 9.

4 Further consequences

In principle, in every framework where conditional probability plays a role, things are quite different according to whether conditional probability is meant as an r.c.d. or as a strategy. In Bayesian statistics, for instance, Theorem 6 would not be available if a posterior would be regarded as an r.c.d. This section is in the spirit of the previous one; namely, the different behaviors of r.c.d.’s and strategies are compared in some special situations. Needless to say that many other analogous examples could be given.

Example 7

(Exchangeability) Let \((\Omega ,{\mathcal {A}})=({\mathcal {X}}^\infty ,{\mathcal {E}}^\infty )\) where \(({\mathcal {X}},{\mathcal {E}})\) is a standard space. To each \(n\in {\mathbb {N}}\) and each permutation \((\pi _1,\ldots ,\pi _n)\) of \((1,\ldots ,n)\), we can associate a function \(f:\Omega \rightarrow \Omega \) defined by

$$\begin{aligned} f(x_1,x_2,\ldots )=(x_{\pi _1},\ldots ,x_{\pi _n},x_{n+1},x_{n+2},\ldots )\quad \quad \text {for all }(x_1,x_2,\ldots )\in \Omega . \end{aligned}$$

Let F denote the class of all such functions, for all \(n\in {\mathbb {N}}\) and all permutations of \((1,\ldots ,n)\). A probability measure P on \({\mathcal {A}}\) is exchangeable if \(P\circ f^{-1}=P\) for all \(f\in F\). The symmetric sub-\(\sigma \)-field is

$$\begin{aligned} {\mathcal {G}}=\{A\in {\mathcal {A}}:f^{-1}(A)=A\text { for all }f\in F\}. \end{aligned}$$

Note that the \({\mathcal {G}}\)-atoms can be written as

$$\begin{aligned} H(\omega )=\{f(\omega ):f\in F\}\quad \quad \text {for all }\omega \in \Omega . \end{aligned}$$

Suppose P exchangeable and \(P(\Delta )=0\), where \(\Delta =\{(x,x,\ldots ):x\in {\mathcal {X}}\}\) is the diagonal. Since \(({\mathcal {X}},{\mathcal {E}})\) is a standard space, there is an r.c.d. \(Q^*\) for P given \({\mathcal {G}}\). Since P is exchangeable, by de Finetti’s theorem, \(Q^*(\omega )\) is an i.i.d. probability measure on \({\mathcal {A}}={\mathcal {E}}^\infty \) for almost all \(\omega \in \Omega \). Now, an i.i.d. probability measure vanishes on singletons unless it is degenerate. Since \(P(\Delta )=0\) and \(H(\omega )\) is countable, it follows that

$$\begin{aligned} Q^*(\omega ,\,H(\omega ))=0\quad \quad \text {for almost all }\omega \in \Omega . \end{aligned}$$

On the other hand, because of Theorem 5, P also admits a \({\mathcal {G}}\)-strategy Q. By definition, Q satisfies

$$\begin{aligned} Q\left( \omega ,\,H(\omega )\right) =1\quad \quad \text {for all }\omega \in \Omega . \end{aligned}$$

Therefore, \(Q(\omega )\) and \(Q^*(\omega )\) are even singular for almost all \(\omega \in \Omega \). Another curious fact is that \(Q^*(\omega )\) can be shown to be \(\{0,1\}\)-valued on \({\mathcal {G}}\), despite \(Q^*\bigl (\omega ,\,H(\omega )\bigr )=0\), for almost all \(\omega \in \Omega \); see Berti and Rigo (2008).

Example 8

(Compatibility) Let \((\Omega ,{\mathcal {A}})\) be a measurable space, \({\mathcal {G}}_i\subset {\mathcal {A}}\) a sub-\(\sigma \)-field and \(Q_i=\{Q_i(\omega ):\omega \in \Omega \}\) a collection of probability measures on \({\mathcal {A}}\), where \(i=1,2\). Generally speaking, \(Q_1\) and \(Q_2\) are compatible if there is a probability measure P on \({\mathcal {A}}\) which admits \(Q_1\) and \(Q_2\) as conditional probabilities given \({\mathcal {G}}_1\) and \({\mathcal {G}}_2\), respectively; see, e.g., Berti et al. (2014) and references therein. Once again, this general idea can be realized differently according to the selected notion of conditional probability.

If conditional probabilities are meant as r.c.d.’s, a necessary condition for compatibility is \(\sigma (Q_1)\subset {\mathcal {G}}_1\) and \(\sigma (Q_2)\subset {\mathcal {G}}_2\). In that case, \(Q_1\) and \(Q_2\) are compatible if there is a probability measure P on \({\mathcal {A}}\) such that

$$\begin{aligned} P(A\cap B)=\int _B Q_i(\omega ,A)\,P(\mathrm{d}\omega )\quad \quad \text {whenever }i=1,2,\,\,B\in {\mathcal {G}}_i\text { and }A\in {\mathcal {A}}. \end{aligned}$$
(4)

If conditional probabilities are meant as strategies, the necessary condition for compatibility turns into

$$\begin{aligned} Q_i(\omega )=Q_i(\upsilon )\text { if }H_i(\omega )=H_i(\upsilon )\quad \text {and}\quad Q_i(\omega )=\delta _\omega \text { on }{\mathcal {G}}_i \end{aligned}$$

for \(i=1,2\) and all \(\omega ,\,\upsilon \in \Omega \), where \(H_i(\omega )\) is the \({\mathcal {G}}_i\)-atom including \(\omega \). Under such condition, \(Q_1\) and \(Q_2\) are compatible whenever

$$\begin{aligned} \int Q_1(\omega ,A)\,{\widehat{P}}_1(\mathrm{d}\omega )=\int Q_2(\omega ,A)\,{\widehat{P}}_2(\mathrm{d}\omega ),\quad \quad A\in {\mathcal {A}}, \end{aligned}$$
(5)

for some probability measures \({\widehat{P}}_1\) on \(\sigma (Q_1)\) and \({\widehat{P}}_2\) on \(\sigma (Q_2)\).

Condition (5) looks intriguing and possibly easier than (4) to work with.

A weaker version of (5) is obtained allowing \({\widehat{P}}_1\) and \({\widehat{P}}_2\) to be finitely additive probabilities. In that case, compatibility essentially reduces to a notion of consistency, introduced in Lane and Sudderth (1983) for Bayesian statistical inference; see also Heath and Sudderth (1978).

Example 9

(Improper priors) We adopt the notation and the assumptions of Sect. 3. In addition, we assume the model \({\mathcal {P}}=\{P_\theta :\theta \in \Theta \}\) dominated, namely \(P_\theta (\mathrm{d}x)=f(x,\theta )\,\lambda (\mathrm{d}x)\) for all \(\theta \in \Theta \), where \(\lambda \) is a \(\sigma \)-finite measure on \({\mathcal {E}}\) and f a nonnegative measurable function on \((\Omega ,{\mathcal {A}})=({\mathcal {X}}\times \Theta ,{\mathcal {E}}\otimes {\mathcal {F}})\). An improper prior is a \(\sigma \)-finite measure \(\gamma \) on \({\mathcal {F}}\) such that \(\gamma (\Theta )=\infty \). Let

$$\begin{aligned} \psi (x)=\int f(x,\theta )\,\gamma (\mathrm{d}\theta )\quad \quad \text {for all }x\in {\mathcal {X}}. \end{aligned}$$

A standard practice is to fix an improper prior \(\gamma \) and to let

$$\begin{aligned} Q_x(\mathrm{d}\theta )=\frac{f(x,\theta )}{\psi (x)}\,\gamma (\mathrm{d}\theta )\quad \quad \text {whenever }\psi (x)\in (0,\infty ). \end{aligned}$$
(6)

Notice that no prior probability on \({\mathcal {F}}\) has been selected. In the sequel, we assume \(\psi (x)\in (0,\infty )\) for all \(x\in {\mathcal {X}}\), and we let \({\mathcal {Q}}=\{Q_x:x\in {\mathcal {X}}\}\) with \(Q_x\) given by (6).

Define

$$\begin{aligned} {\mathcal {G}}_1=\bigl \{A\times \Theta :A\in {\mathcal {E}}\bigr \}\quad \text {and}\quad {\mathcal {G}}_2=\bigl \{{\mathcal {X}}\times B:B\in {\mathcal {F}}\bigr \}. \end{aligned}$$

For \(C\in {\mathcal {A}}\) and \(\omega =(x,\theta )\in \Omega \), define also

$$\begin{aligned} T_1(\omega )(C)=Q_x\{t\in \Theta :(x,t)\in C\}\quad \text {and}\quad T_2(\omega )(C)=P_\theta \{z\in {\mathcal {X}}:(z,\theta )\in C\}. \end{aligned}$$

Then, \({\mathcal {G}}_1\) and \({\mathcal {G}}_2\) are sub-\(\sigma \)fields of \({\mathcal {A}}\) while \(T_1(\omega )\) and \(T_2(\omega )\) are probability measures on \({\mathcal {A}}\). We say that \({\mathcal {P}}\) and \({\mathcal {Q}}\) are compatible to mean that \({\mathcal {T}}_1\) and \({\mathcal {T}}_2\) are compatible, where \({\mathcal {T}}_i=\{T_i(\omega ):\omega \in \Omega \}\) for \(i=1,2\).

From the point of view of probability theory, using \({\mathcal {Q}}\) as a posterior makes sense only if \({\mathcal {P}}\) and \({\mathcal {Q}}\) are compatible; see also (Berti et al. 2014, Example 3). However, measurability of f implies \(\sigma ({\mathcal {T}}_i)\subset {\mathcal {G}}_i\) for \(i=1,2\). Therefore, as regards compatibility of \({\mathcal {P}}\) and \({\mathcal {Q}}\), using r.c.d.’s or strategies is equivalent. The situation is slightly different if the assumption \(0<\psi <\infty \) is dropped.

On the other hand, in most real problems, \({\mathcal {P}}\) and \({\mathcal {Q}}\) fail to be compatible. To get compatibility and thus to make improper priors admissible, finitely additive probabilities are to be involved; see Heath and Sudderth (1978), Heath and Sudderth (1989) and Lane and Sudderth (1983).