Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Probability theory, grounded in Kolmogorov’s axioms and the general foundations of measure theory, is an essential tool in the quantitative mathematical treatment of uncertainty. Of course, probability is not the only framework for the discussion of uncertainty: there is also the paradigm of interval analysis, and intermediate paradigms such as Dempster–Shafer theory, as discussed in Section 2.8 and Chapter 5

This chapter serves as a review, without detailed proof, of concepts from measure and probability theory that will be used in the rest of the text. Like Chapter 3, this chapter is intended as a review of material that should be understood as a prerequisite before proceeding; to an extent, Chapters 2 and 3 are interdependent and so can (and should) be read in parallel with one another.

2.1 Measure and Probability Spaces

The basic objects of measure and probability theory are sample spaces, which are abstract sets; we distinguish certain subsets of these sample spaces as being ‘measurable’, and assign to each of them a numerical notion of ‘size’. In probability theory, this size will always be a real number between 0 and 1, but more general values are possible, and indeed useful.

Definition 2.1.

A measurable space is a pair \((\mathcal{X},\mathcal{F})\), where

  1. (a)

    \(\mathcal{X}\) is a set, called the sample space; and

  2. (b)

    \(\mathcal{F}\) is a \(\sigma\) -algebra on \(\mathcal{X}\), i.e. a collection of subsets of \(\mathcal{X}\) containing \(\varnothing \) and closed under countable applications of the operations of union, intersection and complementation relative to \(\mathcal{X}\); elements of \(\mathcal{F}\) are called measurable sets or events.

Example 2.2.

  1. (a)

    On any set \(\mathcal{X}\), there is a trivial \(\sigma\) -algebra in which the only measurable sets are the empty set \(\varnothing \) and the whole space \(\mathcal{X}\).

  2. (b)

    On any set \(\mathcal{X}\), there is also the power set \(\sigma\) -algebra in which every subset of \(\mathcal{X}\) is measurable. It is a fact of life that this \(\sigma\)-algebra contains too many measurable sets to be useful for most applications in analysis and probability.

  3. (c)

    When \(\mathcal{X}\) is a topological — or, better yet, metric or normed — space, it is common to take \(\mathcal{F}\) to be the Borel \(\sigma\) -algebra \(\mathcal{B}(\mathcal{X})\), the smallest \(\sigma\)-algebra on \(\mathcal{X}\) so that every open set (and hence also every closed set) is measurable.

Definition 2.3.

  1. (a)

    A signed measure (or charge) on a measurable space \((\mathcal{X},\mathcal{F})\) is a function \(\mu: \mathcal{F}\rightarrow \mathbb{R} \cup \{\pm \infty \}\) that takes at most one of the two infinite values, has \(\mu (\varnothing ) = 0\), and, whenever \(E_{1},E_{2},\ldots \in \mathcal{F}\) are pairwise disjoint with union \(E \in \mathcal{F}\), then \(\mu (E) =\sum _{n\in \mathbb{N}}\mu (E_{n})\). In the case that μ(E) is finite, we require that the series \(\sum _{n\in \mathbb{N}}\mu (E_{n})\) converges absolutely to μ(E).

  2. (b)

    A measure is a signed measure that does not take negative values.

  3. (c)

    A probability measure is a measure such that \(\mu (\mathcal{X}) = 1\).

The triple \((\mathcal{X},\mathcal{F},\mu )\) is called a signed measure space, measure space, or probability space as appropriate. The sets of all signed measures, measures, and probability measures on \((\mathcal{X},\mathcal{F})\) are denoted \(\mathcal{M}_{\pm }(\mathcal{X},\mathcal{F})\), \(\mathcal{M}_{+}(\mathcal{X},\mathcal{F})\), and \(\mathcal{M}_{1}(\mathcal{X},\mathcal{F})\) respectively.

Example 2.4.

  1. (a)

    The trivial measure can be defined on any set \(\mathcal{X}\) and \(\sigma\)-algebra: \(\tau (E):= 0\) for every \(E \in \mathcal{F}\).

  2. (b)

    The unit Dirac measure at \(a \in \mathcal{X}\) can also be defined on any set \(\mathcal{X}\) and \(\sigma\)-algebra:

    $$\displaystyle{\delta _{a}(E):= \left \{\begin{array}{@{}l@{\quad }l@{}} 1,\quad &\mbox{ if $a \in E$, $E \in \mathcal{F}$,}\\ 0,\quad &\mbox{ if $a\notin E$, $E \in \mathcal{F}$.}\end{array} \right.}$$
  3. (c)

    Similarly, we can define counting measure:

    $$\displaystyle{\kappa (E):= \left \{\begin{array}{@{}l@{\quad }l@{}} n, \quad &\mbox{ if $E \in \mathcal{F}$ is a finite set with exactly $n$ elements,}\\ +\infty,\quad &\mbox{ if $E \in \mathcal{F}$ is an infinite set.}\end{array} \right.}$$
  4. (d)

    Lebesgue measure on \(\mathbb{R}^{n}\) is the unique measure on \(\mathbb{R}^{n}\) (equipped with its Borel \(\sigma\)-algebra \(\mathcal{B}(\mathbb{R}^{n})\), generated by the Euclidean open balls) that assigns to every rectangle its n-dimensional volume in the ordinary sense. To be more precise, Lebesgue measure is actually defined on the completion \(\mathcal{B}_{0}(\mathbb{R}^{n})\) of \(\mathcal{B}(\mathbb{R}^{n})\), which is a larger \(\sigma\)-algebra than \(\mathcal{B}(\mathbb{R}^{n})\). The rigorous construction of Lebesgue measure is a non-trivial undertaking.

  5. (e)

    Signed measures/charges arise naturally in the modelling of distributions with positive and negative values, e.g. μ(E) = the net electrical charge within some measurable region \(E \subseteq \mathbb{R}^{3}\). They also arise naturally as differences of non-negative measures: see Theorem 2.24 later on.

Remark 2.5.

Probability theorists usually denote the sample space of a probability space by Ω; PDE theorists often use the same letter to denote a domain in \(\mathbb{R}^{n}\) on which a partial differential equation is to be solved. In UQ, where the worlds of probability and PDE theory often collide, the possibility of confusion is clear. Therefore, this book will tend to use \(\varTheta\) for a probability space and \(\mathcal{X}\) for a more general measurable space, which may happen to be the spatial domain for some PDE.

Definition 2.6.

Let \((\mathcal{X},\mathcal{F},\mu )\) be a measure space.

  1. (a)

    If \(N \subseteq \mathcal{X}\) is a subset of a measurable set \(E \in \mathcal{F}\) such that μ(E) = 0, then N is called a μ-null set.

  2. (b)

    If the set of \(x \in \mathcal{X}\) for which some property P(x) does not hold is μ-null, then P is said to hold μ-almost everywhere (or, when μ is a probability measure, μ-almost surely).

  3. (c)

    If every μ-null set is in fact an \(\mathcal{F}\)-measurable set, then the measure space \((\mathcal{X},\mathcal{F},\mu )\) is said to be complete.

Example 2.7.

Let \((\mathcal{X},\mathcal{F},\mu )\) be a measure space, and let \(f: \mathcal{X} \rightarrow \mathbb{R}\) be some function. If f(x) ≥ t for μ-almost every \(x \in \mathcal{X}\), then t is an essential lower bound for f; the greatest such t is called the essential infimum of f:

$$\displaystyle{\mathop{\mathrm{ess\,inf}}f:=\sup \left \{t \in \mathbb{R}\,\vert \,\mbox{ $f \geq t \mu $-almost everywhere}\right \}.}$$

Similarly, if f(x) ≤ t for μ-almost every \(x \in \mathcal{X}\), then t is an essential upper bound for f; the least such t is called the essential supremum of f:

$$\displaystyle{\mathop{\mathrm{ess\,sup}}f:=\inf \left \{t \in \mathbb{R}\,\vert \,\mbox{ $f \leq t \mu $-almost everywhere}\right \}.}$$

It is so common in measure and probability theory to need to refer to the set of all points \(x \in \mathcal{X}\) such that some property P(x) holds true that an abbreviated notation has been adopted: simply [P]. Thus, for example, if \(f: \mathcal{X} \rightarrow \mathbb{R}\) is some function, then

$$\displaystyle{[f \leq t]:=\{ x \in \mathcal{X}\mid f(x) \leq t\}.}$$

As noted above, when the sample space is a topological space, it is usual to use the Borel \(\sigma\)-algebra (i.e. the smallest \(\sigma\)-algebra that contains all the open sets); measures on the Borel \(\sigma\)-algebra are called Borel measures. Unless noted otherwise, this is the convention followed here.

Definition 2.8.

The support of a measure μ defined on a topological space \(\mathcal{X}\) is

$$\displaystyle{\mathop{\mathrm{supp}}\nolimits (\mu ):=\bigcap \{ F \subseteq \mathcal{X}\mid \mbox{ $F$ is closed and $\mu (\mathcal{X}\setminus F) = 0$}\}.}$$

That is, \(\mathop{\mathrm{supp}}\nolimits (\mu )\) is the smallest closed subset of \(\mathcal{X}\) that has full μ-measure. Equivalently, \(\mathop{\mathrm{supp}}\nolimits (\mu )\) is the complement of the union of all open sets of μ-measure zero, or the set of all points \(x \in \mathcal{X}\) for which every neighbourhood of x has strictly positive μ-measure.

Especially in Chapter 14, we shall need to consider the set of all probability measures defined on a measurable space. \(\mathcal{M}_{1}(\mathcal{X})\) is often called the probability simplex on \(\mathcal{X}\). The motivation for this terminology comes from the case in which \(\mathcal{X} =\{ 1,\ldots,n\}\) is a finite set equipped with the power set \(\sigma\)-algebra, which is the same as the Borel \(\sigma\)-algebra for the discrete topology on \(\mathcal{X}\).Footnote 1 In this case, functions \(f: \mathcal{X} \rightarrow \mathbb{R}\) are in bijection with column vectors

$$\displaystyle{\left [\begin{array}{*{10}c} f(1)\\ \vdots\\ f(n) \end{array} \right ]}$$

and probability measures μ on the power set of \(\mathcal{X}\) are in bijection with the (n − 1)-dimensional set of row vectors

$$\displaystyle{\left [\begin{array}{*{10}c} \mu (\{1\})&\cdots &\mu (\{n\}) \end{array} \right ]}$$

such that μ({i}) ≥ 0 for all \(i \in \{ 1,\ldots,n\}\) and \(\sum _{i=1}^{n}\mu (\{i\}) = 1\). As illustrated in Figure 2.1, the set of such μ is the (n − 1)-dimensional simplex in \(\mathbb{R}^{n}\) that is the convex hull of the n points \(\delta _{1},\ldots,\delta _{n}\),

$$\displaystyle{\delta _{i} = \left [\begin{array}{*{10}c} 0&\cdots 0&1&0&\cdots &0 \end{array} \right ],}$$

with 1 in the i th column. Looking ahead, the expected value of f under μ (to be defined properly in Section 2.3) is exactly the matrix product:

$$\displaystyle{\mathbb{E}_{\mu }[f] =\sum _{ i=1}^{n}\mu (\{i\})f(i) =\langle \mu \mathop{ \vert }f\rangle = \left [\begin{array}{*{10}c} \mu (\{1\})&\cdots &\mu (\{n\}) \end{array} \right ]\left [\begin{array}{*{10}c} f(1)\\ \vdots \\ f(n) \end{array} \right ].}$$

It is useful to keep in mind this geometric picture of \(\mathcal{M}_{1}(\mathcal{X})\) in addition to the algebraic and analytical properties of any given \(\mu \in \mathcal{M}_{1}(\mathcal{X})\). As poetically highlighted by Sir Michael Atiyah (2004, Paper 160, p. 7):

Fig. 2.1
figure 1

The probability simplex \(\mathcal{M}_{1}(\{1,2,3\})\), drawn as the triangle spanned by the unit Dirac masses δ i , i ∈ { 1, 2, 3}, in the vector space of signed measures on {1, 2, 3}.

“Algebra is the offer made by the devil to the mathematician. The devil says: ‘I will give you this powerful machine, it will answer any question you like. All you need to do is give me your soul: give up geometry and you will have this marvellous machine.’ ”

Or, as is traditionally but perhaps apocryphally said to have been inscribed over the entrance to Plato’s Academy:

\(\mathrm{A}\Gamma \mathrm{E}\Omega \mathrm{METPHTO}\Sigma \) \(\mathrm{MH}\Delta \mathrm{EI}\Sigma \) \(\mathrm{EI}\Sigma \mathrm{IT}\Omega \)

In a sense that will be made precise in Chapter 14, for any ‘nice’ space \(\mathcal{X}\), \(\mathcal{M}_{1}(\mathcal{X})\) is the simplex spanned by the collection of unit Dirac measures \(\{\delta _{x}\mid x \in \mathcal{X}\}\). Given a bounded, measurable function \(f: \mathcal{X} \rightarrow \mathbb{R}\) and \(c \in \mathbb{R}\),

$$\displaystyle{\{\mu \in \mathcal{M}(\mathcal{X})\mid \mathbb{E}_{\mu }[f] \leq c\}}$$

is a half-space of \(\mathcal{M}(\mathcal{X})\), and so a set of the form

$$\displaystyle{\{\mu \in \mathcal{M}_{1}(\mathcal{X})\mid \mathbb{E}_{\mu }[f_{1}] \leq c_{1},\ldots, \mathbb{E}_{\mu }[f_{m}] \leq c_{m}\}}$$

can be thought of as a polytope of probability measures.

One operation on probability measures that must frequently be performed in UQ applications is conditioning, i.e. forming a new probability measure \(\mu (\cdot \vert B)\) out of an old one μ by restricting attention to subsets of a measurable set B. Conditioning is the operation of supposing that B has happened, and examining the consequently updated probabilities for other measurable events.

Definition 2.9.

If \((\varTheta,\mathcal{F},\mu )\) is a probability space and \(B \in \mathcal{F}\) has μ(B) > 0, then the conditional probability measure \(\mu (\cdot \vert B)\) on \((\varTheta,\mathcal{F})\) is defined by

$$\displaystyle{\mu (E\vert B):= \frac{\mu (E \cap B)} {\mu (B)} \quad \mbox{ for $E \in \mathcal{F}$.}}$$

The following theorem on conditional probabilities is fundamental to subjective (Bayesian) probability and statistics (q.v. Section 2.8:

Theorem 2.10 (Bayes’ rule).

If \((\varTheta,\mathcal{F},\mu )\) is a probability space and \(A,B \in \mathcal{F}\) have μ(A),μ(B) > 0, then

$$\displaystyle{\mu (A\vert B) = \frac{\mu (B\vert A)\mu (A)} {\mu (B)}.}$$

Both the definition of conditional probability and Bayes’ rule can be extended to much more general contexts (including cases in which μ(B) = 0) using advanced tools such as regular conditional probabilities and the disintegration theorem. In Bayesian settings, μ(A) represents the ‘prior’ probability of some event A, and μ(A | B) its ‘posterior’ probability, having observed some additional data B.

2.2 Random Variables and Stochastic Processes

Definition 2.11.

Let \((\mathcal{X},\mathcal{F})\) and \((\mathcal{Y},\mathcal{G})\) be measurable spaces. A function \(f: \mathcal{X} \rightarrow \mathcal{Y}\) generates a \(\sigma\)-algebra on \(\mathcal{X}\) by

$$\displaystyle{\sigma (f):=\sigma {\bigl (\{ [f \in E]\mid E \in \mathcal{G}\}\bigr )},}$$

and f is called a measurable function if \(\sigma (f) \subseteq \mathcal{F}\). That is, f is measurable if the pre-image f −1(E) of every \(\mathcal{G}\)-measurable subset E of \(\mathcal{Y}\) is an \(\mathcal{F}\)-measurable subset of \(\mathcal{X}\). A measurable function whose domain is a probability space is usually called a random variable.

Remark 2.12.

Note that if \(\mathcal{F}\) is the power set of \(\mathcal{Y}\), or if \(\mathcal{G}\) is the trivial \(\sigma\)-algebra \(\{\varnothing,\mathcal{Y}\}\), then every function \(f: \mathcal{X} \rightarrow \mathcal{Y}\) is measurable. At the opposite extreme, if \(\mathcal{F}\) is the trivial \(\sigma\)-algebra \(\{\varnothing,\mathcal{X}\}\), then the only measurable functions \(f: \mathcal{X} \rightarrow \mathcal{Y}\) are the constant functions. Thus, in some sense, the sizes of the \(\sigma\)-algebras used to define measurability provide a notion of how well- or ill-behaved the measurable functions are.

Definition 2.13.

A measurable function \(f: \mathcal{X} \rightarrow \mathcal{Y}\) from a measure space \((\mathcal{X},\mathcal{F},\mu )\) to a measurable space \((\mathcal{Y},\mathcal{G})\) defines a measure f μ on \((\mathcal{Y},\mathcal{G})\), called the push-forward of μ by f, by

$$\displaystyle{(f_{{\ast}}\mu )(E):=\mu {\bigl ( [f \in E]\bigr )},\quad \mbox{ for $E \in \mathcal{G}$.}}$$

When μ is a probability measure, f μ is called the distribution or law of the random variable f.

Definition 2.14.

Let S be any set and let \((\varTheta,\mathcal{F},\mu )\) be a probability space. A function \(U: S\times \varTheta \rightarrow \mathcal{X}\) such that each \(U(s, \cdot )\) is a random variable is called an \(\mathcal{X}\) -valued stochastic process on S.

Whereas measurability questions for a single random variable are discussed in terms of a single \(\sigma\)-algebra, measurability questions for stochastic processes are discussed in terms of families of \(\sigma\)-algebras; when the indexing set S is linearly ordered, e.g. by the natural numbers, or by a continuous parameter such as time, these families of \(\sigma\)-algebras are increasing in the following sense:

Definition 2.15.

  1. (a)

    A filtration of a \(\sigma\)-algebra \(\mathcal{F}\) is a family \(\mathcal{F}_{\bullet } =\{ \mathcal{F}_{i}\mid i \in I\}\) of sub-\(\sigma\)-algebras of \(\mathcal{F}\), indexed by an ordered set I, such that

    $$\displaystyle{i \leq j\mbox{ in }I\Rightarrow\mathcal{F}_{i} \subseteq \mathcal{F}_{j}.}$$
  2. (b)

    The natural filtration associated with a stochastic process \(U: I\times \varTheta \rightarrow \mathcal{X}\) is the filtration \(\mathcal{F}_{\bullet }^{U}\) defined by

    $$\displaystyle{\mathcal{F}_{i}^{U}:=\sigma {\bigl (\{ U(j, \cdot )^{-1}(E) \subseteq \varTheta \mid E \subseteq \mathcal{X}\mbox{ is measurable and }j \leq i\}\bigr )}.}$$
  3. (c)

    A stochastic process U is adapted to a filtration \(\mathcal{F}_{\bullet }\) if \(\mathcal{F}_{i}^{U} \subseteq \mathcal{F}_{i}\) for each i ∈ I.

Measurability and adaptedness are important properties of stochastic processes, and loosely correspond to certain questions being ‘answerable’ or ‘decidable’ with respect to the information contained in a given \(\sigma\)-algebra. For instance, if the event [X ∈ E] is not \(\mathcal{F}\)-measurable, then it does not even make sense to ask about the probability μ [X ∈ E]. For another example, suppose that some stream of observed data is modelled as a stochastic process Y, and it is necessary to make some decision U(t) at each time t. It is common sense to require that the decision stochastic process be \(\mathcal{F}_{\bullet }^{Y }\)-adapted, since the decision U(t) must be made on the basis of the observations Y (s), s ≤ t, not on observations from any future time.

2.3 Lebesgue Integration

Integration of a measurable function with respect to a (signed or non-negative) measure is referred to as Lebesgue integration. Despite the many technical details that must be checked in the construction of the Lebesgue integral, it remains the integral of choice for most mathematical and probabilistic applications because it extends the simple Riemann integral of functions of a single real variable, can handle worse singularities than the Riemann integral, has better convergence properties, and also naturally captures the notion of an expected value in probability theory. The issue of numerical evaluation of integrals — a vital one in UQ applications — will be addressed separately in Chapter 9

The construction of the Lebesgue integral is accomplished in three steps: first, the integral is defined for simple functions, which are analogous to step functions from elementary calculus, except that their plateaus are not intervals in \(\mathbb{R}\) but measurable events in the sample space.

Definition 2.16.

Let \((\mathcal{X},\mathcal{F},\mu )\) be a measure space. The indicator function \(\mathbb{I}_{E}\) of a set \(E \in \mathcal{F}\) is the measurable function defined by

$$\displaystyle{\mathbb{I}_{E}(x):= \left \{\begin{array}{@{}l@{\quad }l@{}} 1,\quad &\mbox{ if $x \in E$}\\ 0,\quad &\mbox{ if $x\notin E$.} \end{array} \right.}$$

A function \(f: \mathcal{X} \rightarrow \mathbb{K}\) is called simple if

$$\displaystyle{f =\sum _{ i=1}^{n}\alpha _{ i}\mathbb{I}_{E_{i}}}$$

for some scalars \(\alpha _{1},\ldots,\alpha _{n} \in \mathbb{K}\) and some pairwise disjoint measurable sets \(E_{1},\ldots,E_{n} \in \mathcal{F}\) with μ(E i ) finite for \(i = 1,\ldots,n\). The Lebesgue integral of a simple function \(f:=\sum _{ i=1}^{n}\alpha _{i}\mathbb{I}_{E_{i}}\) is defined to be

$$\displaystyle{\int _{\mathcal{X}}f\,\mathrm{d}\mu:=\sum _{ i=1}^{n}\alpha _{ i}\mu (E_{i}).}$$

In the second step, the integral of a non-negative measurable function is defined through approximation from below by the integrals of simple functions:

Definition 2.17.

Let \((\mathcal{X},\mathcal{F},\mu )\) be a measure space and let \(f: \mathcal{X} \rightarrow [0,+\infty ]\) be a measurable function. The Lebesgue integral of f is defined to be

$$\displaystyle{\int _{\mathcal{X}}f\,\mathrm{d}\mu:=\sup \left \{\int _{\mathcal{X}}\phi \,\mathrm{d}\mu \,\vert \,\begin{array}{c} \mbox{ $\phi: \mathcal{X} \rightarrow \mathbb{R}$ is a simple function, and}\\ \mbox{ $0 \leq \phi (x) \leq f(x)$ for $\mu $-almost all $x \in \mathcal{X}$} \end{array} \right \}\mbox{.}}$$

Finally, the integral of a real- or complex-valued function is defined through integration of positive and negative real and imaginary parts, with care being taken to avoid the undefined expression ‘\(\infty -\infty \)’:

Definition 2.18.

Let \((\mathcal{X},\mathcal{F},\mu )\) be a measure space and let \(f: \mathcal{X} \rightarrow \mathbb{R}\) be a measurable function. The Lebesgue integral of f is defined to be

$$\displaystyle{\int _{\mathcal{X}}f\,\mathrm{d}\mu:=\int _{\mathcal{X}}f_{+}\,\mathrm{d}\mu -\int _{\mathcal{X}}f_{-}\,\mathrm{d}\mu }$$

provided that at least one of the integrals on the right-hand side is finite. The integral of a complex-valued measurable function \(f: \mathcal{X} \rightarrow \mathbb{C}\) is defined to be

$$\displaystyle{\int _{\mathcal{X}}f\,\mathrm{d}\mu:=\int _{\mathcal{X}}(\mathop{\mathrm{Re}}f)\,\mathrm{d}\mu + i\int _{\mathcal{X}}(\mathop{\mathrm{Im}}f)\,\mathrm{d}\mu \mbox{.}}$$

The Lebesgue integral satisfies all the natural requirements for a useful notion of integration: integration is a linear function of the integrand, integrals are additive over disjoint domains of integration, and in the case \(\mathcal{X} = \mathbb{R}\) every Riemann-integrable function is Lebesgue integrable. However, one of the chief attractions of the Lebesgue integral over other notions of integration is that, subject to a simple domination condition, pointwise convergence of integrands is enough to ensure convergence of integral values:

Theorem 2.19 (Dominated convergence theorem).

Let \((\mathcal{X},\mathcal{F},\mu )\) be a measure space and let \(f_{n}: \mathcal{X} \rightarrow \mathbb{K}\) be a measurable function for each n ∈ ℕ. If \(f: \mathcal{X} \rightarrow \mathbb{K}\) is such that \(\lim _{n\rightarrow \infty }f_{n}(x) = f(x)\) for every \(x \in \mathcal{X}\) and there is a measurable function \(g: \mathcal{X} \rightarrow [0,\infty ]\) such that \(\int _{\mathcal{X}}\vert g\vert \,\mathrm{d}\mu\) is finite and |f n (x)|≤ g(x) for all \(x \in \mathcal{X}\) and all large enough n ∈ ℕ, then

$$\displaystyle{\int _{\mathcal{X}}f\,\mathrm{d}\mu =\lim _{n\rightarrow \infty }\int _{\mathcal{X}}f_{n}\,\mathrm{d}\mu.}$$

Furthermore, if the measure space is complete, then the conditions on pointwise convergence and pointwise domination of f n (x) can be relaxed to hold μ-almost everywhere.

As alluded to earlier, the Lebesgue integral is the standard one in probability theory, and is used to define the mean or expected value of a random variable:

Definition 2.20.

When \((\varTheta,\mathcal{F},\mu )\) is a probability space and \(X: \varTheta \rightarrow \mathbb{K}\) is a random variable, it is conventional to write \(\mathbb{E}_{\mu }[X]\) for \(\int _{\varTheta }X(\theta )\,\mathrm{d}\mu (\theta )\) and to call \(\mathbb{E}_{\mu }[X]\) the expected value or expectation of X. Also,

$$\displaystyle{\mathbb{V}_{\mu }[X]:= \mathbb{E}_{\mu }\big[\big\vert X - \mathbb{E}_{\mu }[X]\big\vert ^{2}\big] \equiv \mathbb{E}_{\mu }[\vert X\vert ^{2}] -\vert \mathbb{E}_{\mu }[X]\vert ^{2}}$$

is called the variance of X. If X is a \(\mathbb{K}^{d}\)-valued random variable, then \(\mathbb{E}_{\mu }[X]\), if it exists, is an element of \(\mathbb{K}^{d}\), and

$$\displaystyle\begin{array}{rcl} C&:=& \mathbb{E}_{\mu }{\bigl [(X - \mathbb{E}_{\mu }[X])(X - \mathbb{E}_{\mu }[X])^{{\ast}}\bigr ]}\in \mathbb{K}^{d\times d} {}\\ \mbox{ i.e. }C_{ij}&:=& \mathbb{E}_{\mu }{\Bigl [(X_{i} - \mathbb{E}_{\mu }[X_{i}])\overline{(X_{j} - \mathbb{E}_{\mu }[X_{j}])}\Bigr ]} \in \mathbb{K} {}\\ \end{array}$$

is the covariance matrix of X.

Spaces of Lebesgue-integrable functions are ubiquitous in analysis and probability theory:

Definition 2.21.

Let \((\mathcal{X},\mathcal{F},\mu )\) be a measure space. For \(1 \leq p \leq \infty \), the L p space (or Lebesgue space) is defined by

$$\displaystyle{L^{p}(\mathcal{X},\mu; \mathbb{K}):=\{ f: \mathcal{X} \rightarrow \mathbb{K}\mid \mbox{ $f$ is measurable and $\|f\|_{ L^{p}(\mu )}$ is finite}\}.}$$

For \(1 \leq p < \infty \), the norm is defined by the integral expression

$$\displaystyle{ \|f\|_{L^{p}(\mu )}:= \left (\int _{\mathcal{X}}\vert f(x)\vert ^{p}\,\mathrm{d}\mu (x)\right )^{1/p}\mbox{;} }$$
(2.1)

for \(p = \infty \), the norm is defined by the essential supremum (cf. Example 2.7)

$$\displaystyle\begin{array}{rcl} \|f\|_{L^{\infty }(\mu )}& \:=& \mathop{\mathrm{ess\,sup}}_{x\in \mathcal{X}}\vert f(x)\vert \\ &\phantom{:}=& \inf \left \{\|g\|_{\infty }\,\vert \,\mbox{ $f = g: \mathcal{X} \rightarrow \mathbb{K} \mu $-almost everywhere}\right \} \\ & \phantom{:}=& \inf \left \{t \geq 0\,\vert \,\vert f\vert \leq t\mbox{ $\mu $-almost everywhere}\right \}\mbox{.} {}\end{array}$$
(2.2)

To be more precise, \(L^{p}(\mathcal{X},\mu; \mathbb{K})\) is the set of equivalence classes of such functions, where functions that differ only on a set of μ-measure zero are identified.

When \((\varTheta,\mathcal{F},\mu )\) is a probability space, we have the containments

$$\displaystyle{1 \leq p \leq q \leq \infty \Rightarrow L^{p}(\varTheta,\mu; \mathbb{R}) \supseteq L^{q}(\varTheta,\mu; \mathbb{R}).}$$

Thus, random variables in higher-order Lebesgue spaces are ‘better behaved’ than those in lower-order ones. As a simple example of this slogan, the following inequality shows that the L p-norm of a random variable X provides control on the probability X deviates strongly from its mean value:

Theorem 2.22 (Chebyshev’s inequality).

Let \(X \in L^{p}(\varTheta,\mu; \mathbb{K})\) , \(1 \leq p < \infty \) , be a random variable. Then, for all t ≥ 0,

$$\displaystyle{ \mathbb{P}_{\mu }{\bigl [\vert X - \mathbb{E}_{\mu }[X]\vert \geq t\bigr ]} \leq t^{-p}\mathbb{E}_{\mu }{\bigl [\vert X\vert ^{p}\bigr ]}. }$$
(2.3)

(The case p = 1 is also known as Markov’s inequality.) It is natural to ask if (2.3) is the best inequality of this type given the stated assumptions on X, and this is a question that will be addressed in Chapter 14, and specifically Example 14.18.

Integration of Vector-Valued Functions. Lebesgue integration of functions that take values in \(\mathbb{R}^{n}\) can be handled componentwise, as indeed was done above for complex-valued integrands. However, many UQ problems concern random fields, i.e. random variables with values in infinite-dimensional spaces of functions. For definiteness, consider a function f defined on a measure space \((\mathcal{X},\mathcal{F},\mu )\) taking values in a Banach space \(\mathcal{V}\). There are two ways to proceed, and they are in general inequivalent:

  1. (a)

    The strong integral or Bochner integral of f is defined by integrating simple \(\mathcal{V}\)-valued functions as in the construction of the Lebesgue integral, and then defining

    $$\displaystyle{\int _{\mathcal{X}}f\,\mathrm{d}\mu:=\lim _{n\rightarrow \infty }\int _{\mathcal{X}}\phi _{n}\,\mathrm{d}\mu }$$

    whenever \((\phi _{n})_{n\in \mathbb{N}}\) is a sequence of simple functions such that the (scalar-valued) Lebesgue integral \(\int _{\mathcal{X}}\|f -\phi _{n}\|\,\mathrm{d}\mu\) converges to 0 as \(n \rightarrow \infty \). It transpires that f is Bochner integrable if and only if \(\|f\|\) is Lebesgue integrable. The Bochner integral satisfies a version of the Dominated Convergence Theorem, but there are some subtleties concerning the Radon–Nikodým theorem.

  2. (b)

    The weak integral or Pettis integral of f is defined using duality: \(\int _{\mathcal{X}}f\,\mathrm{d}\mu\) is defined to be an element \(v \in \mathcal{V}\) such that

    $$\displaystyle{\langle \ell\mathop{\vert }v\rangle =\int _{\mathcal{X}}\langle \ell\mathop{\vert }f(x)\rangle \,\mathrm{d}\mu (x)\quad \mbox{ for all $\ell\in \mathcal{V}'$.}}$$

    Since this is a weaker integrability criterion, there are naturally more Pettis-integrable functions than Bochner-integrable ones, but the Pettis integral has deficiencies such as the space of Pettis-integrable functions being incomplete, the existence of a Pettis-integrable function \(f: [0,1] \rightarrow \mathcal{V}\) such that \(F(t):=\int _{[0,t]}f(\tau )\,\mathrm{d}\tau\) is not differentiable (Kadets, 1994), and so on.

2.4 Decomposition and Total Variation of Signed Measures

If a good mental model for a non-negative measure is a distribution of mass, then a good mental model for a signed measure is a distribution of electrical charge. A natural question to ask is whether every distribution of charge can be decomposed into regions of purely positive and purely negative charge, and hence whether it can be written as the difference of two non-negative distributions, with one supported entirely on the positive set and the other on the negative set. The answer is provided by the Hahn and Jordan decomposition theorems.

Definition 2.23.

Two non-negative measures μ and ν on a measurable space \((\mathcal{X},\mathcal{F})\) are said to be mutually singular, denoted μ ⊥ ν, if there exists \(E \in \mathcal{F}\) such that \(\mu (E) =\nu (\mathcal{X}\setminus E) = 0\).

Theorem 2.24 (Hahn–Jordan decomposition).

Let μ be a signed measure on a measurable space \((\mathcal{X},\mathcal{F})\).

  1. (a)

    Hahn decomposition: there exist sets \(P,N \in \mathcal{F}\) such that \(P \cup N = \mathcal{X}\) , \(P \cap N = \varnothing \) , and

    $$\displaystyle\begin{array}{rcl} & & \mbox{ for all measurable $E \subseteq P$,}\quad \mu (E) \geq 0, {}\\ & & \mbox{ for all measurable $E \subseteq N$,}\quad \mu (E) \leq 0. {}\\ \end{array}$$

    This decomposition is essentially unique in the sense that if P′ and N′ also satisfy these conditions, then every measurable subset of the symmetric differences \(P \bigtriangleup P'\) and \(N \bigtriangleup N'\) is of μ-measure zero.

  2. (b)

    Jordan decomposition: there are unique mutually singular non-negative measures μ + and μ on \((\mathcal{X},\mathcal{F})\) , at least one of which is a finite measure, such that μ = μ + −μ ; indeed, for all \(E \in \mathcal{F}\) ,

    $$\displaystyle\begin{array}{rcl} \mu _{+}(E)& =& \mu (E \cap P), {}\\ \mu _{-}(E)& =& -\mu (E \cap N). {}\\ \end{array}$$

From a probabilistic perspective, the main importance of signed measures and their Hahn and Jordan decompositions is that they provide a useful notion of distance between probability measures:

Definition 2.25.

Let μ be a signed measure on a measurable space \((\mathcal{X},\mathcal{F})\), with Jordan decomposition μ = μ +μ . The associated total variation measure is the non-negative measure \(\vert \mu \vert:=\mu _{+} +\mu _{-}\). The total variation of μ is \(\|\mu \|_{\text{TV}}:= \vert \mu \vert (\mathcal{X})\).

Remark 2.26.

  1. (a)

    As the notation \(\|\mu \|_{\text{TV}}\) suggests, \(\|\cdot \|_{\text{TV}}\) is a norm on the space \(\mathcal{M}_{\pm }(\mathcal{X},\mathcal{F})\) of signed measures on \((\mathcal{X},\mathcal{F})\).

  2. (b)

    The total variation measure can be equivalently defined using measurable partitions:

    $$\displaystyle{\vert \mu \vert (E) =\sup \left \{\sum _{i=1}^{n}\vert \mu (E_{ i})\vert \,\vert \,\begin{array}{c} \mbox{ $n \in \mathbb{N}_{0}$, $E_{1},\ldots,E_{n} \in \mathcal{F}$,} \\ \mbox{ and $E = E_{1} \cup \ldots \cup E_{n}$}\end{array} \right \}.}$$
  3. (c)

    The total variation distance between two probability measures μ and ν (i.e. the total variation norm of their difference) can thus be characterized as

    $$\displaystyle{ d_{\text{TV}}(\mu,\nu ) \equiv \|\mu -\nu \|_{\text{TV}} = 2\sup {\bigl \{\vert \mu (E) -\nu (E)\vert \,\big\vert \,E \in \mathcal{F}\bigr \}}, }$$
    (2.4)

    i.e. twice the greatest absolute difference in the two probability values that μ and ν assign to any measurable event E.

2.5 The Radon–Nikodým Theorem and Densities

Let \((\mathcal{X},\mathcal{F},\mu )\) be a measure space and let \(\rho: \mathcal{X} \rightarrow [0,+\infty ]\) be a measurable function. The operation

$$\displaystyle{ \nu: E\mapsto \int _{E}\rho (x)\,\mathrm{d}\mu (x) }$$
(2.5)

defines a measure ν on \((\mathcal{X},\mathcal{F})\). It is natural to ask whether every measure ν on \((\mathcal{X},\mathcal{F})\) can be expressed in this way. A moment’s thought reveals that the answer, in general, is no: there is no such function ρ that will make (2.5) hold when μ and ν are Lebesgue measure and a unit Dirac measure (or vice versa) on \(\mathbb{R}\).

Definition 2.27.

Let μ and ν be measures on a measurable space \((\mathcal{X},\mathcal{F})\). If, for \(E \in \mathcal{F}\), ν(E) = 0 whenever μ(E) = 0, then ν is said to be absolutely continuous with respect to μ, denoted \(\nu \ll \mu\). If \(\nu \ll \mu \ll \nu\), then μ and ν are said to be equivalent, and this is denoted μ ≈ ν.

Definition 2.28.

A measure space \((\mathcal{X},\mathcal{F},\mu )\) is said to be \(\sigma\) -finite if \(\mathcal{X}\) can be expressed as a countable union of \(\mathcal{F}\)-measurable sets, each of finite μ-measure.

Theorem 2.29 (Radon–Nikodým).

Suppose that μ and ν are \(\sigma\) -finite measures on a measurable space \((\mathcal{X},\mathcal{F})\) and that \(\nu \ll \mu\) . Then there exists a measurable function \(\rho: \mathcal{X} \rightarrow [0,\infty ]\) such that, for all measurable functions \(f: \mathcal{X} \rightarrow \mathbb{R}\) and all \(E \in \mathcal{F}\) ,

$$\displaystyle{\int _{E}f\,\mathrm{d}\nu =\int _{E}f\rho \,\mathrm{d}\mu }$$

whenever either integral exists. Furthermore, any two functions ρ with this property are equal μ-almost everywhere.

The function ρ in the Radon–Nikodým theorem is called the Radon–Nikodým derivative of ν with respect to μ, and the suggestive notation \(\rho = \frac{\mathrm{d}\nu } {\mathrm{d}\mu }\) is often used. In probability theory, when ν is a probability measure, \(\frac{\mathrm{d}\nu } {\mathrm{d}\mu }\) is called the probability density function (PDF) of ν (or any ν-distributed random variable) with respect to μ. Radon–Nikodým derivatives behave very much like the derivatives of elementary calculus:

Theorem 2.30 (Chain rule).

Suppose that μ, ν and π are \(\sigma\) -finite measures on a measurable space \((\mathcal{X},\mathcal{F})\) and that \(\pi \ll \nu \ll \mu\) . Then \(\pi \ll \mu\) and

$$\displaystyle{\frac{\mathrm{d}\pi } {\mathrm{d}\mu } = \frac{\mathrm{d}\pi } {\mathrm{d}\nu } \frac{\mathrm{d}\nu } {\mathrm{d}\mu }\quad \mbox{ $\mu $-almost everywhere.}}$$

Remark 2.31.

The Radon–Nikodým theorem also holds for a signed measure ν and a non-negative measure μ, but in this case the absolute continuity condition is that the total variation measure | ν | satisfies \(\vert \nu \vert \ll \mu\), and of course the density ρ is no longer required to be a non-negative function.

2.6 Product Measures and Independence

The previous section considered one way of making new measures from old ones, namely by re-weighting them using a locally integrable density function. By way of contrast, this section considers another way of making new measures from old, namely forming a product measure. Geometrically speaking, the product of two measures is analogous to ‘area’ as the product of two ‘length’ measures. Products of measures also arise naturally in probability theory, since they are the distributions of mutually independent random variables.

Definition 2.32.

Let \((\varTheta,\mathcal{F},\mu )\) be a probability space.

  1. (a)

    Two measurable sets (events) \(E_{1},E_{2} \in \mathcal{F}\) are said to be independent if \(\mu (E_{1} \cap E_{2}) =\mu (E_{1})\mu (E_{2})\).

  2. (b)

    Two sub-\(\sigma\)-algebras \(\mathcal{G}_{1}\) and \(\mathcal{G}_{2}\) of \(\mathcal{F}\) are said to be independent if E 1 and E 2 are independent events whenever \(E_{1} \in \mathcal{G}_{1}\) and \(E_{2} \in \mathcal{G}_{2}\).

  3. (c)

    Two measurable functions (random variables) \(X: \varTheta \rightarrow \mathcal{X}\) and \(Y: \varTheta \rightarrow \mathcal{Y}\) are said to be independent if the \(\sigma\)-algebras generated by X and Y are independent.

Definition 2.33.

Let \((\mathcal{X},\mathcal{F},\mu )\) and \((\mathcal{Y},\mathcal{G},\nu )\) be \(\sigma\)-finite measure spaces. The product \(\sigma\) -algebra \(\mathcal{F}\otimes \mathcal{G}\) is the \(\sigma\)-algebra on \(\mathcal{X}\times \mathcal{Y}\) that is generated by the measurable rectangles, i.e. the smallest \(\sigma\)-algebra for which all the products

$$\displaystyle{F \times G,\quad F \in \mathcal{F},G \in \mathcal{G},}$$

are measurable sets. The product measure \(\mu \otimes \nu: \mathcal{F}\otimes \mathcal{G}\rightarrow [0,+\infty ]\) is the measure such that

$$\displaystyle{(\mu \otimes \nu )(F \times G) =\mu (F)\nu (G),\quad \mbox{ for all }F \in \mathcal{F},G \in \mathcal{G}.}$$

In the other direction, given a measure on a product space, we can consider the measures induced on the factor spaces:

Definition 2.34.

Let \((\mathcal{X}\times \mathcal{Y},\mathcal{F},\mu )\) be a measure space and suppose that the factor space \(\mathcal{X}\) is equipped with a \(\sigma\)-algebra such that the projections \(\varPi _{\mathcal{X}}: (x,y)\mapsto x\) is a measurable function. Then the marginal measure \(\mu _{\mathcal{X}}\) is the measure on \(\mathcal{X}\) defined by

$$\displaystyle{\mu _{\mathcal{X}}(E):={\bigl ( (\varPi _{\mathcal{X}})_{{\ast}}\mu \bigr )}(E) =\mu (E \times \mathcal{Y}).}$$

The marginal measure \(\mu _{\mathcal{Y}}\) on \(\mathcal{Y}\) is defined similarly.

Theorem 2.35.

Let X = (X 1 ,X 2 ) be a random variable taking values in a product space \(\mathcal{X} = \mathcal{X}_{1} \times \mathcal{X}_{2}\) . Let μ be the (joint) distribution of X, and μ i the (marginal) distribution of X i for i = 1,2. Then X 1 and X 2 are independent random variables if and only if μ = μ 1 ⊗μ 2 .

The important property of integration with respect to a product measure, and hence taking expected values of independent random variables, is that it can be performed by iterated integration:

Theorem 2.36 (Fubini–Tonelli).

Let \((\mathcal{X},\mathcal{F},\mu )\) and \((\mathcal{Y},\mathcal{G},\nu )\) be \(\sigma\) -finite measure spaces, and let \(f: \mathcal{X}\times \mathcal{Y}\rightarrow [0,+\infty ]\) be measurable. Then, of the following three integrals, if one exists in \([0,\infty ]\) , then all three exist and are equal:

$$\displaystyle{\int _{\mathcal{X}}\int _{\mathcal{Y}}f(x,y)\,\mathrm{d}\nu (y)\,\mathrm{d}\mu (x),\quad \int _{\mathcal{Y}}\int _{\mathcal{X}}f(x,y)\,\mathrm{d}\mu (x)\,\mathrm{d}\nu (y),}$$
$$\displaystyle{\mbox{ and }\int _{\mathcal{X}\times \mathcal{Y}}f(x,y)\,\mathrm{d}(\mu \otimes \nu )(x,y).}$$

Infinite product measures (or, put another way, infinite sequences of independent random variables) have some interesting extreme properties. Informally, the following result says that any property of a sequence of independent random variables that is independent of any finite subcollection (i.e. depends only on the ‘infinite tail’ of the sequence) must be almost surely true or almost surely false:

Theorem 2.37 (Kolmogorov zero-one law).

Let \((X_{n})_{n\in \mathbb{N}}\) be a sequence of independent random variables defined over a probability space \((\varTheta,\mathcal{F},\mu )\) , and let \(\mathcal{F}_{n}:=\sigma (X_{n})\) . For each n ∈ ℕ, let \(\mathcal{G}_{n}:=\sigma {\bigl (\bigcup _{k\geq n}\mathcal{F}_{k}\bigr )}\) , and let

$$\displaystyle{\mathcal{T}:=\bigcap _{n\in \mathbb{N}}\mathcal{G}_{n} =\bigcap _{n\in \mathbb{N}}\sigma (X_{n},X_{n+1},\ldots ) \subseteq \mathcal{F}}$$

be the so-called tail \(\sigma\)-algebra . Then, for every \(E \in \mathcal{T}\) , μ(E) ∈{ 0,1}.

Thus, for example, it is impossible to have a sequence of real-valued random variables \((X_{n})_{n\in \mathbb{N}}\) such that \(\lim _{n\rightarrow \infty }X_{n}\) exists with probability \(\frac{1} {2}\); either the sequence converges with probability one, or else with probability one it has no limit at all. There are many other zero-one laws in probability and statistics: one that will come up later in the study of Monte Carlo averages is Kesten’s theorem (Theorem 9.17).

2.7 Gaussian Measures

An important class of probability measures and random variables is the class of Gaussians, also known as normal distributions. For many practical problems, especially those that are linear or nearly so, Gaussian measures can serve as appropriate descriptions of uncertainty; even in the nonlinear situation, the Gaussian picture can be an appropriate approximation, though not always. In either case, a significant attraction of Gaussian measures is that many operations on them (e.g. conditioning) can be performed using elementary linear algebra.

On a theoretical level, Gaussian measures are particularly important because, unlike Lebesgue measure, they are well defined on infinite-dimensional spaces, such as function spaces. In \(\mathbb{R}^{d}\), Lebesgue measure is characterized up to normalization as the unique Borel measure that is simultaneously

  • locally finite, i.e. every point of \(\mathbb{R}^{d}\) has an open neighbourhood of finite Lebesgue measure;

  • strictly positive, i.e. every open subset of \(\mathbb{R}^{d}\) has strictly positive Lebesgue measure; and

  • translation invariant, i.e. \(\lambda (x + E) =\lambda (E)\) for all \(x \in \mathbb{R}^{d}\) and measurable \(E \subseteq \mathbb{R}^{d}\).

In addition, Lebesgue measure is \(\sigma\)-finite. However, the following theorem shows that there can be nothing like an infinite-dimensional Lebesgue measure:

Theorem 2.38.

Let μ be a Borel measure on an infinite-dimensional Banach space \(\mathcal{V}\) , and, for \(v \in \mathcal{V}\) , let \(T_{v}: \mathcal{V}\rightarrow \mathcal{V}\) be the translation map \(T_{v}(x):= v + x\).

  1. (a)

    If μ is locally finite and invariant under all translations, then μ is the trivial (zero) measure.

  2. (b)

    If μ is \(\sigma\) -finite and quasi-invariant under all translations (i.e.  \((T_{v})_{{\ast}}\mu\) is equivalent to μ), then μ is the trivial (zero) measure.

Gaussian measures on \(\mathbb{R}^{d}\) are defined using a Radon–Nikodým derivative with respect to Lebesgue measure. To save space, when P is a self-adjoint and positive-definite matrix or operator on a Hilbert space (see Section 3.3), write

$$\displaystyle\begin{array}{rcl} \langle x,y\rangle _{P}&:=& \langle x,Py\rangle \equiv \langle P^{1/2}x,P^{1/2}y\rangle, {}\\ \|x\|_{P}&:=& \sqrt{\langle x, x\rangle _{P}} \equiv \| P^{1/2}x\| {}\\ \end{array}$$

for the new inner product and norm induced by P.

Definition 2.39.

Let \(m \in \mathbb{R}^{d}\) and let \(C \in \mathbb{R}^{d\times d}\) be symmetric and positive definite. The Gaussian measure with mean m and covariance C is denoted \(\mathcal{N}(m,C)\) and defined by

$$\displaystyle\begin{array}{rcl} \mathcal{N}(m,C)(E)&:=& \frac{1} {\sqrt{\det C}\sqrt{2\pi }^{d}}\int _{E}\exp \left (-\frac{(x - m) \cdot C^{-1}(x - m)} {2} \right )\mathrm{d}x {}\\ &:=& \frac{1} {\sqrt{\det C}\sqrt{2\pi }^{d}}\int _{E}\exp \left (-\frac{1} {2}\|x - m\|_{C^{-1}}^{2}\right )\mathrm{d}x {}\\ \end{array}$$

for each measurable set \(E \subseteq \mathbb{R}^{d}\). The Gaussian measure \(\gamma:= \mathcal{N}(0,I)\) is called the standard Gaussian measure. A Dirac measure δ m can be considered as a degenerate Gaussian measure on \(\mathbb{R}\), one with variance equal to zero.

A non-degenerate Gaussian measure is a strictly positive probability measure on \(\mathbb{R}^{d}\), i.e. it assigns strictly positive mass to every open subset of \(\mathbb{R}^{d}\); however, unlike Lebesgue measure, it is not translation invariant:

Lemma 2.40 (Cameron–Martin formula).

Let \(\mu = \mathcal{N}(m,C)\) be a Gaussian measure on \(\mathbb{R}^{d}\) . Then the push-forward \((T_{v})_{{\ast}}\mu\) of μ by translation by any \(v \in \mathbb{R}^{d}\) , i.e.  \(\mathcal{N}(m + v,C)\) , is equivalent to \(\mathcal{N}(m,C)\) and

$$\displaystyle{\frac{\mathrm{d}(T_{v})_{{\ast}}\mu } {\mathrm{d}\mu } (x) =\exp \left (\langle v,x - m\rangle _{C^{-1}} -\frac{1} {2}\|v\|_{C^{-1}}^{2}\right ),}$$

i.e., for every integrable function f,

$$\displaystyle{\int _{\mathbb{R}^{d}}f(x + v)\,\mathrm{d}\mu (x) =\int _{\mathbb{R}^{d}}f(x)\exp \left (\langle v,x - m\rangle _{C^{-1}} -\frac{1} {2}\|v\|_{C^{-1}}^{2}\right )\mathrm{d}\mu (x).}$$

It is easily verified that the push-forward of \(\mathcal{N}(m,C)\) by any linear functional \(\ell: \mathbb{R}^{d} \rightarrow \mathbb{R}\) is a Gaussian measure on \(\mathbb{R}\), and this is taken as the defining property of a general Gaussian measure for settings in which, by Theorem 2.38, there may not be a Lebesgue measure with respect to which densities can be taken:

Definition 2.41.

A Borel measure μ on a normed vector space \(\mathcal{V}\) is said to be a (non-degenerate) Gaussian measure if, for every continuous linear functional \(\ell: \mathcal{V}\rightarrow \mathbb{R}\), the push-forward measure \(\ell_{{\ast}}\mu\) is a (non-degenerate) Gaussian measure on \(\mathbb{R}\). Equivalently, μ is Gaussian if, for every linear map \(T: \mathcal{V}\rightarrow \mathbb{R}^{d}\), \(T_{{\ast}}\mu = \mathcal{N}(m_{T},C_{T})\) for some \(m_{T} \in \mathbb{R}^{d}\) and some symmetric positive-definite \(C_{T} \in \mathbb{R}^{d\times d}\).

Definition 2.42.

Let μ be a probability measure on a Banach space \(\mathcal{V}\). An element \(m_{\mu } \in \mathcal{V}\) is called the mean of μ if

$$\displaystyle{\int _{\mathcal{V}}\langle \ell\mathop{\vert }x - m_{\mu }\rangle \,\mathrm{d}\mu (x) = 0\mbox{ for all $\ell\in \mathcal{V}'$,}}$$

so that \(\int _{\mathcal{V}}x\,\mathrm{d}\mu (x) = m_{\mu }\) in the sense of a Pettis integral. If m μ  = 0, then μ is said to be centred. The covariance operator is the self-adjoint (i.e. conjugate-symmetric) operator \(C_{\mu }: \mathcal{V}'\times \mathcal{V}'\rightarrow \mathbb{K}\) defined by

$$\displaystyle{C_{\mu }(k,\ell) =\int _{\mathcal{V}}\langle k\mathop{\vert }x - m_{\mu }\rangle \overline{\langle \ell\mathop{\vert }x - m_{\mu }\rangle }\,\mathrm{d}\mu (x)\mbox{ for all $k,\ell\in \mathcal{V}'$.}}$$

We often abuse notation and write \(C_{\mu }: \mathcal{V}'\rightarrow \mathcal{V}''\) for the operator defined by

$$\displaystyle{\langle C_{\mu }k\mathop{\vert }\ell\rangle:= C_{\mu }(k,\ell)}$$

In the case that \(\mathcal{V} = \mathcal{H}\) is a Hilbert space, it is usual to employ the Riesz representation theorem to identify \(\mathcal{H}\) with \(\mathcal{H}'\) and \(\mathcal{H}''\) and hence treat C μ as a linear operator from \(\mathcal{H}\) into itself. The inverse of C μ , if it exists, is called the precision operator of μ.

The covariance operator of a Gaussian measure is closely connected to its non-degeneracy:

Theorem 2.43 (Vakhania, 1975).

Let μ be a Gaussian measure on a separable, reflexive Banach space \(\mathcal{V}\) with mean \(m_{\mu } \in \mathcal{V}\) and covariance operator \(C_{\mu }: \mathcal{V}'\rightarrow \mathcal{V}\) . Then the support of μ is the affine subspace of \(\mathcal{V}\) that is the translation by the mean of the closure of the range of the covariance operator, i.e.

$$\displaystyle{\mathop{\mathrm{supp}}\nolimits (\mu ) = m_{\mu } + \overline{C_{\mu }\mathcal{V}'}.}$$

Corollary 2.44.

For a Gaussian measure μ on a separable, reflexive Banach space \(\mathcal{V}\) , the following are equivalent:

  1. (a)

    μ is non-degenerate;

  2. (b)

    \(C_{\mu }: \mathcal{V}'\rightarrow \mathcal{V}\) is one-to-one;

  3. (c)

    \(\overline{C_{\mu }\mathcal{V}'} = \mathcal{V}\) .

Example 2.45.

Consider a Gaussian random variable \(X = (X_{1},X_{2}) \sim \mu\) taking values in \(\mathbb{R}^{2}\). Suppose that the mean and covariance of X (or, equivalently, μ) are, in the usual basis of \(\mathbb{R}^{2}\),

$$\displaystyle{m = \left [\begin{array}{*{10}c} 0\\ 1 \end{array} \right ]\quad C = \left [\begin{array}{*{10}c} 1&0\\ 0 &0 \end{array} \right ].}$$

Then X = (Z, 1), where \(Z \sim \mathcal{N}(0,1)\) is a standard Gaussian random variable on \(\mathbb{R}\); the values of X all lie on the affine line \(L:=\{ (x_{1},x_{2}) \in \mathbb{R}^{2}\mid x_{2} = 1\}\). Indeed, Vakhania’s theorem says that

$$\displaystyle{\mathop{\mathrm{supp}}\nolimits (\mu ) = m+\overline{C(\mathbb{R}^{2})} = \left [\begin{array}{*{10}c} 0\\ 1 \end{array} \right ]+\left \{\left [\begin{array}{*{10}c} x_{1} \\ 0 \end{array} \right ]\,\vert \,x_{1} \in \mathbb{R}\right \} = L.}$$

Gaussian measures can also be identified by reference to their Fourier transforms:

Theorem 2.46.

A probability measure μ on \(\mathcal{V}\) is a Gaussian measure if and only if its Fourier transform \(\widehat{\mu }: \mathcal{V}'\rightarrow \mathbb{C}\) satisfies

$$\displaystyle{\hat{\mu }(\ell):=\int _{\mathcal{V}}e^{i\langle \ell\mathop{\vert }x\rangle }\,\mathrm{d}\mu (x) =\exp \left (i\langle \ell\mathop{\vert }m\rangle -\frac{Q(\ell)} {2} \right )\quad \mbox{ for all $\ell\in \mathcal{V}'$}.}$$

for some \(m \in \mathcal{V}\) and some positive-definite quadratic form Q on \(\mathcal{V}'\) . Indeed, m is the mean of μ and Q(ℓ) = C μ (ℓ,ℓ). Furthermore, if two Gaussian measures μ and ν have the same mean and covariance operator, then μ = ν.

Not only does a Gaussian measure have a well-defined mean and variance, it in fact has moments of all orders:

Theorem 2.47 (Fernique, 1970).

Let μ be a centred Gaussian measure on a separable Banach space \(\mathcal{V}\) . Then there exists α > 0 such that

$$\displaystyle{\int _{\mathcal{V}}\exp (\alpha \|x\|^{2})\,\mathrm{d}\mu (x) < +\infty.}$$

A fortiori, μ has moments of all orders: for all k ≥ 0,

$$\displaystyle{\int _{\mathcal{V}}\|x\|^{k}\,\mathrm{d}\mu (x) < +\infty.}$$

The covariance operator of a Gaussian measure on a Hilbert space \(\mathcal{H}\) is a self-adjoint operator from \(\mathcal{H}\) into itself. A classification of exactly which self-adjoint operators on \(\mathcal{H}\) can be Gaussian covariance operators is provided by the next result, Sazonov’s theorem:

Definition 2.48.

Let \(K: \mathcal{H}\rightarrow \mathcal{H}\) be a linear operator on a separable Hilbert space \(\mathcal{H}\).

  1. (a)

    K is said to be compact if it has a singular value decomposition, i.e. if there exist finite or countably infinite orthonormal sequences (u n ) and (v n ) in \(\mathcal{H}\) and a sequence of non-negative reals \((\sigma _{n})\) such that

    $$\displaystyle{K =\sum _{n}\sigma _{n}\langle v_{n}, \cdot \rangle u_{n},}$$

    with \(\lim _{n\rightarrow \infty }\sigma _{n} = 0\) if the sequences are infinite.

  2. (b)

    K is said to be trace class or nuclear if \(\sum _{n}\sigma _{n}\) is finite, and Hilbert–Schmidt or nuclear of order 2 if \(\sum _{n}\sigma _{n}^{2}\) is finite.

  3. (c)

    If K is trace class, then its trace is defined to be

    $$\displaystyle{\mathop{\mathrm{tr}}\nolimits (K):=\sum _{n}\langle e_{n},Ke_{n}\rangle }$$

    for any orthonormal basis (e n ) of \(\mathcal{H}\), and (by Lidskiĭ’s theorem) this equals the sum of the eigenvalues of K, counted with multiplicity.

Theorem 2.49 (Sazonov, 1958).

Let μ be a centred Gaussian measure on a separable Hilbert space \(\mathcal{H}\) . Then \(C_{\mu }: \mathcal{H}\rightarrow \mathcal{H}\) is trace class and

$$\displaystyle{\mathop{\mathrm{tr}}\nolimits (C_{\mu }) =\int _{\mathcal{H}}\|x\|^{2}\,\mathrm{d}\mu (x).}$$

Conversely, if \(K: \mathcal{H}\rightarrow \mathcal{H}\) is positive, self-adjoint and of trace class, then there is a Gaussian measure μ on \(\mathcal{H}\) such that C μ = K.

Sazonov’s theorem is often stated in terms of the square root \(C_{\mu }^{1/2}\) of C μ : \(C_{\mu }^{1/2}\) is Hilbert–Schmidt, i.e. has square-summable singular values \((\sigma _{n})_{n\in \mathbb{N}}\).

As noted above, even finite-dimensional Gaussian measures are not invariant under translations, and the change-of-measure formula is given by Lemma 2.40. In the infinite-dimensional setting, it is not even true that translation produces a new measure that has a density with respect to the old one. This phenomenon leads to an important object associated with any Gaussian measure, its Cameron–Martin space:

Definition 2.50.

Let \(\mu = \mathcal{N}(m,C)\) be a Gaussian measure on a Banach space \(\mathcal{V}\). The Cameron–Martin space is the Hilbert space \(\mathcal{H}_{\mu }\) defined equivalently by:

  • \(\mathcal{H}_{\mu }\) is the completion of

    $$\displaystyle{\left \{h \in \mathcal{V}\,\vert \,\mbox{ for some }h^{{\ast}}\in \mathcal{V}',C(h^{{\ast}}, \cdot ) =\langle \cdot \mathop{\vert }h\rangle \right \}}$$

    with respect to the inner product \(\langle h,k\rangle _{\mu }:= C(h^{{\ast}},k^{{\ast}})\).

  • \(\mathcal{H}_{\mu }\) is the completion of the range of the covariance operator \(C: \mathcal{V}'\rightarrow \mathcal{V}\) with respect to this inner product (cf. the closure with respect to the norm in \(\mathcal{V}\) in Theorem 2.43).

  • If \(\mathcal{V}\) is Hilbert, then \(\mathcal{H}_{\mu }\) is the completion of \(\mathop{\mathrm{ran}}\nolimits C^{1/2}\) with the inner product \(\langle h,k\rangle _{C^{-1}}:=\langle C^{-1/2}h,C^{-1/2}k\rangle _{\mathcal{V}}\).

  • \(\mathcal{H}_{\mu }\) is the set of all \(v \in \mathcal{V}\) such that \((T_{v})_{{\ast}}\mu \approx \mu\), with

    $$\displaystyle{\frac{\mathrm{d}(T_{v})_{{\ast}}\mu } {\mathrm{d}\mu } (x) =\exp \left (\langle v,x\rangle _{C^{-1}} -\frac{\|v\|_{C^{-1}}^{2}} {2} \right )}$$

    as in Lemma 2.40.

  • \(\mathcal{H}_{\mu }\) is the intersection of all linear subspaces of \(\mathcal{V}\) that have full μ-measure.

By Theorem 2.38, if μ is any probability measure (Gaussian or otherwise) on an infinite-dimensional space \(\mathcal{V}\), then we certainly cannot have \(\mathcal{H}_{\mu } = \mathcal{V}\). In fact, one should think of \(\mathcal{H}_{\mu }\) as being a very small subspace of \(\mathcal{V}\): if \(\mathcal{H}_{\mu }\) is infinite dimensional, then \(\mu (\mathcal{H}_{\mu }) = 0\). Also, infinite-dimensional spaces have the extreme property that Gaussian measures on such spaces are either equivalent or mutually singular — there is no middle ground in the way that Lebesgue measure on [0, 1] has a density with respect to Lebesgue measure on \(\mathbb{R}\) but is not equivalent to it.

Theorem 2.51 (Feldman–Hájek).

Let μ, ν be Gaussian probability measures on a normed vector space \(\mathcal{V}\) . Then either

  • μ and ν are equivalent, i.e.  \(\mu (E) = 0\;\Longleftrightarrow\;\nu (E) = 0\) , and hence each has a strictly positive density with respect to the other; or

  • μ and ν are mutually singular, i.e. there exists E such that μ(E) = 0 and ν(E) = 1, and so neither μ nor ν can have a density with respect to the other.

Furthermore, equivalence holds if and only if

  1. (a)

    \(\mathop{\mathrm{ran}}\nolimits C_{\mu }^{1/2} =\mathop{ \mathrm{ran}}\nolimits C_{\nu }^{1/2}\) ;

  2. (b)

    \(m_{\mu } - m_{\nu } \in \mathop{\mathrm{ran}}\nolimits C_{\mu }^{1/2} =\mathop{ \mathrm{ran}}\nolimits C_{\nu }^{1/2}\) ; and

  3. (c)

    \(T:= (C_{\mu }^{-1/2}C_{\nu }^{1/2})(C_{\mu }^{-1/2}C_{\nu }^{1/2})^{{\ast}}- I\) is Hilbert–Schmidt in \(\mathop{\mathrm{ran}}\nolimits C_{\mu }^{1/2}\) .

The Cameron–Martin and Feldman–Hájek theorems show that translation by any vector not in the Cameron–Martin space \(\mathcal{H}_{\mu }\subseteq \mathcal{V}\) produces a new measure that is mutually singular with respect to the old one. It turns out that dilation by a non-unitary constant also destroys equivalence:

Proposition 2.52.

Let μ be a centred Gaussian measure on a separable real Banach space \(\mathcal{V}\) such that \(\dim \mathcal{H}_{\mu } = \infty \) . For \(c \in \mathbb{R}\) , let \(D_{c}: \mathcal{V}\rightarrow \mathcal{V}\) be the dilation map \(D_{c}(x):= cx\) . Then \((D_{c})_{{\ast}}\mu\) is equivalent to μ if and only if c ∈{±1}, and \((D_{c})_{{\ast}}\mu\) and μ are mutually singular otherwise.

Remark 2.53.

There is another attractive viewpoint on Gaussian measures on Hilbert spaces, namely that draws from a Gaussian measure \(\mathcal{N}(m,C)\) on a Hilbert space are the same as draws from random series of the form

$$\displaystyle{m +\sum _{k\in \mathbb{N}}\sqrt{\lambda _{k}}\xi _{k}\psi _{k},}$$

where \(\{\psi _{k}\}_{k\in \mathbb{N}}\) are orthonormal eigenvectors for the covariance operator C, \(\{\lambda _{k}\}_{k\in \mathbb{N}}\) are the corresponding eigenvalues, and \(\{\xi _{k}\}_{k\in \mathbb{N}}\) are independent draws from the standard normal distribution \(\mathcal{N}(0,1)\) on \(\mathbb{R}\). This point of view will be revisited in more detail in Section 11.1 in the context of Karhunen–Loève expansions of Gaussian and Besov measures.

The conditioning properties of Gaussian measures can easily be expressed using an elementary construction from linear algebra, the Schur complement. This result will be very useful in Chapters 67, and 13.

Theorem 2.54 (Conditioning of Gaussian measures).

Let \(\mathcal{H} = \mathcal{H}_{1} \oplus \mathcal{H}_{2}\) be a direct sum of separable Hilbert spaces. Let \(X = (X_{1},X_{2}) \sim \mu\) be an \(\mathcal{H}\) -valued Gaussian random variable with mean m = (m 1 ,m 2 ) and positive-definite covariance operator C. For i,j = 1,2, let

$$\displaystyle{ C_{ij}(k_{i},k_{j}):= \mathbb{E}_{\mu }{\Bigl [\langle k_{i},x - m_{i}\rangle \overline{\langle k_{j},x - m_{j}\rangle }\Bigr ]} }$$
(2.6)

for all \(k_{i} \in \mathcal{H}_{i}\) , \(k_{j} \in \mathcal{H}_{j}\) , so that C is decomposed Footnote 2 in block form as

$$\displaystyle{ C = \left [\begin{array}{*{10}c} C_{11} & C_{12} \\ C_{21} & C_{22} \end{array} \right ]; }$$
(2.7)

in particular, the marginal distribution of X i is \(\mathcal{N}(m_{i},C_{ii})\) , and \(C_{21} = C_{12}^{{\ast}}\) . Then C 22 is invertible and, for each \(x_{2} \in \mathcal{H}_{2}\) , the conditional distribution of X 1 given X 2 = x 2 is Gaussian:

$$\displaystyle{ (X_{1}\vert X_{2} = x_{2}) \sim \mathcal{N}{\bigl (m_{1} + C_{12}C_{22}^{-1}(x_{ 2} - m_{2}),C_{11} - C_{12}C_{22}^{-1}C_{ 21}\bigr )}. }$$
(2.8)

2.8 Interpretations of Probability

It is worth noting that the above discussions are purely mathematical: a probability measure is an abstract algebraic–analytic object with no necessary connection to everyday notions of chance or probability. The question of what interpretation of probability to adopt, i.e. what practical meaning to ascribe to probability measures, is a question of philosophy and mathematical modelling. The two main points of view are the frequentist and Bayesian perspectives. To a frequentist, the probability μ(E) of an event E is the relative frequency of occurrence of the event E in the limit of infinitely many independent but identical trials; to a Bayesian, μ(E) is a numerical representation of one’s degree of belief in the truth of a proposition E. The frequentist’s point of view is objective; the Bayesian’s is subjective; both use the same mathematical machinery of probability measures to describe the properties of the function μ.

Frequentists are careful to distinguish between parts of their analyses that are fixed and deterministic versus those that have a probabilistic character. However, for a Bayesian, any uncertainty can be described in terms of a suitable probability measure. In particular, one’s beliefs about some unknown \(\theta\) (taking values in a space \(\varTheta\)) in advance of observing data are summarized by a prior probability measure π on \(\varTheta\). The other ingredient of a Bayesian analysis is a likelihood function, which is up to normalization a conditional probability: given any observed datum y, \(L(y\vert \theta )\) is the likelihood of observing y if the parameter value \(\theta\) were the truth. A Bayesian’s belief about \(\theta\) given the prior π and the observed datum y is the posterior probability measure \(\pi (\cdot \vert y)\) on \(\varTheta\), which is just the conditional probability

$$\displaystyle{\pi (\theta \vert y) = \frac{L(y\vert \theta )\pi (\theta )} {\mathbb{E}_{\pi }[L(y\vert \theta )]} = \frac{L(y\vert \theta )\pi (\theta )} {\int _{\varTheta }L(y\vert \theta )\,\mathrm{d}\pi (\theta )}}$$

or, written in a way that generalizes better to infinite-dimensional \(\varTheta\), we have a density/Radon–Nikodým derivative

$$\displaystyle{\frac{\mathrm{d}\pi (\cdot \vert y)} {\mathrm{d}\pi } (\theta ) \propto L(y\vert \theta ).}$$

Both the previous two equations are referred to as Bayes’ rule, and are at this stage informal applications of the standard Bayes’ rule (Theorem 2.10) for events A and B of non-zero probability.

Example 2.55.

Parameter estimation provides a good example of the philosophical difference between frequentist and subjectivist uses of probability. Suppose that \(X_{1},\ldots,X_{n}\) are n independent and identically distributed observations of some random variable X, which is distributed according to the normal distribution \(\mathcal{N}(\theta,1)\) of mean \(\theta\) and variance 1. We set our frequentist and Bayesian statisticians the challenge of estimating \(\theta\) from the data \(d:= (X_{1},\ldots,X_{n})\).

  1. (a)

    To the frequentist, \(\theta\) is a well-defined real number that happens to be unknown. This number can be estimated using the estimator

    $$\displaystyle{\widehat{\theta }_{n}:= \frac{1} {n}\sum _{i=1}^{n}X_{ i},}$$

    which is a random variable. It makes sense to say that \(\widehat{\theta }_{n}\) is close to \(\theta\) with high probability, and hence to give a confidence interval for \(\theta\), but \(\theta\) itself does not have a distribution.

  2. (b)

    To the Bayesian, \(\theta\) is a random variable, and its distribution in advance of seeing the data is encoded in a prior π. Upon seeing the data and conditioning upon it using Bayes’ rule, the distribution of the parameter is the posterior distribution \(\pi (\theta \vert d)\). The posterior encodes everything that is known about \(\theta\) in view of π, \(L(y\vert \theta ) \propto e^{-\vert y-\theta \vert ^{2}/2 }\) and d, although this information may be summarized by a single number such as the maximum a posteriori estimator

    $$\displaystyle{\widehat{\theta }^{\mathrm{MAP}}:=\mathop{ \mathrm{arg\,max}}_{\theta \in \mathbb{R}}\pi (\theta \vert d)}$$

    or the maximum likelihood estimator

    $$\displaystyle{\widehat{\theta }^{\mathrm{MLE}}:=\mathop{ \mathrm{arg\,max}}_{\theta \in \mathbb{R}}L(d\vert \theta ).}$$

The Bayesian perspective can be seen as the natural extension of classical Aristotelian bivalent (i.e. true-or-false) logic to propositions of uncertain truth value. This point of view is underwritten by Cox’s theorem (Cox, 19461961), which asserts that any ‘natural’ extension of Aristotelian logic to -valued truth values is probabilistic, and specifically Bayesian, although the ‘naturality’ of the hypotheses has been challenged by, e.g., Halpern (1999a,b).

It is also worth noting that there is a significant community that, in addition to being frequentist or Bayesian, asserts that selecting a single probability measure is too precise a description of uncertainty. These ‘imprecise probabilists’ count such distinguished figures as George Boole and John Maynard Keynes among their ranks, and would prefer to say that \(\frac{1} {2} - 2^{-100} \leq \mathbb{P}[\mathrm{heads}] \leq \frac{1} {2} + 2^{-100}\) than commit themselves to the assertion that \(\mathbb{P}[\mathrm{heads}] = \frac{1} {2}\); imprecise probabilists would argue that the former assertion can be verified, to a prescribed level of confidence, in finite time, whereas the latter cannot. Techniques like the use of lower and upper probabilities (or interval probabilities) are popular in this community, including sophisticated generalizations like Dempster–Shafer theory; one can also consider feasible sets of probability measures, which is the approach taken in Chapter 14.

2.9 Bibliography

The book of Gordon (1994) is mostly a text on the gauge integral, but its first chapters provide an excellent condensed introduction to measure theory and Lebesgue integration. Capiński and Kopp (2004) is a clear, readable and self-contained introductory text confined mainly to Lebesgue integration on \(\mathbb{R}\) (and later \(\mathbb{R}^{n}\)), including material on L p spaces and the Radon–Nikodým theorem. Another excellent text on measure and probability theory is the monograph of Billingsley (1995). Readers who prefer to learn mathematics through counterexamples rather than theorems may wish to consult the books of Romano and Siegel (1986) and Stoyanov (1987). The disintegration theorem, alluded to at the end of Section 2.1, can be found in Ambrosio et al. (2008, Section 5.3) and Dellacherie and Meyer (1978, Section III-70).

The Bochner integral was introduced by Bochner (1933); recent texts on the topic include those of Diestel and Uhl (1977) and Mikusiński (1978). For detailed treatment of the Pettis integral, see Talagrand (1984). Further discussion of the relationship between tensor products and spaces of vector-valued integrable functions can be found in the book of Ryan (2002).

Bourbaki (2004) contains a treatment of measure theory from a functional-analytic perspective. The presentation is focussed on Radon measures on locally compact spaces, which is advantageous in terms of regularity but leads to an approach to measurable functions that is cumbersome, particularly from the viewpoint of probability theory. All the standard warnings about Bourbaki texts apply: the presentation is comprehensive but often forbiddingly austere, and so it is perhaps better as a reference text than a learning tool.

Chapters 7 and 8 of the book of Smith (2014) compare and contrast the frequentist and Bayesian perspectives on parameter estimation in the context of UQ. The origins of imprecise probability lie in treatises like those of Boole (1854) and Keynes (1921). More recent foundations and expositions for imprecise probability have been put forward by Walley (1991), Kuznetsov (1991), Weichselberger (2000), and by Dempster (1967) and Shafer (1976).

A general introduction to the theory of Gaussian measures is the book of Bogachev (1998); a complementary viewpoint, in terms of Gaussian stochastic processes, is presented by Rasmussen and Williams (2006).

The non-existence of an infinite-dimensional Lebesgue measure, and related results, can be found in the lectures of Yamasaki (1985, Part B, Chapter 1, Section 5). The Feldman–Hájek dichotomy (Theorem 2.51) was proved independently by Feldman (1958) and Hájek (1958), and can also be found in the book of Da Prato and Zabczyk (1992, Theorem 2.23).

2.10 Exercises

Exercise 2.1.

Let X be any \(\mathbb{C}^{n}\)-valued random variable with mean \(m \in \mathbb{C}^{n}\) and covariance matrix

$$\displaystyle{C:= \mathbb{E}{\bigl [(X - m)(X - m)^{{\ast}}\bigr ]}\in \mathbb{C}^{n\times n}.}$$
  1. (a)

    Show that C is conjugate-symmetric and positive semi-definite. For what collection of vectors in \(\mathbb{C}^{n}\) is C the Gram matrix?

  2. (b)

    Show that if the support of X is all of \(\mathbb{C}^{n}\), then C is positive definite. Hint: suppose that C has non-trivial kernel, construct an open half-space H of \(\mathbb{C}^{n}\) such that \(X\notin H\) almost surely.

Exercise 2.2.

Let X be any random variable taking values in a Hilbert space \(\mathcal{H}\), with mean \(m \in \mathcal{H}\) and covariance operator \(C: \mathcal{H}\times \mathcal{H}\rightarrow \mathbb{C}\) defined by

$$\displaystyle{C(h,k):= \mathbb{E}{\Bigl [\langle h,X - m\rangle \overline{\langle k,X - m\rangle }\Bigr ]}}$$

for h, \(k \in \mathcal{H}\). Show that C is conjugate-symmetric and positive semi-definite. Show also that if there is no subspace \(S \subseteq \mathcal{H}\) with \(\dim S \geq 1\) such that \(X \perp S\) with probability one), then C is positive definite.

Exercise 2.3.

Prove the finite-dimensional Cameron–Martin formula of Lemma 2.40. That is, let \(\mu = \mathcal{N}(m,C)\) be a Gaussian measure on \(\mathbb{R}^{d}\) and let \(v \in \mathbb{R}^{d}\), and show that the push-forward of μ by translation by v, namely \(\mathcal{N}(m + v,C)\), is equivalent to μ and

$$\displaystyle{\frac{\mathrm{d}(T_{v})_{{\ast}}\mu } {\mathrm{d}\mu } (x) =\exp \left (\langle v,x - m\rangle _{C^{-1}} -\frac{1} {2}\|v\|_{C^{-1}}^{2}\right ),}$$

i.e., for every integrable function f,

$$\displaystyle{\int _{\mathbb{R}^{d}}f(x + v)\,\mathrm{d}\mu (x) =\int _{\mathbb{R}^{d}}f(x)\exp \left (\langle v,x - m\rangle _{C^{-1}} -\frac{1} {2}\|v\|_{C^{-1}}^{2}\right )\mathrm{d}\mu (x).}$$

Exercise 2.4.

Let \(T: \mathcal{H}\rightarrow \mathcal{K}\) be a bounded linear map between Hilbert spaces \(\mathcal{H}\) and \(\mathcal{K}\), with adjoint \(T^{{\ast}}: \mathcal{K}\rightarrow \mathcal{H}\), and let \(\mu = \mathcal{N}(m,C)\) be a Gaussian measure on \(\mathcal{H}\). Show that the push-forward measure T μ is a Gaussian measure on \(\mathcal{K}\) and that \(T_{{\ast}}\mu = \mathcal{N}(Tm,TCT^{{\ast}})\).

Exercise 2.5.

For i = 1, 2, let \(X_{i} \sim \mathcal{N}(m_{i},C_{i})\) independent Gaussian random variables taking values in Hilbert spaces \(\mathcal{H}_{i}\), and let \(T_{i}: \mathcal{H}_{i} \rightarrow \mathcal{K}\) be a bounded linear map taking values in another Hilbert space \(\mathcal{K}\), with adjoint \(T_{i}^{{\ast}}: \mathcal{K}\rightarrow \mathcal{H}_{i}\). Show that \(T_{1}X_{1} + T_{2}X_{2}\) is a Gaussian random variable in \(\mathcal{K}\) with

$$\displaystyle{T_{1}X_{1} + T_{2}X_{2} \sim \mathcal{N}{\bigl (T_{1}m_{1} + T_{2}m_{2},T_{1}C_{1}T_{1}^{{\ast}} + T_{ 2}C_{2}T_{2}^{{\ast}}\bigr )}.}$$

Give an example to show that the independence assumption is necessary.

Exercise 2.6.

Let \(\mathcal{H}\) and \(\mathcal{K}\) be Hilbert spaces. Suppose that \(A: \mathcal{H}\rightarrow \mathcal{H}\) and \(C: \mathcal{K}\rightarrow \mathcal{K}\) are self-adjoint and positive definite, that \(B: \mathcal{H}\rightarrow \mathcal{K}\), and that \(D: \mathcal{K}\rightarrow \mathcal{K}\) is self-adjoint and positive semi-definite. Show that the operator from \(\mathcal{H}\oplus \mathcal{K}\) to itself given in block form by

$$\displaystyle{\left [\begin{array}{*{10}c} A + B^{{\ast}}CB &-B^{{\ast}}C \\ -CB &C + D \end{array} \right ]}$$

is self-adjoint and positive-definite.

Exercise 2.7 (Inversion lemma).

Let \(\mathcal{H}\) and \(\mathcal{K}\) be Hilbert spaces, and let \(A: \mathcal{H}\rightarrow \mathcal{H}\), \(B: \mathcal{K}\rightarrow \mathcal{H}\), \(C: \mathcal{H}\rightarrow \mathcal{K}\), and \(D: \mathcal{K}\rightarrow \mathcal{K}\) be linear maps. Define \(M: \mathcal{H}\oplus \mathcal{K}\rightarrow \mathcal{H}\oplus \mathcal{K}\) in block form by

$$\displaystyle{M = \left [\begin{array}{*{10}c} A&B\\ C &D \end{array} \right ].}$$

Show that if A, D, ABD −1 C and DCA −1 B are all non-singular, then

$$\displaystyle{M^{-1} = \left [\begin{array}{*{10}c} A^{-1} + A^{-1}B(D - CA^{-1}B)^{-1}CA^{-1} & -A^{-1}B(D - CA^{-1}B)^{-1} \\ -(D - CA^{-1}B)^{-1}CA^{-1} & (D - CA^{-1}B)^{-1} \end{array} \right ]}$$

and

$$\displaystyle{M^{-1} = \left [\begin{array}{*{10}c} (A - BD^{-1}C)^{-1} & -(A - BD^{-1}C)^{-1}BD^{-1} \\ -D^{-1}C(A - BD^{-1}C)^{-1} & D^{-1} + D^{-1}C(A - BD^{-1}C)^{-1}BD^{-1} \end{array} \right ].}$$

Hence derive the Woodbury formula

$$\displaystyle{ (A + BD^{-1}C)^{-1} = A^{-1} - A^{-1}B(D + CA^{-1}B)^{-1}CA^{-1}. }$$
(2.9)

Exercise 2.8.

Exercise 2.7 has a natural interpretation in terms of the conditioning of Gaussian random variables. Let \((X,Y ) \sim \mathcal{N}(m,C)\) be jointly Gaussian, where, in block form,

$$\displaystyle{m = \left [\begin{array}{*{10}c} m_{1} \\ m_{2} \end{array} \right ],\quad C = \left [\begin{array}{*{10}c} C_{11} & C_{12} \\ C_{12}^{{\ast}}&C_{22} \end{array} \right ],}$$

and C is self-adjoint and positive definite.

  1. (a)

    Show that C 11 and C 22 are self-adjoint and positive-definite.

  2. (b)

    Show that the Schur complement S defined by \(S:= C_{11} - C_{12}C_{22}^{-1}C_{12}^{{\ast}}\) is self-adjoint and positive definite, and

    $$\displaystyle{C^{-1} = \left [\begin{array}{*{10}c} S^{-1} & -S^{-1}C_{12}C_{22}^{-1} \\ -C_{22}^{-1}C_{12}^{{\ast}}S^{-1} & C_{22}^{-1} + C_{22}^{-1}C_{12}^{{\ast}}S^{-1}C_{12}C_{22}^{-1}\end{array} \right ].}$$
  3. (c)

    Hence prove Theorem 2.54, that the conditional distribution of X given that Y = y is Gaussian:

    $$\displaystyle{(X\vert Y = y) \sim \mathcal{N}{\bigl (m_{1} + C_{12}C_{22}^{-1}(y - m_{ 2}),S\bigr )}.}$$