Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

The preceeding chapter showed that by using the two first moments of a multivariate distribution (the mean and the covariance matrix), a lot of information on the relationship between the variables can be made available. Only basic statistical theory was used to derive tests of independence or of linear relationships. In this chapter we give an introduction to the basic probability tools useful in statistical multivariate analysis.

Means and covariances share many interesting and useful properties, but they represent only part of the information on a multivariate distribution. Section 4.1 presents the basic probability tools used to describe a multivariate random variable, including marginal and conditional distributions and the concept of independence. In Section 4.2, basic properties on means and covariances (marginal and conditional ones) are derived.

Since many statistical procedures rely on transformations of a multivariate random variable, Section 4.3 proposes the basic techniques needed to derive the distribution of transformations with a special emphasis on linear transforms. As an important example of a multivariate random variable, Section 4.4 defines the multinormal distribution. It will be analysed in more detail in Chapter 5 along with most of its “companion” distributions that are useful in making multivariate statistical inferences.

The normal distribution plays a central role in statistics because it can be viewed as an approximation and limit of many other distributions. The basic justification relies on the central limit theorem presented in Section 4.55. We present this central theorem in the framework of sampling theory. A useful extension of this theorem is also given: it is an approximate distribution to transformations of asymptotically normal variables. The increasing power of computers today makes it possible to consider alternative approximate sampling distributions. These are based on resampling techniques and are suitable for many general situations. Section 4.8 gives an introduction to the ideas behind bootstrap approximations.

1 Distribution and Density Function

Let X=(X 1,X 2,…,X p ) be a random vector. The cumulative distribution function (cdf) of X is defined by

$$F(x) = \mathrm {P}(X\le x)=\mathrm {P}(X_1\le x_1, X_2\le x_2,\ldots ,X_p\le x_p). $$

For continuous X, a nonnegative probability density function (pdf) f exists, that

$$F(x) = \int ^{x}_{-\infty }f (u)d{u}.$$
(4.1)

Note that

$$\int ^{\infty}_{-\infty } f(u)\,d{u} =1.$$

Most of the integrals appearing below are multidimensional. For instance, \(\int_{-\infty}^{x} f(u) du\) means \(\int_{-\infty}^{x_{p}} \cdots \int_{-\infty}^{x_{1}} f(u_{1},\ldots,u_{p}) du_{1} \cdots du_{p}\). Note also that the cdf F is differentiable with

$$f(x) = \frac{\partial^p F(x)}{\partial x_1 \cdots \partial x_p}.$$

For discrete X, the values of this random variable are concentrated on a countable or finite set of points {c j } jJ , the probability of events of the form {XD} can then be computed as

$$\mathrm {P}(X\in D)=\sum _{\{j:c_j\in D\}} \mathrm {P}(X=c_j). $$

If we partition X as X=(X 1,X 2) with \(X_{1}\in \mathbb {R}^{k}\) and \(X_{2}\in \mathbb {R}^{p-k}\), then the function

$$F_{X_{1}}(x_1)=\mathrm {P}(X_1\le x_1)=F(x_{11},\ldots ,x_{1k},\infty ,\ldots, \infty)$$
(4.2)

is called the marginal cdf. F=F(x) is called the joint cdf. For continuous X the marginal pdf can be computed from the joint density by “integrating out” the variable not of interest.

$$f_{X_{1}}(x_1) = \int ^\infty _{-\infty }f(x_1,x_2) dx_2.$$
(4.3)

The conditional pdf of X 2 given X 1=x 1 is given as

$$f(x_2\mid x_1) = \frac{f(x_1,x_2) }{f_{X_{1}}(x_1)}\cdotp $$
(4.4)

Example 4.1

Consider the pdf

$$\everymath{\displaystyle}f(x_1,x_2) = \left \{\begin{array}{l@{\quad}l}\frac{1}{2} x_1 + \frac{3}{2} x_2 & 0\le x_1, x_2\le 1,\\0 & \mbox{otherwise.}\end{array} \right.$$

f(x 1,x 2) is a density since

$$\int f(x_1,x_2) dx_1 dx_2 = \frac{1}{2} \left[\frac{x^2_1}{2}\right]^1_0 + \frac{3}{2} \left[\frac{x^2_2}{2} \right]^1_0 = \frac{1}{4} + \frac{3}{4} = 1.$$

The marginal densities are

The conditional densities are therefore

$$f(x_2\mid x_1) = \frac{\frac{1 }{2 }x_1+\frac{3 }{2}x_2 }{\frac{1 }{2 }x_1+\frac{3 }{4} }\quad\mbox{and}\quad f(x_1\mid x_2) = \frac{\frac{1 }{2 }x_1+\frac{3 }{2}x_2 }{\frac{3 }{2 }x_2+\frac{1 }{4} }\cdotp $$

Note that these conditional pdf’s are nonlinear in x 1 and x 2 although the joint pdf has a simple (linear) structure.

Independence of two random variables is defined as follows.

Definition 4.1

X 1 and X 2 are independent iff \(f(x) = f(x_{1},x_{2}) = f_{X_{1}}(x_{1}) f_{X_{2}}(x_{2})\).

That is, X 1 and X 2 are independent if the conditional pdf’s are equal to the marginal densities, i.e., \(f(x_{1} \mid x_{2}) = f_{X_{1}}(x_{1}) \) and \(f(x_{2} \mid x_{1}) = f_{X_{2}}(x_{2}) \). Independence can be interpreted as follows: knowing X 2=x 2 does not change the probability assessments on X 1, and conversely.

Different joint pdf’s may have the same marginal pdf’s.

Example 4.2

Consider the pdf’s

$$f(x_1,x_2)=1, \quad 0<x_1,x_2<1, $$

and

$$f(x_1,x_2)=1+\alpha (2x_1-1)(2x_2-1), \quad 0<x_1, \ x_2<1,\ -1\le \alpha \le 1.$$

We compute in both cases the marginal pdf’s as

$$f_{X_{1}}(x_1)=1, \qquad f_{X_{2}}(x_2)=1.$$

Indeed

$$\int ^1_01+\alpha (2x_1-1)(2x_2-1)dx_2=1+\alpha(2x_1-1)[x^2_2-x_2]^1_0=1.$$

Hence we obtain identical marginals from different joint distributions.

Let us study the concept of independence using the bank notes example. Consider the variables X 4 (lower inner frame) and X 5 (upper inner frame). From Chapter 3, we already know that they have significant correlation, so they are almost surely not independent. Kernel estimates of the marginal densities, \(\widehat{f}_{X_{4}}\) and \(\widehat{f}_{X_{5}}\), are given in Figure 4.1. In Figure 4.2 (left) we show the product of these two densities. The kernel density technique was presented in Section 1.3. If X 4 and X 5 are independent, this product \(\widehat{f}_{X_{4}}\cdot \widehat{f}_{X_{5}}\) should be roughly equal to \(\widehat{f}(x_{4},x_{5})\), the estimate of the joint density of (X 4,X 5). Comparing the two graphs in Figure 4.2 reveals that the two densities are different. The two variables X 4 and X 5 are therefore not independent.

Fig. 4.1
figure 1

Univariate estimates of the density of X 4 (left) and X 5 (right) of the bank notes  MVAdenbank2

Fig. 4.2
figure 2

The product of univariate density estimates (left) and the joint density estimate (right) for X 4 (left) and X 5 of the bank notes  MVAdenbank3

An elegant concept of connecting marginals with joint cdfs is given by copulae. Copulae are important in Value-at-Risk calculations and are an essential tool in quantitative finance (Härdle, Hautsch and Overbeck, 2009).

For simplicity of presentation we concentrate on the p=2 dimensional case. A 2-dimensional copula is a function C: [0,1]2→[0,1] with the following properties:

  • For every u∈[0,1]: C(0,u)=C(u,0)=0.

  • For every u∈[0,1]: C(u,1)=u and C(1,u)=u.

  • For every (u 1,u 2),(v 1,v 2)∈[0,1]×[0,1] with u 1v 1 and u 2v 2:

    $$C(v_1,v_2) - C(v_1,u_2) - C(u_1,v_2) + C(u_1,u_2) \ge 0 \, .$$

The usage of the name “copula” for the function C is explained by the following theorem.

Theorem 4.1

(Sklar’s theorem)

Let F be a joint distribution function with marginal distribution functions \(F_{X_{1}}\) and \(F_{X_{2}}\). Then a copula C exists with

$$ F(x_1,x_2) = C\{ F_{X_1}(x_1),F_{X_2}(x_2)\}$$
(4.5)

for every \(x_{1},x_{2} \in \mathbb {R}\). If \(F_{X_{1}}\) and \(F_{X_{2}}\) are continuous, then C is unique. On the other hand, if C is a copula and \(F_{X_{1}}\) and \(F_{X_{2}}\) are distribution functions, then the function F defined by (4.5) is a joint distribution function with marginals \(F_{X_{1}}\) and \(F_{X_{2}}\).

With Sklar’s Theorem, the use of the name “copula” becomes obvious. It was chosen to describe “a function that links a multidimensional distribution to its one-dimensional margins” and appeared in the mathematical literature for the first time in Sklar (1959).

Example 4.3

The structure of independence implies that the product of the distribution functions \(F_{X_{1}}\) and \(F_{X_{2}}\) equals their joint distribution function F,

$$ F(x_1,x_2) = F_{X_1}(x_1) \cdot F_{X_2}(x_2).$$
(4.6)

Thus, we obtain the independence copula C=Π from

$$\Pi(u_1,\dots,u_n)=\prod_{i=1}^n u_i .$$

Theorem 4.2

Let X 1 and X 2 be random variables with continuous distribution functions \(F_{X_{1}}\) and \(F_{X_{2}}\) and the joint distribution function F. Then X 1 and X 2 are independent if and only if \(C_{X_{1}, X_{2}} = \Pi\).

Proof

From Sklar’s Theorem we know that there exists an unique copula C with

$$ \mathrm {P}(X_1 \le x_1, X_2 \le x_2) = F(x_1,x_2) =C\{F_{X_1}(x_1),F_{X_2}(x_2)\} .$$
(4.7)

Independence can be seen using (4.5) for the joint distribution function F and the definition of Π,

$$ F(x_1,x_2) = C\{F_{X_1}(x_1),F_{X_2}(x_2)\} = F_{X_1}(x_1) F_{X_2}(x_2) .$$
(4.8)

 □

Example 4.4

The Gumbel-Hougaard family of copulae (Nelsen, 1999) is given by the function

$$ C_{\theta}(u, v) = \exp \left[ - \left\{ (-\log u)^{\theta}+ (-\log v)^{\theta} \right\}^{1 / \theta} \right] .$$
(4.9)

The parameter θ may take all values in the interval [1,∞). The Gumbel-Hougaard copulae are suited to describe bivariate extreme value distributions.

For θ=1, the expression (4.9) reduces to the product copula, i.e., C 1(u,v)=Π(u,v)=uv. For θ→∞ one finds for the Gumbel-Hougaard copula:

$$C_{\theta}(u,v) {\longrightarrow}\min(u,v) = M(u,v),$$

where the function M is also a copula such that C(u,v)≤M(u,v) for arbitrary copula C. The copula M is called the Fréchet-Hoeffding upper bound.

Similarly, we obtain the Fréchet-Hoeffding lower bound W(u,v)=max(u+v−1,0) which satisfies W(u,v)≤C(u,v) for any other copula C.

figure a

2 Moments and Characteristic Functions

2.1 Moments—Expectation and Covariance Matrix

If X is a random vector with density f(x) then the expectation of X is

$$\mathop {\mbox {\sf E}}X = \left ( \begin{array}{c} \mathop {\mbox {\sf E}}X_1\\ \vdots\\ \mathop {\mbox {\sf E}}X_p \end{array} \right )= \int x f(x)dx= \left ( \begin{array}{c} \int x_1 f(x)dx\\ \vdots\\\int x_p f(x)dx \end{array} \right )= \mu.$$
(4.10)

Accordingly, the expectation of a matrix of random elements has to be understood component by component. The operation of forming expectations is linear:

$$ \mathop {\mbox {\sf E}}\left (\alpha X+\beta Y \right ) = \alpha \mathop {\mbox {\sf E}}X +\beta \mathop {\mbox {\sf E}}Y.$$
(4.11)

If \({\mathcal{A}}(q \times p)\) is a matrix of real numbers, we have:

$$\mathop {\mbox {\sf E}}({\mathcal{A}}X) = {\mathcal{A}} \mathop {\mbox {\sf E}}X.$$
(4.12)

When X and Y are independent,

$$\mathop {\mbox {\sf E}}(XY^{\top}) = \mathop {\mbox {\sf E}}X \mathop {\mbox {\sf E}}Y^{\top}.$$
(4.13)

The matrix

$$\mathop {\mbox {\sf Var}}(X) = \Sigma =\mathop {\mbox {\sf E}}(X-\mu )(X-\mu )^{\top}$$
(4.14)

is the (theoretical) covariance matrix. We write for a vector X with mean vector μ and covariance matrix Σ,

$$X\sim (\mu ,\Sigma ).$$
(4.15)

The (p×q) matrix

$$\Sigma_{XY} = \mathop {\mbox {\sf Cov}}(X,Y)=\mathop {\mbox {\sf E}}(X-\mu )(Y-\nu )^{\top}$$
(4.16)

is the covariance matrix of X∼(μ XX ) and Y∼(ν YY ). Note that \(\Sigma_{XY} = \Sigma^{\top}_{YX}\) and that has covariance . From

$$\mathop {\mbox {\sf Cov}}(X,Y) = \mathop {\mbox {\sf E}}(XY^{\top}) - \mu\nu^{\top}=\mathop {\mbox {\sf E}}(XY^{\top}) - \mathop {\mbox {\sf E}}X \mathop {\mbox {\sf E}}Y^{\top}$$
(4.17)

it follows that \(\mathop {\mbox {\sf Cov}}(X,Y)=0\) in the case where X and Y are independent. We often say that \(\mu = \mathop {\mbox {\sf E}}(X)\) is the first order moment of X and that \(\mathop {\mbox {\sf E}}(XX^{\top})\) provides the second order moments of X:

$$\mathop {\mbox {\sf E}}(XX^{\top}) = \{ \mathop {\mbox {\sf E}}(X_iX_j) \}, \quad\mbox{for } i=1,\ldots,p \mbox{ and } j=1,\ldots,p.$$
(4.18)

2.2 Properties of the Covariance Matrix \(\Sigma=\mathop {\mbox {\sf Var}}(X)\)

(4.19)
(4.20)
(4.21)

2.3 Properties of Variances and Covariances

(4.22)
(4.23)
(4.24)
(4.25)
(4.26)

Let us compute these quantities for a specific joint density.

Example 4.5

Consider the pdf of Example 4.1. The mean vector is

The elements of the covariance matrix are

Hence the covariance matrix is

$$\Sigma = \left( \begin{array}{c@{\quad}c} 0.0815 & 0.0052 \\0.0052 & 0.0677 \end{array} \right). $$

2.4 Conditional Expectations

The conditional expectations are

$$ \mathop {\mbox {\sf E}}(X_2\mid x_1) = \int x_2f(x_2\mid x_1)\;dx_2\quad\mbox{and}\quad \mathop {\mbox {\sf E}}(X_1\mid x_2) = \int x_1f(x_1\mid x_2)\;dx_1.$$
(4.27)

\(\mathop {\mbox {\sf E}}(X_{2}|x_{1})\) represents the location parameter of the conditional pdf of X 2 given that X 1=x 1. In the same way, we can define \(\mathop {\mbox {\sf Var}}(X_{2}|X_{1}=x_{1})\) as a measure of the dispersion of X 2 given that X 1=x 1. We have from (4.20) that

$$\mathop {\mbox {\sf Var}}(X_2|X_1=x_1) = \mathop {\mbox {\sf E}}(X_2\: X_2^{\top}|X_1=x_1) - \mathop {\mbox {\sf E}}(X_2|X_1=x_1) \,\mathop {\mbox {\sf E}}(X_2^{\top}|X_1=x_1). $$

Using the conditional covariance matrix, the conditional correlations may be defined as:

$$\rho_{X_{2}\: X_{3}|X_1=x_1} = \frac{\mathop {\mbox {\sf Cov}}(X_{2}, X_{3}|X_1=x_1)}{\sqrt{\mathop {\mbox {\sf Var}}(X_{2}|X_1=x_1)\, \mathop {\mbox {\sf Var}}(X_{3}|X_1=x_1)}}. $$

These conditional correlations are known as partial correlations between X 2 and X 3, conditioned on X 1 being equal to x 1.

Example 4.6

Consider the following pdf

$$f(x_1,x_2,x_3)=\frac{2}{3}(x_1+x_2+x_3)\quad \mbox{where } 0< x_1,x_2,x_3< 1.$$

Note that the pdf is symmetric in x 1,x 2 and x 3 which facilitates the computations. For instance,

$$\everymath{\displaystyle}\begin{array}{rcl@{\quad}l}f(x_1,x_2)&=&\frac{2}{3}\biggl(x_1+x_2+\frac{1}{2}\biggr)& 0< x_1,x_2< 1\\[8pt]f(x_1)&=&\frac{2}{3}(x_1+1) & 0< x_1 < 1\end{array}$$

and the other marginals are similar. We also have

$$\begin{array}{rcl@{\quad}l}f(x_1,x_2|x_3)&=&\displaystyle\frac{x_1+x_2+x_3}{x_3+1}, & 0< x_1,x_2< 1\\f(x_1|x_3)&=&\displaystyle\frac{x_1+x_3+\frac{1}{2}}{x_3+1}, & 0< x_1< 1.\end{array}$$

It is easy to compute the following moments:

$$\mathop {\mbox {\sf E}}(X_i)=\frac{5}{9};\qquad \mathop {\mbox {\sf E}}(X_i^2)=\frac{7}{18};\qquad \mathop {\mbox {\sf E}}(X_iX_j)=\frac{11}{36}\quad \left(i\not= j \mbox{ and }i,j =1,2,3\right)$$

and

$$\mathop {\mbox {\sf E}}(X_1X_2|X_3=x_3)=\frac{1}{12}\left(\frac{3x_3+4}{x_3+1}\right).$$

Note that the conditional means of X 1 and of X 2, given X 3=x 3, are not linear in x 3. From these moments we obtain:

$$\Sigma =\left(\begin{array}{r@{\quad}r@{\quad}r}\frac{13}{162}&-\frac{1}{324} &-\frac{1}{324}\\-\frac{1}{324} &\frac{13}{162}&-\frac{1}{324}\\-\frac{1}{324}&-\frac{1}{324}&\frac{13}{162}\end{array}\right)\quad \mbox{in particular}\quad \rho_{X_1X_2}=-\frac{1}{26} \approx -0.0385.$$

The conditional covariance matrix of X 1 and X 2, given X 3=x 3 is

$$\mathop {\mbox {\sf Var}}\left({X_1 \choose X_2}\mid X_3=x_3\right)=\left(\begin{array}{l@{\quad}l}\frac{12x_3^2+24x_3+11}{144(x_3+1)^2} & \frac{-1}{144(x_3+1)^2}\\[4pt]\frac{-1}{144(x_3+1)^2} & \frac{12x_3^2+24x_3+11}{144(x_3+1)^2}\end{array}\right).$$

In particular, the partial correlation between X 1 and X 2, given that X 3 is fixed at x 3, is given by \(\rho _{X_{1}X_{2}|X_{3}=x_{3}}=-\frac{1}{12x_{3}^{2}+24x_{3}+11}\) which ranges from −0.0909 to −0.0213 when x 3 goes from 0 to 1. Therefore, in this example, the partial correlation may be larger or smaller than the simple correlation, depending on the value of the condition X 3=x 3.

Example 4.7

Consider the following joint pdf

$$f(x_1,x_2,x_3)= 2x_2(x_1+x_3);\quad 0< x_1,x_2,x_3 < 1.$$

Note the symmetry of x 1 and x 3 in the pdf and that X 2 is independent of (X 1,X 3). It immediately follows that

Simple computations lead to

$$\mathop {\mbox {\sf E}}(X)=\left(\begin{array}{c}\frac{7}{12}\\[3mm]\frac{2}{3}\\[3mm]\frac{7}{12}\end{array}\right)\quad\mbox{and} \quad \Sigma = \left(\begin{array}{r@{\quad}r@{\quad}r}\frac{11}{144} & 0 & -\frac{1}{144}\\0 & \frac{1}{18} & 0\\-\frac{1}{144} & 0 & \frac{11}{144}\end{array}\right).$$

Let us analyze the conditional distribution of (X 1,X 2) given X 3=x 3. We have

$$\everymath{\displaystyle}\begin{array}{rcl@{\quad}l}f(x_1,x_2|x_3) &=& \frac{4(x_1+x_3)x_2}{2x_3+1} & 0 < x_1,x_2 < 1\\f(x_1|x_3) &=& 2 \left( \frac{x_1+x_3}{2x_3+1} \right) & 0 < x_1 < 1\\f(x_2|x_3) &=& f(x_2)= 2x_2 & 0 < x_2 < 1\end{array}$$

so that again X 1 and X 2 are independent conditional on X 3=x 3. In this case

2.5 Properties of Conditional Expectations

Since \(\mathop {\mbox {\sf E}}(X_{2}|X_{1}=x_{1})\) is a function of x 1, say h(x 1), we can define the random variable \(h(X_{1}) = \mathop {\mbox {\sf E}}(X_{2}|X_{1})\). The same can be done when defining the random variable \(\mathop {\mbox {\sf Var}}(X_{2}|X_{1})\). These two random variables share some interesting properties:

(4.28)
(4.29)

Example 4.8

Consider the following pdf

$$f(x_1,x_2)=2e^{-\frac{x_2}{x_1}} ;\quad 0< x_1 < 1,\ x_2 >0.$$

It is easy to show that

$$f(x_1)=2x_1\quad \mbox{for } 0<x_1<1 ;\qquad \mathop {\mbox {\sf E}}(X_1)=\frac{2}{3}\quad \mbox{and}\quad \mathop {\mbox {\sf Var}}(X_1)=\frac{1}{18}$$
$$f(x_2|x_1)=\frac{1}{x_1}e^{-\frac{x_2}{x_1}}\quad \mbox{for } x_2>0;\qquad \mathop {\mbox {\sf E}}(X_2|X_1)=X_1\quad \mbox{and}\quad \mathop {\mbox {\sf Var}}(X_2|X_1)=X_1^2.$$

Without explicitly computing f(x 2), we can obtain:

The conditional expectation \(\mathop {\mbox {\sf E}}(X_{2}|X_{1})\) viewed as a function h(X 1) of X 1 (known as the regression function of X 2 on X 1), can be interpreted as a conditional approximation of X 2 by a function of X 1. The error term of the approximation is then given by:

$$U = X_2 - \mathop {\mbox {\sf E}}(X_2|X_1). $$

Theorem 4.3

Let \(X_{1} \in \mathbb {R}^{k}\) and \(X_{2} \in \mathbb {R}^{p-k}\) and \(U = X_{2} - \mathop {\mbox {\sf E}}(X_{2}|X_{1})\). Then we have:

  1. (1)

    \(\mathop {\mbox {\sf E}}(U) = 0\)

  2. (2)

    \(\mathop {\mbox {\sf E}}(X_{2}|X_{1})\) is the best approximation of X 2 by a function h(X 1) of X 1 where \(h:\; \mathbb {R}^{k} \longrightarrow \mathbb {R}^{p-k}\). “Best” is the minimum mean squared error (MSE), where

    $$MSE(h) = \mathop {\mbox {\sf E}}[\{X_2 - h(X_1)\}^{\top} \, \{X_2 - h(X_1)\}].$$

2.6 Characteristic Functions

The characteristic function (cf) of a random vector \(X\in \mathbb {R}^{p}\) (respectively its density f(x)) is defined as

$$\varphi_X(t) = \mathop {\mbox {\sf E}}(e^{\mathbf{i}t^{\top}X})= \int e^{\mathbf{i}t^{\top}x}f(x)\;dx,\quad t \in \mathbb {R}^p, $$

where \(\mathbf{i}\) is the complex unit: \(\mathbf{i}^{2} = -1\). The cf has the following properties:

$$\varphi_X(0) = 1\quad \mbox{and}\quad |\varphi_X(t)| \le 1.$$
(4.30)

If φ is absolutely integrable, i.e., the integral \(\int_{-\infty}^{\infty}|\varphi(x)| dx\) exists and is finite, then

$$f(x) = \frac{1}{(2\pi)^p} \int^\infty_{-\infty}e^{-\mathbf{i}t^{\top}x}\varphi_X(t)\;dt.$$
(4.31)

If X=(X 1,X 2,…,X p ), then for t=(t 1,t 2,…,t p )

$$\varphi_{X_1}(t_1) = \varphi_X(t_1,0,\ldots,0),\quad\ldots,\quad \varphi_{X_p}(t_p) = \varphi_X(0,\ldots,0,t_{p}).\ $$
(4.32)

If X 1,…,X p are independent random variables, then for t=(t 1,t 2,…,t p )

$$\varphi_X(t) = \varphi_{X_1}(t_1)\cdotp\ldots\cdotp\varphi_{X_p}(t_p). $$
(4.33)

If X 1,…,X p are independent random variables, then for \(t\in \mathbb {R}\)

$$\varphi_{X_{1}+ \cdots +X_{p}}(t) = \varphi_{X_1}(t)\cdotp\ldots\cdotp\varphi_{X_p}(t).$$
(4.34)

The characteristic function can recover all the cross-product moments of any order: ∀j k ≥0,k=1,…,p and for t=(t 1,…,t p ) we have

$$\mathop {\mbox {\sf E}}\left( X_1^{j_1}\cdotp\ldots\cdotp X_p^{j_p} \right) =\frac{1}{\mathbf{i}^{j_1 + \cdots + j_p} }\left [\frac{\partial \varphi_X(t) }{\partial t_1^{j_1} \cdots \partial t_p^{j_p} } \right]_{t=0}.$$
(4.35)

Example 4.9

The cf of the density in Example 4.5 is given by

Example 4.10

Suppose \(X\in \mathbb {R}^{1}\) follows the density of the standard normal distribution

$$f_{X}(x) = \frac{1}{\sqrt{2\pi}} \exp \left(-\frac{x^2}{2}\right) $$

(see Section 4.4) then the cf can be computed via

since \(\mathbf{i}^{2}=-1\) and \(\int \frac{1}{\sqrt{2\pi}} \exp \bigl\{-\frac {(x-\mathbf{i}t)^{2}}{2}\bigr\}\,dx=1\).

A variety of distributional characteristics can be computed from φ X (t). The standard normal distribution has a very simple cf, as was seen in Example 4.10. Deviations from normal covariance structures can be measured by the deviations from the cf (or characteristics of it). In Table 4.1 we give an overview of the cf’s for a variety of distributions.

Table 4.1 Characteristic functions for some common distributions

Theorem 4.4

(Cramer-Wold)

The distribution of \(X\in \mathbb {R}^{p}\) is completely determined by the set of all (one-dimensional) distributions of t X where \(t\in \mathbb {R}^{p}\).

This theorem says that we can determine the distribution of X in \(\mathbb {R}^{p}\) by specifying all of the one-dimensional distributions of the linear combinations

$$\sum^p_{j=1} t_jX_j = t^{\top}X,\quad t = (t_{1},t_{2},\ldots,t_{p})^{\top}. $$

2.7 Cumulant Functions

Moments m k =∫x k f(x)dx often help in describing distributional characteristics. The normal distribution in d=1 dimension is completely characterised by its standard normal density f=φ and the moment parameters are μ=m 1 and \(\sigma^{2}=m_{2}-m_{1}^{2}\). Another helpful class of parameters are the cumulants or semi-invariants of a distribution. In order to simplify notation we concentrate here on the one-dimensional (d=1) case.

For a given one dimensional random variable X with density f and finite moments of order k the characteristic function \(\varphi_{X}(t)=\mathop {\mbox {\sf E}}(e^{\mathbf{ i}tX})\) has the derivative

$$\frac{1}{\mathbf{ i}^j} \left[ \frac{\partial^j \log \left\{\varphi_X(t)\right\}}{\partial t^j }\right]_{t=0} = \kappa_j,\quad j=1,\ldots,k.$$

The values κ j are called cumulants or semi-invariants since κ j does not change (for j>1) under a shift transformation XX+a. The cumulants are natural parameters for dimension reduction methods, in particular the Projection Pursuit method (see Section 19.2).

The relationship between the first k moments m 1,…,m k and the cumulants is given by

(4.36)

Example 4.11

Suppose that k=1, then formula (4.36) above yields

$$\kappa_1=m_1.$$

For k=2 we obtain

For k=3 we have to calculate

$$\kappa_3 =\left|\begin{array}{c@{\quad}c@{\quad}c}m_1 & 1 & 0\\m_2 & m_1 & 1\\m_3&m_2&2m_1\\\end{array}\right|.$$

Calculating the determinant we have:

(4.37)

Similarly one calculates

$$ \kappa_4=m_4-4m_3m_1-3m_2^2+12m_2m_1^2-6m_1^4.$$
(4.38)

The same type of process is used to find the moments from the cumulants:

$$ \everymath{\displaystyle}\begin{array}{rcl}m_1 & = & \kappa_1\\[2mm]m_2 & = & \kappa_2+\kappa_1^2\\[2mm]m_3 & = & \kappa_3 + 3\kappa_2\kappa_1 + \kappa_1^3\\[2mm]m_4 & = & \kappa_4 + 4\kappa_3\kappa_1+3\kappa_2^2+6\kappa_2\kappa_1^2 +\kappa_1^4.\end{array}$$
(4.39)

A very simple relationship can be observed between the semi-invariants and the central moments \(\mu_{k}=\mathop {\mbox {\sf E}}(X-\mu)^{k}\), where μ=m 1 as defined before. In fact, κ 2=μ 2, κ 3=μ 3 and \(\kappa_{4}=\mu_{4}-3\mu_{2}^{2}\).

Skewness γ 3 and kurtosis γ 4 are defined as:

$$ \everymath{\displaystyle}\begin{array}{rcl}\gamma_3 & =& \mathop {\mbox {\sf E}}(X-\mu)^3/\sigma^3\\[2mm]\gamma_4 & =& \mathop {\mbox {\sf E}}(X-\mu)^4/\sigma^4.\end{array}$$
(4.40)

The skewness and kurtosis determine the shape of one-dimensional distributions. The skewness of a normal distribution is 0 and the kurtosis equals 3. The relation of these parameters to the cumulants is given by:

(4.41)

From (4.39) and Example 4.11

(4.42)

These relations will be used later in Section 19.2 on Projection Pursuit to determine deviations from normality.

figure b

3 Transformations

Suppose that X has pdf f X (x). What is the pdf of Y=3X? Or if X=(X 1,X 2,X 3), what is the pdf of

$$Y = \left( \begin{array}{c} 3X_1 \\ X_1-4X_2 \\ X_3 \end{array} \right) ?$$

This is a special case of asking for the pdf of Y when

$$ X = u(Y)$$
(4.43)

for a one-to-one transformation u: \(\mathbb {R}^{p} \rightarrow \mathbb {R}^{p}\). Define the Jacobian of u as

$${\mathcal{J}} = \left( \frac{\partial x_i}{\partial y_j} \right)= \left( \frac{\partial u_i(y)}{\partial y_j} \right)$$

and let \(\mathop{\rm{abs}}(|{\mathcal{J}}|)\) be the absolute value of the determinant of this Jacobian. The pdf of Y is given by

$$f_Y(y) = \mathop{\rm{abs}}(|{\mathcal{J}}|) \cdot f_X\{u(y)\}.$$
(4.44)

Using this we can answer the introductory questions, namely

$$(x_1, \ldots, x_p)^{\top} = u(y_1, \ldots, y_p) = \frac{1}{3}(y_1, \ldots, y_p)^{\top} $$

with

$${\mathcal{J}} = \left( \begin{array}{c@{\quad}c@{\quad}c}\frac{1}{3} & & 0 \\& \ddots & \\0 & & \frac{1}{3} \end{array} \right) $$

and hence \(\mathop{\rm{abs}}(|{\mathcal{J}}|) = ( \frac{1}{3} )^{p}\). So the pdf of Y is \(\frac{1}{3^{p}} f_{X} ( \frac{y}{3})\).

This introductory example is a special case of

$$Y = {\mathcal{A}}X + b,\quad \mbox{where ${\mathcal{A}}$ is nonsingular}.$$

The inverse transformation is

$$X = {\mathcal{A}}^{-1}(Y-b). $$

Therefore

$${\mathcal{J}} = {\mathcal{A}}^{-1}, $$

and hence

$$ f_Y(y) = \mathop{\rm{abs}}(|{\mathcal{A}}|^{-1})f_X\{{\mathcal{A}}^{-1}(y-b)\}.$$
(4.45)

Example 4.12

Consider \(X=(X_{1},X_{2})\in \mathbb {R}^{2}\) with density f X (x)=f X (x 1,x 2),

$${\mathcal{A}} = \left( \begin{array}{r@{\quad}r} 1 & 1 \\ 1 & -1 \end{array}\right), \qquad b = \left( 0 \atop 0 \right).$$

Then

$$Y = {\mathcal{A}}X + b= \left( \begin{array}{c} X_1+X_2 \\ X_1-X_2 \end{array}\right) $$

and

$$|{\mathcal{A}}| = -2,\qquad \mathop{\rm{abs}}(|{\mathcal{A}}|^{-1}) = \frac{1}{2},\qquad {\mathcal{A}}^{-1} =-\frac{1}{2}\left( \begin{array}{r@{\quad}r} -1 & -1 \\-1 & 1 \end{array} \right).$$

Hence

(4.46)

Example 4.13

Consider \(X\in \mathbb {R}^{1}\) with density f X (x) and Y=exp(X). According to (4.43) x=u(y)=log(y) and hence the Jacobian is

$${{\mathcal{J}}}=\frac{dx}{dy}=\frac{1}{y}.$$

The pdf of Y is therefore:

$$f_Y(y)=\frac{1}{y}f_X\{\log(y)\}.$$
figure c

4 The Multinormal Distribution

The multinormal distribution with mean μ and covariance Σ>0 has the density

$$ f(x) = |2 \pi \Sigma |^{-1/2} \exp \left \{ -\frac{1}{2}(x- \mu)^{\top} \Sigma^{-1}(x- \mu) \right \}. $$
(4.47)

We write XN p (μ,Σ).

How is this multinormal distribution with mean μ and covariance Σ related to the multivariate standard normal \(N_{p}(0,{{\mathcal{I}}}_{p}) \)? Through a linear transformation using the results of Section 4.3, as shown in the next theorem.

Theorem 4.5

Let XN p (μ,Σ) and Y−1/2(Xμ) (Mahalanobis transformation). Then

$$Y\sim N_p(0,{\mathcal{I}}_p),$$

i.e., the elements \(Y_{j}\in \mathbb {R}\) are independent, one-dimensional N(0,1) variables.

Proof

Note that (Xμ)Σ−1(Xμ)=Y Y. Application of (4.45) gives \({\mathcal{J}} = \Sigma^{1/2}\), hence

$$f_Y(y) = (2 \pi)^{-p/2} \exp \left(-\frac{1}{2}y^{\top}y \right)$$
(4.48)

which is by (4.47) the pdf of a \(N_{p}(0,{\mathcal{I}}_{p})\). □

Note that the above Mahalanobis transformation yields in fact a random variable Y=(Y 1,…,Y p ) composed of independent one-dimensional Y j N 1(0,1) since

Here each \(f_{Y_{j}}(y)\) is a standard normal density \(\frac{1}{\sqrt{2\pi}}\exp (-\frac{y^{2}}{2} ) \). From this it is clear that \(\mathop {\mbox {\sf E}}(Y)=0\) and \(\mathop {\mbox {\sf Var}}(Y)= {\mathcal{I}}_{p}\).

How can we create N p (μ,Σ) variables on the basis of \(N_{p}(0,{\mathcal{I}}_{p})\) variables? We use the inverse linear transformation

$$X = \Sigma^{1/2}Y + \mu. $$
(4.49)

Using (4.11) and (4.23) we can also check that \(\mathop {\mbox {\sf E}}(X)= \mu\) and \(\mathop {\mbox {\sf Var}}(X) = \Sigma\). The following theorem is useful because it presents the distribution of a variable after it has been linearly transformed. The proof is left as an exercise.

Theorem 4.6

Let XN p (μ,Σ) and \({\mathcal{A}}(p\times p),\; c \in \mathbb {R}^{p}\), where \({\mathcal{A}}\) is nonsingular. Then \(Y = {\mathcal{A}} X +c\) is again a p-variate Normal, i.e.,

$$ Y \sim N_p ( {\mathcal{A}} \mu +c, {\mathcal{A}} \Sigma {\mathcal{A}}^{\top}).$$
(4.50)

4.1 Geometry of the N p (μ,Σ) Distribution

From (4.47) we see that the density of the N p (μ,Σ) distribution is constant on ellipsoids of the form

$$(x-\mu )^{\top} \Sigma^{-1}(x-\mu) = d^2. $$
(4.51)

Example 4.14

Figure 4.3 shows the contour ellipses of a two-dimensional normal distribution. Note that these contour ellipses are the iso-distance curves (2.34) from the mean of this normal distribution corresponding to the metric Σ−1.

Fig. 4.3
figure 3

Scatterplot of a normal sample and contour ellipses for \(\mu=\left(3\atop 2\right)\) and \(\Sigma=\left({1\atop -1.5}\ {-1.5\atop 4}\right)\)MVAcontnorm

According to Theorem 2.7 in Section 2.6 the half-lengths of the axes in the contour ellipsoid are \(\sqrt{d^{2} \lambda_{i}}\) where λ i are the eigenvalues of Σ. If Σ is a diagonal matrix, the rectangle circumscribing the contour ellipse has sides with length 2 i and is thus naturally proportional to the standard deviations of X i (i=1,2).

The distribution of the quadratic form in (4.51) is given in the next theorem.

Theorem 4.7

If XN p (μ,Σ), then the variable U=(Xμ)Σ−1(Xμ) has a \(\chi^{2}_{p}\) distribution.

Theorem 4.8

The characteristic function (cf) of a multinormal N p (μ,Σ) is given by

$$\varphi_X(t) = \exp\biggl(\mathbf{i} t^{\top} \mu -\frac{1}{2}t^{\top}\Sigma t\biggr).$$
(4.52)

We can check Theorem 4.8 by transforming the cf back:

since

Note that if \(Y\sim N_{p}(0,{\mathcal{I}}_{p})\) (e.g., the Mahalanobis-transform), then

which is consistent with (4.33).

4.2 Singular Normal Distribution

Suppose that we have \(\mathop {\rm {rank}}(\Sigma ) = k < p \), where p is the dimension of X. We define the (singular) density of X with the aid of the G-Inverse Σ of Σ,

$$f(x) = \frac{(2\pi)^{-k/2}}{(\lambda_1 \cdots \lambda_k)^{1/2}}\exp \left \{ -\frac{1}{2} (x-\mu)^{\top} \Sigma^{-} (x-\mu) \right \} $$
(4.53)

where

  1. (1)

    x lies on the hyperplane \({\mathcal{N}}^{\top} (x-\mu) = 0 \) with \({\mathcal{N}} (p \times (p-k)) : {\mathcal{N}}^{\top} \Sigma = 0 \) and \({\mathcal{N}}^{\top} {\mathcal{N}} = {\mathcal{I}}_{k} \).

  2. (2)

    Σ is the G-Inverse of Σ, and λ 1,…,λ k are the nonzero eigenvalues of Σ.

What is the connection to a multinormal with k-dimensions? If

$$ Y \sim N_k (0, \Lambda_1)\quad \mbox{and}\quad \Lambda_1 = \mathop {\rm {diag}}(\lambda_1,\ldots, \lambda_k),$$
(4.54)

then an orthogonal matrix \({\mathcal{B}} (p \times k) \) with \({\mathcal{B}} ^{\top} {\mathcal{B}}= {\mathcal{I}}_{k}\) exists that means \(X = {\mathcal{B}} Y + \mu \) where X has a singular pdf of the form (4.53).

4.3 Gaussian Copula

In Examples 4.3 and 4.4 we have introduced copulae. Another important copula is the Gaussian or normal copula,

$$ C_{\rho}(u, v) =\int_{- \infty}^{\Phi_1^{-1}(u)}\int_{- \infty}^{\Phi_2^{-1}(v)} f_\rho(x_1,x_2) d x_2 d x_1,$$
(4.55)

see Embrechts, McNeil and Straumann (1999). In (4.55), f ρ denotes the bivariate normal density function with correlation ρ for n=2. The functions Φ1 and Φ2 in (4.55) refer to the corresponding one-dimensional standard normal cdfs of the margins.

In the case of vanishing correlation, ρ=0, the Gaussian copula becomes

figure d

5 Sampling Distributions and Limit Theorems

In multivariate statistics, we observe the values of a multivariate random variable X and obtain a sample \(\{x_{i}\}_{i=1}^{n}\), as described in Chapter 3. Under random sampling, these observations are considered to be realisations of a sequence of i.i.d. random variables X 1,…,X n , where each X i is a p-variate random variable which replicates the parent or population random variable X. Some notational confusion is hard to avoid: X i is not the ith component of X, but rather the ith replicate of the p-variate random variable X which provides the ith observation x i of our sample.

For a given random sample X 1,…,X n , the idea of statistical inference is to analyse the properties of the population variable X. This is typically done by analysing some characteristic θ of its distribution, like the mean, covariance matrix, etc. Statistical inference in a multivariate setup is considered in more detail in Chapters 6 and 7.

Inference can often be performed using some observable function of the sample X 1,…,X n , i.e., a statistics. Examples of such statistics were given in Chapter 3: the sample mean \(\bar{x}\), the sample covariance matrix \({\mathcal{S}}\). To get an idea of the relationship between a statistics and the corresponding population characteristic, one has to derive the sampling distribution of the statistic. The next example gives some insight into the relation of \((\overline{x}, S)\) to (μ,Σ).

Example 4.15

Consider an iid sample of n random vectors \(X_{i} \in \mathbb {R}^{p}\) where \(\mathop {\mbox {\sf E}}(X_{i})=\mu\) and \(\mathop {\mbox {\sf Var}}(X_{i}) = \Sigma\). The sample mean \(\bar{x}\) and the covariance matrix \({\mathcal{S}}\) have already been defined in Section 3.3. It is easy to prove the following results

This shows in particular that \({\mathcal{S}}\) is a biased estimator of Σ. By contrast, \({\mathcal{S}}_{u} = \frac{n}{n-1}{\mathcal{S}}\) is an unbiased estimator of Σ.

Statistical inference often requires more than just the mean and/or the variance of a statistic. We need the sampling distribution of the statistics to derive confidence intervals or to define rejection regions in hypothesis testing for a given significance level. Theorem 4.9 gives the distribution of the sample mean for a multinormal population.

Theorem 4.9

Let X 1,…,X n be i.i.d. with X i  ∼ N p (μ,Σ). Then \(\bar{x} \,{\sim}\, N_{p}(\mu,n^{-1}\Sigma)\).

Proof

\(\bar{x}=n^{-1}\sum_{i=1}^{n} X_{i}\) is a linear combination of independent normal variables, so it has a normal distribution (see Chapter 5). The mean and the covariance matrix were given in the preceding example. □

With multivariate statistics, the sampling distributions of the statistics are often more difficult to derive than in the preceding Theorem. In addition they might be so complicated that approximations have to be used. These approximations are provided by limit theorems. Since they are based on asymptotic limits, the approximations are only valid when the sample size is large enough. In spite of this restriction, they make complicated situations rather simple. The following central limit theorem shows that even if the parent distribution is not normal, when the sample size n is large, the sample mean \(\bar{x}\) has an approximate normal distribution.

Theorem 4.10

(Central Limit Theorem (CLT))

Let X 1,X 2,…,X n be i.i.d. with X i ∼(μ,Σ). Then the distribution of \(\displaystyle \sqrt{n} (\overline{x} - \mu ) \) is asymptotically N p (0,Σ), i.e.,

$$\sqrt{n} (\overline{x} - \mu) \stackrel{\mathcal{L}}{\longrightarrow}N_p (0, \Sigma) \quad \mbox{as } n \longrightarrow \infty. $$

The symbol “\(\stackrel{\mathcal{L}}{\longrightarrow}\)” denotes convergence in distribution which means that the distribution function of the random vector \(\sqrt{n}(\bar{x}-\mu)\) converges to the distribution function of N p (0,Σ).

Example 4.16

Assume that X 1,…,X n are i.i.d. and that they have Bernoulli distributions where \(p=\frac{1}{2}\) (this means that \(P(X_{i}=1)=\frac{1}{2},\;P(X_{i}=0)=\frac{1}{2})\). Then \(\mu=p=\frac{1}{2}\) and \(\Sigma=p(1-p)=\frac{1}{4}\). Hence,

$$\sqrt{n} \left(\overline{x} - \frac{1}{2}\right) \stackrel{\mathcal{L}}{\longrightarrow}N_{1} \left(0,\frac{1}{4}\right) \quad \mbox{as } n \longrightarrow \infty. $$

The results are shown in Figure 4.4 for varying sample sizes.

Fig. 4.4
figure 4

The CLT for Bernoulli distributed random variables. Sample size n=5 (up) and n=35 (down)  MVAcltbern

Example 4.17

Now consider a two-dimensional random sample X 1,…,X n that is i.i.d. and created from two independent Bernoulli distributions with p=0.5. The joint distribution is given by \(P(X_{i}=(0,0)^{\top}) = \frac{1}{4}\), \(P(X_{i}=(0,1)^{\top}) = \frac{1}{4}\), \(P(X_{i}=(1,0)^{\top}) = \frac{1}{4}\), \(P(X_{i}=(1,1)^{\top}) = \frac{1}{4}\). Here we have

$$\sqrt{n} \left\{ \bar{x}- {\frac{1}{2} \choose \frac{1}{2}} \right\}= N_{2} \left( {0 \choose 0}, \left(\begin{array}{c@{\quad}c}\frac{1}{4}& 0\\ 0 &\frac{1}{4}\end{array} \right)\right) \quad \mbox{as } n \longrightarrow \infty. $$

Figure 4.5 displays the estimated two-dimensional density for different sample sizes.

Fig. 4.5
figure 5

The CLT in the two-dimensional case. Sample size n=5 (up) and n=85 (down)  MVAcltbern2

The asymptotic normal distribution is often used to construct confidence intervals for the unknown parameters. A confidence interval at the level 1−α, α∈(0,1), is an interval that covers the true parameter with probability 1−α:

$$P(\theta \in [\widehat{\theta}_{l} , \widehat{\theta}_{u}]) = 1 - \alpha,$$

where θ denotes the (unknown) parameter and \(\widehat{\theta}_{l}\) and \(\widehat{\theta}_{u}\) are the lower and upper confidence bounds respectively.

Example 4.18

Consider the i.i.d. random variables X 1,…,X n with X i ∼(μ,σ 2) and σ 2 known. Since we have \(\sqrt{n}(\bar{x}-\mu)\stackrel{\mathcal{L}}{\rightarrow} N(0,\sigma^{2})\) from the CLT, it follows that

$$P\biggl(-u_{1-\alpha/2} \le \sqrt{n}\frac{(\bar{x}-\mu)}{\sigma} \le u_{1-\alpha/2}\biggr)\longrightarrow 1 - \alpha,\quad \mbox{as } n \longrightarrow \infty $$

where u 1−α/2 denotes the (1−α/2)-quantile of the standard normal distribution. Hence the interval

$$\left[\bar{x}-\frac{\sigma}{\sqrt{n}}\, u_{1-\alpha/2},\,\bar{x}+\frac{\sigma}{\sqrt{n}}\, u_{1-\alpha/2}\right]$$

is an approximate (1−α)-confidence interval for μ.

But what can we do if we do not know the variance σ 2? The following corollary gives the answer.

Corollary 4.1

If \(\widehat{\Sigma}\) is a consistent estimate for Σ, then the CLT still holds, namely

$$\sqrt{n}\;\widehat{\Sigma}^{-1/2} (\bar{x}-\mu) \stackrel{\mathcal{L}}{\longrightarrow}{N}_p(0,{\mathcal{I}})\quad \mbox{as } n \longrightarrow \infty . $$

Example 4.19

Consider the i.i.d. random variables X 1,…,X n with X i ∼(μ,σ 2), and now with an unknown variance σ 2. From Corollary 4.1 using \(\widehat{\sigma}^{2} = \frac{1}{n} \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}\) we obtain

$$\sqrt{n}\left( \frac{\bar{x}-\mu}{\widehat{\sigma}} \right) \mathrel{\mathop{\longrightarrow}\limits_{}^{\mathcal{L}}} N(0,1) \quad \mbox{as } n \longrightarrow \infty . $$

Hence we can construct an approximate (1−α)-confidence interval for μ using the variance estimate \(\widehat{\sigma}^{2}\):

$$C_{1-\alpha} = \left[\bar{x}-\frac{\widehat{\sigma}}{\sqrt{n}}\, u_{1-\alpha/2},\,\bar{x}+\frac{\widehat{\sigma}}{\sqrt{n}}\, u_{1-\alpha/2}\right].$$

Note that by the CLT

$$P(\mu \in C_{1-\alpha}) \longrightarrow 1 - \alpha \quad \mbox{as } n \longrightarrow \infty . $$

Remark 4.1

One may wonder how large should n be in practice to provide reasonable approximations. There is no definite answer to this question: it mainly depends on the problem at hand (the shape of the distribution of the X i and the dimension of X i ). If the X i are normally distributed, the normality of \(\bar{x}\) is achieved from n=1. In most situations, however, the approximation is valid in one-dimensional problems for n larger than, say, 50.

5.1 Transformation of Statistics

Often in practical problems, one is interested in a function of parameters for which one has an asymptotically normal statistic. Suppose for instance that we are interested in a cost function depending on the mean μ of the process: \(f(\mu)=\mu^{\top} {\mathcal{A}}\mu\) where \({\mathcal{A}}>0\) is given. To estimate μ we use the asymptotically normal statistic \(\bar{x}\). The question is: how does \(f(\bar{x})\) behave? More generally, what happens to a statistic t that is asymptotically normal when we transform it by a function f(t)? The answer is given by the following theorem.

Theorem 4.11

If \(\sqrt{n} (t - \mu) \stackrel{\mathcal{L}}{\longrightarrow}N_{p}(0,\Sigma) \) and if \(f = (f_{1}, \ldots, f_{q})^{\top} : \mathbb {R}^{p} \to \mathbb {R}^{q} \) are real valued functions which are differentiable at \(\mu \in \mathbb {R}^{p}\), then f(t) is asymptotically normal with mean f(μ) and covariance \({\mathcal{D}}^{\top} \Sigma {\mathcal{D}}\), i.e.,

$$\sqrt{n} \{f(t) - f(\mu)\} \stackrel{\mathcal{L}}{\longrightarrow}N_q(0,{\mathcal{D}}^{\top}\Sigma {\mathcal{D}} ) \quad \mbox{for } n\longrightarrow \infty,$$
(4.56)

where

$${\mathcal{D}} = \left .\left( \frac{\partial f_j}{\partial t_i}\right)(t)\right |_{t = \mu} $$

is the (p×q) matrix of all partial derivatives.

Example 4.20

We are interested in seeing how \(f(\bar{x})=\bar{x}^{\top} {\mathcal{A}}\bar{x}\) behaves asymptotically with respect to the quadratic cost function of \(\mu , f(\mu)=\mu^{\top} {\mathcal{A}}\mu\), where \({\mathcal{A}}>0\).

$$D=\left.\frac{\partial f(\bar{x})}{\partial \bar{x}}\right|_{\bar{x}=\mu}=2{\mathcal{A}}\mu.$$

By Theorem 4.11 we have

$$\sqrt{n}(\bar{x}^{\top} {\mathcal{A}}\bar{x}-\mu^{\top} {\mathcal{A}}\mu) \mathrel{\mathop{\longrightarrow}\limits_{}^{\mathcal{L}}} N_1\ (0,4\mu^{\top} {\mathcal{A}}\Sigma {\mathcal{A}}\mu).$$

Example 4.21

Suppose

$$X_i \sim (\mu, \Sigma); \quad \mu = {0\choose 0}, \quad \Sigma = \left( \begin{array}{c@{\quad}c} 1 & 0.5 \\ 0.5 & 1 \end{array} \right),\quad p = 2.$$

We have by the CLT (Theorem 4.10) for n→∞ that

$$\sqrt{n} (\overline{x} - \mu) \mathrel{\mathop{\longrightarrow}\limits_{}^{\mathcal{L}}}N(0, \Sigma).$$

Suppose that we would like to compute the distribution of . According to Theorem 4.11 we have to consider f=(f 1,f 2) with

$$f_1(x_1,x_2) = x_1^2 - x_2, \qquad f_2(x_1,x_2) = x_1 + 3x_2,\qquad q = 2. $$

Given this \(f(\mu) = {0 \choose 0} \) and

$${\mathcal{D}} = (d_{ij}), \quad d_{ij}= \left( \left . \frac{\partial f_j }{\partial x_i}\right) \right |_{x = \mu}= \left.\left(\begin{array}{l@{\quad}l} 2x_{1}&1\\-1&3 \end{array}\right)\right |_{x=0}. $$

Thus

$${\mathcal{D}} = \left( \begin{array}{r@{\quad}r} 0 & 1 \\ -1 & 3 \end{array}\right). $$

The covariance is

$$\begin{array}{cccccccc}\left( \begin{array}{r@{\quad}r} 0 & -1 \\ 1 & 3 \end{array} \right) &\left( \begin{array}{r@{\quad}r} 1 & \frac{1}{2} \\ \frac{1}{2} & 1 \end{array} \right) &\left( \begin{array}{r@{\quad}r} 0 & 1 \\ -1 & 3 \end{array} \right) &= &\left( \begin{array}{r@{\quad}r} 0 & -1 \\ 1 & 3 \end{array} \right) &\left( \begin{array}{r@{\quad}r} -\frac{1}{2} & \frac{5}{2} \\ -1 &\frac{7}{2} \end{array} \right) &= &\left( \begin{array}{r@{\quad}c} 1 & -\frac{7}{2} \\ -\frac{7}{2} & 13 \end{array} \right) \\[8pt]{\mathcal{D}}^{\top} & \Sigma & {\mathcal{D}} & & {\mathcal{D}}^{\top} & \Sigma {\mathcal{D}}& &{\mathcal{D}}^{\top}\Sigma {\mathcal{D}}\end{array}, $$

which yields

$$\sqrt{n}\left(\begin{array}{c} \overline{x}_{1}^2 - \overline{x}_{2}\\\overline{x}_{1} + 3 \overline{x}_{2} \end{array} \right) \mathrel{\mathop{\longrightarrow}\limits_{}^{\mathcal{L}}}N_{2}\left( {0\choose 0}, \left( \begin{array}{r@{\quad}c} 1 & -\frac{7}{2} \\-\frac{7}{2} & 13 \end{array} \right) \right).$$

Example 4.22

Let us continue the previous example by adding one more component to the function f. Since q=3>p=2, we might expect a singular normal distribution. Consider f=(f 1,f 2,f 3) with

$$f_1(x_1,x_2) = x_1^2 - x_2, \qquad f_2(x_1,x_2) = x_1 + 3x_2, \qquad f_3 = x_2^3, \qquad q = 3. $$

From this we have that

$${\mathcal{D}} = \left( \begin{array}{r@{\quad}r@{\quad}r} 0 & 1 & 0 \\ -1 & 3 & 0\end{array}\right) \quad \mbox{and thus}\quad {\mathcal{D}}^{\top}\Sigma {\mathcal{D}}= \left( \begin{array}{r@{\quad}r@{\quad}r} 1 & -\frac{7}{2} & 0 \\-\frac{7}{2} & 13 & 0 \\ 0 & 0 & 0 \end{array} \right). $$

The limit is in fact a singular normal distribution!

figure e

6 Heavy-Tailed Distributions

Heavy-tailed distributions were first introduced by the Italian-born Swiss economist Pareto and extensively studied by Paul Lévy. Although in the beginning these distributions were mainly studied theoretically, nowadays they have found many applications in areas as diverse as finance, medicine, seismology, structural engineering. More concretely, they have been used to model returns of assets in financial markets, stream flow in hydrology, precipitation and hurricane damage in meteorology, earthquake prediction in seismology, pollution, material strength, teletraffic and many others.

A distribution is called heavy-tailed if it has higher probability density in its tail area compared with a normal distribution with same mean μ and variance σ 2. Figure 4.6 demonstrates the differences of the pdf curves of a standard Gaussian distribution and a Cauchy distribution with location parameter μ=0 and scale parameter σ=1. The graphic shows that the probability density of the Cauchy distribution is much higher than that of the Gaussian in the tail part, while in the area around the centre, the probability density of the Cauchy distribution is much lower.

Fig. 4.6
figure 6

Comparison of the pdf of a standard Gaussian (blue) and a Cauchy distribution (red) with location parameter 0 and scale parameter 1  MVAgausscauchy

Fig. 4.7
figure 7

pdf (left) and cdf (right) of \(\mathit{GH}\) (λ=0.5), \(\mathit{HYP}\) and \(\mathit{NIG}\) with α=1, β=0, δ=1, μ=0  MVAghdis

Fig. 4.8
figure 8

pdf (left) and cdf (right) of t-distribution with different degrees of freedom (t3 stands for t-distribution with degree of freedom 3)  MVAtdis

Fig. 4.9
figure 9

pdf (left) and cdf (right) of Laplace distribution with zero mean and different scale parameters (L1 stands for Laplace distribution with θ=1)  MVAlaplacedis

In terms of kurtosis, a heavy-tailed distribution has kurtosis greater than 3 (see Chapter 4, formula (4.40)), which is called leptokurtic, in contrast to mesokurtic distribution (kurtosis=3) and platykurtic distribution (kurtosis<3). Since univariate heavy-tailed distributions serve as basics for their multivariate counterparts and their density properties have been proved useful even in multivariate cases, we will start from introducing some univariate heavy-tailed distributions. Then we will move on to analyse their multivariate counterparts, and their tail behavior.

6.1 Generalised Hyperbolic Distribution

The generalised hyperbolic distribution was introduced by Barndorff-Nielsen and at first applied to model grain size distributions of wind blown sands. Today one of its most important uses is in stock price modelling and market risk measurement. The name of the distribution is derived from the fact that its log-density forms a hyperbola, while the log-density of the normal distribution is a parabola.

The density of a one-dimensional generalised hyperbolic (GH) distribution for \(x\in \mathbb{R}\) is

(4.57)

where K λ is a modified Bessel function of the third kind with index λ

$$K_\lambda(x) =\frac{1}{2}\int_0^\infty y^{\lambda-1}e^{-\frac{x}{2}(y+y^{-1})}dy.$$
(4.58)

The domain of variation of the parameters is \(\mu \in \mathbb{R}\) and

The generalised hyperbolic distribution has the following mean and variance

(4.59)
(4.60)

where μ and δ play important roles in the density’s location and scale respectively. With specific values of λ, we obtain different sub-classes of GH such as hyperbolic (HYP) or normal-inverse Gaussian (NIG) distribution.

For λ=1 we obtain the hyperbolic distributions (HYP)

$$f_{\mathit{HYP}}(x;\alpha,\beta,\delta,\mu)=\frac{\sqrt{\alpha^2-\beta^2}}{2\alpha\delta K_1(\delta\sqrt{\alpha^2-\beta^2})}e^{\{-\alpha\sqrt{\delta^2+(x-\mu)^2}+\beta(x-\mu)\}}$$
(4.61)

where \(x,\mu \in \mathbb{R}, \delta\geq0\) and |β|<α.

For λ=−1/2 we obtain the normal-inverse Gaussian distribution (NIG)

$$f_{\mathit{NIG}}(x;\alpha,\beta,\delta,\mu)=\frac{\alpha\delta}{\pi}\frac{K_1\big(\alpha\sqrt{(}\delta^2+(x-\mu)^2)\big)}{\sqrt{\delta^2+(x-\mu)^2}}e^{\{\delta\sqrt{\alpha^2-\beta^2}+\beta(x-\mu)\}}.$$
(4.62)

6.2 Student’s t-distribution

The t-distribution was first analysed by Gosset (1908). He published his results under his pseudonym “Student” by request of his employer. Let X be a normally distributed random variable with mean μ and variance σ 2, and Y be the random variable such that Y 2/σ 2 has a chi-square distribution with n degrees of freedom. Assume that X and Y are independent, then

$$t \stackrel{\mathrm{def}}{=} \frac{X\sqrt{n}}{Y}$$
(4.63)

is distributed as Student’s t with n degrees of freedom. The t-distribution has the following density function

$$f_t(x;n)=\frac{\Gamma(\frac{n+1}{2})}{\sqrt{n\pi}\Gamma(\frac{n}{2})}\biggl(1+\frac{x^2}{n}\biggr)^{-\frac{n+1}{2}}$$
(4.64)

where n is the number of degrees of freedom, −∞<x<∞, and Γ is the gamma function, e.g. Giri (1996),

$$\Gamma(\alpha)=\int_0^\infty x^{\alpha-1}e^{-x}dx.$$
(4.65)

The mean, variance, skewness, and kurtosis of Student’s t-distribution (n>4) are:

The t-distribution is symmetric around 0, which is consistent with the fact that its mean is 0 and skewness is also 0.

Student’s t-distribution approaches the normal distribution as n increases, since

$$\lim_{n\rightarrow\infty}f_t(x;n)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}.$$
(4.66)

In practice the t-distribution is widely used, but its flexibility of modelling is restricted because of the integer-valued tail index.

In the tail area of the t-distribution, x is proportional to |x|−(n+1). In Figure 4.13 we compared the tail-behaviour of t-distribution with different degrees of freedom. With higher degree of freedom, the t-distribution decays faster.

6.3 Laplace Distribution

The univariate Laplace distribution with mean zero was introduced by Laplace (1774). The Laplace distribution can be defined as the distribution of differences between two independent variates with identical exponential distributions. Therefore it is also called the double exponential distribution.

The Laplace distribution with mean μ and scale parameter θ has the pdf

$$f_{\mathit{Laplace}}(x;\mu,\theta)=\frac{1}{2\theta}e^{-\frac{|x-\mu|}{\theta}}$$
(4.67)

and the cdf

$$F_{\mathit{Laplace}}(x;\mu,\theta)=\frac{1}{2}\big\{1+\mathit{sign}(x-\mu)(1-e^{-\frac{|x-\mu|}{\theta}})\big\},$$
(4.68)

where \(\mathit{sign}\) is sign function. The mean, variance, skewness, and kurtosis of the Laplace distribution are

With mean 0 and θ=1, we obtain the standard Laplace distribution

(4.69)
(4.70)

6.4 Cauchy Distribution

The Cauchy distribution is motivated by the following example.

Example 4.23

A gangster has just robbed a bank. As he runs to a point s meters away from the wall of the bank, a policeman reaches the crime scene. The robber turns back and starts to shoot but he is such a poor shooter that the angle of his fire (marked in Figure 4.10 as α) is uniformly distributed. The bullets hit the wall at distance x (from the centre). Obviously the distribution of x, the random variable where the bullet hits the wall, is of vital knowledge to the policeman in order to identify the location of the gangster. (Should the policeman calculate the mean or the median of the observed bullet hits x i ?)

Fig. 4.10
figure 10

Introduction to Cauchy distribution - robber vs. policeman

Fig. 4.11
figure 11

pdf (left) and cdf (right) of Cauchy distribution with m=0 and different scale parameters (C1 stands for Cauchy distribution with s=1)  MVAcauchy

Fig. 4.12
figure 12

pdf (left) and cdf (right) of a Gaussian mixture (Example 4.23)  MVAmixture

Fig. 4.13
figure 13

Tail comparison of t-distribution, pdf (left) and approximation (right)  MVAtdistail

Since α is uniformly distributed:

$$f(\alpha) = \frac{1}{\pi} \,\textbf{\textit{I}}(\alpha \in [-\pi/2, \pi/2])$$

and

For a small interval , the probability is given by

with

So the pdf of x can be written as:

The general formula for the pdf and cdf of the Cauchy distribution is

(4.71)
(4.72)

where m and s are location and scale parameter respectively. The case in the above example where m=0 and s=1 is called the standard Cauchy distribution with pdf and cdf as following,

(4.73)
(4.74)

The mean, variance, skewness and kurtosis of Cauchy distribution are all undefined, since its moment generating function diverges. But it has mode and median, both equal to the location parameter m.

6.5 Mixture Model

Mixture modelling concerns modelling a statistical distribution by a mixture (or weighted sum) of different distributions. For many choices of component density functions, the mixture model can approximate any continuous density to arbitrary accuracy, provided that the number of component density functions is sufficiently large and the parameters of the model are chosen correctly. The pdf of a mixture distribution consists of n distributions and can be written as:

$$f(x)=\sum_{l=1}^L w_l p_l(x)$$
(4.75)

under the constraints:

where p l (x) is the pdf of the l’th component density and w l is a weight. The mean, variance, skewness and kurtosis of a mixture are

(4.76)
(4.77)
(4.78)
(4.79)

where μ l ,σ l ,SK l and K l are respectively mean, variance, skewness and kurtosis of l’th distribution.

Mixture models are ubiquitous in virtually every facet of statistical analysis, machine learning and data mining. For data sets comprising continuous variables, the most common approach involves mixture distributions having Gaussian components.

The pdf for a Gaussian mixture is:

$$f_{\mathit{GM}}(x)=\sum_{l=1}^L \frac{w_l}{\sqrt{2\pi}\sigma_l}e^{-\frac{(x-\mu_l)^2}{2\sigma_l^2}}.$$
(4.80)

For a Gaussian mixture consisting of Gaussian distributions with mean 0, this can be simplified to:

$$f_{\mathit{GM}}(x)=\sum_{l=1}^L \frac{w_l}{\sqrt{2\pi}\sigma_l}e^{-\frac{x^2}{2\sigma_l^2}},$$
(4.81)

with variance, skewness and kurtosis

(4.82)
(4.83)
(4.84)

Example 4.24

Consider a Gaussian Mixture which is 80% N(0,1) and 20% N(0,9). The pdf of N(0,1) and N(0,9) are

so the pdf of the Gaussian Mixture is

$$f_{\mathit{GM}}(x) = \frac{1}{5\sqrt{2\pi}}\biggl(4e^{-\frac{x^2}{2}}+\frac{1}{3}e^{-\frac{x^2}{18}}\biggr).$$

Notice that the Gaussian Mixture is not a Gaussian distribution:

The kurtosis of this Gaussian mixture is higher than 3.

A summary of the basic statistics is given in Table 4.2.

Table 4.2 basic statistics of t, Laplace and Cauchy distribution
Table 4.3 basic statistics of GH distribution and mixture model

6.6 Multivariate Generalised Hyperbolic Distribution

The multivariate Generalised Hyperbolic Distribution (\(\mathit{GH}_{d}\)) has the following pdf

(4.85)
(4.86)

and characteristic function

(4.87)

These parameters have the following domain of variation:

$$\begin{array}{l@{\quad}l}\lambda \in \mathbb{R}, & \beta,\mu \in \mathbb{R}^d\\\delta >0, & \alpha>\beta^{\top}\Delta\beta\\\Delta \in \mathbb{R}^{d \times d} & \textrm{positive definite matrix}\\|\Delta|=1.\end{array}$$

For \(\lambda = \frac{d+1}{2}\) we obtain the multivariate hyperbolic (HYP) distribution; for \(\lambda = -\frac{1}{2}\) we get the multivariate normal inverse Gaussian (NIG) distribution.

Blæsild and Jensen (1981) introduced a second parameterization (ζ,Π,Σ), where

(4.88)
(4.89)
(4.90)

The mean and variance of XGH d

(4.91)
(4.92)

where

(4.93)
(4.94)

Theorem 4.12

Suppose that X is a d-dimensional variate distributed according to the generalised hyperbolic distribution GH d . Let (X 1,X 2) be a partitioning of X, let r and k denote the dimensions of X 1 and X 2, respectively, and let (β 1,β 2) and (μ 1,μ 2) be similar partitions of β and μ, let

$$\Delta =\left(\begin{array}{l@{\quad}l}\Delta_{11} & \Delta_{12}\\\Delta_{21} & \Delta_{22}\\\end{array}\right)$$
(4.95)

be a partition of Δ such that Δ11 is a r×r matrix. Then one has the following

  1. 1.

    The distribution of X 1 is the r-dimensional generalised hyperbolic distribution, GH r (λ ,α ,β ,δ ,μ ), where

  2. 2.

    The conditional distribution of X 2 given X 1=x 1 is the k-dimensional generalised hyperbolic distribution \(GH_{k}(\tilde{\lambda},\tilde{\alpha},\tilde{\beta},\tilde{\delta},\tilde{\mu},\tilde{\Delta}\)),where

  3. 3.

    Let Y=XA+B be a regular affine transformation of X and let ||A|| denote the absolute value of the determinant of A. The distribution of Y is the d-dimensional generalised hyperbolic distribution GH d (λ +,α +,β +,δ +,μ ++),where

6.7 Multivariate t-distribution

If X and Y are independent and distributed as N p (μ,Σ) and \({\mathcal{X}}^{2}_{n}\) respectively, and \(X\sqrt{n/Y}=t-\mu\), then the pdf of t is given by

$$f_t(t;n,\Sigma,\mu)=\frac{\Gamma\left\{(n+p)/2\right\}}{\Gamma(n/2)n^{p/2}\pi^{p/2}\left|\Sigma\right|^{1/2}\{1+\frac{1}{n}(t-\mu)^{\top}{\Sigma}^{-1}(t-\mu)\}^{(n+p)/2}}.$$
(4.96)

The distribution of t is the noncentral t-distribution with n degrees of freedom and the noncentrality parameter μ, Giri (1996).

6.8 Multivariate Laplace Distribution

Let g and G be the pdf and cdf of a d-dimensional Gaussian distribution N d (0,Σ), the pdf and cdf of a multivariate Laplace distribution can be written as

(4.97)
(4.98)

the pdf can also be described as

(4.99)

where \(\lambda = \frac{2-d}{2}\) and K λ (x) is the modified Bessel function of the third kind

$$K_\lambda(x)=\frac{1}{2}\bigg(\frac{x}{2}\bigg)^\lambda \int_0^\infty t^{-\lambda-1}e^{-t-\frac{x^2}{4t}}dt,\quad x>0.$$
(4.100)

Multivariate Laplace distribution has mean and variance

(4.101)
(4.102)

6.9 Multivariate Mixture Model

A multivariate mixture model comprises multivariate distributions, e.g. the pdf of a multivariate Gaussian distribution can be written as

$$f(x)=\sum_{l=1}^{L}\frac{w_l}{|2\pi\Sigma_l|^{\frac{1}{2}}}e^{-\frac{1}{2}(x-\mu_l)^{\top}\Sigma^{-1}(x-\mu_l)}.$$
(4.103)

6.10 Generalised Hyperbolic Distribution

The GH distribution has an exponential decaying speed

$$f_{\mathit{GH}}(x;\lambda,\alpha,\beta,\delta,\mu=0)\sim x^{\lambda-1}e^{-(\alpha-\beta)x}\qquad \textrm{as} \quad x\rightarrow \infty ,$$
(4.104)

Figure 4.14 illustrates the tail behaviour of GH distributions with different value of λ with α=1,β=0,δ=1,μ=0. It is clear that among the four distributions, GH with λ=1.5 has the lowest decaying speed, while NIG decays fastest.

Fig. 4.14
figure 14

Tail comparison of GH distribution (pdf)  MVAghdistail

In Figure 4.15, Chen, Härdle and Jeong (2008), four distributions and especially their tail-behaviour are compared. In order to keep the comparability of these distributions, we specified the means to 0 and standardised the variances to 1. Furthermore we used one important subclass of the GH distribution: the normal-inverse Gaussian (NIG) distribution with \(\lambda = -\frac{1}{2}\) introduced above. On the left panel, the complete forms of these distributions are revealed. The Cauchy (dots) distribution has the lowest peak and the fattest tails. In other words, it has the flattest distribution. The NIG distribution decays second fast in the tails although it has the highest peak, which is more clearly displayed on the right panel.

Fig. 4.15
figure 15

Graphical comparison of the NIG distribution (line), standard normal distribution  MVAghadatail

7 Copulae

The cumulative distribution function (cdf) of a 2-dimensional vector \(\left(X_{1},X_{2}\right)\) is given by

(4.105)

For the case that X 1 and X 2 are independent, their joint cumulative distribution function F(x 1,x 2) can be written as a product of their 1-dimensional marginals:

(4.106)

But how can we model dependence of X 1 and X 2? Most people would suggest linear correlation. Correlation is though an appropriate measure of dependence only when the random variables have an elliptical or spherical distribution, which include the normal multivariate distribution. Although the terms “correlation” and “dependency” are often used interchangeably, correlation is actually a rather imperfect measure of dependency, and there are many circumstances where correlation should not be used.

Copulae represent an elegant concept of connecting marginals with joint cumulative distribution functions. Copulae are functions that join or “couple” multivariate distribution functions to their 1-dimensional marginal distribution functions. Let us consider a d-dimensional vector X=(X 1,…,X d ). Using copulae, the marginal distribution functions \(F_{X_{i}} (i=1,\ldots,d)\) can be separately modelled from their dependence structure and then coupled together to form the multivariate distribution F X . Copula functions have a long history in probability theory and statistics. Their application in finance is very recent. Copulae are important in Value-at-Risk calculations and constitute an essential tool in quantitative finance (Härdle et al. (2009)).

First let us concentrate on the 2-dimensional case, then we will extend this concept to the d-dimensional case, for a random variable in \(\mathbb{R}^{d}\) with d≥1. To be able to define a copula function, first we need to represent a concept of the volume of a rectangle, a 2-increading function and a grounded function.

Let U 1 and U 2 be two sets in \(\mathbb{\overline{R}}=\mathbb{R}\cup \{+\infty\} \cup \{-\infty\}\) and consider the function \(F :U_{1} \times U_{2} \longrightarrow \mathbb{\overline{R}}\).

Definition 4.2

The F-volume of a rectangle B=[x 1,x 2]×[y 1,y 2]⊂U 1×U 2 is defined as:

(4.107)

Definition 4.3

F is said to be a 2-increasing function if for every B=[x 1,x 2]×[y 1,y 2]⊂U 1×U 2,

(4.108)

Remark 4.2

Note, that “to be 2-increasing function” neither implies nor is implied by “to be increasing in each argument”.

The following lemmas (Nelsen, 1999) will be very useful later for establishing the continuity of copulae.

Lemma 4.1

Let U 1 and U 2 be non-empty sets in \(\mathbb{\overline{R}}\) and let \(F :U_{1} \times U_{2} \longrightarrow \mathbb{\overline{R}}\) be a two-increasing function. Let x 1, x 2 be in U 1 with x 1x 2, and y 1, y 2 be in U 2 with y 1y 2. Then the function tF(t,y 2)−F(t,y 1) is non-decreasing on U 1 and the function tF(x 2,t)−F(x 1,t) is non-decreasing on U 2.

Definition 4.4

If U 1 and U 2 have a smallest element minU 1 and minU 2 respectively, then we say, that a function \(F :U_{1} \times U_{2} \longrightarrow \mathbb{R}\) is grounded if:

(4.109)
(4.110)

In the following, we will refer to this definition of a cdf.

Definition 4.5

A cdf is a function from \(\mathbb{\overline{R}}^{2} \mapsto \left[0,1\right]\) which

  1. i)

    is grounded.

  2. ii)

    is 2-increasing.

  3. iii)

    satisfies \(F\left(\infty,\infty\right)=1\).

Lemma 4.2

Let U 1 and U 2 be non-empty sets in \(\mathbb{\overline{R}}\) and let \(F :U_{1} \times U_{2} \longrightarrow \mathbb{\overline{R}}\) be a grounded two-increasing function. Then F is non-decreasing in each argument.

Definition 4.6

If U 1 and U 2 have a greatest element maxU 1 and maxU 2 respectively, then we say, that a function \(F :U_{1} \times U_{2} \longrightarrow \mathbb{R}\) has margins and that the margins of F are given by:

(4.111)
(4.112)

Lemma 4.3

Let U 1 and U 2 be non-empty sets in \(\mathbb{\overline{R}}\) and let \(F :U_{1} \times U_{2} \longrightarrow \mathbb{\overline{R}}\) be a grounded two-increasing function which has margins. Let (x 1,y 1), (x 2,y 2) ∈ S 1×S 2. Then

(4.113)

Definition 4.7

A two-dimensional copula is a function C defined on the unit square I 2=I×I with I=[0,1] such that

  1. i)

    for every uI holds: C(u,0)=C(0,v)=0, i.e. C is grounded.

  2. ii)

    for every u 1,u 2,v 1,v 2I with u 1u 2 and v 1v 2 holds:

    (4.114)

    i.e. C is 2-increasing.

  3. iii)

    for every uI holds C(u,1)=u and C(1,v)=v.

Informally, a copula is a joint distribution function defined on the unit square \(\left[0,1\right]^{2}\) which has uniform marginals. That means that if \(F_{X_{1}}(x_{1})\) and \(F_{X_{2}}(x_{2})\) are univariate distribution functions, then \(C\{F_{X_{1}}(x_{1}),F_{X_{2}}(x_{2})\}\) is a 2-dimensional distribution function with marginals \(F_{X_{1}}(x_{1})\) and \(F_{X_{2}}(x_{2})\).

Example 4.25

The functions max(u+v−1,0), uv, min(u,v) can be easily checked to be copula functions. They are called respectively the minimum, product and maximum copula.

Example 4.26

Consider the function

(4.115)

where Φ ρ is the joint 2-dimensional standard normal distribution function with correlation coefficient ρ, while Φ1 and Φ2 refer to standard normal cdfs and

(4.116)

denotes the bivariate normal pdf.

It is easy to see, that \(C^{\mathit{Gauss}}\) is a copula, the so called Gaussian or normal copula, since it is 2-increasing and

(4.117)
(4.118)

A simple and useful way to represent the graph of a copula is the contour diagram that is, graphs of its level sets - the sets in I 2 given by C(u,v)= a constant. In Figures 4.164.17 we present the countour diagrams of the Gumbel-Hougard copula (Example 4.4) for different values of the copula parameter θ.

Fig. 4.16
figure 16

Surface plot of the Gumbel-Hougaard copula, θ=3  MVAghsurface

Fig. 4.17
figure 17

Contour plots of the Gumbel-Hougard copula  MVAghcontour

For θ=1 the Gumbel-Hougaard copula reduces to the product copula, i.e.

(4.119)

For θ→∞, one finds for the Gumbel-Hougaard copula:

(4.120)

where M is also a copula such that C(u,v)≤M(u,v) for an arbitrary copula C. The copula M is called the Fréchet-Hoeffding upper bound.

The two-dimensional function W(u,v)=max(u+v−1,0) defines a copula with W(u,v)≤C(u,v) for any other copula C. W is called the Fréchet-Hoeffding lower bound.

In Figure 4.18 we show an example of Gumbel-Hougaard copula sampling for fixed parameters σ 1=1, σ 2=1 and θ=3.

Fig. 4.18
figure 18

10000-sample output for σ 1=1, σ 2=1, θ=3  MVAsample1000

One can demonstrate the so-called Fréchet-Hoeffding inequality, which we have already used in Example 1.3, and which states that each copula function is bounded by the minimum and maximum one:

(4.121)

The full relationship between copula and joint cdf depends on Sklar theorem.

Example 4.27

Let us verify that the Gaussian copula satisfies Sklar’s theorem in both directions. On the one side, let

(4.122)

be a 2-dimensional normal distribution function with standard normal cdf’s \(F_{X_{1}}(x_{1})\) and \(F_{X_{2}}(x_{2})\). Since \(F_{X_{1}}(x_{1})\) and \(F_{X_{2}}(x_{2})\) are continuous, a unique copula C exists such that for all x 1, x 2 \(\in \mathbb{\overline{R}}^{2}\) a 2-dimensional distribution function can be written as a copula in \(F_{X_{1}}(x_{1})\) and \(F_{X_{2}}(x_{2})\):

(4.123)

The Gaussian copula satisfies the above equality, therefore it is the unique copula mentioned in Sklar’s theorem. This proves that the Gaussian copula, together with Gaussian marginals, gives the two-dimensional normal distribution.

Conversely, if C is a copula and \(F_{X_{1}}\) and \(F_{X_{2}}\) are standard normal distribution functions, then

(4.124)

is evidently a joint (two-dimensional) distribution function. Its margins are

(4.125)
(4.126)

The following proposition shows one attractive feature of the copula representation of dependence, i.e. that the dependence structure described by a copula is invariant under increasing and continuous transformations of the marginal distributions.

Theorem 4.13

If \(\left(X_{1},X_{2}\right)\) have copula C and set g 1,g 2 two continuously increasing functions, then \(\left\{g_{1}\left(X_{1}\right),g_{2} \left(X_{2} \right)\right\}\) have the copula C, too.

Example 4.28

Independence implies that the product of the cdf’s \(F_{X_{1}}\) and \(F_{X_{2}}\) equals the joint distribution function F, i.e.:

(4.127)

Thus, we obtain the independence or product copula C=Π(u,v)=uv.

While it is easily understood how a product copula describes an independence relationship, the converse is also true. Namely, the joint distribution function of two independent random variables can be interpreted as a product copula. This concept is formalised in the following theorem:

Theorem 4.14

Let X 1 and X 2 be random variables with continuous distribution functions \(F_{X_{1}}\) and \(F_{X_{2}}\) and the joint distribution function F. Then X 1 and X 2 are independent if and only if \(C_{X_{1},X_{2}}=\Pi\).

Example 4.29

Let us consider the Gaussian copula for the case ρ=0, i.e. vanishing correlation. In this case the Gaussian copula becomes

(4.128)

The following theorem, which follows directly from Lemma 4.3, establishes the continuity of copulae.

Theorem 4.15

Let C be a copula. Then for any u 1,v 1,u 2,v 2I holds

(4.129)

From (4.129) it follows that every copula C is uniformly continuous on its domain.

A further important property of copulae concerns the partial derivatives of a copula with respect to its variables:

Theorem 4.16

Let C(u,v) be a copula. For any uI, the partial derivative \(\frac {\partial C(u, v)}{\partial v}\) exists for almost all uI. For such u and v one has:

(4.130)

The analogous statement is true for the partial derivative \(\frac {\partial C(u, v)}{\partial u}\):

(4.131)

Moreover, the functions

are defined and non-increasing almost everywhere on I.

Until now, we have considered copulae only in a 2-dimensional setting. Let us now extend this concept to the d-dimensional case, for a random variable in \(\mathbb{R}^{d}\) with d≥1.

Let U 1,U 2,…,U d be non-empty sets in \(\mathbb{\overline{R}}\) and consider the function \(F : U_{1} \times U_{2} \times \cdots\times U_{d}\longrightarrow \mathbb{\overline{R}}\). For a=(a 1,a 2,…,a d ) and b=(b 1,b 2,…,b d ) with ab (i.e. a k b k for all k) let B=[a,b]=[a 1,b 1]×[a 2,b 2]×⋯×[a n ,b n ] be the d-box with vertices c=(c 1,c 2,…,c d ). It is obvious, that each c k is either equal to a k or to b k .

Definition 4.8

The F-volume of a d-box B=[a,b]=[a 1,b 1]×[a 2,b 2]×⋯×[a d ,b d ]⊂U 1×U 2×⋯×U d is defined as follows:

(4.132)

where \(\mathit{sign}(c_{k})=1\), if c k =a k for even k and \(\mathit{sign}(c_{k})=-1\), if c k =a k for odd k.

Example 4.30

For the case d=3, the F-volume of a 3-box B=[a,b]=[x 1,x 2]×[y 1,y 2]×[z 1,z 2] is defined as:

Definition 4.9

F is said to be a d-increasing function if for all d-boxes B with vertices in U 1×U 2×⋯×U d holds:

(4.133)

Definition 4.10

If U 1,U 2,…,U d have a smallest element minU 1,minU 2,…,minU d respectively, then we say, that a function \(F :U_{1} \times U_{2} \times\cdots\times U_{d} \longrightarrow \mathbb{\overline{R}}\) is grounded if :

(4.134)

such that x k =minU k for at least one k.

The lemmas, which we presented for the 2-dimensional case, have analogous multivariate versions, see Nelsen (1999).

Definition 4.11

A d-dimensional copula (or d-copula) is a function C defined on the unit d-cube I d=I×I×⋯×I such that

  1. i)

    for every uI d holds: C(u)=0, if at least one coordinate of u is equal to 0; i.e. C is grounded.

  2. ii)

    for every a,bI d with ab holds:

    (4.135)

    i.e. C is 2-increasing.

  3. iii)

    for every uI d holds: C(u)=u k , if all coordinates of u are 1 except u k .

Analogously to the 2-dimensional setting, let us state the Sklar’s theorem for the d-dimensional case.

Theorem 4.17

(Sklar’s theorem in d-dimensional case)

Let F be a d-dimensional distribution function with marginal distribution functions \(F_{X_{1}},F_{X_{2}},\ldots,F_{X_{d}}\). Then a d-copula C exists such that for all x 1,…,x d \(\in \mathbb{\overline{R}}^{d}\):

(4.136)

Moreover, if \(F_{X_{1}},F_{X_{2}},\ldots,F_{X_{d}}\) are continuous then C is unique. Otherwise C is uniquely determined on the Cartesian product \(Im(F_{X_{1}})\times Im(F_{X_{2}})\times\cdots\times Im(F_{X_{d}})\).

Conversely, if C is a copula and \(F_{X_{1}}, F_{X_{2}},\ldots,F_{X_{d}} \) are distribution functions then F defined by (4.136) is a d-dimensional distribution function with marginals \(F_{X_{1}},F_{X_{2}},\ldots, F_{X_{d}}\).

In order to illustrate the d-copulae we present the following examples:

Example 4.31

Let Φ denote the univariate standard normal distribution function and ΦΣ,d the d-dimensional standard normal distribution function with correlation matrix Σ. Then the function

(4.137)

is the d-dimensional Gaussian or normal copula with correlation matrix Σ. The function

(4.138)

is a copula density function. The copula dependence parameter α is the collection of all unknown correlation coefficients in Σ. If α≠0, then the corresponding normal copula allows to generate joint symmetric dependence. However, it is not possible to model a tail dependence, i.e. joint extreme events have a zero probability.

Example 4.32

Let us consider the following function

(4.139)

One recognize this function is as the d-dimensional Gumbel-Hougaard copula function. Unlike the Gaussian copula, the copula (4.139) can generate an upper tail dependence.

Example 4.33

As in the 2-dimensional setting, let us consider the d-dimentional Gumbel-Hougaard copula for the case θ=1. In this case the Gumbel-Hougaard copula reduces to the d-dimensional product copula, i.e.

(4.140)

The extension of the 2-dimensional copula M, which one gets from the d-dimensional Gumbel-Hougaard copula for θ→∞ is denoted M d(u):

(4.141)

The d-dimensional function

(4.142)

defines a copula with W(u)≤C(u) for any other d-dimensional copula function C(u). W d(u) is the Fréchet-Hoeffding lower bound in the d-dimensional case.

The functions M d and Πd are d-copulae for all d≥2, whereas the function W d fails to be a d-copula for any d>2 (Nelsen, 1999). However, the d-dimensional version of the Fréchet-Hoeffding inequality can be written as follows:

(4.143)

As we have already mentioned, copula functions have been widely applied in empirical finance.

figure f

8 Bootstrap

Recall that we need large sample sizes in order to sufficiently approximate the critical values computable by the CLT. Here large means n>50 for one-dimensional data. How can we construct confidence intervals in the case of smaller sample sizes? One way is to use a method called the Bootstrap. The Bootstrap algorithm uses the data twice:

  1. 1.

    estimate the parameter of interest,

  2. 2.

    simulate from an estimated distribution to approximate the asymptotic distribution of the statistics of interest.

In detail, bootstrap works as follows. Consider the observations x 1,…,x n of the sample X 1,…,X n and estimate the empirical distribution function (edf) F n . In the case of one-dimensional data

$$ F_{n}(x) = \frac{1}{n} \sum_{i=1}^n \,\textbf{\textit{I}}(X_{i}\le x).$$
(4.144)

This is a step function which is constant between neighboring data points.

Example 4.34

Suppose that we have n=100 standard normal N(0,1) data points X i , i=1,…,n. The cdf of X is \(\Phi(x) = \int^{x}_{-\infty} \varphi(u) du\) and is shown in Figure 4.19 as the thin, solid line. The empirical distribution function (edf) is displayed as a thick step function line. Figure 4.20 shows the same setup for n=1000 observations.

Fig. 4.19
figure 19

The standard normal cdf (thick line) and the empirical distribution function (thin line) for n=100  MVAedfnormal

Fig. 4.20
figure 20

The standard normal cdf (thick line) and the empirical distribution function (thin line) for n=1000  MVAedfnormal

Now draw with replacement a new sample from this empirical distribution. That is we sample with replacement n observations \(X_{1}^{\ast}, \ldots,X_{n^{\ast}}^{\ast}\) from the original sample. This is called a Bootstrap sample. Usually one takes n =n.

Since we sample with replacement, a single observation from the original sample may appear several times in the Bootstrap sample. For instance, if the original sample consists of the three observations x 1,x 2,x 3, then a Bootstrap sample might look like \(X_{1}^{*}=x_{3}\), \(X_{2}^{*}=x_{2}\), \(X_{3}^{*}=x_{3}\). Computationally, we find the Bootstrap sample by using a uniform random number generator to draw from the indices 1,2,…,n of the original samples.

The Bootstrap observations are drawn randomly from the empirical distribution, i.e., the probability for each original observation to be selected into the Bootstrap sample is 1/n for each draw. It is easy to compute that

$$\mathop{\mbox{E} _{F_{n}}} (X_{i}^\ast) = \frac{1}{n}\sum_{i=1}^{n}x_{i}=\,\bar{x}.$$

This is the expected value given that the cdf is the original mean of the sample x 1.…,x n . The same holds for the variance, i.e.,

$$\mathop{\mbox{Var}_{F_{n}}}(X_{i}^\ast) = \widehat{\sigma}^2, $$

where \(\widehat{\sigma}^{2} = n^{-1} \sum (x_{i} - \bar{x})^{2}\). The cdf of the bootstrap observations is defined as in (4.144). Figure 4.21 shows the cdf of the n=100 original observations as a solid line and two bootstrap cdf’s as thin lines.

Fig. 4.21
figure 21

The cdf F n (thick line) and two bootstrap cdf‘s \(F_{n}^{*}\) (thin lines)  MVAedfbootstrap

The CLT holds for the bootstrap sample. Analogously to Corollary 4.1 we have the following corollary.

Corollary 4.2

If \(X_{1}^{\ast}, \ldots, X_{n}^{\ast}\) is a bootstrap sample from X 1,…,X n , then the distribution of

$$\sqrt{n} \left( \frac{\bar{x}^\ast-\bar{x}}{\widehat{\sigma}^\ast}\right) $$

also becomes N(0,1) asymptotically, where \(\overline{x}^{\ast}= n^{-1} \sum_{i=1}^{n} X_{i}^{\ast}\) and \((\widehat{\sigma}^{\ast})^{2} = n^{-1} \sum_{i=1}^{n} (X_{i}^{\ast}-\bar{x}^{\ast})^{2}\).

How do we find a confidence interval for μ using the Bootstrap method? Recall that the quantile u 1−α/2 might be bad for small sample sizes because the true distribution of \(\sqrt{n}( \frac{\bar{x}-\mu}{\widehat{\sigma}})\) might be far away from the limit distribution N(0,1). The Bootstrap idea enables us to “simulate” this distribution by computing \(\sqrt{n} ( \frac{\bar{x}^{\ast}-\bar{x}}{\widehat{\sigma}^{\ast}} )\) for many Bootstrap samples. In this way we can estimate an empirical (1−α/2)-quantile \(u_{1-\alpha/2}^{\ast}\). The bootstrap improved confidence interval is then

$$C_{1-\alpha}^\ast = \left[\bar{x}-\frac{\widehat{\sigma}}{\sqrt{n}}\,u_{1-\alpha/2}^\ast,\, \bar{x}+\frac{\widehat{\sigma}}{\sqrt{n}}\,u_{1-\alpha/2}^\ast \right]. $$

By Corollary 4.2 we have

$$P(\mu \in C_{1-\alpha}^\ast) \longrightarrow 1 - \alpha \quad \mbox{as } n \rightarrow \infty, $$

but with an improved speed of convergence, see Hall (1992).

figure g

9 Exercises

Exercise 4.1

Assume that the random vector Y has the following normal distribution: \(Y \sim N_{p}(0,{\mathcal{I}})\). Transform it according to (4.49) to create XN(μ,Σ) with mean μ=(3,2) and . How would you implement the resulting formula on a computer?

Exercise 4.2

Prove Theorem 4.7 using Theorem 4.5.

Exercise 4.3

Suppose that X has mean zero and covariance . Let Y=X 1+X 2. Write Y as a linear transformation, i.e., find the transformation matrix \({\mathcal{A}}\). Then compute \(\mathit{Var}(Y)\) via (4.26). Can you obtain the result in another fashion?

Exercise 4.4

Calculate the mean and the variance of the estimate \(\hat{\beta}\) in (3.50).

Exercise 4.5

Compute the conditional moments E(X 2x 1) and E(X 1x 2) for the pdf of Example 4.5.

Exercise 4.6

Prove the relation (4.28).

Exercise 4.7

Prove the relation (4.29). Hint: Note that

$$\mathop {\mbox {\sf Var}}(E(X_2|X_1)) = E(E(X_2|X_1)\, E(X_2^{\top}|X_1)) - E(X_2)\, E(X_2^{\top}))$$

and that

$$E(\mathop {\mbox {\sf Var}}(X_2|X_1)) = E[E(X_2X_2^{\top}|X_1) - E(X_2|X_1) \, E(X_2^{\top}|X_1)].$$

Exercise 4.8

Compute (4.46) for the pdf of Example 4.5.

Exercise 4.9

Show that

$$\everymath{\displaystyle}f_Y(y)=\left\{\begin{array}{l@{\quad}l}\frac{1}{2} y_1 - \frac{1}{4} y_2 & 0\leq y_1\leq 2, \ |y_2|\leq 1-|1-y_1| \\0 & \mbox{otherwise}\end{array}\right.$$

is a pdf.

Exercise 4.10

Compute (4.46) for a two-dimensional standard normal distribution. Show that the transformed random variables Y 1 and Y 2 are independent. Give a geometrical interpretation of this result based on iso-distance curves.

Exercise 4.11

Consider the Cauchy distribution which has no moment, so that the CLT cannot be applied. Simulate the distribution of \(\overline{x}\) (for different n’s). What can you expect for n→∞?

Hint: The Cauchy distribution can be simulated by the quotient of two independent standard normally distributed random variables.

Exercise 4.12

A European car company has tested a new model and reports the consumption of petrol (X 1) and oil (X 2). The expected consumption of petrol is 8 liters per 100 km (μ 1) and the expected consumption of oil is 1 liter per 10.000 km (μ 2). The measured consumption of petrol is 8.1 liters per 100 km (\(\overline{x}_{1}\)) and the measured consumption of oil is 1.1 liters per 10,000 km (\(\overline{x}_{2}\)). The asymptotic distribution of is .

For the American market the basic measuring units are miles (1 mile ≈ 1.6 km) and gallons (1 gallon ≈ 3.8 liter). The consumptions of petrol (Y 1) and oil (Y 2) are usually reported in miles per gallon. Can you express \(\overline{y}_{1}\) and \(\overline{y}_{2}\) in terms of \(\overline{x}_{1}\) and \(\overline{x}_{2}\)? Recompute the asymptotic distribution for the American market.

Exercise 4.13

Consider the pdf \(f(x_{1},x_{2})=e^{-(x_{1}+x_{2})}, x_{1},x_{2}>0\) and let U 1=X 1+X 2 and U 2=X 1X 2. Compute f(u 1,u 2).

Exercise 4.14

Consider the pdf‘s

$$\begin{array}{rcl@{\quad}l}f(x_1,x_2)&=& 4x_1x_2e^{-x_1^2} & x_1,x_2>0,\\f(x_1,x_2)&=& 1 & 0<x_1,x_2<1 \mbox{ and } x_1+x_2<1\\f(x_1,x_2)&=&\displaystyle\frac{1}{2}e^{-x_1} & x_1>|x_2|.\end{array}$$

For each of these pdf’s compute \(E(X), \mathop {\mbox {\sf Var}}(X), E(X_{1}|X_{2}), E(X_{2}|X_{1}), V(X_{1}|X_{2})\) and V(X 2|X 1).

Exercise 4.15

Consider the pdf \(f(x_{1},x_{2})=\frac{3}{2}x_{1}^{-\frac{1}{2}},\ 0<x_{1}<x_{2}<1\). Compute P(X 1<0.25),P(X 2<0.25) and P(X 2<0.25|X 1<0.25).

Exercise 4.16

Consider the pdf \(f(x_{1},x_{2})=\frac{1}{2\pi}\), 0<x 1<2π, 0<x 2<1. Let \(U_{1}=\sin X_{1}\sqrt{-2\log X_{2}}\) and \(U_{2}=\cos X_{1}\sqrt{-2\log X_{2}}\). Compute f(u 1,u 2).

Exercise 4.17

Consider f(x 1,x 2,x 3)=k(x 1+x 2 x 3); 0<x 1,x 2,x 3<1.

  1. a)

    Determine k so that f is a valid pdf of (X 1,X 2,X 3)=X.

  2. b)

    Compute the (3×3) matrix Σ X .

  3. c)

    Compute the (2×2) matrix of the conditional variance of (X 2,X 3) given X 1=x 1.

Exercise 4.18

Let .

  1. a)

    Represent the contour ellipses for \(a=0;\ -\frac{1}{2};\ +\frac{1}{2};\ 1\).

  2. b)

    For \(a=\frac{1}{2}\) find the regions of X centred on μ which cover the area of the true parameter with probability 0.90 and 0.95.

Exercise 4.19

Consider the pdf

$$f(x_1,x_2)=\frac{1}{8x_2}e^{-(\frac{x_1}{2x_2}+\frac{x_2}{4})}\quad x_1,x_2>0.$$

Compute f(x 2) and f(x 1|x 2). Also give the best approximation of X 1 by a function of X 2. Compute the variance of the error of the approximation.

Exercise 4.20

Prove Theorem 4.6.