Abstract
The preceeding chapter showed that by using the two first moments of a multivariate distribution (the mean and the covariance matrix), a lot of information on the relationship between the variables can be made available. Only basic statistical theory was used to derive tests of independence or of linear relationships. In this chapter we give an introduction to the basic probability tools useful in statistical multivariate analysis.
Access provided by Autonomous University of Puebla. Download chapter PDF
Keywords
- Multivariate Distribution
- Laplace Distribution
- Copula Function
- Cauchy Distribution
- Hyperbolic Distribution
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
The preceeding chapter showed that by using the two first moments of a multivariate distribution (the mean and the covariance matrix), a lot of information on the relationship between the variables can be made available. Only basic statistical theory was used to derive tests of independence or of linear relationships. In this chapter we give an introduction to the basic probability tools useful in statistical multivariate analysis.
Means and covariances share many interesting and useful properties, but they represent only part of the information on a multivariate distribution. Section 4.1 presents the basic probability tools used to describe a multivariate random variable, including marginal and conditional distributions and the concept of independence. In Section 4.2, basic properties on means and covariances (marginal and conditional ones) are derived.
Since many statistical procedures rely on transformations of a multivariate random variable, Section 4.3 proposes the basic techniques needed to derive the distribution of transformations with a special emphasis on linear transforms. As an important example of a multivariate random variable, Section 4.4 defines the multinormal distribution. It will be analysed in more detail in Chapter 5 along with most of its “companion” distributions that are useful in making multivariate statistical inferences.
The normal distribution plays a central role in statistics because it can be viewed as an approximation and limit of many other distributions. The basic justification relies on the central limit theorem presented in Section 4.55. We present this central theorem in the framework of sampling theory. A useful extension of this theorem is also given: it is an approximate distribution to transformations of asymptotically normal variables. The increasing power of computers today makes it possible to consider alternative approximate sampling distributions. These are based on resampling techniques and are suitable for many general situations. Section 4.8 gives an introduction to the ideas behind bootstrap approximations.
1 Distribution and Density Function
Let X=(X 1,X 2,…,X p )⊤ be a random vector. The cumulative distribution function (cdf) of X is defined by
For continuous X, a nonnegative probability density function (pdf) f exists, that
Note that
Most of the integrals appearing below are multidimensional. For instance, \(\int_{-\infty}^{x} f(u) du\) means \(\int_{-\infty}^{x_{p}} \cdots \int_{-\infty}^{x_{1}} f(u_{1},\ldots,u_{p}) du_{1} \cdots du_{p}\). Note also that the cdf F is differentiable with
For discrete X, the values of this random variable are concentrated on a countable or finite set of points {c j } j∈J , the probability of events of the form {X∈D} can then be computed as
If we partition X as X=(X 1,X 2)⊤ with \(X_{1}\in \mathbb {R}^{k}\) and \(X_{2}\in \mathbb {R}^{p-k}\), then the function
is called the marginal cdf. F=F(x) is called the joint cdf. For continuous X the marginal pdf can be computed from the joint density by “integrating out” the variable not of interest.
The conditional pdf of X 2 given X 1=x 1 is given as
Example 4.1
Consider the pdf
f(x 1,x 2) is a density since
The marginal densities are
The conditional densities are therefore
Note that these conditional pdf’s are nonlinear in x 1 and x 2 although the joint pdf has a simple (linear) structure.
Independence of two random variables is defined as follows.
Definition 4.1
X 1 and X 2 are independent iff \(f(x) = f(x_{1},x_{2}) = f_{X_{1}}(x_{1}) f_{X_{2}}(x_{2})\).
That is, X 1 and X 2 are independent if the conditional pdf’s are equal to the marginal densities, i.e., \(f(x_{1} \mid x_{2}) = f_{X_{1}}(x_{1}) \) and \(f(x_{2} \mid x_{1}) = f_{X_{2}}(x_{2}) \). Independence can be interpreted as follows: knowing X 2=x 2 does not change the probability assessments on X 1, and conversely.
Different joint pdf’s may have the same marginal pdf’s.
Example 4.2
Consider the pdf’s
and
We compute in both cases the marginal pdf’s as
Indeed
Hence we obtain identical marginals from different joint distributions.
Let us study the concept of independence using the bank notes example. Consider the variables X 4 (lower inner frame) and X 5 (upper inner frame). From Chapter 3, we already know that they have significant correlation, so they are almost surely not independent. Kernel estimates of the marginal densities, \(\widehat{f}_{X_{4}}\) and \(\widehat{f}_{X_{5}}\), are given in Figure 4.1. In Figure 4.2 (left) we show the product of these two densities. The kernel density technique was presented in Section 1.3. If X 4 and X 5 are independent, this product \(\widehat{f}_{X_{4}}\cdot \widehat{f}_{X_{5}}\) should be roughly equal to \(\widehat{f}(x_{4},x_{5})\), the estimate of the joint density of (X 4,X 5). Comparing the two graphs in Figure 4.2 reveals that the two densities are different. The two variables X 4 and X 5 are therefore not independent.
An elegant concept of connecting marginals with joint cdfs is given by copulae. Copulae are important in Value-at-Risk calculations and are an essential tool in quantitative finance (Härdle, Hautsch and Overbeck, 2009).
For simplicity of presentation we concentrate on the p=2 dimensional case. A 2-dimensional copula is a function C: [0,1]2→[0,1] with the following properties:
-
For every u∈[0,1]: C(0,u)=C(u,0)=0.
-
For every u∈[0,1]: C(u,1)=u and C(1,u)=u.
-
For every (u 1,u 2),(v 1,v 2)∈[0,1]×[0,1] with u 1≤v 1 and u 2≤v 2:
$$C(v_1,v_2) - C(v_1,u_2) - C(u_1,v_2) + C(u_1,u_2) \ge 0 \, .$$
The usage of the name “copula” for the function C is explained by the following theorem.
Theorem 4.1
(Sklar’s theorem)
Let F be a joint distribution function with marginal distribution functions \(F_{X_{1}}\) and \(F_{X_{2}}\). Then a copula C exists with
for every \(x_{1},x_{2} \in \mathbb {R}\). If \(F_{X_{1}}\) and \(F_{X_{2}}\) are continuous, then C is unique. On the other hand, if C is a copula and \(F_{X_{1}}\) and \(F_{X_{2}}\) are distribution functions, then the function F defined by (4.5) is a joint distribution function with marginals \(F_{X_{1}}\) and \(F_{X_{2}}\).
With Sklar’s Theorem, the use of the name “copula” becomes obvious. It was chosen to describe “a function that links a multidimensional distribution to its one-dimensional margins” and appeared in the mathematical literature for the first time in Sklar (1959).
Example 4.3
The structure of independence implies that the product of the distribution functions \(F_{X_{1}}\) and \(F_{X_{2}}\) equals their joint distribution function F,
Thus, we obtain the independence copula C=Π from
Theorem 4.2
Let X 1 and X 2 be random variables with continuous distribution functions \(F_{X_{1}}\) and \(F_{X_{2}}\) and the joint distribution function F. Then X 1 and X 2 are independent if and only if \(C_{X_{1}, X_{2}} = \Pi\).
Proof
From Sklar’s Theorem we know that there exists an unique copula C with
Independence can be seen using (4.5) for the joint distribution function F and the definition of Π,
□
Example 4.4
The Gumbel-Hougaard family of copulae (Nelsen, 1999) is given by the function
The parameter θ may take all values in the interval [1,∞). The Gumbel-Hougaard copulae are suited to describe bivariate extreme value distributions.
For θ=1, the expression (4.9) reduces to the product copula, i.e., C 1(u,v)=Π(u,v)=u v. For θ→∞ one finds for the Gumbel-Hougaard copula:
where the function M is also a copula such that C(u,v)≤M(u,v) for arbitrary copula C. The copula M is called the Fréchet-Hoeffding upper bound.
Similarly, we obtain the Fréchet-Hoeffding lower bound W(u,v)=max(u+v−1,0) which satisfies W(u,v)≤C(u,v) for any other copula C.
2 Moments and Characteristic Functions
2.1 Moments—Expectation and Covariance Matrix
If X is a random vector with density f(x) then the expectation of X is
Accordingly, the expectation of a matrix of random elements has to be understood component by component. The operation of forming expectations is linear:
If \({\mathcal{A}}(q \times p)\) is a matrix of real numbers, we have:
When X and Y are independent,
The matrix
is the (theoretical) covariance matrix. We write for a vector X with mean vector μ and covariance matrix Σ,
The (p×q) matrix
is the covariance matrix of X∼(μ,Σ XX ) and Y∼(ν,Σ YY ). Note that \(\Sigma_{XY} = \Sigma^{\top}_{YX}\) and that has covariance . From
it follows that \(\mathop {\mbox {\sf Cov}}(X,Y)=0\) in the case where X and Y are independent. We often say that \(\mu = \mathop {\mbox {\sf E}}(X)\) is the first order moment of X and that \(\mathop {\mbox {\sf E}}(XX^{\top})\) provides the second order moments of X:
2.2 Properties of the Covariance Matrix \(\Sigma=\mathop {\mbox {\sf Var}}(X)\)
2.3 Properties of Variances and Covariances
Let us compute these quantities for a specific joint density.
Example 4.5
Consider the pdf of Example 4.1. The mean vector is
The elements of the covariance matrix are
Hence the covariance matrix is
2.4 Conditional Expectations
The conditional expectations are
\(\mathop {\mbox {\sf E}}(X_{2}|x_{1})\) represents the location parameter of the conditional pdf of X 2 given that X 1=x 1. In the same way, we can define \(\mathop {\mbox {\sf Var}}(X_{2}|X_{1}=x_{1})\) as a measure of the dispersion of X 2 given that X 1=x 1. We have from (4.20) that
Using the conditional covariance matrix, the conditional correlations may be defined as:
These conditional correlations are known as partial correlations between X 2 and X 3, conditioned on X 1 being equal to x 1.
Example 4.6
Consider the following pdf
Note that the pdf is symmetric in x 1,x 2 and x 3 which facilitates the computations. For instance,
and the other marginals are similar. We also have
It is easy to compute the following moments:
and
Note that the conditional means of X 1 and of X 2, given X 3=x 3, are not linear in x 3. From these moments we obtain:
The conditional covariance matrix of X 1 and X 2, given X 3=x 3 is
In particular, the partial correlation between X 1 and X 2, given that X 3 is fixed at x 3, is given by \(\rho _{X_{1}X_{2}|X_{3}=x_{3}}=-\frac{1}{12x_{3}^{2}+24x_{3}+11}\) which ranges from −0.0909 to −0.0213 when x 3 goes from 0 to 1. Therefore, in this example, the partial correlation may be larger or smaller than the simple correlation, depending on the value of the condition X 3=x 3.
Example 4.7
Consider the following joint pdf
Note the symmetry of x 1 and x 3 in the pdf and that X 2 is independent of (X 1,X 3). It immediately follows that
Simple computations lead to
Let us analyze the conditional distribution of (X 1,X 2) given X 3=x 3. We have
so that again X 1 and X 2 are independent conditional on X 3=x 3. In this case
2.5 Properties of Conditional Expectations
Since \(\mathop {\mbox {\sf E}}(X_{2}|X_{1}=x_{1})\) is a function of x 1, say h(x 1), we can define the random variable \(h(X_{1}) = \mathop {\mbox {\sf E}}(X_{2}|X_{1})\). The same can be done when defining the random variable \(\mathop {\mbox {\sf Var}}(X_{2}|X_{1})\). These two random variables share some interesting properties:
Example 4.8
Consider the following pdf
It is easy to show that
Without explicitly computing f(x 2), we can obtain:
The conditional expectation \(\mathop {\mbox {\sf E}}(X_{2}|X_{1})\) viewed as a function h(X 1) of X 1 (known as the regression function of X 2 on X 1), can be interpreted as a conditional approximation of X 2 by a function of X 1. The error term of the approximation is then given by:
Theorem 4.3
Let \(X_{1} \in \mathbb {R}^{k}\) and \(X_{2} \in \mathbb {R}^{p-k}\) and \(U = X_{2} - \mathop {\mbox {\sf E}}(X_{2}|X_{1})\). Then we have:
-
(1)
\(\mathop {\mbox {\sf E}}(U) = 0\)
-
(2)
\(\mathop {\mbox {\sf E}}(X_{2}|X_{1})\) is the best approximation of X 2 by a function h(X 1) of X 1 where \(h:\; \mathbb {R}^{k} \longrightarrow \mathbb {R}^{p-k}\). “Best” is the minimum mean squared error (MSE), where
$$MSE(h) = \mathop {\mbox {\sf E}}[\{X_2 - h(X_1)\}^{\top} \, \{X_2 - h(X_1)\}].$$
2.6 Characteristic Functions
The characteristic function (cf) of a random vector \(X\in \mathbb {R}^{p}\) (respectively its density f(x)) is defined as
where \(\mathbf{i}\) is the complex unit: \(\mathbf{i}^{2} = -1\). The cf has the following properties:
If φ is absolutely integrable, i.e., the integral \(\int_{-\infty}^{\infty}|\varphi(x)| dx\) exists and is finite, then
If X=(X 1,X 2,…,X p )⊤, then for t=(t 1,t 2,…,t p )⊤
If X 1,…,X p are independent random variables, then for t=(t 1,t 2,…,t p )⊤
If X 1,…,X p are independent random variables, then for \(t\in \mathbb {R}\)
The characteristic function can recover all the cross-product moments of any order: ∀j k ≥0,k=1,…,p and for t=(t 1,…,t p )⊤ we have
Example 4.9
The cf of the density in Example 4.5 is given by
Example 4.10
Suppose \(X\in \mathbb {R}^{1}\) follows the density of the standard normal distribution
(see Section 4.4) then the cf can be computed via
since \(\mathbf{i}^{2}=-1\) and \(\int \frac{1}{\sqrt{2\pi}} \exp \bigl\{-\frac {(x-\mathbf{i}t)^{2}}{2}\bigr\}\,dx=1\).
A variety of distributional characteristics can be computed from φ X (t). The standard normal distribution has a very simple cf, as was seen in Example 4.10. Deviations from normal covariance structures can be measured by the deviations from the cf (or characteristics of it). In Table 4.1 we give an overview of the cf’s for a variety of distributions.
Theorem 4.4
(Cramer-Wold)
The distribution of \(X\in \mathbb {R}^{p}\) is completely determined by the set of all (one-dimensional) distributions of t ⊤ X where \(t\in \mathbb {R}^{p}\).
This theorem says that we can determine the distribution of X in \(\mathbb {R}^{p}\) by specifying all of the one-dimensional distributions of the linear combinations
2.7 Cumulant Functions
Moments m k =∫x k f(x)dx often help in describing distributional characteristics. The normal distribution in d=1 dimension is completely characterised by its standard normal density f=φ and the moment parameters are μ=m 1 and \(\sigma^{2}=m_{2}-m_{1}^{2}\). Another helpful class of parameters are the cumulants or semi-invariants of a distribution. In order to simplify notation we concentrate here on the one-dimensional (d=1) case.
For a given one dimensional random variable X with density f and finite moments of order k the characteristic function \(\varphi_{X}(t)=\mathop {\mbox {\sf E}}(e^{\mathbf{ i}tX})\) has the derivative
The values κ j are called cumulants or semi-invariants since κ j does not change (for j>1) under a shift transformation X↦X+a. The cumulants are natural parameters for dimension reduction methods, in particular the Projection Pursuit method (see Section 19.2).
The relationship between the first k moments m 1,…,m k and the cumulants is given by
Example 4.11
Suppose that k=1, then formula (4.36) above yields
For k=2 we obtain
For k=3 we have to calculate
Calculating the determinant we have:
Similarly one calculates
The same type of process is used to find the moments from the cumulants:
A very simple relationship can be observed between the semi-invariants and the central moments \(\mu_{k}=\mathop {\mbox {\sf E}}(X-\mu)^{k}\), where μ=m 1 as defined before. In fact, κ 2=μ 2, κ 3=μ 3 and \(\kappa_{4}=\mu_{4}-3\mu_{2}^{2}\).
Skewness γ 3 and kurtosis γ 4 are defined as:
The skewness and kurtosis determine the shape of one-dimensional distributions. The skewness of a normal distribution is 0 and the kurtosis equals 3. The relation of these parameters to the cumulants is given by:
From (4.39) and Example 4.11
These relations will be used later in Section 19.2 on Projection Pursuit to determine deviations from normality.
3 Transformations
Suppose that X has pdf f X (x). What is the pdf of Y=3X? Or if X=(X 1,X 2,X 3)⊤, what is the pdf of
This is a special case of asking for the pdf of Y when
for a one-to-one transformation u: \(\mathbb {R}^{p} \rightarrow \mathbb {R}^{p}\). Define the Jacobian of u as
and let \(\mathop{\rm{abs}}(|{\mathcal{J}}|)\) be the absolute value of the determinant of this Jacobian. The pdf of Y is given by
Using this we can answer the introductory questions, namely
with
and hence \(\mathop{\rm{abs}}(|{\mathcal{J}}|) = ( \frac{1}{3} )^{p}\). So the pdf of Y is \(\frac{1}{3^{p}} f_{X} ( \frac{y}{3})\).
This introductory example is a special case of
The inverse transformation is
Therefore
and hence
Example 4.12
Consider \(X=(X_{1},X_{2})\in \mathbb {R}^{2}\) with density f X (x)=f X (x 1,x 2),
Then
and
Hence
Example 4.13
Consider \(X\in \mathbb {R}^{1}\) with density f X (x) and Y=exp(X). According to (4.43) x=u(y)=log(y) and hence the Jacobian is
The pdf of Y is therefore:
4 The Multinormal Distribution
The multinormal distribution with mean μ and covariance Σ>0 has the density
We write X∼N p (μ,Σ).
How is this multinormal distribution with mean μ and covariance Σ related to the multivariate standard normal \(N_{p}(0,{{\mathcal{I}}}_{p}) \)? Through a linear transformation using the results of Section 4.3, as shown in the next theorem.
Theorem 4.5
Let X∼N p (μ,Σ) and Y=Σ−1/2(X−μ) (Mahalanobis transformation). Then
i.e., the elements \(Y_{j}\in \mathbb {R}\) are independent, one-dimensional N(0,1) variables.
Proof
Note that (X−μ)⊤Σ−1(X−μ)=Y ⊤ Y. Application of (4.45) gives \({\mathcal{J}} = \Sigma^{1/2}\), hence
which is by (4.47) the pdf of a \(N_{p}(0,{\mathcal{I}}_{p})\). □
Note that the above Mahalanobis transformation yields in fact a random variable Y=(Y 1,…,Y p )⊤ composed of independent one-dimensional Y j ∼N 1(0,1) since
Here each \(f_{Y_{j}}(y)\) is a standard normal density \(\frac{1}{\sqrt{2\pi}}\exp (-\frac{y^{2}}{2} ) \). From this it is clear that \(\mathop {\mbox {\sf E}}(Y)=0\) and \(\mathop {\mbox {\sf Var}}(Y)= {\mathcal{I}}_{p}\).
How can we create N p (μ,Σ) variables on the basis of \(N_{p}(0,{\mathcal{I}}_{p})\) variables? We use the inverse linear transformation
Using (4.11) and (4.23) we can also check that \(\mathop {\mbox {\sf E}}(X)= \mu\) and \(\mathop {\mbox {\sf Var}}(X) = \Sigma\). The following theorem is useful because it presents the distribution of a variable after it has been linearly transformed. The proof is left as an exercise.
Theorem 4.6
Let X∼N p (μ,Σ) and \({\mathcal{A}}(p\times p),\; c \in \mathbb {R}^{p}\), where \({\mathcal{A}}\) is nonsingular. Then \(Y = {\mathcal{A}} X +c\) is again a p-variate Normal, i.e.,
4.1 Geometry of the N p (μ,Σ) Distribution
From (4.47) we see that the density of the N p (μ,Σ) distribution is constant on ellipsoids of the form
Example 4.14
Figure 4.3 shows the contour ellipses of a two-dimensional normal distribution. Note that these contour ellipses are the iso-distance curves (2.34) from the mean of this normal distribution corresponding to the metric Σ−1.
According to Theorem 2.7 in Section 2.6 the half-lengths of the axes in the contour ellipsoid are \(\sqrt{d^{2} \lambda_{i}}\) where λ i are the eigenvalues of Σ. If Σ is a diagonal matrix, the rectangle circumscribing the contour ellipse has sides with length 2dσ i and is thus naturally proportional to the standard deviations of X i (i=1,2).
The distribution of the quadratic form in (4.51) is given in the next theorem.
Theorem 4.7
If X∼N p (μ,Σ), then the variable U=(X−μ)⊤Σ−1(X−μ) has a \(\chi^{2}_{p}\) distribution.
Theorem 4.8
The characteristic function (cf) of a multinormal N p (μ,Σ) is given by
We can check Theorem 4.8 by transforming the cf back:
since
Note that if \(Y\sim N_{p}(0,{\mathcal{I}}_{p})\) (e.g., the Mahalanobis-transform), then
which is consistent with (4.33).
4.2 Singular Normal Distribution
Suppose that we have \(\mathop {\rm {rank}}(\Sigma ) = k < p \), where p is the dimension of X. We define the (singular) density of X with the aid of the G-Inverse Σ− of Σ,
where
-
(1)
x lies on the hyperplane \({\mathcal{N}}^{\top} (x-\mu) = 0 \) with \({\mathcal{N}} (p \times (p-k)) : {\mathcal{N}}^{\top} \Sigma = 0 \) and \({\mathcal{N}}^{\top} {\mathcal{N}} = {\mathcal{I}}_{k} \).
-
(2)
Σ− is the G-Inverse of Σ, and λ 1,…,λ k are the nonzero eigenvalues of Σ.
What is the connection to a multinormal with k-dimensions? If
then an orthogonal matrix \({\mathcal{B}} (p \times k) \) with \({\mathcal{B}} ^{\top} {\mathcal{B}}= {\mathcal{I}}_{k}\) exists that means \(X = {\mathcal{B}} Y + \mu \) where X has a singular pdf of the form (4.53).
4.3 Gaussian Copula
In Examples 4.3 and 4.4 we have introduced copulae. Another important copula is the Gaussian or normal copula,
see Embrechts, McNeil and Straumann (1999). In (4.55), f ρ denotes the bivariate normal density function with correlation ρ for n=2. The functions Φ1 and Φ2 in (4.55) refer to the corresponding one-dimensional standard normal cdfs of the margins.
In the case of vanishing correlation, ρ=0, the Gaussian copula becomes
5 Sampling Distributions and Limit Theorems
In multivariate statistics, we observe the values of a multivariate random variable X and obtain a sample \(\{x_{i}\}_{i=1}^{n}\), as described in Chapter 3. Under random sampling, these observations are considered to be realisations of a sequence of i.i.d. random variables X 1,…,X n , where each X i is a p-variate random variable which replicates the parent or population random variable X. Some notational confusion is hard to avoid: X i is not the ith component of X, but rather the ith replicate of the p-variate random variable X which provides the ith observation x i of our sample.
For a given random sample X 1,…,X n , the idea of statistical inference is to analyse the properties of the population variable X. This is typically done by analysing some characteristic θ of its distribution, like the mean, covariance matrix, etc. Statistical inference in a multivariate setup is considered in more detail in Chapters 6 and 7.
Inference can often be performed using some observable function of the sample X 1,…,X n , i.e., a statistics. Examples of such statistics were given in Chapter 3: the sample mean \(\bar{x}\), the sample covariance matrix \({\mathcal{S}}\). To get an idea of the relationship between a statistics and the corresponding population characteristic, one has to derive the sampling distribution of the statistic. The next example gives some insight into the relation of \((\overline{x}, S)\) to (μ,Σ).
Example 4.15
Consider an iid sample of n random vectors \(X_{i} \in \mathbb {R}^{p}\) where \(\mathop {\mbox {\sf E}}(X_{i})=\mu\) and \(\mathop {\mbox {\sf Var}}(X_{i}) = \Sigma\). The sample mean \(\bar{x}\) and the covariance matrix \({\mathcal{S}}\) have already been defined in Section 3.3. It is easy to prove the following results
This shows in particular that \({\mathcal{S}}\) is a biased estimator of Σ. By contrast, \({\mathcal{S}}_{u} = \frac{n}{n-1}{\mathcal{S}}\) is an unbiased estimator of Σ.
Statistical inference often requires more than just the mean and/or the variance of a statistic. We need the sampling distribution of the statistics to derive confidence intervals or to define rejection regions in hypothesis testing for a given significance level. Theorem 4.9 gives the distribution of the sample mean for a multinormal population.
Theorem 4.9
Let X 1,…,X n be i.i.d. with X i ∼ N p (μ,Σ). Then \(\bar{x} \,{\sim}\, N_{p}(\mu,n^{-1}\Sigma)\).
Proof
\(\bar{x}=n^{-1}\sum_{i=1}^{n} X_{i}\) is a linear combination of independent normal variables, so it has a normal distribution (see Chapter 5). The mean and the covariance matrix were given in the preceding example. □
With multivariate statistics, the sampling distributions of the statistics are often more difficult to derive than in the preceding Theorem. In addition they might be so complicated that approximations have to be used. These approximations are provided by limit theorems. Since they are based on asymptotic limits, the approximations are only valid when the sample size is large enough. In spite of this restriction, they make complicated situations rather simple. The following central limit theorem shows that even if the parent distribution is not normal, when the sample size n is large, the sample mean \(\bar{x}\) has an approximate normal distribution.
Theorem 4.10
(Central Limit Theorem (CLT))
Let X 1,X 2,…,X n be i.i.d. with X i ∼(μ,Σ). Then the distribution of \(\displaystyle \sqrt{n} (\overline{x} - \mu ) \) is asymptotically N p (0,Σ), i.e.,
The symbol “\(\stackrel{\mathcal{L}}{\longrightarrow}\)” denotes convergence in distribution which means that the distribution function of the random vector \(\sqrt{n}(\bar{x}-\mu)\) converges to the distribution function of N p (0,Σ).
Example 4.16
Assume that X 1,…,X n are i.i.d. and that they have Bernoulli distributions where \(p=\frac{1}{2}\) (this means that \(P(X_{i}=1)=\frac{1}{2},\;P(X_{i}=0)=\frac{1}{2})\). Then \(\mu=p=\frac{1}{2}\) and \(\Sigma=p(1-p)=\frac{1}{4}\). Hence,
The results are shown in Figure 4.4 for varying sample sizes.
Example 4.17
Now consider a two-dimensional random sample X 1,…,X n that is i.i.d. and created from two independent Bernoulli distributions with p=0.5. The joint distribution is given by \(P(X_{i}=(0,0)^{\top}) = \frac{1}{4}\), \(P(X_{i}=(0,1)^{\top}) = \frac{1}{4}\), \(P(X_{i}=(1,0)^{\top}) = \frac{1}{4}\), \(P(X_{i}=(1,1)^{\top}) = \frac{1}{4}\). Here we have
Figure 4.5 displays the estimated two-dimensional density for different sample sizes.
The asymptotic normal distribution is often used to construct confidence intervals for the unknown parameters. A confidence interval at the level 1−α, α∈(0,1), is an interval that covers the true parameter with probability 1−α:
where θ denotes the (unknown) parameter and \(\widehat{\theta}_{l}\) and \(\widehat{\theta}_{u}\) are the lower and upper confidence bounds respectively.
Example 4.18
Consider the i.i.d. random variables X 1,…,X n with X i ∼(μ,σ 2) and σ 2 known. Since we have \(\sqrt{n}(\bar{x}-\mu)\stackrel{\mathcal{L}}{\rightarrow} N(0,\sigma^{2})\) from the CLT, it follows that
where u 1−α/2 denotes the (1−α/2)-quantile of the standard normal distribution. Hence the interval
is an approximate (1−α)-confidence interval for μ.
But what can we do if we do not know the variance σ 2? The following corollary gives the answer.
Corollary 4.1
If \(\widehat{\Sigma}\) is a consistent estimate for Σ, then the CLT still holds, namely
Example 4.19
Consider the i.i.d. random variables X 1,…,X n with X i ∼(μ,σ 2), and now with an unknown variance σ 2. From Corollary 4.1 using \(\widehat{\sigma}^{2} = \frac{1}{n} \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}\) we obtain
Hence we can construct an approximate (1−α)-confidence interval for μ using the variance estimate \(\widehat{\sigma}^{2}\):
Note that by the CLT
Remark 4.1
One may wonder how large should n be in practice to provide reasonable approximations. There is no definite answer to this question: it mainly depends on the problem at hand (the shape of the distribution of the X i and the dimension of X i ). If the X i are normally distributed, the normality of \(\bar{x}\) is achieved from n=1. In most situations, however, the approximation is valid in one-dimensional problems for n larger than, say, 50.
5.1 Transformation of Statistics
Often in practical problems, one is interested in a function of parameters for which one has an asymptotically normal statistic. Suppose for instance that we are interested in a cost function depending on the mean μ of the process: \(f(\mu)=\mu^{\top} {\mathcal{A}}\mu\) where \({\mathcal{A}}>0\) is given. To estimate μ we use the asymptotically normal statistic \(\bar{x}\). The question is: how does \(f(\bar{x})\) behave? More generally, what happens to a statistic t that is asymptotically normal when we transform it by a function f(t)? The answer is given by the following theorem.
Theorem 4.11
If \(\sqrt{n} (t - \mu) \stackrel{\mathcal{L}}{\longrightarrow}N_{p}(0,\Sigma) \) and if \(f = (f_{1}, \ldots, f_{q})^{\top} : \mathbb {R}^{p} \to \mathbb {R}^{q} \) are real valued functions which are differentiable at \(\mu \in \mathbb {R}^{p}\), then f(t) is asymptotically normal with mean f(μ) and covariance \({\mathcal{D}}^{\top} \Sigma {\mathcal{D}}\), i.e.,
where
is the (p×q) matrix of all partial derivatives.
Example 4.20
We are interested in seeing how \(f(\bar{x})=\bar{x}^{\top} {\mathcal{A}}\bar{x}\) behaves asymptotically with respect to the quadratic cost function of \(\mu , f(\mu)=\mu^{\top} {\mathcal{A}}\mu\), where \({\mathcal{A}}>0\).
By Theorem 4.11 we have
Example 4.21
Suppose
We have by the CLT (Theorem 4.10) for n→∞ that
Suppose that we would like to compute the distribution of . According to Theorem 4.11 we have to consider f=(f 1,f 2)⊤ with
Given this \(f(\mu) = {0 \choose 0} \) and
Thus
The covariance is
which yields
Example 4.22
Let us continue the previous example by adding one more component to the function f. Since q=3>p=2, we might expect a singular normal distribution. Consider f=(f 1,f 2,f 3)⊤ with
From this we have that
The limit is in fact a singular normal distribution!
6 Heavy-Tailed Distributions
Heavy-tailed distributions were first introduced by the Italian-born Swiss economist Pareto and extensively studied by Paul Lévy. Although in the beginning these distributions were mainly studied theoretically, nowadays they have found many applications in areas as diverse as finance, medicine, seismology, structural engineering. More concretely, they have been used to model returns of assets in financial markets, stream flow in hydrology, precipitation and hurricane damage in meteorology, earthquake prediction in seismology, pollution, material strength, teletraffic and many others.
A distribution is called heavy-tailed if it has higher probability density in its tail area compared with a normal distribution with same mean μ and variance σ 2. Figure 4.6 demonstrates the differences of the pdf curves of a standard Gaussian distribution and a Cauchy distribution with location parameter μ=0 and scale parameter σ=1. The graphic shows that the probability density of the Cauchy distribution is much higher than that of the Gaussian in the tail part, while in the area around the centre, the probability density of the Cauchy distribution is much lower.
In terms of kurtosis, a heavy-tailed distribution has kurtosis greater than 3 (see Chapter 4, formula (4.40)), which is called leptokurtic, in contrast to mesokurtic distribution (kurtosis=3) and platykurtic distribution (kurtosis<3). Since univariate heavy-tailed distributions serve as basics for their multivariate counterparts and their density properties have been proved useful even in multivariate cases, we will start from introducing some univariate heavy-tailed distributions. Then we will move on to analyse their multivariate counterparts, and their tail behavior.
6.1 Generalised Hyperbolic Distribution
The generalised hyperbolic distribution was introduced by Barndorff-Nielsen and at first applied to model grain size distributions of wind blown sands. Today one of its most important uses is in stock price modelling and market risk measurement. The name of the distribution is derived from the fact that its log-density forms a hyperbola, while the log-density of the normal distribution is a parabola.
The density of a one-dimensional generalised hyperbolic (GH) distribution for \(x\in \mathbb{R}\) is
where K λ is a modified Bessel function of the third kind with index λ
The domain of variation of the parameters is \(\mu \in \mathbb{R}\) and
The generalised hyperbolic distribution has the following mean and variance
where μ and δ play important roles in the density’s location and scale respectively. With specific values of λ, we obtain different sub-classes of GH such as hyperbolic (HYP) or normal-inverse Gaussian (NIG) distribution.
For λ=1 we obtain the hyperbolic distributions (HYP)
where \(x,\mu \in \mathbb{R}, \delta\geq0\) and |β|<α.
For λ=−1/2 we obtain the normal-inverse Gaussian distribution (NIG)
6.2 Student’s t-distribution
The t-distribution was first analysed by Gosset (1908). He published his results under his pseudonym “Student” by request of his employer. Let X be a normally distributed random variable with mean μ and variance σ 2, and Y be the random variable such that Y 2/σ 2 has a chi-square distribution with n degrees of freedom. Assume that X and Y are independent, then
is distributed as Student’s t with n degrees of freedom. The t-distribution has the following density function
where n is the number of degrees of freedom, −∞<x<∞, and Γ is the gamma function, e.g. Giri (1996),
The mean, variance, skewness, and kurtosis of Student’s t-distribution (n>4) are:
The t-distribution is symmetric around 0, which is consistent with the fact that its mean is 0 and skewness is also 0.
Student’s t-distribution approaches the normal distribution as n increases, since
In practice the t-distribution is widely used, but its flexibility of modelling is restricted because of the integer-valued tail index.
In the tail area of the t-distribution, x is proportional to |x|−(n+1). In Figure 4.13 we compared the tail-behaviour of t-distribution with different degrees of freedom. With higher degree of freedom, the t-distribution decays faster.
6.3 Laplace Distribution
The univariate Laplace distribution with mean zero was introduced by Laplace (1774). The Laplace distribution can be defined as the distribution of differences between two independent variates with identical exponential distributions. Therefore it is also called the double exponential distribution.
The Laplace distribution with mean μ and scale parameter θ has the pdf
and the cdf
where \(\mathit{sign}\) is sign function. The mean, variance, skewness, and kurtosis of the Laplace distribution are
With mean 0 and θ=1, we obtain the standard Laplace distribution
6.4 Cauchy Distribution
The Cauchy distribution is motivated by the following example.
Example 4.23
A gangster has just robbed a bank. As he runs to a point s meters away from the wall of the bank, a policeman reaches the crime scene. The robber turns back and starts to shoot but he is such a poor shooter that the angle of his fire (marked in Figure 4.10 as α) is uniformly distributed. The bullets hit the wall at distance x (from the centre). Obviously the distribution of x, the random variable where the bullet hits the wall, is of vital knowledge to the policeman in order to identify the location of the gangster. (Should the policeman calculate the mean or the median of the observed bullet hits x i ?)
Since α is uniformly distributed:
and
For a small interval dα, the probability is given by
with
So the pdf of x can be written as:
The general formula for the pdf and cdf of the Cauchy distribution is
where m and s are location and scale parameter respectively. The case in the above example where m=0 and s=1 is called the standard Cauchy distribution with pdf and cdf as following,
The mean, variance, skewness and kurtosis of Cauchy distribution are all undefined, since its moment generating function diverges. But it has mode and median, both equal to the location parameter m.
6.5 Mixture Model
Mixture modelling concerns modelling a statistical distribution by a mixture (or weighted sum) of different distributions. For many choices of component density functions, the mixture model can approximate any continuous density to arbitrary accuracy, provided that the number of component density functions is sufficiently large and the parameters of the model are chosen correctly. The pdf of a mixture distribution consists of n distributions and can be written as:
under the constraints:
where p l (x) is the pdf of the l’th component density and w l is a weight. The mean, variance, skewness and kurtosis of a mixture are
where μ l ,σ l ,SK l and K l are respectively mean, variance, skewness and kurtosis of l’th distribution.
Mixture models are ubiquitous in virtually every facet of statistical analysis, machine learning and data mining. For data sets comprising continuous variables, the most common approach involves mixture distributions having Gaussian components.
The pdf for a Gaussian mixture is:
For a Gaussian mixture consisting of Gaussian distributions with mean 0, this can be simplified to:
with variance, skewness and kurtosis
Example 4.24
Consider a Gaussian Mixture which is 80% N(0,1) and 20% N(0,9). The pdf of N(0,1) and N(0,9) are
so the pdf of the Gaussian Mixture is
Notice that the Gaussian Mixture is not a Gaussian distribution:
The kurtosis of this Gaussian mixture is higher than 3.
A summary of the basic statistics is given in Table 4.2.
6.6 Multivariate Generalised Hyperbolic Distribution
The multivariate Generalised Hyperbolic Distribution (\(\mathit{GH}_{d}\)) has the following pdf
and characteristic function
These parameters have the following domain of variation:
For \(\lambda = \frac{d+1}{2}\) we obtain the multivariate hyperbolic (HYP) distribution; for \(\lambda = -\frac{1}{2}\) we get the multivariate normal inverse Gaussian (NIG) distribution.
Blæsild and Jensen (1981) introduced a second parameterization (ζ,Π,Σ), where
The mean and variance of X∼GH d
where
Theorem 4.12
Suppose that X is a d-dimensional variate distributed according to the generalised hyperbolic distribution GH d . Let (X 1,X 2) be a partitioning of X, let r and k denote the dimensions of X 1 and X 2, respectively, and let (β 1,β 2) and (μ 1,μ 2) be similar partitions of β and μ, let
be a partition of Δ such that Δ11 is a r×r matrix. Then one has the following
-
1.
The distribution of X 1 is the r-dimensional generalised hyperbolic distribution, GH r (λ ∗,α ∗,β ∗,δ ∗,μ ∗,Δ∗), where
-
2.
The conditional distribution of X 2 given X 1=x 1 is the k-dimensional generalised hyperbolic distribution \(GH_{k}(\tilde{\lambda},\tilde{\alpha},\tilde{\beta},\tilde{\delta},\tilde{\mu},\tilde{\Delta}\)),where
-
3.
Let Y=XA+B be a regular affine transformation of X and let ||A|| denote the absolute value of the determinant of A. The distribution of Y is the d-dimensional generalised hyperbolic distribution GH d (λ +,α +,β +,δ +,μ +,Δ+),where
6.7 Multivariate t-distribution
If X and Y are independent and distributed as N p (μ,Σ) and \({\mathcal{X}}^{2}_{n}\) respectively, and \(X\sqrt{n/Y}=t-\mu\), then the pdf of t is given by
The distribution of t is the noncentral t-distribution with n degrees of freedom and the noncentrality parameter μ, Giri (1996).
6.8 Multivariate Laplace Distribution
Let g and G be the pdf and cdf of a d-dimensional Gaussian distribution N d (0,Σ), the pdf and cdf of a multivariate Laplace distribution can be written as
the pdf can also be described as
where \(\lambda = \frac{2-d}{2}\) and K λ (x) is the modified Bessel function of the third kind
Multivariate Laplace distribution has mean and variance
6.9 Multivariate Mixture Model
A multivariate mixture model comprises multivariate distributions, e.g. the pdf of a multivariate Gaussian distribution can be written as
6.10 Generalised Hyperbolic Distribution
The GH distribution has an exponential decaying speed
Figure 4.14 illustrates the tail behaviour of GH distributions with different value of λ with α=1,β=0,δ=1,μ=0. It is clear that among the four distributions, GH with λ=1.5 has the lowest decaying speed, while NIG decays fastest.
In Figure 4.15, Chen, Härdle and Jeong (2008), four distributions and especially their tail-behaviour are compared. In order to keep the comparability of these distributions, we specified the means to 0 and standardised the variances to 1. Furthermore we used one important subclass of the GH distribution: the normal-inverse Gaussian (NIG) distribution with \(\lambda = -\frac{1}{2}\) introduced above. On the left panel, the complete forms of these distributions are revealed. The Cauchy (dots) distribution has the lowest peak and the fattest tails. In other words, it has the flattest distribution. The NIG distribution decays second fast in the tails although it has the highest peak, which is more clearly displayed on the right panel.
7 Copulae
The cumulative distribution function (cdf) of a 2-dimensional vector \(\left(X_{1},X_{2}\right)\) is given by
For the case that X 1 and X 2 are independent, their joint cumulative distribution function F(x 1,x 2) can be written as a product of their 1-dimensional marginals:
But how can we model dependence of X 1 and X 2? Most people would suggest linear correlation. Correlation is though an appropriate measure of dependence only when the random variables have an elliptical or spherical distribution, which include the normal multivariate distribution. Although the terms “correlation” and “dependency” are often used interchangeably, correlation is actually a rather imperfect measure of dependency, and there are many circumstances where correlation should not be used.
Copulae represent an elegant concept of connecting marginals with joint cumulative distribution functions. Copulae are functions that join or “couple” multivariate distribution functions to their 1-dimensional marginal distribution functions. Let us consider a d-dimensional vector X=(X 1,…,X d )⊤. Using copulae, the marginal distribution functions \(F_{X_{i}} (i=1,\ldots,d)\) can be separately modelled from their dependence structure and then coupled together to form the multivariate distribution F X . Copula functions have a long history in probability theory and statistics. Their application in finance is very recent. Copulae are important in Value-at-Risk calculations and constitute an essential tool in quantitative finance (Härdle et al. (2009)).
First let us concentrate on the 2-dimensional case, then we will extend this concept to the d-dimensional case, for a random variable in \(\mathbb{R}^{d}\) with d≥1. To be able to define a copula function, first we need to represent a concept of the volume of a rectangle, a 2-increading function and a grounded function.
Let U 1 and U 2 be two sets in \(\mathbb{\overline{R}}=\mathbb{R}\cup \{+\infty\} \cup \{-\infty\}\) and consider the function \(F :U_{1} \times U_{2} \longrightarrow \mathbb{\overline{R}}\).
Definition 4.2
The F-volume of a rectangle B=[x 1,x 2]×[y 1,y 2]⊂U 1×U 2 is defined as:
Definition 4.3
F is said to be a 2-increasing function if for every B=[x 1,x 2]×[y 1,y 2]⊂U 1×U 2,
Remark 4.2
Note, that “to be 2-increasing function” neither implies nor is implied by “to be increasing in each argument”.
The following lemmas (Nelsen, 1999) will be very useful later for establishing the continuity of copulae.
Lemma 4.1
Let U 1 and U 2 be non-empty sets in \(\mathbb{\overline{R}}\) and let \(F :U_{1} \times U_{2} \longrightarrow \mathbb{\overline{R}}\) be a two-increasing function. Let x 1, x 2 be in U 1 with x 1≤x 2, and y 1, y 2 be in U 2 with y 1≤y 2. Then the function t↦F(t,y 2)−F(t,y 1) is non-decreasing on U 1 and the function t↦F(x 2,t)−F(x 1,t) is non-decreasing on U 2.
Definition 4.4
If U 1 and U 2 have a smallest element minU 1 and minU 2 respectively, then we say, that a function \(F :U_{1} \times U_{2} \longrightarrow \mathbb{R}\) is grounded if:
In the following, we will refer to this definition of a cdf.
Definition 4.5
A cdf is a function from \(\mathbb{\overline{R}}^{2} \mapsto \left[0,1\right]\) which
-
i)
is grounded.
-
ii)
is 2-increasing.
-
iii)
satisfies \(F\left(\infty,\infty\right)=1\).
Lemma 4.2
Let U 1 and U 2 be non-empty sets in \(\mathbb{\overline{R}}\) and let \(F :U_{1} \times U_{2} \longrightarrow \mathbb{\overline{R}}\) be a grounded two-increasing function. Then F is non-decreasing in each argument.
Definition 4.6
If U 1 and U 2 have a greatest element maxU 1 and maxU 2 respectively, then we say, that a function \(F :U_{1} \times U_{2} \longrightarrow \mathbb{R}\) has margins and that the margins of F are given by:
Lemma 4.3
Let U 1 and U 2 be non-empty sets in \(\mathbb{\overline{R}}\) and let \(F :U_{1} \times U_{2} \longrightarrow \mathbb{\overline{R}}\) be a grounded two-increasing function which has margins. Let (x 1,y 1), (x 2,y 2) ∈ S 1×S 2. Then
Definition 4.7
A two-dimensional copula is a function C defined on the unit square I 2=I×I with I=[0,1] such that
-
i)
for every u∈I holds: C(u,0)=C(0,v)=0, i.e. C is grounded.
-
ii)
for every u 1,u 2,v 1,v 2∈I with u 1≤u 2 and v 1≤v 2 holds:
(4.114)i.e. C is 2-increasing.
-
iii)
for every u∈I holds C(u,1)=u and C(1,v)=v.
Informally, a copula is a joint distribution function defined on the unit square \(\left[0,1\right]^{2}\) which has uniform marginals. That means that if \(F_{X_{1}}(x_{1})\) and \(F_{X_{2}}(x_{2})\) are univariate distribution functions, then \(C\{F_{X_{1}}(x_{1}),F_{X_{2}}(x_{2})\}\) is a 2-dimensional distribution function with marginals \(F_{X_{1}}(x_{1})\) and \(F_{X_{2}}(x_{2})\).
Example 4.25
The functions max(u+v−1,0), uv, min(u,v) can be easily checked to be copula functions. They are called respectively the minimum, product and maximum copula.
Example 4.26
Consider the function
where Φ ρ is the joint 2-dimensional standard normal distribution function with correlation coefficient ρ, while Φ1 and Φ2 refer to standard normal cdfs and
denotes the bivariate normal pdf.
It is easy to see, that \(C^{\mathit{Gauss}}\) is a copula, the so called Gaussian or normal copula, since it is 2-increasing and
A simple and useful way to represent the graph of a copula is the contour diagram that is, graphs of its level sets - the sets in I 2 given by C(u,v)= a constant. In Figures 4.16–4.17 we present the countour diagrams of the Gumbel-Hougard copula (Example 4.4) for different values of the copula parameter θ.
For θ=1 the Gumbel-Hougaard copula reduces to the product copula, i.e.
For θ→∞, one finds for the Gumbel-Hougaard copula:
where M is also a copula such that C(u,v)≤M(u,v) for an arbitrary copula C. The copula M is called the Fréchet-Hoeffding upper bound.
The two-dimensional function W(u,v)=max(u+v−1,0) defines a copula with W(u,v)≤C(u,v) for any other copula C. W is called the Fréchet-Hoeffding lower bound.
In Figure 4.18 we show an example of Gumbel-Hougaard copula sampling for fixed parameters σ 1=1, σ 2=1 and θ=3.
One can demonstrate the so-called Fréchet-Hoeffding inequality, which we have already used in Example 1.3, and which states that each copula function is bounded by the minimum and maximum one:
The full relationship between copula and joint cdf depends on Sklar theorem.
Example 4.27
Let us verify that the Gaussian copula satisfies Sklar’s theorem in both directions. On the one side, let
be a 2-dimensional normal distribution function with standard normal cdf’s \(F_{X_{1}}(x_{1})\) and \(F_{X_{2}}(x_{2})\). Since \(F_{X_{1}}(x_{1})\) and \(F_{X_{2}}(x_{2})\) are continuous, a unique copula C exists such that for all x 1, x 2 \(\in \mathbb{\overline{R}}^{2}\) a 2-dimensional distribution function can be written as a copula in \(F_{X_{1}}(x_{1})\) and \(F_{X_{2}}(x_{2})\):
The Gaussian copula satisfies the above equality, therefore it is the unique copula mentioned in Sklar’s theorem. This proves that the Gaussian copula, together with Gaussian marginals, gives the two-dimensional normal distribution.
Conversely, if C is a copula and \(F_{X_{1}}\) and \(F_{X_{2}}\) are standard normal distribution functions, then
is evidently a joint (two-dimensional) distribution function. Its margins are
The following proposition shows one attractive feature of the copula representation of dependence, i.e. that the dependence structure described by a copula is invariant under increasing and continuous transformations of the marginal distributions.
Theorem 4.13
If \(\left(X_{1},X_{2}\right)\) have copula C and set g 1,g 2 two continuously increasing functions, then \(\left\{g_{1}\left(X_{1}\right),g_{2} \left(X_{2} \right)\right\}\) have the copula C, too.
Example 4.28
Independence implies that the product of the cdf’s \(F_{X_{1}}\) and \(F_{X_{2}}\) equals the joint distribution function F, i.e.:
Thus, we obtain the independence or product copula C=Π(u,v)=uv.
While it is easily understood how a product copula describes an independence relationship, the converse is also true. Namely, the joint distribution function of two independent random variables can be interpreted as a product copula. This concept is formalised in the following theorem:
Theorem 4.14
Let X 1 and X 2 be random variables with continuous distribution functions \(F_{X_{1}}\) and \(F_{X_{2}}\) and the joint distribution function F. Then X 1 and X 2 are independent if and only if \(C_{X_{1},X_{2}}=\Pi\).
Example 4.29
Let us consider the Gaussian copula for the case ρ=0, i.e. vanishing correlation. In this case the Gaussian copula becomes
The following theorem, which follows directly from Lemma 4.3, establishes the continuity of copulae.
Theorem 4.15
Let C be a copula. Then for any u 1,v 1,u 2,v 2 ∈ I holds
From (4.129) it follows that every copula C is uniformly continuous on its domain.
A further important property of copulae concerns the partial derivatives of a copula with respect to its variables:
Theorem 4.16
Let C(u,v) be a copula. For any u ∈ I, the partial derivative \(\frac {\partial C(u, v)}{\partial v}\) exists for almost all u ∈ I. For such u and v one has:
The analogous statement is true for the partial derivative \(\frac {\partial C(u, v)}{\partial u}\):
Moreover, the functions
are defined and non-increasing almost everywhere on I.
Until now, we have considered copulae only in a 2-dimensional setting. Let us now extend this concept to the d-dimensional case, for a random variable in \(\mathbb{R}^{d}\) with d≥1.
Let U 1,U 2,…,U d be non-empty sets in \(\mathbb{\overline{R}}\) and consider the function \(F : U_{1} \times U_{2} \times \cdots\times U_{d}\longrightarrow \mathbb{\overline{R}}\). For a=(a 1,a 2,…,a d ) and b=(b 1,b 2,…,b d ) with a≤b (i.e. a k ≤b k for all k) let B=[a,b]=[a 1,b 1]×[a 2,b 2]×⋯×[a n ,b n ] be the d-box with vertices c=(c 1,c 2,…,c d ). It is obvious, that each c k is either equal to a k or to b k .
Definition 4.8
The F-volume of a d-box B=[a,b]=[a 1,b 1]×[a 2,b 2]×⋯×[a d ,b d ]⊂U 1×U 2×⋯×U d is defined as follows:
where \(\mathit{sign}(c_{k})=1\), if c k =a k for even k and \(\mathit{sign}(c_{k})=-1\), if c k =a k for odd k.
Example 4.30
For the case d=3, the F-volume of a 3-box B=[a,b]=[x 1,x 2]×[y 1,y 2]×[z 1,z 2] is defined as:
Definition 4.9
F is said to be a d-increasing function if for all d-boxes B with vertices in U 1×U 2×⋯×U d holds:
Definition 4.10
If U 1,U 2,…,U d have a smallest element minU 1,minU 2,…,minU d respectively, then we say, that a function \(F :U_{1} \times U_{2} \times\cdots\times U_{d} \longrightarrow \mathbb{\overline{R}}\) is grounded if :
such that x k =minU k for at least one k.
The lemmas, which we presented for the 2-dimensional case, have analogous multivariate versions, see Nelsen (1999).
Definition 4.11
A d-dimensional copula (or d-copula) is a function C defined on the unit d-cube I d=I×I×⋯×I such that
-
i)
for every u∈I d holds: C(u)=0, if at least one coordinate of u is equal to 0; i.e. C is grounded.
-
ii)
for every a,b∈I d with a≤b holds:
(4.135)i.e. C is 2-increasing.
-
iii)
for every u∈I d holds: C(u)=u k , if all coordinates of u are 1 except u k .
Analogously to the 2-dimensional setting, let us state the Sklar’s theorem for the d-dimensional case.
Theorem 4.17
(Sklar’s theorem in d-dimensional case)
Let F be a d-dimensional distribution function with marginal distribution functions \(F_{X_{1}},F_{X_{2}},\ldots,F_{X_{d}}\). Then a d-copula C exists such that for all x 1,…,x d \(\in \mathbb{\overline{R}}^{d}\):
Moreover, if \(F_{X_{1}},F_{X_{2}},\ldots,F_{X_{d}}\) are continuous then C is unique. Otherwise C is uniquely determined on the Cartesian product \(Im(F_{X_{1}})\times Im(F_{X_{2}})\times\cdots\times Im(F_{X_{d}})\).
Conversely, if C is a copula and \(F_{X_{1}}, F_{X_{2}},\ldots,F_{X_{d}} \) are distribution functions then F defined by (4.136) is a d-dimensional distribution function with marginals \(F_{X_{1}},F_{X_{2}},\ldots, F_{X_{d}}\).
In order to illustrate the d-copulae we present the following examples:
Example 4.31
Let Φ denote the univariate standard normal distribution function and ΦΣ,d the d-dimensional standard normal distribution function with correlation matrix Σ. Then the function
is the d-dimensional Gaussian or normal copula with correlation matrix Σ. The function
is a copula density function. The copula dependence parameter α is the collection of all unknown correlation coefficients in Σ. If α≠0, then the corresponding normal copula allows to generate joint symmetric dependence. However, it is not possible to model a tail dependence, i.e. joint extreme events have a zero probability.
Example 4.32
Let us consider the following function
One recognize this function is as the d-dimensional Gumbel-Hougaard copula function. Unlike the Gaussian copula, the copula (4.139) can generate an upper tail dependence.
Example 4.33
As in the 2-dimensional setting, let us consider the d-dimentional Gumbel-Hougaard copula for the case θ=1. In this case the Gumbel-Hougaard copula reduces to the d-dimensional product copula, i.e.
The extension of the 2-dimensional copula M, which one gets from the d-dimensional Gumbel-Hougaard copula for θ→∞ is denoted M d(u):
The d-dimensional function
defines a copula with W(u)≤C(u) for any other d-dimensional copula function C(u). W d(u) is the Fréchet-Hoeffding lower bound in the d-dimensional case.
The functions M d and Πd are d-copulae for all d≥2, whereas the function W d fails to be a d-copula for any d>2 (Nelsen, 1999). However, the d-dimensional version of the Fréchet-Hoeffding inequality can be written as follows:
As we have already mentioned, copula functions have been widely applied in empirical finance.
8 Bootstrap
Recall that we need large sample sizes in order to sufficiently approximate the critical values computable by the CLT. Here large means n>50 for one-dimensional data. How can we construct confidence intervals in the case of smaller sample sizes? One way is to use a method called the Bootstrap. The Bootstrap algorithm uses the data twice:
-
1.
estimate the parameter of interest,
-
2.
simulate from an estimated distribution to approximate the asymptotic distribution of the statistics of interest.
In detail, bootstrap works as follows. Consider the observations x 1,…,x n of the sample X 1,…,X n and estimate the empirical distribution function (edf) F n . In the case of one-dimensional data
This is a step function which is constant between neighboring data points.
Example 4.34
Suppose that we have n=100 standard normal N(0,1) data points X i , i=1,…,n. The cdf of X is \(\Phi(x) = \int^{x}_{-\infty} \varphi(u) du\) and is shown in Figure 4.19 as the thin, solid line. The empirical distribution function (edf) is displayed as a thick step function line. Figure 4.20 shows the same setup for n=1000 observations.
Now draw with replacement a new sample from this empirical distribution. That is we sample with replacement n ∗ observations \(X_{1}^{\ast}, \ldots,X_{n^{\ast}}^{\ast}\) from the original sample. This is called a Bootstrap sample. Usually one takes n ∗=n.
Since we sample with replacement, a single observation from the original sample may appear several times in the Bootstrap sample. For instance, if the original sample consists of the three observations x 1,x 2,x 3, then a Bootstrap sample might look like \(X_{1}^{*}=x_{3}\), \(X_{2}^{*}=x_{2}\), \(X_{3}^{*}=x_{3}\). Computationally, we find the Bootstrap sample by using a uniform random number generator to draw from the indices 1,2,…,n of the original samples.
The Bootstrap observations are drawn randomly from the empirical distribution, i.e., the probability for each original observation to be selected into the Bootstrap sample is 1/n for each draw. It is easy to compute that
This is the expected value given that the cdf is the original mean of the sample x 1.…,x n . The same holds for the variance, i.e.,
where \(\widehat{\sigma}^{2} = n^{-1} \sum (x_{i} - \bar{x})^{2}\). The cdf of the bootstrap observations is defined as in (4.144). Figure 4.21 shows the cdf of the n=100 original observations as a solid line and two bootstrap cdf’s as thin lines.
The CLT holds for the bootstrap sample. Analogously to Corollary 4.1 we have the following corollary.
Corollary 4.2
If \(X_{1}^{\ast}, \ldots, X_{n}^{\ast}\) is a bootstrap sample from X 1,…,X n , then the distribution of
also becomes N(0,1) asymptotically, where \(\overline{x}^{\ast}= n^{-1} \sum_{i=1}^{n} X_{i}^{\ast}\) and \((\widehat{\sigma}^{\ast})^{2} = n^{-1} \sum_{i=1}^{n} (X_{i}^{\ast}-\bar{x}^{\ast})^{2}\).
How do we find a confidence interval for μ using the Bootstrap method? Recall that the quantile u 1−α/2 might be bad for small sample sizes because the true distribution of \(\sqrt{n}( \frac{\bar{x}-\mu}{\widehat{\sigma}})\) might be far away from the limit distribution N(0,1). The Bootstrap idea enables us to “simulate” this distribution by computing \(\sqrt{n} ( \frac{\bar{x}^{\ast}-\bar{x}}{\widehat{\sigma}^{\ast}} )\) for many Bootstrap samples. In this way we can estimate an empirical (1−α/2)-quantile \(u_{1-\alpha/2}^{\ast}\). The bootstrap improved confidence interval is then
By Corollary 4.2 we have
but with an improved speed of convergence, see Hall (1992).
9 Exercises
Exercise 4.1
Assume that the random vector Y has the following normal distribution: \(Y \sim N_{p}(0,{\mathcal{I}})\). Transform it according to (4.49) to create X∼N(μ,Σ) with mean μ=(3,2)⊤ and . How would you implement the resulting formula on a computer?
Exercise 4.2
Prove Theorem 4.7 using Theorem 4.5.
Exercise 4.3
Suppose that X has mean zero and covariance . Let Y=X 1+X 2. Write Y as a linear transformation, i.e., find the transformation matrix \({\mathcal{A}}\). Then compute \(\mathit{Var}(Y)\) via (4.26). Can you obtain the result in another fashion?
Exercise 4.4
Calculate the mean and the variance of the estimate \(\hat{\beta}\) in (3.50).
Exercise 4.5
Compute the conditional moments E(X 2∣x 1) and E(X 1∣x 2) for the pdf of Example 4.5.
Exercise 4.6
Prove the relation (4.28).
Exercise 4.7
Prove the relation (4.29). Hint: Note that
and that
Exercise 4.8
Compute (4.46) for the pdf of Example 4.5.
Exercise 4.9
Show that
is a pdf.
Exercise 4.10
Compute (4.46) for a two-dimensional standard normal distribution. Show that the transformed random variables Y 1 and Y 2 are independent. Give a geometrical interpretation of this result based on iso-distance curves.
Exercise 4.11
Consider the Cauchy distribution which has no moment, so that the CLT cannot be applied. Simulate the distribution of \(\overline{x}\) (for different n’s). What can you expect for n→∞?
Hint: The Cauchy distribution can be simulated by the quotient of two independent standard normally distributed random variables.
Exercise 4.12
A European car company has tested a new model and reports the consumption of petrol (X 1) and oil (X 2). The expected consumption of petrol is 8 liters per 100 km (μ 1) and the expected consumption of oil is 1 liter per 10.000 km (μ 2). The measured consumption of petrol is 8.1 liters per 100 km (\(\overline{x}_{1}\)) and the measured consumption of oil is 1.1 liters per 10,000 km (\(\overline{x}_{2}\)). The asymptotic distribution of is .
For the American market the basic measuring units are miles (1 mile ≈ 1.6 km) and gallons (1 gallon ≈ 3.8 liter). The consumptions of petrol (Y 1) and oil (Y 2) are usually reported in miles per gallon. Can you express \(\overline{y}_{1}\) and \(\overline{y}_{2}\) in terms of \(\overline{x}_{1}\) and \(\overline{x}_{2}\)? Recompute the asymptotic distribution for the American market.
Exercise 4.13
Consider the pdf \(f(x_{1},x_{2})=e^{-(x_{1}+x_{2})}, x_{1},x_{2}>0\) and let U 1=X 1+X 2 and U 2=X 1−X 2. Compute f(u 1,u 2).
Exercise 4.14
Consider the pdf‘s
For each of these pdf’s compute \(E(X), \mathop {\mbox {\sf Var}}(X), E(X_{1}|X_{2}), E(X_{2}|X_{1}), V(X_{1}|X_{2})\) and V(X 2|X 1).
Exercise 4.15
Consider the pdf \(f(x_{1},x_{2})=\frac{3}{2}x_{1}^{-\frac{1}{2}},\ 0<x_{1}<x_{2}<1\). Compute P(X 1<0.25),P(X 2<0.25) and P(X 2<0.25|X 1<0.25).
Exercise 4.16
Consider the pdf \(f(x_{1},x_{2})=\frac{1}{2\pi}\), 0<x 1<2π, 0<x 2<1. Let \(U_{1}=\sin X_{1}\sqrt{-2\log X_{2}}\) and \(U_{2}=\cos X_{1}\sqrt{-2\log X_{2}}\). Compute f(u 1,u 2).
Exercise 4.17
Consider f(x 1,x 2,x 3)=k(x 1+x 2 x 3); 0<x 1,x 2,x 3<1.
-
a)
Determine k so that f is a valid pdf of (X 1,X 2,X 3)=X.
-
b)
Compute the (3×3) matrix Σ X .
-
c)
Compute the (2×2) matrix of the conditional variance of (X 2,X 3) given X 1=x 1.
Exercise 4.18
Let .
-
a)
Represent the contour ellipses for \(a=0;\ -\frac{1}{2};\ +\frac{1}{2};\ 1\).
-
b)
For \(a=\frac{1}{2}\) find the regions of X centred on μ which cover the area of the true parameter with probability 0.90 and 0.95.
Exercise 4.19
Consider the pdf
Compute f(x 2) and f(x 1|x 2). Also give the best approximation of X 1 by a function of X 2. Compute the variance of the error of the approximation.
Exercise 4.20
Prove Theorem 4.6.
Bibliography
Blæsild, P. and Jensen, J. (1981). Multivariate distributions of hyperbolic type, in Statistical Distributions in Scientific Work – Proceedings of the NATO Advanced Study Institute held at the Università degli studi di Trieste, Vol. 4, pp. 45–66.
Chen, Y., Härdle, W. and Jeong, S.-O. (2008). Nonparametric risk management with generalized hyperbolic distributions, Journal of the American Statistical Association 103: 910–923.
Embrechts, P., McNeil, A. and Straumann, D. (1999). Correlation and dependence in risk management: Properties and pitfalls. Preprint ETH Zürich.
Giri, N. C. (1996). Multivariate Statistical Analysis, Marcel Dekker, New York.
Gosset, W. S. (1908). The probable error of a mean, Biometrika 6: 1–25.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion, Statistical Series, Springer, New York.
Härdle, W., Hautsch, N. and Overbeck, L. (2009). Applied Quantitative Finance, 2nd edition, Springer, Heidelberg.
Laplace, P.-S. (1774). Mémoire sur la probabilité des causes par les événements, Savants étranges 6: 621–656.
Nelsen, R. B. (1999). An Introduction to Copulas, Springer, New York.
Sklar, A. (1959). Fonctions de répartition à n dimensions et leurs marges, Publ. Inst. Statist. Univ. Paris 8, pp. 229–231.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Härdle, W.K., Simar, L. (2012). Multivariate Distributions. In: Applied Multivariate Statistical Analysis. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17229-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-17229-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17228-1
Online ISBN: 978-3-642-17229-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)