Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

In this chapter we discuss quantities that one may anticipate for individual random variables or their functions—with respect to the probability distributions of these variables—after multiple repetitions of random experiments: they are known as expected values or expectations of random variables. The most important such quantity is the average value, which is the expected value in the basic, narrowest sense of the word; further below we also discuss other expected values in the broader sense.

1 Expected (Average, Mean) Value

The expected value of a discrete random variable X, which can assume the values \(x_i\) (\(i=1,2,\ldots \)), is computed by weighting (multiplying) each of these values by the probability \(P(X=x_i)=f_X(x_i)\) that in a large number of trials this particular value turns up (see (2.13)), then sum all such products:

$$\begin{aligned} \overline{X} = E[X] = \sum _{i=1}^n x_i \, P(X=x_i). \end{aligned}$$
(4.1)

The average is denoted by E or by a line across the random variable (or its function) being averaged. Both E[X] and \(\overline{X}\), as well as the frequently used symbol \(\mu _X\) imply the “averaging operation” performed on the variable X. (We emphasize this because we occasionally also use the slightly misleading expression “expected value of a distribution”: what usually changes in random processes is the value of a variable, not its distribution!) In Chaps. 4–6 the symbols

$$\begin{aligned} E[X], \quad \overline{X}, \quad \mu _X, \end{aligned}$$
(4.2)

signify the one and the same thing, while in Chaps. 710 the symbols \(\overline{X}\) and \(\overline{x}\) will denote the average value of a sample and \(E[\bullet ]\) will be used strictly as expected value. The only symbol that really would not make any sense, is E[x].

It can not hurt to recall the formula to compute the center of mass of a one-dimensional system of point-like masses with a total mass \(M=\sum _{i=1}^n m_i\):

$$ x_\mathrm {cm} = \displaystyle {\sum _{i=1}^n x_i m_i \over \sum _{i=1}^n m_i} = \sum _{i=1}^n x_i {m_i\over M}. $$

If all probabilities in (4.1) are equal, we get a simple expression for the usual arithmetic average

$$ \overline{x} = {1\over n} \sum _{i=1}^n x_i. $$

The expected value of a continuous random variable X is obtained by replacing the sum by the integral and integrating the product of the variable value x and the corresponding probability density over the whole definition domain,

$$\begin{aligned} \overline{X} = E[X] = \int _{-\infty }^\infty x\,f_X(x) \,\mathrm {d}x. \end{aligned}$$
(4.3)

(Beware: this expected value may not exist for certain types of densities \(f_X\).) The analogy from mechanics is again the center of mass of a three-dimensional inhomogeneous body, which is calculated by integrating the product of the position vector with the position-dependent density over the whole volume:

$$ {{\varvec{r}}}_\mathrm {cm} = \overline{{\varvec{r}}} = {1\over m} \int _V {{\varvec{r}}} \, \mathrm {d}m = {1\over m} \int _V {{\varvec{r}}} \rho ({{\varvec{r}}}) \, \mathrm {d}^3{{\varvec{r}}}. $$

Example

In a casino we indulge in a game of dice with the following rules for each throw: 2 spots—win ; 4 spots—win ; 6 spots—lose ; 1 spot, 3 spots or 5 spots—neither win nor lose. Any number of spots \(x_i\) is equally probable, \(P(X=x_i)={1\over 6}\), so the expected value of our earnings is

If the casino wishes to profit from this game, the participation fee should be at least this much.    \(\triangleleft \)

2 Median

The median of a random variable X (discrete or continuous) is the value \(x = \mathrm {med}[X]\), for which

$$\begin{aligned} P(X<x) \le \textstyle {1\over 2} \quad \mathrm {and} \quad P(X>x) \le \textstyle {1\over 2}. \end{aligned}$$
(4.4)

For a continuous variable X the inequalities become equalities,

$$ P(X<x) = P(X>x) = \textstyle {1\over 2} \quad \Longleftrightarrow \quad \mathrm {med}[X] = F_X^{-1}(1/2), $$

as it is always possible to find the value of x that splits the area under the probability density curve in two halves: the probabilities that X assumes a value above or below the median, respectively, are exactly \(50\%\).

The median of a discrete variable X sometimes can not be determined uniquely, since the discrete nature of its distribution may cause the inequalities in (4.4) to be fulfilled simultaneously, but for many different x. For example, consider a discrete distribution with probability function \(f_X(x)=1/2^x\), where \(x=1,2,\ldots \)  We see that \(P(X<x) = P(X>x) = \textstyle {1\over 2}\) holds for any value \(1 \le x \le 2\). In such cases the median is defined as the central point of the interval on which the assignment is ambiguous—in the present example we therefore set it to \(\mathrm {med}[X] = 1.5\).

Fig. 4.1
figure 1

[Left] Probability density \(f_X\) (see (4.5)) with its average, median and mode (maximum). [Right] Maxwell distribution with its mode (“most probable velocity”), average velocity and the root-mean-square velocity. See also Fig. 3.4 (left)

Example

A continuous random variable has the probability density

$$\begin{aligned} f_X(x) = \left\{ \begin{array}{lcl} \displaystyle {4x(9-x^2)\over 81} &{};&{}\quad 0 \le x \le 3, \\ 0 &{};&{}\quad \mathrm {elsewhere}, \end{array} \right. \end{aligned}$$
(4.5)

shown in Fig. 4.1 (left). Find the mode (location of maximum), median and the average (mean) of this distribution!

The mode is obtained by differentiating and setting the result to zero:

$$ {\mathrm {d}f_X\over \mathrm {d}x}\biggl \vert _{X_\mathrm {max}} = {36 - 12 X_\mathrm {max}^2\over 81} = 0 \quad \Longrightarrow \quad X_\mathrm {max} = \sqrt{3}\approx 1.73. $$

The median \(\mathrm {med}[X]\equiv a\) must split the area under the curve of \(f_X\) in two parts of 1 / 2 each, thus

$$ P\left( X < a \right) = P\left( X > a \right) = {4\over 81}\int _0^a x \bigl ( 9-x^2 \bigr )\,\mathrm {d}x = {4\over 81}\left( {9a^2\over 2} - {a^4\over 4} \right) \equiv {1\over 2}. $$

This results in the quadratic equation \(2a^4 - 36 a^2 + 81 = 0\) with two solutions, \(a^2 = 9(1\pm \sqrt{2}/2)\). Only the solution with the negative sign is acceptable as it is the only one that falls within the [0, 3] domain:

$$ \mathrm {med}[X] = \sqrt{a^2} = \sqrt{9(1-\sqrt{2}/2)} \approx 1.62. $$

The average is calculated by using the definition (4.3),

$$ \overline{X} = \int _0^3 x\,f_X(x)\,\mathrm {d}x = {4\over 81}\int _0^3 x^2\bigl ( 9-x^2 \bigr )\,\mathrm {d}x = {4\over 81}\left. \left( 3x^3 - {x^5\over 5} \right) \right| _0^3 \approx 1.60. $$

All three values are shown in Fig. 4.1 (left).    \(\triangleleft \)

3 Quantiles

The value of a random variable, below which a certain fraction of all events are found after numerous trials, is called the quantile of its distribution (lat. quantum, “how much”). For a continuous probability distribution this means that the integral of the probability density from \(-\infty \) to \(x_\alpha \) equals \(\alpha \) (Fig. 4.2). For example, the 0.50th quantile of the standardized normal distribution is \(x_{0.50} = 0\), while its 0.9985th quantile is \(x_{0.9985} \approx 3\), see (3.13).

Fig. 4.2
figure 2

Definition of the quantile of a continuous distribution. The integral of the density \(f_X(x)\) from \(-\infty \) (or the lowest edge of its domain) to \(x=x_\alpha \) equals \(\alpha \). The figure shows the density \(f_X(x) = {21\over 32}\bigl (x-{1\over 2}\bigr )^2\bigl ({5\over 2}-x\bigr )^5\), \(0.5 \le x \le 2.5\), the corresponding distribution function, and the 90th percentile (\(\alpha = 0.90\)), which is \(x_\alpha = 1.58\)

Fig. 4.3
figure 3

[Left] Daily sales of fiction books as a function of sales rank. [Right] Daily earnings as a function of sales rank. In the book segment the online giant earns \(50\%\) by selling books with sales ranks above \(\mathrm {med}[R] \approx 53\), while the average sales rank is \(\overline{r} \approx 135\)

To express the \(\alpha \)th quantile all values \(0 \le \alpha \le 1\) are allowed, but several brethren terms are in wide use for specific values of \(\alpha \): integer values (in percent) express percentiles, the tenths of the whole range of \(\alpha \) are delimited by deciles and the fourths by quartiles: \(x_{0.20}\) defines the 20th percentile or the second decile of a distribution, \(x_{0.25}\) and \(x_{0.75}\) set the limits of its first and third quartile. Hence, \(x_{0.50}\) carries no less than five names: it is the 0.50th quantile, the 50th percentile, the second quartile, the fifth decile and—the median. The difference \(x_{0.75}-x_{0.25}\) is called the inter-quartile range (IQR). The interval \([x_{0.25},x_{0.75}]\) contains half of all values; a quarter of them reside to its left and a quarter to its right.

Example

Fig. 4.3 (left) shows the daily sales of fiction books from the 1000 bestseller list (sales rank r) of the Amazon online bookstore in a certain time period. (Note the log-log scale: in linear scale the distribution has a sharp peak at \(r=1\) and a rapidly dropping tail, so it mostly occupies the region around the origin.)

To study the sales dynamics such discrete distributions are often approximated by continuous Pareto distributions (3.16). For many markets in the past, the “Pareto 80/20 principle” seemed to apply, stating that a relatively small fraction (\({\approx }20\%\)) of products (in our case best-selling books) brings the most (\({\approx } 80\%\)) profit. Figure 4.3 (right) shows the daily earnings as a function of sales rank, as well as the median, average rank, and the sales rank up to which Amazon earns \(80\%\) of the money: the latter is 234 (of 1000), neatly corresponding with the Pareto “principle”. Still, it is obvious from the graph that the Pareto distribution under-estimates the actual sales at high ranks r. Analyses show [1, 2] that the distribution n(r) has become flatter over the years, meaning that more and more profit is being squeezed from the ever increasing tail; see also [3].    \(\triangleleft \)

4 Expected Values of Functions of Random Variables

The simplest functions of random variables are the sum \(X+Y\) of two variables and the linear combination \(aX+b\), where a and b are arbitrary real constants. Since the expected value of a continuous random variable, E[X], is defined by an integral, the expected values of \(E[X+Y]\) and \(E[aX+b]\) inherit all properties of the integral, in particular linearity. (A similar conclusion follows in the discrete case where we are dealing with sums.) Therefore, for both continuous and discrete random variables it holds that

$$\begin{aligned} E[X+Y] = E[X] + E[Y], \end{aligned}$$
(4.6)

as well as

$$ E[X_1 + X_2 + \cdots + X_n] = \sum _{i=1}^n E[X_i] $$

and

$$ E[aX+b] = aE[X] + b. $$

One needs to be slightly more careful in computing the expected values of more general functions of random variables. Suppose that X is a discrete random variable with probability distribution (probability function) \(f_X\). Then \(Y=g(X)\) is also a random variable and its probability function is

$$ f_Y(y) = P(Y=y) = \sum _{\{x|g(x)=y\}} P(X=x) = \sum _{\{x|g(x)=y\}} f_X(x). $$

If X takes the values \(x_1,x_2,\ldots ,x_n\) and Y takes the values \(y_1,y_2,\ldots ,y_m\) (\(m \le n\)), we have

$$\begin{aligned}E[Y]= & {} y_1 f_Y(y_1) + y_2 f_Y(y_2) + \cdots + y_m f_Y(y_m) \\= & {} g(x_1)f_X(x_1) + g(x_2)f_X(x_2) + \cdots + g(x_n)f_X(x_n) = E\bigl [ g(X) \bigr ], \end{aligned}$$

hence

$$\begin{aligned} \overline{g(X)} = E\bigl [ g(X) \bigr ] = \sum _{i=1}^n g(x_i) f_X(x_i). \end{aligned}$$
(4.7)

If X is a continuous random variable, we just need to replace the sum by the integral and the probability function by the probability density:

$$\begin{aligned} \overline{g(X)} = E\bigl [ g(X) \bigr ] = \int _{-\infty }^\infty g(x) f_X(x) \, \mathrm {d}x. \end{aligned}$$
(4.8)

This is a good spot to comment on a very popular approximation that can be an ugly mistake or a good short-cut to a solution: it is the approximation

$$\begin{aligned} g\bigl ( \overline{X} \bigr ) \approx \overline{g(X)}. \end{aligned}$$
(4.9)

The trick works well if the density \(f_X\) of X is a sharp, strongly peaked function, and not so well otherwise. Regardless of this, however, for any convexFootnote 1 function g, Jensen’s inequality holds true:

$$\begin{aligned} g\bigl ( \overline{X} \bigr ) \le \overline{ g(X) }, \end{aligned}$$
(4.10)

that is,

$$ g\left( \int x\,f_X(x) \,\mathrm {d}x \right) \le \int g(x) f_X(x) \,\mathrm {d}x. $$

4.1 Probability Densities in Quantum Mechanics

As physicists, we ceaselessly calculate expected values of the form (4.8) in any field related to statistical or quantum mechanics. We say: the expected value of an operator \({\widehat{\mathcal{{O}}}}\) in a certain state of a quantum-mechanical system (for example, ground state of the hydrogen atom) described by the wave-function \(\psi \), is

$$ \overline{\mathcal{O}} = \int _\Omega \psi ^*({{\varvec{r}}}) \widehat{\mathcal{O}}({{\varvec{r}}}) \psi ({{\varvec{r}}}) \,\mathrm {d}V. $$

The operator \(\widehat{\mathcal{O}}\) acts on the right part of the integrand, \(\psi \), then the result is multiplied from the left by its complex conjugate \(\psi ^*\), and integrated over the whole domain. If \(\widehat{\mathcal{O}}\) is multiplicative, for example \(\widehat{\mathcal{O}}({{\varvec{r}}}) = z\)—in this case we obtain the expectation value of the third Cartesian component of the electron’s position vector in the hydrogen atom—we are computing just

$$\begin{aligned} \overline{\mathcal{O}} = \int _\Omega \widehat{\mathcal{O}}({{\varvec{r}}}) \underbrace{\left| \psi ({{\varvec{r}}}) \right| ^2}_{\displaystyle {\rho ({{\varvec{r}}})}} \,\mathrm {d}V, \end{aligned}$$
(4.11)

which is the integral of a product of two scalar functions, the second of which, \(\rho ({{\varvec{r}}})\), is nothing but the probability density of (4.8).

Example

An electron moving in the electric field of a lead nucleus is described by the function

$$ \psi (r) = {1\over \sqrt{\pi }} r_\mathrm {B}^{-3/2} \mathrm {e}^{-r/r_\mathrm {B}}, $$

where \(r_\mathrm {B} \approx 6.46\times 10^{-13}\,\mathrm {m}\). The nucleus may be imagined as a positively charged sphere with radius \(7\times 10^{-15}\,\mathrm {m}\). How much time does the electron “spend” in the nucleus, i. e. what is the probability that it resides within a sphere of radius R? All we are looking for is the expected value of the operator \(\widehat{\mathcal{O}}({{\varvec{r}}}) = 1\) in (4.11); due to angular symmetry the volume element is simply \(\mathrm {d}V = 4\pi r^2 \,\mathrm {d}r\), thus

$$ P = \int _0^R |\psi (r)|^2 \, 4\pi r^2 \,\mathrm {d}r \approx 1.67\times 10^{-6}. $$

An almost identical result is obtained by assuming that \(\psi \) is practically constant on the interval [0, R], which is reasonable, since \(R \ll r_\mathrm {B}\). In this case we obtain \(P = (1/\pi )r_\mathrm {B}^{-3} (4\pi R^3/3) = (4/3) (R/r_\mathrm {B})^3 \approx 1.69\times 10^{-6}\).    \(\triangleleft \)

5 Variance and Effective Deviation

Computing the expected value of a random variable X tells us something about where within its domain its values will approximately land after many repetitions of the corresponding random experiment. Now we are also interested in the variation (scattering) of the values around their average \(E[X] = \overline{X}\). A measure of this scattering is the variance, defined as

$$ \mathrm {var}[X] = E\bigl [ (X- E[X])^2 \bigr ] = \overline{(X-\overline{X})^2}. $$

A large variance means a large scatter around the average and vice-versa. The positive square root of the variance,

$$ \sigma _X = \sqrt{\mathrm {var}[X]}, $$

is known as effective or standard deviation—in particular with the normal distribution on our minds. In the following we shall also make use of the relation

$$\begin{aligned} \mathrm {var}[aX+b] = a^2 \, \mathrm {var}[X]. \end{aligned}$$
(4.12)

(Prove it as an exercise.) If X is a discrete random variable, which takes the values \(x_1,x_2,\ldots ,x_n\) and has the probability function \(f_X\), its variance is

$$\begin{aligned} \sigma _X^2 = \sum _{i=1}^n \bigl ( x_i - \overline{X} \bigr )^2 f_X(x_i). \end{aligned}$$
(4.13)

In the case that all probabilities are equal, \(f_X(x_i)=1/n\), the variance is

$$\begin{aligned} \sigma _X^2 = {1\over n} \sum _{i=1}^n \bigl ( x_i - \overline{X} \bigr )^2. \end{aligned}$$
(4.14)

Note the factor 1 / n—not \(1/(n-1)\), as one often encounters—as it will acquire an important role in random samples in Chap. 7.

If X is a continuous random variable with the probability density \(f_X\), its variance is

$$\begin{aligned} \sigma _X^2 = \int _{-\infty }^\infty \bigl ( x - \overline{X} \bigr )^2 f_X(x) \, \mathrm {d}x. \end{aligned}$$
(4.15)

It can be shown that, regardless of the distribution obeyed by any (continuous or discrete) random variable X, it holds that

$$ P \bigl ( | X-\overline{X} | \ge a \bigr ) \le {\sigma _X^2\over a^2} $$

for any constant \(a > 0\), which is known as the Chebyshev inequality. It can also be formulated in terms of the slightly tighter Cantelli’s constraints

$$\begin{aligned} P\bigl ( X \ge \overline{X} + a \bigr ) \le {\sigma _X^2\over \sigma _X^2 + a^2}, \qquad P\bigl ( X \le \overline{X} - a \bigr ) \le {\sigma _X^2\over \sigma _X^2 + a^2}. \end{aligned}$$
(4.16)

We may resort to this tool if we know only the expected value of the random variable, \(\overline{X}\), and its variance, \(\sigma _X^2\), but not the functional form of its distribution. In such cases we can still calculate the upper limits for probabilities of the form (4.16).

Example

Suppose that the measured noise voltage at the output of a circuit has an average of \(\overline{U}=200\,\mathrm {mV}\) and variance \(\sigma _U^2 = (80\,\mathrm {mV})^2\). The probability that the noise exceeds \(300\,\mathrm {mV}\) (i. e. raises more than \(\Delta U = 100\,\mathrm {mV}\) above the average), can be bounded from above as \(P \bigl ( U \ge \overline{U} + {\Delta U} \bigr ) \le \sigma _U^2 \bigl / \bigl ( \sigma _U^2 + ({\Delta U})^2 \bigr ) \approx 0.39\).    \(\triangleleft \)

6 Complex Random Variables

A particular linear combination of real random variables X and Y is the complex random variable

$$ Z = X + \mathrm {i}\,Y. $$

Its distribution function at \(z = x + \mathrm {i}\,y\) is defined as

$$ F_Z(z) = P(X \le x, Y \le y) = F_{X,Y}(x,y), $$

where \(F_{X,Y}(x,y)\) is the distribution function of the pair—more precisely, the random vector (XY). The expected value of the variable Z is defined as

$$ E[Z] = E[X] + \mathrm {i}\,E[Y]. $$

Computing the expectation values of complex random variables is an additive and homogeneous operation: for arbitrary \(Z_1\) and \(Z_2\) it holds that

$$ E[Z_1 + Z_2] = E[Z_1] + E[Z_2], $$

while for an arbitrary complex constant \(c = a + \mathrm {i}\,b\) we have

$$ E[cZ] = cE[Z]. $$

The variance of a complex random variable is defined as

$$ \mathrm {var}[Z] = E\Bigl [ \bigl \vert Z - E[Z] \bigr \vert ^2 \Bigr ]. $$

A short calculation—do it!—shows that it is equal to the sum of the variances of its components,

$$ \mathrm {var}[Z] = \mathrm {var}[X] + \mathrm {var}[Y]. $$

The complex random variables \(Z_1 = X_1 + \mathrm {i}\,Y_1\) and \(Z_2 = X_2 + \mathrm {i}\,Y_2\) are mutually independent if random vectorsFootnote 2 \((X_1,X_2)\) and \((Y_1,Y_2)\) are independent. (A generalization is at hand: complex random variables \(Z_k = X_k + \mathrm {i}\,Y_k\) (\(k=1,2,\ldots ,n)\) are mutually independent if the same applies to random vectors \((X_k,Y_k)\).) If \(Z_1\) and \(Z_2\) are independent and possess expected values, their product also possesses it, and it holds that

$$ E[Z_1 Z_2] = E[Z_1]E[Z_2]. $$

7 Moments

The average (mean) and the variance are two special cases of expected values in the broader sense called moments: the pth raw or algebraic moment \(M_p'\) of a random variable X is defined as the expected value of its pth power, that is, \(M_p' = E[X^p]\):

$$\begin{aligned} \begin{aligned} M_p'&=\sum _{i=1}^n x_i^p f_X(x_i) \quad (\mathrm {discrete~case}), \\ M_p'&=\int _{-\infty }^\infty x^p f_X(x) \, \mathrm {d}x \quad (\mathrm {continuous~case}). \end{aligned} \end{aligned}$$
(4.17)

Frequently we also require central moments, defined with respect to the corresponding average value of the variable, that is, \(M_p = E\bigl [\bigl (X-\overline{X}\bigr )^p\bigr ]\):

$$\begin{aligned}M_p= & {} \sum _{i=1}^n \left( x_i - \overline{X} \right) ^p f_X(x_i) \quad (\mathrm {discrete~case}), \\ M_p= & {} \int _{-\infty }^\infty \left( x - \overline{X} \right) ^p f_X(x) \, \mathrm {d}x \quad (\mathrm {continuous~case}). \end{aligned}$$

From here we read off \(M_0'=1\) (normalization of probability distribution), \(M_1' = \overline{X}\) and \(M_2 = \sigma _X^2\). The following relations (check them as an exercise) also hold:

$$\begin{aligned}M_2= & {} M_2' - \overline{X}^2 = \overline{X^2} - \overline{X}^2, \\ M_3= & {} M_3' - 3M_2'\overline{X} + 2\overline{X}^3, \\ M_4= & {} M_4' - 4M_3'\overline{X} + 6M_2'\overline{X}^2 - 3\overline{X}^4. \end{aligned}$$

In addition to the first (average) and second moment (variance) only the third and fourth central moment are in everyday use. The third central moment, divided by the third power of its effective deviation,

$$\begin{aligned} \rho = {M_3\over \sigma ^3}, \end{aligned}$$
(4.18)

is called the coefficient of skewness or simply skewness. The coefficient \(\rho \) measures the asymmetry of the distribution around its average: \(\rho < 0\) means that the distribution has a relatively longer tail to the left of the average value (Fig. 4.4 (left)), while \(\rho > 0\) implies a more pronounced tail to its right (Fig. 4.4 (center)).

Fig. 4.4
figure 4

[Left] A distribution with negative skewness: the tail protruding to the left of the average value is more pronounced than the one sticking to the right. [Center] A distribution with positive skewness. [Right] Examples of distributions with positive (thick full curve) and negative excess kurtosis (thick dashed curve) with respect to the normal distribution (thin full curve)

Table 4.1 Properties of select continuous distributions: average (mean) value, median, mode, variance, skewness (\(M_3/\sigma ^3 = \rho \)) and kurtosis (\(M_4/\sigma ^4 = \varepsilon + 3\))

The fourth central moment, divided by the square of the variance,

$$ {M_4\over \sigma ^4}, $$

is known as kurtosis and tells us something about the “sharpness” or “bluntness” of the distribution. For the normal distribution we have \(M_4/\sigma ^4 = 3\), so we sometimes prefer to specify the quantity

$$\begin{aligned} \varepsilon = {M_4\over \sigma ^4} - 3, \end{aligned}$$
(4.19)

called the excess kurtosis: \(\varepsilon > 0\) indicates that the distribution is “sharper” than the normal (more prominent peak, faster falling tails), while \(\varepsilon < 0\) implies a “blunter” distribution (less pronounced peak, stronger tails), see Fig. 4.4 (right).

The properties of the most important continuous distributions—average value, median, mode (location of maximum), variance, skewness (\(\rho \)) and kurtosis (\(\varepsilon +3\))—are listed in Table 4.1. See also Appendices B.2 and B.3, where we shall learn how to “automate” the calculation of moments by using generating and characteristic functions.

Example

We are interested in the mode (“most probable velocity”), average velocity and the average velocity squared of \(\mathrm {N}_2\) gas molecules (molar mass \(M=28\,\mathrm {kg}/\mathrm {kmol}\), mass of single molecule \(m=M/N_\mathrm {A}\)) at temperature \(T=303\,\mathrm {K}\). The velocity distribution of the molecules is given by the Maxwell distribution (3.15), whose maximum (mode) is determined by \(\mathrm {d}f_V/\mathrm {d}v=0\), hence

$$ \left. \left( 2v - v^2{m\over 2k_\mathrm {B} T}\,2v \right) \right| _{V_\mathrm {max}} = 0 \quad \Longrightarrow \quad V_\mathrm {max} = \sqrt{2k_\mathrm {B} T\over m} \approx 423\,\mathrm {m/s}. $$

The average value and the square root of the average velocity squared (“root-mean-square velocity”) are computed from (4.3) and (4.17) with \(p=2\):

$$\begin{aligned}\overline{V}= & {} \int _0^\infty v f_V(v)\,\mathrm {d}v = \sqrt{8k_\mathrm {B} T\over \pi m} = \sqrt{4\over \pi }\,V_\mathrm {max} \approx 478\,\mathrm {m/s}, \\ \sqrt{\overline{\textstyle {V^2}}}= & {} \left( \int _0^\infty v^2 f_V(v)\,\mathrm {d}v\right) ^{1/2} = \sqrt{3k_\mathrm {B} T\over m} = \sqrt{3\over 2}\,V_\mathrm {max} \approx 518\,\mathrm {m/s}, \end{aligned}$$

where we have used \(\int _0^\infty z^3\exp (-z^2)\,\mathrm {d}z = 1/2\) and \(\int _0^\infty z^4\exp (-z^2)\,\mathrm {d}z = 3\sqrt{\pi }/8\). These three famous quantities are shown in Fig. 4.1 (right).    \(\triangleleft \)

7.1 Moments of the Cauchy Distribution

The Cauchy distribution \(f_X(x)=(1/\pi )/(1+x^2)\) drops off so slowly at \(x\rightarrow \pm \infty \) that its moments (average, variance, and so on) do not exist. For this reason its domain is frequently restricted to a narrower interval \([-x_\mathrm {max},x_\mathrm {max}]\):

$$ g_X(x) = {f_X(x)\over \int _{-x_\mathrm {max}}^{x_\mathrm {max}} f_X(x')\,\mathrm {d}x'} = {1\over 2\,\arctan x_\mathrm {max}} {1\over 1+x^2}. $$

This is particularly popular in nuclear physics where the Breit–Wigner description of the shape of the resonance peak in its tails—see Fig. 3.6 (right)—is no longer adequate due to the presence of neighboring resonances or background. With the truncated density \(g_X\) both the average and the variance are well defined:

$$\begin{aligned}E[X]= & {} {1\over 2\,\arctan x_\mathrm {max}} \int _{-x_\mathrm {max}}^{x_\mathrm {max}} {x\over 1+x^2} \,\mathrm {d}x = 0,\\ \mathrm {var}[X]= & {} {1\over 2\,\arctan x_\mathrm {max}} \int _{-x_\mathrm {max}}^{x_\mathrm {max}} {x^2\over 1+x^2} \,\mathrm {d}x = {x_\mathrm {max}\over \arctan x_\mathrm {max}} - 1. \end{aligned}$$

Narrowing down the domain is a special case of a larger class of “distortions” of probability distributions used to describe, for example, non-ideal outcomes of a process or imperfect efficiencies for analyzing particles in a detector. If individual events are detected under different conditions, the ideal probability density, \(f_X\), must be weighted by the detection efficiency:

$$ g_X(x) = { \int _{\Omega _y} f_X(x) P(y|x) \varepsilon (x,y)\,\mathrm {d}y \over \int _{\Omega _{x'}} \int _{\Omega _y} f_X(x') P(y|x') \varepsilon (x',y)\,\mathrm {d}x' \, \mathrm {d}y }, $$

where y is an auxiliary variable over which the averaging is being performed, and \(\varepsilon (x,y)\) is the probability density for the event being detected near \(X=x\) and \(Y=y\). An introduction to such weighted averaging procedures can be found in Sect. 8.5 of [4].

8 Two- and d-dimensional Generalizations

Let the continuous random variables X and Y be distributed according to the joint probability density \(f_{X,Y}(x,y)\). In this case the expected values of the individual variable can be calculated by the obvious generalization of (4.3) to two dimensions. The density \(f_{X,Y}\) is weighted by the variable whose expected value we are about to compute, while the other is left untouched:

$$\begin{aligned}\overline{X} = \mu _X = E[X]= & {} \int _{-\infty }^\infty \int _{-\infty }^\infty x\,f_{X,Y}(x,y) \, \mathrm {d}x \, \mathrm {d}y, \\ \overline{Y} = \mu _Y = E[Y]= & {} \int _{-\infty }^\infty \int _{-\infty }^\infty y\,f_{X,Y}(x,y) \, \mathrm {d}x \, \mathrm {d}y. \end{aligned}$$

In the discrete case the extension to two variables requires a generalization of (4.1):

$$ E[X] = \sum _{i=1}^n \sum _{j=1}^m x_i \, f_{X,Y}(x_i,y_j), \qquad E[Y] = \sum _{i=1}^n \sum _{j=1}^m y_j \, f_{X,Y}(x_i,y_j). $$

By analogy to (4.15) and (4.13) we also compute the variances of variables in the continuous case,

$$\begin{aligned}\sigma _X^2 = E\bigl [ (X-\mu _X)^2 \bigr ]= & {} \int _{-\infty }^\infty \int _{-\infty }^\infty (x-\mu _X)^2 f_{X,Y}(x,y) \, \mathrm {d}x \, \mathrm {d}y, \\ \sigma _Y^2 = E\bigl [ (Y-\mu _Y)^2 \bigr ]= & {} \int _{-\infty }^\infty \int _{-\infty }^\infty (y-\mu _Y)^2 f_{X,Y}(x,y) \, \mathrm {d}x \, \mathrm {d}y, \end{aligned}$$

and the variances in the discrete case,

$$\begin{aligned}E\bigl [ (X-\mu _X)^2 \bigr ]= & {} \sum _{i=1}^n \sum _{j=1}^m ( x_i - \mu _X)^2 f_{X,Y}(x_i,y_j), \\ E\bigl [ (Y-\mu _Y)^2 \bigr ]= & {} \sum _{i=1}^n \sum _{j=1}^m ( y_j - \mu _Y)^2 f_{X,Y}(x_i,y_j). \end{aligned}$$

Henceforth we only give equations pertaining to continuous variables. The corresponding expressions for discrete variables are obtained, as usual, by replacing the probability densities \(f_{X,Y}(x,y)\) by the probability [mass] functions \(f_{X,Y}(x_i,y_j)=P(X=x_i, Y=y_j)\), and integrals by sums.

Since now two variables are at hand, we can define yet a third version of the double integral (or the double sum) in which the variables enter bilinearly—the so-called mixed moment known as the covariance of X and Y:

$$ \sigma _{XY} = \mathrm {cov}[X,Y] = E\bigl [ (X-\mu _X)(Y-\mu _Y) \bigr ] = \int \limits _{-\infty }^\infty \int \limits _{-\infty }^\infty (x-\mu _X)(y-\mu _Y) f_{X,Y}(x,y) \, \mathrm {d}x \, \mathrm {d}y. $$

One immediately sees that

$$ \mathrm {cov}[aX,bY] = ab \, \mathrm {cov}[X,Y] $$

for arbitrary constants a and b, as well as

$$\begin{aligned}\sigma _{XY}= & {} E\bigl [ (X-\mu _X)(Y-\mu _Y) \bigr ] = E\bigl [ XY - \mu _X Y - \mu _Y X + \mu _X\mu _Y \bigr ] \\= & {} E[XY] - \mu _X \underbrace{E[Y]}_{\displaystyle {\mu _Y}} - \mu _Y \underbrace{E[X]}_{\displaystyle {\mu _X}} + \mu _X\mu _Y = E[XY] - \mu _X\mu _Y. \end{aligned}$$

Therefore, if X and Y are mutually independent, then by definition (2.25) one also has \(E[XY] = E[X]E[Y] = \mu _X\mu _Y\), and then

$$ \sigma _{XY} = 0. $$

(The covariance of independent variables equals zero.) For a later discussion of measurement uncertainties the following relation between the variance and covariance of two variables is important:

$$\begin{aligned} \mathrm {var}[X\pm Y]&=\iint \bigl ( (x-\mu _X) \pm (y-\mu _Y) \bigr )^2 f_{X,Y}(x,y) \, \mathrm {d}x \, \mathrm {d}y \nonumber \\&=\iint (x-\mu _X)^2 f_{X,Y}(x,y) \, \mathrm {d}x \, \mathrm {d}y +\iint (y-\mu _Y)^2 f_{X,Y}(x,y) \, \mathrm {d}x \, \mathrm {d}y \nonumber \\&\quad \pm 2 \iint (x-\mu _X)(y-\mu _Y) f_{X,Y}(x,y) \, \mathrm {d}x \, \mathrm {d}y \nonumber \\&=\mathrm {var}[X] + \mathrm {var}[Y] \pm 2 \, \mathrm {cov}[X,Y]. \end{aligned}$$
(4.20)

In other words,

$$ \sigma _{X\pm Y}^2 = \sigma _{X}^2 + \sigma _{Y}^2 \pm 2 \sigma _{XY}. $$

By using the covariance and both effective deviations we define the Pearson’s coefficient of linear correlation (also linear correlation coefficient )

$$\begin{aligned} \rho _{XY} = {\sigma _{XY}\over \sigma _X \sigma _Y}, \quad -1 \le \rho _{XY} \le 1. \end{aligned}$$
(4.21)

It is easy to confirm the allowed range of \(\rho _{XY}\) given above. Because of its power two the expression \(E[ ( \lambda (X-\mu _X) - (Y-\mu _Y) )^2 ]\) is non-negative for any \(\lambda \in {\mathbb {R}}\). Let us expand it:

$$ \lambda ^2 \underbrace{E \bigl [ (X-\mu _X)^2 \bigr ]}_{\displaystyle {\sigma _X^2}} -2\lambda \underbrace{E \bigl [ (X-\mu _X)(Y-\mu _Y) \bigr ]}_{\displaystyle {\sigma _{XY}}} + \underbrace{E \bigl [ (Y-\mu _Y)^2 \bigr ]}_{\displaystyle {\sigma _Y^2}} \ge 0. $$

The left side of this inequality is a real polynomial of second degree \(a\lambda ^2 + b\lambda + c = 0\) with coefficients \(a=\sigma _X^2\), \(b=-2\sigma _{XY}\), \(c=\sigma _Y^2\), which is non-negative everywhere, so it can have at most one real zero. This implies that its discriminant can not be positive, so \(b^2 - 4ac \le 0\). This tells us that \(4\sigma _{XY}^2 -4\sigma _X^2\sigma _Y^2 \le 0\) or \(|\sigma _{XY}/(\sigma _X\sigma _Y)| \le 1\), which is precisely (4.21).

The generalization of (4.20) to the sum of (not necessarily independent) random variables \(X_1,X_2,\ldots ,X_n\) is

$$ \mathrm {var}[X_1 + X_2 + \cdots + X_n] = \sum _{i=1}^n \mathrm {var}[X_i] + 2 \sum _{i=1}^n \sum _{j=i+1}^n \mathrm {cov}[X_i,X_j]. $$

If the variables \(X_1,X_2,\ldots ,X_n\) are mutually independent, this expression reduces to

$$\begin{aligned} \mathrm {var}[X_1 + X_2 + \cdots + X_n] = \sum _{i=1}^n \mathrm {var}[X_i]. \end{aligned}$$
(4.22)

Example

Many sticks with length 1 are broken at two random locations. What is the average length of the central pieces? At each hit, the stick breaks at \(0< x_1 < 1 \) and \(0< x_2 < 1\), where the values \(x_1\) and \(x_2\) are uniformly distributed over the interval [0, 1], but one can have either \(x_1 < x_2\) or \(x_1 > x_2\). What we are seeking, then, is the expected value of the variable \(L=|X_2-X_1|\) (with values l) with respect to the probability density \(f_{X,Y}(x_1,x_2)=1\):

$$ \overline{L} = \int \limits _0^1\int \limits _0^1 \bigl \vert x_2 - x_1 \bigr \vert \, \mathrm {d}x_1 \,\mathrm {d}x_2 = \int \limits _0^1 \mathrm {d}x_2 \int \limits _0^{x_2} (x_2-x_1)\,\mathrm {d}x_1 + \int \limits _0^1 \mathrm {d}x_2 \int \limits _{x_2}^1 (x_1-x_2)\,\mathrm {d}x_1 = {1\over 3}. $$

 

figure a

How would the result change if the probability that the stick breaks linearly increases from 0 at the origin to 1 at the opposite edge?    \(\triangleleft \)

Example

Let the continuous random variables X and Y both be normally distributed, with averages \(\mu _X\) and \(\mu _Y\) and variances \(\sigma _X^2\) and \(\sigma _Y^2\). What is their joint probability density if X and Y are independent, and what are their joint and conditional densities in the dependent case, with correlation coefficient \(\rho _{XY} = \rho \)?

If X and Y are independent, their joint probability density—by (2.25)—is simply the product of the corresponding one-dimensional densities:

$$ f_{X,Y}(x,y) \,=\,X(x) f_Y(y) \!=\! {1\over \sqrt{2\pi }\sigma _X} \exp \left[ \! -{(x-\mu _X)^2\over 2\sigma _X^2} \!\right] \! {1\over \sqrt{2\pi }\sigma _Y} \exp \left[ \! -{(y-\mu _Y)^2\over 2\sigma _Y^2} \!\right] \!. $$

The curves of constant values of \(f_{X,Y}\) in the (xy) plane are untilted ellipses in general (\(\sigma _X \ne \sigma _Y\)), and circles in the special case \(\sigma _X = \sigma _Y\). At any rate \(\rho =0\) for such a distribution. A two-dimensional normal distribution of dependent (and therefore correlated) variables is described by the probability density

$$ f_{X,Y}(x,y) = {1\over 2\pi \sigma _X\sigma _Y\sqrt{1-\rho ^2}} \exp \left\{ \! - {1\over 1-\rho ^2} \left[ {x^{\prime 2}\over 2\sigma _X^2} -2\rho {x'y'\over \sqrt{2}\sigma _X\sqrt{2}\sigma _Y} +{y^{\prime 2}\over 2\sigma _Y^2} \right] \right\} , $$

where we have denoted \(x' = x-\mu _X\) and \(y' = y-\mu _Y\). This distribution can not be factorized as \(f_{X,Y}(x,y)=f_X(x)f_Y(y)\), and its curves of constant values are tilted ellipses; for parameters \(\mu _X=10\), \(\mu _Y=0\), \(\sigma _X=\sigma _Y=1\) and \(\rho =0.8\) they are shown in Fig. 4.5 (left).

Fig. 4.5
figure 5

[Left] Joint probability density of two dependent, normally distributed random variables X and Y with averages \(\mu _X=10\) and \(\mu _Y=0\), variances \(\sigma _X^2=\sigma _Y^2=1\) and linear correlation coefficient \(\rho =0.8\). [Right] Conditional probability density \(f_{X|Y}(x|y)\)

Conditional probability densities \(f_{X|Y}(x|y)\) and \(f_{Y|X}(y|x)\) can be computed by using (2.26) and (2.27). Let us treat the first case, the other one is obtained by simply replacing \(x \leftrightarrow y\), \(\mu _X \leftrightarrow \mu _Y\) and \(\sigma _X \leftrightarrow \sigma _Y\) at appropriate locations:

$$ f_{X|Y}(x|y) \!=\! {f_{X,Y}(x,y)\over f_Y(y)} \!=\! {1\over \sqrt{2\pi }\sigma _X\sqrt{1-\rho ^2}} \exp \left\{ -{1\over 2(1-\rho ^2)\sigma _X^2}\left[ x' - \rho {\sigma _X\over \sigma _Y}y' \right] ^2 \right\} . $$

This conditional probability density is shown in Fig. 4.5 (right). By comparing it to definition (3.7) we infer that the random variable X|Y is distributed as

$$ X|Y \sim N \left( E[X] + \rho {\sigma _X\over \sigma _Y} \bigl ( Y-\mu _Y \bigr ), \, \bigl ( 1-\rho ^2 \bigr ) \sigma _X^2 \right) , $$

a feature also seen in the plot: the width of the band does not depend on y.    \(\triangleleft \)

8.1 Multivariate Normal Distribution

This appears to be a good place to generalize the normal distribution of two variables (the so-called binormal or bivariate normal distribution) to d dimensions. We are dealing with a vector random variable

$$ {{\varvec{X}}} = \bigl ( X_1, X_2, \ldots , X_d \bigr )^\mathrm {T} \in {\mathbb {R}}^d $$

and its average

$$ E[{{\varvec{X}}}] = \bigl ( E[X_1], E[X_2], \ldots , E[X_d] \bigr )^\mathrm {T} = \bigl ( \mu _1, \mu _2, \ldots , \mu _d \bigr )^\mathrm {T} = {{\varvec{\mu }}}. $$

We construct the \(d\times d\) covariance matrix \(\Sigma \) with the matrix elements

$$ \Sigma _{ij} = \mathrm {cov}[X_i, X_j], \quad i,j=1,2,\ldots ,d. $$

The covariance matrix is symmetric and at least positive semi-definite. It can even be strictly positive definite if none of the variables \(X_i\) is a linear combination of the others. The probability density of the multivariate normal distribution (compare it to its one-dimensional counterpart (3.10)) is then

$$\begin{aligned} f_{{\varvec{X}}}({{\varvec{x}}}; {{\varvec{\mu }}}, \Sigma ) = (2\pi )^{-d/2} \bigl (\det \Sigma \bigr )^{-1/2} \, \exp \left\{ -{1\over 2} ({{\varvec{x}}}-{{\varvec{\mu }}})^\mathrm {T} \, \Sigma ^{-1} \, ({{\varvec{x}}}-{{\varvec{\mu }}}) \right\} . \end{aligned}$$
(4.23)

If \(d=2\) as in the previous Example, we have simply \({{\varvec{X}}}=(X_1,X_2)^\mathrm {T} \rightarrow (X,Y)^\mathrm {T}\) and \({{\varvec{\mu }}}=(\mu _1,\mu _2)^\mathrm {T} \rightarrow (\mu _X,\mu _Y)^\mathrm {T}\), while the covariance matrix is

$$ \Sigma = \left( \begin{array}{cc} \sigma _X^2 &{} \sigma _{XY} \\ \sigma _{XY} &{} \sigma _Y^2 \\ \end{array} \right) . $$

8.2 Correlation Does Not Imply Causality

A vanishing correlation coefficient of X and Y does not mean that these variables are stochastically independent: for each density \(f_{X,Y}\) that is an even function of the deviations \(x-\mu _X\) and \(y-\mu _Y\), one has \(\rho _{XY}=0\). In other words, \(\rho _{XY}=0\) is just a necessary, but not sufficient condition for independence: see bottom part of Fig. 7.8 which illustrates the correlation in the case of finite samples.

Even though one observes a correlation in a pair of variables (sets of values, measurements, phenomena) this does not necessarily mean that there is a direct causal relation between them: correlation does not imply causality. When we observe an apparent dependence between two correlated quantities, often a third factor is involved, common to both X and Y. Example: the sales of ice-cream and the number of shark attacks at the beach are certainly correlated, but there is no causal relation between the two. (Does your purchase of three scoops of ice-cream instead of one triple your chances of being bitten by a shark?) The common factor of tempting scoops and aggressiveness of sharks is a hot summer day, when people wish to cool off in the water and sharks prefer to dwell near the shore.

Besides, one should be aware that correlation and causality are concepts originating in completely different worlds: the former is a statement on the basis of probability theory, while the latter signifies a strictly physical phenomenon, whose background is time and the causal connection between the present and past events.

9 Propagation of Errors

If we knew how to generalize (4.20) to an arbitrary function of an arbitrary number of variables, we would be able to answer the important question of error propagation. But what do we mean by “error of random variable”? In the introductory chapters we learned that each measurement of a quantity represents a single realization of a random variable whose value fluctuates statistically. Such a random deviation from its expected value is called the statistical uncertainty or “error”. By studying the propagation of errors we wish to find out how the uncertainties of a given set of variables translate into the uncertainty of a function of these variables. A typical example is the determination of the thermal power released on a resistor from the corresponding voltage drop: if the uncertainty of the voltage measurement is \(\Delta U\) and the resistance R is known to an accuracy of no more than \(\Delta R\), what is the uncertainty of the calculated power \(P = U^2/R\)?

Let \(X_1,X_2,\ldots ,X_n\) be real random variables with expected values \(\mu _1,\mu _2,\ldots ,\mu _n\), which we arrange as vectors

$$ {{\varvec{X}}} = (X_1,X_2,\ldots ,X_n)^\mathrm {T} $$

and

$$ {{\varvec{\mu }}} = (\mu _1,\mu _2,\ldots ,\mu _n)^\mathrm {T}, $$

just as in Sect. 4.8.1. Let \(Y=Y({{\varvec{X}}})\) be an arbitrary function of these variables which, of course, is also a random variable. Assume that the covariances of all \((X_i,X_j)\) pairs are known. We would like to estimate the variance of the variable Y. In the vicinity of \({{\varvec{\mu }}}\) we expand Y in a Taylor series in \({{\varvec{X}}}\) up to the linear term,

$$ Y({{\varvec{X}}}) \approx Y({{\varvec{\mu }}}) + \sum _{i=1}^n (X_i-\mu _i) {\partial Y\over \partial X_i}\biggl \vert _{{{\varvec{X}}}={{\varvec{\mu }}}}, $$

and resort to the approximation \(E[ Y({{\varvec{X}}}) ] \approx Y({{\varvec{\mu }}})\) (see (4.9) and (4.10)) to compute the variance. It follows that

$$\begin{aligned} \mathrm {var}[ Y({{\varvec{X}}}) ]= & {} E \Bigl [ \Bigl ( Y({{\varvec{X}}}) - E\bigl [ Y({{\varvec{X}}}) \bigr ] \Bigr )^2 \Bigr ] \approx E \Bigl [ \Bigl ( Y({{\varvec{X}}}) - Y({{\varvec{\mu }}}) \Bigr )^2 \Bigr ] \nonumber \\\approx & {} \sum _{i=1}^n\sum _{j=1}^n \left( {\partial Y\over \partial X_i}{\partial Y\over \partial X_j} \right) _{{{\varvec{X}}}={{\varvec{\mu }}}} \Sigma _{ij}, \end{aligned}$$
(4.24)

where

$$ \Sigma _{ij} = E\bigl [ (X_i-\mu _i)(X_j-\mu _j) \bigr ] = \mathrm {cov}\bigl [ X_i, X_j \bigr ] $$

is the covariance matrix of the variables \(X_i\): its diagonal terms are the variances of the individual variables, \(\mathrm {var}[X_i] = \sigma _{X_i}^2\), while the non-diagonal ones (\(i\ne j\)) are the covariances \(\mathrm {cov}[X_i,X_j]\). Formula (4.24) is what we have been looking for: it tells us—within the specified approximations—how the “errors” in \({{\varvec{X}}}\) map to the “errors” in Y. If \(X_i\) are mutually independent, we have \(\mathrm {cov}[X_i,X_j]=0\) for \(i\ne j\) and the formula simplifies to

$$\begin{aligned} \mathrm {var}[ Y({{\varvec{X}}}) ] \approx \sum _{i=1}^n \left( {\partial Y\over \partial X_i} \right) _{{{\varvec{X}}}={{\varvec{\mu }}}}^2 \mathrm {var}[ X_i ]. \end{aligned}$$
(4.25)

Example

Let \(X_1\) and \(X_2\) be independent continuous random variables with the mean values \(\mu _1\) and \(\mu _2\) and variances \(\sigma _1^2\) and \(\sigma _2^2\). We are interested in the variance \(\sigma _Y^2\) of their ratio \(Y=X_1/X_2\). Since \(X_1\) and \(X_2\) are independent, we may apply formula (4.25). We need the derivatives

$$ \left( {\partial Y\over \partial X_1} \right) _{{{\varvec{X}}}={{\varvec{\mu }}}} = {1\over \mu _2}, \qquad \left( {\partial Y\over \partial X_2} \right) _{{{\varvec{X}}}={{\varvec{\mu }}}} = -{\mu _1\over \mu _2^2}. $$

Therefore

$$ \sigma _Y^2 \approx \left( 1\over \mu _2\right) ^2 \sigma _1^2 + \left( \mu _1\over \mu _2^2\right) ^2 \sigma _2^2 = {1\over \mu _2^4} \left[ \mu _2^2 \sigma _1^2 + \mu _1^2 \sigma _2^2 \right] $$

or

$$ {\sigma _Y^2\over \mu _Y^2} \approx {\sigma _1^2\over \mu _1^2} + {\sigma _2^2\over \mu _2^2}, $$

where \(\mu _Y = E[Y] = \mu _1/\mu _2\).    \(\triangleleft \)

Example

Let X and Y be independent random variables with the expected values \(\mu _X\) and \(\mu _Y\) and variances \(\sigma _X^2\) and \(\sigma _Y^2\) (with respective “uncertainties of measurements” \(\sigma _X\) and \(\sigma _Y\)). What is the variance \(\sigma _Z^2\) of the product of their powers,

$$ Z = X^m Y^n? $$

(This is a generalization of the function from the previous example to arbitrary powers m and n.) By formula (4.25) we again obtain

$$ \sigma _Z^2 \approx \left( m X^{m-1} Y^n \right) ^2_{\textstyle {X=\mu _X\atop Y=\mu _Y}} \sigma _X^2 + \left( n X^m Y^{n-1} \right) ^2_{\textstyle {X=\mu _X\atop Y=\mu _Y}} \sigma _Y^2. $$

Thus

$$ \left( {\sigma _Z\over \mu _Z} \right) ^2 \approx {m^2 \mu _X^{2(m-1)} \mu _Y^{2n} \over \mu _X^{2m} \mu _Y^{2n}} \, \sigma _X^2 + {n^2 \mu _X^{2m} \mu _Y^{2(n-1)} \over \mu _X^{2m} \mu _Y^{2n}} \, \sigma _Y^2 = m^2 \left( {\sigma _X \over \mu _X} \right) ^2 + n^2 \left( {\sigma _Y \over \mu _Y} \right) ^2, $$

where we have denoted \(\mu _Z = \mu _X^m \mu _Y^n\).    \(\triangleleft \)

9.1 Multiple Functions and Transformation of the Covariance Matrix

Let us now discuss the case of multiple scalar functions \(Y_1,Y_2,\ldots ,Y_m\), which all depend on variables \({{\varvec{X}}}\),

$$ Y_k = Y_k(X_1,X_2,\ldots ,X_n) = Y_k({{\varvec{X}}}), \quad k=1,2,\ldots ,m. $$

We arrange the function values in the vector \({{\varvec{Y}}} = (Y_1,Y_2,\ldots ,Y_m)^\mathrm {T}\) and retrace the steps from the beginning of this section. We neglect all higher order terms in the Taylor expansion

$$ Y_k({{\varvec{X}}}) = Y_k({{\varvec{\mu }}}) + \sum _{i=1}^n (X_i-\mu _i) {\partial Y_k \over \partial X_i}\biggl \vert _{{{\varvec{X}}}={{\varvec{\mu }}}} + \cdots , \quad k=1,2,\ldots ,m, $$

and take into account that \(E[ Y_k({{\varvec{X}}}) ] \approx Y_k({{\varvec{\mu }}})\). Instead of (4.24) we now obtain a relation between the covariance matrix of variable \({{\varvec{X}}}\) and the covariance matrix of the variables \({{\varvec{Y}}}\),

$$\begin{aligned}\Sigma _{kl}({{\varvec{Y}}})\approx & {} E\Bigl [ \Bigl ( Y_k({{\varvec{X}}}) - Y_k({{\varvec{\mu }}}) \Bigr ) \Bigl ( Y_l({{\varvec{X}}}) - Y_l({{\varvec{\mu }}}) \Bigr ) \Bigr ] \\\approx & {} \sum _{i=1}^n\sum _{j=1}^n \left( {\partial Y_k\over \partial X_i}{\partial Y_l\over \partial X_j} \right) _{{{\varvec{X}}}={{\varvec{\mu }}}} \underbrace{E\bigl [ (X_i-\mu _i)(X_j-\mu _j) \bigr ]}_{\displaystyle \Sigma _{ij}({{\varvec{X}}})}. \end{aligned}$$

This relation becomes even more transparent if we write the Taylor expansion as

$$ {{\varvec{Y}}}({{\varvec{X}}}) = {{\varvec{Y}}}({{\varvec{\mu }}}) + D {{\varvec{X}}} + \cdots , $$

where \({{\varvec{X}}}\) and \({{\varvec{Y}}}\) are n- and m-dimensional vectors, respectively, while D is an \(m\times n\) matrix embodying the linear part of the expansion, namely

$$\begin{aligned} D_{ki} = \left( {\partial Y_k\over \partial X_i} \right) _{{{\varvec{X}}}={{\varvec{\mu }}}}, \end{aligned}$$
(4.26)

Hence

$$ \Sigma _{kl}({{\varvec{Y}}}) \approx \sum _{i=1}^n\sum _{j=1}^n D_{ki} \Sigma _{ij}({{\varvec{X}}}) D_{jl}, \quad k,l=1,2,\ldots ,m, $$

or, in brief,

$$\begin{aligned} \Sigma ({{\varvec{Y}}}) \approx D \Sigma ({{\varvec{X}}}) D^\mathrm {T}. \end{aligned}$$
(4.27)

The propagation of errors in higher dimensions can therefore be seen as a transformation of the covariance matrix. The variances \(\sigma _{Y_k}^2\) of the variables \(Y_k\) are the diagonal matrix elements of \(\Sigma ({{\varvec{Y}}})\). In general they pick up terms from all elements of \(\Sigma ({{\varvec{X}}})\), even the non-diagonal ones, since

$$\begin{aligned} \Sigma _{kk}({{\varvec{Y}}}) \approx \sum _{i=1}^n \left( {\partial Y_k\over \partial X_i}{\partial Y_k\over \partial X_j} \right) _{{{\varvec{X}}}={{\varvec{\mu }}}} \Sigma _{ij}({{\varvec{X}}}). \end{aligned}$$
(4.28)

But if the variables \(X_i\) are mutually independent, only diagonal elements of \(\Sigma ({{\varvec{X}}})\) contribute to the right-hand side of the above equation, yielding

$$\begin{aligned} \sigma _{Y_k}^2 = \sum _{i=1}^n \left( {\partial Y_k\over \partial X_i} \right) ^2_{{{\varvec{X}}}={{\varvec{\mu }}}} \sigma _{X_i}^2. \end{aligned}$$
(4.29)

Equations (4.28) and (4.29) are multi-dimensional equivalents of (4.24) and (4.25). Note that the non-diagonal elements of \(\Sigma ({{\varvec{Y}}})\) may be non-zero even though \(X_i\) are mutually independent! You can find an example of how to use these equations in the case of a measurement of the momentum of a particle in Problem 4.10.6.

10 Problems

10.1 Expected Device Failure Time

A computer disk is controlled by five circuits (\(i=1,2,3,4,5\)). The time until an irreparable failure in each circuit is exponentially distributed, with individual time constants \(\lambda _i\). The disk as a whole works if circuits 1, 2 and 3, circuits 3, 4 and 5, or, obviously, all five circuits work simultaneously. What is the expected time of disk failure?

The probability that the ith element is not broken until time t (the probability that the failure time is larger than t) is exponentially decreasing and equals \(\mathrm {e}^{-\lambda _i t}\). For the disk to fail, three key events are responsible:

$$ \begin{array}{lclcl} \mathrm {event~}A &{}:&{} \mathrm {circuits~} 1 \mathrm {~and~} 2 \mathrm {~fail~after~time~} t &{}:&{} P(A) = \mathrm {e}^{-(\lambda _1+\lambda _2)t}, \\ \mathrm {event~}B &{}:&{} \mathrm {circuit~} 3 \mathrm {~fails~after~time~} t &{}:&{} P(B) = \mathrm {e}^{-\lambda _3 t}, \\ \mathrm {event~}C &{}:&{} \mathrm {circuits~} 4 \mathrm {~and~} 5 \mathrm {~fail~after~time~} t &{}:&{} P(C) = \mathrm {e}^{-(\lambda _4+\lambda _5)t}. \end{array} $$

The disk operates as long as . The probability that the disk still operates after time t, is therefore

This is not yet our answer, since the expression still contains time! We are looking for the expected value of failure time, where we should recall that the appropriate probability density is \(-P'(t)\) (see (3.4)), hence

$$\begin{aligned}\overline{T}= & {} \int _0^\infty t \left[ (\lambda _1+\lambda _2+\lambda _3) \mathrm {e}^{-(\lambda _1+\lambda _2+\lambda _3)t} +(\lambda _3+\lambda _4+\lambda _5) \mathrm {e}^{-(\lambda _3+\lambda _4+\lambda _5)t} \right. \\&\qquad \quad \left. -(\lambda _1+\lambda _2+\lambda _3+\lambda _4+\lambda _5) \mathrm {e}^{-(\lambda _1+\lambda _2+\lambda _3+\lambda _4+\lambda _5)t} \right] \,\mathrm {d}t \\= & {} {1\over \lambda _1+\lambda _2+\lambda _3} + {1\over \lambda _3+\lambda _4+\lambda _5} - {1\over \lambda _1+\lambda _2+\lambda _3+\lambda _4+\lambda _5}. \end{aligned}$$

10.2 Covariance of Continuous Random Variables

(Adopted from [5], Example 4.56.) Calculate the linear correlation coefficient of continuous random variables X and Y distributed according to the joint probability density

$$ f_{X,Y}(x,y) = 2 \, \mathrm {e}^{-x} \mathrm {e}^{-y} H(y) H (x-y), \quad -\infty< x,y < \infty , $$

where H is the Heaviside function (see (2.8)).

The linear correlation coefficient \(\rho _{XY}\) of variables X and Y (see (4.21)) is equal to the ratio of covariance \(\sigma _{XY}\) to the product of their effective deviations \(\sigma _X\) and \(\sigma _Y\). First we need to calculate the expected value of the product XY,

$$\begin{aligned}E[XY] = \overline{XY}= & {} \int _{-\infty }^\infty \int _{-\infty }^\infty xy\,f_{X,Y}(x,y) \, \mathrm {d}x \,\mathrm {d}y \\= & {} 2 \int _{-\infty }^\infty \int _{-\infty }^\infty xy\,\mathrm {e}^{-x} \mathrm {e}^{-y} H(y) H (x-y) \, \mathrm {d}x \,\mathrm {d}y \\= & {} 2 \int _0^\infty x\,\mathrm {e}^{-x} \left[ \int _0^x y\,\mathrm {e}^{-y}\,\mathrm {d}y \right] \mathrm {d}x \\= & {} 2 \int _0^\infty x\,\mathrm {e}^{-x} \Bigl [ 1 - (1+x)\mathrm {e}^{-x} \Bigr ] \mathrm {d}x = \ldots = 1, \end{aligned}$$

then the expected values of X, Y, \(X^2\) and \(Y^2\),

$$\begin{aligned}E[X] = \overline{X}= & {} \int _{-\infty }^\infty \int _{-\infty }^\infty x\,f_{X,Y}(x,y) \, \mathrm {d}x \,\mathrm {d}y = {3\over 2}, \\ E[Y] = \overline{Y}= & {} \int _{-\infty }^\infty \int _{-\infty }^\infty y\,f_{X,Y}(x,y) \, \mathrm {d}x \,\mathrm {d}y = {1\over 2}, \\ E[X^2] = \overline{X^2}= & {} \int _{-\infty }^\infty \int _{-\infty }^\infty x^2\,f_{X,Y}(x,y) \, \mathrm {d}x \,\mathrm {d}y = {7\over 2}, \\ E[Y^2] = \overline{Y^2}= & {} \int _{-\infty }^\infty \int _{-\infty }^\infty y^2\,f_{X,Y}(x,y) \, \mathrm {d}x \,\mathrm {d}y = {1\over 2}. \end{aligned}$$

It follows that

$$ \sigma _X = \sqrt{\,\overline{X^2} - \overline{X}^2} \approx 1.118, \qquad \sigma _Y = \sqrt{\,\overline{Y^2} - \overline{Y}^2} = 0.5, $$

hence

$$ \mathrm {cov}[X,Y] = \sigma _{XY} = \overline{XY} - \overline{X}\,\overline{Y} = 1 - {3\over 2}{1\over 2} = {1\over 4} $$

and

$$ \rho _{XY} = {\sigma _{XY}\over \sigma _X\sigma _Y} \approx 0.447. $$

10.3 Conditional Expected Values of Two-Dimensional Distributions

Let us return to the Example on p. 49 involving two random variables, distributed according to the joint probability density

$$ f_{X,Y}(x,y) = \left\{ \begin{array}{rcl} 8xy &{};&{}\quad 0 \le x\le 1, 0 \le y \le x, \\ 0 &{};&{}\quad \mathrm {elsewhere}. \end{array} \right. $$

Find the conditional expected value of the variable Y, given \(X=x\), and the conditional expected value of the variable X, given \(Y=y\)!

We have already calculated the conditional densities \(f_{X|Y}(x|y)\) and \(f_{Y|X}(y|x)\) in (2.28) and (2.29), so the conditional expected value equals

$$ E\bigl [Y|X=x\bigr ] = \int _{-\infty }^\infty y \, f_{Y|X}(y|x) \,\mathrm {d}y = \int _0^x y \, {2y\over x^2} \,\mathrm {d}y = {2x\over 3}, $$

and the conditional expected value is

$$ E\bigl [X|Y=y\bigr ] = \int _{-\infty }^\infty x \, f_{X|Y}(x|y) \,\mathrm {d}x = \int _y^1 x \, {2x\over 1-y^2} \,\mathrm {d}x = {2(1-y^3)\over 3(1-y^2)} = {2(1+y+y^2)\over 3(1+y)}. $$

10.4 Expected Values of Hyper- and Hypo-exponential Variables

Calculate the expected value, the second moment and the variance of continuous random variables, distributed according to the hyper-exponential (see (3.26)) and hypo-exponential distribution (see (3.28)).

The hyper-exponential distribution, which describes a mixture (superposition) of k independent phases of a parallel process, whose ith phase proceeds with probability \(P_i\) and time constant \(\lambda _i = 1/\tau _i\), is defined by the probability density

$$\begin{aligned} f_X(x) = \sum _{i=1}^k P_i\,f_{X_i}(x) = \sum _{i=1}^k P_i \lambda _i \, \mathrm {e}^{-\lambda _i x}, \quad x \ge 0, \end{aligned}$$
(4.30)

where \(0 \le P_i \le 1\) and \(\sum _{i=1}^k P_i = 1\). The expected value of a hyper-exponentially distributed variable X is

$$\begin{aligned} \overline{X} = E[X] = \int _0^\infty x\,f_X(x) \,\mathrm {d}x = \sum _{i=1}^k P_i \int _0^\infty \lambda _i x \, \mathrm {e}^{-\lambda _i x} \, \mathrm {d}x = \sum _{i=1}^k {P_i\over \lambda _i}, \end{aligned}$$
(4.31)

and its second moment is

$$ \overline{X^2} = E[X^2] = \int _0^\infty x^2 f_X(x) \,\mathrm {d}x = \sum _{i=1}^k P_i \int _0^\infty \lambda _i x^2 \, \mathrm {e}^{-\lambda _i x} \, \mathrm {d}x = 2 \sum _{i=1}^k {P_i\over \lambda _i^2}. $$

Its variance is therefore

$$\begin{aligned} \mathrm {var}[X] = \sigma _X^2 = E[X^2] - E[X]^2 = 2 \sum _{i=1}^k {P_i\over \lambda _i^2} -\left( \sum _{i=1}^k {P_i\over \lambda _i} \right) ^2. \end{aligned}$$
(4.32)

While \(\sigma _X / \overline{X} = \lambda /\lambda = 1\) holds true for the usual single-exponential distribution, its hyper-exponential generalization always has \(\sigma _X / \overline{X} > 1\), except when all \(\lambda _i\) are equal: this inequality is the origin of the root “hyper” in its name.

The hypo-exponential distribution describes the distribution of the sum of k (\(k\ge 2\)) independent continuous random variables \(X_i\), in which each term separately is distributed exponentially with parameter \(\lambda _i\). The sum variable \(X = \sum _{i=1}^k X_i\) has the probability density

$$\begin{aligned} f_X(x) = \sum _{i=1}^k \alpha _i \lambda _i \, \mathrm {e}^{-\lambda _i x}, \end{aligned}$$
(4.33)

where

$$ \alpha _i = \prod _{{j=1\atop j\ne i}}^k {\lambda _j \over \lambda _j-\lambda _i}, \quad i=1,2,\ldots ,k. $$

By comparing (4.33) to (4.30) one might conclude that the coefficients \(\alpha _i\) represent the probabilities \(P_i\) for the realization of the ith random variable, but we are dealing with a serial process here: all indices i come into play—see Fig. 3.13! On the other hand, one can exploit the analytic structure of expressions (4.31) and (4.32), one simply needs to replace all \(P_i\) by \(\alpha _i\). By a slightly tedious calculation (or by exploiting the linearity of \(E[\cdot ]\) and using formula (4.20)) we obtain very simple expressions for the average and variance:

$$ E[X] = \overline{X} = \sum _{i=1}^k {1\over \lambda _i}, \quad \mathrm {var}[X] = \sigma _X^2 = \sum _{i=1}^k {1\over \lambda _i^2}. $$

It is easy to see—Pythagoras’s theorem comes in handy—that one always has \(\sigma _X / \overline{X} < 1\). The root “hypo” in the name of the distribution expresses precisely this property.

10.5 Gaussian Noise in an Electric Circuit

The noise in electric circuits is frequently of Gaussian nature. Assume that the noise (random variable X) is normally distributed, with average \(\overline{X} = 0\,\mathrm {V}\) and variance \(\sigma _X^2 = 10^{-8}\,\mathrm {V}^2\). Calculate the probability that the noise exceeds the value \(10^{-4}\,\mathrm {V}\) and the probability that its value is on the interval between \(-2\cdot 10^{-4}\,\mathrm {V}\) and \(10^{-4}\,\mathrm {V}\)! What is the probability that the noise exceeds \(10^{-4}\,\mathrm {V}\), given that it is positive? Calculate the expected value of |X|.

It is worthwhile to convert the variable \(X \sim N(\overline{X},\sigma _X^2)\) to the standardized form

$$ Z = {X-\overline{X}\over \sigma _X} = {X-0\,\mathrm {V}\over 10^{-4}\,\mathrm {V}} = 10^4 X, $$

so that \(Z \sim N(0,1)\). The required probabilities are then

$$ P\bigl (X> 10^{-4}\,\mathrm {V}\bigr ) = P(Z > 1) = 0.5 - \int _0^1 f_Z(z)\,\mathrm {d}z \approx 0.5 - 0.3413 = 0.1587 $$

and

$$\begin{aligned}P\bigl (-2\times 10^{-4}\,\mathrm {V}< X< 10^{-4}\,\mathrm {V}\bigr )= & {} P(-2< Z< 1) = P(0 \le Z< 1) + P(0 \le Z < 2) \\= & {} \int _0^1 f_Z(z)\,\mathrm {d}z + \int _0^2 f_Z(z)\,\mathrm {d}z \\&\quad \approx 0.3413 + 0.4772 = 0.8185, \end{aligned}$$

where the probability density \(f_Z\) is given by (3.10). We have read off the numerical values of the integrals from Table D.1.

The required conditional probability is

$$\begin{aligned}P\bigl (X> 10^{-4}\,\mathrm {V} | X> 0\,\mathrm {V}\bigr )= & {} P(Z> 1 | Z> 0) \\= & {} { P(Z> 1 \cap Z> 0) \over P(Z> 0)} = { P(Z> 1) \over P(Z> 0) } = { P(Z > 1) \over 0.5 } \approx 0.3174. \end{aligned}$$

Since \(Z = 10^4 X\), we also have \(E[ |Z| ] = E\bigl [ 10^4 |X| \bigr ] = 10^4 E[ |X| ]\), so we need to compute

$$ E[ |Z| ] = \int \limits _{-\infty }^\infty |z|\,f_Z(z) \,\mathrm {d}z = 2 \int \limits _0^\infty z\,f_Z(z) \,\mathrm {d}z = {2\over \sqrt{2\pi }} \int \limits _0^\infty z \,\mathrm {e}^{-z^2/2} \,\mathrm {d}z = \sqrt{2\over \pi } \int \limits _0^\infty \mathrm {d}\left( \mathrm {e}^{-x}\right) = \sqrt{2\over \pi } $$

and revert to the old variable, hence \(E[ |X| ] = 10^{-4}\sqrt{2/\pi }\,\mathrm {V}\).

10.6 Error Propagation in a Measurement of the Momentum Vector \(\star \)

We are measuring the time t in which a non-relativistic particle of mass m and momentum p traverses a distance L (that is, \(t=L/v=mL/p\)), and the spherical angles \(\theta \) and \(\phi \) of the vector \(\mathbf {p}\) relative to the z-axis. Suppose that we have measured the average values \(1/p = 5\,(\mathrm {GeV}/c)^{-1}\), \(\theta = 75^\circ \) and \(\phi = 110^\circ \), but all measurements contain one-percent uncertainties \(\Delta (1/p) \equiv \sigma _p = 0.05\,(\mathrm {GeV}/c)^{-1}\), \(\Delta \theta \equiv \sigma _\theta = 0.75^\circ \) and \(\Delta \phi \equiv \sigma _\phi = 1.1^\circ \), which are uncorrelated. Determine the uncertainties of the quantities

$$ p_x = p \sin \theta \cos \phi , \quad p_y = p \sin \theta \sin \phi , \quad p_z = p \cos \theta ! $$

In the notation of Sect. 4.9 we are dealing with the variables

$$ X_1 = 1/p, \quad X_2 = \theta , \quad X_3 = \phi , $$

with the averages \(\mu _1 = 5\,(\mathrm {GeV}/c)^{-1}\), \(\mu _2 = 75^\circ \) and \(\mu _3 = 110^\circ \). The corresponding covariance matrix (omitting the units for clarity) is

$$ \Sigma ({{\varvec{X}}}) = \left( \begin{array}{ccc} \sigma _p^2 &{} 0 &{} 0 \\ 0 &{} \sigma _\theta ^2 &{} 0 \\ 0 &{} 0 &{} \sigma _\phi ^2 \end{array} \right) \approx \left( \begin{array}{ccc} 0.0025 &{} 0 &{} 0 \\ 0 &{} 0.000171 &{} 0 \\ 0 &{} 0 &{} 0.000369 \end{array} \right) . $$

We need to calculate the covariance matrix of the variables

$$ Y_1 = p_x = {1\over X_1} \sin X_2 \cos X_3, \quad Y_2 = p_y = {1\over X_1} \sin X_2 \sin X_3, \quad Y_3 = p_z = {1\over X_1} \cos X_2, $$

and we need the derivatives (4.26) to do that:

$$ \begin{array}{lll} \displaystyle {\partial Y_1\over \partial X_1} = -{1\over X_1^2} \sin X_2 \cos X_3, \quad &{} \displaystyle {\partial Y_1\over \partial X_2} = {1\over X_1} \cos X_2 \cos X_3, \quad &{} \displaystyle {\partial Y_1\over \partial X_3} = -{1\over X_1} \sin X_2 \sin X_3, \\ \displaystyle {\partial Y_2\over \partial X_1} = -{1\over X_1^2} \sin X_2 \sin X_3, \quad &{} \displaystyle {\partial Y_2\over \partial X_2} = {1\over X_1} \cos X_2 \sin X_3, \quad &{} \displaystyle {\partial Y_2\over \partial X_3} = {1\over X_1} \sin X_2 \cos X_3, \\ \displaystyle {\partial Y_3\over \partial X_1} = -{1\over X_1^2} \cos X_2, \quad &{} \displaystyle {\partial Y_3\over \partial X_2} = -{1\over X_1} \sin X_2, \quad &{} \displaystyle {\partial Y_3\over \partial X_3} = 0. \end{array} $$

When these expressions are arranged in the \(3\times 3\) matrix D, (4.27) immediately yields

$$ \Sigma ({{\varvec{Y}}}) = D \Sigma ({{\varvec{X}}}) D^\mathrm {T} = \left( \begin{array}{lll} \sigma _{p_x}^2 &{} \sigma _{p_x p_y} &{} \sigma _{p_x p_z} \\ \sigma _{p_y p_x} &{} \sigma _{p_y}^2 &{} \sigma _{p_y p_z} \\ \sigma _{p_z p_x} &{} \sigma _{p_z p_y} &{} \sigma _{p_z}^2 \\ \end{array} \right) \approx 10^{-7} \left( \begin{array}{rrr} 126.4 &{} 30.74 &{} 2.440 \\ 30.74 &{} 53.10 &{} -6.704 \\ 2.440 &{} -6.704 &{} 66.63 \end{array} \right) . $$

The uncertainties of \(p_x\), \(p_y\) and \(p_z\) then become

$$\begin{aligned} \sigma _{p_x}&= \sqrt{\Sigma _{11}({{\varvec{Y}}})} \approx 0.00355, \quad \sigma _{p_y} = \sqrt{\Sigma _{22}({{\varvec{Y}}})} \approx 0.00230, \quad \nonumber \\ \sigma _{p_z}&= \sqrt{\Sigma _{33}({{\varvec{Y}}})} \approx 0.00258. \end{aligned}$$

The propagation of the one-percent errors on the variables 1 / p, \(\theta \) and \(\phi \) has therefore resulted in more than one-percent errors on the variables \(p_x\), \(p_y\) and \(p_z\):

$$\begin{aligned}p_x= & {} (-0.0661 \pm 0.0036) \,\mathrm {GeV}/c = -0.0661 (1\pm 0.054) \,\mathrm {GeV}/c, \\ p_y= & {} (0.182 \pm 0.0023) \,\mathrm {GeV}/c = 0.182 (1\pm 0.013) \,\mathrm {GeV}/c, \\ p_z= & {} (0.0518 \pm 0.0026) \,\mathrm {GeV}/c = 0.0518 (1\pm 0.050) \,\mathrm {GeV}/c. \end{aligned}$$

The error of \(p_x\) and \(p_z = p \cos \theta \) has increased dramatically. A feeling for why this happens in \(p_z\) can be acquired by simple differentiation \(\mathrm {d}p_z = \mathrm {d}p \cos \theta - p \sin \theta \,\mathrm {d}\theta \) or

$$ {\Delta p_z\over p\cos \theta } = {\Delta p\over p} - {\sin \theta \over \cos \theta }\, {\Delta \theta }. $$

The average value of \(\theta \) is not very far from \(90^\circ \), where \(\sin \theta \approx 1\) and \(\cos \theta \approx 0\). Any error in \({\Delta \theta }\) in this neighborhood, no matter how small, is amplified by the large factor \(\tan \theta \) that even diverges as \(\theta \rightarrow \pi /2\).

In addition, the covariances \(\sigma _{p_x p_y} = \sigma _{p_y p_x}\), \(\sigma _{p_x p_z} = \sigma _{p_z p_x}\) and \(\sigma _{p_y p_z} = \sigma _{p_z p_y}\) are all non-zero, and the corresponding correlation coefficients are

$$ \rho _{p_x p_y} = {\sigma _{p_x p_y} \over \sigma _{p_x} \sigma _{p_y}} \approx 0.375, \quad \rho _{p_x p_z} = {\sigma _{p_x p_z} \over \sigma _{p_x} \sigma _{p_z}} \approx 0.027, \quad \rho _{p_y p_z} = {\sigma _{p_y p_z} \over \sigma _{p_y} \sigma _{p_z}} \approx -0.113. $$