Abstract
Finding expected values of distributions is one of the main tasks of any probabilistic analysis. The expected value in the narrower sense of the average (mean), which is a measure of distribution location, is introduced first, followed by the related concepts of the median and distribution quantiles. Expected values of functions of random variables are presented, as well as the variance as the primary measure of the distribution scale. The discussion is extended to moments of distributions (skewness, kurtosis), as well as to two- and d-dimensional generalizations. Finally, propagation of errors is analyzed.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
- Expectation Value
- Complex Random Variables
- Mo Let
- Ideal Probability Density
- Multi-dimensional Equivalent
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
In this chapter we discuss quantities that one may anticipate for individual random variables or their functions—with respect to the probability distributions of these variables—after multiple repetitions of random experiments: they are known as expected values or expectations of random variables. The most important such quantity is the average value, which is the expected value in the basic, narrowest sense of the word; further below we also discuss other expected values in the broader sense.
1 Expected (Average, Mean) Value
The expected value of a discrete random variable X, which can assume the values \(x_i\) (\(i=1,2,\ldots \)), is computed by weighting (multiplying) each of these values by the probability \(P(X=x_i)=f_X(x_i)\) that in a large number of trials this particular value turns up (see (2.13)), then sum all such products:
The average is denoted by E or by a line across the random variable (or its function) being averaged. Both E[X] and \(\overline{X}\), as well as the frequently used symbol \(\mu _X\) imply the “averaging operation” performed on the variable X. (We emphasize this because we occasionally also use the slightly misleading expression “expected value of a distribution”: what usually changes in random processes is the value of a variable, not its distribution!) In Chaps. 4–6 the symbols
signify the one and the same thing, while in Chaps. 7–10 the symbols \(\overline{X}\) and \(\overline{x}\) will denote the average value of a sample and \(E[\bullet ]\) will be used strictly as expected value. The only symbol that really would not make any sense, is E[x].
It can not hurt to recall the formula to compute the center of mass of a one-dimensional system of point-like masses with a total mass \(M=\sum _{i=1}^n m_i\):
If all probabilities in (4.1) are equal, we get a simple expression for the usual arithmetic average
The expected value of a continuous random variable X is obtained by replacing the sum by the integral and integrating the product of the variable value x and the corresponding probability density over the whole definition domain,
(Beware: this expected value may not exist for certain types of densities \(f_X\).) The analogy from mechanics is again the center of mass of a three-dimensional inhomogeneous body, which is calculated by integrating the product of the position vector with the position-dependent density over the whole volume:
Example
In a casino we indulge in a game of dice with the following rules for each throw: 2 spots—win ; 4 spots—win ; 6 spots—lose ; 1 spot, 3 spots or 5 spots—neither win nor lose. Any number of spots \(x_i\) is equally probable, \(P(X=x_i)={1\over 6}\), so the expected value of our earnings is
If the casino wishes to profit from this game, the participation fee should be at least this much. \(\triangleleft \)
2 Median
The median of a random variable X (discrete or continuous) is the value \(x = \mathrm {med}[X]\), for which
For a continuous variable X the inequalities become equalities,
as it is always possible to find the value of x that splits the area under the probability density curve in two halves: the probabilities that X assumes a value above or below the median, respectively, are exactly \(50\%\).
The median of a discrete variable X sometimes can not be determined uniquely, since the discrete nature of its distribution may cause the inequalities in (4.4) to be fulfilled simultaneously, but for many different x. For example, consider a discrete distribution with probability function \(f_X(x)=1/2^x\), where \(x=1,2,\ldots \) We see that \(P(X<x) = P(X>x) = \textstyle {1\over 2}\) holds for any value \(1 \le x \le 2\). In such cases the median is defined as the central point of the interval on which the assignment is ambiguous—in the present example we therefore set it to \(\mathrm {med}[X] = 1.5\).
Example
A continuous random variable has the probability density
shown in Fig. 4.1 (left). Find the mode (location of maximum), median and the average (mean) of this distribution!
The mode is obtained by differentiating and setting the result to zero:
The median \(\mathrm {med}[X]\equiv a\) must split the area under the curve of \(f_X\) in two parts of 1 / 2 each, thus
This results in the quadratic equation \(2a^4 - 36 a^2 + 81 = 0\) with two solutions, \(a^2 = 9(1\pm \sqrt{2}/2)\). Only the solution with the negative sign is acceptable as it is the only one that falls within the [0, 3] domain:
The average is calculated by using the definition (4.3),
All three values are shown in Fig. 4.1 (left). \(\triangleleft \)
3 Quantiles
The value of a random variable, below which a certain fraction of all events are found after numerous trials, is called the quantile of its distribution (lat. quantum, “how much”). For a continuous probability distribution this means that the integral of the probability density from \(-\infty \) to \(x_\alpha \) equals \(\alpha \) (Fig. 4.2). For example, the 0.50th quantile of the standardized normal distribution is \(x_{0.50} = 0\), while its 0.9985th quantile is \(x_{0.9985} \approx 3\), see (3.13).
To express the \(\alpha \)th quantile all values \(0 \le \alpha \le 1\) are allowed, but several brethren terms are in wide use for specific values of \(\alpha \): integer values (in percent) express percentiles, the tenths of the whole range of \(\alpha \) are delimited by deciles and the fourths by quartiles: \(x_{0.20}\) defines the 20th percentile or the second decile of a distribution, \(x_{0.25}\) and \(x_{0.75}\) set the limits of its first and third quartile. Hence, \(x_{0.50}\) carries no less than five names: it is the 0.50th quantile, the 50th percentile, the second quartile, the fifth decile and—the median. The difference \(x_{0.75}-x_{0.25}\) is called the inter-quartile range (IQR). The interval \([x_{0.25},x_{0.75}]\) contains half of all values; a quarter of them reside to its left and a quarter to its right.
Example
Fig. 4.3 (left) shows the daily sales of fiction books from the 1000 bestseller list (sales rank r) of the Amazon online bookstore in a certain time period. (Note the log-log scale: in linear scale the distribution has a sharp peak at \(r=1\) and a rapidly dropping tail, so it mostly occupies the region around the origin.)
To study the sales dynamics such discrete distributions are often approximated by continuous Pareto distributions (3.16). For many markets in the past, the “Pareto 80/20 principle” seemed to apply, stating that a relatively small fraction (\({\approx }20\%\)) of products (in our case best-selling books) brings the most (\({\approx } 80\%\)) profit. Figure 4.3 (right) shows the daily earnings as a function of sales rank, as well as the median, average rank, and the sales rank up to which Amazon earns \(80\%\) of the money: the latter is 234 (of 1000), neatly corresponding with the Pareto “principle”. Still, it is obvious from the graph that the Pareto distribution under-estimates the actual sales at high ranks r. Analyses show [1, 2] that the distribution n(r) has become flatter over the years, meaning that more and more profit is being squeezed from the ever increasing tail; see also [3]. \(\triangleleft \)
4 Expected Values of Functions of Random Variables
The simplest functions of random variables are the sum \(X+Y\) of two variables and the linear combination \(aX+b\), where a and b are arbitrary real constants. Since the expected value of a continuous random variable, E[X], is defined by an integral, the expected values of \(E[X+Y]\) and \(E[aX+b]\) inherit all properties of the integral, in particular linearity. (A similar conclusion follows in the discrete case where we are dealing with sums.) Therefore, for both continuous and discrete random variables it holds that
as well as
and
One needs to be slightly more careful in computing the expected values of more general functions of random variables. Suppose that X is a discrete random variable with probability distribution (probability function) \(f_X\). Then \(Y=g(X)\) is also a random variable and its probability function is
If X takes the values \(x_1,x_2,\ldots ,x_n\) and Y takes the values \(y_1,y_2,\ldots ,y_m\) (\(m \le n\)), we have
hence
If X is a continuous random variable, we just need to replace the sum by the integral and the probability function by the probability density:
This is a good spot to comment on a very popular approximation that can be an ugly mistake or a good short-cut to a solution: it is the approximation
The trick works well if the density \(f_X\) of X is a sharp, strongly peaked function, and not so well otherwise. Regardless of this, however, for any convexFootnote 1 function g, Jensen’s inequality holds true:
that is,
4.1 Probability Densities in Quantum Mechanics
As physicists, we ceaselessly calculate expected values of the form (4.8) in any field related to statistical or quantum mechanics. We say: the expected value of an operator \({\widehat{\mathcal{{O}}}}\) in a certain state of a quantum-mechanical system (for example, ground state of the hydrogen atom) described by the wave-function \(\psi \), is
The operator \(\widehat{\mathcal{O}}\) acts on the right part of the integrand, \(\psi \), then the result is multiplied from the left by its complex conjugate \(\psi ^*\), and integrated over the whole domain. If \(\widehat{\mathcal{O}}\) is multiplicative, for example \(\widehat{\mathcal{O}}({{\varvec{r}}}) = z\)—in this case we obtain the expectation value of the third Cartesian component of the electron’s position vector in the hydrogen atom—we are computing just
which is the integral of a product of two scalar functions, the second of which, \(\rho ({{\varvec{r}}})\), is nothing but the probability density of (4.8).
Example
An electron moving in the electric field of a lead nucleus is described by the function
where \(r_\mathrm {B} \approx 6.46\times 10^{-13}\,\mathrm {m}\). The nucleus may be imagined as a positively charged sphere with radius \(7\times 10^{-15}\,\mathrm {m}\). How much time does the electron “spend” in the nucleus, i. e. what is the probability that it resides within a sphere of radius R? All we are looking for is the expected value of the operator \(\widehat{\mathcal{O}}({{\varvec{r}}}) = 1\) in (4.11); due to angular symmetry the volume element is simply \(\mathrm {d}V = 4\pi r^2 \,\mathrm {d}r\), thus
An almost identical result is obtained by assuming that \(\psi \) is practically constant on the interval [0, R], which is reasonable, since \(R \ll r_\mathrm {B}\). In this case we obtain \(P = (1/\pi )r_\mathrm {B}^{-3} (4\pi R^3/3) = (4/3) (R/r_\mathrm {B})^3 \approx 1.69\times 10^{-6}\). \(\triangleleft \)
5 Variance and Effective Deviation
Computing the expected value of a random variable X tells us something about where within its domain its values will approximately land after many repetitions of the corresponding random experiment. Now we are also interested in the variation (scattering) of the values around their average \(E[X] = \overline{X}\). A measure of this scattering is the variance, defined as
A large variance means a large scatter around the average and vice-versa. The positive square root of the variance,
is known as effective or standard deviation—in particular with the normal distribution on our minds. In the following we shall also make use of the relation
(Prove it as an exercise.) If X is a discrete random variable, which takes the values \(x_1,x_2,\ldots ,x_n\) and has the probability function \(f_X\), its variance is
In the case that all probabilities are equal, \(f_X(x_i)=1/n\), the variance is
Note the factor 1 / n—not \(1/(n-1)\), as one often encounters—as it will acquire an important role in random samples in Chap. 7.
If X is a continuous random variable with the probability density \(f_X\), its variance is
It can be shown that, regardless of the distribution obeyed by any (continuous or discrete) random variable X, it holds that
for any constant \(a > 0\), which is known as the Chebyshev inequality. It can also be formulated in terms of the slightly tighter Cantelli’s constraints
We may resort to this tool if we know only the expected value of the random variable, \(\overline{X}\), and its variance, \(\sigma _X^2\), but not the functional form of its distribution. In such cases we can still calculate the upper limits for probabilities of the form (4.16).
Example
Suppose that the measured noise voltage at the output of a circuit has an average of \(\overline{U}=200\,\mathrm {mV}\) and variance \(\sigma _U^2 = (80\,\mathrm {mV})^2\). The probability that the noise exceeds \(300\,\mathrm {mV}\) (i. e. raises more than \(\Delta U = 100\,\mathrm {mV}\) above the average), can be bounded from above as \(P \bigl ( U \ge \overline{U} + {\Delta U} \bigr ) \le \sigma _U^2 \bigl / \bigl ( \sigma _U^2 + ({\Delta U})^2 \bigr ) \approx 0.39\). \(\triangleleft \)
6 Complex Random Variables
A particular linear combination of real random variables X and Y is the complex random variable
Its distribution function at \(z = x + \mathrm {i}\,y\) is defined as
where \(F_{X,Y}(x,y)\) is the distribution function of the pair—more precisely, the random vector (X, Y). The expected value of the variable Z is defined as
Computing the expectation values of complex random variables is an additive and homogeneous operation: for arbitrary \(Z_1\) and \(Z_2\) it holds that
while for an arbitrary complex constant \(c = a + \mathrm {i}\,b\) we have
The variance of a complex random variable is defined as
A short calculation—do it!—shows that it is equal to the sum of the variances of its components,
The complex random variables \(Z_1 = X_1 + \mathrm {i}\,Y_1\) and \(Z_2 = X_2 + \mathrm {i}\,Y_2\) are mutually independent if random vectorsFootnote 2 \((X_1,X_2)\) and \((Y_1,Y_2)\) are independent. (A generalization is at hand: complex random variables \(Z_k = X_k + \mathrm {i}\,Y_k\) (\(k=1,2,\ldots ,n)\) are mutually independent if the same applies to random vectors \((X_k,Y_k)\).) If \(Z_1\) and \(Z_2\) are independent and possess expected values, their product also possesses it, and it holds that
7 Moments
The average (mean) and the variance are two special cases of expected values in the broader sense called moments: the pth raw or algebraic moment \(M_p'\) of a random variable X is defined as the expected value of its pth power, that is, \(M_p' = E[X^p]\):
Frequently we also require central moments, defined with respect to the corresponding average value of the variable, that is, \(M_p = E\bigl [\bigl (X-\overline{X}\bigr )^p\bigr ]\):
From here we read off \(M_0'=1\) (normalization of probability distribution), \(M_1' = \overline{X}\) and \(M_2 = \sigma _X^2\). The following relations (check them as an exercise) also hold:
In addition to the first (average) and second moment (variance) only the third and fourth central moment are in everyday use. The third central moment, divided by the third power of its effective deviation,
is called the coefficient of skewness or simply skewness. The coefficient \(\rho \) measures the asymmetry of the distribution around its average: \(\rho < 0\) means that the distribution has a relatively longer tail to the left of the average value (Fig. 4.4 (left)), while \(\rho > 0\) implies a more pronounced tail to its right (Fig. 4.4 (center)).
The fourth central moment, divided by the square of the variance,
is known as kurtosis and tells us something about the “sharpness” or “bluntness” of the distribution. For the normal distribution we have \(M_4/\sigma ^4 = 3\), so we sometimes prefer to specify the quantity
called the excess kurtosis: \(\varepsilon > 0\) indicates that the distribution is “sharper” than the normal (more prominent peak, faster falling tails), while \(\varepsilon < 0\) implies a “blunter” distribution (less pronounced peak, stronger tails), see Fig. 4.4 (right).
The properties of the most important continuous distributions—average value, median, mode (location of maximum), variance, skewness (\(\rho \)) and kurtosis (\(\varepsilon +3\))—are listed in Table 4.1. See also Appendices B.2 and B.3, where we shall learn how to “automate” the calculation of moments by using generating and characteristic functions.
Example
We are interested in the mode (“most probable velocity”), average velocity and the average velocity squared of \(\mathrm {N}_2\) gas molecules (molar mass \(M=28\,\mathrm {kg}/\mathrm {kmol}\), mass of single molecule \(m=M/N_\mathrm {A}\)) at temperature \(T=303\,\mathrm {K}\). The velocity distribution of the molecules is given by the Maxwell distribution (3.15), whose maximum (mode) is determined by \(\mathrm {d}f_V/\mathrm {d}v=0\), hence
The average value and the square root of the average velocity squared (“root-mean-square velocity”) are computed from (4.3) and (4.17) with \(p=2\):
where we have used \(\int _0^\infty z^3\exp (-z^2)\,\mathrm {d}z = 1/2\) and \(\int _0^\infty z^4\exp (-z^2)\,\mathrm {d}z = 3\sqrt{\pi }/8\). These three famous quantities are shown in Fig. 4.1 (right). \(\triangleleft \)
7.1 Moments of the Cauchy Distribution
The Cauchy distribution \(f_X(x)=(1/\pi )/(1+x^2)\) drops off so slowly at \(x\rightarrow \pm \infty \) that its moments (average, variance, and so on) do not exist. For this reason its domain is frequently restricted to a narrower interval \([-x_\mathrm {max},x_\mathrm {max}]\):
This is particularly popular in nuclear physics where the Breit–Wigner description of the shape of the resonance peak in its tails—see Fig. 3.6 (right)—is no longer adequate due to the presence of neighboring resonances or background. With the truncated density \(g_X\) both the average and the variance are well defined:
Narrowing down the domain is a special case of a larger class of “distortions” of probability distributions used to describe, for example, non-ideal outcomes of a process or imperfect efficiencies for analyzing particles in a detector. If individual events are detected under different conditions, the ideal probability density, \(f_X\), must be weighted by the detection efficiency:
where y is an auxiliary variable over which the averaging is being performed, and \(\varepsilon (x,y)\) is the probability density for the event being detected near \(X=x\) and \(Y=y\). An introduction to such weighted averaging procedures can be found in Sect. 8.5 of [4].
8 Two- and d-dimensional Generalizations
Let the continuous random variables X and Y be distributed according to the joint probability density \(f_{X,Y}(x,y)\). In this case the expected values of the individual variable can be calculated by the obvious generalization of (4.3) to two dimensions. The density \(f_{X,Y}\) is weighted by the variable whose expected value we are about to compute, while the other is left untouched:
In the discrete case the extension to two variables requires a generalization of (4.1):
By analogy to (4.15) and (4.13) we also compute the variances of variables in the continuous case,
and the variances in the discrete case,
Henceforth we only give equations pertaining to continuous variables. The corresponding expressions for discrete variables are obtained, as usual, by replacing the probability densities \(f_{X,Y}(x,y)\) by the probability [mass] functions \(f_{X,Y}(x_i,y_j)=P(X=x_i, Y=y_j)\), and integrals by sums.
Since now two variables are at hand, we can define yet a third version of the double integral (or the double sum) in which the variables enter bilinearly—the so-called mixed moment known as the covariance of X and Y:
One immediately sees that
for arbitrary constants a and b, as well as
Therefore, if X and Y are mutually independent, then by definition (2.25) one also has \(E[XY] = E[X]E[Y] = \mu _X\mu _Y\), and then
(The covariance of independent variables equals zero.) For a later discussion of measurement uncertainties the following relation between the variance and covariance of two variables is important:
In other words,
By using the covariance and both effective deviations we define the Pearson’s coefficient of linear correlation (also linear correlation coefficient )
It is easy to confirm the allowed range of \(\rho _{XY}\) given above. Because of its power two the expression \(E[ ( \lambda (X-\mu _X) - (Y-\mu _Y) )^2 ]\) is non-negative for any \(\lambda \in {\mathbb {R}}\). Let us expand it:
The left side of this inequality is a real polynomial of second degree \(a\lambda ^2 + b\lambda + c = 0\) with coefficients \(a=\sigma _X^2\), \(b=-2\sigma _{XY}\), \(c=\sigma _Y^2\), which is non-negative everywhere, so it can have at most one real zero. This implies that its discriminant can not be positive, so \(b^2 - 4ac \le 0\). This tells us that \(4\sigma _{XY}^2 -4\sigma _X^2\sigma _Y^2 \le 0\) or \(|\sigma _{XY}/(\sigma _X\sigma _Y)| \le 1\), which is precisely (4.21).
The generalization of (4.20) to the sum of (not necessarily independent) random variables \(X_1,X_2,\ldots ,X_n\) is
If the variables \(X_1,X_2,\ldots ,X_n\) are mutually independent, this expression reduces to
Example
Many sticks with length 1 are broken at two random locations. What is the average length of the central pieces? At each hit, the stick breaks at \(0< x_1 < 1 \) and \(0< x_2 < 1\), where the values \(x_1\) and \(x_2\) are uniformly distributed over the interval [0, 1], but one can have either \(x_1 < x_2\) or \(x_1 > x_2\). What we are seeking, then, is the expected value of the variable \(L=|X_2-X_1|\) (with values l) with respect to the probability density \(f_{X,Y}(x_1,x_2)=1\):
How would the result change if the probability that the stick breaks linearly increases from 0 at the origin to 1 at the opposite edge? \(\triangleleft \)
Example
Let the continuous random variables X and Y both be normally distributed, with averages \(\mu _X\) and \(\mu _Y\) and variances \(\sigma _X^2\) and \(\sigma _Y^2\). What is their joint probability density if X and Y are independent, and what are their joint and conditional densities in the dependent case, with correlation coefficient \(\rho _{XY} = \rho \)?
If X and Y are independent, their joint probability density—by (2.25)—is simply the product of the corresponding one-dimensional densities:
The curves of constant values of \(f_{X,Y}\) in the (x, y) plane are untilted ellipses in general (\(\sigma _X \ne \sigma _Y\)), and circles in the special case \(\sigma _X = \sigma _Y\). At any rate \(\rho =0\) for such a distribution. A two-dimensional normal distribution of dependent (and therefore correlated) variables is described by the probability density
where we have denoted \(x' = x-\mu _X\) and \(y' = y-\mu _Y\). This distribution can not be factorized as \(f_{X,Y}(x,y)=f_X(x)f_Y(y)\), and its curves of constant values are tilted ellipses; for parameters \(\mu _X=10\), \(\mu _Y=0\), \(\sigma _X=\sigma _Y=1\) and \(\rho =0.8\) they are shown in Fig. 4.5 (left).
Conditional probability densities \(f_{X|Y}(x|y)\) and \(f_{Y|X}(y|x)\) can be computed by using (2.26) and (2.27). Let us treat the first case, the other one is obtained by simply replacing \(x \leftrightarrow y\), \(\mu _X \leftrightarrow \mu _Y\) and \(\sigma _X \leftrightarrow \sigma _Y\) at appropriate locations:
This conditional probability density is shown in Fig. 4.5 (right). By comparing it to definition (3.7) we infer that the random variable X|Y is distributed as
a feature also seen in the plot: the width of the band does not depend on y. \(\triangleleft \)
8.1 Multivariate Normal Distribution
This appears to be a good place to generalize the normal distribution of two variables (the so-called binormal or bivariate normal distribution) to d dimensions. We are dealing with a vector random variable
and its average
We construct the \(d\times d\) covariance matrix \(\Sigma \) with the matrix elements
The covariance matrix is symmetric and at least positive semi-definite. It can even be strictly positive definite if none of the variables \(X_i\) is a linear combination of the others. The probability density of the multivariate normal distribution (compare it to its one-dimensional counterpart (3.10)) is then
If \(d=2\) as in the previous Example, we have simply \({{\varvec{X}}}=(X_1,X_2)^\mathrm {T} \rightarrow (X,Y)^\mathrm {T}\) and \({{\varvec{\mu }}}=(\mu _1,\mu _2)^\mathrm {T} \rightarrow (\mu _X,\mu _Y)^\mathrm {T}\), while the covariance matrix is
8.2 Correlation Does Not Imply Causality
A vanishing correlation coefficient of X and Y does not mean that these variables are stochastically independent: for each density \(f_{X,Y}\) that is an even function of the deviations \(x-\mu _X\) and \(y-\mu _Y\), one has \(\rho _{XY}=0\). In other words, \(\rho _{XY}=0\) is just a necessary, but not sufficient condition for independence: see bottom part of Fig. 7.8 which illustrates the correlation in the case of finite samples.
Even though one observes a correlation in a pair of variables (sets of values, measurements, phenomena) this does not necessarily mean that there is a direct causal relation between them: correlation does not imply causality. When we observe an apparent dependence between two correlated quantities, often a third factor is involved, common to both X and Y. Example: the sales of ice-cream and the number of shark attacks at the beach are certainly correlated, but there is no causal relation between the two. (Does your purchase of three scoops of ice-cream instead of one triple your chances of being bitten by a shark?) The common factor of tempting scoops and aggressiveness of sharks is a hot summer day, when people wish to cool off in the water and sharks prefer to dwell near the shore.
Besides, one should be aware that correlation and causality are concepts originating in completely different worlds: the former is a statement on the basis of probability theory, while the latter signifies a strictly physical phenomenon, whose background is time and the causal connection between the present and past events.
9 Propagation of Errors
If we knew how to generalize (4.20) to an arbitrary function of an arbitrary number of variables, we would be able to answer the important question of error propagation. But what do we mean by “error of random variable”? In the introductory chapters we learned that each measurement of a quantity represents a single realization of a random variable whose value fluctuates statistically. Such a random deviation from its expected value is called the statistical uncertainty or “error”. By studying the propagation of errors we wish to find out how the uncertainties of a given set of variables translate into the uncertainty of a function of these variables. A typical example is the determination of the thermal power released on a resistor from the corresponding voltage drop: if the uncertainty of the voltage measurement is \(\Delta U\) and the resistance R is known to an accuracy of no more than \(\Delta R\), what is the uncertainty of the calculated power \(P = U^2/R\)?
Let \(X_1,X_2,\ldots ,X_n\) be real random variables with expected values \(\mu _1,\mu _2,\ldots ,\mu _n\), which we arrange as vectors
and
just as in Sect. 4.8.1. Let \(Y=Y({{\varvec{X}}})\) be an arbitrary function of these variables which, of course, is also a random variable. Assume that the covariances of all \((X_i,X_j)\) pairs are known. We would like to estimate the variance of the variable Y. In the vicinity of \({{\varvec{\mu }}}\) we expand Y in a Taylor series in \({{\varvec{X}}}\) up to the linear term,
and resort to the approximation \(E[ Y({{\varvec{X}}}) ] \approx Y({{\varvec{\mu }}})\) (see (4.9) and (4.10)) to compute the variance. It follows that
where
is the covariance matrix of the variables \(X_i\): its diagonal terms are the variances of the individual variables, \(\mathrm {var}[X_i] = \sigma _{X_i}^2\), while the non-diagonal ones (\(i\ne j\)) are the covariances \(\mathrm {cov}[X_i,X_j]\). Formula (4.24) is what we have been looking for: it tells us—within the specified approximations—how the “errors” in \({{\varvec{X}}}\) map to the “errors” in Y. If \(X_i\) are mutually independent, we have \(\mathrm {cov}[X_i,X_j]=0\) for \(i\ne j\) and the formula simplifies to
Example
Let \(X_1\) and \(X_2\) be independent continuous random variables with the mean values \(\mu _1\) and \(\mu _2\) and variances \(\sigma _1^2\) and \(\sigma _2^2\). We are interested in the variance \(\sigma _Y^2\) of their ratio \(Y=X_1/X_2\). Since \(X_1\) and \(X_2\) are independent, we may apply formula (4.25). We need the derivatives
Therefore
or
where \(\mu _Y = E[Y] = \mu _1/\mu _2\). \(\triangleleft \)
Example
Let X and Y be independent random variables with the expected values \(\mu _X\) and \(\mu _Y\) and variances \(\sigma _X^2\) and \(\sigma _Y^2\) (with respective “uncertainties of measurements” \(\sigma _X\) and \(\sigma _Y\)). What is the variance \(\sigma _Z^2\) of the product of their powers,
(This is a generalization of the function from the previous example to arbitrary powers m and n.) By formula (4.25) we again obtain
Thus
where we have denoted \(\mu _Z = \mu _X^m \mu _Y^n\). \(\triangleleft \)
9.1 Multiple Functions and Transformation of the Covariance Matrix
Let us now discuss the case of multiple scalar functions \(Y_1,Y_2,\ldots ,Y_m\), which all depend on variables \({{\varvec{X}}}\),
We arrange the function values in the vector \({{\varvec{Y}}} = (Y_1,Y_2,\ldots ,Y_m)^\mathrm {T}\) and retrace the steps from the beginning of this section. We neglect all higher order terms in the Taylor expansion
and take into account that \(E[ Y_k({{\varvec{X}}}) ] \approx Y_k({{\varvec{\mu }}})\). Instead of (4.24) we now obtain a relation between the covariance matrix of variable \({{\varvec{X}}}\) and the covariance matrix of the variables \({{\varvec{Y}}}\),
This relation becomes even more transparent if we write the Taylor expansion as
where \({{\varvec{X}}}\) and \({{\varvec{Y}}}\) are n- and m-dimensional vectors, respectively, while D is an \(m\times n\) matrix embodying the linear part of the expansion, namely
Hence
or, in brief,
The propagation of errors in higher dimensions can therefore be seen as a transformation of the covariance matrix. The variances \(\sigma _{Y_k}^2\) of the variables \(Y_k\) are the diagonal matrix elements of \(\Sigma ({{\varvec{Y}}})\). In general they pick up terms from all elements of \(\Sigma ({{\varvec{X}}})\), even the non-diagonal ones, since
But if the variables \(X_i\) are mutually independent, only diagonal elements of \(\Sigma ({{\varvec{X}}})\) contribute to the right-hand side of the above equation, yielding
Equations (4.28) and (4.29) are multi-dimensional equivalents of (4.24) and (4.25). Note that the non-diagonal elements of \(\Sigma ({{\varvec{Y}}})\) may be non-zero even though \(X_i\) are mutually independent! You can find an example of how to use these equations in the case of a measurement of the momentum of a particle in Problem 4.10.6.
10 Problems
10.1 Expected Device Failure Time
A computer disk is controlled by five circuits (\(i=1,2,3,4,5\)). The time until an irreparable failure in each circuit is exponentially distributed, with individual time constants \(\lambda _i\). The disk as a whole works if circuits 1, 2 and 3, circuits 3, 4 and 5, or, obviously, all five circuits work simultaneously. What is the expected time of disk failure?
The probability that the ith element is not broken until time t (the probability that the failure time is larger than t) is exponentially decreasing and equals \(\mathrm {e}^{-\lambda _i t}\). For the disk to fail, three key events are responsible:
The disk operates as long as . The probability that the disk still operates after time t, is therefore
This is not yet our answer, since the expression still contains time! We are looking for the expected value of failure time, where we should recall that the appropriate probability density is \(-P'(t)\) (see (3.4)), hence
10.2 Covariance of Continuous Random Variables
(Adopted from [5], Example 4.56.) Calculate the linear correlation coefficient of continuous random variables X and Y distributed according to the joint probability density
where H is the Heaviside function (see (2.8)).
The linear correlation coefficient \(\rho _{XY}\) of variables X and Y (see (4.21)) is equal to the ratio of covariance \(\sigma _{XY}\) to the product of their effective deviations \(\sigma _X\) and \(\sigma _Y\). First we need to calculate the expected value of the product XY,
then the expected values of X, Y, \(X^2\) and \(Y^2\),
It follows that
hence
and
10.3 Conditional Expected Values of Two-Dimensional Distributions
Let us return to the Example on p. 49 involving two random variables, distributed according to the joint probability density
Find the conditional expected value of the variable Y, given \(X=x\), and the conditional expected value of the variable X, given \(Y=y\)!
We have already calculated the conditional densities \(f_{X|Y}(x|y)\) and \(f_{Y|X}(y|x)\) in (2.28) and (2.29), so the conditional expected value equals
and the conditional expected value is
10.4 Expected Values of Hyper- and Hypo-exponential Variables
Calculate the expected value, the second moment and the variance of continuous random variables, distributed according to the hyper-exponential (see (3.26)) and hypo-exponential distribution (see (3.28)).
The hyper-exponential distribution, which describes a mixture (superposition) of k independent phases of a parallel process, whose ith phase proceeds with probability \(P_i\) and time constant \(\lambda _i = 1/\tau _i\), is defined by the probability density
where \(0 \le P_i \le 1\) and \(\sum _{i=1}^k P_i = 1\). The expected value of a hyper-exponentially distributed variable X is
and its second moment is
Its variance is therefore
While \(\sigma _X / \overline{X} = \lambda /\lambda = 1\) holds true for the usual single-exponential distribution, its hyper-exponential generalization always has \(\sigma _X / \overline{X} > 1\), except when all \(\lambda _i\) are equal: this inequality is the origin of the root “hyper” in its name.
The hypo-exponential distribution describes the distribution of the sum of k (\(k\ge 2\)) independent continuous random variables \(X_i\), in which each term separately is distributed exponentially with parameter \(\lambda _i\). The sum variable \(X = \sum _{i=1}^k X_i\) has the probability density
where
By comparing (4.33) to (4.30) one might conclude that the coefficients \(\alpha _i\) represent the probabilities \(P_i\) for the realization of the ith random variable, but we are dealing with a serial process here: all indices i come into play—see Fig. 3.13! On the other hand, one can exploit the analytic structure of expressions (4.31) and (4.32), one simply needs to replace all \(P_i\) by \(\alpha _i\). By a slightly tedious calculation (or by exploiting the linearity of \(E[\cdot ]\) and using formula (4.20)) we obtain very simple expressions for the average and variance:
It is easy to see—Pythagoras’s theorem comes in handy—that one always has \(\sigma _X / \overline{X} < 1\). The root “hypo” in the name of the distribution expresses precisely this property.
10.5 Gaussian Noise in an Electric Circuit
The noise in electric circuits is frequently of Gaussian nature. Assume that the noise (random variable X) is normally distributed, with average \(\overline{X} = 0\,\mathrm {V}\) and variance \(\sigma _X^2 = 10^{-8}\,\mathrm {V}^2\). Calculate the probability that the noise exceeds the value \(10^{-4}\,\mathrm {V}\) and the probability that its value is on the interval between \(-2\cdot 10^{-4}\,\mathrm {V}\) and \(10^{-4}\,\mathrm {V}\)! What is the probability that the noise exceeds \(10^{-4}\,\mathrm {V}\), given that it is positive? Calculate the expected value of |X|.
It is worthwhile to convert the variable \(X \sim N(\overline{X},\sigma _X^2)\) to the standardized form
so that \(Z \sim N(0,1)\). The required probabilities are then
and
where the probability density \(f_Z\) is given by (3.10). We have read off the numerical values of the integrals from Table D.1.
The required conditional probability is
Since \(Z = 10^4 X\), we also have \(E[ |Z| ] = E\bigl [ 10^4 |X| \bigr ] = 10^4 E[ |X| ]\), so we need to compute
and revert to the old variable, hence \(E[ |X| ] = 10^{-4}\sqrt{2/\pi }\,\mathrm {V}\).
10.6 Error Propagation in a Measurement of the Momentum Vector \(\star \)
We are measuring the time t in which a non-relativistic particle of mass m and momentum p traverses a distance L (that is, \(t=L/v=mL/p\)), and the spherical angles \(\theta \) and \(\phi \) of the vector \(\mathbf {p}\) relative to the z-axis. Suppose that we have measured the average values \(1/p = 5\,(\mathrm {GeV}/c)^{-1}\), \(\theta = 75^\circ \) and \(\phi = 110^\circ \), but all measurements contain one-percent uncertainties \(\Delta (1/p) \equiv \sigma _p = 0.05\,(\mathrm {GeV}/c)^{-1}\), \(\Delta \theta \equiv \sigma _\theta = 0.75^\circ \) and \(\Delta \phi \equiv \sigma _\phi = 1.1^\circ \), which are uncorrelated. Determine the uncertainties of the quantities
In the notation of Sect. 4.9 we are dealing with the variables
with the averages \(\mu _1 = 5\,(\mathrm {GeV}/c)^{-1}\), \(\mu _2 = 75^\circ \) and \(\mu _3 = 110^\circ \). The corresponding covariance matrix (omitting the units for clarity) is
We need to calculate the covariance matrix of the variables
and we need the derivatives (4.26) to do that:
When these expressions are arranged in the \(3\times 3\) matrix D, (4.27) immediately yields
The uncertainties of \(p_x\), \(p_y\) and \(p_z\) then become
The propagation of the one-percent errors on the variables 1 / p, \(\theta \) and \(\phi \) has therefore resulted in more than one-percent errors on the variables \(p_x\), \(p_y\) and \(p_z\):
The error of \(p_x\) and \(p_z = p \cos \theta \) has increased dramatically. A feeling for why this happens in \(p_z\) can be acquired by simple differentiation \(\mathrm {d}p_z = \mathrm {d}p \cos \theta - p \sin \theta \,\mathrm {d}\theta \) or
The average value of \(\theta \) is not very far from \(90^\circ \), where \(\sin \theta \approx 1\) and \(\cos \theta \approx 0\). Any error in \({\Delta \theta }\) in this neighborhood, no matter how small, is amplified by the large factor \(\tan \theta \) that even diverges as \(\theta \rightarrow \pi /2\).
In addition, the covariances \(\sigma _{p_x p_y} = \sigma _{p_y p_x}\), \(\sigma _{p_x p_z} = \sigma _{p_z p_x}\) and \(\sigma _{p_y p_z} = \sigma _{p_z p_y}\) are all non-zero, and the corresponding correlation coefficients are
Notes
- 1.
A function is defined to be convex if the line segment between any two points on the graph of the function lies above the graph.
- 2.
A random vector \({{\varvec{X}}}=(X_1,X_2,\ldots ,X_m)\) with a distribution function \(F_{{\varvec{X}}}(x_1,x_2,\ldots ,x_m)\) and a random vector \({{\varvec{Y}}}=(Y_1,Y_2,\ldots ,Y_n)\) with a distribution function \(F_{{\varvec{Y}}}(y_1,y_2,\ldots ,y_n)\) are mutually independent if \(F_{{{\varvec{X}}},{{\varvec{Y}}}}(x_1,x_2,\ldots ,x_m,y_1,y_2,\ldots ,y_n) = F_{{\varvec{X}}}(x_1,x_2,\ldots ,x_m)F_{{\varvec{Y}}}(y_1,y_2,\ldots ,y_n)\). This is an obvious generalization of (2.20) and (2.24).
References
E. Brynjolfsson, Y.J. Hu, D. Simester, Goodbye Pareto principle, hello long tail: the effect of search costs on the concentration of product sales. Manage. Sci. 57, 1373 (2011)
E. Brynjolfsson, Y.J. Hu, M.D. Smith, The longer tail: the changing shape of Amazon’s sales distribution curve. http://dx.doi.org/10.2139/ssrn.1679991. 20 Sep 2010
C. Anderson, The Long Tail: Why the Future of Business is Selling Less of More (Hyperion, New York, 2006)
F. James, Statistical Methods in Experimental Physics, 2nd edn. (World Scientific, Singapore, 2010)
Y. Viniotis, Probability and Random Processes for Electrical Engineers (WCB McGraw-Hill, Singapore, 1998)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Širca, S. (2016). Expected Values. In: Probability for Physicists. Graduate Texts in Physics. Springer, Cham. https://doi.org/10.1007/978-3-319-31611-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-31611-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31609-3
Online ISBN: 978-3-319-31611-6
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)