Introduction to Probability Theory

Lakatos, László; Szeidl, László; Telek, Miklós

doi:10.1007/978-1-4614-5317-8_1

László Lakatos⁴,
László Szeidl^5,6 &
Miklós Telek⁷

1713 Accesses

Abstract

In this chapter we summarize the most important notions and facts of probability theory that are necessary for an elaboration of our topic. In the present summary, we will apply the more specific mathematical concepts and facts – mainly measure theory and analysis – only to the necessary extent while, however, maintaining mathematical precision.

Access provided by Autonomous University of Puebla. Download chapter PDF

Some Topics in Probability Theory

Review of Basic Probability and Statistics

Introduction to Probability Theory

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Summary of Basic Notions of Probability Theory

In this chapter we summarize the most important notions and facts of probability theory that are necessary for an elaboration of our topic. In the present summary, we will apply the more specific mathematical concepts and facts – mainly measure theory and analysis – only to the necessary extent while, however, maintaining mathematical precision.

Random Event We consider experiments whose outcomes are uncertain, where the totality of the circumstances that are or can be considered does not determine the outcome of the experiment. A set consisting of all possible outcomes is called a sample space. We define random events (events for short) as certain sets of outcomes (subsets of the sample space). It is assumed that the set of events is closed under countable set operations, and we assign probability to events only; they characterize the quantitative measure of the degree of uncertainty. Henceforth countable means finite or countably infinite.

Denote the sample space by $\Omega =\{ \omega \}$. If $\Omega $ is countable, then the space $\Omega $ is called discrete. In a mathematical approach, events can be defined as subsets $A \subset \Omega $ of the possible outcomes $\Omega $ having the properties (σ-algebra properties) defined subsequently.

A given event A occurs in the course of an experiment if the outcome of the experiment belongs to the given event, that is, if an outcome ω ∈ A exists. An event is called simple if it contains only one outcome ω. It is always assumed that the whole set $\Omega $ and the empty set $\varnothing $ are events that are called a certain event and an impossible event, respectively.

Operation with Events; Notion of σ-Algebra Let A and B be two events. The union $A \cup B$ of A and B is defined as an event consisting of all elements $\omega \in \Omega $ belonging to either event A or B, i.e., $A \cup B = \left \{\omega :\ \omega \in A\ \text{ or}\ \omega \in B\right \}$.

The intersection (product) $A \cap B\ (AB)$ of events A and B is defined as an event consisting of all elements $\omega \in \Omega $ belonging to both A and B, i.e.,

$$A \cap B = \left \{\omega :\ \omega \in A\text{ and }\omega \in B\right \}.$$

The difference $A\setminus B$, which is not a symmetric operation, is defined as the set of all elements $\omega \in \Omega $ belonging to event A but not to event B, i.e.,

$$A\setminus B = \left \{\omega :\ \omega \in A\text{ and }\omega \notin B\right \}.$$

A complementary event $\overline{A}$ of A is defined as a set of all elements $\omega \in \Omega $ that does not belong to A, i.e.,

$$\overline{A} = \Omega \setminus A.$$

If $A \cap B = \oslash $, then sets A and B are said to be disjoint or mutually exclusive.

Note that the operations $\cup $ and $\cap $ satisfy the associative, commutative, and distributive properties

$$(A \cup B) \cup C = A \cup (B \cup C),\ \ \ \ \text{ and}\ \ \ \ (A \cap B) \cap C = A \cap (B \cap C),$$

$$A \cup B = B \cup A,\ \ \ \ \text{ and}\ \ \ \ A \cap B = B \cap A,$$

$$A \cap (B \cup C) = (A \cap B) \cup (A \cap C),\ \ \ \text{ and}\ \ \ A \cup (B \cap C) = (A \cup B) \cap (A \cup C).$$

DeMorgan identities are valid also for the operations union, intersection, and complementarity of events as follows:

$$\overline{A \cup B} = \overline{A} \cap \overline{B},\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \overline{A \cap B} = \overline{A} \cup \overline{B}.$$

With the use of the preceding definitions introduced, we can define the notion of σ-algebra of events.

Definition 1.1.

Let $\Omega $ be a nonempty (abstract) set, and let $\mathcal{A}$ be a certain family of subsets of the set $\Omega $ satisfying the following conditions:

(1)
$\Omega \in \mathcal{A}$.
(2)
If $A \in \mathcal{A}$, then $\overline{A} \in \mathcal{A}$.
(3)
If ${A}_{1},{A}_{2},\ldots \in \mathcal{A}$ is a countable sequence of elements, then

$$ \bigcup \limits_{i=1}^{\infty }{A}_{ i} \in \mathcal{A}.$$

The family $\mathcal{A}$ of subsets of the set $\Omega $ satisfying conditions (1)–(3) is called a σ-algebra. The elements of $\mathcal{A}$ are called random events, or simply events.

Comment 1.2.

The pair $(\Omega ,\mathcal{A})$ is usually called a measurable space, which forms the general mathematical basis of the notion of probability.

Probability Space, Kolmogorov Axioms of Probability Theory Let $\Omega $ be a nonempty sample set, and let $\mathcal{A}$ be a given σ-algebra of subsets of $\Omega $, i.e., the pair $(\Omega ,\mathcal{A})$ is a measurable space. A nonnegative number $\mathbf{P}\left (A\right )$ is assigned to all events A of σ-algebra satisfying the axioms as follows.

A1.:: $0 \leq \mathbf{P}\left (A\right ) \leq 1$, $A \in \mathcal{A}$.
A2.:: $\mathbf{P}\left (\Omega \right ) = 1$.
A3.:: If the events ${A}_{i} \in \mathcal{A}$, i = 1, 2, …, are disjoint (i.e., ${A}_{i}{A}_{j} = \oslash ,\ i\neq j$), then
$$\mathbf{P}\left ( \bigcup \limits_{i=1}^{\infty }{A}_{ i}\right ) = \sum \limits_{i=1}^{\infty }\mathbf{P}\left ({A}_{ i}\right ).$$

The number $\mathbf{P}\left (A\right )$ is called the probability of event A, axioms A1, A2, and A3 are called the Kolmogorov axioms, and the triplet $(\Omega ,\mathcal{A},\mathbf{P})$ is called the probability space. As usual, axiom A3 is called the σ -additivity property of the probability. The probability space characterizes completely a random experiment.

Comment 1.3.

In the measure theory context of probability theory, the function { P} defined on $\mathcal{A}$ is called a probability measure. Conditions A1–A3 ensure that { P} is nonnegative and that σ is an additive and normed [ $\mathbf{P}\left (\Omega \right ) = 1$ ] set function on $\mathcal{A}$ , i.e., a normed measure on $\mathcal{A}$ . Our discussion basically does not require the direct use of measure theory, but some assertions cited in this work essentially depend on this theory.

Main Properties of Probability Let $(\Omega ,\mathcal{A},\mathbf{P})$ be a probability space. The following properties of probability are valid for all probability spaces.

Elementary properties:

(a)
The probability of an impossible event is zero, i.e.,

$\mathbf{P}\left (\oslash \right ) = 0$.
(b)
$\ \ \mathbf{P}\left (\overline{A}\right ) = 1 -\mathbf{P}\left (A\right )$ for all $A \in \mathcal{A}$.
(c)
If the relationship $A \subseteq B$ is satisfied for given events $A,B \in \mathcal{A}$, then

$\mathbf{P}\left (A\right ) \leq \mathbf{P}\left (B\right )$,

$\mathbf{P}\left (B - A\right ) = \mathbf{P}\left (B\right ) -\mathbf{P}\left (A\right )$.

Definition 1.4.

A collection $\{{A}_{i},\ i \in I\}$ of a countable set of events is called a complete system of events if A _i, i ∈ I are disjoint (i.e., ${A}_{i} \cap {A}_{j} = \oslash $ if $i\neq j$, i, j ∈ I) and $ \bigcup \limits_{i\in I}{A}_{i} = \Omega $.

Comment 1.5.

If the collection of events {A _i , i ∈ I} forms a complete system of events, then

$$\mathbf{P}\left ( \bigcup \limits_{i\in I}{A}_{i}\right ) = 1.$$

Probability of Sum of Events, Poincaré Formula For any events A and B it is true that

$$\mathbf{P}\left (A \cup B\right ) = \mathbf{P}\left (A\right ) + \mathbf{P}\left (B\right ) -\mathbf{P}\left (AB\right ).$$

Using this relation, a more general formula, called the Poincaré formula, can be proved. Let n be a positive integer number; then, for any events ${A}_{1},{A}_{2},\ldots ,{A}_{i} \in \mathcal{A}$,

$$\mathbf{P}\left ({A}_{1} + \ldots + {A}_{n}\right ) = \sum \limits_{k=1}^{n}{(-1)}^{k-1}{S}_{ k}^{(n)},$$

where ${S}_{k}^{(n)} = \sum \limits_{1\leq {i}_{1}\leq \ldots \leq {i}_{k}\leq n}\mathbf{P}\left ({A}_{{i}_{1}}\ldots {A}_{{i}_{k}}\right )$.

Subadditive Property of Probability For any countable set of events $\left \{{A}_{i},\ i \in I\right \}$ the inequality

$$\mathbf{P}\left ( \bigcup \limits_{i\in I}{A}_{i}\right ) \leq \sum \limits_{i\in I}\mathbf{P}\left ({A}_{i}\right )$$

is true.

Continuity Properties of Probability Continuity properties of probability are valid for monotonically sequences of events, each of which is equivalent to axiom A3 of probability. A sequence of events ${A}_{1},{A}_{2},\ldots $ is called monotonically increasing (resp. decreasing) if ${A}_{1} \subset {A}_{2} \subset \ldots $ (resp. ${A}_{1} \supset {A}_{2} \supset \ldots $).

Theorem 1.6.

If the sequence of events ${A}_{1},{A}_{2},\ldots $ is monotonically decreasing, then

$$\mathbf{P}\left ( \bigcap \limits_{i=1}^{\infty }{A}_{ i}\right ) =\mathop{\lim }\limits_{ n \rightarrow \infty }\mathbf{P}\left ({A}_{n}\right ).$$

If the sequence of events ${A}_{1},{A}_{2},\ldots $ is monotonically increasing, then

$$\mathbf{P}\left ( \bigcup \limits_{i=1}^{\infty }{A}_{ i}\right ) =\mathop{\lim }\limits_{ n \rightarrow \infty }\mathbf{P}\left ({A}_{n}\right ).$$

Conditional Probability and Its Properties, Independence of EventsIn practice, the following obvious question arises: if we know that event B occurs (i.e., the outcome is in $B \in \mathcal{A}$), what is the probability that the outcome is in $A \in \mathcal{A}$? In other words, how does the occurrence of an event B influence the occurrence of another event A? This effect is characterized by the notion of conditional probability $\mathbf{P}\left (A\vert B\right )$ as follows.

Definition 1.7.

Let A and B be two events, and assume that $\mathbf{P}\left (B\right ) > 0$. The quantity

$$\mathbf{P}\left (A\vert B\right ) = \mathbf{P}\left (AB\right )/\mathbf{P}\left (B\right )$$

is called the conditional probability of A given B.

It is easy to verify that the conditional probability possesses the following properties:

1.
$0 \leq \mathbf{P}\left (A\vert B\right ) \leq 1$.
2.
$\mathbf{P}\left (B\vert B\right ) = 1$.
3.
If the events ${A}_{1},{A}_{2},\ldots $ are disjoint, then
$$\mathbf{P}\left ( \sum \limits_{i=1}^{\infty }{A}_{ i}\vert B\right ) = \sum \limits_{i=1}^{\infty }\mathbf{P}\left ({A}_{ i}\vert B\right ).$$
4.
The definition of conditional probability $\mathbf{P}\left (A\vert B\right ) = \mathbf{P}\left (AB\right )/\mathbf{P}\left (B\right )$ is equivalent to the so-called theorem of multiplication
$$\mathbf{P}\left (AB\right ) = \mathbf{P}\left (A\vert B\right )\mathbf{P}\left (B\right )\text{ and}\ \mathbf{P}\left (AB\right ) = \mathbf{P}\left (B\vert A\right )\mathbf{P}\left (A\right ).$$

Note that these equations are valid in the cases $\mathbf{P}\left (B\right ) = 0$ and $\mathbf{P}\left (A\right ) = 0$ as well.

One of the most important concepts of probability theory, the independence of events, is defined as follows.

Definition 1.8.

We say that events A and B are independent if the equation

$$\mathbf{P}\left (AB\right ) = \mathbf{P}\left (A\right )\mathbf{P}\left (B\right )$$

is satisfied.

Comment 1.9.

If A and B are independent events and $\mathbf{P}\left (B\right ) > 0$ , then the conditional probability $\mathbf{P}\left (A\vert B\right )$ does not depend on event B since

$$\mathbf{P}\left (A\vert B\right ) = \frac{\mathbf{P}\left (AB\right )} {\mathbf{P}\left (B\right )} = \frac{\mathbf{P}\left (A\right )\mathbf{P}\left (B\right )} {\mathbf{P}\left (B\right )} = \mathbf{P}\left (A\right ).$$

This relation means that knowing that an event B occurs does not change the probability of another event A.

The notion of independence of an arbitrary collection A _i, i ∈ I of events is defined as follows.

Definition 1.10.

A given collection of events A _i, i ∈ I is said to be mutually independent(independent for short) if, having chosen from among them any finite number of events, the probability of the product of the chosen events equals the product of the probabilities of the given events. In other words, if $\{{i}_{1},\ldots ,{i}_{k}\}$ is any subcollection of I, then one has

$$\mathbf{P}\left ({A}_{{i}_{1}} \cap \text{ \ldots } \cap {A}_{{i}_{k}}\right ) = \mathbf{P}\left ({A}_{{i}_{1}}\right )\ldots \mathbf{P}\left ({A}_{{i}_{k}}\right ).$$

This notion of independence is stricter when pairs are concerned since it is easy to create an example where pairwise independence occurs but mutual independence does not.

Example 1.11.

We roll two dice and denote the pair of results by

$$({\omega }_{1},{\omega }_{2}) \in \Omega =\{ (i,j),1 \leq i,j \leq 6\}.$$

The number of elements of the set $\Omega $ is $\left \vert \Omega \right \vert = 36$, and we assume that the dice are standard, that is, $P\left \{({\omega }_{1},{\omega }_{2})\right \} = 1/36$ for every $({\omega }_{1},{\omega }_{2}) \in \Omega $. Events A ₁, A ₂, and A ₃ are defined as follows:

$$\qquad \qquad \begin{array}{l} {A}_{1} =\{ \text{ the result of the first die is even}\}, \\ {A}_{2} =\{ \text{ the result of the second die is odd}\}, \\ {A}_{3} =\{ \text{ both the first and second dice are odd or both of them are even}\}.\end{array}$$

We check that events A ₁, A ₂, and A ₃ are pairwise independent, but they are not (mutually) independent. It is clear that

$$\begin{array}{rcl}{ A}_{1}& =& \{(2,1),\ldots ,(2,6),(4,1),\ldots ,(4,6),(6,1),\ldots ,(6,6)\}, \\ {A}_{2}& =& \{(1,1),\ldots ,(6,1),(1,3),\ldots ,(6,3),(1,5),\ldots ,(6,5)\}, \\ {A}_{3}& =& \{(1,1),(1,3),(1,5),(2,2),(2,4),(2,6),(3,1),(3,3), \\ & & (3,5),\ldots ,(6,2),(6,4)(6,6)\}, \\ \end{array}$$

thus

$$\vert {A}_{1}\vert \ = 3 \cdot 6 = 18,\ \ \vert {A}_{2}\vert \ = 6 \cdot 3 = 18,\ \ \vert {A}_{3}\vert \ = 6 \cdot 3 = 18.$$

We have, then, $\mathbf{P}\left ({A}_{i}\right ) = \frac{1} {2},\ i = 1,2,3$, and the relations

$$\mathbf{P}\left ({A}_{i}{A}_{j}\right ) = \frac{1} {4} = \mathbf{P}\left ({A}_{i}\right )\mathbf{P}\left ({A}_{j}\right ),\ 1 \leq i,j \leq 3,\ i\neq j,$$

which means events A ₁, A ₂, and A ₃ are pairwise independent. On the other hand,

$$\mathbf{P}\left ({A}_{1}{A}_{2}{A}_{3}\right ) = 0\neq \frac{1} {8} = \mathbf{P}\left ({A}_{1}\right )\mathbf{P}\left ({A}_{2}\right )\mathbf{P}\left ({A}_{3}\right );$$

consequently, the mutual independence of events A ₁, A ₂, and A ₃ does not follow from their pairwise independence.

Formula of Total Probability, Bayes’ Rule Using the theorem of multiplication for conditional probability we can easily derive the following two theorems. Despite the fact that the two theorems are not complicated, they represent quite effective tools in the course of the various considerations.

Theorem 1.12 ( Formula of total probability).

Let the sequence {A _i , i ∈ I} be a complete system of events with $\mathbf{P}\left ({A}_{i} > 0\right ),\ i \in I$ ; then for all events B

$$\mathbf{P}\left (B\right ) = \sum \limits_{i\in I}\mathbf{P}\left (B\vert {A}_{i}\right )\mathbf{P}\left ({A}_{i}\right )$$

is true.

Theorem 1.13 ( Bayes’ rule).

Under the conditions of the preceding theorem, the following relation holds for all indices n ∈ I:

$$\mathbf{P}\left ({A}_{n}\vert B\right ) = \frac{\mathbf{P}\left (B\vert {A}_{n}\right )\mathbf{P}\left ({A}_{n}\right )} { \sum \limits_{i\in I}\mathbf{P}\left (B\vert {A}_{i}\right )\mathbf{P}\left ({A}_{i}\right )}.$$

Concept of Random Variables Let $(\Omega ,\mathcal{A},\mathbf{P})$ be a probability space that is to be fixed later on. In the course of random experiments, the experiments usually result in some kind of value. This means that the occurrence of a simple event ω results in a random X(ω) value. Different values might belong to different simple events; however, the function X(ω), depending on the simple event ω, will have a specific property. We must answer such basic questions as, for example, what is the probability that the result of the experiment will be smaller than a certain given value x? We have only determined probabilities of events (only for elements of the set $\mathcal{A}$) in connection with the definition of probability space; therefore, it has the immediate consequence that we may only consider the probability of the set if the set $\{\omega : X(\omega ) \leq x\}$ is an event, which means that the set belongs to σ-algebra $\mathcal{A}$:

$$\{\omega : X(\omega ) \leq x\} \in \mathcal{A}.$$

This fact led to one of the most important notions of probability theory.

Definition 1.14.

The real-valued function $X : \Omega \rightarrow \mathbb{R}$ is called a random variable if the relationship

$$\{\omega : X(\omega ) \leq x\} \in \mathcal{A}$$

is valid for all real numbers $x \in \mathbb{R}$. A function satisfying this condition is called $\mathcal{A}$ measurable.

A property of random variables should be mentioned here. Define by $\mathcal{B} = {\mathcal{B}}_{1}$ the σ-algebra of Borel sets of $\mathbb{R}$ as the minimal σ-algebra containing all intervals of $\mathbb{R}$; the elements of $\mathcal{B}$ are called the Borel sets of $\mathbb{R}$. If X is $\mathcal{A}$ measurable, then for all Borel sets D of $\mathbb{R}$ the set {ω : X(ω) ∈ D} is also an element of $\mathcal{A}$, i.e., {ω : X(ω) ∈ D} is an event. Thus the probability ${\mathbf{P}}_{X}\left [D\right ] = \mathbf{P}\left (\{\omega : X(\omega ) \in D\}\right )$, and so $\mathbf{P}\left (\left \{\omega : X(\omega ) \leq x\right \}\right )$ are well defined. An important special case of random variables are the so-called indicator variables defined as follows. Let $A \in \mathcal{A}$ be an event, and let us introduce the random variable ${\mathcal{I}}_{\left \{A\right \}}$, $A \in \mathcal{A}$:

$${\mathcal{I}}_{\left \{A\right \}} = {\mathcal{I}}_{\left \{A\right \}}(\omega ) = \left \{\begin{array}{c} 1,\text{ if }\omega \in A,\\ 0, \text{ if } \omega \notin A. \end{array} \right.$$

Distribution Function Let X = X(ω) be a random variable; then the probability $\mathbf{P}\left (X \leq x\right )$, $x \in \mathbb{R}$, is well defined.

Definition 1.15.

The function ${F}_{X}(x) = \mathbf{P}\left (X \leq x\right )$ for all real numbers $x \in \mathbb{R}$ is called a cumulativedistribution function(CDF) of random variable X.

Note that the CDFs F _X and function P_X determine each other mutually and unambiguously. It is also clear that if the real line $\mathbb{R}$ is chosen as a new sample space, and $\mathcal{B}$ is a σ-algebra of Borel sets as the σ-algebra of events, then the triplet $(\mathbb{R},\mathcal{B},{\mathbf{P}}_{X})$ determines a new probability space, where { P}_X is referred to as a probability measure induced by the random variable X.

The CDF F _X has the following properties.

(1)
In all points of a real line $-\infty < {x}_{0} < \infty $ the function F _X(x) is continuous from the right, that is,
$$\mathop{\lim }\limits_{x \rightarrow {x}_{0} + 0}{F}_{X}(x) = {F}_{X}({x}_{0}).$$
(2)
The function ${F}_{X}(x),\ -\infty < x < \infty $ is a monotonically increasing function of the variable x, that is, for all $-\infty < x < y < \infty $ the inequality ${F}_{X}(x) \leq {F}_{X}(y)$ holds.
(3)
The limiting values of the function F _X(x) exist under the conditions $x \rightarrow -\infty $ and $x \rightarrow \infty $ as follows:
$$\mathop{\lim }\limits_{x \rightarrow -\infty }{F}_{X}(x) = 0\ \ \ \text{ and }\ \ \mathop{\lim }\limits_{x \rightarrow \infty }{F}_{X}(x) = 1.$$
(4)
The set of discontinuity points of the function F _X(x), that is, the set of points $x \in \mathbb{R}$ for which ${F}_{X}(x)\neq {F}_{X}(x - 0)$, is countable.

Comment 1.16.

It should be noted in connection with the definition of the CDF that the literature is not consistent. The use of ${F}_{X}(x) = \mathbf{P}\left (X < x\right ),\ -\infty < x < \infty $ as a CDF is also widely applied. The only difference between the two definitions lies within property (1) (see preceding discussion), which means that in the latter case the CDF is continuous from the left and not from the right, but all the other properties remain the same. It is also clear that if the CDF is continuous in all $x \in \mathbb{R}$ , then there is no difference between the two definitions.

Comment 1.17.

From a practical point of view, it is sometimes useful to allow that property (3) (see preceding discussion) does not satisfy the CDF F _X of random variable X, which means that, instead, one or both of the following relations hold: In this case $\mathbf{P}\left (\left \vert X\right \vert < \infty \right ) < 1$ , and the CDF of random variable X has a defective distribution function.

Let a and b be two arbitrary real numbers for which $-\infty < a < b < \infty $; then we can determine the probability of some frequently occurring events with the use of the CDF of X as follows:

$$\begin{array}{rcl} \mathbf{P}\left (X = a\right )& =& {F}_{X}(a) - {F}_{X}(a - 0), \\ \mathbf{P}\left (a < X < b\right )& =& {F}_{X}(b - 0) - {F}_{X}(a), \\ \mathbf{P}\left (a \leq X < b\right )& =& {F}_{X}(b - 0) - {F}_{X}(a - 0), \\ \mathbf{P}\left (a < X \leq b\right )& =& {F}_{X}(b) - {F}_{X}(a), \\ \mathbf{P}\left (a \leq X \leq b\right )& =& {F}_{X}(b) - {F}_{X}(a - 0).\end{array}$$

These equations also determine the connection between the CDF F _X and the distribution { P}_X for special Borel sets of a real line.

Discrete and Continuous Distribution, Density Function We distinguish two important types of distributions in practice, the so-called discrete and continuous distributions. There is also a third type of distribution, the so-called singular distribution, in which case the CDF is continuous everywhere and its derivative (with respect to the Lebesgue measure) equals 0 almost everywhere; however, we will not consider this type. This classification follows from the Jordan decomposition theorem of monotonically functions, that is, an arbitrary CDF F can always be decomposed into the sum of three functions – the monotonically increasing absolutely continuous function, the step function with finite or countably infinite sets of jumps (this part corresponds to a discrete distribution), and the singular function.

Definition 1.18.

Random variable X is discrete or has a discrete distribution if there is a finite or countably infinite set of values $\left \{{x}_{k},\ k \in I\right \}$ such that $ \sum \limits_{k\in I}{p}_{k} = 1$, where ${p}_{k} = \mathbf{P}\left (X = {x}_{k}\right ),\ k \in I$. The associated function

$${f}_{X}(x) = \left \{\begin{array}{c} {p}_{k},\text{ if }x = {x}_{k},\ k \in I, \\ 0,\text{ if }x\neq {x}_{k},\ k \in I, \end{array} \right.\ \ \ x \in \mathbb{R},$$

is termed a probability density function (PDF) or probability mass function (PMF).

It is easy to see that if random variable X is discrete with possible values $\left \{{x}_{k},k = 0,1,\ldots \right \}$ and with distribution $\left \{{p}_{k},k = 0,1,\ldots \right \}$, then the relationship between the CDF F _X and the PMF can be given as

$${F}_{X}(x) = \sum \limits_{{x}_{k}<x}{p}_{k},\ -\infty < x < \infty.$$

Definition 1.19.

A random variable X is continuous or has a continuous distribution if there exists a nonnegative integrable function ${f}_{X}(x),\ -\infty < x < \infty $ such that for all real numbers a and b , $-\infty < a < b < \infty $,

$${F}_{X}(b) - {F}_{X}(a) =\int\limits_{a}^{b}{f}_{ X}(x)\mathrm{d}x$$

holds. The functionf _X(x) is called the PDF of random variable X, or just the density function of X.

Comment 1.20.

It is clear that

$${F}_{X}(x) =\int\limits_{-\infty }^{x}{f}_{ X}(u)\mathrm{d}u,\ -\infty < x < \infty ,$$

and it is also true that the PDF is not uniquely defined since if we take instead of f _X (u) the function f _X (u) + g(u), where the function g(u) is nonnegative, integrable, and $\int\limits_{-\infty }^{x}g(u)\mathrm{d}u = 0$ , then the function f _X (u) + g(u) is also a PDF of random variable X, which can naturally differ from the original f _X .

An arbitrary PDF f _X(x) is nonnegative and integrable,

$$\int\limits_{-\infty }^{\infty }{f}_{ X}(x)\mathrm{d}x = 1,$$

and almost everywhere in $\mathbb{R}$ (with respect to the Lebesgue measure) the equation ${F}_{X}^{{\prime}}(x) = {f}_{X}(x)$ is true.

Distribution of a Function of a Random Variable Let X = X(ω) be a random variable. Let $h(x),\ x \in \mathbb{R}$ be a real-valued function, and let us define it as Y = h(X). The equation Y = h(X) determines a random variable if for all $y \in \mathbb{R}$ the set $\left \{\omega : Y (\omega ) = h(X(\omega )) \leq y\right \}$ is an event that is an element of σ-algebra $\mathcal{A}$. If h is a continuous function or, more generally, is a Borel-measurable function (h is Borel measurable if for all x the relationship $\left \{u : h(u) \leq x\right \} \in \mathcal{B}$ is true), then Y , which is determined by the equation Y = h(X), is a random variable. The question is how the CDF and the density function (if the latter exists) of random variable Y can be determined. It is usually true that

$${F}_{X}(y) = \mathbf{P}\left (Y \leq y\right ) = \mathbf{P}\left (h(X) \leq y\right ) ={ \mathbf{P}}_{X}\left [\left \{x : h(x) \leq y\right \}\right ],\ -\infty < y < \infty.$$

If h is a strictly monotonically increasing function, then this formula can be given in a simpler form. Let us denote by h ^− 1 the inverse function of h, which in this case must exist. Then

$${F}_{X}(y) = \mathbf{P}\left (h(X) \leq y\right ) = \mathbf{P}\left (X \leq {h}^{-1}(y)\right ) = {F}_{ X}({h}^{-1}(y)),\ -\infty < y < \infty.$$

If h is a strictly monotonically decreasing function, then

$${F}_{X}(y) = \mathbf{P}\left (h(X) \leq y\right ) = \mathbf{P}\left (X \geq {h}^{-1}(y)\right ) = 1-{F}_{ X}({h}^{-1}(y)-0),\ -\infty < y < \infty.$$

With these relations, a formula can be given for the PDF of Y in special cases.

Theorem 1.21.

Let us suppose that random variable X has a PDF f _X and h is a strictly monotonically, differentiable real function. Then

$${f}_{Y }(y) = {f}_{X}({h}^{-1}(y))\left \vert \frac{\mathrm{d}} {\mathrm{d}y}{h}^{-1}(y)\right \vert ,\ -\infty < y < \infty.$$

Comment 1.22.

If h is a linear function, that is, $h(y) = ay + b,\ a\neq 0$ , and X has a PDF f _X , then the random variable Y = h(X) also has a PDF and the formula

$${f}_{Y }(y) = \frac{1} {\left \vert a\right \vert }{f}_{X}\left (\frac{y - a} {b} \right ),\ -\infty < y < \infty ,$$

is true.

Joint Distribution and Density Function of Random Variables, Marginal Distributions In the majority of problems arising in practice, we have not one but several random variables, and we examine the probability of events where random variables simultaneously satisfy certain conditions.

Let $(\Omega ,\mathcal{A},\mathbf{P})$ be a probability space, and let there be two random variables X and Y on that space. The joint statistical behavior of the two random variables can be determined by a joint CDF. We should note that the joint analysis of the random variables X and Y corresponds to the examination of two-dimensional random vector variables such as (X, Y ) that have random variable coordinates.

Definition 1.23.

The function

$${F}_{XY }(x,y) = \mathbf{P}\left (X \leq x,Y \leq y\right ),\ \ -\infty < x,y < \infty ,$$

is called the joint CDF of random variables X and Y.

From a practical point of view, the two most important types of distributions are the discrete and the continuous ones, as in the one-dimensional case.

Definition 1.24.

The joint distribution function of random variables X and Y is called discrete; in other words, the random vector (X, Y ) has a discrete distribution if random variables X and Y are discrete. If we denote the values of random variables X and Y by $\left \{{x}_{i},\ i \in I\right \}$ and $\left \{{y}_{j},\ j \in J\right \}$, respectively, then the function

$${f}_{X,Y }(x,y) = \left \{\begin{array}{c} {p}_{i,j},\text{ if }x = {x}_{i},\ y = {y}_{j},\ i \in I,\ j \in J, \\ 0,\text{ if }x\neq {x}_{i},\ y\neq {y}_{j},\ i \in I,\ j \in J, \end{array} \right.\ \ \ x \in \mathbb{R},$$

is called a joint PMF or joint PDF.

It is clear that in the discrete case the joint distribution function is

$${F}_{XY }(x,y) = \sum \limits_{{x}_{i}\leq x,\ {y}_{j}\leq y}{p}_{ij}.$$

The case of a joint continuous distribution is analogous to the discrete one.

Definition 1.25.

The joint distribution of random variables X and Y is called continuous; in other words, the random vector (X, Y ) has a continuous distribution if there exists a nonnegative, real-valued integrable function on the plane f _XY(x, y), $-\infty < x,y < \infty $, for which the relation

$${F}_{XY }(x,y) =\int\limits_{-\infty }^{x}\int\limits_{-\infty }^{y}{f}_{ XY }(u,v)\mathrm{d}u\mathrm{d}v$$

holds for all $-\infty < x,y < \infty $.

Definition 1.26.

If F _XY denotes the joint CDF of random variables X and Y , then the CDFs

$${F}_{X}(x) =\mathop{\lim }\limits_{ y \rightarrow \infty }{F}_{XY }(x,y),$$

$${F}_{Y }(y) =\mathop{\lim }\limits_{ x \rightarrow \infty }{F}_{XY }(x,y)$$

are called marginal distribution functions.

It is not difficult to see that marginal distribution functions do not determine the joint CDF. It is also clear that if a joint PDF f _XY(x, y) of random variables X and Y exists, then marginal PDFs can be given in the form

$$\begin{array}{rcl}{ f}_{X}(x)& =& \int\limits_{-\infty }^{\infty }{f}_{ XY }(x,y)\mathrm{d}y,\ -\infty < x < \infty , \\ {f}_{Y }(y)& =& \int\limits_{-\infty }^{\infty }{f}_{ XY }(x,y)\mathrm{d}x\ -\infty < y < \infty.\end{array}$$

If there are more than two random variables ${X}_{1},\ldots ,{X}_{n},$ n ≥ 3, i.e., in the case of an n-dimensional random vector $({X}_{1},\ldots ,{X}_{n})$, then the definitions of joint distribution function and density functions can be given analogously to the case of two random variables, so there is no essential difference. We will return to this question when we introduce the concept of stochastic processes.

Conditional Distributions Let A be an arbitrary event, with P(A) > 0, and X an arbitrary random variable. Using the notion of conditional probability, we can define the conditional distribution of random variable X given event A as the function

$${F}_{X}(x\vert A) = \mathbf{P}\left (X \leq x\vert A\right ),\ x \in \mathbb{R}.$$

The function F _X(x | A) has all the properties of a distribution function mentioned previously.

The function ${f}_{X}(x\vert {A}_{i})$ is called a conditional density function of random variable X given event A if a nonnegative integrable function f _X(x | A) exists for which the equation

$${F}_{X}(x\vert A) =\int\limits_{-\infty }^{x}{f}_{ X}(u\vert A)\mathrm{d}u,\ -\infty < x < \infty ,$$

holds.

The result for the distribution function F _X(x) can be easily proved in the same way as the theorem of full events. If the sequence of events ${A}_{1},{A}_{2},\ldots $ is a complete system of events with the property $\mathbf{P}\left ({A}_{i}\right ) > 0,\ i = 1,2,\ldots $, then

$${F}_{X}(x) = \sum \limits_{i=1}^{\infty }{F}_{ X}(x\vert {A}_{i})\mathbf{P}\left ({A}_{i}\right ),\ -\infty < x < \infty.$$

A similar relation holds for the conditional PDFs ${f}_{X}(x\vert {A}_{i}),\ i \geq 1$, if they exist:

$${f}_{X}(x) = \sum \limits_{i=1}^{\infty }{f}_{ X}(x\vert {A}_{i})\mathbf{P}\left ({A}_{i}\right ),\ -\infty < x < \infty.$$

A different approach is required to define the conditional distribution function F _X | Y(x | y) of random variable X given Y = y , where Y is another random variable. The difficulty is that if a random variable Y has a continuous distribution function, then the probability of the event $\left \{Y = y\right \}$ equals zero, and therefore the conditional distribution function F _X | Y(x | y) cannot be defined with the help of the notion of conditional probability. In this case the conditional distribution function F _X | Y(x | y) is defined as follows:

$${F}_{X\vert Y }(x\vert y) =\mathop{\lim }\limits_{ \Delta y \rightarrow +0}\mathbf{P}\left (X \leq x\vert y \leq Y < y + \Delta y\right )$$

if the limit exists.

Let us assume that the joint density function f _XY(x, y) of random variables X and Y exists. In such a case random variable X has the conditional CDF F _X | Y(x | y) and conditional PDF f _X | Y(x | y) given Y = y. If a joint PDF exists and f _X(y) > 0, then it is not difficult to see that the following relation holds:

$$\begin{array}{rcl}{ F}_{X\vert Y }(x\vert y)& =& \mathop{\lim }\limits_{\Delta y \rightarrow +0}\mathbf{P}\left (X \leq x\vert y \leq Y < y + \Delta y\right ) \\ & =& \mathop{\lim }\limits_{\Delta y \rightarrow +0}\frac{\mathbf{P}\left (X \leq x,y \leq Y < y + \Delta y\right )} {\mathbf{P}\left (y \leq Y < y + \Delta y\right )} \\ & =& \mathop{\lim }\limits_{\Delta y \rightarrow +0}\frac{\frac{{F}_{XY }(x,y+\Delta y)-{F}_{XY }(x,y)} {\Delta y} } {\frac{{F}_{Y }(y+\Delta y)-{F}_{Y }(y)} {\Delta y} } = \frac{1} {{f}_{Y }(y)}\ \frac{\partial } {\partial y}{F}_{XY }(x,y).\end{array}$$

From this relation we get the conditional PDF f _X | Y(x | y) as follows:

$${f}_{X\vert Y }(x\vert y) = \frac{\partial } {\partial x}{F}_{X\vert Y }(x\vert y) = \frac{1} {{f}_{Y }(y)} \frac{{\partial }^{2}} {\partial x\partial y}{F}_{XY }(x,y) = \frac{{f}_{XY }(x,y)} {{f}_{Y }(y)}.$$

(1.1)

Independence of Random Variables Let X and Y be two random variables. Let F _XY(x, y) be the joint distribution function of X and Y , and let F _X(x) and F _Y(y) be the marginal distribution functions.

Definition 1.27.

Random variables X and Y are called independent of each other, or just independent, if the identity

$${F}_{XY }(x,y) = {F}_{X}(x){F}_{Y }(y)$$

holds for any x, y, $-\infty < x,y < \infty $.

In other words, random variables X and Y are independent if and only if the joint distribution function of X and Y equals the product of their marginal distribution functions.

The definition of independence of two random variables can be easily generalized to the case where an arbitrary collection of random variables $\left \{{X}_{i},\ i \in I\right \}$ is given, analogously to the notion of the independence of events.

Definition 1.28.

A collection of random variables $\left \{{X}_{i},\ i \in I\right \}$ is called mutuallyindependent (or just independent), if for any choice of a finite number of elements ${X}_{{i}_{1}},\ldots ,{X}_{{i}_{n}}$ the relation

$${F}_{{X}_{{i}_{ 1}},\ldots ,{X}_{{i}_{n}}}({x}_{1},\ldots ,{x}_{n}) = {F}_{{X}_{{i}_{ 1}}}({x}_{1}) \cdot \ldots \cdot {F}_{{X}_{{i}_{n}}}({x}_{n}),\ {x}_{1},\ldots ,{x}_{n} \in \mathbb{R}$$

holds.

Note that from the pairwise independence of random variables $\left \{{X}_{i},\ i \in I\right \}$, which means that the condition

$${F}_{{X}_{{i}_{ 1}},{X}_{{i}_{2}}}({x}_{1},{x}_{2}) = {F}_{{X}_{{i}_{ 1}}}({x}_{1}){F}_{{X}_{{i}_{ 2}}}({x}_{2}),\ {x}_{1},{x}_{2} \in \mathbb{R},\ {i}_{1},{i}_{2} \in I,$$

is satisfied, mutual independence does not follow.

Example 1.29.

Consider Example 1.11 given earlier and preserve the notation. Denote by ${X}_{i} = {\mathcal{I}}_{\left \{{A}_{i}\right \}}$ the indicator variables of the events A _i, i = 1, 2, 3. Then we can verify that random variables X ₁, X ₂, and X ₃ are pairwise independent, but they do not satisfy mutual independence. The pairwise independence of random variables X _i can be easily proved. Since the events ${A}_{1},{A}_{2},{A}_{3}$ are independent and

$$\{{X}_{i} = 1\} = {A}_{i}\text{ and }\{{X}_{i} = 0\} ={ \overline{A}}_{i},$$

then, using the relation proved in Example 1.11, we obtain for $i\neq j$

$$\begin{array}{rcl} \mathbf{P}\left ({X}_{i} = 1,{X}_{j} = 1\right ) = \mathbf{P}\left ({A}_{i}{A}_{j}\right ) = \mathbf{P}\left ({A}_{i}\right )\mathbf{P}\left ({A}_{j}\right )& =& \frac{1} {4}, \\ \mathbf{P}\left ({X}_{i} = 1,{X}_{j} = 0\right ) = \mathbf{P}\left ({A}_{i}{\overline{A}}_{j}\right ) = \mathbf{P}\left ({A}_{i}\right )\mathbf{P}\left ({\overline{A}}_{j}\right )& =& \frac{1} {4}, \\ \mathbf{P}\left ({X}_{i} = 0,{X}_{j} = 0\right ) = \mathbf{P}\left ({\overline{A}}_{i}{\overline{A}}_{j}\right ) = \mathbf{P}\left ({\overline{A}}_{i}\right )\mathbf{P}\left ({\overline{A}}_{j}\right )& =& \frac{1} {4}, \\ \end{array}$$

while, for example,

$$\begin{array}{rcl} & & \mathbf{P}\left ({X}_{1} = 1,{X}_{2} = 1,{X}_{3} = 1\right ) = \mathbf{P}\left ({A}_{1}{A}_{2}{A}_{3}\right ) = 0\neq \frac{1} {8} \\ & & \quad = \mathbf{P}\left ({A}_{1}\right )\mathbf{P}\left ({A}_{2}\right )\mathbf{P}\left ({A}_{3}\right ) = \mathbf{P}\left ({X}_{1} = 1\right )\mathbf{P}\left ({X}_{2} = 1\right )\mathbf{P}\left ({X}_{3} = 1\right ).\end{array}$$

Consider how we can characterize the notion of independence for two random variables in the discrete and continuous cases (if more than two random variables are given, then we may proceed in a similar manner).

Firstly, let us assume that the sets of values of discrete random variables X and Y are $\left \{{x}_{i},i \geq 0\right \}$ and $\left \{{y}_{j},j \geq 0\right \}$, respectively. If we denote the joint and marginal distributions of X and Y by

$$\begin{array}{rcl} & & \left \{{p}_{ij} = \mathbf{P}\left (X = {x}_{i},Y = {y}_{j}\right ),\ i,j \geq 0\right \},\left \{{q}_{i} = \mathbf{P}\left (X = {x}_{i}\right ),\ i \geq 0\right \}, \\ & & \quad \mbox{ and}\left \{{r}_{j} = \mathbf{P}\left (Y = {y}_{j}\right ),\jmath \geq 0\right \}, \\ \end{array}$$

then the following assertion holds. Random variables X and Y are independent if and only if

$${p}_{ij} = {q}_{i}{r}_{j},\ i,j \geq 0.$$

Now assume that random variables X and Y have joint density f _XY(x, y) and marginal densities f _X(x) and f _Y(y). Thus, in this case, random variables X and Y are independent if and only if the joint PDF takes a product form, that is,

$${f}_{XY }(x,y) = {f}_{X}(x){f}_{Y }(y),\ -\infty < x,y < \infty.$$

Convolution of Distributions Let X and Y be independent random variables with distribution functions F _X(x) and F _Y(y), respectively, and let us consider the distribution of the random variable Z = X + Y.

Definition 1.30.

The distribution (CDF, PDF) of the random variable Z = X + Y is called the convolution of the distribution (CDF, PDF), and the equations expressing the relation among them are called convolution formulas.

Definition 1.31.

Let ${X}_{1},{X}_{2},\ldots $ be independent identically distributed random variables with the common CDF F _X. The CDF ${F}_{X}^{{_\ast}n}$ of the sum ${Z}_{n} = {X}_{1} + \ldots + {X}_{n}(n \geq 1)$ is uniquely determined by F _X and is called the n -fold convolution of the CDF of F _X.

Note that the CDF F _Z(z) of the random variable Z = X + Y , which is called the convolution of CDFs F _X(x) and F _Y(y), can be given in the general form

$${F}_{Z}(z) = \mathbf{P}\left (Z \leq z\right ) = \mathbf{P}\left (X + Y \leq z\right ) =\int\limits_{-\infty }^{\infty }{F}_{ X}(z - y)\ \mathrm{d}{F}_{Y }(y).$$

This formula gets a simpler form in cases where the discrete random variables X and Y take only integer numbers, or if the PDFs f _X(x) and f _Y(y) of X and Y exist.

Let X and Y be independent discrete random variables taking values in $\{0,\pm 1,\pm 2,\ldots \}$ with probabilities $\left \{{q}_{i} = \mathbf{P}\left (X = {x}_{i}\right )\right \}$ and $\left \{{r}_{j} = \mathbf{P}\left (Y = {y}_{j}\right )\right \}$, respectively. Then the random variable Z = X + Y takes values in {0, ± 1, ± 2, …}, and its distribution satisfies the identity

$${s}_{k} = \sum \limits_{n=-\infty }^{\infty }{q}_{ k-n}{r}_{n},\ k = 0,\pm 1,\pm 2,\ldots \,.$$

If the independent random variables X and Y have a continuous distribution with the PDFs f _X(x) and f _Y(y), respectively, then random variable Z is continuous and its PDF f _Z(z) can be given in the integral form

$${f}_{Z}(z) =\int\limits_{-\infty }^{\infty }{f}_{ X}(z - y){f}_{Y }(y)\mathrm{d}y.$$

Mixture of Distributions Let ${F}_{1}(x),\ldots ,{F}_{n}(x)$ be a given collection of CDFs, and let ${a}_{1},\ldots ,{a}_{n}$ be nonnegative numbers with the sum ${a}_{1} + \ldots + {a}_{n} = 1$. The function

$$F(x) = {a}_{1}{F}_{1}(x) + \ldots + {a}_{n}{F}_{n}(x),\ -\infty < x < \infty ,$$

is called a mixture of CDFs ${F}_{1}(x),\ldots ,{F}_{n}(x)$ with weights ${a}_{1},...,{a}_{n}$.

Comment 1.32.

Any CDF can be given as a mixture of discrete, continuous, and singular CDFs, where the weights can also take a value of 0.

Clearly, the function F(x) possesses all the properties of CDFs; therefore it is also a CDF. In practice, the modeling of mixture distributions plays a basic role in stochastic simulation methods. A simple way to model mixture distributions is as follows.

Let us assume that the random variables ${X}_{1},\ldots ,{X}_{n}$ with distribution functions ${F}_{1}(x),\ldots ,{F}_{n}(x)$ can be modeled. Let Y be a random variable taking values in $\{1,\ldots ,n\}$ and independent of ${X}_{1},\ldots ,{X}_{n}$. Assume that Y has a distribution $P(Y = i) = {a}_{i},\ 1 \leq i \leq n$ (${a}_{i} \geq 0,\ {a}_{1} + \ldots + {a}_{n} = 1$). Let us define random variable Z as follows:

$$Z = \sum \limits_{i=1}^{n}{\mathcal{I}}_{\left \{ Y =i\right \}}{X}_{i},$$

where ${\mathcal{I}}_{\left \{\right \}}$ denotes the indicator variable. Then the CDF of random variable Z equals F(z).

Proof.

Using the formula of total probability, we have the relation

$$\mathbf{P}\left (Z \leq z\right ) = \sum \limits_{i=1}^{n}\mathbf{P}\left (Z \leq z\vert Y = i\right )\mathbf{P}\left (Y = i\right ) = \sum \limits_{i=1}^{n}\mathbf{P}\left ({X}_{ i} \leq z\right ){a}_{i} = F(z).$$

□

Concept and Properties of Expectation A random variable can be completely characterized in a statistical sense by its CDF. To define a distribution function F(x), one needs to determine its values for all $x \in \mathbb{R}$, but this is not possible in many cases. Fortunately, there is no need to do so because in many cases it suffices to give some values that characterize the CDF in a certain sense depending on concrete practical considerations. One of the most important concepts is expectation, which we define in general form, and we give the definition for discrete and continuous distributions as special cases.

Definition 1.33.

Let X be a random variable, and let F _X(x) be its CDF. The expected value (or mean value) of random variable X is defined as

$$\mathbf{E}\left (X\right ) =\int\limits_{-\infty }^{\infty }x\mathrm{d}{F}_{ X}(x)$$

if the expectation exists.

Note that the finite expected value $\mathbf{E}\left (X\right )$ exists if and only if $\int\limits_{-\infty }^{\infty }\vert x\vert \mathrm{d}{F}_{X}(x) < \infty $. It is conventional to denote the expected value of the random variable X by μ_X.

Expected Value of Discrete and Continuous Random Variables Let X be a discrete valued random variable with countable values $\left \{{x}_{i},i \in I\right \}$ and with probabilities $\left \{{p}_{i} = \mathbf{P}\left (X = {x}_{i}\right ),i \in I\right \}$. The finite expected value $\mathbf{E}\left (X\right )$ of random variable X exists and equals

$$\mathbf{E}\left (X\right ) = \sum \limits_{i\in I}{p}_{i}{x}_{i}$$

if and only if the sum is absolutely convergent, that is, $\sum\limits_{i\in I}{p}_{i}\left \vert {x}_{i}\right \vert < \infty $. In the case of continuous random variables, the expected value can also be given in a simple form. Let f _X(x) be the PDF of a random variable X. If the condition $\int\limits_{-\infty }^{\infty }\left \vert x\right \vert {f}_{X}(x)\mathrm{d}x < \infty $ holds (i.e., the integral is absolutely convergent), then the finite expected value of X exists and can be given as

$$\mathbf{E}\left (X\right ) =\int\limits_{-\infty }^{\infty }x{f}_{ X}(x)\mathrm{d}x.$$

From a practical point of view, it is generally enough to give two special, discrete, and continuous cases. Let X be a random variable that has a mixed CDF with discrete and continuous components F ₁(x) and F ₂(x), respectively, and with weights a ₁ and a ₂, that is,

$$F(x) = {a}_{1}{F}_{1}(x) + {a}_{2}{F}_{2}(x),\ {a}_{1},{a}_{2} \geq 0,\ {a}_{1} + {a}_{2} = 1.$$

Assume that the set of discontinuities of F ₁(x) is $\left \{{x}_{i},\ i \in I\right \}$ and denote ${p}_{i} = {F}_{1}({x}_{i}) - {F}_{1}({x}_{i}-),i \in I$. In addition, we assume that the continuous CDF F ₂(x) has the PDF f(x). Then the expected value of random variable X is determined as follows:

$$\mathbf{E}\left (X\right ) = {a}_{1} \sum \limits_{i\in I}{p}_{i}{x}_{i} + {a}_{2}\int\limits_{-\infty }^{\infty }xf(x)\mathrm{d}x$$

if the series and the integral on the right-hand side of the last formula are absolutely convergent. The expected values related to special and different CDFs will be given later in this chapter.

The operation of expectation can be interpreted as a functional

$$\mathbf{E} : X \rightarrow \mathbf{E}\left (X\right )$$

that assigns a real value to the given random variable. We enumerate the basic properties of this functional as follows.

1.
If random variable X is finite, i.e., if there are constants x ₁ and x ₂ for which the inequality ${x}_{1} \leq X \leq {x}_{2}$ holds, then
$${x}_{1} \leq \mathbf{E}\left (X\right ) \leq {x}_{2}.$$
If random variable X is nonnegative and the expected value $\mathbf{E}\left (X\right )$ exists, then
$$\mathbf{E}\left (X\right ) \geq 0.$$
2.
Let us assume that the expected value $\mathbf{E}\left (X\right )$ exists; then the expected value of random variable cX exists for an arbitrary given constant c, and the identity
$$\mathbf{E}\left (cX\right ) = c\mathbf{E}\left (X\right )$$
is true.
3.
If random variable X satisfies the condition $\mathbf{P}\left (X = c\right ) = 1$, then
$$\mathbf{E}\left (X\right ) = c.$$
4.
If the expected values of random variables X and Y exist, then the sum X + Y has an expected value, and the equality
$$\mathbf{E}\left (X + Y \right ) = \mathbf{E}\left (X\right ) + \mathbf{E}\left (Y \right )$$
holds. This relation can usually be interpreted in such a way that the operation of expectation on the space of random variables is an additive functional.
5.
The preceding properties can be expressed in a more general form. If there are finite expected values of random variables ${X}_{1},\ldots ,{X}_{n}$ and ${c}_{1},\ldots ,{c}_{n}$ are constants, then the equality
$$\mathbf{E}\left ({c}_{1}{X}_{1} + \ldots + {c}_{n}{X}_{n}\right ) = {c}_{1}\mathbf{E}\left ({X}_{1}\right ) + \ldots + {c}_{n}\mathbf{E}\left ({X}_{n}\right )$$
holds. This property means that the functional $\mathbf{E}\left (\right )$ is a linear one.
6.
Let X and Y be independent random variables with finite expected value. Then the expected value of the product of random variables $X \cdot Y$ exists and equals the product of expected values, i.e., the equality
$$\mathbf{E}\left (XY \right ) = \mathbf{E}\left (X\right ) \cdot \mathbf{E}\left (Y \right )$$
is true.

Expectation of Functions of Random Variables, Moments and Properties Let X be a discrete random} variable with finite or countable values $\left \{{x}_{i},i \in I\right \}$ and with distribution $\left \{{p}_{i},i \in I\right \}$. Let $h(x),\ x \in \mathbb{R}$ be a real-valued function for which the expected value of the random variable Y = h(X) exists; then the equality

$$\mathbf{E}\left (Y \right ) = \mathbf{E}\left (h(X)\right ) = \sum \limits_{i\in I}{p}_{i}h({x}_{i})$$

holds.

If the continuous random variable X has a PDF f _X(x) and the expected value of the random variable Y = h(X) exists, then the expected value of Y can be given in the form

$$\mathbf{E}\left (Y \right ) =\int\limits_{-\infty }^{\infty }h(x){f}_{ X}(x)\mathrm{d}x.$$

In cases where the expected value of functions of random variables (functions of random vectors) are investigated, analogous results to the one-dimensional case can be obtained. We give the formulas in connection with the two-dimensional case only. Let X and Y be two random variables, and let us assume that the expected value of the random variable Z = h(X, Y ) exists. With the appropriate notation, used earlier, for the cases of discrete and continuous distributions, the expected value of random variable Z can be given in the forms

$$\mathbf{E}\left (Z\right ) = \sum \limits_{i\in I} \sum \limits_{j\in J}h({x}_{i},{y}_{j})\mathbf{P}\left (X = {x}_{i},Y = {y}_{j}\right ),$$

$$\mathbf{E}\left (Z\right ) =\int\limits_{-\infty }^{\infty }\int\limits_{-\infty }^{\infty }h(x,y){f}_{ XY }(x,y)\mathrm{d}x\mathrm{d}y.$$

Consider the important case where h is a power function, i.e., for a given positive integer number k, h(x) = x ^k. Assume that the expected value of X ^k exists. Then the quantity

$${\mu }_{k} = \mathbf{E}\left ({X}^{k}\right ),\ k = 1,2,\ldots ,$$

is called the kth moment of random variable X. It stands to reason that the first moment $\mu = {\mu }_{1} = \mathbf{E}\left ({X}^{1}\right )$ is the expected value of X and the frequently used second moment is ${\mu }_{2} = \mathbf{E}\left ({X}^{2}\right )$.

Theorem 1.34.

Let j and k be integer numbers for which $1 \leq j \leq k$ . If the kth moment of random variable X exists, then the jth moment also exists.

Proof.

From the existence of the kth moment it follows that $\mathbf{E}\left (\vert X{\vert }^{k}\right ) < \infty $. Since k ∕ j ≥ 1, the function ${x}^{k/j},\ x \geq 0$, is convex, and by the use of Jensen’s inequality we get the relation

$${\left [\mathbf{E}\left (\vert X{\vert }^{j}\right )\right ]}^{k/j} \leq \mathbf{E}\left ({\left (\vert X{\vert }^{j}\right )}^{k/j}\right ) = \mathbf{E}\left (\vert X{\vert }^{k}\right ) < \infty.$$

□

The kth central moment $\mathbf{E}\left ({(X -\mathbf{E}\left (X\right ))}^{k}\right )$ is also used in practice; it is defined as the kth moment of the random variable centered at the first moment (expected value). The kth central moment $\mathbf{E}\left ({(X -\mathbf{E}\left (X\right ))}^{k}\right )$ can be expressed by the noncentral moments ${\mu }_{i},1 \leq i \leq k$ of random variable X as follows:

$$\begin{array}{rcl} \mathbf{E}\left ({(X -\mathbf{E}\left (X\right ))}^{k}\right ) = \mathbf{E}\left (\sum\limits_{i=0}^{k}\left ({ k \atop i} \right ){X}^{i}{(-\mathbf{E}\left (X\right ))}^{k-i}\right )& & \\ =\sum\limits_{i=0}^{k}\left ({ k \atop i} \right )\mathbf{E}\left ({X}^{i}\right ){(-\mathbf{E}\left (X\right ))}^{k-i}.& & \\ \end{array}$$

In the course of a random experiment, the observed values fluctuate around the expected value. One of the most significant characteristics of the quantity of fluctuations is the variance. Assume that the second moment of random variable X is finite. Then the quantities

$$\mathbf{Var}\left (X\right ) = \mathbf{E}\left ({(X -\mathbf{E}\left (X\right ))}^{2}\right )$$

are called the variance of random variable X. The standard deviation of a random variable X is the square root of its variance:

$$\mathbf{D}\left (X\right ) = \sqrt{\mathbf{E} \left ({(X - \mathbf{E} \left (X\right ) )}^{2 } \right )}.$$

It is clear that the variance of X can be given with the help of the first and second moments as follows:

$$\begin{array}{rcl}{ \mathbf{D}}^{2}\left (X\right )& =& \mathbf{Var}\left (X\right ) = \mathbf{E}\left ({(X -\mathbf{E}\left (X\right ))}^{2}\right ) = \mathbf{E}\left ({X}^{2}\right ) - 2\mathbf{E}\left (X\right ) \cdot \mathbf{E}\left (X\right ) + ({\mathbf{E}\left (X\right )}^{2}) \\ & =& \mathbf{E}\left ({X}^{2}\right ) - {(\mathbf{E}\left (X\right ))}^{2} = {\mu }_{ 2} - {\mu }^{2}.\end{array}$$

It is conventional to denote the variance of the random variable X by ${\sigma }_{X}^{2} ={ \mathbf{D}}^{2}\left (X\right )$.

It should be noted that the variance of a random variable exists if and only if its second moment is finite. In addition, from the last inequality it follows that an upper estimation can be given for the variance as

$${\mathbf{D}}^{2}\left (X\right ) \leq \mathbf{E}\left ({X}^{2}\right ).$$

It can also be seen that for every constant c the relation

$$\mathbf{E}\left ({(X - c)}^{2}\right ) = \mathbf{E}\left ({\left [(X -\mathbf{E}\left (X\right )) + (\mathbf{E}\left (X\right ) - c)\right ]}^{2}\right ) ={ \mathbf{D}}^{2}\left (X\right ) + {(\mathbf{E}\left (X\right ) - c)}^{2}$$

holds, which is analogous to the Steiner formula, well known in the field of mechanics.

As an important consequence of this identity, we have the following result: the second moment $\mathbf{E}\left ({(X - c)}^{2}\right )$ takes the minimal value for the constant $c = \mathbf{E}\left (X\right )$.

We will now mention some frequently used properties of variance.

1.
If the variance of random variable X exists, then for all constants a and b the identity
$${\mathbf{D}}^{2}\left (aX + b\right ) = {a}^{2}{\mathbf{D}}^{2}\left (X\right )$$
is true.
2.
Let ${X}_{1},\ldots ,{X}_{n}$ be independent random variables with finite variance; then
$${ \mathbf{D}}^{2}\left ({X}_{ 1} + \ldots + {X}_{n}\right ) ={ \mathbf{D}}^{2}\left ({X}_{ 1}\right ) + \ldots +{ \mathbf{D}}^{2}\left ({X}_{ n}\right ).$$
(1.2)

The independence of random variables that play a role in formula (1.2) is not required for the last identity, and it is also true if instead of assuming the independence of the random variables ${X}_{1},\ldots ,{X}_{n}$ we assume that they are uncorrelated. The notion of correlation is to be defined later. If ${X}_{1},\ldots ,{X}_{n}$ are independent and identically distributed random variables with finite variance σ, then

$${\mathbf{D}}^{2}\left ({X}_{ 1} + \ldots + {X}_{n}\right ) ={ \mathbf{D}}^{2}\left ({X}_{ 1}\right ) + \ldots +{ \mathbf{D}}^{2}\left ({X}_{ n}\right ) = n{\sigma }^{2},$$

from which

$$\mathbf{D}\left ({X}_{1} + \ldots + {X}_{n}\right ) = \sigma \sqrt{n}$$

follows.

In the literature on queueing theory, the notion of relative variance ${\mathbf{CV}\left (X\right )}^{2}$ is applied, which is defined as

$${\mathbf{CV}\left (X\right )}^{2} = \frac{{\mathbf{D}}^{2}\left (X\right )} {{\mathbf{E}\left (\left \vert X\right \vert \right )}^{2}}\ $$

Its square root $\mathbf{CV}\left (X\right ) = \mathbf{D}\left (X\right )/\mathbf{E}\left (\left \vert X\right \vert \right )$ is called the coefficient of variation, which serves as a normalized measure of variance of a distribution. The following inequalities hold:

$$\begin{array}{ll} \text{ Exponential distribution: } &\qquad CV = 1, \\ \text{ Hyperexponential distribution: }&\qquad CV > 1, \\ \text{ Erlang distribution: } &\qquad CV < \end{array}$$

(1.)

Markov and Chebyshev Inequalities The role of the Markov and Chebyshev inequalities is significant, not only because they provide information concerning distributions with the help of expected value and variance but because they are also effective tools for proving certain results.

Theorem 1.35 ( Markov inequality).

If the expected value of a nonnegative random variable X exists, then the following inequality is true for any constant $\epsilon > 0,$ :

$$\mathbf{P}\left (X \geq \epsilon \right ) \leq \frac{\mathbf{E}\left (X\right )} {\epsilon }.$$

Proof.

For an arbitrary positive constant $\epsilon > 0$ we have the relation

$$\mathbf{E}\left (X\right ) \geq \mathbf{E}\left (X{\mathcal{I}}_{\left \{X\geq \epsilon \right \}}\right ) \geq \epsilon \mathbf{E}\left ({\mathcal{I}}_{\left \{X\geq \epsilon \right \}}\right ) = \epsilon \mathbf{P}\left (X \geq \epsilon \right ),$$

from which the Markov inequality immediately follows. □

Theorem 1.36 ( Chebyshev inequality).

If the variance of random variable X is finite, then for any constant $\epsilon > 0$ the inequality

$$\mathbf{P}\left (\left \vert X -\mathbf{E}\left (X\right )\right \vert \geq \epsilon \right ) \leq \frac{{\mathbf{D}}^{2}\left (X\right )} {{\epsilon }^{2}}$$

holds.

Proof.

Using the Markov inequality for a constant $\epsilon > 0$ and for the random variable ${(X -\mathbf{E}\left (X\right ))}^{2}$ we find that

$$\mathbf{P}\left (\left \vert X -\mathbf{E}\left (X\right )\right \vert \geq \epsilon \right ) = \mathbf{P}\left ({\left (X -\mathbf{E}\left (X\right )\right )}^{2} \geq {\epsilon }^{2}\right ) \leq \frac{{\mathbf{E}\left (X -\mathbf{E}\left (X\right )\right )}^{2}} {{\epsilon }^{2}} = \frac{{\mathbf{D}}^{2}\left (X\right )} {{\epsilon }^{2}} ,$$

from which the assertion of the theorem follows. □

Comment 1.37.

Let X be a random variable. If h(x) is a convex function and $\mathbf{E}\left (h(X)\right )$ exists, then the Jensen inequality $\mathbf{E}\left (h(X)\right ) \geq h(\mathbf{E}\left (X\right ))$ is true. Using this inequality we can obtain some other relations, similar to the case of the Markov inequality.

Example 1.38.

As a simple application of the Chebyshev inequality, let us consider the average $({X}_{1} + \ldots + {X}_{n})/n$, where the random variables ${X}_{1},\ldots ,{X}_{n}$ are independent identically distributed with finite second moment. Let us denote the joint expected value and variance by μ and σ², respectively. Using the property (1.2) of variance and the Chebyshev inequality and applying $(n\epsilon )$ instead of $\epsilon $, we get the inequality

$$\begin{array}{rcl} \mathbf{P}\left (\left \vert {X}_{1} + \ldots + {X}_{n} - n\mu \right \vert \geq n\epsilon \right )& =& \mathbf{P}\left ({\left ({X}_{1} + \ldots + {X}_{n} - n\mu \right )}^{2} \geq {n}^{2}{\epsilon }^{2}\right ) \\ & \leq & \frac{n{\sigma }^{2}} {{(n\epsilon )}^{2}} = \frac{{\sigma }^{2}} {n{\epsilon }^{2}}; \\ \end{array}$$

then

$$\mathbf{P}\left (\left \vert \frac{{X}_{1} + \ldots + {X}_{n}} {n} - \mu \right \vert \geq \epsilon \right ) \leq \frac{{\sigma }^{2}} {n{\epsilon }^{2}}.$$

As a consequence of the last inequality, for every fixed positive constant $\epsilon $ the probability $\mathbf{P}\left (\left \vert \frac{{X}_{1}+\ldots +{X}_{n}} {n} - \mu \right \vert \geq \epsilon \right )$ tends to 0 as n goes to infinity. This assertion is known as the weak law of large numbers.

Generating and Characteristic Functions So far, certain quantities characterizing the distribution of random variables have been provided. Now such transformations of distributions will be given where the distributions and the functions obtained by the transformations uniquely determine each other. The investigated transformations provide effective tools for determining, for instance, distributions and moments and for proving limit theorems.

Definition 1.39.

Let X be a random variable taking values in $\left \{0,1,\ldots \right \}$, with probabilities ${p}_{0},{p}_{1},\ldots $. Then the power series

$${G}_{X}(z) = \mathbf{E}\left ({z}^{X}\right ) = \sum \limits_{i=0}^{\infty }{p}_{ i}{z}^{i}$$

is convergent for all $z \in [-1,1]$, and the function G _X(z) is called the probability generating function (or just generating function) of the discrete random variable X.

In engineering practice, the power series defining the generating function is applied in a more general approach instead of in the interval $\left [-1,1\right ]$, and the generating function is defined on the closed complex unit circle $z \in \mathbb{C},\ \left \vert z\right \vert \leq 1$, which is usually called a z -transform of the distribution $\left \{{p}_{i},i = 0,1,\ldots \right \}$. This notion is also applied if, instead of a distribution, a transformation is made to an arbitrary sequence of real numbers.

It should be noted that $\left \vert {G}_{X}(z)\right \vert \leq 1$ if $\ z \in \mathbb{C}$ and the function G _X(z) is differentiable on the open unit circle of the complex plane $z \in \mathbb{C},\ \left \vert z\right \vert < 1$ infinitely many times and the kth derivative of G _X(z) equals the sum of the kth derivative of the members of the series.

It is clear that

$${p}_{k} = {G}_{X}^{(k)}(0)/k!,\ k = 0,1,\ldots \,.$$

This formula makes it possible to compute the distribution if the generating function is given. It is also true that if the first and second derivatives ${G}_{X}^{{\prime}}(1-)$ and ${G}_{X}^{{\prime\prime}}(1-)$ exist on the left-hand side at z = 1, then the first and second moments of random variable X can be computed as follows:

$$\mathbf{E}\left (X\right ) = {G}_{X}^{{\prime}}(1-)\text{ and }\mathbf{E}\left ({X}^{2}\right ) ={ \left.{\left (z{G}_{ X}^{{\prime}}(z)\right )}^{{\prime}}\right \vert }_{ z=1} = {G}_{X}^{{\prime\prime}}(1-)+{G}_{ X}^{{\prime}}(1-).$$

From this we can obtain the variance of X as follows:

$${\mathbf{D}}^{2}\left (X\right ) = {G}_{ X}^{{\prime\prime}}(1-) + {G}_{ X}^{{\prime}}(1-) - {({G}_{ X}^{{\prime}}(1-))}^{2}.$$

It can also be verified that if the nth derivative of the generating function G _X(z) exists on the left-hand side at z = 1, then

$$\begin{array}{rcl} \mathbf{E}\left (X(X - 1)\ldots (X - m + 1)\right )& =& \sum \limits_{k=m}^{\infty }k(k - 1)\ldots (k - m + 1){p}_{ k} \\ & =& {G}_{X}^{(m)}(1-),\ 1 \leq m \leq n.\end{array}$$

Computing the expected values on the left-hand side of these identities, we can obtain linear equations between the moments ${\mu }_{k} = \mathbf{E}({X}^{k}),\ 1 \leq k \leq m$, and the derivatives ${G}_{X}^{(m)}(1-)$ for all $1 \leq m \leq n$. The moments μ_m, m = 1, 2, …, n can be determined in succession with the help of the derivatives ${G}_{X}^{(k)}(1-),\ 1 \leq k \leq m.$ The special cases of k = 1, 2 give the preceding formulas for the first and second moments.

Characteristic Function

Definition 1.40.

The complex valued function

$${\varphi }_{X}(s) = \mathbf{E}\left ({\mathrm{e}}^{isX}\right ) = \mathbf{E}\left (\cos (sX)\right ) + i\mathbf{E}\left (\sin (sX)\right ),\ s \in \mathbb{R},$$

is called the characteristic function of random variable X, where $i = \sqrt{-1}$.

Note that a characteristic function can be rewritten in the form

$${\varphi }_{X}(s) =\int\limits_{-\infty }^{\infty }{\mathrm{e}}^{isx}\mathrm{d}{F}_{ X}(x),$$

which is the well-known Fourier–Stieltjes transform of the CDF F _X(x). Using conventional notation, in discrete and continuous cases we have

$${\varphi }_{X}(s) = \sum \limits_{k=0}^{\infty }{p}_{ k}{\mathrm{e}}^{is{x}_{k} }\text{ , and }\ {\varphi }_{X}(s) =\int\limits_{-\infty }^{\infty }{\mathrm{e}}^{isx}{f}_{ X}(x)\mathrm{d}x.$$

The characteristic function and the CDFs determine each other uniquely. Now some important properties of characteristic functions will be enumerated.

1.
The characteristic function is real valued if and only if the distribution is symmetric.
2.
If the kth moment $\mathbf{E}\left ({X}^{k}\right )$ exists at point 0, then
$$\mathbf{E}\left ({X}^{k}\right ) = \frac{{\varphi }_{X}^{(k)}(0)} {{i}^{k}}.$$
3.
If the derivative ${\varphi }_{X}^{(2k)}(0)$ is finite for a positive integer k, then the moment $\mathbf{E}\left ({X}^{2k}\right )$ exists. Note that from the existence of the finite derivative ${\varphi }_{X}^{(2k+1)}(0)$ only the existence of the finite moment $\mathbf{E}\left ({X}^{2k}\right )$ follows.
4.
Let ${X}_{1},\ldots ,{X}_{n}$ be independent random variables; then the characteristic function of the sum X ₁ + … + X _n equals the product of the characteristic functions of the random variables X _i, that is,
$$\begin{array}{rcl}{ \varphi }_{{X}_{1}+\ldots +{X}_{n}}(s)& =& \mathbf{E}\left ({\mathrm{e}}^{is({X}_{1}+\ldots +{X}_{n})}\right ) = \mathbf{E}\left ({\mathrm{e}}^{is{X}_{1} }\ldots {\mathrm{e}}^{is{X}_{n} }\right ) \\ & =& \mathbf{E}\left ({\mathrm{e}}^{is{X}_{1} }\right ) \cdot \ldots \cdot \mathbf{E}\left ({\mathrm{e}}^{is{X}_{n} }\right ) = {\varphi }_{{X}_{1}}(s)\ldots {\varphi }_{{X}_{1}}(s).\end{array}$$

Note that property 4 plays an important role in the limit theorems of probability theory.

Laplace–Stieltjes and Laplace Transforms If, instead of the CDFs, the Laplace–Stieltjes and Laplace transforms were used, the problem could be solved much easier in many practical cases and the results could additionally often be given in more compact form. Let X be a nonnegative random variable with the CDF F(x) (F(0) = 0). Then the real or, in general, complex varying function

$${F}^{\sim }(s) = \mathbf{E}\left ({\mathrm{e}}^{-sX}\right ) =\int\limits_{0}^{\infty }{\mathrm{e}}^{-sx}\mathrm{d}F(x),\ \mathrm{Re}s \geq 0,\ {F}^{\sim }(0) = 1$$

is called the Laplace–Stieltjes transform of the CDF F. Since $\left \vert {\mathrm{e}}^{-sX}\right \vert \leq 1$ if $\mathrm{Re}s \geq 0$, then the function F ^∼(s) is well defined. If f is a PDF, then the function

$${f}^{{_\ast}}(s) =\int\limits_{0}^{\infty }{\mathrm{e}}^{-sx}f(x)\mathrm{d}x,\ \mathrm{Re}s \geq 0,$$

is called the Laplace transform of the function f. These notations will be used even if the functions F and f do not possess the necessary properties of distribution and PDFs but F ^∼(s) and f ^∗(s) are well defined. If f is a PDF related to the CDF F, then the equality

$${F}^{\sim }(s) = {f}^{{_\ast}}(s) = s{F}^{{_\ast}}(s)$$

(1.3)

holds.

Proof.

It is clear that

$${F}^{\sim }(s) =\int\limits_{0}^{\infty }{\mathrm{e}}^{-sx}\mathrm{d}F(x) =\int\limits_{0}^{\infty }{\mathrm{e}}^{-sx}f(x)\mathrm{d}x = {f}^{{_\ast}}(s),$$

and integrating by parts we have

$${F}^{\sim }(s) =\int\limits_{0}^{\infty }{\mathrm{e}}^{-sx}\mathrm{d}F(x) =\int\limits_{0}^{\infty }s{\mathrm{e}}^{-sx}F(x)\mathrm{d}x = s{F}^{{_\ast}}(s).$$

□

Since the preceding equation is true between the two introduced transforms, it is enough to consider the Laplace–Stieltjes transform only and to enumerate its main properties.

(a)
${F}^{\sim }(s),\ \mathrm{Re}s \geq 0$ is a continuous function and $0 \leq \left \vert {F}^{\sim }(s)\right \vert \leq 1,\ \mathrm{Re}s \geq 0$.
(b)
${F}_{aX+b}^{\sim }(s) ={ \mathrm{e}}^{-bs}{F}^{\sim }(as)$.
(c)
For all positive integers k

$${(-1)}^{k}{F{}^{\sim }}^{(k)}(s) =\int\limits_{0}^{\infty }{x}^{k}{\mathrm{e}}^{-sx}\mathrm{d}F(x),\ \ \mathrm{Re}s > 0.$$

If the kth moment ${\mu }_{k} = \mathbf{E}\left ({X}^{k}\right )$ exists, then ${\mu }_{k} = {(-1)}^{k}{F{}^{\sim }}^{(k)}(0)$.
(d)
If the nonnegative random variables X and Y are independent, then
$${F}_{X+Y }^{\sim }(s) = {F}_{ X}^{\sim }(s){F}_{ Y }^{\sim }(s).$$
(e)
For all continuity points of the CDF F the inversion formula
$$F(x) =\mathop{\lim }\limits_{ a \rightarrow \infty } \sum \limits_{n\leq ax}{(-1)}^{n}{({F}^{\sim }(a))}^{(n)}\frac{{a}^{n}} {n!}$$
is true.

Covariance and Correlation Let X and Y be two random variables with finite variances ${\sigma }_{X}^{2}$ and ${\sigma }_{Y }^{2}$, respectively. The covariance between the pair of random variables (X, Y ) is defined as

$$\mathrm{cov}(X,Y ) = \mathbf{E}\left ((X -\mathbf{E}\left (X\right ))(Y -\mathbf{E}\left (Y \right ))\right ).$$

The covariance can be rewritten in the simple computational form

$$\mathrm{cov}(X,Y ) = \mathbf{E}\left (XY \right ) -\mathbf{E}\left (X\right )\mathbf{E}\left (Y \right ).$$

If the variances ${\sigma }_{X}^{2}$ and ${\sigma }_{Y }^{2}$ satisfy the conditions $\mathbf{D}\left (X\right ) > 0,\ \mathbf{D}\left (Y \right ) > 0$, then the quantity

$$\mathrm{corr}(X,Y ) =\mathrm{ cov}\left (\frac{X -\mathbf{E}\left (X\right )} {\mathbf{D}\left (X\right )} , \frac{Y -\mathbf{E}\left (Y \right )} {\mathbf{D}\left (Y \right )} \right ) = \frac{\mathrm{cov}(X,Y )} {\mathbf{D}\left (X\right )\mathbf{D}\left (Y \right )}$$

is called the correlation between the pair of random variables (X, Y ).

Correlation can be used as a measure of the dependence between random variables. It is always true that

$$-1 \leq \mathrm{ corr}(X,Y ) \leq 1,$$

provided that the variances of random variables X and Y are finite and nonzero.

Proof.

Since by the Cauchy–Schwartz inequality for all random variables U and V with finite second moments

$${(\mathbf{E}\left (UV \right ))}^{2} \leq \mathbf{E}\left ({U}^{2}\right )\mathbf{E}\left ({V }^{2}\right ),$$

therefore

$${(\mathrm{cov}(X,Y ))}^{2} \leq \mathbf{E}\left ({(X -\mathbf{E}\left (X\right ))}^{2}\right )\mathbf{E}\left ({(Y -\mathbf{E}\left (Y \right ))}^{2}\right ) ={ \mathbf{D}}^{2}\left (X\right ){\mathbf{D}}^{2}\left (Y \right ),$$

from which the inequality $\left \vert \mathrm{corr}(X,Y )\right \vert \leq 1$ immediately follows. □

It can also be proved that the equality $\left \vert \mathrm{corr}(X,Y )\right \vert = 1$ holds if and only if a linear relation exists between random variables X and Y with probability 1, that is, there are two constants a and b for which $\mathbf{P}\left (Y = aX + b\right ) = 1$.

Both covariance and correlation play essential roles in multivariate statistical analysis. Let $X ={ \left ({X}_{1},\ldots ,{X}_{n}\right )}^{\mathrm{T}}$ be a column vector whose n elements ${X}_{1},\ldots ,{X}_{n}$ are random variables. Here it should be noted that in probability theory and statistics usually column vectors are applied, but in queueing theory row vectors are used if Markov processes are considered. We define

$$\mathbf{E}\left (\mathbf{X}\right ) ={ \left (\mathbf{E}\left ({X}_{1}\right ),\ldots ,\mathbf{E}\left ({X}_{n}\right )\right )}^{\mathrm{T}},$$

provided that the expected values of components exist. The upper index T denotes the transpose of vectors or matrices. Similarly, if a matrix $W = \left ({W}_{ij}\right ) \in {\mathbb{R}}^{k\times m}$ is given whose elements W _ij are random variables of finite expected values, then we define

$$\mathbf{E}\left (W\right ) = \left (\mathbf{E}\left ({W}_{ij}\right )\right ),\ 1 \leq i \leq k,\ 1 \leq j \leq m).$$

If the variances of components of a random vector $X ={ \left ({X}_{1},\ldots ,{X}_{k}\right )}^{\mathrm{T}}$ are finite, then the matrix

$$R = \mathbf{E}\left (\left (X -\mathbf{E}\left (X\right )\right ){\left (X -\mathbf{E}\left (X\right )\right )}^{\mathrm{T}}\right )$$

(1.4)

is called a covariance matrix of X. It can be seen that the (i, j) entries of matrix R are ${R}_{ij} =\mathrm{ cov}({X}_{i},{X}_{j}),$ which are the covariances between the random variables X _i and X _j.

The covariance matrix can be defined in cases where the components of X are complex valued random variables replacing in definition (1.4) ${\left (X -\mathbf{E}\left (X\right )\right )}^{\mathrm{T}}$ by ${\left (X -\mathbf{E}\left (X\right )\right )}^{{_\ast}\mathrm{T}}$ the complex composed of transpose.

An important property of a covariance matrix R is that it is nonnegative definite, i.e., for all real or complex k-dimensional column vectors $z = {({z}_{1},\ldots ,{z}_{k})}^{\mathrm{T}}$ the inequality

$$zR{z}^{\mathrm{T}} \geq 0$$

holds.

The matrix r = (r _i, j) with components ${r}_{i,j} =\mathrm{ corr}({X}_{i},{X}_{j}),$ $1 \leq i \leq k,\ 1 \leq j \leq m$ is called a correlation matrix of random vector X.

Conditional Expectation and Its Properties The notion of conditional expectation is defined with the help of results of set and measure theories. We present the general concept and important properties and illustrate the important special cases.

Let $(\Omega ,\mathcal{A},\mathbf{P})$ be a fixed probability space, and let X be a random variable whose expected value exists. Let $\mathcal{C}$ be an arbitrary sub-σ-algebra of $\mathcal{A}$. We wish to define the conditional expectation $Z = \mathbf{E}\left (X\vert \mathcal{C}\right )$ of X given $\mathcal{C}$ as a $\mathcal{C}$-measurable random variable for which the random variable satisfies the condition $\mathbf{E}\left (\mathbf{E}\left (X\vert \mathcal{C}\right ){\mathcal{I}}_{\left \{C\right \}}\right ) = \mathbf{E}\left (X{\mathcal{I}}_{\left \{C\right \}}\right )$ for all $C \in \mathcal{C}$. As a consequence of the Radon–Nikodym theorem, a random variable Z exists with probability 1 that satisfies the required conditions.

Definition 1.41.

Random variable Z is called the conditional expectation of X given σ-algebra $\mathcal{C}$ if the following conditions hold:

(a)
Z is a $\mathcal{C}$-measurable random variable.
(b)
$\mathbf{E}\left (\mathbf{E}\left (X\vert \mathcal{C}\right ){\mathcal{I}}_{\left \{C\right \}}\right ) = \mathbf{E}\left (X{\mathcal{I}}_{\left \{C\right \}}\right )$ for all $C \in \mathcal{C}$.

Definition 1.42.

Let $A \in \mathcal{A}$ be an event. The random variable $\mathbf{P}\left (A\vert \mathcal{C}\right ) = \mathbf{E}\left ({\mathcal{I}}_{\left \{A\right \}}\vert \mathcal{C}\right )$ is called the conditional expectation of event A given σ-algebra $\mathcal{C}$.

Important Properties of Conditional Expectation Let $\mathcal{C}$, ${\mathcal{C}}_{1}$, and ${\mathcal{C}}_{2}$ be sub-σ-algebras of $\mathcal{A}$, and let X, X ₁, and X ₂ be random variables with finite expected values. Then the following relations hold with probability 1:

1.
$\mathbf{E}\left (\mathbf{E}\left (X\vert \mathcal{C}\right )\right ) = \mathbf{E}\left (X\right ).$
2.
$\mathbf{E}\left (cX\vert \mathcal{C}\right ) = c\mathbf{E}\left (X\vert \mathcal{C}\right )$ for all constant c.
3.
If ${\mathcal{C}}_{0} =\{ \oslash ,\Omega \}$ is the trivial σ -algebra, then $\mathbf{E}\left (X\vert {\mathcal{C}}_{0}\right ) = \mathbf{E}\left (X\right ).$
4.
If ${\mathcal{C}}_{1}{\subset \mathcal{C}}_{2}$, then $\mathbf{E}\left (\mathbf{E}\left (X\vert {\mathcal{C}}_{1}\right )\vert {\mathcal{C}}_{2}\right ) = \mathbf{E}\left (\mathbf{E}\left (X\vert {\mathcal{C}}_{2}\right )\vert {\mathcal{C}}_{1}\right ) = \mathbf{E}\left (X\vert {\mathcal{C}}_{1}\right ).$
5.
If random variable X does not depend on the σ-algebra $\mathcal{C}$, i.e., if for all Borel sets $D \in \mathcal{B}$ and for all events $A \in \mathcal{C}$ the equality $\mathbf{P}\left (X \in D,A\right ) = \mathbf{P}\left (X \in D\right )\mathbf{P}\left (A\right )$ holds, then $\mathbf{E}\left (X\vert \mathcal{C}\right ) = \mathbf{E}\left (X\right )$.
6.
$\mathbf{E}\left ({X}_{1} + {X}_{2}\vert \mathcal{C}\right ) = \mathbf{E}\left ({X}_{1}\vert \mathcal{C}\right ) + \mathbf{E}\left ({X}_{2}\vert \mathcal{C}\right ).$
7.
If the random variable X ₁ is $\mathcal{C}$-measurable, then $\mathbf{E}\left ({X}_{1}{X}_{2}\vert \mathcal{C}\right ) = {X}_{1}\mathbf{E}\left ({X}_{2}\vert \mathcal{C}\right ).$

Definition 1.43.

Let Y be a random variable, and denote by ${\mathcal{A}}_{Y }$ the σ -algebra generated by random variable Y , i.e., let ${\mathcal{A}}_{Y }$ be the minimal sub-σ-algebra of $\mathcal{A}$ for which Y is ${\mathcal{A}}_{Y }$-measurable. The random variable $\mathbf{E}\left (X\vert Y \right ) = \mathbf{E}\left (X\vert {\mathcal{C}}_{Y }\right )$ is called the conditional expectation of X given random variable Y.

Main Properties of Conditional Expectation Firstly, consider the case where random variable Y is discrete and takes values in the set $\mathcal{Y} =\{ {y}_{1},\ldots ,{y}_{n}\}$ and $\mathbf{P}\left (Y = {y}_{i}\right ) > 0,\ 1 \leq i \leq n$. We then define the events ${C}_{i} =\{ Y = {y}_{i}\},\ 1 \leq i \leq n$. It is clear that the collection of events $\{{C}_{1},\ldots ,{C}_{n}\}$ forms a complete system of events, i.e., they are mutually exclusive, $\mathbf{P}\left ({C}_{i}\right ) > 0,\ \leq i \leq n$ and $\mathbf{P}\left ({C}_{1}\right ) + \ldots + \mathbf{P}\left ({C}_{n}\right ) = 1$. The σ-algebra ${\mathcal{C}}_{Y } = \sigma ({C}_{1},\ldots ,{C}_{n}) \subset \mathcal{A}$, which is generated by random variable Y , is the set of events consisting of all subsets of $\{{C}_{1},\ldots ,{C}_{n}\}$. Note that here we can write “algebra” instead of “σ-algebra” because the set $\{{C}_{1},\ldots ,{C}_{n}\}$ is finite. Since the events C _i have positive probability, the conditional probabilities

$$\mathbf{E}\left (X\vert {C}_{i}\right ) = \frac{\mathbf{E}\left (X{\mathcal{I}}_{\left \{{C}_{i}\right \}}\right )} {\mathbf{P}\left ({C}_{i}\right )}$$

are well defined.

Theorem 1.44.

The conditional expectation $\mathbf{E}\left (X\vert {\mathcal{C}}_{Y }\right )$ satisfies the relation

$$\mathbf{E}\left (X\vert {\mathcal{C}}_{Y }\right ) = \mathbf{E}\left (X\vert {\mathcal{C}}_{Y }\right )(\omega ) = \sum \limits_{k=1}^{n}\mathbf{E}\left (X\vert {C}_{ k}\right ){\mathcal{I}}_{\left \{{C}_{k}\right \}}\text{ with probability }1.$$

(1.5)

Note that Eq. (1.5) can also be rewritten in the form

$$\mathbf{E}\left (X\vert Y \right ) = \mathbf{E}\left (X\vert Y \right )(\omega ) = \sum \limits_{k=1}^{n}\mathbf{E}\left (X\vert Y = {y}_{ k}\right ){\mathcal{I}}_{\left \{Y ={y}_{k}\right \}}.$$

(1.6)

Proof.

Since the relation

$$\{\mathbf{E}\left (X\vert {\mathcal{C}}_{Y }\right ) \leq x\} = \cup \{{C}_{i} :\ \mathbf{E}\left (X\vert {C}_{i}\right ) \leq x\} \in {\mathcal{C}}_{Y }$$

holds for all $x \in \mathbb{R}$, then $\mathbf{E}\left (X\vert {\mathcal{C}}_{Y }\right )$ is a ${\mathcal{C}}_{Y }$-measurable random variable. On the other hand, if $C \in {\mathcal{C}}_{Y },\ C\neq \{\varnothing \}$, then $C = \cup \{{C}_{i} :\ i \in K\}$ stands with an appropriately chosen set of indices $K \subset \{ 1,\ldots ,n\}$, and we obtain

$$\begin{array}{rcl} \mathbf{E}\left (\mathbf{E}\left (X\vert {\mathcal{C}}_{Y }\right ){\mathcal{I}}_{\left \{C\right \}}\right )& =& \mathbf{E}\left ( \sum \limits_{k\in K}\mathbf{E}\left (X\vert {C}_{k}\right ){\mathcal{I}}_{\left \{{C}_{k}\right \}}\right ) \\ & =& \sum \limits_{k\in K}\mathbf{E}\left (X\vert {C}_{k}\right )\mathbf{P}\left ({C}_{k}\right ) = \sum \limits_{k\in K}\mathbf{E}\left (X{\mathcal{I}}_{\left \{{C}_{k}\right \}}\right ) = \mathbf{E}\left (X{\mathcal{I}}_{\left \{C\right \}}\right ).\end{array}$$

If $C =\{ \oslash \}$, then $\mathbf{E}\left (\mathbf{E}\left (X\vert {\mathcal{C}}_{Y }\right ){\mathcal{I}}_{\left \{C\right \}}\right ) = \mathbf{E}\left (X{\mathcal{I}}_{\left \{C\right \}}\right ) = 0$. Thus we have proved that random variable (1.5) satisfies all the required properties of conditional expectation. □

Comment 1.45.

From expression (1.6) the following relation can be obtained:

$$\mathbf{E}\left (X\right ) = \mathbf{E}\left (\mathbf{E}\left (X\vert Y \right )\right ) =\int\limits_{-\infty }^{\infty }\mathbf{E}\left (X\vert Y = y\right )\mathrm{d}{F}_{ Y }(y).$$

(1.7)

This relation remains valid if, instead of the finite set $\mathcal{Y} =\{ {y}_{1},\ldots ,{y}_{n}\}$ , we choose the countable infinite set $\mathcal{Y} =\{ {y}_{i},\ i \in I\}$ for which $\mathbf{P}\left (Y = {y}_{i}\right ) > 0$ , i ∈ I.

Comment 1.46.

Denote the function g by the relation

$$g(y) = \left \{\begin{array}{cc} \mathbf{E}\left (X\vert Y = {y}_{k}\right ),&\text{ if }y = {y}_{k}\text{ for an index k,} \\ 0, & \text{ otherwise.} \end{array} \right.$$

(1.8)

Then, using formula (1.6), the conditional expectation of X given Y can be obtained with the help of the function g as follows:

$$\mathbf{E}\left (X\vert Y \right ) = g(Y )$$

(1.9)

with probability 1.

Continuous Random Variables (X,Y ) Consider a pair of random variables (X, Y ) having joint density f _X, Y(x, y) and marginal densities f _X(x) and f _Y(y), respectively. Then the conditional density f _X | Y(x | y) exists and, according to (1.1), can be defined as

$${f}_{X\vert Y }(x\vert y) = \left \{\begin{array}{cc} \frac{{f}_{XY }(x,y)} {{f}_{Y }(y)} ,&\mbox{ if }{f}_{Y }(y) > 0, \\ 0, & \mbox{ otherwise }. \end{array} \right.$$

Define $g(y) = \mathbf{E}\left (X\vert Y = y\right ) =\int\limits_{-\infty }^{\infty }x{f}_{X\vert Y }(x\vert y)\mathrm{d}x$. Then the conditional expectation of X given Y can be determined with probability 1 as follows:

$$\mathbf{E}\left (X\vert Y \right ) = g(Y ),$$

and so we can define

$$\mathbf{E}\left (X\vert Y = y\right ) = g(y).$$

Proof.

It is clear that g(Y ) is a ${\mathcal{C}}_{Y }$-measurable random variable; therefore, it is enough to prove that the equality

$$\mathbf{E}\left (\mathbf{E}\left (X\vert Y \right ){\mathcal{I}}_{\left \{Y \in D\right \}}\right ) = \mathbf{E}\left (X{\mathcal{I}}_{\left \{Y \in D\right \}}\right )$$

holds for all Borel sets D of a real line. It is not difficult to see that

$$\begin{array}{rcl} \mathbf{E}\left (\mathbf{E}\left (X\vert Y \right ){\mathcal{I}}_{\left \{Y \in D\right \}}\right )& =& \mathbf{E}\left (g(Y ){\mathcal{I}}_{\left \{Y \in D\right \}}\right ) \\ & =& \int\limits_{D}\int\limits_{-\infty }^{\infty }x\ \frac{{f}_{XY }(x,y)} {{f}_{Y }(y)} \ {f}_{Y }(y)\mathrm{d}x\mathrm{d}y \\ & =& \int\limits_{D}\int\limits_{-\infty }^{\infty }x{f}_{ XY }(x,y)\mathrm{d}x\mathrm{d}y \\ \end{array}$$

and, on other hand,

$$\mathbf{E}\left (X{\mathcal{I}}_{\left \{Y \in D\right \}}\right ) =\int\limits_{D}\int\limits_{-\infty }^{\infty }x{f}_{ XY }(x,y)\mathrm{d}x\mathrm{d}y.$$

□

Comment 1.47.

In the case where a pair of random variables has a joint normal distribution, the conditional expectation $\mathbf{E}\left (X\vert Y \right )$ is a linear function of random variable Y with probability 1, that is, the regression function g is a linear function and the relation

$$\mathbf{E}\left (X\vert Y \right ) = \mathbf{E}\left (X\right ) + \frac{\mathrm{cov}(X,Y )} {\mathbf{D}\left (X\right )} (X -\mathbf{E}\left (X\right ))$$

holds.

General Case By the definition of conditional expectation, $\mathbf{E}\left (X\vert Y \right )$ is ${\mathcal{C}}_{Y }$-measurable; therefore, there is a Borel-measurable function g such that $\mathbf{E}\left (X\vert Y \right )$ can be given with probability 1 in the form

$$\mathbf{E}\left (X\vert Y \right ) = g(Y ).$$

(1.10)

This relation makes it possible to give the conditional expectation $\mathbf{E}\left (X\vert Y = y\right )$ as the function

$$\mathbf{E}\left (X\vert Y = y\right ) = g(y),$$

which is called a regression function. It is clear that the regression function is not necessarily unique and is determined on a Borel set of the real line D satisfying the condition $\mathbf{P}\left (Y \in D\right ) = 1$.

Comment 1.48.

Let X and Y be two random variables. Assume that X has finite variation. Consider the quadratic distance $\mathbf{E}\left ({\left [X - h(Y )\right ]}^{2}\right )$ for the set ${\mathcal{H}}_{Y }$ of all Borel-measurable functions h, for which h(Y ) has finite variation. Then the assertion

$$\min \left \{\mathbf{E}\left ({\left [X - h(Y )\right ]}^{2}\right ) :\ h \in {\mathcal{H}}_{ Y }\right \} = \mathbf{E}\left ({\left [X - g(Y )\right ]}^{2}\right )$$

holds. This relation implies that the best approximation of X by Borel-measurable functions of Y in a quadratic mean is the regression $\mathbf{E}\left (X\vert Y \right ) = g(Y )$ .

Formula of Total Expected Value A useful formula can be given to compute the expected value of random variable X if the regression function $\mathbf{E}\left (X\vert Y = y\right )$ can be determined.

Making use of relation 1 given as a general property of conditional expectation and Eq. (1.10), it is clear that

$$\begin{array}{rcl} \mathbf{E}\left (X\right )& =& \mathbf{E}\left (\mathbf{E}\left (X\vert Y \right )\right ) = \mathbf{E}\left (g(Y )\right ) \\ & =& \int\limits_{-\infty }^{\infty }g(y)\mathrm{d}{F}_{ Y }(y) =\int\limits_{-\infty }^{\infty }\mathbf{E}\left (X\vert Y = y\right )\mathrm{d}{F}_{ Y }(y).\end{array}$$

From this relation we have the so-called formula of total expected value. If random variable Y has discrete or continuous distributions, then we have the formulas

$$\mathbf{E}\left (X\right ) = \sum \limits_{i\in I}\mathbf{E}\left (X\vert Y = {y}_{i}\right )\mathbf{P}\left (Y = {y}_{i}\right )$$

and

$$\mathbf{E}\left (X\right ) =\int\limits_{-\infty }^{\infty }\mathbf{E}\left (X\vert Y = y\right ){f}_{ Y }(y)\mathrm{d}y.$$

2 Frequently Used Discrete and Continuous Distributions

In this part we consider some frequently used distributions and give their definitions and important characteristics. In addition to the formal description of the distributions, we will give appropriate mathematical models that lead to a given distribution. If the distribution function of a random variable is given as a function ${F}_{X}(x;{a}_{1},\ldots ,{a}_{n})$ depending on a positive integer n and constants ${a}_{1},\ldots ,{a}_{n}$, then ${a}_{1},\ldots ,{a}_{n}$ are called the parameters of the density function F _X.

2.1 Discrete Distributions

Bernoulli Distribution $Be(p),\ 0 \leq p \leq 1$. The PDF of random variable X with values $\left \{0,1\right \}$ is called a Bernoulli distribution if

$${p}_{k} = \mathbf{P}\left (X = k\right ) = \left \{\begin{array}{c} \ \ \ \ p,\ \ \ \ \text{ if }k = 1,\\ 1 - p,\ \ \text{ if } k =0. \end{array} \right.$$

$$\begin{array}{ll} \text{ Expected value and variance: } &\mathbf{E}\left (X\right ) = p,\ \ {\mathbf{D}}^{2}\left (X\right ) = p(1 - p); \\ \text{ Generating function: }&1 - p + pz; \\ \text{ Characteristic function: } &1 - p + p{\mathrm{e}}^{it}.\end{array}$$

Example.

Let X be the number of heads appearing in one toss of a coin, where

$$p = \mathbf{P}\left (\text{ head appearing in a toss}\right ).$$

Then X has a Be(p) distribution.

Binomial Distribution B(n, p). The distribution of a discrete random variable X with values $\{0,1,\ldots ,n\}$ is called binomial with the parameters n and p, 0 < p < 1, if its PDF is

$${p}_{k} = \mathbf{P}\left (X = k\right ) = \left ({ n \atop k} \right ){p}^{k}{(1 - p)}^{n-k},\ \ k = 0,1,\ldots ,n.$$

$$\begin{array}{cc} \text{ Expected value and variance: } &\mathbf{E}\left (X\right ) = np,\ \ \ {\mathbf{D}}^{2}\left (X\right ) = np(1 - p); \\ \text{ Generating function: }& G(z) = {(pz + (1 - p))}^{n}; \\ \text{ Characteristic function: } & \varphi (t) = {(1 + p({\mathrm{e}}^{it} - 1))}^{n}.\end{array}$$

Example.

Consider an experiment in which we observe that an event A with probability $p = \mathbf{P}\left (A\right ),\ 0 < p < 1$, occurs (success) or not (failure). Repeating the experiment n times independently, define random variable X by the frequency of event A. Then the random variable has a B(n, p) PDF.

Note that if the Be(n, p) random variables ${X}_{1},\ldots ,{X}_{n}$ are independent, then the random variable $X = {X}_{1} + \ldots + {X}_{n}$ has a B(n, p) distribution.

Polynomial Distribution The PDF of a random vector $X = {({X}_{1},\ldots ,{X}_{k})}^{\mathrm{T}}$ taking values in the set $\{({n}_{1},\ldots ,{n}_{k}) :\ {n}_{i} \geq 0,\ {n}_{1} + \ldots + {n}_{k} = n\}$ is called polynomial with the parameters n and ${p}_{1},\ldots ,{p}_{k}\ ({p}_{i}. > 0,\ {p}_{1} + \ldots + {p}_{k} = 1)$ if X has a PDF

$${p}_{{n}_{1},\ldots ,{n}_{k}} = \mathbf{P}\left ({X}_{1} = {n}_{1},\ldots ,{X}_{k} = {n}_{k}\right ) = \frac{n!} {{n}_{1}!\ldots {n}_{k}!}{p}_{1}^{{n}_{1} }\ldots {p}_{k}^{{n}_{k} }.$$

Note that each coordinate variable X _i of random vector X has a B(p _i, n) binomial distribution whose expected value and variance are np _i and $n{p}_{i}(1 - {p}_{i})$.

$$\begin{array}{lc} \text{ Expected value } & \mathbf{E}\left (X\right ) = {(n{p}_{1},\ldots ,n{p}_{n})}^{\mathrm{T}}; \\ \text{ Covariance matrix } &R = {({R}_{ij})}_{1\leq i,j\leq k}\text{ , where }{R}_{ij} = \left \{\begin{array}{c} n{p}_{i}(1 - {p}_{i}),\text{ if }i = j, \\ \ \ \ n{p}_{i}{p}_{j},\ \ \ \ \ \text{ if }i\neq j; \end{array} \right. \\ \text{ Characteristic function: }& \varphi ({t}_{1},\ldots ,{t}_{k}) = {({p}_{1}{\mathrm{e}}^{i{t}_{1}} + \ldots + {p}_{k}{\mathrm{e}}^{i{t}_{k}})}^{n}.\end{array}$$

Example.

Let ${A}_{1},\ldots ,{A}_{k}$ be k disjoint events for which ${p}_{i} = \mathbf{P}({A}_{i}) > 0,\ {p}_{1} + \ldots + {p}_{k} = 1.$ Consider an experiment with possible outcomes ${A}_{1},\ldots ,{A}_{k}$ and repeat it n times independently. Denote by X _i the frequency of event A _i in the series of n observations. Then the distribution of X is polynomial with the parameters n and $\ {p}_{1},\ldots ,{p}_{k}$.

Geometric Distribution The PDF of random variable X taking values in $\left \{1,2,\ldots \right \}$ is called a geometric distribution with the parameter p, 0 < p < 1, if its PDF is

$${p}_{k} = \mathbf{P}\left (X = k\right ) = {(1 - p)}^{k-1}p,\ k = 1,2,\ldots \,.$$

$$\begin{array}{ll} \text{ Expected value and variance: } &\mathbf{E}\left (X\right ) = \frac{1} {p},\ \ \ {\mathbf{D}}^{2}\left (X\right ) = \frac{1-p} {{p}^{2}} ; \\ \text{ Generating function: }&G(z) = \frac{pz} {1-(1-p)z}; \\ \text{ Characteristic function: } &\varphi (t) = \frac{p} {1-(1-p){\mathrm{e}}^{it}}.\end{array}$$

Theorem 1.49.

If X has a geometric distribution, then X has a so-called memoryless property, that is, for all nonnegative integer numbers i,j the following relation holds:

$$\mathbf{P}\left (X \geq i + j\vert X \geq i\right ) = \mathbf{P}\left (X \geq j\right ).$$

Proof.

It is easy to verify that for k ≥ 1

$$\begin{array}{rcl} \mathbf{P}\left (X \geq k\right )& =& \sum \limits_{\mathcal{l}=k}^{\infty }\mathbf{P}\left (X = \mathcal{l}\right ) = \sum \limits_{\mathcal{l}=k}^{\infty }{(1 - p)}^{\mathcal{l}-1}p \\ & =& {(1 - p)}^{k-1}p \sum \limits_{\mathcal{l}=0}^{\infty }{(1 - p)}^{\mathcal{l}} = {(1 - p)}^{k-1}\text{ ;} \\ \end{array}$$

therefore,

$$\begin{array}{rcl} \mathbf{P}\left (X \geq i + j\vert X \geq i\right )& =& \frac{\mathbf{P}\left (X \geq i + j,X \geq i\right )} {\mathbf{P}\left (X \geq i\right )} \\ & =& \frac{\mathbf{P}\left (X \geq i + j\right )} {\mathbf{P}\left (X \geq i\right )} \\ & =& \frac{{(1 - p)}^{i+j-1}} {{(1 - p)}^{i-1}} = {(1 - p)}^{j},\ \ j = 0,1,\ldots \,.\end{array}$$

□

Note that a geometric distribution is sometimes defined on the set $\left \{0,1,2,\ldots \right \}$ instead of $\left \{1,2,\ldots ,\right \}$; in this case, the PDF is determined by

$${p}_{k} = {(1 - p)}^{k}p,\ k = 0,1,2,\ldots \,.$$

Example.

Consider a sequence of experiments and observe whether an event A, p = P(A) > 0, occurs (success) or does not (failure) in each step. If the event occurs in the kth step first, then define the random variable as X = k. In other words, let X be the number of Bernoulli trials of the first success. Then random variable X has a geometric distribution with the parameter p.

Negative Binomial Distribution The distribution of random variable X taking values in $\left \{0,1,\ldots \right \}$ is called a negative binomial distribution with the parameter p, 0 < p < 1, if

$${p}_{k} = \mathbf{P}\left (X = k + r\right ) = \left ({ r + k - 1 \atop k} \right ){(1 - p)}^{k}{p}^{r},\ \ k = 0,1,\ldots \,.$$

$$\begin{array}{cc} \text{ Expected value and variance: } &\mathbf{E}\left (X\right ) = r\frac{1} {p},\ \ \ {\mathbf{D}}^{2}\left (X\right ) = r\frac{1-p} {{p}^{2}} ; \\ \text{ Generating function: }& G(z) ={ \left ( \frac{pz} {1-(1-p)z}\right )}^{r}; \\ \text{ Characteristic function: } & \varphi (t) = {p}^{r}{\left (1 - (1 - p){\mathrm{e}}^{it}\right )}^{-r}.\end{array}$$

Example.

Let p, 0 < p < 1, and the positive integer r be two given constants. Suppose that we are given a coin that has a probability p of coming up heads. Toss the coin repeatedly until the rth head appears and define by X the number of tosses. Then random variable X has a negative binomial distribution with parameters (p, r).

Note that from this example it immediately follows that X has a geometric distribution with the parameter p when r = 1.

Poisson Distribution The PDF of a random variable X is called a Poisson distribution with the parameter λ (λ > 0) if X takes values in $\left \{0,1,\ldots \right \}$ and

$${p}_{k} = \mathbf{P}\left (X = k\right ) = \frac{{\lambda }^{k}} {k!}{ \mathrm{e}}^{-\lambda },\ \ k = 0,1,\ldots \,.$$

$$\begin{array}{cc} \text{ Expected value and variance: } &\mathbf{E}\left (X\right ) = \lambda ,\ \ \ {\mathbf{D}}^{2}\left (X\right ) = \lambda ; \\ \text{ Generating function: }& G(z) ={ \mathrm{e}}^{\lambda (z-1)}; \\ \text{ Characteristic function: } & \varphi (t) ={ \mathrm{e}}^{\lambda ({\mathrm{e}}^{it}-1) }.\end{array}$$

The following theorem establishes that a binomial distribution can be approximated with a Poisson distribution with the parameter λ when the parameters (p, n) of the binomial distribution satisfy the condition $np \rightarrow \lambda $, $n \rightarrow \infty $.

Theorem 1.50.

Consider a binomial distribution with the parameter $\left (p,n\right )$ . Assume that for a fixed constant λ, λ > 0, the convergence $np \rightarrow \lambda $, $n \rightarrow \infty $ , holds; then the limit of probabilities satisfies the relation

$$\left ({ n \atop k} \right ){p}^{k}{(1 - p)}^{n-k} \rightarrow \frac{{\lambda }^{k}} {k!}{ \mathrm{e}}^{-\lambda },\ \ k = 0,1,\ldots \,.$$

Proof.

For any fixed k ≥ 0 integer number we have

$$\left ({ n \atop k} \right ){p}^{k}{(1 - p)}^{n-k} = \frac{(np)((n - 1)p)\ldots ((n - k + 1)p)} {k!}{ \mathrm{e}}^{(n-k)\log (1-p)}.$$

Since $np \rightarrow \lambda $, $n \rightarrow \infty $, therefore $p \rightarrow 0$, and we obtain

$$\frac{(np)((n - 1)p)\ldots ((n - k + 1)p)} {1 \cdot 2 \cdot \ldots \cdot k} \rightarrow \frac{{\lambda }^{k}} {k!} ,\ \ np \rightarrow \lambda.$$

On the other hand, if $p \rightarrow 0$, then we get the asymptotic relation log(1 − p) = − p + o(p). Consequently,

$$(n - k)\log (1 - p) = -(n - k)(p + o(p)) \rightarrow -\lambda ,\ \ \ np \rightarrow \lambda ,\ n \rightarrow \infty ;$$

therefore, using the last two asymptotic relations, the assertion of the theorem immediately follows. □

2.2 Continuous Distributions

Uniform Distribution Let a, b (a < b) be two real numbers. The distribution of random variable X is called uniform on the interval (a, b) if its PDF is given by

$$f(x) = \left \{\begin{array}{c} \frac{1} {b-a},\text{ ha }x \in (a,b), \\ \ \ 0,\ \ \ \text{ ha }x\notin (a,b). \end{array} \right.$$

$$\begin{array}{cc} \text{ Expected value and variance: } &\mathbf{E}\left (X\right ) = \frac{a+b} {2} ,\ \ \ {\mathbf{D}}^{2}\left (X\right ) = \frac{{(b-a)}^{2}} {12} ; \\ \text{ Characteristic function: }& \varphi (t) = \frac{1} {b-a} \frac{{\mathrm{e}}^{itb}-{\mathrm{e}}^{ita}} {it}.\end{array}$$

Note that if X has a uniform distribution on the interval (a, b), then the random variable $Y = \frac{X-a} {b-a}$ is distributed uniformly on the interval (0, 1).

Exponential Distribution Exp(λ), λ > 0. The distribution of a random variable X is called exponential with the parameter λ, λ > 0, if its PDF

$$f(x) = \left \{\begin{array}{c} \lambda {\mathrm{e}}^{-\lambda x},\text{ if }x > 0, \\ \ \ \ \ 0,\ \text{ if }x \leq 0. \end{array} \right.$$

$$\begin{array}{cc} \text{ Expected value and variance: } &\mathbf{E}\left (X\right ) = \frac{1} {\lambda },\ \ \ {\mathbf{D}}^{2}\left (X\right ) = \frac{1} {{\lambda }^{2}} ; \\ \text{ Characteristic function: }& \varphi (t) = \frac{\lambda } {\lambda -it}.\end{array}$$

The Laplace and Laplace–Stieltjes transforms of the density and distribution function of an Exp(λ) distribution are determined as

$$\mathbf{E}\left ({\mathrm{e}}^{-sX}\right ) = {f}^{{_\ast}}(s) = {F}^{\sim }(s) = \frac{\lambda } {s + \lambda }.$$

The exponential distribution, similarly to the geometric distribution, has the memoryless property.

Theorem 1.51.

For arbitrary constants t,s > 0 the relation

$$\mathbf{P}\left (X > t + s\vert X > t\right ) = \mathbf{P}\left (X > s\right )$$

holds.

Proof.

It is clear that

$$\begin{array}{rcl} \mathbf{P}\left (X > t + s\vert X > t\right )& =& \frac{\mathbf{P}\left (X > t + s,X > t\right )} {\mathbf{P}\left (X > t\right )} = \\ & =& \frac{\mathbf{P}\left (X > t + s\right )} {\mathbf{P}\left (X > t\right )} = \frac{{\mathrm{e}}^{-\lambda (t+s)}} {{\mathrm{e}}^{-\lambda t}} ={ \mathrm{e}}^{-\lambda s}.\end{array}$$

□

Hyperexponential Distribution Let the PDF of random variable X be a mixture of exponential distributions with the parameters ${\lambda }_{1},\ldots ,{\lambda }_{n}$ and with weights ${a}_{1},\ldots ,{a}_{n}({a}_{k} > 0$, ${a}_{1} + \ldots + {a}_{n} = 1)$. Then the PDF

$$f(x) = \left \{\begin{array}{c} \sum \limits_{k=1}^{n}{a}_{k}{\lambda }_{k}{\mathrm{e}}^{-{\lambda }_{k}x}\text{ if }x > 0, \\ \ \ \ \ \ \ \ \ 0,\ \text{ if }x \leq 0, \end{array} \right.$$

of random variable X is called hyperexponential.

$$\begin{array}{lc} \text{ Expected value and variance: } &\mathbf{E}\left (X\right )= \sum \limits_{k=1}^{n}\frac{{a}_{k}} {{\lambda }_{k}},\ \ \ {\mathbf{D}}^{2}\left (X\right )=2 \sum \limits_{k=1}^{n} \frac{{a}_{k}} {{\lambda }_{k}^{2}} -{\left ( \sum \limits_{k=1}^{n}\frac{{a}_{k}} {{\lambda }_{k}}\right )}^{2}; \\ \text{ Characteristic function: }& \varphi (t) = \sum \limits_{k=1}^{n}{a}_{k} \frac{{\lambda }_{k}} {{\lambda }_{k}-it}.\end{array}$$

Denote by $\Gamma (x) =\int\limits_{0}^{\infty }{y}^{x-1}{\mathrm{e}}^{-y}\mathrm{d}y,\ x > -1$ the well-known gamma function $\Gamma $ in analysis, which is necessary for the definition of the gamma distribution.

Gamma Distribution Gamma(α, λ), α, λ > 0.

The distribution of a random variable X is called a gamma distribution with the parameters α, λ > 0, if its PDF is

$$f(x) = \left \{\begin{array}{c} \frac{{\lambda }^{\alpha }} {\Gamma (\alpha )}{x}^{\alpha -1}{\mathrm{e}}^{-\lambda x},\text{ if }x > 0, \\ \ \ \ \ \ \ \ \ \ \ \ 0,\ \ \ \ \ \ \text{ if }x \leq 0. \end{array} \right.$$

$$\begin{array}{cc} \text{ Expected value and variance: } &\mathbf{E}\left (X\right ) = \frac{\alpha } {\lambda },\ \ \ {\mathbf{D}}^{2}\left (X\right ) = \frac{\alpha } {{\lambda }^{2}} ; \\ \text{ Characteristic function: }& \varphi (t) ={ \left ( \frac{\lambda } {\lambda -it}\right )}^{\alpha }.\end{array}$$

Comment 1.52.

A gamma distribution with the parameters α = n, λ = nμ is called an Erlang distribution.

Comment 1.53.

If the independent identically distributed random variables ${X}_{1},{X}_{2},\ldots $ have an exponential distribution with the parameter λ, then the distribution of the sum $Z = {X}_{1} + \ldots + {X}_{n}$ is a gamma distribution with the parameter (n,λ). This relation is easy to see because the characteristic function of an exponential distribution with the parameter λ is (1 − it∕λ) ⁻¹ ; then the characteristic function of its nth convolution power is (1 − it∕λ) ⁻ⁿ , which equals the characteristic function of a Gamma(n,λ) distribution.

Beta Distribution Beta(a, b), a, b > 0. The distribution of random variable X is called a beta distribution if its PDF is

$$f(x) = \left \{\begin{array}{c} \frac{\Gamma (a+b)} {\Gamma (a)\Gamma (b)}{x}^{a-1}{(1 - x)}^{b-1},\text{ if }x \in (0,1), \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0,\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{ if }x\notin (0,1). \end{array} \right.$$

$$\begin{array}{cc} \text{ Expected value and variance: }&\mathbf{E}\left (X\right ) = \frac{a} {a+b},\ \ \ {\mathbf{D}}^{2}\left (X\right ) = \frac{ab} {{(a+b)}^{2}(a+b+1)}; \\ \text{ Characteristic function in the}\ & \\ \text{ form of power series: } & \varphi (t) = \frac{\Gamma (\alpha +\beta )} {\Gamma (\alpha )} \sum \limits_{k=0}^{\infty }\frac{{(it)}^{k}} {k!} \frac{\Gamma (\alpha +k)} {\Gamma (\alpha +\beta +k)}.\end{array}$$

Gaussian (Also Called Normal) Distribution $N(\mu ,\lambda ),\ -\infty < \mu < \infty ,\ 0 < \sigma < \infty $. The distribution of random variable X is called Gaussian with the parameters (μ,σ) if it has a PDF

$$f(x) = \frac{1} {\sqrt{2\pi }\sigma }{\mathrm{e}}^{-{(x-\mu )}^{2}/2{\sigma }^{2} },\ \ -\infty < x < \infty.$$

$$\begin{array}{cc} \text{ Expected value and variance: } &\mu = \mathbf{E}\left (X\right )\text{ and }{\sigma }^{2} ={ \mathbf{D}}^{2}\left (X\right ); \\ \text{ Characteristic function: }& \varphi (t) =\exp \left \{i\mu t -\frac{{\sigma }^{2}} {2} {t}^{2}\right \}.\end{array}$$

The N(0, 1) distribution is usually called a standard Gaussian or standard normal distribution, and its PDF is equal to

$$f(x) = \frac{1} {\sqrt{2\pi }}{\mathrm{e}}^{-{x}^{2}/2 },\ \ -\infty < x < \infty.$$

It is easy to verify that if a random variable has an N(μ, σ) distribution, then the centered and linearly normed random variable Y = (X − μ) ∕ σ has a standard Gaussian distribution.

Multidimensional Gaussian (Normal) Distribution N(μ, R) Let Z = (Z ₁, …, Z _n) be an n-dimensional random vector whose coordinates ${Z}_{1},\ldots ,{Z}_{n}$ are independent and have a standard N(0, 1) Gaussian distribution. Let $\mathbf{V \in }{\mathbb{R}}^{m\times n}$ be an (m ×n) matrix and $\mu = {({\mu }_{1},\ldots ,{\mu }_{m})}^{\mathrm{T}} \in {\mathbb{R}}^{m}$ an m-dimensional vector. Then the distribution of the m-dimensional random vector X defined by the equation $\mathbf{X} = \mathbf{V}\mathbf{Z} + \mu $ is called an m-dimensional Gaussian distribution. Expected value and variance matrix:

$$\mathbf{E}\left (\mathbf{X}\right ) = {\mu }_{\mathbf{X}} = \mu \ \ \ \ \ \ \ \ \ \text{ and}\ \ \ \ \ \ \ \ \ \ \ {\mathbf{D}}^{2}\left (\mathbf{X}\right ) ={ \mathbf{R}}_{\mathbf{ X}} = \mathbf{E}\left ((\mathbf{X} - \mu ){(\mathbf{X} - \lambda )}^{\mathrm{T}}\right ) = \mathbf{V}{\mathbf{V}}^{\mathrm{T}};$$

Characteristic function:

$$\varphi (\mathbf{t}) =\exp \left \{i{\mathbf{t}}^{\mathrm{T}}\mu -\frac{1} {2}{\mathbf{t}}^{\mathrm{T}}{\mathbf{R}}_{\mathbf{ X}}\mathbf{t}\right \},\text{ where }\mathbf{t} = {({t}_{1},\ldots ,{t}_{m})}^{\mathrm{T}} \in {\mathbb{R}}^{m}.$$

If V is a nonsingular quadratic matrix (m = n and $\det \mathbf{V}\neq 0$), then the random vector X has a density in the form

$${f}_{\mathbf{X}}(\mathbf{x}) = \frac{1} {{(2\pi \det {\mathbf{R}}_{X})}^{n/2}}\exp \left \{-\frac{1} {2}{(\mathbf{x-}\mu )}^{\mathrm{T}}{\mathbf{R}}_{ X}^{-1}\mathbf{(x-}\mu )\right \},\ \ \mathbf{x} = {({x}_{ 1},\ldots ,{x}_{n})}^{\mathrm{T}} \in {\mathbb{R}}^{n}.$$

Example.

If the random vector $\mathbf{X} = {({X}_{1},{X}_{2})}^{\mathrm{T}}$ has a two-dimensional Gaussian distribution with expected value $\mu = {({\mu }_{1},{\mu }_{2})}^{\mathrm{T}}$ and covariance matrix

$${R}_{X} = \left [\begin{array}{cc} a&b\\ b &c \end{array} \right ],$$

then its PDF has the form

$${f}_{\mathbf{X}}(\mathbf{x}) = \frac{\sqrt{ac - {b}^{2}}} {2\pi } \exp \left \{-\frac{1} {2}[a{({x}_{1} - {\mu }_{1})}^{2} + 2b({x}_{ 1} - {\mu }_{1})({x}_{2} - {\mu }_{2}) + c{({x}_{2} - {\mu }_{2})}^{2}]\right \},$$

where $a,b,c,{\mu }_{1},{\mu }_{2}$ are constants satisfying the conditions a > 0, c > 0, and b ² < ac.

Note that the marginal distributions of random variables X ₁ and X ₂ are $N({\mu }_{1},{\sigma }_{1})$ and $N({\mu }_{2},{\sigma }_{2})$ Gaussian, respectively, where

$${\sigma }_{1} = \sqrt{ \frac{a} {ac - {b}^{2}}},\ \ \ {\sigma }_{2} = \sqrt{ \frac{c} {ac - {b}^{2}}}\text{ and }b =\mathrm{ cov}({X}_{1},{X}_{2}).$$

Distribution Functions Associated with Gaussian Distributions Let Z, Z ₁, Z ₂, … be independent random variables whose distributions are standard Gaussian, i.e., with the parameters (0, 1). There are many distributions, for example the ${\chi }^{2}$ and the logarithmically normal distributions defined subsequently (further examples are the frequently used t, F, and Wishart distributions in statistics [46]), that can be given as distributions of appropriately chosen functions of random variables $Z,{Z}_{1},{Z}_{2},\ldots $.

${\chi }^{2}$ Distribution The distribution of the random variable $X = {Z}_{1}^{2} + \ldots + {Z}_{n}^{2}$ is called a ${\chi }^{2}$ distribution with parameter n. The PDF is

$${f}_{n}(x) = \left \{\begin{array}{c} \frac{1} {{2}^{n/2}\Gamma (n/2a)}{x}^{n/2-1}{\mathrm{e}}^{-x/2},\text{ if }x > 0, \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0,\ \ \ \ \ \ \ \ \ \ \ \text{ if }x \leq 0. \end{array} \right.$$

$$\begin{array}{cc} \text{ Expected value and variance: } &\mathbf{E}\left (X\right ) = n,\ \ \ {\mathbf{D}}^{2}\left (X\right ) = 2n; \\ \text{ Characteristic function: }& \varphi (t) = {(1 - 2it)}^{-n/2}.\end{array}$$

Logarithmic Gaussian (Normal) Distribution If random variable Z has an N(μ, σ) Gaussian distribution, then the distribution of the random variable X = e^Z is called a logarithmic Gaussian (normal) distribution. The PDF is

$$f(x) = \left \{\begin{array}{c} \frac{1} {\sqrt{2\pi }\sigma x}\exp \left \{\frac{{(\log x-\mu )}^{2}} {2{\sigma }^{2}} \right \},\text{ if }x > 0, \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0,\ \ \ \ \ \ \ \ \ \ \ \ \text{ if }x \leq 0. \end{array} \right.$$

Expected value and variance: $\mathbf{E}\left (X\right ) ={ \mathrm{e}}^{{\sigma }^{2}/2+\mu },\ \ \ {\mathbf{D}}^{2}\left (X\right ) ={ \mathrm{e}}^{{\sigma }^{2}/2+\mu }\left ({\mathrm{e}}^{{\mathrm{e}}^{2} } - 1\right ).$

Weibull Distribution The Weibull distribution is a generalization of the exponential distribution for which the behavior of the tail distribution is modified by a positive constant k as follows:

$$\begin{array}{rcl} & & F(x) = \left \{\begin{array}{ll} 1 -{\mathrm{e}}^{-{(x/\lambda )}^{k} },&\text{ if }x > 0, \\ 0, &\text{ if }x \leq 0; \end{array} \right. \\ & & f(x) = \left \{\begin{array}{ll} \left (\frac{k} {\lambda }\right ){\left (\frac{x} {\lambda }\right )}^{k-1}{\mathrm{e}}^{-{(x/\lambda )}^{k} },&\text{ if }x > 0, \\ 0, &\text{ if }x \leq 0. \end{array} \right.\end{array}$$

Expected value and variance:

$$\mathbf{E}\left (X\right ) = \lambda \Gamma (1 + 1/k),\ \ \ {\mathbf{D}}^{2}\left (X\right ) = {\lambda }^{2}\left (\Gamma (1 + 2/k) - {\Gamma }^{2}(1 + 1/k)\right ).$$

Pareto Distribution Let c and λ be positive numbers. The density function and the PDF of a Pareto distribution are defined as follows:

$$\begin{array}{rcl} & & F(x) = \left \{\begin{array}{ll} 1 - {(\frac{x} {c} )}^{-\lambda },&\text{ if }x > c, \\ 0, &\text{ if }x \leq 0; \end{array} \right. \\ & & f(x) = \left \{\begin{array}{ll} \left (\frac{\lambda } {c} \right ){\left (\frac{x} {c} \right )}^{-\lambda -1} & \text{ if }x > c, \\ 0, &\text{ if }x \leq c. \end{array} \right.\end{array}$$

Since the PDF of the Pareto distribution is a simple power function in consequence of this property, it tends to zero with polynomial order as x goes to infinity and the nth moment exists if and only if n < λ.

Expected value (if k > 1) and variance (if k > 2):

$$\mathbf{E}\left (X\right ) = \frac{ck} {k - 1},\ \ \ \mathbf{D}\left (X\right ) = \frac{{c}^{2}k} {{(k - 1)}^{2}(k - 2)}.$$

3 Limit Theorems

3.1 Convergence Notions

There are many convergence notions in the theory of analysis, for example, pointwise convergence, uniform convergence, and convergences defined by various metrics. In the theory of probability, several kinds of convergences are also used that are related to the sequences of random variables or to their sequence of distribution functions. The following notion is the so-called weak convergence of distribution functions.

Definition 1.54.

The sequence of distribution functions F _n, n = 1, 2, … weakly converges to the distribution function F (abbreviated ${F}_{n}\mathop{ \rightarrow }\limits^{ w}F,\ n \rightarrow \infty $) if the convergence ${F}_{n}(x) \rightarrow F(x),\ n \rightarrow \infty $, holds in all continuity points of F.

If the distribution function F is continuous, then the convergence ${F}_{n}\mathop{ \rightarrow }\limits^{ w}F,\ n \rightarrow \infty $ holds if and only if ${F}_{n}(x) \rightarrow F(x),\ n \rightarrow \infty $ for all $x \in \mathbb{R}$. The weak convergence of the sequence F _n, n = 1, 2, … is equivalent to the condition that the convergence

$$\int\limits_{-\infty }^{\infty }g(x)\mathrm{d}{F}_{ n}(x) \rightarrow \int\limits_{-\infty }^{\infty }g(x)\mathrm{d}F(x)$$

is true for all bounded and continuous functions g.

In addition, the weak convergence of a distribution function can be given with the help of an appropriate metric in the space $\mathbb{F} =\{ F\}$ of all distribution functions. Let G and H be two distribution functions (i.e., $G,H \in \mathbb{F}$), and define the Levy metric [96] as follows:

$$L(G,H) =\inf \{ \epsilon :\ G(x) \leq H(x + \epsilon ) + \epsilon ,\ H(x) \leq G(x + \epsilon ) + \epsilon ,\ \text{ for all }x \in \mathbb{R}\}.$$

Then it can be proved that the weak convergence ${F}_{n}\mathop{ \rightarrow }\limits^{ w}F,\ n \rightarrow \infty $, of the distribution functions $F,\ {F}_{n},\ n = 1,2,\ldots $, holds if and only if $\mathop{\lim }\limits_{n \rightarrow \infty }L({F}_{n},F) = 0$.

The most frequently used convergence notions in probability theory for a sequence of random variables are the convergence in distribution, convergence in probability, convergence with probability 1, or almost surely (a.s.), and convergence in mean square (convergence in L ₂), which will be introduced subsequently. In cases of the last three convergences, it is assumed that the random variables are defined on a common probability space $(\Omega ,\mathcal{A},\mathbf{P}\left (\right )).$

Definition 1.55.

The sequence of random variables ${X}_{1},{X}_{2},\ldots $ converges in distribution to a random variable X (abbreviated ${X}_{n}\mathop{ \rightarrow }\limits^{ d}X,\ n \rightarrow \infty $) if their distribution functions satisfy the weak convergence

$${F}_{{X}_{n}}\mathop{ \rightarrow }\limits^{ w}{F}_{X},\ n = 1,2,\ldots \,.$$

Definition 1.56.

The sequence of random variables ${X}_{1},{X}_{2},\ldots $ converges in probability to a random variable X (${X}_{n}\mathop{ \rightarrow }\limits^{ P}X,\ n \rightarrow \infty $) if the convergence

$$\mathop{\lim }\limits_{n \rightarrow \infty }\mathbf{P}\left (\left \vert {X}_{n} - X\right \vert > \epsilon \right ) = 0$$

holds for all positive constants $\epsilon $.

Definition 1.57.

The random variables ${X}_{1},{X}_{2},\ldots $ converge with probability 1(or almost surely) to a random variable X (abbreviated ${X}_{n}\mathop{ \rightarrow }\limits^{\text{ a.s.}}X,\ n \rightarrow \infty $) if the condition

$$\mathbf{P}\left (\mathop{\lim }\limits_{n \rightarrow \infty }{X}_{n} = X\right ) = 1$$

holds.

The limit $\mathop{\lim }\limits_{n \rightarrow \infty }{X}_{n} = X$ exists if there are defined random variables with probability $1\ {X}^{{\prime}} =\mathop{\lim \sup }\limits_{ n \rightarrow \infty }{X}_{n}$ and ${X}^{{\prime\prime}}(\omega )\ =\mathop{\lim \inf }\limits_{ n \rightarrow \infty }{X}_{n}$ for which the relation

$$\mathbf{P}\left ({X}^{{\prime}}(\omega ) = {X}^{{\prime\prime}}(\omega ) = X(\omega )\right ) = 1$$

is true. This means that there is an event $A \in \mathcal{A}$, $\mathbf{P}\left (A\right ) = 0$, such that the equality

$${X}^{{\prime}}(\omega ) = {X}^{{\prime\prime}}(\omega ) = X(\omega ),\ \ \omega \in \Omega \setminus A$$

holds.

Theorem 1.58 ( [84]).

The convergence ${\lim }_{n\rightarrow \infty }{X}_{n} = X$ with probability 1 is true if and only if for all $\epsilon > 0$

$$\mathbf{P}\left (\mathop{\sup }\limits_{k \geq n}\vert {X}_{k} - X\vert > \epsilon \right ) = 0.$$

Definition 1.59.

Let X _n, n ≥ 1 and X be random variables with finite variance. The sequence ${X}_{1},{X}_{2},\ldots $ converges in mean square to random variable X (abbreviated ${X}_{n}\mathop{ \rightarrow }\limits^{ {L}_{2}}X,\ n \rightarrow \infty $) if

$$\mathbf{E}\left ({\left \vert {X}_{n} - X\right \vert }^{2}\right ) \rightarrow 0,\ n \rightarrow \infty.$$

This type of convergence is often called an L ₂ convergence of random variables.

The enumerated convergence notions are not equivalent to each other, but we can mention several connections between them. The convergence in distribution follows from all the others. The convergence in probability follows from the convergence with probability 1 and from the convergence in mean square. It can be proved that if the sequence ${X}_{1},{X}_{2},\ldots $ is convergent in probability to the random variable X, then there exists a subsequence ${X}_{{n}_{1}},{X}_{{n}_{2}},\ldots $ such that it converges with probability 1 to random variable X.

3.2 Laws of Large Numbers

The intuitive introduction of probability implicitly uses the limit behavior of the average

$${\overline{S}}_{n} = \frac{{X}_{1} + \ldots + {X}_{n}} {n} ,\ n = 1,2,\ldots ,$$

of independent identically distributed random variables ${X}_{1},{X}_{2},\ldots $. The main question is: under what condition does the sequence ${\overline{S}}_{n}$ converge to a constant μ in probability (weak law of large numbers) or with probability 1 (strong law of large numbers) as n goes to infinity?

Consider an experiment in which we observe that an event A occurs or not. Repeating the experiment n times independently, define the frequency of event A by S _n(A) and the relative frequency by ${\overline{S}}_{n}(A)$.

Theorem 1.60 ( Bernoulli).

The relative frequency of an event A tends in probability to the probability of the event p = P(A), that is, for all $\epsilon > 0$ the relation

$$\mathop{\lim }\limits_{n \rightarrow \infty }\mathbf{P}\left (\left \vert {\overline{S}}_{n}(A) - p\right \vert > \epsilon \right ) = 0$$

holds.

If we introduce the notation

$${X}_{i} = \left \{\begin{array}{ll} 1,&\text{ if the }i\text{ -th outcome in }A,\\ 0, &\text{ otherwise,} \end{array} \right.$$

then the assertion of the last theorem can be formulated as follows:

$${\overline{S}}_{n} = \frac{{X}_{1} + \ldots + {X}_{n}} {n} \mathop{ \rightarrow }\limits^{ p}p,\ n \rightarrow \infty ,$$

which is a simple consequence of the Chebyshev inequality because the X _i are independent and identically distributed and $\mathbf{E}\left ({X}_{i}\right ) = p = \mathbf{P}\left (A\right ),\ {\mathbf{D}}^{2}\left ({X}_{i}\right ) = p(1 - p),\ \ i = 1,2,\ldots $. This result can be generalized without any difficulties as follows.

Theorem 1.61.

Let ${X}_{1},{X}_{2},\ldots $ be independent and identically distributed random variables with common expected value μ and finite variance σ ² . Then the convergence in probability

$${\overline{S}}_{n} = \frac{{X}_{1} + \ldots + {X}_{n}} {n} \mathop{ \rightarrow }\limits^{ p}\mu ,\ n \rightarrow \infty ,$$

is true.

Proof.

Example 1.38, which is given after the proof of the Chebyshev inequality, shows that for all $\epsilon > 0$ the inequality

$$\mathbf{P}\left (\left \vert \frac{{X}_{1} + \ldots + {X}_{n}} {n} - \mu \right \vert \geq \epsilon \right ) \leq \frac{{\sigma }^{2}} {n{\epsilon }^{2}}$$

is valid. From this the convergence in probability ${\overline{S}}_{n}\mathop{ \rightarrow }\limits^{ p}\mu ,\ n \rightarrow \infty $ follows. It is not difficult to see that the convergence in L ₂ is also true, i.e., ${\overline{S}}_{n}\mathop{ \rightarrow }\limits^{ {L}_{2}}\mu ,\ n \rightarrow \infty $. □

It should be noted that the inequality $\mathbf{P}\left (\left \vert \frac{{X}_{1}+\ldots +{X}_{n}} {n} - \mu \right \vert \geq \epsilon \right ) \leq \frac{{\sigma }^{2}} {n{\epsilon }^{2}}$, which guarantees the convergence in probability, gives an upper bound for the probability $\mathbf{P}\left (\left \vert \frac{{X}_{1}+\ldots +{X}_{n}} {n} - \mu \right \vert \geq \epsilon \right )$ also.

The Kolmogorov strong law of large numbers gives a necessary and sufficient condition for convergence with probability 1.

Theorem 1.62 ( Kolmogorov).

If the sequence of random variables ${X}_{1},{X}_{2},\ldots $ is independent and identically distributed, then the convergence

$$\frac{{X}_{1} + \ldots + {X}_{n}} {n} \mathop{ \rightarrow }\limits^{ a.s.}\mu ,\ n \rightarrow \infty $$

holds for a constant μ if and only if the random variables X _i have finite expected value and $\mathbf{E}\left ({X}_{i}\right ) = \mu.$

Corollary 1.63.

If ${\overline{S}}_{n}(A)$ defines the relative frequency of an event A occurring in n independent experiments, then the Bernoulli law of large numbers

$${\overline{S}}_{n}(A)\mathop{ \rightarrow }\limits^{ p}p = \mathbf{P}\left (A\right ),\ n \rightarrow \infty ,$$

is valid. By the Kolmogorov law of large numbers, this convergence is true with probability 1 also, that is,

$${\overline{S}}_{n}(A)\mathop{ \rightarrow }\limits^{ a.s.}p = \mathbf{P}\left (A\right ),\ n \rightarrow \infty.$$

3.3 Central Limit Theorem, Lindeberg–Feller Theorem

The basic problem of central limit theorems is as follows. Let ${X}_{1},{X}_{2},\ldots $ be independent and identically distributed random variables with a common distribution function F _X(x). The question is, under what conditions does a sequence of constants μ_n and ${\sigma }_{n},\ {\sigma }_{n}\neq 0,\ n = 1,2,\ldots $ exist such that the sequence of centered and linearly normed sums

$${ \overline{S}}_{n} = \frac{{X}_{1} + \ldots + {X}_{n} - {\mu }_{n}} {{\sigma }_{n}} ,\ n = 1,2,\ldots $$

(1.11)

converges in the distributions

$${F}_{{\overline{S}}_{ n}}\mathop{ \rightarrow }\limits^{ w}F,\ n \rightarrow \infty $$

and have a nondegenerate limit distribution function F? A distribution function F(x) is nondegenerate if there is no point ${x}_{0} \in \mathbb{R}$ satisfying the condition $F({x}_{0}) - F({x}_{0}-) = 1$, that is, the distribution does not concentrate at one point.

Theorem 1.64.

If the random variables ${X}_{1},{X}_{2},\ldots $ are independent and identically distributed with finite expected value $\mu = \mathbf{E}\left ({X}_{1}\right )$ and variance ${\sigma }^{2} ={ \mathbf{D}}^{2}({X}_{1})$ , then

$$\mathbf{P}\left (\frac{{X}_{1} + \ldots + {X}_{n} - n\mu } {\sqrt{n}\sigma } \leq x\right ) \rightarrow \Phi (x) =\int\limits_{-\infty }^{x} \frac{1} {\sqrt{2\pi }}{\mathrm{e}}^{-{u}^{2}/2 }\mathrm{d}u$$

holds for all $x \in \mathbb{R}$ , where the function Φ(x) denotes the distribution function of standard normal random variables.

If the random variables ${X}_{1},{X}_{2},\ldots $ are independent but not necessarily identically distributed, then a general, so-called Lindeberg–Feller theorem is valid.

Theorem 1.65.

Let ${X}_{1},{X}_{2},\ldots $ be independent random variables whose variances are finite. Denote

$${\mu }_{n} = \mathbf{E}\left ({X}_{1}\right ) + \ldots + \mathbf{E}\left ({X}_{n}\right ),\ \ \ {\sigma }_{n} = \sqrt{{\mathbf{D} }^{2 } \left ({X}_{1 } \right ) + \ldots +{ \mathbf{D} }^{2 } \left ({X}_{n } \right )},\ \ n = 1,2,\ldots \,.$$

The limit

$$\mathbf{P}\left (\frac{{X}_{1} + \ldots + {X}_{n} - {\mu }_{n}} {{\sigma }_{n}} \leq x\right ) \rightarrow \Phi (x),\ \ n \rightarrow \infty ,$$

is true for all $\ x \in \mathbb{R}$ if and only if the Lindeberg–Feller condition holds:

$$\mathop{\lim \ }\limits_{n \rightarrow \infty }\mathop{\max }\limits_{1 \leq j \leq n} \frac{1} {{\sigma }_{j}^{2}}\mathbf{E}\left ({X}_{j}^{2}{\mathcal{I}}_{\left \{\left \vert { X}_{j}\right \vert >\epsilon {\sigma }_{n}\right \}}\right ) = 0,\ x \in \mathbb{R},\ \epsilon > 0,$$

where ${\mathcal{I}}_{\left \{\right \}}$ denotes the indicator variable.

3.4 Infinitely Divisible Distributions and Convergence to the Poisson Distribution

There are many practical problems for which model (1.11) and results related to it are not satisfactory. The reason is that the class of possible limit distributions is insufficiently large; for instance, it does not consist of discrete distributions. An example of this is a Poisson distribution, which is an often-used distribution in queueing theory.

As a generalization of model (1.11), consider the sequence of series of random variables (sometimes called a sequence of random variables of triangular arrays)

$$\left \{{X}_{n,1},\ldots ,{X}_{n,{k}_{n}}\right \},\ n = 1,2,\ldots ,\ {k}_{n} \rightarrow \infty ,$$

satisfying the following conditions for all fixed positive integers n:

1.
The random variables ${X}_{n,1},\ldots ,{X}_{n,{k}_{n}}$ are independent.
2.
The random variables ${X}_{n,1},\ldots ,{X}_{n,{k}_{n}}$ are infinitesimal (in other words, asymptotically negligible) if the limit for all $\epsilon > 0$
$$\mathop{\lim }\limits_{n \rightarrow \infty }\mathop{\max }\limits_{1 \leq j \leq {k}_{n}}\mathbf{P}\left (\left \vert {X}_{n,j}\right \vert > \epsilon \right ) = 0$$
holds.

Considering the sums of series of random variables

$${S}_{n} = {X}_{n,1} + \ldots + {X}_{n,{k}_{n}},\ n = 1,2,\ldots ,$$

the class of possible limit distributions (so-called infinitely divisible distributions) is already a sufficiently large class containing, for example, a Poisson distribution.

Definition 1.66.

A random variable X is called infinitely divisible if it can be given in the form

$$X\mathop{ =}\limits^{ d}{X}_{n,1} + \ldots + {X}_{n,n}$$

for every n = 1, 2, …, where the random variables ${X}_{n,1},\ldots ,{X}_{n,n}$ are independent and identically distributed.

Infinitely divisible distributions (to which, for example, the normal and Poisson distributions belong) can be given with the help of their characteristic functions.

Theorem 1.67.

If random variable X is infinitely divisible, then its characteristic function has the form ( Lévy–Khinchin canonical form)

$$\begin{array}{rcl} \log f(t)& =& i\mu t -\frac{{\sigma }^{2}} {2} {t}^{2} +\int\limits_{-\infty }^{0}\left ({\mathrm{e}}^{itx} - 1 - \frac{itx} {1 + {x}^{2}}\right )\mathrm{d}L(x) \\ & & +\int\limits_{0}^{\infty }\left ({\mathrm{e}}^{itx} - 1 - \frac{itx} {1 + {x}^{2}}\right )\mathrm{d}R(x), \\ \end{array}$$

where the functions L and R satisfy the following conditions:

(a)
μ and σ (σ ≥ 0) are real constants.
(b)
L(x), x ∈ ( − ∞, 0) and R(x), x ∈ (0, ∞) are monotonically increasing functions on the intervals ( − ∞, 0) and (0, ∞), respectively.
(c)
L( − ∞) = R(∞) = 0 and the inequality condition
$$\int\limits_{-\infty }^{0}{x}^{2}\mathrm{d}L(x) +\int\limits_{0}^{\infty }{x}^{2}\mathrm{d}R(x) < \infty $$
holds.

If an infinitely divisible distribution has finite variation, then its characteristic function can be given in a more simple form (Kolmogorov formula):

$$\log f(t) = i\mu t +\int\limits_{-\infty }^{\infty }\left ({\mathrm{e}}^{itx} - 1 - itx\right ) \frac{1} {{x}^{2}}\mathrm{d}K(x),$$

where μ is a constant and K(x) (K( − ∞) = 0) is a monotonically nondecreasing function.

As special cases of the Kolmogorov formula, we get the normal and Poisson distributions.

(a)
An infinitely divisible distribution is normal with the parameters (μ,σ) if the function K(x) is defined as
$$K(x) = \left \{\begin{array}{ll} 0, &\text{ if }x \leq 0,\\ {\sigma }^{2 } , &\text{ if } x > 0. \end{array} \right.$$
Then the characteristic function is
$$f(t) = i\mu t -\frac{{\sigma }^{2}} {2} {t}^{2}.$$
(b)
An infinitely divisible distribution is Poisson with the parameter λ (λ > 0) if μ = λ and the function K(x) is defined as
$$K(x) = \left \{\begin{array}{ll} 0, &\text{ if }x \leq 1,\\ \lambda , &\text{ if } x > 1. \end{array} \right.$$
In this case the characteristic function can be given as follows:
$$f(t) = i\mu t +\int\limits_{-\infty }^{\infty }\left ({\mathrm{e}}^{itx} - 1 - itx\right ) \frac{1} {{x}^{2}}\mathrm{d}K(x) = \lambda ({\mathrm{e}}^{it} - 1).$$

The following theorem gives an answer to the question of the conditions under which the limit distribution of sums of independent infinitesimal random variables is Poisson. This result will be used later when considering sums of independent arrival processes of queues.

Theorem 1.68 ( Gnedenko, Marczinkiewicz).

Let $\left \{{X}_{1,n},\ldots ,{X}_{{k}_{n},n}\right \},$ n = 1,2,…, be a sequence of series of independent infinitesimal random variables. The sequence of distributions of sums

$${X}_{n} = {X}_{n1} + \ldots + {X}_{n,{k}_{n}},\ \ n \geq 1,$$

converges weakly to a Poisson distribution with the parameter λ (λ > 0) as $n \rightarrow \infty $ if and only if the following conditions hold for all $\epsilon $ $(0 < \epsilon < 1)$:

(A)
$ \sum \limits_{j=1}^{{k}_{n}}\int\limits_{{\mathbb{R}}_{\epsilon }}\mathrm{d}{F}_{nj}(x) \rightarrow 0$.
(B)
$ \sum \limits_{j=1}^{{k}_{n}}\int\limits_{\left \vert x-1\right \vert <\epsilon }\mathrm{d}{F}_{nj}(x) \rightarrow \lambda $.
(C)
$ \sum \limits_{j=1}^{{k}_{n}}\int\limits_{\left \vert x\right \vert <\epsilon }\mathrm{d}{F}_{nj}(x) \rightarrow 0$ .
(D)
$ \sum \limits_{j=1}^{{k}_{n}}\left [\int\limits_{\left \vert x\right \vert <\epsilon }{x}^{2}\mathrm{d}{F}_{nj}(x) -{\left (\int\limits_{\left \vert x\right \vert <\epsilon }x\mathrm{d}{F}_{nj}(x)\right )}^{2}\right ] \rightarrow 0$,

where ${F}_{nj}(x) = \mathbf{P}\left ({X}_{nj} \leq x\right )$ and ${\mathbb{R}}_{\epsilon } = \mathbb{R} \setminus \left (\{\left \vert x\right \vert < \epsilon \} \cup \{\left \vert x - 1\right \vert < \epsilon \}\right ).$

Note that conditions (A) and (B) guarantee the convergence of the Poisson part to the appropriate Poisson distribution of the limit, (C) means that there is no centralization, and from (D) it follows that the limit distribution does not contain a Gaussian part.

4 Exercises

Exercise 1.1.

Let X be a nonnegative random variable with CDF F _X. Given 0 ≤ t ≤ X [$\mathbf{P}\left (X > t\right )\neq 0$], find the CDF of residual lifetime X.

Exercise 1.2.

Let X and Y be independent random variables with a Poisson distribution of parameters λ and μ, respectively. Verify that

(a)
The sum X + Y has a Poisson distribution with the parameter λ + μ;
(b)
For any nonnegative integers m ≤ n the conditional distribution { P}(X = m | X + Y = n) is binomial with the parameter $(n, \frac{\lambda } {\lambda +\mu })$, i.e.,
$$\mathbf{P}\left (X = m\ \vert \ X + Y = n\right ) = \left ({ m \atop n} \right ){\left ( \frac{\lambda } {\lambda + \mu }\right )}^{m}{\left (1 - \frac{\lambda } {\lambda + \mu }\right )}^{n-m}.$$

Exercise 1.3.

Let X and Y be independent random variables having a uniform distribution on the interval (0, 1) and an exponential distribution with the parameter 1, respectively. Find the probability (concrete number) that X < Y.

Exercise 1.4.

Divide the interval (0, 1) into three parts with two independently and randomly chosen points U ₁ and U ₂ of the interval (0, 1). Find the probability of event A that the three parts can determine a triangle.

Exercise 1.5.

Show that for a nonnegative random variable X with a finite nth ($n \geq 1$) moment it is true that $\mathbf{E}\left ({X}^{n}\right ) =\int\limits_{0}^{\infty }\mathbf{P}\left (x < X\right )n{x}^{n-1}\mathrm{d}x$.

Exercise 1.6.

Let X and Y be independent random variables with a uniform distribution on the interval (0, 1). Find the quantities

(a)
$\mathbf{E}\left (\left \vert X - Y \right \vert \right )$, ${\mathbf{D}}^{2}\left (\vert X - Y \vert \right )$,
(b)
$\mathbf{P}\left (\left \vert X - Y \right \vert \right ) > \frac{1} {2}$.

Exercise 1.7.

Let X and Y be independent random variables having an exponential distribution with the parameters λ and μ, respectively.

(a)
Determine the density function of the random variable Z = X + Y.
(b)
Find the density function of the random variable W = min(X, Y ).

Exercise 1.8.

Let ${X}_{1},\ldots ,{X}_{n}$ be independent random variables having an exponential distribution with the parameter λ. Find the expected values of the random variables ${V }_{n} =\max ({X}_{1},\ldots ,{X}_{n}),$ and ${W}_{n} =\min ({X}_{1},\ldots ,{X}_{n})$.

Exercise 1.9.

Let X and Y be independent random variables with density functions f _X(x) and f _Y(x), respectively. Determine the conditional expected value E(X | X < Y ).

Exercise 1.10.

Determine the conditional expectations $\mathbf{E}\left (X\ \vert Y = y\right )$ and $\mathbf{E}\left (X\ \vert Y \right )$ if the joint PDF of the random variables X and Y has the form

(a)
${f}_{X,Y }(x,y) = \left \{\begin{array}{ll} 2,&\text{ if }\ 0 < x,y\text{ and}\ \ x + y < 1,\\ 0, &\text{ otherwise;} \end{array} \right.$
(b)
${f}_{X,Y }(x,y) = \left \{\begin{array}{ll} 3(x + y),&\text{ if }\ 0 < x,y\text{ and}\ \ x + y < 1,\\ 0, &\text{ otherwise}. \end{array} \right.$

Exercise 1.11.

Let ${X}_{1},{X}_{2},\ldots $ be independent random variables with an exponential distribution of the parameter λ. Let N be a geometrically distributed random variable with the parameter p [${p}_{k} = \mathbf{P}\left (N = k\right ) = p{(1 - p)}^{k},\ k = 1,2,\ldots $], which does not depend on random variables (${X}_{1},{X}_{2},\ldots $). Prove that the sum $Y = {X}_{1} + \ldots + {X}_{N}$ has an exponential distribution with the parameter pλ.

Exercise 1.12.

Consider the distribution function of the sum Y ₄₀ of independent random variables ${X}_{1},\ldots ,{X}_{40}$ having an exponential distribution with the parameter 1. Give an estimate for the probability $p = \mathbf{P}\left (\frac{\left \vert {Y }_{40}-\mathbf{E}\left ({Y }_{40}\right )\right \vert } {\mathbf{D}\left ({Y }_{40}\right )} > 0.05\right )$ calculated with the help of the central limit theorem. We can numerically calculate this probability because the random variable Y ₄₀ has a gamma distribution with the parameter (40, 1). Using this fact, what result can we obtain for the considered probability? (On the numerical calculation of the gamma distribution see, for example, [72] or [63].)

References

802.11. IEEE standard for information technology-telecommunications and information exchange between systems-local and metropolitan area networks-specific requirements - part 11: Wireless LAN medium access control (mac) and physical layer (phy) specifications. http://ieeexplore.ieee.org/servlet/opac?punumber=4248376, 2007.
N. Abramson. The aloha system: another alternative for computer communications. In: Proceedings Fall Joint Computer Conference. AFIPS Press, 1970.
Google Scholar
D. Aldous, L. Shepp. The least variable phase type distribution is Erlang. Stoch. Models, 3:467–473, 1987.
Article MathSciNet MATH Google Scholar
T. Apostol. Calculus I. Wiley, New York, 1967.
MATH Google Scholar
T. Apostol. Calculus II. Wiley, New York, 1969.
MATH Google Scholar
J. R. Artalejo, A. Gómez-Corral. Retrial Queueing Systems: A Computational Approach. Springer, Berlin Heidelberg New York, 2008.
Book MATH Google Scholar
S. Asmussen. Applied Probability and Queues. Springer, Berlin Heidelberg New York, 2003.
MATH Google Scholar
F. Baccelli, P. Brémaud. Elements of Queueing Theory, Applications of Mathematics. Springer, Berlin Heidelberg New York, 2002.
Google Scholar
F. Baskett, K. Mani Chandy, R. R. Muntz, F. G. Palacios. Open, closed and mixed networks of queues with different classes of customers. J. ACM, 22:248–260, 1975.
Article MATH Google Scholar
S. N. Bernstein. Theory of Probabilities. Moskva, Leningrad, 1946. (in Russian).
Google Scholar
G. Bianchi. Performance analysis of the IEEE 802.11 distributed coordination function. IEEE J. Select. Areas Commun., 18:535–547, 2000.
Google Scholar
D. Bini, G. Latouche, B. Meini. Numerical methods for structured Markov chains. Oxford University Press, Oxford, 2005.
Book MATH Google Scholar
A. Bobbio, M. Telek. A benchmark for PH estimation algorithms: results for Acyclic-PH. Stoch. Models, 10:661–677, 1994.
Article MathSciNet MATH Google Scholar
A. A. Borovkov. Stochastic processes in queueing theory. Applications of Mathematics. Springer, Berlin Heidelberg New York, 1976.
Book MATH Google Scholar
A. A. Borovkov. Asymptotic Methods in Queueing Theory. Wiley, New York, 1984.
MATH Google Scholar
L. Breuer, D. Baum. An Introduction to Queueing Theory and Matrix-Analytic Methods. Springer, Berlin Heidelberg New York, 2005.
MATH Google Scholar
P. J. Burke. The output of a queuing system. Oper. Res., 4:699–704, 1956.
Article MathSciNet Google Scholar
J. Buzen. Computational algorithms for closed queueing networks with exponential servers. Commun. ACM, 16:527–531, 1973.
Article MathSciNet MATH Google Scholar
V. Ceric, L. Lakatos. Measurement and analysis of input data for queueing system models used in system design. Syst. Anal. Modell. Simul., 11:227–233, 1993.
MATH Google Scholar
Hong Chen, David D. Yao. Fundamentals of Queueing Networks: Performance, Asymptotics, and Optimization. Springer, Berlin Heidelberg New York, 2001.
MATH Google Scholar
Y. Chow, H. Teicher. Probability Theory. Springer, Berlin Heidelberg New York, 1978.
Book MATH Google Scholar
K. Chung. Markov chains with stationary transition probabilities. Springer, Berlin Heidelberg New York, 1960.
Book MATH Google Scholar
E. Cinlar. Introduction to Stochastic Processes. Prentice-Hall, Englewood Cliffs, NJ, 1975.
MATH Google Scholar
D. R. Cox. The analysis of non-Markovian stochastic processes by the inclusion of supplementary variables. Proc. Cambridge Philos. Soc., 51:433–440, 1955.
Article MathSciNet MATH Google Scholar
A. Cumani. On the canonical representation of homogeneous Markov processes modelling failure-time distributions. Microelectron. Reliab., 22:583–602, 1982.
Article Google Scholar
D.J. Daley, D. Vere-Jones. An Introduction to the Theory of Point Process. Springer, Berlin Heidelberg New York, 2008. 2nd edn.
Google Scholar
Gy. Dallos, Cs. Szabó. Random access methods of communication channels. Akadémiai Kiadó, Budapest, 1984 (in Hungarian).
Google Scholar
M. De Prycker. Asynchronous Transfer Mode, Solutions for Broadband ISDN. Prentice Hall, Englewood Cliffs, NJ, 1993.
Google Scholar
P. Erdős, W. Feller, H. Pollard. A theorem on power series. Bull. Am. Math. Soc., 55:201–203, 1949.
Article Google Scholar
G. I. Falin, J. G. C. Templeton. Retrial queues. Chapman and Hall, London, 1997.
MATH Google Scholar
W. Feller. An Introduction to Probability Theory and its Applications, vol. I. Wiley, New York, 1968.
MATH Google Scholar
Chuan Heng Foh, M. Zukerman. Performance analysis of the IEEE 802.11 MAC protocol. In Proceedings of European wireless conference, Florence, February 2002.
Google Scholar
F. G. Foster. On the stochastic matrices associated with certain queuing processes. Ann. Math. Stat., 24:355–360, 1953.
Article MathSciNet MATH Google Scholar
G. Giambene. Queuing Theory and Telecommunications: Networks and Applications. Springer, Berlin Heidelberg New York, 2005.
Google Scholar
I.I. Gihman, A.V. Skorohod. The Theory of Stochastic Processes, vol. I. Springer, Berlin Heidelberg New York, 1974.
Book MATH Google Scholar
I. I. Gihman, A. V. Skorohod. The Theory of Stochastic Processes, vol. II. Springer, Berlin Heidelberg New York, 1975.
Book MATH Google Scholar
B. Gnedenko, E. Danielyan, B. Dimitrov, G. Klimov, V. Matveev. Priority Queues. Moscow State University, Moscow, 1973 (in Russian).
Google Scholar
B. V. Gnedenko. Theory of Probability. Gordon and Breach, Amsterdam, 1997. 6th edn.
Google Scholar
B. V. Gnedenko, I. N. Kovalenko. Introduction to Queueing Theory, 2nd edn. Birkhauser, Boston 1989.
Book Google Scholar
W. J. Gordon, G. F. Newell. Closed queueing systems with exponential servers. Oper. Res., 15:254–265, 1967.
Article MATH Google Scholar
D. Gross, J. F. Shortle, J. M. Thompson, C. M. Harris. Fundamentals of Queueing Theory, 4th edn. Wiley, New York, 2008.
Google Scholar
W. Henderson. Alternative approaches to the analysis of the M/G/1 and G/M/1 queues. J. Oper. Res. Soc. Jpn., 15:92–101, 1972.
MATH Google Scholar
A. Horváth, M. Telek. PhFit: A general purpose phase type fitting tool. In Tools 2002, pages 82–91, London, April 2002. Lecture Notes in Computer Science, vol. 2324. Springer, Berlin Heidelberg New York.
Google Scholar
J. R. Jackson. Jobshop-like queueing systems. Manage. Sci., 10:131–142, 1963.
Article Google Scholar
N. K. Jaiswal. Priority Queues. Academic, New York, 1968.
MATH Google Scholar
N. L. Johnson, S. Kotz. Distributions in Statistics: Continuous Multivariate Distributions. Applied Probability and Statistics. Wiley, New York, 1972.
MATH Google Scholar
V.V. Kalashnikov. Mathematical Methods in Queueing Theory. Kluwer, Dordrecht, 1994.
Google Scholar
S. Karlin, H. M. Taylor. A First Course in Stochastic Processes. Academic, New York, 1975.
MATH Google Scholar
S. Karlin, H. M. Taylor. A Second Course in Stochastic Processes. Academic, New York, 1981.
MATH Google Scholar
J. Kaufman. Blocking in a shared resource environment. IEEE Trans. Commun., 29: 1474–1481, 1981.
Article Google Scholar
F. P. Kelly. Reversibility and Stochastic Networks. Wiley, New York, 1979.
MATH Google Scholar
D. G. Kendall. Stochastic processes occurring in the theory of queues and their analysis by the method of the imbedded Markov chain. Ann. Math. Stat., 24:338–354, 1953.
Article MathSciNet MATH Google Scholar
A. Khinchin. Mathematisches über die Erwartung vor einem öffentlichen Schalter. Rec. Math., 39:72–84, 1932 (in Russian with German summary).
Google Scholar
J. F. C. Kingman. Poisson Processes. Clarendon, Oxford, 1993.
MATH Google Scholar
L. Kleinrock. Queuing Systems. Volume 1: Theory. Wiley-Interscience, New York, 1975.
Google Scholar
G. P. Klimov. Extremal Problems in Queueing Theory. Energia, Moskva, 1964 (in Russian).
Google Scholar
E. V. Koba. On a retrial queueing system with a FIFO queueing discipline. Theory Stoch. Proc., 8:201–207, 2002.
MathSciNet Google Scholar
V. G. Kulkarni. Modeling and Analysis of Stochastic Systems. Chapman & Hall, London, 1995.
MATH Google Scholar
L. Lakatos. On a simple continuous cyclic waiting problem. Annal. Univ. Sci. Budapest Sect. Comp., 14:105–113, 1994.
MathSciNet MATH Google Scholar
L. Lakatos. A note on the Pollaczek-Khinchin formula. Annal. Univ. Sci. Budapest Sect. Comp., 29:83–91, 2008.
MathSciNet MATH Google Scholar
L. Lakatos. Cyclic waiting systems. Cybern. Syst. Anal., 46:477–484, 2010.
Article MathSciNet Google Scholar
G. Latouche, V. Ramaswami. Introduction to matrix analytic methods in stochastic modeling. SIAM, 1999.
Google Scholar
A. Lewandowski. Statistical tables. http://www.alewand.de. Nov. 13., 2012.
D. V. Lindley. The theory of queues with a single server. Math. Proc. Cambridge Philos. Soc., 48:277–289, 1952.
Article MathSciNet Google Scholar
T. Lindwall. Lectures on the Coupling Method. Wiley, New York, 1992.
Google Scholar
J. D. C. Little. A proof of the queuing formula: L =AW. Oper. Res., 9:383–387, 1961.
Article MathSciNet MATH Google Scholar
A. A. Markov. Rasprostranenie zakona bol’shih chisel na velichiny, zavisyaschie drug ot druga. Izvestiya Fiziko-matematicheskogo obschestva pri Kazanskom universitete, 15:135–156, 1906 (in Russian).
Google Scholar
L. Massoulie, J. Roberts. Bandwidth sharing: Objectives and Algorithms. In Infocom, 1999.
Google Scholar
V. F. Matveev, V. G. Ushakov. Queueing systems. Moscow State University, Moskva, 1984 (in Russian).
MATH Google Scholar
P. Medgyessy, L. Takács. Probability Theory. Tankönyvkiadó, Budapest, 1973 (in Hungarian).
Google Scholar
S. Meyn, R. Tweedie. Markov chains and stochastic stability. Springer, Berlin Heidelberg New York, 1993.
Book MATH Google Scholar
NIST: National Institute of Standards and Technology. Digital library of mathematical functions. http://dlmf.nist.gov. Nov. 13., 2012.
M. Neuts. Probability distributions of phase type. In Liber Amicorum Prof. Emeritus H. Florin, pp. 173–206. University of Louvain, Louvain, Belgium, 1975.
Google Scholar
M.F. Neuts. Matrix Geometric Solutions in Stochastic Models. Johns Hopkins University Press, Baltimore, 1981.
MATH Google Scholar
C. Palm. Methods of judging the annoyance caused by congestion. Telegrafstyrelsen, 4:189–208, 1953.
Google Scholar
A. P. Prudnikov, Y. A. Brychkov, O. I. Marichev. Integrals and series, vol. 2. Gordon and Breach, New York, 1986. Special functions.
Google Scholar
S. Rácz, M. Telek, G. Fodor. Call level performance analysis of 3rd generation mobile core network. In IEEE International Conference on Communications, ICC 2001, 2:456–461, Helsinki, Finland, June 2001.
Google Scholar
S. Rácz, M. Telek, G. Fodor. Link capacity sharing between guaranteed- and best effort services on an atm transmission link under GoS constraints. Telecommun. Syst., 17(1–2):93–114, 2001.
Article MATH Google Scholar
M. Reiser, S. S. Lavenberg. Mean value analysis of closed multi-chain queueing networks. J. ACM, 27:313–322, 1980.
Article MathSciNet MATH Google Scholar
J. Roberts. A service system with heterogeneous user requirements - application to multi-service telecommunications systems. In Proceedings of Performance of Data Communications Systems and Their Applications, pp. 423–431, Paris, 1981.
Google Scholar
K. W. Ross. Multiservice Loss Models for Broadband Telecommunication Networks. Springer, Berlin Heidelberg New York, 1995.
Book MATH Google Scholar
T. Saaty. Elements of Queueing Theory. McGraw-Hill, New York, 1961.
MATH Google Scholar
R. Serfozo. Introduction to Stochastic Networks. Springer, Berlin Heidelberg New York, 1999.
Book MATH Google Scholar
A. N. Shiryaev. Probability. Springer, Berlin Heidelberg New York, 1994.
Google Scholar
D. L. Snyder. Random Point Processes. Wiley, New York, 1975.
MATH Google Scholar
L. Szeidl. Estimation of the moment of the regeneration period in a closed central-server queueing network. Theory Probab. Appl., 31:309–313, 1986.
Google Scholar
L. Szeidl. On the estimation of moment of regenerative cycles in a general closed central-server queueing network. Lect. Notes Math., 1233:182–189, 1987.
Article MathSciNet Google Scholar
L. Takács. Investigation of waiting time problems by reduction to Markov processes. Acta Math. Acad. Sci. Hung., 6:101–129, 1955.
Article MATH Google Scholar
L. Takács. The distribution of the virtual waiting time for a single-server queue with Poisson input and general service times. Oper. Res., 11:261–264, 1963.
Article MATH Google Scholar
L. Takács. Combinatorial Methods in the Theory of Stochastic Processes. Wiley, New York, 1967.
MATH Google Scholar
H. Takagi. Queueing Analysis. North Holland, Amsterdam, 1991.
MATH Google Scholar
M. Telek. Minimal coefficient of variation of discrete phase type distributions. In G. Latouche, P. Taylor, eds., Advances in algorithmic methods for stochastic models, MAM3, pp. 391–400. Notable Publications, 2000.
Google Scholar
A. Thümmler, P. Buchholz, M. Telek. A novel approach for fitting probability distributions to trace data with the em algorithm. IEEE Trans. Depend. Secure Comput., 3(3):245–258, 2006. Extended version of DSN 2005 paper.
Google Scholar
H. Tijms. Stochastic Models: An Algorithmic Approach. Wiley, New York, 1994.
MATH Google Scholar
W. Whitt. A review of l = λw and extensions. Queue. Syst., 9:235–268, 1991.
Article MathSciNet MATH Google Scholar
V. M. Zolotarev. Modern Theory of Summation of Random Variables. VSP, Utrecht, 1997.
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Eötvös Loránd University, Budapest, Hungary
László Lakatos
Óbuda University, Budapest, Hungary
László Szeidl
Széchenyi István University, Győr, Hungary
László Szeidl
Budapest University of Technology and Economics, Budapest, Hungary
Miklós Telek

Authors

László Lakatos
View author publications
You can also search for this author in PubMed Google Scholar
László Szeidl
View author publications
You can also search for this author in PubMed Google Scholar
Miklós Telek
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lakatos, L., Szeidl, L., Telek, M. (2013). Introduction to Probability Theory. In: Introduction to Queueing Systems with Telecommunication Applications. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-5317-8_1

Download citation

DOI: https://doi.org/10.1007/978-1-4614-5317-8_1
Published: 30 October 2012
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4614-5316-1
Online ISBN: 978-1-4614-5317-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Introduction to Probability Theory

Abstract

Similar content being viewed by others

Some Topics in Probability Theory

Review of Basic Probability and Statistics

Introduction to Probability Theory

Keywords

1 Summary of Basic Notions of Probability Theory

Definition 1.1.

Comment 1.2.

Comment 1.3.

Definition 1.4.

Comment 1.5.

Theorem 1.6.

Definition 1.7.

Definition 1.8.

Comment 1.9.

Definition 1.10.

Example 1.11.

Theorem 1.12 ( Formula of total probability).

Theorem 1.13 ( Bayes’ rule).

Definition 1.14.

Definition 1.15.

Comment 1.16.

Comment 1.17.

Definition 1.18.

Definition 1.19.

Comment 1.20.

Theorem 1.21.

Comment 1.22.

Definition 1.23.

Definition 1.24.

Definition 1.25.

Definition 1.26.

Definition 1.27.

Definition 1.28.

Example 1.29.

Definition 1.30.

Definition 1.31.

Comment 1.32.

Proof.

Definition 1.33.

Theorem 1.34.

Proof.

Theorem 1.35 ( Markov inequality).

Proof.

Theorem 1.36 ( Chebyshev inequality).

Proof.

Comment 1.37.

Example 1.38.

Definition 1.39.

Definition 1.40.

Proof.

Proof.

Definition 1.41.

Definition 1.42.

Definition 1.43.

Theorem 1.44.

Proof.

Comment 1.45.

Comment 1.46.

Proof.

Comment 1.47.

Comment 1.48.

2 Frequently Used Discrete and Continuous Distributions

2.1 Discrete Distributions

Example.

Example.

Example.

Theorem 1.49.

Proof.

Example.

Example.

Theorem 1.50.

Proof.

2.2 Continuous Distributions

Theorem 1.51.

Proof.

Comment 1.52.

Comment 1.53.