Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

A probability space is a measurable space \((\Omega,\mathcal{A},P)\), where \(\mathcal{A}\) is the set of all measurable subsets of \(\Omega \), and P is a measure on \(\mathcal{A}\) with \(P(\Omega ) = 1\). The set \(\Omega \) is called a sample space. An element \(a \in \mathcal{A}\) is called an event. P(a) is called the probability of the event a. The measure P on \(\mathcal{A}\) is called a probability measure, or (probability) distribution law, or simply (probability) distribution.

A random variable X is a measurable function from a probability space \((\Omega,\mathcal{A},P)\) into a measurable space, called a state space of possible values of the variable; it is usually taken to be \(\mathbb{R}\) with the Borel σ-algebra, so \(X: \Omega \rightarrow \mathbb{R}\). The range \(\mathcal{X}\) of the variable X is called the support of the distribution P; an element \(x \in \mathcal{X}\) is called a state.

A distribution law can be uniquely described via a cumulative distribution (or simply, distribution) function CDF, which describes the probability that a random value X takes on a value at most x: \(F(x) = P(X \leq x) = P(\omega \in \Omega: X(\omega ) \leq x)\).

So, any random variable X gives rise to a probability distribution which assigns to the interval [a, b] the probability \(P(a \leq X \leq b) = P(\omega \in \Omega: a \leq X(\omega ) \leq b)\), i.e., the probability that the variable X will take a value in the interval [a, b].

A distribution is called discrete if F(x) consists of a sequence of finite jumps at x i ; a distribution is called continuous if F(x) is continuous. We consider (as in the majority of applications) only discrete or absolutely continuous distributions, i.e., the CDF function \(F: \mathbb{R} \rightarrow \mathbb{R}\) is absolutely continuous. It means that, for every number ε > 0, there is a number δ > 0 such that, for any sequence of pairwise disjoint intervals [x k , y k ], 1 ≤ k ≤ n, the inequality 1 ≤ k ≤ n (y k x k ) < δ implies the inequality 1 ≤ k ≤ n  | F(y k ) − F(x k ) |  < ε.

A distribution law also can be uniquely defined via a probability density (or density, probability) function PDF of the underlying random variable. For an absolutely continuous distribution, the CDF is almost everywhere differentiable, and the PDF is defined as the derivative \(p(x) = F^{^{{\prime}} }(x)\) of the CDF; so, \(F(x) = P(X \leq x) =\int _{ -\infty }^{x}p(t)\mathit{dt}\), and \(\int _{a}^{b}p(t)\mathit{dt} = P(a \leq X \leq b)\). In the discrete case, the PDF is \(\sum _{x_{i}\leq x}p(x_{i})\), where p(x) = P(X = x) is the probability mass function. But p(x) = 0 for each fixed x in any continuous case.

The random variable X is used to “push-forward” the measure P on \(\Omega \) to a measure dF on \(\mathbb{R}\). The underlying probability space is a technical device used to guarantee the existence of random variables and sometimes to construct them.

We usually present the discrete version of probability metrics, but many of them are defined on any measurable space; see [Bass89, Bass13, Cha08]. For a probability distance d on random quantities, the conditions P(X = Y ) = 1 or equality of distributions imply (and characterize) d(X, Y ) = 0; such distances are called [Rach91] compound or simple distances, respectively. Often, some ground distance d is given on the state space \(\mathcal{X}\) and the presented distance is a lifting of it to a distance on distributions. A quasi-distance between distributions is also called divergence or distance statistic.

Below we denote p X  = p(x) = P(X = x), F X  = F(x) = P(X ≤ x), p(x, y) = P(X = x, Y = y). We denote by \(\mathbb{E}[X]\) the expected value (or mean) of the random variable X: in the discrete case \(\mathbb{E}[X] =\sum _{x}\mathit{xp}(x)\), in the continuous case \(\mathbb{E}[X] =\int \mathit{xp}(x)\mathit{dx}\).

The covariance between the random variables X and Y is \(\mathit{Cov}(X,Y ) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y ])] = \mathbb{E}[\mathit{XY }] - \mathbb{E}[X]\mathbb{E}[Y ].\) The variance and standard deviation of X are \(\mathit{Var}(X) = \mathit{Cov}(X,X)\) and \(\sigma (X) = \sqrt{\mathit{Var } (X)}\), respectively. The correlation between X and Y is \(\mathit{Corr}(X,Y ) = \frac{\mathit{Cov}(X,Y )} {\sigma (X)\sigma (Y )}\); cf. Chap. 17.

1 Distances on Random Variables

All distances in this section are defined on the set Z of all random variables with the same support \(\mathcal{X}\); here X, Y ∈ Z.

  • p -Average compound metric

    Given p ≥ 1, the p -average compound metric (or L p -metric between variables) is a metric on Z with \(\mathcal{X} \subset \mathbb{R}\) and \(\mathbb{E}[\vert Z\vert ^{p}] < \infty \) for all Z ∈ Z defined by

    $$\displaystyle{(\mathbb{E}[\vert X - Y \vert ^{p}])^{1/p} = (\sum _{ (x,y)\in \mathcal{X}\times \mathcal{X}}\vert x - y\vert ^{p}p(x,y))^{1/p}.}$$

    For p = 2 and , it is called, respectively, the mean-square distance and essential supremum distance between variables.

  • Lukaszyk–Karmovski metric

    The Lukaszyk–Karmovski metric (2001) on \(\mathbb{Z}\) with \(\mathcal{X} \subset \mathbb{R}\) is defined by

    $$\displaystyle{\sum _{(x,y)\in \mathcal{X}\times \mathcal{X}}\vert x - y\vert p(x)p(y).}$$

    For continuous random variables, it is defined by \(\int _{-\infty }^{+\infty }\int _{-\infty }^{+\infty }\vert x - y\vert F(x)F(y)\mathit{dxdy}\). This function can be positive for X = Y. Such possibility is excluded, and so, it will be a distance metric, if and only if it holds

    $$\displaystyle{\int _{-\infty }^{+\infty }\int _{ -\infty }^{+\infty }\vert x - y\vert \delta (x - \mathbb{E}[X])\delta (y - \mathbb{E}[Y ])\mathit{dxdy} = \vert \mathbb{E}[X] - \mathbb{E}[Y ]\vert.}$$
  • Absolute moment metric

    Given p ≥ 1, the absolute moment metric is a metric on Z with \(\mathcal{X} \subset \mathbb{R}\) and \(\mathbb{E}[\vert Z\vert ^{p}] < \infty \) for all Z ∈ Z defined by

    $$\displaystyle{\vert (\mathbb{E}[\vert X\vert ^{p}])^{1/p} - (\mathbb{E}[\vert Y \vert ^{p}])^{1/p}\vert.}$$

    For p = 1 it is called the engineer metric.

  • Indicator metric

    The indicator metric is a metric on Z defined by

    $$\displaystyle{\mathbb{E}[1_{X\neq Y }] =\sum _{(x,y)\in \mathcal{X}\times \mathcal{X}}1_{x\neq y}p(x,y) =\sum _{(x,y)\in \mathcal{X}\times \mathcal{X},x\neq y}p(x,y).}$$

    (Cf. Hamming metric in Chap. 1.)

  • Ky Fan metric K

    The Ky Fan metric K is a metric K on Z, defined by

    $$\displaystyle{\inf \{\epsilon > 0: P(\vert X - Y \vert >\epsilon ) <\epsilon \}.}$$

    It is the case d(x, y) =  | XY | of the probability distance.

  • Ky Fan metric K

    The Ky Fan metric K is a metric on Z defined by

    $$\displaystyle{\mathbb{E}\left [ \frac{\vert X - Y \vert } {1 + \vert X - Y \vert }\right ] =\sum _{(x,y)\in \mathcal{X}\times \mathcal{X}} \frac{\vert x - y\vert } {1 + \vert x - y\vert }p(x,y).}$$
  • Probability distance

    Given a metric space \((\mathcal{X},d)\), the probability distance on Z is defined by

    $$\displaystyle{\inf \{\epsilon > 0: P(d(X,Y ) >\epsilon ) <\epsilon \}.}$$

2 Distances on Distribution Laws

All distances in this section are defined on the set \(\mathcal{P}\) of all distribution laws such that corresponding random variables have the same range \(\mathcal{X}\); here \(P_{1},P_{2} \in \mathcal{P}\).

  • L p -metric between densities

    The L p -metric between densities is a metric on \(\mathcal{P}\) (for a countable \(\mathcal{X}\)) defined, for any p ≥ 1, by

    $$\displaystyle{(\sum _{x}\vert p_{1}(x) - p_{2}(x)\vert ^{p})^{\frac{1} {p} }.}$$

    For p = 1, one half of it is called the variational distance (or total variation distance, Kolmogorov distance). For p = 2, it is the Patrick–Fisher distance . The point metric sup x  | p 1(x) − p 2(x) | corresponds to p = .

    The Lissak–Fu distance with parameter α > 0 is defined as \(\sum _{x}\vert p_{1}(x) - p_{2}(x)\vert ^{\alpha }\).

  • Bayesian distance

    The error probability in classification is the following error probability of the optimal Bayes rule for the classification into two classes with a priori probabilities ϕ, 1 −ϕ and corresponding densities p 1, p 2 of the observations:

    $$\displaystyle{P_{e} =\sum _{x}\min (\phi p_{1}(x),(1-\phi )p_{2}(x)).}$$

    The Bayesian distance on \(\mathcal{P}\) is defined by 1 − P e .

    For the classification into m classes with a priori probabilities ϕ i , 1 ≤ i ≤ m, and corresponding densities p i of the observations, the error probability becomes

    $$\displaystyle{P_{e} = 1 -\sum _{x}p(x)\max _{i}P(C_{i}\vert x),}$$

    where P(C i  | x) is the a posteriori probability of the class C i given the observation x and \(p(x) =\sum _{ i=1}^{m}\phi _{i}P(x\vert C_{i})\). The general mean distance between m classes C i (cf. m-hemimetric in Chap. 3) is defined (Van der Lubbe, 1979) for α > 0, β > 1 by

    $$\displaystyle{\sum _{x}p(x)\left (\sum _{i}P(C_{i}\vert x)^{\beta }\right )^{\alpha }.}$$

    The case α = 1, β = 2 corresponds to the Bayesian distance in Devijver, 1974; the case \(\beta = \frac{1} {\alpha }\) was considered in Trouborst et al., 1974.

  • Mahalanobis semimetric

    The Mahalanobis semimetric is a semimetric on \(\mathcal{P}\) (for \(\mathcal{X} \subset \mathbb{R}^{n}\)) defined by

    $$\displaystyle{\sqrt{(\mathbb{E}_{P_{1 } } [X] - \mathbb{E}_{P_{2 } } [X])^{T } A(\mathbb{E}_{P_{1 } } [X] - \mathbb{E}_{P_{2 } } [X])}}$$

    for a given positive-semidefinite matrix A; its square is a Bregman quasi-distance (cf. Chap. 13). Cf. also the Mahalanobis distance in Chap. 17.

  • Engineer semimetric

    The engineer semimetric is a semimetric on \(\mathcal{P}\) (for \(\mathcal{X} \subset \mathbb{R}\)) defined by

    $$\displaystyle{\vert \mathbb{E}_{P_{1}}[X] - \mathbb{E}_{P_{2}}[X]\vert = \vert \sum _{x}x(p_{1}(x) - p_{2}(x))\vert.}$$
  • Stop-loss metric of order m

    The stop-loss metric of order m is a metric on \(\mathcal{P}\) (for \(\mathcal{X} \subset \mathbb{R}\)) defined by

    $$\displaystyle{\sup _{t\in \mathbb{R}}\sum _{x\geq t}\frac{(x - t)^{m}} {m!} (p_{1}(x) - p_{2}(x)).}$$
  • Kolmogorov–Smirnov metric

    The Kolmogorov–Smirnov metric (or Kolmogorov metric, uniform metric) is a metric on \(\mathcal{P}\) (for \(\mathcal{X} \subset \mathbb{R}\)) defined (1948) by

    $$\displaystyle{\sup _{x\in \mathbb{R}}\vert P_{1}(X \leq x) - P_{2}(X \leq x)\vert.}$$

    This metric is used, for example, in Biology as a measure of sexual dimorphism.

    The Kuiper distance on \(\mathcal{P}\) is defined by

    $$\displaystyle{\sup _{x\in \mathbb{R}}(P_{1}(X \leq x) - P_{2}(X \leq x)) +\sup _{x\in \mathbb{R}}(P_{2}(X \leq x) - P_{1}(X \leq x)).}$$

    (Cf. Pompeiu–Eggleston metric in Chap. 9).

    The Crnkovic–Drachma distance is defined by

    $$\displaystyle{\sup _{x\in \mathbb{R}}(P_{1}(X \leq x) - P_{2}(X \leq x))\ln \frac{1} {\sqrt{(P_{1 } (X \leq x)(1 - P_{1 } (X \leq x))}}+}$$
    $$\displaystyle{+\sup _{x\in \mathbb{R}}(P_{2}(X \leq x) - P_{1}(X \leq x))\ln \frac{1} {\sqrt{(P_{1 } (X \leq x)(1 - P_{1 } (X \leq x))}}.}$$
  • Cramér–von Mises distance

    The Cramér–von Mises distance (1928) is defined on \(\mathcal{P}\) (for \(\mathcal{X} \subset \mathbb{R}\)) by

    $$\displaystyle{\omega ^{2} =\int _{ -\infty }^{+\infty }(P_{ 1}(X \leq x) - P_{2}(X \leq x))^{2}\mathit{dP}_{ 2}(x).}$$

    The Anderson–Darling distance (1954) on \(\mathcal{P}\) is defined by

    $$\displaystyle{\int _{-\infty }^{+\infty }\frac{(P_{1}(X \leq x) - P_{2})(X \leq x))^{2}} {(P_{2}(X \leq x)(1 - P_{2}(X \leq x))}\mathit{dP}_{2}(x).}$$

    In Statistics, above distances of Kolmogorov–Smirnov, Cramér–von Mises, Anderson–Darling and, below, χ 2 -distance are the main measures of goodness of fit between estimated, P 2, and theoretical, P 1, distributions.

    They and other distances were generalized (for example by Kiefer, 1955, and Glick, 1969) on K-sample setting, i.e., some convenient generalized distances \(d(P_{1},\ldots,P_{K})\) were defined. Cf. m-hemimetric in Chap. 3.

  • Energy distance

    The energy distance (Székely, 2005) between cumulative density functions F(X), F(Y ) of two independent random vectors \(X,Y \in \mathbb{R}^{n}\) is defined by

    $$\displaystyle{d(F(X),F(Y )) = 2\mathbb{E}[\vert \vert (X - Y \vert \vert ] - \mathbb{E}[\vert \vert X - X^{{\prime}}\vert \vert ] - \mathbb{E}[\vert \vert (Y - Y ^{{\prime}}\vert \vert ],}$$

    where X, X are iid (independent and identically distributed), Y, Y are iid and | | . | | is the length of a vector. Cf. distance covariance in Chap. 17.

    It holds d(F(X), F(Y )) = 0 if and only if X, Y are iid.

  • Prokhorov metric

    Given a metric space \((\mathcal{X},d)\), the Prokhorov metric on \(\mathcal{P}\) is defined (1956) by

    $$\displaystyle{\inf \{\epsilon > 0: P_{1}(X \in B) \leq P_{2}(X \in B^{\epsilon }) +\epsilon \mbox{ and }P_{2}(X \in B) \leq P_{1}(X \in B^{\epsilon })+\epsilon \},}$$

    where B is any Borel subset of \(\mathcal{X}\), and \(B^{\epsilon } =\{ x: d(x,y) <\epsilon,y \in B\}\).

    It is the smallest (over all joint distributions of pairs (X, Y ) of random variables X, Y such that the marginal distributions of X and Y are P 1 and P 2, respectively) probability distance between random variables X and Y.

  • Levy–Sibley metric

    The Levy–Sibley metric is a metric on \(\mathcal{P}\) (for \(\mathcal{X} \subset \mathbb{R}\) only) defined by

    $$\displaystyle{\inf \{\epsilon > 0: P_{1}(X \leq x-\epsilon )-\epsilon \leq P_{2}(X \leq x) \leq P_{1}(X \leq x+\epsilon ) +\epsilon \mbox{ for any }x \in \mathbb{R}\}.}$$

    It is a special case of the Prokhorov metric for \((\mathcal{X}, d) = (\mathbb{R}, \vert x - y\vert )\).

  • Dudley metric

    Given a metric space \((\mathcal{X},d)\), the Dudley metric on \(\mathcal{P}\) is defined by

    $$\displaystyle{\sup _{f\in F}\vert \mathbb{E}_{P_{1}}[f(X)] - \mathbb{E}_{P_{2}}[f(X)]\vert =\sup _{f\in F}\vert \sum _{x\in \mathcal{X}}f(x)(p_{1}(x) - p_{2}(x))\vert,}$$

    where \(F =\{ f: \mathcal{X} \rightarrow \mathbb{R},\vert \vert f\vert \vert _{\infty } + \mathit{Lip}_{d}(f) \leq 1\}\), and \(\mathit{Lip}_{d}(f) =\sup _{x,y\in \mathcal{X},x\neq y}\frac{\vert f(x)-f(y)\vert } {d(x,y)}\).

  • Szulga metric

    Given a metric space \((\mathcal{X},d)\), the Szulga metric (1982) on \(\mathcal{P}\) is defined by

    $$\displaystyle{\sup _{f\in F}\vert (\sum _{x\in \mathcal{X}}\vert f(x)\vert ^{p}p_{ 1}(x))^{1/p} - (\sum _{ x\in \mathcal{X}}\vert f(x)\vert ^{p}p_{ 2}(x))^{1/p}\vert,}$$

    where \(F =\{ f: X \rightarrow \mathbb{R},\,\,\mathit{Lip}_{d}(f) \leq 1\}\), and \(\mathit{Lip}_{d}(f) =\sup _{x,y\in \mathcal{X},x\neq y}\frac{\vert f(x)-f(y)\vert } {d(x,y)}\).

  • Zolotarev semimetric

    The Zolotarev semimetric is a semimetric on \(\mathcal{P}\), defined (1976) by

    $$\displaystyle{\sup _{f\in F}\vert \sum _{x\in \mathcal{X}}f(x)(p_{1}(x) - p_{2}(x))\vert,}$$

    where F is any set of functions \(f: \mathcal{X} \rightarrow \mathbb{R}\) (in the continuous case, F is any set of such bounded continuous functions); cf. Szulga metric, Dudley metric.

  • Convolution metric

    Let G be a separable locally compact Abelian group, and let C(G) be the set of all real bounded continuous functions on G vanishing at infinity. Fix a function g ∈ C(G) such that | g | is integrable with respect to the Haar measure on G, and \(\{\beta \in G^{{\ast}}:\hat{ g}(\beta ) = 0\}\) has empty interior; here G is the dual group of G, and \(\hat{g}\) is the Fourier transform of g.

    The convolution metric (or smoothing metric) is defined (Yukich, 1985), for any two finite signed Baire measures P 1 and P 2 on G, by

    $$\displaystyle{\sup _{x\in G}\vert \int _{y\in G}g(xy^{-1})(\mathit{dP}_{ 1} -\mathit{dP}_{2})(y)\vert.}$$

    It can also be seen as the difference \(T_{P_{1}}(g) - T_{P_{2}}(g)\) of convolution operators on C(G) where, for any f ∈ C(G), the operator T P f(x) is y ∈ G f(xy −1)dP(y).

    In particular, this metric can be defined on the space of probability measures on \(\mathbb{R}^{n}\), where g is a PDF satisfying above conditions.

  • Discrepancy metric

    Given a metric space \((\mathcal{X},d)\), the discrepancy metric on \(\mathcal{P}\) is defined by

    $$\displaystyle{\sup \{\vert P_{1}(X \in B) - P_{2}(X \in B)\vert: B\mbox{ is any closed ball}\}.}$$
  • Bi-discrepancy semimetric

    The bi-discrepancy semimetric (evaluating the proximity of distributions P 1, P 2 over different collections \(\mathcal{A}_{1},\mathcal{A}_{2}\) of measurable sets) is defined by

    $$\displaystyle{D(P_{1},P_{2}) + D(P_{2},P_{1}),}$$

    where \(D(P_{1},P_{2}) =\sup \{\inf \{ P_{2}(C): B \subset C \in \mathcal{A}_{2}\} - P_{1}(B): B \in \mathcal{A}_{1}\}\) (discrepancy).

  • Le Cam distance

    The Le Cam distance (1974) is a semimetric, evaluating the proximity of probability distributions P 1, P 2 (on different spaces \(\mathcal{X}_{1},\mathcal{X}_{2}\)) and defined as follows:

    $$\displaystyle{\max \{\delta (P_{1},P_{2}),\delta (P_{2},P_{1})\},}$$

    where \(\delta (P_{1},P_{2}) =\inf _{B}\sum _{x_{2}\in \mathcal{X}_{2}}\vert BP_{1}(X_{2} = x_{2}) - BP_{2}(X_{2} = x_{2})\vert \) is the Le Cam deficiency. Here \(BP_{1}(X_{2} = x_{2}) =\sum _{x_{1}\in \mathcal{X}_{1}}p_{1}(x_{1})b(x_{2}\vert x_{1})\), where B is a probability distribution over \(\mathcal{X}_{1} \times \mathcal{X}_{2}\), and

    $$\displaystyle{b(x_{2}\vert x_{1}) = \frac{B(X_{1} = x_{1},X_{2} = x_{2})} {B(X_{1} = x_{1})} = \frac{B(X_{1} = x_{1},X_{2} = x_{2})} {\sum _{x\in \mathcal{X}_{2}}B(X_{1} = x_{1},X_{2} = x)}.}$$

    So, BP 2(X 2 = x 2) is a probability distribution over \(\mathcal{X}_{2}\), since \(\sum _{x_{2}\in \mathcal{X}_{2}}b(x_{2}\vert x_{1}) = 1\).

    Le Cam distance is not a probabilistic distance, since P 1 and P 2 are defined over different spaces; it is a distance between statistical experiments (models).

  • Skorokhod–Billingsley metric

    The Skorokhod–Billingsley metric is a metric on \(\mathcal{P}\), defined by

    $$\displaystyle\begin{array}{rcl} & & \inf _{f}\max \left \{\sup _{x}\vert P_{1}(X \leq x) - P_{2}(X \leq f(x))\vert,\sup _{x}\vert f(x) - x\vert,\right. {}\\ & & \qquad \quad \left.\sup _{x\neq y}\left \vert \ln \frac{f(y) - f(x)} {y - x} \right \vert \right \}, {}\\ \end{array}$$

    where \(f: \mathbb{R} \rightarrow \mathbb{R}\) is any strictly increasing continuous function.

  • Skorokhod metric

    The Skorokhod metric is a metric on \(\mathcal{P}\) defined (1956) by

    $$\displaystyle{\inf \{\epsilon > 0:\max \{\sup _{x}\vert P_{1}(X < x) - P_{2}(X \leq f(x))\vert,\sup _{x}\vert f(x) - x\vert \} <\epsilon \},}$$

    where \(f: \mathbb{R} \rightarrow \mathbb{R}\) is a strictly increasing continuous function.

  • Birnbaum–Orlicz distance

    The Birnbaum–Orlicz distance (1931) is a distance on \(\mathcal{P}\) defined by

    $$\displaystyle{\sup _{x\in \mathbb{R}}f(\vert P_{1}(X \leq x) - P_{2}(X \leq x)\vert ),}$$

    where \(f: \mathbb{R}_{\geq 0} \rightarrow \mathbb{R}_{\geq 0}\) is any nondecreasing continuous function with f(0) = 0, and f(2t) ≤ Cf(t) for any t > 0 and some fixed C ≥ 1. It is a near-metric, since the C -triangle inequality \(d(P_{1},P_{2}) \leq C(d(P_{1},P_{3}) + d(P_{3},P_{2}))\) holds.

    Birnbaum–Orlicz distance is also used, in Functional Analysis, on the set of all integrable functions on the segment [0, 1], where it is defined by \(\int _{0}^{1}H(\vert f(x) - g(x)\vert )\mathit{dx}\), where H is a nondecreasing continuous function from [0, ) onto [0, ) which vanishes at the origin and satisfies the Orlicz condition: \(\sup _{t>0}\frac{H(2t)} {H(t)} < \infty \).

  • Kruglov distance

    The Kruglov distance (1973) is a distance on \(\mathcal{P}\), defined by

    $$\displaystyle{\int f(P_{1}(X \leq x) - P_{2}(X \leq x))\mathit{dx},}$$

    where \(f: \mathbb{R}_{\geq 0} \rightarrow \mathbb{R}_{\geq 0}\) is any even strictly increasing function with f(0) = 0, and f(s + t) ≤ C(f(s) + f(t)) for any s, t ≥ 0 and some fixed C ≥ 1. It is a near-metric, since the C -triangle inequality d(P 1, P 2) ≤ C(d(P 1, P 3) + d(P 3, P 2)) holds.

  • Bregman divergence

    Given a differentiable strictly convex function \(\phi (p): \mathbb{R}^{n} \rightarrow \mathbb{R}\) and β ∈ (0, 1), the skew Jensen (or skew Burbea–Rao) divergence on \(\mathcal{P}\) is (Basseville–Cardoso, 1995)

    $$\displaystyle{J_{\phi }^{(\beta )}(P_{ 1},P_{2}) =\beta \phi (p_{1}) + (1-\beta )\phi (p_{2}) -\phi (\beta p_{1} + (1-\beta )p_{2}).}$$

    The Burbea–Rao distance (1982) is the case \(\beta = \frac{1} {2}\) of it, i.e., it is

    $$\displaystyle{\sum _{x}\left (\frac{\phi (p_{1}(x)) +\phi (p_{2}(x))} {2} -\phi (\frac{p_{1}(x) + (p_{2}(x)} {2} )\right ).}$$

    The Bregman divergence (1967) is a quasi-distance on \(\mathcal{P}\) defined by

    $$\displaystyle{\sum _{x}(\phi (p_{1}(x)) -\phi (p_{2}(x)) - (p_{1}(x) - p_{2}(x))\phi ^{{\prime}}(p_{ 2}(x))) =\lim _{\beta \rightarrow 1}\frac{1} {\beta } J_{\phi }^{(\beta )}(P_{ 1},P_{2}).}$$

    The generalised Kullback–Leibler distance \(\sum _{x}p_{1}(x)\ln \frac{p_{1}(x)} {p_{2}(x)} -\sum _{x}(p_{1}(x) - p_{2}(x))\) and Itakura–Saito distance (cf. Chap. 21) \(\sum _{x}\frac{p_{1}(x)} {p_{2}(x)} -\ln \frac{p_{1}(x)} {p_{2}(x)} - 1\) are the cases \(\phi (p) =\sum _{x}p(x)\ln p(x) -\sum _{x}p(x)\) and \(\phi (p) = -\sum _{x}\ln p(x)\) of the Bregman divergence. Cf. Bregman quasi-distance in Chap. 13.

    Csizár, 1991, proved that the Kullback–Leibler distance is the only Bregman divergence which is an f -divergence.

  • f -divergence

    Given a convex function \(f(t): \mathbb{R}_{\geq 0} \rightarrow \mathbb{R}\) with \(f(1) = 0,f^{{\prime}}(1) = 0,f^{{\prime\prime}}(1) = 1\), the f -divergence (independently, Csizár, 1963, Morimoto, 1963, Ali–Silvey, 1966, Ziv–Zakai, 1973, and Akaike, 1974) on \(\mathcal{P}\) is defined by

    $$\displaystyle{\sum _{x}p_{2}(x)f\left (\frac{p_{1}(x)} {p_{2}(x)}\right ).}$$

    The cases f(t) = tlnt and f(t) = (t − 1)2 correspond to the Kullback–Leibler distance and to the χ 2 -distance below, respectively. The case f(t) =  | t − 1 | corresponds to the variational distance, and the case \(f(t) = 4(1 -\sqrt{t})\) (as well as \(f(t) = 2(t + 1) - 4\sqrt{t}\)) corresponds to the squared Hellinger metric.

    Semimetrics can also be obtained, as the square root of the f-divergence, in the cases f(t) = (t − 1)2∕(t + 1) (the Vajda–Kus semimetric ), f(t) =  | t a − 1 | 1∕a with 0 < a ≤ 1 (the generalized Matusita distance), and \(f(t) = \frac{(t^{a}+1)^{1/a}-2^{(1-a)/a}(t+1)} {1-1/\alpha }\) (the Osterreicher semimetric ).

  • α -divergence

    Given \(\alpha \in \mathbb{R}\), the α -divergence (independently, Csizár, 1967, Havrda–Charvát, 1967, Cressie–Read, 1984, and Amari, 1985) is defined as KL(P 1, P 2), KL(P 2, P 1) for α = 1, 0 and for α ≠ 0, 1, it is

    $$\displaystyle{ \frac{1} {\alpha (1-\alpha )}\left (1 -\sum _{x}p_{2}(x)\left (\frac{p_{1}(x)} {p_{2}(x)}\right )^{\alpha }\right ).}$$

    The Amari divergence come from the above by the transformation \(\alpha = \frac{1+t} {2}\).

  • Harmonic mean similarity

    The harmonic mean similarity is a similarity on \(\mathcal{P}\) defined by

    $$\displaystyle{2\sum _{x} \frac{p_{1}(x)p_{2}(x)} {p_{1}(x) + p_{2}(x)}.}$$
  • Fidelity similarity

    The fidelity similarity (or Bhattacharya coefficient, Hellinger affinity) on \(\mathcal{P}\) is

    $$\displaystyle{\rho (P_{1},P_{2}) =\sum _{x}\sqrt{p_{1 } (x)p_{2 } (x)}.}$$

    Cf. more general quantum fidelity similarity in Chap. 24.

  • Hellinger metric

    In terms of the fidelity similarity ρ, the Hellinger metric (or Matusita distance , Hellinger–Kakutani metric) on \(\mathcal{P}\) is defined by

    $$\displaystyle{(\sum _{x}(\sqrt{p_{1 } (x)} -\sqrt{p_{2 } (x)})^{2})^{\frac{1} {2} } = 2\sqrt{1 -\rho (P_{1 }, P_{2 } )}.}$$
  • Bhattacharya distance 1

    In terms of the fidelity similarity ρ, the Bhattacharya distance 1 (1946) is

    $$\displaystyle{(\arccos \rho (P_{1},P_{2}))^{2}}$$

    for \(P_{1},P_{2} \in \mathcal{P}\). Twice this distance is the Rao distance from Chap. 7. It is used also in Statistics and Machine Learning, where it is called the Fisher distance.

    The Bhattacharya distance 2 (1943) on \(\mathcal{P}\) is defined by

    $$\displaystyle{-\ln \rho (P_{1},P_{2}).}$$
  • χ 2 -distance

    The χ 2 -distance (or Pearson χ 2 -distance ) is a quasi-distance on \(\mathcal{P}\), defined by

    $$\displaystyle{\sum _{x}\frac{(p_{1}(x) - p_{2}(x))^{2}} {p_{2}(x)}.}$$

    The Neyman χ 2 -distance is a quasi-distance on \(\mathcal{P}\), defined by

    $$\displaystyle{\sum _{x}\frac{(p_{1}(x) - p_{2}(x))^{2}} {p_{1}(x)}.}$$

    The half of χ 2-distance is also called Kagan’s divergence.

    The probabilistic symmetric χ 2 -measure is a distance on \(\mathcal{P}\), defined by

    $$\displaystyle{2\sum _{x}\frac{(p_{1}(x) - p_{2}(x))^{2}} {p_{1}(x) + p_{2}(x)}.}$$
  • Separation quasi-distance

    The separation distance is a quasi-distance on \(\mathcal{P}\) (for a countable \(\mathcal{X}\)) defined by

    $$\displaystyle{\max _{x}\left (1 -\frac{p_{1}(x)} {p_{2}(x)}\right ).}$$

    (Not to be confused with separation distance in Chap. 9).

  • Kullback–Leibler distance

    The Kullback–Leibler distance (or relative entropy, information deviation, information gain, KL-distance) is a quasi-distance on \(\mathcal{P}\), defined (1951) by

    $$\displaystyle{\mathit{KL}(P_{1},P_{2}) = \mathbb{E}_{P_{1}}[\ln L] =\sum _{x}p_{1}(x)\ln \frac{p_{1}(x)} {p_{2}(x)},}$$

    where \(L = \frac{p_{1}(x)} {p_{2}(x)}\) is the likelihood ratio. Therefore,

    $$\displaystyle{\mathit{KL}(P_{1},P_{2})\,=\, -\sum _{x}p_{1}(x)\ln \,p_{2}(x) +\sum _{x}p_{1}(x)\ln \,p_{1}(x)\,=\,H(P_{1},P_{2}) - H(P_{1}),}$$

    where H(P 1) is the entropy of P 1, and H(P 1, P 2) is the cross-entropy of P 1 and P 2.

    If P 2 is the product of marginals of P 1 (say, p 2(x, y) = p 1(x)p 1(y)), the KL-distance KL(P 1, P 2) is called the Shannon information quantity and (cf. Shannon distance) is equal to \(\sum _{(x,y)\in \mathcal{X}\times \mathcal{X}}p_{1}(x,y)\ln \frac{p_{1}(x,y)} {p_{1}(x)p_{1}(y)}\).

    The exponential divergence is defined by \(\sum _{x}p_{1}(x)(\ln \frac{p_{1}(x)} {p_{2}(x)})^{2}.\)

  • Distance to normality

    For a continuous distribution P on \(\mathbb{R}\), the differential entropy is defined by

    $$\displaystyle{h(P) = -\int _{-\infty }^{\infty }p(x)\ln p(x)\mathit{dx}.}$$

    It is \(\ln (\delta \sqrt{2\pi e})\) for a normal (or Gaussian) distribution \(g_{\delta,\mu }(x) = \frac{1} {\sqrt{2\pi \delta ^{2}}} \exp \left (-\frac{(x-\mu )^{2}} {2\delta ^{2}} \right )\) with variance δ 2 and mean μ.

    The distance to normality (or negentropy) of P is the Kullback–Leibler distance \(\mathit{KL}(P,g) =\int _{ -\infty }^{\infty }p(x)\ln \left (\frac{p(x)} {g(x)}\right )\mathit{dx} = h(g) - h(P)\), where q is a normal distribution with the same variance as P. So, it is nonnegative and equal to 0 if and only if P = g almost everywhere. Cf. Shannon distance.

    Also, h(u a, b ) = ln(ba) for an uniform distribution with minimum a and maximum b > a, i.e., \(u_{a,b}(x) = \frac{1} {b-a}\), if x ∈ [a, b], and it is 0, otherwise. It holds h(u a, b ) ≥ h(P) for any distribution P with support contained in [a, b]; so, h(u a, b ) − h(P) can be called the distance to uniformity. Tononi, 2008, used it in his model of consciousness.

  • Jeffrey distance

    The Jeffrey distance (or J-divergence, KL2-distance) is a symmetric version of the Kullback–Leibler distance defined (1946) on \(\mathcal{P}\) by

    $$\displaystyle{\mathit{KL}(P_{1},P_{2}) + \mathit{KL}(P_{2},P_{1}) =\sum _{x}((p_{1}(x) - p_{2}(x))\ln \frac{p_{1}(x)} {p_{2}(x)}.}$$

    The Aitchison distance (1986) is defined by \(\sqrt{\sum _{x } (\ln \frac{p_{1 } (x)g(p_{1 } )} {p_{2}(x)g(p_{2})})^{2}}\), where \(g(p) = (\prod _{x}p(x))^{ \frac{1} {n} }\) is the geometric mean of components p(x) of p.

  • Resistor-average distance

    The resistor-average distance is (Johnson–Simanović, 2000) a symmetric version of the Kullback–Leibler distance on \(\mathcal{P}\) which is defined by the harmonic sum

    $$\displaystyle{\left ( \frac{1} {\mathit{KL}(P_{1},P_{2})} + \frac{1} {\mathit{KL}(P_{2},P_{1})}\right )^{-1}.}$$
  • Jensen–Shannon divergence

    Given a number β ∈ [0, 1] and \(P_{1},P_{2} \in \mathcal{P}\), let P 3 denote β P 1 + (1 −β)P 2. The skew divergence and the Jensen–Shannon divergence between P 1 and P 2 are defined on \(\mathcal{P}\) as KL(P 1, P 3) and β KL(P 1, P 3) + (1 −β)KL(P 2, P 3), respectively. Here KL is the Kullback–Leibler distance; cf. clarity similarity.

    In terms of entropy H(P) = − x p(x)ln p(x), the Jensen–Shannon divergence is H(β P 1 + (1 −β)P 2) −β H(P 1) − (1 −β)H(P 2), i.e., the Jensen divergence (cf. Bregman divergence).

    Let \(P_{3} = \frac{1} {2}(P_{1} + P_{2})\), i.e., \(\beta = \frac{1} {2}\). Then the skew divergence and twice the Jensen–Shannon divergence are called K -divergence and Topsøe distance (or information statistics), respectively. The Topsøe distance is a symmetric version of KL(P 1, P 2). It is not a metric, but its square root is a metric.

  • Clarity similarity

    The clarity similarity is a similarity on \(\mathcal{P}\), defined by

    $$\displaystyle{(\mathit{KL}(P_{1},P_{3}) + \mathit{KL}(P_{2},P_{3})) - (\mathit{KL}(P_{1},P_{2}) + \mathit{KL}(P_{2},P_{1})) =}$$
    $$\displaystyle{=\sum _{x}\left (p_{1}(x)\ln \frac{p_{2}(x)} {p_{3}(x)} + p_{2}(x)\ln \frac{p_{1}(x)} {p_{3}(x)}\right ),}$$

    where KL is the Kullback–Leibler distance, and P 3 is a fixed probability law.

    It was introduced in [CCL01] with P 3 being the probability distribution of English.

  • Ali–Silvey distance

    The Ali–Silvey distance is a quasi-distance on \(\mathcal{P}\) defined by the functional

    $$\displaystyle{f(\mathbb{E}_{P_{1}}[g(L)]),}$$

    where \(L = \frac{p_{1}(x)} {p_{2}(x)}\) is the likelihood ratio, f is a nondecreasing function on \(\mathbb{R}\), and g is a continuous convex function on \(\mathbb{R}_{\geq 0}\) (cf. f -divergence).

    The case f(x) = x, g(x) = xlnx corresponds to the Kullback–Leibler distance; the case f(x) = −lnx, g(x) = x t corresponds to the Chernoff distance.

  • Chernoff distance

    The Chernoff distance (or Rényi cross-entropy) on \(\mathcal{P}\) is defined (1954) by

    $$\displaystyle{\max _{t\in (0,1)}D_{t}(P_{1},P_{2}),}$$

    where 0 < t < 1 and D t (P 1, P 2) = −ln x (p 1(x))t(p 2(x))1−t (called the Chernoff coefficient) which is proportional to the Rényi distance.

  • Rényi distance

    Given \(t \in \mathbb{R}\), the Rényi distance (or order t Rényi entropy, 1961) is a quasi-distance on \(\mathcal{P}\) defined as the Kullback–Leibler distance KL(P 1, P 2) if t = 1, and, otherwise, by

    $$\displaystyle{ \frac{1} {1 - t}\ln \sum _{x}p_{2}(x)\left (\frac{p_{1}(x)} {p_{2}(x)}\right )^{t}.}$$

    For \(t = \frac{1} {2}\), one half of the Rényi distance is the Bhattacharya distance 2. Cf. f -divergence and Chernoff distance.

  • Shannon distance

    Given a measure space \((\Omega,\mathcal{A},P)\), where the set \(\Omega \) is finite and P is a probability measure, the entropy (or Shannon information entropy) of a function \(f: \Omega \rightarrow X\), where X is a finite set, is defined by

    $$\displaystyle{H(f) = -\sum _{x\in X}P(f = x)\log _{a}(P(f = x)).}$$

    Here a = 2, e, or 10 and the unit of entropy is called a bit, nat, or dit (digit), respectively. The function f can be seen as a partition of the measure space.

    For any two such partitions \(f: \Omega \rightarrow X\) and \(g: \Omega \rightarrow Y\), denote by H(f, g) the entropy of the partition \((f,g): \Omega \rightarrow X \times Y\) (joint entropy), and by H(f | g) the conditional entropy (or equivocation). Then the Shannon distance between f and g is a metric defined by

    $$\displaystyle{H(f\vert g) + H(g\vert f) = 2H(f,g) - H(f) - H(g) = H(f,g) - I(f;g),}$$

    where I(f; g) = H(f) + H(g) − H(f, g) is the Shannon mutual information.

    If P is the uniform probability law, then Goppa showed that the Shannon distance can be obtained as a limiting case of the finite subgroup metric.

    In general, the information metric (or entropy metric ) between two random variables (information sources) X and Y is defined by

    $$\displaystyle{H(X\vert Y ) + H(Y \vert X) = H(X,Y ) - I(X;Y ),}$$

    where the conditional entropy H(X | Y ) is defined by \(\sum _{x\in X}\sum _{y\in Y }p(x,y)\ln p(x\vert y)\), and p(x | y) = P(X = x | Y = y) is the conditional probability.

    The Rajski distance (or normalized information metric) is defined (Rajski, 1961, for discrete probability distributions X, Y ) by

    $$\displaystyle{\frac{H(X\vert Y ) + H(Y \vert X)} {H(X,Y )} = 1 - \frac{I(X;Y )} {H(X,Y )}.}$$

    It is equal to 1 if X and Y are independent. (Cf., a different one, normalized information distance in Chap. 11).

  • Transportation distance

    Given a metric space \((\mathcal{X},d)\), the transportation distance (and/or, according to Villani, 2009, Monge–Kantorovich–Wasserstein–Rubinstein–Ornstein–Gini–Dall’Aglio–Mallows–Tanaka distance) is the metric defined by

    $$\displaystyle{W_{1}(P_{1},P_{2}) =\inf \, \mathbb{E}_{S}[d(X,Y )] =\inf _{S}\int _{(X,Y )\in \mathcal{X}\times \mathcal{X}}d(X,Y )\mathit{dS}(X,Y ),}$$

    where the infimum is taken over all joint distributions S of pairs (X, Y ) of random variables X, Y such that marginal distributions of X and Y are P 1 and P 2.

    For any separable metric space \((\mathcal{X},d)\), this is equivalent to the Lipschitz distance between measures sup f ∫ f d(P 1P 2), where the supremum is taken over all functions f with | f(x) − f(y) | ≤ d(x, y) for any \(x,y \in \mathcal{X}\). Cf. Dudley metric.

    In general, for a Borel function \(c: \mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R}_{\geq 0}\), the c -transportation distance T c (P 1, P 2) is \(\inf \,\mathbb{E}_{S}[c(X,Y )]\). It is the minimal total transportation cost if c(X, Y ) is the cost of transporting a unit of mass from the location X to the location Y. Cf. the Earth Mover’s distance (Chap. 21), which is a discrete form of it.

    The L p -Wasserstein distance is \(W_{p} = (T_{d^{p}})^{1/p} = (\inf \,\mathbb{E}_{S}[d^{p}(X,Y )])^{1/p}\). For \((\mathcal{X},d) = (\mathbb{R},\vert x - y\vert )\), it is also called the L p -metric between distribution functions (CDF) F i with \(F_{i}^{-1}(x) =\sup _{u}(P_{i}(X \leq x) < u)\), and can be written as

    $$\displaystyle\begin{array}{rcl} (\inf \,\mathbb{E}[\vert X - Y \vert ^{p}])^{1/p}& =& \left (\int _{ \mathbb{R}}\vert F_{1}(x) - F_{2}(x)\vert ^{p}\mathit{dx}\right )^{1/p} {}\\ & =& \left (\int _{0}^{1}\vert F_{ 1}^{-1}(x) - F_{ 2}^{-1}(x)\vert ^{p}\mathit{dx}\right )^{1/p}. {}\\ \end{array}$$

    For p = 1, this metric is called Monge–Kantorovich metric (or Wasserstein metric , Fortet–Mourier metric , Hutchinson metric , Kantorovich–Rubinstein metric). For p = 2, it is the Levy–Fréchet metric (Fréchet, 1957).

  • Ornstein \(\overline{d}\) -metric

    The Ornstein \(\overline{d}\) -metric is a metric on \(\mathcal{P}\) (for \(\mathcal{X} = \mathbb{R}^{n}\)) defined (1974) by

    $$\displaystyle{ \frac{1} {n}\inf \int _{x,y}\left (\sum _{i=1}^{n}1_{ x_{i}\neq y_{i}}\right )\mathit{dS},}$$

    where the infimum is taken over all joint distributions S of pairs (X, Y ) of random variables X, Y such that marginal distributions of X and Y are P 1 and P 2.

  • Distances between belief assignments

    In Bayesian (or subjective, evidential) interpretation, a probability can be assigned to any statement, even if no random process is involved, as a way to represent its subjective plausibility, or the degree to which it is supported by the available evidence, or, mainly, degree of belief. Within this approach, imprecise probability generalizes Probability Theory to deal with scarce, vague, or conflicting information. The main model is Dempster–Shafer theory, which allows evidence to be combined.

    Given a set X, a (basic) belief assignment is a function m: P(X) → [0, 1] (where P(X) is the set of all subsets of X) with m() = 0 and A ⊂ P(X) m(A) = 1. Probability measures are a special case in which m(A) > 0 only for singletons.

    For the classic probability P(A), it holds then Bel(A) ≤ P(A) ≤ Pl(A), where the belief function and plausibility function are defined, respectively, by

    $$\displaystyle{\mathrm{Bel}(A) =\sum _{B:B\subset A}m(B)\,\mbox{ and }\,\mathrm{Pl}(A) =\sum _{B:B\cap A\neq \varnothing }m(B) = 1 -\mathrm{Bel}(\overline{A}).}$$

    The original (Dempster, 1967) conflict factor between two belief assignments m 1 and m 2 was defined as c(m 1, m 2) =  AB =  m 1(A)m 2(B). This is not a distance since c(m, m) > 0. The combination of m 1 and m 2, seen as independent sources of evidence, is defined by \(m_{1} \oplus m_{2}(A) = \frac{1} {1-c(m_{1},m_{2})}\sum _{B\cap C=A}m_{1}(B)m_{2}(C)\).

    Usually, a distance between m 1 and m 2 estimates the difference between these sources in the form d U  =  | U(m 1) − U(m 2) | , where U is an uncertainty measure; see Sarabi-Jamab et al., 2013, for a comparison of their performance. In particular, this distance is called:

    • confusion (Hoehle, 1981) if U(m) − A m(A)log2Bel(A);

    • dissonance (Yager, 1983) if U(m) = E(m) = − A m(A)log2Pl(A);

    • Yager’s factor (Eager, 1983) if \(U(m) = 1 -\sum _{A\neq \varnothing }\frac{m(A)} {\vert A\vert }\);

    • possibility-based (Smets, 1983) if \(U(m) = -\sum _{A}\log _{2}\sum _{B:A\subset B}m(B)\);

    • U-uncertainty (Dubois–Prade, 1985) if \(U(m)\,=\,I(m)\,=\, -\sum _{A}m(A)\log _{2}\vert A\vert \);

    • Lamata–Moral’s (1988) if \(U(m) =\log _{2}(\sum _{A}m(A)\vert A\vert )\) and U(m) = E(m) + I(m);

    • discord (Klir–Ramer, 1990) if \(U(m) = D(m) = -\sum _{A}m(A)\log _{2}(1 -\sum _{B}m(B)\frac{\vert B\setminus A\vert } {\vert B\vert } )\) and a variant: U(m) = D(m) + I(m);

    • strife (Klir–Parviz, 1992) if \(U(m) = -\sum _{A}m(A)\log _{2}(\sum _{B}m(B)\frac{\vert A\cap B\vert } {\vert A\vert } )\);

    • Pal et al.’s (1993) if \(U(m) = G(m) = -\sum _{A}\log _{2}m(A)\) and U(m) = G(m) + I(m);

    • total conflict (George–Pal, 1996) if \(U(m) =\sum _{A}m(A)\sum _{B}(m(B)(1 -\frac{\vert A\cap B\vert } {\vert A\cup B\vert }))\).

    Among other distances used are the cosine distance \(1 - \frac{m_{1}^{T}m_{ 2}} {\vert \vert m_{1}\vert \vert \vert \vert m_{2}\vert \vert }\), the Mahalanobis distance \(\sqrt{(m_{1 } - m_{2 } )^{T } A(m_{1 } - m_{2 } )}\) for some matrices A, and pignistic-based one (Tessem, 1993) \(\max _{A}\{\vert \sum _{B\neg \varnothing }(m_{1}(B) - m_{2}(B)\frac{\vert A\cap B\vert } {\vert B\vert } \vert \}\).