Some Universal Insights on Divergences for Statistics, Machine Learning and Artificial Intelligence

Broniatowski, Michel; Stummer, Wolfgang

doi:10.1007/978-3-030-02520-5_8

Michel Broniatowski² &
Wolfgang Stummer^3,4

Part of the book series: Signals and Communication Technology ((SCT))

1187 Accesses
14 Citations

Abstract

Dissimilarity quantifiers such as divergences (e.g. Kullback–Leibler information, relative entropy) and distances between probability distributions are widely used in statistics, machine learning, information theory and adjacent artificial intelligence (AI). Within these fields, in contrast, some applications deal with divergences between other-type real-valued functions and vectors. For a broad readership, we present a correspondingly unifying framework which – by its nature as a “structure on structures” – also qualifies as a basis for similarity-based multistage AI and more humanlike (robustly generalizing) machine learning. Furthermore, we discuss some specificalities, subtleties as well as pitfalls when e.g. one “moves away” from the probability context. Several subcases and examples are given, including a new approach to obtain parameter estimators in continuous models which is based on noisy divergence minimization.

Access provided by Autonomous University of Puebla. Download chapter PDF

3D Insights to Some Divergences for Robust Statistics and Machine Learning

Similarities, Dissimilarities and Types of Inner Products for Data Analysis in the Context of Machine Learning

Information Divergence and the Generalized Normal Distribution: A Study on Symmetricity

Article 10 February 2020

1 Outline

The goals formulated in the abstract are achieved in the following way and order: to address a wide audience, throughout the paper (with a few connection-indicative exceptions) we entirely formulate and investigate divergences and distances between functions, even for the probability context. In Sect. 2, we provide some non-technical background and overview of some of their principally possible usabilities for tasks in data analytics such as statistics, machine learning, and artificial intelligence (AI). Furthermore, we indicate some connections with geometry and information. Thereafter, in Sect. 3 we introduce a new structured framework (toolkit) of divergences between functions, and discuss their building-blocks, boundary behaviour, as well as their identifiability properties. Several subcases, running examples, technical subtleties of practical importance, etc. are illuminated, too. Finally, we study divergences between “entirely different functions” which e.g. appear in the frequent situation when for data-derived discrete functions one wants to find a closest possible continuous-function model (cf. Sect. 4); several corresponding noisy minimum-divergence procedures are compared – for the first time within a unifying framework – and new methods are derived too.

2 Some General Motivations and Uses of Divergences

2.1 Quantification of Proximity

As a starting motivation, it is basic knowledge that there are numerous ways of evaluating the proximity d(p, q) of two real numbers p and q of primary interest. For instance, to quantify that p and q nearly coincide one could use the difference $d^{(1)}(p,q) := p-q \approx 0$ or the fraction $d^{(2)}(p,q) := \frac{p}{q} \approx 1$, scaled (e.g. magnifying, zooming-in) versions $d_{m}^{(3)}(p,q) := m \cdot (p-q) \approx 0$ or $d_{m}^{(4)}(p,q) := m \cdot \frac{p}{q} \approx 1$ with “scale” m of secondary (auxiliary) interest, as well as more flexible hybrids $d_{m_{1},m_{2},m_{3}}^{(5)}(p,q) := m_{3} \cdot \big (\frac{p}{m_{1}} - \frac{q}{m_{2}}\big ) \approx 0$ where $m_{i}$ may also take one of the values p, q. All these “dissimilarities” $d^{(j)}(\cdot ,\cdot )$ can principally take any sign and they are asymmetric which is consistent with the – in many applications required – desire that one of the two primary-interest numbers (say p) plays a distinct role; moreover, the involved divisions cause technical care if one principally allows for (convergence to) zero-valued numbers. A more sophisticated, nonlinear alternative to $d^{(1)}(\cdot ,\cdot )$ is given by the dissimilarity $d_{\phi }^{(6)}(p,q) := \phi (p) - (\phi (q) + \phi ^{\prime }(q) \cdot (p-q))$ where $\phi (\cdot )$ is a strictly convex, differentiable function and thus $d_{\phi }^{(6)}(p,q)$ quantifies the difference between $\phi (p)$ and the value at p of the tangent line taken at $\phi (q)$. Notice that $d_{\phi }^{(6)}(\cdot ,\cdot )$ is generally still asymmetric but always stays nonnegative independently of the possible signs of the “generator” $\phi $ and the signs of p,q. In contrast, as a nonlinear alternative to $d_{m}^{(4)}(\cdot ,\cdot )$ one can construct from $\phi $ the dissimilarity $d_{\phi }^{(7)}(p,q) := q \cdot \phi \big (\frac{p}{q}\big )$ (where $m=q$) which is also asymmetric but can become negative depending on the signs of p, q, $\phi $. More generally, one often wants to work with dissimilarities $d(\cdot ,\cdot )$ having the properties

(D1)
$d(p,q) \geqslant 0$ for all p, q (nonnegativity),
(D2)
$d(p,q) = 0$ if and only if $p=q$ (reflexivity; identity of indiscernibles^{Footnote 1}),

and such $d(\cdot ,\cdot )$ is then called a divergence (or disparity, contrast function). Loosely speaking, the divergence d(p, q) of p and q can be interpreted as a kind of “directed distance from p to q”.^{Footnote 2} As already indicated above, the underlying directness turns out to be especially useful in contexts where the first component (point), say p, is always/principally of “more importance” or of “higher attention” than the second component, say q; this is nothing unusual, since after all, one of our most fundamental daily-life constituents – namely time – is directed (and therefore also time-dependent quantities)! Moreover, as a further analogue consider the “way/path-length” d(p, q) a taxi would travel from point p to point q in parts of a city with at least one one-way street. Along the latter, there automatically exist points $p \ne q$ such that $d(p,q) \ne d(q,p)$; this non-equality may even hold for all $p \ne q$ if the street pattern is irregular enough; the same holds on similar systems of connected “one-way loops”, directed graphs, etc. However, sometimes the application context demands for the usage of a dissimilarity $d(\cdot ,\cdot )$ satisfying (D1), (D2) and

(D3)
$d(p,q) = d(q,p)$ for all p, q (symmetry),

and such $d(\cdot ,\cdot )$ is denoted as a distance; notice that we don’t assume that the triangle inequality holds. Hence, we regard a distance as a symmetric divergence. Moreover, a distance $d(\cdot ,\cdot )$ can be constructed from a divergence $\widetilde{d}(\cdot ,\cdot )$ e.g. by means of either the three “symmetrizing operations” $d(p,q) := \widetilde{d}(p,q) + \widetilde{d}(q,p)$, $d(p,q) := \min \{\widetilde{d}(p,q), \widetilde{d}(q,p)\}$, $d(p,q) := \max \{\widetilde{d}(p,q), \widetilde{d}(q,p)\}$ for all p and q.

In many real-life applications, the numbers p, q of primary interest as well as the scaling numbers $m_{i}$ of secondary interest are typically replaced by real-valued functions $x \rightarrow p(x)$, $x \rightarrow q(x)$, $x \rightarrow m_{i}(x)$, where $x \in \mathscr {X}$ is taken from some underlying set $\mathscr {X}$. To address the entire functions as objects we use the abbreviations $P := \big \{p(x)\big \}_{x \in \mathscr {X}}$, $Q := \big \{q(x)\big \}_{x \in \mathscr {X}}$, $M_{i} := \big \{m_{i}(x)\big \}_{x \in \mathscr {X}}$, and alternatively sometimes also $p(\cdot )$, $q(\cdot )$, $m_{i}(\cdot )$. This is conform with the high-level data processing paradigms in “functional programming” and “symbolic computation”, where functions are basically treated as whole entities, too.

Depending on the nature of the data-analytical task, the function P of primary interest may stem either from a hypothetical model, or its analogue derived from observed/measured data, or its analogue derived from artificial computer-generated (simulated) data; the same holds for Q where “cross-over constellations” (w.r.t. to the origin of P) are possible.

The basic underlying set (space) $\mathscr {X}$ respectively the function argument x can play different roles, depending on the application context. For instance, if $\mathscr {X} \subset \mathbb {N}$ is a subset of the integers $\mathbb {N}$ then $x \in \mathscr {X}$ may be an index and p(x) may describe the xth real-valued data-point. Accordingly, P is then a s-dimensional vector where s is the total number of elements in $\mathscr {X}$ with eventually allowing for $s=\infty $. In other situations, x itself may be a data point of arbitrary nature (i.e. $\mathscr {X}$ can be any set) and p(x) a real value attributed to x; this p(x) may be of direct or of indirect use. The latter holds for instance in cases where $p(\cdot )$ is a density function (on $\mathscr {X}$) which roughly serves as a “basis” for the operationalized calculation of the “local aggregations over all^{Footnote 3} $A \subset \mathscr {X}$” in the sense of $A \rightarrow \sum _{x \in A} p(x)$ or $A \rightarrow \int _{A} p(x) \, \mathrm {d}\widetilde{\lambda }(x)$ subject to some “integrator” $\widetilde{\lambda }(\cdot )$ (including classical Riemann integrals $\mathrm {d}\widetilde{\lambda }(x) = \mathrm {d}x$); as examples for nonnegative densities $p(\cdot ) \geqslant 0$ one can take “classical” (volumetric, weights-concerning) inertial-mass densities, population densities, probability densities, whereas densities $p(\cdot ) $ with possible negative values can occur in electromagnetism (charge densities, polarization densities), in other fields of contemporary physics (negative inertial-mass respectively gravitational-mass densities) as well as in the field of acoustic metamaterials (effective density), to name but a few.

Especially when used as a set of possible states/data configurations (rather than indices), $\mathscr {X}$ can be of arbitrary complexity. For instance, each x itself may be a real-valued continuous function on a time interval [0, T] (i.e. $x:[0,T] \rightarrow ]-\infty , \infty [$) which describes the scenario of the overall time-evolution of a quantity of interest (e.g. of a time-varying quantity in an deterministic production process of one machine, of the return on a stock, of a neural spike train). Accordingly, one can take e.g. $\mathscr {X} =C\big ([0,T] , ]-\infty , \infty [\big )$ to be the set of all such continuous functions, and e.g. $p(\cdot )$ a density thereupon (which is then a function on functions). Other kinds of functional data analytics can be covered in an analogous fashion.

To proceed with the proximity-quantification of the primary-interest functions $P := \big \{p(x)\big \}_{x \in \mathscr {X}}$, $Q := \big \{q(x)\big \}_{x \in \mathscr {X}}$, in accordance with the above-mentioned investigations one can deal with the pointwise dissimilarities/divergences $d_{\phi }^{(j)}(p(x),q(x))$, $d_{m_{1}(x),m_{2}(x),m_{3}(x)}^{(5)}(p(x),q(x))$ for fixed $x \in \mathscr {X}$, but in many contexts it is crucial to take “summarizing” dissimilarities/divergences

$$ D_{\phi }^{(j)}(P,Q) := \sum _{x \in \mathscr {X}} d_{\phi }^{(j)}(p(x),q(x)) \cdot \lambda (x) \ \ \text {or} \ \ D_{\phi }^{(j)}(P,Q) := \int \limits _{\mathscr {X}} d_{\phi }^{(j)}(p(x),q(x)) \, \mathrm {d}\lambda (x) $$

subject to some weight-type “summator”/“integrator” $\lambda (\cdot )$ (including classical Riemann integrals); analogously, one can deal with $D_{\phi , M_{1}, M_{2}, M_{3}}^{(5)}(P,Q) := \sum _{x \in \mathscr {X}} d_{m_{1}(x),m_{2}(x),m_{3}(x)}^{(5)}(p(x),q(x)) \cdot \lambda (x)$ or $D_{\phi , M_{1}, M_{2}, M_{3}}^{(5)}(P,Q) := \int _{\mathscr {X}} d_{m_{1}(x),m_{2}(x),m_{3}(x)}^{(5)}(p(x),q(x)) \, \mathrm {d}\lambda (x)$. Notice that the requirements (D1), (D2) respectively (D3) carry principally over in a straightforward manner also to these pointwise and aggregated dissimilarities between the functions (rather than real points), and accordingly one calls them (pointwise/aggregated) divergences respectively distances, too.

2.2 Divergences and Geometry

There are several ways how pointwise dissimilarities $d(\cdot , \cdot )$ respectively aggregated dissimilarities $D(\cdot , \cdot )$ between two functions $P := \big \{p(x)\big \}_{x \in \mathscr {X}}$ and $Q := \big \{q(x)\big \}_{x \in \mathscr {X}}$ can be connected with geometric issues. To start with an “all-encompassing view”, following the lines of e.g. Birkhoff [14] and Millmann and Parker [50], one can build from any set $\mathscr {S}$, whose elements can be interpreted as “points”, together with a collection $\mathscr {L}$ of non-empty subsets of $\mathscr {S}$, interpreted as “lines” (as a manifestation of a principle sort of structural connectivity between points), and an arbitrary distance $\mathfrak {d}(\cdot ,\cdot )$ on $\mathscr {S} \times \mathscr {S}$, an axiomatic constructive framework of geometry which can be of far-reaching nature; therein, $\mathfrak {d}(\cdot ,\cdot )$ plays basically the role of a marked ruler. Accordingly, each triplet $(\mathscr {S}, \mathscr {L}, \mathfrak {d}(\cdot ,\cdot ))$ forms a distinct “quantitative geometric system”; the most prominent classical case is certainly $\mathscr {S} = \mathbb {R}^2$ with $\mathscr {L}$ as the collection of all vertical and non-vertical lines, equipped with the Euclidean distance $\mathfrak {d}(\cdot ,\cdot )$, hence generating the usual Euclidean geometry in the two-dimensional space. In the case that $\mathfrak {d}(\cdot ,\cdot )$ is only an asymmetric divergence but not a distance anymore, we propose that some of the outcoming geometric building blocks have to be interpreted in a direction-based way (e.g. the use of $\mathfrak {d}(\cdot ,\cdot )$ as a marked directed ruler, the construction of points of equal divergence from a center viewed as distorted directed spheres, etc.). For $d(\cdot , \cdot )$ one takes $\mathscr {S} \subset \mathbb {R}$ whereas for $D(\cdot , \cdot )$ one has to work with $\mathscr {S}$ being a family of real-valued functions on $\mathscr {X}$.

Secondly, from any distance $\mathfrak {d}(\cdot ,\cdot )$ on a “sufficiently rich” set $\mathscr {S}$ and a finite number of (fixed or adaptively flexible) distinct “reference points” $s_{i}$ ($i=1, \ldots , n$) one can construct the corresponding Voronoi cells $V(s_i)$ by

$$ V(s_i) := \{ z \in \mathscr {S} : \ \mathfrak {d}(z,s_i) \leqslant \mathfrak {d}(z,s_j) \ \text {for all}\, j=1, \ldots , n \, \} . $$

This produces a tesselation (tiling) of $\mathscr {S}$ which is very useful for classification purposes. Of course, the geometric shape of these tesselations is of fundamental importance. In the case that $\mathfrak {d}(\cdot ,\cdot )$ is only an asymmetric divergence but not a distance anymore, then $V(s_i)$ has to be interpreted as a directed Voronoi cell and then there is also the “reversely directed” alternative

$$ \widetilde{V}(s_i) := \{ z \in \mathscr {S} : \ \mathfrak {d}(s_i,z) \leqslant \mathfrak {d}(s_j,z) \ \text {for all}\, j=1, \ldots , n \, \} . $$

Recent applications where $\mathscr {S} \subset \mathbb {R}^d$ and $\mathfrak {d}(\cdot ,\cdot )$ is a Bregman divergence or a more general conformal divergence, can be found e.g. in Boissonnat et al. [15], Nock et al. [64] (and the references therein), where they also deal with the corresponding adaption of k-nearest neighbour classification methods.

Thirdly, consider a “specific framework” where the functions $P:= \widetilde{P}_{\theta _{1}} := \big \{\widetilde{p}_{\theta _{1}}(x)\big \}_{x \in \mathscr {X}}$ and $Q:= \widetilde{P}_{\theta _{2}} := \big \{\widetilde{p}_{\theta _{2}}(x)\big \}_{x \in \mathscr {X}}$ depend on some parameters $\theta _1 \in \varTheta $, $\theta _2 \in \varTheta $, which reflect the strive for a complexity-reducing representation of “otherwise intrinsically complicated” functions P, Q. The way of dependence of the function (say) $\widetilde{p}_{\theta }(\cdot )$ on the underlying parameter $\theta $ from an appropriate space $\varTheta $ of e.g. manifold type, may show up directly e.g. via its operation/functioning as a relevant system-indicator, or it may be manifested implicitly e.g. such that $\widetilde{p}_{\theta }(\cdot )$ is the solution of an optimization problem with $\theta $-involving constraints. In such a framework, one can induce divergences $D(\widetilde{P}_{\theta _{1}},\widetilde{P}_{\theta _{2}}) =: f(\theta _{1},\theta _{2})$ and – under sufficiently smooth dependence – study their corresponding differential-geometric behaviour of $f(\cdot ,\cdot )$ on $\varTheta $. An example is provided by the Kullback–Leibler divergence between two distributions of the same exponential family of distributions, which defines a Bregman divergence on the parameter space. This and related issues are subsumed in the research field of “information geometry”; for comprehensive overviews see e.g. Amari [3], Amari [1], Ay et al. [8]. Moreover, for recent connections between divergence-based information geometry and optimal transport the reader is e.g. referred to Pal and Wong [66, 67], Karakida and Amari [34], Amari et al. [2], Peyre and Cuturi [71], and the literature therein.

Further relations of divergences with other approaches to geometry can be overviewed e.g. from the wide-range-covering research-article collections in Nielsen and Bhatia [58], Nielsen and Barbaresco [55,56,57]. Finally, geometry also enters as a tool for visualizing quantitative effects on divergences.

2.3 Divergences and Uncertainty in Data

In general, data-uncertainty (including “deficiencies” like data incompleteness, fakery, unreliability, faultiness, vagueness, etc.) can enter the framework in various different ways. For instance, in situations where $x \in \mathscr {X}$ plays the role of an index (e.g. $\mathscr {X} = \{1, 2, \ldots , s\}$) and p(x) describes the xth real-valued data-point, the uncertainty is typically^{Footnote 4} incorporated by adding a random argument $\omega \in \varOmega $ to end up with the “vectors” $P(\omega ) := \big \{p(x,\omega )\big \}_{x \in \mathscr {X}}$, $Q(\omega ) := \big \{q(x, \omega )\big \}_{x \in \mathscr {X}}$ of random data points. Accordingly, one ends up with random-variable-type pointwise divergences $\omega \, \rightarrow \, d_{\phi }^{(j)}(p(x,\omega ),q(x,\omega )), \ \omega \, \rightarrow \, d_{m_{1}(x),m_{2}(x),m_{3}(x)}^{(5)}(p(x,\omega ),q(x,\omega ))$ ($x \in \mathscr {X}$) as well as with the random-variable-type “summarizing” divergences $\omega \, \rightarrow \, D_{\phi }^{(j)}(P(\omega ),Q(\omega )) := \sum _{x \in \mathscr {X}} d_{\phi }^{(j)}(p(x,\omega ),q(x,\omega )) \cdot \lambda (x)$ respectively $\omega \rightarrow D_{\phi }^{(j)}(P(\omega ),Q(\omega )) := \int _{\mathscr {X}} d_{\phi }^{(j)}(p(x,\omega ),q(x,\omega )) \, \mathrm {d}\lambda (x)$, as well as with $\omega \, \rightarrow \, D_{\phi , M_{1}, M_{2}, M_{3}}^{(5)}(P(\omega ),Q(\omega )) \, := \, \sum _{x \in \mathscr {X}} \, d_{m_{1}(x),m_{2}(x),m_{3}(x)}^{(5)}(p(x,\omega ),q(x,\omega )) \cdot \lambda (x) $, resp. $\omega \, \negthinspace \rightarrow \, \negthinspace D_{\phi , M_{1}, M_{2}, M_{3}}^{(5)}\,(P(\omega ),Q(\omega )) \, := \, \int _{\mathscr {X}} \, d_{m_{1}(x),m_{2}(x),m_{3}(x)}^{(5)}\,(p(x,\omega ),q(x,\omega )) \mathrm {d}\lambda (x)$. More generally, one can allow for random scales $m_{1}(x,\omega )$, $m_{2}(x,\omega )$, $m_{3}(x,\omega )$.

In other situations with finitely-many-elements carrying $\mathscr {X}$, the state x may e.g. describe a possible outcome $Y(\omega )$ of an uncertainty-prone observation of a quantity Y of interest and p(x), q(x) represent the corresponding probability mass functions (“discrete density functions”) at x under two alternative probability mechanisms Pr, $\widetilde{Pr}$ (i.e. $p(x) = Pr[\{\omega \in \varOmega : Y(\omega )=x\}]$, $q(x) = \widetilde{Pr}[\{\omega \in \varOmega : Y(\omega )=x\}]$); as already indicated above, $P := \big \{p(x)\big \}_{x \in \mathscr {X}}$ respectively $Q := \big \{q(x)\big \}_{x \in \mathscr {X}}$ serve then as a kind of “basis” for the computation of the probabilities $\sum _{x \in A} p(x)$ respectively $\sum _{x \in A} q(x)$ that an arbitrary event $\{\omega \in \varOmega : Y(\omega ) \in A \}$ ($A \subset \mathscr {X}$) occurs. Accordingly, the pointwise divergences $d_{\phi }^{(j)}(p(x),q(x))$, $d_{m_{1}(x),m_{2}(x),m_{3}(x)}^{(5)}(p(x),q(x))$ ($x \in \mathscr {X}$), and the aggregated divergences $D_{\phi }^{(j)}(P,Q) := \sum _{x \in \mathscr {X}} d_{\phi }^{(j)}(p(x),q(x))$, $D_{\phi , M_{1}, M_{2}, M_{3}}^{(5)}(P,Q) := \sum _{x \in \mathscr {X}} d_{m_{1}(x),m_{2}(x),m_{3}(x)}^{(5)}(p(x),q(x))$, $D_{\phi , M_{1}, M_{2}, M_{3}}^{(5)}(P,Q) := \int _{\mathscr {X}} d_{m_{1}(x),m_{2}(x),m_{3}(x)}^{(5)}(p(x),q(x)) \, \mathrm {d}\lambda (x)$ can then be regarded as (nonnegative, reflexive) dissimilarities between the two alternative uncertainty-quantification-bases P and Q. Analogously, when e.g. $\mathscr {X}= \mathbb {R}^n$ is the n-dimensional Euclidean space and P, Q are classical probability density functions interpreted roughly via $p(x) \mathrm {d}x = Pr[\{\omega \in \varOmega : Y(\omega ) \in [x,x + \mathrm {d}x[ \}$, $q(x) \mathrm {d}x = \widetilde{Pr}[\{\omega \in \varOmega \, : \, Y(\omega ) \in [x,x + \mathrm {d}x[ \}$, then $d_{\phi }^{(j)}(p(x),q(x))$, $d_{m_{1}(x),m_{2}(x),m_{3}(x)}^{(5)}(p(x),q(x))$ ($x \in \mathscr {X}$), $D_{\phi }^{(j)}(P,Q) :=$$ \int _{\mathscr {X}} d_{\phi }^{(j)}(p(x),q(x)) \, \mathrm {d}x$, $D_{\phi , M_{1}, M_{2}, M_{3}}^{(5)}(P,Q) := \int _{\mathscr {X}} d_{m_{1}(x),m_{2}(x),m_{3}(x)}^{(5)}(p(x),q(x)) \, \mathrm {d}x$ serve as dissimilarities between the two alternative uncertainty-quantification-bases P, Q.

Let us finally mention that in concrete applications, the “degree” of intrinsic data-uncertainty may be zero (deterministic), low (e.g. small random data contamination and small random deviations from a “basically” deterministic system, slightly noisy data, measurement errors) or high (forecast of the price of a stock in one year from now). Furthermore, the data may contain “high unusualnesses” (“surprising observations”) such as outliers and inliers. All this should be taken into account when choosing or even designing the right type of divergence which have different sensitivity to such issues (see e.g. Kißlinger and Stummer [37] and the references therein).

2.4 Divergences, Information and Model Uncertainty

In the main spirit of this book on geometric structures of information, let us also connect the latter with dissimilarities in a wide sense which is appropriate enough for our ambitions of universal modeling. In correspondingly adapting some conception e.g. of Buckland [20] to our above-mentioned investigations, in the following we regard a density function (say) $p(\cdot )$ as a fundamental basis of information understood as quantified real – respectively hypothetical – knowledge which can be communicated about some particular (family of) subjects or (family of) events; according to this information-as-knowledge point of view, pointwise dissimilarities/divergences/distances d(p(x), q(x)) ($x \in \mathscr {X}$) respectively aggregated dissimilarities/divergences/distances D(P, Q) quantify the proximity between the two information-bases $P := \big \{p(x)\big \}_{x \in \mathscr {X}}$ and $Q := \big \{q(x)\big \}_{x \in \mathscr {X}}$ in a directed/nonnegative directed/nonnegative symmetric way. Hence, $d(\cdot ,\cdot )$ respectively $D(\cdot ,\cdot )$ themselves can be seen as a higher-level information on pairs of information bases.

Divergences can be used for the quantification of information-concerning issues for model uncertainty (model risk) and exploratory model search in various different ways. For instance, suppose that we search for (respectively learn to understand) a true unknown density function $Q^{true} := \big \{q^{true}(x)\big \}_{x \in \mathscr {X}}$ of an underlying data-generating mechanism of interest, which is often supposed to be a member of a prefixed class $\mathscr {P}$ of “hypothetical model-candidate density functions”; frequently, this task is (e.g. for the sake of fast tractability) simplified to a setup of finding the true unknown parameter $\theta =\theta _{0}$ – and hence $Q^{true} = Q_{\theta _{0}}$ – within a parametric family $\mathscr {P} := \{ Q_{\theta } \}_{\theta \in \varTheta }$. Let us first consider the case where the data-generating mechanism of interest $Q^{true}$ is purely deterministic and hence also all the candidates $Q \in \mathscr {P}$ are (taken to be) not of probability-density-function type. Although one has no intrinsic data-uncertainty, one faces another type of knowledge-lack called model-uncertainty. Then, one standard goal is to “track down” (respectively learn to understand) this true unknown $Q^{true}$ respectively $Q_{\theta _0}$ by collecting and purpose-appropriately postprocessing some corresponding data observations. Accordingly, one attempts to design a density-function-construction rule (mechanism, algorithm) $data \rightarrow P^{data} := \big \{p^{data}(x)\big \}_{x \in \mathscr {X}}$ to produce data-derived information-basis-type replica of a “comparable principal form” as the anticipated $Q^{true}$. This rule should theoretically guarantee that $P^{data}$ converges – with reasonable “operational” speed – to $Q^{true}$ as the number $N^{data}$ of data grows, which particularly implies that (say) $D(P^{data},Q^{true})$ for some prefixed aggregated divergence $D(\cdot ,\cdot )$ becomes close to zero “fast enough”. On these grounds, one reasonable strategy to down-narrow the true unknown data-generating mechanism $Q^{true}$ is to take a prefixed class $\mathscr {P}^{hyp}$ of hypothetical density-function models and compute $infodeg:= \inf _{Q\in \mathscr {P}^{hyp}} D(P^{data},Q)$ which in the light of the previous discussions can be interpreted as an “unnormalized degree of informative evidence of $Q^{true}$ being a member of $\mathscr {P}^{hyp}$”, or from a reversed point of view, as an “unnormalized degree of goodness of approximation (respectively fit) of the data-derived density function $P^{data}$ through/by means of $\mathscr {P}^{hyp}$”. Within this current paradigm, if infodeg is too large (to be specified in a context-dependent, appropriately quantified sense by taking into account the size of $N^{data}$), then one has to repeat the same procedure with a different class $\widetilde{\mathscr {P}^{hyp}}$; on the other hand, if (and roughly only if) infodeg is small enough then $\widehat{Q^{data}} := \arg \inf _{Q\in \mathscr {P}^{hyp}} D(P^{data},Q)$ (which may not be unique) is “the most reasonable” approximation. This procedure is repeated recursively as soon as new data points are observed.

In contrast to the last paragraph, let us now cope with the case where the true unknown data-generating mechanism of interest is prone to uncertainties (i.e. is random, noisy, risk-prone) and hence $Q^{true}$ as well as all the candidates $Q \in \mathscr {P}$ are of probability-density-function type. Even more, the data-derived information-basis-type replica $\omega \rightarrow data(\omega ) \rightarrow P^{data(\omega )} := \big \{p^{data(\omega )}(x)\big \}_{x \in \mathscr {X}}$ of $Q^{true}$ is now a density-function-valued (!) random variable; notice that in an above-mentioned “full-scenario” time-evolutionary context, this becomes a density-function-on-functions-valued random variable. Correspondingly, the above-mentioned procedure for the deterministic case has to be adapted and the notions of convergence and smallness have to be stochastified, which leads to the need of considerably more advanced techniques.

Another field of applying divergences to a context of synchronous model and data uncertainty is Bayesian sequential updating. In such a “doubly uncertain” framework, one deals with a parametric context of probability density functions $Q^{true} = Q_{\theta _{0}}$, $\mathscr {P} := \{ Q_{\theta } \}_{\theta \in \varTheta }$ where the uncertain knowledge about the parameter $\theta $ (to be learnt) is operationalized by replacing it with a random variable $\vartheta $ on $\varTheta $. Based on both (i) an initial prior distribution $Prior_1[\cdot ] := Pr[\vartheta \in \cdot \, ]$ of $\vartheta $ (with probability density function pdf $\theta \rightarrow prior_1(\theta )$) and (ii) observed data $data_1(\omega ), \ldots ,data_{N^{data}}(\omega )$ of number $N^{data}$, a posterior distribution of $\vartheta $ (with pdf $\theta \rightarrow post_1(\theta ,\omega )$) is determined with (amongst other things) the help of the well-known Bayes formula. This procedure is repeated recursively with new incoming data input (block) $data_{N^{data}+1}$, where the new prior distribution $Prior_2[\cdot ,\omega ] := Post_1[\cdot ,\omega ]$ is chosen as the old posterior and the new posterior distribution is (with pdf $\theta \rightarrow post_2(\theta ,\omega )$), etc. The corresponding (say) aggregated divergence $D(P(\omega ),Q(\omega ))$ between the probability-density-valued random variables $\omega \rightarrow P(\omega ) := \big \{prior_2(\theta ,\omega )\big \}_{\theta \in \varTheta }$, and $\omega \rightarrow Q(\omega ) := \big \{post_2(\theta ,\omega )\big \}_{\theta \in \varTheta }$ serves as “degree of informativity of the new data-point observation on the learning of the true unknown $\theta _{0}$”.

As another application in a “doubly uncertain” framework, divergences D(P, Q) appear also in a dichotomous Bayesian testing problem between the two alternative probability densities functions P and Q, where D(P, Q) represents an appropriate average (over prior probabilities) of the corresponding difference between the prior Bayes risk (prior minimal mean decision loss) and the posterior Bayes risk (posterior minimal mean decision loss). This, together with non-averaging versions and an interpretation of D(P, Q) as a (weighted-average) statistical information measure in the sense of De Groot [29] can be found e.g. in Österreicher and Vajda [65]; see also Stummer [78,79,80], Liese and Vajda [42], Reid and Williamson [73]. In contrast of this employment of D(P, Q) as quantifier of “decision risk reduction” respectively “model risk reduction” respectively “information gain”, a different use of divergences D(P, Q) in a “double uncertain” general Bayesian context of dichotomous loss-dependent decisions between arbitrary probability density functions P and Q can be found in Stummer and Vajda [81], where they achieve $D_{\phi _{\alpha }}(P,Q)$ (for some power functions $\phi _{\alpha }$ cf. (5)) as upper and lower bound of the Bayes risk (minimal mean decision loss) itself and also give applications to decision making of time-continuous, non-stationary financial stochastic processes.

Divergences can be also employed to detect distributional changes in streams (respectively clouds) $(data_{j})_{j \in \tau }$ of uncertain (random, noisy, risk-prone) data indexed by j from an arbitrary countable set $\tau $ (e.g. the integers, an undirected graph); a survey together with some general framework can be found in Kißlinger and Stummer [38]: the basic idea is to pick out two^{Footnote 5} non-identical, purpose-appropriately chosen subcollections respectively sample patterns (e.g. windows) $data_{one}(\omega ):= (data_{s_1}(\omega ),\ldots , data_{s_{N_1}}(\omega ))$, $data_{two}(\omega ):= $ $(data_{t_1}(\omega ),\ldots , data_{t_{N_2}}(\omega ))$, and to build from them data-derived probability-density functions $\omega \rightarrow data_{one}(\omega ) \rightarrow P^{data_{one}(\omega )} := \big \{p^{data_{one}(\omega )}(x)\big \}_{x \in \mathscr {X}}$, $\omega \rightarrow data_{two}(\omega ) \rightarrow P^{data_{two}(\omega )} := \big \{p^{data_{two}(\omega )}(x)\big \}_{x \in \mathscr {X}}$. If a correspondingly chosen (say) aggregated divergence $D\big (P^{data_{one}(\omega )} , P^{data_{two}(\omega )}\big )$ – which plays the role of a condensed change-score – is “significantly large” in the sense that it is large enough – compared to some sound threshold which within the model reflects the desired “degree of confidential plausibility” – then there is strong indication of a distributional change which we then “believe in”. Notice that both components of the divergence $D(\cdot ,\cdot )$ are now probability-density-function-valued random variables. The sound threshold can e.g. be derived from advanced random asymptotic theory.

From the above discussion it is clear that divergence-based model-uncertainty methods are useful tools in concrete applications for machine learning and artificial intelligence, see e.g. Collins et al. [25], Murata et al. [54], Banerjee et al. [9], Tsuda et al. [87], Cesa-Bianchi and Lugosi [21], Nock and Nielsen [63], Sugiyama et al. [85], Wu et al. [94], Nock et al. [62], Nielsen et al. [60], respectively Minka [51], Cooper et al. [26], Lizier [46], Zhang et al. [96], Chhogyal [22], Cliff et al. [23, 24].

3 General Framework

For the rest of this paper, we shall use the following

Main (i.e. non-locally used) Notation and Symbols

$\mathbb {R}$, $\mathbb {N}$, $\mathbb {R}^d$	Set of real respectively integer numbers respectively d-dimensional vectors
$\varTheta $, $\theta $	Set of parameters, see p. 188
$\mathbbm {1}$	Function with constant value 1
$\varvec{1}_{A}(z) = \delta _{z}[A]$	Indicator function on the set A evaluated at data point z, which is equal to Dirac’s one-point distribution on z evaluated at A
$\# A$	Number of elements in set A
$\mathscr {X}$; $\mathscr {X}_{\#}$	Space/set where data can take values in; space/set of countable size
$\mathscr {F}$	System of admissible events/data-collections ($\sigma $-algebra) on $\mathscr {X}$
$\lambda $	Reference measure/integrator/summator, see p. 160 & Sect. 3.1 on p. 165
$\lambda $-a.a.	$\lambda $-almost all, see p. 160
$\lambda _{L}$	Lebesgue measure (“Riemann-type” integrator), see p. 160, & Sect. 3.1
$\lambda _{\#}$	Counting measure (“classical summator”), see p. 160 & Sect. 3.1 on p. 165
$P \, := \, \big \{p(x)\big \}_{x \in \mathscr {X}}$	Function from which the divergence/dissimilarity is measured from, see p. 160
$Q \, := \, \big \{q(x)\big \}_{x \in \mathscr {X}}$	Function to which the divergence/dissimilarity is measured to, see p. 160
$M_{i} \, := \, \big \{m_{i}(x)\big \}_{x \in \mathscr {X}}$	Scaling function ($i=1,2$) respec. aggregation function ($i=3$), see p. 161, (1) and paragraph (I1) thereafter, as well as Sect. 3.3 on p. 170
$p(\cdot )$, $q(\cdot )$, $m_{i}(\cdot )$,	Alternative representations of P, Q, $M_{i}$
$R \, := \, \big \{r(x)\big \}_{x \in \mathscr {X}}$	Function used for the aggregation function $m_{3}(\cdot )$, see Sect. 3.3.1 on p. 171
$W_{i}\, $	Connector function of the form $W_{i}\, := \, \big \{w_{i}(x,y,z)\big \}_{x,y,z \in \ldots }$, for adaptive scaling and aggregation functions $m_{i}(x) = w_{i}(x,p(x),q(x))$ ($i=1,2,3$), see e.g. Assumption 2 on p. 163 and Sect. 3.3.1.3 on p. 181
$\mathbbm {P}$, $\mathbbm {Q}$, $\mathbbm {M}_{i}$, $\mathbbm {W}_{i}$	Functions with $\mathbbm {p}(x) \geqslant 0$, $\mathbbm {q}(x) \geqslant 0$, $\mathbbm {m}_{i}(x) \geqslant 0$, $\mathbbm {w}_{i}(x) \geqslant 0$ for $\lambda $-a.a. $x \in \mathscr {X}$
$\mathbbm {Q}^{\chi } \, := \, \big \{\mathbbm {q}^{\chi }(x)\big \}_{x \in \mathscr {X}}$	Function for the aggregation function $m_{3}(\cdot )$, see Sect. 4.2 on p. 184, (73)
,	$\lambda $-probability density functions (incl. probability mass functions for $\lambda =\lambda _{\#}$), i.e. for which , for $\lambda $-a.a. $x \in \mathscr {X}$ and , see Remark 2 on p. 172
	$\lambda $-probab. density function which depends on a parameter $\theta \in \varTheta $, see p. 188
$\mathscr {R}\big (\frac{P}{M_{1}}\big )$	Range (image) of the function $\big \{\frac{p(x)}{m_{1}(x)}\big \}_{x \in \mathscr {X}}$, see paragraph (I2) on p. 161
$\mathscr {R}(Y_{1}, \ldots , Y_{N})$	Range (image) of the random variables $Y_{1}, \ldots , Y_{N}$, see p. 182
	$\lambda $-probab. density function (modification of ) defined by , see p. 191
$\phi \, := \, \big \{\phi (t)\big \}_{t \in ]a,b[}$	Divergence generator, a convex real-valued function on ]a, b[, see p. 161, (1) and paragraph (I2), as well as Sect. 3.2 on p. 165
$\varPhi (]a,b[)$;	Class of all such $\phi $, see paragraph (I2) on p. 161
$\overline{\phi } := \, \big \{\phi (t)\big \}_{t \in [a,b]}$	Continuous extension of $\phi $ on [a, b], with $\overline{\phi }(t) = \phi (t)$ for all $t \in ]a,b[$, see (I2)
$\phi _{+,c}^{\prime }(t)$	c-weighted mixture of left-hand and right-hand derivative of $\phi $ at t, see (I2)
$\varPhi _{C_{1}}(]a,b[)$	Subclass of everywhere continuously differentiable $\phi $, with derivative $\phi ^{\prime }(t)$ (being equal to $\phi _{+,c}^{\prime }(t)$ for all $c \in [0,1]$), see (I2) on p. 161
$\phi _{\alpha }$	$\alpha $-power-function type divergence generator, see (5) on p. 166, (14), (18), (19)
$\phi _{TV}$	Generator of total variation distance, see (31) on p. 169
$\phi _{ie}$	Divergence generator with interesting effects, see (35) on p. 170
$\psi _{\phi ,c}$	Function given by $\psi _{\phi ,c}(s,t) := \phi (s) - \phi (t) - \phi _{+,c}^{\prime }(t) \cdot (s-t) \geqslant 0$, see (I2)
$\overline{\psi _{\phi ,c}}$	Bivariate extension of $\psi _{\phi ,c}$, see (I2) on p. 161
${\overline{\int }}_{{\mathscr {X}}} \ldots $, ${\overline{\sum }}_{{\mathscr {X}}} \ldots $	Integral/sum over extension of integrand/summand $\ldots $, see (I2) & (2) on p. 165
$D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q)$	Divergence between two functions P (scaled by $M_{1}$) and Q (scaled by $M_{2}$), generated by $\phi $ and weight c, and aggregated by $\mathbbm {M}_{3}$ and $\lambda $, see (1) on p. 161
$D_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q)$	As above, but with $\phi \in \varPhi _{C_{1}}(]a,b[)$ and obsolete c, see Sect. 3.2 on p. 165
$D_{\lambda }(P,Q)$	General $\lambda $-aggregated divergence, see p. 189, respectively pseudo-divergence, see Definition 2 on p. 195
	Pointwise decomposable pseudo-divergence, scaled by $\mathbbm {M}$ and aggregated by $\mathbbm {M}$ and $\lambda $, see Sect. 4.6 on p. 200
NN0, NN1	Nonnegativity setup 0 respectively 1, see p. 166 resp. p. 171
$\mathfrak {P}^{\mathbbm {R} \cdot \lambda }$, $\mathfrak {Q}^{\mathbbm {R} \cdot \lambda }$, $\mathfrak {M}^{\mathbbm {R} \cdot \lambda }$	Measures with $\lambda $-densities $p(\cdot ) \cdot r(\cdot )$, $q(\cdot ) \cdot r(\cdot )$, $m(\cdot ) \cdot r(\cdot )$, see Remark 2 on p. 171
,	Probability measures (distributions) with $\lambda $-densities $p(\cdot )$, $q(\cdot )$, see Remark 2
$\mathscr {Q}_{\varTheta }^{\lambda _{2}}$,	Class of probability measures with $\lambda _{2}$-densities $q_{\theta }(\cdot )$ with parameter $\theta \in \varTheta $, see p. 188
, ,	Data-derived empirical (probability) distribution, and probability mass
	function ($\lambda _{\#}$-density) thereof, see Remark 2 on p. 172
,	Data-derived “extended” empirical (probability) distribution, and probability mass function thereof, see (85) on p. 190 and thereafter
DPD, CASD	Density-power divergences (see p. 174), Csiszar–Ali–Silvey divergences (see p. 177)
$\ell i_{1}$, $\phi ^{*}(0)$, $\ell i_{2}$, $\ell i_{3}$	Certain limits, see (50), (71), (72)
$\mathbbm {P} \perp \mathbbm {Q}$	The functions $\mathbbm {P}$, $\mathbbm {Q}$ are “essentially different”, see (64) to (66) and thereafter
$\mathbbm {P} \not \perp \mathbbm {Q}$	Negation of $\mathbbm {P} \perp \mathbbm {Q}$, see p. 192
$\mathbbm {P} \sim \mathbbm {Q}$	The functions $\mathbbm {P}$, $\mathbbm {Q}$ are “equivalent” (concerning zeros), see (80)
$\mathbbm {P} \not \sim \mathbbm {Q}$	Negation of $\mathbbm {P} \sim \mathbbm {Q}$, see p. 195
$\widehat{\theta }_{N,D_{\lambda _{2}}}$	Minimum-divergence estimator (“approximator”) of the true unknown parameter $\theta _{0}$, based on N data observations, see (82) on p. 189
$\widehat{\theta }_{N,D_{\lambda _{\#}}}$, $\widehat{\theta }_{N,D_{\lambda }}$	Certain minimum-divergence estimators, see (83), (86)
$\widehat{\theta }_{N \, , decD_{\lambda }}$ ,	Certain minimum-divergence estimators, see (107), (123)
$\widehat{\theta }_{N,sup\mathscr {D}_{\phi ,\lambda }}$	Certain minimum-divergence estimator, see (135)
$\mathscr {P}^{\lambda }$	Certain class of nonnegative, mutually equivalent functions, see p. 194
$\mathscr {P}^{\lambda \not \sim }$, $\widetilde{\mathscr {P}}^{\lambda }$	Certain classes of nonnegative functions, see p. 194
$\mathscr {P}_{\varTheta }^{\lambda }$, $\mathscr {P}_{emp}^{\lambda \perp }$, $\mathscr {P}_{\varTheta ,emp}^{\lambda }$	Certain classes of nonnegative functions, see p. 195
$\mathfrak {D}^{0}$, $\mathfrak {D}^{1}$, $\rho _{\mathbbm {Q}}$	Functionals and mapping for decomposable pseudo-divergences, see Definition 3 on p. 195
$\psi ^{dec}$, $\psi ^{0}$, $\psi ^{1}$, $\rho $	Mappings for pointwise decomposable pseudo-divergences, see Definition 3 on p. 196
$h_{0}$, $h_{1}$, $h_{2}$	Mappings for pointwise decomposable pseudo-divergences, see Definition 3 on p. 196
$\psi _{m}^{dec}$	Perspective function of $\psi ^{dec}$, see (120)

New Divergence Toolkit

In the above Sect. 2, we have motivated that for many different tasks within a broad spectrum of situations, it is useful to employ divergences as “directed distances”, including distances as their symmetric special case. For the rest of the paper, we shall only deal with aggregated forms of divergences, and thus drop the attribute “aggregated” from now on. In the following, we present a fairly universal, flexible, multi-component system of divergences by adapting and widening the concept of scaled Bregman divergences of Stummer [81] and Stummer and Vajda [84] to the current context of arbitrary (measurable) functions. To begin with, let us assume that the modeled respectively observed (random) data take values in a state space $\mathscr {X}$ (with at least two distinct values), equipped with a system $\mathscr { F}$ of admissible events ($\sigma $-algebra) and a $\sigma $-finite measure $\lambda $ (e.g. the Lebesgue measure, the counting measure, etc.). Furthermore, we suppose that $x \rightarrow p(x) \in [-\infty ,\infty ]$ and $x \rightarrow q(x) \in [-\infty ,\infty ]$ are (correspondingly measurable) functions on $\mathscr {X}$ which satisfy $p(x) \in ]-\infty ,\infty [$, $q(x) \in ]-\infty ,\infty [$ for $\lambda $-almost all (abbreviated as $\lambda $-a.a.) $x \in \mathscr {X}$.^{Footnote 6} To address the entire functions as objects we write $P := \big \{p(x)\big \}_{x \in \mathscr {X}}$, $Q := \big \{q(x)\big \}_{x \in \mathscr {X}}$ and alternatively sometimes also $p(\cdot )$, $q(\cdot )$. To better highlight the very important special case of $\lambda $-probability density functions – where $p(x) \geqslant 0$, $q(x) \geqslant 0$ for $\lambda $-a.a. $x \in \mathscr {X}$ and $\int _{\mathscr {X}} p(x) \, \mathrm {d}\lambda (x) =1$, $\int _{\mathscr {X}} q(x) \, \mathrm {d}\lambda (x) =1$ – we use the notation , , , instead of P, p, Q, q (where symbolizes a lying 1). For instance, if $\lambda = \lambda _{L}$ is the Lebesgue measure on the s-dimensional Euclidean space $\mathscr {X} = \mathbb {R}^{s}$, then , are “classical” (e.g. Gaussian) probability density functions. In contrast, in the discrete setup where the state space (i.e. the set of all possible data points) $\mathscr {X} = \mathscr {X}_{\#}$ has countably many elements and $\lambda := \lambda _{\#}$ is the counting measure (i.e., $\lambda _{\#}[\{x\}] =1$ for all $x \in \mathscr {X}_{\#}$), then , are probability mass functions and (say) can be interpreted as probability that the data point x is taken by the underlying random (uncertainty-prone) mechanism. If $p(x) \geqslant 0$, $q(x) \geqslant 0$ for $\lambda $-a.a. $x \in \mathscr {X}$ (but not necessarily with the restrictions $\int _{\mathscr {X}} p(x) \, \mathrm {d}\lambda (x) =1 =\int _{\mathscr {X}} q(x) \, \mathrm {d}\lambda (x)$) then we write $\mathbbm {P}$, $\mathbbm {Q}$, $\mathbbm {p}$, $\mathbbm {q}$ instead of P, p, Q, q.

Back to generality, we quantify the dissimilarity between the two functions P,Q in terms of divergences $D^{c}_{\beta }(P,Q)$ with $\beta = (\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda )$, defined by

$$\begin{aligned}& \textstyle 0 \leqslant D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q) \nonumber \\& \textstyle : = \int _{{\mathscr {X}}} \big [ \phi \big ( { \frac{p(x)}{m_{1}(x)}}\big ) -\phi \big ( {\frac{q(x)}{m_{2}(x)}}\big ) - \phi _{+,c}^{\prime } \big ( {\frac{q(x)}{m_{2}(x)}}\big ) \cdot \big ( \frac{p(x)}{m_{1}(x)}-\frac{q(x)}{m_{2}(x)}\big ) \big ] \cdot \mathbbm {m}_{3}(x) \, \mathrm {d}\lambda (x) \qquad \end{aligned}$$

(1)

(see Stummer [81], Stummer and Vajda [84] for the case $c=1, m_{1}(x)=m_{2}(x)=\mathbbm {m}_{3}(x)$). Here, we use:

(I1)
(measurable) scaling functions $m_{1}: \mathscr {X} \rightarrow [-\infty , \infty ]$ and $m_{2}: \mathscr {X} \rightarrow [-\infty , \infty ]$ as well as a nonnegative (measurable) aggregating function $\mathbbm {m}_{3}: \mathscr {X} \rightarrow [0,\infty ]$ such that $m_{1}(x) \in ]-\infty , \infty [$, $m_{2}(x) \in ]-\infty , \infty [$, $\mathbbm {m}_{3}(x) \in [0, \infty [$ for $\lambda $-a.a. $x \in \mathscr {X}$.^{Footnote 7} In accordance with the above notation, we use the symbols $M_{i} := \big \{m_{i}(x)\big \}_{x \in \mathscr {X}}$ respectively $m_{i}(\cdot )$ to refer to the entire functions, and $\mathbbm {M_{i}}$, $\mathbbm {m_{i}}(\cdot )$ when they are nonnegative as well as , when they manifest $\lambda $-probability density functions. Furthermore, let us emphasize that we allow for / cover adaptive situations in the sense that all three functions $m_1(x)$, $m_{2}(x)$, $\mathbbm {m}_{3}(x)$ (evaluated at x) may also depend on p(x) and q(x).
(I2)
the so-called “divergence-generator” $\phi $ which is a continuous, convex (finite) function $\phi : E \rightarrow ]-\infty ,\infty [$ on some appropriately chosen open interval $E = ]a,b[$ such that [a, b] covers (at least) the union $\mathscr {R}\big (\frac{P}{M_{1}}\big ) \cup \mathscr {R}\big (\frac{Q}{M_{2}}\big )$ of both ranges $\mathscr {R}\big (\frac{P}{M_{1}}\big )$ of $\big \{\frac{p(x)}{m_{1}(x)}\big \}_{x \in \mathscr {X}}$ and $\mathscr {R}\big (\frac{Q}{M_{2}}\big )$ of $\big \{\frac{q(x)}{m_{2}(x)}\big \}_{x \in \mathscr {X}}$; for instance, $E=]0,1[$, $E=]0,\infty [$ or $E=]-\infty ,\infty [$; the class of all such functions will be denoted by $\varPhi (]a,b[)$. Furthermore, we assume that $\phi $ is continuously extended to $\overline{\phi }: [a,b] \rightarrow [-\infty ,\infty ]$ by setting $\overline{\phi }(t) := \phi (t)$ for $t\in ]a,b[$ as well as $\overline{\phi }(a):= \lim _{t\downarrow a} \phi (t)$, $\overline{\phi }(b):= \lim _{t\uparrow b} \phi (t)$ on the two boundary points $t=a$ and $t=b$. The latter two are the only points at which infinite values may appear. Moreover, for any fixed $c \in [0,1]$ the (finite) function $\phi _{+,c}^{\prime }: ]a,b[ \rightarrow ]-\infty ,\infty [$ is well-defined by $\phi _{+,c}^{\prime }(t) := c \cdot \phi _{+}^{\prime }(t) + (1- c) \cdot \phi _{-}^{\prime }(t)$, where $\phi _{+}^{\prime }(t)$ denotes the (always finite) right-hand derivative of $\phi $ at the point $t \in ]a,b[$ and $\phi _{-}^{\prime }(t)$ the (always finite) left-hand derivative of $\phi $ at $t \in ]a,b[$. If $\phi \in \varPhi (]a,b[)$ is also continuously differentiable – which we denote by $\phi \in \varPhi _{C_{1}}(]a,b[)$ – then for all $c \in [0,1]$ one gets $\phi _{+,c}^{\prime }(t) = \phi ^{\prime }(t)$ ($t \in ]a,b[$) and in such a situation we always suppress the obsolete indices c, $+$ in the corresponding expressions. We also employ the continuous continuation $\overline{\phi _{+,c}^{\prime }}: [a,b] \rightarrow [-\infty ,\infty ]$ given by $\overline{\phi _{+,c}^{\prime }}(t) := \phi _{+,c}^{\prime }(t)$ ($t \in ]a,b[$), $\overline{\phi _{+,c}^{\prime }}(a) := \lim _{t\downarrow a} \phi _{+,c}^{\prime }(t)$, $\overline{\phi _{+,c}^{\prime }}(b) := \lim _{t\uparrow b} \phi _{+,c}^{\prime }(t)$. To explain the precise meaning of (1), we also make use of the (finite, nonnegative) function $\psi _{\phi ,c}: ]a,b[ \times ]a,b[ \rightarrow [0,\infty [$ given by $\psi _{\phi ,c}(s,t) := \phi (s) - \phi (t) - \phi _{+,c}^{\prime }(t) \cdot (s-t) \geqslant 0$ ($s,t \in ]a,b[$). To extend this to a lower semi-continuous function $\overline{\psi _{\phi ,c}}: [a,b] \times [a,b] \rightarrow [0,\infty ]$ we proceed as follows: firstly, we set $\overline{\psi _{\phi ,c}}(s,t) := \psi _{\phi ,c}(s,t)$ for all $s,t \in ]a,b[$. Moreover, since for fixed $t \in ]a,b[$, the function $s \rightarrow \psi _{\phi ,c}(s,t)$ is convex and continuous, the limit $\overline{\psi _{\phi ,c}}(a,t) := \lim _{s \rightarrow a} \psi _{\phi ,c}(s,t)$ always exists and (in order to avoid overlines in (1)) will be interpreted/abbreviated as $\phi (a) - \phi (t) - \phi _{+,c}^{\prime }(t) \cdot (a-t)$. Analogously, for fixed $t \in ]a,b[$ we set $\overline{\psi _{\phi ,c}}(b,t) := \lim _{s \rightarrow b} \psi _{\phi ,c}(s,t)$ with corresponding short-hand notation $\phi (b) - \phi (t) - \phi _{+,c}^{\prime }(t) \cdot (b-t)$. Furthermore, for fixed $s\in ]a,b[$ we interpret $\phi (s) - \phi (a) - \phi _{+,c}^{\prime }(a) \cdot (s-a)$ as
$$\begin{aligned} \overline{\psi _{\phi ,c}}(s,a) \, :=&\, \big \{ \phi (s) - \overline{\phi _{+,c}^{\prime }}(a) \cdot s + \lim _{t \rightarrow a} \big (t \cdot \overline{\phi _{+,c}^{\prime }}(a) - \phi (t) \big ) \big \} \cdot \varvec{1}_{]-\infty ,\infty [}\big (\overline{\phi _{+,c}^{\prime }}(a)\big ) \nonumber \\&+ \ \infty \cdot \varvec{1}_{\{-\infty \}}\big (\overline{\phi _{+,c}^{\prime }}(a)\big ) \, , \nonumber \end{aligned}$$
where the involved limit always exists but may be infinite. Analogously, for fixed $s\in ]a,b[$ we interpret $\phi (s) - \phi (b) - \phi _{+,c}^{\prime }(b) \cdot (s-b)$ as
$$\begin{aligned} \overline{\psi _{\phi ,c}}(s,b) :=&\big \{ \phi (s) - \overline{\phi _{+,c}^{\prime }}(b) \cdot s + \lim _{t \rightarrow b} \Big (t \cdot \overline{\phi _{+,c}^{\prime }}(b) - \phi (t) \Big ) \big \} \cdot \varvec{1}_{]-\infty ,\infty [}\big (\overline{\phi _{+,c}^{\prime }}(b) \big ) \nonumber \\[-0.1cm]&+ \ \infty \cdot \varvec{1}_{\{+\infty \}}\big (\overline{\phi _{+,c}^{\prime }}(b)\big ) \, , \nonumber \end{aligned}$$
where again the involved limit always exists but may be infinite. Finally, we always set $\overline{\psi _{\phi ,c}}(a,a):= 0$, $\overline{\psi _{\phi ,c}}(b,b):=0$, and $\overline{\psi _{\phi ,c}}(a,b) := \lim _{s \rightarrow a} \overline{\psi _{\phi ,c}}(s,b)$, $\overline{\psi _{\phi ,c}}(b,a) := \lim _{s \rightarrow b} \overline{\psi _{\phi ,c}}(s,a)$. Notice that $\overline{\psi _{\phi ,c}}(\cdot ,\cdot )$ is lower semi-continuous but not necessarily continuous. Since ratios are ultimately involved, we also consistently take $\overline{\psi _{\phi ,c}}\big (\frac{0}{0},\frac{0}{0}\big ) := 0$. Taking all this into account, we interpret $D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q)$ as $\int _{{\mathscr {X}}} \overline{\psi _{\phi ,c}}\big (\frac{p(x)}{m_{1}(x)},\frac{q(x)}{m_{2}(x)}\big ) \mathbbm {m}_{3}(x) \, \mathrm {d}\lambda (x)$ at first glance (see further investigations in Assumption 2 below), and use the (in lengthy examples) less clumsy notation ${\overline{\int }}_{{\mathscr {X}}} \psi _{\phi ,c}\big (\frac{p(x)}{m_{1}(x)},\frac{q(x)}{m_{2}(x)}\big ) \mathbbm {m}_{3}(x) \, \mathrm {d}\lambda (x)$ as a shortcut for the implicitly involved boundary behaviour. $\square $

Notice that despite of the “difference-structure” in the integrand of (1), the splitting of the integral into differences of several “autonomous” integrals may not always be feasible due to the possible appearance of differences between infinite integral values. Furthermore, there is non-uniqueness in the construction (1); for instance, one (formally) gets $D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q)= D^{c}_{\tilde{\phi },M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q)$ for any $\tilde{\phi }(t):= \phi (t) + c_1 + c_2 \cdot t$ ($t \in E$) with $c_1,c_2 \in \mathbb {R}$. Moreover, there exist “essentially different” pairs $(\phi ,\mathbbm {M})$ and $(\breve{\phi },\breve{\mathbbm {M}})$ (where $\phi (t) - \breve{\phi }(t)$ is nonlinear in t) for which $D^{c}_{\phi ,\mathbbm {M},\mathbbm {M},\mathbbm {M},\lambda }(P,Q)= D^{c}_{\breve{\phi },\breve{\mathbbm {M}},\breve{\mathbbm {M}},\breve{\mathbbm {M}},\lambda }(P,Q)$ (see e.g. [37]). Let us also mention that we could further generalize (1) by adapting the divergence concept of Stummer and Kißlinger [82] who also deal even with non-convex non-concave divergence generators $\phi $; for the sake of brevity, this is omitted here.

Notice that by construction we obtain the following important assertion:

Theorem 1

Let $\phi \in \varPhi (]a,b[)$ and $c \in [0,1]$. Then there holds $D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q) \geqslant 0$ with equality if $\frac{p(x)}{m_1(x)}=\frac{q(x)}{m_2(x)}$ for $\lambda $-almost all $x \in \mathscr {X}$. Depending on the concrete situation, $D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q)$ may take infinite value.

To get “sharp identifiability” (i.e. reflexivity) one needs further assumptions on $\phi \in \varPhi (]a,b[)$, $c \in [0,1]$. As a motivation, consider the case where $\mathbbm {m}_{3}(x) \equiv 1$ and $\phi \in \varPhi (]a,b[)$ is affine linear on the whole interval ]a, b[, and hence its extension $\overline{\phi }$ is affine-linear on [a, b]. Accordingly, one gets for the integrand-builder $\overline{\psi _{\phi ,c}}(s,t) \equiv 0$ and hence $D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q) = \int _{{\mathscr {X}}} \overline{\psi _{\phi ,c}}\big (\frac{p(x)}{m_{1}(x)},\frac{q(x)}{m_{2}(x)}\big ) \, \mathrm {d}\lambda (x) = 0$ even in cases where $\frac{p(x)}{m_{1}(x)} \ne \frac{q(x)}{m_{2}(x)}$ for $\lambda $-a.a. $x \in \mathscr {X}$. In order to avoid such and similar phenomena, we use the following set of requirements:

Assumption 2

Let $c \in [0,1]$, $\phi \in \varPhi (]a,b[)$ and $\mathscr {R}\big (\frac{P}{M_{1}}\big ) \cup \mathscr {R}\big (\frac{Q}{M_{2}}\big ) \subset [a,b]$. The aggregation function is supposed to be of the form $\mathbbm {m}_{3}(x)= \mathbbm {w}_{3}\big (x,\frac{p(x)}{m_{1}(x)},\frac{q(x)}{m_{2}(x)} \big )$ for some (measur.) function $\mathbbm {w}_{3}: \mathscr {X} \times [a,b] \times [a,b] \rightarrow [0,\infty ]$. Moreover, for all $s \in \mathscr {R}\big (\frac{P}{M_{1}}\big )$, all $t \in \mathscr {R}\big (\frac{Q}{M_{2}}\big )$ and $\lambda $-a.a. $x \in \mathscr {X}$, let the following conditions hold:

(a)
$\phi $ is strictly convex at t;
(b)
if $\phi $ is differentiable at t and $s \ne t$, then $\phi $ is not affine-linear on the interval $[\min (s,t),\max (s,t)]$ (i.e. between t and s);
(c)
if $\phi $ is not differentiable at t, $s > t$ and $\phi $ is affine linear on [t, s], then we exclude $c=1$ for the (“globally/universally chosen”) subderivative $\phi _{+,c}^{\prime }(\cdot ) = c \cdot \phi _{+}^{\prime }(\cdot ) + (1- c) \cdot \phi _{-}^{\prime }(\cdot )$;
(d)
if $\phi $ is not differentiable at t, $s < t$ and $\phi $ is affine linear on [s, t], then we exclude $c=0$ for $\phi _{+,c}^{\prime }(\cdot )$;
(e)
$\mathbbm {w}_{3}(x,s,t) < \infty $;
(f)
$\mathbbm {w}_{3}(x,s,t) >0$ if $s \ne t$;
(g)
$\mathbbm {w}_{3}(x,a,a) \cdot \psi _{\phi ,c}(a,a) :=0$ by convention (even in cases where the function $\mathbbm {w}_{3}(x,\cdot ,\cdot ) \cdot \psi _{\phi ,c}(\cdot ,\cdot )$ is not continuous on the boundary point (a, a));
(h)
$\mathbbm {w}_{3}(x,b,b) \cdot \psi _{\phi ,c}(b,b) :=0$ by convention (even in cases where the function $\mathbbm {w}_{3}(x,\cdot ,\cdot ) \cdot \psi _{\phi ,c}(\cdot ,\cdot )$ is not continuous on the boundary point (b, b));
(i)
$\mathbbm {w}_{3}(x,a,t) \cdot \psi _{\phi ,c}(a,t) >0$, where $\mathbbm {w}_{3}(x,a,t) \cdot \psi _{\phi ,c}(a,t) := \lim _{s\rightarrow a} \mathbbm {w}_{3}(x,s,t) \cdot \psi _{\phi ,c}(s,t)$ if this limit exists, and otherwise we set by convention $\mathbbm {w}_{3}(x,a,t) \cdot \psi _{\phi ,c}(a,t) := 1$ (or any other strictly positive constant);
(j)
$\mathbbm {w}_{3}(x,b,t) \cdot \psi _{\phi ,c}(b,t) >0$, where $\mathbbm {w}_{3}(x,b,t) \cdot \psi _{\phi ,c}(b,t)$ is analogous to (i);
(k)
$\mathbbm {w}_{3}(x,s,a) \cdot \psi _{\phi ,c}(s,a) >0$, where $\mathbbm {w}_{3}(x,s,a) \cdot \psi _{\phi ,c}(s,a) := \lim _{t\rightarrow a} \mathbbm {w}_{3}(x,s,t) \cdot \psi _{\phi ,c}(s,t)$ if this limit exists, and otherwise we set by convention $\mathbbm {w}_{3}(x,s,a) \cdot \psi _{\phi ,c}(s,a) := 1$ (or any other strictly positive constant);
(l)
$\mathbbm {w}_{3}(x,s,b) \cdot \psi _{\phi ,c}(s,b) >0$, where $\mathbbm {w}_{3}(x,s,b) \cdot \psi _{\phi ,c}(s,b)$ is analogous to (k);
(m)
$\mathbbm {w}_{3}(x,a,b) \cdot \psi _{\phi ,c}(a,b) >0$, where $\mathbbm {w}_{3}(x,a,b) \cdot \psi _{\phi ,c}(a,b) := \lim _{s\rightarrow a} \mathbbm {w}_{3}(x,s,b) \cdot \psi _{\phi ,c}(s,b)$ if this limit exists, and otherwise we set by convention $\mathbbm {w}_{3}(x,a,b) \cdot \psi _{\phi ,c}(a,b) := 1$ (or any other strictly positive constant);
(n)
$\mathbbm {w}_{3}(x,b,a) \cdot \psi _{\phi ,c}(b,a) >0$, where $\mathbbm {w}_{3}(x,b,a) \cdot \psi _{\phi ,c}(b,a) := \lim _{s\rightarrow b} \mathbbm {w}_{3}(x,s,a) \cdot \psi _{\phi ,c}(s,a)$ if this limit exists, and otherwise we set by convention $\mathbbm {w}_{3}(x,b,a) \cdot \psi _{\phi ,c}(b,a) := 1$ (or any other strictly positive constant). $\square $

Under Assumption 2, we always interpret the corresponding divergence

$$\begin{aligned}&D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q) := D^{c}_{\phi ,M_{1},M_{2},\mathbbm {W}_{3},\lambda }(P,Q) := \nonumber \\&: = {\overline{\int }}_{{\mathscr {X}}} \mathbbm {w}_{3}\big (x,\frac{p(x)}{m_{1}(x)},\frac{q(x)}{m_{2}(x)} \big ) \cdot \big [ \phi \big ( { \frac{p(x)}{m_{1}(x)}}\big ) -\phi \big ( {\frac{q(x)}{m_{2}(x)}}\big ) \nonumber \\&\qquad - \phi _{+,c}^{\prime } \big ( {\frac{q(x)}{m_{2}(x)}}\big ) \cdot \big ( \frac{p(x)}{m_{1}(x)}-\frac{q(x)}{m_{2}(x)}\big ) \big ] \, \mathrm {d}\lambda (x) \nonumber \end{aligned}$$

as $\int _{{\mathscr {X}}} \overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}\big (x, \frac{p(x)}{m_{1}(x)},\frac{q(x)}{m_{2}(x)}\big ) \, \mathrm {d}\lambda (x)$, where $\overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}(x,s,t)$ denotes the extension of the function $\mathscr {X}\times ]a,b[ \times ]a,b[ \ni (x,s,t) \rightarrow \mathbbm {w}_{3}(x,s,t) \cdot \psi _{\phi ,c}(s,t)$ on $\mathscr {X}\times [a,b] \times [a,b]$ according to the conditions (g) to (n) above.

Remark 1

(a) We could even work with a weaker assumption obtained by replacing s with $\frac{p(x)}{m_{1}(x)}$ as well as t with $\frac{q(x)}{m_{2}(x)}$ and by requiring that then the correspondingly plugged-in conditions (a) to (n) hold for $\lambda $-a.a. $x \in \mathscr {X}$. (b) Notice that our above context subsumes aggregation functions of the form $\mathbbm {m}_{3}(x) = \tilde{\mathbbm {w}_{3}}(x,p(x),q(x),m_{1}(x),m_{2}(x))$ with $\tilde{\mathbbm {w}_{3}}(x,z_1,z_2,z_3,z_4)$ having appropriately imbeddable behaviour in its arguments $x,z_1,z_2,z_3,z_4$, the outcoming ratios $\frac{z_1}{z_3}$, $\frac{z_2}{z_4}$ and possible boundary values thereof. $\square $

The following requirement is stronger than the “model-individual/dependent” Assumption 2 but is more “universally applicable” (amongst all models such that $\mathscr {R}\big (\frac{P}{M_{1}}\big ) \cup \mathscr {R}\big (\frac{Q}{M_{2}}\big ) \subset [a,b]$, take e.g. $E=]a,b[$ as $E=]0,\infty [$ or $E=]-\infty ,\infty [$):

Assumption 3

Let $c \in [0,1]$, $\phi \in \varPhi (]a,b[)$ on some fixed $]a,b[ \, \in \, ]-\infty ,+\infty [$ such that $]a,b[ \, \supset \mathscr {R}\big (\frac{P}{M_{1}}\big ) \cup \mathscr {R}\big (\frac{Q}{M_{2}}\big )$. The aggregation function is of the form $\mathbbm {m}_{3}(x)= \mathbbm {w}_{3}\big (x,\frac{p(x)}{m_{1}(x)},\frac{q(x)}{m_{2}(x)} \big )$ for some (measurable) function $\mathbbm {w}_{3}: \mathscr {X} \times [a,b] \times [a,b] \rightarrow [0,\infty ]$. Furthermore, for all $s \in ]a,b[$, $t \in ]a,b[$ and $\lambda $-a.a. $x \in \mathscr {X}$, the conditions (a) to (n) of Assumption 2 hold.

Important examples in connection with the Assumptions 2, 3 will be given in Sect. 3.2 (for $\phi $) and Sect. 3.3 (for $m_{1}$, $m_{2}$, $\mathbbm {w}_{3}$) below. With these assumptions at hand, we obtain the following non-negativity and reflexivity assertions:

Theorem 4

Let the Assumption 2 be satisfied. Then there holds: (1) $D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q) \geqslant 0$. Depending on the concrete situation, $D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q)$ may take infinite value.

$$\begin{aligned}& {\textit{(2)}} \ D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q) = 0 \ \ \text {if and only if} \ \ \frac{p(x)}{m_1(x)}=\frac{q(x)}{m_2(x)} \ \text {for}\, \lambda \text {-a.a.}\, x \in \mathscr {X}. \qquad \ \nonumber \end{aligned}$$

Theorem 4 – whose proof will be given in the appendix – says that

$D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q)$ is indeed a “proper” divergence under the Assumption 2. Hence, the latter will be assumed for the rest of the paper, unless stated otherwise: for instance, we shall sometimes work with the stronger Assumption 3; thus, for more comfortable reference, we state explicitly

Corollary 1

Under the more universally applicable Assumption 3, the Assertions (1) and (2) of Theorem 4 hold.

Under some non-obvious additional constraints on the functions P, Q it may be possible to show the Assertions (1), (2) of Theorem 4 by even dropping the purely generator-concerning Assumptions 2(b) to (d); see e.g. Sect. 3.3.1.2 below. In the following, we discuss several important features and special cases of $\beta = (\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda )$ in a well-structured way. Let us start with the latter.

3.1 The Reference Measure $\lambda $

In (1), $\lambda $ can be interpreted as a “governer” upon the principle aggregation structure, whereas the “aggregation function” $\mathbbm {m}_{3}$ tunes the fine aggregation details. For instance, if one chooses $\lambda = \lambda _{L}$ as the Lebesgue measure on $\mathscr {X} \subset \mathbb {R}$, then the integral in (1) turns out to be of Lebesgue-type and (with some rare exceptions) consequently of Riemann-type. In contrast, in the discrete setup where $\mathscr {X} := \mathscr {X}_{\#}$ has countably many elements and is equipped with the counting measure $\lambda := \lambda _{\#} := \sum _{z \in \mathscr {X}_{\#}} \delta _{z}$ (where $\delta _{z}$ is Dirac’s one-point distribution $\delta _{z}[A] := \varvec{1}_{A}(z)$, and thus $\lambda _{\#}[\{z\}] =1$ for all $z \in \mathscr {X}_{\#}$) then (1) simplifies to

$$\begin{aligned}& \textstyle 0 \leqslant D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda _{\#}}(P,Q) \nonumber \\& \textstyle : = {\overline{\sum }}_{z \in \mathscr {X}} \Big [ \phi \big ( { \frac{p(z)}{m_{1}(z)}}\big ) -\phi \big ( {\frac{q(z)}{m_{2}(z)}}\big ) - \phi _{+,c}^{\prime } \big ( {\frac{q(z)}{m_{2}(z)}}\big ) \cdot \big ( \frac{p(z)}{m_{1}(z)}-\frac{q(z)}{m_{2}(z)}\big ) \Big ] \cdot \mathbbm {m}_{3}(z) \, , \qquad \ \ \end{aligned}$$

(2)

which we interpret as $\sum _{{z \in \mathscr {X}}} \overline{\psi _{\phi ,c}}\big (\frac{p(z)}{m_{1}(z)},\frac{q(z)}{m_{2}(z)}\big ) \cdot \mathbbm {m}_{3}(z)$ with the same conventions and limits as in the paragraph right after (1); if $\mathscr {X}_{\#} = \{z_{0}\}$ for arbitrary $z_{0} \in \widetilde{X}$, we obtain the corresponding one-point divergence over any space $\widetilde{X}$.

3.2 The Divergence Generator $\phi $

We continue with the inspection of interesting special cases of $\beta = (\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda )$ by dealing with the first component. For this, let $\varPhi _{C_1}(]a,b[)$ be the class of all functions $\phi \in \varPhi (]a,b[)$ which are also continuously differentiable on $E = ]a,b[$. For divergence generator $\phi \in \varPhi _{C_1}(]a,b[)$, the formula (1) becomes (recall that we suppress the obsolete c and subderivative index $+$)

$$\begin{aligned}& \textstyle 0 \leqslant D_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q) \nonumber \\& \textstyle : = {\overline{\int }}_{{\mathscr {X}}} \Big [ \phi \big ( { \frac{p(x)}{m_{1}(x)}}\big ) -\phi \big ( {\frac{q(x)}{m_{2}(x)}}\big ) - \phi ^{\prime } \big ( {\frac{q(x)}{m_{2}(x)}}\big ) \cdot \big ( \frac{p(x)}{m_{1}(x)}-\frac{q(x)}{m_{2}(x)}\big ) \Big ] \cdot \mathbbm {m}_{3}(x) \, \mathrm {d}\lambda (x) \ , \qquad \ \ \end{aligned}$$

(3)

whereas (2) turns into

$$\begin{aligned}& \textstyle 0 \leqslant D_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda _{\#}}(P,Q) \nonumber \\& \textstyle : = {\overline{\sum }}_{x \in \mathscr {X}} \Big [ \phi \big ( { \frac{p(x)}{m_{1}(x)}}\big ) -\phi \big ( {\frac{q(x)}{m_{2}(x)}}\big ) - \phi ^{\prime } \big ( {\frac{q(x)}{m_{2}(x)}}\big ) \cdot \big ( \frac{p(x)}{m_{1}(x)}-\frac{q(x)}{m_{2}(x)}\big ) \Big ] \cdot \mathbbm {m}_{3}(x) . \nonumber \end{aligned}$$

Formally, by defining the integral functional $g_{\phi ,\mathbbm {M}_{3},\lambda }(\xi ) := \int _{\mathscr {X}} \phi (\xi (x)) \cdot \mathbbm {m}_{3}(x) \mathrm {d}\lambda (x)$ and plugging in e.g. $g_{\phi ,\mathbbm {M}_{3},\lambda } \big ( {\frac{P}{M_{1}}}\big ) = \int _{\mathscr {X}} \phi \big ( {\frac{p(x)}{m_{1}(x)}}\big ) \cdot \mathbbm {m}_{3}(x) \, \mathrm {d}\lambda (x)$, the divergence in (3) can be interpreted as

$$\begin{aligned}& \textstyle 0 \leqslant D_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q) \nonumber \\& \textstyle = g_{\phi ,\mathbbm {M}_{3},\lambda } \big ( {\frac{P}{M_{1}}}\big ) - g_{\phi ,\mathbbm {M}_{3},\lambda } \big ( {\frac{Q}{M_{2}}}\big ) - g_{\phi ,\mathbbm {M}_{3},\lambda }^{\prime } \big ( {\frac{Q}{M_{2}}}, {\frac{P}{M_{1}}} - {\frac{Q}{M_{2}}}\big ) \end{aligned}$$

(4)

where $g_{\phi ,\mathbbm {M}_{3},\lambda }^{\prime } \big ( \eta , \, \cdot \, \big )$ denotes the corresponding directional derivate at $\eta = \frac{Q}{M_{2}}$. If one has a “nonnegativity-setup” (NN0) in the sense that for all $x \in \mathscr {X}$ there holds $\frac{p(x)}{m_{1}(x)} \geqslant 0$ and $\frac{q(x)}{m_{2}(x)}\geqslant 0$ (but not necessarily $p(x) \geqslant 0$, $q(x) \geqslant 0$, $m_{1}(x) \geqslant 0$, $m_{2}(x) \geqslant 0$) then one can take $a=0$, $b=\infty $, i.e. $E=]0,\infty [$, and employ the strictly convex power functions

$$\begin{aligned}& \textstyle \tilde{\phi }(t): = \tilde{\phi }_{\alpha }(t) := \frac{t^\alpha -1}{\alpha (\alpha -1)} \ \in ]-\infty ,\infty [ , \qquad t \in ]0,\infty [, \ \alpha \in \mathbb {R}\backslash \{0,1\} \ , \nonumber \\& \textstyle \phi (t): = \phi _{\alpha }(t) := \tilde{\phi }_{\alpha }(t) - \tilde{\phi }_{\alpha }^{\prime }(1) \cdot (t-1) = \frac{t^\alpha -1}{\alpha (\alpha -1)}-\frac{t-1}{\alpha -1} \ \in [0,\infty [ , \quad t \in ]0,\infty [, \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \quad \qquad \ \alpha \in \mathbb {R}\backslash \{0,1\} \ , \end{aligned}$$

(5)

which satisfy (with the notations introduced in the paragraph right after (1))

$$\begin{aligned}& \textstyle \phi _{\alpha }(1)=0, \quad \phi _{\alpha }^{\prime }(t)=\frac{t^{\alpha -1}-1}{\alpha -1}, \quad \phi _{\alpha }^{\prime }(1)=0, \quad \phi _{\alpha }^{\prime \prime }(t)=t^{\alpha -2} >0, \quad t \in ]0,\infty [, \end{aligned}$$

(6)

$$\begin{aligned}& \textstyle \phi _{\alpha }(0) := \lim _{t\downarrow 0}\phi _{\alpha }(t)= \frac{1}{\alpha } \cdot \varvec{1}_{]0,1] \cup ]1,\infty [}(\alpha ) + \infty \cdot \varvec{1}_{]-\infty ,0[}(\alpha ), \quad \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \phi _{\alpha }(\infty ) := \lim _{t\uparrow \infty } \phi _{\alpha }(t)= \infty , \end{aligned}$$

(7)

$$\begin{aligned}& \textstyle \phi _{\alpha }^{\prime }(0) := \lim _{t\downarrow 0}\phi _{\alpha }^{\prime }(t)= \frac{1}{1-\alpha } \cdot \varvec{1}_{]1,\infty [}(\alpha ) - \infty \cdot \varvec{1}_{]-\infty ,0[\cup ]0,1[}(\alpha ), \nonumber \\& \textstyle \phi _{\alpha }^{\prime }(\infty ) \, := \, \lim _{t\uparrow \infty }\phi _{\alpha }^{\prime }(t) \, = \, \infty \cdot \varvec{1}_{]1,\infty [}(\alpha ) + \frac{1}{1-\alpha } \cdot \varvec{1}_{]-\infty ,0[\cup ]0,1[}(\alpha ) \, = \, \lim _{t\uparrow \infty }\frac{\phi _{\alpha }(t)}{t} , \end{aligned}$$

(8)

$$\begin{aligned}& \textstyle \psi _{\phi _{\alpha }}(s,t) = \frac{1}{\alpha \cdot (\alpha -1)} \cdot \Big [ s^{\alpha } + (\alpha -1) \cdot t^{\alpha } - \alpha \cdot s \cdot t^{\alpha -1} \Big ], \quad s,t \in ]0,\infty [ , \end{aligned}$$

(9)

$$\begin{aligned}& \textstyle \psi _{\phi _{\alpha }}(0,t) = \frac{t^{\alpha }}{\alpha } \cdot \varvec{1}_{]0,1[ \cup ]1,\infty [}(\alpha ) + \infty \cdot \varvec{1}_{]-\infty ,0[}(\alpha ), \quad t \in ]0,\infty [ , \end{aligned}$$

(10)

$$\begin{aligned}& \textstyle \psi _{\phi _{\alpha }}(\infty ,t) = \infty , \quad t \in ]0,\infty [ , \nonumber \\& \textstyle \lim _{s\rightarrow \infty } \frac{1}{s} \cdot \psi _{\phi _{\alpha }}(s,1) = \frac{1}{1-\alpha } \cdot \varvec{1}_{ ]-\infty ,0[ \cup ]0,1[}(\alpha ) + \infty \cdot \varvec{1}_{]1,\infty [}(\alpha ), \nonumber \\& \textstyle \psi _{\phi _{\alpha }}(s,0) = \frac{s^{\alpha }}{\alpha \cdot (\alpha -1)} \cdot \varvec{1}_{]1,\infty [}(\alpha ) + \infty \cdot \varvec{1}_{]-\infty ,0[ \cup ]0,1[}(\alpha ), \quad s \in ]0,\infty [ , \end{aligned}$$

(11)

$$\begin{aligned}& \textstyle \psi _{\phi _{\alpha }}(s,\infty ) = \frac{s^{\alpha }}{\alpha \cdot (\alpha -1)} \cdot \varvec{1}_{]-\infty ,0[}(\alpha ) + \infty \cdot \varvec{1}_{]0,1[ \cup ]1,\infty [}(\alpha ), \quad s \in ]0,\infty [ , \nonumber \\& \textstyle \psi _{\phi _{\alpha }}(0,0) := 0 \ \text {(which is unequal to}\, \lim _{t\rightarrow 0} \lim _{s\rightarrow 0} \psi _{\phi _{\alpha }}(s,t)\, \text {for}\, \alpha <0 \nonumber \\&\qquad \qquad \qquad \text { and which is unequal to}\, \lim _{s\rightarrow 0} \lim _{t\rightarrow 0} \psi _{\phi _{\alpha }}(s,t)\, \text {for}\, \alpha >1), \nonumber \\& \textstyle \psi _{\phi _{\alpha }}(\infty ,\infty ) := 0 \ \text {(which is unequal to}\, \lim _{t\rightarrow \infty } \lim _{s\rightarrow \infty } \psi _{\phi _{\alpha }}(s,t)\, \text {for}\, \alpha \in \mathbb {R}\backslash \{0,1\} \nonumber \\&\qquad \qquad \qquad \text { and which is unequal to}\, \lim _{s\rightarrow \infty } \lim _{t\rightarrow \infty } \psi _{\phi _{\alpha }}(s,t)\, \text {for}\, \alpha \in ]0,1[ \cup ]1,\infty [), \nonumber \\& \textstyle \psi _{\phi _{\alpha }}(0,\infty ) := \lim _{s \rightarrow 0} \lim _{t \rightarrow \infty } \psi _{\phi _{\alpha }}(s,t) = \infty \end{aligned}$$

(12)

$$\begin{aligned}&\qquad \qquad \qquad \text {(which coincides with}\, \lim _{t\rightarrow \infty } \lim _{s\rightarrow 0} \psi _{\phi _{\alpha }}(s,t)\, \text {for}\, \alpha \in \mathbb {R}\backslash \{0,1\}), \nonumber \\& \textstyle \psi _{\phi _{\alpha }}(\infty ,0) := \lim _{s \rightarrow \infty } \lim _{t \rightarrow 0} \psi _{\phi _{\alpha }}(s,t) = \infty \\&\qquad \qquad \qquad \text {(which coincides with}\, \lim _{t\rightarrow 0} \lim _{s\rightarrow \infty } \psi _{\phi _{\alpha }}(s,t)\, \text {for}\, \alpha \in \mathbb {R}\backslash \{0,1\}) . \nonumber \end{aligned}$$

(13)

The perhaps most important special case is $\alpha =2$, for which (5) turns into

$$\begin{aligned}& \textstyle \phi _{2}(t) := \frac{(t-1)^2}{2}, \quad t \in ]0,\infty [ = E, \end{aligned}$$

(14)

having for $s,t \in ]0,\infty [$ the properties (cf. (7)–(13))

$$\begin{aligned}& \textstyle \phi _{2}(1)=0, \quad \phi _{2}^{\prime }(1)=0, \quad \phi _{2}(0) = \frac{1}{2} , \quad \phi _{2}(\infty ) = \infty , \quad \phi _{2}^{\prime }(0) = - \frac{1}{2} , \quad \nonumber \\& \textstyle \phi _{2}^{\prime }(\infty ) = \infty = \lim _{t\uparrow \infty }\frac{\phi _{2}(t)}{t} , \psi _{\phi _{2}}(s,t) = \frac{(s-t)^2}{2} , \end{aligned}$$

(15)

$$\begin{aligned}& \textstyle \psi _{\phi _{2}}(0,t) = \frac{t^{2}}{2} , \quad \psi _{\phi _{2}}(\infty ,t) = \infty , \quad \lim _{s\rightarrow \infty } \frac{1}{s} \cdot \psi _{\phi _{2}}(s,1) = \infty , \nonumber \\& \textstyle \psi _{\phi _{2}}(s,0) = \frac{s^{2}}{2} , \quad \psi _{\phi _{2}}(s,\infty ) = \infty , \quad \psi _{\phi _{2}}(0,0) := 0 , \\& \textstyle \psi _{\phi _{2}}(\infty ,\infty ) := 0 , \quad \psi _{\phi _{2}}(0,\infty ) = \infty , \quad \psi _{\phi _{2}}(\infty ,0) = \infty . \nonumber \end{aligned}$$

(16)

Also notice that the divergence-generator $\phi _{2}$ of (14) can be trivially extended to

$$\begin{aligned}& \textstyle \bar{\phi }_{2}(t) := \frac{(t-1)^2}{2}, \quad t \in ]-\infty ,\infty [ = \bar{E}, \end{aligned}$$

(17)

which is useful in a general setup (GS) where for all $x \in \mathscr {X}$ one has $\frac{p(x)}{m_{1}(x)} \in [-\infty , \infty ]$ and $\frac{q(x)}{m_{2}(x)} \in [-\infty , \infty ]$. Convex extensions to $]a, \infty [$ with $a \in ]-\infty ,0[$ can be easily done by the shift $\bar{\phi }_{\alpha }(t) := \phi _{\alpha }(t-a)$.

Further examples of everywhere strictly convex differentiable divergence generators $\phi \in \varPhi _{C_{1}}(]a,b[)$ for the “nonnegativity-setup” (NN0) (i.e. $a=0$, $b=\infty $, $E=]0,\infty [$) can be obtained by taking the $\alpha $-limits

$$\begin{aligned}& \textstyle \tilde{\phi }_{1}(t) := \lim _{\alpha \rightarrow 1} \phi _{\alpha }(t) = t \cdot \log t \ \in [- e^{-1},\infty [ , \qquad t \in ]0,\infty [, \nonumber \\& \textstyle \phi _{1}(t) \, := \, \lim _{\alpha \rightarrow 1} \phi _{\alpha }(t) \, = \, \tilde{\phi }_{1}(t) - \tilde{\phi }_{1}^{\prime }(1) \cdot (t-1) \, = \, t \cdot \log t + 1 - t \in [0, \infty [, \ t \in ]0,\infty [, \ \ \end{aligned}$$

(18)

$$\begin{aligned}& \textstyle \tilde{\phi }_{0}(t) := \lim _{\alpha \rightarrow 0} \phi _{\alpha }(t) = - \log t \ \in ]-\infty ,\infty [ , \qquad t \in ]0,\infty [, \nonumber \\& \textstyle \phi _{0}(t) \, := \, \lim _{\alpha \rightarrow 0} \phi _{\alpha }(t) \, = \, \tilde{\phi }_{0}(t) - \tilde{\phi }_{0}^{\prime }(1) \cdot (t-1) \, = \, - \log t + t - 1 \in [0, \infty [, \ t \in ]0,\infty [, \ \ \end{aligned}$$

(19)

which satisfy

$$\begin{aligned}& \textstyle \phi _{1}(1)=0, \quad \phi _{1}^{\prime }(t)=\log t, \quad \phi _{1}^{\prime }(1)=0, \quad \phi _{1}^{\prime \prime }(t)=t^{-1} >0, \quad t \in ]0,\infty [, \nonumber \\& \textstyle \phi _{1}(0) := \lim _{t\downarrow 0}\phi _{1}(t)= 1, \quad \phi _{1}(\infty ) := \lim _{t\uparrow \infty } \phi _{1}(t)= \infty , \end{aligned}$$

(20)

$$\begin{aligned}& \textstyle \phi _{1}^{\prime }(0) := \lim _{t\downarrow 0}\phi _{1}^{\prime }(t)= - \infty , \quad \phi _{1}^{\prime }(\infty ) := \lim _{t\uparrow \infty }\phi _{1}^{\prime }(t)= + \infty = \lim _{t\uparrow \infty }\frac{\phi _{1}(t)}{t} , \qquad \ \ \end{aligned}$$

(21)

$$\begin{aligned}& \textstyle \psi _{\phi _{1}}(s,t) = s \cdot \log \big (\frac{s}{t}\big ) + t - s , \quad s,t \in ]0,\infty [ , \end{aligned}$$

(22)

$$\begin{aligned}& \textstyle \psi _{\phi _{1}}(0,t) = t, \quad \psi _{\phi _{1}}(\infty ,t) = \infty , \quad \lim _{s\rightarrow \infty } \frac{1}{s} \cdot \psi _{\phi _{1}}(s,1) = \infty , \quad t \in ]0,\infty [ , \end{aligned}$$

(23)

$$\begin{aligned}& \textstyle \psi _{\phi _{1}}(s,0) = \infty , \quad \psi _{\phi _{1}}(s,\infty ) = \infty , \quad s \in ]0,\infty [ , \\& \textstyle \psi _{\phi _{1}}(0,0) := 0 \ \text {(which coincides with}\, \lim _{t\rightarrow 0} \lim _{s\rightarrow 0} \psi _{\phi _{1}}(s,t) \nonumber \\&\qquad \qquad \qquad \quad \text {but which does not coincide with}\, \lim _{s\rightarrow 0} \lim _{t\rightarrow 0} \psi _{\phi _{1}}(s,t) = \infty \text {)}, \nonumber \\& \textstyle \psi _{\phi _{1}}(\infty ,\infty ) := 0 \ \text {(which does not coincide with } \nonumber \\&\qquad \qquad \qquad \quad \lim _{t\rightarrow \infty } \lim _{s\rightarrow \infty } \psi _{\phi _{1}}(s,t) = \lim _{s\rightarrow \infty } \lim _{t\rightarrow \infty } \psi _{\phi _{1}}(s,t) = \infty , \nonumber \\& \textstyle \psi _{\phi _{1}}(0,\infty ) := \lim _{s \rightarrow 0} \lim _{t \rightarrow \infty } \psi _{\phi _{1}}(s,t) = \infty \nonumber \\&\qquad \qquad \qquad \quad \text {(which coincides with}\, \lim _{t\rightarrow \infty } \lim _{s\rightarrow 0} \psi _{\phi _{1}}(s,t)\text {)}, \nonumber \\& \textstyle \psi _{\phi _{1}}(\infty ,0) := \lim _{s \rightarrow \infty } \lim _{t \rightarrow 0} \psi _{\phi _{1}}(s,t) = \infty \nonumber \\&\qquad \qquad \qquad \quad \text {(which coincides with}\, \lim _{t\rightarrow 0} \lim _{s\rightarrow \infty } \psi _{\phi _{1}}(s,t)\text {)} , \nonumber \end{aligned}$$

(24)

as well as

$$\begin{aligned}& \textstyle \phi _{0}(1)=0, \quad \phi _{0}^{\prime }(t)=1 - \frac{1}{t}, \quad \phi _{0}^{\prime }(1)=0, \quad \phi _{0}^{\prime \prime }(t)=t^{-2} >0, \quad t \in ]0,\infty [, \qquad \ \ \end{aligned}$$

(25)

$$\begin{aligned}& \textstyle \phi _{0}(0) := \lim _{t\downarrow 0}\phi _{0}(t)= \infty , \quad \phi _{0}(\infty ) := \lim _{t\uparrow \infty } \phi _{0}(t)= \infty , \end{aligned}$$

(26)

$$\begin{aligned}& \textstyle \phi _{0}^{\prime }(0) := \lim _{t\downarrow 0}\phi _{0}^{\prime }(t)= - \infty , \quad \phi _{0}^{\prime }(\infty ) := \lim _{t\uparrow \infty }\phi _{0}^{\prime }(t)= 1 = \lim _{t\uparrow \infty }\frac{\phi _{0}(t)}{t} , \end{aligned}$$

(27)

$$\begin{aligned}& \textstyle \psi _{\phi _{0}}(s,t) = - \log \big (\frac{s}{t}\big ) + \frac{s}{t} - 1 , \quad s,t \in ]0,\infty [ , \end{aligned}$$

(28)

$$\begin{aligned}& \textstyle \psi _{\phi _{0}}(0,t) = \infty , \quad \psi _{\phi _{0}}(\infty ,t) = \infty , \quad \lim _{s\rightarrow \infty } \frac{1}{s} \cdot \psi _{\phi _{0}}(s,1) = 1 , \quad t \in ]0,\infty [ , \end{aligned}$$

(29)

$$\begin{aligned}& \textstyle \psi _{\phi _{0}}(s,0) = \infty , \quad \psi _{\phi _{0}}(s,\infty ) = \infty , \quad s \in ]0,\infty [ , \\& \textstyle \psi _{\phi _{0}}(0,0) := 0 \ \text {(which does not coincide with } \nonumber \\&\qquad \qquad \qquad \lim _{t\rightarrow 0} \lim _{s\rightarrow 0} \psi _{\phi _{0}}(s,t) = \lim _{s\rightarrow 0} \lim _{t\rightarrow 0} \psi _{\phi _{0}}(s,t) = \infty ), \nonumber \\& \textstyle \psi _{\phi _{0}}(\infty ,\infty ) := 0 \ \text {(which does not coincide with } \nonumber \\&\qquad \qquad \qquad \lim _{t\rightarrow \infty } \lim _{s\rightarrow \infty } \psi _{\phi _{0}}(s,t) = \lim _{s\rightarrow \infty } \lim _{t\rightarrow \infty } \psi _{\phi _{0}}(s,t) = \infty ), \nonumber \\& \textstyle \psi _{\phi _{0}}(0,\infty ) := \lim _{s \rightarrow 0} \lim _{t \rightarrow \infty } \psi _{\phi _{0}}(s,t) = \infty \nonumber \\&\qquad \qquad \qquad \text {(which coincides with}\, \lim _{t\rightarrow \infty } \lim _{s\rightarrow 0} \psi _{\phi _{0}}(s,t)\text {)}, \nonumber \\& \textstyle \psi _{\phi _{0}}(\infty ,0) := \lim _{s \rightarrow \infty } \lim _{t \rightarrow 0} \psi _{\phi _{0}}(s,t) = \infty \nonumber \\&\qquad \qquad \qquad \text {(which coincides with}\, \lim _{t\rightarrow 0} \lim _{s\rightarrow \infty } \psi _{\phi _{0}}(s,t)\text {)} . \nonumber \end{aligned}$$

(30)

An important, but (in our context) technically delicate, convex divergence generator is $\phi _{TV}(t):= |t-1|$ which is non-differentiable at $t=1$; the latter is also the only point of strict convexity. Further properties are for arbitrarily fixed $s,t \in ]0,\infty [$, $c \in [0,1]$ (if not stated otherwise)

$$\begin{aligned}& \textstyle \phi _{TV}(1)=0, \quad \phi _{TV}(0) = 1, \quad \phi _{TV}(\infty ) = \infty , \end{aligned}$$

(31)

$$\begin{aligned}& \textstyle \phi _{TV,+,c}^{\prime }(t)= \varvec{1}_{]1,\infty [}(t) + (2c-1) \cdot \varvec{1}_{\{1\}}(t) - \varvec{1}_{]0,1[}(t), \nonumber \\& \textstyle \phi _{TV,+,1}^{\prime }(t)= \varvec{1}_{[1,\infty [}(t) - \varvec{1}_{]0,1[}(t), \nonumber \\& \textstyle \phi _{TV,+,\frac{1}{2}}^{\prime }(t)= \varvec{1}_{]1,\infty [}(t) - \varvec{1}_{]0,1[}(t) = \text {sgn}(t-1) \cdot \varvec{1}_{]0,\infty [}(t) , \nonumber \\& \textstyle \phi _{TV,+,c}^{\prime }(1) = 2c-1, \quad \quad \phi _{TV,+,1}^{\prime }(1) = 1, \quad \phi _{TV,+,\frac{1}{2}}^{\prime }(1) =0, \end{aligned}$$

(32)

$$\begin{aligned}& \textstyle \phi _{TV,+,c}^{\prime }(0) = \lim _{t \rightarrow 0} \phi _{TV,+,c}^{\prime }(t) = -1, \quad \phi _{TV,+,c}^{\prime }(\infty ) = \lim _{t \rightarrow \infty } \phi _{TV,+,c}^{\prime }(t) = 1, \nonumber \\& \textstyle \psi _{\phi _{TV},c}(s,t) = \varvec{1}_{]0,1[}(t) \cdot 2 (s-1) \cdot \varvec{1}_{]1,\infty [}(s) + \varvec{1}_{]1,\infty [}(t) \cdot 2 (1-s) \cdot \varvec{1}_{]0,1]}(s) \nonumber \\&\qquad \qquad \qquad + \ \varvec{1}_{\{1\}}(t) \cdot \Big [ 2 (1-c) \cdot (s-1) \cdot \varvec{1}_{]1,\infty [}(s) + 2c \cdot (1-s) \cdot \varvec{1}_{]0,1]}(s) \Big ], \nonumber \\& \textstyle \psi _{\phi _{TV},\frac{1}{2}}(s,1) = |s-1|, \end{aligned}$$

(33)

$$\begin{aligned}& \textstyle \psi _{\phi _{TV},c}(0,t) = \lim _{s \rightarrow 0} \psi _{\phi _{TV},c}(s,t) = 2 \cdot \varvec{1}_{]1,\infty [}(t) + 2c \cdot \varvec{1}_{\{1\}}(t) , \nonumber \\& \textstyle \psi _{\phi _{TV},c}(\infty ,t) = \lim _{s \rightarrow \infty } \psi _{\phi _{TV},c}(s,t) = \infty \cdot \varvec{1}_{]0,1[}(t) + \ \infty \cdot \varvec{1}_{\{1\}}(t) \cdot \varvec{1}_{[0,1[}(c) , \nonumber \\& \textstyle \lim _{s\rightarrow \infty } \frac{1}{s} \cdot \psi _{\phi _{TV},c}(s,1) = 2 (1-c), \\& \textstyle \psi _{\phi _{TV},c}(s,0) = \lim _{t \rightarrow 0} \psi _{\phi _{TV},c}(s,t) = 2(s-1) \cdot \varvec{1}_{]1,\infty [}(s), \nonumber \\& \textstyle \psi _{\phi _{TV},c}(s,\infty ) = \lim _{t \rightarrow \infty } \psi _{\phi _{TV},c}(s,t) = 2(1-s) \cdot \varvec{1}_{]0,1]}(s), \nonumber \\& \textstyle \psi _{\phi _{TV},c}(0,0) := 0 \ \text {(which coincides with both}\, \lim _{t\rightarrow 0} \lim _{s\rightarrow 0} \psi _{\phi _{TV},c}(s,t) \nonumber \\&\qquad \qquad \qquad \text { and}\, \lim _{s\rightarrow 0} \lim _{t\rightarrow 0} \psi _{\phi _{TV},c}(s,t) ), \nonumber \\& \textstyle \psi _{\phi _{TV},c}(\infty ,\infty ) := 0 \ \text {(which coincides with both}\, \lim _{t\rightarrow \infty } \lim _{s\rightarrow \infty } \psi _{\phi _{TV},c}(s,t) \nonumber \\&\qquad \qquad \qquad \text { and}\, \lim _{s\rightarrow \infty } \lim _{t\rightarrow \infty } \psi _{\phi _{TV},c}(s,t) \text {)}, \nonumber \\& \textstyle \psi _{\phi _{TV},c}(0,\infty ) := \lim _{s \rightarrow 0} \lim _{t \rightarrow \infty } \psi _{\phi _{TV},c}(s,t) = 2 \nonumber \\&\qquad \qquad \qquad \text {(which coincides with}\, \lim _{t\rightarrow \infty } \lim _{s\rightarrow 0} \psi _{\phi _{TV},c}(s,t)\text {)}, \nonumber \\& \textstyle \psi _{\phi _{TV},c}(\infty ,0) := \lim _{s \rightarrow \infty } \lim _{t \rightarrow 0} \psi _{\phi _{TV},c}(s,t) = \infty \nonumber \\&\qquad \qquad \qquad \text {(which coincides with}\, \lim _{t\rightarrow 0} \lim _{s\rightarrow \infty } \psi _{\phi _{TV},c}(s,t) \text {)} . \nonumber \end{aligned}$$

(34)

In particular, one sees from Assumption 2(a) that – in our context – $\phi _{TV}$ can only be potentially applied if $\frac{q(x)}{m_{2}(x)} = 1$ for $\lambda $-a.a. $x \in \mathscr {X}$ and from Assumption 2(c), (d) that we generally have to exclude $c=1$ and $c=0$ for $\phi _{+,c}^{\prime }(\cdot )$ (i.e. we choose $c \in ]0,1[$); as already mentioned above, under some non-obvious additional constraints on the functions P, Q it may be possible to drop the Assumptions 2(c), (d), see for instance Sect. 3.3.1.2 below.

Another interesting and technically delicate example is the divergence generator $\phi _{ie}(t):= t -1 + \frac{(1-t)^3}{3} \cdot \varvec{1}_{[0,1]}(t)$ which is convex, twice continuously differentiable, strictly convex at any point $t \in ]0,1]$ and affine-linear on $[1,\infty [$. More detailed, one obtains for arbitrarily fixed $s,t \in ]0,\infty [$ (if not stated otherwise):

$$\begin{aligned}& \textstyle \phi _{ie}(1)=0, \quad \phi _{ie}(0) = - \frac{2}{3}, \quad \phi _{ie}(\infty ) = \infty , \\& \textstyle \phi _{ie}^{\prime }(t)= 1 - (1-t)^{2} \cdot \varvec{1}_{]0,1[}(t), \nonumber \\& \textstyle \phi _{ie}^{\prime }(1) = 1, \quad \phi _{ie}^{\prime }(0) = \lim _{t \rightarrow 0} \phi _{ie}^{\prime }(t) = 0, \quad \phi _{ie}^{\prime }(\infty ) = \lim _{t \rightarrow \infty } \phi _{ie}^{\prime }(t) = 1, \nonumber \\& \textstyle \phi _{ie}^{\prime \prime }(t)= 2(1-t) \cdot \varvec{1}_{]0,1[}(t), \quad \phi _{ie}^{\prime \prime }(1) = 0, \nonumber \\& \textstyle \psi _{\phi _{ie}}(s,t) = \frac{(1-s)^3}{3} \cdot \varvec{1}_{]0,1[}(s) + \ (1-t)^2 \cdot \Big [\frac{2}{3} \cdot (1-t) + (s-1) \Big ] \cdot \varvec{1}_{]0,1[}(t), \nonumber \\& \textstyle \psi _{\phi _{ie}}(s,1) = \frac{(1-s)^3}{3} \cdot \varvec{1}_{]0,1[}(s) , \nonumber \\& \textstyle \psi _{\phi _{ie}}(0,t) \, = \, \lim _{s \rightarrow 0} \psi _{\phi _{ie}}(s,t) \, = \, \frac{1}{3} \cdot \varvec{1}_{[1,\infty [}(t) + \frac{1}{3} \, \cdot \, \Big [1-(1-t)^{2} \cdot (1-2t) \Big ] \cdot \varvec{1}_{]0,1[}(t) , \nonumber \\& \textstyle \psi _{\phi _{ie}}(\infty ,t) = \lim _{s \rightarrow \infty } \psi _{\phi _{ie}}(s,t) = \infty \cdot \varvec{1}_{]0,1[}(t) , \nonumber \\& \textstyle \lim _{s\rightarrow \infty } \frac{1}{s} \cdot \psi _{\phi _{ie}}(s,1) = 0, \nonumber \\& \textstyle \psi _{\phi _{ie}}(s,0) = \lim _{t \rightarrow 0} \psi _{\phi _{ie}}(s,t) = \big ( s - \frac{1}{3} \big ) \cdot \varvec{1}_{[1,\infty [}(s) + s^{2}\cdot \big ( 1 - \frac{s}{3} \big ) \cdot \varvec{1}_{]0,1[}(s) , \nonumber \\& \textstyle \psi _{\phi _{ie}}(s,\infty ) = \lim _{t \rightarrow \infty } \psi _{\phi _{ie}}(s,t) = \frac{(1-s)^3}{3} \cdot \varvec{1}_{]0,1[}(s) , \nonumber \\& \textstyle \psi _{\phi _{ie}}(0,0) := 0 \ \text {(which coincides with both}\, \lim _{t\rightarrow 0} \lim _{s\rightarrow 0} \psi _{\phi _{ie}}(s,t) \nonumber \\&\qquad \qquad \qquad \text { and}\, \lim _{s\rightarrow 0} \lim _{t\rightarrow 0} \psi _{\phi _{ie}}(s,t) ), \nonumber \\& \textstyle \psi _{\phi _{ie}}(\infty ,\infty ) := 0 \ \text {(which coincides with both}\, \lim _{t\rightarrow \infty } \lim _{s\rightarrow \infty } \psi _{\phi _{ie}}(s,t) \nonumber \\&\qquad \qquad \qquad \text { and}\, \lim _{s\rightarrow \infty } \lim _{t\rightarrow \infty } \psi _{\phi _{ie}}(s,t) ), \nonumber \\& \textstyle \psi _{\phi _{ie}}(0,\infty ) := \lim _{s \rightarrow 0} \lim _{t \rightarrow \infty } \psi _{\phi _{ie}}(s,t) = \frac{1}{3} \nonumber \\&\qquad \qquad \qquad \text {(which coincides with}\, \lim _{t\rightarrow \infty } \lim _{s\rightarrow 0} \psi _{\phi _{ie}}(s,t) \text {)}, \nonumber \\& \textstyle \psi _{\phi _{ie}}(\infty ,0) := \lim _{s \rightarrow \infty } \lim _{t \rightarrow 0} \psi _{\phi _{ie}}(s,t) = \infty \nonumber \\&\qquad \qquad \qquad \text {(which coincides with}\, \lim _{t\rightarrow 0} \lim _{s\rightarrow \infty } \psi _{\phi _{ie}}(s,t)\text {)} . \nonumber \end{aligned}$$

(35)

In particular, one sees from the Assumptions 2(a), (b) that – in our context – $\phi _{ie}$ can only be potentially applied in the following two disjoint situations:

(i) $\frac{q(x)}{m_{2}(x)} < 1$ for $\lambda $-a.a. $x \in \mathscr {X}$; (ii) $\frac{q(x)}{m_{2}(x)} = 1$ and $\frac{p(x)}{m_{1}(x)} \leqslant 1$ for $\lambda $-a.a. $x \in \mathscr {X}$.

As already mentioned above, under some non-obvious additional constraints on the functions P, Q it may be possible to drop Assumption 2(b) and consequently (ii) can then be replaced by $\widetilde{(ii)}$$\frac{q(x)}{m_{2}(x)} = 1$ for $\lambda $-a.a. $x \in \mathscr {X}$; see for instance Sect. 3.3.1.2 below.

3.3 The Scaling and the Aggregation Functions $m_1$, $m_2$, $\mathbbm {m}_{3}$

In the above two Sects. 3.1 and 3.2, we have illuminated details of the choices of the first and the last component of $\beta = (\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda )$. Let us now discuss the principal roles as well as examples of $m_1$, $m_2$, $\mathbbm {m}_{3}$, which widen considerably the divergence-modeling flexibility and thus bring in a broad spectrum of goal-oriented situation-based applicability. To start with, recall that in accordance with (1), the aggregation function $\mathbbm {m}_{3}$ tunes the fine aggregation details (whereas $\lambda $ can be interpreted as a “governer” upon the basic/principle aggregation structure); furthermore, the function $m_1(\cdot )$ scales the function $p(\cdot )$ and $m_2(\cdot )$ the function $q(\cdot )$. From a modeling perspective, these two scaling functions can e.g. be “purely direct” in the sense that $m_{1}(x)$, $m_{2}(x)$ are chosen to directly reflect some dependence on the data-reflecting state $x\in \mathscr {X}$ (independent of the choice of P,Q), or “purely adaptive” in the sense that $m_{1}(x) = w_{1}(p(x),q(x))$, $m_{2}(x) = w_{2}(p(x),q(x))$ for some appropriate (measurable) “connector functions” $w_{1}$, $w_{2}$ on the product $\mathscr {R}(P) \times \mathscr {R}(Q)$ of the ranges of $\big \{p(x)\big \}_{x \in \mathscr {X}}$ and $\big \{q(x)\big \}_{x \in \mathscr {X}}$, or “hybrids” $m_{1}(x) = w_{1}(x,p(x),q(x))$ $m_{2}(x) = w_{2}(x,p(x),q(x))$. Also recall that in consistency with Assumption 2 we always assume $\mathbbm {m}_{3}(x)= \mathbbm {w}_{3}\big (x,\frac{p(x)}{m_{1}(x)},\frac{q(x)}{m_{2}(x)} \big )$ for some (measurable) function $\mathbbm {w}_{3}: \mathscr {X} \times [a,b] \times [a,b] \rightarrow [0,\infty ]$. Whenever applicable and insightfulness-enhancing, we use the notation $D^{c}_{\phi ,W_{1},W_{2},\mathbbm {W}_{3},\lambda }(P,Q)$ instead of $D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q)$.

Let us start with the following important sub-setup:

3.3.1 $\mathbf {m_{1}(x) = m_{2}(x) := m(x)}$, $\mathbf {\mathbbm {m}_{3}(x) = r(x) \cdot m(x)\in [0,\infty ]}$ for Some (meas.) Function $\mathbf {r: \mathscr {X} \rightarrow \mathbb {R}}$ Satisfying $\mathbf {r(x) \in ]-\infty ,0[ \cup ]0,\infty [}$ for ${\varvec{\lambda -}}$a.a. $\mathbf {x \in \mathscr {X}}$

As an interpretation, here the scaling functions are strongly coupled with the aggregation function; in order to avoid “case-overlapping”, we assume that the function $r(\cdot )$ does not (explicitly) depend on the functions $m(\cdot )$, $p(\cdot )$ and $q(\cdot )$ (i.e. it is not of the form $r(\cdot )= h(\cdot , m(\cdot ), p(\cdot ), q(\cdot ))$ ). From (1) one can deduce

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D^{c}_{\phi ,M,M,R\cdot M,\lambda }(P,Q) \nonumber \\& \textstyle \, : = \, {\overline{\int }}_{{\mathscr {X}}} \Big [ \phi \big ( { \frac{p(x)}{m(x)}}\big ) -\phi \big ( {\frac{q(x)}{m(x)}}\big ) - \phi _{+,c}^{\prime } \big ( {\frac{q(x)}{m(x)}}\big ) \cdot \big ( \frac{p(x)}{m(x)}-\frac{q(x)}{m(x)}\big ) \Big ] \, \cdot \, m(x) \, \cdot \, r(x) \, \mathrm {d}\lambda (x) \ , \ \ \ \ \end{aligned}$$

(36)

which for the discrete setup $(\mathscr {X},\lambda ) = (\mathscr {X}_{\#},\lambda _{\#})$ (recall $\lambda _{\#}[\{x\}] =1$ for all $x \in \mathscr {X}_{\#}$) simplifies to

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D^{c}_{\phi ,M,M,R\cdot M,\lambda _{\#}}(P,Q) \nonumber \\& \textstyle = {\overline{\sum }}_{{x \in \mathscr {X}}} \Big [ \phi \big ( { \frac{p(x)}{m(x)}}\big ) -\phi \big ( {\frac{q(x)}{m(x)}}\big ) - \phi _{+,c}^{\prime } \big ( {\frac{q(x)}{m(x)}}\big ) \cdot \big ( \frac{p(x)}{m(x)}-\frac{q(x)}{m(x)}\big ) \Big ] \cdot m(x) \cdot r(x) \ . \qquad \ \ \end{aligned}$$

(37)

Remark 2

(a) If one has a “nonnegativity-setup” (NN1) in the sense that for $\lambda $-almost all $x \in \mathscr {X}$ there holds $\mathbbm {m}(x) \geqslant 0$, $\mathbbm {r}(x)\geqslant 0$, $\mathbbm {p}(x) \geqslant 0$, $\mathbbm {q}(x) \geqslant 0$, then (36) (and hence also (37)) can be interpreted as scaled Bregman divergence $B_{\phi }\big ( \mathfrak {P}, \mathfrak {Q}\,|\, \mathfrak {M} \big )$ between the two nonnegative measures $\mathfrak {P}, \mathfrak {Q}$ (on $(\mathscr {X},\mathscr {F})$) defined by $\mathfrak {P}[\bullet ] := \mathfrak {P}^{\mathbbm {R} \cdot \lambda }[\bullet ] : = \int _{\bullet } \mathbbm {p}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x)$ and $\mathfrak {Q}[\bullet ]:= \mathfrak {Q}^{\mathbbm {R} \cdot \lambda }[\bullet ] : = \int _{\bullet } \mathbbm {q}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x)$, with scaling by the nonnegative measure $\mathfrak {M}[\bullet ] := \mathfrak {M}^{\mathbbm {R} \cdot \lambda }[\bullet ] : = \int _{\bullet } \mathbbm {m}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x)$.

(b) In a context of $\mathbbm {r}(x) \equiv 1$ and “$\lambda $-probability-densities” , on general state space $\mathscr {X}$, then and are probability measures (where $\mathbbm {1}$ stands for the function with constant value 1). Accordingly, (36) (and hence also (37)) can be interpreted as scaled Bregman divergence which has been first defined in Stummer [81], Stummer and Vajda [84], see also Kisslinger and Stummer [35,36,37] for the “purely adaptive” case and indications on non-probability measures.

For instance, if Y is a random variable taking values in the discrete space $\mathscr {X}_{\#}$, then (with a slight abuse of notation^{Footnote 8}) may be its probability mass function under a hypothetical/candidate law , and is the probability mass function of the corresponding data-derived “empirical distribution” of an N-size independent and identically distributed (i.i.d.) sample $Y_1, \ldots , Y_N$ of Y which is nothing but the probability distribution reflecting the underlying (normalized) histogram. Typically, for small respectively medium sample size N one gets for some states $x \in \mathscr {X}$ which are feasible but “not yet” observed; amongst other things, this explains why density-zeros play an important role especially in statistics and information theory. This concludes the current Remark 2. $\square $

In the following, we illuminate two important special cases of the scaling (and aggregation-part) function $m(\cdot )$, namely $\mathbbm {m}(x) := 1$ and $m(x):= q(x)$:

3.3.1.1 $\mathbf {\mathbbm {m}_{1}(x) = \mathbbm {m}_{2}(x) := 1}$, $\mathbf {\mathbbm {m}_{3}(x) = \mathbbm {r}(x)}$ for Some (Measurable) Function $\mathbf {\mathbbm {r}: \mathscr {X} \rightarrow [0,\infty ]}$ Satisfying $\mathbf {\mathbbm {r}(x) \in ]0,\infty [}$ for ${\varvec{\lambda -}}$a.a. $\mathbf {x \in \mathscr {X}}$

Accordingly, (36) turns into

$$\begin{aligned}& \textstyle 0 \leqslant D^{c}_{\phi ,\mathbbm {1},\mathbbm {1},\mathbbm {R} \cdot \mathbbm {1},\lambda }(P,Q) \nonumber \\& \textstyle : = {\overline{\int }}_{{\mathscr {X}}} \Big [ \phi \big ( p(x)\big ) -\phi \big ( q(x) \big ) - \phi _{+,c}^{\prime } \big ( q(x) \big ) \cdot \big ( p(x) - q(x) \big ) \Big ] \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \ , \end{aligned}$$

(38)

which for the discrete setup $(\mathscr {X},\lambda ) = (\mathscr {X}_{\#},\lambda _{\#})$ becomes^{Footnote 9}

$$\begin{aligned}& \textstyle 0 \leqslant D^{c}_{\phi ,\mathbbm {1},\mathbbm {1},\mathbbm {R}\cdot \mathbbm {1},\lambda _{\#}}(P,Q) \nonumber \\& : = {\overline{\sum }}_{{x \in \mathscr {X}}} \Big [ \phi \big ( p(x)\big ) -\phi \big ( q(x) \big ) - \phi _{+,c}^{\prime } \big ( q(x) \big ) \cdot \big ( p(x) - q(x) \big ) \Big ] \cdot \mathbbm {r}(x) \ \ \end{aligned}$$

(39)

Notice that for $\mathbbm {r}(x) \equiv 1$, the divergences (38) and (39) are “consistent extensions” of the motivating pointwise dissimilarity $d_{\phi }^{(6)}(\cdot ,\cdot )$ from Sect. 2. A special case of (38) is e.g. the rho-tau divergence (cf. Lemma 1 of Zhang and Naudts [95]).

Let us exemplarily illuminate the special case $\phi = \phi _{\alpha }$ together with $\mathbbm {p}(x) \geqslant 0$, $\mathbbm {q}(x) \geqslant 0$, for $\lambda $-almost all $x\in \mathscr {X}$ which by means of (9), (22), (28) turns (38) into the “explicit-boundary” version (of the corresponding “implicit-boundary-describing” ${\overline{\int }}\ldots $)^{Footnote 10}

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D_{\phi _{\alpha },\mathbbm {1},\mathbbm {1},\mathbbm {R}\cdot \mathbbm {1},\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle = {\overline{\int }}_{{\mathscr {X}}} \frac{\mathbbm {r}(x)}{\alpha \cdot (\alpha -1)} \cdot \big [ \mathbbm {p}(x)^{\alpha } + (\alpha -1) \cdot \mathbbm {q}(x)^{\alpha } - \alpha \cdot \mathbbm {p}(x) \cdot \mathbbm {q}(x)^{\alpha -1} \big ] \, \mathrm {d}\lambda (x) \end{aligned}$$

(40)

$$\begin{aligned}& \textstyle = \int _{{\mathscr {X}}} \frac{\mathbbm {r}(x)}{\alpha \cdot (\alpha -1)} \cdot \big [ \mathbbm {p}(x)^{\alpha } + (\alpha -1) \cdot \mathbbm {q}(x)^{\alpha } - \alpha \cdot \mathbbm {p}(x) \cdot \mathbbm {q}(x)^{\alpha -1} \big ]\cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \, \int _{{\mathscr {X}}} \, \mathbbm {r}(x) \, \cdot \, \big [ \frac{\mathbbm {p}(x)^{\alpha }}{\alpha \, \cdot \, (\alpha -1)} \, \cdot \, \varvec{1}_{]1,\infty [}(\alpha ) \, + \, \infty \, \cdot \, \varvec{1}_{]-\infty ,0[ \cup ]0,1[}(\alpha ) \big ] \, \cdot \, \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \big ) \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \, \int _{{\mathscr {X}}} \, \mathbbm {r}(x) \, \cdot \, \big [ \frac{\mathbbm {q}(x)^{\alpha }}{\alpha } \, \cdot \, \varvec{1}_{]0,1[ \cup ]1,\infty [}(\alpha ) \, + \, \infty \, \cdot \, \varvec{1}_{]-\infty ,0[}(\alpha ) \big ] \, \cdot \, \varvec{1}_{]0,\infty [}\big (\mathbbm {q}(x) \big ) \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) \, , \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \text { for } \alpha \in \mathbb {R}\backslash \{0,1\}, \end{aligned}$$

(41)

$$\begin{aligned}& \textstyle 0 \leqslant D_{\phi _{1},\mathbbm {1},\mathbbm {1},\mathbbm {R}\cdot \mathbbm {1},\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle = \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \big [ \mathbbm {p}(x) \cdot \log \big (\frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}\big ) + \mathbbm {q}(x) - \mathbbm {p}(x) \big ] \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \infty \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \big ) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {q}(x) \big ) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) \end{aligned}$$

(42)

$$\begin{aligned}& \textstyle 0 \leqslant D_{\phi _{0},\mathbbm {1},\mathbbm {1},\mathbbm {R}\cdot \mathbbm {1},\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle = \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \Big [ - \log \big (\frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}\big ) + \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)} - 1 \Big ] \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \infty \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \big ) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \infty \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {q}(x) \big ) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) \, , \end{aligned}$$

(43)

where we have employed (10), (11) (23), (24), (29), (30); notice that $D_{\phi _{1},\mathbbm {1},\mathbbm {1},\mathbbm {R}\cdot \mathbbm {1},\lambda }(\mathbbm {P},\mathbbm {Q})$ is a generalized version of the Kullback–Leibler information divergence (resp. of the relative entropy). According to the above calculations, one should exclude $\alpha \leqslant 0$ whenever $\mathbbm {p}(x) =0$ for all x in some A with $\lambda [A]>0$, respectively $\alpha \leqslant 1$ whenever $\mathbbm {q}(x) =0$ for all x in some $\tilde{A}$ with $\lambda [\tilde{A}]>0$ (a refined alternative for $\alpha =1$ is given in Sect. 3.3.1.2 below). As far as splitting of the first integral e.g. in (42) resp. (43) is concerned, notice that the integral $(\mathfrak {P}^{\mathbbm {R} \cdot \lambda } - \mathfrak {Q}^{\mathbbm {R} \cdot \lambda })[\mathscr {X}] := \int _{{\mathscr {X}}} \big [\mathbbm {q}(x) - \mathbbm {p}(x) \big ] \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x)$ resp. $\int _{{\mathscr {X}}} \Big [ \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)} - 1 \Big ] \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x)$ may be finite even in cases where $\mathfrak {P}^{\mathbbm {R} \cdot \lambda }[\mathscr {X}] = \int _{{\mathscr {X}}} \mathbbm {p}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) = \infty $ and $\mathfrak {Q}^{\mathbbm {R} \cdot \lambda }[\mathscr {X}] = \int _{{\mathscr {X}}} \mathbbm {q}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) = \infty $ (especially in case of unbounded data space (e.g. $\mathscr {X}=\mathbb {R}$) when an additive constant is involved and $\mathbbm {r}(\cdot )$ is bounded from above); furthermore, there are situations where $\mathfrak {P}^{\mathbbm {R}\cdot \lambda }[\mathscr {X}] = \mathfrak {Q}^{\mathbbm {R}\cdot \lambda }[\mathscr {X}] < \infty $ and thus $(\mathfrak {P}^{\mathbbm {R} \cdot \lambda } - \mathfrak {Q}^{\mathbbm {R} \cdot \lambda })[\mathscr {X}] =0$ but $\int _{{\mathscr {X}}} \Big [ \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)} - 1 \Big ] \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) = \infty $. For $\alpha =2$, we obtain from (41) and (15) to (16)

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D_{\phi _{2},\mathbbm {1},\mathbbm {1},\mathbbm {R}\cdot \mathbbm {1},\lambda }(\mathbbm {P},\mathbbm {Q}) = \int _{{\mathscr {X}}} \frac{\mathbbm {r}(x)}{2} \cdot \big [ \mathbbm {p}(x) - \mathbbm {q}(x) \big ]^2 \, \mathrm {d}\lambda (x) \ , \end{aligned}$$

(44)

where we can exceptionally drop the non-negativity constraints $\mathbbm {p}(x) \geqslant 0$, $\mathbbm {q}(x) \geqslant 0$. As for interpretation, (44) is nothing but half of the $\mathbbm {r}(\cdot )$-weighted squared $L^2(\lambda )$-distance between $\mathbbm {p}(\cdot )$ and $\mathbbm {q}(\cdot )$.

In the special sub-setup of $\mathbbm {r}(x) \equiv 1$ and “$\lambda $-probability-densities” on data space $\mathscr {X}$ (cf. Remark 2(b)), we can deduce from (41)–(43) the divergences

(45)

which for the choice $\alpha >0$ can be interpreted as “order$-\alpha $” density-power divergences DPD of Basu et al. [10] between the two corresponding probability measures and ; for their statistical applications see e.g. Basu et al. [12], Ghosh and Basu [30, 31] and the references therein, and for general $\alpha \in \mathbb {R}$ see e.g. Stummer and Vajda [84]. In particular, the case $\alpha =1$ corresponding divergence in (45) is called “Kullback–Leibler information divergence” between and , and is also known under the name “relative entropy”. For $\alpha =2$, we derive from (44) with $\mathbbm {r}(x) =1$ which is nothing but half of the squared $L^2$-distance between the two “$\lambda $-probability-densities” and .

For the special discrete setup $(\mathscr {X},\lambda ) = (\mathscr {X}_{\#},\lambda _{\#})$ (recall $\lambda _{\#}[\{x\}] =1$ for all $x \in \mathscr {X}_{\#}$), the divergences (41)–(44) simplify to

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D_{\phi _{\alpha },\mathbbm {1},\mathbbm {1},\mathbbm {R}\cdot \mathbbm {1},\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle = \sum _{{x \in \mathscr {X}}} \frac{\mathbbm {r}(x)}{\alpha \cdot (\alpha -1)} \cdot \big [ \big (\mathbbm {p}(x)\big )^{\alpha } + (\alpha -1) \cdot \big (\mathbbm {q}(x)\big )^{\alpha } - \alpha \cdot \mathbbm {p}(x) \cdot \big (\mathbbm {q}(x)\big )^{\alpha -1} \big ] \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \nonumber \\& \textstyle + \, \sum _{{x \in \mathscr {X}}} \, \mathbbm {r}(x) \, \cdot \, \big [ \frac{\mathbbm {p}(x)^{\alpha }}{\alpha \cdot (\alpha -1)} \, \cdot \, \varvec{1}_{]1,\infty [}(\alpha ) \, + \, \infty \, \cdot \, \varvec{1}_{]-\infty ,0[ \cup ]0,1[}(\alpha ) \big ] \, \cdot \, \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \big ) \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \nonumber \\& \textstyle + \, \sum _{{x \in \mathscr {X}}} \, \mathbbm {r}(x) \, \cdot \, \big [ \frac{\mathbbm {q}(x)^{\alpha }}{\alpha } \cdot \varvec{1}_{]0,1[ \cup ]1,\infty [}(\alpha ) \, + \, \infty \, \cdot \, \varvec{1}_{]-\infty ,0[}(\alpha ) \big ] \, \cdot \, \varvec{1}_{]0,\infty [}\big (\mathbbm {q}(x) \big ) \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, , \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \text { for } \alpha \in \mathbb {R}\backslash \{0,1\}, \\[-0.2cm]& \textstyle 0 \leqslant D_{\phi _{1},\mathbbm {1},\mathbbm {1},\mathbbm {R}\cdot \mathbbm {1},\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle = \sum _{{x \in \mathscr {X}}} \mathbbm {r}(x) \cdot \big [ \mathbbm {p}(x) \cdot \log \big (\frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}\big ) + \mathbbm {q}(x) - \mathbbm {p}(x) \big ] \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x)\big ) \nonumber \\& \textstyle + \sum _{{x \in \mathscr {X}}} \mathbbm {r}(x) \cdot \infty \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \big ) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \nonumber \\& \textstyle + \sum _{{x \in \mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {q}(x) \big ) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, , \nonumber \\& \textstyle 0 \leqslant D_{\phi _{0},\mathbbm {1},\mathbbm {1},\mathbbm {R}\cdot \mathbbm {1},\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle = \sum _{{x \in \mathscr {X}}} \mathbbm {r}(x) \cdot \big [ - \log \big (\frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}\big ) + \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)} - 1 \big ] \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x)\big ) \nonumber \\& \textstyle + \sum _{{x \in \mathscr {X}}} \mathbbm {r}(x) \cdot \infty \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \big ) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \nonumber \\& \textstyle + \sum _{{x \in \mathscr {X}}} \mathbbm {r}(x) \cdot \infty \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {q}(x) \big ) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, , \nonumber \\& \textstyle 0 \leqslant D_{\phi _{2},\mathbbm {1},\mathbbm {1},\mathbbm {R} \cdot \mathbbm {1},\lambda _{\#}}(\mathbbm {P},\mathbbm {Q}) = \sum _{{x \in \mathscr {X}}} \frac{\mathbbm {r}(x)}{2} \cdot \big [ \mathbbm {p}(x) - \mathbbm {q}(x) \big ]^2 \, . \nonumber \end{aligned}$$

(46)

Hence, as above, one should exclude $\alpha \leqslant 0$ whenever $\mathbbm {p}(x) =0$ for all x in some A with $\lambda [A]>0$, respectively $\alpha \leqslant 1$ whenever $\mathbbm {q}(x) =0$ for all x in some $\tilde{A}$ with $\lambda [\tilde{A}]>0$ (a refined alternative for $\alpha =1$ is given in Sect. 3.3.1.2 below).

In particular, take the probability context of Remark 2(b), with discrete random variable Y, hypothetical probability mass function , and data-derived probability mass function (relative frequency) with sample size N. For $\mathbbm {r}(x)\equiv 1$, the corresponding sample-size-weighted divergences (for $\alpha \in \mathbb {R}$) can be used as goodness-of-fit test statistics; see e.g. Kisslinger and Stummer [37] for their limit behaviour as the sample size N tends to infinity.

3.3.1.2 $\mathbf {m_{1}(x) = m_{2}(x) := q(x)}$, $\mathbf {\mathbbm {m}_{3}(x) = r(x) \cdot q(x) \in [0, \infty ]}$ for Some (meas.) Function $\mathbf {r: \mathscr {X} \rightarrow \mathbb {R}}$ Satisfying $\mathbf {r(x) \in ]-\infty ,0[ \cup ]0,\infty [}$ for ${\varvec{\lambda -}}$a.a. $\mathbf {x \in \mathscr {X}}$

In such a set-up, the divergence (36) becomes

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D^{c}_{\phi ,Q,Q,R\cdot Q,\lambda }(P,Q) \nonumber \\& \textstyle = {\overline{\int }}_{{\mathscr {X}}} \big [ \phi \big ( { \frac{p(x)}{q(x)}}\big ) -\phi \big ( 1 \big ) - \phi _{+,c}^{\prime } \big ( 1 \big ) \cdot \big ( \frac{p(x)}{q(x)}- 1 \big ) \big ] \cdot q(x) \cdot r(x) \, \mathrm {d}\lambda (x) \end{aligned}$$

(47)

$$\begin{aligned}& \textstyle = {\overline{\int }}_{{\mathscr {X}}} \big [ q(x) \cdot \phi \big ( { \frac{p(x)}{q(x)}}\big ) - q(x) \cdot \phi \big ( 1 \big ) - \phi _{+,c}^{\prime } \big ( 1 \big ) \cdot \big ( p(x) - q(x) \big ) \big ] \cdot r(x) \, \mathrm {d}\lambda (x) \, , \qquad \ \end{aligned}$$

(48)

where in accordance with the descriptions right after (1) we require that $\phi : ]a,b[ \rightarrow \mathbb {R}$ is convex and strictly convex at $1 \in ]a,b[$ and incorporate the zeros of $p(\cdot ),q(\cdot ),r(\cdot )$ by the appropriate limits and conventions. In the following, we demonstrate this in a non-negativity set-up where for $\lambda $-almost all $x \in \mathscr {X}$ one has $\mathbbm {r}(x) \in ]0,\infty [$ as well as $\mathbbm {p}(x) \in [0,\infty [$, $\mathbbm {q}(x) \in [0,\infty [$, and hence $E=]a,b[=]0,\infty [$. In order to achieve a reflexivity result in the spirit of Theorem 4, we have to check for – respectively analogously adapt most of – the points in Assumption 2: to begin with, the weight w(x, s, t) evaluated at $s:= \mathbbm {p}(x)$, $t:= \mathbbm {q}(x)$ has to be substituted/replaced by $\widetilde{w}(x,\widetilde{t}) := \mathbbm {r}(x) \cdot \widetilde{t}$ evaluated at $\widetilde{t} = \mathbbm {q}(x)$, and the dissimilarity $\psi _{\phi ,c}(s,t)$ has to be substituted/replaced by $\widetilde{\widetilde{\psi }}_{\phi ,c}(\widetilde{s},\widetilde{t}) := \psi _{\phi ,c}\big (\frac{\widetilde{s}}{{\widetilde{t}}^{}},1\big )$ with the plug-in $\widetilde{s} = \mathbbm {p}(x)$. Putting things together, instead of the integrand-generating term $w(x,s,t) \cdot \psi _{\phi ,c}(s,t)$ we have to inspect the boundary behaviour of $\widetilde{w}(x,\widetilde{t}) \cdot \widetilde{\widetilde{\psi }}_{\phi ,c}(\widetilde{s},\widetilde{t})$ being explicitly given (with a slight abuse of notation) by the function $\widetilde{\psi }_{\phi ,c}: ]0,\infty [^3 \rightarrow [0,\infty [$ in

$$\begin{aligned}& \textstyle \widetilde{\psi }_{\phi ,c}\big (r,\widetilde{s},\widetilde{t}\big ) := r \cdot \widetilde{t} \cdot \psi _{\phi ,c}\big (\frac{\widetilde{s}}{{\widetilde{t}}^{}},1\big ) = r \cdot \widetilde{t} \cdot \big [ \phi \big (\frac{\widetilde{s}}{{\widetilde{t}}^{}}\big ) - \phi (1) - \phi _{+,c}^{\prime }(1) \cdot \big (\frac{\widetilde{s}}{\widetilde{t}^{}}-1 \big ) \big ] \ \nonumber \\& \textstyle = r \cdot \widetilde{t} \cdot \big [ \phi \big (\frac{\widetilde{s}\cdot r}{{\widetilde{t} \cdot r}^{}}\big ) - \phi (1) - \phi _{+,c}^{\prime }(1) \cdot \big (\frac{\widetilde{s} \cdot r}{{\widetilde{t} \cdot r}^{}}-1 \big ) \big ] \ = r \cdot \widetilde{t} \cdot \psi _{\phi ,c}\big (\frac{\widetilde{s} \cdot r}{{\widetilde{t} \cdot r}^{}},1\big ) \, . \qquad \ \ \end{aligned}$$

(49)

Since the general right-hand-derivative concerning assumption $t \in \mathscr {R}\big (\frac{Q}{M_{2}}\big )$ has $\frac{\widetilde{s}}{\widetilde{t}} =1$ as its analogue, we require that the convex function $\phi :]0,\infty [ \rightarrow ]-\infty ,\infty [$ is strictly convex (only) at 1 in conformity with Assumption 2(a) (which is also employed in Assumption 3); for the sake of brevity we use the short-hand notation 2(a) etc. in the following discussion. We shall not need 2(b) to 2(d) in the prevailing context, so that the above-mentioned generator $\phi _{TV}(t):= |t-1|$ is allowed for achieving reflexivity (for reasons which will become clear in the proof of Theorem 5 in the appendix). The analogue of 2(e) is $\mathbbm {r}(x) \cdot \widetilde{t} < \infty $ which is always (almost surely) automatically satisfied (a.a.sat.), whereas 2(f) converts to “$\mathbbm {r}(x) \cdot \widetilde{t} > 0$ for all $\widetilde{s} \ne \widetilde{t}$” which is also a.a.sat. except for the case $\widetilde{t} =0$ which will be below incorporated in combination with $\psi _{\phi ,c}$-multiplication (cf. (50)). For the derivation of the analogue of 2(k) we observe that for fixed $r >0$, $\widetilde{s} >0$ the function $\widetilde{t} \rightarrow \widetilde{\psi }_{\phi ,c}\big (r,\widetilde{s},\widetilde{t}\,\big )$ is (the r-fold of) the perspective function (at $\widetilde{s}$) of the convex function $\psi _{\phi ,c}\big ( \cdot ,1\big )$ and thus convex with existing limit

$$\begin{aligned}& \textstyle \ell i_{1} := r \cdot 0 \cdot \psi _{\phi ,c}\big (\frac{\widetilde{s}}{0},1\big ) : = \lim _{t\rightarrow 0} \widetilde{\psi }_{\phi ,c}\big (r,\widetilde{s},\widetilde{t}\,\big ) = \nonumber \\& \textstyle = - r \cdot \widetilde{s} \cdot \phi _{+,c}^{\prime }(1) + r \cdot \widetilde{s} \cdot \lim _{\widetilde{t}\rightarrow 0} \big [ \frac{\widetilde{t}}{\widetilde{s}} \cdot \phi \big (\frac{\widetilde{s}}{\widetilde{t}}\big ) \big ] = r \cdot \widetilde{s} \cdot (\phi ^{*}(0) -\phi _{+,c}^{\prime }(1)) \geqslant 0 \, , \qquad \ \ \end{aligned}$$

(50)

where $\phi ^{*}(0) := \lim _{u\rightarrow 0} u \cdot \phi \big (\frac{1}{u}\big ) = \lim _{v\rightarrow \infty } \frac{\phi (v)}{v}$ exists but may be infinite (recall that $\phi _{+,c}^{\prime }(1)$ is finite). Notice that in contrast to 2(k) we need not assume $\ell i_{1} >0$ (and thus do not exclude $\phi _{TV}$). To convert 2(i), we employ the fact that for fixed $r >0$, $\widetilde{t} >0$ the function $\widetilde{s} \rightarrow \widetilde{\psi }_{\phi ,c}\big (r,\widetilde{s},\widetilde{t}\big )$ is convex with existing limit

$$\begin{aligned}& r \cdot \widetilde{t} \cdot \psi _{\phi ,c}\big (\frac{0}{\widetilde{t}},1\big ) : = \lim _{s\rightarrow 0} \widetilde{\psi }_{\phi ,c}\big (r,\widetilde{s},\widetilde{t}\big ) = r \cdot \widetilde{t} \cdot (\phi (0) + \phi _{+,c}^{\prime }(1) - \phi (1)) > 0 \, , \qquad \ \nonumber \end{aligned}$$

where $\phi (0) := \lim _{u\rightarrow 0} \phi (u)$ exists but may be infinite. To achieve the analogue of 2(g), let us first remark that for fixed $r >0$ the function $(\widetilde{s},\widetilde{t}) \rightarrow \widetilde{\psi }_{\phi ,c}\big (r,\widetilde{s},\widetilde{t}\big )$ may not be continuous at $(\widetilde{s},\widetilde{t}) = (0,0)$, but due to the very nature of a divergence we make the 2(g)-conform convention of setting

$$\begin{aligned}& \textstyle r \cdot 0 \cdot \psi _{\phi ,c}\big (\frac{0}{0},1\big ) : = \widetilde{\psi }_{\phi ,c}\big (r,0,0\big ) := 0 \, \nonumber \end{aligned}$$

(notice that e.g. the power function $\phi _{-1}$ of (5) with index $\alpha =-1$ obeys $\lim _{\widetilde{t} \rightarrow 0} \widetilde{\psi }_{\phi _{-1}}\big (r,\widetilde{t},\widetilde{t}\big ) = 0 \ne \frac{r}{2} = \lim _{\widetilde{t} \rightarrow 0} \widetilde{\psi }_{\phi _{-1}}\big (r,\widetilde{t}^2,\widetilde{t}\big )$). The analogues of the remaining Assumptions 2(h),(j),($\ell $),(m),(n) are (almost surely) obsolete because of our basic (almost surely) finiteness requirements. Summing up, with the above-mentioned limits and conventions we write (47) explicitly as

$$\begin{aligned}& \textstyle 0 \leqslant D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle = \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \big [ \mathbbm {q}(x) \cdot \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) - \mathbbm {q}(x) \cdot \phi \big ( 1 \big ) - \phi _{+,c}^{\prime } \big ( 1 \big ) \cdot \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \big ] \, \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi ^{*}(0) -\phi _{+,c}^{\prime }(1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {p}(x) \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \big ) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi (0) + \phi _{+,c}^{\prime }(1) - \phi (1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {q}(x) \big ) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle = \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \Big [ \mathbbm {q}(x) \cdot \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) - \mathbbm {q}(x) \cdot \phi \big ( 1 \big ) - \phi _{+,c}^{\prime } \big ( 1 \big ) \cdot \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \Big ] \, \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi ^{*}(0) -\phi _{+,c}^{\prime }(1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {p}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi (0) + \phi _{+,c}^{\prime }(1) - \phi (1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) \, . \end{aligned}$$

(51)

In case of $\mathfrak {Q}^{\mathbbm {R}\cdot \lambda }[\mathscr {X}] := \int _{{\mathscr {X}}} \mathbbm {q}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) < \infty $, the divergence (51) becomes

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle = \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \Big [ \mathbbm {q}(x) \cdot \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) - \phi _{+,c}^{\prime } \big ( 1 \big ) \cdot \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \Big ] \, \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi ^{*}(0) -\phi _{+,c}^{\prime }(1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {p}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi (0) + \phi _{+,c}^{\prime }(1) \big ] \, \cdot \, \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) \, - \phi (1) \, \cdot \, \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \, \mathrm {d}\lambda (x) \, . \nonumber \\ \end{aligned}$$

(52)

Moreover, in case of $\phi \big ( 1 \big ) = 0$ and $(\mathfrak {P}^{\mathbbm {R}\cdot \lambda }-\mathfrak {Q}^{\mathbbm {R}\cdot \lambda })[\mathscr {X}] = \int _{{\mathscr {X}}} \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \in ]-\infty , \infty [$ (but not necessarily $\mathfrak {P}^{\mathbbm {R}\cdot \lambda }[\mathscr {X}] = \int _{{\mathscr {X}}} \mathbbm {p}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) < \infty $, $\mathfrak {Q}^{\mathbbm {R}\cdot \lambda }[\mathscr {X}] = \int _{{\mathscr {X}}} \mathbbm {q}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) < \infty $), the divergence (51) turns into

$$\begin{aligned}& \textstyle 0 \leqslant D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q}) = \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \phi ^{*}(0) \cdot \int _{{\mathscr {X}}} \, \mathbbm {r}(x) \, \cdot \, \mathbbm {p}(x) \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) + \phi (0) \cdot \int _{{\mathscr {X}}} \, \mathbbm {r}(x) \, \cdot \, \mathbbm {q}(x) \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle - \phi _{+,c}^{\prime } \big ( 1 \big ) \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \, . \end{aligned}$$

(53)

Let us remark that (53) can be interpreted as $\phi $-divergence $D^{c}_{\phi }\big ( \mu , \nu \big )$ between the two nonnegative measures $\mu , \nu $ (on $(\mathscr {X},\mathscr {F})$) (cf. Stummer and Vajda [83]), where $\mu [\bullet ] := \mathfrak {P}^{\mathbbm {R}\cdot \lambda }[\bullet ]$ and $\nu [\bullet ] := \mathfrak {Q}^{\mathbbm {R}\cdot \lambda }[\bullet ]$. In the following, we briefly discuss two important sub-cases. First, in the “$\lambda $-probability-densities” context of Remark 2(b) one has for general $\mathscr {X}$ the manifestation , , and under the constraint $\phi (1)=0$ the corresponding divergence turns out to be the ($\mathbbm {r}$-)“local $\phi $-divergence of Avlogiaris et al. [6, 7]; in case of $\mathbbm {r}(x) \equiv 1$ this reduces – due to the fact – to the classical Csiszar-Ali-Silvey $\phi $-divergence CASD ([4, 27], see also e.g. Liese and Vajda [41], Vajda [89])

(54)

if $\phi (1) \ne 0$ then one has to additionally subtract $\phi (1)$ (cf. the corresponding special case of (52)). In particular, for the special sub-setup where for $\lambda $-almost all $x \in \mathscr {X}$ there holds , , $\mathbbm {r}(x) \equiv 1$ , $\phi (1) = 0$, one ends up with the reduced Csiszar-Ali-Silvey divergence

which can be interpreted as a “consistent extension” of the motivating pointwise dissimilarity $d_{\phi }^{(7)}(\cdot ,\cdot )$ from the introductory Sect. 2; notice the fundamental structural difference to the divergence (38) which reflects $d_{\phi }^{(6)}(\cdot ,\cdot )$. For comprehensive treatments of statistical applications of CASD, the reader is referred to Liese and Vajda [41], Read and Cressie [72], Vajda [89], Pardo [68], Liese and Miescke [40], Basu et al. [13].

Returning to the general divergence setup (51), we derive the reflexivity result (to be proved in the appendix):

Theorem 5

Let $c \in [0,1]$, $\mathbbm {r}(x) \in ]0,\infty [$ for $\lambda $-a.a. $x \in \mathscr {X}$, $\mathscr {R}\big (\frac{\mathbbm {P}}{\mathbbm {Q}}\big ) \cup \{1\} \subset [a,b]$, and $\phi \in \varPhi (]a,b[)$ be strictly convex at $t=1$. Moreover, suppose that

$$\begin{aligned}& \textstyle \int _{{\mathscr {X}}} \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \ = \ 0 \end{aligned}$$

(55)

(but not necessarily $\int _{{\mathscr {X}}} \mathbbm {p}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) < \infty $, $\int _{{\mathscr {X}}} \mathbbm {q}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) < \infty $). Then: (1) $D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q})\geqslant 0$. Depending on the concrete situation, $D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q})$ may take infinite value.

$$\begin{aligned}&\textstyle \textit{(2)} \ \ D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q}) = 0 \ \ \text {if and only if} \ \ \mathbbm {p}(x) = \mathbbm {q}(x) \ \text {for}\, \lambda \text {-a.a.}\, x \in \mathscr {X}. \quad \ \end{aligned}$$

(56)

Remark 3

(a) In the context of non-negative measures, the special case $c=1$ – together with $\int _{{\mathscr {X}}} \mathbbm {p}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) < \infty $, $\int _{{\mathscr {X}}} \mathbbm {q}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) < \infty $ – of Theorem 5 was first achieved by Stummer and Vajda [83].

(b) Assumption (55) is always automatically satisfied if one has coincidence of finite total masses in the sense of $\mathfrak {P}^{\mathbbm {R} \cdot \lambda }[\mathscr {X}] = \int _{\mathscr {X}} \mathbbm {p}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) = \int _{\mathscr {X}} \mathbbm {q}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) =\mathfrak {Q}^{\mathbbm {R} \cdot \lambda }[\mathscr {X}] < \infty $. For $\mathbbm {r}(x) \equiv 1$ this is always satisfied for $\lambda $-probability densities , , since .

(c) Notice that in contrast to Theorem 4, the generator-concerning Assumptions 2(b)–(d) are replaced by the “model-concerning” constraint (55). This opens the gate for the use of the generators $\phi _{ie}$ and $\phi _{TV}$ for cases where (55) is satisfied. For the latter, we obtain with $c = \frac{1}{2}$ explicitly from (49) and (33)

$$ \widetilde{\psi }_{\phi _{TV},\frac{1}{2}}\big (r,\widetilde{s},\widetilde{t}\big ) := r \cdot \widetilde{t} \cdot \psi _{\phi _{TV},\frac{1}{2}}\big (\frac{\widetilde{s}}{\widetilde{t}},1\big ) = r \cdot \widetilde{t} \cdot \big | \frac{\widetilde{s}}{\widetilde{t}} -1 \big | = r \cdot \big | \widetilde{s} - \widetilde{t} \big | , $$

and hence from (51) together with $\phi _{TV}(1)=0$, $\phi _{TV}(0) = 1$ (cf. (31)), $\phi _{TV,+,\frac{1}{2}}^{\prime }(1) =0$ (cf. (32)), $\phi _{TV}^{*}(0) = \lim _{s\rightarrow \infty } \frac{1}{s} \cdot \psi _{\phi _{TV},\frac{1}{2}}(s,1) = 1$ (cf. (34)) we get

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q}) = \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \big | \mathbbm {p}(x) - \mathbbm {q}(x) \big | \, \mathrm {d}\lambda (x) \, \end{aligned}$$

(57)

which is nothing but the (possibly infinite) $\mathbbm {r}(\cdot )$-weighted $L_{1}$-distance between the functions $x \rightarrow \mathbbm {p}(x)$ and $x \rightarrow \mathbbm {q}(x)$.

(d) In the light of (52), Theorem 4 (adapted to the current context) and Theorem 5, let us indicate that if one wants to use $\varXi := \int _{{\mathscr {X}}} \mathbbm {q}(x) \cdot \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) $ (with appropriate zero-conventions) as a divergence, then one should either employ generators $\phi $ satisfying $\phi (1)=\phi _{+,c}^{\prime }(1)=0$, or employ models fulfilling the assumption (56) together with generators $\phi $ satisfying $\phi (1)=0$. On the other hand, if this integral $\varXi $ appears in your application context “naturally”, then one should be aware that $\varXi $ may become negative depending on the involved set-up; for a counter-example, see Stummer and Vajda [83]. This concludes Remark 3.

As an important example, we illuminate the special case $\phi = \phi _{\alpha }$ with $\alpha \in \mathbb {R}\backslash \{0,1\}$ (cf. (5)) under the constraint $(\mathfrak {P}^{\mathbbm {R}\cdot \lambda }-\mathfrak {Q}^{\mathbbm {R}\cdot \lambda })[\mathscr {X}] = \int _{{\mathscr {X}}} \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \in ]-\infty , \infty [$. Accordingly, the “implicit-boundary-describing” divergence (48) resp. the corresponding “explicit-boundary” version (53) turn into the generalized power divergences of order $\alpha $ (cf. Stummer and Vajda [83] for $\mathbbm {r}(x) \equiv 1$)^{Footnote 11}

$$\begin{aligned} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\textstyle \textstyle 0 \leqslant D_{\phi _{\alpha },\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\ \textstyle = {\overline{\int }}_{{\mathscr {X}}} \frac{1}{\alpha \cdot (\alpha -1)} \cdot \Big [ \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big )^{\alpha } - \alpha \cdot \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)} + \alpha -1 \Big ] \cdot \mathbbm {q}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \end{aligned}$$

(58)

$$\begin{aligned}&\textstyle = \frac{1}{\alpha \cdot (\alpha -1)} \, \cdot \int _{{\mathscr {X}}} \, \mathbbm {r}(x) \, \cdot \, \mathbbm {q}(x) \, \cdot \, \Big [ \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big )^{\alpha } \, - \, \alpha \, \cdot \, \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)} \, + \, \alpha -1 \Big ] \cdot \, \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x)\, \cdot \, \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\&\textstyle + \phi _{\alpha }^{*}(0) \, \cdot \, \int _{{\mathscr {X}}} \, \mathbbm {r}(x) \, \cdot \, \mathbbm {p}(x) \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \, + \, \phi _{\alpha }(0) \, \cdot \, \int _{{\mathscr {X}}} \, \mathbbm {r}(x) \, \cdot \, \mathbbm {q}(x) \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\&\textstyle = \frac{1}{\alpha \cdot (\alpha -1)} \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \Big [ \mathbbm {p}(x)^{\alpha } \cdot \mathbbm {q}(x)^{1-\alpha } - \mathbbm {q}(x) \Big ] \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\&\textstyle + \frac{1}{1-\alpha } \cdot \int _{{\mathscr {X}}} \, \, \mathbbm {r}(x) \, \cdot \, (\mathbbm {p}(x) \, - \, \mathbbm {q}(x)) \, \mathrm {d}\lambda (x) \, + \, \infty \, \cdot \, \varvec{1}_{]1,\infty [}(\alpha ) \, \cdot \, \int _{{\mathscr {X}}} \, \, \mathbbm {r}(x) \, \cdot \, \mathbbm {p}(x) \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\&\textstyle + \big (\frac{1}{\alpha \cdot (1-\alpha )} \cdot \varvec{1}_{]0,1] \cup ]1,\infty [}(\alpha ) + \infty \cdot \varvec{1}_{]-\infty ,0[}(\alpha ) \big ) \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) , \qquad \ \nonumber \end{aligned}$$

where we have employed (8) and (7); especially, one gets for $\alpha =2$

$$\begin{aligned}& \textstyle 0 \leqslant D_{\phi _{2},\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q}) = {\overline{\int }}_{{\mathscr {X}}} \frac{1}{2} \cdot \frac{(\mathbbm {p}(x) - \mathbbm {q}(x))^2}{\mathbbm {q}(x)} \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle = \frac{1}{2} \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \frac{(\mathbbm {p}(x) - \mathbbm {q}(x))^2}{\mathbbm {q}(x)} \cdot \varvec{1}_{[0,\infty [}(\mathbbm {p}(x)) \cdot \varvec{1}_{]0,\infty [}(\mathbbm {q}(x)) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \infty \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {p}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \end{aligned}$$

which is called Pearsons’s chisquare divergence. Under the same constraint $(\mathfrak {P}^{\mathbbm {R}\cdot \lambda }-\mathfrak {Q}^{\mathbbm {R}\cdot \lambda })[\mathscr {X}] \in ]-\infty , \infty [$, the case $\alpha =1$ leads by (18)–(22) to the generalized Kullback–Leibler divergence (generalized relative entropy)

$$\begin{aligned}& \textstyle 0 \leqslant D_{\phi _{1},\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q}) = {\overline{\int }}_{{\mathscr {X}}} \Big [ \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)} \cdot \log \big ( {\frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}} \big ) + 1 - \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)} \Big ] \cdot \mathbbm {q}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \qquad \ \nonumber \\& \textstyle = \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {p}(x) \cdot \log \big ( {\frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}} \big ) \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot (\mathbbm {q}(x) - \mathbbm {p}(x)) \, \mathrm {d}\lambda (x) + \infty \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {p}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \qquad \ \nonumber \end{aligned}$$

(which equals (42)), and for $\alpha =0$ one gets from (19), (25)–(27) the generalized reverse Kullback–Leibler divergence (generalized reverse relative entropy)

$$\begin{aligned}& \textstyle 0 \leqslant D_{\phi _{0},\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q}) = {\overline{\int }}_{{\mathscr {X}}} \big [ - \log \big ( {\frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}} \big ) + \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)} - 1 \big ] \cdot \mathbbm {q}(x) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \qquad \ \nonumber \\& \textstyle = \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \log \big ( {\frac{\mathbbm {q}(x)}{\mathbbm {p}(x)}} \big ) \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot (\mathbbm {p}(x) - \mathbbm {q}(x)) \, \mathrm {d}\lambda (x) + \infty \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) . \qquad \ \nonumber \end{aligned}$$

Notice that instead of the limit in (50) one could also use the convention $r \cdot 0 \cdot \psi _{\phi }\big (\frac{s}{0},1\big ) : = \widetilde{\psi }_{\phi }\big (r,s,0\big ) := 0$; in the context of $\lambda $-probability densities, one then ends up with divergence by Rüschendorf [75].

For the discrete setup $(\mathscr {X},\lambda ) = (\mathscr {X}_{\#},\lambda _{\#})$, the divergence in (51) simplifies to

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda _{\#}}(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle = \sum _{x \in \mathscr {X}} \mathbbm {r}(x) \cdot \big [ \mathbbm {q}(x) \cdot \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) - \mathbbm {q}(x) \cdot \phi \big ( 1 \big ) - \phi _{+,c}^{\prime } \big ( 1 \big ) \cdot \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \big ] \, \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \textstyle \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \nonumber \\& \textstyle + \big [ \phi ^{*}(0) -\phi _{+,c}^{\prime }(1) \big ] \cdot \sum _{x \in \mathscr {X}} \mathbbm {r}(x) \cdot \mathbbm {p}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \nonumber \\& \textstyle + \big [ \phi (0) + \phi _{+,c}^{\prime }(1) - \phi (1) \big ] \cdot \sum _{x \in \mathscr {X}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \end{aligned}$$

(59)

which in case of $\phi (1)=\phi _{+,c}^{\prime }(1)=0$ – respectively $\phi (1)=0$ and (55) – turns into

$$\begin{aligned}& \textstyle 0 \leqslant D^{c}_{\phi , \mathbbm {Q}, \mathbbm {Q}, \mathbbm {R}\cdot \mathbbm {Q},\lambda _{\#}}(\mathbbm {P}, \mathbbm {Q}) = \sum _{{x \in \mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \phi \big ( { \frac{ \mathbbm {p}(x)}{ \mathbbm {q}(x)}}\big ) \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \quad \ \nonumber \\& \textstyle + \phi ^{*}(0) \, \cdot \, \sum _{x \in \mathscr {X}} \mathbbm {r}(x) \, \cdot \, \mathbbm {p}(x) \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) + \phi (0) \, \cdot \, \sum _{x \in \mathscr {X}} \mathbbm {r}(x) \, \cdot \, \mathbbm {q}(x) \, \cdot \,\varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) . \quad \ \end{aligned}$$

(60)

3.3.1.3 $\mathbf {m_{1}(x) = m_{2}(x) := w(p(x),q(x))}$, $\mathbf {\mathbbm {m}_{3}(x) = r(x) \cdot w(p(x),q(x)) \in [0, \infty [}$ for Some (Measurable) Functions $\mathbf {w: \mathscr {R}(P) \times \mathscr {R}(Q) \rightarrow \mathbb {R}}$ and $\mathbf {r: \mathscr {X} \rightarrow \mathbb {R}}$

Such a choice extends the context of the previous Sect. 3.3.1.2 where the “connector function” w took the simple form $w(u,v) = v$, as well as the setup of Sect. 3.3.1.1 dealing with constant $w(u,v) \equiv 1$. This introduces a wide flexibility with divergences of the form

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D^{c}_{\phi ,W(P,Q),W(P,Q),R\cdot W(P,Q),\lambda }(P,Q) \nonumber \\& \textstyle : = {\overline{\int }}_{{\mathscr {X}}} \Big [ \phi \big ( { \frac{p(x)}{w(p(x),q(x))}}\big ) -\phi \big ( {\frac{q(x)}{w(p(x),q(x))}}\big ) \nonumber \\& \textstyle - \phi _{+,c}^{\prime } \big ( {\frac{q(x)}{w(p(x),q(x))}}\big ) \cdot \big ( \frac{p(x)}{w(p(x),q(x))}-\frac{q(x)}{w(p(x),q(x))}\big ) \Big ] \cdot w(p(x),q(x)) \cdot r(x) \, \mathrm {d}\lambda (x), \quad \ \ \end{aligned}$$

(61)

which for the discrete setup $(\mathscr {X},\lambda ) = (\mathscr {X}_{\#},\lambda _{\#})$ (recall $\lambda _{\#}[\{x\}] =1$ for all $x \in \mathscr {X}_{\#}$) simplifies to

$$\begin{aligned}& \textstyle \textstyle 0 \, \leqslant \, D^{c}_{\phi ,W(P,Q),W(P,Q),R\cdot W(P,Q),\lambda _{\#}}\,(P,Q) \, = \, {\overline{\sum }}_{{x \in \mathscr {X}}} \Big [ \phi \big ( { \frac{p(x)}{w(p(x),q(x))}}\big ) \, - \,\phi \big ( {\frac{q(x)}{w(p(x),q(x))}}\big ) \nonumber \\& \textstyle - \phi _{+,c}^{\prime } \big ( {\frac{q(x)}{w(p(x),q(x))}}\big ) \cdot \big ( \frac{p(x)}{w(p(x),q(x))}-\frac{q(x)}{w(p(x),q(x))}\big ) \Big ] \cdot w(p(x),q(x)) \cdot r(x) \ . \end{aligned}$$

(62)

A detailed discussion of this wide class of divergences (61),(62) is beyond the scope of this paper. For the $\lambda $-probability density context (and an indication for more general functions), see the comprehensive paper of Kisslinger and Stummer [37] and the references therein. Finally, by appropriate choices of $w(\cdot ,\cdot )$ we can even derive divergences of the form (60) but with non-convex non-concave $\phi $: see e.g. the “perturbed” power divergences of Roensch and Stummer [74].

3.3.2 Global Scaling and Aggregation, and Other Paradigms

Our universal framework also contains, as special cases, scaling and aggregation functions of the form $m_{i}(x) := m_{\ell ,i}(x) \cdot H_{i}\big ( ( m_{g,i}(z) )_{z \in \mathscr {X}} \big )$ for some (meas., possibly nonnegative) functions $m_{l,i}:\mathscr {X} \mapsto \mathbb {R}$, $m_{g,i}:\mathscr {X} \mapsto \mathbb {R}$ and some nonzero scalar functionals $H_{i}$ thereupon ($i=1,2,3$, $x \in \mathscr {X}$). Accordingly, the components $H_{i}\big ( \ldots \big )$ can be viewed as “global tunings”, and may depend adaptively on the primary-interest functions P and Q, i.e. $m_{g,i}(z) = w_{g,i}(x,p(x),q(x))$. For instance, in a finite discrete setup $(\mathscr {X}_{\#},\lambda _{\#})$ with strictly convex and differentiable $\phi $, $m_{1}(x) \equiv m_{2}(x) \equiv 1$, $m_{3}(x)= H_{i} \big ( (w_{g,3}(q(x)) )_{z \in \mathscr {X}} \big )$ this reduces to the conformal divergences of Nock et al. [64] (they also indicate the extension to equal non-unity scaling $m_{1}(x) \equiv m_{2}(x)$), for which the subcase $w_{g,3}(q(x)) := \left( \phi ^{\prime }\left( q(x)\right) \right) ^2$, $H_{3}\left( \big ( h(x) \right) _{x \in \mathscr {X}} \big ) := \big (1+ \sum _{x \in \mathscr {X}} h(x)\big )^{-1/2}$ leads to the total Bregman divergences of Liu et al. [44, 45], Vemuri et al. [91]. In contrast, Nock et al. [62] use $m_{1}(x) \equiv m_{1} = H_{1} \big ( (p(x))_{z \in \mathscr {X}} \big )$, $m_{2}(x) \equiv m_{1} = H_{1} \big ( (q(x))_{z \in \mathscr {X}} \big )$, $m_{3}(x) \equiv 1$. A more detailed discussion can be found in Stummer and Kißlinger [82] and Roensch and Stummer [74], where they also give versions for nonconvex nonconcave divergence generators. Let us finally mention that for the construction of divergence families, there are other recent paradigms which are essentially different to (1), e.g. by means of measuring the tightness of inequalities (cf. Nielsen et al. [60, 61]), respectively of comparative convexity (cf. Nielsen et al. [59]).

4 Divergences for Essentially Different Functions

4.1 Motivation

Especially in divergence-based statistics, one is often faced with the situation where the functions $p(\cdot )$ and $q(\cdot )$ are of “essentially different nature”. For instance, consider the situation where the uncertainty-prone data-generating mechanism is a random variable Y taking values in $\mathscr {X}=\mathbb {R}$ having a “classical” (e.g. Gaussian) probability density with respect to the one-dimensional Lebesque measure $\lambda _{L}$, i.e. where the latter is almost always a Riemann integral (i.e. $\mathrm {d}\lambda _{L}(x) = \mathrm {d}x$); notice that we have set $\mathbbm {r}(x) \equiv 1$ ($x \in \mathbb {R}$). As already indicated above, under independent and identically distributed (i.i.d.) data observations $Y_1, \ldots , Y_N$ of Y one often builds the corresponding “empirical distribution” which is nothing but the probability distribution reflecting the underlying (normalized) histogram. By rewriting with empirical probability mass function one encounters some basic problems for a straightforward application of divergence concepts: the two aggregating measures $\lambda _{L}$ and $\lambda _{\#}$ do not coincide and actually they are of “essentially different” nature; moreover, is nonzero only on the range $\mathscr {R}(Y_{1}, \ldots , Y_{N}) = \{ z_1, \ldots , z_s \}$ of distinguishable points $z_1, \ldots , z_s$ ($s \leqslant N$) occupied by $Y_{1}, \ldots , Y_{N}$. In particular, one has $\lambda _{L}[ \{ z_1, \ldots , z_s \} ] = 0$. Accordingly, building a “non-coarsely discriminating” dissimilarity/divergence between such type of functions and , is a task like “comparing apples with pears”. There are several solutions to tackle this. To begin with, in the following we take the “encompassing” approach of quantifying their dissimilarity by means of their common superordinate characteristics as “fruits”. Put in mathematical terms, we choose e.g. $\mathscr {X}= \mathbb {R}$, $\lambda = \lambda _{L} + \lambda _{\#}$ and work with the particular representations with for $\lambda $-almost all $x \in \{ z_1, \ldots , z_s \}$ as well as with for $\lambda $-almost all $x \in \widetilde{A} \backslash \{z_1, \ldots , z_s \}$ with some large enough (measurable) subset $\widetilde{A}$ of $\mathscr {X} = \mathbb {R}$ such that

(63)

hold. In fact, with these choices one gets and , as well as

$$\begin{aligned}& \textstyle \mathbbm {p}(x) \cdot \mathbbm {q}(x) = 0 \quad \text {for}\, \lambda \text {-almost all}\, x \in \mathscr {X}, \end{aligned}$$

(64)

$$\begin{aligned}& \textstyle \mathbbm {p}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) =\mathbbm {p}(x) \quad \text {for}\, \lambda \text {-almost all}\, x \in \mathscr {X}, \end{aligned}$$

(65)

$$\begin{aligned}& \textstyle \mathbbm {q}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) =\mathbbm {q}(x) \quad \text {for}\, \lambda \text {-almost all}\, x \in \mathscr {X} \end{aligned}$$

(66)

for the special choices and . By means of these and (63), the divergence (51) simplifies to

(67)

Since for arbitrary space $\mathscr {X}$ (and not only $\mathbb {R}$) and any aggregator $\lambda $ thereupon, the formula (67) holds for all functions , which satisfy (63) as well as (64)–(66) for $\lambda $-almost all $x \in \mathscr {X}$, and since $\phi ^{*}(0) + \phi (0) - \phi (1)$ is just a constant (which may be infinite), these divergences are not suitable for discriminating between such “essentially different” (basically orthogonal) $\lambda $-probability densities and . More generally, under the validity of (64)–(66) for $\lambda $-almost all $x \in \mathscr {X}$ – which we denote by $\mathbbm {P} \perp \mathbbm {Q}$ and which basically amounts to pair of functions of the type

$$\begin{aligned} \textstyle \mathbbm {p}(x) := \widetilde{p}(x) \cdot \varvec{1}_{A}(x) \quad \text {with}\, \widetilde{p}(x) > 0 \, \text {for}\, \lambda \text {-almost all}\, x \in A, \end{aligned}$$

(68)

$$\begin{aligned} \textstyle \mathbbm {q}(x) := \widetilde{q}(x) \cdot \varvec{1}_{B \backslash A}(x) \quad \text {with}\, \widetilde{q}(x) > 0 \, \text {for}\, \lambda \text {-almost all}\, x \in B \backslash A, \end{aligned}$$

(69)

with some (measurable) subsets $\widetilde{A} \subset B$ of $\mathscr {X}$ – the divergence (51) turns into

$$\begin{aligned} \textstyle D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q})= & {} \textstyle \big [ \phi ^{*}(0) -\phi _{+,c}^{\prime }(1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {p}(x) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi (0) + \phi _{+,c}^{\prime }(1) - \phi (1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \, \mathrm {d}\lambda (x) \ > \ 0 \qquad \end{aligned}$$

(70)

which now depends on $\mathbbm {P}$ and $\mathbbm {Q}$, in a rudimentary “weighted-total-mass” way. Inspired by this, we specify a statistically interesting divergence subclass:

Definition 1

We say that a divergence (respectively dissimilarity respectively distance)^{Footnote 12} $D(\cdot ,\cdot )$ is encompassing for a class $\widetilde{\mathscr {P}}$ of functions if

for arbitrarily fixed $Q := \big \{q(x)\big \}_{x \in \mathscr {X}} \in \widetilde{\mathscr {P}}$ the function $P := \big \{p(x)\big \}_{x \in \mathscr {X}} \rightarrow D(P,Q)$ is non-constant on the subfamily of all $P \in \widetilde{\mathscr {P}}$ with $P \perp Q$, and
for arbitrarily fixed $P \in \widetilde{\mathscr {P}}$ the function $Q \rightarrow D(P,Q)$ is non-constant on the subfamily of all $Q \in \widetilde{\mathscr {P}}$ with $Q \perp P$.

Accordingly, due to (67) the prominently used divergences are not encompassing for the class of $\widetilde{\mathscr {P}}$ of all $\lambda $-probability densities; more generally, because of (70) the divergences $D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q})$ are in general encompassing for the class of $\widetilde{\mathscr {P}}$ of all $\lambda $-probability densities, but not for $\widetilde{\mathscr {P}}:= \{ \widetilde{P} := \big \{\widetilde{\mathbbm {p}}(x)\big \}_{x \in \mathscr {X}} \, | \, \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \widetilde{\mathbbm {p}}(x) \, \mathrm {d}\lambda (x) = \widetilde{c} \, \}$ for any fixed $\widetilde{c}$.

4.2 $\mathbf {\mathbbm {m}_{1}(x) = \mathbbm {m}_{2}(x) := \mathbbm {q}(x)}$, $\mathbf {\mathbbm {m}_{3}(x) = {\mathbbm {r}(x)} \cdot {\mathbbm {q}}(x)}^{\varvec{\chi }} {\varvec{\in [0, \infty ]}}$ for Some $\chi >1$ and Some (Measurable) Function $\mathbf {\mathbbm {r}: \mathscr {X} \rightarrow [0,\infty [}$

In the following, we propose a new way of repairing the above-mentioned encompassing-concerning deficiency for $\lambda $-probability density functions, by introducing a new divergence in terms of choosing a generator $\phi : ]0,\infty [ \rightarrow \mathbb {R}$ which is convex and strictly convex at 1, the scaling function $\mathbbm {m}_{1}(x) = \mathbbm {m}_{2}(x) := \mathbbm {q}(x)$ as in the non-negativity set-up of Sect. 3.3.1.2, but the more general aggregation function $\mathbbm {m}_{3}(x) = \mathbbm {r}(x) \cdot \mathbbm {q}(x)^{\chi } \in [0, \infty [$ for some power $\chi >1$ and some (measurable) function $\mathbbm {r} : \mathscr {X} \rightarrow [0,\infty [$ which satisfies $\mathbbm {r}(x) \in ]0,\infty [$ for $\lambda $-almost all $x \in \mathscr {X}$. To incorporate the zeros of $\mathbbm {p}(\cdot ),\mathbbm {q}(\cdot ),\mathbbm {r}(\cdot )$ by appropriate limits and conventions, we proceed analogously to Sect. 3.3.1.2. Accordingly, we inspect the boundary behaviour of the function $\widetilde{\psi }_{\phi ,c}: ]0,\infty [^3 \rightarrow [0,\infty [$ given by

$$\begin{aligned}& \textstyle \widetilde{\psi }_{\phi ,c}\big (r,\widetilde{s},\widetilde{t}\big ) := r \cdot \widetilde{t}^{\chi } \cdot \psi _{\phi ,c}\big (\frac{\widetilde{s}}{\widetilde{t}},1\big ) = r \cdot \widetilde{t}^{\chi } \cdot \big [ \phi \big (\frac{\widetilde{s}}{\widetilde{t}}\big ) - \phi (1) - \phi _{+,c}^{\prime }(1) \cdot \big (\frac{\widetilde{s}}{\widetilde{t}}-1 \big ) \big ] \ \nonumber \\& \textstyle = r \cdot \widetilde{t}^{\chi } \cdot \big [ \phi \big (\frac{\widetilde{s}\cdot r}{\widetilde{t} \cdot r}\big ) - \phi (1) - \phi _{+,c}^{\prime }(1) \cdot \big (\frac{\widetilde{s} \cdot r}{\widetilde{t} \cdot r}-1 \big ) \big ] \ = r \cdot \widetilde{t}^{\chi } \cdot \psi _{\phi ,c}\big (\frac{\widetilde{s} \cdot r}{\widetilde{t} \cdot r},1\big ) . \qquad \ \ \nonumber \end{aligned}$$

As in Sect. 3.3.1.2, the Assumption 2(a) is conformly satisfied, for which we use the short-hand notation 2(a) etc. in the following discussion. Moreover, we require the validity of 2(b)–2(d) at the point $t=1$. The analogue of 2(e) is $\mathbbm {r}(x) \cdot \widetilde{t}^{\chi } < \infty $ which is always (almost surely) automatically satisfied (a.a.sat.), whereas 2(f) converts to “$\mathbbm {r}(x) \cdot \widetilde{t}^{\chi } > 0$ for all $\widetilde{s} \ne \widetilde{t}$” which is also a.a.sat. except for the case $\widetilde{t} =0$ which will be incorporated below. For the derivation of the analogue of 2(k) we observe that for fixed $r >0$, $\widetilde{s} >0$

$$\begin{aligned}& \textstyle \ell i_{2} := r \cdot 0^{\chi } \cdot \psi _{\phi ,c}\big (\frac{\widetilde{s}}{0},1\big ) : = \lim _{t\rightarrow 0} \widetilde{\psi }_{\phi ,c}\big (r,\widetilde{s},\widetilde{t}\,\big ) = \nonumber \\& \textstyle = r \cdot \widetilde{s}^{\chi } \cdot \lim _{\widetilde{t}\rightarrow 0} \big [ \frac{\widetilde{t}^{\chi }}{\widetilde{s}^{\chi }} \cdot \phi \big (\frac{\widetilde{s}}{\widetilde{t}}\big ) \big ] = r \cdot \widetilde{s}^{\chi } \cdot \phi _{\chi }^{*}(0) \geqslant 0 , \end{aligned}$$

(71)

where $\phi _{\chi }^{*}(0) := \lim _{u\rightarrow 0} u^{\chi -1} \cdot u \cdot \phi \big (\frac{1}{u}\big ) = \lim _{v\rightarrow \infty } \frac{\phi (v)}{v^{\chi }}$ exists but may be infinite. To convert 2(i), we employ the fact that for fixed $r >0$, $\widetilde{t} >0$ the function $\widetilde{s} \rightarrow \widetilde{\psi }_{\phi ,c}\big (r,\widetilde{s},\widetilde{t}\big )$ is convex with existing limit

$$\begin{aligned} \textstyle \ell i_{3}:= & {} r \cdot \widetilde{t}^{\chi } \cdot \psi _{\phi ,c}\big (\frac{0}{\widetilde{t}},1\big ) : = \lim _{s\rightarrow 0} \widetilde{\psi }_{\phi ,c}\big (r,\widetilde{s},\widetilde{t}\big ) \nonumber \\ \textstyle= & {} r \cdot \widetilde{t}^{\chi } \cdot (\phi (0) + \phi _{+,c}^{\prime }(1) - \phi (1)) > 0 . \qquad \ \ \end{aligned}$$

(72)

To achieve the analogue of 2(g), let us first remark that for fixed $r >0$ the function $(\widetilde{s},\widetilde{t}) \rightarrow \widetilde{\psi }_{\phi ,c}\big (r,\widetilde{s},\widetilde{t}\,\big )$ may not be continuous at $(\widetilde{s},\widetilde{t}) = (0,0)$, but due to the very nature of a divergence we make the 2(g)-conform convention of setting

$$\begin{aligned}& \textstyle r \cdot 0^{\chi } \cdot \psi _{\phi ,c}\big (\frac{0}{0},1\big ) : = \widetilde{\psi }_{\phi ,c}\big (r,0,0\big ) := 0 \, . \nonumber \end{aligned}$$

The analogues of the Assumptions 2(h), (j), ($\ell $), (m), (n) are obsolete because of our basic finiteness requirements. Putting together all the building-blocks, with the above-mentioned limits and conventions we obtain the divergence

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\chi },\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle \, : = \, {\overline{\int }}_{{\mathscr {X}}} \mathbbm {r}(x) \, \cdot \, \Big [ \mathbbm {q}(x)^{\chi } \, \cdot \, \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) \, - \, \mathbbm {q}(x)^{\chi } \, \cdot \, \phi \big ( 1 \big ) \, - \, \phi _{+,c}^{\prime } \big ( 1 \big ) \, \cdot \, \big ( \mathbbm {p}(x) \, \cdot \, \mathbbm {q}(x)^{\chi -1} \, - \, \mathbbm {q}(x)^{\chi } \big ) \Big ] \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle : = \int _{{\mathscr {X}}} \mathbbm {r}(x) \, \cdot \, \Big [ \mathbbm {q}(x)^{\chi } \, \cdot \, \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) \, - \, \mathbbm {q}(x)^{\chi } \, \cdot \, \phi \big ( 1 \big ) \, - \, \phi _{+,c}^{\prime } \big ( 1 \big ) \, \cdot \, \big ( \mathbbm {p}(x) \, \cdot \, \mathbbm {q}(x)^{\chi -1} \, - \, \mathbbm {q}(x)^{\chi } \big ) \Big ] \, \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \phi _{\chi }^{*}(0) \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {p}(x)^{\chi } \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \big ) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi (0) + \phi _{+,c}^{\prime }(1) - \phi (1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x)^{\chi } \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {q}(x) \big ) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle \, = \, \int _{{\mathscr {X}}} \mathbbm {r}(x) \, \cdot \, \Big [ \mathbbm {q}(x)^{\chi } \, \cdot \, \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) \, - \, \mathbbm {q}(x)^{\chi } \, \cdot \, \phi \big ( 1 \big ) \, - \, \phi _{+,c}^{\prime } \big ( 1 \big ) \, \cdot \, \big ( \mathbbm {p}(x) \, \cdot \, \mathbbm {q}(x)^{\chi -1} \, - \, \mathbbm {q}(x)^{\chi } \big ) \Big ] \, \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \phi _{\chi }^{*}(0) \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {p}(x)^{\chi } \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi (0) + \phi _{+,c}^{\prime }(1) - \phi (1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x)^{\chi } \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) \, . \end{aligned}$$

(73)

In case of $\mathfrak {Q}_{\chi }^{\mathbbm {R}\cdot \lambda }[\mathscr {X}] := \int _{{\mathscr {X}}} \mathbbm {q}(x)^{\chi } \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) < \infty $, the divergence (73) becomes

$$\begin{aligned}& \textstyle 0 \leqslant D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\chi },\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle = \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \Big [ \mathbbm {q}(x)^{\chi } \cdot \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) - \phi _{+,c}^{\prime } \big ( 1 \big ) \cdot \big ( \mathbbm {p}(x) \cdot \mathbbm {q}(x)^{\chi -1} - \mathbbm {q}(x)^{\chi } \big ) \Big ] \, \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \phi _{\chi }^{*}(0) \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {p}(x)^{\chi } \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi (0) + \phi _{+,c}^{\prime }(1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x)^{\chi } \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle - \phi (1) \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x)^{\chi } \, \mathrm {d}\lambda (x) \, . \end{aligned}$$

(74)

Moreover, in case of $\phi \big ( 1 \big ) = 0$ and $\int _{{\mathscr {X}}} \big ( \mathbbm {p}(x) \cdot \mathbbm {q}(x)^{\chi -1} - \mathbbm {q}(x)^{\chi } \big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \in [0, \infty [$ (but not necessarily $\int _{{\mathscr {X}}} \mathbbm {p}(x) \cdot \mathbbm {q}(x)^{\chi -1} \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) < \infty $, $\int _{{\mathscr {X}}} \mathbbm {q}(x)^{\chi } \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) < \infty $), the divergence (73) turns into

$$\begin{aligned}& \textstyle 0 \leqslant D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\chi },\lambda }(\mathbbm {P},\mathbbm {Q}) = \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x)^{\chi } \cdot \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \phi _{\chi }^{*}(0) \, \cdot \, \int _{{\mathscr {X}}} \mathbbm {r}(x) \, \cdot \, \mathbbm {p}(x)^{\chi } \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \, + \, \phi (0) \, \cdot \, \int _{{\mathscr {X}}} \mathbbm {r}(x) \, \cdot \, \mathbbm {q}(x)^{\chi } \, \cdot \, \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle - \phi _{+,c}^{\prime } \big ( 1 \big ) \cdot \int _{{\mathscr {X}}} \big ( \mathbbm {p}(x) \cdot \mathbbm {q}(x)^{\chi -1} - \mathbbm {q}(x)^{\chi } \big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \, . \nonumber \end{aligned}$$

In contrast to the case $\chi =1$ where for $\lambda $-probability-density functions , , the divergence (53) was further simplified due to , for the current setup $\chi >1$ the latter has no impact for further simplification. However, in general, for the new divergence defined by (73) one gets for any $\mathbbm {P} \perp \mathbbm {Q}$ from (68), (69), (64)–(66) the expression

$$\begin{aligned}& \textstyle 0 \leqslant D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\chi },\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle \, = \, \phi _{\chi }^{*}(0) \, \cdot \, \int _{{\mathscr {X}}} \mathbbm {r}(x) \, \cdot \, \mathbbm {p}(x)^{\chi } \, \mathrm {d}\lambda (x) \, + \, \big [ \phi (0) \, + \, \phi _{+,c}^{\prime }(1) \, - \, \phi (1) \big ] \, \cdot \, \int _{{\mathscr {X}}} \mathbbm {r}(x) \, \cdot \, \mathbbm {q}(x)^{\chi } \, \mathrm {d}\lambda (x) \, \qquad \ \ \end{aligned}$$

(75)

which is encompassing for the class of $\lambda $-probability functions. By inspection of the above calculations, one can even relax the assumptions away from convexity:

Theorem 6

Let $\chi >1$, $c \in [0,1]$, $\phi : ]0,\infty [ \rightarrow \mathbb {R}$ such that both $\phi _{+,c}^{\prime }(1)$ and $\phi (0) := \lim _{s \rightarrow 0} \phi (s)$ exist and $\psi _{\phi ,c}(s,1) = \phi (s) - \phi (1) - \phi _{+,c}^{\prime }(1) \cdot (s-1) \geqslant 0 $ for all $s>0$. Moreover, assume that $\psi _{\phi ,c}(s,1) = 0$ if and only if $s=1$. Furthermore, let the limits $\ell i_{2} \geqslant 0$ defined by (71) and $\ell i_{3} \geqslant 0$ defined by (72) exist and satisfy $\ell i_{2}+\ell i_{3} >0$. Then one gets for the divergence defined by (73):

(1) $D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\chi },\lambda }(\mathbbm {P},\mathbbm {Q}) \geqslant 0$. Depending on the concrete situation, $D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\chi },\lambda }(\mathbbm {P},\mathbbm {Q})$ may take infinite value.

$$\begin{aligned}&\textit{(2)} \quad D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\chi },\lambda }(\mathbbm {P},\mathbbm {Q}) = 0 \quad \text {if and only if} \quad \mathbbm {p}(x)=\mathbbm {q}(x) \ \text {for}\, \lambda \text {-a.a.}\, x \in \mathscr {X}. \nonumber \end{aligned}$$

(3) For $\mathbbm {P} \perp \mathbbm {Q}$, the representation (75) holds.

Remark 4

(1) As seen above, if the generator $\phi $ is in $\varPhi (]0,\infty [)$ and satisfies the Assumptions 2(a)–(d) for $t=1$, then the requirements on $\phi $ in Theorem 6 are automatically satisfied. The case $\chi =1$ has already been covered by Theorem 5.

(2) For practical purposes, it is sometimes useful to work with a sub-setup of choices $\chi >1$, $c \in [0,1]$ and $\phi $ such that $\ell i_{2} \in ]0,\infty [$ and/or $\ell i_{3} \in ]0,\infty [$. $\square $

Let us give some examples. To begin with, for $\alpha \in \mathbb {R}\backslash \{0,1\}$ take the power functions $\phi (t): = \phi _{\alpha }(t) := \frac{t^\alpha -1}{\alpha (\alpha -1)}-\frac{t-1}{\alpha -1} \ \in [0,\infty [ , \quad t \in ]0,\infty [$, with the properties $\phi _{\alpha }(1) =0$, $\phi _{\alpha }^{\prime }(1)=0$ (cf. (6)) and $\phi _{\alpha }(0) := \lim _{t\downarrow 0}\phi _{\alpha }(t)= \frac{1}{\alpha } \cdot \varvec{1}_{]0,1] \cup ]1,\infty [}(\alpha ) + \infty \cdot \varvec{1}_{]-\infty ,0[}(\alpha )$. Then, for arbitrary $\chi \in \mathbb {R}$ one gets the representation

$$\begin{aligned}& \textstyle 0 \leqslant D_{\phi _{\alpha },\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\chi },\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle : = {\overline{\int }}_{{\mathscr {X}}} \mathbbm {r}(x) \, \cdot \, \Big [ \mathbbm {q}(x)^{\chi } \, \cdot \, \phi _{\alpha } \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) \, - \, \mathbbm {q}(x)^{\chi } \, \cdot \, \phi _{\alpha } \big ( 1 \big ) \, - \, \phi _{\alpha }^{\prime } \big ( 1 \big ) \, \cdot \, \big ( \mathbbm {p}(x) \, \cdot \, \mathbbm {q}(x)^{\chi -1} \, - \, \mathbbm {q}(x)^{\chi } \big ) \Big ] \, \mathrm {d}\lambda (x) \nonumber \\ \end{aligned}$$

(76)

$$\begin{aligned}& \textstyle = {\overline{\int }}_{{\mathscr {X}}} \Big [ \phi _{\alpha } \big ( \frac{\mathbbm {p}(x)}{w_{\widetilde{\chi }}{(\mathbbm {p}(x),\mathbbm {q}(x))}} \big ) -\phi _{\alpha } \big ( {\frac{\mathbbm {q}(x)}{w_{\widetilde{\chi }}(\mathbbm {p}(x),\mathbbm {q}(x))}}\big ) \nonumber \\& \textstyle - \phi _{\alpha }^{\prime } \big ( {\frac{\mathbbm {q}(x)}{w_{\widetilde{\chi }}(\mathbbm {p}(x),\mathbbm {q}(x))}}\big ) \cdot \big ( \frac{\mathbbm {p}(x)}{w_{\widetilde{\chi }}(\mathbbm {p}(x),\mathbbm {q}(x))}- \frac{\mathbbm {q}(x)}{w_{\widetilde{\chi }}(\mathbbm {p}(x),\mathbbm {q}(x))}\big ) \Big ] \cdot w_{\widetilde{\chi }}(\mathbbm {p}(x),\mathbbm {q}(x)) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \ \nonumber \\& \textstyle = D_{\phi _{\alpha },\mathbbm {Q}^{\widetilde{\chi }},\mathbbm {Q}^{\widetilde{\chi }},\mathbbm {R}\cdot \mathbbm {Q}^{\widetilde{\chi }},\lambda }(\mathbbm {P},\mathbbm {Q}) \, \end{aligned}$$

(77)

with the adaptive scaling/aggregation function $w_{\widetilde{\chi }}(u,v) = v^{\widetilde{\chi }}$ and $\widetilde{\chi } := 1 + \frac{\chi -1}{1-\alpha }$; in other words, the divergence (76) can be seen as a particularly adaptively scaled Bregman divergence of non-negative functions in the sense of Kißlinger and Stummer [37], from which their robustness and non-singularity-asymptotical-statistics properties can be derived as a special case (for the probability setup , , $\mathbbm {r}(x) \equiv 1$, and beyond). From (77), it is immediate to see that the case $\chi =1$ corresponds to the generalized power divergences (58) of order $\alpha \in \mathbb {R}\backslash \{0,1\}$, whereas $\chi =\alpha $ corresponds to the unscaled divergences (40), i.e.

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D_{\phi _{\alpha },\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\alpha },\lambda }(\mathbbm {P},\mathbbm {Q}) = D_{\phi _{\alpha },\mathbbm {1},\mathbbm {1},\mathbbm {R}\cdot \mathbbm {1},\lambda }(\mathbbm {P},\mathbbm {Q}) \\& \textstyle = {\overline{\int }}_{{\mathscr {X}}} \frac{\mathbbm {r}(x)}{\alpha \cdot (\alpha -1)} \cdot \Big [ \mathbbm {p}(x)^{\alpha } + (\alpha -1) \cdot \mathbbm {q}(x)^{\alpha } - \alpha \cdot \mathbbm {p}(x) \cdot \mathbbm {q}(x)^{\alpha -1} \Big ] \, \mathrm {d}\lambda (x) (cf.\ (40)) \ \nonumber \end{aligned}$$

(78)

which for $\alpha >1$, $\mathbbm {r}(x) \equiv 1$, , is a multiple of the $\alpha $-order density-power divergences DPD used by Basu et al. [10]; as a side remark, in the latter setup our divergence (77) manifests a smooth interconnection between PD and DPD which differs from that of Patra et al. [70], Ghosh et al. [32].

For (76), let us shortly inspect the corresponding $\ell i_{2}$ from (71) as well as $\ell i_{3}$ from (72). Only for $\alpha \in ]0,1[ \cup ]1,\infty [$, one gets finite $\ell i_{3} = \frac{r \widetilde{t}^{\chi }}{\alpha } \in ]0,\infty [$ for all $\chi \in \mathbb {R}$, $r>0$, $\widetilde{t} >0$. Additionally, one obtains finite $\ell i_{2}$ only for $\chi =1$, $\alpha \in ]0,1[$ where $\ell i_{2} = \frac{r \widetilde{s}}{1-\alpha }$ (PD case), respectively for $\chi >1$, $\alpha \in ]0,1[ \cup ]1,\chi [$ where $\ell i_{2} = 0$, respectively for $\alpha =\chi >1$ where $\ell i_{2} = \frac{r \widetilde{s}^{\alpha }}{\alpha \cdot (\alpha -1)}$ (DPD case), for all $r>0$, $\widetilde{s} >0$.

Another interesting example for the divergence $D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\chi },\lambda }(\mathbbm {P},\mathbbm {Q})$ in (73) is given for $\alpha \in \mathbb {R}\backslash \{0,1\}$ by the generators

$$\begin{aligned}& \textstyle \phi (t) := \widetilde{\widetilde{\phi }}_{\alpha }(t) := \frac{(\alpha -1) \cdot t^{\alpha } - \alpha \cdot t^{\alpha -1} +1}{\alpha \cdot (\alpha -1)}, \ \ t >0, \quad \widetilde{\widetilde{\phi }}_{\alpha }(1) = 0, \ \widetilde{\widetilde{\phi ^{\prime }}}_{\alpha }(1) = 0 , \qquad \ \nonumber \end{aligned}$$

for which $t \rightarrow \widetilde{\widetilde{\phi }}_{\alpha }(t) = \widetilde{\widetilde{\phi }}_{\alpha }(t) - \widetilde{\widetilde{\phi }}_{\alpha }(0) - \widetilde{\widetilde{\phi ^{\prime }}}_{\alpha }(1) \cdot (t-1) = \psi _{\phi _{\alpha }}(t,1)$ is strictly decreasing on ]0, 1[ and strictly increasing on $]1,\infty [$. Hence, the corresponding assumptions of Theorem 6 are satisfied. Beyond this, notice that $\widetilde{\widetilde{\phi }}_{\alpha }(\cdot )$ is strictly convex on $]0,\infty [$ if $\alpha \in ]1,2]$, respectively strictly convex on $]1-\frac{1}{\alpha -1}, \infty [$ and strictly concave on $]0, 1-\frac{1}{\alpha -1}[$ if $\alpha >2$, respectively strictly convex on $]0, 1+\frac{1}{1-\alpha }[$ and strictly concave on $]1+\frac{1}{1-\alpha },\infty [$ if $\alpha \in ]-\infty ,0[ \cup ]0,1[$. Furthermore, the corresponding $\ell i_{3}$ is finite only for $\alpha >1$, namely $\ell i_{3} = \frac{r \widetilde{t}^{\chi }}{\alpha \cdot (\alpha -1)} \in ]0,\infty [$ for all $\chi \in \mathbb {R}$, $r>0$, $\widetilde{t} >0$. Additionally, if $\alpha >1$ one gets finite $\ell i_{2}$ only for $\chi> \alpha >1$ where $\ell i_{2} = 0$, respectively for $\alpha =\chi >1$ where $\ell i_{2} = \frac{r \widetilde{s}^{\alpha }}{\alpha }$ for all $r>0$, $\widetilde{s} >0$. Notice that for $\chi =\alpha >1$, the limits $\ell i_{2}$, $\ell i_{3}$ for the cases $\phi _{\alpha }$ and $\widetilde{\widetilde{\phi }}_{\alpha }$ are asymmetric. Indeed, by straightforward calculations one can easily see that

$$\begin{aligned}& \textstyle \textstyle 0 \leqslant D_{\widetilde{\widetilde{\phi }}_{\alpha },\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\alpha },\lambda }(\mathbbm {P},\mathbbm {Q}) = D_{\phi _{\alpha },\mathbbm {1},\mathbbm {1},\mathbbm {R}\cdot \mathbbm {1},\lambda }(\mathbbm {Q},\mathbbm {P}) \nonumber \\& \textstyle = {\overline{\int }}_{{\mathscr {X}}} \frac{\mathbbm {r}(x)}{\alpha \cdot (\alpha -1)} \cdot \Big [ \big (\mathbbm {q}(x)\big )^{\alpha } + (\alpha -1) \cdot \big (\mathbbm {p}(x)\big )^{\alpha } - \alpha \cdot \mathbbm {q}(x) \cdot \big (\mathbbm {p}(x)\big )^{\alpha -1} \Big ] \, \mathrm {d}\lambda (x) \qquad \ \end{aligned}$$

(79)

which is the “reversion” of the divergence (40).

4.3 Minimum Divergences - The Encompassing Method

So far, we have almost entirely dealt with aggregated divergences between functions $P := \big \{p(x)\big \}_{x \in \mathscr {X}}$, $Q := \big \{q(x)\big \}_{x \in \mathscr {X}}$ under the same aggregator (measure) $\lambda $. On the other hand, in Sect. 4.1 we have already encountered an important statistical situation where two aggregators $\lambda _{1}$ and $\lambda _{2}$ come into play. Let us now investigate such a context in more detail. To achieve this, for the rest of this paper we confine ourselves to the following probabilistic setup: the modeled respectively observed (random) data take values in a state space $\mathscr {X}$ (with at least two distinct values), equipped with a system $\mathscr {F}$ of admissible events ($\sigma $-algebra) and two $\sigma $-finite measures $\lambda _{1}$ and $\lambda _{2}$. Furthermore, let , such that for $\lambda _{1}$-a.a. $x \in \mathscr {X}$, for $\lambda _{2}$-a.a. $x \in \mathscr {X}$, , and ; in other words, is a $\lambda _{1}$-probability density function and is a $\lambda _{2}$-probability density function; the two corresponding probability measures are denoted by and . Notice that we henceforth assume $\mathbbm {r}(x) =1$ for all $x \in \mathscr {X}$.

More specific, we deal with a parametric framework of double uncertainty in the data and in the model (cf. Sect. 2.4). The former is described by a random variable Y taking values in the space $\mathscr {X}$ and by its probability law which (as far as model risk is concerned) is supposed to be unknown but belong to a class of probability measures on $({\mathscr {X}},{\mathscr {F}})$ indexed by a set of parameters $\varTheta \subset {\mathbb {R}}^{d}$ (the non-parametric case works basically in analogous way, with more sophisticated technicalities). Accordingly, all ($\theta \in \varTheta $) are principal model-candidate laws, with $\theta _{0}$ to be found out (approximately and with high confidence) by N concrete data observations described by the independent and identically distributed random variables $Y_{1}, \ldots Y_{N}$. Furthermore, we assume that the true unknown parameter $\theta _{0}$ (to be learnt) is identifiable and that the family $\mathscr {Q}_{\varTheta }^{\lambda _{2}}$ is (measure-theoretically) equivalent in the sense

(80)

As usual, the equivalence means that for $\lambda _{2}$-a.a. $x \in \mathscr {X}$ there holds the density-function-relation: if and only if ; this implies in particular that and for $\lambda _{2}$-a.a. $x \in \mathscr {X}$, and by cutting off “datapoints/states of zero contributions” one can then even take $\mathscr {X}$ small enough such that (and hence, ) for $\lambda _{2}$-a.a. $x \in \mathscr {X}$. Clearly, since any $\lambda _{2}$-aggregated divergence $D_{\lambda _{2}}(\cdot ,\cdot )$ satisfies (the aggregated version of) the axioms (D1) and (D2), and since $\theta _{0}$ is identifiable, one gets immediately in terms of the corresponding $\lambda _{2}$-probability density functions

(81)

Inspired by this, one major idea of tracking down (respectively, learning) the true unknown $\theta _{0}$ is to replace by a data-observation-derived – and thus noisy – probability law where the $\lambda _{1}$-probability density function depends, as indexed, on the outcome of the observations $Y_{1}(\omega ),\ldots , Y_{N}(\omega )$. If converges in distribution to as N tends to infinity, then one intuitively expects to obtain the so-called minimum-divergence estimator (“approximator”)

(82)

which estimates $\theta _{0}$ consistently in the usual sense of the convergence $\theta _{n}\rightarrow \theta _{0}$ for $n\rightarrow \infty $. However, by the nature of our divergence construction, the method (82) makes principal sense only if the two aggregators $\lambda _{1}$ and $\lambda _{2}$ coincide (and if (82) is analytically respectively computationally solvable)! Remark that the minimum distance estimator (82) depends on the choice of the divergence $D_{\lambda _{2}}(\cdot ,\cdot )$.

Subsetup 1. For instance, if by nature the set $\mathscr {X}$ of all possible data points has only countably many elements, say $\mathscr {X} = \mathscr {X}_{\#} = \{z_{1}, \ldots z_{s}\}$ (where s is an integer larger than one or infinity), then a natural model-concerning aggregator is the counting measure $\lambda _{2} := \lambda _{\#}$ (recall $\lambda _{\#}[\{x\}] =1$ for all $x \in \mathscr {X}$), and hence (where $\bullet $ stands for any arbitrary subset of $\mathscr {X}$). In such a context, a popular choice for the data-observation-derived probability law is the so-called “empirical distribution” , where $\lambda _{1} := \lambda _{\#}= \lambda _{2}$ and is the total number of x-observations divided by the total number N of observations. In other words, , where $\delta _{z}[\bullet ]$ is the corresponding Dirac (resp. one-point) distribution given by $\delta _{z}[A] := \varvec{1}_{A}(z)$. Hence, in such a set-up it makes sense to solve the noisy minimization problem

(83)

where and $D_{\lambda _{\#}}(\cdot ,\cdot )$ is the discrete version of any of the divergences above. Notice that – at least for small enough number N of observations – for some $x \in \mathscr {X}$ with $\lambda _{\#}[\{x\}] >0$ one has but (i.e. an “extreme inlier”), and hence, ; this must be taken into account in the calculation of the explicit forms of the corresponding divergences.^{Footnote 13} By the assumed convergence, this effect disappears as N becomes large enough. $\square $

Subsetup 2. Consider the “crossover case” where $\mathscr {X}$ is uncountable (e.g. $\mathscr {X}=\mathbb {R}$) and the family $\mathscr {Q}_{\varTheta }^{\lambda _{2}}$ is assumed to be continuous (nonatomic) in the sense

(84)

(e.g. are Gaussian densities with mean $\theta $ and variance 1), and the data-observation-derived probability law is the “extended” empirical distribution

(85)

where the extension on $\mathscr {X}$ is accomplished by attributing zeros to all x outside the finite range $\mathscr {R}(Y_{1}(\omega ), \ldots , Y_{N}(\omega )) = \{ z_{1}(\omega ), \ldots , z_{s}(\omega ) \}$ of distinguishable points $z_{1}(\omega ), \ldots , z_{s}(\omega )$ ($s \leqslant N$) occupied by the observations $Y_{1}(\omega ), \ldots , Y_{N}(\omega )$; notice that the involved counting measure given by $\lambda _{1}[\bullet ] := \sum _{z \in \mathscr {X}} \varvec{1}_{\mathscr {R}(Y_{1}(\omega ), \ldots , Y_{N}(\omega ))}(z) \cdot \delta _{z}[\bullet ]$ puts 1 to each data-point z which has been observed. Because $\lambda _{1}$ and $\lambda _{2}$ are now essentially different, the minimum-divergence method (82) can not be applied directly (by taking either $\lambda := \lambda _{1}$ or $\lambda := \lambda _{2}$), despite of converging in distribution to as N tends to infinity. $\square $

There are several ways to circumvent the problem in Subsetup 2. In the following, we discuss in more detail our abovementioned new encompassing approach:

(Enc1)
take the encompassing aggregator $\lambda := \lambda _{1} + \lambda _{2}$ and the imbedding with ;
(Enc2)
choose a “sufficiently discriminating” (e.g. encompassing) divergence $D_{\lambda }(\cdot ,\cdot )$ from above and evaluate them with the density-functions obtained in (Enc1);
(Enc3)
solve the corresponding noisy minimization problem
(86)
for respectively (to be defined right below);
(Enc4)
compute the noisy minimal distance as an indicator of “goodness of fit” (goodness of noisy approximation”);
(Enc5)
investigate sound statistical properties of the outcoming estimator $\widehat{\theta }_{N}(\omega )$, e.g. show probabilistic convergence (as N tends to infinity) to the true unknown parameter $\theta _{0}$, compute the corresponding convergence speed, analyze its robustness against data-contamination, etc.

Typically, for fixed N the step (Enc3) is not straightforward to solve, and consequently, the tasks described in the unavoidable step (Enc4) become even much more complicated; a detailed discussion of both is – for the sake of brevity – beyond the scope of this paper. As far as (Enc1) is concerned, things are non-trivial due to the generally well-known fact that “continuous” densities are only almost-surely unique. Indeed, consider e.g. the case where the $\theta $-family of functions satisfies

(87)

and the alternative $\theta $-family of functions defined by ; for the latter, one obtains

(88)

Furthermore, due to (85) one has

(89)

and the validity of (64)–(66) with , $\lambda _{1}+\lambda _{2}$; in other words, there holds the singularity (measure-theoretical orthogonality) for all $\theta \in \varTheta $. Accordingly, for the step (Enc2) one can e.g. take directly the (family of) encompassing divergences $D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\chi },\lambda }(\mathbbm {P},\mathbbm {Q})$ of (73) for ,, $\lambda := \lambda _{1} + \lambda _{2}$, $\mathbbm {r}(x) \equiv 1$, and apply (75) to get

(90)

hence, the corresponding solution of (Enc3) does not depend on the data-observations $Y_{1}(\omega ), \ldots , Y_{N}(\omega )$, and thus is “statistically non-relevant”. As an important remark for the rest of this paper, let us mention that – only – in situations where no observations are taken into account, then , $\mathscr {R}(Y_1, \ldots , Y_N) = \emptyset $, and $\lambda _{1}$ collapses to the “zero aggregator” (i.e. $\lambda _{1}[\bullet ] \equiv 0$).

In contrast, let us replace the alternative $\theta $-family by the original , on which $\lambda _{1}$ acts differently. In fact, instead of (88) there holds

(91)

moreover, one has for all $\theta \in \varTheta $ the non-singularity but

(92)

(93)

(94)

Correspondingly, for the step (Enc2) one can e.g. take directly the (family of) encompassing divergences $D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q}^{\chi },\lambda }(\mathbbm {P},\mathbbm {Q})$ of (73) for , , $\lambda := \lambda _{1} + \lambda _{2}$, $\mathbbm {r}(x) \equiv 1$; the corresponding solution of the noisy minimization problem (Enc3) generally does depend on the data-observations $Y_{1}(\omega ), \ldots , Y_{N}(\omega )$, as required. Let us demonstrate this exemplarily for the special subsetup where $\phi :[0,\infty [ \rightarrow [0,\infty [$ is continuous (e.g. strictly convex on $]0,\infty [$), differentiable at 1, $\phi (1) = \phi ^{\prime }(1) = 0$, $\phi (t) \in ]0,\infty [$ for all $t \in [0,1[ \cup [1,\infty [$, $\chi >1$, $\mathbbm {r}(x) \equiv 1$, and for all $\theta \in \varTheta $. Then, for each fixed $\theta \in \varTheta $ we derive from (73) and (92)–(94) the divergence

(95)

When choosing this divergence (95) in step (Enc2), we call the solution $\widehat{\theta }_{N}(\omega )$ of the corresponding noisy minimization problem (86) of step (Enc3) a minimum $(\phi ,\chi )$-divergence estimator of the true unknown parameter $\theta _{0}$; in ML and AI contexts, the pair $(\phi ,\chi )$ may be regarded as “hyperparameter”. Exemplarily, for the power functions $\phi := \phi _{\alpha }$ (cf. (5)) with $\alpha = \chi >1$, we obtain from (95) (see also (78), (41)) the divergence

(96)

where for the last equality we have used the representation

(97)

notice that . Clearly, the outcoming minimum $(\phi ,\chi )$-divergence estimator of (95) (and in particular, the minimum $(\phi _{\alpha },\alpha )$-divergence estimator of (96)) depends on the data observations $Y_{1}(\omega ), \ldots , Y_{N}(\omega )$, where for technical reasons as e.g. existence and uniqueness – as well as for the tasks (Enc4), (Enc5) – some further assumptions are generally needed; for the sake of brevity, corresponding details will appear in a forthcoming paper.

4.4 Minimum Divergences - Grouping and Smoothing

Next, we briefly indicate two other ways to circumvent the problem described in Subsetup 2 of Sect. 4.3, with continuous (nonatomic) $\mathscr {Q}_{\varTheta }^{\lambda _{2}}$ and $\lambda _{2}$ from (84):

(GR)
grouping (partitioning, quantization) of data: convert^{Footnote 14} everything into a purely discrete context, by subdividing the data-point-set $\mathscr {X} = \bigcup _{j=1}^{s} A_{j}$ into countably many – (say) $s \in \mathbb {N}\cup \{\infty \} \backslash \{1\}$ – (measurable) disjoint classes $A_{1}, \ldots , A_{s}$ with the property $\lambda _{2}[A_{j}] >0$ (“essential partition”); proceed as in Subsetup 1 of Sect. 4.3, with $\mathscr {X}^{new} := \{A_{1}, \ldots , A_{s}\}$ instead of $\{z_1, \ldots , z_s\}$, and thus the ith data observation $Y_{i}(\omega )$ and the corresponding running variable x) manifest (only) the corresponding class-membership. For the subcase of Csiszar-Ali-Slivey divergences and adjacently related divergences, thorough statistical investigations (such as efficiency, robustness, types of grouping, grouping-error sensitivity, etc.) of the corresponding minimum-divergence-estimation can be found e.g. in Victoria-Feser and Ronchetti [92], Menendez et al. [47,48,49], Morales et al. [52, 53], Lin and He [43].
(SM)
smoothing of the empirical density function: convert everything to a purely continuous context, by keeping the original data-point-set $\mathscr {X}$ and by “continuously modifying” (e.g. with the help of kernels) the empirical density to a function such that and that for all $\theta \in \varTheta $ there holds: if and only if (in addition to (80)). For the subcase of Csiszar-Ali-Slivey divergences, thorough statistical investigations (such as efficiency, robustness, information loss, etc.) of the corresponding minimum-divergence-estimation can be found e.g. in Basu and Lindsay [11], Park and Basu [69], Chapter 3 of Basu et al. [13], Kuchibhotla and Basu [39], Al Mohamad [5], and the references therein. Due to the “curse of dimensionality”, such a solution cannot be applied successfully in a large-dimension setting, as required in the so called “big data” paradigm. For instance (in preparation for divergence valuation), take $\mathscr {X}=\mathbb {R}^d$, $\lambda _{2}$ to be the $d-$dimensional Lebesgue measure and where $K(\cdot ,\cdot ,\cdot )$ is an appropriate smooth kernel function with “bandwidth” $h_{n}$, e.g. with appropriate nonnegative function satisfying . Since such kernel smoothers KS use local averaging, and for large d most neighborhoods tend to be empty of data observations (because data often “live” on lower-dimensional manifolds, sparsity of data), a typical KS technique (choosing concrete kernels and bandwidths, etc.) needs then a huge amount N of data to provide a reasonable accuracy; for $d=8$ one may need N to be 1 million. For background details, the reader is e.g. referred to DasGupta [28], Scott and Wand [77], Chapter 7 of Scott [76] and the references therein.

For the sake of brevity, a detailed discussion of (GR) and (SM) is beyond the scope of this paper.

4.5 Minimum Divergences - The Decomposability Method

Let us discuss yet another strategy to circumvent the problem described in Subsetup 2 of Sect. 4.3. As a motivation, for a divergence of the form

$$\begin{aligned}& \textstyle 0 \leqslant D_{\lambda }(\mathbbm {P},\mathbbm {Q}) = \int _{{\mathscr {X}}} f_{1}(x) \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \int _{{\mathscr {X}}} f_{2}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \cdot \varvec{1}_{]0,\infty [}(\mathbbm {q}(x)) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \int _{{\mathscr {X}}} f_{3}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \cdot \varvec{1}_{]0,\infty [}(\mathbbm {p}(x)) \, \mathrm {d}\lambda (x) \qquad \ \ \end{aligned}$$

(98)

with $f_{1}(x)\geqslant 0$, $f_{2}(x)\geqslant 0$, $f_{3}(x)\geqslant 0$, and an “adjacent” dissimilarity

$$\begin{aligned}& \textstyle \widetilde{D_{\lambda }}(\mathbbm {P},\mathbbm {Q}) = \int _{{\mathscr {X}}} f_{1}(x) \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \cdot \mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \int _{{\mathscr {X}}} g_{2}(x) )\cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \cdot \varvec{1}_{]0,\infty [}(\mathbbm {q}(x)) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \int _{{\mathscr {X}}} g_{3}(x) )\cdot \varvec{1}_{\{0\}}\big (\mathbbm {q}(x)\big ) \cdot \varvec{1}_{]0,\infty [}(\mathbbm {p}(x)) \, \mathrm {d}\lambda (x), \qquad \ \ \end{aligned}$$

(99)

there holds $D_{\lambda }(\mathbbm {P},\mathbbm {Q}) = \widetilde{D_{\lambda }}(\mathbbm {P},\mathbbm {Q})$ for all equivalent $\mathbbm {P} \sim \mathbbm {Q}$ (where for both, the second and third integral become zero), but (in case that $g_{2}(\cdot )$, $g_{3}(\cdot )$ differ sufficiently enough from $f_{2}(\cdot )$, $f_{3}(\cdot )$) one gets $D_{\lambda }(\mathbbm {P},\mathbbm {Q}) \ne \widetilde{D_{\lambda }}(\mathbbm {P},\mathbbm {Q})$ for $\mathbbm {P} \perp \mathbbm {Q}$ and even for $\mathbbm {P} \not \sim \mathbbm {Q}$; in the latter two cases, depending on the signs of $g_{2}(\cdot )$, $g_{3}(\cdot )$, $\widetilde{D_{\lambda }}(\mathbbm {P},\mathbbm {Q})$ may even become negative.

Such issues are of importance for our current problem where e.g. . For further illuminations, and for the sake of a compact presentation, we use henceforth the notations $\mathscr {P}^{\lambda }$ for an arbitrarily fixed class of nonnegative, mutually equivalent functions (i.e. $\mathbbm {P}_{1} \sim \mathbbm {P}_{2}$ for all $\mathbbm {P}_{1} \in \mathscr {P}^{\lambda }$, $\mathbbm {P}_{2} \in \mathscr {P}^{\lambda }$), and $\mathscr {P}^{\lambda \not \sim }$ for a corresponding class of nonnegative (not necessarily mutually equivalent) functions such that $\mathbbm {P}_{1} \not \sim \mathbbm {P}_{2}$ for all $\mathbbm {P}_{1}^{\lambda } \in \mathscr {P}^{\lambda }$, $\mathbbm {P}_{2} \in \mathscr {P}^{\lambda \not \sim }$. Furthermore, we employ $\widetilde{\mathscr {P}}^{\lambda } := \mathscr {P}^{\lambda } \cup \mathscr {P}^{\lambda \not \sim }$ and specify:

Definition 2

We say that a function $D_{\lambda }: \widetilde{\mathscr {P}}^{\lambda } \otimes \mathscr {P}^{\lambda } \rightarrow \mathbb {R}$ is a pseudo-divergence on $\widetilde{\mathscr {P}}^{\lambda } \times \mathscr {P}^{\lambda }$, if its restriction to $\mathscr {P}^{\lambda } \cup \mathscr {P}^{\lambda }$ is a divergence, i.e.

$$\begin{aligned}& \textstyle D_{\lambda }( \mathbbm {P},\mathbbm {Q}) \geqslant 0 \ \ \text {for all} \ \ \mathbbm {P} \in \mathscr {P}^{\lambda },\mathbbm {Q} \in \mathscr {P}^{\lambda }, \quad \text {and} \\& \textstyle D_{\lambda }( \mathbbm {P},\mathbbm {Q}) = 0 \ \ \text {if and only if} \ \ \mathbbm {P} = \mathbbm {Q} \in \mathscr {P}^{\lambda } \, . \nonumber \end{aligned}$$

(100)

If also $D_{\lambda }( \mathbbm {P},\mathbbm {Q}) > 0$ for all $\mathbbm {P} \in \mathscr {P}^{\lambda \not \sim }$,$\mathbbm {Q} \in \mathscr {P}^{\lambda }$, then $D_{\lambda }(\cdot ,\cdot )$ is a divergence.

As for interpretation, a pseudo-divergence $D_{\lambda }( \cdot ,\cdot )$ acts like a divergence if both arguments are from $\mathscr {P}^{\lambda }$, but only like a dissimilarity if the first argument is from $\mathscr {P}^{\lambda \not \sim }$ and thus is “quite different” from the second argument. In the following, we often use pseudo-divergences for our noisy minimum-distance-estimation problem – cf. (81), (82) – by taking $\lambda =\lambda _{1}+\lambda _{2}$, (cf. (87), (88)), and (cf. (85), (Enc1)) covering all numbers N of data observations (sample sizes), and the according $\widetilde{\mathscr {P}}^{\lambda } := \mathscr {P}_{\varTheta ,emp}^{\lambda } = \mathscr {P}_{\varTheta }^{\lambda } \cup \mathscr {P}_{emp}^{\lambda \perp }$; notice that by construction we have even the function-class-relationship $\perp $ which is stronger than $\not \sim $. In such a setup, we have seen that for the choice , the divergence $D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {1}\cdot \mathbbm {Q}^{\chi },\lambda }(\mathbbm {P},\mathbbm {Q}) > 0$ of (90) is unpleasant for (Enc3) since the solution does not depend on the data-observations $Y_{1}(\omega ), \ldots , Y_{N}(\omega )$; also recall the special case of power functions $\phi := \phi _{\alpha }$ (cf. (5)) with $\alpha = \chi >1$ which amounts to the unscaled divergences (78), (40) and thus to (41). In (95), for general $\phi $ we have repaired this deficiency by replacing with , at the cost of getting total mass larger than 1 but by keeping the strict positivity of the involved divergence; especially for $\phi := \phi _{\alpha }$, the divergence (41) has then amounted to (96).

In contrast, let us show another method to repair the (Enc3)-deficiency of (41), by sticking to but changing the basically underlying divergence. In fact, we deal with the even more general

Definition 3

(a) We say that a pseudo-divergence $D_{\lambda }: \widetilde{\mathscr {P}}^{\lambda } \otimes \mathscr {P}^{\lambda } \rightarrow \mathbb {R}$ is decomposable if there exist functionals $\mathfrak {D}^{0}: \widetilde{\mathscr {P}}^{\lambda } \mapsto \mathbb {R}$, $\mathfrak {D}^{1}:\mathscr {Q}\mapsto \mathbb {R}$ and a (measurable) mapping $\rho _{\mathbbm {Q}}:\mathscr {X} \mapsto \mathbb {R}$ (for each $\mathbbm {Q} \in \mathscr {P}^{\lambda }$) such that^{Footnote 15}

$$\begin{aligned}& \textstyle D_{\lambda }(\mathbbm {P},\mathbbm {Q}) \, = \, \mathfrak {D}^{0}(\mathbbm {P})+\mathfrak {D}^{1}(\mathbbm {Q}) +\int _{\mathscr {X}} \rho _{\mathbbm {Q}}(x) \cdot \mathbbm {p}(x) \, \mathrm {d}\lambda (x) \ \ \text {for all }\mathbbm {P} \in \widetilde{\mathscr {P}}^{\lambda }, \mathbbm {Q} \in \mathscr {P}^{\lambda }. \nonumber \\ \end{aligned}$$

(101)

(b) We say that a pseudo-divergence $D_{\lambda }: \widetilde{\mathscr {P}}^{\lambda } \otimes \mathscr {P}^{\lambda } \rightarrow \mathbb {R}$ is pointwise decomposable if it is of the form $D_{\lambda }(\mathbbm {P},\mathbbm {Q}) = \int _\mathscr {X} \psi ^{dec}(\mathbbm {p}(x), \mathbbm {q}(x)) \, \mathrm {d}\lambda (x)$ for some (measurable) mapping $\psi ^{dec}: [0,\infty [ \times [0,\infty [ \mapsto \mathbb {R}$ with representation

$$\begin{aligned}& \textstyle \psi ^{dec}(s,t) := \psi ^{0}\big (s + h_{0}(x,s) \cdot \varvec{1}_{\{0\}}(t)\big ) \cdot \varvec{1}_{]\overline{c}_{0},\infty [}(s) \cdot \varvec{1}_{]c_{0},\infty [}(t) \nonumber \\[-0.1cm]& +\psi ^{1}\big (t + h_{1}(x) \cdot \varvec{1}_{\{0\}}(t)\big ) \cdot \varvec{1}_{]c_{1},\infty [}(t) \nonumber \\& + \rho \big (t + h_{2}(x) \cdot \varvec{1}_{\{0\}}(t)\big ) \cdot s \quad \text { for all}\, (s,t) \in [0,\infty [ \times [0,\infty [ \backslash \{(0,0)\} \ , \qquad \ \ \\& \textstyle \psi ^{dec}(0,0):=0, \nonumber \end{aligned}$$

(102)

with constants $c_{0},c_{1},\overline{c}_{0} \in \{0,1\}$, and (measurable) mappings $\psi ^{0},\psi ^{1},\rho : [0,\infty [ \mapsto \mathbb {R}$, $h_{1},h_{2} : \mathscr {X} \mapsto [0,\infty [$, $h_{0} : \mathscr {X} \times [0,\infty [ \mapsto \mathbb {R}$, such that

$$\begin{aligned}& \textstyle \psi ^{dec}(s,t)= \psi ^{0}(s) +\psi ^{1}(t) + \rho (t) \cdot s \geqslant 0 \quad \text {for all}\, (s,t) \in ]0,\infty [ \times ]0,\infty [ \, , \quad \ \end{aligned}$$

(103)

$$\begin{aligned}& \textstyle \psi ^{dec}(s,t) = 0 \quad \text {if and only if} \quad s=t \, , \\& \textstyle s + h_{0}(x,s) \geqslant 0 \quad \text {for all}\, s \in [0,\infty [\, \text {and}\, \lambda \text {-almost all}\, x \in \mathscr {X} \, . \nonumber \end{aligned}$$

(104)

Remark 5

(a) Any pointwise decomposable pseudo-divergence is decomposable, under the additional assumption that the integral $\int _\mathscr {X} \ldots \, \mathrm {d}\lambda (x)$ can be split into three appropriate parts.

(b) For use in (Enc3), $\mathfrak {D}^{1}(\cdot )$ and $\rho _{\mathbbm {Q}}(\cdot )$ should be non-constant.

(c) In the Definitions 2 and 3 we have put the “extension-role” to the first component $\mathbbm {P}$; of course, everything can be worked out analogously for the second component $\mathbbm {Q}$ by using (pseudo-)divergences $D_{\lambda }: \mathscr {P}^{\lambda } \times \widetilde{\mathscr {P}}^{\lambda } \rightarrow \mathbb {R}$.

(d) We could even extend (102) for bivariate functions $h_{1}(x,s)$, $h_{2}(x,s)$. $\square $

Notice that from (102) one obtains the boundary behaviour

(105)

(106)

with , , . Notice that $\psi ^{dec}(s,0)$ of (105) does generally not coincide with the eventually existent “(103)-limit” $\lim _{t \rightarrow 0} [\psi ^{0}(s) +\psi ^{1}(t) + \rho (t) \cdot s]$ ($s>0$), which reflects a possibly “non-smooth boundary behaviour” (also recall (98), (99)). Moreover, when choosing a decomposable pseudo-divergence (101) in step (Enc2), we operationalize the solution $\widehat{\theta }_{N}(\omega )$ of the corresponding noisy minimization problem (86) of step (Enc3) as follows:

Definition 4

(a) We say that a functional $T_{D_{\lambda }}: \mathscr {P}_{\varTheta ,emp}^{\lambda } \mapsto \varTheta $ generates a minimum decomposable pseudo-divergence estimator (briefly, $\min -decD_{\lambda }$-estimator)

(107)

of the true unknown parameter $\theta _{0}$, if $D_{\lambda }(\cdot ,\cdot ): \mathscr {P}_{\varTheta ,emp}^{\lambda } \otimes \mathscr {P}_{\varTheta }^{\lambda } \mapsto \mathbb {R}$ is a decomposable pseudo-divergence and

(108)

(b) If $D_{\lambda }(\cdot ,\cdot )$ is a pointwise decomposable pseudo-divergence we replace (108) by

but do not introduce a new notion (also recall that $\lambda =\lambda _{2}$ and for the case of no observations, e.g. if ).

To proceed, let us point out that by (107) and (97) every $\min -decD_{\lambda }$-estimator rewrites straightforwardly as

(109)

and is Fisher consistent in the sense that

(110)

Furthermore, the criterion to be minimized in (109) is of the form

which e.g. for the task (Enc5) opens the possibility to apply the methods of the asymptotic theory of so-called M-estimators (cf. e.g. Hampel et al. [33], van der Vaart and Wellner [88], Liese and Mieske [40]). The concept of $\min -decD_{\lambda }$-estimators (101) were introduced in Vajda [90], Broniatowski and Vajda [18] within the probability-law-restriction of the non-encompassing, “plug-in” context of footnote 15.

In the following, we demonstrate that our new concept of pointwise decomposability defined by (102) is very useful and flexible for creating new $\min -decD_{\lambda }$-estimators and imbedding existing ones. In fact, since in our current statistics-ML-AI context we have chosen $\lambda [\bullet ] := \lambda _{1}[\bullet ] + \lambda _{2}[\bullet ]$ with $\lambda _{1}[\bullet ] := \sum _{z \in \mathscr {X}} \varvec{1}_{\mathscr {R}(Y_{1}(\omega ), \ldots , Y_{N}(\omega ))}(z) \cdot \delta _{z}[\bullet ]$ and $\lambda _{2}[\bullet ]$ stemming from (87), we have seen that for all $\theta \in \varTheta $. Hence, from (102), (105), (106) we obtain

(111)

where we have employed (97); recall that . Hence, we always choose . Notice that the functions $h_{0}$, $h_{1}$, $h_{2}$ may depend on the parameter $\theta $. Indeed, for $h_{0}(x,s) \equiv 0$, $h_{1}(x) \equiv 0$, (), the pseudo-divergence (111) turns into

(112)

whereas for $h_{0}(x,s) \equiv 0$, , , (111) becomes

(113)

The last sum in (112) respectively (113) is the desired . As an example, let us take $c_{0} = c_{1}= \overline{c}_{0}= -1$ (and hence, ) and for $\alpha >1$ the power functions $\phi (t): = \phi _{\alpha }(t) := \frac{t^{\alpha }- \alpha \cdot t + \alpha - 1}{\alpha \cdot (\alpha -1)}$ ($t \in ]0,\infty [$) of (6), for which by (9) and (103) one derives immediately the decomposition $\psi ^{0}(t) := \psi _{\alpha }^{0}(t) := \frac{t^{\alpha }}{\alpha (\alpha -1)} >0$, $\psi ^{1}(t) := \psi _{\alpha }^{1}(t) := \frac{t^{\alpha }}{\alpha } >0$, $\rho (t) := \rho _{\alpha }(t) := - \frac{t^{\alpha -1}}{\alpha -1} < 0$ ($t \in ]0,\infty [$). Accordingly, (111) simplifies to

(114)

and in particular the special case (112) turns into

(115)

whereas the special case (113) simplifies to

(116)

Notice that (116) coincides with (96), but both were derived within quite different frameworks: to obtain (116) we have used the concept of decomposable pseudo-divergences (which may generally become negative at the boundary) together with which leads to total mass of 1 (cf. (88)); on the other hand, for establishing (96) we have employed the concept of divergences (which are generally always strictly positive at the boundary) together with which amounts to total mass greater than 1 (cf. (91)). Moreover, choosing $h_{0}(x,s) \equiv 0$, $h_{1}(x) \equiv 0$, $h_{2}(x) \equiv 0$ in (114) gives exactly the divergence (90) for the current generator $\phi (t): = \phi _{\alpha }(t)$ with $\alpha >1$; recall that the latter has been a starting motivation for the search of repairs. For $c_{0} = c_{1}= \overline{c}_{0}= -1$ and the limit case $\alpha \rightarrow 1$ one gets $\phi (t): = \phi _{1}(t) := t \cdot \log t + 1 - t$ ($t \in ]0,\infty [$) of (18), for which by (22) and (103) we obtain the decomposition $\psi ^{0}(t) := \psi _{1}^{0}(t) := t \cdot \log t - t $, $\psi ^{1}(t) := \psi _{1}^{1}(t) := t >0$, $\rho (t) := \rho _{1}(t) := - \log t$. Accordingly, (111) simplifies to

(117)

and in particular the special case (112) turns into

(118)

whereas the special case (113) becomes

(119)

To end up this subsection, let us briefly indicate that choosing in step (Enc2) a decomposable pseudo-divergence of the form (respectively) (111)–(119), and in the course of (Enc3) minimize this over $\theta \in \varTheta $, we end up at the corresponding $\min -decD_{\lambda }$-estimator (109). For the special case (118) (i.e. $\alpha = 1$) this leads to the omnipresent, celebrated maximum-likelihood-estimator (MLE) which is known to be efficient but not robust. The particular choice (115) for $\alpha > 1$ gives the density-power divergence estimator DPDE of Basu et al. [10], where $\alpha =2$ amounts to the (squared) $L_{2}$-estimator which is robust but not efficient (see e.g. Hampel et al. [33] ); accordingly, taking $\alpha \in ]1,2[$ builds a smooth bridge between the robustness and efficiency. The reversed version of the DPDE can be analogously imbedded in our context, by employing our new approach with $\phi (t) := \widetilde{\widetilde{\phi }}_{\alpha }(t)$ (cf. (79)).

4.6 Minimum Divergences - Generalized Subdivergence Method

One can flexibilize some of the methods of the previous Sect. 4.5, by employing an additional (a.s.) strictly positive density function $\mathbbm {M}$ to define a pseudo-divergence $D_{\mathbbm {M},\lambda }: \widetilde{\mathscr {P}}^{\lambda } \otimes \mathscr {P}^{\lambda } \rightarrow \mathbb {R}$ of the form $D_{\mathbbm {M},\lambda }(\mathbbm {P},\mathbbm {Q}) = $$\int _\mathscr {X} \psi ^{dec}\big (\frac{\mathbbm {p}(x)}{\mathbbm {m}(x)}, \frac{\mathbbm {q}(x)}{\mathbbm {m}(x)}\big ) \cdot \mathbbm {m}(x) \, \mathrm {d}\lambda (x)$ for some (measurable) mapping $\psi ^{dec}: [0,\infty [ \times [0,\infty [ \mapsto \mathbb {R}$ with representation

$$\begin{aligned}& \textstyle \psi ^{dec}(s,t) := \psi ^{0}\Big (s + h_{0}(x,s) \cdot \varvec{1}_{\{0\}}(t)\Big ) \cdot \varvec{1}_{]\overline{c}_{0},\infty [}(s) \cdot \varvec{1}_{]c_{0},\infty [}(t) \nonumber \\& +\psi ^{1}\Big (t + h_{1}(x) \cdot \varvec{1}_{\{0\}}(t)\Big ) \cdot \varvec{1}_{]c_{1},\infty [}(t) \nonumber \\& + \rho \Big (t + h_{2}(x) \cdot \varvec{1}_{\{0\}}(t)\Big ) \cdot s \ \ \text { for all}\, (s,t) \in [0,\infty [ \times [0,\infty [ \backslash \{(0,0)\} \ , \ \ \nonumber (cf. (102)) \\& \textstyle \psi ^{dec}(0,0):=0. \nonumber \end{aligned}$$

It is straightforward to see that $D_{\mathbbm {M},\lambda }(\cdot ,\cdot )$ is a pointwise decomposable pseudo-divergence in the sense of Definition 3(b), and one gets for fixed $m >0$

$$\begin{aligned}& \textstyle \psi _{m}^{dec}(s,t) := m \cdot \psi ^{dec}\Big (\frac{s}{m},\frac{t}{m}\Big ) = m \cdot \psi ^{0}\Big (\frac{s}{m}\Big ) + m \cdot \psi ^{1}\Big (\frac{t}{m}\Big ) + \rho \Big (\frac{t}{m}\Big ) \cdot s \geqslant 0 \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \text {for all}\, (s,t) \in ]0,\infty [ \times ]0,\infty [ \, , \quad \ \end{aligned}$$

(120)

(121)

(122)

For each class-family member with arbitrarily fixed $\tau \in \varTheta $, we can apply Definition 4 to , and arrive at the corresponding -estimators

(123)

of the true unknown parameter $\theta _{0}$. Hence, analogously to the derivation of (111), we obtain from (102), (121), (122) for each $\tau \in \varTheta $

(124)

Just as in the derivation of (112) respectively (113), reasonable choices for the “boundary-functions” in (124) are $h_{0}(x,s) \equiv 0$, , respectively $h_{0}(x,s) \equiv 0$, , . As for example, consider for all $\theta _{0},\theta ,\tau \in \varTheta $ the scaled Bregman divergences in the sense of Stummer [81], Stummer and Vajda [84] (cf. Remark ()(b)), for which we get from (36) with $r(x) \equiv 1$

(125)

from which – together with (120) – one can identify immediately the pointwise decomposability with $\psi ^{0}(s) := \psi _{\phi }^{0}(s) := \phi (s)$, $\psi ^{1}(t) := \psi _{\phi }^{1}(t) := t \cdot \phi _{+,c}^{\prime }(t) -\phi (t)$, $\rho (t) := \rho _{\phi }(t) := - \phi _{+,c}^{\prime }(t)$; by plugging this into (124), one obtains the objective , which in the course of (Enc3) should be – for fixed $\tau \in \varTheta $ – minimized over $\theta \in \varTheta $ in order to obtain the corresponding $\tau $-individual” -estimator . Recall that this choice can be motivated by and . Furthermore, one gets even , , and in case of also , . This suggests the alternative, “$\tau $-uniform” estimators , respectively . As a side remark, let us mention that in general, (say) is not necessarily decomposable anymore, and therefore the standard theory of M-estimators is not applicable to this class of estimators.

With our approach, we can generate numerous further estimators of the true unknown parameter $\theta _{0}$, by permuting the positions – but not the roles (!) – of the parameters $(\theta _{0},\theta ,\tau )$ in the (pseudo-)divergences of the above investigations. For the sake of brevity, we only sketch two further cases; the full variety will appear elsewhere. To start with, consider the adaptively scaled and aggregated divergence

(indeed, by Theorem 4 and (80) this is zero if and only if $\theta = \theta _{0}$). By means of the involved mappings $\psi ^{0}(s) := \psi _{m}^{0,rev}(s) := s \cdot \phi (\frac{m}{s})$, $\psi ^{1}(t) := \psi _{m}^{1,rev}(t) := - m \cdot \phi _{+,c}^{\prime }(\frac{m}{t})$, $\rho (t) := \rho _{m}^{rev}(t) := \frac{m}{t} \cdot \phi _{+,c}^{\prime }(\frac{m}{t}) - \phi (\frac{m}{t}) =: \phi ^{\circledcirc }(\frac{m}{t})$ ($s,t,m >0$), the properties (103), (104) are applicable and thus can be extended to a pointwise decomposable pseudo-divergence on $\widetilde{\mathscr {P}}^{\lambda } \otimes \mathscr {P}^{\lambda }$ by using (102) with appropriate functions $h_{0}$,$h_{1}$,$h_{2}$ and constants $c_{0}$,$c_{1}$,$\overline{c}_{0}$. Furthermore, by minimizing over $\theta \in \varTheta $ the objective (111) with these choices $\psi _{m}^{0,rev}(\cdot )$, $\psi _{m}^{1,rev}(\cdot )$, $\rho _{m}^{rev}(\cdot )$, in the course of (Enc3) we end up at the corresponding -estimator. In particular, the corresponding special case $h_{0}(x,s) \equiv 0$, $h_{1}(x) \equiv 1$, () leads to the objective (cf. (112) but with $\psi ^{1}(1)$ instead of $\psi ^{1}(0)$)

to be minimized over $\theta $. As a second possibility to permutate the positions of the parameters $(\theta _{0},\theta ,\tau )$, let us consider

(126)

this is a pointwise decomposable divergence between and , but it is not a divergence – yet still a nonnegative and obviously not pointwise decomposable functional – between and . Indeed, for $\theta = \theta _{0} \ne \tau $ one obtains . Notice that from (126) one gets

(127)

provided that the integral on the right-hand side exists and is finite. If moreover $\phi (1)=0$, then by (54) the inequality (127) rewrites as

(128)

with (for fixed $\theta $) equality if and only if $\theta _{0} = \tau $; this implies that

(129)

(130)

with $\psi ^{0}(s) := \psi _{m}^{0,sub}(s) \equiv 0$, $\psi ^{1}(t) := \psi _{m}^{1,sub}(t) := t\cdot \phi (\frac{m}{t}) - m \cdot \phi _{+,c}^{\prime }(\frac{m}{t})$, $\rho (t) := \rho _{m}^{sub}(t) := \phi _{+,c}^{\prime }(\frac{m}{t})$ ($s,t,m >0$). In other words, this means that the Csiszar-Ali-Silvey divergence CASD can be represented as the $\tau $-maximum over – not necessarily nonnegative – pointwise decomposable (in the sense of (103), (104)) functionals between and . Furthermore, from Theorem 5 and (130) we arrive at

(131)

Accordingly, in analogy to the spirit of (81), (82), (86), respectively Definition 4 and (110), in order to achieve an estimator of the true unknown parameter $\theta _{0}$ we first extend the “pure parametric case” to a singularity-covering functional , although it is not a pseudo-divergence anymore; indeed, by employing the reduced form of (102) we take

(132)

Hence, analogously to the derivation of (111), we obtain from (132)

(133)

(134)

to be minimized over $\theta \in \varTheta $. In the view of (131), we can estimate (respectively learn) the true unknown parameter $\theta _{0}$ by the estimator

(135)

which under appropriate technical assumptions (integrability, etc.) exists, is finite, unique, and Fisher consistent; moreover, this method can be straightforwardly extended to non-parametric setups. Similarly to the derivation of (112) respectively (113), reasonable choices for the “boundary-functions” in (134) are together with $h_{1}(x) \equiv 1$ respectively (where the nominator in the last sum becomes ). In the special case with – where the choice of $h_{1}(\cdot )$ is irrelevant – and , the estimator $\widehat{\theta }_{N,sup\mathscr {D}_{\phi ,\lambda }}(\omega )$ was first proposed independently by Liese and Vajda [42] under the name modified $\phi $ -divergence estimator and Broniatowski and Keziou [16, 17] under the name minimum dual $\phi $ -divergence estimator; furthermore, within this special-case setup, Broniatowski and Keziou [17] also introduced for each fixed $\theta \in \varTheta $ the related, so-called dual $\phi $-divergence estimator . The latter four references also work within a nonparametric framework. Let us also mention that by (128) and (129), $\widehat{\theta }_{N,\mathscr {D}_{\phi ,\lambda }}(\omega )$ can be interpreted as maximum sub-$\phi $-divergence estimator, whereas $\widehat{\theta }_{N,sup\mathscr {D}_{\phi ,\lambda }}(\omega )$ can be viewed as minimum super-$\phi $-divergence estimator (cf. Vajda [90], Broniatowski and Vajda [18] for the probability-measure-theoretic context of footnote 15).

Remark 6

Making use of the escort parameter $\tau $ proves to be useful in statistical inference under the model; its use under misspecification has been considered in Toma and Broniatowski [86], Al Mohamad [5], for Csiszar-Ali-Silvey divergences.

As a final example, consider $c_{1}= 0$, , and $\phi (t) : = t \log t + 1 - t$, for which we can deduce

for all $\theta \in \varTheta $, i.e. in this case all maximum sub-$\phi $-divergence estimators and the minimum super-$\phi $-divergence estimator exceptionally coincide, and give the celebrated maximum-likelihood estimator.

5 Conclusions

Motivated by fields of applications from statistics, machine learning, artificial intelligence and information geometry, we presented for a wide audience a new unifying framework of divergences between functions. Within this, we illuminated several important subcases – such as scaled Bregman divergences and Csiszar-Ali-Silvey $\phi $-divergences – as well as involved subtleties and pitfalls. For the often desired task of finding the “continuous” model with best divergence-proximity to the observed “discrete” data, we summarized existing and also derived new approaches. As far as potential future studies is concerned, the kind of universal nature of our introduced toolkit suggests quite a lot of possibilities for further adjacent developments and concrete applications.

Notes

1.
See e.g. Weller-Fahy et al. [93].
2.
Alternatively, one can think of d(p, q) as degree of proximity from p to q.
3.
Measurable.
4.
In a probabilistic approach rather than a chaos-theoretic approach.
5.
Where one of them may e.g. stem from training data.
6.
This means that there exists a $N \in \mathscr {F}$ with $\lambda [N]=0$ (where the empty set $N = \emptyset $ is allowed) such that for all $x \in \mathscr {X}\backslash \{N\}$ (say) $p(x) \in ]-\infty ,\infty [$ holds.
7.
As an example, let $\mathscr {X}= \mathbb {R}$, $\lambda = \lambda _{L}$ be the Lebesgue measure (and hence, except for rare cases, the integral turns into a Riemann integral) and ; since this qualifies as a probability density and thus is a possible candidate for in Sect. 3.3.1.2 below.
8.
Respectively working with canonical space representation and $Y : = id$.
9.
As a side remark, let us mention here that in the special case of continuously differentiable strictly log-convex divergence generator $\phi $, one can construct divergences which are tighter than (38) respectively (39), see Stummer and Kißlinger [82]; in a finite discrete space and for differentiable exponentially concave divergence generator $\phi $, a similar tightening (called L-divergence) can be found in Pal and Wong [66, 67].
10.
The first resp. second resp. third integral in (41) can be interpreted as divergence-contribution of the function-(support-)overlap resp. of one part of the function-nonoverlap (e.g. describing “extreme outliers”) resp. of the other part of the function-nonoverlap (e.g. describing “extreme inliers”).
11.
This can be interpreted analogously as in footnote 10.
12.
i.e. the properties (D1) and (D2) (respectively (D2) respectively (D1), (D2) and (D3)) are satisfied.
13.
E.g. applying the divergence (46) for $\alpha \in \mathbb {R}\backslash \{0,1\}$, the sum-entry appears, which can be viewed as penalty for the cell x being empty of data observations (“intrinsic empty-cell-penalty”); for divergence (60), the penalty is .
14.
In several situations, such a conversion can appear in a natural way; e.g. an institution may generate/collect data of “continuous value” but mask them for external data analysts to group-frequencies, for reasons of confidentiality (information asymmetry).
15.
In an encompassing way, the part (a) reflects a measure-theoretic “plug-in” version of decomposable pseudo-divergences $D: (\mathscr {P}^{meas,\lambda _1} \cup \mathscr {P}^{meas,\lambda _2}) \otimes \mathscr {P}^{meas,\lambda _1} \mapsto \mathbb {R}$, where $\mathscr {P}^{meas,\lambda _1}$ is a family of mutually equivalent nonnegative measures of the form $\mathfrak {P}[\bullet ] := \mathfrak {P}^{\mathbbm {1} \cdot \lambda _{1}}[\bullet ] : = \int _{\bullet } \mathbbm {p}(x) \, \mathrm {d}\lambda _{1}(x)$, $\mathscr {P}^{meas,\lambda _2}$ is a family of nonnegative measures of the form $\overline{\mathfrak {P}}[\bullet ]:= \overline{\mathfrak {P}}^{\mathbbm {1} \cdot \lambda _{2}}[\bullet ] : = \int _{\bullet } \mathbbm {q}(x) \, \mathrm {d}\lambda _{2}(x)$ such that any $\mathfrak {P} \in \mathscr {P}^{meas,\lambda _1}$ is not equivalent to any $\overline{\mathfrak {P}} \in \mathscr {P}^{meas,\lambda _2}$, and (101) is replaced with $D(\mathfrak {P},\mathfrak {Q})=\mathfrak {D}^{0}(\mathfrak {P})+\mathfrak {D}^{1}(\mathfrak {Q}) +\int _{\mathscr {X}} \rho _{\mathfrak {Q}}(x) \, \mathrm {d}\mathfrak {P}(x) \quad \text {for all }\mathbbm {P} \in \mathfrak {P} \in \mathscr {P}^{meas,\lambda _1}\cup \mathscr {P}^{meas,\lambda _2}, \mathfrak {Q} \in \mathscr {P}^{meas,\lambda _2}$; cf. Vajda [90], Broniatowski and Vajda [18], Broniatowski et al. [19]; part (b) is new.

References

Amari, S.-I.: Information Geometry and Its Applications. Springer, Japan (2016)
Book Google Scholar
Amari, S.-I., Karakida, R., Oizumi, M.: Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the entropy-relaxed transportation problem. Info. Geo. (2018). https://doi.org/10.1007/s41884-018-0002-8
Article Google Scholar
Amari, S.-I., Nagaoka, H.: Methods of Information Geometry. Oxford University Press, Oxford (2000)
Google Scholar
Ali, M.S., Silvey, D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. B–28, 131–140 (1966)
MathSciNet MATH Google Scholar
Al Mohamad, D.: Towards a better understanding of the dual representation of phi divergences. Stat. Papers (2016). https://doi.org/10.1007/s00362-016-0812-5
Article MathSciNet Google Scholar
Avlogiaris, G., Micheas, A., Zografos, K.: On local divergences between two probability measures. Metrika 79, 303–333 (2016)
Article MathSciNet Google Scholar
Avlogiaris, G., Micheas, A., Zografos, K.: On testing local hypotheses via local divergence. Stat. Methodol. 31, 20–42 (2016)
Article MathSciNet Google Scholar
Ay, N., Jost, J., Le, H.V., Schwachhöfer, L.: Information Geometry. Springer, Berlin (2017)
Book Google Scholar
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)
MathSciNet MATH Google Scholar
Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C.: Robust and efficient estimation by minimizing a density power divergence. Biometrika 85(3), 549–559 (1998)
Article MathSciNet Google Scholar
Basu, A., Lindsay, B.G.: Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Ann. Inst. Stat. Math. 46(4), 683–705 (1994)
Article MathSciNet Google Scholar
Basu, A., Mandal, A., Martin, N., Pardo, L.: Robust tests for the equality of two normal means based on the density power divergence. Metrika 78, 611–634 (2015)
Article MathSciNet Google Scholar
Basu, A., Shioya, H., Park, C.: Statistical Inference: The Minimum Distance Approach. CRC Press, Boca Raton (2011)
MATH Google Scholar
Birkhoff, G.D: A set of postulates for plane geometry, based on scale and protractor. Ann. Math. 33(2) 329–345 (1932)
Article MathSciNet Google Scholar
Boissonnat, J.-D., Nielsen, F., Nock, R.: Bregman Voronoi diagrams. Discret. Comput. Geom. 44(2), 281–307 (2010)
Article MathSciNet Google Scholar
Broniatowski, M., Keziou, A.: Minimization of $\phi $-divergences on sets of signed measures. Stud. Sci. Math. Hungar. 43, 403–442 (2006)
MathSciNet MATH Google Scholar
Broniatowski, M., Keziou, A.: Parametric estimation and tests through divergences and the duality technique. J. Multiv. Anal. 100(1), 16–36 (2009)
Article MathSciNet Google Scholar
Broniatowski, M., Vajda, I.: Several applications of divergence criteria in continuous families. Kybernetika 48(4), 600–636 (2012)
MathSciNet MATH Google Scholar
Broniatowski, M., Toma, A., Vajda, I.: Decomposable pseudodistances in statistical estimation. J. Stat. Plan. Inf. 142, 2574–2585 (2012)
Article MathSciNet Google Scholar
Buckland, M.K.: Information as thing. J. Am. Soc. Inf. Sci. 42(5), 351–360 (1991)
Article Google Scholar
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning and Games. Cambridge University Press, Cambridge (2006)
Book Google Scholar
Chhogyal, K., Nayak, A., Sattar, A.: On the KL divergence of probability mixtures for belief contraction. In: Hölldobler,S., et al. (eds.) KI 2015: Advances in Artificial Intelligence. Lecture Notes in Artificial Intelligence, vol. 9324, pp. 249–255. Springer International Publishing (2015)
Google Scholar
Cliff, O.M., Prokopenko, M., Fitch, R.: An information criterion for inferring coupling in distributed dynamical systems. Front. Robot. AI 3(71). https://doi.org/10.3389/frobt.2016.00071 (2016)
Cliff, O.M., Prokopenko, M., Fitch, R.: Minimising the Kullback-Leibler divergence for model selection in distributed nonlinear systems. Entropy 20(51). https://doi.org/10.3390/e20020051 (2018)
Article MathSciNet Google Scholar
Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, AdaBoost and Bregman distances. Mach. Learn. 48, 253–285 (2002)
Article Google Scholar
Cooper, V.N., Haddad, H.M., Shahriar, H.: Android malware detection using Kullback-Leibler divergence. Adv. Distrib. Comp. Art. Int. J., Special Issue 3(2) (2014)
Article Google Scholar
Csiszar, I.: Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. A-8, 85–108 (1963)
Google Scholar
DasGupta, A.: Some results on the curse of dimensionality and sample size recommendations. Calcutta Stat. Assoc. Bull. 50(3–4), 157–178 (2000)
Article MathSciNet Google Scholar
De Groot, M.H.: Uncertainty, information and sequential experiments. Ann. Math. Stat. 33, 404–419 (1962)
Article MathSciNet Google Scholar
Ghosh, A., Basu, A.: Robust Bayes estimation using the density power divergence. Ann. Inst. Stat. Math. 68, 413–437 (2016)
Article MathSciNet Google Scholar
Ghosh, A., Basu, A.: Robust estimation in generalized linear models: the density power divergence approach. TEST 25, 269–290 (2016)
Article MathSciNet Google Scholar
Ghosh, A., Harris, I.R., Maji, A., Basu, A., Pardo, L.: A generalized divergence for statistical inference. Bernoulli 23(4A), 2746–2783 (2017)
Article MathSciNet Google Scholar
Hampel, F.R., Ronchetti, E.M., Rousseuw, P.J., Stahel, W.A.: Robust Statistics: The Approach Based on Influence Functions. Wiley, New York (1986)
MATH Google Scholar
Karakida, R., Amari, S.-I.: Information geometry of Wasserstein divergence. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 119–126. Springer International (2017)
Google Scholar
Kißlinger, A.-L., Stummer, W.: Some decision procedures based on scaled Bregman distance surfaces. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2013. Lecture Notes in Computer Science, vol. 8085, pp. 479–486. Springer, Berlin (2013)
Google Scholar
Kißlinger, A.-L., Stummer, W.: New model search for nonlinear recursive models, regressions and autoregressions. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2015. Lecture Notes in Computer Science, vol. 9389, pp. 693–701. Springer International (2015)
Google Scholar
Kißlinger, A.-L., Stummer, W.: Robust statistical engineering by means of scaled Bregman distances. In: Agostinelli, C., Basu, A., Filzmoser, P., Mukherjee, D. (eds.) Recent Advances in Robust Statistics - Theory and Applications, pp. 81–113. Springer, India (2016)
Chapter Google Scholar
Kißlinger, A.-L., Stummer, W.: A new toolkit for robust distributional change detection. Appl. Stochastic Models Bus. Ind. 34, 682–699 (2018)
Article Google Scholar
Kuchibhotla, A.K., Basu, A.: A general setup for minimum disparity estimation. Stat. Prob. Lett. 96, 68–74 (2015)
Article Google Scholar
Liese, F., Miescke, K.J.: Statistical Decision Theory: Estimation, Testing, and Selection. Springer, New York (2008)
Book Google Scholar
Liese, F., Vajda, I.: Convex Statistical Distances. Teubner, Leipzig (1987)
MATH Google Scholar
Liese, F., Vajda, I.: On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 52(10), 4394–4412 (2006)
Article MathSciNet Google Scholar
Lin, N., He, X.: Robust and efficient estimation under data grouping. Biometrika 93(1), 99–112 (2006)
Article MathSciNet Google Scholar
Liu, M., Vemuri, B.C., Amari, S.-I., Nielsen, F.: Total Bregman divergence and its applications to shape retrieval. In: Proceedings of 23rd IEEE CVPR, pp. 3463–3468 (2010)
Google Scholar
Liu, M., Vemuri, B.C., Amari, S.-I., Nielsen, F.: Shape retrieval using hierarchical total Bregman soft clustering. IEEE Trans. Pattern Anal. Mach. Intell. 34(12), 2407–2419 (2012)
Article Google Scholar
Lizier, J.T.: JIDT: an information-theoretic toolkit for studying the dynamcis of complex systems. Front. Robot. AI 1(11). https://doi.org/10.3389/frobt.2014.00011 (2014)
Menendez, M., Morales, D., Pardo, L., Vajda, I.: Two approaches to grouping of data and related disparity statistics. Comm. Stat. - Theory Methods 27(3), 609–633 (1998)
Article MathSciNet Google Scholar
Menendez, M., Morales, D., Pardo, L., Vajda, I.: Minimum divergence estimators based on grouped data. Ann. Inst. Stat. Math. 53(2), 277–288 (2001)
Article MathSciNet Google Scholar
Menendez, M., Morales, D., Pardo, L., Vajda, I.: Minimum disparity estimators for discrete and continuous models. Appl. Math. 46(6), 439–466 (2001)
Article MathSciNet Google Scholar
Millmann, R.S., Parker, G.D.: Geometry - A Metric Approach With Models, 2nd edn. Springer, New York (1991)
Google Scholar
Minka, T.: Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft Research Ltd., Cambridge, UK (2005)
Google Scholar
Morales, D., Pardo, L., Vajda, I.: Digitalization of observations permits efficient estimation in continuous models. In: Lopez-Diaz, M., et al. (eds.) Soft Methodology and Random Information Systems, pp. 315–322. Springer, Berlin (2004)
Chapter Google Scholar
Morales, D., Pardo, L., Vajda, I.: On efficient estimation in continuous models based on finitely quantized observations. Comm. Stat. - Theory Methods 35(9), 1629–1653 (2006)
Article MathSciNet Google Scholar
Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of U-boost and Bregman divergence. Neural Comput. 16(7), 1437–1481 (2004)
Article Google Scholar
Nielsen, F., Barbaresco, F. (eds.): Geometric Science of Information GSI 2013. Lecture Notes in Computer Science, vol. 8085. Springer, Berlin (2013)
Google Scholar
Nielsen, F., Barbaresco, F. (eds.): Geometric Science of Information GSI 2015. Lecture Notes in Computer Science, vol. 9389. Springer International (2015)
Google Scholar
Nielsen, F., Barbaresco, F. (eds.): Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589. Springer International (2017)
Google Scholar
Nielsen, F., Bhatia, R. (eds.): Matrix Information Geometry. Springer, Berlin (2013)
Google Scholar
Nielsen, F., Nock, R.: Bregman divergences from comparative convexity. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 639–647. Springer International (2017)
Google Scholar
Nielsen, F., Sun, K., Marchand-Maillet, S.: On Hölder projective divergences. Entropy 19, 122 (2017)
Article Google Scholar
Nielsen, F., Sun, K., Marchand-Maillet,S.: K-means clustering with Hölder divergences. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 856–863. Springer International (2017)
Google Scholar
Nock, R., Menon, A.K., Ong, C.S.: A scaled Bregman theorem with applications. Advances in Neural Information Processing Systems 29 (NIPS 2016), pp. 19–27 (2016)
Google Scholar
Nock, R., Nielsen, F.: Bregman divergences and surrogates for learning. IEEE Trans. Pattern Anal. Mach. Intell. 31(11), 2048–2059 (2009)
Article Google Scholar
Nock, R., Nielsen, F., Amari, S.-I.: On conformal divergences and their population minimizers. IEEE Trans. Inf. Theory 62(1), 527–538 (2016)
Article MathSciNet Google Scholar
Österreicher, F., Vajda, I.: Statistical information and discrimination. IEEE Trans. Inf. Theory 39, 1036–1039 (1993)
Article MathSciNet Google Scholar
Pal, S., Wong, T.-K.L.: The geometry of relative arbitrage. Math. Financ. Econ. 10, 263–293 (2016)
Article MathSciNet Google Scholar
Pal, S., Wong, T.-K.L.: Exponentially concave functions and a new information geometry. Ann. Probab. 46(2), 1070–1113 (2018)
Article MathSciNet Google Scholar
Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC, Boca Raton (2006)
MATH Google Scholar
Park, C., Basu, A.: Minimum disparity estimation: asymptotic normality and breakdown point results. Bull. Inf. Kybern. 36, 19–33 (2004)
MathSciNet MATH Google Scholar
Patra, S., Maji, A., Basu, A., Pardo, L.: The power divergence and the density power divergence families: the mathematical connection. Sankhya 75-B Part 1, 16–28 (2013)
Article MathSciNet Google Scholar
Peyre, G., Cuturi M.: Computational Optimal Transport (2018). arXiv:1803.00567v1
Read, T.R.C., Cressie, N.A.C.: Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, New York (1988)
Book Google Scholar
Reid, M.D., Williamson, R.C.: Information, divergence and risk for binary experiments. J. Mach. Learn. Res. 12, 731–817 (2011)
MathSciNet MATH Google Scholar
Roensch, B., Stummer, W.: 3D insights to some divergences for robust statistics and machine learning. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 460–469. Springer International (2017)
Google Scholar
Rüschendorf, L.: On the minimum discrimination information system. Stat. Decis. Suppl. Issue 1, 263–283 (1984)
MATH Google Scholar
Scott, D.W.: Multivariate Density Estimation - Theory, Practice and Visualization, 2nd edn. Wiley, Hoboken (2015)
MATH Google Scholar
Scott, D.W., Wand, M.P.: Feasibility of multivariate density estimates. Biometrika 78(1), 197–205 (1991)
Article MathSciNet Google Scholar
Stummer, W.: On a statistical information measure of diffusion processes. Stat. Decis. 17, 359–376 (1999)
MathSciNet MATH Google Scholar
Stummer, W.: On a statistical information measure for a generalized Samuelson-Black-Scholes model. Stat. Decis. 19, 289–314 (2001)
MathSciNet MATH Google Scholar
Stummer, W.: Exponentials, Diffusions, Finance. Entropy and Information. Shaker, Aachen (2004)
MATH Google Scholar
Stummer, W.: Some Bregman distances between financial diffusion processes. Proc. Appl. Math. Mech. 7(1), 1050503–1050504 (2007)
Article Google Scholar
Stummer, W., Kißlinger, A-L.: Some new flexibilizations of Bregman divergences and their asymptotics. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 514–522. Springer International (2017)
Google Scholar
Stummer, W., Vajda, I.: On divergences of finite measures and their applicability in statistics and information theory. Statistics 44, 169–187 (2010)
Article MathSciNet Google Scholar
Stummer, W., Vajda, I.: On Bregman distances and divergences of probability measures. IEEE Trans. Inf. Theory 58(3), 1277–1288 (2012)
Article MathSciNet Google Scholar
Sugiyama, M., Suzuki, T., Kanamori, T.: Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation. Ann. Inst. Stat. Math. 64, 1009–1044 (2012)
Article MathSciNet Google Scholar
Toma, A., Broniatowski, M.: Dual divergence estimators and tests: robustness results. J. Multiv. Anal. 102, 20–36 (2011)
Article MathSciNet Google Scholar
Tsuda, K., Rätsch, G., Warmuth, M.: Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res. 6, 995–1018 (2005)
MathSciNet MATH Google Scholar
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer, Berlin (1996)
Book Google Scholar
Vajda, I.: Theory of Statistical Inference and Information. Kluwer, Dordrecht (1989)
MATH Google Scholar
Vajda, I.: Modifications of divergence criteria for applications in continuous families. Research Report No. 2230, Institute of Information Theory and Automation, Prague (2008)
Google Scholar
Vemuri, B.C., Liu, M., Amari, S.-I., Nielsen, F.: Total Bregman divergence and its applications to DTI analysis. IEEE Trans. Med. Imag. 30(2), 475–483 (2011)
Article Google Scholar
Victoria-Feser, M.-P., Ronchetti, E.: Robust estimation for grouped data. J. Am. Stat. Assoc. 92(437), 333–340 (1997)
Article MathSciNet Google Scholar
Weller-Fahy, D.J., Borghetti, B.J., Sodemann, A.A.: A survey of distance and similarity measures used within network intrusion anomaly detection. IEEE Commun. Surv. Tutor. 17(1), 70–91 (2015)
Article Google Scholar
Wu, L., Hoi, S.C.H., Jin, R., Zhu, J., Yu, N.: Learning Bregman distance functions for semi-supervised clustering. IEEE Trans. Knowl. Data Engin. 24(3), 478–491 (2012)
Article Google Scholar
Zhang, J., Naudts, J.: Information geometry under monotone embedding, part I: divergence functions. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 205–214. Springer International (2017)
Google Scholar
Zhang, J., Wang, X., Yao, L., Li, J., Shen, X.: Using Kullback-Leibler divergence to model opponents in poker. Computer Poker and Imperfect Information: Papers from the AAAI-14 Workshop (2014)
Google Scholar

Download references

Acknowledgements

We are grateful to three anonymous referees for their very useful suggestions and comments. W. Stummer wants to thank very much the Sorbonne Universite Pierre et Marie Curie Paris for its partial financial support.

Author information

Authors and Affiliations

Sorbonne Universite Pierre et Marie Curie, LPSM, 4 place Jussieu, 75252, Paris, France
Michel Broniatowski
Department of Mathematics, University of Erlangen–Nürnberg, Cauerstrasse 11, 91058, Erlangen, Germany
Wolfgang Stummer
Affiliated Faculty Member of the School of Business and Economics, University of Erlangen–Nürnberg, Lange Gasse 20, 90403, Nürnberg, Germany
Wolfgang Stummer

Authors

Michel Broniatowski
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Stummer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wolfgang Stummer .

Editor information

Editors and Affiliations

Sony Computer Science Laboratories, Inc., Tokyo, Japan
Frank Nielsen

Appendix: Proofs

Proof of Theorem 4. Assertion (1) and the “if-part” of (2) follow immediately from Theorem 1 which uses less restrictive assumptions. In order to show the “only-if” part of (2) (and the “if-part” of (2) in an alternative way), one can use the straightforwardly provable fact that the Assumption 2 implies

$$\begin{aligned}& \overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}(x,s,t) \ = \ 0 \qquad \text {if and only if} \qquad s \ = \ t \end{aligned}$$

(136)

for all $s \in \mathscr {R}\big (\frac{P}{M_{1}}\big )$, all $t \in \mathscr {R}\big (\frac{Q}{M_{2}}\big )$ and $\lambda $-a.a. $x \in \mathscr {X}$. To proceed, assume that $D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q) = 0$, which by the non-negativity of $\overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}(\cdot ,\cdot )$ implies that $\overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}\big (\frac{p(x)}{m_{1}(x)},\frac{q(x)}{m_{2}(x)}\big ) = 0$ for $\lambda $-a.a. $x \in \mathscr {X}$. From this and the “only-if” part of (136), we obtain the identity $\frac{p(x)}{m_1(x)}=\frac{q(x)}{m_2(x)} \ \text {for}\, \lambda \text {-a.a.}\, x \in \mathscr {X}$. $\square $

Proof of Theorem 5. Consistently with Theorem 1 (and our adaptions) the “if-part” follows from (51). By our above investigations on the adaptions of the Assumptions 2 to the current context, it remains to investigate the “only-if” part (2) for the following four cases (recall that $\phi $ is strictly convex at $t=1$): (ia) $\phi $ is differentiable at $t=1$ (hence, c is obsolete and $\phi _{+,c}^{\prime }(1)$ collapses to $\phi ^{\prime }(1)$) and the function $\phi $ is affine linear on [1, s] for some $s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [a,1]$; (ib) $\phi $ is differentiable at $t=1$, and the function $\phi $ is affine linear on [s, 1] for some $s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [1,b]$; (ii) $\phi $ is not differentiable at $t=1$, $c=1$, and the function $\phi $ is affine linear on [1, s] for some $s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [a,1]$; (iii) $\phi $ is not differentiable at $t=1$, $c=0$, and the function $\phi $ is affine linear on [s, 1] for some $s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [1,b]$. It is easy to see from the strict convexity at 1 that for (ii) one has $\phi (0) + \phi _{+,1}^{\prime }(1) - \phi (1) >0$, whereas for (iii) one gets $\phi ^{*}(0) -\phi _{+,0}^{\prime }(1) >0$; furthermore, for (ia) there holds $\phi (0) + \phi ^{\prime }(1) - \phi (1) >0$ and for (ib) $\phi ^{*}(0) -\phi ^{\prime }(1) >0$. Let us first examine the situations (ia) respectively (ii) under the assumptive constraint $D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q})= 0$ with $c=1$ respectively (in case of differentiability) obsolete c, for which we can deduce from (51)

$$\begin{aligned}& \textstyle \textstyle 0 = D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle \geqslant \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \big [ \mathbbm {q}(x) \cdot \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) - \mathbbm {q}(x) \cdot \phi \big ( 1 \big ) - \phi _{+,c}^{\prime } \big ( 1 \big ) \cdot \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \big ] \, \nonumber \\[-0.1cm]&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \big ) \cdot \varvec{1}_{]\mathbbm {p}(x),\infty [}\big (\mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi (0) + \phi _{+,c}^{\prime }(1) - \phi (1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \cdot \varvec{1}_{]\mathbbm {p}(x),\infty [}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \, \geqslant 0 , \nonumber \end{aligned}$$

and hence $\int _{{\mathscr {X}}} \varvec{1}_{]\mathbbm {p}(x),\infty [}\big (\mathbbm {q}(x)\big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \, = 0$. From this and (55) we obtain

$$\begin{aligned}& \textstyle 0 \, = \, \int _{{\mathscr {X}}} \, \big ( \mathbbm {p}(x) \, - \, \mathbbm {q}(x) \big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \, = \, \int _{{\mathscr {X}}} \, \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \, \cdot \,\varvec{1}_{]\mathbbm {q}(x),\infty [}\big (\mathbbm {p}(x)\big ) \, \cdot \, \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \qquad \ \ \nonumber \end{aligned}$$

and therefore $\int _{{\mathscr {X}}} \varvec{1}_{]\mathbbm {q}(x),\infty [}\big (\mathbbm {p}(x)\big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \, = 0$. Since for $\lambda $-a.a. $x\in \mathscr {X}$ we have $\mathbbm {r}(x) >0$, we arrive at $\mathbbm {p}(x) =\mathbbm {q}(x)$ for $\lambda $-a.a. $x\in \mathscr {X}$. The remaining cases (ib) respectively (iii) can be treated analogously. $\square $

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Broniatowski, M., Stummer, W. (2019). Some Universal Insights on Divergences for Statistics, Machine Learning and Artificial Intelligence. In: Nielsen, F. (eds) Geometric Structures of Information. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-030-02520-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-02520-5_8
Published: 20 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02519-9
Online ISBN: 978-3-030-02520-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

\(\mathbb {R}\), \(\mathbb {N}\), \(\mathbb {R}^d\)	Set of real respectively integer numbers respectively d-dimensional vectors
\(\varTheta \), \(\theta \)	Set of parameters, see p. 188
\(\mathbbm {1}\)	Function with constant value 1
\(\varvec{1}_{A}(z) = \delta _{z}[A]\)	Indicator function on the set A evaluated at data point z, which is equal to Dirac’s one-point distribution on z evaluated at A
\(\# A\)	Number of elements in set A
\(\mathscr {X}\); \(\mathscr {X}_{\#}\)	Space/set where data can take values in; space/set of countable size
\(\mathscr {F}\)	System of admissible events/data-collections (\(\sigma \)-algebra) on \(\mathscr {X}\)
\(\lambda \)	Reference measure/integrator/summator, see p. 160 & Sect. 3.1 on p. 165
\(\lambda \)-a.a.	\(\lambda \)-almost all, see p. 160
\(\lambda _{L}\)	Lebesgue measure (“Riemann-type” integrator), see p. 160, & Sect. 3.1
\(\lambda _{\#}\)	Counting measure (“classical summator”), see p. 160 & Sect. 3.1 on p. 165
\(P \, := \, \big \{p(x)\big \}_{x \in \mathscr {X}}\)	Function from which the divergence/dissimilarity is measured from, see p. 160
\(Q \, := \, \big \{q(x)\big \}_{x \in \mathscr {X}}\)	Function to which the divergence/dissimilarity is measured to, see p. 160
\(M_{i} \, := \, \big \{m_{i}(x)\big \}_{x \in \mathscr {X}}\)	Scaling function (\(i=1,2\)) respec. aggregation function (\(i=3\)), see p. 161, (1) and paragraph (I1) thereafter, as well as Sect. 3.3 on p. 170
\(p(\cdot )\), \(q(\cdot )\), \(m_{i}(\cdot )\),	Alternative representations of P, Q, \(M_{i}\)
\(R \, := \, \big \{r(x)\big \}_{x \in \mathscr {X}}\)	Function used for the aggregation function \(m_{3}(\cdot )\), see Sect. 3.3.1 on p. 171
\(W_{i}\, \)	Connector function of the form \(W_{i}\, := \, \big \{w_{i}(x,y,z)\big \}_{x,y,z \in \ldots }\), for adaptive scaling and aggregation functions \(m_{i}(x) = w_{i}(x,p(x),q(x))\) (\(i=1,2,3\)), see e.g. Assumption 2 on p. 163 and Sect. 3.3.1.3 on p. 181
\(\mathbbm {P}\), \(\mathbbm {Q}\), \(\mathbbm {M}_{i}\), \(\mathbbm {W}_{i}\)	Functions with \(\mathbbm {p}(x) \geqslant 0\), \(\mathbbm {q}(x) \geqslant 0\), \(\mathbbm {m}_{i}(x) \geqslant 0\), \(\mathbbm {w}_{i}(x) \geqslant 0\) for \(\lambda \)-a.a. \(x \in \mathscr {X}\)
\(\mathbbm {Q}^{\chi } \, := \, \big \{\mathbbm {q}^{\chi }(x)\big \}_{x \in \mathscr {X}}\)	Function for the aggregation function \(m_{3}(\cdot )\), see Sect. 4.2 on p. 184, (73)
,	\(\lambda \)-probability density functions (incl. probability mass functions for \(\lambda =\lambda _{\#}\)), i.e. for which , for \(\lambda \)-a.a. \(x \in \mathscr {X}\) and , see Remark 2 on p. 172
	\(\lambda \)-probab. density function which depends on a parameter \(\theta \in \varTheta \), see p. 188
\(\mathscr {R}\big (\frac{P}{M_{1}}\big )\)	Range (image) of the function \(\big \{\frac{p(x)}{m_{1}(x)}\big \}_{x \in \mathscr {X}}\), see paragraph (I2) on p. 161
\(\mathscr {R}(Y_{1}, \ldots , Y_{N})\)	Range (image) of the random variables \(Y_{1}, \ldots , Y_{N}\), see p. 182
	\(\lambda \)-probab. density function (modification of ) defined by , see p. 191
\(\phi \, := \, \big \{\phi (t)\big \}_{t \in ]a,b[}\)	Divergence generator, a convex real-valued function on ]a, b[, see p. 161, (1) and paragraph (I2), as well as Sect. 3.2 on p. 165
\(\varPhi (]a,b[)\);	Class of all such \(\phi \), see paragraph (I2) on p. 161
\(\overline{\phi } := \, \big \{\phi (t)\big \}_{t \in [a,b]}\)	Continuous extension of \(\phi \) on [a, b], with \(\overline{\phi }(t) = \phi (t)\) for all \(t \in ]a,b[\), see (I2)
\(\phi _{+,c}^{\prime }(t)\)	c-weighted mixture of left-hand and right-hand derivative of \(\phi \) at t, see (I2)
\(\varPhi _{C_{1}}(]a,b[)\)	Subclass of everywhere continuously differentiable \(\phi \), with derivative \(\phi ^{\prime }(t)\) (being equal to \(\phi _{+,c}^{\prime }(t)\) for all \(c \in [0,1]\)), see (I2) on p. 161
\(\phi _{\alpha }\)	\(\alpha \)-power-function type divergence generator, see (5) on p. 166, (14), (18), (19)
\(\phi _{TV}\)	Generator of total variation distance, see (31) on p. 169
\(\phi _{ie}\)	Divergence generator with interesting effects, see (35) on p. 170
\(\psi _{\phi ,c}\)	Function given by \(\psi _{\phi ,c}(s,t) := \phi (s) - \phi (t) - \phi _{+,c}^{\prime }(t) \cdot (s-t) \geqslant 0\), see (I2)
\(\overline{\psi _{\phi ,c}}\)	Bivariate extension of \(\psi _{\phi ,c}\), see (I2) on p. 161
\({\overline{\int }}_{{\mathscr {X}}} \ldots \), \({\overline{\sum }}_{{\mathscr {X}}} \ldots \)	Integral/sum over extension of integrand/summand \(\ldots \), see (I2) & (2) on p. 165
\(D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q)\)	Divergence between two functions P (scaled by \(M_{1}\)) and Q (scaled by \(M_{2}\)), generated by \(\phi \) and weight c, and aggregated by \(\mathbbm {M}_{3}\) and \(\lambda \), see (1) on p. 161
\(D_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q)\)	As above, but with \(\phi \in \varPhi _{C_{1}}(]a,b[)\) and obsolete c, see Sect. 3.2 on p. 165
\(D_{\lambda }(P,Q)\)	General \(\lambda \)-aggregated divergence, see p. 189, respectively pseudo-divergence, see Definition 2 on p. 195
	Pointwise decomposable pseudo-divergence, scaled by \(\mathbbm {M}\) and aggregated by \(\mathbbm {M}\) and \(\lambda \), see Sect. 4.6 on p. 200
NN0, NN1	Nonnegativity setup 0 respectively 1, see p. 166 resp. p. 171
\(\mathfrak {P}^{\mathbbm {R} \cdot \lambda }\), \(\mathfrak {Q}^{\mathbbm {R} \cdot \lambda }\), \(\mathfrak {M}^{\mathbbm {R} \cdot \lambda }\)	Measures with \(\lambda \)-densities \(p(\cdot ) \cdot r(\cdot )\), \(q(\cdot ) \cdot r(\cdot )\), \(m(\cdot ) \cdot r(\cdot )\), see Remark 2 on p. 171
,	Probability measures (distributions) with \(\lambda \)-densities \(p(\cdot )\), \(q(\cdot )\), see Remark 2
\(\mathscr {Q}_{\varTheta }^{\lambda _{2}}\),	Class of probability measures with \(\lambda _{2}\)-densities \(q_{\theta }(\cdot )\) with parameter \(\theta \in \varTheta \), see p. 188
, ,	Data-derived empirical (probability) distribution, and probability mass
	function (\(\lambda _{\#}\)-density) thereof, see Remark 2 on p. 172
,	Data-derived “extended” empirical (probability) distribution, and probability mass function thereof, see (85) on p. 190 and thereafter
DPD, CASD	Density-power divergences (see p. 174), Csiszar–Ali–Silvey divergences (see p. 177)
\(\ell i_{1}\), \(\phi ^{*}(0)\), \(\ell i_{2}\), \(\ell i_{3}\)	Certain limits, see (50), (71), (72)
\(\mathbbm {P} \perp \mathbbm {Q}\)	The functions \(\mathbbm {P}\), \(\mathbbm {Q}\) are “essentially different”, see (64) to (66) and thereafter
\(\mathbbm {P} \not \perp \mathbbm {Q}\)	Negation of \(\mathbbm {P} \perp \mathbbm {Q}\), see p. 192
\(\mathbbm {P} \sim \mathbbm {Q}\)	The functions \(\mathbbm {P}\), \(\mathbbm {Q}\) are “equivalent” (concerning zeros), see (80)
\(\mathbbm {P} \not \sim \mathbbm {Q}\)	Negation of \(\mathbbm {P} \sim \mathbbm {Q}\), see p. 195
\(\widehat{\theta }_{N,D_{\lambda _{2}}}\)	Minimum-divergence estimator (“approximator”) of the true unknown parameter \(\theta _{0}\), based on N data observations, see (82) on p. 189
\(\widehat{\theta }_{N,D_{\lambda _{\#}}}\), \(\widehat{\theta }_{N,D_{\lambda }}\)	Certain minimum-divergence estimators, see (83), (86)
\(\widehat{\theta }_{N \, , decD_{\lambda }}\) ,	Certain minimum-divergence estimators, see (107), (123)
\(\widehat{\theta }_{N,sup\mathscr {D}_{\phi ,\lambda }}\)	Certain minimum-divergence estimator, see (135)
\(\mathscr {P}^{\lambda }\)	Certain class of nonnegative, mutually equivalent functions, see p. 194
\(\mathscr {P}^{\lambda \not \sim }\), \(\widetilde{\mathscr {P}}^{\lambda }\)	Certain classes of nonnegative functions, see p. 194
\(\mathscr {P}_{\varTheta }^{\lambda }\), \(\mathscr {P}_{emp}^{\lambda \perp }\), \(\mathscr {P}_{\varTheta ,emp}^{\lambda }\)	Certain classes of nonnegative functions, see p. 195
\(\mathfrak {D}^{0}\), \(\mathfrak {D}^{1}\), \(\rho _{\mathbbm {Q}}\)	Functionals and mapping for decomposable pseudo-divergences, see Definition 3 on p. 195
\(\psi ^{dec}\), \(\psi ^{0}\), \(\psi ^{1}\), \(\rho \)	Mappings for pointwise decomposable pseudo-divergences, see Definition 3 on p. 196
\(h_{0}\), \(h_{1}\), \(h_{2}\)	Mappings for pointwise decomposable pseudo-divergences, see Definition 3 on p. 196
\(\psi _{m}^{dec}\)	Perspective function of \(\psi ^{dec}\), see (120)

Some Universal Insights on Divergences for Statistics, Machine Learning and Artificial Intelligence

Abstract

Similar content being viewed by others

3D Insights to Some Divergences for Robust Statistics and Machine Learning

Similarities, Dissimilarities and Types of Inner Products for Data Analysis in the Context of Machine Learning

Information Divergence and the Generalized Normal Distribution: A Study on Symmetricity

1 Outline

2 Some General Motivations and Uses of Divergences

2.1 Quantification of Proximity

2.2 Divergences and Geometry

2.3 Divergences and Uncertainty in Data

2.4 Divergences, Information and Model Uncertainty

3 General Framework

Theorem 1

Assumption 2

Remark 1

Assumption 3

Theorem 4

Corollary 1

3.1 The Reference Measure \(\lambda \)

3.2 The Divergence Generator \(\phi \)

3.3 The Scaling and the Aggregation Functions \(m_1\), \(m_2\), \(\mathbbm {m}_{3}\)

Remark 2

Theorem 5

Remark 3

3.3.2 Global Scaling and Aggregation, and Other Paradigms

4 Divergences for Essentially Different Functions

4.1 Motivation

Definition 1

Theorem 6

Remark 4

4.3 Minimum Divergences - The Encompassing Method

4.4 Minimum Divergences - Grouping and Smoothing

4.5 Minimum Divergences - The Decomposability Method

Definition 2

Definition 3

Remark 5

Definition 4

4.6 Minimum Divergences - Generalized Subdivergence Method

Remark 6

5 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Proofs

Appendix: Proofs

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation