The Bregman Chord Divergence

Nielsen, Frank; Nock, Richard

doi:10.1007/978-3-030-26980-7_31

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11712))

Included in the following conference series:

International Conference on Geometric Science of Information

1810 Accesses
1 Citations
15 Altmetric

Abstract

Distances are fundamental primitives whose choice significantly impacts the performances of algorithms in applications. However selecting the most appropriate distance for a given task is an endeavor. Instead of testing one by one the entries of an ever-expanding dictionary of ad hoc distances, one rather prefers to consider parametric classes of distances that are exhaustively characterized by axioms derived from first principles. Bregman divergences are such a class. However fine-tuning a Bregman divergence is delicate since it requires to smoothly adjust a functional generator. In this work, we propose an extension of Bregman divergences called the Bregman chord divergences. This new class of distances bypasses the gradient calculations, uses two scalar parameters that can be easily tailored in applications, and generalizes asymptotically Bregman divergences.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Invariant Distances

Hermite-Hadamard Trapezoid and Mid-Point Divergences

3D Insights to Some Divergences for Robust Statistics and Machine Learning

Keywords

1 Introduction

Distances are at the heart of many signal processing tasks [6, 14], and the performance of algorithms solving those tasks heavily depends on the chosen distances. Historically, many ad hoc distances have been proposed and empirically benchmarked on different tasks in order to improve the state-of-the-art performances. However, getting the most appropriate distance for a given task is often an endeavour. Thus principled classes of distances^{Footnote 1} have been proposed and studied. Among those generic classes of distances, three main generic classes have emerged:

The Bregman divergences [5, 7, 22] defined for a strictly convex and differentiable generator $F\in \mathcal {B}:\varTheta \rightarrow \mathbb {R}$ (where $\mathcal {B}$ denotes the class of strictly convex and differentiable functions defined modulo affine terms):
$$\begin{aligned} B_F(\theta _1:\theta _2) :=F(\theta _1)-F(\theta _2)-(\theta _1-\theta _2)^\top \nabla F(\theta _2), \end{aligned}$$
(1)
measure the dissimilarity between parameters $\theta _1,\theta _2\in \varTheta $, where $\varTheta \subset \mathbb {R}^d$ is a d-dimensional convex set. Bregman divergences have also been generalized to other types of objects like matrices [26].
The Csiszár f-divergences [1, 11, 12] defined for a convex generator $f\in \mathcal {C}$ satisfying $f(1)=0$ and strictly convex at 1:
$$\begin{aligned} I_f[p_1:p_2] :=\int _\mathcal {X}p_1(x) f\left( \frac{p_2(x)}{p_1(x)}\right) \mathrm {d}\mu (x) \ge f(1)=0, \end{aligned}$$
(2)
measure the dissimilarity between probability densities $p_1$ and $p_2$ that are absolutely continuous with respect to a base measure $\mu $ (defined on a support $\mathcal {X}$).
The Burbea-Rao divergences [9] also called Jensen differences or Jensen divergences because they rely on the Jensen’s inequality [16] for a strictly convex function $F\in \mathcal {J}:\varTheta \rightarrow \mathbb {R}$:
$$\begin{aligned} J_F(\theta _1,\theta _2) :=\frac{F(\theta _1)+F(\theta _2)}{2} -F \left( \frac{\theta _1+\theta _2}{2}\right) \ge 0, \end{aligned}$$
(3)
where $\theta _1$ and $\theta _2$ belong to a parameter space $\varTheta $.

These three fundamental classes of distances are not mutually exclusive, and their pairwise intersections (e.g., $\mathcal {B}\cap \mathcal {C}$ or $\mathcal {J}\cap \mathcal {C}$) have been studied in [2, 17, 27]. The ‘:’ notation between arguments of distances emphasizes the potential asymmetry of distances (oriented distances with $D(\theta _1:\theta _2)\not = D(\theta _2:\theta _1)$), and the brackets surrounding distance arguments indicate that it is a statistical distance between probability densities, and not a distance between parameters. Using these notations, we express the Kullback-Leibler distance [10] (KL) as

$$\begin{aligned} \mathrm {KL}[p_1:p_2] :=\int p_{1}(x)\log \frac{p_{1}(x)}{p_{2}(x)}\mathrm {d}\mu (x). \end{aligned}$$

(4)

The KL distance/divergence between two members $p_{\theta _1}$ and $p_{\theta _2}$ of a parametric family $\mathcal {F}$ of distributions amount to a parameter divergence

$$\begin{aligned} \mathrm {KL}_\mathcal {F}(\theta _1:\theta _2) :=\mathrm {KL}[p_{\theta _1}:p_{\theta _2}]. \end{aligned}$$

(5)

For example, the KL statistical distance between two probability densities belonging to the same exponential family or the same mixture family amounts to a (parameter) Bregman divergence [3, 23]. When $p_1$ and $p_2$ are finite discrete distributions of the d-dimensional probability simplex $\varDelta _d$, we have $\mathrm {KL}_{\varDelta _d}(p_1:p_2)=\mathrm {KL}[p_{1}:p_{2}]$. This explains why sometimes we can handle loosely distances between discrete distributions as both a parameter distance and a statistical distance. For example, the KL distance between two discrete distributions is a Bregman divergence $B_{F_\mathrm {KL}}$ for $F_\mathrm {KL}(x)=\sum _{i=1}^d x_i\log x_i$ (Shannon negentropy) for $x\in \varTheta =\varDelta _d$. Extending $\varTheta =\varDelta _d$ to positive measures $\varTheta =\mathbb {R}_+^d$, this Bregman divergence $B_{F_\mathrm {KL}}$ yields the extended KL distance:

$$\begin{aligned} \mathrm {eKL}[p:q] = \sum _{i=1}^d p_{i}\log \frac{p_{i}}{q_{i}}+q_i-p_i. \end{aligned}$$

(6)

Notice that the KL divergence of 4 between non-probability positive distributions may yield potential negativity of the measure (e.g., Example 2.1 of [28] and [8]). This case also happens when doing Monte Carlo stochastic integrations of the KL divergence integral.

Whenever using a functionally parameterized distance in applications, we need to choose the most appropriate functional generator, ideally from first principles [3, 4, 13]. For example, Non-negative Matrix Factorization (NMF) for audio source separation or music transcription from the signal power spectrogram can be done by selecting the Itakura-Saito divergence [15]^{Footnote 2} that satisfies the requirement of being scale invariant:

$$\begin{aligned} B_{F_\mathrm {IS}}(\lambda \theta :\lambda \theta ')=B_{F_\mathrm {IS}}(\theta :\theta ')=\sum _i \left( \frac{\theta _i}{\theta _i'}-\log \frac{\theta _i}{\theta _i'}-1\right) , \end{aligned}$$

(7)

for any $\lambda >0$. When no such first principles can be easily stated for a task [13], we are left by choosing manually or by cross-validation a generator. Notice that the convex combinations of Csiszár generators is a Csiszár generator (idem for Bregman divergences): $\sum _{i=1}^d \lambda _i I_{f_i}=I_{\sum _i{i=1}^d \lambda _i f_i}$ for $\lambda $ belonging to the standard $(d-1)$-dimensional standard simplex $\varDelta _d$.

In this work, we propose a novel class of distances, termed Bregman chord divergences. A Bregman chord divergence is parameterized by a Bregman generator and two scalar parameters which make it easy to fine-tune in applications, and matches asymptotically the ordinary Bregman divergence.

The paper is organized as follows: In Sect. 2, we describe the skewed Jensen divergence, show how to bi-skew any distance by using two scalars, and report on the Jensen chord divergence [20]. In Sect. 3, we first introduce the univariate Bregman chord divergence, and then extend its definition to the multivariate case, in Sect. 4. Finally, we conclude in Sect. 5.

2 Geometric Design of Skewed Divergences

We can geometrically design divergences from convexity gap properties of the graph plot of the generator. For example, the Jensen divergence $J_F(\theta _1:\theta _2)$ of Eq. 3 is visualized as the ordinate (vertical) gap between the midpoint of the line segment $[(\theta _1,F(\theta _1));(\theta _2,F(\theta _2))]$ and the point $(\frac{\theta _1+\theta _2}{2},F(\frac{\theta _1+\theta _2}{2}))$. The non-negativity property of the Jensen divergence follows from the Jensen’s midpoint convex inequality [16]. Instead of taking the midpoint $\bar{\theta }=\frac{\theta _1+\theta _2}{2}$, we can take any interior point $(\theta _1\theta _2)_\alpha :=(1-\alpha )\theta _1+\alpha \theta _2$, and get the skewed $\alpha $-Jensen divergence (for any $\alpha \in (0,1)$):

$$\begin{aligned} J_F^\alpha (\theta _1:\theta _2) :=(F(\theta _1)F(\theta _2))_\alpha - F((\theta _1\theta _2)_\alpha ) \ge 0. \end{aligned}$$

(8)

A remarkable fact is that the scaled $\alpha $-Jensen divergence $\frac{1}{\alpha }J_F^\alpha (\theta _1:\theta _2)$ tends asymptotically to the reverse Bregman divergence $B_F(\theta _2:\theta _1)$ when $\alpha \rightarrow 0$, see [21, 30].

By measuring the ordinate gap between two non-crossing upper and lower chords anchored at the generator graph plot, we can extend the $\alpha $-Jensen divergences to a tri-parametric family of Jensen chord divergences [20]:

$$\begin{aligned} J_F^{\alpha ,\beta ,\gamma }(\theta :\theta ') :=(F(\theta )F(\theta '))_\gamma -(F((\theta \theta ')_\alpha )F((\theta \theta ')_\beta ))_{\frac{\gamma -\alpha }{\beta -\alpha }}, \end{aligned}$$

(9)

with $\alpha ,\beta \in [0,1]$ and $\gamma \in [\alpha ,\beta ]$. The $\alpha $-Jensen divergence is recovered when $\alpha =\beta =\gamma $ (Fig. 1).

For any given distance $D:\varTheta \times \varTheta \rightarrow \mathbb {R}_+$ (with convex parameter space $\varTheta $), we can bi-skew the distance by considering two scalars $\gamma ,\delta \in \mathbb {R}$ (with $\delta \not =\gamma $) as:

$$\begin{aligned} D_{\gamma ,\delta }(\theta _1:\theta _2) :=D((\theta _1\theta _2)_\gamma :(\theta _1\theta _2)_\delta ). \end{aligned}$$

(10)

Clearly, $(\theta _1\theta _2)_\gamma =(\theta _1\theta _2)_\delta $ if and only if $(\delta -\gamma )(\theta _1-\theta _2)=0$. That is, if (i) $\theta _1=\theta _2$ or if (ii) $\delta =\gamma $. Since by definition $\delta \not =\gamma $, we have $D_{\gamma ,\delta }(\theta _1:\theta _2)=0$ if and only if $\theta _1=\theta _2$. Notice that both $(\theta _1\theta _2)_\gamma =(1-\gamma )\theta _1+\gamma \theta _2$ and $(\theta _1\theta _2)_\delta =(1-\delta )\theta _1+\delta \theta _2$ should belong to the parameter space $\varTheta $. A sufficient condition is to ensure that $\gamma ,\delta \in [0,1]$ so that both $(\theta _1\theta _2)_\gamma \in \varTheta $ and $(\theta _1\theta _2)_\delta \in \varTheta $. When $\varTheta =\mathbb {R}^d$, we may further consider any $\gamma ,\delta \in \mathbb {R}$.

3 The Scalar Bregman Chord Divergence

Let $F:\varTheta \subset \mathbb {R}\rightarrow \mathbb {R}$ be a univariate Bregman generator with open convex domain $\varTheta $, and denote by $\mathcal {F}=\{ (\theta ,F(\theta )) \}_\theta $ its graph. Let us rewrite the ordinary univariate Bregman divergence [7] of Eq. 1 as follows:

$$\begin{aligned} B_F(\theta _1:\theta _2) = F(\theta _1) - T_{\theta _2}(\theta _1), \end{aligned}$$

(11)

where $y=T_{\theta }(\omega )$ denotes the equation of the tangent line of F at $\theta $:

$$\begin{aligned} T_{\theta }(\omega ) :=F(\theta )+(\omega -\theta ) F'(\theta ), \end{aligned}$$

(12)

Let $\mathcal {T}_\theta =\{ (\theta ,T_\theta (\omega )) \ :\ \theta \in \varTheta \}$ denote the graph of that tangent line. Line $\mathcal {T}_\theta $ is tangent to curve $\mathcal {F}$ at point $P_\theta :=(\theta ,F(\theta ))$. Graphically speaking, the Bregman divergence is interpreted as the ordinate gap (gap vertical) between the point $P_{\theta _1}=(\theta _1,F(\theta _1))\in \mathcal {F}$ and the point of $(\theta _1,T_{\theta _2}(\theta _1))\in \mathcal {T}_\theta $, as depicted in Fig. 2.

Now let us observe that we may relax the tangent line $\mathcal {T}_{\theta _2}$ to a chord line (or secant) $\mathcal {C}_{\theta _1,\theta _2}^{\alpha ,\beta } = \mathcal {C}_{(\theta _1\theta _2)_\alpha ,(\theta _1\theta _2)_\beta }$ passing through the points $((\theta _1\theta _2)_\alpha ,F((\theta _1\theta _2)_\alpha ))$ and $((\theta _1\theta _2)_\beta ,F((\theta _1\theta _2)_\beta ))$ for $\alpha ,\beta \in (0,1)$ with $\alpha \not =\beta $ (with corresponding Cartesian equation $C_{(\theta _1\theta _2)_\alpha ,(\theta _1\theta _2)_\beta }$), and still get a non-negative vertical gap between $(\theta _1,F(\theta _1))$ and $(\theta _1,C_{(\theta _1\theta _2)_\alpha ,(\theta _1\theta _2)_\beta }(\theta _1))$ (because any line intersects a convex body in at most two points). By construction, this vertical gap is smaller than the gap measured by the ordinary Bregman divergence. This yields the Bregman chord divergence ($\alpha ,\beta \in (0,1]$, $\alpha \not =\beta $):

$$\begin{aligned} B_F^{\alpha ,\beta }(\theta _1:\theta _2) :=F(\theta _1) - C_F^{(\theta _1\theta _2)_\alpha ,(\theta _1\theta _2)_\beta }(\theta _1) \le B_F(\theta _1:\theta _2), \end{aligned}$$

(13)

illustrated in Fig. 3. By expanding the chord equation and massaging the equation, we get the following formula:

$$\begin{aligned} B_F^{\alpha ,\beta }(\theta _1:\theta _2):= & {} F(\theta _1) - \varDelta _F^{\alpha ,\beta }(\theta _1,\theta _2) (\theta _1-(\theta _1\theta _2)_\alpha ) - F((\theta _1\theta _2)_\alpha ),\\= & {} F(\theta _1) - F\left( (\theta _1\theta _2)_\alpha \right) + \frac{\alpha \left\{ {F\left( (\theta _1\theta _2)_\alpha \right) -F\left( (\theta _1\theta _2)_\beta \right) }\right\} }{\beta -\alpha },\nonumber \end{aligned}$$

(14)

where

$$\begin{aligned} \varDelta _F^{\alpha ,\beta }(\theta _1,\theta _2):=\frac{F((\theta _1\theta _2)_\alpha )-F((\theta _1\theta _2)_\beta )}{(\theta _1\theta _2)_\alpha -(\theta _1\theta _2)_\beta } \end{aligned}$$

(15)

is the slope of the chord, and since $(\theta _1\theta _2)_\alpha -(\theta _1\theta _2)_\beta =(\beta -\alpha )(\theta _1-\theta _2)$ and $\theta _1-(\theta _1\theta _2)_\alpha =\alpha (\theta _1-\theta _2)$.

Notice the symmetry $B_F^{\alpha ,\beta }(\theta _1:\theta _2)=B_F^{\beta ,\alpha }(\theta _1:\theta _2)$. We have

$$\begin{aligned} \lim _{\alpha \rightarrow 1,\beta \rightarrow 1} B_F^{\alpha ,\beta }(\theta _1:\theta _2) = B_F(\theta _1:\theta _2). \end{aligned}$$

(16)

When $\alpha \rightarrow \beta $, the Bregman chord divergences yields a subfamily of Bregman tangent divergences:

$$\begin{aligned} B_F^{\alpha }(\theta _1:\theta _2)=\lim _{\beta \rightarrow \alpha } B_F^{\alpha ,\beta }(\theta _1:\theta _2)\le B_F(\theta _1:\theta _2). \end{aligned}$$

(17)

We consider the tangent line $\mathcal {T}_{(\theta _1\theta _2)_\alpha }$ at $(\theta _1\theta _2)_\alpha $ and measure the ordinate gap at $\theta _1$ between the function plot and this tangent line:

$$\begin{aligned} B_F^\alpha (\theta _1:\theta _2):= & {} F(\theta _1)-F\left( (\theta _1\theta _2)_\alpha \right) - \left( \theta _1-(\theta _1\theta _2)_\alpha \right) ^\top \nabla F\left( (\theta _1\theta _2)_\alpha \right) ,\nonumber \\= & {} F(\theta _1)-F\left( (\theta _1\theta _2)_\alpha \right) -\alpha (\theta _1-\theta _2)^\top \nabla F\left( (\theta _1\theta _2)_\alpha \right) , \end{aligned}$$

(18)

for $\alpha \in (0,1]$. The ordinary Bregman divergence is recovered when $\alpha =1$. Notice that the mean value theorem yields $\varDelta _F^{\alpha ,\beta }(\theta _1,\theta _2)=F'(\xi )$ for $\xi \in (\theta _1,\theta _2)$. Thus $B_F^{\alpha ,\beta }(\theta _1:\theta _2)=B_F^{\xi }(\theta _1:\theta _2)$ for $\xi \in (\theta _1,\theta _2)$. Letting $\beta =1$ and $\alpha =1-\epsilon $ (for small values of $1>\epsilon >0$), we can approximate the ordinary Bregman divergence by the Bregman chord divergence without requiring to compute the gradient:

$$\begin{aligned} B_F(\theta _1:\theta _2) \simeq _{\epsilon \rightarrow 0} B_F^{1-\epsilon ,1}(\theta _1:\theta _2). \end{aligned}$$

(19)

4 The Multivariate Bregman Chord Divergence

When the generator is separable [3], i.e., $F(x)=\sum _i F_i(x_i)$ for univariate generators $F_i$, we extend easily the Bregman chord divergence as:

$$\begin{aligned} B_F^{\alpha ,\beta }(\theta :\theta ')=\sum _i B_{F_i}^{\alpha ,\beta }(\theta _i:\theta '_i). \end{aligned}$$

(20)

Otherwise, we have to carefully define the notion of “slope” for the multivariate case. An example of such a non-separable multivariate generator is the Legendre dual of the Shannon negentropy: The log-sum-exp function [24, 25]:

$$\begin{aligned} F(\theta )=\log (1+\sum _i e^{\theta _i}). \end{aligned}$$

(21)

Given a multivariate (non-separable) Bregman generator $F(\theta )$ with $\varTheta \subseteq \mathbb {R}^D$ and two prescribed distinct parameters $\theta _1$ and $\theta _2$, consider the following univariate function, for $\lambda \in \mathbb {R}$:

$$\begin{aligned} F_{\theta _1,\theta _2}(\lambda ) :=F\left( (1-\lambda )\theta _1+\lambda \theta _2\right) = F\left( \theta _1+\lambda (\theta _2-\theta _1)\right) , \end{aligned}$$

(22)

with $F_{\theta _1,\theta _2}(0)=F(\theta _1)$ and $F_{\theta _1,\theta _2}(1)=F(\theta _2)$.

The functions $\{F_{\theta _1,\theta _2}\}_{\theta _1\not =\theta _2}$ are strictly convex and differentiable univariate Bregman generators.

Proof

To prove the strict convexity of a univariate function G, we need to show that for any $\alpha \in (0,1)$, we have $G((1-\alpha )x+\alpha y)< (1-\alpha )G(x)+\alpha G(y)$.

$$\begin{aligned} F_{\theta _1,\theta _2}((1-\alpha )\lambda _1+\alpha \lambda _2)= & {} F\left( \theta _1+((1-\alpha )\lambda _1+\alpha \lambda _2)(\theta _2-\theta _1)\right) ,\\= & {} F( (1-\alpha )(\lambda _1(\theta _2-\theta _1)+\theta _1) + \alpha ((\lambda _2(\theta _2-\theta _1)+\theta _1)) ),\\< & {} (1-\alpha ) F(\lambda _1(\theta _2-\theta _1)+\theta _1) + \alpha F((\lambda _2(\theta _2-\theta _1)+\theta _1)),\\< & {} (1-\alpha ) F_{\theta _1,\theta _2}(\lambda _1) + \alpha F_{\theta _1,\theta _2}(\lambda _2). \end{aligned}$$

Then we define the multivariate Bregman chord divergence by applying the definition of the univariate Bregman chord divergence on these families of univariate Bregman generators:

$$\begin{aligned} B_F^{\alpha ,\beta }(\theta _1:\theta _2) :=B_{F_{\theta _1,\theta _2}}^{\alpha ,\beta }(0:1), \end{aligned}$$

(23)

Since $(01)_\alpha =\alpha $ and $(01)_\beta =\beta $, we get:

$$\begin{aligned} B_F^{\alpha ,\beta }(\theta _1:\theta _2)= & {} F_{\theta _1,\theta _2}(0)+\frac{\alpha (F_{\theta _1,\theta _2}(\alpha )-F_{\theta _1,\theta _2}(\beta ))}{\beta -\alpha }-F_{\theta _1,\theta _2}(\alpha ),\\= & {} F(\theta _1) - F\left( (\theta _1\theta _2)_\alpha \right) - \frac{\alpha \left( F\left( (\theta _1\theta _2)_\beta \right) - F\left( (\theta _1\theta _2)_\alpha \right) \right) }{\beta -\alpha }, \end{aligned}$$

in accordance with the univariate case. Since $(\theta _1\theta _2)_\beta =(\theta _1\theta _2)_\alpha -(\beta -\alpha )(\theta _2-\theta _1)$, we have the first-order Taylor expansion

$$\begin{aligned} F\left( (\theta _1\theta _2)_\beta \right) \simeq _{\beta \simeq \alpha } F\left( (\theta _1\theta _2)_\alpha \right) -(\beta -\alpha )(\theta _2-\theta _1)^\top \nabla F\left( (\theta _1\theta _2)_\alpha \right) . \end{aligned}$$

(24)

Therefore, we have:

$$\begin{aligned} \frac{\alpha \left( F\left( (\theta _1\theta _2)_\beta \right) - F\left( (\theta _1\theta _2)_\alpha \right) \right) }{\beta -\alpha } \simeq -\alpha (\theta _2-\theta _1)^\top \nabla F\left( (\theta _1\theta _2)_\alpha \right) . \end{aligned}$$

(25)

This proves that

$$\begin{aligned} \lim _{\beta \rightarrow \alpha } B_F^{\alpha ,\beta }(\theta _1:\theta _2)=B_F^{\alpha }(\theta _1:\theta _2). \end{aligned}$$

(26)

Notice that the Bregman chord divergence does not require to compute the gradient $\nabla F$ The “slope term” in the definition is reminiscent to the q-derivative [18] (quantum/discrete derivatives). However the (p, q)-derivatives [18] are defined with respect to a single reference point while the chord definition requires two reference points.

5 Conclusion

In this paper, we geometrically designed a new class of distances using a Bregman generator and two additional scalar parameters, termed the Bregman chord divergence, and its one-parametric subfamily, the Bregman tangent divergences that includes the ordinary Bregman divergence. This generalization allows one to easily fine-tune Bregman divergences in applications by adjusting smoothly one or two (scalar) knobs. Moreover, by choosing $\alpha =1-\epsilon $ and $\beta =1$ for small $\epsilon >0$, the Bregman chord divergence $B_{F}^{1-\epsilon ,1}(\theta _1:\theta _2)$ lower bounds closely the Bregman divergence $B_{F}(\theta _1:\theta _2)$ without requiring to compute the gradient (a different approximation without gradient is $\frac{1}{\epsilon } J_F^\epsilon (\theta _2:\theta _1)$). We expect that this new class of distances brings further improvements in signal processing and information fusion applications [29] (e.g., by tuning $B_{F_\mathrm {KL}}^{\alpha ,\beta }$ or $B_{F_\mathrm {IS}}^{\alpha ,\beta }$). While the Bregman chord divergence defines an ordinate gap on the exterior of the epigraph, the Jensen chord divergence [20] defines the gap inside the epigraph of the generator. In future work, the dualistic information-geometric structure induced by the Bregman chord divergences shall be investigated from the viewpoint of gauge theory [19] and in contrast with the dually flat structures of Bregman manifolds [3].

Source code in Java™ is available for reproducible research.^{Footnote 3}

Notes

1.
Here, we use the word distance to mean a dissimilarity (or a distortion, a deviance, a discrepancy, etc.), not necessarily a metric distance [14]. A distance between arguments $\theta _1$ and $\theta _2$ satisfies $D(\theta _1,\theta _2)\ge 0$ with equality if and only if $\theta _1=\theta _2$.
2.
A Bregman divergence for the Burg negentropy $F_\mathrm {IS}(x)=-\sum _i\log x_i$.
3.
https://franknielsen.github.io/~nielsen/BregmanChordDivergence/.

References

Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. Roy. Stat. Soc.: Ser. B (Methodol.) 28(1), 131–142 (1966)
MathSciNet MATH Google Scholar
Amari, S.I.: $\alpha $-divergence is unique, belonging to both $f$-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 55(11), 4925–4931 (2009)
Article MathSciNet Google Scholar
Amari, S.: Information Geometry and Its Applications. AMS, vol. 194. Springer, Tokyo (2016). https://doi.org/10.1007/978-4-431-55978-8
Book MATH Google Scholar
Banerjee, A., Guo, X., Wang, H.: On the optimality of conditional expectation as a Bregman predictor. IEEE Trans. Inf. Theory 51(7), 2664–2669 (2005)
Article MathSciNet Google Scholar
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6(Oct), 1705–1749 (2005)
MathSciNet MATH Google Scholar
Basseville, M.: Divergence measures for statistical data processing: an annotated bibliography. Sig. Process. 93(4), 621–633 (2013)
Article MathSciNet Google Scholar
Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)
Article MathSciNet Google Scholar
Broniatowski, M., Stummer, W.: Some universal insights on divergences for statistics, machine learning and artificial intelligence. In: Nielsen, F. (ed.) Geometric Structures of Information. SCT, pp. 149–211. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-02520-5_8
Chapter Google Scholar
Burbea, J., Rao, C.: On the convexity of some divergence measures based on entropy functions. IEEE Trans. Inf. Theory 28(3), 489–495 (1982)
Article MathSciNet Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2012)
MATH Google Scholar
Csiszár, I.: Eine infonnationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitlit von Markoffschen Ketten. Magyar Tudományos Akadémia - MAT 8, 85–108 (1963)
MATH Google Scholar
Csiszár, I.: Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica 2, 229–318 (1967)
MathSciNet Google Scholar
Csiszár, I.: Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Stat. 19(4), 2032–2066 (1991). https://doi.org/10.1007/978-1-4613-0071-7
Article MathSciNet MATH Google Scholar
Deza, M.M., Deza, E.: Encyclopedia of distances. In: Deza, M.M., Deza, E. (eds.) Encyclopedia of Distances, pp. 1–583. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00234-2_1
Chapter MATH Google Scholar
Févotte, C.: Majorization-minimization algorithm for smooth Itakura-Saito nonnegative matrix factorization. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1980–1983. IEEE (2011)
Google Scholar
Jensen, J.L.W.V.: Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Math. 30(1), 175–193 (1906)
Article MathSciNet Google Scholar
Jiao, J., Courtade, T.A., No, A., Venkat, K., Weissman, T.: Information measures: the curious case of the binary alphabet. IEEE Trans. Inf. Theory 60(12), 7616–7626 (2014)
Article MathSciNet Google Scholar
Kac, V., Cheung, P.: Quantum Calculus. Springer, New York (2001)
MATH Google Scholar
Naudts, J., Zhang, J.: Rho-tau embedding and gauge freedom in information geometry. Inf. Geom. 1(1), 79–115 (2018)
Article Google Scholar
Nielsen, F.: The chord gap divergence and a generalization of the Bhattacharyya distance. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2276–2280, April 2018. https://doi.org/10.1109/ICASSP.2018.8462244
Nielsen, F., Boltz, S.: The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 57(8), 5455–5466 (2011)
Article MathSciNet Google Scholar
Nielsen, F., Nock, R.: Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 55(6), 2882–2904 (2009)
Article MathSciNet Google Scholar
Nielsen, F., Nock, R.: On the geometry of mixtures of prescribed distributions. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2861–2865. IEEE (2018)
Google Scholar
Nielsen, F., Sun, K.: Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy 18(12), 442 (2016)
Article MathSciNet Google Scholar
Nielsen, F., Sun, K.: Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures. IEEE Signal Process. Lett. 23(11), 1543–1546 (2016)
Article Google Scholar
Nock, R., Magdalou, B., Briys, E., Nielsen, F.: Mining matrix data with Bregman matrix divergences for portfolio selection. In: Nielsen, F., Bhatia, R. (eds.) Matrix Information Geometry, pp. 373–402. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-30232-9_15
Chapter MATH Google Scholar
Pardo, M.C., Vajda, I.: About distances of discrete distributions satisfying the data processing theorem of information theory. IEEE Trans. Inf. Theory 43(4), 1288–1293 (1997)
Article MathSciNet Google Scholar
Stummer, W., Vajda, I.: On divergences of finite measures and their applicability in statistics and information theory. Statistics 44(2), 169–187 (2010)
Article MathSciNet Google Scholar
Üney, M., Houssineau, J., Delande, E., Julier, S.J., Clark, D.E.: Fusion of finite set distributions: pointwise consistency and global cardinality. CoRR abs/1802.06220 (2018). http://arxiv.org/abs/1802.06220
Zhang, J.: Divergence function, duality, and convex analysis. Neural Comput. 16(1), 159–195 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Sony Computer Science Laboratories, Inc., Tokyo, Japan
Frank Nielsen
Data61, Sydney, Australia
Richard Nock
The Australian National University, Canberra, Australia
Richard Nock
The University of Sydney, Sydney, Australia
Richard Nock

Authors

Frank Nielsen
View author publications
You can also search for this author in PubMed Google Scholar
Richard Nock
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frank Nielsen .

Editor information

Editors and Affiliations

Sony Computer Science Laboratories, Inc., Tokyo, Japan
Frank Nielsen
Thales, Limours, France
Frédéric Barbaresco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nielsen, F., Nock, R. (2019). The Bregman Chord Divergence. In: Nielsen, F., Barbaresco, F. (eds) Geometric Science of Information. GSI 2019. Lecture Notes in Computer Science(), vol 11712. Springer, Cham. https://doi.org/10.1007/978-3-030-26980-7_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-26980-7_31
Published: 02 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26979-1
Online ISBN: 978-3-030-26980-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Bregman Chord Divergence

Abstract