Abstract
Distances are fundamental primitives whose choice significantly impacts the performances of algorithms in applications. However selecting the most appropriate distance for a given task is an endeavor. Instead of testing one by one the entries of an ever-expanding dictionary of ad hoc distances, one rather prefers to consider parametric classes of distances that are exhaustively characterized by axioms derived from first principles. Bregman divergences are such a class. However fine-tuning a Bregman divergence is delicate since it requires to smoothly adjust a functional generator. In this work, we propose an extension of Bregman divergences called the Bregman chord divergences. This new class of distances bypasses the gradient calculations, uses two scalar parameters that can be easily tailored in applications, and generalizes asymptotically Bregman divergences.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Distances are at the heart of many signal processing tasks [6, 14], and the performance of algorithms solving those tasks heavily depends on the chosen distances. Historically, many ad hoc distances have been proposed and empirically benchmarked on different tasks in order to improve the state-of-the-art performances. However, getting the most appropriate distance for a given task is often an endeavour. Thus principled classes of distancesFootnote 1 have been proposed and studied. Among those generic classes of distances, three main generic classes have emerged:
-
The Bregman divergences [5, 7, 22] defined for a strictly convex and differentiable generator \(F\in \mathcal {B}:\varTheta \rightarrow \mathbb {R}\) (where \(\mathcal {B}\) denotes the class of strictly convex and differentiable functions defined modulo affine terms):
$$\begin{aligned} B_F(\theta _1:\theta _2) :=F(\theta _1)-F(\theta _2)-(\theta _1-\theta _2)^\top \nabla F(\theta _2), \end{aligned}$$(1)measure the dissimilarity between parameters \(\theta _1,\theta _2\in \varTheta \), where \(\varTheta \subset \mathbb {R}^d\) is a d-dimensional convex set. Bregman divergences have also been generalized to other types of objects like matrices [26].
-
The Csiszár f-divergences [1, 11, 12] defined for a convex generator \(f\in \mathcal {C}\) satisfying \(f(1)=0\) and strictly convex at 1:
$$\begin{aligned} I_f[p_1:p_2] :=\int _\mathcal {X}p_1(x) f\left( \frac{p_2(x)}{p_1(x)}\right) \mathrm {d}\mu (x) \ge f(1)=0, \end{aligned}$$(2)measure the dissimilarity between probability densities \(p_1\) and \(p_2\) that are absolutely continuous with respect to a base measure \(\mu \) (defined on a support \(\mathcal {X}\)).
-
The Burbea-Rao divergences [9] also called Jensen differences or Jensen divergences because they rely on the Jensen’s inequality [16] for a strictly convex function \(F\in \mathcal {J}:\varTheta \rightarrow \mathbb {R}\):
$$\begin{aligned} J_F(\theta _1,\theta _2) :=\frac{F(\theta _1)+F(\theta _2)}{2} -F \left( \frac{\theta _1+\theta _2}{2}\right) \ge 0, \end{aligned}$$(3)where \(\theta _1\) and \(\theta _2\) belong to a parameter space \(\varTheta \).
These three fundamental classes of distances are not mutually exclusive, and their pairwise intersections (e.g., \(\mathcal {B}\cap \mathcal {C}\) or \(\mathcal {J}\cap \mathcal {C}\)) have been studied in [2, 17, 27]. The ‘:’ notation between arguments of distances emphasizes the potential asymmetry of distances (oriented distances with \(D(\theta _1:\theta _2)\not = D(\theta _2:\theta _1)\)), and the brackets surrounding distance arguments indicate that it is a statistical distance between probability densities, and not a distance between parameters. Using these notations, we express the Kullback-Leibler distance [10] (KL) as
The KL distance/divergence between two members \(p_{\theta _1}\) and \(p_{\theta _2}\) of a parametric family \(\mathcal {F}\) of distributions amount to a parameter divergence
For example, the KL statistical distance between two probability densities belonging to the same exponential family or the same mixture family amounts to a (parameter) Bregman divergence [3, 23]. When \(p_1\) and \(p_2\) are finite discrete distributions of the d-dimensional probability simplex \(\varDelta _d\), we have \(\mathrm {KL}_{\varDelta _d}(p_1:p_2)=\mathrm {KL}[p_{1}:p_{2}]\). This explains why sometimes we can handle loosely distances between discrete distributions as both a parameter distance and a statistical distance. For example, the KL distance between two discrete distributions is a Bregman divergence \(B_{F_\mathrm {KL}}\) for \(F_\mathrm {KL}(x)=\sum _{i=1}^d x_i\log x_i\) (Shannon negentropy) for \(x\in \varTheta =\varDelta _d\). Extending \(\varTheta =\varDelta _d\) to positive measures \(\varTheta =\mathbb {R}_+^d\), this Bregman divergence \(B_{F_\mathrm {KL}}\) yields the extended KL distance:
Notice that the KL divergence of 4 between non-probability positive distributions may yield potential negativity of the measure (e.g., Example 2.1 of [28] and [8]). This case also happens when doing Monte Carlo stochastic integrations of the KL divergence integral.
Whenever using a functionally parameterized distance in applications, we need to choose the most appropriate functional generator, ideally from first principles [3, 4, 13]. For example, Non-negative Matrix Factorization (NMF) for audio source separation or music transcription from the signal power spectrogram can be done by selecting the Itakura-Saito divergence [15]Footnote 2 that satisfies the requirement of being scale invariant:
for any \(\lambda >0\). When no such first principles can be easily stated for a task [13], we are left by choosing manually or by cross-validation a generator. Notice that the convex combinations of Csiszár generators is a Csiszár generator (idem for Bregman divergences): \(\sum _{i=1}^d \lambda _i I_{f_i}=I_{\sum _i{i=1}^d \lambda _i f_i}\) for \(\lambda \) belonging to the standard \((d-1)\)-dimensional standard simplex \(\varDelta _d\).
In this work, we propose a novel class of distances, termed Bregman chord divergences. A Bregman chord divergence is parameterized by a Bregman generator and two scalar parameters which make it easy to fine-tune in applications, and matches asymptotically the ordinary Bregman divergence.
The paper is organized as follows: In Sect. 2, we describe the skewed Jensen divergence, show how to bi-skew any distance by using two scalars, and report on the Jensen chord divergence [20]. In Sect. 3, we first introduce the univariate Bregman chord divergence, and then extend its definition to the multivariate case, in Sect. 4. Finally, we conclude in Sect. 5.
2 Geometric Design of Skewed Divergences
We can geometrically design divergences from convexity gap properties of the graph plot of the generator. For example, the Jensen divergence \(J_F(\theta _1:\theta _2)\) of Eq. 3 is visualized as the ordinate (vertical) gap between the midpoint of the line segment \([(\theta _1,F(\theta _1));(\theta _2,F(\theta _2))]\) and the point \((\frac{\theta _1+\theta _2}{2},F(\frac{\theta _1+\theta _2}{2}))\). The non-negativity property of the Jensen divergence follows from the Jensen’s midpoint convex inequality [16]. Instead of taking the midpoint \(\bar{\theta }=\frac{\theta _1+\theta _2}{2}\), we can take any interior point \((\theta _1\theta _2)_\alpha :=(1-\alpha )\theta _1+\alpha \theta _2\), and get the skewed \(\alpha \)-Jensen divergence (for any \(\alpha \in (0,1)\)):
A remarkable fact is that the scaled \(\alpha \)-Jensen divergence \(\frac{1}{\alpha }J_F^\alpha (\theta _1:\theta _2)\) tends asymptotically to the reverse Bregman divergence \(B_F(\theta _2:\theta _1)\) when \(\alpha \rightarrow 0\), see [21, 30].
By measuring the ordinate gap between two non-crossing upper and lower chords anchored at the generator graph plot, we can extend the \(\alpha \)-Jensen divergences to a tri-parametric family of Jensen chord divergences [20]:
with \(\alpha ,\beta \in [0,1]\) and \(\gamma \in [\alpha ,\beta ]\). The \(\alpha \)-Jensen divergence is recovered when \(\alpha =\beta =\gamma \) (Fig. 1).
For any given distance \(D:\varTheta \times \varTheta \rightarrow \mathbb {R}_+\) (with convex parameter space \(\varTheta \)), we can bi-skew the distance by considering two scalars \(\gamma ,\delta \in \mathbb {R}\) (with \(\delta \not =\gamma \)) as:
Clearly, \((\theta _1\theta _2)_\gamma =(\theta _1\theta _2)_\delta \) if and only if \((\delta -\gamma )(\theta _1-\theta _2)=0\). That is, if (i) \(\theta _1=\theta _2\) or if (ii) \(\delta =\gamma \). Since by definition \(\delta \not =\gamma \), we have \(D_{\gamma ,\delta }(\theta _1:\theta _2)=0\) if and only if \(\theta _1=\theta _2\). Notice that both \((\theta _1\theta _2)_\gamma =(1-\gamma )\theta _1+\gamma \theta _2\) and \((\theta _1\theta _2)_\delta =(1-\delta )\theta _1+\delta \theta _2\) should belong to the parameter space \(\varTheta \). A sufficient condition is to ensure that \(\gamma ,\delta \in [0,1]\) so that both \((\theta _1\theta _2)_\gamma \in \varTheta \) and \((\theta _1\theta _2)_\delta \in \varTheta \). When \(\varTheta =\mathbb {R}^d\), we may further consider any \(\gamma ,\delta \in \mathbb {R}\).
3 The Scalar Bregman Chord Divergence
Let \(F:\varTheta \subset \mathbb {R}\rightarrow \mathbb {R}\) be a univariate Bregman generator with open convex domain \(\varTheta \), and denote by \(\mathcal {F}=\{ (\theta ,F(\theta )) \}_\theta \) its graph. Let us rewrite the ordinary univariate Bregman divergence [7] of Eq. 1 as follows:
where \(y=T_{\theta }(\omega )\) denotes the equation of the tangent line of F at \(\theta \):
Let \(\mathcal {T}_\theta =\{ (\theta ,T_\theta (\omega )) \ :\ \theta \in \varTheta \}\) denote the graph of that tangent line. Line \(\mathcal {T}_\theta \) is tangent to curve \(\mathcal {F}\) at point \(P_\theta :=(\theta ,F(\theta ))\). Graphically speaking, the Bregman divergence is interpreted as the ordinate gap (gap vertical) between the point \(P_{\theta _1}=(\theta _1,F(\theta _1))\in \mathcal {F}\) and the point of \((\theta _1,T_{\theta _2}(\theta _1))\in \mathcal {T}_\theta \), as depicted in Fig. 2.
Now let us observe that we may relax the tangent line \(\mathcal {T}_{\theta _2}\) to a chord line (or secant) \(\mathcal {C}_{\theta _1,\theta _2}^{\alpha ,\beta } = \mathcal {C}_{(\theta _1\theta _2)_\alpha ,(\theta _1\theta _2)_\beta }\) passing through the points \(((\theta _1\theta _2)_\alpha ,F((\theta _1\theta _2)_\alpha ))\) and \(((\theta _1\theta _2)_\beta ,F((\theta _1\theta _2)_\beta ))\) for \(\alpha ,\beta \in (0,1)\) with \(\alpha \not =\beta \) (with corresponding Cartesian equation \(C_{(\theta _1\theta _2)_\alpha ,(\theta _1\theta _2)_\beta }\)), and still get a non-negative vertical gap between \((\theta _1,F(\theta _1))\) and \((\theta _1,C_{(\theta _1\theta _2)_\alpha ,(\theta _1\theta _2)_\beta }(\theta _1))\) (because any line intersects a convex body in at most two points). By construction, this vertical gap is smaller than the gap measured by the ordinary Bregman divergence. This yields the Bregman chord divergence (\(\alpha ,\beta \in (0,1]\), \(\alpha \not =\beta \)):
illustrated in Fig. 3. By expanding the chord equation and massaging the equation, we get the following formula:
where
is the slope of the chord, and since \((\theta _1\theta _2)_\alpha -(\theta _1\theta _2)_\beta =(\beta -\alpha )(\theta _1-\theta _2)\) and \(\theta _1-(\theta _1\theta _2)_\alpha =\alpha (\theta _1-\theta _2)\).
Notice the symmetry \(B_F^{\alpha ,\beta }(\theta _1:\theta _2)=B_F^{\beta ,\alpha }(\theta _1:\theta _2)\). We have
When \(\alpha \rightarrow \beta \), the Bregman chord divergences yields a subfamily of Bregman tangent divergences:
We consider the tangent line \(\mathcal {T}_{(\theta _1\theta _2)_\alpha }\) at \((\theta _1\theta _2)_\alpha \) and measure the ordinate gap at \(\theta _1\) between the function plot and this tangent line:
for \(\alpha \in (0,1]\). The ordinary Bregman divergence is recovered when \(\alpha =1\). Notice that the mean value theorem yields \(\varDelta _F^{\alpha ,\beta }(\theta _1,\theta _2)=F'(\xi )\) for \(\xi \in (\theta _1,\theta _2)\). Thus \(B_F^{\alpha ,\beta }(\theta _1:\theta _2)=B_F^{\xi }(\theta _1:\theta _2)\) for \(\xi \in (\theta _1,\theta _2)\). Letting \(\beta =1\) and \(\alpha =1-\epsilon \) (for small values of \(1>\epsilon >0\)), we can approximate the ordinary Bregman divergence by the Bregman chord divergence without requiring to compute the gradient:
4 The Multivariate Bregman Chord Divergence
When the generator is separable [3], i.e., \(F(x)=\sum _i F_i(x_i)\) for univariate generators \(F_i\), we extend easily the Bregman chord divergence as:
Otherwise, we have to carefully define the notion of “slope” for the multivariate case. An example of such a non-separable multivariate generator is the Legendre dual of the Shannon negentropy: The log-sum-exp function [24, 25]:
Given a multivariate (non-separable) Bregman generator \(F(\theta )\) with \(\varTheta \subseteq \mathbb {R}^D\) and two prescribed distinct parameters \(\theta _1\) and \(\theta _2\), consider the following univariate function, for \(\lambda \in \mathbb {R}\):
with \(F_{\theta _1,\theta _2}(0)=F(\theta _1)\) and \(F_{\theta _1,\theta _2}(1)=F(\theta _2)\).
The functions \(\{F_{\theta _1,\theta _2}\}_{\theta _1\not =\theta _2}\) are strictly convex and differentiable univariate Bregman generators.
Proof
To prove the strict convexity of a univariate function G, we need to show that for any \(\alpha \in (0,1)\), we have \(G((1-\alpha )x+\alpha y)< (1-\alpha )G(x)+\alpha G(y)\).
Then we define the multivariate Bregman chord divergence by applying the definition of the univariate Bregman chord divergence on these families of univariate Bregman generators:
Since \((01)_\alpha =\alpha \) and \((01)_\beta =\beta \), we get:
in accordance with the univariate case. Since \((\theta _1\theta _2)_\beta =(\theta _1\theta _2)_\alpha -(\beta -\alpha )(\theta _2-\theta _1)\), we have the first-order Taylor expansion
Therefore, we have:
This proves that
Notice that the Bregman chord divergence does not require to compute the gradient \(\nabla F\) The “slope term” in the definition is reminiscent to the q-derivative [18] (quantum/discrete derivatives). However the (p, q)-derivatives [18] are defined with respect to a single reference point while the chord definition requires two reference points.
5 Conclusion
In this paper, we geometrically designed a new class of distances using a Bregman generator and two additional scalar parameters, termed the Bregman chord divergence, and its one-parametric subfamily, the Bregman tangent divergences that includes the ordinary Bregman divergence. This generalization allows one to easily fine-tune Bregman divergences in applications by adjusting smoothly one or two (scalar) knobs. Moreover, by choosing \(\alpha =1-\epsilon \) and \(\beta =1\) for small \(\epsilon >0\), the Bregman chord divergence \(B_{F}^{1-\epsilon ,1}(\theta _1:\theta _2)\) lower bounds closely the Bregman divergence \(B_{F}(\theta _1:\theta _2)\) without requiring to compute the gradient (a different approximation without gradient is \(\frac{1}{\epsilon } J_F^\epsilon (\theta _2:\theta _1)\)). We expect that this new class of distances brings further improvements in signal processing and information fusion applications [29] (e.g., by tuning \(B_{F_\mathrm {KL}}^{\alpha ,\beta }\) or \(B_{F_\mathrm {IS}}^{\alpha ,\beta }\)). While the Bregman chord divergence defines an ordinate gap on the exterior of the epigraph, the Jensen chord divergence [20] defines the gap inside the epigraph of the generator. In future work, the dualistic information-geometric structure induced by the Bregman chord divergences shall be investigated from the viewpoint of gauge theory [19] and in contrast with the dually flat structures of Bregman manifolds [3].
Source code in Java™ is available for reproducible research.Footnote 3
Notes
- 1.
Here, we use the word distance to mean a dissimilarity (or a distortion, a deviance, a discrepancy, etc.), not necessarily a metric distance [14]. A distance between arguments \(\theta _1\) and \(\theta _2\) satisfies \(D(\theta _1,\theta _2)\ge 0\) with equality if and only if \(\theta _1=\theta _2\).
- 2.
A Bregman divergence for the Burg negentropy \(F_\mathrm {IS}(x)=-\sum _i\log x_i\).
- 3.
References
Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. Roy. Stat. Soc.: Ser. B (Methodol.) 28(1), 131–142 (1966)
Amari, S.I.: \(\alpha \)-divergence is unique, belonging to both \(f\)-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 55(11), 4925–4931 (2009)
Amari, S.: Information Geometry and Its Applications. AMS, vol. 194. Springer, Tokyo (2016). https://doi.org/10.1007/978-4-431-55978-8
Banerjee, A., Guo, X., Wang, H.: On the optimality of conditional expectation as a Bregman predictor. IEEE Trans. Inf. Theory 51(7), 2664–2669 (2005)
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6(Oct), 1705–1749 (2005)
Basseville, M.: Divergence measures for statistical data processing: an annotated bibliography. Sig. Process. 93(4), 621–633 (2013)
Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)
Broniatowski, M., Stummer, W.: Some universal insights on divergences for statistics, machine learning and artificial intelligence. In: Nielsen, F. (ed.) Geometric Structures of Information. SCT, pp. 149–211. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-02520-5_8
Burbea, J., Rao, C.: On the convexity of some divergence measures based on entropy functions. IEEE Trans. Inf. Theory 28(3), 489–495 (1982)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2012)
Csiszár, I.: Eine infonnationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitlit von Markoffschen Ketten. Magyar Tudományos Akadémia - MAT 8, 85–108 (1963)
Csiszár, I.: Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica 2, 229–318 (1967)
Csiszár, I.: Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Stat. 19(4), 2032–2066 (1991). https://doi.org/10.1007/978-1-4613-0071-7
Deza, M.M., Deza, E.: Encyclopedia of distances. In: Deza, M.M., Deza, E. (eds.) Encyclopedia of Distances, pp. 1–583. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00234-2_1
Févotte, C.: Majorization-minimization algorithm for smooth Itakura-Saito nonnegative matrix factorization. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1980–1983. IEEE (2011)
Jensen, J.L.W.V.: Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Math. 30(1), 175–193 (1906)
Jiao, J., Courtade, T.A., No, A., Venkat, K., Weissman, T.: Information measures: the curious case of the binary alphabet. IEEE Trans. Inf. Theory 60(12), 7616–7626 (2014)
Kac, V., Cheung, P.: Quantum Calculus. Springer, New York (2001)
Naudts, J., Zhang, J.: Rho-tau embedding and gauge freedom in information geometry. Inf. Geom. 1(1), 79–115 (2018)
Nielsen, F.: The chord gap divergence and a generalization of the Bhattacharyya distance. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2276–2280, April 2018. https://doi.org/10.1109/ICASSP.2018.8462244
Nielsen, F., Boltz, S.: The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 57(8), 5455–5466 (2011)
Nielsen, F., Nock, R.: Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 55(6), 2882–2904 (2009)
Nielsen, F., Nock, R.: On the geometry of mixtures of prescribed distributions. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2861–2865. IEEE (2018)
Nielsen, F., Sun, K.: Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy 18(12), 442 (2016)
Nielsen, F., Sun, K.: Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures. IEEE Signal Process. Lett. 23(11), 1543–1546 (2016)
Nock, R., Magdalou, B., Briys, E., Nielsen, F.: Mining matrix data with Bregman matrix divergences for portfolio selection. In: Nielsen, F., Bhatia, R. (eds.) Matrix Information Geometry, pp. 373–402. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-30232-9_15
Pardo, M.C., Vajda, I.: About distances of discrete distributions satisfying the data processing theorem of information theory. IEEE Trans. Inf. Theory 43(4), 1288–1293 (1997)
Stummer, W., Vajda, I.: On divergences of finite measures and their applicability in statistics and information theory. Statistics 44(2), 169–187 (2010)
Üney, M., Houssineau, J., Delande, E., Julier, S.J., Clark, D.E.: Fusion of finite set distributions: pointwise consistency and global cardinality. CoRR abs/1802.06220 (2018). http://arxiv.org/abs/1802.06220
Zhang, J.: Divergence function, duality, and convex analysis. Neural Comput. 16(1), 159–195 (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Nielsen, F., Nock, R. (2019). The Bregman Chord Divergence. In: Nielsen, F., Barbaresco, F. (eds) Geometric Science of Information. GSI 2019. Lecture Notes in Computer Science(), vol 11712. Springer, Cham. https://doi.org/10.1007/978-3-030-26980-7_31
Download citation
DOI: https://doi.org/10.1007/978-3-030-26980-7_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26979-1
Online ISBN: 978-3-030-26980-7
eBook Packages: Computer ScienceComputer Science (R0)