Keywords

1 Introduction

Distances are at the heart of many signal processing tasks [6, 14], and the performance of algorithms solving those tasks heavily depends on the chosen distances. Historically, many ad hoc distances have been proposed and empirically benchmarked on different tasks in order to improve the state-of-the-art performances. However, getting the most appropriate distance for a given task is often an endeavour. Thus principled classes of distancesFootnote 1 have been proposed and studied. Among those generic classes of distances, three main generic classes have emerged:

  • The Bregman divergences [5, 7, 22] defined for a strictly convex and differentiable generator \(F\in \mathcal {B}:\varTheta \rightarrow \mathbb {R}\) (where \(\mathcal {B}\) denotes the class of strictly convex and differentiable functions defined modulo affine terms):

    $$\begin{aligned} B_F(\theta _1:\theta _2) :=F(\theta _1)-F(\theta _2)-(\theta _1-\theta _2)^\top \nabla F(\theta _2), \end{aligned}$$
    (1)

    measure the dissimilarity between parameters \(\theta _1,\theta _2\in \varTheta \), where \(\varTheta \subset \mathbb {R}^d\) is a d-dimensional convex set. Bregman divergences have also been generalized to other types of objects like matrices [26].

  • The Csiszár f-divergences [1, 11, 12] defined for a convex generator \(f\in \mathcal {C}\) satisfying \(f(1)=0\) and strictly convex at 1:

    $$\begin{aligned} I_f[p_1:p_2] :=\int _\mathcal {X}p_1(x) f\left( \frac{p_2(x)}{p_1(x)}\right) \mathrm {d}\mu (x) \ge f(1)=0, \end{aligned}$$
    (2)

    measure the dissimilarity between probability densities \(p_1\) and \(p_2\) that are absolutely continuous with respect to a base measure \(\mu \) (defined on a support \(\mathcal {X}\)).

  • The Burbea-Rao divergences [9] also called Jensen differences or Jensen divergences because they rely on the Jensen’s inequality [16] for a strictly convex function \(F\in \mathcal {J}:\varTheta \rightarrow \mathbb {R}\):

    $$\begin{aligned} J_F(\theta _1,\theta _2) :=\frac{F(\theta _1)+F(\theta _2)}{2} -F \left( \frac{\theta _1+\theta _2}{2}\right) \ge 0, \end{aligned}$$
    (3)

    where \(\theta _1\) and \(\theta _2\) belong to a parameter space \(\varTheta \).

These three fundamental classes of distances are not mutually exclusive, and their pairwise intersections (e.g., \(\mathcal {B}\cap \mathcal {C}\) or \(\mathcal {J}\cap \mathcal {C}\)) have been studied in [2, 17, 27]. The ‘:’ notation between arguments of distances emphasizes the potential asymmetry of distances (oriented distances with \(D(\theta _1:\theta _2)\not = D(\theta _2:\theta _1)\)), and the brackets surrounding distance arguments indicate that it is a statistical distance between probability densities, and not a distance between parameters. Using these notations, we express the Kullback-Leibler distance [10] (KL) as

$$\begin{aligned} \mathrm {KL}[p_1:p_2] :=\int p_{1}(x)\log \frac{p_{1}(x)}{p_{2}(x)}\mathrm {d}\mu (x). \end{aligned}$$
(4)

The KL distance/divergence between two members \(p_{\theta _1}\) and \(p_{\theta _2}\) of a parametric family \(\mathcal {F}\) of distributions amount to a parameter divergence

$$\begin{aligned} \mathrm {KL}_\mathcal {F}(\theta _1:\theta _2) :=\mathrm {KL}[p_{\theta _1}:p_{\theta _2}]. \end{aligned}$$
(5)

For example, the KL statistical distance between two probability densities belonging to the same exponential family or the same mixture family amounts to a (parameter) Bregman divergence [3, 23]. When \(p_1\) and \(p_2\) are finite discrete distributions of the d-dimensional probability simplex \(\varDelta _d\), we have \(\mathrm {KL}_{\varDelta _d}(p_1:p_2)=\mathrm {KL}[p_{1}:p_{2}]\). This explains why sometimes we can handle loosely distances between discrete distributions as both a parameter distance and a statistical distance. For example, the KL distance between two discrete distributions is a Bregman divergence \(B_{F_\mathrm {KL}}\) for \(F_\mathrm {KL}(x)=\sum _{i=1}^d x_i\log x_i\) (Shannon negentropy) for \(x\in \varTheta =\varDelta _d\). Extending \(\varTheta =\varDelta _d\) to positive measures \(\varTheta =\mathbb {R}_+^d\), this Bregman divergence \(B_{F_\mathrm {KL}}\) yields the extended KL distance:

$$\begin{aligned} \mathrm {eKL}[p:q] = \sum _{i=1}^d p_{i}\log \frac{p_{i}}{q_{i}}+q_i-p_i. \end{aligned}$$
(6)

Notice that the KL divergence of 4 between non-probability positive distributions may yield potential negativity of the measure (e.g., Example 2.1 of [28] and [8]). This case also happens when doing Monte Carlo stochastic integrations of the KL divergence integral.

Whenever using a functionally parameterized distance in applications, we need to choose the most appropriate functional generator, ideally from first principles [3, 4, 13]. For example, Non-negative Matrix Factorization (NMF) for audio source separation or music transcription from the signal power spectrogram can be done by selecting the Itakura-Saito divergence [15]Footnote 2 that satisfies the requirement of being scale invariant:

$$\begin{aligned} B_{F_\mathrm {IS}}(\lambda \theta :\lambda \theta ')=B_{F_\mathrm {IS}}(\theta :\theta ')=\sum _i \left( \frac{\theta _i}{\theta _i'}-\log \frac{\theta _i}{\theta _i'}-1\right) , \end{aligned}$$
(7)

for any \(\lambda >0\). When no such first principles can be easily stated for a task [13], we are left by choosing manually or by cross-validation a generator. Notice that the convex combinations of Csiszár generators is a Csiszár generator (idem for Bregman divergences): \(\sum _{i=1}^d \lambda _i I_{f_i}=I_{\sum _i{i=1}^d \lambda _i f_i}\) for \(\lambda \) belonging to the standard \((d-1)\)-dimensional standard simplex \(\varDelta _d\).

In this work, we propose a novel class of distances, termed Bregman chord divergences. A Bregman chord divergence is parameterized by a Bregman generator and two scalar parameters which make it easy to fine-tune in applications, and matches asymptotically the ordinary Bregman divergence.

The paper is organized as follows: In Sect. 2, we describe the skewed Jensen divergence, show how to bi-skew any distance by using two scalars, and report on the Jensen chord divergence [20]. In Sect. 3, we first introduce the univariate Bregman chord divergence, and then extend its definition to the multivariate case, in Sect. 4. Finally, we conclude in Sect. 5.

2 Geometric Design of Skewed Divergences

We can geometrically design divergences from convexity gap properties of the graph plot of the generator. For example, the Jensen divergence \(J_F(\theta _1:\theta _2)\) of Eq. 3 is visualized as the ordinate (vertical) gap between the midpoint of the line segment \([(\theta _1,F(\theta _1));(\theta _2,F(\theta _2))]\) and the point \((\frac{\theta _1+\theta _2}{2},F(\frac{\theta _1+\theta _2}{2}))\). The non-negativity property of the Jensen divergence follows from the Jensen’s midpoint convex inequality [16]. Instead of taking the midpoint \(\bar{\theta }=\frac{\theta _1+\theta _2}{2}\), we can take any interior point \((\theta _1\theta _2)_\alpha :=(1-\alpha )\theta _1+\alpha \theta _2\), and get the skewed \(\alpha \)-Jensen divergence (for any \(\alpha \in (0,1)\)):

$$\begin{aligned} J_F^\alpha (\theta _1:\theta _2) :=(F(\theta _1)F(\theta _2))_\alpha - F((\theta _1\theta _2)_\alpha ) \ge 0. \end{aligned}$$
(8)

A remarkable fact is that the scaled \(\alpha \)-Jensen divergence \(\frac{1}{\alpha }J_F^\alpha (\theta _1:\theta _2)\) tends asymptotically to the reverse Bregman divergence \(B_F(\theta _2:\theta _1)\) when \(\alpha \rightarrow 0\), see [21, 30].

By measuring the ordinate gap between two non-crossing upper and lower chords anchored at the generator graph plot, we can extend the \(\alpha \)-Jensen divergences to a tri-parametric family of Jensen chord divergences [20]:

$$\begin{aligned} J_F^{\alpha ,\beta ,\gamma }(\theta :\theta ') :=(F(\theta )F(\theta '))_\gamma -(F((\theta \theta ')_\alpha )F((\theta \theta ')_\beta ))_{\frac{\gamma -\alpha }{\beta -\alpha }}, \end{aligned}$$
(9)

with \(\alpha ,\beta \in [0,1]\) and \(\gamma \in [\alpha ,\beta ]\). The \(\alpha \)-Jensen divergence is recovered when \(\alpha =\beta =\gamma \) (Fig. 1).

Fig. 1.
figure 1

The Jensen chord gap divergence.

For any given distance \(D:\varTheta \times \varTheta \rightarrow \mathbb {R}_+\) (with convex parameter space \(\varTheta \)), we can bi-skew the distance by considering two scalars \(\gamma ,\delta \in \mathbb {R}\) (with \(\delta \not =\gamma \)) as:

$$\begin{aligned} D_{\gamma ,\delta }(\theta _1:\theta _2) :=D((\theta _1\theta _2)_\gamma :(\theta _1\theta _2)_\delta ). \end{aligned}$$
(10)

Clearly, \((\theta _1\theta _2)_\gamma =(\theta _1\theta _2)_\delta \) if and only if \((\delta -\gamma )(\theta _1-\theta _2)=0\). That is, if (i) \(\theta _1=\theta _2\) or if (ii) \(\delta =\gamma \). Since by definition \(\delta \not =\gamma \), we have \(D_{\gamma ,\delta }(\theta _1:\theta _2)=0\) if and only if \(\theta _1=\theta _2\). Notice that both \((\theta _1\theta _2)_\gamma =(1-\gamma )\theta _1+\gamma \theta _2\) and \((\theta _1\theta _2)_\delta =(1-\delta )\theta _1+\delta \theta _2\) should belong to the parameter space \(\varTheta \). A sufficient condition is to ensure that \(\gamma ,\delta \in [0,1]\) so that both \((\theta _1\theta _2)_\gamma \in \varTheta \) and \((\theta _1\theta _2)_\delta \in \varTheta \). When \(\varTheta =\mathbb {R}^d\), we may further consider any \(\gamma ,\delta \in \mathbb {R}\).

3 The Scalar Bregman Chord Divergence

Let \(F:\varTheta \subset \mathbb {R}\rightarrow \mathbb {R}\) be a univariate Bregman generator with open convex domain \(\varTheta \), and denote by \(\mathcal {F}=\{ (\theta ,F(\theta )) \}_\theta \) its graph. Let us rewrite the ordinary univariate Bregman divergence [7] of Eq. 1 as follows:

$$\begin{aligned} B_F(\theta _1:\theta _2) = F(\theta _1) - T_{\theta _2}(\theta _1), \end{aligned}$$
(11)

where \(y=T_{\theta }(\omega )\) denotes the equation of the tangent line of F at \(\theta \):

$$\begin{aligned} T_{\theta }(\omega ) :=F(\theta )+(\omega -\theta ) F'(\theta ), \end{aligned}$$
(12)

Let \(\mathcal {T}_\theta =\{ (\theta ,T_\theta (\omega )) \ :\ \theta \in \varTheta \}\) denote the graph of that tangent line. Line \(\mathcal {T}_\theta \) is tangent to curve \(\mathcal {F}\) at point \(P_\theta :=(\theta ,F(\theta ))\). Graphically speaking, the Bregman divergence is interpreted as the ordinate gap (gap vertical) between the point \(P_{\theta _1}=(\theta _1,F(\theta _1))\in \mathcal {F}\) and the point of \((\theta _1,T_{\theta _2}(\theta _1))\in \mathcal {T}_\theta \), as depicted in Fig. 2.

Fig. 2.
figure 2

Bregman divergence as the vertical gap between the generator graph \(\mathcal {F}\) and the tangent line \(\mathcal {T}_{\theta _2}\) at \(\theta _2\).

Now let us observe that we may relax the tangent line \(\mathcal {T}_{\theta _2}\) to a chord line (or secant) \(\mathcal {C}_{\theta _1,\theta _2}^{\alpha ,\beta } = \mathcal {C}_{(\theta _1\theta _2)_\alpha ,(\theta _1\theta _2)_\beta }\) passing through the points \(((\theta _1\theta _2)_\alpha ,F((\theta _1\theta _2)_\alpha ))\) and \(((\theta _1\theta _2)_\beta ,F((\theta _1\theta _2)_\beta ))\) for \(\alpha ,\beta \in (0,1)\) with \(\alpha \not =\beta \) (with corresponding Cartesian equation \(C_{(\theta _1\theta _2)_\alpha ,(\theta _1\theta _2)_\beta }\)), and still get a non-negative vertical gap between \((\theta _1,F(\theta _1))\) and \((\theta _1,C_{(\theta _1\theta _2)_\alpha ,(\theta _1\theta _2)_\beta }(\theta _1))\) (because any line intersects a convex body in at most two points). By construction, this vertical gap is smaller than the gap measured by the ordinary Bregman divergence. This yields the Bregman chord divergence (\(\alpha ,\beta \in (0,1]\), \(\alpha \not =\beta \)):

$$\begin{aligned} B_F^{\alpha ,\beta }(\theta _1:\theta _2) :=F(\theta _1) - C_F^{(\theta _1\theta _2)_\alpha ,(\theta _1\theta _2)_\beta }(\theta _1) \le B_F(\theta _1:\theta _2), \end{aligned}$$
(13)

illustrated in Fig. 3. By expanding the chord equation and massaging the equation, we get the following formula:

$$\begin{aligned} B_F^{\alpha ,\beta }(\theta _1:\theta _2):= & {} F(\theta _1) - \varDelta _F^{\alpha ,\beta }(\theta _1,\theta _2) (\theta _1-(\theta _1\theta _2)_\alpha ) - F((\theta _1\theta _2)_\alpha ),\\= & {} F(\theta _1) - F\left( (\theta _1\theta _2)_\alpha \right) + \frac{\alpha \left\{ {F\left( (\theta _1\theta _2)_\alpha \right) -F\left( (\theta _1\theta _2)_\beta \right) }\right\} }{\beta -\alpha },\nonumber \end{aligned}$$
(14)

where

$$\begin{aligned} \varDelta _F^{\alpha ,\beta }(\theta _1,\theta _2):=\frac{F((\theta _1\theta _2)_\alpha )-F((\theta _1\theta _2)_\beta )}{(\theta _1\theta _2)_\alpha -(\theta _1\theta _2)_\beta } \end{aligned}$$
(15)

is the slope of the chord, and since \((\theta _1\theta _2)_\alpha -(\theta _1\theta _2)_\beta =(\beta -\alpha )(\theta _1-\theta _2)\) and \(\theta _1-(\theta _1\theta _2)_\alpha =\alpha (\theta _1-\theta _2)\).

Fig. 3.
figure 3

The Bregman chord divergence \(B_F^{\alpha ,\beta }(\theta _1:\theta _2)\).

Notice the symmetry \(B_F^{\alpha ,\beta }(\theta _1:\theta _2)=B_F^{\beta ,\alpha }(\theta _1:\theta _2)\). We have

$$\begin{aligned} \lim _{\alpha \rightarrow 1,\beta \rightarrow 1} B_F^{\alpha ,\beta }(\theta _1:\theta _2) = B_F(\theta _1:\theta _2). \end{aligned}$$
(16)

When \(\alpha \rightarrow \beta \), the Bregman chord divergences yields a subfamily of Bregman tangent divergences:

$$\begin{aligned} B_F^{\alpha }(\theta _1:\theta _2)=\lim _{\beta \rightarrow \alpha } B_F^{\alpha ,\beta }(\theta _1:\theta _2)\le B_F(\theta _1:\theta _2). \end{aligned}$$
(17)

We consider the tangent line \(\mathcal {T}_{(\theta _1\theta _2)_\alpha }\) at \((\theta _1\theta _2)_\alpha \) and measure the ordinate gap at \(\theta _1\) between the function plot and this tangent line:

$$\begin{aligned} B_F^\alpha (\theta _1:\theta _2):= & {} F(\theta _1)-F\left( (\theta _1\theta _2)_\alpha \right) - \left( \theta _1-(\theta _1\theta _2)_\alpha \right) ^\top \nabla F\left( (\theta _1\theta _2)_\alpha \right) ,\nonumber \\= & {} F(\theta _1)-F\left( (\theta _1\theta _2)_\alpha \right) -\alpha (\theta _1-\theta _2)^\top \nabla F\left( (\theta _1\theta _2)_\alpha \right) , \end{aligned}$$
(18)

for \(\alpha \in (0,1]\). The ordinary Bregman divergence is recovered when \(\alpha =1\). Notice that the mean value theorem yields \(\varDelta _F^{\alpha ,\beta }(\theta _1,\theta _2)=F'(\xi )\) for \(\xi \in (\theta _1,\theta _2)\). Thus \(B_F^{\alpha ,\beta }(\theta _1:\theta _2)=B_F^{\xi }(\theta _1:\theta _2)\) for \(\xi \in (\theta _1,\theta _2)\). Letting \(\beta =1\) and \(\alpha =1-\epsilon \) (for small values of \(1>\epsilon >0\)), we can approximate the ordinary Bregman divergence by the Bregman chord divergence without requiring to compute the gradient:

$$\begin{aligned} B_F(\theta _1:\theta _2) \simeq _{\epsilon \rightarrow 0} B_F^{1-\epsilon ,1}(\theta _1:\theta _2). \end{aligned}$$
(19)

4 The Multivariate Bregman Chord Divergence

When the generator is separable [3], i.e., \(F(x)=\sum _i F_i(x_i)\) for univariate generators \(F_i\), we extend easily the Bregman chord divergence as:

$$\begin{aligned} B_F^{\alpha ,\beta }(\theta :\theta ')=\sum _i B_{F_i}^{\alpha ,\beta }(\theta _i:\theta '_i). \end{aligned}$$
(20)

Otherwise, we have to carefully define the notion of “slope” for the multivariate case. An example of such a non-separable multivariate generator is the Legendre dual of the Shannon negentropy: The log-sum-exp function [24, 25]:

$$\begin{aligned} F(\theta )=\log (1+\sum _i e^{\theta _i}). \end{aligned}$$
(21)

Given a multivariate (non-separable) Bregman generator \(F(\theta )\) with \(\varTheta \subseteq \mathbb {R}^D\) and two prescribed distinct parameters \(\theta _1\) and \(\theta _2\), consider the following univariate function, for \(\lambda \in \mathbb {R}\):

$$\begin{aligned} F_{\theta _1,\theta _2}(\lambda ) :=F\left( (1-\lambda )\theta _1+\lambda \theta _2\right) = F\left( \theta _1+\lambda (\theta _2-\theta _1)\right) , \end{aligned}$$
(22)

with \(F_{\theta _1,\theta _2}(0)=F(\theta _1)\) and \(F_{\theta _1,\theta _2}(1)=F(\theta _2)\).

The functions \(\{F_{\theta _1,\theta _2}\}_{\theta _1\not =\theta _2}\) are strictly convex and differentiable univariate Bregman generators.

Proof

To prove the strict convexity of a univariate function G, we need to show that for any \(\alpha \in (0,1)\), we have \(G((1-\alpha )x+\alpha y)< (1-\alpha )G(x)+\alpha G(y)\).

$$\begin{aligned} F_{\theta _1,\theta _2}((1-\alpha )\lambda _1+\alpha \lambda _2)= & {} F\left( \theta _1+((1-\alpha )\lambda _1+\alpha \lambda _2)(\theta _2-\theta _1)\right) ,\\= & {} F( (1-\alpha )(\lambda _1(\theta _2-\theta _1)+\theta _1) + \alpha ((\lambda _2(\theta _2-\theta _1)+\theta _1)) ),\\< & {} (1-\alpha ) F(\lambda _1(\theta _2-\theta _1)+\theta _1) + \alpha F((\lambda _2(\theta _2-\theta _1)+\theta _1)),\\< & {} (1-\alpha ) F_{\theta _1,\theta _2}(\lambda _1) + \alpha F_{\theta _1,\theta _2}(\lambda _2). \end{aligned}$$

Then we define the multivariate Bregman chord divergence by applying the definition of the univariate Bregman chord divergence on these families of univariate Bregman generators:

$$\begin{aligned} B_F^{\alpha ,\beta }(\theta _1:\theta _2) :=B_{F_{\theta _1,\theta _2}}^{\alpha ,\beta }(0:1), \end{aligned}$$
(23)

Since \((01)_\alpha =\alpha \) and \((01)_\beta =\beta \), we get:

$$\begin{aligned} B_F^{\alpha ,\beta }(\theta _1:\theta _2)= & {} F_{\theta _1,\theta _2}(0)+\frac{\alpha (F_{\theta _1,\theta _2}(\alpha )-F_{\theta _1,\theta _2}(\beta ))}{\beta -\alpha }-F_{\theta _1,\theta _2}(\alpha ),\\= & {} F(\theta _1) - F\left( (\theta _1\theta _2)_\alpha \right) - \frac{\alpha \left( F\left( (\theta _1\theta _2)_\beta \right) - F\left( (\theta _1\theta _2)_\alpha \right) \right) }{\beta -\alpha }, \end{aligned}$$

in accordance with the univariate case. Since \((\theta _1\theta _2)_\beta =(\theta _1\theta _2)_\alpha -(\beta -\alpha )(\theta _2-\theta _1)\), we have the first-order Taylor expansion

$$\begin{aligned} F\left( (\theta _1\theta _2)_\beta \right) \simeq _{\beta \simeq \alpha } F\left( (\theta _1\theta _2)_\alpha \right) -(\beta -\alpha )(\theta _2-\theta _1)^\top \nabla F\left( (\theta _1\theta _2)_\alpha \right) . \end{aligned}$$
(24)

Therefore, we have:

$$\begin{aligned} \frac{\alpha \left( F\left( (\theta _1\theta _2)_\beta \right) - F\left( (\theta _1\theta _2)_\alpha \right) \right) }{\beta -\alpha } \simeq -\alpha (\theta _2-\theta _1)^\top \nabla F\left( (\theta _1\theta _2)_\alpha \right) . \end{aligned}$$
(25)

This proves that

$$\begin{aligned} \lim _{\beta \rightarrow \alpha } B_F^{\alpha ,\beta }(\theta _1:\theta _2)=B_F^{\alpha }(\theta _1:\theta _2). \end{aligned}$$
(26)

Notice that the Bregman chord divergence does not require to compute the gradient \(\nabla F\) The “slope term” in the definition is reminiscent to the q-derivative [18] (quantum/discrete derivatives). However the (pq)-derivatives [18] are defined with respect to a single reference point while the chord definition requires two reference points.

5 Conclusion

In this paper, we geometrically designed a new class of distances using a Bregman generator and two additional scalar parameters, termed the Bregman chord divergence, and its one-parametric subfamily, the Bregman tangent divergences that includes the ordinary Bregman divergence. This generalization allows one to easily fine-tune Bregman divergences in applications by adjusting smoothly one or two (scalar) knobs. Moreover, by choosing \(\alpha =1-\epsilon \) and \(\beta =1\) for small \(\epsilon >0\), the Bregman chord divergence \(B_{F}^{1-\epsilon ,1}(\theta _1:\theta _2)\) lower bounds closely the Bregman divergence \(B_{F}(\theta _1:\theta _2)\) without requiring to compute the gradient (a different approximation without gradient is \(\frac{1}{\epsilon } J_F^\epsilon (\theta _2:\theta _1)\)). We expect that this new class of distances brings further improvements in signal processing and information fusion applications [29] (e.g., by tuning \(B_{F_\mathrm {KL}}^{\alpha ,\beta }\) or \(B_{F_\mathrm {IS}}^{\alpha ,\beta }\)). While the Bregman chord divergence defines an ordinate gap on the exterior of the epigraph, the Jensen chord divergence [20] defines the gap inside the epigraph of the generator. In future work, the dualistic information-geometric structure induced by the Bregman chord divergences shall be investigated from the viewpoint of gauge theory [19] and in contrast with the dually flat structures of Bregman manifolds [3].

Source code in Java™ is available for reproducible research.Footnote 3