Keywords

11.1 Introduction and Prior Work

Mixture models are a powerful and flexible tool to model an unknown probability density function \(f(x)\) as a weighted sum of parametric density functions \(p_{j}(x;\theta _{j})\):

$$\begin{aligned} f(x) = \sum _{j=1}^{K} w_{j} p_{j}(x;\theta _{j}),\; \text{ with } \;w_{j} > 0 \; \text{ and } \;\sum _{j=1}^{K} w_{j} = 1. \end{aligned}$$
(11.1)

By far, the most common case are mixtures of Gaussians for which the Expectation-Maximization (EM) method is used for decades to estimate the parameters \(\{(w_ {j}, \theta _ {j})\} _j\) from the maximum likelihood principle. Many extensions aimed at overcoming its slowness and lack of robustness [1]. From the seminal work of Banerjee et al. [2], several methods have been generalized for the exponential families in connection with the Bregman divergences. In particular, the Bregman soft clustering provides a unifying and elegant framework for the EM algorithm with mixtures of exponential families. In a recent work [3], the \(k\)-Maximum Likelihood Estimator (\(k\)-MLE) has been proposed as a fast alternative to EM for learning any exponential family mixtures: \(k\)-MLE relies on the bijection of exponential families with Bregman divergences to transform the mixture learning problem into a geometric clustering problem. Thus we refer the reader to the review paper [4] for an introduction to clustering.

This paper proposes several variations around the initial \(k\)-MLE algorithm with a specific focus on mixtures of Wishart [5]. Such a mixture can model complex distributions over the set \(\mathcal {S}^{d}_{++}\) of \(d \times d\) symmetric positive definite matrices. Data of this kind comes naturally in some applications like diffusion tensor imaging, radar imaging but also artificially as signature for a multivariate dataset (region of interest in an multispectral image or a temporal sequence of measures for several sensors).

In the literature, the Wishart distribution is rarely used for modeling data but more often in bayesian approaches as a (conjugate) prior for the inverse covariance-matrix of a gaussian vector. This justifies that few works concern the estimation of the parameters of Wishart from a set of matrices. To the best of our knowledge, the only and most related work is the one of Tsai [6] concerning MLE and Restricted-MLE with ordering constraints. From the application viewpoint, one may cite polarimetric SAR imaging [7], bio-medical imaging [8]. Another example is a recent paper on people tracking [9] which applies Dirichlet process mixture model (infinite mixture model) to the clustering of covariance matrices.

The paper is organized as follows: Sect. 11.2 recalls the definition of an exponential family (EF), the principle of maximum likelihood estimation in EFs and how it is connected with Bregman divergences. From these definitions, the complete description of \(k\)-MLE technique is derived by following the formalism of the Expectation-Maximization algorithm in Sect. 11.3. In the same section, the Hartigan approach for \(k\)-MLE is proposed and discussed as well as how to initialize it properly. Section 11.4 concerns the learning of a mixture of Wishart with \(k\)-MLE. For this purpose, a iterative procedure that converges to the MLE when it exists. In Sect. 11.5, we describe an application scenario to motion retrieval before concluding in Sect. 11.6.

11.2 Preliminary Definitions and Notations

An exponential family is a set of probability distributions admitting the following canonical decomposition:

$$ p_{F}(x;\theta ) = \exp \left\{ \langle t(x), \theta \rangle + k(x) -F(\theta )\right\} $$

with \(t(x)\) the sufficient statistic, \(\theta \) the natural parameter, \(k\) the carrier measure and \(F\) the log-normalizer [10]. Most of commonly used distributions such Bernoulli, Gaussian, Multinomial, Dirichlet, Poisson, Beta, Gamma, von Mises are indeed exponential families (see above reference for a complete list). Later on in the chapter, a canonical decomposition of the Wishart distribution as an exponential family will be detailed.

11.2.1 Maximum Likelihood Estimator

The framework of exponential families gives a direct solution for finding the maximum likelihood estimator \(\hat{\theta }\) from a set of i.i.d observations \(\chi = \{x_{1},\ldots ,x_{N}\}\). Denoting \(\mathcal {L}\) the likelihood function

$$\begin{aligned} \mathcal {L}(\theta ;\chi ) = \prod _{i=1}^{N} p_{F}(x_{i};\theta ) = \prod _{i=1}^{N} \exp \left\{ \langle t(x_{i}), \theta \rangle + k(x_{i}) -F(\theta )\right\} \end{aligned}$$
(11.2)

and \(\bar{l}\) the average log-likelihood function

$$\begin{aligned} \bar{l}(\theta ; \chi )=\frac{1}{N}\sum _{i=1}^{N} \left( \langle t(x_i), {\theta } \rangle +k(x_i) -F(\theta )\right) {.} \end{aligned}$$
(11.3)

It follows that the MLE \(\hat{\theta } = \mathrm{{arg\,max}}_{\varTheta } \bar{l}(\theta ; \chi )\) for \(\theta \) satisfies

$$\begin{aligned} \nabla F (\hat{\theta }) = \frac{1}{N}\sum _{i=1}^{N}t(x_{i}). \end{aligned}$$
(11.4)

Recall that the functional reciprocal \((\nabla F)^{-1}\) of \(\nabla F\) is also \(\nabla F^{*}\) for \(F^{*}\) the convex conjugate of \(F\) [11]. It is a mapping from the expectation parameter space \(\mathbf {H}\) to the natural parameter space \(\varTheta \). Thus, the MLE is obtained by mapping \((\nabla F)^{-1}\) on the average of sufficient statistics:

$$\begin{aligned} \hat{\theta } = (\nabla F)^{-1} \left( \frac{1}{N}\sum _{i=1}^{N}t(x_{i})\right) {.} \end{aligned}$$
(11.5)

Whereas determining \((\nabla F)^{-1}\) may be trivial for some univariate distributions like Bernoulli, Poisson, Gaussian, multivariate case is much challenging and lead to consider approximate methods to solve this variational problem [12].

11.2.2 MLE and Bregman Divergence

In this part, the link between MLE and Kullback-Leibler (KL) divergence is recalled. Banerjee et. al. [2] interpret the log-density of a (regular) exponential family as a (regular) Bregman divergence:

$$\begin{aligned} \log p_{F}(x;\theta ) = - B_{F^{*}}(t(x) : \eta ) + F^{*}(t(x)) + k(x), \end{aligned}$$
(11.6)

where \(F^{*}\) is the convex conjugate (Legendre transform) of \(F\). Skipping a formal definition, a Bregman divergence for a strictly convex and differentiable function \(\varphi : \varOmega \mapsto \mathbb {R}\) is

$$\begin{aligned} B_{\varphi }(\omega _{1} : \omega _{2}) = \varphi (\omega _{1}) - \varphi (\omega _{2}) - \langle \omega _{1} - \omega _{2}, \nabla \varphi (\omega _{2})\rangle . \end{aligned}$$
(11.7)

From a geometric viewpoint, \(B_{\varphi }(\omega _{1}:\omega _{2})\) is the difference between the value of \(\varphi \) at \(\omega _{1}\) and its first-order Taylor expansion around \(\omega _{2}\) evaluated at \(\omega _{1}\). Since \(\varphi \) is convex, \(B_{\varphi }\) is positive, zero if and only if (iff) \(\omega _{1}= \omega _{2}\) but not symmetric in general. The expression of \(F^*\) (and thus of \(B_{F^*}\)) follows from \((\nabla F)^{-1}\)

$$\begin{aligned} F^{*} (\eta ) = \langle (\nabla F)^{-1} (\eta ), \eta \rangle - F( (\nabla F)^{-1} (\eta )). \end{aligned}$$
(11.8)

In Eq. 11.6, term \(B_{F^{*}}(t(x) : \eta )\) says how much sufficient statistic \(t(x)\) on observation \(x\) is dissimilar to \(\eta \in \mathbf {H}\).

The Kullback-Leibler divergence on two members of the same exponential family is equivalent to the Bregman divergence of the associated log-normalizer on swapped natural parameters [10]:

$$\begin{aligned} {\mathrm {KL}}(p_{F}(.;\theta _{1}) || p_{F}(.;\theta _{2})) = B_{F}(\theta _{2} : \theta _{1}) = B_{F^{*}}(\eta _{1}:\eta _{2}). \end{aligned}$$
(11.9)

Let us remark that \(B_{F}\) is always known in a closed-form using the canonical decomposition of \(p_{F}\) whereas \(B_{F^{*}}\) requires the knowledge of \(F^{*}\). Finding the maximizer of the log likelihood on \(\varTheta \) amounts to find the minimizer \(\hat{\eta }\) of

$$\sum _{i=1}^{N} B_{F^{*}}(t(x_{i}) : \eta ) = \sum _{i=1}^{N} {\mathrm {KL}}\left( p_{F^*}(.;t(x_{i})) || p_{F^*}(.;\eta )\right) $$

on \(\mathbf {H}\) since the two last terms in Eq. (11.6) are constant with respect to \(\eta \).

11.3 Learning Mixtures with \(k\)-MLE

This section presents how to fit a mixture of exponential families with \(k\)-MLE. This algorithm requires to have a MLE (see previous section) for each component distribution \(p_{F_{j}}\) of the considered mixture. As it shares many properties with the EM algorithm for mixtures, this latter is recalled first. The heuristics Lloyd and Hartigan for \(k\)-MLE are completely described. Also, two methods for the initialization of \(k\)-MLE are proposed depending whether or not component distributions are known.

11.3.1 EM Algorithm

Mixture modeling is a convenient framework to address the problem of clustering defined as the partitioning of a set of i.i.d observations \(\chi = \{x_{i}\}_{i=1,..,N}\) into “meaningful” groups regarding to some similarity. Consider a finite mixture model of exponential families (see Eq. 11.1)

$$\begin{aligned} f(x) = \sum _{j=1}^{K} w_{j} p_{F_j} (x;\theta _{j}){,} \end{aligned}$$
(11.10)

where \(K\) is the number of components and \(w_{j}\) are the mixture weights which sum up to unity. Finding mixture parameters \(\{(w_{j}, \theta _{j})\}_{j}\) can be again addressed by maximizing the log likelihood of the mixture distribution

$$\begin{aligned} \mathcal {L}(\{(w_{j}, \theta _{j})\}_j; \chi ) = \sum _{i=1}^{N} \log \sum _{j=1}^{K} w_{j} p_{F_j} (x_{i};\theta _{j}). \end{aligned}$$
(11.11)

For \(K>1\), a sum of terms appearing inside a logarithm makes optimization much more difficult than the one of Sect. 11.2.1 (\(K=1\)). A classical solution, also well suitable for clustering purpose, is to augment model with indicatory hidden vector variables \(z_{i}\) where \(z_{ij} = 1\) iff observation \(x_{i}\) is generated for \(j\)th component and 0 otherwise. Previous equation is now replaced by the complete log likelihood of the mixture distribution

$$\begin{aligned} \mathcal {L}_{\mathbf {c}}(\{(w_{j}, \theta _{j})\}_j; \{(x_{i},z_{i})\}_{i}) =\sum _{i=1}^{N} \sum _{j=1}^{K} z_{ij} \log \left( w_{j} p_{F_j} (x_{i};\theta _{j})\right) {.} \end{aligned}$$
(11.12)

This is typically the framework of the Expectation-Maximization (EM) algorithm [13] which optimizes this function by repeating two steps:

  1. 1.

    Compute \(\mathcal {Q}(\{(w_{j}, \theta _{j})\}_j,\{(w_{j}^{(t)}, \theta _{j}^{(t)})\}_j)\) the conditional expectation of \(\mathcal {L}_{\mathbf {c}}\) w.r.t. the observed data \(\chi \) given an estimate \(\{(w_{j}^{(t)}, \theta _{j}^{(t)})\}_j\) for mixture parameters. This step amounts to compute \(\hat{z}_{i}^{(t)}=\mathbb {E}_{\{(w_{j}^{(t)}, \theta _{j}^{(t)})\}_j}[z_{i}|x_{i}]\), the vector of responsibilities for each component to have generated \(x_{i}\).

    $$\begin{aligned} \hat{z}_{ij}^{(t)} = \frac{w_{j}^{(t)} p_{F_j} (x_{i};\theta _{j}^{(t)})}{\sum _{j'}w_{j'}^{(t)} p_{F_{j'}} (x_{i};\theta _{j'}^{(t)})}. \end{aligned}$$
    (11.13)
  2. 2.

    Update mixture parameters by maximizing \(\mathcal {Q}\) (i.e. Eq. (11.12) where hidden values \(z_{ij}\) are replaced by \(\hat{z}_{ij}^{(t)}\)).

    $$\begin{aligned} \hat{w}_{j}^{(t+1)} = \frac{\sum _{i=1}^{N} \hat{z}_{ij}^{(t)}}{N},\quad \hat{\theta }_{j}^{(t+1)} = \mathop {\mathrm{arg}\, \mathrm{max}}\limits _{\theta _{j} \in \varTheta _{j}} \sum _{i=1}^{N} \hat{z}_{ij}^{(t)} \log \left( p_{F_j} (x_{i};\theta _{j})\right) . \end{aligned}$$
    (11.14)

    While \(\hat{w}_{j}^{(t+1)}\) is always known in closed-form whatever \(F_{j}\) are, \(\hat{\theta }_{j}^{(t+1)}\) are obtained by component-wise specific optimization involving all observations.

Many properties of this algorithm are known (e.g. maximization of \(\mathcal {Q}\) implies maximization of \(\mathcal {L}\), slow convergence to local maximum, etc...). In a clustering perspective, components are identified to clusters and values \(\hat{z}_{ij}\) are interpreted as soft membership of \(x_{i}\) to cluster \(\mathcal {C}_{j}\). In order to get a strict partition after the convergence, each \(x_{i}\) is assigned to the cluster \(\mathcal {C}_{j}\) iff \(\hat{z}_{ij}\) is maximum over \(\hat{z}_{i1}, \hat{z}_{i2}, \ldots , \hat{z}_{iK}\).

11.3.2 \(k\)-MLE with Lloyd Method

A main reason for the slowness of EM is that all observations are taken into account for the update of parameters for each component since \(\hat{z}_{ij}^{(t)} \in [0,1]\). A natural idea is then to generate smaller sub-samples of \(\chi \) from \(\hat{z}_{ij}^{(t)}\) in a deterministic manner.Footnote 1 The simplest way to do this is to get a strict partition of \(\chi \) with MAP assignment:

$$\tilde{z}_{ij}^{(t)} = \left\{ \begin{array}{ll} 1 &{} \quad \text{ if } \hat{z}_{ij}^{(t)} = \max \nolimits _{k} \hat{z}_{ik}^{(t)}\\ 0 &{} \quad \text{ otherwise } \end{array}\right. {.}$$

When multiple maxima exist, the component with the smallest index is chosen. If this classification step is inserted between E-step and M-step, Classification EM (CEM) algorithm [14] is retrieved. Moreover, for isotropic gaussian components with fixed unit variance, CEM is shown to be equivalent to the Lloyd K-means algorithm [4]. More recently, CEM was reformulated in a close way under the name \(k\)-MLE [3] for the context of exponential families and Bregman divergences. In the following of the paper, we will refer only to this latter. Replacing \(z_{ij}^{(t)}\) by \(\tilde{z}_{ij}^{(t)}\) in Eq. (11.12), the criterion to be maximized in the M-step can be reformulated as

$$\begin{aligned} \tilde{\mathcal {L}}_{\mathbf {c}}(\{(w_{j}, \theta _{j})\}_j; \{(x_{i},\tilde{z}_{i}^{(t)})\}_{i}) = \sum _{j=1}^{K} \sum _{i=1}^{N} \tilde{z}_{ij}^{(t)} \log \left( w_{j} p_{F_{j}} (x_{i};\theta _{j})\right) {.} \end{aligned}$$
(11.15)

Following CEM terminology, this quantity is called the “classification maximum likelihood”. Letting \(C_{j}^{(t)} = \left\{ x_{i} \in \chi | \tilde{z}_{ij}^{(t)} = 1\right\} \), this equation can be conveniently rewritten as

$$\begin{aligned} \tilde{\mathcal {L}}_{\mathbf {c}}(\{(w_{j}, \theta _{j})\}_j; \{\mathcal {C}_{j}^{(t)}\}_{j}) =&\sum _{x \in \mathcal {C}_{1}^{(t)}} \log \left( w_{1}p_{F_{1}} (x;\theta _{1})\right) \nonumber \\&+ \cdots + \sum _{x \in \mathcal {C}_{K}^{(t)}} \log \left( w_{K}p_{F_{K}} (x;\theta _{K})\right) {.} \end{aligned}$$
(11.16)

Each term leads to a separate optimization to get the parameters of the corresponding component:

$$\begin{aligned} \hat{w}_{j}^{(t+1)} = \frac{|\mathcal {C}_{j}^{(t)}|}{N},\quad \hat{\theta }_{j}^{(t+1)} = \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\theta _{j} \in \varTheta _{j}} \sum _{x \in \mathcal {C}_{j}^{(t)}} \log p_{F_{j}} (x;\theta _{j}). \end{aligned}$$
(11.17)

Last equation is nothing but the equation of the MLE for the j-th component with a subset of \(\chi \). Algorithm 1 summarizes \(k\)-MLE with Lloyd method given an initial description of the mixture.

figure a

Contrary to CEM algorithm, \(k\)-MLE algorithm updates mixture weights after the convergence of the \(\tilde{\mathcal {L}}_{\mathbf {c}}\) (line 8) and not simultaneously with component parameters. Despite this difference, both algorithms can be proved to converge to a local maximum of \(\tilde{\mathcal {L}}_{\mathbf {c}}\) with same kind of arguments (see [3, 14]). In practice, the local maxima (and also the mixture parameters) are not necessary equal for the two algorithms.

11.3.3 \(k\)-MLE with Hartigan Method

In this section, a different optimization of the classification maximum likelihood is presented. A drawback of previous methods is that they can produce empty clusters without any mean of control. It occurs especially when observations are in a high dimensional space. A mild solution is to discard empty clusters by setting their weights to zero and their parameters to \(\emptyset \). A better approach, detailed in the following, is to rewrite the \(k\)-MLE following the same principle as Hartigan method for \(k\)-means [15]. Moreover, this heuristic is preferred to Lloyd’s one since it generally provides better local maxima [16].

Hartigan method is generally summarized by the sentence “Pick an observation, say \(x_{c}\) in cluster \(\mathcal {C}_{c}\), and optimally reassign it to another cluster.” Let us first consider as “optimal” the assignment \(x_{c}\) to its most probable cluster, say \(\mathcal {C}_{j^{*}}\):

$$ j^{*} = \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{j} \log \hat{w}_{j}^{(t)} p_{F_{j}}(x_{c};\hat{\theta }_{j}^{(t)}),$$

where \(\hat{w}_{j}^{(t)},\) and \(\hat{\theta }_{j}^{(t)}\) denote the weight and the parameters of the \(j\)-th component at some iteration. Then, parameters of the two components are updated with MLE:

$$\begin{aligned} \hat{\theta }_{c}^{(t+1)} = \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\theta _{c} \in \varTheta _{c}} \sum _{x \in \mathcal {C}_{c}^{(t)} \backslash \{x_{c}\}} \log p_{F_{c}} (x;\theta _{c}) \end{aligned}$$
(11.18)
$$\begin{aligned} \hat{\theta }_{j^{*}}^{(t+1)} = \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\theta _{j^{*}} \in \varTheta _{j^{*}}} \sum _{x \in \mathcal {C}_{j^{*}}^{(t)}\cup \{x_{c}\}} \log p_{F_{j^{*}}} (x;\theta _{j^{*}}). \end{aligned}$$
(11.19)

The mixture weights \(\hat{w}_{c}\) and \(\hat{w}_{j^{*}}\) remain unchanged in this step (see line 9 of Algorithm 1). Consequently, \(\tilde{\mathcal {L}}_{\mathbf {c}}\) increases by \(\varPhi ^{(t)}(x_{c},\mathcal {C}_{c},\mathcal {C}_{j^{*}})\) where \(\varPhi ^{(t)}(x_{c},\mathcal {C}_{c},\mathcal {C}_{j})\) is more generally defined as

$$\begin{aligned} \varPhi ^{(t)}(x_{c},\mathcal {C}_{c},\mathcal {C}_{j}) =&\, \sum _{x \in \mathcal {C}^{(t)}_{c} \backslash \{x_{c}\}} \log p_{F_{c}} (x; \hat{\theta }_{c}^{(t+1)}) - \sum _{x \in \mathcal {C}^{(t)}_{c} \cup \{x_{c}\}} \log p_{F_{c}} (x; \hat{\theta }_{c}^{(t)})-\log \frac{\hat{w}_{c}}{\hat{w}_{j}}\nonumber \\&+ \sum _{x \in \mathcal {C}^{(t)}_{j} \cup \{x_{c}\}} \log p_{F_{j}} (x; \hat{\theta }_{j}^{(t+1)}) - \sum _{x \in \mathcal {C}^{(t)}_{j} \backslash \{x_{c}\}} \log p_{F_{j}} (x; \hat{\theta }_{j}^{(t)}){.} \end{aligned}$$
(11.20)

This procedure is nothing more than a partial assignment (C-step) in the Lloyd method for \(k\)-MLE. This is an indirect way to reach our initial goal which is the maximization of \(\tilde{\mathcal {L}}_{\mathbf {c}}\).

Following Telgarsky and Vattani [16], a better approach is to consider as “optimal” the assignment to cluster \(\mathcal {C}_{j}\) which maximizes \(\varPhi ^{(t)}(x_{c},\mathcal {C}_{c},\mathcal {C}_{j})\)

$$\begin{aligned} j^{*} = \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{j} \varPhi ^{(t)}(x_{c},\mathcal {C}_{c},\mathcal {C}_{j}){.} \end{aligned}$$
(11.21)

Since \(\varPhi ^{(t)}(x_{c},\mathcal {C}_{c},\mathcal {C}_{c}) = 0\), such assignment satisfies \(\varPhi ^{(t)}(x_{c},\mathcal {C}_{c},\mathcal {C}_{j^{*}}) \ge 0\) and therefore the increase of \(\tilde{\mathcal {L}}_{\mathbf {c}}\). As the optimization space is finite (partitions of \(\{x_{1},...,x_{N}\}\)), this procedure converges to a local maximum of \(\tilde{\mathcal {L}}_{\mathbf {c}}\). There is no guarantee that \(\mathcal {C}_{j^{*}}\) coincides with the MAP assignment for \(x_{c}\).

For the \(k\)-means loss function, Hartigan method avoids empty clusters since any assignment to one of those empty clusters decreases it necessarily [16]. Analogous property will be now studied for \(k\)-MLE through the formulation of \(\tilde{\mathcal {L}}_{\mathbf {c}}\) with \(\eta \)-coordinates:

$$\begin{aligned}&{\tilde{\mathcal {L}}_{\mathbf {c}}(\{(w_{j}, \eta _{j})\}_j; \{\mathcal {C}^{(t)}_{j}\}_{j}) }\\ \nonumber&= \sum _{j=1}^{K} \sum _{x \in \mathcal {C}^{(t)}_{j}} \left[ F^{*}_{j}(\eta _{j})+ k_{j}(x) + \langle t_{j}(x) - \eta _{j}, \nabla F_{j}^{*} (\eta _{j})\rangle + \log w_{j}\right] {.} \end{aligned}$$
(11.22)

Recalling that the MLE satisfies \(\hat{\eta }^{(t)}_{j} = |\mathcal {C}^{(t)}_{j}|^{-1}\sum _{x \in \mathcal {C}^{(t)}_{j}} t_{j}(x)\), dot product vanishes when \(\eta _{j} = \hat{\eta }^{(t)}_{j}\) and it follows after small calculations

$$\begin{aligned} \varPhi ^{(t)}(x_{c},\mathcal {C}_{c},\mathcal {C}_{j}) = \,&(|\mathcal {C}^{(t)}_{c}| - 1) F_{c}^{*} (\hat{\eta }_{c}^{(t+1)}) - |\mathcal {C}^{(t)}_{c}| F_{c}^{*} (\hat{\eta }_{c}^{(t)}) \nonumber \\&+\; (|\mathcal {C}^{(t)}_{j}| + 1) F_{j}^{*} (\hat{\eta }_{j}^{(t+1)}) - |\mathcal {C}^{(t)}_{j}| F_{j}^{*} (\hat{\eta }_{j}^{(t)})\nonumber \\&+\; k_{j}(x_{c}) - k_{c}(x_{c}) - \log \frac{\hat{w}_{c}}{\hat{w}_{j}}{.} \end{aligned}$$
(11.23)

As far as \(F_{j}^{*}\) is known in closed-form, this criterion is faster to compute than Eq. (11.20) since updates of component parameters are immediate

$$\begin{aligned} \hat{\eta }_{c}^{(t+1)} = \frac{|\mathcal {C}^{(t)}_{c}|\hat{\eta }_{c}^{(t)} - t_{c}(x_{c})}{|\mathcal {C}^{(t)}_{c}|-1},\;\hat{\eta }_{j}^{(t+1)} = \frac{|\mathcal {C}^{(t)}_{j}|\hat{\eta }_{j}^{(t)} + t_{j}(x_{c})}{|\mathcal {C}^{(t)}_{j}|+1}{.} \end{aligned}$$
(11.24)

When \(\mathcal {C}^{(t)}_{c}=\{x_{c}\}\), there is no particular reason for \(\varPhi ^{(t)}(x_{c},\{x_{c}\},\mathcal {C}_{j})\) to be always negative. Simplications occurring for the \(k\)-means in euclidean case (e.g. \(k_{j}(x_{c}) = 0\), clusters have equal weight \(w_{j}= K^{-1}\), etc...) do not exist in this more general case. Thus, in order to avoid to empty a cluster, it is mandatory to reject every outgoing transfer for a singleton cluster (cf. line 8).

Algorithm 2 details \(k\)-MLE algorithm with Hartigan method when \(F^{*}_{j}\) are available. When only \(F_{j}\) are known, \(\varPhi ^{(t)}(x_{c},\mathcal {C}_{c},\mathcal {C}_{j})\) can be computed with Eq. (11.20). In this case, the computation of MLE \(\hat{\theta }_{j}\) is much slower and is an issue for a singleton cluster. Its existence and possible solutions will be discussed later for the Wishart distribution. Further remarks on Algorithm 2:

(line 1):

When all \(F^*_{j} = F^{*}\) are identical, this partitioning can be understood as geometric split in the expectation parameter space induced by divergence \(B_{F^{*}}\) and additive weight \(-\log w_{j}\) (weighted Bregman Voronoi diagram [17]).

(line 4):

This permutation avoids same ordering for each loop.

(line 6):

A weaker choice may be done here: any cluster \(\mathcal {C}_{j}\) (for instance the first) which satisfies \(\varPhi ^{(t)}(x_{i},\mathcal {C}_{\bar{z}_{i}},\mathcal {C}_{j}) > 0\) is a possible candidate still guaranteeing convergence of the algorithm. For such clusters, it may be also advisable to select \(\mathcal {C}_{j}\) with maximum \(\hat{z}_{ij}^{(t)}\).

(line 12):

Obviously, this criterion is equivalent to local convergence of \(\tilde{\mathcal {L}}_{\mathbf {c}}\).

figure b

As said before, this algorithm is faster when components parameters \(\eta _{j}\) can be updated in the expectation parameter space \(\mathbf {H}_{j}\). But the price to pay is the memory needed to keep all sufficient statistics \(t_{j}(x_{i})\) for each observation \(x_{i}\).

11.3.4 Initialization with DP-\(k\)-MLE\(++\)

To complete the description of \(k\)-MLE, it remains the problem of the initialization of the algorithm: choice of the exponential family for each component, initial values of \(\{(\hat{w}^{(0)}_{j},\eta ^{(0)}_{j})\}_{j}\), number of components \(K\). Ideally, a good initialization would be fast, select automatically the number of components (unknown for most applications) and provide initial mixture parameters not too far from a good local minimum of the clustering criterion. The choice of model complexity (i.e. the number \(K\) of groups) is a recurrent problem in clustering since a compromise has to be done between genericity and goodness of fit. Since the likelihood increases with \(K\), many criteria such as BIC, NEC are based on the penalization of likelihood by a function of the degree of freedom of the model. Other approaches include MDL principle, Bayes factor or simply a visual inspection of some plottings (e.g. silhouette graph, dendrogram for hierarchical clustering, Gram matrix, etc...). The reader interested by this topic may refer to section M in the survey of Xu and Wunsch [18]. Proposed method, inspired by the algorithms \(k\)-MLE\(++\) [3] and DP-means [19], will be described in the following.

At the beginning of a clustering, there is no particular reason for favoring one particular cluster among all others. Assuming uniform weighting for components, \(\tilde{\mathcal {L}}_{\mathbf {c}}\) simplifies to

$$\begin{aligned} \mathring{\mathcal {L}}_{\mathbf {c}}(\{\theta _{j}\}_j; \{\mathcal {C}_{j}^{(t)}\}_{j})&= \sum _{j=1}^{K}\sum _{x \in \mathcal {C}_{j}^{(t)}}\log p_{F_{j}} (x;\theta _{j}) \qquad \qquad {{or \, equivalently\, to}}\end{aligned}$$
(11.25)
$$\begin{aligned} \mathring{\mathcal {L}}_{\mathbf {c}}(\{\eta _{j}\}_j; \{\mathcal {C}^{(t)}_{j}\}_{j})&= \sum _{j=1}^{K} \sum _{x \in \mathcal {C}^{(t)}_{j}} \left[ F^{*}_{j}(\eta _{j})+ k_{j}(x) + \langle t_{j}(x) - \eta _{j}, \nabla F_{j}^{*} (\eta _{j})\rangle \right] .\\[-14pt] \end{aligned}$$
(11.26)

When all \(F^*_{j} = F^{*}\) are identical and the partition \(\{\mathcal {C}_{j}^{(t)}\}_{j}\) corresponds to MAP assignment, \(\mathring{\mathcal {L}}\) is exactly the objective function \(\breve{\mathcal {L}}\) for the Bregman \(k\)-means [2]. Rewriting \(\breve{\mathcal {L}}\) as an equivalent criterion to be minimized, it follows

$$\begin{aligned} \breve{\mathcal {L}}(\{\eta _{j}\}_j) = \sum _{i=1}^{N} \min _{j=1}^{K} B_{F^{*}}(t(x_{i}) :\eta _{j}). \end{aligned}$$
(11.27)

Bregman \(k\)-means\(++\) [20, 21] provides initial centers \(\{\eta ^{(0)}_{j}\}_{j}\) which guarantee to find a clustering that is \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( \log K\right) \)-competitive to the optimal Bregman \(k\)-means clustering. The \(k\)-MLE\(++\) algorithm amounts to use Bregman \(k\)-means\(++\) on the dual log-normalizer \(F^*\) (see Algorithm 3).

figure c

When \(K\) is unknown, same strategy can still be applied but a stopping criterion has to be set. Probability \(p_{i}\) in Algorithm 3 is a relative contribution of observation \(x_{i}\) through \(t(x_{i})\) to \(\breve{\mathcal {L}}(\{\eta _{1},...,\eta _{K}\})\) where \(K\) is the number of already selected centers. A high \(p_{i}\) indicates that \(x_{i}\) is relatively far from these centers, thus is atypic to the mixture \(\{(w^{(0)}_{1}, \eta ^{(0)}_{1}),...,(w^{(0)}_{K}, \eta ^{(0)}_{K}), \}\) for \(w^{(0)}_{j} = w^{(0)}\) an arbitrary constant. When selecting a new center, \(p_{i}\) necessarily decreases in the next iteration. A good covering of \(\chi \) is obtained when all \(p_{i}\) are lower than some threshold \(\lambda \in [0,1]\). Algorithm 4 describes the initialization named after DP-\(k\)-MLE\(++\).

figure d

The higher the threshold \(\lambda \), the lower the number of generated centers. In particular, the value \(\frac{1}{N}\) should be considered as a reasonable minimum setting for \(\lambda \). For \(\lambda = 1\), the algorithm will simply return one center. Since \(p_{i}=0\) for already selected centers, this method guarantees all centers to be distinct.

11.3.5 Initialization with DP-comp-\(k\)-MLE

Although \(k\)-MLE can be used with component-wise exponential families, previous initialization methods yield components of same exponential family. Component distribution may be chosen simultaneously to a center selection when additional knowledge \(\xi _{i}\) about \(x_{i}\) is available (see Sect. 11.5 for an example). Given such a choice function \(H\), Algorithm 5 called “DP-comp-\(k\)-MLE” describes this new flexible initialization method. DP-comp-\(k\)-MLE is clearly a generalization of DP-\(k\)-MLE\(++\) when \(H\) always returns the same exponential family. However, in the general case, it remains to be proved whether a DP-comp-\(k\)-MLE clustering is \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( \log K\right) \)-competitive to the optimal \(k\)-MLE clustering (with equal weight). Without this difficult theoretical study, suffix “\(++\)” is carefully omitted in the name DP-comp-\(k\)-MLE.

To end up with this section, let us recall that all we need to know for using proposed algorithms is the MLE for the considered exponential family, whether it is available in a closed-form or not. In many exponential families, all details (canonical decomposition, \(F, \nabla F, F^{*}, \nabla F^* = (\nabla F)^{-1}\)) are already known [10]. The next section focuses on the case of the Wishart distribution.

figure e

11.4 Learning Mixtures of Wishart with k-MLE

This section recalls the definition of Wishart distribution and proposes a maximum likelihood estimator for its parameters. Some known facts such as the Kullback-Leibler divergence between two Wishart densities are recalled. Its use with the above algorithms is also discussed.

11.4.1 Wishart Distribution

The Wishart distribution [5] is the multidimensional version of the chi-square distribution and it characterizes empirical scatter matrix estimator for the multivariate gaussian distribution. Let \(\mathbb {X}\) be a \(n\)-sample consisting in independent realizations of a random gaussian vector with \(d\) dimensions, zero mean and covariance matrix \(S\). Then scatter matrix \(X={}^t\!{\mathbb {X}}\mathbb {X}\) follows a central Wishart distribution with scale matrix \(S\) and degree of freedom \(n\), denoted by \(X \sim \mathcal {W}_d(n,S)\). Its density function is

$$\begin{aligned} \mathcal {W}_{d}(X;n,S)= \frac{|X|^{\frac{n-d-1}{2}}\exp \big \{-\frac{1}{2}{\mathrm {tr}}(S^{-1}X)\big \}}{2^{\frac{nd}{2}} |S|^{\frac{n}{2}}\varGamma _{d}\left( \frac{n}{2}\right) },\nonumber \end{aligned}$$

where for \(y>0\), \(\varGamma _{d} (y)= \pi ^{\frac{d(d-1)}{4}} \prod _{j=1}^d\varGamma \left( y -\frac{j-1}{2}\right) \) is the multivariate gamma function. Let us remark immediately that this definition implies that \(n\) is constrained to be strictly greater than \(d-1\).

Wishart distribution is an exponential family since

$$\mathcal {W}_{d}(X;\theta _{n},\theta _{S})=\exp \left\{ <\theta _{n},\log |X|>_{\mathbb {R}} + <\theta _{S}, - \frac{1}{2}X>_{HS} +\, k(X) - F(\theta _{n},\theta _{S})\right\} {,}$$

where \((\theta _{n},\theta _{S})=(\frac{n-d-1}{2},S^{-1})\), \(t(X) = (\log |X|,- \frac{1}{2}X)\), \(\langle , \rangle _{HS}\) denotes the Hilbert-Schmidt inner product and

$$\begin{aligned} F(\theta _{n},\theta _{S}) = \left( \theta _{n} + \frac{(d + 1)}{2}\right) \left( d\log (2) - \log |\theta _{S}|\right) + \log \varGamma _{d}\left( \theta _{n} + \frac{(d + 1)}{2}\right) . \end{aligned}$$
(11.28)

Note that this decomposition is not unique (see another one in [22]). Refer to Appendix A.1 for detailed calculations.

11.4.2 MLE for Wishart Distribution

Let us recall (see Sect. 11.2.1) that the MLE is obtained by mapping \((\nabla F)^{-1}\) on the average of sufficient statistics. Finding \((\nabla F)^{-1}\) amounts to solve here the following system (see Eqs. 11.5 and 11.28):

$$\begin{aligned} \left\{ \begin{array}{lcl} \ d\log (2) - \log |\theta _{S}|+ \varPsi _{d}\left( \theta _{n} + \frac{(d + 1)}{2} \right) &{} = &{} \eta _{n},\\ - \left( \theta _{n} + \frac{(d + 1)}{2}\right) \theta _{S}^{-1} &{} = &{} \eta _{S}. \end{array} \right. \end{aligned}$$
(11.29)

with \(\eta _{n} = \mathbb {E} [\log |X|]\) and \(\eta _{S} = \mathbb {E}[- \frac{1}{2}X]\) the expectation parameters and \(\varPsi _{d}\) the derivative of the \(\log \varGamma _{d}\). Unfortunately, variables \(\theta _{n}\) and \(\theta _{S}\) are not separable so that no closed-form solution is known. Instead, as pointed out in [23], it is possible to adopt an iterative scheme that alternatively yields maximum likelihood estimate when the other parameter is fixed. This is equivalent to consider two sub-families \(\mathcal {W}_{d,\underline{n}}\) and \(\mathcal {W}_{d,\underline{S}}\) of Wishart distribution \(\mathcal {W}_{d}\) which are also exponential families. For the sake of simplicity, natural parameterizations and sufficient statistics of the decomposition in the general case are kept (see Appendices A.2 and A.3 for more details).

Distribution \(\mathcal {W}_{d,\underline{n}}\) (\(\underline{n} = 2\theta _{\underline{n}} + d + 1\)):    \(k_{\underline{n}}(X) = \frac{\underline{n}-d-1}{2} \log |X|\) and

$$\begin{aligned} F_{\underline{n}}(\theta _{S}) = \frac{\underline{n}d}{2}\log (2) - \frac{\underline{n}}{2} \log |\theta _{S}| + \log \varGamma _{d}\left( \frac{\underline{n}}{2}\right) . \end{aligned}$$
(11.30)

Using classical results for matrix derivatives, (Eq. 11.5) can be easily solved:

$$\begin{aligned} -\frac{\underline{n}}{2} \hat{\theta }_{S}^{-1} = \frac{1}{N} \sum _{i=1}^{N} -\frac{1}{2} X_{i} \implies \hat{\theta }_{S} = N\underline{n} \left( \sum _{i=1}^{N} X_{i}\right) ^{-1}. \end{aligned}$$
(11.31)

Distribution \(\mathcal {W}_{d,\underline{S}}\) (\(\underline{S} = \theta _{\underline{S}}^{-1}\)):    \(k_{\underline{S}}(X) = -\frac{1}{2} {\mathrm {tr}}(\underline{S}^{-1} X)\) and

$$\begin{aligned} F_{\underline{S}}(\theta _{n}) = \left( \theta _{n} + \frac{d+1}{2}\right) \log \left| 2\underline{S}\right| + \log \varGamma _{d}\left( \theta _{n} + \frac{d+1}{2}\right) \end{aligned}$$
(11.32)

Again, Eq. (11.5) can be numerically solved:

$$\begin{aligned} \hat{\theta }_{n} = \varPsi _{d}^{-1}\left( \frac{1}{N} \sum _{i=1}^{N} \log \left| X_{i}\right| - \log \left| 2\underline{S}\right| \right) - \frac{d+1}{2}, \hat{\theta }_{n}>-1 \end{aligned}$$
(11.33)

with \(\varPsi _{d}^{-1}\) the functional reciprocal of \(\varPsi _{d}\). This latter can be computed with any optimization method on bounded domain (e.g. Brent method [24]). Let us mention that notation is simplified here since \(\hat{\theta }_{S}\) and \(\hat{\theta }_{n}\) should have been indexed by their corresponding family. Algorithm 6 summarizes the estimate \(\hat{\theta }\) for parameters of the Wishart distribution. By precomputing \(N\left( \sum _{i=1}^{N} X_{i}\right) ^{-1}\) and \(N^{-1}\sum _{i=1}^{N} \log \left| X_{i}\right| \), much computation time can be saved. The computation of the \(\varPsi _{d}^{-1}\) remains an expensive part of the algorithm.

Let us now prove the convergence and the consistency of this method. Maximizing \(\bar{l}\) amounts to minimize equivalently \(E(\theta ) = F(\theta )- \langle \frac{1}{N}\sum _{i=1}^{N}t(X_{i}),{\theta }\rangle \). The following properties are satisfied by \(E\):

  • The hessian \(\nabla ^2 E = \nabla ^2 F\) of \(E\) is positive definite on \(\varTheta \) since \(F\) is convex.

  • Its unique minimizer on \(\varTheta \) is the MLE \(\hat{\theta } = \nabla F^{*} (\frac{1}{N}\sum _{i=1}^{N}t(X_{i}))\) whenever it exists (although \(F^*\) is not known for Wishart, and \(F\) is not separable).

figure f

Therefore, Algorithm 6 is an instance of the group coordinate descent algorithm of Bezdek et al. (Theorem 2.2 in [25]) for \(\theta =(\theta _n,\theta _S)\):

$$\begin{aligned} \hat{\theta }_S^{(t+1)}&= \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\theta _S} E(\hat{\theta }_n^{(t)},\theta _S)\end{aligned}$$
(11.34)
$$\begin{aligned} \hat{\theta }_n^{(t+1)}&= \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\theta _n} E(\theta _n,\hat{\theta }_S^{(t+1)}) \end{aligned}$$
(11.35)

Resulting sequences \(\{\hat{\theta }_n^{(t)}\}\) and \(\{\hat{\theta }_S^{(t)}\}\) are shown to converge linearly to the coordinates of \(\hat{\theta }\).

By looking carefully at the previous algorithms, let us remark that the initialization methods require to able to compute the divergence \(B_{F^{*}}\) between two elements \(\eta _{1}\) and \(\eta _{2}\) in the expectation space \(\mathbf {H}\). Whereas \(F^*\) is known for \(\mathcal {W}_{d,\underline{n}}\) and \(\mathcal {W}_{d,\underline{S}}\), Eq. (11.9) gives a potential solution for \(\mathcal {W}_{d}\) by considering \(B_{F}\) on natural parameters \(\theta _{2}\) and \(\theta _{1}\) in \(\varTheta \). Searching the correspondence \(\mathbf {H} \mapsto \varTheta \) is analogous to compute the MLE for a single observation...

The previous MLE procedure does not converge with a single observation \(X_{1}\). Bogdan and Bogdan [26] proved that MLE exists and is unique in an exponential family off the affine envelope of the \(N\) points \(t(X_1)\), ..., \(t(X_N)\) is of dimension \(D\), the order of this exponential family. Since the affine envelope of \(t(X_{1})\) is of dimension \(d \times d\) (instead of D = \(d \times d + 1\)), the MLE does not exists and the likelihood function goes to infinity.Footnote 2 Unboundedness of likelihood function is well known problem that can be tackled by adding a penalty term to it [27]. A simpler solution is to take the MLE in family \(\mathcal {W}_{d,\underline{n}}\) for some \(n\) (known or arbitrary fixed above \(d-1\)) instead of \(\mathcal {W}_{d}\).

11.4.3 Divergences for Wishart Distributions

For two Wishart distributions \(\mathcal {W}_{d}^{1} = \mathcal {W}_{d}(X;n_{1},S_{1})\) and \(\mathcal {W}_{d}^{2}=\mathcal {W}_{d}(X;n_{2},S_{2})\), the KL divergence is known [22] (even if \(F^{*}\) is unknown):

$$\begin{aligned} {\mathrm {KL}}(\mathcal {W}_{d}^{1} || \mathcal {W}_{d}^{2}) =\,&- \log \left( \frac{\varGamma _{d}\left( \frac{n_{1}}{2}\right) }{\varGamma _{d}\left( \frac{n_{2}}{2}\right) }\right) + \left( \frac{n_{1} - n_{2}}{2}\right) \varPsi _{d}\left( \frac{n_{1}}{2}\right) \nonumber \\&+ \frac{n_{1}}{2} \left( - \log \frac{|S_{1}|}{|S_{2}|} + {\mathrm {tr}}(S_{2}^{-1}S_{1}) - d\right) \end{aligned}$$
(11.36)

Looking the KL divergences of the two Wishart sub-families \(\mathcal {W}_{d,\underline{n}}\) and \(\mathcal {W}_{d,\underline{S}}\) gives an interesting perspective to this formula. Applying Eqs. 11.9 and 11.8, it follows

$$\begin{aligned} {\mathrm {KL}}(\mathcal {W}_{d,\underline{n}}^{1} || \mathcal {W}_{d,\underline{n}}^{2})&= \frac{\underline{n}}{2} \left( - \log \frac{|S_{1}|}{|S_{2}|} + {\mathrm {tr}}(S_{2}^{-1}S_{1}) - d\right) \end{aligned}$$
(11.37)
$$\begin{aligned} {\mathrm {KL}}(\mathcal {W}_{d,{\underline{S}}}^{1} || \mathcal {W}_{d,{\underline{S}}}^{2})&= -\log \left( \frac{\varGamma _{d}\left( \frac{n_{1}}{2}\right) }{\varGamma _{d}\left( \frac{n_{2}}{2}\right) }\right) + \left( \frac{n_{1} - n_{2}}{2}\right) \varPsi _{d}\left( \frac{n_{1}}{2}\right) \end{aligned}$$
(11.38)

Detailed calculations can be found in the Appendix. Notice that \({\mathrm {KL}}(\mathcal {W}_{d}^{1} || \mathcal {W}_{d}^{2})\) is simply the sum of these two divergences

$$\begin{aligned} {\mathrm {KL}}(\mathcal {W}_{d}^{1} || \mathcal {W}_{d}^{2}) = {\mathrm {KL}}(\mathcal {W}_{d,\underline{S_{1}}}^{1} || \mathcal {W}_{d,\underline{S_{1}}}^{2}) + {\mathrm {KL}}(\mathcal {W}_{d,\underline{n_{1}}}^{1} || \mathcal {W}_{d,\underline{n_{1}}}^{2}) \end{aligned}$$
(11.39)

and that \({\mathrm {KL}}(\mathcal {W}_{d,{\underline{S}}}^{1} || \mathcal {W}_{d,{\underline{S}}}^{2})\) does not depend on \(\underline{S}\).

Divergence \({\mathrm {KL}}(\mathcal {W}_{d,\underline{n}}^{1} || \mathcal {W}_{d,\underline{n}}^{2})\), commonly used as a dissimilarity measure between covariance matrices, is sometimes referred as the log-Det divergence due to the form of \(\varphi (S) = F_{\underline{n}} (S) \propto \log |S|\) (see Eq. 11.30). However, the dependency on term \(\underline{n}\) should be neglected only when the two empirical covariance matrices comes from samples of the same size. In this case, log-Det divergence between two covariance matrices is the KL divergence in the sub-family \(\mathcal {W}_{d,\underline{n}}\).

11.4.4 Toy Examples

In this part, some simple simulations are given for \(d=2\). Since the observations are positive semi-definite matrices, it is possible to visualize them with ellipses parametrized by their eigen decompositions. For example, Fig. 11.1 shows 20 matrices generated from \(\mathcal {W}_{d}(.;n,S)\) for \(n=5\) and for \(n=50\) with \(S\) having eigenvalues \(\{2,1\}\).

Fig. 11.1
figure 1

20 random matrices from \(\mathcal {W}_{d}(.;n,S)\) from \(n=5\) (left), \(n=50\) (right)

This visualization highlights the difficulty for the estimation of the parameters (even for \(d\) small) when \(n\) is small.

Then, a dataset of 60 matrices is generated from a three components mixture with parameters \(\mathcal {W}_{d}(.;10,S_{1})\), \(\mathcal {W}_{d}(.;20,S_{2})\), \(\mathcal {W}_{d}(.;30,S_{3})\) and equal weights \(w_{1}=w_{2}=w_{3}=1/3\). The respective eigenvalues for \(S_{1}, S_{2}, S_{3}\) are in turn \(\{2,1\}, \{2,0.5\}, \{1,1\}\). Figure 11.2 illustrates this dataset. To study the influence of a good initialization for \(k\)-MLE, the Normalized Mutual Information (NMI) [28] is computed between the final partition and the ground-truth partition for different initializations. This value between 0 and 1 is higher when the two partitions are more similar. Following table gives average and standard deviation of NMI over 30 runs:

 

Rand. Init/Lloyd

Rand. Init/Hartigan

\(k\)-MLE\(++\)/Hartigan

NMI

\(0.229 \pm 0.279\)

\(0.243 \pm 0.276\)

\(0.67 \pm 0.083\)

From this small experiment, we can easily verify the importance of a good initialization. Also, the partitions having the highest NMI are reported in Fig. 11.4 for each method. Let us mention that Hartigan method gives almost always a better partition than the Lloyd’s one for the same initial mixture.

Fig. 11.2
figure 2

Dataset: 60 matrices generated from a three components mixture of \(\mathcal {W}_{2}\)

Fig. 11.3
figure 3

DP-\(k\)-MLE\(++\): Influence of \(\lambda \) on \(K\) and on the average log-likelihood

Fig. 11.4
figure 4

Best partitions with Rand. Init/Lloyd (left), Rand. Init/Hartigan (middle), \(k\)-MLE\(++\) Hartigan (right)

A last simulation indicates that the initialization with DP-\(k\)-MLE\(++\) is very sensible to its parameter \(\lambda \). Again with the same set of matrices, Fig. 11.3 shows how the number of generated clusters \(K\) and the average log-likelihood evolve with \(\lambda \). Not surprisingly, both quantities decrease when \(\lambda \) increases.

11.5 Application to Motion Retrieval

In this section, a potential application to motion retrieval is proposed following our previous work [23]. Raw motion-captured movement can be identifiable to a \(n_{i} \times d\) matrix \(\mathbb {X}_{i}\) where each row corresponds to captured locations of a set of sensors.

11.5.1 Movement Representation

When the aim is to provide a taxonomy of a set of movements, it is difficult to compare varying-size matrices. Cross-product matrices \(X_{i} = {}^{t}\mathbb {X}_{i}\mathbb {X}_{i}\) is a possible descriptorFootnote 3 of \(\mathbb {X}_{i}\). Denoting \(N\) the number of movements, set \(\{X_{1},...,X_{N}\}\) of \(d \times d\) matrix is exactly the input of \(k\)-MLE. Note that \(d\) can easily be of high dimension when the number of sensors is large.

The simplest way to initialize \(k\)-MLE in this setting is to apply DP-\(k\)-MLE\(++\) for \(\mathcal {W}_{d}\). But when \(n_{i}\) are known, it is better not to estimate them. In this case, DP-comp-\(k\)-MLE is appropriate for a function \(H\) selecting \(\mathcal {W}_{d,\underline{n_{i}}}\) given \(\xi _{i} = n_{i}\). When learning algorithm is fast enough, it is common practice to restart it for different initializations and to keep the best output (mixture parameters).

To enrich the description of a single movement, it is possible to define a mixture \(m_{i}\) per movement \(\mathbb {X}_{i}\). For example, several subsets of successive observations with different sizes can be extracted and their cross-product matrices used as inputs for \(k\)-MLE (and DP-comp-\(k\)-MLE). Mixture \(m_{i}\) can be viewed as a sparse representation of local dynamics of \(\mathbb {X}_{i}\) through their local second-order moments.

While these two representations are of different kind, it is possible to encompass both in a common framework for \(\mathbb {X}_{i}\) described by a mixture of a single component \(\{(w_{i,1} = 1, \eta _{i,1} = t(X_{i}))\}\). Algorithm \(k\)-MLE applied on such input for all movements (i.e. \(\{t(X_{i})\}_{i}\)) provides then another set of mixture parameters \(\{(\hat{w}_{j},\hat{\eta }_{j})\}_{j}\). Note that the general treatment of arbitrary mixtures of mixtures of Wishart is not claimed to be addressed here.

11.5.2 Querying with Cauchy-Schwartz Divergence

Let us consider a movement \(\mathbb {X}\) (a \(n \times d\) matrix) and its mixture representation \(m\). Without loss of generality, let us denote \(\{(w_{j},\theta _{j})\}_{j=1..K}\) the mixture parameters for \(m\). The problem of comparing two movements amounts to compute a appropriate dissimilarity between \(m\) and another mixture \(m'\) of such a kind with parameters \(\{(w'_{j},\theta '_{j})\}_{j=1..K'}\).

When both mixtures have a single component (K \(=\) K’ \(=\) 1), an immediate solution is to consider the Kullback-Leibler divergence \({\mathrm {KL}}(m:m)\) for two members of the same exponential family. Since it is the Bregman divergence on the swapped natural parameters \(B_{F}(\theta ':\theta )\), a closed form is always available from Eq. (11.7). It is important to mention that this formula holds for \(\theta \) and \(\theta '\) viewed as parameters for \(\mathcal {W}_{d}\) even if they are estimated in sub-families \(W_{d,n}\) and \(W_{d,n'}\).

For general mixtures of the same exponential family (\(K>1\) or \(K'>1\)), KL divergence admits no more a closed form and has to be approximate with numerical methods. Recently, other divergences such as the Cauchy-Schwartz divergence (CS) [29] were shown to be available in a closed form:

$$\begin{aligned} \mathrm {CS}(m:m') = -\log \frac{\int m(x)m'(x) \mathrm {d}x}{\sqrt{\int m(x)^{2} \mathrm {d}x\int m'(x)^{2} \mathrm {d}x}}. \end{aligned}$$
(11.40)

Within the same exponential family \(p_{F}\), the integral of the product of mixtures is

$$\begin{aligned} \int m(x)m'(x) \mathrm {d}x = \sum _{j=1}^{K}\sum _{j'=1}^{K'}w_{j}w'_{j'} \int p_{F}(x;\theta _{j})p_{F}(x;\theta '_{j'}) \mathrm {d}x. \end{aligned}$$
(11.41)

When carrier measure \(k(X) = 0\), as it is for \(\mathcal {W}_{d}\) but not for \(\mathcal {W}_{d,\underline{n}}\) and \(\mathcal {W}_{d,\underline{S}}\), the integral can be further expanded as

$$\begin{aligned} \int p_{F}(x;\theta _{j})p_{F}(x;\theta '_{j'}) \mathrm {d}X&= \int e^{\langle \theta _{j}, t(X) \rangle - F(\theta _{j})}e^{\langle \theta '_{j}, t(X) \rangle - F(\theta '_{j'})}\mathrm {d}X\\&= \int e^{\langle \theta _{j} + \theta '_{j'}, t(X) \rangle - F(\theta _{j})- F(\theta '_{j'})}\mathrm {d}X\\&= e^{F(\theta _{j} + \theta '_{j'}) - F(\theta _{j})- F(\theta '_{j'})} \underbrace{\int e^{\langle \theta _{j} + \theta '_{j'}, t(X) \rangle - F(\theta _{j} + \theta '_{j'})}\mathrm {d}X}_{=1}. \end{aligned}$$

Note that \(\theta _{j} +\, \theta '_{j'}\) must be in the natural parameter space \(\varTheta \) to ensure that \(F(\theta _{j} +\, \theta '_{j'})\) is finite. An equivalent condition is that \(\varTheta \) is a convex cone.

When \(p_{F} = \mathcal {W}_{d}\), space \(\varTheta = ]-1;+\infty [ \times \mathcal {S}_{++}^{p}\) is not a convex cone since \(\theta _{n_{j}} +\theta '_{n_{j'}} < - 1\) for \(n_{j}\) and \(n'_{j'}\) smaller than \(d+1\). Practically, this constraint is tested for each parameter pairs before going on with the computation the CS divergence. A possible fix, not developed here, would be to constraint \(n\) to be greater than \(d+1\) (or equivalently \(\theta _{n} > 0\)). Such a constraint amounts to take a convex subset \(]0;+\infty [ \times \mathcal {S}_{++}^{p}\) of \(\varTheta \). Denoting \(\varDelta (\theta _{j},\theta '_{j'})= F(\theta _{j} + \theta '_{j'}) - F(\theta _{j})- F(\theta '_{j'})\), the CS divergence is also

(11.42)

Note that CS divergence is symmetric since \(\varDelta (\theta _{j},\theta '_{j'})\) is. A numeric value of \(\varDelta (\theta _{j},\theta '_{j'})\) can be computed for \(\mathcal {W}_{d}\) from Eq. 11.28 (see Eq. 11.45 or 11.46 in the Appendix).

11.5.3 Summary of Proposed Motion Retrieval System

To conclude this section, let us recall the elements of our proposal for a motion retrieval system. Movement is represented by a Wishart mixture model learned by \(k\)-MLE initialized by DP-\(k\)-MLE\(++\) or DP-comp-\(k\)-MLE. In the case of a mixture of a component, a simple application of the MLE for \(\mathcal {W}_{d,\underline{n}}\) is sufficient. Although a Wishart distribution appears inadequate model for the scatter matrix \(X\) of a movement, it has been shown that this crude assumption provides a good classification rates on a real data set [23]. Learning representations of the movements may be performed offline since it is computational demanding. Using CS divergence as dissimilarity, we can then extract a taxonomy of movements with any spectral clustering algorithm. For a query movement, its representation by a mixture has to be computed first. Then it is possible to search the database for the most similar movements according to the CS divergence or to predict its type by a majority vote among them. More details of the implementation and results for the real dataset will be in a forthcoming technical report.

11.6 Conclusions and Perspectives

Hartigan’s swap clustering method for \(k\)-MLE was studied for the general case of an exponential family. Unlike for \(k\)-means, this method does not guarantee to avoid empty clusters but achieves generally better performance than the Lloyd’s heuristic. Two methods DP-\(k\)-MLE and DP-comp-\(k\)-MLE are proposed to initialize \(k\)-MLE automatically by setting the number of clusters. While the former shares the good properties of \(k\)-MLE, the latter selects the component distributions given some extra knowledge. A small experiment indicates these methods appear to be quite sensible to their only parameter.

We recalled the definition and some properties of the Wishart distribution \(\mathcal {W}_{d}\), especially its canonical decomposition as a member of an exponential family.By fixing either one of its two parameters \(n\) and \(S\), two other (nested) exponential (sub-) families \(\mathcal {W}_{d,\underline{n}}\) and \(\mathcal {W}_{d,\underline{S}}\) may be defined. From their respective MLEs, it is possible to define an iterative process which provably converges to the MLE for \(\mathcal {W}_{d}\). For a single observation, the MLE does not exist.Then a crude solution is to replace the MLE in \(\mathcal {W}_{d}\) by the MLE in one of the two sub-families.

The MLE is an example of a point estimator among many others (e.g. method of moments, minimax estimators, Bayesian point estimators). This suggests as future work many other learning algorithms such as \(k\)-MoM, \(k\)-Minimax [30], \(k\)-MAP following the same algorithmic scheme as \(k\)-MLE.

Finally, an application to the retrieval motion-captured motions is proposed.Each motion is described by a Wishart mixture model and the Cauchy-Schwarz divergence is used as a dissimilarity measure between two mixture models.As the CS divergence is always available in closed-form, such divergence is fast to compute compared to stochastic integration estimation schemes.This divergence can be used in spectral clustering methods and for visualization of a set of motions in an Euclidean embedding.

Another perspective is the connection between the closed-form divergences between mixtures and kernels based on divergences [31]: The CS divergence looks similar to the Normalized Correlation Kernel [32].This could lead to a broader class of methods (e.g., SVM) using these divergences.