Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction and Motivation

The problem of estimating the probability density function of an unknown probability law is old and well-studied and among all the techniques used mixture models are particularly widespread in practical applications. A lot of work is thus devoted to the improvement of the speed of the algorithms for mixture parameters estimation, which is of particular interest in real-time applications such as object tracking in videos [4, 9, 10].

The most common axes of research for mixture models can be divided into three main categories. First, the goal may be to reduce the computational burden of the algorithm itself: for example \(k\)-MLE [7] and cEM [3] are fast variants of EM where the slow step of soft assignment is replaced by a fast step of hard assignment. Second, a work on the input data may be done: in [5], coresets are used to reduce the number of points needed to build the model. Third, online algorithms can be designed to deal more easily with large datasets [2, 10], avoiding the need to store all the content of the dataset.

We take here a slightly different point of view: we address both the massive data problem and the online constraint in the case where a large number of different mixtures from quite similar sets of points are needed. As such, our new algorithm is divided into two steps: a first training step (which can be slow but it does not really matter since it will be done only once) is used to build a dictionary of components (where atoms are the parameters of the distributions), and a second step uses a nearest-neighbor search to associate each incoming observation to the most probable component, thus incrementally populating the vectors of weights of the mixture. This learning step is obviously online, since the processing can be done observation by observation, and is faster than Expectation-Maximization (EM), since the nearest-neighbor search is rather simple compared to a full-blown EM.

We believe that the separation between a training step and learning step for mixture model can be very useful in numerous applications. For example, in a video analysis application (on a MPEG compressed stream), the dictionary can be built on a key-frame and inter-frames can be modeled using the dictionary of the corresponding key-frame: the dictionary learned on a key-frame will be well suited for the following images but a new one will be needed if a too different scene appears.

Our contributions are the following: first we define the co-mixture concept and present an Expectation-Maximization based algorithm, called co-EM, to estimate the parameters; then we introduce an online algorithm, called Bag of Components, which relies on a co-mixture to learn a dictionary of components and uses a nearest-neighbor search to estimate a new mixture from observations arriving one by one.

This article is organized as follows: the first part describes the co-mixtures and the algorithm co-EM; the second part introduces the Bag of Components and the online algorithm; the next part discusses some improvements over the basic algorithm; and the last part shows some experimental results.

2 Co-mixture Models

We formally define a co-mixture of exponential families as a set of S mixture models sharing the same parameters for theirs components:

$$\begin{aligned} \left\{ \begin{array}{l} m_1(x; \omega ^{(1)}_1 \dots \omega ^{(1)}_K) = \sum _{i=1}^K \omega ^{(1)}_i p_F(x; \theta _i) \\ m_2(x; \omega ^{(2)}_1 \dots \omega ^{(2)}_K) = \sum _{i=1}^K \omega ^{(2)}_i p_F(x; \theta _i) \\ \dots \\ m_S(x; \omega ^{(S)}_1 \dots \omega ^{(S)}_K) = \sum _{i=1}^K \omega ^{(S)}_i p_F(x; \theta _i) \end{array} \right. \end{aligned}$$
(1)

where \(p_F\) is the exponential family with log-normalizer F and \(\theta _1\dots \theta _K\) are the parameters of the components and are shared between all the individual mixtures of the co-mixture; the S vectors \(\omega ^{(s)}_1 \dots \omega ^{(s)}_K\) are the vectors of weights (thus positive and normalized to 1).

In the previous expressions, all the mixtures have the same number of components but since the weight associated to a component may be zero, it is not a limitation.

In order to build such a set of mixtures from a dataset made of S sets of point \(\mathcal {X}^{(l)} = \{x^{(l)}_1, \dots , x^{(l)}_{n_l}\}\) (where \(n_l\) is the number of observations in the set of points \(\mathcal {X}^{(l)}\)), we design an EM-based iterative algorithm, called co-EM. For clarity, we write a generic version working for any exponential family: it is a variant of Bregman Soft Clustering [1] for which the maximization is simply an arithmetic mean in the expectation parameters space (which is in bijection with the usual parameters). It can be described by three main steps:

  • Expectation step,

  • Maximization step (set of points by set of points),

  • Maximization step (aggregation).

Expectation Step. We compute S responsibility matrices \(p^{(1)}, \dots , p^{(S)}\): the coefficient \(p^{(l)}(i, j)\) measures the likelihood for the observation \(x^{(l)}_i\) from the set of points \(\mathcal {X}^{(l)}\) to come from the j-th component of the mixture \(m_l\) given the current estimate of the parameters \(\eta _1, \dots , \eta _k\) and of the weights for the l-th mixture \(\omega ^{(l)}_1, \dots , \omega ^{(l)}_k\). In short, we have:

$$\begin{aligned} p^{(l)}(i, j) = \frac{\omega ^{(l)}_j p_F(x^{(l)}_i, \eta _j)}{m(x^{(l)}_i| \omega ^{(l)}, \eta )} \end{aligned}$$
(2)

Maximization Step (Set of Points by Set of Points). In the first part of the maximization step S partial estimates \((\eta ^{(1)}_1, \dots , \eta ^{(1)}_K), \dots , (\eta ^{(S)}_1, \dots , \eta ^{(S)}_k)\) are made, one for each individual mixture of the comixture.

The new estimates \((\eta ^{(l)}_1, \dots , \eta ^{(l)}_K)\) for the l-th set of points are computed using the observations for \(\mathcal {X}^{(l)}\) and the l-th responsibility matrix:

$$\begin{aligned} \eta ^{(l)}_j&= \sum _i \frac{p^{(l)}(i, j)}{\sum _u p^{(l)}(u, j)}\, t(x^{(l)}_i) \end{aligned}$$
(3)

And the weights of each individual mixtures are updated with:

$$\begin{aligned} \omega ^{(l)}_j&= \frac{1}{n_l} \sum _{i=1}^{n_l} p^{(l)}(i, j) \end{aligned}$$
(4)

Maximization Step (Aggregation). All these partial estimates are then aggregated into the new estimate of the parameters \(\eta _1, \dots , \eta _K\).

For the component j, the new estimate of \(\eta _j\) is computed with a Bregman barycenter of all the \(\eta ^{(1)}_j, \dots , \eta ^{(S)}_j\):

$$\begin{aligned} \eta _j = \frac{1}{S} \sum _{l=1}^S \eta ^{(l)}_j \end{aligned}$$
(5)

This aggregation step gives the same weight to all the set of points, no matter the number of components inside, allowing to remove the influence of various set of points sizes.

Fig. 1.
figure 1

Segmentation with regular EM and co-EM using a 5D RGBxy description of the images.

The algorithm co-EM converges to the average of the log-likelihoods on all the individual mixtures of the co-mixture and can be used independently of the Bag of Components: Fig. 1 shows an image segmentation application.

3 Bag of Components

This online algorithm is inspired by dictionary methods. As such, the training step amounts to building a dictionary of components (the atoms of the dictionary) and the learning step amounts to computing the activation of each atom given the observations.

The dictionary can be directly extracted from the output of co-EM (or from the output of any algorithm building a co-mixture). Given a co-mixture, the dictionary is the set of parameters:

$$\begin{aligned} \mathcal {D}= \left\{ \theta _1, \dots , \theta _K \right\} \end{aligned}$$
(6)

Due to the need to build a co-mixture, the training step is potentially expensive but this cost is counterbalanced by two points. First it is made only once and the results are reused during the learning step. Second, there is no overload if the set used to build the co-mixtures is a subset of the interesting sets of points: in this case, since it is not more costly to build a co-mixture of size S with co-EM than to learn S mixtures with EM, the global cost of the training and learning steps is still smaller than the cost of doing an EM on all the dataset.

The learning step can be done online: we do not need to work on the entire input points but we can rather update the model parameters each time we see a new observation. This step amounts to a hard-assignment step: given a new observation, we search the most probable component among the atoms of the dictionary (using a naive linear search):

$$\begin{aligned} \hat{\imath } = \arg \max _{\theta \in \mathcal {D}} p_F(x_i, \theta ) \end{aligned}$$
(7)

We then increment the value in the bin \(\hat{\imath }\) of the histogram which counts how many observations have been associated to each atom. At the end of the processing, it is straightforward to go from the histogram to a real vector of weights by dividing by the total number of observations.

4 Improvements

The previous maximization problem can be rewritten as a nearest-neighbor search using the bijection between exponential families and Bregman divergences [1]:

$$\begin{aligned} \hat{\imath } = \arg \min _{\theta \in \mathcal {D}} B_{F^*}\left( t(x_i) \Vert \eta (\theta )\right) \end{aligned}$$
(8)

where \(F^*\) is the Legendre dual of the log-normalizer F of the exponential family and \(\eta (\theta )\) is the transformation of the natural parameter \(\theta \) into the space of expectation parameters.

As such, it is possible to improve the linear time search described previously by using appropriate nearest-neighbor techniques and data structures such as Bregman ball tree [8] and to go below the linear time search.

Another possible variant is to enforce the sparsity of the weights: after the computation of the vector of weights, we are likely to have some components with a very low weight and thus carrying nearly no information. We assume we can remove these components by thresholding and renormalizing the weights. Another choice may be to clusterize the mixture using the k-medoids [6] algorithm to concentrate weights on most important components.

5 Experiments

We evaluate the Bag of Components algorithm on artificially generated mixture models. In order to generate mixture models sufficiently similar to use with a dictionary-based method , we first generate a dictionary of multivariate Gaussian distributions (the covariance matrices are generated from a \(LDL^T\) decomposition, where L is a triangular unit matrix and D a diagonal matrix with positive coefficients). We then generate mixtures by randomly drawing the weights, imposing that only a small fraction of the components has a non-zero weight (to enforce some diversity between the random mixtures).

Fig. 2.
figure 2

From top to bottom: original mixture, Expectation-Maximization, raw Bag of Components, Bag of Components with weights thresholded. Between parentheses is the number of components with non-zero weights.

In all the following experiments, the random mixtures are generated from a dictionary of size 30 with only 30 % of non-zero weights. co-EM builds a 30 components co-mixture from 10 sets of 1000 points. The components of this co-mixture are used as a dictionary for Bag of Components.

The goal of the first experiment is to visually check the quality of a 1D mixture built with Bag of Components (from 1000 observations). Figure 2 compares the original mixture (first curve) with the output of EM (second curve, with 10 components) and the output of Bag of Components (third curve). On the third curve, some components have clearly a low weight compared to the most prominent Gaussians so in the fourth curve weights are thresholded under 0.06 in order to keep only 5 components: in this particular case, most of the information seem to be preserved.

Fig. 3.
figure 3

Computation time for EM and BoC (left) and relative log-likelihood (in percent, right) with respect to the number of observations during the learning step (from 1000 to 10000, in dimension 5).

A second experiment in Fig. 3 compares the execution time (left) and the quality (right) of the output of Bag of Components and of an EM (10 components) with respect to the number of points in the input set (from 1000 to 10000 points, in dimension 5). Given a dictionary built with co-EM during a pre-processing offline step, we build a mixture with the Bag of Components method from a new input set of points (not present in the dataset used for the dictionary learning step) and compare the output mixture to the result of a classical EM.

The quality of the mixtures from the two algorithms is compared using the relative log-likelihood \(\frac{\mathrm {ll}_\mathrm {BoC} - \mathrm {ll}_\mathrm {EM}}{\mathrm {ll}_\mathrm {BoC}}\) so a negative value means Bag of Components produces worse mixtures than EM: on the explored range, the two algorithms produce mixtures of similar quality, with roughly between –4 % and –2 % of relative difference.

The left part of Fig. 3 measures the execution time of Bag of Components (without the dictionary building step, since this step is made offline): not surprisingly, it is perfectly linear with a speed-up between 1.2 and 4 compared to EM (which has a very irregular execution time).

The speed-up from EM to Bag of Components is real but not so high. Indeed, even if the learning step of Bag of Components is made in O(nK) (where n is the number of observations and K the number of atoms of the dictionary) and the EM is made in O(nki) (where k is the number of components and i the number of iterations), the number of atoms K is higher than the number of components k (three times in the experiments). The execution time of Bag of Components is thus of the same order of magnitude than all the iterations of EM, giving an execution time which can be nearly the same when EM converges in few iterations. We may increase the speed of Bag of Components by using a Bregman Ball Tree which would allow a sub-linear nearest-neighbor search. Moreover, independently of the time, Bag of Components has the big advantage of being an online algorithm (so in the curves on Fig. 3, each point for Bag of Components is not a new mixture built from scratch, but only an improvement of the previous one).

6 Conclusion

We described the notion of co-mixtures along with the algorithm co-EM. It is used as a basis to design a new algorithm for mixture model learning, called Bag of Components: this new algorithm works online and allows to build a mixture faster than Expectation-Maximization. It is well suited when a lot of mixtures from related or similar sets of points are needed: in such a case, it is worth building a dictionary on a subset of the sets of points and apply Bag of Components on the remaining sets of points. It is also interesting when only a few sets of points are available at a time: the available sets can be used to learn the dictionary of components and new mixtures can be built on new sets of points at soon as they become available.

There is room for lots of improvements both on the speed, by using efficient nearest-neighbor or approximate nearest-neighbor techniques, and for the sparsity of the weights, by evaluating the need and the interest of removing low weight components. Furthermore, we leave for future work to validate co-EM and Bag of Components on a real application instead of artificial mixtures.