Keywords

1 Introduction

In recent years, directional data (i.e. the “direction” of the data is more important than their magnitude) analysis has drawn significant attention in various fields [12, 14]. Typical directional data are the data that are normalized to have unit norm, which lie on the surface of the unit sphere. Since directional data are better represented on a manifold, the nonlinear nature of manifolds implies that common distributions such as the multivariate Gaussian distribution can not be used to model and analyze directional data. Alternatively, distributions that are defined on the unit hypersphere are more appropriate and effective to model directional data.

One of the most basic directional distributions is the von Mises-Fisher (vMF) distribution, which is defined on the unit hyperspahere (\(\mathbb {S}^{D-1}\)) and has similar characteristics to those of the multivariate Gaussian distribution defined in the Euclidean space \(\mathbb {R}^D\). Although vMF distributions were widely involved with directional data modeling, it is not a universal solution to all types directional data. For instance, resent reach works have demonstrated that axial data where the observations are axes of direction (i.e. the unit vectors \(\pm \mathbf {X}\) are indistinguishable) are better modeled with Watson distributions rather than with vMF [1]. As a special type of directional data, axial data have found their applications in various applications, such as blind speech separation [21], speech clustering in distributed microphone arrays [17], differentiation between normal and schizophrenic brains [13], gene expression data clustering analysis [7], etc.

Different methods have been proposed to learn Watson distributions or its natural extension the Watson mixture model (WMM). The major difficulty of learning Watson-based models lies on the fact that no analytically solution to the inference of the concentration parameters of Watson distributions can be found. Thus, approximation methods were proposed to solve this problem. A simple approximation method for large concentrations has been proposed in [13] to learn Watson distributions with the maximum likelihood (ML) estimates. This learning method, however, can not deal axial data with higher dimensions. In [1], an approximation to ML estimates has been proposed within an expectation maximization (EM) framework to learn WMMs. However, this method is prone to the problem of over-fitting. A better alternative method to ML estimates is the variational Bayes (VB) [4, 9], a method that approximates posterior distributions through optimization. In [18], a VB inference method was proposed to learn WMMs and demonstrated better performance than the ML estimates. Although closed-form solutions can be obtained by this method, the evaluation of the model complexity (i.e. the number of mixture components model that best fit the data) requires extra effort. Specifically, the VB inference method in [18] treats the mixing coefficients of the WMM as random variables which are assigned with a Dirichlet prior. Then, model selection was performed by removing the components with small responsibilities. A more elegant solution to the model selection problem in modeling WMMs was proposed in [7], where a nonparametric framework known as the Dirichlet process mixture model [3, 10] was adopted to define the WMM with an infinite number of components. By applying VB inference method to learn the infinite WMM (In-WMM), the number of mixture component can be freely initialized and will be adjusted automatically as the data set increases [7].

Although both VB inference methods ( [18] and [7]) are effective to learn WMMs, to ensure closed-form solutions, the VB inference has to adopt the mean-field assumption [2] where parameters are assumed to be independent. This assumption, however, is not realistic in the WMM or In-WMM in which the mixing coefficients and the latent indicator variables are obviously closely related. This issue can be addressed by applying VB inference in a collapsed space where parameters are marginalized out, which leads to the so-called collapsed VB (CVB) inference framework [20]. As described in [11, 20], the mean-field assumption is more satisfied with CVB without the concern of dependencies between parameters. Thus, in this work we focus on developing an effective CVB inference method to learn the In-WMM in a collapsed space where the mixing coefficients are integrated out.

We summarize the contributions of this work as follows. Firstly, a collapsed infinite WMM (Co-InWMM) is proposed for modeling axial data by marginalizing out the mixing coefficients. Secondly, an effective CVB inference method is theoretically developed to learn Co-InWMM with closed-from solutions. Lastly, the proposed Co-InWMM with CVB inference is validated through both synthetical data sets and a challenging application about depth image analysis.

2 The Collapsed Infinite WMM

2.1 Infinite Watson Mixture Models

Given a data set \(\mathcal {X}=\{\mathbf {x}_i\}_{i=1}^N\) which contains N axial random vectors (i.e. \(\mathbf {x}\) and \(-\mathbf {x}\) are equivalent), each D-dimensional data vector can be represented as a unit vector (i.e. \(\Vert \mathbf {x}\Vert _2=1\)) defined on a \((D-1)\)-dimensional unit hypersphere \(\mathbb {S}^{D-1}\). If each vector \(\mathbf {x}\) is a drawn from a mixture of an infinite number of Watson distributions, then the probability density function of this infinite Watson mixture model (InWMM) is given by

$$\begin{aligned} p(\mathbf {x}|\mathbf {\pi },\mathbf {\mu },\mathbf {\gamma })= \sum _{k=1}^\infty \pi _k\mathcal {W}(\mathbf {x}|\mathbf {\mu }_k,\gamma _k) \end{aligned}$$
(1)

where \(\mathbf {\pi }=\{\pi _k\}_{k=1}^\infty \) represent the mixing coefficients that should be nonnegative and sum to 1; \(\mathbf {\mu }\in \mathbb {S}^{D-1}\) denotes the mean direction with \(\Vert \mathbf {\mu }\Vert _2=1\), and \(\gamma \in \mathbb {R}\) represents the concentration. \(\mathcal {W}(\mathbf {x}_i|\mathbf {\mu }_k,\gamma _k)\) indicates the Watson distribution associated with the kth component of the mixture model and is defined by

$$\begin{aligned} \mathcal {W}(\mathbf {x}|\mathbf {\mu }_k,\gamma _k)=\frac{\varGamma (D/2)}{2\pi ^{D/2}M(\frac{1}{2},\frac{D}{2},\gamma _k)} \exp [{\gamma _k(\mathbf {\mu }_k^T\mathbf {x})^2}] \end{aligned}$$
(2)

where \(M(a,b,\cdot )\) represents the Kummer function (also known as the confluent hypergeometric function) which is given by

$$\begin{aligned} M(a,b,\gamma )=\sum _{n=0}^\infty \frac{\varGamma (a+n)\varGamma (b)}{\varGamma (a)\varGamma (b+n)}\frac{\gamma ^n}{n!} \end{aligned}$$
(3)

where \(\varGamma (\cdot )\) denotes the Gamma function.

Next, each vector \(\mathbf {x}_i\) is assigned with a latent indicator variable \(z_i\) which is used to indicate the component from which \(\mathbf {x}_i\) is drawn. For the data set \(\mathcal {X}\), the distribution of indicator variables \(\mathbf {z}=\{z_i\}_{i=1}^N\) can be represented by

$$\begin{aligned} p(\mathbf {z}|\mathbf {\pi }) = \prod _{i=1}^N\prod _{k=1}^\infty \pi _k ^{\mathbf {1}[z_i=k]} \end{aligned}$$
(4)

where \(\mathbf {1}[\cdot ]\) denotes the indicator function which equals 1 when \(z_i = k\), otherwise it equals 0.

2.2 Prior Distributions

The InWMM is constructed using a Bayesian framework, in which each unknown variable is assigned with a prior distribution. A nonparametric prior namely Dirichlet process [10] is considered for the mixing coefficients \(\mathbf {\pi }\), and is defined in terms of a stick-breaking representation [3] as

$$\begin{aligned} \pi _k = \pi _k'\prod _{s =1}^{k-1}(1-\pi _s'), \quad \pi _k' \sim \mathrm {Beta}(1,\varpi _k), \quad G = \sum _{k = 1}^{\infty }\pi _k\delta _{\theta _k}, \quad \theta _k \sim H \end{aligned}$$
(5)

where G is a drawn from the Dirichlet process \(G \sim DP(\varpi ,H)\) with the base distribution H and scaling parameter \(\varpi \), where \(\delta _{\theta _k}\) is an atom at \(\theta _k\).

Following [7, 18], a Watson-Gamma prior is selected for parameters \(\mathbf {\mu }\) and \(\mathbf {\gamma }\) as

$$\begin{aligned} p(\mathbf {\mu },\mathbf {\gamma })= \prod _{k=1}^\infty \mathcal {W}(\mathbf {\mu }_k|\mathbf {m}_k,\beta _k\gamma _k)\mathcal {G}(\gamma _k|a_k,b_k) \end{aligned}$$
(6)

where \(\mathcal {G}(\cdot )\) indicates the Gamma distribution.

2.3 Collapsed Infinite Watson Mixture Models

According to several recent works in the literature of mixture modeling [5, 6], better performance often would be obtained when model learning was conducted in a collapsed space where some or all of the parameters are marginalized out. In our case, inspired from [5, 6, 11], we re-formulate a collapsed version of InWMM (i.e. the Co-InWMM) by marginalizing out the mixing coefficients \(\mathbf {\pi }\). Consequently, the latent variable \(\mathbf {z}\) does not depend on the mixing coefficients \(\mathbf {\pi }\) anymore and is distributed as

$$\begin{aligned} p(\mathbf {z}) = \prod _{k=1}^\infty \frac{\varGamma (1+n_k)\varGamma (\varpi _k+n_{>k})}{\varGamma (1+\varpi _k+n_{\ge k})} \end{aligned}$$
(7)

where \(n_k= \sum _{i=1}^N\mathbf {1}[z_i=k]\) indicates the number of data instances from the kth component, \(n_{>k}= \sum _{i=1}^N\mathbf {1}[z_i>k]\), and \(n_{\ge k}=n_k+n_{>k}\).

The conditional distribution of \(z_i=k\) given the current state of all except one variable \(z_i\) is

$$\begin{aligned} p(z_i=k|\mathbf {z}^{\lnot i})\propto (1+n_k^{\lnot i})(\varpi _k+n_{>k}^{\lnot i})(1+\varpi _k+n_{\ge k}^{\lnot i})^{-1} \end{aligned}$$
(8)

where the superscript \(\lnot i\) indicates the associated ith term is removed.

The joint distribution of all latent and random variables in the Co-InWMM is given by

$$\begin{aligned} p(\mathcal {X},\mathbf {z},\mathbf {\mu },\mathbf {\gamma })=\prod _{i=1}^Np(\mathbf {x}_i|\mathbf {\mu }_{z_i},\gamma _{z_i})p(z_i) \prod _{k=1}^\infty p(\mathbf {\mu }_k,\gamma _k) \end{aligned}$$
(9)

In contrast with the InWMM as described in Eq. (1), the Co-InWMM has two major advantages: 1) the explicit dependency between latent variables \(\mathbf {z}\) and mixing coefficients \(\mathbf {\pi }\) is broken, which will be in favor of the mean-filed variational Bayes model learning method as developed in the following section; 2) a smaller number of parameters are obtained by integrating out \(\mathbf {\pi }\), which leads to a faster inference process with better performance.

3 Model Learning

In this section, based on the VB inference methods that were respectively proposed in [7, 18] for learning finite WMM and InWMM, we develop an effective method based on collapsed variational Bayes (CVB) [11, 20] to learn the proposed Co-InWMM with closed-form solutions.

3.1 Mean-Field Collapsed Variational Inference

VB inference is an effective method for approximating posterior dentistries in Bayesian models. In our case, VB is adopted to approximate the true posterior \(p(\varTheta |\mathcal {X})\) with an approximated posterior \(q(\varTheta )\) (also referred to as variational posterior), where \(\varTheta = \{\mathbf {z},\mathbf {\mu },\mathbf {\gamma }\}\) denotes the set of all latent and random variables of the Co-InWMM. VB inference solves the problem of approximation though optimization, by minimizing the Kullback-Leibler (KL) divergence between \(q(\varTheta )\) and \(p(\varTheta |\mathcal {X})\), which is equivalent to maximizing the lower bound of \(\ln p(\mathcal {X})\) that is defined by

$$\begin{aligned} \mathcal {L}(q) = \int q(\varTheta )\ln [ p(\mathcal {X},\varTheta )/q(\varTheta )] d\varTheta \end{aligned}$$
(10)

To perform VB inference for learning Co-InWMM which contains an infinite number of mixture components, a common technique is to truncate the stick-breaking representation of Co-InWMM at a finite value K as

$$\begin{aligned} \pi _K' = 1, \quad \sum _{k=1}^K\pi _k = 1, \quad \pi _k = 0 \;\;\text {when}\;\; k>K \end{aligned}$$
(11)

where K can be freely initialized and would be inferred automatically through VB inference.

To obtain closed-from solutions, mean-field assumption [2] is often adopted in VB inference to factorize the variational posterior as the product of independent factors, where each factor represents variational posterior of the corresponding variable. In [7], the variational posterior of InWMM with truncation was factorized as

$$\begin{aligned} q(\varTheta ) = q(\mathbf {\pi })q(\mathbf {z})q(\mathbf {\mu },\mathbf {\gamma }) \end{aligned}$$
(12)

This factorization assumption, however, clearly violates the fact that latent variables \(\mathbf {z}\) and mixing coefficients \(\mathbf {\pi }\) are closely related with strong dependency as demonstrated in Eq. (4). The mean-field assumption is more satisfied in Co-InWMM where \(\mathbf {\pi }\) are marginalized out as

$$\begin{aligned} q(\varTheta ) = \prod _{i=1}^N\bigg [q(z_i)\bigg ]\prod _{k=1}^K\bigg [q(\mathbf {\mu }_k,\gamma _k)\bigg ] \end{aligned}$$
(13)

Then, we can obtain the following update equations by maximizing the lower bound \(\mathcal {L}(q)\) with respect to each variational posterior

$$\begin{aligned} q(\mathbf {z})=\prod _{i=1}^N\prod _{k=1}^Kr_{ik}^{\mathbf {1}[z_i=k]} \end{aligned}$$
(14)
$$\begin{aligned} q(\mathbf {\mu },\mathbf {\gamma })=\prod _{k=1}^K\mathcal {W}(\mathbf {\mu }_k|\mathbf {m}_k^*,\beta _k^*\gamma _k)\mathcal {G}(\gamma _k|a_k^*,b_k^*) \end{aligned}$$
(15)

where the hyperparameters in the above variational posteriors are calculated by

$$\begin{aligned} r_{ik} = \frac{\widetilde{r}_{ik}}{\sum _{s=1}^K\widetilde{r}_{is}}, \end{aligned}$$
(16)
$$\begin{aligned} \widetilde{r}_{ik} =&\ln \varGamma (\frac{D}{2})-\frac{D}{2}\ln 2\pi +\frac{D}{2}\langle \ln \gamma _k\rangle -\ln [\bar{\gamma }_k^{\frac{D}{2}} M(\frac{1}{2},\frac{D}{2},\bar{\gamma }_k)] \nonumber \\&-\frac{\partial }{\partial \bar{\gamma }_k}\bigg [\ln \bar{\gamma }_k^{\frac{D}{2}}M(\frac{1}{2},\frac{D}{2},\bar{\gamma }_k)\bigg ](\langle \gamma _k\rangle -\bar{\gamma }_k)\nonumber \\&+\bar{\gamma }_k\vartheta (\beta _k^*\bar{\gamma }_k)+ \big \{\bar{\gamma }_k[\vartheta (\beta _k^*\bar{\gamma }_k)+\beta _k^*\bar{\gamma }_k\vartheta '(\beta _k^*\bar{\gamma }_k)]\nonumber \\&\times (\langle \ln \gamma _k\rangle +\ln \beta _k^*-\ln \beta _k^*\bar{\gamma }_k )\big \}(\mathbf {m}_k^{*T}\mathbf {X}_i)^2\nonumber \\&+\langle \ln (1+n_k^{\lnot i})\rangle -\langle \ln (1+\varpi _k+n_{\ge k}^{\lnot i})\rangle \nonumber \\&+\sum _{j<k}\big [\langle \ln (\varpi _j+n_{>j}^{\lnot i})\rangle -\langle \ln (1+\varpi _j+n_{\ge j}^{\lnot i})\rangle \big ] \end{aligned}$$
(17)
$$\begin{aligned} a_k^*= a_k + \frac{D}{2}(1+\sum _{i=1}^N\langle z_{i=k}\rangle )+\beta _k^*\bar{\gamma }_k\frac{\partial }{\partial \beta _k^*\bar{\gamma }_k}\ln M\big (\frac{1}{2},\frac{D}{2},\beta _k^*\bar{\gamma }_k\big ) \end{aligned}$$
(18)
$$\begin{aligned} b_k^*=&b_k + \sum _{i=1}^N\langle z_{i=k}\rangle \frac{\partial }{\partial \bar{\gamma }_k}\bigg [\ln \bar{\gamma }_k^{\frac{D}{2}}M(\frac{1}{2},\frac{D}{2},\bar{\gamma }_k)\bigg ] \nonumber \\&+\beta _k\frac{\partial }{\partial \beta _k\bar{\gamma }_k}\bigg [\ln (\beta _k\bar{\gamma }_k)^{\frac{D}{2}}M(\frac{1}{2},\frac{D}{2},\beta _k\bar{\gamma }_k)\bigg ] \end{aligned}$$
(19)
$$\begin{aligned} A = \beta _k\mathbf {m}_k\mathbf {m}_k^T+\sum _{i=1}^N\langle z_{i=k}\rangle \mathbf {x}_i\mathbf {x}_i^T \end{aligned}$$
(20)

where \(\vartheta (x) = \frac{\partial }{\partial x}\ln M\big (\frac{1}{2},\frac{D}{2},x\big )\), \(\beta _k^*\) is the largest eigenvalue of A, \(\mathbf {m}_k^*\) represents the corresponding eigenvector to \(\beta _k^*\). The expected values in above equations are given by

$$\begin{aligned} \langle z_{i=k} \rangle = r_{ik}, \qquad \bar{\gamma }_k=a_k^*/b_k^*, \qquad \langle \ln \gamma _k\rangle = \psi (a_k^*)-\ln b_k^*\end{aligned}$$
(21)
$$\begin{aligned} \langle \ln (1+n_k^{\lnot i})\rangle \approx \ln (1+\langle n_k^{\lnot i}\rangle ), \end{aligned}$$
(22)
$$\begin{aligned} \langle \ln (\varpi _k+n_{>k}^{\lnot i})\rangle \approx \ln (\varpi _k+\langle n_{>k}^{\lnot i}\rangle ) \end{aligned}$$
(23)
$$\begin{aligned} \langle \ln (1+\varpi _k+n_{\ge k}^{\lnot i})\rangle \approx \ln (1+\varpi _k+\langle n_{\ge k}^{\lnot i}\rangle ) \end{aligned}$$
(24)
$$\begin{aligned} \langle n_k^{\lnot i}\rangle =\sum _{i'\ne i}r_{i'k}, \qquad \langle n_{>k}^{\lnot i}\rangle =\sum _{i'\ne i}\sum _{s=k+1}^Kr_{i's}, \qquad \langle n_{\ge k}^{\lnot i}\rangle =\langle n_k^{\lnot i}\rangle + \langle n_{>k}^{\lnot i}\rangle \end{aligned}$$
(25)

where the expected values of \(\ln (1+n_k^{\lnot i})\), \(\ln (\varpi _k+n_{>k}^{\lnot i})\), and \(\ln (1+\varpi _k+n_{\ge k}^{\lnot i})\) were acquired according to Gaussian approximations [20] with 0th-order Taylor approximation [15]. Our CVB inference method for learning the Co-InWMM is analogous to the maximum likelihood expectation maximization (EM) algorithm, which is summarized in Algorithm 1.

figure a

4 Experimental Results

The proposed Co-InWMM with CVB inference is evaluated through two experiments involved with both simulated data and a application about depth image analysis. In our experiments, the truncation level K is initialized to 10, \(\varpi _k\) and \(\beta _k\) are set to 1, \(a_{k}\) and \(b_{k}\) are initialized to 1 and 0.01, respectively. These initial values were found through cross validation.

4.1 Synthetic Data

The principal purpose of conducting experiments on synthetic axial data is to validate the “correctness” of the proposed CVB inference algorithm in learning the proposed Co-InWMM. This is fulfilled by verifying the discrepancy between computed values of the parameters and their true values. A synthetic data set was generated to conduct the experiments. This data set contains 900 3-dimensional data instances which are drawn from 3 Watson distributions (as demonstrated in Fig. 1).

Fig. 1.
figure 1

The synthetic data set.

The true parameters that were used to generate the data set and the estimated parameters by CVB inference method are shown in Table 1. According to this table, the proposed learning algorithm is able to effectively learn the Co-InWMM with estimated values of parameters that are vary close to the true ones.

4.2 Depth Image Analysis

In this experiment, we apply the proposed Co-InWMM to a challenging application namely depth image analysis. We use the NYU-V2 depth data set [16] to conduct our experiments. This data set includes 1449 rgb-d images collected from three different cities in the United States, consisting of 464 indoor different scenes across 26 scene classes in commercial buildings and residences. Following [8], we compute surface normals of depth images and then apply Co-InWMM for clustering the normals. It is worth noting that the axially symmetric property of WMM can naturally overcome the ambiguity signals caused by the normal vector which calculated by plane fitting method.

Table 1. Parameters estimation of the synthetic data set.

Figure 2 shows the number of estimated clusters for all NYU-V2 depth data set obtained by finite WMM with the Integrated Completed Likelihood (ICL) criteria [8] and the proposed Co-InWMM. As can we can see from the figure, most of the images contain 3–4 clusters. It is note worthy that, the WMM method in [8] has to calculate the ICL criteria with different number of clusters in order to determine the optimal number. In contrast, our model can detect the number of clusters automatically with a single run.

Fig. 2.
figure 2

Estimated number of clusters for NYU-V2 depth data set.

Fig. 3.
figure 3

Cluster example in the NYU-V2 depth data set. (a) rgb image; (b) depth image; (c) normals; (d) results by Co-InWMM.

Fig. 4.
figure 4

Cluster results on the NYU-V2 depth data set. (a) rgb images; (b) depth images; (c) normals; (d) results by WMM; (e) results by In-WMM; (f) results by Co-InWMM.

Figure 3 shows the example of depth image analysis. From the results we observe that, different clusters represent different image regions and also represent the segment plane associated with the scene with a specific axis. Other results can be seen in Fig. 4. Through the results, we can see that some classes represent some nonplanar objects (see case-7 and case-9 of Fig. 4), which means that our method can find nonplanar objects. From case-3 and case-5, we can see a lot of noise on the normal vector, but our method can still identify plane and nonplaner objects well. In addition, similar to [8], we also find that the data with lower prior probability will be divided into fewer clusters. In order to solve this problem, a reasonable solution is to highlight each cluster by preprocessing the normal vector to make the clustering more accurate.

Table 2. Results obtained by different methods in terms of MI and computational runtime (in min.)

In order to show the superiority of our model, we compare it with Kmeans, finite vMFMM [19], finite WMM [8] and the In-WMM proposed in [7] in terms of clustering performance on normals and computational runtime. It should be noted that the first three algorithms use ICL criteria to calculate the optimal cluster number. We use mutual information (MI) to evaluate the performance of clustering. The specific results are shown in Table 2. Based on the results shown in this table, it is obvious that the Co-InWMM is able to provide better clustering performance in terms of the highest MI value. Moreover, the Co-InWMM is more computational efficient than other tested methods in terms of the shortest computational runtime. This result demonstrates the advantages of constructing the nonparametric infinite WMM in a collapsed space, where mixing coefficients are integrated out and thus leads to a smaller number of parameters that have to be estimated.

5 Conclusion

In this paper, we proposed a collapsed infinite Watson mixture model for modeling axial data where the mixing coefficients are integrated out. We developed an effective collapsed variational Bayes inference method to learn the proposed model with closed-from solutions. The effectiveness of the proposed Co-InWMM with CVB inference for modeling axial data was verified through experiments that were conducted on both synthetical data sets and a challenging application regarding depth image analysis.