Keywords

1 Introduction

Statistical shape models (SSMs) from point sets, proposed by Cootes et al. in [2], are powerful tools in medical imaging to encode the natural variability of anatomical structures. To construct an SSM, traditionally, points are selected on training surfaces and point-to-point correspondences are required. By consistently concatenating the points on each training data set, shapes are represented as high-dimensional vectors and assumed to be sampled from a Gaussian distribution; under this hypothesis the major modes of variation are then extracted by PCA. In reality, however, the training data can have a multi-modal distribution and represent various classes of shapes. As a result, no particular class is fully represented by the mean model and the constructed SSM often does not generalize well. To alleviate this problem, Zhang et al. [15] proposed sparse non-parametric shape description, and Cootes et al. [3] used a Gaussian Mixture Model (GMM) to represent the pdf of the training sets; the shape space is first partitioned and then PCA is applied in each cluster. But, it is likely that clustering of point sets and the estimation of variation modes may mutually benefit from each other. In addition, this approach requires having point-to-point correspondences, and a user-selected a priori number of components, which is difficult.

Establishing point-to-point correspondences across training point sets is a major challenge that undermines the practicality of the SSMs. Manually specifying correspondences over landmarks could be an ambiguous subjective task. 3D automatic techniques based on image registration [5] or minimizing the description length [4] have a varying performance, in particular for complex structures such as the heart. EM-ICP based methods [7, 11] offer more flexibility by computing probabilistic matchings between points and are shown to be robust to the matching errors. Recently, Hufnagel et al. [8] proposed a generative model for estimating modes of variation in point sets without resourcing to PCA from point sets with no correspondences. This method, however, still assumes that the distribution is a monomodal Gaussian distribution.

We present a hierarchical clustering scheme to estimate pdfs of unstructured, rigidly aligned, point sets having no point-to-point correspondences. Points at each set are regarded as samples from a low dimensional GMM, whose means are concatenated to form higher dimensional vector. This vector is considered to be a sample drawn from a Mixture of Probabilistic Principal Component Analyzers (PPCA) [13]. The latter is essentially a higher dimensional GMM, where the covariance matrices of its clusters can be decomposed to subspaces of local principal components. An inference algorithm based on variational Bayes (VB) [1] is proposed for unsupervised learning of class labels and variations.

Thanks to this hierarchical structure, the proposed method estimates probabilistic point matchings across the training data sets; and handles mixtures of different shape classes. Another important advantage of the proposed VB approach is that the number of clusters is automatically learned from data. It is noteworthy that, in machine learning, VB has been successfully applied for inferring mixtures of subspace analyzers [6] from training vectors having equal lengths. However, adopting the framework for point sets, as order-less random variables having different cardinalities (point counts), is a challenging problem. In the rest of this paper, we first present our generative model, derive an efficient inference algorithm and finally compare the method to the standard PCA model using cardiac data with different pathologies.

Fig. 1.
figure 1

The graphical representation of the proposed model; shaded and hollow circles represent observed and latent variables, respectively, arrows imply the dependencies and plates embrace numbered incidences of events.

2 Methods

2.1 Probabilistic Generative Model

Our observation consists of K point sets, denoted as \(\mathcal {X}_k=\{\mathrm {\mathbf {x}}_{kn}\}^{N_k}_{n=1}\), \(1\le k \le K\), where \(\mathrm {\mathbf {x}}_{kn}\) is a D dimensional feature vector corresponding to the nth landmark in the kth point set. The model can be explained as two interacting layers of mixture models. In the first (lower-dimension) layer, \(\mathcal {X}_k\) is assumed to be a collection of D-dimensional samples from a GMM with M Gaussian components. Meanwhile, by concatenating the means of the GMM (with a consistent order), a vector representation for \(\mathcal {X}_k\) can be derived in \(M\cdot D\) dimension. Clustering and linear component analysis for \(\mathcal {X}_k\) takes place in this space.

More specifically, we consider a mixture of J probabilistic principal component analyzers (MPPCA). A PPCA is essentially an \(M\cdot D\)-dimensional Gaussian specified by a mean vector, \(\bar{\varvec{\mu }}_{j}\in \mathcal {R}^{MD}\), \(1\le j \le J\), and a covariance matrix having a subspace component in the form of \(\mathrm {\mathbf {W}}_{j}\mathrm {\mathbf {W}}_{j}^T\) [13]. Here, \({\mathrm {\mathbf {W}}_{j}}\) is a \({MD\times L}\) dimensional matrix, whose column l, i.e. \(\mathrm {\mathbf {W}}^{(l)}_{j}\), represents one mode of variation for the cluster j. Let \(\mathrm {\mathbf {v}}_{k}\) be an L dimensional vector of loading coefficients corresponding to \(\mathcal {X}_k\) and let us define: \(\varvec{\mu }_{jk}=\mathrm {\mathbf {W}}_{j}\mathrm {\mathbf {v}}_{k}+\bar{\varvec{\mu }}_{j}\). These vectors can be thought of as variables that bridge the two layers of our model: In the higher dimension, \(\varvec{\mu }_{jk}\) is a re-sampled representation of \(\mathcal {X}_k\) in the space spanned by principal components of the jth cluster; meanwhile, if we partition \(\varvec{\mu }_{jk}\) into a series of M subsequent vectors, and denote each as \(\varvec{\mu }^{(m)}_{jk}\), we obtain the means of D dimensional Gaussians of the corresponding GMM.

Let \(\mathcal {Z}_k=\{\mathrm {\mathbf {z}}_{kn}\}^{N_k}_{n=1}\) be a set of \(N_k\), 1-of-M coded latent membership vectors for the points in \(\mathcal {X}_k\). Each \(\mathrm {\mathbf {z}}_{kn}\in \{0,1\}^M\) is a vector of zeros, whose mth component equals one (\(z_{knm}=1\)) indicates that \(\mathrm {\mathbf {x}}_{kn}\) is a sample from the D dimensional Gaussian m. The precision (inverse of the variance) of Gaussians is globally denoted by \(\beta \mathbf {I}_{D\times D}\). Similarly, let \(\mathrm {\mathbf {t}}_{k}\in \{0,1\}^J\) be a latent, 1-of-J coded vector whose component j being one (\(t_{kj}=1\)) indicates the membership of the \(\mathcal {X}_k\) to cluster j. The conditional pdf of \(\mathrm {\mathbf {x}}_{kn}\) is then given by:

$$\begin{aligned} p(\mathrm {\mathbf {x}}_{kn}|\mathrm {\mathbf {z}}_{kn},\mathrm {\mathbf {t}}_{k},\beta ,\mathrm {\mathbb {W}},\mathrm {\mathbf {v}}_{k})=\prod ^J_{j=1}\prod ^M_{m=1}\Big (\mathcal {N}(\mathrm {\mathbf {x}}_{kn}|\varvec{\mu }^{(m)}_{jk},\beta ^{-1} \varvec{I}_{D\times D})^{z_{knm}}\Big )^{t_{kj}} \end{aligned}$$
(1)

where \(\mathbb {W}=\{\mathrm {\mathbf {W}}_{j}\}^J_{j=1}\) is the set of principal component matrices. To facilitate our derivations, we introduce the following prior distributions over \(\mathrm {\mathbf {W}}_{j}\), \(\mathrm {\mathbf {v}}_{k}\), and \(\beta \), which are conjugate to the normal distribution in Eq. (1):

$$\begin{aligned} \!\!p(\mathrm {\mathbf {W}}_{j})\!\!=\!\!\prod ^{L}_{l=1} \mathcal {N}(\mathrm {\mathbf {W}}^{(l)}_{j}|\varvec{0},\alpha ^{-1}_{jl} \varvec{I}),\quad p(\mathrm {\mathbf {v}}_{k})\!\!=\!\!\mathcal {N}(\mathrm {\mathbf {v}}_{k}|\varvec{0},\varvec{I}),\quad p(\beta )\!\!=\!\!\varGamma (\beta |a_0,b_0) \end{aligned}$$
(2)

The hyper-parameters of the Gamma distribution in the last line are set to \(a_0=10^{-3}\) and \(b_0=1\) to have a flat prior over \(\beta \). Next, we respectively denote the mixture weights of GMMs and MPPCA by \(\varvec{\pi }^{\mathrm {\mathbf {z}}}\) and \(\varvec{\pi }^{\mathrm {\mathbf {t}}}\) vectors, each having a Dirichlet distribution as priors: \(p(\varvec{\pi }^{\mathrm {\mathbf {z}}}) = \text {Dir}(\varvec{\pi }^{\mathrm {\mathbf {z}}}|\lambda ^{\mathrm {\mathbf {z}}}_{0})\), \(p(\varvec{\pi }^{\mathrm {\mathbf {t}}}) = \text {Dir}(\varvec{\pi }^{\mathrm {\mathbf {t}}}|\lambda ^{\mathrm {\mathbf {t}}}_{0})\). where we set \(\lambda ^{\mathrm {\mathbf {z}}}_{0}=\lambda ^{\mathrm {\mathbf {t}}}_{0}=10^{-3}\). The conditional distributions of membership vectors of \(\mathrm {\mathbf {z}}_{kn}\) (for points) and \(\mathrm {\mathbf {t}}_{k}\) (for point sets) given mixing weights are specified by two multi-nomial distributions: \(p(\mathrm {\mathbf {z}}_{kn}|\varvec{\pi }^{\mathrm {\mathbf {z}}}) = \prod ^{M}_{m=1}{(\pi ^{\mathrm {\mathbf {z}}}_{m})}^{z_{knm}}\), and \(p(\mathrm {\mathbf {t}}_{k}|\varvec{\pi }^{\mathrm {\mathbf {t}}}) = \prod ^{J}_{j=1}{(\pi ^{\mathrm {\mathbf {t}}}_{j})}^{t_{kj}}\), where \(0\le \pi ^{\mathrm {\mathbf {z}}}_{m}\), \(0\le \pi ^{\mathrm {\mathbf {t}}}_{j}\) are the components m, j of \(\varvec{\pi }^{\mathrm {\mathbf {z}}}\), \(\varvec{\pi }^{\mathrm {\mathbf {t}}}\), respectively. We now construct the joint pdf of the sets of all random variables, by assuming (conditional) independence and multiplying the pdfs where needed. Let \(\mathbb {X}=\{\mathcal {X}_k\}^K_{k=1}\), \(\mathbb {Z}=\{\mathcal {Z}_k\}^K_{k=1}\), \(\mathbb {V}=\{\mathrm {\mathbf {v}}_{k}\}^K_{k=1}\), and \(\mathbb {T}=\{\mathrm {\mathbf {t}}_{k}\}^K_{k=1}\), then the distributions of these variables can be written as:

$$\begin{aligned}&p(\mathrm {\mathbb {W}})\!\!=\!\!\prod _j\! p(\mathrm {\mathbf {W}}_{j}|\varvec{\alpha }_j),\quad p(\mathbb {Z}|\varvec{\pi }^{\mathrm {\mathbf {z}}})\!\!=\!\!\prod _k\!p(\mathcal {Z}_k|\varvec{\pi }^{\mathrm {\mathbf {z}}})\!\!=\!\!\prod _k\!\prod _n\! p(\mathrm {\mathbf {z}}_{kn}),\,\, p(\mathbb {T}|\varvec{\pi }^{\mathrm {\mathbf {t}}})\!\!=\!\!\prod _k\! p(\mathrm {\mathbf {t}}_{k}|\varvec{\pi }^{\mathrm {\mathbf {t}}}) \nonumber \\&p(\mathbb {V})=\prod _k p(\mathrm {\mathbf {V}}_{k}),\quad p(\mathbb {X}|\mathbb {Z},\mathbb {T},\mathrm {\mathbb {W}},\mathbb {V},\beta )=\prod _k p(\mathcal {X}_k|\mathcal {Z}_k,\mathrm {\mathbf {t}}_{k},\beta ,\mathbb {\mathrm {\mathbb {W}}},\mathrm {\mathbf {v}}_{k}), \end{aligned}$$
(3)
$$\begin{aligned}&p(\mathcal {X}_k|\mathcal {Z}_k,\mathrm {\mathbf {t}}_{k},\beta ,\mathbb {\mathrm {\mathbb {W}}},\mathrm {\mathbf {v}}_{k})=\prod ^{N_k}_{n=1} p(\mathrm {\mathbf {x}}_{kn}|\mathrm {\mathbf {z}}_{kn},\mathrm {\mathbf {t}}_{k},\beta ,\mathrm {\mathbb {W}},\mathrm {\mathbf {v}}_{k}) \end{aligned}$$
(4)

Having defined the required distributions through Eqs. (1)-(3), the distribution of the complete observation is given as

$$\begin{aligned} \!\!\!\!\!\!p(\mathbb {X}\!,\!\mathbb {Z},\!\mathbb {T},\!\!\mathrm {\mathbb {W}},\!\mathbb {V},\!\varvec{\pi }^{\mathrm {\mathbf {t}}}\!,\varvec{\pi }^{\mathrm {\mathbf {z}}}\!,\beta \!)= & {} p(\mathbb {X}|\mathbb {Z},\!\mathbb {T},\!\mathrm {\mathbb {W}},\!\mathbb {V}\!,\!\beta )p(\mathbb {Z}|\varvec{\pi }^{\mathrm {\mathbf {z}}}\!)p(\mathbb {T}|\varvec{\pi }^{\mathrm {\mathbf {t}}}\!)p(\!\varvec{\pi }^{\mathrm {\mathbf {z}}}\!)p(\!\varvec{\pi }^{\mathrm {\mathbf {t}}}\!)p(\!\mathrm {\mathbb {W}}\!)p(\!\mathbb {V}\!)p(\!\beta ) \end{aligned}$$
(5)

Figure 1 is a graphical representation for the generative model considered in this paper. Given observations (colored dark gray) as D dimensional points, our problem is to estimate the posterior distributions of all the latent random variables (hollow circles) and hyper-parameters, which include the discrete cluster and the continuous variables (e.g. means and modes of variations).

2.2 Approximated Inference

If we denote the set of latent variables as \(\mathbf{{\theta }}=\{\mathbb {Z},\mathbb {T},\mathrm {\mathbb {W}},\mathbb {V},\varvec{\pi }^{\mathrm {\mathbf {t}}},\varvec{\pi }^{\mathrm {\mathbf {z}}},\beta \}\), direct inference of \(p(\varvec{\theta }|\mathbb {X})\) (as our objective) is analytically intractable thus an approximated distribution, \(q(\varvec{\theta })\), is sought. Owing to the dimensionality of the data, we prefer Variational Bayes (VB) over sampling based methods. The VB principle for obtaining \(q(\varvec{\theta })\) is explained briefly. The model evidence, i.e. \(p(\mathbb {X})\) Footnote 1, can be decomposed as \(p(\mathbb {X})=\mathcal {L}+\text {KL}( p(\varvec{\theta }|\mathbb {X})||q(\varvec{\theta }))\), where \(0\le \text {KL}(\cdot ||\cdot )\) denotes the Kullback-Leilber divergence, and

$$\begin{aligned} \mathcal {L}=\int q(\varvec{\theta })\ln {\frac{p(\mathbb {X},\varvec{\theta })}{q(\varvec{\theta })}}d\varvec{\theta } \le p(\mathbb {X}) \end{aligned}$$
(6)

is a lower bound on \(p(\mathbb {X})\). To obtain \(q(\varvec{\theta })\), the \(\text {KL}\) divergence between the true and the approximated posterior should be minimized. However, this is not feasible because the true posterior is not accessible to us. Thus, \(q(\varvec{\theta })\) can be computed by maximizing \(\mathcal {L}\). We approximate the true posterior as a factorized form, i.e., \(q(\varvec{\theta })=\prod _i q(\theta _i)\), where \(\theta _i\) refers to any of our latent variables. This factorization leads to the following tractable result: let \(\varepsilon \) be the variable of interest in \(\varvec{\theta }\), and \(\xi =\varvec{\theta }-\varepsilon \), then the variational posterior of \(\varepsilon \) is given by \(\ln {q(\varepsilon )}=\langle \ln {p(\mathbb {X},\varvec{\theta })}\rangle _\xi +\text {const}\), where \(p(\mathbb {X},\varvec{\theta })\) is given in Eq. (5), \(\langle \cdot \rangle _\xi \) denotes the expectation w.r.t. to the product of \(q(\cdot )\) of all variable in \(\xi \).

2.3 Update of Posteriors and Hyper-Parameters

In this section, we provide equations to update the variational posteriors. Thanks to conjugacy of priors to likelihoods, these derivations are done by inspecting expectations of logarithms and matching posteriors to their corresponding likelihood template forms. Detailed proof of our derivations is skipped for brevity. Starting from \(\mathbb {Z}\) variables we have \(q(\mathbb {Z})=\prod _{k}q(\mathcal {Z}_k)=\prod _{k,m,n} (r_{knm})^{z_{knm}}\) Under this equation, we have \(\langle z_{knm}\rangle =r_{knm}\), where the right hand side can be computed using the following relationships:

$$\begin{aligned} \!\!\!\!r_{knm}=\frac{\rho _{knm}}{\sum _{m'}\rho _{knm'}},\, \ln {\rho _{knm}} = -\frac{\langle \beta \rangle }{2}\sum _{j}\langle t_{kj}\rangle \langle |\mathrm {\mathbf {x}}_{kn}-\varvec{\mu }^{(m)}_{jk}|^2\rangle \!+\!\langle \ln {\pi ^{\mathrm {\mathbf {z}}}_{m}}\rangle \end{aligned}$$
(7)

The first term can be directly computed using the expectations of \(\mathrm {\mathbb {W}}\) and \(\mathbb {V}\) as follows: \(\langle |\mathrm {\mathbf {x}}_{kn}-\varvec{\mu }^{(m)}_{jk}|^2\rangle =|\mathrm {\mathbf {x}}_{kn}-\langle \varvec{\mu }^{(m)}_{jk} \rangle |^2\!\!+\!\!\text {Tr}[\text {Cov}[\varvec{\mu }_{jk}]^{(m,m)}]\), where the super indexes \((\cdot )\), \((\cdot ,\cdot )\) specify the D and \(D\times D\) dimensional block numbers of the vector \(\langle \varvec{\mu }_{jk}\rangle =\langle \mathrm {\mathbf {W}}_{j}\rangle \langle \mathrm {\mathbf {v}}_{k}\rangle +\bar{\varvec{\mu }}_{j}\) and the matrix defined by:

$$\begin{aligned} \text {Cov}[\varvec{\mu }_{jk}]=\langle \mathrm {\mathbf {W}}_{j}\rangle \langle \mathrm {\mathbf {v}}_{k}\mathrm {\mathbf {v}}_{k}^T\rangle \langle \mathrm {\mathbf {W}}_{j}\rangle ^T\!\!+\!\!\sum _l\langle \mathrm {\mathbf {v}}_{k}\mathrm {\mathbf {v}}_{k}^T\rangle _{ll}\text {Cov}[\mathrm {\mathbf {W}}^{(l)}_{j}] \end{aligned}$$
(8)

To simplify the rest our notations, we introduce the following auxiliary variables:

$$\begin{aligned} \mathrm {\mathbf {R}}_{k}=\text {Diag}(\underbrace{R_{k1}\cdots R_{k1}}_{D\,\text {copies}},\cdots , \underbrace{R_{kM}\cdots R_{kM}}_{D\,\text {copies}}),\, \mathrm {\mathbf {\bar{x}}}_k=[\mathrm {\mathbf {\bar{x}}}_{k1}^T,\cdots ,\mathrm {\mathbf {\bar{x}}}_{kM}^T]^T \end{aligned}$$
(9)

where: \(R_{km}=\sum _n r_{knm}\), and \(\mathrm {\mathbf {\bar{x}}}_{km}=\sum _n r_{knm} \mathrm {\mathbf {x}}_{kn}\). Under these definitions, the posteriors of \(\mathbb {T}\) is given by \(q(\mathbb {T})=\prod _{k}q(\mathrm {\mathbf {t}}_{k})=\prod _{k,j}(r'_{kj})^{t_{kj}}\) where, in analogy to Eq. (7), we have: \(\langle t_{kj}\rangle =r'_{kj}\) and

$$\begin{aligned} r'_{kj}=\frac{\rho '_{kj}}{\varSigma _{j'}\rho '_{kj'}},\quad \ln {\rho '_{kj}}=\langle \beta \rangle \! \text {Tr}[-\frac{1}{2}\mathrm {\mathbf {R}}_{k}\langle \varvec{\mu }_{jk}\varvec{\mu }_{jk}^T\rangle +\varvec{\mu }_{jk}\mathrm {\mathbf {\bar{x}}}_k^T]+\! \langle \ln \pi ^{\mathrm {\mathbf {t}}}_{j}\rangle \end{aligned}$$
(10)

The posterior of the principal components is given by

$$\begin{aligned} \!\!\!\!\!\!q(\mathrm {\mathbb {W}})=\prod _{j,l}q(\mathrm {\mathbf {W}}^{(l)}_{j}),\quad q(\mathrm {\mathbf {W}}^{(l)}_{j})=\mathcal {N}(\mathrm {\mathbf {W}}^{(l)}_{j}|\langle \mathrm {\mathbf {W}}^{(l)}_{j}\rangle ,\text {Cov}[\mathrm {\mathbf {W}}^{(l)}_{j}]) \end{aligned}$$
(11)

where the means and covariance matrices are specified as:

$$\begin{aligned}&\text {Cov}[\mathrm {\mathbf {W}}^{(l)}_{j}] = [\alpha _{jl}\varvec{I}+ \langle \beta \rangle \sum _k \langle t_{kj}\rangle \langle \mathrm {\mathbf {v}}_{k}\mathrm {\mathbf {v}}_{k}^T\rangle _{ll}\mathrm {\mathbf {R}}_{k}]^{-1} \nonumber \\&\langle \mathrm {\mathbf {W}}^{(l)}_{j}\rangle = \langle \beta \rangle \text {Cov}[\mathrm {\mathbf {W}}^{(l)}_{j}](\sum _k \langle t_{kj}\rangle \mathrm {\mathbf {Q}_{kj}}^{(l)}) \end{aligned}$$
(12)

Here, the auxiliary matrix \(\mathrm {\mathbf {Q}_{kj}}\) is defined as

$$\begin{aligned} \!\!\!\!\!\!\mathrm {\mathbf {Q}_{kj}}= (\mathrm {\mathbf {\bar{x}}}_k-\mathrm {\mathbf {R}}_{k}\bar{\varvec{\mu }}_{j})\langle \mathrm {\mathbf {v}}_{k}\rangle ^T - \mathrm {\mathbf {R}}_{k}\langle \mathrm {\mathbf {W}}_{j}\rangle \Big [\langle \mathrm {\mathbf {v}}_{k}\mathrm {\mathbf {v}}_{k}^T\rangle -\text {Diag}(\text {diag}\langle \mathrm {\mathbf {v}}_{k}\mathrm {\mathbf {v}}_{k}^T\rangle )\Big ] \end{aligned}$$
(13)

where the inner \(\text {diag}\) operator copies the main diagonal of \(\langle \mathrm {\mathbf {v}}_{k}\mathrm {\mathbf {v}}_{k}^T\rangle \) into a vector, and the outer \(\text {Diag}\) transforms the vector back into a diagonal matrix. The posterior of \(\mathrm {\mathbf {v}}_{k}\) vectors is given by

$$\begin{aligned}&q(\mathbb {V})\!\!=\!\!\prod _kq(\mathrm {\mathbf {v}}_{k})\!\!=\!\!\mathcal {N}(\mathrm {\mathbf {v}}_{k}|\langle \mathrm {\mathbf {v}}_{k}\rangle ,\text {Cov}[\mathrm {\mathbf {v}}_{k}]),\, \text {Cov}[\mathrm {\mathbf {v}}_{k}]\!=\!\Big [\mathrm {\varvec{I}}+\langle \beta \rangle \!\!\sum _j \langle t_{kj}\rangle \langle \!\mathrm {\mathbf {W}}_{j}^{\!T}\mathrm {\mathbf {R}}_{k}\!\mathrm {\mathbf {W}}_{j}\rangle \Big ]^{ -1}\nonumber \\&\langle \mathrm {\mathbf {W}}_{j}^T\!\mathrm {\mathbf {R}}_{k}\!\mathrm {\mathbf {W}}_{j}\rangle \!=\!\langle \mathrm {\mathbf {W}}_{j}\rangle ^{\!T}\mathrm {\mathbf {R}}_{k}\langle \mathrm {\mathbf {W}}_{j}\rangle +\text {Diag}\Big (\!\text {Tr}[\mathrm {\mathbf {R}}_{k}\text {Cov}[\mathrm {\mathbf {W}}_{j}^{(1)}]],\cdots ,\!\!\text {Tr}[\mathrm {\mathbf {R}}_{k}\text {Cov}[\mathrm {\mathbf {W}}_{j}^{(L)}]]\Big )\nonumber \\&\langle \mathrm {\mathbf {v}}_{k}\rangle =\langle \beta \rangle \text {Cov}[\mathrm {\mathbf {v}}_{k}]\sum _j\langle t_{kj}\rangle \langle \mathrm {\mathbf {W}}_{j}\rangle ^T(\mathrm {\mathbf {\bar{x}}}_k-\mathrm {\mathbf {R}}_{k}\bar{\varvec{\mu }}_{j}) \end{aligned}$$
(14)

The posterior of the precision \(\beta \) is a Gamma distribution specified by:

$$\begin{aligned} \!\!\!\!\!\!\!\!\!\!q(\beta )=\varGamma (\beta |a,b),\,\, a = a_0\!\!+\!\!\frac{DN}{2},\,\, b = b_0\!\!+\!\!\frac{1}{2}\!\!\sum _{k,n,m,j}\!\!\langle z_{knm}\rangle \langle t_{kj}\rangle \langle |\mathrm {\mathbf {x}}_{kn}-\varvec{\mu }^{(m)}_{jk}|^2\rangle \end{aligned}$$
(15)

Under these definitions, we have \(\langle \beta \rangle =a/b\) and \(\langle \ln \beta \rangle = \psi (a)-\ln (b)\), where \(\psi \) is the Digamma function. Finally, the posteriors of the mixing coefficients are Dirichlet distributions:

$$\begin{aligned} \!\!\!\!\!\!\!\!\!\!q(\varvec{\pi }^{\mathrm {\mathbf {t}}})\!\!=\!\!\text {Dir}(\!\varvec{\pi }^{\mathrm {\mathbf {t}}}|\varvec{\lambda }^{\mathrm {\mathbf {t}}}\!), \lambda ^{\mathrm {\mathbf {t}}}_{j}\!\!=\!\!\lambda ^{\mathrm {\mathbf {t}}}_{0}\!\!+\!\!\sum _k\langle t_{kj}\rangle ,\, q(\varvec{\pi }^{\mathrm {\mathbf {z}}})\!\!=\!\!\text {Dir}(\!\varvec{\pi }^{\mathrm {\mathbf {z}}}|\varvec{\lambda }^{\mathrm {\mathbf {z}}}\!), \lambda ^{\mathrm {\mathbf {z}}}_{m}\!\!=\!\!\lambda ^{\mathrm {\mathbf {z}}}_{0}+\sum _{k,n}\langle z_{knm}\rangle \end{aligned}$$
(16)

Using Eq. (16), the expectations related to the mixing coefficients are computed as \(\langle \pi ^{\mathrm {\mathbf {z}}}_{m}\rangle =\lambda ^{\mathrm {\mathbf {z}}}_{m}/\sum _{m'}\lambda ^{\mathrm {\mathbf {z}}}_{m'}\), and \(\langle \ln {\lambda ^{\mathrm {\mathbf {t}}}_{j}}\rangle =\psi {(\lambda ^{\mathrm {\mathbf {t}}}_{j})}-\psi {(\sum _{j'}\lambda ^{\mathrm {\mathbf {t}}}_{j'})}\). Finally, by maximizing Eq. (6) with regard to \(\bar{\varvec{\mu }}_{j}\) and \(\alpha _{jl}\), we obtain:

$$\begin{aligned}&\bar{\varvec{\mu }}_{j}=\Big [\sum _k\langle t_{kj}\rangle \mathrm {\mathbf {R}}_{k}\Big ]^{-1}\Big [\sum _k\!\langle t_{kj}\rangle (\mathrm {\mathbf {\bar{x}}}_k-\mathrm {\mathbf {R}}_{k}\langle \mathrm {\mathbf {W}}_{j}\rangle \langle \mathrm {\mathbf {v}}_{k}\rangle )\Big ], \end{aligned}$$
(17)
$$\begin{aligned}&\alpha _{jl}=MD/\Big [|\langle \mathrm {\mathbf {W}}^{(l)}_{j}\rangle |^2 +\text {Tr}[\text {Cov}[\mathrm {\mathbf {W}}^{(l)}_{j}]]\Big ] \end{aligned}$$
(18)

2.4 Predictive Distribution

For a new test point set \(\mathcal {X}_r=\{\mathrm {\mathbf {x}}_{rn}\}^{N_r}_{n=1}\), with \( K < r\), we can obtain a model projected point set as \(\mathcal {\hat{X}}_r=\{\langle \hat{\mathrm {\mathbf {x}}}_{rn}\rangle \}^{N_r}_{n=1}\), where \(\langle \hat{\mathrm {\mathbf {x}}}_{rn}\rangle =\int \hat{\mathrm {\mathbf {x}}}_{rn}p(\hat{\mathrm {\mathbf {x}}}_{rn}|\mathcal {X}_r,\mathbb {X})d\hat{\mathrm {\mathbf {x}}}_{rn}\). Here, the predictive distribution should be computed by marginalizing the corresponding latent and model variables by

$$\begin{aligned} p(\hat{\mathrm {\mathbf {x}}}_{rn}|\mathcal {X}_r,\mathbb {X})\!\!=\!\!\sum _{\mathrm {\mathbf {z}}_{rn},\mathrm {\mathbf {t}}_{r}}\!\!\int p(\hat{\mathrm {\mathbf {x}}}_{rn}|\mathrm {\mathbf {z}}_{rn},\mathrm {\mathbf {t}}_{r},\beta ,\mathrm {\mathbb {W}},\mathrm {\mathbf {v}}_{r})p(\mathrm {\mathbf {z}}_{rn},\mathrm {\mathbf {t}}_{r},\beta ,\mathrm {\mathbb {W}},\mathrm {\mathbf {v}}_{r}|\mathcal {X}_r,\mathbb {X})d\mathrm {\mathbb {W}}d\mathrm {\mathbf {v}}_{r}d\beta \end{aligned}$$

Because this integral is analytically intractable, we use an approximation for the posterior using \(p(\mathrm {\mathbf {z}}_{rn},\mathrm {\mathbf {t}}_{r},\beta ,\mathrm {\mathbb {W}},\mathrm {\mathbf {v}}_{r}|\mathcal {X}_r,\mathbb {X})\approx q(\mathrm {\mathbf {z}}_{rn})q(\mathrm {\mathbf {t}}_{r})q(\mathrm {\mathbf {v}}_{r})q(\beta )q(\mathrm {\mathbb {W}})\). Thus, having \(\mathcal {X}_r\) we iterate over updating \(q(\mathrm {\mathbf {z}}_{rn})\),\(q(\mathrm {\mathbf {t}}_{r})\) and \(q(\mathrm {\mathbf {v}}_{r})\), and replace \(q(\beta )\) and \(q(\mathrm {\mathbb {W}})\) from the training step.

2.5 Initialization and Computational Burden

To initialize the model, a GMM with M Gaussians is fit to the set of all points. Next, for the Gaussian component m in the GMM, a corresponding point from \(\mathcal {X}_k\) is identified having the maximum posterior probability in \(\mathcal {X}_k\). Iterating over M Gaussian components, all the corresponding points from point set k are identified and concatenated to form an MD dimensional vector. This procedure is then repeated over K training point sets and the obtained vectors are clustered using k-means. Next, by applying PCA at each cluster, we identify the mean \(\bar{\varvec{\mu }}_{j}\) , \(\mathrm {\mathbf {W}}_{j}\) as the first L components, and \(\mathrm {\mathbf {v}}_{k}\) as the projections of the original vectors to these components. Finally, \(\beta \) is initialized as the component wise average L2 difference of the original and the PCA projected vectors. In practice, we have observed that for a set of fifty point sets each having 4000 points, sufficient convergence is achieved by 50 VB iterations in nearly an hour.

Fig. 2.
figure 2

Clustering and mode estimation of synthetic point sets. (a) overlay of 750 point sets, (b) corresponding color separated ground true clusters, (c) estimated labels (colors), GMM centroids showing local modes of variations, (d) lower bound \(\mathcal {L}\) on the model evidence versus number of clusters, indicating \(J=3\) as the optimal number (Color figure online).

Fig. 3.
figure 3

Short axis MR images from normal (a), PH (b), and HCM patients (c).

3 Results

We evaluate our method on both synthetic and real data sets of cardiac MRI as follows. The reliability of the lower bound as a criterion to select the number of clusters of point sets is evaluated in both data types. We also measure generalization and specificity errors, and compare them to the standard PCA based SSM on the real data sets. Generalization ability is the error between the actual and the model projected point sets. Specificity is related to the ability of the model to instantiate correct samples resembling the training data. We randomly divide the available point sets into the testing and training subsets and trained the model using latter. Next, we measure the generalization and specificity using the testing, and model generated point sets as explained in [4]. To measure the distances between point sets, we considered: \(d(\mathcal {X}_k,\hat{\mathcal {X}_k})=1/N_k \sum _\mathrm {\mathbf {x}}{{\mathrm{min}}}_{\mathrm {\mathbf {y}}\in \hat{\mathcal {X}_k}}||\mathrm {\mathbf {x}} - \mathrm {\mathbf {y}}||_2\). Here, \(N_k\) is the number of points in \(\mathcal {X}_k\), and \(\hat{\mathcal {X}_k}\) denote the model projected point set. Furthermore, since d is asymmetric, we also compute \(\hat{d}(\mathcal {X}_k,\hat{\mathcal {X}_k})=d(\hat{\mathcal {X}_k},\mathcal {X}_k)\) and report both.

Fig. 4.
figure 4

Quantitative results for models trained by clustering normal-HCM (top row) and normals-PH (bottom row) cases versus number of clusters, (a-b) Generalization errors as distances between the model projected and original test sets (red), and vice versa (blue), (c-d) Specificity errors as distances between model generated and training point sets (red), and vice versa (blue). (e-f) show the model evidecene (Color figure online).

Table 1. Generalization and Specificity errors (in mm) for PCA and the proposed method at \(J=2,6\). Significant differences between results of PCA and our method are indicated in bold (\(\text {p-value} < 0.001\)). Double lines separate \(d | \hat{d}\) distances computed from the models trained by clustering PH or HCM patients versus normals.

3.1 Synthetic Dataset

We investigate the problem of selecting a proper number of clusters, J, from the data by generating synthetic point sets using ancestral sampling [1]. We expect the model evidence, i.e. \(p(\mathbb {X}|J)\), to reach a maximum for the proper number of clusters used to generate data, due to the marginalization of the latent variables. Three distinctive 2D point set patterns are generated as the cluster means. To help visualization, by setting \(L=1\), a single mode of variation per cluster is considered and sampled from \(p(\mathrm {\mathbf {W}}_{j})\). Next, a set of 250 point sets for each cluster is generated by sampling from \(p(\mathrm {\mathbf {v}}_{k})\) and by applying the corresponding variations at each point (local Gaussians in Fig. 2), by adding the measurement noise (by the precision of \(\beta =1\)). The number of points in each point set is around 100, but this is variable over the population and thus no correspondence is assumed. For \(1\le J\le 5\), we repeated 15 rounds of experiments with variable patterns (one shown in Fig. 2), and recorded the mean and one standard deviation on \(\mathcal {L}\) (Fig. 2(d)). As shown, for \(J=3\) a maximum for the model evidence is correctly found, and the color match between (b)-(c) indicates correct clustering of point sets. In addition, the linear patterns of GMM means, i.e. \(\varvec{\mu }_{jk}^{(m)}\), in (c) match the local structures of the points in (d), showing that variation modes are estimated reasonably.

3.2 Cardiac Datasets

We apply the proposed approach to the analysis of cardiac data sets, which are known to display significant variability and geometrical complexity. We consider three groups of individuals, 36 normal cases, 20 subjects with Pulmonary Hypertension (PH), and 20 subjects with Hypertrophic Cardiomyopathy (HCM). The data acquisition of these data sets were done using a balanced Steady State Free Precession protocol under various brands of 1.5 MR scanners, resulting in image matrices of \(256\times 256\) in short axial direction and slice thicknesses of \(8-10\) mm.

These subjects differ in the properties of the cardiac shapes. For PH patients, which are associated with pulmonary vascular proliferation  [12], complex shape remodeling of both the left and right ventricles occurs (see Fig. 3). As a result, the RV becomes very dilated, pushing onto the LV, which deforms and loses its roundness [14]. On the other hand, HCM [9] is a condition in which the muscle of the heart shows an excessive thickening, and the most characteristic feature is a hypertrophied LV (asymmetric thickening prominently involving the ventricular septum) without abnormal enlargement of the ventricular cavities.

Fig. 5.
figure 5

Means and variation modes for normal (a), PH (b), and HCM (c) cases, with mean models in the middle and variations in opposite directions at two sides, (d-f) axial and coronal cross sections of the mean models for each population.

To derive cardiac surfaces, the initial shape was obtained by labeling the MRI slices, thus obtaining binary masks, then a volume mesh was generated form the binary images, finally, we extracted a surface mesh and considered vertices of the mesh as a point set. Next, we registered the point sets removing scaling, rotation the translation effects. Because we want to compare our method to standard PCA, point correspondences between training sets was needed and established using the projection method proposed in [10]. However, to implement our proposed method we ignore this information.

We cluster mixtures of normal-PH and normal-HCM cases in two sets of experiments. We randomly pick 33 (20 normals and 13 pathological) cases, ignore the labels and perform clustering. We then sample random point sets from the trained generative model and compute specificity. To quantify generalization on the test point sets, we use Eqn. (19) to compute model projected point sets for the test set. The number of clusters, i.e. J, is varied from 1 to maximum 20. At each number, we compare both measures to the PCA model, taking 13-15 number of modes to cover \(95\,\%\) of the full trace of the covariance matrix.

Figure 4 shows the quantitative results obtained as described above. It can be seen that the generalization distances, i.e., d (blue) and \(\hat{d}\) (red), are minimum at \(J=6\). Compared to PCA, both generalization and specificity improve (indicated in Table 1). Also, note that as the number of clusters increases and the variations within each cluster are eliminated, specificity error improves. To understand this behavior, consider an extreme case, where every training point set becomes the cluster of it own. In that case, all the intra-cluster variations are eliminated and the trained model becomes strictly specific to the training data. It is noteworthy that the model evidence is maximized at \(J=2\) (see Fig. 4(c)), which is the expected number of clusters at each experiment. The discrepancy between this number of clusters and \(J=6\) where the generalization error is minimum could be due to the approximations that are made in computing the predictive distribution. Nevertheless, as shown in Table 1, at both \(J=2,6\) results are significantly improved over PCA model.

Next, we evaluate the clustering efficiency by comparing the estimated labels of the point sets to ground truths. For these experiments, by setting \(L=2\), \(J=2\), we independently apply the method to cluster all the available normal-PH and normal-HCM cases. We observe that at both experiments only 2 out of 53 cases are clustered incorrectly as normals cases. The first mode of variation both in 3D and in longitudinal cross sections are visualized in Fig. 5. It can be seen that, in the normal heart (see (a) and (d)), LV is significantly larger than RV, and when compared to PH and HCM, it is more spherical. On the other hand, in the PH heart ((b) and (e)), the RV is evidently dilated and the LV loses it roundness. Finally, significant thickening of the septum and shrinkage of LV are noticeable in the HCM heart ((c) and (f)). These morphological variations in the normal heart have been reported for both pathologies [9, 14].

4 Conclusion

We proposed a unified framework for joint clustering and component analysis of point sets. We modeled the pdf of point sets as a mixture of principal component analyzers, where the labels of the point sets and variations of clusters are derived through a Variational Bayesian framework. The method is flexible to handle point sets with no point-to-point correspondences. We showed that the method can identify the number of clusters automatically, and outperform traditional PCA based SSMs in generalization and specificity. Application of the proposed framework to heart data sets shows successful clustering of normal and pathological cases, as well as extraction of their intra-cluster variations.