1 Introduction

Cluster analysis, or clustering, is one of several general approaches to the problem of unsupervised learning, that is, classification when no class labels are given. In practice, clustering is often based on heuristic ideas and intuitive measures, such as hierarchical clustering or k-means clustering, that do not assume a probability model. By contrast, this paper focuses on model-based clustering, in which data are viewed as coming from a mixture of probability distributions, each representing a cluster. Typically, the distributions are assumed to come from a parametric family such as normal (Fraley and Raftery 1998, 2002; Banfield and Raftery 1993), and group membership is learned from data by estimation algorithms that are often variations of the expectation-maximization method described by Dempster et al. (1977). Since the early work of Wolfe (1963) and others, the literature on model-based clustering has expanded enormously. Indeed, there are several book-length treatments of mixture models, such as Frühwirth-Schnatter (2006) and McLachlan and Peel (2000).

The advent of easily accessible computing power has given rise to semi- and non-parametric methods that avoid the standard assumption that the cluster densities come from a known parametric family, and applications and extensions of these methods are growing more common in the literature. A semiparametric model-based clustering analysis for DNA microarray data can be found in Han and Davis (2006). Azzalini and Torelli (2007) propose nonparametric density estimation using Delaunay triangulation for clustering via identification of subpopulations with regions with high density of the underlying probability distribution. Li et al. (2007) develop a clustering approach based on mode identification by applying new optimization techniques to a nonparametric density estimator. Vichi (2008) fits semiparametric clustering models to dissimilarity data. In Zhang et al. (2009), a semiparametric model is introduced to account for varying impacts of factors over clusters by using cluster-level covariates. Mallapragada et al. (2010) propose a non-parametric mixture model (NMM) for data clustering. Guglielmi et al. (2014) fit Bayesian semiparametric logit models to grouped data of in-hospital survival outcomes of patients hospitalized with ST-segment Elevation Myocardial Infarction diagnosis. Certain mixtures of linear regressions also fall under the category of semiparametric model-based clustering. For instance, Hunter and Young (2012) present an algorithm for estimating parameters in a mixture-of-regressions model in which the errors are assumed to be independent and identically distributed but no other distributional assumption is made. Huang et al. (2013) propose nonparametric finite mixture-of-regression models for analysis of U.S. housing price index (HPI) data. Vandekerkhove (2013) studies estimation of a semiparametric mixture-of-regressions model of two components when one component is known. Bajari et al. (2011) views a game abstractly as a semiparametric mixture distribution and studies the semiparametric efficiency bound of this model. Finally, Butucea and Vandekerkhove (2014) consider a semiparametric mixture of two distributions that are equal up to a shift parameter.

The current article combines recent advances in methods for multivariate non-parametric finite mixture models under an assumption that we refer to as the conditional independence assumption with another method, called independent components analysis (ICA), that solves one of the main drawbacks of this assumption. To illustrate this drawback, let us first introduce the modeling framework: suppose that r-dimensional vectors \(\mathbf{{Y}}_i = ({{Y}}_{i1}, {{Y}}_{i2}, \ldots , {{Y}}_{ir})^\top \), \(1\le i\le n\), are a simple random sample from a finite mixture of \(m>1\) components with positive mixing proportions \(\lambda _1, \lambda _2, \ldots , \lambda _m\) that sum up to 1, and (Lebesgue measurable) density functions \(q_1, q_2, \ldots , q_m\) respectively. This gives the mixture density

$$\begin{aligned} d(\mathbf{y})=\sum \limits _{j=1}^{m}\lambda _jq_{j}(\mathbf{y}) \end{aligned}$$
(1)

for \(\mathbf{y}\in R^r\). The sole assumption imposed on the densities \(q_1, q_2, \ldots , q_m\) is that the coordinates of \(\mathbf{y}\) are independent given the component from which \(\mathbf{y}\) is sampled, so that Eq. (1) becomes

$$\begin{aligned} d(\mathbf{y}) =\sum \limits _{j=1}^{m}\lambda _j\prod \limits _{k=1}^{r}q_{jk}(y_k). \end{aligned}$$
(2)

The basic idea of conditional independence is outlined by Hall and Zhou (2003), and, notably, Allman et al. (2009) prove the generic identifiability of the parameters in Eq. (2) for \(r\ge 3\) under some weak assumptions. Chauveau et al. (2015) present a survey of the growing literature on the theory and algorithmic treatment of model (1) under the conditional independence assumption. The algorithms in this article have their roots in the EM-like algorithm of Benaglia et al. (2009), which is later modified by Levine et al. (2011). An alternative estimation method to EM-like algorithms is the method of moments approach described in Anandkumar et al. (2012). More recently, Bonhomme et al. (2016a, b) provide new estimation algorithms for this model and prove consistency and asymptotic normality of the estimators, an important statistical innovation missing from earlier work. Related work on mixtures of nonparametric hidden Markov models (HMMs) is presented by Gassiat and Rousseau (2016), Gassiat et al. (2016), and Castro et al. (2016); these authors describe links between their HMM work and the model of Eq. (2).

Although the conditional independence assumption of Eq. (2) is important theoretically due to its guarantee of identifiable parameters despite essentially no assumptions other than \(r\ge 3\), it is clearly not appropriate for some clustering problems. As a simple example, consider the well-known Fisher Iris data depicted in Fig. 1. In a model-based clustering scenario, the goal of estimation would be to learn the shapes of the three four-dimensional distributions, one for each species, without the benefit of the species labels. Under the conditional independence assumption of Eq. (2), no bivariate plot should exhibit correlation within any of the categories. Yet it is clear in Fig. 1a that nonzero within-species correlation exists, so any correct classifier of the three species would necessarily violate the conditional independence assumption.

Fig. 1
figure 1

a At left, the four-dimensional iris dataset with \(n=150\) and three distinct species. b At right, the same dataset classified according to our algorithm into three groups, where each group is transformed linearly in an attempt to achieve conditional independence. The seven misclassified points are circled; the colors indicate the algorithm’s choices while the numbers are the correct species

To remedy this shortcoming, the algorithm we propose in this article combines existing work on nonparametric mixture models with the ideas of independent components analysis (ICA). The basic idea of ICA, as elucidated, for example, by Hyvarinen et al. (2002), is to find a linear transformation of a multivariate dataset under which its coordinates are as close to independent as possible.

We can see the result of applying our algorithm to the iris dataset in Fig. 1b, in which each mixture component is associated with its own linear transformation. The categorizations displayed in this figure are based on the highest probability of each point among the three possible categories, and we observe that 7 of the 150 points are incorrectly classified—recall that the mixture model parameter estimates are calculated without taking labels into account—but the correlation structure evident in the left-hand plots has been eliminated. In the remainder of this article, we describe the algorithm, discuss the potential issues of identifiability that arise, and illustrate the algorithm’s performance on three datasets including one in which \(n=10{,}000\) and \(r=144\). The result of our work is an algorithm we call the NSMM-ICA algorithm, which we implement in the icamix package for R (R Core Team 2015), available at http://cran.r-project.org/web/packages/icamix/index.html.

The novelty of the current article is its combination of the non-parametric mixture structure with the linear transformations of ICA; previous work on model-based clustering using ICA has imposed parametric assumptions on the component density functions. Lee et al. (1999b) and Lee et al. (2000) propose parametric ICA mixture models with algorithms based on the infomax principle for various unsupervised classification problems. Shah et al. (2004) apply the ICA mixture model methodology to the problem of unsupervised classification of hyperspectral or multispectral imagery where image data are captured at multiple or a continuous range of frequencies across the electromagnetic spectrum. This is an important application of remote sensing and land cover classification. Palmer et al. (2008) derive an asymptotic Newton algorithm for Quasi-maximum likelihood estimation of the parametric ICA mixture model and presents its application to EEG segmentation. Finally, Salazar et al. (2010) extend the ICA mixture model methodology of Lee et al. (1999b) and others by positing that the component density functions actually have the form of kernel density estimates—i.e., the components are themselves mixtures of parametric functions of a fixed and known form (the kernel density itself), with one component per observation. Salazar et al. (2015) apply a similar idea in an agglomerative clustering framework; however, in our work we assume that the number of mixture components is fixed and known.

As mentioned above, our work drops the parametric assumption of all the previous ICA mixture model literature that we are aware of. Yet it does bear some similarities to the work described in, for example, Salazar et al. (2010); Salazar et al. (2015), because we also employ kernel density estimation in order to approximate the unknown underlying univariate density functions that we describe below. There is the additional similarity that these papers minimize a Kullback–Leibler divergence, which has a similar form to our penalized and smoothed Kullback–Leibler divergence objective function. However, a crucial distinction is that we do not assume the knowledge of any of the category labels a priori; our algorithm is designed to handle completely unsupervised multivariate clustering problems. In addition, there are methods other than ICA in the literature based on the same idea of exploring mutlivariate data to determine coordinate systems having some desirable property such as inter-coordinate independence. Invariant co-ordinate selection (Tyler et al. 2009; Miettinen et al. 2015), or ICS, is one such method, and Peña et al. (2010) in particular uses ICS to search for cluster structure in the data based on the eigenvalues of a kurtosis matrix. This work does not assume an underlying non-parametric mixture structure as in the current article, yet in principle it would be possible to combine non-parametric mixtures with ICS instead of ICA.

2 The nonparametric ICA mixture model

Most previous work on model (2) assumes that we observe the random sample \(\mathbf{Y}_1, \ldots , \mathbf{Y}_n\). However, in the current article we generalize this previous work by adding the assumption that the observed data are \(\mathbf{X}_1, \ldots , \mathbf{X}_n\), where \(\mathbf{X}_i = {\mathsf {A}}_j \mathbf{Y}_i\) for some invertible \(r\times r\) matrix \({\mathsf {A}}_j\), conditional on \(\mathbf{Y}_i\) being generated from the jth component density \(q_j\). In other words, we introduce additional parameters \({\mathsf {A}}_1, \ldots , {\mathsf {A}}_m\), one for each mixture component, consisting of the matrices that linearly transform the latent \(\mathbf{Y}_i\) with independent coordinates into the observed \(\mathbf{X_i}\). When there is no mixture structure, this assumption is exactly the independent component analysis (ICA) framework as described by Hyvarinen et al. (2002). NB: The word “component” in “ICA” is replaced by “coordinate” or “dimension” in the terminology of this article; here, “component” refers to one of the mixture densities.

To aid notation, let us define \(q_{{\mathsf {A}}}\) for any nonnegative function q on \(R^r\) and invertible \(r\times r\) matrix \({\mathsf {A}}\) as

$$\begin{aligned} q_{{\mathsf {A}}}(\mathbf{x})=q({\mathsf {A}}^{-1}{} \mathbf{x})|\det {{\mathsf {A}}}|^{-1}, \end{aligned}$$

which is the density function of a linearly transformed random variable having density q after left-multiplication by \({\mathsf {A}}\).

Our ICA mixture model may thus be described formally as follows: We observe a random sample \(\mathbf{X}_1, \ldots , \mathbf{X}_n\) from the mixture density

$$\begin{aligned} g(\mathbf{x}) = \sum _{j=1}^m \lambda _j f_j(\mathbf{x}), \end{aligned}$$
(3)

where

$$\begin{aligned} f_j(\mathbf{x}) = ({q_j})_{{\mathsf {A}}_j}(\mathbf{x}) \end{aligned}$$
(4)

and

$$\begin{aligned} q_j(\mathbf{y})=\prod \limits _{k=1}^{r}q_{jk}({y}_k). \end{aligned}$$
(5)

For each observed \(\mathbf{X}_i\), we shall define the usual latent variables

$$\begin{aligned} Z_{ij} = I\{ \mathbf{X}_i \hbox { is drawn from the } j\hbox {th mixture component} \} \end{aligned}$$

and \(\mathbf{Y}_i = {\mathsf {A}}_j^{-1} \mathbf{X}_i\) for the unique j such that \(Z_{ij}=1\). For estimation purposes, we write

$$\begin{aligned} (e_j)_{{\mathsf {A}}_j} = \lambda _j f_j, \end{aligned}$$

so \(e_j(\mathbf{x}) = \lambda _j q_j(\mathbf{x})\). Since any constant multiple of \(\mathbf{Y}_i\) can be absorbed into the \({\mathsf {A}}_j\) matrices, we mitigate against non-identifiable parameters by further assuming

$$\begin{aligned} \text{ Var }\, Y_{ik}=1 \end{aligned}$$
(6)

for all i and for \(1\le k\le r\). Finally, we assume for each j, \(1\le j\le m\), at most one of the density functions \(q_{j1}, \ldots , q_{jr}\) is a normal density function. The reason for this last assumption is that ICA operates on standardized versions of the data, and thus if a subset of the linearly transformed \(\mathbf{Y}\) coordinates is multivariate normal, the standardized versions are always standard multivariate normal so there is no way to uniquely identify an ICA transformation \({\mathsf {A}}_j\). Thus, the non-normality assumption, along with assumptions (4), (5), and (6), are commonly used in the literature on ICA (Hyvarinen et al. 2002). Since the \(q_{jk}\) are not assumed to follow any parametric form, we call the model

$$\begin{aligned} g(\mathbf{x}) = \sum _{j=1}^m \lambda _j |\det {\mathsf {A}}_j |^{-1} \prod \limits _{k=1}^{r} {q_{jk}}\left( [{\mathsf {A}}_j^{-1}{} \mathbf{x} ] _k \right) \end{aligned}$$
(7)

a nonparametric ICA mixture model.

It is not known whether the parameters in Eq. (7) are uniquely identifiable in the case of perfect information about the form of \(g(\mathbf{x})\). Our empirical experience with our algorithm, of which we provide examples in Sect. 5, suggests that parameter estimation is well-behaved, yet this important theoretical question remains. For example, general identifiability does not follow directly from the facts that the \(q_{jk}\) and \(\lambda _j\) parameters are uniquely determined by Eq. (2) and the \({\mathsf {A}}_j\) parameters are uniquely determined by Eq. (5) together with \(\mathbf{X}={\mathsf {A}}_j \mathbf{Y}\), even under the usual assumptions that are stated above. It may be possible to extend the methods of Allman et al. (2009) to prove identifiability, but for now we are in a situation analogous to the period just prior to the publication of that article, when estimation algorithms existed for cases in which only special cases of identifiability had been addressed in the literature and no general identifiability result had yet been established. It is also important to realize that even in a case where parameters are theoretically within the set of identifiable parameters, estimation may still be difficult when the true parameters happen to be near the boundary of that set. However, here again we can point to our empirical experience in an example such as the iris dataset of Sect. 1. Quite often, the iris species have been modeled in the literature as multivariate normal distributions, suggesting that the data are generated from a set of parameters quite close to non-identifiability; yet in practice, our algorithm appears to find reasonable \({\mathsf {A}}_j\) estimates as shown in Fig. 1b.

3 Parameter estimation

This section introduces an MM-like algorithm that seeks to estimate the parameters \(\mathbf{e}=(e_1,e_2,\ldots ,e_m)\) and \({\mathsf {A}}=({\mathsf {A}}_1,{\mathsf {A}}_2,\ldots ,{\mathsf {A}}_m)\) by minimizing a function that gives in some sense the distance between the empirical data distribution and the theoretical mixture distribution determined by the parameters.

We begin by defining some operators that will aid notation. Much of the development of this section follows the recent work of Levine et al. (2011) and Zhu and Hunter (2016); the novelty here is in the incorporation of the \({\mathsf {A}}_j\) matrices into the usual conditional independence framework, which requires some delicacy.

First, we define the linear smoothing, or convolution, operators \(S_h\) and \(S_h^*\) on \(L^1(R^r)\). Let \(s_h(\cdot ,\cdot )\in L^1(R\times R)\) be a nonnegative kernel function satisfying

$$\begin{aligned} \int s_h(v,z) \text { d}z=\int s_h(v,z) \text { d}v=1 \end{aligned}$$
(8)

for \(v,z \in R\). Here, \(h>0\) is a user-specified tuning parameter often referred to as a bandwidth in smoothing contexts. For any \(f\in L^1(R^r)\), define \(S_hf\) and \(S_h^*f\) by

$$\begin{aligned} (S_hf)(\mathbf{{x}})=\int \tilde{s}_h(\mathbf{{x}},\mathbf{{u}})f(\mathbf{{u}})\text { d}\mathbf{{u}} \quad \text {and}\quad (S_h^*f)(\mathbf{{x}})=\int \tilde{s}_h(\mathbf{{u}},\mathbf{{x}})f(\mathbf{{u}})\text { d}\mathbf{{u}}, \end{aligned}$$

where

$$\begin{aligned} \tilde{s}_h(\mathbf{{x}},\mathbf{{u}})=\prod \limits _{k=1}^{r}s_h(x_k,u_k) \qquad \text {for } \mathbf{{x}}, \mathbf{{u}} \in R^r. \end{aligned}$$
(9)

Furthermore, let

$$\begin{aligned} ({\mathcal {N}}_hf)(\mathbf{{x}})= \exp [(S_h^*\log {f})(\mathbf{{x}})]. \end{aligned}$$

Notice that \(({\mathcal {N}}_h)\) as an operator on \(L^1(R^r)\) is nonlinear, as it is the exponentiation of the linear operator \(S_h\) applied to the logarithm of a function. This nonlinear smoothing operator plays an important role in the algorithm.

Finally, we reproduce the projection-multiplication operator of Zhu and Hunter (2016), defined as

$$\begin{aligned} (Pf)(\mathbf{{x}})=\displaystyle \frac{\left[ \prod \nolimits _{k=1}^{r} \displaystyle \int f(\mathbf{{x}})\text { d}x_1\text { d}x_2\cdots \text { d}x_{k-1}\text { d}x_{k+1} \cdots \text { d}x_r\right] }{\left[ \int {f}\right] ^{(r-1)}}. \end{aligned}$$
(10)

Zhu and Hunter (2016) point out that when f is a density on \(R^r\), the right side of (10) simplifies because the denominator is 1, and also that the P and \(S_h\) operators commute, i.e., \((P \circ S_h)f=(S_h \circ P)f\).

Let us consider the hypothetical case of a known target density \(g(\mathbf{x})\), which we sometimes call the infinite sample size case. To estimate the parameters \(\mathbf{e}=(e_1,e_2,\ldots ,e_m)\) and \({\mathsf {A}}=({\mathsf {A}}_1,{\mathsf {A}}_2,\ldots ,{\mathsf {A}}_m)\), the idea is to minimize a measure of the distance between g and the mixture density determined by the parameters. Due to mathematical considerations explained in Levine et al. (2011), we wish to first apply the nonlinear smoother to the mixture density and then minimize the Kullback–Leibler distance between g and this nonlinearly smoothed density. We therefore propose in this hypothetical case to minimize

$$\begin{aligned} \int g(\mathbf{x})\log \left[ {g(\mathbf{x})}/{\sum \limits _{j=1}^{m}[{\mathcal {N}}_he_j]_{{\mathsf {A}}_j} (\mathbf{x})}\right] \text { d}\mathbf{x}+\int \left[ \sum \limits _{j=1}^{m}(e_j)_{{\mathsf {A}}_j}(\mathbf{x})\right] \text { d}\mathbf{x} \end{aligned}$$

with respect to \(\mathbf{e}\) and \({\mathsf {A}}\). To analyze an actual dataset, we would replace the g density by the empirical distribution of the data, which leads to the objective function

$$\begin{aligned} \ell (\mathbf{e},{\mathsf {A}})= - \sum _{i=1}^n \log \sum \limits _{j=1}^{m}[{\mathcal {N}}_he_j]_{{\mathsf {A}}_j} (\mathbf{x}_i) +\int \left[ \sum \limits _{j=1}^{m}(e_j)_{{\mathsf {A}}_j}(\mathbf{x})\right] \text { d}\mathbf{x} \end{aligned}$$
(11)

to be minimized with respect to \(\mathbf{e}\) and \({\mathsf {A}}\).

Two aspects of Eq. (11) are worth noticing. First, the second integral is part of a penalty term whose presence guarantees a convenient property of the functional parameters \(\mathbf{e}=(e_1,e_2,\ldots ,e_m)\) that minimize \(\ell (\mathbf{e},{\mathsf {A}})\) for a fixed \({\mathsf {A}}\), and this property is explained below in Eq. (12). Second, the definition of \(\ell (\mathbf{e},{\mathsf {A}})\) uses \(\sum _j[{\mathcal {N}}_he_j] _{{\mathsf {A}}_j}(\mathbf{x})\) instead of \(\sum _j{\mathcal {N}}_h[(e_j)_{{\mathsf {A}}_j}](\mathbf{x})\); that is, the nonlinear smoothing is applied before the linear transformation. The intuition is that after the transformation, the data are no longer conditionally independent and standardized, so the smoothing would affect each dimension very differently if it were applied after the transformation.

An advantage of using the \(\mathbf{e}\) parameters instead of \(\varvec{\lambda }\) and \(\mathbf{q}\) is that the latter parameterization requires the constraint that every \(q_j\) is a density function. With the \(\mathbf{e}\) parameters, such a constraint is unnecessary: As a straightforward corollary of Theorem 2.1 of Zhu and Hunter (2016), any minimizer \((\tilde{\mathbf{e}}, \tilde{{\mathsf {A}}})\) of (11) must satisfy

$$\begin{aligned} \int \left[ \sum \limits _{j=1}^{m}(e_j)_{{\mathsf {A}}_j}(\mathbf{x})\right] \text { d}\mathbf{x} =\int \sum _{j=1}^{m}\tilde{e}_j(\mathbf{x})\text { d}\mathbf{x} =1. \end{aligned}$$
(12)

4 The NSMM-ICA algorithm

Here, we derive an iterative algorithm for solving the main problem of minimizing Eq. (11). The algorithm is based on the MM framework, which stands for majorization-minimization (Hunter and Lange 2004) and which involves constructing and minimizing an alternative to the \(\ell (\mathbf{e},{\mathsf {A}})\) function with respect to \(\mathbf{e}\) and \({\mathsf {A}}\) at each iteration.

4.1 Majorizing the objective function

Given the current estimate \(\mathbf{e}^{(0)}\) and \({\mathsf {A}}^{(0)}\), let us define

$$\begin{aligned} w^{(0)}_j(\mathbf{x})=\frac{\left[ {\mathcal {N}}_he^{(0)}_j\right] _{{\mathsf {A}}^{(0)}_j}(\mathbf{x})}{\sum \nolimits _{j'=1}^{m}\left[ {\mathcal {N}}_he^{(0)}_{j'}\right] _{{\mathsf {A}}^{(0)}_{j'}}(\mathbf{x})}. \end{aligned}$$

Since \(\sum _j w^{(0)}_j(\mathbf{x})=1\), Jensen’s inequality gives

$$\begin{aligned}&\ell (\mathbf{e},{\mathsf {A}})-\ell (\mathbf{e}^{(0)},{\mathsf {A}}^{(0)})\\&\quad =-\int {g(\mathbf{x})\log {\sum \limits _{j=1}^{m} w^{(0)}_j(\mathbf{x}) \frac{({\mathcal {N}}_he_j)_{{\mathsf {A}}_j}(\mathbf{x})}{({\mathcal {N}}_he^{(0)}_j)_{{\mathsf {A}}^{(0)}_j}(\mathbf{x})}}}\text { d}\mathbf{x} +\int {\left( \sum \limits _{j=1}^{m}(e_j)_{{\mathsf {A}}_j}-\sum \limits _{j=1}^{m}(e^{(0)}_j)_{{\mathsf {A}}^{(0)}_j}\right) }\\&\quad \le -\int {g(\mathbf{x})\sum \limits _{j=1}^{m} w^{(0)}_j(\mathbf{x}) \log {\frac{({\mathcal {N}}_he_j)_{{\mathsf {A}}_j}(\mathbf{x})}{({\mathcal {N}}_he^{(0)}_j)_{{\mathsf {A}}^{(0)}_j}(\mathbf{x})}}}\text { d}\mathbf{x}+\int {\left( \sum \limits _{j=1}^{m} (e_j)_{{\mathsf {A}}_j}-\sum \limits _{j=1}^{m}(e^{(0)}_j)_{{\mathsf {A}}^{(0)}_j}\right) }. \end{aligned}$$

Thus, if we let

$$\begin{aligned} b^{(0)}(\mathbf{e},{\mathsf {A}})=-\int {g(\mathbf{x}) \sum \limits _{j=1}^{m}w^{(0)}_j(\mathbf{x})\cdot \log {{({\mathcal {N}}_he_j)_{{\mathsf {A}}_j}(\mathbf{x})}}}\text { d}\mathbf{x} +\int {\left( \sum \limits _{j=1}^{m}(e_j)_{{\mathsf {A}}_j}\right) }, \end{aligned}$$

then

$$\begin{aligned} \ell (\mathbf{e},{\mathsf {A}})-\ell (\mathbf{e}^{(0)},{\mathsf {A}}^{(0)})\le b^{(0)}(\mathbf{e},{\mathsf {A}})-b^{(0)}(\mathbf{e}^{(0)},{\mathsf {A}}^{(0)}). \end{aligned}$$

Therefore \(b^{(0)}\) majorizes \(\ell \) at \((\mathbf{e}^{(0)}, {\mathsf {A}}^{(0)})\) up to an additive constant. We conclude that minimizing \(b^{(0)}(\mathbf{e},{\mathsf {A}})\) will create an MM algorithm, as explained by Hunter and Lange (2004), and taking the next estimate in the iterative algorithm to be the minimizer will guarantee that the algorithm possesses a descent property.

4.2 Minimizing the majorizer

For each j, \(1\le j\le m\), we wish to minimize

$$\begin{aligned} b_j^{(0)}(e_j,{\mathsf {A}}_j)=-\int {g(\mathbf{x})w^{(0)}_j(\mathbf{x})\cdot \log {{({\mathcal {N}}_he_j)_{{\mathsf {A}}_j}(\mathbf{x})}}}\text { d}\mathbf{x} +\int {(e_j)_{{\mathsf {A}}_j}} (\mathbf{x}) \text { d}\mathbf{x} \end{aligned}$$
(13)

with respect to \(e_j\) and \({\mathsf {A}}_j\). Instead of finding a global minimizer for \(b_j^{(0)}\), we first hold \({\mathsf {A}}_j\) fixed and minimize with respect to \(e_j\), then plug in the resulting update to \(e_j\) and minimize with respect to \({\mathsf {A}}_j\). The resulting algorithm, which mimics the multiple “conditional maximization” steps of the ECM algorithm (Meng and Rubin 1993), does not actually minimize \(b_j^{(0)}\), but it does ensure that the next iteration achieves a smaller value of \(b_j^{(0)}\). This property is enough to guarantee the descent property, which states that the value of the objective function decreases at each iteration of the algorithm.

We find that Eq. (13) has a closed-form minimizer as a function of \(e_j\) when \({\mathsf {A}}_j\) is held fixed.

Proposition 1

The minimizer of Eq. (13) with respect to \(e_j\), with \({\mathsf {A}}_j\) held fixed, is

$$\begin{aligned} \hat{e}_j(\mathbf{u})=\frac{|\det {{\mathsf {A}}_j}|}{\left[ \int {g({\mathsf {A}}_j\mathbf{x})w^{(0)}_j({\mathsf {A}}_j\mathbf{x})}\text { d}{} \mathbf{x}\right] ^{r-1}} \cdot \prod \limits _{k=1}^{r}\int {g({\mathsf {A}}_j\mathbf{y})w^{(0)}_j({\mathsf {A}}_j\mathbf{y})s_h(u_k,y_k)}\text { d}\mathbf{y}. \end{aligned}$$
(14)

A proof of Proposition 1 is provided in “Appendix A”.

Equation (14) can be rewritten as

$$\begin{aligned} \hat{e}_j(\mathbf{u})=\left[ P\circ S_h(|\det {{\mathsf {A}}_j}|\cdot (g\cdot w^{(0)}_j)\circ {\mathsf {A}}_j)\right] (\mathbf{{u}}) \end{aligned}$$

using the P operator of Eq. (10). In general, for any nonnegative function f on \(R^r\),

$$\begin{aligned} S_h(f\circ {\mathsf {A}}_j)=(S_h)_{{\mathsf {A}}_j}(f)\circ {\mathsf {A}}_j, \end{aligned}$$

where

$$\begin{aligned} (S_h)_{{\mathsf {A}}_j}f(\mathbf{{x}})=\int |\det {{\mathsf {A}}_j}|^{-1}\tilde{s}_h({\mathsf {A}}_j^{-1} \mathbf{{x}},{\mathsf {A}}_j^{-1}{} \mathbf{{u}})f(\mathbf{{u}})\text { d}\mathbf{{u}}. \end{aligned}$$

Thus, we may also write

$$\begin{aligned} (\hat{e}_j)_{{\mathsf {A}}_j}(\mathbf{u})&=\left[ P_{{\mathsf {A}}_j}\circ (S_h)_{{\mathsf {A}}_j}(g\cdot w^{(0)}_j)\right] (\mathbf u), \end{aligned}$$

where

$$\begin{aligned} P_{{\mathsf {A}}_j}f(\mathbf{u})=[P(f_{{\mathsf {A}}_j^{-1}})]_{{\mathsf {A}}_j}(\mathbf{u})=[P(f\circ {\mathsf {A}}_j)]({\mathsf {A}}_j^{-1}{} \mathbf{{u}}). \end{aligned}$$

Now let us turn to the minimization of Eq. (13) with respect to \({\mathsf {A}}_j\). We first define

$$\begin{aligned} \hat{q}_{jk}(u_k)= \frac{|\det {{\mathsf {A}}_j}|}{\int {g({\mathsf {A}}_j\mathbf{x})w^{(0)}_j({\mathsf {A}}_j\mathbf{x})}\text { d}{} \mathbf{x}} \int {g({\mathsf {A}}_j\mathbf{y})w^{(0)}_j({\mathsf {A}}_j\mathbf{y})s_h(u_k,y_k)}\text { d}\mathbf{y}. \end{aligned}$$
(15)

If we apply the change of variables \(\mathbf{x} = {\mathsf {A}}_j\mathbf{y}\) to Eq. (13) and then plug in \(\hat{e}_j(\mathbf{u})\) into the resulting expression for \(b_j^{(0)}(e_j,{\mathsf {A}}_j)\), we find that minimizing the result with respect to \({\mathsf {A}}_j\) is equivalent to minimizing

$$\begin{aligned} \log {|\det {{\mathsf {A}}_j}|}+\sum \limits _{k=1}^{r}\int \hat{q}_{jk}(u)\log \hat{q}_{jk} (u) \text { d}u \end{aligned}$$
(16)

with respect to \({\mathsf {A}}_j\), where \(\hat{q}_{jk}\) depends on \({\mathsf {A}}_j\) through (15).

In Expression (16), \(\hat{q}_{jk}\) is the kth margin of the kernel smoothed version of \((g\cdot w^{(0)}_j)_{{\mathsf {A}}_j^{-1}}/\int {g\cdot w^{(0)}_j}\). In the discrete case where \(dG(\mathbf{x})\) is the empirical distribution, \(\hat{q}_{jk}\) is the kth margin of the kernel density estimate based on the linearly transformed (by \({\mathsf {A}}_j^{-1}\)) weighted observed data set, where the weight for the data point \(\mathbf{x}_i\) is \(w^{(0)}_j(\mathbf{x}_i)\). Let us denote this weighted data set by \(\mathbf{\mathsf {D}}^{(0)}_j\) and hence its linear transformation by \({\mathsf {A}}_i^{-1}\mathbf{\mathsf {D}}^{(0)}_j\). By (15), the optimization mechanism at the current step views \({\mathsf {A}}_j^{-1}\mathbf{\mathsf {D}}^{(0)}_j\) as a weighted sample generated from the unknown density function \(q_j\), where \(\mathbf{\mathsf {D}}^{(0)}_j\) is a weighted sample from the jth mixing component and \({\mathsf {A}}_j^{-1}\) is the matrix that recovers the associated ICA transformations. Let us call \({\mathsf {A}}_j^{-1}\) a recovering matrix.

By Eq. (6), we may treat \(|\det {A}_j|\) as fixed given the weighted data \({\mathsf {A}}_j^{-1}\mathbf{\mathsf {D}}^{(0)}_j\). The second term in (16) is an estimate of the sum of marginal entropies of \(q_j\), which is equal, up to a term that does not involve \({\mathsf {A}}_j\), to the mutual information of marginals of \(q_j\). According to Hyvarinen et al. (2002), minimizing mutual information in this setting—that is, minimizing the mutual information of \({\mathsf {A}}_j^{-1}{} \mathbf{S}\) given a randomly chosen weighted sample from \(\mathbf{S}\))—can be acheived by existing ICA algorithms such as the fastICA algorithm described in Sect. 4.3.

To summarize, the NSMM-ICA iterative algorithm will iterate as follows, where the parameters at the tth iteration will be denoted by \((\mathbf{e}^{(t)}, {\mathsf {A}}^{(t)})\):

Majorization Step: For \(1\le j\le m\), compute

$$\begin{aligned} w^{(t)}_j(\mathbf{x})=\frac{({\mathcal {N}}_he^{(t)}_j)_{{\mathsf {A}}^{(t)}_j}(\mathbf{x})}{\sum \nolimits _{j=1}^{m}({\mathcal {N}}_he^{(t)}_j)_{{\mathsf {A}}^{(t)}_j}(\mathbf{x})}. \end{aligned}$$

ICA Step: Use the fastICA technique of Sect. 4.3 to find \({\mathsf {A}}^{(t+1)}_j\) subject to (6) that minimizes

$$\begin{aligned} \sum \limits _{k=1}^{r}\int \hat{q}^{(t+1)}_{jk}(u) \log \hat{q}^{(t+1)}_{jk} (u) \text { d}u, \end{aligned}$$

for \(j=1, \ldots , m\), where \(\hat{q}^{(t+1)}_{jk}(u_k)\) is defined in Eq. (15).

Minimization Step: Let

$$\begin{aligned} e^{(t+1)}_j(\mathbf{u})=\hat{\lambda }^{(t+1)}_j\hat{q}^{(t+1)}_j(\mathbf{u})= \hat{\lambda }^{(t+1)}_j\prod \limits _{k=1}^{r}\hat{q}^{(t+1)}_{jk}(u_k), \end{aligned}$$

where

$$\begin{aligned} \hat{\lambda }^{(t+1)}_j=\int {(g\cdot w^{(t)}_j)}. \end{aligned}$$

4.3 Practical implementation of NSMM-ICA

Section 4.2 suggests alternating NSMM and ICA methods to form an iterative algorithm for the estimation of the nonparametric ICA mixture model. This section describes the practical considerations that went into the development of a package for R (R Core Team 2015), called icamix, that implements these ideas.

Empirical evidence suggests that NSMM and the npEM algorithm of Benaglia et al. (2009) tend to give very similar estimates (Levine et al. 2011). The reason is that usually \({\mathcal {N}}_hf\) is close to f itself. This suggests that the smoothed version of the algorithm can reasonably be replaced by the non-smoothed version because the former is more computationally burdensome than the latter. The decision to implement this non-smoothed version affects only Step 1 of the algorithm below. The result is an algorithm that fails to achieve the provable descent property of the smoothed version but which is much faster and which appears to result in nearly identical results for most test problems.

Among the many ICA techniques available in the literature, here we use the efficient and well-tested FastICA of Hyvarinen et al. (2002). At each iteration, FastICA will be applied to a weighted dataset, where the weight on observation i for component j is determined as the estimate, given the information available at the present iteration, of the probability that observation i falls into component j.

Assume we are given raw data as a matrix \({\mathsf {X}}^\top =\{\mathbf{x }_1, \mathbf{x }_2, \ldots , \mathbf{x }_n\}^\top \), where \(\mathbf{x }_i=(x_{i1}, x_{i2}, x_{i3}, \ldots , x_{ir})^\top \) for \(1\le i\le n\). We first choose a set of starting parameter values. Since our algorithm, like any MM algorithm, finds at best a local minimum, it is possible that different starting values will lead to different solutions. In the icamix package, we begin with a k-means clustering solution, which assigns each data point to a distinct mixture component (and which itself has a stochastic element that allows for ease in choosing multiple starting points). Given this initial partition, straightforward estimation (e.g., using FastICA) on the separate components leads to initial parameter values. Then, our algorithm iterates through Steps 1 through 4 below until a convergence criterion is met.

Step 1. For all i and j, estimate the jth component weight for the ith observation using the non-smoothed densities:

$$\begin{aligned} p^{(t)}_{ij}=\displaystyle \frac{\lambda ^{(t)}_{j}f_j^{(t)}(\mathbf{x}_i)}{\sum \nolimits _{j\prime =1}^{m}\lambda ^{(t)}_{j\prime } f_{j\prime }^{(t)}(\mathbf{x}_i)} =\displaystyle \frac{\lambda ^{(t)}_{j}\left| \det {{\mathsf {A}}_j^{(t)}}\right| ^{-1}\prod \nolimits _{k=1}^{r}q_{jk}^{(t)} \left( \left[ ({\mathsf {A}}_j^{(t)})^{-1}{} \mathbf{x}_i \right] _k\right) }{\sum \nolimits _{j\prime =1}^{m}\lambda ^{(t)}_{j\prime } \left| \det {{\mathsf {A}}_{j\prime }^{(t)}}\right| ^{-1}\prod \nolimits _{k=1}^{r}q_{j\prime ,k}^{(t)} \left( \left[ ({\mathsf {A}}_{j\prime }^{(t)})^{-1}{} \mathbf{x}_i \right] _k\right) }. \end{aligned}$$

Steps 2, 3a, 3b, 3c, and 4 are now repeated separately for each value of j, \(1\le j\le m\):

Step 2. Update the \({\lambda }\) parameters:

$$\begin{aligned} \lambda ^{(t+1)}_j=\displaystyle \frac{1}{n}\sum \limits _{i=1}^{n}p^{(t)}_{ij}. \end{aligned}$$

Step 3a. Centering FastICA step for component j: For each i, define

$$\begin{aligned} \tilde{\mathbf{x}}_i \leftarrow \mathbf{x}_i-\displaystyle \frac{\sum \nolimits _{i=1}^{n}{} \mathbf{x}_ip_{ij}^{(t)}}{\sum \nolimits _{i=1}^{n}p_{ij}^{(t)}}. \end{aligned}$$

Step 3b. Decorrelating FastICA step for component j: We first obtain the eigenvalue decomposition as

$$\begin{aligned} \displaystyle \frac{\sum \nolimits _{i=1}^{n}\tilde{\mathbf{x}}_i\tilde{\mathbf{x}}_i^\top p_{ij}^{(t)}}{\sum \nolimits _{i=1}^{n}p_{ij}^{(t)}}=E_jD_jE_j^\top , \end{aligned}$$

then let \(V_j=E_jD_j^{-1/2}E_j^\top \) and \(\mathbf{z}_{ij}=V_j\tilde{\mathbf{x}}_i\) for \(i=1, \ldots , n\). Here, the notation \(\mathbf{z}_{ij}\) refers to the jth component version of the ith observation of the \(\mathbf{z}\) vector, which is r-dimensional. Therefore,

$$\begin{aligned} \displaystyle \frac{\sum \nolimits _{i=1}^{n}{} \mathbf{z}_{ij}{} \mathbf{z}_{ij}^\top p_{ij}^{(t)}}{\sum \nolimits _{i=1}^{n}p_{ij}^{(t)}}=V_jE_jD_jE_j^\top V_j^\top =I. \end{aligned}$$
(17)

The transformed data \(\mathbf{z}_{ij}\) with weights \(p_{ij}^{(t)}\), \(1\le i\le n\), thus have their coordinates uncorrelated and standardized according to (17). Since \(\mathbf{Z}_j=V_j\tilde{\mathbf{X}}=V_j{\mathsf {A}}_j\mathbf{S}\), we need to first estimate \(\left( V_j{\mathsf {A}}_j\right) ^{-1}\), of which the ith row is the same as \(\mathbf{w}_{ij}\) in (18) below, and multiply it by \(V_j\) on the right to get an update of \({\mathsf {A}}_j^{-1}\).

Step 3c. Symmetric orthogonalization FastICA step for component j: At this step, we enter an internal loop that ultimately results in the update of the \({\mathsf {A}}_j\) matrix. Let \(W_j=[\mathbf{w}_{1j}, \mathbf{w}_{2j}, \ldots , \mathbf{w}_{rj}]^\top \) for r-dimensional unit length vectors \(\mathbf{w}_{1j}, \mathbf{w}_{2j}, \ldots , \mathbf{w}_{rj}\). The first time we enter Step 3c, we may simply take \(\mathbf{w}_{ij}\) to be the ith standard basis vector for each j, and at succeeding iterations we take these \(\mathbf{w}_{ij}\) to be rows of the most recent \(W_j\) matrix.

The inner loop then proceeds by updating the \(\mathbf{w}_{ij}\) from their previous values according to

$$\begin{aligned} \mathbf{w}_{ij}\leftarrow \displaystyle \frac{\sum \nolimits _{i=1}^{n}{} \mathbf{z}_{ij}g(\mathbf{w}_{ij}^\top \mathbf{z}_{ij})p_{ij}^{(t)}}{\sum \nolimits _{i=1}^{n}p_{ij}^{(t)}}-\mathbf{w}_{ij}\displaystyle \frac{\sum \nolimits _{i=1}^{n}g\prime (\mathbf{w}_{ij}^\top \mathbf{z}_{ij})p_{ij}^{(t)}}{ \sum \nolimits _{i=1}^{n}p_{ij}^{(t)} }, \end{aligned}$$
(18)

where g may be chosen to be either \(g(y)=\text {tanh}(\alpha _1y)\) for some \(1\le \alpha _1\le 2\) or \(g(y)=y\exp (-y^2/2)\) (Hyvarinen et al. 2002). Let \(W_j=[\mathbf{w}_{1j}, \mathbf{w}_{2j}, \ldots , \mathbf{w}_{rj}]^\top \) and then symmetrize and orthogonalize by

$$\begin{aligned} W_j\leftarrow (W_jW_j^\top )^{-1/2}W_j. \end{aligned}$$
(19)

Iteratively update the \(\mathbf{w}_{ij}, i=1,\ldots ,r\) using Eqs. (18) and (19) until convergence is achieved. More precisely, we choose a tolerance \(\tau \) and stop updating when

$$\begin{aligned} \max \limits _{1\le i\le r}\left\{ \Bigl |\left( \mathbf{w}_{ij}^\mathrm{(previous)}\right) ^\top \cdot \mathbf{w}_{ij}^\mathrm{(current)}-1\Bigr |\right\} \le \tau . \end{aligned}$$

Finally, set

$$\begin{aligned} {\mathsf {A}}_j^{(t+1)}=V_j^{-1}W_j^{-1}. \end{aligned}$$

Step 4. Non-parametric density estimation step: For all j and k, let

$$\begin{aligned} q^{(t+1)}_{jk}(\mathbf{u})= \left( \sum \limits _{i=1}^{n}p^{(t+1)}_{ij} \right) ^{-1} \sum \limits _{i=1}^{n}p^{(t+1)}_{ij}\displaystyle \frac{1}{h}K\left( \displaystyle \frac{\mathbf{u}- \left[ ({\mathsf {A}}^{(t+1)}_j)^{-1}{} \mathbf x _i \right] _k}{h}\right) . \end{aligned}$$

The R package icamix makes use of the Rcpp (Eddelbuettel and François 2011; Eddelbuettel 2013) and RcppArmadillo (Eddelbuettel and Sanderson 2014) packages for compiling and calling the core algorithms implemented in C++ code to speed up the calculations.

In the discrete algorithm we have developed, a single fixed bandwidth calculated from the data will not be sensible, especially because the scale is now changing according to the ICA framework. Thus we propose an iterative scheme for choosing the bandwidth similar to that of Benaglia et al. (2011), whereby

$$\begin{aligned} h^{t+1}_{jk}=0.5\cdot \text {SD}^{t+1}_{jk}\cdot (n\lambda ^{t+1}_{j})^{-0.2}=0.5\cdot (n\lambda ^{t+1}_{j})^{-0.2}. \end{aligned}$$
(20)

For simplicity, we replace \(\min {\{\text {SD}^{t+1}_{j,l}, \text {IQR}^{t+1}_{j,l}/1.349\}}\) in the original Benaglia et al. (2011) formulation by \(\text {SD}^{t+1}_{j,l}=1\). We also propose the ad hoc coefficient of 0.5 rather than Silverman’s 0.9 used by Benaglia et al. (2011) in order to capture fine features of the density for better performance in the classification task. Our experience is that using 0.9 tends to oversmooth the estimated density. Simulation studies and applications we have run suggest that Eq. (20) works well in practice.

When running the algorithm, we have determined that Step 4 dominates the computing time. By making use of Gaussian kernels and utilizing certain symmetric structure in evaluating some of Gaussians, we are able to lower the computing cost for the kernel density estimation step by about 50% with respect to the mixtools package. Further improvement might be possible via the Fast Gauss Transform (Raykar et al. 2005) and related techniques, though we have not implemented these improvements.

5 Applications

Here, we describe our experience applying the modified NSMM-ICA algorithm implemented in the icamix package to several datasets of varying size.

5.1 Italian wine classification

The Italian wine data set is another popular data set used for comparing various classifiers (Forina et al. 1988; Aeberhard et al. 1992). It contains results of a chemical analysis of wines grown in Italy but derived from three different cultivars. A total of 178 observations are recorded, each with 13 continuous attributes such as color intensity, magnesium and malic acid. There are 59, 71 and 48 instances in the first (Barolo), second (Grignolino) and third (Barbera) wine classes, respectively (Table 1).

Table 1 Wine data classifications by PCA \(+\) NSMM-ICA algorithm

If we feed the unlabeled data directly to the NSMM-ICA algorithm, we obtain a classification error rate equal to 28.65%, prompting us to consider remedies. it seems that given the small number of observations, the relatively large number of attributes may be somewhat challenging as there are too many parameters to estimate and some of the attributes may consist of noise. So instead of using all 13 attributes, we first run principal component analysis (PCA) on the attributes and then select the 5 PCA scores that explain the largest proportion of variance in the attributes. Finally, we run the NSMM-ICA algorithm on the data set with the chosen PCA scores as attributes. In this way, the classification performance improves quite a lot, giving a classification error rate equal to 5.62%. Hence, in situations with relatively large numbers of coordinates, it might be worthwhile to utilize a dimension reduction technique followed by the NSMM-ICA algorithm. Figure 2 shows a comparison of true species information and results from our unsupervised learning algorithms.

Fig. 2
figure 2

Wine data: comparison of true species information (far left) and two results from our unsupervised learning algorithms

5.2 Tone data

The tone perception experiment and data were first introduced by Cohen (1984). These data have been analyzed by Veaux (1989), Viele and Tong (2002), and Hunter and Young (2012) in the context of mixtures of regressions. In each trial of the experiment, a musician is presented with a fundamental tone plus a series of overtones determined by a stretching ratio. Then the musician is asked to tune an adjustable tone to one octave above the fundamental tone. Both the stretching ratio and the ratio of the adjusted tone to the fundamental are reported for each trial. There are five musicians involved in the experiment. However, the tone data set only contains 150 trials with the same musician. The problem of interest in conducting this experiment is to investigate the theory that the musician would either tune the tones to the nominal octave at a ratio of 2:1 to the fundamental tone (i.e., the interval memory hypothesis) or use the overtone to tune the tone to the stretching ratio (i.e., the partial matching hypothesis). The findings by Hunter and Young (2012) via modeling through a semi-parametric mixture of regressions conforms with the latter theory.

For this unsupervised learning task, we run both the npEM algorithm by Benaglia et al. (2009) and our NSMM-ICA algorithm on the tone data. Figure 3 shows that our NSMM-ICA algorithm does a good job of classification, very close to the mixture-of-regressions results obtained by Hunter and Young (2012), despite omitting any explicit assumption of regression structure. The reason why the results of the npEM algorithm shown in Fig. 3 do not look nearly as good is because with regression lines that have nonzero slopes the mixture is far from being conditionally independent. Thus, the additional ICA step in our algorithm is essential.

For comparing our result with that of Hunter and Young (2012), we can use the estimated mixing weights obtained from NSMM-ICA to fit a weighted ordinary least squares model to obtain the regression coefficients. The results, summarized in Table 2, reflect the difference in estimated membership between SP EM and our NSMM-ICA primarily at the intersection of the two components, which is responsible for the noticeable difference in the estimated mixing weights.

Fig. 3
figure 3

Comparison of three algorithms for fitting the tone dataset of Cohen (1984). Only the plot labeled SP EM explicitly assumes a mixture of regressions. The npEM algorithm does not utilize ICA and therefore misses the two lines entirely

Table 2 Comparison of mixtures of least squares fits for the tone dataset of Cohen (1984)

5.3 Clustering images

Learning efficient codes for images obtained from different sources or contexts is an important problem in the area of image processing. The task involves extracting intrinsic structure in images by clustering and finding a complete set of efficient linear basis functions for each image source, which results in coefficient values being as statistically independent as possible. Techniques that utilize a parametric form of ICA mixture models have been proposed in Lee et al. (1999a; b; 2000). Here, we apply the NSMM-ICA algorithm, which eliminates the parametric assumptions, to the task of unsupervised learning of image codes.

The two images shown in Fig. 4 will be used as sources for the data set of the application. One is a painting image (2508 \(\times \) 1808 pixels) and the other is a newspaper image (2057 \(\times \) 1365 pixels). The images are transformed to grey scale: Each pixel consists of a pixel intensity value ranging from 0 (black) to 1 (white).

Fig. 4
figure 4

Two images as sources for data: newspaper and painting

We select 5000 12 \(\times \) 12 pixel patches randomly from each image. So the complete data set is of dimension 10,000 \(\times \) 144. The NSMM-ICA algorithm converges after 19 iterations, which lasts a little less than 8 hours. Again each bandwidth is automatically learned by the iterative scheme we implemented. The result shows a very good recovery of the class-membership information, with a classification error rate of 1.2%.

The learned basis functions (i.e., a basis for the linear space of the pixel patches) show interesting patterns. Figures 5 and 6 show the basis functions for each image. The ones for the painting image appear smoother but more irregular, while the ones for the newspaper image look spottier and more regular.

Fig. 5
figure 5

Learned basis functions for the newspaper image of Fig. 4. Each basis function, which is a 144-dimensional vector, is standardized to be within 0 and 1, then displayed as a \(12\times 12\) image patch

Fig. 6
figure 6

Learned basis functions for the painting image of Fig. 4. Each basis function, which is a 144-dimensional vector, is standardized to be within 0 and 1, then displayed as a \(12\times 12\) image patch

6 Discussion

Although the conditional independence model for multivariate data is promising due to the fact that its parameters are provably identifiable under weak conditions, even some simple datasets such as the well-known iris dataset clearly violate this conditional independence assumption. The current article loosens this assumption, positing instead that each multivariate component can be linearly transformed to having independent coordinates. This gives a much more flexible model-based clustering framework than the conditional independence assumption alone. Among the particular favorable features of this extended model is that it allows for linear feature extraction, as illustrated by the tone data application of Sect. 5.2.

Yet much work remains to be done on this and similar methods. Despite the favorable-looking results that we have obtained on the datasets in this article, the theoretically important question of the identifiability of the parameters remains unresolved. A related issue is the fact that ICA cannot operate in the setting of multivariate normal data, since in that case the standardized data are already standard multivariate normal, so that there is no way to identify an ICA transformation. Thus, with the increased flexibility of our current framework come additional questions regarding conditions under which parameter identifiability holds.

In addition, large-sample behavior of the NSMM estimator such as convergence rates is still not fully known; however, recent work on alternative algorithms such as those of Bonhomme et al. (2016a), for which asymptotic properties have been proven, might provide suitable alternatives for estimation of the nonparametric mixture structure. Similarly, alternatives to the ICA algorithm we employ here, such as ICS, are also possible, and it is possible that exploiting the ability of ICS to search for interesting clustering features as in Peña et al. (2010) could be exploited.

In addition, there is the question of computational efficiency given the burden of estimating multiple univariate densities, particularly for large datasets. Our implementation of the algorithm currently replaces the smoothed NSMM portion of the algorithm by the non-smoothed npEM of Benaglia et al. (2009), since empirical comparisons have suggested that these two algorithms often result in nearly the same solutions. The icamix package for R (R Core Team 2015), available on CRAN, interweaves npEM with a weighted version of the FastICA algorithms (Hyvarinen et al. 2002). The package also implements an automated and adaptive scheme for bandwidth selection that is based on the work of Benaglia et al. (2011). Further computing efficiencies may be attainable through the use of the Fast Gauss Transform.

It is important to remember that although this article compares some clustering solutions obtained via our algorithm with those obtained using other techniques, such as k-means clustering, model-based clustering yields much more than mere cluster memberships: Not only does it assign each data point a probability vector of component membership, but the statistical modeling of the individual components is often of great interest beyond the assignment of individuals to groups. We did not explore the component density estimates obtained via our algorithm in this article, but for an example that does so in the nonparametric mixture modeling literature, consider the water-level dataset as analyzed in Section 5.2 of Chauveau et al. (2015).

All in all, given its flexibility and hence wide applicability, we believe that the novel approach to model-based clustering presented here has the potential to be a useful alternative to existing approaches based on parametric mixtures or mixtures that assume conditional independence.