1 Introduction

Clustering algorithm is a common unsupervised learning methodology for data analysis and has been widely used for uncovering hidden patterns within data. One extensively considered approach in statistical modeling is mixture models. It is capable of clustering data into homogeneous subgroups where the whole model is represented by a weighted sum of the subpopulations’ densities. Due to its flexible representations that provide interpretable results, mixture models are adopted in many applications from different domains.

A well-known assumption in using mixture models for statistical analysis is that considering the per components densities follows the widely used Gaussian assumption (Park et al. 2013). However, the Gaussian distribution is not always an appropriate choice since observations shape may not be strictly symmetric. This is especially the case in natural images where the density distribution may be far from the Gaussian (Hyvärinen and Hoyer 2000; Laptev 2009; Boutemedjet et al. 2010; Elguebaly and Bouguila 2014). Some evolving systems have been proposed for this problem (Andonovski et al. 2018; Škrjanc et al. 2019). For achieving a better approximation, we investigate the use of asymmetric Gaussian distribution (AGD) which is capable of modeling asymmetric data: AGD has left and right variance parameters which control the shape of different parts to better model the asymmetry of data (Elguebaly and Bouguila 2011; Song et al. 2019).

Parameter estimation is one of the challenges required for the use of mixture models. Various algorithms have been studied to achieve this purpose. The expectation maximization (EM) algorithm is a well-known approach to solve such problem (Bouguila and Ziou 2006). Nevertheless, the EM algorithm is a deterministic approach which is not guaranteed to reach a global optimal solution because of its sensitivity to initialization and overfitting. Instead, Bayesian inference may be used which is extensively studied in mixture modeling (Channoufi et al. 2018; Elguebaly and Bouguila 2014). It provides a strong theoretical framework to design clustering algorithms as well as a formal approach to incorporate prior knowledge about the problem. The authors in Fu and Bouguila (2018) recently studied Bayesian learning of asymmetric Gaussian mixture model. In this work, the authors implement Markov Chain Monte Carlo (MCMC) methods that eradicate the dependency between the mixture parameters and components to address over-fitting problems.

Several studies and research have been devoted to the automatic selection of the components number which best describes the observations. We introduce the Dirichlet process to address the problem of determination of correct components number since it leads to a realization of a mixture model with an unbounded number of components (Antoniak 1974). This can be considered also as a nonparametric Bayesian approach since it allows the number of components to grow to infinity as required to fit the data (Griffin and Steel 2010). In this paper, we are interested in Bayesian non-parametric approaches for modeling, particularly models based on the Dirichlet process (Bouguila and Ziou 2012). The Dirichlet process allows the number of latent variables to grow as necessary to fit the data, but where individual variables still follow parametric distributions. We address the prevalent problem of choosing the correct mixture components number for mixture models, by introducing the Dirichlet process to extend finite mixture model to an infinite one. Thus, we apply a hierarchical Bayesian learning technique for the proposed infinite asymmetric Gaussian mixture model (IAGM).

Theoretically, the more features used to represent data, the better the clustering algorithm is expected to perform. In practice, however, some features can be noisy, redundant, or uninformative thus can hinder the performance of clustering (Boutemedjet et al. 2009; Bouguila 2009). The presence of many irrelevant features introduces a bias and renders homogeneity measures unreliable (Elguebaly and Bouguila 2015). A viable solution is to remove irrelevant features by identifying the best features to the trained model. The process of reducing the number of collected features to a relevant subset of features is known as feature selection. It can increase the performance of models by eliminating noise in the data, improving model interpretation and decreasing the risk of overfitting. Feature selection methods can be broadly divided into three groups: filters, wrappers, and embedded methods (Adams and Beling 2017).

Filter approaches treat feature selection as preprocessing step where the relevance of each feature is evaluated using the dataset alone. Thus, filters only consider the properties of the features regardless of the model. The authors in Krishnan et al. (1996) propose a trimming feature selection technique specific to mixture models based on the Fisher ratio. However, this method does not iterate through the feature space nor simultaneously estimate model parameters and feature subsets. On the other hand, wrapper approaches evaluate feature relevance with regard to the model. In most cases, a model is built with respect to a subset of features and the model’s performance is evaluated based on specified criteria. Wrappers then move through the subset space evaluating feature subsets with regard to the evaluation function. The readers are referred to Galimberti et al. (2018); Marbac and Sedki (2017) for further details about wrapper approaches.

Embedded methods simultaneously select features and construct models. Penalized model-based clustering (Pan and Shen 2007; Bouveyron and Brunet-Saumard 2014) and Bayesian methods (Gustafson et al. 2003; Wang and Zhu 2008) are extensively used in many applications. Feature saliency approaches consider feature selection as parameter estimation problems and recast probability distribution as dependent and independent distributions (Elguebaly and Bouguila 2012; Law et al. 2004). Feature saliency is added as new parameters to the conditional distribution of the mixture model and used to find clusters embedded in feature subspace. Because feature saliency represents the probability of belonging to a mixture-dependent distribution. It can be interpreted as the probability that a feature is relevant. In this paper, we propose a feature saliency measure and integrate it into the Bayesian inference framework. Our approach focuses on detecting cluster structure and discriminating feature relevance simultaneously through Bayesian learning.

To summarize, in this paper, we propose a Bayesian inference approach for infinite asymmetric Gaussian mixture (IAGM) models with a simultaneous feature selection framework. The proposed approach better fits data than the traditionally applied Gaussian mixture models in the case of asymmetric data distribution. Extension to an infinite number of mixture components aims to better estimate the data clusters as required. The simultaneous feature selection approach allows for better approximation due to a better choice of features that represent the data with an enhanced ability for the clustering task; i.e. separating the different classes. A potential drawback of the model is computational complexity which is easily remedied with today’s immense available computational resources.

The reminder of this paper is organized as follows. Sect. 2 outlines asymmetric mixture model, sets up Dirichlet process and highlights the feature selection algorithm. Sect. 3 presents the Bayesian inference process and complete algorithm for our model. In Sect. 4, we present the validation on dynamic textures clustering and image categorization tasks and compare it with a number of state-of-the-art methods. Finally, Sect. 5 concludes the paper.

2 Infinite asymmetric Gaussian mixture model

In this section, we introduce IAGM with feature saliency algorithm. This paper proposes finite mixture model and then extend it to infinite one. We also introduce the concept of feature saliency and represent our model combined with a feature selection algorithm.

2.1 Finite mixture model

Assume we have N observations dataset \(\chi = (X_1,\dots ,X_N) \), where each of observations \(X_i = (X_{i1}, \dots , X_{iD})\) could be represented as a D-dimensional random variable and it follows asymmetric Gaussian distribution (AGD). The probability density function for dataset \(\chi \) can be written as:

$$\begin{aligned} p\big (\chi \mid \Theta \big ) = \prod _{i=1}^N \sum _{j=1}^M p_j p\big (X_i \mid \xi _j\big ) \end{aligned}$$
(1)

where \(\Theta =(p_1, \dots , p_M, \xi _1, \dots , \xi _M)\) represents the complete set of parameters fully characterizing the mixture model, M is the number of components, \(\overrightarrow{p} = (p_1, \dots , p_M)\) represents the mixing proportions which must be positive and sum to one, and \(\xi _j\) represents the AGD parameters for mixture component j.

Given AGD parameters for mixture component j, the AGD density function is defined as:

$$\begin{aligned} p \big (X_i \mid \xi \big )&\propto \prod _{k=1}^D \frac{1}{ (S_{lj k})^{-\frac{1}{2}} + (S_{rj k})^{-\frac{1}{2}}} \nonumber \\&\quad \times {\left\{ \begin{array}{ll} \exp \big [-\frac{S_{lj k} (X_{i k} - \mu _{j k})^2}{2}\big ]\quad if \, X_{i k} <\mu _{j k} \\ \exp \big [-\frac{S_{rj k} (X_{i k} - \mu _{j k})^2}{2}\big ]\quad if \, X_{i k} \ge \mu _{j k} \end{array}\right. } \end{aligned}$$
(2)

where \(\xi _j = (\mu _j,\, S_{lj},\, S_{rj}) \) is the set of parameters for AGD with \(\mu _j = (\mu _{j1}, \dots , \mu _{jD})\), \(S_{lj} = (S_{lj 1}, \dots , S_{lj D})\), and \(S_{rj} = (S_{rj 1}, \dots , S_{rj D})\). \(\mu _{j k}\), \(S_{lj k}\) and \(S_{rj k}\) are the mean, the left precision and the right precision for the kth-dimensional distribution (Fu and Bouguila 2018) respectively. In this paper, we assume each dimension of observation \(X_i\) is independent; hence, its covariance matrix will be diagonal. This assumption leads to a reduction in the computational power during processing and deployment.

We introduce latent indicator variables Z, \(Z_i\) for each observations i to indicate which mixture component it belongs to. \(Z_i = (Z_{i1}, \dots , Z_{iM})\) where hidden label \(Z_{ij}\) is set to 1 when the observation \(X_i\) is allocated to component j otherwise 0. The likelihood function of IAGM is given by:

$$\begin{aligned} p \big (\chi \mid Z, \Theta \big ) = \prod _{i=1}^N p\big (X_i \mid \xi _j\big )^{Z_{ij}} \end{aligned}$$
(3)

For the mixing weight, \(p_j = p(Z_{i = j})\) , \(j = 1, \dots , M\) indicates the probability that an observation \(X_i\) is associated with component j. Hence, the missing allocation variable Z is given a Multinomial prior as follows:

$$\begin{aligned} p \big (Z \mid \overrightarrow{p}) \sim \text {Multi}\left( \overrightarrow{p}\right) = \prod _{j=1}^M p_j^{n_j} \end{aligned}$$
(4)

where \(n_j = \sum _{i=1}^N I_{Z_i=j} \) is the number of observations allocated to component j, and function I is the indicator function. The mixing proportions are assumed to follow a symmetric Dirichlet prior with concentration parameter \(\frac{\alpha }{M} \) (Rasmussen 1999), so that it is considered that all components sharing an equal prior probability. This then can be denoted as follows:

$$\begin{aligned} p\big (\overrightarrow{p} \mid \alpha \big ) \!\sim \! Dirichlet\left( \frac{\alpha }{M },\ldots ,\frac{\alpha }{M }\right) \!=\! \frac{\Gamma (\alpha )}{\Gamma (\frac{\alpha }{M })^M } \prod _{j=1}^M p_j^{\frac{\alpha }{M} - 1} \end{aligned}$$
(5)

The Dirichlet distribution is a conjugate prior of the Multinomial distribution. Due to the conjugacy of Z and \(\overrightarrow{p}\), we can achieve better inference by integrating out \(\overrightarrow{p}\) to obtain the prior of Z given hyperparameter \(\alpha \), and then inferring directly the distribution of the latent variables Z:

$$\begin{aligned} p\big (Z \mid \alpha \big )&= \int p\big (Z \mid \overrightarrow{p}\big ) p\big (\overrightarrow{p} \mid \alpha \big ) d\overrightarrow{p} \nonumber \\&= \frac{\Gamma (\alpha )}{\Gamma (N + \alpha )} \prod _{j=1}^M\frac{\Gamma (\frac{\alpha }{M} + n_j)}{\Gamma (\frac{\alpha }{M})} \end{aligned}$$
(6)

To use the Gibbs sampling technique, it is required to obtain the conditional prior for a single allocation variable \(Z_i\) given all the others. Keeping all the other indicators fixed, we obtain the following conditional prior:

$$\begin{aligned} p(Z_i=j \mid \alpha , Z_{-i}) = \frac{n_{-i,j} + \frac{\alpha }{M}}{N-1+\alpha } \end{aligned}$$
(7)

where the subscript \(-i\) indicates all indexes except i, and \(Z_{-i} = (Z_1, \dots , Z_{i-1}, Z_{i+1}, \dots , Z_N) \). \(N_{-i, j} \) is the number of observations excluding \(X_i \) that are allocated to component j.

2.2 Infinite mixture model

We continue to extend the finite mixture model proposed in last section to an infinite mixture model by letting component number \(M \rightarrow \infty \) and updating the posteriors of indicators in Eq. (7). This is achieved by introducing the Dirichlet process to extend to the infinite mixture model (Blei and Jordan 2006):

$$\begin{aligned} p \big (Z_i=j \mid \alpha , Z_{-i}\big ) = {\left\{ \begin{array}{ll} \frac{n_{-i,j}}{N-1+\alpha } \quad if \, n_{-i,j} > 0 \\ \frac{\alpha }{N-1+\alpha } \quad if \, n_{-i,j} = 0 \end{array}\right. } \end{aligned}$$
(8)

where \(n_{-i,j} > 0\) indicates that component j is represented. Thus, an observation \(X_i \) is allocated to an existing component with certain probability proportional to the number of observations already associated with this component, while a new component is only proportional to concentration parameter \(\alpha \) and observations number N. Given the priors, the conditional posteriors are obtained by combining priors with the likelihood:

$$\begin{aligned} p(Z_i=j \mid \ldots ) = {\left\{ \begin{array}{ll} \frac{n_{-i,j}}{N-1+\alpha } \prod _{k=1}^{D} p \big (X_{i k} \mid \xi _{j k}\big ) \\ \quad \quad \quad if \,\, n_{-i,j} > 0 \\ \frac{\alpha }{N-1+\alpha }\int p\big (\xi _j \mid \cdots ) p\big (X_i \mid \xi _j\big ) d\xi _j \\ \quad \quad \quad if \,\, n_{-i,j} = 0 \end{array}\right. } \end{aligned}$$
(9)

where the conditional posteriors of unrepresented component is obtained by integrating over hyperparameters and the integral is not analytically intractable. For infering intractable posteriors, we adopt Algorithm 2 of Neal’s (Neal 2000) which proposes a sampling method to approximate the desired distribution.

Concerning the concentration parameter \(\alpha \), we consider \(\alpha \) an inverse Gamma prior with parameter \(\kappa \) and \(\eta \):

$$\begin{aligned} p \big ( \alpha ^{-1} \mid \kappa , \eta \big ) \sim \Gamma \left( \kappa , \eta \right) \end{aligned}$$
(10)

Given the likelihood of \(\alpha \) in Eq.(6), we obtain the conditional posterior for \(\alpha \) depending on the observations number N and the components number M

$$\begin{aligned} p \big (\alpha \mid \kappa , \eta \big ) \sim p\big (Z \mid \alpha \big ) p \big ( \alpha \mid \kappa , \eta \big ) \end{aligned}$$
(11)

2.3 Feature saliency

In this section, we introduce the concept of feature saliency and consider the feature selection problem as a parameter estimation problem (Law et al. 2004). It is natural to consider that different features may have different weights for each of the mixture components. Thus, we define feature saliency as the weight of feature importance.

We assume that a feature is relevant if it follows a mixture-dependent distribution AGD. Otherwise, it may be modeled as a mixture-independent background distribution. In our work, we propose a Gaussian assumption for the background distribution. By treating latent relevant indicator \(\phi _i = (\phi _{i1}, \dots , \phi _{iM})\) with \(\phi _{ij} = (\phi _{ij1}, \dots , \phi _{ijD})\). We could then represent if a given feature is relevant or not. The binary indicator \(\phi _{ijk} = 1\) if feature k in observations \(X_i\) is relevant for component j, otherwise \(\phi _{ijk}=0\). Thus, we rewrite the probability density function as follows:

$$\begin{aligned} p\big ({\chi } {\mid } {\Theta }, \xi ^{irr}, {\Phi } \big )= & {} \prod _{i=1}^N \sum _{j=1}^M p_j {\prod }_{k=1}^D \big [p\big (X_{ik} {\mid } {\xi }_{j k} \big )^{\phi _k} \nonumber \\&\quad p\big (X_{ik} {\mid } \xi _{j k}^{irr}\big )^{1-\phi _k} \big ] \end{aligned}$$
(12)

where the \({\xi }^{irr} = ({\xi }_1^{irr}, \dots , {\xi }_M^{irr})\) represents the set of parameters for background Gaussian distribution with \(\xi _j^{irr} = (\mu _j^{irr}, (S_j^{irr})^{-1})\), \(\mu _j = (\mu _{j 1}, \dots , \mu _{j D})\), \(S_j = (S_{j 1}, \dots , S_{j D})\). \(\mu _{j k}^{irr}\) and \(S_{j k}^{irr}\) represent the mean and precision for the k dimensional Gaussian distribution, respectively.

Feature saliency defined as \(\overrightarrow{\rho } = (\rho _1,\dots ,\rho _M)\) such that \(\rho _j = (\rho _{j 1}, \dots , \rho _{j D})\). \(\rho _{j k} = p\big ( \phi _j = 1\big )\) represents the prior probability that the feature k is relevant in mixture component j. Thus, we could recast the likelihood function after introducing the feature saliency \(\overrightarrow{\rho }\). This can be denoted by:

$$\begin{aligned} p\big (X_i \mid \Theta _F \big )= & {} \sum _{j=1}^M p_j \prod _{k=1}^D \big ( \rho _{j k} p\big (X_{ik} \mid \xi _{j k} \big ) \nonumber \\&+ \big ( 1 - \rho _{j k} \big ) p\big (X_{ik} \mid \xi _{j k}^{irr}\big )\big ) \end{aligned}$$
(13)

where \(\Theta _F = (\Theta , \overrightarrow{\rho }, \xi ^{irr})\) is the full set of parameters of the mixture model after introducing feature saliency. Eq. (13) offers sound generative interpretation for our model. First, the model selects the component j by sampling from a Multinomial distribution with mixing proportion \((p_1, \dots , p_k)\). Then, for each feature dimension \(k = 1, \dots , D\), we follow a Bernoulli distribution with feature saliency \(\rho _{j k}\); if successful, we use the relevant mixture component \(p\big (\ X_{ik} \mid \xi _{j k} \big )\) to generate feature k; otherwise, the background component \(p \big ( X_{ik} \mid \xi _{j k}^{irr} \big )\) will be used. Therefore, we could view the model of previous section as special case when all of the features are relevant.

The conditional posteriors of Dirichlet process mixture could be rewritten after bringing feature saliency into model as:

$$\begin{aligned} p(Z_i=j \mid \ldots ) = {\left\{ \begin{array}{ll} \frac{n_{-i,j}}{N-1+\alpha } \prod _{k=1}^{D} \big (\rho _{j k} p\big (X_{i k} \mid \xi _{j k}\big ) \\ +\big (1-\rho _{j k}) p\big (X_{i k} \mid \xi _{j k}^{irr} \big )) \quad if \, n_{-i,j} > 0 \\ \frac{\alpha }{N-1+\alpha }\int p\big (\xi _j \mid \cdots ) p\big (\xi _j^{irr} \mid \cdots ) \\ \times p\big (X_i \mid \xi _j\big ) d\xi _j \quad if \, n_{-i,j} = 0 \\ \end{array}\right. } \end{aligned}$$
(14)

We could use these posteriors to generate new components or allocated observations. For latent allocation variable \(Z = (Z_1, \dots , Z_N)\), \(p_j = p\big ( Z_i = j\big )\) represents the prior probability that observation \(X_i\) is associated with component j. We could obtain the posterior probability that the observation \(X_i\) is allocated to component j conditional on having observation \(X_i\) to be:

$$\begin{aligned} p \big ( Z_i= & {} j \mid X_i \big ) = \frac{p\big (X_i \mid \Theta _F, Z_i = j \big )}{p\big (X_i \mid \Theta _F \big )} \nonumber \\&\propto p_j\prod _{k=1}^D \big ( \rho _{j k} p\big (X_{i k} \mid \theta _{j k}) \!+\! ( 1 \!-\! \rho _{j k}) p\big (X_{i k} \mid \theta _{j k}^{irr}))\nonumber \\ \end{aligned}$$
(15)

Latent relevancy variable \(\phi _{ij k} \) indicates whether the feature k is relevant for component j given the observation \(X_i\). \(\rho _j = p\big ( \phi _{ij k} = 1)\) represents the prior probability that the feature k is relevant for component j given observation \(X_i\). The posterior probability that the feature k is relevant for component j conditioned on \(X_i \) is given by:

$$\begin{aligned}&p \big ( \phi _{ij k} = 1, Z_i = j \mid X_i \big ) = p \big ( Z_i = j \mid X_i \big ) \cdot \nonumber \\&\frac{\rho _{j k} p\big (X_{ik} \mid \xi _{j k} \big )}{\rho _{j k} p\big (X_{ik} \mid \xi _{j k} \big ) + \big (1-\rho _{j k} \big ) p\big (X_{ik} \mid \xi _{j k}^{irr}\big )} \end{aligned}$$
(16)

Posteriors for irrelevant features could be deduced in the same way.

$$\begin{aligned}&p \big ( \phi _{ij k} = 0, Z_i = j \mid X_i \big ) = p \big ( Z_i = j \mid X_i \big ) \cdot \nonumber \\&\frac{\big (1-\rho _{j k} \big ) p\big (X_{ik} \mid \xi _{j k}^{irr}\big )}{\rho _{j k} p\big (X_{ik} \mid \xi _{j k} \big ) + \big (1-\rho _{j k} \big ) p\big (X_{ik} \mid \xi _{j k}^{irr}\big )} \end{aligned}$$
(17)

The likelihood function of \(\chi \) conditioned on the complete set of mixture parameters can be obtained. It will be used for further Bayesian inference derivation:

$$\begin{aligned} p \big ( \chi \mid Z, \Phi , \xi , \xi ^{irr} \big )= & {} \prod _{i=1}^N \prod _{k=1}^D \big [p\big (X_{ik} \mid \xi _{j k} \big )^{\phi _k} \cdot \nonumber \\&\quad \big (X_{ik} \mid \xi _{j k}^{irr}\big )^{1-\phi _k} \big ] \end{aligned}$$
(18)

3 Non-parametric Bayesian inference

In the Bayesian context, the most important step is the determination of the posteriors for inference. In this section, we describe a MCMC-based inference approach to learn the proposed model (recall that MCMC refers to Markov Chain Monte Carlo methods). The goal of inference is to approximate the posteriors of parameters which absorb the information data to update the priors. Thus, we define a hierarchical Bayesian model and use conjugacy to develop the appropriate posteriors. The parameters are inferred based on a MCMC method. The graphical representation is shown in Fig.1.

3.1 Parameter estimation for \(\mu _{j k}\) and \(\mu _{j k}^{irr}\)

We consider that the relevant and irrelevant mean parameters \(\mu _{j k}\) and \(\mu _{j k}^{irr}\) follow Gaussian priors with common hyperparameters mean \(\lambda \) and precision r respectively as follows:

$$\begin{aligned}&p\big (\mu _{j k} \mid \lambda , r\big ) \sim \mathcal {N}(\lambda ,\,r^{-1}) \nonumber \\&p\big (\mu _{j k}^{irr} \mid \lambda ^{irr}, r^{irr}\big ) \sim \mathcal {N}(\lambda ^{irr}, \,(r^{irr})^{-1}) \end{aligned}$$
(19)

where the hyperparameters mean \(\lambda \) and precision r are considered as common to all components in a specific dimension k. \(\lambda \) and r are given Gamma and inverse Gamma prior with the following shape and mean hyperparameters:

$$\begin{aligned} p\big (\lambda \big ) \sim \mathcal {N}\left( e, f\right) \quad \quad p\big (r\big ) \sim \gamma \left( g, h\right) \end{aligned}$$
(20)
Fig. 1
figure 1

Graphical model representation of IAGM. Symbols in circles denote random variables; while the ones in squares denote model parameters. Plates indicate repetition (with the number of repetitions in the lower right), and arcs describe the conditional dependencies between the variables

where \(\lambda \), \(\lambda ^{irr}\), r, \(r^{irr}\) have same prior forms and we will omit replicated representation for saving space. The conditional posteriors for \(\mu _{j k}\) and \(\mu _{j k}^{irr} \) are obtained by combining the likelihood in Eq. (18) and the priors in Eq. (19).

$$\begin{aligned} p\big (\mu _{j k} \mid \dots )\propto & {} p\big (\mu _{j k} \mid \lambda , r\big ) p \big ( \chi \mid Z, \Phi , \xi , \xi ^{irr} \big ) \nonumber \\ p\big (\mu _{j k}^{irr} \mid \dots )\propto & {} p\big (\mu _{j k}^{irr} \mid \lambda ^{irr}, r^{irr}\big ) p \big ( \chi \mid Z, \Phi , \xi , \xi ^{irr} \big )\nonumber \\ \end{aligned}$$
(21)

For the posteriors of hyperparameters \(\lambda \) and r, Eq. (19) plays the role of likelihood and combined with priors Eq. (20) to obtain:

$$\begin{aligned}&p\big (\lambda \mid \dots ) \propto p\big (\lambda \big ) \prod _{j=1}^M p\big (\mu _{j k} \mid \lambda , r\big )\nonumber \\&p\big (r \mid \dots ) \propto p\big (r\big ) \prod _{j=1}^M p\big (\mu _{j k} \mid \lambda , r\big ) \end{aligned}$$
(22)

3.2 Parameter estimation for \(S_{lj k}\), \(S_{rj k}\) and \(S_{j k}^{irr}\)

The precision parameters \(S_{lj k}\), \(S_{rj k}\) and \(S_{j k}^{irr}\) are endowed with Gamma priors of common hyperparameters \(\beta \) and w respectively:

$$\begin{aligned}&p\big (S_{lj k} \mid \beta _l, w_l \big ) \sim \gamma (\beta _l, w_l^{-1}) \nonumber \\&p\big (S_{rj k} \mid \beta _r, w_r \big ) \sim \gamma (\beta _r, w_r^{-1}) \nonumber \\&p\big (S_{j k}^{irr} \mid \beta ^{irr}, w^{irr} \big ) \sim \gamma (\beta ^{irr}, (w^{irr})^{-1}) \end{aligned}$$
(23)

where the hyperparameters \(\beta \), w are common to all components in specific dimension k. \(\beta \) and w are given Gamma and inverse Gamma priors with the respective shape and mean hyperparameters:

$$\begin{aligned} p\big (\beta ^{-1} \big ) \sim \gamma \left( s, t\right) \quad \quad p\big (w\big ) \sim \gamma \left( u, v\right) \end{aligned}$$
(24)

where \(\beta _l\), \(\beta _r\), \(\beta ^{irr}\), \(w_l\), \(w_r\), \(w^{irr}\) have the same prior forms. The conditional posteriors for \(S_{lj k}\), \(S_{rj k}\) and \(S_{j k}^{irr}\) are obtained by combining the likelihood in Eq. (18) and the priors in Eq. (23) as follows:

$$\begin{aligned}&p\big (S_{lj k} \mid \dots ) \propto p\big (S_{lj k} \mid \beta _l, w_l\big ) p \big ( \chi \mid Z, \Phi , \xi , \xi ^{irr} \big ) \nonumber \\&p\big (S_{rj k} \mid \dots ) \propto p\big (S_{rj k} \mid \beta _r, w_r\big ) p \big ( \chi \mid Z, \Phi , \xi , \xi ^{irr} \big ) \nonumber \\&p\big (S_{j k}^{irr} \mid \dots ) \propto p\big (S_j^{irr} \mid \beta ^{irr}, w^{irr}\big ) p \big ( \chi \mid Z, \Phi , \xi , \xi ^{irr} \big ) \end{aligned}$$
(25)

For the posteriors of hyperparameters \(\beta \) and w, Eq. (23) plays the role of likelihood and combined with priors Eq. (24), we can then obtain the following:

$$\begin{aligned}&p\big (\beta \mid \dots ) \propto p\big (\beta \big ) \prod _{j=1}^M p\big (S_{j k} \mid \beta , w\big ) \nonumber \\&p\big (r \mid \dots ) \propto p\big (w \big ) \prod _{j=1}^M p\big (S_{j k} \mid \beta , w\big ) \end{aligned}$$
(26)

3.3 Parameter estimation for \(\rho \)

Feature saliency \(\rho _{j k}\) has support over [0, 1] and considered naturally as Beta distribution with common hyperparameters a and b as following:

$$\begin{aligned} p\big (\rho _{j k}\mid a, b) \sim \text {Beta}\left( a, b\right) \end{aligned}$$
(27)

where the shape hyperparameters a and b are common to all components and follows Gamma priors respectively:

$$\begin{aligned} p\big (a \big ) \sim \gamma \left( \delta _1, \delta _2\right) \quad \quad p\big (b \big ) \sim \gamma \left( \varphi _1, \varphi _2\right) \end{aligned}$$
(28)

We assume that the latent relevancy parameter \(\phi _{j k} \) are following Bernoulli distribution with probability \(\rho _{j k}\), so we have:

$$\begin{aligned} p\big (\phi _{j k} \mid \rho _{j k}) \sim \prod _{i=1}^N \rho _{j k}^{\phi _{i j k}} (1 -\rho _{j k})^{(1 - \phi _{i j k})} = \rho _{j k}^{n_{j k}} (1-\rho _{j k}^{N - n_{j k}}) \end{aligned}$$
(29)

where \(n_{j k} = \sum _{i=1}^N I_{\phi _{j k} = 1} \) represents the amount of feature k relevant for component j given all of the observations. Considering Eq. (37) as the likelihood, we can obtain the conditional posteriors by multiplying the prior in Eq. (35):

$$\begin{aligned} p \big ( \rho _{j k} \mid \dots \big ) \sim p\big (\phi _{j k} \mid \rho _{j k}) p\big (\rho _{j k}\mid a, b) \end{aligned}$$
(30)

Conditional posteriors can then be obtained by combing Eq. (27) and Eq. (28) as follows:

$$\begin{aligned} p\big (a \mid \dots )\propto & {} p\big (a \big ) \prod _{j=1}^M p\big (\rho _{j k}\mid a, b) \nonumber \\ p\big (b \mid \dots )\propto & {} p\big (b \big ) \prod _{j=1}^M p\big (\rho _{j k}\mid a, b) \end{aligned}$$
(31)

3.4 Complete algorithm

Following the inference approach above, we propose a MCMC based algorithm for inferring our hierarchical Bayesian mixture model. Among Monte Carlo methods, Gibbs sampling is one of the most popular methods, and it also widely used for complicated posteriors sampling. We also use Metropolis-Hastings algorithm to generate non-standard posteriors. The Gibbs sequence converges to the joint posterior distribution. The algorithm can be summarized in Algorithm. 1.

figure a

4 Experimental results

In this section, we validate our algorithm on several challenging experiments; particularly, dynamic textures clustering and scene categorization. We compare our results with multiple state-of-the-art methods of these applications. The hyperparameters chosen are e = \(\mu _y\), f= \(\sigma ^2\), g=2, h=\(\frac{2}{\sigma ^2}\), s=0.5, t=2, u=0.5, v=\(\frac{2}{\sigma ^2}\), \(\delta _1\)=2, \(\delta _2\)=0.5, \(\varphi _1\)=2, \(\varphi _2\)=0.5, \(\kappa \)=0.5, and \(\eta \)=2. \(\mu _x\) and \(\sigma _x^2\) are the mean and variance of observations.

4.1 Dynamic textures clustering

Dynamic textures are the temporal extension of spatial textures which are defined as sequences of images of moving scenes that exhibit certain stationarity properties in time (sea-waves, smoke, foliage, whirlwind) (Doretto et al. 2003). Dynamic textures have drawn tremendous attention during the past years due to their application in several domains in image processing and pattern recognition, such as motion classification, video registration, and computer games (Fan and Bouguila 2013, 2015). In our experiment, we apply the proposed IAGM with simultaneous feature selection for clustering dynamic textures with a representation of LBP-TOP features.

Fig. 2
figure 2

Sample frames from the DynTex database

We carry out our experimentation on the challenging dynamic textures dataset; DynTex (Péteri et al. 2010), for evaluating the performance of the algorithm. This dataset contains over 650 dynamic texture video sequences from several categories. In our case, we use a subset of video sequences from 8 different categories: candle, flag, flower, fountain, grass, sea, smoke and tree. Each category has about 20 video sequences. The sample frames from each category are shown in Fig. 2. As a preprocessing step, we extract LBP-TOP descriptors from the selected video sequence. In our experiment, we adopt the parameter choice of 4,4,4,1,1,1 as suggested in Zhao and Pietikainen (2007). The chosen setting of the LBP-TOP descriptor achieves a good performance while it also provides a comparative shorter 48-length feature vector.

Obtained features are modeled using proposed IAGM algorithm. In order to evaluate the performance of the proposed method, we compared our approach with four other methods; infinite Beta-Liouville mixture, infinite generalized Dirichlet mixture, infinite Dirichlet mixture, and infinite Gaussian mixture models. We run these approaches 30 times and get average results for validating the performance. The averages of the clustering accuracy can be observed in Table 1. Fig. 3 shows the confusion matrix for the dataset using IAGM with feature selection. According to the results, IAGM with feature selection approach outperforms the other four methods in terms of the highest categorization accuracy rate (87.02%). It shows significant improvement compared with other methods because it could successfully distinguished 6 categories leading to a higher overall accuracy

The results of dynamic texture clustering demonstrates the advantage of applying mixture models which includes asymmetry characteristics of observations for modelling non-standard shape observations. Meanwhile, simultaneously performing feature selection allows for the inclusion of background noise while accurately representing important features for contributing better performance.

Table 1 Average accuracy of different algorithms for dynamic textures clustering
Fig. 3
figure 3

Confusion matrix of the IAGM with feature selection for the DynTex database

4.2 Scene categorization

Humans are proficient at perceiving, recognizing and understanding natural scenes. It have attracted more and more interests to develop machines to simulate human vision functions. The representation of scene images has drawn considerable interests in recent years. In this section, we apply our proposed algorithm to the challenging scene categorization task. Thus, we divide our approach into three parts: feature extraction, image representation, and scene classification.

Fig. 4
figure 4

Confusion matrix of the the IAGM with feature selection for the UIUC sport event dataset

Table 2 Average accuracy of different algorithms for scene categorization

In this application, we use the UIUC sports event dataset (Li and Fei-Fei 2007) to validate the performance of our algorithm. This dataset consists of 8 different sport event classes: rowing (250 images), badminton (200 images), polo (182 images), bocce (137 images), snowboarding (190 images), croquet (236 images), sailing (190 images), and rock climbing (194 images). Fig. 5 demonstrates its diverse nature.

We represent each image by a collection of local image patches. Particularly, we adopt scale-invariant feature transform (SIFT) descriptors of 16 \(\times \) 16 pixel patches computed over a grid with spacing of 8 pixels. Then, we employ bag of visual words (BoVW) approach to have an overall representation of each image. We then use k-means algorithm to cluster our training dataset in a vocabulary of V visual words. For each SIFT keypoint, it will be allocated to the nearest vocabulary in codebook. The points in the image that can be approximated by each of the visual words. Thus, each image can be represented as a frequency histogram over the V visual words. Then, we use IAGM with feature selection model to classify the processed data. For each sport event class, we randomly select 70 images as a training dataset and 60 images as a testing dataset. We run our proposed algorithm 30 times to obtain the average accuracy results for comparison.

Fig. 5
figure 5

Sample frames from UIUC sport event dataset. Our samples show the diversity of background and complexity of information

In order to demonstrate the advantages of our algorithm, we compared our model with a number of state-of-the-art approaches within similar area. These approaches include Gaussian mixture model with Expectation Maximization algorithm (GMM-EM) (Law et al. 2004), Gaussian mixture model with Rival Penalized Expectation Maximization (GMM-RPEM) (Cheung and Zeng 2006), GIST (Oliva and Torralba 2001), multi-class supervised Latent Dirichlet Allocation and multi-class supervised Latent Dirichlet Allocation with annotations (probabilistic) (Wang et al. 2009), Spatial pyramid matching (SPM) (Lazebnik et al. 2006), bag of keypoints (BOK) (Csurka et al. 2004), maximum likelihood estimation Scene (MLE-Scene) and Max-Margin Scene (MM-Scene) (Zhu et al. 2010). The evaluation results are shown at Table 2. Fig. 4 displays the confusion matrix for IAGM applied on sport dataset.

We can observe from our results that our proposed IAGM with simultaneous feature selection outperforms other approaches under consideration and provides better average accuracy results for the task of scene categorization.

5 Conclusion

In this paper, we present a Dirichlet process mixture model capable of approximating asymmetric Gaussian distributed data, and automatically determining components number, and simultaneously performing feature selection for clustering high-dimensional data. The assumption of asymmetric Gaussian is supported by the fact that natural scene usually not distributed in Gaussian distribution. Dirichlet process allows components number grows to infinite. Infinite mixture model offers flexible representation and straightforward interpretation. Through Bayesian framework, identifying relevant feature and parameter inference are unified into the same framework. We have demonstrated the excellent performance of our algorithm on both dynamic textures clustering and scene categorization tasks.

Although MCMC based Bayesian inference provides a clear posterior sampling approach but it also bring heavier computation cost. A possible future work could be the development of a variational inference based learning approach for the proposed data which is capable of involving massive data and saving tremendous time.