Bayesian inference for infinite asymmetric Gaussian mixture with feature selection

Song, Ziyang; Ali, Samr; Bouguila, Nizar

doi:10.1007/s00500-021-05598-4

Bayesian inference for infinite asymmetric Gaussian mixture with feature selection

Methodologies and Application
Published: 02 February 2021

Volume 25, pages 6043–6053, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Soft Computing Aims and scope Submit manuscript

Bayesian inference for infinite asymmetric Gaussian mixture with feature selection

Download PDF

239 Accesses
4 Citations
Explore all metrics

Abstract

Data clustering is a fundamental unsupervised learning approach that impacts several domains such as data mining, computer vision, information retrieval, and pattern recognition. In this work, we develop a statistical framework for data clustering which uses Dirichlet processes and asymmetric Gaussian distributions. The parameters of this framework are learned using Markov Chain Monte Carlo inference approaches. We also integrate a feature selection technique to choose the features that are most informative in order to construct an appropriate model in terms of clustering accuracy. This paper reports results based on experiments that concern dynamic textures clustering as well as scene categorization.

Model-Based Clustering Based on Variational Learning of Hierarchical Infinite Beta-Liouville Mixture Models

Article 15 August 2015

Online Data Clustering Using Variational Learning of a Hierarchical Dirichlet Process Mixture of Dirichlet Distributions

A nonparametric Bayesian learning model using accelerated variational inference and feature selection

Article 08 January 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Clustering algorithm is a common unsupervised learning methodology for data analysis and has been widely used for uncovering hidden patterns within data. One extensively considered approach in statistical modeling is mixture models. It is capable of clustering data into homogeneous subgroups where the whole model is represented by a weighted sum of the subpopulations’ densities. Due to its flexible representations that provide interpretable results, mixture models are adopted in many applications from different domains.

A well-known assumption in using mixture models for statistical analysis is that considering the per components densities follows the widely used Gaussian assumption (Park et al. 2013). However, the Gaussian distribution is not always an appropriate choice since observations shape may not be strictly symmetric. This is especially the case in natural images where the density distribution may be far from the Gaussian (Hyvärinen and Hoyer 2000; Laptev 2009; Boutemedjet et al. 2010; Elguebaly and Bouguila 2014). Some evolving systems have been proposed for this problem (Andonovski et al. 2018; Škrjanc et al. 2019). For achieving a better approximation, we investigate the use of asymmetric Gaussian distribution (AGD) which is capable of modeling asymmetric data: AGD has left and right variance parameters which control the shape of different parts to better model the asymmetry of data (Elguebaly and Bouguila 2011; Song et al. 2019).

Parameter estimation is one of the challenges required for the use of mixture models. Various algorithms have been studied to achieve this purpose. The expectation maximization (EM) algorithm is a well-known approach to solve such problem (Bouguila and Ziou 2006). Nevertheless, the EM algorithm is a deterministic approach which is not guaranteed to reach a global optimal solution because of its sensitivity to initialization and overfitting. Instead, Bayesian inference may be used which is extensively studied in mixture modeling (Channoufi et al. 2018; Elguebaly and Bouguila 2014). It provides a strong theoretical framework to design clustering algorithms as well as a formal approach to incorporate prior knowledge about the problem. The authors in Fu and Bouguila (2018) recently studied Bayesian learning of asymmetric Gaussian mixture model. In this work, the authors implement Markov Chain Monte Carlo (MCMC) methods that eradicate the dependency between the mixture parameters and components to address over-fitting problems.

Several studies and research have been devoted to the automatic selection of the components number which best describes the observations. We introduce the Dirichlet process to address the problem of determination of correct components number since it leads to a realization of a mixture model with an unbounded number of components (Antoniak 1974). This can be considered also as a nonparametric Bayesian approach since it allows the number of components to grow to infinity as required to fit the data (Griffin and Steel 2010). In this paper, we are interested in Bayesian non-parametric approaches for modeling, particularly models based on the Dirichlet process (Bouguila and Ziou 2012). The Dirichlet process allows the number of latent variables to grow as necessary to fit the data, but where individual variables still follow parametric distributions. We address the prevalent problem of choosing the correct mixture components number for mixture models, by introducing the Dirichlet process to extend finite mixture model to an infinite one. Thus, we apply a hierarchical Bayesian learning technique for the proposed infinite asymmetric Gaussian mixture model (IAGM).

Theoretically, the more features used to represent data, the better the clustering algorithm is expected to perform. In practice, however, some features can be noisy, redundant, or uninformative thus can hinder the performance of clustering (Boutemedjet et al. 2009; Bouguila 2009). The presence of many irrelevant features introduces a bias and renders homogeneity measures unreliable (Elguebaly and Bouguila 2015). A viable solution is to remove irrelevant features by identifying the best features to the trained model. The process of reducing the number of collected features to a relevant subset of features is known as feature selection. It can increase the performance of models by eliminating noise in the data, improving model interpretation and decreasing the risk of overfitting. Feature selection methods can be broadly divided into three groups: filters, wrappers, and embedded methods (Adams and Beling 2017).

Filter approaches treat feature selection as preprocessing step where the relevance of each feature is evaluated using the dataset alone. Thus, filters only consider the properties of the features regardless of the model. The authors in Krishnan et al. (1996) propose a trimming feature selection technique specific to mixture models based on the Fisher ratio. However, this method does not iterate through the feature space nor simultaneously estimate model parameters and feature subsets. On the other hand, wrapper approaches evaluate feature relevance with regard to the model. In most cases, a model is built with respect to a subset of features and the model’s performance is evaluated based on specified criteria. Wrappers then move through the subset space evaluating feature subsets with regard to the evaluation function. The readers are referred to Galimberti et al. (2018); Marbac and Sedki (2017) for further details about wrapper approaches.

Embedded methods simultaneously select features and construct models. Penalized model-based clustering (Pan and Shen 2007; Bouveyron and Brunet-Saumard 2014) and Bayesian methods (Gustafson et al. 2003; Wang and Zhu 2008) are extensively used in many applications. Feature saliency approaches consider feature selection as parameter estimation problems and recast probability distribution as dependent and independent distributions (Elguebaly and Bouguila 2012; Law et al. 2004). Feature saliency is added as new parameters to the conditional distribution of the mixture model and used to find clusters embedded in feature subspace. Because feature saliency represents the probability of belonging to a mixture-dependent distribution. It can be interpreted as the probability that a feature is relevant. In this paper, we propose a feature saliency measure and integrate it into the Bayesian inference framework. Our approach focuses on detecting cluster structure and discriminating feature relevance simultaneously through Bayesian learning.

To summarize, in this paper, we propose a Bayesian inference approach for infinite asymmetric Gaussian mixture (IAGM) models with a simultaneous feature selection framework. The proposed approach better fits data than the traditionally applied Gaussian mixture models in the case of asymmetric data distribution. Extension to an infinite number of mixture components aims to better estimate the data clusters as required. The simultaneous feature selection approach allows for better approximation due to a better choice of features that represent the data with an enhanced ability for the clustering task; i.e. separating the different classes. A potential drawback of the model is computational complexity which is easily remedied with today’s immense available computational resources.

The reminder of this paper is organized as follows. Sect. 2 outlines asymmetric mixture model, sets up Dirichlet process and highlights the feature selection algorithm. Sect. 3 presents the Bayesian inference process and complete algorithm for our model. In Sect. 4, we present the validation on dynamic textures clustering and image categorization tasks and compare it with a number of state-of-the-art methods. Finally, Sect. 5 concludes the paper.

2 Infinite asymmetric Gaussian mixture model

In this section, we introduce IAGM with feature saliency algorithm. This paper proposes finite mixture model and then extend it to infinite one. We also introduce the concept of feature saliency and represent our model combined with a feature selection algorithm.

2.1 Finite mixture model

Assume we have N observations dataset $\chi = (X_1,\dots ,X_N) $, where each of observations $X_i = (X_{i1}, \dots , X_{iD})$ could be represented as a D-dimensional random variable and it follows asymmetric Gaussian distribution (AGD). The probability density function for dataset $\chi $ can be written as:

$$\begin{aligned} p\big (\chi \mid \Theta \big ) = \prod _{i=1}^N \sum _{j=1}^M p_j p\big (X_i \mid \xi _j\big ) \end{aligned}$$

(1)

where $\Theta =(p_1, \dots , p_M, \xi _1, \dots , \xi _M)$ represents the complete set of parameters fully characterizing the mixture model, M is the number of components, $\overrightarrow{p} = (p_1, \dots , p_M)$ represents the mixing proportions which must be positive and sum to one, and $\xi _j$ represents the AGD parameters for mixture component j.

Given AGD parameters for mixture component j, the AGD density function is defined as:

$$\begin{aligned} p \big (X_i \mid \xi \big )&\propto \prod _{k=1}^D \frac{1}{ (S_{lj k})^{-\frac{1}{2}} + (S_{rj k})^{-\frac{1}{2}}} \nonumber \\&\quad \times {\left\{ \begin{array}{ll} \exp \big [-\frac{S_{lj k} (X_{i k} - \mu _{j k})^2}{2}\big ]\quad if \, X_{i k} <\mu _{j k} \\ \exp \big [-\frac{S_{rj k} (X_{i k} - \mu _{j k})^2}{2}\big ]\quad if \, X_{i k} \ge \mu _{j k} \end{array}\right. } \end{aligned}$$

(2)

where $\xi _j = (\mu _j,\, S_{lj},\, S_{rj}) $ is the set of parameters for AGD with $\mu _j = (\mu _{j1}, \dots , \mu _{jD})$, $S_{lj} = (S_{lj 1}, \dots , S_{lj D})$, and $S_{rj} = (S_{rj 1}, \dots , S_{rj D})$. $\mu _{j k}$, $S_{lj k}$ and $S_{rj k}$ are the mean, the left precision and the right precision for the kth-dimensional distribution (Fu and Bouguila 2018) respectively. In this paper, we assume each dimension of observation $X_i$ is independent; hence, its covariance matrix will be diagonal. This assumption leads to a reduction in the computational power during processing and deployment.

We introduce latent indicator variables Z, $Z_i$ for each observations i to indicate which mixture component it belongs to. $Z_i = (Z_{i1}, \dots , Z_{iM})$ where hidden label $Z_{ij}$ is set to 1 when the observation $X_i$ is allocated to component j otherwise 0. The likelihood function of IAGM is given by:

$$\begin{aligned} p \big (\chi \mid Z, \Theta \big ) = \prod _{i=1}^N p\big (X_i \mid \xi _j\big )^{Z_{ij}} \end{aligned}$$

(3)

For the mixing weight, $p_j = p(Z_{i = j})$ , $j = 1, \dots , M$ indicates the probability that an observation $X_i$ is associated with component j. Hence, the missing allocation variable Z is given a Multinomial prior as follows:

$$\begin{aligned} p \big (Z \mid \overrightarrow{p}) \sim \text {Multi}\left( \overrightarrow{p}\right) = \prod _{j=1}^M p_j^{n_j} \end{aligned}$$

(4)

where $n_j = \sum _{i=1}^N I_{Z_i=j} $ is the number of observations allocated to component j, and function I is the indicator function. The mixing proportions are assumed to follow a symmetric Dirichlet prior with concentration parameter $\frac{\alpha }{M} $ (Rasmussen 1999), so that it is considered that all components sharing an equal prior probability. This then can be denoted as follows:

$$\begin{aligned} p\big (\overrightarrow{p} \mid \alpha \big ) \!\sim \! Dirichlet\left( \frac{\alpha }{M },\ldots ,\frac{\alpha }{M }\right) \!=\! \frac{\Gamma (\alpha )}{\Gamma (\frac{\alpha }{M })^M } \prod _{j=1}^M p_j^{\frac{\alpha }{M} - 1} \end{aligned}$$

(5)

The Dirichlet distribution is a conjugate prior of the Multinomial distribution. Due to the conjugacy of Z and $\overrightarrow{p}$, we can achieve better inference by integrating out $\overrightarrow{p}$ to obtain the prior of Z given hyperparameter $\alpha $, and then inferring directly the distribution of the latent variables Z:

$$\begin{aligned} p\big (Z \mid \alpha \big )&= \int p\big (Z \mid \overrightarrow{p}\big ) p\big (\overrightarrow{p} \mid \alpha \big ) d\overrightarrow{p} \nonumber \\&= \frac{\Gamma (\alpha )}{\Gamma (N + \alpha )} \prod _{j=1}^M\frac{\Gamma (\frac{\alpha }{M} + n_j)}{\Gamma (\frac{\alpha }{M})} \end{aligned}$$

(6)

To use the Gibbs sampling technique, it is required to obtain the conditional prior for a single allocation variable $Z_i$ given all the others. Keeping all the other indicators fixed, we obtain the following conditional prior:

$$\begin{aligned} p(Z_i=j \mid \alpha , Z_{-i}) = \frac{n_{-i,j} + \frac{\alpha }{M}}{N-1+\alpha } \end{aligned}$$

(7)

where the subscript $-i$ indicates all indexes except i, and $Z_{-i} = (Z_1, \dots , Z_{i-1}, Z_{i+1}, \dots , Z_N) $. $N_{-i, j} $ is the number of observations excluding $X_i $ that are allocated to component j.

2.2 Infinite mixture model

We continue to extend the finite mixture model proposed in last section to an infinite mixture model by letting component number $M \rightarrow \infty $ and updating the posteriors of indicators in Eq. (7). This is achieved by introducing the Dirichlet process to extend to the infinite mixture model (Blei and Jordan 2006):

$$\begin{aligned} p \big (Z_i=j \mid \alpha , Z_{-i}\big ) = {\left\{ \begin{array}{ll} \frac{n_{-i,j}}{N-1+\alpha } \quad if \, n_{-i,j} > 0 \\ \frac{\alpha }{N-1+\alpha } \quad if \, n_{-i,j} = 0 \end{array}\right. } \end{aligned}$$

(8)

where $n_{-i,j} > 0$ indicates that component j is represented. Thus, an observation $X_i $ is allocated to an existing component with certain probability proportional to the number of observations already associated with this component, while a new component is only proportional to concentration parameter $\alpha $ and observations number N. Given the priors, the conditional posteriors are obtained by combining priors with the likelihood:

$$\begin{aligned} p(Z_i=j \mid \ldots ) = {\left\{ \begin{array}{ll} \frac{n_{-i,j}}{N-1+\alpha } \prod _{k=1}^{D} p \big (X_{i k} \mid \xi _{j k}\big ) \\ \quad \quad \quad if \,\, n_{-i,j} > 0 \\ \frac{\alpha }{N-1+\alpha }\int p\big (\xi _j \mid \cdots ) p\big (X_i \mid \xi _j\big ) d\xi _j \\ \quad \quad \quad if \,\, n_{-i,j} = 0 \end{array}\right. } \end{aligned}$$

(9)

where the conditional posteriors of unrepresented component is obtained by integrating over hyperparameters and the integral is not analytically intractable. For infering intractable posteriors, we adopt Algorithm 2 of Neal’s (Neal 2000) which proposes a sampling method to approximate the desired distribution.

Concerning the concentration parameter $\alpha $, we consider $\alpha $ an inverse Gamma prior with parameter $\kappa $ and $\eta $:

$$\begin{aligned} p \big ( \alpha ^{-1} \mid \kappa , \eta \big ) \sim \Gamma \left( \kappa , \eta \right) \end{aligned}$$

(10)

Given the likelihood of $\alpha $ in Eq.(6), we obtain the conditional posterior for $\alpha $ depending on the observations number N and the components number M

$$\begin{aligned} p \big (\alpha \mid \kappa , \eta \big ) \sim p\big (Z \mid \alpha \big ) p \big ( \alpha \mid \kappa , \eta \big ) \end{aligned}$$

(11)

2.3 Feature saliency

In this section, we introduce the concept of feature saliency and consider the feature selection problem as a parameter estimation problem (Law et al. 2004). It is natural to consider that different features may have different weights for each of the mixture components. Thus, we define feature saliency as the weight of feature importance.

We assume that a feature is relevant if it follows a mixture-dependent distribution AGD. Otherwise, it may be modeled as a mixture-independent background distribution. In our work, we propose a Gaussian assumption for the background distribution. By treating latent relevant indicator $\phi _i = (\phi _{i1}, \dots , \phi _{iM})$ with $\phi _{ij} = (\phi _{ij1}, \dots , \phi _{ijD})$. We could then represent if a given feature is relevant or not. The binary indicator $\phi _{ijk} = 1$ if feature k in observations $X_i$ is relevant for component j, otherwise $\phi _{ijk}=0$. Thus, we rewrite the probability density function as follows:

$$\begin{aligned} p\big ({\chi } {\mid } {\Theta }, \xi ^{irr}, {\Phi } \big )= & {} \prod _{i=1}^N \sum _{j=1}^M p_j {\prod }_{k=1}^D \big [p\big (X_{ik} {\mid } {\xi }_{j k} \big )^{\phi _k} \nonumber \\&\quad p\big (X_{ik} {\mid } \xi _{j k}^{irr}\big )^{1-\phi _k} \big ] \end{aligned}$$

(12)

where the ${\xi }^{irr} = ({\xi }_1^{irr}, \dots , {\xi }_M^{irr})$ represents the set of parameters for background Gaussian distribution with $\xi _j^{irr} = (\mu _j^{irr}, (S_j^{irr})^{-1})$, $\mu _j = (\mu _{j 1}, \dots , \mu _{j D})$, $S_j = (S_{j 1}, \dots , S_{j D})$. $\mu _{j k}^{irr}$ and $S_{j k}^{irr}$ represent the mean and precision for the k dimensional Gaussian distribution, respectively.

Feature saliency defined as $\overrightarrow{\rho } = (\rho _1,\dots ,\rho _M)$ such that $\rho _j = (\rho _{j 1}, \dots , \rho _{j D})$. $\rho _{j k} = p\big ( \phi _j = 1\big )$ represents the prior probability that the feature k is relevant in mixture component j. Thus, we could recast the likelihood function after introducing the feature saliency $\overrightarrow{\rho }$. This can be denoted by:

$$\begin{aligned} p\big (X_i \mid \Theta _F \big )= & {} \sum _{j=1}^M p_j \prod _{k=1}^D \big ( \rho _{j k} p\big (X_{ik} \mid \xi _{j k} \big ) \nonumber \\&+ \big ( 1 - \rho _{j k} \big ) p\big (X_{ik} \mid \xi _{j k}^{irr}\big )\big ) \end{aligned}$$

(13)

where $\Theta _F = (\Theta , \overrightarrow{\rho }, \xi ^{irr})$ is the full set of parameters of the mixture model after introducing feature saliency. Eq. (13) offers sound generative interpretation for our model. First, the model selects the component j by sampling from a Multinomial distribution with mixing proportion $(p_1, \dots , p_k)$. Then, for each feature dimension $k = 1, \dots , D$, we follow a Bernoulli distribution with feature saliency $\rho _{j k}$; if successful, we use the relevant mixture component $p\big (\ X_{ik} \mid \xi _{j k} \big )$ to generate feature k; otherwise, the background component $p \big ( X_{ik} \mid \xi _{j k}^{irr} \big )$ will be used. Therefore, we could view the model of previous section as special case when all of the features are relevant.

The conditional posteriors of Dirichlet process mixture could be rewritten after bringing feature saliency into model as:

$$\begin{aligned} p(Z_i=j \mid \ldots ) = {\left\{ \begin{array}{ll} \frac{n_{-i,j}}{N-1+\alpha } \prod _{k=1}^{D} \big (\rho _{j k} p\big (X_{i k} \mid \xi _{j k}\big ) \\ +\big (1-\rho _{j k}) p\big (X_{i k} \mid \xi _{j k}^{irr} \big )) \quad if \, n_{-i,j} > 0 \\ \frac{\alpha }{N-1+\alpha }\int p\big (\xi _j \mid \cdots ) p\big (\xi _j^{irr} \mid \cdots ) \\ \times p\big (X_i \mid \xi _j\big ) d\xi _j \quad if \, n_{-i,j} = 0 \\ \end{array}\right. } \end{aligned}$$

(14)

We could use these posteriors to generate new components or allocated observations. For latent allocation variable $Z = (Z_1, \dots , Z_N)$, $p_j = p\big ( Z_i = j\big )$ represents the prior probability that observation $X_i$ is associated with component j. We could obtain the posterior probability that the observation $X_i$ is allocated to component j conditional on having observation $X_i$ to be:

$$\begin{aligned} p \big ( Z_i= & {} j \mid X_i \big ) = \frac{p\big (X_i \mid \Theta _F, Z_i = j \big )}{p\big (X_i \mid \Theta _F \big )} \nonumber \\&\propto p_j\prod _{k=1}^D \big ( \rho _{j k} p\big (X_{i k} \mid \theta _{j k}) \!+\! ( 1 \!-\! \rho _{j k}) p\big (X_{i k} \mid \theta _{j k}^{irr}))\nonumber \\ \end{aligned}$$

(15)

Latent relevancy variable $\phi _{ij k} $ indicates whether the feature k is relevant for component j given the observation $X_i$. $\rho _j = p\big ( \phi _{ij k} = 1)$ represents the prior probability that the feature k is relevant for component j given observation $X_i$. The posterior probability that the feature k is relevant for component j conditioned on $X_i $ is given by:

$$\begin{aligned}&p \big ( \phi _{ij k} = 1, Z_i = j \mid X_i \big ) = p \big ( Z_i = j \mid X_i \big ) \cdot \nonumber \\&\frac{\rho _{j k} p\big (X_{ik} \mid \xi _{j k} \big )}{\rho _{j k} p\big (X_{ik} \mid \xi _{j k} \big ) + \big (1-\rho _{j k} \big ) p\big (X_{ik} \mid \xi _{j k}^{irr}\big )} \end{aligned}$$

(16)

Posteriors for irrelevant features could be deduced in the same way.

$$\begin{aligned}&p \big ( \phi _{ij k} = 0, Z_i = j \mid X_i \big ) = p \big ( Z_i = j \mid X_i \big ) \cdot \nonumber \\&\frac{\big (1-\rho _{j k} \big ) p\big (X_{ik} \mid \xi _{j k}^{irr}\big )}{\rho _{j k} p\big (X_{ik} \mid \xi _{j k} \big ) + \big (1-\rho _{j k} \big ) p\big (X_{ik} \mid \xi _{j k}^{irr}\big )} \end{aligned}$$

(17)

The likelihood function of $\chi $ conditioned on the complete set of mixture parameters can be obtained. It will be used for further Bayesian inference derivation:

$$\begin{aligned} p \big ( \chi \mid Z, \Phi , \xi , \xi ^{irr} \big )= & {} \prod _{i=1}^N \prod _{k=1}^D \big [p\big (X_{ik} \mid \xi _{j k} \big )^{\phi _k} \cdot \nonumber \\&\quad \big (X_{ik} \mid \xi _{j k}^{irr}\big )^{1-\phi _k} \big ] \end{aligned}$$

(18)

3 Non-parametric Bayesian inference

In the Bayesian context, the most important step is the determination of the posteriors for inference. In this section, we describe a MCMC-based inference approach to learn the proposed model (recall that MCMC refers to Markov Chain Monte Carlo methods). The goal of inference is to approximate the posteriors of parameters which absorb the information data to update the priors. Thus, we define a hierarchical Bayesian model and use conjugacy to develop the appropriate posteriors. The parameters are inferred based on a MCMC method. The graphical representation is shown in Fig.1.

3.1 Parameter estimation for $\mu _{j k}$ and $\mu _{j k}^{irr}$

We consider that the relevant and irrelevant mean parameters $\mu _{j k}$ and $\mu _{j k}^{irr}$ follow Gaussian priors with common hyperparameters mean $\lambda $ and precision r respectively as follows:

$$\begin{aligned}&p\big (\mu _{j k} \mid \lambda , r\big ) \sim \mathcal {N}(\lambda ,\,r^{-1}) \nonumber \\&p\big (\mu _{j k}^{irr} \mid \lambda ^{irr}, r^{irr}\big ) \sim \mathcal {N}(\lambda ^{irr}, \,(r^{irr})^{-1}) \end{aligned}$$

(19)

where the hyperparameters mean $\lambda $ and precision r are considered as common to all components in a specific dimension k. $\lambda $ and r are given Gamma and inverse Gamma prior with the following shape and mean hyperparameters:

$$\begin{aligned} p\big (\lambda \big ) \sim \mathcal {N}\left( e, f\right) \quad \quad p\big (r\big ) \sim \gamma \left( g, h\right) \end{aligned}$$

(20)

where $\lambda $, $\lambda ^{irr}$, r, $r^{irr}$ have same prior forms and we will omit replicated representation for saving space. The conditional posteriors for $\mu _{j k}$ and $\mu _{j k}^{irr} $ are obtained by combining the likelihood in Eq. (18) and the priors in Eq. (19).

$$\begin{aligned} p\big (\mu _{j k} \mid \dots )\propto & {} p\big (\mu _{j k} \mid \lambda , r\big ) p \big ( \chi \mid Z, \Phi , \xi , \xi ^{irr} \big ) \nonumber \\ p\big (\mu _{j k}^{irr} \mid \dots )\propto & {} p\big (\mu _{j k}^{irr} \mid \lambda ^{irr}, r^{irr}\big ) p \big ( \chi \mid Z, \Phi , \xi , \xi ^{irr} \big )\nonumber \\ \end{aligned}$$

(21)

For the posteriors of hyperparameters $\lambda $ and r, Eq. (19) plays the role of likelihood and combined with priors Eq. (20) to obtain:

$$\begin{aligned}&p\big (\lambda \mid \dots ) \propto p\big (\lambda \big ) \prod _{j=1}^M p\big (\mu _{j k} \mid \lambda , r\big )\nonumber \\&p\big (r \mid \dots ) \propto p\big (r\big ) \prod _{j=1}^M p\big (\mu _{j k} \mid \lambda , r\big ) \end{aligned}$$

(22)

3.2 Parameter estimation for $S_{lj k}$, $S_{rj k}$ and $S_{j k}^{irr}$

The precision parameters $S_{lj k}$, $S_{rj k}$ and $S_{j k}^{irr}$ are endowed with Gamma priors of common hyperparameters $\beta $ and w respectively:

$$\begin{aligned}&p\big (S_{lj k} \mid \beta _l, w_l \big ) \sim \gamma (\beta _l, w_l^{-1}) \nonumber \\&p\big (S_{rj k} \mid \beta _r, w_r \big ) \sim \gamma (\beta _r, w_r^{-1}) \nonumber \\&p\big (S_{j k}^{irr} \mid \beta ^{irr}, w^{irr} \big ) \sim \gamma (\beta ^{irr}, (w^{irr})^{-1}) \end{aligned}$$

(23)

where the hyperparameters $\beta $, w are common to all components in specific dimension k. $\beta $ and w are given Gamma and inverse Gamma priors with the respective shape and mean hyperparameters:

$$\begin{aligned} p\big (\beta ^{-1} \big ) \sim \gamma \left( s, t\right) \quad \quad p\big (w\big ) \sim \gamma \left( u, v\right) \end{aligned}$$

(24)

where $\beta _l$, $\beta _r$, $\beta ^{irr}$, $w_l$, $w_r$, $w^{irr}$ have the same prior forms. The conditional posteriors for $S_{lj k}$, $S_{rj k}$ and $S_{j k}^{irr}$ are obtained by combining the likelihood in Eq. (18) and the priors in Eq. (23) as follows:

$$\begin{aligned}&p\big (S_{lj k} \mid \dots ) \propto p\big (S_{lj k} \mid \beta _l, w_l\big ) p \big ( \chi \mid Z, \Phi , \xi , \xi ^{irr} \big ) \nonumber \\&p\big (S_{rj k} \mid \dots ) \propto p\big (S_{rj k} \mid \beta _r, w_r\big ) p \big ( \chi \mid Z, \Phi , \xi , \xi ^{irr} \big ) \nonumber \\&p\big (S_{j k}^{irr} \mid \dots ) \propto p\big (S_j^{irr} \mid \beta ^{irr}, w^{irr}\big ) p \big ( \chi \mid Z, \Phi , \xi , \xi ^{irr} \big ) \end{aligned}$$

(25)

For the posteriors of hyperparameters $\beta $ and w, Eq. (23) plays the role of likelihood and combined with priors Eq. (24), we can then obtain the following:

$$\begin{aligned}&p\big (\beta \mid \dots ) \propto p\big (\beta \big ) \prod _{j=1}^M p\big (S_{j k} \mid \beta , w\big ) \nonumber \\&p\big (r \mid \dots ) \propto p\big (w \big ) \prod _{j=1}^M p\big (S_{j k} \mid \beta , w\big ) \end{aligned}$$

(26)

3.3 Parameter estimation for $\rho $

Feature saliency $\rho _{j k}$ has support over [0, 1] and considered naturally as Beta distribution with common hyperparameters a and b as following:

$$\begin{aligned} p\big (\rho _{j k}\mid a, b) \sim \text {Beta}\left( a, b\right) \end{aligned}$$

(27)

where the shape hyperparameters a and b are common to all components and follows Gamma priors respectively:

$$\begin{aligned} p\big (a \big ) \sim \gamma \left( \delta _1, \delta _2\right) \quad \quad p\big (b \big ) \sim \gamma \left( \varphi _1, \varphi _2\right) \end{aligned}$$

(28)

We assume that the latent relevancy parameter $\phi _{j k} $ are following Bernoulli distribution with probability $\rho _{j k}$, so we have:

$$\begin{aligned} p\big (\phi _{j k} \mid \rho _{j k}) \sim \prod _{i=1}^N \rho _{j k}^{\phi _{i j k}} (1 -\rho _{j k})^{(1 - \phi _{i j k})} = \rho _{j k}^{n_{j k}} (1-\rho _{j k}^{N - n_{j k}}) \end{aligned}$$

(29)

where $n_{j k} = \sum _{i=1}^N I_{\phi _{j k} = 1} $ represents the amount of feature k relevant for component j given all of the observations. Considering Eq. (37) as the likelihood, we can obtain the conditional posteriors by multiplying the prior in Eq. (35):

$$\begin{aligned} p \big ( \rho _{j k} \mid \dots \big ) \sim p\big (\phi _{j k} \mid \rho _{j k}) p\big (\rho _{j k}\mid a, b) \end{aligned}$$

(30)

Conditional posteriors can then be obtained by combing Eq. (27) and Eq. (28) as follows:

$$\begin{aligned} p\big (a \mid \dots )\propto & {} p\big (a \big ) \prod _{j=1}^M p\big (\rho _{j k}\mid a, b) \nonumber \\ p\big (b \mid \dots )\propto & {} p\big (b \big ) \prod _{j=1}^M p\big (\rho _{j k}\mid a, b) \end{aligned}$$

(31)

3.4 Complete algorithm

Following the inference approach above, we propose a MCMC based algorithm for inferring our hierarchical Bayesian mixture model. Among Monte Carlo methods, Gibbs sampling is one of the most popular methods, and it also widely used for complicated posteriors sampling. We also use Metropolis-Hastings algorithm to generate non-standard posteriors. The Gibbs sequence converges to the joint posterior distribution. The algorithm can be summarized in Algorithm. 1.

4 Experimental results

In this section, we validate our algorithm on several challenging experiments; particularly, dynamic textures clustering and scene categorization. We compare our results with multiple state-of-the-art methods of these applications. The hyperparameters chosen are e = $\mu _y$, f= $\sigma ^2$, g=2, h=$\frac{2}{\sigma ^2}$, s=0.5, t=2, u=0.5, v=$\frac{2}{\sigma ^2}$, $\delta _1$=2, $\delta _2$=0.5, $\varphi _1$=2, $\varphi _2$=0.5, $\kappa $=0.5, and $\eta $=2. $\mu _x$ and $\sigma _x^2$ are the mean and variance of observations.

4.1 Dynamic textures clustering

Dynamic textures are the temporal extension of spatial textures which are defined as sequences of images of moving scenes that exhibit certain stationarity properties in time (sea-waves, smoke, foliage, whirlwind) (Doretto et al. 2003). Dynamic textures have drawn tremendous attention during the past years due to their application in several domains in image processing and pattern recognition, such as motion classification, video registration, and computer games (Fan and Bouguila 2013, 2015). In our experiment, we apply the proposed IAGM with simultaneous feature selection for clustering dynamic textures with a representation of LBP-TOP features.

We carry out our experimentation on the challenging dynamic textures dataset; DynTex (Péteri et al. 2010), for evaluating the performance of the algorithm. This dataset contains over 650 dynamic texture video sequences from several categories. In our case, we use a subset of video sequences from 8 different categories: candle, flag, flower, fountain, grass, sea, smoke and tree. Each category has about 20 video sequences. The sample frames from each category are shown in Fig. 2. As a preprocessing step, we extract LBP-TOP descriptors from the selected video sequence. In our experiment, we adopt the parameter choice of 4,4,4,1,1,1 as suggested in Zhao and Pietikainen (2007). The chosen setting of the LBP-TOP descriptor achieves a good performance while it also provides a comparative shorter 48-length feature vector.

Obtained features are modeled using proposed IAGM algorithm. In order to evaluate the performance of the proposed method, we compared our approach with four other methods; infinite Beta-Liouville mixture, infinite generalized Dirichlet mixture, infinite Dirichlet mixture, and infinite Gaussian mixture models. We run these approaches 30 times and get average results for validating the performance. The averages of the clustering accuracy can be observed in Table 1. Fig. 3 shows the confusion matrix for the dataset using IAGM with feature selection. According to the results, IAGM with feature selection approach outperforms the other four methods in terms of the highest categorization accuracy rate (87.02%). It shows significant improvement compared with other methods because it could successfully distinguished 6 categories leading to a higher overall accuracy

The results of dynamic texture clustering demonstrates the advantage of applying mixture models which includes asymmetry characteristics of observations for modelling non-standard shape observations. Meanwhile, simultaneously performing feature selection allows for the inclusion of background noise while accurately representing important features for contributing better performance.

Table 1 Average accuracy of different algorithms for dynamic textures clustering

Full size table

4.2 Scene categorization

Humans are proficient at perceiving, recognizing and understanding natural scenes. It have attracted more and more interests to develop machines to simulate human vision functions. The representation of scene images has drawn considerable interests in recent years. In this section, we apply our proposed algorithm to the challenging scene categorization task. Thus, we divide our approach into three parts: feature extraction, image representation, and scene classification.

Table 2 Average accuracy of different algorithms for scene categorization

Full size table

In this application, we use the UIUC sports event dataset (Li and Fei-Fei 2007) to validate the performance of our algorithm. This dataset consists of 8 different sport event classes: rowing (250 images), badminton (200 images), polo (182 images), bocce (137 images), snowboarding (190 images), croquet (236 images), sailing (190 images), and rock climbing (194 images). Fig. 5 demonstrates its diverse nature.

We represent each image by a collection of local image patches. Particularly, we adopt scale-invariant feature transform (SIFT) descriptors of 16 $\times $ 16 pixel patches computed over a grid with spacing of 8 pixels. Then, we employ bag of visual words (BoVW) approach to have an overall representation of each image. We then use k-means algorithm to cluster our training dataset in a vocabulary of V visual words. For each SIFT keypoint, it will be allocated to the nearest vocabulary in codebook. The points in the image that can be approximated by each of the visual words. Thus, each image can be represented as a frequency histogram over the V visual words. Then, we use IAGM with feature selection model to classify the processed data. For each sport event class, we randomly select 70 images as a training dataset and 60 images as a testing dataset. We run our proposed algorithm 30 times to obtain the average accuracy results for comparison.

In order to demonstrate the advantages of our algorithm, we compared our model with a number of state-of-the-art approaches within similar area. These approaches include Gaussian mixture model with Expectation Maximization algorithm (GMM-EM) (Law et al. 2004), Gaussian mixture model with Rival Penalized Expectation Maximization (GMM-RPEM) (Cheung and Zeng 2006), GIST (Oliva and Torralba 2001), multi-class supervised Latent Dirichlet Allocation and multi-class supervised Latent Dirichlet Allocation with annotations (probabilistic) (Wang et al. 2009), Spatial pyramid matching (SPM) (Lazebnik et al. 2006), bag of keypoints (BOK) (Csurka et al. 2004), maximum likelihood estimation Scene (MLE-Scene) and Max-Margin Scene (MM-Scene) (Zhu et al. 2010). The evaluation results are shown at Table 2. Fig. 4 displays the confusion matrix for IAGM applied on sport dataset.

We can observe from our results that our proposed IAGM with simultaneous feature selection outperforms other approaches under consideration and provides better average accuracy results for the task of scene categorization.

5 Conclusion

In this paper, we present a Dirichlet process mixture model capable of approximating asymmetric Gaussian distributed data, and automatically determining components number, and simultaneously performing feature selection for clustering high-dimensional data. The assumption of asymmetric Gaussian is supported by the fact that natural scene usually not distributed in Gaussian distribution. Dirichlet process allows components number grows to infinite. Infinite mixture model offers flexible representation and straightforward interpretation. Through Bayesian framework, identifying relevant feature and parameter inference are unified into the same framework. We have demonstrated the excellent performance of our algorithm on both dynamic textures clustering and scene categorization tasks.

Although MCMC based Bayesian inference provides a clear posterior sampling approach but it also bring heavier computation cost. A possible future work could be the development of a variational inference based learning approach for the proposed data which is capable of involving massive data and saving tremendous time.

References

Adams S, Beling PA (2017) A survey of feature selection methods for gaussian mixture models and hidden markov models. Artif Intell Rev 52:1739–1779
Article Google Scholar
Andonovski G, Mušič G, Blažič S, Škrjanc I (2018) Evolving model identification for process monitoring and prediction of non-linear systems. Eng Appl Artif Intell 68:214–221
Article Google Scholar
Antoniak CE (1974) Mixtures of dirichlet processes with applications to bayesian nonparametric problems. Ann Statist 2(6):1152–1174
Article MathSciNet Google Scholar
Blei DM, Jordan MI (2006) Variational inference for dirichlet process mixtures. Bayesian Anal 1(1):121–143
Article MathSciNet Google Scholar
Bouguila N (2009) A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity. IEEE Trans Knowl Data Eng 21(12):1649–1664
Article Google Scholar
Bouguila N, Ziou D (2006) Unsupervised selection of a finite dirichlet mixture model: An mml-based approach. IEEE Trans Knowl Data Eng 18(8):993–1009
Article Google Scholar
Bouguila N, Ziou D (2012) A countably infinite mixture model for clustering and feature selection. Knowl Inf Syst 33(2):351–370
Article Google Scholar
Boutemedjet S, Bouguila N, Ziou D (2009) A hybrid feature extraction selection approach for high-dimensional non-gaussian data clustering. IEEE Trans Pattern Anal Mach Intell 31:1429–1443
Article Google Scholar
Boutemedjet S, Ziou D, Bouguila N (2010) Model-based subspace clustering of non-gaussian data. Neurocomputing 73(10–12):1730–1739
Article Google Scholar
Bouveyron C, Brunet-Saumard C (2014) Discriminative variable selection for clustering with the sparse fisher-em algorithm. Comput Stat 29(3):489–513
Article MathSciNet Google Scholar
Channoufi I, Bourouis S, Bouguila N, Hamrouni K (2018) Color image segmentation with bounded generalized gaussian mixture model and feature selection. In: 2018 4th International conference on advanced technologies for signal and image processing (ATSIP), pp 1–6
Channoufi I, Bourouis S, Bouguila N, Hamrouni K (2018) Image and video denoising by combining unsupervised bounded generalized gaussian mixture modeling and spatial information. Multimed Tools Appl 77(19):25591–25606
Article Google Scholar
Cheung Y, Zeng H (2006) Feature weighted rival penalized em for gaussian mixture clustering: Automatic feature and model selections in a single paradigm. In: 2006 International Conference on Computational Intelligence and Security, vol 1, pp 633–638
Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: In Workshop on Statistical Learning in Computer Vision, ECCV, pp 1–22
Doretto G, Chiuso A, Wu YN, Soatto S (2003) Dynamic textures. Int J Comp Vision 51(2):91–109
Article Google Scholar
Elguebaly T, Bouguila N (2011) Bayesian learning of finite generalized gaussian mixture models on images. Sig Proces 91(4):801–820
Article Google Scholar
Elguebaly T, Bouguila N (2012) Generalized gaussian mixture models as a nonparametric bayesian approach for clustering using class-specific visual features. J Vis Comun Image Represent 23(8):1199–1212
Article Google Scholar
Elguebaly T, Bouguila N (2014) Background subtraction using finite mixtures of asymmetric gaussian distributions and shadow detection. Mach Vis Appl 25(5):1145–1162
Article Google Scholar
Elguebaly T, Bouguila N (2015) Simultaneous high-dimensional clustering and feature selection using asymmetric gaussian mixture models. Image Vis Comput 34:27–41
Article Google Scholar
Fan W, Bouguila N (2013) Learning finite beta-liouville mixture models via variational bayes for proportional data clustering. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, AAAI Press, IJCAI ’13, pp 1323–1329
Fan W, Bouguila N (2015) Dynamic textures clustering using a hierarchical pitman-yor process mixture of dirichlet distributions. In: 2015 IEEE International Conference on Image Processing (ICIP), pp 296–300
Fu S, Bouguila N (2018) Bayesian learning of finite asymmetric gaussian mixtures. In: Recent Trends and Future Technology in Applied Intelligence—31st International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2018, Montreal, QC, Canada, June 25-28, 2018, Proceedings, pp 355–365
Galimberti G, Manisi A, Soffritti G (2018) Modelling the role of variables in model-based cluster analysis. Statist Comput 28(1):145–169
Article MathSciNet Google Scholar
Griffin JE, Steel MFJ (2010) Bayesian nonparametric modelling with the dirichlet process regression smoother. Statist Sinica 20(4):1507–1527
MathSciNet MATH Google Scholar
Gustafson P, Carbonetto P, Thompson N, de Freitas N (2003) Bayesian feature weighting for unsupervised learning, with application to object recognition. In: AISTATS
Hyvärinen A, Hoyer P (2000) Emergence of phase- and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Comput 12(7):1705–1720
Article Google Scholar
Krishnan S, Samudravijaya K, Rao P (1996) Feature selection for pattern classification with gaussian mixture models: a new objective criterion. Pattern Recognit Lett 17(8):803–809
Article Google Scholar
Laptev I (2009) Improving object detection with boosted histograms. Image Vision Comput 27(5):535–544
Article Google Scholar
Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26:1154–1166
Article Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol 2, pp 2169–2178
Li LJ, Fei-Fei L (2007) What, where and who? classifying events by scene and object recognition. In: 2007 IEEE 11th International Conference on Computer Vision pp 1–8
Marbac M, Sedki M (2017) Variable selection for model-based clustering using the integrated complete-data likelihood. Stat Comput 27(4):1049–1063
Article MathSciNet Google Scholar
Neal RM (2000) Markov chain sampling methods for dirichlet process mixture models. J Comput Graph Stat 9(2):249–265
MathSciNet Google Scholar
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comp Vision 42(3):145–175
Article Google Scholar
Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164
MATH Google Scholar
Park S, Serpedin E, Qaraqe K (2013) Gaussian assumption: The least favorable but the most useful [lecture notes]. IEEE Signal Process Magaz 30(3):183–186
Article Google Scholar
Péteri R, Fazekas S, Huiskes MJ (2010) DynTex: a Comprehensive Database of Dynamic Textures. Pattern Recognit Lett 31:1627–1632
Article Google Scholar
Rasmussen CE (1999) The infinite gaussian mixture model. In: Proceedings of the 12th International Conference on Neural Information Processing Systems, MIT Press, Cambridge, MA, USA, NIPS’99, pp 554–560
Škrjanc I, Iglesias JA, Sanchis A, Leite D, Lughofer E, Gomide F (2019) Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: A survey. Inf Sci 490:344–368
Article Google Scholar
Song Z, Ali S, Bouguila N (2019) Bayesian learning of infinite asymmetric gaussian mixture models for background subtraction. In: Image Analysis and Recognition—16th International Conference, ICIAR 2019, Waterloo, ON, Canada, August 27-29, 2019, Proceedings, Part I, pp 264–274
Wang C, Blei DM, Fei-Fei L (2009) Simultaneous image classification and annotation. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition pp 1903–1910
Wang S, Zhu J (2008) Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64(2):440–8
Article MathSciNet Google Scholar
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans Pattern Anal Mach Intell 29(6):915–928
Article Google Scholar
Zhu J, Li LJ, Fei-Fei L, Xing EP (2010) Large margin learning of upstream scene understanding models. In: NIPS

Download references

Author information

Authors and Affiliations

Concordia University, Montreal, Canada
Ziyang Song, Samr Ali & Nizar Bouguila

Authors

Ziyang Song
View author publications
You can also search for this author in PubMed Google Scholar
Samr Ali
View author publications
You can also search for this author in PubMed Google Scholar
Nizar Bouguila
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samr Ali.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Based on the hyperparameters setting chosen in Section 4, we deduce the posteriors for all of the parameters. For parameter $\alpha $, the posteriors depend only on the number of observations N and the number of components M, and not on how the distributions are distributed among the mixtures:

$$\begin{aligned} p \big (\alpha \mid k, n \big ) \propto \frac{\alpha ^{M-\frac{3}{2}}exp(-\frac{1}{2\alpha })\Gamma (\alpha )}{\Gamma (N+\alpha )} \end{aligned}$$

(32)

The complete posteriors for $\mu $, $\mu _{irr}$, $\lambda $ and r are obtained as follows:

$$\begin{aligned}&p\big ( \mu _{j k} \mid \dots \big ) \propto \, \mathcal {N}\bigg ((r\lambda + S_{lj k}\sum _{i: \phi _{ijk}=1, X_{ik} <\mu _{j k}} X_{i k} \nonumber \\&+ s_{rj k}\sum _{i: \phi _{ijk}=1, X_{ik}\ge \mu _{j k}} X_{i k}) / (r + p s_{lj k} + (n_j-p) s_{rj k}), \nonumber \\&\frac{1}{r + p s_{lj k} + (n_j - p) s_{rj k}} \bigg ) \end{aligned}$$

(33)

$$\begin{aligned}&p\big ( \mu _{j k}^{irr} \mid \dots \big ) \propto \mathcal {N}\bigg (\frac{\sum _{i, \phi _{ijk}=0} x_{ik}^{irr} S_{j k}^{irr} + r_k^{irr} \lambda _k^{irr}}{r_k^{irr} + n_j^{irr} S_{j k}^{irr} }, \nonumber \\&\frac{1}{r_k^{irr} + n_j^{irr} S_{j k}^{irr}}\bigg ) \end{aligned}$$

(34)

$$\begin{aligned} p\big (&\lambda \mid \mu _{1 k},\dots ,\mu _{M k}, r\big ) \propto \mathcal {N}\bigg (\frac{r\sum _{j=1}^M \mu _{j k} +\mu _x\sigma _x^{-2}}{\sigma _x^{-2}+Mr_k}, \nonumber \\&\frac{1}{\sigma _x^{-2}+Mr_k}\bigg ) \end{aligned}$$

(35)

$$\begin{aligned}&p\big (r \mid \mu _{1 k},\dots ,\mu _{M k}, \lambda \big ) \propto \gamma \bigg (\frac{M+1}{2}, \end{aligned}$$

(36)

$$\begin{aligned}&\frac{2}{\sigma _x^2 + \sum _{j=1}^M (\mu _{jk} - \lambda _k)^2}\bigg ) \end{aligned}$$

(37)

The complete posteriors for $s_{ljk}$, $s_{rjk}$, $s_{jk}^{irr}$, $\beta $ and w are obtained as follows:

$$\begin{aligned}&p\big (S_{lj k} \mid X, \mu _{j}, S_{rj}, \beta _l, w_l \big ) \nonumber \\&\propto \exp \bigg [-\frac{S_{lj k}\sum _{i:X_{i k}<\mu _{j k}}^n(x_{i k} - \mu _{j k})^2}{2} -\frac{w_{lk}\beta _{lk} S_{lj k}}{2}\bigg ] \end{aligned}$$

(38)

$$\begin{aligned}&p\big ( S_{j}^{irr} \mid X, \mu _{j}^{irr}, \beta ^{irr},w^{irr} \big ) \propto \Gamma \bigg (\frac{N_{jk}^{irr} \beta _k^{irr}}{2}, \nonumber \\&\frac{2}{\beta _k^{irr} w_k^{irr} + \sum _{i,\phi _{ijk} = 0} (X_{ik} - \mu _{j k}^{irr})^2}\bigg ) \end{aligned}$$

(39)

$$\begin{aligned}&p\big ( \beta _l \mid S_{l1 k},\dots ,S_{lM k}, w_l) \propto \Gamma (\frac{\beta _l}{2})^{-M}exp\bigg (-\frac{1}{2\beta _l}\bigg )\nonumber \\&(\frac{\beta _l}{2})^{\frac{M\beta _l -3 }{2}} \prod _{j=1}^{M}(w_l S_{lj k})^{\frac{\beta _l}{2}} exp\bigg (-\frac{\beta _l w_l s_{lj k}}{2}\bigg ) \end{aligned}$$

(40)

$$\begin{aligned}&p\big ( w_l \mid S_{l1 k},\dots ,S_{lM k}, \beta _l) \propto \Gamma \bigg (\frac{M\beta _l+1}{2}, \nonumber \\&\frac{2}{\sigma _y^{-2}+\beta _l \sum _{j=1}^{M}S_{lj k}}\bigg ) \end{aligned}$$

(41)

$N_{jk}^{re}$ and $N_{jk}^{irr}$ are the number of observations allocated to mixture j with feature k considered as relevant and irrelevant, respectively.

The complete posteriors for feature saliency $\phi $ with gamma parameters a and b, with $n_{jk}$ the number of feature k relevant for component j can then be expressed by:

$$\begin{aligned} p \big ( \rho _{j k} \mid \dots \big ) \propto \text {Beta} \left( a + n_{jk}, b + N - n_{jk}\right) \end{aligned}$$

(42)

$$\begin{aligned} p\big (a \mid \dots )\propto & {} a e^{-\frac{a}{2}} \bigg (\frac{\Gamma (a+b)}{\Gamma (a)}\bigg )^M \prod _{j=1}^M \rho _{jk}^{a-1 } \nonumber \\ p\big (b \mid \dots )\propto & {} b e^{-\frac{b}{2}} \bigg (\frac{\Gamma (a+b)}{\Gamma (b)}\bigg )^M \prod _{j=1}^M (1-\rho _{jk})^{a-1 } \end{aligned}$$

(43)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, Z., Ali, S. & Bouguila, N. Bayesian inference for infinite asymmetric Gaussian mixture with feature selection. Soft Comput 25, 6043–6053 (2021). https://doi.org/10.1007/s00500-021-05598-4

Download citation

Published: 02 February 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s00500-021-05598-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Bayesian inference for infinite asymmetric Gaussian mixture with feature selection

Abstract

Similar content being viewed by others

Model-Based Clustering Based on Variational Learning of Hierarchical Infinite Beta-Liouville Mixture Models

Online Data Clustering Using Variational Learning of a Hierarchical Dirichlet Process Mixture of Dirichlet Distributions

A nonparametric Bayesian learning model using accelerated variational inference and feature selection

1 Introduction

2 Infinite asymmetric Gaussian mixture model

2.1 Finite mixture model

2.2 Infinite mixture model

2.3 Feature saliency

3 Non-parametric Bayesian inference

3.1 Parameter estimation for \(\mu _{j k}\) and \(\mu _{j k}^{irr}\)

3.2 Parameter estimation for \(S_{lj k}\), \(S_{rj k}\) and \(S_{j k}^{irr}\)

3.3 Parameter estimation for \(\rho \)

3.4 Complete algorithm

4 Experimental results

4.1 Dynamic textures clustering

4.2 Scene categorization

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bayesian inference for infinite asymmetric Gaussian mixture with feature selection

Abstract

Similar content being viewed by others

Model-Based Clustering Based on Variational Learning of Hierarchical Infinite Beta-Liouville Mixture Models

Online Data Clustering Using Variational Learning of a Hierarchical Dirichlet Process Mixture of Dirichlet Distributions

A nonparametric Bayesian learning model using accelerated variational inference and feature selection

Explore related subjects

1 Introduction

2 Infinite asymmetric Gaussian mixture model

2.1 Finite mixture model

2.2 Infinite mixture model

2.3 Feature saliency

3 Non-parametric Bayesian inference

3.1 Parameter estimation for \(\mu _{j k}\) and \(\mu _{j k}^{irr}\)

3.2 Parameter estimation for \(S_{lj k}\), \(S_{rj k}\) and \(S_{j k}^{irr}\)

3.3 Parameter estimation for \(\rho \)

3.4 Complete algorithm

4 Experimental results

4.1 Dynamic textures clustering

4.2 Scene categorization

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation