1 Introduction

Mixture modeling has been firmly established in the literature as a useful method for finding homogeneous groups within heterogeneous data. Using mixture models for cluster analysis has a long history (Hasselblad, 1966; Day, 1969) dating at least to Wolfe (1965), who used a Gaussian mixture model for clustering. When using mixture models for clustering, which is known as model-based clustering, mixture models are used to partition data points to learn group memberships, or labels, of observations with unknown labels. If some observations are a priori labeled, a semi-supervised analogue of model-based clustering is used and this is known as model-based classification. Extensive details on model-based clustering and classification are given by McNicholas (2016a) and recent reviews are provided by Bouveyron and Brunet-Saumard (2014) and McNicholas (2016b).

A G-component finite mixture model assumes that a random vector X has density of the form

$$ f(\mathbf{x}|\boldsymbol {\vartheta})=\sum\limits_{g=1}^{G}\pi_{g}f_{g}(\mathbf{x}|\boldsymbol {\theta}_{g}), $$

where g = 1,…,G, πg > 0 are the mixing proportions with \({\sum }_{g=1}^{G}\pi _{g}=1\), and fg(⋅) are the component densities. The Gaussian mixture model (see, e.g., Banfield and Raftery, 1993; Celeux and Govaert, 1995; Tipping and Bishop 1999; McNicholas and Murphy, 2008) remains popular due to its mathematical tractability. However, it is inflexible in the presence of cluster skewness and different levels of cluster kurtosis, and has been known to result in an overestimate of the number of clusters and poor density estimation for known clusters (see Franczak et al., 2014; Dang et al., 2015, for examples). Therefore, it has become popular to consider mixtures of more flexible distributions for clustering to deal with such scenarios.

Mixture models that can deal with varying cluster tail-weight, skewness and/or concentration, and kurtosis are increasingly becoming common. A small selection of such models include mixtures using power transformations (Zhu et al., 2022), mixtures of multivariate t-distributions (Peel and McLachlan, 2000; Andrews & McNicholas, 2012), mixtures of normal inverse Gaussian distributions (Karlis & Santourian, 2009; Subedi & McNicholas, 2014; O’Hagan et al., 2016), mixtures of skew-t distributions (Lin, 2010; Murray et al., 2014; Vrbik & McNicholas, 2014; Lee & McLachlan, 2014; 2016), mixtures of shifted asymmetric Laplace distributions (Morris & McNicholas, 2013; Franczak et al., 2014), mixtures of multivariate power exponential distributions (Dang et al., 2015), mixtures of variance-gamma distributions (McNicholas et al., 2017), and mixtures of generalized hyperbolic distributions and variations thereof (Browne & McNicholas, 2015; Murray et al., 2017).

Two common approaches to introducing skewness are by means of a normal variance-mean mixture model and via hidden truncation using an elliptical distribution and a skewing function. The former assumes that a random vector X can be written in the form

$$ \mathbf{X}=\boldsymbol {\mu}+W \boldsymbol {\alpha}+\sqrt{W}\mathbf{V}, $$

where μ and α are location and skewness vectors, respectively, \(\mathbf {V}\sim \mathcal {N}(\textbf {0}, \boldsymbol {\Sigma })\), WV, and W > 0 is a positive random variable with density h(w|Θ). Depending on the distribution of W, different skewed distributions can be derived, e.g., the generalized hyperbolic, skew-t, variance-gamma, and normal inverse Gaussian distributions. The hidden truncation approach makes use of a combination of an elliptical distribution and a skewing function. For example, a random vector X follows a multivariate skew-normal distribution with skewness α if its density can be written as follows:

$$ f(\mathbf{x})=2\phi_{p}(\mathbf{x}|\boldsymbol{\mu}, \boldsymbol{\Sigma}){\Phi}(\boldsymbol {\alpha}^{\prime}\mathbf{x}), $$

where ϕp(⋅) is the density of the p-dimensional normal distribution and Φ(⋅) denotes the cumulative distribution function of the standard normal distribution (Azzalini & Valle, 1996).

The multivariate power exponential (MPE) distribution (Gómez et al., 1998) has been used in many different applications (e.g., Lindsey, Cho & Bui, and Verdoolaege et al., 1999, 2005, and 2008) and was recently used in the mixture model context by Dang et al. (2015). Depending on the shape parameter β, either a leptokurtic or platykurtic distribution can be obtained. Specifically, if β ∈ (0,1) then the distribution is leptokurtic, which is characterized by a thinner peak and heavy tails compared to the Gaussian distribution. If β > 1, a platykurtic distribution is obtained, which is characterized by a flatter peak and thin tails compared to the Gaussian distribution. Other distributions can also be obtained for specific values of the shape parameter, for example, for β = 0.5, the distribution is a Laplace (double-exponential) distribution and, for β = 1, it is a Gaussian distribution. Furthermore, when \(\beta \rightarrow \infty \), the MPE becomes a multivariate uniform distribution.

Dang et al. (2015) derived a family of mixtures of MPE distributions but those mixtures could only account for elliptical clusters. Previously, skew power exponential distributions have been discussed in the univariate case with constrained β (Azzalini, 1986; DiCiccio and Monti, 2004; da Silva Ferreira et al., 2011) or in the multivariate case as scale mixture of skew-normal with constrained β (Branco & Dey, 2001). Herein, we present mixtures based on a novel multivariate skewed power exponential (MSPE) distribution. As compared to earlier proposals, this distribution is more suitable for clustering and classification purposes and can be used for a wide range of β (heavy, Gaussian, and light tails). Using an eigen-decomposition of the component scale distributions (à la Celeux & Govaert, 1995), we construct a family of 16 MSPE mixture models for use in both clustering and semi-supervised classification. These models can account for varying tail weight (heavy, Gaussian, or light), peakedness (thinner or thicker than Gaussian), and skewness of mixture components.

2 Background

Using the parametrization given by Gómez et al. (1998), a random vector X follows a p-dimensional power exponential distribution if the density is

$$ f(\mathbf{x}| \boldsymbol{\mu}, \boldsymbol{\Sigma},\beta)=\frac{p{\Gamma}\left( \frac{p}{2}\right)}{\pi^{p/2}{\Gamma}\left( 1+\frac{p}{2\beta}\right)2^{1+\frac{p}{2\beta}}}| \boldsymbol {\Sigma}|^{-\frac{1}{2}} \exp\left\{ -\frac{1}{2} \delta(\mathbf{x})^{\beta} \right\}, $$
(1)

where μ is the location parameter, Σ is a scale matrix, β determines the kurtosis, and

$$ \delta(\mathbf{x})\colon=\delta \left( \mathbf{x}| \boldsymbol {\mu}, \boldsymbol {\Sigma} \right)=\left (\mathbf{x}-\boldsymbol {\mu} \right)^{\prime} \boldsymbol{\Sigma}^{-1} \left( \mathbf{x}-\boldsymbol{\mu} \right). $$

In a similar manner to Azzalini and Valle (1996), Lin et al. (2014) derived the multivariate skew t-normal distribution by using an elliptical multivariate t-distribution and the cumulative distribution function of the standard normal distribution as the skewing function.

Herein, the skewness function is still the N(0,1) cumulative distribution function while the elliptical distribution is now the MPE distribution. Specifically, a random vector X follows a p-dimensional skew power exponential distribution if the density is of the form

$$ \begin{array}{@{}rcl@{}} f(\mathbf{x}| \boldsymbol {\mu}, \boldsymbol {\Sigma},\beta, \boldsymbol {\psi})&=&2 g(\mathbf{x}| \boldsymbol {\mu},\boldsymbol {\Sigma},\beta) {\Phi}(\boldsymbol {\psi}^{\prime} \boldsymbol {\Sigma}^{-1/2} (\mathbf{x} - \boldsymbol {\mu})),\\ &=&\frac{2p{\Gamma}\left( \frac{p}{2}\right)}{\pi^{p/2}{\Gamma}\left( 1+\frac{p}{2\beta}\right)2^{1+\frac{p}{2\beta}}} |\boldsymbol {\Sigma}|^{-\frac{1}{2}} \exp\left\{ -\frac{1}{2} \delta(\mathbf{x})^{\beta} \right\} {\Phi}\left( \boldsymbol{\psi}^{\prime} \boldsymbol {\Sigma}^{-1/2} (\mathbf{x} - \boldsymbol{\mu})\right) \end{array} $$
(2)

with location vector μ, scale matrix Σ, shape parameter β, and skewness vector ψ. Some special cases of this distribution include the skew-normal distribution (β = 1), a variant of a skew Laplace distribution (β = 0.5), the power exponential distribution (ψ = 0), and a generalization of the multivariate uniform distribution \((\beta \rightarrow \infty ,\psi =0)\). Examples of contours of the MSPE distribution are given in Fig. 1.

Fig. 1
figure 1

Contours of the multivariate skew power exponential distribution for different values of the shape and skewness parameters with \(\boldsymbol {\mu }=(0, 0)^{\prime }\) and an identity scale matrix. The middle panel, with β = 1, is a multivariate skew-normal distribution

3 Mixtures of MSPE Distributions

3.1 Inference

An iterative procedure is used for parameter estimation; specifically, a generalized expectation-maximization (GEM) algorithm (Dempster et al., 1977) with conditional maximization steps. The expectation-maximization (EM) algorithm (Dempster et al., 1977) is an iterative procedure in which the conditional expected value of the complete-data log-likelihood is maximized on each iteration to yield parameter updates. As opposed to the EM algorithm, the conditional maximization steps increase, rather than maximize, the conditional expected value of the complete-data log-likelihood in each iteration of a GEM algorithm. Consider a random sample x1,…,xn from a p-dimensional MSPE mixture distribution from a population with G subgroups. If we define

$$ z_{ig}= \left\{\begin{array}{l} 1 ~~~ \text{if} ~\mathbf{x}_{i} \text{is from group} g,\\ 0 ~~~ \text{otherwise}, \end{array}\right. $$

then the complete-data log-likelihood can be written as follows:

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{c}(\boldsymbol {\Theta}) &=&\sum\limits_{i=1}^{n} \sum\limits_{g=1}^{G} z_{ig}\log\Bigg[2 \pi_{g} \frac{p{\Gamma}\left( \frac{p}{2}\right)}{\Gamma\left( 1+\frac{p}{2\beta_{g}}\right)2^{1+\frac{p}{2\beta_{g}}}\pi^{p/2}} |\boldsymbol {\Sigma}_{g}|^{-\frac{1}{2}}\\ & & \qquad~~~~~~~~~~~~~~~~~~~~ \times\exp\left\{ -\frac{\delta_{ig}(\mathbf{x}_{i})^{\beta_{g}}}{2} \right\} {\Phi}\left( \boldsymbol {\psi}^{\prime}_{g} \boldsymbol {\Sigma}_{g}^{-1/2} (\mathbf{x}_{i} - \boldsymbol {\mu}_{g})\right)\Bigg]. \end{array} $$

For parsimony, an eigen-decomposition is commonly imposed on component scale matrices using the re-parameterization \(\boldsymbol {\Sigma }_{g}=\lambda _{g} \boldsymbol {\Gamma }_{g} \boldsymbol {\Delta }_{g} \boldsymbol {\Gamma }_{g}^{\prime }\), where Δg is a diagonal matrix with entries proportional to the eigenvalues of Σg (with |Δg| = 1), λg is the associated constant of proportionality, and Γg is a p × p orthogonal matrix of the eigenvectors of Σg with entries ordered according to the eigenvalues (Banfield and Raftery, 1993; Celeux & Govaert, 1995). A subset of eight models was considered in Dang et al. (2015) including the most parsimonious (EII) and the fully unconstrained (VVV) models, along with a possible constraint on βg, for their family of mixture models using elliptical power exponential distributions (Table 1). Herein, we consider the same eight models to form a family of mixtures of skewed power exponential distributions.

Table 1 Nomenclature, scale matrix structure, and the number of free scale parameters for the eigen-decomposed family of models

After initialization (Section 3.2), the algorithm proceeds as follows. E-Step: In the E-step, the group membership estimates \(\hat {z}_{ig}\) are updated using

$$ \hat{z}_{ig} \colon= \mathbb{E}_{\widehat{\boldsymbol{\Theta}}} [Z_{ig}|\mathbf{x}_{i}] = \frac{\hat{\pi}_{g} f \left( \mathbf{x}_{i}| {\hat{\boldsymbol\mu}}_{g}, {\hat{\boldsymbol{\Sigma}}}_{g},\hat\beta_{g},{\hat{\boldsymbol\psi}}_{g}\right)} {{\sum}_{j=1}^{G} \hat\pi_{j} f \left( \mathbf{x}_{i}| {\hat{\boldsymbol\mu}}_{j}, {\hat{\boldsymbol{\Sigma}}}_{j},\hat\beta_{j},{\hat{\boldsymbol\psi}}_{j}\right)}, $$

for i = 1,…,n and g = 1,…,G. M-Step: The update for πg is \(\hat {\pi }_{g} = {n_{g}}/{n},\) where \(n_{g}={\sum }_{i=1}^{n}\hat {z}_{ig}\). However, the updates for μg,Σg,βg, and ψg are not available in closed form. For estimating βg, either a Newton-Raphson method or a root finding algorithm may be used and is identical to the estimate in Dang et al. (2015). In our implemented code, we constrain βg to be less than 20 for numerical stability. Let

$$ \mathcal{Q} \colon=\mathbb{E}_{\boldsymbol{\Theta}} [\mathcal{L}_{c}(\boldsymbol {\Theta}|\mathbf{x})]. $$

Then, a Newton-Raphson update is used for the location parameter \({\hat {\boldsymbol \mu }}_{g}\) with the following:

$$ \begin{array}{@{}rcl@{}} \frac{\partial \mathcal{Q}}{\partial\boldsymbol{\mu}_{g}} &=& \hat{\beta}_{g}\sum\limits_{i=1}^{n} \hat{z}_{ig} \delta_{ig}(\mathbf{x}_{i})^{\hat{\beta}_{g}-1} {\hat{\boldsymbol{\Sigma}}}_{g}^{-1} (\mathbf{x}_{i}-{\hat{\boldsymbol\mu}}_{g}) - \sum\limits_{i=1}^{n} \hat{z}_{ig} \frac{\phi(\boldsymbol{\psi}^{\prime}_{g} \boldsymbol{\Sigma}_{g}^{-1/2} (\mathbf{x}_{i}-{\hat{\boldsymbol\mu}}_{g}))}{\Phi(\boldsymbol {\psi}^{\prime}_{g} \boldsymbol {\Sigma}_{g}^{-1/2} (\mathbf{x}_{i}-{\hat{\boldsymbol\mu}}_{g}))} \boldsymbol{\Sigma}_{g}^{-1/2} \boldsymbol {\psi}_{g},\\ \frac{\partial^{2} \mathcal{Q}}{\partial \boldsymbol {\mu}_{g}\boldsymbol{\mu}_{g}^{\prime}} &=& \hat{\beta}_{g}\sum\limits_{i=1}^{n} \hat{z}_{ig} \left[-\delta_{ig}(\mathbf{x}_{i})^{\hat{\beta}_{g}-1} {\hat{\boldsymbol{\Sigma}}}_{g}^{-1} - 2(\hat{\beta}_{g}-1)\delta_{ig}(\mathbf{x}_{i})^{\hat{\beta}_{g}-2}{\hat{\boldsymbol {\Sigma}}}_{g}^{-1} (\mathbf{x}_{i}-{\hat{\boldsymbol \mu}}_{g})(\mathbf{x}_{i}-{\hat{\boldsymbol \mu}}_{g})^{\prime}{\hat{\boldsymbol{\Sigma}}}_{g}^{-1} \right] \\ & &- \sum\limits_{i=1}^{n} \hat{z}_{ig} \boldsymbol {\psi}^{\prime}_{g} \boldsymbol {\Sigma}_{g}^{-1/2} (\mathbf{x}_{i}-{\hat{\boldsymbol \mu}}_{g}) \frac{\phi(\boldsymbol {\psi}^{\prime}_{g} \boldsymbol{\Sigma}_{g}^{-1/2} (\mathbf{x}_{i}-{\hat{\boldsymbol\mu}}_{g}))}{\Phi(\boldsymbol{\psi}^{\prime}_{g} \boldsymbol{\Sigma}_{g}^{-1/2} (\mathbf{x}_{i}-{\hat{\boldsymbol \mu}}_{g}))} {\hat{\boldsymbol{\Sigma}}}_{g}^{-1} \boldsymbol{\psi}_{g}\boldsymbol {\psi}^{\prime}_{g}{\hat{\boldsymbol{\Sigma}}}_{g}^{-1} \\ & &- \sum\limits_{i=1}^{n} \hat{z}_{ig} \left[\frac{\phi(\boldsymbol{\psi}^{\prime}_{g} \boldsymbol {\Sigma}_{g}^{-1/2} (\mathbf{x}_{i}-{\hat{\boldsymbol \mu}}_{g}))}{\Phi(\boldsymbol {\psi}^{\prime}_{g} \boldsymbol {\Sigma}_{g}^{-1/2} (\mathbf{x}_{i}-{\hat{\boldsymbol\mu}}_{g}))}\right]^{2} {\hat{\boldsymbol{\Sigma}}}_{g}^{-1} \boldsymbol{\psi}_{g}\boldsymbol{\psi}^{\prime}_{g}{\hat{\boldsymbol{\Sigma}}}_{g}^{-1}, \end{array} $$
(3)

where \(\delta _{ig}(\mathbf {x}_{i})\colon = \left (\mathbf {x}_{i}-{\hat {\boldsymbol \mu }}_{g} \right )^{\prime }{\hat {\boldsymbol {\Sigma }}}_{g}^{-1} \left (\mathbf {x}_{i}-{\hat {\boldsymbol \mu }}_{g} \right )\).

For estimating the skewness parameter ψ, the density is first re-parameterized as follows:

$$ \begin{array}{@{}rcl@{}} f(\mathbf{x}|\boldsymbol {\mu},\boldsymbol{\Sigma},\beta, \boldsymbol {\psi})&=& 2 g(\mathbf{x}|\boldsymbol{\mu},\boldsymbol{\Sigma},\beta) {\Phi}(\boldsymbol {\psi}^{\prime} \boldsymbol{\Sigma}^{-1/2} (\mathbf{x} - \boldsymbol{\mu})) \\ &=&2 g(\mathbf{x}|\boldsymbol{\mu},\boldsymbol{\Sigma},\beta) {\Phi}(\boldsymbol {\eta}^{\prime} (\mathbf{x} - \boldsymbol{\mu})), \end{array} $$

where η = Σ− 1/2ψ. A quadratic lower-bound principle (Böhning and Lindsay, 1988; Hunter & Lange, 2004) on the relevant part of the complete-data log-likelihood using the re-parameterized density uses the following property to construct a quadratic minorizer:

$$ \log({\Phi}(s)) \geq \log({\Phi}(s_{0})) + \frac{\phi(s_{0})}{\Phi(s_{0})} (s - s_{0}) + \frac{1}{2} (-1) (s - s_{0})^{2}, $$

where − 1 is the lower bound of the second derivative in the Taylor series around s0. Then, an estimate for ηg yields

$$ \begin{array}{@{}rcl@{}} \boldsymbol{\eta}_{g} &=& \left[ \sum\limits_{i=1}^{n} \hat{z}_{ig} (\mathbf{x}_{i}-{\hat{\boldsymbol \mu}}_{g}) (\mathbf{x}_{i}-{\hat{\boldsymbol \mu}}_{g})^{\prime}\right]^{{-1}} \left[\sum\limits_{i=1}^{n} \hat{z}_{ig} \frac{\phi(\boldsymbol{\eta}^{\prime}_{g0} (\mathbf{x}_{i}-{\hat{\boldsymbol \mu}}_{g}))}{\Phi(\boldsymbol{\eta}^{\prime}_{g0} (\mathbf{x}_{i}-{\hat{\boldsymbol\mu}}_{g}))} (\mathbf{x}_{i}-{\hat{\boldsymbol \mu}}_{g}) \right. \\ & & \qquad\qquad\qquad\qquad\quad\quad ~~~~~~~~~~~~~~\left. + \sum\limits_{i=1}^{n} \hat{z}_{ig} (\mathbf{x}_{i}-{\hat{\boldsymbol \mu}}_{g})(\mathbf{x}_{i}-{\hat{\boldsymbol \mu}}_{g})^{\prime} \boldsymbol{\eta}_{g0}\right] \end{array} $$

and we can back-transform to obtain \(\boldsymbol {\psi }_{g} = \boldsymbol {\Sigma }^{1/2}_{g}\boldsymbol {\eta }_{g}\).

For the scale matrices Σg, the estimation makes use of minorization-maximization algorithms (Hunter & Lange, 2000; 2004) by exploiting the concavity of the functions containing Σg (or parts of its decomposition) and accelerated line search algorithms on the Stiefel manifold (Absil et al., 2009; Browne & McNicholas, 2014), with different updates depending on whether the latest estimate for βg is less than 1 or is greater than or equal to 1. For more details, see Dang et al. (2015). Combining the constraints of the eigen-decomposition in Table 1, with constraining βg to be equal or different between groups, results in a family of 16 models. For example, a VVIE model represents a VVI scale structure (as in Table 1) and the shape parameter constrained to be equal between groups (βg = β).

3.2 Initialization

It is well known that the performance of the EM algorithm depends heavily on the starting values. The following strategy is adopted. The group memberships are initialized using a combination of emEM approach (Biernacki et al., 2003), k-means algorithm (Hartigan and Wong, 1979), and random soft starts. Specifically, the most superior run (highest log-likelihood) from 10 different k-means-based short EM runs is chosen for a long EM run. This process is repeated with random soft starts. The fits of the two long EM runs are compared based on a model selection criterion (see Section 3.3) to choose a best model. Once these initial memberships are set, the μg and Σg are initialized using a constrained model. The kurtosis parameters βg are initialized to 0.5, and the skewness is initialized as a zero vector.

3.3 Convergence, Model Selection, and Performance Assessment

Following Lindsay (1995) and McNicholas et al. (2010), the iterative GEM algorithm is stopped based on the Aitken’s acceleration (Aitken 1926). Specifically, an asymptotic estimate of the log-likelihood at iteration k + 1 is compared with the current log-likelihood value and considered converged when the difference is less than some positive 𝜖. We use 𝜖 = 0.005 for the analyses herein.

In a general clustering scenario, the number of groups is generally not known a priori and the covariance model is not known. Therefore, a model selection criterion is required. The most common criterion for model selection is the Bayesian information criterion (BIC; Schwarz 1978). The BIC can be written as follows:

$$ \text{BIC}=2l({\hat{\boldsymbol \vartheta}})-m\log n, $$
(4)

where m is the number of free parameters, n is the sample size, and \(l({\hat {\boldsymbol \vartheta }})\) is the maximized log-likelihood. When written as in Eq. 4, a greater BIC represents a superior model fit. The integrated complete likelihood (ICL; Biernacki et al., 2000) was also considered for model selection; however, in initial testing, the ICL did not consistently outperform the BIC in simulations and thus, for the remainder of this manuscript, we use only the BIC.

To evaluate classification performance, we use the adjusted Rand index (ARI; Hubert and Arabie 1985). The ARI compares two different partitions; specifically, in our case, the estimated classification and the (known) true classifications. The ARI takes a value of 1 when there is perfect classification and has expected value 0 under random classification (see Steinley (2004) for extensive details on the ARI).

4 Analyses

4.1 Overview

The performance of the MSPE mixture models is compared with mixture model implementations based on the MPE distribution (Dang et al., 2015), as well as implementations from the mixture package (Pocuca et al., 2022) of the generalized hyperbolic distribution (ghpcm) and the Gaussian distribution (gpcm). We chose these mixtures for comparison as Gaussian mixtures remain widely used and the generalized hyperbolic distribution has special cases that include some parameterizations of inverse Gaussian, variance-gamma, skew-t (note there are formulations of the skew-t distribution that cannot be obtained from the generalized hyperbolic), multivariate normal-inverse Gaussian, and asymmetric Laplace distribution. Using these comparators, we obtain comparisons to mixtures based on purely elliptical (Gaussian), elliptical with flexible kurtosis modeling (MPE), and skewed (generalized hyperbolic) distributions. For a fair comparison, we restrict the models in the other implementations to those in Table 1. In addition, we use BIC as the model selection criterion, we run G = 1,…,4 for all simulations and real data analyses, and we use the same starting soft memberships for all comparator models. Data from the MSPE distribution is simulated using a Metropolis-Hastings rule.

4.2 Simulations

4.2.1 Simulation 1: Heavy and Light-Tailed Skewed Clusters

A three-component mixture is simulated with 500 observations in total. Group sample sizes are sampled from a multinomial distribution with mixing proportions (0.35,0.25,0.4). The first component is simulated from a heavy-tailed three-dimensional MSPE distribution with \(\boldsymbol {\mu }_{1} = (3, 3, 0)^{\prime }\), β1 = 0.85, and \(\boldsymbol {\psi }_{1}=(-5, -10, 0)^{\prime }\). The second component is simulated with \(\boldsymbol {\mu }_{2} = (3, 6, 0)^{\prime }\), β2 = 0.9, and \(\boldsymbol {\psi }_{2}=(15, 10, 0)^{\prime }\). The third component is simulated with light tails with \(\boldsymbol {\mu }_{3} = (4, 2, 0)^{\prime }\), β3 = 2, and ψ3 = ψ2. Lastly, the scale matrices were common to all three components with diag \((\boldsymbol {\Delta }_{g}) = (4, 3, 1)^{\prime }\) and

$$ \boldsymbol{\Gamma}_{g}=\begin{pmatrix} 0.36 & 0.48 & -0.80 \\ {}-0.80 & 0.60 & 0.0 \\ 0.48 & 0.64 & 0.6 \\ \end{pmatrix}, $$

for g = 1,2,3. The simulated components are not well separated (an example scatterplot matrix is given in Fig. 2). All four mixture implementations are run on 100 such datasets for G = 1,…,4.

Fig. 2
figure 2

An example scatterplot matrix of the three-component mixture of simulation 1

For the MSPE family, a three-component (four-component) model is selected by the BIC 98 (2) times. For the MPE family, the BIC selects a three-component (four-component) model 87 (13) times. When the four-component models are selected, typically, this is because the model chosen has split up the heavy-tailed cluster into two separate components. The exception is that for two chosen four-component MPE solutions, the light-tailed cluster was split into two separate components. For the gpcm family, the BIC selects a three-component (four-component) model 99 (1) times. On the other hand, for the ghpcm algorithm, one-component, two-component, and three-component models are selected 3, 46, and 51 times, respectively. This under-fitting—the heavy-tailed and light-tailed components are merged—may be due to the use of the BIC.

The ARI values for the selected MSPE models range from 0.71 to 0.91, with a median (mean) ARI value of 0.83 (0.83). The selected MPE models yield ARI values ranging between 0.67 and 0.91, with a median (mean) value of 0.80 (0.79). The selected gpcm models yield ARI values ranging between 0.71 and 0.91, with a median (mean) value of 0.80 (0.80). Similarly, the ghpcm algorithm yields ARI values ranging between 0 and 0.82, with a median (mean) value of 0.41 (0.47). For the MSPE models, an EEEV model was selected 68 out of the 100 times, with a less parsimonious model selected the other times. To demonstrate parameter recovery, a similar simulation setup was carried out and findings are provided in in the Appendix.

4.2.2 Simulation 2: Heavy-Tailed and Gaussian Skewed Clusters

A three-component mixture is simulated with 500 observations in total. Group sample sizes are sampled from a multinomial distribution with mixing proportions \((0.3, 0.45, 0.25)^{\prime }\). The first component is simulated from a three-dimensional skewed normal distribution (i.e., β1 = 1) with \(\boldsymbol {\mu }_{1} = (0, 1, 2)^{\prime }\) and ψ1 = (3,5,10). The second component is simulated from a heavy-tailed three-dimensional MSPE distribution with \(\boldsymbol {\mu }_{2} = (0, 4, 2)^{\prime }\), β2 = 0.8, and \(\boldsymbol {\psi }_{2}=(-3, 5, -5)^{\prime }\). The third component is simulated with \(\boldsymbol {\mu }_{3} = (-2, -3, 0)^{\prime }\), β3 = 0.9, and ψ3 = (5,10,− 5). Lastly, the scale matrices are

$$ \boldsymbol {\Sigma}_{1}=\begin{pmatrix} 1.00 & 0.50 & 0.40 \\ 0.50 & 1.50 & 0.35 \\ 0.40 & 0.35 & 1.20 \\ \end{pmatrix} \qquad \text{and} \qquad \boldsymbol {\Sigma}_{2}=\boldsymbol {\Sigma}_{3}=\begin{pmatrix} 1.00 & 0.30 & 0.20 \\ 0.30 & 1.50 & 0.30 \\ 0.20 & 0.30 & 1.20 \\ \end{pmatrix}. $$

Again, the simulated components are not well separated and all four mixture implementations are run on 100 such datasets for G = 1,…,4.

For the MSPE family, a three-component (four-component) model is selected by the BIC 95 (5) times. For the MPE mixture, the BIC selected a three-component (four-component) model 99 (1) times. Interestingly, for the ghpcm mixtures, the BIC selects a three-component model all 100 times. On the other hand, for the gpcm family, the BIC selects a three-component (four-component) model 94 (6) times. In all cases when the four-component models are selected, this is because the model chosen has split up one of the heavy-tailed clusters into two components.

The ARI values for the selected MSPE models range from 0.74 to 0.98, with a median (mean) ARI value of 0.94 (0.93). The selected MPE models yield ARI values ranging between 0.72 and 0.96, with a median (mean) value of 0.90 (0.89). Similarly, the ghpcm algorithm yields ARI values ranging between 0.70 and 0.94, with a median (mean) value of 0.86 (0.85). The selected gpcm models yield ARI values ranging between 0.68 and 0.95, with a median (mean) value of 0.90 (0.88). For the MSPE models, an EEEV model is selected 76 out of 100 times.

4.2.3 Simulation 3: Two Light-Tailed Elliptical Clusters

A simulation from Dang et al. (2015) is replicated, where a two-component EIIV model is simulated with 450 observations with the sample sizes for each group sampled from a binomial distribution with success probability 0.45. Both components had identity scale matrices and zero skewness. The first component is simulated from a two-dimensional MPE distribution with \(\boldsymbol {\mu }_{1} = (0,0)^{\prime }\) and β1 = 2 while the second component is simulated using \(\boldsymbol {\mu }_{2} = (2,0)^{\prime }\) and β2 = 5. Again, the simulated components are not well separated. All four algorithms are run on 100 such datasets. For the MSPE and MPE families, a two-component model is selected by the BIC for 100 and 99 datasets, respectively. The dataset where a three-component model is selected for the MPE models involves a cluster of four observations that are tightly clustered with tiny eigenvalues. On the other hand, for the gpcm family, the BIC selects a two-, three-, and four-component model 80, 12, and 8 times, respectively. Similarly, for the ghpcm algorithm, one-, two-, and three-component models are selected 1, 95, and 4 times, respectively. Here, more components are being fitted to deal with the light-tailed nature of the data.

For the MPE family (which was used to simulate the data), the ARI values for the selected models range from 0.81 to 0.95, with a median (mean) ARI value of 0.88 (0.88). The MSPE family performed similarly, as expected—the MPE is a special case of the MSPE family—with the ARI values for the selected models ranging from 0.81 to 0.94, with a median (mean) ARI value of 0.88 (0.88). The selected gpcm models yield ARI values ranging between 0.35 and 0.96, with a median (mean) value of 0.85 (0.80). Similarly, the ghpcm algorithm yields ARI values ranging between 0 and 0.91, with a median (mean) value of 0.81 (0.80). For the MSPE and MPE models, an EIIV model was selected 89 and 95 times out of 100, respectively. Because the ranges of ARI can be more reflective of one poor or great fit, the median, first and third quartile of the ARIs for the selected models are summarized in Table 2 for all simulations.

Table 2 Performance comparison of four mixture model families on simulations

4.3 Dataset Descriptions

For assessment of performance on real data, we considered the following “benchmark” datasets, i.e., datasets often used for comparison of clustering algorithms, of various sizes and dimensionalities available through various R packages:

  • Body Dataset: The body data from the gclus package (Hurley, 2012) has 24 measurements on body dimension, age, etc., for 507 individuals (247 men and 260 women).

  • Coffee Dataset: The coffee dataset (Streuli, 1973), obtained from the pgmm package (McNicholas et al., 2022), has 12 chemical measurements on two types of coffee (arabica and robusta).

  • Female Voles Dataset: The fvoles dataset (Flury, 2012) has measurements on six size variables and age of 86 female voles from two different species (californicus and ochrogaster).

  • Olive Dataset: The olive dataset (Forina and Tiscornia, 1982), obtained from the pgmm package, has percentage composition of eight fatty acids of olive oils from three regions of Italy.

  • Penguins Dataset: The penguins dataset (Horst et al., 2020) has measurements on bill length, bill depth, flipper length, and body mass of 342 penguins from three different species (Adélie, Chinstrap, and Gentoo).

  • Wine Dataset: The thirteen variable wine dataset, obtained from (Hurley, 2012), has 13 different measurements of chemical aspects of 178 Italian wines from three different types of wine.

Several other datasets were also considered, on which performance of the different algorithms is summarized in the Appendix.

4.4 Unsupervised Classification

Unsupervised classification is performed on (scaled) datasets mentioned above using the same comparison distributions as on the simulated data. The ARI and the number of groups chosen by the BIC are shown in Table 3 along with the classification tables in Table 4.

Table 3 Performance comparison of four mixture model families on real data for the unsupervised scenarios. Sample size, dimensionality, and the number of known groups (i.e., classes) are in parentheses following each dataset name
Table 4 Classification tables for the cluster analysis of each model and dataset

In the case of the body dataset, heavy-tailed components with some skewness are fit with the best selected MSPE model (an EEEV model). For the other three comparators, the BIC selected a four-component model, with the fewest misclassifications for the MPE model. The coffee dataset is a small dataset (n = 43), on which the selected MSPE model (heavy-tailed) had perfect classification. The other three comparators overfit the number of components, with MPE and gpcm obtaining the same classification and the ghpcm model fitting two small clusters containing six and seven observations, respectively. In the case of the fvoles dataset, skewed yet light-tailed clusters are fit. However, the gpcm model performed equally well here while the ghpcm model misclassified two more observations in comparison. On the other hand, the selected MPE model fit three components (with one component containing only six observations). In the case of the olive data, the BIC selected a similarly performing, four-component, solution for all four comparators (G = 3) with olive oils from Southern Italy split into two components. The MSPE model fit heavy-tailed components. On the penguins data, the two skewed mixtures MSPE and ghpcm performed best, fitting the “correct” number of components. The MSPE model fit heavy (with some skewness) clusters. The selected models from the MPE and gpcm split different species into two different components with the best fit gpcm model having the largest number of misclassifications of all the comparators. On the other hand, in the case of the wine data, the MSPE mixture fit heavy-tailed components with five misclassifications. The BIC selected a four-component solution for the other three comparators.

Note that, for the body and wine datasets, the findings above differ from those previously reported in Dang et al. (2015) for the MPE mixtures; this is likely due to different starting values. For example, in Dang et al. (2015), a solution was obtained using MPE mixtures which fit three components and misclassified only one observation; however, here the selected MPE mixture based on the BIC was a four-component model, likely due to different starting values (one of the MPE mixture fits also only had one misclassification; this was the third best model according to the BIC). Note that in the semi-supervised case as well (Table 5), the MSPE mixtures perform the best followed by the MPE mixtures. The estimates of the βg parameters from the MSPE fit indicate heavy tails while ghpcm overfits the number of components.

It is interesting to note that while the BIC is used to select one model from a family of models within a family, there is no guarantee that this model, whether within a family or between families, provides the most superior clustering performance.

4.5 Semi-supervised Classification

Using the same datasets as in the unsupervised classification case, semi-supervised classification is now considered. Following McNicholas (2010), the (observed) likelihood in the model-based classification framework can be written

$$ \mathcal{L}(\boldsymbol{\vartheta} \mid \mathbf{x}_{1}, \ldots, \mathbf{x}_{n})= \prod\limits_{i=1}^{k} \prod\limits_{g=1}^{G} [\pi_{g} f(\mathbf{x}_{i}|\boldsymbol{\theta}_{g})]^{z_{ig}} \prod\limits_{j=k+1}^{n} \sum\limits_{h=1}^{G} [\pi_{h} f(\mathbf{x}_{j}|\boldsymbol{\theta}_{h})], $$

where the first k observations have known component memberships (i.e., labels), f(xi|𝜃g) is the g th component density, and πg and zig have the usual interpretations.

For each dataset previously considered, we take 25 labelled/unlabelled splits with 25% supervision. In Table 5, we display the median ARI values along with the first and third quartiles over the 25 splits. Generally, and as one would expect, performance in the semi-supervised scenarios was found to be better than in the fully unsupervised scenarios. In the case of the wine data, as mentioned in Section 4.4, the MSPE mixtures clearly exhibit the best performance (perfect classification), followed by the MPE mixtures. On the fvoles, olive, and penguins datasets, all comparators performed similarly. For the penguins dataset, the highest ARI was achieved by the two power exponential based mixtures (only one misclassification based on 25% supervision). On the other hand, on the coffee dataset, all models performed well although gpcm and ghpcm had some runs with poor fits. On the body dataset, the two power exponential-based mixtures performed best, with the highest ARI achieved with three misclassifications overall (75% unlabelled), followed by gpcm.

Table 5 Median ARI values along with first and third quartiles in parentheses for the four different models for each dataset for the semi-supervised runs

5 Discussion

A multivariate skewed power exponential distribution is introduced that is well suited for density estimation purposes for a wide range of data with non-Gaussian clusters. The family of 16 MSPE mixtures presented herein allow for robust mixture models for model-based clustering on skewed as well as symmetric components. These models can model components with varying levels of peakedness and tail-weight (light, heavy, Gaussian) simultaneously with skewness. As a result, these models are well suited to model heterogeneous data with non-Gaussian components.

In addition to these properties, special cases of the MSPE distribution include the skew-normal distribution among others. The performance of such mixtures for clustering is investigated on a wide range of simulated scenarios—on heavy-tailed, light-tailed, Gaussian, and skewed components, and combinations thereof—and on real data of varying dimensionalities and sizes commonly used for illustrating clustering and classification. At present, model selection is performed using the BIC and, although this performs well in most cases, it is by no means perfect and alternative criteria could be considered in more detail. Standard errors of parameter estimates were not considered herein; this could be implemented via the standard information-based method to obtain the asymptotic covariance matrix of the estimates or via bootstrapping (Basford et al., 1997; Lin et al., 2014).

Through simulations, we showed scenarios where such skewed mixtures are comparative to or better than widely used elliptical mixture models (mixtures of Gaussians) or skewed mixture models gaining increasing attention (mixtures of generalized hyperbolic distributions). When looking at real datasets, we compared in the context of both unsupervised classification (i.e., clustering) and semi-supervised classification. On these, the MSPE model performed just as well or better on most of the investigated datasets compared to three other mixture model families/algorithms. The analysis on the real datasets in the unsupervised case displayed some possible weaknesses, which may be related to initializations or the choice of BIC as the model selection criterion. Performance improved substantially for some of the datasets when a small level of supervision was introduced.

There are numerous areas of possible future work. One such area would be to consider a mixture of factor analyzers with the MSPE distribution for high-dimensional data. A matrix variate extension, in a similar manner to Gallaugher and McNicholas (2018), might also be interesting for modeling three-way data.