1 Introduction

Statistical models based on finite mixtures of distributions have been extensively used in a wide variety of applications. Applying finite mixture models to real datasets allows fitting different characteristics of the empirical distribution, such as multimodality, skewness, kurtosis and heterogeneity, across observations. For general reviews of mixture models and applications, see Hogan and Laird (1997), Böhning (2000), McLachlan and Peel (2000), Frühwirth-Schnatter (2006), Lin (2010) and Mengersen et al. (2011).

While the vast majority of work on mixture models has focused on Gaussian mixture models, in many applications the tails of Gaussian distributions are shorter than appropriate and the Gaussian shape is not suitable for highly asymmetric data. Recent research has thus focused on fitting finite mixture models with more flexible distributional forms. The Student-t and the contaminated Gaussian distributions are two symmetric members of the scale mixtures of normal (SMN) family of distributions due to Andrews and Mallows (1974), which provide attractive heavy-tailed alternatives to the Gaussian distribution. Building upon this work is the class of scale mixtures of skew-normal (SMSN) distributions proposed by Branco and Dey (2001). The class of SMSN distributions provides location-scale density functions which depend on additional parameters of shape and kurtosis, and includes as special cases the normal and skew-normal (SN) densities, as well as the full SMN class of symmetric densities. Special symmetric and skew-symmetric heavier tail members of the SMSN family are, e.g., the Student-t, Cauchy, skew-t (ST), skew-Cauchy (SC), skew-contaminated normal (SCN) and skew-slash (SSL) distributions. Comprehensive coverage of the fundamental theory and new developments for SN and related distributions is given by Azzalini and Capitanio 2014; see also Genton (2004), Arellano-Valle and Genton (2005, 2010), Arellano-Valle and Azzalini (2006), Arellano-Valle and Azzalini (2006).

Many of the different distributions within the class of SMSN distributions have been developed, and their performance assessed in the context of mixture models. Lin et al. (2007), Lin et al. (2009) and Pyne et al. (2009) studied mixtures of skew-normal distributions. Frühwirth-Schnatter and Pyne (2010) considered Bayesian inference for finite mixtures of univariate and multivariate SN and ST distributions. Basso et al. (2010) considered the robust mixture modeling based on the SMSN family. Wang et al. (2009), Lin (2010), Lee and McLachlan (2014), Vrbik and McNicholas (2012) and Forbes and Wraith (2014) considered mixtures of multivariate ST distribution. Maleki and Arellano-Valle (2017) proposed a time series model based on finite mixtures of SMSN distributions. For a review of mixtures of SN and ST distributions, see Lee and McLachlan (2013a, b).

Other distributional forms within the SMSN family of distributions have also been examined. Karlis and Santourian (2009) developed mixtures of multivariate normal inverse Gaussian distributions. Franczak et al. (2014) examined mixtures of shifted asymmetric Laplace (SAL) distributions. Morris et al. (2014) proposed mixtures of contaminated SAL distributions. Browne and McNicholas (2015) and Wraith and Forbes (2015) examined mixtures of generalized hyperbolic distributions.

In the SMSN class of distributions, although the mixing distribution typically controls the tail behavior, it can also affect the behavior of skewness (Branco and Dey 2001). A recent work on this theme providing flexibility in both skewness and heavy tails has been considered by Vilca et al. (2014) in a class of distributions referred to as multivariate SN generalized hyperbolic (SNGH) distributions. In this setting, the mixing distribution follows a generalized inverse Gaussian distribution (GIG), which has previously been demonstrated to provide considerable flexibility in modeling heavy-tailed data (Wraith and Forbes 2015).

In other recent works, a broad class of skewed distributions has been explored by Lee and McLachlan (2016) in a mixture model context focusing on the unified ST (SUT) distribution (Arellano-Valle and Azzalini 2006) and the fundamental ST distribution (Arellano-Valle and Genton 2005), including as a special case the location-scale variant of the canonical fundamental or unrestricted ST (or skew unified t; SUT) distribution. A particular feature and advantage of the SUT distribution is that it encompasses as special cases the canonical fundamental or unrestricted SN (or skew unified normal; SUN) distribution (Arellano-Valle and Genton 2005) and other SN or ST variants (e.g., Sahu et al. 2003; Arellano-Valle et al. 2007; Lachos et al. 2007, 2010), thus providing considerable flexibility for modeling where the best fitting distribution can be chosen simply (automatically) through parameter estimation or use of model choice criteria.

In this paper, we propose a very general class of distributions which extends the previous work on the SUN and SUT distributions by considering a mixing distribution for this class of models which follows a generalized inverse Gaussian (GIG). We refer to this new family of distributions as an unrestricted skew-normal generalized hyperbolic distribution (SUNGH). The new family provides a very general framework for a large class of distributions and has several desirable properties, including an analytically tractable density and ease of computation for simulation and estimation of parameters. The family also provides a high degree of flexibility for the modeling of complex multivariate data with different degrees of asymmetry, kurtosis and heavy tails. A particular attractiveness of this family of distributions is that it encompasses as special cases all of the distributions previously considered in the SMSN family and extensions to the unrestricted classes (e.g., SUT and SUN). Thus, this class of distributions provides a much needed unified framework where the choice of the best fitting distribution can proceed quite naturally through either parameter estimation or by placing constraints on specific parameters and assessing through model choice criteria. We illustrate the advantages of this new family in the finite mixture modeling context using a Bayesian framework.

There are some computational advantages to using a Bayesian framework in a mixture model setting. First, allowing for the influence or effect of missing data on parameter estimates is quite natural in a Bayesian setting as various patterns of missing data (e.g., class-dependent missingness) can be imputed at each MCMC iteration from the posterior predictive distribution (e.g., using a mixture model defined using open-source software such as JAGS or NIMBLE). In contrast, often quite separate and complex methods are needed for maximum likelihood estimation in these settings (Lin et al. 2009 and Wang et al. 2004). Further, for the complex distributions we consider in this paper, previous work using the EM algorithm has at times relied on approximations (Lee and McLachlan 2016) or calculations of derivatives involving complex functions (Browne and McNicholas 2015) for the estimation of parameters. This difficulty also extends to the estimation of the standard errors for parameters (if they are available) using asymptotic approximations to the observed information matrix if the sample size is large or resorting to a bootstrap method which is computationally demanding (Basso et al. 2010). At times, the standard errors for the parameters are also unavailable (particularly for the GH distribution) or not mentioned (e.g., Browne and McNicholas 2015). This is not to say that estimation in the Bayesian setting is devoid of potential computational issues, in particular the issue of label switching is a more prominent issue compared to methods using ML estimation (Mengersen et al. (2011)).

The paper is organized as follows. In Sect. 2, we provide some background to the SUN and GIG distributions. Section 3 outlines the details and properties of the new SUNGH family. In Sect. 4, we present a Bayesian analysis of a finite mixture model following a SUNGH distribution. In Sect. 5, we illustrate the performance of the proposed approach on real and simulated data. Finally, in Sect. 6, we present our main conclusions and discuss some areas of further research.

2 SUN and GIG distributions

2.1 Preliminaries

Following Arellano-Valle and Genton (2005), Arellano-Valle and Azzalini (2006) and Arellano-Valle et al. (2007), we say that a \(p\times 1\) random vector \({\varvec{X}}\) follows an unrestricted skew-normal (SUN) with \(p\times 1\) location vector \({\varvec{\mu }} \), \(p\times p\) positive definite dispersion matrix \(\varvec{\Sigma }\) and \(p\times q\) skewness parameter matrix \(\varvec{\Lambda }\), denoted by \({\varvec{X}}\sim \mathrm{SUN}_{p,q} \left( {{\varvec{\mu }} ,\varvec{\Sigma },\varvec{\Lambda }} \right) \), if its probability density function (pdf) is

$$\begin{aligned} f\left( {{\varvec{x}}\hbox {|}{\varvec{\mu }} ,\varvec{\Sigma },\varvec{\Lambda }} \right)= & {} 2^{q}\phi _p \left( {{\varvec{x}}\hbox {|}{\varvec{\mu }} ,\varvec{\uppsi }} \right) {\Phi }_q \left( {\varvec{\Lambda }^{\top }\varvec{\uppsi }^{-1}\left( {{\varvec{x}}-{\varvec{\mu }} } \right) \hbox {|}{\varvec{\varUpsilon }} } \right) ,\nonumber \\&{\varvec{x}}\in {\varvec{R}}^{p}, \end{aligned}$$
(1)

where \(\varvec{\uppsi }=\varvec{\Sigma }+{\varvec{\Lambda \Lambda }}^{\top }\), \(\varvec{\Upsilon }={\varvec{I}}_q -\varvec{\Lambda }^{\top }\varvec{\uppsi }^{-1}\varvec{\Lambda }=\left( {{\varvec{I}}_q +\varvec{\Lambda }^{\top }\varvec{\Sigma }^{-1}\varvec{\Lambda }} \right) ^{-1}\), and \(\phi _k \left( {\cdot \hbox {|}{\varvec{\mu }} ,\varvec{\uppsi }} \right) \) and \({\Phi }_q \left( {\cdot \hbox {|}\varvec{\Upsilon }} \right) \) are, respectively, the pdf and cumulative distribution function (cdf) of the multivariate normal distributions given by \(N_p \left( {{\varvec{\mu }} ,\varvec{\uppsi }} \right) \) and \(N_q \left( {{\varvec{0}},\varvec{\Upsilon }} \right) \). The SUN class of multivariate distributions defined by (1) contains various special cases. For instance, we recover the multivariate normal when \(\varvec{\Lambda }=\mathbf{0}\), the multivariate SN which called here restricted SN (rMSN) when \(q=1\), and the multivariate SN of Sahu et al. (2003) when \(p=q\) and \(\varvec{\Lambda }\) being a diagonal matrix. In fact, the SUN distribution becomes an important special case of the unified SN distribution (SUN) studied by Arellano-Valle and Azzalini (2006).

The random vector \({\varvec{X}}\sim \mathrm{SUN}_{p,q} \left( {{\varvec{\mu }} ,\varvec{\Sigma },\varvec{\Lambda }} \right) \) can be stochastically represented from different ways. According to Arellano-Valle et al. (2006), the SUN random vector \({\varvec{X}}\) has selection representation given by

$$\begin{aligned} {\varvec{X}}={\varvec{\mu }} +\left( {{\varvec{V}}_1 \hbox {|}{\varvec{V}}_0 >0} \right) , \end{aligned}$$
(2)

where the condition \({\varvec{V}}_0 >0\) means that each element of \({\varvec{V}}_0 \) is positive, and

$$\begin{aligned} \left( {{\begin{array}{c} {{\varvec{V}}_0 } \\ {{\varvec{V}}_1 } \\ \end{array} }} \right) \sim N_{p+q} \left( {\left( {{\begin{array}{c} {\varvec{0}} \\ {\varvec{0}} \\ \end{array} }} \right) ,\left( {{\begin{array}{cc} {{\varvec{I}}_{q} }&{} {\varvec{\Lambda }^{\top }} \\ \varvec{\Lambda }&{} \varvec{\Sigma } \\ \end{array} }} \right) } \right) . \end{aligned}$$

The representation in (2) becomes a selection representation of the rMSN distribution when \(q=1\), i.e., when the latent vector \({\varvec{V}}_0 \) is replaced by a one-dimensional normal random variable \({{V}}_0 \). Also, if we let \({\varvec{V}}_0 ={\varvec{W}}_0\) and \({\varvec{V}}_1 ={\varvec{W}}_1 +\varvec{\Lambda }{\varvec{W}}_0 \), where \({\varvec{W}}_0 \sim {\varvec{N}}_q \left( {\mathbf{0},{\varvec{I}}_{q} } \right) \) and \({\varvec{W}}_1 \sim {\varvec{N}}_p \left( {{\varvec{0}}, {\varvec{I}}_p } \right) \) are independent, it follows from (2) that the stochastic representation of \({\varvec{X}}\) is given by

$$\begin{aligned} {\varvec{X}}={\varvec{\mu }} +\varvec{\Lambda }\left| {{\varvec{W}}_0} \right| +\varvec{\Sigma }^{1/2}{\varvec{W}}_1 , \end{aligned}$$
(3)

where \(\left| {{\varvec{W}}_0} \right| \) is the vector formed with the absolute value of each component of \({\varvec{W}}_0\). For more details, see Arellano-Valle et al. (2006), Arellano-Valle and Azzalini (2006) and Arellano-Valle et al. (2007). In particular, the mean vector and covariance matrix of \({\varvec{X}}\) are given by \(E\left[ {\varvec{X}} \right] ={\varvec{\mu }} +\sqrt{2/\pi }\varvec{\Lambda }{\varvec{1}}_p \) and \(\hbox {Cov}\left[ {\varvec{X}} \right] =\varvec{\uppsi }-\frac{2}{\pi }\varvec{\Lambda }{} \mathbf{1}_q \mathbf{1}_q^\top \varvec{\Lambda }^{\top }\), respectively, where \(\mathbf{1}_q \) denotes the vector of ones with length q.

In this work, we consider the extension of the scale mixtures of rMSN (SMRSN or SMSN) distributions to the scale mixtures of SUN (SMSUN) distributions. Specifically, we consider the family of random vectors defined by

$$\begin{aligned} {\varvec{Y}}={\varvec{\mu }} +\kappa \left( U \right) ^{1/2}{\varvec{X}}, \end{aligned}$$
(4)

where \({\varvec{X}}\sim \mathrm{SUN}_{p,q} \left( {\mathbf{0},\varvec{\Sigma },\varvec{\Lambda }} \right) \), \(\kappa \left( \cdot \right) \) is a positive scale function and U is a mixing random variable which is independent of \({\varvec{X}}\). For our proposed SUNGH distribution, we consider the SMSUN class of distributions defined by (4) when the mixing random variable U follows a GIG distribution.

2.2 The family of GIG distribution

The GIG class is a rich family of flexible distributions with positive support that has been studied by several authors. For instance, see Good (1953), Barndorff-Nielsen and Halgreen (1977), Jørgensen (1982), among others. Thus, the choice of a GIG distribution for the scale mixing variable U in (4) is a natural candidate and provides a highly flexible unified class of multivariate distributions for multivariate statistical analysis.

The GIG distribution has several but equivalent representations in terms of its parameterization. In this paper, and in order to simplify and have closed-form posterior distributions in the Bayesian framework adopted here, we consider (without loss of generality) the following two representations:

First representation \(\mathrm{GIG}^{*}\left( {\upsilon ,\gamma ,\rho } \right) \) : A random variable U has a GIG distribution, denoted by \(U\sim \mathrm{GIG}^{*}\left( {\upsilon ,\gamma ,\rho } \right) \), if its pdf is given by

$$\begin{aligned}&\mathcal{G}\mathcal{I}\mathcal{G}^{*}\left( {u\left| {\upsilon ,\gamma ,\rho } \right. } \right) =\left( {\frac{\gamma }{\rho }} \right) ^{\upsilon }\frac{u^{\upsilon -1}}{2K_\upsilon \left( {\rho \gamma } \right) }\nonumber \\&\quad \exp \left( {-\frac{1}{2}\left( {\frac{\rho ^{2}}{u}+\gamma ^{2}u} \right) } \right) ,\quad u>0, \end{aligned}$$
(5)

where \(K_r \left( x \right) \) is the modified Bessel function of the third kind of order r evaluated at x, and the parameter spaces are given by \(\gamma >0\), \(\rho >0\) and \(-\infty<\upsilon <+\infty \).

Second representation \(\mathrm{GIG}_*\left( {\upsilon ,\psi ,\eta } \right) \) : A random variable U follows a GIG distribution denoted by \(U\sim \mathrm{GIG}_*\left( {\upsilon ,\psi ,\eta } \right) \), if its pdf is given by

$$\begin{aligned}&\mathcal{G}\mathcal{I}\mathcal{G}_*\left( {u\left| {\upsilon ,\psi ,\eta } \right. } \right) =\frac{\left( {u/\eta } \right) ^{\upsilon -1}}{2\eta K_\upsilon \left( \psi \right) }\nonumber \\&\quad \exp \left( {-\frac{\psi }{2}\left( {\frac{u}{\eta }+\frac{\eta }{u}} \right) } \right) ,u>0, \end{aligned}$$
(6)

where \(K_r \left( x \right) \) is defined previously and the parameter spaces are \(\psi >0\), \(\eta >0\) and \(-\infty<\upsilon <+\infty \). This representation will be used to simplify the posterior representation of the GIG parameters. In this case, the mth moment of the random variable \(U^{1/2}\) is given by

$$\begin{aligned} E\left( {U^{m/2}} \right) =\frac{K_{\upsilon +m/2} \left( \psi \right) }{K_\upsilon \left( \psi \right) }\eta ^{m/2},m=1,2,\ldots . \end{aligned}$$

The equivalence between both representations of the GIG distribution considered in (5) and (6) is obtained by observing the one-to-one relationship between their parameters given by \(\psi =\rho \gamma \) and \(\eta =\rho /\gamma \). Particular members of the GIG class lead to a variety of skewed distributions belonging to the proposed family. The inverse Gaussian is one member of this class which has been extensively studied by Chhikara and Folks (1989), Seshadri (1993) and Johnson et al. (1994, chap. 15). Two additional members of the GIG class are the hyperbola and the positive hyperbolic distributions, both of which have been studied by Barndorff-Nielsen (1978) and Barndorff-Nielsen and Blaesild (1980). The exponential, gamma and inverse gamma distributions are also special members of the GIG family. For a recent study on these distributions, see Vilca et al. (2014) and references therein.

In this paper, we define the multivariate random variable \({\varvec{Y}}\) via (4), and by considering a multivariate SUN random variable \({\varvec{X}}\) according to (3) and a GIG scale random variable U distributed according to the second representation in (6). As mentioned previously, we refer to this proposed family as SUNGH distributions.

3 The family of SUNGH distributions

An alternative way to define SUNGH distribution follows by replacing Eq. (3) in Eq. (4). From this, we can say that a \(p\times 1\) random vector \({\varvec{Y}}\) follows a SUNGH distribution if

$$\begin{aligned} {\varvec{Y}}={\varvec{\mu }} +\varvec{\Lambda }{\varvec{W}}+\kappa \left( U \right) ^{1/2}\varvec{\Sigma }^{1/2}{\varvec{W}}_1 , \end{aligned}$$
(7)

where \({\varvec{\mu }} \) is a \(p\times 1\) location vector, \(\varvec{\Sigma }\) is a \(p\times p\) scale matrix, \(\varvec{\Lambda }\) is a \(p\times q\) shape matrix, \({\varvec{W}}=\kappa ^{1/2}\left( U \right) \left| {{\varvec{W}}_0 } \right| \), \({\varvec{W}}_0 \sim N_q \left( {\mathbf{0},{\varvec{I}}_{q} } \right) \), \({\varvec{W}}_1 \sim N_p \left( {\mathbf{0},{\varvec{I}}_{p} } \right) \) and \(U\sim \mathrm{GIG}_*\left( {\upsilon ,\psi ,\eta } \right) \), with \({\varvec{W}}_0 \), \({\varvec{W}}_1 \) and U being independent random quantities. These assumptions also imply that \({\varvec{W}}\) is also independent of \({\varvec{W}}_1 \). Note that if we set \({\varvec{W}}=U,\kappa \left( u \right) =u\) and \(q=1\) we obtain the GH distribution proposed by McNeil et al. (2005) and considered in the mixture model context by Browne and McNicholas (2015). For this reason, the GH distribution is more restrictive (less flexible) compared to the SUNGH distribution. Since the conditional distribution of \({\varvec{Y}}\) given \(U=u\) is given by \(\left. {\varvec{Y}} \right| U=u\sim \mathrm{SUN}_{p,q} \left( {{\varvec{\mu }} ,\kappa \left( u \right) \varvec{\Sigma },\kappa \left( u \right) ^{1/2}\varvec{\Lambda }} \right) \), the marginal pdf of \({\varvec{Y}}\) becomes the infinite mixture of the SUN pdf in (1) given by

$$\begin{aligned}&f\left( {{\varvec{y}}|{\varvec{\mu }} ,\varvec{\Sigma },\varvec{\Lambda },{\varvec{\varpi }} } \right) =2^{q}\mathop \int \limits _0^\infty \phi _p \left( {{\varvec{y}}\hbox {|}{\varvec{\mu }},\kappa \left( u \right) \varvec{\uppsi }} \right) \Phi _q \nonumber \\&\quad \left( {\kappa \left( u \right) ^{-1/2}\varvec{\Lambda }^{\top }\varvec{\uppsi }^{-1}\left( {{\varvec{y}}-{\varvec{\mu }} } \right) \hbox {|}\Upsilon } \right) \mathcal{G}\mathcal{I}\mathcal{G}_*\left( {u\left| {\varvec{\varpi }} \right. } \right) du, \end{aligned}$$
(8)

\({\varvec{y}}\in {\varvec{R}}^{p}\), where \({\varvec{\varpi }} =\left( {\upsilon ,\psi ,\eta } \right) ^{\top }\), and \(\varvec{\uppsi }\) and \({\varvec{\varUpsilon }} \) defined as in (1). In what follows, we will refer to the SUNGH random vector in (7) as \({\varvec{Y}}\sim \mathrm{SUNGH}_{p,q} \left( {{\varvec{\mu }} ,\varvec{\Sigma },\varvec{\Lambda },{\varvec{\varpi }} } \right) \).

Note that there are some identifiability issues concerning the GIG parameters \(\varvec{\varpi } \) and skewness matrix \(\varvec{\Lambda }\). Using (8) the density is not identifiable as for any parameter \(c>0\), the parameters \(\left( {\varvec{\mu } ,\varvec{\Sigma },\varvec{\Lambda },\upsilon ,\psi ,\eta } \right) \) and \(\left( {\varvec{\mu } ,c\varvec{\Sigma },c\varvec{\Lambda },\upsilon ,\psi /c,c\eta } \right) \) yield the same density. A simple fix which results in an identifiable density is to set \(\eta =1\) and so \(\varvec{\varpi }=\left( {\upsilon ,\psi } \right) ^{\top }\). An alternative parameterization which can provide for greater flexibility is discussed in Wraith and Forbes (2015). Further, any permutation matrix can be multiplied by \({\varvec{W}}\) from the stochastic representations (7) without any changes to the distribution of \({\varvec{Y}}\), so sorting \(\varvec{\Lambda }\) by the norm of the columns or some other sorting method is also needed to ensure identifiability of the proposed model.

Varying the scale mixing function \(\kappa \left( U \right) \) for a given distribution of U belonging to the \(\mathrm{GIG}_*\left( \varvec{\varpi } \right) \) class leads to a variety of members in the SUNGH family. Alternatively, we can fix the scale function and vary the distribution of U within the \(\mathrm{GIG}_*\left( \varvec{\varpi } \right) \) class. In the latter case, a convenient choice for the scale function is \(\kappa \left( u \right) =u\), for which the pdf in (8) becomes

$$\begin{aligned}&f\left( {{\varvec{y}}\hbox {|}{\varvec{\mu }} ,\varvec{\Sigma },\varvec{\Lambda },\varvec{\varpi } } \right) =2^{q}\mathcal{G}\mathcal{H}_p \left( {\varvec{y}}|{\varvec{\mu }} ,\varvec{\uppsi },\mathbf{0},{\varvec{\nu }}^{\prime } \right) GH_{q}\nonumber \\&\quad ({\varvec{B}}|\mathbf{0},{\varvec{\varUpsilon }},{\varvec{0}},{\varvec{\nu }}^{''}), {\varvec{y}}\in {\varvec{R}}^{p}, \end{aligned}$$
(9)

where \({\varvec{{\nu }} }'= \left( {\upsilon ,\sqrt{\psi /\eta } ,\sqrt{\psi \eta } } \right) ^{ \top } )\), \({\varvec{{\nu }}}''= \big ( \upsilon - p/2,\sqrt{\psi /\eta } ,q^{\prime }\big ( {\varvec{y}} \big ) \big )^{ \top } \), \({q}' \left( {\varvec{y}} \right) ^{2} = \left( {\varvec{y}} - {\varvec{\mu }} \right) ^{ \top } {\varvec{\uppsi }}^{{ - 1}} \left( {{\varvec{y}} - \varvec{\mu }} \right) + \psi \eta \), \(\varvec{\uppsi }=\varvec{\Sigma }+\varvec{\Lambda \Lambda }^{\top }\), \({\varvec{\varUpsilon }} ={\varvec{I}}_{q} -\varvec{\Lambda }^{\top }\varvec{\uppsi }^{-1}\varvec{\Lambda }\) and \({\varvec{B}}=\varvec{\Lambda }^{\top }\varvec{\uppsi }^{-1}\left( {{\varvec{y}}-{\varvec{\mu }} } \right) \), \(\mathcal{G}\mathcal{H}_p \) and \(GH_q \) denote the p-variate pdf and q-variate cdf of the generalized Hyperbolic distribution, respectively (Wraith and Forbes 2015).

The flexibility of the SUNGH family proposed in (8) can also be observed by varying the value of the dimension q. In fact, for \(q=1\) (the restricted case) we obtain as a special case of (8) the SN generalized hyperbolic (SNGH) distributions considered in Vilca et al. (2014), and thus some known SMSN (or SMRSN) distributions, as well the corresponding symmetrical variant for \(\varvec{\Lambda }=\mathbf{0}\).

A special case of the GIG distribution is the gamma distribution, so the proposed family of distributions covers the canonical fundamental unrestricted skew-normal (CFUSN) distribution of Arellano-Valle and Genton (2005), and the canonical fundamental unrestricted skew-t (CFUST) distribution of Lee and McLachlan (2016). Subsequently, a mixture model approach covering these distributions contains the finite mixtures of CFUSN and CFUST. By considering (9) in the symmetric case, the SUNGH and GH studied by Wraith and Forbes (2015) and Browne and McNicholas (2015) are similar, but in the asymmetric case these families are different. In particular, a greater degree of flexibility is available for the SUNGH family by allowing the skewness parameter to be multivariate (\(p\times q)\) rather than \(p\times 1\). The SUNGH family also has several desirable properties outlined in Propositions 2 to 6 below which will allow the family to be used in a variety of statistical models (e.g., mixed models and regression).

Known members of the SMSN family contained in the SNGH family are the SN, ST, SSL and skew-Laplace (SLP), and their respective symmetric versions. In the unrestricted case (\(q>1)\), the proposed family contains several subfamilies of distributions (symmetric and asymmetric) considered in the literature. For instance, if in (9) we let \(q=p\), \(\varvec{\Lambda }=\hbox {diag}\left( {{\lambda }_1 ,\ldots ,{\lambda }_{\mathrm{p}} } \right) \) and \(\kappa \left( u \right) =1\), then the multivariate skew-normal distribution of Sahu et al. (2003) is obtained. Finally, if \(\varvec{\Lambda }=\mathbf{0}\) (symmetric case) and \(\kappa \left( u \right) =u\), then (9) becomes the symmetric generalized hyperbolic (GH) distribution introduced by Barndorff-Nielsen and Halgreen (1977).

In the following propositions, we present some necessary and useful properties of the SUNGH family for the next sections. The proofs of these results are presented in “Appendix.”

Proposition 1

Let \({\varvec{Y}}\sim \mathrm{SUNGH}_{p,q} \left( {\varvec{\mu } ,\varvec{\Sigma },\varvec{\Lambda },\varvec{\varpi } } \right) \). Then, the following results hold:

  1. (a)

    if \(k_1 =E\left[ {\kappa \left( U \right) ^{1/2}} \right] <\infty \), then \(E\left[ {\varvec{Y}} \right] ={\varvec{\mu }} +\sqrt{\frac{2}{\pi }}k_1 \varvec{\Lambda }{} \mathbf{1}_q \),

  2. (b)

    if \(k_2 =E\left[ {\kappa \left( U \right) } \right] <\infty \), then \(\hbox {Var}\left[ {\varvec{Y}} \right] =k_2 \varvec{\uppsi }-\frac{2}{\pi }\varvec{\Lambda }\left[ {k_2 {\varvec{I}}_{q} -\left( {k_2 -k_1^2 } \right) \mathbf{1}_q \mathbf{1}_q^\top } \right] \varvec{\Lambda }^{\top }\).

Proposition 2

Let \({\varvec{Y}}\sim \mathrm{SUNGH}_{p,q} \left( {\varvec{\mu } ,\varvec{\Sigma },\varvec{\Lambda },\varvec{\varpi } } \right) \). Then, \({\varvec{Y}}\sim \mathrm{SUNGH}_{p,q+m} \left( {\varvec{\mu } ,\varvec{\Sigma },\varvec{\Lambda }^{*},\varvec{\varpi } } \right) \) for each \(m=1,2\),..., where \(\varvec{\Lambda }^{*}=\left( {{\begin{array}{cc} {\varvec{\Lambda }_{p\times q} }&{} {\mathbf{0}_{p\times m} } \\ \end{array} }} \right) \) or \(\varvec{\Lambda }^{*}=\left( {{\begin{array}{cc} {\mathbf{0}_{p\times m} }&{} {\varvec{\Lambda }_{p\times q} } \\ \end{array} }} \right) \).

Proposition 3

Let \({\varvec{Y}}\sim \mathrm{SUNGH}_{p,q} \left( {\varvec{\mu } ,\varvec{\Sigma },\varvec{\Lambda },\varvec{\varpi } } \right) \). Then, for each \({\varvec{b}}\in {\varvec{R}}^{n}\) and full row rank matrix \({\varvec{B}}\in {\varvec{R}}^{n\times p}\) we have

$$\begin{aligned} {\varvec{X}}={\varvec{b}}+{\varvec{BY}}\sim \mathrm{SUNGH}_{n,q} \left( {{\varvec{b}}+{\varvec{B}}{\varvec{\mu }} ,{\varvec{B}}\varvec{\Sigma }{\varvec{B}}^{\top },{\varvec{B}}\varvec{\Lambda },\varvec{\varpi } } \right) . \end{aligned}$$

Proposition 4

Let \({\varvec{Y}}\sim \mathrm{SUNGH}_{p,q} \left( {\varvec{\mu } ,\varvec{\Sigma },\varvec{\Lambda },\varvec{\varpi } } \right) \). Partition \({\varvec{Y}}=\left( {{\varvec{Y}}_1^\top ,{\varvec{Y}}_2^\top } \right) ^{\top }\), where the first and second sub-vectors are of dimensions \(p_1 \times 1\) and \(p_2 \times 1\), respectively, with \(p_1 +p_2 =p\). The corresponding partition of the parameters \(\left( {\varvec{\mu } ,\varvec{\Sigma },\varvec{\Lambda }} \right) \) is

$$\begin{aligned} {\varvec{\mu }}=\left( {{\begin{array}{c} {\varvec{\mu } _1 } \\ {\varvec{\mu } _2 } \\ \end{array} }} \right) ,\varvec{\Sigma }=\left( {{\begin{array}{cc} {\varvec{\Sigma }_{11} }&{} {\varvec{\Sigma }_{12} } \\ {\varvec{\Sigma }_{21} }&{} {\varvec{\Sigma }_{22} } \\ \end{array} }} \right) ,\varvec{\Lambda }=\left( {{\begin{array}{c} {\varvec{\Lambda }_1 } \\ {\varvec{\Lambda }_2 } \\ \end{array} }} \right) , \end{aligned}$$

where \({\varvec{\mu }}_i \), \(\varvec{\Sigma }_{ii} \) and \(\varvec{\Lambda }_i \) have dimensions \(p_i \times 1\), \(p_i \times p_i \) and \(p_i \times q\), respectively, for \(i=1,2\). Then, the marginal distribution of \({\varvec{Y}}_i \) is \(\mathrm{SUNGH}_{p_i ,q} \left( {\varvec{\mu } _i ,\varvec{\Sigma }_{ii} ,\varvec{\Lambda }_i ,\varvec{\varpi } } \right) ,i=1,2\).

Proposition 5

If under the same conditions of Proposition 4, we have \(\varvec{\Sigma }_{12} =\varvec{\Sigma }_{21} =\mathbf{0}\) then a necessary and sufficient condition to have null correlation between \({\varvec{Y}}_1 \) and \({\varvec{Y}}_2 \) is that \(\varvec{\Lambda }_1 =\mathbf{0}\) or \(\varvec{\Lambda }_2 =\mathbf{0}\).

Proposition 6

Consider the same conditions of Proposition 4 with the partition of shape matrix \(\varvec{\Lambda }_ =\left( {\varvec{\Lambda }_{ij} } \right) _{i,j=1,2} \), where \(\varvec{\Lambda }_{ij} \) has dimension \(p_i \times q_j\), with \(q_1 +q_2 =q\). If \(\varvec{\Sigma }_{12} =\varvec{\Sigma }_{21}^\top =\mathbf{0}\) and \(\varvec{\Lambda }_{12} =\mathbf{0}\) or \(\varvec{\Lambda }_{21} =\mathbf{0}\), then \({\varvec{Y}}_i \sim \mathrm{SUNGH}_{p_i ,q_i } \left( {\varvec{\mu } _i ,\varvec{\Sigma }_{ii} ,\varvec{\Lambda }_{ii} ,\varvec{\varpi } } \right) ,i=1,2\), and \(\hbox {Cov}\left( {{\varvec{Y}}_1 ,{\varvec{Y}}_2 } \right) =-\frac{2}{\pi }k_1^2 \varvec{\Lambda }_{12} \mathbf{1}_{q_1 } \mathbf{1}_{q_2 }^\top \varvec{\Lambda }_{21}^\top \).

4 Finite mixtures of SUNGH family

4.1 FM-SUNGH model

In this section, we consider finite mixtures of the proposed SUNGH family of distributions (hereafter FM-SUNGH). To establish notation, we consider the usual mixture model defined as

$$\begin{aligned} f\left( {{\varvec{y}};\varvec{\Theta },{\varvec{p}}} \right) =\sum \nolimits _{k=1}^K p_k f\left( {{\varvec{y}};\varvec{\Theta }_k } \right) , \end{aligned}$$
(10)

where \(\varvec{\Theta }=\big ( {\varvec{\Theta }_1 ,\ldots ,\varvec{\Theta }_K } \big )\), with \(\varvec{\Theta }_k =\big ( {\varvec{\mu }} _k ,\varvec{\Sigma }_k ,\varvec{\Lambda }_k ,{\varvec{\upsilon }}_k ,{\varvec{\psi }} _k ,{\varvec{\eta }} _k \big ), k=1,\ldots ,K\), \({\varvec{p}}=\left( {p_1 ,\ldots ,p_K } \right) ^{\top }\) (for which \(p_k >0,k=1,\ldots ,K\hbox {and}\sum \nolimits _{k=1}^K p_k =1)\), \({\varvec{\upsilon }}_k =\left( {\upsilon _{k1} ,\ldots ,\upsilon _{kp} } \right) ^{\top }\), \({\varvec{\psi }} _k =\left( {\psi _{k1} ,\ldots ,\psi _{kp} } \right) ^{\top }\), \({\varvec{\eta }} _k = \big ( \eta _{k1} ,\ldots ,\eta _{kp} \big )^{\top }\) and \(f\left( {{\varvec{y}};\varvec{\Theta }_k } \right) \) given by (8). This model hereafter will be called FM-SUNGH. The identifiability of mixtures of distributions has been studied by Teicher (1963) and Holzmann et al. (2006) to ensure that the FM-SUNGH is identifiable.

The SUNGH family is a rich class of distributions, and various particular forms from this family have been considered over the last few years in the case of mixture models. In Table 1, we outline details of some of the distributions and the corresponding parameters within the SUNGH family.

Using the mixture model representation in (10), for each i.i.d. sample in the form of \({\varvec{Y}}_1 ,\ldots ,{\varvec{Y}}_n \), we can utilize an (latent) indicator (allocation) variables \({\varvec{Z}}_1 ,\ldots ,{\varvec{Z}}_n \), to assign observations to belong to different components of the mixture \(\left( {k=1,\ldots ,K} \right) \). The standard assumption for the allocation random variables \(Z_1 ,\ldots ,Z_K \) is that they follow a multinomial joint distribution: \({\varvec{Z}}_i =\left( {Z_1 ,\ldots ,Z_K } \right) \sim Multinomial\left( {K,p_1 ,\ldots ,p_K } \right) \), so that \(P\left( {Z_i =k} \right) =p_k ;i=1,\ldots ,n, k=1,\ldots ,K\). In terms of \(Z_i \), we can conclude that

$$\begin{aligned} \left. {{\varvec{Y}}_i } \right| Z_i =k{\mathop {\sim }\limits ^{\mathrm{ind.}}}\mathrm{SUNGH}\left( {\varvec{\Theta }_k } \right) ,P\left( {Z_i =k} \right) =p_k . \end{aligned}$$

Let \({\varvec{C}}=\left\{ {{\varvec{Y,U,W,Z}}} \right\} \) denote the complete data, where \({\varvec{Y}}=\left( {{\varvec{Y}}_1^\top ,\ldots ,{\varvec{Y}}_n^\top } \right) ^{\top }\) is the observed variable and \({\varvec{U}}=\left( {U_{11} ,\ldots ,U_{1K} ,\ldots ,U_{n1} ,\ldots ,U_{nK} } \right) ^{\top }\), \({\varvec{W}}=\big ( {\varvec{W}}_{11}^\top ,\ldots ,{\varvec{W}}_{1K}^\top ,\ldots ,{\varvec{W}}_{n1}^\top ,\ldots ,{\varvec{W}}_{nK}^\top \big )^{\top }\) and \({\varvec{Z}}=\left( {Z_1 ,\ldots ,Z_n } \right) ^{\top }\) are the latent or unobserved variables. If we consider the SUNGH stochastic representation (7) in terms of a finite mixture model for \(i=1,\ldots ,n\) and \(k=1,\ldots ,K\), a hierarchical representation is

$$\begin{aligned}&\left. {{\varvec{Y}}_i } \right| {\varvec{W}}_{ik} =\varvec{w}_{ik} ,U_{ik} =u_{ik} ,Z_i =k{\mathop {\sim }\limits ^{\mathrm{ind.}}}N_p \nonumber \\&\quad \left( {{\varvec{\mu }} _k +\varvec{\Lambda }_k {\varvec{w}}_{ik} ,\kappa \left( {u_{ik} } \right) \varvec{\Sigma }_k } \right) , \end{aligned}$$
(11)
$$\begin{aligned}&\left. {{\varvec{W}}_{ik} } \right| U_{ik} =u_{ik} ,Z_i =k {\mathop {\sim }\limits ^{\mathrm{ind.}}} HN_q \left( {\mathbf{0},\kappa \left( {u_{ik} } \right) {\varvec{I}}_q } \right) , \end{aligned}$$
(12)
$$\begin{aligned}&\left. {U_{ik} } \right| Z_i =k{\mathop {\sim }\limits ^{\mathrm{ind.}}} \mathrm{GIG}_*\left( {\upsilon _k ,\psi _k ,\eta _k } \right) , \end{aligned}$$
(13)
$$\begin{aligned}&{\varvec{Z}}\sim \mathrm{Multinomial} (K, p_{1},\ldots ,p_{K}), \end{aligned}$$
(14)

where \(HN_q \) denotes the q-variate right half-normal distribution.

The model’s complete data likelihood function is then given by

$$\begin{aligned}&L\left( {\varvec{\Theta }\left| {\varvec{C}} \right. } \right) =\mathop \prod \limits _{i=1}^n \mathop \prod \limits _{k=1}^K \left( \phi _p \left( {{\varvec{y}}_i \hbox {|}{\varvec{\mu }} _k +\varvec{\Lambda }_k {\varvec{w}}_{ik} ,\kappa \left( {u_{ik} } \right) \varvec{\Sigma }_k } \right) \right. \nonumber \\&\quad \left. H\phi _q \left( {{\varvec{w}}_{ik} \hbox {|}{} \mathbf{0},\kappa \left( {u_{ik} } \right) {\varvec{I}}_q } \right) \mathcal{G}\mathcal{I}\mathcal{G}_*\left( {u_{ik} \hbox {|}\upsilon _k ,\psi _k ,\eta _k } \right) \right. \nonumber \\&\quad \left. P\left( {Z_i =k} \right) \right) , \end{aligned}$$
(15)

where \(H\phi _q \left( {{\varvec{w}}|\mathbf{0},\cdot } \right) =\phi _q \left( {{\varvec{w}}|\mathbf{0},\cdot } \right) I({\varvec{w}}>\mathbf{0})\) is the q-variate right half-normal pdf.

4.2 Bayesian analysis

4.3 Priors

In this section, we choose priors for the parameters \(\varvec{\Theta }\) which will be used in Applications section. By assuming independency between the different types of parameters in \(\varvec{\Theta }\) and that the skewness matrix of each mixture component be in the form of \(\varvec{\Lambda }_k =\left( {\left. {\left. {{\varvec{\lambda }} _{k1} } \right| \ldots } \right| {\varvec{\lambda }} _{kq} } \right) \), prior distributions for some of the FM-SUNGH model parameters are given by

$$\begin{aligned}&{\varvec{p}}\sim Dir\left( {\delta _1 ,\ldots ,\delta _K } \right) ,{\varvec{\mu }} _k \sim N_p \left( {{\varvec{m}}_k,{\varvec{M}}_k } \right) ,\\&\quad \varvec{\Sigma }_k \sim IW_{t_k } \left( {{\varvec{T}}_k } \right) ,\varvec{\lambda }_{kt} \sim N_p \left( {{\varvec{\ell }}_{kt} ,{\varvec{L}}_{kt} } \right) ,\quad t=1,\ldots ,p \end{aligned}$$

for \(k=1,\ldots ,K\), and where Dir and IW denote the Dirichlet and inverse Wishart distributions, respectively. An alternative representation of the skewness matrices priors and posteriors in the Gibbs updates is provided in “Appendix.” Prior distributions of the scaled factor variables for \(k=1,\ldots ,K\) are:

$$\begin{aligned} \upsilon _k \sim N\left( {\mu _k ,\sigma _k^2 } \right) ,\eta _k \sim \mathrm{GIG}^{*}\left( {\alpha _k ,\chi _k ,\varepsilon _k } \right) ,\psi _k \sim \exp \left( {\varsigma _k } \right) . \end{aligned}$$
Table 1 Summary of non-normal finite mixture models belonging to the FM-SUNGH model

4.3.1 Posteriors

By considering the likelihood function (15) and the priors specified previously, the joint posterior of \(\varvec{\Theta }\) is given by

$$\begin{aligned} \pi \left( {\left. {\varvec{\Theta },{\varvec{u,w,z}}} \right| {\varvec{y}}} \right) \propto L\left( {\varvec{\Theta }\left| {{\varvec{y,u,w,z}}} \right. } \right) \pi \left( \varvec{\Theta } \right) . \end{aligned}$$

The above joint posterior is intractable, but we can use an MCMC method such as Gibbs sampling and Metropolis–Hastings to draw samples using the conditional posterior distributions. To establish notation, let \(B_k =\left\{ {i,Z_i =k} \right\} \) be the set of observation indices for those \({\varvec{y}}_i \) classified into the kth cluster and \(n_k\) is equal to the number of observations allocated to the kth component (cluster). Apart from the parameters for the scaled factor variables, all conditional posterior distributions have closed form and are as follows: (Note that \(\varvec{\Theta }_{\left( {-{\varvec{\theta }} } \right) } \) denotes the set of parameters without its element \({\varvec{\theta }} \).)

\(\left. {\varvec{p}} \right| \varvec{\Theta }_{\left( {-p} \right) } ,{\varvec{y,u,w}},z_i =k\sim Dir\left( {\delta _{pos.1} ,\ldots ,\delta _{pos.K} } \right) \), where

$$\begin{aligned} \delta _{pos.k} =\delta _k +n_k ;k=1,\ldots ,K. \end{aligned}$$
(16)

\(\left. {{\varvec{\mu }} _k } \right| \varvec{\Theta }_{\left( {-\mu _k } \right) } ,{\varvec{y,u,w,}}z_i =k\sim N_p \left( {{\varvec{\mu }} ,\varvec{\Sigma }} \right) ,k=1,\ldots ,K\), where

$$\begin{aligned}&{\varvec{\mu }} =\varvec{\Sigma }\left( {{\varvec{M}}_k^{-1} {\varvec{m}}_k +\sum \nolimits _{B_k } \kappa \left( {u_{ik} } \right) ^{-1}\varvec{\Sigma }_k^{-1} \left( {{\varvec{y}}_i -\varvec{\Lambda }_k \varvec{w}_{ik} } \right) } \right) ,\nonumber \\&\varvec{\Sigma }=\left[ {{\varvec{M}}_k^{-1} +\sum \nolimits _{B_k } \kappa \left( {u_{ik} } \right) ^{-1}\varvec{\Sigma }_k^{-1} } \right] ^{-1}, \end{aligned}$$
(17)

\(\left. {\varvec{\Sigma }_k } \right| \varvec{\Theta }_{\left( {-\varvec{\Sigma }_k } \right) } ,{\varvec{y,u,w}},z_i =k\sim IW_{t_k +n} \left( {\varvec{T}} \right) ,k=1,\ldots ,K\), where

$$\begin{aligned} {\varvec{T}}= & {} \sum \nolimits _{B_k } \kappa \left( {u_{ik} } \right) ^{-1}\left( {{\varvec{y}}_i -{\varvec{\mu }} _k -\varvec{\Lambda }_k \varvec{w}_{ik} } \right) \nonumber \\&\left( {{\varvec{y}}_i -{\varvec{\mu }} _k -\varvec{\Lambda }_k {\varvec{w}}_{ik} } \right) ^{\top }+{\varvec{T}}_k , \end{aligned}$$
(18)

\(\left. {{\varvec{\lambda }} _{kt} } \right| \varvec{\Theta }_{\left( {-\lambda _{kt} } \right) } ,{\varvec{y,u,w,}}z_i =k\sim N_p \left( {{\varvec{\mu }} ,\varvec{\Sigma }} \right) ;k=1,\ldots ,K;t=1,\ldots ,p\), where

$$\begin{aligned}&{\varvec{\mu }} =\varvec{\Sigma }\left( {\varvec{L}}_{kt}^{-1} {\varvec{\ell }} _{kt} +\sum \nolimits _{B_k } \kappa \left( {u_{ik} } \right) ^{-1}{w}_{ik\left( t \right) } \varvec{\Sigma }_k^{-1}\right. \nonumber \\&\quad \left. \left( {{\varvec{y}}_i -{\varvec{\mu }} _k -\varvec{\Lambda }_{k\left( {-t} \right) } {\varvec{w}}_{ik\left( {-t} \right) } } \right) \right) ,\nonumber \\&\varvec{\Sigma }=\left( {{\varvec{L}}_{kt}^{-1} +\sum \nolimits _{B_k } \kappa \left( {u_{ik} } \right) ^{-1}{w}_{ik\left( t \right) }^2 \varvec{\Sigma }_k^{-1} } \right) ^{-1}, \end{aligned}$$
(19)

where \(\varvec{\Lambda }_{k\left( {-t} \right) } \) denotes the \(p\times \left( {q-1} \right) \) skewness matrix \(\varvec{\Lambda }_k \) with the tth column eliminated, \(\varvec{w}_{ik\left( {-t} \right) } \) denotes the \(\left( {q-1} \right) \times 1\) vector \(\varvec{w}_{ik} \) vector with the tth element eliminated, and \({w}_{ik\left( t \right) } \) denotes the tth element of the vector \(\varvec{w}_{ik} \).

The full conditional posterior distribution for the latent variables \(Z_i \), \(U_{ik} \) and \({\varvec{W}}_{ik} \), for \(i=1,\ldots ,n;k=1,\ldots ,K\), are given by:

\(\left. {Z_i } \right| \varvec{\Theta },{\varvec{y,u,w}}\sim \mathrm{Multinomial}\left( {K,p_{p.1} ,\ldots ,p_{p.K} } \right) \), where

$$\begin{aligned} p_{p.k} =\frac{p_k f\left( {{\varvec{y}}_i ;\varvec{\Theta }_k } \right) }{\mathop \sum \nolimits _{j=1}^K p_j f\left( {{\varvec{y}}_i ;\varvec{\Theta }_j } \right) },k=1,\ldots ,K, \end{aligned}$$
(20)

\(\left. {U_{ik} } \right| \varvec{\Theta },{\varvec{y,w}},z_i =k\sim \mathrm{GIG}^{*}\left( {\hbox {a}_u ,\hbox {b}_u ,\hbox {c}_u } \right) \), where \(\kappa \left( u \right) =u\) and

$$\begin{aligned} \hbox {a}_u= & {} \upsilon _{ik} -\frac{p+q}{2},\nonumber \\ \hbox {b}_u= & {} \left( {\psi _{ik} /\eta _{ik} } \right) ^{1/2},\nonumber \\ \hbox {c}_u= & {} \left( \left[ \varvec{w}_{ik}^\top \varvec{w}_{ik} +\left( { {\varvec{y}}_i -{\varvec{\mu }} _k -\varvec{\Lambda }_k \varvec{w}_{ik} } \right) ^{\top }\right. \right. \nonumber \\&\left. \left. \varvec{\Sigma }_k^{-1} \left( {{\varvec{y}}_i -{\varvec{\mu }} _k -\varvec{\Lambda }_k \varvec{w}_{ik} } \right) \right] +\psi _{ik} \eta _{ik} \right) ^{1/2}. \end{aligned}$$
(21)

\(\left. {{\varvec{W}}_{ik} } \right| \varvec{\Theta },{\varvec{y,u}},z_i =k\sim HN_q \left( {{\varvec{\mu }} ,\varvec{\Sigma }} \right) \), where

$$\begin{aligned} {\varvec{\mu }}= & {} \kappa \left( {u_{ik} } \right) ^{-1}\varvec{\Sigma \Lambda }_k^\top \varvec{\Sigma }_k^{-1} \left( {{\varvec{y}}_i -{\varvec{\mu }} _k } \right) ,\nonumber \\ \varvec{\Sigma }= & {} \kappa \left( {u_{ik} } \right) \left( {{\varvec{I}}_q +\varvec{\Lambda }_k^\top \varvec{\Sigma }_k^{-1} \varvec{\Lambda }_k } \right) ^{-1}. \end{aligned}$$
(22)

Finally, the full conditional posterior for the scaled factor variables \(\upsilon _k ,\psi _k ,\eta _k ,k=1,\ldots ,K\), is as follows: \(\left. {\eta _k } \right| \varvec{\Theta }_{\left( {-\eta _k } \right) } ,{\varvec{u,w}},z_i =k\sim \mathrm{GIG}^{*}\left( {\hbox {a}_\eta ,\hbox {b}_\eta ,\hbox {c}_\eta } \right) \), where

$$\begin{aligned}&\hbox {a}_\eta =\alpha _k -\upsilon _k n_k ,\nonumber \\&\hbox {b}_\eta =\left( {\chi _k^2 +\psi _k \sum \nolimits _{B_k } 1/u_{ik} } \right) ^{1/2},\nonumber \\&\hbox {c}_\eta =(\varepsilon _k^2+\psi _{k} \sum \nolimits _{B_{k}}u_{ik})^{1/2}. \end{aligned}$$
(23)

The full conditional posterior density of \(\upsilon _k ,k=1,\ldots ,K\) is proportional to:

$$\begin{aligned} \pi \left( {\left. {\upsilon _k } \right| \varvec{\Theta }_{\left( {-\upsilon _k } \right) } ,{\varvec{u,w}},Z_i =k} \right) \sim \pi _1 \left( {\upsilon _k } \right) N\left( {\mu _k ,\sigma _k^2 } \right) ,\nonumber \\ \end{aligned}$$
(24)

where \(\pi _1 \left( {\upsilon _k } \right) =\left( {K_{\upsilon _k } \left( {\psi _k } \right) } \right) ^{-n_k }\mathop \prod \limits _{B_k } \left( {u_{ik} /\eta _k } \right) ^{\upsilon _k }\).

The full conditional posterior density of \(\psi _k ,k=1,\ldots ,K\) is also proportional to:

$$\begin{aligned}&\pi \left( {\left. {\psi _k } \right| \varvec{\Theta }_{\left( {-\psi _k } \right) } ,{\varvec{u,w}},z_i =k} \right) \sim \pi _2 \left( {\psi _k } \right) \nonumber \\&\quad E\left( {\varsigma _k +\sum \nolimits _{B_k } \left( {u_{ik} /\eta _k +\eta _k /u_{ik} } \right) /2} \right) , \end{aligned}$$
(25)

where \(\pi _2 \left( {\psi _k } \right) =\left( {K_{\upsilon _k } \left( {\psi _k } \right) } \right) ^{-n_k }\) and \(E\left( \varphi \right) \) denotes the density of the exponential distribution with rate parameter \(\varphi \).

Note that (24) and (25) do not have closed forms, but a Metropolis–Hastings or rejection sampling step can be embedded in the MCMC scheme to obtain draws from them.

5 Applications

In this section, we present a simulation study and applications on two real datasets to evaluate the performance of the proposed SUNGH model for clustering problems. For illustrative purposes, we choose K to be equal to two for all models presented.

5.1 Simulated data

To illustrate some of the differences between the SUNGH family of models, we consider the case of two clusters each sampled from a four-dimensional SUNGH distribution with known parameters which are slightly separated from each other. For the first and second cluster

$$\begin{aligned} {\varvec{\mu }} _1= & {} \left( {4,4,4,4} \right) ^{\top }, \quad \varvec{\Sigma }_1 =\varvec{\Sigma }_2 =\left( {{\begin{array}{ccc} 1&{} \cdots &{} {0.5} \\ \vdots &{} \ddots &{} \vdots \\ {0.5}&{} \cdots &{} 1 \\ \end{array} }} \right) \hbox { and }\\ \varvec{\Lambda }_1= & {} \left( {{\begin{array}{cc} {-4}&{} {-4} \\ {1}&{} {3} \\ {-4}&{} {-4} \\ {1}&{} {3} \\ \end{array} }} \right) ,\\ {\varvec{\mu }} _2= & {} \left( {-1,-1,-1,-1} \right) ^{\top }\hbox { and }\varvec{\Lambda }_2 =\left( {{\begin{array}{cc} {4}&{} {-5} \\ {1}&{} {2} \\ 4&{} {-5} \\ 1&{} {2} \\ \end{array} }} \right) , \end{aligned}$$

respectively. Both clusters shared the same parameters for the \(\mathrm{GIG}_*\left( {\upsilon ,\psi ,\eta } \right) \) distribution, where \(\upsilon =-0.5\), \(\psi =1\) and \(\eta =1\). The sample size for each cluster is 300 and 450, respectively. A plot of the simulated data is shown in Fig. 1 with the observations belonging to each cluster labeled by different colors.

Fig. 1
figure 1

Plot of results for simulated data. Colors indicate the groups to which observations belong in the first two dimensions. (Color figure online)

Table 2 Results for simulated data example

For estimation of the different models, largely non-informative prior distributions were used for each of the component parameters: \({\varvec{\mu }} =\left( {\mu _1 ,\ldots ,\mu _4 } \right) ^{\top }\sim N_4 \left( {\mathbf{0},\varvec{\Sigma }} \right) \), where \(\varvec{\Sigma }=10^{3}\varvec{I}_4 \), \(\varvec{\Sigma }\sim IW_\tau \left( {\varvec{T}} \right) \), where \(\tau =4\) and \({\varvec{T}}=\varvec{I}_4 \) with skewness matrix \(\varvec{\Lambda }_{4\times 2} \) with priors of its columns as \(\varvec{\lambda }_t \sim N_4 \left( {{\varvec{\ell }} _t ,{\varvec{L}}_t } \right) \) for which \({\varvec{\ell }} _t =\mathbf{0}\) and \({\varvec{L}}_t =10^{3}\varvec{I}_4 \) and for \(t=1,2\) (and in the matrix variate prior of the skewness matrix we can consider that \(\varvec{\Lambda }_{4\times 2} \sim MN_{4,2} \left( {\mathbf{0},10^{3}\varvec{I}_4 ,10^{3}\varvec{I}_4 } \right) )\), \(\upsilon \sim N\left( {{ 0},10^{3}} \right) \), \(\eta \sim \mathrm{GIG}^{*}\left( {0.001,2000,0} \right) \), \(\psi \sim \exp \left( {0.1} \right) \) and \({\varvec{p}}\sim Dir\left( {1,\ldots ,1} \right) \). Also, we chose \(\kappa \left( u \right) =u\) for the scale mixing function. All computations are implemented in the R software version 3.3.1 (R Core Team 2017) with a core i7 760 processor 2.8 GHz. The R and Nimble code for the models are available from the authors upon request. Gibbs sampling runs of 60,000 iterations with burn-in of 30,000 were used, and convergence criteria were established using the Gelman–Rubin statistic (Gelman and Rubin 1992) and by visual inspection. To address the issue of label switching over the MCMC iterations (Mengersen et al. (2011)), we used the maximum a posteriori estimate (MAP) to select one of the k! modal regions and a distance based measure on the space of parameters to relabel parameters in proximity to this region (Celeux et al. 2000).

Model performance was assessed by comparing the classification accuracy and model selection criteria for different distributions within the family of SUNGH models (see Table 2). For classification accuracy, we report the adjusted rand index (ARI) (Hubert and Arabie 1985) which ranges from 0 (no match) to 1 (perfect match). We also report the EAIC and EBIC which are variations of the classical AIC and BIC criteria for use in a Bayesian setting (Carlin and Louis 2011) (lower values indicate a better fit). In a mixture setting, it is also possible to compare the DIC values using one of the measures suggested by Celeux et al. (2006).

As to be expected, from Table 2 we can see quite clearly that the classification performance of the true model (SUNGH \((q=2)\)) is very good with an ARI of 0.87 and model choice criteria all appear to favor this model. A higher log-likelihood was found for the model SUNGH \((q=3)\) with a similar ARI score to the SUNGH \((q=2)\) model, but on other criteria this model was not favored due to the extra parameters involved. In applied settings, and where the true labels are unknown, a similar trade-off will be made between choosing more complex models with extra flexibility in the skewness matrix (higher q values) and relative improvement in model choice or goodness-of-fit measures. The performance of the SN and SNGH models is also to be expected given the relative lack of flexibility for the skewness parameter to accommodate the degree of skewness in all dimensions in this application.

Fig. 2
figure 2

Plots of results for AIS example: a true data, b SNGH, c SN, d SUNGH

Table 3 Model choice criteria for AIS data example

5.2 Real applications

5.2.1 AIS example

In this example, we consider a dataset from the Australian Institute Sports (AIS) containing measures of physical activity for 202 athletes (102 male and 100 female) based on sex, red cell count, white cell count, hematocrit, hemoglobin, plasma ferritin concentration, body mass index, sum of skin folds, body fat percentage, lean body mass and finally height and weight of the athletes (Cook and Weisberg 1994). The data are available in the R package “sn” (see Azzalini 2015).

Fig. 3
figure 3

Plots of observations and fitted contours for lymphoma example: a plot of observations for CD4 and ZAP70, b SNGH, c SN, d SUNGH

To assess the performance of the proposed SUNGH model, we use BMI and body fat percentage (Bfat) to classify male and female athletes. Figure 2a shows the observations for male (in black) and female (in red) athletes according to these two measures, suggesting a reasonably skewed distribution for both males and females with a particularly strong skewed and heavy-tailed distribution for male athletes. Figure 2b–d also shows the fitted contours and assigned labels for each observation for several of the models examined (SNGH, SN and SUNGH).

Table 3 presents the model choice criteria for the different models examined. The results suggest that the SUNGH has the highest log-likelihood and the lowest EAIC, but the SN model has the lowest values for the EBIC and DIC\(_{2}\) measures. However, the ARI for the SN model (\(=0.52\)) is considerably lower than the SUNGH model (\(=0.79\)) suggesting more support for the SUNGH model in terms of classification accuracy. We can also see these results reflected visually in Fig. 2 with the SUNGH (Fig. 2d) able to represent the skewed nature of the distribution for the two groups, particularly for the male athletes. In contrast, the SN model (Fig. 2c) poorly represents the skewed distribution of the female athletes and the heavy-tailed nature of the distribution for the male athletes. As can be expected, the classification results for the SNGH model (Fig. 2b) are visually similar to the SUNGH; however, small differences (due to the reduced flexibility of the skewness parameter for the SNGH) can be observed which impact greatly on the classification accuracy (ARI = 0.64).

5.2.2 Lymphoma example

In another example, we examine a clustering problem for a lymphoma dataset analyzed by Lee and McLachlan (2013b). The data consist of a subset of data originally presented and collected by Maier et al. (2007). In Maier et al. (2007), blood samples from 30 subjects were stained with four fluorophore-labeled antibodies against CD4, CD45RA, SLP 76(pY 128) and ZAP 70 (pY 292) before and after an anti-CD3 stimulation. To illustrate the performance of different distributions within the SUNGH family, we will look at clustering a subset of the data containing the variables CD4 and ZAP70 (Fig. 3), which appear to be bimodal and display an asymmetric pattern. In particular, the largest mode appears to show both strong correlation between the two variables and substantial skewness in both dimensions.

Table 4 Model choice criteria for lymphoma example

From Fig. 3, we can see a clear difference between the SNGH and SUNGH models with the latter providing a closer fit to the two groups visible in the data. This is further supported by the model choice criteria with all three measures (EAIC, EBIC and DIC\(_{2})\) favoring the SUNGH model (Table 4) compared to SNGH. The SUNGH model is also preferred over the SN model (Fig. 3c) with the latter model not appearing to fit or represent the larger component in the data. Overall, the SUNGH model appears to fit the two groups in these data quite well with the lowest values for two of the three model choice measures (EAIC and EBIC). Using the DIC\(_{2}\) criteria, the lowest value appears to be for the SUT model, so there is some support for this in terms of model choice. However, the difference between DIC\(_{2}\) values for this model and SUNGH is not great (\(\mathrm{SUNGH}=7201.3; \mathrm{SUT}=7198.5\)), suggesting little difference in terms of this measure.

6 Conclusion

We have proposed a flexible family of unrestricted skew-normal generalized hyperbolic (SUNGH) distributions for application in clustering problems which are capable of representing distributions of asymmetric and heavy-tailed forms. The family contains several other well-known symmetric and asymmetric families of distributions such as scale mixtures of the skew-normal family (SMSN) as special cases. Various properties of the SUNGH family are well defined, and estimation of the parameters is relatively straightforward in a Bayesian framework with most of the Gibbs sampling updates available in closed form. Assessments of the performance of the proposed model on simulated and real data suggest that the family provides a considerable degree of freedom and flexibility in modeling data of varying tail behavior and directional shape. As this family of distributions and the parameterization we have adopted preserves several important propositions (e.g., closed under linear combinations), the SUNGH family can be used in a variety of other statistical models (e.g., linear multilevel/mixed models and regression).