Seemingly unrelated clusterwise linear regression for contaminated data

Perrone, Gabriele; Soffritti, Gabriele

doi:10.1007/s00362-022-01344-6

Seemingly unrelated clusterwise linear regression for contaminated data

Regular Article
Open access
Published: 06 August 2022

Volume 64, pages 883–921, (2023)
Cite this article

Download PDF

You have full access to this open access article

Statistical Papers Aims and scope Submit manuscript

Seemingly unrelated clusterwise linear regression for contaminated data

Download PDF

2315 Accesses
3 Citations
Explore all metrics

Abstract

Clusterwise regression is an approach to regression analysis based on finite mixtures which is generally employed when sample observations come from a population composed of several unknown sub-populations. Whenever the response is continuous, Gaussian clusterwise linear regression models are usually employed. Such models have been recently robustified with respect to the possible presence of mild outliers in the sub-populations. However, in some fields of research, especially in the modelling of multivariate economic data or data from the social sciences, there may be prior information on the specific covariates to be considered in the linear term employed in the prediction of a certain response. As a consequence, covariates may not be the same for all responses. Thus, a novel class of multivariate Gaussian linear clusterwise regression models is proposed. This class provides an extension to mixture-based regression analysis for modelling multivariate and correlated responses in the presence of mild outliers that let the researcher free to use a different vector of covariates for each response. Details about the model identification and maximum likelihood estimation via an expectation-conditional maximisation algorithm are given. The performance of the new models is studied by simulation in comparison with other clusterwise linear regression models. A comparative evaluation of their effectiveness and usefulness is provided through the analysis of a real dataset.

Seemingly unrelated clusterwise linear regression

Article 12 August 2019

Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

Article 20 June 2017

Matrix Normal Cluster-Weighted Models

Article Open access 02 June 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In multivariate regression analysis, when modelling the dependence of a random vector ${\textbf {Y}}=(Y_{1}, \ldots ,Y_{m}, \ldots , Y_{M})'$ of M responses on a given vector ${\textbf {X}}=(X_{1}, \ldots , X_{p}, \ldots , X_{P} )'$ of P predictors through a sample $\mathcal {S}= \{ ({\textbf {x}}_1,{\textbf {y}}_1),\ldots ,$ $({\textbf {x}}_I,{\textbf {y}}_I) \}$ drawn from a certain population, the following sources of complexity could affect the data and make the prediction of the responses a task difficult to perform.

(a):: With multivariate longitudinal data, time-series data or repeated measures, the M responses contained in ${\textbf {Y}}$ are typically correlated. Furthermore, in analyses of economic data or data from the social sciences, it is not unusual that prior information about the phenomenon under study enables the analyst to specify a system of M regression equations (one equation for each response) in which certain regressors contained in ${\textbf {X}}$ are absent from certain regression equations. This is especially true for multivariate economic data referring to general theories (i.e., investment equations, production functions) or applications dealing with the explanation of a certain economic activity (i.e., demand of petrol, employment) in different geographical locations (see, e.g., Giles and Hampton 1984; White and Hewings 1982; Zellner 1962). Further examples can be found also in other fields, such as medicine, food quality, tourism economics, quality of life and health (see, e.g., Cadavez and Hennningsen 2012; Disegna and Osti 2016; Heidari et al. 2017; Keshavarzi et al. 2012, 2013). A parametric framework able to take into consideration both multivariate correlated responses and systems of regression equations with equation-dependent vectors of predictors (i.e., vectors which do not necessarily contain the same predictors for all the responses) is given by the so-called seemingly unrelated regression approach (see, e.g., Park 1993; Srivastava and Giles 1987). In particular, in this approach the random disturbances associated with the M regression equations are allowed to be correlated with each other; hence, the variance-covariance matrix $\varvec{\Sigma }$ of the resulting M-dimensional vector of the error terms will have a non-diagonal structure.
(b):: In general, real data can often be characterised by the presence of atypical observations. In parametric regression analysis, such observations negatively impact on both the estimation of the regression coefficients and the prediction of the responses based on the classical procedures. Such procedures have been widely recognized to be extremely sensitive to even seemingly minor or negligible deviations from some conventional assumptions (see, e.g., Tukey 1960). Thus, when the data are contaminated by such observations, it is crucial that robust methods are employed (see, e.g., Maronna et al. 2006). Departures from the Gaussian distribution of the error terms in the regression model caused by some mildly atypical observations can be managed by simply resorting to heavy-tailed models for the conditional distribution of ${\textbf {Y}} \vert {\textbf {X}}={\textbf {x}}$. Those observations are also called small or mild outliers (see, e.g., Ritter 2015). Examples of robust methods against the presence of such outliers have been developed by Lange et al. (1989), Kibria and Haq (1999), Lachos et al. (2011); to this end, the multivariate t distribution or scale mixtures of Gaussian distributions have been exploited. Another model able to manage the possible presence of mild outliers in a dataset is the contaminated Gaussian distribution (see, e.g., Aitkin and Wilson 1980; Tukey 1960). This probabilistic model is defined as a mixture of two Gaussian distributions having the same expected mean values but different variances-covariances. Furthermore, the Gaussian distribution having the smallest mixing weight also has inflated variances-covariances and is employed to represent the mild outliers. Maximum likelihood (ML) estimation can be performed via an expectation-maximisation (EM) algorithm (see Aitkin and Wilson 1980; Dempster et al. 1977). Once such a model is fitted to the observed data, each sample observation can be classified as either typical or outlier using the maximum a posteriori probability (for further details see, e.g., Aitkin and Wilson 1980). With an approach based on the use of one of these distributions, robustness can be achieved without suppressing any observation from the sample $\mathcal {S}$.
(c):: Sometimes the population from which the sample $\mathcal {S}$ comes from is composed of a certain number, say K, of sub-populations. Furthermore, when the information about the value of K and the specific sub-population each sample observation belongs to is not known, $\mathcal {S}$ is characterised by unobserved heterogeneity. If this source of heterogeneity affects the distribution of ${\textbf {Y}} \vert {\textbf {X}}={\textbf {x}}$, then a mixture of K different regression models (one for each sub-population) will describe the distribution of ${\textbf {Y}} \vert {\textbf {X}}={\textbf {x}}$ in the population. This phenomenon can be experienced in many fields, such as economics, marketing, agriculture, education, human genomics, quantitative finance, social sciences and transport systems (see, e.g., Ding 2006; Dyer et al. 2012; Elhenawy et al. 2017; Fair and Jaffe 1972; Kamakura 1988; McDonald et al. 2016; Qin and Self 2006; Tashman and Frey 2009; Turner 2000; Van Horn et al. 2015). In this case, the sample $\mathcal {S}$ should be analysed in a regression framework able to detect both the number of sub-populations and their regression models. Methods for clusterwise regression analysis play a special role. They exploit clusterwise regression models, which are mixtures of K regression models (see, e.g., De Sarbo and Cron 1988; Depraetere and Vandebroek 2014; Frühwirth-Schnatter 2006; Hosmer 1974). In these models, the mixing weights can also be expressed as a function of some concomitant variables (Wedel 2002). With M continuous responses in vector ${\textbf {Y}}$, multivariate Gaussian clusterwise linear regression models are generally employed (see, e.g., Jones and McLachlan 1992). If the P predictors are random and the source of heterogeneity mentioned above affects the distribution of $({\textbf {X}}, {\textbf {Y}})$, then Gaussian cluster-weighted models should be employed (see, e.g., Dang et al. 2017).

Recently, Mazza and Punzo (2020) have introduced methods to perform Gaussian clusterwise linear regression analysis which are robust with respect to heavy-tailed departures from Gaussianity due to the presence of mild outliers in the data. By relying on contaminated Gaussian clusterwise linear regression models, their methods are able to produce a simultaneous clustering of the sample observations and the detection of mild outliers in a multivariate regression context. In this way, they allow to manage the sources of complexity (b) and (c); they are also capable of explaining the correlation among responses. A limitation of an approach based on those models is that the same vector of regressors has to be employed for the prediction of all responses. Galimberti and Soffritti (2020) have developed models for Gaussian clusterwise linear regression which make use of seemingly unrelated regression equations. The methods based on these latter models are suitable for the analysis of data affected by complexities (a) and (c); however, they are not insensitive to the possible presence of mild outliers in the K sub-populations. Based on all these considerations, multivariate seemingly unrelated clusterwise linear regression models for data contaminated by mild outliers are introduced here. They are obtained from the models described in Mazza and Punzo (2020) by modifying the definition of the linear terms in the M regression equations so that a different vector of regressors can be employed for each dependent variable. With these new models, the three sources of complexities mentioned above are jointly taken into consideration when predicting the responses in a multivariate linear regression framework. Thus, a more flexible approach for the analysis of linear dependencies in multivariate data is provided.

The key contributions of this paper are:

the specification of a novel class of models able to jointly account for the sources of complexity (a), (b) and (c) mentioned above;
a comparison with some other linear clusterwise regression models;
the description of conditions for the identifiability of the novel models;
details about ML estimation via an expectation-conditional maximisation (ECM) algorithm (Meng and Rubin 1993);
a treatment of the initialisation and convergence of the ECM algorithm and the issue of model selection;
an investigation of the effectiveness of the new models, based on simulated datasets, in comparison with the models proposed by Galimberti and Soffritti (2020) and Mazza and Punzo (2020);
an application to a study of the effects of prices and promotional activities on sales for two U.S. brands of canned tuna.

The remainder of this paper is organised as follows. The novel models are introduced in Sect. 2.1. Section 2.2 shows how they relate to some clusterwise linear regression models. Identifiability is treated in Sect. 2.3. Section 2.4 and Appendix A provide details on the ECM algorithm. Issues of algorithm initialisation, convergence criterion and model selection are discussed in Sects. 2.5 and 2.6 . Section 3 contains a summary of the experimental results obtained from the analysis of simulated data. The study of the effects of prices and promotional activities on U.S. canned tuna sales is presented in Sect. 4. Finally, in Sect. 5, some concluding remarks and ideas for future research are illustrated.

2 Seemingly unrelated contaminated Gaussian linear clusterwise regression analysis

2.1 Seemingly unrelated contaminated Gaussian linear clusterwise regression models

In order to introduce the new model, the following notation is required. Suppose that only $P_m$ of the P covariates contained in ${\textbf {X}}$ are considered to be relevant for the prediction of the response $Y_m$, where $P_m \le P$. Thus, let ${\textbf {X}}_{m}=(X_{m_1}, X_{m_2},\ldots ,X_{m_{P_m}})'$ be the vector composed of such $P_m$ covariates, and let ${\textbf {X}}_{m}^*=(1,{\textbf {X}}'_{m})'$. Furthermore, let $\varvec{\beta }_{km}=(\beta _{k,m_1}, \beta _{k,m_2},\ldots ,\beta _{k,m_{P_m}})'$ be the vector of the $P_m$ regression coefficients capturing the linear effect of such covariates on the response $Y_m$ in the kth sub-population, and $\varvec{\beta }_{km}^*=(\beta _{0k,m}, \varvec{\beta }'_{km})'$. Then, the vector containing all linear effects on the M responses in the kth sub-population can be obtained by stacking the M regression coefficient vectors specific for the kth sub-population one underneath the other; it can be denoted as $\varvec{\beta }^*_k=(\varvec{\beta }_{k1}^{*_{'}},\ldots ,\varvec{\beta }_{km}^{*_{'}},\ldots ,\varvec{\beta }_{kM}^{*_{'}})'$ and its length is $P^*+M$, where $P^*=\sum _{m=1}^M P_m$. Finally, the following $(P^*+M) \times M$ partitioned matrix is required:

$$\begin{aligned} \tilde{{\textbf {X}}}^*= \begin{bmatrix} {\textbf {X}}_{1}^* &{} {\textbf {0}}_{P_1+1} &{} \cdots &{} {\textbf {0}}_{P_1+1} \\ {\textbf {0}}_{P_2+1} &{} {\textbf {X}}_{2}^* &{} \cdots &{} {\textbf {0}}_{P_2+1} \\ \vdots &{} \vdots &{} &{} \vdots \\ {\textbf {0}}_{P_M+1} &{} {\textbf {0}}_{P_M+1} &{} \cdots &{} {\textbf {X}}_{M}^* \\ \end{bmatrix} , \end{aligned}$$

where ${\textbf {0}}_{P_m+1}$ denotes the $(P_m+1)$-dimensional null vector.

The random vector ${\textbf {Y}}$ follows a seemingly unrelated contaminated Gaussian linear clusterwise regression model of order K if the conditional probability density function (p.d.f.) of ${\textbf {Y}} \vert {\textbf {X}} = {\textbf {x}}$ has the form

$$\begin{aligned} f({\textbf {y}} \vert {\textbf {x}}; \varvec{\psi })=\sum _{k=1}^K \pi _k h\left( {\textbf {y}};\varvec{\theta }_k\right) , {\textbf {y}} \in \mathbb {R}^M, \end{aligned}$$

(1)

where $\pi _k$ is the mixing weight of the kth sub-population, with $\pi _k>0$ for $k=1, \ldots , K$, and $\sum _{k=1}^K \pi _k =1$; $h\left( {\textbf {y}};\varvec{\theta }_k\right) $ is the contaminated Gaussian p.d.f. of ${\textbf {Y}} \vert {\textbf {X}} = {\textbf {x}}$ in the kth sub-population, defined as follows:

$$\begin{aligned} h\left( {\textbf {y}}; \varvec{\theta }_k\right) = \alpha _k \phi _M\left( {\textbf {y}}; \varvec{\mu }_k({\textbf {x}};\varvec{\beta }^*_k),\varvec{\Sigma }_k\right) +(1-\alpha _k)\phi _M\left( {\textbf {y}}; \varvec{\mu }_k({\textbf {x}};\varvec{\beta }^*_k),\eta _k \varvec{\Sigma }_k\right) , \end{aligned}$$

(2)

and $\phi _M\left( \cdot ; \varvec{\mu },\varvec{\Sigma }\right) $ denotes the p.d.f. of an M-dimensional Gaussian distribution with expected mean vector $\varvec{\mu }$ and positive definite covariance matrix $\varvec{\Sigma }$. The term $\varvec{\mu }_k({\textbf {x}};\varvec{\beta }^*_k)$ in Eq. (2) is the conditional expected value of ${\textbf {Y}} \vert {\textbf {X}} = {\textbf {x}}$ in the kth sub-population; it is defined as follows:

$$\begin{aligned} \varvec{\mu }_k({\textbf {x}};\varvec{\beta }^*_k) = \tilde{{\textbf {x}}}^{*_{'}} \varvec{\beta }_k^* = \begin{bmatrix} {\textbf {x}}_{1}^{*_{'}} \ \varvec{\beta }_{k1}^{*} \\ \vdots \\ {\textbf {x}}_{m}^{*_{'}} \ \varvec{\beta }_{km}^{*} \\ \vdots \\ {\textbf {x}}_{M}^{*_{'}} \ \varvec{\beta }_{kM}^{*} \\ \end{bmatrix}, \end{aligned}$$

(3)

where $\tilde{{\textbf {x}}}^*$ denotes the realisation of $\tilde{{\textbf {X}}}^*$ obtained when ${\textbf {X}}={\textbf {x}}$. Thus, $\tilde{{\textbf {x}}}^{*_{'}} \varvec{\beta }_k^*$ coincides with an M-dimensional vector whose mth element is a linear combination of the realisations of the $P_m$ regressors selected for the prediction of $Y_m$ with weights given by the elements of vector $\varvec{\beta }_{km}^{*}$. Terms $\alpha _k \in (0,1)$ and $\eta _k>1$ are the weight of the typical observations in the kth sub-population and the factor contaminating the conditional variances and covariances of ${\textbf {Y}} \vert {\textbf {X}} = {\textbf {x}}$ for the mild outliers in the kth sub-population, respectively. In robust statistics, it is generally assumed that at least half of the observations are typical; thus, it is also possible to consider $\alpha _k \in [0.5,1)$. As a consequence of the constraint $\eta _k>1$, $\eta _k$ represents an inflation parameter for the elements of $\varvec{\Sigma }_k$. $\varvec{\theta }_k=(\varvec{\beta }_k^*,\varvec{\Sigma }_k,\alpha _k,\eta _k)$ is the parameter vector of model (2). The parameter vector of model (1) is given by $\varvec{\psi }=(\varvec{\psi }_1, \ldots , \varvec{\psi }_k, \ldots , \varvec{\psi }_K)$, where $\varvec{\psi }_k=(\pi _k,\varvec{\theta }_k)$; the number of free parameters in this vector is equal to $n_{\varvec{\psi }}= 3K-1+K(P^{*}+M)+K\frac{M(M+1)}{2}$.

In summary, the conditional p.d.f. $f({\textbf {y}} \vert {\textbf {x}}; \varvec{\psi })$ in Eq. (1) can be interpreted as a weighted average (namely, a mixture) of K Gaussian regression models with weights $\pi _k$, $k=1, \ldots , K$. The kth component of this mixture represents a multivariate seemingly unrelated contaminated Gaussian linear regression model with intercepts and regression coefficients $\varvec{\beta }_k^*$, symmetric and positive definite covariance matrix $\varvec{\Sigma }_{k}$, proportion of typical points $\alpha _k$ and inflation parameter $\eta _k$. Thanks to the non-diagonal structure of the variance-covariance matrices $\varvec{\Sigma }_k$, $k=1, \ldots , K$, the proposed model is able to account for correlated random disturbances within each of the K sub-populations associated with the mixture (1). Since the contaminated Gaussian distribution (2) is a mixture of two Gaussian linear regression models which are both associated with the kth component of the mixture in Eq. (1), the model defined by this latter equation can also be considered as a mixture of 2K seemingly unrelated Gaussian clusterwise linear regression models, whose components can be grouped into K pairs, each of which contains two Gaussian components having the same expected values and proportional covariance matrices.

2.2 Comparisons with other linear clusterwise regression models

When specific conditions are met, some special linear regression models can be obtained from model (1).

If $M>1$ and ${\textbf {X}}_{m}={\textbf {X}} \ \forall m$ (the same vector of predictors is considered for all responses), the following equality holds: $\tilde{{\textbf {x}}}^*={\textbf {I}}_M\otimes {\textbf {x}}^*$, where ${\textbf {I}}_M$ is the identity matrix of order M and $\otimes $ denotes the Kronecker product operator (see, e.g., Magnus and Neudecker 1988). Equation (3) can be rewritten as
$$\begin{aligned} \varvec{\mu }_k({\textbf {x}};\varvec{\beta }^*_k) = \left( {\textbf {I}}_M\otimes {\textbf {x}}^*\right) '\varvec{\beta }^*_k={\textbf {B}}_k'{} {\textbf {x}}, \ k=1, \ldots , K, \end{aligned}$$
(4)
where ${\textbf {B}}_k=\left[ \varvec{\beta }^*_{k1} \cdots \varvec{\beta }^*_{km} \cdots \varvec{\beta }^*_{kM}\right] $. Thus, Eq. (1) reduces to the mixture of multivariate contaminated Gaussian regression models introduced by Mazza and Punzo (2020).
If $M>1$, $\alpha _k \rightarrow 1$ and $\eta _k \rightarrow 1 \ \forall k$ (there is no contamination in the data), the resulting model coincides with the mixture of multivariate seemingly unrelated linear regressions described in Galimberti and Soffritti (2020).
If $\alpha _k \rightarrow 1$, $\eta _k \rightarrow 1 \ \forall k$ and ${\textbf {X}}_{m}={\textbf {X}} \ \forall m$ (there is no contamination in the data and the same vector of predictors is considered for all responses), Eq. (1) reduces to a mixture of either univariate Gaussian linear regression models (see, e.g., De Sarbo and Cron 1988; De Veaux 1989; Quandt and Ramsey 1978) or multivariate Gaussian linear regression models (see Jones and McLachlan 1992).
If $\alpha _k \rightarrow 1$, $\eta _k \rightarrow 1 \ \forall k$, ${\textbf {X}}_{m}={\textbf {X}} \ \forall m$ and $\varvec{\beta }^*_k=\varvec{\beta }^* \ \forall k$ (there is no contamination in the data, the same vector of predictors is considered for all responses and their effects are the same across all the sub-populations), the resulting model coincides with a linear regression model with error terms distributed according to a mixture of K either univariate Gaussian distributions (Bartolucci and Scaccia 2005) or multivariate Gaussian distributions (Soffritti and Galimberti 2011).
If $M>1$, $\alpha _k \rightarrow 1$, $\eta _k \rightarrow 1 \ \forall k$, $\varvec{\beta }^*_k=\varvec{\beta }^* \ \forall k$ (there is no contamination in the data and the effects of the predictors are the same across all the sub-populations), a multivariate seemingly unrelated linear regression model whose error terms are assumed to follow a Gaussian mixture model is obtained (Galimberti et al. 2016).

Seemingly unrelated regression models represent multivariate regression models in which prior information about the absence of certain covariates for the prediction of certain responses is explicitly taken into consideration (Srivastava and Giles 1987). Thus, Eq. (1) can also be seen as a mixture of multivariate contaminated Gaussian regression models in which some regression coefficients are constrained to be a priori equal to zero. To the best of the authors’ knowledge, the inclusion of such constraints in these latter models has not been addressed yet. Models obtained from Eq. (1) by embedding different constraints on the regression coefficients could also be employed in any practical application in which the relevant regressors for each response cannot be established from a priori information and, thus, the choice of the regressors to be used for the M responses is questionable. As it will be illustrated in Sect. 4, in such situations strategies based on a joint use of models (1) and variable selection techniques could be devised and employed.

2.3 Identifiability

A preliminary requirement for the consistency and other asymptotic properties of the ML estimator is represented by identifiability of the model parameters. Thus, before detailing ML estimation of $\varvec{\psi }$, a discussion about identifiability of model (1) is provided here. Consider the class of models $\mathfrak {F} = \{\mathfrak {F}_K, K=1, \ldots , K_{max}\}$, where $\mathfrak {F}_K = \{f({\textbf {y}} \vert {\textbf {x}}; \varvec{\psi }), \varvec{\psi } \in \varvec{\Psi }\}$, $f({\textbf {y}} \vert {\textbf {x}}; \varvec{\psi })$ is the p.d.f. of ${\textbf {Y}} \vert {\textbf {X}}={\textbf {x}}$ under the seemingly unrelated contaminated Gaussian linear clusterwise regression model of order K defined in (1) and $K_{max}$ denotes the maximum order specified by the researcher for that model. This class is said to be identifiable if, for any two models M, $\tilde{M} \in \mathfrak {F}$ with parameters $\varvec{\psi }=(\varvec{\psi }_1, \ldots , \varvec{\psi }_k, \ldots , \varvec{\psi }_K)$ and $\tilde{\varvec{\psi }}=(\tilde{\varvec{\psi }}_1, \ldots , \tilde{\varvec{\psi }}_k, \ldots , \tilde{\varvec{\psi }}_{\tilde{K}})$, respectively,

$$\begin{aligned} \sum _{k=1}^K \pi _k h\left( {\textbf {y}};\varvec{\theta }_{k}\right) = \sum _{k=1}^{\tilde{K}} \tilde{\pi }_k h\left( {\textbf {y}};\tilde{\varvec{\theta }}_{k}\right) \, \forall \, {\textbf {y}} \in \mathbb {R}^{M} \end{aligned}$$

implies that $K=\tilde{K}$ and $\varvec{\psi }=\tilde{\varvec{\psi }}$.

Several types of non-identifiability can affect the model class $\mathfrak {F}$. A first type is due to invariance to relabeling the components of the mixture (also known as label-switching). Non-identifiability can also be caused by potential overfitting associated with empty components or equal components (see, e.g., Frühwirth-Schnatter 2006). Imposing suitable constraints on the parameter space $\varvec{\Psi }$ can prevent such sources of non-identifiability for $\mathfrak {F}$. Another type of non-identifiability affecting this class is specifically associated with the use of finite mixtures in linear regression analysis with fixed covariates, which requires an additional constraint on the number of components of the mixture (1) (see Hennig 2000). Non-identifiability due to empty components is avoided by requiring the positivity of all the mixing weights $\pi _k$. Conditions specifically devised for ensuring identifiability of mixtures of contaminated Gaussian regression models are provided in Mazza and Punzo (2020). These results have been exploited in Theorem 1 to show that model (1) is identifiable if the parameters $(\varvec{\beta }^*_k, \varvec{\Sigma }_{k})$, $k=1, \ldots , K$, are pairwise distinct and the order K is exceeded by the number of distinct $(P_m-1)$-dimensional hyperplanes required to cover the covariates employed for the prediction of $Y_m$, for $m=1, \ldots , M$. In order to state Theorem 1, the following notation is also required: $\left\| \cdot \right\| _F$ is the element-wise matrix 2-norm (also known as the Frobenious norm); $H^{P_{m}-1} =\{ {\textbf {x}}_{m} \in \mathbb {R}^{P_m}: \varvec{\lambda }' {\textbf {x}}_{m} = c, \varvec{\lambda } \in \mathbb {R}^{P_m}, \varvec{\lambda } \ne {\textbf {0}}\}$ is a $(P_{m}-1)$-dimensional hyperplane; $J_m$ is the minimum number of such hyperplanes required to cover the covariates ${\textbf {x}}_{m}$; $\mathcal {H}^{P_{m}-1}$ is the space of $(P_{m}-1)$-dimensional hyperplanes of $\mathbb {R}^{P_m}$.

Theorem 1

Let $M \in \mathfrak {F}$ and $\tilde{M} \in \mathfrak {F}$ be two models, $\varvec{\psi }=(\varvec{\psi }_1, \ldots , \varvec{\psi }_k, \ldots , \varvec{\psi }_K)$ and $\tilde{\varvec{\psi }}=(\tilde{\varvec{\psi }}_1, \ldots , \tilde{\varvec{\psi }}_k, \ldots , \tilde{\varvec{\psi }}_{\tilde{K}})$ the corresponding parameters and, without loss of generality, $K \ge \tilde{K}$. If

(C1)
$K < J_m$ for $m=1, \ldots , M$, where
$$\begin{aligned} J_m:=\min \left\{ q_m: \{{\textbf {x}}_{im}, i \in \mathcal {I}_m \} \subseteq \bigcup _{b=1}^{q_m} H^{P_{m}-1}_b: H^{P_{m}-1}_b \in \mathcal {H}^{P_{m}-1} \right\} , \end{aligned}$$
with $\mathcal {I}_m$ being an index set associated with the distinct covariate points available for the prediction of $Y_m$, and
(C2)
$k \ne l$, with $k, l \in \{1, \ldots , K\}$, implies
$$\begin{aligned} \left\| \varvec{\beta }^*_k-\varvec{\beta }^*_l \right\| _F^2 + \left\| \varvec{\Sigma }_k-a \varvec{\Sigma }_l \right\| _F^2 \ne 0\ \forall a >0, \end{aligned}$$

then the class $\mathfrak {F}$ is identifiable.

Conditions (C1) and (C2) are obtained from Mazza and Punzo (2020) after suitable modifications of similar conditions required for the identifiability of their mixtures of contaminated Gaussian regression models. In particular, condition (C2) results from a simple substitution of the vector $\varvec{\beta }^*_k$ of model (1) for the matrix ${\textbf {B}}_k$ introduced in Eq. (4) containing the intercepts and regression coefficients in the kth component of the regression mixture model developed by Mazza and Punzo (2020). The modifications involved in the definition of the condition (C1) derive from the fact that each $Y_m \in {\textbf {Y}}$ may have its own covariates and, thus, M different restrictions on K have to be required, each one involving a (possibly) different minimum number of low-dimensional hyperplanes to cover those covariates. As a consequence, the proof of Theorem 1 can be obtained by exploiting the same arguments illustrated in Mazza and Punzo (2020) for the proof of their theorem about identifiability of mixtures of contaminated Gaussian regression models.

2.4 Maximum likelihood estimation

The ML estimation of the parameters $\varvec{\psi }$ is carried out here for a fixed value of K. Given a sample $\mathcal {S}$ of I independent observations drawn from model (1), the model log-likelihood is equal to $\ell (\varvec{\psi })=\sum _{i=1}^I \ln \left( \sum _{k=1}^K \pi _k h\left( {\textbf {y}}_i;\varvec{\theta }_k\right) \right) $. Following Mazza and Punzo (2020), ML estimates $\hat{\varvec{\psi }}$ can be computed by means of an ECM algorithm, which represents a variant of the EM algorithm usually employed for the computation of ML estimates from incomplete data. In the considered situation, the missing information is twofold. On the one hand, there is a classical source of incompleteness of any mixture model associated with the component memberships of the I sample observations. On the other hand, it is not known whether such observations are outliers with reference to any component or not. These two sources can be described by two different types of K-dimensional vectors. For the ith sample observation, they are given by ${\textbf {z}}_i$ and ${\textbf {u}}_i$, respectively: ${\textbf {z}}_i=(z_{i1},\ldots ,z_{iK})'$, with $z_{ik}=1$ if the ith observation comes from the kth component and $z_{ik}=0$ otherwise; ${\textbf {u}}_i=(u_{i1},\ldots ,u_{iK})'$, with $u_{ik}=1$ if the ith observation is typical in the kth component and $u_{ik}=0$ if it is an outlier, for $k=1,\ldots ,K$. Then, the set of complete data would be $\mathcal {S}_c=\{({\textbf {x}}_1,{\textbf {y}}_1,{\textbf {z}}_1,{\textbf {u}}_1),\ldots ,({\textbf {x}}_I,{\textbf {y}}_I,{\textbf {z}}_I,{\textbf {u}}_I)\}$, and the complete-data likelihood function is equal to

$$\begin{aligned} L_c(\varvec{\psi })&= \prod _{i=1}^I \prod _{k=1}^K \Bigr \{ \pi _k \Bigr [ \alpha _k \phi _M\Bigr ({\textbf {y}}_i; \varvec{\mu }_k({\textbf {x}}_i;\varvec{\beta }^*_k),\varvec{\Sigma }_k\Bigl )\Bigl ]^{u_{ik}} \\&\qquad \Bigr [(1-\alpha _k)\phi _M\Bigr ({\textbf {y}}_i; \varvec{\mu }_k({\textbf {x}}_i;\varvec{\beta }^*_k),\eta _k\varvec{\Sigma }_k \Bigl )\Bigl ]^{1-u_{ik}}\Bigl \}^{z_{ik}}. \end{aligned}$$

Thus, up to an additive constant, the complete-data log-likelihood function employed in the ECM algorithm for the computation of the parameter estimates can be expressed as follows:

$$\begin{aligned} \ell _c(\varvec{\psi })&=\sum _{i=1}^I \sum _{k=1}^K z_{ik} \Bigr [\ln \pi _k+ u_{ik} \ln \alpha _k+(1-u_{ik})\ln (1-\alpha _k)-\frac{1}{2}\ln \vert \varvec{\Sigma }_k \vert +\\&\qquad -\Bigr (\frac{M}{2}\ln \eta _k\Bigl )(1- u_{ik})-\frac{1}{2} \Bigr (u_{ik}+\frac{1-u_{ik}}{\eta _k}\Bigl )\delta ^2_{\varvec{\Sigma }_k}\Bigr ({\textbf {y}}_i, \varvec{\mu }_k({\textbf {x}}_i;\varvec{\beta }^*_k) \Bigl )\Bigl ], \end{aligned}$$

where

$$\begin{aligned} \delta ^2_{\varvec{\Sigma }_k}\left( {\textbf {y}}_i, \varvec{\mu }_k({\textbf {x}}_i;\varvec{\beta }^*_k)\right) = ({\textbf {y}}_i-\varvec{\mu }_k({\textbf {x}}_i;\varvec{\beta }^*_k))' \varvec{\Sigma }_k^{-1} ({\textbf {y}}_i- \varvec{\mu }_k({\textbf {x}}_i;\varvec{\beta }^*_k)) \end{aligned}$$

(5)

is the squared Mahalanobis distance between ${\textbf {y}}_i$ and $\varvec{\mu }_k({\textbf {x}}_i;\varvec{\beta }^*_k)$ with respect to the matrix $\varvec{\Sigma }_k$.

The hth iteration of the E-step in the ECM algorithm consists in calculating the conditional expectation of $l_c(\varvec{\psi })$ on the basis of the current estimate $\varvec{\psi }^{(h)}$ of the model parameters $\varvec{\psi }$; up to an additive constant, this expected value can be expressed as follows:

$$\begin{aligned} Q\left( \varvec{\psi } \vert \varvec{\psi }^{(h)}\right)&= \mathbb {E}_{\varvec{\psi }^{(h)}} [l_c(\varvec{\psi })] \\&=\sum _{i=1}^I \sum _{k=1}^K \hat{z}_{ik}^{(h)}\Bigr \{\ln \pi _k^{(h)}+ \hat{u}_{ik}^{(h)} \ln \alpha _k^{(h)}+(1-\hat{u}_{ik}^{(h)})\ln (1-\alpha _k^{(h)})+ \\&\quad + Q_i \Bigr (\varvec{\beta }^*_k,\varvec{\Sigma }_{k} \vert \varvec{\psi }^{(h)} \Bigl )\Bigl \}, \end{aligned}$$

where

$$\begin{aligned} Q_i \Bigr (\varvec{\beta }^*_k,\varvec{\Sigma }_{k} \vert \varvec{\psi }^{(h)} \Bigl )&=- \frac{1}{2} \Bigr [ \ln \vert \varvec{\Sigma }_{k}^{(h)} \vert + M(1-\hat{u}^{(h)}_{ik})\ln \eta _k^{(h)}+ \\&\quad +\Bigr (\hat{u}_{ik}^{(h)}+\frac{1-\hat{u}_{ik}^{(h)}}{\eta _k^{(h)}}\Bigl ) \delta ^2_{\varvec{\Sigma }^{(h)}_k}\left( {\textbf {y}}_i, \varvec{\mu }_k({\textbf {x}}_i;\varvec{\beta }^{*(h)}_k)\right) \Bigl ], \end{aligned}$$

$\hat{z}_{ik}^{(h)}$ and $\hat{u}_{ik}^{(h)}$ are the posterior probabilities (evaluated using $\varvec{\psi }^{(h)}$) that the ith observation is generated from the kth component of the mixture (1) and that the ith observation is a typical point of such a component, respectively:

$$\begin{aligned}&\hat{z}_{ik}^{(h)} = \mathbb {E}_{\varvec{\psi }^{(h)}} [Z_{ik} \vert ({\textbf {x}}_i,{\textbf {y}}_i)] = \frac{\pi _{k}^{(h)} h\left( {\textbf {y}}_i; \varvec{\theta }^{(h)}_k\right) }{ f\left( {\textbf {y}}_i \vert {\textbf {x}}_i;\varvec{\psi }^{(h)}\right) }, \end{aligned}$$

(6)

$$\begin{aligned}&\hat{u}_{ik}^{(h)} = \mathbb {E}_{\varvec{\psi }^{(h)}} [U_{ik} \vert ({\textbf {x}}_i,{\textbf {y}}_i,{\textbf {z}}_i)] = \frac{\alpha _{k}^{(h)}\phi \Bigr ({\textbf {y}}_i; \varvec{\mu }_k({\textbf {x}}_i;\varvec{\beta }^{*(h)}_k),\varvec{\Sigma }^{(h)}_k\Bigl )}{ h\left( {\textbf {y}}_i; \varvec{\theta }^{(h)}_k\right) }, \end{aligned}$$

(7)

with ${\textbf {Z}}_i=(Z_{i1},\ldots ,Z_{iK})'$ denoting a K-dimensional multinomial random vector with probabilities $\varvec{\pi }=(\pi _1, \ldots , \pi _K)'$, and $U_{ik} \vert Z_{ik}=1$ having a Bernoulli distribution with success probability of $\alpha _k$.

As far as the conditional maximisation is concerned, the update of $\varvec{\psi }^{(h)}$ is carried out by considering the following two parameter sub-vectors: $\varvec{\gamma }=(\varvec{\pi },\varvec{\beta }^{*},\varvec{\Sigma },\varvec{\alpha })$ and $\varvec{\eta }=(\eta _1, \ldots , \eta _K)'$, where $\varvec{\beta }^{*}=(\varvec{\beta }^*_{1}, \ldots , \varvec{\beta }^*_{K})$, $\varvec{\Sigma }=(\varvec{\Sigma }_1, \ldots , \varvec{\Sigma }_K)$, $\varvec{\alpha }=(\alpha _1, \ldots , \alpha _K)$. At the $(h+1)$th iteration of the ECM algorithm, $\varvec{\gamma }^{(h)}=(\varvec{\pi }^{(h)},\varvec{\beta }^{*(h)},\varvec{\Sigma }^{(h)},\varvec{\alpha }^{(h)})$ is updated through the maximisation of $Q(\varvec{\psi } \vert \varvec{\psi }^{(h)})$ with respect to $\varvec{\gamma }$ with $\varvec{\eta }$ fixed at $\varvec{\eta }^{(h)}$ (first CM step); then, the update of $\varvec{\eta }^{(h)}$ is carried out by maximising $Q(\varvec{\psi } \vert \varvec{\psi }^{(h)})$ with respect to $\varvec{\eta }$ with $\varvec{\gamma }$ fixed at $\varvec{\gamma }^{(h+1)}$ (second CM step). The resulting updates of ${\pi }^{(h)}_k$, ${\alpha }_k^{(h)}$ and ${\eta }_k^{(h)}$ are:

$$\begin{aligned} {\pi }^{(h+1)}_k&= \frac{1}{I}\sum _{i=1}^I \hat{z}_{ik}^{(h)}, \nonumber \\ {\alpha }_k^{(h+1)}&= \frac{\sum _{i=1}^I \hat{z}_{ik}^{(h)}\hat{u}_{ik}^{(h)}}{\sum _{i=1}^I \hat{z}_{ik}^{(h)}}, \end{aligned}$$

(8)

$$\begin{aligned} {\eta }_k^{(h+1)}&= \max \Bigr \{1,\frac{\sum _{i=1}^I \hat{z}_{ik}^{(h)}(1-\hat{u}_{ik}^{(h)})\delta ^2_{\varvec{\Sigma }^{(h+1)}_k}\left( {\textbf {y}}_i, \varvec{\mu }_k({\textbf {x}}_i;\varvec{\beta }^{*(h+1)}_k)\right) }{M \sum _{i=1}^I \hat{z}_{ik}^{(h)}(1-\hat{u}_{ik}^{(h)})}\Bigl \}. \end{aligned}$$

(9)

Such updates coincide with the ones reported in Mazza and Punzo (2020) for their model. Based on Eq. (9), it is possible to highlight that the update ${\eta }_k^{(h+1)}$ will be larger when the kth component is highly contaminated by the presence of outliers (i.e., when it is characterised by many observations with a small value of $\hat{u}_{ik}^{(h)}$ and a large value of the squared Mahalanobis distance from $\varvec{\mu }_k({\textbf {x}}_i;\varvec{\beta }^{*(h+1)}_k)$). As far as the remaining parameters are concerned, their updates are (details are reported in the Appendix):

$$\begin{aligned} {\varvec{\beta }}_{k}^{*{(h+1)}}&=\, \Bigr (\sum _{i=1}^I\hat{z}_{ik}^{(h)} \hat{w}_{ik}^{(h)}\tilde{{\textbf {x}}}^{*}_i \varvec{\Sigma }_k^{{(h)}^{-1}}\tilde{{\textbf {x}}}^{*'}_i\Bigl )^{-1} \Bigr (\sum _{i=1}^I \hat{z}_{ik}^{(h)} \hat{w}_{ik}^{(h)} \tilde{{\textbf {x}}}^{*}_i \varvec{\Sigma }_k^{{(h)}^{-1}} \varvec{y}_i \Bigl ), \end{aligned}$$

(10)

$$\begin{aligned} {\varvec{\Sigma }}_k^{(h+1)}&= \frac{\sum _{i=1}^I \hat{z}_{ik}^{(h)} \hat{w}_{ik}^{(h)} \Bigr ({\textbf {y}}_i-\tilde{{\textbf {x}}}^{*'}_i\varvec{\beta }^{*_{(h+1)}}_{k}\Bigl )\Bigr ({\textbf {y}}_i-\tilde{{\textbf {x}}}^{*'}_i\varvec{\beta }^{*_{(h+1)}}_{k}\Bigl )'}{\sum _{i=1}^I \hat{z}_{ik}^{(h)}}, \end{aligned}$$

(11)

where

$$\begin{aligned} \hat{w}_{ik}^{(h)}=\hat{u}_{ik}^{(h)}+\frac{1-\hat{u}_{ik}^{(h)}}{{\eta }_k^{(h)}}. \end{aligned}$$

(12)

It is worth noting that the matrix $\sum _{i=1}^I\hat{z}_{ik}^{(h)} \hat{w}_{ik}^{(h)}\tilde{{\textbf {x}}}^{*}_i \varvec{\Sigma }_k^{{(h)}^{-1}}\tilde{{\textbf {x}}}^{*'}_i$ in (10) has to be nonsingular; otherwise, the update ${\varvec{\beta }}^{*(h+1)}_k$ cannot be computed. Equation (10) also highlights that this update can be considered as a generalised least squares estimate with weights depending on $\hat{w}_{ik}^{(h)}$; this latter term also affects the update $\varvec{\Sigma }_k^{(h+1)}$ in (11), which represents a weighted sum of squared residuals. Using such weights leads to a reduction in the effects of the outliers on the estimation of ${\varvec{\beta }}^{*(h+1)}_k$; thus, this approach provides robust estimates of ${\varvec{\beta }}^{*(h+1)}_k$, for $k=1, \ldots K$. Furthermore, based on (12), sample observations with the highest posterior estimated probabilities of being generated from the kth component and of representing typical points in the kth component will have the largest impact on the updates of both the regression coefficients and covariances within that component.

Once the convergence is reached and the ML estimates $\hat{\varvec{\psi }}$ are computed, by exploiting Eq. (6) the ECM algorithm provides estimates of the posterior probabilities $\mathbb {P}_{\hat{\varvec{\psi }}}[Z_{ik}=1 \vert ({\textbf {x}}_i,{\textbf {y}}_i)] = \hat{z}_{ik}$, $i=1, \ldots , I$, $k=1, \ldots , K$. Such estimated probabilities can be employed to partition the I sample observations into K clusters, by assigning each observation to the component showing the highest posterior probability; for the ith observation:

$$\begin{aligned} \text {MAP}(\hat{z}_{ik}) = {\left\{ \begin{array}{ll} 1 &{} \text {if max}_h\{\hat{z}_{ih}\} \text { occurs when } h=k;\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Furthermore, Eq. (7) allows to compute the estimated posterior probabilities $\mathbb {P}_{\hat{\varvec{\psi }}}[U_{ik}=1 \vert ({\textbf {x}}_i,{\textbf {y}}_i,\hat{{\textbf {z}}}_i)]=\hat{u}_{ik}$, and an intra-cluster distinction between typical observations and mild outliers can be defined: the ith observation will be classified as an outlier of the hth cluster, where h is the label of the component for which $\text {MAP}(\hat{z}_{ik})=1$, if $\hat{u}_{ih}<0.5$. From the ML estimates $\hat{\varvec{\psi }}$ and Eq. (5) it is also possible to compute the estimated squared Mahalanobis distances $\hat{d}^2_{ik}=\delta ^2_{\hat{\varvec{\Sigma }}_k}\left( {\textbf {y}}_i, \hat{\varvec{\mu }}_k({\textbf {x}}_i;\hat{\varvec{\beta }}^*_k)\right) $, $i=1, \ldots , I$, $k=1, \ldots , K$, which can be employed as multivariate measures of the outlyingness of the I sample observations with respect to the K clusters detected by the model. From the definition of the squared Mahalanobis distance given in Eq. (5) and the expressions for $\hat{u}_{ik}^{(h)}$ and $\hat{w}_{ik}^{(h)}$ reported in Eqs. (7) and (12), respectively, it is possible to express both $\hat{u}_{ik}$ and $\hat{w}_{ik}$ as decreasing functions of $\hat{d}^2_{ik}$ (see Mazza and Punzo 2020, for the explicit expressions). Thus, atypical observations could also be detected and studied by considering the values of $\hat{d}^2_{ik}$ $\forall (i,k) \in \{i \in \{1, \ldots , I\}, k: \text {MAP}(\hat{z}_{ik})=1\}$ and by focusing on the largest values obtained in this way (see McLachlan and Peel 2000, p. 232).

2.5 Technical details about the ECM algorithm

A crucial point of any EM-based algorithm is the choice of the starting values for the model parameters (i.e., $\varvec{\psi }^{(0)})$. Multiple executions of the algorithm in association with multiple random initialisations or approaches based on non-random choices of either $\varvec{\psi }^{(0)}$ or the missing information can provide a solution (see, e.g., Biernacki et al. 2003; Karlis and Xekalaki 2003). As far as the ECM algorithm described above is concerned, the initialisation technique illustrated in Mazza and Punzo (2020) could be modified so as to be employed also for model (1). This task would require setting the initial values $\hat{z}_{ik}^{(0)}$, $i=1, \ldots , I$, $k=1, \ldots , K$, equal to the posterior probabilities from the EM algorithm for the estimation of the seemingly unrelated Gaussian clusterwise linear regression models, which are nested in model (1) when $\alpha _k \rightarrow 1^-$ and $\eta _k \rightarrow 1^+$, $k=1, \ldots , K$; furthermore, $\hat{u}_{ik}^{(0)}=0.999$, $i=1, \ldots , I$, $k=1, \ldots , K$. Another strategy for the initialisation of $\varvec{\psi }$ which exploits the relationship between model (1) and seemingly unrelated Gaussian clusterwise linear regression models (see Sect. 2.2) could be composed of the following three steps. Firstly, a Gaussian mixture model with K components is fitted to the sample residuals of a seemingly unrelated linear regression model (Srivastava and Giles 1987); this allows to obtain the starting values $\pi _k^{(0)}$ and $\varvec{\Sigma }_k^{(0)}$. Secondly, the starting values $\varvec{\beta }^{*(0)}_{k}$ are obtained from the fitting of K different seemingly unrelated linear regression models, one for each cluster of the partition associated with the Gaussian mixture model considered in the previous step. Thirdly, $\alpha _k^{(0)}$ and $\eta _k ^{(0)}$, $k=1 \ldots , K$, are set equal to 0.999 and 1.001, respectively. Models involved in the first two steps can be estimated through the packages mclust (Scrucca et al. 2017) and systemfit (Henningsen and Hamann 2007) in the R environment (R Core Team 2021). In the analyses of Sects. 3 and 4 , the ECM algorithm has been initialised using this latter strategy. Furthermore, since $(1-\alpha _k)$ in model (1) can be considered as the proportion of outliers in the kth sub-population, when this model is employed for outlier detection, a reasonable requirement is that in each cluster the number of typical observations cannot be smaller than the number of outliers, that is $\alpha _k \in [0.5,1) \ \forall k$. To guarantee this result, constraints on the estimation of $\alpha _k$, $k=1, \ldots , K$, have been included in the ECM algorithm; namely, Eq. (8) has been modified as follows: $ {\alpha }_k^{(h+1)} = \max \left\{ 0.5, \frac{\sum _{i=1}^I \hat{z}_{ik}^{(h)}\hat{u}_{ik}^{(h)}}{\sum _{i=1}^I \hat{z}_{ik}^{(h)}}\right\} $.

In order to avoid premature stops of the ECM algorithm associated with the use of lack of progress stopping criteria, such as the one based on the difference between the log-likelihood values at two consecutive steps, a convergence criterion based on the Aitken acceleration (Aitken 1926) has been adopted. It consists in stopping the algorithm when $\vert \ell ^{(h+1)}_{A} - \ell (\varvec{\psi }^{(h)}) \vert < \epsilon $, where $0<\epsilon <+ \infty $, $\ell ^{(h+1)}_{A}$ is $(h+1)$th Aitken accelerated estimate of the log-likelihood limit, and $\ell (\varvec{\psi }^{(h)})$ is the incomplete log-likelihood evaluated at $\varvec{\psi }^{(h)}$ (see, e.g., McNicholas 2010). Furthermore, a criterion based on a maximum number of iterations for the ECM algorithm has been employed. In the analyses of Sects. 3 and 4 , the maximum number of iterations and $\epsilon $ have been set equal to 500 and $10^{-6}$, respectively. Furthermore, in order to circumvent the possible issue of unbounded likelihood associated with a degenerate model, the ECM algorithm has been developed by embedding some constraints on the eigenvalues of $\varvec{\Sigma }_k^{(h)}$ for $k = 1,\ldots , K$. Namely, for all estimated covariance matrices, the ratio between the smallest and the largest eigenvalues is required to be not lower than $10^{-10}$.

2.6 Determining the value of K

As illustrated in Sect. 2.4, the ML estimation of $\varvec{\psi }$ based on the ECM algorithm is carried out for a given number of mixture components. When this number is not known and has to be determined from the data $\mathcal {S}$, it is common practice to employ model selection criteria able to take account of different aspects which are considered relevant when evaluating the adequacy of a model (see, e.g., Depraetere and Vandebroek 2014; Frühwirth-Schnatter 2006). For example, the Bayesian Information Criterion (Schwarz 1978) provides a trade-off between the fit and the model complexity; it can be computed as follows:

$$\begin{aligned} BIC(K)= 2 \ell (\hat{\varvec{\psi }}) - n_{\varvec{\psi }} \ln I. \end{aligned}$$

Model selection criteria that also consider the uncertainty of the estimated partition of the sample observations could be employed. An example is represented by the integrated completed likelihood (Biernacki et al. 2000), which can be computed according to different ways of measuring the uncertainty of the estimated partition (see, e.g., Andrews and McNicholas 2011; Baek and McLachlan 2011):

$$\begin{aligned} ICL_{1}(K)= & {} 2 \ell (\hat{\varvec{\psi }}) - n_{\varvec{\psi }} \ln I +2 \sum _{i=1}^I \sum _{k=1}^K \text {MAP}(\hat{z}_{ik}) \ln \hat{z}_{ik},\\ ICL_{2}(K)= & {} 2 \ell (\hat{\varvec{\psi }}) - n_{\varvec{\psi }} \ln I +2 \sum _{i=1}^I \sum _{k=1}^K \hat{z}_{ik} \ln \hat{z}_{ik}. \end{aligned}$$

These latter criteria penalize complex models more severely than BIC because of the presence of an additional penalty, which represents the estimated mean entropy. Thus, when using these criteria in comparison with the BIC, one cluster should be less likely split into two different components. $ICL_{1}$ and $ICL_{2}$ differ on whether a soft (i.e., $\hat{z}_{ik}$) or hard (i.e., $\text {MAP}(\hat{z}_{ik})$) clustering is considered in the estimation of the mean entropy. Higher values of these criteria indicate better-fit models; as it will be illustrated in Sect. 4, BIC, $ICL_{1}$ and $ICL_{2}$ can also be employed to select the predictors to be considered in the linear terms employed in the prediction of the M responses in model (1).

3 Results from Monte Carlo studies

3.1 Settings

This section focuses on the investigation of the effectiveness of models (1) (mixtures of contaminated seemingly unrelated Gaussian regressions, hereafter denoted as MCSG) in comparison with other approaches using simulated datasets. This task has been carried out in a multivariate setting with $M=4$ responses, $P=4$ covariates and datasets comprising $K=3$ groups of observations. The additional models considered in the comparison are those described by Mazza and Punzo (2020) and Galimberti and Soffritti (2020). From now on, these latter models have been denoted as MCG (mixtures of contaminated Gaussian regressions) and MSG (mixtures of seemingly unrelated Gaussian regressions), respectively.

The simulated datasets have been generated using three different data generation processes:

(a):: MSG;
(b):: MCSG with $\alpha _k=0.9 \ \forall k$, $\eta _1=40$, $\eta _2=\eta _3= 20$;
(c):: mixtures of regression models with seemingly unrelated t-distributed errors (MSt), with $\nu _1=\nu _2=\nu _3=4$ degrees of freedom.

In all the regression models employed to generate the datasets, the response $Y_m$ has been assumed to depend on $X_m$, for $m=1,2,3,4$; thus, $P_m=1$ $\forall m$. With each process, the following parameters have been employed: $\pi _1=0.3$, $\pi _2=0.5$, $\pi _3=0.2$, $\varvec{\beta }^{*}_1=(-3,0.2,-3,0.2,-3,0.2,-3,0.2)'$, $\varvec{\beta }^*_2=-\varvec{\beta }^*_1$, $\varvec{\beta }^{*}_3=(3+\epsilon ,-0.2,3+\epsilon ,-0.2,3+\epsilon ,-0.2,3+\epsilon ,-0.2)$,

$\varvec{\Sigma }_1 = \begin{pmatrix} 1.0 &{} 0.5 &{} 0.5 &{} 0.5 \\ 0.5 &{}\quad 1.0 &{}\quad 0.5 &{}\quad 0.5\\ 0.5 &{}\quad 0.5 &{}\quad 1.0 &{}\quad 0.5\\ 0.5 &{}\quad 0.5 &{}\quad 0.5 &{}\quad 1.0 \end{pmatrix}$, $\varvec{\Sigma }_2 = \varvec{\Sigma }_3 = \begin{pmatrix} 1.00 &{}\quad 0.75 &{}\quad 0.75 &{}\quad 0.75 \\ 0.75 &{}\quad 1.00 &{}\quad 0.75 &{}\quad 0.75\\ 0.75 &{}\quad 0.75 &{}\quad 1.00 &{}\quad 0.75\\ 0.75 &{}\quad 0.75 &{}\quad 0.75 &{}\quad 1.00 \end{pmatrix}$.

It is worth noting that the second and third components only differ in the intercepts of the four regression equations. Covariate values have been generated by a uniform distribution over the interval $(-5,5)$. As concerns $\epsilon $, two alternatives have been considered in order to produce two different degrees of separation between groups of observations: $\epsilon =9$ (higher degree), $\epsilon =6.5$ (lower degree). Figure 1 shows the scatterplots of the variables $Y_1$ and $X_1$ for a sample of size $I = 1000$ generated using the MSG (upper panel), MCSG (central panel) and MSt (lower panel) processes with $\epsilon =9$ (on the left) and $\epsilon =6.5$ (on the right). Due to the values of the regression coefficients employed to model the linear dependencies of $Y_m$ and $X_m$ across the three components, the scatterplots of $Y_m$ and $X_m$ for $m=2,3,4$ are similar. Under each data generating process, 100 random samples of size I have been simulated for each $\epsilon $. As far as the sample size is concerned, the following values have been examined: $I=500, 1000$. Thus, the degree of separation and the sample size can be considered as experimental factors. This yields a total of 600 generated datasets for each I. The whole analysis has been run on an IBM x3750 M4 server with 4 Intel Xeon E5-4620 processors with 8 cores and 128GB RAM.

3.2 Results

Table 1 Estimation of $\alpha _k$ and $\eta _k$: averages and standard deviations of the estimates over 100 samples for the fitted MCG and MCSG models of order $K=3$ ($I=500$)

Seemingly unrelated clusterwise linear regression for contaminated data

Abstract

Similar content being viewed by others

Seemingly unrelated clusterwise linear regression

Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

Matrix Normal Cluster-Weighted Models

1 Introduction

2 Seemingly unrelated contaminated Gaussian linear clusterwise regression analysis

2.1 Seemingly unrelated contaminated Gaussian linear clusterwise regression models

2.2 Comparisons with other linear clusterwise regression models

2.3 Identifiability

Theorem 1

2.4 Maximum likelihood estimation

2.5 Technical details about the ECM algorithm

2.6 Determining the value of K

3 Results from Monte Carlo studies

3.1 Settings

3.2 Results

3.2.1 Estimation of \(\alpha _k\) and \(\eta _k\)

3.2.2 Parameter recovery

3.2.3 Classification recovery

3.2.4 Trade-off between fit and complexity

3.2.5 Comparison among information criteria

4 Results from the analysis of canned tuna sales

5 Conclusions

Code Availability

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interest

Additional information

Publisher's Note

Appendix A: Update of \(\varvec{\beta }_{k}^{*}\) and \(\varvec{\Sigma }_k\)

Appendix A: Update of \(\varvec{\beta }_{k}^{*}\) and \(\varvec{\Sigma }_k\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation