1 Introduction

The decomposition of the total sum of squares (total variation) in the explained sum of squares (explained variation) plus the sum of squared residuals (unexplained variation) is a peculiarity of the linear regression model whose parameters are estimated by least squares (see, e.g., Davidson and MacKinnon 2004, pp. 117–118). The famous coefficient of determination, universally referred to as R2, which is defined as a measure of explained variation, arises from this decomposition and is used to evaluate the goodness-of-fit for the linear regression model. Surprisingly, as also emphasized by Cameron and Windmeijer (1996), the extension to other models is rare, with the notable exceptions of models with heteroscedastic errors with known variance (Buse 1973), logit and probit models (see Maddala 1986, Windmeijer 1995, and the references therein), tobit models (surveyed by Veall and Zimmermann 1996), regression models for count data (Cameron and Windmeijer 1996), and some common nonlinear regression models (Cameron and Windmeijer 1997).

We focus on mixtures of (linear) regressions, also known in literature as switching regression or clusterwise regression models (see, e.g., Wedel 1990; Wedel and Kamakura 2000, Chapter 7; Frühwirth-Schnatter 2006, Chapter 8). These models represent a classical alternative/generalization of a single (linear) regression to be used when there is some latent or unobserved feature splitting the data into groups (or clusters) having a regression relationship.

Three eminent members of the class of mixtures of regressions—whose peculiarities and differences are detailed in Ingrassia et al. (2012) and Ingrassia and Punzo (2016)—are mixtures of regressions with fixed covariates (De Sarbo et al. 1988; see also Quandt 1972, Hosmer 1974, and Quandt and Ramsey 1978, for the special case of two mixture components), mixtures of regressions with concomitant variables (Dayton and Macready 1988), and mixtures of regressions with random covariates (Hennig 2000). For these three classes of mixtures of regressions, we propose a finer three-term decomposition of the total sum of squares when the parameters are estimated with the expectation-maximization (EM) algorithm (Dempster et al. 1977), within a maximum likelihood framework, under normally distributed errors in each mixture component. The terms of this decomposition allow the user to investigate the main aspects of the fitted model via normalized measures. These aspects are the association between the response variable and the latent groups, the goodness-of-fit of the model, and the proportion of the total variation in the dependent variable which remains unexplained by the fitted model. Furthermore, local and overall coefficients of determination are respectively introduced to evaluate how well the model fits the data group-by-group but also taken as a whole.

The proposed decomposition and measures can be seen also as cluster validity methods (see, e.g., Halkidi et al. 2001; Theodoridis and Koutroumbas 2008, Chapter 16) for mixtures of regressions, i.e., as methods aiming at the quantitative evaluation of the clusters from the fitted models; this is a step of fundamental importance in most applications (Rezaee et al. 1998; Steinley et al.2015). According to the usual classification of cluster validity criteria as internal, external, and relative (Arbelaitz et al. 2013), our measures can be categorized as internal (Milligan and Cheng 1996), i.e., as criteria which measure the goodness of the estimated clusters without reference to external information.

The paper is organized as follows. Section 2 summarizes basic concepts about the mixtures of linear regressions we consider. Section 3 details the part of the EM algorithm devoted to the update of the local regression coefficients. The proposed three-term decomposition is presented in Section 4. The other proposals are presented in Section 5: normalized measures based on the proposed decomposition are given in Section 5.1, the use of the ternary diagram to display the normalized terms of the decomposition is suggested in Section 5.2, a normalized measure of explained response variation is defined in Section 5.3, and local and overall coefficients of determination are introduced in Section 5.4. Sections 6 and 7 illustrate applications to artificial and real data, respectively. Section 8 summarizes and concludes.

2 Mixtures of Regressions

Let X be a vector of covariates with values in \(\mathbb {R}^{d}\), and let Y be a dependent (or response) variable taking values in \(\mathbb {R}\). Suppose that the regression of Y on X varies across the k levels (groups or clusters) of a categorical latent variable G.

Mixtures of regressions with fixed covariates (MRFC; DeSarbo and Cron 1988) are characterized by the following conditional density function:

$$ p\left( y|\boldsymbol{x}; \boldsymbol{\psi}\right) = \sum\limits_{g=1}^{k} \pi_{g} f(y|\boldsymbol{x}; \boldsymbol{\theta}_{g}), $$
(1)

where πg = P(G = g) is the mixing weight, with πg > 0 and \({\sum }_{g=1}^{k} \pi _{g}=1\), while f(y|x;θg) is the conditional density of Y |X = x,G = g depending on the parameter vector θg. In (1), \(\boldsymbol {\psi } = (\pi _{1}, \ldots , \pi _{k-1}, \boldsymbol {\theta }_{1}^{\prime }, \ldots , \boldsymbol {\theta }_{k}^{\prime })^{\prime }\) denotes the set of all parameters of the model (see also Mazza and Punzo 2018).

Mixtures of regressions with concomitant variables (MRCV; Dayton and Macready 1988), when covariates and concomitant variables coincide, are characterized by the following density:

$$ p\left( y|\boldsymbol{x}; \boldsymbol{\psi}\right) = {\sum}^{k}_{g=1} p\left( G=g | \boldsymbol{x}; \boldsymbol{\alpha}\right) f\left( y|\boldsymbol{x}; \boldsymbol{\theta}_{g}\right), $$
(2)

where the mixing weight p(G = g|x;α) is now a function depending on x through a parameter vector α, and \(\boldsymbol {\psi } = (\boldsymbol {\alpha }^{\prime }, \boldsymbol {\theta }_{1}^{\prime }, \ldots , \boldsymbol {\theta }_{k}^{\prime })^{\prime }\) denotes the set of all parameters of the model. The probability p(G = g|x;α) is usually modeled by the multinomial logistic model:

$$ \begin{array}{@{}rcl@{}} p(G=g|\boldsymbol{x}; \boldsymbol{\alpha}) = \frac{\exp(\alpha_{g0}+\boldsymbol{\alpha}_{g1}^{\prime}\boldsymbol{x})}{\displaystyle\sum\limits^k_{j=1} \exp(\alpha_{j0} + \boldsymbol{\alpha}_{j1}^{\prime}\boldsymbol{x})}, \end{array} $$

where \(\boldsymbol {\alpha }_{g}=(\alpha _{g0},\boldsymbol {\alpha }^{\prime }_{g1})^{\prime } \in \mathbb {R}^{d+1}\) and \(\boldsymbol {\alpha }=(\boldsymbol {\alpha }^{\prime }_{1}, \ldots , \boldsymbol {\alpha }^{\prime }_{k})^{\prime }\), with α10 (see, e.g., Grün and Leisch 2008 and Mazza et al. 2019).

Mixtures of regressions with random covariates (MRRC) have been first discussed in Gershenfeld (1997), Hennig (2000), and Wedel (2002), and have been referred to as cluster-weighted models (CWM), clusterwise regression with random covariates, and saturated mixture regression models, respectively. Recent work on MRRC can be found in Ingrassia et al. (2012), Ingrassia et al. (2014), Ingrassia et al. (2015); Punzo (2014); Punzo and Ingrassia (2015); Subedi et al. (2013), Subedi et al. (2015); Berta et al. (2016); McNicholas (2016); Punzo and McNicholas (2017); Punzo et al. (2018); Zarei et al. (2018). Differently from MRFC and MRCV, which model the conditional density of Y |X = x, MRRC models the joint distribution of \(\left (\boldsymbol {X}^{\prime }, Y\right )^{\prime }\) as:

$$ p\left( \boldsymbol{x},y; \boldsymbol{\psi}\right) = \sum\limits^{k}_{g=1} \pi_{g} f\left( y|\boldsymbol{x}; \boldsymbol{\theta}_{g}\right) p\left( \boldsymbol{x} ; \boldsymbol{\xi}_{g}\right), $$
(3)

where p(x;ξg) is the density of X|G = g, depending on the parameter vector ξg, and \(\boldsymbol {\psi } = \left (\pi _{1}, \ldots , \pi _{k-1}, \boldsymbol {\theta }_{1}^{\prime }, \ldots , \boldsymbol {\theta }_{k}^{\prime }, \boldsymbol {\xi }_{1}^{\prime }, \ldots , \boldsymbol {\xi }_{k}^{\prime }\right )^{\prime }\).

3 Maximum Likelihood Estimation: the EM Algorithm

In models (1)–(3), assume a normal distribution for Y |X = x,G = g. Denoting with \(\phi \left (y; \mu , \sigma ^{2}\right )\), the univariate normal density with mean μ, and variance σ2, these models specialize respectively as:

$$ \begin{array}{@{}rcl@{}} \text{MRFC}: && p\left( y|\boldsymbol{x}; \boldsymbol{\psi}\right) = \sum\limits^{k}_{g=1} \pi_{g} \phi\left[y;\mu\left( \boldsymbol{x};\boldsymbol{\beta}_{g}\right),{\sigma^{2}_{g}}\right], \end{array} $$
(4)
$$ \begin{array}{@{}rcl@{}} \text{MRCV}: && p\left( y|\boldsymbol{x}; \boldsymbol{\psi}\right) = \sum\limits^{k}_{g=1} p\left( G=g|\boldsymbol{x};\boldsymbol{\alpha}\right) \phi\left[y;\mu\left( \boldsymbol{x};\boldsymbol{\beta}_{g}\right),{\sigma^{2}_{g}}\right], \end{array} $$
(5)
$$ \begin{array}{@{}rcl@{}} \text{MRRC}: && p\left( \boldsymbol{x},y; \boldsymbol{\psi}\right) = \sum\limits^{k}_{g=1} \pi_{g} p\left( \boldsymbol{x};\boldsymbol{\xi}_{g}\right) \phi\left[y;\mu\left( \boldsymbol{x};\boldsymbol{\beta}_{g}\right),{\sigma^{2}_{g}}\right], \end{array} $$
(6)

where the local conditional densities are based on the linear function \(\mu \left (\boldsymbol {x};\boldsymbol {\beta }_{g}\right ) = \beta _{g0}+\boldsymbol {\beta }_{g1}^{\prime }\boldsymbol {x}\), with \(\boldsymbol {\beta }_{g}=\left (\beta _{g0},\boldsymbol {\beta }^{\prime }_{g1}\right )^{\prime }\), \(\beta _{g0} \in \mathbb {R}\), and \(\boldsymbol {\beta }_{g1} \in \mathbb {R}^{d}\).

Maximum likelihood (ML) parameter estimates for models (4)–(6) are usually obtained via the expectation-maximization (EM) algorithm (Dempster et al. 1977). Given a random sample \((\boldsymbol {x}_{1}^{\prime },y_{1})^{\prime },\ldots ,(\boldsymbol {x}_{n}^{\prime },y_{n})^{\prime }\) of \(\left (\boldsymbol {X}^{\prime },Y\right )^{\prime }\), for a fixed number k of groups, for models (4)–(6) the algorithm basically takes into account the complete-data log-likelihood:

$$ \begin{array}{@{}rcl@{}} \text{MRFC}: l_{c}\left( \boldsymbol{\psi}\right) &=& \displaystyle \sum\limits_{g=1}^{k}\sum\limits_{i=1}^{n} z_{ig}\ln\left( \pi_{g}\right) + l_{\text{reg}}\left( \boldsymbol{\beta},\boldsymbol{\sigma}^{2}\right), \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} \text{MRCV}: l_{c}\left( \boldsymbol{\psi}\right) &=& \displaystyle \sum\limits_{g=1}^{k}\sum\limits_{i=1}^{n} z_{ig}\ln\left[p\left( G=g | \boldsymbol{x}_{i}; \boldsymbol{\alpha}\right)\right] + l_{\text{reg}}\left( \boldsymbol{\beta},\boldsymbol{\sigma}^{2}\right), \end{array} $$
(8)
$$ \begin{array}{@{}rcl@{}} \text{MRRC}: l_{c}\left( \boldsymbol{\psi}\right) &=& \displaystyle \sum\limits_{g=1}^{k}\sum\limits_{i=1}^{n} z_{ig}\ln\left( \pi_{g}\right) + \sum\limits_{g=1}^{k}\sum\limits_{i=1}^{n} z_{ig}\ln\left[p\left( \boldsymbol{x}_{i};\boldsymbol{\xi}_{g}\right)\right] + l_{\text{reg}}\left( \boldsymbol{\beta},\boldsymbol{\sigma}^{2}\right), \end{array} $$
(9)

respectively, where zig = 1 if \((\boldsymbol {x}_{i}^{\prime },y_{i})^{\prime }\) comes from component g and zig = 0 otherwise, and

$$ l_{\text{reg}}\left( \boldsymbol{\beta},\boldsymbol{\sigma}^{2}\right)=\sum\limits_{g=1}^{k}\sum\limits_{i=1}^{n} z_{ig}\ln\left\{\phi\left[y_{i};\mu\left( \boldsymbol{x}_{i};\boldsymbol{\beta}_{g}\right),{\sigma^{2}_{g}}\right]\right\}, $$
(10)

where \(\boldsymbol {\beta }=(\boldsymbol {\beta }_{1}^{\prime },\ldots ,\boldsymbol {\beta }_{k}^{\prime })^{\prime }\) and \(\boldsymbol {\sigma }^{2}=({\sigma ^{2}_{1}},\ldots ,{\sigma ^{2}_{k}})^{\prime }\). It is well known that the EM algorithm iterates between two steps, the E-step and the M-step, until convergence; their schematization, only with respect to the estimation of β and \({\boldsymbol {\sigma }^{2}_{g}}\) from lreg, is given below.

E-step::

Given the current parameter estimates \(\boldsymbol {\psi }^{\left (r\right )}\) on the r th iteration, simply replace each zig by the estimated posterior probability:

$$ \begin{array}{@{}rcl@{}} \text{MRFC}: && z_{ig}^{\left( r\right)} = \frac{ \pi_{g}^{(r)}\phi\left( y_{i}|\boldsymbol{x}_{i};\boldsymbol{\beta}_{g}^{(r)},\sigma_{g}^{2,(r)}\right) }{ p\left( y_{i}|\boldsymbol{x}_{i};\boldsymbol{\psi}^{\left( r\right)}\right) }, \end{array} $$
(11)
$$ \begin{array}{@{}rcl@{}} \text{MRCV}: && z_{ig}^{\left( r\right)} = \frac{ p\left( G=g | \boldsymbol{x}_{i}; \boldsymbol{\alpha}^{(r)}\right)\phi\left( y_{i}|\boldsymbol{x}_{i};\boldsymbol{\beta}_{g}^{(r)},\sigma_{g}^{2,(r)}\right) }{ p\left( y_{i}|\boldsymbol{x}_{i};\boldsymbol{\psi}^{\left( r\right)}\right) }, \end{array} $$
(12)
$$ \begin{array}{@{}rcl@{}} \text{MRRC}: && z_{ig}^{\left( r\right)} = \frac{ \pi_{g}^{(r)} p\left( \boldsymbol{x}_{i};\boldsymbol{\xi}_{g}^{(r)}\right)\phi\left( y_{i}|\boldsymbol{x}_{i};\boldsymbol{\beta}_{g}^{(r)},\sigma_{g}^{2,(r)}\right) }{ p\left( \boldsymbol{x}_{i},y_{i};\boldsymbol{\psi}^{\left( r\right)}\right) }. \end{array} $$
(13)
M-step (regression parameters only)::

the values \(z_{ig}^{\left (r\right )}\) are substituted to zig in (7)–(9) yielding the expected complete-data log-likelihood whose terms can be maximized separately. In particular, the expectation of lreg in (10) yields:

$$ Q_{\text{reg}}\left( \boldsymbol{\beta},\boldsymbol{\sigma}^{2}\right)=\sum\limits_{i=1}^{n}\sum\limits_{g=1}^{k} z_{ig}^{\left( r\right)}\ln\left[\phi\left( y_{i}|\boldsymbol{x}_{i};\boldsymbol{\beta}_{g},{\sigma^{2}_{g}}\right)\right]. $$

The maximization of Qreg with respect to β and σ2 is equivalent to independently maximize each of the k expressions:

$$ Q_{\text{reg, \textit{g}}}\left( \boldsymbol{\beta}_{g},{\sigma_{g}^{2}}\right)=\frac{1}{2}\sum\limits_{i=1}^{n} z_{ig}^{\left( r\right)}\left[-\ln\left( 2\pi\right)-\ln\left( {\sigma_{g}^{2}}\right)-\frac{\left( y_{i}-\boldsymbol{\beta}_{g}^{\prime}\boldsymbol{x}_{i}\right)^{2}}{{\sigma_{g}^{2}}}\right] $$
(14)

with respect to βg and \({\sigma _{g}^{2}}\), g = 1,…,k. The maximization of (14) is equivalent to the maximization problem of the linear regression model (for the complete data), except that each observation \((\boldsymbol {x}_{i}^{\prime },y_{i})^{\prime }\) contributes to the log-likelihood with a known weight \(z_{ig}^{\left (r\right )}\).Update forβg1 Equating to zero the differentiation of (14) with respect to βg1 yields

$$ \begin{array}{@{}rcl@{}} \sum\limits_{i=1}^{n} z_{ig}^{(r)} \left( y_{i} - \beta_{g0} - \boldsymbol{\beta}_{g1}^{\prime}\boldsymbol{x}_{i}\right)\boldsymbol{x}_{i} &= \boldsymbol{0} \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} \sum\limits_{i=1}^{n} z_{ig}^{(r)} \left( y_{i} - \beta_{g0} - \boldsymbol{\beta}_{g1}^{\prime}\boldsymbol{x}_{i}\right)\boldsymbol{x}_{i} - \sum\limits_{i=1}^{n} z_{ig}^{(r)} \left( y_{i}-\beta_{g0}-\boldsymbol{\beta}_{g1}^{\prime}\boldsymbol{x}_{i}\right)\bar{\boldsymbol{x}}_{g} &= \boldsymbol{0} \\ \sum\limits_{i=1}^{n} z_{ig}^{(r)} \left( y_{i} - \beta_{g0} - \boldsymbol{\beta}_{g1}^{\prime}\boldsymbol{x}_{i}\right)\left( \boldsymbol{x}_{i}-\bar{\boldsymbol{x}}_{g}\right) &= \boldsymbol{0} \end{array} $$
(16)
$$ \begin{array}{@{}rcl@{}} \sum\limits_{i=1}^{n} z_{ig}^{(r)} \left[\left( y_{i} - \bar{y}_{g}\right) - \boldsymbol{\beta}_{g1}^{\prime}\left( \boldsymbol{x}_{i}-\bar{\boldsymbol{x}}_{g}\right)\right]\left( \boldsymbol{x}_{i}-\bar{\boldsymbol{x}}_{g}\right) &= \boldsymbol{0} \\ \sum\limits_{i=1}^{n} z_{ig}^{(r)} \left( y_{i} - \bar{y}_{g}\right)\left( \boldsymbol{x}_{i}-\bar{\boldsymbol{x}}_{g}\right) - \left[\sum\limits_{i=1}^{n} z_{ig}^{(r)}\left( \boldsymbol{x}_{i}-\bar{\boldsymbol{x}}_{g}\right)\left( \boldsymbol{x}_{i}-\bar{\boldsymbol{x}}_{g}\right)^{\prime}\right]\boldsymbol{\beta}_{g1} &= \boldsymbol{0}, \end{array} $$
(17)

where

$$ \bar{y}_{g} = \frac{1}{n_{g}^{(r)}}\sum\limits_{i=1}^{n} z_{ig}^{(r)} y_{i} \qquad \text{and} \qquad \bar{\boldsymbol{x}}_{g} = \frac{1}{n_{g}^{(r)}}\sum\limits_{i=1}^{n} z_{ig}^{(r)} \boldsymbol{x}_{i}, $$
(18)

with \(n_{g}^{(r)}=\displaystyle \sum \limits _{i=1}^{n} z_{ig}^{(r)}\) being the expected a posteriori size of the g th group. Solving (17) with respect to βg1 yields:

$$ \boldsymbol{\beta}_{g1}^{(r+1)}=\left[\sum\limits_{i=1}^{n} z_{ig}^{(r)}\left( \boldsymbol{x}_{i}-\bar{\boldsymbol{x}}_{g}\right)\left( \boldsymbol{x}_{i}-\bar{\boldsymbol{x}}_{g}\right)^{\prime}\right]^{-1}\sum\limits_{i=1}^{n} z_{ig}^{(r)} \left( y_{i} - \bar{y}_{g}\right)\left( \boldsymbol{x}_{i}-\bar{\boldsymbol{x}}_{g}\right). $$
(19)

Update forβg0 Equating to zero the differentiation of (14) with respect to βg0, with βg1 substituted by \(\boldsymbol {\beta }_{g1}^{(r+1)}\) in (19), yields:

$$ \begin{array}{@{}rcl@{}} \sum\limits_{i=1}^{n} z_{ig}^{(r)} \left( y_{i}-\beta_{g0}-\boldsymbol{\beta}_{g1}^{(r+1)^{\prime}}\boldsymbol{x}_{i}\right) & =& 0 \\ n_{g}^{(r)}\beta_{g0} & =& \sum\limits_{i=1}^{n} z_{ig}^{(r)} y_{i} - \sum\limits_{i=1}^{n} z_{ig}^{(r)} \boldsymbol{\beta}_{g1}^{(r+1)^{\prime}}\boldsymbol{x}_{i}. \end{array} $$
(20)

Solving (20) with respect to βg0 yields:

$$ \beta_{g0}^{(r+1)} = \bar{y}_{g} - \boldsymbol{\beta}_{g1}^{(r+1)^{\prime}}\bar{\boldsymbol{x}}_{g}. $$
(21)

Note that the local regression coefficients \(\beta _{g0}^{(r+1)}\) in (21) and \(\boldsymbol {\beta }_{g1}^{(r+1)}\) in (19) are weighted least squares estimates of βg0 and βg1 (see, e.g., Chat et al. 2006, Chapter 7).Update for\({\boldsymbol {\sigma }_{\mathbf {g}}^{\mathbf {2}}}\) The maximization of (14) with respect to \({\sigma _{g}^{2}}\), with βg0 substituted with \(\beta _{g0}^{(r+1)}\) and βg1 with \(\boldsymbol {\beta }_{g1}^{(r+1)}\), yields:

$$ \sigma_{g}^{2,(r+1)} = \frac{1}{n_{g}^{(r)}}\displaystyle\sum\limits_{i=1}^{n} z_{ig}^{(r)} \left( y_{i}-\beta_{g0}^{(r+1)}-\boldsymbol{\beta}_{g1}^{(r+1)^{\prime}}\boldsymbol{x}_{i}\right)^{2}. $$

A complete description of the M-step can be found in Wedel and De Sarbo (1995) and Wedel and Kamakura (2000), pp. 120–124, for the MRFC, in Leisch (2004) and Grün and Leisch (2008) for the MRCV, and in Mazza et al. (2018) for the MRRC.

Once the model is fitted, each observation is classified into one of the k categories of the latent variable G according to the maximum a posteriori probability (MAP) estimate: \(\text {MAP}(\hat {z}_{ig})=1\) if \( \max \limits _{h}\{\hat {z}_{ih}\}\) occurs in cluster g, and 0 otherwise, where \(\hat {z}_{ig}\) denotes the value of zig at convergence of the EM algorithm.

4 Three-Term Decomposition of the Total Sum of Squares

The total sum of squares (total variability) on Y, i.e.:

$$ \begin{array}{@{}rcl@{}} \text{TSS}=\sum\limits_{i=1}^n \left( y_i - \bar{y}\right)^2, \end{array} $$

can be written, because \({\sum }_{g=1}^{k} \hat {z}_{ig} = 1\), i = 1,…,n, as:

$$ \begin{array}{@{}rcl@{}} \text{TSS} & = \sum\limits_{i=1}^{n} \left( y_{i} - \bar{y}\right)^{2} \sum\limits_{g=1}^{k} \hat{z}_{ig} = \sum\limits_{g=1}^{k} \sum\limits_{i=1}^{n} \hat{z}_{ig} \left( y_{i} - \bar{y}\right)^{2} = \sum\limits_{g=1}^{k} \sum\limits_{i=1}^{n} \hat{z}_{ig} \left( y_{i} - \bar{y}_{g} + \bar{y}_{g}- \bar{y}\right)^{2} \\ & = \sum\limits_{g=1}^{k} \sum\limits_{i=1}^{n} \hat{z}_{ig} \left( y_{i} - \bar{y}_{g}\right)^{2} + \sum\limits_{g=1}^{k} \sum\limits_{i=1}^{n} \hat{z}_{ig} \left( \bar{y}_{g}- \bar{y}\right)^{2} + 2 \sum\limits_{g=1}^{k} \sum\limits_{i=1}^{n} \hat{z}_{ig} \left( y_{i} - \bar{y}_{g}\right) \left( \bar{y}_{g} - \bar{y}\right) \\ & = \sum\limits_{g=1}^{k} \sum\limits_{i=1}^{n} \hat{z}_{ig} \left( y_{i} - \bar{y}_{g}\right)^{2} + \sum\limits_{g=1}^{k} \hat{n}_{g}\left( \bar{y}_{g} - \bar{y}\right)^{2} + 2 \sum\limits_{g=1}^{k}\left( \bar{y}_{g}- \bar{y}\right)\sum\limits_{i=1}^{n} \hat{z}_{ig} \left( y_{i} - \bar{y}_{g}\right), \end{array} $$
(22)

where \(\hat {n}_{g}=\displaystyle \sum \limits _{i=1}^{n}\hat {z}_{ig}\) denotes the expected (soft) size of the g th group according to the fitted model. Based on (18), \(\displaystyle \sum \limits _{i=1}^{n} \hat {z}_{ig} y_{i} = \bar {y}_{g} \sum \limits _{i=1}^{n} \hat {z}_{ig}\) and then the last term on the right-hand side of (22) is null. Thus,

$$ \begin{array}{@{}rcl@{}} \text{TSS} = \sum\limits_{g=1}^{k} \text{SS}_{g} + \sum\limits_{g=1}^{k} \hat{n}_{g}\left( \bar{y}_{g} - \bar{y}\right)^{2} = \text{WSS} + \text{BSS}, \end{array} $$
(23)

where

$$ \begin{array}{@{}rcl@{}} \text{SS}_g = \sum\limits_{i=1}^n \hat{z}_{ig} \left( y_i - \bar{y}_g\right)^2 \end{array} $$

is the (soft) sum of squares in the g th group,

$$ \text{WSS} = \sum\limits_{g=1}^{k} \text{SS}_{g} $$
(24)

is the (soft) within-group sum of squares, and

$$ \text{BSS} = \sum\limits_{g=1}^{k} \hat{n}_{g}\left( \bar{y}_{g} - \bar{y}\right)^{2} $$
(25)

is the (soft) between-group sum of squares. The wording “soft” is used because the group memberships \(\hat {z}_{ig}\), i = 1,…,n and g = 1,…,k, are a posteriori probabilities and not “hard” 0/1 values. Denoting with \(\hat {\boldsymbol {\beta }}_{g}=(\hat {\beta }_{g0},\hat {\boldsymbol {\beta }}_{g1}^{\prime })'\) the ML estimate of \(\boldsymbol {\beta }_{g}=(\beta _{g0},\boldsymbol {\beta }_{g1}^{\prime })'\) at convergence of the EM algorithm, the WSS term can be further decomposed as:

$$ \begin{array}{@{}rcl@{}} \text{WSS} &= & \sum\limits_{g=1}^{k} \sum\limits_{i=1}^{n} \hat{z}_{ig}\left[y_{i} - \mu\left( \boldsymbol{x}_{i};\hat{\boldsymbol{\beta}}_{g}\right) + \mu\left( \boldsymbol{x}_{i};\hat{\boldsymbol{\beta}}_{g}\right) - \bar{y}_{g}\right]^{2} \\ &= & \sum\limits_{g=1}^{k} \sum\limits_{i=1}^{n} \hat{z}_{ig} \left[y_{i} - \mu\left( \boldsymbol{x}_{i};\hat{\boldsymbol{\beta}}_{g}\right)\right]^{2} + \sum\limits_{g=1}^{k} \sum\limits_{i=1}^{n} \hat{z}_{ig} \left[\mu\left( \boldsymbol{x}_{i};\hat{\boldsymbol{\beta}}_{g}\right) - \bar{y}_{g}\right]^{2} \\ & &+ 2 \sum\limits_{g=1}^{k} \sum\limits_{i=1}^{n} \hat{z}_{ig} \left[y_{i} - \mu\left( \boldsymbol{x}_{i};\hat{\boldsymbol{\beta}}_{g}\right)\right]\left[\mu\left( \boldsymbol{x}_{i};\hat{\boldsymbol{\beta}}_{g}\right) - \bar{y}_{g}\right]. \end{array} $$
(26)

The use of (16) and (21) in (26) yields:

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{i=1}^{n} \hat{z}_{ig}\left[y_{i} - \mu\left( \boldsymbol{x}_{i};\hat{\boldsymbol{\beta}}_{g}\right)\right]\left[\mu\left( \boldsymbol{x}_{i};\hat{\boldsymbol{\beta}}_{g}\right) - \bar{y}_{g}\right] \\& =& \sum\limits_{i=1}^{n} \hat{z}_{ig} \left( y_{i} - \hat{\beta}_{g0} - \hat{\boldsymbol{\beta}}_{g1}^{\prime}\boldsymbol{x}_{i}\right)\left( \hat{\beta}_{g0} + \hat{\boldsymbol{\beta}}_{g1}^{\prime}\boldsymbol{x}_{i} - \hat{\beta}_{g0} - \hat{\boldsymbol{\beta}}_{g1}^{\prime}\bar{\boldsymbol{x}}_{g}\right) \\ & =& \hat{\boldsymbol{\beta}}_{g1}^{\prime} \sum\limits_{i=1}^{n} \hat{z}_{ig} \left( y_{i} - \hat{\beta}_{g0} - \hat{\boldsymbol{\beta}}_{g1}^{\prime}\boldsymbol{x}_{i}\right)\left( \boldsymbol{x}_{i} - \bar{\boldsymbol{x}}_{g}\right) = 0 , \end{array} $$

so that the third term on the right-hand side of (26) vanishes. Thus, the WSS term in (26) simplifies as:

$$ \text{WSS} = \text{EWSS} + \text{RWSS}, $$
(27)

with

$$ \begin{array}{@{}rcl@{}} \text{EWSS} & =& \sum\limits_{g=1}^{k} \text{ESS}_{g}, \end{array} $$
(28)
$$ \begin{array}{@{}rcl@{}} \text{RWSS} & =& \sum\limits_{g=1}^{k} \text{RSS}_{g}, \end{array} $$
(29)

where, for each group g,

$$ \text{ESS}_{g}=\sum\limits_{i=1}^{n} \hat{z}_{ig}\left[\mu\left( \boldsymbol{x}_{i};\hat{\boldsymbol{\beta}}_{g}\right) - \bar{y}_{g}\right]^{2} $$
(30)

is the (soft) explained sum of squares and

$$ \text{RSS}_{g}=\sum\limits_{i=1}^{n} \hat{z}_{ig} \left[y_{i} - \mu\left( \boldsymbol{x}_{i};\hat{\boldsymbol{\beta}}_{g}\right)\right]^{2} $$
(31)

is the (soft) residual sum of squares. Finally, substituting (27) in (23) yields

$$ \text{TSS} = \text{BSS} + \text{RWSS} + \text{EWSS}. $$
(32)

Thus, by considering the classical nomenclature from the (one-factor) analysis of covariance (ANCOVA; see, e.g., Huitema 2011, Chapter 6), the total sum of squares TSS can be broken into three parts: the (soft) between-group sum of squares (i.e., the variability of Y explained by the latent group variable G), or BSS, the (soft) within-group sum of squares explained by the model (thanks to the covariates), or EWSS, and the (soft) residual within-group sum of squares, or RWSS. This means that the (soft) within-group sum of squares (WSS) is decomposed in the WSS predictable from the covariates X via the chosen model (EWSS) and the WSS not predictable from X via the chosen model (RWSS). Finally note that, when k = 1, the BSS term in (32) vanishes and TSS = EWSS + RWSS, which is the classical decomposition of the total sum of squares for the standard linear regression model whose parameters are estimated by least squares.

In terms of clustering validation, BSS can be seen as a separation measure (see, e.g., Cerdeira et al. 2012), i.e., as a measure of how well separated clusters are along the y-axis (the greater the value of BSS, the more “separated” the clusters are on Y ), while WSS can be seen as a compactness measure (see, e.g., Panagiotakis 2015), i.e., as a measure of how close observations in a cluster are with respect to the regression line of that cluster (the smaller the value of WSS, the more “compact” the clusters are around their regression line).

5 Evaluating the Main Aspects of the Fitted Model

5.1 Normalized Three-Term Decomposition

Starting from the three-term decomposition given in (32), it is possible to define normalized summary measures aiming to evaluate the main aspects of the fitted model. In particular, dividing both sides of (32) by TSS yields:

$$ \begin{array}{@{}rcl@{}} \frac{\text{BSS}}{\text{TSS}} + \frac{\text{EWSS}}{\text{TSS}} + \frac{\text{RWSS}}{\text{TSS}} & =& 1 \\ \text{NBSS} + \text{NEWSS} + \text{NRWSS} & =& 1, \end{array} $$
(33)

where NBSS, NEWSS, and NRWSS are the normalized versions, with respect to TSS, of BSS, EWSS, and RWSS, respectively. In terms of interpretation, NBSS is the proportion of the total variability of Y in the sample explained by the weighted differences between the weighted group means \(\bar {y}_{g}\) and the overall mean \(\bar {y}\); hence, NBSS can be meant as a sort of correlation ratio, being a measure of association between the dependent variable Y and the latent group variable G. NEWSS is the proportion of the total variability of Y explained by the inclusion of the covariates X via the slope(s) of the local regressions. On the contrary, NRWSS represents the proportion of the total variability of Y in the sample which remains unexplained by the fitted model.

5.2 Graphical Representation of the Three-Term Decomposition

With reference to a fitted model, the triplet \(\left (\text {NBSS},\text {NEWSS},\text {NRWSS}\right )\) can be seen as a point p in the probability simplex \(\mathbb {S}^{3}\), defined as the 2-dimensional subset of the 3-dimensional space containing vectors with non-negative coordinates summing to one. As illustrated in Aitchison (2003), Chapter 1.4, a convenient way of displaying points in \(\mathbb {S}^{3}\) is represented by the ternary diagram in Fig. 1, an equilateral triangle having unit altitude.

Fig. 1
figure 1

A point \(\boldsymbol {p}=\left (\text {NBSS},\text {NEWSS},\text {NRWSS}\right )\) in the ternary diagram

Here, for any point p, the lengths of the perpendiculars NBSS, NEWSS, and NRWSS from p to the sides opposite to the vertices NBSS, NEWSS, and NRWSS are all greater than, or equal to, 0 and have a unitary sum. Since there is a unique point with these perpendicular values, there is a one-to-one correspondence between \(\mathbb {S}^{3}\) and points in the triangle. In such a representation, the larger a component, say NBSS, the further the point p is away from the side opposite the vertex NBSS or, in other words, the nearer the point is to the vertex NBSS. Moreover, points with two components, say NBSS and NEWSS, in constant ratio are represented by points on a straight line through the complementary vertex NRWSS. Finally, points with one component, say NRWSS, in constant value are represented by points on a straight line which is parallel to the side opposite to the vertex NRWSS.

5.3 Normalized Explained Sum of Squares

According to (33), it is natural to introduce the quantity:

$$ \text{NESS} = \text{NBSS} + \text{NEWSS} = 1-\text{NRWSS} $$
(34)

representing the proportion of the total variability of Y explained by the fitted model. NESS desirably assumes values in the interval \(\left [0,1\right ]\): large values of NESS, hence small values of NRWSS, indicate a mixture of regressions that “fits”, or comes closer to, the observed data.

Provided that TSS > 0, the limit cases NESS = 0 and NESS = 1 are respectively obtained when NBSS = NEWSS = 0 and NRWSS = 0. Cases where each of the three terms NBSS, NEWSS, and NRWSS is null are analyzed below.

  • NBSS = 0 when BSS = 0, i.e., when \(\bar {y}_{1}=\cdots =\bar {y}_{k}=\bar {y}\), regardless of the group sizes \(\hat {n}_{1},\ldots ,\hat {n}_{k}\); refer to (25).

  • NEWSS = 0 when EWSS = 0, i.e., when \(\hat {\boldsymbol {\beta }}_{11}=\cdots =\hat {\boldsymbol {\beta }}_{k1}=\boldsymbol {0}\) so that \(\hat {\beta }_{g0}=\bar {y}_{g}\), g = 1,…,k, regardless of the values of \(\hat {z}_{ig}\); refer to (28) and (30).

  • NRWSS = 0 when RWSS = 0. A sufficient condition for the latter equality to be true, regardless of the values of \(\hat {z}_{ig}\), is represented by k overlapped component regression lines (i.e., \(\hat {\beta }_{10}=\cdots =\hat {\beta }_{k0}=\hat {\beta }_{0}\) and \(\hat {\boldsymbol {\beta }}_{11}=\cdots =\hat {\boldsymbol {\beta }}_{k1}=\hat {\boldsymbol {\beta }}_{1}\)) with all the n data points lying on the resulting common regression line (i.e., \(y_{i} = \mu (\boldsymbol {x}_{i};\hat {\boldsymbol {\beta }})\), i = 1,…,n and \(\hat {\boldsymbol {\beta }}=(\hat {\beta }_{0},\hat {\boldsymbol {\beta }}_{1}^{\prime })'\)); refer to (29) and (31).

5.4 Local and Overall Coefficients of Determination

Since \(\hat {\boldsymbol {\beta }}_{g}\) is a WLS estimate of βg, it is natural to define the local coefficient of determination for the g th group as:

$$ {R^{2}_{g}}=\frac{\text{ESS}_{g}}{\text{ESS}_{g}+\text{RSS}_{g}}=\frac{\text{ESS}_{g}}{\text{SS}_{g}} $$
(35)

(see, e.g., Will et al. 1988). \({R^{2}_{g}}\) can be interpreted as the proportion of response variation in the g th group that cannot be explained in the model with the only intercept \(\hat {\beta }_{g0}\), i.e., by \(\mu \left (\boldsymbol {x};\hat {\beta }_{g0}\right )=\hat {\beta }_{g0}\), but can be explained by the covariates X included into the linear model \(\mu \left (\boldsymbol {x};\hat {\boldsymbol {\beta }}_{g}\right )=\hat {\beta }_{g0}+\hat {\boldsymbol {\beta }}_{g1}^{\prime }\boldsymbol {x}\). In general, the higher the \({R^{2}_{g}}\), the better the g th linear model fits the data in the g th group under the idea that the more response variability that is accounted for by the regression model, the closer the data points will fall to the fitted regression line.

With the same principle, it is natural to define the overall coefficient of determination as:

$$ R^{2} = \frac{\text{EWSS}}{\text{WSS}}. $$
(36)

It can be interpreted as the proportion of the within-group response variation explained (accounted for) by the fitted mixture of regression. Based on (28), R2 is related to \({R^{2}_{1}},\ldots ,{R^{2}_{k}}\) by the following relation:

$$ R^{2} = \frac{\displaystyle\sum\limits_{g=1}^{k} \text{ESS}_{g}}{\text{WSS}} = \frac{\displaystyle\sum\limits_{g=1}^{k} \text{SS}_{g} \frac{\text{ESS}_{g}}{\text{SS}_{g}}}{\text{WSS}} = \frac{\displaystyle\sum\limits_{g=1}^{k} \text{SS}_{g} {R^{2}_{g}}}{\text{WSS}} = \displaystyle\sum\limits_{g=1}^{k} \frac{\text{SS}_{g}}{\text{WSS}} {R^{2}_{g}}. $$
(37)

According to (37), R2 can be seen as a weighted average of the local coefficients of determination \({R^{2}_{1}},\ldots ,{R^{2}_{k}}\) with weights SS1/WSS,…,SSk/WSS being the proportion of the within-group sum of squares due to each group.

6 Analyses on Artificial Data

To find out more about the three terms of the decomposition proposed in (33), and to evaluate the behavior of these terms under violations of the model assumptions, applications on artificial data are here considered. The analyses are performed in R (R Core Team 2016). MRFC and MRRC are fitted via the cwm() function of the flexCWM package (Mazza et al. 2018), while MRCV are fitted via the flexmix() function of the flexmix package (Leisch 2004; Grun et al. 2008). These functions implement the EM algorithm to find ML estimates of the parameters (cf. Section 3). Among the possible initialization strategies for the EM algorithm (see, e.g., Biernacki et al. 2003; Karlis and Xekalaki 2003; Bagnato and Punzo 2013), a random initialization is repeated 20 times from different random positions and the value maximizing the observed-data log-likelihood among these 20 runs is selected.

6.1 Understanding the Decomposition

In the first illustrative example, artificial data are considered to find out more about the role of the three terms of the decomposition in (33). To simplify the graphical representations, a single covariate X (d = 1) and two groups (k = 2) are taken into account. The data generating process is a mixture of regressions where:

  • The weights are π1 = 0.3 and π2 = 0.7;

  • A standard normal distribution is used to generate the values of X in both groups;

  • A normal distribution is adopted to generate the values of the dependent variable Y;

  • The two regression lines have intercepts β10 = 0 and β20, the same slope β11 = β21 = β1, and the same conditional standard deviation σ1 = σ2 = σ.

The experimental conditions are the intercept in the second group (\(\beta _{20}\in \left \{0.5,1,1.5,2,2.5,3\right \}\)), the common slope (\(\beta _{1}\in \left \{0,0.2,0.4,0.6,0.8,1,1.2,1.4,1.6,1.8,2\right \}\)), and the common conditional standard deviation (\(\sigma \in \left \{0.1,0.4,0.7,1\right \}\)). These experimental conditions cover the aspects the three terms of the decomposition are based on: the difference between β10 and β20 is related to the BSS term in (25), and the slope β1 of the parallel regression lines affects the EWSS term in formula (28), while the conditional standard deviation σ impacts the RWSS term in (29).

One hundred datasets, each of size n = 1000, have been generated for each of the 264 combinations of the conditions above. Figure 2 shows an example of generated dataset related to the following combination of experimental conditions: β20 = 2, β1 = 0.6, and σ = 0.7.

Fig. 2
figure 2

Section 6.1. Example of scatter plot in the case β20 = 2, β1 = 0.6, and σ = 0.7

On each generated dataset, a MRFC with k = 2 components is fitted and the three terms NBSS, NEWSS, and NRWSS are computed. Figure 3 displays the ternary diagrams of the obtained results. Each of these diagrams contains the same points, but their color (in grayscale) changes based on the considered experimental factor. In these diagrams, each of the 264 points is related to a particular combination \(\left (\beta _{20},\beta _{1},\sigma \right )\), and the point is obtained by averaging the triplets \(\left (\text {NBSS},\text {NEWSS},\text {NRWSS}\right )\) related to the 100 replications for the considered combination.

Fig. 3
figure 3

Section 6.1. Average (over 100 replications) of the decomposition terms from the fitted MRFC models

From Fig. 3a, we note that points roughly depart from the vertex NBSS as the second regression line approaches the first one (i.e., as \(\beta _{20}\rightarrow \beta _{10}=0\)). This happens because if the parallel lines move closer, then the group means of Y, i.e., \(\overline {y}_{1}\) and \(\overline {y}_{2}\), move closer too; consequently, the separation (on Y ) between groups reduces and the BSS term in (25) decreases too. From Fig. 3b, we note that points roughly depart from the vertex NEWSS as the positive common slope of the regression lines tends to 0. This happens because if \(\beta _{1}\rightarrow 0\), then \(\mu \left (x;\boldsymbol {\beta }_{g}\right ) \rightarrow \overline {y}_{g}\), with \(\boldsymbol {\beta }_{g}=\left (\beta _{g0},\beta _{1}\right )^{\prime }\), g = 1,2; consequently, X is not useful (via the linear model) to explain Y in each group, and \(\text {EWSS}\rightarrow 0\); refer to formula (28). Finally, from Fig. 3c, we note that points roughly depart from the vertex NRWSS as the local conditional variability of Y tends to vanish (i.e., as \(\sigma \rightarrow 0\)). This happens because if \(\sigma \rightarrow 0\), then the observed couples \(\left (x_{i},y_{i}\right )\), i = 1,…,n, tend to lie on one of the local regression lines and \(\text {RWSS}\rightarrow 0\); refer to formula (29).

6.2 Atypical Points and Departures from Conditional Local Normality

In the second illustrative example, artificial data are considered to evaluate the behavior of the decomposition in (33) with respect to the presence of atypical observations and departures from conditional normality of Y |X = x,G = g, g = 1,…,k (local conditional normality).

In regression analysis, atypical observations in Y |x represent model failure, and such observations are called outliers, while atypical observations with respect to X are called leverage points. There are two types of leverage points: good and bad. A bad leverage point is a regression outlier that has an x value that is atypical among the values of X as well. A good leverage point is a point that is unusually large or small among the X values but is not a regression outlier, i.e., x is atypical but the corresponding y fits the model quite well. A point like this is called good because it improves the precision of the regression coefficients (Rousseeuw and Van Zomeren 1990, p. 635). Each point (x’, y) can be considered as belonging to one of the four categories indicated in Table 1.

Table 1 Categorization for points in a regression analysis

As in Section 6.1, a single covariate X (d = 1) and two groups (k = 2) are taken into account. The t distribution—with mean \(\mu \in \mathbb {R}\), scale parameter τ > 0, and degrees of freedom ν > 2—is considered to introduce departures from normality and the possible presence of atypical observations. It is important to recall that the t-distribution approaches the normal distribution, with mean μ and standard deviation τ, as \(\nu \rightarrow \infty \) (see, e.g., Lange et al. 1989). The data generating process is a mixture of regressions where:

  • The weights are π1 = 0.3 and π2 = 0.7;

  • The values of X are generated by a t-distribution with mean μX = 0, scale parameter τX = 1, and νX degrees of freedom;

  • Two different t-distributions are adopted to generate the values of the dependent variable Y in the two groups;

  • The two regression lines have intercepts β10 = − 4 and β20 = 4, the same slope β11 = β21 = 1, the same conditional scale parameter τY = 1, and the same degrees of freedom νY.

The experimental conditions are \(\nu _{X}\in \left \{3,4,\infty \right \}\) and \(\nu _{Y} \in \left \{3,4,\infty \right \}\). Their combination gives rise to nine different scenarios. These scenarios cover all the types of data categorized in Table 1: typical data (with respect to MRFC, MRCV, and MRRC models) when \(\nu _{X}\rightarrow \infty \) and \(\nu _{Y}\rightarrow \infty \), good leverage points when \(\nu _{X}<\infty \), outliers when \(\nu _{Y} < \infty \), and bad leverage points when \(\nu _{X}<\infty \) and \(\nu _{Y} < \infty \).

One hundred datasets, each of size n = 1000, have been generated for each of the 9 scenarios. Figure 4 shows examples of generated data for each scenario. On each generated dataset, MRFC, MRCV, and MRRC models, all with k = 2 components, are fitted and the terms NBSS, NEWSS, and NRWSS are computed. Figure 5 displays the ternary diagrams of the obtained results for each model. Each of these diagrams contains 9 triplets \(\left (\text {NBSS},\text {NEWSS},\text {NRWSS}\right )\), averaged over the 100 replications, each related to a particular scenario. Points into the diagrams are denoted as \(\left (\nu _{X},\nu _{Y}\right )\), with \(\nu _{X},\nu _{Y}\in \left \{3,4,\infty \right \}\).

Fig. 4
figure 4

Section 6.2. Examples of generated scatters as a function of \(\left (\nu _{X},\nu _{Y}\right )\)

Fig. 5
figure 5

Section 6.2. Averages (over 100 replications) of the decomposition terms from the fitted models. Each point is represented by a pair \(\left (\nu _{X},\nu _{Y}\right )\), with \(\nu _{X},\nu _{Y}\in \left \{3,4,\infty \right \}\)

By comparing the three diagrams in Fig. 5, it is possible to note that the position of the points is pretty much the same regardless from the considered model, with a slightly worse performance, in terms of NESS, for the MRCV model. Therefore, the following considerations will apply to all the considered models. Taking \(\left (\infty ,\infty \right )\) as a reference scenario, it is possible to note that the points go up into the ternary diagram (consequently, the NESS decreases) as νY goes down. This means that, as expected, outliers get NESS values worse. At the same time, it is also interesting to note as good leverage points make NESS values slightly better; compare the position of the pairs \(\left (4,\infty \right )\) and \(\left (3,\infty \right )\) with respect to \(\left (\infty ,\infty \right )\). Finally, pairs where both νX and νY are finite, i.e., scenarios including bad leverage points, are located closer to the NRWSS vertex, as expected.

7 Illustration on Tourism Data

This application focuses on n = 180 monthly data concerning tourist overnights (X, data in millions) and attendance at museums and monuments (Y, data in millions) in Italy over the 15-year period spanning from January 1996 to December 2010. These data, available at http://www.economia.unict.it/punzo/Data.htm, have been recently analyzed by Cellini and Cuccia (2013) and Ingrassia et al. (2014).

The scatter plot of the data is shown in Fig. 6; it gives strong evidence of both group-structure and relationships of Y on X.

Fig. 6
figure 6

Tourism data. Scatter plot

Motivated by this consideration, we fit MRFC, MRCV, and MRRC for \(k\in \left \{1,\ldots ,4\right \}\), resulting in 12 different models. As concerns the MRRC, a normal distribution is considered for X in each group (see, e.g., Punzo and Ingrassia 2016 and Dang et al. 2017).

When using mixtures of regressions, and mixture models in general, some objective criterion is necessary for selecting the number of mixture components k for data under consideration. The Bayesian information criterion (BIC; Schwarz 1978) is the most commonly used for this purpose and is given by:

$$ \begin{array}{@{}rcl@{}} \text{BIC} = - 2 l(\hat{\boldsymbol{\psi}}) + m \ln \left( n\right), \end{array} $$

where \(l(\hat {\boldsymbol {\psi }})\) is the (maximized) observed-data log-likelihood while m is the number of free parameters. Note that, while the likelihood for MRFC and MRCV is a product of conditional probabilities \(p\left (y_{i}|\boldsymbol {x}_{i}; \boldsymbol {\psi }\right )\), the likelihood for MRRC is a product of joint probabilities \(p\left (\boldsymbol {x}_{i},y_{i};\boldsymbol {\psi }\right )\); therefore, values of \(l\left (\hat {\boldsymbol {\psi }}\right )\) and BIC can be compared between MRFC and MRCV, but not with respect to the MRRC. Operationally, this means we can use the BIC to select between MRFC and MRCV too. With respect to these latter models, it is finally important to underline that, given k, MRFC in (1) can be thought as nested in the MRCV in (2).

Values of m, \(l(\hat {\boldsymbol {\psi }})\), and BIC for the fitted models are reported in Table 2. Bold numbers in Table 2(c) highlight the best BIC value among the fitted MRFC and MRCV models (whose likelihoods can be compared) and among the fitted MRRC models. The selected models are the MRCV and the MRRC with k = 4 components; they are represented in Fig. 7 in terms of regression lines and MAP classification of the observations; points are displayed as numbers denoting the MAP group membership.

Fig. 7
figure 7

Tourism data. Scatter plots with regression lines and MAP classification of the observations from the models selected by the BIC

Table 2 Tourism data. Values of m, \(l(\hat {\boldsymbol {\psi }})\), and BIC for the fitted mixtures of regressions. Bold numbers in Table 2(c) highlight the BIC values of the selected models

The classifications from the two models are similar enough, with slight differences only with respect to the composition of groups 2 and 4. With respect to these models, it is interesting to note the good agreement between clusters and months (see Table 3). In detail, with reference to the MRCV, only one unit in February, which concerns the year 2008, and three units in March, which concern the years 1996, 1998, and 1999, are assigned to a different group (see Table 3(a)). With reference to the MRRC, only two units in November, which concern the years 2006 and 2010, are assigned to a different group (see Table 3 and compare with Ingrassia et al. 2014, p. 170).

Table 3 Tourism data. Relation between the clusters from the selected MRCV and MRRC and the months

To determine how much the obtained clusters are separated on Y, and how well the selected models fit the data, it is useful to consider the measures introduced in Section 4. Figure 8 shows the ternary diagram containing the triplets \(\left (\text {NBSS},\text {NEWSS},\text {NRWSS}\right )\) of the selected models. The displayed triplets are \(\left (0.712,0.119,0.169\right )\) for the MRCV, and \(\left (0.660,0.180,0.160\right )\) for the MRRC. In terms of proportion of the total variability on Y explained by the fitted model, as measured by NESS in (34), the MRRC (with NESS = 0.840) performs slightly better than the MRCV (with NESS = 0.831), and this is visually confirmed by a point for the MRRC which is slightly further from the NRWSS vertex. Even if the two models have similar NESS values, they have different behaviors in terms of NBSS and NEWSS which, according to (34), are the components of NESS. In particular, as concerns the MRCV, the NBSS/NESS ⋅ 100 = 85.7% of the explained variability is due to the clustering on Y, as measured by NBSS. The analogous percentage for the MRRC is lower (78.6%). Indeed, the MRCV point in Fig. 8 lies closer to the vertex NBSS than the MRRC point.

Fig. 8
figure 8

Tourism data. Ternary diagram of the triplets \(\left (\text {NBSS},\text {NEWSS},\text {NRWSS}\right )\) from the mixtures of regressions selected by the BIC

Given the clustering provided by the fitted model, i.e., given the values of \(\hat {z}_{ig}\), to evaluate how close the data are to the fitted regression lines, it is useful to refer to the local coefficients of determination introduced in Section 5.4. For the MRCV, the local coefficients of determination are \({R^{2}_{1}}=0.267\), \({R^{2}_{2}}=0.446\), \({R^{2}_{3}}=0.807\), and \({R^{2}_{4}}=0.023\). A good fit can be noted in the third group, where the regression lines account for 80.7% of the local sum of squares SS3. The overall coefficient of determination is R2 = 0.412. The third group contributes to this value with weight SS3/WSS = 0.106; refer to (37). The other groups take part in the overall R2 with weights SS1/WSS = 0.084, SS2/WSS = 0.674, and SS4/WSS = 0.136. For the MRRC, the local coefficients of determination are \({R^{2}_{1}}=0.267\), \({R^{2}_{2}}=0.572\), \({R^{2}_{3}}=0.807\), and \({R^{2}_{4}}=0.020\). With respect to the MRCV, \({R^{2}_{1}}\) and \({R^{2}_{3}}\) are the same (see also Fig. 6), and \({R^{2}_{4}}\) is slightly lower, while \({R^{2}_{2}}\) is quite greater. The overall coefficient of determination is R2 = 0.530, quite greater than the overall R2 for the MRCV. This improvement is due to the greater weight (0.764) associated with \({R^{2}_{2}}\). The other groups participate with weights SS1/WSS = 0.071, SS3/WSS = 0.090, and SS4/WSS = 0.075.

8 Conclusions and Discussion

When we use mixtures of regressions, the aim is twofold. First, as in the classical use of clustering/classification techniques, we want a method explaining the unobserved heterogeneity via the identification of homogeneous groups of observations. Second, as in the classical use of regression models, we hope that the inclusion of covariates explains more variation in the dependent variable. A mixture of regressions performs well if both these aspects are accounted for.

In this paper, for classical classes of mixtures of linear regressions, we proposed a three-term decomposition of the total sum of squares when the parameters are estimated with the expectation-maximization (EM) algorithm, within a maximum likelihood framework, under normally distributed errors in each mixture component. Based on this decomposition, we also introduced a measure for the explained within-group response variation (NEWSS), a measure of association between the response variable and the latent groups (NBSS), and an overall measure, collectively referred to as explained variation (NESS), considering NEWSS and NBSS together. Moreover, we introduced local and overall coefficients of determination to further evaluate how well the model fits the data group-by-group but also taken as a whole. The application to real data in Section 7 illustrated the use and the usefulness of our measures.

Finally, we remark that a natural extension of the ideas proposed herein would be the definition of “adjusted” local and overall coefficients of determination to be used—as the classical adjusted R2 for the standard linear regression model whose parameters are estimated by least squares—as comparative measures of suitability of models with alternative nested/nonnested sets of covariates (de Amorim 2016). However, groups are unknown and they change every time the model is estimated with a different set of covariates; this would make adjusted indexes senseless in our context.