Keywords

1 Introduction

Regression is one of the widespread tools used to establish the relationship between a set of predictor variables and a response variable. However, in many circumstances, careful data preparation may not be possible and hence data may be heavily contaminated by a substantial number of outliers. In these situations, the estimates of the parameters of the regression model obtained by the Maximum Likelihood criterion are fairly unstable.

The development of robust methods is underlined by the appearance of a wide number of papers and books on the topic including: Huber (1981), Rousseeuw and Leroy (1987), Staudte and Sheather (1990), Davies (1993), Dodge and Jurečkova (2000), Seber and Lee (2003), Rousseeuw et al. (2004), Jurečkova and Picek (2006), Maronna et al. (2006) and Fujisawa and Eguchi (2006).

The approach based on minimizing the Integrated Square Error is particularly helpful in those situations where, due to large sample size, careful data preparation is not feasible and hence data may contain a substantial number of outliers (Scott 2001). In this sense the L 2 E criterion can be viewed as an efficient diagnostic tool in building useful models.

In this paper we suggest a procedure of regression analysis whose first step consists in comparing the results arising from L 2 estimates with those obtained from some common M-estimators. Afterwards, if a particular test of hypothesis leads us to reject the conjecture of similarity between the estimated regression models, we investigate the data for the presence of clusters by analyzing the L 2 minimizing function. The third step of the procedure consists in fitting a mixture of regression models via the L 2 criterion.

Below, we introduce the Integrated Square Error minimizing criterion for regression models, define a new concept of similarity between functions and introduce a Monte Carlo Significance (M.C.S.) test. We also illustrate the whole procedure by means of some simulated examples involving simple linear regression models. Finally, we present an analysis of a case study concerning the evaluation of the risk of fire and the risk of electric shocks in electronic transformers.

2 Parametric Linear Regression Models and Robust Estimators

Let \(\{(x_{i1},\ldots,x_{\mathit{ip}},y_{i})\}_{i=1,\ldots,n}\) be the observed data set, where each observation stems from a random sample drawn from the p + 1 random variable (X 1, , X p , Y ). The regression model for the observed data set being studied is \(y_{i} = m_{\boldsymbol{\beta }}(\mathbf{x}_{i}) +\varepsilon _{i}\), with \(i = 1,\ldots,n\), where the object of our interest is the regression mean

$$\displaystyle{ m_{\boldsymbol{\beta }}(\mathbf{x}_{i}) = \mathbb{E}[Y \vert \mathbf{x}_{i}] =\beta _{0} +\sum _{ j=1}^{p}\beta _{ j}\mathbf{x}_{\mathit{ij}} }$$
(1)

and the errors \(\{\varepsilon _{i}\}_{i=1,\ldots,n}\) are assumed to be independent random variables with zero mean and unknown finite variances.

2.1 Huber M-Estimator

The presence of outliers is a problem for regression techniques; these may occur for many reasons. An extreme situation arises when the outliers are numerous and they arise as a consequence of clustered data. For example, a large proportion of outliers may be found, if there is an omitted unknown categorical variable (e.g. gender, species, geographical location, etc.) where the data behave differently in each category. In parametric estimation, the estimators with good robustness proprieties relative to maximum likelihood are the M-estimators. The class of M-estimators of the vector \(\boldsymbol{\beta }\) is defined as (e.g., Hampel et al. 2005)

$$\displaystyle{ \hat{\boldsymbol{\beta }}_{M} =\mathop{ \text{arg min}}\limits_{\boldsymbol{\beta }}\sum _{i=1}^{n}\rho \left (y_{ i} - m_{\boldsymbol{\beta }}(\mathbf{x}_{i})\right ), }$$
(2)

where \(\rho: \mathbb{R} \rightarrow \mathbb{R}\) is absolutely continuous convex function with derivative ψ.

If we assume that the r.vs \(\varepsilon _{i}\) are independent and identically distributed as the r.v. \(\varepsilon \sim \mathcal{N}(0,\sigma )\), the least-squares estimator gives the Maximum Likelihood Estimate (MLE) of the vector \(\boldsymbol{\beta }\), i.e.:

$$\displaystyle{ \hat{\boldsymbol{\beta }}_{\mathit{MLE}} =\mathop{ \text{arg min}}\limits_{\boldsymbol{\beta }}\sum _{i=1}^{n}{\left [y_{ i} - m_{\boldsymbol{\beta }}(\mathbf{x}_{i})\right ]}^{2}. }$$

Since in the presence of outliers MLEs are quite unstable, i.e., inefficient and biased, for our purpose in the class of M-estimators we shall resort to the robust Huber M-estimator (HME) for which

$$\displaystyle{ \rho (y_{i}-m_{\boldsymbol{\beta }}(\mathbf{x}_{i})) = \left \{\begin{array}{@{}l@{\quad }l@{}} \dfrac{1} {2}{(y_{i} - m_{\boldsymbol{\beta }}(\mathbf{x}_{i}))}^{2} \quad &\mbox{ if $\vert y_{i} - m_{\boldsymbol{\beta }}(\mathbf{x}_{i})\vert \leq k$}, \\ k\;\vert y_{i} - m_{\boldsymbol{\beta }}(\mathbf{x}_{i})\vert \;(1 -\dfrac{k} {2})\quad &\mbox{ if $\vert y_{i} - m_{\boldsymbol{\beta }}(\mathbf{x}_{i})\vert > k$}, \end{array} \right. }$$

where the tuning constant k is generally set to 1. 345 σ.

2.2 \(\boldsymbol{L_{2}}\)-Based Estimator

We investigate estimation methods in parametric linear regression models based on the minimum Integrated Square Error and the minimum L 2 metric. In the α-family of estimators proposed by Basu et al. (1998), L 2 estimator, briefly L 2 E, is the more robust to outliers, even if it is less efficient than MLE.

Given the r.v. X, with unknown density \(f(x\vert \boldsymbol{\theta }_{0})\), for which we introduce the model \(f(x\vert \boldsymbol{\theta })\), the estimate for \(\boldsymbol{\theta }_{0}\) minimizing the L 2 metric will be:

$$\displaystyle\begin{array}{rcl} \hat{\boldsymbol{\theta }}_{L_{2}E}& =& \mathop{\text{arg min}}\limits_{\boldsymbol{\theta }}\int _{\mathbb{R}}{\left [f(x\vert \boldsymbol{\theta }) - f(x\vert \boldsymbol{\theta }_{0})\right ]}^{2}\,dx = \\ & =& \mathop{\text{arg min}}\limits_{\boldsymbol{\theta }}\left [\int _{\mathbb{R}}{f}^{2}(x\vert \boldsymbol{\theta })\,dx - 2\,\mathbb{E}\left [f(x\vert \boldsymbol{\theta }_{ 0})\right ]\right ] = \\ & =& \mathop{\text{arg min}}\limits_{\boldsymbol{\theta }}\left [\int _{\mathbb{R}}{f}^{2}(x\vert \boldsymbol{\theta })\,dx - \frac{2} {n}\,\sum _{i=1}^{n}f(x_{ i}\vert \boldsymbol{\theta })\right ],{}\end{array}$$
(3)

where, the so-called expected height of the density, \(\mathbb{E}\left [f(x\vert \boldsymbol{\theta }_{0})\right ]\) is replaced with its estimate \(\hat{\mathbb{E}}\left [f(x\vert \boldsymbol{\theta }_{0})\right ] = {n}^{-1}\sum _{i=1}^{n}f(x_{i}\vert \boldsymbol{\theta })\) and where (Basu et al. 1998),

$$\displaystyle\begin{array}{rcl} \int _{\mathbb{R}}{f}^{2}(x\vert \boldsymbol{\theta })\,dx& =& \dfrac{1} {n}\sum _{i=1}^{n}\int _{ \mathbb{R}}{f}^{2}(x_{ i}\vert \boldsymbol{\theta })\,dx_{i}\,.{}\end{array}$$
(4)

We turn now our attention to illustrate how the estimates based on L 2 criterion can be applied to parametric regression models. Assuming that the random variables Y | x are distributed as a \(\mathcal{N}(m_{\boldsymbol{\beta }_{0}}(\mathbf{x}),\sigma _{0})\), i.e. \(f_{Y \vert \mathbf{x}}(y\vert \boldsymbol{\beta }_{0},\sigma _{0}) =\phi (y\vert m_{\boldsymbol{\beta }_{0}}(\mathbf{x}),\sigma _{0})\), the L 2 estimates of the parameters in \(\boldsymbol{\beta }_{0}\) and σ 0 are given by Eq. (3), which in this case becomes

$$\displaystyle\begin{array}{rcl} (\hat{\boldsymbol{\beta }},\hat{\sigma })_{L_{2}E}& =& \mathop{\text{arg min}}\limits_{\boldsymbol{\beta },\sigma }\left [\int {_{\mathbb{R}}\phi }^{2}(y\vert m_{\boldsymbol{\beta }}(\mathbf{x}),\sigma )\,\mathit{dy} - \frac{2} {n}\,\sum _{i=1}^{n}\phi (y_{ i}\vert m_{\boldsymbol{\beta }}(\mathbf{x}_{i}),\sigma )\right ] \\ & =& \mathop{\text{arg min}}\limits_{\boldsymbol{\beta },\sigma }\left [ \frac{1} {2\sigma \sqrt{\pi }}- \frac{2} {n}\sum _{i=1}^{n}\phi (y_{ i}\vert m_{\boldsymbol{\beta }}(\mathbf{x}_{i}),\sigma )\right ], {}\end{array}$$
(5)

since from Eq. (4)

$$\displaystyle\begin{array}{rcl} \int {_{\mathbb{R}}\phi }^{2}(y\vert m_{\boldsymbol{\beta }}(\mathbf{x}),\sigma )\,dy& =& \dfrac{1} {n}\sum _{i=1}^{n}\int {_{ \mathbb{R}}\phi }^{2}(y_{ i}\vert m_{\boldsymbol{\beta }}(\mathbf{x}_{i}),\sigma )\,dy_{i} = \dfrac{1} {2\,\sigma \,\sqrt{\pi }}\,. {}\\ \end{array}$$

Clearly Eq. (5) is a feasible computationally closed-form expression so that L 2 criteria can be performed by any standard non-linear optimization procedure, for example, the nlm routine in the R library. However, it is important to recall that, whatever the algorithm, convergence to the global optimum can depend strongly on the starting values.

3 The Similarity Index and the M.C.S Test

To compare the L 2 E performance with respect to some other common estimators we resort to an index of similarity between regression models introduced in Durio and Isaia (2010). In order to measure the discrepancy between the two estimated regression models, the index of similarity takes into account the space region between \(\hat{m}_{T_{0}}(\mathbf{x})\) and \(\hat{m}_{T_{1}}(\mathbf{x})\) with respect to the space region where the whole of the data points lie. Let T 0 and T 1 be two regression estimators and \(\hat{\boldsymbol{\beta }}_{T_{0}}\), \(\hat{\boldsymbol{\beta }}_{T_{1}}\) the corresponding vectors of the estimated parameters. Introducing the sets:

$$\displaystyle\begin{array}{rcl} & &{ \boldsymbol{I}}^{p} = \left [\min (x_{ i1});\max (x_{i1})] \times \ldots \times [\min (x_{ip});\max (x_{ip})\right ], {}\\ & & \boldsymbol{I} = \left [\min (y_{i});\max (y_{i})\right ] = \left [a;b\right ], {}\\ \end{array}$$

we define the similarity index as

$$\displaystyle\begin{array}{rcl} & & sim(T_{0},T_{1})\mathop{ =}\limits^{ \mathit{def }}\dfrac{\int _{{\boldsymbol{D}}^{p+1}}\,d\mathbf{t}} {\int _{{\boldsymbol{C}}^{p+1}}\,d\mathbf{t}} \\ & &{ \boldsymbol{C}}^{p+1} ={ \boldsymbol{I}}^{p} \times \boldsymbol{I} \\ & &{ \boldsymbol{D}}^{p+1} = \left \{(\mathbf{x},y) \in {\mathbb{R}}^{p+1}:\zeta (\mathbf{x}) \leq y \leq \xi (\mathbf{x}),\mathbf{x} \in {I}^{p}\right \} \cap {\boldsymbol{C}}^{p+1}{}\end{array}$$
(6)

with \(\zeta (\mathbf{x}) = \mathit{min}\left (\hat{m}_{T_{0}}(\mathbf{x}),\hat{m}_{T_{1}}(\mathbf{x})\right )\) and \(\xi (\mathbf{x}) = \mathit{max}\left (\hat{m}_{T_{0}}(\mathbf{x}),\hat{m}_{T_{1}}(\mathbf{x})\right )\).

Fig. 1
figure 1

Data points and two estimated regression models \(\hat{m}_{T0}(x)\) and \(\hat{m}_{T1}(x)\). In panel (b) the domains \({\boldsymbol{D}}^{p+1}\) and \({\boldsymbol{C}}^{p+1}\) upon which the \(sim(T_{0},T_{1})\) statistic is computed

Figure 1 shows how the similarity index given by Eq. (6) can be computed in the simple case where p = 1. In panel (a) we have the cloud of data points and the two estimated models \(\hat{\boldsymbol{\beta }}_{T_{0}}\) and \(\hat{\boldsymbol{\beta }}_{T_{1}}\). The shaded area of panel (b) corresponds to \(\int _{{\boldsymbol{D}}^{p+1}}\,d\mathbf{t}\), while the integral \(\int _{{\boldsymbol{C}}^{p+1}}\,d\mathbf{t}\) is given by the area of the dotted rectangle, in which data points lay.

In order to compute the integrals of Eq. (6), we employ the fast and accurate algorithm proposed by Durio and Isaia (2010).

If the vectors \(\hat{\boldsymbol{\beta }}_{T_{0}}\) and \(\hat{\boldsymbol{\beta }}_{T_{1}}\) are close to each other, then \(\mathit{sim}(T_{0},T_{1})\) will be close to zero. On the other hand, if the estimated regression models \(\hat{m}_{T_{0}}(\mathbf{x})\) and \(\hat{m}_{T_{1}}(\mathbf{x})\) are dissimilar we are likely to observe a value of \(\mathit{sim}(T_{0},T_{1})\) far from zero. We therefore propose to use the \(\mathit{sim}(T_{0},T_{1})\) statistic to verify the following system of hypothesis

$$\displaystyle{ \left \{\begin{array}{@{}l@{\quad }l@{}} H_{0}: \boldsymbol{\beta }_{0} =\hat{\boldsymbol{\beta }} _{T_{0}}\quad \\ H_{1}: \boldsymbol{\beta }_{0}\neq \hat{\boldsymbol{\beta }}_{T_{0}} \quad \end{array} \right. }$$
(7)

Since it is not reasonable to look for an exact form of the \(sim(T_{0},T_{1})\) distribution, in order to check the above system of hypothesis we utilise a simplified M.C.S. test originally suggested by Barnard (1963) and later proposed by Hope (1968).

Let \(\mathit{sim}_{T_{0}T_{1}}\) denote the value of the \(\mathit{sim}(T_{0},T_{1})\) statistic computed on the observed data. The simplified M.C.S. test consists of rejecting H 0 if \(\mathit{sim}_{T_{0}T_{1}}\) is the mα-th most extreme statistic relative to the corresponding quantities based on the random samples of the reference set, where the reference set consists of m − 1 random samples, of size n each, generated under the null hypothesis, i.e., drawn at random from the model \(\hat{m}_{T_{0}}(\mathbf{x})\) with \(\sigma =\hat{\sigma } _{T_{0}}\). In other words we generate m − 1 random samples under H 0 and for each of them we compute \(\mathit{sim}_{T_{0}T_{1}}^{{\ast}}\) and we shall reject the null hypothesis, at the α significance level, if and only if the value of the test statistic \(\mathit{sim}_{T_{0}T_{1}}\) is greater than all the m − 1 values of \(\mathit{sim}_{T_{0}T_{1}}^{{\ast}}\). We remark that if we set mα = 1 and fix α = 0. 01, we have \(m - 1 = 99\) (while fixing α = 0. 05 would yield \(m - 1 = 19\)).

4 Simple Linear Regression and Examples

Since for our case study we shall consider the simple linear regression model \(y_{i} =\beta _{0} +\beta _{1}\,x_{i} +\varepsilon _{i}\), the L 2 criterion according to Eq. (5) reduces to the following computationally closed-form expression

$$\displaystyle{ (\hat{\boldsymbol{\beta }},\hat{\sigma })_{L_{2}E} =\mathop{ \text{arg min}}\limits_{\boldsymbol{\beta },\sigma }\left [ \frac{1} {2\sigma \sqrt{\pi }}- \frac{2} {n}\sum _{i=1}^{n}\phi (y_{ i}\vert \beta _{0} +\beta _{1}\,x_{i},\sigma )\right ]. }$$
(8)

In the following we introduce two simulated examples in order to demonstrate the behaviour of the L 2 criterion in the presence of outliers and in the presence of clustered data. To evaluate its performance, we shall use the Maximum Likelihood estimator and the robust Huber M estimator. Given T 1 = L 2 E, we shall perform the M.C.S. test two times: the first one, fixing T 0 = MLE, for sim(MLE, L 2 E), the second one fixing T 0 = HME, for sim(HME, L 2 E). We remark that, as p = 1, in both situations we have \({\boldsymbol{I}}^{p} = \left [\min (x_{i});\max (x_{i})\right ]\) and that clearly the integrals of Eq. (6) are defined on bi-dimensional domains.

Example I.

Let us consider a simulated dataset of n = 200 points generated according to the model \(Y = X+\varepsilon\), where \(X \sim \mathcal{U}(0,10)\) and \(\varepsilon \sim \mathcal{N}(0,0.8)\). We then introduce m = 10(30) points according to the model \(Y = -3 + X+\varepsilon\), where \(X \sim \mathcal{U}(8,10)\) and \(\varepsilon \sim \mathcal{N}(0,0.4)\), so that they can be considered as outliers. Resorting to the estimators ML, HM and L 2 we obtain the following estimates of the parameters β 0, β 1 and σ listed in Table 1 (also see Fig. 2).

Table 1 Results of simulated Example I
Fig. 2
figure 2

Data points of Example I and estimated models \(\hat{m}_{ML}(x)\), \(\hat{m}_{HM}(x)\) and \(\hat{m}_{L2}(x)\). In panel (a) we set m = 10 outliers while in panel (b) m = 30

Applying the M.C.S. test, with α = 0. 01, to the estimated models \(\hat{m}_{\mathit{ML}}(x)\) and \(\hat{m}_{L2}(x)\), we reject the null hypothesis of system (7) as we have \(\mathit{sim}_{ML,L2} = 0.0203 >\max (\mathit{sim}_{\mathit{ML},L2}^{{\ast}}) = 0.0128\). Turning our attention to models \(\hat{m}_{\mathit{HM}}(x)\) and \(\hat{m}_{L2}(x)\), the M.C.S. test leads us to accept the null hypothesis since \(\mathit{sim}_{\mathit{HM},L2} = 0.0091 <\max (\mathit{sim}_{\mathit{HM},L2}^{{\ast}}) = 0.0123\).

In the case we add m = 30 outliers to the sample data, the results of the M.C.S. tests lead us to different conclusions. In both situations we reject the null hypothesis of system (7) as we have

$$\displaystyle\begin{array}{rcl} & & \mathit{sim}_{\mathit{ML},L2} = 0.0364 >\max (\mathit{sim}_{\mathit{ML},L2}^{{\ast}}) = 0.0159 {}\\ & & \mathit{sim}_{\mathit{HM},L2} = 0.0289 >\max (\mathit{sim}_{\mathit{HM},L2}^{{\ast}}) = 0.0103 {}\\ \end{array}$$

When the outliers are few, the estimated regression model \(\hat{m}_{\mathit{HM}}(x)\) and \(\hat{m}_{L2}(x)\) do not differ significantly. This is not the case when the number of outliers increases; in this sense it seems that L 2 estimator can be helpful in cluster detection.

Table 2 Results of simulated Example II
Fig. 3
figure 3

(Panel a) Data points of Example II and estimated models \(\hat{m}_{\mathit{ML}}(x)\), \(\hat{m}_{\mathit{HM}}(x)\) and \(\hat{m}_{L2}(x)\). (Panel b) Contour plot of function \(g(\boldsymbol{\beta }{\vert \sigma }^{{\ast}})\) of Eq. (9) evaluated at \({\sigma }^{{\ast}} = 0.5\,\hat{\sigma }_{L_{2}E}\)

Example II.

Let us consider a dataset of n = 300 points, 200 of which arise from model \(Y = 1 + 0.8\,X +\varepsilon _{1}\) while the remaining from model \(Y = 5 - 0.2\,X +\varepsilon _{2}\), where \(\varepsilon _{1} \sim \mathcal{N}(0,1)\), \(\varepsilon _{2} \sim \mathcal{N}(0,0.5)\) and \(X \sim \mathcal{U}(1,10)\). Again, resorting to the estimators ML, HM and L 2 we obtain the following estimates of the parameters β 0, β 1 and σ listed in Table 2 (also see Fig. 3, panel a). Considering the models \(\hat{m}_{\mathit{ML}}(x)\) and \(\hat{m}_{L2}(x)\) the M.C.S. test, with α = 0. 01, indicates that they can be considered dissimilar, as we observe \(\mathit{sim}_{\mathit{ML},L2} = 0.0582 >\max (\mathit{sim}_{\mathit{ML},L2}^{{\ast}}) = 0.0210\). This is still true if we consider the estimated models \(\hat{m}_{\mathit{HM}}(x)\) and \(\hat{m}_{L2}(x)\), in fact from the M.C.S. test we have \(\mathit{sim}_{\mathit{HM},L2} = 0.0451 >\max (\mathit{sim}_{\mathit{MH},L2}^{{\ast}}) = 0.0156\). Also in this situation the L 2 estimator seems to be helpful in detecting clusters of data when compared with the Maximum Likelihood and the Huber M estimators.

5 Mixture of Regression Models via \(\boldsymbol{L_{2}}\)

It seems to the authors that the properties of robustness of L 2 estimates, as outlined above, can be helpful in pointing out the presence of clusters in the data, e.g. Durio and Isaia (2007).

This in the sense that whenever sample data belong to two (or more) clusters, \(\hat{m}_{L2}(\mathbf{x})\) will always tend to fit the cluster with the heaviest number of data points and hence big discrepancies between \(\hat{m}_{\mathit{ML}}(\mathbf{x})\) and \(\hat{m}_{L2}(\mathbf{x})\) will be likely to be observed, as illustrated by the previous examples. Investigating more accurately function (5) for a fixed value of σ it can be seen that in all situations where sample data are clustered it can show more than one local minimum. A simple way forward is to investigate the behaviour of the function

$$\displaystyle{ g(\boldsymbol{\beta }{\vert \sigma }^{{\ast}}) = \frac{1} {2\sigma \sqrt{\pi }}- \frac{2} {n}\sum _{i=1}^{n}\phi (y_{ i}\vert m_{\boldsymbol{\beta }}(\mathbf{x}_{i}){,\sigma }^{{\ast}}) }$$
(9)

for different values of σ on its parameter space, for instance, the interval \(\left ]0,2 \cdot \hat{\sigma }_{L_{2}E}\right ]\). In fact, whenever sample data are clustered, function \(g(\boldsymbol{\beta }{\vert \sigma }^{{\ast}})\) given by Eq. (9) shows one absolute and one or more local points of minimum.

Whenever the presence of clusters of data is detected by L 2 criterion, we can use L 2 estimator assuming that the model that best fits the data is a mixture of K ≥ 2 regression models. Assuming that each data point \((\mathbf{x}_{i},y_{i})\) comes from the k-th regression model \(y_{i} = m_{\boldsymbol{\beta }_{k}}(\mathbf{x}_{i}) +\varepsilon _{ik}\) with probability p k , we suppose that the random variables Y | x are distributed as a mixture of K Gaussian random variables, i.e.,

$$\displaystyle{ f_{Y \vert \mathbf{x}}(y\vert \boldsymbol{\theta }_{0}) =\sum _{ k=1}^{K}p_{ k}^{0}\,\phi (y\vert m_{\boldsymbol{\beta }_{ k}^{0}}(\mathbf{x}),\sigma _{k}^{0}). }$$
(10)

We are now able to derive the following closed-form expression for the estimates of \(\boldsymbol{\theta }_{0} = [{\boldsymbol{p}}^{0},{\boldsymbol{\beta }}^{0}{,\boldsymbol{\sigma }}^{0}]\); in fact, according to Eq. (9) and recalling Eq. (4), we have

$$\displaystyle\begin{array}{rcl} \hat{\boldsymbol{\theta }}_{L_{2}E}& =& \mathop{\text{arg min}}\limits_{\boldsymbol{p},\boldsymbol{\beta },\boldsymbol{\sigma }}\left [ \frac{1} {n}\sum _{i=1}^{n}\sum _{ j=1}^{K}\sum _{ h=1}^{K}p_{ j}\,p_{h}\,\phi (0\vert m_{\boldsymbol{\beta }_{j}}(\mathbf{x}_{i}) - m_{\boldsymbol{\beta }_{h}}(\mathbf{x}_{i}),\sigma _{j}^{2} +\sigma _{ h}^{2})-\right. \\ & & -\left.\frac{2} {n}\sum _{i=1}^{n}\sum _{ k=1}^{K}p_{ k}\,\phi (y_{i}\vert m_{\boldsymbol{\beta }_{k}}(\mathbf{x}_{i}),\sigma _{k}^{2})\right ]. {}\end{array}$$
(11)

Solving Eq. (11) we obtain the estimates of the vector of the weights, i.e. \(\hat{\boldsymbol{p}} = {[p_{1},\ldots,p_{K}]}^{T}\), the vector of the parameters, i.e. \(\hat{\boldsymbol{\beta }}= {[\beta _{0_{1}},\ldots,\beta _{d_{1}},\ldots,\beta _{0_{K}},\ldots,\beta _{d_{K}}]}^{T}\) and the vector of the standard deviations of the error of each component of the mixture, i.e. \(\hat{\boldsymbol{\sigma }}= {[\sigma _{1},\ldots,\sigma _{K}]}^{T}\).

Example II (continued).

Referring to the situation of Example II, for which \(\hat{\sigma }_{L_{2}} = 1.1633\), the contour plot of function \(g(\boldsymbol{\beta }{\vert \sigma }^{{\ast}})\) of Eq. (9) and displayed in Fig. 3, panel b, evaluated at \({\sigma }^{{\ast}} = 0.5\,\hat{\sigma }_{L_{2}E}\), shows the existence of one absolute minimum corresponding to the estimates of the parameters of the model \(Y = 1 + 0.8\,X +\varepsilon _{1}\) and one local minimum close to the values of the parameters of the model \(Y = 5 - 0.2\,X +\varepsilon _{2}\). We therefore consider a mixture of K = 2 simple linear regression models. Since in this situation Eq. (10) becomes

$$\displaystyle{ f_{Y \vert \mathbf{x}}(y\vert \boldsymbol{\theta }_{0}) = p_{1}^{0}\,\phi (y\vert \beta _{ 0_{1}}^{0} +\beta _{ 1_{1}}^{0}\,x,\sigma _{ 1}^{0}) + p_{ 2}^{0}\,\phi (y\vert \beta _{ 0_{2}}^{0} +\beta _{ 1_{2}}^{0}\,x,\sigma _{ 2}^{0}), }$$

the L 2 estimates of the vector \(\boldsymbol{\theta }_{0}\), according to Eq. (11), will be given by solving

$$\displaystyle\begin{array}{rcl} & & \hat{\boldsymbol{\theta }}_{L_{2}E}\,=\,\mathop{\text{arg min}}\limits_{\boldsymbol{p},\boldsymbol{\beta },\boldsymbol{\sigma }}\left [\frac{p_{1}^{2}\sigma _{2}\,+\,p_{2}^{2}\sigma _{1}} {2\sigma _{1}\sigma _{2}\sqrt{\pi }} \,+\, \frac{2} {n}\sum _{i=1}^{n}p_{ 1}p_{2}\phi (0\vert \beta _{0_{1}}\,+\,\beta _{1_{1}}x_{i}\,-\,\beta _{0_{2}}\,-\,\beta _{1_{2}}x_{i},\sigma _{1}^{2} +\sigma _{ 2}^{2})-\right. \\ & & -\left.\frac{2} {n}\sum _{i=1}^{n}\left (p_{ 1}\,\phi (y_{i}\vert \beta _{0_{1}} +\beta _{1_{1}}\,x_{i},\sigma _{1}^{2}) + p_{ 2}\,\phi (y_{i}\vert \beta _{0_{2}} +\beta _{1_{2}}\,x_{i},\sigma _{2}^{2})\right )\right ]. {}\end{array}$$
(12)

From numerical minimization of Eq. (12), we obtain (see Fig. 4, panel a) the following estimates of the eight parameters of the mixture

$$\displaystyle{\begin{array}{l@{\quad }l@{\quad }l@{\quad }l@{\quad }l } L_{2}E\ \text{Model}\_1:\quad &\hat{p}_{1} = 0.646\quad &\hat{\beta }_{0_{1}} = 1.0281\quad &\hat{\beta }_{1_{1}} = 0.8109 \quad &\hat{\sigma }_{1} = 0.8411 \\ L_{2}E\ \text{Model}\_2:\quad &\hat{p}_{2} = 0.354\quad &\hat{\beta }_{0_{2}} = 4.8267\quad &\hat{\beta }_{1_{2}} = -0.0576\quad &\hat{\sigma }_{2} = 0.5854\end{array} }$$

which are quite close to the true values of the parameters.

Fig. 4
figure 4

(Panel a) Data points and estimated components of the mixture of two simple regression models via L 2. (Panel b) Data points assignment according to the “quick classification rule” with γ = 3

From a practical point of view, it would be interesting to be able to highlight which data points belong to each component of the mixture; to this end we resort to a quick classification rule based on the assumption that the density of the errors follows a Normal distribution, i.e. \(\forall \,i = 1,\ldots,n\)

$$\displaystyle\begin{array}{rcl} & & \text{if }\vert \hat{\varepsilon }_{i_{1}}\vert \leqq \gamma \,\hat{\sigma }_{1} \wedge \vert \hat{\varepsilon }_{i_{2}}\vert >\gamma \,\hat{\sigma } _{2}\quad \rightarrow (x_{i},y_{i}) \in \text{ Model }\;L_{2}E - I \\ & & \text{if }\vert \hat{\varepsilon }_{i_{1}}\vert >\gamma \,\hat{\sigma } _{1} \wedge \vert \hat{\varepsilon }_{i_{2}}\vert \leqq \gamma \,\hat{\sigma }_{2}\quad \rightarrow (x_{i},y_{i}) \in \text{ Model }\;L_{2}E - II \\ & & \text{if }\vert \hat{\varepsilon }_{i_{1}}\vert \leqq \gamma \,\hat{\sigma }_{1} \wedge \vert \hat{\varepsilon }_{i_{2}}\vert \leqq \gamma \,\hat{\sigma }_{2}\quad \rightarrow (x_{i},y_{i}) \in \text{ Unknown model} \\ & & \text{if }\vert \hat{\varepsilon }_{i_{1}}\vert >\gamma \,\hat{\sigma } _{1} \wedge \vert \hat{\varepsilon }_{i_{2}}\vert >\gamma \,\hat{\sigma } _{2}\quad \rightarrow (x_{i},y_{i}) \in \text{ Outlier} {}\end{array}$$
(13)

where γ is an appropriate quantile of a \(\mathcal{N}(0,1)\).

Fixing γ = 3, if we apply the quick rule and drop two points that are classified as outliers we obtain (see Fig. 4, panel b) the following classification table, see Table 3. Clearly, the high percentage of not assigned points (52. 7%) is due to the specific structure of the two clusters which are quite confused.

Table 3 Classification I

6 The Case Study

A firm operating in the field of diagnosis and decontamination of electronic transformers fluids assesses the risks of fluid degradation, electric shocks, fire or explosion, PCB contamination, decomposition of cellulosic insulation, etc. With the aid of well-known models and relying on the results of chemical analysis, the firm’s staff estimate the value of the risk on continuous scales.

In order to determine if their methods of assigning risk values are independent of specific characteristics of the transformers (age, voltage, fluid mass, etc.) we conducted an analysis based on a database of 1, 215 records of diagnosis containing oil chemical analysis, technical characteristics and risk values.

Taking into account the risk of fire (Y ) and the risk of electric shocks (X), it was natural to suppose a linear dependence between the two variables, i.e., we considered the simple regression model with \(m_{\boldsymbol{\beta }}(x_{i}) =\beta _{0} +\beta _{1}x_{i}\).

Resorting to the estimators ML, HM and L 2 we obtained the following estimates of the parameters β 0, β 1 and σ listed in Table 4.

Although the estimates of the vector of the parameters \(\boldsymbol{\beta }\) are quite close, the corresponding three estimated models differ in some way, e.g., Fig. 5, panel a.

Computing the values of the sim() statistics, the M.C.S. test led us to the conclusion that the L 2 estimated model can be considered dissimilar from both \(\hat{m}_{\mathit{ML}}(x)\) and \(\hat{m}_{\mathit{HM}}(x)\) models, as

$$\displaystyle\begin{array}{rcl} & & \mathit{sim}_{\mathit{ML},L2} = 0.0220 >\max (\mathit{sim}_{\mathit{ML},L2}^{{\ast}}) = 0.0051 {}\\ & & \mathit{sim}_{\mathit{HM},L2} = 0.0203 >\max (\mathit{sim}_{\mathit{HM},L2}^{{\ast}}) = 0.0031 {}\\ \end{array}$$

Probing more deeply, we found that function \(g(\boldsymbol{\beta }{\vert \sigma }^{{\ast}})\) of Eq. (9) presents two points of minimum for \({\sigma }^{{\ast}} = 0.5\,\hat{\sigma }_{L_{2}E} = 0.0755\), as shown in Fig. 5, panel b.

Therefore we decided to model our data by means of a mixture of two simple regression models. Considering the L 2 criterion and solving Eq. (12), we found that about \(57\,{{\%}}(=\hat{ p}_{1}\,{{\%}})\) of the data points follow the model

$$\displaystyle{ \hat{m}_{\boldsymbol{\beta }_{1}}(x) = -0.4042 + 1.7705\,x\qquad \rightarrow \qquad L_{2}E\text{ Model}\_1 }$$

for which \(\hat{\sigma }_{1} = 0.0547\), while the remaining \(43\,{{\%}}(=\hat{ p}_{2}\,{{\%}})\) of the data points follow the model

$$\displaystyle{ \hat{m}_{\boldsymbol{\beta }_{2}}(x) = -0.3955 + 1.5847\,x\qquad \rightarrow \qquad L_{2}E\text{ Model}\_2 }$$

for which \(\hat{\sigma }_{2} = 0.0775\). Panel a of Fig. 6 shows the two estimates models.

Table 4 Estimates of the parameters after resorting
Fig. 5
figure 5

Case study. (Panel a) Data points and estimated models \(\hat{m}_{\mathit{ML}}(x)\), \(\hat{m}_{\mathit{HM}}(x)\) and \(\hat{m}_{L2}(x)\). (Panel b) Contour plot of function \(g(\boldsymbol{\beta }{\vert \sigma }^{{\ast}})\) of Eq. (9) evaluated at \({\sigma }^{{\ast}} = 0.5\,\hat{\sigma }_{L_{2}E}\), with \(\hat{\sigma }_{L_{2}E} = 0.151\)

Applying the quick rule we were able to classify the data according to whether they followed the first or the second regression model. From the L 2 estimates of \(\boldsymbol{\hat{p}}\) and the quick rule (dropping two points that were classified as outliers) we obtained the following classification table, see Table 5.

In order to classify the 266 ( = 22. 0 %) points belonging, according to the quick rule, to the Unknown Model, we had to investigate more deeply the specific characteristics of the transformers themselves.

Examining our database, we found that 40 % of the transformers has a fluid mass ≦ 500 kg and the L 2 criterion gave us an estimate of 43 % for the weight of points belonging to L 2 E Model_1 while our quick rule assigned the 36. 9 % of data points to L 2 E Model_2.

Furthermore, our quick classification rule assigns 419 out of the 448 points (93. 5 %) to L 2 E Model_2 and these have a fluid mass less (or equal) than 500 kg, while all the 499 transformers imputed to L 2 E Model_1 have a fluid mass greater than 500 kg, see Table 6.

Fig. 6
figure 6

Case study. (Panel a) Data points and estimated models \(\hat{m}_{\boldsymbol{\beta }_{1}}(x)\) and \(\hat{m}_{\boldsymbol{\beta }_{2}}(x)\). (Panel b) Final data points assignment according to the fluid mass of the electrical transformers

Table 5 Classification II
Table 6 Fluid mass of the model

From the above, we decided to use the fluid mass as clustering variable and so we assigned the transformers with a fluid mass equal or less than 500 kg to Model L 2 E Model_2 while the transformers with a fluid mass greater than 500 kg were assigned to the L 2 E Model_1 regression line. The final assignment is shown in Fig. 6, panel b.

These results allowed us to state that, at fixed level of risk of electric shocks, the risk of fire was evaluated in a different way for the two groups of transformers, i.e., the relationship between the two variables depended on the fluid mass of the transformers.

However, the chemical staff of the firm could not find any scientific reason to explain the different risks of fire in the two types of transformers, so they decided to change the model used by assigning different weights to the hydrocarbon variable in order to better reflect the differential risks of fire.