1 Introduction

Generally, a regression model has two main purposes (Saville and Wood 1991):

  1. a.

    it could be used to determine to what extent the dependent variable can be predicted by the independent variables;

  2. b.

    it could be used to capture the strength of a theoretical relationship between the dependent variable and the independent variables.

The first use of the term “multicollinearity” in the literature was by Frisch (1934). We refer to multicollinearity when there is a strong linear dependence between two or more regressors (Leahy 2000).

Multicollinearity is not just present or absent, but it is necessary to check for its degree, or severity, ranging from no collinearity to perfect collinearity. Normally, some degree of multicollinearity is always present in real data. Perfect multicollinearity is an extreme and uncommon case, occurring when two or more independent variables in a regression model exhibit a perfect linear relationship. In the latter case, the regression coefficients are indeterminate and their standard errors are infinite.

Multicollinearity may lead to several issues, summarized in the following points (Besley et al. 1980; Shieh 2011; Montgomery et al. 2012):

  • Estimates issues: considering the ordinary least squares (OLS) method, the regression coefficients’ estimates are given by:

    $$\hat{\beta }OLS = (X^{{\prime }} X)^{ - 1} X^{{\prime }} Y$$

    In the presence of a high degree of multicollinearity, the X′X matrix is quasi-singular (ill-conditioned); hence, there is a misalignment of the equation system that should provide the parameters’ estimates, which are therefore inaccurate. In the case of perfect multicollinearity, the matrix is non-invertible; this causes computational problems as well, as any statistical software would give warnings such as “The matrix is singular” and will be unable to proceed with the estimation.

  • Forecasting issues: from a practical point of view, it is useless to make any kind of prediction for a value of the response variable y, due to the uncertainty of the values of \(\hat{\beta }\).

  • Interpreting issues: in the presence of multicollinearity, the coefficients’ estimates are less efficient and their interpretation becomes tricky. Regressors don’t explain the variance in the dependent variable as they should normally do.

When multicollinearity is detected, it becomes important to understand which explanatory variables are causing the issue, and the easiest solution would be to exclude them from the analysis. However, doing so is not always a good solution, as it depends on whether the purpose of the analysis is explanation or prediction. When prediction is the goal, no problem arises if, among dependent variables, two regressors have the same ‘meaning’: it is possible to just drop one of them. When, instead, explanation is the goal, regressors are selected according to a theoretical rule, so that it is not possible to eliminate any variable: the model has to be explained exactly, including every variable that was selected. Literature has recently introduced other methods to deal with multicollinearity without the need of dropping variables from the model, that are described in the following section.

A widely used estimator is the ordinary least squares (OLS) one, which is a valid alternative to the Maximum Likelihood estimator, especially when no information is given about the error distribution. According to the Gauss-Markov theorem, the OLS estimator is BLUE (Best Linear Unbiased Estimator) when E(ε) = 0 and var(ε) = σ2In. This means that, among all the linear estimators, it will be the one with the least variance, guaranteeing furthermore that the estimates are unbiased.

When the set of error random variables doesn’t respect the Gauss-Markov assumptions, the OLS estimator is not BLUE and might therefore provide less efficient estimates than the ones provided by other methods, even though it continues to be unbiased. The variance of the residuals will be biased, invalidating significance tests, confidence intervals and the R2 coefficient.

In this paper, we are going to carry out some simulations in order to show the differences, in terms of efficiency, between the OLS estimator and two different Lp-norm estimators—whose properties will be described in the following sections—in the presence of multicollinearity.

This paper is organized as follows. In the next section, after a brief review on the origins and some new tools to deal with the multicollinearity problem, we illustrate some new methods, proposed in the literature, to solve this issue in the case of stochastic regression, typical in the analysis of economic and social data. In the third section, we introduce the exponential power function (E.P.F.), a useful family of symmetrical random error curves, and its connection with the Lp-norm method. In the fourth section, we investigate the relationship between the different values of p, the shape parameter of the E.P.F., and some kurtosis indexes that refer to the error distributions; then we analyze and propose the Lpmin algorithm in the fifth section. In the following one (6), we propose a simulation study, with the aim of choosing the best rule to select the most appropriate value of p for any given error distribution and to evaluate the influence of multicollinearity in the parameter estimation procedures in terms of variance. Finally, in Sect. 7, we propose a real data application, showing that our method gives better estimates than OLS. In the last Sect. (8), some results from the empirical study and comparisons are discussed.

2 Solving the multicollinearity problem: a review

Literature has devoted a lot of attention to the multicollinearity problem and the different ways to solve it.

When multicollinearity is detected in a regression model, the easiest solution is to eliminate the problematic variables from the model, so that it doesn’t include variables that are correlated (Bowerman and O’Connell 1993; Bring 1996).

But, sometimes, it is hard to choose which variables to exclude. Another classical remedy to the multicollinearity problem is the Principal Component Analysis (Maddala 1977; Gower and Blasius 2005), that converts the original variables into a new set of linearly uncorrelated variables.

Since eliminating a variable leads to a problem of bad-specification of the model, a viable remedy is the use of a different estimator than the least squares one. This is because the ‘classical’ OLS method provides estimates that could be statistically not significant in the presence of multicollinearity. Some common alternatives are the partial least squares (Cassel et al. 1999) and the Ridge Regression (Pagel and Lunneberg 1985; Dijkstra 2014; Masson et al. 2014). The latter, in particular, was introduced by Hoerl and Kennard (1970), and has the advantage of reducing the variance term of the slope parameters (Muniz and Kibria 2009; Akdeniz et al. 2015). In the most recent literature, though, some robust estimators are preferred. There are many types of robust regression models and, traditionally, the least squares criterion has always been the most used one. The Least Absolute Value (LAV) is a regression coefficient that minimizes the sum of the absolute values of the residuals:

$$\hat{\beta }_{LAV} = \mathop {\arg \hbox{min} }\limits_{\beta } \sum {\left| {y_{i} - x_{i}^{\prime } \beta } \right|}$$

However, when multicollinearity is present, other methods should be preferred, such as the Ridge Least Absolute Value (RLAV), handling the problem of multicollinearity and outliers simultaneously. A new approach to solve the multicollinearity problem considers the use of a ridge regression based on Bisquare Ridge LTS estimators (BRLTS). Through this method, it is possible to obtain good estimates, removing multicollinearity when it is “moderate or high” (Pati et al. 2016).

In a regression model, we expect a high variance explained (R2). The higher the variance explained, the better the model is. Though, in the presence of multicollinearity, the variance, standard errors and parameter estimates may be all inflated.

Other authors suggest the so called “Nested estimate procedure” (Lin 2008), that is a relatively new method, based on OLS, consisting in an iterative estimation procedure of independent variable parameters. Garcia et al. (2011) presented the “raise method” as an alternative to the Nested estimation, that keeps all information, which could be highly recommended in some cases, and compares the results with other procedures.

There are no specific tests to detect multicollinearity, but some characteristics of the estimated model can reveal it:

  • a high R2 with non-significant regression coefficients (t-scores);

  • a high correlation between the single variables;

  • a high VIF (Variance Inflation Factor).

The Variance Inflation Factor (VIF) is calculated for each variable in the model, based on the expression:

$$VIF_{j} = \frac{1}{{\left( {1 - R_{j}^{2} } \right)}}$$

where \(R_{j}^{2}\) is the multiple R2 for the regression of the j-th element of the vector of estimates on the other covariates.

A high VIF reveals linear dependence between the j-th column and the remaining columns of the X matrix and, thus, the presence of multicollinearity (Lazaridis 2007). However, as Curto and Pinto (2011) point out, the real impact on the variance can be overestimated by the VIF and, for this reason, CVIF can be used in place of VIF. In formulas:

$$CVIF_{j} = \frac{{((1 - R^{2} )TSS/(n - k)(1 - R_{j}^{2} )TSS_{j} )}}{{((1 - R_{0}^{2} )TSS/((n - k)TSS_{j} )}}$$

or:

$$CVIF_{j} = VIF_{j} \times \frac{{1 - R^{2} }}{{1 - R_{0}^{2} }}$$

where \(R_{0}^{2} = R_{{yx_{2} }}^{2} + R_{{yx_{3} }}^{2} + \cdots + R_{{yx_{k} }}^{2}\).

In particular, \(R_{{yx_{k} }}^{2}\) is the square of the correlation coefficient between the dependent variable and the independent variables, keeping TSS and TSSj constant. In general, \(R_{0}^{2}\) can be lower or higher than R2, but when the independent variables are orthogonal we have \(R_{0}^{2} = {\text{R}}^{2}\).

When \(R_{0}^{2} > {\text{R}}^{2}\), CVIF ranges from 1 to +∞ and CVIFj > VIFj; when \(R_{0}^{2} < {\text{R}}^{2}\), CVIF ranges from 0 to 1 and CVIFj < VIFj. About interpreting issues, as suggested by Curto and Pinto (2011), we can use the rule-of-thumb CVIFj > 10 to indicate the presence of severe multicollinearity.

Another common issue in economics is the spurious relationship between variables (Armor et al. 2017). As Chatelain and Ralf (2014) show, by increasing the number of observations it is possible to bring a spurious inference of highly correlated classical suppressors. In particular, they propose a tool to solve these issues: the Parameter Inflation Factor (PIF), obtained by analyzing a trivariate regression model as follows:

$$x_{1} = \beta_{12} x_{2} + \beta_{13} x_{3} + \varepsilon_{1,23}$$

PIF is defined as:

$$PIF_{12} = \left( {1 - \frac{{r_{13} }}{{r_{12} }}r_{23} } \right)VIF{}_{12}$$

where r12, r13 and r23 are the correlation coefficients of the regressors and VIF is the Variance Inflation Factor.

While the Variance Inflation Factor only depends on the correlation between the independent variables, the Parameter Inflation Factor also considers the correlation between the regressors and the dependent variable. In the presence of a high PIF value, there may be highly correlated classical suppressors.

When regressors and error are correlated, the OLS estimator becomes less efficient because its variance increases (Vargha et al. 2013; Griffiths and Hajargasht 2016), even though it remains consistent (Krone et al. 2017).

Another estimator that can be used in the presence of multicollinearity is the Classical Linear Regression Model (CLRM) (Ayinde 2007):

$$y_{t} = \beta_{0} + \beta_{1} x_{1t} + \beta_{2} x_{2t} + \beta_{3} x_{3t} + \varepsilon_{t}$$

where t = 1, 2,…,n and εt ~ (0,σ2) and assuming that x1 is correlated with error ε and x2 is correlated with x1.

In addition, we can consider a generalized least squares (GLS) model:

$$y_{t} = \beta_{0} + \beta_{1} x_{1t} + \beta_{2} x_{2t} + \beta_{3} x_{3t} + u_{t}$$

where t = 1, 2,…,n; \(u_{t} = \rho u_{t - 1} + \varepsilon_{t}\) and ρ is the correlation.

Starting from this point, Ayinde (2007) shows that CLRM and GLS models are equivalent in the case of zero correlation. In addition, comparing CLRM, GLS and Maximum Likelihood (ML) estimators, using the MSE (Mean Square Error) criteria in a Monte Carlo simulation plan, the author shows that ML and GLS are better than an OLS estimator with low replication, but when there is a high number of replications the OLS method is to be favored.

Another study by Tran and Tsionas (2013) proposes the Generalized Method of Moments (GMM) estimator, showing that if there is no correlation between the variables, it performs as the standard Maximum Likelihood estimator. However, in the case of strong correlation, the MLE becomes biased while the GMM estimator remains unbiased.

Finally, as Giacalone and Richiusa (2006) point out, Lp-norm estimators allow to manage the residuals more efficiently, particularly when there is some degree of multicollinearity and regressors are correlated with residuals. This is because the Lp-norm methods are adaptive procedures with respect to the error component of the model and not to the deterministic one.

As a result, using Lp-norm estimators in the presence of stochastic regressors, it is possible to obtain a better efficiency in the estimates than using OLS.

3 The exponential power function and the Lp-norm estimators

The exponential power function (E.P.F.) is a family of probability functions proposed by Subbotin (1923) and studied, among others, by Vianelli (1963), Lunetta (1966), Mineo (1989), Gonin and Money (1989), Chiodi (1995) and Bottazzi and Secchi (2011).

The E.P.F. constitutes a valid generalization of the Gaussianity hypothesis, that is usually assumed even though, depending on the data we have at disposal, it is not always fully supportable. In the literature, it is also known as “Normal distribution of order p” or “Generalized Error Distribution”, and it constitutes a parametric alternative to the robust methods (Mineo 2003).

The density function of the E.P.F. is:

$$f_{p} (z) = \frac{1}{{2p^{1/p} \sigma_{p} \varGamma (1 + 1/p)}}\exp \left[ { - \frac{1}{p}\left| {\frac{{z - M_{p} }}{{\sigma_{p} }}} \right|^{p} } \right]$$
(1)

where Mp = E(z) is the location parameter, σp = (E[|z − Mp|])1/p is the scale parameter and p > 0 is the shape parameter.

Considering the Pearson kurtosis index β2, we distinguish:

  • 0 < p < 1: double exponential distribution, β2 > 6;

  • 1 < p < 2: leptokurtic distribution, 3 < β2 < 6;

  • p > 2: platikurtic distribution, 1.8 < β2 < 3.

For particular values of p, we have:

  • the Laplace distribution (p = 1, β2 = 6);

  • the Gaussian distribution (p = 2, β2 = 3);

  • the Uniform distribution (p → ∞, β2 → 1.8).

Considering a sample of n observed data (xi, yi), a general linear regression model is:

$$y_{i} = g\left( {x_{i} ,\theta } \right) + \varepsilon_{i}$$
(2)

with g(.) linear function.

Lp-norm estimators are useful generalizations of ordinary least squares estimators, obtained by replacing the exponent 2 with a general exponent p (Ekblom and Henriksson 1969; Forsythe 1972). Therefore, they minimize the sum of the p-th power of the absolute deviations of the observed points from the regression function:

$$S_{p} (\theta ) = \sum\limits_{i = 1}^{n} {\left| {y_{i} - g\left( {x_{i} ,\theta } \right)} \right|}^{p} \; {\text{with}}\;1 \le p < \infty$$
(3)

Under the regular assumptions, the log-likelihood associated with the sample, where \(z = y_{i}\); \(M_{p} = g({x}_{i,} {\theta } )\) is given by:

$$l(\theta ,\sigma_{p} ,p) = - n\log \left[ {2p^{1/p} \sigma_{p} \varGamma (1 + 1/p)} \right] - \left( {p\sigma_{p}^{p} } \right)^{ - 1} \sum\limits_{i = 1}^{n} {\left| {y_{i} - g\left( {x_{i} ,\theta } \right)} \right|}^{p}$$
(4)
$$\begin{array}{*{20}l} {\frac{\partial l}{{\partial \theta_{j} }} = \sum\limits_{i = 1}^{n} {\left| {y_{i} - g(x_{i} ,\theta )} \right|}^{p - 1} sign(y_{i} - g(x_{i} ,\theta ))\frac{\partial g}{{\partial \theta_{j} }} = 0} \hfill \\ {\sum\limits_{i = 1}^{n} {\left| {y_{i} - g\left( {x_{i} ,\theta } \right)} \right|}^{p} = \hbox{min} \;{\text{with}}\;p \ge 1} \hfill \\ \end{array}$$
(5)

When the order p is specified, all the terms in the (4), except for the last part containing the vector θ, are constant. Therefore, Maximum Likelihood estimators are equivalent to Lp-norm estimators (5).

If p is unknown, there are two related problems to consider:

  1. 1.

    the estimation of the exponent p on the sample data;

  2. 2.

    the choice of the minimization algorithm to obtain the regression parameters’ estimates.

About the procedures to estimate p, it is possible to find out the following proposals in the literature:

  • Harter (1977), noting that p depends on \(\hat{\beta }_{2}\) (the sample residual kurtosis), proposed to select p using the following rules:

    if \(\hat{\beta }_{2} > 3.8\) use p = 1 (the least absolute deviations regression);

    if \(2.2 < \hat{\beta }_{2} < 3.8\) use p = 2 (the least squares regression);

    if \(\hat{\beta }_{2} < 2.2\) use \(p = \infty\) (the minimax or Chebychev regression).

  • Money et al. (1982) and Sposito (1982) proposed two different criteria respectively:

$$\hat{p} = 9/\hat{\beta }_{2}^{2} + 1\;{\text{for }}1 \le p < \infty$$
(6)
$$\hat{p} = 6/\hat{\beta }_{2}^{2} \;{\text{for }}1 < p < 2$$
(7)
  • Mineo (1989) proposed the Generalized Kurtosis βK, as described in Sect. 4.

  • Mineo (1994) considered a new method to estimate p, based on an empirical index called VI. This estimation method is based on a two-steps algorithm in which, iteratively, the estimates of p and the other parameters are evaluated: the structure parameter is estimated by numerically solving an equation in which the theoretical value of the I index is equaled to the empirical one; the other parameters are estimated from the corresponding maximum likelihood estimators as a function of the current value of \(\hat{p}\).

  • Agrò (1995) proposed a maximum likelihood estimation either for the E.P.F. parameters or for the p shape parameter. It is a two-steps process which, however, is essentially suitable for medium to large samples (n > 50).

  • Giacalone (1997) proposed an algorithm based on a two-steps alternating procedure that firstly estimates the θ parameter vector by means of the classical conjugated gradient algorithm (Fletcher and Reeves 1964) and secondly estimates p using a joint inverse function of I and β2, obtained comparing empirical and theoretical moments, matching (10) with (12) and (9) with (11). The minimization algorithm stops when the variation of p is not significant (Everitt and Hand 1987).

  • Agrò (1999) proposed an adjustment of the likelihood estimation, based on Cox and Reid (1987), which consists in a reparametrization that leads to asymptotically uncorrelated estimators and, consequently, to better results (at least, for n > 30).

  • Finally, Mineo and Ruggieri (2005) developed an R package for dealing with the exponential power function, with functions to compute the density function, the distribution function and the quantiles from an E.P.F., available on the Comprehensive R Archive Network (CRAN).

4 The exponential power function kurtosis indexes

For the density (1), the theoretical moment of order k is a function of the shape parameter p as follows:

$$E\left| {z - M_{p} } \right|^{k} = \left( {p\sigma_{p}^{p} } \right)^{ - k/p} \frac{\varGamma ((k + 1)/p)}{\varGamma (1/p)} = \mu_{k}$$
(8)

The ratios of the moments of order 2k and the squared moment of order k only depend on the shape parameter p. This theoretic relation is also known as “Generalized Kurtosis” (Mineo 1989):

$$\beta_{k} = \frac{{\mu_{2k} }}{{\mu_{k}^{2} }} = \frac{\varGamma (1/p)\varGamma ((2k + 1)/p))}{{[\varGamma ((k + 1)/p))]^{2} }}$$

If k = 2, we get the Pearson’s kurtosis index:

$$\beta_{2} = \frac{{\mu_{4} }}{{\mu_{2}^{2} }} = \frac{\varGamma (1/p)\varGamma (5/p)}{{[\varGamma (3/p)]^{2} }}$$
(9)

If k = 1, considering the square root of the reciprocal, we get the Geary’s length of tails index (Geary 1936):

$$I = \frac{{\mu_{1} }}{{\sqrt \mu_{2} }} = \frac{\varGamma (2/p)}{{\sqrt {\varGamma (1/p)\varGamma (3/p)} }}$$
(10)

The indexes I and β2 show a different behavior according to the variation of p (Giacalone 1996, 1997). Calculating the sample values of I and β2, it is possible to obtain, by inverse interpolation, two different estimations of p.

Gonin and Money (1987), Kendall and Stuart (1966) and Lunetta (1966) considered the unbiased estimates of the second and fourth order sample moments, with correction factors depending on the sample size n:

$$\begin{array}{*{20}l} {\hat{\mu }_{2} = \frac{1}{n - 1}\sum\limits_{i} {\left( {\varepsilon_{i} - \bar{\varepsilon }} \right)}^{2} } \hfill \\ {\hat{\mu }_{4} = \frac{{\left( {n^{2} - 2n + 3} \right)}}{{\left( {n - 1} \right)\left( {n - 2} \right)\left( {n - 3} \right)}}\sum\limits_{i} {\left( {\varepsilon_{i} - \bar{\varepsilon }} \right)}^{2} - \frac{{3\left( {n - 1} \right)(2n - 3)}}{{n\left( {n - 2} \right)\left( {n - 3} \right)}}\hat{\mu }_{2} } \hfill \\ \end{array}$$

where \(\varepsilon_{i}\) and \(\bar{\varepsilon }\) are, respectively, the estimated residuals and their average.

The ratio of \(\hat{\mu }_{4}\) and \(\hat{\mu }_{2}\) gives the following estimator of β2:

$$\hat{\beta }_{2} = \frac{{\hat{\mu }_{4} }}{{\hat{\mu }_{2}^{2} }}$$
(11)

For the I empirical index we obtain:

$$\hat{I} = \frac{{\sum\nolimits_{i} {\varepsilon_{i} - \bar{\varepsilon }} }}{{\sum\nolimits_{i} {\varepsilon_{i} - \bar{\varepsilon }}^{2} }}\frac{{\sqrt {n - 1} }}{n}$$
(12)

5 The Lpmin algorithm

The proposed algorithm is based on a two-steps alternating procedure:

  1. 1.

    minimization procedure to estimate the parameters;

  2. 2.

    joint inverse function of I and β2 to estimate p.

The algorithm stops when p does not vary significantly. To obtain this estimate, we minimize the difference between empirical and theoretical indexes (13) to avoid some convergence problems encountered when we investigate a different algorithm considering this difference equal to zero (Counihan 1985; Giacalone 2002).

The function used to estimate p is therefore the following:

$$\left[ {\left( {I - \hat{I}} \right):0.86054} \right]^{2} + \left[ {\left( {\beta_{2} - \hat{\beta }_{2} } \right):25.2} \right]^{2} = \hbox{min}$$
(13)

where \(I,\hat{I},\beta_{2} ,\hat{\beta }_{2}\) are given, respectively, by (10), (12), (9) and (11). For simplicity, we can express (13) as follows:

$$\left[ {{\text{f}}\left( {\text{p}} \right)} \right]^{ 2} + \, \left[ {{\text{g}}\left( {\text{p}} \right)} \right]^{ 2} = { \hbox{min} } .$$

As the indexes I and β 2 have different orders of magnitude and different variance, it is necessary to eliminate this difference in order to obtain a joint estimation of p, therefore setting 0 < f(p) < 1 and 0 < g(p) < 1.

The maximum theoretical values of f(p) and g(p) are the chosen standardization factors.

Considering p ranging from 0.5 to 10, 25.2 is the value of β2 when p = 0.5, and 0.86054 is the value of I when p = 10. This way it is possible to obtain a joint estimator made up of two squared functions. The two kurtosis indexes are used as it was observed that norm 1 kurtosis is a valid choice for the estimation of p in the presence of outliers, while norm 2 kurtosis performs better when the sample values gather around the center of the distribution.

So, using the relation (9) we calculate max(β2) = 25.2, for p = 0.5, whilst using the relation (10) we calculate max(I) = 0.86054, for p = 10.

The proposed algorithm is then specified in the following steps (Giacalone 1997):

  1. 1.

    set i = 0 and p0 = 2;

  2. 2.

    fit the model (2) to the data using the previous step value pi;

  3. 3.

    compute the estimated residuals εi = yi − g(xi, θ), their average \(\bar{\varepsilon }\) and insert these quantities in the (13) that is equal to the sum of the two squared functions to be minimized;

  4. 4.

    minimize the function (13) to obtain pi+1, new estimate of p;

  5. 5.

    compare the estimated pi+1 with the previous pi, and if |pi+1 − pi| > 0.01 then set i = i + 1 and repeat the steps 2–5;

  6. 6.

    otherwise: stop the algorithm assuming \(\hat{\theta }_{i} = \theta_{ij}\) as Lp-norm estimator for the parameter θi and the value p = p i as joint estimation of p.

In step 2, a nonlinear Lp-norm estimation is examined. The problem could be solved using the optimality conditions encountered in unconstrained optimization (McCormick 1983). The minimization algorithm (Fletcher and Reeves 1964) is used to take the special structure of the problem directly into account.

In step 3, we calculate both empirical and theoretical I and β2 kurtosis indexes to obtain the values to estimate p from (13).

In step 4, a parabolic interpolation method (Everitt and Hand 1987) to find the minimum sum of squared functions (13) is used. The convergence of the proposed algorithm was empirically verified by simulating 5000 samples of different sizes (n = 30, 50, 100, 200, 500, 1000) for six fixed theoretical values of p.

6 A simulation study for efficiency comparison

After looking at the more recent literature (e.g. Alabi et al. 2008, 2014), we chose to consider 5000 samples of size n = 30, 50, 100, 200, 500 and 1000, generated from an E.P.F., and 6 values of p (1.1, 1.5, 2.0, 2.5, 3.0, 3.5) for three different degrees of multicollinearity (low R2 = 0.33, medium R2 = 0.66, high R2 = 0.99).

The algorithm to generate the εi (for p ≥ 1) from an E.P.F. is suggested by Chiodi (1986). The values of yi are given by the following multiple regression model:

$$y_{i} = \beta_{0} + \beta_{1} x_{i1} + \beta_{2} x_{i2} + \beta_{3} x_{i3} + \varepsilon_{i}$$
(14)

with X1, X2 identically distributed independent variables from a Gaussian standardized distribution, and X3 linear combination of X1 and X2:

$$X_{3} = X_{1} + X_{2} + Z\;{\text{with }}Z \sim {\text{N(0,}}\sigma_{z} )$$
(15)

Therefore, we can write the associated variance and covariance matrix:

$${\text{S}} = \left\| {\begin{array}{*{20}c} {} & {X_{1} } & {X_{2} } & {X_{3} } \\ {X_{1} } & 1 & 0 & 1 \\ {X_{2} } & 0 & 1 & 1 \\ {X_{3} } & 1 & 1 & {2 + \sigma_{z}^{2} } \\ \end{array} } \right\|$$

It is easy to notice that \(E(X_{3}^{2} ) = E(X_{1}^{2} ) + E(X_{2}^{2} ) + \sigma_{z}^{2} = 2 + \sigma_{z}^{2}\), and the correspondent correlation matrix is equal to:

$${\text{A}} = \left\| {\begin{array}{*{20}c} {} & {X_{1} } & {X_{2} } & {X_{3} } \\ {X_{1} } & 1 & 0 & {1/\sqrt {2 + (\sigma_{z}^{2} )} } \\ {X_{2} } & 0 & 1 & {1/\sqrt {2 + (\sigma_{z}^{2} )} } \\ {X_{3} } & {1/\sqrt {2 + (\sigma_{z}^{2} )} } & {1/\sqrt {2 + (\sigma_{z}^{2} )} } & 1 \\ \end{array} } \right\|$$

where \(A_{13} = \frac{{\text{cov} (X_{1} ,X_{3} )}}{{\sqrt {\text{var} (X_{1} )\text{var} (X_{3} )} }} = \frac{1}{{\sqrt {2 + \left( {\sigma_{z}^{2} } \right)} }} = A_{23}\), \(R_{3.12}^{2} = 1 - \frac{\det A}{{\det A_{33} }} = \frac{2}{{2 + \left( {\sigma_{z}^{2} } \right)}}\), and \(\det A_{33}\) is the cofactor of the element occupying the same position in the correlation matrix (Leti 1983).

In this particular regression model, the degree of multicollinearity is inversely proportional to σ 2z .

Specifically, in our simulation we have put as parameter values: β0 = 1, β1 = 2, β2 = 3, β3 = 4, and a comparative analysis was performed applying the following three estimation methods on the same samples:

  1. 1.

    Least squares estimators (L2);

  2. 2.

    Lp-norm estimators with theoretical p of the E.P.F. (Lp);

  3. 3.

    Lp-norm estimators with p as in our proposal (Lpmin).

The results of the simulation plan are reported in Tables 10, 11, 12, 13, 14 and 15 in “Appendix 1”, where the OLS simulation results are marked in bold; furthermore, in Tables 13, 14 and 15, the number of samples with a difficult convergence (“Conv.”) is also presented. In the tables, we report mean (M) and variance (V) of the regression model parameters. We can see that, for any value of p and for any method, the estimates of β0, β1, β2 and β3 are less efficient for small samples (e.g. for n = 30 and n = 50) but their variances decrease as n increases.

We calculated the relative efficiency for every value of p and n, and the results are shown in Tables 1A, 1B, 2A, 2B, 3A and 3B. From the tables, we notice that Lp-norm estimators (Lp and Lpmin) give better estimates for every parameter, compared to the least squares method, especially when the value of p is not close to 2: there is a gain in efficiency in all the cases considered, except for the case p = 2, in which the error follows a Gaussian distribution.

Table 1 Relative efficiency of Lp-norm estimators compared to OLS (parameters β0, β1 β2, β3) on 5000 samples of size n = 30, n = 50, n = 100, n = 200, n = 500, n = 1000 (σz = 2, R2 = 0.33)
Table 2 Relative efficiency of Lp-norm estimators compared to OLS (parameters β0, β1 β2, β3) on 5000 samples of size n = 30, n = 50, n = 100, n = 200, n = 500, n = 1000 (σz = 1, R2 = 0.66)
Table 3 Relative efficiency of Lp-norm estimators compared to OLS (parameters β0, β1 β2, β3) on 5000 samples of size n = 30, n = 50, n = 100, n = 200, n = 500, n = 1000 (σz = 0.01, R2 = 0.99)

However, the p parameter of any real dataset is rarely equal to 2. When the value of p is different than 2, the OLS method always brings less efficient estimates than both the Lp and Lpmin methods.

Indeed, in most cases, the Lpmin method produces more efficient estimates than OLS, but less efficient than the Lp method, that uses the true value of p. Therefore, Lpmin could be considered halfway between the L2 and Lp methods, because the p parameter is estimated on the data sample.

The relative efficiency indexes (Tables 1A, 1B, 2A, 2B, 3A, 3B) show that, for medium and high degrees of multicollinearity (R2 = 0.66; R2 = 0.99), Lp-norm estimators are more efficient than OLS. However, in the presence of a low degree of multicollinearity (R2 = 0.33), the OLS estimator’s efficiency is sometimes higher than Lp-norm.

Since we know the distribution of residuals, both OLS and Lp-norm estimators are also Maximum Likelihood estimators in our case. As the sample size increases to infinity, a Maximum Likelihood estimator is always efficient, since it achieves the Cramér-Rao lower bound. That said, given that all of these estimators benefit from asymptotic efficiency property, the efficiency gain is larger for small samples.

7 A real data application

To better show the difference between OLS and Lp-norm estimates, we have built a dataset made up of four variables, as in Mouza and Targoudtzidis (2012): GDP per capita, unemployment rate and working hours are examined together in order to show a connection with labor accidents in the UK. Working hours are considered along with unemployment rate, as the latter measures any occupation of the individuals throughout a period of time, without giving any information about the nature of this occupation (e.g. part-time and occasional workers are considered employed even if they spend a small amount of time at the working place). The data, presented in “Appendix 2” (Table 16), refers to the United Kingdom, in the period 1971–2007 (yearly, 37 observations).

More in detail, the variables used in our application are the following:

  1. 1.

    GDP—Gross Domestic Product per capita (thousand US dollars, current prices and purchasing power parity), source: OECD databank;

  2. 2.

    UR—unemployment rate (percentage), source: OECD databank;

  3. 3.

    AH—average hours worked (hours per year per worker), source: OECD databank;

  4. 4.

    FI—fatal injuries of employees (absolute numbers), source: UK national statistics database, health and safety executive.

To study the relationship between the economic cycle and workplace accidents, we estimated the following regression model:

$$FI =\upbeta_{0} +\upbeta_{1} GDP +\upbeta_{2} UR +\upbeta_{3} AH + \varepsilon$$

The R2 for this regression is 0.886, and the Adjusted R2 is 0.875. In this model, the parameters to be estimated are 4 (including the intercept), just like in our simulation study.

In the following table (Table 4), we show the VIF values, for each explanatory variable taken as dependent and the remaining as independent, in order to get the corresponding R2:

Table 4 R2, adjusted R2 and VIF for each explanatory variable in the model

About VIF interpretation, considering the limitations highlighted by O’Brien (2007), we opted for the “rule of 10” proposed by Menard (1995). Since none of the VIFs exceed 10, we can conclude that there is no severe collinearity in this regression, but only a medium degree of collinearity. In this case, looking at the results of the simulation study, our method brings better estimates than OLS.

The results of the OLS estimation (p = 2) are (Table 5):

Table 5 OLS regression results

Thus, the OLS method returns the following model:

$$\hat{Y} = - 255.40 - 13.97X_{1} - 15.14X_{2} + 0.54X_{3}$$
(16)

Using the Lp-norm estimation method proposed in this paper (Lpmin), we estimated p = 1.696 (leptokurtic distribution of residuals), and the results are (Table 6):

Table 6 Lp-norm regression results

Thus, the Lp-norm method returns the following model:

$$\hat{Y} = - 25.01 - 14.68X_{1} - 16.06X_{2} + 0.42X_{3}$$
(17)

As data points out, the intercept (β0) value gets closer to zero when estimated through the Lp-norm method. This is already proof of a better accuracy of the estimates: when the value of the other variables (e.g. GDP and Average Working Hours) is zero, it is natural to expect, for Fatal Injuries, a value that is as close as possible to zero.

In Table 7, by minimizing the objective function expressed by (3), we clearly show that the Lp-norm method brings a large improvement in the estimates’ efficiency.

Table 7 Minimization of the objective function

The results from Table 7 indicate that our estimates are better than the ones obtained by OLS, as they are characterized by a significantly lower variability in the residuals. The objective function represents the stochastic component of the model; the deterministic component is therefore better explained by the Lp-norm method, given the lower value of the objective function. We can justify the relevant difference resulting from the minimization with the important difference in the estimated values for the intercept.

Below, for each value of p, we present the scatter plot of residuals against fitted values (Fig. 1). As the plots suggest, the variances of the error terms are not equal.

Fig. 1
figure 1

Scatter plot of residuals against fitted values, for p = 1.696 and p = 2

The following figure (Fig. 2) shows, for each value of p, the Normal distribution Quantile–Quantile plot. As the succession of points doesn’t form a straight line, the plots suggest that the data doesn’t come from a Normal distribution and, therefore, a different distribution should be considered.

Fig. 2
figure 2

Normal quantile–quantile plot, for p = 1.696 and p = 2

Dynamic graphic techniques could also be used, for a graphical comparison of the two different estimation methods (Destefanis and Porzio 1999).

Moreover, we performed a regression analysis in terms of elasticity. The results, shown in Tables 8 and 9, are kind of impressive, since they suggest, as also evidenced by Mouza and Targoudtzidis (2012), that fatal injuries are somewhat related to, and explained by, labor market indicators.

Table 8 OLS regression results (in terms of elasticity)
Table 9 Lp-norm regression results (in terms of elasticity)

It is very interesting to note that the signs of the estimated coefficients are the same with both our estimation method and OLS: this evidences the robustness of our analysis.

Fatal Injuries appear inelastic in relation to GDP per capita and Unemployment Rate but, at the same time, Fatal Injuries are elastic in relation to Average Working Hours. In particular, although FI is positively influenced by an increase in unemployment, this appears inelastic. Overall, we might expect that rising unemployment rates would increase employee marginal working hours, leading to an increase in workplace accidents. Indeed, the AH variable is strictly positive and elastic: when AH increases, FI increases as well. The reason behind this result is, by the way, statistically intuitive: as time spent at the working place increases, chance of injury increases consequently.

In this application, although there are some improvements in the estimates by using Lp-norm methods, principally linked to a better explanation of the deterministic part of the regression model, the results are in line with what happens when using OLS. This is because data is affected by only a medium degree of collinearity and the estimated p is close to two.

8 Conclusions

Multicollinearity indicates a situation in which the independent variables in a regression model are highly correlated, leading to instability and large variance of the OLS estimator. For this reason, the absence of multicollinearity is essential to obtain optimal estimates from a multiple regression model.

Ordinary least squares, one of the simplest and most used estimation methods, depends on different binding assumptions regarding the random variables and the nature of independent variables.

In this paper, the Lp-norm methods are considered not only in order to reduce multicollinearity in the model, but also in order to see some improvements in the parameter estimation.

By using Lp-norm, a better performance in terms of variance of parameter estimates is always gained in the case of non-Normal symmetric distributions, compared to the least squares situation, even when considering a model with collinear regressors.

When interpreting the simulation results, we notice that the improvements using Lp-norm estimators in place of least squares are more evident in the case σz = 0.01 and \(R_{3.12}^{2} = 0.99\) (high collinearity) than in the case σz = 1 and \(R_{3.12}^{2} = 0.66\) (medium collinearity). Only in the case of low collinearity, the OLS method gives more efficient estimates.

So, the use of Lp-norm methods in the presence of stochastic regressors is strongly recommended. This is linked to the characteristics of these methods, that are adaptive procedures with respect to the error component of the model and not to the deterministic one.

Finally, more interesting is the real data application, as it confirms that the coefficients estimated by Lp-norm can explain reality better than the ones obtained through OLS. It is evident that Lp-norm estimators, characterized by a lower value of the objective function compared to OLS, allow for a better explanation of the deterministic component of the model.

All the simulations were made using R 3.3.2 and Stata 12.0.