Multicollinearity in regression: an efficiency comparison between Lp-norm and least squares estimators

Giacalone, Massimiliano; Panarello, Demetrio; Mattera, Raffaele

doi:10.1007/s11135-017-0571-y

Multicollinearity in regression: an efficiency comparison between L_p-norm and least squares estimators

Published: 12 September 2017

Volume 52, pages 1831–1859, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Quality & Quantity Aims and scope Submit manuscript

Multicollinearity in regression: an efficiency comparison between L_p-norm and least squares estimators

Download PDF

Massimiliano Giacalone¹,
Demetrio Panarello² &
Raffaele Mattera¹

1294 Accesses
46 Citations
Explore all metrics

Abstract

Multicollinearity is one of the most important issues in regression analysis, as it produces unstable coefficients’ estimates and makes the standard errors severely inflated. The regression theory is based on specific assumptions concerning the set of error random variables. In particular, when errors are uncorrelated and have a constant variance, the ordinary least squares estimator produces the best estimates among all linear estimators. If, as often happens in reality, these assumptions are not met, other methods might give more efficient estimates and their use is therefore recommendable. In this paper, after reviewing and briefly describing the salient features of the methods, proposed in the literature, to determine and address the multicollinearity problem, we introduce the L_pmin method, based on L_p-norm estimation, an adaptive robust procedure that is used when the residual distribution has deviated from normality. The major advantage of this approach is that it produces more efficient estimates of the model parameters, for different degrees of multicollinearity, than those generated by the ordinary least squares method. A simulation study and a real-data application are also presented, in order to show the better results provided by the L_pmin method in the presence of multicollinearity.

Two-Stage Least Squares and The K-Class Estimator

Overview of Maximum Likelihood Estimation

A Mathematical-Statistics Approach to the Least Squares Method

Article 05 January 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Generally, a regression model has two main purposes (Saville and Wood 1991):

a.
it could be used to determine to what extent the dependent variable can be predicted by the independent variables;
b.
it could be used to capture the strength of a theoretical relationship between the dependent variable and the independent variables.

The first use of the term “multicollinearity” in the literature was by Frisch (1934). We refer to multicollinearity when there is a strong linear dependence between two or more regressors (Leahy 2000).

Multicollinearity is not just present or absent, but it is necessary to check for its degree, or severity, ranging from no collinearity to perfect collinearity. Normally, some degree of multicollinearity is always present in real data. Perfect multicollinearity is an extreme and uncommon case, occurring when two or more independent variables in a regression model exhibit a perfect linear relationship. In the latter case, the regression coefficients are indeterminate and their standard errors are infinite.

Multicollinearity may lead to several issues, summarized in the following points (Besley et al. 1980; Shieh 2011; Montgomery et al. 2012):

Estimates issues: considering the ordinary least squares (OLS) method, the regression coefficients’ estimates are given by:

$$\hat{\beta }OLS = (X^{{\prime }} X)^{ - 1} X^{{\prime }} Y$$

In the presence of a high degree of multicollinearity, the X′X matrix is quasi-singular (ill-conditioned); hence, there is a misalignment of the equation system that should provide the parameters’ estimates, which are therefore inaccurate. In the case of perfect multicollinearity, the matrix is non-invertible; this causes computational problems as well, as any statistical software would give warnings such as “The matrix is singular” and will be unable to proceed with the estimation.
Forecasting issues: from a practical point of view, it is useless to make any kind of prediction for a value of the response variable y, due to the uncertainty of the values of $\hat{\beta }$.
Interpreting issues: in the presence of multicollinearity, the coefficients’ estimates are less efficient and their interpretation becomes tricky. Regressors don’t explain the variance in the dependent variable as they should normally do.

When multicollinearity is detected, it becomes important to understand which explanatory variables are causing the issue, and the easiest solution would be to exclude them from the analysis. However, doing so is not always a good solution, as it depends on whether the purpose of the analysis is explanation or prediction. When prediction is the goal, no problem arises if, among dependent variables, two regressors have the same ‘meaning’: it is possible to just drop one of them. When, instead, explanation is the goal, regressors are selected according to a theoretical rule, so that it is not possible to eliminate any variable: the model has to be explained exactly, including every variable that was selected. Literature has recently introduced other methods to deal with multicollinearity without the need of dropping variables from the model, that are described in the following section.

A widely used estimator is the ordinary least squares (OLS) one, which is a valid alternative to the Maximum Likelihood estimator, especially when no information is given about the error distribution. According to the Gauss-Markov theorem, the OLS estimator is BLUE (Best Linear Unbiased Estimator) when E(ε) = 0 and var(ε) = σ²I_n. This means that, among all the linear estimators, it will be the one with the least variance, guaranteeing furthermore that the estimates are unbiased.

When the set of error random variables doesn’t respect the Gauss-Markov assumptions, the OLS estimator is not BLUE and might therefore provide less efficient estimates than the ones provided by other methods, even though it continues to be unbiased. The variance of the residuals will be biased, invalidating significance tests, confidence intervals and the R² coefficient.

In this paper, we are going to carry out some simulations in order to show the differences, in terms of efficiency, between the OLS estimator and two different L_p-norm estimators—whose properties will be described in the following sections—in the presence of multicollinearity.

This paper is organized as follows. In the next section, after a brief review on the origins and some new tools to deal with the multicollinearity problem, we illustrate some new methods, proposed in the literature, to solve this issue in the case of stochastic regression, typical in the analysis of economic and social data. In the third section, we introduce the exponential power function (E.P.F.), a useful family of symmetrical random error curves, and its connection with the L_p-norm method. In the fourth section, we investigate the relationship between the different values of p, the shape parameter of the E.P.F., and some kurtosis indexes that refer to the error distributions; then we analyze and propose the L_pmin algorithm in the fifth section. In the following one (6), we propose a simulation study, with the aim of choosing the best rule to select the most appropriate value of p for any given error distribution and to evaluate the influence of multicollinearity in the parameter estimation procedures in terms of variance. Finally, in Sect. 7, we propose a real data application, showing that our method gives better estimates than OLS. In the last Sect. (8), some results from the empirical study and comparisons are discussed.

2 Solving the multicollinearity problem: a review

Literature has devoted a lot of attention to the multicollinearity problem and the different ways to solve it.

When multicollinearity is detected in a regression model, the easiest solution is to eliminate the problematic variables from the model, so that it doesn’t include variables that are correlated (Bowerman and O’Connell 1993; Bring 1996).

But, sometimes, it is hard to choose which variables to exclude. Another classical remedy to the multicollinearity problem is the Principal Component Analysis (Maddala 1977; Gower and Blasius 2005), that converts the original variables into a new set of linearly uncorrelated variables.

Since eliminating a variable leads to a problem of bad-specification of the model, a viable remedy is the use of a different estimator than the least squares one. This is because the ‘classical’ OLS method provides estimates that could be statistically not significant in the presence of multicollinearity. Some common alternatives are the partial least squares (Cassel et al. 1999) and the Ridge Regression (Pagel and Lunneberg 1985; Dijkstra 2014; Masson et al. 2014). The latter, in particular, was introduced by Hoerl and Kennard (1970), and has the advantage of reducing the variance term of the slope parameters (Muniz and Kibria 2009; Akdeniz et al. 2015). In the most recent literature, though, some robust estimators are preferred. There are many types of robust regression models and, traditionally, the least squares criterion has always been the most used one. The Least Absolute Value (LAV) is a regression coefficient that minimizes the sum of the absolute values of the residuals:

$$\hat{\beta }_{LAV} = \mathop {\arg \hbox{min} }\limits_{\beta } \sum {\left| {y_{i} - x_{i}^{\prime } \beta } \right|}$$

However, when multicollinearity is present, other methods should be preferred, such as the Ridge Least Absolute Value (RLAV), handling the problem of multicollinearity and outliers simultaneously. A new approach to solve the multicollinearity problem considers the use of a ridge regression based on Bisquare Ridge LTS estimators (BRLTS). Through this method, it is possible to obtain good estimates, removing multicollinearity when it is “moderate or high” (Pati et al. 2016).

In a regression model, we expect a high variance explained (R²). The higher the variance explained, the better the model is. Though, in the presence of multicollinearity, the variance, standard errors and parameter estimates may be all inflated.

Other authors suggest the so called “Nested estimate procedure” (Lin 2008), that is a relatively new method, based on OLS, consisting in an iterative estimation procedure of independent variable parameters. Garcia et al. (2011) presented the “raise method” as an alternative to the Nested estimation, that keeps all information, which could be highly recommended in some cases, and compares the results with other procedures.

There are no specific tests to detect multicollinearity, but some characteristics of the estimated model can reveal it:

a high R² with non-significant regression coefficients (t-scores);
a high correlation between the single variables;
a high VIF (Variance Inflation Factor).

The Variance Inflation Factor (VIF) is calculated for each variable in the model, based on the expression:

$$VIF_{j} = \frac{1}{{\left( {1 - R_{j}^{2} } \right)}}$$

where $R_{j}^{2}$ is the multiple R² for the regression of the j-th element of the vector of estimates on the other covariates.

A high VIF reveals linear dependence between the j-th column and the remaining columns of the X matrix and, thus, the presence of multicollinearity (Lazaridis 2007). However, as Curto and Pinto (2011) point out, the real impact on the variance can be overestimated by the VIF and, for this reason, CVIF can be used in place of VIF. In formulas:

$$CVIF_{j} = \frac{{((1 - R^{2} )TSS/(n - k)(1 - R_{j}^{2} )TSS_{j} )}}{{((1 - R_{0}^{2} )TSS/((n - k)TSS_{j} )}}$$

or:

$$CVIF_{j} = VIF_{j} \times \frac{{1 - R^{2} }}{{1 - R_{0}^{2} }}$$

where $R_{0}^{2} = R_{{yx_{2} }}^{2} + R_{{yx_{3} }}^{2} + \cdots + R_{{yx_{k} }}^{2}$.

In particular, $R_{{yx_{k} }}^{2}$ is the square of the correlation coefficient between the dependent variable and the independent variables, keeping TSS and TSS_j constant. In general, $R_{0}^{2}$ can be lower or higher than R², but when the independent variables are orthogonal we have $R_{0}^{2} = {\text{R}}^{2}$.

When $R_{0}^{2} > {\text{R}}^{2}$, CVIF ranges from 1 to +∞ and CVIF_j > VIF_j; when $R_{0}^{2} < {\text{R}}^{2}$, CVIF ranges from 0 to 1 and CVIF_j < VIF_j. About interpreting issues, as suggested by Curto and Pinto (2011), we can use the rule-of-thumb CVIF_j > 10 to indicate the presence of severe multicollinearity.

Another common issue in economics is the spurious relationship between variables (Armor et al. 2017). As Chatelain and Ralf (2014) show, by increasing the number of observations it is possible to bring a spurious inference of highly correlated classical suppressors. In particular, they propose a tool to solve these issues: the Parameter Inflation Factor (PIF), obtained by analyzing a trivariate regression model as follows:

$$x_{1} = \beta_{12} x_{2} + \beta_{13} x_{3} + \varepsilon_{1,23}$$

PIF is defined as:

$$PIF_{12} = \left( {1 - \frac{{r_{13} }}{{r_{12} }}r_{23} } \right)VIF{}_{12}$$

where r₁₂, r₁₃ and r₂₃ are the correlation coefficients of the regressors and VIF is the Variance Inflation Factor.

While the Variance Inflation Factor only depends on the correlation between the independent variables, the Parameter Inflation Factor also considers the correlation between the regressors and the dependent variable. In the presence of a high PIF value, there may be highly correlated classical suppressors.

When regressors and error are correlated, the OLS estimator becomes less efficient because its variance increases (Vargha et al. 2013; Griffiths and Hajargasht 2016), even though it remains consistent (Krone et al. 2017).

Another estimator that can be used in the presence of multicollinearity is the Classical Linear Regression Model (CLRM) (Ayinde 2007):

$$y_{t} = \beta_{0} + \beta_{1} x_{1t} + \beta_{2} x_{2t} + \beta_{3} x_{3t} + \varepsilon_{t}$$

where t = 1, 2,…,n and ε_t ~ (0,σ²) and assuming that x₁ is correlated with error ε and x₂ is correlated with x₁.

In addition, we can consider a generalized least squares (GLS) model:

$$y_{t} = \beta_{0} + \beta_{1} x_{1t} + \beta_{2} x_{2t} + \beta_{3} x_{3t} + u_{t}$$

where t = 1, 2,…,n; $u_{t} = \rho u_{t - 1} + \varepsilon_{t}$ and ρ is the correlation.

Starting from this point, Ayinde (2007) shows that CLRM and GLS models are equivalent in the case of zero correlation. In addition, comparing CLRM, GLS and Maximum Likelihood (ML) estimators, using the MSE (Mean Square Error) criteria in a Monte Carlo simulation plan, the author shows that ML and GLS are better than an OLS estimator with low replication, but when there is a high number of replications the OLS method is to be favored.

Another study by Tran and Tsionas (2013) proposes the Generalized Method of Moments (GMM) estimator, showing that if there is no correlation between the variables, it performs as the standard Maximum Likelihood estimator. However, in the case of strong correlation, the MLE becomes biased while the GMM estimator remains unbiased.

Finally, as Giacalone and Richiusa (2006) point out, L_p-norm estimators allow to manage the residuals more efficiently, particularly when there is some degree of multicollinearity and regressors are correlated with residuals. This is because the L_p-norm methods are adaptive procedures with respect to the error component of the model and not to the deterministic one.

As a result, using L_p-norm estimators in the presence of stochastic regressors, it is possible to obtain a better efficiency in the estimates than using OLS.

3 The exponential power function and the L_p-norm estimators

The exponential power function (E.P.F.) is a family of probability functions proposed by Subbotin (1923) and studied, among others, by Vianelli (1963), Lunetta (1966), Mineo (1989), Gonin and Money (1989), Chiodi (1995) and Bottazzi and Secchi (2011).

The E.P.F. constitutes a valid generalization of the Gaussianity hypothesis, that is usually assumed even though, depending on the data we have at disposal, it is not always fully supportable. In the literature, it is also known as “Normal distribution of order p” or “Generalized Error Distribution”, and it constitutes a parametric alternative to the robust methods (Mineo 2003).

The density function of the E.P.F. is:

$$f_{p} (z) = \frac{1}{{2p^{1/p} \sigma_{p} \varGamma (1 + 1/p)}}\exp \left[ { - \frac{1}{p}\left| {\frac{{z - M_{p} }}{{\sigma_{p} }}} \right|^{p} } \right]$$

(1)

where M_p = E(z) is the location parameter, σ_p = (E[|z − M_p|])^1/p is the scale parameter and p > 0 is the shape parameter.

Considering the Pearson kurtosis index β₂, we distinguish:

0 < p < 1: double exponential distribution, β₂ > 6;
1 < p < 2: leptokurtic distribution, 3 < β₂ < 6;
p > 2: platikurtic distribution, 1.8 < β₂ < 3.

For particular values of p, we have:

the Laplace distribution (p = 1, β₂ = 6);
the Gaussian distribution (p = 2, β₂ = 3);
the Uniform distribution (p → ∞, β₂ → 1.8).

Considering a sample of n observed data (x_i, y_i), a general linear regression model is:

$$y_{i} = g\left( {x_{i} ,\theta } \right) + \varepsilon_{i}$$

(2)

with g(.) linear function.

L_p-norm estimators are useful generalizations of ordinary least squares estimators, obtained by replacing the exponent 2 with a general exponent p (Ekblom and Henriksson 1969; Forsythe 1972). Therefore, they minimize the sum of the p-th power of the absolute deviations of the observed points from the regression function:

$$S_{p} (\theta ) = \sum\limits_{i = 1}^{n} {\left| {y_{i} - g\left( {x_{i} ,\theta } \right)} \right|}^{p} \; {\text{with}}\;1 \le p < \infty$$

(3)

Under the regular assumptions, the log-likelihood associated with the sample, where $z = y_{i}$; $M_{p} = g({x}_{i,} {\theta } )$ is given by:

$$l(\theta ,\sigma_{p} ,p) = - n\log \left[ {2p^{1/p} \sigma_{p} \varGamma (1 + 1/p)} \right] - \left( {p\sigma_{p}^{p} } \right)^{ - 1} \sum\limits_{i = 1}^{n} {\left| {y_{i} - g\left( {x_{i} ,\theta } \right)} \right|}^{p}$$

(4)

$$\begin{array}{*{20}l} {\frac{\partial l}{{\partial \theta_{j} }} = \sum\limits_{i = 1}^{n} {\left| {y_{i} - g(x_{i} ,\theta )} \right|}^{p - 1} sign(y_{i} - g(x_{i} ,\theta ))\frac{\partial g}{{\partial \theta_{j} }} = 0} \hfill \\ {\sum\limits_{i = 1}^{n} {\left| {y_{i} - g\left( {x_{i} ,\theta } \right)} \right|}^{p} = \hbox{min} \;{\text{with}}\;p \ge 1} \hfill \\ \end{array}$$

(5)

When the order p is specified, all the terms in the (4), except for the last part containing the vector θ, are constant. Therefore, Maximum Likelihood estimators are equivalent to L_p-norm estimators (5).

If p is unknown, there are two related problems to consider:

1.
the estimation of the exponent p on the sample data;
2.
the choice of the minimization algorithm to obtain the regression parameters’ estimates.

About the procedures to estimate p, it is possible to find out the following proposals in the literature:

Harter (1977), noting that p depends on $\hat{\beta }_{2}$ (the sample residual kurtosis), proposed to select p using the following rules:

if $\hat{\beta }_{2} > 3.8$ use p = 1 (the least absolute deviations regression);

if $2.2 < \hat{\beta }_{2} < 3.8$ use p = 2 (the least squares regression);

if $\hat{\beta }_{2} < 2.2$ use $p = \infty$ (the minimax or Chebychev regression).
Money et al. (1982) and Sposito (1982) proposed two different criteria respectively:

$$\hat{p} = 9/\hat{\beta }_{2}^{2} + 1\;{\text{for }}1 \le p < \infty$$

(6)

$$\hat{p} = 6/\hat{\beta }_{2}^{2} \;{\text{for }}1 < p < 2$$

(7)

Mineo (1989) proposed the Generalized Kurtosis β_K, as described in Sect. 4.
Mineo (1994) considered a new method to estimate p, based on an empirical index called VI. This estimation method is based on a two-steps algorithm in which, iteratively, the estimates of p and the other parameters are evaluated: the structure parameter is estimated by numerically solving an equation in which the theoretical value of the I index is equaled to the empirical one; the other parameters are estimated from the corresponding maximum likelihood estimators as a function of the current value of $\hat{p}$.
Agrò (1995) proposed a maximum likelihood estimation either for the E.P.F. parameters or for the p shape parameter. It is a two-steps process which, however, is essentially suitable for medium to large samples (n > 50).
Giacalone (1997) proposed an algorithm based on a two-steps alternating procedure that firstly estimates the θ parameter vector by means of the classical conjugated gradient algorithm (Fletcher and Reeves 1964) and secondly estimates p using a joint inverse function of I and β₂, obtained comparing empirical and theoretical moments, matching (10) with (12) and (9) with (11). The minimization algorithm stops when the variation of p is not significant (Everitt and Hand 1987).
Agrò (1999) proposed an adjustment of the likelihood estimation, based on Cox and Reid (1987), which consists in a reparametrization that leads to asymptotically uncorrelated estimators and, consequently, to better results (at least, for n > 30).
Finally, Mineo and Ruggieri (2005) developed an R package for dealing with the exponential power function, with functions to compute the density function, the distribution function and the quantiles from an E.P.F., available on the Comprehensive R Archive Network (CRAN).

4 The exponential power function kurtosis indexes

For the density (1), the theoretical moment of order k is a function of the shape parameter p as follows:

$$E\left| {z - M_{p} } \right|^{k} = \left( {p\sigma_{p}^{p} } \right)^{ - k/p} \frac{\varGamma ((k + 1)/p)}{\varGamma (1/p)} = \mu_{k}$$

(8)

The ratios of the moments of order 2k and the squared moment of order k only depend on the shape parameter p. This theoretic relation is also known as “Generalized Kurtosis” (Mineo 1989):

$$\beta_{k} = \frac{{\mu_{2k} }}{{\mu_{k}^{2} }} = \frac{\varGamma (1/p)\varGamma ((2k + 1)/p))}{{[\varGamma ((k + 1)/p))]^{2} }}$$

If k = 2, we get the Pearson’s kurtosis index:

$$\beta_{2} = \frac{{\mu_{4} }}{{\mu_{2}^{2} }} = \frac{\varGamma (1/p)\varGamma (5/p)}{{[\varGamma (3/p)]^{2} }}$$

(9)

If k = 1, considering the square root of the reciprocal, we get the Geary’s length of tails index (Geary 1936):

$$I = \frac{{\mu_{1} }}{{\sqrt \mu_{2} }} = \frac{\varGamma (2/p)}{{\sqrt {\varGamma (1/p)\varGamma (3/p)} }}$$

(10)

The indexes I and β₂ show a different behavior according to the variation of p (Giacalone 1996, 1997). Calculating the sample values of I and β₂, it is possible to obtain, by inverse interpolation, two different estimations of p.

Gonin and Money (1987), Kendall and Stuart (1966) and Lunetta (1966) considered the unbiased estimates of the second and fourth order sample moments, with correction factors depending on the sample size n:

$$\begin{array}{*{20}l} {\hat{\mu }_{2} = \frac{1}{n - 1}\sum\limits_{i} {\left( {\varepsilon_{i} - \bar{\varepsilon }} \right)}^{2} } \hfill \\ {\hat{\mu }_{4} = \frac{{\left( {n^{2} - 2n + 3} \right)}}{{\left( {n - 1} \right)\left( {n - 2} \right)\left( {n - 3} \right)}}\sum\limits_{i} {\left( {\varepsilon_{i} - \bar{\varepsilon }} \right)}^{2} - \frac{{3\left( {n - 1} \right)(2n - 3)}}{{n\left( {n - 2} \right)\left( {n - 3} \right)}}\hat{\mu }_{2} } \hfill \\ \end{array}$$

where $\varepsilon_{i}$ and $\bar{\varepsilon }$ are, respectively, the estimated residuals and their average.

The ratio of $\hat{\mu }_{4}$ and $\hat{\mu }_{2}$ gives the following estimator of β₂:

$$\hat{\beta }_{2} = \frac{{\hat{\mu }_{4} }}{{\hat{\mu }_{2}^{2} }}$$

(11)

For the I empirical index we obtain:

$$\hat{I} = \frac{{\sum\nolimits_{i} {\varepsilon_{i} - \bar{\varepsilon }} }}{{\sum\nolimits_{i} {\varepsilon_{i} - \bar{\varepsilon }}^{2} }}\frac{{\sqrt {n - 1} }}{n}$$

(12)

5 The L_pmin algorithm

The proposed algorithm is based on a two-steps alternating procedure:

1.
minimization procedure to estimate the parameters;
2.
joint inverse function of I and β₂ to estimate p.

The algorithm stops when p does not vary significantly. To obtain this estimate, we minimize the difference between empirical and theoretical indexes (13) to avoid some convergence problems encountered when we investigate a different algorithm considering this difference equal to zero (Counihan 1985; Giacalone 2002).

The function used to estimate p is therefore the following:

$$\left[ {\left( {I - \hat{I}} \right):0.86054} \right]^{2} + \left[ {\left( {\beta_{2} - \hat{\beta }_{2} } \right):25.2} \right]^{2} = \hbox{min}$$

(13)

where $I,\hat{I},\beta_{2} ,\hat{\beta }_{2}$ are given, respectively, by (10), (12), (9) and (11). For simplicity, we can express (13) as follows:

$$\left[ {{\text{f}}\left( {\text{p}} \right)} \right]^{ 2} + \, \left[ {{\text{g}}\left( {\text{p}} \right)} \right]^{ 2} = { \hbox{min} } .$$

As the indexes I and β ₂ have different orders of magnitude and different variance, it is necessary to eliminate this difference in order to obtain a joint estimation of p, therefore setting 0 < f(p) < 1 and 0 < g(p) < 1.

The maximum theoretical values of f(p) and g(p) are the chosen standardization factors.

Considering p ranging from 0.5 to 10, 25.2 is the value of β₂ when p = 0.5, and 0.86054 is the value of I when p = 10. This way it is possible to obtain a joint estimator made up of two squared functions. The two kurtosis indexes are used as it was observed that norm 1 kurtosis is a valid choice for the estimation of p in the presence of outliers, while norm 2 kurtosis performs better when the sample values gather around the center of the distribution.

So, using the relation (9) we calculate max(β₂) = 25.2, for p = 0.5, whilst using the relation (10) we calculate max(I) = 0.86054, for p = 10.

The proposed algorithm is then specified in the following steps (Giacalone 1997):

1.
set i = 0 and p₀ = 2;
2.
fit the model (2) to the data using the previous step value p_i;
3.
compute the estimated residuals ε_i = y_i − g(x_i, θ), their average $\bar{\varepsilon }$ and insert these quantities in the (13) that is equal to the sum of the two squared functions to be minimized;
4.
minimize the function (13) to obtain p_i+1, new estimate of p;
5.
compare the estimated p_i+1 with the previous p_i, and if |p_i+1 − p_i| > 0.01 then set i = i + 1 and repeat the steps 2–5;
6.
otherwise: stop the algorithm assuming $\hat{\theta }_{i} = \theta_{ij}$ as L_p-norm estimator for the parameter θ_i and the value p = p _i as joint estimation of p.

In step 2, a nonlinear L_p-norm estimation is examined. The problem could be solved using the optimality conditions encountered in unconstrained optimization (McCormick 1983). The minimization algorithm (Fletcher and Reeves 1964) is used to take the special structure of the problem directly into account.

In step 3, we calculate both empirical and theoretical I and β₂ kurtosis indexes to obtain the values to estimate p from (13).

In step 4, a parabolic interpolation method (Everitt and Hand 1987) to find the minimum sum of squared functions (13) is used. The convergence of the proposed algorithm was empirically verified by simulating 5000 samples of different sizes (n = 30, 50, 100, 200, 500, 1000) for six fixed theoretical values of p.

6 A simulation study for efficiency comparison

After looking at the more recent literature (e.g. Alabi et al. 2008, 2014), we chose to consider 5000 samples of size n = 30, 50, 100, 200, 500 and 1000, generated from an E.P.F., and 6 values of p (1.1, 1.5, 2.0, 2.5, 3.0, 3.5) for three different degrees of multicollinearity (low R² = 0.33, medium R² = 0.66, high R² = 0.99).

The algorithm to generate the ε_i (for p ≥ 1) from an E.P.F. is suggested by Chiodi (1986). The values of y_i are given by the following multiple regression model:

$$y_{i} = \beta_{0} + \beta_{1} x_{i1} + \beta_{2} x_{i2} + \beta_{3} x_{i3} + \varepsilon_{i}$$

(14)

with X₁, X₂ identically distributed independent variables from a Gaussian standardized distribution, and X₃ linear combination of X₁ and X₂:

$$X_{3} = X_{1} + X_{2} + Z\;{\text{with }}Z \sim {\text{N(0,}}\sigma_{z} )$$

(15)

Therefore, we can write the associated variance and covariance matrix:

$${\text{S}} = \left\| {\begin{array}{*{20}c} {} & {X_{1} } & {X_{2} } & {X_{3} } \\ {X_{1} } & 1 & 0 & 1 \\ {X_{2} } & 0 & 1 & 1 \\ {X_{3} } & 1 & 1 & {2 + \sigma_{z}^{2} } \\ \end{array} } \right\|$$

It is easy to notice that $E(X_{3}^{2} ) = E(X_{1}^{2} ) + E(X_{2}^{2} ) + \sigma_{z}^{2} = 2 + \sigma_{z}^{2}$, and the correspondent correlation matrix is equal to:

$${\text{A}} = \left\| {\begin{array}{*{20}c} {} & {X_{1} } & {X_{2} } & {X_{3} } \\ {X_{1} } & 1 & 0 & {1/\sqrt {2 + (\sigma_{z}^{2} )} } \\ {X_{2} } & 0 & 1 & {1/\sqrt {2 + (\sigma_{z}^{2} )} } \\ {X_{3} } & {1/\sqrt {2 + (\sigma_{z}^{2} )} } & {1/\sqrt {2 + (\sigma_{z}^{2} )} } & 1 \\ \end{array} } \right\|$$

where $A_{13} = \frac{{\text{cov} (X_{1} ,X_{3} )}}{{\sqrt {\text{var} (X_{1} )\text{var} (X_{3} )} }} = \frac{1}{{\sqrt {2 + \left( {\sigma_{z}^{2} } \right)} }} = A_{23}$, $R_{3.12}^{2} = 1 - \frac{\det A}{{\det A_{33} }} = \frac{2}{{2 + \left( {\sigma_{z}^{2} } \right)}}$, and $\det A_{33}$ is the cofactor of the element occupying the same position in the correlation matrix (Leti 1983).

In this particular regression model, the degree of multicollinearity is inversely proportional to σ ²_z .

Specifically, in our simulation we have put as parameter values: β₀ = 1, β₁ = 2, β₂ = 3, β₃ = 4, and a comparative analysis was performed applying the following three estimation methods on the same samples:

1.
Least squares estimators (L₂);
2.
L_p-norm estimators with theoretical p of the E.P.F. (L_p);
3.
L_p-norm estimators with p as in our proposal (L_pmin).

The results of the simulation plan are reported in Tables 10, 11, 12, 13, 14 and 15 in “Appendix 1”, where the OLS simulation results are marked in bold; furthermore, in Tables 13, 14 and 15, the number of samples with a difficult convergence (“Conv.”) is also presented. In the tables, we report mean (M) and variance (V) of the regression model parameters. We can see that, for any value of p and for any method, the estimates of β₀, β₁, β₂ and β₃ are less efficient for small samples (e.g. for n = 30 and n = 50) but their variances decrease as n increases.

We calculated the relative efficiency for every value of p and n, and the results are shown in Tables 1A, 1B, 2A, 2B, 3A and 3B. From the tables, we notice that L_p-norm estimators (L_p and L_pmin) give better estimates for every parameter, compared to the least squares method, especially when the value of p is not close to 2: there is a gain in efficiency in all the cases considered, except for the case p = 2, in which the error follows a Gaussian distribution.

Table 1 Relative efficiency of L_p-norm estimators compared to OLS (parameters β₀, β₁ β₂, β₃) on 5000 samples of size n = 30, n = 50, n = 100, n = 200, n = 500, n = 1000 (σ_z = 2, R² = 0.33)

Multicollinearity in regression: an efficiency comparison between Lp-norm and least squares estimators

Abstract

Similar content being viewed by others

Two-Stage Least Squares and The K-Class Estimator

Overview of Maximum Likelihood Estimation

A Mathematical-Statistics Approach to the Least Squares Method

1 Introduction

2 Solving the multicollinearity problem: a review

3 The exponential power function and the Lp-norm estimators

4 The exponential power function kurtosis indexes

5 The Lpmin algorithm

6 A simulation study for efficiency comparison

7 A real data application

8 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1

Appendix 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Multicollinearity in regression: an efficiency comparison between L_p-norm and least squares estimators

3 The exponential power function and the L_p-norm estimators

5 The L_pmin algorithm