Correlated Component Regression: Re-thinking Regression in the Presence of Near Collinearity

Magidson, Jay

doi:10.1007/978-1-4614-8283-3_3

Jay Magidson⁶

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 56))

3161 Accesses
14 Citations

Abstract

We introduce a new regression method—called Correlated Component Regression (ccr)—which provides reliable predictions even with near multicollinear data. Near multicollinearity occurs when a large number of correlated predictors and relatively small sample size exists as well as situations involving a relatively small number of correlated predictors. Different variants of ccr are tailored to different types of regression (e.g. linear, logistic, Cox regression). We also present a step-down variable selection algorithm for eliminating irrelevant predictors. Unlike pls-r and penalized regression approaches, ccr is scale invariant. ccr is illustrated in several examples involving real data and its performance is compared with other approaches using simulated data.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Regression Analysis and Its Development

Simple and Multiple Linear Regression

Multiple Correlation and Multiple Regression

Key words

1 Background and Introduction

^{Footnote 1}When correlation between predictor variables is moderate or high, coefficients estimated using traditional regression techniques become unstable or cannot be uniquely estimated due to multicollinearity (singularity of the covariance matrix). In the case of high dimensional data, where the number of predictor variables P approaches or exceeds the sample size N, such instability is often accompanied by perfect or near perfect predictions within the analysis sample. However, this seemingly good predictive performance is usually associated with overfitting, and tends to deteriorate when applied to new cases outside the sample.

The primary “regularization” approaches that have been proposed for dealing with this problem are (1) penalized regression such as Ridge, Lasso and Elastic Net, and (2) dimension reduction methods such as Principle Component Regression, and pls Regression (pls-r). In this paper we describe a new method similar to pls-r called Correlated Component Regression (ccr) and an associated step-down algorithm for reducing the number of predictors in the model to P ^∗ < P. ccr has different variants depending upon the scale type of the dependent variable (e.g. ccr-linear regression for Y continuous, ccr-logistic regression for Y dichotomous, ccr-Cox regression for survival data). Unlike the other regularization approaches, the ccr algorithm shares with traditional maximum likelihood regression approaches the favorable property of scale invariance.

In this paper we introduce ccr, and describe its performance on various real and simulated data sets. The basic ccr algorithms are described in Sect. 2. ccr is contrasted with pls-r in a linear regression key driver application with few predictors (Sect. 3) and in an application with Near Infrared (NIR) data involving many predictors (Sect. 4). We then describe the ccr extension to logistic regression, linear discriminant analysis (lda) and survival analysis and discuss results from simulated data where suppressor variables are included among the predictors (Sect. 5). Results from our simulations suggest that ccr may be expected to outperform other sparse regularization approaches, especially when important suppressor variables are included among the predictors. We conclude with a discussion of a hybrid latent class ccr model extension (Sect. 6).

2 Correlated Component Regression

ccr utilizes K < P correlated components, in place of the P predictors to predict an outcome variable. Each component S _k is an exact linear combination of the predictors, $X = (X_{1},X_{2},\ldots,X_{P})$, the first component S ₁ capturing the effects of those predictors that have direct effects on the outcome. The ccr-linear regression (ccr-lm) algorithm proceeds as follows:

Estimate the loading $\lambda _{g}^{(1)}$, on S ₁, for each predictor g = 1,2,…,P, as the simple regression coefficient in the regression of Y on X _g, $\lambda _{g}^{(1)} = \frac{cov(Y,X_{g})} {var(X_{g})}$. Then S ₁ is defined as a weighted average of all 1-predictor effects:

$$\displaystyle{ S_{1} = \frac{1} {P}\sum \limits _{g=1}^{P}\lambda _{ g}^{(1)}X_{ g} }$$

(1)

The predictions for Y in the 1-component ccr model are obtained from the simple ols regression of Y on S ₁. Similarly, predictions for the 2-component ccr model are obtained from the simple ols regression of Y on S ₁ and S ₂, where the second component S ₂, captures the effects of suppressor variables that improve prediction by removing extraneous variation from one or more predictors that have direct effects. Component $S_{{k}^{{\prime}}}$ for k ^′ > 1, is defined as a weighted average of all 1-predictor partial effects, where the partial effect for predictor g is computed as the partial regression coefficient in the ols regression of Y on X _g and all previously computed components $S_{k},k = 1,\ldots,{k}^{{\prime}}- 1$. For example, for K = 2 we have:

$$\displaystyle{ Y = \alpha + \gamma _{1.g}^{(2)}S_{ 1} + \lambda _{g}^{(2)}X_{ g} + \epsilon _{g}^{(2)} }$$

(2)

and $S_{2} = \frac{1} {P}\sum \limits _{g=1}^{P}\lambda _{ g}^{(2)}X_{ g}$, or more simply^{Footnote 2} we can write $S_{2} =\sum \limits _{ g=1}^{P}\lambda _{g}^{(2)}X_{g}$.

As mentioned earlier, predictions for Y in the K-component ccr model are obtained from the ols regression of Y on $S_{1},\ldots,S_{K}$. For example, for K = 2: $\hat{Y } = {\alpha }^{(2)} + b_{1}^{(2)}S_{1} + b_{2}^{(2)}S_{2}$. In general, K ^∗ components are computed, where the optimal value, K ^∗, is determined by M-fold cross-validation (CV). For K = 1, maximum regularization, no predictor correlation information is used in parameter estimation. As K is repeatedly incremented by 1, more and more information provided by the predictor correlations is utilized, and M-fold CV determines the value of K where near multicollinearity begins to deteriorate the predictive performance, the value for K ^∗ being obtained accordingly. Deterioration occurs beginning at K = 3 for the example illustrated in Sect. 3, and thus K ^∗ = 2.

Any K-component ccr model can be re-expressed to obtain regression coefficients for X by substituting for the components as follows:

$$\displaystyle{ \hat{Y } = {\alpha }^{(K)}+\sum \limits _{ k=1}^{K}b_{ k}^{(K)}S_{ k} = {\alpha }^{(K)}+\sum \limits _{ k=1}^{K}b_{ k}^{(K)}\sum \limits _{ g=1}^{P}\lambda _{ g}^{(k)}X_{ g} = {\alpha }^{(K)}+\sum \limits _{ g=1}^{P}\beta _{ g}X_{g} }$$

Thus, the regression coefficient β_g for predictor X _g is simply the weighted sum of the loadings, where the weights are the regression coefficients for the components (component weights) in the K-component model: $\beta _{g} =\sum \limits _{ k=1}^{K}b_{k}^{(K)}\lambda _{g}^{(K)}$.

Simultaneous variable reduction is achieved using a step-down algorithm where at each step the least important predictor is removed, importance defined by the absolute value of the standardized coefficient $\beta _{g}^{{\ast}} = (\sigma _{g}/\sigma _{Y })\beta _{g}$, where σ denotes the standard deviation. M-fold CV is used to determine the two tuning parameters: the number of components K and number of predictors P.

Consider an example with 6 predictors. For any given value for K, and say M = 10 folds, the basic ccr algorithm is applied 10 times, generating predictions for cases in each of the 10 folds^{Footnote 3} based on models with all 6 predictors, yielding a baseline (iteration = 0) CV-R ²(K) for P = 6. In iteration 1, the variable reduction algorithm eliminates 1 predictor, which may not be the same predictor in all 10 subsamples, each resulting 5-predictor model being used to obtain new predictions for the associated omitted folds, yielding CV-R ²(K) for P = 5. In iteration 2, the variable reduction process continues resulting in 10 4-predictor models, which yields CV-R ²(K) for P = 4. Following the last iteration, P ^∗(K) is determined as the value of P associated with the maximum CV-R ²(K).

The basic idea is that by applying the proper amount of regularization through the tuning of K, we reduce any confounding effects due to high predictor correlation, thus obtaining more interpretable regression coefficients, and better, more reliable predictions. In addition, tuning P tends to eliminate irrelevant or otherwise extraneous predictors and further improve both prediction and interpretability.

Since the optimal P may depend on K, P should be tuned for each K, the optimal (P ^∗,K ^∗) yielding the global maximum for CV-R ². Alternatively, as a matter of preference a final model may be based on a smaller value for P and/or K, such that the resulting CV-R ² is within c standard errors of the global maximum, where c ≤ 1.

Since K can never exceed P, for P = K, the model becomes saturated and is equivalent to the traditional regression model.^{Footnote 4} For pre-specified K, when P is reduced below K, we maintain the saturated model by also reducing K so K = P. For example, for K = 4, when we step down to 3 predictors, we reduce K so K = 3. Similarly, when we step down to 1 predictor, K = 1. This is similar to traditional stepwise regression with backwards elimination.

Prime predictors, those having direct effects, are identified as those having substantial loadings on S ₁, and suppressor variables, as those having substantial loadings on one or more other components, and relatively small loadings on S ₁. See Sect. 5 for further insight into suppressor variables.

Since ccr is scale invariant, it yields identical results regardless of whether predictions are based on unstandardized or standardized predictors (Z-scores). Other methods such as pls-r and penalized regression (Ridge Regression, Lasso, Elastic Net) are not scale invariant and hence yield different results depending on the predictor scaling used.

3 A Simple Example with Six Correlated Predictors

Our first example makes use of data involving the prediction of car prices (Y) as a linear function of 6 predictors, each having a statistically significant positive correlation with Y (between 0.6 and 0.9).

N = 24 car models
Dependent variable: Y = PRICE (car price measured in francs)
6 Predictor Variables:
- X ₁ = CYLINDER (engine measured in cubic centimeters)
- X ₂ = POWER (horsepower)
- X ₃ = SPEED (top speed in kilometers/hour)
- X ₄ = WEIGHT (kilograms)
- X ₅ = LENGTH (centimeters)
- X ₆ = WIDTH (centimeters)

The ols regression solution (Table 1a) imposes no regularization, maximizing R ² in the training sample. This solution is equivalent to that obtained from a saturated ($K = P = 6$ components) ccr model. Since this solution is based on a relatively small sample and correlated predictors, it is likely to overfit the data and the R ² is likely to be an overly optimistic estimate of the true population R ². Table 1a shows only 1 statistically significant coefficient (0.05) and unrealistic (negative) coefficient estimates for 3 of the 6 predictors, which are problems that can be explained by model overfitting due to imposing no regularization.

Table 1 (a) (left) shows OLS Regression Coefficient results ($P = K = 6$) and (b) (right) shows R ² and CV-R ² for different numbers of components K and for the final ccr model $(P = 3,K = 2)$

Full size table

To determine the value for K that provides the optimal amount of regularization, we choose the ccr model that maximizes the CV-R ². For cross-validation we used 10 rounds of 6-folds, since 24 divides evenly into 6, each fold containing exactly 4 cars. Table 1b shows that K = 2 components provides the maximum CV-R ² based on P = 6 predictors, and when the step-down algorithm is employed, CV-R ² increases to 0.769 which occurs with P ^∗ = 3 predictors.^{Footnote 5} While traditional ols regression yields a higher R ² in the analysis sample (0.847 vs. 0.836), the 2-component ccr model with 3 predictors yields a higher CV-R ², suggesting that this ccr model will outperform^{Footnote 6} ols regression when applied to new data.

Further evidence of improvement for the 2-component models over ols regression is that the coefficients are more interpretable. Table 2 shows that the coefficients in the 2-component ccr models are all positive, which is what we would expect if we were to interpret them as measures of effect.^{Footnote 7}

Table 2 Comparison of results from plsr (a) (left) with unstandardized predictors, and (b) with standardized predictors, and ccr (c) without variable selection and (d) (right) with variable selection

Full size table

pls-r with standardized predictors, the recommended pls-r option when predictors are measured in different units, yields similar results to ccr here. When the predictors remain unstandardized, pls-r yields more components (K ^∗ = 3), two negative coefficients, and substantially worse predictions (CV-R ² = 0.69), as the much larger variance for the predictor CYLINDER causes this predictor to dominate the first component, requiring two additional components to recover.

4 An Example with Near Infrared (nir) Data

Next, we analyze high dimensional data involving N = 72 biscuits, each measured at each of P = 700 near infrared (NIR) wave-lengths corresponding to every other wavelength between the range 1,100–2,500 [2]. Since all 700 predictors are measured in comparable units in this popular pls-r application, typically the 700 predictors are analyzed on an unstandardized basis, or standardized using Pareto scaling [3] where the scaling factor is the square root of the standard deviation. As shown above, results from pls-r differ depending upon whether the predictors are standardized or not, while for the scale invariant ccr, no decision needs to be made regarding such standardization, predictions being identical in either case.

The goal of modeling here is to reduce costs of monitoring fat content by predicting the percent fat based on spectroscopic absorbance variables from the nir frequencies. Following Kraemer and Boulesteix [4], we use N = 40 samples as the calibration (training) set to develop models based on the 700 wave lengths.

It is well known that for nir data, a column plot of regression coefficients exhibit a sequence of oscillating patterns, the most important wavelength ranges being those with the highest peak-to-peak amplitude. For example, for these data, wavelengths in the 1,500–1,598 range yield a peak to peak amplitude of $0.109 - (-0.203) = 0.312$, based on a ccr model with K = 9 (see Fig. 1).

Table 3a compares the corresponding amplitudes obtained from ccr and both unstandardized and Pareto standardized pls-r models, where the number of components is determined based on 10 rounds of 5-folds. As can be seen in Table 3a, all three models agree that absorbances from the 1,500–1,598 wavelengths tend to be among the most important (relatively large amplitude).

Table 3 (a) (left) Comparison of peak-to-peak amplitudes for various frequency ranges based on three models, with the most and least important ranges according to ccr in bold, and (b) (right) comparison of CV-R ² (highest is bold) obtained from three models with (P = 700) and without (P = 650) the highest wavelengths included among the predictors

Full size table

Previous analyses of these data excluded the highest 50 wavelengths since they were “…thought to contain little useful information” [5]. Table 3a shows that ccr identifies these wavelengths as least important (smallest amplitude), but the amplitude of 0.44 resulting from pls-r suggests that these wavelengths are important.

Figure 2 shows the standardized coefficients for the 50 highest wavelengths for ccr and pls-r models. As can be seen, the weights obtained from the ccr model are small and diminishing, the coefficients for the highest wavelengths being very close to 0. In contrast, pls-r weights are quite high and show no sign of diminishing for the highest wavelengths (Fig. 2(right)), a similar pattern being observed for pls-Pareto.

One possible reason that the conclusions from ccr and pls-r differ regarding the importance of these high wavelengths is that its scale invariance property allows ccr to better determine that the high variability associated with these wavelengths is due to increased amounts of measurement error. In other words, the much higher amplitude obtained from pls-r is likely due to the higher standard deviations of the absorbances in this range.

To test the hypothesis that these higher wavelengths tend to be unimportant, we re-estimated the models after omitting these variables. Table 3b shows that for all three models, the CV-R ² increases when these variables are omitted, supporting the hypothesis that these wavelengths are not important.

In order to compare the predictive performance of ccr with other regularization approaches, 100 simulated samples of size N = 50 were generated with 14 predictors according to the assumptions of ols regression. An additional 14 extraneous predictors, correlated with the 14 true predictors, plus 28 irrelevant predictors, were also generated and included among the candidate predictors. The results indicated that ccr outperformed pls-r, Elastic Net, and sparse pls with respect to mean squared error, and several other criteria. All methods were tuned using an independent validation sample of size 50 (for more details, see [6]).

5 Extension of ccr to Logistic Regression, Linear Discriminant Analysis and Survival Analysis

When the dependent variable is dichotomous, the ccr algorithm generalizes directly to ccr-logistic and ccr-lda respectively depending upon whether no assumptions are made about the predictor distributions, or whether the normality assumptions from linear discriminant analysis are made. In either case, the generalization involves replacing Y by Logit(Y ) on the left side of the linear equations. Thus, for example, under ccr-logistic and ccr-lda Eq. 2 becomes:

$$\displaystyle{ Logit(Y ) = \alpha + \gamma _{1.g}^{(2)}S_{ 1} + \lambda _{g}^{(2)}X_{ g} }$$

(3)

where parameter estimation in each regression equation is performed by use of the appropriate ML algorithm (for logistic regression or lda).

M-fold cross-validation continues to be used for tuning, but CV-R ² is replaced by the more appropriate statistics CV-Accuracy and CV-AUC, AUC denoting the Area Under the ROC Curve. Accuracy is most useful when the distribution of the dichotomous Y is approximately uniform, about 50% of the sample being in each group. When Y is skewed, accuracy frequently results in many ties and thus is not as useful. In such cases AUC can be used as a tie breaker with Accuracy as the primary criterion or in the case of large skew, AUC can replace accuracy as primary.

For survival data, Cox regression and other important log-linear hazard models can be expressed as Poisson regression models since the likelihood functions are equivalent [7]. As such, ccr can be employed using the logit equation above where Y is a dichotomous variable indicating the occurrence of a rare event. In this case since Y has an extreme skew, the AUC is used as the primary criterion.

Similar to the result for ccr-linear regression, predictions obtained for the saturated ccr model for dichotomous Y are equivalent to those from the corresponding traditional model (logistic regression, lda and Poisson regression).^{Footnote 8}In addition, for dichotomous Y the 1-component ccr model is equivalent to Naïve Bayes, which is also called diagonal discriminant analysis [8] in the case of ccr-lda.

In a surprising result reported in [9], for high dimensional data (small samples and many predictors) generated according to the lda assumptions, traditional lda does not work well, and is outperformed by Naïve Bayes. Because of the equivalences described above, this means that the 1-component ccr model should outperform the saturated ccr model under such conditions. However, we know that the Naïve Bayes model will not work well if predictors include 1 or more important suppressor variables, since suppressor variables tend to have 0 loadings on the first component and require at least two components for their effects to be captured in the model [10]. Thus, a ccr model with two components should outperform Naïve Bayes whenever important suppressor variables are included among the predictors.

Despite extensive literature documenting the enhancement effects of suppressor variables (e.g. [11, 12]), most pre-screening methods omit suppressor variables prior to model development, resulting in suboptimal models.^{Footnote 9} Since suppressor variables are commonplace and often are among the most important predictors in a model [10], such screening is akin to “throwing out the baby with the bath water.”

In order to compare the predictive performance of ccr with other sparse modeling methods in a realistic high dimensional setting, data were simulated according to lda assumptions to reflect the relationships among real world data for prostate cancer patients and normals where at least one important suppressor variable was among the predictors. The simulated data involved 100 samples each with N = 25 cases in each group, the predictors including 28 valid predictors plus 56 that were irrelevant. The sparse methods included ccr, sparse pls-r [13, 14] and the penalized regression methods Lasso and Elastic Net [15–17]. For tuning purposes, cross-validation with five folds was used with accuracy as the criterion for all methods.

Results showed that ccr with typically 4–10 components outperformed the other methods with respect to accuracy (82.6% vs. 80.9% for sparse pls-r, and under 80% for Lasso and Elastic Net), and fewest irrelevant predictors (3.4 vs. 6.2 for Lasso, 11.5 for Elastic Net and 13.1 for sparse pls-r). The most important variable, which was a suppressor variable, was captured in the ccr model in 91 of the 100 samples compared to 78 for sparse pls-r, 61 for elastic net and only 51 for Lasso. For further details of this and other simulations see [6].

6 Extension to Latent Class Models

In practice, sample data often reflects two or more distinct subpopulations (latent segments), with different intercepts and/or different regression coefficients, possibly due to different key drivers or at least different effects for the key drivers. In this section we describe a 2-step hybrid approach for identifying the latent segments without use of the predictors (step 1) and then using ccr to develop a predictive model based on a possibly large number of predictors (step 2). If the predictors are characteristics of the respondents, then the dependent variable (Y ) would be the latent classes, while if the predictors were attributes of objects being rated, Y would be taken as the ratings.

As an example of the first case where the latent segments have different intercepts, in step 1 a latent class (lc) survival analysis was conducted on a sample of patients with late stage prostate cancer. The lc model identified both long-term and short term survival groups [18]. The goal in that study was to use gene expression measurements to predict whether patients belong to the longer or shorter survival class. Since the relevant genes were not known beforehand, the large number of available candidate predictors (genes) ruled out use of traditional methods.

In this case, ccr can be used to simultaneously select the appropriate genes and develop reliable predictions of lc membership based on the selected genes. One way to perform this task is to predict the dichotomy formed by the two groups of patients classified according to the lc model. However, this approach is suboptimal because the classifications contain error due to modal assignment. That is, assigning patients with a posterior probability of say 0.6 of being a long term survivor to this class (with probability 1) ignores the 40% expected misclassification error ($1 - 0.6 = 0.4$). The better way is to perform a weighted logistic (or lda) ccr regression, where posterior probabilities from the lc model serve as case weights.

Table 4 Results from ccr showing that P = 3 of the 16 attributes were selected for inclusion in the model together with the random intercept CFactor1

Full size table

As an example of the second case, consider ratings on 6 different orange juice (OJ) drinks provided by 96 judges [20]. Based on these ratings, in step 1 a lc regression determines that there are two latent segments^{Footnote 10} exhibiting different OJ preferences. In step 2, separate weighted least squares ccr regressions are performed for each class to predict ratings based on the 16 OJ attributes. For a given class, posterior membership probabilities for that class are used as case weights.

For this application ccr is needed because traditional regression can include no more than six attributes in the model due to the fact that the attributes describe the six juices rather than the respondents. In addition, since these data consist of multiple records (6) per case, residuals from records associated with the same case are correlated, a violation of the independent observations assumption. This violation is handled in step 1 by the lc model satisfying the “local independence” assumption. In step 2, the cross-validation is refined by assigning records associated with the same case to the same fold. Separate ccr models are developed for each lc segment, and then combined to obtain predicted ratings, providing substantial improvement over the traditional regression (CV-R ² increases from 0.28 to 0.48). Results of step 2 are summarized in Table 4, showing that the most important attribute for both segments is acidity since it has the highest standardized coefficient magnitude. Segment 1 tends to prefer juices with low acidity (negative coefficient) and high sweetening power (positive coefficient) while the reverse is true for segment 2. Details of this analysis are provided in tutorials from www.statisticalinnovations.com.

Appendix

Claim: ols predictions based on X are equivalent to predictions based on S = XA, where A is a nonsingular matrix.

Proof:

Predictions based on X:
$$\displaystyle{ \hat{Y } = X\hat{\beta } = X{({X}^{{\prime}}X)}^{-1}{X}^{{\prime}}Y. }$$
Predictions based on S:
$$ \displaystyle\begin{array}{rcl} \hat{Y }& =& S\hat{\gamma } {}\\ & =& S{({S}^{{\prime}}S)}^{-1}{S}^{{\prime}}Y = XA{({(XA)}^{{\prime}}XA)}^{-1}{(XA)}^{{\prime}}Y {}\\ & =& XA{({A}^{{\prime}}{X}^{{\prime}}XA)}^{-1}{A}^{{\prime}}{X}^{{\prime}}Y = XA{A}^{-1}{({X}^{{\prime}}X)}^{-1}{A}^{{\prime}-1}{A}^{{\prime}}{X}^{{\prime}}Y {}\\ & =& X{({X}^{{\prime}}X)}^{-1}{X}^{{\prime}}Y. {}\\ \end{array} $$
Equations 4 and 5 above follow from standard operations with square matrices:
$$\displaystyle{ {(BC)}^{{\prime}} = {C}^{{\prime}}{B}^{{\prime}}\quad \mathrm{and}\quad {(BC)}^{-1} = {C}^{-1}{B}^{-1}. }$$
It also follows that the ols regression coefficients for X are identical to those obtained from ccr with a saturated model (i.e., K = P).

Notes

1.
All data sets are available on the website statisticalinnovations.com
2.
Going forward, the factor 1∕P will be omitted which will not alter the predictions since multiplying S _k by P is offset by the ols estimate for gamma (i.e., $\gamma _{k.g}^{(K)}$ becomes $\gamma _{k.g}^{(K)}/P$).
3.
The square of the correlation between these predictions and the observed Y yields CV-R ².
4.
See Appendix for proof of this equivalence.
5.
The analysis was conducted using the CORExpress^®; package (patent pending) [1].
6.
Since multiple rounds of 6-folds are performed, standard errors are available, which yield 95% confidence intervals for CV-R ² of 0.746 ± 0.04 for the ccr model with 6 predictors and 0.769 ± 0.056 for the 3-predictor ccr model.
7.
Interestingly, each ccr model based on an insufficient amount of regularization (K > 2) provides uninterpretable coefficients, in each case exactly three coefficients turning out negative.
8.
In general, the saturated model occurs when K ≥minimum(P,N − 1).
9.
For a rare exception, isis (see [19]) corrects for the exclusion of suppressor variables by the popular sis screening. ccr has been shown to outperform isis in a simulation study [10].
10.
The number of classes was determined based on the Bayesian Information Criterion. For further details of this methodology, see [21].

References

J. Magidson, “CORExpress User’s Guide: Manual for CORExpress”, Belmont, MA: Statistical Innovations Inc., 2011.
Google Scholar
B. Osbourne, T. Fearn, A. Miller, and S. Douglas, “Application of near infrared reflectance spectroscopy to compositional analysis of biscuits and biscuit dough,” Journal of Science and Food Agriculture, 35, 99–105, 1984.
Article Google Scholar
L. Eriksson, E. Johansson, N. Kettaneh-Wold, and S. Wold, “Introduction to multi- and megavariate data analysis using projection methods (PCA & PLS),” Umetrics, pp. 213–225, 1999.
Google Scholar
N. Kraemer, and A. Boulesteix, “Penalized Partial Least Squares (PPLS),” R Package, V. 1.05, Aug. 2011.
Google Scholar
P.J. Brown, T. Fearn, and M. Vannucci, “Bayesian wavelet regression on curves with application to a spectroscopic calibration problem,” Journal of the American Statistical Association, 96, 398–408, 2001.
Article MathSciNet MATH Google Scholar
J. Magidson, “Correlated Component Regression: A Prediction/Classification Methodology for Possibly Many Features,” JSM Proceedings of the American Statistical Association, pp. 4372–4386, 2010.
Google Scholar
N. Laird and D. Oliver, “Covariance analysis of censored survival data using log-linear analysis techniques,” Journal of the American Statistical Association, 76, pp. 231–240, 1981.
Article MathSciNet MATH Google Scholar
T.R. Golub, D.K.Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, 286, pp. 531–537, Oct. 1999.
Article Google Scholar
P. Bickel and E. Levina, “Some theory for Fisher’s linear discriminant function, ‘naïve Bayes’ and some alternatives when there are many more variables than observations,” Bernoulli, 10, 989–1010, 2004.
Article MathSciNet MATH Google Scholar
J. Magidson and K. Wassmann, “The Role of Proxy Genes in Predictive Models: An Application to Early Detection of Prostate Cancer,” 2010 JSM Proceedings of American Statistical Association, Biometrics Section, pp. 2739–2753, 2010.
Google Scholar
P. Horst, “The role of predictor variables which are independent of the criterion,” Social Science Research Bulletin, 48, pp. 431–436, 1941.
Google Scholar
H. Lynn, “Suppression and Confounding in Action,” The American Statistician, 57, pp. 58–61, 2003.
Article MathSciNet Google Scholar
H. Chun and S. Keleş, “Sparse partial least squares regression for simultaneous dimension reduction and variable selection,” University of Wisconsin, Madison, 2009.
Google Scholar
H. Chun and S. Keleş, “Sparse Partial Least Squares Classification for High Dimensional Data,” Statistical Applications in Genetics and Molecular Biology, 9, 17, 2010.
MathSciNet Google Scholar
R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society Series B (Methodological), 58, pp. 267–288, 1996.
MathSciNet MATH Google Scholar
J. Friedman, T.Hastie, and R.Tibshirani, “Regularization Paths for Generalized Linear Models via Coordinate Descent,” Journal of Statistical Software, 33, pp. 1–22, 2010.
Google Scholar
J. Friedman, T. Hastie, and R. Tibshirani, “Lasso and elastic-net regularized generalized linear models,” Version 1.3, Jstatsoft.org, April 25, 2010.
Google Scholar
R. Ross, M. Galsky, H. Scher, J. Magidson, K. Wassmann, G. Lee, L. Katz, S. Subudhi, A. Anand, M. Fleisher, P. Kantoff, and W. Oh, “A whole-blood RNA transcript-based prognostic model in men with castration-resistant prostate cancer: a prospective study,” Lancet Oncology, 2012; http://dx.doi.org/10.1016/S1470-2045(12)70263-2.
J. Fan, Samworth, and W. Yichao, “Ultrahigh Dimensional Feature Selection: Beyond the Linear Model,” Journal of Machine Learning Research, 10, pp. 2013–2038, 2009.
MATH Google Scholar
M Tenenhaus, M., Pagès, J., Ambroisine L. and C. Guinot, “PLS methodology for studying relationships between hedonic judgments,” Food Quality and Preference, 16, pp. 315–325, 2005.
Article Google Scholar
J. Magidson, and J. Vermunt, “Latent Class Models,” in D.Kaplan (Ed.), The Sage Handbook of Quantitative Methodology for the Social Sciences, pp. 175–198. Thousand Oaks: Sage Publications, 2004.
Google Scholar

Download references

Acknowledgements

The author gratefully acknowledges the contributions of Yiping Yuan in conducting the simulations and providing the proof in the Appendix, Michel Tennenhaus for contributing the car price and orange juice example datasets, programming assistance and ongoing support from Jeremy Magland, Alexander Ahlstom, William Barker, and helpful comments from Harald Martens, Vincenzo Vinzi, Anthony Babinec, Gary Bennett, Karl Wassmann, David Rindskoff, Nicole Krämer, and an anonymous reviewer.

Author information

Authors and Affiliations

Statistical Innovations Inc., Belmont, MA, USA
Jay Magidson

Authors

Jay Magidson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jay Magidson .

Editor information

Editors and Affiliations

School of Behavioral & Brain Sciences, The University of Texas at Dallas, Richardson, Texas, USA
Herve Abdi
Department of Decision and Information Systems, University of Houston, Houston, Texas, USA
Wynne W. Chin
ESSEC Business School of Paris, Cergy-Pontoise Cedex, France
Vincenzo Esposito Vinzi
CNAM, Paris, USA
Giorgio Russolillo
Rouen Business School, Rouen, France
Laura Trinchera

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Magidson, J. (2013). Correlated Component Regression: Re-thinking Regression in the Presence of Near Collinearity. In: Abdi, H., Chin, W., Esposito Vinzi, V., Russolillo, G., Trinchera, L. (eds) New Perspectives in Partial Least Squares and Related Methods. Springer Proceedings in Mathematics & Statistics, vol 56. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8283-3_3

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8283-3_3
Published: 16 August 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8282-6
Online ISBN: 978-1-4614-8283-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Correlated Component Regression: Re-thinking Regression in the Presence of Near Collinearity

Abstract