Key words

1 Background and Introduction

Footnote 1When correlation between predictor variables is moderate or high, coefficients estimated using traditional regression techniques become unstable or cannot be uniquely estimated due to multicollinearity (singularity of the covariance matrix). In the case of high dimensional data, where the number of predictor variables P approaches or exceeds the sample size N, such instability is often accompanied by perfect or near perfect predictions within the analysis sample. However, this seemingly good predictive performance is usually associated with overfitting, and tends to deteriorate when applied to new cases outside the sample.

The primary “regularization” approaches that have been proposed for dealing with this problem are (1) penalized regression such as Ridge, Lasso and Elastic Net, and (2) dimension reduction methods such as Principle Component Regression, and pls Regression (pls-r). In this paper we describe a new method similar to pls-r called Correlated Component Regression (ccr) and an associated step-down algorithm for reducing the number of predictors in the model to P < P. ccr has different variants depending upon the scale type of the dependent variable (e.g. ccr-linear regression for Y continuous, ccr-logistic regression for Y dichotomous, ccr-Cox regression for survival data). Unlike the other regularization approaches, the ccr algorithm shares with traditional maximum likelihood regression approaches the favorable property of scale invariance.

In this paper we introduce ccr, and describe its performance on various real and simulated data sets. The basic ccr algorithms are described in Sect. 2. ccr is contrasted with pls-r in a linear regression key driver application with few predictors (Sect. 3) and in an application with Near Infrared (NIR) data involving many predictors (Sect. 4). We then describe the ccr extension to logistic regression, linear discriminant analysis (lda) and survival analysis and discuss results from simulated data where suppressor variables are included among the predictors (Sect. 5). Results from our simulations suggest that ccr may be expected to outperform other sparse regularization approaches, especially when important suppressor variables are included among the predictors. We conclude with a discussion of a hybrid latent class ccr model extension (Sect. 6).

2 Correlated Component Regression

ccr utilizes K < P correlated components, in place of the P predictors to predict an outcome variable. Each component S k is an exact linear combination of the predictors, \(X = (X_{1},X_{2},\ldots,X_{P})\), the first component S 1 capturing the effects of those predictors that have direct effects on the outcome. The ccr-linear regression (ccr-lm) algorithm proceeds as follows:

Estimate the loading \(\lambda _{g}^{(1)}\), on S 1, for each predictor g = 1,2,,P, as the simple regression coefficient in the regression of Y on X g , \(\lambda _{g}^{(1)} = \frac{cov(Y,X_{g})} {var(X_{g})}\). Then S 1 is defined as a weighted average of all 1-predictor effects:

$$\displaystyle{ S_{1} = \frac{1} {P}\sum \limits _{g=1}^{P}\lambda _{ g}^{(1)}X_{ g} }$$
(1)

The predictions for Y in the 1-component ccr model are obtained from the simple ols regression of Y on S 1. Similarly, predictions for the 2-component ccr model are obtained from the simple ols regression of Y on S 1 and S 2, where the second component S 2, captures the effects of suppressor variables that improve prediction by removing extraneous variation from one or more predictors that have direct effects. Component \(S_{{k}^{{\prime}}}\) for k > 1, is defined as a weighted average of all 1-predictor partial effects, where the partial effect for predictor g is computed as the partial regression coefficient in the ols regression of Y on X g and all previously computed components \(S_{k},k = 1,\ldots,{k}^{{\prime}}- 1\). For example, for K = 2 we have:

$$\displaystyle{ Y = \alpha + \gamma _{1.g}^{(2)}S_{ 1} + \lambda _{g}^{(2)}X_{ g} + \epsilon _{g}^{(2)} }$$
(2)

and \(S_{2} = \frac{1} {P}\sum \limits _{g=1}^{P}\lambda _{ g}^{(2)}X_{ g}\), or more simplyFootnote 2 we can write \(S_{2} =\sum \limits _{ g=1}^{P}\lambda _{g}^{(2)}X_{g}\).

As mentioned earlier, predictions for Y in the K-component ccr model are obtained from the ols regression of Y on \(S_{1},\ldots,S_{K}\). For example, for K = 2: \(\hat{Y } = {\alpha }^{(2)} + b_{1}^{(2)}S_{1} + b_{2}^{(2)}S_{2}\). In general, K components are computed, where the optimal value, K , is determined by M-fold cross-validation (CV). For K = 1, maximum regularization, no predictor correlation information is used in parameter estimation. As K is repeatedly incremented by 1, more and more information provided by the predictor correlations is utilized, and M-fold CV determines the value of K where near multicollinearity begins to deteriorate the predictive performance, the value for K being obtained accordingly. Deterioration occurs beginning at K = 3 for the example illustrated in Sect. 3, and thus K = 2.

Any K-component ccr model can be re-expressed to obtain regression coefficients for X by substituting for the components as follows:

$$\displaystyle{ \hat{Y } = {\alpha }^{(K)}+\sum \limits _{ k=1}^{K}b_{ k}^{(K)}S_{ k} = {\alpha }^{(K)}+\sum \limits _{ k=1}^{K}b_{ k}^{(K)}\sum \limits _{ g=1}^{P}\lambda _{ g}^{(k)}X_{ g} = {\alpha }^{(K)}+\sum \limits _{ g=1}^{P}\beta _{ g}X_{g} }$$

Thus, the regression coefficient β g for predictor X g is simply the weighted sum of the loadings, where the weights are the regression coefficients for the components (component weights) in the K-component model: \(\beta _{g} =\sum \limits _{ k=1}^{K}b_{k}^{(K)}\lambda _{g}^{(K)}\).

Simultaneous variable reduction is achieved using a step-down algorithm where at each step the least important predictor is removed, importance defined by the absolute value of the standardized coefficient \(\beta _{g}^{{\ast}} = (\sigma _{g}/\sigma _{Y })\beta _{g}\), where σ denotes the standard deviation. M-fold CV is used to determine the two tuning parameters: the number of components K and number of predictors P.

Consider an example with 6 predictors. For any given value for K, and say M = 10 folds, the basic ccr algorithm is applied 10 times, generating predictions for cases in each of the 10 foldsFootnote 3 based on models with all 6 predictors, yielding a baseline (iteration = 0) CV-R 2(K) for P = 6. In iteration 1, the variable reduction algorithm eliminates 1 predictor, which may not be the same predictor in all 10 subsamples, each resulting 5-predictor model being used to obtain new predictions for the associated omitted folds, yielding CV-R 2(K) for P = 5. In iteration 2, the variable reduction process continues resulting in 10 4-predictor models, which yields CV-R 2(K) for P = 4. Following the last iteration, P (K) is determined as the value of P associated with the maximum CV-R 2(K).

The basic idea is that by applying the proper amount of regularization through the tuning of K, we reduce any confounding effects due to high predictor correlation, thus obtaining more interpretable regression coefficients, and better, more reliable predictions. In addition, tuning P tends to eliminate irrelevant or otherwise extraneous predictors and further improve both prediction and interpretability.

Since the optimal P may depend on K, P should be tuned for each K, the optimal (P ,K ) yielding the global maximum for CV-R 2. Alternatively, as a matter of preference a final model may be based on a smaller value for P and/or K, such that the resulting CV-R 2 is within c standard errors of the global maximum, where c ≤ 1.

Since K can never exceed P, for P = K, the model becomes saturated and is equivalent to the traditional regression model.Footnote 4 For pre-specified K, when P is reduced below K, we maintain the saturated model by also reducing K so K = P. For example, for K = 4, when we step down to 3 predictors, we reduce K so K = 3. Similarly, when we step down to 1 predictor, K = 1. This is similar to traditional stepwise regression with backwards elimination.

Prime predictors, those having direct effects, are identified as those having substantial loadings on S 1, and suppressor variables, as those having substantial loadings on one or more other components, and relatively small loadings on S 1. See Sect. 5 for further insight into suppressor variables.

Since ccr is scale invariant, it yields identical results regardless of whether predictions are based on unstandardized or standardized predictors (Z-scores). Other methods such as pls-r and penalized regression (Ridge Regression, Lasso, Elastic Net) are not scale invariant and hence yield different results depending on the predictor scaling used.

3 A Simple Example with Six Correlated Predictors

Our first example makes use of data involving the prediction of car prices (Y) as a linear function of 6 predictors, each having a statistically significant positive correlation with Y (between 0.6 and 0.9).

  • N = 24 car models

  • Dependent variable: Y = PRICE (car price measured in francs)

  • 6 Predictor Variables:

    • X 1 = CYLINDER (engine measured in cubic centimeters)

    • X 2 = POWER (horsepower)

    • X 3 = SPEED (top speed in kilometers/hour)

    • X 4 = WEIGHT (kilograms)

    • X 5 = LENGTH (centimeters)

    • X 6 = WIDTH (centimeters)

The ols regression solution (Table 1a) imposes no regularization, maximizing R 2 in the training sample. This solution is equivalent to that obtained from a saturated (\(K = P = 6\) components) ccr model. Since this solution is based on a relatively small sample and correlated predictors, it is likely to overfit the data and the R 2 is likely to be an overly optimistic estimate of the true population R 2. Table 1a shows only 1 statistically significant coefficient (0.05) and unrealistic (negative) coefficient estimates for 3 of the 6 predictors, which are problems that can be explained by model overfitting due to imposing no regularization.

Table 1 (a) (left) shows OLS Regression Coefficient results (\(P = K = 6\)) and (b) (right) shows R 2 and CV-R 2 for different numbers of components K and for the final ccr model \((P = 3,K = 2)\)

To determine the value for K that provides the optimal amount of regularization, we choose the ccr model that maximizes the CV-R 2. For cross-validation we used 10 rounds of 6-folds, since 24 divides evenly into 6, each fold containing exactly 4 cars. Table 1b shows that K = 2 components provides the maximum CV-R 2 based on P = 6 predictors, and when the step-down algorithm is employed, CV-R 2 increases to 0.769 which occurs with P = 3 predictors.Footnote 5 While traditional ols regression yields a higher R 2 in the analysis sample (0.847 vs. 0.836), the 2-component ccr model with 3 predictors yields a higher CV-R 2, suggesting that this ccr model will outperformFootnote 6 ols regression when applied to new data.

Further evidence of improvement for the 2-component models over ols regression is that the coefficients are more interpretable. Table 2 shows that the coefficients in the 2-component ccr models are all positive, which is what we would expect if we were to interpret them as measures of effect.Footnote 7

Table 2 Comparison of results from plsr (a) (left) with unstandardized predictors, and (b) with standardized predictors, and ccr (c) without variable selection and (d) (right) with variable selection

pls-r with standardized predictors, the recommended pls-r option when predictors are measured in different units, yields similar results to ccr here. When the predictors remain unstandardized, pls-r yields more components (K = 3), two negative coefficients, and substantially worse predictions (CV-R 2 = 0.69), as the much larger variance for the predictor CYLINDER causes this predictor to dominate the first component, requiring two additional components to recover.

4 An Example with Near Infrared (nir) Data

Next, we analyze high dimensional data involving N = 72 biscuits, each measured at each of P = 700 near infrared (NIR) wave-lengths corresponding to every other wavelength between the range 1,100–2,500 [2]. Since all 700 predictors are measured in comparable units in this popular pls-r application, typically the 700 predictors are analyzed on an unstandardized basis, or standardized using Pareto scaling [3] where the scaling factor is the square root of the standard deviation. As shown above, results from pls-r differ depending upon whether the predictors are standardized or not, while for the scale invariant ccr, no decision needs to be made regarding such standardization, predictions being identical in either case.

The goal of modeling here is to reduce costs of monitoring fat content by predicting the percent fat based on spectroscopic absorbance variables from the nir frequencies. Following Kraemer and Boulesteix [4], we use N = 40 samples as the calibration (training) set to develop models based on the 700 wave lengths.

It is well known that for nir data, a column plot of regression coefficients exhibit a sequence of oscillating patterns, the most important wavelength ranges being those with the highest peak-to-peak amplitude. For example, for these data, wavelengths in the 1,500–1,598 range yield a peak to peak amplitude of \(0.109 - (-0.203) = 0.312\), based on a ccr model with K = 9 (see Fig. 1).

Table 3a compares the corresponding amplitudes obtained from ccr and both unstandardized and Pareto standardized pls-r models, where the number of components is determined based on 10 rounds of 5-folds. As can be seen in Table 3a, all three models agree that absorbances from the 1,500–1,598 wavelengths tend to be among the most important (relatively large amplitude).

Table 3 (a) (left) Comparison of peak-to-peak amplitudes for various frequency ranges based on three models, with the most and least important ranges according to ccr in bold, and (b) (right) comparison of CV-R 2 (highest is bold) obtained from three models with (P = 700) and without (P = 650) the highest wavelengths included among the predictors
Fig. 1
figure 1

Column plot of standardized coefficients output from XLSTAT-CCR

Previous analyses of these data excluded the highest 50 wavelengths since they were “…thought to contain little useful information” [5]. Table 3a shows that ccr identifies these wavelengths as least important (smallest amplitude), but the amplitude of 0.44 resulting from pls-r suggests that these wavelengths are important.

Figure 2 shows the standardized coefficients for the 50 highest wavelengths for ccr and pls-r models. As can be seen, the weights obtained from the ccr model are small and diminishing, the coefficients for the highest wavelengths being very close to 0. In contrast, pls-r weights are quite high and show no sign of diminishing for the highest wavelengths (Fig. 2(right)), a similar pattern being observed for pls-Pareto.

Fig. 2
figure 2

Comparison of column plots of standardized coefficients for 50 highest wavelengths based on the ccr (left) vs. pls-r estimated with unstandardized predictors (right)

One possible reason that the conclusions from ccr and pls-r differ regarding the importance of these high wavelengths is that its scale invariance property allows ccr to better determine that the high variability associated with these wavelengths is due to increased amounts of measurement error. In other words, the much higher amplitude obtained from pls-r is likely due to the higher standard deviations of the absorbances in this range.

To test the hypothesis that these higher wavelengths tend to be unimportant, we re-estimated the models after omitting these variables. Table 3b shows that for all three models, the CV-R 2 increases when these variables are omitted, supporting the hypothesis that these wavelengths are not important.

In order to compare the predictive performance of ccr with other regularization approaches, 100 simulated samples of size N = 50 were generated with 14 predictors according to the assumptions of ols regression. An additional 14 extraneous predictors, correlated with the 14 true predictors, plus 28 irrelevant predictors, were also generated and included among the candidate predictors. The results indicated that ccr outperformed pls-r, Elastic Net, and sparse pls with respect to mean squared error, and several other criteria. All methods were tuned using an independent validation sample of size 50 (for more details, see [6]).

5 Extension of ccr to Logistic Regression, Linear Discriminant Analysis and Survival Analysis

When the dependent variable is dichotomous, the ccr algorithm generalizes directly to ccr-logistic and ccr-lda respectively depending upon whether no assumptions are made about the predictor distributions, or whether the normality assumptions from linear discriminant analysis are made. In either case, the generalization involves replacing Y by Logit(Y ) on the left side of the linear equations. Thus, for example, under ccr-logistic and ccr-lda Eq. 2 becomes:

$$\displaystyle{ Logit(Y ) = \alpha + \gamma _{1.g}^{(2)}S_{ 1} + \lambda _{g}^{(2)}X_{ g} }$$
(3)

where parameter estimation in each regression equation is performed by use of the appropriate ML algorithm (for logistic regression or lda).

M-fold cross-validation continues to be used for tuning, but CV-R 2 is replaced by the more appropriate statistics CV-Accuracy and CV-AUC, AUC denoting the Area Under the ROC Curve. Accuracy is most useful when the distribution of the dichotomous Y is approximately uniform, about 50% of the sample being in each group. When Y is skewed, accuracy frequently results in many ties and thus is not as useful. In such cases AUC can be used as a tie breaker with Accuracy as the primary criterion or in the case of large skew, AUC can replace accuracy as primary.

For survival data, Cox regression and other important log-linear hazard models can be expressed as Poisson regression models since the likelihood functions are equivalent [7]. As such, ccr can be employed using the logit equation above where Y is a dichotomous variable indicating the occurrence of a rare event. In this case since Y has an extreme skew, the AUC is used as the primary criterion.

Similar to the result for ccr-linear regression, predictions obtained for the saturated ccr model for dichotomous Y are equivalent to those from the corresponding traditional model (logistic regression, lda and Poisson regression).Footnote 8In addition, for dichotomous Y the 1-component ccr model is equivalent to Naïve Bayes, which is also called diagonal discriminant analysis [8] in the case of ccr-lda.

In a surprising result reported in [9], for high dimensional data (small samples and many predictors) generated according to the lda assumptions, traditional lda does not work well, and is outperformed by Naïve Bayes. Because of the equivalences described above, this means that the 1-component ccr model should outperform the saturated ccr model under such conditions. However, we know that the Naïve Bayes model will not work well if predictors include 1 or more important suppressor variables, since suppressor variables tend to have 0 loadings on the first component and require at least two components for their effects to be captured in the model [10]. Thus, a ccr model with two components should outperform Naïve Bayes whenever important suppressor variables are included among the predictors.

Despite extensive literature documenting the enhancement effects of suppressor variables (e.g. [11, 12]), most pre-screening methods omit suppressor variables prior to model development, resulting in suboptimal models.Footnote 9 Since suppressor variables are commonplace and often are among the most important predictors in a model [10], such screening is akin to “throwing out the baby with the bath water.”

In order to compare the predictive performance of ccr with other sparse modeling methods in a realistic high dimensional setting, data were simulated according to lda assumptions to reflect the relationships among real world data for prostate cancer patients and normals where at least one important suppressor variable was among the predictors. The simulated data involved 100 samples each with N = 25 cases in each group, the predictors including 28 valid predictors plus 56 that were irrelevant. The sparse methods included ccr, sparse pls-r [13, 14] and the penalized regression methods Lasso and Elastic Net [1517]. For tuning purposes, cross-validation with five folds was used with accuracy as the criterion for all methods.

Results showed that ccr with typically 4–10 components outperformed the other methods with respect to accuracy (82.6% vs. 80.9% for sparse pls-r, and under 80% for Lasso and Elastic Net), and fewest irrelevant predictors (3.4 vs. 6.2 for Lasso, 11.5 for Elastic Net and 13.1 for sparse pls-r). The most important variable, which was a suppressor variable, was captured in the ccr model in 91 of the 100 samples compared to 78 for sparse pls-r, 61 for elastic net and only 51 for Lasso. For further details of this and other simulations see [6].

6 Extension to Latent Class Models

In practice, sample data often reflects two or more distinct subpopulations (latent segments), with different intercepts and/or different regression coefficients, possibly due to different key drivers or at least different effects for the key drivers. In this section we describe a 2-step hybrid approach for identifying the latent segments without use of the predictors (step 1) and then using ccr to develop a predictive model based on a possibly large number of predictors (step 2). If the predictors are characteristics of the respondents, then the dependent variable (Y ) would be the latent classes, while if the predictors were attributes of objects being rated, Y would be taken as the ratings.

As an example of the first case where the latent segments have different intercepts, in step 1 a latent class (lc) survival analysis was conducted on a sample of patients with late stage prostate cancer. The lc model identified both long-term and short term survival groups [18]. The goal in that study was to use gene expression measurements to predict whether patients belong to the longer or shorter survival class. Since the relevant genes were not known beforehand, the large number of available candidate predictors (genes) ruled out use of traditional methods.

In this case, ccr can be used to simultaneously select the appropriate genes and develop reliable predictions of lc membership based on the selected genes. One way to perform this task is to predict the dichotomy formed by the two groups of patients classified according to the lc model. However, this approach is suboptimal because the classifications contain error due to modal assignment. That is, assigning patients with a posterior probability of say 0.6 of being a long term survivor to this class (with probability 1) ignores the 40% expected misclassification error (\(1 - 0.6 = 0.4\)). The better way is to perform a weighted logistic (or lda) ccr regression, where posterior probabilities from the lc model serve as case weights.

Table 4 Results from ccr showing that P = 3 of the 16 attributes were selected for inclusion in the model together with the random intercept CFactor1

As an example of the second case, consider ratings on 6 different orange juice (OJ) drinks provided by 96 judges [20]. Based on these ratings, in step 1 a lc regression determines that there are two latent segmentsFootnote 10 exhibiting different OJ preferences. In step 2, separate weighted least squares ccr regressions are performed for each class to predict ratings based on the 16 OJ attributes. For a given class, posterior membership probabilities for that class are used as case weights.

For this application ccr is needed because traditional regression can include no more than six attributes in the model due to the fact that the attributes describe the six juices rather than the respondents. In addition, since these data consist of multiple records (6) per case, residuals from records associated with the same case are correlated, a violation of the independent observations assumption. This violation is handled in step 1 by the lc model satisfying the “local independence” assumption. In step 2, the cross-validation is refined by assigning records associated with the same case to the same fold. Separate ccr models are developed for each lc segment, and then combined to obtain predicted ratings, providing substantial improvement over the traditional regression (CV-R 2 increases from 0.28 to 0.48). Results of step 2 are summarized in Table 4, showing that the most important attribute for both segments is acidity since it has the highest standardized coefficient magnitude. Segment 1 tends to prefer juices with low acidity (negative coefficient) and high sweetening power (positive coefficient) while the reverse is true for segment 2. Details of this analysis are provided in tutorials from www.statisticalinnovations.com.

Appendix

Claim: ols predictions based on X are equivalent to predictions based on S = XA, where A is a nonsingular matrix.

Proof:

  • Predictions based on X:

    $$\displaystyle{ \hat{Y } = X\hat{\beta } = X{({X}^{{\prime}}X)}^{-1}{X}^{{\prime}}Y. }$$
  • Predictions based on S:

    $$ \displaystyle\begin{array}{rcl} \hat{Y }& =& S\hat{\gamma } {}\\ & =& S{({S}^{{\prime}}S)}^{-1}{S}^{{\prime}}Y = XA{({(XA)}^{{\prime}}XA)}^{-1}{(XA)}^{{\prime}}Y {}\\ & =& XA{({A}^{{\prime}}{X}^{{\prime}}XA)}^{-1}{A}^{{\prime}}{X}^{{\prime}}Y = XA{A}^{-1}{({X}^{{\prime}}X)}^{-1}{A}^{{\prime}-1}{A}^{{\prime}}{X}^{{\prime}}Y {}\\ & =& X{({X}^{{\prime}}X)}^{-1}{X}^{{\prime}}Y. {}\\ \end{array} $$

    Equations 4 and 5 above follow from standard operations with square matrices:

    $$\displaystyle{ {(BC)}^{{\prime}} = {C}^{{\prime}}{B}^{{\prime}}\quad \mathrm{and}\quad {(BC)}^{-1} = {C}^{-1}{B}^{-1}. }$$

    It also follows that the ols regression coefficients for X are identical to those obtained from ccr with a saturated model (i.e., K = P).