Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

In the regression setting, the standard linear model

$$\displaystyle{ Y =\beta _{0} +\beta _{1}X_{1} + \cdots +\beta _{p}X_{p}+\epsilon }$$
(6.1)

is commonly used to describe the relationship between a response Y and a set of variables \(X_{1},X_{2},\ldots,X_{p}\). We have seen in Chapter 3 that one typically fits this model using least squares.

In the chapters that follow, we consider some approaches for extending the linear model framework. In Chapter 7 we generalize (6.1) in order to accommodate non-linear, but still additive, relationships, while in Chapter 8 we consider even more general non-linear models. However, the linear model has distinct advantages in terms of inference and, on real-world problems, is often surprisingly competitive in relation to non-linear methods. Hence, before moving to the non-linear world, we discuss in this chapter some ways in which the simple linear model can be improved, by replacing plain least squares fitting with some alternative fitting procedures.

Why might we want to use another fitting procedure instead of least squares? As we will see, alternative fitting procedures can yield better prediction accuracy and model interpretability.

  • Prediction Accuracy: Provided that the true relationship between the response and the predictors is approximately linear, the least squares estimates will have low bias. If n ≫ p—that is, if n, the number of observations, is much larger than p, the number of variables—then the least squares estimates tend to also have low variance, and hence will perform well on test observations. However, if n is not much larger than p, then there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions on future observations not used in model training. And if p > n, then there is no longer a unique least squares coefficient estimate: the variance is infinite so the method cannot be used at all. By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias. This can lead to substantial improvements in the accuracy with which we can predict the response for observations not used in model training.

  • Model Interpretability: It is often the case that some or many of the variables used in a multiple regression model are in fact not associated with the response. Including such irrelevant variables leads to unnecessary complexity in the resulting model. By removing these variables—that is, by setting the corresponding coefficient estimates to zero—we can obtain a model that is more easily interpreted. Now least squares is extremely unlikely to yield any coefficient estimates that are exactly zero. In this chapter, we see some approaches for automatically performing feature selection or variable selection —that is, for excluding irrelevant variables from a multiple regression model.

feature selection

variable selection

There are many alternatives, both classical and modern, to using least squares to fit (6.1). In this chapter, we discuss three important classes of methods.

  • Subset Selection. This approach involves identifying a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.

  • Shrinkage. This approach involves fitting a model involving all p predictors. However, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. Hence, shrinkage methods can also perform variable selection.

  • Dimension Reduction. This approach involves projecting the p predictors into a M-dimensional subspace, where M < p. This is achieved by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.

In the following sections we describe each of these approaches in greater detail, along with their advantages and disadvantages. Although this chapter describes extensions and modifications to the linear model for regression seen in Chapter 3, the same concepts apply to other methods, such as the classification models seen in Chapter 4.

6.1 Subset Selection

In this section we consider some methods for selecting subsets of predictors. These include best subset and stepwise model selection procedures.

6.1.1 Best Subset Selection

To perform best subset selection, we fit a separate least squares regression for each possible combination of the p predictors. That is, we fit all p models that contain exactly one predictor, all \(\left ({ p \atop 2} \right ) = p(p - 1)/2\) models that contain exactly two predictors, and so forth. We then look at all of the resulting models, with the goal of identifying the one that is best.

best subset selection

The problem of selecting the best model from among the 2p possibilities considered by best subset selection is not trivial. This is usually broken up into two stages, as described in Algorithm 6.1.

Algorithm 6.1 Best subset selection

  1. 1.

    Let \(\mathcal{M}_{0}\) denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.

  2. 2.

    For k = 1, 2, …p:

    1. (a)

      Fit all \(\left ({ p \atop k} \right )\) models that contain exactly k predictors.

    2. (b)

      Pick the best among these \(\left ({ p \atop k} \right )\) models, and call it \(\mathcal{M}_{k}\). Here best is defined as having the smallest RSS, or equivalently largest R 2.

  3. 3.

    Select a single best model from among \(\mathcal{M}_{0},\ldots,\mathcal{M}_{p}\) using cross-validated prediction error, C p (AIC), BIC, or adjusted R 2.

In Algorithm 6.1, Step 2 identifies the best model (on the training data) for each subset size, in order to reduce the problem from one of 2p possible models to one of p + 1 possible models. In Figure 6.1, these models form the lower frontier depicted in red.

Fig. 6.1
figure 1figure 1

For each possible model containing a subset of the ten predictors in the Credit data set, the RSS and R 2 are displayed. The red frontier tracks the best model for a given number of predictors, according to RSS and R 2 . Though the data set contains only ten predictors, the x-axis ranges from 1 to 11, since one of the variables is categorical and takes on three values, leading to the creation of two dummy variables.

Now in order to select a single best model, we must simply choose among these p + 1 options. This task must be performed with care, because the RSS of these p + 1 models decreases monotonically, and the R 2 increases monotonically, as the number of features included in the models increases. Therefore, if we use these statistics to select the best model, then we will always end up with a model involving all of the variables. The problem is that a low RSS or a high R 2 indicates a model with a low training error, whereas we wish to choose a model that has a low test error. (As shown in Chapter 2 in Figures 2.92.11, training error tends to be quite a bit smaller than test error, and a low training error by no means guarantees a low test error.) Therefore, in Step 3, we use cross-validated prediction error, C p , BIC, or adjusted R 2 in order to select among \(\mathcal{M}_{0},\mathcal{M}_{1},\ldots,\mathcal{M}_{p}\). These approaches are discussed in Section 6.1.3.

An application of best subset selection is shown in Figure 6.1. Each plotted point corresponds to a least squares regression model fit using a different subset of the 11 predictors in the Credit data set, discussed in Chapter 3. Here the variable ethnicity is a three-level qualitative variable, and so is represented by two dummy variables, which are selected separately in this case. We have plotted the RSS and R 2 statistics for each model, as a function of the number of variables. The red curves connect the best models for each model size, according to RSS or R 2. The figure shows that, as expected, these quantities improve as the number of variables increases; however, from the three-variable model on, there is little improvement in RSS and R 2 as a result of including additional predictors.

Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression. In the case of logistic regression, instead of ordering models by RSS in Step 2 of Algorithm 6.1, we instead use the deviance, a measure that plays the role of RSS for a broader class of models. The deviance is negative two times the maximized log-likelihood; the smaller the deviance, the better the fit.

deviance

While best subset selection is a simple and conceptually appealing approach, it suffers from computational limitations. The number of possible models that must be considered grows rapidly as p increases. In general, there are 2p models that involve subsets of p predictors. So if p = 10, then there are approximately 1,000 possible models to be considered, and if p = 20, then there are over one million possibilities! Consequently, best subset selection becomes computationally infeasible for values of p greater than around 40, even with extremely fast modern computers. There are computational shortcuts—so called branch-and-bound techniques—for eliminating some choices, but these have their limitations as p gets large. They also only work for least squares linear regression. We present computationally efficient alternatives to best subset selection next.

6.1.2 Stepwise Selection

For computational reasons, best subset selection cannot be applied with very large p. Best subset selection may also suffer from statistical problems when p is large. The larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data. Thus an enormous search space can lead to overfitting and high variance of the coefficient estimates.

For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection.

Algorithm 6.2 Forward stepwise selection

  1. 1.

    Let \(\mathcal{M}_{0}\) denote the null model, which contains no predictors.

  2. 2.

    For \(k = 0,\ldots,p - 1\):

    1. (a)

      Consider all p − k models that augment the predictors in \(\mathcal{M}_{k}\) with one additional predictor.

    2. (b)

      Choose the best among these p − k models, and call it \(\mathcal{M}_{k+1}\). Here best is defined as having smallest RSS or highest R 2.

  3. 3.

    Select a single best model from among \(\mathcal{M}_{0},\ldots,\mathcal{M}_{p}\) using cross-validated prediction error, C p (AIC), BIC, or adjusted R 2.

6.1.2.1 Forward Stepwise Selection

Forward stepwise selection is a computationally efficient alternative to best subset selection. While the best subset selection procedure considers all 2p possible models containing subsets of the p predictors, forward stepwise considers a much smaller set of models. Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model. More formally, the forward stepwise selection procedure is given in Algorithm 6.2.

forward stepwise selection

Unlike best subset selection, which involved fitting 2p models, forward stepwise selection involves fitting one null model, along with p − k models in the kth iteration, for \(k = 0,\ldots,p - 1\). This amounts to a total of \(1 +\sum _{ k=0}^{p-1}(p - k) = 1 + p(p + 1)/2\) models. This is a substantial difference: when p = 20, best subset selection requires fitting 1, 048, 576 models, whereas forward stepwise selection requires fitting only 211 models.Footnote 1

In Step 2(b) of Algorithm 6.2, we must identify the best model from among those p − k that augment \(\mathcal{M}_{k}\) with one additional predictor. We can do this by simply choosing the model with the lowest RSS or the highest R 2. However, in Step 3, we must identify the best model among a set of models with different numbers of variables. This is more challenging, and is discussed in Section 6.1.3.

Forward stepwise selection’s computational advantage over best subset selection is clear. Though forward stepwise tends to do well in practice, it is not guaranteed to find the best possible model out of all 2p models containing subsets of the p predictors. For instance, suppose that in a given data set with p = 3 predictors, the best possible one-variable model contains X 1, and the best possible two-variable model instead contains X 2 and X 3. Then forward stepwise selection will fail to select the best possible two-variable model, because \(\mathcal{M}_{1}\) will contain X 1, so \(\mathcal{M}_{2}\) must also contain X 1 together with one additional variable.

Table 6.1, which shows the first four selected models for best subset and forward stepwise selection on the Credit data set, illustrates this phenomenon. Both best subset selection and forward stepwise selection choose rating for the best one-variable model and then include income and student for the two- and three-variable models. However, best subset selection replaces rating by cards in the four-variable model, while forward stepwise selection must maintain rating in its four-variable model. In this example, Figure 6.1 indicates that there is not much difference between the three- and four-variable models in terms of RSS, so either of the four-variable models will likely be adequate.

Table 6.1 The first four selected models for best subset selection and forward stepwise selection on the Credit data set. The first three models are identical but the fourth models differ.

Forward stepwise selection can be applied even in the high-dimensional setting where n < p; however, in this case, it is possible to construct submodels \(\mathcal{M}_{0},\ldots,\mathcal{M}_{n-1}\) only, since each submodel is fit using least squares, which will not yield a unique solution if p ≥ n.

6.1.2.2 Backward Stepwise Selection

Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection. However, unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time. Details are given in Algorithm 6.3.

backward stepwise selection

Algorithm 6.3 Backward stepwise selection

  1. 1.

    Let \(\mathcal{M}_{p}\) denote the full model, which contains all p predictors.

  2. 2.

    For \(k = p,p - 1,\ldots,1\):

    1. (a)

      Consider all k models that contain all but one of the predictors in \(\mathcal{M}_{k}\), for a total of k − 1 predictors.

    2. (b)

      Choose the best among these k models, and call it \(\mathcal{M}_{k-1}\). Here best is defined as having smallest RSS or highest R 2.

  3. 3.

    Select a single best model from among \(\mathcal{M}_{0},\ldots,\mathcal{M}_{p}\) using cross-validated prediction error, C p (AIC), BIC, or adjusted R 2.

Like forward stepwise selection, the backward selection approach searches through only \(1 + p(p + 1)/2\) models, and so can be applied in settings where p is too large to apply best subset selection.Footnote 2 Also like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the p predictors.

Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large.

6.1.2.3 Hybrid Approaches

The best subset, forward stepwise, and backward stepwise selection approaches generally give similar but not identical models. As another alternative, hybrid versions of forward and backward stepwise selection are available, in which variables are added to the model sequentially, in analogy to forward selection. However, after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit. Such an approach attempts to more closely mimic best subset selection while retaining the computational advantages of forward and backward stepwise selection.

6.1.3 Choosing the Optimal Model

Best subset selection, forward selection, and backward selection result in the creation of a set of models, each of which contains a subset of the p predictors. In order to implement these methods, we need a way to determine which of these models is best. As we discussed in Section 6.1.1, the model containing all of the predictors will always have the smallest RSS and the largest R 2, since these quantities are related to the training error. Instead, we wish to choose a model with a low test error. As is evident here, and as we show in Chapter 2, the training error can be a poor estimate of the test error. Therefore, RSS and R 2 are not suitable for selecting the best model among a collection of models with different numbers of predictors.

In order to select the best model with respect to test error, we need to estimate this test error. dThere are two common approaches:

  1. 1.

    We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.

  2. 2.

    We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in Chapter 5.

We consider both of these approaches below.

6.1.3.1 C p , AIC, BIC, and Adjusted R 2

We show in Chapter 2 that the training set MSE is generally an underestimate of the test MSE. (Recall that \(\mbox{ MSE} = \mbox{ RSS}/n\).) This is because when we fit a model to the training data using least squares, we specifically estimate the regression coefficients such that the training RSS (but not the test RSS) is as small as possible. In particular, the training error will decrease as more variables are included in the model, but the test error may not. Therefore, training set RSS and training set R 2 cannot be used to select from among a set of models with different numbers of variables.

However, a number of techniques for adjusting the training error for the model size are available. These approaches can be used to select among a set of models with different numbers of variables. We now consider four such approaches: C p, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted R 2. Figure 6.2 displays C p , BIC, and adjusted R 2 for the best model of each size produced by best subset selection on the Credit data set.

Fig. 6.2
figure 2figure 2

C p , BIC, and adjusted R 2 are shown for the best models of each size for the Credit data set (the lower frontier in Figure  6.1).C p and BIC are estimates of test MSE. In the middle plot we see that the BIC estimate of test error shows an increase after four variables are selected. The other two plots are rather flat after four variables are included.

C p

Akaike information criterion

Bayesian information criterion

adjusted R 2

For a fitted least squares model containing d predictors, the C p estimate of test MSE is computed using the equation

$$\displaystyle{ C_{p} = \frac{1} {n}\left (\mbox{ RSS} + 2{d\hat{\sigma }}^{2}\right ), }$$
(6.2)

where \(\hat{{\sigma }}^{2}\) is an estimate of the variance of the error ε associated with each response measurement in (6.1).Footnote 3 Essentially, the C p statistic adds a penalty of \(2{d\hat{\sigma }}^{2}\) to the training RSS in order to adjust for the fact that the training error tends to underestimate the test error. Clearly, the penalty increases as the number of predictors in the model increases; this is intended to adjust for the corresponding decrease in training RSS. Though it is beyond the scope of this book, one can show that if \(\hat{{\sigma }}^{2}\) is an unbiased estimate of σ 2 in (6.2), then C p is an unbiased estimate of test MSE. As a consequence, the C p statistic tends to take on a small value for models with a low test error, so when determining which of a set of models is best, we choose the model with the lowest C p value. In Figure 6.2, C p selects the six-variable model containing the predictors income, limit, rating, cards, age and student.

The AIC criterion is defined for a large class of models fit by maximum likelihood. In the case of the model (6.1) with Gaussian errors, maximum likelihood and least squares are the same thing. In this case AIC is given by

$$\displaystyle{\mathrm{AIC} = \frac{1} {{n\hat{\sigma }}^{2}}\left (\mbox{ RSS} + 2{d\hat{\sigma }}^{2}\right ),}$$

where, for simplicity, we have omitted an additive constant. Hence for least squares models, C p and AIC are proportional to each other, and so only C p is displayed in Figure 6.2.

BIC is derived from a Bayesian point of view, but ends up looking similar to C p (and AIC) as well. For the least squares model with d predictors, the BIC is, up to irrelevant constants, given by

$$\displaystyle{ \mbox{ BIC} = \frac{1} {n}\left (\mbox{ RSS} +\log (n){d\hat{\sigma }}^{2}\right ). }$$
(6.3)

Like C p , the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value. Notice that BIC replaces the \(2{d\hat{\sigma }}^{2}\) used by C p with a \(\log (n){d\hat{\sigma }}^{2}\) term, where n is the number of observations. Since logn > 2 for any n > 7, the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than C p . In Figure 6.2, we see that this is indeed the case for the Credit data set; BIC chooses a model that contains only the four predictors income, limit, cards, and student. In this case the curves are very flat and so there does not appear to be much difference in accuracy between the four-variable and six-variable models.

The adjusted R 2 statistic is another popular approach for selecting among a set of models that contain different numbers of variables. Recall from Chapter 3 that the usual R 2 is defined as \(1 -\mbox{ RSS}/\mbox{ TSS}\), where \(\mbox{ TSS} =\sum {(y_{i} -\overline{y})}^{2}\) is the total sum of squares for the response. Since RSS always decreases as more variables are added to the model, the R 2 always increases as more variables are added. For a least squares model with d variables, the adjusted R 2 statistic is calculated as

$$\displaystyle{ \text{Adjusted }{R}^{2} = 1 -\frac{\mbox{ RSS}/(n - d - 1)} {\mbox{ TSS}/(n - 1)}. }$$
(6.4)

Unlike C p , AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted R 2 indicates a model with a small test error. Maximizing the adjusted R 2 is equivalent to minimizing \(\frac{\mbox{ RSS}}{n-d-1}\). While RSS always decreases as the number of variables in the model increases, \(\frac{\mbox{ RSS}} {n-d-1}\) may increase or decrease, due to the presence of d in the denominator.

The intuition behind the adjusted R 2 is that once all of the correct variables have been included in the model, adding additional noise variables will lead to only a very small decrease in RSS. Since adding noise variables leads to an increase in d, such variables will lead to an increase in \(\frac{\mbox{ RSS}}{n-d-1}\), and consequently a decrease in the adjusted R 2. Therefore, in theory, the model with the largest adjusted R 2 will have only correct variables and no noise variables. Unlike the R 2 statistic, the adjusted R 2 statistic pays a price for the inclusion of unnecessary variables in the model. Figure 6.2 displays the adjusted R 2 for the Credit data set. Using this statistic results in the selection of a model that contains seven variables, adding gender to the model selected by C p and AIC.

C p , AIC, and BIC all have rigorous theoretical justifications that are beyond the scope of this book. These justifications rely on asymptotic arguments (scenarios where the sample size n is very large). Despite its popularity, and even though it is quite intuitive, the adjusted R 2 is not as well motivated in statistical theory as AIC, BIC, and C p . All of these measures are simple to use and compute. Here we have presented the formulas for AIC, BIC, and C p in the case of a linear model fit using least squares; however, these quantities can also be defined for more general types of models.

6.1.3.2 Validation and Cross-Validation

As an alternative to the approaches just discussed, we can directly estimate the test error using the validation set and cross-validation methods discussed in Chapter 5. We can compute the validation set error or the cross-validation error for each model under consideration, and then select the model for which the resulting estimated test error is smallest. This procedure has an advantage relative to AIC, BIC, C p , and adjusted R 2, in that it provides a direct estimate of the test error, and makes fewer assumptions about the true underlying model. It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance σ 2.

In the past, performing cross-validation was computationally prohibitive for many problems with large p and/or large n, and so AIC, BIC, C p , and adjusted R 2 were more attractive approaches for choosing among a set of models. However, nowadays with fast computers, the computations required to perform cross-validation are hardly ever an issue. Thus, cross-validation is a very attractive approach for selecting from among a number of models under consideration.

Figure 6.3 displays, as a function of d, the BIC, validation set errors, and cross-validation errors on the Credit data, for the best d-variable model. The validation errors were calculated by randomly selecting three-quarters of the observations as the training set, and the remainder as the validation set. The cross-validation errors were computed using k = 10 folds. In this case, the validation and cross-validation methods both result in a six-variable model. However, all three approaches suggest that the four-, five-, and six-variable models are roughly equivalent in terms of their test errors.

Fig. 6.3
figure 3figure 3

For the Credit data set, three quantities are displayed for the best model containing d predictors, for d ranging from 1 to 11. The overall best model, based on each of these quantities, is shown as a blue cross. Left: Square root of BIC. Center: Validation set errors. Right: Cross-validation errors.

In fact, the estimated test error curves displayed in the center and right-hand panels of Figure 6.3 are quite flat. While a three-variable model clearly has lower estimated test error than a two-variable model, the estimated test errors of the 3- to 11-variable models are quite similar. Furthermore, if we repeated the validation set approach using a different split of the data into a training set and a validation set, or if we repeated cross-validation using a different set of cross-validation folds, then the precise model with the lowest estimated test error would surely change. In this setting, we can select a model using the one-standard-error rule. We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve. The rationale here is that if a set of models appear to be more or less equally good, then we might as well choose the simplest model—that is, the model with the smallest number of predictors. In this case, applying the one-standard-error rule to the validation set or cross-validation approach leads to selection of the three-variable model.

one-standard-error rule

6.2 Shrinkage Methods

The subset selection methods described in Section 6.1 involve using least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.

6.2.1 Ridge Regression

Recall from Chapter 3 that the least squares fitting procedure estimates \(\beta _{0},\beta _{1},\ldots,\beta _{p}\) using the values that minimize

$$\displaystyle{\mbox{ RSS} =\sum _{ i=1}^{n}{\left (y_{ i} -\beta _{0} -\sum _{j=1}^{p}\beta _{ j}x_{ij}\right )}^{2}.}$$

Ridge regression is very similar to least squares, except that the coefficients are estimated by minimizing a slightly different quantity. In particular, the ridge regression coefficient estimates \(\hat{{\beta }}^{R}\) are the values that minimize

$$\displaystyle{ \sum _{i=1}^{n}{\left (y_{ i} -\beta _{0} -\sum _{j=1}^{p}\beta _{ j}x_{ij}\right )}^{2} +\lambda \sum _{ j=1}^{p}\beta _{ j}^{2} = \mbox{ RSS} +\lambda \sum _{ j=1}^{p}\beta _{ j}^{2}, }$$
(6.5)

where λ ≥ 0 is a tuning parameter, to be determined separately. Equation 6.5 trades off two different criteria. As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. However, the second term, \(\lambda \sum _{j}\beta _{j}^{2}\), called a shrinkage penalty, is small when \(\beta _{1},\ldots,\beta _{p}\) are close to zero, and so it has the effect of shrinking the estimates of β j towards zero. The tuning parameter λ serves to control the relative impact of these two terms on the regression coefficient estimates. When λ = 0, the penalty term has no effect, and ridge regression will produce the least squares estimates. However, as λ → , the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero. Unlike least squares, which generates only one set of coefficient estimates, ridge regression will produce a different set of coefficient estimates, \(\hat{\beta }_{\lambda }^{R}\), for each value of λ. Selecting a good value for λ is critical; we defer this discussion to Section 6.2.3, where we use cross-validation.

ridge regression

tuning parameter

shrinkage penalty

Note that in (6.5), the shrinkage penalty is applied to β 1, , β p , but not to the intercept β 0. We want to shrink the estimated association of each variable with the response; however, we do not want to shrink the intercept, which is simply a measure of the mean value of the response when \(x_{i1} = x_{i2} =\ldots = x_{ip} = 0\). If we assume that the variables—that is, the columns of the data matrix X—have been centered to have mean zero before ridge regression is performed, then the estimated intercept will take the form \(\hat{\beta }_{0} =\bar{ y} =\sum _{ i=1}^{n}y_{i}/n\).

6.2.1.1 An Application to the Credit Data

In Figure 6.4, the ridge regression coefficient estimates for the Credit data set are displayed. In the left-hand panel, each curve corresponds to the ridge regression coefficient estimate for one of the ten variables, plotted as a function of λ. For example, the black solid line represents the ridge regression estimate for the income coefficient, as λ is varied. At the extreme left-hand side of the plot, λ is essentially zero, and so the corresponding ridge coefficient estimates are the same as the usual least squares estimates. But as λ increases, the ridge coefficient estimates shrink towards zero. When λ is extremely large, then all of the ridge coefficient estimates are basically zero; this corresponds to the null model that contains no predictors. In this plot, the income, limit, rating, and student variables are displayed in distinct colors, since these variables tend to have by far the largest coefficient estimates. While the ridge coefficient estimates tend to decrease in aggregate as λ increases, individual coefficients, such as rating and income, may occasionally increase as λ increases.

Fig. 6.4
figure 4figure 4

The standardized ridge regression coefficients are displayed for the Credit data set, as a function of λ and \(\|\hat{\beta }_{\lambda }^{R}\|_{2}/\|\hat{\beta }\|_{2}\) .

The right-hand panel of Figure 6.4 displays the same ridge coefficient estimates as the left-hand panel, but instead of displaying λ on the x-axis, we now display \(\|\hat{\beta }_{\lambda }^{R}\|_{2}/\|\hat{\beta }\|_{2}\), where \(\hat{\beta }\) denotes the vector of least squares coefficient estimates. The notation \(\|\beta \|_{2}\) denotes the 2 norm(pronounced “ell 2”) of a vector, and is defined as \(\|\beta \|_{2} = \sqrt{\sum _{j=1 }^{p }{\beta _{j } }^{2}}\). It measures the distance of β from zero. As λ increases, the 2 norm of \(\hat{\beta }_{\lambda }^{R}\) will always decrease, and so will \(\|\hat{\beta }_{\lambda }^{R}\|_{2}/\|\hat{\beta }\|_{2}\). The latter quantity ranges from 1 (when λ = 0, in which case the ridge regression coefficient estimate is the same as the least squares estimate, and so their 2 norms are the same) to 0 (when λ = , in which case the ridge regression coefficient estimate is a vector of zeros, with 2 norm equal to zero). Therefore, we can think of the x-axis in the right-hand panel of Figure 6.4 as the amount that the ridge regression coefficient estimates have been shrunken towards zero; a small value indicates that they have been shrunken very close to zero.

2 norm

The standard least squares coefficient estimates discussed in Chapter 3 are scale invariant : multiplying X j by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of 1 ∕ c. In other words, regardless of how the jth predictor is scaled, \(X_{j}\hat{\beta }_{j}\) will remain the same. In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant. For instance, consider the income variable, which is measured in dollars. One could reasonably have measured income in thousands of dollars, which would result in a reduction in the observed values of income by a factor of 1,000. Now due to the sum of squared coefficients term in the ridge regression formulation (6.5), such a change in scale will not simply cause the ridge regression coefficient estimate for income to change by a factor of 1,000. In other words, \(X_{j}\hat{\beta }_{j,\lambda }^{R}\) will depend not only on the value of λ, but also on the scaling of the jth predictor. In fact, the value of \(X_{j}\hat{\beta }_{j,\lambda }^{R}\) may even depend on the scaling of the other predictors! Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula

$$\displaystyle{ \tilde{x}_{ij} = \frac{x_{ij}} {\sqrt{ \frac{1} {n}\sum _{i=1}^{n}{(x_{ij} -\overline{x}_{j})}^{2}}}, }$$
(6.6)

so that they are all on the same scale. In (6.6), the denominator is the estimated standard deviation of the jth predictor. Consequently, all of the standardized predictors will have a standard deviation of one. As a result the final fit will not depend on the scale on which the predictors are measured. In Figure 6.4, the y-axis displays the standardized ridge regression coefficient estimates—that is, the coefficient estimates that result from performing ridge regression using standardized predictors.

scale invariant

6.2.1.2 Why Does Ridge Regression Improve Over Least Squares?

Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. This is illustrated in the left-hand panel of Figure 6.5, using a simulated data set containing p = 45 predictors and n = 50 observations. The green curve in the left-hand panel of Figure 6.5 displays the variance of the ridge regression predictions as a function of λ. At the least squares coefficient estimates, which correspond to ridge regression with λ = 0, the variance is high but there is no bias. But as λ increases, the shrinkage of the ridge coefficient estimates leads to a substantial reduction in the variance of the predictions, at the expense of a slight increase in bias. Recall that the test mean squared error (MSE), plotted in purple, is a function of the variance plus the squared bias. For values of λ up to about 10, the variance decreases rapidly, with very little increase in bias, plotted in black. Consequently, the MSE drops considerably as λ increases from 0 to 10. Beyond this point, the decrease in variance due to increasing λ slows, and the shrinkage on the coefficients causes them to be significantly underestimated, resulting in a large increase in the bias. The minimum MSE is achieved at approximately λ = 30. Interestingly, because of its high variance, the MSE associated with the least squares fit, when λ = 0, is almost as high as that of the null model for which all coefficient estimates are zero, when λ = . However, for an intermediate value of λ, the MSE is considerably lower.

Fig. 6.5
figure 5figure 5

Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of λ and \(\|\hat{\beta }_{\lambda }^{R}\|_{2}/\|\hat{\beta }\|_{2}\) . The horizontal dashed lines indicate the minimum possible MSE. The purple crosses indicate the ridge regression models for which the MSE is smallest.

The right-hand panel of Figure 6.5 displays the same curves as the left-hand panel, this time plotted against the 2 norm of the ridge regression coefficient estimates divided by the 2 norm of the least squares estimates. Now as we move from left to right, the fits become more flexible, and so the bias decreases and the variance increases.

In general, in situations where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. This means that a small change in the training data can cause a large change in the least squares coefficient estimates. In particular, when the number of variables p is almost as large as the number of observations n, as in the example in Figure 6.5, the least squares estimates will be extremely variable. And if p > n, then the least squares estimates do not even have a unique solution, whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, ridge regression works best in situations where the least squares estimates have high variance.

Ridge regression also has substantial computational advantages over best subset selection, which requires searching through 2p models. As we discussed previously, even for moderate values of p, such a search can be computationally infeasible. In contrast, for any fixed value of λ, ridge regression only fits a single model, and the model-fitting procedure can be performed quite quickly. In fact, one can show that the computations required to solve (6.5), simultaneously for all values of λ, are almost identical to those for fitting a model using least squares.

6.2.2 The Lasso

Ridge regression does have one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model. The penalty \(\lambda \sum \beta _{j}^{2}\) in (6.5) will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless λ = ). This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large. For example, in the Credit data set, it appears that the most important variables are income, limit, rating, and student. So we might wish to build a model including just these predictors. However, ridge regression will always generate a model involving all ten predictors. Increasing the value of λ will tend to reduce the magnitudes of the coefficients, but will not result in exclusion of any of the variables.

The lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, \(\hat{\beta }_{\lambda }^{L}\), minimize the quantity

$$\displaystyle{ \sum _{i=1}^{n}{\left (y_{ i} -\beta _{0} -\sum _{j=1}^{p}\beta _{ j}x_{ij}\right )}^{2} +\lambda \sum _{ j=1}^{p}\vert \beta _{ j}\vert \ = \mbox{ RSS} +\lambda \sum _{ j=1}^{p}\vert \beta _{ j}\vert. }$$
(6.7)

Comparing (6.7) to (6.5), we see that the lasso and ridge regression have similar formulations. The only difference is that the β j 2 term in the ridge regression penalty (6.5) has been replaced by | β j  | in the lasso penalty (6.7). In statistical parlance, the lasso uses an 1 (pronounced “ell 1”) penalty instead of an 2 penalty. The 1 norm of a coefficient vector β is given by \(\|\beta \|_{1} =\sum \vert \beta _{j}\vert \).

lasso

As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the 1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection. As a result, models generated from the lasso are generally much easier to interpret than those produced by ridge regression. We say that the lasso yields sparse models—that is, models that involve only a subset of the variables. As in ridge regression, selecting a good value of λ for the lasso is critical; we defer this discussion to Section 6.2.3, where we use cross-validation.

sparse

As an example, consider the coefficient plots in Figure 6.6, which are generated from applying the lasso to the Credit data set. When λ = 0, then the lasso simply gives the least squares fit, and when λ becomes sufficiently large, the lasso gives the null model in which all coefficient estimates equal zero. However, in between these two extremes, the ridge regression and lasso models are quite different from each other. Moving from left to right in the right-hand panel of Figure 6.6, we observe that at first the lasso results in a model that contains only the rating predictor. Then student and limit enter the model almost simultaneously, shortly followed by income. Eventually, the remaining variables enter the model. Hence, depending on the value of λ, the lasso can produce a model involving any number of variables. In contrast, ridge regression will always include all of the variables in the model, although the magnitude of the coefficient estimates will depend on λ.

Fig. 6.6
figure 6figure 6

The standardized lasso coefficients on the Credit data set are shown as a function of λ and \(\|\hat{\beta }_{\lambda }^{L}\|_{1}/\|\hat{\beta }\|_{1}\) .

6.2.2.1 Another Formulation for Ridge Regression and the Lasso

One can show that the lasso and ridge regression coefficient estimates solve the problems

$$\displaystyle{ \mathrm{minimize}_{\beta }\left \{\sum _{i=1}^{n}{\left (y_{ i} -\beta _{0} -\sum _{j=1}^{p}\beta _{ j}x_{ij}\right )}^{2}\right \}\quad \text{subject to}\quad \sum _{ j=1}^{p}\vert \beta _{ j}\vert \leq s }$$
(6.8)

and

$$\displaystyle{ \mathrm{minimize}_{\beta }\left \{\sum _{i=1}^{n}{\left (y_{ i} -\beta _{0} -\sum _{j=1}^{p}\beta _{ j}x_{ij}\right )}^{2}\right \}\quad \text{subject to }\quad \sum _{ j=1}^{p}\beta _{ j}^{2} \leq s, }$$
(6.9)

respectively. In other words, for every value of λ, there is some s such that the Equations (6.7) and (6.8) will give the same lasso coefficient estimates. Similarly, for every value of λ there is a corresponding s such that Equations (6.5) and (6.9) will give the same ridge regression coefficient estimates. When p = 2, then (6.8) indicates that the lasso coefficient estimates have the smallest RSS out of all points that lie within the diamond defined by | β 1 |  +  | β 2 | ≤ s. Similarly, the ridge regression estimates have the smallest RSS out of all points that lie within the circle defined by \(\beta _{1}^{2} +\beta _{ 2}^{2} \leq s\).

We can think of (6.8) as follows. When we perform the lasso we are trying to find the set of coefficient estimates that lead to the smallest RSS, subject to the constraint that there is a budgets for how large \(\sum _{j=1}^{p}\vert \beta _{j}\vert \) can be. When s is extremely large, then this budget is not very restrictive, and so the coefficient estimates can be large. In fact, if s is large enough that the least squares solution falls within the budget, then (6.8) will simply yield the least squares solution. In contrast, if s is small, then \(\sum _{j=1}^{p}\vert \beta _{j}\vert \) must be small in order to avoid violating the budget. Similarly, (6.9) indicates that when we perform ridge regression, we seek a set of coefficient estimates such that the RSS is as small as possible, subject to the requirement that \(\sum _{j=1}^{p}\beta _{j}^{2}\) not exceed the budget s.

The formulations (6.8) and (6.9) reveal a close connection between the lasso, ridge regression, and best subset selection. Consider the problem

$$\displaystyle{ \mathrm{minimize}_{\beta }\left \{\sum _{i=1}^{n}{\left (y_{ i} -\beta _{0} -\sum _{j=1}^{p}\beta _{ j}x_{ij}\right )}^{2}\right \}\quad \text{subject to}\quad \sum _{ j=1}^{p}I(\beta _{ j}\neq 0) \leq s. }$$
(6.10)

Here I(β j ≠0) is an indicator variable: it takes on a value of 1 if β j ≠0, and equals zero otherwise. Then (6.10) amounts to finding a set of coefficient estimates such that RSS is as small as possible, subject to the constraint that no more than s coefficients can be nonzero. The problem (6.10) is equivalent to best subset selection. Unfortunately, solving (6.10) is computationally infeasible when p is large, since it requires considering all \(\left ({ p \atop s} \right )\) models containing s predictors. Therefore, we can interpret ridge regression and the lasso as computationally feasible alternatives to best subset selection that replace the intractable form of the budget in (6.10) with forms that are much easier to solve. Of course, the lasso is much more closely related to best subset selection, since only the lasso performs feature selection for s sufficiently small in (6.8).

6.2.2.2 The Variable Selection Property of the Lasso

Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero? The formulations (6.8) and (6.9) can be used to shed light on the issue. Figure 6.7 illustrates the situation. The least squares solution is marked as \(\hat{\beta }\), while the blue diamond and circle represent the lasso and ridge regression constraints in (6.8) and (6.9), respectively. If s is sufficiently large, then the constraint regions will contain \(\hat{\beta }\), and so the ridge regression and lasso estimates will be the same as the least squares estimates. (Such a large value of s corresponds to λ = 0 in (6.5) and (6.7).) However, in Figure 6.7 the least squares estimates lie outside of the diamond and the circle, and so the least squares estimates are not the same as the lasso and ridge regression estimates.

Fig. 6.7
figure 7figure 7

Contours of the error and constraint functions for the lasso (left) and ridge regression (right) . The solid blue areas are the constraint regions, \(\vert \beta _{1}\vert + \vert \beta _{2}\vert \leq s\) and \(\beta _{1}^{2} +\beta _{ 2}^{2} \leq s\) , while the red ellipses are the contours of the RSS.

The ellipses that are centered around \(\hat{\beta }\) represent regions of constant RSS. In other words, all of the points on a given ellipse share a common value of the RSS. As the ellipses expand away from the least squares coefficient estimates, the RSS increases. Equations (6.8) and (6.9) indicate that the lasso and ridge regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region. Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-zero. However, the lasso constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coefficients will equal zero. In higher dimensions, many of the coefficient estimates may equal zero simultaneously. In Figure 6.7, the intersection occurs at β 1 = 0, and so the resulting model will only include β 2.

In Figure 6.7, we considered the simple case of p = 2. When p = 3, then the constraint region for ridge regression becomes a sphere, and the constraint region for the lasso becomes a polyhedron. When p > 3, the constraint for ridge regression becomes a hypersphere, and the constraint for the lasso becomes a polytope. However, the key ideas depicted in Figure 6.7 still hold. In particular, the lasso leads to feature selection when p > 2 due to the sharp corners of the polyhedron or polytope.

6.2.2.3 Comparing the Lasso and Ridge Regression

Fig. 6.8
figure 8figure 8

Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso on a simulated data set. Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dotted). Both are plotted against their R 2 on the training data, as a common form of indexing. The crosses in both plots indicate the lasso model for which the MSE is smallest.

It is clear that the lasso has a major advantage over ridge regression, in that it produces simpler and more interpretable models that involve only a subset of the predictors. However, which method leads to better prediction accuracy? Figure 6.8 displays the variance, squared bias, and test MSE of the lasso applied to the same simulated data as in Figure 6.5. Clearly the lasso leads to qualitatively similar behavior to ridge regression, in that as λ increases, the variance decreases and the bias increases. In the right-hand panel of Figure 6.8, the dotted lines represent the ridge regression fits. Here we plot both against their R 2 on the training data. This is another useful way to index models, and can be used to compare models with different types of regularization, as is the case here. In this example, the lasso and ridge regression result in almost identical biases. However, the variance of ridge regression is slightly lower than the variance of the lasso. Consequently, the minimum MSE of ridge regression is slightly smaller than that of the lasso.

However, the data in Figure 6.8 were generated in such a way that all 45 predictors were related to the response—that is, none of the true coefficients β 1, , β 45 equaled zero. The lasso implicitly assumes that a number of the coefficients truly equal zero. Consequently, it is not surprising that ridge regression outperforms the lasso in terms of prediction error in this setting. Figure 6.9 illustrates a similar situation, except that now the response is a function of only 2 out of 45 predictors. Now the lasso tends to outperform ridge regression in terms of bias, variance, and MSE.

Fig. 6.9
figure 9figure 9

Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso. The simulated data is similar to that in Figure  6.8,except that now only two predictors are related to the response. Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dotted). Both are plotted against their R 2 on the training data, as a common form of indexing. The crosses in both plots indicate the lasso model for which the MSE is smallest.

These two examples illustrate that neither ridge regression nor the lasso will universally dominate the other. In general, one might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero. Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size. However, the number of predictors that is related to the response is never known a priori for real data sets. A technique such as cross-validation can be used in order to determine which approach is better on a particular data set.

As with ridge regression, when the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions. Unlike ridge regression, the lasso performs variable selection, and hence results in models that are easier to interpret.

There are very efficient algorithms for fitting both ridge and lasso models; in both cases the entire coefficient paths can be computed with about the same amount of work as a single least squares fit. We will explore this further in the lab at the end of this chapter.

6.2.2.4 A Simple Special Case for Ridge Regression and the Lasso

In order to obtain a better intuition about the behavior of ridge regression and the lasso, consider a simple special case with n = p, and X a diagonal matrix with 1’s on the diagonal and 0’s in all off-diagonal elements. To simplify the problem further, assume also that we are performing regression without an intercept. With these assumptions, the usual least squares problem simplifies to finding \(\beta _{1},\ldots,\beta _{p}\) that minimize

$$\displaystyle{ \sum _{j=1}^{p}{(y_{ j} -\beta _{j})}^{2}. }$$
(6.11)

In this case, the least squares solution is given by

$$\displaystyle{\hat{\beta }_{j} = y_{j}.}$$

And in this setting, ridge regression amounts to finding β 1, , β p such that

$$\displaystyle{ \sum _{j=1}^{p}{(y_{ j} -\beta _{j})}^{2} +\lambda \sum _{ j=1}^{p}\beta _{ j}^{2} }$$
(6.12)

is minimized, and the lasso amounts to finding the coefficients such that

$$\displaystyle{ \sum _{j=1}^{p}{(y_{ j} -\beta _{j})}^{2} +\lambda \sum _{ j=1}^{p}\vert \beta _{ j}\vert }$$
(6.13)

is minimized. One can show that in this setting, the ridge regression estimates take the form

$$\displaystyle{ \hat{\beta }_{j}^{R} = y_{ j}/(1+\lambda ), }$$
(6.14)

and the lasso estimates take the form

$$\displaystyle{ \hat{\beta }_{j}^{L} = \left \{\begin{array}{@{}l@{\quad }l@{}} y_{j} -\lambda /2\quad &\mbox{ if }y_{j} >\lambda /2; \\ y_{j} +\lambda /2\quad &\mbox{ if }y_{j} < -\lambda /2; \\ 0 \quad &\mbox{ if }\vert y_{j}\vert \leq \lambda /2. \end{array} \right. }$$
(6.15)

Figure 6.10 displays the situation. We can see that ridge regression and the lasso perform two very different types of shrinkage. In ridge regression, each least squares coefficient estimate is shrunken by the same proportion. In contrast, the lasso shrinks each least squares coefficient towards zero by a constant amount, λ ∕ 2; the least squares coefficients that are less than λ ∕ 2 in absolute value are shrunken entirely to zero. The type of shrinkage performed by the lasso in this simple setting (6.15) is known as soft-thresholding. The fact that some lasso coefficients are shrunken entirely to zero explains why the lasso performs feature selection.

Fig. 6.10
figure 10figure 10

The ridge regression and lasso coefficient estimates for a simple setting with n = p and X a diagonal matrix with 1’s on the diagonal. Left: The ridge regression coefficient estimates are shrunken proportionally towards zero, relative to the least squares estimates. Right: The lasso coefficient estimates are soft-thresholded towards zero.

soft-thresholding

In the case of a more general data matrix X, the story is a little more complicated than what is depicted in Figure 6.10, but the main ideas still hold approximately: ridge regression more or less shrinks every dimension of the data by the same proportion, whereas the lasso more or less shrinks all coefficients toward zero by a similar amount, and sufficiently small coefficients are shrunken all the way to zero.

6.2.2.5 Bayesian Interpretation for Ridge Regression and the Lasso

We now show that one can view ridge regression and the lasso through a Bayesian lens. A Bayesian viewpoint for regression assumes that the coefficient vector β has some prior distribution, say p(β), where \(\beta = {(\beta _{0},\beta _{1},\ldots,\beta _{p})}^{T}\). The likelihood of the data can be written as f(Y | X, β), where X = (X 1, , X p ). Multiplying the prior distribution by the likelihood gives us (up to a proportionality constant) the posterior distribution, which takes the form

$$\displaystyle{p(\beta \vert X,Y ) \propto f(Y \vert X,\beta )p(\beta \vert X) = f(Y \vert X,\beta )p(\beta ),}$$

where the proportionality above follows from Bayes’ theorem, and the equality above follows from the assumption that X is fixed.

posterior distribution

We assume the usual linear model,

$$\displaystyle{Y =\beta _{0} + X_{1}\beta _{1} +\ldots +X_{p}\beta _{p}+\epsilon,}$$

and suppose that the errors are independent and drawn from a normal distribution. Furthermore, assume that \(p(\beta ) =\prod _{ j=1}^{p}g(\beta _{j})\), for some density function g. It turns out that ridge regression and the lasso follow naturally from two special cases of g:

  • If g is a Gaussian distribution with mean zero and standard deviation a function of λ, then it follows that the posterior mode for β—that is, the most likely value for β, given the data—is given by the ridge regression solution. (In fact, the ridge regression solution is also the posterior mean.)

  • If g is a double-exponential (Laplace) distribution with mean zero and scale parameter a function of λ, then it follows that the posterior mode for β is the lasso solution. (However, the lasso solution is not the posterior mean, and in fact, the posterior mean does not yield a sparse coefficient vector.)

posterior mode

The Gaussian and double-exponential priors are displayed in Figure 6.11. Therefore, from a Bayesian viewpoint, ridge regression and the lasso follow directly from assuming the usual linear model with normal errors, together with a simple prior distribution for β. Notice that the lasso prior is steeply peaked at zero, while the Gaussian is flatter and fatter at zero. Hence, the lasso expects a priori that many of the coefficients are (exactly) zero, while ridge assumes the coefficients are randomly distributed about zero.

Fig. 6.11
figure 11figure 11

Left: Ridge regression is the posterior mode for β under a Gaussian prior. Right: The lasso is the posterior mode for β under a double-exponential prior.

6.2.3 Selecting the Tuning Parameter

Just as the subset selection approaches considered in Section 6.1 require a method to determine which of the models under consideration is best, implementing ridge regression and the lasso requires a method for selecting a value for the tuning parameter λ in (6.5) and (6.7), or equivalently, the value of the constraint s in (6.9) and (6.8). Cross-validation provides a simple way to tackle this problem. We choose a grid of λ values, and compute the cross-validation error for each value of λ, as described in Chapter 5. We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.

Figure 6.12 displays the choice of λ that results from performing leave-one-out cross-validation on the ridge regression fits from the Credit data set. The dashed vertical lines indicate the selected value of λ. In this case the value is relatively small, indicating that the optimal fit only involves a small amount of shrinkage relative to the least squares solution. In addition, the dip is not very pronounced, so there is rather a wide range of values that would give very similar error. In a case like this we might simply use the least squares solution.

Fig. 6.12
figure 12figure 12

Left: Cross-validation errors that result from applying ridge regression to the Credit data set with various value of λ. Right: The coefficient estimates as a function of λ. The vertical dashed lines indicate the value of λ selected by cross-validation.

Figure 6.13 provides an illustration of ten-fold cross-validation applied to the lasso fits on the sparse simulated data from Figure 6.9. The left-hand panel of Figure 6.13 displays the cross-validation error, while the right-hand panel displays the coefficient estimates. The vertical dashed lines indicate the point at which the cross-validation error is smallest. The two colored lines in the right-hand panel of Figure 6.13 represent the two predictors that are related to the response, while the grey lines represent the unrelated predictors; these are often referred to as signal and noise variables, respectively. Not only has the lasso correctly given much larger coefficient estimates to the two signal predictors, but also the minimum cross-validation error corresponds to a set of coefficient estimates for which only the signal variables are non-zero. Hence cross-validation together with the lasso has correctly identified the two signal variables in the model, even though this is a challenging setting, with p = 45 variables and only n = 50 observations. In contrast, the least squares solution—displayed on the far right of the right-hand panel of Figure 6.13—assigns a large coefficient estimate to only one of the two signal variables.

Fig. 6.13
figure 13figure 13

Left : Ten-fold cross-validation MSE for the lasso, applied to the sparse simulated data set from Figure  6.9. Right: The corresponding lasso coefficient estimates are displayed. The vertical dashed lines indicate the lasso fit for which the cross-validation error is smallest.

signal

6.3 Dimension Reduction Methods

The methods that we have discussed so far in this chapter have controlled variance in two different ways, either by using a subset of the original variables, or by shrinking their coefficients toward zero. All of these methods are defined using the original predictors, \(X_{1},X_{2},\ldots,X_{p}\). We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods.

dimension reduction

Let \(Z_{1},Z_{2},\ldots,Z_{M}\) represent M < plinear combinations of our original p predictors. That is,

$$\displaystyle{ Z_{m} =\sum _{ j=1}^{p}\phi _{ jm}X_{j} }$$
(6.16)

for some constants \(\phi _{1m},\phi _{2m}\ldots,\phi _{pm},\;m = 1,\ldots,M\). We can then fit the linear regression model

$$\displaystyle{ y_{i} =\theta _{0} +\sum _{ m=1}^{M}\theta _{ m}z_{im} +\epsilon _{i},\quad i = 1,\ldots,n, }$$
(6.17)

using least squares. Note that in (6.17), the regression coefficients are given by \(\theta _{0},\theta _{1},\ldots,\theta _{M}\). If the constants \(\phi _{1m},\phi _{2m},\ldots,\phi _{pm}\) are chosen wisely, then such dimension reduction approaches can often outperform least squares regression. In other words, fitting (6.17) using least squares can lead to better results than fitting (6.1) using least squares.

linear combination

The term dimension reduction comes from the fact that this approach reduces the problem of estimating the p + 1 coefficients \(\beta _{0},\beta _{1},\ldots,\beta _{p}\) to the simpler problem of estimating the M + 1 coefficients \(\theta _{0},\theta _{1},\ldots,\theta _{M}\), where M < p. In other words, the dimension of the problem has been reduced from p + 1 to M + 1.

Notice that from (6.16),

$$\displaystyle{\sum _{m=1}^{M}\theta _{ m}z_{im} =\sum _{ m=1}^{M}\theta _{ m}\sum _{j=1}^{p}\phi _{ jm}x_{ij} =\sum _{ j=1}^{p}\sum _{ m=1}^{M}\theta _{ m}\phi _{jm}x_{ij} =\sum _{ j=1}^{p}\beta _{ j}x_{ij},}$$

where

$$\displaystyle{ \beta _{j} =\sum _{ m=1}^{M}\theta _{ m}\phi _{jm}. }$$
(6.18)

Hence (6.17) can be thought of as a special case of the original linear regression model given by (6.1). Dimension reduction serves to constrain the estimated β j coefficients, since now they must take the form (6.18). This constraint on the form of the coefficients has the potential to bias the coefficient estimates. However, in situations where p is large relative to n, selecting a value of M ≪ p can significantly reduce the variance of the fitted coefficients. If M = p, and all the Z m are linearly independent, then (6.18) poses no constraints. In this case, no dimension reduction occurs, and so fitting (6.17) is equivalent to performing least squares on the original p predictors.

All dimension reduction methods work in two steps. First, the transformed predictors \(Z_{1},Z_{2},\ldots,Z_{M}\) are obtained. Second, the model is fit using these M predictors. However, the choice of \(Z_{1},Z_{2},\ldots,Z_{M}\), or equivalently, the selection of the ϕ jm ’s, can be achieved in different ways. In this chapter, we will consider two approaches for this task: principal components and partial least squares.

6.3.1 Principal Components Regression

Principal components analysis (PCA) is a popular approach for deriving a low-dimensional set of features from a large set of variables. PCA is discussed in greater detail as a tool for unsupervised learning in Chapter 10. Here we describe its use as a dimension reduction technique for regression.

principal components analysis

6.3.1.1 An Overview of Principal Components Analysis

PCA is a technique for reducing the dimension of a n ×p data matrix X. The first principal component direction of the data is that along which the observations vary the most. For instance, consider Figure 6.14, which shows population size (pop) in tens of thousands of people, and ad spending for a particular company (ad) in thousands of dollars, for 100 cities. The green solid line represents the first principal component direction of the data. We can see by eye that this is the direction along which there is the greatest variability in the data. That is, if we projected the 100 observations onto this line (as shown in the left-hand panel of Figure 6.15), then the resulting projected observations would have the largest possible variance; projecting the observations onto any other line would yield projected observations with lower variance. Projecting a point onto a line simply involves finding the location on the line which is closest to the point.

Fig. 6.14
figure 14figure 14

The population size ( pop ) and ad spending ( ad ) for 100 different cities are shown as purple circles. The green solid line indicates the first principal component, and the blue dashed line indicates the second principal component.

Fig. 6.15
figure 15figure 15

A subset of the advertising data. The mean pop and ad budgets are indicated with a blue circle. Left: The first principal component direction is shown in green. It is the dimension along which the data vary the most, and it also defines the line that is closest to all n of the observations. The distances from each observation to the principal component are represented using the black dashed line segments. The blue dot represents \((\overline{\mathtt{pop}},\overline{\mathtt{ad}})\) . Right: The left-hand panel has been rotated so that the first principal component direction coincides with the x-axis.

The first principal component is displayed graphically in Figure 6.14, but how can it be summarized mathematically? It is given by the formula

$$\displaystyle{ Z_{1} = 0.839 \times (\mathtt{pop}-\overline{\mathtt{pop}}) + 0.544 \times (\mathtt{ad}-\overline{\mathtt{ad}}). }$$
(6.19)

Here ϕ 11 = 0. 839 and ϕ 21 = 0. 544 are the principal component loadings, which define the direction referred to above. In (6.19), \(\overline{\mathtt{pop}}\) indicates the mean of all pop values in this data set, and \(\overline{\mathtt{ad}}\) indicates the mean of all advertising spending. The idea is that out of every possible linear combination of pop and ad such that \(\phi _{11}^{2} +\phi _{ 21}^{2} = 1\), this particular linear combination yields the highest variance: i.e. this is the linear combination for which \(\mbox{ Var}(\phi _{11} \times (\mathtt{pop}-\overline{\mathtt{pop}}) +\phi _{21} \times (\mathtt{ad}-\overline{\mathtt{ad}}))\) is maximized. It is necessary to consider only linear combinations of the form \(\phi _{11}^{2} +\phi _{ 21}^{2} = 1\), since otherwise we could increase ϕ 11 and ϕ 21 arbitrarily in order to blow up the variance. In (6.19), the two loadings are both positive and have similar size, and so Z 1 is almost an average of the two variables.

Since n = 100, pop and ad are vectors of length 100, and so is Z 1 in (6.19). For instance,

$$\displaystyle{ z_{i1} = 0.839 \times (\mathtt{pop}_{i} -\overline{\mathtt{pop}}) + 0.544 \times (\mathtt{ad}_{i} -\overline{\mathtt{ad}}). }$$
(6.20)

The values of z 11, , z n1 are known as the principal component scores, and can be seen in the right-hand panel of Figure 6.15.

There is also another interpretation for PCA: the first principal component vector defines the line that is as close as possible to the data. For instance, in Figure 6.14, the first principal component line minimizes the sum of the squared perpendicular distances between each point and the line. These distances are plotted as dashed line segments in the left-hand panel of Figure 6.15, in which the crosses represent the projection of each point onto the first principal component line. The first principal component has been chosen so that the projected observations are as close as possible to the original observations.

In the right-hand panel of Figure 6.15, the left-hand panel has been rotated so that the first principal component direction coincides with the x-axis. It is possible to show that the first principal component score for the ith observation, given in (6.20), is the distance in the x-direction of the ith cross from zero. So for example, the point in the bottom-left corner of the left-hand panel of Figure 6.15 has a large negative principal component score, \(z_{i1} = -26.1\), while the point in the top-right corner has a large positive score, z i1 = 18. 7. These scores can be computed directly using (6.20).

We can think of the values of the principal component Z 1 as single-number summaries of the joint pop and ad budgets for each location. In this example, if \(z_{i1} = 0.839 \times (\mathtt{pop}_{i} -\overline{\mathtt{pop}}) + 0.544 \times (\mathtt{ad}_{i} -\overline{\mathtt{ad}}) < 0\), then this indicates a city with below-average population size and below-average ad spending. A positive score suggests the opposite. How well can a single number represent both pop and ad? In this case, Figure 6.14 indicates that pop and ad have approximately a linear relationship, and so we might expect that a single-number summary will work well. Figure 6.16 displays z i1 versus both pop and ad. The plots show a strong relationship between the first principal component and the two features. In other words, the first principal component appears to capture most of the information contained in the pop and ad predictors.

Fig. 6.16
figure 16figure 16

Plots of the first principal component scores z i1 versus pop and ad . The relationships are strong.

So far we have concentrated on the first principal component. In general, one can construct up to p distinct principal components. The second principal component Z 2 is a linear combination of the variables that is uncorrelated with Z 1, and has largest variance subject to this constraint. The second principal component direction is illustrated as a dashed blue line in Figure 6.14. It turns out that the zero correlation condition of Z 1 with Z 2 is equivalent to the condition that the direction must be perpendicular, or orthogonal, to the first principal component direction. The second principal component is given by the formula

$$\displaystyle{Z_{2} = 0.544 \times (\mathtt{pop}-\overline{\mathtt{pop}}) - 0.839 \times (\mathtt{ad}-\overline{\mathtt{ad}}).}$$

Since the advertising data has two predictors, the first two principal components contain all of the information that is in pop and ad. However, by construction, the first component will contain the most information. Consider, for example, the much larger variability of z i1 (the x-axis) versus z i2 (the y-axis) in the right-hand panel of Figure 6.15. The fact that the second principal component scores are much closer to zero indicates that this component captures far less information. As another illustration, Figure 6.17 displays z i2 versus pop and ad. There is little relationship between the second principal component and these two predictors, again suggesting that in this case, one only needs the first principal component in order to accurately represent the pop and ad budgets.

Fig. 6.17
figure 17figure 17

Plots of the second principal component scores z i2 versus pop and ad . The relationships are weak.

perpendicular

orthogonal

With two-dimensional data, such as in our advertising example, we can construct at most two principal components. However, if we had other predictors, such as population age, income level, education, and so forth, then additional components could be constructed. They would successively maximize variance, subject to the constraint of being uncorrelated with the preceding components.

6.3.1.2 The Principal Components Regression Approach

The principal components regression (PCR) approach involves constructing the first M principal components, Z 1, , Z M , and then using these components as the predictors in a linear regression model that is fit using least squares. The key idea is that often a small number of principal components suffice to explain most of the variability in the data, as well as the relationship with the response. In other words, we assume that the directions in which X 1 ,…,X p show the most variation are the directions that are associated with Y. While this assumption is not guaranteed to be true, it often turns out to be a reasonable enough approximation to give good results.

principal components regression

If the assumption underlying PCR holds, then fitting a least squares model to Z 1, , Z M will lead to better results than fitting a least squares model to X 1, , X p , since most or all of the information in the data that relates to the response is contained in Z 1, , Z M , and by estimating only M ≪ p coefficients we can mitigate overfitting. In the advertising data, the first principal component explains most of the variance in both pop and ad, so a principal component regression that uses this single variable to predict some response of interest, such as sales, will likely perform quite well.

Fig. 6.18
figure 18figure 18

PCR was applied to two simulated data sets. Left: Simulated data from Figure  6.8. Right: Simulated data from Figure  6.9.

Figure 6.18 displays the PCR fits on the simulated data sets from Figures 6.8 and 6.9. Recall that both data sets were generated using n = 50 observations and p = 45 predictors. However, while the response in the first data set was a function of all the predictors, the response in the second data set was generated using only two of the predictors. The curves are plotted as a function of M, the number of principal components used as predictors in the regression model. As more principal components are used in the regression model, the bias decreases, but the variance increases. This results in a typical U-shape for the mean squared error. When \(M = p = 45\), then PCR amounts simply to a least squares fit using all of the original predictors. The figure indicates that performing PCR with an appropriate choice of M can result in a substantial improvement over least squares, especially in the left-hand panel. However, by examining the ridge regression and lasso results in Figures 6.5, 6.8, and 6.9, we see that PCR does not perform as well as the two shrinkage methods in this example.

The relatively worse performance of PCR in Figure 6.18 is a consequence of the fact that the data were generated in such a way that many principal components are required in order to adequately model the response. In contrast, PCR will tend to do well in cases when the first few principal components are sufficient to capture most of the variation in the predictors as well as the relationship with the response. The left-hand panel of Figure 6.19 illustrates the results from another simulated data set designed to be more favorable to PCR. Here the response was generated in such a way that it depends exclusively on the first five principal components. Now the bias drops to zero rapidly as M, the number of principal components used in PCR, increases. The mean squared error displays a clear minimum at M = 5. The right-hand panel of Figure 6.19 displays the results on these data using ridge regression and the lasso. All three methods offer a significant improvement over least squares. However, PCR and ridge regression slightly outperform the lasso.

Fig. 6.19
figure 19figure 19

PCR, ridge regression, and the lasso were applied to a simulated data set in which the first five principal components of X contain all the information about the response Y. In each panel, the irreducible error Var(ε) is shown as a horizontal dashed line. Left: Results for PCR. Right: Results for lasso (solid) and ridge regression (dotted). The x-axis displays the shrinkage factor of the coefficient estimates, defined as the ℓ 2 norm of the shrunken coefficient estimates divided by the ℓ 2 norm of the least squares estimate.

We note that even though PCR provides a simple way to perform regression using M < p predictors, it is not a feature selection method. This is because each of the M principal components used in the regression is a linear combination of all p of the original features. For instance, in (6.19), Z 1 was a linear combination of both pop and ad. Therefore, while PCR often performs quite well in many practical settings, it does not result in the development of a model that relies upon a small set of the original features. In this sense, PCR is more closely related to ridge regression than to the lasso. In fact, one can show that PCR and ridge regression are very closely related. One can even think of ridge regression as a continuous version of PCR!Footnote 4

In PCR, the number of principal components, M, is typically chosen by cross-validation. The results of applying PCR to the Credit data set are shown in Figure 6.20; the right-hand panel displays the cross-validation errors obtained, as a function of M. On these data, the lowest cross-validation error occurs when there are M = 10 components; this corresponds to almost no dimension reduction at all, since PCR with M = 11 is equivalent to simply performing least squares.

Fig. 6.20
figure 20figure 20

Left: PCR standardized coefficient estimates on the Credit data set for different values of M. Right: The ten-fold cross validation MSE obtained using PCR, as a function of M.

When performing PCR, we generally recommend standardizing each predictor, using (6.6), prior to generating the principal components. This standardization ensures that all variables are on the same scale. In the absence of standardization, the high-variance variables will tend to play a larger role in the principal components obtained, and the scale on which the variables are measured will ultimately have an effect on the final PCR model. However, if the variables are all measured in the same units (say, kilograms, or inches), then one might choose not to standardize them.

6.3.2 Partial Least Squares

The PCR approach that we just described involves identifying linear combinations, or directions, that best represent the predictors X 1, , X p . These directions are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions. That is, the response does not supervise the identification of the principal components. Consequently, PCR suffers from a drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response. Unsupervised methods are discussed further in Chapter 10.

We now present partial least squares (PLS), a supervised alternative to PCR. Like PCR, PLS is a dimension reduction method, which first identifies a new set of features Z 1, , Z M that are linear combinations of the original features, and then fits a linear model via least squares using these M new features. But unlike PCR, PLS identifies these new features in a supervised way—that is, it makes use of the response Y in order to identify new features that not only approximate the old features well, but also that are related to the response. Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors.

partial least squares

We now describe how the first PLS direction is computed. After standardizing the p predictors, PLS computes the first direction Z 1 by setting each ϕ j1 in (6.16) equal to the coefficient from the simple linear regression of Y onto X j . One can show that this coefficient is proportional to the correlation between Y and X j . Hence, in computing \(Z_{1} =\sum _{ j=1}^{p}\phi _{j1}X_{j}\), PLS places the highest weight on the variables that are most strongly related to the response.

Figure 6.21 displays an example of PLS on the advertising data. The solid green line indicates the first PLS direction, while the dotted line shows the first principal component direction. PLS has chosen a direction that has less change in the ad dimension per unit change in the pop dimension, relative to PCA. This suggests that pop is more highly correlated with the response than is ad. The PLS direction does not fit the predictors as closely as does PCA, but it does a better job explaining the response.

Fig. 6.21
figure 21figure 21

For the advertising data, the first PLS direction (solid line) and first PCR direction (dotted line) are shown.

To identify the second PLS direction we first adjust each of the variables for Z 1, by regressing each variable on Z 1 and taking residuals. These residuals can be interpreted as the remaining information that has not been explained by the first PLS direction. We then compute Z 2 using this orthogonalized data in exactly the same fashion as Z 1 was computed based on the original data. This iterative approach can be repeated M times to identify multiple PLS components Z 1, , Z M . Finally, at the end of this procedure, we use least squares to fit a linear model to predict Y using Z 1, , Z M in exactly the same fashion as for PCR.

As with PCR, the number M of partial least squares directions used in PLS is a tuning parameter that is typically chosen by cross-validation. We generally standardize the predictors and response before performing PLS.

PLS is popular in the field of chemometrics, where many variables arise from digitized spectrometry signals. In practice it often performs no better than ridge regression or PCR. While the supervised dimension reduction of PLS can reduce bias, it also has the potential to increase variance, so that the overall benefit of PLS relative to PCR is a wash.

6.4 Considerations in High Dimensions

6.4.1 High-Dimensional Data

Most traditional statistical techniques for regression and classification are intended for the low-dimensional setting in which n, the number of observations, is much greater than p, the number of features. This is due in part to the fact that throughout most of the field’s history, the bulk of scientific problems requiring the use of statistics have been low-dimensional. For instance, consider the task of developing a model to predict a patient’s blood pressure on the basis of his or her age, gender, and body mass index (BMI). There are three predictors, or four if an intercept is included in the model, and perhaps several thousand patients for whom blood pressure and age, gender, and BMI are available. Hence n ≫ p, and so the problem is low-dimensional. (By dimension here we are referring to the size of p.)

low-dimensional

In the past 20 years, new technologies have changed the way that data are collected in fields as diverse as finance, marketing, and medicine. It is now commonplace to collect an almost unlimited number of feature measurements (p very large). While p can be extremely large, the number of observations n is often limited due to cost, sample availability, or other considerations. Two examples are as follows:

  1. 1.

    Rather than predicting blood pressure on the basis of just age, gender, and BMI, one might also collect measurements for half a million single nucleotide polymorphisms (SNPs; these are individual DNA mutations that are relatively common in the population) for inclusion in the predictive model. Then n ≈ 200 and p ≈ 500, 000.

  2. 2.

    A marketing analyst interested in understanding people’s online shopping patterns could treat as features all of the search terms entered by users of a search engine. This is sometimes known as the “bag-of-words” model. The same researcher might have access to the search histories of only a few hundred or a few thousand search engine users who have consented to share their information with the researcher. For a given user, each of the p search terms is scored present (0) or absent (1), creating a large binary feature vector. Then n ≈ 1, 000 and p is much larger.

Data sets containing more features than observations are often referred to as high-dimensional. Classical approaches such as least squares linear regression are not appropriate in this setting. Many of the issues that arise in the analysis of high-dimensional data were discussed earlier in this book, since they apply also when n > p: these include the role of the bias-variance trade-off and the danger of overfitting. Though these issues are always relevant, they can become particularly important when the number of features is very large relative to the number of observations.

high-dimensional

We have defined the high-dimensional setting as the case where the number of features p is larger than the number of observations n. But the considerations that we will now discuss certainly also apply if p is slightly smaller than n, and are best always kept in mind when performing supervised learning.

6.4.2 What Goes Wrong in High Dimensions?

In order to illustrate the need for extra care and specialized techniques for regression and classification when p > n, we begin by examining what can go wrong if we apply a statistical technique not intended for the high-dimensional setting. For this purpose, we examine least squares regression. But the same concepts apply to logistic regression, linear discriminant analysis, and other classical statistical approaches.

When the number of features p is as large as, or larger than, the number of observations n, least squares as described in Chapter 3 cannot (or rather, should not) be performed. The reason is simple: regardless of whether or not there truly is a relationship between the features and the response, least squares will yield a set of coefficient estimates that result in a perfect fit to the data, such that the residuals are zero.

An example is shown in Figure 6.22 with p = 1 feature (plus an intercept) in two cases: when there are 20 observations, and when there are only two observations. When there are 20 observations, n > p and the least squares regression line does not perfectly fit the data; instead, the regression line seeks to approximate the 20 observations as well as possible. On the other hand, when there are only two observations, then regardless of the values of those observations, the regression line will fit the data exactly. This is problematic because this perfect fit will almost certainly lead to overfitting of the data. In other words, though it is possible to perfectly fit the training data in the high-dimensional setting, the resulting linear model will perform extremely poorly on an independent test set, and therefore does not constitute a useful model. In fact, we can see that this happened in Figure 6.22: the least squares line obtained in the right-hand panel will perform very poorly on a test set comprised of the observations in the left-hand panel. The problem is simple: when p > n or p ≈ n, a simple least squares regression line is too flexible and hence overfits the data.

Fig. 6.22
figure 22figure 22

Left: Least squares regression in the low-dimensional setting. Right: Least squares regression with n = 2 observations and two parameters to be estimated (an intercept and a coefficient).

Figure 6.23 further illustrates the risk of carelessly applying least squares when the number of features p is large. Data were simulated with n = 20 observations, and regression was performed with between 1 and 20 features, each of which was completely unrelated to the response. As shown in the figure, the model R 2 increases to 1 as the number of features included in the model increases, and correspondingly the training set MSE decreases to 0 as the number of features increases, even though the features are completely unrelated to the response. On the other hand, the MSE on an independent test set becomes extremely large as the number of features included in the model increases, because including the additional predictors leads to a vast increase in the variance of the coefficient estimates. Looking at the test set MSE, it is clear that the best model contains at most a few variables. However, someone who carelessly examines only the R 2 or the training set MSE might erroneously conclude that the model with the greatest number of variables is best. This indicates the importance of applying extra care when analyzing data sets with a large number of variables, and of always evaluating model performance on an independent test set.

Fig. 6.23
figure 23figure 23

On a simulated example with n = 20 training observations, features that are completely unrelated to the outcome are added to the model. Left: The R 2 increases to 1 as more features are included. Center: The training set MSE decreases to 0 as more features are included. Right: The test set MSE increases as more features are included.

In Section 6.1.3, we saw a number of approaches for adjusting the training set RSS or R 2 in order to account for the number of variables used to fit a least squares model. Unfortunately, the C p , AIC, and BIC approaches are not appropriate in the high-dimensional setting, because estimating \(\hat{{\sigma }}^{2}\) is problematic. (For instance, the formula for \(\hat{{\sigma }}^{2}\) from Chapter 3 yields an estimate \(\hat{{\sigma }}^{2} = 0\) in this setting.) Similarly, problems arise in the application of adjusted R 2 in the high-dimensional setting, since one can easily obtain a model with an adjusted R 2 value of 1. Clearly, alternative approaches that are better-suited to the high-dimensional setting are required.

6.4.3 Regression in High Dimensions

It turns out that many of the methods seen in this chapter for fitting less flexible least squares models, such as forward stepwise selection, ridge regression, the lasso, and principal components regression, are particularly useful for performing regression in the high-dimensional setting. Essentially, these approaches avoid overfitting by using a less flexible fitting approach than least squares.

Figure 6.24 illustrates the performance of the lasso in a simple simulated example. There are p = 20, 50, or 2, 000 features, of which 20 are truly associated with the outcome. The lasso was performed on n = 100 training observations, and the mean squared error was evaluated on an independent test set. As the number of features increases, the test set error increases. When p = 20, the lowest validation set error was achieved when λ in (6.7) was small; however, when p was larger then the lowest validation set error was achieved using a larger value of λ. In each boxplot, rather than reporting the values of λ used, the degrees of freedom of the resulting lasso solution is displayed; this is simply the number of non-zero coefficient estimates in the lasso solution, and is a measure of the flexibility of the lasso fit. Figure 6.24 highlights three important points: (1) regularization or shrinkage plays a key role in high-dimensional problems, (2) appropriate tuning parameter selection is crucial for good predictive performance, and (3) the test error tends to increase as the dimensionality of the problem (i.e. the number of features or predictors) increases, unless the additional features are truly associated with the response.

Fig. 6.24
figure 24figure 24

The lasso was performed with n = 100 observations and three values of p, the number of features. Of the p features, 20 were associated with the response. The boxplots show the test MSEs that result using three different values of the tuning parameter λ in (6.7).For ease of interpretation, rather than reporting λ, the degrees of freedom are reported; for the lasso this turns out to be simply the number of estimated non-zero coefficients. When p = 20, the lowest test MSE was obtained with the smallest amount of regularization. When p = 50, the lowest test MSE was achieved when there is a substantial amount of regularization. When p = 2,000 the lasso performed poorly regardless of the amount of regularization, due to the fact that only 20 of the 2,000 features truly are associated with the outcome.

The third point above is in fact a key principle in the analysis of high-dimensional data, which is known as the curse of dimensionality. One might think that as the number of features used to fit a model increases, the quality of the fitted model will increase as well. However, comparing the left-hand and right-hand panels in Figure 6.24, we see that this is not necessarily the case: in this example, the test set MSE almost doubles as p increases from 20 to 2,000. In general, adding additional signal features that are truly associated with the response will improve the fitted model, in the sense of leading to a reduction in test set error. However, adding noise features that are not truly associated with the response will lead to a deterioration in the fitted model, and consequently an increased test set error. This is because noise features increase the dimensionality of the problem, exacerbating the risk of overfitting (since noise features may be assigned nonzero coefficients due to chance associations with the response on the training set) without any potential upside in terms of improved test set error. Thus, we see that new technologies that allow for the collection of measurements for thousands or millions of features are a double-edged sword: they can lead to improved predictive models if these features are in fact relevant to the problem at hand, but will lead to worse results if the features are not relevant. Even if they are relevant, the variance incurred in fitting their coefficients may outweigh the reduction in bias that they bring.

curse of dimensionality

6.4.4 Interpreting Results in High Dimensions

When we perform the lasso, ridge regression, or other regression procedures in the high-dimensional setting, we must be quite cautious in the way that we report the results obtained. In Chapter 3, we learned about multicollinearity, the concept that the variables in a regression might be correlated with each other. In the high-dimensional setting, the multicollinearity problem is extreme: any variable in the model can be written as a linear combination of all of the other variables in the model. Essentially, this means that we can never know exactly which variables (if any) truly are predictive of the outcome, and we can never identify the best coefficients for use in the regression. At most, we can hope to assign large regression coefficients to variables that are correlated with the variables that truly are predictive of the outcome.

For instance, suppose that we are trying to predict blood pressure on the basis of half a million SNPs, and that forward stepwise selection indicates that 17 of those SNPs lead to a good predictive model on the training data. It would be incorrect to conclude that these 17 SNPs predict blood pressure more effectively than the other SNPs not included in the model. There are likely to be many sets of 17 SNPs that would predict blood pressure just as well as the selected model. If we were to obtain an independent data set and perform forward stepwise selection on that data set, we would likely obtain a model containing a different, and perhaps even non-overlapping, set of SNPs. This does not detract from the value of the model obtained—for instance, the model might turn out to be very effective in predicting blood pressure on an independent set of patients, and might be clinically useful for physicians. But we must be careful not to overstate the results obtained, and to make it clear that what we have identified is simply one of many possible models for predicting blood pressure, and that it must be further validated on independent data sets.

It is also important to be particularly careful in reporting errors and measures of model fit in the high-dimensional setting. We have seen that when p > n, it is easy to obtain a useless model that has zero residuals. Therefore, one should never use sum of squared errors, p-values, R 2 statistics, or other traditional measures of model fit on the training data as evidence of a good model fit in the high-dimensional setting. For instance, as we saw in Figure 6.23, one can easily obtain a model with R 2 = 1 when p > n. Reporting this fact might mislead others into thinking that a statistically valid and useful model has been obtained, whereas in fact this provides absolutely no evidence of a compelling model. It is important to instead report results on an independent test set, or cross-validation errors. For instance, the MSE or R 2 on an independent test set is a valid measure of model fit, but the MSE on the training set certainly is not.

6.5 Lab 1: Subset Selection Methods

6.5.1 Best Subset Selection

Here we apply the best subset selection approach to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.

First of all, we note that the Salary variable is missing for some of the players. The is.na() function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a TRUE for any elements that are missing, and a FALSE for non-missing elements. The sum() function can then be used to count all of the missing elements.

is.na()

sum()

> library(ISLR)

> fix(Hitters)

> names(Hitters)

 [1] "AtBat"     "Hits"      "HmRun"    "Runs"      "RBI"

 [6] "Walks"     "Years"     "CAtBat"   "CHits"     "CHmRun"

[11] "CRuns"     "CRBI"      "CWalks"   "League"    "Division"

[16] "PutOuts"   "Assists"   "Errors"   "Salary"    "NewLeague"

> dim(Hitters)

[1] 322  20

> sum(is.na(Hitters$Salary))

[1] 59

Hence we see that Salary is missing for 59 players. The na.omit() function removes all of the rows that have missing values in any variable.

> Hitters=na.omit(Hitters)

> dim(Hitters)

[1] 263  20

> sum(is.na(Hitters))

[1] 0

The regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for lm(). The summary() command outputs the best set of variables for each model size.

regsubsets()

> library(leaps)

> regfit.full=regsubsets(Salary∼.,Hitters)

> summary(regfit.full)

Subset selection object

Call: regsubsets.formula(Salary ∼ ., Hitters)

19 Variables  (and intercept)

...

1 subsets of each size up to 8

Selection Algorithm: exhaustive

         AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits

1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "

2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "

3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "

4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "

5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "

6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "

7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"

8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "

         CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts

1  ( 1 ) " "    " "   "*"  " "    " "     " "       " "

2  ( 1 ) " "    " "   "*"  " "    " "     " "       " "

3  ( 1 ) " "    " "   "*"  " "    " "     " "       "*"

4  ( 1 ) " "    " "   "*"  " "    " "     "*"       "*"

5  ( 1 ) " "    " "   "*"  " "    " "     "*"       "*"

6  ( 1 ) " "    " "   "*"  " "    " "     "*"       "*"

7  ( 1 ) "*"    " "   " "  " "    " "     "*"       "*"

8  ( 1 ) "*"    "*"   " "  "*"    " "     "*"       "*"

         Assists Errors NewLeagueN

1  ( 1 ) " "     " "    " "

2  ( 1 ) " "     " "    " "

3  ( 1 ) " "     " "    " "

4  ( 1 ) " "     " "    " "

5  ( 1 ) " "     " "    " "

6  ( 1 ) " "     " "    " "

7  ( 1 ) " "     " "    " "

8  ( 1 ) " "     " "    " "

An asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Hits and CRBI. By default, regsubsets() only reports results up to the best eight-variable model. But the nvmax option can be used in order to return as many variables as are desired. Here we fit up to a 19-variable model.

> regfit.full=regsubsets(Salary∼.,data=Hitters,nvmax=19)

> reg.summary=summary(regfit.full)

The summary() function also returns R 2, RSS, adjusted R 2, C p , and BIC. We can examine these to try to select the best overall model.

> names(reg.summary)

[1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"

[7] "outmat" "obj"

For instance, we see that the R 2 statistic increases from 32 %, when only one variable is included in the model, to almost 55 %, when all variables are included. As expected, the R 2 statistic increases monotonically as more variables are included.

> reg.summary$rsq

 [1] 0.321 0.425 0.451 0.475 0.491 0.509 0.514 0.529 0.535

[10] 0.540 0.543 0.544 0.544 0.545 0.545 0.546 0.546 0.546

[19] 0.546

Plotting RSS, adjusted R 2, C p , and BIC for all of the models at once will help us decide which model to select. Note the type="l" option tells R to connect the plotted points with lines.

> par(mfrow=c(2,2))

> plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")

> plot(reg.summary$adjr2,xlab="Number of Variables",         ylab="Adjusted RSq",type="l")

The points() command works like the plot() command, except that it puts points on a plot that has already been created, instead of creating a new plot. The which.max() function can be used to identify the location of the maximum point of a vector. We will now plot a red dot to indicate the model with the largest adjusted R 2 statistic.

points()

> which.max(reg.summary$adjr2)

[1] 11

> points(11,reg.summary$adjr2[11], col="red",cex=2,pch=20)

In a similar fashion we can plot the C p and BIC statistics, and indicate the models with the smallest statistic using which.min().

which.min()

> plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",    type=’l’)

> which.min(reg.summary$cp)

[1] 10

> points(10,reg.summary$cp[10],col="red",cex=2,pch=20)

> which.min(reg.summary$bic)

[1] 6

> plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type=’l’)

> points(6,reg.summary$bic[6],col="red",cex=2,pch=20)

The regsubsets() function has a built-in plot() command which can be used to display the selected variables for the best model with a given number of predictors, ranked according to the BIC, C p , adjusted R 2, or AIC. To find out more about this function, type ?plot.regsubsets.

> plot(regfit.full,scale="r2")

> plot(regfit.full,scale="adjr2")

> plot(regfit.full,scale="Cp")

> plot(regfit.full,scale="bic")

The top row of each plot contains a black square for each variable selected according to the optimal model associated with that statistic. For instance, we see that several models share a BIC close to − 150. However, the model with the lowest BIC is the six-variable model that contains only AtBat, Hits, Walks, CRBI, DivisionW, and PutOuts. We can use the coef() function to see the coefficient estimates associated with this model.

> coef(regfit.full,6)

(Intercept)       AtBat        Hits       Walks        CRBI

     91.512      -1.869       7.604       3.698       0.643

  DivisionW     PutOuts

   -122.952       0.264

6.5.2 Forward and Backward Stepwise Selection

We can also use the regsubsets() function to perform forward stepwise or backward stepwise selection, using the argument method="forward" or method="backward".

> regfit.fwd=regsubsets(Salary∼.,data=Hitters,nvmax=19,        method="forward")

> summary(regfit.fwd)

> regfit.bwd=regsubsets(Salary∼.,data=Hitters,nvmax=19,        method="backward")

> summary(regfit.bwd)

For instance, we see that using forward stepwise selection, the best one-variable model contains only CRBI, and the best two-variable model additionally includes Hits. For this data, the best one-variable through six-variable models are each identical for best subset and forward selection. However, the best seven-variable models identified by forward stepwise selection, backward stepwise selection, and best subset selection are different.

> coef(regfit.full,7)

(Intercept)        Hits       Walks      CAtBat       CHits

     79.451       1.283       3.227      -0.375       1.496

     CHmRun   DivisionW     PutOuts

      1.442    -129.987       0.237

> coef(regfit.fwd,7)

(Intercept)       AtBat        Hits       Walks        CRBI

    109.787      -1.959       7.450       4.913       0.854

     CWalks   DivisionW     PutOuts

     -0.305    -127.122       0.253

> coef(regfit.bwd,7)

(Intercept)       AtBat        Hits       Walks       CRuns

    105.649      -1.976       6.757       6.056       1.129

     CWalks   DivisionW     PutOuts

     -0.716    -116.169       0.303

6.5.3 Choosing Among Models Using the Validation Set Approach and Cross-Validation

We just saw that it is possible to choose among a set of models of different sizes using C p , BIC, and adjusted R 2. We will now consider how to do this using the validation set and cross-validation approaches.

In order for these approaches to yield accurate estimates of the test error, we must use only the training observations to perform all aspects of model-fitting—including variable selection. Therefore, the determination of which model of a given size is best must be made using only the training observations. This point is subtle but important. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error.

In order to use the validation set approach, we begin by splitting the observations into a training set and a test set. We do this by creating a random vector, train, of elements equal to TRUE if the corresponding observation is in the training set, and FALSE otherwise. The vector test has a TRUE if the observation is in the test set, and a FALSE otherwise. Note the ! in the command to create test causes TRUEs to be switched to FALSEs and vice versa. We also set a random seed so that the user will obtain the same training set/test set split.

> set.seed(1)

> train=sample(c(TRUE,FALSE), nrow(Hitters),rep=TRUE)

> test=(!train)

Now, we apply regsubsets() to the training set in order to perform best subset selection.

> regfit.best=regsubsets(Salary∼.,data=Hitters[train,],     nvmax=19)

Notice that we subset the Hitters data frame directly in the call in order to access only the training subset of the data, using the expression Hitters[train,]. We now compute the validation set error for the best model of each model size. We first make a model matrix from the test data.

  test.mat=model.matrix(Salary∼.,data=Hitters[test,])

The model.matrix() function is used in many regression packages for building an “X” matrix from data. Now we run a loop, and for each size i, we extract the coefficients from regfit.best for the best model of that size, multiply them into the appropriate columns of the test model matrix to form the predictions, and compute the test MSE.

model.matrix()

> val.errors=rep(NA,19)

> for(i in 1:19){

+   coefi=coef(regfit.best,id=i)

+   pred=test.mat[,names(coefi)]%*%coefi

+   val.errors[i]=mean((Hitters$Salary[test]-pred)^2)

}

We find that the best model is the one that contains ten variables.

> val.errors

 [1] 220968 169157 178518 163426 168418 171271 162377 157909

 [9] 154056 148162 151156 151742 152214 157359 158541 158743

[17] 159973 159860 160106

> which.min(val.errors)

[1] 10

> coef(regfit.best,10)

(Intercept)       AtBat        Hits       Walks      CAtBat

    -80.275      -1.468       7.163       3.643      -0.186

      CHits      CHmRun      CWalks     LeagueN   DivisionW

      1.105       1.384      -0.748      84.558     -53.029

    PutOuts

      0.238

This was a little tedious, partly because there is no predict() method for regsubsets(). Since we will be using this function again, we can capture our steps above and write our own predict method.

>  predict.regsubsets=function(object,newdata,id,...){

+  form=as.formula(object$call[[2]])

+  mat=model.matrix(form,newdata)

+  coefi=coef(object,id=id)

+  xvars=names(coefi)

+  mat[,xvars]%*%coefi

+  }

Our function pretty much mimics what we did above. The only complex part is how we extracted the formula used in the call to regsubsets(). We demonstrate how we use this function below, when we do cross-validation.

Finally, we perform best subset selection on the full data set, and select the best ten-variable model. It is important that we make use of the full data set in order to obtain more accurate coefficient estimates. Note that we perform best subset selection on the full data set and select the best ten-variable model, rather than simply using the variables that were obtained from the training set, because the best ten-variable model on the full data set may differ from the corresponding model on the training set.

> regfit.best=regsubsets(Salary∼.,data=Hitters,nvmax=19)

> coef(regfit.best,10)

(Intercept)       AtBat        Hits       Walks      CAtBat

    162.535      -2.169       6.918       5.773      -0.130

      CRuns        CRBI      CWalks   DivisionW     PutOuts

      1.408       0.774      -0.831    -112.380       0.297

    Assists

      0.283

In fact, we see that the best ten-variable model on the full data set has a different set of variables than the best ten-variable model on the training set.

We now try to choose among the models of different sizes using cross-validation. This approach is somewhat involved, as we must perform best subset selection within each of the k training sets. Despite this, we see that with its clever subsetting syntax, R makes this job quite easy. First, we create a vector that allocates each observation to one of k = 10 folds, and we create a matrix in which we will store the results.

> k=10

> set.seed(1)

> folds=sample(1:k,nrow(Hitters),replace=TRUE)

> cv.errors=matrix(NA,k,19, dimnames=list(NULL, paste(1:19)))

Now we write a for loop that performs cross-validation. In the jth fold, the elements of folds that equal j are in the test set, and the remainder are in the training set. We make our predictions for each model size (using our new predict() method), compute the test errors on the appropriate subset, and store them in the appropriate slot in the matrix cv.errors.

> for(j in 1:k){

+  best.fit=regsubsets(Salary∼.,data=Hitters[folds!=j,],

      nvmax=19)

+  for(i in 1:19){

+    pred=predict(best.fit,Hitters[folds==j,],id=i)

+    cv.errors[j,i]=mean( (Hitters$Salary[folds==j]-pred)^2)

+    }

+  }

This has given us a 10 ×19 matrix, of which the (i, j)th element corresponds to the test MSE for the ith cross-validation fold for the best j-variable model. We use the apply() function to average over the columns of this matrix in order to obtain a vector for which the jth element is the cross-validation error for the j-variable model.

apply()

> mean.cv.errors=apply(cv.errors,2,mean)

> mean.cv.errors

 [1] 160093 140197 153117 151159 146841 138303 144346 130208

 [9] 129460 125335 125154 128274 133461 133975 131826 131883

[17] 132751 133096 132805

> par(mfrow=c(1,1))

> plot(mean.cv.errors,type=’b’)

We see that cross-validation selects an 11-variable model. We now perform best subset selection on the full data set in order to obtain the 11-variable model.

> reg.best=regsubsets(Salary∼.,data=Hitters, nvmax=19)

> coef(reg.best,11)

(Intercept)       AtBat        Hits       Walks      CAtBat

    135.751      -2.128       6.924       5.620      -0.139

      CRuns        CRBI      CWalks     LeagueN   DivisionW

      1.455       0.785      -0.823      43.112    -111.146

    PutOuts     Assists

      0.289       0.269

6.6 Lab 2: Ridge Regression and the Lasso

We will use the glmnet package in order to perform ridge regression and the lasso. The main function in this package is glmnet(), which can be used to fit ridge regression models, lasso models, and more. This function has slightly different syntax from other model-fitting functions that we have encountered thus far in this book. In particular, we must pass in an x matrix as well as a y vector, and we do not use the y ∼ x syntax. We will now perform ridge regression and the lasso in order to predict Salary on the Hitters data. Before proceeding ensure that the missing values have been removed from the data, as described in Section 6.5.

glmnet()

> x=model.matrix(Salary∼.,Hitters)[,-1]

> y=Hitters$Salary

The model.matrix() function is particularly useful for creating x; not only does it produce a matrix corresponding to the 19 predictors but it also automatically transforms any qualitative variables into dummy variables. The latter property is important because glmnet() can only take numerical, quantitative inputs.

6.6.1 Ridge Regression

The glmnet() function has an alpha argument that determines what type of model is fit. If alpha=0 then a ridge regression model is fit, and if alpha=1 then a lasso model is fit. We first fit a ridge regression model.

> library(glmnet)

> grid=10^seq(10,-2,length=100)

> ridge.mod=glmnet(x,y,alpha=0,lambda=grid)

By default the glmnet() function performs ridge regression for an automatically selected range of λ values. However, here we have chosen to implement the function over a grid of values ranging from λ = 1010 to \(\lambda = 1{0}^{-2}\), essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit. As we will see, we can also compute model fits for a particular value of λ that is not one of the original grid values. Note that by default, the glmnet() function standardizes the variables so that they are on the same scale. To turn off this default setting, use the argument standardize=FALSE.

Associated with each value of λ is a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef(). In this case, it is a 20 ×100 matrix, with 20 rows (one for each predictor, plus an intercept) and 100 columns (one for each value of λ).

> dim(coef(ridge.mod))

[1]  20 100

We expect the coefficient estimates to be much smaller, in terms of 2 norm, when a large value of λ is used, as compared to when a small value of λ is used. These are the coefficients when λ = 11, 498, along with their 2 norm:

> ridge.mod$lambda[50]

[1] 11498

> coef(ridge.mod)[,50]

(Intercept)       AtBat        Hits       HmRun        Runs

    407.356       0.037       0.138       0.525       0.231

        RBI       Walks       Years      CAtBat       CHits

      0.240       0.290       1.108       0.003       0.012

     CHmRun       CRuns        CRBI      CWalks     LeagueN

      0.088       0.023       0.024       0.025       0.085

  DivisionW     PutOuts     Assists      Errors  NewLeagueN

     -6.215       0.016       0.003      -0.021       0.301

> sqrt(sum(coef(ridge.mod)[-1,50]^2))

[1] 6.36

In contrast, here are the coefficients when λ = 705, along with their 2 norm. Note the much larger 2 norm of the coefficients associated with this smaller value of λ.

> ridge.mod$lambda[60]

[1] 705

> coef(ridge.mod)[,60]

(Intercept)       AtBat        Hits       HmRun        Runs

     54.325       0.112       0.656       1.180       0.938

        RBI       Walks       Years      CAtBat       CHits

      0.847       1.320       2.596       0.011       0.047

     CHmRun       CRuns        CRBI      CWalks     LeagueN

      0.338       0.094       0.098       0.072      13.684

  DivisionW     PutOuts     Assists      Errors  NewLeagueN

    -54.659       0.119       0.016      -0.704       8.612

> sqrt(sum(coef(ridge.mod)[-1,60]^2))

[1] 57.1

We can use the predict() function for a number of purposes. For instance, we can obtain the ridge regression coefficients for a new value of λ, say 50:

> predict(ridge.mod,s=50,type="coefficients")[1:20,]

(Intercept)       AtBat        Hits       HmRun        Runs

     48.766      -0.358       1.969      -1.278       1.146

        RBI       Walks       Years      CAtBat       CHits

      0.804       2.716      -6.218       0.005       0.106

     CHmRun       CRuns        CRBI      CWalks     LeagueN

      0.624       0.221       0.219      -0.150      45.926

  DivisionW     PutOuts     Assists      Errors  NewLeagueN

   -118.201       0.250       0.122      -3.279      -9.497

We now split the samples into a training set and a test set in order to estimate the test error of ridge regression and the lasso. There are two common ways to randomly split a data set. The first is to produce a random vector of TRUE, FALSE elements and select the observations corresponding to TRUE for the training data. The second is to randomly choose a subset of numbers between 1 and n; these can then be used as the indices for the training observations. The two approaches work equally well. We used the former method in Section 6.5.3. Here we demonstrate the latter approach.

We first set a random seed so that the results obtained will be reproducible.

> set.seed(1)

> train=sample(1:nrow(x), nrow(x)/2)

> test=(-train)

> y.test=y[test]

Next we fit a ridge regression model on the training set, and evaluate its MSE on the test set, using λ = 4. Note the use of the predict() function again. This time we get predictions for a test set, by replacing type="coefficients" with the newx argument.

> ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid, thresh=1e-12)

> ridge.pred=predict(ridge.mod,s=4,newx=x[test,])

> mean((ridge.pred-y.test)^2)

[1] 101037

The test MSE is 101037. Note that if we had instead simply fit a model with just an intercept, we would have predicted each test observation using the mean of the training observations. In that case, we could compute the test set MSE like this:

> mean((mean(y[train])-y.test)^2)

[1] 193253

We could also get the same result by fitting a ridge regression model with a very large value of λ. Note that 1e10 means 1010.

> ridge.pred=predict(ridge.mod,s=1e10,newx=x[test,])

> mean((ridge.pred-y.test)^2)

[1] 193253

So fitting a ridge regression model with λ = 4 leads to a much lower test MSE than fitting a model with just an intercept. We now check whether there is any benefit to performing ridge regression with λ = 4 instead of just performing least squares regression. Recall that least squares is simply ridge regression with λ = 0.Footnote 5

> ridge.pred=predict(ridge.mod,s=0,newx=x[test,],exact=T)

> mean((ridge.pred-y.test)^2)

[1] 114783

> lm(y∼x, subset=train)

> predict(ridge.mod,s=0,exact=T,type="coefficients")[1:20,]

In general, if we want to fit a (unpenalized) least squares model, then we should use the lm() function, since that function provides more useful outputs, such as standard errors and p-values for the coefficients.

In general, instead of arbitrarily choosing λ = 4, it would be better to use cross-validation to choose the tuning parameter λ. We can do this using the built-in cross-validation function, cv.glmnet(). By default, the function performs ten-fold cross-validation, though this can be changed using the argument folds. Note that we set a random seed first so our results will be reproducible, since the choice of the cross-validation folds is random.

cv.glmnet()

> set.seed(1)

> cv.out=cv.glmnet(x[train,],y[train],alpha=0)

> plot(cv.out)

> bestlam=cv.out$lambda.min

> bestlam

[1] 212

Therefore, we see that the value of λ that results in the smallest cross-validation error is 212. What is the test MSE associated with this value of λ?

> ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])

> mean((ridge.pred-y.test)^2)

[1] 96016

This represents a further improvement over the test MSE that we got using λ = 4. Finally, we refit our ridge regression model on the full data set, using the value of λ chosen by cross-validation, and examine the coefficient estimates.

> out=glmnet(x,y,alpha=0)

> predict(out,type="coefficients",s=bestlam)[1:20,]

(Intercept)       AtBat        Hits       HmRun        Runs

     9.8849      0.0314      1.0088      0.1393      1.1132

        RBI       Walks       Years      CAtBat       CHits

     0.8732      1.8041      0.1307      0.0111      0.0649

     CHmRun       CRuns        CRBI      CWalks     LeagueN

     0.4516      0.1290      0.1374      0.0291     27.1823

  DivisionW     PutOuts     Assists      Errors  NewLeagueN

   -91.6341      0.1915      0.0425     -1.8124      7.2121

As expected, none of the coefficients are zero—ridge regression does not perform variable selection!

6.6.2 The Lasso

We saw that ridge regression with a wise choice of λ can outperform least squares as well as the null model on the Hitters data set. We now ask whether the lasso can yield either a more accurate or a more interpretable model than ridge regression. In order to fit a lasso model, we once again use the glmnet() function; however, this time we use the argument alpha=1. Other than that change, we proceed just as we did in fitting a ridge model.

> lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)

> plot(lasso.mod)

We can see from the coefficient plot that depending on the choice of tuning parameter, some of the coefficients will be exactly equal to zero. We now perform cross-validation and compute the associated test error.

> set.seed(1)

> cv.out=cv.glmnet(x[train,],y[train],alpha=1)

> plot(cv.out)

> bestlam=cv.out$lambda.min

> lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])

> mean((lasso.pred-y.test)^2)

[1] 100743

This is substantially lower than the test set MSE of the null model and of least squares, and very similar to the test MSE of ridge regression with λ chosen by cross-validation.

However, the lasso has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse. Here we see that 12 of the 19 coefficient estimates are exactly zero. So the lasso model with λ chosen by cross-validation contains only seven variables.

> out=glmnet(x,y,alpha=1,lambda=grid)

> lasso.coef=predict(out,type="coefficients",s=bestlam)[1:20,]

> lasso.coef

(Intercept)       AtBat        Hits       HmRun        Runs

     18.539       0.000       1.874       0.000       0.000

        RBI       Walks       Years      CAtBat       CHits

      0.000       2.218       0.000       0.000       0.000

     CHmRun       CRuns        CRBI      CWalks     LeagueN

      0.000       0.207       0.413       0.000       3.267

  DivisionW     PutOuts     Assists      Errors  NewLeagueN

   -103.485       0.220       0.000       0.000       0.000

> lasso.coef[lasso.coef!=0]

(Intercept)        Hits       Walks       CRuns        CRBI

     18.539       1.874       2.218       0.207       0.413

    LeagueN   DivisionW     PutOuts

      3.267    -103.485       0.220

6.7 Lab 3: PCR and PLS Regression

6.7.1 Principal Components Regression

Principal components regression (PCR) can be performed using the pcr() function, which is part of the pls library. We now apply PCR to the Hitters data, in order to predict Salary. Again, ensure that the missing values have been removed from the data, as described in Section 6.5.

pcr()

> library(pls)

> set.seed(2)

> pcr.fit=pcr(Salary∼., data=Hitters,scale=TRUE,        validation="CV")

The syntax for the pcr() function is similar to that for lm(), with a few additional options. Setting scale=TRUE has the effect of standardizing each predictor, using (6.6), prior to generating the principal components, so that the scale on which each variable is measured will not have an effect. Setting validation="CV" causes pcr() to compute the ten-fold cross-validation error for each possible value of M, the number of principal components used. The resulting fit can be examined using summary().

> summary(pcr.fit)

Data:   X dimension: 263 19

    Y dimension: 263 1

Fit method: svdpc

Number of components considered: 19

VALIDATION: RMSEP

Cross-validated using 10 random segments.

       (Intercept)  1 comps  2 comps  3 comps  4 comps

CV             452    348.9    352.2    353.5    352.8

adjCV          452    348.7    351.8    352.9    352.1

...

TRAINING: % variance explained

        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps

X         38.31    60.16    70.84    79.03    84.29    88.63

Salary    40.63    41.58    42.17    43.22    44.90    46.48

...

The CV score is provided for each possible number of components, ranging from M = 0 onwards. (We have printed the CV output only up to M = 4.) Note that pcr() reports the root mean squared error; in order to obtain the usual MSE, we must square this quantity. For instance, a root mean squared error of 352. 8 corresponds to an MSE of 352. 82 = 124, 468.

One can also plot the cross-validation scores using the validationplot() function. Using val.type="MSEP" will cause the cross-validation MSE to be plotted.

validationplot()

> validationplot(pcr.fit,val.type="MSEP")

We see that the smallest cross-validation error occurs when M = 16 components are used. This is barely fewer than M = 19, which amounts to simply performing least squares, because when all of the components are used in PCR no dimension reduction occurs. However, from the plot we also see that the cross-validation error is roughly the same when only one component is included in the model. This suggests that a model that uses just a small number of components might suffice.

The summary() function also provides the percentage of variance explained in the predictors and in the response using different numbers of components. This concept is discussed in greater detail in Chapter 10. Briefly, we can think of this as the amount of information about the predictors or the response that is captured using M principal components. For example, setting M = 1 only captures 38. 31 % of all the variance, or information, in the predictors. In contrast, using M = 6 increases the value to 88. 63 %. If we were to use all \(M = p = 19\) components, this would increase to 100 %.

We now perform PCR on the training data and evaluate its test set performance.

> set.seed(1)

> pcr.fit=pcr(Salary∼., data=Hitters,subset=train,scale=TRUE, validation="CV")

> validationplot(pcr.fit,val.type="MSEP")

Now we find that the lowest cross-validation error occurs when M = 7 component are used. We compute the test MSE as follows.

> pcr.pred=predict(pcr.fit,x[test,],ncomp=7)

> mean((pcr.pred-y.test)^2)

[1] 96556

This test set MSE is competitive with the results obtained using ridge regression and the lasso. However, as a result of the way PCR is implemented, the final model is more difficult to interpret because it does not perform any kind of variable selection or even directly produce coefficient estimates.

Finally, we fit PCR on the full data set, using M = 7, the number of components identified by cross-validation.

> pcr.fit=pcr(y∼x,scale=TRUE,ncomp=7)

> summary(pcr.fit)

Data:   X dimension: 263 19

        Y dimension: 263 1

Fit method: svdpc

Number of components considered: 7

TRAINING: % variance explained

   1 comps  2 comps  3 comps  4 comps  5 comps  6 comps

X    38.31    60.16    70.84    79.03    84.29    88.63

y    40.63    41.58    42.17    43.22    44.90    46.48

   7 comps

X    92.26

y    46.69

6.7.2 Partial Least Squares

We implement partial least squares (PLS) using the plsr() function, also in the pls library. The syntax is just like that of the pcr() function.

plsr()

> set.seed(1)

> pls.fit=plsr(Salary∼., data=Hitters,subset=train,scale=TRUE, validation="CV")

> summary(pls.fit)

Data:   X dimension: 131 19

    Y dimension: 131 1

Fit method: kernelpls

Number of components considered: 19

VALIDATION: RMSEP

Cross-validated using 10 random segments.

       (Intercept)  1 comps  2 comps  3 comps  4 comps

CV           464.6    394.2    391.5    393.1    395.0

adjCV        464.6    393.4    390.2    391.1    392.9

...

TRAINING: % variance explained

        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps

X         38.12    53.46    66.05    74.49    79.33    84.56

Salary    33.58    38.96    41.57    42.43    44.04    45.59

...

> validationplot(pls.fit,val.type="MSEP")

The lowest cross-validation error occurs when only M = 2 partial least squares directions are used. We now evaluate the corresponding test set MSE.

> pls.pred=predict(pls.fit,x[test,],ncomp=2)

> mean((pls.pred-y.test)^2)

[1] 101417

The test MSE is comparable to, but slightly higher than, the test MSE obtained using ridge regression, the lasso, and PCR.

Finally, we perform PLS using the full data set, using M = 2, the number of components identified by cross-validation.

> pls.fit=plsr(Salary∼., data=Hitters,scale=TRUE,ncomp=2)

> summary(pls.fit)

Data:   X dimension: 263 19

    Y dimension: 263 1

Fit method: kernelpls

Number of components considered: 2

TRAINING: % variance explained

        1 comps  2 comps

X         38.08    51.03

Salary    43.05    46.40

Notice that the percentage of variance in Salary that the two-component PLS fit explains, 46. 40 %, is almost as much as that explained using the final seven-component model PCR fit, 46. 69 %. This is because PCR only attempts to maximize the amount of variance explained in the predictors, while PLS searches for directions that explain variance in both the predictors and the response.

6.8 Exercises

6.8.1 Conceptual

  1. 1.

    We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, , p predictors. Explain your answers:

    1. (a)

      Which of the three models with k predictors has the smallest training RSS?

    2. (b)

      Which of the three models with k predictors has the smallest test RSS?

    3. (c)

      True or False:

      1. i.

        The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k + 1)-variable model identified by forward stepwise selection.

      2. ii.

        The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)-variable model identified by backward stepwise selection.

      3. iii.

        The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)-variable model identified by forward stepwise selection.

      4. iv.

        The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k + 1)-variable model identified by backward stepwise selection.

      5. v.

        The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.

  2. 2.

    For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

    1. (a)

      The lasso, relative to least squares, is:

      1. i.

        More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

      2. ii.

        More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

      3. iii.

        Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

      4. iv.

        Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

    2. (b)

      Repeat (a) for ridge regression relative to least squares.

    3. (c)

      Repeat (a) for non-linear methods relative to least squares.

  3. 3.

    Suppose we estimate the regression coefficients in a linear regression model by minimizing

    $$\displaystyle{\sum _{i=1}^{n}\left (y_{ i} -\beta _{0} -\sum _{j=1}^{p}\beta _{ j}x_{ij}\right )\quad \text{subject to}\quad \sum _{j=1}^{p}\vert \beta _{ j}\vert \leq s}$$

    for a particular value of s. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.

    1. (a)

      As we increase s from 0, the training RSS will:

      1. i.

        Increase initially, and then eventually start decreasing in an inverted U shape.

      2. ii.

        Decrease initially, and then eventually start increasing in a U shape.

      3. iii.

        Steadily increase.

      4. iv.

        Steadily decrease.

      5. v.

        Remain constant.

    2. (b)

      Repeat (a) for test RSS.

    3. (c)

      Repeat (a) for variance.

    4. (d)

      Repeat (a) for (squared) bias.

    5. (e)

      Repeat (a) for the irreducible error.

  4. 4.

    Suppose we estimate the regression coefficients in a linear regression model by minimizing

    $$\displaystyle{\sum _{i=1}^{n}\left (y_{ i} -\beta _{0} -\sum _{j=1}^{p}\beta _{ j}x_{ij}\right ) +\lambda \sum _{ j=1}^{p}\beta _{ j}^{2}}$$

    for a particular value of λ. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.

    1. (a)

      As we increase λ from 0, the training RSS will:

      1. i.

        Increase initially, and then eventually start decreasing in an inverted U shape.

      2. ii.

        Decrease initially, and then eventually start increasing in a U shape.

      3. iii.

        Steadily increase.

      4. iv.

        Steadily decrease.

      5. v.

        Remain constant.

    2. (b)

      Repeat (a) for test RSS.

    3. (c)

      Repeat (a) for variance.

    4. (d)

      Repeat (a) for (squared) bias.

    5. (e)

      Repeat (a) for the irreducible error.

  5. 5.

    It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting.

    Suppose that n = 2, p = 2, x 11 = x 12, x 21 = x 22. Furthermore, suppose that \(y_{1} + y_{2} = 0\) and \(x_{11} + x_{21} = 0\) and \(x_{12} + x_{22} = 0\), so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: \(\hat{\beta }_{0} = 0\).

    1. (a)

      Write out the ridge regression optimization problem in this setting.

    2. (b)

      Argue that in this setting, the ridge coefficient estimates satisfy \(\hat{\beta }_{1} =\hat{\beta } _{2}\).

    3. (c)

      Write out the lasso optimization problem in this setting.

    4. (d)

      Argue that in this setting, the lasso coefficients \(\hat{\beta }_{1}\) and \(\hat{\beta }_{2}\) are not unique—in other words, there are many possible solutions to the optimization problem in (c). Describe these solutions.

  6. 6.

    We will now explore (6.12) and (6.13) further.

    1. (a)

      Consider (6.12) with p = 1. For some choice of y 1 and λ > 0, plot (6.12) as a function of β 1. Your plot should confirm that (6.12) is solved by (6.14).

    2. (b)

      Consider (6.13) with p = 1. For some choice of y 1 and λ > 0, plot (6.13) as a function of β 1. Your plot should confirm that (6.13) is solved by (6.15).

  7. 7.

    We will now derive the Bayesian connection to the lasso and ridge regression discussed in Section 6.2.2.

    1. (a)

      Suppose that \(y_{i} =\beta _{0} +\sum _{ j=1}^{p}x_{ij}\beta _{j} +\epsilon _{i}\) where ε 1, , ε n are independent and identically distributed from a N(0, σ 2) distribution. Write out the likelihood for the data.

    2. (b)

      Assume the following prior for β: β 1, , β p are independent and identically distributed according to a double-exponential distribution with mean 0 and common scale parameter b i.e. \(p(\beta ) = \frac{1} {2b}\exp (-\vert \beta \vert /b)\). Write out the posterior for β in this setting.

    3. (c)

      Argue that the lasso estimate is the mode for β under this posterior distribution.

    4. (d)

      Now assume the following prior for β: β 1, , β p are independent and identically distributed according to a normal distribution with mean zero and variance c. Write out the posterior for β in this setting.

    5. (e)

      Argue that the ridge regression estimate is both the mode and the mean for β under this posterior distribution.

6.8.2 Applied

  1. 8.

    In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

    1. (a)

      Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector ε of length n = 100.

    2. (b)

      Generate a response vector Y of length n = 100 according to the model

      $$\displaystyle{Y =\beta _{0} +\beta _{1}X +\beta _{2}{X}^{2} +\beta _{ 3}{X}^{3}+\epsilon,}$$

      where β 0, β 1, β 2, and β 3 are constants of your choice.

      Note you will need to use the data.frame() function to create a single data set containing both X and Y.

    3. (c)

      Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors \(X,{X}^{2},\ldots,{X}^{10}\). What is the best model obtained according to C p , BIC, and adjusted R 2? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained.

    4. (d)

      Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?

    5. (e)

      Now fit a lasso model to the simulated data, again using X, X 2, , X 10 as predictors. Use cross-validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained.

    6. (f)

      Now generate a response vector Y according to the model

      $$\displaystyle{Y =\beta _{0} +\beta _{7}{X}^{7}+\epsilon,}$$

      and perform best subset selection and the lasso. Discuss the results obtained.

  2. 9.

    In this exercise, we will predict the number of applications received using the other variables in the College data set.

    1. (a)

      Split the data set into a training set and a test set.

    2. (b)

      Fit a linear model using least squares on the training set, and report the test error obtained.

    3. (c)

      Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

    4. (d)

      Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

    5. (e)

      Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

    6. (f)

      Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

    7. (g)

      Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

  3. 10.

    We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.

    1. (a)

      Generate a data set with p = 20 features, n = 1, 000 observations, and an associated quantitative response vector generated according to the model

      $$\displaystyle{Y = X\beta +\epsilon,}$$

      where β has some elements that are exactly equal to zero.

    2. (b)

      Split your data set into a training set containing 100 observations and a test set containing 900 observations.

    3. (c)

      Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.

    4. (d)

      Plot the test set MSE associated with the best model of each size.

    5. (e)

      For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.

    6. (f)

      How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.

    7. (g)

      Create a plot displaying \(\sqrt{\sum _{j=1 }^{p }{(\beta _{j } -\hat{\beta }_{ j }^{r })}^{2}}\) for a range of values of r, where \(\hat{\beta }_{j}^{r}\) is the jth coefficient estimate for the best model containing r coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?

  4. 11.

    We will now try to predict per capita crime rate in the Boston data set.

    1. (a)

      Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

    2. (b)

      Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

    3. (c)

      Does your chosen model involve all of the features in the data set? Why or why not?