Abstract
The use of information criteria, especially AIC (Akaike’s information criterion) and BIC (Bayesian information criterion), for choosing an adequate number of principal components is illustrated.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
This paper applies model selection criteria, especially AIC and BIC, to the problem of choosing a sufficient number of principal components to retain. It applies the concepts of Sclove [13] to this particular problem.
2 Background
Other researchers have considered to problem of the choice of number of principal components. For example, Bai et al. [6] examined the asymptotic consistency of the criteria AIC and BIC for determining the number of significant principal components in high-dimensional problems. The focus here is not necessarily on high-dimensional problems.
To begin the discussion here, we first give a short review of some general background on the relevant portions of multivariate statistical analysis, such as may be obtained from textbooks such as Anderson [5] or Johnson and Wichern [9].
2.1 Sample Quantities
Suppose we have a multivariate sample \({{\mathbf{x}}}_1, {{\mathbf{x}}}_2 , \; \dots , \; {{\mathbf{x}}}_n \) of n p-dimensional random vectors,
The transpose (\('\)) means that we are thinking of the vectors as column vectors. The sample mean vector is
The \(p \times p\) sample covariance matrix is
2.2 Population Quantities and Principal Components
The sample covariance matrix \({\mathbf{S}}\) estimates the true covariance matrix \( {\varvec{\Sigma }} \) of the random variables
That is,
where
the covariance of \(X_u\) and \(X_v.\) In particular, \({{\mathcal{C}}}[X_v, X_v] ={\mathcal {V}}[X_v], \) the variance of \(X_v. \)
The principal components of \({\varvec{\Sigma }}\) are defined as uncorrelated linear combinations of maximal variance. A linear combination, say LC, of the p variables is \({\mathbf{a'X}},\) that is
Here the vector \( \, {\mathbf{a}} \, \) is a vector of scalars \(\, a_1, a_2, \ldots , \; a_p{:} \)
These are the coefficients in the linear combination. Such linear combinations are called variates.
We have
This is estimated as \({\mathbf{a'Sa}}. \) This is to be maximized over \({\mathbf{a}}.\) The derivative is
is A constraint is required for meaningful maximization. A reasonable such constraint \({\mathbf{a'a}} = 1,\) which is equivalent to the length of \({\mathbf{a}} , \) the quantity \(\sqrt{{\mathbf{a'a}}},\) being equal to 1.
The Lagrangian function incorporating the constraint is
The partial derivatives are
and
Setting these equal to zero gives the simultaneous linear equations
The first is the equation
the zero vector. This is the homogeneous equation
For nontrivial solutions, we must have det\(( {\mathbf{S}} - \lambda {\mathbf{a}} ) = 0. \) This is a polynomial equation of degree p in \(\lambda \); denote the roots by \(\lambda _1 \ge \lambda _2 \ge \; \cdots \; \lambda _p. \) These are the eigenvalues. Their sum is the trace of \({\mathbf{S}};\) their product is the determinant of \({\mathbf{S}}. \)
The corresponding eigenequations are
The j-th PC (principal component), \(C_j,\) is the linear combination of the form
where \({\mathbf{a}}_j' =(a_{1j}, a_{2j}, \ldots , \, a_{pj}). \) That is to say, for \(j = 1, 2, \ldots , p,\) the value of the j-th PC for Individual i is \({\mathbf{a}}_j' {\mathbf{x}}_i, \; i = 1, 2, \ldots , n. \)
The equations for the PCs in terms of the Xs are PC\(_j = {\mathbf{a}}'_j {\mathbf{X}}, \; j = 1, 2, \ldots , p. \) Let \({\mathbf{C}}\) be the p-vector of PCs. Then \({\mathbf{C}} \; = \; {{\mathbf{A}}}'{} {\mathbf{X}}, \) where \({\mathbf{A}} \; = \; [{\mathbf{a}}_{\mathbf{1}} \, {\mathbf{a}}_{\mathbf{2}} \, \dots \, {\mathbf{a}}_{\mathbf{p}} ] \) is the matrix whose columns are the eigenvectors. The inverse relation is
where
where \({\mathbf{L}}\) is the matrix of loadings of the \(X_v\) on the PCs \(C_j.\) Actually, \({\mathbf{A}} \) is an orthonormal matrix (its columns are of length one and are pairwise orthogonal), so \({\mathbf{A}}^{-1} = {{\mathbf{A}}}'.\) Thus \({\mathbf{L}} = \; {\mathbf{A}}. \) So
Letting \( {\mathbf{a}}^{(v)'} \, \) be the v-th row of the matrix \({\mathbf{A}},\) that is
we have
In terms of the first k PCs, this is
where the error \(\varepsilon _v\) is
The covariance matrix can be represented as
Correspondingly, the best rank k approximation to \({\mathbf{S}}\) is
Recall that for a symmetric matrix such as a covariance matrix, the eigenvalues are non-negative.
2.3 Ad Hoc Procedures for Determining an Appropriate Number of PCs
2.3.1 Procedure Based on the Average Eigenvalue
The average eigenvalue is
One rule for the number of PCs to retain is the retain those for which the eigenvalues are greater than \(\bar{\lambda }. \) When \({\mathbf{S}}\) is taken to be the sample correlation matrix, the trace is p and the average eigenvalue \(\bar{\lambda }\) is 1.
2.3.2 Procedure Based on Retaining a Prescribed Portion of the Total Variance
Another procedure is to retain a number of PCs sufficient to account for, say, 90% of the total variance, trace \({\mathbf{S}} = \sum _{j=1}^p \, \lambda _j. \) Of course the figure ninety percent is somewhat arbitrary and it might be nice to have some somewhat more objective criteria.
2.3.3 Procedure Based on the Dropoff of the Eigenvalues
Another procedure is to plot \(\lambda _1, \lambda _2, \ldots , \; \lambda _p\) against \(1, 2, \ldots , p. \) One then looks for an elbow in the curve and retains a number of PCs corresponding to the point before the leveling off of the curve, if it does indeed take an elbow shape. Such a plot is called a scree plot, “scree” being the debris at the foot of a glacier.
3 AIC and BIC for the Number of PCs
Let us see what a Gaussian model would imply. The maximum log likelihood for the model (*) approximating the p variables in terms of k PCs is \((2\pi \hat{| {\varvec{\Sigma }}}_k |)^{-n/2} C(n,p,k ), \) where C(n, p, k) is a constant depending upon n, p, and k and \(|{\varvec{\Sigma }}_k|\) denotes the determinant of the residual covariance matrix \({\varvec{\Sigma }}_k.\)
The determinant of the covariance matrix is the product of the eigenvalues,
For a model based on the first k PCs, this is
The determinant of the residual covariance is \(\Pi _{j=k+1}^p \lambda _j. \) The model-selection criterion AIC—Akaike’s information criterion [2,3,4]—is based on an estimate of the log cross-entropy of K proposed models with a null model.
The Bayesian information criterion BIC [12] is based on a large-sample estimate of the posterior probability \(pp_k\) of Model \(k, \; k = 1, 2, \ldots , K. \, \)
More precisely, BIC\(_k \) is an approximation to \(\, -2 \ln pp_k.\) These model-selection criteria (MSCs) are thus smaller-is-better criteria and take the form
where \(L_k\) is the likelihood for Model \(k, \, a(n) = \ln n\) for BIC\(_k, \; a(n) = 2\) (not depending upon n) for AIC\(_k\) and \(m_k\) is the number of independent parameters in Model \(k.\, \) Relative to BIC, AIC tends to favor models with a smaller number of parameters. Note that
where C is a constant. Thus BIC values can be converted to a scale of 0 to 1. This is done by exponentiating -BIC\(_k/2,\) summing the values, and dividing by the sum.
For the PC model,
The criteria can be written as
where \(\text {Deviance}_k = n \, \ln \max L_k \text{ is a measure of lack of fit and Penalty}_k = \; a(N) m_k.\) Inclusion of an additional PC is justified if the criterion value decreases, that is if MSC\(_{k+1} < {MSC}_k. \) For PCs, this is
This is
or
or
or
Thus for AIC, inclusion of the additional PC\(_{k+1} \) is justified if \(\lambda _{k+1}\) is greater than \(\exp (-2/n). \)
For BIC, inclusion of an additional PC\(_{k+1} \) is justified if \( \lambda _{k+1} >\exp (\ln N / N) = \; [\exp (\ln n)]^{1/n} = n^{1/n},\) which tends to 1 for large n. So this is in approximate agreement with the average eigenvalue rule for correlation matrices, stating that one should retain dimensions with eigenvalues larger than 1.
4 Example
Here we consider a sample from the LA Heart Study. See, e.g., [8]. The sample is \(n = 100 \) men. The variables include Age, Systolic blood pressure, Diastolic blood pressure, weight, height and Coronary Incident, a binary variable indicating whether or not the individual had a coronary incident during the course of the study. (Data on the same variables for another 100 men are also given in Dixon and Massey’s book. Results can be compared and contrasted between the two samples.) Here we focus on the first five variables. Minitab statistical software was used for the analysis.
Table 1 is the lower-triangular portion of the correlation matrix for the five variables (Table 2).
4.1 Principal Component Analysis in the Example
Note that an eigenvector can be multiplied by \(-1,\) changing the signs of all its elements. Below, this is done with PC1 so that SYS and DIAS have positive loadings. Interpretations, BPtotal, SIZE, AGE, OVERWT, BPdiff, are given below the eigenvectors. The interpretations are based on which loadings are large and which are small. Taking .6 as a cut-off point, in PC1, SYS and DIAS have loadings above this, while the other variables have loadings less than this (in fact, less than .4), so PC1 can be interpreted asan index of total BP. In PC2, WT and HT have large loadings with the same sign, so PC2 can be interpreted as SIZE (Table 3).
As above, denote the eigensystem by
Then the eigensystem equations are
Here \({\mathbf{S}}\) is taken to be the correlation matrix. Let \({\mathbf{1}}_v' \; = \; ( 0 \; 0 \cdots \; 1 \cdots \; 0 \cdots \; ), \) the vector with 1 in the v-th position and zeroes elsewhere. The covariance between a variable \(X_v\) and a PC \(C_u\) is \({\mathcal{C}}[X_v, \, C_u \,] = \, {\mathcal{C}}[{\mathbf{1}}_v' {\varvec{X}}, {\varvec{a}}_u' \, {\varvec{X}}] = {\mathbf{1}}' \Sigma \, {\varvec{a}}_u \; = \; {\mathbf{1}}_v' \, \lambda _u \, {\varvec{a}}_u \; = \; \lambda _u a_{uv}, \, \) where \( \, a_{uv} \, \) is the v-th element of the vector \({\varvec{a}}_u. \, \) The correlation is Corr\( [X_v, \, C_u \,] = \; {\mathcal{C}}[X_v, \, C_u \,] / {SD}[X_v] {SD}[ C_u \, ] \; = \; \lambda _u \, a_{uv} \, / \, \sigma _v \, \sqrt{\lambda _u} \; = \; \sqrt{\lambda _u} \, a_{uv} \, / \, \sigma _v. \, \) When the correlation matrix is used, \(\sigma _v = 1, \) and this correlation is \(\sqrt{\lambda _u} \ a_{uv}. \) A correlation of size greater than .6 corresponds to 36% of variance explained. The variable \(X_v \, \) has a correlation higher than .6 with the component \(C_u\) if its loading in \(C_u, \) the value \(a_{uv}, \) is greater than .6 / \(\sqrt{\lambda _u}.\) These values are appended to the table below. Loadings larger than this cut point are in boldface. (The cut-off of .6 is somewhat arbitrary; one might use, for example, a cut-off of .5.)
One can also focus on the pattern of loadings within the different PCs for interpretation of the PCs. To reiterate:
-
PC1:
SYS and DIAS have large loadings with the same sign; we interpret PC1 as BPinex or BPtotal.
-
PC2:
WT and HT have large loadings of the same sign; we interpret PC2 as the man’s SIZE.
-
PC3:
Only AGE has a large loading; we interpret PC3 as AGE.
-
PC4:
WT and HT have large loadings with opposite signs; we interpret PC4 as OVERWEIGHT.
-
PC5:
SYS and DIAS have large loadings with opposite signs; we interpret PC5 as BPdrop.
I continue to marvel at how readily interpretable the PCs are. And, this is even without using a factor analysis model and using rotation (Table 4).
4.2 Employing the Criteria in the Example
Table 5 shows the eigenvalues and the results according to the various criteria. According to the rule based on the average eigenvalue, the dimension is retained it its eigenvalue is greater than 1 (for a correlation matrix). For BIC, the k-th PC is retained if \(n \, \ln \, \lambda _k > - a(n), \) where \(a(n) = \ln n .\) Here, \(n = 100 \) and \( \ln n = \ln 100,\) approx. 4.61. For AIC, the k-th PC is retained if \(n \ln \lambda _k > - 2. \) In this example, the methods agree on retaining \(k = 2\) PCs.
I feel that I should remark that, though this is the case, the fourth and fifth PCs do have simple and interesting interpretations. It is just that they do not improve the fit very much.
5 Discussion
The focus here has been on determining the number of dimensions needed to represent a complex of variables adequately.
5.1 Regression on Principal Components
Given a response variable Y and explanatory variables \(X_1, X_2, \ldots , X_p,\) one may transform the Xs to their principal components, as this may aid in the interpretation of the results of the regression. In such regression on principal components (see, e.g., [10]), however, one should not necessarily eliminate the principal components with small eigenvalues, as they may still be strongly related to the response variable. The Bayesian information criterion is
for alternative models indexed by \( k = 1, 2,\ldots , K,\) where \(LL_k\) is the maximum log likelihood for Model k and \(m_k\) is the number of independent parameters in Model k. For linear regression models with Gaussian-distributed errors BIC takes the form
where \(MSE_k\) is the MLE (maximum likelihood estimate) of the MSE (mean squared error) of Model k, with divisor n, of the error variance. With p explanatory variables, there are \(2^p\) alternative models (including the model where no explanatory variables are used and the fitted value of Y is simply \(\bar{y}). \) It would usually seem to be wise to evaluate all \(2^p\) models using \(BIC_k\) rather than reducing the number of principal components by just looking at the explanatory variables.
5.2 Some Related Recent Literature
Some various applications involving choosing the number of principal components from recent literature include the following. The method presented here could possibly be applied in these applications. For example, a good book on the topic of model selection and testing covering all aspects is Bhatti et al. [7]. In recent years econometricians have examined the problems of diagnostic testing, specification testing, semiparametric estimation and model selection. In addition, researchers have considered whether to use model testing and model selection procedures to decide upon the models that best fit a particular dataset. This book explores both issues with application to various regression models, including arbitrage pricing theory models. Along the lines of model-selection criteria, the book references, e.g., Schwarz [12], the foundational paper for BIC.
Next we mention some recent papers which show applications of model selection in various research areas.
One such paper is Xu et al. [14] an application of principal components analysis and other methods to water quality assessment in a lake basin in China,
Another is Omuya et al. [11], on feature selection for classification using principal component analysis.
As mentioned, a particularly interesting application of principal components analysis is in regression and logistic regression. We have mentioned the paper by Massy [10] on using principal components analysis in regression. Another is Aguilera et al. [1] on using principal components in logistic regression.
6 Conclusions
The information criteria AIC and BIC have been applied here to the choice of the number of principal components to represent a dataset. The results have been compared and contrasted with criteria such as retaining those principal components which explain more than an average amount of the total variance.
Availability of Data and Material
The source of data used is a book that is referenced and available.
Abbreviations
- AIC:
-
Akaike’s information criterion
- BIC:
-
Bayesian information criterion
- DIAS:
-
Diastolic blood pressure
- HT:
-
Height
- LC:
-
Linear combination
- LL:
-
Maximum log likelihood
- MLE:
-
Maximum likelihood estimate
- MSE:
-
Mean squared error
- PC:
-
Principal component
- SYS:
-
Systolic blood pressure
- WT:
-
Weight
References
Aguilera, A.M., Escabias, M., Valderrama, M.J.: Using principal components for estimating logistic regression with high-dimensional multicollinear data. Comput. Stat. Data Anal. 50(8), 1905–1924 (2006)
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csáki, F. (eds.) 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, September 2–8, 1971, pp. 267–281. Akadémiai Kiadó, Budapest (1973). [Republished in Kotz, S., Johnson, N.L. (eds.) (1992) Breakthroughs in Statistics, I. Springer, pp. 610–624 (1973)]
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6):716–723.(1974). https://doi.org/10.1109/TAC.1974.1100705, MR 0423716
Akaike, H.: Prediction and entropy. In: Atkinson, A.C., Fienberg, S.E. (eds.) A Celebration of Statistics, pp. 1–24. Springer, New York (1985)
Anderson, T.W.: An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, New York (1958) [Wiley, Hoboken, NJ, 2002]
Bai, Z., Choi, K.P., Fujikoshi, Y.: Consistency of AIC and BIC in estimating the number of significant components in high-dimensional principal component analysis. Ann. Stat. 46(3), 1050–1076 (2018). https://doi.org/10.1214/17-AOS1577
Bhatti, M.I., Al-Shanfari, H., Zakir Hossain, M.: Econometric Analysis of Model Selection and Model Testing. Routledge, London (2017)
Dixon, W.J., Massey, F.J., Jr.: Introduction to Statistical Analysis, 3rd edn. McGraw-Hill, New York (1969)
Johnson, R.J., Wichern, D.W.: Applied Multivariate Statistical Analysis, 6th edn. Pearson, Upper Saddle River (2008)
Massy, W.F.: Principal components regression in exploratory statistical research. J. Am. Stat. Assoc. 60(309), 234–256 (1965). https://doi.org/10.1080/01621459.1965.10480787
Omuya, E.O., Okeyo, G.O., Kimwele, M.W.: Feature selection for classification using principal component analysis and information gain. Expert Syst. Appl. 174, 114765 (2021)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978). Stable URL: http://www.jstor.org/stable/2958889
Sclove, S.L.: Application of model-selection criteria to some problems in multivariate analysis. Psychometrika 52(1987), 333–343 (1987). https://doi.org/10.1007/BF02294360
Xu, S., Cui, Y., Yang, C., Wei, S., Dong, W., Huang, L., Liu, C., Ren, Z., Wang, W.: The fuzzy comprehensive evaluation (FCE) and the principal component analysis (PCA) model simulation and its applications in water quality assessment of Nansi Lake Basin, China. Environ. Eng. Res. 26(2), 222–232 (2021)
Acknowledgements
There are no further acknowledgements.
Funding
There was no funding other than the author’s usual salary at the university.
Author information
Authors and Affiliations
Contributions
SLS is the sole author.
Corresponding author
Ethics declarations
Conflict of interest
There are no competing interests.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sclove, S.L. Using Model Selection Criteria to Choose the Number of Principal Components. J Stat Theory Appl 20, 450–461 (2021). https://doi.org/10.1007/s44199-021-00002-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s44199-021-00002-4