Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

A frequently applied paradigm in analyzing data from multivariate observations is to model the relevant information (represented in a multivariate variable X) as coming from a limited number of latent factors. In a survey on household consumption, for example, the consumption levels, X, of p different goods during one month could be observed. The variations and covariations of the p components of X throughout the survey might in fact be explained by two or three main social behavior factors of the household. For instance, a basic desire of comfort or the willingness to achieve a certain social level or other social latent concepts might explain most of the consumption behavior. These unobserved factors are much more interesting to the social scientist than the observed quantitative measures (X) themselves, because they give a better understanding of the behavior of households. As shown in the examples below, the same kind of factor analysis is of interest in many fields such as psychology, marketing, economics, politic sciences, etc.

How can we provide a statistical model addressing these issues and how can we interpret the obtained model? This is the aim of factor analysis. As in Chapter 9 and Chapter 10, the driving statistical theme of this chapter is to reduce the dimension of the observed data. The perspective used, however, is different: we assume that there is a model (it will be called the “Factor Model”) stating that most of the covariances between the p elements of X can be explained by a limited number of latent factors. Section 11.1 defines the basic concepts and notations of the orthogonal factor model, stressing the non-uniqueness of the solutions. We show how to take advantage of this non-uniqueness to derive techniques which lead to easier interpretations. This will involve (geometric) rotations of the factors. Section 11.2 presents an empirical approach to factor analysis. Various estimation procedures are proposed and an optimal rotation procedure is defined. Many examples are used to illustrate the method.

1 The Orthogonal Factor Model

The aim of factor analysis is to explain the outcome of p variables in the data matrix \({\mathcal{X}}\) using fewer variables, the so-called factors. Ideally all the information in \({\mathcal{X}}\) can be reproduced by a smaller number of factors. These factors are interpreted as latent (unobserved) common characteristics of the observed \(x \in \mathbb {R}^{p}\). The case just described occurs when every observed x=(x 1,…,x p ) can be written as

$$ x_{j}=\sum_{\ell=1}^{k} q_{j\ell}f_{\ell} + \mu_{j},\quad j=1,\ldots,p.$$
(11.1)

Here f , for =1,…,k denotes the factors. The number of factors, k, should always be much smaller than p. For instance, in psychology x may represent p results of a test measuring intelligence scores. One common latent factor explaining \(x \in \mathbb {R}^{p}\) could be the overall level of “intelligence”. In marketing studies, x may consist of p answers to a survey on the levels of satisfaction of the customers. These p measures could be explained by common latent factors like the attraction level of the product or the image of the brand, and so on. Indeed it is possible to create a representation of the observations that is similar to the one in (11.1) by means of principal components, but only if the last pk eigenvalues corresponding to the covariance matrix are equal to zero. Consider a p-dimensional random vector X with mean μ and covariance matrix \(\mathop {\mathsf {Var}}(X)=\Sigma\). A model similar to (11.1) can be written for X in matrix notation, namely

$$ X = {\mathcal{Q}} F + \mu,$$
(11.2)

where F is the k-dimensional vector of the k factors. When using the factor model (11.2) it is often assumed that the factors F are centered, uncorrelated and standardized: \(\mathop {\mathsf {E}}(F)=0\) and \(\mathop {\mathsf {Var}}(F)={{\mathcal{I}}}_{k}\). We will now show that if the last pk eigenvalues of Σ are equal to zero, we can easily express X by the factor model (11.2).

The spectral decomposition of Σ is given by ΓΛΓ. Suppose that only the first k eigenvalues are positive, i.e., λ k+1=⋯=λ p =0. Then the (singular) covariance matrix can be written as

$$\Sigma = \sum_{\ell=1}^k \lambda_\ell \gamma_{\ell} \gamma_{\ell}^{\top}=(\Gamma_1 \Gamma_2) \left(\begin{array}{c@{\quad}c}\Lambda_1& 0\\ 0& 0\end{array}\right)\left(\Gamma_1^{\top} \atop \Gamma_2^{\top}\right).$$

In order to show the connection to the factor model (11.2), recall that the PCs are given by Y(Xμ). Rearranging we have XμY1 Y 12 Y 2, where the components of Y are partitioned according to the partition of Γ above, namely

In other words, Y 2 has a singular distribution with mean and covariance matrix equal to zero. Therefore, Xμ1 Y 12 Y 2 implies that Xμ is equivalent to Γ1 Y 1, which can be written as

$$X=\Gamma_1 \Lambda_1^{1/2}\Lambda_1^{-1/2} Y_1 + \mu.$$

Defining \({{\mathcal{Q}}}= \Gamma_{1} \Lambda_{1}^{1/2}\) and \(F=\Lambda_{1}^{-1/2} Y_{1}\), we obtain the factor model (11.2).

Note that the covariance matrix of model (11.2) can be written as

$$\Sigma = \mathop {\mathsf {E}}(X-\mu)(X-\mu)^{\top} = {\mathcal{Q}}\mathop {\mathsf {E}}(FF^{\top}){\mathcal{Q}}^{\top} = {\mathcal{QQ}}^{\top}= \sum_{j=1}^k \lambda_j \gamma_{j} \gamma_{j}^{\top}.$$
(11.3)

We have just shown how the variable X can be completely determined by a weighted sum of k (where k<p) uncorrelated factors. The situation used in the derivation, however, is too idealistic. In practice the covariance matrix is rarely singular.

It is common praxis in factor analysis to split the influences of the factors into common and specific ones. There are, for example, highly informative factors that are common to all of the components of X and factors that are specific to certain components. The factor analysis model used in praxis is a generalization of (11.2):

$$ X = {\mathcal{Q}} F + U +\mu,$$
(11.4)

where \({\mathcal{Q}}\) is a (p×k) matrix of the (non-random) loadings of the common factors F(k×1) and U is a (p×1) matrix of the (random) specific factors. It is assumed that the factor variables F are uncorrelated random vectors and that the specific factors are uncorrelated and have zero covariance with the common factors. More precisely, it is assumed that:

(11.5)

Define

$$\mathop {\mathsf {Var}}(U)=\Psi =\mathop {\mathrm {diag}}(\psi _{11},\ldots ,\psi _{pp}). $$

The generalized factor model (11.4) together with the assumptions given in (11.5) constitute the orthogonal factor model.

figure a

Note that (11.4) implies for the components of X=(X 1,…,X p ) that

$$ X_j=\sum ^k_{\ell=1}q_{j\ell}F_\ell+U_j+\mu_j,\quad j=1,\ldots ,p.$$
(11.6)

Using (11.5) we obtain \(\sigma_{X_{j}X_{j}} = \mathop {\mathsf {Var}}(X_{j}) = \sum^{k}_{\ell=1} q^{2}_{j\ell} +\psi_{jj}\). The quantity \(h^{2}_{j} = \sum^{k}_{\ell=1} q^{2}_{j\ell} \) is called the communality and ψ jj the specific variance. Thus the covariance of X can be rewritten as

(11.7)

In a sense, the factor model explains the variations of X for the most part by a small number of latent factors F common to its p components and entirely explains all the correlation structure between its components, plus some “noise” U which allows specific variations of each component to enter. The specific factors adjust to capture the individual variance of each component. Factor analysis relies on the assumptions presented above. If the assumptions are not met, the analysis could be spurious. Although principal components analysis and factor analysis might be related (this was hinted at in the derivation of the factor model), they are quite different in nature. PCs are linear transformations of X arranged in decreasing order of variance and used to reduce the dimension of the data set, whereas in factor analysis, we try to model the variations of X using a linear transformation of a fixed, limited number of latent factors. The objective of factor analysis is to find the loadings \({\mathcal{Q}}\) and the specific variance Ψ. Estimates of \({\mathcal{Q}}\) and Ψ are deduced from the covariance structure (11.7).

1.1 Interpretation of the Factors

Assume that a factor model with k factors was found to be reasonable, i.e., most of the (co)variations of the p measures in X were explained by the k fixed latent factors. The next natural step is to try to understand what these factors represent. To interpret F , it makes sense to compute its correlations with the original variables X j first. This is done for =1,…,k and for j=1,…,p to obtain the matrix P XF . The sequence of calculations used here are in fact the same that were used to interprete the PCs in the principal components analysis.

The following covariance between X and F is obtained via (11.5),

$$\Sigma_{XF} = \mathop {\mathsf {E}}\{ ({\mathcal{Q}}F+U)F^{\top} \} ={\mathcal{Q}}. $$

The correlation is

$$P_{XF} = D^{-1/2} {\mathcal{Q}}, $$
(11.8)

where \(D = \mathop {\mathrm {diag}}(\sigma_{X_{1}X_{1}}, \ldots, \sigma_{X_{p}X_{p}}) \). Using (11.8) it is possible to construct a figure analogous to Figure 10.6 and thus to consider which of the original variables X 1,…,X p play a role in the unobserved common factors F 1,…,F k .

Returning to the psychology example where X are the observed scores to p different intelligence tests (the WAIS data set in Table B.12 provides an example), we would expect a model with one factor to produce a factor that is positively correlated with all of the components in X. For this example the factor represents the overall level of intelligence of an individual. A model with two factors could produce a refinement in explaining the variations of the p scores. For example, the first factor could be the same as before (overall level of intelligence), whereas the second factor could be positively correlated with some of the tests, X j , that are related to the individual’s ability to think abstractly and negatively correlated with other tests, X i , that are related to the individual’s practical ability. The second factor would then concern a particular dimension of the intelligence stressing the distinctions between the “theoretical” and “practical” abilities of the individual. If the model is true, most of the information coming from the p scores can be summarized by these two latent factors. Other practical examples are given below.

1.2 Invariance of Scale

What happens if we change the scale of X to \(Y = {\mathcal{C}}X\) with \({\mathcal{C}} = \mathop {\mathrm {diag}}(c_{1},\ldots,c_{p})\)? If the k-factor model (11.6) is true for X with \({\mathcal{Q}} ={\mathcal{Q}}_{X}\), Ψ=Ψ X , then, since

$$\mathop {\mathsf {Var}}(Y) = {\mathcal{C}}\Sigma {\mathcal{C}}^{\top} = {\mathcal{C}}{\mathcal{Q}}_X {\mathcal{Q}}_X^{\top}{\mathcal{C}}^{\top} + {\mathcal{C}} \Psi_X{\mathcal{C}}^{\top}, $$

the same k-factor model is also true for Y with \({\mathcal{Q}}_{Y} = {\mathcal{C}}{\mathcal{Q}}_{X}\) and \(\Psi_{Y} = {\mathcal{C}}\Psi_{X}{\mathcal{C}}^{\top}\). In many applications, the search for the loadings \({\mathcal{Q}}\) and for the specific variance Ψ will be done by the decomposition of the correlation matrix of X rather than the covariance matrix Σ. This corresponds to a factor analysis of a linear transformation of X (i.e., Y=D −1/2(Xμ)). The goal is to try to find the loadings \({\mathcal{Q}}_{Y}\) and the specific variance Ψ Y such that

$$P = {\mathcal{Q}}_Y \; {\mathcal{Q}}_Y^{\top} + \Psi_Y. $$
(11.9)

In this case the interpretation of the factors F immediately follows from (11.8) given the following correlation matrix:

$$P_{XF} = P_{YF} = {\mathcal{Q}}_{Y}. $$
(11.10)

Because of the scale invariance of the factors, the loadings and the specific variance of the model, where X is expressed in its original units of measure, are given by

It should be noted that although the factor analysis model (11.4) enjoys the scale invariance property, the actual estimated factors could be scale dependent. We will come back to this point later when we discuss the method of principal factors.

1.3 Non-uniqueness of Factor Loadings

The factor loadings are not unique! Suppose that \({\mathcal{G}}\) is an orthogonal matrix. Then X in (11.4) can also be written as

$$X = ({\mathcal{Q}} {\mathcal{G}})({\mathcal{G}}^{\top}F) + U + \mu .$$

This implies that, if a k-factor of X with factors F and loadings \({{\mathcal{Q}}}\) is true, then the k-factor model with factors \({{\mathcal{G}}}^{\top}F\) and loadings \({{\mathcal{Q}}}{{\mathcal{G}}}\) is also true. In practice, we will take advantage of this non-uniqueness. Indeed, referring back to Section 2.6 we can conclude that premultiplying a vector F by an orthogonal matrix corresponds to a rotation of the system of axis, the direction of the first new axis being given by the first row of the orthogonal matrix. It will be shown that choosing an appropriate rotation will result in a matrix of loadings \({{\mathcal{Q}}}{{\mathcal{G}}}\) that will be easier to interpret. We have seen that the loadings provide the correlations between the factors and the original variables, therefore, it makes sense to search for rotations that give factors that are maximally correlated with various groups of variables.

From a numerical point of view, the non-uniqueness is a drawback. We have to find loadings \({{\mathcal{Q}}}\) and specific variances Ψ satisfying the decomposition \(\Sigma= {{\mathcal{Q}}}{{\mathcal{Q}}}^{\top} + \Psi\), but no straightforward numerical algorithm can solve this problem due to the multiplicity of the solutions. An acceptable technique is to impose some chosen constraints in order to get—in the best case—an unique solution to the decomposition. Then, as suggested above, once we have a solution we will take advantage of the rotations in order to obtain a solution that is easier to interprete.

An obvious question is: what kind of constraints should we impose in order to eliminate the non-uniqueness problem? Usually, we impose additional constraints where

$$ {\mathcal{Q}}^{\top} \Psi^{-1} {\mathcal{Q}} \quad \mbox{is diagonal}$$
(11.11)

or

$$ {\mathcal{Q}} ^{\top} {\mathcal{D}}^{-1}{\mathcal{Q}} \quad \mbox{is diagonal.}$$
(11.12)

How many parameters does the model (11.7) have without constraints?

Hence we have to determine pk+p parameters! Conditions (11.11) respectively (11.12) introduce \(\frac{1}{2} \{ k(k-1) \}\) constraints, since we require the matrices to be diagonal. Therefore, the degrees of freedom of a model with k factors is:

If d<0, then the model is undetermined: there are infinitely many solutions to (11.7). This means that the number of parameters of the factorial model is larger than the number of parameters of the original model, or that the number of factors k is “too large” relative to p. In some cases d=0: there is a unique solution to the problem (except for rotation). In practice we usually have that d>0: there are more equations than parameters, thus an exact solution does not exist. In this case approximate solutions are used. An approximation of Σ, for example, is \({\mathcal{QQ}}^{\top} + \Psi\). The last case is the most interesting since the factorial model has less parameters than the original one. Estimation methods are introduced in the next section.

Evaluating the degrees of freedom, d, is particularly important, because it already gives an idea of the upper bound on the number of factors we can hope to identify in a factor model. For instance, if p=4, we could not identify a factor model with 2 factors (this results in d=−1 which has infinitly many solutions). With p=4, only a one factor model gives an approximate solution (d=2). When p=6, models with 1 and 2 factors provide approximate solutions and a model with 3 factors results in an unique solution (up to the rotations) since d=0. A model with 4 or more factors would not be allowed, but of course, the aim of factor analysis is to find suitable models with a small number of factors, i.e., smaller than p. The next two examples give more insights into the notion of degrees of freedom.

Example 11.1

Let p=3 and k=1, then d=0 and

$$\Sigma = \left(\begin{array}{l@{\quad}l@{\quad}l}\sigma_{11} &\sigma_{12}& \sigma_{13}\\\sigma_{21} & \sigma_{22} & \sigma_{23}\\\sigma_{31} & \sigma_{32} & \sigma_{33}\end{array}\right) = \left(\begin{array}{c@{\quad}c@{\quad}c}q_1^2 + \psi_{11} &q_1q_2 & q_1q_3\\q_1q_2 & q_2^2+\psi_{22} &q_2q_3 \\q_1q_3 & q_2q_3 & q_3^2 + \psi_{33}\end{array} \right) $$

with and . Note that here the constraint (11.8) is automatically verified since k=1. We have

$$q_1^2 = \frac{\sigma_{12}\sigma_{13}}{\sigma_{23}};\qquad q_2^2 = \frac{\sigma_{12}\sigma_{23}}{\sigma_{13}};\qquad q_3^2 = \frac{\sigma_{13}\sigma_{23}}{\sigma_{12}}$$

and

$$\psi_{11} = \sigma_{11} - q_1^2;\qquad \psi_{22} = \sigma_{22} - q_2^2;\qquad \psi_{33} = \sigma_{33} - q_3^2.$$

In this particular case (k=1), the only rotation is defined by \({\mathcal{G}}=-1\), so the other solution for the loadings is provided by \(-{\mathcal{Q}}\).

Example 11.2

Suppose now p=2 and k=1, then d<0 and

$$\Sigma = \left( \begin{array}{c@{\quad}c} 1 & \rho \\ \rho & 1 \end{array}\right) = \left( \begin{array}{c@{\quad}c} q_1^2 + \psi_{11} & q_1q_2 \\q_1q_2 & q_2^2+\psi_{22} \end{array} \right).$$

We have infinitely many solutions: for any α (ρ<α<1), a solution is provided by

$$q_1 = \alpha;\qquad q_2 = \rho/\alpha;\qquad \psi_{11} = 1-\alpha^2; \qquad \psi_{22} = 1-(\rho/\alpha)^2.$$

The solution in Example 11.1 may be unique (up to a rotation), but it is not proper in the sense that it cannot be interpreted statistically. Exercise 11.5 gives an example where the specific variance ψ 11 is negative.

Even in the case of a unique solution (d=0), the solution may be inconsistent with statistical interpretations.

figure b

2 Estimation of the Factor Model

In practice, we have to find estimates \(\widehat{{\mathcal{Q}}}\) of the loadings \({{\mathcal{Q}}}\) and estimates \(\widehat{\Psi}\) of the specific variances Ψ such that analogously to (11.7)

$${{\mathcal{S}}} = \widehat{{\mathcal{Q}}} \widehat{{\mathcal{Q}}}^{\top} + \widehat{\Psi}, $$

where \({{\mathcal{S}}}\) denotes the empirical covariance of \({{\mathcal{X}}}\). Given an estimate \(\widehat{{\mathcal{Q}}}\) of \({{\mathcal{Q}}}\), it is natural to set

$$\widehat{\psi}_{jj}=s_{X_{j}X_{j}}-\sum ^k_{\ell=1}\widehat{q}^2_{j\ell}.$$

We have that \(\widehat{h}_{j}^{2}=\sum ^{k}_{\ell=1}\widehat{q}^{2}_{j\ell}\) is an estimate for the communality \(h_{j}^{2}\).

In the ideal case d=0, there is an exact solution. However, d is usually greater than zero, therefore we have to find \(\widehat{{\mathcal{Q}}}\) and \(\widehat{\Psi}\) such that S is approximated by \(\widehat{{\mathcal{Q}}} \widehat{{\mathcal{Q}}}^{\top} +\widehat{\Psi}\). As mentioned above, it is often easier to compute the loadings and the specific variances of the standardized model.

Define \({{\mathcal{Y}}}={{\mathcal{HXD}}}^{-1/2}\), the standardization of the data matrix \({{\mathcal{X}}}\), where, as usual, \({{\mathcal{D}}}=\mathop {\mathrm {diag}}( s_{X_{1}X_{1}}, \ldots, s_{X_{p}X_{p}} )\) and the centering matrix \({{\mathcal{H}}}= {{\mathcal{I}}}-n^{-1}1_{n}1_{n}^{\top}\) (recall from Chapter 2 that \({{\mathcal{S}}}= \frac{1}{n}{{\mathcal{X}}}^{\top}{ {\mathcal{HX}}}\)). The estimated factor loading matrix \(\widehat{{\mathcal{Q}}}_{Y}\) and the estimated specific variance \(\widehat{\Psi}_{Y}\) of \({{\mathcal{Y}}}\) are

$$\widehat{{\mathcal{Q}}}_Y={{\mathcal{D}}}^{-1/2}\widehat{{\mathcal{Q}}}_X \quad \mbox{and} \quad \widehat{\Psi}_Y={{\mathcal{D}}}^{-1}\widehat{\Psi}_X.$$

For the correlation matrix \({{\mathcal{R}}}\) of \({{\mathcal{X}}}\), we have that

$${{\mathcal{R}}}=\widehat{{\mathcal{Q}}}_Y\widehat{{\mathcal{Q}}}_Y^{\top}+\widehat{\Psi}_Y.$$

The interpretations of the factors are formulated from the analysis of the loadings \(\widehat{{\mathcal{Q}}}_{Y}\).

Example 11.3

Let us calculate the matrices just defined for the car data given in Table B.7. This data set consists of the averaged marks (from 1=low to 6=high) for 24 car types. Considering the three variables price, security and easy handling, we get the following correlation matrix:

$${{\mathcal{R}}} = \left (\begin{array}{c@{\quad}c@{\quad}c}1 & 0.975 & 0.613 \\0.975& 1 & 0.620 \\0.613& 0.620& 1\end{array} \right ).$$

We will first look for one factor, i.e., k=1. Note that (# number of parameters of Σ unconstrained−# parameters of Σ constrained) is equal to \(\frac{1 }{2 } (p-k)^{2}-\frac{1 }{2 }(p+ k)= \frac{1 }{2 }(3-1)^{2}-\frac{1 }{2 }(3+1)=0\). This implies that there is an exact solution! The equation

$$\left (\begin{array}{c@{\quad}c@{\quad}c}1 & r_{X_{1}X_{2}} & r_{X_{1}X_{3}}\\r_{X_{1}X_{2}}& 1 & r_{X_{2}X_{3}} \\r_{X_{1}X_{3}}& r_{X_{2}X_{3}} & 1\end{array} \right )= {{\mathcal{R}}} =\left (\begin{array}{c@{\quad}c@{\quad}c}\widehat{q} _1^2 + \widehat{\psi}_{11} & \widehat{q} _1 \widehat{q} _2 & \widehat{q} _1\widehat{q} _3 \\\widehat{q} _1 \widehat{q} _2& \widehat{q}^2_2 + \widehat{\psi}_{22} & \widehat{q}_2 \widehat{q}_3\\\widehat{q} _1\widehat{q} _3&\widehat{q}_2 \widehat{q}_3 & \widehat{q}^2_3 + \widehat{\psi}_{33}\end{array} \right )$$

yields the communalities \(\widehat{h}^{2}_{i}=\widehat{q}^{2}_{i}\), where

$$\widehat{q} ^2_1 = \frac{r_{X_{1}X_{2}}r_{X_{1}X_{3}}}{r_{X_{2}X_{3}}},\qquad \widehat{q} ^2_2 = \frac{r_{X_{1}X_{2}}r_{X_{2}X_{3}}}{r_{X_{1}X_{3}}}\quad \mbox{and}\quad \widehat{q} ^2_3 = \frac{r_{X_{1}X_{3}}r_{X_{2}X_{3}}}{r_{X_{1}X_{2}}}.$$

Combining this with the specific variances \(\widehat{\psi}_{11} = 1- \widehat{q} ^{2}_{1}\) , \(\widehat{\psi}_{22}= 1- \widehat{q} ^{2}_{2}\) and \(\widehat{\psi}_{33} = 1- \widehat{q}^{2}_{3}\), we obtain the following solution

$$\begin{array}{rcl@{\qquad}rcl@{\qquad}rcl}\widehat{q}_1 &=& 0.982 &\widehat{q}_2 &=& 0.993 &\widehat{q}_3 &=& 0.624 \\[6pt]\widehat{\psi}_{11} &=& 0.035 &\widehat{\psi}_{22} &=& 0.014 &\widehat{\psi}_{33} &=& 0.610.\end{array}$$

Since the first two communalities (\(\widehat{h}^{2}_{i}=\widehat{q}^{2}_{i}\)) are close to one, we can conclude that the first two variables, namely price and security, are explained by the single factor quite well. This factor can be interpreted as a “price+security” factor.

2.1 The Maximum Likelihood Method

Recall from Chapter 6 the log-likelihood function for a data matrix \({{\mathcal{X}}}\) of observations of XN p (μ,Σ):

This can be rewritten as

$$\ell({{\mathcal{X}}};\widehat{\mu},\Sigma)= -\frac{n}{2} \{ \log |2\pi\Sigma| +\mathop {\mathrm {tr}}(\Sigma^{-1}{{\mathcal{S}}}) \}. $$

Replacing μ by \(\widehat{\mu}= \overline{x}\) and substituting \(\Sigma = {{\mathcal{QQ}}}^{\top} + \Psi\) this becomes

$$ \ell({{\mathcal{X}}};\widehat{\mu},{{\mathcal{Q}}},\Psi)= -\frac{n}{2} [ \log \{ | 2 \pi( {{\mathcal{QQ}}}^{\top} + \Psi)| \}+ \mathop {\mathrm {tr}}\{({{\mathcal{QQ}}}^{\top} + \Psi)^{-1}{{\mathcal{S}}}\} ].$$
(11.13)

Even in the case of a single factor (k=1), these equations are rather complicated and iterative numerical algorithms have to be used (for more details see Mardia et al. (1979, p. 263ff)). A practical computation scheme is also given in Supplement 9A of Johnson and Wichern (1998).

2.1.1 Likelihood Ratio Test for the Number of Common Factors

Using the methodology of Chapter 7, it is easy to test the adequacy of the factor analysis model by comparing the likelihood under the null (factor analysis) and alternative (no constraints on covariance matrix) hypotheses.

Assuming that \(\widehat{{\mathcal{Q}}}\) and \(\widehat{\Psi}\) are the maximum likelihood estimates corresponding to (11.13), we obtain the following LR test statistic:

$$ - 2 \log \left( \frac{\textrm{maximized likelihood under}\ H_0}{\textrm{maximized likelihood}}\right)=n \log \left(\frac{|\widehat{{\mathcal{Q}}} \widehat{{\mathcal{Q}}}^{\top}+ \widehat{\Psi}|}{|{{\mathcal{S}}}|}\right),$$
(11.14)

which asymptotically has the \(\chi^{2}_{\frac{1}{2}\{(p-k)^{2}-p-k\}}\) distribution.

The χ 2 approximation can be improved if we replace n by n−1−(2p+4k+5)/6 in (11.14) (Bartlett, 1954). Using Bartlett’s correction, we reject the factor analysis model at the α level if

$$\{n-1-(2p+4k+5)/6\}\log \left(\frac{|\widehat{{\mathcal{Q}}} \widehat{{\mathcal{Q}}}^{\top}+ \widehat{\Psi}|}{|{{\mathcal{S}}}|}\right)>\chi^2_{1-\alpha; \{(p-k)^2-p-k\}/2},$$
(11.15)

and if the number of observations n is large and the number of common factors k is such that the χ 2 statistic has a positive number of degrees of freedom.

2.2 The Method of Principal Factors

The method of principal factors concentrates on the decomposition of the correlation matrix \({{\mathcal{R}}}\) or the covariance matrix \({{\mathcal{S}}}\). For simplicity, only the method for the correlation matrix \({{\mathcal{R}}}\) will be discussed. As pointed out in Chapter 10, the spectral decompositions of \({{\mathcal{R}}}\) and \({{\mathcal{S}}}\) yield different results and therefore, the method of principal factors may result in different estimators. The method can be motivated as follows: Suppose we know the exact Ψ, then the constraint (11.12) implies that the columns of \({{\mathcal{Q}}}\) are orthogonal since \({{\mathcal{D}}}={{\mathcal{I}}}\) and it implies that they are eigenvectors of \({{\mathcal{Q}}}{{\mathcal{Q}}}^{\top} = {{\mathcal{R}}}-\Psi\). Furthermore, assume that the first k eigenvalues are positive. In this case we could calculate \({{\mathcal{Q}}}\) by means of a spectral decomposition of \({{\mathcal{Q}}}{{\mathcal{Q}}}^{\top}\) and k would be the number of factors.

The principal factors algorithm is based on good preliminary estimators \(\widetilde{h}^{2}_{j}\) of the communalities \(h^{2}_{j}\), for j=1,…,p. There are two traditional proposals:

  • \(\widetilde{h}_{j}^{2}\), defined as the square of the multiple correlation coefficient of X j with (X l ), for \(l\not= j\), i.e., \(\rho^{2}(V,W\widehat{\beta})\) with V=X j , W=(X ) j and where \(\widehat{\beta}\) is the least squares regression parameter of a regression of V on W.

  • \(\widetilde{h}_{j}^{2} =\max_{\ell\neq j}|r_{X_{j}X_{\ell}}|\), where \({{\mathcal{R}}}=(r_{X_{j}X_{\ell}})\) is the correlation matrix of \({\mathcal{X}}\).

Given \(\tilde{\psi}_{jj} = 1-\tilde{h}_{j}^{2}\) we can construct the reduced correlation matrix, \({\mathcal{R}}-\widetilde{\Psi}\). The Spectral Decomposition Theorem says that

$${\mathcal{R}}- \widetilde{\Psi}= \sum^p_{\ell=1}\lambda_\ell\gamma_{\ell}\gamma_{\ell}^{\top},$$

with eigenvalues λ 1≥⋯≥λ p . Assume that the first k eigenvalues λ 1,…,λ k are positive and large compared to the others. Then we can set

$$\widehat{q}_{\ell} = \sqrt{\lambda_\ell}\ \gamma_{\ell},\quad \ell=1, \ldots, k$$

or

$$\widehat{{\mathcal{Q}}} = \Gamma_1\Lambda_1^{1/2}$$

with

$$\Gamma_1 = (\gamma_{1}, \ldots, \gamma_{k})\quad \mbox{and} \quad \Lambda_1 = \mathop {\mathrm {diag}}(\lambda_1, \ldots, \lambda_k).$$

In the next step set

$$\widehat{\psi}_{jj} = 1- \sum^k_{\ell=1} \widehat{q}^2_{j\ell},\quad j=1, \ldots, p.$$

Note that the procedure can be iterated: from \(\widehat{\psi}_{jj}\) we can compute a new reduced correlation matrix \({\mathcal{R}}-\widehat{\Psi}\) following the same procedure. The iteration usually stops when the \(\widehat{\psi}_{jj}\) have converged to a stable value.

Example 11.4

Consider once again the car data given in Table B.7. From Exercise 10.4 we know that the first PC is mainly influenced by X 2X 7. Moreover, we know that most of the variance is already captured by the first PC. Thus we can conclude that the data are mainly determined by one factor (k=1).

The eigenvalues of \({\mathcal{R}}- \widehat{\Psi}\) for \(\widehat{\Psi}=(\max_{j\neq i}|r_{X_{i}X_{j}}|)\) are

$$(5.448,0.003,-.246,-0.646,-0.901,-0.911,-0.948,-0.964)^{\top}.$$

It would suffice to choose only one factor. Nevertheless, we have computed two factors. The result (the factor loadings for two factors) is shown in Figure 11.1.

Fig. 11.1
figure 1

Loadings of the evaluated car qualities, factor analysis with k=2  MVAfactcarm

We can clearly see a cluster of points to the right, which contain the factor loadings for the variables X 2X 7. This shows, as did the PCA, that these variables are highly dependent and are thus more or less equivalent. The factor loadings for X 1 (economy) and X 8 (easy handling) are separate, but note the different scales on the horizontal and vertical axes! Although there are two or three sets of variables in the plot, the variance is already explained by the first factor, the “price+security” factor.

2.3 The Principal Component Method

The principal factor method involves finding an approximation \(\tilde{\Psi}\) of Ψ, the matrix of specific variances, and then correcting \({\mathcal{R}}\), the correlation matrix of X, by \(\tilde{\Psi}\). The principal component method starts with an approximation \(\hat{{\mathcal{Q}}}\) of \({\mathcal{Q}}\), the factor loadings matrix. The sample covariance matrix is diagonalized, \({\mathcal{S}}=\Gamma \Lambda \Gamma^{\top}\). Then the first k eigenvectors are retained to build

$$ \hat{{\mathcal{Q}}}= [\sqrt{\lambda_{1}} \gamma_{1}, \ldots, \sqrt{\lambda_{k}} \gamma_{k}].$$
(11.16)

The estimated specific variances are provided by the diagonal elements of the matrix \({\mathcal{S}}-\hat{{\mathcal{Q}}}\hat{{\mathcal{Q}}}^{\top}\),

$$ \hat{\Psi}= \left(\begin{array}{c@{\quad}c@{\quad}c@{\quad}c}\hat{\psi}_{11} & & & 0\\ & \hat{\psi}_{22} & & \\ & & \ddots & \\0 & & & \hat{\psi}_{pp}\end{array} \right)\quad \mbox{with}\ \hat{\psi}_{jj}=s_{X{j}X{j}}-\sum^{k}_{\ell=1} \hat{q}^{2}_{j\ell}.$$
(11.17)

By definition, the diagonal elements of \({\mathcal{S}}\) are equal to the diagonal elements of \(\hat{{\mathcal{Q}}} \hat{{\mathcal{Q}}}^{\top}+ \hat{\Psi}\). The off-diagonal elements are not necessarily estimated. How good then is this approximation? Consider the residual matrix

$${\mathcal{S}}-(\hat{{\mathcal{Q}}}\hat{{\mathcal{Q}}}^\top+\hat{\Psi})$$

resulting from the principal component solution. Analytically we have that

$$\sum_{i,j}({\mathcal{S}}-\hat{{\mathcal{Q}}}\hat{{\mathcal{Q}}}^\top-\hat{\Psi})^{2}_{ij}\leq \lambda^{2}_{k+1}+\cdots+\lambda^{2}_{p}.$$

This implies that a small value of the neglected eigenvalues can result in a small approximation error. A heuristic device for selecting the number of factors is to consider the proportion of the total sample variance due to the j-th factor. This quantity is in general equal to

  1. (A)

    \(\lambda_{j}/ \sum^{p}_{j=1}s_{jj}\) for a factor analysis of \({\mathcal{S}}\),

  2. (B)

    λ j /p for a factor analysis of \({\mathcal{R}}\).

Example 11.5

This example uses a consumer-preference study from Johnson and Wichern (1998). Customers were asked to rate several attributes of a new product. The responses were tabulated and the following correlation matrix \({{\mathcal{R}}}\) was constructed:

figure c

The bold entries of \({{\mathcal{R}}}\) show that variables 1 and 3 and variables 2 and 5 are highly correlated. Variable 4 is more correlated with variables 2 and 5 than with variables 1 and 3. Hence, a model with 2 (or 3) factors seems to be reasonable.

The first two eigenvalues λ 1=2.85 and λ 2=1.81 of \({{\mathcal{R}}}\) are the only eigenvalues greater than one. Moreover, k=2 common factors account for a cumulative proportion

$$\frac{\lambda_1 + \lambda_2}{p} = \frac{2.85+1.81}{5} =0.93$$

of the total (standardized) sample variance. Using the principal component method, the estimated factor loadings, communalities, and specific variances, are calculated from formulas (11.16) and (11.17), and the results are given in Table 11.1.

Table 11.1 Estimated factor loadings, communalities, and specific variances

Take a look at:

This nearly reproduces the correlation matrix \({{\mathcal{R}}}\). We conclude that the two-factor model provides a good fit of the data. The communalities (0.98,0.88,0.98,0.89,0.93) indicate that the two factors account for a large percentage of the sample variance of each variable. Due to the nonuniqueness of factor loadings, the interpretation might be enhanced by rotation. This is the topic of the next subsection.

2.4 Rotation

The constraints (11.11) and (11.12) are given as a matter of mathematical convenience (to create unique solutions) and can therefore complicate the problem of interpretation. The interpretation of the loadings would be very simple if the variables could be split into disjoint sets, each being associated with one factor. A well known analytical algorithm to rotate the loadings is given by the varimax rotation method proposed by Kaiser (1985). In the simplest case of k=2 factors, a rotation matrix \({\mathcal{G}}\) is given by

$${\mathcal{G}}(\theta)=\left(\begin{array}{r@{\quad}r}\cos{\theta}&\sin{\theta}\\-\sin{\theta}&\cos{\theta}\end{array}\right),$$

representing a clockwise rotation of the coordinate axes by the angle θ. The corresponding rotation of loadings is calculated via \(\hat{{\mathcal{Q}}}^{*} = \hat{{\mathcal{Q}}}{\mathcal{G}}(\theta)\). The idea of the varimax method is to find the angle θ that maximizes the sum of the variances of the squared loadings \(\hat{q}^{*}_{ij}\) within each column of \(\hat{{\mathcal{Q}}}^{*}\). More precisely, defining \(\tilde{q}_{jl}=\hat{q}^{*}_{jl}/ \hat{h}^{*}_{j}\), the varimax criterion chooses θ so that

$${\mathcal{V}}=\frac{1}{p}\sum^{k}_{\ell=1}\Biggl[ \sum^{p}_{j=1}(\tilde{q}^{*}_{jl})^{4}-\Biggl\{\frac{1}{p} \sum^{p}_{j=1}(\tilde{q}^{*}_{jl})^{2} \Biggr\}^2\Biggr]$$

is maximized.

Example 11.6

Let us return to the marketing example of Johnson and Wichern (1998) (Example 11.5). The basic factor loadings given in Table 11.1 of the first factor and a second factor are almost identical making it difficult to interpret the factors. Applying the varimax rotation we obtain the loadings \(\tilde{q}_{1}=(0.02, \mathbf{0.94}, 0.13,\allowbreak \mathbf{0.84}, \mathbf{0.97})^{\top}\) and \(\tilde{q}_{2}=(\mathbf{0.99}, -0.01, \mathbf{0.98}, 0.43, -0.02)^{\top}\). The high loadings, indicated as bold entries, show that variables 2, 4, 5 define factor 1, a nutricional factor. Variables 1 and 3 define factor 2 which might be referred to as a taste factor.

figure d

3 Factor Scores and Strategies

Up to now strategies have been presented for factor analysis that have concentrated on the estimation of loadings and communalities and on their interpretations. This was a logical step since the factors F were considered to be normalized random sources of information and were explicitely addressed as nonspecific (common factors). The estimated values of the factors, called the factor scores, may also be useful in the interpretation as well as in the diagnostic analysis. To be more precise, the factor scores are estimates of the unobserved random vectors F l , l=1,…,k, for each individual x i , i=1,…,n. Johnson and Wichern (1998) describe three methods which in practice yield very similar results. Here, we present the regression method which has the advantage of being the simplest technique and is easy to implement.

The idea is to consider the joint distribution of (Xμ) and F, and then to proceed with the regression analysis presented in Chapter 5. Under the factor model (11.4), the joint covariance matrix of (Xμ) and F is:

(11.18)

Note that the upper left entry of this matrix equals Σ and that the matrix has size (p+k)×(p+k).

Assuming joint normality, the conditional distribution of F|X is multinormal, see Theorem 5.1, with

$$ \mathop {\mathsf {E}}(F|X=x)={{\mathcal{Q}}}^\top\Sigma^{-1}(X-\mu)$$
(11.19)

and using (5.7) the covariance matrix can be calculated:

$$ \mathop {\mathsf {Var}}(F|X=x)={{\mathcal{I}}}_k-{{\mathcal{Q}}}^\top\Sigma^{-1}{{\mathcal{Q}}}.$$
(11.20)

In practice, we replace the unknown \({{\mathcal{Q}}}\), Σ and μ by corresponding estimators, leading to the estimated individual factor scores:

$$ \widehat{f}_i=\widehat{{\mathcal{Q}}}^\top {{\mathcal{S}}}^{-1}(x_i-\overline{x}).$$
(11.21)

We prefer to use the original sample covariance matrix \({{\mathcal{S}}}\) as an estimator of Σ, instead of the factor analysis approximation \(\widehat{{\mathcal{Q}}}\widehat{{\mathcal{Q}}}^{\top}+\widehat{\Psi}\), in order to be more robust against incorrect determination of the number of factors.

The same rule can be followed when using \({{\mathcal{R}}}\) instead of \({{\mathcal{S}}}\). Then (11.18) remains valid when standardized variables, i.e., \({Z}={{\mathcal{D}}}_{\Sigma}^{-1/2} ({X}-\mu)\), are considered if \({{\mathcal{D}}}_{\Sigma}= \mathop {\mathrm {diag}}(\sigma_{11},\ldots,\sigma_{pp})\). In this case the factors are given by

$$ \widehat{f}_i=\widehat{{\mathcal{Q}}}^\top {{\mathcal{R}}}^{-1}(z_i),$$
(11.22)

where \(z_{i}={{\mathcal{D}}}_{S}^{-1/2}(x_{i}-\overline{x})\), \(\widehat{{\mathcal{Q}}}\) is the loading obtained with the matrix \({{\mathcal{R}}}\), and \({{\mathcal{D}}}_{S}=\mathop {\mathrm {diag}}(s_{11},\dots,s_{pp})\).

If the factors are rotated by the orthogonal matrix \({{\mathcal{G}}}\), the factor scores have to be rotated accordingly, that is

$$ \widehat{f}_i^*={{\mathcal{G}}}^\top \widehat{f}_i.$$
(11.23)

A practical example is presented in Section 11.4 using the Boston Housing data.

3.1 Practical Suggestions

No one method outperforms another in the practical implementation of factor analysis. However, by applying a tâtonnement process, the factor analysis view of the data can be stabilized. This motivates the following procedure.

  1. 1.

    Fix a reasonable number of factors, say k=2 or 3, based on the correlation structure of the data and/or screeplot of eigenvalues.

  2. 2.

    Perform several of the presented methods, including rotation. Compare the loadings, communalities, and factor scores from the respective results.

  3. 3.

    If the results show significant deviations, check for outliers (based on factor scores), and consider changing the number of factors k.

For larger data sets, cross-validation methods are recommended. Such methods involve splitting the sample into a training set and a validation data set. On the training sample one estimates the factor model with the desired methodology and uses the obtained parameters to predict the factor scores for the validation data set. The predicted factor scores should be comparable to the factor scores obtained using only the validation data set. This stability criterion may also involve the loadings and communalities.

3.2 Factor Analysis versus PCA

Factor analysis and principal component analysis use the same set of mathematical tools (spectral decomposition, projections, …). One could conclude, on first sight, that they share the same view and strategy and therefore yield very similar results. This is not true. There are substantial differences between these two data analysis techniques that we would like to describe here.

The biggest difference between PCA and factor analysis comes from the model philosophy. Factor analysis imposes a strict structure of a fixed number of common (latent) factors whereas the PCA determines p factors in decreasing order of importance. The most important factor in PCA is the one that maximizes the projected variance. The most important factor in factor analysis is the one that (after rotation) gives the maximal interpretation. Often this is different from the direction of the first principal component.

From an implementation point of view, the PCA is based on a well-defined, unique algorithm (spectral decomposition), whereas fitting a factor analysis model involves a variety of numerical procedures. The non-uniqueness of the factor analysis procedure opens the door for subjective interpretation and yields therefore a spectrum of results. This data analysis philosophy makes factor analysis difficult especially if the model specification involves cross-validation and a data-driven selection of the number of factors.

4 Boston Housing

To illustrate how to implement factor analysis we will use the Boston housing data set and the by now well known set of transformations. Once again, the variable X 4 (Charles River indicator) will be excluded. As before, standardized variables are used and the analysis is based on the correlation matrix.

In Section 11.3, we described a practical implementation of factor analysis. Based on principal components, three factors were chosen and factor analysis was applied using the maximum likelihood method (MLM), the principal factor method (PFM), and the principal component method (PCM). For illustration, the MLM will be presented with and without varimax rotation.

Table 11.2 gives the MLM factor loadings without rotation and Table 11.3 gives the varimax version of this analysis. The corresponding graphical representations of the loadings are displayed in Figures 11.2 and 11.3. We can see that the varimax does not significantly change the interpretation of the factors obtained by the MLM. Factor 1 can be roughly interpreted as a “quality of life factor” because it is positively correlated with variables like X 11 and negatively correlated with X 8, both having low specific variances. The second factor may be interpreted as a “residential factor”, since it is highly correlated with variables X 6, and X 13. The most striking difference between the results with and without varimax rotation can be seen by comparing the lower left corners of Figures 11.2 and 11.3. There is a clear separation of the variables in the varimax version of the MLM. Given this arrangement of the variables in Figure 11.3, we can interpret factor 3 as an employment factor, since we observe high correlations with X 8 and X 5.

Fig. 11.2
figure 2

Factor analysis for Boston housing data, MLM  MVAfacthous

Fig. 11.3
figure 3

Factor analysis for Boston housing data, MLM after varimax rotation  MVAfacthous

Table 11.2 Estimated factor loadings, communalities, and specific variances, MLM   MVAfacthous
Table 11.3 Estimated factor loadings, communalities, and specific variances, MLM, varimax rotation  MVAfacthous

We now turn to the PCM and PFM analyses. The results are presented in Tables 11.4 and 11.5 and in Figures 11.4 and 11.5. We would like to focus on the PCM, because this 3-factor model yields only one specific variance (unexplained variation) above 0.5. Looking at Figure 11.4, it turns out that factor 1 remains a “quality of life factor” which is clearly visible from the clustering of X 5, X 3, X 10 and X 1 on the right-hand side of the graph, while the variables X 8, X 2, X 14, X 12 and X 6 are on the left-hand side. Again, the second factor is a “residential factor”, clearly demonstrated by the location of variables X 6, X 14, X 11, and X 13. The interpretation of the third factor is more difficult because all of the loadings (except for X 12) are very small.

Fig. 11.4
figure 4

Factor analysis for Boston housing data, PCM after varimax rotation  MVAfacthous

Fig. 11.5
figure 5

Factor analysis for Boston housing data, PFM after varimax rotation  MVAfacthous

Table 11.4 Estimated factor loadings, communalities, and specific variances, PCM, varimax rotation  MVAfacthous
Table 11.5 Estimated factor loadings, communalities, and specific variances, PFM, varimax rotation  MVAfacthous

5 Exercises

Exercise 11.1

In Example 11.4 we have computed \(\widehat{{\mathcal{Q}}}\) and \(\widehat{\Psi}\) using the method of principal factors. We used a two-step iteration for \(\widehat{\Psi}\). Perform the third iteration step and compare the results (i.e., use the given \(\widehat{{\mathcal{Q}}}\) as a pre-estimate to find the final Ψ).

Exercise 11.2

Using the bank data set, how many factors can you find with the Method of Principal Factors?

Exercise 11.3

Repeat Exercise 11.2 with the U.S. company data set!

Exercise 11.4

Generalize the two-dimensional rotation matrix in Section 11.2 to n-dimensional space.

Exercise 11.5

Compute the orthogonal factor model for

$$\Sigma=\left( \begin{array}{c@{\quad}c@{\quad}c}1 & 0.9 & 0.7\\ 0.9 & 1 & 0.4\\ 0.7 & 0.4 & 1\end{array} \right).$$

[Solution: ψ 11=−0.575,q 11=1.255]

Exercise 11.6

Perform a factor analysis on the type of families in the French food data set. Rotate the resulting factors in a way which provides the most reasonable interpretation. Compare your result with the varimax method.

Exercise 11.7

Perform a factor analysis on the variables X 3 to X 9 in the U.S. crime data set (Table B.10). Would it make sense to use all of the variables for the analysis?

Exercise 11.8

Analyze the athletic records data set (Table B.18). Can you recognize any patterns if you sort the countries according to the estimates of the factor scores?

Exercise 11.9

Perform a factor analysis on the U.S. health data set (Table B.16) and estimate the factor scores.

Exercise 11.10

Redo Exercise 11.9 using the U.S. crime data in Table B.10. Compare the estimated factor scores of the two data sets.

Exercise 11.11

Analyze the vocabulary data given in Table B.17.