Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Complex multivariate data structures are better understood by studying low-dimensional projections. For a joint study of two data sets, we may ask what type of low-dimensional projection helps in finding possible joint structures for the two samples. The canonical correlation analysis is a standard tool of multivariate statistical analysis for discovery and quantification of associations between two sets of variables.

The basic technique is based on projections. One defines an index (projected multivariate variable) that maximally correlates with the index of the other variable for each sample separately. The aim of canonical correlation analysis is to maximize the association (measured by correlation) between the low-dimensional projections of the two data sets. The canonical correlation vectors are found by a joint covariance analysis of the two variables. The technique is applied to a marketing example where the association of a price factor and other variables (like design, sportiness etc.) is analysed. Tests are given on how to evaluate the significance of the discovered association.

1 Most Interesting Linear Combination

The associations between two sets of variables may be identified and quantified by canonical correlation analysis. The technique was originally developed by Hotelling (1935) who analyzed how arithmetic speed and arithmetic power are related to reading speed and reading power. Other examples are the relation between governmental policy variables and economic performance variables and the relation between job and company characteristics.

Suppose we are given two random variables \(X\in \mathbb {R}^{q}\) and \(Y\in \mathbb {R}^{p}\). The idea is to find an index describing a (possible) link between X and Y. Canonical correlation analysis (CCA) is based on linear indices, i.e., linear combinations

$$a^{\top}X\quad \mbox{and} \quad b^{\top}Y$$

of the random variables. Canonical correlation analysis searches for vectors a and b such that the relation of the two indices a x and b y is quantified in some interpretable way. More precisely, one is looking for the “most interesting” projections a and b in the sense that they maximize the correlation

$$\rho (a,b)=\rho_{a^{\top}X\,b^{\top}Y} $$
(15.1)

between the two indices.

Let us consider the correlation ρ(a,b) between the two projections in more detail. Suppose that

where the sub-matrices of this covariance structure are given by

Using (3.7) and (4.26),

$$\rho(a,b) =\frac{a^{\top}\Sigma _{XY}b }{(a^{\top}\Sigma _{XX}a)^{1/2}\;(b^{\top}\Sigma _{YY}b)^{1/2}}.$$
(15.2)

Therefore, ρ(ca,b)=ρ(a,b) for any \(c \in \mathbb {R}^{+}\). Given the invariance of scale we may rescale projections a and b and thus we can equally solve

$$\max\limits_{a,b} = a^{\top}\Sigma _{XY}b$$

under the constraints

For this problem, define

$$ {\mathcal{K}}=\Sigma ^{-1/2}_{XX}\Sigma _{XY}\Sigma ^{-1/2}_{YY}.$$
(15.3)

Recall the singular value decomposition of \({\mathcal{K}}(q\times p)\) from Theorem 2.2. The matrix \({\mathcal{K}}\) may be decomposed as

$${\mathcal{K}}=\Gamma \Lambda \Delta^{\top}$$

with

(15.4)

where by (15.3) and (2.15),

$$k= \mathop {\mathrm {rank}}({\mathcal{K}}) = \mathop {\mathrm {rank}}(\Sigma _{XY})= \mbox{rank}(\Sigma _{YX}),$$

and λ 1λ 2≥⋯≥λ k are the nonzero eigenvalues of \({\mathcal{N}}_{1}={\mathcal{K}}{\mathcal{K}}^{\top}\) and \({\mathcal{N}}_{2}={\mathcal{K}}^{\top} {\mathcal{K}}\) and γ i and δ j are the standardized eigenvectors of \({\mathcal{N}}_{1}\) and \({\mathcal{N}}_{2}\) respectively.

Define now for i=1,…,k the vectors

(15.5)
(15.6)

which are called the canonical correlation vectors. Using these canonical correlation vectors we define the canonical correlation variables

(15.7)
(15.8)

The quantities \(\rho_{i}=\lambda_{i}^{1/2}\) for i=1,…,k are called the canonical correlation coefficients.

From the properties of the singular value decomposition given in (15.4) we have

$$\mathop {\mathsf {Cov}}(\eta _i,\eta _j)=a^{\top}_i\Sigma _{XX}a_j=\gamma_{i}^{\top}\gamma_{j}= \left\{ \begin{array}{c@{\quad}c} 1 & i=j,\\0& i\neq j. \end{array} \right . $$
(15.9)

The same is true for \(\mathop {\mathsf {Cov}}(\varphi _{i},\varphi _{j})\). The following theorem tells us that the canonical correlation vectors are the solution to the maximization problem of (15.1).

Theorem 15.1

For any given r, 1≤rk, the maximum

$$ C(r)=\max_{a,b} a^{\top}\Sigma _{XY}b$$
(15.10)

subject to

$$a^{\top}\Sigma _{XX}a=1,\qquad b^{\top}\Sigma _{YY}b=1$$

and

$$a_i^{\top}\Sigma _{XX}a=0\quad \mbox{\textit{for}}\ i=1,\ldots ,r-1$$

is given by

$$C(r)=\rho_r=\lambda_r^{1/2}$$

and is attained when a=a r and b=b r .

Proof

The proof is given in three steps.

(i) Fix a and maximize over b, i.e., solve:

$$\max_b (a^{\top} \Sigma_{XY} b)^2 =\max_b (b^{\top} \Sigma_{YX} a)(a^{\top} \Sigma_{XY} b)$$

subject to b Σ YY b=1. By Theorem 2.5 the maximum is given by the largest eigenvalue of the matrix

$$\Sigma_{YY}^{-1}\Sigma_{YX} a a^{\top} \Sigma_{XY}.$$

By Corollary 2.2, the only nonzero eigenvalue equals

$$ a^{\top}\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}a.$$
(15.11)

(ii) Maximize (15.11) over a subject to the constraints of the theorem. Put \(\gamma=\Sigma_{XX}^{1/2}a\) and observe that (15.11) equals

$$\gamma^{\top}\Sigma_{XX}^{-1/2}\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}\Sigma_{XX}^{-1/2}\gamma =\gamma^{\top} {\mathcal{K}}^{\top} {\mathcal{K}} \gamma.$$

Thus, solve the equivalent problem

$$ \max_\gamma \gamma^{\top}{ {\mathcal{N}}_1}\gamma $$
(15.12)

subject to γ γ=1, \(\gamma_{i}^{\top}\gamma=0\) for i=1,…,r−1.

Note that the γ i ’s are the eigenvectors of \({\mathcal{N}}_{1}\) corresponding to its first r−1 largest eigenvalues. Thus, as in Theorem 10.3, the maximum in (15.12) is obtained by setting γ equal to the eigenvector corresponding to the r-th largest eigenvalue, i.e., γ=γ r or equivalently a=a r . This yields

$$C^2(r)=\gamma_r^{\top} {\mathcal{N}}_1 \gamma_r = \lambda_r \gamma_r^{\top} \gamma = \lambda_r.$$

(iii) Show that the maximum is attained for a=a r and b=b r . From the SVD of \({{\mathcal{K}}}\) we conclude that \({{\mathcal{K}}}\delta_{r}=\rho_{r}\gamma_{r}\) and hence

$$a_r^{\top}\Sigma_{XY}b_r=\gamma_r^{\top}{ {\mathcal{K}}}\delta_r=\rho_r\gamma_r^{\top}\gamma_r=\rho_r.$$

 □

Let

The canonical correlation vectors

maximize the correlation between the canonical variables

The covariance of the canonical variables η and φ is given in the next theorem.

Theorem 15.2

Let η i and φ i be the i-th canonical correlation variables (i=1,…,k). Define η=(η 1,…,η k ) and φ=(φ 1,…,φ k ). Then

with Λ given in (15.4).

This theorem shows that the canonical correlation coefficients, \(\rho_{i}=\lambda_{i}^{1/2}\), are the covariances between the canonical variables η i and φ i and that the indices \(\eta_{1}=a_{1}^{\top}X\) and \(\varphi_{1}=b_{1}^{\top}Y\) have the maximum covariance \(\sqrt{\lambda_{1}}=\rho_{1}\).

The following theorem shows that canonical correlations are invariant w.r.t. linear transformations of the original variables.

Theorem 15.3

Let \(X^{*}= {\mathcal{U}}^{\top}X+u\) and \(Y^{*}={\mathcal{V}}^{\top}Y+v\) where \({\mathcal{U}}\) and \({\mathcal{V}}\) are nonsingular matrices. Then the canonical correlations between X and Y are the same as those between X and Y. The canonical correlation vectors of X and Y are given by

$$\everymath{\displaystyle}\begin{array}{rcl}a_i^*&=&{\mathcal{U}}^{-1}a_i, \\[6pt]b_i^*&=&{\mathcal{V}}^{-1}b_i.\end{array}$$
(15.13)
figure a

2 Canonical Correlation in Practice

In practice we have to estimate the covariance matrices Σ XX , Σ XY and Σ YY . Let us apply the canonical correlation analysis to the car marks data (see Table B.7). In the context of this data set one is interested in relating price variables with variables such as sportiness, safety, etc. In particular, we would like to investigate the relation between the two variables non-depreciation of value and price of the car and all other variables.

Example 15.1

We perform the canonical correlation analysis on the data matrices \({\mathcal{X}}\) and \({\mathcal{Y}}\) that correspond to the set of values {Price, Value Stability} and {Economy, Service, Design, Sporty car, Safety, Easy handling}, respectively. The estimated covariance matrix \({\mathcal{S}}\) is given by

Hence,

It is interesting to see that value stability and price have a negative covariance. This makes sense since highly priced vehicles tend to loose their market value at a faster pace than medium priced vehicles.

Now we estimate \({\mathcal{K}} = \Sigma ^{-1/2}_{XX}\,\Sigma_{XY}\,\Sigma ^{-1/2}_{YY}\) by

$$\widehat{{\mathcal{K}}} = {\mathcal{S}}^{-1/2}_{XX}\;{\mathcal{S}}_{XY}\;{\mathcal{S}}^{-1/2}_{YY}$$

and perform a singular value decomposition of \(\widehat{{\mathcal{K}}}\):

$$\widehat{{\mathcal{K}}}={\mathcal{G}} {\mathcal{L}}{\mathcal{D}}^{\top}= (g_{1},g_{2})\,\mathop {\mathrm {diag}}(\ell^{1/2}_1,\ell^{1/2}_2)\,(d_{1},d_{2})^{\top}$$

where the i ’s are the eigenvalues of \(\widehat{{\mathcal{K}}}\widehat{{\mathcal{K}}}^{\top}\) and \(\widehat{{\mathcal{K}}}^{\top}\widehat{{\mathcal{K}}}\) with \(\mathop {\mathrm {rank}}(\widehat{{\mathcal{K}}})=2\), and g i and d i are the eigenvectors of \(\widehat{{\mathcal{K}}}\widehat{{\mathcal{K}}}^{\top}\) and \(\widehat{{\mathcal{K}}}^{\top}\widehat{{\mathcal{K}}}\), respectively. The canonical correlation coefficients are

$$r_1=\ell^{1/2}_1=0.98,\qquad r_2=\ell^{1/2}_2=0.89.$$

The high correlation of the first two canonical variables can be seen in Figure 15.1. The first canonical variables are

Note that the variables y 1 (economy), y 2 (service) and y 6 (easy handling) have positive coefficients on \(\widehat{\varphi}_{1}\). The variables y 3 (design), y 4 (sporty car) and y 5 (safety) have a negative influence on \(\widehat{\varphi}_{1}\).

Fig. 15.1
figure 1

The first canonical variables for the car marks data  MVAcancarm

The canonical variable η 1 may be interpreted as a price and value index. The canonical variable φ 1 is mainly formed from the qualitative variables economy, service and handling with negative weights on design, safety and sportiness. These variables may therefore be interpreted as an appreciation of the value of the car. The sportiness has a negative effect on the price and value index, as do the design and the safety features.

2.1 Testing the Canonical Correlation Coefficients

The hypothesis that the two sets of variables \({\mathcal{X}}\) and \({\mathcal{Y}}\) are uncorrelated may be tested (under normality assumptions) with Wilk’s likelihood ratio statistic (Gibbins 1985):

$$T^{2/n}=| {\mathcal{I}} - S_{YY}^{-1}S_{YX}S_{XX}^{-1}S_{XY}|=\prod_{i=1}^k(1-\ell_i).$$

This statistic unfortunately has a rather complicated distribution. Bartlett (1939) provides an approximation for large n:

$$ -\{n-(p+q+3)/2\}\log\prod_{i=1}^k(1-\ell_i)\sim \chi^2_{pq}.$$
(15.14)

A test of the hypothesis that only s of the canonical correlation coefficients are non-zero may be based (asymptotically) on the statistic

$$ -\{n-(p+q+3)/2\}\log\prod_{i=s+1}^k(1-\ell_i)\sim \chi^2_{(p-s)(q-s)}.$$
(15.15)

Example 15.2

Consider Example 15.1 again. There are n=40 persons that have rated the cars according to different categories with p=2 and q=6. The canonical correlation coefficients were found to be r 1=0.98 and r 2=0.89. Bartlett’s statistic (15.14) is therefore

$$-\{40-(2+6+3)/2\}\log\{(1-0.98^2)(1-0.89^2)\}=165.59\sim\chi_{12}^2$$

which is highly significant (the 99% quantile of the \(\chi_{12}^{2}\) is 26.23). The hypothesis of no correlation between the variables \({\mathcal{X}}\) and \({\mathcal{Y}}\) is therefore rejected.

Let us now test whether the second canonical correlation coefficient is different from zero. We use Bartlett’s statistic (15.15) with s=1 and obtain

$$-\{40-(2+6+3)/2\}\log\{(1-0.89^2)\}=54.19\sim\chi_5^2$$

which is again highly significant with the \(\chi_{5}^{2}\) distribution.

2.2 Canonical Correlation Analysis with Qualitative Data

The canonical correlation technique may also be applied to qualitative data. Consider for example the contingency table \({\mathcal{N}}\) of the French baccalauréat data. The dataset is given in Table B.8 in Appendix B.8. The CCA cannot be applied directly to this contingency table since the table does not correspond to the usual data matrix structure. We may wish, however, to explain the relationship between the row r and column c categories. It is possible to represent the data in a (n×(r+c)) data matrix \({\mathcal{Z}}=({\mathcal{X}},{\mathcal{Y}})\) where n is the total number of frequencies in the contingency table \({\mathcal{N}}\) and \({\mathcal{X}}\) and \({\mathcal{Y}}\) are matrices of zero-one dummy variables. More precisely, let

$$x_{ki}=\left\{\begin{array}{l@{\quad}l}1&\mbox{if the $k$-th individual belongs to the $i$-th row category}\\0&\mbox{otherwise}\end{array} \right.$$

and

$$y_{kj}=\left\{\begin{array}{l@{\quad}l}1&\mbox{if the $k$-th individual belongs to the $j$-th column category}\\0&\mbox{otherwise}\end{array} \right.$$

where the indices range from k=1,…,n, i=1,…,r and j=1,…,c. Denote the cell frequencies by n ij so that \({\mathcal{N}}=(n_{ij})\) and note that

$$x_{(i)}^{\top} y_{(j)}=n_{ij},$$

where x (i) (y (j)) denotes the i-th (j-th) column of \({\mathcal{X}}\) (\({\mathcal{Y}}\)).

Example 15.3

Consider the following example where

$${\mathcal{N}}=\left(\begin{array}{c@{\quad}c}3&2\\1&4\end{array}\right).$$

The matrices \({\mathcal{X}}\), \({\mathcal{Y}}\) and \({\mathcal{Z}}\) are therefore

$${\mathcal{X}}=\left(\begin{array}{c@{\quad}c}1&0\\1&0\\1&0\\1&0\\1&0\\0&1\\0&1\\0&1\\0&1\\0&1\\\end{array}\right),\qquad {\mathcal{Y}}=\left(\begin{array}{c@{\quad}c}1&0\\1&0\\1&0\\0&1\\0&1\\1&0\\0&1\\0&1\\0&1\\0&1\\\end{array}\right),\qquad {\mathcal{Z}}= ({\mathcal{X}},{\mathcal{Y}})=\left(\begin{array}{c@{\quad}c@{\quad}c@{\quad}c}1&0&1&0\\1&0&1&0\\1&0&1&0\\1&0&0&1\\1&0&0&1\\0&1&1&0\\0&1&0&1\\0&1&0&1\\0&1&0&1\\0&1&0&1\\\end{array}\right).$$

The element n 12 of \({\mathcal{N}}\) may be obtained by multiplying the first column of \({\mathcal{X}}\) with the second column of \({\mathcal{Y}}\) to yield

$$x_{(1)}^{\top}y_{(2)}=2.$$

The purpose is to find the canonical variables η=a x and φ=b y that are maximally correlated. Note, however, that x has only one non-zero component and therefore an “individual” may be directly associated with its canonical variables or score (a i ,b j ). There will be n ij points at each (a i ,b j ) and the correlation represented by these points may serve as a measure of dependence between the rows and columns of \({\mathcal{N}}\).

Let \({\mathcal{Z}}=({\mathcal{X}},{\mathcal{Y}})\) denote a data matrix constructed from a contingency table \({\mathcal{N}}\). Similar to Chapter 13 define

and define \({\mathcal{C}}=\mathop {\mathrm {diag}}(c)\) and \({\mathcal{D}}=\mathop {\mathrm {diag}}(d)\). Suppose that x i>0 and x j >0 for all i and j. It is not hard to see that

where \(\widehat{{\mathcal{N}}}= cd^{\top}/n\) is the estimated value of \({\mathcal{N}}\) under the assumption of independence of the row and column categories.

Note that

$$(n-1)S_{XX}1_r = {\mathcal{C}}1_r -n^{-1}cc^{\top}1_r= c-c(n^{-1}c^{\top}1_r) = c-c(n^{-1}n)=0$$

and therefore \(S_{XX}^{-1}\) does not exist. The same is true for \(S_{YY}^{-1}\). One way out of this difficulty is to drop one column from both \({\mathcal{X}}\) and \({\mathcal{Y}}\), say the first column. Let \(\bar{c}\) and \(\bar{d}\) denote the vectors obtained by deleting the first component of c and d.

Define \(\bar{{\mathcal{C}}}\), \(\bar{{\mathcal{D}}}\) and \(\bar{S}_{XX}\), \(\bar{S}_{YY}\), \(\bar{S}_{XY}\) accordingly and obtain

so that (15.3) exists. The score associated with an individual contained in the first row (column) category of \({\mathcal{N}}\) is 0.

The technique described here for purely qualitative data may also be used when the data is a mixture of qualitative and quantitative characteristics. One has to “blow up” the data matrix by dummy zero-one values for the qualitative data variables.

figure b

3 Exercises

Exercise 15.1

Show that the eigenvalues of \({\mathcal{K}}{\mathcal{K}}^{\top}\) and \({\mathcal{K}}^{\top} {\mathcal{K}}\) are identical. (Hint: Use Theorem 2.6.)

Exercise 15.2

Perform the canonical correlation analysis for the following subsets of variables: \({\mathcal{X}}\) corresponding to {price} and \({\mathcal{Y}}\) corresponding to {economy, easy handling} from the car marks data (Table B.7).

Exercise 15.3

Calculate the second canonical variables for Example 15.1. Interpret the coefficients.

Exercise 15.4

Use the SVD of matrix \({\mathcal{K}}\) to show that the canonical variables η 1 and η 2 are not correlated.

Exercise 15.5

Verify that the number of nonzero eigenvalues of matrix \({\mathcal{K}}\) is equal to \(\mathop {\mathrm {rank}}(\Sigma_{XY})\).

Exercise 15.6

Express the singular value decomposition of matrices \({\mathcal{K}}\) and \({\mathcal{K}}^{\top}\) using eigenvalues and eigenvectors of matrices \({{\mathcal{K}}}^{\top} {\mathcal{K}}\) and \({\mathcal{K}}{{\mathcal{K}}}^{\top}\).

Exercise 15.7

What will be the result of CCA for Y=X?

Exercise 15.8

What will be the results of CCA for Y=2X and for Y=−X?

Exercise 15.9

What results do you expect if you perform CCA for X and Y such that Σ XY =0? What if \(\Sigma_{XY}={{\mathcal{I}}}_{p}\)?