1 Introduction

Herman (H.O.A.) Wold (1908–1992) developed partial least squares (PLS) in a series of papers, published as well as privately circulated. The seminal published papers are Wold (196619751982). A key characteristic of PLS is the determination of composites, linear combinations of observables, by weights that are fixed points of sequences of alternating least squares programs, called “modes.” Wold distinguished three types of modes (not models!): mode A, reminiscent of principal component analysis, mode B, related to canonical variables analysis, and mode C, that mixes the former two. In a sense PLS is an extension of canonical variables and principal components analyses. While Wold designed the algorithms, great strides were made in the estimation, testing, and analysis of structured covariance matrices, as induced by linear structural equations in terms of latent factors and indicators (LISREL first, then EQS et cetera). Latent factor modeling became the dominant backdrop against which Wold designed his tools. One model in particular, the “basic design,” became the model of choice in calibrating PLS. Here each latent factor is measured indirectly by a unique set of indicators, with all measurement errors usually assumed to be mutually uncorrelated. The composites combine the indicators for each latent factor separately, and their relationships are estimated by regressions.Footnote 1 The basic design embodies Wold’s “fundamental principle of soft modeling”: all information between the blocks of observables is assumed to be conveyed by latent variables (Wold 1982).Footnote 2 However, in this model PLS is not well-calibratedFootnote 3: when applied to the true covariance matrix it yields by necessity approximations, see, e.g., Dijkstra (1981, 198320102014). For consistency, meaning that the probability limit of the estimators equals the theoretical value, Wold also requires the number of indicators to increase alongside the number of observations (consistency-at-large).

In this chapter we leave the realm of the unobservables, and build a model in terms of manifest variables that satisfies the fundamental principle of soft modeling, adjusted to read: all information between the blocks is conveyed solely by the composites. For this model, mode B is the perfect match, in the sense that estimation via mode B is the natural thing to do: when applied to the true covariance matrix it yields the underlying parameter values, not approximations that require corrections. A latent factor model, in contrast, would need additional structure (like uncorrelated measurement errors) and fitting it would produce approximations.

The chapter is structured as follows. The next section, Sect. 4.2, outlines the new model. We specify for a vector y of observable variables, “indicators,” a structural model that generates via linear composites of separate blocks of indicators all the standard basic design rank restrictions on the covariance matrix, without invoking the existence of unobservable latent factors. They, the composites, are linked to each other by means of a “structural,” “simultaneous,” or “interdependent” equations system, that together with the loadings fully captures the (linear) relationships between the blocks of indicators.

Section 4.3 is devoted to estimation issues. We describe a step-wise procedure: first the weights defining the composites via generalized canonical variables,Footnote 4 then their correlations and the loadings in the simplest possible way, and finally the parameters of the simultaneous equations system using the econometric methods 2SLS or 3SLS. The estimation proceeds essentially in a non-iterative fashion (even when we use one of the PLS’ traditional algorithms, it will be very fast), making it potentially eminently suitable for bootstrap analyses. We give the results of a Monte Carlo simulation for a model for 18 indicators; they are generated by six composites linked to each other via two linear equations, which are not regressions. We also show that mode A, when applied to the true covariance matrix of the indicators, can only yield the correct results when the composites are certain principal components. As in PLSc, mode A can be adjusted to produce the right results (in the limit).

Section 4.4 suggests how to test various aspects of the model, via tests of the rank constraints, via prediction/cross-validation, and via global goodness-of-fit tests.

Section 4.5 contains some final observations and comments. We briefly return to “the latent factors versus composites”-issue and point out that in a latent factor model the factors cannot fully be replaced by linear composites, no matter how we choose them: the regression of the indicators on the composites will not yield the loadings on the factors, or (inclusive) the composites cannot satisfy the same equations that the factors satisfy.

The Appendix contains a proof for a statement needed in Sect. 4.3.

2 The Model: Composites as Factors

Our point of departure is a random vectorFootnote 5 y of “indicators” that can be partitioned into N subvectors, “blocks” in PLS parlance, as \(\mathbf{y} = \left (\mathbf{y}_{1};\mathbf{y}_{2};\mathbf{y}_{3};\ldots;\mathbf{y}_{N}\right )\). Here the semi-colon stacks the subvectors one underneath the other, as in MATLAB; y i is of order p i × 1 with p i usually larger than one. So y is of dimension p × 1 with \(p:=\mathop{ \sum }_{i=1}^{N}p_{i}\). We will denote cov\(\left (\mathbf{y}\right )\) by \(\boldsymbol{\Sigma }\), and take it to be positive definite (p.d.), so no indicator is redundant. We will let \(\boldsymbol{\Sigma }_{ii}:=\) cov\(\left (\mathbf{y}_{i}\right )\). \(\boldsymbol{\Sigma }_{ii}\) is of order p i × p i and it is of course p.d. as well. It is not necessary to have other constraints on \(\boldsymbol{\Sigma }_{ii},\) in particular it need not have a one-factor structure. Each block y i is condensed into a composite, a scalar c i , by means of a conformable weight vector w i : \(c_{i}:= \mathbf{w}_{i}^{\intercal }\mathbf{y}_{i}\). The composites will be normalized to have variance one: var\(\left (c_{i}\right ) = \mathbf{w}_{i}^{\intercal }\boldsymbol{\Sigma }_{ii}\mathbf{w}_{i} = 1\). The vector of composites \(\mathbf{c:=}\left (c_{1};c_{2};c_{3};\ldots;c_{N}\right )\) has a p.d. covariance/correlation matrix denoted by \(\mathbf{R}_{c} = \left (r_{ij}\right )\) with r ij = \(\mathbf{w}_{i}^{\intercal }\boldsymbol{\Sigma }_{ij}\mathbf{w}_{j}\) where \(\boldsymbol{\Sigma }_{ij}:=\) E\(\left (\mathbf{y}_{i} -\text{E}\mathbf{y}_{i}\right )\left (\mathbf{y}_{j} -\text{E}\mathbf{y}_{j}\right )^{\intercal }\). A regression of y i on c i and a constant gives a loading vector L i of order p i × 1:

$$\displaystyle{ \mathbf{L}_{i}:= \text{E}\left (\mathbf{y}_{i} -\text{E}\mathbf{y}_{i}\right ) \cdot \left (c_{i} -\text{E}c_{i}\right ) = \text{E}\left (\mathbf{y}_{i} -\text{E}\mathbf{y}_{i}\right )\left (\mathbf{y}_{i} -\text{E}\mathbf{y}_{i}\right )^{\intercal }\mathbf{w}_{ i} =\boldsymbol{ \Sigma }_{ii}\mathbf{w}_{i} }$$
(4.1)

So far all we have is a list of definitions but as yet no real model: there are no constraints on the joint distribution of y apart from the existence of momentsFootnote 6 and a p.d. covariance matrix. We will now impose our version of Wold’s fundamental principle in soft modeling:

all information between the blocks is conveyed solely by the composites

We deviate from Wold’s original formulation in an essential way: whereas Wold postulated that all information is conveyed by unobserved, even unobservable, latent variables, we let the information to be fully transmitted by indices, by composites of observable indicators. So we postulate the existence of weight vectors such that for any two different blocks y i and y j

$$\displaystyle\begin{array}{rcl} \boldsymbol{\Sigma }_{ij}& =& r_{ij}\mathbf{L}_{i}\mathbf{L}_{j}^{\intercal }{}\end{array}$$
(4.2)
$$\displaystyle\begin{array}{rcl} & =& \mathbf{w}_{i}^{\intercal }\boldsymbol{\Sigma }_{ ij}\mathbf{w}_{j} \cdot \boldsymbol{ \Sigma }_{ii}\mathbf{w}_{i} \cdot \left (\boldsymbol{\Sigma }_{jj}\mathbf{w}_{j}\right )^{\intercal } \\ & =& \text{corr}\left (\mathbf{w}_{i}^{\intercal }\mathbf{y}_{ i},\mathbf{w}_{j}^{\intercal }\mathbf{y}_{ j}\right ) \cdot \text{cov}\left (\mathbf{y}_{i},\mathbf{w}_{i}^{\intercal }\mathbf{y}_{ i}\right ) \cdot \left (\text{cov}\left (\mathbf{y}_{j},\mathbf{w}_{j}^{\intercal }\mathbf{y}_{ j}\right )\right )^{\intercal }{}\end{array}$$
(4.3)

The cross-covariances between the blocks are determined by the correlation between their corresponding composites and the loadings of the blocks on those composites. Note that line (4.2) is highly reminiscent of the corresponding equation for the basic design, with latent variables. There it would read \(\rho _{ij}\boldsymbol{\lambda }_{i}\boldsymbol{\lambda }_{j}^{\intercal }\) with ρ ij representing the correlation between the latent variables, with \(\boldsymbol{\lambda }_{i}\) and \(\boldsymbol{\lambda }_{j}\) capturing the loadings. So the rank-one structure of the covariance matrices between the blocks is maintained fully, without requiring the existence of N additional unobservable variables.

We now have:

$$\displaystyle{ \boldsymbol{\Sigma =} \left [\begin{array}{*{10}c} \boldsymbol{\Sigma }_{11} & r_{12}\mathbf{L}_{1}\mathbf{L}_{2}^{\intercal }&r_{13}\mathbf{L}_{1}\mathbf{L}_{3}^{\intercal }& \cdot & r_{1N}\mathbf{L}_{1}\mathbf{L}_{N}^{\intercal } \\ & \boldsymbol{\Sigma }_{22} & r_{23}\mathbf{L}_{2}\mathbf{L}_{3}^{\intercal }& \cdot & r_{2N}\mathbf{L}_{2}\mathbf{L}_{N}^{\intercal } \\ & & \cdot & \cdot & \cdot \\ & & &\boldsymbol{\Sigma }_{ N-1,N-1} & r_{N-1,N}\mathbf{L}_{N-1}\mathbf{L}_{N}^{\intercal } \\ & & & & \boldsymbol{\Sigma }_{NN} \end{array} \right ] }$$
(4.4)

The appendix contains a proof of the fact that \(\boldsymbol{\Sigma }\) is positive definite when and only when the correlation matrix of the composites, R c , is positive definite. Note that in a Monte Carlo analysis we can choose the weight vectors (or loadings) and the values of R c independently.

We can add more structure to the model by imposing constraints on R c . This is done most conveniently by postulating a set of simultaneous equations to be satisfied by c. We will call one subvector of c the exogenous composites, denoted by c exo, and the remaining elements will be collected in c endo, the endogenous composites. There will be conformable matrices B and C with B invertible such that

$$\displaystyle{ \mathbf{Bc}_{\mathrm{endo}} = \mathbf{Cc}_{\mathrm{exo}} + \mathbf{z} }$$
(4.5)

It is customary to normalize B, i.e., all diagonal elements equal one (perhaps after some re-ordering). The residual vector z has a zero mean and is uncorrelated with c exo. In this type of (econometric) model the relationships between the exogenous variables are usually not the main concern. The research focus is on the way they drive the endogenous variables and the interplay or the feedback mechanism between the latter as captured by a matrix B that has nonzero elements both above and below the diagonal. A special case, with no feedback mechanism at all, is the class of recursive models, where B has only zeros on one side of its diagonal, and the elements of z are mutually uncorrelated. Here the coefficients in B and C can be obtained directly by consecutive regressions, given the composites. For general B this is not possible, since c endo is a linear function of z so that z i will typically be correlated with every endogenous variable in the ith equation.Footnote 7

Even when the model is not recursive, the matrices B and C will be postulated to satisfy certain zero constraints (and possibly other types of constraints, but we focus here on the simplest situation). So some B ij ’s and C kl ’s are zero. We will assume that the remaining coefficients are identifiable from a knowledge of the so-called reduced form matrix \(\boldsymbol{\Pi }\)

$$\displaystyle{ \boldsymbol{\Pi:= B}^{-1}\mathbf{C}\text{ } }$$
(4.6)

Note that

$$\displaystyle{ \mathbf{c}_{\mathrm{endo}} =\boldsymbol{ \Pi c}_{\mathrm{exo}} + \mathbf{B}^{-1}\mathbf{z} }$$
(4.7)

so \(\boldsymbol{\Pi }\) is a matrix of regression coefficients. Once we have those, we should be able to retrieve B and C from them. Identifiability is equivalent to the existence of certain rank conditions on \(\boldsymbol{\Pi }\), we will have more to say about them later on. We could have additional constraints on the covariance matrices of c exo and z but we will not develop that here, taking the approach that demands the least in terms of knowledge about the relationships between the composites. It is perhaps good to note that granted identifiability, the free elements in B and C can be interpreted as regression coefficients, provided we replace the “explanatory” endogenous composites by their regression on the exogenous composites. This is easily seen as follows:

$$\displaystyle{ \mathbf{c}_{\mathrm{endo}} = \left (\mathbf{I - B}\right )\mathbf{c}_{\mathrm{endo}} + \mathbf{Cc}_{\mathrm{exo}} + \mathbf{z} }$$
(4.8)
$$\displaystyle{ = \left (\mathbf{I - B}\right )\left (\boldsymbol{\Pi c}_{\mathrm{exo}} + \mathbf{B}^{-1}\mathbf{z}\right ) + \mathbf{Cc}_{\mathrm{ exo}} + \mathbf{z} }$$
(4.9)
$$\displaystyle{ = \left (\mathbf{I - B}\right )\left (\boldsymbol{\Pi c}_{\mathrm{exo}}\right ) + \mathbf{Cc}_{\mathrm{exo}} + \mathbf{B}^{-1}\mathbf{z} }$$
(4.10)

where B −1 z is uncorrelated with \(\boldsymbol{\Pi c}_{\mathrm{exo}}\) and c exo. So the free elements of \(\left (\mathbf{I - B}\right )\) and C can be obtained by a regression of c endo on \(\boldsymbol{\Pi c}_{\mathrm{exo}}\) and c exo, equation by equation.Footnote 8 Identifiability is here equivalent to invertibility of the covariance matrix of the “explanatory” variables in each equation. A necessary condition for this to work is that we cannot have more coefficients to estimate in each equation than the total number of exogenous composites in the system.

We have for R c

$$\displaystyle{ \mathbf{R}_{c} = \left [\begin{array}{*{10}c} \text{cov}\left (\mathbf{c}_{\mathrm{exo}}\right )& \text{cov}\left (\mathbf{c}_{\mathrm{exo}}\right ) \cdot \boldsymbol{ \Pi }^{\intercal } \\ &\boldsymbol{\Pi }\text{cov}\left (\mathbf{c}_{\mathrm{exo}}\right )\boldsymbol{\Pi }^{\intercal } + \mathbf{B}^{-1}\text{cov}\left (\mathbf{z}\right )\left (\mathbf{B}^{\intercal }\right )^{-1} \end{array} \right ] }$$
(4.11)

Thanks to the structural constraints, the number of parameters in R c could be (considerably) less than \(\frac{1} {2}\) N(N − 1), potentially allowing for an increase in estimation efficiency.

As far as \(\boldsymbol{\Sigma }\) is concerned, the model is now completely specified.

2.1 Fundamental Properties of the Model and Wold’s Fundamental Principle

Now define for each i the measurement error vector d i via

$$\displaystyle{ \mathbf{y}_{i} -\text{mean}\left (\mathbf{y}_{i}\right ) = \mathbf{L}_{i}\left (c_{i} -\text{mean}\left (c_{i}\right )\right ) + \mathbf{d}_{i} }$$
(4.12)

where \(\mathbf{L}_{i} =\boldsymbol{ \Sigma }_{ii}\) w i , the loadings vector obtained by a regression of the indicators on their composite (and a constant).

By construction d i has a zero mean and is uncorrelated with c i . In what follows it will be convenient to have all variables de-meaned, so we have y i = L i c i + d i . It is easy to verify that:

The measurement error vectors are mutually uncorrelated, and uncorrelated with all composites:

$$\displaystyle{ \text{E}\mathbf{d}_{i}\mathbf{d}_{j}^{\intercal } = 0\text{ for all different }i\text{ and }j }$$
(4.13)
$$\displaystyle{ \text{E}\mathbf{d}_{i}c_{j} = 0\text{ for all }i\text{ and }j }$$
(4.14)

It follows that E\(\mathbf{y}_{i}\mathbf{d}_{j}^{\intercal } = 0\) for all different i and j. In addition:

$$\displaystyle{ \text{cov}\left (\mathbf{d}_{i}\right ) =\boldsymbol{ \Sigma }_{ii} -\mathbf{L}_{i}\mathbf{L}_{i}^{\intercal } }$$
(4.15)

The latter is also very similar to the corresponding expression in the basic design, but we cannot in general have a diagonal cov\(\left (\mathbf{d}_{i}\right )\), because cov\(\left (\mathbf{d}_{i}\right )\mathbf{w}_{i}\) is identically zero (implying that the variance of \(\mathbf{w}_{i}^{\intercal }\mathbf{d}_{i}\) is zero, and therefore \(\mathbf{w}_{i}^{\intercal }\mathbf{d}_{i} = 0\) with probability one). The following relationships can be verified algebraically using regression results, or by using conditional expectations formally (so even though we use the formalism of conditional expectations and the notation, we do just mean regression).

$$\displaystyle{ \text{E}\left (\mathbf{y}_{1}\vert c_{1}\right ) = \mathbf{L}_{1}c_{1}\text{ } }$$
(4.16)

because E\(\left (\mathbf{y}_{1}\vert c_{1}\right ) =\) E\(\left (\mathbf{L}_{1}c_{1} + \mathbf{d}_{1}\vert c_{1}\right ) = \mathbf{L}_{1}c_{1} + 0.\) Also note that

$$\displaystyle{ \text{E}\left (c_{1}\vert \mathbf{y}_{2},\mathbf{y}_{3},\ldots,\mathbf{y}_{N}\right ) }$$
(4.17)
$$\displaystyle{ = \text{E}(\text{E}\left (c_{1}\vert \mathbf{y}_{2},\mathbf{y}_{3},\ldots,\mathbf{y}_{N},\mathbf{d}_{2},\mathbf{d}_{3},\ldots,\mathbf{d}_{N}\right )\vert \mathbf{y}_{2},\mathbf{y}_{3},\ldots,\mathbf{y}_{N}) }$$
(4.18)
$$\displaystyle{ = \text{E}(\text{E}\left (c_{1}\vert c_{2},c_{3},\ldots,c_{N},\mathbf{d}_{2},\mathbf{d}_{3},\ldots,\mathbf{d}_{N}\right )\vert \mathbf{y}_{2},\mathbf{y}_{3},\ldots,\mathbf{y}_{N}) }$$
(4.19)
$$\displaystyle{ = \text{E}(\text{E}\left (c_{1}\vert c_{2},c_{3},\ldots,c_{N}\right )\vert \mathbf{y}_{2},\mathbf{y}_{3},\ldots,\mathbf{y}_{N}) }$$
(4.20)
$$\displaystyle{ = \text{E}\left (c_{1}\vert c_{2},c_{3},\ldots,c_{N}\right ) }$$
(4.21)

We use the “tower property” of conditional expectation on the second line. (In order to project on a target space, we first project on a larger space, and then project the result of this on the target space.) On the third line we use y i = L i c i + d i so that conditioning on the y i ’s and the d i ’s is the same as conditioning on the c i ’s and the d i ’s. The fourth line is due to zero correlation between the c i ’s and the d i ’s, and the last line exploits the fact that the composites are determined fully by the indicators. So because E\(\left (\mathbf{y}_{1}\vert \mathbf{y}_{2},\mathbf{y}_{3},\ldots,\mathbf{y}_{N}\right ) =\) E\(\left (\mathbf{L}_{1}c_{1} + \mathbf{d}_{1}\vert \mathbf{y}_{2},\mathbf{y}_{3},\ldots,\mathbf{y}_{N}\right ) = \mathbf{L}_{1}\) E\(\left (c_{1}\vert \mathbf{y}_{2},\mathbf{y}_{3},\ldots,\mathbf{y}_{N}\right )\) we have

$$\displaystyle{ \text{E}\left (\mathbf{y}_{1}\vert \mathbf{y}_{2},\mathbf{y}_{3},\ldots,\mathbf{y}_{N}\right ) = \mathbf{L}_{1}\text{E}\left (c_{1}\vert c_{2},c_{3},\ldots,c_{N}\right ) }$$
(4.22)

In other words, the best (least squares) predictor of a block of indicators given other blocks is determined by the best predictor of the composite of that block given the composites of the other blocks, together with the loadings on the composite. This contrasts rather strongly with the model Wold used, with latent factors/variables f. Here instead of L 1E\(\left (c_{1}\vert c_{2},c_{3},\ldots,c_{N}\right )\) we have

$$\displaystyle{ \text{E}\left (\mathbf{y}_{1}\vert \mathbf{y}_{2},\mathbf{y}_{3},\ldots,\mathbf{y}_{N}\right ) =\boldsymbol{\lambda } _{1}\text{E}(\text{E}\left (\,f_{1}\vert \,f_{2},f_{3},\ldots,f_{N}\right )\vert \mathbf{y}_{2},\mathbf{y}_{3},\ldots,\mathbf{y}_{N}) }$$
(4.23)

Basically, we can follow the sequence of steps as above for the composites except the penultimate step, from (4.20) to (4.21). I would maintain that the model as specified answers more truthfully to the fundamental principle of soft modeling than the basic design.

3 Estimation Issues

We will assume that we have the outcome of a Consistent and Asymptotically Normal (CAN-)estimator for \(\boldsymbol{\Sigma }\). One can think of the sample covariance matrix of a random sample from a population with covariance matrix \(\boldsymbol{\Sigma }\) and finite fourth-order moments (the latter is sufficient for asymptotic normality, consistency requires finite second-order moments only). The estimators to be described are all (locally) smooth functions of the CAN-estimator for \(\boldsymbol{\Sigma }\), hence they are CAN as well.

We will use a step-wise approach: first the weights, then the loadings and the correlations between the composites, and finally the structural form coefficients. Each step uses a procedure that is essentially non-iterative, or if it iterates, it is very fast. So no explicit overall fit-criterion, although one could interpret the approach as the first iterate in a block relaxation program that aims to optimize a positive combination of target functions appropriate for each step. The view that a lack of an overall criterion to be optimized is a major flaw is ill-founded. Estimators should be compared on the basis of their distribution functions, the extent to which they satisfy computational desiderata, and the induced quality of the predictions. There is no theorem, and their cannot be one, to the effect that estimators that optimize a function are better than those that are not so motivated. For composites a proper comparison between the “step-wise” (partial) and the “global” approaches is still open. Of the issues to be addressed two stand out: efficiency in case of a proper, correct specification, and robustness with respect to distributional assumptions and specification errors (the optimization of a global fitting function that takes each and every structural constraint seriously may not be as robust to specification errors as a step-wise procedure).

3.1 Estimation of Weights, Loadings, and Correlations

The only issue of some substance in this section is the estimation of the weights. Once they are available, estimates for the loadings and correlations present themselves: the latter are estimated via the correlation between the composites, the former by a regression of each block on its corresponding composite. One could devise more intricate methods but in this stage there seems little point in doing so.

To estimate the weights we will use generalized Canonical Variables (CV’s) analysis.Footnote 9 This includes of course the approach proposed by Wold, the so-called mode B estimation method. Composites simply are canonical variables. Any method that yields CV’s matches naturally, “perfectly,” with the model. We will describe some of the methods while applying them to \(\boldsymbol{\Sigma }\) and show that they do indeed yield the weights. A continuity argument then gives that when they are applied to the CAN-estimator for \(\boldsymbol{\Sigma }\) the estimators for the weights are consistent as well. Local differentiability leading to asymptotic normality is not difficult to establish either.Footnote 10

For notational ease we will employ a composites model with three blocks, N = 3, but that is no real limitation. Now consider the covariance matrix, denoted by \(\mathbf{R}\left (\mathbf{v}\right )\), of \(\mathbf{v}_{1}^{\intercal }\mathbf{y}_{1}\), \(\mathbf{v}_{2}^{\intercal }\mathbf{y}_{2}\), and \(\mathbf{v}_{3}^{\intercal }\mathbf{y}_{3}\) where each v i is normalized (var\(\left (\mathbf{v}_{i}^{\intercal }\mathbf{y}_{i}\right ) = 1\)). So

$$\displaystyle{ \mathbf{R}\left (\mathbf{v}\right ):= \left [\begin{array}{*{10}c} 1 &\mathbf{v}_{1}^{\intercal }\boldsymbol{\Sigma }_{12}\mathbf{v}_{2} & \mathbf{v}_{1}^{\intercal }\boldsymbol{\Sigma }_{13}\mathbf{v}_{3} \\ \mathbf{v}_{1}^{\intercal }\boldsymbol{\Sigma }_{12}\mathbf{v}_{2} & 1 &\mathbf{v}_{2}^{\intercal }\boldsymbol{\Sigma }_{23}\mathbf{v}_{3} \\ \mathbf{v}_{1}^{\intercal }\boldsymbol{\Sigma }_{13}\mathbf{v}_{3} & \mathbf{v}_{2}^{\intercal }\boldsymbol{\Sigma }_{23}\mathbf{v}_{3} & 1 \end{array} \right ]. }$$
(4.24)

Canonical variables are composites whose correlation matrix has “maximum distance” to the identity matrix of the same size. They are “collectively maximally correlated.” The term is clearly ambiguous for more than two blocks. One program that would seem to be natural is to maximize with respect to v

$$\displaystyle{ z\left (\mathbf{v}\right ):= \text{abs}\left (R_{12}\right ) + \text{abs}\left (R_{13}\right ) + \text{abs}\left (R_{23}\right )\text{ } }$$
(4.25)

subject to the usual normalizations. Since

$$\displaystyle{ \text{abs}\left (R_{ij}\right ) = \text{abs}\left (r_{ij}\right ) \cdot \text{abs}\left (\mathbf{v}_{i}^{\intercal }\boldsymbol{\Sigma }_{ ii}\mathbf{w}_{i}\right ) \cdot \text{abs}\left (\mathbf{v}_{j}^{\intercal }\boldsymbol{\Sigma }_{ jj}\mathbf{w}_{j}\right ) }$$
(4.26)

we know, thanks to Cauchy–Schwarz, that

$$\displaystyle\begin{array}{rcl} \text{abs}\left (\mathbf{v}_{i}^{\intercal }\boldsymbol{\Sigma }_{ ii}\mathbf{w}_{i}\right )& =& \text{abs}\left (\mathbf{v}_{i}^{\intercal }\boldsymbol{\Sigma }_{ ii}^{\frac{1} {2} }\boldsymbol{\Sigma }_{ii}^{\frac{1} {2} }\mathbf{w}_{i}\right ) \leq \sqrt{\mathbf{v} _{i }^{\intercal }\boldsymbol{\Sigma }_{ii }^{\frac{1} {2} }\boldsymbol{\Sigma }_{ii}^{\frac{1} {2} }\mathbf{v}_{i} \cdot \mathbf{w}_{i}^{\intercal }\boldsymbol{\Sigma }_{ii}^{\frac{1} {2} }\boldsymbol{\Sigma }_{ii}^{\frac{1} {2} }\mathbf{w}_{i}}{}\end{array}$$
(4.27)
$$\displaystyle{ = \sqrt{\mathbf{v} _{i }^{\intercal }\boldsymbol{\Sigma }_{ii } \mathbf{v} _{i } \cdot \mathbf{w} _{i }^{\intercal }\boldsymbol{\Sigma }_{ii } \mathbf{w} _{i}} = 1 }$$
(4.28)

with equality if and only if v i = w i (ignoring irrelevant sign differences). Observe that the upper bound can be reached for v i = w i for all terms in which v i appears, so maximization of the sum of the absolute correlations gives w. A numerical, iterative routineFootnote 11 suggests itself by noting that the optimal v 1 satisfies the first order condition

$$\displaystyle{ 0 = \text{sgn}\left (R_{12}\right ) \cdot \boldsymbol{ \Sigma }_{12}\mathbf{v}_{2} + \text{sgn}\left (R_{13}\right ) \cdot \boldsymbol{ \Sigma }_{13}\mathbf{v}_{3} - l_{1}\boldsymbol{\Sigma }_{11}\mathbf{v}_{1} }$$
(4.29)

where l 1 is a Lagrange multiplier (for the normalization), and two other quite similar equations for v 2 and v 3. So with arbitrary starting vectors one could solve the equations recursively for v 1, v 2, and v 3, respectively, updating them after each complete round or at the first opportunity, until they settle down at the optimal value. Note that each update of v 1 is obtainable by a regression of a “sign-weighted sum”

$$\displaystyle{ \text{sgn}\left (R_{12}\right ) \cdot \mathbf{v}_{2}^{\intercal }\mathbf{y}_{ 2} + \text{sgn}\left (R_{13}\right ) \cdot \mathbf{v}_{3}^{\intercal }\mathbf{y}_{ 3} }$$
(4.30)

on y 1, and analogously for the other weights. This happens to be the classical form of PLS’ mode B.Footnote 12 For \(\boldsymbol{\Sigma }\) we do not need many iterations, to put it mildly: the update of v 1 is already w 1, as straightforward algebra will easily show. And similarly for the other weight vectors. In other words, we have in essentially just one iteration a fixed point for the mode B equations that is precisely w.

If we use the correlations themselves in the recursions instead of just their signs, we regress the “correlation weighted sum”

$$\displaystyle{ R_{12} \cdot \mathbf{v}_{2}^{\intercal }\mathbf{y}_{ 2} + R_{13} \cdot \mathbf{v}_{3}^{\intercal }\mathbf{y}_{ 3} }$$
(4.31)

on y 1 (and analogously for the other weights), and end up with weights that maximize

$$\displaystyle{ z\left (\mathbf{v}\right ):= R_{12}^{2} + R_{ 13}^{2} + R_{ 23}^{2} }$$
(4.32)

the simple sum of the squared correlations. Again, with the same argument, the optimal value is w.

Observe that for this \(z\left (\mathbf{v}\right )\) we have

$$\displaystyle{ \text{tr}\left (\mathbf{R}^{2}\right ) = 2 \cdot z\left (\mathbf{v}\right ) + 3 =\mathop{ \sum }_{ i=1}^{3}\gamma _{ i}^{2} }$$
(4.33)

where \(\gamma _{i}:=\gamma _{i}\left (\mathbf{R}\left (\mathbf{v}\right )\right )\) is the ith eigenvalue of \(\mathbf{R}\left (\mathbf{v}\right )\). We can take other functions of the eigenvalues, in order to maximize the difference between \(\mathbf{R}\left (\mathbf{v}\right )\) and the identity matrix of the same order. Kettenring (1971) discusses a number of alternatives. One of them minimizes the product of the γ i ’s, the determinant of \(\mathbf{R}\left (\mathbf{v}\right )\), also known as the generalized variance. The program is called GENVAR. Since \(\mathop{\sum }_{i=1}^{N}\gamma _{i}\) is always N (three in this case) for every choice of v, GENVAR tends to make the eigenvalues as diverse as possible (as opposed to the identity matrix where they are all equal to one). The determinant of \(\mathbf{R}\left (\mathbf{v}\right )\) equals \(\left (1 - R_{23}^{2}\right )\), which is independent of v 1, times

$$\displaystyle\begin{array}{rcl} & & 1 -\left [\begin{array}{*{10}c} R_{12} & R_{13} \end{array} \right ]\left [\begin{array}{*{10}c} 1 &R_{23} \\ R_{23} & 1 \end{array} \right ]^{-1}\left [\begin{array}{*{10}c} R_{12} \\ R_{13} \end{array} \right ] {} \\ & =& 1 -\left (\mathbf{v}_{1}^{\intercal }\boldsymbol{\Sigma }_{ 11}\mathbf{w}_{1}\right )^{2}\left [\begin{array}{*{10}c} r_{ 12}\mathbf{v}_{2}^{\intercal }\boldsymbol{\Sigma }_{ 22}\mathbf{w}_{2} & r_{13}\mathbf{v}_{3}^{\intercal }\boldsymbol{\Sigma }_{ 33}\mathbf{w}_{3} \end{array} \right ]\left [\begin{array}{*{10}c} 1 &R_{23} \\ R_{23} & 1 \end{array} \right ]^{-1}\left [\begin{array}{*{10}c} r_{12}\mathbf{v}_{2}^{\intercal }\boldsymbol{\Sigma }_{22}\mathbf{w}_{2} \\ r_{13}\mathbf{v}_{3}^{\intercal }\boldsymbol{\Sigma }_{33}\mathbf{w}_{3} \end{array} \right ]\\ \end{array}$$
(4.34)

where the last quadratic form does not involve v 1 either and we have with the usual argument that GENVAR produces w also. See Kettenring (1971) for an appropriate iterative routine (this involves the calculation of ordinary canonical variables of y i and the \(\left (N - 1\right )\)-vector consisting of the other composites).

Another program is MAXVAR, which maximizes the largest eigenvalue. For every v one can calculate the linear combination of the corresponding composites that best predicts or explains them: the first principal component of \(\mathbf{R}\left (\mathbf{v}\right )\). No other set is as well explained by the first principal component as the MAXVAR composites. There is an explicit solution here, no iterative routine is needed for the estimate of \(\boldsymbol{\Sigma }\), if one views the calculation of eigenvectors as non-iterative, see Kettenring (1971) for details.Footnote 13 One can show again that the optimal v equals w when MAXVAR is applied to \(\mathbf{\Sigma }\), although this requires a bit more work than for GENVAR (due to the additional detail needed to describe the solution).

As one may have expected, there is also MINVAR, the program aimed at minimizing the smallest eigenvalue (Kettenring 1971). The result is a set of composites with the property that no other set is “as close to linear dependency” as the MINVAR set. We also have an explicit solution, and w is optimal again.

3.2 Mode A and Mode B

In the previous subsection we recalled that mode B generates weight vectors by iterating regressions of certain weighted sums of composites on blocks. There is also mode A (and a mode C which we will not discuss), where weights are found iteratively by reversing the regressions: now blocks are regressed on weighted sums of composites. The algorithm generally converges, and the probability limits of the weights can be found as before by applying mode A to \(\boldsymbol{\Sigma }\). If we denote the probability limits (plims) of the (normalized) mode A weights by \(\widetilde{\mathbf{w}}_{i}\), we have in the generic case that y i is regressed on \(\mathop{\sum }_{j\neq i}\) sgn(cov\((\widetilde{\mathbf{w}}_{i}^{\intercal }\mathbf{y}_{i}\), \(\widetilde{\mathbf{w}}_{j}^{\intercal }\mathbf{y}_{j})\))\(\cdot \widetilde{\mathbf{w}}_{j}^{\intercal }\mathbf{y}_{j}\) so that

$$\displaystyle{ \widetilde{\mathbf{w}}_{i} \propto \mathop{\sum }_{j\neq i}\text{sgn}(\text{cov}(\widetilde{\mathbf{w}}_{i}^{\intercal }\mathbf{y}_{ i},\widetilde{\mathbf{w}}_{j}^{\intercal }\mathbf{y}_{ j})) \cdot \boldsymbol{ \Sigma }_{ij}\widetilde{\mathbf{w}}_{j} }$$
(4.35)
$$\displaystyle{ =\mathop{ \sum }_{j\neq i}\text{sgn}(\text{cov}(\widetilde{\mathbf{w}}_{i}^{\intercal }\mathbf{y}_{ i},\widetilde{\mathbf{w}}_{j}^{\intercal }\mathbf{y}_{ j})) \cdot r_{ij}\mathbf{L}_{i}\mathbf{L}_{j}^{\intercal }\widetilde{\mathbf{w}}_{ j} }$$
(4.36)
$$\displaystyle{ = \mathbf{L}_{i} \cdot \left (\mathop{\sum }_{j\neq i}\text{sgn}(\text{cov}(\widetilde{\mathbf{w}}_{i}^{\intercal }\mathbf{y}_{ i},\widetilde{\mathbf{w}}_{j}^{\intercal }\mathbf{y}_{ j})) \cdot r_{ij}\mathbf{L}_{j}^{\intercal }\widetilde{\mathbf{w}}_{ j}\right ) }$$
(4.37)

and so

$$\displaystyle{ \widetilde{\mathbf{w}}_{i} \propto \mathbf{L}_{i}\text{, in fact }\widetilde{\mathbf{w}}_{i} = \frac{1} {\sqrt{\mathbf{L} _{i }^{\intercal }\boldsymbol{\Sigma }_{ii } \mathbf{L} _{i}}}\mathbf{L}_{i} }$$
(4.38)

An immediate consequence is that the plim of mode A’s correlation, \(\widetilde{r}_{ij}\), equals

$$\displaystyle{ \widetilde{r}_{ij} =\widetilde{ \mathbf{w}}_{i}^{\intercal }\left (r_{ ij}\mathbf{L}_{i}\mathbf{L}_{j}^{\intercal }\right )\widetilde{\mathbf{w}}_{ j} = r_{ij} \cdot \frac{\mathbf{L}_{i}^{\intercal }\mathbf{L}_{i}} {\sqrt{\mathbf{L} _{i }^{\intercal }\boldsymbol{\Sigma }_{ii } \mathbf{L} _{i}}} \frac{\mathbf{L}_{j}^{\intercal }\mathbf{L}_{j}} {\sqrt{\mathbf{L} _{j }^{\intercal }\boldsymbol{\Sigma }_{jj } \mathbf{L} _{j}}} }$$
(4.39)

One would expect this to be smaller in absolute value than r ij , and so it is, since

$$\displaystyle{ \frac{\mathbf{L}_{i}^{\intercal }\mathbf{L}_{i}} {\sqrt{\mathbf{L} _{i }^{\intercal }\boldsymbol{\Sigma }_{ii } \mathbf{L} _{i}}} = \frac{\mathbf{w}_{i}^{\intercal }\boldsymbol{\Sigma }_{ii}^{2}\mathbf{w}_{i}} {\sqrt{\mathbf{w} _{i }^{\intercal }\boldsymbol{\Sigma }_{ii }^{3 }\mathbf{w} _{i}}} }$$
(4.40)
$$\displaystyle{ = \frac{\mathbf{w}_{i}^{\intercal }\boldsymbol{\Sigma }_{ii}^{1/2}\mathbf{\Sigma }_{ii}^{3/2}\mathbf{w}_{i}} {\sqrt{\mathbf{w} _{i }^{\intercal }\mathbf{\Sigma } _{ii }^{3 }\mathbf{w} _{i}}} \leq \frac{\sqrt{\mathbf{w} _{i }^{\intercal }\boldsymbol{\Sigma }_{ii } \mathbf{w} _{i}}\sqrt{\mathbf{w} _{i }^{\intercal }\mathbf{\Sigma } _{ii }^{3 }\mathbf{w} _{i}}} {\sqrt{\mathbf{w} _{i }^{\intercal }\mathbf{\Sigma } _{ii }^{3 }\mathbf{w} _{i}}} = 1 }$$
(4.41)

because of Cauchy–Schwarz. In general, mode A’s composites, \(\widetilde{\mathbf{c}}\), will not satisfy \(\mathbf{B}\widetilde{\mathbf{c}}_{\mathrm{endo}} = \mathbf{C}\widetilde{\mathbf{c}}_{\mathrm{exo}} +\widetilde{ \mathbf{z}}\) with \(\widetilde{\mathbf{z}}\) uncorrelated with \(\widetilde{\mathbf{c}}_{\mathrm{exo}}\). Observe that we have \(\widetilde{r}_{ij} = r_{ij}\) when and only when \(\mathbf{\Sigma }_{ii}\mathbf{w}_{i} \propto \mathbf{w}_{i}\) & \(\boldsymbol{\Sigma }_{jj}\mathbf{w}_{j} \propto \mathbf{w}_{j}\), in which case each composite is a principal component of its corresponding block.

For the plim of the loadings, \(\widetilde{\mathbf{L}}_{i}\), we note

$$\displaystyle{ \widetilde{\mathbf{L}}_{i} = \frac{1} {\sqrt{\mathbf{L} _{i }^{\intercal }\boldsymbol{\Sigma }_{ii } \mathbf{L} _{i}}}\boldsymbol{\Sigma }_{ii}\mathbf{L}_{i} }$$
(4.42)

So mode A’s loading vector is in the limit proportional to the true vector when and only when \(\boldsymbol{\Sigma }_{ii}\mathbf{w}_{i} \propto \mathbf{w}_{i}\).

To summarize:

  1. 1.

    Mode A will tend to underestimate the correlations in absolute value. Footnote 14

  2. 2.

    The plims of the correlations between the composites for Mode A and Mode B will be equal when and only when each composite is a principal component of its corresponding block, in which case we have a perfect match between a model and two modes as far as the relationships between the composites are concerned.

  3. 3.

    The plims of the loading vectors for Mode A and Mode B will be proportional when and only when each composite is a principal component of its corresponding block.

A final observation: we can “correct” mode A to yield the right results in the general situation via

$$\displaystyle{ \frac{\boldsymbol{\Sigma }_{ii}^{-1}\widetilde{\mathbf{w}}_{i}} {\sqrt{\widetilde{\mathbf{w} }_{i }^{\intercal }\boldsymbol{\Sigma }_{ii }^{-1 }\widetilde{\mathbf{w} }_{i}}} = \mathbf{w}_{i} }$$
(4.43)

and

$$\displaystyle{ \frac{\widetilde{\mathbf{w}}_{i}} {\sqrt{\widetilde{\mathbf{w} }_{i }^{\intercal }\boldsymbol{\Sigma }_{ii }^{-1 }\widetilde{\mathbf{w} }_{i}}} = \mathbf{L}_{i} }$$
(4.44)

3.3 Estimation of the Structural Equations

Given the estimate of R c we now focus on the estimation of Bc endo = Cc exo + z. We have exclusion constraints for the structural form matrices B and C, i.e., certain coefficients are a priori known to be zero. There are no restrictions on cov\(\left (\mathbf{z}\right )\), or if there are, we will ignore them here (for convenience, not as a matter of principle). This seems to exclude Wold’s recursive system where the elements of B on one side of the diagonal are zero, and the equation-residuals are uncorrelated. But we can always regress the first endogenous composite c endo,1 on c exo, and c endo,2 on [c endo, 1; c exo], and c endo,3 on [c endo,1; c endo,2; c exo] et cetera. The ensuing residuals are by construction uncorrelated with the explanatory variables in their corresponding equations, and by implication they are mutually uncorrelated. In a sense, there are no assumptions here, the purpose of the exercise (prediction of certain variables using a specific set of predictors) determines the regression to be performed; there is also no identifiability issue.Footnote 15

Now consider P, the regression matrix obtained from regressing the (estimated) endogenous composites on the (estimated) exogenous composites. It estimates \(\boldsymbol{\Pi }\), the reduced form matrix B −1 C. We will use P, and possible other functions of R c , to estimate the free elements of B and C. There is no point in trying when \(\boldsymbol{\Pi }\) is compatible with different values of the structural form matrices. So the crucial question is whether \(\boldsymbol{\Pi = B}^{-1}\mathbf{C}\), or equivalently \(\mathbf{B\Pi = C}\), can be solved uniquely for the free elements of B and C. Take the ith equationFootnote 16

$$\displaystyle{ \mathbf{B}_{i\cdot }\boldsymbol{\Pi = C}_{i\cdot } }$$
(4.45)

where the ith row of B, B i⋅ , has 1 in the ith entry (normalization) and possibly some zeros elsewhere, and where the ith row of C, C i⋅ , may also contain some zeros. The free elements in C i⋅  are given when those in B i⋅  are known, and the latter are to be determined by the zeros in C i⋅ . More precisely

$$\displaystyle{ \mathbf{B}_{\left (i,\text{ }k:B_{ ik}\mbox{ free or unit}\right )} \cdot \boldsymbol{ \Pi }_{\left (k:B_{ ik}\mbox{ free or unit, }j:C_{ij}=0\right )} = 0 }$$
(4.46)

So we have a submatrix of \(\boldsymbol{\Pi }\), the rows correspond with the free elements (and the unit) in the ith row of B, and the columns with the zero elements in the ith row of C. This equation determines \(\mathbf{B}_{\left (i,\text{ }k:B_{ ik}\mbox{ free or unit}\right )}\) uniquely, apart from an irrelevant nonzero multiple, when and only when the particular submatrix of \(\boldsymbol{\Pi }\) has a rank equal to its number of rows minus one. This is just the number of elements to be estimated in the ith row of B. To have this rank requires the submatrix to have at least as many columns. So a little thought will give that a necessary condition for unique solvability, identifiability, is that we must have as least as many exogenous composites in the system as coefficients to be estimated in any one equation. We emphasize that this order condition as it is traditionally called is indeed nothing more than necessary.Footnote 17 The rank condition is both necessary and sufficient.

A very simple example, which we will use in a small Monte Carlo study in the next subsection is as follows. Let

$$\displaystyle{ \left [\begin{array}{*{10}c} 1 &b_{12} \\ b_{21} & 1 \end{array} \right ]\left [\begin{array}{*{10}c} c_{\mathrm{endo},1} \\ c_{\mathrm{endo},2} \end{array} \right ] = \left [\begin{array}{*{10}c} c_{11} & c_{12} & 0 & 0 \\ 0 & 0 &c_{23} & c_{24} \end{array} \right ]\left [\begin{array}{*{10}c} c_{\mathrm{exo},1} \\ c_{\mathrm{exo},2} \\ c_{\mathrm{exo},3} \\ c_{\mathrm{exo},4} \end{array} \right ]+\left [\begin{array}{*{10}c} z_{1} \\ z_{2}\end{array} \right ] }$$
(4.47)

with 1 − b 12 b 21 ≠ 0. The order conditions are satisfied: each equation has three free coefficients and there are four exogenous composites.Footnote 18 Note that

$$\displaystyle{ \boldsymbol{\Pi =} \frac{1} {1 - b_{12}b_{21}}\left [\begin{array}{*{10}c} c_{11} & c_{12} & -b_{12}c_{23} & -b_{12}c_{24} \\ -b_{21}c_{11} & -b_{21}c_{12} & c_{23} & c_{24} \end{array} \right ] }$$
(4.48)

The submatrix of \(\boldsymbol{\Pi }\) relevant for an investigation into the validity of the rank condition for the first structural form equation is

$$\displaystyle{ \left [\begin{array}{*{10}c} \Pi _{13} & \Pi _{14} \\ \Pi _{23} & \Pi _{24} \end{array} \right ] = \frac{1} {1 - b_{12}b_{21}}\left [\begin{array}{*{10}c} -b_{12}c_{23} & -b_{12}c_{24}\\ c_{23 } & c_{24 } \end{array} \right ] }$$
(4.49)

It should have rank one, and it does so in the generic case, since its first row is a multiple of its second row.Footnote 19 Note that we cannot have both c 23 and c 24 zero. Clearly, b 12 can be obtained from \(\boldsymbol{\Pi }\) via \(-\Pi _{13}/\Pi _{23}\) or via -\(\Pi _{14}/\Pi _{24}\). A similar analysis applies to the second structural form equation. We note that the model imposes two constraints on \(\boldsymbol{\Pi }\): \(\Pi _{11}\Pi _{22} - \Pi _{12}\Pi _{21} = 0\) and \(\Pi _{13}\Pi _{24} - \Pi _{14}\Pi _{23} = 0\), in agreement with the fact that the 8 reduced form coefficients can be expressed in terms of 6 structural form parameters. For an extended analysis of the number and type of constraints that a structural form imposes on the reduced form see Bekker and Dijkstra (1990) and Bekker et al. (1994).

It will be clear that the estimate P of \(\boldsymbol{\Pi }\) will not in general satisfy the rank conditions (although we do expect them to be close for sufficiently large samples), and using either − P 13P 23 or − P 14P 24 as an estimate for b 12 will give different answers. Econometric methods construct explicitly or implicitly compromises between the possible estimates. 2SLS, as discussed above is one of them. See Dijkstra and Henseler (2015a,b) for a specification of the relevant formula (formula (23)) for 2SLS that honors the motivation via two regressions. Here we will outline another approach based on Dijkstra (1989) that is close to the discussion about identifiability.

Consider a row vectorFootnote 20 with ith subvector B i⋅  P − C i⋅ . If P would equal \(\boldsymbol{\Pi }\) we could get the free coefficients by making B i⋅  P − C i⋅  zero. But that will not be the case. So we could decide to choose values for the free coefficients that make each B i⋅  P − C i⋅  as “close to zero as possible.” One way to implement that is to minimize a suitable quadratic form subject to the exclusion constraints and normalizations. We take

$$\displaystyle{ \left (\text{vec}\left [\left (\mathbf{BP - C}\right )^{\intercal }\right ]\right )^{\intercal }\cdot \left (\mathbf{W\otimes }\widehat{\mathbf{R}}_{\mathrm{ exo}}\right ) \cdot \text{vec}\left [\left (\mathbf{BP - C}\right )^{\intercal }\right ] }$$
(4.50)

Here stands for Kronecker’s matrix multiplication symbol, \(\widehat{\mathbf{R}}_{\mathrm{exo}}\) is the estimated p.d. correlation matrix of the estimated exogenous composites, W is a p.d. matrix with as many rows and columns as there are endogenous composites, and the operator “vec” stacks the columns of its matrix-argument one underneath the other, starting with the first. If we take a diagonal matrix W the quadratic form disintegrates into separate quadratic forms, one for each subvector, and minimization yields in fact 2SLS estimates. A non-diagonal W tries to exploit information about the covariances between the subvectors. For the classical econometric simultaneous equation model it is true that vec\(\left [\left (\mathbf{BP - C}\right )^{\intercal }\right ]\) is asymptotically normal with zero mean and covariance matrix cov\(\left (\mathbf{z}\right )\mathbf{\otimes R}_{\mathrm{exo}}^{-1}\) divided by the sample size, adapting the notation somewhat freely. General estimation theory tells us to use the inverse of an estimate of this covariance matrix in order to get asymptotic efficiency. So W should be the inverse of an estimate for cov\(\left (\mathbf{z}\right )\). The latter is traditionally estimated by the obvious estimate based on 2SLS. Note that the covariances between the structural form residuals drive the extent to which the various optimizations are integrated. There is no or little gain when there is no or little correlation between the elements of z. This more elaborate method is called 3SLS.

We close with some observations. Since the quadratic form in the parameters is minimized subject to zero constraints and normalizations only, there is an explicit solution, see Dijkstra (1989, section 5), for the formulae.Footnote 21 If the fact that the weights are estimated can be ignored, there is also an explicit expression for the asymptotic covariance matrix, both for 2SLS and 3SLS. But if the sampling variation in the weights does matter, this formula may not be accurate and 3SLS may not be more efficient than 2SLS. Both methods are essentially non-iterative and very fast, and therefore suitable candidates for bootstrapping. One potential advantage of 2SLS over 3SLS is that it may be more robust to model specification errors, because as opposed to its competitor, it estimates equation by equation, so that an error in one equation need not affect the estimation of the others.

3.4 Some Monte Carlo Results

We use the setup from Dijkstra and Henseler (2015a,b) adapted to the present setting. We have

$$\displaystyle{ \left [\begin{array}{*{10}c} 1 &-0.25\\ -0.50 & 1 \end{array} \right ]\left [\begin{array}{*{10}c} c_{\mathrm{endo},1} \\ c_{\mathrm{endo},2} \end{array} \right ] = \left [\begin{array}{*{10}c} -0.30&0.50& 0 & 0\\ 0 & 0 &0.50 &0.25 \end{array} \right ]\left [\begin{array}{*{10}c} c_{\mathrm{exo},1} \\ c_{\mathrm{exo},2} \\ c_{\mathrm{exo},3} \\ c_{\mathrm{exo},4} \end{array} \right ]+\left [\begin{array}{*{10}c} z_{1} \\ z_{2}\end{array} \right ] }$$
(4.51)

All variables have zero mean, and we will take them jointly normal. Cov\(\left (\mathbf{c}_{\mathrm{exo}}\right )\) has ones on the diagonal and 0. 50 everywhere else; the variances of the endogenous composites are also one and we take cov\(\left (c_{\mathrm{endo},1}\text{,}c_{\mathrm{endo},2}\right ) = \sqrt{0.50}\). The values as specified imply for the covariance matrix for the structural form residuals z:

$$\displaystyle{ \text{cov}\left (\mathbf{z}\right ) = \left [\begin{array}{*{10}c} 0.5189 &-0.0295\\ -0.0295 & 0.1054 \end{array} \right ] }$$
(4.52)

Note that the correlation between z 1 and z 2 is rather small, − 0. 1261, so the setup has the somewhat unfortunate consequence to potentially favor 2SLS. The R-squared for the first reduced form equation is 0. 3329 and for the second reduced form equation this is 0. 7314.

Every composite is built up by three indicators, with a covariance matrix that has ones on the diagonal and 0. 49 everywhere else. This is compatible with a one-factor model for each vector of indicators but we have no use nor need for that interpretation here.

The composites \(\left (c_{\mathrm{exo},1}\text{, }c_{\mathrm{exo},2}\text{, }c_{\mathrm{exo},3}\text{, }c_{\mathrm{exo},4}\text{, }c_{\mathrm{endo},1}\text{, }c_{\mathrm{endo},2}\right )\) need weights. For the first and fourth we take weights proportional to \(\left [1,1,1\right ]\). For the second and fifth the weights are proportional to \(\left [1,2,3\right ]\) and for the third and sixth they are proportional to \(\left [1,4,9\right ]\). There are no deep thoughts behind these choices.

We get the following weights (rounded to two decimals for readability): \(\left [0.41,0.41,0.41\right ]\) for blocks one and four, \(\left [0.20,0.40,0.60\right ]\) for blocks two and five, and \(\left [0.08,0.33,0.74\right ]\) for blocks three and six.

The loadings are now given as well: \(\left [0.81,0.81,0.81\right ]\) for blocks one and four, \(\left [0.69,0.80,0.90\right ]\) for blocks two and five, and \(\left [0.61,0.74,0.95\right ]\) for blocks three and six.

One can now calculate the 18 by 18 covariance/correlation matrix \(\boldsymbol{\Sigma }\) and its unique p.d. matrix square root \(\boldsymbol{\Sigma }^{1/2}\). We generate samples of size 300, which appears to be relatively modest given the number of parameters to estimate. A sample of size 300 is obtained via \(\boldsymbol{\Sigma }^{1/2}\times\) randn\(\left (18,300\right ).\) We repeat this ten thousand times, each time estimating the weights via MAXVAR,Footnote 22 the loadings via regressions and the correlations in the obvious way, and all structural form parameters via 2SLS and 3SLS using standardized indicators.Footnote 23

The loadings and weights are on the average slightly underestimated, see Dijkstra (2015) for some of the tables: when rounded to two decimals the difference is at most 0. 01. The standard deviations of the weights estimators for the endogenous composites are either the largest or the smallest: for the weights of c endo,1 we have resp. \(\left [0.12,0.12,0.11\right ]\) and for c endo,2 \(\left [0.04,0.04,0.04\right ]\); the standard deviations for the weights of the exogenous composites are, roughly, in between. And similarly for the standard deviations for the loadings estimators: for the loadings on c endo,1 we have resp. \(\left [0.08,0.07,0.05\right ]\) and for c endo,2 \(\left [0.05,0.04,0.01\right ]\); the standard deviations for the loadings on the exogenous composites are again, roughly, in between.

The following table gives the results for the coefficients in B and C, rounded to two decimals:

Table 1

Clearly, for the model at hand 3SLS has nothing to distinguish itself positively from 2SLSFootnote 24 (its standard deviations are only smaller than those of 2SLS when we use three decimals). This might be different when the structural form residuals are materially correlated.

We also calculated, not shown, for each of the 10, 000 samples of size 300 the theoretical (asymptotic) standard deviations for the 3SLS estimators. They are all on the average 0. 01 smaller than the values in the table, they are relatively stable, with standard deviations ranging from 0. 0065 for b 12 to 0. 0015 for c 24. They are not perfect but not really bad either.

It would be reckless to read too much into this small and isolated study, for one type of distribution. But the approach does appear to be feasible.

4 Testing the Composites Model

In this section we sketch four more or less related approaches to test the appropriateness or usefulness of the model. In practice one might perhaps want to deploy all of them. Investigators will easily think of additional, “local” tests, like those concerning the signs or the order of magnitude of coefficients et cetera.

A thorny issue that should be mentioned here is capitalization on chance, which refers to the phenomenon that in practice one runs through cycles of model testing and adaptation until the current model tests signal that all is well according to popular rules-of-thumb.Footnote 25 This makes the model effectively stochastic, random. Taking a new sample and going through the cycles of testing and adjusting all over again may well lead to another model. But when we give estimates of the distribution functions of our estimators we imply that this helps to assess how the estimates will vary when other samples of the same size would be employed, while keeping the model fixed. It is tempting, but potentially very misleading, to ignore the fact that the sample (we/you, actually) favored a particular model after a (dedicated) model search, see Freedman et al. (1988), Dijkstra and Veldkamp (1988), Leeb and Pötscher (2006), and Freedman (2009)Footnote 26. It is not clear at all how to properly validate the model on the very same data that gave it birth, while using test statistics as design criteria.Footnote 27 Treating the results conditional on the sample at hand, as purely descriptive (which in itself may be rather useful, Berk 2008), or testing the model on a fresh sample (e.g., a random subset of the data that was kept apart when the model was constructed), while bracing oneself for a possibly big disappointment, appear to be the best or most honest responses.

4.1 Testing Rank Restrictions on Submatrices

The covariance matrix of any subvector of y i with any choice from the other indicators has rank one. So the corresponding regression matrix has rank one. To elaborate a bit, since E\(\left (c_{1}\vert c_{2},c_{3},\ldots,c_{N}\right )\) is a linear function of y the formula E\(\left (\mathbf{y}_{1}\vert \mathbf{y}_{2},\mathbf{y}_{3},\ldots,\mathbf{y}_{N}\right ) = \mathbf{L}_{1}\) E\(\left (c_{1}\vert c_{2},c_{3},\ldots,c_{N}\right )\) tells us that the regression matrix is a column times a row vector. Therefore its p 1 ⋅ ( pp 1) elements can be expressed in terms of just \(\left (\,p - 1\right )\) parameters (one row of \(\left (\,p - p_{1}\right )\) elements plus \(\left (\,p_{1} - 1\right )\) proportionality factors). This number could be even smaller when the model imposes structural constraints on R c as well. A partial check could be performed using any of the methods developed for restricted rank testing. A possible objection could be that the tests are likely to be sensitive to deviations from the Gaussian distribution, but jackknifing or bootstrapping might help to alleviate this. Another issue is the fact that we get many tests that are also correlated, so that simultaneous testing techniques based on Bonferroni or more modern approaches are required.Footnote 28

4.2 Exploiting the Difference Between Different Estimators

We noted that a number of generalized canonical variable programs yield identical results when applied to a \(\boldsymbol{\Sigma }\) satisfying the composites factor model. But we expect to get different results when this is not the case. So, when using the estimate for \(\boldsymbol{\Sigma }\) one might want to check whether the differences between, say PLS mode B and MAXVAR (or any other couple of methods), are too big for comfort. The scale on which to measure this could be based on the probability (as estimated by the bootstrap) of obtaining a larger “difference” than actually observed.

4.3 Prediction Tests, via Cross-Validation

The path diagram might naturally indicate composites and indicators that are most relevant for prediction. So it would seem to make sense to test whether the model’s rank restrictions can help improve predictions of certain selected composites or indicators. The result will not only reflect model adequacy but also the statistical phenomenon that the imposition of structure, even when strictly unwarranted, can help in prediction. It would therefore also reflect the sample size. The reference for an elaborate and fundamental discussion of prediction and cross-validation in a PLS-context is Shmueli et al. (2016).

4.4 Global Goodness-of-Fit Tests

In SEM we test the model by assessing the probability value of a distance measure between the sample covariance matrix S and an estimated matrix \(\widehat{\boldsymbol{\Sigma }}\) that satisfies the model. Popular measures are

$$\displaystyle{ \frac{1} {2}\text{tr}\left (\mathbf{S}^{-1}\left (\mathbf{S} -\widehat{\boldsymbol{ \Sigma }}\right )\right )^{2} }$$
(4.53)

and

$$\displaystyle{ \text{tr}\left (\mathbf{S}\widehat{\boldsymbol{\Sigma }}^{-1}\right ) -\text{log}\left (\text{det}\left (\mathbf{S}\widehat{\boldsymbol{\Sigma }}^{-1}\right )\right ) - p }$$
(4.54)

They belong to a large class of distances, all expressible in terms of a suitable function f:

$$\displaystyle{ \mathop{\sum }_{k=1}^{p}f\left (\gamma _{ k}\left (\mathbf{S}^{-1}\widehat{\mathbf{\Sigma }}\right )\right ). }$$
(4.55)

Here \(\gamma _{k}\left (\cdot \right )\) is the kth eigenvalue of its argument, and f is essentially a smooth real function defined on positive real numbers, with a unique global minimum of zero at the argument value 1.The functions are “normalized,” \(f^{^{{\prime\prime}} }\left (1\right ) = 1\), entailing that the second-order Taylor expansions around 1 are identical.Footnote 29 For the examples referred to we have \(f\left (\gamma \right ) = \frac{1} {2}\left (1-\gamma \right )^{2}\) and \(f\left (\gamma \right ) = 1/\gamma +\) log\(\left (\gamma \right ) - 1\), respectively. Another example is \(f\left (\gamma \right ) = \frac{1} {2}\left (\text{log}\left (\gamma \right )\right )^{2}\), the so-called geodesic distance; its value is the same whether we work with \(\mathbf{S}^{-1}\widehat{\boldsymbol{\Sigma }}\) or with \(\mathbf{S}\widehat{\boldsymbol{\Sigma }}^{-1}\). The idea is that when the model fits perfectly, so \(\mathbf{S}^{-1}\widehat{\boldsymbol{\Sigma }}\) is the identity matrix, then all its eigenvalues equal one, and conversely. This class of distances was first analyzed by Swain (1975).Footnote 30 Distance measures outside of this class are those induced by WLS with general fourth-order moments based weight matrices,Footnote 31 but also the simple ULS: tr\(\left (\mathbf{S} -\widehat{\boldsymbol{ \Sigma }}\right )^{2}\). We can take any of these measures, calculate its value, and use the bootstrap to estimate the corresponding probability value. It is important to pre-multiply the observation vectors by \(\widehat{\boldsymbol{\Sigma }}^{\frac{1} {2} }\mathbf{S}^{-\frac{1} {2} }\) before the bootstrap is implemented, in order to ensure that their empirical distribution has a covariance matrix that agrees with the assumed model. For \(\widehat{\boldsymbol{\Sigma }}\) one could take in an obvious notation \(\widehat{\boldsymbol{\Sigma }}_{ii}:= \mathbf{S}_{ii}\) and for ij

$$\displaystyle{ \widehat{\boldsymbol{\Sigma }}_{ij}:=\widehat{ r}_{ij} \cdot \mathbf{S}_{ii}\widehat{\mathbf{w}}_{i} \cdot \widehat{\mathbf{w}}_{j}^{\intercal }\mathbf{S}_{ jj}. }$$
(4.56)

Here \(\widehat{r}_{ij} =\widehat{ \mathbf{w}}_{i}^{\intercal }\mathbf{S}_{ij}\widehat{\mathbf{w}}_{j}\) if there are no constraints on R c , otherwise it will be the ijth element of \(\widehat{\mathbf{R}}_{c}\). If S is p.d., then \(\widehat{\boldsymbol{\Sigma }}\) is p.d. (as follows from the appendix) and \(\widehat{\boldsymbol{\Sigma }}^{\frac{1} {2} }\mathbf{S}^{-\frac{1} {2} }\) is well-defined.

5 Some Final Observations and Comments

In this chapter we outlined a model in terms of observables only while adhering to the soft modeling principle of Wold’s PLS. Wold developed his methods against the backdrop of a particular latent variables model, the basic design. This introduces N additional unobservable variables which by necessity cannot in general be expressed unequivocally in terms of the “manifest variables,” the indicators. However, we can construct composites that satisfy the same structural equations as the latent variables, in an infinite number of ways in fact. Also, we can design composites such that the regression of the indicators on the composites yields the loadings. But in the regular case we cannot have both.

Suppose \(\mathbf{y} =\boldsymbol{ \Lambda }\mathbf{f}+\boldsymbol{\varepsilon }\) with E\(\mathbf{f}\boldsymbol{\varepsilon }^{\intercal } = 0\), \(\boldsymbol{\Theta:=}\) cov\(\left (\boldsymbol{\varepsilon }\right )> 0\), and \(\boldsymbol{\Lambda }\) has full column rank. The p.d. cov\(\left (\mathbf{f}\right )\) will satisfy the constraints as implied by identifiable equations like \(\mathbf{Bf}_{\mathrm{endo}} = \mathbf{Cf}_{\mathrm{exo}}+\boldsymbol{\zeta }\) with E\(\mathbf{f}_{\mathrm{exo}}\boldsymbol{\zeta }^{\intercal } = 0\). All variables have zero mean. Let \(\widehat{\mathbf{f}}\), of the same dimension as f, equal Fy for a fixed matrix F. If the regression of y on \(\widehat{\mathbf{f}}\) yields \(\boldsymbol{\Lambda }\) we must have \(\mathbf{F}\boldsymbol{\Lambda }\mathbf{= I}\) because then

$$\displaystyle{ \boldsymbol{\Lambda =} \text{E}\left [\mathbf{y}\left (\mathbf{Fy}\right )^{\intercal }\right ] \cdot \left [\text{cov}\left (\mathbf{Fy}\right )\right ]^{-1} = \text{cov}\left (\mathbf{y}\right )\mathbf{F}^{\intercal }[\mathbf{F}\text{cov}\left (\mathbf{y}\right )\mathbf{F}^{\intercal }\mathbf{]}^{-1} }$$
(4.57)

Consequently

$$\displaystyle{ \widehat{\mathbf{f}} = \mathbf{F}\left (\boldsymbol{\Lambda f+\varepsilon }\right )\mathbf{= f + F}\boldsymbol{\varepsilon } }$$
(4.58)

and \(\widehat{\mathbf{f}}\) has a larger covariance matrix then f (the difference is p.s.d., usually p.d.). One example isFootnote 32 \(\mathbf{F =}\left (\boldsymbol{\Lambda }^{\intercal }\boldsymbol{\Theta }^{-1}\boldsymbol{\Lambda }\right )^{-1}\boldsymbol{\Lambda }^{\intercal }\boldsymbol{\Theta }^{-1}\)with cov\(\left (\widehat{\mathbf{f}}\right )-\) cov\(\left (\mathbf{f}\right ) = \left (\boldsymbol{\Lambda }^{\intercal }\boldsymbol{\Theta }^{-1}\boldsymbol{\Lambda }\right )^{-1}\).

So, generally, if the regression of y on the composites yields \(\boldsymbol{\Lambda }\), the covariance matrices cannot be the same, and the composites cannot satisfy the same equations as the latent variables f.Footnote 33 Conversely, if cov\(\left (\widehat{\mathbf{f}}\right ) =\) cov\(\left (\mathbf{f}\right )\), then the regression of y on the composites cannot yield \(\boldsymbol{\Lambda }\).

If we minimize E\(\left (\mathbf{y-}\boldsymbol{\Lambda }\mathbf{Fy}\right )^{\intercal }\boldsymbol{\Theta }^{-1}\left (\mathbf{y-}\boldsymbol{\Lambda }\mathbf{Fy}\right )\) subject to cov\(\left (\mathbf{Fy}\right ) =\) cov\(\left (\mathbf{f}\right )\) we get the composites that LISREL reports. We can generate an infinite number of alternativesFootnote 34 by minimizing E\(\left (\mathbf{f - Fy}\right )^{\intercal }\mathbf{V}\left (\mathbf{f - Fy}\right )\) subject to cov\(\left (\mathbf{Fy}\right ) =\) cov\(\left (\mathbf{f}\right )\) for any conformable p.d. V. Note that each composite here typically uses all indicators. Wold takes composites that combine the indicators per block. Of course, they also cannot reproduce the measurement equations and the structural equations, but the parameters can be obtained (consistently estimated) using suitable corrections (PLSc.Footnote 35)

Two challenging research topics present themselves: first, the extension of the approach to more dimensions/layers, and second, the imposition of sign constraints on weights, loadings, and structural coefficients, while maintaining as far as possible the numerical efficiency of the approach.