Regression Analysis of Interval Data

Berry, Kenneth J.; Mielke, Paul W.; Johnston, Janis E.

doi:10.1007/978-3-319-28770-6_4

Kenneth J. Berry⁴,
Paul W. Mielke Jr.⁵ &
Janis E. Johnston⁶

1180 Accesses

Abstract

Chapter 4 continues Chap. 3, utilizing the multi-response permutation procedures developed in Chap. 2 for analyzing completely randomized data at the interval level of measurement. In Chap. 4, multi-response permutation procedures are used to analyze regression residuals generated by ordinary least squares (OLS) and least absolute deviation (LAD) regression models. Experimental designs presented and analyzed in Chap. 4 include one-way randomized, one-way randomized with a covariate, one-way randomized-block, two-way randomized-block, two-way factorial, Latin square, split-plot, and two-factor nested designs.

Access provided by Autonomous University of Puebla. Download chapter PDF

Interval-Level Variables

Completely-Randomized Designs

Multiple linear regression models for random intervals: a set arithmetic approach

Article 27 June 2019

Multi-Response Permutation Procedures (MRPP) were introduced in Chap. 2 and applied to interval-level, completely randomized data in Chap. 3. While multi-response permutation procedures are generally thought of as providing tests of differences among g treatment groups as demonstrated in Chap. 3, they also have applications in ordinary least squares (OLS) linear regression analyses with v = 2 and least absolute deviations (LAD) linear regression analyses with v = 1. In this fourth chapter of Permutation Statistical Methods, MRPP analyses of LAD regression residuals are illustrated with a variety of experimental designs, including one-way completely randomized with and without a covariate, one-way and two-way randomized-block, two-way factorial, Latin square, and two-factor nested analysis-of-variance designs. Also considered are multivariate multiple regression designs.

4.1 LAD Linear Regression

OLS linear regression has long been recognized as a useful tool in many fields of research. The optimal properties of OLS regression are well known when the errors are normally distributed. However, in practice the assumption of multivariate normality is rarely justified. LAD linear regression is an attractive alternative to OLS regression as it is extremely robust to deviations from normality as well as to the presence of extreme values [297, p. 172].

It is widely recognized that estimators of OLS regression parameters can be severely affected by unusual values in either the criterion variable or in one or more of the predictor variables. This is due in large part to the weight given to each data point when minimizing the sum of squared errors. In contrast, LAD regression is much less sensitive to the effects of unusual-value errors due to the fact that the errors are not squared. Moreover, LAD regression has been shown to be superior to OLS regression when errors are generated from heavy-tailed or outlier-producing distributions, such as the Cauchy and double-exponential distributions; see, for example, articles by Blattburg and Sargent [46], Dielman [94, 95], Dielman and Pfaffenberger [96], Dielman and Rose [97], Mathew and Nordström [264], Mielke , Berry , Landsea , and Gray [303], Pfaffenberger and Dinkel [337], Rice and White [346], Rosenberg and Carlson [352], Rousseeuw [355], Taylor [394], and Wilson [432].

As described by Sheynin , the initial known use of regression by Daniel Bernoulli (c. 1734) for astronomical prediction problems involved LAD regression based on ordinary Euclidean distances between the observed and predicted response values [372]. Further developments in LAD regression were due to Roger Joseph (Rogerius Josephus) Boscovich (c. 1755), Pierre-Simon Laplace (c. 1789), and Carl Friedrich Gauss (c. 1809). The American mathematician and astronomer Nathaniel Bowditch (c. 1809) was highly critical of OLS regression because, as he argued, squared regression residuals unduly emphasized questionable observations in comparison with the absolute regression residuals associated with LAD regression [372].

Consider the general multivariate regression model given by

$$\displaystyle{ \mathbf{y}_{i} =\boldsymbol{ h}\left (\boldsymbol{\beta },\mathbf{x}_{i}\right ) + \mathbf{e}_{i}\;, }$$

where $\mathbf{y}_{i}^{\,{\prime}} = (y_{1i},\,\ldots,\,y_{ri})$ denotes the row vector of r observed response measurements for the ith of N objects, $\mathbf{x}_{i}^{\,{\prime}} = (x_{1i},\,\ldots,\,x_{si})$ is the row vector of s predictor values for the ith object, $\boldsymbol{\beta }^{{\prime}} = (\beta _{1},\,\ldots,\,\beta _{t})$ is the row vector of t parameters, $\boldsymbol{h}^{{\prime}} = (h_{1},\,\ldots,\,h_{r})$ is the row vector of r model functions of $\boldsymbol{\beta }$ and x _i for the ith object, and $\mathbf{e}_{i}^{\,{\prime}} = (e_{1i},\,\ldots,\,e_{ri})$ denotes the r errors between the response variables and model functions for the ith object, i = 1, …, N objects. The special case of a multivariate linear regression model is given by

$$\displaystyle{ \mathbf{y}_{i} = \mathbf{B}\boldsymbol{f}(\mathbf{x}_{i}) + \mathbf{e}_{i}\;, }$$

where $\boldsymbol{f}(\mathbf{x}_{i})$ denotes a column vector of p distinct functions of s predictors (x _i) for the ith object, i = 1, …, N, and B is an r×p matrix of parameters in which (B _j1, …, B _jp) is the row vector of p parameters associated with the jth response measurement, j = 1, …, r.

Let y _i denote a column vector of r observed response measurement scores and let $\tilde{\mathbf{y}}_{i}$ denote a column vector of r predicted response values for the ith object, i = 1, …, N. Thus, the general and linear predicted multivariate regression models are given by

$$\displaystyle\begin{array}{rcl} \qquad \qquad \qquad \tilde{\mathbf{y}}_{i}& =& \boldsymbol{h}\left (\boldsymbol{\tilde{\beta }},\mathbf{x}_{i}\right ) {}\\ \text{and}\qquad \qquad \qquad \qquad \qquad \qquad & & {}\\ \qquad \qquad \qquad \tilde{\mathbf{y}}_{i}& =& \tilde{\mathbf{B}}\boldsymbol{f}\left (\mathbf{x}_{i}\right )\;, {}\\ \end{array}$$

respectively, where $\boldsymbol{\tilde{\beta }}$ and $\tilde{\mathbf{B}}$ are estimated parameters that are intended to provide good fits between the y _i and $\tilde{\mathbf{y}}_{i}$ values relative to a selected goodness-of-fit criterion. The null hypothesis (H ₀) underlying each criterion dictates that each of the N! possible, equally-likely pairings of the predicted sequential ordering ($\tilde{\mathbf{y}}_{1},\,\ldots,\,\tilde{\mathbf{y}}_{N}$) with the fixed observed sequential ordering (y ₁, …, y _N) occurs with equal probability, i.e., 1∕N! .

Let $\Delta (\mathbf{\tilde{y}_{i}},\mathbf{y}_{i})$ for i = 1, …, N denote the distance function between the predicted and observed response measurement values and consider the generalized Minkowski distance function given by

$$\displaystyle{ \Delta (\tilde{\mathbf{y}}_{i},\mathbf{y}_{i}) = \left (\,\sum _{j=1}^{r}\big\vert \tilde{y}_{ ij} - y_{ij}\big\vert ^{w}\right )^{\!v/w}\;, }$$

where w ≥ 1 and v > 0. Since v = 1 yields the Minkowski metric [12], the choice of v = 1 is preferred since v > 1 yields distance functions that do not satisfy the triangle inequality property of a metric. Consequently, the distance function of choice utilizes v = 1 and w = 2, i.e., an ordinary Euclidean distance function.

Let the average distance function between $(\tilde{\mathbf{y}}_{1},\,\ldots,\,\tilde{\mathbf{y}}_{N})$ and $(\mathbf{y}_{1},\,\ldots,\,\mathbf{y}_{N})$ be given by

$$\displaystyle{ \delta = \frac{1} {N}\sum _{i=1}^{N}\Delta \big(\tilde{\mathbf{y}}_{ i},\mathbf{y}_{i}\big)\;. }$$

(4.1)

As noted previously, a distance function with v > 1 is not a metric function. If the distance function associated with LAD regression is squared (i.e., v = 2), then the estimated parameters that minimize δ yield an OLS regression model.

The criterion for fitting multivariate regression models based on δ is the chance-corrected measure of agreement between the observed and predicted response measurement values given by

$$\displaystyle{ \mathfrak{R} = 1 -\frac{\delta } {\mu _{\delta }}\;, }$$

(4.2)

where μ _δ is the expected value of δ over the N! possible pairings under the null hypothesis. An efficient computational expression for obtaining μ _δ that involves a sum of N ² rather than N! terms is given by

$$\displaystyle{ \mu _{\delta } = \frac{1} {N^{2}}\sum _{i=1}^{N}\,\sum _{ j=1}^{N}\Delta \big(\tilde{\mathbf{y}}_{ i},\mathbf{y}_{j}\big)\;. }$$

(4.3)

4.1.1 Linear Regression and Agreement

A simple interpretation of $\mathfrak{R}$ can be described for $r = s = 1$ since the same interpretation holds for any r and s. In the case involving perfect agreement, $\tilde{y}_{i} = y_{i}$ for i = 1, …, N, δ = 0. 00, and $\mathfrak{R} = 1.00$. This implies that the functional relationship between $\tilde{y}$ and y can be described by a straight line that passes through the origin with a slope of 45^∘, as depicted in Fig. 4.1 with N = 5 bivariate $(y,\,\tilde{y})$ values: (2, 2), (4, 4), (6, 6), (8, 8), and (10, 10). For the N = 5 data points depicted in Fig. 4.1, the intercept is $\tilde{\beta }_{0} = 0.00$, the unstandardized slope is $\tilde{\beta }_{1} = +1.00$, the squared Pearson product-moment correlation coefficient is $r_{y\tilde{y}}^{2} = +1.00$, and the agreement percentage is also 1. 00, i.e., all five of the y and $\tilde{y}$ paired values agree.

In this context, the squared Pearson product-moment correlation coefficient, $r_{y\tilde{y}}^{2}$, has also been used as a measure of agreement. However, $r_{y\tilde{y}}^{2} = +1.00$ implies a linear relationship between y and $\tilde{y}$, where both the intercept and slope are arbitrary. While perfect agreement is described by $\mathfrak{R} = +1.00$, $r_{y\tilde{y}}^{2} = +1.00$ describes a linear relationship that may or may not reflect perfect agreement as depicted in Fig. 4.2 with N = 5 $(y,\,\tilde{y})$ values: (2, 4), (4, 5), (6, 6), (8, 7), and (10, 8). For the N = 5 bivariate data points depicted in Fig. 4.2, the intercept is $\tilde{\beta }_{0} = +3.00$, the unstandardized slope is $\tilde{\beta }_{1} = +0.50$, the squared Pearson product-moment correlation coefficient is $r_{y\tilde{y}}^{2} = +1.00$, and the agreement percentage is 0. 20, i.e., only one (6, 6) of the N = 5 y and $\tilde{y}$ paired values agree.

Comparisons of $\mathfrak{R}$ with other measures of agreement and the advantages of $\mathfrak{R}$ relative to the other agreement measures were detailed in a 1996 article by Watterson [416].

While the agreement measure $\mathfrak{R}$ provides a description of the functional relationship between $(\tilde{\mathbf{y}}_{1},\,\ldots,\,\tilde{\mathbf{y}}_{N})$ and $(\mathbf{y}_{1},\,\ldots,\,\mathbf{y}_{N})$, it does not indicate how extreme an observed value of $\mathfrak{R}$, say $\mathfrak{R}_{\text{o}}$, is relative to the N! possible values of $\mathfrak{R}$ under the null hypothesis. Since μ _δ is invariant under the null hypothesis and the observed value of δ is given by

$$\displaystyle{ \delta _{\text{o}} =\mu _{\delta }(1 -\mathfrak{R}_{\text{o}})\;, }$$

the exact probability value for $\mathfrak{R}_{\text{o}}$ is given by

$$\displaystyle{ P\big(\mathfrak{R}\geq \mathfrak{R}_{\text{o}}\vert H_{0}\big) = P\big(\delta \leq \delta _{\text{o}}\vert H_{0}\big) = \frac{\mbox{ number of $\delta $ values $ \leq \delta _{\text{o}}$}} {M} \;, }$$

where M = N! . Because an exact probability value requires generating N! arrangements of the observed data, calculation of an exact value is prohibitive even for small values of N, e.g., $M = N! = 15! = 1,307,674,368,000$.

When M is very large, an approximate probability value for δ may be obtained from a resampling permutation procedure. Let L denote a random sample of all possible arrangements of the observed data, where L is typically a large number, e.g., L = 1, 000, 000. Then, an approximate resampling probability value is given by

$$\displaystyle{ P\big(\mathfrak{R}\geq \mathfrak{R}_{\text{o}}\vert H_{0}\big) = P\big(\delta \leq \delta _{\text{o}}\vert H_{0}\big) = \frac{\mbox{ number of $\delta $ values $ \leq \delta _{\text{o}}$}} {L} \;. }$$

Also, when M is very large and P is exceedingly small, a resampling-approximation permutation procedure based on fitting the first three exact moments of the discrete permutation distribution to a Pearson type III distribution provides approximate probability values, as detailed in Chap. 1, Sect. 1.2.2; see also references [284] and [300].

4.2 Example LAD Regression Analyses

In this section, example analyses illustrate the permutation approach to typical multiple regression problems. The first example analyzes a small set of multivariate response measurement scores using LAD regression and generates a resampling permutation probability value; the second example analyzes the same small set of multivariate response measurement scores using OLS regression and also generates a resampling permutation probability value; the third example analyzes the same set of multivariate response measurement scores using OLS regression, but provides a conventional approximate probability value based on Snedecor’s F distribution.

4.2.1 Example Analysis 1

Consider the multiple regression data listed in Fig. 4.3 where s = 2 observed response measurement scores have been obtained for each of N = 12 objects, y ₁, …, y _N denotes the observed response measurement scores for the N objects, and $\mathbf{x}_{i}^{\,{\prime}} = (x_{1i},\,\ldots,\,x_{2i})$ is the row vector of s = 2 predictor variables for the ith of N objects.

Because there are $M = 12! = 479,001,600$ possible, equally-likely arrangements of the N = 12 multivariate response measurement scores in Fig. 4.3, an exact permutation approach is impractical and a resampling procedure is mandated.

A LAD regression analysis of the multivariate response measurement scores listed in Fig. 4.3 yields estimated regression coefficients of

$$\displaystyle{ \tilde{\beta }_{0} = +3.8571\;,\quad \tilde{\beta }_{1} = +0.4286\;,\mbox{ and}\quad \tilde{\beta }_{2} = +0.1429\;. }$$

^{Footnote 1} Figure 4.4 lists the observed y _i values, LAD predicted $\tilde{y}_{i}$ values, and residual e _i values for i = 1, …, 12.

Following Eq. (4.1) on p. 125 with v = 1, the observed value of the MRPP test statistic calculated on the LAD regression residuals listed in Fig. 4.4 is δ _o = 1. 50.

If all M possible arrangements of the N = 12 observed LAD regression residuals listed in Fig. 4.4 occur with equal chance, the approximate resampling probability value of δ _o = 1. 50 calculated on L = 1, 000, 000 random arrangements of the observed LAD regression residuals is

$$\displaystyle{ P\big(\delta \leq \delta _{\text{o}}\vert H_{0}\big) = \frac{\mbox{ number of $\delta $ values $ \leq \delta _{\text{o}}$}} {L} = \frac{191,128} {1,000,000} = 0.0191\;. }$$

Following Eq. (4.3) on p. 117, the exact expected value of the M = 479, 001, 600 δ values is μ _δ = 1. 8294 and, following Eq. (4.2) on p. 117, the observed chance-corrected measure of effect size for the y _i and $\tilde{y}_{i}$ values, i = 1, …, N, is

$$\displaystyle{ \mathfrak{R}_{\text{o}} = 1 -\frac{\delta _{\text{o}}} {\mu _{\delta }} = 1 - \frac{1.50} {1.8294} = +0.1800\;, }$$

indicating 18 % agreement between the observed and predicted y values above that expected by chance.

4.2.2 Example Analysis 2

For a second example analysis of the multivariate response measurement scores listed in Fig. 4.3 on p. 120, consider an OLS regression analysis based on a resampling permutation procedure. An OLS regression analysis of the multivariate response measurement scores listed in Fig. 4.3 yields estimated regression coefficients of

$$\displaystyle{ \hat{\beta }_{0} = +6.8198\;,\quad \hat{\beta }_{1} = +0.6356\;,\mbox{ and}\quad \hat{\beta }_{2} = -0.0649\;. }$$

Figure 4.5 lists the observed y _i values, OLS predicted $\hat{y}_{i}$ values, and residual e _i values for i = 1, …, 12.

Following Eq. (4.1) on p. 117 with v = 2, the observed value of the MRPP test statistic computed on the OLS regression residuals listed in Fig. 4.5 is δ _o = 3. 1502. If all M possible arrangements of the N = 12 observed OLS regression residuals listed in Fig. 4.5 occur with equal chance, the approximate resampling probability value of δ _o = 3. 1502 computed on L = 1, 000, 000 random arrangements of the observed OLS regression residuals is

$$\displaystyle{ P\big(\delta \leq \delta _{\text{o}}\vert H_{0}\big) = \frac{\mbox{ number of $\delta _{\text{o}}$ values $ \leq \delta _{\text{o}}$}} {L} = \frac{96,104} {1,000,000} = 0.0961\;. }$$

For comparison, the approximate resampling probability value based on LAD regression in Example 1 is P = 0. 0191.

Following Eq. (4.3) on p. 117, the exact expected value of the M = 479, 001, 600 δ values is μ _δ = 5. 2942 and, following Eq. (4.2) on p. 117, the observed chance-corrected measure of effect size for the y _i and $\hat{y}_{i}$ values, i = 1, …, N, is

$$\displaystyle{ \mathfrak{R}_{\text{o}} = 1 -\frac{\delta _{\text{o}}} {\mu _{\delta }} = 1 -\frac{3.1502} {5.2942} = +0.4050\;, }$$

indicating approximately 41 % agreement between the observed and predicted y values above that expected by chance.

4.2.3 Example Analysis 3

Finally, consider a conventional OLS regression analysis of the multivariate response measurement scores listed in Fig. 4.3 on p. 120. An OLS regression analysis yields estimated regression coefficients of

$$\displaystyle{ \hat{\beta }_{0} = +6.8198\;,\quad \hat{\beta }_{1} = +0.6356\;,\mbox{ and}\quad \hat{\beta }_{2} = -0.0649\;, }$$

the regression residuals are listed in Fig. 4.5, and the observed squared multiple correlation coefficient is $R_{y.x_{1},x_{2}}^{2} = 0.2539$. $R_{y.x_{1},x_{2}}^{2}$ may be transformed into an F-ratio by

$$\displaystyle{ F = \frac{(N - s - 1)R_{y.x_{1},x_{2}}^{2}} {s(1 - R_{y.x_{1},x_{2}}^{2})} = \frac{(12 - 2 - 1)(0.2539)} {(2)(1 - 0.2539)} = 1.5313\;. }$$

Assuming independence, normality, and homogeneity of variance, F is approximately distributed as Snedecor’s F under the null hypothesis with $\nu _{1} = s = 2$ and $\nu _{2} = N - s - 1 = 12 - 2 - 1 = 9$ degrees of freedom. Under the null hypothesis, the observed value of F _o = 1. 5313 yields an approximate probability value of P = 0. 2677.

Note that the asymptotic probability value based on OLS regression in Example 3 is P = 0. 2677, while a resampling analysis of the same data in Example 2 yielded a probability value, again based on OLS regression, of P = 0. 0961, a marked difference. Moreover, a LAD regression analysis of the same data in Example 1 yielded an approximate resampling probability value of P = 0. 0191, once again demonstrating the different results possible with v = 1 and v = 2, both with and without a permutation analysis.

4.3 LAD Regression and Analysis of Variance Designs

It is well known that experimental designs that would ordinarily be analyzed by some form of analysis of variance can also be analyzed by OLS multiple regression using either dummy- or effect-coding schemes. The same is true of LAD regression. In this section a variety of analysis-of-variance designs are analyzed using MRPP, LAD regression, and either dummy or effect coding of treatment groups; included are one-way randomized, one-way randomized with a covariate, one-way randomized-block, two-way randomized-block, two-way factorial, Latin square, split-plot, and two-factor nested analysis-of-variance designs.

4.3.1 One-Way Randomized Design

Consider a one-way completely randomized experimental design with fixed effects in which N = 26 objects have been randomly assigned to one of g = 3 treatment groups with n ₁ = 8 and $n_{2} = n_{3} = 9$. The design and data are adapted from Stevens [387, p. 70] and are given in Fig. 4.6.

For a one-way randomized experimental design, the appropriate regression model is given by

$$\displaystyle{ y_{i} =\sum _{ j=1}^{m}x_{ ij}\beta _{j} + e_{i}\;, }$$

where y _i denotes the ith of N responses possibly affected by a treatment; x _ij is the jth of m covariates associated with the ith response, where x _i1 = 1 if the model includes an intercept; β _j denotes the jth of m regression parameters; and e _i designates the error associated with the ith of N responses. If the estimates of β ₁, …, β _m that minimize

$$\displaystyle{ \sum _{i=1}^{N}\vert e_{ i}\vert }$$

are denoted by $\tilde{\beta }_{1},\,\ldots,\,\tilde{\beta }_{m}$, then the N residuals of the LAD regression model are given by $e_{i} = y_{i} -\tilde{ y}_{i}$ for i = 1, …, N, where the predicted value of y _i is given by

$$\displaystyle{ \tilde{y}_{i} =\sum _{ j=1}^{m}x_{ ij}\tilde{\beta }_{j}\;,\qquad i = 1,\,\ldots,\,N\;. }$$

In contrast, OLS regression estimators of β ₁, …, β _m minimize

$$\displaystyle{ \sum _{i=1}^{N}e_{ i}^{2}\;, }$$

the N residuals of the OLS regression model are given by $e_{i} = y_{i} -\hat{ y}_{i}$ for i = 1, …, N, and the predicted value of y _i is given by

$$\displaystyle{ \hat{y}_{i} =\sum _{ j=1}^{m}x_{ ij}\hat{\beta }_{j}\;,\qquad i = 1,\,\ldots,\,N\;. }$$

If the N regression residuals are partitioned into g disjoint treatment groups of sizes n ₁, …, n _g, where n _i ≥ 2 for i = 1, …, g and

$$\displaystyle{ N =\sum _{ i=1}^{g}n_{ i}\;, }$$

then the permutation test depends on test statistic

$$\displaystyle{ \delta =\sum _{ i=1}^{g}C_{ i}\xi _{i}\;, }$$

(4.4)

where

$$\displaystyle{ C_{i} = \frac{n_{i}} {N}\;,\qquad i = 1,\,\ldots,\,g\;, }$$

is a positive weight for the ith of g treatment groups that minimizes the variability of δ,

$$\displaystyle{ \sum _{i=1}^{g}C_{ i} = 1\;, }$$

and $\xi _{i}$ is the average pairwise Euclidean difference among the n _i residuals in the ith of g treatment groups defined by

$$\displaystyle{ \xi _{i} = \binom{n_{i}}{2}^{\!-1}\,\sum _{ j=1}^{N-1}\,\sum _{ k=j+1}^{N}\Big[\big(e_{ j} - e_{k}\big)^{2}\Big]^{v/2}\Psi _{ ji}\,\Psi _{ki}\;, }$$

(4.5)

where v = 1 for LAD regression and

$$\displaystyle{ \Psi _{ji} = \left \{\begin{array}{@{}l@{\quad }l@{}} \,1 \quad &\mbox{ if $e_{i}$ is in the $i$th treatment group}\;, \\ [6pt]\,0\quad &\text{otherwise}\;. \end{array} \right. }$$

The null hypothesis specifies that each of the

$$\displaystyle{ M = \frac{N!} {\prod _{i=1}^{g}n_{ i}!} }$$

allocations of the N residuals to the g treatment groups is equally likely with n _i, i = 1, …, g, residuals preserved for each arrangement of the observed data. The exact probability value of an observed value of δ, δ _o, is given by

$$\displaystyle{ P\big(\delta \leq \delta _{\text{o}}\vert H_{0}\big) = \frac{\mbox{ number of $\delta $ values $ \leq \delta _{\text{o}}$}} {M} \;. }$$

As previously, when M is large, an approximate probability value of δ may be obtained from a resampling procedure, where

$$\displaystyle{ P\big(\delta \leq \delta _{\text{o}}\vert H_{0}\big) = \frac{\mbox{ number of $\delta $ values $ \leq \delta _{\text{o}}$}} {L} }$$

and L denotes the number of resampled test statistic values. Typically, L is set to a large number to ensure accuracy, e.g., L = 1, 000, 000. When M is very large and P is exceedingly small, a resampling-approximation permutation procedure may produce no δ values equal to or less than δ _o, even with L = 1, 000, 000, yielding an approximate resampling probability value of P = 0. 00. In such cases, moment-approximation permutation procedures based on fitting the first three exact moments of the discrete permutation distribution to a Pearson type III distribution provide approximate probability values, as detailed in Chap. 1, Sect. 1.2.2 [284, 300].

An index of the effect size for the y _i and $\tilde{y}_{i}$ values, i = 1, …, N, is given by the chance-corrected measure

$$\displaystyle{ \mathfrak{R} = 1 -\frac{\delta } {\mu _{\delta }}\;, }$$

(4.6)

where μ _δ is the arithmetic average of the δ values calculated on all M equally-likely arrangements of the observed response measurements, i.e.,

$$\displaystyle{ \mu _{\delta } = \frac{1} {M}\sum _{i=1}^{M}\delta _{ i}\;. }$$

(4.7)

A design matrix of dummy codes for an MRPP regression analysis of the N = 26 response measurement scores in Fig. 4.6 is given in Fig. 4.7 where the first columns of 1 values provide for an intercept. The second columns contain the N = 26 univariate response measurement scores listed according to the original random assignment of the N = 26 objects to the g = 3 treatment groups with the first n ₁ = 8 scores, the next n ₂ = 9 scores, and the last n ₃ = 9 scores associated with the first, second, and third treatment groups, respectively.

Because the purpose of the analysis is to test for possible differences among the g = 3 treatment groups, a reduced regression model is constructed without a variate for treatments. Therefore, for a single-factor experiment the design matrix for the reduced model is composed solely of a code for the intercept. The MRPP regression analysis examines the N = 26 regression residuals for possible differences among the g = 3 treatment levels; consequently, no dummy codes for treatments are included in Fig. 4.7 as this information is implicit in the ordering of the g = 3 treatment groups in the three columns labeled “Score” with n ₁ = 8 and $n_{2} = n_{3} = 9$ values.

An exact permutation solution is impractical for the univariate response measurements listed in Fig. 4.7 since there are

$$\displaystyle{ M = \frac{N!} {\prod _{i=1}^{g}n_{ i}!} = \frac{26!} {8!\;9!\;9!} = 75,957,810,500 }$$

possible, equally-likely arrangements of the N = 26 univariate response measurement scores; consequently, a resampling procedure is the default in this case.

LAD Regression Analysis

An MRPP resampling analysis of the LAD regression residuals calculated on the univariate response measurement scores listed in Fig. 4.7 yields an estimated LAD regression coefficient of $\tilde{\beta }_{0} = +12.00$. Figure 4.8 lists the observed y _i values, LAD predicted $\tilde{y}_{i}$ values, and residual e _i values for i = 1, …, 26.

Following Eq. (4.5) on p. 125 and employing ordinary Euclidean distance between residuals with v = 1, the N = 26 LAD regression residuals listed in Fig. 4.8 yield g = 3 average distance-function values of

$$\displaystyle{ \xi _{1} = 4.50\;,\quad \xi _{2} = 4.2222\;,\mbox{ and}\quad \xi _{3} = 6.8889\;. }$$

Following Eq. (4.4) on p. 125, the observed value of the MRPP test statistic calculated on the LAD regression residuals listed in Fig. 4.8 with v = 1 and treatment-group weights

$$\displaystyle{ C_{i} = \frac{n_{i}} {n} \;,\qquad i = 1,2,3\;, }$$

is

$$\displaystyle{ \delta _{\text{o}} =\sum _{ i=1}^{g}C_{ i}\xi _{i} = \frac{1} {26}\big[(8)(4.50) + (9)(4.2222) + (9)(6.8889)\big] = 5.2308\;. }$$

If all M possible arrangements of the N = 26 observed LAD regression residuals listed in Fig. 4.8 occur with equal chance, the approximate resampling probability value of δ _o = 5. 2308 computed on L = 1, 000, 000 random arrangements of the observed LAD regression residuals with n ₁ = 8 and $n_{2} = n_{3} = 9$ residuals preserved for each arrangement is

$$\displaystyle{ P\big(\delta \leq \delta _{\text{o}}\vert H_{0}\big) = \frac{\mbox{ number of $\delta $ values $ \leq \delta _{\text{o}}$}} {L} = \frac{12,062} {1,000,000} = 0.0121\;. }$$

Following Eq. (4.7) on p. 126, the exact expected value of the Mδ values is μ _δ = 6. 1262 and, following Eq. (4.6) on p. 126, the observed chance-corrected measure of effect size for the y _i and $\tilde{y}_{i}$ values, i = 1, …, N, is

$$\displaystyle{ \mathfrak{R}_{\text{o}} = 1 -\frac{\delta _{\text{o}}} {\mu _{\delta }} = 1 -\frac{5.2308} {6.1262} = +0.1462\;, }$$

indicating approximately 15 % agreement between the observed and predicted y values above that expected by chance.

OLS Regression Analysis

For comparison, consider an MRPP resampling analysis of OLS regression residuals calculated on the N = 26 univariate response measurement scores listed in Fig. 4.7 on p. 127. The MRPP regression analysis yields an estimated OLS regression coefficient of $\hat{\beta }_{0} = +14.2692$. Figure 4.9 lists the observed y _i values, OLS predicted $\hat{y}_{i}$ values, and residual e _i values for i = 1, …, 26.

Following Eq. (4.5) on p. 125 and employing squared Euclidean distance between residuals with v = 2, the N = 26 OLS regression residuals listed in Fig. 4.9 yield g = 3 average distance-function values of

$$\displaystyle{ \xi _{1} = 29.7143\;,\quad \xi _{2} = 25.00\;,\mbox{ and}\quad \xi _{3} = 103.2222\;. }$$

Following Eq. (4.4) on p. 125, the observed value of the MRPP test statistic calculated on the OLS regression residuals listed in Fig. 4.9 with v = 2 and treatment-group weights

$$\displaystyle{ C_{i} = \frac{n_{i} - 1} {N - g}\;,\qquad i = 1,2,3\;, }$$

is

$$\displaystyle\begin{array}{rcl} \delta _{\text{o}} =\sum _{ i=1}^{g}C_{ i}\xi _{i} = \frac{1} {26 - 3}\big[(8 - 1)(29.7143)& +& (9 - 1)(25.00) {}\\ & +& (9 - 1)(103.2222)\big] = 53.6425\;. {}\\ \end{array}$$

If all M possible arrangements of the N = 26 observed OLS regression residuals listed in Fig. 4.9 occur with equal chance, the approximate resampling probability value of δ _o = 53. 6425 computed on L = 1, 000, 000 random arrangements of the observed OLS regression residuals with n ₁ = 8 and $n_{2} = n_{3} = 9$ residuals preserved for each arrangement is

$$\displaystyle{ P\big(\delta \leq \delta _{\text{o}}\vert H_{0}\big) = \frac{\mbox{ number of $\delta $ values $ \leq \delta _{\text{o}}$}} {L} = \frac{91,842} {1,000,000} = 0.0918\;. }$$

For comparison, the approximate resampling probability value based LAD regression, v = 1, L = 1, 000, 000, and $C_{i} = n_{i}/N$ for i = 1, 2, 3 is P = 0. 0121.

Following Eq. (4.7) on p. 126, the exact expected value of the M = 75, 957, 810, 500 δ values is μ _δ = 60. 5692 and, following Eq. (4.6) on p. 126, the observed chance-corrected measure of effect size for the y _i and $\hat{y}_{i}$ values, i = 1, …, N, is

$$\displaystyle{ \mathfrak{R}_{\text{o}} = 1 -\frac{\delta _{\text{o}}} {\mu _{\delta }} = 1 -\frac{53.6425} {60.5692} = +0.1144\;, }$$

indicating approximately 11 % agreement between the observed and predicted y values above that expected by chance.

Conventional ANOVA Analysis

A conventional fixed-effects one-way analysis of variance calculated on the N = 26 univariate response measurement scores listed in Fig. 4.6 on p. 124 yields an observed F-ratio of F _o = 2. 6141. Assuming independence, normality, and homogeneity of variance, F is approximately distributed as Snedecor’s F under the null hypothesis with $\nu _{1} = g - 1 = 3 - 1 = 2$ and $\nu _{2} = N - g = 26 - 3 = 23$ degrees of freedom. Under the null hypothesis, the observed value of F _o = 2. 6141 yields an approximate probability value of P = 0. 0948, which is similar to that produced by the MRPP resampling analysis of the OLS regression residuals.

4.3.2 One-Way Randomized Design with a Covariate

A covariate experimental design permits the testing of differences among the treatment groups after the effect of the covariate has been removed from the analysis. Consider a one-way completely randomized design with a covariate in which N = 47 objects are randomly assigned to one of g = 5 treatment groups. The experimental data are listed in Table 4.1 and are adapted from a 1984 study by Conti and Musty [78].

Table 4.1 Example data for a one-way randomized design with a covariate, consisting of pre-test (Pre) and post-test (Post) response measurement scores on N = 47 randomly assigned objects to g = 5 treatment groups

Full size table

A design matrix of dummy codes for analyzing treatments is given in Fig. 4.10, where the first column of 1 values provides for an intercept, the second column contains the covariate (Pre-test) values, and the third column contains the (Post-test) scores listed according to the original random assignment of the N = 47 objects to the g = 5 treatment groups with the first n ₁ = 10 scores, the next n ₂ = 10 scores, the next n ₃ = 9 scores, the next n ₄ = 8 scores, and the last n ₅ = 10 scores associated with the g = 5 treatment groups, respectively.

The MRPP regression analysis examines the N = 47 regression residuals for possible differences among the g = 5 treatment levels; consequently, no dummy codes for treatments are included in Fig. 4.10 as this information is implicit in the ordering of the g = 5 treatment groups in the two paired columns labeled “Pre” and “Post.”

Because there are

$$\displaystyle{ M = \frac{N!} {\prod _{i=1}^{g}n_{ i}!} = \frac{47!} {10!\;10!\;9!\;8!\;10!} = 369,908,998,147,203,213,613,129,815,600 }$$

possible, equally-likely arrangements of the N = 47 univariate response measurement scores listed in Table 4.1, an exact permutation approach is not possible and a resampling analysis is mandated.