Chapter 3 of The Measurement of Association applied permutation statistical methods to measures of association based on Pearson’s chi-squared test statistic for two nominal-level (categorical) variables, e.g., Pearson’s ϕ 2, Tschuprov’s T 2, Cramér’s V 2, and Pearson’s C. This fourth chapter of The Measurement of Association continues the examination of measures of association designed for nominal-level variables, but concentrates on exact and Monte Carlo permutation statistical methods for measures of nominal association that are based on criteria other than Pearson’s chi-squared test statistic. First, two asymmetric measures of nominal-level association proposed by Goodman and Kruskal in 1954, λ and t, are described [37]. Next, Cohen’s unweighted kappa coefficient, κ, provides an introduction to the measurement of agreement, in contrast to measures of association [23]. Also included in Chap. 4 are McNemar’s [63] and Cochran’s [22] Q tests that measure the degree to which response measurements change over time, Leik and Gove’s [52] \(d_{N}^{\,c}\) measure of nominal association, and a solution to the matrix occupancy problem proposed by Mielke and Siddiqui [68]. Fisher’s [32] exact probability test is the iconic permutation test for contingency tables. While Fisher’s exact test is typically limited to 2×2 contingency tables, for which it was originally intended, in this chapter Fisher’s exact test is extended to 2×c, 3×3, 2×2×2, and other larger contingency tables.

Some measures designed for ordinal-level variables also serve as measures of association for nominal-level variables when r = 2 rows and c = 2 columns, i.e., a 2×2 contingency table. Other measures were originally designed for 2×2 contingency tables with nominal-level variables. Included in measures of association for 2×2 contingency tables are percentage differences, Yule’s Q and Y  measures [90], the odds ratio, and Somers’ asymmetric measures, d yx and d xy [78]. These measures are more appropriately described and discussed in Chaps. 9 and 10, which are devoted to measures of association for analyzing 2×2 contingency tables, where the level of measurement is often irrelevant.

4.1 Hypergeometric Probability Values

Exact permutation statistical methods, especially when applied to contingency tables, are heavily dependent on hypergeometric probability values.Footnote 1 In this section, a brief introduction to hypergeometric probability values illustrates their calculation and interpretation. For 2×2 contingency tables, the calculation of hypergeometric probability values is easily demonstrated. Consider the 2×2 contingency table in Table 4.1 where n 11, …, n 22 denote the four cell frequencies, R 1 and R 2 denote the two row marginal frequency totals, C 1 and C 2 denote the two column marginal frequency totals, and

$$\displaystyle \begin{aligned} N = \sum_{i=1}^{2}\,\sum_{j=1}^{2} n_{ij}\;. \end{aligned} $$
Table 4.1 Notation for a 2×2 contingency table

Because the contingency table given in Table 4.1 is a 2×2 table and, consequently, has only one degree of freedom, the probability of any one cell frequency constitutes the probability of the entire contingency table. Thus, the hypergeometric point probability value for the cell containing n 11 is given by:

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} p(n_{11}|R_{1},C_{1},N) = \binom{C_{1}}{n_{11}}\binom{C_{2}}{n_{12}}\binom{N}{R_{1}}^{-1} &\displaystyle =&\displaystyle \binom{R_{1}}{n_{11}}\binom{R_{2}}{n_{21}}\binom{N}{C_{1}}^{-1} \\ &\displaystyle =&\displaystyle \frac{R_{1}!\;R_{2}!\;C_{1}!\;C_{2}!}{N!\;n_{11}!\;n_{12}!\;n_{21}!\;n_{22}!}\;. \end{array} \end{aligned} $$
(4.1)

To illustrate the calculation of a hypergeometric point probability value for a 2×2 contingency table, consider the frequency data given in Table 4.2 with N = 20 observations. Following Eq. (4.1)

$$\displaystyle \begin{aligned} p(n_{11}|R_{1},C_{1},N) = \frac{R_{1}!\;R_{2}!\;C_{1}!\;C_{2}!}{N!\;n_{11}!\;n_{12}!\;n_{21}!\;n_{22}!} = \frac{11!\;9!\;12!\;8!}{20!\;9!\;2!\;3!\;6!} = 0.0367\;. \end{aligned} $$
Table 4.2 Example 2×2 contingency table

The calculation of hypergeometric probability values for r×c contingency tables is more complex than for simple 2×2 contingency tables. Consider the 4×3 contingency table given in Table 4.3 where n 11, …, n 43 denote the 12 cell frequencies, R 1, …, R 4 denote the four row marginal frequency totals, C 1, C 2, and C 3 denote the three column marginal frequency totals, and

$$\displaystyle \begin{aligned} N = \sum_{i=1}^{4}\,\sum_{j=1}^{3} n_{ij}\;. \end{aligned} $$
Table 4.3 Notation for a 4×3 contingency table

When there are only two rows, as in the previous 2×2 example, each column probability value is binomial, but with four rows each column probability value is multinomial. It is well known that a multinomial probability value can be obtained from an inter-connected series of binomial expressions. For example, for column A 1 in Table 4.3,

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \binom{C_{1}}{n_{11}}\binom{C_{1}-n_{11}}{n_{21}}\binom{C_{1}-n_{11}-n_{21}}{n_{31}} = \frac{C_{1}!}{n_{11}!\;(C_{1}-n_{11})!}\\ &\displaystyle &\displaystyle \qquad \qquad \times \frac{(C_{1}-n_{11})!}{n_{21}!\;(C_{1}-n_{11}-n_{21})!} \times \frac{(C_{1}-n_{11}-n_{21})!}{n_{31}!\;(C_{1}-n_{11}-n_{21}-n_{31})!}\\ &\displaystyle &\displaystyle \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad = \frac{C_{1}!}{n_{11}!\;n_{21}!\;n_{31}!\;n_{41}!}\;, \end{array} \end{aligned} $$

for column A 2 in Table 4.3,

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \binom{C_{2}}{n_{12}}\binom{C_{2}-n_{12}}{n_{22}}\binom{C_{2}-n_{12}-n_{22}}{n_{32}} = \frac{C_{2}!}{n_{12}!\;(C_{2}-n_{12})!}\\ &\displaystyle &\displaystyle \qquad \qquad \times \frac{(C_{2}-n_{12})!}{n_{22}!\;(C_{2}-n_{12}-n_{22})!} \times \frac{(C_{2}-n_{12}-n_{22})!}{n_{32}!\;(C_{2}-n_{12}-n_{22}-n_{32})!}\\ &\displaystyle &\displaystyle \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad = \frac{C_{2}!}{n_{12}!\;n_{22}!\;n_{32}!\;n_{42}!}\;, \end{array} \end{aligned} $$

for column A 3 in Table 4.3,

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \binom{C_{3}}{n_{13}}\binom{C_{3}-n_{13}}{n_{23}}\binom{C_{3}-n_{13}-n_{23}}{n_{33}} = \frac{C_{3}!}{n_{13}!\;(C_{3}-n_{13})!}\\ &\displaystyle &\displaystyle \qquad \qquad \times \frac{(C_{3}-n_{13})!}{n_{23}!\;(C_{3}-n_{13}-n_{23})!} \times \frac{(C_{3}-n_{13}-n_{23})!}{n_{33}!\;(C_{3}-n_{13}-n_{23}-n_{33})!}\\ &\displaystyle &\displaystyle \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad = \frac{C_{3}!}{n_{13}!\;n_{23}!\;n_{33}!\;n_{43}!}\;, \end{array} \end{aligned} $$

and for the row marginal frequency distribution in Table 4.3,

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \binom{N}{R_{1}}\binom{N-R_{1}}{R_{2}}\binom{N-R_{1}-R_{2}}{R_{3}} = \frac{N!}{R_{1}!\;(N-R_{1})!}\\ &\displaystyle &\displaystyle \qquad \qquad \times \frac{(N-R_{1})!}{R_{2}!\;(N-R_{1}-R_{2})!} \times \frac{(N-R_{1}-R_{2})!}{R_{3}!\;(N-R_{1}-R_{2}-R_{3})!}\\ &\displaystyle &\displaystyle \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad = \frac{N!}{R_{1}!\;R_{2}!\;R_{3}!\;R_{4}!}\;. \end{array} \end{aligned} $$

Thus, for an r×c contingency table,

$$\displaystyle \begin{aligned} p(n_{ij}|R_{i},C_{j},N) = \frac{\left( \,\displaystyle\prod_{i=1}^{r} R_{i}! \right) \left( \,\displaystyle\prod_{j=1}^{c} C_{j}! \right)}{N!\;\displaystyle\prod_{i=1}^{r}\,\prod_{j=1}^{c} n_{ij}!}\;. \end{aligned} $$
(4.2)

In this form, Eq. (4.2) can easily be generalized to more complex multi-way contingency tables [64].

To illustrate the calculation of a hypergeometric point probability value for an r×c contingency table, consider the sparse frequency data given in Table 4.4 with N = 14 observations. Following Eq. (4.2)

$$\displaystyle \begin{aligned} \begin{array}{rcl} p(n_{ij}|R_{i},C_{j},N) &\displaystyle =&\displaystyle \frac{\left( \,\displaystyle\prod_{i=1}^{r} R_{i}! \right) \left( \,\displaystyle\prod_{j=1}^{c} C_{j}! \right)}{N!\;\displaystyle\prod_{i=1}^{r}\,\prod_{j=1}^{c} n_{ij}!}\\ &\displaystyle &\displaystyle = \frac{3!\;4!\;3!\;4!\;5!\;5!\;5!}{14!\;2!\;1!\;0!\;0!\;1!\;3!\;0!\;3!\;0!\;3!\;0!\;1!} = 0.1903 {\times} 10^{-3}\;. \end{array} \end{aligned} $$
Table 4.4 Example 4×3 contingency table

While this section illustrates the calculation of a hypergeometric point probability value, for an exact permutation test of an r×c contingency table it is necessary to calculate the selected measure of association for the observed cell frequencies and, then, exhaustively enumerate all possible, equally-likely arrangements of the N objects in the rc cells, given the observed marginal frequency distributions.

For each arrangement in the reference set of all permutations of cell frequencies, a measure of association, say, T, is calculated and the exact hypergeometric point probability value, p(n ij|R i, C j, N) for i = 1, …, r and j = 1, …, c, is calculated. If T o denotes the value of the observed test statistic, i.e., measure of association, the exact two-sided probability value of T o is the sum of the hypergeometric point probability values associated with the values of T computed on all possible arrangements of cell frequencies that are equal to or greater than T o.

When the number of possible arrangements of cell frequencies is very large, exact tests are impractical and Monte Carlo permutation statistical methods become necessary. Monte Carlo permutation statistical methods generate a random sample of all possible arrangements of cell frequencies, drawn with replacement, given the observed marginal frequency distributions. The resampling two-sided probability value is simply the proportion of the T values computed on the randomly selected arrangements that are equal to or greater than T o. In the case of Monte Carlo resampling, hypergeometric probability values are not involved—simply the proportion of the values of the measures of association (T values) equal to or greater than the value of the observed measure of association (T o).

4.2 Goodman and Kruskal’s λ a and λ b Measures

A common problem that many researchers confront is the analysis of a cross-classification table where both variables are categorical, as categorical variables usually do not contain as much information as ordinal- or interval-level variables [54]. As noted in Chap. 3, the usual measures of association based on chi-squared, such as Pearson’s ϕ 2, Tschuprov’s T 2, Cramér’s V 2, and Pearson’s C, have proven to be less than satisfactory due to difficulties in interpretation; see, for example, discussions by Agresti and Finlay [2, p. 284], Berry, Martin, and Olson [11], Berry, Johnston, and Mielke [8, 9], Blalock [18, p. 306], Costner [27], Ferguson [30, p. 422], Guilford [42, p. 342], and Wickens [86, p. 226].

In 1954, Leo Goodman and William Kruskal proposed several new measures of association [37].Footnote 2 Among the measures were two asymmetric proportional-reduction-in-error (PRE) prediction measures for the analyses of a random sample of two categorical variables: λ a, for when A was considered to be the dependent variable, and λ b, for when B was considered to be the dependent variable [37].Footnote 3

Consider an r×c contingency table such as depicted in Table 4.5, where a j for j = 1, …, c denotes the c categories for dependent variable A, b i for i = 1, …, r denotes the r categories for independent variable B, n ij denotes a cell frequency for i = 1, …, r and j = 1, …, c, and N denotes the total of cell frequencies in the table. Denote by a dot (⋅) the partial sum of all rows or all columns, depending on the position of the (⋅) in the subscript list. If the (⋅) is in the first subscript position, the sum is over all rows and if the (⋅) is in the second subscript position, the sum is over all columns. Thus, n i. denotes the marginal frequency total of the ith row, i = 1, …, r, summed over all columns, and n .j denotes the marginal frequency total of the jth column, j = 1, …, c summed over all rows.

Table 4.5 Notation for the cross-classification of two categorical variables, A j for j = 1, …, c and B i for i = 1, …, r

Given the notation in Table 4.5, let

$$\displaystyle \begin{aligned} W = \sum_{i=1}^{r} \max(n_{i1},n_{i2},\,\ldots,\,n_{ic}) \end{aligned} $$

and

$$\displaystyle \begin{aligned} X = \max(n_{.1},n_{.2},\,\ldots,\,n_{.c})\;. \end{aligned} $$

Then, λ a, with variable A the dependent variable, is given by:

$$\displaystyle \begin{aligned} \lambda_{a} = \frac{W-X}{N-X}\;. \end{aligned}$$

In like manner, let

$$\displaystyle \begin{aligned} Y = \sum_{j=1}^{c} \max(n_{1j},n_{2j},\,\ldots,\,n_{rj}) \end{aligned}$$

and

$$\displaystyle \begin{aligned} Z = \max(n_{1.},n_{2.},\,\ldots,\,n_{r.})\;. \end{aligned}$$

Then, λ b, with variable B the dependent variable, is given by:

$$\displaystyle \begin{aligned} \lambda_{b} = \frac{Y-Z}{N-Z}\;. \end{aligned}$$

Both λ a and λ b are proportional-reduction-in-error (PRE) measures. Consider λ a and two possible scenarios:

Case 1::

Knowledge of only the disjoint categories of dependent variable A.

Case 2::

Knowledge of the disjoint categories of variable A, and also knowledge of the disjoint categories of independent variable B.

For Case 1, it is expedient for a researcher to guess the category of dependent variable A that has the largest marginal frequency total (mode), which in this case is \(X = \max (n_{.1},\,\ldots ,\,n_{.c})\). Then, the probability of error is N − X; label these “errors of the first kind” or E 1. For Case 2, it is expedient for a researcher to guess the category of dependent variable A that has the largest cell frequency (mode) in each category of the independent variable B, which in this case is

$$\displaystyle \begin{aligned} W = \sum_{i=1}^{r} \max(n_{i1},n_{i2},\,\ldots,\,n_{ic})\;. \end{aligned}$$

The probability of error is then N − W; label these “errors of the second kind” or E 2. Then, λ a may be expressed as:

$$\displaystyle \begin{aligned} \lambda_{a} = \frac{E_{1}-E_{2}}{E_{1}} = \frac{N-X-(N-W)}{N-X} = \frac{W-X}{N-X}\;. \end{aligned}$$

As noted by Goodman and Kruskal in 1954, a problem was immediately observed with the interpretations of both λ a and λ b. Since both measures were based on the modal values of the categories of the independent variable, when the modal values all occurred in the same category of the dependent variable λ a and λ b returned results of zero [37, p. 742]. Thus, while λ a and λ b were equal to zero under independence, λ a and λ b could also be equal to zero for cases other than independence. This made both λ a and λ b difficult to interpret; consequently, λ a and λ b are seldom found in the contemporary literature. The problem is easy to illustrate with simple 2×2 contingency tables. Consider first the 2×2 contingency table given in Table 4.6 where the cell frequencies indicate independence between variables A and B. For the frequency data given in Table 4.6,

$$\displaystyle \begin{aligned} W = \sum_{i=1}^{r}\max(n_{i1},\,\ldots,\,n_{ic}) = \max(36,24)+\max(24,16) = 36+24 = 60\;, \end{aligned}$$
$$\displaystyle \begin{aligned} X = \max(n_{.1},\,\ldots,\,n_{.c}) = \max(60,40) = 60\;, \end{aligned}$$

and the observed value of λ a is

$$\displaystyle \begin{aligned} \lambda_{a} = \frac{W-X}{N-X} = \frac{60-60}{100-60} = 0.00\;. \end{aligned}$$
Table 4.6 Example 2×2 contingency table with variables A and B independent

Now, consider the 2×2 contingency table given in Table 4.7 where the cell frequencies do not indicate independence between variables A and B. For the frequency data given in Table 4.7,

$$\displaystyle \begin{aligned} W = \sum_{i=1}^{r}\max(n_{i1},\,\ldots,\,n_{ic}) = \max(32,28)+\max(28,12) = 32+28 = 60\;, \end{aligned}$$
$$\displaystyle \begin{aligned} X = \max(n_{.1},\,\ldots,\,n_{.c}) = \max(60,40) = 60\;, \end{aligned}$$

and the observed value of λ a is

$$\displaystyle \begin{aligned} \lambda_{a} = \frac{W-X}{N-X} = \frac{60-60}{100-60} = 0.00\;. \end{aligned}$$
Table 4.7 Example 2×2 contingency table with variables A and B not independent

Finally, consider the 2×2 contingency table given in Table 4.8, where the cell frequencies indicate perfect association between variables A and B. For the frequency data given in Table 4.8,

$$\displaystyle \begin{aligned} W = \sum_{i=1}^{r}\max(n_{i1},\,\ldots,\,n_{ic}) = \max(60,0)+\max(0,40) = 60+40 = 100\;, \end{aligned} $$
$$\displaystyle \begin{aligned} X = \max(n_{.1},\,\ldots,\,n_{.c}) = \max(60,40) = 60\;, \end{aligned} $$

and the observed value of λ a is

$$\displaystyle \begin{aligned} \lambda_{a} = \frac{W-X}{N-X} = \frac{100-60}{100-60} = 1.00\;. \end{aligned} $$

Thus, as Goodman and Kruskal explained in 1954 [37, p. 742]:

  1. 1.

    λ a is indeterminate if and only if the population lies in one column; that is, it appears in one category of variable A.

    Table 4.8 Example 2×2 contingency table with variables A and B in perfect association
  2. 2.

    Otherwise, the value of λ a lies between the limits 0 and 1.

  3. 3.

    λ a is 0 if and only if knowledge of the B classification is of no help in predicting the A classification.

  4. 4.

    λ a is 1 if and only if knowledge of an object’s B category completely specifies its A category, i.e., if each row of the cross-classification table contains at most one non-zero value.

  5. 5.

    In the case of statistical independence, λ a, when determinate, is zero. The converse need not hold: λ a may be zero without statistical independence holding.

  6. 6.

    λ a is unchanged by any permutation of rows or columns.

4.2.1 Example λ a and λ b Analyses

For a more realistic application of Goodman and Kruskal’s λ a and λ b measures of nominal association, consider the 3×4 contingency table given in Table 4.9, where for λ a

$$\displaystyle \begin{aligned} X = \max(n_{.1},\,\ldots,\,n_{.c}) = \max(15,25,35,15) = 35\;, \end{aligned}$$

and the observed value of λ a is

$$\displaystyle \begin{aligned} \lambda_{a} = \frac{W-X}{N-X} = \frac{50-35}{90-35} = 0.2727\;. \end{aligned}$$
Table 4.9 Example 3×4 contingency table for Goodman and Kruskal’s λ a and λ b

The exact probability value of an observed value of λ a under the null hypothesis is given by the sum of the hypergeometric point probability values associated with values of λ a equal to or greater than the observed λ a value. For the frequency data given in Table 4.9, there are only M = 3, 453, 501 possible, equally-likely arrangements in the reference set of all permutations of cell frequencies given the observed row and column marginal frequency distributions, {20, 30, 40} and {15, 25, 35, 15}, respectively, making an exact permutation analysis possible. The exact upper-tail probability value of the observed λ a value is P = 0.2715, i.e., the sum of the hypergeometric point probability values associated with values of λ a = 0.2727 or greater.

The frequency data given in Table 4.9 can also be considered with variable B as the dependent variable. Thus, for λ b

$$\displaystyle \begin{aligned} Z = \max(n_{1.},\ldots,n_{r.}) = \max(20,30,40) = 40\;, \end{aligned}$$

and the observed value of λ b is

$$\displaystyle \begin{aligned} \lambda_{b} = \frac{Y-Z}{N-Z} = \frac{50-40}{90-40} = 0.20\;. \end{aligned}$$

For the frequency data given in Table 4.9, there are only M = 3, 453, 501 possible, equally-likely arrangements in the reference set of all permutations of cell frequencies given the observed row and column marginal frequency distributions, {20, 30, 40} and {15, 25, 35, 15}, respectively, making an exact permutation analysis feasible. The exact upper-tail probability value of the observed λ b value is P = 0.7669, i.e., the sum of the hypergeometric point probability values associated with values of λ b = 0.20 or greater.

4.3 Goodman and Kruskal’s t a and t b Measures

As noted, vide supra, in 1954 Leo Goodman and William Kruskal proposed several new measures of association. Among the measures was an asymmetric proportional-reduction-in-error (PRE) prediction measure, t a, for the analysis of a random sample of two categorical variables [37]. Consider two cross-classified unordered polytomies, A and B, with variable A the dependent variable and variable B the independent variable. Table 4.5 on p. 144, replicated in Table 4.10 for convenience, provides notation for the cross-classification, where a j for j = 1, …, c denotes the c categories for dependent variable A, b i for i = 1, …, r denotes the r categories for independent variable B, N denotes the total of cell frequencies in the table, n i. denotes a marginal frequency total for the ith row, i = 1, …, r, summed over all columns, n .j denotes a marginal frequency total for the jth column, j = 1, …, c, summed over all rows, and n ij denotes a cell frequency for i = 1, …, r and j = 1, …, c.

Table 4.10 Notation for the cross-classification of two categorical variables, A j for j = 1, …, c and B i for i = 1, …, r

Goodman and Kruskal’s t a statistic is a measure of the relative reduction in prediction error where two types of errors are defined. The first type is the error in prediction based solely on knowledge of the distribution of the dependent variable, termed “errors of the first kind” (E 1) and consisting of the expected number of errors when predicting the c dependent variable categories (a 1, …, a c) from the observed distribution of the marginals of the dependent variable (n .1, …, n .c). The second type is the error in prediction based on knowledge of the distributions of both the independent and dependent variables, termed “errors of the second kind” (E 2) and consisting of the expected number or errors when predicting the c dependent variable categories (a 1, …, a c) from knowledge of the r independent variable categories (b 1, …, b r).

To illustrate the two error types, consider predicting category a 1 only from knowledge of its marginal distribution, n .1, …, n .c. Clearly, n .1 out of the N total cases are in category a 1, but exactly which n .1 of the N cases is unknown. The probability of incorrectly identifying one of the N cases in category a 1 by chance alone is given by:

$$\displaystyle \begin{aligned} \frac{N-n_{.1}}{N}\;. \end{aligned}$$

Since there are n .1 such classifications required, the number of expected incorrect classifications is

$$\displaystyle \begin{aligned} \frac{n_{.1}(N-n_{.1})}{N} \end{aligned} $$

and, for all c categories of variable A, the number of expected errors of the first kind is given by:

$$\displaystyle \begin{aligned} E_{1} = \sum_{j=1}^{c} \frac{n_{.j}(N-n_{.j})}{N}\;. \end{aligned} $$

Likewise, to predict n 11, …, n 1c from the independent category b 1, the probability of incorrectly classifying one of the n 1. cases in cell n 11 by chance alone is

$$\displaystyle \begin{aligned} \frac{n_{1.}-n_{11}}{n_{1.}}\;. \end{aligned} $$

Since there are n 11 such classifications required, the number of incorrect classifications is

$$\displaystyle \begin{aligned} \frac{n_{11}(n_{1.}-n_{11})}{n_{1.}} \end{aligned} $$

and, for all cr cells, the number of expected errors of the second kind is given by:

$$\displaystyle \begin{aligned} E_{2} = \sum_{j=1}^{c}\,\sum_{i=1}^{r} \frac{n_{ij}(n_{i.}-n_{ij})}{n_{i.}}\;. \end{aligned}$$

Goodman and Kruskal’s t a statistic is then defined as:

$$\displaystyle \begin{aligned} t_{a} = \frac{E_{1}-E_{2}}{E_{1}}\;. \end{aligned}$$

An efficient computation form for Goodman and Kruskal’s t a is given by:

$$\displaystyle \begin{aligned} t_{a} = \frac{N \displaystyle\sum_{i=1}^{r}\,\sum_{j=1}^{c} \frac{n_{ij}^{2}}{n_{i.}}-\sum_{j=1}^{c}n_{.j}^{2}}{N^{2}-\displaystyle\sum_{j=1}^{c}n_{.j}^{2}}\;. \end{aligned} $$
(4.3)

A computed value of t a indicates the proportional reduction in prediction error given knowledge of the distribution of independent variable B over and above knowledge of only the distribution of dependent variable A. As defined, t a is a point estimator of Goodman and Kruskal’s population parameter τ a for the population from which the sample of N cases was obtained. If variable B is considered the dependent variable and variable A the independent variable, then Goodman and Kruskal’s test statistic t b and associated population parameter τ b are analogously defined.

While parameter τ a norms properly from 0 to 1, possesses a clear and meaningful proportional-reduction-in-error interpretation [27], and is characterized by high intuitive and factorial validity [45], test statistic t a poses difficulties whenever the null hypothesis posits that H 0: τ a = 0 [61]. The problem is that the sampling distribution of t a is not asymptotically normal under the null hypothesis H 0: τ a = 0. Consequently, the applicability of Goodman and Kruskal’s t a to typical tests of null hypotheses has been severely circumscribed.

Although t a was developed by Goodman and Kruskal in 1954, it was not until 1963 that the asymptotic normality for t a was established and an asymptotic variance was given for t a, but only for 0 < τ a < 1 [39]. Unfortunately, the asymptotic variance for t a given in 1963 was later found to be incorrect, and it was not until 1972 that the correct asymptotic variance for t a was obtained, but again, only for 0 < τ a < 1.

In 1971, Richard Light and Barry Margolin developed R 2, an analysis-of-variance technique for categorical response variables, called CATANOVA for CATegorical ANalysis Of VAriance [55]. Light and Margolin apparently were unaware that R 2 was identical to Goodman and Kruskal’s t a and that they had asymptotically solved the longstanding problem of testing H 0: τ a = 0. The identity between R 2 and t a was first recognized by Särndal in 1974 [75] and later discussed by Margolin and Light [61], where they showed that t a(N − 1)(r − 1) was distributed as chi-squared with (r − 1)(c − 1) degrees of freedom under H 0: τ a = 0 as N → [13].

4.3.1 Example Analysis for t a

Consider the same 3×4 contingency table analyzed with Goodman and Kruskal’s λ a, replicated in Table 4.11 for convenience. Following Eq. (4.3), the observed value of Goodman and Kruskal’s t a is

$$\displaystyle \begin{aligned} \begin{array}{rcl} t_{a} &\displaystyle =&\displaystyle \frac{N \displaystyle\sum_{i=1}^{r}\,\sum_{j=1}^{c} \frac{n_{ij}^{2}}{n_{i.}}-\sum_{j=1}^{c}n_{.j}^{2}}{N^{2}-\displaystyle\sum_{j=1}^{c}n_{.j}^{2}}\\ &\displaystyle &\displaystyle = \frac{90 \left( \displaystyle\frac{5^{2}}{20}+\frac{0^{2}}{20}+\cdots+\frac{10^{2}}{40} \right)-(15^{2}+25^{2}+35^{2}+15^{2})}{90^{2}-(15^{2}+25^{2}+35^{2}+15^{2})} = 0.1659\;. \end{array} \end{aligned} $$
Table 4.11 Example 3×4 contingency table

The exact probability value of an observed t a under the null hypothesis is given by the sum of the hypergeometric point probability values associated with values of t a equal to or greater than the observed value of t a. For the frequency data given in Table 4.11, there are only M = 3, 453, 501 possible, equally-likely arrangements in the reference set of all permutations of cell frequencies given the observed row and column marginal frequency distributions, {20, 30, 40} and {15, 25, 35, 15}, respectively, making an exact permutation analysis possible. The exact upper-tail probability value of the observed t a value is P = 0.3828, i.e., the sum of the hypergeometric point probability values associated with values of t a = 0.1659 or greater.

4.3.2 Example Analysis for t b

Now, consider variable B as the dependent variable. A convenient computing formula for t b is

$$\displaystyle \begin{aligned} t_{b} = \frac{N \displaystyle\sum_{j=1}^{c}\,\sum_{i=1}^{r} \frac{n_{ij}^{2}}{n_{.j}}-\sum_{i=1}^{r}n_{i.}^{2}}{N^{2}-\displaystyle\sum_{i=1}^{r}n_{i.}^{2}}\;. \end{aligned}$$

Thus, for the frequency data given in Table 4.11 the observed value of t b is

$$\displaystyle \begin{aligned} t_{b} = \frac{90 \left( \displaystyle\frac{5^{2}}{15}+\frac{0^{2}}{25}+\cdots+\frac{10^{2}}{40} \right)-(20^{2}+30^{2}+40^{2})}{90^{2}-(20^{2}+30^{2}+40^{2})} = 0.2022\;. \end{aligned}$$

For the frequency data given in Table 4.11, there are only M = 3, 453, 501 possible, equally-likely arrangements in the reference set of all permutations of cell frequencies given the observed row and column marginal frequency distributions, {20, 30, 40} and {15, 25, 35, 15}, respectively, making an exact permutation analysis feasible. The exact upper-tail probability value of the observed t b value is P = 0.5187, i.e., the sum of the hypergeometric point probability values associated with values of t b = 0.2022 or greater.

4.4 An Asymmetric Test of Homogeneity

Oftentimes a research question involves determining if the proportions of items in a set of mutually exclusive categories are the same for two or more groups. When independent random samples are drawn from each of g ≥ 2 groups and then classified into r ≥ 2 mutually exclusive categories, the appropriate test is a test of homogeneity of the g distributions. In a test of homogeneity, one of the marginal distributions is known prior to collecting the data, i.e., the row or column marginal frequency totals indicating the numbers in each of the g groups. This is termed product multinomial sampling, since the sampling distribution is the product of g multinomial distributions and the null hypothesis is that the g multinomial distributions are identical [19, 49, 61].

A test of homogeneity is quite different from a test of independence, where a single sample is drawn and then classified on both variables. In a test of independence, both sets of marginal frequency totals are known only after the data have been collected [62]. This is termed simple multinomial sampling, since the sampling distribution is a multinomial distribution [19, 49]. The most widely used test of homogeneity is the Pearson [69] chi-squared test of homogeneity with degrees of freedom given by df  = (r − 1)(g − 1). The Pearson chi-squared test of homogeneity tests the null hypothesis that there is no difference in the proportions of subjects in a set of mutually exclusive categories between two or more populations [60].

Pearson’s chi-squared test of homogeneity is a symmetrical test, yielding only a single value for an r×g contingency table. In contrast, an asymmetrical test yields two values depending on which variable is considered to be the dependent variable. As noted by Berkson , if the differences are all in one direction, a symmetrical test such as chi-squared is insensitive to this fact [6, p. 536].

A symmetrical test of homogeneity, by its nature, excludes known information about the data—which variable is the independent variable and which variable is the dependent variable. While it is sometimes necessary to reduce the level of measurement when distributional requirements cannot be met, in general it is not advisable to use a statistical test that discounts important information [29, p. 911]. For example, a researcher should not discard the magnitude of a set of scores and use a signed-ranks test instead of a Fisher–Pitman test, nor should a researcher subsequently ignore the ranks and reduce the analysis to a simple sign test. In the same fashion, given the problem of examining the contingency of two ordered polytomies, the use of a chi-squared-based measure of association does not take into consideration the inherent ordering of the categories [7].

Consider two cross-classified unordered polytomies, A and B, with B the dependent variable. Let b 1, …, b r represent the r ≥ 2 categories of the dependent variable, a 1, …, a g represent the g ≥ 2 categories of the independent variable, n ij indicate the cell frequency in the ith row and jth column, i = 1, …, r and j = 1, …, g, and N denote the total sample size. Denote by a dot (⋅) the partial sum of all rows or all columns, depending on the position of the (⋅) in the subscript list. If the (⋅) is in the first subscript position, the sum is over all rows and if the (⋅) is in the second subscript position, the sum is over all columns. Thus, n 1., …, n r. denotes the marginal frequency totals of row variable B summed over all columns and n .1, …, n .g denotes the marginal frequency totals of column variable A summed over all rows. The cross-classification of variables A and B is displayed in Table 4.12.

Table 4.12 Notation for the cross-classification of two categorical variables, A j for j = 1, …, g and B i for i = 1, …, r

Although never advanced as a test of homogeneity, the asymmetrical test t b, first introduced by Goodman and Kruskal in 1954 [37], is an attractive alternative to the symmetrical chi-squared test of homogeneity. The test statistic is given by:

$$\displaystyle \begin{aligned} t_{b} = \frac{N \displaystyle\sum_{j=1}^{g}\,\sum_{i=1}^{r} \frac{n_{ij}^{2}}{n_{.j}}-\sum_{i=1}^{r}n_{i.}^{2}}{N^{2}-\displaystyle\sum_{i=1}^{r}n_{i.}^{2}}\;, \end{aligned}$$

where B is the dependent variable and the associated population parameter is denoted as τ b. If variable A is considered the dependent variable, the test statistic is given by:

$$\displaystyle \begin{aligned} t_{a} = \frac{N \displaystyle\sum_{i=1}^{r}\,\sum_{j=1}^{g} \frac{n_{ij}^{2}}{n_{i.}}-\sum_{j=1}^{g} n_{.j}^{2}}{N^{2}-\displaystyle\sum_{j=1}^{g} n_{.j}^{2}} \end{aligned}$$

and the associated population parameter is τ a.

Test statistic t b takes on values between 0 and 1; t b is 0 if and only if there is homogeneity over the r categories of the dependent variable (B) for all g groups, and t b is 1 if and only if knowledge of variable A j for j = 1, …, g completely determines knowledge of variable B i for i = 1, …, r. In like fashion, test statistic t a is 0 if and only if there is homogeneity over the g categories of the dependent variable (A) for all r groups, and t a is 1 if and only if knowledge of variable B i for i = 1, …, r completely determines knowledge of variable A j for j = 1, …, g.

While no general equivalence exists for test statistics t b, t a, and χ 2, certain relationships hold among t b, t a, and χ 2 under special conditions. If g = 2, χ 2 = Nt b, and if g > 2 and n .j = Ng for j = 1, …, g, χ 2 = N(g − 1)t b. Similarly, if r = 2, χ 2 = Nt a, and if r > 2 and n i. = Nr for i = 1, …, r, χ 2 = N(r − 1)t a. It follows that if r = g = 2, t b = t a = χ 2N, which is the Pearson mean-squared contingency coefficient, ϕ 2. Finally, as N →, t b(N − 1)(r − 1) and t a(N − 1)(g − 1) are distributed as chi-squared with (r − 1)(g − 1) degrees of freedom.

There are three methods to determine the probability value of a computed t b or t a test statistic: exact, Monte Carlo resampling, and asymptotic procedures. The following discussions consider only t b, but the methods are analogous for t a.

Exact Probability Values

Under the null hypothesis, H 0: τ b = 0, each of the M possible arrangements of the N cases over the rg categories of the contingency table is equally probable with fixed marginal frequency distributions. For each arrangement of the observed data in the reference set of all possible arrangements, the desired test statistic is calculated. The exact probability value of an observed t b test statistic is the sum of the hypergeometric point probability values associated with values of t b or greater.

Resampling Probability Values

An exact test is computationally not practical except for fairly small samples. An alternative method that avoids the computational demands of an exact test is a resampling permutation approximation. Under the null hypothesis, H 0: τ b = 0, resampling permutation tests generate and examine a Monte Carlo random subset of all possible, equally-likely arrangements of the observed data. For each randomly selected arrangement of the observed data, the desired test statistic is calculated. The Monte Carlo resampling probability value of an observed t b test statistic is simply the proportion of the randomly selected values of t b equal to or greater than the observed value of t b.

Asymptotic Probability Values

Under the null hypothesis, H 0: τ b = 0, as , t b(N − 1)(g − 1) is distributed as chi-squared with (r − 1)(g − 1) degrees of freedom [61]. The asymptotic probability value is the proportion of the appropriate chi-squared distribution equal to or greater than the observed value of t b(N − 1)(g − 1).

4.4.1 Example 1

Consider a sample of N = 80 seventh grade female students, all from complete families with three children, stratified by Resident Type (Rural, Suburban, or Urban). Each subject is categorized into one of four Personality Characteristics (Domineering, Assertive, Submissive, or Passive) in a classroom setting by a panel of trained observers. The data are given in Table 4.13. The null hypothesis posits that the proportions of the r = 4 observed Personality Types are the same for each of the g = 3 Residence Types. Thus, Residence Type (A) is the independent variable and Personality Type (B) is the dependent variable.

Table 4.13 Example data set of residence type (A) and personality type (B)

For the frequency data given in Table 4.13,

$$\displaystyle \begin{aligned} \begin{array}{rcl} t_{b} &\displaystyle =&\displaystyle \frac{N \displaystyle\sum_{j=1}^{g}\,\sum_{i=1}^{r} \frac{n_{ij}^{2}}{n_{.j}}-\sum_{i=1}^{r}n_{i.}^{2}}{N^{2}-\displaystyle\sum_{i=1}^{r}n_{i.}^{2}}\\ &\displaystyle &\displaystyle \qquad \quad = \frac{80 \left( \displaystyle\frac{15^{2}}{30}+\frac{15^{2}}{30}+ \cdots + \frac{5^{2}}{20} \right)-(45^{2}+15^{2}+15^{2}+5^{2})}{80^{2}-(45^{2}+15^{2}+15^{2}+5^{2})} = 0.2308\;. \end{array} \end{aligned} $$

There are only M = 359, 961 possible, equally-likely arrangements in the reference set of all permutations of cell frequencies given the observed row and column marginal frequency distributions, {45, 15, 15, 5} and {30, 30, 20}, respectively, making an exact permutation analysis reasonable. The exact upper-tail probability value for the observed value of t b is P = 0.1728, i.e., the sum of the hypergeometric point probability values associated with values of t b = 0.2308 or greater.

In dramatic contrast, the Pearson chi-squared test of homogeneity yields a computed value of χ 2 = 66.6667 for the frequency data given in Table 4.13 and the exact Pearson χ 2 probability value is P = 0.1699×10−12. For comparison, the asymptotic Pearson χ 2 probability value based on (r − 1)(g − 1) = (4 − 1)(3 − 1) = 6 degrees of freedom is P = 0.1969×10−11.

The Pearson χ 2 test of homogeneity is a symmetrical test and does not distinguish between independent and dependent variables, thus excluding important information. Because the Pearson χ 2 test of homogeneity considers both variables A and B, some insight can be gained by calculating a value for t a. For the frequency data given in Table 4.13,

$$\displaystyle \begin{aligned} \begin{array}{rcl} t_{a} &\displaystyle =&\displaystyle \frac{N \displaystyle\sum_{i=1}^{r}\,\sum_{j=1}^{g} \frac{n_{ij}^{2}}{n_{i.}}-\sum_{j=1}^{g} n_{.j}^{2}}{N^{2}-\displaystyle\sum_{j=1}^{g} n_{.j}^{2}}\\ &\displaystyle &\displaystyle \qquad \quad = \frac{80 \left( \displaystyle\frac{15^{2}}{45}+\frac{15^{2}}{45}+ \cdots +\frac{5^{2}}{5} \right)-(30^{2}+30^{2}+20^{2})}{80^{2}-(30^{2}+30^{2}+20^{2})} = 0.4286\;, \end{array} \end{aligned} $$

which is considerably larger than the value for t b of 0.2308. There are only M = 359, 961 possible, equally-likely arrangements in the reference set of all permutations of cell frequencies given the observed row and column marginal frequency distributions, {45, 15, 15, 5} and {30, 30, 20}, respectively, making an exact permutation analysis feasible. The exact upper-tail probability value for the observed value of t a is P = 0.0073, i.e., the sum of the hypergeometric point probability values associated with values of t a = 0.4286 or greater.

Clearly, the Pearson χ 2 test of homogeneity is detecting the substantial departure from homogeneity of the row proportions. This is reflected in the relatively low probability value for t a (P = 0.0073) where the column variable (A) is considered to be the dependent variable. As the dependent variable of interest is variable B, the Pearson χ 2 test of homogeneity yields a misleading result with an asymptotic probability value of P = 0.1969×10−11 compared with the exact probability value for t b of P = 0.1728.

Table 4.14 displays the conditional column proportions obtained from the sample cell frequencies of Table 4.13. In Table 4.14, variable B is the dependent variable and the conditional column proportions are given by p i|j = n ijn .j, e.g., p 1|1 = 15∕30 = 0.5000. Table 4.15 displays the conditional row proportions obtained from the sample cell frequencies of Table 4.13. In Table 4.15, variable A is the dependent variable and the conditional row proportions are given by p j|i = n ijn i., e.g., p 1|1 = 15∕45 = 0.3333.

Table 4.14 Conditional column proportions for residence type (A) and personality type (B)
Table 4.15 Conditional row proportions for residence type (A) and personality type (B)

Even the most casual inspection of Tables 4.14 and 4.15 reveals the relative homogeneity extant among the proportions in the columns of Table 4.14, compared with the lack of homogeneity among the proportions in the rows of Table 4.15. Compare, for example, the Domineering (0.3333, 0.3333, 0.3333) and Assertive (1.0000, 0.0000, 0.0000) row proportions in Table 4.15. It is this departure from homogeneity in the row proportions that contributes to the low probability value, i.e., P = 0.1969×10−11, associated with the Pearson χ 2 test of homogeneity.

4.4.2 Example 2

To clarify the utility of a test of homogeneity based on Goodman and Kruskal’s t b test statistic, consider a simplified example. Suppose that a researcher wishes to conduct a test of homogeneity with respect to Voting Behavior on three categories of Marital Status. The null hypothesis posits that the proportions of the r = 3 observed categories of Marital Status (independent variable) are the same for each of the g = 3 categories of Voting Behavior (dependent variable). The researcher obtains three independent simple random samples of 80 individuals from each of the three categories of Marital Status—Single, Married, and Divorced—in a local election. Table 4.16 contains the raw frequency data and conditional row proportions where independent variable Marital Status (Single, Married, Divorced) is cross-classified with dependent variable Voting Behavior (Republican, Democrat, Independent).

Table 4.16 Example data set of marital status (A) and voting behavior (B) with row proportions in parentheses

Because the frequency data given in Table 4.16 correspond to the expected values for each of the nine cells, Pearson’s chi-squared test of homogeneity is χ 2 = 0.00 with a probability value under the null hypothesis of P = 1.00. In contrast, Goodman and Kruskal’s test statistic, with variable B (Voting Behavior) the dependent variable is t b = 1.00 with a probability value under the null hypothesis of P = 0.00.

4.5 The Measurement of Agreement

The measurement of agreement is a special case of measuring association between two or more variables. A number of statistical research problems require the measurement of agreement, rather than association or correlation. Agreement indices measure the extent to which a set of response measurements are identical to another set, i.e., agree, rather than the extent to which one set of response measurements is a linear function of another set of response measurements, i.e., correlated.

The usual research situation involving a measure of agreement arises when several judges or raters assign objects to a set of disjoint, unordered categories. In 1957, W.S. Robinson published an article in American Sociological Review on “The statistical measurement of agreement” [73]. In this formative article, Robinson developed the idea of agreement, as contrasted with correlation, and showed that a simple modification of the intraclass correlation coefficient was an appropriate measure of statistical agreement, which he called A, presumably for agreement [73, p. 20]. Robinson explained that statistical agreement requires that paired values be identical, while correlation requires only that the paired values be linked by some mathematical function [73, p. 19]. Thus, agreement is a more restrictive measure than is correlation. Robinson argued that the distinction between agreement and correlation leads to the conclusion that a logically correct estimate of the reliability of a test is given by the intraclass correlation coefficient rather than the Pearsonian (interclass) correlation coefficient and that the concept of agreement, rather than correlation, is the proper basis of reliability theory [73, p. 18]. The 1957 Robinson article, which was quite mathematical, was followed by a more interpretive article in 1959 in the same journal on “The geometric interpretation of agreement” [74].

A measure of inter-rater agreement should, as a minimum, embody seven basic attributes [16]. First, it is generally agreed that a measure of agreement should be chance corrected, i.e., any agreement coefficient should reflect the amount of agreement in excess of what would be expected by chance. Several researchers have advocated chance-corrected measures of agreement, including Brennan and Prediger [20], Cicchetti, Showalter, and Tyrer [21], Cohen [23], Conger [26], and Krippendorff [50]. Although some investigators have argued against chance-corrected measures of agreement, e.g., Armitage, Blendis, and Smyllie [3] and Goodman and Kruskal [37], supporters of chance-corrected measures of agreement far outweigh detractors.

Second, as noted by Bartko [4, 5], Bartko and Carpenter [5], Krippendorff [50], and Robinson [72], a measure of inter-rater agreement possesses an added advantage if it is directly applicable to the assessment of reliability. Robinson, in particular, was emphatic that reliability could not simply be measured by some function of Pearsonian product-moment correlation, such as in the split-half or test–retest methods, and argued that the concept of agreement should be the basis of reliability theory, not correlation [73, p. 18].

Third, a number of researchers have commented on the simplicity of Euclidean distance for measures of inter-rater agreement, noting that the squaring of differences between scale values is questionable at best, while acknowledging that squared differences allow for familiar interpretations of coefficients [34, 50]. Moreover, Graham and Jackson noted that squaring of differences between values, i.e., quadratic weighting, results in a measure of association, not agreement [41]. Thus, Euclidean distance is a desired property for measures of inter-rater agreement.

Fourth, every measure of agreement should have a statistical base [5]. A measure of agreement without a proper test of significance is severely limited in application to practical research situations. Asymptotic analyses are interesting and useful, under large sample conditions, but often limited in their practical utility when sample sizes are small.

Fifth, a measure of agreement that analyzes multivariate data has a decided advantage over univariate measures of agreement. Thus, if one observer locates a set of objects in an r-dimensional space, a multivariate measure of agreement can ascertain the degree to which a second observer locates the same set of objects in the defined r-dimensional space.

Sixth, a measure of agreement should be able to analyze data at any level of measurement. Cohen’s kappa measure of inter-rater agreement is, at the present time, the most widely used measure of agreement. Extensions of Cohen’s kappa to incompletely ranked data by Iachan [46] and to continuous categorical data by Conger [26] have been established. An extension of Cohen’s kappa measure of agreement to fully ranked ordinal data and to interval data was provided by Berry and Mielke in 1988 [16].

Seventh, a measure of agreement should be able to evaluate information from more than two raters or judges. Fleiss proposed a measure of agreement for multiple raters on a nominal scale [33]. Williams presented a measure that was limited to comparisons of the joint agreement of several raters with another rater singled out as being of special interest [88]. Landis and Koch considered agreement among several raters in terms of a majority opinion [51]. Light focused on an extension of Cohen’s [23] kappa measure of inter-rater agreement to multiple raters that was based on the average of all pairwise kappa values [54].

Unfortunately, the measure proposed by Fleiss was dependent on the average proportion of raters who agree on the classification of each observation. The limitation in the measure proposed by Williams appears to be overly restrictive, and the formulation by Landis and Koch becomes computationally prohibitive if either the number of observers or the number of response categories is large. Moreover, the extension of kappa proposed by Fleiss did not reduce to Cohen’s kappa when the number of raters was two. Finally, Hubert [44] and Conger [25] provided critical summaries of the problem of extending Cohen’s kappa measure of inter-rater agreement to multiple raters for categorical data.

4.5.1 Robinson’s Measure of Agreement

An early measure of maximum-corrected agreement was developed by W.S. Robinson in 1957 [73, 74]. Assume that k = 2 judges independently rate N objects. Robinson argued that the Pearson product-moment (interclass) correlation calculated between the ratings of two judges was an inadequate measure of agreement because it measures the degree to which the paired values of the two variables are proportional, when expressed as deviations from their means, rather than identical [73, p. 19]. Robinson proposed a new measure of agreement based on the intraclass correlation coefficient that he called A. Consider two sets of ratings such as given in Table 4.17, where there are N = 3 pairs of values. Robinson defined A as:

$$\displaystyle \begin{aligned} A = 1-\frac{D}{D_{\text{max}}}\;, \end{aligned}$$

where D (for Disagreement) is given by:

$$\displaystyle \begin{aligned} D = \sum_{i=1}^{N} \big( X_{1i}-\bar{X}_{i} \big)^{2}+\sum_{i=1}^{N} \big( X_{2i}-\bar{X}_{i} \big)^{2} \end{aligned}$$

and

$$\displaystyle \begin{aligned} \begin{array}{rcl} X_{1i} &\displaystyle =&\displaystyle \ \mbox{the value of}\ X_{1}\ \mbox{for the}\ i\mbox{th}\ \mbox{pair of ratings}\;,\\ X_{2i} &\displaystyle =&\displaystyle \mbox{the value of}\ X_{2}\ \mbox{for the}\ i\mbox{th}\ \mbox{pair of ratings}\;,\\ \bar{X}_{i} &\displaystyle =&\displaystyle \mbox{the mean of}\ X_{1}\ \mbox{and}\ X_{2}\ \mbox{for the}\ i\mbox{th}\ \mbox{pair of ratings}\;. \end{array} \end{aligned} $$

Robinson noted that, by itself, D is not a very useful measure because it involves the units of X 1 and X 2. To find a relative, rather than an absolute, measure of agreement, Robinson standardized D by its range of possible variation, given by:

$$\displaystyle \begin{aligned} D_{\text{max}} = \sum_{i=1}^{N} \big( X_{1i}-\bar{X} \big)^{2}+\sum_{i=1}^{N}\big( X_{2i}-\bar{X} \big)^{2}\;, \end{aligned}$$
Table 4.17 Example data for Robinson’s A coefficient of agreement

where the common mean is given by:

$$\displaystyle \begin{aligned} \bar{X} = \frac{\displaystyle\sum_{i=1}^{N}X_{1i}+\displaystyle\sum_{i=1}^{N}X_{2i}}{2N}\;. \end{aligned}$$

4.5.1.1 Example

Consider the data listed in Table 4.17 on p. 162 with N = 3 paired observations and k = 2 sets of ratings, replicated in Table 4.18 for convenience. Then,

$$\displaystyle \begin{aligned} D = \sum_{i=1}^{N} \big( X_{1i}-\bar{X}_{i} \big)^{2}+\sum_{i=1}^{N} \big( X_{2i}-\bar{X}_{i} \big)^{2} = 8.25+8.25 = 16.50\;. \end{aligned}$$
Table 4.18 Illustration of the calculation of Robinson’s D coefficient of agreement

Define the common mean as:

$$\displaystyle \begin{aligned} \bar{X} = \frac{\displaystyle\sum_{i=1}^{N} X_{1i}+\displaystyle\sum_{i=1}^{N} X_{2i}}{2N} = \frac{12+21}{(2)(3)} = 5.50\;, \end{aligned}$$

then the maximum value of D is illustrated in Table 4.19. The maximum value of D is then

$$\displaystyle \begin{aligned} D_{\text{max}} = \sum_{i=1}^{N} \big( X_{1i}-\bar{X} \big)^{2}+\sum_{i=1}^{N} \big( X_{2i}-\bar{X} \big)^{2} = 32.75+56.75 = 89.50 \end{aligned}$$
Table 4.19 Illustration of calculation of Robinson’s maximum value of D

and Robinson’s A is

$$\displaystyle \begin{aligned} A = 1-\frac{D}{D_{\text{max}}} = 1-\frac{16.50}{89.50} = 0.8156\;. \end{aligned}$$

The sums,

$$\displaystyle \begin{aligned} \sum_{i=1}^{N} X_{1i} = 12 \quad \mbox{and} \quad \sum_{i=1}^{N} X_{2i} = 21, \end{aligned} $$

are invariant under permutation. Therefore, \(\bar {X} = 5.50\) and D max = 89.50 are also invariant under permutation. Moreover,

$$\displaystyle \begin{aligned} \sum_{i=1}^{N} \big( X_{1i}-\bar{X}_{i} \big)^{2} = \sum_{i=1}^{N} \big( X_{2i}-\bar{X}_{i} \big)^{2} \end{aligned} $$

for all arrangements of the observed data. Thus, for an exact permutation analysis, it is only required to calculate either

$$\displaystyle \begin{aligned} \sum_{i=1}^{N} \big( X_{1i} - \bar{X}_{i} \big)^{2} \quad \mbox{or} \quad \sum_{i=1}^{N} \big( X_{2i} - \bar{X}_{i} \big)^{2}\;. \end{aligned} $$

In addition, it is only necessary to shuffle either the X 1i values or the X 2i values, i = 1, 2, 3, while holding the X 2i or X 1i values, respectively, constant.

For the data listed in Table 4.18, there are only M = 6 possible, equally-likely arrangements of the observed data. Since M = 6 is a very small number, it will be illustrative to list the shuffled X 1i values and the associated D and A values in Table 4.20, where the arrangement with the observed values in Table 4.18 is indicated with an asterisk. The exact upper-tail probability of the observed value of A = 0.8156 under the null hypothesis is given by:

$$\displaystyle \begin{aligned} P(A \geq A_{\text{o}}|H_{0}) = \frac{\text{number of}\ A\ \text{values}\ \geq A_{\text{o}}}{M} = \frac{1}{6} = 0.1667\;, \end{aligned} $$

where A o denotes the observed value of Robinson’s A. Alternatively,

$$\displaystyle \begin{aligned} P(D \leq D_{\text{o}}|H_{0}) = \frac{\text{number of}\ D\ \text{values} \leq D_{\text{o}}}{M} = \frac{1}{6} = 0.1667\;, \end{aligned} $$

where D o denotes the observed value of Robinson’s D.

Table 4.20 The M = 6 possible arrangements of the X 1i values, i = 1, 2, 3, with associated values of Robinson’s D and A

4.5.1.2 The Intraclass Correlation Coefficient

It is well known that the intraclass correlation coefficient (r I) between N pairs of observations on two variables is by definition the ordinary Pearson product-moment (interclass) correlation between 2N pairs of observations, the first N of which are the original observations, and the second N the original observations with X 1i replacing X 2i and vice versa for i = 1, …, N [31, Sect. 38]. Thus, the intraclass correlation between the values of X 1i and X 2i for i = 1, …, N given in Table 4.18 on p. 162 is the Pearson product-moment correlation between the six pairs of values, as illustrated in Table 4.21.

Table 4.21 Example data for the intraclass correlation coefficient

For the data given in Table 4.21 with N = 6 pairs of observations, the intraclass correlation coefficient is

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} r_{\text{I}} = r_{12} &\displaystyle =&\displaystyle \frac{N\displaystyle\sum_{i=1}^{N}X_{1j}X_{2i}-\sum_{i=1}^{N}X_{1i}\sum_{i=1}^{N}X_{2i}}{\sqrt{\left[ N\displaystyle\sum_{i=1}^{N}X_{1i}^{2}-\left( \sum_{i=1}^{N} X_{1i} \right)^{2} \,\right]\left[ N\displaystyle\sum_{i=1}^{N}X_{2i}^{2}-\left( \sum_{i=1}^{N} X_{2i} \right)^{2} \,\right]}} \\ &\displaystyle &\displaystyle \ = \frac{(6)(238)-(33)(33)}{\sqrt{\big[(6)(271)-(33)^{2}\big]\big[(6)(271)-(33)^{2}\big]}} = +0.6313\;. \end{array} \end{aligned} $$
(4.4)

It is obvious from Eq. (4.4) that certain computational simplifications follow from the reversal of the variable values, i.e., the row and column marginal frequency distributions for the new variables are identical and, therefore, the means and variances of the new variables are identical [73, p. 20].

For the case of two variables, the relationships between Robinson’s coefficient of agreement and the coefficient of intraclass correlation are given by:

$$\displaystyle \begin{aligned} r_{\text{I}} = 2A-1 \quad \mbox{and} \quad A = \frac{r_{\text{I}}+1}{2}\;. \end{aligned}$$

Thus, in the case of two variables the intraclass correlation is a simple linear function of the coefficient of agreement. For the example data given in Table 4.18 on p. 162,

$$\displaystyle \begin{aligned} r_{\text{I}} = 2(0.8156)-1 = 0.6313 \quad \mbox{and} \quad A = \frac{0.6313+1}{2} = 0.8156\;. \end{aligned}$$

For k > 2 sets of ratings, the relationships between the intraclass correlation coefficient and Robinson’s A are not so simple and are given by:

$$\displaystyle \begin{aligned} r_{\text{I}} = \frac{kA-1}{k-1} \quad \mbox{and} \quad A = \frac{r_{\text{I}}(k-1)+1}{k}\;. \end{aligned} $$
(4.5)

It is apparent from the expressions in Eq. (4.5) that the value of the intraclass coefficient depends not only upon A but also upon k, the number of observations per case. The range of Robinson’s A is always from zero to unity regardless of the number of observations. Therefore, comparisons between agreement coefficients based upon different numbers of variables are commensurable [73, p. 22]. The upper limit of the intraclass correlation coefficient is always unity, but its lower limit is − 1∕(k − 1) [31, Sect. 38]. For k = 2 variables, the lower limit of r I is − 1, but for k = 3 variables the lower limit is − 1∕2, for k = 4 the lower limit is − 1∕3, for k = 5 the lower limit is − 1∕4, and so on.

4.5.2 Scott’s π Measure of Agreement

An early measure of chance-corrected agreement was introduced by William Scott in 1955 [76]. Assume that two judges or raters independently classify each of N observations into one of c categories. The resulting classifications can be displayed in a c×c contingency table, such as the 3×3 table in Table 4.22, with frequencies for cell entries. Denote by a dot (⋅) the partial sum of all rows or all columns, depending on the position of the (⋅) in the subscript list. If the (⋅) is in the first subscript position, the sum is over all rows and if the (⋅) is in the second subscript position, the sum is over all columns. Thus, n i. denotes the marginal frequency total of the ith row, i = 1, …, r, summed over all columns; n .j denotes the marginal frequency total of the jth column, j = 1, …, c, summed over all rows; and

$$\displaystyle \begin{aligned} N = \sum_{i=1}^{r}\,\sum_{j=1}^{c}n_{ij} \end{aligned} $$

denotes the table frequency total. In the notation of Table 4.22, Scott’s coefficient of agreement for nominal-level data is given by:

$$\displaystyle \begin{aligned} \pi = \frac{p_{\text{o}}-p_{\text{e}}}{1-p_{\text{e}}}\;, \end{aligned} $$
(4.6)

where

$$\displaystyle \begin{aligned} p_{\text{o}} = \frac{1}{N} \sum_{i=1}^{c} n_{ii} \quad \mbox{and} \quad p_{\text{e}} = \frac{1}{4N^{2}}\sum_{k=1}^{c} \big( n_{.k}+n_{k.}\big)^{2}\;. \end{aligned} $$

In this configuration, p o is the observed proportion of observations on which the judges agree, p e is the proportion of observations for which agreement is expected by chance, p o − p e is the proportion of agreement beyond that expected by chance, 1 − p e is the maximum possible proportion of agreement beyond that expected by chance, and Scott’s π is the proportion of agreement between the two judges, after chance agreement has been removed.

Table 4.22 Example 3×3 cross-classification (agreement) table with frequencies for cell entries

4.5.2.1 Example

For an example of Scott’s π measure of inter-rater agreement, consider the frequency data given in Table 4.23, where two judges have independently classified N = 40 objects into four disjoint categories: A, B, C, and D. For the agreement data given in Table 4.23,

$$\displaystyle \begin{aligned} p_{\text{o}} = \frac{1}{N} \sum_{1=1}^{c} n_{ii} = \frac{4+4+4+4}{40} = 0.40\;, \end{aligned}$$
$$\displaystyle \begin{aligned}\begin{array}{rcl} p_{\text{e}} = \frac{1}{4N^{2}} \sum_{k=1}^{c} (n_{.k}+n_{k.})^{2} &\displaystyle =&\displaystyle \frac{1}{(4)(40^{2})}\big[ (10+10)^{2}+(10+10)^{2}\\ &\displaystyle &\displaystyle \qquad \qquad +(10+10)^{2}+(10+10)^{2}\big] = 0.25\;, \end{array} \end{aligned} $$

and the observed value of Scott’s π is

$$\displaystyle \begin{aligned} \pi = \frac{p_{\text{o}}-p_{\text{e}}}{1-p_{\text{e}}} = \frac{0.40-0.25}{1-0.25} = +0.20\;, \end{aligned} $$
(4.7)

indicating 20% agreement above that expected by chance.

Table 4.23 Example 4×4 cross-classification (agreement) table

The exact probability value of an observed value of Scott’s π under the null hypothesis is given by the sum of the hypergeometric point probability values associated with the π values equal to or greater than the observed π value. For the frequency data given in Table 4.23, there are only M = 5, 045, 326 possible, equally-likely arrangements in the reference set of all permutations of cell frequencies given the observed row and column marginal frequency distributions, {10, 10, 10, 10} and {10, 10, 10, 10}, respectively, making an exact permutation analysis possible. The exact upper-tail probability value of the observed π value is P = 0.2047, i.e., the sum of the hypergeometric point probability values associated with values of π = +0.20 or greater.

While Scott’s π is interesting from a historical perspective, π has fallen into desuetude and is no longer found in the current literature. Based as it is on joint proportions, Scott’s π makes the assumption that the two judges have the same distribution of responses, as in the example data in Table 4.18 on p. 162 with identical marginal distributions, {10, 10, 10, 10} and {10, 10, 10, 10}. Cohen’s κ measure does not make this assumption and, consequently, has emerged as the preferred chance-corrected measure of inter-rater agreement for two judges/raters.

4.5.3 Cohen’s κ Measure of Agreement

Currently, the most popular measure of agreement between two judges or raters is the chance-corrected measure of inter-rater agreement first proposed by Jacob Cohen in 1960 and termed kappa [23]. Cohen’s kappa measures the magnitude of agreement between b = 2 observers on the assignment of N objects to a set of c disjoint, unordered categories. In 1968, Cohen proposed a version of kappa that allowed for weighting of the c categories [24]. Whereas the original (unweighted) kappa did not distinguish among magnitudes of disagreement, weighted kappa incorporated the magnitude of each disagreement and provided partial credit for disagreements when agreement was not complete [57]. The usual approach is to assign weights to each disagreement pair with larger weights indicating greater disagreement.Footnote 4

In both the unweighted and weighted cases, kappa is equal to +1 when perfect agreement among two or more judges occurs, 0 when agreement is equal to that expected under independence, and negative when agreement is less than expected by chance. Because weighted kappa applies to ordered categories, it is discussed in Chap. 6. Unweighted kappa is discussed here as it is typically used for unordered categorical data.

Assume that two judges or raters independently classify each of N observations into one of c mutually exclusive, exhaustive, unordered categories. The resulting classifications can be displayed in a c×c cross-classification, such as the 3×3 contingency table in Table 4.24, with proportions for cell entries. Denote by a dot (⋅) the partial sum of all rows or all columns, depending on the position of the (⋅) in the subscript list. If the (⋅) is in the first subscript position, the sum is over all rows and if the (⋅) is in the second subscript position, the sum is over all columns. Thus, p i. denotes the marginal proportion total of the ith row, i = 1, …, c, summed over all columns; p .j denotes the marginal proportion total of the jth column, j = 1, …, c, summed over all rows; and p .. = 1.00. In the notation of Table 4.24, Cohen’s unweighted kappa coefficient for nominal-level data is given by:

$$\displaystyle \begin{aligned} \kappa = \frac{p_{\text{o}}-p_{\text{e}}}{1-p_{\text{e}}}\;, \end{aligned} $$
(4.8)

where

$$\displaystyle \begin{aligned} p_{\text{o}} = \sum_{i=1}^{c} p_{ii} \quad \mbox{and} \quad p_{\text{e}} = \sum_{i=1}^{c} p_{i.} p_{.i}\;. \end{aligned} $$
Table 4.24 Example 3×3 cross-classification table with proportions for cell entries

Cohen’s kappa can also be defined in terms of raw frequency values, making calculations somewhat more straightforward. Thus,

$$\displaystyle \begin{aligned} \kappa = \frac{\displaystyle\sum_{i=1}^{c}O_{ii}-\displaystyle\sum_{i=1}^{c}E_{ii}}{N-\displaystyle\sum_{i=1}^{c}E_{ii}}\;, \end{aligned} $$

where O ii denotes an observed cell frequency value on the principal diagonal of a c×c agreement table, E ii denotes an expected cell frequency value on the principal diagonal, and

$$\displaystyle \begin{aligned} E_{ii} = \frac{n_{i.}n_{.i}}{N} \qquad \mbox{for}\ i = 1,\,\ldots,\,c\;. \end{aligned}$$

In the configuration of Table 4.24, p o is the observed proportion of observations on which the judges agree, p e is the proportion of observations for which agreement is expected by chance, p o − p e is the proportion of agreement beyond that expected by chance, 1 − p e is the maximum possible proportion of agreement beyond that expected by chance, and Cohen’s kappa test statistic is the proportion of agreement between the two judges, after chance agreement has been removed.

4.5.3.1 Example 1

To illustrate Cohen’s kappa measure of chance-corrected inter-rater agreement, consider the frequency data given in Table 4.25 where two judges have independently classified N = 5 objects into c = 3 disjoint, unordered categories: A, B, and C. For the agreement data given in Table 4.25,

$$\displaystyle \begin{aligned} p_{\text{o}} = \sum_{i=1}^{c} p_{ii} = \frac{0}{5}+\frac{2}{5}+\frac{1}{5} = 0.60\;, \end{aligned}$$
$$\displaystyle \begin{aligned} p_{\text{e}} = \sum_{i=1}^{c} p_{i.} p_{.i} = \left( \frac{1}{5} \right) \left( \frac{1}{5} \right)+\left( \frac{2}{5} \right) \left( \frac{3}{5} \right)+\left( \frac{2}{5} \right) \left( \frac{1}{5} \right) = 0.36\;, \end{aligned}$$

and following Eq. (4.8), the observed value of Cohen’s κ is

$$\displaystyle \begin{aligned} \kappa = \frac{p_{\text{o}}-p_{\text{e}}}{1-p_{\text{e}}} = \frac{0.60-0.36}{1-0.36} = +0.3750\;, \end{aligned}$$

indicating approximately 37% agreement above that expected by chance.

Table 4.25 Example 3×3 cross-classification table for Cohen’s unweighted kappa

The exact probability value of an observed κ value under the null hypothesis is given by the sum of the hypergeometric point probability values associated with the κ values equal to or greater than the observed κ value. For the frequency data given in Table 4.25, there are only M = 8 possible, equally-likely arrangements in the reference set of all permutations of cell frequencies given the observed row and column marginal frequency distributions, {1, 2, 2} and {1, 3, 1}, respectively, making an exact permutation analysis possible. The eight possible arrangements of cell frequencies, given the observed marginal frequency totals, are listed in Table 4.26, where Table 3 of Table 4.26 contains the N = 5 observed cell frequencies.

Table 4.26 Listing of the eight sets of 3×3 cell frequencies with row marginal distribution {1, 2, 2} and column marginal distribution {1, 3, 1}

Table 4.27 lists the computed κ values and associated hypergeometric point probability values for the M = 8 tables given in Table 4.26, ordered from high to low by the κ values. Only two κ values are equal to or greater than the observed value of κ = +0.3750, those belonging to Tables 8 and 3 (indicated with asterisks). Thus, the exact upper-tail probability value of the observed κ value is P = 0.2000 + 0.1000 = 0.3000, the sum of the hypergeometric point probability values associated with values of κ = +0.3750 or greater, i.e., κ 8 = +0.6875 and κ 3 = +0.3750.

Table 4.27 Kappa and hypergeometric probability values for the eight 3×3 contingency tables listed in Table 4.26

4.5.3.2 Example 2

For a second, more realistic, example of Cohen’s unweighted kappa measure of chance-corrected inter-rater agreement, consider the frequency data given in Table 4.28, where two judges have independently classified N = 68 objects into four disjoint, unordered categories: A, B, C, and D. For the agreement data given in Table 4.28,

$$\displaystyle \begin{aligned} p_{\text{o}} = \sum_{i=1}^{c} p_{ii} = \frac{8}{68}+\frac{7}{68}+\frac{9}{68}+\frac{8}{68} = 0.4706\;, \end{aligned}$$

and following Eq. (4.8), the observed value of Cohen’s κ is

$$\displaystyle \begin{aligned} \kappa = \frac{p_{\text{o}}-p_{\text{e}}}{1-p_{\text{e}}} = \frac{0.4706-0.2571}{1-0.2571} = +0.2873\;, \end{aligned}$$

indicating approximately 29% agreement above that expected by chance.

Table 4.28 Example 4×4 cross-classification table

The exact probability value of an observed κ value under the null hypothesis is given by the sum of the hypergeometric point probability values associated with κ values equal to or greater than the observed κ value. For the frequency data given in Table 4.28, there are M = 181, 260, 684 possible, equally-likely arrangements in the reference set of all permutations of cell frequencies, given the observed row and column marginal frequency distributions, {15, 17, 20, 16} and {11, 16, 24, 17}, respectively, making an exact permutation analysis feasible. The exact upper-tail probability value of the observed κ value is P = 0.1098×10−3, i.e., the sum of the hypergeometric point probability values associated with values of κ = +0.2873 or greater.

4.5.4 Application with Multiple Judges

Cohen’s κ measure of chance-corrected inter-rater agreement was originally designed for, and limited to, only b = 2 judges. In this section, a procedure is introduced for computing unweighted kappa with multiple judges. Although the procedure is appropriate for any number of c ≥ 2 disjoint, unordered categories and b ≥ 2 judges, the description of the procedure is confined to b = 3 independent judges and the example is limited to b = 3 independent judges and c = 3 disjoint, unordered categories to simplify presentation.

Consider b = 3 judges who independently classify N objects into c disjoint, unordered categories. The classification may be conceptualized as a c×c×c contingency table with c rows, c columns, and c slices. Let n ijk, R i, C j, and S k denote the observed cell frequencies and the row, column, and slice marginal frequency totals for i, j, k = 1, …, c and let the frequency total be given by:

$$\displaystyle \begin{aligned} N = \sum_{i=1}^{c}\,\sum_{j=1}^{c}\,\sum_{k = 1}^{c} n_{ijk}\;.\end{aligned} $$

Cohen’s unweighted kappa test statistic for a three-way contingency table is given by:

$$\displaystyle \begin{aligned} \kappa = 1-\frac{N^{2}\displaystyle\sum_{i=1}^{c}\,\sum_{j=1}^{c}\,\sum_{k=1}^{c}w_{ijk}n_{ijk}}{\displaystyle\sum_{i=1}^{c}\,\sum_{j=1}^{c}\,\sum_{k=1}^{c}w_{ijk}R_{i}C_{j}S_{k}}\;,\end{aligned} $$
(4.9)

where w ijk are disagreement “weights” assigned to each cell for i, j, k = 1, …, c. For unweighted kappa, the disagreement weights are given by:

$$\displaystyle \begin{aligned} w_{ijk} = \begin{cases} \,0 & \text{if}\ i = j = k\;, \\ {} \,1 & \text{otherwise}\;. \end{cases}\end{aligned} $$

Given a c×c×c contingency table with N objects cross-classified by b = 3 independent judges, an exact permutation test involves generating all possible, equally-likely arrangements of the N objects to the c 3 cells, while preserving the marginal frequency distributions. For each arrangement of cell frequencies, the unweighted kappa statistic, κ, and the exact hypergeometric point probability value under the null hypothesis, p(n ijk|R i, C j, S k, N), are calculated, where

$$\displaystyle \begin{aligned} p(n_{ijk}|R_{i},C_{j},S_{k},N) = \frac{\left( \,\displaystyle\prod_{i=1}^{c}R_{i}! \right) \left( \,\displaystyle\prod_{j=1}^{c}C_{j}! \right) \left( \,\displaystyle\prod_{k=1}^{c}S_{k}! \right)}{(N!)^{b-1}\displaystyle\prod_{i=1}^{c}\,\displaystyle\prod_{j=1}^{c}\,\displaystyle\prod_{k=1}^{c}n_{ijk}!}\;. \end{aligned} $$
(4.10)

If κ o denotes the value of the observed unweighted kappa test statistic, the exact probability value of κ o under the null hypothesis is given by:

$$\displaystyle \begin{aligned} P(\kappa_{\text{o}}) = \sum_{l=1}^{M}\Psi_{l}\left( n_{ijk}|R_{i},C_{j},S_{k},N \right)\;, \end{aligned}$$

where

$$\displaystyle \begin{aligned} \Psi_{l}\left( n_{ijk}|R_{i},C_{j},S_{k},N \right) = \begin{cases} \,p(n_{ijk}|R_{i},C_{j},S_{k},N) & \text{if}\ \kappa \geq \kappa_{\text{o}}\;, \\ {} \,0 & \text{otherwise}\;, \end{cases} \end{aligned}$$

and M denotes the total number of possible, equally-likely cell frequency arrangements in the reference set of all possible arrangements of cell frequencies, given the observed marginal frequency distributions. When M is very large, as is typical with multi-way contingency tables, exact tests are impractical and Monte Carlo resampling procedures become necessary. In such cases, a random sample of the M possible, equally-likely arrangements of cell frequencies provides a comparison of κ test statistics calculated on L random multi-way tables with the κ test statistic calculated on the observed multi-way contingency table.

An efficient Monte Carlo resampling algorithm to generate random cell frequency arrangements for multi-way contingency tables with fixed marginal frequency distributions was developed by Mielke, Berry, and Johnston in 2007 [66, pp. 19–20]. For a three-way contingency table with r rows, c columns, and s slices, the resampling algorithm is given in 12 simple steps.

  1. Step1.

    Construct an r×c×s contingency table from the observed data.

  2. Step2.

    Obtain the fixed marginal frequency totals R 1, …, R r, C 1, …, C c, S 1, …, S s, and frequency total N. Set a resampling counter JL = 0, and set L equal to the number of samples desired.

  3. Step3.

    Set the resampling counter JL = JL + 1.

  4. Step4.

    Set the marginal frequency counters JR i = R i for i = 1, …, r; JC j = C j for j = 1, …, c; JS k = S k for k = 1, …, s, and M = N.

  5. Step5.

    Set n ijk = 0 for i = 1, …, r, j = 1, …, c, and k = 1, …, s, and set row, column, and slice counters IR, IC, and IS equal to zero.

  6. Step6.

    Create cumulative probability distributions PR i, PC j, and PS k from the adjusted marginal frequency totals JR i, JC j, and JS k for i = 1, …, r, j = 1, …, c, and k = 1, …, s, where

    $$\displaystyle \begin{aligned} \mathit{PR}_{1} = \mathit{JR}_{1}/M \quad \mbox{and} \quad \mathit{PR}_{i} = \mathit{PR}_{i-1}+\mathit{JR_{i}}/M \end{aligned}$$

    for i = 1, …, r,

    $$\displaystyle \begin{aligned} \mathit{PC}_{1} = \mathit{JC}_{1}/M \quad \mbox{and} \quad \mathit{PC}_{j} = \mathit{PC}_{j-1}+\mathit{JC}_{j}/M \end{aligned}$$

    for j = 1, …, c, and

    $$\displaystyle \begin{aligned} \mathit{PS}_{1} = \mathit{JS}_{1}/M \quad \mbox{and} \quad \mathit{PS}_{k} = \mathit{PS}_{k-1}+\mathit{JS}_{k}/M \end{aligned}$$

    for k = 1, …, s.

  7. Step7.

    Generate three uniform pseudorandom numbers U r, U c, and U s over [0, 1) and set row, column, and slice indices i = j = k = 1, respectively.

  8. Step8.

    If U r ≤PR i, then IR = i, JR i = JR i − 1, and go to Step 9; otherwise, i = i + 1 and repeat Step 8.

  9. Step9.

    If U c ≤PC j, then IC = j, JC j = JC j − 1, and go to Step 10; otherwise, j = j + 1 and repeat Step 9.

  10. Step10.

    If U s ≤PS k, then IS = k, JS k = JS k − 1, and go to Step 11; otherwise, k = k + 1 and repeat Step 10.

  11. Step11.

    Set M = M − 1 and n IR,IC,IS = n IR,IC,IS + 1. If M > 0, go to Step 4; otherwise, obtain the required test statistic.

  12. Step12.

    If JL < L, go to Step 3; otherwise, stop.

At the conclusion of the resampling procedure, Cohen’s κ, as given in Eq. (4.9) on p. 172, is obtained for each of the L random three-way contingency tables, given fixed marginal frequency distributions. Let κ o denote the observed value of κ, then under the null hypothesis the resampling approximate probability value for κ o is given by:

$$\displaystyle \begin{aligned} P\left( \kappa_{\text{o}} \right) = \frac{1}{L} \sum_{l=1}^{L} \Psi_{l} \left( \kappa \right)\;, \end{aligned}$$

where

$$\displaystyle \begin{aligned} \Psi_{l} \left( \kappa \right) = \begin{cases} \,1 & \text{if}\ \kappa \geq \kappa_{\text{o}}\;, \\ {} \,0 & \text{otherwise}\;. \end{cases} \end{aligned}$$

4.5.5 Example Analysis with Multiple Judges

The calculation of unweighted kappa and the resampling procedure for obtaining a probability value with multiple judges can be illustrated with a sparse data set. Consider b = 3 independent judges who classify N = 93 objects into one of c = 3 disjoint, unordered categories: A, B, or C. Table 4.29 lists the c 3 cross-classified frequencies and corresponding disagreement weights, where the cell disagreement weights are given in parentheses.

Table 4.29 Classification of N = 93 objects by three independent judges into one of three disjoint, unordered categories: A, B, or C, with disagreement weights in parentheses

For the frequency data listed in Table 4.29, the observed value of kappa is κ = +0.1007, indicating approximately 10% agreement among the b = 3 judges above that expected by chance. If κ o denotes the observed value of κ, the approximate resampling probability value based on L = 1, 000, 000 random arrangements of the observed data is

$$\displaystyle \begin{aligned} P(\kappa \geq \kappa_{\text{o}}|H_{0}) = \frac{\text{number of}\ \kappa\ \text{values}\ \geq \kappa_{\text{o}}}{L} = \frac{8{,}311}{1{,}000{,}000} = 0.0083\;. \end{aligned}$$

4.6 McNemar’s Q Test for Change

In 1947, psychologist Quinn McNemar proposed a test for change that was derived from the matched-pairs t test for proportions [63]. A typical application is to analyze binary responses, coded (0, 1), at g = 2 time periods for each of N ≥ 2 subjects, such as Success and Failure, Yes and No, Agree and Disagree, or Pro and Con. If the four cells are identified as in Table 4.30, then McNemar’s test for change is given by:

$$\displaystyle \begin{aligned} Q = \frac{ \left( B-C \right)^{2}}{B+C}\;, \end{aligned}$$

where N = A + B + C + D and B and C represent the two cells of change, i.e., from Pro to Con and from Con to Pro.

Table 4.30 Notation for a 2×2 cross-classification for McNemar’s Q test for change

Alternatively, McNemar’s Q test can be thought of as a chi-squared goodness-of-fit test with two categories, where the observed frequencies, O 1 and O 2, correspond to cells B and C, respectively, and the expected frequencies, E 1 and E 2, are given by E 1 = E 2 = (B + C)∕2, i.e., half the subjects are expected to change in one direction (e.g., from Pro to Con) and half in the other direction (e.g., from Con to Pro), under the null hypothesis of no change from Time 1 to Time 2. Let

$$\displaystyle \begin{aligned} E = \frac{B+C}{2} \end{aligned}$$

denote an expected value where, by chance, half of the changes are from Pro to Con and half are from Con to Pro. Then, a chi-squared goodness of fit for the two categories of change is given by:

$$\displaystyle \begin{aligned} \chi^{2} = \frac{(B-E)^{2}}{E}+\frac{(C-E)^{2}}{E} = \frac{B^{2}}{E}+\frac{C^{2}}{E}+2E-2B-2C\;. \end{aligned}$$

Substituting (B + C)∕2 for E yields

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle \frac{2B^{2}}{B+C}+\frac{2C^{2}}{B+C}+B+C-2B-2C&\displaystyle \\ &\displaystyle {}= \frac{2B^{2}}{B+C}+\frac{2C^{2}}{B+C}-B-C&\displaystyle \\ &\displaystyle {}= \frac{2B^{2}+2C^{2}-B(B+C)-C(B+C)}{B+C}&\displaystyle \\ &\displaystyle {}= \frac{B^{2}-2BC+C^{2}}{B+C}&\displaystyle \\ &\displaystyle {}= \frac{(B-C)^{2}}{B+C}\;.&\displaystyle \end{array} \end{aligned} $$

4.6.1 Example 1

To illustrate McNemar’s test for change, consider the frequency data given in Table 4.31, where N = 50 objects have been recorded as either Pro or Con on a specified issue at Time 1 and again on the same issue at Time 2. For the frequency data given in Table 4.31, the observed value of McNemar’s Q test statistic is

$$\displaystyle \begin{aligned} Q = \frac{(B-C)^{2}}{B+C} = \frac{(5-25)^{2}}{5+25} = 13.3333\;. \end{aligned}$$

Alternatively, O 1 = B = 5, O 2 = C = 25, E 1 = E 2 = (O 1 + O 2)∕2 = (5 + 25)∕2 = 15, and

$$\displaystyle \begin{aligned} \chi_{1}^{2} = \frac{\left( O_{1}-E_{1} \right)^{2}}{E_{1}}+\frac{\left( O_{2}-E_{2} \right)^{2}}{E_{2}} = \frac{\left( 5-15 \right)^{2}}{15}+\frac{\left( 25-15 \right)^{2}}{15} = 13.3333\;. \end{aligned}$$
Table 4.31 Example frequency data for McNemar’s test for change with N = 50 objects

The exact probability value of an observed value of Q, under the null hypothesis, is given by the sum of the hypergeometric point probability values associated with the Q values that are equal to or greater than the observed value of Q. For the frequency data listed in Table 4.31, there are only M = 31 possible, equally-likely arrangements in the reference set of all permutations of cell frequencies given the two cell frequencies of change, 5 and 25, and only 12 Q values are equal to or greater than the observed value of Q = 13.3333.

Since M = 31 is a reasonably small number of arrangements, it will be illustrative to list the complete set of Q values and the associated hypergeometric point probability values in Table 4.32, where rows with hypergeometric point probability values associated with Q values equal to or greater than the observed value of Q are indicated with asterisks. The exact upper-tail probability value of the observed value of Q is the sum of the hypergeometric point probability values that are associated with values of Q = 13.3333 or greater. Since the distribution of all possible Q values is symmetrical, the exact two-tailed probability value is

Table 4.32 McNemar Q values and exact hypergeometric point probability values for M = 31 possible arrangements of the frequency data given in Table 4.31

4.6.2 Example 2

For a second example of McNemar’s Q test, consider the frequency data given in Table 4.33, where N = 190 objects have been recorded as either Pro or Con on a specified issue at Time 1 and again at Time 2. For the frequency data given in Table 4.33, the observed value of McNemar’s Q test statistic is

$$\displaystyle \begin{aligned} Q = \frac{(B-C)^{2}}{B+C} = \frac{(59-37)^{2}}{59+37} = 5.0417\;. \end{aligned}$$

Alternatively, O 1 = B = 59, O 2 = C = 37, E 1 = E 2 = (O 1 + O 2)∕2 = (59 + 37)∕2 = 48, and

$$\displaystyle \begin{aligned} \chi_{1}^{2} = \frac{\left( O_{1}-E_{1} \right)^{2}}{E_{1}}+\frac{\left( O_{2}-E_{2} \right)^{2}}{E_{2}} = \frac{\left( 59-48 \right)^{2}}{48}+\frac{\left( 37-48 \right)^{2}}{48} = 5.0417\;. \end{aligned}$$
Table 4.33 Example frequency data for McNemar’s test for change with N = 190 objects

The exact probability value of an observed value of Q, under the null hypothesis, is given by the sum of the hypergeometric point probability values associated with the Q values that are equal to or greater than the observed value of Q. For the frequency data listed in Table 4.33, there are only M = 97 possible, equally-likely arrangements in the reference set of all permutations of cell frequencies given the two cell frequencies of change, 59 and 37, and only 76 Q values are equal to or greater than the observed value of Q = 5.0417. The exact upper-tail probability value of the observed Q value is P = 0.0315, i.e., the sum of the hypergeometric point probability values that are associated with values of Q = 5.0417 or greater.

4.7 Cochran’s Q Test for Change

The ubiquitous dichotomous variable plays a large role and has many applications in research and measurement. Conventionally, a value of one is assigned to each test item that a subject answers correctly and a zero is assigned to each incorrect answer. A common example application occurs when subjects are placed into an experimental situation, observed as to whether or not some specified response is elicited, and scored appropriately [56].

In 1950, William Cochran published an article on “The comparison of percentages in matched samples” [22]. In this brief but formative article, Cochran described a test for equality of matched proportions that is now widely used in educational and psychological research. The matching may be based on the characteristics of different subjects or on the same subjects under different conditions. The Cochran Q test may be viewed as an extension of the McNemar [63] test to three or more treatment conditions. For a typical application, suppose that a sample of N ≥ 2 subjects is observed in a situation wherein each subject performs individually under each of k ≥ 1 different experimental conditions. The performance is scored as a success (1) or as a failure (0). The research question evaluates whether the true proportion of successes is constant over the k time periods.

Cochran’s Q test for the analysis of k treatment conditions (columns) and N subjects (rows) is given by:

$$\displaystyle \begin{aligned} Q = \frac{(k-1)\left( k \displaystyle\sum_{j=1}^{k} C_{j}^{2}-A^{2} \right)}{kA - B}\;, \end{aligned} $$
(4.11)

where

$$\displaystyle \begin{aligned} C_{j} = \sum_{i=1}^{N} x_{ij} \end{aligned}$$

is the number of 1s in the jth of k columns,

$$\displaystyle \begin{aligned} R_{i} = \sum_{j=1}^{k}x_{ij} \end{aligned}$$

is the number of 1s in the ith of N rows,

$$\displaystyle \begin{aligned} A = \sum_{i=1}^{N} R_{i}\;, \quad B = \sum_{i=1}^{N} R_{i}^{2}\;, \end{aligned}$$

and x ij denotes the cell entry of either 0 or 1 associated with the ith of N rows and the jth of k columns. The null hypothesis stipulates that each of the

$$\displaystyle \begin{aligned} M = \prod_{i=1}^{N} \binom{k}{R_{i}} \end{aligned}$$

distinguishable arrangements of 1s and 0s within each of the N rows occurs with equal probability, given that the values of R 1, …, R N are fixed [65].

4.7.1 Example 1

For an example analysis of Cochran’s Q test, consider the binary-coded data listed in Table 4.34 consisting of responses (1 or 0) for N = 10 subjects evaluated over k = 5 time periods, where a 1 denotes success on a prescribed task and a 0 denotes failure. For the binary-coded data listed in Table 4.34,

$$\displaystyle \begin{aligned} \sum_{j=1}^{k} C_{j}^{2} = 4^{2}+7^{2}+7^{2}+3^{2}+1^{2} = 124\;, \end{aligned}$$
$$\displaystyle \begin{aligned} A = \sum_{i=1}^{N} R_{i} = 2+3+2+2+3+2+2+1+2+3 = 22\;, \end{aligned}$$
$$\displaystyle \begin{aligned} B = \sum_{i=1}^{N} R_{i}^{2} = 2^{2}+3^{2}+2^{2}+2^{2}+3^{2}+2^{2}+2^{2}+1^{2}+2^{2}+3^{2} = 52\;, \end{aligned}$$

and, following Eq. (4.11) on p. 180, the observed value of Cochran’s Q is

$$\displaystyle \begin{aligned} Q = \frac{(k-1)\left( k \displaystyle\sum_{j=1}^{k} C_{j}^{2}-A^{2} \right)}{kA - B} = \frac{(5-1)[(5)(124)-22^{2}]}{(5)(22)-52} = 9.3793\;. \end{aligned}$$
Table 4.34 Successes (1) and failures (0) of N = 10 subjects on a series of k = 5 time periods

For the binary-coded data listed in Table 4.34, there are

$$\displaystyle \begin{aligned} M = \prod_{i=1}^{N} \binom{k}{R_{i}} = \binom{5}{1}^{1} \binom{5}{2}^{6} \binom{5}{3}^{3} = (5)(10^{6})(10^{3}) = 5{,}000{,}000{,}000 \end{aligned}$$

possible, equally-likely arrangements of the observed data, making an exact permutation analysis prohibitive and a Monte Carlo resampling analysis necessary. Based on L = 1, 000, 000 random arrangements of the observed data, there are 54,486 Q values equal to or greater than the observed value of Q = 9.3793. If Q o denotes the observed value of Q, the approximate resampling probability value of the observed data is

$$\displaystyle \begin{aligned} P \big( Q \geq Q_{\text{o}}|H_{0} \big) = \frac{\text{number of}\ Q\ \text{values}\ \geq Q_{\text{o}}}{L} = \frac{54{,}486}{1{,}000{,}000} = 0.0545\;. \end{aligned}$$

For comparison, under the null hypothesis Cochran’s Q is approximately distributed as chi-squared with k − 1 degrees of freedom. The approximate probability of Q = 9.3793 with k − 1 = 5 − 1 = 4 degrees of freedom is P = 0.0523.

4.7.2 Example 2

For a second example of Cochran’s Q test, consider the binary-coded data listed in Table 4.35 consisting of responses (1 or 0) for N = 9 subjects evaluated over k = 3 time periods, where a 1 indicates success on a prescribed task and a 0 indicates failure. For the binary-coded data listed in Table 4.35,

$$\displaystyle \begin{aligned} A = \sum_{i=1}^{N} R_{i} = 1+1+1+1+2+1+2+1+2 = 12\;, \end{aligned}$$
$$\displaystyle \begin{aligned}B = \sum_{i=1}^{N} R_{i}^{2} = 1^{2}+1^{2}+1^{2}+1^{2}+2^{2}+1^{2}+2^{2}+1^{2}+2^{2} = 18\;, \end{aligned}$$
$$\displaystyle \begin{aligned}\sum_{j=1}^{g}C_{j}^{2} = 4^{2}+7^{2}+1^{2} = 66\;, \end{aligned}$$
Table 4.35 Successes (1) and failures (0) of N = 9 subjects on a series of k = 3 time periods

and, following Eq. (4.11) on p. 180, the observed value of Cochran’s Q is

$$\displaystyle \begin{aligned} Q = \frac{(k-1)\left( k \displaystyle\sum_{j=1}^{k} C_{j}^{2}-A^{2} \right)}{kA - B} = \frac{(3-1)[(3)(66)-12^{2}]}{(3)(12)-18} = 6.00\;. \end{aligned}$$

For the binary-coded data listed in Table 4.35, there are only

$$\displaystyle \begin{aligned} M = \prod_{i=1}^{N} \binom{k}{R_{i}} = \binom{3}{1}^{6} \binom{3}{2}^{3} = (3^{6})(3^{3}) = 19{,}683 \end{aligned}$$

possible, equally-likely arrangements of the observed data in the reference set of all possible arrangements, making an exact permutation analysis easily accomplished. Based on M = 19, 683 equally-likely, possible arrangements of the observed data, there are 1,056 Q values equal to or greater than the observed value of Q = 6.00. If Q o denotes the observed value of Q, the exact upper-tail probability value of the observed data is

$$\displaystyle \begin{aligned} P \big( Q \geq Q_{\text{o}}|H_{0} \big) = \frac{\text{number of}\ Q\ \text{values}\ \geq Q_{\text{o}}}{M} = \frac{1{,}056}{19{,}683} = 0.0537\;. \end{aligned}$$

For comparison, under the null hypothesis Cochran’s Q is approximately distributed as chi-squared with k − 1 degrees of freedom. The approximate probability of Q = 86.00 with k − 1 = 3 − 1 = 2 degrees of freedom is P = 0.0498.

4.8 A Measure of Effect Size for Cochran’s Q Test

Measures of effect size are increasingly important in reporting research outcomes. The American Psychological Association (APA) has long recommended measures of effect size for articles published in APA journals. For example, as far back as 1994 the 4th edition of the APA Publication Manual strongly encouraged reporting measures of effect size in conjunction with probability values. In 1999, the APA Task Force on Statistical Inference, under the direction of Leland Wilkinson , noted that “reporting and interpreting effect sizes in the context of previously reported effects is essential to good research” [87, p. 599]. In 2016, the American Statistical Association (ASA) recommended that measures of effect size be included in future publications in ASA journals [84]. Unfortunately, measures of effect size do not exist for a number of common statistical tests. In this section, a chance-corrected measure of effect size is presented for Cochran’s Q test for related proportions [9].

Consider an alternative approach to Cochran’s Q test where g treatments are applied independently to each of N subjects with the result of each treatment application recorded as either 1 or 0, representing any suitable dichotomization of the treatment results, i.e., a randomized-block design where the subjects are the blocks and the treatment results are registered as either 1 or 0. Let x ij denote the recorded 1 and 0 response measurements for i = 1, …, N and j = 1, …, g. Then, Cochran’s test statistic can be defined as:

$$\displaystyle \begin{aligned} Q = \frac{g-1}{2\displaystyle\sum_{i=1}^{N} p_{i}(1-p_{i})} \left[ 2 \left( \sum_{i=1}^{N} p_{i} \right) \left( N-\sum_{i=1}^{N} p_{i} \right) -N(N-1) \,\delta \right]\;, \end{aligned}$$

where

$$\displaystyle \begin{aligned} \delta = \left[ g \binom{N}{2} \right]^{-1} \sum_{k=1}^{g}\,\sum_{i=1}^{N-1}\,\sum_{j=i+1}^{N} \big| x_{ik}-x_{jk} \big| \end{aligned} $$
(4.12)

and

$$\displaystyle \begin{aligned} p_{i} = \frac{1}{g} \sum_{j=1}^{g} x_{ij} \qquad \mbox{for}\ i = 1,\,\ldots,\,N\;, \end{aligned}$$

that is, the proportion of 1 values for the ith of N subjects. Note that in this representation the variation of Q is totally dependent on δ.

In 1979, Acock and Stavig [1] proposed a maximum value for Q given by:

$$\displaystyle \begin{aligned} Q_{\text{max}} = N(g-1)\;. \end{aligned} $$
(4.13)

Acock and Stavig’s maximum value of Q in Eq. (4.13) was employed by Serlin, Carr, and Marascuilo [77] to provide a measure of effect size for Cochran’s Q given by:

$$\displaystyle \begin{aligned} \hat{\eta}_{Q}^{\,2} = \frac{Q}{Q_{\text{max}}} = \frac{Q}{N(g-1)}\;, \end{aligned}$$

which standardized Cochran’s Q by a maximum value. Unfortunately, the value of Q max = N(g − 1) advocated by Acock and Stavig is achieved only when each subject g-tuple is identical and there is at least one 1 and one 0 in each g-tuple. Thus, \(\hat {\eta }_{Q}^{\,2}\) is a “maximum-corrected” measure of effect size and \(0 \leq \hat {\eta }_{Q}^{\,2} \leq 1\) only under these rare conditions.

Assume 0 < p i < 1 for i = 1, …, N since p i = 0 and p i = 1 are uninformative. If p i is constant for i = 1, …, N, then Q max = N(g − 1). However, for the vast majority of cases when p i ≠ p j for i ≠ j, Q max < N(g − 1). Thus, the routine use of setting Q max = N(g − 1) is problematic and leads to questionable results.

It should also be noted that \(\hat {\eta }_{Q}^{\,2}\) is a member of the V  family of measures of nominal association based on Cramér’s V 2 test statistic given by:

$$\displaystyle \begin{aligned} V^{2} = \frac{\chi^{2}}{\chi_{\text{max}}^{2}} = \frac{\chi^{2}}{N \big[ \min(r-1,c-1) \big]}\;, \end{aligned}$$

where r and c denote the number of rows and columns in an r×c contingency table [1]. Other members of the V  family are Pearson’s ϕ 2 for 2×2 contingency tables [70] and Tschuprov’s T 2 for r×c contingency tables where r = c [82]. The difficulties in interpreting V 2 extend to \(\hat {\eta }_{Q}^{\,2}\).

As noted in Chap. 3, Wickens observed that Cramér’s V 2 lacks an intuitive interpretation other than as a scaling of chi-squared, which limits its usefulness [86, p. 226]. Also, Costner noted that V 2 and other measures based on Pearson’s chi-squared lack any interpretation at all for values other than 0 and 1, or the maximum, given the observed marginal frequency distributions [27]. Agresti and Finlay also noted that Cramér’s V 2 is very difficult to interpret and recommended other measures [2, p. 284]. Blalock noted that “all measures based on chi square are somewhat arbitrary in nature, and their interpretations leave a lot to be desired …they all give greater weight to those columns or rows having the smallest marginals rather than to those with the largest marginals” [17, 18, p. 306]. Ferguson discussed the problem of using idealized marginal frequencies [30, p. 422], and Guilford noted that measures such as Pearson’s ϕ 2, Tschuprov’s T 2, and Cramér’s V 2 necessarily underestimate the magnitude of association present [42, p. 342]. Berry, Martin, and Olson considered these issues with respect to 2×2 contingency tables [10, 12], and Berry, Johnston, and Mielke discussed in some detail the problems with using Pearson’s ϕ 2, Tschuprov’s T 2, and Cramér’s V 2 as measures of effect size [8]. Since \(\hat {\eta }_{Q}^{\,2}\) is simply a special case of Cramér’s V 2, it presents the same problems of interpretation. For a detailed assessment of Pearson’s ϕ 2, Tschuprov’s T 2, and Cramér’s V 2, see Chap. 3.

4.8.1 A Chance-Corrected Measure of Effect Size

Chance-corrected measures of effect size have much to commend them over maximum-corrected measures. A chance-corrected measure of effect size is a measure of agreement among the N subjects over g treatments, corrected for chance. A number of researchers have advocated chance-corrected measures of effect size, including Brennan and Prediger [20], Cicchetti, Showalter, and Tyrer [21], Conger [26], and Krippendorff [50]. A chance-corrected measure is zero under chance conditions, unity when agreement among the N subjects is perfect, and negative under conditions of disagreement. Some well-known chance-corrected measures are Scott’s coefficient of inter-coder agreement [76], Kendall and Babington Smith’s u measure of agreement [48], Cohen’s unweighted and weighted coefficients of inter-rater agreement [23, 24], and Spearman’s footrule measure [79, 80]. Under certain conditions, Spearman’s rank-order correlation coefficient [79, 80] is also a chance-corrected measure of agreement, i.e., when variables x and y consist of ranks from 1 to N with no tied values, or when variable x includes tied values and variable y is a permutation of variable x, then Spearman’s rank-order correlation coefficient is both a measure of correlation and a chance-corrected measure of agreement [50, p. 144].

Let x ij denote the (0, 1) response measurements for i = 1, …, N blocks and j = 1, …, g treatments, then

$$\displaystyle \begin{aligned} \delta = \left[ g \binom{N}{2} \right]^{-1} \sum_{k=1}^{g}\,\sum_{i=1}^{N-1}\,\sum_{j=i+1}^{N} \big| x_{ik}-x_{jk} \big|\;. \end{aligned}$$

Under the null hypothesis that the distribution of δ assigns equal probability to each of

$$\displaystyle \begin{aligned} M = \big( g! \big)^{N} \end{aligned}$$

possible allocations of the g dichotomous response measurements to the g treatment positions for each of the N subjects, the average value of δ is given by:

$$\displaystyle \begin{aligned} \mu_{\delta} = \frac{2}{N(N-1)} \left[ \left( \sum_{i=1}^{N} p_{i} \right) \left( N-\sum_{i=1}^{N} p_{i} \right)-\sum_{i=1}^{N} p_{i} (1-p_{i}) \right]\;, \end{aligned}$$

where

$$\displaystyle \begin{aligned} p_{i} = \frac{1}{g} \sum_{i=1}^{g} x_{ij} \qquad \mbox{for}\ i = 1,\,\ldots,\,N\;. \end{aligned}$$

Then, a chance-corrected measure of effect size may be defined as:

$$\displaystyle \begin{aligned} \Re = 1-\frac{\delta}{\mu_{\delta}}\;. \end{aligned}$$

4.8.2 Example

Consider a sample of N = 6 psychology graduate students enrolled in a seminar designed to hone skills in assessing patients with various disorders. The seminar includes a clinical aspect whereby actors, provided with different scripts, present symptoms that the students then diagnose. There are g = 8 scripts for a variety of symptoms including eating disorders, anxiety, depression, oppositional defiant behavior, obsessive-compulsive disorder, and post-traumatic stress disorders, any of which may be presented over the course of the seminar. The “patients” present at random intervals during the semester and the students are assessed as to whether or not the correct diagnosis was made. Table 4.36 lists the data with a 1 (0) indicating a correct (false) diagnosis. For the binary data listed in Table 4.36, Table 4.37 illustrates the calculation of

$$\displaystyle \begin{aligned} \sum_{i=1}^{N} p_{i} \quad \mbox{and} \quad \sum_{i=1}^{N} p_{i}(1-p_{i})\;, \end{aligned}$$

where

$$\displaystyle \begin{aligned} p_{1} &= \frac{1}{g} \sum_{j=1}^{g} x_{1j} = \frac{0+1+1+1+0+0+1+0}{8} = 0.5000\;,\\ p_{2} &= \frac{1}{g} \sum_{j=1}^{g} x_{2j} = \frac{1+1+1+0+0+1+1+1}{8} = 0.7500\;,\\ p_{3} &= \frac{1}{g} \sum_{j=1}^{g} x_{3j} = \frac{0+1+0+1+1+0+1+1}{8} = 0.6250\;,\\ p_{4} &= \frac{1}{g} \sum_{j=1}^{g} x_{4j} = \frac{1+1+1+1+0+1+1+1}{8} = 0.8750\;,\\ p_{5} &= \frac{1}{g} \sum_{j=1}^{g} x_{5j} = \frac{0+1+1+0+0+0+1+1}{8} = 0.5000\;, \end{aligned} $$

and

$$\displaystyle \begin{aligned} p_{6} &= \frac{1}{g} \sum_{j=1}^{g} x_{6j} = \frac{1+1+1+1+0+1+1+0}{8} = 0.7500\;. \end{aligned} $$

Table 4.38 illustrates the calculation of the |x ik − x jk| values, i = 1, …, N − 1 and j = i + 1, …, N, for Treatments 1, 2, …, 8. Then,

$$\displaystyle \begin{aligned} \begin{array}{rcl} Q &\displaystyle =&\displaystyle \frac{g-1}{2\displaystyle\sum_{i=1}^{N} p_{i}(1-p_{i})} \left[ 2 \left( \sum_{i=1}^{N} p_{i} \right) \left( N-\sum_{i=1}^{N} p_{i} \right) -N(N-1) \,\delta \right]\\ &\displaystyle &\displaystyle \qquad = \frac{8-1}{2(1.2188)} \big[ 2(4.00)(6-4.00)-6(6-1)(0.3667) \big] = 14.3590\;, \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} \mu_{\delta} &\displaystyle =&\displaystyle \frac{2}{N(N-1)} \left[ \left( \sum_{i=1}^{N} p_{i} \right) \left( N-\sum_{i=1}^{N} p_{i} \right)-\sum_{i=1}^{N} p_{i} (1-p_{i}) \right]\\ &\displaystyle &\displaystyle \qquad \qquad \qquad \qquad = \frac{2}{6(6-1)} \big[ (4.00)(6-4.00)-1.2188 \big] = 0.4521\;, \end{array} \end{aligned} $$

and

$$\displaystyle \begin{aligned} \Re = 1-\frac{\delta}{\mu_{\delta}} = 1-\frac{0.3667}{0.4521} = +0.1889\;, \end{aligned}$$

indicating approximately 19% agreement above that expected by chance. For comparison, the maximum-corrected measure of effect size proposed by Serlin et al. [77] is

$$\displaystyle \begin{aligned} \hat{\eta}_{Q}^{\,2} = \frac{Q}{Q_{\text{max}}} = \frac{Q}{N(g-1)} = \frac{14.3590}{6(8-1)} = 0.3419. \end{aligned}$$
Table 4.36 Example data for Cochran’s Q test of related proportions with N = 6 subjects and g = 8 treatments
Table 4.37 Summations for p i and p i(1 − p i) for i = 1, …, N
Table 4.38 Summation totals for |x ik − x jk| for k = 1, 2, …, 7, 8 treatments, i = 1, …, N − 1, and j = i + 1, …, N

4.8.3 Advantages of the \( \Re \) Measure of Effect Size

Chance-corrected measures of effect size, such as \(\Re \), possess distinct advantages in interpretation over maximum-corrected measures of effect size, such as \(\hat {\eta }_{Q}^{\,2}\). The problem with \(\hat {\eta }_{Q}^{\,2}\) lies in the manner in which \(\hat {\eta }_{Q}^{\,2}\) is maximized. The denominator of \(\hat {\eta }_{Q}^{\,2}\), Q max = N(g − 1), standardizes the observed value of Q for the sample size (N) and the number of treatments (g). Unfortunately, N(g − 1) does not standardize Q for the data on which Q is based, but rather standardizes Q on another unobserved hypothetical set of data.

Consider a simple example with N = 10 subjects and g = 2 treatments. The observed data are given in Table 4.39, where at Time 1 seven subjects were classified as Pro and three subjects were classified as Con, and at Time 2 five subjects were classified as Pro and five subjects were classified as Con.

Table 4.39 Example 2×2 cross-classification for Cochran’s Q test for change

Given the observed data in Table 4.39, only four values of Q are possible. Table 4.40 displays the four possible arrangements in the reference set of all permutations of cell frequencies given the observed row and column marginal frequency distributions, {7, 3} and {5, 5}, respectively. Table A in Table 4.40 (the observed table) yields Q = 2.00, Table B yields Q = 1.00, Table C yields Q = 0.6667, and Table D yields Q = 0.50. Thus, for the observed data given in Table 4.40, Q = 2.00 is the maximum value of Q possible, given the observed marginal frequency distributions. Note that Q max = N(g − 1) = 10(2 − 1) = 10 cannot be achieved with these data. For the data given in Table A in Table 4.40 with Q = 2.00, \(\hat {\eta }_{Q}^{\,2}\) is only 0.20, while \(\Re = 1.00\), indicating the proper maximum-corrected effect size.

Table 4.40 Four possible arrangements of the data given in Table 4.39 with fixed observed row and column marginal frequency distributions, {7, 3} and {5, 5}, respectively

\(\Re \) is a preferred alternative to \(\hat {\eta }_{Q}^{\,2}\) as a measure of effect size for two reasons. First, \(\Re \) can achieve an effect size of unity for the observed data, while this is often impossible for \(\hat {\eta }_{Q}^{\,2}\). Second, \(\Re \) is a chance-corrected measure of effect size, meaning that \(\Re \) is zero under chance conditions, unity when agreement among the N subjects is perfect, and negative under conditions of disagreement. Therefore, \(\Re \) possesses a clear interpretation corresponding to Cohen’s coefficient of inter-rater agreement and other chance-corrected measures that are familiar to most researchers. On the other hand, \(\hat {\eta }_{Q}^{\,2}\) possesses no meaningful interpretation except for the limiting values of Q = 0 and Q = 1.

4.9 Leik and Gove’s \(d_{N}^{\,c}\) Measure of Association

In 1971, Robert Leik and Walter Gove proposed a new measure of nominal association based on pairwise comparisons of differences between observations [53]. Dissatisfied with the existing measures of nominal association, Leik and Gove suggested a proportional-reduction-in-error measure of association that was corrected for the true maximum amount of association, given the observed marginal frequency distributions. The new measure was denoted by \(d_{N}^{\,c}\), where d indicated the index, following other indices such as Somers’ d yx and d xy; the subscript N indicated the relevance of d to a nominal dependent variable; and the superscript c indicated that the measure was corrected for the constraints imposed by the marginal frequency distributions [53, p. 287].

Like \(d_{N}^{\,c}\), many measures of association for two variables have been based on pairwise comparisons of differences between observations. Consider two nominal-level variables that have been cross-classified into an r×c contingency table, where r and c denote the number of rows and columns, respectively. Let n i., n .j, and n ij denote the row marginal frequency totals, column marginal frequency totals, and number of objects in the ijth cell, respectively, for i = 1, …, r and j = 1, …, c, and let N denote the total number of objects in the r×c contingency table. If y and x represent the row and column variables, respectively, there are N(N − 1)∕2 pairs of objects in the table that can be partitioned into five mutually exclusive, exhaustive types of pairs: concordant pairs, discordant pairs, pairs tied on variable y but differing on variable x, pairs tied on variable x but differing on variable y, and pairs tied on both variables x and y.

For an r×c contingency table, concordant pairs (pairs of objects that are ranked in the same order on both variable x and variable y) are given by:

$$\displaystyle \begin{aligned} C = \sum_{i=1}^{r-1} \,\sum_{j=1}^{c-1}n_{ij} \left( \,\sum_{k=i+1}^{r} \,\sum_{l=j+1}^{c} n_{kl} \right)\;, \end{aligned}$$

discordant pairs (pairs of objects that are ranked in one order on variable x and the reverse order on variable y) are given by:

$$\displaystyle \begin{aligned} D = \sum_{i=1}^{r-1} \,\sum_{j=1}^{c-1} n_{i,c-j+1} \left( \: \sum_{k=i+1}^{r} \,\sum_{l=1}^{c-j} n_{kl} \right)\:, \end{aligned}$$

pairs of objects tied on variable x but differing on variable y are given by:

$$\displaystyle \begin{aligned} T_{x} = \sum_{i=1}^{r} \,\sum_{j=1}^{c-1} n_{ij} \left( \, \sum_{k = j+1}^{c} n_{ik} \right)\:, \end{aligned}$$

pairs of objects tied on variable y but differing on variable x are given by:

$$\displaystyle \begin{aligned} T_{y} = \sum_{j=1}^{c} \,\sum_{i=1}^{r-1} n_{ij} \left( \;\sum_{k=i+1}^{r} n_{kj} \right)\:, \end{aligned}$$

and pairs of objects tied on both variable x and variable y are given by:

$$\displaystyle \begin{aligned} T_{xy} = \frac{1}{2} \sum_{i=1}^{r} \,\sum_{j=1}^{c} n_{ij} \left( \, n_{ij}-1 \right)\:. \end{aligned} $$

Then,

$$\displaystyle \begin{aligned} C+D+T_{x}+T_{y}+T_{xy} = \frac{N(N-1)}{2}\;. \end{aligned} $$

To illustrate the calculation of Leik and Gove’s \(d_{N}^{\,c}\) measure, consider first an example 3×3 contingency table, such as given in Table 4.41, where N = 100 observations are cross-classified into variable x and variable y, each with r = c = 3 categories labeled x 1, x 2, x 3 and y 1, y 2, y 3, respectively.

Table 4.41 Example observed values in a 3×3 contingency table with N = 100 observations

4.9.1 Observed Contingency Table

For the frequency data given in Table 4.41, consider all possible pairs of observed cell frequency values that have been partitioned into concordant pairs,

all discordant pairs of observed cell frequency values,

all pairs of observed cell frequency values tied on variable x,

$$\displaystyle \begin{aligned} \begin{array}{rcl} T_{x} &\displaystyle =&\displaystyle \sum_{j=1}^{c} \,\sum_{i=1}^{r-1} n_{ij} \left( \;\sum_{k=i+1}^{r} n_{kj} \right)\\ &\displaystyle &\displaystyle \qquad \ \ = (15)(15+0)+(15)(0)+(5)(25+10)+(25)(10)\\ &\displaystyle &\displaystyle \qquad \qquad \qquad \qquad \qquad \qquad \qquad +(0)(10+20)+(10)(20) = 850\;, \vspace{-2pt}\end{array} \end{aligned} $$

all pairs of observed cell frequency values tied on variable y,

and all pairs of observed cell frequency values tied on both variables x and y,

Then,

$$\displaystyle \begin{aligned} C+D+T_{x}+T_{y}+T_{xy} = \frac{N(N-1)}{2} \end{aligned} $$

and, for the observed frequency data given in Table 4.41,

$$\displaystyle \begin{aligned} 2{,}075+175+850+1{,}050+800 = \frac{100(100-1)}{2} = 4{,}950\;. \end{aligned}$$

4.9.2 Expected Contingency Table

Now, consider Table 4.41 expressed as expected cell values, as given in Table 4.42, where an expected value is given by:

$$\displaystyle \begin{aligned} E_{ij} = \frac{n_{i.}n_{.j}}{N} \qquad \mbox{for}\ i = 1,\,\ldots,\,r\ \mbox{and}\ j = 1,\,\ldots,\,c\;. \end{aligned}$$
Table 4.42 Example expected values in a 3×3 contingency table with N = 100 observations

For example,

$$\displaystyle \begin{aligned} E_{11} = \frac{(20)(30)}{100} = 6 \quad \mbox{and} \quad E_{12} = \frac{(20)(40)}{100} = 8\;. \end{aligned}$$

Following Leik and Gove, let a prime () indicate a sum of pairs calculated on the expected cell frequency values. Then, for the expected cell frequency values given in Table 4.42, consider all possible pairs of expected values partitioned into concordant pairs,

all discordant pairs of expected cell frequency values,

all pairs of expected cell frequency values tied on variable x,

all pairs of expected cell frequency values tied on variable y,

and all pairs of expected cell frequency values tied on both variables x and y,

Then,

$$\displaystyle \begin{aligned} C^{\,\prime}+D^{\,\prime}+T_{x}^{\,\prime}+T_{y}^{\,\prime}+T_{xy}^{\,\prime} = \frac{N(N-1)}{2} \end{aligned}$$

and, for the expected frequency data given in Table 4.42,

$$\displaystyle \begin{aligned} 1{,}023+1{,}023+1{,}054+1{,}254+596 = \frac{100(100-1)}{2} = 4{,}950\;. \end{aligned}$$

Fortunately, there is a more convenient way to calculate C , D , \(T_{x}^{\,\prime }\), \(T_{y}^{\,\prime }\), and \(T_{xy}^{\,\prime }\) without first calculating the expected values. First, given the observed row and column marginal frequency distributions in Table 4.41, {20, 50, 30} and {30, 40, 30}, respectively, calculate the number of pairs of expected cell frequency values tied on both variables x and y,

$$\displaystyle \begin{aligned} \begin{array}{rcl} T_{xy}^{\,\prime} &\displaystyle =&\displaystyle \frac{1}{2N^{2}} \left( \sum_{i=1}^{r} n_{i.}^{2} \right) \left( \sum_{j=1}^{c} n_{.j}^{2} \right)-\frac{N}{2}\\ &\displaystyle &\displaystyle \qquad = \frac{1}{2(100^{2})} \left( 20^{2}+50^{2}+30^{2} \right) \left( 30^{2}+40^{2}+30^{2} \right)-\frac{100}{2} = 596\;. \end{array} \end{aligned} $$

Next, calculate the number of pairs of expected cell frequency values tied on variable y,

$$\displaystyle \begin{aligned} T_{y}^{\,\prime} = \frac{1}{2} \sum_{i=1}^{r} n_{i.}^{2}-\frac{N}{2}-T_{xy}^{\,\prime} = \frac{1}{2} \left( 20^{2}+50^{2}+30^{2} \right)-\frac{100}{2}-596 = 1{,}254\;. \end{aligned}$$

In like manner, calculate the number of pairs of expected cell frequency values tied on variable x,

$$\displaystyle \begin{aligned} T_{x}^{\,\prime} = \frac{1}{2} \sum_{j=1}^{c} n_{.j}^{2}-\frac{N}{2}-T_{xy}^{\,\prime} = \frac{1}{2} \left( 30^{2}+40^{2}+30^{2} \right)-\frac{100}{2}-596 = 1{,}054\;. \end{aligned}$$

Finally, calculate the number of concordant and discordant pairs of expected cell frequency values,

$$\displaystyle \begin{aligned} \begin{array}{rcl} C^{\,\prime} = D^{\,\prime} &\displaystyle =&\displaystyle \frac{1}{2} \left[ \frac{N(N-1)}{2}-T_{x}^{\,\prime}-T_{y}^{\,\prime}-T_{xy}^{\,\prime} \right]\\ &\displaystyle &\displaystyle \qquad \qquad = \frac{1}{2} \left[ \frac{100(100-1)}{2}-1054-1254-596 \right] = 1{,}023\;. \end{array} \end{aligned} $$

It should be noted that C , D , \(T_{x}^{\,\prime }\), \(T_{y}^{\,\prime }\), and \(T_{xy}^{\,\prime }\) are all calculated on the observed marginal frequency totals of the observed contingency table, which are invariant under permutation.

4.9.3 Maximized Contingency Table

Test statistic \(d_{N}^{\,c}\) is based on three contingency tables: the table of observed values given in Table 4.41, the table of expected values given in Table 4.42, and a table of maximum values to be described next. A contingency table of maximum values is necessary for computing \(d_{N}^{\,c}\). An algorithm for generating an arrangement of cell frequencies in an r×c contingency table that provides the maximum value of a test statistic was presented in Chap. 3, Sect. 3.2. The algorithm is reproduced here for convenience.

  1. Step1:

    List the observed marginal frequency totals of an r×c contingency table with empty cell frequencies.

  2. Step2:

    If any pair of marginal frequency totals, one from each set of marginals, are equal to each other, enter that value in the table as n ij and subtract the value from the two marginal frequency totals. For example, if the marginal frequency total for Row 2 is equal to the marginal frequency total for Column 3, enter the marginal frequency total in the table as n 23 and subtract the value of n 23 from the marginal frequency totals of Row 2 and Column 3.

    Repeat Step 2 until no two marginal frequency totals are equal. If all marginal frequency totals have been reduced to zero, go to Step 5; otherwise, go to Step 3.

  3. Step3:

    Find the largest remaining marginal frequency totals in each set and enter the smaller of the two values in n ij. Then, subtract that (smaller) value from the two marginal frequency totals. Go to Step 4.

  4. Step4:

    If all marginal frequency totals have been reduced to zero, go to Step 5; otherwise, go to Step 2.

  5. Step5:

    Set any remaining n ij values to zero, i = 1, …, r and j = 1, …, c.

To illustrate the algorithmic procedure, consider the 3×3 contingency table given in Table 4.41 on p. 192, replicated in Table 4.43 for convenience. Then, the procedure is:

Table 4.43 Example observed values in a 3×3 contingency table with N = 100 observations
  1. Step1:

    List the observed row and column marginal frequency totals, leaving the cell frequencies empty, as in Table 4.44.

    Table 4.44 Empty 3×3 contingency table with observed row marginal frequency distribution {20, 50, 30} and observed column marginal frequency distribution {30, 40, 30}
  2. Step2:

    For the two sets of marginal frequency totals given in Table 4.44, three marginal frequency totals are equal to 30, one for Row 3, one for Column 1, and one for Column 3, i.e., n 3. = n .1 = n .3 = 30. Set n 31 = 30 and subtract 30 from the two marginal frequency totals. The adjusted row and column marginal frequency totals are now {20, 50, 0} and {0, 40, 30}, respectively. No other two marginal frequency totals are identical, so go to Step 3.

  3. Step3:

    The two largest remaining marginal frequency totals are 50 in Row 2 and 50 in Column 2, i.e., n 2. = 50 and n .2 = 40. Set n 22 = 40, the smaller of the two marginal frequency totals, and subtract 40 from the two adjusted marginal frequency totals. The adjusted row and column marginal frequency totals are now {20, 10, 0} and {0, 0, 30}, respectively. Go to Step 4.

  4. Step4:

    Not all marginal frequency totals have been reduced to zero, so go to Step 2.

  5. Step2:

    No two marginal frequency totals are identical, so go to Step 3.

  6. Step3:

    The two largest marginal frequency totals are 20 in Row 1 and 30 in Column 3, i.e., n 1. = 20 and n .3 = 30. Set n 13 = 20, the smaller of the two marginal frequency totals and subtract 20 from the two adjusted marginal frequency totals. The adjusted row and column marginal frequency totals are now {0, 10, 0} and {0, 0, 10}. Go to Step 4.

  7. Step4:

    Not all marginal frequency totals have been reduced to zero, so go to Step 2.

  8. Step2:

    Two marginal frequency totals are equal to 10, one for Row 2 and one for Column 3, i.e., n 2. = n .3 = 10. Set n 23 = 10 and subtract 10 from the two adjusted marginal frequency totals. The adjusted row and column marginals are now {0, 0, 0} and {0, 0, 0}. All adjusted marginal frequency totals are now zero, so go to Step 5.

  9. Step5:

    Set any remaining n ij values to zero; in this case, n 11, n 12, n 21, n 32, and n 33 are set to zero.

The completed contingency table is given in Table 4.45. When there are tied values in a marginal distribution, e.g., n .1 = n .3 = 30, there may be alternative cell locations for the non-zero entries, meaning that more than one arrangement of cell frequencies may satisfy the conditions, but the nine cell frequency values {0, 0, 20, 0, 40, 10, 30, 0, 0} must be included in the 3×3 maximized contingency table.

Table 4.45 Completed 3×3 contingency table with row marginal frequency distribution {20, 50, 30} and column marginal frequency distribution {30, 40, 30}

Let a double prime (′′) indicate a sum of pairs calculated on the maximized cell frequency values. Then, for the maximized frequency data given in Table 4.45, the number of concordant pairs of maximized cell frequency values is

the number of discordant pairs of maximized cell frequency values is

the number of pairs of maximized cell frequency values tied on variable x is

the number of pairs of maximized cell frequency values tied on variable y is

and the number of pairs of maximized cell frequency values tied on both variables x and y is

Then,

$$\displaystyle \begin{aligned} C^{\,\prime\prime}+D^{\,\prime\prime}+T_{x}^{\,\prime\prime}+T_{y}^{\,\prime\prime}+T_{xy}^{\,\prime\prime} = \frac{N(N-1)}{2} \end{aligned}$$

and for the maximized data given in Table 4.45,

Note that the maximized contingency table given in Table 4.45 occurs only when as few cells as possible contain non-zero entries. Thus, either C ′′ or D ′′ is maximized and the other is minimized; in this case, C ′′ = 0 is the minimum value of C possible, given the observed marginal frequency distributions, and D ′′ = 2, 900 is the maximum value of D possible, given the observed marginal frequency distributions. Also, \(T_{x}^{\,\prime \prime } = 200\) and \(T_{y}^{\,\prime \prime } = 400\) are the minimum values of T x and T y possible, given the observed marginal frequency distributions. On the other hand, \(T_{xy}^{\,\prime \prime } = 1{,}450\) is the maximum value of T xy possible, given the observed marginal frequency distributions.

Table 4.46 summarizes the C, D, T x, T y, and T xy values obtained from the observed, expected, and maximized contingency tables.

Table 4.46 Values for C, D, T x, T y, and T xy obtained from the observed, expected, and maximized frequency tables

4.9.4 Calculation of Leik and Gove’s \(d_{N}^{\,c}\)

Given the observed, expected, and maximized values for C, D, T x, T y, and T xy in Table 4.46, errors of the first kind (E 1)—the variation between independence and maximum association—are given by:

$$\displaystyle \begin{aligned} E_{1} = T_{y}^{\,\prime}-T_{y}^{\,\prime\prime} = 1{,}254-400 = 854 \end{aligned}$$

and errors of the second kind (E 2)—the variation between the observed table and the table of maximum association—are given by:

$$\displaystyle \begin{aligned} E_{2} = T_{y}-T_{y}^{\,\prime\prime} = 1{,}050-400 = 650\;. \end{aligned}$$

Then, in the manner of proportional-reduction-in-error measures of association,

$$\displaystyle \begin{aligned} \begin{array}{rcl} d_{N}^{\,c} = \frac{E_{1}-E_{2}}{E_{1}} = \frac{(T_{y}^{\,\prime}-T_{y}^{\,\prime\prime})-(T_{y}-T_{y}^{\,\prime\prime})}{T_{y}^{\,\prime}-T_{y}^{\,\prime\prime}} &\displaystyle =&\displaystyle \frac{T_{y}^{\,\prime}-T_{y}}{T_{y}^{\,\prime}-T_{y}^{\,\prime\prime}}\\ &\displaystyle &\displaystyle = \frac{1{,}254-1{,}050}{1{,}254-400} = 0.2389\;. \end{array} \end{aligned} $$

Because \(d_{N}^{\,c}\) is a symmetrical measure, the number of tied values on variable x can be used in place of the number of tied values on variable y. Thus,

$$\displaystyle \begin{aligned} d_{N}^{\,c} = \frac{T_{x}^{\,\prime}-T_{x}}{T_{x}^{\,\prime}-T_{x}^{\,\prime\prime}} = \frac{1{,}054-850}{1{,}054-200} = 0.2389\;. \end{aligned}$$

Alternatively, \(d_{N}^{\,c}\) can be defined in terms of the number of values tied on both x and y. Thus,

$$\displaystyle \begin{aligned} d_{N}^{\,c} = \frac{T_{xy}^{\,\prime}-T_{xy}}{T_{xy}^{\,\prime}-T_{xy}^{\,\prime\prime}} = \frac{596-800}{596-1{,}450} = 0.2389\;. \end{aligned}$$

Because the data are categorical, C and D can be considered as grouped together. Thus,

$$\displaystyle \begin{aligned} \begin{array}{rcl} d_{N}^{\,c} = \frac{\big( C^{\,\prime}+D^{\,\prime} \big)- \big( C+D\big)}{\big( C^{\,\prime}+D^{\,\prime}\big)-\big( C^{\,\prime\prime}+D^{\,\prime\prime}\big)} &\displaystyle =&\displaystyle \frac{(1{,}023+1{,}023)-(2{,}075+175)}{(1{,}023+1{,}023)-(0+2{,}900)}\\ &\displaystyle &\displaystyle \qquad \qquad \qquad \qquad \qquad \qquad = 0.2389\;. \end{array} \end{aligned} $$

Finally,

$$\displaystyle \begin{aligned} d_{N}^{\,c} = \frac{T_{y}^{\,\prime}-T_{y}}{T_{y}^{\,\prime}-T_{y}^{\,\prime\prime}} = \frac{T_{x}^{\,\prime}-T_{x}}{T_{x}^{\,\prime}-T_{x}^{\,\prime\prime}} = \frac{T_{xy}^{\,\prime}-T_{xy}}{T_{xy}^{\,\prime}-T_{xy}^{\,\prime\prime}} = \frac{\big( C^{\,\prime}+D^{\,\prime} \big)- \big( C+D\big)}{\big( C^{\,\prime}+D^{\,\prime}\big)-\big( C^{\,\prime\prime}+D^{\,\prime\prime}\big)}\;. \end{aligned}$$

As noted by Leik and Gove, for an aid in interpreting the relationship between variables x and y, it would be preferable to explicitly determine the number of pairs lost to the marginal requirements of the contingency table. Association can then be defined within those limits, enabling the index to reach unity if cell frequencies are as close to a perfect pattern as the marginal distributions allow [53, p. 286]. Thus, for the frequency data given in Table 4.41 on p. 192, the proportion of cases being considered is

$$\displaystyle \begin{aligned} 1-\frac{2 \left( T_{x}^{\,\prime\prime}+T_{y}^{\,\prime\prime} \right)}{N(N-1)} = 1-\frac{2(200+600)}{100(100-1)} = 0.8384\;. \end{aligned}$$

4.9.5 A Permutation Test for \(d_{N}^{\,c}\)

Leik and Gove did not provide a standard error for test statistic \(d_{N}^{\,c}\) [52]. On the other hand, permutation tests neither assume nor require knowledge of standard errors. Consider the expression

$$\displaystyle \begin{aligned} d_{N}^{\,c} = \frac{T_{y}^{\,\prime}-T_{y}}{T_{y}^{\,\prime}-T_{y}^{\,\prime\prime}}\;. \end{aligned}$$

It is readily apparent that \(T_{y}^{\,\prime }\) and \(T_{y}^{\,\prime \prime }\) are invariant under permutation. Therefore, the probability of \(d_{N}^{\,c}\) under the null hypothesis can be determined by the discrete permutation distribution of T y alone, which is easily obtained from the observed contingency table. Exact permutation statistical methods are highly efficient when only the variable portion of the defined test statistic is calculated on each of the M possible arrangements of the observed data; in this case, T y.

For the frequency data given in Table 4.41 on p. 192, there are only M = 96, 151 possible, equally-likely arrangements in the reference set of all permutations of cell frequencies given the observed row and column marginal frequency distributions, {20, 50, 30} and {30, 40, 30}, respectively, making an exact permutation analysis feasible. If all M = 96, 151 arrangements occur with equal chance, the exact probability value of \(d_{N}^{\,c}\) under the null hypothesis is the sum of the hypergeometric point probability values associated with \(d_{N}^{\,c} = 0.2389\) or greater. Based on the underlying hypergeometric probability distribution, the exact upper-tail probability value is P = 0.1683×10−11.

4.10 A Matrix Occupancy Problem

In many research situations, it is necessary to examine a sequence of observations on a small group of subjects, where each observation is classified in one of two ways. Suppose, for example, a Success (1) or Failure (0) is recorded for each of N ≥ 2 subjects on each of k ≥ 2 tasks. The standard test in such cases is Cochran’s Q test, as described in Sect. 4.7.

However, when the number of subjects is small, e.g., 2 ≤ N ≤ 6, and the number of treatments is large, e.g., 20 ≤ k ≤ 400, an alternative test may be preferable to Cochran’s Q test. Such research conditions arise for a number of reasons. First, a long-term panel study is proposed, but few subjects are willing to make a research commitment due to the extended time of the research, or the treatment is either distasteful or time-intensive for the subjects. Second, a longitudinal study begins with an adequate number of subjects, but there is a high drop-out rate and survival analysis cannot be justified. Third, very few subjects satisfy the research protocol. Fourth, the cost of each observation/treatment is expensive for the researcher. Fifth, subjects are very expensive, as in primate studies. Sixth, a pilot study with a small number of subjects may be implemented to establish the validity of the research prior to applying for funding for a larger study.

Consider an N×k occupancy matrix with N subjects (rows) and k treatment conditions (columns). Let x ij denote the observation of the ith subject (i = 1, …, N) in the jth treatment condition (j = 1, …, k), where a success is coded 1 and a failure is coded 0. For any subject, a success might result from the treatment administered or it might result from some other cause or a random response, i.e., a false positive. Therefore, a successful treatment response is counted only when all N subjects score a success, i.e., a full column of 1 values. Clearly, this approach does not generalize well to a great number of subjects since it is unrealistic for a large number of subjects to respond in concert. The Q test of Cochran is preferable when N is large.

In 1965, Mielke and Siddiqui presented an exact permutation procedure for the matrix occupancy problem in Journal of the American Statistical Association that is appropriate for small samples (N) and a large number of treatments (k) [68]. Let

$$\displaystyle \begin{aligned} R_{i} = \sum_{j=1}^{k} x_{ij} \end{aligned}$$

for i = 1, …, N denote subject (row) totals, let

$$\displaystyle \begin{aligned} M = \prod_{i=1}^{N} \binom{k}{R_{i}} \end{aligned}$$

denote the number of equally-likely distinguishable N × k occupancy matrices in the reference set, under the null hypothesis, and let \(v = \min (R_{1},\,\ldots ,\,R_{N})\). The null hypothesis stipulates that each of the M distinguishable configurations of 1s and 0s within each of the N rows occurs with equal probability, given that the R 1, …, R N values are fixed. If U g is the number of distinct configurations where exactly k treatment conditions (columns) are filled with successes (1s), then

$$\displaystyle \begin{aligned} U_{v} = \binom{k}{v} \prod_{i=1}^{N} \binom{k-v}{R_{i}-v} \end{aligned}$$

is the initial value of the recursive relation

$$\displaystyle \begin{aligned} U_{g} = \binom{k}{g} \left[ \prod_{i=1}^{N} \binom{k-g}{R_{i}-g}-\sum_{j=g+1}^{v} \binom{k-g}{j-g} \frac{U_{j}}{\binom{k}{j}} \right]\;, \end{aligned}$$

where 0 ≤ g ≤ v − 1. If g = 0, then

$$\displaystyle \begin{aligned} M = \sum_{g=0}^{v} U_{g} \end{aligned}$$

and the exact probability of observing s or more treatment conditions (columns) completely filled with successes (1s) is given by:

$$\displaystyle \begin{aligned} P = \frac{1}{M} \sum_{g=s}^{v} U_{g}\;, \end{aligned}$$

where 0 ≤ s ≤ v.

In 1972, Eicker, Siddiqui, and Mielke described extensions to the matrix occupancy problem [28]. In 1974, Mantel [58] observed that the solution to the matrix occupancy problem was also the solution to the “committee problem” considered by Mantel and Pasternack in 1968 [59], Gittelsohn in 1969 [36], Sprott in 1969 [81], and White in 1971 [85]. Whereas the matrix occupancy problem considers N subjects and k treatments, scoring a success by a subject for a specific treatment as a 1 and a failure as a 0, the committee problem considers N committees and k individuals, scoring a 1 if an individual is not a member of a specified committee and 0 otherwise. The committee problem is concerned with the number of individuals belonging to no committees, which is equivalent to the concern of the matrix occupancy problem with the number of treatments associated with successes among all subjects.

4.10.1 Example Analysis

Consider an experiment with N = 6 subjects and k = 8 treatment conditions, such as given in Table 4.47. For the binary data listed in Table 4.47, the R i totals are {4, 6, 5, 7, 4, 6}, the minimum of R i, i = 1, … , N, is v = 4, the number of treatment conditions filled with 1s is s = 2 (treatments 2 and 7),

$$\displaystyle \begin{aligned} \sum_{g=s}^{v}U_{g} = \sum_{g=2}^{4}U_{g} = 149{,}341{,}920+6{,}838{,}720+40{,}320 = 156{,}220{,}960\;, \end{aligned}$$

the number of N×k occupancy matrices in the reference set of all possible occupancy matrices, under the null hypothesis, is

$$\displaystyle \begin{aligned} \begin{array}{rcl} M = \prod_{i=1}^{N} \binom{k}{R_{i}} &\displaystyle =&\displaystyle \binom{8}{4} \binom{8}{6} \binom{8}{5} \binom{8}{7} \binom{8}{4} \binom{8}{6}\\ &\displaystyle &\displaystyle \qquad \quad = 70 \times 28 \times 56 \times 8 \times 70 \times 28 = 1{,}721{,}036{,}800\;, \end{array} \end{aligned} $$

and the exact probability of observing s = 2 or more treatment conditions completely filled with 1s is

$$\displaystyle \begin{aligned} P = \frac{1}{M} \sum_{g=s}^{v} U_{g} = \frac{156{,}220{,}960}{1{,}721{,}036{,}800} = 0.0908\;. \end{aligned}$$

It is also possible to define a maximum-corrected measure of effect size as R = sk that varies between 0 when no treatments (columns) are completely filled with 1s, to a maximum of 1 when all k columns are filled with 1s; in this example,

$$\displaystyle \begin{aligned} R = \frac{s}{k} = \frac{2}{8} = 0.25. \end{aligned}$$
Table 4.47 Successes (1s) and failures (0s) of N = 6 subjects on a series of k = 8 treatments

4.11 Fisher’s Exact Probability Test

While Fisher’s exact probability (FEP) test is, strictly speaking, not a measure of association between two nominal-level variables, it has assumed such importance in the analysis of 2×2 contingency tables that excluding Fisher’s exact test from consideration would be a serious omission. That said, however, Fisher’s exact probability test provides the probability of association rather than a measure of the strength of association. The Fisher exact probability test was independently developed by R.A. Fisher, Frank Yates, and Joseph Irwin in the early 1930s [32, 47, 89]. Consequently, the test is often referred to as the Fisher–Yates or the Fisher–Irwin exact probability test.Footnote 5

Although the Fisher exact probability test was originally designed for 2×2 contingency tables and is used almost exclusively for this purpose, in this section the test is extended to apply to other contingency tables such as 2×3, 3×3, 3×4, 2×2×2, and other larger contingency tables. For ease of calculation and to avoid large factorial expressions, a recursion procedure with an arbitrary initial value provides an efficient method to obtain exact probability values; for a detailed description of recursion procedures, see Chap. 2, Sects. 2.6.1 and 2.6.2.

4.11.1 Fisher’s Exact Analysis of a 2×2 Table

Consider a 2×2 contingency table with N cases, where x o denotes the observed frequency of any cell and r and c represent the row and column marginal frequency totals, respectively, corresponding to x o. Table 4.48 illustrates the notation for a 2×2 contingency table.

Table 4.48 Example notation for a 2×2 contingency table

If H(x|r, c, N) is a recursively defined positive function in which

$$\displaystyle \begin{aligned} \begin{array}{rcl} H(x|r,c,N) &\displaystyle =&\displaystyle D \times\displaystyle\binom{r}{x}\binom{N-r}{c-x}\displaystyle\binom{N}{c}^{-1}\\ &\displaystyle &\displaystyle \qquad \qquad = D \times \frac{r!\;c!\;(N-r)!\;(N-c)!}{N!\;x!\;(r-x)!\;(c-x)!\;(N-r-c+x)!}\;, \end{array} \end{aligned} $$

where D > 0 is an unknown constant, then solving the recursive relation

$$\displaystyle \begin{aligned} H(x+1|r,c,N) = H(x|r,c,N) \times g(x) \end{aligned}$$

yields

$$\displaystyle \begin{aligned} g(x) = \frac{(r-x)(c-x)}{(x+1)(N-r-c+x+1)}\;. \end{aligned}$$

The algorithm may then be employed to enumerate all values of

$$\displaystyle \begin{aligned} H(x|r,c,N)\;, \end{aligned}$$

where a ≤ x ≤ b, \(a = \max (0,r+c-N)\), \(b = \min (r,c)\), and H(a|N, r, c) is initially set to some small positive value [14]. The total over the entire distribution may be found by:

$$\displaystyle \begin{aligned} T = \sum_{k=a}^{b} H(k|r,c,N)\;. \end{aligned}$$

To calculate the probability value of x o, given the observed marginal frequency distributions, the point probability of the observed table must be determined. This value, designated by U 2 = H(x|r, c, N), is found recursively. Next, the tail of the probability distribution associated with U 2 must be identified. Let

$$\displaystyle \begin{aligned} U_{1} = \begin{cases} \,H(x_{\text{o}}-1|r,c,N) & \text{if}\ x_{\text{o}} > a\;, \\ {} \,0 & \text{if}\ x_{\text{o}} = a\;, \end{cases} \end{aligned}$$

and

$$\displaystyle \begin{aligned} U_{3} = \begin{cases} \,H(x_{\text{o}}+1|r,c,N) & \text{if}\ x_{\text{o}} < b\;, \\ {} \,0 & \text{if}\ x_{\text{o}} = b\;. \end{cases} \end{aligned}$$

If U 1 > U 3, U 2 is located in the right tail of the distribution; otherwise, U 2 is defined to be in the left tail of the distribution, and the one-tailed (S 1) and two-tailed (S 2) subtotals may be found by:

$$\displaystyle \begin{aligned} S_{1}(x_{\text{o}}|r,c,N) = \sum_{k=a}^{b} K_{k}H(k|r,c,N) \end{aligned}$$

and

$$\displaystyle \begin{aligned} S_{2}(x_{\text{o}}|r,c,N) = \sum_{k=a}^{b} L_{k}H(k|r,c,N)\;, \end{aligned}$$

respectively, where

$$\displaystyle \begin{aligned} K_{k} = \begin{cases} \,1 & \text{if}\ U_{1} \leq U_{3}\ \text{and}\ k \leq x_{\text{o}}\ \text{or}\ \text{if}\ U_{1} > U_{2}\ \text{and}\ k \geq x_{\text{o}}\;, \\ {} \,0 & \text{otherwise ,} \end{cases} \end{aligned}$$

and

$$\displaystyle \begin{aligned} L_{k} = \begin{cases} \,1 & \text{if}\ H(k|r,c,N) \leq U_{2}\;, \\ {} \,0 & \text{otherwise ,} \end{cases} \end{aligned}$$

for k = a, …, b. The one- and two-tailed exact probability values are then given by:

$$\displaystyle \begin{aligned} P_{1} = \frac{S_{1}}{T} \quad \mbox{and} \quad P_{2} = \frac{S_{2}}{T}\;, \end{aligned}$$

respectively.

4.11.1.1 A 2×2 Contingency Table Example

To illustrate the calculation of Fisher’s exact probability test for a fourfold contingency table, consider the 2×2 contingency table given in Table 4.49 with x o = 6, r = 9, c = 8, N = 20,

$$\displaystyle \begin{aligned} a = \max(0,r+c-N) = \max(0,9+8-20) = \max(0,-3) = 0\;, \end{aligned}$$
$$\displaystyle \begin{aligned} b = \min(r,c) = \min(9,8) = 8\;, \end{aligned}$$

and b − a + 1 = 8 − 0 + 1 = 9 possible table configurations in the reference set of all permutations of cell frequencies, given the observed row and column marginal frequency distributions, {9, 11} and {8, 12}, respectively.

Table 4.49 Example 2×2 contingency table

Table 4.50 lists the nine possible values of x in the first column. The second column of Table 4.50 lists the exact point probability values for x = 0, …, 8 calculated from the conventional hypergeometric probability expression given by:

$$\displaystyle \begin{aligned} \begin{array}{rcl} p(x|r,c,N) &\displaystyle =&\displaystyle \displaystyle\binom{r}{x}\binom{N-r}{c-x}\displaystyle\binom{N}{c}^{-1}\\ &\displaystyle &\displaystyle \qquad \qquad \qquad \ = \frac{r!\;(N-r)!\;c!\;(N-c)!}{N!\;x!\;(r-x)!\;(c-x)!\;(N-r-c+x)!}\;. \end{array} \end{aligned} $$

The third column of Table 4.50 contains the recursion values where, for x = 0, the initial (starting) value is arbitrarily set to 1 for this example analysis. Then,

$$\displaystyle \begin{aligned} 1 \left[ \frac{(9)(8)}{(1)(4)} \right] &= 18 \;,\\ 18 \left[ \frac{(8)(7)}{(2)(5)} \right] &= 100.80 \;,\\ 100.80 \left[ \frac{(7)(6)}{(3)(6)} \right] &= 235.20 \;,\\ 235.20 \left[ \frac{(6)(5)}{(4)(7)} \right] &= 252 \;,\\ 252 \left[ \frac{(5)(4)}{(5)(8)} \right] &= 126 \;,\\ 126 \left[ \frac{(4)(3)}{(6)(9)} \right] &= 28 \;,\\ 28 \left[ \frac{(3)(2)}{(7)(10)} \right] &= 2.40 \;,\\ 2.40 \left[ \frac{(2)(1)}{(8)(11)} \right] &= 0.054545 \;. \end{aligned} $$

The total of H(x|r, c, N) for x = 0, …, 8 is

The fourth column of Table 4.50 corrects the entries of the third column by dividing each entry by T. For the frequency data given in Table 4.41 on p. 192,

$$\displaystyle \begin{aligned} U_{2} = H(x_{\text{o}}|r,c,N) = H(6|9,8,20) = 28\;. \end{aligned}$$

Because x o > a, i.e., 6 > 1,

$$\displaystyle \begin{aligned} U_{1} = H(x_{\text{o}}-1|r,v,N) = H(5|9,8,20) = 126 \end{aligned}$$

and because x o < b, i.e., 6 < 8,

$$\displaystyle \begin{aligned} U_{3} = H(x_{\text{o}}+1|r,c,N) = H(7|9,8,20) = 2.40\;. \end{aligned}$$

Thus, U 2 = 28 is located in the right tail of the distribution since U 1 > U 3, i.e., 126 > 2.40. Then, the one- and two-tailed subtotals are

$$\displaystyle \begin{aligned} S_{1} = 28+2.40+0.054545 = 30.454545 \end{aligned}$$

and

$$\displaystyle \begin{aligned} S_{2} = 1+18+28+2.40+0.054545 = 49.454545\;, \end{aligned}$$

respectively, and the one- and two-tailed exact probability values are

$$\displaystyle \begin{aligned} P_{1} = \frac{S_{1}}{T} = \frac{30.454545}{763.454545} = 0.039890 \end{aligned}$$

and

$$\displaystyle \begin{aligned} P_{2} = \frac{S_{2}}{T} = \frac{49.454545}{763.454545} = 0.064777\;, \end{aligned}$$

respectively.

Table 4.50 Example of statistical recursion with an arbitrary initial value

4.11.2 Larger Contingency Tables

Although Fisher’s exact probability test has largely been limited to the analysis of 2×2 contingency tables in the literature, it is not difficult to extend Fisher’s exact test to larger contingency tables, although such extensions may be computationally intensive [71, pp. 127–130, 296–298 ]. Consider an example 2×3 contingency table with N cases, where x o denotes the observed frequency of the cell in the first row and first column, y o denotes the observed frequency of the cell in the second row and first column, and r 1, r 2, and c 1 are the observed marginal frequency totals in the first row, second row, and first column, respectively. If H(x, y), given N, r 1, r 2, and c 1, is a recursively defined positive function, then solving the recursive relation

$$\displaystyle \begin{aligned} H(x, y+1) = H(x,y) \times g_{1}(x, y) \end{aligned}$$

yields

$$\displaystyle \begin{aligned} g_{1}(x, y) = \frac{(c_{1}-x-y)(r_{2}-y)}{(1+y)(N-r_{1}-r_{2}-c_{1}+1+x+y)}\;. \end{aligned} $$
(4.14)

If \(y = \min (r_{2},c_{1}-x)\), then H(x + 1, y) = H(x, y) × g 2(x, y), where

$$\displaystyle \begin{aligned} g_{2}(x, y) = \frac{(c_{1}-x-y)(r_{1}-x)}{(1+x)(N-r_{1}-r_{2}-c_{1}+1+x+y)}\;, \end{aligned} $$
(4.15)

given that \(\max (0,r_{1}+r_{2}+c_{1}-N-x) = 0\). However, if \(y = \min (r_{2},c_{1}-x)\) and \(\max (0,r_{1}+r_{2}+c_{1}-N-x) > 0\), then H(x + 1, y − 1) = H(x, y) × g 3(x, y), where

$$\displaystyle \begin{aligned} g_{3}(x,y) = \frac{y(r_{1}-x)}{(1+x)(r_{2}+1-y)}\;. \end{aligned} $$
(4.16)

The three recursive expressions given in Eqs. (4.14), (4.15), and (4.16) may be employed to completely enumerate the distribution of H(x, y), where a ≤ x ≤ b, \(a = \max (0,r_{1}+c_{1}-N)\), \(b = \min (r_{1}, c_{1})\), c(x) ≤ y ≤ d(x), \(c(x) = \max (0,r_{1}+r_{2}+c_{1}-N+x)\), \(d(x) = \min (r_{2},c_{1}-x)\), and H[a, c(x)] is initially set to some small positive value [15]. The total over the completely enumerated distribution may be found by:

$$\displaystyle \begin{aligned} T = \sum_{x=a}^{b}\,\sum_{y=c(x)}^{d(x)} H(x,y)\;. \end{aligned}$$

To calculate the probability value of (x o, y o), given the observed marginal frequency distributions, the hypergeometric point probability value of the observed 2×3 contingency table must be obtained; this value may also be found recursively. Next, the probability of a result this extreme or more extreme must be found. The subtotal is given by:

$$\displaystyle \begin{aligned} S = \sum_{x=a}^{b}\,\sum_{y=c(x)}^{d(x)} J_{x,y}H_{x,y}\;, \end{aligned}$$

where

$$\displaystyle \begin{aligned} J_{x,y} = \begin{cases} \,1 & \text{if}\ H(x,y) \leq H(x_{\text{o}},y_{\text{o}})\;, \\ {} \,0 & \text{otherwise ,} \end{cases} \end{aligned}$$

for x = a, …, b and y = c(x), …, d(x). The exact probability value for independence associated with the observed cell frequencies, x o and y o is given by P = ST.

4.11.2.1 A 2×3 Contingency Table Example

To illustrate the calculation of Fisher’s exact probability test for a 2×3 contingency table, consider the frequency data given in Table 4.51 where x o = 5, y o = 3, r 1 = 10, c 1 = 13, c 2 = 7, and N = 29. For the frequency data given in Table 4.51, there are only M = 59 arrangementsFootnote 6 of cell frequencies that are consistent with the observed row and column marginal frequency distributions, {10, 19} and {13, 7, 9}, respectively, and exactly 56 of the arrangements M = 59 have hypergeometric point probability values equal to or less than the point probability value of the observed table (p = 0.8096×10−1), yielding an exact probability value of P = 0.6873. Since the 2×3 table in Table 4.51 has only two degrees of freedom, Table 4.52 lists the M = 59 values for n 11 and n 12 for each possible arrangement of cell frequencies, given the observed marginal frequency totals, and the associated hypergeometric point probability values. Row 56 contains the observed values of n 11 = 5 and n 12 = 3 indicated by an asterisk.

Table 4.51 Example 2×3 contingency table
Table 4.52 Listing of the M = 59 possible cell arrangements for the data given in Table 4.51 with cell frequencies n 11, n 12, and associated exact hypergeometric point probability values

4.11.2.2 A 2×6 Contingency Table Example

Fisher’s exact probability test is easily extended to any 2×c contingency table. For example, consider the 2×6 contingency table given in Table 4.53 where v o = 1, w o = 4, x o = 3, y o = 4, z o = 8, r 1 = 6, r 2 = 5, r 3 = 10, r 4 = 9, r 5 = 10, c 1 = 29, and N = 52. For the frequency data given in Table 4.53, M = 33, 565 arrangements of cell frequencies are consistent with the observed row and column marginal frequency distributions, {29, 23} and {6, 5, 10, 9, 10, 12}, respectively, and exactly 27,735 of the M = 33, 565 arrangements have hypergeometric point probability values equal to or less than the point probability value of the observed table (p = 0.1159×10−3), yielding an exact probability value of P = 0.0338.

Table 4.53 Example 2×6 contingency table

4.11.2.3 A 3×3 Contingency Table Example

Fisher’s exact probability test can also be applied to larger contingency tables, although calculation time increases substantially as the number of rows and columns increase. In this section, Fisher’s exact probability test is applied to a 3×3 contingency table. Consider the 3×3 contingency table given in Table 4.54 where w o = 3, x o = 5, y o = 2, z o = 9, r 1 = 10, r 2 = 14, c 1 = 13, c 2 = 16, and N = 40. For the frequency data given in Table 4.54, M = 4, 818 arrangements of cell frequencies are consistent with the observed row and column marginal frequency distributions, {10, 14, 16} and {13, 16, 11}, respectively, and exactly 3,935 of the M = 4, 818 arrangements have hypergeometric point probability values equal to or less than the point probability value of the observed table (p = 0.1273×10−4), yielding an exact probability value of P = 0.0475.

Table 4.54 Example 3×3 contingency table

4.11.2.4 A 3×4 Contingency Table Example

Finally, consider the sparse 3×4 contingency table given in Table 4.55. For the frequency data given in Table 4.55, only M = 706 arrangements of cell frequencies are consistent with the observed row and column marginal frequency distributions, {5, 5, 4} and {4, 3, 4, 3}, respectively, and 168 of the M = 706 arrangements have hypergeometric point probability values equal to or less than the point probability value of the observed table (p = 0.1903×10−3), yielding an exact probability value of P = 0.0187.

Table 4.55 Example 3×4 contingency table

4.12 Analyses of 2×2×2 Tables

Fisher’s exact probability test is not limited to two-way contingency tables. Consider a 2×2×2 contingency table, such as depicted in Fig. 4.1, where n ijk denotes the cell frequency of the ith row, jth column, and kth slice for i, j, k = 1, 2. Denote by a dot (⋅) the partial sum of all rows, all columns, or all slices, depending on the position of the (⋅) in the subscript list. If the (⋅) is in the first subscript position, the sum is over all rows, if the (⋅) is in the second subscript position, the sum is over all columns, and if the (⋅) is in the third subscript position, the sum is over all slices. Thus, n i.. denotes the marginal frequency total of the ith row, i = 1, …, r, summed over all columns and slices; n .j. denotes the marginal frequency total of the jth column, j = 1, …, c, summed over all rows and slices; and n ..k denotes the marginal frequency total of the kth slice, k = 1, …, s, summed over all rows and columns. Therefore, A = n 1.., B = n .1., C = n ..1, and N = n denote the observed marginal frequency totals of the first row, first column, first slice, and entire table, respectively, such that 1 ≤ A ≤ B ≤ C ≤ N∕2. Also, let w = n 111, x = n 112, y = n 121, and z = n 211 denote cell frequencies of the 2×2×2 contingency table. Then, the probability for any w, x, y, and z is given by:

[67]. An algorithm to compute Fisher’s exact probability test involves a nested looping structure and requires two distinct passes. The first pass yields the exact probability, U, of the observed 2×2×2 contingency table and is terminated when U is obtained. The second pass yields the exact probability value of all tables with hypergeometric point probability values equal to or less than the point probability of the observed contingency table. The four nested loops within each pass are over the cell frequency indices w, x, y, and z, respectively. The bounds for w, x, y, and z are

$$\displaystyle \begin{aligned} 0 \leq &w \leq M_{w}\;,\\ 0 \leq &x \leq M_{x}\;,\\ 0 \leq &y \leq M_{y}\;,\\ \end{aligned} $$

and

$$\displaystyle \begin{aligned} L_{x} \leq &z \leq M_{z}\;, \end{aligned} $$

respectively, where M w = A, M x = A − w, M y = A − w − x, \(M_{z} = \min (B-w-x,C-w-y)\), and \(L_{z} = \max (0,A+B+C-N-2w-x-y)\).

Fig. 4.1
figure 1

Graphic depiction of a 2×2×2 contingency table

The recursion method can be illustrated with the fourth (inner) loop over z, given w, x, y, A, B, C, and N because the inner loop yields both U on the first pass and the exact probability value on the second pass. Let H(w, x, y, z) be a recursively defined positive function given A, B, C, and N, satisfying

$$\displaystyle \begin{aligned} H(w,x,y,z+1) = H(w,x,y,z) \times g(w,x,y,z)\;, \end{aligned}$$

where

$$\displaystyle \begin{aligned} g(w,x,y,z) = \frac{(B-w-x-z)(C-w-z)}{(z+1)(N-A-B-C+2w+x+y+z+1)}\;. \end{aligned}$$

The remaining three loops of each pass initialize H(w, x, y, z) for continued enumerations. Let \(I_{x} = \max (0,A+B+C-N)\) and set the initial value of H(0, 0, 0, I z) to an arbitrary small positive constant. Then, the total over the completely enumerated distribution is found by:

$$\displaystyle \begin{aligned} T = \sum_{w=0}^{M_{w}}\,\sum_{x=0}^{M_{x}}\,\sum_{y=0}^{M_{y}}\,\sum_{z=L_{x}}^{M_{x}} H(w,x,y,z)\;. \end{aligned}$$

If w o, x o, y o, and z o are the values of w, x, y, and z in the observed 2×2×2 contingency table, then U and the exact probability value (P) are given by:

$$\displaystyle \begin{aligned} U = H(w_{\text{o}},x_{\text{o}},y_{\text{o}},z_{\text{o}})/T \end{aligned}$$

and

$$\displaystyle \begin{aligned} P = \sum_{w=0}^{M_{w}}\,\sum_{x=0}^{M_{x}}\,\sum_{y=0}^{M_{y}}\,\sum_{z=L_{x}}^{M_{x}} H(w,x,y,z)\,\psi(w,x,y,z,)/T\;. \end{aligned}$$

respectively, where

$$\displaystyle \begin{aligned} \psi(w,x,y,z) = \begin{cases} \,1 & \text{if}\ H(w,x,y,z) \leq H(w_{\text{o}},x_{\text{o}},y_{\text{o}},z_{\text{o}})\;, \\ {} \,0 & \text{otherwise .} \end{cases} \end{aligned}$$

4.12.1 A 2×2×2 Contingency Table Example

Consider a scenario in which N = 1, 663 respondents were asked if they agreed with the statement that women should have equal pay for the same job as men (No, Yes). The respondents were then classified by region of the country (North, South) and by year of the survey (2000, 2010). For the frequency data given in Table 4.56, M = 3, 683, 159, 504 arrangements of cell frequencies are consistent with the observed row, column, and slice marginal frequency distributions, {623, 1040}, {1, 279, 384}, and {1, 039, 624}, respectively. Exactly 2,761,590,498 of the arrangements have hypergeometric point probability values equal to or less than the point probability value of the observed table (p = 0.1684×10−72), yielding an exact probability value of P = 0.1684×10−65.

Table 4.56 Cross-classification of responses (No, Yes), categorized by year and region

4.12.2 A 3×4×2 Contingency Table Example

Fisher’s exact probability test is not limited to multi-way contingency tables with only two categories in each dimension. Consider the r×c×s contingency table given in Table 4.57 with r = 3 rows, c = 4 columns, and s = 2 slices. In general, it is not efficient to analyze complex multi-way tables with exact permutation procedures, as there are usually too many arrangements of cell frequencies in the reference set of all possible arrangements of cell frequencies. For the frequency data given in Table 4.57 with row, column, and slice marginal frequency distributions, {71, 31}, {21, 32, 25, 24}, and {29, 37, 36}, respectively, the approximate resampling probability value based on L = 1, 000, 000 random arrangements of cell frequencies is

$$\displaystyle \begin{aligned} P = \frac{29{,}600}{1{,}000{,}000} = 0.0296\;. \end{aligned}$$
Table 4.57 Three-way contingency table with r = 3 rows, c = 4 columns, and s = 2 slices

4.13 Coda

Chapter 3 applied permutation statistical methods to measures of association for two nominal-level variables that are based on Pearson’s chi-squared test statistic. Chapter 4 applied exact and resampling permutation statistical methods to measures of association for two nominal-level variables that are not based on Pearson’s chi-squared test statistic. Included in Chap. 4 were Goodman and Kruskal’s asymmetric λ a, λ b, t a, and t b measures, Cohen’s unweighted chance-corrected κ coefficient, McNemar’s and Cochran’s Q measures of change, Leik and Gove’s \(d_{N}^{\,c}\) measure, Mielke and Siddiqui’s exact probability for the matrix occupancy problem, and Fisher’s exact probability test, extended to cover a variety of contingency tables. For each test, examples illustrated the measures and either exact or resampling probability values based on the appropriate permutation analysis were provided.

Chapter 5 applies permutation statistical methods to a variety of measures of association designed for ordinal-level variables that are based on all possible paired comparisons. Included in Chap. 5 are Kendall’s τ a and τ b and Stuart’s τ c measures of ordinal association, Somers’ asymmetric d yx and d xy measures, Kim’s d y.x and d x.y measures, Wilson’s e measure, and Cureton’s rank-biserial correlation coefficient.