Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

3.1 Notion of Distance Between Two Rankings

A ranking represents the order of preference one has with respect to a set of t objects. If we label the objects by the integers 1 to t, a ranking can then be thought of as a permutation of the integers \((1,2,\ldots,t)\). We may denote such a permutation by μ = (μ(1), μ(2), , μ(t))′ which may also be conceptualized as a point in t-dimensional space. It is natural to measure the spread between two individual permutations μ, ν by means of a distance function. There are several examples of distance functions that have been proposed in the literature. Here are a few:

Spearman

$$\displaystyle{ d_{S}(\mu,\nu ) = \frac{1} {2}\sum _{i=1}^{t}\left (\mu (i) -\nu (i)\right )^{2}. }$$
(3.1)

Kendall

$$\displaystyle{ d_{K}(\mu,\nu ) =\sum _{i<j}\left \{1 - sgn\left (\mu (j) -\mu (i)\right )sgn\left (\nu (j) -\nu (i)\right )\right \}, }$$
(3.2)

where sgn(x) is either 1 or − 1 depending on whether x > 0 or x < 0. 

Hamming

$$\displaystyle{ d_{H}(\mu,\nu ) = t -\sum _{i=1}^{t}\sum _{ j=1}^{t}I\left (\mu (i) = j\right )I\left (\nu (i) = j\right ) }$$
(3.3)

where I(. ) is the indicator function taking values 1 or 0 depending on whether the statement in brackets holds or not.

Spearman Footrule

$$\displaystyle{ d_{F}(\mu,\nu ) =\sum _{ i=1}^{t}\vert \mu (i) -\nu (i)\vert. }$$
(3.4)

The Spearman measure is not a proper “distance” in that it does not obey the triangular inequality property. We shall nonetheless refer to it as a distance function in this book. It is based upon squared Euclidean distance whereas the Footrule is based on the absolute deviations. The Kendall distance counts the number of “discordant” pairs whereas the Hamming distance counts the number of “mismatches.” The Hamming distance has found uses in coding theory. These distances have the property of being invariant under any permutation relabeling of the objects. That is, for any permutations σ, μ, ν, 

$$\displaystyle{d\left (\mu,\nu \right ) = d\left (\mu \circ \sigma,\nu \circ \sigma \right )}$$

where \(\mu \circ \sigma \left (i\right ) =\mu \left (\sigma \left (i\right )\right ).\) This property is known as right invariance. Let \(\Delta = \left (d\left (\mu _{i},\mu _{j}\right )\right )\) denote the matrix of all pairwise distances. If d is right invariant, then it follows that there exists a constant c > 0 for which

$$\displaystyle{\Delta \mathbf{1} = (ct!)\mathbf{1}}$$

where \(\boldsymbol{1} = \left (1,1,\ldots,1\right )'\) is of dimension t! . Hence, c is equal to the average distance. It is straightforward to show that for the Spearman and Kendall distances

$$\displaystyle{c_{S} = \frac{t(t^{2} - 1)} {12},c_{K} = \frac{t(t - 1)} {2}.}$$

Turning attention to the Hamming distance, we note that if \(e = \left (1,2,\ldots,t\right )'\), then

$$\displaystyle\begin{array}{rcl} \Sigma _{\mu }d_{H}\left (\mu,e\right )& =& \Sigma _{\mu }t - \Sigma _{\mu }\Sigma _{i}\Sigma _{j}I\left (\mu \left (i\right ) = j\right )I\left (e\left (i\right ) = j.\right ) {}\\ & =& t\left (t!\right ) - t! {}\\ \end{array}$$

and hence \(c_{H} = \left (t - 1\right )\).

Example 3.1.

Suppose that t = 3 and that the complete rankings are denoted by

$$\displaystyle\begin{array}{rcl} & & \mu _{1} = \left (1,2,3\right )',\mu _{2} = \left (1,3,2\right )',\mu _{3} = \left (2,1,3\right )',\mu _{4} = \left (2,3,1\right )',\mu _{5} = \left (3,1,2\right )', {}\\ & & \mu _{6} = \left (3,2,1\right )'. {}\\ \end{array}$$

Using the above order of the permutations, we may write the matrix \(\Delta\) of pairwise Spearman, Kendall, Hamming, and Footrule distances respectively as

$$\displaystyle{\Delta _{S} = \left (\begin{array}{cccccc} 0&1&1&3&3&4\\ 1 &0 &3 &1 &4 &3 \\ 1&3&0&4&1&3\\ 3 &1 &4 &0 &3 &1 \\ 3&4&1&3&0&1\\ 4 &3 &3 &1 &1 &0 \end{array} \right )}$$
$$\displaystyle{\Delta _{K} = \left (\begin{array}{cccccc} 0&2&2&4&4&6\\ 2 &0 &4 &2 &6 &4 \\ 2&4&0&6&2&4\\ 4 &2 &6 &0 &4 &2 \\ 4&6&2&4&0&2\\ 6 &4 &4 &2 &2 &0 \end{array} \right )}$$
$$\displaystyle{\Delta _{H} = \left (\begin{array}{cccccc} 0&2&2&3&3&2\\ 2 &0 &3 &2 &2 &3 \\ 2&3&0&2&2&3\\ 3 &2 &2 &0 &3 &2 \\ 3&2&2&3&0&2\\ 2 &3 &3 &2 &2 &0 \end{array} \right )}$$
$$\displaystyle{\Delta _{F} = \left (\begin{array}{cccccc} 0&2&2&4&4&4\\ 2 &0 &4 &2 &4 &4 \\ 2&4&0&4&2&4\\ 4 &2 &4 &0 &4 &2 \\ 4&4&2&4&0&2\\ 4 &4 &4 &2 &2 &0 \end{array} \right )}$$

These distances may alternatively be written in terms of a similarity function in the form

$$\displaystyle{ d(\mu,\nu ) = c -\mathcal{A}(\mu,\nu ), }$$
(3.5)

Spearman:

$$\displaystyle{ \mathcal{A}_{S} = \mathcal{A}_{S}(\mu,\nu ) =\sum _{i=1}^{t}\left (\mu (i) -\frac{t + 1} {2} \right )\left (\nu (i) -\frac{t + 1} {2} \right ). }$$
(3.6)

Kendall:

$$\displaystyle{ \mathcal{A}_{K} = \mathcal{A}_{K}(\mu,\nu ) =\sum _{i<j}sgn\left (\mu (j) -\mu (i)\right )sgn\left (\nu (j) -\nu (i)\right ). }$$
(3.7)

Hamming:

$$\displaystyle{ \mathcal{A}_{H}\left (\mu,\nu \right ) =\sum _{ i=1}^{t}\sum _{ j=1}^{t}I\left (\left [\mu (i) = j\right ] -\frac{1} {t}\right )I\left (\left [\nu (i) = j\right ] -\frac{1} {t}\right ). }$$
(3.8)

Footrule:

$$\displaystyle{\mathcal{A}_{F}\left (\mu,\nu \right ) = \sum _{i=1}^{t}\sum _{ j=1}^{t}I\left (\left [\mu (i) \leq j\right ] -\frac{j} {t}\right )I\left (\left [\nu (i) \leq j\right ] -\frac{j} {t}\right ).}$$

The similarity measures may be also interpreted geometrically as inner products which sets the groundwork for defining correlation in the next section.

3.2 Correlation Between Two Rankings

The notion of correlation occurs frequently in statistics. For example, in regression analysis, one is interested in the correlation between two variables such as height and weight. Similarly, in nonparametric statistics, we shall be interested in the correlation between two rankings. Let \(\mathcal{P}\) be the space of all possible permutations of the integers \(1,2,\ldots,t\). We may define the correlation between two rankings μ, ν as

$$\displaystyle{ \alpha \left (\mu,\nu \right ) = 1 -\frac{2d\left (\mu,\nu \right )} {M} }$$
(3.9)

where M is the maximum value of the distance \(d\left (\mu,\nu \right )\) taken over all possible pairs μ, ν in \(\mathcal{P}\) (Diaconis and Graham 1977). In the case of the Spearman and Kendall distance, the maximum values occur when

$$\displaystyle{\left (\mu (i) -\frac{t + 1} {2} \right ) = -\left (\nu (i) -\frac{t + 1} {2} \right )\:\mathit{for}\:\mathit{all}\:i,}$$

whereas the minimum occurs when

$$\displaystyle{\left (\mu (i) -\frac{t + 1} {2} \right ) = \left (\nu (i) -\frac{t + 1} {2} \right )}$$

This is a consequence of the rearrangement inequality given as a lemma below.

Lemma 3.1.

Let \(a_{1},\ldots,a_{t}\) and \(b_{1},\ldots,b_{t}\) be real numbers, not necessarily positive with

$$\displaystyle{a_{1} \leq \ldots \leq a_{t},b_{1} \leq \ldots \leq b_{t}}$$

and let σ be a permutation of the integers 1,…,t. Then

$$\displaystyle{a_{1}b_{t} +\ldots +a_{t}b_{1} \leq a_{1}b_{\sigma \left (1\right )} +\ldots +a_{t}b_{\sigma \left (t\right )} \leq a_{1}b_{1} +\ldots +a_{t}b_{t}.}$$

Proof.

The proof follows by induction on t. □ 

It can be shown that for the Spearman and Kendall distances, the maximum is equal to twice the mean,

$$\displaystyle{ M_{S} = 2c_{S},M_{K} = 2c_{K}. }$$
(3.10)

In view of (3.10) we have

$$\displaystyle\begin{array}{rcl} \alpha _{S}\left (\mu,\nu \right )& =& \frac{\mathcal{A}_{S}} {c_{S}},\alpha _{K}\left (\mu,\nu \right ) = \frac{\mathcal{A}_{K}} {c_{K}}.{}\end{array}$$
(3.11)

Example 3.2 (Lehmann 1975, p. 298).

Consider the test scores in Language and Arithmetic for a group of 9 students as shown in Table 3.1. The right-invariance property shared by the Spearman and Kendall distances enables us to rewrite the table in a more convenient fashion with one of the rankings in natural order as in Table 3.2. The Spearman and Kendall correlations are respectively 0. 683 and 0. 500. Here \(c_{S} = 60,c_{K} = 36\).

Table 3.1 Language and Arithmetic scores
Table 3.2 Language and Arithmetic scores rearranged

The correlation coefficients based on these distances are of the multiplicative type (Kendall and Gibbons 1990); that is, there exists a function g such that

$$\displaystyle{ \alpha \left (\mu,\nu \right ) = k_{\mu }k_{\nu }\sum _{i}\sum _{j}g\left (\mu \left (i\right ),\mu \left (j\right )\right )g\left (\nu \left (i\right ),\nu \left (j\right )\right ) }$$
(3.12)

where \(k_{\mu },k_{\nu }\) are normalizing constants. The constants may be different depending on whether the coefficient is of type a or b. A type a correlation is used above. For Spearman and Kendall, the functions are, respectively,

$$\displaystyle{g_{S}\left (\mu \left (i\right ),\mu \left (j\right )\right ) = \left (\mu \left (i\right ) -\mu \left (j\right )\right )}$$
$$\displaystyle{g_{K}\left (\mu \left (i\right ),\mu \left (j\right )\right ) = sgn\left [\mu \left (i\right ) -\mu \left (j\right )\right ].}$$

For a type b correlation, the constants are given by

$$\displaystyle{k_{\mu } = \sqrt{\Sigma _{i } \Sigma _{j } \left [g\left (\mu \left (i\right ),\mu \left (j\right ) \right ) \right ] ^{2}}.}$$

We shall make use of a type b correlation when defining angular correlations in Sect. 3.6.

For a multiplicative index, it can be shown that the correlation matrix is necessarily positive semidefinite (Quade 1972). Setting

$$\displaystyle{ Q = \left (J - \frac{2} {M}\Delta \right ) }$$
(3.13)

where \(J =\boldsymbol{ 11}'\) and \(\frac{M} {2} = c,\) this implies that there exists a matrix T for which

$$\displaystyle\begin{array}{rcl} Q& =& \frac{1} {c}\left (\mathbf{T}'\mathbf{T}\right ).{}\end{array}$$
(3.14)

It follows that the distance matrix for both Spearman and Kendall can be expressed as

$$\displaystyle{ \Delta = cJ -\mathbf{T}'\mathbf{T}. }$$
(3.15)

From the form of the Spearman and Kendall similarity measures (3.12), it can be seen that the matrices T are respectively

$$\displaystyle{ \mathbf{T}_{S} = \left (t_{S}\left (\mu _{1}\right ),\ldots,t_{S}\left (\mu _{t!}\right )\right )' }$$
(3.16)

where

$$\displaystyle{t_{S}\left (\mu \right ) = \left (\mu \left (1\right ) -\frac{t + 1} {2},\ldots,\mu \left (t\right ) -\frac{t + 1} {2} \right )'}$$

is the centered rank vector and

$$\displaystyle{ \mathbf{T}_{K} = \left (t_{K}\left (\mu _{1}\right ),\ldots,t_{K}\left (\mu _{t!}\right )\right )' }$$
(3.17)

is of dimension \(\left ({}_{2}^{t}\right ) \times t!\) where the qth element for \(q = \left (i - 1\right )\left (t - \frac{i} {2}\right ) + \left (j - i\right ),1 \leq i <j \leq t\),

$$\displaystyle{\left (t_{K}\left (\mu \right )\right )_{q} = sgn\left [\mu \left (j\right ) -\mu \left (i\right )\right ].}$$

For Hamming, we may write the t 2-dimensional vector where the \(\left (i,j\right )\) th element is

$$\displaystyle{\left (t_{H}\left (\mu \right )\right )_{ij} = \left (I\left [\mu \left (i\right ) = j\right ] -\frac{1} {t}\right )}$$

for 1 ≤ i, j ≤ t. 

For the Footrule we have the t 2-dimensional vector where the qth element for \(q = \left (i - 1\right )t + j,1 \leq i <j \leq t\)

$$\displaystyle{\left (t_{F}\left (\mu \right )\right )_{q} = \left (I\left [\mu \left (i\right ) \leq j\right ] -\frac{j} {t}\right ).}$$

Example 3.3.

Suppose that t = 3. Then, placing the rankings in the natural order of Example 3.1, we have that

$$\displaystyle{\mathbf{T}_{S} = \left (\begin{array}{cccccc} - 1& - 1& 0 & 0 & 1 & 1\\ 0 & 1 & - 1 & 1 & - 1 & 0 \\ 1 & 0 & 1 & - 1& 0 & - 1 \end{array} \right )}$$

and

$$\displaystyle{\mathbf{T}_{K} = \left (\begin{array}{cccccc} 1& 1 & - 1& 1 & - 1& - 1\\ 1 & 1 & 1 & - 1 & - 1 & - 1 \\ 1& - 1& 1 & - 1& 1 & - 1 \end{array} \right ).}$$

The notion of correlation is particularly useful in problems wherein one wishes to test for the independence of two variables as in Example 3.2 or for the existence of long-term monotone trend in the pH of a river. We will postpone a discussion of these important topics later in this chapter where it will be addressed in the general context of incomplete rankings.

3.3 Incomplete Rankings and the Notion of Compatibility

A judge may rank a complete set of candidates in accordance with some criterion. On occasion, however, data may be missing either at random or by design. For example, one or more candidates may not be ranked. In another example, the pH data on a lake may not be available for certain months in a year, thereby making it impossible to test for a long-term trend using traditional nonparametric rank-based statistics. The option to ignore the missing data is unsatisfactory because it distorts the time scale. As we shall see later on, this option is always suboptimal when testing for trend. We address the topic in this section by first introducing the notion of compatibility.

Notation.:

Incomplete ranks will be denoted by “−” and corresponding incomplete rankings will be written with an upper script “*”.

For example, the ranking \(\mu ^{{\ast}} = (2,-,3,4,1)'\) indicates that object 2 is unranked among the five objects presented.

Definition 3.1.

The complete ranking μ of t objects is said to be compatible with an incomplete ranking μ  of a subset of k of these objects, 2 ≤ k ≤ t,  if the relative ranking of every pair of objects ranked in μ coincides with their relative ranking in μ.

An incomplete ranking gives rise to a class of order preserving complete rankings. Denoting by \(C\left (\mu ^{{\ast}}\right )\) the set of complete permutations compatible with \(\mu ^{{\ast}} = (2,-,3,4,1)'\), we have that t

$$\displaystyle{C\left (\mu ^{{\ast}}\right ) = \left \{\left (2,5,3,4,1\right )',\left (2,4,3,5,1\right )',\left (2,3,4,5,1\right )',\left (3,2,4,5,1\right )',\left (3,1,4,5,2\right )'\right \}.}$$

The total number of complete rankings of t objects compatible with an incomplete ranking of a subset of k objects is given by t! ∕ ​k! . This follows from the fact that there are \(\left ({}_{k}^{t}\right )\) ways of choosing k integers for the ranked objects, one way in placing them to preserve the order and then \(\left (t - k\right )!\) ways of rearranging the remaining integers. The product is thus

$$\displaystyle{ a = \left ({}_{k}^{t}\right )\left (t - k\right )! = t!/\ \!k! }$$
(3.18)

The notion of compatibility establishes a connection between an incomplete ranking and the class of complete rankings from which the incomplete ranking could have arisen. It seems natural as a result to extend the notion of distance to incomplete rankings by referring to the corresponding compatibility classes.

Definition 3.2.

The distance \(d^{{\ast}}\left (\mu ^{{\ast}},\nu ^{{\ast}}\right )\) between two incomplete rankings μ and ν is defined to be the average of all values of the distances \(d(\mu _{i}\mathbf{,\nu }_{j}\mathbf{)}\) taken over all pairs of complete rankings \(\mu _{i}\mathbf{,\nu }_{j}\) compatible with μ and ν , respectively.

Example 3.4.

Suppose that \(t = 3,k = 2.\) In that case, the possible incomplete rankings are denoted by

$$\displaystyle\begin{array}{rcl} & & \nu _{11}^{{\ast}} = \left (1,2,-\right )',\nu _{ 12}^{{\ast}} = \left (2,1,-\right )',\nu _{ 21}^{{\ast}} = \left (1,-,2\right )',\nu _{ 22}^{{\ast}} = \left (2,-,1\right )', {}\\ & & \nu _{31}^{{\ast}} = \left (-,1,2\right )',\nu _{ 32}^{{\ast}} = \left (-,2,1\right )' {}\\ \end{array}$$

We may associate with every incomplete ranking a (t! x 1) compatibility vector, also denoted by \(C\!\left (\nu ^{{\ast}}\right )\), whose ith component is 1 or 0 according to whether μ i is compatible with ν . A summary can be provided by a compatibility matrix as follows.

$$\displaystyle{C = \begin{array}{c} \\ \mu _{1} \\ \mu _{2} \\ \mu _{ 3} \\ \mu _{4} \\ \mu _{5} \\ \mu _{ 6} \end{array} \begin{array}{cccccc} \nu _{11}^{{\ast}}&\nu _{12}^{{\ast}}&\nu _{21}^{{\ast}}&\nu _{22}^{{\ast}}&\nu _{31}^{{\ast}}&\nu _{32}^{{\ast}} \\ 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 & 1 \\ 0 & 1 & 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 & 0 & 1 \end{array} }$$

Consequently, the matrix of average pairwise Spearman distances for the incomplete rankings is given by the product \(C_{S}'\Delta C_{S}/a^{2}\) where \(a = t!/\ k! = 3\) and

$$\displaystyle{C_{S}'\Delta C_{S} = \begin{array}{ccccccc} &\nu _{11}^{{\ast}}&\nu _{12}^{{\ast}}&\nu _{21}^{{\ast}}&\nu _{22}^{{\ast}}&\nu _{31}^{{\ast}}&\nu _{32}^{{\ast}} \\ \nu _{11}^{{\ast}}&10 & 26 & 14 & 22 & 22 & 14 \\ \nu _{12}^{{\ast}}&26 & 10 & 22 & 14 & 14 & 22 \\ \nu _{21}^{{\ast}}&14 & 22 & 10 & 26 & 14 & 22 \\ \nu _{22}^{{\ast}}&22 & 14 & 26 & 10 & 22 & 14 \\ \nu _{31}^{{\ast}}&22 & 14 & 14 & 22 & 10 & 26 \\ \nu _{32}^{{\ast}}&14 & 22 & 22 & 14 & 26 & 10 \end{array} }$$

We note from this example that the distance of an incomplete ranking to itself is 10 and not 0. In extending the notion of correlation to incomplete rankings, it will be necessary to take this into account.

For the Spearman and Kendall distances, we may re-express the distance \(d^{{\ast}}\left(\mu ^{{\ast}},\nu ^{{\ast}}\right )\) as

$$\displaystyle\begin{array}{rcl} d^{{\ast}}\left(\mu ^{{\ast}},\nu ^{{\ast}}\right) = \frac{1} {a^{2}}\left[C\left (\mu ^{{\ast}}\right)\right]'\Delta \left[C\left (\nu ^{{\ast}}\right)\right]& &{}\end{array}$$
(3.19)
$$\displaystyle\begin{array}{rcl} & =& \frac{1} {a^{2}}\left [C\left (\mu ^{{\ast}}\right )\right ]'\left (cJ -\mathbf{T}'\mathbf{T}\right )\left [C\left (\nu ^{{\ast}}\right )\right ] \\ & =& c -\mathcal{A}^{{\ast}}\left (\mu ^{{\ast}},\nu ^{{\ast}}\right ) {}\end{array}$$
(3.20)

where

$$\displaystyle{\mathcal{A}^{{\ast}}\left(\mu ^{{\ast}},\nu ^{{\ast}}\right ) = \frac{1} {a^{2}}\left [C\left (\mu ^{{\ast}}\right )\right ]'\mathbf{T}'\mathbf{T}\left [C\left (\nu ^{{\ast}}\right )\right ].}$$

The latter may be viewed as the average of the \(\mathcal{A}(\mu _{i},\nu _{j}\mathbf{)}\) taken over all complete rankings \(\mu _{i},\nu _{j}\) compatible with μ and ν , respectively.

3.4 Correlation for Incomplete Rankings

At this point it is useful to derive an expression for an incomplete ranking μ given knowledge of its compatibility class \(C\!\left (\mu ^{{\ast}}\right ).\) We shall assume that each complete ranking has the same probability of being selected, i.e., they are uniformly distributed over the t! permutations of \(\left (1,2,\ldots,t\right )\).

Lemma 3.2.

The conditional distribution of the rank \(\mu \left (i\right )\) given the compatibility class \(C\left (\mu ^{{\ast}}\right )\) generated by μ is given by

$$\displaystyle{P\left \{\mu \left (i\right ) = j\vert C\left (\mu ^{{\ast}}\right )\right \} = \left (\begin{array}{c} j-1 \\ \mu ^{{\ast}}\left (i\right )-1 \end{array} \right )\left (\begin{array}{c} t-j \\ k-\mu ^{{\ast}}\left (i\right ) \end{array} \right )\left (\begin{array}{c} t\\ k \end{array} \right )^{-1}\delta \left (i\right )+\frac{1} {t}\left (1-\delta \left (i\right )\right )}$$

where \(\delta \left (i\right )\) is either 1 or 0 depending on whether the object i is or is not ranked in the incomplete ranking. Here \(\mu ^{{\ast}}\left (i\right ) \leq j \leq \left (t - k\right ) +\mu ^{{\ast}}\left (i\right )\) , if object i is ranked whereas 1 ≤ j ≤ t, if object i is not ranked.

Proof.

If an object i is ranked in an incomplete ranking μ of k objects, then the number of complete rankings compatible with μ which assign rank j to object i is

$$\displaystyle{\left (\begin{array}{c} j - 1\\ \mu ^{{\ast} } \left (i\right ) -1 \end{array} \right )\left (\begin{array}{c} t - j\\ k -\mu ^{{\ast} } \left (i\right ) \end{array} \right )\left (t - k\right )!}$$

This consists of the number of ways of picking a set of \(\left (\mu ^{{\ast}}\left (i\right ) - 1\right )\) from the first \(\left (j - 1\right )\) integers and a set of \(\left (k -\mu ^{{\ast}}\left (i\right )\right )\) from the last \(\left (t - j\right )\) integers while allowing all possible permutations of the \(\left (t - k\right )\) integers not picked. On the other hand, if object i is not ranked in μ then the number of such complete compatible rankings is given by

$$\displaystyle{\left (\begin{array}{c} t - 1\\ k \end{array} \right )\left (t - k - 1\right )!}$$

the number of ways of picking k from the t − 1 integers not equal to j and allowing all possible permutations of the remaining \(\left (t - k - 1\right )\) integers. Dividing these by \(\frac{t!} {k!}\) the number of complete rankings compatible with μ gives the result. □ 

In the next lemma, we show that it is possible to compute the value of a score function corresponding to an incomplete ranking from knowledge of the compatibility class. To this end, we make use of the conditional distribution of a complete ranking given its compatibility class and the fact that the conditional expectation of the score function corresponds to its projection onto that class. We apply this approach to compute the form of score functions for both the Spearman and Kendall distances.

Lemma 3.3.

Suppose that we select a complete ranking μ at random from the class of compatible rankings \(C\left (\mu ^{{\ast}}\right )\) . Suppose that object s is ranked. Then (a)

$$\displaystyle\begin{array}{rcl} E\left [\left (\mu (s) -\frac{t + 1} {2} \right )\mid \mathcal{C}(\mu ^{{\ast}})\right ] = \frac{t + 1} {k + 1}\left (\mu ^{{\ast}}(s) -\frac{k + 1} {2} \right ),& &{}\end{array}$$
(3.21)

and (b) for any pair of objects i < j,

$$\displaystyle{ E\left [sgn\left (\mu (j) -\mu (i)\right )\mid \mathcal{C}(\mu ^{{\ast}})\right ] = a(i,j), }$$
(3.22)

where

$$\displaystyle{ a(i,j) = \left \{\begin{array}{ll} sgn(\mu ^{{\ast}}(j) -\mu ^{{\ast}}(i))&\text{if both objects }i\text{ and }j\text{ are ranked} \\ 1 - \tfrac{2\mu ^{{\ast}}(i)} {(k+1)} & \text{if only }object\,i\text{ is ranked} \\ \tfrac{2\mu ^{{\ast}}(j)} {(k+1)} - 1 &\text{if only object }j\text{ is ranked} \\ 0 &\text{otherwise} \end{array} \right. }$$
(3.23)

Proof.

To prove (a), recall the identity

$$\displaystyle{ \sum _{j=l}^{t-k+l}\left ({}_{ l-1}^{j-1}\right )\left ({}_{ k-l}^{t-j}\right ) = \left ({}_{ l}^{t}\right ). }$$
(3.24)

Consequently, we have that

$$\displaystyle\begin{array}{rcl} E\left [\left (\mu (s) -\frac{t + 1} {2} \right )\mid \mathcal{C}(\mu ^{{\ast}})\right ]& \!=\!& \sum _{ j=\mu ^{{\ast}}\left (s\right )}^{t-k+\mu ^{{\ast}}\left (s\right ) }\left (j -\frac{t + 1} {2} \right )\left (\begin{array}{c} j - 1\\ \mu ^{{\ast} } \left (s\right ) -1 \end{array} \right )\left (\begin{array}{c} t - j\\ k -\mu ^{{\ast} } \left (s\right ) \end{array} \right )\!\Big/\!\left (\begin{array}{c} t\\ l \end{array} \right ) {}\\ & \!=\!& \frac{t + 1} {k + 1}\left (\mu ^{{\ast}}(s) -\frac{k + 1} {2} \right ). {}\\ \end{array}$$

For the proof of (b), let

$$\displaystyle{\delta \left (s,j\right ) = \left \{\begin{array}{@{}l@{\quad }l@{}} 1\quad &\mbox{ if judge $j$ ranks object $s$ }\\ 0\quad &\mathrm{otherwise } \end{array} \right.}$$

and define

$$\displaystyle{ \varpi _{j}(s) =\mu _{ j}^{{\ast}}\left (s\right )\delta \left (s,j\right ) + \left (\frac{k + 1} {2} \right )\left (1 -\delta \left (s,j\right )\right ) }$$
(3.25)

so that the incomplete ranking takes value \(\frac{k+1} {2}\) when an object is unranked. Note that for any complete ranking,

$$\displaystyle{ \mu \left (j\right ) = \frac{t + 1} {2} + \frac{1} {2}\sum _{i=1}^{t}sgn\left (\mu (j) -\mu (i)\right ). }$$
(3.26)

It is clear that if objects i and j are both ranked, then a(i, j) is as stated. Suppose now only object j is ranked. The adjusted score becomes on using (3.21)

$$\displaystyle\begin{array}{rcl} E\left [\mu \left (j\right )\mid \mathcal{C}(\mu ^{{\ast}})\right ]& =& \frac{t + 1} {2} + \frac{1} {2}E\left [\sum _{i=1}^{t}sgn\left (\mu (j) -\mu (i)\right )\mid \mathcal{C}(\mu ^{{\ast}})\right ] {}\\ \frac{t + 1} {k + 1}\mu ^{{\ast}}\left (j\right )\ & =& \ \frac{t + 1} {2} + \frac{1} {2}\sum _{i=1}^{k}sgn\left (\mu ^{{\ast}}(j) -\mu ^{{\ast}}(i)\right ) + \frac{\left (t - k\right )} {2} a(i,j) {}\\ & =& \frac{t + 1} {2} + \left (\mu ^{{\ast}}\left (j\right ) -\frac{k + 1} {2} \right ) + \frac{\left (t - k\right )} {2} a(i,j). {}\\ \end{array}$$

Hence, \(a\left (i,j\right ) = \left (\frac{2\mu ^{{\ast}}\left (j\right )} {k+1} - 1\right ).\) The case where only object i is ranked is dealt with similarly. □ 

In describing visualization techniques for incomplete ranking data, Kidwell et al. (2008) have noted the efficiency for computing the Kendall scores in (3.23). Next, we proceed to find the maximum and minimum distances when only k objects are ranked among the incomplete rankings.

Lemma 3.4.

  1. (a)

    For the Spearman distance,

    $$\displaystyle{m_{S}^{{\ast}} = c_{ S} -\frac{\left (t + 1\right )^{2}} {12} \frac{k\left (k - 1\right )} {\left (k + 1\right )},M_{S}^{{\ast}} = c_{ S} + \frac{\left (t + 1\right )^{2}} {12} \frac{k\left (k - 1\right )} {\left (k + 1\right )} }$$

    where \(c_{S} = \frac{t\left (t^{2}-1\right )} {12}\) .

  2. (b)

    For the Kendall distance,

    $$\displaystyle{m_{K}^{{\ast}} = c_{ K} -\frac{\left (2t + k + 3\right )} {6} \frac{k\left (k - 1\right )} {\left (k + 1\right )},M_{K}^{{\ast}} = c_{ K} + \frac{\left (2t + k + 3\right )} {6} \frac{k\left (k - 1\right )} {\left (k + 1\right )} }$$

    where \(c_{K} = \frac{t\left (t-1\right )} {2}\) . It follows that the correlation between the incomplete rankings \(\mu _{1}^{{\ast}},\mu _{2}^{{\ast}}\) can be defined to be

    $$\displaystyle{ \alpha \left (\mu _{1}^{{\ast}},\mu _{ 2}^{{\ast}}\right ) = 1 -\frac{2\left [d_{K}^{{\ast}}\left (\mu _{ i}^{{\ast}},\mu _{ j}^{{\ast}}\right ) - m^{{\ast}}\right ]} {M^{{\ast}}- m^{{\ast}}}. }$$
    (3.27)

Proof.

The right-hand side of (3.21) provides a general expression for an incomplete ranking. It follows that the Spearman distance between two incomplete rankings with the same number of ranked objects is

$$\displaystyle{d_{S}^{{\ast}}\left (\mu _{ i}^{{\ast}},\mu _{ j}^{{\ast}}\right ) = \frac{t(t + 1)(2t + 1)} {6} -\left ( \frac{t + 1} {k + 1}\right )^{2}\sum _{ s=1}^{t}\varpi _{ i}\left (s\right )\varpi _{j}\left (s\right )}$$

and in the Kendall case, the distance may be written as

$$\displaystyle{d_{K}^{{\ast}}\left (\mu _{ i}^{{\ast}},\mu _{ j}^{{\ast}}\right ) = \frac{t(t - 1)} {2} -\sum _{q_{1}<q_{2}}a_{i}\left (q_{1},q_{2}\right )a_{j}\left (q_{1},q_{2}\right )}$$

where \(a_{i}\left (q_{1},q_{2}\right )\) is defined as in (3.23) and \(\varpi _{i}\left (s\right )\) is given in 3.25. An application of the Cauchy–Schwarz inequality indicates that the upper bound of the Spearman distance occurs when \(\mathbf{T}_{S}C\left (\mu _{i}^{{\ast}}\right ) = -\mathbf{T}_{S}C\left (\mu _{j}^{{\ast}}\right )\) whereas the lower bound is achieved when \(\mathbf{T}_{S}C\left (\mu _{i}^{{\ast}}\right ) = \mathbf{T}_{S}C\left (\mu _{j}^{{\ast}}\right )\). If we let \(\mu _{j}^{{\ast}}\) be the inverted ranking, that is, \(\mu _{j}^{{\ast}}\left (s\right ) = k + 1 -\mu _{i}^{{\ast}}\left (s\right )\) when object s is ranked by i, then \(\varpi _{j}\left (s\right ) = k + 1 -\varpi _{i}\left (s\right )\) and \(\mathbf{T}_{S}C\left (\mu _{i}^{{\ast}}\right ) = -\mathbf{T}_{S}C\left (\mu _{j}^{{\ast}}\right )\). Furthermore, for the Kendall scores, \(a_{j}\left (q_{1},q_{2}\right ) = -a_{i}\left (q_{1},q_{2}\right )\) and thus \(\mathbf{T}_{K}C\left (\mu _{i}^{{\ast}}\right ) = -\mathbf{T}_{K}C\left (\mu _{j}^{{\ast}}\right )\). A straightforward calculation of these distances using the incomplete ranking \(\left (1,2,\ldots,k,-,-,\ldots,-\right )'\) and its inversion yields the minimum and maximum for each distance. □ 

We quote without proof a result in Alvo and Cabilio (1995a) which allows for different numbers of observations missing at random.

Lemma 3.5.

For fixed \(k_{1} \leq k_{2}\) suppose the pattern of missing observations is randomly selected from the set of all possible patterns. Then, for the Spearman and Kendall cases, the minimum and maximum values of the distance are of the form

$$\displaystyle{m^{{\ast}} = c -\gamma \left (i\right ),\;M^{{\ast}} = c +\gamma \left (i\right )}$$

where the \(\gamma \left (i\right )\) are given as

$$\displaystyle\begin{array}{rcl} \gamma _{S}\left (1\right )& =& \frac{\left (t + 1\right )^{2}\left (k_{1} - 1\right )\left (3k_{2} - k_{1}\right )} {24\left (k_{2} + 1\right )},\:k_{1}\,odd {}\\ \gamma _{S}\left (2\right )& =& \frac{\left (t + 1\right )^{2}k_{1}\left (k_{1}\left (3k_{2} - k_{1}\right ) - 2\right )} {24\left (k_{1} + 1\right )\left (k_{2} + 1\right )},\:k_{1}\,even {}\\ \gamma _{K}\left (1\right )& =& \frac{\left (k_{1} - 1\right )\left (t\left (3k_{2} - k_{1}\right ) + k_{2}\left (k_{1} + 3\right )\right )} {6\left (k_{2} + 1\right )},\:k_{1}\,odd {}\\ \gamma _{K}\left (2\right )& =& \frac{k_{1}\left (3k_{1}k_{2}\left (t + 1\right ) -\left (k_{1}^{2} + 2\right )\left (t - k_{2}\right ) - 3\left (k_{2} + 1\right )\right )} {6\left (k_{1} + 1\right )\left (k_{2} + 1\right )},\:k_{1}\,even {}\\ \end{array}$$

Consider now two independent rankings of length \(k_{1},k_{2}\), respectively, with \(2 \leq k_{1} \leq k_{2} \leq t.\) It follows from (3.6) and Lemma 3.3 that

$$\displaystyle\begin{array}{rcl} \mathcal{A}_{S}^{{\ast}}(\mu ^{{\ast}},\nu ^{{\ast}}) = E\left [\mathcal{A}_{ S}(\mu,\nu )\mid \mathcal{C}(\mu ^{{\ast}}),\mathcal{C}(\nu ^{{\ast}})\right ]& &{}\end{array}$$
(3.28)
$$\displaystyle\begin{array}{rcl} & =& \frac{\left (t + 1\right )^{2}} {\left (k_{1} + 1\right )\left (k_{2} + 1\right )}\sum _{s=1}^{t}\left (\mu ^{{\ast}}\left (s\right ) -\frac{k_{2} + 1} {2} \right )\left (\nu ^{{\ast}}\left (s\right ) -\frac{k_{1} + 1} {2} \right )\delta \left (s,\mu ^{{\ast}}\right )\delta \left (s,\nu ^{{\ast}}\right ) \\ & =& \frac{\left (t + 1\right )^{2}} {\left (k_{1} + 1\right )\left (k_{2} + 1\right )}\sum _{i=1}^{k^{{\ast}} }\left (o_{i} -\frac{k_{2} + 1} {2} \right )\left (\mu ^{{\ast}}\left (o_{ i}\right ) -\frac{k_{1} + 1} {2} \right ) {}\end{array}$$
(3.29)

where k is the number of objects ranked in ranking 1 among the k 2 objects ranked in ranking 2 and o i is the label of the ith object ranked in ranking 1. Here, \(\delta \left (s,\mu ^{{\ast}}\right )\) takes value 1 if object s is ranked by μ and value 0 otherwise. Note that

$$\displaystyle{o_{i} = i + l_{i},}$$

where l i  = number of objects unranked in ranking 1 which are to the left of the object being ranked. Similarly from (3.7) we have that

$$\displaystyle\begin{array}{rcl} \mathcal{A}_{K}^{{\ast}}(\mu ^{{\ast}},\nu ^{{\ast}}) = E\left [\mathcal{A}_{ K}(\mu,\nu )\mid \mathcal{C}(\mu ^{{\ast}}),\mathcal{C}(\nu ^{{\ast}})\right ]& &{}\end{array}$$
(3.30)
$$\displaystyle\begin{array}{rcl} =\sum _{i<j}a_{1}\left (i,j\right )a_{2}\left (i,j\right ).& &{}\end{array}$$
(3.31)

Example 3.5.

Consider the test scores in Language (ranking 1) and Arithmetic (ranking 2) of a group of nine students in Table 3.3. The original data was altered by removing certain values, with the remaining observations reordered and ranked as follows.

Table 3.3 Language and arithmetic scores revisited

Here \(t = 9,k_{1} = 6,k_{2} = 7,k^{{\ast}} = 5,\) \(o_{1} = 1,o_{2} = 2,o_{3} = 3,o_{4} = 5,\) o 5 = 7, o 6 = 8, and \(l_{1} = l_{2} = l_{3} = 0,l_{4} = 1,l_{5} = l_{6} = 2.\) Further,

$$\displaystyle{\mu ^{{\ast}}\left (o_{ 1}\right ) = 2,\mu ^{{\ast}}\left (o_{ 2}\right ) = 1,\mu ^{{\ast}}\left (o_{ 3}\right ) = 3,\mu ^{{\ast}}\left (o_{ 4}\right ) = 5,\mu ^{{\ast}}\left (o_{ 5}\right ) = 6,\mu ^{{\ast}}\left (o_{ 6}\right ) = 4.\quad }$$

Hence A S  = 33. 9286 and A K  = 4.

3.4.1 Asymptotic Normality of the Spearman and Kendall Test Statistics

The main objective of this section is to demonstrate the asymptotic normality of the similarity measures due to Spearman and Kendall in the case of incomplete rankings. Specifically, we shall be concerned with the asymptotic distributions of both \(\mathcal{A}_{\mathcal{S}}^{\mathrm{{\ast}}}\),\(\mathcal{A}_{K}^{{\ast}}\) under each of two possible null hypotheses H 1 and H 2. For both hypotheses we assume that k 1, k 2, the number of ranked observations, are fixed and the rankings for which we have (possibly) incomplete data are uniformly distributed over the t! permutations of (1, 2, , t).

  • Under hypothesis H 1, we assume that the pattern of missing observations is fixed, so that all inference in this case is conditional on such a pattern.

  • Under H 2, we assume that the patterns of missing observations are randomly selected from the set of all possible patterns. The latter situation would arise in practice if unranked objects occur by chance. An example would be testing for trend in water quality data when the historical data is incomplete.

We begin with the definition of a linear rank statistic.

Definition 3.3.

Let \(\left \{a\left (i\right )\right \}\) and \(\left \{c\left (i\right )\right \}\) be two sets of constants. A statistic of the form

$$\displaystyle{S = \Sigma _{i=1}^{N}c\left (i\right )a\left (R_{ i}\right )}$$

where \(R = \left (R_{1},\ldots,R_{N}\right )\) is a vector of ranks is called a linear rank statistic . The constants \(a\left (i\right )\) are called scores whereas the \(c\left (i\right )\) are called regression coefficients .

Many test statistics are of this form. For example, suppose that we have a random sample of n observations from a population and N-n from another. We are interested in testing the null hypothesis that the two populations are the same against the alternative that they differ only in location. Rank all N observations together. The Wilcoxon statistic then considers only the ranks of one of the populations by choosing

$$\displaystyle{c\left (i\right ) = \left \{\begin{array}{@{}l@{\quad }l@{}} 0\quad &i = 1,\ldots,n\\ 1\quad &i = n + 1,\ldots,N. \end{array} \right.}$$

Lemma 3.6.

Suppose that R is uniformly distributed over the set of permutations in \(\mathcal{P}\) . Then

  1. (i)

    for i = 1,…,N, E \(\left (R_{i}\right ) = \frac{N+1} {2},V ar\left (R_{i}\right ) = \frac{\left (N^{2}-1\right )} {12}\) and for i ≠ j, \(Cov\left (R_{i},R_{j}\right ) = -\frac{N+1} {12}\) and

  2. (ii)
    $$\displaystyle{ES = N\bar{c}\bar{a}}$$

and

$$\displaystyle{V ar\,S = \frac{1} {N - 1}\Sigma \left (c\left (i\right ) -\bar{ c}\right )^{2}\Sigma \left (a\left (i\right ) -\bar{ a}\right )^{2}}$$

where \(\bar{a}\) and \(\bar{c}\) represent the corresponding means.

Proof.

The proof of this lemma is given in (Hájek and Sidak 1967). □ 

The following theorem states that under certain conditions, linear rank statistics are asymptotically normally distributed. We shall consider square integrable functions ϕ defined on \(\left (0,1\right )\) which have the property that they can be written as the difference of two nondecreasing functions and satisfy

$$\displaystyle{0 <\int _{ 0}^{1}\left [\phi \left (u\right )-\bar{\phi }\right ]^{2}du <\infty }$$

where \(\bar{\phi }=\int _{ 0}^{1}\phi \left (u\right )du.\)

Theorem 3.1.

Suppose that R is uniformly distributed over the set of permutations in \(\mathcal{P}\) . Let the score function be given by \(a\left (i\right ) =\phi \left ( \frac{i} {N}\right )\) where \(\phi \left (\right )\) is a square integrable score function. Then S is asymptotically normally distributed as N →∞ with mean \(N\bar{c}\bar{a}\) and variance

$$\displaystyle{V ar\,S = \frac{1} {N - 1}\Sigma _{i=1}^{N}\left (c\left (i\right ) -\bar{ c}\right )^{2}\Sigma _{ i=1}^{N}\left (a\left (i\right ) -\bar{ a}\right )^{2}}$$

provided

$$\displaystyle{ \frac{\sum _{i=1}^{N}\left (c\left (i\right ) -\bar{ c}\right )^{2}} {\max _{1\leq i\leq N}\left (c\left (i\right ) -\bar{ c}\right )^{2}} \rightarrow \infty.}$$

Proof.

The proof of this important result is given in (Hájek and Sidak 1967). □ 

We may now apply Theorem 3.1 to obtain the asymptotic normality of the Spearman test statistic in the case of incomplete rankings under Hypothesis 1 wherein the pattern of missing data is fixed. Set

$$\displaystyle{ \sigma _{S}^{2} = \frac{1} {12}\left [ \frac{\left (t + 1\right )^{2}} {\left (k_{2} + 1\right )}\right ]^{2}\sum _{ i=1}^{k_{1} }\left (o_{i}^{{\ast}}-\overline{o}_{ 1}\right )^{2}, }$$
(3.32)

where

$$\displaystyle{ o_{i}^{{\ast}} = \left \{\begin{array}{ll} o_{i} &\mbox{ if}\ 1 \leq i \leq k^{{\ast}} \\ \\ \frac{k_{2}+1} {2} & \mbox{ if}\ k^{{\ast}} + 1 \leq i \leq k_{ 1} \end{array} \right. }$$
(3.33)

and \(\overline{o}_{1} = \left (\sum _{i=1}^{k_{1}}o_{i}^{{\ast}}\right )/k_{1}.\) Also set \(\overline{o}^{{\ast}} = \left (\sum _{i=1}^{k^{{\ast}} }o_{i}\right )/k^{{\ast}}.\)

Theorem 3.2.

Assume that \(k^{{\ast}}\rightarrow \infty\) (and hence \(k_{1} \rightarrow \infty,k_{2} \rightarrow \infty,t \rightarrow \infty\) ) with \(k^{{\ast}}/t \rightarrow \lambda> 0,\) where λ is a finite constant. Then, under H 1 , whereby the pattern of missing data is fixed, \(\mathcal{A}_{S}^{{\ast}}\) given in (3.28) is asymptotically normal with mean 0 and variance σ S 2 .

Proof.

The proof hinges on the fact that \(\mathcal{A}_{S}^{{\ast}}\) is a linear rank statistic. In fact

$$\displaystyle\begin{array}{rcl} \mathcal{A}_{S}^{{\ast}}& =& \frac{\left (t + 1\right )^{2}} {\left (k_{1} + 1\right )\left (k_{2} + 1\right )}\sum _{i=1}^{k_{1} }\left (o_{i}^{{\ast}}-\frac{k_{2} + 1} {2} \right )\left (\mu ^{{\ast}}\left (o_{ i}\right ) -\frac{k_{1} + 1} {2} \right ) {}\\ & =& \frac{\left (t + 1\right )^{2}} {\left (k_{1} + 1\right )\left (k_{2} + 1\right )}\sum _{i=1}^{k_{1} }\left (o_{i}^{{\ast}}-\overline{o}_{ 1}\right )\left (\mu ^{{\ast}}\left (o_{ i}\right )\right ). {}\\ \end{array}$$

The normality follows provided

$$\displaystyle{\frac{\sum _{i=1}^{k_{1}}\left (o_{i}^{{\ast}}-\overline{o}_{1}\right )^{2}} {\max \left (o_{i}^{{\ast}}-\overline{o}_{1}\right )^{2}} \rightarrow \infty.}$$

Now

$$\displaystyle\begin{array}{rcl} \Sigma _{i=1}^{k_{1} }\left (o_{i}^{{\ast}}-\bar{ o_{ 1}}\right )^{2}& =& \Sigma _{ i=1}^{k^{{\ast}} }\left (o_{i}^{{\ast}}-\bar{ o_{ 1}}\right )^{2}+k^{{\ast}}\left (\bar{o}^{{\ast}}-\bar{o}_{ 1}\right )^{2} + \left (k_{ 1} - k^{{\ast}}\right )\left (\frac{k_{2} + 1} {2} -\bar{ o}_{1}\right )^{2} {}\\ & \geq & k^{{\ast}}\left (k^{{\ast}2} - 1\right )/12. {}\\ \end{array}$$

Further, \(\left (o_{i}^{{\ast}}-\bar{ o}_{1}\right )^{2} \leq \left (t - 1\right )^{2}\), so that the result follows on letting k  →  with k t → λ. □ 

The exact variance of \(\mathcal{A}_{S}^{{\ast}}\) under H 1, which is recommended in applications of Theorem 3.2, is related to \(\sigma _{S}^{2}\) by

$$\displaystyle{V ar(\mathcal{A}_{S}^{{\ast}}) = \frac{k_{1}} {k_{1} + 1}\sigma _{S}^{2}}$$

(Lehmann 1975 (A. 49) p. 334). That is, the asymptotic variance given in the theorem is essentially the actual variance of \(\mathcal{A}_{S}^{{\ast}}\). In any application, the calculation of the variance of \(\mathcal{A}_{S}^{{\ast}}\) is a straightforward computation. Next, we consider the asymptotic distribution of \(\mathcal{A}_{S}^{{\ast}}\) and \(\mathcal{A}_{K}^{{\ast}}\) when the pattern of missing observations is random.

Theorem 3.3.

Let \(k_{1} \rightarrow \infty\) (and hence \(k_{2} \rightarrow \infty,t \rightarrow \infty )\) with \(k_{1}/t \rightarrow \lambda> 0,\) where λ is a finite constant. Then, under H 2 , whereby the pattern of missing data is random, \(\mathcal{A}_{S}^{{\ast}}\) is asymptotically normal with mean 0 and variance

$$\displaystyle{ V ar\left (\mathcal{A}_{S}^{{\ast}}\right ) = \frac{\left (t + 1\right )^{4}} {144\left (t - 1\right )}\kappa _{1}\kappa _{2}, }$$
(3.34)

with

$$\displaystyle{\kappa _{i} = \frac{k_{i}\left (k_{i} - 1\right )} {\left (k_{i} + 1\right )},i = 1,2.}$$

Proof.

Define \(\mathbf{U} = (U_{1,}U_{2},\ldots,U_{t})\) as the random vector uniformly distributed over the permutations of \((1,2,\ldots,k_{1}, \frac{k_{1}+1} {2},\ldots, \frac{k_{1}+1} {2} ).\) In this case, the extended Spearman distance may be written as

$$\displaystyle\begin{array}{rcl} d_{S}^{{\ast}} = \frac{t(t + 1)(2t + 1)} {6} -\mathcal{A}_{\mathcal{S}}^{\mathrm{{\ast}}}& &{}\end{array}$$
(3.35)
$$\displaystyle\begin{array}{rcl} = \frac{\left (t + 1\right )^{2}} {\left (k_{1} + 1\right )\left (k_{2} + 1\right )}\left [\sum _{i=1}^{k_{2} }iU_{i} + \frac{k_{2} + 1} {2} \sum _{i=k_{2}+1}^{t}U_{ i}\right ].& &{}\end{array}$$
(3.36)

The result follows from the combinatorial central limit theorem of Hoeffding (see Appendix B.1) applied to the quantity within square brackets above. □ 

Theorem 3.4.

\(\mathcal{A}_{\mathcal{K}}^{\mathrm{{\ast}}}\) is asymptotically equivalent to \(\mathcal{A}_{\mathcal{S}}^{\mathrm{{\ast}}}\) under both hypotheses H 1 and H 2 . Hence, \(\mathcal{A}_{\mathcal{K}}^{\mathrm{{\ast}}}\) is asymptotically normal with mean 0 and variance \(\left (\frac{16} {t^{2}} \right )V ar\left (\mathcal{A}_{S}^{{\ast}}\right )\) .

Proof.

We know from (Hájek and Sidak 1967) that for the complete case

$$\displaystyle{E\left (\mathcal{A}_{K} -\frac{4} {t}\mathcal{A}_{S}\right )^{2} = \frac{\left (t - 1\right )\left (t - 2\right )} {18} }$$

and that, moreover,

$$\displaystyle{ \frac{12\mathcal{A}_{S}} {t\left (t + 1\right )\sqrt{t - 1}} \Rightarrow N\left (0,1\right )\ \text{as }t \rightarrow \infty.}$$

Consequently, we have

$$\displaystyle{ \frac{6\mathcal{A}_{K}} {\sqrt{2t\left (t - 1 \right ) \left (2t + 5 \right )}} \Rightarrow N\left (0,1\right ).}$$

From Jensen’s inequality

$$\displaystyle\begin{array}{rcl} E\left (\mathcal{A}_{K}^{{\ast}}-\frac{4} {t}\mathcal{A}_{S}^{{\ast}}\right )^{2}& =& E\left (E^{2}\left (\left (\mathcal{A}_{ K} -\frac{4} {t}\mathcal{A}_{S}\right )\vert \mathcal{C}(\mu ^{{\ast}}),\mathcal{C}(\nu ^{{\ast}})\right )\right ) {}\\ & \leq & E\left (E\left (\mathcal{A}_{K} -\frac{4} {t}\mathcal{A}_{S}\right )^{2}\vert \mathcal{C}(\mu ^{{\ast}}),\mathcal{C}(\nu ^{{\ast}})\right ) = O\left (t^{2}\right ) {}\\ \end{array}$$

and consequently the asymptotic normality of \(\mathcal{A}_{S}^{{\ast}}\) will imply the asymptotic normality of \(\mathcal{A}_{K}^{{\ast}}.\) □ 

Example 3.6.

We return to Example 3.2 wherein we wish to test the hypothesis of independence against the alternative of a positive correlation. For the complete data, the value of \(\mathcal{A}_{\mathcal{S}}\) is 41, and from the tables, under the randomness hypothesis, \(P(\mathcal{A}_{\mathcal{S}}\geq 41) = 0.0252,\) whereas the use of the asymptotic result gives a p-value of \(1 - \Phi (1.9328) = 0.0266\), where \(\Phi\) is the cumulative distribution function of a standard normal. For the data in Example 3.5, the value of \(\mathcal{A}_{\mathcal{S}}^{\mathrm{{\ast}}}\) for the reduced data is calculated to be 33.9286. An application of the theorem yields that under H1, the p-value is \(P\left (\mathcal{A}_{\mathcal{S}}^{\mathrm{{\ast}}}\geq 33.9286\right ) = 0.0178\). On the other hand, if all observations with missing values are deleted, we obtain a reduced value of \(\mathcal{A}_{\mathcal{S}} = 9\) with t = 5, and from the tables \(P(\mathcal{A}_{\mathcal{S}}\geq 9) = 0.0417.\)

3.4.2 Asymptotic Efficiency

We now turn to the question of the efficiency which is further discussed in Appendix B.4. Let \(X_{1},X_{2},\ldots,X_{t}\) be independent random variables whose joint density under the alternative is described by

$$\displaystyle{q_{d} =\prod _{ i=1}^{t}f_{ 0}\left (x_{i} - d_{i}\right )}$$

where f 0 is a known density having finite Fisher information \(I\left (f_{0}\right )\) and \(\mathbf{d} = \left (d_{1},d_{2},\ldots,d_{t}\right )\) is an arbitrary vector. In the notation of our tests, k 2 = t, and write k 1 = k, the actual number of X i ’s observed. Recalling that o i is the label of the ith object ranked, the Spearman test which deletes all missing observations is based on the Spearman correlation of the reduced sample of k pairs, and the test statistic may be written as

$$\displaystyle{A_{RS} = \left (t + 1\right )\sum _{i=1}^{k}\left (i -\frac{k + 1} {2} \right )\left (\frac{\mu ^{{\ast}}\left (o_{i}\right )} {t + 1}\right ).}$$

Since \(k = k_{1} = k^{{\ast}}\) and consequently \(o_{i} = o_{i}^{{\ast}},\) the statistic \(\mathcal{A}_{S}^{{\ast}}\) may be written as

$$\displaystyle{\mathcal{A}_{S}^{{\ast}} = \frac{\left (t + 1\right )} {\left (k + 1\right )}\sum _{i=1}^{k}\left (o_{ i} -\frac{t + 1} {2} \right )\left (\mu ^{{\ast}}\left (o_{ i}\right ) -\frac{k + 1} {2} \right ).}$$

Hence,

$$\displaystyle{\mathcal{A}_{S}^{{\ast}} = \frac{\left (t + 1\right )} {\left (k + 1\right )}\left \{A_{RS} +\sum _{ i=1}^{k}\left (\mu ^{{\ast}}\left (o_{ i}\right ) -\frac{k + 1} {2} \right )\left (o_{i} - i\right )\right \}.}$$

The weight \(\left (o_{i} - i\right )\) represents the number of time points to the left of o i for which there are no observations. Similarly,

$$\displaystyle{\mathcal{A}_{K}^{{\ast}} = A_{ RK} + \frac{4} {k + 1}\sum _{i=1}^{k}\left (\mu ^{{\ast}}\left (o_{ i}\right ) -\frac{k + 1} {2} \right )\left (o_{i} - i\right )}$$

where

$$\displaystyle{A_{RK} =\sum _{ i<j}^{k}sgn\left (\mu ^{{\ast}}\left (o_{ j}\right ) -\mu ^{{\ast}}\left (o_{ i}\right )\right ).}$$

Set \(d_{i}^{{\ast}} = d_{o_{i}}\) and \(\overline{d} =\sum _{ i=1}^{t}d_{i}/t.\) Under the alternative q d , provided

$$\displaystyle{\max _{1\leq i\leq t}\left (d_{i} -\overline{d}\right )^{2} \rightarrow 0\text{ and }I\left (f_{ 0}\right )\sum _{i=1}^{t}\left (d_{ i} -\overline{d}\right )^{2} \rightarrow b^{2},0 <b^{2} <\infty,}$$

both A RS and \(A_{S}^{{\ast}}\) are asymptotically normal with means and variances given respectively by \(\left (\mu _{R},\sigma _{RS}^{2}\right )\) and \(\left (\mu _{S},\sigma _{S}^{2}\right )\), where

$$\displaystyle\begin{array}{rcl} \mu _{RS}& =& \left (t + 1\right )\sum _{i=1}^{k}\left (i -\frac{k + 1} {2} \right )\left (d_{i}^{{\ast}}-\overline{d}\right )\int _{ 0}^{1}u\ \phi \left (u,{f_ 0}\right )du {}\\ \mu _{S}& =& \frac{\left (t + 1\right )^{2}} {\left (k + 1\right )} \sum _{i=1}^{k}\left (o_{ i} -\overline{o}\right )\left (d_{i}^{{\ast}}-\overline{d}\right )\int _{ 0}^{1}u\ \phi \left (u,{f_ 0}\right )du. {}\\ \sigma _{RS}^{2}& =& \frac{\left (t + 1\right )^{2}} {12} \sum _{i=1}^{k}\left (i -\frac{k + 1} {2} \right )^{2},\;\sigma _{ S}^{2} = \frac{\left (t + 1\right )^{4}} {12\left (k + 1\right )^{2}}\sum _{i=1}^{k}\left (o_{ i} -\overline{o}\right )^{2}. {}\\ \end{array}$$

Here \(\phi \left (u,f\right ) = \left [f^{{{\prime}} }\left (F^{-1}\left (u\right )\right )\right ]/\left [f\left (F^{-1}\left (u\right )\right )\right ],0 <u <1,\) and F is the cumulative distribution of f.

Shifting now to the efficiencies, it is seen that the asymptotic efficiencies as \(k \rightarrow \infty\), for A RS and \(\mathcal{A}_{S}^{{\ast}}\) are respectively given by

$$\displaystyle\begin{array}{rcl} e_{RS}& =& \lim \frac{\left [\sum _{i=1}^{k}\left (i -\frac{k+1} {2} \right )\left (d_{i}^{{\ast}}-\overline{d}\right )\right ]^{2}} {\sum _{i=1}^{k}\left (i -\frac{k+1} {2} \right )^{2}\sum _{i=1}^{t}\left (d_{i} -\overline{d}\right )^{2}}Q_{1} {}\\ e_{S}& =& \lim \frac{\left [\sum _{i=1}^{k}\left (o_{i} -\overline{o}\right )\left (d_{i}^{{\ast}}-\overline{d}\right )\right ]^{2}} {\sum _{i=1}^{k}\left (o_{i} -\overline{o}\right )^{2}\sum _{i=1}^{t}\left (d_{i} -\overline{d}\right )^{2}}Q_{1}, {}\\ \end{array}$$

where Q 1 is a positive function of f 0 and the limit is taken as \(t \rightarrow \infty,k \rightarrow \infty,\) with kt → λ > 0. The asymptotic relative efficiency of \(\mathcal{A}_{S}^{{\ast}}\) relative to A RS is then given by the ratio \(e_{S}/e_{RS}\) (Appendix B.4).

Now consider the case where \(d_{i}^{{\ast}} = o_{i},\bar{d} =\bar{ o},i = 1,\ldots,k\) and the remaining d i are arbitrary, a situation which includes alternatives of the form \(EX_{i} =\beta _{0} +\beta i,\beta> 0.\) It can be shown that irrespective of the density f 0, the asymptotic relative efficiency of \(\mathcal{A}_{S}^{{\ast}}\) relative to A RS is given by

$$\displaystyle{ARE\left (\mathcal{A}_{S}^{{\ast}},A_{ RS}\right ) =\lim _{k\rightarrow \infty }R\left (k,\mathbf{o}_{\mathbf{k}}\right ),}$$

where \(\mathbf{o}_{\mathbf{k}}\mathbf{=}\left (o_{1},\ldots,o_{k}\right )\) and

$$\displaystyle{R\left (k,\mathbf{o}_{\mathbf{k}}\right ) = \frac{\sum _{i=1}^{k}\left (i -\frac{k+1} {2} \right )^{2}\sum _{ i=1}^{k}\left (o_{ i} -\overline{o}\right )^{2}} {\left [\sum _{i=1}^{k}\left (i -\frac{k+1} {2} \right )\left (o_{i} -\overline{o}\right )\right ]^{2}} \geq 1.}$$

Note that \(R\left (k,\mathbf{o}_{\mathbf{k}}\right )> 1\) unless the o i ’s are equally spaced.

In order to illustrate the magnitude of this efficiency, suppose for example that \(t = 19,k = 7,o_{1} = 1,o_{2} = 2,o_{3} = 3,o_{4} = 10,o_{5} = 17,o_{6} = 18,o_{7} = 19,\) then the ratio of the efficacies of A S to A RS is 1. 086. On the other hand, if \(o_{1} = 1,o_{2} = 8,o_{3} = 9,o_{4} = 10,o_{5} = 11,o_{6} = 12,o_{7} = 19\), then that ratio is 1. 176.

3.5 Tied Rankings and the Notion of Compatibility

The notion of compatibility may also be extended to deal with tied rankings . As an example, suppose that objects 1 and 2 are equally preferred whereas object 3 is least preferred. Such a ranking would be compatible with the rankings \(\left (1,2,3\right )\) and \(\left (2,1,3\right )\) in that both are plausible. The average of the rankings in the compatibility class, which as we shall see results from the use of the Spearman distance, will then be the ranking

$$\displaystyle{\frac{1} {2}\left [\left (1,2,3\right ) + \left (2,1,3\right )\right ] = \left (1.5,1.5,3\right )}$$

to be presented in this case. It is seen that the notion of compatibility serves to justify the use of the midrank when ties exist. Formally we can define tied orderings as follows.

Definition 3.4.

A tied ordering of t objects is a partition into e sets, 1 ≤ e ≤ t, each containing d i objects, \(d_{1} + d_{2} +\ldots +d_{e} = t\), so that the d i objects in each set share the rank i,1 ≤ i ≤ e. Such a tie pattern is denoted by \(\delta = \left (d_{1},d_{2},\ldots,d_{e}\right )\). The ranking denoted by \(\mu _{\delta } = \left (\mu _{\delta }\left (1\right ),\ldots,\mu _{\delta }\left (t\right )\right )\) resulting from such an ordering is a tied ranking and is one of \(\frac{t!} {d_{1}!d_{2}!\ldots d_{e}!}\) possible permutations.

Associated with every tied ranking we may define a t! × (\(\frac{t!} {d_{1}!d_{2}!\ldots d_{e}!}\)) matrix of compatibility D δ . Yu et al. (2002) considered the problem of testing for independence between two random variables when the tie patterns and the pattern of missing observations are fixed. Specifically, let μ be an incomplete ranking of k 1 out of t objects with tie pattern \(\delta _{1} = \left (d_{11},\ldots,d_{1e_{1}}\right )\). Similarly, let ν be an incomplete ranking of k 2 out of t objects with tie pattern \(\delta _{2} = \left (d_{21},\ldots,d_{2e_{2}}\right )\). The Spearman similarity measure between two incomplete rankings \(\mu ^{{\ast}},\nu ^{{\ast}}\) is defined to be

$$\displaystyle{A_{S}^{{\ast}} = \frac{\left (t + 1\right )^{2}} {\left (k_{1} + 1\right )\left (k_{2} + 1\right )}\sum _{j=1}^{t}\delta \left (j\right )\left [\mu ^{{\ast}}\left (j\right ) -\frac{k_{1} + 1} {2} \right ]\left [\nu ^{{\ast}}\left (j\right ) -\frac{k_{2} + 1} {2} \right ]}$$

where \(\delta \left (j\right ) = 1\) if both rankings of object j are not missing and 0 otherwise.

Theorem 3.5.

Let k be the number of objects ranked in ranking 1 among the k 2 objects ranked in ranking 2. Let \(2 \leq k_{1} \leq k_{2} \leq t\) . Assume that

  1. (i)

    \(k^{{\ast}}\rightarrow \infty\) , (and hence \(k_{1} \rightarrow \infty,k_{2} \rightarrow \infty,t \rightarrow \infty\) ) with \(k^{{\ast}}/t \rightarrow \lambda> 0.\)

  2. (ii)

    \(\max _{j=1,\cdots \,,e_{1}} \frac{g_{1j}} {k^{{\ast}}}\) is bounded away from 1.

  3. (iii)

    \(\max _{j=1,\cdots \,,e_{2}} \frac{g_{2j}} {k^{{\ast}}}\) is bounded away from 1.

Then, under the null hypothesis of independence whereby the pattern of ties and missing data is fixed, A S is asymptotically normal with mean 0 and exact variance

$$\displaystyle{V ar\left (A_{S}^{{\ast}}\right ) = \left [ \frac{\left (t + 1\right )^{2}k_{ 1}} {\left (k_{1} + 1\right )\left (k_{2} + 1\right )}\right ]^{2}\frac{\sum _{j=1}^{k_{1}}\left (o_{ j}^{{\ast}}-\bar{ o}\right )^{2}} {12} \left \{1 -\frac{\sum _{j=1}^{e_{1}}\left (g_{1j}^{3} - g_{1j}\right )} {k_{1}^{3} - k_{1}} \right \}.}$$

Proof.

See Yu et al. (2002). □ 

Example 3.7.

In a public opinion survey held in 1999 in Hong Kong, it was of interest to determine whether the education level of the respondents is related to the level of dissatisfaction of the Policy Address of the Chief Executive of the Hong Kong Special Administrative Region. The response is an ordinal variable having seven options as follows: (1), very satisfied; (2), satisfied; (3), neutral; (4), unsatisfied; (5), very unsatisfied; (6), not sure; and (7), refuse to answer. Options (6) and (7) were combined and listed as “missing.” Table 3.4 displays the frequencies of the respondents listed by option and by education level.

Table 3.4 Data from the public opinion survey

It is noted that about 19.9 % of the respondents did not respond either to one or to both questions. Moreover, since the education levels are grouped into a few categories, the problem of ties cannot be ignored. One alternative approach for analyzing this data is as a contingency table. In that case, however, the ordering among the education levels and separately among the responses would not be taken into account. The results of the analysis shown in Table 3.5 reveal that at the 5 % significance level, the test based on the reduced sample (which discards all observations with at least one missing variable) cannot reject the hypothesis of independence whereas the one based on the complete sample can. Since the test statistic is positive, this implies that there is a positive association between education level and level of dissatisfaction. More highly educated respondents tend to be less satisfied with the Policy Address. The analysis by means of a contingency table whereby the missing categories for education and response were dropped leads to a chi-square statistic with a value of 35.2161 on 16 degrees of freedom and a p-value of 0.0037.

Table 3.5 Results of the analyses
Table 3.6 Wind direction in degrees

3.6 Angular Correlations

There has been a great deal of interest in directional statistics in the literature. Consider the following example on wind directions whereby we are interested in testing for independence between the 6 a.m. and the noon readings. The data shown in Table 3.6 can be viewed as points on the unit circle and cannot be dealt with by simply computing the usual rank correlation. The reason is that the larger ranks are close to the smaller ranks. Hence, for example, for the noon readings, angle 23 is closer to angle 313 than to angle 248. Yet, the ranks imply an opposite interpretation. In the table, tied ranks were replaced by their midranks.

Example 3.8 (Johnson and Wehrly 1977).

Wind directions were recorded at 6 a.m. and at 12 noon on each day at a weather station for 21 consecutive days. It is desired to test for independence. Tied rankings were replaced by their midranks (Table 3.6).

Excellent review articles along with additional references are given by Mardia (19751976) and Jupp and Mardia (1989). Typically, data is provided in the form of directions either in two- or three-dimensional space or as rotations in such a space. The data may take on a variety of forms. It may consist of a unit vector of directions, pairs of such vectors, or a vector of directions along with a corresponding random variable on the line. Examples of applications are to be found in the fields of astronomy, biology, geology, medicine, and meteorology (Downs 1973; Johnson and Wehrly 1977; Breckling 1989). A large number of the works presented deal with the study of inference from parametric models. In this section, we define a corresponding notion of angular correlation using the ranks of the data.

Let X and Y be random vectors with covariance matrix \(\Sigma\) partitioned as

$$\displaystyle{\Sigma = \left (\begin{array}{ll} \Sigma _{11} & \Sigma _{12} \\ \Sigma _{21} & \Sigma _{22} \end{array} \right )}$$

and suppose \(\Sigma _{11}\) and \(\Sigma _{22}\) are non-singular of ranks p and q, respectively.

Definition 3.5 (Jupp and Mardia 1989).

The correlation coefficient γ XY between X and Y is defined to be the trace γ of the matrix

$$\displaystyle{\gamma _{XY } = Tr[\Sigma _{11}^{-1}\Sigma _{ 12}\Sigma _{22}^{-1}\Sigma _{ 21}].}$$

It follows that, \(\gamma _{XY } =\sum _{ i=1}^{s}\lambda _{i}^{2}\) where the λ i are the canonical correlations and s = min(p, q). This coefficient satisfies the property of invariance under rotation and reflection in addition to the usual properties of a correlation.

Suppose now that θ and \(\varphi\) are circular variables with \(0 \leq \theta,\varphi \leq 2\pi\). Define the directional vectors \(t_{1}^{{{\prime}} }(\theta ) = (\cos \theta,\sin \theta )\), \(t_{2}^{{{\prime}} }(\varphi ) = (\cos \varphi,\sin \varphi ),\) and let \(\Sigma\) be the covariance matrix of t 1 and t 2. It is seen that

$$\displaystyle\begin{array}{rcl} \gamma _{\theta \varphi }& =& [\rho _{cc}^{2} +\rho _{ cs}^{2} +\rho _{ sc}^{2} +\rho _{ ss}^{2} + 2(\rho _{ cc}\rho _{ss} -\rho _{cs}\rho _{sc})\rho _{1}\rho _{2} - 2(\rho _{cc}\rho _{cs}\qquad \\ & & +\rho _{sc}\rho _{ss})\rho _{1} - 2(\rho _{cc}\rho _{sc} +\rho _{cs}\rho _{ss})\rho _{2}]/[(1 -\rho _{1}^{2})(1 -\rho _{ 2}^{2})].{}\end{array}$$
(3.37)

where \(\rho _{cc} = corr(\cos \theta,\cos \varphi )\), \(\rho _{cs} = corr(\cos \theta,\sin \varphi )\), etc., and \(\rho _{1} = corr(\cos \theta,\sin \theta )\), \(\rho _{2} = corr(\cos \varphi,\sin \varphi )\).

Let (\(\theta _{i},\varphi _{i}\)) for i = 1, , n be a random sample of n pairs of angles which define points on the unit circle. Without loss in generality assume that the ranks of the θ’s are the natural integers 1,…,n whereas the corresponding ranks of the \(\varphi\)’s are denoted by \(R_{1},\ldots,R_{n}.\) Let

$$\displaystyle{\eta ^{(1)} = (\cos \frac{2\pi } {n},\cos \frac{4\pi } {n},\ldots,\cos 2\pi )^{{\prime}}\text{, }\eta ^{(2)} = (\sin \frac{2\pi } {n},\sin \frac{4\pi } {n},\ldots,\sin 2\pi )^{{\prime}}}$$
$$\displaystyle{\nu ^{(1)}=(\cos \frac{2\pi R_{1}} {n},\cos \frac{2\pi R_{2}} {n},\ldots,\cos \frac{2\pi R_{n}} {n} )^{{\prime}},\nu ^{(2)}=(\sin \frac{2\pi R_{1}} {n},\sin \frac{2\pi R_{2}} {n},\ldots,\sin \frac{2\pi R_{n}} {n} )^{{\prime}}.}$$

We may formally construct on the basis of the sample the matrix of pairwise correlations

$$\displaystyle{\Upsilon _{12} = \left (\begin{array}{ll} \rho (\eta ^{(1)},\nu ^{(1)})&\rho (\eta ^{(1)},\nu ^{(2)}) \\ \rho (\eta ^{(2)},\nu ^{(1)})&\rho (\eta ^{(2)},\nu ^{(2)}) \end{array} \right )}$$

where ρ(η, ν) is a measure of correlation between η and ν. We shall consider correlations based on the Spearman and Kendall distance functions in subsequent sections and we will determine the corresponding asymptotic distributions of the correlation coefficients as n → . 

3.6.1 Spearman Distance

We shall consider the Kendall notion of a type b correlation (Kendall and Gibbons 1990) given by

$$\displaystyle\begin{array}{rcl} \rho _{S}(\eta,\nu )& =& \frac{\sum _{i\neq j}\left (\eta _{i} -\eta _{j}\right )\left (\nu _{i} -\nu _{j}\right )} {\sqrt{\sum _{i\neq j } \left (\eta _{i } -\eta _{j } \right ) ^{2 } \sum _{i\neq j } \left (\nu _{i } -\nu _{j } \right ) ^{2}}} {}\\ & =& \text{ } \frac{2} {n}\eta ^{{\prime}}\nu. {}\\ \end{array}$$

It is straightforward to show

$$\displaystyle{\sum _{i=1}^{n}\cos \frac{2\pi i} {n} =\sum _{ i=1}^{n}\sin \frac{2\pi i} {n} =\sum _{ i=1}^{n}\cos \frac{2\pi i} {n} \sin \frac{2\pi i} {n} = 0}$$

and

$$\displaystyle{\sum _{i=1}^{n}\cos ^{2}\frac{2\pi i} {n} =\sum _{ i=1}^{n}\sin ^{2}\frac{2\pi i} {n} = \frac{n} {2}.}$$

It follows that \(\Sigma _{11} = \Sigma _{22} = \frac{n} {2} I\). The sample estimate of \(\Sigma _{12}\) is given by

$$\displaystyle{\Upsilon _{12}^{S} = \frac{2} {n}\left (\begin{array}{ll} T_{cc} &T_{cs} \\ T_{sc}&T_{ss} \end{array} \right )}$$

where \(T_{cc} =\eta ^{(1)^{{\prime}} }\nu ^{(1)},T_{cs} =\eta ^{(1)^{{\prime}} }\nu ^{(2)},T_{sc} =\eta ^{(2)^{{\prime}} }\nu ^{(1)},T_{ss} =\eta ^{(2)^{{\prime}} }\nu ^{(2)}.\)

We recognize that the T s are measures of correlation in the Spearman sense. Consequently, the sample correlation using Spearman distance becomes

$$\displaystyle{\gamma _{S} = \frac{4} {n^{2}}\left (\mathbf{T}_{cc}^{2} + \mathbf{T}_{ ss}^{2} + \mathbf{T}_{ cs}^{2} + \mathbf{T}_{ sc}^{2}\right ).}$$

3.6.2 Kendall Distance

Recalling the Kendall measure of distance defined by

$$\displaystyle{d_{K}(\eta,\nu ) =\sum _{i<j}\left \{1 - sgn(\eta _{i} -\eta _{j})sgn(\nu _{i} -\nu _{j})\right \}}$$

where sgn indicates the sign function, we may define a corresponding type b correlation as

$$\displaystyle\begin{array}{rcl} \rho _{K}(\eta,\nu )& =& \frac{\sum _{i\neq j}sgn(\eta _{i} -\eta _{j})sgn(\nu _{i} -\nu _{j})} {\sqrt{\sum _{i\neq j } (sgn(\eta _{i } -\eta _{j } ))^{2}}\sqrt{\sum _{i\neq j } (sgn(\nu _{i } -\nu _{j } ))^{2}}} {}\\ & =& \frac{\sum _{i\neq j}sgn(\eta _{i} -\eta _{j})sgn(\nu _{i} -\nu _{j})} {\sqrt{A(\eta )A(\nu )}}, {}\\ \end{array}$$

where \(A(\eta ) = \#\left (\text{pairs }(i,j),i\neq j\vert \eta _{i}\neq \eta _{j}\right ).\) It is easy to see that \(\Sigma _{11}\) and \(\Sigma _{22}\) are diagonal matrices. In fact, the off-diagonal terms are equal to

$$\displaystyle\begin{array}{rcl} & & \sum _{i\neq j}sgn\left (\cos \frac{2\pi i} {n} -\cos \frac{2\pi j} {n} \right )sgn\left (\sin \frac{2\pi i} {n} -\sin \frac{2\pi j} {n} \right ) {}\\ & =& -4\sum _{i\neq j}sgn\left (\sin \frac{\pi (i + j)} {n} \sin \frac{\pi (i - j)} {n} \right )sgn\left (\cos \frac{\pi (i + j)} {n} \sin \frac{\pi (i - j)} {n} \right ) {}\\ & =& -2\sum _{i\neq j}sgn\left (\sin \frac{2\pi (i + j)} {n} \right ) = 0. {}\\ \end{array}$$

The normalization in the Kendall case is somewhat delicate and depends in part on the parity of n. For example, for n = 10, there are five pairs of equal values in the set \(\left \{\sin \frac{2\pi i} {n} \right \}\) whereas for n = 11, all the values are distinct. In general, the number of equal pairs is at most O(n). The sample estimate of \(\Upsilon _{12}\) is given by

$$\displaystyle{\Upsilon _{12}^{K} = \left (\begin{array}{ll} K_{cc} &K_{cs} \\ K_{sc}&K_{ss} \end{array} \right )}$$

where \(K_{cc} =\rho _{K}(\eta ^{(1)},\nu ^{(1)})\), \(K_{cs} =\rho _{K}(\eta ^{(1)},\nu ^{(2)})\), \(K_{sc} =\rho _{K}(\eta ^{(2)},\nu ^{(1)})\), \(K_{ss} =\rho _{K}(\eta ^{(2)},\nu ^{(2)})\).

It follows that the sample correlation coefficient in the Kendall case is given by

$$\displaystyle{\gamma _{K} = \left (K_{cc}^{2} + K_{ ss}^{2} + K_{ cs}^{2} + K_{ sc}^{2}\right ).}$$

In the following sections, we shall derive the asymptotic null distributions of the test statistics induced by the Spearman and Kendall distances.

3.6.3 Asymptotic Distributions

We are interested in testing the null hypothesis that the circular variables \(\theta,\varphi\) are independent. In terms of the ranks, assuming no ties, this translates into the hypothesis H0 that all permutations of the integers 1, , n are equally likely.

Theorem 3.6.

The asymptotic null distribution of nγ S as n →∞ is \(\chi _{4}^{2}\) .

Proof.

The joint distribution of \(T_{cc},T_{ss},T_{cs},T_{sc}\) is asymptotically normal. In fact, for arbitrary \(\left \{a_{i}\right \}\), consider the linear combination

$$\displaystyle\begin{array}{rcl} a_{1}T_{cc}+a_{2}T_{ss}+a_{3}T_{cs}+a_{4}T_{sc} =\ \ & & {}\\ \sum _{i=1}^{n}[\cos \frac{2\pi R_{i}} {n} (a_{1}\cos \frac{2\pi i} {n} + a_{2}\sin \frac{2\pi i} {n} )+\sin \frac{2\pi R_{i}} {n} (a_{3}\cos \frac{2\pi i} {n} + a_{4}\sin \frac{2\pi i} {n} )].& & {}\\ \end{array}$$

Let

$$\displaystyle{d(i,j) =\cos \frac{2\pi i} {n} (a_{1}\cos \frac{2\pi j} {n} + a_{2}\sin \frac{2\pi j} {n} ) +\sin \frac{2\pi i} {n} (a_{3}\cos \frac{2\pi j} {n} + a_{4}\sin \frac{2\pi j} {n} ).}$$

Since

$$\displaystyle{maxd_{n}^{2}(i,j) \leq 4(a_{ 1}^{2} + a_{ 2}^{2} + a_{ 3}^{2} + a_{ 4}^{2})}$$

and the variance

$$\displaystyle{ \frac{1} {n}\sum _{i=1}^{n}\sum _{ j=1}^{n}d_{ n}^{2}(i,j) = \frac{n} {4} (a_{1}^{2} + a_{ 2}^{2} + a_{ 3}^{2} + a_{ 4}^{2})}$$

we have that

$$\displaystyle{ \frac{max\text{ }d_{n}^{2}(i,j)} { \frac{1} {n}\sum _{i=1}^{n}\sum _{j=1}^{n}d_{n}^{2}(i,j)} \rightarrow 0\:as\:n \rightarrow \infty.}$$

The result follows on using Hoeffding’s combinatorial central limit theorem (see Appendix B.1). Hence \(\Upsilon _{12}^{S}\) is multivariate normal and the theorem follows. □ 

A similar result holds for the Kendall tau statistic.

Theorem 3.7.

The asymptotic null distribution of \(\frac{9} {4}n\gamma _{K}\) as n \(\rightarrow \infty\) is \(\chi _{4}^{2}\) .

Proof.

See Alvo (1998) for the proof. A different proof can make use of the asymptotic equivalence between the Kendall and Spearman coefficients in general. □ 

Example 3.9.

We revisit the wind direction data. We calculate

$$\displaystyle{\Upsilon _{12}^{S} = \left (\begin{array}{cc} - 0.246& 0.306 \\ - 0.376& - 0.452 \end{array} \right )}$$

and hence \(n\gamma _{S} = 21(0.50047) = 10.51\) with a p-value of 0. 0327. Consequently, we conclude that there is evidence that the 6 a.m. and noon wind directions are significantly correlated.

It is interesting to compare this result with the usual product moment correlation between the two angular measurements. The latter yields a value equal to − 0. 04, thereby implying that the variables are independent. On the other hand, restricting attention only to the pairs of measurements for which the 6 a.m. readings are below 180 the value of the product moment correlation is 0. 512 while for pairs for which the 6 a.m. readings are above 180 it is − 0. 475. These results taken separately imply a fair degree of dependence. The test statistic γ S takes into account the fact that very small and very large angles (mod 2π) are close to one another.

For the Kendall statistic, we may also calculate

$$\displaystyle{\Upsilon _{12}^{K} = \left (\begin{array}{ll} - 0.1822&0.2097 \\ - 0.3106& - 0.3637 \end{array} \right )}$$

and hence \(\frac{9n} {4}\) \(\gamma _{K} = \frac{9(21)} {4} (0.3056) = 14.44\) with a p-value of 0. 006.  It is clear that with either the Spearman or the Kendall statistic, the hypothesis of independence is in doubt.

3.7 Angle-Linear Correlation

Suppose that we are now interested in defining the correlation between an angle θ and a real valued random variable X. It can be shown that the correlation coefficient in that case is given by

$$\displaystyle{\gamma _{L} = [\rho _{xc}^{2} +\rho _{ xs}^{2} - 2\rho _{ xc}\rho _{xs}\rho _{cs}]/(1 -\rho _{cs}^{2})}$$

where

$$\displaystyle{\rho _{xc} = corr(X,\cos \theta ),\rho _{xs} = corr(X,\sin \theta ),\rho _{cs} = corr(\cos \theta,\sin \theta ).}$$

In the nonparametric context, let \((X_{i},\theta _{i})\) for \(i = 1,\ldots,n\) be a random sample of linear-angular measurements. Let {R i } be the ranks of the {X i } and let {S i } be the ranks of the {θ i }. We may assume without loss in generality that the S i are in natural order 1, 2, , n. Based on the Spearman measure of distance, the sample angular-linear correlation is defined by

$$\displaystyle{\gamma _{LS} = \frac{[T_{xc}^{2} + T_{xs}^{2}]} {\frac{n} {2} \left (\frac{n(n^{2}-1)} {12} \right )} }$$

where \(T_{xc} =\sum R_{i}\cos \left (\frac{2\pi i} {n} \right ),\) \(T_{xs} =\sum R_{i}\sin \left (\frac{2\pi i} {n} \right ).\) Similarly, for the Kendall measure, the angular-linear correlation is then given by

$$\displaystyle{\gamma _{LK} = [K_{xc}^{2} + K_{ xs}^{2}]}$$

where

$$\displaystyle\begin{array}{rcl} K_{xc}& =& \frac{\sum _{i\neq j}[sgn(R_{i} - R_{j})sgn(\cos \left (\frac{2\pi i} {n} \right ) -\cos \left (\frac{2\pi j} {n} \right ))]} {\sqrt{[n(n - 1)]}\sqrt{\sum _{i\neq j } (sgn(\eta _{i }^{(1) } -\eta _{ j }^{(1) }))^{2}}} {}\\ K_{xs}& =& \frac{\sum _{i\neq j}[sgn(R_{i} - R_{j})sgn(\sin \left (\frac{2\pi i} {n} \right ) -\sin \left (\frac{2\pi j} {n} \right ))]} {\sqrt{[n(n - 1)]}\sqrt{\sum _{i\neq j } (sgn(\eta _{i }^{(2) } -\eta _{ j }^{(2) }))^{2}}}. {}\\ \end{array}$$

We may now prove a theorem giving the asymptotic distributions of γ LS and γ LK under the null hypothesis that all vectors of ranks \((R_{1},\ldots,R_{n})\) are equally likely.

Theorem 3.8.

The asymptotic null distribution of nγ LS as n →∞ is χ 2 2 .

Proof.

The joint distribution of T xc , T xs is asymptotically normal. In fact, for arbitrary constants \(a_{1},a_{2}\), consider the linear combination

$$\displaystyle{a_{1}T_{xc} + a_{2}T_{xs} =\sum _{ i=1}^{n}[R_{ i}(a_{1}\cos \frac{2\pi i} {n} + a_{2}\sin \frac{2\pi i} {n} ).}$$

This is a linear rank statistic for which the conditions in Hoeffding (1951) are satisfied. In fact, let

$$\displaystyle{d_{n}(i,j) = (i -\frac{n + 1} {2} )(a_{1}\cos \frac{2\pi i} {n} + a_{2}\sin \frac{2\pi i} {n} ).}$$

The variance is then equal to

$$\displaystyle{ \frac{1} {n}\sum _{i=1}^{n}\sum _{ j=1}^{n}d_{ n}^{2}(i,j) = \frac{1} {4}(a_{1}^{2} + a_{ 2}^{2})\frac{n(n^{2} - 1)} {12} }$$

and we have that

$$\displaystyle{ \frac{max\text{ }d_{n}^{2}(i,j)} { \frac{1} {n}\sum _{i=1}^{n}\sum _{j=1}^{n}d_{n}^{2}(i,j)} \rightarrow 0}$$

as \(n \rightarrow \infty\). The result follows. □ 

Theorem 3.9.

The asymptotic null distribution of \(\frac{9n} {4} \gamma _{LK}\) as n →∞ is χ 2 2 .

Proof.

For arbitrary constants a 1, a 2, consider the linear combination

$$\displaystyle{\sum _{i\neq j}^{n}sgn(R_{ i} - R_{j})b_{ij}}$$

where

$$\displaystyle{b_{ij} = [(a_{1}sgn(\cos \frac{2\pi i} {n} -\cos \frac{2\pi j} {n} ) + a_{2}sgn(\sin \frac{2\pi i} {n} -\sin \frac{2\pi j} {n} )].}$$

Using a result of Daniels (1950), the asymptotic normality of K xc and K xs follows. □ 

Example 3.10 (Johnson and Wehrly 1977).

We consider data on wind direction and ozone concentration collected at a weather station for 19 days at 4-day intervals. The readings are given in Table 3.7.

Table 3.7 Wind direction and ozone concentration

The Spearman test statistic to be \(n\gamma _{LS} = 19\left (0.3751\right ) = 7.13\) which has a p-value equal to 0. 0283. On the other hand the Kendall statistic is given by \(\frac{9n} {4} \gamma _{LK} = \frac{9\left (19\right )} {4} \left (0.1595\right ) = 6.82\) for a p-value of 0. 033. Both statistics imply that there is a fair degree of dependence between wind direction and ozone concentration.