Abstract
A ranking represents the order of preference one has with respect to a set of t objects. If we label the objects by the integers 1 to t, a ranking can then be thought of as a permutation of the integers \((1,2,\ldots,t)\). We may denote such a permutation by μ = (μ(1), μ(2), …, μ(t))′ which may also be conceptualized as a point in t-dimensional space. It is natural to measure the spread between two individual permutations μ, ν by means of a distance function.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
3.1 Notion of Distance Between Two Rankings
A ranking represents the order of preference one has with respect to a set of t objects. If we label the objects by the integers 1 to t, a ranking can then be thought of as a permutation of the integers \((1,2,\ldots,t)\). We may denote such a permutation by μ = (μ(1), μ(2), …, μ(t))′ which may also be conceptualized as a point in t-dimensional space. It is natural to measure the spread between two individual permutations μ, ν by means of a distance function. There are several examples of distance functions that have been proposed in the literature. Here are a few:
Spearman
Kendall
where sgn(x) is either 1 or − 1 depending on whether x > 0 or x < 0.
Hamming
where I(. ) is the indicator function taking values 1 or 0 depending on whether the statement in brackets holds or not.
Spearman Footrule
The Spearman measure is not a proper “distance” in that it does not obey the triangular inequality property. We shall nonetheless refer to it as a distance function in this book. It is based upon squared Euclidean distance whereas the Footrule is based on the absolute deviations. The Kendall distance counts the number of “discordant” pairs whereas the Hamming distance counts the number of “mismatches.” The Hamming distance has found uses in coding theory. These distances have the property of being invariant under any permutation relabeling of the objects. That is, for any permutations σ, μ, ν,
where \(\mu \circ \sigma \left (i\right ) =\mu \left (\sigma \left (i\right )\right ).\) This property is known as right invariance. Let \(\Delta = \left (d\left (\mu _{i},\mu _{j}\right )\right )\) denote the matrix of all pairwise distances. If d is right invariant, then it follows that there exists a constant c > 0 for which
where \(\boldsymbol{1} = \left (1,1,\ldots,1\right )'\) is of dimension t! . Hence, c is equal to the average distance. It is straightforward to show that for the Spearman and Kendall distances
Turning attention to the Hamming distance, we note that if \(e = \left (1,2,\ldots,t\right )'\), then
and hence \(c_{H} = \left (t - 1\right )\).
Example 3.1.
Suppose that t = 3 and that the complete rankings are denoted by
Using the above order of the permutations, we may write the matrix \(\Delta\) of pairwise Spearman, Kendall, Hamming, and Footrule distances respectively as
These distances may alternatively be written in terms of a similarity function in the form
Spearman:
Kendall:
Hamming:
Footrule:
The similarity measures may be also interpreted geometrically as inner products which sets the groundwork for defining correlation in the next section.
3.2 Correlation Between Two Rankings
The notion of correlation occurs frequently in statistics. For example, in regression analysis, one is interested in the correlation between two variables such as height and weight. Similarly, in nonparametric statistics, we shall be interested in the correlation between two rankings. Let \(\mathcal{P}\) be the space of all possible permutations of the integers \(1,2,\ldots,t\). We may define the correlation between two rankings μ, ν as
where M is the maximum value of the distance \(d\left (\mu,\nu \right )\) taken over all possible pairs μ, ν in \(\mathcal{P}\) (Diaconis and Graham 1977). In the case of the Spearman and Kendall distance, the maximum values occur when
whereas the minimum occurs when
This is a consequence of the rearrangement inequality given as a lemma below.
Lemma 3.1.
Let \(a_{1},\ldots,a_{t}\) and \(b_{1},\ldots,b_{t}\) be real numbers, not necessarily positive with
and let σ be a permutation of the integers 1,…,t. Then
Proof.
The proof follows by induction on t. □
It can be shown that for the Spearman and Kendall distances, the maximum is equal to twice the mean,
In view of (3.10) we have
Example 3.2 (Lehmann 1975, p. 298).
Consider the test scores in Language and Arithmetic for a group of 9 students as shown in Table 3.1. The right-invariance property shared by the Spearman and Kendall distances enables us to rewrite the table in a more convenient fashion with one of the rankings in natural order as in Table 3.2. The Spearman and Kendall correlations are respectively 0. 683 and 0. 500. Here \(c_{S} = 60,c_{K} = 36\).
The correlation coefficients based on these distances are of the multiplicative type (Kendall and Gibbons 1990); that is, there exists a function g such that
where \(k_{\mu },k_{\nu }\) are normalizing constants. The constants may be different depending on whether the coefficient is of type a or b. A type a correlation is used above. For Spearman and Kendall, the functions are, respectively,
For a type b correlation, the constants are given by
We shall make use of a type b correlation when defining angular correlations in Sect. 3.6.
For a multiplicative index, it can be shown that the correlation matrix is necessarily positive semidefinite (Quade 1972). Setting
where \(J =\boldsymbol{ 11}'\) and \(\frac{M} {2} = c,\) this implies that there exists a matrix T for which
It follows that the distance matrix for both Spearman and Kendall can be expressed as
From the form of the Spearman and Kendall similarity measures (3.12), it can be seen that the matrices T are respectively
where
is the centered rank vector and
is of dimension \(\left ({}_{2}^{t}\right ) \times t!\) where the qth element for \(q = \left (i - 1\right )\left (t - \frac{i} {2}\right ) + \left (j - i\right ),1 \leq i <j \leq t\),
For Hamming, we may write the t 2-dimensional vector where the \(\left (i,j\right )\) th element is
for 1 ≤ i, j ≤ t.
For the Footrule we have the t 2-dimensional vector where the qth element for \(q = \left (i - 1\right )t + j,1 \leq i <j \leq t\)
Example 3.3.
Suppose that t = 3. Then, placing the rankings in the natural order of Example 3.1, we have that
and
The notion of correlation is particularly useful in problems wherein one wishes to test for the independence of two variables as in Example 3.2 or for the existence of long-term monotone trend in the pH of a river. We will postpone a discussion of these important topics later in this chapter where it will be addressed in the general context of incomplete rankings.
3.3 Incomplete Rankings and the Notion of Compatibility
A judge may rank a complete set of candidates in accordance with some criterion. On occasion, however, data may be missing either at random or by design. For example, one or more candidates may not be ranked. In another example, the pH data on a lake may not be available for certain months in a year, thereby making it impossible to test for a long-term trend using traditional nonparametric rank-based statistics. The option to ignore the missing data is unsatisfactory because it distorts the time scale. As we shall see later on, this option is always suboptimal when testing for trend. We address the topic in this section by first introducing the notion of compatibility.
- Notation.:
-
Incomplete ranks will be denoted by “−” and corresponding incomplete rankings will be written with an upper script “*”.
For example, the ranking \(\mu ^{{\ast}} = (2,-,3,4,1)'\) indicates that object 2 is unranked among the five objects presented.
Definition 3.1.
The complete ranking μ of t objects is said to be compatible with an incomplete ranking μ ∗ of a subset of k of these objects, 2 ≤ k ≤ t, if the relative ranking of every pair of objects ranked in μ ∗ coincides with their relative ranking in μ.
An incomplete ranking gives rise to a class of order preserving complete rankings. Denoting by \(C\left (\mu ^{{\ast}}\right )\) the set of complete permutations compatible with \(\mu ^{{\ast}} = (2,-,3,4,1)'\), we have that t
The total number of complete rankings of t objects compatible with an incomplete ranking of a subset of k objects is given by t! ∕ k! . This follows from the fact that there are \(\left ({}_{k}^{t}\right )\) ways of choosing k integers for the ranked objects, one way in placing them to preserve the order and then \(\left (t - k\right )!\) ways of rearranging the remaining integers. The product is thus
The notion of compatibility establishes a connection between an incomplete ranking and the class of complete rankings from which the incomplete ranking could have arisen. It seems natural as a result to extend the notion of distance to incomplete rankings by referring to the corresponding compatibility classes.
Definition 3.2.
The distance \(d^{{\ast}}\left (\mu ^{{\ast}},\nu ^{{\ast}}\right )\) between two incomplete rankings μ ∗ and ν ∗ is defined to be the average of all values of the distances \(d(\mu _{i}\mathbf{,\nu }_{j}\mathbf{)}\) taken over all pairs of complete rankings \(\mu _{i}\mathbf{,\nu }_{j}\) compatible with μ ∗ and ν ∗, respectively.
Example 3.4.
Suppose that \(t = 3,k = 2.\) In that case, the possible incomplete rankings are denoted by
We may associate with every incomplete ranking a (t! x 1) compatibility vector, also denoted by \(C\!\left (\nu ^{{\ast}}\right )\), whose ith component is 1 or 0 according to whether μ i is compatible with ν ∗. A summary can be provided by a compatibility matrix as follows.
Consequently, the matrix of average pairwise Spearman distances for the incomplete rankings is given by the product \(C_{S}'\Delta C_{S}/a^{2}\) where \(a = t!/\ k! = 3\) and
We note from this example that the distance of an incomplete ranking to itself is 10 and not 0. In extending the notion of correlation to incomplete rankings, it will be necessary to take this into account.
For the Spearman and Kendall distances, we may re-express the distance \(d^{{\ast}}\left(\mu ^{{\ast}},\nu ^{{\ast}}\right )\) as
where
The latter may be viewed as the average of the \(\mathcal{A}(\mu _{i},\nu _{j}\mathbf{)}\) taken over all complete rankings \(\mu _{i},\nu _{j}\) compatible with μ ∗ and ν ∗, respectively.
3.4 Correlation for Incomplete Rankings
At this point it is useful to derive an expression for an incomplete ranking μ ∗ given knowledge of its compatibility class \(C\!\left (\mu ^{{\ast}}\right ).\) We shall assume that each complete ranking has the same probability of being selected, i.e., they are uniformly distributed over the t! permutations of \(\left (1,2,\ldots,t\right )\).
Lemma 3.2.
The conditional distribution of the rank \(\mu \left (i\right )\) given the compatibility class \(C\left (\mu ^{{\ast}}\right )\) generated by μ ∗ is given by
where \(\delta \left (i\right )\) is either 1 or 0 depending on whether the object i is or is not ranked in the incomplete ranking. Here \(\mu ^{{\ast}}\left (i\right ) \leq j \leq \left (t - k\right ) +\mu ^{{\ast}}\left (i\right )\) , if object i is ranked whereas 1 ≤ j ≤ t, if object i is not ranked.
Proof.
If an object i is ranked in an incomplete ranking μ ∗ of k objects, then the number of complete rankings compatible with μ ∗ which assign rank j to object i is
This consists of the number of ways of picking a set of \(\left (\mu ^{{\ast}}\left (i\right ) - 1\right )\) from the first \(\left (j - 1\right )\) integers and a set of \(\left (k -\mu ^{{\ast}}\left (i\right )\right )\) from the last \(\left (t - j\right )\) integers while allowing all possible permutations of the \(\left (t - k\right )\) integers not picked. On the other hand, if object i is not ranked in μ ∗ then the number of such complete compatible rankings is given by
the number of ways of picking k from the t − 1 integers not equal to j and allowing all possible permutations of the remaining \(\left (t - k - 1\right )\) integers. Dividing these by \(\frac{t!} {k!}\) the number of complete rankings compatible with μ ∗ gives the result. □
In the next lemma, we show that it is possible to compute the value of a score function corresponding to an incomplete ranking from knowledge of the compatibility class. To this end, we make use of the conditional distribution of a complete ranking given its compatibility class and the fact that the conditional expectation of the score function corresponds to its projection onto that class. We apply this approach to compute the form of score functions for both the Spearman and Kendall distances.
Lemma 3.3.
Suppose that we select a complete ranking μ at random from the class of compatible rankings \(C\left (\mu ^{{\ast}}\right )\) . Suppose that object s is ranked. Then (a)
and (b) for any pair of objects i < j,
where
Proof.
To prove (a), recall the identity
Consequently, we have that
For the proof of (b), let
and define
so that the incomplete ranking takes value \(\frac{k+1} {2}\) when an object is unranked. Note that for any complete ranking,
It is clear that if objects i and j are both ranked, then a(i, j) is as stated. Suppose now only object j is ranked. The adjusted score becomes on using (3.21)
Hence, \(a\left (i,j\right ) = \left (\frac{2\mu ^{{\ast}}\left (j\right )} {k+1} - 1\right ).\) The case where only object i is ranked is dealt with similarly. □
In describing visualization techniques for incomplete ranking data, Kidwell et al. (2008) have noted the efficiency for computing the Kendall scores in (3.23). Next, we proceed to find the maximum and minimum distances when only k objects are ranked among the incomplete rankings.
Lemma 3.4.
-
(a)
For the Spearman distance,
$$\displaystyle{m_{S}^{{\ast}} = c_{ S} -\frac{\left (t + 1\right )^{2}} {12} \frac{k\left (k - 1\right )} {\left (k + 1\right )},M_{S}^{{\ast}} = c_{ S} + \frac{\left (t + 1\right )^{2}} {12} \frac{k\left (k - 1\right )} {\left (k + 1\right )} }$$where \(c_{S} = \frac{t\left (t^{2}-1\right )} {12}\) .
-
(b)
For the Kendall distance,
$$\displaystyle{m_{K}^{{\ast}} = c_{ K} -\frac{\left (2t + k + 3\right )} {6} \frac{k\left (k - 1\right )} {\left (k + 1\right )},M_{K}^{{\ast}} = c_{ K} + \frac{\left (2t + k + 3\right )} {6} \frac{k\left (k - 1\right )} {\left (k + 1\right )} }$$where \(c_{K} = \frac{t\left (t-1\right )} {2}\) . It follows that the correlation between the incomplete rankings \(\mu _{1}^{{\ast}},\mu _{2}^{{\ast}}\) can be defined to be
$$\displaystyle{ \alpha \left (\mu _{1}^{{\ast}},\mu _{ 2}^{{\ast}}\right ) = 1 -\frac{2\left [d_{K}^{{\ast}}\left (\mu _{ i}^{{\ast}},\mu _{ j}^{{\ast}}\right ) - m^{{\ast}}\right ]} {M^{{\ast}}- m^{{\ast}}}. }$$(3.27)
Proof.
The right-hand side of (3.21) provides a general expression for an incomplete ranking. It follows that the Spearman distance between two incomplete rankings with the same number of ranked objects is
and in the Kendall case, the distance may be written as
where \(a_{i}\left (q_{1},q_{2}\right )\) is defined as in (3.23) and \(\varpi _{i}\left (s\right )\) is given in 3.25. An application of the Cauchy–Schwarz inequality indicates that the upper bound of the Spearman distance occurs when \(\mathbf{T}_{S}C\left (\mu _{i}^{{\ast}}\right ) = -\mathbf{T}_{S}C\left (\mu _{j}^{{\ast}}\right )\) whereas the lower bound is achieved when \(\mathbf{T}_{S}C\left (\mu _{i}^{{\ast}}\right ) = \mathbf{T}_{S}C\left (\mu _{j}^{{\ast}}\right )\). If we let \(\mu _{j}^{{\ast}}\) be the inverted ranking, that is, \(\mu _{j}^{{\ast}}\left (s\right ) = k + 1 -\mu _{i}^{{\ast}}\left (s\right )\) when object s is ranked by i, then \(\varpi _{j}\left (s\right ) = k + 1 -\varpi _{i}\left (s\right )\) and \(\mathbf{T}_{S}C\left (\mu _{i}^{{\ast}}\right ) = -\mathbf{T}_{S}C\left (\mu _{j}^{{\ast}}\right )\). Furthermore, for the Kendall scores, \(a_{j}\left (q_{1},q_{2}\right ) = -a_{i}\left (q_{1},q_{2}\right )\) and thus \(\mathbf{T}_{K}C\left (\mu _{i}^{{\ast}}\right ) = -\mathbf{T}_{K}C\left (\mu _{j}^{{\ast}}\right )\). A straightforward calculation of these distances using the incomplete ranking \(\left (1,2,\ldots,k,-,-,\ldots,-\right )'\) and its inversion yields the minimum and maximum for each distance. □
We quote without proof a result in Alvo and Cabilio (1995a) which allows for different numbers of observations missing at random.
Lemma 3.5.
For fixed \(k_{1} \leq k_{2}\) suppose the pattern of missing observations is randomly selected from the set of all possible patterns. Then, for the Spearman and Kendall cases, the minimum and maximum values of the distance are of the form
where the \(\gamma \left (i\right )\) are given as
Consider now two independent rankings of length \(k_{1},k_{2}\), respectively, with \(2 \leq k_{1} \leq k_{2} \leq t.\) It follows from (3.6) and Lemma 3.3 that
where k ∗ is the number of objects ranked in ranking 1 among the k 2 objects ranked in ranking 2 and o i is the label of the ith object ranked in ranking 1. Here, \(\delta \left (s,\mu ^{{\ast}}\right )\) takes value 1 if object s is ranked by μ ∗ and value 0 otherwise. Note that
where l i = number of objects unranked in ranking 1 which are to the left of the object being ranked. Similarly from (3.7) we have that
Example 3.5.
Consider the test scores in Language (ranking 1) and Arithmetic (ranking 2) of a group of nine students in Table 3.3. The original data was altered by removing certain values, with the remaining observations reordered and ranked as follows.
Here \(t = 9,k_{1} = 6,k_{2} = 7,k^{{\ast}} = 5,\) \(o_{1} = 1,o_{2} = 2,o_{3} = 3,o_{4} = 5,\) o 5 = 7, o 6 = 8, and \(l_{1} = l_{2} = l_{3} = 0,l_{4} = 1,l_{5} = l_{6} = 2.\) Further,
Hence A S ∗ = 33. 9286 and A K ∗ = 4.
3.4.1 Asymptotic Normality of the Spearman and Kendall Test Statistics
The main objective of this section is to demonstrate the asymptotic normality of the similarity measures due to Spearman and Kendall in the case of incomplete rankings. Specifically, we shall be concerned with the asymptotic distributions of both \(\mathcal{A}_{\mathcal{S}}^{\mathrm{{\ast}}}\),\(\mathcal{A}_{K}^{{\ast}}\) under each of two possible null hypotheses H 1 and H 2. For both hypotheses we assume that k 1, k 2, the number of ranked observations, are fixed and the rankings for which we have (possibly) incomplete data are uniformly distributed over the t! permutations of (1, 2, …, t).
-
Under hypothesis H 1, we assume that the pattern of missing observations is fixed, so that all inference in this case is conditional on such a pattern.
-
Under H 2, we assume that the patterns of missing observations are randomly selected from the set of all possible patterns. The latter situation would arise in practice if unranked objects occur by chance. An example would be testing for trend in water quality data when the historical data is incomplete.
We begin with the definition of a linear rank statistic.
Definition 3.3.
Let \(\left \{a\left (i\right )\right \}\) and \(\left \{c\left (i\right )\right \}\) be two sets of constants. A statistic of the form
where \(R = \left (R_{1},\ldots,R_{N}\right )\) is a vector of ranks is called a linear rank statistic . The constants \(a\left (i\right )\) are called scores whereas the \(c\left (i\right )\) are called regression coefficients .
Many test statistics are of this form. For example, suppose that we have a random sample of n observations from a population and N-n from another. We are interested in testing the null hypothesis that the two populations are the same against the alternative that they differ only in location. Rank all N observations together. The Wilcoxon statistic then considers only the ranks of one of the populations by choosing
Lemma 3.6.
Suppose that R is uniformly distributed over the set of permutations in \(\mathcal{P}\) . Then
-
(i)
for i = 1,…,N, E \(\left (R_{i}\right ) = \frac{N+1} {2},V ar\left (R_{i}\right ) = \frac{\left (N^{2}-1\right )} {12}\) and for i ≠ j, \(Cov\left (R_{i},R_{j}\right ) = -\frac{N+1} {12}\) and
-
(ii)
$$\displaystyle{ES = N\bar{c}\bar{a}}$$
and
where \(\bar{a}\) and \(\bar{c}\) represent the corresponding means.
Proof.
The proof of this lemma is given in (Hájek and Sidak 1967). □
The following theorem states that under certain conditions, linear rank statistics are asymptotically normally distributed. We shall consider square integrable functions ϕ defined on \(\left (0,1\right )\) which have the property that they can be written as the difference of two nondecreasing functions and satisfy
where \(\bar{\phi }=\int _{ 0}^{1}\phi \left (u\right )du.\)
Theorem 3.1.
Suppose that R is uniformly distributed over the set of permutations in \(\mathcal{P}\) . Let the score function be given by \(a\left (i\right ) =\phi \left ( \frac{i} {N}\right )\) where \(\phi \left (\right )\) is a square integrable score function. Then S is asymptotically normally distributed as N →∞ with mean \(N\bar{c}\bar{a}\) and variance
provided
Proof.
The proof of this important result is given in (Hájek and Sidak 1967). □
We may now apply Theorem 3.1 to obtain the asymptotic normality of the Spearman test statistic in the case of incomplete rankings under Hypothesis 1 wherein the pattern of missing data is fixed. Set
where
and \(\overline{o}_{1} = \left (\sum _{i=1}^{k_{1}}o_{i}^{{\ast}}\right )/k_{1}.\) Also set \(\overline{o}^{{\ast}} = \left (\sum _{i=1}^{k^{{\ast}} }o_{i}\right )/k^{{\ast}}.\)
Theorem 3.2.
Assume that \(k^{{\ast}}\rightarrow \infty\) (and hence \(k_{1} \rightarrow \infty,k_{2} \rightarrow \infty,t \rightarrow \infty\) ) with \(k^{{\ast}}/t \rightarrow \lambda> 0,\) where λ is a finite constant. Then, under H 1 , whereby the pattern of missing data is fixed, \(\mathcal{A}_{S}^{{\ast}}\) given in (3.28) is asymptotically normal with mean 0 and variance σ S 2 .
Proof.
The proof hinges on the fact that \(\mathcal{A}_{S}^{{\ast}}\) is a linear rank statistic. In fact
The normality follows provided
Now
Further, \(\left (o_{i}^{{\ast}}-\bar{ o}_{1}\right )^{2} \leq \left (t - 1\right )^{2}\), so that the result follows on letting k ∗ → ∞ with k ∗∕t → λ. □
The exact variance of \(\mathcal{A}_{S}^{{\ast}}\) under H 1, which is recommended in applications of Theorem 3.2, is related to \(\sigma _{S}^{2}\) by
(Lehmann 1975 (A. 49) p. 334). That is, the asymptotic variance given in the theorem is essentially the actual variance of \(\mathcal{A}_{S}^{{\ast}}\). In any application, the calculation of the variance of \(\mathcal{A}_{S}^{{\ast}}\) is a straightforward computation. Next, we consider the asymptotic distribution of \(\mathcal{A}_{S}^{{\ast}}\) and \(\mathcal{A}_{K}^{{\ast}}\) when the pattern of missing observations is random.
Theorem 3.3.
Let \(k_{1} \rightarrow \infty\) (and hence \(k_{2} \rightarrow \infty,t \rightarrow \infty )\) with \(k_{1}/t \rightarrow \lambda> 0,\) where λ is a finite constant. Then, under H 2 , whereby the pattern of missing data is random, \(\mathcal{A}_{S}^{{\ast}}\) is asymptotically normal with mean 0 and variance
with
Proof.
Define \(\mathbf{U} = (U_{1,}U_{2},\ldots,U_{t})\) as the random vector uniformly distributed over the permutations of \((1,2,\ldots,k_{1}, \frac{k_{1}+1} {2},\ldots, \frac{k_{1}+1} {2} ).\) In this case, the extended Spearman distance may be written as
The result follows from the combinatorial central limit theorem of Hoeffding (see Appendix B.1) applied to the quantity within square brackets above. □
Theorem 3.4.
\(\mathcal{A}_{\mathcal{K}}^{\mathrm{{\ast}}}\) is asymptotically equivalent to \(\mathcal{A}_{\mathcal{S}}^{\mathrm{{\ast}}}\) under both hypotheses H 1 and H 2 . Hence, \(\mathcal{A}_{\mathcal{K}}^{\mathrm{{\ast}}}\) is asymptotically normal with mean 0 and variance \(\left (\frac{16} {t^{2}} \right )V ar\left (\mathcal{A}_{S}^{{\ast}}\right )\) .
Proof.
We know from (Hájek and Sidak 1967) that for the complete case
and that, moreover,
Consequently, we have
From Jensen’s inequality
and consequently the asymptotic normality of \(\mathcal{A}_{S}^{{\ast}}\) will imply the asymptotic normality of \(\mathcal{A}_{K}^{{\ast}}.\) □
Example 3.6.
We return to Example 3.2 wherein we wish to test the hypothesis of independence against the alternative of a positive correlation. For the complete data, the value of \(\mathcal{A}_{\mathcal{S}}\) is 41, and from the tables, under the randomness hypothesis, \(P(\mathcal{A}_{\mathcal{S}}\geq 41) = 0.0252,\) whereas the use of the asymptotic result gives a p-value of \(1 - \Phi (1.9328) = 0.0266\), where \(\Phi\) is the cumulative distribution function of a standard normal. For the data in Example 3.5, the value of \(\mathcal{A}_{\mathcal{S}}^{\mathrm{{\ast}}}\) for the reduced data is calculated to be 33.9286. An application of the theorem yields that under H1, the p-value is \(P\left (\mathcal{A}_{\mathcal{S}}^{\mathrm{{\ast}}}\geq 33.9286\right ) = 0.0178\). On the other hand, if all observations with missing values are deleted, we obtain a reduced value of \(\mathcal{A}_{\mathcal{S}} = 9\) with t = 5, and from the tables \(P(\mathcal{A}_{\mathcal{S}}\geq 9) = 0.0417.\)
3.4.2 Asymptotic Efficiency
We now turn to the question of the efficiency which is further discussed in Appendix B.4. Let \(X_{1},X_{2},\ldots,X_{t}\) be independent random variables whose joint density under the alternative is described by
where f 0 is a known density having finite Fisher information \(I\left (f_{0}\right )\) and \(\mathbf{d} = \left (d_{1},d_{2},\ldots,d_{t}\right )\) is an arbitrary vector. In the notation of our tests, k 2 = t, and write k 1 = k, the actual number of X i ’s observed. Recalling that o i is the label of the ith object ranked, the Spearman test which deletes all missing observations is based on the Spearman correlation of the reduced sample of k pairs, and the test statistic may be written as
Since \(k = k_{1} = k^{{\ast}}\) and consequently \(o_{i} = o_{i}^{{\ast}},\) the statistic \(\mathcal{A}_{S}^{{\ast}}\) may be written as
Hence,
The weight \(\left (o_{i} - i\right )\) represents the number of time points to the left of o i for which there are no observations. Similarly,
where
Set \(d_{i}^{{\ast}} = d_{o_{i}}\) and \(\overline{d} =\sum _{ i=1}^{t}d_{i}/t.\) Under the alternative q d , provided
both A RS and \(A_{S}^{{\ast}}\) are asymptotically normal with means and variances given respectively by \(\left (\mu _{R},\sigma _{RS}^{2}\right )\) and \(\left (\mu _{S},\sigma _{S}^{2}\right )\), where
Here \(\phi \left (u,f\right ) = \left [f^{{{\prime}} }\left (F^{-1}\left (u\right )\right )\right ]/\left [f\left (F^{-1}\left (u\right )\right )\right ],0 <u <1,\) and F is the cumulative distribution of f.
Shifting now to the efficiencies, it is seen that the asymptotic efficiencies as \(k \rightarrow \infty\), for A RS and \(\mathcal{A}_{S}^{{\ast}}\) are respectively given by
where Q 1 is a positive function of f 0 and the limit is taken as \(t \rightarrow \infty,k \rightarrow \infty,\) with k∕t → λ > 0. The asymptotic relative efficiency of \(\mathcal{A}_{S}^{{\ast}}\) relative to A RS is then given by the ratio \(e_{S}/e_{RS}\) (Appendix B.4).
Now consider the case where \(d_{i}^{{\ast}} = o_{i},\bar{d} =\bar{ o},i = 1,\ldots,k\) and the remaining d i are arbitrary, a situation which includes alternatives of the form \(EX_{i} =\beta _{0} +\beta i,\beta> 0.\) It can be shown that irrespective of the density f 0, the asymptotic relative efficiency of \(\mathcal{A}_{S}^{{\ast}}\) relative to A RS is given by
where \(\mathbf{o}_{\mathbf{k}}\mathbf{=}\left (o_{1},\ldots,o_{k}\right )\) and
Note that \(R\left (k,\mathbf{o}_{\mathbf{k}}\right )> 1\) unless the o i ’s are equally spaced.
In order to illustrate the magnitude of this efficiency, suppose for example that \(t = 19,k = 7,o_{1} = 1,o_{2} = 2,o_{3} = 3,o_{4} = 10,o_{5} = 17,o_{6} = 18,o_{7} = 19,\) then the ratio of the efficacies of A S ∗ to A RS is 1. 086. On the other hand, if \(o_{1} = 1,o_{2} = 8,o_{3} = 9,o_{4} = 10,o_{5} = 11,o_{6} = 12,o_{7} = 19\), then that ratio is 1. 176.
3.5 Tied Rankings and the Notion of Compatibility
The notion of compatibility may also be extended to deal with tied rankings . As an example, suppose that objects 1 and 2 are equally preferred whereas object 3 is least preferred. Such a ranking would be compatible with the rankings \(\left (1,2,3\right )\) and \(\left (2,1,3\right )\) in that both are plausible. The average of the rankings in the compatibility class, which as we shall see results from the use of the Spearman distance, will then be the ranking
to be presented in this case. It is seen that the notion of compatibility serves to justify the use of the midrank when ties exist. Formally we can define tied orderings as follows.
Definition 3.4.
A tied ordering of t objects is a partition into e sets, 1 ≤ e ≤ t, each containing d i objects, \(d_{1} + d_{2} +\ldots +d_{e} = t\), so that the d i objects in each set share the rank i,1 ≤ i ≤ e. Such a tie pattern is denoted by \(\delta = \left (d_{1},d_{2},\ldots,d_{e}\right )\). The ranking denoted by \(\mu _{\delta } = \left (\mu _{\delta }\left (1\right ),\ldots,\mu _{\delta }\left (t\right )\right )\) resulting from such an ordering is a tied ranking and is one of \(\frac{t!} {d_{1}!d_{2}!\ldots d_{e}!}\) possible permutations.
Associated with every tied ranking we may define a t! × (\(\frac{t!} {d_{1}!d_{2}!\ldots d_{e}!}\)) matrix of compatibility D δ . Yu et al. (2002) considered the problem of testing for independence between two random variables when the tie patterns and the pattern of missing observations are fixed. Specifically, let μ ∗ be an incomplete ranking of k 1 out of t objects with tie pattern \(\delta _{1} = \left (d_{11},\ldots,d_{1e_{1}}\right )\). Similarly, let ν ∗ be an incomplete ranking of k 2 out of t objects with tie pattern \(\delta _{2} = \left (d_{21},\ldots,d_{2e_{2}}\right )\). The Spearman similarity measure between two incomplete rankings \(\mu ^{{\ast}},\nu ^{{\ast}}\) is defined to be
where \(\delta \left (j\right ) = 1\) if both rankings of object j are not missing and 0 otherwise.
Theorem 3.5.
Let k ∗ be the number of objects ranked in ranking 1 among the k 2 objects ranked in ranking 2. Let \(2 \leq k_{1} \leq k_{2} \leq t\) . Assume that
-
(i)
\(k^{{\ast}}\rightarrow \infty\) , (and hence \(k_{1} \rightarrow \infty,k_{2} \rightarrow \infty,t \rightarrow \infty\) ) with \(k^{{\ast}}/t \rightarrow \lambda> 0.\)
-
(ii)
\(\max _{j=1,\cdots \,,e_{1}} \frac{g_{1j}} {k^{{\ast}}}\) is bounded away from 1.
-
(iii)
\(\max _{j=1,\cdots \,,e_{2}} \frac{g_{2j}} {k^{{\ast}}}\) is bounded away from 1.
Then, under the null hypothesis of independence whereby the pattern of ties and missing data is fixed, A S ∗ is asymptotically normal with mean 0 and exact variance
Proof.
See Yu et al. (2002). □
Example 3.7.
In a public opinion survey held in 1999 in Hong Kong, it was of interest to determine whether the education level of the respondents is related to the level of dissatisfaction of the Policy Address of the Chief Executive of the Hong Kong Special Administrative Region. The response is an ordinal variable having seven options as follows: (1), very satisfied; (2), satisfied; (3), neutral; (4), unsatisfied; (5), very unsatisfied; (6), not sure; and (7), refuse to answer. Options (6) and (7) were combined and listed as “missing.” Table 3.4 displays the frequencies of the respondents listed by option and by education level.
It is noted that about 19.9 % of the respondents did not respond either to one or to both questions. Moreover, since the education levels are grouped into a few categories, the problem of ties cannot be ignored. One alternative approach for analyzing this data is as a contingency table. In that case, however, the ordering among the education levels and separately among the responses would not be taken into account. The results of the analysis shown in Table 3.5 reveal that at the 5 % significance level, the test based on the reduced sample (which discards all observations with at least one missing variable) cannot reject the hypothesis of independence whereas the one based on the complete sample can. Since the test statistic is positive, this implies that there is a positive association between education level and level of dissatisfaction. More highly educated respondents tend to be less satisfied with the Policy Address. The analysis by means of a contingency table whereby the missing categories for education and response were dropped leads to a chi-square statistic with a value of 35.2161 on 16 degrees of freedom and a p-value of 0.0037.
3.6 Angular Correlations
There has been a great deal of interest in directional statistics in the literature. Consider the following example on wind directions whereby we are interested in testing for independence between the 6 a.m. and the noon readings. The data shown in Table 3.6 can be viewed as points on the unit circle and cannot be dealt with by simply computing the usual rank correlation. The reason is that the larger ranks are close to the smaller ranks. Hence, for example, for the noon readings, angle 23 is closer to angle 313 than to angle 248. Yet, the ranks imply an opposite interpretation. In the table, tied ranks were replaced by their midranks.
Example 3.8 (Johnson and Wehrly 1977).
Wind directions were recorded at 6 a.m. and at 12 noon on each day at a weather station for 21 consecutive days. It is desired to test for independence. Tied rankings were replaced by their midranks (Table 3.6).
Excellent review articles along with additional references are given by Mardia (1975, 1976) and Jupp and Mardia (1989). Typically, data is provided in the form of directions either in two- or three-dimensional space or as rotations in such a space. The data may take on a variety of forms. It may consist of a unit vector of directions, pairs of such vectors, or a vector of directions along with a corresponding random variable on the line. Examples of applications are to be found in the fields of astronomy, biology, geology, medicine, and meteorology (Downs 1973; Johnson and Wehrly 1977; Breckling 1989). A large number of the works presented deal with the study of inference from parametric models. In this section, we define a corresponding notion of angular correlation using the ranks of the data.
Let X and Y be random vectors with covariance matrix \(\Sigma\) partitioned as
and suppose \(\Sigma _{11}\) and \(\Sigma _{22}\) are non-singular of ranks p and q, respectively.
Definition 3.5 (Jupp and Mardia 1989).
The correlation coefficient γ XY between X and Y is defined to be the trace γ of the matrix
It follows that, \(\gamma _{XY } =\sum _{ i=1}^{s}\lambda _{i}^{2}\) where the λ i are the canonical correlations and s = min(p, q). This coefficient satisfies the property of invariance under rotation and reflection in addition to the usual properties of a correlation.
Suppose now that θ and \(\varphi\) are circular variables with \(0 \leq \theta,\varphi \leq 2\pi\). Define the directional vectors \(t_{1}^{{{\prime}} }(\theta ) = (\cos \theta,\sin \theta )\), \(t_{2}^{{{\prime}} }(\varphi ) = (\cos \varphi,\sin \varphi ),\) and let \(\Sigma\) be the covariance matrix of t 1 and t 2. It is seen that
where \(\rho _{cc} = corr(\cos \theta,\cos \varphi )\), \(\rho _{cs} = corr(\cos \theta,\sin \varphi )\), etc., and \(\rho _{1} = corr(\cos \theta,\sin \theta )\), \(\rho _{2} = corr(\cos \varphi,\sin \varphi )\).
Let (\(\theta _{i},\varphi _{i}\)) for i = 1, …, n be a random sample of n pairs of angles which define points on the unit circle. Without loss in generality assume that the ranks of the θ’s are the natural integers 1,…,n whereas the corresponding ranks of the \(\varphi\)’s are denoted by \(R_{1},\ldots,R_{n}.\) Let
We may formally construct on the basis of the sample the matrix of pairwise correlations
where ρ(η, ν) is a measure of correlation between η and ν. We shall consider correlations based on the Spearman and Kendall distance functions in subsequent sections and we will determine the corresponding asymptotic distributions of the correlation coefficients as n → ∞.
3.6.1 Spearman Distance
We shall consider the Kendall notion of a type b correlation (Kendall and Gibbons 1990) given by
It is straightforward to show
and
It follows that \(\Sigma _{11} = \Sigma _{22} = \frac{n} {2} I\). The sample estimate of \(\Sigma _{12}\) is given by
where \(T_{cc} =\eta ^{(1)^{{\prime}} }\nu ^{(1)},T_{cs} =\eta ^{(1)^{{\prime}} }\nu ^{(2)},T_{sc} =\eta ^{(2)^{{\prime}} }\nu ^{(1)},T_{ss} =\eta ^{(2)^{{\prime}} }\nu ^{(2)}.\)
We recognize that the T ′ s are measures of correlation in the Spearman sense. Consequently, the sample correlation using Spearman distance becomes
3.6.2 Kendall Distance
Recalling the Kendall measure of distance defined by
where sgn indicates the sign function, we may define a corresponding type b correlation as
where \(A(\eta ) = \#\left (\text{pairs }(i,j),i\neq j\vert \eta _{i}\neq \eta _{j}\right ).\) It is easy to see that \(\Sigma _{11}\) and \(\Sigma _{22}\) are diagonal matrices. In fact, the off-diagonal terms are equal to
The normalization in the Kendall case is somewhat delicate and depends in part on the parity of n. For example, for n = 10, there are five pairs of equal values in the set \(\left \{\sin \frac{2\pi i} {n} \right \}\) whereas for n = 11, all the values are distinct. In general, the number of equal pairs is at most O(n). The sample estimate of \(\Upsilon _{12}\) is given by
where \(K_{cc} =\rho _{K}(\eta ^{(1)},\nu ^{(1)})\), \(K_{cs} =\rho _{K}(\eta ^{(1)},\nu ^{(2)})\), \(K_{sc} =\rho _{K}(\eta ^{(2)},\nu ^{(1)})\), \(K_{ss} =\rho _{K}(\eta ^{(2)},\nu ^{(2)})\).
It follows that the sample correlation coefficient in the Kendall case is given by
In the following sections, we shall derive the asymptotic null distributions of the test statistics induced by the Spearman and Kendall distances.
3.6.3 Asymptotic Distributions
We are interested in testing the null hypothesis that the circular variables \(\theta,\varphi\) are independent. In terms of the ranks, assuming no ties, this translates into the hypothesis H0 that all permutations of the integers 1, …, n are equally likely.
Theorem 3.6.
The asymptotic null distribution of nγ S as n →∞ is \(\chi _{4}^{2}\) .
Proof.
The joint distribution of \(T_{cc},T_{ss},T_{cs},T_{sc}\) is asymptotically normal. In fact, for arbitrary \(\left \{a_{i}\right \}\), consider the linear combination
Let
Since
and the variance
we have that
The result follows on using Hoeffding’s combinatorial central limit theorem (see Appendix B.1). Hence \(\Upsilon _{12}^{S}\) is multivariate normal and the theorem follows. □
A similar result holds for the Kendall tau statistic.
Theorem 3.7.
The asymptotic null distribution of \(\frac{9} {4}n\gamma _{K}\) as n \(\rightarrow \infty\) is \(\chi _{4}^{2}\) .
Proof.
See Alvo (1998) for the proof. A different proof can make use of the asymptotic equivalence between the Kendall and Spearman coefficients in general. □
Example 3.9.
We revisit the wind direction data. We calculate
and hence \(n\gamma _{S} = 21(0.50047) = 10.51\) with a p-value of 0. 0327. Consequently, we conclude that there is evidence that the 6 a.m. and noon wind directions are significantly correlated.
It is interesting to compare this result with the usual product moment correlation between the two angular measurements. The latter yields a value equal to − 0. 04, thereby implying that the variables are independent. On the other hand, restricting attention only to the pairs of measurements for which the 6 a.m. readings are below 180∘ the value of the product moment correlation is 0. 512 while for pairs for which the 6 a.m. readings are above 180∘ it is − 0. 475. These results taken separately imply a fair degree of dependence. The test statistic γ S takes into account the fact that very small and very large angles (mod 2π) are close to one another.
For the Kendall statistic, we may also calculate
and hence \(\frac{9n} {4}\) \(\gamma _{K} = \frac{9(21)} {4} (0.3056) = 14.44\) with a p-value of 0. 006. It is clear that with either the Spearman or the Kendall statistic, the hypothesis of independence is in doubt.
3.7 Angle-Linear Correlation
Suppose that we are now interested in defining the correlation between an angle θ and a real valued random variable X. It can be shown that the correlation coefficient in that case is given by
where
In the nonparametric context, let \((X_{i},\theta _{i})\) for \(i = 1,\ldots,n\) be a random sample of linear-angular measurements. Let {R i } be the ranks of the {X i } and let {S i } be the ranks of the {θ i }. We may assume without loss in generality that the S i are in natural order 1, 2, …, n. Based on the Spearman measure of distance, the sample angular-linear correlation is defined by
where \(T_{xc} =\sum R_{i}\cos \left (\frac{2\pi i} {n} \right ),\) \(T_{xs} =\sum R_{i}\sin \left (\frac{2\pi i} {n} \right ).\) Similarly, for the Kendall measure, the angular-linear correlation is then given by
where
We may now prove a theorem giving the asymptotic distributions of γ LS and γ LK under the null hypothesis that all vectors of ranks \((R_{1},\ldots,R_{n})\) are equally likely.
Theorem 3.8.
The asymptotic null distribution of nγ LS as n →∞ is χ 2 2 .
Proof.
The joint distribution of T xc , T xs is asymptotically normal. In fact, for arbitrary constants \(a_{1},a_{2}\), consider the linear combination
This is a linear rank statistic for which the conditions in Hoeffding (1951) are satisfied. In fact, let
The variance is then equal to
and we have that
as \(n \rightarrow \infty\). The result follows. □
Theorem 3.9.
The asymptotic null distribution of \(\frac{9n} {4} \gamma _{LK}\) as n →∞ is χ 2 2 .
Proof.
For arbitrary constants a 1, a 2, consider the linear combination
where
Using a result of Daniels (1950), the asymptotic normality of K xc and K xs follows. □
Example 3.10 (Johnson and Wehrly 1977).
We consider data on wind direction and ozone concentration collected at a weather station for 19 days at 4-day intervals. The readings are given in Table 3.7.
The Spearman test statistic to be \(n\gamma _{LS} = 19\left (0.3751\right ) = 7.13\) which has a p-value equal to 0. 0283. On the other hand the Kendall statistic is given by \(\frac{9n} {4} \gamma _{LK} = \frac{9\left (19\right )} {4} \left (0.1595\right ) = 6.82\) for a p-value of 0. 033. Both statistics imply that there is a fair degree of dependence between wind direction and ozone concentration.
Bibliography
Alvo, M. (1998). On non-parametric measures of correlation for directional data. Environmetrics, 9, 645–656.
Alvo, M., & Cabilio, P. (1992). Correlation methods for incomplete rankings. (Technical Report 200) Laboratory for Research in Statistics and Probability: Carleton University and University of Ottawa.
Alvo, M., & Cabilio, P. (1993). Tables of critical values of rank tests for trend when the data is incomplete. (Technical Report 230) Laboratory for Research in Statistics and Probability: Carleton University and University of Ottawa.
Alvo, M., & Cabilio, P. (1994). Rank test of trend when data are incomplete. Environmetrics, 5, 21–27.
Alvo, M., & Cabilio, P. (1995a). Rank correlation methods for missing data. Canadian Journal of Statistics, 23, 345–358.
Alvo, M., & Park, J. (2002). Multivariate non-parametric tests of trend when the data are incomplete. Statistics and Probability Letters, 57, 281–290.
Alvo, M., & Smrz, P. (2005). An arc model for ranking data. Journal of Statistical Research, 39, 43–54.
Breckling, J. (1989). The analysis of directional time series. Lecture Notes in Statistics (Vol. 61). Berlin: Springer.
Cabilio, P., & Tilley, J. (1999). Power calculations for tests of trend with missing observations. Environmetrics, 10, 803–816.
Daniels, H. (1950). Rank correlation and population models. Journal of the Royal Statistical Society Series B, 12, 171–181.
Diaconis, P., & Graham, R. (1977). Spearman’s footrule as a measure of disarray. Journal of the Royal Statistical Society Series B, 39, 262–268.
Downs, T. (1973). Rotational angular correlations. New York: Wiley.
Hájek, J., & Sidak, Z. (1967). Theory of rank tests. New York: Academic.
Hoeffding, W. (1951). A combinatorial central limit theorem. Annals of Mathematical Statistics, 22, 558–566.
Johnson, R., & Wehrly, T. (1977). Measures and models for angular correlation and angular-linear correlation. Journal of the Royal Statistical Society Series B, 39, 222–229.
Jupp, P., & Mardia, K. (1989). A unified view of the theory of directional statistics, 1975–1988. International Statistical Review, 57, 261–294.
Kendall, M., & Gibbons, J. (1990). Rank correlation methods. London: Edward Arnold.
Kidwell, P., Lebanon, G., & Cleveland, W. S. (2008). Visualizing incomplete and partially ranked data. IEEE Transactions on Visualization and Computer Graphics, 14(6), 1356–1363.
Lehmann, E. (1975). Nonparametrics: Statistical methods based on ranks. New York: McGraw-Hill.
Mardia, K. (1975). Statistics of directional data. Journal of the Royal Statistical Society Series B, 37, 349–393.
Mardia, K. (1976). Linear-circular correlation coefficients and rhythnometry. Biometrika, 63, 403–405.
Quade, D. (1972). Average internal rank correlation. (Technical report) Amsterdam: Mathematical Centre.
Yu, P. L. H., Lam, K. F., & Alvo, M. (2002). Nonparametric rank test for independence in opinion surveys. Austrian Journal of Statistics, 31, 279–290.
Author information
Authors and Affiliations
Chapter Notes
Chapter Notes
In this chapter, the traditional rank correlation has been extended to include incomplete rankings. This was made possible using the notion of compatibility which was developed by Alvo and Cabilio in a series of papers. Cabilio and Tilley (1999) report the results of a simulation study where they considered linear, quadratic, and square root trends. They observed that when there were no missing observations, the Spearman statistic was more powerful than Kendall’s. In the incomplete case, however, the new Kendall statistic has superior power for more patterns.
The calculation of the exact variance of \(\mathcal{A}_{K}^{{\ast}}\) under H 2, in Theorem 3.4, is more involved, and the reader is referred to Alvo and Cabilio (1992), where it is shown that
An important application of the results presented above which do not discard missing data is in tests of trend where k 2 = t and k 1 < t. It is seen that in this context, the superiority of the extended Spearman statistic is established through the calculation of its asymptotic relative efficiency relative to the “naive” statistic. (Alvo and Cabilio 1994) applied these methods to test for trend in precipitation data for St John and Fredericton (NB) and showed that the extended statistic based on Spearman distance is more sensitive in detecting trends than the statistic which ignores the missing observations. Tables of selected critical values of \(\mathcal{A}_{S}^{{\ast}}\) and \(\mathcal{A}_{K}^{{\ast}}\) for the trend case when k ≥ t∕2 have been developed for both hypotheses (Alvo and Cabilio 1993). The results of this section have been extended to the case of ties (Yu et al. 2002) and applied to deal with tests of independence in opinion surveys. A further extension to assess trend in proportions appears in Chap. 7.
Alvo and Smrz (2005) proposed an arc model which serves as a good approximation to Kendall distance.
Although not considered in this book, Alvo and Park (2002) were concerned with multivariate tests of trend when the data are partially incomplete. Such is the case in environmental studies when pH data for one or more lakes are often recorded over regular time intervals and examined for monotone increasing or decreasing trends in order to test for trend in acidification. In monitoring recovering patients, one looks for trends in their vital signs which are often multivariate data in nature. There may be as many as 20–30 blood constituents measured weekly over a period of several months or years. In those case, the use of separate tests on each constituent is inefficient.
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media New York
About this chapter
Cite this chapter
Alvo, M., Yu, P.L.H. (2014). Correlation Analysis of Paired Ranking Data. In: Statistical Methods for Ranking Data. Frontiers in Probability and the Statistical Sciences. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1471-5_3
Download citation
DOI: https://doi.org/10.1007/978-1-4939-1471-5_3
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-1470-8
Online ISBN: 978-1-4939-1471-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)