Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

7.1 Introduction

In situations where an accepted standard diagnostic procedure exists, it is possible to plan a clinical trial to confirm that a new diagnostic procedure is superior to the standard diagnostic procedure. However, if it will be expected that the efficacy of the new diagnostic procedure is not lower than that of the standard diagnostic procedure and the new diagnostic procedure is less or non-invasive, less or non-toxic, inexpensive or easy to operate in comparison with the standard procedure, we can plan a non-inferiority study. A non-inferiority study of two diagnostic procedures is designed to indicate that the sensitivity or specificity of the new diagnostic procedure is no more than 100Δ percent inferior compared with the sensitivity or specificity of the standard procedure, respectively, where Δ(0 < Δ ≤ 1) is a pre-specified acceptable difference between the two proportions. In general, sensitivity is defined as the probability that a result of a diagnostic procedure is positive when the subject has the disease, and specificity is defined as the probability that a result of a diagnostic procedure is negative when the subject does not have the disease. These two measures are very important to evaluate the performance of the diagnostic procedure. However, these measures are calculated on the basis of different populations of subjects. Therefore, we consider the statistical inference for the difference in sensitivities in this chapter. However, the same methods can be applied to examine the difference in the specificities using a different study population.

If two diagnostic procedures are performed on each subject, the difference in proportions for matched-pair data has a correlation between the two diagnostic procedures. Nam [10] and Tango [17] derived the same non-inferiority test for the difference in proportions for matched-pair categorical data based on the efficient score in which the pairs were independent. Tango [17] also derived the confidence interval based on the efficient score. However, these methods are only applicable to the case where the results of the two diagnostic procedures are evaluated by a single rater. Multiple independent raters often evaluate the diagnoses obtained from these diagnostic procedures (see, e.g., [6]). If multiple raters are involved in the evaluation, the differences in proportions for matched-pair data also have correlations between different raters. Although we can apply the aforementioned methods by considering consensus evaluations or majority votes to handle multiple results from the multiple raters as if there were a single rater, these methods are not recommended for the primary evaluation [1, 2, 12]. The consensus evaluations may produce a bias caused by non-independent evaluations. For example, senior or persuasive raters may affect the evaluations of junior or passive raters. Moreover, the majority votes cannot take into account the variability in results of the multiple raters. Therefore, all results from the multiple independent raters should be used in the analysis.

In this chapter, we introduce a non-inferiority test, confidence interval and sample size formula proposed by Saeki and Tango [14], for inference of the difference in correlated proportions between two diagnostic procedures on the basis of the results from the multiple independent raters where the matched pairs are independent. Furthermore, we consider a possible procedure based on majority votes and we conduct Monte Carlo simulation studies to examine the validity of the proposed methods in comparison with the procedure based on majority votes. Finally, we illustrate the methods with data from studies of diagnostic procedures for the diagnosis of oesophageal carcinoma infiltrating the tracheobronchial tree [13] and for the diagnosis of aneurysm in patients with acute subarachnoid hemorrhage [4].

7.2 Design

7.2.1 Data Structure and Model

Consider a clinical experimental design where a new diagnostic procedure (or treatment) and a standard diagnostic procedure (or treatment) that are independently performed on the same subject (or matched pairs of subjects) and independently evaluated by K raters are compared. Each rater’s judgment is assumed to take on one of two values: 1 represents that the subject is diagnosed as ‘positive’, and 0 indicates that the subject is diagnosed as ‘negative’. Suppose we have n subjects. If we consider only subjects with a pre-specified disease, we use a positive probability as a measure, that is, sensitivity. On the other hand, if we consider subjects without the disease, we use a negative probability as a measure, that is, specificity. In the following, we consider a situation on the basis of sensitivity.

For ease of explanation, let us consider the case of K = 2 first. The resulting types of matched observations and probabilities are naturally classified as a 4 × 4 contingency table shown in Table 7.1, where + (1) or − (0) denotes a positive or negative judgment on a procedure, respectively. For example, y 1101 denotes the observed number of matched type {+ on the new procedure by rater 1, + on the new procedure by rater 2, − on the standard procedure by rater 1, + on the standard procedure by rater 2} and r 1101 indicates its probability.

Table 7.1 A 4 × 4 contingency table for matched-pair categorical data in the case of two raters

Let π (k) N (π (k) S ) denote the probability that rater k judges as positive on the new (standard) diagnostic procedure of a randomly selected subject. Then, it will be naturally calculated as

$$\displaystyle{ \pi _{N}^{(1)} = r_{ 11\cdot \cdot } + r_{10\cdot \cdot }\,,\quad \pi _{N}^{(2)} = r_{ 11\cdot \cdot } + r_{01\cdot \cdot } }$$
(7.1)

and π (1) S and π (2) S are defined in a similar manner. Let π N and π S denote the probability of a positive judgment on the new and standard diagnostic procedures, respectively. Then, these probabilities can, in general, be defined as follows:

$$\displaystyle{ \pi _{N} =\omega ^{(1)}\pi _{ N}^{(1)} +\omega ^{(2)}\pi _{ N}^{(2)}\;, }$$
(7.2)
$$\displaystyle{ \pi _{S} =\omega ^{(1)}\pi _{ S}^{(1)} +\omega ^{(2)}\pi _{ S}^{(2)}\;, }$$
(7.3)

where ω (k) (\(\omega ^{(1)} +\omega ^{(2)} = 1\)) denotes the weight for rater k, showing the difference in the raters’ evaluation skill. However, raters are usually selected among the raters with at least equivalent skill, and it is assumed in this paper that

$$\displaystyle{ \omega ^{(k)} = 1/K\quad (k = 1,\ldots,K)\;. }$$
(7.4)

Therefore, these probabilities can be defined as follows:

$$\displaystyle{ \pi _{N} = \frac{\pi _{N}^{(1)} +\pi _{ N}^{(2)}} {2} = r_{11\cdot \cdot } + \frac{r_{10\cdot \cdot } + r_{01\cdot \cdot }} {2} \;, }$$
(7.5)
$$\displaystyle{ \pi _{S} = \frac{\pi _{S}^{(1)} +\pi _{ S}^{(2)}} {2} = r_{\cdot \cdot 11} + \frac{r_{\cdot \cdot 10} + r_{\cdot \cdot 01}} {2} \;. }$$
(7.6)

On the basis of the form of the expressions of (7.5) and (7.6), the 4 × 4 contingency table is found to be reduced to the 3 × 3 contingency table shown in Table 7.2, where p ℓ m (x ℓ m ) denotes the probability (observed number of observations) that raters judge as positive on the new procedure and m raters judge as positive on the standard procedure. Then, we have

$$ \displaystyle\begin{array}{rcl} \pi _{N}& =& p_{2\cdot } + \frac{1} {2}p_{1\cdot } \\ & =& p_{20} + (p_{21} + \frac{1} {2}p_{10}) + (p_{22} + \frac{1} {2}p_{11}) + \frac{1} {2}p_{12}\;,{}\end{array}$$
(7.7)
$$\displaystyle\begin{array}{rcl} \pi _{S}& =& p_{\cdot 2} + \frac{1} {2}p_{\cdot 1} \\ & =& p_{02} + (p_{12} + \frac{1} {2}p_{01}) + (p_{22} + \frac{1} {2}p_{11}) + \frac{1} {2}p_{21}\;.{}\end{array}$$
(7.8)

Let λ denote the difference in positive probabilities; that is,

$$\displaystyle\begin{array}{rcl} \lambda & =& \pi _{N} -\pi _{S} \\ & =& p_{20} + \frac{1} {2}(p_{21} + p_{10}) - p_{02} -\frac{1} {2}(p_{12} + p_{01})\;,{}\end{array}$$
(7.9)

and its sample estimate will be

$$\displaystyle{ \tilde{\lambda }= \frac{1} {n}\left \{x_{20} + \frac{1} {2}(x_{21} + x_{10}) - x_{02} -\frac{1} {2}(x_{12} + x_{01})\right \}\;, }$$
(7.10)

which clearly shows that the inference on λ can be made by the observed vector \(\boldsymbol{x} = (x_{20}\), x 21 + x 10, x 02, x 12 + x 01, \(x_{22} + x_{11} + x_{00})\) following a multinomial distribution with parameters n and \(\boldsymbol{p} = (p_{20}\), p 21 + p 10, p 02, p 12 + p 01, \(p_{22} + p_{11} + p_{00})\).

Table 7.2 A 3 × 3 contingency table for matched-pair categorical data in the case of two raters

It should be noted that x 20 is the frequency such that the number of raters judging as positive on the new procedure is larger than the number of raters judging as positive on the standard procedure by 2 and that (x 21 + x 10) is the frequency such that the number of raters judging as positive on the new procedure is larger than the number of raters judging as positive on the standard procedure by 1. Similarly, x 02 is the frequency such that the number of raters judging as positive on the standard procedure is larger than the number of raters judging as positive on the new procedure by 2 and (x 12 + x 01) is the frequency such that the number of raters judging as positive on the standard procedure is larger than the number of raters judging as positive on the new procedure by 1. These observations lead to a generalization to K raters. The resulting types of matched observations and probabilities are classified as a \((K + 1) \times (K + 1)\) contingency table similar to Table 7.2. However, the method is reduced to the following. Let n Nk denote the frequency such that the number of raters who judge as positive on the new procedure is larger than the number of raters who judge as positive on the standard procedure by k and let q Nk indicate such probability. Namely, we have

$$\displaystyle\begin{array}{rcl} & n_{\mathit{Nk}} =\sum _{\ell-m=k}x_{\ell m}\;,& {}\\ & q_{\mathit{Nk}} =\sum _{\ell-m=k}p_{\ell m}\;,& {}\\ \end{array}$$

where is the number of raters who judge as positive on the new procedure, and m is the number of raters who judge as positive on the standard procedure. Similarly, let n Sk denote the frequency such that the number of raters who judge as positive on the standard procedure is larger than the number of raters who judge as positive on the new procedure by k and let q Sk indicate such probability. Then, we have

$$\displaystyle\begin{array}{rcl} & n_{\mathit{Sk}} =\sum _{\ell-m=-k}x_{\ell m}\;,& {}\\ & q_{\mathit{Sk}} =\sum _{\ell-m=-k}p_{\ell m}\;,& {}\\ \end{array}$$

and q N0 = q S0 and n N0 = n S0. Namely, for K raters, the inference on λ can be made by the vector of random variables \(\boldsymbol{n} = (n_{N0},\,n_{N1},\ldots,\,n_{\mathit{NK}},\,n_{S1},\ldots,\,n_{\mathit{SK}})\) following a multinomial distribution with parameters n and \(\boldsymbol{q}\,=\,(q_{N0},q_{N1},\ldots,\,q_{\mathit{NK}}\), q S1, , q SK ). Then, we have

$$\displaystyle\begin{array}{rcl} \pi _{N}& =& \sum _{k=1}^{K}\omega ^{(k)}\pi _{ N}^{(k)} = \frac{1} {K}\sum _{k=1}^{K}k\sum _{ m=0}^{K}p_{\mathit{ km}} = \frac{1} {K}\sum _{k=1}^{K}kp_{ k\cdot } \\ & =& \frac{1} {K}\sum _{k=1}^{K}kq_{\mathit{ Nk}} + \frac{1} {K}\sum _{k=1}^{K}kp_{\mathit{ kk}} + \frac{1} {K}\sum _{{ \ell,m\in K \atop \ell<m} }\ell p_{\ell m} + \frac{1} {K}\sum _{{ \ell,m\in K \atop m<\ell} }mp_{\ell m}\;, \\ \pi _{S}& =& \sum _{k=1}^{K}\omega ^{(k)}\pi _{ S}^{(k)} = \frac{1} {K}\sum _{k=1}^{K}k\sum _{\ell =0}^{K}p_{\ell k} = \frac{1} {K}\sum _{k=1}^{K}kp_{ \cdot k} \\ & =& \frac{1} {K}\sum _{k=1}^{K}kq_{\mathit{ Sk}} + \frac{1} {K}\sum _{k=1}^{K}kp_{\mathit{ kk}} + \frac{1} {K}\sum _{{ \ell,m\in K \atop \ell<m} }\ell p_{\ell m} + \frac{1} {K}\sum _{{ \ell,m\in K \atop m<\ell} }mp_{\ell m}\;.{}\end{array}$$
(7.11)

Therefore, the difference in positive probabilities (7.9) is generalized to

$$\displaystyle\begin{array}{rcl} \lambda =\pi _{N} -\pi _{S}& =& \Big( \frac{1} {K}\sum _{k=1}^{K}kp_{ k\cdot }\Big) -\Big ( \frac{1} {K}\sum _{k=1}^{K}kp_{ \cdot k}\Big) \\ & =& \frac{1} {K}\sum _{k=1}^{K}k(q_{\mathit{ Nk}} - q_{\mathit{Sk}})\;. {}\end{array}$$
(7.12)

Then, the estimate \(\tilde{\lambda }\) given in (7.10) is generalized to

$$\displaystyle{ \tilde{\lambda }= \frac{1} {nK}\sum _{k=1}^{K}k(n_{\mathit{ Nk}} - n_{\mathit{Sk}})\;. }$$
(7.13)

7.2.2 Problems in Consensus Evaluations or Majority Votes

Although we can handle multiple results from the multiple raters as if there were a single rater by considering consensus evaluations or majority votes, these handlings are not recommended for the primary evaluation [1, 2, 12]. The consensus evaluations may produce a bias caused by non-independent evaluation, even if the consensus evaluations are performed after individual evaluations by the multiple raters are completed. For example, senior or persuasive raters may affect the evaluations of junior or passive raters. Moreover, the majority votes cannot take into account the variability in results of the multiple raters. For ease of explanation, let us consider the case of K = 3. The resulting types of matched observations are classified as a 4 × 4 contingency table in Table 7.3. In this case, \(\tilde{\lambda }_{K=3}\) can be addressed from (7.13) as

$$\displaystyle{\tilde{\lambda }_{K=3} = \frac{1} {n}\left \{(n_{N3} - n_{S3}) + \frac{2} {3}(n_{N2} - n_{S2}) + \frac{1} {3}(n_{N1} - n_{S1})\right \}\;,}$$

where \((n_{N3} - n_{S3}) = (x_{30} - x_{03})\), \((n_{N2} - n_{S2}) = \left \{(x_{31} + x_{20}) - (x_{13} + x_{02})\right \}\) and \((n_{N1} - n_{S1}) = \left \{(x_{32} + x_{21} + x_{10}) - (x_{23} + x_{12} + x_{01})\right \}\). If we adopt the majority votes, the 4 × 4 contingency table shown in Table 7.3 is transformed to the 2 × 2 contingency table shown in Table 7.4, and the estimate of the difference between π N and π S on the basis of the results from the majority votes will be

$$\displaystyle{\tilde{\lambda }_{\mathit{MV}} = \frac{(b - c)} {n} = \frac{1} {n}\left \{(n_{N3} - n_{S3}) + (n_{N2} - n_{S2}) + (x_{21} - x_{12})\right \}\;.}$$

We should focus on two problems in \(\tilde{\lambda }_{\mathit{MV}}\).

  1. 1.

    \(\tilde{\lambda }_{\mathit{MV}}\) involves (n N2n S2) and (x 21x 12) without the weights of the contribution for π N and π S from π (1) N , π (2) N , π (3) N and π (1) S , π (2) S , π (3) S .

  2. 2.

    x 32, x 10 and x 23, x 01 do not take part in \(\tilde{\lambda }_{\mathit{MV}}\), because these values are involved in the cells ‘a’ and ‘d’ in Table 7.4.

Table 7.3 A 4×4 contingency table for matched-pair categorical data in the case of three raters
Table 7.4 A 2×2 contingency table transformed from Table 7.3 by majority votes

Therefore, it is important that all results from the multiple independent raters are used in the analysis appropriately.

7.3 Methods for Statistical Inference

In this section, we shall introduce methods for statistical inference of the difference λ, that is, a non-inferiority test, confidence interval and formula for determination of sample size.

7.3.1 Non-inferiority Test

The non-inferiority hypothesis will be formulated as

$$\displaystyle{H_{0}:\pi _{N} =\pi _{S}-\varDelta,\ H_{1}:\pi _{N} >\pi _{S}-\varDelta \;,}$$

where Δ (0 < Δ ≤ 1) is a pre-specified acceptable difference in two probabilities. Let

$$\displaystyle{ \delta =\lambda +\varDelta =\pi _{N} - (\pi _{S}-\varDelta ) = \frac{1} {K}\sum _{k=1}^{K}kq_{\mathit{ Nk}} -\Big ( \frac{1} {K}\sum _{k=1}^{K}kq_{\mathit{ Sk}}-\varDelta \Big)\;. }$$
(7.14)

Then, under the null hypothesis, the log-likelihood function without constant terms is expressed as

$$\displaystyle\begin{array}{rcl} L = L(\boldsymbol{\theta })& =& n_{N0}\log (q_{N0}) + n_{\mathit{NK}}\log (q_{\mathit{NK}}) +\sum _{ k=1}^{K-1}n_{\mathit{ Nk}}\log (q_{\mathit{Nk}}) +\sum _{ k=1}^{K}n_{\mathit{ Sk}}\log (q_{\mathit{Sk}}) \\ & =& n_{N0}\log (1 -\delta +\varDelta - A - B - C) + n_{\mathit{NK}}\log (\delta -\varDelta + A) \\ & & \ \ +\sum _{k=1}^{K-1}n_{\mathit{ Nk}}\log (q_{\mathit{Nk}}) +\sum _{ k=1}^{K}n_{\mathit{ Sk}}\log (q_{\mathit{Sk}})\;, {}\end{array}$$
(7.15)

where \(\boldsymbol{\theta }= (\delta,\,q_{N1},\ldots,\,q_{N(K-1)},\,q_{S1},\ldots,\,q_{\mathit{SK}})^{T}\) is the parameter vector of dimension 2K and

$$\displaystyle{A = \frac{1} {K}\Big(\sum _{k=1}^{K}kq_{\mathit{ Sk}} -\sum _{k=1}^{K-1}kq_{\mathit{ Nk}}\Big),\quad B =\sum _{ k=1}^{K-1}q_{\mathit{ Nk}}\;,\quad C =\sum _{ k=1}^{K}q_{\mathit{ Sk}}\;.}$$

Then, the score test for testing the null hypothesis H 0: δ = 0 against H 1: δ > 0 is expressed as

$$\displaystyle{ Z_{S} = \left [\frac{\partial L} {\partial \delta } \Big\vert _{\delta =0,\,q_{\mathit{Nk}}=\hat{q}_{\mathit{Nk}},\,q_{\mathit{Sk}}=\hat{q}_{\mathit{Sk}}}\right ]\sqrt{\left (\hat{I}^{-1 } \right ) _{11 } \big\vert _{\delta =0,\,q_{\mathit{Nk } } =\hat{q}_{\mathit{Nk } },\,q_{\mathit{Sk } } =\hat{q}_{\mathit{Sk } }}} \sim _{H_{0}}N(0,1)\;, }$$
(7.16)

where \((\hat{q}_{N1},\ldots,\,\hat{q}_{N(K-1)},\,\hat{q}_{S1},\ldots,\,\hat{q}_{\mathit{SK}})\) is the vector of the maximum likelihood estimators under the null hypothesis, which is the unique solution for the following equations:

$$\displaystyle{ \frac{\partial L} {\partial q_{\mathit{Nk}}}\bigg\vert _{\delta =0} = 0,\quad (k = 1,\ldots,K - 1)\;, }$$
(7.17)
$$\displaystyle{ \frac{\partial L} {\partial q_{\mathit{Sk}}}\bigg\vert _{\delta =0} = 0,\quad (k = 1,\ldots,K)\;. }$$
(7.18)

These equations can be obtained iteratively using the quasi-Newton method with constraints. The R function ‘constrOptim’ is useful for the quasi-Newton method with constraints. Further, \((\hat{I}^{-1})_{11}\) indicates the (1, 1)th element of the (2K × 2K) inverse Fisher information matrix evaluated at the maximum likelihood estimators. On the other hand, we can consider a test based on the sample estimate T for the difference δ

$$\displaystyle{ T =\tilde{\lambda } +\varDelta = \frac{1} {nK}\sum _{k=1}^{K}k(n_{\mathit{ Nk}} - n_{\mathit{Sk}}) +\varDelta \;. }$$
(7.19)

The variance of T evaluated at the null hypothesis δ = 0 is

$$\displaystyle{\text{Var}_{H_{0}}(T) = \frac{1} {n}\left [ \frac{1} {K^{2}}\sum _{k=1}^{K}k^{2}(q_{\mathit{ Nk}} + q_{\mathit{Sk}}) -\varDelta ^{2}\right ]\;.}$$

Therefore, the normal deviate for testing H 0: δ = 0 against H 1: δ > 0 is expressed as

$$\displaystyle{ Z_{\mathit{ND}} = \frac{ \frac{1} {nK}\sum _{k=1}^{K}k(n_{\mathit{ Nk}} - n_{\mathit{Sk}})+\varDelta } {\sqrt{ \frac{1} {n}\left [ \frac{1} {K^{2}} \sum _{k=1}^{K}k^{2}(\hat{q}_{\mathit{Nk}} +\hat{ q}_{\mathit{Sk}}) -\varDelta ^{2}\right ]}} \sim _{H_{0}}N(0,1)\;. }$$
(7.20)

It can be shown that when K = 1, the normal deviate test statistic, Z ND , is equivalent to the score test statistic Z S  [10, 17]. When K = 2 or 3, we confirmed that Z S and Z ND were approximately equal using the example data (see Sect. 7.5). However, we have not been able to show the equivalence between Z S and Z ND analytically. On the other hand, by using the observed proportions \(\tilde{q}_{\mathit{Nk}} = n_{\mathit{Nk}}/n\), \(\tilde{q}_{\mathit{Sk}} = n_{\mathit{Sk}}/n\) instead of the maximum likelihood estimators, we can construct a Wald-type test statistic for testing H 0: δ = 0 against H 1: δ > 0:

$$\displaystyle{ Z_{W} = \frac{ \frac{1} {nK}\sum _{k=1}^{K}k(n_{\mathit{ Nk}} - n_{\mathit{Sk}})+\varDelta } {\sqrt{ \frac{1} {n}\left [ \frac{1} {nK^{2}} \sum _{k=1}^{K}k^{2}(n_{\mathit{Nk}} + n_{\mathit{Sk}}) -\varDelta ^{2}\right ]}} \sim _{H_{0}}N(0,1)\;. }$$
(7.21)

When Δ = 0, the Wald-type test Z W is identical to Schouten’s [15] generalized McNemar test although Schouten’s test statistic is presented in a different form. When K = 1, the Wald-type test Z W is identical to the unconditional test for non-inferiority of Lu and Bean [7]. When Δ = 0 and K = 1, both the normal deviate test Z ND and the Wald-type test Z W are identical to the McNemar test [9].

7.3.2 Confidence Interval

Testing non-inferiority with an acceptable difference Δ at a one-sided significance level α∕2 is equivalent to judging whether the lower limit of the 1 −α level confidence interval is greater than −Δ. The score-type approximate confidence limits for the difference in two proportions, λ, are the two solutions to the equation

$$\displaystyle{ \frac{ \frac{1} {nK}\sum _{k=1}^{K}k(n_{\mathit{ Nk}} - n_{\mathit{Sk}})-\lambda } {\sqrt{ \frac{1} {n}\left [ \frac{1} {K^{2}} \sum _{k=1}^{K}k^{2}(\hat{q}_{\mathit{Nk}} +\hat{ q}_{\mathit{Sk}}) -\lambda ^{2}\right ]}} = \pm Z_{\alpha /2}\;, }$$
(7.22)

where the plus and minus signs indicate the lower limit λ low and the upper limit λ up, respectively, and Z α∕2 is the upper α∕2 percentile of the standard normal distribution. These two limits can be found using an iterative numerical method such as the secant method (see, e.g., [17]). On the other hand, we can easily derive the Wald-type confidence interval:

$$\displaystyle{ \text{CI}_{W}: \frac{1} {nK}\left (\sum _{k=1}^{K}k(n_{\mathit{ Nk}} - n_{\mathit{Sk}}) \pm Z_{\alpha /2}\sqrt{\sum _{k=1 }^{K }k^{2 } (n_{\mathit{Nk } } + n_{\mathit{Sk } } )}\right )\;. }$$
(7.23)

Equation (7.23) utilizes the variance evaluated under the null hypothesis and is identical to Schouten’s [15] Wald-type confidence interval.

7.3.3 Sample Size

To calculate the sample size required for testing the null hypothesis H 0: δ = 0 against the alternative hypothesis H 1: δ > 0, we only have to consider the following properties of the statistic T:

$$\displaystyle\begin{array}{rcl} E_{H_{0}}(T)& =& 0\;, {}\\ E_{H_{1}}(T)& =& \lambda +\varDelta \;, {}\\ S =\lim _{n\rightarrow \infty }n\text{Var}_{H_{1}}(T)& =& \left [ \frac{1} {K^{2}}\sum _{k=1}^{K}k^{2}(q_{\mathit{ Nk}} + q_{\mathit{Sk}}) -\lambda ^{2}\right ]\;. {}\\ \end{array}$$

On the other hand, we have

$$\displaystyle{R =\lim _{n\rightarrow \infty }n\text{Var}_{H_{0}}(T) = \left [ \frac{1} {K^{2}}\sum _{k=1}^{K}k^{2}(\bar{q}_{\mathit{ Nk}} +\bar{ q}_{\mathit{Sk}}) -\varDelta ^{2}\right ]\;,}$$

where \((\bar{q}_{\mathit{Nk}},\,\bar{q}_{\mathit{Sk}})\), k = 0, , K, are the asymptotic values of the maximum likelihood estimators \((\hat{q}_{\mathit{Nk}},\,\hat{q}_{\mathit{Sk}})\), k = 0, , K. These asymptotic values are solutions to (7.17) and (7.18). From the aforementioned equations, the approximate sample size n required for 100(1 −β) power of a one-sided normal deviate test at α∕2 level is given by

$$\displaystyle{ n = \left (\frac{Z_{\alpha /2}\sqrt{R} + Z_{\beta }\sqrt{S}} {\lambda +\varDelta } \right )^{2}\;. }$$
(7.24)

When K = 1, the derived formula for determining the sample size agrees with that proposed by Nam [10]. The sample sizes required for 80 % power of a one-sided non-inferiority test at \(\alpha /2 = 2.5\,\%\) for K = 2, 3, Δ = 0. 1, 0. 05, and various values of (q N3, q N2, q N1, q S3, q S2, q S1) with \(\pi _{N} -\pi _{S} =\lambda = 0\) are shown in Table 7.5.

Table 7.5 Sample sizes calculated by formula (7.24) for nominal power = 80 % of a non-inferiority test at \(\alpha /2 = 2.5\,\%\) for K = 2, 3, Δ = 0. 1, 0. 05, \(\pi _{N} -\pi _{S} =\lambda = 0\), q N3 = q S3, q N2 = q S2, q N1 = q S1

7.4 Simulation

We have indicated here the results of simulation studies for the methods at a one-sided 2. 5 % level for the case of K = 3 and sample size n = 25, 50 or 100 with 10, 000 replicates. Simulation data were generated on the basis of a multinomial distribution by considering typical situations for parameter values (q N3, q N2, q N1, q S3, q S2, q S1) and non-inferiority margin Δ = 0. 1. In assessing the performance of the methods based on the majority votes, we transformed the simulation data based on the following definitions: \(q_{N} = (q_{N3} + q_{N2} + \frac{1} {3} \times q_{N1})\), \(q_{S} = (q_{S3} + q_{S2} + \frac{1} {3} \times q_{S1})\).

7.4.1 Non-inferiority Test

We performed Monte Carlo simulation studies to assess the empirical size and power of the normal deviate test statistic Z ND , the Wald-type test statistic Z W and the test statistic based on the majority votes Z MV . Z MV was calculated using the method of Nam [10] and Tango [17]. Table 7.6 presents the empirical sizes. For the set of parameter values (q N3, q N2, q N1, q S3, q S2, q S1) considered here, the empirical sizes for the normal deviate test Z ND are generally closer to the nominal α∕2-level of 2. 5 % than those for the Wald-type test Z W or the test based on the majority votes Z MV . The empirical sizes of Z W tend to be quite inflated. The empirical sizes of Z MV , on the other hand, tend to be quite reduced. Table 7.7 presents the empirical powers for the alternative hypothesis H 1: π N  = π S for the case of Δ = 0. 1. The differences in powers between Z ND and Z W are generally small. When the sample size is small, however, the empirical powers of Z W are far greater than those of Z ND . On the other hand, the empirical powers of Z MV are far smaller than those of Z ND under all situations.

Table 7.6 Empirical sizes of the normal deviate test Z ND , the Wald-type test Z W and the test based on majority votes Z MV at \(\alpha /2 = 2.5\,\%\) for K = 3, \(\pi _{N} -\pi _{S} =\lambda = -0.1\), Δ = 0. 1 based on 10,000 replicates
Table 7.7 Empirical powers of the normal deviate test Z ND , the Wald-type test Z W and the test based on majority votes Z MV at \(\alpha /2 = 2.5\,\%\) for K = 3, \(\pi _{N} -\pi _{S} =\lambda = 0\), Δ = 0. 1 based on 10,000 replicates

7.4.2 Confidence Interval

We performed Monte Carlo simulation studies to evaluate the coverage probability of the score-type confidence interval, the Wald-type confidence interval CI W and the confidence interval based on the majority votes CI MV . CI MV was calculated using the method of Tango [17]. Table 7.8 shows the empirical coverage probabilities of the score-type 95 % confidence interval, the Wald-type 95 % confidence interval and the 95 % confidence interval based on the majority votes under the hypothesis \(\pi _{N} -\pi _{S} =\lambda = -0.1\). It shows that the score-type confidence interval and the Wald-type confidence interval both generally perform very well. However, when n = 25, the score-type confidence interval outperforms the Wald-type confidence interval. On the other hand, the confidence interval based on the majority votes shows a conservative property.

Table 7.8 Coverage probabilities of the score-type 95 % confidence interval, the Wald-type 95 % confidence interval and the 95 % confidence interval based on the majority votes for K = 3 based on 10,000 replicates generated under the null hypothesis \(\pi _{N} -\pi _{S} =\lambda = -0.1\)

7.5 Example

7.5.1 Study of Diagnostic Procedures for the Diagnosis of Oesophageal Carcinoma Infiltrating the Tracheobronchial Tree

Here, we shall consider the data presented by Rapp-Bernhardt et al. [13]. They compared the sensitivities between axial computed tomography (CT) slices and minimal intensity projection (MIP) in 21 patients with oesophageal carcinoma infiltrating the tracheobronchial tree. The bronchoscopic findings were determined as the gold standard. Three radiologists, working independently of each other and without knowledge of the findings on the gold standard, assessed separately the axial CT slices and MIP. In these diagnostic procedures, stenoses were localized, and the degree of stenosis was assessed as in real bronchoscopy. The resulting type of matched observations was classified as a 4 × 4 contingency table for MIP versus axial CT slices and is shown in Table 7.9 (similar to Table 7.3), where ‘+’ indicates a true positive and ‘−’ indicates a false negative based on binary assessment where 0–50 % of total occlusion was considered as negative and 50–100 % of total occlusion was considered as positive. MIP is one of the reconstruction techniques of making three-dimensional images. MIP images make it easier to appreciate the condition of the whole tracheobronchial tree than axial CT slices. Therefore, we are interested in the non-inferiority of MIP to axial CT slices where the non-inferiority margin is set as Δ = 0. 1. From Table 7.9, we have \(\tilde{p}_{3.} = 17/21\), \(\tilde{p}_{2.} = 0/21\), \(\tilde{p}_{1.} = 2/21\), \(\tilde{p}_{.3} = 14/21\), \(\tilde{p}_{.2} = 2/21\) and \(\tilde{p}_{.1} = 5/21\). Then, the sensitivities of MIP and axial CT slices are estimated as \(\tilde{\pi }_{MIP} = \left (17 + 2/3 \times 0 + 1/3 \times 2\right )/21 = 0.841\) and \(\tilde{\pi }_{CT} = \left (14 + 2/3 \times 2 + 1/3 \times 5\right )/21 = 0.810\), respectively. Moreover, we have \(\tilde{q}_{N3} = 0/21\), \(\tilde{q}_{N2} = (1 + 0)/21\), \(\tilde{q}_{N1} = (2 + 0 + 0)/21\), \(\tilde{q}_{S3} = 0/21\), \(\tilde{q}_{S2} = (0 + 0)/21\) and \(\tilde{q}_{S1} = (0 + 0 + 2)/21\). Then, the difference in the sensitivities between MIP and axial CT slices based on the three raters is \(\tilde{\lambda }_{K=3} = 0.032\), and the normal deviate test has Z ND  = 1. 753 ≈ Z S (one-sided p-value = 0. 040). The score-type 95 % confidence interval is − 0. 141 to 0. 181 where the lower limit is not greater than \(-\varDelta = -0.1\). These results suggest that the non-inferiority of MIP to axial CT slices cannot be claimed at the one-sided 2. 5 % significance level. The Wald-type test statistic, on the other hand, suggests non-inferiority because Z W  = 3. 358 with one-sided p-value < 0.001 and because the Wald-type 95 % confidence interval under the null hypothesis is − 0. 056 to 0. 120. However, the simulation study suggests that the Wald-type test result here is not reliable because of its inflated empirical sizes for a quite small sample size such as n = 21. The result of the normal-deviate test, on the other hand, may or may not be reliable because its empirical sizes for Δ = 0. 1 and n = 25 are shown to be around 1. 6 ∼ 2. 4.

Table 7.9 A 4 × 4 contingency table (K = 3) of the assessments of MIP and axial CT slices by three radiologists (True positive (TP: +) and false negative (FN: −) by three radiologists (1, 2, 3): I (+, +, +), II (+, +, − or +, −, + or −, +, +), III (+, −, − or −, +, − or −, −, +), IV (−, −, −)) (Rapp-Bernhardt et al. [13])

7.5.2 Study of Diagnostic Procedures for the Diagnosis of Aneurysm in Patients with Acute Subarachnoid Hemorrhage

Jäger et al. [4] performed a blinded multi-rater study comparing magnetic resonance angiography (MRA) and digital subtraction angiography (DSA) in 34 prospectively enrolled patients who presented with acute subarachnoid hemorrhage (SAH). Two raters independently evaluated the MRA and DSA images. The presence of an aneurysm was evaluated on a 4-point ordinal scale (1, absent; 2, probably absent; 3, probably present; 4, definitely present). Additionally, all aneurysms for which the two raters had given different evaluations on the 4-point scale were subsequently reviewed by consensus evaluations. Because the authors intended to study the inter-rater and inter-procedure agreement, neither method was a priori taken as the gold standard. However, they showed the data of evaluation of the MRA and DSA images by the two raters with details of the clinical follow-up of all patients. Therefore, we considered comparing the difference in sensitivities between MRA and DSA on the basis of the data of 27 patients with aneurysms among the patients with SAH. Data were analyzed on a patient-basis, taking into account only the aneurysm with the highest ranking on the 4-point scale in each patient. We assigned the rating of true positive (‘+’) for scores of 3 and 4 or false negative (‘−’) for scores of 1 and 2. The resulting types of matched observations based on the two independent raters and the consensus evaluations were classified as a 3 × 3 and 2 × 2 contingency tables, respectively (Tables 7.10 and 7.11). DSA is a procedure in which radiographic images of blood vessels filled with a contrast agent are digitized and then subtracted from images obtained before administration of the contrast agent. This method increases the contrast between the vessels and the background. However, as a catheter (a long, thin, flexible tube) is inserted into an artery, DSA is considered to be invasive. MRA is a procedure to image blood vessels based on MRI. Unlike DSA that involves placing a catheter into the body, MRA is considered noninvasive. Therefore, we are interested in the non-inferiority of MRA to DSA where the non-inferiority margin is set as Δ = 0. 1. From Table 7.10 based on the multiple raters, we have \(\tilde{p}_{2.} = 20/27\), \(\tilde{p}_{1.} = 5/27\), \(\tilde{p}_{.2} = 22/27\) and \(\tilde{p}_{.1} = 2/27\). Then, the sensitivities of MRA and DSA are estimated as \(\tilde{\pi }_{\mathit{MRA}} = \left (20 + 1/2 \times 5\right )/27 = 0.833\) and \(\tilde{\pi }_{\mathit{DSA}} = \left (22 + 1/2 \times 2\right )/27 = 0.852\), respectively. Moreover, we have \(\tilde{q}_{N2} = 1/27\), \(\tilde{q}_{N1} = (0 + 2)/27\), \(\tilde{q}_{S2} = 0/27\) and \(\tilde{q}_{S1} = (3 + 2)/27\). Then, the difference in the sensitivities between MRA and DSA based on the two raters is \(\tilde{\lambda }_{K=2} = -0.019\), and the normal deviate test has Z ND  = 1. 393 ≈ Z S (one-sided p-value = 0. 082). The score-type 95 % confidence interval is − 0. 141 to 0. 144 where the lower limit is not greater than \(-\varDelta = -0.1\). Furthermore, the Wald-type test has Z w  = 1. 397 (one-sided p-value = 0. 081) and the Wald-type 95 % confidence interval under the null hypothesis is − 0. 139 to 0. 102. From Table 7.11 based on the consensus evaluations, on the other hand, the sensitivities of MRA and DSA are estimated as \(\tilde{\pi }_{\mathit{MRA}_{ \mathit{CE}}} = 0.926\) and \(\tilde{\pi }_{\mathit{DSA}_{ \mathit{CE}}} = 0.889\), respectively. Then, the difference in the sensitivities between MRA and DSA based on the consensus evaluations is \(\tilde{\lambda }_{\mathit{CE}} = 0.037\), and the score test derived from Nam [10] and Tango [17] has Z S  = 1. 510 (one-sided p-value = 0. 066). Moreover, the score-based 95 % confidence interval derived from Tango [17] is − 0. 150 to 0. 227. These results suggest that the non-inferiority of MRA to DSA cannot be claimed at the one-sided significance level. However, although the difference in the sensitivities based on the two raters \(\tilde{\lambda }_{K=2}\) is a negative value, the difference in the sensitivities based on the consensus evaluations \(\tilde{\lambda }_{\mathit{CE}}\) is a positive value. We consider that bias from the consensus evaluations caused this phenomenon.

Table 7.10 A 3 × 3 contingency table (K = 2) of the assessments of MRA and DSA by two neuroradiologists (True positive (TP: +) and false negative (FN: −) by two neuroradiologists (1, 2): I (+, +), II (+, − or −, +), III (−, −)) (Jäger et al. [4])
Table 7.11 A 2 × 2 contingency table of the assessments of MRA and DSA by consensus evaluations (True positive (TP: +) and false negative (FN: −)) (Jäger et al. [4])

7.6 Conclusion

A non-inferiority trial of diagnostic procedures is generally evaluated on the basis of the results from multiple independent raters who are independent of the study centers. However, consensus evaluations or majority votes to handle multiple results from the multiple raters are not recommended in terms of bias or loss of information [1, 2, 12]. Therefore, it is important that all of the results from the multiple raters are utilized appropriately in the statistical analysis. The methods addressed in this chapter are available for inference of the difference in correlated proportions between the two diagnostic procedures based on the multiple raters. In this chapter, we introduced methods on the basis of sensitivity. However, the methods can be applied to inference of the difference in specificity. Furthermore, if we need to consider the simultaneous non-inferiority of a new diagnostic procedure to the standard diagnostic procedure in sensitivity and specificity, we can extend the methods using an approach proposed by Lu et al. [8]. Lu et al. extended the score test proposed by Nam [10] and Tango [17] for a single proportion to a simultaneous test for both sensitivity and specificity based on the principle of intersection-union test.

We carried out Monte Carlo simulation studies to evaluate the performance of these methods. The normal deviate test for non-inferiority was shown to have an empirical size closer to a nominal significance level of one-sided 2. 5 % than the Wald-type test or the test based on the majority votes. Moreover, the score-type confidence interval had better performance than the Wald-type confidence interval under the null-hypothesis in terms of coverage probability, when the sample size was small. On the other hand, the confidence interval based on the majority votes shows a conservative property.

When we plan a clinical trial to compare the efficacies between two diagnostic procedures, it is very important to take into account the study design. The methods addressed in this chapter are only useful for a study design in which two diagnostic procedures are applied to each subject and all raters evaluate all subjects, that is, paired-patient, paired-rater design. Zhou et al. [18] provided information on study designs for diagnostic procedures in detail. Moreover, it is noted that these methods may not be appropriate for clustered matched-pair data. Schwenke and Busse [16] proposed a Wald-type test for clustered matched-pair data based on multiple raters. However, the test of Schwenke and Busse is a so-called test for superiority and cannot be used as a test for non-inferiority. If the results of the two diagnostic procedures are evaluated by a single rater, we can apply several non-inferiority tests for clustered matched-pair data [3, 5, 11]. Therefore, we expect that a non-inferiority test for clustered matched-pair data on the basis of the results from multiple raters will be developed. If there are missing data among the results from the multiple raters in some subject, we would have to apply some kind of imputation method, which would require future research. Furthermore, if the presence of a qualitative interaction between the two diagnostic procedures and the multiple raters is demonstrated, we would not be able to apply these methods for those data. However, this problem could probably be solved by a non-statistical study, for example, by training all of the raters on the criteria of judgment about diagnostic procedures before the start of evaluation.

7.7 Program

The R programs for the methods of this chapter can be downloaded at http://www.medstat.jp/downloadsaeki.html.