Abstract
In a clinical trial of diagnostic procedures to indicate non-inferiority, the efficacy is generally evaluated on the basis of the results from multiple raters who interpret and report their findings independently. Although we can handle the multiple results from the multiple raters as if there were a single rater by considering consensus evaluations or majority votes, this handling is not recommended for the primary evaluation. Therefore, all results from the multiple independent raters should be used in the analysis. This chapter addresses a non-inferiority test, confidence interval and sample size formula, for inference of the difference in correlated proportions between the two diagnostic procedures based on the multiple raters. Moreover, we illustrate the methods with data from studies of diagnostic procedures for the diagnosis of oesophageal carcinoma infiltrating the tracheobronchial tree and for the diagnosis of aneurysm in patients with acute subarachnoid hemorrhage.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
- Magnetic Resonance Angiography
- Digital Subtraction Angiography
- Majority Vote
- Oesophageal Carcinoma
- Consensus Evaluation
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
7.1 Introduction
In situations where an accepted standard diagnostic procedure exists, it is possible to plan a clinical trial to confirm that a new diagnostic procedure is superior to the standard diagnostic procedure. However, if it will be expected that the efficacy of the new diagnostic procedure is not lower than that of the standard diagnostic procedure and the new diagnostic procedure is less or non-invasive, less or non-toxic, inexpensive or easy to operate in comparison with the standard procedure, we can plan a non-inferiority study. A non-inferiority study of two diagnostic procedures is designed to indicate that the sensitivity or specificity of the new diagnostic procedure is no more than 100Δ percent inferior compared with the sensitivity or specificity of the standard procedure, respectively, where Δ(0 < Δ ≤ 1) is a pre-specified acceptable difference between the two proportions. In general, sensitivity is defined as the probability that a result of a diagnostic procedure is positive when the subject has the disease, and specificity is defined as the probability that a result of a diagnostic procedure is negative when the subject does not have the disease. These two measures are very important to evaluate the performance of the diagnostic procedure. However, these measures are calculated on the basis of different populations of subjects. Therefore, we consider the statistical inference for the difference in sensitivities in this chapter. However, the same methods can be applied to examine the difference in the specificities using a different study population.
If two diagnostic procedures are performed on each subject, the difference in proportions for matched-pair data has a correlation between the two diagnostic procedures. Nam [10] and Tango [17] derived the same non-inferiority test for the difference in proportions for matched-pair categorical data based on the efficient score in which the pairs were independent. Tango [17] also derived the confidence interval based on the efficient score. However, these methods are only applicable to the case where the results of the two diagnostic procedures are evaluated by a single rater. Multiple independent raters often evaluate the diagnoses obtained from these diagnostic procedures (see, e.g., [6]). If multiple raters are involved in the evaluation, the differences in proportions for matched-pair data also have correlations between different raters. Although we can apply the aforementioned methods by considering consensus evaluations or majority votes to handle multiple results from the multiple raters as if there were a single rater, these methods are not recommended for the primary evaluation [1, 2, 12]. The consensus evaluations may produce a bias caused by non-independent evaluations. For example, senior or persuasive raters may affect the evaluations of junior or passive raters. Moreover, the majority votes cannot take into account the variability in results of the multiple raters. Therefore, all results from the multiple independent raters should be used in the analysis.
In this chapter, we introduce a non-inferiority test, confidence interval and sample size formula proposed by Saeki and Tango [14], for inference of the difference in correlated proportions between two diagnostic procedures on the basis of the results from the multiple independent raters where the matched pairs are independent. Furthermore, we consider a possible procedure based on majority votes and we conduct Monte Carlo simulation studies to examine the validity of the proposed methods in comparison with the procedure based on majority votes. Finally, we illustrate the methods with data from studies of diagnostic procedures for the diagnosis of oesophageal carcinoma infiltrating the tracheobronchial tree [13] and for the diagnosis of aneurysm in patients with acute subarachnoid hemorrhage [4].
7.2 Design
7.2.1 Data Structure and Model
Consider a clinical experimental design where a new diagnostic procedure (or treatment) and a standard diagnostic procedure (or treatment) that are independently performed on the same subject (or matched pairs of subjects) and independently evaluated by K raters are compared. Each rater’s judgment is assumed to take on one of two values: 1 represents that the subject is diagnosed as ‘positive’, and 0 indicates that the subject is diagnosed as ‘negative’. Suppose we have n subjects. If we consider only subjects with a pre-specified disease, we use a positive probability as a measure, that is, sensitivity. On the other hand, if we consider subjects without the disease, we use a negative probability as a measure, that is, specificity. In the following, we consider a situation on the basis of sensitivity.
For ease of explanation, let us consider the case of K = 2 first. The resulting types of matched observations and probabilities are naturally classified as a 4 × 4 contingency table shown in Table 7.1, where + (1) or − (0) denotes a positive or negative judgment on a procedure, respectively. For example, y 1101 denotes the observed number of matched type {+ on the new procedure by rater 1, + on the new procedure by rater 2, − on the standard procedure by rater 1, + on the standard procedure by rater 2} and r 1101 indicates its probability.
Let π (k) N (π (k) S ) denote the probability that rater k judges as positive on the new (standard) diagnostic procedure of a randomly selected subject. Then, it will be naturally calculated as
and π (1) S and π (2) S are defined in a similar manner. Let π N and π S denote the probability of a positive judgment on the new and standard diagnostic procedures, respectively. Then, these probabilities can, in general, be defined as follows:
where ω (k) (\(\omega ^{(1)} +\omega ^{(2)} = 1\)) denotes the weight for rater k, showing the difference in the raters’ evaluation skill. However, raters are usually selected among the raters with at least equivalent skill, and it is assumed in this paper that
Therefore, these probabilities can be defined as follows:
On the basis of the form of the expressions of (7.5) and (7.6), the 4 × 4 contingency table is found to be reduced to the 3 × 3 contingency table shown in Table 7.2, where p ℓ m (x ℓ m ) denotes the probability (observed number of observations) that ℓ raters judge as positive on the new procedure and m raters judge as positive on the standard procedure. Then, we have
Let λ denote the difference in positive probabilities; that is,
and its sample estimate will be
which clearly shows that the inference on λ can be made by the observed vector \(\boldsymbol{x} = (x_{20}\), x 21 + x 10, x 02, x 12 + x 01, \(x_{22} + x_{11} + x_{00})\) following a multinomial distribution with parameters n and \(\boldsymbol{p} = (p_{20}\), p 21 + p 10, p 02, p 12 + p 01, \(p_{22} + p_{11} + p_{00})\).
It should be noted that x 20 is the frequency such that the number of raters judging as positive on the new procedure is larger than the number of raters judging as positive on the standard procedure by 2 and that (x 21 + x 10) is the frequency such that the number of raters judging as positive on the new procedure is larger than the number of raters judging as positive on the standard procedure by 1. Similarly, x 02 is the frequency such that the number of raters judging as positive on the standard procedure is larger than the number of raters judging as positive on the new procedure by 2 and (x 12 + x 01) is the frequency such that the number of raters judging as positive on the standard procedure is larger than the number of raters judging as positive on the new procedure by 1. These observations lead to a generalization to K raters. The resulting types of matched observations and probabilities are classified as a \((K + 1) \times (K + 1)\) contingency table similar to Table 7.2. However, the method is reduced to the following. Let n Nk denote the frequency such that the number of raters who judge as positive on the new procedure is larger than the number of raters who judge as positive on the standard procedure by k and let q Nk indicate such probability. Namely, we have
where ℓ is the number of raters who judge as positive on the new procedure, and m is the number of raters who judge as positive on the standard procedure. Similarly, let n Sk denote the frequency such that the number of raters who judge as positive on the standard procedure is larger than the number of raters who judge as positive on the new procedure by k and let q Sk indicate such probability. Then, we have
and q N0 = q S0 and n N0 = n S0. Namely, for K raters, the inference on λ can be made by the vector of random variables \(\boldsymbol{n} = (n_{N0},\,n_{N1},\ldots,\,n_{\mathit{NK}},\,n_{S1},\ldots,\,n_{\mathit{SK}})\) following a multinomial distribution with parameters n and \(\boldsymbol{q}\,=\,(q_{N0},q_{N1},\ldots,\,q_{\mathit{NK}}\), q S1, …, q SK ). Then, we have
Therefore, the difference in positive probabilities (7.9) is generalized to
Then, the estimate \(\tilde{\lambda }\) given in (7.10) is generalized to
7.2.2 Problems in Consensus Evaluations or Majority Votes
Although we can handle multiple results from the multiple raters as if there were a single rater by considering consensus evaluations or majority votes, these handlings are not recommended for the primary evaluation [1, 2, 12]. The consensus evaluations may produce a bias caused by non-independent evaluation, even if the consensus evaluations are performed after individual evaluations by the multiple raters are completed. For example, senior or persuasive raters may affect the evaluations of junior or passive raters. Moreover, the majority votes cannot take into account the variability in results of the multiple raters. For ease of explanation, let us consider the case of K = 3. The resulting types of matched observations are classified as a 4 × 4 contingency table in Table 7.3. In this case, \(\tilde{\lambda }_{K=3}\) can be addressed from (7.13) as
where \((n_{N3} - n_{S3}) = (x_{30} - x_{03})\), \((n_{N2} - n_{S2}) = \left \{(x_{31} + x_{20}) - (x_{13} + x_{02})\right \}\) and \((n_{N1} - n_{S1}) = \left \{(x_{32} + x_{21} + x_{10}) - (x_{23} + x_{12} + x_{01})\right \}\). If we adopt the majority votes, the 4 × 4 contingency table shown in Table 7.3 is transformed to the 2 × 2 contingency table shown in Table 7.4, and the estimate of the difference between π N and π S on the basis of the results from the majority votes will be
We should focus on two problems in \(\tilde{\lambda }_{\mathit{MV}}\).
-
1.
\(\tilde{\lambda }_{\mathit{MV}}\) involves (n N2 − n S2) and (x 21 − x 12) without the weights of the contribution for π N and π S from π (1) N , π (2) N , π (3) N and π (1) S , π (2) S , π (3) S .
-
2.
x 32, x 10 and x 23, x 01 do not take part in \(\tilde{\lambda }_{\mathit{MV}}\), because these values are involved in the cells ‘a’ and ‘d’ in Table 7.4.
Therefore, it is important that all results from the multiple independent raters are used in the analysis appropriately.
7.3 Methods for Statistical Inference
In this section, we shall introduce methods for statistical inference of the difference λ, that is, a non-inferiority test, confidence interval and formula for determination of sample size.
7.3.1 Non-inferiority Test
The non-inferiority hypothesis will be formulated as
where Δ (0 < Δ ≤ 1) is a pre-specified acceptable difference in two probabilities. Let
Then, under the null hypothesis, the log-likelihood function without constant terms is expressed as
where \(\boldsymbol{\theta }= (\delta,\,q_{N1},\ldots,\,q_{N(K-1)},\,q_{S1},\ldots,\,q_{\mathit{SK}})^{T}\) is the parameter vector of dimension 2K and
Then, the score test for testing the null hypothesis H 0: δ = 0 against H 1: δ > 0 is expressed as
where \((\hat{q}_{N1},\ldots,\,\hat{q}_{N(K-1)},\,\hat{q}_{S1},\ldots,\,\hat{q}_{\mathit{SK}})\) is the vector of the maximum likelihood estimators under the null hypothesis, which is the unique solution for the following equations:
These equations can be obtained iteratively using the quasi-Newton method with constraints. The R function ‘constrOptim’ is useful for the quasi-Newton method with constraints. Further, \((\hat{I}^{-1})_{11}\) indicates the (1, 1)th element of the (2K × 2K) inverse Fisher information matrix evaluated at the maximum likelihood estimators. On the other hand, we can consider a test based on the sample estimate T for the difference δ
The variance of T evaluated at the null hypothesis δ = 0 is
Therefore, the normal deviate for testing H 0: δ = 0 against H 1: δ > 0 is expressed as
It can be shown that when K = 1, the normal deviate test statistic, Z ND , is equivalent to the score test statistic Z S [10, 17]. When K = 2 or 3, we confirmed that Z S and Z ND were approximately equal using the example data (see Sect. 7.5). However, we have not been able to show the equivalence between Z S and Z ND analytically. On the other hand, by using the observed proportions \(\tilde{q}_{\mathit{Nk}} = n_{\mathit{Nk}}/n\), \(\tilde{q}_{\mathit{Sk}} = n_{\mathit{Sk}}/n\) instead of the maximum likelihood estimators, we can construct a Wald-type test statistic for testing H 0: δ = 0 against H 1: δ > 0:
When Δ = 0, the Wald-type test Z W is identical to Schouten’s [15] generalized McNemar test although Schouten’s test statistic is presented in a different form. When K = 1, the Wald-type test Z W is identical to the unconditional test for non-inferiority of Lu and Bean [7]. When Δ = 0 and K = 1, both the normal deviate test Z ND and the Wald-type test Z W are identical to the McNemar test [9].
7.3.2 Confidence Interval
Testing non-inferiority with an acceptable difference Δ at a one-sided significance level α∕2 is equivalent to judging whether the lower limit of the 1 −α level confidence interval is greater than −Δ. The score-type approximate confidence limits for the difference in two proportions, λ, are the two solutions to the equation
where the plus and minus signs indicate the lower limit λ low and the upper limit λ up, respectively, and Z α∕2 is the upper α∕2 percentile of the standard normal distribution. These two limits can be found using an iterative numerical method such as the secant method (see, e.g., [17]). On the other hand, we can easily derive the Wald-type confidence interval:
Equation (7.23) utilizes the variance evaluated under the null hypothesis and is identical to Schouten’s [15] Wald-type confidence interval.
7.3.3 Sample Size
To calculate the sample size required for testing the null hypothesis H 0: δ = 0 against the alternative hypothesis H 1: δ > 0, we only have to consider the following properties of the statistic T:
On the other hand, we have
where \((\bar{q}_{\mathit{Nk}},\,\bar{q}_{\mathit{Sk}})\), k = 0, …, K, are the asymptotic values of the maximum likelihood estimators \((\hat{q}_{\mathit{Nk}},\,\hat{q}_{\mathit{Sk}})\), k = 0, …, K. These asymptotic values are solutions to (7.17) and (7.18). From the aforementioned equations, the approximate sample size n required for 100(1 −β) power of a one-sided normal deviate test at α∕2 level is given by
When K = 1, the derived formula for determining the sample size agrees with that proposed by Nam [10]. The sample sizes required for 80 % power of a one-sided non-inferiority test at \(\alpha /2 = 2.5\,\%\) for K = 2, 3, Δ = 0. 1, 0. 05, and various values of (q N3, q N2, q N1, q S3, q S2, q S1) with \(\pi _{N} -\pi _{S} =\lambda = 0\) are shown in Table 7.5.
7.4 Simulation
We have indicated here the results of simulation studies for the methods at a one-sided 2. 5 % level for the case of K = 3 and sample size n = 25, 50 or 100 with 10, 000 replicates. Simulation data were generated on the basis of a multinomial distribution by considering typical situations for parameter values (q N3, q N2, q N1, q S3, q S2, q S1) and non-inferiority margin Δ = 0. 1. In assessing the performance of the methods based on the majority votes, we transformed the simulation data based on the following definitions: \(q_{N} = (q_{N3} + q_{N2} + \frac{1} {3} \times q_{N1})\), \(q_{S} = (q_{S3} + q_{S2} + \frac{1} {3} \times q_{S1})\).
7.4.1 Non-inferiority Test
We performed Monte Carlo simulation studies to assess the empirical size and power of the normal deviate test statistic Z ND , the Wald-type test statistic Z W and the test statistic based on the majority votes Z MV . Z MV was calculated using the method of Nam [10] and Tango [17]. Table 7.6 presents the empirical sizes. For the set of parameter values (q N3, q N2, q N1, q S3, q S2, q S1) considered here, the empirical sizes for the normal deviate test Z ND are generally closer to the nominal α∕2-level of 2. 5 % than those for the Wald-type test Z W or the test based on the majority votes Z MV . The empirical sizes of Z W tend to be quite inflated. The empirical sizes of Z MV , on the other hand, tend to be quite reduced. Table 7.7 presents the empirical powers for the alternative hypothesis H 1: π N = π S for the case of Δ = 0. 1. The differences in powers between Z ND and Z W are generally small. When the sample size is small, however, the empirical powers of Z W are far greater than those of Z ND . On the other hand, the empirical powers of Z MV are far smaller than those of Z ND under all situations.
7.4.2 Confidence Interval
We performed Monte Carlo simulation studies to evaluate the coverage probability of the score-type confidence interval, the Wald-type confidence interval CI W and the confidence interval based on the majority votes CI MV . CI MV was calculated using the method of Tango [17]. Table 7.8 shows the empirical coverage probabilities of the score-type 95 % confidence interval, the Wald-type 95 % confidence interval and the 95 % confidence interval based on the majority votes under the hypothesis \(\pi _{N} -\pi _{S} =\lambda = -0.1\). It shows that the score-type confidence interval and the Wald-type confidence interval both generally perform very well. However, when n = 25, the score-type confidence interval outperforms the Wald-type confidence interval. On the other hand, the confidence interval based on the majority votes shows a conservative property.
7.5 Example
7.5.1 Study of Diagnostic Procedures for the Diagnosis of Oesophageal Carcinoma Infiltrating the Tracheobronchial Tree
Here, we shall consider the data presented by Rapp-Bernhardt et al. [13]. They compared the sensitivities between axial computed tomography (CT) slices and minimal intensity projection (MIP) in 21 patients with oesophageal carcinoma infiltrating the tracheobronchial tree. The bronchoscopic findings were determined as the gold standard. Three radiologists, working independently of each other and without knowledge of the findings on the gold standard, assessed separately the axial CT slices and MIP. In these diagnostic procedures, stenoses were localized, and the degree of stenosis was assessed as in real bronchoscopy. The resulting type of matched observations was classified as a 4 × 4 contingency table for MIP versus axial CT slices and is shown in Table 7.9 (similar to Table 7.3), where ‘+’ indicates a true positive and ‘−’ indicates a false negative based on binary assessment where 0–50 % of total occlusion was considered as negative and 50–100 % of total occlusion was considered as positive. MIP is one of the reconstruction techniques of making three-dimensional images. MIP images make it easier to appreciate the condition of the whole tracheobronchial tree than axial CT slices. Therefore, we are interested in the non-inferiority of MIP to axial CT slices where the non-inferiority margin is set as Δ = 0. 1. From Table 7.9, we have \(\tilde{p}_{3.} = 17/21\), \(\tilde{p}_{2.} = 0/21\), \(\tilde{p}_{1.} = 2/21\), \(\tilde{p}_{.3} = 14/21\), \(\tilde{p}_{.2} = 2/21\) and \(\tilde{p}_{.1} = 5/21\). Then, the sensitivities of MIP and axial CT slices are estimated as \(\tilde{\pi }_{MIP} = \left (17 + 2/3 \times 0 + 1/3 \times 2\right )/21 = 0.841\) and \(\tilde{\pi }_{CT} = \left (14 + 2/3 \times 2 + 1/3 \times 5\right )/21 = 0.810\), respectively. Moreover, we have \(\tilde{q}_{N3} = 0/21\), \(\tilde{q}_{N2} = (1 + 0)/21\), \(\tilde{q}_{N1} = (2 + 0 + 0)/21\), \(\tilde{q}_{S3} = 0/21\), \(\tilde{q}_{S2} = (0 + 0)/21\) and \(\tilde{q}_{S1} = (0 + 0 + 2)/21\). Then, the difference in the sensitivities between MIP and axial CT slices based on the three raters is \(\tilde{\lambda }_{K=3} = 0.032\), and the normal deviate test has Z ND = 1. 753 ≈ Z S (one-sided p-value = 0. 040). The score-type 95 % confidence interval is − 0. 141 to 0. 181 where the lower limit is not greater than \(-\varDelta = -0.1\). These results suggest that the non-inferiority of MIP to axial CT slices cannot be claimed at the one-sided 2. 5 % significance level. The Wald-type test statistic, on the other hand, suggests non-inferiority because Z W = 3. 358 with one-sided p-value < 0.001 and because the Wald-type 95 % confidence interval under the null hypothesis is − 0. 056 to 0. 120. However, the simulation study suggests that the Wald-type test result here is not reliable because of its inflated empirical sizes for a quite small sample size such as n = 21. The result of the normal-deviate test, on the other hand, may or may not be reliable because its empirical sizes for Δ = 0. 1 and n = 25 are shown to be around 1. 6 ∼ 2. 4.
7.5.2 Study of Diagnostic Procedures for the Diagnosis of Aneurysm in Patients with Acute Subarachnoid Hemorrhage
Jäger et al. [4] performed a blinded multi-rater study comparing magnetic resonance angiography (MRA) and digital subtraction angiography (DSA) in 34 prospectively enrolled patients who presented with acute subarachnoid hemorrhage (SAH). Two raters independently evaluated the MRA and DSA images. The presence of an aneurysm was evaluated on a 4-point ordinal scale (1, absent; 2, probably absent; 3, probably present; 4, definitely present). Additionally, all aneurysms for which the two raters had given different evaluations on the 4-point scale were subsequently reviewed by consensus evaluations. Because the authors intended to study the inter-rater and inter-procedure agreement, neither method was a priori taken as the gold standard. However, they showed the data of evaluation of the MRA and DSA images by the two raters with details of the clinical follow-up of all patients. Therefore, we considered comparing the difference in sensitivities between MRA and DSA on the basis of the data of 27 patients with aneurysms among the patients with SAH. Data were analyzed on a patient-basis, taking into account only the aneurysm with the highest ranking on the 4-point scale in each patient. We assigned the rating of true positive (‘+’) for scores of 3 and 4 or false negative (‘−’) for scores of 1 and 2. The resulting types of matched observations based on the two independent raters and the consensus evaluations were classified as a 3 × 3 and 2 × 2 contingency tables, respectively (Tables 7.10 and 7.11). DSA is a procedure in which radiographic images of blood vessels filled with a contrast agent are digitized and then subtracted from images obtained before administration of the contrast agent. This method increases the contrast between the vessels and the background. However, as a catheter (a long, thin, flexible tube) is inserted into an artery, DSA is considered to be invasive. MRA is a procedure to image blood vessels based on MRI. Unlike DSA that involves placing a catheter into the body, MRA is considered noninvasive. Therefore, we are interested in the non-inferiority of MRA to DSA where the non-inferiority margin is set as Δ = 0. 1. From Table 7.10 based on the multiple raters, we have \(\tilde{p}_{2.} = 20/27\), \(\tilde{p}_{1.} = 5/27\), \(\tilde{p}_{.2} = 22/27\) and \(\tilde{p}_{.1} = 2/27\). Then, the sensitivities of MRA and DSA are estimated as \(\tilde{\pi }_{\mathit{MRA}} = \left (20 + 1/2 \times 5\right )/27 = 0.833\) and \(\tilde{\pi }_{\mathit{DSA}} = \left (22 + 1/2 \times 2\right )/27 = 0.852\), respectively. Moreover, we have \(\tilde{q}_{N2} = 1/27\), \(\tilde{q}_{N1} = (0 + 2)/27\), \(\tilde{q}_{S2} = 0/27\) and \(\tilde{q}_{S1} = (3 + 2)/27\). Then, the difference in the sensitivities between MRA and DSA based on the two raters is \(\tilde{\lambda }_{K=2} = -0.019\), and the normal deviate test has Z ND = 1. 393 ≈ Z S (one-sided p-value = 0. 082). The score-type 95 % confidence interval is − 0. 141 to 0. 144 where the lower limit is not greater than \(-\varDelta = -0.1\). Furthermore, the Wald-type test has Z w = 1. 397 (one-sided p-value = 0. 081) and the Wald-type 95 % confidence interval under the null hypothesis is − 0. 139 to 0. 102. From Table 7.11 based on the consensus evaluations, on the other hand, the sensitivities of MRA and DSA are estimated as \(\tilde{\pi }_{\mathit{MRA}_{ \mathit{CE}}} = 0.926\) and \(\tilde{\pi }_{\mathit{DSA}_{ \mathit{CE}}} = 0.889\), respectively. Then, the difference in the sensitivities between MRA and DSA based on the consensus evaluations is \(\tilde{\lambda }_{\mathit{CE}} = 0.037\), and the score test derived from Nam [10] and Tango [17] has Z S = 1. 510 (one-sided p-value = 0. 066). Moreover, the score-based 95 % confidence interval derived from Tango [17] is − 0. 150 to 0. 227. These results suggest that the non-inferiority of MRA to DSA cannot be claimed at the one-sided significance level. However, although the difference in the sensitivities based on the two raters \(\tilde{\lambda }_{K=2}\) is a negative value, the difference in the sensitivities based on the consensus evaluations \(\tilde{\lambda }_{\mathit{CE}}\) is a positive value. We consider that bias from the consensus evaluations caused this phenomenon.
7.6 Conclusion
A non-inferiority trial of diagnostic procedures is generally evaluated on the basis of the results from multiple independent raters who are independent of the study centers. However, consensus evaluations or majority votes to handle multiple results from the multiple raters are not recommended in terms of bias or loss of information [1, 2, 12]. Therefore, it is important that all of the results from the multiple raters are utilized appropriately in the statistical analysis. The methods addressed in this chapter are available for inference of the difference in correlated proportions between the two diagnostic procedures based on the multiple raters. In this chapter, we introduced methods on the basis of sensitivity. However, the methods can be applied to inference of the difference in specificity. Furthermore, if we need to consider the simultaneous non-inferiority of a new diagnostic procedure to the standard diagnostic procedure in sensitivity and specificity, we can extend the methods using an approach proposed by Lu et al. [8]. Lu et al. extended the score test proposed by Nam [10] and Tango [17] for a single proportion to a simultaneous test for both sensitivity and specificity based on the principle of intersection-union test.
We carried out Monte Carlo simulation studies to evaluate the performance of these methods. The normal deviate test for non-inferiority was shown to have an empirical size closer to a nominal significance level of one-sided 2. 5 % than the Wald-type test or the test based on the majority votes. Moreover, the score-type confidence interval had better performance than the Wald-type confidence interval under the null-hypothesis in terms of coverage probability, when the sample size was small. On the other hand, the confidence interval based on the majority votes shows a conservative property.
When we plan a clinical trial to compare the efficacies between two diagnostic procedures, it is very important to take into account the study design. The methods addressed in this chapter are only useful for a study design in which two diagnostic procedures are applied to each subject and all raters evaluate all subjects, that is, paired-patient, paired-rater design. Zhou et al. [18] provided information on study designs for diagnostic procedures in detail. Moreover, it is noted that these methods may not be appropriate for clustered matched-pair data. Schwenke and Busse [16] proposed a Wald-type test for clustered matched-pair data based on multiple raters. However, the test of Schwenke and Busse is a so-called test for superiority and cannot be used as a test for non-inferiority. If the results of the two diagnostic procedures are evaluated by a single rater, we can apply several non-inferiority tests for clustered matched-pair data [3, 5, 11]. Therefore, we expect that a non-inferiority test for clustered matched-pair data on the basis of the results from multiple raters will be developed. If there are missing data among the results from the multiple raters in some subject, we would have to apply some kind of imputation method, which would require future research. Furthermore, if the presence of a qualitative interaction between the two diagnostic procedures and the multiple raters is demonstrated, we would not be able to apply these methods for those data. However, this problem could probably be solved by a non-statistical study, for example, by training all of the raters on the criteria of judgment about diagnostic procedures before the start of evaluation.
7.7 Program
The R programs for the methods of this chapter can be downloaded at http://www.medstat.jp/downloadsaeki.html.
References
Guidance for industry. Developing medical imaging drugs and biological products. Part 3: design, analysis, and interpretation of clinical studies (2004). URL http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ucm071604.pdf. Cited 21 May 2012
Appendix 1 to the guideline on clinical evaluation of diagnostic agents (CPMP/EWP/1119/98 REV. 1) on imaging agents (Doc. Ref. EMEA/CHMP/EWP/321180/2008) (2009). URL http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003581.pdf. Cited 21 May 2012
Durkalski, V., Palesch, Y., Lipsitz, S., Rust, P.: Analysis of clustered matched-pair data for a non-inferiority study design. Statistics in Medicine 22, 279–290 (2003). DOI 10.1002/sim.1385
Jäger, H., Mansmann, U., Hausmann, O., Partzsch, U., Moseley, I., Taylor, W.: MRA versus digital subtraction angiography in acute subarachnoid haemorrhage: a blinded multireader study of prospectively recruited patients. Neuroradiology 42, 313–326 (2000)
Jin, H., Lu, Y.: Comparison of correlated proportions based on paired binary data from clustered samples. Journal of Statistical Planning and Inference 139, 4206–4212 (2009). DOI 10.1016/j.jspi.2009.06.005
Lehr, R., Kashanian, F.: Three persistent issues in analysis of clinical trials involving diagnostic contrast agents. Drug Information Journal 43, 525–532 (2009). DOI 10.1177/009286150904300501
Lu, Y., Bean, J.: On the sample size for one-sided equivalence of sensitivities based upon McNemar’s test. Statistics in Medicine 14, 1831–1839 (1995). DOI 10.1002/sim.4780141611
Lu, Y., Jin, H., Genant, H.: On the non-inferiority of a diagnostic test based on paired observations. Statistics in Medicine 22, 3029–3044 (2003). DOI 10.1002/sim.1569
McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153–157 (1947). DOI 10.1007/BF02295996
Nam, J.: Establishing equivalence of two treatments and sample size requirements in matched-pairs design. Biometrics 53, 1422–1430 (1997)
Nam, J., Kwon, D.: Non-inferiority tests for clustered matched-pair data. Statistics in Medicine 28, 1668–1679 (2009). DOI 10.1002/sim.3580
Obuchowski, N., Lieber, M.: Statistics and methodology. Skeletal Radiology 37, 393–396 (2008). DOI 10.1007/s00256-008-0448-1
Rapp-Bernhardt, U., Welte, T., Budinger, M., Bernhardt, T.: Comparison of three-dimensional virtual endoscopy with bronchoscopy in patients with oesophageal carcinoma infiltrating the tracheobronchial tree. The British Journal of Radiology 71, 1271–1278 (1998)
Saeki, H., Tango, T.: Non-inferiority test and confidence interval for the difference in correlated proportions in diagnostic procedures based on multiple raters. Statistics in Medicine 30, 3313–3327 (2011). DOI 10. 1002/sim.4364
Schouten, H.: Estimating kappa from binocular data and comparing marginal probabilities. Statistics in Medicine 12, 2207–2217 (1993). DOI 10.1002/sim.4780122306
Schwenke, C., Busse, R.: Analysis of differences in proportions from clustered data with multiple measurements in diagnostic studies. Methods of Information in Medicine 46, 548–552 (2007). DOI 10.1160/ ME0433
Tango, T.: Equivalence test and confidence interval for the difference in proportions for the paired-sample design. Statistics in Medicine 17, 891–908 (1998). DOI 10.1002/(SICI)1097-0258(19980430)17: 8\(\langle 891::\mathrm{AID}\mbox{ -}\mathrm{SIM780}\rangle\) 3.0.CO;2-B
Zhou, X., Obuchowski, N., McClish, D.: Statistical Methods in Diagnostic Medicine, 2nd edn. Wiley & Sons, New York (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Saeki, H., Tango, T. (2014). Statistical Inference for Non-inferiority of a Diagnostic Procedure Compared to an Alternative Procedure, Based on the Difference in Correlated Proportions from Multiple Raters. In: van Montfort, K., Oud, J., Ghidey, W. (eds) Developments in Statistical Evaluation of Clinical Trials. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55345-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-55345-5_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-55344-8
Online ISBN: 978-3-642-55345-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)