Introduction

Bioanalytical methods of quantitatively measuring chemical compounds and their metabolites in biological samples must be validated to ensure that such methods yield reliable results [1]. To achieve necessary analytical throughput, bioanalytical methods are often transferred from one laboratory to another, and sometimes the samples from a clinical or animal study are analyzed using different methods or at different laboratories.

To ensure that comparable results can be achieved between two laboratories or methods, it is important to conduct cross-validation [1] or comparability studies [2] before a new laboratory or method is permitted to analyze the biological samples. There should be three objectives of the experimental design in such studies: to establish statistical equivalence [2] between the two laboratories or methods, to identify the sources of any differences, and to resolve these differences. When conducting such studies, it is important to maintain the traceability of the results through proper documentation and to determine the uncertainties (variance) in the measurement [3].

Historically, several techniques have been used to compare the data generated by two laboratories or methods using either paired- or unpaired-sample analysis (e.g., the general t-test, Youden matched pair plots [4], and regression analysis). However, each of these techniques has certain drawbacks [5, 6]. For example, Youden plots only provide visual, qualitative assessment, and regression analysis is based on the false assumption that the errors in the x-values are negligible and that all of the error resides in the y-values. For the general t-test, the most widely used technique, the hypotheses are:

$${\text{H}}_{{\text{0}}} :{\mu _{2} } \mathord{\left/ {\vphantom {{\mu _{2} } {\mu _{1} }}} \right. \kern-\nulldelimiterspace} {\mu _{1} } = 1{\left( {{\text{there is no bias}}} \right)}{\text{ versus H}}_{{\text{1}}} :{\mu _{2} } \mathord{\left/ {\vphantom {{\mu _{2} } {\mu _{1} }}} \right. \kern-\nulldelimiterspace} {\mu _{1} } \ne 1{\left( {{\text{bias exists}}} \right)}$$
(1)

If there is insufficient evidence to reject the null hypothesis, the two means are declared equal. It has been shown [6] that this test has the undesirable property of penalizing higher precision. If the sample size is small and there are relatively large variations within two sets of data in comparison to the relatively small difference between the two means, the t-test may be unable to detect a possible difference, and sometimes when the precision of each data set is very high, differences of little practical importance could be considered significant.

Hartmann et al. [7] compared the general t-test with the interval hypothesis test for testing the equivalence of two laboratories or two methods, and concluded that for laboratory or method validation purposes, the latter is preferable because the β-error of the general t-test can be controlled. The interval hypothesis test is based on Schuirmann’s [6] two one-sided t-tests (TOST), and this has become the standard approach in bioequivalence studies. This approach requires the predetermination of an acceptance interval (lower limit θ 1 and upper limit θ 2) and then involves testing whether the measured bias is within the acceptance interval. In statistical terms, the hypotheses to be tested are:

$${\text{H}}_{{\text{0}}} :{\mu _{2} } \mathord{\left/ {\vphantom {{\mu _{2} } {\mu _{1} }}} \right. \kern-\nulldelimiterspace} {\mu _{1} } \leqslant \theta _{1} {\text{ or }}{\mu _{{\text{2}}} } \mathord{\left/ {\vphantom {{\mu _{{\text{2}}} } {\mu _{1} }}} \right. \kern-\nulldelimiterspace} {\mu _{1} } \geqslant \theta _{2} {\text{ versus H}}_{{\text{1}}} :\theta _{{\text{1}}} < {\mu _{2} } \mathord{\left/ {\vphantom {{\mu _{2} } {\mu _{1} }}} \right. \kern-\nulldelimiterspace} {\mu _{1} } < \theta _{2} $$
(2)

The hypotheses can also be expressed as two one-sided tests:

$${\text{H}}_{{{\text{01}}}} :{\mu _{{\text{2}}} } \mathord{\left/ {\vphantom {{\mu _{{\text{2}}} } {\mu _{1} }}} \right. \kern-\nulldelimiterspace} {\mu _{1} } \leqslant \theta _{1} {\text{ versus H}}_{{{\text{11}}}} :{\mu _{2} } \mathord{\left/ {\vphantom {{\mu _{2} } {\mu _{1} }}} \right. \kern-\nulldelimiterspace} {\mu _{1} } > \theta _{1} $$
(3)
$${\text{H}}_{{{\text{02}}}} :{\mu _{{\text{2}}} } \mathord{\left/ {\vphantom {{\mu _{{\text{2}}} } {\mu _{1} }}} \right. \kern-\nulldelimiterspace} {\mu _{1} } \geqslant \theta _{2} {\text{ versus H}}_{{{\text{12}}}} :{\mu _{2} } \mathord{\left/ {\vphantom {{\mu _{2} } {\mu _{1} }}} \right. \kern-\nulldelimiterspace} {\mu _{1} } < \theta _{2}$$
(4)

Compared to the general t-test, the null and alternative hypotheses have been reversed. Consequently, the definitions of type I and type II errors in the interval hypothesis test are exactly switched with respect to the general t-test. The type I error is now the probability (α) of the erroneous acceptance of equivalence when in fact the two laboratories or methods are not equivalent. The type II error is the probability (β) of erroneous acceptance of nonequivalence when in fact the two laboratories or methods are equivalent. The acceptance of null hypotheses leads to the conclusion that the bias is not acceptable, and the rejection of null hypotheses leads to the conclusion that the bias is acceptable. This test has the advantage of limiting the risk of erroneously accepting a new laboratory or new method as being unbiased (when it is actually biased) to very low degree.

However, close examination of the algorithm and procedures described by Hartmann et al. [7] led us to believe that they are only suitable for testing two means from two independent sets of samples (unpaired sample analysis). Unpaired-sample experimental design in a bioanalytical cross-validation study may confound this statistical test because of a possible large pooled variance that is actually due to the intersample variability, especially for incurred biological samples obtained from clinical or animal studies. We believe that this problem can be overcome by applying paired-sample analysis.

In this manuscript, we present a modified-interval hypothesis testing procedure and practical experimental design (based on paired-sample analysis) for testing the equivalence between two bioanalytical laboratories or methods. The example used is the transfer of an LC–MS/MS method for S-phenylmercapturic acid (S-PMA), a benzene metabolite in human urine and a biomarker for exposure to benzene [8, 9]. Both spiked quality control (QC) and incurred human samples were included in this study.

Experimental part

The interval hypothesis test procedure

The acceptance interval (θ 1 and θ 2) should be defined before the experiments begin. As it is now widely accepted [1] that a validated bioanalytical method should have less than ±15 % bias at concentration levels other than the lower limit of quantification (LLOQ), where at most ±20 % bias is acceptable, the θ 1 and θ 2 values adopted in this study were therefore 0.85 and 1.15, respectively (note that although we typically find these limits reasonable, one should define the acceptance interval based on his or her own experience with the method). A paired-sample design was employed and the concentration ratio for each sample was calculated. Let \(R = \frac{{{\text{Conc}}_{{{\text{Lab2}}}} {\text{(measured)}}}}{{{\text{Conc}}_{{{\text{Lab1}}}} {\text{(measured)}}}}\) for each sample and \(\overline{R} _{{{{\text{Lab2}}} \mathord{\left/ {\vphantom {{{\text{Lab2}}} {{\text{Lab1}}}}} \right. \kern-\nulldelimiterspace} {{\text{Lab1}}}}} \) be the mean ratio of each sample set, while R 0 is the the true ratio between the two laboratories (Lab2/Lab1) for the entire sample population. The hypotheses to be tested can be stated as:

$${\text{H}}_{{\text{0}}} :R_{0} \leqslant 0.85{\text{ or }}R_{0} \geqslant 1.15{\text{ versus H}}_{{\text{1}}} :{\text{0}}{\text{.85}} < R_{0} < 1.15$$
(5)

or:

$${\text{H}}_{{{\text{01}}}} :R_{0} \leqslant 0.85{\text{ versus H}}_{{{\text{11}}}} :R_{0} \,{\text{ $>$ }}\,0.85$$
(6)

and

$${\text{H}}_{{{\text{02}}}} :R_{0} \geqslant 1.15{\text{ versus H}}_{{{\text{12}}}} :R_{0} < 1.15$$
(7)

The H0 will be rejected if

$$t_{{{\text{cal}}{\left( 1 \right)}}} = \frac{{\overline{R} _{{{\text{Lab2/Lab1}}}} - 0.85}}{{{SD_{{\text{R}}} } \mathord{\left/ {\vphantom {{SD_{{\text{R}}} } {{\sqrt n }}}} \right. \kern-\nulldelimiterspace} {{\sqrt n }}}} > t_{{\alpha ,n - 1}} $$
(8)

where \(t_{{\alpha ,n - 1}} \) is the 1−α quantile of the t distribution with n−1 degrees of freedom. SDR represents the standard deviation of the mean ratio for the sample set, or by rearrangement, if

$$\overline{R} _{{{\text{Lab2/Lab1}}}} - t_{{\alpha ,n - 1}} {SD_{{\text{R}}} } \mathord{\left/ {\vphantom {{SD_{{\text{R}}} } {{\sqrt n }}}} \right. \kern-\nulldelimiterspace} {{\sqrt n }} > 0.85$$
(9)

and if

$$t_{{{\text{cal}}{\left( 2 \right)}}} = \frac{{\overline{R} _{{{\text{Lab2/Lab1}}}} - 1.15}}{{{SD_{R} } \mathord{\left/ {\vphantom {{SD_{R} } {{\sqrt n }}}} \right. \kern-\nulldelimiterspace} {{\sqrt n }}}} < - t_{{\alpha ,n - 1}} $$
(10)

or by rearrangement, if

$$\overline{R} _{{{\text{Lab2/Lab1}}}} + t_{{\alpha ,n - 1}} {SD_{{\text{R}}} } \mathord{\left/ {\vphantom {{SD_{{\text{R}}} } {{\sqrt n }}}} \right. \kern-\nulldelimiterspace} {{\sqrt n }} < 1.15$$
(11)

Notice that \(\overline{R} _{{{{\text{Lab2}}} \mathord{\left/ {\vphantom {{{\text{Lab2}}} {{\text{Lab1}}}}} \right. \kern-\nulldelimiterspace} {{\text{Lab1}}}}} \pm t_{{\alpha ,n - 1}} {SD_{{\text{R}}} } \mathord{\left/ {\vphantom {{SD_{{\text{R}}} } {{\sqrt n }}}} \right. \kern-\nulldelimiterspace} {{\sqrt n }}\) actually represents the confidence interval of the mean ratio at (1−2α)×100% level. Thus, H0 will be rejected at a significance level of α if the (1−2α)×100 % confidence interval occurs entirely between 0.85 and 1.15.

In this study, basic data calculations were performed with Microsoft Excel 2002. The interval hypothesis test and statistical power were computed with Statistical Analytical System (SAS) software.

The LC–MS/MS method

An LC–MS/MS method for S-PMA in human urine was initially developed and validated at Lab1, the originating laboratory. The full details of the method will be published separately. Briefly, the method utilized 1 mL of a human urine sample and a solid phase extraction procedure. The LC–MS/MS system included a Thermo Hypersil BioBasic AX column (Thermo Electron Corporation, Waltham, MA, USA) with a guard column and a PE Sciex API 4000 MS detector (Applied Biosystems, Foster City, CA) with an electrospray ionization interface. Negative ions were monitored in the multiple reaction monitoring (MRM) mode.

After the method had been validated and used to analyze approximately 1,800 human urine samples at Lab1, it was transferred to Lab2 (accepting laboratory) to accommodate the need for higher throughput. The method used at Lab2 was identical to that at Lab1 except that the final step of the extraction procedure was slightly modified to allow for differences in available analytical equipment.

Cross-validation experimental design

The method transfer was conducted according to a method transfer protocol which outlined the sample types and source and the cross-validation acceptance criteria. Lab2 performed the initial between-site qualification batches to re-establish the lower and upper limits of quantification, interbatch accuracy and precision, and the linear range. The performance characteristics at both laboratories are summarized in Table 1.

Table 1 Validation summary for the two laboratories

To establish equivalence between the two laboratories, a paired-sample design was employed: paired analyses of both spiked quality control (QC) samples at three fixed concentrations (low, medium, and high), and human incurred samples. QC samples at each concentration level were prepared by Lab1 and a portion of each sample was sent to Lab2. The nominal values for each QC level were 89.0, 2,140, and 15,100 pg mL−1, respectively. Each laboratory measured QC samples at each concentration level in twelve replicates in paired fashion. In addition, 12 incurred human urine samples were collected at Lab1 and a portion of each sample was also sent to Lab2. The ratio of measured concentrations was then calculated for each sample pair. The interval hypothesis test was performed on each QC set as well as the incurred human sample set. Equivalence between the two laboratories was declared at the α=0.025 level if the 95 % confidence interval of the mean ratio occurred entirely between 0.85 and 1.15 for all QC levels and incurred human samples.

Results and discussion

Sample selection

The calibration curve of a bioanalytical method should represent the normal concentration distribution of the real biological samples. To establish equivalence between the two laboratories, we believe that at least three concentration levels which represent the entire range of the calibration curve should be studied: one near the lower boundary of the curve, one near the center, and one near the upper boundary of the curve. Real biological samples that represent the normal concentration distribution of the analyte should also be included. The selection of QC levels and incurred human samples in this study was based on these considerations. The paired-sample design is based on the properties of the interval hypothesis test, which will be discussed later in the section on paired versus unpaired sample analysis. The analytical results from the two laboratories are shown in Table 2.

Table 2 The analytical results from a cross-validation study

Type I and II errors, power and sample size

Similar to bioequivalence studies [10, 11], the primary concern of bioanalytical chemists should be the control of α (equivalent to β in the general t-test) because results for clinical samples generated by nonequivalent laboratories or methods that are erroneously taken to be equivalent will cause comparability problems and confusion in data interpretation. The α level should be defined a priori as the acceptance interval (θ 1 and θ 2). In this study, we predefined in the cross-validation protocol that the 95 % confidence interval of the mean ratio must occur entirely between 0.85 and 1.15 for all QC levels and incurred human samples. Therefore, the risk of wrongly concluding that the two labs are equivalent is limited to no more than 2.5 %. The actual probability of this risk is represented by the p-value, and can be calculated by SAS software using the two one-sided tests approach (Eqs. 6 and 7). The p-value of the interval hypothesis test (Eqs. 5, 6 and 7) is the maximum p-value of the two one-sided tests. For example, as shown in Table 3, the p-values for the two one-sided tests for the incurred human sample set were 0.0008 and <0.0001, respectively. The p-value of the interval hypothesis test is therefore 0.0008, which is much lower than the predefined acceptable limit of 0.025. To reject H0, both p 1 and p 2 should be less than the α-value. In this example, H0 (nonequivalence) was rejected and H1 (equivalence) was accepted at α=0.025. The risks of declaring equivalence between the two laboratories while the two laboratories are in fact not equivalent were extremely small for all sample sets tested.

Table 3 The interval hypothesis test* results for mean ratios

For the incurred sample set, we plotted the ratio versus the mean concentration from the two laboratories (not shown here), which showed no correlation. This indicated that the two laboratories were not systematically different at various points in the concentration range.

A basic question bioanalytical chemists often ask when they design experiments is: how many samples need to be analyzed in order to make the conclusions of the experiment statistically meaningful? Sample size (number of sample pairs in our case) is one of the key factors that affects the statistical power, and should be estimated based on the desired power and estimated variability. The power (1−β) for the interval hypothesis test should be:

$$\begin{array}{*{20}c} {{\text{Power }}{\left( {\overline{R} _{{{\text{Lab2/Lab1}}}} } \right)} = P{\left\{ {t_{{{\text{cal}}{\left( {\text{1}} \right)}}} \geqslant t_{{\alpha ,n - 1}} {\text{ and }}t_{{{\text{cal}}{\left( {\text{2}} \right)}}} \leqslant - t_{{\alpha ,n - 1}} } \right\}}} \\ { = P{\left\{ {0.85 + t_{{\alpha ,n - 1}} {\sigma _{{\text{R}}} } \mathord{\left/ {\vphantom {{\sigma _{{\text{R}}} } {{\sqrt n }}}} \right. \kern-\nulldelimiterspace} {{\sqrt n }} < 1.15 - t_{{\alpha ,n - 1}} {\sigma _{{\text{R}}} } \mathord{\left/ {\vphantom {{\sigma _{{\text{R}}} } {{\sqrt n }}}} \right. \kern-\nulldelimiterspace} {{\sqrt n }}} \right\}}} \\ \end{array} $$
(12)

where P is the probability and σ R is the standard deviation of the entire sample population.

Using the same procedure described by Chow and Liu [12, 13], we obtained Eqs. 13 and 14, which can be used to estimate the minimum number of sample pairs needed to achieve the desired power:

$$n \geqslant (t_{{\alpha ,n - 1}} + t_{{\beta ,n - 1}} )^{2} {\left( {\frac{{CV \times \overline{R} _{{{\text{Lab2/Lab1}}}} }}{{1.15 - \overline{R} _{{{\text{Lab2/Lab1}}}} }}} \right)}^{2} {\text{ for }}1.00 < \overline{R} _{{{{\text{Lab2}}} \mathord{\left/ {\vphantom {{{\text{Lab2}}} {{\text{Lab1}}}}} \right. \kern-\nulldelimiterspace} {{\text{Lab1}}}}} < 1.15$$
(13)

and

$$n \geqslant {\left( {t_{{\alpha ,n - 1}} + t_{{\beta ,n - 1}} } \right)}^{2} {\left( {\frac{{CV \times \overline{R} _{{{{\text{Lab2}}} \mathord{\left/ {\vphantom {{{\text{Lab2}}} {{\text{Lab1}}}}} \right. \kern-\nulldelimiterspace} {{\text{Lab1}}}}} }}{{\overline{R} _{{{{\text{Lab2}}} \mathord{\left/ {\vphantom {{{\text{Lab2}}} {{\text{Lab1}}}}} \right. \kern-\nulldelimiterspace} {{\text{Lab1}}}}} - 0.85}}} \right)}^{2} {\text{ for }}0.85 < \overline{R} _{{{{\text{Lab2}}} \mathord{\left/ {\vphantom {{{\text{Lab2}}} {{\text{Lab1}}}}} \right. \kern-\nulldelimiterspace} {{\text{Lab1}}}}} < 1.00$$
(14)

where CV is the coefficient of variation of the sample set and \(t_{{\beta ,n - 1}} \) is the 1−β quantile of the t distribution with n−1 degrees of freedom.

Figure 1 illustrates the effect of the number of sample pairs, the CV, and the ratio on the power at the α=0.025 level, based on the paired-sample design (when θ 1 and θ 2 were set at 0.85 and 1.15, respectively). The power increases as the CV decreases or the number of sample pairs increases.

Fig. 1
figure 1

The effect of the number of sample pairs, CV (%) and mean ratio on the power at the α=0.025 level. The top figure is for a mean ratio of 0.90, and the bottom figure is for a mean ratio of 1.10

The relationships shown in Fig. 1 should be used to guide experimental planning. However, because computing the power requires computer programming and also because the main concern of bioanalytical chemists in cross-validation is to control α, the post-study calculation of the actual power does not have to be part of the test. If the confidence interval is within the acceptance interval after the experiment, a simple way to verify whether the desired power has been achieved is to compare the calculated number of sample pairs for a desired power to the actual number of sample pairs. If the calculated number of sample pairs is less than the actual value, the desired power has been achieved. It is common in clinical studies to set β to be less than 0.2, which means that the power of the statistical test is greater than 80 %. Therefore, for example, for the incurred sample set in this study, the estimated number of sample pairs for 80 % power is: \(n = {\left( {2.20 + 0.88} \right)}^{2} \times {\left[ {{{\left( {0.07} \right)}} \mathord{\left/ {\vphantom {{{\left( {0.07} \right)}} {{\left( {0.94 - 0.85} \right)}}}} \right. \kern-\nulldelimiterspace} {{\left( {0.94 - 0.85} \right)}}} \right]}^{2} \approx 6\). Since the actual number was 12, the power must have exceeded 80 %. The power for this sample set is 98 %.

Each laboratory or method is associated with two types of errors: systematic and random errors. A ratio that is close to the acceptance boundary is an indication of systematic bias between the two laboratories or methods. One should first focus on finding the sources of systematic bias. For example, comparing the calibration standards, stock solutions and working solutions in both laboratories often helps resolve the problem. Sometimes in practice it is necessary to conduct pilot experiments with a limited number of samples to generate preliminary data in order to guide the full-scale comparison. The CV of the ratio is a function of random errors of the two laboratories or methods being compared. Assuming that there is no variation in sampling, the CV of the ratio can be estimated using Eq. 15 from the precision data of each laboratory or method for a given concentration level, which should have been generated during the method validation by repeated measurements of QC samples at different concentration levels.

$$CV = \frac{{\sigma _{{\text{R}}} }}{{R_{0} }} \approx {\sqrt {{\left( {\frac{{\sigma _{{{\text{Lab1}}}} }}{{{\text{Conc}}_{{{\text{Lab1}}}} }}} \right)}^{2} + {\left( {\frac{{\sigma _{{{\text{Lab2}}}} }}{{{\text{Conc}}_{{{\text{Lab2}}}} }}} \right)}^{2} } } = {\sqrt {{\text{CV}}^{2}_{{{\text{Lab1}}}} + {\text{CV}}^{2}_{{{\text{Lab2}}}} } }$$
(15)

Note that Eq. 15 does not apply to the incurred human samples, which may vary over a wide concentration range. Because the accuracy and precision at different concentration levels are often different for a laboratory or method, the incurred samples should have an approximately even distribution over the entire concentration range. If the incurred samples are not distributed evenly, both the ratio and CV may be skewed.

Paired versus unpaired sample analysis

The interval hypothesis test procedures used to test either the ratio of or the difference (δ) between two means from two independent random samples (unpaired sample analysis) have been described by previous authors [7, 14, 15]. When testing the ratio of two means, the rejection criteria for the null hypotheses for two independent series of data with equal variance \({\left( {\sigma ^{2}_{1} = \sigma ^{2}_{2} } \right)}\) are:

$$t_{{{\text{cal}}{\left( 1 \right)}}} = \frac{{{\left( {\overline{x} _{2} - \theta _{1} \overline{x} _{1} } \right)}}}{{{\sqrt {s^{2}_{{\text{p}}} {\left( {1 \mathord{\left/ {\vphantom {1 {n_{2} }}} \right. \kern-\nulldelimiterspace} {n_{2} } + {\theta ^{2}_{1} } \mathord{\left/ {\vphantom {{\theta ^{2}_{1} } {n_{1} }}} \right. \kern-\nulldelimiterspace} {n_{1} }} \right)}} }}} \geqslant t_{{\alpha ,n_{1} + n_{2} - 2}} $$
(16)

where \(s^{2}_{{\text{p}}} \) is the pooled variance of the two means:

$$s^{2}_{{\text{p}}} = \frac{{{\left( {n_{1} - 1} \right)}s^{2}_{1} + {\left( {n_{2} - 1} \right)}s^{2}_{2} }}{{n_{1} + n_{2} - 2}}$$
(17)

and

$$t_{{{\text{cal}}{\left( 2 \right)}}} = \frac{{{\left( {\theta _{2} \overline{x} _{1} - \overline{x} _{2} } \right)}}}{{{\sqrt {s^{2}_{{\text{p}}} {\left( {1 \mathord{\left/ {\vphantom {1 {n_{2} }}} \right. \kern-\nulldelimiterspace} {n_{2} } + {\theta ^{2}_{1} } \mathord{\left/ {\vphantom {{\theta ^{2}_{1} } {n_{1} }}} \right. \kern-\nulldelimiterspace} {n_{1} }} \right)}} }}} \geqslant t_{{\alpha ,n_{1} + n_{2} + 2}} $$
(18)

It should be noted that the unpaired-sample design in bioanalytical cross-validation studies may confound this statistical test due to an incidence of very large pooled variance that is actually due to the intersample variability, especially for incurred biological samples. Two actually equivalent laboratories or methods may be determined to be nonequivalent unless the sample size is extremely large. Using the paired-sample design and testing the ratios allows one to remove such variability and therefore increase the sensitivity of the test. For similar reasons, it is inappropriate to test the mean differences (δ) for incurred human samples in paired-sample analysis. For example, for the incurred human sample set in this study, \(\overline{\delta } \)=−308.8; SD=469.1; CV=151.9 %; 95 % CI=−606.9 (lower), −10.8 (upper); t 1=1.83 and t 2=−6.39 (compared to t α,(n−1)=2.20). Notice that the variability in the mean difference \({\left( {\overline{\delta } } \right)}\) of this sample set was very high (CV=151.9 %). This variability is mainly due to the variable sample concentrations. Therefore, the test performed on the mean difference led to the wrong conclusion that the two laboratories were not equivalent, because the test was confounded by the variability in the incurred sample itself.

Some potential issues

Compared to unpaired sample analysis, the degree of freedom in paired sample analysis is lower (n−1 versus n 1+n 2−2), which seemingly implies lower power. However, with good planning the desired power can be achieved, as discussed above.

Another issue is that the statistical testing procedure described herein assumes that the ratio has a normal distribution around the mean values, which is true in this study. The normality of the sample distributions should be tested before applying the algorithms described here, because it may affect the sensitivity of this test. For non-normal ratio data, if the sample size is sufficiently large and the variability is small, this procedure is still applicable. Log-transformation may also be considered if the data are not normally distributed.

Sometimes, especially when the sample size is small, one may encounter a situation where outliers (significantly larger or smaller ratios than the others in a data set) may occur, and the outcomes of the analysis can be influenced depending on how the outliers are treated. Potential outliers can be identified using statistical procedures such as graphical diagnostic tools (e.g., the box-and-whisker plot), Dixon’s test, Grubbs’ test [5], or simply by finding the values outside of three standard deviations of the mean. We recommend that the root causes be investigated before a decision is made to exclude outliers from the statistical analysis. The exclusion of an outlier may be justified if a review of the documentation clearly indicates an error in the sample processing or analysis. One should bear in mind that excluding the outliers will result in reduced sample size and will therefore affect both types of statistical errors. An alternative is to reanalyze the outlying samples in either or both laboratories using both methods. However, if the repeat analyses verify the original values, these values should not be excluded in the statistical analysis, and unfavorable conclusions may be reached in such cases. However, it is our opinion that this situation is likely due to large variability between the two laboratories or methods, which is one indication that they are not equivalent.

Sample source, stability and homogeneity are other factors that may affect the cross-validation results. The shipping and storage conditions must be proven to be effective at preserving the sample integrity. Proper documentation should be maintained to trace the samples back to their origins.

Conclusion

The interval hypothesis test and paired-sample experimental design described in this manuscript can be used to test for equivalence between two bioanalytical laboratories or methods. The premise for using this approach is that the two laboratories or methods being compared have been validated independently prior to the cross-validation. The acceptance intervals and the limits for the two types of risks associated with this statistical test should be defined prior to the experiments. Equations for estimating the number of sample pairs based on the power were also described. The number of sample pairs should be large enough to achieve the desired power. One of the advantages of this test and the experimental design based on this test is that the risk of wrongly concluding equivalence when in fact two laboratories or methods are not equivalent can be limited to a low level. In addition, as the number of sample pairs are determined based on power calculations, the risk of wrongly concluding nonequivalence when in fact the two laboratories or methods are equivalent can also be limited to a low level (<20 %). Finally, since both systematic bias and random errors from the two laboratories or methods are taken into account, this procedure should be suitable for practical use in the real world.