Background

After developing a new health intervention (treatment or diagnostic test), the next step is to assess its effectiveness, relative to the existing reference intervention. There are several strategies to do this, such as the superiority trials which involve testing whether the new treatment is superior to another (placebo, reference, or active control treatment). However, when the active control intervention achieves maximum efficacy or the use of a placebo is unethical, it becomes difficult to statistically show the superiority of the new health intervention. Studies aimed at showing that a new intervention is not worse than the active control intervention by more than a pre-specified amount of efficacy have become increasingly common in the recent decade [1]. The expression is not worse than the active control intervention by more than a pre-specified amount, means it is acceptable to lose a “little bit” of the main effect of the active control intervention compared to a new intervention’s benefits (fewer side effects, costs, tolerable, and safer). This acceptable loss of efficacy is illustrated numerically as the non-inferiority margin. A trial showing that the new intervention is non-inferior to the active control intervention is called a non-inferiority trial [1].

The Food and Drug Administration (FDA)[2] provided general principles for an appropriate choice of the non-inferiority margin. The non-inferiority margin is at the upper limit of the confidence interval, so the trial is designed to show evidence of no more than this “loss of maximum efficacy.” Generally, this margin is fixed, determined from historical trials as a fraction of the treatment effect. However, in some cases, the mean estimate of reference treatment could be subjected to variations to the levels that adopting a fixed margin would not be relevant. Indeed, the fixed margin cannot take into account the variability which surrounds the reference treatment estimate, in this case, the margin should be a function of the reference treatment. For binary endpoints, tests that account for non-fixed margins have been studied [35]. One finds that most works on the non-inferiority test for continuous endpoints with fixed and linear margin have been focused on the confidence intervals approach [68], mainly consisting of comparing the bounds of the treatments difference to the fixed margin. However, few studies have been performed for a non-fixed or variable margin for continuous endpoints. This work is aimed at deriving non-inferiority tests for continuous endpoints with flexible margin in active randomized controlled trials. An application of the proposed methods is done on the Stratall ANRS 12110/ESTHER trial.

Methods

Notations

The following are the definition of the basic notations used.

  • XR and XN are the the random variables for continuous primary endpoint in the active control group (reference) and new intervention group (new group), respectively.

  • nR and nN are the the sample sizes for the active control group and new group, respectively.

  • μR and μN are the the means of continuous primary endpoint for the active group and new group, respectively.

  • \({\sigma }^{2}_{R}\) and \({\sigma }^{2}_{N}\) are the the variances of continuous primary endpoint for the active group and new group respectively.

  • ΔL(μR) is the non-inferiority margin, and Δ=μNμR is the difference of true means.

  • H0 and H1 are the null and alternative hypotheses, respectively.

Approach using a test statistic

Without loss of generality, assuming that an increase in the endpoint corresponds to more efficacy. The non-inferiority hypotheses can be formulated as follows:

$$ \left \{ \begin{array}{ll} H_{0}{:} \mu_{N} \leq \mu_{R}-\Delta_{L} & \mathrm{There is no non-inferiority}\\ H_{1}{:} \mu_{N} > \mu_{R}-\Delta_{L} & \mathrm{There is non-inferiority} \end{array} \right. $$
(1)

The formulation of the hypotheses test in Eq. (1) shows that the non-inferiority means that the new intervention is not worse than the active control intervention with a ΔL margin. When ΔL is fixed, testing the hypotheses (1) can be viewed as a classical composite hypotheses test for mean difference [9]; therefore, based on the central limit theorem applied to the boundary of the null hypothesis, the asymptotic test Zfixed can be obtained by:

$$ Z_{\text{fixed}}=\frac{\bar{X}_{N}-\bar{X}_{R}+\Delta_{L}}{\sqrt{\frac{{\sigma}^{2}_{N}}{n_{N}}+\frac{{\sigma}^{2}_{R}}{n_{R}}}}\sim N(0,1). $$
(2)

In effect, when ΔL is fixed, we have:

$$\begin{array}{*{20}l} \text{Var}(\bar{X}_{N}-\bar{X}_{R}+\Delta_{L}) &=\text{Var}(\bar{X}_{N})+\text{Var}(\bar{X}_{R}) \notag \\ &=\frac{{\sigma}^{2}_{N}}{n_{N}}+\frac{{\sigma}^{2}_{R}}{n_{R}}. \end{array} $$
(3)

The null hypothesis is rejected if Zfixed>Z1−α, where Z1−α is the (1−α) percentile of the standard normal distribution. From the Karlin-Rubin theorem, this test is the uniformly most powerful test of level α [10].

If ΔL is not fixed, i.e, if ΔL is a function of μR, then \(\text {Var}\{\bar {X}_{N}-\bar {X}_{R}+\Delta _{L}(\bar {X}_{R})\}\neq \text {Var}(\bar {X}_{N})+\text {Var}(\bar {X}_{R})\), and therefore, \(\text {Var}(\bar {X}_{N})+\text {Var}(\bar {X}_{R})\) is not a valid variance of \(\bar {X}_{N}-\bar {X}_{R}+\Delta _{L}(\bar {X}_{R})\). Under the assumption that ΔL is a continuously differentiable function, variance estimation was performed using delta method discussed below.

Variance estimation using delta method

If ΔL(.) is a continuously differentiable such that ΔL′(μR)≠0 (ΔL′ is the first derivative of ΔL), then using the Taylor series of order 1 in a neighborhood of μR,

$$ \Delta_{L}(\bar{X}_{R})=\Delta_{L}(\mu_{R})+\Delta'_{L}(\mu_{R})(\bar{X}_{R}-\mu_{R})+o_{p}(1). $$
(4)

Hence,

$$\begin{array}{*{20}l} &{}\{\bar{X}_{N}-\bar{X}_{R}+\Delta_{L}(\bar{X}_{R})\}-\{\mu_{N}-\mu_{R}+\Delta_{L}(\mu_{R})\}\\ &{}=(\bar{X}_{N}-\mu_{N})-(\bar{X}_{R}-\mu_{R})+\{\Delta_{L}(\bar{X}_{R})-\Delta_{L}(\mu_{R})\}\\ &{}=(\bar{X}_{N}-\mu_{N})-(\bar{X}_{R}-\mu_{R})+\Delta'_{L}(\mu_{R})(\bar{X}_{R}-\mu_{R})+o_{p}(1)\\ &{}=(\bar{X}_{N}-\mu_{N})+\{\Delta'_{L}(\mu_{R})-1\}(\bar{X}_{R}-\mu_{R})+o_{p}(1)\\ \end{array} $$

Thus, the variance estimate is:

$$ {}\text{Var}\{\bar{X}_{N}-\bar{X}_{R}+\Delta_{L}(\bar{X}_{R})\} = \frac{\sigma^{2}_{N}}{n_{N}}+\frac{\{\Delta'_{L}(\mu_{R})-1\}^{2}\sigma^{2}_{R}}{n_{R}} $$
(5)

The test statistic can then be expressed as:

$$ {}Z_{\text{flexible}}=\frac{\{\bar{X}_{N}-\bar{X}_{R}+\Delta_{L}(\bar{X}_{R})\}-\{\mu_{N}-\mu_{R}+\Delta_{L}(\mu_{R})\}}{\sqrt{\frac{\sigma^{2}_{N}}{n_{N}}+\frac{\{\Delta'_{L}(\mu_{R})-1\}^{2}\sigma^{2}_{R}}{n_{R}}}}. $$
(6)

Asymptotic properties of the test statistic Z flexible

From the central limit theorem, when nN and nR approach infinity, the random variable ZflexibleN(0,1) on the boundary of null hypothesis, that is, asymptotically,

$$ Z_{\text{flexible}}=\frac{\bar{X}_{N}-\bar{X}_{R}+\Delta_{L}(\bar{X}_{R})}{\sqrt{\frac{\sigma^{2}_{N}}{n_{N}}+\frac{\{\Delta'_{L}(\mu_{R})-1\}^{2}\sigma^{2}_{R}}{n_{R}}}} \sim N(0, 1). $$
(7)

μR is unknown and \(\sigma ^{2}_{R}\) and \(\sigma ^{2}_{N}\) may be unknowns, which need to be estimated. We used the maximum likelihood estimation method on the boundary of the null hypothesis (μN=μRΔL(μR)). The unknown parameters are estimated considering the cases where the variances \(\sigma ^{2}_{R}\) and \(\sigma ^{2}_{N}\) are known, unknown, equal, or unequal.

The maximum likelihood (ML) estimators \(\hat {\mu _{R}}, \hat {\sigma _{R}}^{2}\) and \(\hat {\sigma _{N}}^{2}\) for \(\mu _{R}, \sigma ^{2}_{R}\) and \(\sigma ^{2}_{N}\), respectively, are consistent. Moreover, since ΔL′ is assumed continuous, \(\Delta '_{L}(\hat {\mu _{R}})\) is a consistent estimator for ΔL′(μR). The estimator \(\hat {Z}_{\text {flexible}}\) of the test statistic Zflexible can be obtained by replacing the unknown parameters in (6) by their ML estimators. Therefore, the test H0′ versus H1 (where H0′ is the boundary of H0 i.e μN=μRΔL(μR)) is rejected if \(\hat {Z}_{\text {flexible}}>z_{1-\alpha }\), where α is the nominal type I error and z1−α denotes the 1−α percentile of the standard normal distribution. The significance level of this test tends to α when nN and nR approach infinity.

Assuming that, under alternative hypotheses H1,μNμR+ΔL(μR)=v, we have v>0. Hence, if η is the power of the test, it follows that:

$$\begin{array}{*{20}l} \eta &= \mathbf{P}\left(\frac{\bar{X}_{N}-\bar{X}_{R}+\Delta_{L}(\bar{X}_{R})}{\sqrt{\frac{\sigma^{2}_{N}} {n_{N}}+\frac{(\Delta'_{L}(\mu_{R})-1)^{2}\sigma^{2}_{R}}{n_{R}}}} > z_{1-\alpha} /H_{1}\right)\\ & = \mathbf{P}\left(\frac{\bar{X}_{N}-\bar{X}_{R}+\Delta_{L}(\bar{X}_{R})-v}{\sqrt{\frac{\sigma^{2}_{N}} {n_{N}}+\frac{(\Delta'_{L}(\mu_{R})-1)^{2}\sigma^{2}_{R}}{n_{R}}}} \right. \\ &>\left. z_{1-\alpha}-\frac{v}{\sqrt{\frac{\sigma^{2}_{N}} {n_{N}}+\frac{(\Delta'_{L}(\mu_{R})-1)^{2}\sigma^{2}_{R}}{n_{R}}}}\right), \end{array} $$

where, under alternative hypothesis, \(\frac {\bar {X}_{N}-\bar {X}_{R}+\Delta _{L}(\bar {X}_{R})-v}{\sqrt {\frac {\sigma ^{2}_{N}} {n_{N}}+\frac {(\Delta '_{L}(\mu _{R})-1)^{2}\sigma ^{2}_{R}}{n_{R}}}} \sim N(0,1)\). Assuming the equal variance in both groups (\(\sigma ^{2} = \sigma ^{2}_{R} =\sigma ^{2}_{N}\)) and denoting by δ=v/σ, the power, given as a function of δ,nN,nR, and α is:

$$ \eta(\delta, n_{N}, n_{R})=\Phi\left(\frac{\delta}{\sqrt{\frac{1} {n_{N}}+\frac{(\Delta'_{L}(\mu_{R})-1)^{2}}{n_{R}}}}-z_{1-\alpha}\right), $$
(8)

where Φ is the cumulative distribution function of the standard normal distribution. For a fixed nominal type I error α, and for any fixed μR and μN such that v=μNμR+ΔL(μR)>0, when nR and nN, it follows that η→1. Therefore, the test Zflexible is asymptotically convergent. From Eq. 8, it is possible to find the sample size that achieves the nominal fixed power. Denoting the nominal type II error by β and assuming that nN=rnR with r>0, the sample size which will allow nominal power (1−β) is such that:

$$ n_{R} \geq \frac{(z_{1-\alpha}+z_{1-\beta})^{2}\left[1+r\{\Delta'_{L}(\mu_{R})-1\}^{2}\right]}{r\delta^{2}}. $$
(9)

This formula is equivalent to the one found in [9] when the margin is fixed. Practically, δ is equivalent to the standardized difference in the comparison of the means, and in this work, it would be named standardized non-inferiority difference. In the power and sample sizes calculations, one will fix δ (for example, δ=0.05 or δ=0.5 if one wants to detect small or large inferiority differences respectively), and μR could be pre-specified from historical studies with similar treatment.

The proposed test statistic \(\hat {Z}_{\text {flexible}}\) is asymptotic, hence works well for large sample sizes, hence not adapted for datasets with small sample sizes, which are not uncommon in pratical situations. In such cases, the non-parametric test based on the percentile bootstrap confidence interval which does not require any assumptions on the sample size or sample distribution can be used[11].

Approach based on confidence intervals

For any test based on confidence intervals, the main interest is on the level of confidence intervals which is required to achieve a desired nominal type I error. Moreover, as discussed in [9] and [12], the type I error is a controversial issue in clinical trial tests. In the framework of non-inferiority tests, when the non-inferiority margin is fixed, [13] recommended using 1−α and \(1-\frac {\alpha }{2}\) for two-sided and one-sided confidence interval levels respectively, while [7] recommended to use 1−2α for two-sided and 1−α for one-sided confidence intervals. In [7], it is argued that the recommendation of [13] would lead to a conservative test, as the estimate type I error rate would be half the nominal one. Moreover, it has been argued that there would be approximately a 10% loss of power. In this section, we propose a non-parametric procedure for the confidence interval (one-sided and two sided) construction when the non-inferiority margin is flexible.

An intuitive procedure based on confidence intervals for the hypotheses test in Eq. (1) would be by checking the overlapping of the confidence intervals of μNμR and −ΔL(μR). The null hypothesis would be rejected if the two confidence intervals are non-overlapped and not rejected otherwise. In such case, as illustrated in [14], the intervals may be overlapped while the statistics would not be necessarily non-significantly different; thus, the power of the test would be lower. The proposed procedure involves comparing the lower bound of the confidence interval (one- or two-sided, respectively) with γ% level of μNμR+ΔL(μR) with 0. The null hypothesis H0 is rejected if the lower bound of the confidence interval for μNμR+ΔL(μR) is greater than 0.

Estimation of the type I error is performed using simulations and non-parametric estimation of confidence intervals on the boundary of the null hypothesis. The detailed steps are described below. 1. From a fixed μR, calculate μN=μRΔL(μR) (satisfying the null hypothesis H0). We assume that the standard deviations σN and σR are known. 2. Let m denote the number of desired simulations, for i∈{1⋯m}, simulate m pairs of samples XN and XR of size nN and nR, respectively, from the normal distribution \(\mathcal {N}(\mu _{N}, \sigma _{N})\) and \(\mathcal {N}(\mu _{R}, \sigma _{R}),\) respectively. 3. Using bootstrap, compute the empirical percentile confidence intervals [ai,] for one-sided confidence interval (and [ai,bi] for two-sided confidence interval, respectively) of level γ for μNμR+ΔL(μR), for i∈{1⋯m}. 4. For i∈{1⋯m}H0 is rejected when ai>0, thus the level of significance is estimated by: \(\alpha (\gamma)=\frac {1}{m}\sum ^{m}_{i=1}1_{a_{i}>0}\).

Like any other power estimation, the data are drawn under the alternative hypothesis that is, μN>μRΔL(μR). Since there is a wide range of possibilities on the alternative hypothesis, in practice, one considers the equivalence point, that is, μR=μN. Therefore, similarly to studies of [5] and [15], the equivalence point (μR=μN) will be used for drawing data for the power estimation. 1. Given μR, simulate m pairs of samples XN and XR of respective sizes nN and nR using the respective normal distributions \(\mathcal {N}(\mu _{R}, \sigma _{N})\) and \(\mathcal {N}(\mu _{R}, \sigma _{R})\). 2. Using bootstrap, compute the empirical percentile confidence intervals [ai,bi] of level γ for μNμR+ΔL(μR), for i∈{1⋯m}. 3. For i∈{1⋯m}H0 is rejected when ai>0. Thus, the power is estimated by, \(\eta (\gamma)=\frac {1}{m}\sum ^{m}_{i=1}1_{a_{i}>0}\).

Performances assessment

Simulations were done to evaluate the finite-sample performances of the asymptotic test and confidence interval based test. The performance indicators used were the type I error and statistical power. Monte-Carlo simulation techniques were used for the estimation of the considered indicators. In the simulations, we considered the margin \(\Delta _{L}(\mu _{R})=\mu _{R}^{1/4}\); and unknown variances \(\sigma _{R}^{2}\) and \(\sigma _{N}^{2}\).

Both indicators were computed for the two proposed tests according to the reference treatment. For the type I error, data were drawn on the boundary of the null hypothesis: for a given μR, μN is obtained such that μN=μRΔL(μR). For the power, data were drawn under the alternative hypothesis: for a given μR, μN is obtained such that μN>μRΔL(μR). Usually, one takes μN=μR. In all cases, it is assumed that μR vary in [1,1000]. In the test based on statistic, the power was estimate using formula (8), and two cases were considered for δ=0.05 and δ=0.5.

In the approach based on the asymptotic test, the nominal type I error was fixed and set at α=5%. For the confidence interval based test, we considered 95% one- and two-sided confidence interval levels. The purpose was to estimate the type I error rate for the respective confidence interval. In all the simulations, we considered balanced sample sizes (that is when n=nN=nR), n=30,100, and 1000 for small, medium, and large sample sizes, respectively. The number of bootstrap samples with replacement was B=1000, and the number of simulation replications was m=10000. The R software programming language [16] was used to conduct the simulations and codes are accessible in a separate file on request.

Application to the Stratall ANRS 12110 / ESTHER

This study was motivated by the randomized non-inferiority “Stratall ANRS 12110 / ESTHER” trial [17]. The main purpose was to assess an exclusively clinical monitoring strategy compared with a clinical monitoring strategy plus laboratory monitoring in terms of effectiveness and safety in HIV-infected patients in Cameroon. The idea was to achieve the scaling-up of HIV care in rural districts where most people live with HIV, but local health facilities generally have low-grade equipment. A total of 459 HIV-infected patients were included in the study and randomly allocated to two groups, one receiving exclusively clinical monitoring (intervention group, N = 238) and the other receiving laboratory and clinical monitoring (active control group (reference), N = 221). All patients included were initiated antiretroviral treatment and were followed up for 24 months. Clinical monitoring alone was compared to laboratory and clinical monitoring in a non-inferiority design. The continuous primary endpoint was the increase in CD4 cells count from treatment initiation to the twenty-fourth month. Based on previous studies, the non-inferiority margin (ΔL(R)) was prespecified as a linear function (25%) of the mean CD4 cells increase (μR) after 24 months of antiretroviral treatment in laboratory and clinical monitoring group, \(\Delta _{L}(R)= \frac {25}{100} \mu _{R}\). Unlike other non-inferiority studies [18, 19], the non-inferiority margin in this study was varied (depending on the mean increase in CD4 in the active control group (reference)). However, the classical two-sided confidence interval based test with 90% level were used to obtain a type I error (α) close to 5% [17]. Indeed, the statistical test procedures that explore the non-inferiority test for con- tinuous data with variable margins were not available at that time in the original paper [17]. Moreover, as discussed in [12], the relationship between the confidence intervals level and the type I error can be controversial.

More details about the background of the study and the clinical trial process can be found in [17]. Two analyses were done according to the type of data:

  • Firstly, the increase of CD4 cells count at 24 months from the baseline was considered, which implies missing or lost patients before the end of follow-up period were excluded in the analysis. In that case, the total number of patient in the analysis reduced to n=334, with nR=169 and nN=165. “Observed data” will refer to the case where data are analyzed by excluding participants with missing observation at 24 months.

  • Secondly, an analysis was done with all participants who attended at least one follow-up visit, and the last observation carried forward (LOCF) imputation method was applied for participants whose CD4 data were missing at 24 months (in this case, the number of patients to analyzed is the same as the baseline: n=459, nR=238, nN=221).

The classical parametric two-sided confidence interval based test with 90% level was used by [17] to perform the non-inferiority test. The final result was that the CLIN was inferior to the LAB.

Results

Simulations results

Test statistic based test

The results for the approach based on a statistic are summarized in Figs. 1, 2, and 3 for type I error rate and power estimates, respectively. Whatever the sample size, it is observed that the type I error rate estimates were constant and were not μR dependent. For small sample size, the type I error rate estimate was slightly above the nominal value, while the median value estimate was 0.053, and an Interquartile Range(IQR) of [0.051−0.054]. As the sample size increases, the type I error estimates get close to the nominal value. In effect, for medium sample size of n=100, the type I error estimate is close to the nominal value, the median value estimate for μR was 0.051 (IQR=[0.050−0.052]). For large sample sizes, for example, n=1000, the type I error estimate was more accurate and closer to the nominal value, the median estimate was 0.050 (IQR=[0.050−0.050]).

Fig. 1
figure 1

Type I error rate estimates according to sample sizes for test statistic based test. Type I error rate estimates as function of reference treatment, for the test statistic based test from the left to the rigth, sample sizes are nN=nR=20, 100, and 1000 respectively

Fig. 2
figure 2

Power estimates according to sample sizes for test statistic based test (with standardized non-inferiority difference delta = 0.05). Power estimates as function of reference treatment (with standardized non-inferiority difference delta = 0.05), for test statistic based test. From the left to the rigth, sample sizes are nN=nR=20, 100, and 1000, respectively

Fig. 3
figure 3

Power estimates according to sample sizes for test statistic based test (with standardized non-inferiority difference delta = 0.5). Power estimates as function of reference treatment (with standardized non-inferiority difference delta = 0.5), for test statistic based test. From the left to the rigth, sample sizes are nN=nR=20, 100, and 1000, respectively

The power estimates were summarized in Figs. 2 and 3, and they were not μR-dependent. As expected, the power increased with sample sizes for fixed standardized non-inferiority difference δ, and larger values of δ led to a higher power estimate for fixed sample size.

Confidence interval based test

The results for the approach based on confidence intervals are summarized in Figs. 4, 5, 6, and 7. For 95% both one- and two-sided confidence interval levels, the estimate type I error rates remained around 0.05 and 0.025, respectively, and are more concentrated around those values as the sample sizes get larger. Then, for a given nominal type I error of α, the suitable confidence intervals level would be 1−α and 1−2α for one- and two-sided confidence intervals, respectively. The power (at the equivalence point, μR=μN) increases with the sample sizes, but the convergence to 1 seemed to require very large sample sizes. This is not the case for the test statistic based method. Therefore, in terms of power estimate, the approach based on the test statistic would perform better than the confidence intervals based approach.

Fig. 4
figure 4

Type I error rate estimates according to sample sizes for the 95% one-sided confidence intervals level based test. Type I error rate estimate as function of reference treatment, for the 95% one-sided confidence intervals level based test. From the left to the rigth, sample sizes are nN=nR=20, 100, and 1000, respectively

Fig. 5
figure 5

Power estimates according to sample sizes for the 95% one-sided confidence intervals level based test. Power estimates as function of reference treatment, for the 95% one-sided confidence intervals level based test. From the left to the rigth, sample sizes are nN=nR=20, 100, and 1000, respectively

Fig. 6
figure 6

Type I error rate estimates according to sample sizes for the 95% two-sided confidence intervals level based test. Type I error rate estimate as function of reference treatment, for the 95% two-sided confidence intervals level based test. From the left to the rigth, sample sizes are nN=nR=20, 100, and 1000, respectively

Fig. 7
figure 7

Power estimates according to sample sizes for the 95% two-sided confidence intervals level based test. Power estimates as function of reference treatment, for the 95% two-sided confidence intervals level based test. From the left to the rigth, sample sizes are nN=nR=20, 100, and 1000, respectively

The Stratall ANRS 12110 / ESTHER trial

The proposed methods were also applied to the Stratall ANRS 12110 / ESTHER tria, based on Observer and LOCF data, with a linear margin of \(\Delta _{L}(R)= \frac {25}{100} R\). The results for the approach based on the test statistic are summarized in Table 1. The p-value is calculated based on the test statistic in Eq. (6). The statistical power was computed using Eq. (8) and based on the same inputs as in [17], which were μN=μR=140 and σN=σR=130. For the Observed data, the p-value estimate was =0.02, and the null hypothesis that CLIN was inferior to the LAB was rejected at 0.05 level. On the other hand, for the LOCF data, the p-value was =0.09, and the null hypothesis that CLIN was inferior to the LAB was not rejected at 0.05 level.

Table 1 p-value and power determination for the approach based on the asymptotic test statistic and according to the data used

For the confidence interval-based approach, the test was performed by considering the one- and two-sided confidence interval levels. The results are presented in Table 2. The null hypothesis that CLIN was inferior to LAB was not rejected for any of the confidence intervals used with “LOCF data.” On the other hand, when using “Observed data,” the null hypothesis of inferiority was not demonstrated.

Table 2 Confidence interval calculations and decision on non-inferiority confidence interval based test

The two proposed methods produced consistent results on the Stratall ANRS 12110 / ESTHER trial. Moreover, based on LOCF data, the obtained results are in line with those in [17]: the clinical monitoring alone was inferior to laboratory plus clinical monitoring.

Discussions

In this study, we have proposed two non-inferiority test approaches for a continuous endpoints with flexible margins: a test based on a test statistic and a confidence interval based test. The confidence interval approach is more used in literature and recommended by the international guideline [2]. For the non-inferiority test with continuous endpoints and fixed margin, some studies like [7] and [12] studied the confidence interval approach which does not allowed for explicit sample size calculation. Comparatively, our proposed test based on a statistic allows explicit calculation of sample size and power formula.

The simulation results for the confidence intervals based test showed that the confidence interval level determined approximatively the type I error rate. The test with 95% one- and two-sided confidence intervals level led to type I errors which were approximated by 0.05 and 0.025, respectively. Therefore, for a given nominal type I error α=0.05, the confidence intervals based test would be performed with one- or two-sided confidence intervals with 1−α or 1−2α levels, respectively; these findings are consistent with those in [7]. The non-inferiority hypothesis test is a one-tailed test, so when performing the testing procedure with the classical nominal type I error α, the actual type I error would be α/2. Therefore, for a given desired nominal type I error, to avoid the conservativeness of the test, the test should be performed with this nominal error times two. However, the debate on which of the one- or two-sided confidence intervals should be used in non-inferiority trials remains open, which is discussed in [20].

The most important output of this study was the type I error which was not varying according to the value of reference treatment, either for the test based on a statistic or the test based on confidence intervals. This suggested that the variability and uncertainty around the margin were accounted for, without affecting the properties of the proposed tests. The proposed methods in this study could therefore be viewed as a generalization of the case where the non-inferiority margin is fixed for continuous endpoints.

Conclusions

In an active controlled trial of non-inferiority, the non-inferiority margin should be a function of reference treatment to account for the uncertainty surrounding the mean estimate of reference treatment. This paper produced a framework on how to perform the non-inferiority hypothesis test with a flexible margin. Based on type I one error rate and power estimates, the proposed non-inferiority hypothesis test procedures have good performances and are applicable in practice, a practical application on clinical data was illustrative.