1 Introduction

In medical studies, researchers are generally interested in testing whether a new drug is not worse than a standard drug. Generally, the efficacy is not the only factor considered. The new drug is assumed beneficial if its efficacy is not inferior to the standard one by some prespecified standards if the new drug is proved to have advantages such as less toxicity, low cost, or easier administration. However, some drugs do not produce any appreciable absorption into the systemic circulation, and thus, the pharmacokinetics approaches are not appropriate in assessing noninferiority or bioequivalence [1,2,3,4,5,6]. Instead, the clinical endpoints are used directly to assess the equivalence.

Although researchers often use the risk difference and risk ratio as measures of risk for comparing different drugs or treatments in prospective clinical trials, they generally can not be used to assess noninferiority or equivalence of risk or disease outcomes in retrospective studies. On the other hand, odds ratio is widely used as a measure of risk association in both prospective and retrospective studies.

In clinical trials, a crossover design is widely used to assess the drug effects. The most commonly used crossover design is a design of two treatments A and B, in which some subjects receive treatment A first and B second, while the others receive treatment B first and A second. One of the main benefits of cross trials is that the inherent variability is reduced, and hence the power is increased [7]. Medical researchers also use crossover trials to assess bioequivalence of drugs [2, 8]. It often involves comparisons of the response rates of binary outcomes to test for noninferiority or equivalence between two drugs. For example, in comparing efficacy of two drugs, researchers study the therapeutic response of relief of primary dysmenorrhea (yes \(=\) 1, no \(=\) 0) in the crossover trial. In this article, we focus on studies where the patient response outcome is dichotomous. That is, we want to assess whether the positive response rate for the generic drug is not inferior to that of the standard drug by a prespecified margin. The test is generally denoted as noninferiority trial [9].

In crossover design, it is generally assumed that the carryover effect, the effect of a drug that persists after the end of the dosing period, does not exist with a sufficiently long washout period. However, if the assumption is not valid, it is not appropriate to use the crossover design since it causes the biased estimate of the treatment effect. [10, 11]. In practice, we may always be able to use the crossover trial with a sufficient washout period in between administering drugs under comparison. There are many papers on crossover trials in current literatures. Gart [12] used a logistic regression model to test the equality of two treatment effects. Zimmermann and Rahlfs [13] used the linear additive risk model to test equality in patient response rates for a simple crossover trial. Schouten and Kester [14] used the similar approach to assess treatment effects for crossover trials. Ezzet and Whitehead [15] used a logistic random effect to assess equality of the treatment effects. Lui and Chang [16] proposed a semiparametric approach to assess the equality test. Senn [17, 18] provided a general overview of the current progress in crossover design. On the other hand, there are very limited Bayesian approaches in the literature. Osman and Ghosh [19] proposed a Bayesian semiparametric approach to test noninferiority. Ghosh et al. [20] proposed a Bayesian semiparametric approach to test noninferiority in three-arm trials. Although these approaches generally work well, there are some obvious disadvantages. For example, both approaches of Ghosh et al. [20], and Osman and Ghosh [19] are not easily implemented and do not characterize heterogeneity among the subjects. The approach of Lui and Chang [16] is subject to asymptotic theorem constraint and does not work well with small data sets. In particular, the approach of Lui and Chang [16] does not work when some cells have zero values. To the author’s best knowledge, there is no Bayesian approach proposed to test noninferiority or equivalence based on the odds ratio of patient response rates for a crossover design.

In this paper, we propose a Bayesian approach based on the odds ratio of patient response rate to test noninferiority and equivalence for simple crossover designs. Compared with the frequentist approaches, a fully Bayesian approach can incorporate useful prior information to account for varied uncertainties. Statistical inference can be easily derived efficiently and accurately using Markov chain Monte Carlo algorithms from the posterior distributions. Thus, the results of the hypotheses tests of noninferiority (or equivalence) do not suffer from the asymptotic constraints. In addition, the Bayesian approach generally has advantage for hypothesis testing since it is based on the posterior probability distribution instead of the p values that are often misinterpreted.

The paper is organized as follows: Sect. 2 describes the model. Section 3 provides a simulation study to demonstrate the performance of the proposed approach. Section 4 illustrates the approach using a real data example, followed by conclusion and discussion in Sect. 5.

2 The Model

2.1 General Description

We consider a simple crossover trial for the noninferiority (or equivalence) test of a new drug B versus a standard drug A. Assume that \(n_1\) patients are assigned to the first group (\(g=1\)), for which the patients take drug A and B at period 1 and 2, respectively; \(n_2\) patients assigned to the second group (\(g=2\)) for which patients take drug B and A at period 1 and 2, respectively. We assess the treatment efficacy by conducting noninferiority (or equivalence) test based on the odds ratio of positive response rate of the patients. Let \(y_{gij}\) denote the patient binary outcome from subject \(i (i=1, \ldots , n_g) \) at period \(j (j=1,2)\) from treatment sequence \(g(g=1, 2)\); the model is specified as follows:

$$\begin{aligned} y_{gij} \sim \hbox {Bernoulli}\big (H^{-1}(\chi _{gij})\big ), \quad \chi _{gij}= \vartheta _{gj} + \tau _{g(1, 2)}I(j=2) + \mu _{gi}, \end{aligned}$$
(1)

where \(\vartheta _{gj}\)= \(u_0 + \eta _g + \psi _j +t_{l(g,j)}\) and \(u_0\) is the overall mean; \(\eta _g\) is the fixed effect of the gth group sequence, \(g=1, 2\); \(\psi _j\) is the fixed effect of the jth period; \(t_{ l(g,j) }\) is the fixed treatment effect, and l(gj) is the treatment index; \(\tau _{g(1, 2)}\) is the first-order carryover effect of treatment at period 1 from the immediately preceding period into the effect of treatment at period 2 in the current period; \(\mu _{gi}\) is the random effect of ith subject from the gth sequence, \(i=1, \ldots , n_g\), specified as a normal distribution with \(\mu _{gi} \sim N(0, \rho ^2)\) and H(.) is the logistic link function with \(H(\kappa )=\log (\kappa /(1-\kappa ) )\). We set \(y_{giz}=1\) if the patient has the positive response outcome of interest (with improvement) , and 0 (without improvement) otherwise. We impose the following constraints for the group effect, treatment effect, and period effect \(\sum \eta _g=0\), \(\sum \psi _j=0\), \(\sum _{m=1}^M t_m=0\) to resolve identifiability.

The carryover effect is the effect of a drug that persists after the end of the dosing period. It poses an challenging issue in analyzing the crossover trials since it causes the biased estimate of the treatment effect. If there is a sufficient washout between dosing periods, then the carryover effect can be ignored. Obviously, it is impossible to get all the parameters in the above formula given limited information. In this article, we will focus mainly on the treatment effects and drop out the group sequence effect and the carryover effect, assuming sufficient long washout in between two dosing periods. Thus, model specified in Eq. (1) is reduced to

$$\begin{aligned} y_{gij} \sim \hbox {Bernoulli}\big (H^{-1}(\chi _{gij})\big ), \quad \chi _{gij}=u_0 + \psi _j + t_{l(g,j)}+ \mu _{gi}. \end{aligned}$$
(2)

Although the carryover effect is removed in the above model, we should note that it can be easily estimated following general procedures, e.g., [21, 22]. Obviously, the logistic model is nonlinear and we can not get conditional conjugacy even with simple normal priors, which ultimately causes inefficiency in computation. We will take several approaches to improve efficiency by converting the nonlinear model to the standard linear models. According to the current literature [23,24,25], the logistic distribution can be approximated by the Student’s t distribution. The Student’s t distribution can be expressed as a scale mixture of normals [26]. Thus, model specified in Eq. (2) can be expressed equivalently as follows with auxiliary variables:

$$\begin{aligned} y_{gij}= & {} 1 : y_{gij}^*>0, \\ y_{gij}= & {} 0 : y_{gij}^*\le 0, \end{aligned}$$

where \(y_{ij}^*\) is an underlying value with the logistic distribution with location parameter \( u_0 + \psi _j + t_{l(g,j)}+ \mu _{gi}\) and density function as follows:

$$\begin{aligned} f\big (y_{gij}^*|. \big ) =\frac{\hbox {exp}\big \{-\big (y_{gij}^*- ( u_0 + \psi _j + t_{l(g,j)}+ \mu _{gi}) \big ) \big \}}{\big \{ 1+ \hbox {exp}\big [-\big (y_{gij}^*- (u_0 + \psi _j + t_{l(g,j)}+ \mu _{gi}) \big ) \big ] \big \}^2 }. \end{aligned}$$
(3)

Thus, \(y_{gij}^*\) is approximated as a noncentral t distribution with location parameter \(u_0 + \eta _g + \psi _j + t_{l(g,j)}+ \mu _{gi}\), degree of freedom v, and scale parameter \(\sigma ^2\). We can further express it as a scale mixture of normals and get the following model:

$$\begin{aligned} y_{gij}^*= u_0 + \eta _g + \psi _j + t_{l(g,j)}+ \mu _{gi}+ \epsilon _{gij}, \quad \epsilon _{gij} \sim N(0, \sigma ^2/\phi _{gij}), \end{aligned}$$
(4)

where \(\phi _{gij}\) has a Gamma prior G(v / 2, v / 2). We take \(v=7.3\) and \(\sigma ^2=\pi ^2(v-2)/3v \) as suggested by O’Brien and Dunson [25] to make the approximation almost exact.

2.2 Prior and Posterior

A prior specification for unknown parameters is essential and important for Bayesian approaches. It is recommended not to use excessively diffuse or flat priors since they might cause improper posterior due to the intractable nature of the density. On the other hand, it is preferred to choose conjugate priors to facilitate computation and improve efficiency. We follow these rules and choose priors with caution in our Bayesian approach.

We specify a normal distribution \(N(\mu _1, \sigma _1^2)\) for the overall mean \(u_0 \sim N( \mu _1, \sigma _1^2) \). Similar priors are selected for the period effect and the treatment effect: \(\psi _j \sim N(\mu _3, \sigma _3^2)\), and \( t_m \sim N(\mu _4, \sigma _4^2)\). The random effect \(\mu _{gi}\) is specified as \(\mu _{gi} \sim N(0, \rho ^2)\); the hyperparameter \(\rho ^2\) is placed an Inverse Gamma distribution \(\rho ^2 \sim \mathrm{IG}(a_0, b_0)\), and \(\phi _{gij}\) is placed a prior of Gamma distribution G(v / 2, v / 2). Based on the model and prior specifications, we can easily derive the joint posterior distribution for \(\tilde{{\varvec{\theta }}}=( u_0, {\varvec{\psi }}, {\mathbf{t}}, {\varvec{\phi }})\) as follows:

$$\begin{aligned}&p(\tilde{{\varvec{\theta }}}|{\mathbf{y}}) \propto p(.) \left[ \prod _{g}\prod _{i}N(\mu _{gi}; 0, \rho ^2) \prod _{j}N(y_{gij}; u_0 \right. \nonumber \\&\qquad \quad \qquad \qquad \qquad \left. + \psi _j + t_{l(g,j)}+ \mu _{gi}, \sigma ^2/\phi _{gij})w_{gij} \right] , \end{aligned}$$
(5)

where \( w_{gij}= \{{1(y_{gij^*}>0)y_{gij} + 1( y_{gij}^*<0) (1-y_{gij})} \}p( \phi _{gij})\), and \(p(.)=p( \rho ^2) p(u_0) p( {\mathbf{t}}) p({\varvec{\psi }})\). Obviously, we get a very complicated posterior formula that we can not sample directly. By introducing the latent variable \(y_{gij}^*\), we have applied a data augmentation algorithm and can easily sample the parameters and hyperparameters of interest using Gibbs sampler. The main idea of data augmentation algorithm is to augment the observed data Y with another variable V, which is generally referred to as latent data. Given Y and V, one can easily sample the parameters \(\theta \) from the posterior distribution \(P(\theta |Y,V)\). From the above model, the auxiliary variable can be easily updated using the Gibbs sampler from a posterior normal distribution truncated below or above 0 according to the value of \(y_{gij}\). The conditional posterior of the auxiliary variable is

$$\begin{aligned} p\big (y_{gij}^*|{\varvec{\theta }}, y_{gij}\big ) \!= \!\frac{N(y_{gij}; q_{gij}, \sigma ^2/\phi _{gij}) \{{1(y_{gij^*}>0)y_{gij} \!+\! 1( y_{gij}^*<0) (1\!-\!y_{gij})} \}}{ {\varvec{\varPhi }}(0; q_{gij}, \sigma ^2/\phi _{gij})^{1-y_{gij}} \{ 1\!-\!{\varvec{\varPhi }}(0; q_{gij}, \sigma ^2/\phi _{gij}) \}^{y_{gij}} },\nonumber \\ \end{aligned}$$
(6)

where \(q_{gij}=u_0 + \psi _j + t_{l(g,j)}+ \mu _{gi}\). The full conditional posterior distribution is specified in (5). The detailed sampling steps are listed in the Appendix. We run the Gibbs sampler by iteratively sampling all the parameters, and hyperparameters of interest.

Given the full conditional posterior distributions in formula (5), one can easily use the MCMC algorithms to derive the estimates of the parameters of interest. After discarding the initial burn-in period, we can get the posterior summaries of parameters of interest from the Gibbs sampler output. For example, if we want to evaluate the relative treatment effect difference between drug B and drug A. The posterior estimated parameter \(t_{\mathrm{BA}}\) is given

$$\begin{aligned} t_{\mathrm{BA}}= \sum _{k=\varsigma +1}^K( t_{\mathrm{B}}^{(k)} - t_{\mathrm{A}}^{(k)})/(K-\varsigma ), \end{aligned}$$
(7)

where the relative treatment effect difference is defined as \(t_{\mathrm{BA}}= t_{\mathrm{B}} - t_{\mathrm{A}}\), \( t_{\mathrm{B}}^{(k)}\) is the kth draw of \(t_{\mathrm{B}}\) from the posterior sampling. We denote \(\nu _{\mathrm{BA}}=\exp (t_{\mathrm{BA}})\). Obviously \(\nu _{\mathrm{BA}}\) represents the odds ratio of relative difference between treatment A and B. The relative treatment effect difference derived in this section is used to assess noninferiority and equivalence of drug B (treatment B) to drug A (treatment A) in relation to treatment efficacy, to be discussed in the next section.

2.3 Test Noninferiority and Equivalence

Generally, the noninferiority hypothesis of between two treatments is formulated

$$\begin{aligned} H_0: t_{\mathrm{B}} - t_{\mathrm{A}} \le \omega \quad \mathrm{vs} \quad H_1: t_{\mathrm{B}} - t_{\mathrm{A}} > \omega , \end{aligned}$$
(8)

where \(\omega < 0 \) denotes the predetermined real value quantity, and is called the amount of noninferiority margin. Alternatively, we can also use the following formula based on the odds ratio of the response rate:

$$\begin{aligned} H_0: \nu _{\mathrm{BA}} \le \delta \quad \mathrm{vs} \quad H_1: \nu _{\mathrm{BA}} > \delta , \end{aligned}$$
(9)

where \(0< \delta < 1\). When the null hypothesis is rejected, the conclusion of noninferiority is reached for the experimental treatment (B) to the reference treatment (A).

The formula of noninferiority testing can be modified easily for hypothesis test of equivalence between two treatments. The hypothesis test of equivalence between two treatments B and A is formulated as

$$\begin{aligned} H_0: \nu _{\mathrm{BA}} \le \delta _1 \quad \mathrm{or} \quad \nu _{\mathrm{BA}} \ge \delta _2 \quad \mathrm{vs} \quad H_a: \delta _1< \nu _{\mathrm{BA}} < \delta _2 \end{aligned}$$
(10)

where \(\delta _1\) and \(\delta _2\) are the predetermined maximum clinical acceptable margins. When the null hypothesis is rejected, one can arrive at the conclusion of equivalence between two treatments.

Given the full conditional posterior distributions, we can easily get the posterior probabilities of the hypothesis using the MCMC algorithm from the posterior draws by following (7) in a similar manner.

3 Simulation

To evaluate the performance of our approach, we conduct a simulation study. The specification of Bayesian model is completed by specifying the prior. Basu and Santra [21] chose fairly flat but proper, conditionally conjugate priors to analyze crossover trial studies. Following this line, we chose similar priors for our analysis. We assume that there are no carry-over effects with an adequate washout period for the simulation. The overall mean \(u_0\) is set equal to 0.10. We generate the random effects \(\mu _{gi}\) independently and identically from a normal distribution with mean 0 and standard deviation \(s=0.2, 0.5\), and 0.8, respectively; we set the relative treatment effect difference \(t_{21}\) as 0.1, 0.3, and 0.5, respectively. The number of patients for each group is varied as \(n_1=n_2=15, 20, 50\), and 150. For the priors, we specify \(\mu _1=\mu _2=\mu _3 =0.5\), \(\sigma _1=\sigma _2=\sigma _3 =10.0\), and \(a_0 = b_0 = 0.1\).

We generate 200 simulated data sets according to Eq. (2). The percentage of data sets with zero count cells varies with different scenarios. For example, for small sample size \(n_1=n_2=15\), there are about 5\(\%\) (\(s=0.2\)) to 12 \(\%\) (\(s=0.8\)) data sets with zero counts, whereas for \(n_1=n_2=20\), there are about 2\(\%\) to 4\(\%\) data sets with zero counts. We take the Gibbs sampling algorithm as described in the previous section and the Appendix. The Gelman–Rubin approach [27] is utilized to assess the mixing and convergence. After an initial 2000 iterations, we take the next 5000 iterations to estimate the parameters of interest. We also run the simulations with varying means and variances of the priors for the parameters to evaluate the effects. For data sets of small sample size, the length of the coverage is a little wider with flatter prior, but negligible with bigger sample size. We generally do not see any significant deviation in the parameter estimation.

We compare our proposed approach with that of Lui and Chang [16]. For easy notation, we denote our approach as YB, Lui and Chang [16] as LC. The program is compiled in Intel Visual Fortran Professional 16.0 and executed in a PC (Win7 Professional, Intel Xeon processor E5-1620 at 3.5 GHZ, 6 G RAM). We ran the Gibbs sampler for above simulation. Although we make a few approximation, the program runs very fast, e.g., the CPU time consumed is 642s for \(n=50\) with 200 simulated data sets. Table 1 provides the bias, mean square error (MSE), and the 95\(\%\) coverage of the relative treatment effect difference of varying scenarios. From the table, we can see that our approach (YB) has fairly smaller bias and MSE than that of LC, though a little smaller in coverage for small sample size. As the sample size increases, the coverage and MSE of YB are still better than those of LC, and the bias of both approaches is very close to each other. This might be due to the presence of zero count cells for small sample size, which is difficult for the approach LC to estimate. As the sample size increases, the results of LC get improved since there is less likely presence of zero count cells. Overall, we can see the performance of our approach is very good. Although there are some data sets that have cells with zero count value, a challenging issue for the approach of Lui and Chang [16], our approach does not suffer from it and can still provide very reliable and consistent results.

Table 1 The estimated bias, MSE, and 95\(\%\) coverage of the relative treatment effect difference

We also study the relationship between the posterior probability of \(H_1\) and the relative treatment effect difference in the above simulations. Figure 1 provides the results. Within each panel of Fig. 1, the posterior probabilities of the alternative hypothesis are in ascending order as the standard deviation of the residual errors increases at the beginning, and in descending order at later time. The posterior probability of \(H_1\) increases as the relative treatment effect difference increases. When we check across the panels, we see that the curve becomes steeper as the sample size increases. These results are expected and are in accordance with intuition.

Fig. 1
figure 1

Simulation: the expected posterior probability of E[P(\(H_1|\)Data)] as a function of relative treatment effect difference

4 Applications

In this section, we use two real data examples to illustrate our approach. For the first real data example in Table 2, we consider the data set given by Senn [17]. In this study, 24 children who suffered from exercise-induced asthma were randomly assigned to a crossover trial of two treatment sequences: drug B, \(12\,\upmu \)g formoterol solution aerosol; and drug A, 200\(\,\upmu \)g salbutamol solution aerosol. The first group of 12 children received treatment A first and B second; the second group of 12 children received treatment B first and A second. The binary outcome was derived from a subjectively judged four-point scale: a success represents a good response, and a failure represents a poor, fair or moderate efficacy. Here, we want to test whether drug B is noninferior to drug A based on the odds ratio of the patient success responses rate.

Table 2 Real data example 1: the frequency distribution of patients with a success (+) or a failure (− ) during treatment A (200\(\,\upmu \)g salbutamol solution aerosol) or B (12\(\,\upmu \)g formoterol solution aerosol

We specify the priors as in the simulation example. We run the Gibbs sampler for 5000 iterations after a 1000 burn-in period and conduct the Gelman–Rubin approach to assess convergence as in the simulation section. Similar sensitivity tests as in the simulation example are also conducted. The MCMC chains show good convergence and the results are consistent.

We obtain the mean estimate \(\nu _{\mathrm{BA}}\) and the corresponding 95\(\%\) CI as 173.33 and (7.21, 628.67), respectively. We should note that this is a very small data set. In particular, there are two cells with zero counts, five cells with value no more than 2. Because of the cells with zero counts, the approach of Lui and Chang [16] can not estimate it directly. They obtain the estimate for the odds ratio as 9.074 by adding 0.5 to the cells. Given the small sample size and distribution of small values, it might still change the results significantly by adding a seemingly small value. However, what is a reasonably small value to be used is not justified by Lui and Chang [16]. For example, when we try different small values from 0.001 to 1.0, the estimated results of Lui and Chang [16] can vary from 232.30 to 5.92. We run our approach by adding 1 to the corresponding cells and obtain the estimate \(\nu _{\mathrm{BA}}\) and the corresponding 95\(\%\) CI as 10.72 and (2.70, 30.22), respectively. The median value is drastically changed from 42.43 to 8.60, see Table 3. When the maximum clinically acceptable noninferior margin \(\delta \) is set 0.8, we derive the posterior probability of the alternative hypothesis almost 0.0. Obviously, we can easily derive the conclusion that the treatment of 12\(\,\upmu \)g formoterol solution aerosol is noninferior to the treatment of 200\(\,\upmu \)g salbutamol solution aerosol.

Table 3 Real data example 1: the comparison of results \(\nu _{\mathrm{BA}}\) by YB and LC under varying scenarios
Table 4 Real data example 2: the frequency distribution of patients with a response “yes” (+) or a “no” (− ) comparing two new inhalation devices delivering salbutamol
Fig. 2
figure 2

Application 2:traceplots of relative treatment effect difference, group difference, period difference, and intercept

For the second example, we consider a crossover trial study analyzed by Ezzet and Whitehead [15]. In the study, 3M-Risker conducted a crossover trial to compare the suitability of two new inhalation devices (A and B) in patients who were currently using a standard inhaler device delivering Salbutamol. Patients in group 1 used device A for one week and then device B for another week. The other patients in group 2 used the devices in reverse order. No washout was felt necessary. Patients were surveyed whether there were particular features which they liked about each device, and their responses were coded as “yes” or “no.” There were less than 3\(\%\) patients with missing outcomes. We assume that the missing outcomes occur completely at random and summarize the frequencies of patients with known responses in Table 4 for purpose of illustration. To assess the noninferiority of device A versus device B, we run our approach and derive the \(\nu _{\mathrm{BA}}\) and the corresponding 95\(\%\) CI as 2.42 and (1.58, 3.68). When the maximum clinically acceptable noninferior margin \(\delta \) is set 0.8, we derive the posterior probability of the alternative hypothesis 0.00081. All the results indicate that device A is noninferior to device B based on the odds ratio of the patient favor response. The approach of Lui and Chang [16] provides results close to ours with estimate \(\nu \) 2.34 and p value less than 0.001. The traceplots of several parameters are provided in Fig. 2 to assess convergence, we can see the results converge very fast. Here, we should note that the second example has a fairly big sample size 279. For the real data examples, we show that our approach provides results very close to those of Lui and Change [16] with fairly big data sets. However, for small data examples, our approach can provide more consistent and reliable results. We also run the analyses for the applications with varying means and variances of the priors for the parameters to evaluate the effects. For the first application with small sample size, the length of the coverage is a little wider with flatter prior, but negligible with the second application with bigger sample size. We do not see any significant deviation in the parameter estimation.

5 Conclusion and Discussion

In this paper, we have developed a Bayesian approach for hypothesis testing in noninferiority and equivalence based on odds ratio of patient response rate for simple crossover trials. The approach can be easily implemented and improve efficiency using data augmentation and scale mixture of normal representation. Compared to the current frequentist approaches, our approach does not suffer from the asymptotic constraints and can easily handle cells with zero counts. Through simulation studies, we have shown the strength and the good performance of our approach. The approach can be easily modified to accommodate the more complicated case, for example, three-arm trials. We expect the approach can be of help for the clinical researchers conducting research in noninferiority test for simple crossover trials.