1 Introduction

In this paper, we focus on the statistical performances of an hypothesis test that compares the means of the responses to two treatments. The procedure studied in the paper is illustrated within the framework of clinical trials. However, the generality of the mathematical setting would allow the method to be applied to a wider set of applications. So, we consider a clinical trial to compare the mean effect of two competing treatments, say \(R\) and \(W\). We consider a classical test \({\mathcal {T}}_0=(p_0,n_0)\) that involves \(n_0\) patients with a fixed proportion \(p_0\) of subjects allocated to treatment \(R\). In Sect. 2 we consider a test \({\mathcal {T}}\) based on a response-adaptive design with different sample size and proportion allocation and we make a comparison among their statistical performances. In particular, the analysis aims at determining which characteristics guarantees a test \({\mathcal {T}}\) to perform better than \({\mathcal {T}}_0\), in terms of (a) higher power and (b) fewer subjects assigned to the inferior treatment. Response-adaptive designs, in a clinical setting, are very attractive since they aim at achieving simultaneously two different goals, concerning both statistical and ethical purposes: (i) collecting evidence to determine the superior treatment, and (ii) minimizing the number of subjects allocated to the inferior treatment. For a complete literature review on response-adaptive designs see Lachin and Rosenberger (2002), Hu and Rosenberger (2006), Flournoy et al. (2012), Atkinson and Biswas (2014). The adaptive procedure we propose is the Modified Randomly Reinforced Urn (MRRU) design introduced in Aletti et al. (2013). A wide class of response-adaptive randomized designs is based on urn models, that are classical tools to guarantee a randomized device Rosenberger (2002), Cheung et al. (2006). Asymptotic results concerning urn models with an irreducible reinforcement mean matrix could be found in Rosenberger (2002), Bai et al. (2002), Janson (2004), Bai and Hu (2005), Cheung et al. (2006). Recently, in Laruelle and Pags (2013) the randomized urn designs proposed in Bai et al. (2002), Bai and Hu (2005) have been studied by applying stochastic approximation algorithms, and asymptotic results have been obtained using these techniques. Moreover, a general class of immigrated urn models has been proposed in Cheung et al. (2011), which provides a unified view of various urn models with irreducible mean reinforcement matrix. However, all these urn processes are based on the assumption that the replacement matrix is irreducible, which is not satisfied by the Randomly Reinforced Urn (RRU) studied in May et al. (2005), Muliere et al. (2006), Paganoni and Secchi (2007), Flournoy and May (2009), which has a diagonal mean reinforcement matrix. The RRU models have been introduced in Durham and Yu (1990) for binary responses, applied to the dose-finding problems in Durham et al (1996), Durham et al. (1998) and then extended to the case of continuous responses in Beggs (2005), Muliere et al. (2006). An interesting result concerning RRU models states that the probability to allocate units to the superior treatment converges to one as the sample size increases. This property is very attractive from an ethical point of view. However, because of this asymptotic behavior, RRU models are not in the class of designs targeting a proportion in \((0,1)\), that usually is previously fixed or computed to optimize some suitable criteria. Hence, all the asymptotic properties concerning these procedures presented in literature [see for instance Melfi and Page (2000), Melfi et al. (2001)], are not straightforwardly fulfilled by the RRU designs.

So, in Aletti et al. (2013) the urn scheme of the RRU model has been conveniently modified, in order to construct a new urn model, called Modified Randomly Reinforced Urn (MRRU), that asymptotically targets a fixed allocation proportion in \((0,1)\), and at the same time reduces the number of subjects allocated to the inferior treatment. This goal has been realized by introducing two thresholds \(\delta \) and \(\eta , 0<\delta \le \eta <1\) for the urn proportion. These parameters modify the reinforcement’s process. A brief discussion on the MRRU design is reported at the end of Sect. 2. In general, \(\delta \) represents the desired asymptotic proportion of subjects to allocate to \(R\) when \(W\) is the superior treatment, i.e. \(m_R<m_W\), while \(\eta \) will be the desired asymptotic proportion of subjects to allocate to \(R\) when \(R\) is the superior treatment, i.e. \(m_R>m_W\). The asymptotic properties of MRRU studied in Aletti et al. (2013) and Ghiglietti and Paganoni (2014), together with results about adaptive estimators proved in Melfi et al. (2001), are crucial for the procedure presented in this paper.

In Sect. 2 we describe the framework we deal with, considering the case of Gaussian responses with known variances, and we discuss the selection of the parameters for the MRRU model. In Sect. 3 some assumptions on the distributions of the reinforcements required in Sect. 2 are relaxed: specifically, Gaussian responses with unknown variances and Exponential and Bernoulli responses are considered. Section 4 gathers some simulation studies and Sect. 5 contains the analysis of a real case study.

A short conclusion ends the paper (Sect. 6). Data analysis and simulations have been carried out using the statistical software R Development Core Team (2011).

2 The proportion-sample size space

Consider the classical hypothesis test for comparing the means of two Gaussian samples with known variances. Consider a classical procedure that assigns a proportion \(p_0\) of patients to treatment \(R, 1-p_0\) to treatment \(W\), with \(p_0\in (0,1)\). Let \(n_0\in \mathbb {N}\) be the total number of subjects involved in the experiment. Let \(n_0\) be the sample size that guarantees a minimum power (\(\beta _0\)) evaluated at a specific difference of the means (\(\pm \Delta _0\)). In what follows, \(n_{0,R}:=n_0p_0\) and \(n_{0,W}:=n_0(1-p_0)\) indicate the number of subjects assigned to treatments \(R\) and \(W\), respectively. Moreover,

  • responses to treatment \(R\): \(M_1,M_2,..,M_{n_{0,R}}\) i.i.d. \(\sim \mathcal {N}(m_R,\sigma _R^2)\).

  • responses to treatment \(W\): \(N_1,N_2,..,N_{n_{0,W}}\) i.i.d. \(\sim \mathcal {N}(m_W,\sigma _W^2)\).

For the classical hypothesis test

$$\begin{aligned} H_0:\ m_R-m_W=0\ \ \ \ \ \ \ \ \ vs\ \ \ \ \ \ \ \ \ H_1:\ m_R-m_W\ne 0 \end{aligned}$$
(1)

the critical region of the Likelihood Ratio Test (LRT) \({\mathcal {T}}_0\) with level \(\alpha \) is:

$$\begin{aligned} R_{\alpha }=\left\{ \left| \overline{M}_{n_{0,R}}-\overline{N}_{n_{0,W}}\right| > \sqrt{\frac{\sigma _R^2}{n_{0,R}} + \frac{\sigma _W^2}{n_{0,W}} } z_{\frac{\alpha }{2}} \right\} \end{aligned}$$
(2)

where \(\overline{M}_{n_{0,R}}=\sum _{i=1}^{n_{0,R}}M_i/n_{0,R}\) and \(\overline{N}_{n_{0,W}}=\sum _{i=1}^{n_{0,W}}N_i/n_{0,W}\) and \(z_{\frac{\alpha }{2}}\) is the quantile of order \(1-\alpha /2\) of a standard normal distribution. The power function of the test \({\mathcal {T}}_0\) is the following

$$\begin{aligned} \beta _{{\mathcal {T}}_0}(\Delta )=P\left( Z < -z_{\frac{\alpha }{2}} - \frac{\Delta }{\sqrt{ \frac{\sigma _R^2}{n_{0,R}} + \frac{\sigma _W^2}{n_{0,W}} }}\right) + P\left( Z > z_{\frac{\alpha }{2}} - \frac{\Delta }{\sqrt{ \frac{\sigma _R^2}{n_{0,R}} + \frac{\sigma _W^2}{n_{0,W}} }}\right) , \end{aligned}$$

where \(\Delta =m_R-m_W\). The test \({\mathcal {T}}_0\) could be represented in the space \(((0,1)\times \mathbb {N})\), that we call proportion-sample size space, by a pair \((p_0,n_0)\). Any other point \((p,n)\) in the same space represents a test \({\mathcal {T}}\) with sample size equal to \(n\) and allocation proportion to treatment \(R\) equal to \(p\). The goal is to individuate regions of this space characterized by tests \({\mathcal {T}}\) performing better than \({\mathcal {T}}_0\), i.e.

  1. (a)

    \({\mathcal {T}}\) has a power function \(\beta _{{\mathcal {T}}}(\Delta )\) uniformly higher than the power function of \({\mathcal {T}}_0\), i.e. \(\beta _{{\mathcal {T}}_0}(\Delta )\);

  2. (b)

    \({\mathcal {T}}\) assigns fewer patients to the inferior treatment than \({\mathcal {T}}_0\).

To achieve condition (a) we impose the following constraint

$$\begin{aligned} \beta _{{\mathcal {T}}}(\Delta ) \ge \beta _{{\mathcal {T}}_0}(\Delta )\ \ \ \forall \Delta \in \mathbb {R} \ \Leftrightarrow \ \ \frac{\sigma _R^2}{n p} + \frac{\sigma _W^2}{n(1-p)} \le \frac{\sigma _R^2}{n_0p_0} + \frac{\sigma _W^2}{n_0(1-p_0)}. \end{aligned}$$
(3)

From (3) we compute the function \(n_{\beta }\) that separates two regions in the proportion-sample size space

$$\begin{aligned} n_{\beta }(p)\ =\ \left( \frac{\rho ^2}{p} + \frac{(1-\rho )^2}{1-p}\right) \left( \frac{\rho ^2}{n_0p_0} + \frac{(1-\rho )^2}{n_0(1-p_0)}\right) ^{-1} \end{aligned}$$
(4)

where \(\rho \) indicates the Neyman allocation proportion \(\frac{\sigma _R}{\sigma _R+\sigma _W}\), that, for any fixed sample size, provides the test with highest power.

In Fig. 1, points above the curve \(n_{\beta }\) (red line) indicate tests \({\mathcal {T}}\) with a power uniformly higher than \({\mathcal {T}}_0\). To satisfy condition (b) we distinguish two different cases, depending on which the superior treatment is:

  • if the superior treatment is \(R\), we impose

    $$\begin{aligned} n(1-p)<n_0(1-p_0)\ \ \Leftrightarrow \ \ p > 1-\frac{n_0}{n}(1-p_0); \end{aligned}$$
    (5)
  • if the superior treatment is \(W\), we impose

    $$\begin{aligned} np<n_0p_0\ \ \Leftrightarrow \ \ p<\frac{n_0}{n}p_0. \end{aligned}$$
    (6)

Both these constraints are depicted in blue in the proportion-sample size space. Below each of these lines, either (5) or (6) are verified. In conclusion, we divide the \(\textit{proportion-sample size}\) space in three regions:

  • Region \(A\):

    $$\begin{aligned} A = \left\{ (x,y)\in (0,1)\times (0,\infty )\ :\ n_{\beta }(x)<y<\frac{p_0}{x}n_0\right\} \end{aligned}$$

    tests \({\mathcal {T}}\in A\) having a power uniformly higher and assigning fewer patients to treatment \(R\) than \({\mathcal {T}}_0\).

  • Region \(B\):

    $$\begin{aligned} B = \left\{ (x,y)\in (0,1)\times (0,\infty )\ :\ y>\max \left\{ \frac{p_0}{x};\frac{1-p_0}{1-x}\right\} \cdot n_0 \right\} \end{aligned}$$

    tests \({\mathcal {T}}\in B\) having a power uniformly higher and assigning more patients to both treatments than \({\mathcal {T}}_0\).

  • Region \(C\):

    $$\begin{aligned} C\ =\ \left\{ (x,y)\in (0,1)\times (0,\infty )\ :\ n_{\beta }(x)<y<\frac{1-p_0}{1-x}n_0\ \right\} \end{aligned}$$

    tests \({\mathcal {T}}\in C\) having a power uniformly higher and assigning fewer patients to treatment \(W\) than \({\mathcal {T}}_0\).

Fig. 1
figure 1

The picture represents the regions \(A, B\) and \(C\), on the proportion-sample size space, with: \(\alpha =0.05, p_0=0.5, n_0=70, \sigma _R=1, \sigma _W=1.5\). The red line represents the function \(n_{\beta }\) in (4); it separates the tests \({\mathcal {T}}\) with power \(\beta _{{\mathcal {T}}}(\Delta ) > \beta _{{\mathcal {T}}_0}(\Delta )\), from the tests with power \(\beta _{{\mathcal {T}}}(\Delta ) < \beta _{{\mathcal {T}}_0}(\Delta )\). Blue lines separate tests according on the number of patients allocated to the treatments \(R\) and \(W\), with respect to \(n_{0,R}\) and \(n_{0,W}\). (Color figure online)

Hence, a test \({\mathcal {T}}=(p,n)\) is considered better than \({\mathcal {T}}_0\) if \((p,n)\in A\) and the superior treatment is \(W\), or if \((p,n)\in C\) and the superior treatment is \(R\). Unfortunately, the experimenter doesn’t know which the superior treatment is before the trial is conducted. For this reason, it is reasonable to set a response-adaptive design to construct the test \({\mathcal {T}}\). Let us introduce a vector \((X_1,X_2,...,X_n)\in \{0;1\}^n\) composed by the allocations to the treatments according to the adaptive design, i.e. \(X_i=1\) if the subject \(i\) receives treatment \(R\) or \(X_i=0\) if the subject \(i\) receives treatment \(W\). The quantities \(N_R(n) = \sum _{i=1}^n X_i\) and \(N_W(n) = \sum _{i=1}^n (1-X_i)\) are the number of patients allocated to treatments \(R\) and \(W\), respectively. Let us define the adaptive estimators based on responses collected at time \(n\)

$$\begin{aligned} \overline{M}(n)=\frac{\sum _{i=1}^n X_i M_i}{N_R(n)} \qquad \text {and} \qquad \overline{N}(n)=\frac{\sum _{i=1}^n (1-X_i) N_i}{N_W(n)}. \end{aligned}$$
(7)

Then, the test \({\mathcal {T}}\) is defined by the following critical region

$$\begin{aligned} R_{\alpha }^{adaptive }=\left\{ |\overline{M}(n)-\overline{N}(n)| > \sqrt{\frac{\sigma _R^2}{N_R(n)} + \frac{\sigma _W^2}{N_W(n)} } z_{\frac{\alpha }{2}} \right\} \end{aligned}$$
(8)

whose properties (in terms of power, level and asymptotic distribution of the test statistic) depend on the type of adaptive design has been adopted in the trial.

The authors propose to adopt as response-adaptive design the Modified Randomly Reinforced Urn design (MRRU) introduced in Aletti et al. (2013). In this model, an urn containing red and white ball is sequentially sampled and subjects are assigned to treatments corresponding to the colors of the sampled balls. After any allocation, the urn is virtually reinforced with a random real number of balls depending on the response given by the patient just assigned. We call \(Z_n\) the proportion of red balls in the urn, which is also the probability of assigning the (\(n+1\))-patient to treatment \(R\). We reinforce the number of red (white) balls only if \(Z_n<\eta \) (\(Z_n>\delta \)), with \(0<\delta \le \eta <1\), fixed parameters. In Aletti et al. (2013) and Ghiglietti and Paganoni (2014) theoretical results concerning the MRRU model have been proved and the asymptotic behavior of the urn process has been discussed. In particular, when \(m_R\ne m_W\) it has been proved that

$$\begin{aligned} \lim _{n\rightarrow \infty }Z_n\ =\ \lim _{n\rightarrow \infty }\frac{N_R(n)}{n}\ =\ \eta \mathbf {1}_{\{m_R>m_W\}}+\delta \mathbf {1}_{\{m_R<m_W\}}\ \ \ a.s. \end{aligned}$$
(9)

Moreover, from (9) both the sequences \(N_R(n) = \sum _{i=1}^n X_i\) and \(N_W(n) = \sum _{i=1}^n (1-X_i)\) diverge to infinity a.s. For this reason we can apply Proposition 3.1 of Aletti et al. (2013) concerning the adaptive estimators \(\overline{M}(n)\) and \(\overline{N}(n)\) defined in (7), which is a consequence of Theorem 2 of Melfi et al. (2001), i.e.

Proposition 1

The estimators \(\overline{M}(n)\) and \(\overline{N}(n)\) are consistent estimators of \(m_R\) and \(m_W\), respectively. Moreover as \(n \rightarrow \infty \),

$$\begin{aligned} \left( \sqrt{N_R(n)}\frac{(\overline{M}(n)- m_R)}{\sigma _R},\sqrt{N_W(n)}\frac{(\overline{N}(n)- m_W)}{\sigma _W}\right) \rightarrow (\xi _1,\xi _2) \end{aligned}$$

in distribution, where \((\xi _1,\xi _2)\) are independent standard normal random variables.

This result gives us the asymptotic normality of the adaptive estimators \(\overline{M}(n)\) and \(\overline{N}(n)\). This result is very useful in an inferential setting, when a statistic based on the adaptive estimators is used. In particular, Proposition 1 provides the asymptotic normality of the test statistic, which justifies the term \(z_{\frac{\alpha }{2}}\) in (8).

Let us fix a sample size \(n\) higher than \(n_0\) used in \({\mathcal {T}}_0\) (i.e., \(n = c \cdot n_0\) with \(c > 1\)). For any \(n> n_0\), we can identify the following intervals

$$\begin{aligned} I^{R_i}_n\ =\ \left\{ \ x \in (0,1)\ :\ (x,n) \in R_i\ \right\} ,\ \mathrm{with}\ R_i\in \{A,B,C\}. \end{aligned}$$

Observe that \((I^{R_i}_n)_i\) are pairwise disjoints and their union is a subset of \((0,1)\). We look for an adaptive test \({\mathcal {T}}\) represented in the proportion-sample size space by a point in region \(A\) (\(C\)) when \(R\) (\(W\)) is the inferior treatment. This goal is achieved when \(\frac{N_R(n)}{n}\in I^{C}_n\) when \(m_R>m_W\), and \(\frac{N_R(n)}{n}\in I^{A}_n\) when \(m_R<m_W\). Since (9) holds, we set \(\delta \in I^A_n\) and \(\eta \in I^C_n\). This implies that the test \({\mathcal {T}}=(p,n)\) is in the right region, i.e. where both condition (a) and (b) are satisfied. In Fig. 2 we show how the urn process \(Z_n\) converges towards the right region.

Fig. 2
figure 2

The pictures represents the regions \(A, B\) and \(C\), with: \(\alpha =0.05, p_0=0.5, n_0=70, \sigma _R=1, \sigma _W=1.5\). For each fixed sample size \(n\), the parameters of the urn model \(\delta ,\eta \in (0,1)\) are chosen such that \((\delta ,n)\in A\) and \((\eta ,n)\in C\). a simulations with \(m_R<m_W\). b simulations with \(m_R>m_W\). In both pictures, the black lines represent ten replications of the urn process \((Z_k,k)\). (Color figure online)

Remark 1

It is worth observing that without loss of generality similar results can be proved in the case of an one-sided test instead of (1), for instance \(H_0:m_R\le m_W\) and \(H_1:m_R> m_W\). In this case, the goal (b) is achieved when we assign more patients to treatment \(W\), so we can arbitrarily fix the parameter \(\delta \) within the interval \((0,\eta )\).

3 Different response distributions

In this section we relax some assumptions on response distributions. First, we consider Gaussian response distributions with unknown variances, then, we discuss the case of non-Gaussian responses (exponential and Bernoulli).

When the variances are unknown, the regions \(A-B-C\) can’t be defined a priori, since from (4) \(n_{\beta }\) depend on \(\rho =\frac{\sigma _R}{\sigma _R+\sigma _W}\). So, here we describe a convenient procedure to overcome this problem.

First, consider the adaptive estimators of the unknown variances \(S_R^2(n)\) and \(S_W^2(n)\), defined as follows

$$\begin{aligned} S_R^2(n)=\frac{\sum _{i=1}^n X_i (M_i-\overline{M}(n))^2}{N_R(n)-1},\quad \ \ \text {and}\ \ S_W^2(n)=\frac{\sum _{i=1}^n (1-X_i) (N_i-\overline{N}(n))^2}{N_W(n)-1}. \end{aligned}$$

Then, in (4) the true variances \(\sigma _R^2\) and \(\sigma _W^2\) with their adaptive estimators \(S_R^2(i)\) and \(S_W^2(i)\), so obtaining

$$\begin{aligned} n_{\beta }(p;\widehat{\rho }(i))\ :=\ \left( \frac{\widehat{\rho }^2(i)}{p} + \frac{(1-\widehat{\rho }(i))^2}{1-p}\right) \left( \frac{\widehat{\rho }^2(i)}{n_0p_0} + \frac{(1-\widehat{\rho }(i))^2}{n_0(1-p_0)}\right) ^{-1}, \end{aligned}$$
(10)

where \(\widehat{\rho }(i)=\frac{S_R(i)}{S_R(i)+S_W(i)}\).

We note that \(n_{\beta }(\cdot ;\widehat{\rho }(i))\) in (10) is a time dependent random function, since it depends on \(\widehat{\rho }(i)\); at each step \(i\le n\), a new response is collected, the adaptive estimators are updated and the function \(n_{\beta }(\cdot ;\widehat{\rho }(i))\) changes. So, also the intervals \(I^A_i, I^B_i, I^C_i\) will be random and they will change for any \(i\le n\). This generates two sequences \((\delta _i)_i,(\eta _i)_i\) instead of two parameters \(\delta ,\eta \), since we need to maintain the parameters of the urn model within the corresponding intervals: \(\delta _i\in I^A_i\) and \(\eta _i\in I^C_i\).

From Melfi and Page (2000) we have that the adaptive estimators \(S_R^2(n)\) and \(S_W^2(n)\) are strongly consistent, since the sequences \(N_R(n)\) and \(N_W(n)\) increase to infinity almost surely. Moreover, since \(\widehat{\rho }(\cdot )\) and \(n_{\beta }(p,\cdot )\) are continuous functions, the consistency of \(S_R^2(n)\) and \(S_W^2(n)\) implies that \(n_{\beta }(p;\widehat{\rho }(i))\mathop {\longrightarrow }\limits ^{a.s.}n_{\beta }(p)\) for any \(p\in (0,1)\). So, we have that \(\delta _n\mathop {\longrightarrow }\limits ^{a.s.}\delta ,\eta _n\mathop {\longrightarrow }\limits ^{a.s.}\eta \) and \(\delta \in I^A, \eta \in I^C\). This implies that \(Z_n\) converge a.s. to \(\delta \) when \(m_R<m_W\) or to \(\eta \) when \(m_R>m_W\) [for further details see Ghiglietti (2014)].

When we relax the normality assumption on the reinforcements distributions it is not easy to write the power function of the test in an analytic form, by solving the condition \(\beta _{{\mathcal {T}}}(\Delta )\ge \beta _{{\mathcal {T}}_0}(\Delta )\) and then by computing the function \(n_{\beta }\). Anyway, this task can be numerically found; so we will show that the \(\textit{proportion-sample size}\) space can be partitioned again in the regions \(A-B-C\) even with non-Gaussian reinforcements.

Exponential responses:

Let us make the following assumptions on the responses

  • responses to treatment \(R\): \(M_1,M_2,..,M_{n_{0,R}}\) i.i.d. \(\sim \mathcal {E}({\uplambda }_R)\).

  • responses to treatment \(W\): \(N_1,N_2,..,N_{n_{0,W}}\) i.i.d. \(\sim \mathcal {E}({\uplambda }_W)\).

Our aim is to perform the following hypothesis test

$$\begin{aligned} H_0:\ {\uplambda }_R={\uplambda }_W\ \ \ \ \ \ \ \ \ vs\ \ \ \ \ \ \ \ \ H_1:\ {\uplambda }_R\ne {\uplambda }_W. \end{aligned}$$
(11)

The likelihood function of the whole sample is

$$\begin{aligned} L({\uplambda }_R,{\uplambda }_W,data)= & {} {\uplambda }_R^{n_{0,R}}{\uplambda }_W^{n_{0,W}}\exp \left( -{\uplambda }_R\sum _{i=1}^{n_{0,R}}M_i -{\uplambda }_W\sum _{i=1}^{n_{0,W}}N_i\right) \\= & {} \left( \ {\uplambda }_R^{p_0}{\uplambda }_W^{1-p_0}\exp \left( -{\uplambda }_R\overline{M}_{n_{0,R}}p_0 -{\uplambda }_W\overline{N}_{n_{0,W}}(1-p_0)\right) \ \right) ^n \end{aligned}$$

where \(\overline{M}_{n_{0,R}}=\sum _{i=1}^{n_{0,R}}M_i/n_{0,R}\) and \(\overline{N}_{n_{0,W}}=\sum _{i=1}^{n_{0,W}}N_i/n_{0,W}\). Then, the likelihood ratio test (see Lehmann and Romano 2005) gives us the following critical region

$$\begin{aligned}&\left\{ \ \frac{\sup _{{\uplambda }_R={\uplambda }_W\in (0,\infty )}L({\uplambda }_R,{\uplambda }_W,data)}{\sup _{({\uplambda }_R,{\uplambda }_W)\in (0,\infty )^2}L({\uplambda }_R,{\uplambda }_W,data)} \ <\ c_{\alpha }\ \right\} \\&\quad = \left\{ \ \frac{\overline{M}_{n_{0,R}}^{p_0}\ \cdot \ \overline{N}_{n_{0,W}}^{1-p_0}}{\overline{M}_{n_{0,R}}\cdot p_0+\overline{N}_{n_{0,W}}\cdot (1-p_0)} \ <\ \root n \of {c_{\alpha }}\ \right\} \end{aligned}$$

where \(c_{\alpha }\in (0,1)\) can be determined to set the level of this critical region equal to \(\alpha \).

Bernoulli responses:

Let us make the following assumptions on patients’ responses

  • responses to treatment \(R\): \(M_1,M_2,..,M_{n_{0,R}}\) i.i.d. \(\sim \mathcal {B}(p_R)\).

  • responses to treatment \(W\): \(N_1,N_2,..,N_{n_{0,W}}\) i.i.d. \(\sim \mathcal {B}(p_W)\).

Let us consider now the following hypothesis test

$$\begin{aligned} H_0:\ p_R=p_W\ \ \ \ \ \ \ \ \ vs\ \ \ \ \ \ \ \ \ H_1:\ p_R\ne p_W. \end{aligned}$$
(12)

The likelihood function for two samples of Bernoulli variables is

$$\begin{aligned} \begin{aligned}&L(p_R,p_W,data)\\&\quad =\left( p_R^{\overline{M}_{n_{0,R}}p_0}(1-p_R)^{(1-\overline{M}_{n_{0,R}})p_0}p_W^{\overline{N}_{n_{0,W}}(1-p_0)} (1-p_W)^{(1-\overline{N}_{n_{0,W}})(1-p_0)}\right) ^n \end{aligned} \end{aligned}$$

Then, the likelihood ratio test, see Lehmann and Romano (2005), gives us the following critical region

$$\begin{aligned} \begin{aligned}&\left\{ \ \frac{\sup _{p_R=p_W\in (0,1)}L(p_R,p_W,data)}{\sup _{(p_R,p_W)\in (0,1)^2}L(p_R,p_W,data)} \ <\ c_{\alpha }\ \right\} \\&=\left\{ \ \frac{\overline{P}^{\overline{P}}(1-\overline{P})^{1-\overline{P}}}{\overline{M}_{n_{0,R}}^{\overline{M}_{n_{0,R}}p_0}(1-\overline{M}_{n_{0,R}})^{(1-\overline{M}_{n_{0,R}})p_0} \overline{N}_{n_{0,W}}^{\overline{N}_{n_{0,W}}(1-p_0)}(1-\overline{N}_{n_{0,W}})^{(1-\overline{N}_{n_{0,W}})(1-p_0)}} \ <\ \root n \of {c_{\alpha }}\ \right\} \end{aligned} \end{aligned}$$

where

$$\begin{aligned} \overline{P}=\frac{\sum _{i=1}^{n_{0,R}}M_i+\sum _{i=1}^{n_{0,W}}N_i}{n}=\overline{M}_{n_{0,R}}p_0+\overline{N}_{n_{0,W}}(1-p_0). \end{aligned}$$

Also in this case, \(c_{\alpha }\in (0,1)\) can be determined to set the level of this critical region equal to \(\alpha \).

The power function (\(\widehat{\beta }_{(p_0,n_0)}\)) in both cases (11) and (12) can be numerically computed. For any \(p\in (0,1)\), we define

$$\begin{aligned} n_{\beta }(p)\ :=\ \min \left\{ \ n\ge 1\ :\ \widehat{\beta }_{(p,n)}\ge \widehat{\beta }_{(p_0,n_0)}\ \right\} \end{aligned}$$

Once we have computed the function \(n_{\beta }(\cdot )\), we partition the \(\textit{proportion-sample size}\) space, we introduce the intervals \(I_n^C\) and \(I_n^A\) and we fix the parameters \(\eta \) and \(\delta \). As we can see from Fig. 3, the shape of the regions is the same of those computed in the case of Gaussian responses.

Fig. 3
figure 3

Left panel: exponential responses with \({\uplambda }_R=2\) and \({\uplambda }_W=1\). The parameters of test \({\mathcal {T}}_0\) are: \(\alpha =0.05, 1-\beta _0=0.2, \Delta _0=\Delta =1/2\), allocation proportion \(p_0=1/2\) and sample size \(n_0=67\). Right panel: Bernoulli responses with \(p_R=0.2\) and \(p_W=0.5\). The parameters of test \({\mathcal {T}}_0\) are: \(\alpha =0.05, 1-\beta _0=0.2, \Delta _0=\Delta =0.3\), allocation proportion \(p_0=1/2\) and sample size \(n_0=76\). a Exponential responses. b Bernoulli responses

4 Simulation studies

In this section we show some simulation studies that aim at illustrating the theory presented in the previous sections of the paper. Let us consider the two-sided hypothesis test (1), for comparing the mean effect of two treatments \(R\) and \(W\). We simulated Gaussian responses to treatments \(R\) and \(W\) with parameters:

  • \(m_W=10\),

  • \(m_R\in \{5,7,9,9.5,10.5,11,13,15\}\),

  • equal variances: \(\sigma _R^2=\sigma _W^2=1.5^2\),

  • different variances: \(\sigma _R^2=1,\sigma _W^2=2^2\).

The test \({\mathcal {T}}_0\) is computed by setting the following parameters: \(\alpha =0.05, \beta _0=0.9, \Delta _0=1, p_0=0.5\). Then, the sample size for \({\mathcal {T}}_0\) can been computed and it is \(n_0=96\) when the variances are equal and \(n_0=106\) when the variances are different.

At this point, we apply the procedure described in Sect. 2 to get an adaptive test \({\mathcal {T}}\) based on MRRU design performing better than \({\mathcal {T}}_0\). The sample size of \({\mathcal {T}}\) has been increased by 25 % (\(n=1.25\cdot n_0\)), obtaining \(n=120\) in the case of equal variances and \(n=132\) in the case of different variances. In both cases, we can design the regions \(A, B\) and \(C\) and the corresponding intervals \(I^A_n, I^B_n\) and \(I^C_n\)

  • \(\sigma _R^2=1.5^2, \sigma _W^2=1.5^2\ \ \Rightarrow \) \(I^A_{120}=(0.127,0.402)\), \(I^C_{120}=(0.598,0.632)\).

  • \(\sigma _R^2=1, \sigma _W^2=4\ \ \ \ \ \ \ \ \Rightarrow \) \(I^A_{132}=(0.279,0.403)\), \(I^C_{132}=(0.597,0.721)\)

In all simulations, the urn has been initialized with a total number of balls equal to \(d_0=(m_R+m_W)/2\); the initial urn proportion \(z_0\) has been set at the center of the interval \((\delta ,\eta )\). Then, for each value of \(m_R \in \{5,7,9,9.5,10.5,11,13,15\}\), we have run \(1000\) urn processes \((Z_k)_k\) stopped at time \(n\).

In Table 1 (equal variances) and in Table 2 (different variances), we report the proportion of simulation runs in which the power of \({\mathcal {T}}\) is higher than the power of \({\mathcal {T}}_0\) (first column) and the proportion of replications in which \({\mathcal {T}}\) assigns fewer subjects than \({\mathcal {T}}_0\) to treatment \(R\) and \(W\) (second/third column). The parenthesis indicate the allocations to the superior treatment. In Fig. 4, we report the flanked boxplots of the number of subjects assigned to the inferior treatment in the 1000 replications of the urn design, for different values of \(\Delta \).

Fig. 4
figure 4

Flanked boxplots of the number of subjects allocated to the inferior treatment by \({\mathcal {T}}\) for \(\Delta \in \{-5,-3,-1,-0.5,0.5,1,3,5\}\). The red line represents the number of subject allocated to the inferior treatment by \({\mathcal {T}}_0\). Left panel: case of equal variances (\(\sigma _R^2=\sigma _W^2=1.5^2\)). Right panel: case of different variances (\(\sigma _R^2=1\) and \(\sigma _W^2=4\)). a Equal variances. b Different variances

Table 1 Proportion of simulation runs in which \({\mathcal {T}}\) performs better than \({\mathcal {T}}_0\) in terms of power (first column) and subjects assigned to the inferior treatment (second/third column).
Table 2 Proportion of simulation runs in which \({\mathcal {T}}\) performs better than \({\mathcal {T}}_0\) in terms of power (first column) and subjects assigned to the inferior treatment (second/third column).

It is interesting to investigate the procedure described in Sect. 2 when the test \({\mathcal {T}}_0\) adopts an allocation related with the treatment performances. Let us consider the Optimal Adaptive Design for Bernoulli responses (RSIHR) presented in Rosenberger et al. (2001). The allocation proportion of this model converges to \(p_0=\frac{\sqrt{p_R}}{\sqrt{p_R}+\sqrt{p_W}}\) that is the allocation that minmizes the number of expected failures at fixed power \(\beta _0\), where \(p_R\) and \(p_W\) are the success probabilities of \(R\) and \(W\), respectively. Let us fix

  • Significance level \(\alpha =0.05\) and the power \(\beta _0=0.9\)

  • Success probabilities: \(p_R=0.2, p_W=0.1\)

Then, \({\mathcal {T}}_0\) should have an allocation proportion \(p_0=\frac{\sqrt{p_R}}{\sqrt{p_R}+\sqrt{p_W}}=0.586\) and sample size \(n_0=516\).

By following the procedure described in Sect. 2, we construct the test \({\mathcal {T}}\) with MRRU model with sample size \(n=645, \eta =0.724\) and \(\delta =0.402\). We realized 200 replications and the results are reported in Fig. 5. For both test \({\mathcal {T}}\) and \({\mathcal {T}}_0\), red boxplots indicate the power, blue boxplots represent the number of subjects assigned to the inferior treatment and green boxplots indicate the number of failures for \(n=645\) subjects. Since \({\mathcal {T}}_0\) uses only \(n_0=516\), we have considered the failures of the \(n-n_0=129\) subjects as if they had been assigned to the superior treatment.

Fig. 5
figure 5

Boxplots representing the empirical power (left), the number of subjects assigned to the inferior treatment (center) and the number of failures (right) from 200 replications of \({\mathcal {T}}_0\) (RSIHR model) and \({\mathcal {T}}\) (MRRU model)

5 Real case study

In this section we show a real case study, also presented in Ghiglietti and Paganoni (2014), where the application of the methodology presented in the paper would have improved the performance of a classical test, from both the statistical and ethical point of view. We consider data concerning treatment times of patients affected by ST- Elevation Myocardial gathered in the MOMI\(^2\) (MOnth MOnitoring Myocardial Infarction in MIlan) study, (see Grieco et al. 2012). The main rescue procedure for these patients is the Primary Angioplasty. It is well known that the time between the arrival at ER (called Door) and the time of intervention (called Baloon) must be reduced as much as possible in order to improve the outcome of patients and reduce the in-hospital mortality. So in this case the Door to Baloon time (DB) is the treatment response. We have two different treatments: the patients managed by the 118 (free-toll number for emergency in Italy) and the self presented ones. We design our experiment to allocate the majority of patients to treatment performing better, and simultaneously collect evidence in comparing the distributions of DB times.

Data are door-to-baloon times (DB) in minutes of 1179 patients. Among them, 657 subjects have been managed by 118, while the others 522 subjects reached the hospital by themselves. We identify the treatment \(W\) with the choice of calling 118 and the treatment \(R\) with choice of going to the hospital by themselves. Treatment responses are represented by DB times. Since lower are the responses (DB time) better is the treatment, without loss of generality we transform the responses through a monotonic decreasing function. The true means and variances of populations \(R\) and \(W\) have been computed using all data, obtaining: \(m_R=1.503, m_W=1.996, \sigma _R=0.518, \sigma _W=0.760\). The true difference of the means \(\Delta =m_R-m_W=-0.493\) is negative, so \(W\) is the superior treatment in this case.

Initially, we consider a test \({\mathcal {T}}_0\) to compare the mean effects to treatments \(R\) and \(W\). Let us fix \(\alpha =0.01, \beta _0=0.95, \Delta _0=0.5\). The allocation proportion is empirically set equal to \(p_0=0.468\). Response distributions are verified to be Gaussian. Then, for a two-sided t-test we need a total of \(n_0=119\) subjects, \(n_0p_0=56\) allocated to treatment \(R\) and \(n_0(1-p_0)=63\) allocated to treatment \(W\). The power of test \({\mathcal {T}}_0\) evaluated \(\Delta \) is \(\beta _{{\mathcal {T}}_0}(\Delta )=0.945\).

Now, consider the MRRU model to construct the adaptive test \({\mathcal {T}}\). \({\mathcal {T}}\) involves more subject in the experiment than \({\mathcal {T}}_0\), in particular \(n\) is computed as \(1.25\times n_0=148\). Nevertheless, since in practice variances are unknown, \(n_0\) and \(n\) are computed from variance estimators. As a consequence, the sample size of \({\mathcal {T}}\) is random and each replication of \({\mathcal {T}}\) has a different value of \(n\).

We realized 500 simulation runs of the urn procedure. Each replication uses a subset of responses selected by permutation from the whole dataset. In Fig. 6, ten replications of the urn proportion process \((Z_n)_n\) are represented.

Fig. 6
figure 6

Black lines represent ten replications of the urn proportion process \((Z_n)_n\). Each replication uses responses taken at random from the data at our disposal. The proportion-sample size space has been partitioned assuming known variances. (Color figure online)

As we can see from Fig. 6, the urn process seems to target region \(A\), where parameter \(\delta \) is set. Then, test \({\mathcal {T}}\) has higher power and assigns to treatment \(R\) fewer patients than \({\mathcal {T}}_0\). This is our goal, since \(R\) is the inferior treatment in this case (\(m_R<m_W\)).

For each one of 500 replications we compute analytically the power evaluated at \(\Delta \). In Fig. 7 we show a boxplot with the power of the 500 replications of the urn model, to be compared with the power of \({\mathcal {T}}_0\). Moreover, we show for each simulation the proportion of subjects assigned to treatment \(R\), to be compared with the proportion of subjects assigned to treatment \(R\) by \({\mathcal {T}}_0\).

Fig. 7
figure 7

On the left: boxplot representing 500 values of power evaluated at the true difference of the means \(\Delta =-0.493\) using \({\mathcal {T}}\): \(\beta _{{\mathcal {T}}}(\Delta )\). The red line represents the power obtained with \({\mathcal {T}}_0\): \(\beta _{{\mathcal {T}}_0}(\Delta )=0.945\). On the right: boxplot representing 500 values of the proportion of subjects assigned to treatment \(R\) by \({\mathcal {T}}\): \(N_R/n\). The red line represents the proportion of subjects assigned to treatment \(R\) by \({\mathcal {T}}_0\): \(p_0=0.468\)

From Fig. 7, we note that the MRRU design constructs a test \({\mathcal {T}}\) with power higher than \({\mathcal {T}}_0\). This occurs in more than 99 % of replications, and the average power over the replications is

$$\begin{aligned} \frac{1}{500}\sum _{i=1}^{500} \beta _{{\mathcal {T}}i}(\Delta )\ =\ 0.975\ >\ 0.945\ =\ \beta _{{\mathcal {T}}_0}(\Delta ). \end{aligned}$$

Even if \({\mathcal {T}}\) uses a sample size \(n\) larger than \({\mathcal {T}}_0\), in the 52.6 % of the runs the number of subjects allocated to the inferior treatment \(R\) by \({\mathcal {T}}\) is less that by \({\mathcal {T}}_0\). Besides, the average number of units assigned to treatment \(R\) is almost the same of the number computed with \({\mathcal {T}}_0\)

$$\begin{aligned} \frac{1}{500}\sum _{i=1}^{500} N_{Ri}\ =\ 56.43\ \simeq \ 56\ =\ n_0\cdot p_0. \end{aligned}$$

6 Conclusions

In this paper we conduct an analysis on the statistical properties of tests that compares the means of the responses to two treatments. Given a test \({\mathcal {T}}_0\), we point out which features a response-adaptive test \({\mathcal {T}}\) should have in order to perform better than \({\mathcal {T}}_0\). In a clinical trials framework, this goal is achieved when \({\mathcal {T}}\) has (a) higher power and (b) assigns to the inferior treatment fewer subjects than \({\mathcal {T}}_0\). Specifically, we individuate in the proportion-sample size space the subregions where selecting the allocation proportion \(p\) and the sample size \(n\) of the test \({\mathcal {T}}\).

The test \({\mathcal {T}}\) can be implemented by using a response-adaptive design. We propose an urn procedure (MRRU) that is able to target a fixed proportion allocation in (0, 1). This urn model identifies the test \({\mathcal {T}}\) in a specific region, depending on the inferior treatment, and both the goals (a) and (b) are accomplished. We show that the assumption of Gaussian responses and known variances can be relaxed. We report some simulations and a case study that highlight the goodness of the procedure.