1 Introduction

Evaluation of success is a crucial step of an experiment, especially at the design stage. Under the standard frequentist approach to testing, success is evaluated by computing the power function \(\eta _n(\cdot )\), the probability of rejecting the null hypothesis, at a fixed (design) value of the parameter of interest. It is well known that this approach lacks flexibility and does not account for uncertainty on the design value.

Conversely, under the hybrid Bayesian-frequentist paradigm, the parameter of interest \(\theta \) is considered as a random variable, \(\Theta \), with distribution \(\pi (\theta )\), often called design prior (in most of the cases a density). Consequently, the power function \(\eta _n(\Theta )\) is a random variable as well: we will refer to it as random power (but it is also known as predictive power, see Spiegelhalter et al. 2004). In most of the cases, the experimental success is evaluated by the expected value of the random power, commonly known as probability of success (PoS). However, the concept of PoS has been highly debated in the literature: Kunzmann et al. (2021) is a recent review paper that identifies even 17 different definitions of it. The computation of PoS is a routine component of study planning and decision-making (Liu and Yu 2024) especially in clinical trials, where it is employed for several reasons, such as, to support funding approval from sponsor governance boards (Crisp et al. 2018; Wang et al. 2013), to compute the optimal sample size at the design stage (Kunzmann et al. 2021) or at interim analysis (Wang 2007), and to choose the clinical development plan (Temple and Robertson 2021).

However, PoS is just the expected value of a random variable, whose distribution may not be well-represented by its mean; the perils of entrusting solely on averaging are pointed out, for example, in Liu and Yu (2024) and Dallow and Fina (2011). Starting from the seminal paper of Spiegelhalter et al. (1986), several authors suggested to complement PoS with other alternative summaries (such as the median or other quantiles, see for instance Huson 2009) or with the whole distribution that PoS summarizes (Huson 2009; Rufibach et al. 2016; De Santis and Gubbiotti 2023). The latter, in fact, provides an overall indication of the chance of success of an experiment: the basic idea is that a test is well-designed if the distribution of the random power induced by the design prior assigns high density to large values of the power (i.e. values as close to one as possible). The whole density function of \(\eta _n(\Theta )\) was studied in Rufibach et al. (2016) with the objective of providing recommendations and guidelines on the design prior choice, but Kunzmann et al. (2021), Liu (2010), Dallow and Fina (2011) argued that by definition \(\eta _n(\Theta )\) and its mean are misleading in explaining the success of the experiment, since they represent the probability of rejecting the null hypothesis regardless of it being true or not. Recently, in Kunzmann et al. (2021) and De Santis et al. (2024), it was shown that PoS has essentially four main specifications, each representing the expected value of a power-related random variable (PrRV) based on a suitable function of \(\eta _n(\Theta )\). Since PrRVs have the potential to provide an overall indication of the success of an experiment, the goal of this paper is to investigate their distributions.

We find out that the main definitions of PoS and the respective PrRVs are closely interconnected and their discrepancies, albeit few, discern which definition accounts for success more properly. This analysis also provides guidelines on prior choices, methods to evaluate whether PoS is a well-representative summary, as well as alternative tools to obtain an overall pre-experimental evaluation of the designed experiment, such as cumulative distribution functions (cdfs), probability density functions (pdfs) and quantiles.

The outline of the article is as follows. We introduce notation, setup and literature review on the main definitions of PoS in Sect. 2. In Sect. 3 we provide general results for the density functions of the PrRVs, whose closed-form expressions are drawn under normality assumption of the test statistic. In addition, to illustrate our ideas, we replicate an example of Spiegelhalter et al. (1986) in a clinical trial where the log-hazard ratio is the endpoint. We show how the qualitative features of the PrRV densities can be employed to set the design prior parameters. We also sketch a simulation algorithm useful when explicit expressions are not available. An application to a two sample confirmatory phase III trial is illustrated in Sect. 4, while we come to conclusions in Sect. 5.

2 Setup and notation

In this section we formally introduce power and utility functions of a test, whose definitions are needed to discuss PoS and its specifications. Consider the statistical model \((\mathcal {X}, f_n(\cdot ,\theta ), \theta \in \Omega )\) for which \(\theta \in \Omega \), \(\Omega \subseteq \mathbb {R}\), is a parameter of interest, \(\mathcal {X}\) is the sample space, \(f_n(\cdot |\theta )\) the probability mass or pdf of the random sample of independent and identical distributed (iid) random variables \({\textbf{X}_n}=(X_1,X_2, \ldots , X_n)\), and \({\textbf{x}_n}\) an element of \(\mathcal {X}\) (i.e. observed values of \({\textbf{X}_n}\)). We denote with \(\mathbb {P}_\theta (\cdot )\) and \(\mathbb {E}_\theta (\cdot )\) the probability measure and the expected value with respect to the sampling distribution of \(f_n(\cdot |\theta )\), respectively. Given \(\Omega _i \subset \Omega \), \(i=0,1\) a partition of \(\Omega \), consider the following testing problem on \(\theta \):

$$\begin{aligned} H_0: \theta \in \Omega _0 \qquad \text {vs} \qquad H_1: \theta \in \Omega _1. \end{aligned}$$
(1)

We here focus on the one-sided hypotheses \(\Omega _0 = (-\infty , \theta _0]\) and \(\Omega _1=\Omega _0^C=(\theta _0,+\infty )\), but implementation of the reversed one-sided test is straightforward. Let \(\mathcal {X}_0\) and \(\mathcal {X}_1\) be the two elements of a partition of \(\mathcal {X}\), representing the acceptance and the rejection regions of \(H_0\), respectively. The power function of the test is formally defined as:

$$\begin{aligned} \eta _n(\theta ) = \mathbb {P}_{\theta }({\textbf{X}_n}\in \mathcal {X}_1). \end{aligned}$$
(2)

In order to define the utility function of a test we adopt a decision-theoretic perspective, under which a test statistic is a decision function \(d_n({\textbf{x}_n}) = a_i\mathbb {I}_{\mathcal {X}_i}({\textbf{x}_n}), i={0,1}\), where in general \(\mathbb {I}_{A}(\cdot )\) is the indicator function of the set A, and \(a_i\) means accepting \(H_i\), \(i=0,1\). The utility function evaluates the quality of the random decision \(d_n({\textbf{X}_n})\) and it is defined as:

$$\begin{aligned} U_n(\theta ) = 1-\mathbb {E}_\theta [\mathbb {L}(\theta , d_n({\textbf{X}_n}))], \end{aligned}$$

where \(\mathbb {L}(\theta ,a_i)=\mathbb {I}_{\Omega _j}(\theta )\) is the \(0-1\) loss function of \(a_i, i= 0,1\). Note that \(\mathbb {E}_\theta [\mathbb {L}(\theta ,d_n({\textbf{X}_n}))]\) is the risk function of the decision.

\(U_n(\theta )\) and \(\eta _n(\theta )\) are related in the following way:

$$\begin{aligned} U_n(\theta ) = [1-\eta _n(\theta )]\mathbb {I}_{\Omega _0}(\theta ) + \eta _n(\theta )\mathbb {I}_{\Omega _1}(\theta ). \end{aligned}$$
(3)

A graphical representation of \(\eta _n(\theta )\) and of \(U_n(\theta )\) is provided on the top panels of Fig. 1. We comment on that in the next section.

2.1 Main definitions of PoS

According to the hybrid Bayesian-frequentist approach, the unknown parameter of interest is a random variable \(\Theta \), thus power and utility functions are random variables as well.

We respectively denote with \(\mathbb {P}_\pi (\cdot )\), \(\mathbb {F}_\pi (\cdot )\) and \(\mathbb {E}_\pi (\cdot )\) probability measure, cdf and expected value with respect to \(\pi (\cdot )\), the design prior density of \(\Theta \).

As discussed in the Sect. 1, our starting point is the specification of the main definitions of PoS, \( e_i = e_i(n, \pi )\), with \(i \in \{{P},{J},{C},{U}\}\), identified in Kunzmann et al. (2021), De Santis et al. (2024), which are the expected values of some PrRVs. Table 1 reports these definitions and notation.

Table 1 Main definitions of PoS with the corresponding PrRV, cdf and pdf

The specifications of PoS and the related PrRVs are deeply dependent on the choice of the design prior, and they all coincide when the design prior assumes values only on the alternative hypothesis. As the design prior assigns positive probability masses to \(\theta \) values in \(\Omega _0\), differences in the definitions of PoS are not negligible and in Sect. 3 we show that \(e_{U}\ge e_{C}\ge e_{P}\ge e_{J}\).

To facilitate the understanding of how the PrRVs vary from each other, in Fig. 1 we plot \(\eta _n(\theta )\), \(\eta _n(\theta )\mathbb {I}_{\Omega _1}(\theta )\), \(\eta _n(\theta )|\theta \in \Omega _1\) and \(U_n(\theta )\) as functions of \(\theta \) deterministic. We refer to the setting of an example discussed in Spiegelhalter et al. (1986) and further developed in Sect. 3.1.1. We now provide comments on the main definitions of PoS.

  • The value \(e_{P}\) (where \({P}\) stands for power) is mostly known with the name of assurance (O’Hagan and Stevens 2001; O’Hagan et al. 2005), and according to Kunzmann et al. (2021) this is the most used definition of PoS. It is the expected value of the random power \({P}_n\), according to which the experiment success is the rejection of \(H_0\), no matter of its truth. In other words, \({P}_n\) treats the occurrence of a type I error as a success; hence, its expected value averages the power with an error, which reduces the chance of success.

  • The joint probability to reject \(H_0\), \(e_{J}\) (with \({J}\) standing for joint probability), is developed in Brown et al. (1987) and Ciarleglio et al. (2015). Similarly to the random power, the random variable \({J}_n\) assumes that the test success is only related to the proper selection of \(H_1\), but not to the proper rejection of it. Consequently, \(e_{J}\) averages the power values when \(\theta \in \Omega _1\) and 0 when \(\theta \in \Omega _0\) reducing the chance of success.

  • The expected conditional power \(e_{C}\) (where \({C}\) stands for conditional) is introduced in Spiegelhalter et al. (2004): the concept of success is only related to the proper rejection of the null hypothesis, thus conditional on \(H_1\) being true. Consequently, the pdf \(f_{C}(\cdot )\) is the density of the random power when \(\Theta \) is restricted on the alternative hypothesis. Note that \(e_{C}\) can be also seen as \(\mathbb {E}_\pi [\eta _n(\Theta )\mathbb {I}_{\Omega _1}(\Theta )/\pi _1]\), where \(\pi _1 = \mathbb {P}_\pi (\Theta \in \Omega _1)\) and \(\eta _n(\Theta )\mathbb {I}_{\Omega _1}(\Theta )/\pi _1\) is a linear transformation of \({J}_n\). However, the support of this random variable is \((\frac{\alpha }{\pi _1}, \frac{1}{\pi _1})\), where \(\frac{1}{\pi _1} \ge 1\): this implies that its interpretation as an overall representation of the probability of success does not make sense. For this reason, we will consider only the definition of \(e_{C}\) as expected value of \({C}_n\) provided in Table 1.

  • The Bayes utility \(e_U\) (where U stands for utility) has been formally introduced in De Santis et al. (2024) as uPoS. Its advantages over \(e_i, i \in \{{P},{J},{C}\}\), discussed in the aforementioned contribution, follow from the fact that the random variable \({U}_n\) generalizes the concept of probability of success by considering successful the proper rejection of \(H_1\) and of \(H_0\).

Fig. 1
figure 1

Power and power-related functions of \(\theta \) deterministic for \(n=10\) under the setting of the example in Sect. 3.1.1, which is when \(X_i|\theta \sim N(\theta , \sigma ^2)\) iid \(i=1,...,n\),\(\sigma ^2=4\), \(\alpha = 0.05\), \(\theta _0 = 0\)

As discussed by Spiegelhalter et al. (2004) and Kunzmann et al. (2021), since the objective of a test is the rejection of \(H_0\), the design prior is usually almost fully concentrated on the alternative hypothesis, thus differences between \(e_i, i \in \{{P},{J},{C},{U}\}\) and their related distributions are negligible. Nonetheless there are cases where accounting for a design distribution which assigns not negligible probability masses to \(\Omega _0\) is inescapable: this happens for instance in Temple and Robertson (2021), where the design prior is a mixture of two normals defined on \(\Omega _0\) and \(\Omega _1\), respectively, or in Liu and Yu (2024) and Dallow and Fina (2011) where, instead of a design prior, a data driven design posterior is used in the interim analysis of a clinical trial. In fact, in these cases, discrepancies between the PrRVs under study and consequently in results, can not be ignored.

3 The power related random variables

We here provide the general expressions of the pdfs \(f_P(y)\) and \(f_i(y), i \in \{{J}, {C}, {U}\}\) as functions of \(f_{P}(y)\).

Theorem 3.1

Given the hypotheses in (1) and a size-\(\alpha \) test with monotone increasing power function \(\eta _n(\theta )\), the density functions of the PrRVs in Table 1 are:

$$\begin{aligned} f_{P}(y)= & \pi \big (\eta _n^{-1}(y)\big ) \left| \frac{d}{dy} \eta _n^{-1} (y) \right| \mathbb {I}_{[0,1]}(y) \\ f_{J}(y)= & \delta _{0}(y) + f_{P}(y) \mathbb {I}_{(\alpha ,1]}(y) \\ f_{C}(y)= & \frac{f_{P}(y)}{\pi _1} \mathbb {I}_{(\alpha ,1]}(y) \\ f_{U}(y)= & f_{P}(y) \mathbb {I}_{(\alpha ,1]}(y) + f_{P}(1-y) \mathbb {I}_{(1-\alpha ,1]}(y) \end{aligned}$$

where \(\eta _n^{-1}(\cdot )\) is the inverse function of \(\eta _n(\cdot )\) and \(\delta _{0}(y) \) is the Dirac delta function at 0.

Proof

We start the proof deriving the expression of the cdf of the random power, which support is [0, 1]. Since \({P}_n\) is assumed monotone increasing, \(\mathbb {F}_{P}(y)\) can be written as:

$$\begin{aligned} \mathbb {F}_{P}(y)= & \mathbb {P}_\pi [{P}_n \le y] =\mathbb {P}_\pi [\eta _n(\Theta ) \le y] = \mathbb {P}_\pi \big [\Theta \le \eta _n^{-1}(y)\big ] = \\= & \mathbb {F}_\pi [\eta _n^{-1}(y)]. \end{aligned}$$

The pdf \(f_{P}(y)\) is obtained by deriving \(\mathbb {F}_{P}(y)\) wrt y. Recall that \(\frac{d}{dx} f^{-1}(x) = \frac{1}{f'(f^{-1}(x))}\), where \(f'(x) = \frac{d}{dx} f(x)\). We have:

$$\begin{aligned} f_{P}(y)= & \frac{d}{dy} \mathbb {F}_{P}(y) = \pi \big (\eta _n^{-1}(y)\big ) \left| \frac{d}{dy} \eta _n^{-1} (y) \right| . \end{aligned}$$

From (3) it follows that the generic expression of the cdf of \({U}_n\) is:

$$\begin{aligned} \mathbb {F}_{U}(y)= & \mathbb {P}_{\pi }[U_n \le y] = \\= & \mathbb {P}_{\pi }[{U}_n \le y, \Theta \in \Omega _0 ] + \mathbb {P}_{\pi }[{U}_n \le y, \Theta \in \Omega _1 ] = \\= & \mathbb {P}_{\pi }[{P}_n \ge 1-y, \Theta \in \Omega _0] + \mathbb {P}_{\pi }[{P}_n \le y, \Theta \in \Omega _1] = \\= & \mathbb {P}_{\pi }\big [\Theta \ge \eta _n^{-1}(1-y), \Theta \le \theta _0\big ] + \mathbb {P}_{\pi }[\Theta \le \eta _n^{-1}(y), \Theta> \theta _0] = \\= & \left\{ \begin{array}{l c l} 0 & \text{ for } & y \le \alpha \\ \mathbb {F}_{{P}}(y)-\pi _0 & \text{ for } & \alpha< y \le 1-\alpha \\ \mathbb {F}_{{P}}(y)-\mathbb {F}_{{P}}(1-y) & \text{ for } & 1-\alpha < y \le 1 \\ 1 & \text{ for } & y > 1 \\ \end{array}\right. , \end{aligned}$$

where \(\pi _0 = \mathbb {F}_{\pi }(\theta _0) = \mathbb {P}_\pi (\Theta \in \Omega _0)\). Note that \(\mathbb {F}_{U}(y)\) is differentiable for \(y \ne 1-\alpha \), hence \(\mathbb {F}_{U}(y)\) admits density function \(f_{U}(y)\), that is:

$$\begin{aligned} f_{U}(y)= & \pi \big (\eta _n^{-1}(y)\big ) \left| \frac{d}{dy} \eta _n^{-1} (y) \right| \mathbb {I}_{(\alpha , 1]}(y)+\nonumber \\ & +\pi \big (\eta _n^{-1}(1-y)\big ) \left| \frac{d}{dy} \eta _n^{-1} (1-y) \right| \mathbb {I}_{(1-\alpha ,1]}(y). \end{aligned}$$
(4)

Since \(\mathbb {F}_{U}(y)\) is not differentiable at \(y = 1-\alpha \), then \(f_{U}(y)\) has a jump discontinuity at that point. Derivations of cdf and pdf expressions for \( {J}_n\) and \({C}_n\) are available in Appendix A. \(\square \)

Corollary 3.1

[Stochastic order of the PrRVs] Given the hypotheses in (1) and a size-\(\alpha \) test with monotone increasing power function \(\eta _n(\theta )\), the PrRVs in Table 1 are in the following stochastic order

$$\begin{aligned} U_n \succeq C_n \succeq P_n \succeq J_n, \end{aligned}$$

that is, for any \(y \in \mathbb {R}\),

$$\begin{aligned} \mathbb {F}_U(y) \le \mathbb {F}_C(y) \le \mathbb {F}_P(y) \le \mathbb {F}_J(y). \end{aligned}$$

Proof

The proof is provided in the Supplementary Material. \(\square \)

Remark 3.1

The stochastic order of the PrRVs implies that:

  1. (i)

    \(e_U \ge e_C \ge e_P \ge e_J\);

  2. (ii)

    \(q_U^\gamma \ge q_C^\gamma \ge q_P^\gamma \ge q_J^\gamma , \forall \gamma \in [0,1]\), where \(q_i^\gamma = q_i^\gamma (n, \pi ), \; i \in \{{P},{J},{C},{U}\}\), is the \(\gamma \) level quantile of the PrRV.

3.1 Normal models

In this section, we apply Theorem 3.1 when the test statistic has normal distribution. Following Spiegelhalter et al. (2004), consider the statistical model of Sect. 2 and assume that \(T_n\) is the sufficient statistic and that, at least asymptotically, \(T_n|\theta \sim N(\theta ,\frac{\sigma ^2}{n})\). Then, the size-\(\alpha \) uniformly most powerful (UMP) test statistic is

$$\begin{aligned} W(T_n,\theta _0) = \frac{\sqrt{n}}{\sigma }(T_n - \theta _0) \sim N(0,1) \text{ under } H_0. \end{aligned}$$
(5)

The random power of the test is:

$$\begin{aligned} {P}_n = \eta _n(\Theta ) = \Phi \left( \frac{\Theta -\theta _0 + \frac{\sigma }{\sqrt{n}}z_{\alpha }}{\frac{\sigma }{\sqrt{n}}} \right) , \end{aligned}$$
(6)

where \(\Phi (\cdot )\), \(\phi (\cdot )\) and \(z_\gamma = \Phi ^{-1}(\gamma )\) are the cdf, pdf and \(\gamma \) level quantile of the standard normal. As it was shown in Rufibach et al. (2016),

$$\begin{aligned} f_{P}(y) = \pi \left( \theta _0 + \frac{\sigma }{\sqrt{n}}(z_y-z_{\alpha }) \right) \times \frac{\sigma }{\sqrt{n}}\sqrt{2\pi } \exp \left\{ \frac{1}{2} z_y^2\right\} \mathbb {I}_{[0,1]}(y). \end{aligned}$$
(7)

Details are available in Appendix B. From Theorem 3.1, the expressions of \(f_i(y), i \in \{{J},{C},{U}\}\) are straightforward. Following Rufibach et al. (2016), we use Eq. (7) and Theorem 3.1 to derive closed-form expressions of \(f_{U}(y)\) under three typical prior choices: (i) normal, (ii) truncated normal, (iii) uniform.

(i) Normal design prior

Let \(\Theta \sim N \bigg ( \theta _d, \frac{\sigma ^2}{n_d} \bigg )\), the pdf \(f_{U}^N(y)\) is:

$$\begin{aligned} f_{U}^N(y)= & f_{P}^N(y)\mathbb {I}_{(\alpha ,1]}(y) + f_{P}^N(1-y)\mathbb {I}_{(1-\alpha ,1]}(y) \end{aligned}$$
(8)

where, as in Rufibach et al. (2016),

$$\begin{aligned} f_{P}^N(y)= & \tau \phi (\Psi + \tau (z_y-z_\alpha ))[\phi (z_y)]^{-1} = \nonumber \\= & \tau \exp \Big \{-\frac{1}{2}(\Psi + \tau (z_y - z_\alpha ))^2 + \frac{1}{2}z_{y}^2\Big \}, \end{aligned}$$
(9)

where \(\Psi = \frac{\theta _0 - \theta _d}{\frac{\sigma }{\sqrt{n_d}}}\) and \(\tau = \sqrt{\frac{n_d}{n}}\).

(ii) Truncated normal design prior

Let \(\Theta \sim TN \bigg (\inf = a, \sup = b,\theta _d, \frac{\sigma ^2}{n_d} \bigg )\). The density \(f_{U}^{TN}(y)\) is:

$$\begin{aligned} f_{U}^{TN}(y) = f_{P}^{TN}(y)\mathbb {I}_{(\alpha , 1]}(y) + f_{P}^{TN}(1-y)\mathbb {I}_{(1-\alpha ,1]}(y) \end{aligned}$$

where

$$\begin{aligned} f_{P}^{TN}(y) = \frac{f_{P}^N(y)}{\Phi \left( \frac{b-\theta _d}{\sigma /\sqrt{n_d}}\right) - \Phi \left( \frac{a-\theta _d}{\sigma /\sqrt{n_d}} \right) } \mathbb {I}_{[\eta _n(a), \eta _n(b)]}(y). \end{aligned}$$

Specifically, if the design prior of \(\Theta \) is a normal truncated on the alternative hypothesis (\(\inf = \theta _0, \sup = \infty \)), then \(f_{P}^{TN}(y) = \frac{f_{P}^N(y)}{1 - \Phi \bigl ( \theta _0 \bigl )}I_{(\alpha ,1]}(y)\).

(iii) Uniform design prior

Let \(\Theta \sim Unif(a,b)\). The density \(f_{U}^{UNIF}(y)\) is:

$$\begin{aligned} f_{U}^{UNIF}(y) = f_{P}^{UNIF}(y)\mathbb {I}_{(\alpha ,1]}(y) + f_{P}^{UNIF}(1-y)\mathbb {I}_{(1-\alpha ,1]}(y) \end{aligned}$$

where

$$\begin{aligned} f_{P}^{UNIF}(y) = \frac{1}{b-a} \frac{\sigma }{\sqrt{n}}\sqrt{2\pi } \exp \left\{ \frac{1}{2} z_y^2\right\} \mathbb {I}_{[\eta _n(a), \eta _n(b)]}(y). \end{aligned}$$

3.1.1 Example on log-hazard ratio

As an example, we consider an application provided in Spiegelhalter et al. (1986), whose main objective is the design of a clinical trial to compare a new treatment to the standard one in terms of log-hazard ratio \(\theta \), with \(\theta _0 = 0\), \(\alpha = 0.05\) and known variance \(\sigma ^2 = 4\). The authors assume that a balanced trial is designed to have more than 0.8 of frequentist power at \(\theta _d = 0.56\), which is reached for \(n = 79\). We now consider a normal design prior \(N(\theta _d, \sigma ^2/n_d)\) where \(n_d = 9\), which is the prior sample size needed to obtain a prior with 0.20 probability that \(\theta \) is less than zero, thus \(\pi _1 = 0.80\). In Figs. 2 and 3 we respectively provide a graphical representation of \(\mathbb {F}_i(y)\) and of \(f_i(y), \;i \in \{{P},{J},{C},{U}\}\). Visual inspection of the PrRVs cdfs confirms the result of Corollary 3.1. To complement cdfs and pdfs, we compute expectations \(e_i\) and quantiles \(q_i^\gamma \), which are available in Table 2. The PrRVs \({C}_n\) and \({U}_n\), in comparison with the other two, assign substantially higher densities to values of y close to 1, with medians equal to \(q_{C}^{0.5} = 0.947\) and \(q_{U}^{0.5} = 0.981\), while \(q_{P}^{0.5} = 0.798\) and \(q_{J}^{0.5} = 0.798\). As expected, \(U_n\) presents the highest density mass for high values of the power. Conversely, the cdfs and pdfs of \({P}_n\) and \({J}_n\) coincide for \(y \in (\alpha ,1]\), but they differ for \(y \in [0,\alpha ]\): consequently, their \(\gamma \) quantiles coincide when \(\gamma >\alpha \), but in terms of expected values, \(e_{J}= 0.604\) is slightly lower than \(e_{P}= 0.606\) as an implication of the fact that \({J}_n\) is null in \([0,\alpha ]\).

Fig. 2
figure 2

Cdfs of the PrRVs

Fig. 3
figure 3

Pdfs of the PrRVs

Table 2 Summaries of the PrRVs

3.1.2 Qualitative features of \(f_U(y)\)

As discussed in Sect. 3, the shape of the PrRVs pdfs indicates whether the experiment is well-designed: a large density for values close to 1 corresponds to high chances of conducting a well-designed experiment. The qualitative features of these density functions depend on the sample size and on the design prior, therefore they provide guidelines on the choice of the design prior parameters. We specifically focus on \(f_{U}(y)\) assuming a normal design prior \(\Theta \sim N(\theta _d, \sigma ^2/n_d)\). From Eqs. (8) and (9), the qualitative features of \(f_{U}(y)\) can be summarized as follows.

Result 3.1

[Qualitative features of \(f_{U}(y)\)]

  • For \(y \in (\alpha , 1-\alpha ]\):

    • for \(n_d = n\) (\(\tau = 1\)):

      1. 1.

        \(f_{U}(y)\) is strictly increasing for \(\theta _d> \theta ^\star \);

      2. 2.

        \(f_{U}(y)\) is constant for \(\theta _d = \theta ^\star \);

      3. 3.

        \(f_{U}(y)\) is strictly decreasing for \(\theta _d < \theta ^\star \);

    • for \(n_d \ne n\) (\(\tau \ne 1\)):

      1. 1.

        \(f_{U}(y)\) is strictly increasing for \(\theta _d> \theta ^\star \);

      2. 2.

        \(f_{U}(y)\) is not strictly increasing for \(\theta _d \le \theta ^\star \).

  • For \(y \in (1-\alpha , 1]\):

    • \(f_U(y)\) is strictly increasing for \(\theta _d>\theta ^\star \) when \(n_d \le n\) (\(\tau \le 1\)),

where \(\theta ^\star = \theta _0 + \kappa (n,n_d) \in \Omega _1\) and \(\kappa (n,n_d) = \frac{\sigma }{\sqrt{n_d}} \; \max \{-z_\alpha ; - 2 \sqrt{\frac{n_d}{n}} z_\alpha \}\).

Proof

The proof is provided in the Supplementary Material. \(\square \)

The interpretation as applied to the experiment success is that we get a well-shaped density of the utility, thus a well-designed experiment, when \(\theta _d > \theta ^\star \). Since \(\theta ^\star \) is a decreasing function of n and \(n_d\), the choice of the design prior parameters is a matter of finding the good trade-off between \(\theta _d, n_d,n\). For instance, for low values of \(n_d\) a higher value of \(\theta _d\) is required in order to ensure a good shape of the density. Note also that the equation of \(\theta ^\star \) helps in the choice of the design value n or \(n_d\) or \(\theta _d\), when the other two are given. This is particularly useful for example in the design of clinical trials when the prior sample size \(n_d\) is the historical and/or external data sample size, thus it is known at the design stage of the trial. Similar considerations about the shape of the pdf of \({U}_n\) hold for the truncated normal design prior case.

Example on log-hazard ratio (continued) We consider once again the example of Sect. 3.1.1 and we illustrate the qualitative features of \(f_{U}(y)\) in Fig. 4, bottom panels, for \(n_d = 29, 79, 129\) and for several values of \(\theta _d\) chosen to be symmetric around the \(\theta ^\star \) value. Moreover, for the same design values, we plot \(f_{{P}}(y)\) in Fig. 4, top panels.

As discussed above, in general, for low values of \(n_d\) a higher value of \(\theta _d\) is required to ensure a well-shaped pdf. We note that, according to findings in Rufibach et al. (2016), in order to ensure a well-shaped \(f_{P}(y)\), the design values may need to satisfy hard-to-meet criteria. Conversely, the conditions that the design values need to satisfy to ensure a well-shaped \(f_{U}(y)\) are milder. For instance, when \(n_d = 29\) (thus for \(n_d\) realistically low) and \(\theta _d = 0.51\), \(f_{U}(y)\) is substantially increasing (except for a very small set of values of y) and \(q_{U}^{0.25} = 0.46\), while \(f_{P}(y)\) presents a bad-shape (u-shape) and \(q_{P}^{0.25} = 0.31\). As \(\theta _d\) is reduced to 0.41, things for \(f_{P}(y)\) go worst since \(q_{P}^{0.25} = 0.17\), while \(q_{U}^{0.25} = 0.38\). Our point is that the u-shape of \(f_{P}(y)\) is not solely a consequence of the design prior choices: as largely debated in Sect. 2, \({P}_n\) takes values close to 0 (\(< \alpha \)) with non-negligible probability (\(\pi _0\)), leading to a u-shape of \(f_{P}(y)\) and therefore a reduced quantification of success even for realistic design prior parameters choices. The employment of \({U}_n\) avoids this scenario.

Fig. 4
figure 4

Qualitative features of \(f_{P}(y)\) (top panels) and of \(f_{U}(y)\) (bottom panels) for the example in Sect. 3.1.1 when \(n_d = 29, 79, 129\) (left, central and right panels, respectively) and for several choices of \(\theta _d\)

3.2 Simulation algorithm

When closed-form expressions are not available, the PrRVs distributions can be simulated. Let \(w_{1-\alpha }\) be the generic \(1-\alpha \) level quantile of the test statistic \(W(T_n,\theta _0)\). The algorithm works as follows.

figure a

From the empirical distributions of the PrRVs, it is easy to approximate their summaries: for instance, \(e_{P}\simeq \sum _{k=1}^M \eta _n(\theta ^{(k)})/M\). Note that if \(T_n|\theta \) is exactly or asymptotically normal, steps 2, 3 and 4 are not necessary: once \(\theta ^{(1)}, \cdots , \theta ^{(k)}, \cdots , \theta ^{(M)}\) are drawn, the values \(\eta _n(\theta ^{(k)}), \; k= 1, \cdots M\) can be computed using Eq. (6). The R code is provided in the Supplementary Material. Note that the algorithm works also for the reversed one-sided hypotheses test (ie \(\Omega _0 = [\theta _0, +\infty )\)), point-null hypotheses test (ie \(\Omega _0 = \{\theta _0\},\Omega _1 = \{\theta _1\} \)), and two-sided hypotheses test (ie \(\Omega _0 = {\theta _0},\Omega _1 = \Omega {\setminus } \{\theta _0\}\)) under appropriate specifications of \(\eta _n(\theta )\).

4 Application to a clinical trial

We here consider an application to a two samples confirmatory Phase III trial, which aims to show efficacy of a drug in treating Restless Legs Syndrome. This example was introduced in Muirhead and Soaita (2012) and discussed in Eaton et al. (2013). We exclude from the study the pdf \(f_{J}(y)\) as its behavior almost coincides with the one of \(f_{P}(y)\). It is assumed that \(T_n \sim N(\theta , \frac{4 \sigma ^2}{n})\), where \(\theta \) is the difference in treatment effects and \(\sigma ^2\) is the common variance in the two groups. Superiority of the experimental treatment is declared by rejecting \(H_0: \theta \le \theta _0 = 0\) when \(\alpha = 0.025\). The authors assume a known variance \(\sigma ^2 = 64\), and a clinically meaningful difference \(\theta _d = 4\). In this case, \(\eta _n(\theta _d)\) reaches the desired power value of 0.80 at \(n = 128\). At the design stage of the trial, we want to obtain a pre-experimental evaluation of the trial supposing three different prior beliefs about \(\theta \):

  • neutral: \(\Theta \sim N(\theta _d, \frac{4 \sigma ^2}{n_d}) \), with \(n_d = 4\). In this case, \(\pi _1 = 0.69\);

  • pessimistic: \(\Theta \sim Unif(a,b)\), with \(a = -3, b = 5\). In this case, \(\pi _1 = 0.625\);

  • optimistic: \(\Theta \sim TN(a, b, \theta _d, \frac{4 \sigma ^2}{n_d})\), with \(n_d = 4, a = \theta _0 = 0, b = + \infty \). In this case, \(\pi _1 = 1\).

In Table 3 we provide values of \(e_i\) and of \(q_{i}^\gamma \) for \(\gamma = 0.25, 0.5,0.75\), \(i \in \{{P},{C},{U}\}\) and \(n = 64, 128, 256\). As the sample size increases, expected values and quantiles of the PrRVs increase for each prior belief. As expected, regardless of the sample size, the stochastic order of the PrRVs is never violated. For fixed \(n=128\), in Fig. 5 we report priors (left panels) and induced PrRV densities (right panels). As a consequence of Corollary 3.1, in the neutral and pessimistic scenarios, summaries of \(f_{P}(y)\) and of \(f_{C}(y)\) assume lower values than the ones of \(f_{U}(y)\). This can be also seen from their pdfs, which are u-shaped. On the other hand, the u-shape of \(f_{U}(y)\) is less marked. In the optimistic scenario, the design prior is entirely concentrated on \(\Omega _1\), ie \(\pi _1 = 1\). As discussed in Sect. 2.1, this is the case where the definitions of the PrRVs (thus including their cdfs, pdfs, expected values and quantiles) coincide.

Finally, we consider the normal neutral design prior and show expected value and quantiles as functions of n in Fig. 6. This reveals interesting aspects. First, according to Corollary 3.1, \(e_{U}\ge e_{C}\ge e_{J}\) and \(q_{U}^{\gamma } \ge q_{C}^{\gamma } \ge q_{J}^{\gamma }, \gamma = 0.25, 0.5, 0.75\). Second, the 0.25 quantile of \({P}_n\) never increases with n meaning that, even when the sample size suddenly increases, the u-shape of \(f_{P}(y)\) can not be avoided, thus \({P}_n\) still presents high probability masses for low values of the power. In conclusion, note that stochastic order among the PrRVs also implies that sample sizes chosen using summaries of \(U_n\) would be less than or equal to those based on summaries of the others PrRVs.

Fig. 5
figure 5

Normal, uniform and truncated normal prior (left panels) and induced \(f_i(y), i \in \{{P},{C},{U}\}\) (right panels)

Table 3 Expected values and quantiles for different design priors and sample sizes
Fig. 6
figure 6

Expected values and quantiles as a function of n when the design prior is \(N(4, \frac{4 \cdot 64}{4})\)

5 Conclusion

In the hybrid Bayesian-frequentist approach to hypotheses testing, the probability of success, PoS, is commonly employed to evaluate the design of an experiment. In its simplest and most common definition, PoS is the expected value of the random power. Two criticisms arise:

  1. 1.

    PoS is just the mean of an entire probability distribution, thus may not be a well representative summary of it (Spiegelhalter et al. 1986; Huson 2009; Rufibach et al. 2016; Liu and Yu 2024);

  2. 2.

    the random power does not properly account for the success of the experiment, as it is defined as the random probability of rejecting \(H_0\), being it true or not (Kunzmann et al. 2021; De Santis et al. 2024).

To address criticism 1, in the seminal paper of Spiegelhalter et al. (1986) it is proposed to complement PoS with other summaries of the random power (such as quantiles), or with the whole distribution itself. This is accomplished in Rufibach et al. (2016), where the pdf of \({P}_n\) is derived and studied under normality assumptions. To overcome criticism 2, multiple alternative definitions of PoS have been proposed in the literature (and deeply reviewed in Kunzmann et al. 2021). However, these alternative definitions are still affected by criticism 1.

Here, our starting point are four main definitions of PoS identified in De Santis et al. (2024), which can be seen as expected values of PrRVs. In the proposed analysis, we aim to provide tools useful to avoid criticism 1 and, at the same time, we investigate which definition of PoS overcomes criticism 2.

Specifically, we provide general expressions of the cdf and pdf of the PrRVs under investigation, as well as closed-form expressions under normality assumption of the test statistic; moreover, we sketch a simulation algorithm useful when explicit formulas are not available. We show our ideas through an illustrative example on log-hazard ratio and an application to a two arms Phase III clinical trial.

When the design prior assigns null or negligible probability masses to \(\Omega _0\) as, for instance, in the optimistic design prior case of Sect. 4, the discrepancies between the PrRVs are null or negligible. This is also noted in Kunzmann et al. (2021), with specific reference to the definitions of PoS, ie the expected values of the PrRVs.

On the other hand, when the design prior assigns not negligible probability masses to \(\Omega _0\), the use of \({U}_n\) to address the evaluation of the experiment success is crucial. In fact, in this case we care about both rejecting \(H_0\) when it is false, and not rejecting \(H_0\) when it is true. \({U}_n\) accounts for both the aspects; conversely, when \({C}_n\) is considered, the interpretation of success is only related to the proper rejection of \(H_0\); finally, when we consider \({P}_n\) or \({J}_n\), the power of the test under \(\Omega _1\) is mixed up with the type I error and 0 values, respectively. This is debated in the application of Sect. 4 through the pessimistic and neutral design prior cases, where graphical representations of \(f_{P}\) (and consequently \(f_{J}\)) show that it never loses its u-shape. Not by chance, the 0.25 quantile of \(f_{P}\) tends to 0 as n increases.

Finally, we note that the PrRVs are likely to be skewed, thus the median is usually a better representative summary of the entire distribution than the expected value. Similar conclusions are drawn in Liu and Yu (2024) and Huson (2009) as well. This has an impact on sample size determination (see De Santis et al. 2024).