1 Introduction

It has long been known, for a bivariate normal model with \(X_1, X_2\) independently distributed with means \(\theta _1\) and \(\theta _2\), and known variances \(\sigma ^2_1\) and \(\sigma ^2_2\), that the Bayes estimator of \(\theta _1\) with respect to the uniform prior on \(\theta _1 \ge \theta _2\) dominates the benchmark minimax estimator \(X_1\) when \(\theta _1 \ge \theta _2\) under squared error loss (Cohen and Sackrowitz 1970). However, there are situations where one would not expect this bound to hold exactly, and one could envisage introducing uncertainty in the parametric bound. This has been previously proposed (see O’Hagan and Leonard 1976 where uncertainty is expressed through a hierarchical prior, as well as Liseo and Loperfido 2003 for uncertain linear restrictions) and allows for a more flexible and encompassing model, where the data is allowed to contradict the believed parametric constraint. Moreover, with such a model, one has the ability to take into account the degree of prior belief in the constraint. Despite the earlier work, little is known about the frequentist risk performance of associated Bayes point estimators or Bayes credible sets.

Here, we consider Bayesian inference about \(\theta _1\) for the two-sample normal problem with hierarchical prior density given by \(\pi (\theta _1,\theta _2\,|\,m)=\mathbbm {1}_{[m,\infty )} (\theta _1-\theta _2)\) with \(m\sim N(0,\sigma _m^2)\), and study the frequentist performance of (generalized) Bayesian point estimators and credible sets. We show that the Bayes estimator of \(\theta _1\) dominates \(X_1\), and is hence minimax, under squared error loss for \(\theta _1-\theta _2\ge 0\) and all choices of \(\sigma _m^2>0\). We make use of the so-called rotation technique (e.g., Blumenthal and Cohen 1968a), and a one-sample minimax finding by Marchand and Nicoleris (2019) set in the context of a single normal mean with an uncertain lower bound. The proposed Bayesian estimators stem from posterior densities for \(\theta =(\theta _1, \theta _2)^{\top }\) that take values on \(\mathbb {R}^2\), but still pass the test of minimaxity for estimating \(\theta _1\) when evaluated on the restricted parameter space \(\theta _1 \ge \theta _2\). In this sense, they are more flexible and desirable in the context of constraint uncertainty than their counterpart estimator when \(\sigma ^2_m=0\), for which the posterior density is concentrated on \(\theta _1 \ge \theta _2\). The finding adds to known analyses for \(\sigma ^2_m=0\) carried out by Cohen and Sackrowitz (1970), van Eeden and Zidek (2002), and Kumar and Sharma (1988), among others.

The attractive performance of the proposed point estimators of the suspected larger of the two means, \(\theta _1\), leads to interest in Bayes credible sets, and to the investigation of the extent to which one can capitalize on this additional information. We namely focus on the performance of such credible sets as measured by frequentist coverage probability. Typically, Bayesian credible sets are far from guaranteeing matching coverage probability and are not designed to do so. Exceptions lie in location and scale models without parametric restrictions and non-informative priors. Even so, in such problems, in the face of a parametric restriction \(\theta \in C\), the truncation of such non-informative priors on C perturbs probability matching, with both higher coverage and lower coverage than credibility occurring (e.g., Mandelkern 2002; Marchand and Strawderman 2006). We point out that there has been much work on evaluating Bayesian posterior densities and estimates with parametric restrictions, notably for ordered parameters with or without nuisance parameters (e.g., Gelfand et al. 1992; Madi et al. 2000).

We introduce below an ad hoc Bayes credible set with approximate \(1-\alpha \) credibility (based again on the prior \(\pi (\theta \,|\,m)=\mathbbm {1}_{[m,\infty )} (\theta _1-\theta _2)\) with \(m\sim N(0,\sigma _m^2)\)), and study its frequentist coverage probability with evidence of very good matching to the nominal credibility \(1-\alpha \). Numerical evidence of the remarkable proximity between the actual and nomimal credibilities is also provided. We furthermore explore how the performance is affected by the choice of the hyperparameter \(\sigma _m\), ranging from the case of a certain constraint, i.e., \(\sigma _m=0\), to the case of no useful information provided by \(X_2\) when \(\sigma _m \rightarrow \infty \).

For a given posterior distribution, there is no single definitive choice of a Bayes credible set and such a choice can be impactful in terms of frequentist coverage. Namely, as illustrated by Marchand and Strawderman (2013), as well as Ghashim et al. (2016), the characterization of Bayes credible sets through a spending function merits to be considered. Hence, the analysis and illustrations presented here involve a spending function, the choice of which is guided.

The paper is organized as follows. After having extracted and interpreted some useful properties of the posterior distributions in Sect. 2.1, which relate to extended skew-normal densities, the dominance and minimax results are presented and commented on in Sect. 2.2. Section 3 deals with proposed credible sets for \(\theta _1\), focussing mostly on their frequentist coverage probability. The findings are commented on at length and illustrated with several figures. Section 3.2 expands on modifications which make use of the concept of a spending function. A summary and further research questions are presented in Sect. 4. Finally, we mention that the developments in this paper also appear in the M.Sc. thesis (Drew 2021) of Courtney Drew.

2 Bayesian inference and minimax point estimators

2.1 Posterior analysis

We consider the following model for \(X=(X_1,X_2)^T\) and hierarchical prior:

$$\begin{aligned}&X_1\sim N(\theta _1,\sigma _1^2),X_2\sim N(\theta _2,\sigma _2^2); \nonumber \\&\pi (\theta _1,\theta _2|m)\,=\,\mathbbm {1}_{[m,\infty )} (\theta _1-\theta _2)\,,\, m\sim N(0,\sigma _m^2), \end{aligned}$$
(1)

where \(X_1\) and \(X_2\) are independently distributed and \(\sigma _1, \sigma _2, \sigma _m>0\) are known. This corresponds to a situation where the difference of parameters \(\theta _1-\theta _2\) is bounded below by m, with uncertainty on m. We denote throughout \(\phi \) and \(\Phi \) as the standard normal pdf and cdf respectively. An alternative and equivalent representation of the prior in (1) is readily obtained by integrating out m yielding the improper density \(\pi (\theta _1, \theta _2) \, = \, \Phi (\frac{\theta _1-\theta _2}{\sigma _m})\).

Remark 2.1

(a) The situation given by (1) also covers the case of a parametric bound of the form \(\theta _1-c\,\theta _2 \ge m\), with \(c \ne 0\). Setting \(X_1'=X_1, X_2'=cX_2\), the constraint becomes re-expressible as \(\mu _1-\mu _2 \ge m\) with \(X_1' \sim N(\mu _1, \sigma ^2_1)\) and \(X_2' \sim N(\mu _2 = c \theta _2, c^2 \sigma ^2_2)\).

(b) Analysis for (1) yields applications for correlated variables, specifically for \(W=(W_1, W_2)^{\top } \sim N_2(\xi ,\Sigma )\) with \(\xi _1 - \xi _2 \ge m\), correlation coefficient \(\rho =\rho (W_1, W_2) \in (-1,1)\), such that \(\lambda =\rho \sigma (W_1)/\sigma (W_2) \ne 1\). This is achieved by setting \(X_1\, = \, W_1 \, - \, \lambda W_2\), \(X_2\,=\, W_2\) whereupon part (a) applies with \(\theta _1 \, = \, \xi _1 \, - \, \lambda \xi _2\), \(\theta _2 \, = \, \xi _2\), \(c=(1-\lambda )\), \(\sigma _1^2 \, = \, \mathbb {V}(W_1) (1-\rho ^2\)), and \(\sigma ^2_2\,=\, \mathbb {V}(W_2)\).

Remark 2.2

There exist many instances with summary statistics well modelled by normal observables such as in (1). Common occurrences arise through sufficiency or asymptotically justified approximations. An example emerges in a basic linear model with \(W \sim N_n(Z^{\top }\beta , \sigma ^2 I_n)\) with \(Z (n \times p)\) of full rank p, the least squares \(\hat{\beta }=(\hat{\beta _1}, \ldots , \hat{\beta _p})^{\top } \, = \, (Z^{\top }Z)^{-1}Z^{\top }W\), \(X_1=\hat{\beta }_1\) and \(X_2=\hat{\beta }_2\), where it is suspected that \(\beta _1 \ge \beta _2\). In such cases, with the link presented in part (b) of Remark 2.1, analysis for (1) applies whether \(\hat{\beta }_1\) and \(\hat{\beta }_2\) are correlated or not.

The following known result is useful in analyzing the posterior density in (1).

Lemma 2.3

Let \(Z\sim N(0,1)\) and \(\nu ,\varepsilon \in \mathbb {R}\). Then \(\mathbb {E}\big [\Phi (\nu (Z+\varepsilon ))\big ] \, = \, \Phi \left( \frac{\nu \,\varepsilon }{\sqrt{1+\nu ^2}}\right) \).

Proof

Let \(T \sim N(0,1)\) be independent of Z. Then, we can write \(\mathbb {E}\big [\Phi (\nu (Z+\varepsilon ))\big ]\, = \, \mathbb {P}\big (T \le \nu (Z+\varepsilon )\big )\, = \, \Phi \left( \frac{\nu \,\varepsilon }{\sqrt{1+\nu ^2}}\right) \) since \(T - \nu Z \sim N(0, 1 + \nu ^2)\). \(\square \)

Theorem 2.4

Under the model and prior given by (1), setting \(d=x_1-x_2\), the marginal posterior density of \(U=\frac{\theta _1-x_1}{\sigma _1}\) is given by

$$\begin{aligned} \pi (u|x)=\frac{\phi (u) \, \Phi \left( \frac{\sigma _1 u+d}{\sqrt{\sigma _2^2+\sigma _m^2}}\right) }{\Phi \left( \frac{d}{\sqrt{\sigma _1^2+\sigma _2^2+\sigma _m^2}}\right) }. \end{aligned}$$
(2)

Proof

This follows from writing the marginal posterior density of \(\theta _1\) as

$$\begin{aligned}{} & {} \pi (\theta _1|x)=\frac{\int _{-\infty }^{\infty }\int _{-\infty }^{\theta _1-m} f(x|\theta ) \, \pi (\theta |m) \, \pi (m) \, d\theta _2\, dm}{\int _{-\infty }^{\infty }\int _{-\infty }^{\infty }\int _{-\infty }^{\theta _1-m} f(x|\theta ) \, \pi (\theta |m)\, \pi (m) \, d\theta _2 \, dm \, d\theta _1}, \end{aligned}$$

where

$$\begin{aligned}{} & {} f(x|\theta ) \, \pi (\theta |m) \, \pi (m)\\{} & {} \quad =\frac{1}{2\pi \sigma _1 \sigma _2}e^{-\frac{1}{2\sigma _1^2}(x_1-\theta _1)^2}e^{-\frac{1}{2\sigma _2^2}(x_2-\theta _2)^2}\frac{1}{\sqrt{2\pi \sigma _m^2}}e^{-\frac{m^2}{2\sigma _m^2}} \, \mathbbm {1}_{[m,\infty )} (\theta _1-\theta _2), \end{aligned}$$

then using Lemma 2.3 to evaluate the integrals, and changing variables from \(\theta _1\) to U.

\(\square \)

One recognizes the posterior density in (2) as a skew-normal density of the form \(\phi (u) \frac{\Phi (\alpha _1u+\alpha _2)}{\Phi \left( \alpha _2/\sqrt{1+\alpha ^2_1}\right) }\); \(\alpha _1, \alpha _2 \in \mathbb {R}\) (e.g., Azzalini 1985; Arnold and Beaver 2002). Note that the density in (2) also holds for \(\sigma _m=0\). We next link properties of such extended skew-normal distributions to the posterior distribution (2).

Lemma 2.5

Under the context of Theorem 2.4, the posterior moment generating function, expectation and variance of U are given respectively by

$$\begin{aligned}{} & {} M_{U|x}(t)=\frac{e^{\frac{t^2}{2}}}{\Phi \left( d'\right) }\Phi \left( t \sigma ' + d' \right) ,\ \ \mathbb {E}(U|x)=\sigma ' \, R\left( d'\right) , \\{} & {} \mathbb {V}(U|x)=1- \, \sigma '^2 \, d' \, R\left( d'\right) \, -\, \sigma '^2 \, R^2\left( d'\right) , \end{aligned}$$

with \(\sigma ' \, = \, \frac{\sigma _1}{\sqrt{\sigma _1^2+\sigma _2^2+\sigma _{m}^2}}\), \(d'= \, \frac{x_1-x_2}{\sqrt{\sigma _1^2+\sigma _2^2+\sigma _{m}^2}} \), and where \(R(t)=\frac{\phi (t)}{\Phi (t)}\) is the reverse Mill’s ratio.

Proof

The moment generating function is readily computed by making a change of variables \(u'=u-t\) and using Lemma 2.3. The posterior expectation and variance of U follow by straightforward calculations. \(\square \)

In Sect. 3, we construct an ad hoc credible set for \(\theta _1\) based on its posterior expectation and variance. It is therefore of interest to study the properties of these quantities, which in turn follow from well-known properties of the reverse Mill’s ratio.

Lemma 2.6

In the setting of Theorem 2.4, the following properties of \(\mathbb {E}(U|x)\) and \(\mathbb {V}(U|x)\) hold for \(d=x_1-x_2\):

(a):

\(\mathbb {E}(U|x)\) is a decreasing function of d with \(\lim \nolimits _{d\rightarrow \infty }\mathbb {E}(U|x)=0\), \(\lim \nolimits _{d\rightarrow -\infty }\mathbb {E}(U|x)=+\infty \) and \(\lim \nolimits _{d\rightarrow -\infty }\frac{\mathbb {E}(U|x)}{d}=-\frac{\sigma _1}{\sigma _1^2+\sigma _2^2+\sigma _m^2}\);

(b):

\(\mathbb {V}(U|x)\) is an increasing function of d with \(\lim \nolimits _{d\rightarrow \infty }\mathbb {V}(U|x)=1\) and \(\lim \nolimits _{d\rightarrow -\infty }\mathbb {V}(U|x)=1-\frac{\sigma _1^2}{\sigma _1^2+\sigma _2^2+\sigma _{m}^2}\);

(c):

\(\mathbb {E}(U|x)\) is decreasing in \(\sigma _m^2\) when \(d<0\), and \(\mathbb {V}(U|x)\) is increasing in \(\sigma _m^2\) when \(d<0\).

Proof

These results follow from properties of the reverse Mill’s ratio, in particular \(\lim \nolimits _{t\rightarrow \infty }R(t)=0\), \(\lim \nolimits _{t\rightarrow -\infty }R(t)=\infty \), \(\lim \nolimits _{t\rightarrow -\infty }R(t)\big (t+R(t)\big )=1\), \(\lim \nolimits _{t\rightarrow \infty }tR(t)=0\) and \(\lim \nolimits _{t \rightarrow -\infty } \frac{R(t)}{t}=-1\), as well as the fact that R(t) is a decreasing function of t and \(R'(t)=-R(t)\big (t+R(t)\big )\). \(\square \)

Remark 2.7

The case \(\sigma _m=0\), i.e., no uncertainty on the restriction \(\theta _1 \ge \theta _2\), warrants particular attention. One recovers results for this degenerate case in literature, notably in Cohen and Sackrowitz (1970) and Blumenthal and Cohen (1968b). Moreover, the case \(\sigma _{m}\rightarrow \infty \) corresponds to an absence of additional information. It is useful to consider heuristics related to these limiting cases in order to gain additional understanding.

(A):

If \(x_1 \gg x_2\), then \(d=x_1-x_2\) is large and, since \(\theta _1 \ge \theta _2\) given that \(\sigma _{m}=0\), \(x_2\) provides very little additional information. We would therefore expect to obtain results similar to those in the limiting case with information on \(x_1\) only. This is indeed the case, since we would expect a \(N(x_1,\sigma _1^2)\) posterior for \(\theta _1\), which matches the limiting density of U in (2) when \(d \rightarrow \infty \).

(B):

In the opposite situation where \(\sigma _{m}=0\) but \(d \ll 0\), we have data which appears to contradict the model. Assuming the model is still correct, posterior belief would be concentrated on the boundary \(\theta _1=\theta _2\). This suggests the benchmark model

$$\begin{aligned} X_i|\theta _1 \sim N(\theta _1,\sigma _i^2) \, \hbox { independent}. \end{aligned}$$

For the flat prior \(\pi (\theta _1)=1\), the posterior distribution of \(\theta _1\) becomes

$$\begin{aligned} \theta _1|x \sim N\left( \frac{\sigma _2^2x_1+\sigma _1^2x_2}{\sigma _1^2+\sigma _2^2},\frac{\sigma _1^2\sigma _2^2}{\sigma _1^2+\sigma _2^2}\right) , \end{aligned}$$

which for \(U = \frac{\theta _1-x_1}{\sigma _1}\) and very small d, yields the approximations:

$$\begin{aligned} \mathbb {E}\left( \frac{U}{d}|x\right) \approx -\frac{\sigma _1}{\sigma _1^2+\sigma _2^2} \hbox { and } \mathbb {V}(U|x) \approx \frac{\sigma _2^2}{\sigma _1^2+\sigma _2^2}\,, \end{aligned}$$

which match the limiting values as \(d\rightarrow -\infty \) given in Lemma 2.6 (taking \(\sigma _m=0\)).

2.2 Point estimation

This section concerns itself with the efficiency of point estimators of \(\theta _1\) for model (1). We obtain a class of Bayesian estimators that dominate \(X_1\). From Cohen and Sackrowitz (1970), it is known that \(X_1\) is minimax for \(\theta _1 \ge \theta _2\), which renders our class of estimators also minimax. Consider the problem of estimating \(\theta _1\) under squared error loss \(L(\theta ,d)= {(d-\theta _1)}^2\) with X distributed according to model (1) and with the additional prior information \(\theta _1-\theta _2\in A \subset \mathbb {R}\). As reviewed by Marchand and Strawderman (2004), it is pertinent to consider the class of estimators:

$$\begin{aligned} \mathcal {C}_1&= \left\{ \delta _\phi (X) = Y_2+\phi (Y_1) \text { where } Y_1=\frac{X_1-X_2}{1+\tau },Y_2 =\frac{\tau X_1+X_2}{1+\tau } \text {, and } \tau =\frac{\sigma _2^2}{\sigma _1^2} \right\} . \end{aligned}$$
(3)

Of particular interest is the choice \(\delta _{\phi _{0}}(X)=X_1\), i.e., the MLE of \(\theta _1\) without parametric restrictions, obtained by taking \(\phi (Y_1)=Y_1\). Under model (1), \(Y_1\) and \(Y_2\) are independently distributed with \(Y_1\sim N (\mu _1,\sigma _{Y_1}^2)\) and \(Y_2\sim N (\mu _2,\sigma _{Y_2}^2)\), where \(\mu _1=\frac{\theta _1-\theta _2}{1+\tau }\), \(\sigma _{Y_1}^2=\frac{\sigma _1^2}{1+\tau }\), \(\mu _2=\frac{\tau \theta _1+\theta _2}{1+\tau }\) and \(\sigma _{Y_2}^2=\frac{\tau \sigma _1^2}{1+\tau }\). Furthermore, the mean squared error of the estimator \(\delta _{\phi }(X)\) reduces to

$$\begin{aligned} R(\theta ,\delta _{\phi }(X))=\mathbb {E}_\theta \left[ (Y_2+\phi (Y_1)-\theta _1)^2\right] =\mathbb {E}_\theta \left[ (Y_2-\mu _2)^2\right] +\mathbb {E}_\theta \left[ (\phi (Y_1)-\mu _1)^2\right] . \end{aligned}$$

The efficiency of the estimator \(\delta _{\phi }(X)\) in estimating \(\theta _1\) is therefore reliant on that of the estimator \(\phi (Y_1)\) in estimating \(\mu _1\).

Lemma 2.8

For estimating \(\theta _1\) in the context of model (1) under squared error loss \(L(\theta ,d)=(d-\theta _1)^2\), with prior additional information \(\theta _1-\theta _2\in A \subset \mathbb {R}\), the estimator \(\delta _{\phi _{1}}(X)\) dominates \(\delta _{\phi _{0}}(X)\) if and only if \(\phi _1(Y_1)\) dominates \(\phi _0(Y_1)\) in the problem of estimating \(\mu _1 \in \mathcal {C}=\{y:(1+\tau )y\in A\}\).

We now use a recent result from Marchand and Nicoleris (2019) which gives a class of minimax Bayes estimators for a normal mean suspected to be positive.

Lemma 2.9

(Marchand and Nicoleris 2019) For \(X\sim N(\epsilon ,\sigma ^2)\), squared error loss \(L(\epsilon ,d)=(d-\epsilon )^2\) and parametric restriction \(\epsilon \ge 0\), estimators \(\delta _c(X)=X+c\sigma R\left( \frac{cX}{\sigma }\right) \), \(c\in (0,1]\), dominate \(\delta _0(X)=X\). Moreover, this class of estimators contains Bayes point estimators of \(\epsilon \) under the hierarchical prior density \(\pi (\epsilon \,|\,m)=\mathbbm {1}_{[m,\infty )} (\epsilon )\) with \(m \sim N(0,\sigma _{m}^2)\), namely \(\delta _c\), with \(c=\frac{\sigma }{\sqrt{\sigma ^2+\sigma _m^2}}\).

Combining Lemma 2.8 and Lemma 2.9, one obtains the following result.

Theorem 2.10

Let X be distributed according to model (1), \(\tau =\frac{\sigma _2^2}{\sigma _1^2}\), with squared error loss for estimating \(\theta _1\), \(L(\theta ,d)=(d-\theta _1)^2\). Then under the additional information \(\theta _1-\theta _2 \ge 0\), estimators of the form

$$\begin{aligned} \delta _{\phi _c}(X)&=X_1+\frac{c\,\sigma _1}{\sqrt{1+\tau }}R\left( \frac{c\,(X_1-X_2)}{\sigma _1 \sqrt{1+\tau }}\right) \end{aligned}$$
(4)

dominate \(X_1\), and are hence minimax, for \(c \in (0,1]\). Furthermore, the choice \(c=\frac{\sqrt{1+\tau }}{\sqrt{1+\tau +\frac{\sigma _{m}^2}{\sigma _1^2}}}\) coincides with the Bayes estimator for \(\theta _1\) under the prior given in (1); that is,

$$\begin{aligned} \delta _{\pi _{\sigma _m}}(X)=\mathbb {E}[\theta _1|X]=X_1+\frac{\sigma _1^2}{\sqrt{\sigma _1^2+\sigma _2^2+\sigma _m^2}}R\left( \frac{X_1-X_2}{\sqrt{\sigma _1^2+\sigma _2^2+\sigma _{m}^2}}\right) . \end{aligned}$$
(5)

Proof

Under the setting of (3), Lemma 2.9 asserts that estimators of the form

$$\begin{aligned} \delta _c(Y_1)=Y_1+c\sigma _{Y_1}R\left( \frac{cY_1}{\sigma _{Y_1}}\right) =\frac{X_1-X_2}{1+\tau }+\frac{c\,\sigma _1}{\sqrt{1+\tau }}R\left( \frac{c\,(X_1-X_2)}{\sigma _1\,\sqrt{1+\tau }}\right) \end{aligned}$$
(6)

dominate \(\delta _0(Y_1)=Y_1\) for \(c\in (0,1]\). Thus, with \(\phi _0(Y_1)=Y_1\) and correspondingly \(\delta _{{\phi }_0}(X)=Y_2+Y_1=X_1\), Lemma 2.8 yields (4) as a class of estimators which dominate \(X_1\) for \(c\in (0,1]\). \(\square \)

Theorem 2.10 provides a class of Bayesian estimators that dominate \(X_1\) and are minimax for \(\theta _1 \ge \theta _2\). As for the previously known result when \(\sigma _m=0\), the estimators \(\delta _{\pi _{\sigma _m}}(X)\) incorporate the sample information \(X_2\) but, in contrast, do not arise from a prior (or posterior) density for \(\theta \) concentrated on \(\theta _1 \ge \theta _2\). Expressed otherwise, choices with \(\sigma _m>0\) allow more flexibility for the data to contradict such a constraint and for it to be better reflected in the posterior distribution determination. Despite this accommodation, the estimators \(\delta _{\pi _{\sigma _m}}(X)\) for \(\sigma _m>0\) still remain minimax for \(\theta _1 \ge \theta _2\) and will have less inflated risk than \(\delta _{\pi _{0}}(X)\) for parameter values of \(\theta \) such that \(\theta _1 < \theta _2\). The value of \(\sigma _m\) relates to the degree of confidence for which \(\theta _1-\theta _2 \ge m\) and impacts the corresponding risk accordingly. Several of the frequentist risk features above will be paralleled by the frequentist coverage analysis of Bayes credible sets, which is the object of study of Sect. 3. Finally, questions of minimaxity and admissibility, including simultaneous estimation of \(\theta =(\theta _1, \theta _2)^{\top }\), are addressed in Drew (2021).

3 Bayes credible sets

Having evaluated the posterior distribution of \(\theta _1\) under model and prior (1), we now turn to the construction of a Bayesian credible set for \(\theta _1\) and the study of its frequentist coverage probability and length. One objective is to determine the effect of the additional information on the credible sets, notably by considering the length of the intervals, as well as their frequentist coverage probability and credibility. Naturally, one may strive to obtain a satisfactory compromise between a short interval and good coverage. While there exist several types of credible sets; one thinks of highest posterior density (HPD) or equal-tails for example; we focus on an ad hoc interval with approximate credibility \(1-\alpha \) due to its ease of computation (i.e., explicit endpoints) and interpretation, which also presents the potential for further analytical determination of frequentist coverage probability. In Sect. 3.1, the ad hoc credible set studied is of a standard form \(\mathbb {E}[\theta |x] \pm z_{\alpha /2} \, \sigma (\theta |x)\) (e.g., Berger 1985). In Sect. 3.2, we propose and study a modification based on the idea of a “spending function” (e.g., Marchand and Strawderman 2013) that shifts the above credible set towards lower values.

3.1 An ad hoc credible set

The Bayes credible set studied here is given by Definition 3.1.

Definition 3.1

Let \(\mathbb {E}(U|x)\) and \(\mathbb {V}(U|x)\) denote respectively the posterior expectation and variance of U given by Lemma 2.5. The ad hoc Bayes credible interval for \(\theta _1\) (i.e., for \(\sigma _1U+X_1)\) is defined as

$$\begin{aligned} I_{\textrm{ah}}(X)=[X_1+l(X_1-X_2),X_1+u(X_1-X_2)], \end{aligned}$$
(7)

where \(l(d)=\sigma _1\mathbb {E}(U|x)-z_{\alpha /2}\sigma _1\sqrt{\mathbb {V}(U|x)}\) and \(u(d)=\sigma _1\mathbb {E}(U|x)+z_{\alpha /2}\sigma _1\sqrt{\mathbb {V}(U|x)}\), and where \(z_{\alpha /2}=\Phi ^{-1}\left( 1-\frac{\alpha }{2}\right) \).

Theorem 3.2 (also see Denis 2010) gives an expression for the frequentist coverage probability of a more general interval for \(\theta _1\), of which \(I_{\textrm{ah}}(X)\) is a particular case.

Theorem 3.2

Let \(X_i\sim N(\theta _i,\sigma _i^2)\), \(i=1,2\), independent, with \(d=X_1-X_2\), \(\sigma _i^2\) known and consider an interval of the form \(I(X)=[X_1+l(d),X_1+u(d)]\). Then the frequentist coverage probability, \(\mathbb {P}[\theta _1 \in I(X)]\), is given by

$$\begin{aligned} C(\theta )&=\mathbb {E}^Z\left[ \Phi \left( \gamma \, \,u\left\{ \sqrt{\sigma _1^2+\sigma _2^2}\,Z+\beta \right\} +\frac{\sigma _1}{\sigma _2}Z\right) \right. \nonumber \\&\quad \left. -\Phi \left( \gamma \,l\left\{ \sqrt{\sigma _1^2+\sigma _2^2}\,Z+\beta \right\} +\frac{\sigma _1}{\sigma _2}Z\right) \right] , \end{aligned}$$
(8)

where \(\beta =\theta _1-\theta _2\), \(\gamma = \frac{\sqrt{\sigma _1^2+\sigma _2^2}}{\sigma _1 \sigma _2}\), and \(Z\sim N(0,1)\).

Proof

We have \( C(\theta )=\mathbb {P}_\theta \left[ \theta _1 \in I(X)\right] \, = \, \mathbb {P}_\theta \big [X_1+l\{X_1-X_2\}\le \theta _1\le X_1+u\{X_1-X_2\}\big ] = \mathbb {P}_\theta \left[ -u\{Y_1-Y_2+\beta \}\le Y_1 \le -l\{Y_1-Y_2+\beta \}\right] \,,\) where \(Y_i=X_i-\theta _i \sim N(0,\sigma _i^2)\), \(i=1,2\), are independent. Setting \(Z=\frac{Y_1-Y_2}{\sqrt{\sigma _1^2+\sigma _2^2}}\) and \(Z'= \gamma \, \big (Y_1-\frac{\sigma _1^2}{\sqrt{\sigma _1^2+\sigma _2^2}}Z \big )\), we obtain \((Z,Z')^T\sim N_2(0,I_2)\). Now, by conditioning, we have

$$\begin{aligned} C(\theta )&=\mathbb {P}\left[ \gamma \left( -u\left\{ \sqrt{\sigma _1^2+\sigma _2^2}Z+\beta \right\} -\frac{\sigma _1^2}{\sqrt{\sigma _1^2+\sigma _2^2}}Z\right) \right. \\&\left. \le Z' \le \gamma \left( -l\left\{ \sqrt{\sigma _1^2+\sigma _2^2}Z+\beta \right\} -\frac{\sigma _1^2}{\sqrt{\sigma _1^2+\sigma _2^2}}Z\right) \right] \\&=\mathbb {E}^Z\left[ \mathbb {P}\left[ \gamma \left( -u\left\{ \sqrt{\sigma _1^2+\sigma _2^2}Z+\beta \right\} -\frac{\sigma _1^2}{\sqrt{\sigma _1^2+\sigma _2^2}}Z\right) \right. \right. \\&\le Z' \le \gamma \left( -l\left\{ \sqrt{\sigma _1^2+\sigma _2^2}Z+\beta \right\} \left. \left. -\frac{\sigma _1^2}{\sqrt{\sigma _1^2+\sigma _2^2}}Z\right) \right] \right] , \end{aligned}$$

which yields (8). \(\square \)

As a first example, Fig. 1 presents the frequentist coverage probability of the ad hoc interval for \(\sigma _1=\sigma _2=1\), a 0.95 nominal level and varying \(\sigma _m\).

Fig. 1
figure 1

Frequentist coverage probability of the ad hoc interval (\(1-\alpha =0.95\), \(\sigma ^2_1=\sigma ^2_2=1\)) as a function of \(\beta =\theta _1-\theta _2\) for varying \(\sigma _m\)

While the maximum coverage appears to decrease in \(\sigma _m\), the overall discrepancy between frequentist coverage and credibility for \(\beta \ge 0\) tends to diminish as \(\sigma _m\) increases. The coverage of \(I_{ah}(X)\) at \(\beta =0\) also appears to increase as \(\sigma _m\) increases (although it seems to remain below the nominal level \(1-\alpha \)). The same ordering occurs for negative values of \(\beta \), which is understandable as larger values of \(\sigma _m\) correlate with more uncertainty on the bound \(\beta \ge 0\), which in turn becomes reflected in the posterior distribution. Moreover, we have \(\lim \nolimits _{\beta \rightarrow \infty }C(\theta )=1-\alpha \). This can be shown in the same way as in Remark 3.3 below for \(\sigma _m \rightarrow \infty \) since we have \(\lim \nolimits _{d\rightarrow \infty }u(d)=-\lim \nolimits _{d\rightarrow \infty }l(d)=\sigma _1z_{\alpha /2}\). We noted similar overall behaviour of \(I_{ah}(X)\) for other nominal levels such as 0.80, 0.90 and 0.99.

Remark 3.3

Without recourse to the additional information provided by \(X_2\), a benchmark confidence interval for \(\theta _1\) is given by \(X_1 \pm z_{\alpha /2}\sigma _1\). This interval arises from \(I_{ah}(X)\) by taking \(\sigma _m \rightarrow \infty \) in (2) and (7), yielding \(\lim \nolimits _{\sigma _m^2 \rightarrow \infty }\pi (u|x)=\phi (u),\ \forall u \in \mathbb {R}\). Accordingly, one infers that \(\lim \nolimits _{\sigma _m \rightarrow \infty } C(\theta )=1-\alpha , \ \forall \theta \in \mathbb {R}^2\), and this is illustrated in Fig. 1 (for \(\theta _1 \ge \theta _2\) mostly) with the flattening out around the nominal level observed as \(\sigma _m\) increases.

We also consider the credibility \(\mathbb {P}[\theta _1 \in I_{ah}(X)|x]\) of the ad hoc interval, also given by

$$\begin{aligned} \mathbb {P}[U \in [l(d),u(d)]|x]=\int _{l(d)}^{u(d)}\pi (u|x)du, \end{aligned}$$

where \(l(d)=\mathbb {E}[U|x]-z_{\alpha /2}\sqrt{\mathbb {V}(U|x)}\) and \(u(d)=\mathbb {E}[U|x]+z_{\alpha /2}\sqrt{\mathbb {V}(U|x)}\).

Figure 2 presents the credibility as a function of \(d=x_1-x_2\) of the ad hoc interval with \(1-\alpha =0.95\), and \(\sigma ^2_1=\sigma ^2_2=1\) for varying values of \(\sigma _{m}\). Examining Fig. 2, we notice that the credibility flattens out around the nominal level as \(\sigma _{m}\) increases, as was the case for the coverage probability, which is justified here by the fact that \(\pi (u|x)\rightarrow \phi (u)\) as \(\sigma _m^2 \rightarrow \infty \). For all values of \(\sigma _{m}\), the exact credibility is remarkably close to the nominal level, with slightly higher credibility for positive d. Such closeness was equally observed for other nominal levels and other settings of \(\sigma _1^2\) and \(\sigma _2^2\).

Fig. 2
figure 2

Credibility of the ad hoc interval (\(1-\alpha =0.95\), \(\sigma ^2_1=\sigma ^2_2=1\)) as a function of \(d=x_1-x_2\) for varying \(\sigma _m\)

3.2 Credible sets defined in terms of a spending function

The ad hoc procedure previously considered creates a credible set which is centered at the mean of the posterior distribution and which extends on either side of the mean by equal amounts. Given the asymmetry of the posterior density, it is justifiable to consider throwing out \(\alpha _1\) in one tail and \(\alpha _2\) in the other tail such that \(\alpha _1+\alpha _2=\alpha \). As above, exact credibility will not be achieved for all x, but it turns out for practical purposes to be close to nominal credibility (see Fig. 4). This idea of discarding unequal amounts in the tails is referred to as a spending function in Ghashim et al. (2016), and previously in Marchand and Strawderman (2013). We consider the situation where we discard \(k\alpha \) in the left tail and \((1-k)\alpha \) in the right tail. The adjustment in this direction with \(k < 1/2\) is motivated by a relatively smaller coverage for \(\beta =\theta _1-\theta _2\) closer to 0 (see Fig. 1).

Definition 3.4

Let \(\mathbb {E}(U|x)\) and \(\mathbb {V}(U|x)\) denote respectively the posterior expectation and variance of U given by Lemma 2.5. The ad hoc Bayes credible interval for \(\theta _1=\sigma _1U+X_1\) defined in terms of a spending function is given by

$$\begin{aligned} I_{\textrm{ah}}'(X)=[X_1+l'(X_1-X_2),X_1+u'(X_1-X_2)], \end{aligned}$$
(9)

where \(l'(d)=\sigma _1\mathbb {E}(U|x)-z_{k\alpha } \, \sigma _1 \, \sqrt{\mathbb {V}(U|x)}\) and \(u'(d)=\sigma _1\mathbb {E}(U|x)+z_{(1-k)\alpha } \, \sigma _1 \sqrt{\mathbb {V}(U|x)}\), with \(z_{\alpha }=\Phi ^{-1}\left( 1-\alpha \right) \).

Theorem 3.2 holds for general u(d) and l(d), so Eq. (8) holds here for all values of k. Figure 3 presents the frequentist coverage probability of the ad hoc interval for \(\sigma _1=\sigma _2=1,\sigma _m=0\), a 0.95 nominal level and varying values of k in the spending function.

Fig. 3
figure 3

Frequentist coverage probability of the ad hoc interval (\(1-\alpha =0.95\), \(\sigma ^2_1=\sigma ^2_2=1\), and \(\sigma ^2_m=0\)) as a function of \(\beta =\theta _1-\theta _2\) for varying values of k in the spending function

Similarly to previous results, it is easy to show that \(\lim \nolimits _{\beta \rightarrow \infty }C(\theta )=1-\alpha \) for all k. The coverage at \(\beta =0\) appears to be a decreasing function of k. Further numerical exploration suggests that \(C(0) \ge 1-\alpha \) for \(k\le 1/4\), even for various other values of \(1-\alpha \). Moreover, for small values of k, the minimum coverage is no longer attained at \(\beta =0\). It would be interesting to investigate theoretically if the coverage has a local minimum after the initial peak or if it decreases monotonically towards the limiting value of \((1-\alpha )\). If the latter were true, then the coverage would always be above the nominal value whenever \(C(\theta )> 1-\alpha \) for \(\theta _1-\theta _2=0\). Further illustration and observations about the coverage at \(\beta =0\) are provided by Drew (2021).

Figure 4 presents the credibility as a function of \(d=x_1-x_2\) of the ad hoc interval for \(\sigma _1=\sigma _2=1\), \(\sigma _{m}=0\), a 0.95 nominal level and varying values of k.

Fig. 4
figure 4

Credibility of the ad hoc interval (\(1-\alpha =0.95\), \(\sigma ^2_1=\sigma ^2_2=1\), and \(\sigma ^2_m=0\)) as a function of \(d=x_1-x_2\) for varying values of k in the spending function

The overall credibility appears to be the best when \(k=1/2\), and decrease as k decreases. That being said, for all values of k plotted here, the credibility remains extremely close to the nominal level. For the sake of further comparison, Table 1 gives an approximate maximum discrepancy of the credibility for \(k=1/4\) and varying values of \(1-\alpha \).

Table 1 Approximate maximum credibility discrepancy for the ad hoc interval with \(k=1/4\) in the spending function, \(\sigma _1=\sigma _2=1\) and \(\sigma _m=0\)

Remark 3.5

Unsurprisingly, the credible intervals \(I_{ad}(X)\) and \(I'_{ad}(X)\) typically lead to shorter intervals in comparison to the non-informative case \(\sigma _m \rightarrow \infty \). The expected length of these credible intervals is further studied in Drew (2021) and illustrated for various settings of \(\sigma ^2_m\) and the spending function (i.e., k).

4 Concluding remarks

For estimating the suspected larger (\(\theta _1\)) of two normal means (\(\theta _1\) and \(\theta _2\)), we have studied the frequentist risk performance of Bayesian point and interval estimators associated with non-informative prior densities of the form:

$$\begin{aligned} \pi (\theta _1,\theta _2|m)=\mathbbm {1}_{[m,\infty )} (\theta _1-\theta _2),\, m\sim N(0,\sigma _m^2). \end{aligned}$$

Firstly, we establish for all \(\sigma _m > 0\) the minimaxity of the Bayesian point estimator of \(\theta _1\) under squared error loss and when the supremum risk is taken on \(\theta _1 \ge \theta _2 \), thus extending the previously known result for \(\sigma _m=0\). Secondly, we provide ample evidence of satisfactory, or even excellent, frequentist performance of Bayesian credible sets for the same priors as measured on the set of parameter values \(\theta _1 \ge \theta _2 \), with such procedures capitalizing on the additional information available for \(\theta _2\). In doing so, we have elicited how the frequentist probability of coverage varies with the difference \(\beta =\theta _1-\theta _2\), as well as vary according to the choice of the hyperparameter \(\sigma _m\) ranging from the “no-useful additional information case” (\(\sigma _m \rightarrow \infty \)) to the certain constraint \(\theta _1 \ge \theta _2\) (\(\sigma _m=0\)). Moreover, we have further illustrated the role of a spending function in the construction of the Bayesian credible set and how its setting can give rise to even better frequentist coverage probability.

The findings of this paper also apply to situations where \(m \sim N(\xi , \sigma ^2_m)\) in (1) with \(\xi \ne 0\). Indeed for such a case, we can set \(X_1'=X_1 - \xi \) and \(\theta _1'=\theta _1 - \xi \) so that point and interval estimates of \(\theta _1'\) based on \((X_1',X_2)\) with \(\theta _1' - \theta _2 \ge m'\), \(m'=^d m-\xi \sim N(0,\sigma ^2_m)\), translate to point and interval estimates of \(\theta _1\). For instance, he above strategy will generate point estimates \(\hat{\theta }_1(x) \, = \, \hat{\theta _1'}(x_1',x_2) \, + \, \xi \). Theorem 2.10’s minimaxity result will then apply to the parametric restriction \(\theta _1-\theta _2 \ge \xi \), and Section 3’s study of frequentist coverage probability which pertains to \(\beta '=\theta _1'-\theta _2\) will equate to \(\beta =\theta _1-\theta _2 \ge \xi \).

The results of this paper do leave open several interesting questions about analytically derived lower bounds on coverage probabilities which bring into play the model variances, the choice of \(\sigma _m\), as well as the spending function setting. It would be particularly interesting to proceed with an analysis for an unknown variances extension of model (1). Finally, although we have focussed on a relatively simple two-parameter problem with normal observables, we do believe that the ideas or techniques put forth can be adapted to a wider range of settings, namely the incorporation of uncertainty on a parametric restriction and the use of a spending function in the construction of Bayesian credible sets.