1 Introduction

Stein (1973, 1981) introduced the Stein identity, also known as the Stein equation, to derive unbiased estimators for risk functions of shrinkage estimators in the simultaneous estimation of means within normal distributions. This innovative method was employed to enhance the performance of unbiased estimators. The simplicity and potency of this technique have led to significant developments in the field of shrinkage estimation, as extensively documented in the literature. For a comprehensive exploration of this subject, refer to Stein (1981), Strawderman (1971), Shinozaki (1984), Berger (1985), Brandwein and Strawderman (1990), Robert (2007), and Fourdrinier et al. (2018). Komaki (2001) made an intriguing contribution by extending the Stein phenomenon to the prediction of predictive distributions. For further insights into this topic, see Ghosh et al. (2020).

It is crucial to note that Stein identities offer utility not only in the context of shrinkage estimation within decision-theoretic frameworks but also in normal approximations, such as the central limit theorem. Their relevance in normal approximation stems from the fact that Stein identities provide a characterization of the normal distribution. Consequently, these identities can be employed in constructing goodness-of-fit test statistics for normality. Hudson (1978) extended Stein-type identities to gamma, Poisson, and negative binomial distributions. Thus, the threefold applications of shrinkage estimation, distribution characterization, and goodness-of-fit testing can be extended to these alternative distributions.

In this paper, we offer an instructive exposition and review of these expanded applications of Stein identities. While many of the results presented herein are well-established in the literature, readers will appreciate the versatile utility of Stein identities in both statistical theory and practical applications.

In Sect. 2, we explain that the normal distribution is equivalent to the Stein identity or the differential equation based on the moment-generating function. Although many characterizations of normal distributions were given in the literature, some of which are summarized there. Two applications of the Stein identity are provided in Sect. 3. Especially, we construct a goodness-of-fit test statistic for normality based on the Stein identity and investigate numerically the performance.

Another important application of the Stein identity is the normal approximation of sum of independent random variables. An instructive explanation is provided in Sect. 4 based on Chen et al. (2011).

In Sect. 5, we describe that the gamma distribution is equivalent to the Stein-type identity or the differential equation of the moment-generating function. Some characterizations of gamma and exponential distributions and shrinkage estimation in decision-theoretic frameworks are summarized. A goodness-of-fit test statistic for exponentiality is constructed based on the Stein-type identity and the performance is investigated numerically.

In Sect. 6, we explain that the Poisson distribution is equivalent to the Stein-type identity or the differential equation of the moment-generating function. Shrinkage estimation and goodness-of-fit test for Poissonity are demonstrated. The Stein-type identity in a negative binomial distribution is briefly described. Some remarks and extensions are given in Sect. 7 as concluding remarks.

2 Stein identity and characterization of normal distributions

In normal distributions, Stein (1973, 1981) developed the so-called Stein identity which is not only useful for calculating higher moments, but also powerful for developing shrinkage estimators improving on the minimax estimator in the simultaneous estimation of normal means. The Stein identity also provides the characterization of normal distribution, which means that the central limit theorem can be shown by using the Stein identity. For recent developments and review on the Stein identity, see Bellec and Zhang (2021), Chen (2021), Fathi et al. (2022) and Anastasiou (2023).

Theorem 2.1

Let X be a random variable with mean \(\textrm{E}[X]=\mu \) and variance \(\textrm{Var}(X)={\sigma }^2\). Then, the following four conditions are equivalent.

  1. (a)

    \(X\sim \mathcal{N}(\mu , {\sigma }^2)\).

  2. (b)

    For any differentiable function \(h(\cdot )\) with \(\textrm{E}[|(X-\mu )h(X)|]<\infty \) and \(\textrm{E}[|h'(X)|]<\infty \), it holds that

    $$\begin{aligned} \textrm{E}[(X-\mu )h(X)]={\sigma }^2 \textrm{E}[h'(X)], \end{aligned}$$
    (2.1)

    where \(h'(x)\) is the derivative of h(x).

  3. (c)

    For any real constant t satisfying \(\textrm{E}[|X|e^{tX}]<\infty \), it holds that

    $$\begin{aligned} \textrm{E}[(X-\mu )\exp \{tX\}]=t {\sigma }^2 \textrm{E}[\exp \{tX\}]. \end{aligned}$$
    (2.2)
  4. (d)

    Let \(g(t)=\textrm{E}[e^{t(X-\mu )}]\). Then for any t in the interval \((-c,c)\) for positive constant c, g(t) satisfies the differential equation

    $$\begin{aligned} {\textrm{d}\over \textrm{d}t}\log g(t)={g'(t)\over g(t)}={\sigma }^2t\quad \text {or}\quad {\textrm{d}^2\over \textrm{d}t^2}\log g(t)={g''(t)\over g(t)}-\Big ({g'(t)\over g(t)}\Big )^2={\sigma }^2, \end{aligned}$$
    (2.3)

    where \(g(0)=1\), \(g'(0)=0\) and \(g''(0)={\sigma }^2\).

Proof

The proof from (a) to (b) can be done by integration by parts as seen from Stein (1981) and Fourdrinier et al. (2018). We here introduce another approach. Making the transformation \(Y=X-\mu \) gives

$$\begin{aligned} \textrm{E}[h(X)]=\int _{-\infty }^\infty h(x) {1\over \sqrt{2\pi }{\sigma }}e^{-(x-\mu )^2/{\sigma }^2}\textrm{d}x = \int _{-\infty }^\infty h(y+\mu ) {1\over \sqrt{2\pi }{\sigma }}e^{-y^2/{\sigma }^2}\textrm{d}y. \end{aligned}$$

Differentiating both sides with respect to \(\mu \) and using Lebesgue’s dominated convergence theorem, we can demonstrate that

$$\begin{aligned} \int _{-\infty }^\infty h(x) {x-\mu \over {\sigma }^2}{1\over \sqrt{2\pi }{\sigma }}e^{-(x-\mu )^2/(2{\sigma }^2)}\textrm{d}x = \int _{-\infty }^\infty h'(y+\mu ) {1\over \sqrt{2\pi }{\sigma }}e^{-y^2/(2{\sigma }^2)}\textrm{d}y, \end{aligned}$$

which is rewritten as in (2.1) by turning back with \(X=Y+\mu \).

Clearly, one gets (c) from (b). Also, it is easy to get (d) from (c). For the proof from (d) to (a), the solution of the differential equation in (2.3) is \(g(t)=\exp \{{\sigma }^2 t^2/2\}\), which implies that \(X-\mu \sim \mathcal{N}(0, {\sigma }^2)\). \(\square \)

We briefly provide some conditions for characterizing normality. The study of characterizations of normality has had a long history as explained in Kagan et al. (1973) and Kotz (1974). For a good review of the book of Kagan et al. (1973), see Diaconis et al. (1977). In Theorem 2.2, sufficient condition (b) was given by Cramér (1936), and we provide a simple proof by using the Stein identity. Condition (d) was shown by Kac (1939), Bernstein (1941) and Lukacs (1942).

Theorem 2.2

Assume that independent random variables \(X_1\) and \(X_2\) are identically distributed with \(\textrm{E}[X_i]=\mu \) and \(\textrm{Var}(X_i)={\sigma }^2\) for \(i=1,2\). Then the following four conditions are equivalent.

  1. (a)

    \(X_i \sim \mathcal{N}(\mu , {\sigma }^2)\) for \(i=1, 2\).

  2. (b)

    \(X_1+X_2\sim \mathcal{N}(2\mu , 2{\sigma }^2)\).

  3. (c)

    For \(i=1,2\), the density function of \(X_i-\mu \) is symmetric, and \((X_1-X_2)^2/(2{\sigma }^2) \sim \chi _1^2\).

  4. (d)

    \(X_1+X_2\) and \(X_1-X_2\) are independent.

Proof

Since one gets clearly (b), (c) and (d) from (a), it is sufficient to demonstrate the opposite directions. For the proof from (b) to (a), the condition in (b) and Theorem 2.1(c) implies that

$$\begin{aligned} \textrm{E}[(X_1+X_2-2\mu )\exp \{t(X_1+X_2)\}] = 2{\sigma }^2 t \textrm{E}[ \exp \{t(X_1+X_2)\}]. \end{aligned}$$
(2.4)

From the independence of \(X_1\) and \(X_2\), we have

$$\begin{aligned}&\textrm{E}[(X_1-\mu )\exp \{t X_1\}] \textrm{E}[\exp \{t X_2\}] + \textrm{E}[(X_2-\mu )\exp \{t X_2\}] \textrm{E}[\exp \{t X_1\}]\\&\quad = 2{\sigma }^2 t \textrm{E}[ \exp \{tX_1\}]\textrm{E}[ \exp \{tX_2\}]. \end{aligned}$$

Since \(X_1\) and \(X_2\) have the same distribution, we have \(\textrm{E}[(X_1-\mu )\exp \{t X_1\}] \textrm{E}[\exp \{t X_2\}]= \textrm{E}[(X_2-\mu )\exp \{t X_2\}] \textrm{E}[\exp \{t X_1\}]\), so that we can see that for \(i=1, 2\),

$$\begin{aligned} \textrm{E}[(X_i-\mu )\exp \{t X_i\}] = {\sigma }^2 t \textrm{E}[ \exp \{tX_i\}], \end{aligned}$$

which, from Theorem 2.1, shows that \(X_i\sim \mathcal{N}(\mu , {\sigma }^2)\).

For the proof from (c) to (b), let \(Y_i = X_i-\mu \) for simplicity. Since \((Y_1-Y_2)^2/(2{\sigma }^2)\sim \chi _1^2\), we have \(Y_1-Y_2\sim \mathcal{N}(0,2{\sigma }^2)\). From Theorem 2.1(c), it follows that

$$\begin{aligned} \textrm{E}[(Y_1-Y_2)\exp \{t(Y_1-Y_2)\}] = 2{\sigma }^2 t \textrm{E}[ \exp \{t(Y_1-Y_2)\}]. \end{aligned}$$
(2.5)

Note that \(Y_1\) and \(Y_2\) are independent and \(-Y_i\) has the same distribution as \(Y_i\). Thus, equality (2.5) can be rewritten as

$$\begin{aligned} \textrm{E}[(Y_1+Y_2)\exp \{t(Y_1+Y_2)\}] = 2{\sigma }^2 t \textrm{E}[ \exp \{t(Y_1+Y_2)\}], \end{aligned}$$

which is identical to equality (2.4). Hence one gets (b).

Finally, we provide the proof from (d) to (a) along with the proof of Lukacs (1942). The independence of \(X_1+X_2\) and \(X_1-X_2\) is equivalent to the independence of \(Y_1+Y_2\) and \(Y_1-Y_2\) for \(Y_i=X_i-\mu \), which implies that

$$\begin{aligned} \textrm{E}[\exp \{ s(Y_1+Y_2)+t(Y_1-Y_2)\}]=\textrm{E}[\exp \{ s(Y_1+Y_2)\}] \textrm{E}[\exp \{ t(Y_1-Y_2)\}]. \end{aligned}$$
(2.6)

Letting \(g(t)=\textrm{E}[\exp \{tY_i\}]\), we can see that LHS of (2.6) is written as

$$\begin{aligned}{} & {} \textrm{E}[\exp \{ s(Y_1+Y_2)+t(Y_1-Y_2)\}]\\{} & {} \quad =\textrm{E}[\exp \{ (s+t)Y_1\}]\textrm{E}[\exp \{ (s-t)Y_2\}]= g(s+t) g(s-t). \end{aligned}$$

On the other hand, RHS of (2.6) is written as \(\{g(s)\}^2g(t)g(-t)\), so that Eq. (2.6) is expressed as

$$\begin{aligned} g(s+t)g(s-t) = \{g(s)\}^2g(t)g(-t), \end{aligned}$$

equivalently rewritten as

$$\begin{aligned} \log g(s+t)+ \log g(s-t) = 2 \log g(s) + \log g(t) + \log g(-t). \end{aligned}$$
(2.7)

Let \(\psi (t)=(d/dt)\log g(t)\). Differentiating the both sides in (2.7) with respect to s and t, we have

$$\begin{aligned} \psi (s+t) + \psi (s-t)&= 2 \psi (s), \end{aligned}$$
(2.8)
$$\begin{aligned} \psi (s+t) - \psi (s-t)&= \psi (t) - \psi (-t). \end{aligned}$$
(2.9)

Note that \(\psi (0)=0\). Substituting \(s=0\) in (2.8) gives \(\psi (t)+\psi (-t)=0\), or \(\psi (-t)=- \psi (t)\), which is used to rewrite (2.9) as \(\psi (s+t)-\psi (s-t)=2\psi (t)\). Combining this equality and (2.8) gives

$$\begin{aligned} \psi (s+t) = \psi (s) + \psi (t). \end{aligned}$$
(2.10)

Equation (2.10) implies that \(\psi (t)\) is written as \(\psi (t) = c t\) for constant c, namely \((d/dt)\log g(t)=ct\). From Theorem 2.1(d), the solution is \(g(t)=\exp \{ct^2/2\}\). Since \(g''(0)={\sigma }^2\), we have \(c={\sigma }^2\). Thus, \(Y_i=X_i-\mu \sim \mathcal{N}(0, {\sigma }^2)\). \(\square \)

The normality can be characterized by a random sample. Conditions (b), (e) and (f) in Theorem 2.3 were derived by Kagan et al. (1965), Lukacs (1942) and Ruben (1974), respectively.

Theorem 2.3

Let \(X_1, \ldots , X_n\) be a random sample from a population with \(\textrm{E}[X_i]=\mu \) and \(\textrm{Var}(X_i)={\sigma }^2\). Let \({{\overline{X}}}=n^{-1}\sum _{i=1}^n X_i\) and \(S^2=n^{-1}\sum _{i=1}^n(X_i-{{\overline{X}}})^2\). Then, the following six conditions are equivalent.

  1. (a)

    \(X_i \sim \mathcal{N}(\mu , {\sigma }^2)\) for \(i=1, \ldots , n\).

  2. (b)

    \(\textrm{E}[{{\overline{X}}}| X_1-{{\overline{X}}}, \ldots , X_n-{{\overline{X}}}]=\mu \) for \(n\ge 3\).

  3. (c)

    \({{\overline{X}}}\) and \((X_1-{{\overline{X}}}, \ldots , X_n-{{\overline{X}}})\) are independent.

  4. (d)

    \({{\overline{X}}}\sim \mathcal{N}(\mu , {\sigma }^2/n)\).

  5. (e)

    \({{\overline{X}}}\) and \(S^2\) are independent.

  6. (f)

    For \(i=1, \ldots , n\), the density function of \(X_i-\mu \) is symmetric, and \(nS^2/{\sigma }^2\sim \chi _{n-1}^2\).

Proof

Using well-known properties of a normal distribution, one gets (b), (c), (d), (e) and (f) from (a). For the proof from (b) to (a), let \(Z_i=(X_i-\mu )/{\sigma }\) for simplicity. Then, (b) is rewritten as \(\textrm{E}[{{\overline{Z}}}| Z_1-{{\overline{Z}}}, \ldots , Z_n-{{\overline{Z}}}]=0\) for \({{\overline{Z}}}=n^{-1}\sum _{i=1}^n Z_i\). This implies that

$$\begin{aligned} \textrm{E}\Big [ \sum _{i=1}^n Z_i \exp \{t_1(Z_1-{{\overline{Z}}})+\cdots + t_n(Z_n-{{\overline{Z}}})\}\Big ]=0. \end{aligned}$$

Let \(s_i=t_i-{{\overline{t}}}\) for \({{\overline{t}}}=n^{-1}\sum _{i=1}^n t_i\). Then the above equality is expressed as

$$\begin{aligned} \textrm{E}\Big [ \sum _{i=1}^n Z_i \exp \Big \{ \sum _{j=1}^n s_j Z_j\Big \}\Big ] =\sum _{i=1}^n \textrm{E}[ Z_i\exp \{s_i Z_i\}]\prod _{j=1, j\not = i}^n \textrm{E}[\exp \{s_j Z_j\}]=0, \end{aligned}$$

equivalently rewritten as

$$\begin{aligned} \sum _{i=1}^n {\textrm{E}[Z_i\exp \{s_i Z_i\}]\over \textrm{E}[ \exp \{s_i Z_i\}]}=0, \quad \textrm{or}\quad \sum _{i=1}^n \psi (s_i)=0 \end{aligned}$$

for \(\psi (t)=(d/dt)\log \textrm{E}[\exp \{t Z_i\}]\). Since \(\sum _{i=1}^n s_i=0\), we have \(s_n = - \sum _{i=1}^{n-1}s_i\). Thus,

$$\begin{aligned} \sum _{i=1}^{n-1} \psi (s_i) = - \psi \Big (- \sum _{i=1}^{n-1}s_i\Big ). \end{aligned}$$

Substituting \(s_2=\cdots =s_{n-1}=0\) gives \(\psi (s_1)=-\psi (-s_1)\). Hence, the above equality is expressed as

$$\begin{aligned} \sum _{i=1}^{n-1} \psi (s_i) = \psi \Big ( \sum _{i=1}^{n-1}s_i\Big ). \end{aligned}$$

Since this equality holds for \(n\ge 3\), we can see that the solution is \(\psi (t)=ct\). Since \(\textrm{Var}(Z_i)=1\), we have \(c=1\). Thus, from Theorem 2.1(d), it follows that \(Z_i \sim \mathcal{N}(0, 1)\).

For the proof from (c) to (a), the case of \(n=2\) follows from Theorem 2.2(d). When \(n\ge 3\), the independence between \({{\overline{X}}}\) and \((X_1-{{\overline{X}}}, \ldots , X_n-{{\overline{X}}})\) implies that

$$\begin{aligned} \textrm{E}[{{\overline{X}}}| X_1-{{\overline{X}}}, \ldots , X_n-{{\overline{X}}}]=\textrm{E}[{{\overline{X}}}]=\mu , \end{aligned}$$

which results in (b) and leads to (a).

The proof from (d) to (a) can be done by using the same arguments as in the proof of Theorem 2.2(d).

For the proof from (e) to (a), we provide the proof given by Lukacs (1942). Let \(Y_i=X_i-\mu \) and \({{\overline{Y}}}=n^{-1}\sum _{i=1}^n Y_i\). Note that \(\sum _{i=1}^n(X_i-{{\overline{X}}})^2=\sum _{i=1}^n (Y_i-{{\overline{Y}}})^2\) and \({{\overline{X}}}={{\overline{Y}}}+\mu \). Since \(\sum _{i=1}^n (Y_i-{{\overline{Y}}})^2\) and \({{\overline{Y}}}\) are independent, we have

$$\begin{aligned} \textrm{E}\Big [e^{s\sum _{i=1}^n Y_i + t\sum _{i=1}^n (Y_i-{{\overline{Y}}})^2}\Big ] = \textrm{E}\Big [e^{s\sum _{i=1}^n Y_i}\Big ] \textrm{E}\Big [ e^{t\sum _{i=1}^n (Y_i-{{\overline{Y}}})^2}\Big ]. \end{aligned}$$
(2.11)

For \(g(s)=\textrm{E}[e^{sY_i}]\), we can see that \(\textrm{E}[e^{s\sum _{i=1}^n Y_i}]= \{ g(s)\}^n\). Differentiating (2.11) with respect to t and putting \(t=0\) gives

$$\begin{aligned} \textrm{E}\Big [\sum _{i=1}^n (Y_i-{{\overline{Y}}})^2 e^{s\sum _{i=1}^n Y_i}\Big ] = \{ g(s)\}^n \textrm{E}\Big [ \sum _{i=1}^n (Y_i-{{\overline{Y}}})^2\Big ] = (n-1){\sigma }^2\{ g(s)\}^n. \end{aligned}$$
(2.12)

Noting that

$$\begin{aligned} \sum _{i=1}^n (Y_i-{{\overline{Y}}})^2=\sum _{i=1}^n Y_i^2- n{{\overline{Y}}}^2 = {n-1\over n}\sum _{i=1}^n Y_i^2 - {1\over n}\sum _{i=1}^n \sum _{j=1, j\not = i}^n Y_i Y_j, \end{aligned}$$

we can express (2.12) as

$$\begin{aligned} \textrm{E}\Bigg [ \Big \{ {n-1\over n}\sum _{i=1}^n Y_i^2 - {1\over n}\sum _{i=1}^n \sum _{j=1, j\not = i}^n Y_i Y_j\Big \} e^{s\sum _{k=1}^n Y_k}\Bigg ] = (n-1){\sigma }^2\{ g(s)\}^n. \end{aligned}$$
(2.13)

Since \(Y_1, \ldots , Y_n\) are independently and identically distributed, the terms in LHS of (2.13) are evaluated as

$$\begin{aligned} \textrm{E}\Bigg [\sum _{i=1}^n Y_i^2 e^{s\sum _{k=1}^n Y_k}\Bigg ]&= n \textrm{E}\Big [ Y_1^2 e^{s\sum _{k=1}^n Y_k}\Big ] = n \textrm{E}[Y_1^2 e^{tY_1}] \{ g(s)\}^{n-1}\\&= n g''(s) \{ g(s)\}^{n-1},\\ \textrm{E}\Bigg [\sum _{i=1}^n \sum _{j=1, j\not = i}^n Y_i Y_j e^{s\sum _{k=1}^n Y_k}\Big ]&= n(n-1)\textrm{E}\Big [ Y_1 Y_2 e^{s(Y_1+Y_2)}\Bigg ]\{ g(s)\}^{n-2} \\&= n(n-1)\{g'(s)\}^2\{ g(s)\}^{n-2}. \end{aligned}$$

Substituting these quantities into (2.13) yields

$$\begin{aligned} g''(s)g(s)-\{g'(s)\}^2={\sigma }^2 \{g(s)\}^2\quad \text {or}\quad {\textrm{d}^2 \over \textrm{d}s^2} \log g(s) ={\sigma }^2. \end{aligned}$$
(2.14)

Thus, from Theorem 2.1(d), it follows that \(X_i-\mu \sim \mathcal{N}(0, {\sigma }^2)\), and one gets (a). For the proof of (f) to (a), see Ruben (1974). \(\square \)

3 Applications to shrinkage estimation and goodness-of-fit test

We now provide two applications of the Stein identity to shrinkage estimation and goodness-of-fit tests for normality.

3.1 Shrinkage estimation

The Stein identity is very powerful for deriving unbiased risk estimators of shrinkage estimators. Let \(X_1, \ldots , X_p\) be independent random variables such that \(X_i \sim \mathcal{N}({\theta }_i, {\sigma }^2)\), \(i=1, \ldots , p\). Consider the problem of estimating \({{\varvec{\theta }}}=({\theta }_1, \ldots , {\theta }_p)^\top \) simultaneously for known \({\sigma }^2\). When estimator \({{\widehat{{{\varvec{\theta }}}}}}\) is evaluated with the risk function relative to the quadratic loss \(\Vert {{\widehat{{{\varvec{\theta }}}}}}-{{\varvec{\theta }}}\Vert ^2/{\sigma }^2=({{\widehat{{{\varvec{\theta }}}}}}-{{\varvec{\theta }}})^\top ({{\widehat{{{\varvec{\theta }}}}}}-{{\varvec{\theta }}})/{\sigma }^2\), Stein (1956) established the inadmissibility of \({{\varvec{X}}}=(X_1, \ldots , X_p)^\top \) in the case of \(p\ge 3\), and James and Stein (1961) suggested the shrinkage estimator

$$\begin{aligned} {{\widehat{{{\varvec{\theta }}}}}}^{\textrm{JS}}={{\varvec{X}}}- {(p-2){\sigma }^2\over \Vert {{\varvec{X}}}\Vert ^2}{{\varvec{X}}}. \end{aligned}$$

The improvement over \({{\varvec{X}}}\) was proved using a somewhat complicated properties of noncentral chi-squares distribution. Stein (1973) provided a new technique based on the Stein identity for the proof. Because of a simple integration-by-part, the Stein identity enabled us to develop innovated results and great contributions to this research area. The Stein identity was extended to the identity in the chi-square and Wishart distributions and those identities were unified by Konno (2009) which enables us to handle the high-dimensional cases. For some developments and extensions, see Berger (1985), Brandwein and Strawderman (1990), Fourdrinier et al. (2018), Ghosh et al. (2020), Tsukuma and Kubokawa (2020) and Maruyama et al. (2023) and the reference therein.

Theorem 3.1

Let \({{\varvec{h}}}({{\varvec{X}}})=(h_1({{\varvec{X}}}), \ldots , h_p({{\varvec{X}}}))^\top \), where \(h_i({{\varvec{X}}})\) is differentiable and satisfies \(\textrm{E}[|(X_i-{\theta }_i)h_i({{\varvec{X}}})|]<\infty \) and \(\textrm{E}[|(\partial /\partial X_i)h_i({{\varvec{X}}})|]<\infty \). Then, the shrinkage estimator \({{\widehat{{{\varvec{\theta }}}}}}_{{\varvec{h}}}={{\varvec{X}}}-{{\varvec{h}}}({{\varvec{X}}})\) has the unbiased risk estimator

$$\begin{aligned} \widehat{ R({{\varvec{\theta }}}, {{\widehat{{{\varvec{\theta }}}}}}_{{\varvec{h}}})}= p-2{{\varvec{\nabla }}}^\top {{\varvec{h}}}({{\varvec{X}}})+\{{{\varvec{h}}}({{\varvec{X}}})\}^\top {{\varvec{h}}}({{\varvec{X}}})/{\sigma }^2 \end{aligned}$$
(3.1)

where \({{\varvec{\nabla }}}=(\partial /\partial X_1, \ldots , \partial /\partial X_p)^\top \). Especially, \({{\widehat{{{\varvec{\theta }}}}}}_\phi ={{\varvec{X}}}-W^{-1}\phi (W){{\varvec{X}}}\) for \(W=\Vert {{\varvec{X}}}\Vert ^2/{\sigma }^2\) has the risk unbiased estimator

$$\begin{aligned} \widehat{ R({{\varvec{\theta }}}, {{\widehat{{{\varvec{\theta }}}}}}_\phi )}= p + { \{\phi (W)- (p-2)\}^2-(p-2)^2\over W} - 4 \phi '(W). \end{aligned}$$
(3.2)

Proof

The risk function of \({{\widehat{{{\varvec{\theta }}}}}}_{{\varvec{h}}}\) is \(R({{\varvec{\theta }}}, {{\widehat{{{\varvec{\theta }}}}}}_{{\varvec{h}}})=p-2\textrm{E}[({{\varvec{X}}}-{{\varvec{\theta }}})^\top {{\varvec{h}}}({{\varvec{X}}})]/{\sigma }^2+\textrm{E}[\{{{\varvec{h}}}({{\varvec{X}}})\}^\top {{\varvec{h}}}({{\varvec{X}}})]/{\sigma }^2\), and the Stein identity gives

$$\begin{aligned} \textrm{E}[({{\varvec{X}}}-{{\varvec{\theta }}})^\top {{\varvec{h}}}({{\varvec{X}}})]=\sum _{i=1}^p\textrm{E}[(X_i-{\theta }_i)h_i({{\varvec{X}}})]=\sum _{i=1}^p \textrm{E}\Big [{\sigma }^2 {\partial \over \partial X_i}h_i({{\varvec{X}}})\Big ] ={\sigma }^2\textrm{E}[ {{\varvec{\nabla }}}^\top {{\varvec{h}}}({{\varvec{X}}})]. \end{aligned}$$

This provides the unbiased estimator of the risk function given in (3.1). (3.2) can be derived from (3.1). \(\square \)

From (3.1) or (3.2), we can derive conditions on \({{\varvec{h}}}(\cdot )\) or \(\phi (\cdot )\) for improvement over \({{\varvec{X}}}\). For example, Baranchik’s (1970) condition is (a) \(\phi (w)\) is nondecreasing and (b) \(0\le \phi (w)\le 2(p-2)\).

The James–Stein estimator \({{\widehat{{{\varvec{\theta }}}}}}^{\textrm{JS}}\) corresponds to the case of \(\phi (w)=p-2\), and the risk unbiased estimator suggests the equation

$$\begin{aligned} \textrm{E}[\Vert {{\varvec{X}}}-{{\varvec{\theta }}}\Vert ^2] = \textrm{E}[\Vert {{\widehat{{{\varvec{\theta }}}}}}^{\textrm{JS}}-{{\varvec{\theta }}}\Vert ^2] + \textrm{E}[\Vert {{\varvec{X}}}- {{\widehat{{{\varvec{\theta }}}}}}^{\textrm{JS}}\Vert ^2], \end{aligned}$$

which is interpreted as the Pythagorean triangle among \({{\varvec{X}}}\), \({{\varvec{\theta }}}\) and \({{\widehat{{{\varvec{\theta }}}}}}^{\textrm{JS}}\). Kubokawa (1994) constructed a class of estimators improving on the James–Stein estimator. See also Kubokawa (1991).

Theorem 3.2

The estimator \({{\widehat{{{\varvec{\theta }}}}}}_\phi ={{\varvec{X}}}-W^{-1}\phi (W){{\varvec{X}}}\) improves on the James–Stein estimator if (a) \(\phi (w)\) is nondecreasing, and (b) \(\lim _{w\rightarrow \infty }\phi (w)=p-2\) and

$$\begin{aligned} \phi (w) \ge \phi _0(w)\equiv {\int _0^w y^{p/2-1}e^{-y/2}\textrm{d}y \over \int _0^w y^{p/2-2}e^{-y/2}\textrm{d}y}. \end{aligned}$$
(3.3)

Proof

The risk difference is \({\Delta }=R({{\varvec{\theta }}}, {{\widehat{{{\varvec{\theta }}}}}}^{\textrm{JS}})-R({{\varvec{\theta }}}, {{\widehat{{{\varvec{\theta }}}}}}_\phi )=-\textrm{E}[ \{\phi (W)- (p-2)\}^2/W] + 4 \textrm{E}[\phi '(W)]\), and from condition (b), it is noted that

$$\begin{aligned} -\{\phi (W)- (p-2)\}^2&= \Big [ \{\phi (tW)-(p-2)\}^2\Big ]_{t=1}^\infty =\int _1^\infty \Big \{{\textrm{d}\over \textrm{d}t}\{\phi (tW)- (p-2)\}^2\Big \} \textrm{d}t\\&= 2 W \int _1^\infty \{\phi (tW)- (p-2)\}\phi '(tW) \textrm{d}t, \end{aligned}$$

so that, after making the transformation, the first term is written as

$$\begin{aligned}{} & {} - \textrm{E}\Big [ { \{\phi (W)- (p-2)\}^2/ W}\Big ] \nonumber \\{} & {} \quad = 2 \int _0^\infty \int _1^\infty \{\phi (tw)- (p-2)\}\phi '(tw) \textrm{d}t f_p(w, {\lambda })\textrm{d}w, \end{aligned}$$
(3.4)

where \(f_p(w, {\lambda })\) denotes the density function of noncentral chi-square distribution with p degrees of freedom and noncentrality \({\lambda }=\Vert {{\varvec{\theta }}}\Vert ^2/{\sigma }^2\). Thus,

$$\begin{aligned} {\Delta }= 2 \int _0^\infty \phi '(w)\Big [ \{\phi (w)- (p-2)\} \int _0^w {1\over y}f_p(y, {\lambda }) \textrm{d}y + 2 f_p(w, {\lambda }) \Big ]\textrm{d}w. \end{aligned}$$

Since \(\phi '(w)\ge 0\) from condition (a), we have \({\Delta }\ge 0\) if \(\phi (w)\) satisfies \(\phi (w)\ge \phi _{\lambda }(w)\), where

$$\begin{aligned} \phi _{\lambda }(w) = p-2 - {2f_p(w,{\lambda })\over \int _0^w y^{-1}f_p(y,{\lambda })\textrm{d}y}. \end{aligned}$$

We here show that \(\phi _0(w)\ge \phi _{\lambda }(w)\), which is written as

$$\begin{aligned} {2f_p(w,{\lambda })\over \int _0^w y^{-1}f_p(y,{\lambda })\textrm{d}y}\ge {2f_p(w,0)\over \int _0^w y^{-1}f_p(y,0)\textrm{d}y}, \end{aligned}$$

or

$$\begin{aligned} \int _0^w{f_p(y,0)f_p(w,0)\over y}\Big \{ {f_p(w,{\lambda })\over f_p(w,0)}-{f_p(y,{\lambda })\over f_p(y,0)}\Big \}\textrm{d}y\ge 0. \end{aligned}$$

Since the noncentral chi-squared distribution can be expressed as a mixtute of Poisson and central chi-squared distributions, it is noted that

$$\begin{aligned} f_p(y,{\lambda })=\sum _{k=0}^\infty P_{\lambda }(k) {1\over {\Gamma }(p/2+k)2^{p/2+k}}y^{p/2+k-1}e^{-y/2} \end{aligned}$$

for \(P_{\lambda }(k)=({\lambda }/2)^k e^{-{\lambda }/2}/k!\). Hence,

$$\begin{aligned} {f_p(y,{\lambda })\over f_p(y,0)}=\sum _{k=0}^\infty P_{\lambda }(k) {{\Gamma }(p/2)\over {\Gamma }(p/2+k)}{1\over 2^k}y^k \end{aligned}$$

is increasing in y, so that for \(w>y\),

$$\begin{aligned} {f_p(w,{\lambda })\over f_p(w,0)}-{f_p(y,{\lambda })\over f_p(y,0)}\ge 0, \end{aligned}$$

which implies that \(\phi _0(w)\ge \phi _{\lambda }(w)\). Using integration by parts, we can see that

$$\begin{aligned} \phi _0(w)=p-2 - {2 w^{p/2-1}e^{-w/2}\over \int _0^w y^{p/2-2}e^{-y/2}\textrm{d}y} ={\int _0^w y^{p/2-1}e^{-y/2}\textrm{d}y\over \int _0^w y^{p/2-2}e^{-y/2}\textrm{d}y}, \end{aligned}$$

which is given in (3.3). Hence, it is proved that \({\Delta }\ge 0\) under condition \(\phi (w)\ge \phi _0\). \(\square \)

It is interesting to note that the estimator \({{\widehat{{{\varvec{\theta }}}}}}^{\textrm{GB}}={{\widehat{{{\varvec{\theta }}}}}}_{\phi _0}\) with \(\phi _0(W)\) is the generalized Bayes estimator against the prior distribution \(\pi ({{\varvec{\theta }}})=\Vert {{\varvec{\theta }}}\Vert ^{2-p}\). Since \(\phi _0(w)\) satisfies the above conditions (a), the generalized Bayes estimator \({{\widehat{{{\varvec{\theta }}}}}}_{\phi _0}\) improves on the James–Stein estimator. It is also interesting to note that the prior distribution \(\pi ({{\varvec{\theta }}})=\Vert {{\varvec{\theta }}}\Vert ^{2-p}\) is a harmonic function, namely \({{\varvec{\nabla }}}^\top {{\varvec{\nabla }}}\pi ({{\varvec{\theta }}})=\sum _{i=1}^p({\partial ^2/\partial {\theta }_i^2})\Vert {{\varvec{\theta }}}\Vert ^{2-p}=0\). The positive-part Stein estimator \({{\widehat{{{\varvec{\theta }}}}}}^{\mathrm{S+}}=\max \{1-(p-2)/W, 0\}{{\varvec{X}}}\) also satisfies the conditions (a) and (b).

The risk performances of the shrinkage estimators \({{\widehat{{{\varvec{\theta }}}}}}^{\textrm{JS}}\), \({{\widehat{{{\varvec{\theta }}}}}}^{\mathrm{S+}}\) and \({{\widehat{{{\varvec{\theta }}}}}}^{\textrm{GB}}\), denoted by JS, PS and GB, respectively, are investigated by simulation with 10, 000 replications and the average values of the risks are reported in Table 1 for \(p=6\), \({\sigma }^2=1\) and \({{\varvec{\theta }}}=(k/3){{\varvec{I}}}\), \(k=0, \ldots , 9\). As distributions of \(X_i\)’s, we treat normal, double exponential and t-distributions with 5 degrees of freedom. From Table 1, it is seen that the minimality of the three shrinkage estimators is robust for the t- and double exponential distributions.

Table 1 Risks of three shrinkage estimators for \(p=6\) and \({{\varvec{\theta }}}=(k/3){{\varvec{I}}}\)

3.2 Goodness-of-fit tests for normality

Goodness-of-fit tests for normality have been studied in a lot of articles. For references and explanations including omnibus test procedures, see Madansky (1988) and Thode (2002). An idea of using the Stein identity for testing normality is interesting and reasonable, because the Stein identity characterizes normal distributions. Henze and Visagie (2020) and Betsch and Ebner (2020) constructed test statistics based on the Stein identity.

Let \(X_1, \ldots , X_n\) be a random sample from a population with distribution function \(F(\cdot )\), where the mean and variance are denoted by \(\textrm{E}[X_i]=\mu \) and \(\textrm{Var}(X_i)={\sigma }^2\). The problem is to test the normality of the underlying distribution under the null hypothesis \(H_0: F=\mathcal{N}(\mu , {\sigma }^2)\). From Theorem 2.1, the characterization of a normal distribution of random variable X is

$$\begin{aligned} \textrm{E}\Big [ \Big ({X-\mu \over {\sigma }}-t\Big )e^{t(X-\mu )/{\sigma }}\Big ]=0, \end{aligned}$$

and the sample counterpart of the LHS is expressed by \(w_t/\sqrt{n}\), where

$$\begin{aligned} w_t={1\over \sqrt{n}}\sum _{i=1}^n\Big (Y_i-t\Big )e^{t Y_i}, \end{aligned}$$

for \(Y_i=(X_i-{{\overline{X}}})/S\) and \(S^2=n^{-1}\sum _{i=1}^n(X_i-{{\overline{X}}})^2\). It is noted that \(w_t\) is invariant under the transformation of location and scale. Then, the normality can be tested based on \(\textrm{ST}_t=|w_t|\). Since it depends on t, however, it is better to take a weighted \(L^2\) distance and integrate over t. Henze and Visagie (2020) considered the test statistic \(\int _{-\infty }^\infty w_t^2 K(t)\textrm{d}t\) for a weight function K(t) and suggested the use of \(K(t)=e^{-{\gamma }t^2}\) for positive \({\gamma }\). The resulting test statistic is

$$\begin{aligned} \textrm{HV}_{\gamma }= \int _{-\infty }^\infty w_t^2 e^{-{\gamma }t^2}\textrm{d}t = {\sqrt{\pi }\over n\sqrt{{\gamma }}}\sum _{i=1}^n\sum _{j=1}^n \Big (Y_iY_j - {A_{ij}^2\over 2{\gamma }}+{1\over 2{\gamma }}+{A_{ij}^2\over 4{\gamma }^2}\Big )e^{A_{ij}/(4{\gamma })}, \end{aligned}$$

where \(A_{ij}=Y_i+Y_j\). Taking \(K(t)=1\) for \(-c<t<c\) and otherwise \(K(t)=0\) with positive constant c, one gets another test statistic

$$\begin{aligned} \textrm{IST}_c&= \int _{-c}^c w_t^2 \textrm{d}t\\&= {1\over n}\sum _{i=1}^n\sum _{j=1}^n \Bigg \{ {Y_iY_j+c^2\over A_{ij}}(e^{cA_{ij}}-e^{-cA_{ij}})\\&\quad -\Big (1+{2\over A_{ij}}\Big )\Big ((c-A_{ij}^{-1})e^{cA_{ij}}+(c+A_{ij}^{-1})e^{-cA_{ij}}\Big )\Bigg \}. \end{aligned}$$

Based on \(w_t\), we also suggest the test statistic

$$\begin{aligned} \textrm{MST}_c = \sup _{-c<t<c} |w_t| \end{aligned}$$

for positive constant c.

The following lemma is helpful for investigating asymptotic properties of these test statistics.

Lemma 3.1

Let \(Z_i=(X_i-\mu )/{\sigma }\) for \(i=1, \ldots , n\). For \(g(t)=\textrm{E}[e^{tZ_1}]\), let \(h_0(t)=g'(t)-tg(t)\), \(h_1(t)=tg'(t)+(1-t^2)g(t)\) and \(h_2(t)=tg''(t)+(1-t^2)g'(t)\). Assume that \(\textrm{E}[Z_1^2e^{t Z_1}]<\infty \) for t around zero. Then, \(w_t\) is approximated as \(w_t=W_n(t)+\sqrt{n}h_0(t)+o_p(1)\), where

$$\begin{aligned} W_n(t)={1\over \sqrt{n}}\sum _{i=1}^n\Big \{ (Z_i-t)e^{tZ_i}-h_0(t)-Z_ih_1(t)-{1\over 2}(Z_i^2-1)h_2(t)\Big \}. \end{aligned}$$
(3.5)

Proof

Let \({{\overline{Z}}}=n^{-1}\sum _{i=1}^n Z_i\). Note that \(\{1+(S/{\sigma }-1)\}^{-1}=1-(S/{\sigma }-1) +O_p(n^{-1})\) and \(S/{\sigma }-1=\sqrt{S^2/{\sigma }^2}-1=2^{-1}(S^2/{\sigma }^2-1)+O_p(n^{-1})\). Then, \((X_i-{{\overline{X}}})/S\) is approximated as

$$\begin{aligned} {X_i-{{\overline{X}}}\over S}&= {Z_i-{{\overline{Z}}}\over 1+(S/{\sigma }-1)} = (Z_i-{{\overline{Z}}})\Big \{ 1 - \Big ({S\over {\sigma }}-1\Big )+O_p(n^{-1}\Big \}\\&= (Z_i-{{\overline{Z}}})\Big \{ 1 - {1\over 2}\Big ({S^2\over {\sigma }^2}-1\Big )\Big \}+O_p(n^{-1})\\&= (Z_i-{{\overline{Z}}})\Big \{ 1-{1\over 2n}\sum _{j=1}^n(Z_j^2-1)\Big \}+O_p(n^{-1}). \end{aligned}$$

Using this approximation, we evaluate \(w_t\) as

$$\begin{aligned} w_t&= {1\over \sqrt{n}}\sum _{i=1}^n\Big ({X_i-{{\overline{X}}}\over S}-t\Big )e^{t(X_i-{{\overline{X}}})/S} \\&= {1\over \sqrt{n}}\sum _{i=1}^n\Big \{ Z_i-t-{{\overline{Z}}}-{Z_i-{{\overline{Z}}}\over 2n}\sum _{j=1}^n(Z_j^2-1)+O_p(n^{-1})\Big \}\\&\quad \times \exp \Big [t Z_i-t{{\overline{Z}}}-t{Z_i-{{\overline{Z}}}\over 2n}\sum _{j=1}^n(Z_j^2-1)+O_p(n^{-1})\Big ]. \end{aligned}$$

Since \(e^x=1+x + O(x^2)\) and \({{\overline{Z}}}=O_p(1/\sqrt{n})\), \(w_t\) is approximated as

$$\begin{aligned} w_t&= {1\over \sqrt{n}}\sum _{i=1}^n\Big \{ Z_i-t-{{\overline{Z}}}-{Z_i\over 2n}\sum _{j=1}^n(Z_j^2-1)+O_p(n^{-1})\Big \}\\&\quad \times \Big \{ 1 -t{{\overline{Z}}}-t{Z_i\over 2n}\sum _{j=1}^n(Z_j^2-1)+O_p(n^{-1})\Big \}e^{t Z_i}\\&= {1\over \sqrt{n}}\sum _{i=1}^n\Big [ Z_i-t-{{\overline{Z}}}-{Z_i\over 2n}\sum _{j=1}^n(Z_j^2-1)\\&\quad -t(Z_i-t)\Big \{{{\overline{Z}}}+{Z_i\over 2n}\sum _{j=1}^n(Z_j^2-1)\Big \}\Big ] e^{t Z_i}+o_p(1), \end{aligned}$$

which can be rewritten as

$$\begin{aligned} w_t&= {1\over \sqrt{n}}\sum _{i=1}^n(Z_i-t)e^{tZ_i} - {{\overline{Z}}}{1\over \sqrt{n}}\sum _{i=1}^n e^{tZ_i} -{1\over 2n}\sum _{j=1}^n(Z_j^2-1){1\over \sqrt{n}}\sum _{i=1}^n Z_i e^{tZ_i}\\&\quad -t{{\overline{Z}}}{1\over \sqrt{n}}\sum _{i=1}^n(Z_i-t)e^{tZ_i} - {t\over 2n}\sum _{j=1}^n(Z_j^2-1){1\over \sqrt{n}}\sum _{i=1}^nZ_i(Z_i-t)e^{tZ_i}+o_p(1). \end{aligned}$$

Each term can be evaluated as

$$\begin{aligned} n^{-1/2}\sum _{i=1}^n (Z_i-t)e^{tZ_i}&= \sqrt{n}\{n^{-1}\sum _{i=1}^n(Z_i-t)e^{tZ_i}-h_0(t)\}+\sqrt{n}h_0,\\ {{\overline{Z}}}n^{-1/2}\sum _{i=1}^n e^{tZ_i}&= \sqrt{n}{{\overline{Z}}}g(t)+o_p(1),\\ n^{-1}\sum _{j=1}^n(Z_j^2-1)n^{-1/2}\sum _{i=1}^n Z_i e^{tZ_i}&= n^{-1/2}\sum _{j=1}^n(Z_j^2-1)g'(t)+o_p(1),\\ {{\overline{Z}}}n^{-1/2}\sum _{i=1}^n(Z_i-t)e^{tZ_i}&= \sqrt{n}{{\overline{Z}}}\{g'(t)-tg(t)\}+o_p(1),\\ n^{-1}\sum _{j=1}^n(Z_j^2-1)n^{-1/2}\sum _{i=1}^nZ_i(Z_i-t)e^{tZ_i}&= n^{-1/2}\sum _{j=1}^n(Z_j^2-1)\{g''(t)-tg'(t)\}+o_p(1). \end{aligned}$$

Thus, it can be verified that \(w_t\) is approximated as \(w_t=W_n+\sqrt{n}h_0(t)+o_p(1)\), which proves Lemma 3.1\(\square \)

From (3.5), the central limit theorem shows that \(W_n(t)\) is asymptotically distributed as the normal distribution with mean zero and variance \(\textrm{Var}(W_n(t))\) under the assumption of \(\textrm{E}[Z_1^4]<\infty \), where the variance can be evaluated as

$$\begin{aligned} \textrm{Var}(W_n(t))&= \textrm{E}[(Z_1-t)^2e^{2tZ_1}]-h_0(t)^2+h_1(t)^2 \\&\quad + {1\over 4}h_2(t)^2(\textrm{E}[Z_1^4]-1) -2h_1(t)\textrm{E}[Z_1(Z_1-t)e^{tZ_1}]\\&\quad -h_2(t)\{\textrm{E}[Z_1^2(Z_1-t)e^{tZ_1}]-h_0(t)\}+h_1(t)h_2(t)\textrm{E}[Z_1^3]+o(1). \end{aligned}$$

Note that \(h_1(t)=g(t)+t h_0(t)\) and \(h_2(t)=2t g(t)+h_0(t)+t h_0'(t)\). Under the normality hypothesis \(H_0\), we have \(h_0(t)=0\), \(h_1(t)=g(t)\), \(h_2(t)=2tg(t)\) and \(\textrm{E}[(Z_1-t)^2e^{2tZ_1}]=(1+t^2)e^{2t^2}\). Thus, the asymptotic variance of \(w_t\) under the normality is \(\textrm{Var}(W_n(t)) =V(t)^2+o(1)\), where

$$\begin{aligned} V(t)^2 = (1+t^2)e^{2t^2}-(1+2t^2)e^{t^2}. \end{aligned}$$

Henze and Visagie (2020) showed the consistency of \(\textrm{HV}_{\gamma }\), namely \(\textrm{P}_F(\textrm{HV}_{\gamma }>d)\rightarrow 1\) as \(n\rightarrow \infty \) for positive constant d and non-normal distributions F. Using Lemma 3.1, we can verify the consistency of \(\textrm{IST}_c\) and \( \textrm{MST}_c\).

Theorem 3.3

Assume that \(\textrm{E}[Z_1^4]<\infty \). Then, the test statistics \(\textrm{IST}_c\) and \( \textrm{MST}_c\) are consistent.

Proof

Concerning the consistency of \(\textrm{MST}_c\), it can be observed that for all t in the interval \((-c, c)\),

$$\begin{aligned} \textrm{P}_F(\sup _{-c<t<c} |w_t|>d)&\ge \textrm{P}_F(|w_t|>d) =\textrm{P}_F(w_t>-d) + \textrm{P}_F(w_t>d)\\&= \textrm{P}_F\{W_n(t)< -\sqrt{n}h_0(t)-d+o_p(1)\}\\&\quad + \textrm{P}_F\{W_n(t)> -\sqrt{n}h_0(t)+d+o_p(1)\}. \end{aligned}$$

Note that \(W_n(t)\) converges in distribution to the normal distribution. When F is not a normal distribution, from Theorem 2.1, there is some \(t_0\) in \((-c, c)\) such that \(h_0(t_0)\not = 0\). Hence, \(\textrm{P}_F\{W_n(t_0)< -\sqrt{n}h_0(t_0)-d+o_p(1)\}\rightarrow 1\) when \(h_0(t_0)<0\), and \(\textrm{P}_F\{W_n(t_0)> -\sqrt{n}h_0(t_0)+d+o_p(1)\}\rightarrow 1\) when \(h_0(t_0)>0\). This shows the consistency of \(\textrm{MST}_c\).

For \(\textrm{IST}_c\), it is observed that for large n,

$$\begin{aligned} \textrm{P}_F\Big (\int _{-c}^c w_t^2 \textrm{d}t>\textrm{d}\Big )&= \textrm{P}_F\Big ( \int _{-c}^c W_n(t)^2\textrm{d}t \\&\quad + 2\sqrt{n}\int _{-c}^c W_n(t)h_0(t)\textrm{d}t +n\int _{-c}^c h_0(t)^2\textrm{d}t>\textrm{d}\Big )+o(1)\\&\ge \textrm{P}_F\Big ( - 2\sqrt{n}\Big |\int _{-c}^c W_n(t)h_0(t)\textrm{d}t\Big | +n\int _{-c}^c h_0(t)^2\textrm{d}t>\textrm{d}\Big )+o(1)\\&= \textrm{P}_F\Big ( \Big |\int _{-c}^c W_n(t)h_0(t)\textrm{d}t\Big | < {\sqrt{n}\over 2}\Big (\int _{-c}^c h_0(t)^2\textrm{d}t-\textrm{d}/n\Big )\Big )+o(1)\\&\ge 1 - {\textrm{E}\big [ \big \{\int _{-c}^c W_n(t)h_0(t)\textrm{d}t\big \}^2\big ] \over (n/2) (\int _{-c}^c h_0(t)^2\textrm{d}t-\textrm{d}/n)^2}+o(1). \end{aligned}$$

From (3.5), we have \(\int _{-c}^c W_n(t)h_0(t)\textrm{d}t=n^{-1/2}\sum _{i=1}^n Y_i^*\) for

$$\begin{aligned} Y_i^*= & {} \int _{-c}^c\{(Z_i-t)e^{tZ_i}-h_0(t)\} h_0(t)\textrm{d}t -Z_i\int _{-c}^c h_1(t)h_0(t)\textrm{d}t-{1\over 2}(Z_i^2-1)\\{} & {} \quad \int _{-c}^c h_2(t)h_0(t)\textrm{d}t. \end{aligned}$$

Since \(Y_1^*, \ldots , Y_n^*\) are independently and identically distributed with zero mean and a finite variance, it can be seen that \(\textrm{E}[\{\int _{-c}^c W_n(t)h_0(t)\textrm{d}t\}^2]\) converges to a positive constant. Thus, it is concluded that \(\textrm{P}_F\Big (\int _{-c}^c w_t^2 \textrm{d}t>\textrm{d}\Big )\rightarrow 1\). \(\square \)

We investigate the performances of powers of the test statistics \(\textrm{HV}_{\gamma }\) with \({\gamma }=3\) given in Henze and Visagie (2020), \(\textrm{IST}_c\) with \(c=1\) and \(\textrm{MST}_c\) with \(c=1\). We also treat the test statistic \(\textrm{ST}_t=w_t/V(t)\) with \(t=0.5\), which converges in distribution to \(\mathcal{N}(0,1)\) under \(H_0\). From the proof of Theorem 3.3, this test can be seen to be consistent in the sense that \(\textrm{P}_F(\textrm{ST}_{t_0}>d)\rightarrow 1\) for distributions with \(h_0(t_0)\not = 0\) for \(t_0=0.5\). As another competitor, we add the test statistic suggested by De Wet and Ventner (1972), who modified the Shapiro–Francia (1972) and the Shapiro–Wilk (1965) test statistics, as

$$\begin{aligned} \textrm{DWV} = {\sum _{i=1}^n a_i X_{(i)}\over \sqrt{\sum _{i=1}^n a_i^2} \sqrt{nS^2}}, \quad a_i=\Phi ^{-1}\Big ({i\over n+1}\Big ), \end{aligned}$$

where \(X_{(1)}\le \cdots \le X_{(n)}\) are order statistics of \(X_i\)’s. The idea of this test is simple, but powerful. When data are distributed as a normal distribution, the points \((x_{(i)}, a_i)\) are plotted on or near the line, so that \(\textrm{DWV}\) is close to one. Thus, \(H_0\) is rejected when \(\textrm{DWV}\) is smaller than a critical value.

The powers of the five test statistics \(\textrm{ST}_{0.5}\), \(\textrm{HV}_3\), \(\textrm{IST}_1\), \(\textrm{MST}_1\) and \(\textrm{DWV}\) are investigated by simulation, where their critical values are adjusted so that their type I errors are \({\alpha }=5\%\). We consider the three alternative distributions

$$\begin{aligned} M_1\ :\quad&X_i = (1-w) \mathcal{N}(0,1) + w Ex(1),\\ M_2\ :\quad&X_i = (1-w) \mathcal{N}(0,1) + w DE(0,2),\\ M_3\ :\quad&X_i = (1-w) \mathcal{N}(0,1) + w Ex(1)/2+w DE(0,0.5)/2, \end{aligned}$$

where \(\mathcal{N}(0,1)\), Ex(1) and \(DE(0,{\sigma })\) denote random variables having standard normal distribution \(\mathcal{N}(0,1)\), exponential distribution Ex(1) and double exponential distribution \(DE(0,{\sigma })\) with scale parameter \({\sigma }\), respectively.

The values of the powers for \(w=0.2, 0.5, 0.8, 1.0\) and \(n=50\) are obtained based on simulation data with 10,000 replications by using the Ox by Doornik (2007) and reported in Table 2. The model \(M_1\) has skewness, and the tests \(\textrm{ST}_{0.5}\) and \(\textrm{DWV}\) are more powerful, but \(\textrm{MST}_1\) is less powerful. The model \(M_2\) has kurtosis, and the test \(\textrm{DWV}\) is more powerful, but \(\textrm{ST}_{0.5}\) is less powerful. In the model \(M_3\) with mixed distributions as an alternative, the test \(\textrm{DWV}\) is good, and the other four tests are similarly performed. Through Table 2, it is seen that the performances of \(\textrm{HV}_3\) and \(\textrm{IST}_1\) are not bad, but such tests based on moments are less powerful than \(\textrm{DWV}\) based on quantiles.

Table 2 Power of the six tests for \(n=50\) and \(w=0.2, 0.5, 0.8, 1.0\)

4 Stein’s methods for normal approximations

An important application of the Stein identity is the normal approximation. This approach is called Stein’s method, and it has been studied in the literature including Ho and Chen (1978), Stein (1986), Goldstein and Reinert (1997), Shorack (2000), Barbour and Chen (2005), Diaconis and Holmes (2004), Chen and Shao (2005), Chen et al. (2011), Chen et al. (2013), Lehmann and Romano (2022) and the reference therein. Of these, Chen et al. (2011) gives us a good explanation for the Stein method. For instructive purposes, we here provide a simple introduction based on Chen et al. (2011).

4.1 A basic concept and a simple method

Let \(X_1, \ldots , X_n\) be independent random variables with \(\textrm{E}[X_i]=0\) and \(\textrm{Var}(X_i)=1\). Let

$$\begin{aligned} X={1\over \sqrt{n}}\sum _{i=1}^n X_i, \quad \xi _i={X_i\over \sqrt{n}}, \quad X^{(i)}=X-\xi _i \ \textrm{and}\ \mu _k={1\over n}\sum _{i=1}^n \textrm{E}[|X_i|^k] \ \textrm{for}\ k=1,2,3. \end{aligned}$$

Note that \(X^{(i)}\) is independent of \(\xi _i\). Let Y be a random variable having \(\mathcal{N}(0,1)\). Then, we want to show that for any real z,

$$\begin{aligned} |\textrm{P}(X\le z)-\Phi (z)|=|\textrm{E}[I(X\le z)]-\textrm{E}[Y\le z]|\rightarrow 0 \end{aligned}$$

as \(n\rightarrow \infty \).

For any nonnegative function \(h(\cdot )\) satisfying \(\textrm{E}[h(X)]<\infty \), in general, the solution of the equation

$$\begin{aligned} h(x)-\textrm{E}[h(Y)] = f'_h(x)-xf_h(x) \end{aligned}$$
(4.1)

is given by

$$\begin{aligned} f_h(x)=\int _{-\infty }^x[ h(y)-\mu _h ]\phi (y)\textrm{d}y/\phi (x) =-\int _x^{\infty }[ h(y)-\mu _h ]\phi (y)\textrm{d}y/\phi (x), \end{aligned}$$
(4.2)

where \(\mu _h=\textrm{E}[h(Y)]\) and \(\phi (x)\) is the probability density function of \(\mathcal{N}(0,1)\). In particular, for \(h_z(x)=I(x\le z)-\Phi (z)\), the solution of the equation \(I(x\le z)-\Phi (z) = f_z'(x)-x f_z(x)\) is written as

$$\begin{aligned} f_z(x)&= \int _{-\infty }^x[ I(y\le z)-\Phi (z)]\phi (y)\textrm{d}y/\phi (x)\nonumber \\&= \left\{ \begin{array}{ll} \Phi (x){{\overline{\Phi }}}(z)/\phi (x),&{} x\le z,\\ \Phi (z){{\overline{\Phi }}}(x)/\phi (x),&{} x> z, \end{array} \right. \end{aligned}$$
(4.3)

where \({{\overline{\Phi }}}(x)=1-\Phi (x)\). Then, we get the equality

$$\begin{aligned} \textrm{E}[I(X\le z)-\Phi (z)] = \textrm{E}[f_z'(X)-X f_z(X)]. \end{aligned}$$
(4.4)

For the central limit theorem (CLT), it is sufficient to show that \(\lim _{n\rightarrow \infty }\textrm{E}[f_z'(X)-X f_z(X)]=0\). It is noted that the RHS of (4.4) is exactly zero from the Stein identity if \(X\sim \mathcal{N}(0,1)\). To this end, we prepare the following lemma.

Lemma 4.1

For any nonnegative function \(h(\cdot )\) satisfying \(\textrm{E}[h(X)]<\infty \), it holds that

$$\begin{aligned} \textrm{E}[f'_h(X)-Xf_h(X)]&= \sum _{i=1}^n \textrm{E}\Big [\xi _i^2\int _0^1\{f_h'(X^{(i)})-f_h'(X^{(i)}+u\xi _i)\}\textrm{d}u\Big ]\nonumber \\&\quad +{1\over n}\sum _{i=1}^n \textrm{E}[ f_h'(X^{(i)}+\xi _i)-f_h'(X^{(i)})]. \end{aligned}$$
(4.5)

Proof

The proof is from Chen et al. (2011). From the Taylor series expansion with integral remainder, it follows that

$$\begin{aligned} f_h(X^{(i)}+\xi _i)&= f_h(X^{(i)})+\int _{X^{(i)}}^{X^{(i)}+\xi _i}f_h'(t)\textrm{d}t\nonumber \\&= f_h(X^{(i)})+\xi _i \int _0^1 f_h'(X^{(i)}+u\xi _i)\textrm{d}u. \end{aligned}$$
(4.6)

We first write \(\textrm{E}[Xf_h(X)]=\sum _{i=1}^n \textrm{E}[\xi _i f_h(X^{(i)}+\xi _i)]\). From (4.6), \(\textrm{E}[\xi _i]=0\) and independence of \(X^{(i)}\) and \(\xi _i\), we observe that

$$\begin{aligned} \textrm{E}[\xi _i f_h(X^{(i)}+\xi _i)]&= \textrm{E}\Big [\xi _i f_h(X^{(i)})+\xi _i^2 \int _0^1 f_h'(X^{(i)}+u\xi _i)\textrm{d}u\Big ]\nonumber \\&= \textrm{E}\Big [\xi _i^2 \int _0^1 f_h'(X^{(i)}+u\xi _i)\textrm{d}u\Big ]. \end{aligned}$$
(4.7)

Since \(\textrm{E}[\xi _i^2f_h'(X^{(i)})]=\textrm{E}[\xi _i^2] \textrm{E}[f_h'(X^{(i)})]=n^{-1}\textrm{E}[f_h'(X^{(i)})]\), on the other hand, it can be seen that

$$\begin{aligned} \textrm{E}[f_h'(X)]&= \sum _{i=1}^n \textrm{E}[ n^{-1} f_h'(X^{(i)})]+ {1\over n} \sum _{i=1}^n\textrm{E}[f_h'(X)-f_h'(X^{(i)})]\nonumber \\&= \sum _{i=1}^n \textrm{E}[ \xi _i^2 \int _0^1 f_h'(X^{(i)})\textrm{d}u]+ {1\over n} \sum _{i=1}^n\textrm{E}[f_h'(X)-f_h'(X^{(i)})]. \end{aligned}$$
(4.8)

Combining (4.7) and (4.8) yields (4.5) in Lemma 4.1. \(\square \)

Hereafter, we consider the specific function \(h_{z,{\alpha }}(x)\), defined by

$$\begin{aligned} h_{z,{\alpha }}(x)=\left\{ \begin{array}{ll}1, &{} x\le z,\\ 1+(z-x)/{\alpha }, &{} z<x\le z+{\alpha },\\ 0,&{} x>z+{\alpha }, \end{array} \right. \end{aligned}$$

for positive constant \({\alpha }\). It is noted that \(h_{z,{\alpha }}(x)\) is absolutely continuous and bounded as \(|h_{z,{\alpha }}(x)|\le 1\). Let \(f_{z,{\alpha }}(x)\) be the function given in (4.2) for \(h(x)=h_{z,{\alpha }}(x)\).

Lemma 4.2

The function \(f_{z,{\alpha }}(x)\) satisfies that \(|f_{z,{\alpha }}(x)|\le \sqrt{\pi /2}\), \(|f_{z,{\alpha }}'(x)|\le 2\) and

$$\begin{aligned} |f_{z,{\alpha }}'(w+x)-f_{z,{\alpha }}'(w)| \le x (\sqrt{\pi /2}+2|w|)+d(w,x), \end{aligned}$$
(4.9)

where \(d(w, x)=|h_{z,{\alpha }}(w+x)-h_{z,{\alpha }}(w)|\).

Proof

For \(x>0\), from RHS of (4.2), it follows that

$$\begin{aligned} |f_{z,{\alpha }}(x)|\le \int _x^{\infty }| h_{z,{\alpha }}(y)-\mu _{h_{z,{\alpha }}}|\phi (y)\textrm{d}y/\phi (x) \le {1-\Phi (x)\over \phi (x)}. \end{aligned}$$

Since \(\{1-\Phi (x)\}/ \phi (x)\) is decreasing, we have \(\{1-\Phi (x)\}/ \phi (x)\le \{1-\Phi (0)\}/ \phi (0)=\sqrt{\pi /2}\). For \(x<0\), it follows from (4.2) that

$$\begin{aligned} |f_{z,{\alpha }}(x)|=\int _{-\infty }^x| h_{z,{\alpha }}(y)-\mu _{h_{z,{\alpha }}} |\phi (y)\textrm{d}y/\phi (x) \le {\Phi (x)\over \phi (x)}. \end{aligned}$$

Since \(\Phi (x)/ \phi (x)\) is increasing, we have \(\Phi (x)/\phi (x)\le \Phi (0)/\phi (0)=\sqrt{\pi /2}\). Thus, \(|f_{z,{\alpha }}(x)|\le \sqrt{\pi /2}\).

We next show that \(|f_{z,{\alpha }}'(x)|\le 2\). Note that \(f_{z,{\alpha }}'(x)=xf_{z,{\alpha }}(x)+h_{z,{\alpha }}(y)-\mu _{h_{z,{\alpha }}}\). For \(x>0\), it can be demonstrated that

$$\begin{aligned} {x\over x^2+1}< {1-\Phi (x)\over \phi (x)} < {1\over x}, \end{aligned}$$

which is called Mills’ ratio. Then from RHS of (4.2), it follows that

$$\begin{aligned} |f_{z,{\alpha }}'(x)|&\le x |f_{z,{\alpha }}(x)|+|h_{z,{\alpha }}(y)-\mu _{h_{z,{\alpha }}}| \le x\int _x^{\infty }\phi (y)\textrm{d}y/\phi (x) +1\\&= x {1-\Phi (x)\over \phi (x)}+1< 2. \end{aligned}$$

For \(x<0\), Mills’ ratio implies that

$$\begin{aligned} {1-\Phi (-x)\over \phi (x)}< -{1\over x}, \quad \textrm{or}\quad {1\over x}< {\Phi (x)\over \phi (x)}, \end{aligned}$$

namely we have \(x\Phi (x)/\phi (x)<1\). Then from RHS of (4.2), it follows that

$$\begin{aligned} |f_{z,{\alpha }}'(x)|\le x\int _{-\infty }^x \phi (y)\textrm{d}y/\phi (x) +1= x {\Phi (x)\over \phi (x)}+1< 2. \end{aligned}$$

Thus, the inequality \(|f_{z,{\alpha }}'(x)|\le 2\) is proved.

Finally, it is noted that \(f_{z,{\alpha }}'(w+x)-f_{z,{\alpha }}'(w)=(w+x)f_{z,{\alpha }}(w+x)-wf_{z,{\alpha }}(w)+h_{z,{\alpha }}(w+x)-h_{z,{\alpha }}(w)\). Since \(|f_{z,{\alpha }}'(x)|<2\), it can be observed that \(|f_{z,{\alpha }}(w+x)-f_{z,{\alpha }}(w)|<2|x|\). Since \(|f_{z,{\alpha }}(x)|<\sqrt{\pi /2}\), we have

$$\begin{aligned}&|f_{z,{\alpha }}'(w+x)-f_{z,{\alpha }}'(w)|\\&\quad \le |x||f_{z,{\alpha }}(w+x)| + |w|\{ f_{z,{\alpha }}(w+x)-f_{z,{\alpha }}(w)\}+|h_{z,{\alpha }}(w+x)-h_{z,{\alpha }}(w)|\\&\quad \le |x|\sqrt{\pi /2} +2 |w|+d(w,x), \end{aligned}$$

which shows (4.9). \(\square \)

We now show the central limit theorem using Lemmas 4.1 and 4.2. It is first noted that for \(Y\sim \mathcal{N}(0,1)\),

$$\begin{aligned} |\textrm{P}(X\le z)-\Phi (z)|= & {} |\textrm{E}[h_{z,{\alpha }}(X)]-\textrm{E}[h_{z,{\alpha }}(Y)] + \textrm{E}[h_{z,{\alpha }}(Y)]-\Phi (z)| \\\le & {} {\Delta }+ |\Phi (z+{\alpha })-\Phi (z)|, \end{aligned}$$

where \({\Delta }= |\textrm{E}[h_{z,{\alpha }}(X)]-\textrm{E}[h_{z,{\alpha }}(Y)]|\). Since \( |\Phi (z+{\alpha })-\Phi (z)|<{\alpha }/\sqrt{2\pi }\), we have

$$\begin{aligned} |\textrm{P}(X\le z)-\Phi (z)|\le {\Delta }+ {\alpha }/\sqrt{2\pi }. \end{aligned}$$
(4.10)

From (4.1), (4.2) and Lemma 4.1, it follows that

$$\begin{aligned} {\Delta }&= \sum _{i=1}^n \textrm{E}\Big [\xi _i^2\int _0^1\{f_{z,{\alpha }}'(X^{(i)})-f_{z,{\alpha }}'(X^{(i)}+u\xi _i)\}\textrm{d}u\Big ]\nonumber \\&\quad +{1\over n}\sum _{i=1}^n \textrm{E}[ f_{z,{\alpha }}'(X^{(i)}+\xi _i)-f_{z,{\alpha }}'(X^{(i)})]. \end{aligned}$$
(4.11)

From (4.9) in Lemma 4.2, it can be seen that

$$\begin{aligned} |f_{z,{\alpha }}'(X^{(i)}+\xi _i)-f_{z,{\alpha }}'(X^{(i)})| \le |\xi _i|(\sqrt{\pi /2}+2|X^{(i)}|) + d(X^{(i)},\xi _i). \end{aligned}$$
(4.12)

Note that \(d(X^{(i)},\xi _i)=|\xi _i|/{\alpha }\) and \((\textrm{E}[|X^{(i)}])^2\le \textrm{E}[(X^{(i)})^2]=(n-1)/n<1\). Then,

$$\begin{aligned} {1\over n}\sum _{i=1}^n \textrm{E}[ f_{z,{\alpha }}'(X^{(i)}+\xi _i)-f_{z,{\alpha }}'(X^{(i)})]&\le {1\over n}\sum _{i=1}^n \textrm{E}[|\xi _i|](\sqrt{\pi /2}+2+1/{\alpha }) \\&= \Big \{{\sqrt{\pi /2}+2\over \sqrt{n}}+{1\over {\alpha }\sqrt{n}}\Big \} {\sum _{i=1}^n \textrm{E}[|X_i|]\over n}. \end{aligned}$$

Similarly,

$$\begin{aligned}&\sum _{i=1}^n \textrm{E}\Big [\xi _i^2\int _0^1\{f_{z,{\alpha }}'(X^{(i)})-f_{z,{\alpha }}'(X^{(i)}+u\xi _i)\}\textrm{d}u\Big ]\\&\qquad \quad \le \Big \{{\sqrt{\pi /2}+2\over \sqrt{n}}+{1\over {\alpha }\sqrt{n}}\Big \}{\sum _{i=1}^n \textrm{E}[|X_i|^3]\over n}. \end{aligned}$$

Combining (4.10), (4.11) and these observations gives the following theorem.

Theorem 4.1

For \(X=\sum _{i=1}^nX_i/\sqrt{n}\), it holds that

$$\begin{aligned} |\textrm{P}(X\le z)-\Phi (z)|\le \Big \{{\sqrt{\pi /2}+2\over \sqrt{n}}+{1\over {\alpha }\sqrt{n}}\Big \}(\mu _1+\mu _3)+ {{\alpha }\over \sqrt{2\pi }}. \end{aligned}$$
(4.13)

Assume that \(\sum _{i=1}^n \textrm{E}[|X_i|^3]/n\) converges to a positive constant. Let \({\alpha }=n^{-1/4}\). Then, \(|\textrm{P}(X\le z)-\Phi (z)|\rightarrow 0\) as \(n\rightarrow \infty \).

The inequality in (4.17) is improved if we use the following concentration inequality due to Chen et al. (2011).

Lemma 4.3

For any a and b satisfying \(a<b\),

$$\begin{aligned} \textrm{P}(a\le X^{(i)} \le b) \le \sqrt{2}(b-a) + {2(\sqrt{2}+1)\mu _3\over \sqrt{n}}. \end{aligned}$$
(4.14)

In the evaluation of \(\textrm{E}[d(X^{(i)},\xi _i)]\) in (4.12), we can see that for \(\xi _i>0\),

$$\begin{aligned} \textrm{E}[d(X^{(i)},\xi _i)|\xi _i]\le \textrm{E}[I(z-\xi _i \le X^{(i)}< z+{\alpha })|\xi _i] =\textrm{P}( z-\xi _i \le X^{(i)} < z+{\alpha })|\xi _i). \end{aligned}$$

From the inequality (4.14), it follows that for all \(\xi _i\),

$$\begin{aligned} \textrm{E}[d(X^{(i)},\xi _i)|\xi _i]\le \sqrt{2}(|\xi _i|+{\alpha }) + {2(\sqrt{2}+1)\mu _3\over \sqrt{n}}, \end{aligned}$$

which gives

$$\begin{aligned} \textrm{E}[|f_{z,{\alpha }}'(X^{(i)}+\xi _i)-f_{z,{\alpha }}'(X^{(i)})| |\xi _i] \le |\xi _i|(\sqrt{\pi /2}+2) + \sqrt{2}(|\xi _i|+{\alpha })+{2(\sqrt{2}+1)\mu _3\over \sqrt{n}}. \end{aligned}$$

Thus, the same arguments used above provide the inequality

$$\begin{aligned} |\textrm{P}(X\le z)-\Phi (z)|\le {C({\alpha })\over \sqrt{n}}, \end{aligned}$$
(4.15)

where

$$\begin{aligned} C({\alpha })=(\sqrt{\pi /2}+2+\sqrt{2})(\mu _1+\mu _3)+\{\sqrt{2n}{\alpha }+2(\sqrt{2}+1)\mu _3\}(1+\mu _2)+ {\alpha }\sqrt{n/(2\pi )}. \end{aligned}$$

When \({\alpha }=1/\sqrt{n}\), we have

$$\begin{aligned} C(1/\sqrt{n})= (\sqrt{\pi /2}+2+\sqrt{2})(\mu _1+\mu _3) + \{\sqrt{2}+2(\sqrt{2}+1)\mu _3\}(1+\mu _2)+ 1/\sqrt{2\pi }. \end{aligned}$$

The inequality provides a Berry–Esseen-type bound.

4.2 A method based on the K-function

Another method based on the K-function is useful for evaluating \(\textrm{E}[f_h'(X)-Xf_h(X)]\) for \(f_h(\cdot )\) in (4.2). Define \(K_i(t)\) by

$$\begin{aligned} K_i(t)=\textrm{E}[\xi _i \{ I(0\le t\le \xi _i)-I(\xi _i\le t<0)\}]. \end{aligned}$$

It can be seen that \(K_i(t)\ge 0\) for \(t\in \Re \) and that

$$\begin{aligned} \int _{-\infty }^\infty K_i(t)\textrm{d}t=\textrm{E}[\xi _i^2]\quad \textrm{and}\quad \int _{-\infty }^\infty |t| K_i(t)\textrm{d}t={1\over 2}\textrm{E}[|\xi _i|^3]. \end{aligned}$$

Lemma 4.4

For \(f_h(\cdot )\) in (4.2), it holds that

$$\begin{aligned} \textrm{E}[f_h'(X)-Xf_h(X)] =\sum _{i=1}^n \int _{-\infty }^\infty \textrm{E}[ f_h'(X)-f_h'(X^{(i)}+t)]K_i(t)\textrm{d}t. \end{aligned}$$
(4.16)

Proof

Since \(\xi _i\) and \(X^{(i)}\) are independent and \(\textrm{E}[\xi _i]=0\), it is observed that \(\textrm{E}[X f_h(X)]=\sum _{i=1}^n\textrm{E}[\xi _i f_h(X)]=\sum _{i=1}^n\textrm{E}[\xi _i \{f_h(X)-f(X^{(i)})\}]\), which is rewritten as

$$\begin{aligned}&\sum _{i=1}^n\textrm{E}\Big [\xi _i\int _0^{\xi _i} f_h'(X^{(i)}+t)\textrm{d}t\Big ] \\&= \sum _{i=1}^n\textrm{E}\Big [\int _{-\infty }^\infty f_h'(X^{(i)}+t) \xi _i\{I(0\le t\le \xi _i)-I(\xi _i\le t<0)\} \textrm{d}t\Big ]\\&= \sum _{i=1}^n\textrm{E}\Big [\int _{-\infty }^\infty f_h'(X^{(i)}+t) K_i(t) \textrm{d}t\Big ]. \end{aligned}$$

Since \(\sum _{i=1}^n \int _{-\infty }^\infty K_i(t) \textrm{d}t=\sum _{i=1}^n\textrm{E}[\xi _i^2]=1\), we have \(\textrm{E}[f_h'(X)]=\sum _{i=1}^n \textrm{E}[\int _{-\infty }^\infty f_h'(X)K_i(t) \textrm{d}t]\). Combining these observations gives the expression in (4.16). \(\square \)

We treat \(h_z(x)=I(x\le z)-\Phi (z)\) and the function \(f_z(x)\) given in (4.3). From (4.1), it follows that

$$\begin{aligned} f_z'(X)-f_z'(X^{(i)}+t)= & {} \{Xf_z(X)-(X^{(i)}+t)f_z(X^{(i)}+t)\}\\{} & {} +I(z-\xi _i\vee t< X^{(i)}<z-\xi _i\wedge t), \end{aligned}$$

where \(a\vee b=\max (a,b)\) and \(a\wedge b=\min (a,b)\). Similarly to Lemma 4.2, it can be shown that

$$\begin{aligned}&(w+u)f_z(w+u)-(w+v)f_z(w+v)\\&\quad \le |w||f_z(w+u)-f_z(w+v)|\\&\qquad +uf_z(w+u)-vf_z(w+v)\\&\quad \le (|w|+\sqrt{2\pi }/4)(|u|+|v|), \end{aligned}$$

because \(|f_z(x)|\le \sqrt{2\pi }/4\). Thus,

$$\begin{aligned} \textrm{E}[Xf_z(X)-(X^{(i)}+t)f_z(X^{(i)}+t)|\xi _i] \le 1+(\sqrt{2\pi }/4)(|\xi _i|+t). \end{aligned}$$

From Lemma 4.3, it follows that

$$\begin{aligned} \textrm{P}(z-\xi _i\vee t< X^{(i)}<z-\xi _i\wedge t|{{\overline{x}}}_i) \le \sqrt{2}|\xi _i-t|+2(\sqrt{2}+1){\mu _3\over \sqrt{n}}. \end{aligned}$$

Hence from Lemma 4.4, we get

$$\begin{aligned}&\sum _{i=1}^n \int _{-\infty }^\infty \textrm{E}[ f_h'(X)-f_h'(X^{(i)}+t)]K_i(t)\textrm{d}t\\&\quad \le \sum _{i=1}^n\int _{-\infty }^\infty \Big \{ (1+\sqrt{2\pi }/4+\sqrt{2})(\textrm{E}[|\xi _i|]+|t|) +{2(\sqrt{2}+1)\mu _3\over \sqrt{n}}\Big \} K_i(t)\textrm{d}t, \end{aligned}$$

which yields the following bound.

Theorem 4.2

For \(X=\sum _{i=1}^nX_i/\sqrt{n}\), it holds that

$$\begin{aligned}{} & {} |\textrm{P}(X\le z)-\Phi (z)|\nonumber \\{} & {} \quad \le {(1+\sqrt{2}+\sqrt{2\pi }/4)(\mu _1\mu _2+\mu _3/2)+2(1+\sqrt{2})\mu _3 \over \sqrt{n}}. \end{aligned}$$
(4.17)

Chen and Shao (2001) derived a more refined concentration inequality and obtained the improved bound given by

$$\begin{aligned} |\textrm{P}(X\le z)-\Phi (z)|\le 4.1({\beta }_2+{\beta }_3), \end{aligned}$$
(4.18)

where \({\beta }_2=\sum _{i=1}^n\textrm{E}[\xi _i^2I(|\xi _i|>1)]\) and \({\beta }_3=\sum _{i=1}^n\textrm{E}[\xi _i^3I(|\xi _i|\le 1)]\). This corresponds to the Lindeberg’s condition. In fact, let \(X_1, \ldots , X_n\) be independent random variables with \(\textrm{E}[X_i]=0\) and \(\textrm{Var}(X_i)={\sigma }_i^2\). Let \(S_n=\sum _{i=1}^n X_i\) and \(B_n^2=\sum _{i=1}^n{\sigma }_i^2\). Then, \(\xi _i\) and X correspond to \(\xi _i=X_i/B_n\) and \(X=S_n/B_n\). It is observed that for any \({\varepsilon }>0\),

$$\begin{aligned} {\beta }_2+{\beta }_3&= {1\over B_n^2}\sum _{i=1}^n\textrm{E}[X_i^2I(|X_i|>B_n)]+{1\over B_n^3}\sum _{i=1}^n\textrm{E}[X_i^3I(|X_i|\le B_n)]\\&\le {1\over B_n^2}\sum _{i=1}^n\textrm{E}[X_i^2I(|X_i|>B_n)]+{1\over B_n^3}\sum _{i=1}^nB_n \textrm{E}[X_i^2I({\varepsilon }B_n\le |X_i|\le B_n)]\\&\quad +{1\over B_n^3}\sum _{i=1}^n{\varepsilon }B_n \textrm{E}[X_i^2I(|X_i|\le {\varepsilon }B_n)]\\&\le {1\over B_n^2}\sum _{i=1}^n\textrm{E}[X_i^2I(|X_i|>{\varepsilon }B_n)]+{\varepsilon }. \end{aligned}$$

Thus from (4.18), it follows that

$$\begin{aligned} |\textrm{P}(S_n/B_n\le z)-\Phi (z)|\le 4.1\Big \{ {1\over B_n^2}\sum _{i=1}^n\textrm{E}[X_i^2I(|X_i|>{\varepsilon }B_n)]+{\varepsilon }\Big \}, \end{aligned}$$

which converges to zero if the Lindeberg condition

$$\begin{aligned} \lim _{n\rightarrow \infty } {1\over B_n^2}\sum _{i=1}^n\textrm{E}[X_i^2I(|X_i|>{\varepsilon }B_n)]=0 \end{aligned}$$

is satisfied.

5 Stein-type identities in gamma and exponential distributions

5.1 Stein-type identities and characterization of gamma and exponential distributions

We treat the gamma distribution \(Ga({\alpha }, {\beta })\) with the density function

$$\begin{aligned} f(x|{\alpha },{\beta }) = {1\over {\Gamma }({\alpha }){\beta }^{\alpha }} x^{{\alpha }-1} e^{-x/{\beta }},\quad x>0, \end{aligned}$$

where \({\alpha }\) and \({\beta }\) are positive parameters. Hudson (1978) derived the Stein-type identity in the gamma distribution, and Betsch and Ebner (2019) showed that the identity characterizes the gamma distribution.

Theorem 5.1

Let X be a positive random variable with \(\textrm{E}[X]={\alpha }{\beta }\) and \(\textrm{Var}(X)={\alpha }{\beta }^2\). Then, the following four conditions are equivalent.

  1. (a)

    \(X\sim Ga({\alpha }, {\beta })\).

  2. (b)

    For any differentiable function \(h(\cdot )\) with \(\textrm{E}[|Xh(X)|]<\infty \) and \(\textrm{E}[|Xh'(X)|]<\infty \), it holds that

    $$\begin{aligned} \textrm{E}[(X-{\alpha }{\beta })h(X)]={\beta }\textrm{E}[Xh'(X)]. \end{aligned}$$
    (5.1)
  3. (c)

    For real constant t with \(t<1/{\beta }\), it holds that

    $$\begin{aligned} \textrm{E}[(X-{\alpha }{\beta })\exp \{tX\}]=t {\beta }\textrm{E}[X\exp \{tX\}]. \end{aligned}$$
    (5.2)
  4. (d)

    \(g(t)=\textrm{E}[\exp \{tX\}]\) satisfies the differential equation

    $$\begin{aligned} {\textrm{d}\over \textrm{d}t}\log \{g(t)\}= {{\alpha }{\beta }\over 1-t{\beta }}\quad \text {or}\quad {\textrm{d}^2\over \textrm{d}t^2}\log \{g(t)\}= {{\alpha }{\beta }^2 \over (1-t{\beta })^2}. \end{aligned}$$
    (5.3)

Proof

For the proof from (a) to (b), the identity (5.1) can be derived by integration by parts, because \({\beta }(d/ dx) f(x|{\alpha },{\beta }) = - (x-{\alpha }{\beta })f(x|{\alpha },{\beta })\). We here provide another approach. Making the transformation \(y=x/{\beta }\) gives the expression

$$\begin{aligned} \textrm{E}[h(X)]=\int _0^\infty h(x){1\over {\Gamma }({\alpha }){\beta }^{\alpha }}x^{{\alpha }-1}e^{-x/{\beta }}\textrm{d}x = \int _0^\infty h({\beta }y){1\over {\Gamma }({\alpha })}x^{{\alpha }-1}e^{-y}\textrm{d}y. \end{aligned}$$

Differentiating both sides with respect to \({\beta }\), we have

$$\begin{aligned} \int _0^\infty h(x)\Big ({x\over {\beta }^2}-{{\alpha }\over {\beta }}\Big ){1\over {\Gamma }({\alpha }){\beta }^{\alpha }}x^{{\alpha }-1}e^{-x/{\beta }}\textrm{d}x&= \int _0^\infty y h'({\beta }y){1\over {\Gamma }({\alpha })}x^{{\alpha }-1}e^{-y}\textrm{d}y\\&= \int _0^\infty {x\over {\beta }}h'(x){1\over {\Gamma }({\alpha }){\beta }^{\alpha }}x^{{\alpha }-1}e^{-x/{\beta }}\textrm{d}x, \end{aligned}$$

which leads to (5.1). Clearly, one gets (c) from (b). For the proof from (c) to (d), the identity (5.2) is written as \(g'(t) -{\alpha }{\beta }g(t) = t{\beta }g'(t)\), which is expressed as (5.3). For the proof from (d) to (a), the solution of the differential equation in (5.3) is \(\log g(t)=-{\alpha }\log (1-t{\beta })\), namely \(g(t) = (1-t{\beta })^{-{\alpha }}\), which implies that \(X \sim Ga({\alpha },{\beta })\). \(\square \)

We here provide some conditions for characterizing gamma distributions. Condition (c) in Theorem 5.2 is due to Lukacs (1955). See also Kagan et al. (1973) and Kotz (1974).

Theorem 5.2

Assume that independent positive random variables \(X_1\) and \(X_2\) are identically distributed with \(\textrm{E}[X_i]={\alpha }{\beta }\) and \(\textrm{Var}(X_i)={\alpha }{\beta }^2\). Then the following three are equivalent.

  1. (a)

    \(X_i \sim Ga({\alpha }, {\beta })\) for \(i=1, 2\).

  2. (b)

    \(X_1+X_2\sim Ga(2{\alpha }, {\beta })\).

  3. (c)

    \(X_1+X_2\) and \(X_1/(X_1+X_2)\) are independent.

Proof

Since one gets clearly (b) and (c) from (a), it is sufficient to demonstrate the opposite directions. For the proof from (b) to (a), the condition in (b) and Theorem 5.1(c) implies that

$$\begin{aligned} \textrm{E}[(X_1+X_2-2{\alpha }{\beta })\exp \{t(X_1+X_2)\}] = {\beta }t \textrm{E}[(X_1+X_2) \exp \{t(X_1+X_2)\}]. \end{aligned}$$
(5.4)

From the independence of \(X_1\) and \(X_2\), equality (5.4) is rewritten as

$$\begin{aligned} \textrm{E}[(X_1-{\alpha }{\beta })\exp \{tX_1\}] \textrm{E}[ \exp \{tX_2\}] = {\beta }t \textrm{E}[X_1 \exp \{tX_1\}]\textrm{E}[ \exp \{tX_2\}], \end{aligned}$$

which, from Theorem 5.1, shows that \(X_i\sim Ga({\alpha }, {\beta })\).

The proof from (c) to (a) can be done along with the proof of Lukacs (1955). From the independence of \(X_1+X_2\) and \(X_1/(X_1+X_2)\), it follows that

$$\begin{aligned}{} & {} \textrm{E}\Big [\exp \Big \{ s(X_1+X_2)+t{X_1\over X_1+X_2}\Big \}\Big ]\nonumber \\{} & {} \quad =\textrm{E}[\exp \{ s(X_1+X_2)\}] \textrm{E}\Big [\exp \Big \{ t{X_1\over X_1+X_2}\Big \}\Big ]. \end{aligned}$$
(5.5)

Differentiating both sides twise with respect to s and t and putting \(t=0\), we have

$$\begin{aligned}{} & {} \textrm{E}\Big [X_1^2 \exp \Big \{ s(X_1+X_2)\Big \}\Big ]\nonumber \\{} & {} \quad =\textrm{E}[(X_1+X_2)^2\exp \{ s(X_1+X_2)\}] \textrm{E}\Big [\Big ({X_1\over X_1+X_2}\Big )^2\Big ]. \end{aligned}$$
(5.6)

Let \(a=\textrm{E}[X_1^2/(X_1+X_2)^2]\) and \(g(s)=\textrm{E}[\exp \{sX_1\}]\). Then,

$$\begin{aligned} \textrm{E}[(X_1+X_2)^2\exp \{ s(X_1+X_2)\}]= & {} \textrm{E}[(X_1^2+X_2^2 +2X_1X_2)\exp \{ s(X_1+X_2)\}] \\= & {} 2g''(s)g(s)+2\{g'(s)\}^2, \end{aligned}$$

which rewrite (5.6) as

$$\begin{aligned} g''(s)g(s) = 2a [g''(s)g(s)+\{g'(s)\}^2]\quad \text {or}\quad (1-2a)g''(s)g(s)=\{g'(s)\}^2. \end{aligned}$$
(5.7)

Let \(\psi (s)=g'(s)/g(s)\), and we have \(\psi '(s)=g''(s)/g(s)-\{\psi (s)\}^2\). The equality (5.7) is expressed as

$$\begin{aligned} (1-2a) [ \psi '(s)+\{\psi (s)\}^2] = \{\psi (s)\}^2 \quad \text {or}\quad (1-2a) \psi '(s) = 2a \{\psi (s)\}^2. \end{aligned}$$

The solution of the differential equation is

$$\begin{aligned} - {1\over \psi (s)} = {2a\over 1-2a} s - c_0 \quad \text {or}\quad \psi (s) = {1\over c_0- b s} \end{aligned}$$

for \(b=2a/(1-2a)\) and constant \(c_0\). Since \(\psi (s)=(d/ds)\log g(s)\), we have

$$\begin{aligned} \log g(s) = -{1\over b} \log (c_0-bs) + \log c_1\quad \text {or}\quad g(s) = {c_1\over (c_0-bs)^{1/b}}. \end{aligned}$$

Since \(g(0)=1\), \(g'(0)={\alpha }{\beta }\) and \(g''(0)={\alpha }{\beta }(1+{\beta })\), constants satisfy \(c_1/c_0^{1/b}=1\), \(c_1/c_0^{1+1/b}={\alpha }{\beta }\) and \(c_1(1+b)/c_0^{2+1/b}={\alpha }({\alpha }+1){\beta }^2\), which gives \(c_0=({\alpha }{\beta })^{-1}\), \(c_1=({\alpha }{\beta })^{-{\alpha }}\) and \(b=1/{\alpha }\). This yields \(g(s)=(1-{\beta }s)^{-{\alpha }}\), which means that \(X_i \sim Ga({\alpha },{\beta })\) for \(i=1, 2\). \(\square \)

Condition (b) can be easily extended to the case of a sample with size n. Such an extension of condition (c) was done by Khatri and Rao (1968).

The exponential distribution \(Ex({\lambda })\) corresponds to the case of \({\alpha }=1\) and \({\beta }=1/{\lambda }\). The characterization problem of the exponential distribution has been studied in the literature. Of these, Shanbhag (1970a) showed that memoryless property characterizes the exponential distribution: \(X\sim Ex({\lambda })\) if and only if for any \(x, y>0\),

$$\begin{aligned} \textrm{P}(X>x+y| X>y)=\textrm{P}(X>x)\quad \text {or}\quad \textrm{P}(X>x+y)=\textrm{P}(X>x)\textrm{P}(X>y). \end{aligned}$$

In fact, \(\log \textrm{P}(X>x+y)=\log \textrm{P}(X>x)+\log \textrm{P}(X>y)\) implies that \(\log \textrm{P}(X>x)=- cx\) for \(c>0\), namely \(\textrm{P}(X>x)=\exp (- cx)\), which means that X has an exponential distribution. For independently and identically distributed positive random variables \(X_1\) and \(X_2\), Ferguson (1964) proved that \(X_i\) has an exponential distribution if and only if \(\min (X_1, X_2)\) is independent of \(X_1-X_2\).

5.2 Shrinkage estimation

The Stein identity is useful for obtaining improved shrinkage estimators in simultaneous estimation in gamma distributions. Let \(X_1, \ldots , X_p\) be independent random variables such that \(X_i\sim Ga({\alpha }_i, {\beta }_i)\), \(i=1, \ldots , p\). We first consider the simultaneous estimation of \({{\varvec{\alpha }}}=({\alpha }_1, \ldots , {\alpha }_p)^\top \) in the case of \({\beta }_1=\cdots ={\beta }_p=1\). When estimator \({{\widehat{{{\varvec{\alpha }}}}}}=({{\widehat{{\alpha }}}}_1, \ldots , {{\widehat{{\alpha }}}}_p)^\top \) is evaluated by the risk relative to the quadratic loss \(\sum _{i=1}^p({{\widehat{{\alpha }}}}_i-{\alpha }_i)^2\), Hudson (1978) suggested the shrinkage estimator

$$\begin{aligned} {{\widehat{{\alpha }}}}^{\textrm{S}}_i = X_i - {p-2 \over \sum _{j=1}^p(\log X_j)^2}\log X_i. \end{aligned}$$
(5.8)

Theorem 5.3

For \(p\ge 3\), the risk functions have the relation

$$\begin{aligned} \textrm{E}[\Vert {{\varvec{X}}}-{{\varvec{\alpha }}}\Vert ^2]=\textrm{E}[\Vert {{\widehat{{{\varvec{\alpha }}}}}}^{\textrm{S}}-{{\varvec{\alpha }}}\Vert ^2]+\textrm{E}[ \Vert {{\varvec{X}}}- {{\widehat{{{\varvec{\alpha }}}}}}^{\textrm{S}}\Vert ^2], \end{aligned}$$

which is interpreted as the Pythagorean triangle among \({{\varvec{X}}}\), \({{\varvec{\alpha }}}\) and \({{\widehat{{{\varvec{\alpha }}}}}}^{\textrm{S}}\).

Proof

The estimator in (5.8) has the risk

$$\begin{aligned} \textrm{E}[\Vert {{\widehat{{{\varvec{\alpha }}}}}}^{\textrm{S}}-{{\varvec{\alpha }}}\Vert ^2]= & {} \textrm{E}[\Vert {{\varvec{X}}}-{{\varvec{\alpha }}}\Vert ^2]-2\sum _{i=1}^p\textrm{E}\Big [(X_i-{\alpha }_i){p-2\over \sum _{j=1}^p(\log X_j)^2}\log X_i\Big ] \\{} & {} + \textrm{E}\Big [ {(p-2)^2 \over \sum _{j=1}^p(\log X_j)^2}\Big ]. \end{aligned}$$

The Stein-type identity in (5.1) gives \(\textrm{E}[(X_i-{\alpha }_i)h(X_i)]=\textrm{E}[X_i h'(X_i)]\) for

$$\begin{aligned} h(X_i)={p-2 \over \sum _{j=1}^p(\log X_j)^2}\log X_i. \end{aligned}$$

Noting that

$$\begin{aligned} X_i {\partial h(X_i)\over \partial X_i}={p-2\over \sum _{j=1}^p(\log X_j)^2} - {2(p-2)(\log X_i)^2\over \{\sum _{j=1}^p(\log X_j)^2\}^2}, \end{aligned}$$

we can see that

$$\begin{aligned} \sum _{i=1}^p X_i {\partial h(X_i)\over \partial X_i}={(p-2)p\over \sum _{j=1}^p(\log X_j)^2} - {2(p-2)\sum _{i=1}^p(\log X_i)^2\over \{\sum _{j=1}^p(\log X_j)^2\}^2} ={(p-2)^2\over \sum _{j=1}^p(\log X_j)^2}. \end{aligned}$$

Thus,

$$\begin{aligned} \sum _{i=1}^p\textrm{E}\Big [(X_i-{\alpha }_i){p-2\over \sum _{j=1}^p(\log X_j)^2}\log X_i\Big ] = \textrm{E}\Big [ {(p-2)^2 \over \sum _{j=1}^p(\log X_j)^2}\Big ], \end{aligned}$$

which is used to rewrite the risk as

$$\begin{aligned} \textrm{E}[\Vert {{\widehat{{{\varvec{\alpha }}}}}}^{\textrm{S}}-{{\varvec{\alpha }}}\Vert ^2]=\textrm{E}[\Vert {{\varvec{X}}}-{{\varvec{\alpha }}}\Vert ^2]- \textrm{E}\Big [ {(p-2)^2 \over \sum _{j=1}^p(\log X_j)^2}\Big ]. \end{aligned}$$

This shows that the estimator \({{\widehat{{{\varvec{\alpha }}}}}}^{\textrm{S}}\) improves on \({{\varvec{X}}}\) for \(p\ge 3\). Since \((p-2)^2/\sum _{j=1}^p (\log X_j)^2=\Vert {{\varvec{X}}}- {{\widehat{{{\varvec{\alpha }}}}}}^{\textrm{S}}\Vert ^2\), the above risk function expresses the Pythagorean triangle. \(\square \)

The risk performances of the estimators \({{\varvec{X}}}\) and \({{\widehat{{{\varvec{\alpha }}}}}}^{\textrm{S}}\) are investigated by simulation with 10, 000 replications and the average values of the risks are reported in Table 3 for \(p=3, 6\), \({\beta }=1\) and \({{\varvec{\alpha }}}=(k/3){{\varvec{I}}}\), \(k=1, \ldots , 10\). Table 3 shows that the improvement of the shrinkage estimator \({{\widehat{{{\varvec{\alpha }}}}}}^{\textrm{S}}\) over \({{\varvec{X}}}\) is significant in the case of \(p=6\).

Table 3 Risks of the estimators \({{\varvec{X}}}\) and \({{\widehat{{{\varvec{\alpha }}}}}}^{\textrm{S}}\) for \(p=3, 6\) and \({{\varvec{\alpha }}}=(k/3){{\varvec{I}}}\)

We next consider the simultaneous estimation of \({{\varvec{\beta }}}=({\beta }_1, \ldots , {\beta }_p)\) for known common \({\alpha }_1=\cdots = {\alpha }_p={\alpha }\), where estimator \({{\widehat{{{\varvec{\beta }}}}}}=({{\widehat{{\beta }}}}_1, \ldots , {{\widehat{{\beta }}}}_p)\) is evaluated by the risk relative to the quadratic loss \(\sum _{i=1}^p({{\widehat{{\beta }}}}_i-{\beta }_i)^2\). This estimation problem was studied by Berger (1980), Das Gupta (1986) and others. Das Gupta (1986) suggested the shrinkage estimator \({{\widehat{{{\varvec{\beta }}}}}}^{\textrm{S}}=({{\widehat{{\beta }}}}^{\textrm{S}}_1, \ldots , {{\widehat{{\beta }}}}^{\textrm{S}}_p)\) with

$$\begin{aligned} {{\widehat{{\beta }}}}^{\textrm{S}}_i = {1\over {\alpha }+1}X_i + c V, \quad V= \Big (\prod _{j=1}^p X_j\Big )^{1/p}, \end{aligned}$$
(5.9)

and derived a condition for improving on \({{\widehat{{{\varvec{\beta }}}}}}_0={{\varvec{X}}}/({\alpha }+1)\).

Theorem 5.4

When \(p\ge 2\), the estimator \({{\widehat{{{\varvec{\beta }}}}}}^{\textrm{S}}\) improves on \({{\widehat{{{\varvec{\beta }}}}}}_0\) relative to the quadratic loss if

$$\begin{aligned} 0< c \le {2(p-1) \over ({\alpha }+1)({\alpha }p+1)}. \end{aligned}$$

Proof

The risk function of the estimator (5.9) is

$$\begin{aligned} \textrm{E}[\Vert {{\widehat{{{\varvec{\beta }}}}}}^{\textrm{S}}-{{\varvec{\beta }}}\Vert ^2]=\textrm{E}[\Vert {{\widehat{{{\varvec{\beta }}}}}}_0-{{\varvec{\beta }}}\Vert ^2]+2\sum _{i=1}^p\textrm{E}\Big [\Big ({X_i\over {\alpha }+1}-{\beta }_i\Big )cV\Big ] + p\textrm{E}[ c^2 V^2]. \end{aligned}$$

The Stein-type identity in (5.1) gives

$$\begin{aligned} \textrm{E}[X_i V]={\beta }_i \textrm{E}\Big [{\alpha }V+X_i {\partial V \over \partial X_i}\Big ]={\beta }_i {{\alpha }p+1\over p}\textrm{E}[ V], \end{aligned}$$

because \(\partial V/\partial X_i=V/(pX_i)\). Thus, the risk difference is written as

$$\begin{aligned} {\Delta }&= \textrm{E}[\Vert {{\widehat{{{\varvec{\beta }}}}}}^{\textrm{S}}-{{\varvec{\beta }}}\Vert ^2]-\textrm{E}[\Vert {{\widehat{{{\varvec{\beta }}}}}}_0-{{\varvec{\beta }}}\Vert ^2] = \sum _{i=1}^p \textrm{E}\Big [ 2{c\over {\alpha }+1}VX_i -2 {cp\over {\alpha }p+1}V X_i + pc^2 V^2\Big ]\\&= c\textrm{E}\Big [ - 2 {(p-1)p \over ({\alpha }+1)({\alpha }p+1)} {{\overline{X}}}V + pc V^2\Big ] \le c\textrm{E}\Big [ - 2 {(p-1)p \over ({\alpha }+1)({\alpha }p+1)} V^2 + pc V^2\Big ], \end{aligned}$$

because \({{\overline{X}}}\ge V\). This shows that \({\Delta }\le 0\) under the condition in Theorem 5.4. \(\square \)

The risk performances of the estimators \({{\widehat{{{\varvec{\beta }}}}}}_0\) and \({{\widehat{{{\varvec{\beta }}}}}}^{\textrm{S}}\) are investigated by simulation with 10, 000 replications and the average values of the risks are reported in Table 4 for \(p=2, 6\), \({\alpha }=1\) and \({{\varvec{\beta }}}=(k/3){{\varvec{I}}}\), \(k=1, \ldots , 10\). From the table, the improvement of the shrinkage estimator \({{\widehat{{{\varvec{\beta }}}}}}^{\textrm{S}}\) over \({{\widehat{{{\varvec{\beta }}}}}}_0\) is numerically confirmed.

Table 4 Risks of the estimators \({{\widehat{{{\varvec{\beta }}}}}}_0\) and \({{\widehat{{{\varvec{\beta }}}}}}^{\textrm{S}}\) for \(p=2, 6\), \({\alpha }=1\) and \({{\varvec{\beta }}}=(k/3){{\varvec{I}}}\)

5.3 Goodness-of-fit tests for exponentiality

We consider to construct a statistic for testing exponentiality using the Stein identity. Goodness-of-fit tests for exponentiality have been studied in the literature. For some good reviews, see Henze and Meintanis (2005) and Ossai et al. (2022). The idea of constructing test statistics for exponentiality based on the Stein identity appears in Betsch and Ebner (2019) and Henze et al. (2012).

Let \(X_1, \ldots , X_n\) be a positive random sample from a distribution function \(F(\cdot )\) with \(\textrm{E}[X_i]={\sigma }\). Consider the problem of testing the exponentiality of the underlying distribution \(H_0: F=Ex(1/{\sigma })=Ga(1, {\sigma })\). From Theorem 5.1, the characterization of exponential distributions is

$$\begin{aligned} \textrm{E}\Big [ \{(1-t)X/{\sigma }-1\}e^{tX/{\sigma }}\Big ]=0, \end{aligned}$$

and the sample counterpart is \(w_t/\sqrt{n}\), where

$$\begin{aligned} w_t = {1\over \sqrt{n}}\sum _{i=1}^n\{ (1-t)Y_i-1\}e^{t Y_i} \end{aligned}$$
(5.10)

for \(Y_i=X_i/{{\overline{X}}}\). It is noted that \(w_t\) is invariant under the transformation of scale. Henze et al. (2012) suggested a couple of test statistics based on \(w_t\), one of which is

$$\begin{aligned} \textrm{HME}_{\gamma }= \int _{-\infty }^0 w_t^2 e^{{\gamma }t}\textrm{d}t = {1\over n}\sum _{i=1}^n\sum _{j=1}^n \Big \{ {B_{ij}-A_{ij}+1\over A_{ij}+{\gamma }} +{2B_{ij}-A_{ij}\over (A_{ij}+{\gamma })^2}+{2B_{ij}\over (A_{ij}+{\gamma })^3}\Big \}, \end{aligned}$$

where \(A_{ij}=Y_i+Y_j\) and \(B_{ij}=Y_iY_j\). Similarly to the problem of testing normality, we can consider the test statistics \(\textrm{IST}_c^+=\int _0^c w_t^2 \textrm{d}t\), \(\textrm{IST}_c^-=\int _{-c}^0 w_t^2 \textrm{d}t\), \(\textrm{IST}_c=\int _{-c}^c w_t^2 \textrm{d}t\) and \(\textrm{MST}_c=\sup _{-c<t<c} |w_t|\) for positive constant c, where

$$\begin{aligned} \textrm{IST}_c^+&= {1\over n}\sum _{i=1}^n\sum _{j=1}^n\Big [ \Big \{ c\Big (c-2-{2\over A_{ij}}\Big )e^{cA_{ij}} - \Big (1+{2\over A_{ij}}+{2\over A_{ij}^2}\Big )(1-e^{cA_{ij}})\Big \}{B_{ij} \over A_{ij}}\\&\quad +(c-1)e^{cA_{ij}}+1\Big ],\\ \textrm{IST}_c^-&= {1\over n}\sum _{i=1}^n\sum _{j=1}^n\Big [ \Big \{ -c\Big (c+2+{2\over A_{ij}}\Big )e^{-cA_{ij}} + \Big (1+{2\over A_{ij}}+{2\over A_{ij}^2}\Big )(1-e^{-cA_{ij}})\Big \}{B_{ij}\over A_{ij}}\\&\quad +(c+1)e^{-cA_{ij}}-1\Big ],\\ \textrm{IST}_c&= {1\over n}\sum _{i=1}^n\sum _{j=1}^n\Big [ \Big \{ \Big (c^2+1+{2\over A_{ij}}+{2\over A_{ij}^2}\Big ) (e^{cA_{ij}}-e^{-cA_{ij}})\\&\quad - {2c\over A_{ij}}\Big (1+{1\over A_{ij}}\Big )(e^{cA_{ij}}+e^{-cA_{ij}})\Big \}{B_{ij}\over A_{ij}}\\&\quad -e^{cA_{ij}}+e^{-cA_{ij}}+c(e^{cA_{ij}}+e^{-cA_{ij}})\Big ]. \end{aligned}$$

The following lemma is helpful for investigating asymptotic properties of these test statistics.

Lemma 5.1

Let \(Z_i=X_i/{\sigma }\) for \(i=1, \ldots , n\). Let \(h_0(t)=\textrm{E}[\{(1-t)Z_1-1\}e^{tZ_1}]\) and \(h_1(t)=(1-2t)\textrm{E}[Z_1e^{tZ_1}]+t(1-t)\textrm{E}[Z_1^2e^{tZ_1}]\). Then, \(w_t\) is approximated as \(w_t=W_n(t)+\sqrt{n}h_0(t)+o_p(1)\), where

$$\begin{aligned} W_n(t)={1\over \sqrt{n}}\sum _{i=1}^n\Big [ \{ (1-t)Z_i-1\}e^{tZ_i}-h_0(t)-(Z_i-1)h_1(t)\Big ]. \end{aligned}$$
(5.11)

Proof

Letting \({{\overline{Z}}}=n^{-1}\sum _{i=1}^n Z_i\), we rewrite \(w_t\) as

$$\begin{aligned} w_t={1\over \sqrt{n}}\sum _{i=1}^n\{(1-t)Z_i/{{\overline{Z}}}-1\}e^{tZ_i/{{\overline{Z}}}}. \end{aligned}$$

Since \(Z_i/{{\overline{Z}}}=Z_i/\{1+({{\overline{Z}}}-1)\}=1-({{\overline{Z}}}-1)+O_p(n^{-1})\), \(w_t\) can be approximated as

$$\begin{aligned} w_t&= {1\over \sqrt{n}}\sum _{i=1}^n\{(1-t)Z_i - (1-t)({{\overline{Z}}}-1)Z_i -1+O_p(n^{-1})\}e^{tZ_i-t({{\overline{Z}}}-1)Z_i+O_p(n^{-1})}\\&= {1\over \sqrt{n}}\sum _{i=1}^n\{(1-t)Z_i -1 - (1-t)({{\overline{Z}}}-1)Z_i\}\{1-t({{\overline{Z}}}-1)Z_i\}e^{tZ_i}+o_p(1)\\&= {1\over \sqrt{n}}\sum _{i=1}^n\big \{(1-t)Z_i -1 - (1-2t)({{\overline{Z}}}-1)Z_i -t(1-t)({{\overline{Z}}}-1)Z_i^2\big \}e^{tZ_i}+o_p(1), \end{aligned}$$

which leads to the approximation \(w_t=W_n+\sqrt{n}h_0(t)+o_p(1)\). Hence, Lemma 5.1 is proved. \(\square \)

From (5.11), the central limit theorem shows that \(W_n(t)\) is asymptotically distributed as the normal distribution with mean zero and variance \(\textrm{Var}(W_n(t))\) under the assumption of \(\textrm{E}[Z_1^2e^{2tZ_1}]<\infty \), where the variance can be evaluated as

$$\begin{aligned} \textrm{Var}(W_n(t))&= \textrm{E}[\{(1-t)Z_1-1\}^2e^{2tZ_1}]-h_0(t)^2+h_1(t)^2(\textrm{E}[Z_1^2]-1) \\&\quad -2h_1(t)\textrm{E}[(Z_1-1)\{(1-t)Z_1-1\}e^{tZ_1}]+o(1). \end{aligned}$$

Note that \(h_0(t)=(1-t)g'(t)-g(t)\) for \(g(t)=\textrm{E}[e^{tZ_1}]\). Then, \(h_1(t)=(1-2t)g'(t)+t(1-t)g''(t)=th_0'(t)+\{h_0(t)+g(t)\}/(1-t)\) and

$$\begin{aligned} \textrm{E}[(Z_1-1)\{(1-t)Z_1-1\}e^{tZ_1}]=h_0'(t)+{t\over 1-t}h_0(t)+{1\over 1-t}g(t). \end{aligned}$$

Under the exponentiality hypothesis \(H_0\), we have \(g(t)=1/(1-t)\), \(h_0(t)=0\), \(h_1(t)=g(t)/(1-t)=1/(1-t)^2\) and \(\textrm{E}[Z_1^2]=2\). Also note that

$$\begin{aligned} \textrm{E}[\{(1-t)Z_1-1\}^2e^{2tZ_1}]&= (1-t)^2\textrm{E}[Z_1^2e^{2tZ_1}]-2(1-t)\textrm{E}[Z_1e^{2tZ_1}]+\textrm{E}[e^{2tZ_1}]\\&= {2(1-t)^2\over (1-2t)^3}-2 {1-t\over (1-2t)^2}+{1\over 1-2t} ={(1-t)^2+t^2\over (1-2t)^3}. \end{aligned}$$

Thus, the asymptotic variance of \(w_t\) under the exponentiality is \(\textrm{Var}(W_n(t)) =V(t)^2+o(1)\), where

$$\begin{aligned} V(t)^2={(1-t)^2+t^2\over (1-2t)^3}-{1\over (1-t)^4} \end{aligned}$$

for \(0<t<1/2\).

Henze et al. (2012) showed the consistency of \(\textrm{HME}_{\gamma }\). Using Lemma 5.1 and the same arguments as in the proof of Theorem 3.3, we can verify the consistency of the suggested test statistics.

Theorem 5.5

Assume that \(\textrm{E}[Z_1^2e^{tZ_1}]<\infty \) for t around zero. Then, the test statistics \(\textrm{IST}_c^+\), \(\textrm{IST}_c^-\), \(\textrm{IST}_c\) and \(\textrm{MST}_c\) given below (5.10) are consistent.

We investigate the performances of powers of the suggested estimators \(\textrm{IST}_c^+\), \(\textrm{IST}_c^-\), \(\textrm{IST}_c\) and \(\textrm{MST}_c=\sup _{-c<t<c} |w_t|\) for \(c=0.1\). As competitors, we treat the test \(\textrm{HME}_{\gamma }\) of Henze et al. (2012) for \({\gamma }=1\) and two more simple test statistics. One of them is the Cox and Oakes (1984) test

$$\begin{aligned} \textrm{CO} = {\sqrt{6}\over \sqrt{n}\pi } \Big \{ n + \sum _{i=1}^n\Big (1-{X_i\over {{\overline{X}}}}\Big )\log \Big ( {X_i\over {{\overline{X}}}}\Big )\Big \}, \end{aligned}$$

and the hypothesis \(H_0\) is rejected when \(|\textrm{CO}|>z_{{\alpha }/2}\). Another test is based on the coefficient of variation \(\textrm{HS}=nS^2/{{\overline{X}}}^2\), given in Hahn and Shapiro (1967), and the hypothesis \(H_0\) is rejected when \(\textrm{HS}<\chi _{n-1, 1-{\alpha }/2}^2\) or \(\textrm{HS}>\chi _{n-1, {\alpha }/2}^2\). This is also derived from a likelihood ratio for testing the homogeneity \(H_0^{*} : {\lambda }_1=\cdots = {\lambda }_n\) for \(X_i\sim Ex({\lambda }_i)\), and the null hypothesis \(H_0^{*}\) is rejected when \(\textrm{HS}>\chi _{n-1, {\alpha }}^2\).

The powers of those test statistics are examined by simulation with 10,000 replication. For \(n=50\), we adjust the type-I errors before calculating their powers under the following three alternatives for \(w=0.2, 0.5, 0.8, 1.0\).

$$\begin{aligned} M_1\ :\quad&X_i = (1-w) Ex(1) + w Ga(1.2,0.8),\\ M_2\ :\quad&X_i = (1-w) Ex(1) + w LogN(0,1),\\ M_3\ :\quad&X_i = (1-w) Ex(1) + w InvG(1,1), \end{aligned}$$

where Ex(1), Ga(ab), LogN(0, 1) and InvG(0, 1) denote random variables having exponential distribution Ex(1), gamma distribution Ga(ab), log normal distribution LogN(0, 1) and inverse Gaussian distribution InvG(1, 1), respectively. The values of their powers are reported in Table 5. From the table, it is observed that the tests \(\textrm{IST}_{0.1}\), \(\textrm{HME}_1\) and \(\textrm{CO}\) are more powerful for \(w=0.2\), 0.5 and 0.8 in \(M_1\), \(M_2\) and \(M_3\). In the case of \(w=1\), the test \(\textrm{IST}_{0.1}\) is most powerful for Ga(1.2, 0.8), the tests \(\textrm{IST}_{0.1}^+\), \(\textrm{IST}_{0.1}^-\), \(\textrm{MST}_{0.1}\) and \(\textrm{HS}\) are more powerful for LN(0, 1) and the tests \(\textrm{IST}_{0.1}\), \(\textrm{HME}_1\) and \(\textrm{CO}\) are more powerful for InvG. Overall, the tests \(\textrm{IST}_{0.1}\), \(\textrm{HME}_1\) and \(\textrm{CO}\) have similar performances and the tests \(\textrm{IST}_{0.1}^+\), \(\textrm{IST}_{0.1}^-\), \(\textrm{MST}_{0.1}\) and \(\textrm{HS}\) are similarly performed.

Table 5 Power of the five tests for \(n=50\) and \(w=0.2, 0.5, 0.8, 1.0\)

6 Stein-type identities in Poisson and negative binomial distributions

6.1 Stein-type identity in Poisson distributions

In the Poisson distribution \(Po({\lambda })\), Hudson (1978) provided the Stein-type identity, which is also characterizes the Poisson distribution as seen below.

Theorem 6.1

Let X be a non-negative and discrete random variable with \(\textrm{E}[X]={\lambda }\). Then, the following four conditions are equivalent.

  1. (a)

    \(X\sim Po({\lambda })\).

  2. (b)

    For any function \(h(\cdot )\) with \(\textrm{E}[|Xh(X)|]<\infty \), it holds that

    $$\begin{aligned} \textrm{E}[{\lambda }h(X)]=\textrm{E}[X h(X-1)] \quad \text {or}\quad \textrm{E}[X h(X)]=\textrm{E}[{\lambda }h(X+1)] \end{aligned}$$
    (6.1)
  3. (c)

    For any real constant t, it holds that

    $$\begin{aligned} \textrm{E}[(X-{\lambda }e^t)e^{tX}]=0. \end{aligned}$$
    (6.2)
  4. (d)

    \(g(t)=\textrm{E}[\exp \{tX\}]\) satisfies the differential equation

    $$\begin{aligned} {\textrm{d}\over \textrm{d}t}\log \{g(t)\}= {\lambda }e^t. \end{aligned}$$
    (6.3)

Proof

For the proof from (a) to (b), it is noted that

$$\begin{aligned} {\lambda }{{\lambda }^x\over x!}e^{-{\lambda }}=(x+1) {{\lambda }^{x+1}\over (x+1)!}e^{-{\lambda }}, \end{aligned}$$

which produces the identity (6.1). Clearly, one gets (c) from (b). For the proof from (c) to (d), the identity (6.2) is written as \(g'(t) = {\lambda }e^t g(t)\), which is given in (6.3). For the proof from (d) to (a), the solution of the differential equation in (6.3) is \(g(t) = \exp \{{\lambda }(e^t-1)\}\), which implies that \(X \sim Po({\lambda })\). \(\square \)

The characterization of the Poisson distribution has been studied in a lot of papers. A feature of this distribution is that the sample mean and the unbiased sample variance have the same expectation. Shanbhag (1970b) derived the related condition for characterizing the Poisson distribution.

Theorem 6.2

Assume that nonnegative and discrete random variables \(X_1\) and \(X_2\) are independently and identically distributed. Then, the following three conditions are equivalent.

  1. (a)

    \(X_i \sim Po({\lambda })\) for \(i=1, 2\).

  2. (b)

    \(X_1+X_2 \sim Po(2{\lambda })\).

  3. (c)

    The conditional expectation of \((X_1-X_2)^2\) given \(X_1+X_2\) is equal to \(X_1+X_2\), namely \(\textrm{E}[(X_1-X_2)^2|X_1+X_2]=X_1+X_2\).

Proof

The proof from (a) to (b) is trivial. For the proof from (b) to (a), from Theorem 6.1, we have

$$\begin{aligned} \textrm{E}[(X_1+X_2-2{\lambda }e^t)e^{t(X_1+X_2)}]=0, \end{aligned}$$

which easily leads to

$$\begin{aligned} 2 \textrm{E}[(X_1-{\lambda }e^t)e^{tX_1}] \textrm{E}[e^{tX_2}]=0. \end{aligned}$$

Thus, we get (a) by using Theorem 6.1 again.

For the proof from (a) to (c), it is noted that \(\textrm{E}[(X_1-X_2)^2]=2{\lambda }\) and \(\textrm{E}[X_1+X_2]=2{\lambda }\), namely

$$\begin{aligned} \textrm{E}[ (X_1-X_2)^2 - (X_1+X_2)]=0. \end{aligned}$$

Since \(X_1+X_2\) is complete and sufficient, from \(\textrm{E}[ \textrm{E}[(X_1-X_2)^2|X_1+X_2] - (X_1+X_2)]=0\), it follows that \(\textrm{E}[(X_1-X_2)^2|X_1+X_2] - (X_1+X_2)=0\), and we get (c).

For the proof from (c) to (a), it is noted that condition (c) implies

$$\begin{aligned} \textrm{E}[(X_1-X_2)^2 e^{t(X_1+X_2)}]=\textrm{E}[ (X_1+X_2)e^{t(X_1+X_2)}]. \end{aligned}$$
(6.4)

It is observed that

$$\begin{aligned} \textrm{E}[(X_1-X_2)^2 e^{t(X_1+X_2)}]&= \textrm{E}[(X_1^2+X_2^2-2X_1X_2) e^{t(X_1+X_2)}]\\&= 2\textrm{E}[X_1^2 e^{tX_1}]\textrm{E}[e^{tX_2}] -2\textrm{E}[X_1e^{tX_1}] \textrm{E}[X_2e^{tX_2}] ,\\ \textrm{E}[ (X_1+X_2)e^{t(X_1+X_2)}]&= 2\textrm{E}[X_1e^{tX_1}]\textrm{E}[e^{tX_2}]. \end{aligned}$$

Then from (6.4), for \(g(t)=\textrm{E}[e^{tX_1}]\), we have

$$\begin{aligned} g''(t)g(t) - \{g'(t)\}^2 = g'(t)g(t) \quad \text {or}\quad {g''(t)g(t) - \{g'(t)\}^2\over \{g(t)\}^2} = {g'(t)\over g(t)}. \end{aligned}$$

Let \(\psi (t)=g'(t)/g(t)\), and this equality is expressed as \(\psi '(t)=\psi (t)\). The solution of this differential equation is \(\log \psi (t)=t + \log c_0\), namely \(\psi (t)=c_0 e^t\). This implies that \(\log g(t)=c_0 e^t + \log c_1\), or \(g(t)=c_1\exp \{c_0 e^t\}\). Since \(g(0)=c_1e^{c_0}=1\) or \(c_1=e^{-c_0}\), we get \(g(t)=\exp \{c_0(e^t-1)\}\), which leads to the Poisson distribution. \(\square \)

Theorem 6.2 can be extended to the case of a random sample with size n, where condition (c) is replaced by \(\textrm{E}[(n-1)^{-1}\sum _{i=1}^n(X_i-{{\overline{X}}})^2|{{\overline{X}}}]={{\overline{X}}}\).

6.2 Two applications of the Stein-type identity in Poisson distributions

We here provide two applications of the Stein-type identity in Poisson distributions. One of them is to obtain improved shrinkage estimators in simultaneous estimation in Poisson distributions. Let \(X_1, \ldots , X_p\) be independent random variables such that \(X_i\sim Po({\lambda }_i)\), \(i=1, \ldots , p\). Consider the problem of simultaneously estimating \({{\varvec{\lambda }}}=({\lambda }_1, \ldots , {\lambda }_p)\) relative to the loss \(\sum _{i=1}^p({{\hat{{\lambda }}}}_i-{\lambda }_i)^2/{\lambda }_i\). Clevenson and Zidek (1975) constructed a class of estimators improving on \({{\varvec{X}}}\), given by \({{\widehat{{{\varvec{\lambda }}}}}}_\phi =({{\hat{{\lambda }}}}_{\phi ,1}, \ldots , {{\hat{{\lambda }}}}_{\phi ,p})\) for

$$\begin{aligned} {{\hat{{\lambda }}}}_{\phi , i} = X_i - {\phi (Z) \over Z + p-1}X_i,\quad Z=\sum _{j=1}^p X_j. \end{aligned}$$
(6.5)

Theorem 6.3

The unbiased risk estimator of the shrinkage estimator \({{\widehat{{{\varvec{\lambda }}}}}}_\phi \) is

$$\begin{aligned} \widehat{ R({{\varvec{\lambda }}}, {{\widehat{{{\varvec{\lambda }}}}}}_\phi )}=p - {2Z\over Z+p-1}\{\phi (Z+1)-\phi (Z)\} -{2(p-1)\phi (Z+1)\over Z+p-1} + {\phi ^2(Z+1)\over Z+p}. \end{aligned}$$

Thus, for \(p\ge 2\), \({{\widehat{{{\varvec{\lambda }}}}}}_\phi \) improves on \({{\varvec{X}}}\) if \(\phi (\cdot )\) satisfies the conditions (a) \(\phi (z)\) is nondecreasing in z, and (b) \(0\le \phi (z) \le 2(p-1)\).

Proof

The risk function of \({{\widehat{{{\varvec{\lambda }}}}}}_\phi \) is

$$\begin{aligned} R({{\varvec{\lambda }}}, {{\widehat{{{\varvec{\lambda }}}}}}_\phi )=p-2\sum _{i=1}^p\textrm{E}\Big [\Big ({X_i\over {\lambda }_i}-1\Big ){X_i\phi (Z)\over Z+p-1}\Big ]+\sum _{i=1}^p\textrm{E}\Big [{X_i^2\over {\lambda }_i} {\phi ^2(Z)\over (Z+p-1)^2} \Big ]. \end{aligned}$$

From the Stein identity in Theorem 6.1, we use the identity \(\textrm{E}[(X_i/{\lambda }_i) h(X_i)]=\textrm{E}[ h(X_i+1)]\) to write

$$\begin{aligned} \sum _{i=1}^p\textrm{E}\Big [{X_i\over {\lambda }_i}{X_i\phi (Z)\over Z+p-1}\Big ]&= \sum _{i=1}^p\textrm{E}\Big [{(X_i+1)\phi (Z+1)\over Z+p}\Big ] =\textrm{E}[\phi (Z+1)],\\ \sum _{i=1}^p\textrm{E}\Big [{X_i^2\over {\lambda }_i} {\phi ^2(Z)\over (Z+p-1)^2} \Big ]&= \sum _{i=1}^p\textrm{E}\Big [{(X_i+1)\phi ^2(Z+1)\over (Z+p)^2} \Big ] =\textrm{E}\Big [{\phi ^2(Z+1)\over Z+p} \Big ]. \end{aligned}$$

These observations give the expression in the unbiased risk estimator. \(\square \)

In the case of \(\phi (z)=p-1\), the estimator \({{\widehat{{{\varvec{\lambda }}}}}}^{\textrm{S}}={{\varvec{X}}}-(p-1)(Z+p-1)^{-1}{{\varvec{X}}}\) has the risk unbiased estimator \({{\widehat{R}}({{\varvec{\lambda }}}, {{\widehat{{{\varvec{\lambda }}}}}}^{\textrm{S}})}=p - (p-1)^2(Z+p+1)/\{(Z+p-1)(Z+p)\}\). Clevenson and Zidek (1975) showed that the Bayes estimator \({{\widehat{{{\varvec{\lambda }}}}}}^{\textrm{GB}}={{\varvec{X}}}-(p-1+{\beta })(Z+p-1+{\beta })^{-1}{{\varvec{X}}}\) satisfies the conditions of Theorem 6.3 for \(0\le {\beta }\le p-1\) and it is admissible for \({\beta }>1\). Numerical investigation of the estimator \({{\widehat{{{\varvec{\lambda }}}}}}^{\textrm{CZ}}\) is given in Table 7.

We next consider to derive goodness-of fit test statistics for Poissonity based on the Stein identity. The problem of testing Poissonity has been studied in the literature, and one can see Mijburgh and Visagie (2020) for an overview.

Let \(X_1, \ldots , X_n\) be a discrete and nonnegative random sample from a population with distribution function \(F(\cdot )\) with mean \(\textrm{E}[X_i]={\lambda }\). Consider the problem of testing the Poissonity of the underlying distribution \(H_0: F=Po({\lambda })\). From Theorem 6.1, the characterization of the Poisson distribution is

$$\begin{aligned} \textrm{E}\Big [ (X-{\lambda }e^t)e^{tX}\Big ]=0, \end{aligned}$$

and the sample counterpart is \(w_t/\sqrt{n}\), where

$$\begin{aligned} w_t = {1\over \sqrt{n}} \sum _{i=1}^n(X_i-{{\overline{X}}}e^t)e^{t X_i}. \end{aligned}$$
(6.6)

The idea of Henze et al. (2012) is used to construct the test statistic

$$\begin{aligned} \textrm{BHT}_{\gamma }&= \int _{-\infty }^0 w_t^2 e^{{\gamma }t}\textrm{d}t ={1\over n1}\sum _{i=1}^n\sum _{j=1}^n \Big \{ {B_{ij}\over A_{ij}+{\gamma }}-{{{\overline{X}}}A_{ij}\over A_{ij}+1_{\gamma }}+{{{\overline{X}}}^2\over A_{ij}+2+{\gamma }}\Big \} \end{aligned}$$

for \(A_{ij}=X_i+X_j\), \(B_{ij}=A_iX_j\) and positive constant \({\gamma }\). This was suggested by Treutler (1995) and the related test was proposed by Baringhaus and Henze (1992). For other test statistics, see Gürtler and Henze (2000). Similarly to the problem of testing exponentiality, we can consider the test statistics

$$\begin{aligned} \textrm{IST}_c&= \int _{-c}^0 w_t^2 e^{t}\textrm{d}t ={1\over n}\sum _{i=1}^n\sum _{j=1}^n \Big \{ {B_{ij}\over A_{ij}+1}(1-e^{-cA_{ij}}) - {{{\overline{X}}}A_{ij}\over A_{ij}+2}(1-e^{-cA_{ij}-2c}) \\&\quad + {{{\overline{X}}}^2\over A_{ij}+3}(1-e^{-cA_{ij}-3c})\Big \}, \end{aligned}$$

and \(\textrm{MST}_c = \sup _{-c<t<0}|w_t|\) for positive constant c.

It is observed that \(w_t\) can be approximated as

$$\begin{aligned} w_t&= {1\over \sqrt{n}}\sum _{i=1}^n (X_i-{\lambda }e^t)e^{tX_i}-({{\overline{X}}}-{\lambda }e^t) {1\over \sqrt{n}}\sum _{i=1}^n e^{tX_i}\\&= \sqrt{n}\Big \{{1\over n}\sum _{i=1}^n(X_i-{\lambda }e^t)e^{tX_i}-h_0(t)\Big \} +\sqrt{n}h_0(t) +\sqrt{n}({{\overline{X}}}-{\lambda })e^t g(t) +o_p(1), \end{aligned}$$

where \(h_0(t)=\textrm{E}[(X_1-{\lambda }e^t)e^{tX_1}]\) and \(g(t)=\textrm{E}[e^{tX_1}]\). This gives the following lemma.

Lemma 6.1

\(w_t\) is approximated as \(w_t=W_n(t)+\sqrt{n}h_0(t)+o_p(1)\), where

$$\begin{aligned} W_n(t)={1\over \sqrt{n}}\sum _{i=1}^n\Big \{ (X_i-{\lambda }e^t)e^{tX_i}-h_0(t)-(X_i-{\lambda })e^t g(t)\Big \}. \end{aligned}$$
(6.7)

From the central limit theorem, it follows that \(W_n(t)\) converges in distribution to the normal distribution with mean zero and the variance

$$\begin{aligned} \textrm{Var}(W_n(t))&= \textrm{E}[(X_1-{\lambda }e^t)^2e^{2tX_1}]-h_0(t)^2 +(\textrm{E}[X_1^2]-{\lambda }^2)e^{2t}g(t)^2\\&\quad -2\{ \textrm{E}[X_1(X_1-{\lambda }e^t)e^{tX_1}]-{\lambda }h_0(t)\}e^t g(t)+o(1). \end{aligned}$$

Since \(h_0(t)=g'(t)-{\lambda }e^tg(t)\), we have \(h_0'(t)=g''(t)-{\lambda }e^t g'(t)-{\lambda }e^tg(t)\), so that \(\textrm{E}[X_1(X_1-{\lambda }e^t)e^{tX_1}]=g''(t)-{\lambda }e^tg'(t)=h_0'(t)+{\lambda }e^t g(t)\). Under the null hypothesis \(H_0\) of Poissonity, it can be seen that \(g(t)=\exp \{{\lambda }(e^t-1)\}\), \(h_0(t)=0\) and

$$\begin{aligned} \textrm{E}[(X_1-{\lambda }e^t)^2e^{2tX_1}]={\lambda }\{ {\lambda }(e^t-1)^2+1\}e^{2t}\exp \{{\lambda }(e^{2t}-1)\}, \end{aligned}$$

so that \(\textrm{Var}(W_n(t))=V(t)^2+o(1)\), where

$$\begin{aligned} V(t)^2={\lambda }e^{2t}\Big [ \{ {\lambda }(e^t-1)^2+1\}\exp \{{\lambda }(e^{2t}-1)\} - \exp \{2{\lambda }(e^t-1)\}\Big ]. \end{aligned}$$

Using Lemma 6.1 and the same arguments as in the proof of Theorem 3.3, we can verify the consistency of the suggested test statistics.

Theorem 6.4

Assume that \(\textrm{E}[X_1^2e^{tX_1}]<\infty \) for t around zero. Then, the test statistics \(\textrm{BHT}_{\gamma }\), \(\textrm{IST}_c\) and \(\textrm{MST}_c\) given below (6.6) are consistent.

We investigate the performances of powers of the test statistics \(\textrm{BHT}_{\gamma }\) for \({\gamma }=1\), \(\textrm{IST}_c\) for \(c=1\) and \(\textrm{MST}_c\) for \(c=1\). As a competitor, we employ the test based on Fisher’s index, given by \(\textrm{FI}=(n-1)S^2/{{\overline{X}}}\). The null hypothesis \(H_0\) is rejected when \(\textrm{FI}>\chi _{n-1, 1-{\alpha }/2}^2\) or \(\textrm{FI}>\chi _{n-1, {\alpha }/2}^2\). The simulation experiments are conducted under the following two alternatives for \({\alpha }=0, 2, 3, 4, 5\) and \({\lambda }=1, 3\).

$$\begin{aligned} M_1\ :\quad&X_i = Po({\lambda }) + {\alpha }Nbin(10, 10/10.2),\\ M_2\ :\quad&X_i = Po({\lambda }) + {\alpha }Po(0.2),\\ M_3\ :\quad&X_i = Po({\lambda }) + {\alpha }Bin(10, 0.1), \end{aligned}$$

where \(Po({\lambda })\), Po(0.2), Nbin(10, 10/10.2) and Bin(10, 0.1) denote independent random variables having Poisson distributions \(Po({\lambda })\), Po(1.5), negative binomial distribution Nbin(10, 10/10.2) and binomial distribution Bin(10, 0.1), respectively. It is here noted that the type-I errors of the tests depend on the unknown parameter \({\lambda }\). To fix this problem, the parametric bootstrap method is useful for critical values of the tests. For example, we explain how to obtain the critical value of \(\textrm{BHT}_{\gamma }\) in model \(M_2\). We first generate K samples of size n from \(M_2\) and each sample consists of \(u_1, \ldots , u_{n}\) from \(Po({\lambda })\) and \(v_1, \ldots , v_{n}\) from Po(0.2) for \(K=1,000\) and \(n=50\). Then, we generate the B bootstrap samples \((u_1^{(b)}, \ldots , u_{n}^{(b)})\), \(b=1, \ldots , B\) from \(Po({{\overline{u}}})\) for \({{\overline{u}}}=\sum _{i=1}^{n}u_i/n\) and calculate the value of \(\textrm{BHT}_{\gamma }^{(b)}\) for \(b=1, \ldots , B\) and \(B=1,000\). We obtain the critical value \(q_{\textrm{BHT}}\) of the test \(\textrm{BHT}_{\gamma }\) as an upper 5% quantile of the histogram of \(\textrm{BHT}_{\gamma }^{(b)}\)’s. We then calculate K values of \(\textrm{BHT}_{\gamma }\) based on \(x_i=u_i + {\alpha }v_i\) for \(i=1, \ldots , n\) and count the number such that \(\textrm{BHT}_{\gamma }>q_{\textrm{BHT}}\). In this way, we can obtain approximated values of the size and the power of the test \(\textrm{BHT}_{\gamma }\). These values are reported in Table 6.

From the table, the sizes of the tests have small variations for \(n=50\), \(B=1,000\) and \(K=1,000\) and the variations will lower for large n, B and K. The tests \(\textrm{BHT}_1\), \(\textrm{IST}_1\) and \(\textrm{MST}_1\) have similar performances in the powers and are more powerful than \(\textrm{FI}\) in the case of \({\lambda }=1\) for \(M_1, M_2\) and \(M_3\). In the case of \({\lambda }=3\), the test \(\textrm{MST}_1\) is more powerful than \(\textrm{BHT}_1\) and \(\textrm{IST}_1\) and \(\textrm{FI}\) is more powerful for \({\alpha }=4, 5\) in \(M_1\) and \(M_2\).

Table 6 Sizes and powers of the four tests for \(n=50\) and \({\alpha }=0, 2, 3, 4, 5\)

6.3 Stein-type identity in negative binomial distributions

Consider the negative binomial distribution \(Nbin({\alpha }, p)\) with the probability function

$$\begin{aligned} f(x|{\alpha },p) = {{\Gamma }(x+{\alpha })\over {\Gamma }({\alpha })x!}q^x p^{\alpha }, \quad q=1-p, \end{aligned}$$

where \({\alpha }>0\) and \(0<p<1\). Although \({\alpha }\) is a natural number in the negative binomial distribution, we here treat \({\alpha }\) as a positive real number. Hudson (1978) provided the Stein-type identity, which is also characterizes the negative binomial distribution as seen below.

Theorem 6.5

Let X be a non-negative and discrete variable. Then, the following four conditions are equivalent.

  1. (a)

    \(X\sim Nbin({\alpha },p)\).

  2. (b)

    For any function \(h(\cdot )\) with \(\textrm{E}[|Xh(X)|]<\infty \), it holds that

    $$\begin{aligned} \textrm{E}[X h(X)]=q \textrm{E}[(X+{\alpha }) h(X+1)]. \end{aligned}$$
    (6.8)
  3. (c)

    For any real constant t, it holds that

    $$\begin{aligned} \textrm{E}[X e^{tX}]=q \textrm{E}[(X+{\alpha }) e^{t(X+1)}]. \end{aligned}$$
    (6.9)
  4. (d)

    \(g(t)=\textrm{E}[\exp \{tX\}]\) satisfies the differential equation

    $$\begin{aligned} {\textrm{d}\over \textrm{d}t}\log \{g(t)\}= {{\alpha }q e^t \over 1-qe^t}. \end{aligned}$$
    (6.10)

Proof

For the proof from (a) to (b), it is noted that

$$\begin{aligned} x {{\Gamma }(x+{\alpha })\over {\Gamma }({\alpha })x!}q^x p^{\alpha }=q(x-1+{\alpha }) {{\Gamma }(x-1+{\alpha })\over {\Gamma }({\alpha })(x-1)!}q^{x-1} p^{{\varvec{a}}}, \end{aligned}$$

which shows the identity (6.8). Clearly, one gets (c) from (b). For the proof from (c) to (d), the identity (6.9) is written as \(g'(t) =q\{g'(t)+{\alpha }g(t)\}e^t\), which is (6.10). For the proof from (d) to (a), the solution of the differential equation in (6.10) is \(\log g(t)={\alpha }\log p -{\alpha }\log (1-qe^t)\), namely, \(g(t) = {p^{\alpha }/(1-q e^t)^{\alpha }}\), which implies that \(X \sim Nbin({\alpha },p)\). \(\square \)

Theorem 6.6

Assume that nonnegative and discrete random variables \(X_1\) and \(X_2\) are independently and identically distributed with \(\textrm{E}[|X|e^{tX}]<\infty \). The following two are equivalent.

  1. (a)

    \(X_i\sim Nbin({\alpha },p)\) for \(i=1,2\).

  2. (b)

    \(X_1+X_2 \sim Nbin(2{\alpha }, p)\).

Proof

It is easy to see (b) from (a). For the proof from (b) to (a), from Theorem 6.5, it follows that

$$\begin{aligned} \textrm{E}[(X_1+X_2)e^{t(X_1+X_2)}]=q \textrm{E}[(X_1+X_2+2{\alpha })e^{t(X_1+X_2)}]. \end{aligned}$$

This equality leads to

$$\begin{aligned} 2 \textrm{E}[X_1 e^{tX_1}]\textrm{E}[e^{tX_2}] = 2 q \textrm{E}[(X_1+{\alpha }) e^{tX_1}]\textrm{E}[e^{tX_2}], \end{aligned}$$

which, from Theorem 6.5, shows (a). \(\square \)

We briefly describe the Stein problem of simultaneous estimation of means of k negative binomial distributions. This is due to Tsui (1984). Let \(X_1, \ldots , X_p\) be independent random variables such that \(X_i \sim Nbin({\alpha }_i, \eta _i)\) for \(i=1, \ldots , p\). The mean of \(X_i\) is denoted by \({\theta }_i={\alpha }_i (1-\eta _i)/\eta _i\). Let \({{\varvec{X}}}=(X_1, \ldots , X_p)\) and \({{\varvec{\theta }}}=({\theta }_1, \ldots , {\theta }_p)\), and we consider the estimation of \({{\varvec{\theta }}}\) for known \({\alpha }_i\)’s relative to the loss \(\sum _{i=1}^p ({{\hat{{\theta }}}}_i-{\theta }_i)^2/{\theta }_i\). Tsui (1984) suggested the shrinkage estimator \({{\widehat{{{\varvec{\theta }}}}}}_\phi =({{\hat{{\theta }}}}_{\phi ,1}, \ldots , {{\hat{{\theta }}}}_{\phi ,p})\), where

$$\begin{aligned} {{\hat{{\theta }}}}_{\phi , i} = X_i - {\phi (Z)\over Z+p-1}X_i, \end{aligned}$$

for \(Z=\sum _{i=1}^p X_i\) and nonnegative function \(\phi (\cdot )\). Let \({\Delta }= R({{\varvec{\theta }}}, {{\widehat{{{\varvec{\theta }}}}}}_\phi )-R({{\varvec{\theta }}}, {{\varvec{X}}})\). The conditions on \(\phi (\cdot )\) for improving on \({{\varvec{X}}}\) was derived by Tsui (1984).

Theorem 6.7

The risk difference \({\Delta }\) is decomposed as \({\Delta }={\Delta }_1+{\Delta }_2\), where

$$\begin{aligned} {\Delta }_1&= \textrm{E}\Big [ {(Z+p)\phi ^2(Z+1)\over (Z+p)^2}- 2 {(Z+p)\phi (Z+1)\over Z+p}+ 2 {Z\phi (Z)\over Z+p-1}\Big ],\\ {\Delta }_2&= \sum _{i=1}^p \textrm{E}\Big [ {X_i\over {\alpha }_i}\Big \{ {(X_i+1)\phi (Z+1)\over Z+p}\Big ({\phi (Z+1)\over Z+p}-2\Big )+{X_i\phi (Z)\over Z+p-1}\Big (2-{\phi (Z)\over Z+p-1}\Big )\Big \}\Big ]. \end{aligned}$$

Then, it holds that \({\Delta }\le 0\) if the following conditions are satisfied. (a) \(\phi (z)\) is nondecreasing, (b) \(0<\phi (z) \le 2(p-1)\) and (c) \(\phi (z)/z\) is nonincreasing.

Proof

More generally, we consider the estimator \({{\hat{{\theta }}}}_i=X_i+f_i({{\varvec{X}}})\). Let \({{\varvec{e}}}_i\) be a p-variate vector whose i-th coordinate is one and whose other coordinates are zero. Note that \({\alpha }_i/{\theta }_i=1/q_i -1\) and \(\textrm{E}[h(X_i)/q_i]=\textrm{E}[(X_i+{\alpha }_i)h(X_i+1)/(X_i+1)]\) from Theorem 6.5. Then,

$$\begin{aligned} {\Delta }&= R({{\varvec{\theta }}}, {{\widehat{{{\varvec{\theta }}}}}})-R({{\varvec{\theta }}}, {{\varvec{X}}}) =\sum _{i=1}^p \textrm{E}[f_i^2({{\varvec{X}}})/{\theta }_i + 2X_i f_i({{\varvec{X}}})/{\theta }_i - 2f_i({{\varvec{X}}})]\\&= \sum _{i=1}^p \textrm{E}\Bigg [- {f_i^2({{\varvec{X}}})\over {\alpha }_i} - 2{X_i f_i({{\varvec{X}}})\over {\alpha }_i} -2f_i({{\varvec{X}}}) + {f_i^2({{\varvec{X}}})\over {\alpha }_i q_i}+ {2X_i f_i({{\varvec{X}}})\over {\alpha }_i q_i}\Bigg ]\\&= \sum _{i=1}^p \textrm{E}\Bigg [- {f_i^2({{\varvec{X}}})\over {\alpha }_i} - 2{X_i f_i({{\varvec{X}}})\over {\alpha }_i} -2f_i({{\varvec{X}}})\\&\quad + {(X_i+{\alpha })f_i^2({{\varvec{X}}}+{{\varvec{e}}}_i)\over {\alpha }_i (X_i+1)}+ {2(X_i+{\alpha }_i) f_i({{\varvec{X}}}+{{\varvec{e}}}_i)\over {\alpha }_i}\Bigg ]\\&= {\Delta }_1+{\Delta }_2, \end{aligned}$$

where

$$\begin{aligned} {\Delta }_1&= \sum _{i=1}^p \textrm{E}\Bigg [ {X_if_i^2({{\varvec{X}}}+{{\varvec{e}}}_i)\over X_i+1}+ 2 f_i({{\varvec{X}}}+{{\varvec{e}}}_i)-2f_i({{\varvec{X}}})\Bigg ],\\ {\Delta }_2&= \sum _{i=1}^p \textrm{E}\Bigg [ {X_i f_i^2({{\varvec{X}}}+{{\varvec{e}}}_i)\over {\alpha }_i(X_i+1)} - {f_i^2({{\varvec{X}}})\over {\alpha }_i} +2 {X_if_i({{\varvec{X}}}+{{\varvec{e}}}_i)\over {\alpha }_i}-2{X_if_i({{\varvec{x}}})\over {\alpha }_i}\Bigg ]. \end{aligned}$$

Substituting \(f_i({{\varvec{X}}})=- X_i\phi (Z)/(Z+p-1)\) yields the expressions in Theorem 6.7. It can be easily checked that \({\Delta }_1\le 0\) and \({\Delta }_2\le 0\) under the conditions (a), (b) and (c).

\(\square \)

We investigate the risk performances of the estimators \({{\varvec{X}}}\) and \({{\widehat{{{\varvec{\theta }}}}}}^{\textrm{CZ}}\) under the three distributions of \(X_i\): the Poisson distribution \(Po({\theta }_i)\), geometric distribution \(Geo(1/({\theta }_i+1))\) and negative binomial distribution \(Nbin(5, 5/({\theta }_i+5)\). The simulation experiment has been conducted with \(p=6\) and \({\theta }_i=k/2\) for \(i=1, \ldots , p\) and \(k=1, \ldots , 10\), and the average losses based on simulation with 10,000 replications are reported Table 7. From Table 7, it is seen that the improvements of \({{\widehat{{{\varvec{\theta }}}}}}^{\textrm{CZ}}\) and \({{\widehat{{{\varvec{\theta }}}}}}^{\textrm{TS}}\) over \({{\varvec{X}}}\) are significant and robust in the Poisson, geometric and negative binomial distributions.

Table 7 Risks of \({{\varvec{X}}}\) and \({{\widehat{{{\varvec{\theta }}}}}}^{\textrm{CZ}}\) for \(p=6\) and \({\theta }_{ik}=k/2\)

7 Concluding remarks

We conclude the paper with some remarks and extensions. Through the paper, we have used the moment-generating functions for characterizing and testing the normal, exponential and Poisson distributions. As well known, however, the use of the moment generating functions is limited to the existence of their expectations. To avoid this point, it may be better to use the characteristic functions, and the results given in the paper can be extended to the arguments based on the characteristic functions.

The test of normality has been treated in Sect. 3.2 in the uni-variate case. For testing multivariate normal distributions, Ebner (2021) provided a test statistic based on the Stein method. Many other statistics for testing multivariate-normality have been suggested in the literature. For example, see Mardia (1970), Mecklin and Mundfrom (2004) and Ebner and Henze (2020).

Although the Stein method in this paper is limited to normal distributions, Ley and Swan (2013) suggested the general density approach which is applicable to general distributions. Barbour (1988) and Götze (1991) introduced the generator approach which can adapt the method to many other distributions. For the details, see Reinert (2005).

Finally, we give some recent developments related to the Stein method. Betsch et al. (2021) suggested new techniques based on the Stein method for estimating parameters. In addition, Betsch et al. (2022) discussed the estimation of parameters in negative-binomial distributions and the test of Poissonity based on the Stein method. Betsch and Ebner (2021) provided characterization of continuous distributions based on indicator functions, and Betsch et al. (2022) applied a similar argument to discrete distributions.