Keywords

1 Introduction

The crucial component of Bayesian statistics is to estimate the posterior distribution of parameter \(\theta \) with given observations y. The posterior distribution, denoted as \(p(\theta \vert y)\), satisfies that

$$\begin{aligned} \begin{array}{ll} p(\theta \vert y) = \ \frac{p(y\vert \theta )p(\theta )}{{p}(y)} \propto p(y\vert \theta )p(\theta ), \end{array} \end{aligned}$$
(1)

where \({p}(y) =\int p(y\vert \theta )p(\theta )d\theta \) is the normalizing constant and computationally inefficient in general. \(p(y\vert \theta )\) and \(p(\theta )\) represent likelihood function and the prior distribution, respectively. However, the likelihood \(p(y\vert \theta )\) is not always intractable due to the lager sample size and high dimension of parameters. Approximate Bayesian Computation (ABC) methods provide likelihood-free approach for performing statistical inferences with Bayesian models  [5, 17, 26]. The ABC method replaces the calculation of the likelihood function \(p(y\vert \theta )\) in Eq. (1) with a simulation of the model that produces an artificial data set \(\{x_i\}\). The most influential part of ABC is to construct some metric (or distance) and compare the simulated data \(\{x_i\}\) to the observed data \(\{y_i\}\)  [6, 15]. Recently, ABC has gained popularity particularly for the analysis of complex problems arising out of biological sciences (e.g. in population genetics, ecology, epidemiology, and systems biology)  [5, 24, 27].

There are at least three leaps in the development of ABC, we denote as algorithms \(\mathbb {A}\), \(\mathbb {B}\) and \(\mathbb {C}\). Algorithms of type \(\mathbb {A}\), the simplest algorithm of ABC proposed in  [25], is listed as follows:

  • \(\mathbb {A}1\). Sample \(\theta \) from the prior distribution \(p(\theta )\).

  • \(\mathbb {A}2\). Accept the proposed \(\theta \) with probability h proportional to \(p(y\vert \theta )\). Return to \(\mathbb {A}1\).

Concretely, if \(\theta ^{*}\) denotes the maximum-likelihood estimator of \(\theta \), the acceptance probability h can be directly set as:

$$\begin{aligned} \begin{array}{l} h=\frac{p(y\vert \theta )}{c}, \end{array} \end{aligned}$$
(2)

where c can be any constant greater than \(p(y\vert \theta ^*) \). Unfortunately, the likelihood function \(p(y\vert \theta )\) is computationally expensive or even intractable. Hence Algorithm \(\mathbb {A}1\) is not practical.

Many variants are proposed, among which one common approach is algorithms of type \(\mathbb {B}\)  [19]:

  • \(\mathbb {B}\)1. Sample \(\theta \) from the prior distribution \(p(\theta )\).

  • \(\mathbb {B}\)2. Generate x given the parameter \(\theta \) via the simulator, i.e., \(x \sim p(\cdot \vert \theta )\).

  • \(\mathbb {B}\)3. Accept the proposed \(\theta \) if \(x = y\). Return to \(\mathbb {B}1\).

The success of algorithm \(\mathbb {B}\) depends on the fact that simulating from \( p(\cdot \vert \theta )\) is easy for any \(\theta \), a basic assumption of ABC. To discriminate simulated data x from the observation y, we call x pseudo-observation here. Moreover, in Step \(\mathbb {B}\)3, \(\mathbb {S}(x)=\mathbb {S}(y)\) is employed instead of \(x=y\) in practice, where \(\mathbb {S}(x)\) represents the summary statistics of x. It has been shown that if the statistics used in likelihood function are sufficient, then Algorithm \(\mathbb {B}\) sample correctly from the true posterior distribution. Here, for ease of exposition, we use \(x=y \) instead of \(\mathbb {S}(x) = \mathbb {S}(y)\). Whereas the acceptance criteria \(x = y\) is too restrictive here, leading the acceptance rate intolerably small. One might resort to relaxing the criteria as algorithm \(\mathbb {C}\)  [21]:

  • \(\mathbb {C}\)1. Sample \(\theta \) from the prior distribution \(p(\theta )\).

  • \(\mathbb {C}\)2. Generate x given the parameter \(\theta \) via the simulator, i.e., \(x \sim p(\cdot \vert \theta )\).

  • \(\mathbb {C}\)3. Calculate the similarity between observations y and simulated data x, denoted \(\rho (x,y)\)Footnote 1.

  • \(\mathbb {C}\)4. Accept the proposed \(\theta \) if \(\rho (x,y)\ge \xi \) (\(\xi \) is a prespecified threshold). Return to \(\mathbb {C}\)1.

Notice that in Step \(\mathbb {C}\)2, a quantity of pseudo-observations x are simulated from \(p(\cdot \vert \theta )\) independently, i.e., \(x = \{x_1,...,x_S\}, x_i{\sim } p(\cdot \vert \theta )~i.i.d.\), where S is the number of simulators in each proposal and always fixed, independent of \(\theta \). The similarity \(\rho (x,y)\) can be represented in terms of the average similarity between \(x_i\) and y such that \(\rho (x,y) = \frac{1}{S} \sum \limits _{i=1}^{S}\pi _\zeta (x_i\vert y),\) where \(\pi _\zeta (\cdot \vert y)\) is an \(\zeta \)-kernel around observation yFootnote 2.

It is apparent that the choice of S plays a critical role in the efficiency of the algorithm. Obviously a large S will degrade the efficiency of ABC. In contrast, if S is small, though leading a significant reduction for each \(\theta \) in computation, the samples may fail to converge to the target distribution  [4]. Moreover, it is awful to spend amounts of computation (S simulations) for just 1 bit information, namely accept or reject the proposal. A natural question is proposed: can we simulate a small number of pseudo-observations in Step \(\mathbb {C}\)2 and maintain the convergence to the target distribution simultaneously? Or can we find a tradeoff between efficiency and accuracy? Here, we claim it is feasible.

In this paper, we devise Pre-judgment (PJR) rule, adjusting number of simulators dynamically, instead of using a constant S. In short, we firstly generate small amount of data and estimate a rough similarity. If the similarity is far away from the prespecified threshold (say, in Step \(\mathbb {C}\)4, \(\xi \)), then we judge (accept/reject) the proposal ahead. Otherwise, we draw more data from the simulator and repeat the evaluation until we have enough evidence to make the decision. Empirical results show that majority of these decision can be made based on a small amount of simulators with high confidence, thus lots of computations are saved.

The remainder of the paper is organized as follows. Section 2 describes our algorithm and Sect. 3 provides theoretical analysis. A toy model is shown in Sect. 4.1 to show some properties of PJR based method. Furthermore, the empirical evaluations are given in Sect. 4.2. Finally, the last section is devoted to conclude the paper.

2 Methodology

In this section, we will review the relative works and then present our method. Firstly, we introduce how pre-judgment rule (PJR) accelerate ABC rejection method. Then we adapt PJR strategy to ABC-MCMC framework  [20].

2.1 Related Works

In this section, we briefly review the related studies. Firstly, we focus on recent developments in ABC community. Though allowing parallel computation, ABC is still in its infancy owing to the large computational cost. Many approaches are proposed to scale up ABC in machine learning community. Concretely,  [22, 29] introduced Gaussian process to accelerate ABC.  [23] made use of the random seed in sampling procedure and transform ABC sampler into an deterministic optimization procedure.  [21] adapted Hamiltonian Monte Carlo to ABC scenario, allowing noise in estimated gradient of log-likelihood by borrowing the idea from stochastic gradient MCMC framework  [1, 2, 11, 12, 18, 28] and pseudo-marginal MCMC methods  [3, 14].

In addition, theoretical works has become popular recently  [4, 7, 8, 30]. Some works focus on the selection of summary statistics  [9, 13]. Different from these methods, PJR strategy essentially alleviates the computational burden in ABC rejection step, which can be extended to any ABC scenario, e.g., ABC rejection approach and ABC-MCMC proposed in this paper.

2.2 PJR Based ABC: (PJR-ABC)

In the Algorithm A, the likelihood is not available explicitly. Thus we resort to approximate methods by introducing the simulated data x, as follows:

(3)

where \(\delta _D(\cdot )\) is the Dirac delta function. Then a relaxation is employed by introducing an \(\zeta \)-kernel around the observation y. The last approximate equality use a Monte Carlo estimate of the likelihood via S draws of x from simulator \(p(\cdot \vert \theta )\).

On the other hand, for Algorithm \(\mathbb {C}\), the similarity between pseudo-observations x and raw observations y can be expressed as the mean similarity between each simulator output \(x_i\) and y

$$\begin{aligned} \begin{array}{l} \rho (x,y) = \frac{1}{S} \sum \limits _{i=1}^{S}\pi _\zeta (x_i\vert y). \end{array} \end{aligned}$$
(4)

From Eq. (3) and (4), it is validated that Algorithm \(\mathbb {A}\) is equivalent to Algorithm \(\mathbb {C}\) in essence. Then acceptance conditions in both Step \(\mathbb {A}\)2 and Step \(\mathbb {C}\)4 are equivalent to performing a comparison (between z and \(z_0\), defined later). Specifically, firstly we compute \( z = \frac{1}{S} \sum \limits _{i=1}^{S}\pi _\zeta (x_i\vert y),\ \text {where}\ x_i{\sim }p(\cdot \vert \theta )\,\,\,{i.i.d.}, \) and then compare it with \(z_0\), a constant. If \(z > z_0\), accept the proposed \(\theta \). If \(z \le z_0\), reject it, where \(z_0\) is a prespecified threshold, say, in Step \(\mathbb {C}\)4, \(z_0\) corresponds to \(\xi \)Footnote 3.

To guarantee the convergence to the true posterior, S should be a large number, which means each proposal needs S simulations  [4]. However, spending quantities of computation (i.e., simulating S pseudo-data \(x_1,\ldots ,x_S\)) to get just one bit of information, namely whether to accept or reject a proposal, is likely not the best use of computational resources.

To address this issue, PJR is devised to speedup the ABC procedure. We are willing to tolerate small error in this step to achieve faster judgement. In particular, we firstly draw a small number of pseudo-observations x and estimate a rough z. If the difference between z and \(z_0\) is significantly larger than the standard deviation of z, we claim that z is far away enough from \(z_0\) confidently and make the decision by comparing the rough z with \(z_0\). Otherwise, we draw more pseudo-observations to increase the precision of z until we have enough evidence to make the decision.

More formally, checking the acceptance condition can be reformulated to the following statistical hypothesis test.

In order to test the hypothesis, we are able to generate infinitely many pseudo-observations from \( p(\cdot \vert \theta )\). On the other hand, we expect to simulate less pseudo-observations owing to computational cost.

To do this, we proceed as follows. We compute the sample mean \(\bar{z}\) and sample standard deviation \(s_z\) as

(5)

where \(\bar{z^2}\) represents the mean of \(z^2\). Then we compute the test statistics t via

$$\begin{aligned} \begin{array}{l} t = \frac{\bar{z}-z_0}{s_z}. \end{array} \end{aligned}$$
(6)

It is assumed that n is large enough here. Under this situation central limit theorem (CLT) kicks in and the test statistic t follows the standard Student-t distribution with \(n-1\) degrees of freedom. Note that when n is large enough, Student-t distribution with \(n-1\) degrees of freedom is close to the standard normal distribution. Then we compute \(\eta \) defined as:

$$\begin{aligned} \begin{array}{l} \eta = 1 - \psi _{n-1}(\vert t\vert ), \end{array} \end{aligned}$$
(7)

where \(\psi _{n-1}(\cdot )\) is the cdf of the standard Student-t distribution with \(n-1\) degrees of freedom.

Then we provide a threshold \(\epsilon \), e.g., \(\epsilon = 0.1\). If \(\eta <\epsilon \), we make a decision that z is significantly different from \(z_0\). Then we accept/reject \(\theta \) via comparing \(\bar{z}\) and \(z_0\). If \(\eta \ge \epsilon \), it means that we do not have enough evidence to decide. Thus more pseudo-observations are drawn to reduce the uncertainty of z. Note that when S pseudo-observations are drawn, the procedure would be terminated and it reduces to previous ABC algorithm. The resulting algorithm can be seen in Algorithm 1.

The advantage of PJR-ABC is that we can often make confident decisions with \(s_i\) (\(s_i\ll S\)) pseudo-observations and reduce computation significantly. Though PJR-ABC brings error in judgement, we can use the computational time we save to draw more samples to offset the small bias. Worth to note that \(\epsilon \) can be regarded as a knob. When \(\epsilon \) approaches to 0, we make almost the same decision with the ABC rejection method but requires masses of simulators. On the other hand, when \(\epsilon \) is high, we make decisions without sufficient evidence and the error would be high. This accuracy-efficiency trade-off will be empirically verified in Sect. 4.1.

figure a

2.3 PJR Based Markov Chain Monte Carlo Version of ABC: PJR-ABC-MCMC

The ABC rejection methods are easy to implement and compatible with embarrassingly parallel computation. However, when the prior distribution is long way from posterior distribution, most of the samples from prior distribution would be rejected, leading acceptance rate too small, especially in high-dimensional problem. To address this issue, a Markov Chain Monte Carlo version of ABC (ABC-MCMC) algorithm is proposed  [20]. It is well-known that MCMC has been the main workhorse of Bayesian computation since 1990s and many state-of-the-art samplers in MCMC framework can be extended into ABC scenario, e.g., Hamiltonian Monte Carlo can be extended to Hamiltonian ABC  [21]. Hence ABC-MCMC  [20] is a benchmark in ABC community. Now we show that our PJR rule can be adapted to the ABC-MCMC framework. First, ABC-MCMC is briefly introduced:

  • \(\mathbb {D}\)1. Given the current point \(\theta \), \(\theta ^\prime \) is proposed according to a transition kernel \(q(\theta ^\prime \vert \theta )\).

  • \(\mathbb {D}\)2. Generate \(x^\prime \) from the simulator \(p(\cdot \vert \theta ^\prime )\).

  • \(\mathbb {D}\)3. Compute the acceptance probability \(\alpha \) defined in Eq. 8.

  • \(\mathbb {D}\)4. Accept \(\theta ^\prime \) with probability \(\alpha \). Otherwise, stay at \(\theta \). Return to \(\mathbb {D}1\).

In MCMC sampler, MH acceptance probability \(\alpha \) is defined as

$$\begin{aligned} \begin{array}{lll} \alpha = \min \big \{1,\frac{p(\theta ^\prime ) p(y\vert \theta ^\prime )q(\theta \vert \theta ^\prime )}{p(\theta ) p(y\vert \theta ) q(\theta ^\prime \vert \theta )}\big \}. \end{array} \end{aligned}$$
(8)

In likelihood-free scenario, the acceptance probability of ABC-MCMC is

$$\begin{aligned} \begin{array}{lll} \alpha = \min \big \{1,\frac{p(\theta ^{\prime })\sum \limits _{s=1}^{S}\pi _\zeta ( x_{s}^\prime \vert y)q(\theta \vert \theta ^{\prime })}{p(\theta ) \sum \limits _{s=1}^{S} \pi _\zeta ( x_{s}\vert y) q(\theta ^{\prime }\vert \theta )}\big \},\\ \end{array} \end{aligned}$$

where \(x_{s}{\sim } p(\cdot \vert \theta )~i.i.d.\) and \(x_{s}^\prime {\sim } p(\cdot \vert \theta ^\prime )~i.i.d.\) The acceptance of proposal is determined by following form:

$$\begin{aligned} \begin{array}{ll} u < \alpha = \min \big \{1,\frac{p(\theta ^{\prime })\sum \limits _{s=1}^{S}\pi _\zeta ( x^\prime _{s}\vert y)q(\theta \vert \theta ^{\prime })}{p(\theta )\sum \limits _{s=1}^{S} \pi _\zeta ( x_{s}\vert y) q(\theta ^{\prime }\vert \theta )}\big \},\\ \end{array} \end{aligned}$$

where \(u\sim \text {Uniform}(0,1)\). This is equivalent to the following expression:

$$\begin{aligned} \begin{array}{ll} u < \frac{p(\theta ^{\prime }) \frac{1}{S}\sum \limits _{s=1}^{S} \pi _\zeta (x^\prime _{s} \vert y) q(\theta \vert \theta ^{\prime })}{p(\theta ) \frac{1}{S} \sum \limits _{s=1}^{S} \pi _\zeta (x_{s}\vert y) q(\theta ^{\prime }\vert \theta )}.\\ \end{array} \end{aligned}$$

Note that \(\{x_1,...,x_S\}\) is given in ABC-MCMC, then define the fixed part \(z_0\) and test variable z, we obtain that

$$\begin{aligned} \begin{array}{l} z_0 =\frac{p(\theta )}{p(\theta ^{\prime })} \frac{1}{S} \sum \limits _{s=1}^{S} \pi _\zeta (x_{s}\vert y) \frac{q(\theta ^{\prime }\vert \theta )}{q(\theta \vert \theta ^{\prime })} u,\,\,\, z = \frac{1}{S} \sum \limits _{s=1}^{S} \pi _\zeta (x^\prime _{s}\vert y), \\ \end{array} \end{aligned}$$

where z can be further simplified into the following form, similar to PJR-ABC: \( z = \frac{1}{S} \sum \limits _{i=1}^{S} z_i, \) where \(z_i = \pi _\zeta ( x^\prime _{i}\vert y)\).

Following PJR-ABC, we test the following hypothesis \(H_1: z_0>z\) vs \(H_2:z_0<z\). Then the sample mean \(\bar{z}\), the sample standard deviation \(s_z\) and the test statistics t can be calculated as shown in Eq. (5) and (6), same with PJR-ABC. The resulting algorithm is similar and not listed.

3 Theoretical Analysis

In this section, we study the theoretical properties for PJR strategy. Specifically, we provide the error analysis for both PJR-ABC and PJR-ABC-MCMC. Since every time we accept/reject a proposal in PJR-ABC/PJR-ABC-MCMC, we deal with a hypothesis testing problem. We are attempting to bound the error caused by such a testing problem first. Then we build the relationship between such a single test error and total error for both PJR-ABC and PJR-ABC-MCMC. Now we focus on the error caused by a single testing problem. In hypothesis testing problem, two types of error are distinguished. A type I error is the incorrect rejection of a true hypothesis while the type II error is the failure to reject a false hypothesis. Now we discuss the probabilities of these two errors in a single decision problem.

Theorem 1

The probability of both the error I and II decreases approximately exponentially w.r.t. the sample size of z (sample size of z corresponds to \(s_1,\ldots ,s_k\) in Algorithm 1).

Proof

We assume that \(\psi _{n-1}(\cdot )\) is the cdf of standard Student-t distribution with degree \(n-1\). For simplicity, we first discuss the probability of type I error, i.e., the incorrect rejection of a true hypothesis. It would be easy to extend the conclusion into the type II error owing to the symmetry.

In this case, \(z > z_0\). Suppose the number of sampled z is n. The test statistics t satisfies that \(t = \frac{\bar{z}-z_0}{s_z}\), following the standard Student-t distribution with degree \(n-1\). The standard Student-t distribution is approaching to the standard normal distribution when the degree \(n-1\) is large enough. Hence, many properties of normal distribution can be shared.

Given the knob parameter \(\epsilon \), according to the monotonicity of the function \(\psi _{n-1}(\cdot )\) on \(\mathbb {R}\), we know that there exists a unique s such that \(\psi _{n-1}(s) = \epsilon \). Moreover, since \(\bar{z} = \frac{z_1+z_2+\ldots +z_n}{n}\) and \(t = \frac{\bar{z}-z_0}{s_z}\sim \psi _{n-1}(\cdot ) \approx \mathcal {N}(0,1)\), we have that \(z_i\) can be seen as sampled independent identically distributed from \(\mathcal {N}(z_0,ns_z)\), i.e., \(z_i{\sim }\mathcal {N}(z_0,ns_z)\,\,\,i.i.d\).

The type I error only occurs when \(\frac{\bar{z} - z_0}{s_z} < s\). That is, \(\sum _{i=1}^{n}z_i < n(s_zs+z_0)\). Thus, we can have the probability of type I error via integrating over the space \((z_1,z_2,\ldots ,z_n) \) and \( \sum _{i=1}^{n}z_i < n(s_zs+z_0)\).

where \(\psi '(\cdot )\) and \(\psi _{n-1}(\cdot )\) represent the pdf and cdf of the standard Student-t distribution with \(n-1\) degree of freedom.

This completes the proof.

The above theorem demonstrates that the error during a single judge can be negligible as long as the number of sampled z is large enough. Based on this theorem, the following assumption are reasonable.

Assumption 1

The probability of error produced by a single hypothesis testing problem in both PJR-ABC and PJR-ABC-MCMC can be upper-bounded, denoted by \(\delta _{1}, \delta _2\rightarrow {}0_+\), for PJR-ABC and PJR-ABC-MCMC, respectively.

In Bayesian inference, we are interested in the posterior average, defined as \(\bar{\phi } \triangleq \int _{\mathcal {\theta }} \phi (\theta ) p(\theta \vert y) d\theta \) for some test function \(\phi (\theta )\) of interest. For a given numerical method (say, PJE-ABC or PJR-ABC-MCMC) with generated samples \(\{\theta _1, \ldots ,\theta _M\}\), we use the sample average \(\hat{\phi }\) defined as \(\hat{\phi } = 1/M\sum _{l=1}^{M}\phi (\theta _l)\) to approximate \(\bar{\phi }\). Before providing a bound for the bias of a PJR-ABC algorithm, we make a mild assumption first.

Assumption 2

The prior average of \(\phi (\cdot )\) is bounded away from infinity, i.e.,

$$\begin{aligned} \int _\theta \phi (\theta )p(\theta )d\theta < +\infty . \end{aligned}$$

Theorem 2

Under Assumption (1) and (2), the bias of PJR-ABC can be upper-bounded as: \(\begin{array}{l} \vert \mathbb {E}\hat{\phi } - \bar{\phi }\vert \le C_1 \delta _{1}, \end{array} \) where \(C_1 = \frac{\int _\theta \phi (\theta ) p(\theta ) d\theta }{ p(y) }\) is a constant, p(y) denotes the normalizing constant.

Proof

In ABC rejection method, each \(\theta \) drawn from \(p(\theta )\) is independent. The error at \(\theta \) caused by PJR is denoted by \(\xi (\theta )\), which is assumed to be a perturbation on the true likelihood. Thus the estimated likelihood function can be represented as \(\hat{p}(y\vert \theta ) = p(y\vert \theta ) + \xi (\theta )\), where \(\vert \xi (\theta )\vert \le \delta _{1}\) owing to the boundedness of single error, described in Assumption 3.

(9)

The first term in RHS of Eq. (9) is the expectation of the true posterior distribution. While the second term is the error. We can observe that the error is upper bounded.

$$\begin{aligned} \begin{array}{lll} \frac{1}{p(y)} \int \phi (\theta )\xi (\theta )p(\theta )d\theta \le \frac{1}{p(y)} \vert \delta _{1}\vert \int \phi (\theta )p(\theta )d\theta = C_1 \vert \delta _{1}\vert , \\ \end{array} \end{aligned}$$

where \(C_1 = \frac{\int \phi (\theta )p(\theta )d\theta }{p(y)}\) is bounded followed from the fact that both \(\frac{1}{p(y)} \) and \(\int \phi (\theta )p(\theta )d\theta \) are bounded away from \(+\infty \).

This completes the proof.

In PJR-ABC, each sample is independent with each other. However, in PJR-ABC-MCMC, all the samples are in a single chain, leading the analysis more complicated. Here, the distance between probability distributions is measured by the total variational distance (TVD),Footnote 4 described as follows.

Theorem 3

Under Assumption 3, for any posterior distribution, there exists a constant \(C_2\) such that the discrepancies between the true posterior distribution \(\mathcal {S}_0\) and the stationary distribution of our PJR-ABC-MCMC algorithm \(\mathcal {S}_\epsilon \) can be upper bounded as: \( \begin{array}{ll} d_v(S_0,S_\epsilon ) \le C_2 \delta _{2}. \\ \end{array} \)

Proof

We firstly focus on the error for a single step. Based on this, the error about the stationary distribution is derived. The transition kernel of the ABC-MCMC algorithm can be written as

where \(\delta _D(\cdot ) \) is the Dirac delta function, \(P_a(\theta ,\theta ^\prime )\) is the acceptance probability. Similar definition of transition kernel of PJR-ABC-MCMC hold for \(\mathcal {T}_\epsilon (\theta ,\theta ^{\prime })\) and acceptance probability \(P_{a,\epsilon }(\theta ,\theta ^{\prime })\).

The discrepancies between \(P_a(\theta ,\theta ^{\prime }) \) and \(P_{a,\epsilon }(\theta ,\theta ^{\prime })\) is defined as: \(\delta P_a(\theta ,\theta ^{\prime }) \triangleq P_{a,\epsilon }(\theta ,\theta ^{\prime }) - P_a(\theta ,\theta ^{\prime })\). For every \((\theta ,\theta ^{\prime })\), according to the error for a single test, there exists an upper bound for \(\delta P_a(\theta ,\theta ^{\prime })\), i.e., \(\vert \delta P(\theta ,\theta ^{\prime }) \vert \le \delta _{\text {max}}\) for \(\forall \ (\theta ,\theta ^{\prime })\).

Then the total variational distance for a single step can be upper bounded for any distribution P as:

Then apply Lemma 1, substitute \(2\delta _{\text {max}} \) into \(\delta \) in Eq. 10 we prove Theorem 3. This completes the proof.

Lemma 1

  [16]. Given two transition kernels, \(\mathcal {T}_0\) and \(\mathcal {T}_\epsilon \), whose stationary distributions are denoted by \(\mathcal {S}_0 \) and \(\mathcal {S}_\epsilon \), if \(\mathcal {T}_0\) satisfies the following contraction condition with a constant \(\eta \in [0,1) \) for all probability distribution \(\mathcal {P}\):

$$\begin{aligned} \begin{array}{l} d_v(\mathcal {P}\mathcal {T}_0,\mathcal {S}_0) \le \eta d_v(\mathcal {P},\mathcal {S}_0) \\ \end{array} \end{aligned}$$

and the one step error between \(\mathcal {T}_0\) and \(\mathcal {T}_\epsilon \) is upper bounded uniformly with a constant \(\delta >0\) as:

$$\begin{aligned} \begin{array}{l} d_v(\mathcal {P}\mathcal {T}_0,\mathcal {P}\mathcal {T}_\epsilon ) \le \delta ,\forall \mathcal {P} \\ \end{array} \end{aligned}$$
(10)

then the distance between \(\mathcal {S}_0\) and \(\mathcal {S}_\epsilon \) can be bounded as: \(\begin{array}{ll} d_v(S_0,S_\epsilon ) \le \frac{\delta }{1-\eta } \\ \end{array} \)

Theorem 2 and 3 indicate that the error is proportional to the single testing error. Combining this result with Theorem 1, we know that the bias of both PJR-ABC and PJR-ABC-MCMC can be bounded.

4 Numerical Validation

In this section, we use a toy model to demonstrate both PJR-ABC and PJR-ABC-MCMC.

4.1 Synthetic Data

We adopt the gamma prior with shape \(\alpha \) and rate \(\beta \), i.e., \(p(\theta ) = \text {Gamma}(\alpha ,\beta )\). The likelihood function is exponential distribution, i.e., \(x\sim \text {exp}(1/\theta )\). Let observations are generated via \(y = \frac{1}{N}\sum \nolimits _{i=1}^N e_i\), where \(e_i\sim \text {exp}(1/\theta ^*)\), N is the number of observations. Regarding the selection of the sequence \(\{s_i\}_{i=1}^{k}\) (\(s_0=0\)), we find geometric sequence is the usually the best choice, thus is used in both Sect. 4.1 and 4.2. The common ratio of the geometric sequence is usually set to 1.5–2. The true posterior is a gamma distribution with shape \(\alpha + N\) and rate \(\beta +Ny\), i.e., \(p(\theta \vert y) = \text {Gamma}(\alpha +N,\beta +Ny)\). In particular, we set \(S=1000\), \(N = 20\), \(y = 7.74\), \(\alpha = \beta = 1\), \(\theta ^* = 0.15\) in this scenario. We run chains of length 50K for ABC-MCMC and PJR-ABC-MCMC and 100K for ABC and PJR-ABC. For each method, we conduct 5 independent trials and report the average value. In this paper, the choice of proposal distribution in both ABC-MCMC and PJR-ABC-MCMC is a Gaussian distribution centered at current \(\theta \).

First, we investigate how the performance (both efficiency and accuracy) changes as a function of the knob \(\epsilon \) empirically. For each \(\epsilon \in \{0,0.01, 0.03, 0.07, 0.1, 0.2, 0.3\}\), we record both efficiencyFootnote 5 and accuracyFootnote 6. \(\epsilon =0\) means the PJR-ABC/PJR-ABC-MCMC reduce to ABC/ABC-MCMC approach. The results are reported in Fig. 1. We find that smaller \(\epsilon \) usually leads to higher accuracy and less efficiency, validating the statement about \(\epsilon \) mentioned in Sect. 2. Hence, the empirical trade-off between efficiency and accuracy can be controlled by adjusting \(\epsilon \). In the following, we set \(\epsilon =0.1\). In Fig. 3, we show the trace plots of the last 1K samples from a single chain for ABC-MCMC and PJR-ABC-MCMC. It is a positive result, indicating PJR-ABC-MCMC preserve the ability of exploration to the parameter space compared with ABC-MCMC. The empirical histograms of \(\theta \) for all the methods are presented in Fig. 2. We find that all of them are close to the desired posterior. In Table 1 we show

  • the average Total Variational DistanceFootnote 7 (between the true posterior and the ABC posteriors) and the corresponding standard deviation using the first 10K samples and whole chain;

  • the average number of simulators.

We can observe that our PJR based ABC rejection and ABC-MCMC achieve similar result with original algorithm in convergence to the target posterior distribution. Furthermore, PJR strategy can accelerate both ABC and ABC-MCMC in terms of number of simulators.

Fig. 1.
figure 1

Demonstration problem. TVD and number of simulations as a function of the knob \(\epsilon \).

Fig. 2.
figure 2

Demonstration problem. The empirical histograms of \(\theta \) for all the methods.

Fig. 3.
figure 3

Demonstration problem. Trace plot of last 1K samples, where \(\epsilon =0.1\).

Fig. 4.
figure 4

Ricker model. Empirical histogram of parameter \(\theta = (\log r,\sigma , \phi )\) generated by ABC-MCMC.

Fig. 5.
figure 5

Ricker model. Empirical histogram of parameter \(\theta = (\log r,\sigma , \phi )\) generated by PJR-ABC-MCMC.

Fig. 6.
figure 6

Ricker model. Trajectories of each pair of two parameters over the last 200 time-steps generated by our PJR-ABC-MCMC.

Table 1. Results for the demonstration problem in terms of TVD (Total Variational Distance) and number of simulators. Note that for TVD the value below is the actual value times 100 (mean ± std). Simulators represent the total number of pseudo-observations from the simulator. For the first two approaches, we draw 100K samples while for the last two approaches, 50K samples are drawn.

4.2 Real Applications

The Popular Ricker Model. In this section, we show the application of our method on the popular Ricker model  [31]. The Ricker model, a classic discrete population model used in ecology, gives the expected number of individuals in current generation as a function of number of individuals in previous generation. This model is commonly used as an exampler of complex model  [29] because it cause the collapse of standard statistical methods due to near-chaotic dynamics  [31]. In particular, \(N_t\) denote the unobserved number of individuals in the population at time t while the number of observed individuals is denoted by \(Y_t\). The Ricker model is defined via the following relationships  [31]

where each \(e_t\) (\(t = 1,2,...,\)) is independent and \(Y_t\) only depends on \(N_t\). In this model, the parameter vector is \(\theta = \{\log r, \sigma ^2, \phi \}\). \(y_{1:T} = \{y_1,...,y_T \}\in \mathbb {R}^T\) is the time-series of observations. For each parameter, we adopt the uniform prior as

The target distribution is the posterior of \(\theta \) given observations \(y_{1:T}\), i.e., \(p(\theta \vert y_{1:T})\). Artificial dataset is generated using \({\theta }^* = (3.8,0.3,10.0)\). We compare PJR-ABC-MCMC method with ABC-MCMC. For ABC-MCMC, we run the simulator \(S = 2000\) times at each \(\theta \) to approximate the likelihood value. The knob \(\epsilon \) is set to be 0.1. For summary statistics, we follow the methods described in  [29], which contain a collection of phase-invariant measures, such as coefficients of polynomial autogressive models.

Effectiveness: Figure 4 and 5 show the empirical histogram of parameter of interest \(\theta = (\log r,\sigma , \phi )\) generated by ABC-MCMC and PJR-ABC-MCMC, respectively. Furthermore, we present the scatter plots of trajectories for every two parameters in Fig. 6. We can observe that the mode of the empirical posterior is close to the \({\theta }^*\) and the posteriors produced by the two algorithms are similar, showing the success of PJR-ABC-MCMC in Ricker model.

Efficiency: The simulation procedure is complex and dominate in computational time. Therefore, the running time of samplers is almost proportional to the number of simulators. Specifically, sampling 1K parameters, ABC-MCMC requires 2M simulators (\(S=2000\)) while PJR-ABC-MCMC only requires about 371K simulators. We conclude that majority of the decision can be made based on a small amount of simulators with high confidence. Hence, our PJR strategy accelerates ABC-MCMC algorithm greatly in Ricker model.

4.3 Apply to HABC-SGLD

In this part, we apply our method to SGLD (Stochastic Gradient Langevin Dynamics,  [28]) version of HABC (Hamiltonian ABC) proposed in  [21].

In each iteration of SGLD, a mini-batch \(\mathcal {X}_n\) of size n is drawn to estimate the gradient of log-posterior. The proposal is

It can be shown that when the stepsize \(\alpha \) approaches to zero, the acceptance probability approaches to 1  [28]. Based on this, the MH correction step is ignored. However, the assumption that \(\alpha \rightarrow {} 0\) is too restrictive. In practice, to keep the mixing rate high, we always choose a reasonably large \(\alpha \). Under this situation, SGLD can not converge to target distribution in some cases. The detailed reasons can be found in  [16].

In ABC scenarios, conventional MH rejection step is time-consuming. So our method fit to this problem naturally. Specifically, we consider an L1-regularized linear regression model. This model has been used in  [16] to explain the necessity of MH rejection in SGLD. We explore its effectiveness in ABC scenario.

Given a dataset \(\{u_i,v_i\}_{i=1}^N\), where \(u_i\) are the predictors and \(v_i \) are the targets. Gaussian error model and Laplacian prior for parameter \(\theta \in \mathbb {R}^D \) are adopted, i.e., \(p(v\vert u,\theta ) \propto \text {exp}(-\frac{\lambda }{2} (v-\theta ^Tu)^2)\) and \(p(\theta ) \propto \exp (-\lambda _0\Vert \theta \Vert _1)\). We generate a synthetic dataset of size \(N=10000\) via \(v_i = \theta _0^Tu_i + \xi \), where \(\xi \sim \mathcal {N}(0,1/3)\) and \(\theta _0 = 0.5\), following  [16]. For pedagogical reason, we set \(D=1 \). Furthermore, we choose \(\lambda = 1\) and \(\lambda _0 = 4700\) so that the prior is not washed out by the likelihood.

Here, standard MCMC sampler is employed as the baseline method. And we run the HABC-SGLD without rejection and HABC-SGLD with rejection (PJR-HABC-SGLD). The empirical histograms of samples obtained by running different samplers are shown in Fig. 7. We observe that the empirical histogram of samples obtained from PJR-HABC-SGLD is much closer to the standard MCMC sampler than that of HABC-SGLD, thus verifying the effectiveness of PJR-HABC-SGLD.

Fig. 7.
figure 7

Application to HABC-SGLD. Empirical histogram of samples obtained by different samplers. We can observe that HABC-SGLD fails to converge to the posterior distribution in this situation. But PJR correction version of HABC-SGLD converges to the posterior.

5 Conclusion

In this paper, we have proposed pre-judgment Rule to accelerate ABC method. Computational methods adaptive to ABC rejection method and ABC-MCMC are provided as PJR-ABC and PJR-ABC-MCMC respectively. We analyze the error bound produced by PJR strategy. Our methodology establishes its practical value with desirable accuracy and efficiency. Finally, as a future direction, we plan to integrate PJR strategy with neural network as  [24].