Abstract
Approximate Bayesian Computation (ABC) is a popular approach for Bayesian modeling, when these models exhibit an intractable likelihood. However, during each proposal of ABC, a great number of simulators are required and each simulation is always time-consuming. The overall goal of this work is to avoid inefficient computational cost of ABC. A pre-judgment rule (PJR) is proposed, which mainly aims to judge the acceptance condition using a small fraction of simulators instead of the whole simulators, thus achieving less computational complexity. In addition, it provided a theoretical study of the error bounded caused by PJR Strategy. Finally, the methodology was illustrated with various examples. The empirical results show both the effectiveness and efficiency of PJR compared with the previous methods.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The crucial component of Bayesian statistics is to estimate the posterior distribution of parameter \(\theta \) with given observations y. The posterior distribution, denoted as \(p(\theta \vert y)\), satisfies that
where \({p}(y) =\int p(y\vert \theta )p(\theta )d\theta \) is the normalizing constant and computationally inefficient in general. \(p(y\vert \theta )\) and \(p(\theta )\) represent likelihood function and the prior distribution, respectively. However, the likelihood \(p(y\vert \theta )\) is not always intractable due to the lager sample size and high dimension of parameters. Approximate Bayesian Computation (ABC) methods provide likelihood-free approach for performing statistical inferences with Bayesian models [5, 17, 26]. The ABC method replaces the calculation of the likelihood function \(p(y\vert \theta )\) in Eq. (1) with a simulation of the model that produces an artificial data set \(\{x_i\}\). The most influential part of ABC is to construct some metric (or distance) and compare the simulated data \(\{x_i\}\) to the observed data \(\{y_i\}\) [6, 15]. Recently, ABC has gained popularity particularly for the analysis of complex problems arising out of biological sciences (e.g. in population genetics, ecology, epidemiology, and systems biology) [5, 24, 27].
There are at least three leaps in the development of ABC, we denote as algorithms \(\mathbb {A}\), \(\mathbb {B}\) and \(\mathbb {C}\). Algorithms of type \(\mathbb {A}\), the simplest algorithm of ABC proposed in [25], is listed as follows:
-
\(\mathbb {A}1\). Sample \(\theta \) from the prior distribution \(p(\theta )\).
-
\(\mathbb {A}2\). Accept the proposed \(\theta \) with probability h proportional to \(p(y\vert \theta )\). Return to \(\mathbb {A}1\).
Concretely, if \(\theta ^{*}\) denotes the maximum-likelihood estimator of \(\theta \), the acceptance probability h can be directly set as:
where c can be any constant greater than \(p(y\vert \theta ^*) \). Unfortunately, the likelihood function \(p(y\vert \theta )\) is computationally expensive or even intractable. Hence Algorithm \(\mathbb {A}1\) is not practical.
Many variants are proposed, among which one common approach is algorithms of type \(\mathbb {B}\) [19]:
-
\(\mathbb {B}\)1. Sample \(\theta \) from the prior distribution \(p(\theta )\).
-
\(\mathbb {B}\)2. Generate x given the parameter \(\theta \) via the simulator, i.e., \(x \sim p(\cdot \vert \theta )\).
-
\(\mathbb {B}\)3. Accept the proposed \(\theta \) if \(x = y\). Return to \(\mathbb {B}1\).
The success of algorithm \(\mathbb {B}\) depends on the fact that simulating from \( p(\cdot \vert \theta )\) is easy for any \(\theta \), a basic assumption of ABC. To discriminate simulated data x from the observation y, we call x pseudo-observation here. Moreover, in Step \(\mathbb {B}\)3, \(\mathbb {S}(x)=\mathbb {S}(y)\) is employed instead of \(x=y\) in practice, where \(\mathbb {S}(x)\) represents the summary statistics of x. It has been shown that if the statistics used in likelihood function are sufficient, then Algorithm \(\mathbb {B}\) sample correctly from the true posterior distribution. Here, for ease of exposition, we use \(x=y \) instead of \(\mathbb {S}(x) = \mathbb {S}(y)\). Whereas the acceptance criteria \(x = y\) is too restrictive here, leading the acceptance rate intolerably small. One might resort to relaxing the criteria as algorithm \(\mathbb {C}\) [21]:
-
\(\mathbb {C}\)1. Sample \(\theta \) from the prior distribution \(p(\theta )\).
-
\(\mathbb {C}\)2. Generate x given the parameter \(\theta \) via the simulator, i.e., \(x \sim p(\cdot \vert \theta )\).
-
\(\mathbb {C}\)3. Calculate the similarity between observations y and simulated data x, denoted \(\rho (x,y)\)Footnote 1.
-
\(\mathbb {C}\)4. Accept the proposed \(\theta \) if \(\rho (x,y)\ge \xi \) (\(\xi \) is a prespecified threshold). Return to \(\mathbb {C}\)1.
Notice that in Step \(\mathbb {C}\)2, a quantity of pseudo-observations x are simulated from \(p(\cdot \vert \theta )\) independently, i.e., \(x = \{x_1,...,x_S\}, x_i{\sim } p(\cdot \vert \theta )~i.i.d.\), where S is the number of simulators in each proposal and always fixed, independent of \(\theta \). The similarity \(\rho (x,y)\) can be represented in terms of the average similarity between \(x_i\) and y such that \(\rho (x,y) = \frac{1}{S} \sum \limits _{i=1}^{S}\pi _\zeta (x_i\vert y),\) where \(\pi _\zeta (\cdot \vert y)\) is an \(\zeta \)-kernel around observation yFootnote 2.
It is apparent that the choice of S plays a critical role in the efficiency of the algorithm. Obviously a large S will degrade the efficiency of ABC. In contrast, if S is small, though leading a significant reduction for each \(\theta \) in computation, the samples may fail to converge to the target distribution [4]. Moreover, it is awful to spend amounts of computation (S simulations) for just 1 bit information, namely accept or reject the proposal. A natural question is proposed: can we simulate a small number of pseudo-observations in Step \(\mathbb {C}\)2 and maintain the convergence to the target distribution simultaneously? Or can we find a tradeoff between efficiency and accuracy? Here, we claim it is feasible.
In this paper, we devise Pre-judgment (PJR) rule, adjusting number of simulators dynamically, instead of using a constant S. In short, we firstly generate small amount of data and estimate a rough similarity. If the similarity is far away from the prespecified threshold (say, in Step \(\mathbb {C}\)4, \(\xi \)), then we judge (accept/reject) the proposal ahead. Otherwise, we draw more data from the simulator and repeat the evaluation until we have enough evidence to make the decision. Empirical results show that majority of these decision can be made based on a small amount of simulators with high confidence, thus lots of computations are saved.
The remainder of the paper is organized as follows. Section 2 describes our algorithm and Sect. 3 provides theoretical analysis. A toy model is shown in Sect. 4.1 to show some properties of PJR based method. Furthermore, the empirical evaluations are given in Sect. 4.2. Finally, the last section is devoted to conclude the paper.
2 Methodology
In this section, we will review the relative works and then present our method. Firstly, we introduce how pre-judgment rule (PJR) accelerate ABC rejection method. Then we adapt PJR strategy to ABC-MCMC framework [20].
2.1 Related Works
In this section, we briefly review the related studies. Firstly, we focus on recent developments in ABC community. Though allowing parallel computation, ABC is still in its infancy owing to the large computational cost. Many approaches are proposed to scale up ABC in machine learning community. Concretely, [22, 29] introduced Gaussian process to accelerate ABC. [23] made use of the random seed in sampling procedure and transform ABC sampler into an deterministic optimization procedure. [21] adapted Hamiltonian Monte Carlo to ABC scenario, allowing noise in estimated gradient of log-likelihood by borrowing the idea from stochastic gradient MCMC framework [1, 2, 11, 12, 18, 28] and pseudo-marginal MCMC methods [3, 14].
In addition, theoretical works has become popular recently [4, 7, 8, 30]. Some works focus on the selection of summary statistics [9, 13]. Different from these methods, PJR strategy essentially alleviates the computational burden in ABC rejection step, which can be extended to any ABC scenario, e.g., ABC rejection approach and ABC-MCMC proposed in this paper.
2.2 PJR Based ABC: (PJR-ABC)
In the Algorithm A, the likelihood is not available explicitly. Thus we resort to approximate methods by introducing the simulated data x, as follows:
where \(\delta _D(\cdot )\) is the Dirac delta function. Then a relaxation is employed by introducing an \(\zeta \)-kernel around the observation y. The last approximate equality use a Monte Carlo estimate of the likelihood via S draws of x from simulator \(p(\cdot \vert \theta )\).
On the other hand, for Algorithm \(\mathbb {C}\), the similarity between pseudo-observations x and raw observations y can be expressed as the mean similarity between each simulator output \(x_i\) and y
From Eq. (3) and (4), it is validated that Algorithm \(\mathbb {A}\) is equivalent to Algorithm \(\mathbb {C}\) in essence. Then acceptance conditions in both Step \(\mathbb {A}\)2 and Step \(\mathbb {C}\)4 are equivalent to performing a comparison (between z and \(z_0\), defined later). Specifically, firstly we compute \( z = \frac{1}{S} \sum \limits _{i=1}^{S}\pi _\zeta (x_i\vert y),\ \text {where}\ x_i{\sim }p(\cdot \vert \theta )\,\,\,{i.i.d.}, \) and then compare it with \(z_0\), a constant. If \(z > z_0\), accept the proposed \(\theta \). If \(z \le z_0\), reject it, where \(z_0\) is a prespecified threshold, say, in Step \(\mathbb {C}\)4, \(z_0\) corresponds to \(\xi \)Footnote 3.
To guarantee the convergence to the true posterior, S should be a large number, which means each proposal needs S simulations [4]. However, spending quantities of computation (i.e., simulating S pseudo-data \(x_1,\ldots ,x_S\)) to get just one bit of information, namely whether to accept or reject a proposal, is likely not the best use of computational resources.
To address this issue, PJR is devised to speedup the ABC procedure. We are willing to tolerate small error in this step to achieve faster judgement. In particular, we firstly draw a small number of pseudo-observations x and estimate a rough z. If the difference between z and \(z_0\) is significantly larger than the standard deviation of z, we claim that z is far away enough from \(z_0\) confidently and make the decision by comparing the rough z with \(z_0\). Otherwise, we draw more pseudo-observations to increase the precision of z until we have enough evidence to make the decision.
More formally, checking the acceptance condition can be reformulated to the following statistical hypothesis test.
In order to test the hypothesis, we are able to generate infinitely many pseudo-observations from \( p(\cdot \vert \theta )\). On the other hand, we expect to simulate less pseudo-observations owing to computational cost.
To do this, we proceed as follows. We compute the sample mean \(\bar{z}\) and sample standard deviation \(s_z\) as
where \(\bar{z^2}\) represents the mean of \(z^2\). Then we compute the test statistics t via
It is assumed that n is large enough here. Under this situation central limit theorem (CLT) kicks in and the test statistic t follows the standard Student-t distribution with \(n-1\) degrees of freedom. Note that when n is large enough, Student-t distribution with \(n-1\) degrees of freedom is close to the standard normal distribution. Then we compute \(\eta \) defined as:
where \(\psi _{n-1}(\cdot )\) is the cdf of the standard Student-t distribution with \(n-1\) degrees of freedom.
Then we provide a threshold \(\epsilon \), e.g., \(\epsilon = 0.1\). If \(\eta <\epsilon \), we make a decision that z is significantly different from \(z_0\). Then we accept/reject \(\theta \) via comparing \(\bar{z}\) and \(z_0\). If \(\eta \ge \epsilon \), it means that we do not have enough evidence to decide. Thus more pseudo-observations are drawn to reduce the uncertainty of z. Note that when S pseudo-observations are drawn, the procedure would be terminated and it reduces to previous ABC algorithm. The resulting algorithm can be seen in Algorithm 1.
The advantage of PJR-ABC is that we can often make confident decisions with \(s_i\) (\(s_i\ll S\)) pseudo-observations and reduce computation significantly. Though PJR-ABC brings error in judgement, we can use the computational time we save to draw more samples to offset the small bias. Worth to note that \(\epsilon \) can be regarded as a knob. When \(\epsilon \) approaches to 0, we make almost the same decision with the ABC rejection method but requires masses of simulators. On the other hand, when \(\epsilon \) is high, we make decisions without sufficient evidence and the error would be high. This accuracy-efficiency trade-off will be empirically verified in Sect. 4.1.
2.3 PJR Based Markov Chain Monte Carlo Version of ABC: PJR-ABC-MCMC
The ABC rejection methods are easy to implement and compatible with embarrassingly parallel computation. However, when the prior distribution is long way from posterior distribution, most of the samples from prior distribution would be rejected, leading acceptance rate too small, especially in high-dimensional problem. To address this issue, a Markov Chain Monte Carlo version of ABC (ABC-MCMC) algorithm is proposed [20]. It is well-known that MCMC has been the main workhorse of Bayesian computation since 1990s and many state-of-the-art samplers in MCMC framework can be extended into ABC scenario, e.g., Hamiltonian Monte Carlo can be extended to Hamiltonian ABC [21]. Hence ABC-MCMC [20] is a benchmark in ABC community. Now we show that our PJR rule can be adapted to the ABC-MCMC framework. First, ABC-MCMC is briefly introduced:
-
\(\mathbb {D}\)1. Given the current point \(\theta \), \(\theta ^\prime \) is proposed according to a transition kernel \(q(\theta ^\prime \vert \theta )\).
-
\(\mathbb {D}\)2. Generate \(x^\prime \) from the simulator \(p(\cdot \vert \theta ^\prime )\).
-
\(\mathbb {D}\)3. Compute the acceptance probability \(\alpha \) defined in Eq. 8.
-
\(\mathbb {D}\)4. Accept \(\theta ^\prime \) with probability \(\alpha \). Otherwise, stay at \(\theta \). Return to \(\mathbb {D}1\).
In MCMC sampler, MH acceptance probability \(\alpha \) is defined as
In likelihood-free scenario, the acceptance probability of ABC-MCMC is
where \(x_{s}{\sim } p(\cdot \vert \theta )~i.i.d.\) and \(x_{s}^\prime {\sim } p(\cdot \vert \theta ^\prime )~i.i.d.\) The acceptance of proposal is determined by following form:
where \(u\sim \text {Uniform}(0,1)\). This is equivalent to the following expression:
Note that \(\{x_1,...,x_S\}\) is given in ABC-MCMC, then define the fixed part \(z_0\) and test variable z, we obtain that
where z can be further simplified into the following form, similar to PJR-ABC: \( z = \frac{1}{S} \sum \limits _{i=1}^{S} z_i, \) where \(z_i = \pi _\zeta ( x^\prime _{i}\vert y)\).
Following PJR-ABC, we test the following hypothesis \(H_1: z_0>z\) vs \(H_2:z_0<z\). Then the sample mean \(\bar{z}\), the sample standard deviation \(s_z\) and the test statistics t can be calculated as shown in Eq. (5) and (6), same with PJR-ABC. The resulting algorithm is similar and not listed.
3 Theoretical Analysis
In this section, we study the theoretical properties for PJR strategy. Specifically, we provide the error analysis for both PJR-ABC and PJR-ABC-MCMC. Since every time we accept/reject a proposal in PJR-ABC/PJR-ABC-MCMC, we deal with a hypothesis testing problem. We are attempting to bound the error caused by such a testing problem first. Then we build the relationship between such a single test error and total error for both PJR-ABC and PJR-ABC-MCMC. Now we focus on the error caused by a single testing problem. In hypothesis testing problem, two types of error are distinguished. A type I error is the incorrect rejection of a true hypothesis while the type II error is the failure to reject a false hypothesis. Now we discuss the probabilities of these two errors in a single decision problem.
Theorem 1
The probability of both the error I and II decreases approximately exponentially w.r.t. the sample size of z (sample size of z corresponds to \(s_1,\ldots ,s_k\) in Algorithm 1).
Proof
We assume that \(\psi _{n-1}(\cdot )\) is the cdf of standard Student-t distribution with degree \(n-1\). For simplicity, we first discuss the probability of type I error, i.e., the incorrect rejection of a true hypothesis. It would be easy to extend the conclusion into the type II error owing to the symmetry.
In this case, \(z > z_0\). Suppose the number of sampled z is n. The test statistics t satisfies that \(t = \frac{\bar{z}-z_0}{s_z}\), following the standard Student-t distribution with degree \(n-1\). The standard Student-t distribution is approaching to the standard normal distribution when the degree \(n-1\) is large enough. Hence, many properties of normal distribution can be shared.
Given the knob parameter \(\epsilon \), according to the monotonicity of the function \(\psi _{n-1}(\cdot )\) on \(\mathbb {R}\), we know that there exists a unique s such that \(\psi _{n-1}(s) = \epsilon \). Moreover, since \(\bar{z} = \frac{z_1+z_2+\ldots +z_n}{n}\) and \(t = \frac{\bar{z}-z_0}{s_z}\sim \psi _{n-1}(\cdot ) \approx \mathcal {N}(0,1)\), we have that \(z_i\) can be seen as sampled independent identically distributed from \(\mathcal {N}(z_0,ns_z)\), i.e., \(z_i{\sim }\mathcal {N}(z_0,ns_z)\,\,\,i.i.d\).
The type I error only occurs when \(\frac{\bar{z} - z_0}{s_z} < s\). That is, \(\sum _{i=1}^{n}z_i < n(s_zs+z_0)\). Thus, we can have the probability of type I error via integrating over the space \((z_1,z_2,\ldots ,z_n) \) and \( \sum _{i=1}^{n}z_i < n(s_zs+z_0)\).
where \(\psi '(\cdot )\) and \(\psi _{n-1}(\cdot )\) represent the pdf and cdf of the standard Student-t distribution with \(n-1\) degree of freedom.
This completes the proof.
The above theorem demonstrates that the error during a single judge can be negligible as long as the number of sampled z is large enough. Based on this theorem, the following assumption are reasonable.
Assumption 1
The probability of error produced by a single hypothesis testing problem in both PJR-ABC and PJR-ABC-MCMC can be upper-bounded, denoted by \(\delta _{1}, \delta _2\rightarrow {}0_+\), for PJR-ABC and PJR-ABC-MCMC, respectively.
In Bayesian inference, we are interested in the posterior average, defined as \(\bar{\phi } \triangleq \int _{\mathcal {\theta }} \phi (\theta ) p(\theta \vert y) d\theta \) for some test function \(\phi (\theta )\) of interest. For a given numerical method (say, PJE-ABC or PJR-ABC-MCMC) with generated samples \(\{\theta _1, \ldots ,\theta _M\}\), we use the sample average \(\hat{\phi }\) defined as \(\hat{\phi } = 1/M\sum _{l=1}^{M}\phi (\theta _l)\) to approximate \(\bar{\phi }\). Before providing a bound for the bias of a PJR-ABC algorithm, we make a mild assumption first.
Assumption 2
The prior average of \(\phi (\cdot )\) is bounded away from infinity, i.e.,
Theorem 2
Under Assumption (1) and (2), the bias of PJR-ABC can be upper-bounded as: \(\begin{array}{l} \vert \mathbb {E}\hat{\phi } - \bar{\phi }\vert \le C_1 \delta _{1}, \end{array} \) where \(C_1 = \frac{\int _\theta \phi (\theta ) p(\theta ) d\theta }{ p(y) }\) is a constant, p(y) denotes the normalizing constant.
Proof
In ABC rejection method, each \(\theta \) drawn from \(p(\theta )\) is independent. The error at \(\theta \) caused by PJR is denoted by \(\xi (\theta )\), which is assumed to be a perturbation on the true likelihood. Thus the estimated likelihood function can be represented as \(\hat{p}(y\vert \theta ) = p(y\vert \theta ) + \xi (\theta )\), where \(\vert \xi (\theta )\vert \le \delta _{1}\) owing to the boundedness of single error, described in Assumption 3.
The first term in RHS of Eq. (9) is the expectation of the true posterior distribution. While the second term is the error. We can observe that the error is upper bounded.
where \(C_1 = \frac{\int \phi (\theta )p(\theta )d\theta }{p(y)}\) is bounded followed from the fact that both \(\frac{1}{p(y)} \) and \(\int \phi (\theta )p(\theta )d\theta \) are bounded away from \(+\infty \).
This completes the proof.
In PJR-ABC, each sample is independent with each other. However, in PJR-ABC-MCMC, all the samples are in a single chain, leading the analysis more complicated. Here, the distance between probability distributions is measured by the total variational distance (TVD),Footnote 4 described as follows.
Theorem 3
Under Assumption 3, for any posterior distribution, there exists a constant \(C_2\) such that the discrepancies between the true posterior distribution \(\mathcal {S}_0\) and the stationary distribution of our PJR-ABC-MCMC algorithm \(\mathcal {S}_\epsilon \) can be upper bounded as: \( \begin{array}{ll} d_v(S_0,S_\epsilon ) \le C_2 \delta _{2}. \\ \end{array} \)
Proof
We firstly focus on the error for a single step. Based on this, the error about the stationary distribution is derived. The transition kernel of the ABC-MCMC algorithm can be written as
where \(\delta _D(\cdot ) \) is the Dirac delta function, \(P_a(\theta ,\theta ^\prime )\) is the acceptance probability. Similar definition of transition kernel of PJR-ABC-MCMC hold for \(\mathcal {T}_\epsilon (\theta ,\theta ^{\prime })\) and acceptance probability \(P_{a,\epsilon }(\theta ,\theta ^{\prime })\).
The discrepancies between \(P_a(\theta ,\theta ^{\prime }) \) and \(P_{a,\epsilon }(\theta ,\theta ^{\prime })\) is defined as: \(\delta P_a(\theta ,\theta ^{\prime }) \triangleq P_{a,\epsilon }(\theta ,\theta ^{\prime }) - P_a(\theta ,\theta ^{\prime })\). For every \((\theta ,\theta ^{\prime })\), according to the error for a single test, there exists an upper bound for \(\delta P_a(\theta ,\theta ^{\prime })\), i.e., \(\vert \delta P(\theta ,\theta ^{\prime }) \vert \le \delta _{\text {max}}\) for \(\forall \ (\theta ,\theta ^{\prime })\).
Then the total variational distance for a single step can be upper bounded for any distribution P as:
Then apply Lemma 1, substitute \(2\delta _{\text {max}} \) into \(\delta \) in Eq. 10 we prove Theorem 3. This completes the proof.
Lemma 1
[16]. Given two transition kernels, \(\mathcal {T}_0\) and \(\mathcal {T}_\epsilon \), whose stationary distributions are denoted by \(\mathcal {S}_0 \) and \(\mathcal {S}_\epsilon \), if \(\mathcal {T}_0\) satisfies the following contraction condition with a constant \(\eta \in [0,1) \) for all probability distribution \(\mathcal {P}\):
and the one step error between \(\mathcal {T}_0\) and \(\mathcal {T}_\epsilon \) is upper bounded uniformly with a constant \(\delta >0\) as:
then the distance between \(\mathcal {S}_0\) and \(\mathcal {S}_\epsilon \) can be bounded as: \(\begin{array}{ll} d_v(S_0,S_\epsilon ) \le \frac{\delta }{1-\eta } \\ \end{array} \)
Theorem 2 and 3 indicate that the error is proportional to the single testing error. Combining this result with Theorem 1, we know that the bias of both PJR-ABC and PJR-ABC-MCMC can be bounded.
4 Numerical Validation
In this section, we use a toy model to demonstrate both PJR-ABC and PJR-ABC-MCMC.
4.1 Synthetic Data
We adopt the gamma prior with shape \(\alpha \) and rate \(\beta \), i.e., \(p(\theta ) = \text {Gamma}(\alpha ,\beta )\). The likelihood function is exponential distribution, i.e., \(x\sim \text {exp}(1/\theta )\). Let observations are generated via \(y = \frac{1}{N}\sum \nolimits _{i=1}^N e_i\), where \(e_i\sim \text {exp}(1/\theta ^*)\), N is the number of observations. Regarding the selection of the sequence \(\{s_i\}_{i=1}^{k}\) (\(s_0=0\)), we find geometric sequence is the usually the best choice, thus is used in both Sect. 4.1 and 4.2. The common ratio of the geometric sequence is usually set to 1.5–2. The true posterior is a gamma distribution with shape \(\alpha + N\) and rate \(\beta +Ny\), i.e., \(p(\theta \vert y) = \text {Gamma}(\alpha +N,\beta +Ny)\). In particular, we set \(S=1000\), \(N = 20\), \(y = 7.74\), \(\alpha = \beta = 1\), \(\theta ^* = 0.15\) in this scenario. We run chains of length 50K for ABC-MCMC and PJR-ABC-MCMC and 100K for ABC and PJR-ABC. For each method, we conduct 5 independent trials and report the average value. In this paper, the choice of proposal distribution in both ABC-MCMC and PJR-ABC-MCMC is a Gaussian distribution centered at current \(\theta \).
First, we investigate how the performance (both efficiency and accuracy) changes as a function of the knob \(\epsilon \) empirically. For each \(\epsilon \in \{0,0.01, 0.03, 0.07, 0.1, 0.2, 0.3\}\), we record both efficiencyFootnote 5 and accuracyFootnote 6. \(\epsilon =0\) means the PJR-ABC/PJR-ABC-MCMC reduce to ABC/ABC-MCMC approach. The results are reported in Fig. 1. We find that smaller \(\epsilon \) usually leads to higher accuracy and less efficiency, validating the statement about \(\epsilon \) mentioned in Sect. 2. Hence, the empirical trade-off between efficiency and accuracy can be controlled by adjusting \(\epsilon \). In the following, we set \(\epsilon =0.1\). In Fig. 3, we show the trace plots of the last 1K samples from a single chain for ABC-MCMC and PJR-ABC-MCMC. It is a positive result, indicating PJR-ABC-MCMC preserve the ability of exploration to the parameter space compared with ABC-MCMC. The empirical histograms of \(\theta \) for all the methods are presented in Fig. 2. We find that all of them are close to the desired posterior. In Table 1 we show
-
the average Total Variational DistanceFootnote 7 (between the true posterior and the ABC posteriors) and the corresponding standard deviation using the first 10K samples and whole chain;
-
the average number of simulators.
We can observe that our PJR based ABC rejection and ABC-MCMC achieve similar result with original algorithm in convergence to the target posterior distribution. Furthermore, PJR strategy can accelerate both ABC and ABC-MCMC in terms of number of simulators.
4.2 Real Applications
The Popular Ricker Model. In this section, we show the application of our method on the popular Ricker model [31]. The Ricker model, a classic discrete population model used in ecology, gives the expected number of individuals in current generation as a function of number of individuals in previous generation. This model is commonly used as an exampler of complex model [29] because it cause the collapse of standard statistical methods due to near-chaotic dynamics [31]. In particular, \(N_t\) denote the unobserved number of individuals in the population at time t while the number of observed individuals is denoted by \(Y_t\). The Ricker model is defined via the following relationships [31]
where each \(e_t\) (\(t = 1,2,...,\)) is independent and \(Y_t\) only depends on \(N_t\). In this model, the parameter vector is \(\theta = \{\log r, \sigma ^2, \phi \}\). \(y_{1:T} = \{y_1,...,y_T \}\in \mathbb {R}^T\) is the time-series of observations. For each parameter, we adopt the uniform prior as
The target distribution is the posterior of \(\theta \) given observations \(y_{1:T}\), i.e., \(p(\theta \vert y_{1:T})\). Artificial dataset is generated using \({\theta }^* = (3.8,0.3,10.0)\). We compare PJR-ABC-MCMC method with ABC-MCMC. For ABC-MCMC, we run the simulator \(S = 2000\) times at each \(\theta \) to approximate the likelihood value. The knob \(\epsilon \) is set to be 0.1. For summary statistics, we follow the methods described in [29], which contain a collection of phase-invariant measures, such as coefficients of polynomial autogressive models.
Effectiveness: Figure 4 and 5 show the empirical histogram of parameter of interest \(\theta = (\log r,\sigma , \phi )\) generated by ABC-MCMC and PJR-ABC-MCMC, respectively. Furthermore, we present the scatter plots of trajectories for every two parameters in Fig. 6. We can observe that the mode of the empirical posterior is close to the \({\theta }^*\) and the posteriors produced by the two algorithms are similar, showing the success of PJR-ABC-MCMC in Ricker model.
Efficiency: The simulation procedure is complex and dominate in computational time. Therefore, the running time of samplers is almost proportional to the number of simulators. Specifically, sampling 1K parameters, ABC-MCMC requires 2M simulators (\(S=2000\)) while PJR-ABC-MCMC only requires about 371K simulators. We conclude that majority of the decision can be made based on a small amount of simulators with high confidence. Hence, our PJR strategy accelerates ABC-MCMC algorithm greatly in Ricker model.
4.3 Apply to HABC-SGLD
In this part, we apply our method to SGLD (Stochastic Gradient Langevin Dynamics, [28]) version of HABC (Hamiltonian ABC) proposed in [21].
In each iteration of SGLD, a mini-batch \(\mathcal {X}_n\) of size n is drawn to estimate the gradient of log-posterior. The proposal is
It can be shown that when the stepsize \(\alpha \) approaches to zero, the acceptance probability approaches to 1 [28]. Based on this, the MH correction step is ignored. However, the assumption that \(\alpha \rightarrow {} 0\) is too restrictive. In practice, to keep the mixing rate high, we always choose a reasonably large \(\alpha \). Under this situation, SGLD can not converge to target distribution in some cases. The detailed reasons can be found in [16].
In ABC scenarios, conventional MH rejection step is time-consuming. So our method fit to this problem naturally. Specifically, we consider an L1-regularized linear regression model. This model has been used in [16] to explain the necessity of MH rejection in SGLD. We explore its effectiveness in ABC scenario.
Given a dataset \(\{u_i,v_i\}_{i=1}^N\), where \(u_i\) are the predictors and \(v_i \) are the targets. Gaussian error model and Laplacian prior for parameter \(\theta \in \mathbb {R}^D \) are adopted, i.e., \(p(v\vert u,\theta ) \propto \text {exp}(-\frac{\lambda }{2} (v-\theta ^Tu)^2)\) and \(p(\theta ) \propto \exp (-\lambda _0\Vert \theta \Vert _1)\). We generate a synthetic dataset of size \(N=10000\) via \(v_i = \theta _0^Tu_i + \xi \), where \(\xi \sim \mathcal {N}(0,1/3)\) and \(\theta _0 = 0.5\), following [16]. For pedagogical reason, we set \(D=1 \). Furthermore, we choose \(\lambda = 1\) and \(\lambda _0 = 4700\) so that the prior is not washed out by the likelihood.
Here, standard MCMC sampler is employed as the baseline method. And we run the HABC-SGLD without rejection and HABC-SGLD with rejection (PJR-HABC-SGLD). The empirical histograms of samples obtained by running different samplers are shown in Fig. 7. We observe that the empirical histogram of samples obtained from PJR-HABC-SGLD is much closer to the standard MCMC sampler than that of HABC-SGLD, thus verifying the effectiveness of PJR-HABC-SGLD.
5 Conclusion
In this paper, we have proposed pre-judgment Rule to accelerate ABC method. Computational methods adaptive to ABC rejection method and ABC-MCMC are provided as PJR-ABC and PJR-ABC-MCMC respectively. We analyze the error bound produced by PJR strategy. Our methodology establishes its practical value with desirable accuracy and efficiency. Finally, as a future direction, we plan to integrate PJR strategy with neural network as [24].
Notes
- 1.
\(\rho (\mathbb {S}(x),\mathbb {S}(y))\) is replaced by \(\rho (x,y)\), similar with Step \(\mathbb {C}\)3.
- 2.
E.g., \(\zeta \)-kernel can be chosen as \(\pi _\zeta (x_1\vert x_2) = (1/\sqrt{2\pi }\zeta )\exp (-\Vert x_1-x_2\Vert ^2/2\zeta ^2)\).
- 3.
In Step A2, \(z_0\) is more complex. Checking the acceptance condition is equivalent to judging \(\frac{z}{c} > u\), where c is defined in Eq. 2 and \(u\sim \text {Uniform}(0,1)\).
- 4.
The total variation distance between two distribution P and Q, absolutely continuous w.r.t. measure \(\varOmega \), is defined as \(d_v(P,Q) \triangleq 1/2\int \nolimits _{\theta } \vert f_P(\theta ) - f_Q(\theta ) \vert d\varOmega (\theta )\), where \(f_P(\cdot )\) and \(f_Q(\cdot )\) are their respective densities.
- 5.
Measured in term of number of simulator.
- 6.
Measured in term of TVD with the true posterior distribution.
- 7.
Note that in experiment the total variational distance is estimated empirically owing to the absence of explicit formulae.
References
Ahn, S., Korattikara, A., Welling, M.: Bayesian posterior sampling via stochastic gradient fisher scoring. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, pp. 1771–1778 (2012)
Ahn, S., Shahbaba, B., Welling, M.: Distributed stochastic gradient MCMC. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1044–1052 (2014)
Andrieu, C., Roberts, G.O.: The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Stat. 37, 697–725 (2009)
Barber, S., Voss, J., Webster, M., et al.: The rate of convergence for approximate Bayesian computation. Electron. J. Stat. 9(1), 80–105 (2015)
Beaumont, M.A.: Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst. 41, 379–406 (2010)
Bernton, E., Jacob, P.E., Gerber, M., Robert, C.P.: Approximate Bayesian computation with the Wasserstein distance. J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.) 81(2), 235–269 (2019)
Biau, G., Cérou, F., Guyader, A., et al.: New insights into approximate Bayesian computation. In: Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 51, pp. 376–403. Institut Henri Poincaré (2015)
Blum, M.G., François, O.: Non-linear regression models for approximate Bayesian computation. Stat. Comput. 20(1), 63–73 (2010)
Blum, M.G., Nunes, M.A., Prangle, D., Sisson, S.A., et al.: A comparative review of dimension reduction methods in approximate Bayesian computation. Stat. Sci. 28(2), 189–208 (2013)
Cabras, S., Nueda, M.E.C., Ruli, E., et al.: Approximate Bayesian computation by modelling summary statistics in a quasi-likelihood framework. Bayesian Anal. 10(2), 411–439 (2015)
Chen, T., Fox, E., Guestrin, C.: Stochastic gradient Hamiltonian Monte Carlo. In: International Conference on Machine Learning, pp. 1683–1691 (2014)
Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R.D., Neven, H.: Bayesian sampling using stochastic gradient thermostats. In: Advances in Neural Information Processing Systems, pp. 3203–3211 (2014)
Fearnhead, P., Prangle, D.: Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.) 74(3), 419–474 (2012)
Fu, T., Luo, L., Zhang, Z.: Quasi-Newton Hamiltonian Monte Carlo. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pp. 212–221 (2016)
Jiang, B., Wu, T.Y., Zheng, C., Wong, W.H.: Learning summary statistic for approximate Bayesian computation via deep neural network. Stat. Sin. 27, 1595–1618 (2017)
Korattikara, A., Chen, Y., Welling, M.: Austerity in MCMC land: cutting the metropolis-hastings budget. In: International Conference on Machine Learning, pp. 181–189 (2014)
Lintusaari, J., Gutmann, M.U., Dutta, R., Kaski, S., Corander, J.: Fundamentals and recent developments in approximate Bayesian computation. Syst. Biol. 66(1), e66–e82 (2017)
Ma, Y.A., Chen, T., Fox, E.: A complete recipe for stochastic gradient MCMC. In: Advances in Neural Information Processing Systems (2015)
Marin, J.M., Pudlo, P., Robert, C.P., Ryder, R.J.: Approximate Bayesian computational methods. Stat. Comput. 22(6), 1167–1180 (2012)
Marjoram, P., Molitor, J., Plagnol, V., Tavaré, S.: Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. 100(26), 15324–15328 (2003)
Meeds, E., Leenders, R., Welling, M.: Hamiltonian ABC. In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pp. 582–591 (2015)
Meeds, E., Welling, M.: GPS-ABC: Gaussian process surrogate approximate Bayesian computation. In: Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp. 593–602 (2014)
Meeds, T., Welling, M.: Optimization Monte Carlo: efficient and embarrassingly parallel likelihood-free inference. In: Advances in Neural Information Processing Systems, pp. 2071–2079 (2015)
Mondal, M., Bertranpetit, J., Lao, O.: Approximate Bayesian computation with deep learning supports a third archaic introgression in Asia and Oceania. Nat. Commun. 10(1), 246 (2019)
Pritchard, J.K., Seielstad, M.T., Perez-Lezaun, A., Feldman, M.W.: Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16(12), 1791–1798 (1999)
Sisson, S.A., Fan, Y., Beaumont, M.: Handbook of Approximate Bayesian Computation. Chapman and Hall/CRC, New York (2018)
Sunnåker, M., Busetto, A.G., Numminen, E., Corander, J., Foll, M., Dessimoz, C.: Approximate Bayesian computation. PLoS Comput. Biol. 9(1), e1002803 (2013)
Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688 (2011)
Wilkinson, R.: Accelerating ABC methods using Gaussian processes. In: Artificial Intelligence and Statistics, pp. 1015–1023 (2014)
Wilkinson, R.D.: Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. Stat. Appl. Genet. Mol. Biol. 12(2), 129–141 (2013)
Wood, S.N.: Statistical inference for noisy nonlinear ecological dynamic systems. Nature 466(7310), 1102–1104 (2010)
Acknowledgement
This study was funded by Scientific research fund of North University of China (No. XJJ201803).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, Y., Yu, X., Qin, P., Chai, R., Qiao, G. (2020). Improving Approximate Bayesian Computation with Pre-judgment Rule. In: Zeng, J., Jing, W., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2020. Communications in Computer and Information Science, vol 1257. Springer, Singapore. https://doi.org/10.1007/978-981-15-7981-3_15
Download citation
DOI: https://doi.org/10.1007/978-981-15-7981-3_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-7980-6
Online ISBN: 978-981-15-7981-3
eBook Packages: Computer ScienceComputer Science (R0)