1 Computational challenges in Bayesian nonparametric models

One of the most successful strategies of the Bayesian nonparametric approach to statistical inference has arguably been semiparametric mixture modelling, which has proved to be extremely flexible and widely applicable. Semiparametric modelling assumes the observations are generated by parametric densities conditionally on the value of a set of parameters, which in turn are assigned a nonparametric distribution. More formally, we have the hierarchical representation

$$\begin{aligned} Y_i|\theta _i\mathop {\sim }\limits ^{\text {ind}}\varphi (y_i|\theta _i), \quad \theta _i|G \mathop {\sim }\limits ^{iid} G, \quad G\sim q(G^{*},\zeta ). \end{aligned}$$
(1)

Here \(Y_{1},\ldots ,Y_{n}\) are the observations, \(\theta _{1},\ldots ,\theta _{n}\) are a set of latent variables that parametrise the densities \(\varphi (y_i|\theta _i)\), and G is a nonparametric distribution with prior q. The latter is in turn parametrised by a baseline distribution \(G^{*}\) on the parameter space \(\Theta \subset \mathbb {R}^{d}\) and by a vector of reals \(\zeta \).

Upon observation of a dataset, posterior inference requires evaluating the conditional distribution of the parameters given the data. As is typically the case in absence of a fully conjugate model, i.e. such that the family of distributions assigned to the parameters is closed upon conditioning to the data, one needs to resort to Markov chain Monte Carlo methods to sample from the posterior. Early contributions dealing with this problem date back to Escobar and West (1995), MacEachern and Müller (1998) and Neal (2000), and the large use of computer-aided inference has since boosted the investigation of new and efficient algorithms to deal with posterior analysis under a variety of modelling assumptions, generating a very lively literature. A brief introduction to such methods can focus on models as in (1) under the assumption that the mixing distribution G is almost surely a discrete probability measure with representation \(G:=\sum _{h\ge 1}w_h\delta _{\theta _h^{*}}\), where \(\{w_h\}\) are random weights that sum up to one and \(\{\theta _h^{*}\}\) are iid random points, taken independent of the weights, from the baseline distribution \(G^{*}\). When the weights are obtained by normalising the increments of a time-changed subordinator, or more generally of a completely random measure (Kingman 1967), this specification coincides with the relevant class of random probability measures given by (homogeneous) normalised random measures with independent increments (Regazzini et al. 2003), which have recently been object of intense research and in turn include the celebrated Dirichlet process (Ferguson 1973). See Lijoi and Prünster (2010) for a recent review.

A broad classification of algorithms which enable to perform posterior inference under the above specifications divides them into marginal and conditional Gibbs sampling methods. Marginal Gibbs samplers are so called because they integrate out of (1) the random probability measure G. This entails sampling from

$$\begin{aligned} \mathcal {L}(\mathrm {d}\theta _1,\dots ,\mathrm {d}\theta _n|\mathbf y )\propto \prod _{i=1}^{n}\varphi (y_i|\theta _i) \mathcal {L}(\mathrm {d}\theta _1,\dots ,\mathrm {d}\theta _n) \end{aligned}$$
(2)

where \(\mathcal {L}(\mathrm {d}\theta _1,\dots ,\mathrm {d}\theta _n)\) is the prior marginal distribution of a sample from G and \(\mathbf y =(y_{1},\ldots ,y_{n})\) are the data. This marginal distribution can typically be characterised in terms of the predictive laws \(\mathcal {L}(\theta _{i}|\theta _1,\dots ,\theta _{i-1},\theta _{i+1},\ldots ,\theta _{n})\), which gives rise to Pólya urn schemes, in the case of the Dirichlet process (Blackwell and MacQueen 1973), or generalisations thereof. Accordingly, Gibbs samplers with invariant distribution (2) are often called generalised Pólya urns Gibbs samplers.

Since G is discrete, the \(\theta _i\)’s will induce a partition \(\rho =\{C_1,\dots ,C_k\}\) of \(\{1,\dots ,n\}\) with \(C_j = \{i : \theta _i = \theta ^*_j\} \), where \(j=1,\dots ,k\) and \(\theta _{1}^{*},\ldots ,\theta _{k}^{*}\)’s are the distinct values in \(\theta _{1},\ldots ,\theta _{n}\). Given that \(\theta ^*_h\) are iid and independent of \(\{w_h\}\) in G, the law \(\mathcal {L}(\theta _1,\dots ,\theta _n)\) is equivalent to

$$\begin{aligned} \mathcal {L}(C_{1},\ldots ,C_{k},d\theta ^*_1,\dots ,d\theta ^*_k)=p(n_1,\dots ,n_k) \prod _{j=1}^{k}G^{*}(d\theta ^*_j), \end{aligned}$$

where \(n_i=\text {card}(C_i)\). The function p is called exchangeable partition probability function (EPPF), and represents the law of the random partition \(\rho \). Marginalising over \(\theta ^*_1,\dots ,\theta ^*_k \) allows to express (2) as

$$\begin{aligned} \mathcal {L}(C_1,\dots ,C_n|\mathbf y )\propto \prod _{j=1}^{k}m(y_{C_j}) p(n_1,\dots ,n_k) \end{aligned}$$
(3)

where \(m(y_{C_j}):=\int _{\Theta }\prod _{i \in C_j}^{ }\varphi (y_i|\theta )\mathrm {d}\theta \) is the marginal distribution of the data in group \(C_j\). The computation now relies on the availability of efficient strategies for sampling from the EPPF, which are model-specific. Efficient solutions have been found based on the so called Chinese restaurant process (Pitman 2006) and its generalisations. Furthermore, expression (3) is the starting point for setting up inference under product partition models with regression on covariates, a class of models introduced in Müller et al. (2011) and extended to the spatial setting by Page and Quintana (2016) (cf. also Section 5.3 of Müller et al. 2017).

The above, briefly described, marginal sampling methods are extremely useful since they allow to reduce an infinite-dimensional computation to a finite number of operations, entailed by integrating out the random probability measure. However, a downside is that inference is limited to point estimation of linear functionals of the population such as, e.g., predictive distributions, without allowing to quantify the associated uncertainty.

Alternative strategies retain the random probability measure G as part of the model, to be updated within the Gibbs sampling routine, and are therefore called conditional methods. Given the series representation of G, this strategies then shift the problem to that of simulating G, conditional on the data, with small or no approximation error. Truncation methods are the most intuitive option, and entail finding an appropriate N in \(G_{N}:=\sum \limits _{h=1}^{N}w_{h}\delta _{\theta ^*_h}\) which guarantees certain desired minimal requirements. Several ways to achieve these have been proposed, among others, in Muliere and Tardella (1998), Ishwaran and James (2001), Barrios et al. (2013), Argiento et al. (2016a, b), Arbel and Prünster (2017). These truncation methods are generally fairly easy to implement, but need to fix a priori, implicitly or explicitly, some notion of distance between the approximating and the target measure.

Other, very successful, stochastic truncation methods allow to perform exact sampling from the random probability measure and have proven to be reliable and relatively easy to implement. These include the slice sampler (Walker 2007; Kalli et al. 2011) and the retrospective sampler (Papaspiliopoulos and Roberts, 2008) together with their adaptations and generalisations. The slice sampler requires introducing appropriate latent [0,1]-valued variables \(u_i\) so that

$$\begin{aligned} \mathcal {L}(\mathrm {d}u_i,\mathrm {d}\theta _i|G)=\sum _{h=1}^{\infty }\mathbf {1}_{(0,w_h)}(\mathrm {d}u_i)\delta _{\theta ^{*}_h}(\mathrm {d}\theta _i), \end{aligned}$$

whereby integrating u out of the previous recovers \(\mathcal {L}(\mathrm {d}\theta _i|G)=G(\mathrm {d}\theta _i)\). Define now

$$\begin{aligned} \mathcal {L}(\theta _i|u_i,G) =G_{u_i}(\mathrm {d}\theta _i) :=\sum _{h \in A(u_i)}^{}\delta _{\theta ^{*}_h}(\mathrm {d}\theta _i) \end{aligned}$$

where \(A(u_i):=\{h: w_h>u\}\) is a finite set. From the latter it is clear that sampling G, conditional on \(u_1,\dots ,u_n, \theta _1,\dots ,\theta _n\) and the data, entails updating only finitely-many of its components, namely the pairs \((w_h,\theta ^{*}_h)\) for \(h\in \cup _{i=1}^{n}A(u_i)\).

The retrospective sampler instead is based on the idea of exchanging the intuitive order of simulation for sampling from G. This would lead to sampling the infinite sequences \(\{w_h\},\{\theta ^{*}_h\}\), then draw \(v_i\) uniformly distributed in (0, 1) and set \(\theta _i=\theta ^{*}_l\) if \(\sum _{j=1_{}}^{l-1^{}}w_l<v_i<\sum _{j=1}^{l-1}w_l\), The retrospective sampler instead first samples \(v_i\) and then draws as many \(w_h,\theta ^{*}_h\) as are needed to meet the above inequalities.

Gibbs sampling procedures described so far are very appealing strategies but still computationally intensive methods. This makes the use of mixtures such as (1) infeasible when dealing with large datasets, or when the computational resources are limited. Recently, variational Bayes methods have been proposed as an alternative (Blei and Jordan 2006). Acting essentially as optimisation algorithms, under these methods the posterior distribution of G is approximated by a distribution \(\tilde{q}\), called variational distribution, of a finite dimensional process. The goal is then to adjust the parameters of \(\tilde{q}\) in order to minimise the Kullback–Leibler divergence between \(\tilde{q}\) and the posterior. Robustness of variational Bayes methods is currently one of the open problems in the Bayesian nonparametric literature, as it is known they can underestimate the model uncertainty.

2 Temporal dependence in Bayesian nonparametric models

An important line of research in Bayesian nonparametrics on the so called dependent processes has developed from the ideas introduced in MacEachern (1999), where collections of dependent random probability measures \(\{G_{z},z\in \mathcal {Z}\}\) are considered, and \(G_{z}\) encodes the current state of the problem in correspondence of the covariate value z. Cf. Müller et al. (2017, Section 3.2.1). Computational methods for dependent models are very often problem-specific extensions of those summarised in Sect. 1. Providing a general overview of these computational strategies would be a difficult task far beyond the scope and possibilities of this discussion. Since Section 5 of Müller et al. (2017) presents some applications of dependent models for spatial data, we choose here to briefly discuss some issues related to models with temporal dependence, with particular emphasis on the role of conjugacy.

A common setting for Bayesian inference with temporal dependence is that of partial exchangeability, whereby the available data are of the form \(y_{t_{i},1},\ldots ,y_{t_{i},n_{i}}\), where the indices \(t_{i}\) are discrete data-collection times, \(n_{i}\ge 1\) for all i, and the data \(y_{t_{i},j}\) are such that, as j varies,

$$\begin{aligned} y_{t_{i},j}\mid G_{t_{i}}\sim ^{iid} G_{t_{i}}. \end{aligned}$$

Hence the data are exchangeable across the \(t_{i}\)-sections, but not overall. From a temporal modelling perspective, one ideally wants the correlation between pairs of random measures \(G_{t}\) and \(G_s\) to increase as the indices t and s get closer, and decay to zero as t and s grow farther apart.

A non exhaustive list of contributions, along this line of investigation, based on Dirichlet mixture models includes, among others, Dunson (2006), Caron et al. (2008), Griffin and Steel (2010), Caron et al. (2007), Rodriguez and Horst (2008), Taddy and Kottas (2009), Mena and Ruggiero (2016). Other contributions have explored models which go beyond the structure of the Dirichlet process or closely related constructions, aiming at modelling, for example: marginal measures of the dependent process of geometric stick-breaking type (Mena et al. 2011), of Pitman–Yor type (Caron et al. 2017), of GEM type (Gutierrez et al. 2016), or of gamma type (Caron and Teh 2012); evolving binary matrices for relational network structures (Durante and Dunson 2014), or for dynamic feature allocation (Perrone et al. 2017); monotonic functional time series (Canale and Ruggiero 2016); emission distributions for hidden Markov models (Yau et al. 2011; Linderman et al. 2016).

Here we are interested in highlighting two roles conjugacy can play in these approaches to inference. One is with the aim of constructing stationary temporal models with a built-in simulation scheme available, as done in Pitt and Walker (2005), Walker et al. (2007), Mena and Walker (2009). The kernel of the idea is to consider joint distributions, for some fixed \(n\ge 1\),

$$\begin{aligned} \mathcal {L}(\mathrm {d}\theta _{1},\ldots ,\mathrm {d}\theta _{n},\mathrm {d}G)=\mathcal {L}(\mathrm {d}G)\prod _{i=1}^{n}\mathcal {L}(\mathrm {d}\theta _{i}\mid G) \end{aligned}$$

where q is the nonparametric prior on G, and to construct transition functions through latent variables by writing

$$\begin{aligned} P(G,\mathrm {d}H)=\int \mathcal {L}(\mathrm {d}H\mid \theta _{1},\ldots ,\theta _{n})\prod _{i=1}^{n}\mathcal {L}(\mathrm {d}\theta _{i}\mid G) \end{aligned}$$
(4)

where \(\mathcal {L}(\mathrm {d}H\mid \theta _{1},\ldots ,\theta _{n})\) is the posterior of H given \(\theta _{1},\ldots ,\theta _{n}\). For example, if \(p:=G(A)\in [0,1]\) for some fixed set A, the law of G(A) is a beta distribution and \(\mathcal {L}(\mathrm {d}\theta _{i}\mid G(A))\) is Bernoulli with parameter p, then the above reduces to a beta-binomial transition

$$\begin{aligned} P(p,\mathrm {d}p')=\text {beta}(\mathrm {d}p'\mid a+\theta ,b+n-\theta )\text {Binom}(\theta \mid n,p) \end{aligned}$$

where (ab) are prior beta hyperparameters. Note that this is in fact the transition function of the marginal state of a two-dimensional Gibbs sampler on the augmented space of \((p,\theta )\), which is stationary with respect to a beta. In a nonparametric framework, if \(\mathcal {L}(\mathrm {d}G)=\Pi (\mathrm {d}G\mid \alpha )\) for some finite parameter measure \(\alpha \), and \(\Pi \) is conjugate in the sense that \( G\sim \Pi \) and \(X\mid G\sim G\) jointly imply \(\Pi \mid X\sim \Pi (\cdot \mid f(\alpha ,X))\) for some function f of the data and the prior parameter, then (4) yields

$$\begin{aligned} P(G,\mathrm {d}H)=\int \Pi (\mathrm {d}H\mid f(\alpha ,\theta _{1},\ldots ,\theta _{n}))\prod _{i=1}^{n} G(\mathrm {d}\theta _{i}). \end{aligned}$$
(5)

Here \(\Pi \) can be shown to be the reversible measure of the process, so this strategy allows to construct stationary nonparametric processes. Lijoi et al. (2016) discuss along these lines the Bayesian interpretation of the dynamics of two families of continuous-time Dirichlet and gamma dependent models for Bayesian nonparametrics, the latter used for example in Caron and Teh (2012). See also Ruggiero and Walker (2009). The transition functions of such models can be obtained by randomising n in (4) and by introducing appropriate coefficients which make \(P(G,\mathrm {d}H)\) satisfy the Chapman–Kolmogorov conditions in continuous time. For example, for these two families one has

$$\begin{aligned} P_{t}(G,\mathrm {d}H)=\sum _{n\ge 0}P(N_{t}=n)\int \Pi (\mathrm {d}H\mid f(\alpha ,\theta _{1},\ldots ,\theta _{n}))\prod _{i=1}^{n} G(\mathrm {d}\theta _{i}), \end{aligned}$$
(6)

where \(N_{t}\) is an appropriate \(\mathbb {Z}_{+}\)-valued continuous-time process which determines the size of the latent sample \((\theta _{1},\ldots ,\theta _{n})\) and complies with the requirements on the correlation mentioned at the beginning of the section. This approach has been followed explicitly in Walker et al. (2007). The resulting transitions are therefore infinite mixtures. Simulation of these transition functions can in principle be done by resorting to one of the methods outlined in the previous section, e.g. by using a slice sampler twice on the mixture (6) and on the infinite-dimensional random measure which is the state of process, as done for example in Mena et al. (2011). Model-specific hurdles however may make call for additional steps, e.g. Jenkins and Spanò (2017) develop an exact simulation scheme for (6) in some finite and infinite-dimensional Dirichlet cases, which deals efficiently with the non trivial expression for \(P(N_t = n)\), which has itself a series representation.

Alternatively, conjugacy can be deliberately sought in order to reduce the overall Monte Carlo error and predictive uncertainty within a broader computation. Papaspiliopoulos et al. (2016) for example extend classical posterior characterisations for Dirichlet and gamma random measures to the two above-mentioned families of dependent processes, conditional on discretely collected data. In particular, sufficient conditions are identified for these models (cf. Papaspiliopoulos and Ruggiero 2014) that allow to write (6), conditional on \(y_{1},\ldots ,y_{m}\) collected possibly at different times, as

$$\begin{aligned} P_{t} (G,\mathrm {d}H\mid y_{1},\ldots ,y_{m})=\sum _{i=0}^{m}w_{i}(t)\Pi (\mathrm {d}H\mid f(\alpha ,y_{1},\ldots ,y_{i})). \end{aligned}$$

This reduces (6), upon conditioning to the observed data, to a finite mixture of distributions in the same conjugate family. Note that the mixture components only consider \(y_{1},\ldots ,y_{i}\) and not the entire sample. The \(w_{i}(t)\)’s are appropriate time-dependent weights which regulate how the posterior mass is reassigned at different times to the mixture components.