Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Nonparametric Bayesian Inference in Biostatistics and Bioinformatics

The increased complexity of biomedical inference problems requires ever more sophisticated and flexible approaches to statistical inference. The challenges include in particular massive data, high-dimensional sets of potential covariates, highly structured stochastic systems, and complicated decision problems. Some of these challenges can be naturally addressed with a class of inference approaches known as nonparametric Bayesian (BNP) methods. A technical definition of BNP models is that they are probability models on infinite dimensional probability spaces. This includes priors on random probability measures, random mean functions, and more.

BNP methods relax the sometimes restrictive assumptions of traditional parametric methods. A parametric model is indexed by an unknown finite dimensional parameter vector \(\theta\). Bayesian inference proceeds by assuming a prior probability model \(p(\theta )\) which is updated with the relevant sampling model \(p(y\mid \theta )\) for the observed data y.

For example, consider a density estimation problem, with observed data y i  ∼ G, i = 1, , n. Inference under the Bayesian paradigm requires a completion of the model with a prior for the unknown distribution G. If G is restricted to be in a family \(\{G_{\theta },\;\theta \in \mathfrak{R}^{d}\}\), then the prior is specified as a prior probability model \(p(\theta )\) for the d-dimensional parameter vector \(\theta\). In contrast, if G is not restricted to a finite dimensional parametric family, then the prior model p(G) becomes a probability model for the infinite dimensional G.

A very common related use of BNP priors on random probability measures is for random effects distributions in mixed effects models. Such generalizations of parametric models are important when the default choice of multivariate normal random effects might understate uncertainties and miss some important structures. Another important class of BNP priors are priors on unknown functions, for example as prior p(f) for the unknown mean function f(x) in a regression model \(y_{i} = f(x_{i}) +\epsilon _{i}\).

The chapters in this volume discuss important research problems in biostatistics and bioinformatics that are naturally addressed by BNP methods. Each chapter introduces and defines the BNP methods and models that are used to address the specific problem. In this introductory chapter we briefly introduce and review some of the most commonly used BNP priors. Posterior inference in many of these models gives rise to challenging computational problems. We review some of most commonly used computational methods and include some references. The brief review in this introduction includes the ubiquitous Dirichlet process (DP) model, the DP mixture model (DPM), the dependent DP (DDP) model, the Polya tree (PT) prior, and the Gaussian process (GP) prior. These models and their variations are the workhorses of BNP inference in biostatistics. The next chapter in this volume discusses some typical examples by reviewing BNP methods in some important applications.

For a more exhaustive discussion of BNP models, see, for example, recent discussions in Hjort et al. (2010), Müller and Rodríguez (2013), Walker et al. (1999), Müller and Quintana (2004), Walker (2013) and Müller et al. (2015).

2 Dirichlet Process

Let δ x (⋅ ) denote a point mass at x. The DP prior (Ferguson 1973) is a probability model for a random distribution G,

$$\displaystyle{ G =\sum _{ h=1}^{\infty }w_{ h}\delta _{m_{h}}, }$$
(1.1)

with independent locations m h  ∼ G 0, i.i.d., and weights that are constructed as \(w_{h} = v_{h}\prod _{\ell<h}(1 - v_{\ell})\) with independent beta fractions \(v_{h} \sim \mbox{ Be}(1,M)\), i.i.d. (Sethurman 1994). The prior on w h is known as the stick-breaking process. It can be described as breaking off fractions v h of a stick of initially unit length. The DP prior is characterized by the base measure G 0 that generates the locations of the atoms m h and the total mass parameter M that determines the distribution of the beta fractions v h . We write \(G \sim \mbox{ DP}(M,G_{0})\). Implied in the constructive definition of the stick breaking construction is an important property of DP random measures. A DP random measure \(G \sim \mbox{ DP}(M,G_{0})\) is discrete with probability one.

The DP is a conjugate prior under i.i.d. sampling. That is, assume x i G ∼ G, i.i.d., i = 1, , n and \(G \sim \mbox{ DP}(M,G_{0})\). Let \(F_{n} = \frac{1} {n}\sum _{i=1}^{n}\delta _{ x_{i}}\) denote the empirical distribution. Then \(p(G\mid \boldsymbol{x}) = \mbox{ DP}(M + n,G_{1})\) with G 1 ∝ MG 0 + nF n . An interesting limiting case occurs for M → 0, when the posterior on G is entirely determined by the empirical distribution. This leads to a construction known as the Bayesian bootstrap, which is discussed in Chap. 16 (Inácio de Carvalho et al. 2015).

One of the reasons for the wide use of the DP prior is ease of computation for posterior inference in models based on the DP. In particular, the DP prior implies a particularly simple predictive probability function p(x n x 1, , x n−1). Under i.i.d. sampling from a DP random measure the marginal distribution \(p(x_{1},\ldots,x_{n}) =\int \prod _{ i=1}^{n}G(x_{i})\;dp(G)\) reduces to a simple expression which is easiest characterized as \(p(x_{1},\ldots,x_{n}) =\prod _{ i=1}^{n}p(x_{i}\mid x_{1},\ldots,x_{i-1})\) with increasing conditionals

$$\displaystyle{ p(x_{i}\mid x_{1},\ldots,x_{i-1}) \propto MG_{0}(x_{i}) +\sum _{ \ell=1}^{i-1}\delta _{ x_{\ell}}. }$$
(1.2)

With probability \(\pi _{0} = M/(i - 1 + M)\) the sample x i is a new draw from G 0, and with probability \(1/(i - 1 + M)\) the new sample is tied with a previous sample x . The conditional distribution (1.2) is also known as the Polya urn. We will return to it below. Let \(\boldsymbol{x}_{-i} = \boldsymbol{x}\setminus \{x_{i}\}\). For later reference we note that by symmetry the conditional distribution \(p(x_{i}\mid \boldsymbol{x}_{-i})\) takes the same form.

2.1 DP Mixture

The discrete nature of a DP random measure is awkward in many applications and is therefore often avoided by using an additional convolution with a continuous kernel. Let \(k(x_{i}\mid \theta )\) denote a continuous kernel, for example a Gaussian kernel. Without loss of generality we assume in the remaining discussion \(k(x_{i}\mid \theta ) = N(x_{i}\mid \theta,s)\) (with fixed s). The DP mixture (DPM) model assumes \(G =\int N(x_{i}\mid \theta,s)dF(\theta ),\) with \(F \sim \mbox{ DP}(M,F_{0}).\) We write \(G \sim \mbox{ DPM}(M,G_{0},k)\). It is often convenient to rewrite the mixture as an equivalent hierarchical model. Instead of y i  ∼ G and \(G \sim \mbox{ DPM}(M,G_{0},k)\) we write

$$\displaystyle{ y_{i}\mid \theta _{i} \sim N(\theta _{i},s)\mbox{ and }\theta _{i} \sim F }$$
(1.3)

with \(F \sim \mbox{ DP}(M,G_{0})\). The DPM model is one of the most widely used BNP priors for random distributions. In this volume we find it, for example, in Chap. 11 (Zhou and Hanson 2015) to construct a semiparametric version of an accelerated failure time model; in Chap. 16 (Inácio de Carvalho et al. 2015) as a prior for the distribution of test outcomes to develop inference on ROC curves; in Chap. 21 (Daniels and Linero 2015) for longitudinal outcomes under different missingness patterns; and many more.

Consider again the \(\theta _{i}\) in (1.3). As a sample from the discrete random measure F, the newly introduced latent variables \(\theta _{i}\) include many ties. Let \(\boldsymbol{\theta }^{\star } =\{\theta _{ 1}^{\star },\ldots,\theta _{k}^{\star }\}\) denote the k ≤ n unique values and let \(S_{j} =\{ i:\;\theta _{i} =\theta _{ j}^{\star }\}\) denote the indices [n] ≡ { 1, , n} arranged by the configuration of ties. Then ρ n  ≡ { S 1, , S k } defines a partition of [n]. Since the θ i were random, as a consequence the partition is random. That is, the DP mixture model (1.3) induces a random partition p(ρ n ). At first glance this seems like a coincidental detail of the model. However, many applications of the DPM model exploit exactly this feature. It features in many chapters in this volume. The implied prior p(ρ n ) on the random partition is also known as Chinese restaurant process (CRP). It is used, for example, in Chap. 3 (Zhang et al. 2015).

Sometimes it is convenient to index the partition ρ n alternatively by an equivalent set of cluster membership indicators. Let s i denote indicators with s i  = j if i ∈ S j , that is when \(\theta _{i} =\theta _{ j}^{\star }\). Let n j  =  | S j  | denote the size of the jth cluster, \(n_{j}^{-} = \vert S_{j}\setminus \{i\}\vert \) and let k denote the number of unique values \(\theta _{\ell}\) in \(\boldsymbol{\theta }_{-i}\). Then we can rewrite (1.2) as

$$\displaystyle{ s_{i}\mid \boldsymbol{s}_{-i} = \left \{\begin{array}{@{}l@{\quad }l@{}} j \quad &\mbox{ with prob } \frac{n_{j}^{-}} {n-1+M},\ \ j = 1,\ldots,k^{-} \\ k^{-} + 1\quad &\mbox{ with prob } \frac{M} {n-1+M} \end{array} \right. }$$
(1.4)

The attraction of model (1.3) is the ease of posterior simulation. Consider a generic model y i  ∼ G with DPM prior (1.3) and similar to k and n j let \(\theta _{j}^{\star -}\) denote the jth unique value among \(\boldsymbol{\theta }_{-i}\). Then (1.4) implies

$$\displaystyle{ \theta _{i}\mid \boldsymbol{y},\boldsymbol{\theta }_{-i} = \left \{\begin{array}{@{}l@{\quad }l@{}} \theta _{j}^{\star -} \quad &\mbox{ with prob. $ \propto $}n_{j}^{-}\,p(y_{i}\mid \theta _{j}^{\star -}) \\ \sim H_{1}\quad &\mbox{ with prob. $ \propto $}M\int p(y_{i}\mid \theta )\;dG_{0}(\theta ) \end{array} \right. }$$
(1.5)

with \(H_{1}(\theta ) \propto p(y_{i}\mid \theta )\,G_{0}(\theta )\). If \(p(y_{i}\mid \theta )\) and G 0(θ) are chosen as a conjugate pair of sampling model and prior, then generating from (1.5) is straightforward. In the general case, the evaluation of \(h_{0} \equiv \int p(y_{i}\mid \theta )\;dG_{0}(\theta )\) can be computationally challenging. Several MCMC algorithms have been proposed to circumvent the evaluation of an analytically intractable integral h 0 (Neal 2000). For a recent review of the DP and related models, see, for example, Ghosal (2010).

2.2 Generalizations of the DP

Many generalizations of the DP prior have been proposed in the literature. One example is the Poisson-Dirichlet (PD) process that is used in Chap. 9 (Guha et al. 2015). The PD arises by replacing the \(\mbox{ Be}(1,M)\) prior on the fractions v h in the stick breaking construction by \(\mbox{ Be}(1 - a,b + ha)\) priors. Other generalizations are specifically focused on the implied random partition model, like the generalized Ottawa sequence introduced in Chap. 5 (Bassetti et al. 2015) or the hierarchical DP (HDP) model in Chap. 7 (Iorio et al. 2015). The latter defines a prior on a family of random probability measures {G j ; j = 1, , j}.

3 Dependent Dirichlet Process

Many problems involve a family of unknown random probability measures \(\mathcal{F} =\{ F_{x};x \in X\}\). For example, in a mixed effects model that includes data from several related studies, F j might be the random effects distribution for patients in study j. More generally, a formalization of non-parametric regression could assume

$$\displaystyle{ y_{i}\mid \boldsymbol{x}_{i} = x,\mathcal{F}\sim F_{x} }$$
(1.6)

i = 1, , n. That is, we denote by F x the sampling model for the response of a subject with covariates \(\boldsymbol{x}_{i} = x\). If we are willing to assume \(F_{x} = N(\boldsymbol{x}_{i}'\boldsymbol{\beta },\sigma ^{2})\), then the problem reduces to parametric inference on the finite dimensional parameter vector \(\boldsymbol{\theta }= (\boldsymbol{\beta },\sigma ^{2})\). In other words, we restrict \(\mathcal{F}\) to the family of probability measures indexed by \(\boldsymbol{\theta }\). In the absence of such restrictions Bayesian inference in (1.6) requires a prior probability model \(p(\mathcal{F})\) that allows for dependence and borrowing of strength across x, short of the strict parametric assumption, but still more than in a model with independent, separate priors on each F x .

One of the most popular models in the recent literature for a family of random probability measures \(\mathcal{F}\) is the dependent DP (DDP) and variations of it. The model was first introduced in MacEachern (1999). The idea is simple. We continue to use

$$\displaystyle{ F_{x} =\sum _{ h=1}^{\infty }w_{ h}\delta _{m_{xh}}, }$$
(1.7)

with independent locations m xh , i.i.d. across h and weights that are constructed with independent beta fractions as before, in (1.7). The only addition is that we now introduce dependence on the point masses m xh across x. For example, we could assume that (m xh ,  x ∈ X) is a realization of a Gaussian process indexed by x. In the simplest implementation the weights w h are shared across all x, as implied in the notation w h without a subindex for x.

Similar to the DP mixture model, the DDP model (1.7) is often combined with a continuous kernel, for example a normal kernel to define

$$\displaystyle{ G_{x}(y) =\int N(y\mid \theta,\sigma ^{2})\;dF_{ x}(\theta ) =\sum _{ h=1}^{\infty }w_{ h}N(y\mid m_{xh},\sigma ^{2}). }$$
(1.8)

with a DDP prior on {F x ,  x ∈ X}. Here N(ym, s 2) denotes a normal kernel in y. We refer to (1.8) as a DDP mixture of normals. For categorical covariates x ∈ X the dependent probability model for (m xh ,  x ∈ X) could be defined, for example, as an ANOVA model. This defines the ANOVA DDP proposed in DeIorio et al. (2002). A version of the same, with a general linear model in place of the ANOVA model is the linear dependent DP (LDDP) (Jara et al. 2010).

3.1 Variations of the DDP

The DDP prior and variations of it are used in several chapters in this volume. Chapter 12 (Jara et al. 2015) uses an LDDP to implement survival regression. Chapter 20 (Karabatsos and Walker 2015) constructs a variation of a DDP by introducing the dependence on covariates in (1.7) by a probit regression in the weights w h , rather than the atoms m h .

4 Polya Tree

The Polya tree (PT) prior (Lavine 19921994) is an attractive alternative BNP prior for a random probability measure. The PT prior is essentially a random histogram. Without loss of generality, assume that we wish to define a random probability measure G on the unit interval [0, 1]. We could start with a random histogram with two bins {B 0, B 1}, say over B 0 = [0, 0. 5) and B 1 = [0. 5, 1]. Let Y 0 = G(B 0) and \(Y _{1} = 1 - Y _{0}\) denote the (random) probabilities of B 0 and B 1. Next we refine the histogram by splitting the bins into B 0 = B 00B 01 with B 00 = [0, 0. 25), etc. Let Y 00 = G(B 00B 0), Y 10 = G(B 10B 1), \(Y _{01} = 1 - Y _{00}\), and \(Y _{11} = 1 - Y _{10}\). We continue refining the histogram to 2m bins, m = 1, 2,  by repeating similar binary splits. The process creates a sequence \(\varPi =\{\varPi _{m},\;m = 1,2,\ldots \}\) of nested binary partitions \(\varPi _{m} =\{ B_{e_{1}\cdots e_{m}}\}\) with e j  ∈ { 0, 1}. The PT defines a prior on G by assuming

$$\displaystyle{Y _{\epsilon 0} \sim \mbox{ Be}(\alpha _{\epsilon 0},\alpha _{\epsilon 1}),}$$

independently across \(\epsilon\) and \(Y _{\epsilon 1} = 1 - Y _{\epsilon 0}\). The nested partitions Π together with the beta parameters \(\mathcal{A} =\{\alpha _{\epsilon }\}\) characterize the PT prior. We write \(G \sim \mbox{ PT}(\varPi,\mathcal{A})\).

One of the attractions of the PT prior is the ease of centering the model. Let \(0.e_{1}\cdots e_{m} =\sum _{j}e_{j}2^{-j}\) denote the number with binary digits \(\epsilon = e_{1},\ldots,e_{m}\) and let \(q_{\epsilon }\) denote the corresponding quantile of a fixed probability measure G 0. That is, for example, q 1, q 01, q 10 are the median and the first and third quartile of G 0. Next define \(B_{\epsilon }\) to denote the corresponding partitioning subsets and let denote the nested partition sequence with partitioning subsets \(B_{\epsilon }\). If \(G \sim \mbox{ PT}(\varPi,\mathcal{A})\) with \(\alpha _{\epsilon 0} =\alpha _{\epsilon 1}\), then E(G) = G 0. We write

$$\displaystyle{G \sim \mbox{ PT}(G_{0},\mathcal{A}).}$$

A particularly attractive choice is \(\alpha _{e_{1}\cdots e_{m}} = c2^{m}\) which can be shown to imply a continuous random probability measure G. We write \(G \sim \mbox{ PT}(G_{0},c)\). Alternatively, for an arbitrary nested partitioning sequence Π, define \(\mathcal{A}\) by \(\alpha _{\epsilon e_{m}} = c\,G_{0}(B_{\epsilon e_{m}}\mid B_{\epsilon })\) and assume \(G \sim \mbox{ PT}(\varPi,\mathcal{A})\). Then again E(G) = G 0. We write

$$\displaystyle{G \sim \mbox{ PT}(\varPi,G_{0}).}$$

For a recent review of the PT prior, see, for example, Müller et al. (2015 Chapter 3). PT priors are used, for example, in Chap. 11 (Zhou and Hanson 2015) to construct a semi-parametric accelerated failure time model.

5 Gaussian Process

Gaussian Process (GP) priors are widely used in machine learning, medical imaging, ecology, and various disease risk models. A GP is a stochastic process {Y (s); s ∈ S} that extends (finite dimensional) multivariate Gaussians to infinite dimensions. Here Y(.) is a function-valued random variable while S denotes the domain (typically \(\mathfrak{R}^{e}\)) of the function. The domain S and thus Y (. ) can have very different interpretation and meaning depending upon specific applications. For example, in Chap. 17 (Reich and Fuentes 2015), and typically in the context of spatial models, S refers to all location points in a given region. For machine learning applications, it can be the set of all possible input stimuli. It could even represent the time domain for recording neuronal activity as in the case study provided in Chap. 13 (Shahbaba et al. 2015). S is usually endowed with its own specific metric, e.g. the Euclidean distance in spatial applications. The problem of analyzing the random function Y (. ) or predicting its value Y(s) at a specific point s can be formulated within the framework of non-parametric regression, where the values in S play the role of covariates and Y (. ) is the regression function to be estimated. A prior on the random function Y () would simply refer to the probability law of the stochastic process.

We formally characterize a GP as a stochastic process with mean function m(. ) and covariance function k(⋅ , ⋅ ) if every finite sub-collection of this process, [Y (s 1), Y (s 2)… Y (s n )] is multivariate Gaussian

$$\displaystyle{[Y (s_{1}),\ldots,Y (s_{n})] \sim N(\mu,\varSigma )\mbox{ with }\mu = [m(s_{1}),\ldots,m(s_{n})]\mbox{ and }\varSigma _{ij} = k(s_{i},s_{j}).}$$

We write \(Y \sim \mbox{ GP}(m,k)\). The covariance function is sometimes also referred to as the kernel of the GP. The prior on the random Y (. ), thus defined, is called a GP prior. Simply put, a GP extends finite multivariate Gaussian models to infinite dimensions. It can be shown that such an extension is possible using Kolmogorov’s consistency theorem. Naturally, the infinite process inherits many attractive properties of its finite version. For example, no restrictions are required for the mean function m. However, since all finite dimensional subsets are required to be Gaussian, a condition of positive semi-definiteness is implied on V for any finite subset of S.

A variety of different families of valid kernels are in common use. Some popular choices include squared exponential (SE), polynomial, neural network, Ornstein-Uhlenbeck (OU), Matern, etc. Each of these families typically has a number of free hyper-parameters. Choosing a covariance function for a particular application thus comprises both, the setting of hyper-parameters within a family, and sometimes the comparison across different families through model-selection techniques. Alternatively, flexible and non-parametric covariance functions can be built by exploiting the spectral representation of a GP. Chapter 17 (Reich and Fuentes 2015) introduces such general priors for spatial covariances by applying the DP and the DPM priors to the coefficients of the spectral density.

In general, all covariance functions formally encode some notion of similarity between a pair of random observations based on the distance between corresponding elements of S. Consider, for example, the SE kernel given by k(s, t) = \(exp(-\vert \vert s - t\vert \vert ^{2}/2\tau ^{2})\). The functional form suggests that observations corresponding to proximal points are highly correlated, with the correlation dropping off exponentially with the distance between the points.

Posterior inference and prediction with GP priors is made immensely easy by using the analytical results for multivariate Gaussians. For this, it is enough to observe that the collection of new and observed variables is a finite subset of the GP and their joint density is a multivariate Gaussian. Hence, the posterior predictive distribution, obtained by conditioning on the observed data, appears as another multivariate normal. The infinite dimension of the prior, while providing substantial modeling flexibility, poses no concern for inference and computation. These properties turn out to be critical for several analytical manipulations with the GP prior.

However a known computational bottleneck is the inversion of (n × n) matrices that appear in the analytical results, thus making the computational complexity cubic in the number of data points. For large datasets (n > 10,000) this is prohibitive (in both time and space) for any inference, Bayesian or otherwise. So a number of computational methods [e.g., reduced rank matrix approximations (Fine et al. 2001; Smola and Schökopf 2000)] have been developed. Another approach is to exploit structures of special classes of covariance functions for exact computation. These methods are iterative and the computation scales linearly with the size of the data (Johannesson and Cressie 2004). Cressie and Johannesson (2008) extended this approach to a flexible class of covariance functions. The computational complexity also increases drastically in multivariate settings with several spatially dependent response variables. Banerjee et al. (2008) used induced predictive process models as a clever strategy for dimension reduction and to reduce computational cost in this context. An alternative solution to the computational problem is the treed Gaussian process of Gramacy and Lee (2008). The approach proceeds by first partitioning the covariate space into a number of smaller regions, similar to a classification and regression tree (CART). Next, independent GP’s are fit to each subregion. The overall inversion of a large matrix is replaced by a number of smaller, computationally feasible inversions. Posterior inference is efficiently handled in the tgp package for R.

An excellent reference on Gaussian process models for regression is Rasmussen and Williams (2005).

6 Conclusion

In this brief review we only introduced some of the most popular BNP models and variations. Some of the chapters use models beyond this selection. Chapter 4 (Ji et al. 2015) uses an Indian buffet process as a prior probability model for a feature allocation problem. Feature allocation generalizes random clusters, that is, non-overlapping subsets, to families of possibly overlapping subsets. Chapter 10 (Nieto-Barajas 2015) introduces several alternative models, including, for example, the normalized generalized gamma (NGG) process. The same NGG process appears in Chap. 6. Some chapters define random functions based on spline bases, including Chap. 14 (Telesca 2015) and Chap. 11 (Zhou and Hanson 2015). Finally, Chap. 8 (Ni et al. 2015) discusses prior probability models for random networks.

The next chapter, Chap. 2 continues this review by discussing some typical applications of basic BNP models.