Posterior Contraction Rates for Stochastic Block Models

Ghosh, Prasenjit; Pati, Debdeep; Bhattacharya, Anirban

doi:10.1007/s13171-019-00180-5

Posterior Contraction Rates for Stochastic Block Models

Published: 14 October 2019

Volume 82, pages 448–476, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Sankhya A Aims and scope Submit manuscript

Posterior Contraction Rates for Stochastic Block Models

Download PDF

Prasenjit Ghosh¹,
Debdeep Pati¹ &
Anirban Bhattacharya¹

427 Accesses
3 Citations
Explore all metrics

Abstract

With the advent of structured data in the form of social networks, genetic circuits and protein interaction networks, statistical analysis of networks has gained popularity over recent years. The stochastic block model constitutes a classical cluster-exhibiting random graph model for networks. There is a substantial amount of literature devoted to proposing strategies for estimating and inferring parameters of the model, both from classical and Bayesian viewpoints. Unlike the classical counterpart, there is a dearth of theoretical results on the accuracy of estimation in the Bayesian setting. In this article, we undertake a theoretical investigation of the posterior distribution of the parameters in a stochastic block model. In particular, we show that one obtains near-optimal rates of posterior contraction with routinely used multinomial-Dirichlet priors on cluster indicators and uniform or general Beta priors on the probabilities of the random edge indicators. Our theoretical results are corroborated through a small scale simulation study.

Auxiliary Parameter MCMC for Exponential Random Graph Models

Article 27 October 2016

Large deviations for empirical measures of dense stochastic block graphs

Article 01 January 2020

Fast Maximum Likelihood Estimation via Equilibrium Expectation for Large Network Data

Article Open access 31 July 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Data available in the form of networks are increasingly becoming common in applications ranging from brain connectivity, protein interactions, web applications and social networks to name a few, motivating an explosion of activity in the statistical analysis of networks in recent years (Goldenberg et al. 2010). Estimating large networks offers unique challenges in terms of structured dimension reduction and estimation in stylized domains, necessitating new tools for inference. A rich variety of probabilistic models have been studied for network estimation, ranging from the classical Erdos and Renyi graphs (Erdős and Rényi, 1961), exponential random graph models (Holland and Leinhardt, 1981), stochastic block models (Holland et al. 1983), Markov Graphs (Frank and Strauss, 1986) and latent space models (Hoff et al. 2002) to name a few.

In a network with n nodes, there are O(n²) possible connections between pairs of nodes, the exact number depending on whether the network is directed/undirected and whether self-loops are permitted. A common goal of the parametric models mentioned previously is to parsimoniously represent the O(n²) probabilities of connections between pairs of nodes in terms of fewer parameters. The stochastic block model achieves this by clustering the nodes into k ≪ n groups, with the probability of an edge between two nodes solely dependent on their cluster memberships. The block model originated in the mathematical sociology literature (Holland et al. 1983), with subsequent widespread applications in statistics (Wang and Wong, 1987; Snijders and Nowicki, 1997; Nowicki and Snijders, 2001). In particular, the clustering property of block models offers a natural way to find communities within networks, inspiring a large literature on community detection (Bickel and Chen, 2009; Newman, 2012; Zhao et al. 2012; Karrer and Newman, 2011; Zhao et al. 2011; Amini et al. 2013). Various modifications of the stochastic block model have also been proposed, including the mixed membership stochastic block model (Airoldi et al. 2009) and degree-corrected stochastic block model (Dasgupta et al. 2004; Karrer and Newman, 2011).

Statistical accuracy of parameter estimates for inference in stochastic block models is of growing interest, with one of the objects of interest being the n × n matrix of probabilities of edges between pairs of nodes, which we shall denote by 𝜃 = (𝜃_ij). Using a singular-value thresholding approach, Chatterjee (2014) obtained a $\sqrt {k/n}$ rate for estimating 𝜃 with respect to the squared ℓ₂ distance in a k-component stochastic block model. In a recent technical report, Gao et al. (2015) obtained an improved $k^{2}/n^{2} + \log k/n$ rate by considering a least-squares type estimator. They also showed that the resulting rate is minimax-optimal; interestingly the minimax rate comprises of two parts which (Gao et al. 2015) refer to as the nonparametric and clustering rates respectively. Among other related work, Bickel et al. (2013) provided conditions for asymptotic normality of maximum likelihood estimators in stochastic block models.

In this article, we consider a Bayesian formulation of a stochastic block model where 𝜃 is equipped with a hierarchical prior and study the contraction of the posterior distribution assuming the data to be generated from a stochastic block model. We show that one obtains the minimax rate of posterior contraction with essentially automatic prior choices, such as multinomial-Dirichlet priors on cluster indicators and uniform^{Footnote 1} on the probability of the random edge indicators. Such priors are commonly used and there is a sizable literature (Snijders and Nowicki, 1997; Nowicki and Snijders, 2001; Golightly and Wilkinson, 2005; McDaid et al. 2013) on posterior sampling and inference in the stochastic block model. The theoretical development of our present work assumes the knowledge of the number of clusters a priori. In a different perspective, when such prior knowledge is unavailable, Geng et al. (2018) proposed an efficient Markov Chain Monte Carlo (MCMC) algorithm to simultaneously estimate the number of unknown clusters and clustering structure. While preparing this manuscript, we also came to know about some recent studies relating to various theoretical properties of such stochastic block models. For instance, Gao et al. (2018) considers a general unified framework of structured linear models that covers many complex statistical problems such as stochastic block models, bi-clustering, sparse linear regression, regression under group sparsity, multi-task learning and dictionary learning. The authors of this paper study the posterior contraction rate of their newly proposed elliptical Laplace distribution under this general set up. Refer also to Channarond et al. (2012), Suwan et al. (2016), van der Pas and van der Vaart (2018), and Hayashi et al. (2016), among others for recent works on the theoretical investigation of various aspects of the SBM.

Theoretical investigation of the posterior distribution in block models offers some unique challenges relative to the small but growing literature on posterior contraction in high-dimensional sparse problems (Castillo and van der Vaart 2012, 2015; Pati et al. 2014; Banerjee and Ghosal, 2014). When a large subset of the parameters are exactly or approximately zero, the sparsity assumption can be exploited to reduce the complexity of the model space to derive tests for the true parameter versus the complement of a neighborhood of the true parameter (Castillo and van der Vaart, 2012; Pati et al. 2014). It is now well appreciated that constructing such tests plays a crucial role in posterior asymptotics (Schwartz, 1965; Barron, 1988, 1999; Ghosal et al., 2000). In the present setting, we exploit the parsimonious structure of the parameter space as a result of clustering of n nodes into k << n communities to derive such tests. This also enables us to reduce the “effective” number of parameters (the edge probabilities) to be estimated from O(n²) to O(k² + n). This dimension reduction is enabled by exploiting the structure of the model unlike the traditional notion of sparsity in typical sparse high-dimensional studies where a subset of the parameters are zero or negligible in magnitude.

The remainder of the paper is organized as follows. Some notations are introduced in Section 2. We provide an overview of the stochastic block models in Section 3. Our main theoretical results on posterior contraction are stated in Section 4. While the proof of Theorem 4.1 is given at the end of Section 4.2, proof of other main theoretical results are deferred to the Appendix. A small-scale simulation study is presented in Section 5 and some additional simulation results are also given in the Appendix. We conclude the paper with some discussions in Section 6.

2 Preliminaries

For $\mathcal S \subset \mathbb {R}$, we shall denote the set of all d × d matrices with entries in $\mathcal S$ by $\mathcal S^{d \times d}$. For any $B = (B_{ll^{\prime }}) \in \mathbb {R}^{d \times d}$, we denote the Euclidean (equivalently Frobenius) norm of B by $\|B\| = \sqrt {{\sum }_{l=1}^{d} {\sum }_{l^{\prime }=1}^{d} B_{ll^{\prime }}^{2}}$. Given $X^{*} \in \mathbb {R}^{d \times d}, W \in \mathbb {R}_{+}^{d \times d}$, let $\xi _{d^{2}}(X^{*}; W)$ denote the unit ellipsoid with center X^∗ and weight W given by

$$ \begin{array}{@{}rcl@{}} \xi_{d^{2}}(X^{*}; W) = \left\{ X \in \mathbb{R}^{d \times d} : \sum\limits_{l=1}^{d} \sum\limits_{l^{\prime}=1}^{d} W_{l l^{\prime}} (X_{l l^{\prime}} - X^{*}_{ll^{\prime}})^{2} \le 1 \right\}. \end{array} $$

(2.1)

Viewed as a subset of $\mathbb {R}^{d^{2}}$, the Euclidean volume of $\xi _{d^{2}}(X^{*}; W)$, denoted by $| \xi _{d^{2}}(X^{*}; W) |$, is

$$ \begin{array}{@{}rcl@{}} | \xi_{d^{2}}(X^{*}; W) | = \frac{\pi^{d^{2}}}{{\varGamma}(d^{2}/2 + 1)} \prod\limits_{l=1}^{d} \prod\limits_{l^{\prime} = 1}^{d} W_{ll^{\prime}}^{-1/2}. \end{array} $$

(2.2)

Given sequences {a_n},{b_n}, $a_{n} \lesssim b_{n}$ indicates there exists a constant K > 0 such that a_n ≤ Kb_n for all large n. We say a_n ≍ b_n when $a_{n} \lesssim b_{n}$ and $b_{n} \lesssim a_{n}$. Given any function f and some subset A in its domain, we denote by f(A) the image of A under f. Throughout, $C, C^{\prime }$ denote positive constants whose values might change from one line to the next.

3 Stochastic Block Models

Let $A = (A_{ij}) \in \{0, 1\}^{n \times n}$ denote the adjacency matrix of a network with n nodes, with A_ij = 1 indicating the presence of an edge from node i to node j and A_ij = 0 indicating a lack thereof. To keep the subsequent notation clean, we shall consider directed networks with self-loops so that A_ij and A_ji need not be the same and A_ii can be both 0 and 1. Our theoretical results can be modified to undirected networks with or without self-loops in a straightforward fashion; refer to Section 4.1 for further discussion.

Let 𝜃_ij denote the probability of an edge from node i to j, with $A_{ij} \sim \text {{Bernoulli}}(\theta _{ij})$ independently for 1 ≤ i, j ≤ n. A stochastic block model postulates that the nodes are clustered into communities, with the probability of an edge between two nodes solely dependent on their community memberships. Specifically, let z_i ∈{1,…, k} denote the cluster membership of the i th node and Q = (Q_rs) ∈ [0,1]^k×k be a matrix of probabilities, with Q_rs indicating the probability of an edge from any node i in cluster r to any node j in cluster s. With these notations, a k-component stochastic block model is given by

$$ \begin{array}{@{}rcl@{}} A_{ij} \sim \text{{Bernoulli}}(\theta_{ij}), \quad \theta_{ij} = Q_{z_{i} z_{j}}. \end{array} $$

(3.1)

We use $\mathbb {E}_{\theta } \slash \mathbb {P}_{\theta }$ to denote an expectation/probability under the sampling mechanism (3.1).

The stochastic block model clearly imposes a parsimonious structure on the node probabilities 𝜃 = (𝜃_ij) when k ≪ n, reducing the effective number of parameters from O(n²) to O(k² + n). To describe the parameter space for 𝜃, we need to introduce some notations. For k ≤ n, let $\mathcal Z_{n, k} = \left \{z = (z_{1}, \ldots , z_{n}) : z_{i} \in \{1, \ldots , k\}, 1 \le i \le n \right \}$ denote all possible clusterings of n nodes into k clusters.

For any 1 ≤ r ≤ k, z^− 1(r) is used as a shorthand for {1 ≤ i ≤ n : z_i = r}; the nodes belonging to cluster r. When z is clear from the context, we shall use n_r = |z^− 1(r)| to denote the number of nodes in cluster r; clearly ${\sum }_{r=1}^{k} n_{r} = n$. For the theoretical development in this paper, it is assumed that n_r ≥ 1 for all $r=1,\dots ,k$, that is, each cluster is assumed to be non-empty containing at least one observation.

With these notations, the parameter space Θ_k for 𝜃 is given by

$$ \begin{array}{@{}rcl@{}} {\varTheta}_{k} = \{ \theta \in [0,1]^{n\times n} : \theta_{ij} = Q_{z_{i} z_{j}}, z \in \mathcal Z_{n, k}, Q \in [0, 1]^{k \times k} \}. \end{array} $$

(3.2)

For any $z \in \mathcal Z_{n, k}$ and Q ∈ [0,1]^k×k, we denote the corresponding 𝜃 ∈Θ_k by 𝜃^{z, Q}, so that $\theta ^{z, Q}_{ij} = Q_{z_{i} z_{j}}$. In fact, (z, Q)↦𝜃^{z, Q} is a surjective map from $\mathcal Z_{n, k} \times [0, 1]^{k \times k} \to {\varTheta }_{k}$, though it is clearly not injective.

Given $z \in \mathcal Z_{n, k}$, let A_[rs] denote the n_r × n_s sub matrix of A consisting of entries A_ij with z_i = r and z_j = s. The joint likelihood of A under model (3.1) can be expressed as

$$ \begin{array}{@{}rcl@{}} P(A \mid z, Q) = \prod\limits_{r = 1}^{k} \prod\limits_{s = 1}^{k} P(A_{[rs]} \mid z, Q), \quad P(A_{[rs]} \mid z, Q) = \prod\limits_{i: z_{i} = r} \prod\limits_{j: z_{j} = s} Q_{rs}^{A_{ij}} (1 - Q_{rs})^{1 - A_{ij}}. \end{array} $$

(3.3)

A Bayesian specification of the stochastic block model can be completed by assigning independent priors to z and Q, which in turn induces a prior on Θ_k via the mapping (z, Q)↦𝜃^{z, Q}. We generically use p(z, Q) = p(z)p(Q) to denote the joint prior on z and Q. The induced prior on Θ_k will be denoted by π(𝜃) and the corresponding posterior given data A = (A_ij) will be denoted by π_n(𝜃∣A). The following fact is useful and heavily used in the sequel: for any U ⊂Θ_k,

$$ \begin{array}{@{}rcl@{}} {\varPi}(U) = \sum\limits_{z \in \mathcal Z_{n,k}} {\varPi}(U \mid z) \ p(z) = \sum\limits_{z \in \mathcal Z_{n,k}} p(Q : \theta^{z,Q} \in U) \ p(z), \end{array} $$

(3.4)

where the second equality uses the independence of z and Q. Specific choices of p(z) and p(Q) are discussed below.

We assume independent U(0,1) prior on the Q_rs’s. We consider a hierarchical prior on z where each node has probability π_r of being allocated to the r th cluster independently of the other nodes, and the vector of probabilities π = (π₁,…, π_k) follows a Dirichlet(α₁,…, α_k) prior. Here α₁,…, α_k are fixed hyper-parameters that do not depend on k or n; a default choice is α_r = 1/2 for all r = 1,…, k. We further assume the number of clusters k to be known. Model (3.1) along with the prior specified above can be expressed hierarchically as follows:

$$ \begin{array}{@{}rcl@{}} && Q_{rs} \stackrel{\text{ind}} \sim U(0, 1), \quad r, s = 1, \ldots, k, \end{array} $$

(3.5)

$$ \begin{array}{@{}rcl@{}} && P(z_{i} = r \mid \pi) = \pi_{r}, \quad r=1,\ldots,k, i = 1, \ldots, n, \end{array} $$

(3.6)

$$ \begin{array}{@{}rcl@{}} && \pi \sim \text{Dirichlet}(\alpha_{1}, \ldots, \alpha_{k}), \end{array} $$

(3.7)

$$ \begin{array}{@{}rcl@{}} && A_{ij} \mid z, Q \stackrel{\text{ind}} \sim \text{{Bernoulli}}(\theta_{ij}), \quad \theta_{ij} = Q_{z_{i} z_{j}}. \end{array} $$

(3.8)

A hierarchical specification as in (or very similar to) (3.5)–(3.8) has been commonly used in the literature; see for example, Snijders and Nowicki (1997), Nowicki and Snijders (2001), Golightly and Wilkinson (2005), and McDaid et al. (2013). Analytic marginalizations can be carried out due to the conjugate nature of the prior, facilitating posterior sampling (McDaid et al. 2013). In particular, using standard multinomial-Dirichlet conjugacy, the marginal prior of z can be written as

$$ \begin{array}{@{}rcl@{}} p(z) = \frac{{\varGamma}({\sum}_{r=1}^{k} \alpha_{r})}{{\varGamma}(n + {\sum}_{r=1}^{k} \alpha_{r})} \prod\limits_{r=1}^{k} \frac{{\varGamma}(n_{r} + \alpha_{r})}{{\varGamma}(\alpha_{r})}, \quad z \in \mathcal Z_{n, k}, \end{array} $$

(3.9)

where we recall that . The following lemma provides an upper bound to the prior ratio {p(z)/p(z₀)} which is used subsequently in the proof of our main theorem.

Lemma 3.1.

Assume $z_{0} \in \mathcal Z_{n,k}$with for all r = 1,…k. Then, $\max \limits _{z \in \mathcal Z_{n, k}} p(z)/p(z_{0}) \leq e^{C n \log k}$, where C is a positive constant.

Proof

Fix $z \in \mathcal Z_{n,k}$. From Eq. 3.9, $p(z)/p(z_{0}) = {\prod }_{r=1}^{k} {\varGamma }(n_{r} + \alpha _{r})/{\varGamma }\\(n_{0r} + \alpha _{r})$. Then

$$ \begin{array}{@{}rcl@{}} \log \{ p(z)/p(z_{0}) \} = \sum\limits_{r=1}^{k} \log {\varGamma}(n_{r}+\alpha_{r}) - \sum\limits_{r=1}^{k} \log {\varGamma}(n_{0r}+\alpha_{r}) \end{array} $$

The first term is maximized over $z \in \mathcal Z_{n,k}$ when n_r = n for some r and n_s = 0 for all s≠r. Further, replacing n_0r by n/k for all r = 1,…, k only decreases the second term in the above display. Hence, letting $\alpha _{(k)} = \max \limits \{\alpha _{1}, \ldots , \alpha _{k}\}$ and $\alpha _{(1)} = \min \limits \{\alpha _{1}, \ldots , \alpha _{k}\}$,

$$ \begin{array}{@{}rcl@{}} \log \{ p(z)/p(z_{0}) \} \le \log {\varGamma}(n+\alpha_{(k)}) - k \log {\varGamma}(n/k+\alpha_{(1)}). \end{array} $$

Using the standard two sided bound (Abramowitz and Stegun, 1964), we obtain $\log {\varGamma }(z) = \log (2 \pi )/2 + (z - 1/2) \log (z) - z + R(z)$ with 0 < R(z) < (12z)^− 1 for z > 0, the dominating term in the right hand side of the above display being $n \log \{ (n + \alpha _{(k)}) / (n/k + \alpha _{(1)}) \} \lesssim C n\log k$, concluding the proof. □

4 Posterior Contraction Rates in Stochastic Block Models

We are interested in contraction properties of the posterior π_n(⋅∣A) assuming the true data-generating parameter $\theta ^{0} \in {\varTheta }_{k}$. To measure the discrepancy in the estimation of $\theta ^{0} \in {\varTheta }_{k}$, the mean squared error has been used in the frequentist literature,

$$ \begin{array}{@{}rcl@{}} \frac{1}{n^{2}} \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} (\hat{\theta}_{ij} - \theta_{ij}^{0})^{2} = \frac{1}{n^{2}} \left\Vert\hat{\theta} - \theta^{0}\right\Vert^{2}, \end{array} $$

(4.1)

where $\hat {\theta }$ is an estimator of 𝜃⁰. Chatterjee (2014) proposed estimating 𝜃⁰ using a low rank decomposition of the adjacency matrix A followed by a singular value decomposition to obtain a convergence rate of $\sqrt {k/n}$. More recently, Gao et al. (2015) considered a least squares type approach which can be related to maximum likelihood estimation where the Bernoulli likelihood is replaced by a Gaussian likelihood. They obtained a rate of $k^{2}/n^{2} + \log k/n$, which they additionally showed to be the minimax rate over Θ_k, i.e.,

$$ \begin{array}{@{}rcl@{}} \inf_{\hat{\theta}} \sup_{\theta^{0} \in {\varTheta}_{k}} \mathbb{E}_{\theta_{0}} \frac{1}{n^{2}} \left\Vert\hat{\theta} - \theta^{0}\right\Vert^{2} \asymp \frac{k^{2}}{n^{2}} + \frac{\log k}{n}. \end{array} $$

(4.2)

Interestingly, the minimax rate has two components, k²/n² and $\log k/n$. Gao et al. (2015) refer to the k²/n² term in the minimax rate as the nonparametric rate, since it arises from the need to estimate k² unknown elements in Q from n² observations. The second part, $\log k/n$, is termed as the clustering rate, which appears since the clustering configuration z is unknown and needs to be estimated from the data. Observe that the clustering rate grows logarithmically in k. Parameterizing k = n^ζ with ζ ∈ [0,1], the interplay between the two components becomes clearer (refer to equation 2.6 of Gao et al. 2015); in particular, the clustering rate dominates when k is small and the nonparametric rate dominates when k is large.

To evaluate Bayesian procedures from a frequentist standpoint, one seeks for the minimum possible sequence 𝜖_n → 0 such that the posterior probability assigned to the complement of an 𝜖_n-neighborhood (blown up by a constant factor) of 𝜃⁰ receives vanishingly small probabilities. The smallest such 𝜖_n is called the posterior contraction rate (Ghosal et al. 2000). There is now a growing body of literature showing that Bayesian procedures achieve the frequentist minimax rate of posterior contraction (up to a logarithmic term) in models where the parameter dimension grows with the sample size; see Bontemps (2011), Castillo and van der Vaart (2012), Pati et al. (2014), Banerjee and Ghosal (2014), van der Pas et al. (2014), and Castillo et al. (2015) for some flavor of the recent literature.

We now state the main result of this article where we derive the contraction rate of the posterior arising from the hierarchical formulation (3.5)–(3.8).

Theorem 4.1.

Assume A = (A_ij) is generated from a k-component stochastic block model (3.1) with the true data-generating parameter $\theta ^{0} = (\theta ^{0}_{ij}) \in {\varTheta }_{k}$, where Θ_k is as in (3.2). Further assume that there exists a small constant δ ∈ (0,1/2) such that $ \theta ^{0}_{ij} \in (\delta , 1 - \delta )$for all i, j = 1,…, n. Suppose the hierarchical Bayesian model (3.5)–(3.8) is fitted. Then, with ${\epsilon _{n}^{2}} = k^{2}\{ \log n + \log (\delta ^{-1})\}/ n^{2} + \log k/n$, and a sufficiently large constant M > 0,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\theta^{0}} {\varPi}_{n} \left\{ \frac{1}{n^{2}} \left\Vert\theta - \theta^{0}\right\Vert^{2} > M^{2} {\epsilon_{n}^{2}} \mid A \right\} \leq \exp \{-M^{2} n^{2} {\epsilon_{n}^{2}}\} + \frac{1}{C n^{2}{\epsilon_{n}^{2}}}, \end{array} $$

(4.3)

for some C > 0 and for all n ≥ 1.

Remark 4.2.

Since $\theta ^{0} \in {\varTheta }_{k}$, following the discussion after (3.2), there exists $z^{0} \in \mathcal Z_{n, k}$and Q⁰ ∈ [0,1]^k×k such that $\theta ^{0} = \theta ^{z^{0}, Q^{0}}$. The condition of the theorem posits that all entries of Q⁰ lie in (δ, 1 − δ). As long as δ ≥ n^−a for any a > 0, the posterior contraction rate is the same (up to a constant) as in the case of fixed δ. The assumption $\theta ^{0} \in {\varTheta }_{k}$also implicitly implies that all the clusters have at least one observation, i.e., for all r = 1,…, k; otherwise there exists l < k such that $\theta ^{0} \in {\varTheta }_{l}$.

A proof of Theorem 4.1 can be found towards the end of Section 4.2 after some important auxiliary results which are instrumental in deriving the main theoretical results of this paper. Theorem 4.1 shows that as long as δ ≥ n^−a for any a > 0, the posterior contracts at a (near) minimax rate of $k^{2}\log n /n^{2} + \log k/n$. The nonparametric component of the rate is slightly hurt by a logarithmic term; appearance of such an additional logarithmic term is common in Bayesian nonparametrics.

It would be noteworthy that in Theorem 4.1 a uniform U(0,1) prior is assigned to the edge probabilities Q_rs’s, while in a similar independent work, van der Pas and van der Vaart (2018) considered a more general Beta(β₁, β₂) distribution for Q’s that includes the uniform prior as a special case. While our main goal of inference is the recovery of the edge probabilities, van der Pas and van der Vaart (2018) focused on detection of the community memberships. A pertinent question that would be natural to ask in this context is whether our posterior contraction results can be extended further for a more general Beta prior. The following result, namely, Corollary 4.3 provides an affirmative answer to the aforesaid question. In particular, it says that, for recovery of the edge probabilities Q_rs’, the posterior obtained from a more general Beta(β₁, β₂) prior contracts at the same rate as obtained for the uniform prior as in Theorem 4.1. As a matter of fact, it turns out that our general scheme of arguments for deriving the contraction rates works equally well even for this Beta prior.

Corollary 4.3.

Consider the set up of Theorem 4.1, where $Q_{rs}\stackrel {ind}{\sim }\text {Beta}\\(\beta _{1},\beta _{2})$, for $r,\ s=1,\dots ,k$, in Eq. 3.5instead of an U(0,1) prior. Then, with ${\epsilon _{n}^{2}} = k^{2}\{ \log n + \log (\delta ^{-1})\}/ n^{2} + \log k/n$, and a sufficiently large constant M > 0 (depending on (β₁, β₂))

$$ \mathbb{E}_{\theta^{0}} {\varPi}_{n} \left\{ \frac{1}{n^{2}} \left\Vert\theta - \theta^{0}\right\Vert^{2} > M^{2} {\epsilon_{n}^{2}} \mid A \right\} \leq \exp \{-M^{2} n^{2} {\epsilon_{n}^{2}}\} + \frac{1}{C n^{2}{\epsilon_{n}^{2}}}, $$

for some C > 0 and for all n ≥ 1.

Proof of Corollary 4.3 above follows along exactly the same line of arguments as that of Theorem 4.1 and is given in the Appendix. An inspection of the proof of Theorem 4.1 reveals that the only technical difference between the proofs of the aforesaid results lies in a careful exploitation of a volume argument used in the proof of Theorem 4.1 under the more general Beta(β₁, β₂) prior for every possible choice of (β₁, β₂), while rest of the arguments remain unaltered.

4.1 Undirected Networks

Theorem 4.1 can be extended to the case of undirected networks with or without self-loops. For technical simplification, we consider the case when there are no self-loops. We highlight the key differences in the data generation and the prior specification below. Let z_i ∈{1,…, k} denote the cluster membership of the i th node and Q = (Q_rs) ∈ [0,1]^k×k be a symmetric matrix of probabilities, with Q_rs = Q_sr indicating the probability of an edge between node i in cluster r and any node j in cluster s. Then an undirected version of Eq. 3.1 can be obtained by letting

$$ \begin{array}{@{}rcl@{}} A_{ij} \sim \text{{Bernoulli}}(\theta_{ij}), \quad \theta_{ij} = Q_{z_{i} z_{j}}, \quad 1\leq i < j \leq n. \end{array} $$

(4.4)

and A_ii = 𝜃_ii = 0 for i = 1,…, n. The prior distributions are appropriately modified as:

$$ \begin{array}{@{}rcl@{}} && Q_{rs} \stackrel{\text{ind}} \sim U(0, 1), \quad 1\leq r \leq s \leq k \end{array} $$

(4.5)

$$ \begin{array}{@{}rcl@{}} && P(z_{i} = k \mid \pi) = \pi_{k}, \quad i = 1, \ldots, n, \end{array} $$

(4.6)

$$ \begin{array}{@{}rcl@{}} && \pi \sim \text{Dirichlet}(\alpha_{1}, \ldots, \alpha_{k}). \end{array} $$

(4.7)

We modify the discrepancy measure as

$$ \begin{array}{@{}rcl@{}} \frac{1}{n^{2}} \underset{1\leq i < j \leq n}{\sum\sum} (\hat{\theta}_{ij} - \theta_{ij}^{0})^{2} \end{array} $$

(4.8)

where 𝜃, 𝜃⁰ are in the parameter space

$$ \begin{array}{@{}rcl@{}} {{\varTheta}_{k}^{u}} = \{ \theta \in [0,1]^{n\times n} : \theta_{ij} = Q_{z_{i} z_{j}}, 1\leq i \neq j \leq n; \theta_{ii} = 0, 1\leq i \leq n, z \in \mathcal Z_{n, k}, \\ Q \in [0, 1]^{k \times k}, Q_{rs} = Q_{sr}, 1\leq r\leq s \leq k \}.\\ \end{array} $$

(4.9)

Then the following version of Theorem 4.1 is true for undirected networks:

Theorem 4.4.

Assume A = (A_ij) is generated as in Eq. 4.4with $\theta ^{0} = (\theta ^{0}_{ij}) \in {{\varTheta }_{k}^{u}}$, where ${{\varTheta }_{k}^{u}}$is as in Eq. 4.9. Further assume that there exists a small constant δ ∈ (0,1/2) such that $ \theta ^{0}_{ij} \in (\delta , 1 - \delta )$for all 1 ≤ i ≤ j ≤ n. Suppose the hierarchical Bayesian model (4.5)–(4.7) is fitted. Then, with ${\epsilon _{n}^{2}} = k^{2}\{ \log n + \log (\delta ^{-1})\}/ n^{2} + \log k/n$, and a sufficiently large constant M > 0, the conclusion (4.3) is true.

A sketch of the proof of Theorem 4.4 is given in the Appendix.

4.2 Geometry of Θ_k

In this section, we derive a number of auxiliary results aimed at understanding the geometry of the parameter space Θ_k. These results are useful in proving our main concentration result presented in Theorem 4.1.

We first state a testing lemma which harnesses the ability of the likelihood to separate points in the parameter space.

Lemma 4.5.

Assume $\theta ^{0} \ne \theta ^{1} \in {\varTheta }_{k}$and let $E = \{\theta \in [0, 1]^{n \times n} : \left \Vert \theta - \theta ^{1}\right \Vert \le \left \Vert \theta ^{1} - \theta ^{0}\right \Vert /2\}$be an Euclidean ball of radius ∥𝜃¹ − 𝜃⁰∥/2 around 𝜃¹ inside [0,1]^n×n. Based on $A_{ij} \stackrel {\text {ind}}\sim \text {{Bernoulli}}(\theta _{ij})$for i, j = 1,…, n, consider testing

H₀ : 𝜃 = 𝜃⁰ versus H₁ : 𝜃 ∈ E. There exists a test function Φ such that

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\theta^{0}} ({\varPhi}) \le \exp\{- C_{1} \left\Vert\theta^{1} - \theta^{0}\right\Vert^{2} \}, \quad \sup_{\theta \in E } \mathbb{E}_{\theta} (1- {\varPhi}) \le \exp\{- C_{2} \left\Vert\theta^{1} - \theta^{0}\right\Vert^{2} \}, \end{array} $$

(4.10)

for constants C₁, C₂ > 0 independent of n, 𝜃¹ and x 𝜃⁰.

Proof

Define the test function Φ as

where denotes the indicator of a set. We show below that this test has the desired error rates (4.10).

We first bound the type-I error $\mathbb {E}_{\theta ^{0}} ({\varPhi })$. Noting that under $\mathbb {P}_{\theta ^{0}}$, $(A_{ij} - \theta ^{0}_{ij})$ are independent zero mean random variables with $|A_{ij} - \theta ^{0}_{ij}| < 1$, we use a version of Hoeffding’s inequality (refer to Proposition 5.10 of Vershynin 2012) to conclude that,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\theta^{0}} ({\varPhi}) &=& \mathbb{P}_{\theta^{0}} \left\{ \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} (\theta^{1}_{ij} - \theta^{0}_{ij} ) (A_{ij} - \theta^{0}_{ij}) > \left\Vert\theta^{1} - \theta^{0}\right\Vert^{2}/4 \right\} \\ & \le& \exp \left\{-C_{1} \frac{\left\Vert\theta^{1} - \theta^{0}\right\Vert^{4}}{\left\Vert\theta^{1} - \theta^{0}\right\Vert^{2}}\right\} = \exp\left\{-C_{1} \left\Vert\theta^{1} - \theta^{0}\right\Vert^{2} \right\} \end{array} $$

for a constant C₁ > 0 independent of n, 𝜃¹ and 𝜃⁰.

We next bound the type-II error $\sup _{\theta \in E} \mathbb {E}_{\theta }(1 - {\varPhi })$. Fix 𝜃 ∈ E. We have,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\theta}(1 - {\varPhi}) & = \mathbb{P}_{\theta} \left\{{\sum}_{i=1}^{n} {\sum}_{j=1}^{n} (\theta^{1}_{ij} - \theta^{0}_{ij} ) (A_{ij} - \theta^{0}_{ij}) < \left\Vert\theta^{1} - \theta^{0}\right\Vert^{2}/4 \right\} \\ & = \mathbb{P}_{\theta} \left\{{\sum}_{i=1}^{n} {\sum}_{j=1}^{n} (\theta^{1}_{ij} - \theta^{0}_{ij} ) (A_{ij} \!- \theta_{ij}) < \left\Vert\theta^{1} - \theta^{0}\right\Vert^{2}/4 - \left\langle \theta^{1} - \theta^{0}, \theta - \theta^{0}\right\rangle \right\}, \end{array} $$

(4.11)

where we abbreviate $\left \langle \theta ^{\prime }, \theta ^{\prime \prime }\right \rangle = {\sum }_{i=1}^{n} {\sum }_{j=1}^{n} \theta ^{\prime }_{ij} \theta ^{\prime \prime }_{ij}$. Bound

$$ \begin{array}{@{}rcl@{}} &&\left\langle \theta^{1} - \theta^{0}, \theta - \theta^{0}\right\rangle \\ && = \left\langle \theta^{1} - \theta^{0}, \theta^{1} - \theta^{0}\right\rangle - \left\langle \theta^{1} - \theta^{0}, \theta^{1} - \theta\right\rangle \\ && \ge \left\Vert\theta^{1} - \theta^{0}\right\Vert^{2} - \left\Vert\theta^{1} - \theta^{0}\right\Vert^{2}/2 = \left\Vert\theta^{1} - \theta^{0}\right\Vert^{2}/2, \end{array} $$

where the penultimate step used the Cauchy–Schwarz inequality along with the fact that $\left \Vert \theta - \theta ^{1}\right \Vert \le \left \Vert \theta ^{1} - \theta ^{0}\right \Vert /2$. Substituting in Eq. 4.11 and noting that under $\mathbb {P}_{\theta }$, (A_ij − 𝜃_ij) are independent zero mean bounded random variables, another application of Hoeffding’s inequality yields

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\theta} (1 - {\varPhi}) & \le& \mathbb{P}_{\theta} \left\{\sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} (\theta^{1}_{ij} - \theta^{0}_{ij} ) (A_{ij} - \theta_{ij}) < - \left\Vert\theta^{1} - \theta^{0}\right\Vert^{2}/4 \right\} \\ & \le& \exp \left\{-C_{2} \frac{\left\Vert\theta^{1} - \theta^{0}\right\Vert^{4}}{\left\Vert\theta^{1} - \theta^{0}\right\Vert^{2}}\right\} = \exp\left\{-C_{2} \left\Vert\theta^{1} - \theta^{0}\right\Vert^{2} \right\} \end{array} $$

for some constant C₂ > 0 independent of n and 𝜃. Taking a supremum over 𝜃 ∈ E yields the desired result. □

Our next result is concerned with the structure of a specific type of Euclidean balls inside Θ_k. Recall that 𝜃^{z, Q} denotes the element of Θ_k with $\theta ^{z, Q}_{ij} = Q_{z_{i} z_{j}}$. For $z \in \mathcal Z_{n, k}$, let

$$ \begin{array}{@{}rcl@{}} {\varTheta}_{k}(z) = \left\{ \theta^{z, Q} : Q \in [0, 1]^{k \times k} \right\} \end{array} $$

(4.12)

denote a slice of Θ_k along z. In other words, given z, Θ_k(z) is the image of the map Q↦𝜃^{z, Q} in Θ_k. Suppose $\theta ^{*} = \theta ^{z^{*}, Q^{*}} \in {\varTheta }_{k}$, and consider a ball B(z) in Θ_k(z) centered at 𝜃^∗ of the form $B(z) = \left \{\theta \in {\varTheta }_{k}(z) : \|\theta - \theta ^{*}\| < t \right \}$ for some t > 0. If z^∗ = z, then it is straightforward to observe that

$$ \begin{array}{@{}rcl@{}} \left\Vert\theta^{z,Q} - \theta^{z^{*}, Q^{*}}\right\Vert^{2} = \sum\limits_{r=1}^{k} \sum\limits_{s=1}^{k} n_{r} n_{s} (Q_{rs} - Q_{rs}^{*})^{2}, \end{array} $$

(4.13)

wherewe recall that for r = 1,…, k. Therefore, although a subset of [0,1]^n×n, B(z) can be identified with a k²-dimensional ellipsoid in [0,1]^k×k. When z^∗≠z, one no longer has a nice identity as above and the geometry of B(z) is more difficult to describe. However, we show below in Lemma 4.6 that B(z) is always contained inside a set $\widetilde {B}(z)$ in Θ_k(z) which can be identified with a k²-dimensional ellipsoid in [0,1]^k×k. Recall our convention for describing ellipsoids from Eq. 2.1.

Lemma 4.6.

Fix $z^{*} \in \mathcal Z_{n, k}, Q^{*} \in [0, 1]^{k \times k}$, and let $\theta ^{*} = \theta ^{z^{*}, Q^{*}}$. For $z \in \mathcal Z_{n, k}$and t > 0, let $B(z) = \left \{\theta \in {\varTheta }_{k}(z) : \|\theta - \theta ^{*}\| < t \right \}$. Set W_rs = n_rn_s/t² and W = (W_rs), where for r = 1,…, k. Then, $B(z) \subseteq \widetilde {B}(z)$, where

$$ \begin{array}{@{}rcl@{}} { \widetilde{B} }(z) = \left\{ \theta^{z, Q}: Q \in \xi_{k^{2}}(\bar{Q}^{*}, W) \cap [0, 1]^{k \times k} \right\} \end{array} $$

(4.14)

for some $\bar {Q}^{*} \in [0, 1]^{k \times k}$ depending on Q^∗, z^∗ and z. In particular, if z^∗ = z, then $\bar {Q}^{*} = Q^{*}$and the containment becomes equality, i.e., $B(z) = \widetilde {B}(z)$.

Proof

Define $\bar {\theta } = \text {arg min}_{\theta \in {\varTheta }_{k}(z)} \left \Vert \theta - \theta ^{*}\right \Vert ^{2}$. According to the definition of $\bar {\theta }$, we have from the Pythagorean identity

$$ \begin{array}{@{}rcl@{}} \left\Vert\theta - \theta^{*}\right\Vert^{2} = \left\Vert\theta - \bar{\theta}\right\Vert^{2} + \left\Vert\bar{\theta} - \theta^{*}\right\Vert^{2}. \end{array} $$

for 𝜃 ∈Θ_k(z). Therefore, $\left \Vert \theta - \bar {\theta }\right \Vert \leq \left \Vert \theta - \theta ^{*}\right \Vert $, which implies $\{\theta \in {\varTheta }_{k}(z): \left \Vert \theta - \theta ^{*}\right \Vert \leq t \} \subset \{\theta \in {\varTheta }_{k}(z): \left \Vert \theta - \bar {\theta }\right \Vert \leq t\}$. Since $\bar {\theta } \in {\varTheta }_{k}(z)$, there exists $\bar {Q}^{*} \in [0, 1]^{k \times k}$ such that $\bar {\theta }_{ij} =\bar {Q}^{*}_{z_{i} z_{j}}$. This completes the proof of the first part. When z = z^∗, the proof of the second part is completed by noting that

$$ \begin{array}{@{}rcl@{}} \left\Vert\theta - \theta^{*}\right\Vert^{2} = {\sum}_{r=1}^{k} {\sum}_{s=1}^{k} n_{r} n_{s} (Q_{rs} - Q_{rs}^{*})^{2}. \end{array} $$

□

Remark 4.7.

From Eq. 2.1, $\xi _{k^{2}}(\bar {Q}^{*}, W)$in Lemma 4.6 is the collection of all Q satisfying ${\sum }_{r=1}^{k} {\sum }_{s=1}^{k} n_{r} n_{s}(Q_{rs} - \bar {Q}_{rs}^{*})^{2} < t^{2}$. The last part of Lemma 4.6 is consistent with the discussion preceding (4.13). When z^∗ = z, Eq. 4.13implies that B(z) consists of all 𝜃^{z, Q} with Q ∈ [0,1]^k×k satisfying ${\sum }_{r=1}^{k} {\sum }_{s=1}^{k} n_{r} n_{s} (Q_{rs} - Q_{rs}^{*})^{2} < t^{2}$.

Corollary 4.8.

Inspecting the proof of Lemma 4.6, the condition Q ∈ [0,1]^k×k is only used to show that $\bar {Q}^{*} \in [0, 1]^{k \times k}$. If we let Q to be unrestricted, then the containment relation continues to hold as subsets of $\mathbb {R}^{k \times k}$, i.e.,

$$ \begin{array}{@{}rcl@{}} \left\{\theta^{z, Q} : Q \in \mathbb{R}^{k \times k}, \left\Vert\theta^{z, Q} - \theta^{z^{*}, Q^{*}}\right\Vert < t \right\} \subseteq \left\{ \theta^{z, Q} : Q \in \xi_{k^{2}}(\bar{Q}^{*}, W) \right\}, \end{array} $$

(4.15)

with equality when z^∗ = z.

Lemma 4.6 crucially exploits the lower dimensional structure underlying the parameter space Θ_k and is used subsequently multiple times. First, recall from Eq. 3.4 that one needs a handle on p(Q : 𝜃^{z, Q} ∈ U) to bound the prior probability of U ⊂Θ_k. In particular, if U = {∥𝜃 − 𝜃⁰∥ < t}, then p(Q : 𝜃^{z, Q} ∈ U) equals the volume of U ∩Θ_k(z), which can be suitably bounded by the volume of the bounding k² dimensional ellipsoid. Second, a handle on the size of balls in Θ_k facilitates calculating the complexity of the model space (in terms of metric entropy) which is pivotal in proving the posterior contraction; in particular, to extend the test function in Lemma 4.5 to construct test functions against more complex alternatives in Lemma 4.9 below. Once again, the dimensionality reduction is key to preventing the metric entropy from growing too fast.

Lemma 4.9.

Recall 𝜖_n from Theorem 4.1. Assume $\theta ^{0} \in {\varTheta }_{k}$and for l ≥ 1, let $U_{l, n} = \left \{ \theta \in {\varTheta }_{k} : l n \epsilon _{n} \le \|\theta - \theta ^{0}\| < (l+1) n \epsilon _{n} \right \}$. Based on $A_{ij} \stackrel {\text {ind}}\sim \text {{Bernoulli}}(\theta _{ij})$for i, j = 1,…, n, consider testing H₀ : 𝜃 = 𝜃⁰ versus H₁ : 𝜃 ∈ U_{l, n}. There exists a test function Φ_{l, n} such that

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\theta^{0}} ({\varPhi}_{l, n}) \!\le \exp(- C_{1} l^{2} n^{2} {\epsilon_{n}^{2}}), \quad \sup_{\theta \in U_{l, n} } \mathbb{E}_{\theta} (1- {\varPhi}_{l, n}) \!\le\! \exp(- C_{2} l^{2} n^{2} {\epsilon_{n}^{2}} ), \end{array} $$

(4.16)

for constants C₁, C₂ > 0 independent of n.

Proof

Since $\theta ^{0} \in {\varTheta }_{k}$, there exists $z^{0} \in \mathcal Z_{n,k}$ and Q⁰ ∈ [0,1]^k×k with $\theta ^{0} = \theta ^{z^{0}, Q^{0}}$. For $z \in \mathcal Z_{n, k}$, define U_{l, n}(z) = U_{l, n} ∩Θ_k(z), where Θ_k(z) is as in Eq. 4.12. Clearly,

$$ \begin{array}{@{}rcl@{}} U_{l,n}(z) = \left\{\theta^{z,Q} : Q \in [0, 1]^{k \times k}, l n \epsilon_{n} \!\le\! \left\Vert\theta^{z, Q} - \theta^{z^{0}, Q^{0}}\right\Vert \!<\! (l+1) n \epsilon_{n} \right\}, \end{array} $$

(4.17)

and $U_{l, n} \subset \cup _{z \in \mathcal Z_{n, k}} U_{l,n}(z)$. We first use Lemma 4.5 to construct tests against U_{l, n}(z) for fixed z. Our desired test is obtained by taking the maximum of all such test functions.

Fix $z \in \mathcal {Z}_{n, k}$. Let $\mathcal {N}_{l,n}(z) = \{\theta _{l, n, h} \in U_{l, n}(z): h \in I_{l,n}(z) \}$ be a maximalln𝜖_n/2-separated set inside U_{l, n}(z) for some index set I_{l, n}(z); i.e., $\mathcal N_{l,n}(z)$ is such that ∥𝜃¹ − 𝜃²∥≥ ln𝜖_n/2 for all $\theta ^{1} \ne \theta ^{2} \in \mathcal {N}_{l,n}(z)$, and no subset of U_{l, n}(z) containing $\mathcal N_{l,n}(z)$ has this property. We provide a volume argument to determine an upper bound for |I_{l, n}(z)|, the cardinality of $\mathcal N_{l, n}(z)$. The separation property implies that Euclidean balls of radius ln𝜖_n/4 centered at the points in $\mathcal N_{l, n}(z)$ are disjoint. Since $B_{h}^{+} := \left \{\theta ^{z, Q} : Q \in \mathbb {R}^{k \times k}, |\theta ^{z, Q} - \theta _{l, n, h}| < l n \epsilon _{n}/4 \right \}$ is contained inside an Euclidean ball of radius ln𝜖_n/4 centered at 𝜃_{l, n, h}, the sets $B_{h}^{+}$ are disjoint as h varies over I_{l, n}(z). By the triangle inequality, all $B_{h}^{+}$s lie inside $B^{+} = \left \{\theta ^{z, Q} : Q \in \mathbb {R}^{k \times k}, \|\right .$$\left .\theta ^{z,Q} - \theta ^{0}\| \le (5l/4+1) n \epsilon _{n} \right \}$, since ∥𝜃^{z, Q} − 𝜃⁰∥≤∥𝜃^{z, Q} − 𝜃_{l, n, h}∥ + ∥𝜃_{l, n, h} − 𝜃⁰∥≤ (l + 1)n𝜖_n + ln𝜖_n/4.

It should be noted that the sets $B_{h}^{+}$s and B⁺ are constructed in a way that Q is not restricted to be inside [0,1]^k×k. This allows us to invoke Corollary 4.8 to identify $B_{h}^{+}$ and B⁺ with appropriate ellipsoids in $\mathbb {R}^{k^{2}}$ and simplify volume calculations. First, since 𝜃_{l, n, h} ∈Θ_k(z) for each h, it follows from (the equality part of) Corollary 4.8 that $B_{h}^{+} = \{\theta ^{z,Q} : Q \in \xi _{k^{2}}(\bar {Q}_{h}, \widetilde {W})\}$ with $\bar {Q}_{h}$ constructed as in the proof of Lemma 4.6 and $\widetilde {W}_{rs} = n_{r} n_{s}/\{(l n \epsilon _{n})^{2}\}$. The equality is crucially used below; also note that $\widetilde {W}$ does not depend on h. Invoking Corollary 4.8 one more time, we obtain $B^{+} \subset \{\theta ^{z,Q} : Q \in \xi _{k^{2}}(\bar {Q}^{0}, W)\}$, with $W_{rs} = n_{r} n_{s}/[\{ (5l/4+1) n \epsilon _{n} \}\text {} ^{2}]$. We conclude that the Euclidean ellipsoids $\xi _{k^{2}}(\bar {Q}_{h}, \widetilde {W})$ are disjoint as h varies over I_{l, n}(z) and all of them are contained in $\xi _{k^{2}}(\bar {Q}^{0}, W)$. Comparing volumes,

$$ |\xi_{k^{2}}(\bar{Q}_{h}, { \widetilde{W} })\|I_{l,n}(z)|\le |\xi_{k^{2}}(\bar{Q}^{0},W)|.$$

Using the volume formula in Eq. 2.2 and canceling out common terms, we finally have

$$ \begin{array}{@{}rcl@{}} | I_{l,n}(z) | \le \left\{ \frac {(5l/4+1)}{l/2} \right\}^{k^{2}} \le 9^{k^{2}}. \end{array} $$

(4.18)

We are now in a position to construct the test. The maximality of $\mathcal N_{l,n}(z)$ implies that $\mathcal N_{l,n}(z)$ is an ln𝜖_n/2-net of U_{l, n}(z), i.e., the sets E_{l, n, z, h} = {𝜃 ∈ [0,1]^n×n : ∥𝜃 − 𝜃_{l, n, h}∥ < ln𝜖_n/2} cover U_{l, n}(z) as h varies. For each $\theta _{l,n,h} \in \mathcal N_{l,n}(z)$, consider testing H₀ : 𝜃 = 𝜃⁰ versus H₁ : 𝜃 ∈ E_{l, n, z, h} using the test function from Lemma 4.5. Lemma 4.5 is applicable since ∥𝜃⁰ − 𝜃_{l, n, h}∥≥ ln𝜖_n; let Φ_{l, n, z, h} denote the corresponding test with type-I and II errors bounded above by $e^{-C l^{2} n^{2} {\epsilon _{n}^{2}}}$. Define ${\varPhi }_{l,n} = \max \limits _{z \in \mathcal Z_{n,k}} \max \limits _{h \in I_{l,n}(z)} {\varPhi }_{l,n,z,h}$. For any 𝜃 ∈ U_{l, n}, there exists $z \in \mathcal Z_{n,k}$ and h ∈ I_{l, n}(z) such that 𝜃 ∈ E_{l, n, z, h}, so that $\mathbb {E}_{\theta } (1 - {\varPhi }_{l,n}) \le \mathbb {E}_{\theta }(1 - {\varPhi }_{l,n,z,h}) \le e^{-C l^{2} n^{2} {\epsilon _{n}^{2}}}$. Taking supremum over 𝜃 ∈ U_{l, n} delivers the desired type-II error. Further, the type-I error of Φ_{l, n} can be bounded as

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\theta^{0}} ({\varPhi}_{l,n}) \le \sum\limits_{z \in \mathcal Z_{n,k}} \sum\limits_{h \in I_{l,n}(z)} \mathbb{E}_{\theta^{0}} ({\varPhi}_{l,n,z,h}) \le k^{n} 9^{k^{2}} e^{- C l^{2} n^{2} {\epsilon_{n}^{2}}}, \end{array} $$

(4.19)

since $|\mathcal Z_{n,k}| = k^{n}$ and by Eq. 4.18, $|I_{l,n}(z)| \le 9^{k^{2}}$ for all z. The conclusion then follows since $n^{2} {\epsilon _{n}^{2}} = k^{2}\{\log n + \log (\delta ^{-1})\} + n \log k \gtrsim k^{2} + n \log k$. □

As commented earlier, below we present the proof of Theorem 4.1 already stated in Section 4 of this paper.

Proof 5 of Theorem 4.1.

Let $\mathbb {E}_{0} \slash \mathbb {P}_{0}$ denote an abbreviation of $\mathbb {E}_{\theta ^{0}} \slash \mathbb {P}_{\theta ^{0}}$. Since $\theta ^{0} \in {\varTheta }_{k}$, there exists some $z^{0} \in \mathcal Z_{n,k}$ and Q⁰ ∈ [0,1]^k×k with $\theta ^{0} = \theta ^{z^{0}, Q^{0}}$. Recall ${\epsilon _{n}^{2}} = k^{2} \{\log n + \log (\delta ^{-1}) \}/n^{2} +$$ \log k/n$ and define $U_{n} = \left \{\theta \in {\varTheta }_{k}: \|\theta - \theta ^{0}\|^{2} > M^{2} n^{2} {\epsilon _{n}^{2}}\right \}$ for some large constant M > 0 to be chosen later. Letting $f_{\theta _{ij}}(A_{ij}) = \theta _{ij}^{A_{ij}} (1 - \theta _{ij})^{1-A_{ij}}$ denote the Bernoulli(𝜃_ij) likelihood evaluated at A_ij, the posterior probability assigned to U_n can be written as

$$ \begin{array}{@{}rcl@{}} {\varPi}_{n}(U_{n} \mid A) = \frac{ {\int}_{U_{n}} {\prod}_{i=1}^{n} {\prod}_{j=1}^{n} \frac{ f_{\theta_{ij}}(A_{ij}) }{ f_{\theta_{ij}^{0}}(A_{ij}) } p(dz, dQ) }{ {\int}_{{\varTheta}_{k}} {\prod}_{i=1}^{n} {\prod}_{j=1}^{n} \frac{ f_{\theta_{ij}}(A_{ij}) }{ f_{\theta_{ij}^{0}}(A_{ij}) } p(dz, dQ) } = \frac{\mathcal N_{n}}{\mathcal D_{n}}, \end{array} $$

(4.20)

where $\mathcal N_{n}$ and $\mathcal D_{n}$ respectively denote the numerator and denominator of the fraction in Eq. 4.20. Let $\mathcal F_{n}$ denote the σ-field generated by $\tilde {A} = (\tilde {A}_{ij})$, with $\tilde {A}_{ij}$ independently distributed as $\text {{Bernoulli}}(\theta _{ij}^{0})$; the true data generating distribution. We first claim that there exists a set $\mathcal A_{n} \in \mathcal F_{n}$ where we can bound $\mathcal D_{n}$ from below with large probability under $\mathbb {P}_{0}$ in Lemma 4.10. The proof is adapted from Lemma 10 of Ghosal and van der Vaart (2007).

Lemma 4.10.

Assume 𝜃⁰ satisfies the conditions of Theorem 4.1. Then, there exists a set $\mathcal A_{n}$ in the σ-field $\mathcal F_{n}$with $ \mathbb {P}_{0}(\mathcal A_{n}) \geq 1 - C/ (n^{2} {\epsilon _{n}^{2}})$for some C > 0, such that within $\mathcal A_{n}$,

$$ \begin{array}{@{}rcl@{}} \mathcal D_{n} \ge e^{-C n^{2} {\epsilon_{n}^{2}}} {\varPi}\left( \left\Vert\theta - \theta^{0}\right\Vert^{2} < n^{2} \delta^{2} {\epsilon_{n}^{2}} \right). \end{array} $$

Proof

Let $f_{\theta _{ij}}(A_{ij})$ denote the likelihood for Bernoulli(𝜃_ij) evaluated at A_ij. Letting $B_{l, n}= \{\theta \in {\varTheta }_{k}: l^{2} {\epsilon _{n}^{2}} \leq (1/n^{2})\|\theta - \theta ^{0}\|^{2} \leq (l+1)^{2} {\epsilon _{n}^{2}} \}$. Define

$$ \begin{array}{@{}rcl@{}} B_{n}(\theta^{0}; \epsilon_{n}):= \left\{\theta: \sum\limits_{1\leq i,j \leq n}E_{\theta^{0}_{ij}} \log \frac{f_{\theta^{0}_{ij}}(A_{ij})}{f_{\theta_{ij}}(A_{ij})} \leq n^{2} {\epsilon_{n}^{2}}, \sum\limits_{1\leq i,j \leq n} E_{\theta^{0}_{ij}} \left\{\log\frac{f_{\theta^{0}_{ij}}(A_{ij})}{f_{\theta_{ij}}(A_{ij})} \right\}^{2} \leq n^{2} {\epsilon_{n}^{2}} \right\} \end{array} $$

and $\mathcal {A}_{n} = \{ A:\! {\int \limits } {\prod }_{1\leq i,j \leq n } f_{\theta _{ij}}(A_{ij}) / f_{\theta ^{0}_{ij}}(A_{ij}) p(dz, dQ) \!\geq \! e^{- n^{2} {\epsilon _{n}^{2}}} {\varPi }(B_{n}(\theta ^{0}; \epsilon _{n}) \}$. The following fact is a straightforward modification of Lemma 5 of Ghosal and Roy (2006). Let 0 < δ < 1/2, δ < α, β < 1 − δ. Then there exists a constant L such that

$$ \begin{array}{@{}rcl@{}} \alpha \left( \log \frac{\alpha}{\beta}\right)^{m} + (1- \alpha) \left( \log \frac{1- \alpha}{1-\beta}\right)^{m} \leq \frac{L(\alpha - \beta)^{2}}{\delta^{2}}, \quad m = 1, 2. \end{array} $$

Since $\delta < \theta _{ij}^{0} \leq 1- \delta $ for 1 ≤ i, j ≤ n, it follows from the above fact that $B_{n}(\theta ^{0}; \epsilon _{n}) \supset \{\theta : \left \Vert \theta - \theta ^{0}\right \Vert ^{2} \leq n^{2} \delta ^{2} {\epsilon _{n}^{2}} \}$. It now follows from Lemma 10 of Ghosal and van der Vaart (2007) that $ \mathbb {P}_{0}(\mathcal A_{n}) \geq 1 - C/ (n^{2} {\epsilon _{n}^{2}})$ for some C > 0. □

In view of Lemma 4.10, it is sufficient to provide an upper bound to

For l ≥ M, let $U_{l, n} = \left \{ \theta \in {\varTheta }_{k} : l^{2} n^{2} {\epsilon _{n}^{2}} \le \left \Vert \theta - \theta ^{0}\right \Vert ^{2} < (l+1)^{2} n^{2} {\epsilon _{n}^{2}} \right \}$ denote an annulus in Θ_k centered at 𝜃⁰ with inner and outer Euclidean radii ln𝜖_n and (l + 1)n𝜖_n respectively. Using a standard testing argument (see, for example, the proof of Proposition 5.1 in Castillo and van der Vaart (2012) 2012) in conjunction with Lemma 4.10, one arrives at

(4.21)

where Φ_{l, n} is the test function constructed in Lemma 4.9 for testing H₀ : 𝜃 = 𝜃⁰ versus H₁ : 𝜃 ∈ U_{l, n} with error rates as in Eq. 4.16. Recall U_{l, n}(z) = U_{l, n} ∩Θ_k(z) and its equivalent representation in Eq. 4.17 from the proof of Lemma 4.9. Since $U_{l,n} \subseteq \cup _{z \in \mathcal Z_{n,k}} U_{l,n}(z)$, from Eq. 3.4,

$$ \begin{array}{@{}rcl@{}} {\varPi}(U_{l,n}) \le \sum\limits_{z \in \mathcal Z_{n,k}} {\varPi} \left\{U_{l,n}(z) \right\} \leq \left\vert \mathcal Z_{n,k}\right\vert \max_{z \in Z_{n,k}} p(z), \end{array} $$

(4.22)

where p(z) is the prior probability (3.9) of z under the Dirichlet-multinomial prior.

Next, consider the term ${\varPi }(\left \Vert \theta - \theta ^{0}\right \Vert ^{2} < n^{2} \delta ^{2} {\epsilon _{n}^{2}})$ in the denominator of the expression for β_{l, n}. Bound ${\varPi }(\left \Vert \theta - \theta ^{0}\right \Vert ^{2} < n^{2} \delta ^{2} {\epsilon _{n}^{2}}) \ge {\varPi }(\left \Vert \theta - \theta ^{0}\right \Vert ^{2} < n^{2} \delta ^{2} {\epsilon _{n}^{2}} \mid z = z^{0}) p(z^{0})$ and using Lemma 4.6 once again,

$$ \begin{array}{@{}rcl@{}} {\varPi}\left( \left\Vert\theta - \theta^{0}\right\Vert^{2} < n^{2} \delta^{2} {\epsilon_{n}^{2}} \mid z = z^{0} \right) = p\left\{ Q : \sum\limits_{r=1}^{k} \sum\limits_{s=1}^{k} n_{0r} n_{0s} (Q_{rs} - Q^{0}_{rs})^{2} < n^{2} \delta^{2} {\epsilon_{n}^{2}} \right\}. \end{array} $$

(4.23)

The probability in the right hand side of the above display is the volume of the intersection of an ellipsoid with [0,1]^k×k, and therefore we cannot simply replace the probability by the volume of the ellipsoid. Instead, we embed an appropriate rectangle inside the intersection of the ellipsoid and [0,1]^k×k. We claim that

$$ \begin{array}{@{}rcl@{}} \prod\limits_{r=1}^{k} \prod\limits_{s=1}^{k} [Q_{rs}^{0} - \delta \epsilon_{n}/2, Q_{rs}^{0} + \delta \epsilon_{n}/2] \subset \left\{ Q \in [0, 1]^{k \times k}: \sum\limits_{r=1}^{k} \sum\limits_{s=1}^{k} n_{0r} n_{0s} (Q_{rs} - Q^{0}_{rs})^{2} \!<\! n^{2} \delta^{2} {\epsilon_{n}^{2}}\right\}. \end{array} $$

(4.24)

First, based on our assumption that all entries of Q⁰ are bounded away from 0 and 1 and the fact that 𝜖_n → 0, it is immediate that the rectangle is contained in [0,1]^k×k for sufficiently large n. Second, for any Q with $|Q_{rs} - Q^{0}_{rs}| \le \delta \epsilon _{n}/2$ for all 1 ≤ r, s ≤ k, we have

$$ \sum\limits_{r=1}^{k} \sum\limits_{s=1}^{k} n_{0r} n_{0s} (Q_{rs} - Q^{0}_{rs})^{2} \le \frac{\delta^{2} {\epsilon_{n}^{2}}}{4} \sum\limits_{r=1}^{k} \sum\limits_{s=1}^{k} n_{0r} n_{0s} = \frac{n^{2} \delta^{2} {\epsilon_{n}^{2}}}{4 }, $$

thereby proving the claim in Eq. 4.24. Now we can bound ${\varPi }(\left \Vert \theta - \theta ^{0}\right \Vert ^{2} < n^{2} \delta ^{2} {\epsilon _{n}^{2}} \mid z = z^{0})$ from below by the volume of the rectangle, which equals $(\epsilon _{n} \delta )^{k^{2}}$. Since n_0r ≥ 1 for all r = 1,…, k, invoke Lemma 3.1 to bound $\max \limits _{z \in \mathcal Z_{n, k}}p(z) / p(z_{0}) \leq e^{C n \log k}$. Combining this with error rates (4.16) in (4.21) we obtain,

(4.25)

For $n^{2}{\epsilon _{n}^{2}} = k^{2} \{\log n + \log (\delta ^{-1})\} + n \log k$, the right hand size of Eq. 4.25 converges to zero for all M larger than a suitable constant. □

5 Simulation Studies

In this section, we consider a small-scale simulation study to examine the accuracy in estimating 𝜃 as the number of nodes n in the network increases. We simulate 100 replicates of an SBM network using k = 3,4,5 equi-sized communities, with n = 120, 150, and 200. The off-diagonal entries of Q are set to 0.1 and all the diagonal entries are set to a constant ρ > 0.1. The smaller the value of ρ is, the more vague the block structure is in the network.

For each n, we consider ρ = 0.3,0.5,0.8. The true community assignment z₀ is set to {(1)_n/3,(2)_n/3,(3)_n/3}, where (x)_k denotes the vector obtained by replicated x, k times. We consider (i) an SBM with k = 3, (ii) an SBM with k = 4 and (iii) an SBM with k = 5; note the number of communities is mis-specified in (ii). The following Gibbs sampler is employed to sample from the posterior distribution of the parameters.

5.1 Gibbs sampling for fixed k (directed networks)

Define

$$ \begin{array}{@{}rcl@{}} n_{r} &=& \sum\limits_{i=1}^{n} I(z_{i}=r), \quad r=1, \ldots, k.\\ n_{rs} &=& \underset{1\leq i\neq j\leq n}{\sum\sum} I(z_{i}=r, z_{j}=s) = n_{r} n_{s} - n_{r} I(r=s). \\ A[rs] &=& \underset{(i, j): z_{i}=r, z_{j} =s}{\sum\sum} A_{ij}, \quad r=1, \ldots, k, s=1, \ldots, k. \end{array} $$

Then the full-conditional distributions of π and Q can be obtained as

$$ \begin{array}{@{}rcl@{}} \pi \mid - &\sim& \text{Dirichlet}(\alpha_{1} + n_{1}, \ldots, \alpha_{k} + n_{k})\\ Q_{rs} \mid - &\sim& \text{Beta}(1 + A[rs], 1+ n_{rs} - A[rs]). \end{array} $$

Observe that

$$ \begin{array}{@{}rcl@{}} P(z_{i} = l \mid z_{-i}, A, \pi, Q) \propto P(A \mid z, \pi, Q) P(z \mid \pi) P(\pi) P(Q). \end{array} $$

Keeping the terms involving z_i,

$$ \begin{array}{@{}rcl@{}} P(A \mid z, \pi, Q) \propto \left\{ \prod\limits_{j \neq i} Q_{z_{i} z_{j}}^{A_{ij}} (1 - Q_{z_{i} z_{j}})^{1- A_{ij}}\right\} \!\times\! \left\{ \prod\limits_{k \neq i} Q_{z_{k} z_{i}}^{A_{ki}} (1 - Q_{z_{k} z_{i}})^{1- A_{ki}}\right\}, \quad \!P(z \mid \pi) \propto \pi_{z_{i}}. \end{array} $$

Hence,

$$ \begin{array}{@{}rcl@{}} P(z_{i} = l \mid z_{-i}, A, \pi, Q) \propto \pi_{z_{i}} \!\times\! \left\{ \prod\limits_{j \neq i} Q_{z_{i} z_{j}}^{A_{ij}} (1 - Q_{z_{i} z_{j}})^{1- A_{ij}}\right\} \times \left\{ \prod\limits_{k \neq i} Q_{z_{k} z_{i}}^{A_{ki}} (1 - Q_{z_{k} z_{i}})^{1- A_{ki}}\right\}. \end{array} $$

The Gibbs sampler proceeds by cycling through π∣−, Q_rs∣ − and z_i∣z_−i, A, π, Q. We set α_j = 1, j = 1,…, k and ran the MCMC for 3000 iterations with a burn-in of 1000. The posterior mean $\hat {\theta }$ of 𝜃 and the posterior mode of z post burn-in are obtained as the Bayes estimates. As a measure of discrepancy between 𝜃 and 𝜃₀, we compute mean squared error (MSE): (1/n²)∥𝜃 − 𝜃₀∥² and for z and z₀, we obtained the Rand-index (RI) where $\text {RI}=\# \text {mismatched pairs} / {n \choose 2}$. The results are summarized in Tables 1 & 2. It is evident that for a fixed ρ, MSE decreases and RI increases as n increases. On the other hand, for a fixed n, as ρ increases, the clustering pattern is more evident in the network leading to improved accuracy in estimating z₀.

Table 1 MSE (× 10²) and standard error (× 10³) comparison over 100 replicates

Full size table

Table 2 Rand Index and standard error comparison (× 10³) over 100 replicates

Full size table

Interestingly, the results for known k are very similar with that of the mis-specified k indicating the robustness of the Bayesian approach. It is possible that a phenomenon similar to overfitted Gaussian mixtures (Rousseau and Mengersen, 2011) is at work here.

6 Discussion

In this article, we presented a theoretical investigation of posterior contraction in stochastic block models. One crucial assumption in our current results is that the true number of clusters k is known. Geng et al. (2018) studied inference in a stochastic block model with an unknown number of clusters within a Bayesian non-parametric framework. Their objective was two-fold : (i) simultaneous estimation of the number of clusters and the cluster structure and (ii) consistent cluster detection. Towards that, they employed a mixture of finite mixtures (MFM) as a prior distribution for k. A natural and interesting extension of our present work would be to theoretically explore the situation when k remains unknown and an MFM or a mixture of Dirichlet processes prior is used to adaptively learn about k. We leave this as an important research problem in future. In a very recent technical report, Gao et al. (2018) provided general conditions for optimal posterior contraction rates in stochastic block models adaptively for all values of k ∈{1,2,…, n} using Laplace-type priors on Q and a complexity prior on k. Their proposed elliptical Laplace prior distribution is theoretically interesting and accommodates many statistical problems in a unified way. Contrary to that, we worked with a more natural and easily implementable uniform prior specification which is widely used in network analysis problems. An interesting direction is to develop a fully Bayesian approach with the more commonly used uniform prior on Q and a complexity prior on k and to show that the corresponding procedure yields optimal rates of posterior contraction adaptively for all values of k ∈{1,2,…, n}. Such an approach can be connected to nonparametric estimation of networks (Bickel and Chen, 2009) where one typically assumes a more flexible way of data generation; $A_{ij} \mid \xi _{i}, \xi _{j} \sim \text {{Bernoulli}}\{f(\xi _{i}, \xi _{j})\}$, where f is a function from [0,1]² → [0,1], called a graphon and ξ_is are i.i.d. random variables on [0,1]. It is well known (refer, for example, to Szemerédi 1975; Lovász and Szegedy 2006; Airoldi et al. 2013; Gao et al. 2015) that one can approximate a sufficiently smooth graphon using elements of Θ_k. When the smoothness of the graphon is unknown, the prior on k should facilitate the posterior to concentrate in the appropriate region. Using such approximation results and modifying our Theorem 4.1, it may be possible to derive posterior contraction rates for estimating a graphon.

Notes

Our result continues to hold for general Beta priors on the edge-inclusion probabilities.

References

Abramowitz, M. and Stegun, I. (1964). Handbook of mathematical functions: with formulas, graphs, and mathematical tables. No. 55. Courier Corporation.
Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (2009). Mixed membership stochastic blockmodels. In: Advances in Neural Information Processing Systems. pp. 33–40.
Airoldi, E., Costa, T. and Chan, S. (2013). Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In: Advances in Neural Information Processing Systems. pp. 692–700.
Amini, A. A., Chen, A., Bickel, P. J. and Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics 41, 4, 2097–2122.
MathSciNet MATH Google Scholar
Banerjee, S. and Ghosal, S. (2014). Posterior convergence rates for estimating large precision matrices using graphical models. Electronic Journal of Statistics 8, 2, 2111–2137.
MathSciNet MATH Google Scholar
Barron, A. R. (1988). The exponential convergence of posterior probabilities with implications for Bayes estimators of density functions. Univ.
Barron, A., Schervish, M. J. and Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. The Annals of Statistics 27, 2, 536–561.
MathSciNet MATH Google Scholar
Bickel, P. and Chen, A. (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proceedings of the National Academy of Sciences106, 50, 21068–21073.
MATH Google Scholar
Bickel, P. J., Choi, D., Chang, X. and Zhang, H. (2013). Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. The Annals of Statistics 41, 4, 1922–1943.
MathSciNet MATH Google Scholar
Bontemps, D. (2011). Bernstein–von mises theorems for gaussian regression with increasing number of regressors. The Annals of Statistics 39, 5, 2557–2584.
MathSciNet MATH Google Scholar
Castillo, I. and van der Vaart, A. W. (2012). Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. The Annals of Statistics 40, 4, 2069–2101.
MathSciNet MATH Google Scholar
Castillo, I., Schmidt-Hieber, J. and van der Vaart, A. (2015). Bayesian linear regression with sparse priors. Ann. Statist. 43, 5, 1986–2018. https://doi.org/10.1214/15-AOS1334.
MathSciNet MATH Google Scholar
Channarond, A., Daudin, J. -J. and Robin, S. (2012). Classification and estimation in the stochastic block model based on the empirical degrees. Electronic Journal of Statistics 6, 2574–2601.
MathSciNet MATH Google Scholar
Chatterjee, S. (2014). Matrix estimation by universal singular value thresholding. The Annals of Statistics 43, 1, 177–214.
MathSciNet MATH Google Scholar
Dasgupta, A., Hopcroft, J. E. and McSherry, F. (2004). Spectral analysis of random graphs with skewed degree distributions. IEEE, p. 602–610.
Erdős, P. and Rényi, A. (1961). On the evolution of random graphs. Bull. Inst. Internat. Statist 38, 4, 343–347.
MathSciNet Google Scholar
Frank, O. and Strauss, D. (1986). Markov graphs. Journal of the American Statistical association 81, 395, 832–842.
MathSciNet MATH Google Scholar
Gao, C., Lu, Y. and Zhou, H. H. (2015). Rate-optimal graphon estimation. The Annals of Statistics 43, 6, 2624–2652.
MathSciNet MATH Google Scholar
Gao, C., van der Vaart, A. W. and Zhou, H. H. (2018). A general framework for bayes structured linear models. arXiv:1506.02174.
Geng, J., Bhattacharya, A. and Pati, D. (2018). Probabilistic community detection with unknown number of communities. Journal of American Statistical Association (to appear).
Ghosal, S., Ghosh, J. K. and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics 28, 2, 500–531.
MathSciNet MATH Google Scholar
Ghosal, S. and Roy, A. (2006). Posterior consistency of gaussian process prior for nonparametric binary regression. The Annals of Statistics, 2413–2429.
Ghosal, S. and van der Vaart, A. W. (2007). Convergence rates of posterior distributions for noniid observations. The Annals of Statistics 35, 1, 192–223.
MathSciNet MATH Google Scholar
Goldenberg, A., Zheng, A., Fienberg, S. and Airoldi, E. (2010). A survey of statistical network models. Foundations and Trends®;, in Machine Learning 2, 2, 129–233.
MATH Google Scholar
Golightly, A. and Wilkinson, D. J. (2005). Bayesian inference for stochastic kinetic models using a diffusion approximation. Biometrics 61, 3, 781–788.
MathSciNet MATH Google Scholar
Hayashi, K., Konishi, T. and Kawamoto, T. (2016). A tractable fully bayesian method for the stochastic block model. arXiv:1602.02256.
Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002). Latent space approaches to social network analysis. Journal of the American Statistical association 97, 460, 1090–1098.
MathSciNet MATH Google Scholar
Holland, P. W. and Leinhardt, S. (1981). An exponential family of probability distributions for directed graphs. Journal of the American Statistical association 76, 373, 33–50.
MathSciNet MATH Google Scholar
Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks 5, 2, 109–137.
MathSciNet Google Scholar
Karrer, B. and Newman, M. E. J. (2011). Stochastic blockmodels and community structure in networks. Physical Review E 83, 1, 016107.
MathSciNet Google Scholar
Lovász, L. and Szegedy, B. (2006). Limits of dense graph sequences. Journal of Combinatorial Theory, Series B 96, 6, 933–957.
MathSciNet MATH Google Scholar
McDaid, A., Murphy, T. B., Friel, N. and Hurley, N. (2013). Improved bayesian inference for the stochastic block model with application to large networks. Computational Statistics & Data Analysis 60, 12–31.
MathSciNet MATH Google Scholar
Newman, M. E. J. (2012). Communities, modules and large-scale structure in networks. Nature Physics 8, 1, 25–31.
Google Scholar
Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association 96, 455, 1077–1087.
MathSciNet MATH Google Scholar
Pati, D., Bhattacharya, A., Pillai, N. S. and Dunson, D. (2014). Posterior contraction in sparse bayesian factor models for massive covariance matrices. The Annals of Statistics 42, 3, 1102–1130.
MathSciNet MATH Google Scholar
Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73, 5, 689–710.
MathSciNet MATH Google Scholar
Schwartz, L. (1965). On bayes procedures. Probability Theory and Related Fiel 4, 1, 10–26.
MathSciNet MATH Google Scholar
Snijders, T. A. B. and Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification 14, 1, 75–100.
MathSciNet MATH Google Scholar
Suwan, S., Lee, D. S., Tang, R., Sussman, D. L., Tang, M. and Priebe, C. E. (2016). Empirical bayes estimation for the stochastic block model. Electronic Journal of Statistics 10, 1, 761–782.
MathSciNet MATH Google Scholar
Szemerédi, E. (1975). On sets of integers containing no k elements in arithmetic progression. Acta Arith 27, 199-245, 2.
MathSciNet MATH Google Scholar
van der Pas, S., Kleijn, B. and van der Vaart, A. (2014). The horseshoe estimator: Posterior concentration around nearly black vectors. Electronic Journal of Statistics 8, 2, 2585–2618.
MathSciNet MATH Google Scholar
van der Pas, S. L. and van der Vaart, A. W. (2018). Bayesian community detection. Bayesian Analysis 13, 3, 767–796.
MathSciNet MATH Google Scholar
Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. Compressed Sensing, 210–268.
Wang, Y. J. and Wong, G. Y. (1987). Stochastic blockmodels for directed graphs. Journal of the American Statistical Association 82, 397, 8–19.
MathSciNet MATH Google Scholar
Zhao, Y., Levina, E. and Zhu, J. (2011). Community extraction for social networks. Proceedings of the National Academy of Sciences 108, 18, 7321–7326.
Google Scholar
Zhao, Y., Levina, E. and Zhu, J. (2012). Consistency of community detection in networks under degree-corrected stochastic block models. The Annals of Statistics 40, 4, 2266–2292.
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, Texas A&M University, College Station, Texas, USA
Prasenjit Ghosh, Debdeep Pati & Anirban Bhattacharya

Authors

Prasenjit Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Debdeep Pati
View author publications
You can also search for this author in PubMed Google Scholar
Anirban Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prasenjit Ghosh.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A.1 Proof of Corollary 4.3

Following exactly the same set of arguments as in the proof of Theorem 4.1, we have for all sufficiently large n,

$$ \prod\limits_{r=1}^{k} \prod\limits_{s=1}^{k} [Q_{rs}^{0} - \delta \epsilon_{n}/2, Q_{rs}^{0} + \delta \epsilon_{n}/2] \subset \left\{ Q: \sum\limits_{r=1}^{k} \sum\limits_{s=1}^{k} n_{0r} n_{0s} (Q_{rs} - Q^{0}_{rs})^{2} < n^{2} \delta^{2} {\epsilon_{n}^{2}}\right\}. $$

(4.26)

Since for each (r, s), $Q_{rs}^{0}\in (\delta ,1-\delta )$, the prior probability of the embedded rectangle ${\prod }_{r=1}^{k}{\prod }_{s=1}^{k}[Q_{rs}^{0}-\delta \epsilon _{n}/2,Q_{rs}^{0}+\delta \epsilon _{n}/2]$ can be bounded below as follows:

$$ \begin{array}{@{}rcl@{}} p\left( \prod\limits_{r=1}^{k} \prod\limits_{s=1}^{k} [Q_{rs}^{0} - \delta \epsilon_{n}/2, Q_{rs}^{0} + \delta \epsilon_{n}/2]\right) &\geq& (\delta\epsilon_{n})^{k^{2}} \prod\limits_{r=1}^{k}\prod\limits_{s=1}^{k} \inf p\left( [Q_{rs}^{0} - \delta \epsilon_{n}/2, Q_{rs}^{0} + \delta \epsilon_{n}/2]\right)\\ &\geq& (\delta\epsilon_{n})^{k^{2}} \left\{\inf p[\delta (1-\epsilon_{n}/2), 1-\delta (1-\epsilon_{n}/2)]\right\}^{k^{2}}\\ &=& \left( \text{Beta}(\beta_{1},\beta_{2})\right)^{-k^{2}}(\delta\epsilon_{n})^{k^{2}} \times \\ && \left\{\inf_{q\in[\delta (1-\epsilon_{n}/2), 1-\delta (1-\epsilon_{n}/2)]} q^{\beta_{1}-1}(1-q)^{\beta_{2}-1} \right\}^{k^{2}}\\ \end{array} $$

(4.27)

where Beta(β₁, β₂) denotes the standard Beta function with parameters (β₁, β₂). Next we observe that

$$ \inf_{q\in[\delta (1-\epsilon_{n}/2), 1-\delta (1-\epsilon_{n}/2)]} q^{\beta_{1}-1}(1-q)^{\beta_{2}-1}\geq \psi_{\delta,\epsilon_{n}}(\beta_{1},\beta_{2}), $$

(4.28)

where for each fixed (δ, 𝜖_n), the function $\psi _{\delta ,\epsilon _{n}}\colon (0,\infty )^{2} \rightarrow (0,\infty )$ is defined as

$$ {\psi_{\delta,\epsilon_{n}}(\beta_{1},\beta_{2}) :=} \left\{ \begin{array}{ll} \{\delta (1-\epsilon_{n}/2)\}^{\beta_{1}+\beta_{2}-2} & \text{ if} \beta_{1}\geq 1,\beta_{2}\geq 1\\ \\ \{1-\delta (1-\epsilon_{n}/2)\}^{\beta_{1}-1}\{\delta (1-\epsilon_{n}/2)\}^{\beta_{2}-1} & \text{ if} \beta_{1} < 1,\beta_{2}\geq 1\\ \\ \{\delta (1-\epsilon_{n}/2)\}^{\beta_{1}-1}\{1-\delta (1-\epsilon_{n}/2)\}^{\beta_{2}-1} & \text{ if} \beta_{1} \geq 1,\beta_{2} < 1\\ \\ \{1-\delta (1-\epsilon_{n}/2)\}^{\beta_{1}+\beta_{2}-2} & \text{ if} \beta_{1}< 1,\beta_{2}< 1. \end{array}\right. $$

Using Eqs. 4.26–4.28, and following exactly the same line of arguments as in the proof of Theorem 4.1, we obtain

$$ \mathbb{E}_{0} \left\{{\varPi}_{n}(U_{n} \mid A) 1_{\mathcal{A}_{n}^{c}} \right\} \leq \sum\limits_{l=M}^{\infty} \left\{e^{-{C_{1}^{2}} l^{2} n^{2} {\epsilon_{n}^{2}} } + \frac{ e^{-{C_{2}^{2}} l^{2} n^{2} {\epsilon_{n}^{2}}} e^{C_{3}n \log k}}{(\delta\epsilon_{n})^{k^{2}}C(\beta_{1},\beta_{2},\delta,\epsilon_{n})^{k^{2}}}\right\}, $$

(4.29)

for some constant $C(\beta _{1},\beta _{2},\delta ,\epsilon _{n})=\left (\text {Beta}(\beta _{1},\beta _{2})\right )^{-k^{2}}\left (\psi _{\delta ,\epsilon _{n}}(\beta _{1},\beta _{2})\right )^{k^{2}}>0$. Now, for every possible choice of the pair (β₁, β₂), we note that $\log C(\beta _{1},\beta _{2},$$\delta ,\epsilon _{n})\sim k^{2}$. For instance, suppose β₁ ≥ 1, β₂ ≥ 1. Then, as δ ∈ (0,1/2) is fixed and $\epsilon _{n}\rightarrow 0$ as $n\rightarrow \infty $, $\log C(\beta _{1},\beta _{2},\delta ,\epsilon _{n})=-k^{2} \log \text {Beta}(\beta _{1},\beta _{2})+(\beta _{1}+\beta _{2}-2)k^{2}\log (\delta (1-\epsilon _{n}/2))\sim k^{2}$. Therefore, for $n^{2}{\epsilon _{n}^{2}} = k^{2} \{\log n + \log (\delta ^{-1})\} + n \log k$, $\log C(\beta _{1},\beta _{2},\delta ,\epsilon _{n})=o(n^{2}{\epsilon _{n}^{2}})$ as $n\rightarrow \infty $. Thus, choosing a large enough constant M > 0 (depending on (β₁, β₂)), it follows that the above sum in Eq. 4.29 converges to zero for all large values of M which concludes the argument.

A.2 Proof of Theorem 4.4

Observe that the posterior distribution in the case of directed networks can be written as

$$ \begin{array}{@{}rcl@{}} {\varPi}_{n}(U_{n} \mid A) = \frac{ {\int}_{U_{n}} \prod{\prod}_{1\leq i < j \leq n} \frac{ f_{\theta_{ij}}(A_{ij}) }{ f_{\theta_{ij}^{0}}(A_{ij}) } p(dz, dQ) }{ {\int}_{{\varTheta}_{k}} \prod{\prod}_{1\leq i < j \leq n} \frac{ f_{\theta_{ij}}(A_{ij}) }{ f_{\theta_{ij}^{0}}(A_{ij}) } p(dz, dQ) }. \end{array} $$

(4.30)

Observe that the discrepancy measure in Eq. 4.8 can also be written as

$$ \begin{array}{@{}rcl@{}} \frac{1}{n^{2}} \underset{1\leq i < j \leq n}{\sum\sum} (\hat{\theta}_{ij} - \theta_{ij}^{0})^{2} = \frac{1}{2n^{2}} \left\Vert\hat{\theta} - \theta^{0}\right\Vert^{2} . \end{array} $$

for $\theta , \theta ^{0} \in {{\varTheta }_{k}^{u}}$, defined in Eq. 4.9. Hence, it is straightforward to obtain versions of Lemmata 4.5, 4.6, 4.9 and 4.10 as well as Corollary 4.8 for parameters $\theta \in {{\varTheta }_{k}^{u}}$. The conclusion then follows by replicating arguments (4.21)-(4.25).

A.3 Additional Simulations Results

Below we present an additional small scale simulation study where we simulate 100 replicates of an SBM network using k = 3 and 5 equi-sized communities with n = 30, 60, and 90 and ρ = 0.3,0.5. We summarize these additional results into Tables 3 and 4 below.

Table 3 MSE (× 10²) and standard error (× 10³) comparison over 100 replicates

Full size table

Table 4 Rand Index and standard error (× 10³) comparison over 100 replicates

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ghosh, P., Pati, D. & Bhattacharya, A. Posterior Contraction Rates for Stochastic Block Models. Sankhya A 82, 448–476 (2020). https://doi.org/10.1007/s13171-019-00180-5

Download citation

Received: 15 December 2018
Published: 14 October 2019
Issue Date: August 2020
DOI: https://doi.org/10.1007/s13171-019-00180-5

Keywords and phrases

AMS (2000) subject classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Posterior Contraction Rates for Stochastic Block Models

Abstract

Similar content being viewed by others

Auxiliary Parameter MCMC for Exponential Random Graph Models

Large deviations for empirical measures of dense stochastic block graphs

Fast Maximum Likelihood Estimation via Equilibrium Expectation for Large Network Data

1 Introduction

2 Preliminaries

3 Stochastic Block Models

Lemma 3.1.

Proof

4 Posterior Contraction Rates in Stochastic Block Models

Theorem 4.1.

Remark 4.2.

Corollary 4.3.

4.1 Undirected Networks

Theorem 4.4.

4.2 Geometry of Θk

Lemma 4.5.

Proof

Lemma 4.6.

Proof

Remark 4.7.

Corollary 4.8.

Lemma 4.9.

Proof

Proof 5 of Theorem 4.1.

Lemma 4.10.

Proof

5 Simulation Studies

5.1 Gibbs sampling for fixed k (directed networks)

6 Discussion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendices

Appendix

A.1 Proof of Corollary 4.3

A.2 Proof of Theorem 4.4

A.3 Additional Simulations Results

Rights and permissions

About this article

Cite this article

Share this article

Keywords and phrases

AMS (2000) subject classification

Search

Navigation

4.2 Geometry of Θ_k