1 Introduction

Acyclic digraphs or directed acyclic graphs (DAGs) comprise a collection of vertices or nodes linked by directional arrows. DAGs can be used to represent the structure of a popular class of probabilistic graphical models (Lauritzen 1996) known as Bayesian networks (Neapolitan 2004). These in turn represent probabilistic relationships between a large number of random variables, in short multivariate probability distributions, with the appealing property that conditional independence properties are encoded and can be read directly from the graph. They are widely used in many fields of applied statistics with especially important applications in biostatistics, such as the learning of epistatic relationships (Jiang et al. 2011). Triggered by the seminal paper of Friedman et al. (2000) Bayesian networks have become a popular tool for reverse engineering gene regulatory networks from large-scale expression data (Emmert-Streib et al. 2012). Inferring the structure as well as the parameters of a model is particularly important to shed some light on the mechanisms driving biological processes. The estimation of DAGs or their equivalence class is a hard problem and methods for their efficient reconstruction from data is a very active field of research: a recent review is given by Daly et al. (2011) while some new methodological developments for estimating high dimensional sparse DAGs by constraint-based methods are discussed by Colombo et al. (2012) and Kalisch and Bühlmann (2007).

For simulation studies aimed at assessing the performance of learning algorithms which reconstruct a graph from data, it is crucial to be able to generate uniform samples from the space of DAGs so that any structure related bias is removed. The only currently available method relies on the construction of a Markov chain whose properties ensure that the limiting distribution is uniform over all DAGs with a given number of vertices n. The strategy is based on a well known idea first suggested by Madigan and York (1995) as a Markov chain Monte Carlo (MCMC) scheme in the context of Bayesian graphical models to sample from the posterior distribution of graphs conditional on the data. A specific algorithm for uniform sampling of DAGs was first provided by Melançon et al. (2001), with the advantage over the standard MCMC scheme of not requiring the evaluation of the sampled graphs’ neighbourhood, at the expense of slower convergence. The method was later extended (Ide and Cozman 2002; Ide et al. 2004; Melançon and Philippe 2004) to limit the sampling to restricted sets of DAGs. An R implementation was also recently provided (Scutari 2010).

Since Markov chain based algorithms pose non-negligible convergence and computational issues, in practice random upper or lower triangular adjacency matrices are often sampled to generate random ensembles for simulation studies (as for example implemented in the pcalg R package of Kalisch et al. 2012). Sampling over the space of triangular matrices however does not provide uniformly distributed graphs on the space of DAGs and is for example likely to perform poorly to obtain starting points for hill-climbing algorithms or slowly converging Markov chains. In fact due to the non-uniformity the risk of remaining within a small neighbourhood of certain graphs and more inefficiently exploring the space is increased. Likewise uniform sampling allows the correct evaluation of structure learning algorithms. Finally, a uniform sample is essential when evaluating the prevalence of certain features in a population, in order to count the relative frequency of structural features in DAGs, which provides insight into the DAG landscape. This was recently pursued for example in Borboudakis and Tsamardinos (2012), to assign causal or associative structural priors in search and score algorithms.

In this work we therefore discuss a sampling strategy based on the recursive enumeration of DAGs but where no explicit listing is required and which follows directly from a combinatorial result of Robinson (1970, 1977). An algorithm for generating uniform DAGs easily follows from the enumeration formula, which was also noted in an earlier technical report version (Melançon et al. 2000) of Melançon et al. (2001), where the method was thought computationally impractical and its complexity overestimated with respect to the Markov chain approach. Explicit evaluation of both their complexities allows us to show that the enumeration method is actually the least expensive of the two. After re-establishing the practical convenience of enumeration based sampling our main contribution is twofold. First we exploit the asymptotic behaviour of DAGs to derive an extremely fast implementation for approximate (but highly accurate) sampling, which has the same (minimal) complexity as sampling triangular matrices. Very large DAGs can so be de facto sampled uniformly.

Taking a different direction we also devise a new hybrid MCMC method which is orders of magnitude faster than current MCMC methods but which shares their advantage of being easily adaptable to include restrictions or variations. For example we can use this method to restrict the number of parents each vertex has or to weight the edges with a certain probability as is often performed when sampling triangular matrices. The new hybrid MCMC method therefore offers a reasonably efficient alternative while avoiding bias by sampling from the correct space.

2 Acyclic digraphs

A directed graph or digraph on n vertices has up to one directed edge between each pair of vertices, from node i to node j with 1≤i,jn. Acyclic digraphs or DAGs are those which admit no cycles so that there are no paths along the directed edges from a node to itself. The total number a n of labelled DAGs with n nodes, which is sequence A003024 in Sloane (2013), and its asymptotic behaviour are respectively (Robinson 1970, 1973; Stanley 1973)

$$ a_n=\sum_{k=1}^{n}(-1)^{k+1} \binom{n}{k}2^{k(n-k)}a_{n-k}, \quad a_n\sim \frac{n!2^{L}}{Mq^n}, $$
(1)

with \(L=\binom{n}{2}=n(n-1)/2\) and where M=0.574… and q=1.48… are constants. The number of unlabelled DAGs was found later (Robinson 1977) and is recorded as sequence A003087 in Sloane (2013).

The directed edges of a DAG can be represented as entries with value 1 of a vertex adjacency matrix with remaining entries 0. Being acyclic the matrix can be made lower triangular by relabelling the nodes, and this observation lies behind the proof of McKay et al. (2004) that the number of n×n (0,1)-matrices with only positive real eigenvalues is also a n . Hence DAGs can be easily generated by sampling each of the L elements of a lower triangular matrix from a Bernoulli distribution with probability p=1/2 and then permuting the n labels. The graphs so obtained however are non-uniform on the space of DAGs since several permutations may correspond to the same graph. For example, the DAG with no arcs (the 0 matrix) would be counted n! times, while those with the most arcs (when all L lower triangular elements are 1) only once. Choosing p≠1/2 may reduce the overall non-uniformity but not remove it entirely, because the overcounting is sensitive to the structure of the DAG.

3 Markov chain method

The algorithm of Melançon et al. (2001) starts with a given DAG, with no arcs say, and proceeds by repeatedly sampling an ordered pair of vertices from the n available. If an arc already joins the pair in that direction it is deleted, otherwise it is added as long as the graph remains acyclic. Addition and deletion of an arc are inverse operations so that the resulting transition matrix T is symmetric. As adding an arc to any pair (i,i) would immediately create a cycle the probability to stay with the same graph is nonzero and at least 1/n. One could exclude all but one such pair from the sampling to reduce this probability and marginally improve the convergence of the chain. Finally a path exists between any pair of DAGs since deleting all the arcs in turn leads to the empty DAG and the steps are reversible. These three properties ensure that the stationary distribution of the Markov chain is uniform over the space of graphs and that an (approximately) uniform DAG can be generated by running the chain long enough.

The Markov chain method is very elegant, as no particular knowledge of DAGs is required apart from checking whether the graphs remain acyclic, which can be done in a time of the order between n and the number of edges (typically around n 2) and on average in a time of order nlog(n) (Alon and Rodeh 1978; Melançon et al. 2001). In fact when an edge is added, one only needs to check if cycles are added downstream which speeds up the calculation, but does not reduce its complexity.

3.1 Non-uniformity of the irreducible chain

The main issue for the applicability of the method is how quickly the Markov chain converges. The longest path is between a DAG with the most arcs (L) and the same graph with all the edges reversed. Therefore after 2L steps we obtain an irreducible transition matrix U=T 2L, whose elements though are far from uniform. To illustrate this non-uniformity, we consider the extreme elements of U. Moving along the longest path, consider first the transition probability of going from any DAG with L arcs to the empty DAG with none. Removing all the arcs in L steps has a probability of L!/n 2L. Adding arcs to arrive at the DAG with all the arcs reversed has the same probability due to the symmetry. The transition probability of the single path moving through the empty DAG is then (L!)2/n 4L. Other paths exist as the L additions and deletions can be ordered in ways, with the first and the last step fixed as an addition and deletion respectively. The probability of moving along the other paths corresponding to different orderings is however lower. Hence the minimum transition probability, which is the minimum element of U, is

$$ < \frac{(L!)^2}{n^{4L}} \binom{2L-2}{L-1} = \frac {L^2(2L-2)!}{n^{4L}} < \frac{(2L)!}{2 n^{4L}}, $$
(2)

which is actually \(< 1/a_{n}^{2}\), much less than the required uniform value of 1/a n . For comparison the ratio between the probabilities of the most and least likely DAGs, which can be interpreted as a measure of non-uniformity, deriving from sampling triangular matrices is only n!

This non-uniformity was overlooked in Melançon et al. (2001) leading them to underestimate the complexity of the Markov chain method. To obtain an approximately uniform distribution the Markov chain needs to be run long enough (say 2Lk steps) so that the elements of the resulting transition matrix U k are close on a scale well below 1/a n . This means we require |(U k) ij −1/a n |≪1/a n , ∀ij or more accurately

$$ \bigl[a_n \cdot\bigl(U^{k} \bigr)_{ij} \bigr]^{\pm1} -1 \ll1, $$
(3)

where we take the positive power if the matrix element is larger than the uniform background and the negative power if it is smaller. To see why we need such a restriction, consider the case for example where ∃ij: [a n ⋅(U k) ij ]±1≈2 then the corresponding DAG will be sampled about twice or half as often as it should be and the sample cannot yet be considered uniform. This uniformity scale becomes finer as n increases and leads to an additional dependence of k on n which increases the complexity of the algorithm.

3.2 Speed of convergence and complexity

The difference between the maximum and minimum of the elements of U k decays at least exponentially ∼exp(−αk) with a rate α. Proving lower bounds for the rate α is extremely hard and a generally known one, given by twice the minimum entry of U, is not enough to ensure that the algorithm is useful in practice since (2) means that k would need to be much larger than a n for convergence. Instead, using the spectral decomposition as pursued in Appendix A, one can express a lower bound in terms of the largest eigenvalue of U below that at 1. More importantly, each element of U must approach uniformity at least at this rate. For the maximum element of U for example we have

$$ U^k_{\mathrm{max}}-\frac{1}{a_n} \leq C_{\mathrm{max}} \exp(- \alpha k) $$
(4)

in terms of some constant C max which is probably order 1 since the elements of U must satisfy 0<U ij <1, and certainly less than a n as detailed in Appendix A. The condition in (3) is guaranteed to be satisfied if

$$ a_nC_{\mathrm{max}}\exp(-\alpha k) \ll1. $$
(5)

Taking the logarithm in (5), it follows αk of order n 2 is sufficient to compensate for the quadratic growth of log(a n ). The more detailed consideration in Appendix A suggests that in fact αk has to be at least this order for convergence.

What matters next for the speed of convergence is how the eigenvalues of U and α depend on n. For DAGs with 2 or 3 nodes, T can be easily filled out and we find α=0.575 and α=0.594 respectively. By removing all but one of the arcs (i,i) from the edge sampling, we can improve this to α=0.811 and α=0.776 instead. The behaviour should be verified, but as long as α does not decrease asymptotically with n, a uniformly sampled DAG can be obtained by throwing away the first O(n 4) steps of the Markov chain. Analogously O(n 4) steps should also be discarded between further samples to ensure they are independent. These conditions should be observed when using implementations like that in Scutari (2010). As each step may involve checking for acyclicity, the entire algorithm then becomes approximately O(n 5log(n)) which limits the size of graphs that can be sampled.

A marginal improvement in the speed of convergence could be achieved by starting each Markov chain on DAGs from a randomly drawn permuted triangular matrix.

4 Enumeration method

In order to obtain a uniform sample directly, we return to the enumeration of labelled DAGs detailed in Robinson (1970, 1977). Nodes with no incoming arcs are called ‘outpoints’ and can be used to further classify and recursively enumerate DAGs. Due to the acyclicity each DAG has at least one outpoint. Let a n,k be the number of labelled DAGs with n nodes of which k are outpoints (1≤kn). Removing the latter and the arcs originating from them leaves a smaller DAG with m=nk nodes, with s outpoints say. An example with n=5, k=3 and s=1 can be seen at the start of Fig. 1.

Fig. 1
figure 1

Removing the outpoints from a DAG (and all their outgoing arcs), a smaller DAG is created until only outpoints remain. For the DAG depicted on the left, one can first remove 3 outpoints, then 1 to be left with a single one. Reversing the process by adding new outpoints and recording the number of possible ways of adding new edges and permuting the node labels leads to the recursive formula in (6)

Reversing the process, the required DAGs can be built by adding k new outpoints and allowing all the possible connections from them to the previous nodes. For each of the ms non-outpoint nodes there are 2k possibilities of having an arc or not. Each of the s old outpoints must be connected to at least once, giving a factor of 2k−1. Finally the labels can be rearranged in ways, giving the recursions (Robinson 1970, 1977)

$$ a_{n,k}=\binom{n}{k}b_{n,k},\quad b_{n,k}=\sum_{s=1}^{m} \bigl(2^k-1\bigr)^{s}2^{k(m-s)}a_{m,s}, $$
(6)

with a n,n =1 the DAGs with no arcs and b n,k auxiliary quantities useful later, also with the initialisation b n,n =1. The total number of DAGs with n nodes is \(a_{n}=\sum_{k=1}^{n}a_{n,k}\) which can be turned into the more compact form in (1) by means of the inclusion-exclusion principle. The separation in terms of outpoints however is the key to sampling labelled DAGs uniformly.

4.1 Recursively generating the outpoints

The first step in the algorithm is to compute all the integers a n,k (and b n,k ) as well as the totals a n . From (1) it follows that the number of bits needed to store them grows like L, while 2n+1 integers are needed for each n (since 1≤kn). Finally each of them is obtained by adding an order of n smaller integers (multiplied by the binomial function) so that the process becomes O(n 5) as detailed in Appendix B. However, they only need to be calculated once and then stored for sampling arbitrarily many DAGs. Later we also discuss how to avoid their explicit evaluation.

The second step is, given n, to sample the number of outpoints. An integer is drawn between 1 and a n , for example by sampling from a Bernoulli\((\frac{1}{2})\) the digits of a binary string long enough to represent a n (if the resulting number is larger than a n the string is rejected and resampled, which happens with probability less than a half). The chosen number of outpoints is the minimum k for which the partial sum \(\sum_{i=1}^{k}a_{n,i}\) is larger than or equal to the sampled integer.

The final DAG will be obtained by connecting the k outpoints to a smaller one of size m=nk, whose number of outpoints can be taken to be the minimum s for which the partial sum \(\sum_{i=1}^{s}(2^{k}-1)^{i}2^{k(m-i)}a_{m,i}\) is larger than or equal to a randomly drawn integer between 1 and b n,k . Similarly a DAG with m nodes and s outpoints can be sampled by drawing an integer between 1 and b m,s . Let k i denote the number of outpoints removed at each step, which can so be recursively sampled until their sum \(\sum_{i=1}^{I} k_{i} =n\). In total one needs to sample and manipulate In integers whose size in bits is at most of order L, so that generating the sequence is O(n 3) as detailed in Appendix B.

Though for simplicity, we sampled an integer at each step above, the original one between 1 and a n could equally be reused to obtain the entire sequence of outpoints. In fact, this can be computationally more efficient than resampling at each step and more directly follows the mapping between the integers and the DAGs. For example, given the original integer r sampled uniformly between 1 and a n , we first find the minimal k so that \(1\leq r-\sum_{i=1}^{k-1}a_{n,i} \leq a_{n,k}\). The corresponding integer between 1 and b n,k is then

$$ \biggl\lceil \frac{r-\sum_{i=1}^{k-1}a_{n,i}}{\binom{n}{k}} \biggr\rceil , $$
(7)

where we round up to the next integer. As the numerator was already calculated when finding k we can see why this can be cheaper than resampling. Repeating these steps with the next partial sums gives the full sequence of outpoints.

4.2 Reconstructing the DAG

To represent the sampled DAG we fill a null n×n adjacency matrix from the top left to obtain a lower triangular matrix. Since the recursion ends when all the remaining nodes are outpoints, the first k I ×k I elements stay zero. These must connect to at least one of k I−1 newly added outpoints which must not connect to each other, hence the elements of each column from 1 to k I in rows k I +1 to k I +k I−1 can be sampled from a Bernoulli\((\frac {1}{2})\). Any samples where all the elements in a column are 0 can be rejected columnwise (alternatively a direct binomial sampling with this case excluded can be performed).

When adding an additional k I−2 nodes, the elements in columns 1 to k I and rows k I +k I−1+1 to k I +k I−1+k I−2 are again sampled from a Bernoulli\((\frac{1}{2})\), while for each column k I +1 to k I +k I−1 we need to ensure that not all new entries are 0. Completing the adjacency matrix is a step of order n 2 since up to L integers need to be sampled. The final sampled DAG can be obtained by permuting the labels only at the end, rather than at each step as suggested by (6), since each permutation must be equally likely. Although different permutations may still correspond to the same DAG, the procedure of Sect. 4.1 for drawing the number of outpoints at each step is weighted so that the resulting sample is perfectly uniform.

The procedure is simply a 1 to 1 mapping between the integers 1 to a n and the DAGs with n nodes. In principle the originally sampled integer r uniquely identifies the edges and node permutations. For example the remainder when dividing by \(\binom{n}{k}\) in (7) can be mapped to the permutation of the k nodes. It is simpler to ignore the remainders and their mappings while drawing the sequence of outpoints and just sample the edges and permutation at the end. Since each set of compatible edges and node permutations is equally likely given the sequence of outpoints, the sample remains uniform. The recursions weight the choice of the number of outpoints removed at each step so that the probability of selecting each sequence is proportional to the total number of corresponding DAGs.

A detailed sampling procedure is provided in Appendix C in the form of pseudocode.

4.3 Illustration of recursive sampling

The method is best illustrate with an example. For n=5, Table 1 lists the quantities a 5,k . An integer up to a 5=29 281 is drawn uniformly, say r=28 405. Since r is larger than a 5,1 this is subtracted to obtain 11 520, bigger in turn than a 5,2. A further subtraction leaves 810, smaller than or equal to a 5,3. The desired DAG is the 810th with 3 outpoints out of 5. Dividing by the ways of permuting the new outpoint labels means that the sampled DAG corresponds to the 81st of a list of b 5,3=161. Also from (6)

$$ b_{5,3} = 56 a_{2,1} + 49 a_{2,2}, $$
(8)

where a 2,1=2 and a 2,2=1. Since 81≤56×2, a DAG with one outpoint is selected. In fact given that a 2,1=2b 2,1 and ⌈81/112⌉=1, the desired DAG is the first and only one of b 2,1. Removing the outpoint, a single node remains and the sampled sequence is

$$ k_1 = 3, \qquad k_2 = 1, \qquad k_3 = 1. $$
(9)

Sampling any of the integers between 27 596 and 28 715 would result in the same sequence of outpoints.

Table 1 The number of DAGs with 5 nodes and k outpoints

To avoid confusion in Fig. 2, the permutation of the node labels is drawn before reconstructing the edges. Let the permutation be

$$ \pi= \left ( \begin{array}{c@{\quad }c@{\quad }c@{\quad }c@{\quad }c} 1 & 2 & 3 & 4 & 5 \\ 2 & 4 & 1 & 3 & 5 \end{array} \right ), $$
(10)

in the two-line notation. Next a corresponding lower triangular adjacency matrix can be sampled

$$ \begin{array}{c|ccccc} & \;\;2\; & \;4\; & \;1\; & \;3\; & \;5\; \\ \hline \;2\; & 0 & 0 & 0 & 0 & 0\\ \;4\; & \mathbf{1} & 0 & 0 & 0 & 0\\ \;1\; & 1 & \mathbf{0} & 0 & 0 & 0\\ \;3\; & 1 & \mathbf{1} & 0 & 0 & 0\\ \;5\; & 1 & \mathbf{1} & 0 & 0 & 0 \end{array} . $$
(11)

Let the 5×5 matrix inside the solid lines be denoted by Q. The lower elements can be drawn uniformly as either 1 or 0, with certain restrictions. First, not all the elements in bold in each column can be 0. They correspond to possible connections between the old and new outpoints at adjacent steps in the sequence of outpoints. Graphically the sampling of the matrix in (11) can be represented by the recursive addition of outpoints as in Fig. 2.

Fig. 2
figure 2

Once a sequence of outpoints has been drawn, for example as in (9), the edges can be sampled by recursively adding new outpoints. Here the labelling of the nodes respects the sampled permutation in (10). Starting with one outpoint from k 3, when the next one from k 2 is added an arc must also be included. At least one arc must arrive at node 4 from the last three nodes. Here there happened to be two. So far the bold elements in (11) have been sampled. Elements below the bold ones in the matrix in (11) may be drawn without restriction so that node 2 can receive an arbitrary number of arcs from nodes 1, 3 and 5; here all three were sampled. The final sampled DAG is identical to the one at the start of Fig. 1, merely organised by the sequence of outpoints

The starting point is k 3=1 outpoint labelled by 2 because of the permutation in (10). Then k 2=1 new outpoint labelled by 4 is added. Since node 2 must stop being an outpoint it has to receive an arc from node 4 and the corresponding element Q 2,1=1. At the next step the remaining k 1=3 are added ensuring that node 4 receives at least one arc. Here it happens to receive 2 corresponding to the second column of Q in (11).

The second restriction is that the lower triangular matrix elements above, or to the right, of the bold ones must all be 0 since no arcs are allowed between outpoints added at the same time. The remaining lower triangular matrix entries below the bold ones can be sampled without restriction. In this case they correspond to arcs from nodes 1, 3 and 5 to node 2. In the example they were all drawn to arrive at the final sampled DAG in Fig. 2, which is just a redrawing of the first DAG in Fig. 1.

To complete the example, the lower triangular matrix in (11) can be rearranged to standard ordering using R π(m),π(l)=Q m,l leading to the adjacency matrix

$$ R = \left ( \begin{array}{c@{\quad }c@{\quad }c@{\quad }c@{\quad }c} 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 1 & 0\\ 0 & 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 1 & 0 \end{array} \right ), $$
(12)

which is also the adjacency matrix of the first DAG in Fig. 1.

4.4 Computational time and space complexity

As the different parts of the algorithm are performed sequentially, the complexity is determined by the first step, which is the most expensive. A uniform DAG can therefore be sampled from scratch in O(n 5), marginally better than the O(n 5logn) required by the Markov chain algorithm. Once the integers in (6) have been computed and stored, each subsequent perfectly uniformly sampled DAG is obtained in only O(n 3). This is significantly better than the Markov chain algorithm which requires the same O(n 5log(n)) for each subsequent approximately independent and uniform DAG. These considerations are important since in most practical cases the aim is to draw a reasonably large sample of DAGs, rather than a single one.

Not only the time complexity is improved, but the computational time is also vastly reduced. In a simple Maple implementation, the first step is rather quick to perform, for example taking under a second for n=100. When reusing the original integer drawn between 1 and a n it takes about a tenth of a second to sample each sequence of outpoints. The final DAG given the sequence is then sampled in less than a twentieth of a second so that the total time for all the steps is still under a second. At least 6 uniformly distributed DAGs can be obtained in a second before optimising the code further. For reference, the computations were all run on a single core of a standard Intel Q9550 processor with 4 GB of RAM.

The R implementation in Scutari (2010) of the Markov chain method can perform about 80 000 steps of the chain per second on the same machine. This means that in the time it takes to sample each DAG perfectly uniformly using the recursive enumeration, the Markov chain only just becomes irreducible (recall that this takes 2L≈104 steps). Before it can be reasonably expected that an approximately uniform sample is provided by the chain it needs to run for hundreds of millions of steps (1004) as discussed in Sect. 3 and is therefore tens of thousands of times slower than the enumeration method for n=100. This means going from the order of seconds to the order of hours and given the complexity arguments, things only get worse as n increases.

The time complexity analysis of the first step of the enumeration algorithm, on the other hand, does not account for the space required to store and access all the numbers a n,k and b n,k which together require up to about n 4/4 bits. Even with the gigabits of fast memory available on modern computers, the space limit is reached when n gets to several hundred. In fact, as we will see in the next section, fast access is only needed for low k reducing the order of bits to n 3. The memory space limit is then only reached by DAGs on about a thousand nodes but the time complexity still makes the sampling on such ultra-large spaces unfeasible.

The exact sampling is just a way of mapping between the integers between 1 and a n and DAGs, whose superexponential growth as in (1) lies behind the space and time problems for large n. As an indication of this growth, sampling a DAG with just 21 nodes is akin to sampling an atom uniformly from the observable universe. To move to arbitrarily large DAGs, however, we can introduce an approximate method which is both extremely fast and highly accurate.

5 Arbitrarily large DAGs

The number of outpoints sampled at each step is small and most often just 1. In fact the fraction of DAGs with k outpoints among n,

$$ A_k=\lim_{n\to\infty} \frac{a_{n,k}}{a_n}, $$
(13)

converges as was proved in Liskovets (1976) and to within 10−10 by n=20. For n>20, within the uniformity limits of a 32 bit integer sampler, k could be directly sampled from the limiting distribution in Table 2. Given the initial k the next value s in the sequence of outpoints can be drawn from the limiting conditional probabilities

$$ B_{s\mid k}= \biggl(1-\frac{1}{2^k} \biggr)^{s}\frac{A_{s}}{Z_k}, \qquad Z_k = \sum _s \biggl(1-\frac{1}{2^k} \biggr)^{s} A_{s}, $$
(14)

which follow from (6) as detailed in Appendix D. The k-dependent weight further shifts the distribution towards low s so it would be computationally meaningful to store the small number of relevant probabilities. The procedure is iterated until 20 nodes or fewer are left to be sampled, when the tabulated distributions are no longer accurate enough. For n≤20 the exact distributions can be tabulated or one may simply return to the exact enumeration method.

Table 2 The relative occurrence of the number of outpoints in large DAGs, A k multiplied by 1010. A 8 is approximately 2.2×10−12, so k>7 can be excluded at this level of accuracy

Generating the sequence of outpoints exploiting the limiting behaviour is of order n while reconstructing the DAG remains order n 2. Again these steps are sequential so that a uniform (at scales above 10−10) DAG can be drawn in O(n 2) allowing the sampling of very large graphs. Sampling a lower or upper triangular matrix of the same size has the same complexity, which is therefore the minimum possible for sampling DAGs. As it is built on the underlying space of DAGs, the procedure above has the advantage over sampling triangular matrices of removing (most of) the bias, while being faster than other potentially unbiased alternatives such as the Markov chain method of Melançon et al. (2001).

For the latter, convergence of the transition matrix to a given level of accuracy is reached in O(n 2) steps so each sample can be drawn in O(n 3log(n)), as follows from the discussion in Sect. 3. The probabilities with which different DAGs are sampled however vary wildly, with a more pronounced variability than observed when sampling triangular matrices. In the approximated method above instead, a zero weight is given to the rare event of drawing more than 7 outpoints at any step, resulting in a zero probability for all DAGs with any k i >7. The rest however are sampled exactly uniformly and again much faster than with the Markov chain method.

The accuracy of the convergence to the limiting distributions of A k nearly exactly doubles when doubling n. Uniformity on the scale of a 64 bit integer sampler can be achieved by a more accurately calculated distribution of A k for n>40 while returning to the exact enumeration for n≤40.

6 Hybrid MCMC method

Although the approximate method of Sect. 5 will in practice give indistinguishable results from the full enumeration sampler, it has the philosophical problem that certain DAGs will never be sampled. As an alternative, and one which allows restrictions and variations to be incorporated more easily, we propose a hybrid method combining the ideas from the recursive enumeration and Markov chain approaches.

In the enumeration based sampler, most of the effort goes towards sampling the sequence of outpoints k i added at each step. As \(\sum_{i=1}^{I}k_{i} =n\), this sequence is just an ordered partition of the number n. A MCMC on the space of partitions can be introduced, where each partition is scored according to the number of DAGs corresponding to it. Denoting each partition by a sequence in square brackets [k I ,…,k 1], to respect the order in Sect. 4, and setting \(a_{[k_{I},\ldots,k_{1}]}\) to be the corresponding number of DAGs, from (6) we have

$$ a_{[k_I,\ldots,k_1]} = \frac{n!}{k_I!\cdots k_1!} \prod _{i=1}^{I-1} \bigl(2^{k_i}-1 \bigr)^{k_{i+1}}\prod_{i=2}^{I-1} 2^{S_{i-1}k_{i+1}}, $$
(15)

with \(S_{i} = \sum_{j=1}^{i}k_{i}\). The first term accounts for the number of ways of organising the labels, the second for the number of ways of connecting edges between adjacent partition elements and the last term for the number of ways of connecting edges between non-adjacent partition elements. Empty products are given the value of 1 so that a [n]=1 counts the empty DAGs with no edges.

6.1 Partitions and binary sequences

There are 2n−1 ordered partitions of the integer n which can be represented by the binary sequences of length (n−1). Each 0 corresponds to starting a new partition element whose size is one larger than the number of 1’s since the previous 0. For example the binary sequence (0,1,1,0,0,1) gives the partition [1,3,1,2] of 7 when treating the brackets as zeros. A Markov chain on the space of partitions can be built by flipping one of the elements of the binary sequence. Changing a 1 to a 0 splits a partition element into two, while changing a 0 to a 1 joins two adjacent partition elements. To introduce some probability of staying still, and ensure aperiodicity, the binary sequence can be extended with an artificial n-th element which does not affect the partition. At each step of the chain one of the n elements is uniformly drawn and flipped. Each partition then has (n−1) neighbours excluding itself and the chain takes (n−1) steps to become irreducible.

The Metropolis-Hastings acceptance probability to move from partition P to a proposed partition \(\hat{P}\) is

$$ \min \biggl(1,\frac{a_{\hat{P}}}{a_P} \biggr), $$
(16)

which involves the number of DAGs in each partition from (15). The resulting MCMC scheme converges to a stationary distribution where the probability of each partition is proportional to the number of DAGs belonging to it.

Only one digit of the binary sequence is flipped at each step. A single partition element is then split or two neighbouring ones are joined depending on the direction of the flip. The partition elements directly involved in the modifications and their neighbours are the only ones which can be affected by changes in the edge configuration. Hence the terms corresponding to the unaffected partition elements simplify in the ratio \(a_{\hat{P}}/a_{P}\). For example, given a partition

$$ P=[k_I,\ldots,k_{i+1},k_i,k_{i-1}, \ldots,k_{1}], $$
(17)

if the element k i is split into elements c and k i c to propose partition

$$ \hat{P}=[k_I,\ldots,k_{i+1},c,k_i-c,k_{i-1}, \ldots,k_{1}], $$
(18)

the ratio in the acceptance probability becomes

$$ \frac{a_{\hat{P}}}{a_P} = \binom{k_i}{c} \biggl( \frac {1-2^{-c}}{1-2^{-k_i}} \biggr)^{k_{i+1}} \biggl(\frac {1-2^{c-k_i}}{1-2^{-k_{i-1}} \mathrm{I}_{\{ i>1 \} } } \biggr)^{c} 2^{c(k_i-c)}, $$
(19)

depending just on k i−1,k i ,c and k i+1. For convenience k i is set to 0 for i outside 1,…,I and an indicator function is included to correct for the case when i=1.

6.2 Complexity of the MCMC steps

To evaluate the complexity of calculating the acceptance ratio and sampling accordingly, the relative weight of different partitions is first taken into account. In general, partitions with a larger number of smaller partition elements are more likely as they allow more DAGs. Therefore the chain will have a preference to move to and spend more time on them. In fact the most likely partition is the one with all k i =1 for which the calculation of the ratio in (19) and sampling can be performed in O(1). There is only a small chance of moving to partitions with large enough elements to make the ratio more complex, as noted in Sect. 5 for elements with k i >7. This should not suffice to increase the average complexity. However for completeness consider the most extreme case when the partition [n] is split into [n/2,n/2] assuming n is even. The ratio in the acceptance probability is then

$$ \frac{a_{[n/2,n/2]}}{a_{[n]}} = \binom{n}{\frac{n}{2}} 2^{\frac {n^2}{4}}, $$
(20)

involving large numbers but clearly larger than 1. The reverse move which is based on the inverse can however be performed by iterative sampling as follows. Repeatedly draw from a Bernoulli\((\frac{1}{2})\) and reject the move as soon as 0 is drawn. In the unlikely event that a 1 is drawn a total of n 2/4 times consecutively the procedure moves through the binomial terms. More Bernoulli random variables are then sampled with rate ranging from 1/n to n/(n+2), but stopping and effectively rejecting as soon as a zero is drawn. A move may in principle take up to n 2 steps, but it is on average rejected in just 2 steps. Even in this extreme case the move is still O(1) on average leading to the conclusion that each move of the chain can be performed with no complexity overhead.

6.3 Speed of convergence

The chain requires (n−1) steps to become irreducible. Hence the speed of convergence is related to the eigenvalues of the transition matrix U=T n−1 with the elements of T corresponding to a single move of the MCMC chain. The difference between the maximum and minimum element of each column of U k again decays exponentially with a rate α dependent on the largest eigenvalue of U below 1. Since the partition [n] only has one DAG, k of the order n 2/α is again needed for convergence on the relevant scale as in Sect. 3. To explore how α depends on n consider the explicit values for n=2,…,5, which are found to be α=1.386,0.962,0.872,0.811 respectively. These decay with increasing n, but αlog(n) increases suggesting that convergence is attained for k of the order of n 2log(n).

Since (n−1) steps are required to first obtain U and each step takes O(1) the whole algorithm is O(n 3log(n)). Each time a partition is sampled a corresponding DAG can also be drawn according to Sect. 4.2 without a complexity overhead. By combining DAGs into partitions irreducibility is reached by shorter chains. Since there is also no need to check for acyclicity, as the samples are DAGs by construction, the resulting algorithm is much faster than the Markov chain method of Melançon et al. (2001) discussed in Sect. 3. Recall that the standard Markov chain only becomes irreducible in the time it takes for the hybrid partition method to converge. The partition Markov chain can then be applied to sample reasonably large graphs while sharing some of the advantages of other Markov chain methods, such as the possibility of easily including restrictions.

Irreducibility could be achieved by an even shorter chain, for example by sampling from all partitions (or binary sequences) at each step. The disadvantage lies in the potentially increased complexity of the acceptance probability evaluation and the associated sampling operation. Such large moves would also reduce the probability of moving away from the most likely partition (with all k i =1) increasing the chance that the chain gets temporarily stuck and so hindering the convergence. Flipping a single element of the binary sequence allows more local and inexpensive moves with better acceptance probabilities. Such moves should explore the space more easily which is why we opted for them here.

7 Restrictions and variations

A promising application of the Markov chain method, as particularly pursued in Ide and Cozman (2002), Ide et al. (2004), is to impose restrictions on the DAGs to sample. For example the number of edges and the degrees of the nodes can be restricted by simply rejecting any steps producing graphs which do not meet the requirements. More complicated Markov chains are needed for sampling polytrees and DAGs with restricted density (Ide and Cozman 2002; Ide et al. 2004) but the speed of convergence to uniformity remains an issue. Although the space of restricted graphs is smaller, the probability of not moving for graphs on the border of the restricted set is increased due to the additional rejections. If no recursive enumeration is possible for the restricted graphs, however, a Markov chain approach remains the only one available. For restrictions which can be incorporated combinatorially, direct enumeration again provides a more efficient method and we discuss some examples below.

7.1 Connected DAGs

The main restriction imposed in Ide and Cozman (2002), Melançon and Philippe (2004) is that DAGs be weakly connected (admitting a path between every pair of nodes in the underlying undirected graph). They were also counted by Robinson (1973) and are recorded as sequence A082402 in Sloane (2013). The standard Markov chain method can be simply modified so to only delete an arc if the resulting graph is also connected (Ide and Cozman 2002) or, more efficiently, instead to reverse its direction otherwise (Melançon and Philippe 2004). Checking for connectedness is of the order of the number of arcs making the algorithm O(n 6). At each step either connectedness or acyclicity needs to be checked, the first when removing an edge and the second when adding one. The resulting algorithm may than be actually slower then simply rejecting unconnected DAGs at the end. The fraction of connected DAGs tends to 1 as n increases (Bender et al. 1986; Bender and Robinson 1988; Robinson 1973); for n=4 it is 82 % (for smaller n we can enumerate the possibilities by hand), while it is >99 % and growing for n>8. Therefore the above modifications of the Markov chain, though theoretically interesting, are of no practical use, as it is much more efficient to sample a DAG at random and reject the disconnected ones.

When removing the k outpoints of a connected DAG the rest breaks up into a set of smaller connected DAGs, hence a recursive combinatorial enumeration (involving a sum over the partitions of m) is possible. Again the increase in algorithmic complexity makes a simple rejection step more efficient, apart from possibly for small n.

7.2 Restricted new connections

Because new outpoints are added at each stage of the recursive enumeration of DAGs, restrictions on the number of incoming arcs the previous nodes can receive from them are the simplest to treat. Allowing up to a maximum K new connections (K≥1 avoids only generating DAGs with no arcs) the recursions change to

$$ \tilde{a}_{n,k}=\binom{n}{k}\sum _{s=1}^{m} \Biggl[\sum_{i=1}^{\min (k,K )} \binom{k}{i} \Biggr]^{s} \Biggl[\sum_{i=0}^{\min (k,K )} \binom{k}{i} \Biggr]^{m-s}\tilde{a}_{m,s}. $$
(21)

For K>k the above reduces to (6). A restriction on the minimum number of new connections can be incorporated analogously. Such restricted DAGs can easily be counted and then sampled as in Sect. 4. In reconstructing the DAGs new arcs can be uniformly sampled from the binomial distributions in (21). Retaining the limit on the arcs received by the previous outpoints, it is also straightforward to separately limit the connections each new outpoint can send to the previous non-outpoints to a maximum of K n. The corresponding DAGs can be sampled through

$$\begin{aligned} \tilde{a}_{n,k} =&\binom{n}{k}\sum_{s=1}^{m} \Biggl[\sum_{i=1}^{\min (k,K )}\binom{k}{i} \Biggr]^{s} \Biggl[\sum_{i=0}^{\min (m-s,K_{\mathrm {n}})} \binom{m-s}{i} \Biggr]^{k} \\ &{}\times\tilde{a}_{m,s}. \end{aligned}$$
(22)

7.3 Restricted number of children

Because they must be connected to, including previous outpoints when limiting the outgoing arcs from each new one is more complicated. Since arcs only originate from newly added outpoints, this means limiting the children to a certain K. Denote by C(k,m,s,K) the number of ways of adding k new outpoints to a DAG with m nodes and s outpoints while restricting the number of children to K or below. An expression for the ways C(k,m,s,K) of linking k new outpoints may be found by subtracting from the ways of connecting k nodes to m (with up to K arcs each) those which leave any of the s outpoints unconnected to

$$\begin{aligned} &{ \begin{aligned}[b] C(k,m,s,K)&= \Biggl[\sum_{i=0}^{\min (m,K )} \binom {m}{i} \Biggr]^{k} \\ &\quad {} - \sum_{i=1}^{s} \binom{s}{i}C(k,m-i,s-i,K), \end{aligned}} \end{aligned}$$
(23)
$$\begin{aligned} &{\tilde{a}_{n,k} =\binom{n}{k}\sum_{s=1}^{m}C(k,m,s,K) \tilde {a}_{m,s}.} \end{aligned}$$
(24)

The graph is reconstructed by drawing from the C(k,m,s,K) possibilities at each step, and formula (23) simply suggests rejecting the cases which leave some of the s outpoints with no incoming link. It is however more efficient to first derive the distribution of the number of arcs linking each new outpoint to the old ones when excluding the cases with a total <s. Samples from the resulting distribution where any old outpoints do not receive arcs are then rejected. Arcs to the remaining nodes can be drawn subsequently.

For C(k,m,s,K) much smaller than the first term on the right of (23), or when the average number of new arcs linking to the old outpoints falls, the acceptance ratio also drops. In this limit it may be convenient to first draw exactly one link to each of them and then sample the remaining arcs among all the nodes, including the s old outpoints. The configurations where each receives l i incoming arcs are overcounted by a factor \(F=\prod_{i=1}^{s}l_{i}\), so the sample should only be accepted with probability 1/F.

A balance of the two approaches may help minimise the rejection probability, but this example highlights that just knowing the number of possibilities is not sufficient for reconstruction, and a constructive method to generate them is preferable. A uniformly drawn DAG with a limit on the number of parents can be simply obtained by inverting all the arcs of a DAG with a limited number of children.

7.4 Hybrid MCMC method to restrict parents

As illustrated in the previous subsection, the combinatorial enumeration based on recursively removing outpoints reaches its limits when aiming to restrict the number of children or parents. The difficulties can be avoided by working partitionwise, rather than directly on individual DAGs. Given a partition [k I ,…k i+1,k i ,…,k 1] the number of ways, G i,K , in which each of the k i+1 nodes can receive up to K edges from the k i in the next partition element and the remaining S i−1 further up the DAG can be easily computed

$$ G_{i,K}=\sum_{l=1}^{\min(K,k_i)} \binom{k_i}{l} \Biggl[ \sum_{j=0}^{\min (S_{i-1}, K-l)} \binom{S_{i-1}}{j} \Biggr]. $$
(25)

For K=∞ this directly reduces to the number of edges in unrestricted DAGs: \(G_{i,\infty}=(2^{k_{i}}-1) 2^{S_{i-1}}\).

The corresponding number of restricted DAGs for each partition is then simply

$$ \tilde{a}_{[k_I,\ldots,k_1]} = \frac{n!}{k_I!\cdots k_1!} \prod_{i=1}^{I-1} (G_{i,K})^{k_{i+1}}. $$
(26)

This can replace expression (15) in the acceptance ratio (16) of the MCMC method in Sect. 6 providing a way to uniformly sample DAGs with a restricted number of parents.

8 Weighted sample

When generating DAGs by sampling triangular matrices their sparsity can be easily influenced by drawing the entries from a Bernoulli with p≠1/2. The associated weighting however acts on an underlying non-uniform space of DAGs. A weighted sample from an underlying uniform distribution may be preferable in practice when DAGs rather than triangular matrices are the object of study. The probability of a DAG with l arcs should basically be proportional to p l(1−p)Ll and analogous factors should be included in the recursive enumeration. Each of the ms non-outpoints may receive no links from the k new ones with weight (1−p)k, one link with weight kp(1−p)k−1 and so on to include all the terms in the binomial expansion of [(1−p)+p]k=1. The possibility that no arcs link to the s old outpoints is excluded. The k new outpoints on the other hand could be theoretically connected by links, but no arcs are allowed between them, leading to the recursion

$$ \hat{a}_{n,k}=\binom{n}{k}(1-p)^{\binom{k}{2}} \sum_{s=1}^{m} \bigl(1-(1-p)^k \bigr)^{s}\hat{a}_{m,s}, $$
(27)

where now \(\hat{a}_{n,n}=(1-p)^{L}\) correspond to the DAGs with no arcs out of L possibilities. It is chosen to set 00=1 so that \(\hat {a}_{1,1}=1\) even for p=1 where only one new outpoint is added at each step. Weighted DAGs can be generated as in Sect. 4 where links are now added at each step with probability p. The terms in the recursions are no longer integers in general (though for rational p the numerator and denominator could be stored as a pair of integers) and exact sampling can be obtained up to the precision limits of the computational implementation.

If many samples were required for a particular value of p the limiting distribution could be derived again and used for sampling large graphs, as discussed in Sect. 5 for the case p=1/2. However, as p is reduced more weight is shifted towards DAGs with a larger number of outpoints added at each step and more terms would need to be included. Often sparse DAGs are required in practice for modelling reasons where p might scale like 1/n to keep the average number of parents low. In the limit of small p it may be necessary to calculate all the numbers in (27) up to a given precision.

The hybrid MCMC method of Sect. 6 can be used instead to treat weighted DAGs. Analogously to (27), the weight of DAGs corresponding to a partition P=[k I ,…,k 1] can be expressed as

$$ \hat{a}_{[k_I,\cdots,k_1]}=\frac{n!}{k_I!\cdots k_1!} \prod_{i=1}^{I} (1-p)^{\binom{k_i}{2}} \prod_{i=1}^{I-1} \bigl(1-(1-p)^{k_{i}} \bigr)^{k_{i+1}}. $$
(28)

The acceptance ratio of a proposed partition \(\hat{P}\) where we split element k i from P into two elements of size c and (k i c) simplifies to

$$\begin{aligned} \frac{\hat{a}_{\hat{P}}}{\hat{a}_P} =& \binom{k_i}{c} \biggl( \frac {1-(1-p)^{c}}{1-(1-p)^{k_i}} \biggr)^{k_{i+1}} \\ &{} \times \biggl(\frac{1-(1-p)2^{k_i-c}}{1-(1-p)^{k_{i-1}} \mathrm{I}_{\{ i>1 \}} } \biggr)^{c} (1-p)^{-c(k_i-c)}. \end{aligned}$$
(29)

For p=1/2 this reduces to the expression in (19). The hybrid MCMC method then allows us to efficiently sample DAGs weighted by their number of arcs without needing to calculate and store all the numbers in (27). For perfect sampling with rational p the time and space complexity are at least O(n 5) and O(n 4) respectively as discussed in Sect. 4.4 for uniform sampling. The hybrid MCMC method therefore offers an alternative to sampling triangular matrices while removing the bias of operating on the wrong underlying space.

9 Conclusions

The recursive method for generating DAGs we analysed has the advantage over the highly adopted strategy of sampling triangular adjacency matrices of producing a uniformly distributed sample and over Markov chain based algorithms of avoiding convergence problems and being computationally more efficient. The recursive method provides a perfectly uniform DAG on n vertices with a setup time of O(n 5) and a sampling time of O(n 3) compared to the Markov chain method which provides an approximately uniform DAG in O(n 5log(n)). Comparing the actual computational time for n=100, the recursive method provides uniform DAGs thousands of times faster than the Markov chain. Moreover, an approximately uniform DAG can instead be obtained through the limiting behaviour of the recursive enumeration in just O(n 2). The accuracy can be chosen at will and this minimal level of complexity means that the very large graphs which may be needed for practical applications can be generated.

Along with using the limiting behaviour, we developed a novel hybrid MCMC setup based on the space of partitions of n. Each partition is simply scored according to the number of DAGs belonging to it. This offers the advantage compared to current alternatives of shortening the chain needed to reach irreducibility and avoiding the costs of checking for acyclicity. Approximately uniform DAGs can be sampled in O(n 3log(n)) which is a factor of n 2 smaller than current Markov chain methods. Such an MCMC on the space of partitions can also be easily adapted to uniformly sample restricted DAGs, for example by including a restriction on the number of parents for each node or by weighting the edges.

The ideas discussed here could be integrated within MCMC schemes in the inference of Bayesian graphical models. A procedure can be built following the combinatorial construction where moves are suggested by going back a certain number of steps in the sequence of adding outpoints and resampling just the new connections at that step. As they are sampled uniformly, such moves are reversible. To ensure full space reachability, one could introduce some probability of proposing a move uniformly amongst all DAGs or according to the standard MCMC on network structures (Madigan and York 1995). Alternatively, a number of moves through partitions can be defined based on the underlying combinatorial structure.

The hybrid MCMC method based on partitions is similar in spirit to the sampling in the space of orders of Friedman and Koller (2003) which actually acts on the space of triangular matrices. Scoring all the DAGs in a given partition therefore possibly has similar potential to greatly improve the mixing properties of the chain produced, with the advantage of avoiding the bias inherent in the order version of acting on a different space. The local changes in partition elements could also possibly propose structures which are in some sense close in the space of graphs while other moves based on the combinatorial structure might be tuned to the local scores. Since neither the original MCMC on network structures nor the order version appear to be completely satisfactory in practice (Grzegorczyk and Husmeier 2008) the possibilities outlined above seem promising.

Interpreted as Bayesian networks with random variables placed on each node, DAGs encode conditional independencies of multivariate distributions. However, several DAGs might encode he same set of conditional independencies, meaning that they typically cannot be uniquely identified from data. The set of DAGs with the same probability distribution, also known as the Markov equivalence class, can be represented by an essential graph (Andersson et al. 1997) or a completed partially DAG. In practice, it might be desirable to sample equivalence classes and work on the space of essential graphs as opposed to DAGs. However, although essential graphs can now be counted, the current formula takes exponential time and their number is therefore only known up to n=13 (Steinsky 2013). Previously, essential graphs had only been counted by running through all possible DAGs for n≤10 (Gillispie and Perlman 2002) with the n=10 term corrected in sequence A007984 of Sloane (2013).

MCMC methods can be restricted to the space of essential graphs, for example an implementation of the general framework of Madigan and York (1995) was given in Madigan et al. (1996) while a Markov chain algorithm specifically targeted at uniform sampling was developed by Peña (2007). Comparing to the Markov chain for sampling DAGs uniformly (Melançon et al. 2001; Scutari 2010) the algorithm for essential graphs is notably slower, taking a similar time to make each step for an essential graph with about 20 or 30 nodes as for a DAG with 100. Although no convergence details are given, only a very small fraction of the proposed moves result in an essential graph suggesting that the convergence would also be much slower than for DAGs.

Although little is known about the exact number of essential graphs beyond the results in Steinsky (2013), the number of essential graphs which are also DAGs is known (Steinsky 2003). More importantly, the asymptotic ratio of the number of DAGs to those which are essential graphs is under 13.7 (Steinsky 2004). Since there are more essential graphs than just those which are also DAGs, 13.7 is a bound on the average number of DAGs corresponding to each essential graph, while the actual ratio seems to be less than 4. With such a low bound, it might be worth thinking of rejection or importance sampling strategies, starting from the uniform DAG generation discussed here, possibly leading to a simple and more efficient alternative for sampling essential graphs.