Introduction

The three-dimensional organization of chromatin is thought to play an important role in the transcriptional regulation of gene expression (Hübner et al. 2013). In recent years, experiments based on chromosome conformation capture (3C, and the derived 4C, 5C and Hi-C, reviewed in Denker and de Laat (2016)) shed light into the structure of chromatin in the kilobase to megabase scale, where regulatory enhancer-promoter interactions occur, and revealed the existence of a new level of chromatin organization called topologically associating domains (TADs) (Dixon et al. 2012; Nora et al. 2012). The biological importance of TADs, and their internal sub-structures (Rao et al. 2014), stems from the fact that they seem to act as regulatory microenvironments by controlling the interaction frequencies of enhancers and promoters and consequently contributing to the establishment of correct transcriptional levels and patterns of gene expression.

3C-based techniques provide maps of the contact probabilities between pairs of genomic loci, averaged over millions of cells. However, this averaged infor-mation gives only a partial view of the complicated behaviour of the chromatin fibre and misses several properties that characterize chromatin folding within TADs, which might be important for transcriptional control. An example of those properties is the degree at which different pairs of loci co-localize. If Hi-C maps provide information on how often a locus A is in contact with locus B and how often locus B is in contact with locus C, it does not quantify explicitly how often these two contacts occur simultaneously unless more sophisticated approaches are used (Olivares-Chauvet et al. 2016). Moreover, it is not possible to extract any information about the shape and the three-dimensional size of a TAD directly from Hi-C maps. Finally, fluctuations over time of any conformational properties of the fibre cannot come directly from the experimental data.

Polymer models (Baù et al. 2011; Mirny 2011; Barbieri et al. 2012; Brackley et al. 2013; Benedetti et al. 2013; Rosa and Zimmer 2014; Giorgetti et al. 2014; Johnson et al. 2015; Goloborodko et al. 2016) are an useful tool to increase the amount of information that can be derived from 3C-based experiments, and 5C and Hi-C interaction maps in particular. We previously showed that a coarse-grained model of chromatin, in which a chain of beads interact through a contact potential tuned to reproduce the experimental contact properties of a given TAD, is able to extract realistic three-dimensional conformations of the chromatin fibre in single cells (Giorgetti et al. 2014). This model is based on the minimal assumption that chromosomal loci interact through a two-body, short-range interaction and remain agnostic on the actual molecular mechanisms underlying it (e.g. direct looping vs. loop extrusion between sites bound by CTCF (Merkenschlager and Nora 2016), provided that the resulting interaction can be approximated by this simple model). Nevertheless, the conformations derived were shown to correctly predict independent observables, such as the distribution of distances between pairs of loci measured by DNA fluorescence in situ hybridization (FISH) and how it is affected by mutations (Giorgetti et al. 2014; Tiana et al. 2016). Moreover, this model is thermodynamically correct, in the sense that the polymer obeys the laws of motion imposed by physics.

Co-localization of multiple loci was clearly observed in the X inactivation centre of mouse embryonic stem cells (ESC) by DNA-FISH (Nora et al. 2012). The present model proved efficient in identifying multibody interactions in the same system, specifically those involving the Linx, Tsix and Chic1 loci, whose co-localization was then verified by DNA-FISH (Giorgetti et al. 2014). Interestingly, also a different model, based on the correct thermodynamics of proteins as well, was able to highlight many body contacts in the Sox9 locus (Chiariello et al. 2016), thus testifying to the robustness of this phenomenon. In a recent work (Olivares-Chauvet et al. 2016), 3C-based techniques were extended to identify experimentally multiple interactions.

In the present work, we have extended our previous approach to study the conformational properties of TADs and their internal sub-structures at the genome-wide level in mouse embryonic stem cells, using Hi-C data at 20 kb resolution (Giorgetti et al. 2016) as an input for the model. TADs were identified using a custom algorithm (Zhan et al. 2017). For each domain that is bigger than 240 kb (a total of 2473 TADs), a polymer model was derived and its equilibrium properties were investigated and correlated with the activity of the genes contained within them.

The model

The starting point of the model is a mouse ESC Hi-C map M ij at a resolution of 20 kb, from ref. (Giorgetti et al. 2016), which was iteratively corrected using the ICE method (Imakaev et al. 2012). All samples were mapped to mouse mm9 using QuasR (Gaidatzis et al. 2015). Hi-C read counts of M ij were converted into contact probability p ij that is taken as input by the model following a procedure similar to Giorgetti et al. (2014) (see Supplementary Materials). TADs were identified using the CaTCH algorithm “(Zhan et al. 2017), which we developed to identify and compare hierarchies of nested interaction domains in Hi-C maps. We identified TADs as the set of domains that is maximally similar to the domains called on the same dataset using the directionality index method (Dixon et al. 2012). TADs simulated in this work have a slightly smaller median size than directionality index TADs (820 vs. 880 kb), due to the fact that the CaTCH algorithm often splits large, inactive directionality index TADs in two sub-domains (Zhan et al. 2017). The complete list of simulated domain coordinates is provided in Supplementary Table S1.

In this work, each TAD is studied independently of the others; see the ‘Size and shape of TADs’ section below for a discussion on the validity of this assumption. Each region is modelled as an inextensible chain of beads. A bead represents effectively a genomic segment of 20 kb (which would correspond to ≈134 nm in euchromatic regions, based on the estimations discussed in Giorgetti et al. (2014), corresponding to the resolution of the Hi-C experiments (see Fig. 1a)). We termed a the distance between consecutive beads, which sets the elementary length unit of the model; all lengths in the study will be thus given in units of a. An intrinsic assumption of the model is that such a distance is constant along the genome and thus that the internal structure of a 20 kb element does not change dramatically from one locus to another. However, importantly, it should be noted that the actual spatial extension of each bead (i.e. the number of nanometre that it occupies in space) could depend on the particular genomic location and notably on the local chromatin composition (notably euchromatin vs. heterochromatin).

Fig. 1
figure 1

a Chromosomes are represented as chains of beads, each representing a unit of 20 kb. Pairs of beads interact with a potential u ij (b) shaped as a spherical well. c The number of TADs identified in each chromosome. d A zoom into the Hi-C map showing five TADs

Pairs of beads i and j interact with a spherical well potential

$$ {u}_{i j}=\left(\begin{array}{cc}\hfill +\infty \hfill & \hfill if\ \left|{r}_i-{r}_j\right|\le {R}_{HC}\hfill \\ {}\hfill {B}_{i j}\hfill & \hfill if\ {R}_{HC}<\left|{r}_i-{r}_j\right|\le R\hfill \\ {}\hfill 0\hfill & \hfill \left|{r}_i-{r}_j\right|> R\hfill \end{array}\right. $$

where |r i  − r j | is the Euclidean distance between the positions of beads i and j, R HC  = 0.24a is a hard core radius which sets the diameter of the fibre and R = 0.6a is the interaction range (see Fig. 1b).

The matrix B ij contains the interaction energies between each pair of beads, which summarize the effect of all biomolecules (transcription factors, architectural proteins, histone modifications, etc.) associated with the corresponding 20 kb segments, which might mediate such interaction energies. The interaction energy matrix is optimized according to the algorithm described in Giorgetti et al. (2014) to minimize the χ 2 between the experimental Hi-C map and the one that is back-calculated from the model. The variance between two experimental replicates (Giorgetti et al. 2016), binned at 40 kb resolution, was used to weigh the χ2 measure.

The model was applied to each TAD of the 20 mouse chromosomes larger than 12 model beads (2473 TADs) (see Fig. 1c, d) and iterated until the χ 2 between the Hi-C map and the model contact map converges to 1. The final set of conformations associated with χ 2 ≈ 1 represents the most likely equilibrium ensemble of the TAD compatible with the experimental data (Tiana et al. 2016). These ensembles of conformations are then analysed.

Size and shape of TADs

We started by analysing the volume of space occupied by polymer models of the various TADs. Specifically, we quantified the compactness of the polymer (in units of the bead size a) using the root mean square radius of gyration s (see Eq. S4 in the Supplementary Materials), which represents the radius of a hypothetical sphere that contains the typical conformation of the TAD. The value of s as a function of the linear size N of each simulated TAD is displayed in Fig. 2a. The data appear to follow the behaviour of an ideal noninteracting chain, in which s scales as s ∼ N 1/2 (Grosberg and Khokhlov 1994), but is also compatible with the scaling of a compact globule (s ∼ N 1/3) (solid and dashed line, respectively, in Fig. 2a). However, fluctuations follow a linear dependence on s (Fig. 2b), which is typical of an ideal chain (see Eq. S6 in the Supplementary Materials) and not of a globule, in which σ s decreases with s.

Fig. 2
figure 2

a The root square mean gyration radius s of each TAD as a function of its size; the error bars report the associated fluctuations, that is the standard deviation σ Rg of the gyration radius. Lengths are in units of the inter-bead distance a. The red curves indicate the fits with the curves expected for an ideal chain behaviour (continuous curve) and for a random globule (dashed curve). b The standard deviation plotted as a function of the mean gyration radius; the red line displays the linear behaviour expected for an ideal chain. c The average number of contacts associated with each TAD as a function of the linear size of the TAD; in red, the fits (cf. a). d The average linearity parameter λ of each TAD as a function of the normalized mean gyration radius; the continuous and the dashed line indicate the average value calculated, respectively, for an ideal chain and a random coil; the dotted lines indicate their standard deviations

The polymer behaviour of chromatin within single TADs, thus, seems to be largely similar to that of an ideal chain. This in fact is not unexpected, because it refers to small chromosomal segments belonging to a much longer chromosome polymer, confined by the nuclear envelope and by its interactions with other chromosomes into a chromosomal territory of finite volume. The Flory theorem (Doi and Edwards 1986) states that in such a situation, the forces acting on the small segment cancel out and the segment behaves as an ideal chain.

The ideal chain behaviour is also compatible with our previous observation, based on 5C-based reconstruction of the Tsix and Xist TADs in ESCs, that chromatin behaves as an ideal chain in those two TADs (Tiana et al. 2016). Our previous calculations showed that the heterogeneous interactions optimized to fit the 5C data set the polymer model of the chromatin fibre close to the theta point of the system, i.e. the temperature at which all polymers behave ideally. We now provide evidence that this seems to be the case at the genome-wide level.

Other properties that can be extracted from TAD structural models are however far from those of an ideal chain. Notably, the average number of contacts between different monomers in a single TAD is displayed in Fig. 2c. In an ideal chain, the number of contacts would be expected to scale as N 1/2 (see Eq. S8 in the Supplementary Materials); however, in spite of the large variability, this does not seem compatible with the calculated data (solid curve in Fig. 2c) where the number of contacts seems to grow linearly with the physical size of the TAD (dashed curve in Fig. 2c). This is similar to the behaviour of a homopolymer that is collapsed into a globule (Lifshitz et al. 1978).

Another interesting property of polymers is their degree of isotropy, which measures to which extent the conformations of the polymer are spherically symmetric. The isotropy of a conformation can be quantified by a parameter λ that assumes values close to 0 in case of a perfectly spherical symmetry and 1 for a linear conformation of the chain (see Supplementary Materials). The average values of λ for the different TADs are displayed in Fig. 2d as a function of their gyration radius, normalized by N 1/2, in order to evaluate their compactness independently on the genomic length of the TAD. As a reference, an ideal chain is typically rather anisotropic, displaying λ = 0.85 ± 0.13 (cf. solid line in Fig. 2d), while a globular homopolymer displays λ = 0.25 ± 0.10. The values of λ for simulated TADs are in between these two reference cases but overall show increased similarity to an ideal chain rather than a globule. The spatial isotropy shows only a weak dependence on the compactness of the chain, increasing slightly in less compact chains.

The isotropy of TADs was observed experimentally by super-resolution imaging (Boettiger et al. 2016) and as result of simulations of a simple polymer model, different from ours, of the Hox-B locus (Annunziatella et al. 2016).

In summary, although single TADs display properties (such as the number of contacts) that are more similar to those of homopolymer globules, overall their compactness, fluctuations and (roughly) their anisotropy are more similar to those of an ideal chain. This is likely to reflect the fact that unlike homopolymers, a chromosomal region of the size of a TAD displays complex and heterogeneous interactions (in part mediated by proteins such as CTCF that promote chromosomal interactions), which are distributed unevenly along the chain and can be responsible for properties which depart considerably from those of homopolymers (Zhan et al. 2016).

Relation between structural properties of TADs and gene activity

We then set out to address the question whether any of the polymer properties defined in the previous section is related to level of transcriptional activity within a TAD. We quantified the level of expression of genes in ESC samples matched with the Hi-C data using published RNA-seq datasets (Gendrel et al. 2014). Expression levels were estimated by quantifying the reads per kilobase per million mapped reads (RPKM) of each gene

$$ RPKM=\frac{M}{N* L}\cdot {10}^3\cdot {10}^6, $$

where M is number of the mapped reads to the gene (exons), L is the length of the region (sum of all exons) and N is the total number of mapped reads. We then defined the overall transcriptional activity of a TAD as the sum of the RPKM of each gene belonging to that TAD. The box plot of Fig. 3a displays TAD activity as a function of the mean gyration radius s and shows that transcriptional activity appears to decrease with increasing gyration radius (in monomer unit a). This effect can also be observed when plotting the gyration radius of the active and of the inactive TADs, respectively, as a function of the genomic size (see Fig. 3b; the curve, that describes the behaviour of active TADs, displays a scaling exponent as a function of genomic size of 0.31, which is slightly smaller the value 0.35 of inactive TADs. The transcriptional activity within TADs also decreases with TAD spatial size, measured in units of a (see Fig. S1 in the Supplementary Materials). This behaviour seems to contradict previous results (le Dily et al. 2014; Fabre et al. 2015; Boettiger et al. 2016) but is difficult to interpret for two reasons: On one hand, TADs with different genomic size N could have similar gyration radii, depending on their internal interactions and consequent polymeric phase; on the other hand, in this analysis, we implicitly assumed that the density of bases (i.e. the spatial size of a bead) is constant across the genome, whereas (as discussed above) this could not be the case and might in fact critically change in euchromatic vs. heterochromatic regions.

Fig. 3
figure 3

a Box plot of the sum of the activity of the genes in each TAD (RPKM) as a function of the mean gyration radius s. b The mead gyration radius s of a subset of active TADs, with RPKM larger than 500 (in blue), and of inactive TADs, with RPKM lower than 50 (in red); the lines indicate the power law fit, which gives exponents 0.31 and 0.35, respectively. c The activity as a function of the relative fluctuations of the gyration radius. d The activity of genes as a function of the linearity coefficient λ

To better understand structural correlates of transcriptional activity, we plotted in Fig. 3c the RPKM as a function of the relative fluctuations of the gyration radius quantified by its normalized standard deviation σ s /s, which is an adimensional quantity that does not depend on the actual physical size of a monomer. This quantity provides, independently on the genomic size of a TAD, information about the phase of the polymer. If the relative fluctuations are small, the polymer is globular; fluctuations increase as the polymer swells into a coil (Lifshitz et al. 1978). As shown in Fig. 3c, the transcriptional activity appears to decrease as the TAD conformation swells.

Finally, Fig. 3d shows that the genetic activity within TADs is positively correlated with the mean linearity λ of polymer conformations.

It should be noted that if the chain were described by a simple homopolymeric theory, the correlation of transcriptional activity of TADs with their anisotropy and the anticorrelation with their relative fluctuations would be contradictory, because globules are more spherical than coils (cf. red lines in Fig. 2d). However, the heterogeneity of the interactions allows TADs to display both these features.

The contact probability between any pairs of loci (cf. Fig. 4a) can be estimated, according to its thermodynamic interpretation (Fudenberg and Mirny 2012) from the normalized Hi-C matrix M ij without the need of any polymer model. The average \( \overline{M} \) of all pairs of each TAD correlates with the activity of the TAD (see Fig. 4b; r = 0.28; p value = 10−16). Notably, transcriptional activity increases with \( \overline{M} \) for \( \overline{M}\le 70 \) and then reaches a plateau. This effect is mainly due to pairs of loci that are at relatively small genomic distances, since transcriptional activity correlates well with the mean of the Hi-C counts restricted to pairs that are separated by 100 kb (cf. Fig. 4c) but much less with the mean over pairs separated by 600 kb (cf. Fig. 4d).

Fig. 4
figure 4

For each pair of loci i and j, the Hi-C map gives the probability M ij that the two are within their interaction distance (orange circles in a), and the model permits the calculation of their mutual mean distance \( \overline{d_{ij}} \). The activity RPKM of each TAD is plotted versus the mean counts (b), the mean counts restricted to pairs of the TAD at small genomic distances (100 kb) (c), that between pairs at large genomic distances (600 kb) (d), the average distances between pairs of the TAD at small genomic distances (e) and that between pairs at large genomic distances (f)

Using the TAD polymer model, we also calculated the Euclidean three-dimensional distance between pairs of genomic loci. Consistent with what observed at the level of average contacts, TAD activity positively correlates with the average distance between genomic loci in the small-distance range (100 kb, cf. Fig. 4e). This reflects the fact that at small genomic distances, contact probabilities and mean distances are well correlated (see upper panel in Fig. S2 in the Supplementary Materials). However, this is not the case for pairs of loci located at large genomic distance (e.g. 600 kb), whose spatial distances are much less correlated with contact probabilities (see lower panel in Fig. S4). This discordance hints at the fact that comparing Hi-C results and distance measurements (for example by DNA-FISH) is not a trivial task (Giorgetti and Heard 2016; Dekker 2016) and suggests that this might be true especially at relatively large, TAD-scale genomic distances (several hundred kilobases) (Fudenberg and Imakaev 2016).

In summary, highly active TADs appear to be more polarized along a specific axis (‘elongated’) than inactive ones, to display a smaller amount of structural fluctuations relative to their overall size and to display a higher amount of short-range contacts (in the 100 kb range). This reflects the higher compartmentalization of active TADs in highly interacting, hierarchical sub-domains of preferential interactions reflecting CTCF-mediated sub-TAD architectures and interactions between active regulatory elements (Rao et al. 2014; Zhan et al. 2017).

Co-localized loci are enriched in active promoters and enhancers

While Hi-C data inform only on the probability that two loci are in contact over the cell population, polymer models can also predict the probability that multiplets of loci are found in contact with each other simultaneously. For each TAD, we calculated the probability p ijk that the triplet of loci i, j and k is in contact (cf. Fig. 4a) and further computed the quantity \( {p}_{ij k}^{*}={p}_{ij k}-\left({p}_{ij}{p}_{jk}+{p}_{ik}{p}_{kj}+{p}_{ki}{p}_{ij}\right)/3 \)expressing the excess probability that the three loci co-localize with respect to the probability that the three of them come in contact independently on each other. Thus, \( {p}_{ijk}^{*} \) is 0 if the three loci co-localize merely because of nonzero pairwise contact probabilities of each pair in the triplet, whereas it is larger than 0 if there is a specific cooperative mechanism which involves the three loci. We required that i, j and k have a mutual genomic distance of at least 60 kb (i.e. three beads in the model). As a negative control, we also calculated the same quantity for each TAD in 40 simulations derived by a randomly reshuffled interaction matrix B ij . For each TAD, we then recorded the triplets of loci whose excess probability \( {p}_{ijk}^{*} \) was found to be larger than in 95% of the randomized controls and filtered the data by requiring \( {p}_{ijk}^{*}>0.02 \).

We observed that TADs display a variable number of co-localized triplets, from 0 to ∼4000 per TAD, to be compared with the number of possible triplets ∼N 3/3! ∼ 105. Interestingly, the number of co-localized multiplets does not correlate with the activity within single TADs (cf. Fig. S4 in the Supplementary Materials). We then investigated whether co-localized sites contain preferentially active promoters, enhancers or CTCF-bound sites. CTCF-bound sites were detected using ChIP-seq from ENCODE project (Cheng et al. 2014). In particular, macs2 (Zhang et al. 2008) was applied to ChIP-seq datasets to identify significant CTCF-bound sites (peaks). Active promoters were defined as those genes where log2 RPKM > 1. To define a region to be enhancers, we used as landmark region enriched in H3K27ac (peaks from macs). Then, we analysed the ratio between H3K4me1/H3K4me3 (ChIP-seq normalized signal) and labelled a region as an enhancer if H3K4me1/H3K4me3 is enriched. In Fig. 4b, we display the probability P defined as the fraction of co-localized triplets of loci containing a promoter each (marked as ‘ppp’), a promoter in two of them and an enhancer in the other (‘ppe’), a promoter, an enhancer and a CTCF site (‘pec’), etc. These probabilities are compared with a negative control obtained reshuffling the co-localized triplets in each TAD and with the theoretical probability obtained from the densities of promoters, enhancers and CTCF in the genome (see Supplementary Materials).

All the enrichments displayed in the figure are statistically significant, displaying a p value <0.05 with respect to the bootstrap. Thus, a comparison of the significance of the different combinations was carried out on the basis of their Z score (Fig. 5).

Fig. 5
figure 5

a Co-localization occurs in TADs when three 20 kb loci interact with each other with a probability \( {p}_{ijk}^{*} \) which is statistically significant (p < 0.05) with respect to bootstrapped, random-interacting polymers. b The probability P that triplets of co-localized sites contain promoters (‘p’), enhancers (‘e’) and CTCF (‘c’). For example, ‘ppe’ means that two of them contain a promoter and the other an enhancer, and ‘pe*’ means that a site contains a promoter, another an enhancer and the third can contain anything

Triplets of CTCF are remarkably significant (Z score = 493), suggesting that the molecular mechanism employed by this protein is not just a dimerization (Pant et al. 2004) but can involve in some way multiple copies of CTCF. A similar significance is displayed by triplets made of an enhancer, a promoter and CTCF (Z score = 491), suggesting that CTCF could play a direct role in the approaching of enhancers to promoters. The significance of triplets of promoters (Z score = 451) and of two promoters and one enhancer (Z score = 410) could emphasize the fact that multiple promoters are controlled by few factors. Triplets involving more than one enhancer are less enriched (eee has Z score = 262 and pee has Z score = 294).

The list of triplets containing promoters and enhancers is reported in the Supplemental Table S2. Among them are, as an example, the triplet Tsix/Chic1/Linx, whose co-localization was already predicted and verified in FISH experiments (Giorgetti et al. 2014).

Conclusions

Hi-C datasets give information about interaction probabilities of pairs of chromosomal loci. Using a polymeric model, these data can be de-convolved into equilibrium ensembles of conformations that allow one to predict the shape and size of single TADs, their structural fluctuations and the correlations in the formation of sub-TAD contact loci.

Applying this strategy to all genome-wide TADs in mouse embryonic stem cells allowed us to extract structural properties and to correlate them with genomic datasets. Model-driven structural reconstruction predicts that the overall structure of TADs displays some features that are distinctive of ideal polymeric chains, as expected by general physical arguments applying to a small polymer chain surrounded by other polymers. However, TAD structural properties are strongly affected by the heterogeneity of interactions between loci within the TAD, irrespective of the molecular mechanisms that generate them. The global level of gene activity within single TADs correlates positively with their degree of structural anisotropy; this suggests that high local gene activity is accompanied by a ‘linearization’ of the three-dimensional conformations of the chromatin fibre and a suppression of structural fluctuations, as a consequence of looping interactions between CTCF sites and active regulatory regions and compartmentalization in sub-TAD domains. These two effects require a complex set of interaction to stabilize the conformation of the chromatin fibre within TAD and could not be possible, for instance, in a simple homopolymeric model (i.e. a genome without locus-specific interactions). Indeed, the number of contacts between pairs of loci at small genomic distance is a good proxy for global transcriptional activity, whereas the number of contacts between genes at large genomic distance (although within the same TAD) does not seem to correlate with activity.

Importantly, our simulations also provide information on the tendency to form simultaneous contacts within single TADs. In agreement with the experimental findings of Olivares-Chauvet et al. (2016), the amount of co-localizing triplets amounts to a small fraction of all possible ones. However, they are enriched in promoters and enhancers, which suggests a potential role of their co-localization in single TADs in setting up their functional interactions.