Keywords

1 Chromatin Structure, Combinatorial Complexity of Histone Modifications, and Mechanisms of Epigenetic Regulation

Epigenetic phenomena constitute a very important regulatory checkpoint in many key cellular processes such as DNA maintenance and repair [1, 2], epigenetic inheritance [3, 4], and gene expression [5, 6]. While the genome underlying structure – i.e., DNA sequence – is highly stable, epigenetic signatures are dynamic [7,8,9], with different epigenetic phenomena having different degrees of stability and variability, causing most of the phenotypic differences across cells in multicellular organisms. Fluctuations in DNA condensation, and the establishment of heterochromatic or euchromatic regions, are determined by covalent modifications of chromatin, including DNA methylation of CpG islands [10,11,12], and a wide range of histone modifications [9, 13, 14], which form complex combinatorial networks of histone marks, that constitute the “histone code” [15]. Moreover, DNA methylation and histone modification pathways are significantly interconnected [16,17,18], and the cross talk between DNA and histone epigenetic modifications significantly increases the combinatorial complexity of the mechanisms of epigenetic regulation. Although not yet fully understood, there are two characterized mechanisms by which epigenetic modifications exert their function [9]: the first is the disruption of contacts between nucleosomes in order to “unravel” chromatin, and the second is the recruitment of nonhistone proteins [9]. A wide family of epigenetic signaling proteins – i.e., readers, writers, and erasers – [19,20,21,22] recognize the complex code of epigenetic modifications, controlling the condensation levels of genomic regions, and the susceptibility of these regions to be transcribed [5, 6], to be subject of DNA repair [1, 2] or be involved in other cellular processes. The central role of epigenetics in the regulation of a broad range of key cellular processes explains their implication in multiple common and serious human pathologies [23,24,25], such as developmental diseases [26,27,28], cancer [29,30,31,32], and neurological disorders [33,34,35,36,37]. Despite technological advances for the study of mechanisms of epigenetic regulation, we still lack a systematic understanding of how the epigenomic landscape contributes to cellular circuitry, lineage specification, and the onset and progression of human disease [38]. Due to the significant complexity of the mechanisms of epigenetic regulation, computational and bioinformatics approaches have been essential for disentangling these mechanisms at the genome-wide level and for answering important questions such as how the epigenetic level senses environmental cues during lineage specification and development and which are the interactions among different chromatin modifications to control transcription.

In this chapter, we review the state of the art of computational approaches and bioinformatics tools for genome-wide epigenetic research. We cover the field of “computational epigenetics” and discuss recent advances in computational methods for processing and quality control of different types of epigenetic data, the prediction of chromatin states and the study of the dynamics of chromatin, and the analysis of 3D structure of chromatin. We also address the status of different collaborative projects and databases comprising a wealth of genome-wide epigenetic data. We discuss how the fast growth in the generation of epigenetic data, boosted by the development of high-throughput sequencing (HTS) experimental technologies and inter-institutional public/private collaborative projects, has been complemented and prompted by the development of computational methods for analyzing and rationalizing huge quantities of epigenetic data. The steady decrease in the cost of technologies for generating epigenetic data has also opened the possibility of performing epigenetic surveys in human populations. In this regard, we also examine the recent development in the computational approaches used to perform these studies for uncovering the main differences and similarities at the epigenetic level between individuals and their implication in cellular differentiation, gene regulation, and disease.

2 Whole Genome Annotation of Histone Modifications: Computational Tools for Data Quality Control and Mapping of Epigenetic Data

The characteristics and specificities of the wide range of computational methods commonly used for the analysis of epigenetic data depend significantly on the particularities of the experimental techniques used to perform epigenomic profiling. The techniques available for profiling histone modifications (and the other epigenetic phenomena described in the next sections of this chapter) are described in detail in a previous chapter of this book, but it would be important to summarize their commonalities and differences to discuss the different computational approaches used to analyze the epigenetic data generated in each case. The most commonly used experimental approaches to profile histone posttranslational modifications are ChIP-on-chip [39,40,41], ChIP-seq [42,43,44], and mass spectrometry [45,46,47,48]. In ChIP-on-chip, histone modification-specific antibodies, bound to chromatin regions bearing the corresponding modification, are cross-linked to DNA by treatment with formaldehyde. Next, chromatin is collected and fragmented by sonication or using nucleases, and the fragments bearing the histone modification are enriched by using an antibody matrix specific to the histone modification-specific antibody – i.e., immunoprecipitation. The DNA in the enriched fragments is released reverting cross-linking by increasing temperature, and purified DNA fragments are amplified and labeled with fluorescent dyes for further quantitation. Finally, purified DNA is hybridized to a tilling microarray, which allows the identification of regions overrepresented in the immunoprecipitated DNA relative to control DNA – i.e., regarded as epigenetically modified. ChIP-seq shares the initial steps of the ChIP-on-chip technique, but unlike the former, it relies on HTS DNA sequencing rather than on microarrays for identifying the sequences enriched in histone marks. Unlike immunoprecipitation techniques, proteomic profiling using mass spectrometry (MS) allows the detailed characterization of histone tail posttranslational modifications. This technique relies on the chromatographic separation of histones from cell lysates, followed by enzymatic digestion of individual histones for the accurate assignment and quantification of the amino acids bearing different kinds of posttranslational modifications [9, 13, 14], following top-down, bottom-up, or middle-down approaches [47, 49].

Immunoprecipitation techniques are by far the most commonly used, thanks to their high-throughput capabilities and the developments in the production of highly specific histone modification-specific antibodies. The main bioinformatics problem for the analysis of ChIP-on-chip data is establishing a ranking of genomic regions overrepresented on the arrays from raw probe intensities. In this regard, many different approaches have been specifically developed for performing peak calling from ChIP-on-chip experiments. In general, these methods have a set of common steps, encompassing the normalization of the intensities of hybridized fragments, assessment of the statistical significance of the intensities of each peak with respect to the whole array, and finally merging overlapping overrepresented regions [39,40,41, 50]. The list of peak-calling packages for processing ChIP-on-chip data is fairly ample and diverse, including Tilescope [51], an automated data processing toolkit for analyzing high-density tiling microarray data that integrates data normalization, combination of replicate experiments, tile scoring, and feature identification in an easy-to-use online suite. Tilemap [52] is a stand-alone package that provides a flexible way to study tiling array hybridizations under multiple experimental conditions in Affymetrix ChIP-on-chips. Ringo [53] is an R package devised for NimbleGen microarrays, which facilitates the construction of automated programmed workflows and enables the scalability and reproducibility of the analyses in comparison to other ChIP-on-chip peak callers. The abovementioned list of bioinformatics tools for processing ChIP-on-chip microarray data is by no means exhaustive, and there is a wide spectrum of other approaches, including ACME [54], HGMM [55], ChIPOTle [56], HMMTiling [57], and MAT [58], among others. Notwithstanding the diversity of tools for processing ChIP-on-chip data, the bioinformatics analysis of tiling microarrays shares the same drawbacks of the algorithms for analyzing DNA arrays, as they fail to accurately estimate histone modifications spanning extended genomic regions and underestimate weak binding events [50].

The key bioinformatics challenge in the analysis of ChIP-seq data is the fast and accurate mapping of thousands to millions of short reads, corresponding to the regions bearing a specific histone modification, to the reference genome. Many sequence aligners for solving the problems of mapping short sequence reads have been developed, such as Bowtie [59], BWA [60], SOAP [61], and BLAT [62], among a wide list of others (for a detailed review on short-read alignment methods see [63]). Other methods with alignment strategies optimized for reads obtained with specific sequencing platforms have been developed, including commercial suites such as ELAND that form part of the SOLEXA pipeline (http://www.solexa.com/), and the Broad Institute sequencing platform [64] (http://genomics.broadinstitute.org/). While mapping short reads to a reference genome, special care should be taken to the quality control of sequencing data. For instance, random fragmentation of ChIP-seq samples treated with sonication renders an array of overlapping reads corresponding to the same genomic regions, and these duplicated reads should be removed, using for example SAMtools [65]. This requirement for quality control is not necessary, however, while analyzing ChIP-seq data generated from samples treated with nucleases, because the likelihood of the generation of overlapping reads is rather low. The assessment of “uniquely mapped” and “unique reads” is also a very important step in the quality control of ChIP-seq data. The former correspond to reads that aligned to specific regions, excluding repetitive genomic loci and non-repetitive regions with highly similar sequences, while the latter correspond to de-duplication PCR reads. In this regard, depending on the specificities of the ChIP-seq dataset, removal of duplicated reads to reduce amplification artifacts could result in an underestimation of real binding events. On the other hand, not removing duplicate reads could cause the inclusion of a significant amount of false positives, which could have strong implications in the downstream analysis of ChIP-seq data. Therefore, alignment of short sequence reads to the reference genome, and quality control of sequencing data, still remains a bioinformatics challenge. The analysis of the signal-to-noise ratio of sequencing signals also constitutes an important step on ChIP-seq quality control. The estimation of the “fraction of reads in peaks” (FRiP) – i.e., number of reads per region – and cross-correlation profiles (CCPs), i.e., read clustering prior to peak calling [66], are very useful for assessing the signal-to-noise ratio. Based on these metrics, different approaches for estimating the signal-to-noise ratio of ChIP-seq sequencing data have been developed [67].

The procedures for performing peak calling from ChIP-seq samples are different from those commonly used for ChIP-on-chip experiments. There exists a myriad of different peak callers based on different statistical criteria, which cannot be covered here in detail (for a detailed review, please see [68]). The general procedure followed by all of these algorithms includes the identification of enriched sequence read density for different chromosome loci, relative to a background sequence read distribution. The first step common to all ChIP-seq peak callers is the generation of a signal profile by integrating reads mapped to specific genomic regions. Different tools rely on sliding-window approaches for smoothing the discrete distribution of read counts into a continuous signal profile distribution. Tools such as CisGenome [69] follow this rationale, estimating the number of reads above a predefined peak cutoff, and others like SISSRs [70], Peakzilla [71], and SPP [72] also take into account the correspondence of read counts in positive and negative strands to improve peak resolution. Other tools use more sophisticated approaches for integrating the signals in sequence windows. For example, MACS [73] uses the local Poisson model to identify local biases in genomic positions, F-Seq [74] and QuEST [75] rely on kernel density estimations, and PICS [76] uses a Bayesian hierarchical t-mixture model for smoothing count reads in the genomic signal profile. The HOMER program suite [77] has been also widely used for peak calling and is specially useful for analyzing broad peak corresponding to histone modifications – e.g., H3K9me3 – spanning large chromosome regions. Other tools such as JAMM [78] and PePr [79] integrate information from biological replicates to determine enrichment site widths in neighboring narrow peaks, whereas GLITR [80] and PeakSeq [81] use tag extension –i.e., extension of ChIP-seq tags along their strand direction – to identify genomic regions enriched in sequence reads. The selection of the background distribution used in the comparison with the sample analyzed is also an essential step in peak calling. Although there is no consensus on which is the best background distribution, different datasets have been used as control sample, such as ChIP-seq data for histone H3, or from experiments using a control antibody for nonbinding proteins, such as immunoglobulins [66, 82]. The following steps during peak calling include the selection of the statistical criteria for identifying enriched peaks, which usually correspond to a specific cutoff for the enrichment of peaks relative to the background, or estimating metrics with more statistical support, such as the false discovery rate (FDR). Once enriched peaks are identified for a selected number of genes, or genome wide, most peak calling algorithms allow ranking and selection of the more significant peaks by estimating their corresponding p-values and q-values. Despite the great variety of peak calling toolkits for analyzing ChIP-seq data, the comparison of the performance of different approaches shows that different programs produce very different peaks in terms of peak size, number, and position relative to genes [83, 84] when presented to the same input dataset. Thus, as different tools usually generate significantly different epigenomic profiles, peak calling of ChIP-seq data remains a difficult task, and the selection of the best performing methods usually depends on the species, sample conditions, and target proteins [43].

The bioinformatics analysis of histone posttranslational modification profiles obtained with MS is significantly dependent on the specific MS approach used – e.g., top-down, bottom-up, or middle-down approaches [47, 49]. The preprocessing of MS data for removing false fragment ion assignments can be performed with different programs, such as Thrash [85], MS-Deconv [86], or YADA [87]. These approaches can also be used to deconvolute ion signals with multiple charges into mono-charged ion mass values from bottom-up MS profiles, but are unable to produce good results for other approaches generating longer peptides [88]. Unlike immunoprecipitation techniques, in which PTM-specific antibodies are used to profile one histone modification per experiment, the analysis of cell lysates with MS has the added difficulty of having to deal with the genome-wide profile of all the histone modifications. Due to the huge combinatorial complexity of this problem, current approaches concentrate on the most common histone PTM [47], which might overlook unknown, but functionally relevant modifications. Top-down and middle-down proteomics strategies require specialized search algorithms and annotation tools, due to the great complexity of the MS spectra generated for intact or large polypeptides [89]. Methods such as ProSight PTM [90], MX-Align+ [91], ROCCIT [92], and MLIP [93] are tools specifically suited for performing database sequence searches from neutral mass lists of precursor and fragment ions obtained with top-down approaches. Different implementations of the THRASH [85] algorithm have been adapted for top-down histone modification profiling [94, 95], as well as MS-Deconv tool [86], developed specifically to analyze MS spectra from complete proteins. These methods offer a number of different functionalities for guiding the search for specific modifications that allows a significant reduction of the search space, which can increase the significance of assigned peaks. Other tools allow tackling the complex problem of identifying different histone PTM fragments with fairly similar ion masses [93, 96]. The software VEMS is included in this category [97], which can discriminate acetyl and trimethyl lysine histone modifications. In summary, mass spectrometry constitutes a very powerful approach for the genome-wide profiling of histone modifications, but there is still a need for the development of more accurate bioinformatics approaches to allow a more comprehensive and thorough study of MS histone modification spectra.

3 Bioinformatics Approaches for Analyzing Genome-Wide Methylation Profiling

DNA methylation, which is the only epigenetic phenomena involving the direct modification of genome underlying structure, can be profiled experimentally with bisulfite sequencing [98, 99], bisulfite microarrays [100, 101], and enrichment methods, such as MeDIP-seq and MethylCap-seq [102,103,104]. Different computational approaches have been developed for processing genome-wide profiling data obtained with each of the abovementioned techniques. In the case of bisulfite sequencing data, methylated cytosines are protected from chemical modification – i.e., sulfonation – induced by treatment with bisulfite, while unmethylated cytosines are sulfonated and appear as thymines after sequencing. Following, the reads obtained at the sequencing stage are mapped back to the reference genome, and the ratios of Cs and Ts are measured, representing the methylation levels of genomic regions. In principle, aligners such as those currently used for mapping ChIP-seq reads (please see in the previous section in this chapter) can be used for processing bisulfite sequencing reads, but in this case it is necessary to account for the underrepresentation of unmethylated Cs. Moreover, different approaches specifically suited for analyzing this data have been developed, comprising RRBSMAP [105], RMAP [106], GSNAP [107], and Segemehl [108], among others, which have been coined as wildcard aligners. These tools offer multiple functionalities for wildcarding Cs in the sequencing reads during the alignment and also adjusting the matrices used for scoring tag alignment for accommodating base mismatches. Furthermore, wildcard aligners allow the efficient and fast alignment to large genomic regions, although they tend to overestimate highly methylated regions. A second group of tools (MethylCoder [109], BRAT [110], and Bismark [111]) follow a more straightforward strategy, leveraging from well-established short-read alignment tools, and use a three-letter alphabet – i.e., considering T, G, and A – in the alignment. Three-letter alignment approaches are not very efficient for scanning large genomic regions, as a significant proportion of regions are filtered out of the alignment due to lack of sequence complementarity, caused by an increased alignment ambiguity. Once bisulfite sequence reads are aligned to the reference genome, the methylation levels of specific genomic regions can be estimated by using variant caller algorithms, which allow the quantitation of the frequency of Cs and Ts. For instance, Bis-SNP [112] relies on a Bayesian inference approach to evaluate strand-specific base calls and base call quality scores, and experiment-specific bisulfite conversion efficiency to derive fairly accurate DNA methylation estimates. Faster variant callers have been developed, including MethylExtract [113] that implements a modified version of the VarScan algorithm [114], and BS-SNPer [115] based on a “dynamic matrix algorithm” and Bayesian modeling, which are able to process large quantities of genomic sequences.

The most widely used bisulfite microarrays are Illumina® Infinium Methylation Assay [100], which allows single-CpG-site resolution quantitative measurement of genome-wide methylation profiles. In this assay, cytosine methylation at CpG islands is detected by multiplexed genotyping of bisulfite-converted genomic DNA, upon treatment with bisulfite (this technique also relies on bisulfite selective DNA modification of unmethylated regions, as described above). The assay uses two site-specific probes, one for methylated and another for the unmethylated loci. The Infinium MethylationEPIC BeadChip Kit enables quantitative genome-wide profiling of almost 900,000 methylation sites at the single-nucleotide resolution, encompassing expert-selected coverage of up to 99% of RefSeq genes, 95% of CpG islands, and ENCODE enhancer regions. In addition to the great potential of this technology, it has been the focus of intense research for the development of proprietaries and open-source bioinformatics tools for processing Illumina Methylation Arrays. The GenomeStudio software developed by the chip supplier enables differential methylation analysis for small-scale studies, also including advanced tools for visualization of large amounts of data, plotting, and statistical analysis. The R/Bioconductor BeadArray toolkit [116] is also available for performing large-scale stand-alone analysis requiring more intense calculations or parallel computing infrastructures. Infinium® arrays include multiple probes for performing sample-dependent and sample-independent data quality control, which is the input of packages like IMA [117] and LumiWCluster [118]. These tools use different approaches for removing noisy probes from the chip data, which are straightforwardly filtered out based on the median detection p-value cutoff in the case of IMA, while LumiWCluster relies on a more sophisticated weighted likelihood model based on clustering methylation data. Background correction should also be performed for removing nonspecific signals and differences between replicates. This step can be performed with the GenomeStudio Infinium integrated package, but also with many other toolkits, such as lumi [119], limma [120], and BeadArray [116]. After the initial quality control, microarray data need to be normalized to remove random noise, technical artifacts, and measurement variation inherent to microarrays. Normalization should be performed between different replicate array measurements, i.e., between array, and internally for each array, i.e., within array. This can be accomplished with HumMethQCReport [121] and lumi [119], which use spline and weighted scatter smoothing for normalizing methylation data, but there are also many other alternative approaches based on different statistical approaches [122]. Special interest should also be put on scaling the signal obtained for the two different probes used in this technique – i.e., probes for methylated and unmethylated loci – that produce rather different signal distributions, due to the bias towards CpG islands in the genome [100]. Peak rescaling is usually performed with methods such as SWAN [123] that implements a sub-quantile within-array normalization (SQN) procedure, similar to the rationale followed in another study implementing a pipeline for processing Illumina® Infinium Methylation BeadChip [124]. Other approaches use variations of this procedure, such as the mixture quantile normalization method to rescale the distributions of the methylation and unmethylation probes into distributions that can be compared statistically [125, 126]. Batch effects, which are also common on DNA methylation arrays, can be corrected with toolkits like CpGassoc [127], MethLAB [128], and ISVA [129] R/Bioconductor packages.

Enrichment techniques, such as MeDIP-seq and MethylCap-seq [102,103,104], are based on the use of proteins that specifically bind to methylated DNA regions – e.g., 5-methylcytosine-specific antibodies [104, 130] (methylated DNA immunoprecipitation (MeDIP)) or methyl-binding domain proteins [131, 132] (MethylCap) – to enrich hypermethylated fragments that are subject to HTP or microarray sequencing. The bioinformatics processing of methylation data generated with these approaches can be performed with the same methods describe above for processing sequencing or microarray platforms. Moreover, there are some methods exclusively tailored for enrichment data, like MEDIPS [133], an R/Bioconductor suite that enables processing multiple replicates and performing a great variety of statistical analyses. Another toolkit, coined as Batman [102], which stands for “Bayesian tool for methylation analysis” relies on the knowledge that almost all DNA methylation in mammals occurs at CpG dinucleotides and uses a standard Bayesian inference approach to estimate the posterior distribution of the methylation state parameters from data to generate quantitative methylation profiles. A very interesting study built on a thorough comparison of more than 20 different software tools has resulted in the development of RnBeads [134], an integrative suite that supports all genome-scale and genome-wide DNA methylation assays, implemented to facilitate stand-alone running of complex pipelines in high-performance computing infrastructures. With this toolkit, it is possible to perform all the steps of DNA methylation data analysis, ranging from data visualization, quality control, handling batch effects, correction for tissue heterogeneity, and differential DNA methylation analysis.

4 Computational Analysis of Chromatin Accessibility Data

The chromatin accessibility of genomic regions can be profiled with methodologies such as DNase-seq [135], FAIRE-seq [136], and ATAC-seq [137], which rely on different experimental principles and produce rather different data outputs. DNase-seq and ATAC-seq are based on the use of endonucleases – i.e., DNase I and engineered Tn5 transposase, respectively – to fragment DNA, while FAIRE-seq is a physical fragmentation method, in which DNA is treated with formaldehyde to cross-link chromatin. The differences between DNA fragmentation procedures used in each technique – i.e., DNase I and engineered Tn5 transposase have a tendency to cleave some DNA sequences more efficiently than others, and sonication could produce under and over sonicated chromatin depending on the sonication parameters used – cause that each technique generates rather different accessibility profiles [138]. In accordance, these differences should be taken into consideration while performing the downstream bioinformatics processing of sequencing data. Chromatin accessibility peaks are generally different from peak signals generated with histone modification ChIP-seq experiments, which are in general broad sequence read peaks. Hence, peak callers designed for ChIP-seq need some fine-tuning for processing chromatin accessibility data [138, 139]. Furthermore, ChIP-seq data usually shows a higher signal-to-noise ratio compared to DNase-seq, making ChIP-seq peaks easier to detect [140]. Different peak callers have been developed to process accessibility data, including F-Seq [74] toolkit, which can be used for ChIP-seq and FAIRE-seq data [141], and ZINBA [142], which relies on a mixture regression approach for probabilistically identifying real and artifact peaks and can also handle ChIP-seq and FAIRE-seq data. Moreover, the Hotspot program [143] has been developed as part of the ENCODE project specifically for analyzing DNase-seq data, and follows a similar rationale to ChIP-seq sliding-window peak callers described above, using a probabilistic model to classify peaks by assessing the differences between the sample and a background distribution. MACS [73], which is commonly used for ChIP-seq data, and ChIPOTle [56], suited for processing ChIP-on-chip data as described above, have also been used for DNase-seq [144] and FAIRE-seq [136], respectively. In general, most of these tools have also been applied for ATAC-seq data analysis, but there are some other tools specifically implemented for this novel technique, such as I-ATAC (https://www.jax.org/research-and-faculty/tools/i-atac). This tool integrates multiple methods for quality check, preprocessing, and running sequential, multiple-parallel, and customized data analysis pipelines into a cross platform and open-source desktop application. Interestingly, the selection of the peak caller of use could play a key role in peak assignment output, as a comparison of the most common tools for processing accessibility data has shown that there is little overlap among called peaks obtained for the same chromatin accessibility dataset [140].

5 Epigenomic Databases and Epigenome Mapping Initiatives

The great developments of high-throughput sequencing technologies have allowed the steady generation of great quantities of epigenomic data in different cell types/lines and multiple organisms. This has been boosted by many large-scale epigenome mapping projects, such as the ENCODE project [145], the NIH Roadmap Epigenomics [146], the International Human Epigenome Consortium (http://ihec-epigenomes.org/), and the HEROIC European project (http://cordis.europa.eu/project/rcn/78439_en.html), among others. Other resources, such as the MethBase database (http://smithlabresearch.org/software/methbase/) [147], encompassing hundreds of methylomes from different organisms allow comparing the methylation profiles of genomics regions in different animal and plant genomes. There exist other more specialized epigenomic projects and databases encompassing information of the brain. These neuroepigenomic resources include MethylomeDB database (http://www.neuroepigenomics.org/methylomedb) [148] that includes genome-wide DNA methylation profiles of human and mouse brain and is integrated with a genome browser which allows surfing through the genome and analyzes the methylation of specific loci, searches for specific methylation profiles, and compares methylation patterns between individual samples. The Brain Cloud (http://braincloud.jhmi.edu/) [149] compiles methylation data from human postmortem dorsolateral prefrontal cortices from normal subjects across the life span, also integrating single-nucleotide polymorphism data. The great amount of data generated in these projects has prompted the development of a great variety of computational tools for the analysis of epigenetic data, some of which have been described in detail in previous sections of this chapter. Moreover, the wealth of data in these databases has enabled groundbreaking studies, such as one recent report [38] encompassing a thorough integrative study of different epigenetic phenomena – e.g., chromatin accessibility, DNA methylation, chromatin marks, gene expression – in different reference epigenomes. In this study, the authors profile cells from different tissues and organs in more than 100 adult and fetal epigenomes and were able to identify epigenetic differences arising during lineage specification and cellular differentiation, which are the modules of regulatory regions with coordinated activity across cell types, and the role of regulatory regions in human disease associated with common traits and disorders [38]. This study shows that genomic regions vary greatly in their association with active marks, with approximately 5% of each epigenome marked by enhancer or promoter signatures, showing increased association with expressed genes and increased evolutionary conservation, while two-thirds of each reference epigenome are quiescent and enriched in gene-poor stably repressed regions [38]. Furthermore, the authors find that genetic variants associated with complex traits are highly enriched in epigenomic annotations of trait-relevant tissues, and genome-wide association enrichments are significantly strongest for enhancer-associated marks, consistent with their high tissue-specific nature [38]. However, promoter-associated and transcription-associated marks were also enriched, implicating several gene-regulatory levels as underlying genetic variants associated with complex traits [38].

6 Epigenetic Differential Analysis and Integration of Epigenomic and Gene Expression Data

Despite the great wealth of epigenomic data, we still lack a systematic understanding of how the epigenomic landscape regulates gene expression and which are the epigenetic signatures that control the most important regulatory circuitry in the transcriptional level. Differential analysis of ChIP-seq genome-wide profiles obtained for different cellular phenotypes is a rather challenging problem, due to the significant heterogeneity in peak calling between different measurements and the lack of overlap between peak assignments obtained with different peak callers [140]. The diffReps program [150] has been designed to detect differential sites from ChIP-seq data, with or without biological replicates, and implements a sliding-window approach to estimate the statistical significance of differential peaks based on a binomial distribution model across samples. The differential histone modification profiles generated with diffReps can be used to try to superimpose the epigenetic differential profile with gene expression data. The GeneOverlap R/Bioconductor tool implements different statistical models for estimating the significance of the overlap of histone modification and gene expression profiles. However, the great complexity of the histone code, and the cross talk established between different histone marks to cooperatively regulate gene expression, makes it difficult to capture the regulatory epigenetic mechanisms just by superimposing histone modification and gene expression data. More complex computational models for predicting gene expression from complex histone modification profiles have been proposed [151, 152]. In order to reproduce the quantitative relationship between gene expression levels and histone modifications, these approaches combine information from many different data tracks of repressive and activating chromatin modifications, which are processed with machine learning approaches and were able to explain a fairly high proportion of the gene expression profiles in different organisms [151, 152]. In more complex expression datasets, such as brain tissues, similar approaches for combining histone modification data [153] have not been able to obtain a good correlation with the observed gene expression profiles, which could be related to the great complexity of gene regulation in these heterogeneous tissues, and the regulatory role of other histone marks not included in the study.

The prediction of epigenetic states has also been the focus of intense research. Several computational approaches have been devised for predicting promoter regions (extensively reviewed in [154]), prediction of CpG islands [155, 156], DNA methylation [157, 158], and nucleosome positioning [159, 160]. However, with the advent of next-generation sequencing (NGS), which is used in combination with techniques for profiling chromatin accessibility, histone modifications, and DNA methylation that have allowed the generation of huge quantities of genome-wide epigenetic data, the prediction of epigenetic states has lost relevance. Nevertheless, a different group of approaches has been developed for leveraging from genome annotation data at the epigenetic level for predicting the chromatin states – e.g., poised or strong enhancers, active promoters, and heterochromatin, among others – from histone modification data [161, 162]. ChromHMM [161] relies on a multivariate hidden Markov model that represents the observed combination of chromatin marks as the product of independent Bernoulli random variables for segmenting the genome into regions with different chromatin states. Segway [162] can also input histone modification data, but also DNA methylation and chromatin accessibility data, and implements a Dynamic Bayesian Network model for hierarchical genome segmentation. Interestingly, ChromHMM and Segway can be used to process fairly complex datasets of experimental data and perform chromatin state assignments, which have provided key insights in transversal epigenomic studies in different cell types, tissues, or human populations [38, 163, 164].

7 Systems Biology Approaches and Reconstruction of Multilevel Regulatory Networks

The availability of highly detailed annotation of human and mouse genomes [38, 145, 146] has paved the way for performing studies for integrating multilevel biological data, encompassing epigenetics, DNA sequence variation, gene expression, and clinical data. The regulatory events triggering phenotypic transitions such as cellular differentiation, and the dysfunctions associated to disease onset and progression are usually mediated by multiple genes, which establish complex interaction networks. Thus, in order to gain understanding of the regulatory mechanisms at the epigenetic and transcriptional levels involved in the regulation of these cellular phenotypes, it is necessary to derive more comprehensive systems-level computational models. For such large-scale molecular datasets, several network approaches have been developed to identify and dissect the underlying “interactomes” for discovering key mechanisms and causal regulators in normal or pathological biological systems [165]. Gene regulatory Boolean network models have been very useful for conducting systems-level modeling of complex high-throughput biological data enabling the construction of complex interaction networks for studying disease mechanisms [166]. Disease network models have been essential for predicting disease-related genes based on the analysis of different topological characteristics, such as node connectivity [167, 168], gene-gene interaction tendency in specific tissues [169], or network neighbors of disease-related genes [170, 171]. A different group of approaches tries to model cellular phenotypes as attractors in the gene expression landscape, and phenotypic transitions are modeled by identifying nodes destabilizing these attractors [172,173,174], and disease perturbations, such as chemical compounds or mutations, can cause a switch from a healthy to a disease attractor state [175,176,177]. Co-expression-based network inference approaches [178, 179] have also been used to build regulatory network models from HTS data. Weighted gene co-expression models (WGCNA) [180] – i.e., there exists a widely used and very efficient R/Bioconductor package to build WGCNA network models [181] – which allow embodying important information of the underlying relationships and interactions among genes have been widely used to identify disease-causing genes in multigene human pathologies, such as autism [182,183,184] and Alzheimer’s disease [185, 186]. These WGCNA formalisms allow the generation of fairly complex network representations – e.g., eigengene networks [187, 188], in which the nodes are composite network modules. WGCNA models have enabled the identification of an age-related co-methylation module present in multiple human tissues, including the blood and brain from the analysis of up to 2442 Illumina DNA methylation arrays [189]. Similarly, these approaches have been used to identify common methylation patterns correlated with age in identical twins [190], the identification of the upstream epigenetic control and the downstream cellular physiology associated with alcohol dependence and neuroadaptive changes in alcoholic brain [191] and the prediction of the co-methylation modules associated with the Huntington’s disease pathogenesis [192]. The developments of the abovementioned integrative and other multiscale network modeling approaches for trying to integrate complex and multidimensional biological data to infer regulatory relationships linking different regulatory levels – e.g., DNA sequence variations, epigenetic, transcriptional, and metabolic – will be key for gaining a deeper understanding of disease onset and progression, or other important biological processes, such as development.

8 The Advent of the Single-Cell Era in Neuroepigenetics: Challenges for Analyzing Single-Cell Epigenomic Data

The great technological advances in the methodologies for generating high-quality genome-wide epigenomic data have caused a revolution in the study of the epigenetic mechanisms regulating gene expression, stem cell differentiation, disease onset and progression, and other key biological phenomena. These developments have also contributed to the emergence of the field of “neuroepigenetics,” aimed at studying the epigenetic regulatory mechanisms in cells from the central nervous system. It has been shown that in neurons, which live throughout most of the life span of an animal, epigenetic mechanisms play a key role in the regulation of the complex metabolic and gene expression these cells must go through upon synaptic input or interactions with other nervous system cells [193, 194]. One of the main problems for studying cells from the mammalian nervous systems is trying to disentangle the great cellular heterogeneity of bran tissues [195,196,197]. In this regard, most of the neuroepigenomic studies conducted so far have been performed with the traditional techniques for profiling chromatin accessibility, histone modifications, and DNA methylation described in this and other chapters of this book. These approaches require as input samples containing hundreds of thousands or millions of cells, encompassing highly heterogeneous cell populations. In recent years, different experimental techniques have been developed for studying heterogeneous cell populations. Gene expression single-cell transcriptional profiling techniques first developed 20 years ago [198] have become a very popular technique conventionally used in most laboratories, thanks to great technological developments in cell capture and next-generation sequencing approaches. The application of single-cell gene transcriptomics techniques has been central in the study of gene expression and functional diversity in somatosensory neurons from the dorsal root ganglia [199, 200], in different cortical regions [197, 201, 202], and developing retina [203].

Different single-cell epigenomic approaches have been recently developed for high-throughput genome-wide mapping of DNA methylation, histone modifications, and chromatin accessibility. The single-cell reduced-representation bisulfite sequencing (scRRBS) technique [204] is highly sensitive and can detect the methylation status of up to 1.5 million CpG sites within the genome of an individual cell. This technique is very efficient for profiling promoter regions, though it has poor coverage in enhancer regions. Bisulfite single-cell sequencing approaches enable genome-wide profiling of single cells or very small cell populations, although with a rather low sequencing coverage [205, 206]. Histone modification single-cell profiling can be measured with different barcoding approaches, taking advantage of techniques for indexing regions bearing the posttranslational modification in individual cells with specific sequence tags, and then performing ChIP-seq measurement after pooling cells from different wells – i.e., the heterogeneous population – which reduces the problem associated to input sample requirement of ChIP-seq [207, 208]. A different technique has been developed (the nano-ChIP-seq protocol) [209], which combines a high-sensitivity small-scale ChIP assay tailored for HTS libraries from scarce amounts of ChIP DNA. Recently, the single-tube DNA amplification method (LinDA) has been conceived, enabling ChIP-seq measurements of picogram DNA amounts obtained from a few thousand cells [210]. Chromatin accessibility single-cell profiling can be performed with a modification of the ATAC-seq approach, based on combinatorial indexing for barcoding populations of nuclei in different wells, and then performing chromatin accessibility after pooling [211]. There exists another methodology available for single-cell chromatin accessibility profiling, based on a programmable microfluidics platform for capturing and analyzing cells in specific microfluidic chambers [212]. These methodologies are still under development for improving single-cell isolation [203, 213] and single-molecule sequencing techniques [214, 215], to try to increase the reliability of the measurements and sequencing coverage. The application of these approaches to study central nervous system samples will be essential for obtaining a clearer picture of the epigenetic regulatory mechanisms in neurons from different brain regions and how the heterogeneity at the epigenetic level defines different circuitries at the transcriptional regulatory level in central nervous system cells. However, the computational analysis of single-cell epigenomic data poses many computational challenges that will be the focus of intense research in the next years to match the great developments of experimental techniques. Currently, the computational tools and approaches used for processing single-cell epigenomic data are essentially those developed for bulk measurements, which have been thoroughly discussed in this chapter. Nevertheless, it is crucial to develop computational methods that are tailored specifically for processing single-cell data for tackling the problems associated with normalization and cell-type identification and for dissecting variability levels across cells [216]. It is expected that such methods will be developed in the next few years, leading to new discoveries in areas ranging from the physiology of tissues to systems biology [216].