Abstract
Technological advances in the last decade resulted in an explosion of biological data. Sequencing methods in particular provide large-scale data sets as resource for incorporation of machine learning in the biological field. By measuring DNA accessibility for instance, enzymatic hypersensitivity assays facilitate identification of regions of open chromatin in the genome, marking potential locations of regulatory elements. ATAC-seq is the primary method of choice to determine these footprints. It allows measurements on the cellular level, complementing the recent progress in single cell transcriptomics. However, as the method-specific enzymes tend to bind preferentially to certain sequences, the accessibility profile is confounded by binding specificity. The inference of open chromatin should be adjusted for this bias [1].
To enable such corrections, we built a deep learning model that learns the sequence specificity of ATAC-seq’s enzyme Tn5 on naked DNA. We found binding preferences and demonstrate that cleavage patterns specific to Tn5 can successfully be discovered by the means of convolutional neural networks. Such models can be combined with accessibility analysis in the future in order to predict bias on new sequences and furthermore provide a better picture of the regulatory landscape of the genome.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Single-cell ATAC-seq
- Convolutional neural networks
- Deep learning
- Sequence preference bias
- Regulatory element discovery
1 Introduction
It is still not completely understood how heterogeneity can arise from identical nucleotide sequences. Epigenetic modification like DNA phosphorylation and histone alterations are implicated in additionally regulating gene expression without inducing changes in the nucleotide sequence, representing one association between variants in genome regions and phenotypic variation. Characterizing transcription factor binding sites across the genome and identifying the positions of regulatory elements that can increase or decrease the transcription rates will further our understanding of gene regulation. These modifications can only take place as DNA and RNA binding proteins access the genome. Enzymatic hypersensitivity assays allow to measure the chromatin accessibility to the transcription machinery and identify relevant DNA sites. They are useful tools for building a genome-wide landscape of epigenetic regulatory structures.
Interestingly, many epigenetic events have been linked to non-coding RNA. However, the functional description and interpretation of non-coding variants is still lacking [2]. Since deep learning methods are able to learn functional relationships from data without having to explicitly define them beforehand, these approaches can overcome the still limited description of non-coding regions. They are in particular suited to learn sequence specificity from experimental data and have been applied for detection of transcription factors in nucleotide sequences [3, 4]. Here, we build a deep learning model based on neural networks for the purpose of gaining insights on enzyme-specific sequence preferences.
1.1 Assay for Transposase Accessible Chromatin Using Sequencing
Chromatin is found in eukaryotic cells and has a main function of packaging DNA into a more compact form in order to prevent DNA damage. It is also involved in regulation of gene expression and DNA replication, as it controls the density of the DNA at different cell cycle states by its structural changes. It has been shown that there is a certain heterogeneity in not only the gene expression patterns but also the chromatin structure across different cell types, indicating differences in accessibility of the genome to the transcription machinery based on cellular profiles [5].
In recent years more and more single-cell sequencing methods have been introduced, empowering researchers to pinpoint changes at the highest resolution. One approach to study chromatin accessibility is the Assay for Transposase Accessible Chromatin using sequencing (ATAC-seq) described by Buenrostro et al. [6]. A basic scheme of this method is outlined in Fig. 1.
ATAC-seq is based on Tn5 transposase’s ability to access open regions of the DNA. Transposases are enzymes that bind transposons, cut them out of the region and catalyse their movement across the genome. The hyperactive Tn5 transposase used in this assay cuts and simultaneously ligates synthetic sequencing adapters into the accessible sites. By tagging these sites the characterization and quantification of open genome regions even after high-throughput sequencing is enabled.
Single-cell ATAC-seq provides a genome-wide map of the sequence accessibility. This picture of the regulatory landscape can be further used to explore multiple questions, e.g. quantification of these changes across different populations. Based on position and length distribution of the adapter insertions, one can gain insight to nucleosome positions in regulatory regions and the interplay of genomic sites with DNA binding factors [7].
The sequence specificities of Tn5 are still poorly understood. Computational DNA footprinting is frequently used to investigate associated regions of DNA binding proteins. Sequence signatures that are obtained by footprinting methods can be distorted by preferences specific to the enzyme used in the assay and should be considered when interpreting the signals. There have been efforts made in correcting for these artefacts in ATAC-seq by incorporating Hidden Markov Models [8] or via mathematical models e.g. scaling experimental data by the ratio of expected insertion events for a sequence window to the observed occurrences [1]. We evaluated whether we can identify a binding bias and aim to learn potential cleavage preferences of Tn5 by applying deep learning algorithms.
1.2 Convolutional Neural Networks in Genomics
One prominently applied approach to capture certain structures of interest in biological sequences are neural networks. During the learning phase the connections between the neurons are refined and relevant features in the data can be extracted without the need to manually define them beforehand.
Although these are a powerful machine learning tool, they do not scale well with increasing data size and carry the risk of overfitting when adjusting to minimal details of the image. Instead it is reasonable to focus on certain patterns. Convolutional neural networks are specifically developed for this application. Inspired by an animal’s visual cortex and its receptive field, a region of a single sensory neuron in which a stimulus can affect the behaviour of the neuron, the connections between neurons mirror this region. Each neuron is assigned to a local region of the image which it processes [9].
Compared to a fully connected network, additional filters in form of so-called convolutional layers are applied in order to reduce dimensions. By this a certain window of the size of the predefined receptive field is slid over the input image. During this sliding the values located in the window are combined into one single number which is then propagated to the next layer [10, 11].
In the context of biology, the first convolutional layer can be thought of as a motif scanner resembling position weight matrices PWMs. During convolution each filter searches for its specific motif while sliding along a biological sequence. These networks are perfectly suited to detect motifs in sequence windows and thus enable binding classification and motif discovery. By adding even more convolutional layers and abstracting the model further, additional information like the spatial interaction between the initial detectors’ output as well as local sequence context can be captured [4, 11].
2 Materials and Methods
To detect the transposase’s sequence bias in ATAC-seq experiments, it was important to consider a data set resulting from ATAC-seq performed on “naked” DNA, i.e. DNA that was purified such that no other molecules as proteins, lipids, etc. were associated to it. It has been shown previously that the probability of a cleavage event by the Tn5 transposase is higher in DNA stretc.hes void of nucleosomes or proteins attached [1, 12]. Particularly in the case of scATAC-seq where genomic accessibility is in question, such purified DNA lessens the confounding effect introduced via binding preferences of the assay’s Tn5.
Moreover, clearer results on the cleavage bias are obtained as it avoids capturing molecular profiles that arise due to protein/DNA interaction.
We fetched a data set from GEO (accession number GSM1550786) [13] which contains the result of a scATAC-seq assay performed on FACS sorted germ cells of Mus musculus. The data set consists of positions with known transposition events and the read density at corresponding coordinates in the genome. The procedure as to how they were extracted is outlined in Fig. 3. Briefly, the 46 base pair sequences centered at each signal peak were selected as positive sequences. To create the corresponding negative counterpart with similar properties, windows of length 46 which did not have a read overlapping should be considered.
Therefore, the negative set was chosen such that it mirrors the size and the GC contents of the sequences in the positive set and preferably contains sequences across the whole chromosome. Finally, in order to focus on sequences with strongest signal, the 100 000 sequence windows with the highest read count values are selected as base training input for the model. The sequences were converted to a one hot encoded representation as multidimensional arrays (tensors, see Fig. 4A). The generation of the CNN models was done via the python package keras due to its comprehensible and simple way of translating the desired CNN architecture into machine-readable format [14].
For the purpose of identifying transposition sites of transposase Tn5, we designed a shallow CNN. The architecture was inspired by models that predict transcription factor binding based on CHIPseq data. One of the first approaches to apply deep learning for identification of protein binding sites was DeepBind [3]. By using a CNN with a single convolutional layer the model can learn sequence specificities of DNA- and RNA-binding proteins from raw genomic sequences and detect known motifs. Based on an appropriate architecture such neural networks tend to outperform prevailing simpler models. The DeepBind model consists of following layers:
-
1.
Convolution: one dimensional convolution over 4-channel input. Can be interpreted as a motif scan in the context of biological sequences
-
2.
Rectification: take output of the first layer and propagate maximal value. Applies activation threshold for a certain motif at every position. If a motif’s score at a position passes the threshold, the score is propagated
-
3.
Max Pooling: reduce input matrix to a vector with length = number of motif detectors by retaining the maximal score per detector
-
4.
Output: consists of two neurons corresponding to two classification results which are fully connected to the previous layer
Finally an appropriate evaluation method to interpret the results of the CNN prediction was established. For this the data set was split into portions used at different times for a 3-fold cross validation and a final evaluation using a holdout set, which was not considered during hyper parameter tuning. The procedure is outlined in Fig. 4D.
3 Results
Since DeepBind’s application differs from what we aim for, some adjustments in layer sizes and number of neurons were made. It has been shown that increasing the number of convolutional kernels can improve the performance of motif-based tasks [11]. We therefore expanded our initial shallow model by adding a second convolutional layer (deeper model). The architectures are presented in Fig. 4.
In order to find the most appropriate properties for the models, different combinations of adjustable parameters of keras were compared in a grid hyper parameter search. The optimization was done on 10,000 sequences (excluding the holdout set) as a run of all combinations on the whole set proved to be too time consuming. The results can be seen in Fig. 5. The available optimization functions consist of stochastic gradient descent, RMSprop, adagrad, adadelta, adam and adamax and the loss functions tested were binary crossentropy, mean absolute error, mean squared error, mean squared logarithmic error and poisson. Likewise different lengths of filters [3, 5, 16, 24, 32] and number of filters [16, 32, 64, 86, 100] were tested for their accuracy. Due to the increase in computation time with larger number of filters, 64 filters with a length of 24 were chosen, indicating sufficient enough accuracy as shown in Fig. 6A.
The hyper parameter search was executed on the shallow model. The resulting parameters were then re-applied on the deeper model, with slight adjustments. In contrast to the first model, the deeper one uses 64 filters of length 32 in a first convolutional layer, and additionally 32 of length 5 in a second convolutional layer. For both models adam was adopted as optimizer and poisson as loss function. The activation function was constantly kept as rectified linear unit due to its prominent use in comparable methods.
After deciding on these parameters, the evaluation as described in 4D was executed for both models separately. Their performances are presented in Fig. 6A and B. In both parts of the evaluation the deeper model with an additional convolutional layer containing short filters achieved better accuracy.
To further examine whether the score returned for an input sequence can be used as a measure to how reliable the prediction is, Fig. 6C shows a distribution of scores returned for sequences from the hold-out set. Indeed for a vast majority of cases a high score was output if a transposition was observed in the sequence and a low score if not. This indicates that the significant signal could be captured by the established convolutional network.
The associated motifs in the genomic sequence that were learnt during training of model one can be extracted via the keras architecture. Applying the methods described by Basset et al. [4] the position weight matrix for each kernel in the first convolutional layer could be reconstructed be accessing its learnt weights respectively. These were further used to generate the corresponding sequence logos, part of the results are shown in Fig. 6D.
4 Discussion
It is essential to assess changes in the regulatory element landscape in pathological conditions to further grasp altered mechanisms in disease. Due to researchers’ still limited knowledge of these regions it is hard to distinguish functional stretc.hes from non-functional ones. Compared to coding regions no encoded genes can be detected which typically guide the functional interpretation of the sequence. Machine learning approaches enable to use these genome regions as input and find underlying patterns intrinsic to regulatory elements and enable the extension of known functional loci into the still mostly undescribed genomic regions. Modification of chromatin accessibility is one of the explorable aspects which can be measured thanks to the advances in recent profiling techniques. The regions of open chromatin are considerably variable across different cell types. Differentially open elements like enhancers, promoter etc. can have a vast effect on a cell’s signature and therefore their analysis adds further insight to transcriptomics.
Like previously described the individual binding preferences of the enzymes used in the techniques can influence the peak distribution and aggravate the interpretation of results [1]. Considering these confounding factors and correcting for them is essential for meaningful biological analysis.
We show that convolutional neural networks provide a powerful tool that can catch motifs that are predominantly bound to by the ATAC-seq assay’s transposase Tn5. These captured preferences on purified DNA can be used to improve the confounded observations in new ATAC-seq experiments and enable to understand the underlying biology behind accessible regions of the genome untarnished by technical artefacts.
Based on this outcome there are many directions that are worth looking into in more detail. The results presented were obtained using mouse germ cell lines. Still, as there is perceivable heterogeneity across different cell types, it might be interesting to consider these during training and bias correction in order to streamline the results more.
References
Martins, A.L., et al.: Universal correction of enzymatic sequence bias reveals molecular signatures of protein/DNA interactions. Nucleic Acids Res. 46(2), e9 (2018). https://doi.org/10.1093/nar/gkx1053
Costa, FF.: Non-coding RNAs, epigenetics and complexity. Gene 410(1), 9–17 (2008). https://doi.org/10.1016/j.gene.2007.12.008
Alipanahi, B., et al.: Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015). https://doi.org/10.1038/nbt.3300
Kelley, D.R., Snoek, J., Rinn, J.L.: Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26(7), 990–999 (2016). https://doi.org/10.1101/gr.200535.115
Natarajan, A., Yardimci, G.G., Sheffield, N.C., Crawford, G.E., Ohler, U.: Predicting cell-type-specific gene expression from regions of open chromatin. Genome Res. 22(9), 1711–1722 (2012). https://doi.org/10.1101/gr.135129.111
Buenrostro, J.D., Giresi, P.G., Zaba, L.C., Chang, H.Y., Greenleaf, W.J.: Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10(12), 1213–1218 (2013). https://doi.org/10.1038/nmeth.2688
Buenrostro, J., et al.: Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015). https://doi.org/10.1038/nature14590
Li, Z., Schulz, M.H., Look, T., Begemann, M., Zenke, M., Costa, I.G.: Identification of transcription factor binding sites using ATAC-seq. Genome Biol. 20(1), 45 (2019). https://doi.org/10.1186/s13059-019-1642-2
Angermueller, C., Pärnamaa, T., Parts, L., Stegle, O.: Deep learning for computational biology. Mol. Syst. Biol. 12(7), 878 (2016). https://doi.org/10.15252/msb.20156651
A Guide to Convolutional Neural Networks. https://adeshpande3.github.io/A-Beginner‘s-Guide-To-Understanding-Convolutional-Neural-Networks/. Accessed 30 Apr 2020
Zeng, H., Edwards, M.D., Liu, G., Gifford, D.K.: Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics 32(12), 121–127 (2016). https://doi.org/10.1093/bioinformatics/btw255
Picelli, S., Björklund, A.K., Reinius, B., Sagasser, S., Winberg, G., Sandberg, R.: Tn5 transposase and tagmentation procedures for massively scaled sequencing projects. Genome Res. 24(12), 2033–2040 (2014). https://doi.org/10.1101/gr.177881.114
Pastor, W.A., Stroud, H., Nee, K., et al.: MORC1 represses transposable elements in the mouse male germline. Nat. Commun. 5, 5795 (2014). https://doi.org/10.1038/ncomms6795
Chollet, F. et al.: Keras. https://keras.io
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ansari, M., Fischer, D.S., Theis, F.J. (2020). Learning Tn5 Sequence Bias from ATAC-seq on Naked Chromatin. In: Farkaš, I., Masulli, P., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2020. ICANN 2020. Lecture Notes in Computer Science(), vol 12396. Springer, Cham. https://doi.org/10.1007/978-3-030-61609-0_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-61609-0_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61608-3
Online ISBN: 978-3-030-61609-0
eBook Packages: Computer ScienceComputer Science (R0)