Keywords

1 Introduction

It is still not completely understood how heterogeneity can arise from identical nucleotide sequences. Epigenetic modification like DNA phosphorylation and histone alterations are implicated in additionally regulating gene expression without inducing changes in the nucleotide sequence, representing one association between variants in genome regions and phenotypic variation. Characterizing transcription factor binding sites across the genome and identifying the positions of regulatory elements that can increase or decrease the transcription rates will further our understanding of gene regulation. These modifications can only take place as DNA and RNA binding proteins access the genome. Enzymatic hypersensitivity assays allow to measure the chromatin accessibility to the transcription machinery and identify relevant DNA sites. They are useful tools for building a genome-wide landscape of epigenetic regulatory structures.

Fig. 1.
figure 1

Tn5 transposase in ATAC-seq integrates sequencing adapters into sites of open chromatin. A transposition to less accessible regions is less likely due to steric hindrances. In contrast to previous methods which consist of multiple steps and long incubation times, the assay of scATAC-seq requires far smaller number of sample cells and involves only two steps: Tn5 insertion and PCR. Finally peak calling and footprinting is performed after high-throughput sequencing.

Interestingly, many epigenetic events have been linked to non-coding RNA. However, the functional description and interpretation of non-coding variants is still lacking [2]. Since deep learning methods are able to learn functional relationships from data without having to explicitly define them beforehand, these approaches can overcome the still limited description of non-coding regions. They are in particular suited to learn sequence specificity from experimental data and have been applied for detection of transcription factors in nucleotide sequences [3, 4]. Here, we build a deep learning model based on neural networks for the purpose of gaining insights on enzyme-specific sequence preferences.

1.1 Assay for Transposase Accessible Chromatin Using Sequencing

Chromatin is found in eukaryotic cells and has a main function of packaging DNA into a more compact form in order to prevent DNA damage. It is also involved in regulation of gene expression and DNA replication, as it controls the density of the DNA at different cell cycle states by its structural changes. It has been shown that there is a certain heterogeneity in not only the gene expression patterns but also the chromatin structure across different cell types, indicating differences in accessibility of the genome to the transcription machinery based on cellular profiles [5].

In recent years more and more single-cell sequencing methods have been introduced, empowering researchers to pinpoint changes at the highest resolution. One approach to study chromatin accessibility is the Assay for Transposase Accessible Chromatin using sequencing (ATAC-seq) described by Buenrostro et al. [6]. A basic scheme of this method is outlined in Fig. 1.

ATAC-seq is based on Tn5 transposase’s ability to access open regions of the DNA. Transposases are enzymes that bind transposons, cut them out of the region and catalyse their movement across the genome. The hyperactive Tn5 transposase used in this assay cuts and simultaneously ligates synthetic sequencing adapters into the accessible sites. By tagging these sites the characterization and quantification of open genome regions even after high-throughput sequencing is enabled.

Single-cell ATAC-seq provides a genome-wide map of the sequence accessibility. This picture of the regulatory landscape can be further used to explore multiple questions, e.g. quantification of these changes across different populations. Based on position and length distribution of the adapter insertions, one can gain insight to nucleosome positions in regulatory regions and the interplay of genomic sites with DNA binding factors [7].

The sequence specificities of Tn5 are still poorly understood. Computational DNA footprinting is frequently used to investigate associated regions of DNA binding proteins. Sequence signatures that are obtained by footprinting methods can be distorted by preferences specific to the enzyme used in the assay and should be considered when interpreting the signals. There have been efforts made in correcting for these artefacts in ATAC-seq by incorporating Hidden Markov Models [8] or via mathematical models e.g. scaling experimental data by the ratio of expected insertion events for a sequence window to the observed occurrences [1]. We evaluated whether we can identify a binding bias and aim to learn potential cleavage preferences of Tn5 by applying deep learning algorithms.

1.2 Convolutional Neural Networks in Genomics

One prominently applied approach to capture certain structures of interest in biological sequences are neural networks. During the learning phase the connections between the neurons are refined and relevant features in the data can be extracted without the need to manually define them beforehand.

Although these are a powerful machine learning tool, they do not scale well with increasing data size and carry the risk of overfitting when adjusting to minimal details of the image. Instead it is reasonable to focus on certain patterns. Convolutional neural networks are specifically developed for this application. Inspired by an animal’s visual cortex and its receptive field, a region of a single sensory neuron in which a stimulus can affect the behaviour of the neuron, the connections between neurons mirror this region. Each neuron is assigned to a local region of the image which it processes [9].

Compared to a fully connected network, additional filters in form of so-called convolutional layers are applied in order to reduce dimensions. By this a certain window of the size of the predefined receptive field is slid over the input image. During this sliding the values located in the window are combined into one single number which is then propagated to the next layer [10, 11].

Fig. 2.
figure 2

Characteristic structure of a CNN constituted of multiple layers. Convolution layers combine input values (e.g. biological sequences). Pooling layers are employed to further reduce the dimensions. An activation function is applied before passing the signal to the fully connected layers, where the score for each category of interest is calculated and returned via output neurons [10].

In the context of biology, the first convolutional layer can be thought of as a motif scanner resembling position weight matrices PWMs. During convolution each filter searches for its specific motif while sliding along a biological sequence. These networks are perfectly suited to detect motifs in sequence windows and thus enable binding classification and motif discovery. By adding even more convolutional layers and abstracting the model further, additional information like the spatial interaction between the initial detectors’ output as well as local sequence context can be captured [4, 11].

2 Materials and Methods

To detect the transposase’s sequence bias in ATAC-seq experiments, it was important to consider a data set resulting from ATAC-seq performed on “naked” DNA, i.e. DNA that was purified such that no other molecules as proteins, lipids, etc. were associated to it. It has been shown previously that the probability of a cleavage event by the Tn5 transposase is higher in DNA stretc.hes void of nucleosomes or proteins attached [1, 12]. Particularly in the case of scATAC-seq where genomic accessibility is in question, such purified DNA lessens the confounding effect introduced via binding preferences of the assay’s Tn5.

Moreover, clearer results on the cleavage bias are obtained as it avoids capturing molecular profiles that arise due to protein/DNA interaction.

We fetched a data set from GEO (accession number GSM1550786) [13] which contains the result of a scATAC-seq assay performed on FACS sorted germ cells of Mus musculus. The data set consists of positions with known transposition events and the read density at corresponding coordinates in the genome. The procedure as to how they were extracted is outlined in Fig. 3. Briefly, the 46 base pair sequences centered at each signal peak were selected as positive sequences. To create the corresponding negative counterpart with similar properties, windows of length 46 which did not have a read overlapping should be considered.

Therefore, the negative set was chosen such that it mirrors the size and the GC contents of the sequences in the positive set and preferably contains sequences across the whole chromosome. Finally, in order to focus on sequences with strongest signal, the 100 000 sequence windows with the highest read count values are selected as base training input for the model. The sequences were converted to a one hot encoded representation as multidimensional arrays (tensors, see Fig. 4A). The generation of the CNN models was done via the python package keras due to its comprehensible and simple way of translating the desired CNN architecture into machine-readable format [14].

Fig. 3.
figure 3

(A) Creation of a data set from scATAC-seq experiment on naked DNA on sorted mouse germ cells, given as .bedGraph file. (B) The positive set was created as follows: A window of length 46 base pairs (bp) was slid over the sequence. If the center position overlaps with a read, save the current sequence window (sw) and assign the value of the original read. Creation of negative set: For each sequence in the positive set, search in a certain range for a sw with similar GC content (± 0.1). The range is defined as region between reads, to ensure that no transposition event happened. If none is found, continue search in next possible range. If the chromosome’s end is reached, jump to its start and continue search. (C) Both positive and negative test sets show the same GC content distribution in order to avoid sequence bias.

For the purpose of identifying transposition sites of transposase Tn5, we designed a shallow CNN. The architecture was inspired by models that predict transcription factor binding based on CHIPseq data. One of the first approaches to apply deep learning for identification of protein binding sites was DeepBind [3]. By using a CNN with a single convolutional layer the model can learn sequence specificities of DNA- and RNA-binding proteins from raw genomic sequences and detect known motifs. Based on an appropriate architecture such neural networks tend to outperform prevailing simpler models. The DeepBind model consists of following layers:

  1. 1.

    Convolution: one dimensional convolution over 4-channel input. Can be interpreted as a motif scan in the context of biological sequences

  2. 2.

    Rectification: take output of the first layer and propagate maximal value. Applies activation threshold for a certain motif at every position. If a motif’s score at a position passes the threshold, the score is propagated

  3. 3.

    Max Pooling: reduce input matrix to a vector with length = number of motif detectors by retaining the maximal score per detector

  4. 4.

    Output: consists of two neurons corresponding to two classification results which are fully connected to the previous layer

Finally an appropriate evaluation method to interpret the results of the CNN prediction was established. For this the data set was split into portions used at different times for a 3-fold cross validation and a final evaluation using a holdout set, which was not considered during hyper parameter tuning. The procedure is outlined in Fig. 4D.

Fig. 4.
figure 4

(A) Conversion of genomic sequence using one hot encoding. Tensors have dimension of number of sequences \(\times \) window length \(\times \) nucleotides, in this case 100 000 \(\times \) 46 \(\times \) 5. (B) Architecture of model 1 (shallow model). The first layer consists of 64 convolutional kernels that transform its output using rectified linear unit as activation function. The propagated output is further condensed by the global max pooling layer, as only the maximal value of its corresponding convolutional layer is propagated to the fully connected layer. (C) Architecture of model 2 (deeper model): The main difference to the first model is the additional convolution layer of size 5. (D) Evaluation method for CNN prediction. In a first step a forth of the data set was put aside (holdout set). A 3-fold cross validation was performed, in each iteration the remaining portion of the data set was split such that 85% of the sequences were used for training and 15% for evaluation as test set. The final evaluation of the model was performed on the hold-out set. In each data split an equal amount of positive and negative sequences was ensured.

3 Results

Since DeepBind’s application differs from what we aim for, some adjustments in layer sizes and number of neurons were made. It has been shown that increasing the number of convolutional kernels can improve the performance of motif-based tasks [11]. We therefore expanded our initial shallow model by adding a second convolutional layer (deeper model). The architectures are presented in Fig. 4.

In order to find the most appropriate properties for the models, different combinations of adjustable parameters of keras were compared in a grid hyper parameter search. The optimization was done on 10,000 sequences (excluding the holdout set) as a run of all combinations on the whole set proved to be too time consuming. The results can be seen in Fig. 5. The available optimization functions consist of stochastic gradient descent, RMSprop, adagrad, adadelta, adam and adamax and the loss functions tested were binary crossentropy, mean absolute error, mean squared error, mean squared logarithmic error and poisson. Likewise different lengths of filters [3, 5, 16, 24, 32] and number of filters [16, 32, 64, 86, 100] were tested for their accuracy. Due to the increase in computation time with larger number of filters, 64 filters with a length of 24 were chosen, indicating sufficient enough accuracy as shown in Fig. 6A.

Fig. 5.
figure 5

Overview of parameter optimization. Test accuracy obtained by grid parameter search using 10,000 training sequences. Results for different (A) filter length and filter number combinations and (B) optimization and loss function combinations.

The hyper parameter search was executed on the shallow model. The resulting parameters were then re-applied on the deeper model, with slight adjustments. In contrast to the first model, the deeper one uses 64 filters of length 32 in a first convolutional layer, and additionally 32 of length 5 in a second convolutional layer. For both models adam was adopted as optimizer and poisson as loss function. The activation function was constantly kept as rectified linear unit due to its prominent use in comparable methods.

Fig. 6.
figure 6

Prediction performance of the shallow model (A) and deeper model (B), which has an additional convolution layer compared to the shallow model. 3-fold cross validation was performed on training set (left) and a final evaluation on a hold-out set (right). (C) Bar plots showing prediction score distribution separately for sequences that originated from either positive (upper) or negative set (bottom). (D) Examples of the generated set of sequence logos based on extracted weights of shallow model. Size of the motif filter was chosen as 24, leading to logos of the same length. Relative sizes of letters indicate the contribution of a nucleotide at given position to the associated weight in the convolutional layer.

After deciding on these parameters, the evaluation as described in 4D was executed for both models separately. Their performances are presented in Fig. 6A and B. In both parts of the evaluation the deeper model with an additional convolutional layer containing short filters achieved better accuracy.

To further examine whether the score returned for an input sequence can be used as a measure to how reliable the prediction is, Fig. 6C shows a distribution of scores returned for sequences from the hold-out set. Indeed for a vast majority of cases a high score was output if a transposition was observed in the sequence and a low score if not. This indicates that the significant signal could be captured by the established convolutional network.

The associated motifs in the genomic sequence that were learnt during training of model one can be extracted via the keras architecture. Applying the methods described by Basset et al. [4] the position weight matrix for each kernel in the first convolutional layer could be reconstructed be accessing its learnt weights respectively. These were further used to generate the corresponding sequence logos, part of the results are shown in Fig. 6D.

4 Discussion

It is essential to assess changes in the regulatory element landscape in pathological conditions to further grasp altered mechanisms in disease. Due to researchers’ still limited knowledge of these regions it is hard to distinguish functional stretc.hes from non-functional ones. Compared to coding regions no encoded genes can be detected which typically guide the functional interpretation of the sequence. Machine learning approaches enable to use these genome regions as input and find underlying patterns intrinsic to regulatory elements and enable the extension of known functional loci into the still mostly undescribed genomic regions. Modification of chromatin accessibility is one of the explorable aspects which can be measured thanks to the advances in recent profiling techniques. The regions of open chromatin are considerably variable across different cell types. Differentially open elements like enhancers, promoter etc. can have a vast effect on a cell’s signature and therefore their analysis adds further insight to transcriptomics.

Like previously described the individual binding preferences of the enzymes used in the techniques can influence the peak distribution and aggravate the interpretation of results [1]. Considering these confounding factors and correcting for them is essential for meaningful biological analysis.

We show that convolutional neural networks provide a powerful tool that can catch motifs that are predominantly bound to by the ATAC-seq assay’s transposase Tn5. These captured preferences on purified DNA can be used to improve the confounded observations in new ATAC-seq experiments and enable to understand the underlying biology behind accessible regions of the genome untarnished by technical artefacts.

Based on this outcome there are many directions that are worth looking into in more detail. The results presented were obtained using mouse germ cell lines. Still, as there is perceivable heterogeneity across different cell types, it might be interesting to consider these during training and bias correction in order to streamline the results more.