Keywords

1 Introduction

Transcription factors (TFs) are proteins that directly interpret the genome, performing the first step in decoding the DNA sequence [1,2,3]. It recognizes specific DNA sequences to control chromatin and transcription, forming a complex system that guides expression of the genome [4, 5]. Such specific DNA sequences are called transcription factors binding sites (TFBSs) and play important roles in vital movement. It has proved that transcription factors binding sites are also associated with various human diseases and phenotypes. Mutation of TFs and TFBSs are often highly deleterious and may lead to diseases [6]. Thus, by discovering TFBSs, also called motif discovering, is helpful for further study in gene expression and therapy to diseases caused by gene mutation. Identification of DNA-binding motif provides a gateway to further analyses [7].

Data for study in transcription factors binding is pretty much nowadays due to the fast development of high-throughput sequencing technology which make obtaining DNA data a much easier work [8]. Traditionally, transcription factors binding sites are displayed by sequence logo which based on weight matrices (PWMs) [9, 10]. The PWM is computed by multiplying the scores for each base of a sequence and means a predicted relative affinity of the TF to that sequence. However, such model has an obvious defect that it is not able to process large-scale data from high-throughput sequencing technology [4, 11,12,13].

Thanks to the rapid development of deep learning in recent year, new computational methods such as convolutional neural network (CNN) and recurrent neural network (RNN) have been applied to many fields such as natural language processing and computer graphic and have made great achievement [14]. The recent applications of deep learning methods to processing biological data have also shown impressive improvement on performance [15,16,17,18]. Among these applications, DeepBind is one of the earliest attempts to apply deep learning to the motif discovery task and has proved to be an effective model. DeepBind’s success mainly based on the appropriate use of CNN, which is a variant of multilayer artificial neural network specialized in computer vision. By converting 1-D genomic sequence with four nucleotides {A, C, G, T} into 2-D image-like data through one-hot encoding method, CNN can be adapted to modeling DNA sequence protein-binding which is analogous to a two-class image classification problem [16]. Another impressive application of deep learning to bioinformatics field is Deepsea [19, 20]. Deepsea is an algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity that directly learns a regulatory sequence code from large-scale chromatin-profiling data. It also achieved a great performance by using CNN and related deep learning algorithm.

However, their performance results require much improvement, and in this study, we aim to enhance the solution to this issue with an innovative approach. Our idea is to transform the enhancer sequences into vectors using word embedding and then proceed to classify them with method of natural language processing (NLP). DNA sequence and human language are both sequence data which express certain meanings by the combination of a number of certain elements, which in human language is words or alphabet and in DNA is the four nucleic acids bases A, C, G, T. This idea has indeed been used in past experiments, where researches attempt to apply existing natural language processing algorithms to the study of biological sequences [21]. It was first presented by E. Asgari et al. and applied successfully in many latter biological applications [22]. Moreover, the word feature namely k-mer has also been applied in RNA sequence description and protein structure. In this paper, we proposed a hierarchical attention networks for motif discovery task [23], which based on a computational method for document classification proposed by Zichao Yang et al.

2 Proposed Approach

We propose a hierarchical attention network (HAN) for discovering DNA-protein binding sites. The model we used is based on the hierarchical attention network for document classification proposed by Zichao Yang et al. which includes two levels of attention mechanisms. The attention mechanism was used in two hierarchical level, word and sentence. By letting the model pay different attention on different part of sentence or document, the application of attention mechanism could generate a more reasonable representation of sentence or document. We believe the attention method is also suitable for bioinformatics data such as DNA sequence because different part of the DNA sequence may have different amount of influence to the expression of the whole sequence. DNA sequence also has the hierarchical structure like word and sentence in natural language, thus the hierarchy mechanism in the model could improve the performance of processing DNA data [23].

In order to transform DNA sequence into a form that could be read by our model based on natural language processing method while not losing information in the DNA sequence, we use k-mer and word embedding when processing the sequence data. The DNA sequence is first sliced by k-mer and word vectors is generated by word embedding according to the sliced result. The vectors are then fed into the hierarchical attention network for training. We also tried several different k-mer methods with different k lengths and gram lengths to figure out the best k length and gram length.

Figure 1 shows the structure of our proposed model. The result of word embeddding is fed into a hierarchical attention network. The network use a bidirectional LSTM based recurrent neural network as word encoder which encodes the word embedding into vectors containing information of context and the word itself. The followed attention layer extract words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector because not all words contribute equally to the representation of the sentence meaning. How attention mechanism work is specifically showed below.

Fig. 1.
figure 1

A graphical illustration of HAN for motif discovery

$$ u_{it} = \tanh (W_{w} h_{it} + b_{w} ) $$
(1)
$$ a_{it} = \frac{{\exp (u_{it}^{T} u_{w} )}}{{\sum\nolimits_{t} {\exp (u_{it}^{T} u_{w} )} }} $$
(2)
$$ s_{i} = \sum\limits_{t} {\alpha_{it} h_{it} } $$
(3)

Where hit is the output of word encoder which is generated according to the input word embedding. The word annotation hit is then fed into a one-layer MLP to get uit as a hidden representation of hit. After that, we measure the importance of the word as the similarity of uit with a word level context vector uw and get a normalized importance weight α it through a softmax function. Finally, the sentence vector si is computed as a weighted sum of the word annotations based on the weights. The context vector uw can be seen as a high-level representation of a fixed query “what is the informative word” over the words like that used in memory networks. The word context vector uw is randomly initialized and jointly learned during the training process.

The structure of sentence encoder and the following attention layer is as same as those for word except the sentence encoder take the sentence vector as input and finally output a vector representing the whole documentation. The documentation vector is fed into a softmax layer to get the final classification.

3 Experiments and Results

3.1 Data Sets

We collected 50 public ChIP-seq datasets from ENCODE to evaluate the performance of our proposed method. For each ChIP-seq dataset, 1000–5000 top ranking peaks were chosen as the foreground (positive) set in which each sequence consists of 200 bps. On the other hand, the way of generating background (negative) sequences is also crucial. It is widely recognized that the background sequences have to be selected to match the statistical properties of the foreground set, otherwise the elicited motifs could be biased [24, 25]. To satisfy such requirements, equal numbers of background sequences were generated by matching the length, GC content and repeat fraction of the foreground set following. According to this guideline, we selected the upstream and downstream DNA sequences of DNA-protein binding sites as negative sequences. Moreover, as mentioned before, the data sets are processed by several different k-mer methods with various k length and gram length to evaluate how value of k and gram influence the performance of the model [26, 27].

3.2 Evaluation Metrics

Area under the curve (AUC), a widely used evaluation metric in both machine learning and motif discovery, is one of the two evaluation metrics used in this paper [2, 28]. It is equal to the area under receiver operating characteristic curve (ROC curve), which is a graphical plot that illustrates the performance of a binary classifier, and indicates the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one [29, 30].

Another evaluation metrics applied in this paper is average precision (AP), which is also a commonly used metric to measure the ability of a proposed classifier [31]. Average precision is a measure that combines recall and precision for ranked retrieval results. For one information need, the average precision is the mean of the precision scores after each relevant document is retrieved [32].

3.3 Results

We compared the performance of HAN for predicting DNA-Protein binding sites with those of DeepBind and Deepsea. The AUCs and APs of proposed method HAN and competing methods DeepBind and Deepsea are calculated as the metrics of their performances. Each method is tested on the same 50 different datasets. The result of comparison is illustrated in Figs. 2 and 3.

Fig. 2.
figure 2

The AUC and AP of our proposed HAN method across 50 experiments in the motif discovery task compared with DeepBind

Fig. 3.
figure 3

The AUC and AP of our proposed HAN method across 50 experiments in the motif discovery task compared with Deepsea

From Figs. 2 and 3, which displays the AUC and AP comparisons between HAN and DeepBind, and between HAN and Deepsea, we observe that the AUCs and APs of MSC are much higher than those of DeepBind and Deepsea on most datasets, which means HAN could achieve a better performance than these two algorithms in most cases. In addition, APs of HAN are within a narrow range while these of DeepBind or Deepsea distribute in a quite larger area, which means the accuracy of our proposed HAN method are more stable than the other two compared methods across the various datasets.

The experiment results prove that with the help of hierarchical network structure and attention mechanism, our computation model extract critical information from DNA sequence more efficiently and process the data with a more reasonable method.

4 Conclusions and Future Work

In this paper, we propose a hierarchical attention network for predicting DNA-protein binding sites inspired by the model in natural language process area and has shown an impressive improvement on performance compared with DeepBind and DeepSea through a series of experiments. Our proposed HAN method obtains a higher accuracy than the two competing methods and shows a stable performance on various datasets. The results of experiments prove that the application of algorithms for NLP such as hierarchical attention network to motif discovery field are practicable and effective.

Although our proposed HAN method has achieved a relatively better performance, it still have some drawbacks such as that it shows poor performance on some datasets despite it perform well on most datasets, which means HAN for predicting DNA-protein binding sites still need to be improved.