1 Introduction

Accurately modeling the specificity of the transcription factors sequence is an essential problem in understanding the function and evolution of the genome [1,2,3,4,5]. TF is a protein that can bind to DNA sequence and regulate gene expression. The transcription factor binding sites are a subset of DNA binding sites. These sites can be defined as short segments of DNA that are specifically bound by one or more proteins with various functions. Particularly, the characterization of binding affinity of TFs to the DNA sequence determines the relative expression of genes downstream of the transcription factor binding sites (TFBS). The mechanism by which TFs select specific binding regions is complex and there are a large number of DNA–protein binding sites to be determined at different levels.

With the high-throughput technologies developing, such as ChIP-seq [6], ChIP-exo [7] and ChIP-nexus [8], a huge volume of experiments verified the TFs binding sites. However, they are time-consuming and expensive. Fortunately, these experimental data can serve as training data for machine learning models to learn the binding patterns of TF. Many computational approaches have been proposed to predict DNA–protein binding [9,10,11,12]. For example, Cirillo et al. [9] proposed PAnDA approach to predict DNA–protein binding with human transcription factors by using gene expression profiles, protein–protein interaction and recognition motifs. Zhang et al. [10] proposed an approach named DiseMLA to discover TFBS motifs on high-throughput dataset, which aims to optimize the phase of motif searching with a more comprehensive criterion. Zhu et al. [11] presented LSUE for inferring DNA–protein binding from new ChIP-seq datasets, which mainly utilize the local correlations between available datasets. Schmidt et al. [12] presented a framework, namely TEPIC2, allowing for a fast, accurate and versatile prediction and analyzing DNA–protein binding from epigenetic data.

Recently, deep learning technology has shown the capability of improving discriminating ability compared with other machine learning methods [13, 14], and has been widely applied in bioinformatics [15, 16], i.e., protein structure prediction [17], gene expression regulation [18, 19] and protein classification [20]. The convolutional neural network (CNN) has successfully predicted the DNA–protein binding [21,22,23,24]. These methods not only outperform other existing methods in terms of prediction accuracy, but also can easily extract binding motifs directly from the learned parameters of CNN. For example, DeepBind [21] is known to outperform the state-of-the-art experimental and computational methods to identify the binding preference of DNA-binding and RNA-binding proteins, which is a convolutional neural network trained on a large amount of data from high-throughput experiments. DeepSEA [22] also trains a CNN framework to predict the noncoding-variant effects from DNA sequences. Zeng et al. [23] proposed a systematic exploration of CNN architectures to predict DNA sequence binding in 690 transcription factor ChIP-seq experiments from the Encyclopedia of DNA Elements (ENCODE) project [25]. Cao et al. [24] introduced some tricks of CNN to improve the performance of DNA sequence related prediction tasks and took the DNA–protein binding as an illustrative task for demonstration. Fast convolution on the graphic processing unit (GPU) allows CNN to be trained on large-scale datasets. Wang et al. [26] proposed a specific study on the relationship between generalization and uncertainty by incorporating complexity of classification, which concludes that the generalization ability of a classifier is statistically becoming better with the increase of uncertainty when the complexity of the classification problem is relatively high. Wang et al. [27] investigated the multiple-instance active learning (MIAL) by incorporating diversity and informativeness. Two diversity criteria are proposed for MIAL by utilizing a support vector machine based MIL classifier. However, these techniques cannot indicate the dependency information of DNA sequences in the framework of CNN. In addition, these methods are not accurate enough in predicting DNA–protein binding from DNA sequences.

In this study, we focus on exploring the method of classifying whether a DNA segment binds to any TF. Therefore, we propose a computational prediction approach for DNA–protein binding based on BLSTM [28] and CNN, we call it DeepSite, to solve the aforementioned disadvantages of the existing methods. Based on DeepSite model, both the long as well as short dependency information of DNA sequences can be captured by mining the information from every mediate hidden value of BLSTM and CNN. The experimental results on the benchmark datasets show that DeepSite outperforms other existing deep learning methods. DeepSite approach can predict DNA-binding sites with 87.12% sensitivity, 91.06% specificity, 89.19% accuracy and 0.783 MCC when tested on the dataset used in 690 Chip-seq experiments. When compared with the CNN model, our method predicts DNA-binding sites with a 5.28%, 8.35%, 6.89% and 0.138 improvement in sensitively, specificity, accuracy and MCC value, respectively.

The original contributions of the proposed model are threefold: (1) we introduce BLSTM layer in the DeepSite algorithm to capture the long and short dependency information of DNA sequence, which improves its predictive performance; (2) a novel hybrid BLSTM and CNN framework for predicting DNA–protein binding from DNA sequences; (3) the experimental results demonstrate that the proposed approach performs better in identification of DNA–protein binding in DNA sequence.

2 Materials and methods

In this study, we present a deep learning-based approach, DeepSite (Fig. 1), to predict TBFS in DNA sequence by integrating a BLSTM and a CNN. We first describe the problem of transcription factor binding site by deep learning method. Then, we introduce the ChIP-seq experiments dataset from ENCODE, which is used to train and evaluate DeepSite. Next, we give the technical details about two different deep neural networks, BLSTM and CNN. Finally, we describe the proposed method DeepSite and how to implement in detail.

2.1 Problem statement

This study focus on discovering the DNA–protein binding in DNA sequence, and the task of DNA–protein binding can be viewed as a binary sequence classification problem. The problem can be formulized as: as input, the training set is represented by \(\{X^{(i)},y^{(i)}\}^n_{i=1}\), where \(X^{(i)}\) is a matrix, of dimension \(4\times N\), and N is the length of a DNA sequence (101 base pairs in our experiments). Each base pair in the sequence is represented as one of the four one-hot vectors [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0] and [0, 0, 0, 1]. This matrix is called Positional frequency matrix (PFM), which has four rows corresponding to each channel of genetic alphabet, namely \(\{A,T,C,G\}\). Our labels, \(y^{(i)}\) can be a scalar or vector, depending on the number of transcription factor binding sites being studied. Nonetheless, the number of dimension is equal to the classification tasks, and each element of \(y^{(i)}\) is a binary label in the standard space \(\{0,1\}\). The goal is to accurately predict the label in the testing data, that is, to accurately predict whether a transcription factor combined with a given DNA sequence.

2.2 Dataset

As was performed in Alipanahi [21], Zhou [22] and Zeng [23], we obtain 690 ChIP-seq experiments from ENCODEFootnote 1. We use the similar DNA sequence data by Zeng [23], the positive dataset consists of the centering 101 base pair region of each ChIP-seq peak, and the negative dataset consists of shuffled positive sequences with matching dinucleotide composition.

We generate the dataset based on the 690 ChIP-seq experimental data. In this study, we focus on the task of classifying whether a DNA segment binds to any TF. All the training data are combined into a whole dataset, the number of DNA sequences in the training set is 2,725,808, and the number of DNA sequences in the testing set is 255,700. In order to reduce the runtime of DeepSite, we firstly use 10% of training set and testing set to evaluate the performance. Finally, all the datatsets are used to assess the performance of DeepSite.

2.3 Bidirectional LSTM networks

Compared with traditional RNN, LSTM shows the ability to increase the dependence on long-distance evolution. Zhu et al. [29] used the traditional RNN to solve protein–protein network problems. One explanation may be their different processing of protein sequence data. Given a sequence, the tradition RNN, from \(t=1\) to n, works iteratively by Eqs. (1) and (2) to calculate the hidden vector sequence \(h=(h_1,h_2,\ldots ,h_n)\) and outputs a vector sequence \(y=(y_1,y_2,\ldots ,y_n)\).

$$\begin{aligned} h_t= & {} f(W_{xh}*x_t + W_{hh}*h_{t-1} + b_h) \end{aligned}$$
(1)
$$\begin{aligned} y_t= & {} g(W_{hy}*h_t + b_y) \end{aligned}$$
(2)

where \(x=(x_1,x_2,\ldots ,x_n)\) is the input vector, t represents the index of input, output and hidden vectors, W is a weight matrix that is computed in the phase of training, \(b_*\) is the offset vector, and f() and g() denote the activation function.

LSTM is a special type of RNN and is well suitable for capturing the long and short dependency information in sequence [30]. A memory mechanism is applied in LSTM to replace the hidden function in the traditional RNN. The commonly-used LSTM unit consists of a memory cell, a forget gate, an input gate and an output gate, which is designed to enhance the ability of LSTM to model long-range dependence. LSTM memory cell is given in the following equations:

$$\begin{aligned} f_t= & {} \sigma (W_{xf}*x_t + W_{hf}*h_{t-1} + b_f) \end{aligned}$$
(3)
$$\begin{aligned} i_t= & {} \sigma (W_{xi}*x_t + W_{hi}*h_{t-1} + b_i) \end{aligned}$$
(4)
$$\begin{aligned} c_t= & {} f_t\bigotimes c_{t-1} + i_t\bigotimes tanh(W_{xc}*x_t + W_{hc}*h_{t-1} + b_c) \end{aligned}$$
(5)
$$\begin{aligned} o_t= & {} \sigma (W_{xo}*x_t + W_{ho}*h_{t-1} + b_o) \end{aligned}$$
(6)
$$\begin{aligned} h_t = o_t\bigotimes tanh(c_t) \end{aligned}$$
(7)

where \(\sigma\) is the logistic Sigmoid function, tanh is a function to push the values to be between \(-1\) and 1, f, i, c, o represent the forget gate, input gate, cell vectors and output gate, respectively, which are specified to be the same value as given in the hidden vector h, \(W_{xf}\) is the input-forget gate matrix, and \(W_{hf}\) is the hidden-forget gate matrix. The index t refers to the time step. \(\bigotimes\) represents the vector product. It is worthwhile to note that the initial values of \(c_0=0\) and \(h_0=0\).

In the phase of sequence tagging, we have access to both past and future input features for a given time, so we can use a BLSTM as proposed in [28]. By doing so, we can efficiently make use of past features and future features within a specific time interval. The back-propagation is used to train BLSTM. In this study, we apply the forward and backward LSTM in the entire DNA sequence in order to capture long-term dependent relationship of DNA sequence. The hidden states only need to be set to 0 at the beginning of each sequence. In particular, we make a batch implementation that can handle multiple sentences at the same time.

2.4 Convolutional neural networks

CNN is a well-known deep learning framework, which has been widely applied in image recognition [31], speech recognition [32], computer vision [33], natural language processing [34], bioinformatics [21, 22] and other artificial intelligence research fields [35, 36]. Wang et al. [37] investigated essential relationships between generalization capabilities and fuzziness of fuzzy classifiers. The study makes a claim and offers sound evidence behind the observation that higher fuzziness of a fuzzy classifier may imply better generalization aspects of the classifier. The components of CNN include convolutional, pooling and fully connected layers. The convolutional layer is proposed to extract and represent the local information of original features through several feature maps and kernels. The pooling layer is employed to compress the resolution of the feature maps to achieve spatial invariance. After several convolution and pooling operations, there may be one or more fully connected layers to perform advanced reasoning. The output of the last fully connected layer transfer to an output layer. For a classifier or regression task, softmax regression is commonly-used because it can produce a well-formed probability distribution corresponding to the outputs.

2.5 The proposed model

The proposed model is introduced in this section, including the structure of the networks and its learning algorithm. Adam algorithm [38] is used to update the parameters. We used a bidirectional LSTM structure to deal with the order and reverse order dependency information in the DNA sequence. The network structure and the proposed algorithm are implemented based on Keras library. All of them are conducted on graphical processing units (GPU) to accelerate the training time.

Fig. 1
figure 1

The working mechanism of DeepSite model

We combine a BLSTM network and a CNN network to build a BLSTM-CNN model, which is shown in Fig. 1. This framework can efficiently characterize a possibly highly-complex order in DNA sequence via BLSTM layer and to generate filters that generalize sequence patterns via CNN and max pooling layers. With this neural network, both the long and short dependency information of DNA sequence can be captured by tapping the information from every mediate hidden value of BLSTM and CNN.

As shown in Fig. 1, the first input layer uses one-hot coding to represent each input sequence as a 4-row binary matrix, and the length of each sequence is 101 base pair.

The second layer is a BLSTM layer where each LSTM block in the first layer will receive the input sequence extracted from the trace of interest on the DNA and encodes its own interpretation regarding the overall contributions of the past history into its hidden state. Then, this interpretation is propagated to the next LSTM blocks located above and to the right of itself. Once the last nucletide is observed, the last unrolled LSTM block makes the final decision on the goodness of the probe.

The third layer is a convolutional layer composed of different convolutional kernels with rectified linear units as the activation function. Each convolutional kernel works as a motif detector that scans the input matrices and produces different strengths of signals that are correlated to underlying sequence patterns. The vertical and horizontal dimensions in the convolution box are 1 and 24, respectively.

The fourth layer is a max pooling layer that maximize the output signals of each convolutional kernel along the whole sequence.

The fifth layer is a fully connected layer with rectified linear units as activation unit. The size of fully connected layer is 32, the same as Zeng [23].

The last layer performs a non-linear transformation with sigmoid activation and produces a value between 0 and 1 that represents the probability of a binding preference of each probe.

2.6 Model parameters and training procedure

DeepSite is trained by using the standard back-propagation algorithm [39] and mini-batch gradient descent with the Adagrad [40] variation. Wang et al. [41] proposed a new deep learning approach to train multilayer feed-forward neural networks, which dose not need to iteratively tune the weights. It uses restricted Boltzman machine as the layer-wise training and use the generalized inverse of a matrix as the supervised fine-tuning. Dropout [42] and the phase of early stopping are used for regularization and model selection. Detailed parameter configurations are given in the next section.

All models use a genetic SGD forward and backward training method in this study. We choose the most complicated and best model BLSTM-CNN to display the performance of training. In experiments, the training dataset is divided into batches and one batch is processed at a time. Each batch contains a series of sentences which is determined by the parameter of batch size. As recommended by Alipanahi [21], the batch size is specified to 64. The weights and bias are set to the default values in Keras. Each model is optimized by training for 100 epochs. The learning rate changes from 0.001 to 0.008. The dropout ratio is specified to 0.1, 0.3, and 0.5, respectively. The number of cells w.r.t. BLSTM changes from 32 to 400 and the default value is 32. The filter number of CNN changes from 32 to 400 and the default value is 32.

All experiments are conducted by the Python library Keras, running on a machine with 24 Xeon processor and 256GB of memory and 1 Nvidia Tesla K40C GPU.

3 Results and discussions

In order to examine the performance of the proposed DeepSite, experiments based on ChIP-seq from ENCODE benchmarks against three selected state-of-the-art algorithms are performed. In the following, the evaluation matric are outlined first. Then the parameter tuning was discussed, including learning rate, dropout ratio, number of cells in LSTM and number of convolution kernels in CNN. Finally, the performance comparison was employed other deep learning method, three existing predictors and other different datasets.

3.1 Evaluation metrics

In this study, five evaluation measurements are used in this study, that is, sensitivity (Sen), specificity (Spe), accuracy (Acc), precision (Pre) and the Mathew’s correlation coefficient (MCC) are employed to evaluate predictive capability. They are calculated by the following equations:

$$\begin{aligned} Sen= & {} \frac{TP}{TP+FN} \end{aligned}$$
(8)
$$\begin{aligned} Spe= & {} \frac{TN}{TN+FP} \end{aligned}$$
(9)
$$\begin{aligned} Acc= & {} \frac{TP+TN}{TP+FN+TN+FP} \end{aligned}$$
(10)
$$\begin{aligned} Pre= & {} \frac{TP}{TP+FP} \end{aligned}$$
(11)
$$\begin{aligned} MCC= & {} \frac{TP \cdot TN - FN \cdot FP}{\sqrt{(TP+FN)(TP+FP)(TN+FN)(TN+FP)}} \end{aligned}$$
(12)

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, FN is the number of false negatives, P is the number of positives, and N is the number of negatives.

However, these five measurements are threshold dependent. Hence, the method chosen for reporting these evaluation measurements is critical for making a fair comparison between different predictors. In this study, the area under the receiver operating characteristic (ROC) curve (AUC), which is threshold-independent and increases in direct proportion to the overall prediction performance, is used to evaluate the prediction performance.

3.2 Parameter tuning

3.2.1 Selecting the learning rate

The hyper-parameters for TFBS task needed to be tuned in order to obtain optimal results. The learning rate is one of the most important hyper-parameters to be tuned for training deep neural networks. If the learning rate is a little bit lower, the phase of training is more reliable, but the phase of optimization will cost much time because the update value of the loss function is small for each time optimization. If the learning rate is high, the phase of training may not converge or even diverge. A higher interval of learning rate may cause the optimizer skips the optimal value, which makes the optimization of loss function become worse. The range of learning rate is different for different datasets as well as parameter configuration. In this study, we observe different metrics when the learning rate changes from 0.001 to 0.008. The experimental results are given in Table 1 and Fig. 2.

Table 1 Performance of DeepSite model with different learning rates
Fig. 2
figure 2

AUC of DeepSite model with different learning rates

From Table 1, we observe that when learning rate is set to 0.001, the proposed algorithm obtains the best values of all evaluation metrics. The values of Sen, Acc, Pre and MCC of are 72.23%, 79.85%, 83.05% and 0.598, respectively, when the learning rate is set to 0.001 which improves approximately 7.74%, 3.42%, 1.01% and 0.064 when the learning rate is set to 0.008. Figure 2 shows the performance of AUC when the learning rate changes from 0.001 to 0.008. We can see that: as the learning rate increases gradually, the AUC of the predictor decreases drastically. By empirical studies, the learning rate is specified to 0.001 in the following experiments.

3.2.2 Selecting the dropout ratio

Table 2 Performance of the DeepSite model with different dropout ratios
Fig. 3
figure 3

The performance variation curves of Acc, MCC and AUC under different dropout ratio by DeepSite

Overfitting is a common problem in deep neural network. Dropout is a technique for addressing this problem, which randomly set some intermediate values to zero in training the neural network [42]. To prevent the phenomenon of overfitting, we investigate whether the dropout method was a feasible strategy to improve training accuracy. Based on Fig. 3, as the dropout ratio increases, the AUC substantially grows, suggesting that adding dropout to the model may improve the robustness. A similar trend is also observed from the results in Table 2. The MCC is 0.706, 0.704 and 0.70 on 0.1, 0.3 and 0.5 dropout ratio, respectively. Therefore, the dropout ratio is chosen as 0.1 in the next model.

3.2.3 Selecting the number of cells in LSTM

In this section, we attempt to empirically demonstrate how to choose the number of cells in LSTM. We evaluate the Sen, Spe, Acc, Pre, MCC and AUC values on the training dataset by gradually varying the value from 32, 64, 128 to 400.

Fig. 4
figure 4

Performance of variation curves of AUC of LSTM, BLSTM, LSTM-CNN and DeepSite models with different number of cells

Figure 4 shows the performance of the metric of AUC w.r.t these four algorithms including LSTM, BLSTM, LSTM-CNN and DeepSite with different number of cells from 32 to 400. As shown in Fig. 4, the AUC of BLSTM improves significantly with the number of cells from 32 to 300. After that, the value of the AUC remain unchanged. For the LSTM, when the number of cells varies from 32 to 300, the AUC increase drastically. After that, the values of AUC keep stable. LSTM-CNN and DeepSite algorithm have almost the same trend in terms of AUC with different number of cells from 32 to 400, the AUC increase gradually. According to Fig. 4, we find that the value of AUC increases with the number of cells from 32 to 256. After that, the improvement of four methods is not obvious, even more cells have been used. This can be explained by the reason that these four methods have reached to the peak value of AUC. We can conclude that the best number of cells is 256 in this set of experiments.

Table 3 Performance comparison of DeepSite and other deep learning predictors with different number of cells

Table 3 shows the values of Sen, Spe, Acc, Pre and MCC by specifying different values of the cell numbers. Experimental results show that our algorithm achieve 0.686, 0.691, 0.706, 0.713, 0.716, 0.724 and 0.721 for MCC on 32, 64, 128, 256, 300, 350 and 400 cells, respectively, which outperforms BLSTM with the gap of 0.089, 0.044, 0.039, 0.015, 0.017, 0.021 and 0.017 for MCC on 32, 64, 128, 256, 300, 350 and 400 cells, respectively. In order to facilitate comparison, we lastly specify the number of cells to 256 in four methods.

As we can see from Table 3, our proposed, DeepSite algorithm, achieve the best results in all metrics when the number of cells are specified to different values, e.g., for DeepSite, when the cell number equals 350, it obtains the best value of the Sen metric.

3.2.4 Selecting the number of convolution kernels in CNN

In this section, we discuss how to choose the number of convolutional kernels in CNN. We evaluate the values of Sen, Spe, Acc, Pre, MCC and AUC on the training dataset by gradually varying the value of the convolution kernels from 32, 64 to 400.

Fig. 5
figure 5

Performance of variation curves of AUC under different number of convolution kernels

Figure 5 shows the variation curves of AUC under different number of convolution kernels. We can observe that the value of AUC increases with the number of convolution kernels and DeepSite model outperforms CNN. Specifically, the AUC of CNN significantly improves with the number of convolution kernels from 32 to 300 and keeps stable with the number of convolution kernels from 300 to 400. In terms of DeepSite and LSTM-CNN models, they have the same trend of AUC with the number of convolution kernels changing. These two methods are increased slowly with the number of convolution kernels from 32 to 128. After that, the value of AUC is stable. The peak performance of DeepSite is better than LSTM-CNN and CNN models.

Table 4 AUC of DeepSite and other deep learning predictors with different convolution kernels

Table 4 demonstrates the mean value and standard deviation of AUC between DeepSite and other deep learning predictors with different number of convolution kernels. According to Table 4, we can see that the best value of the average AUC w.r.t DeepSite is higher than that of the CNN and LSTM-CNN algorithms. In addition, the standard deviation of DeepSite is lower than that of CNN and LSTM-CNN algorithms. The results demonstrate that DeepSite model is more accurate and stable at predicting the DNA–protein bindings.

Table 5 Performance comparison of DeepSite and other deep learning predictors with different number of convolution kernels

Table 5 shows the measurements of Sen, Spe, Acc, Pre and MCC under different number of convolution kernels. According to results of the model with the best-performing number of kernels, we can see that the proposed method achieves better performance than other classical models.

3.2.5 Peak performance of LSTM, BLSTM, LSTM-CNN and DeepSite models

Different methods have very different architectures, and we compare the peak performance of LSTM, BLSTM, LSTM-CNN and DeepSite models based on the results from Table 3. The peak performance results are shown in Table 6.

According to Table 6, the peak values of different experiments show that our method achieves \(84.94\%\) for Sen and 0.724 for MCC , respectively, works better than LSTM, BLSTM and BLSTM-CNN models in all cases. The results demonstrates that the combination of BLSTM and CNN obtains much better performance than other deep learning models. Furthermore, the results show the advantage of BLSTM which captures the long and short dependency information of DNA sequences.

Table 6 Peak Performance of LSTM, BLSTM, LSTM-CNN and DeepSite models

3.3 Performance comparison

3.3.1 Performance comparison with different methods

In this section, the discriminative performances of these three deep learning methods, including CNN, BLSTM and BLSTM-CNN, will be investigated. Each method was evaluated on the same training dataset. The details of the parameters for different methods are shown in Table 7. These parameters are optimized by above analysis and then choose the best parameters for these methods. Figure 6 illustrates the ROC curves of three deep learning methods on the same dataset.

Table 7 Parameter setting of CNN, BLSTM, BLSTM-CNN models
Fig. 6
figure 6

Performance comparison of ROC curves for CNN, BLSTM and BLSTM-CNN on the same dataset

As shown in Fig. 6, we find that the AUC of BLSTM-CNN is 0.932, which demonstrates improvement of approximately 0.005 and 0.035, when compared with the BLSTM and CNN, respectively. From the comparison results between these three methods given in Fig. 6, we empirically demonstrate that these three deep learning methods are highly useful, and the combination of BLSTM and CNN, DeepSite, obtains the best ROC curve for effectively predicting DNA–protein binding. It indicated the advantage of BLSTM which captured the long and short dependency information of DNA sequence.

3.3.2 Performance comparison with existing predictors

In this section, we demonstrate the efficacy of the proposed DeepSite algorithm, by comparing it with the state-of-the-art method, including DeepBind [21], DeepSEA [22] and Zeng [23], on the same training and testing datasets, and the results are shown in Table 8.

Table 8 Performance comparison of DeepSite and classical predictors

We obtained the source code of DeepBind from the url: http://tools.genes.toronto.edu/deepbind/nbtcode/. We run DeepBind with the Docker Enterprise container platform so it can be run on different systems without the environment dependency problems.

DeepSEA model contains three convolution layers with 320, 480 and 960 kernels and two max pooling layers in alternating order to learn the motifs. On the third convolution layer, it has a fully connected layer and the last layer is the Sigmoid output layer.

We implemented the best model in Zeng using the training pipeline. The best model in Zeng has 128 convolution filters and the window size is 24, a global max pooling layer. In addition, their model always has a fully connected layer with 32 neurons after the global max pooling layer.

According to Table 8, we can see that DeepSite outperforms other classifiers in all metrics including the MCC which is an overall index for evaluating the quality of binary prediction. The Sen, Acc, Pre and MCC of DeepSite predictor are 80.09%, 85.62%, 88.47% and 0.713, respectively, which improves approximately 3.89%, 3.93%, 4.78% and 0.08 when compared with the Zeng predictor, respectively. As for DeepBind and DeepSEA models, tDeepSite also have an improvement of 0.147 and 0.057 in MCC, respectively. These results demonstrate that: by adding the recurrent connections, the performance of DeepSite algorithm can be significantly improved.

3.3.3 Performance comparison with different datasets

To further assess the performance of DeepSite, we conduct experiments on four different datasets with 10%, 30%, 50% and 100% of the size of data by using DeepSite and CNN. Figure 7 shows the performance variation curves of AUC under different cardinality of datasets.

Fig. 7
figure 7

AUC of CNN and DeepSite models under different sizes of data

Table 9 Performance comparison of DeepSite and CNN under different sizes of datasets

From Fig. 7, we find that the value of AUC increases with the cardinality of data and the performance of DeepSite wins CNN in most cases. Table 9 gives the values of Sen, Spe, Acc, Pre and MCC under different number of data. The results show that the our method achieve 0.713, 0.765, 0.770 and 0.783 for MCC on 10%, 30%, 50%, and 100% of the size of data, respectively, performing better than the CNN model with 0.008, 0.116, 0.131, 0.138 on 10%, 30%, 50%, and 100% of the size of data. This may be explained by the fact that the 100% of the size of dataset has more training data and DeepSite can make good use of the large number of training instances to improve its performance.

4 Conclusions

In this study, we present a combined BLSTM and CNN framework to predict DNA–protein binding in DNA sequences, which is called DeepSite. DeepSite uses BLSTM to capture context dependency information of DNA subsequences, and then spreads it into the CNN layer to extract the discriminative features, and finally outputs these features to a full connection layer. Experimental results with a training dataset have demonstrated the efficacy of the proposed DeepSite. The DeepSite model proposed in this study can be applied to identify DNA–protein binding. For the ongoing work, we will further investigate the applicability of the proposed model to other types of molecules binding prediction problems, e.g., RNA-protein binding, which potentially can help scientists identify new DNA–protein binding sites in test sequences.