1 Introduction

Understanding how proteins and DNA interact is crucial for controlling gene transcription, splicing, translation, replication, and degradation. These interactions significantly influence the complex systems of genetic regulation [1,2,3]. To annotate and investigate the activity of cis-regulatory elements, modelling of Transcription Factor (TF) binding affinity and predictions of TF binding locations are of key importance. Transcription factor binding sites (TFBSs), also known as motifs [4], are a particular class of functional DNA sites that typically range in size from a few to around 20 base pairs (bps). Recognising the many processes involved in gene expression and gaining knowledge about in vitro cellular processes and the design of medicinal treatments [5] depend on precisely finding the TFBS within the DNA sequence. In the past ten years, improvements in techniques such as Protein Binding Microarrays (PBMs) [6, 7], Chromatin Immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) [8, 9], and Systematic Evolution of Ligands by Exponential Enrichment coupled with high-throughput sequencing (SELEXseq) have produced detailed datasets of TFBSs, encompassing both in vivo and in vitro contexts. Nevertheless, regardless of the steadily growing variety of these datasets, we can still not predict the genomic regions where a certain TF binds with total accuracy.

Numerous precise techniques for analysing PBM data have been suggested to predict TFBSs accurately [10,11,12]. Due to the availability of this data, computational methods performance is enhanced for the prediction of specific expression of transcription factor binding [13,14,15]. Thus, computational technologies have replaced biological experimentation as the principal strategy for answering critical biological questions [12, 16]. Plus, computational technologies have inherent simplicity, speed, and cost-effectiveness advantages that are responsible for replacing conventional biological experiments.

Computational technology like Deep learning (DL) [17,18,19,20,21] has experienced rapid advancements and showcased remarkable performance in diverse fields [22,23,24], and similarly in predicting functional genomics [25,26,27,28]. Input data of high-dimension can be processed and automatically identified by this technology. Encoding input data in DL models like Convolutional Neural Networks (CNN) or Recurrent Neural Networks (RNN) have demonstrated favorable results in the identification of TFBSs [29,30,31,32,33] by producing a probability value representing the binding or no binding of TFBSs. Thus, these DL models greatly outperform conventional methods.

Identifying particular sequences (TFBS) within DNA sequences is regarded as a Natural Language Processing (NLP) task, and we know that advancements in NLP have been propelled by the emergence of the self-attention mechanism [34, 35]. Ullah et al. [36] introduced a Deep Learning (DL) model based on CNN and self-attention layers to capture interactions among regulatory elements within genomic sequences. This model incorporates attention mechanisms to enhance the network’s learning capability by inferring a global view of interactions in the genomic dataset. Shen et al. [37] introduced SAResNet, a model that merges the self-attention mechanism with a residual network structure, enhancing the network’s learning capability by capturing positional information in biological sequences using the self-attention mechanism and with residual connections to extract high-level features, enabling accurate prediction of DNA-protein binding interactions. These notable studies demonstrate the self-attention layer’s significant utility in detecting potential motifs and its capacity to accurately picture the relationships between regulatory components inside a particular sequence.

An expanding volume of research suggests that the shape of DNA in specific targeted locations may provide insight into a critical aspect of TF binding. The reason for this lies in the 3D structure of DNA, formed by the stacking of physical interactions among adjacent base pairs, which inherently contains the dependencies among nucleotides [38]. Studies have demonstrated that TF binding is notably affected by four separate shape characteristics derived from Monte Carlo (MC) simulations: Minor Groove Width (MGW), Roll, Propeller Twist (ProT), and Helix Twist (HelT) [39]. In [40], a kernel-based framework was introduced to identify TF-DNA binding similarities precisely. In this approach, the spectrum + shape kernel and the di-mismatch + shape kernel were employed for modeling TF binding without requiring sequence alignment and potentially offering better scalability for large datasets. Unlike Ullah et al. [36] and Shen et al. [37], Ma et al. [40] rely on kernel-based methods, which may have limitations in capturing complex interactions and long-range dependencies within genomic sequences, potentially leading to lower predictive performance.

Yang et al. [41] employed the DEep Sequence and Shape mOtif (DESSO) model, a straightforward DL model that incorporated DNA shape to predict TFBSs using human ChIP-seq datasets. They discovered that the shape of DNA holds significant predictive capability for TF-DNA binding, offering novel potential shape motifs for human TFs. However, DESSO may lack the advanced attention mechanisms utilized by Ullah et al. [36] and Shen et al. [37], potentially limiting its ability to capture intricate interactions and long-range dependencies within genomic sequences. Additionally, it may not offer the flexibility and scalability of kernel-based methods like Ma et al. [40] for handling diverse datasets. Zhang et al. [42] introduced a sequence + shape framework called DLBSS, and Wang et al. [43] introduced a hybrid convolutional recurrent neural network framework named CRPTS both predicted TFBS using DNA sequence and shape features. Thus, the conclusion was that including DNA shape significantly enhances the results of TFBS prediction.

Although promising results were obtained by using primary DNA sequences and shape features as input, like in DLBSS and CRPTS models, these models lack the lack the advanced attention mechanisms presented in Ullah et al. [36] and Shen et al. [37], potentially limiting their ability to capture complex dependencies and interactions within genomic sequences effectively. Moreover, the approach encounters challenges like prioritizing key features over comprehensively considering all features and working with the continuous nature of shape features, which differs from the discrete nature of sequence features. So, there is still room for advancement in DL models. Thus, we present an improved shared DL architecture incorporating an attention mechanism, drawing inspiration from Wang et al. [43]. Our approach in the DeepCTF model combines attention mechanisms with CNN and recurrent neural networks (RNNs) to adapt DNA sequences and their associated local DNA shape features, enabling an enhanced predictive model for TFBS identification. The improved performance of our proposed model, DeepCTF, stems from two important advances: (1) the strategic incorporation of a self-attention mechanism into CNN and RNN, which effectively allows for the extraction of complex features from DNA sequences acquired from high-throughput technologies; (2) DeepCTF’s exceptional ability to extract hidden local structural information from DNA sequences, which reduces the need for solely depending on DNA shape data. This combination highlights the model’s adaptability to complex genomic contexts while improving performance.

2 Approach

This study employed kernel methods to construct quantitative TF binding prediction models [40]. We have considered a set of triples \((s_1, x_1, y_1)\) up to \((s_n, x_n, y_n)\). Each \(s_i\) represents a DNA sequence of a specific length, denoted as w. The information in \(x_i\) pertains to the DNA shape conformation of \(s_i\). Meanwhile, \(y_i\) is a binary indicator, signifying whether a TF binds to the sequence. We aim to construct a predictive model, denoted as f(.), with the objective that when given \(s_i\) and \(x_i\) as inputs, the model \(f(s_i, x_i)\) accurately predicts \(y_i\). As we have integrated DNA shape local features within our prediction framework, we examined four DNA Shapes: MGW, Roll, ProT, and HelT.

2.1 Attention mechanism

Figure 1 depicts the attention mechanism’s framework. To allow the layer after batch normalisation to learn from earlier layers and concentrate on achieving its objective of expediting the training process, we first applied the batch normalisation technique on the input features to reduce the internal covariate shift. Then, we utilise the Rectified Linear Unit (ReLU) activation, computed by the formula described below:

$$\begin{aligned} f(x)=max(0,x) \end{aligned}$$
(1)

Subsequently, X is transformed linearly to yield three vectors which are Query \(Q_{r}\in R^{T \times d_k}\), Key \(K_{Y} \in R^{T \times d_k}\), and Value \(V_{V} \in R^{T \times d_v}\). T denotes the sequence length, while the hidden dimensionality for query or key and value is indicated by \(d_k\) and \(d_v\), respectively. These three vectors are formulated as follows:

$$\begin{aligned} Q_{r}= & {} W^{T}_{Q_{r}} X \end{aligned}$$
(2)
$$\begin{aligned} K_{Y}= & {} W^{T}_{K_{Y}} X \end{aligned}$$
(3)
$$\begin{aligned} V_{V}= & {} W^{T}_{V_{V}} X \end{aligned}$$
(4)

Where the learned weight metrics of the query, key, and value vectors are denoted by the variables \(W_{Q_{r}}\), \(W_{K_{Y}}\), and \(W_{V_{V}}\). We have selected the scaled dot-product attention; in other words, the attention value from x to y is determined by the similarity between \((Q_{r})_x\) and \((K_{Y})_y\), and it is then normalized and multiplied by \(V_{V}\) to provide the final attention weight \( A(Q_{r},K_{Y},V_{V})\), which may be expressed as follows:

$$\begin{aligned} A(Q_{r},K_{Y},V_{V})= softmax\biggl (\frac{Q_{r}K_{y}^{T}}{\sqrt{d_k}}\biggr ) V_{V} \end{aligned}$$
(5)

\(\frac{1}{\sqrt{d_k}}\), is essential to manage the attention values with an appropriate variance. It primarily prevents the softmax function’s input from becoming overly large. For each possible combination of Queries and Keys, \(Q_{r}K_{Y}^T\) provides the dot product, producing a matrix with the shape \(T \times T\). The model can obtain long-term relationships among residues by dynamically focusing on the residues that comprise the sequences and capturing the global properties of the input DNA sequences because of the self-attention mechanism.

Fig. 1
figure 1

Illustrative diagram of self-attention module

2.2 Convolutional neural network (CNN)

As widely recognised, the convolutional layer, usually followed by the ReLU unit, is a motif scanner that calculates a score for all possible motifs. Thus, this stage is in charge of detecting motif features. Prior CNN-based prediction techniques have shown that CNNs can pick up complex features. Nonetheless, different CNN architectures will result in various network efficiency levels [44, 45]. Increasing the number of convolution kernels makes it easier to identify motif variants while stacking convolutional layers deepens the model and improves feature identification/extraction. The single convolution layer focuses more on extracting local features in the absence of the stacking step. The multilayer convolutional neural network is frequently employed to create layered representations of the input sequence, facilitating the extraction of meaningful features at different levels of abstraction [45] and more thoroughly detect TFBSs [44]. By collaborating among convolution layers, the network achieves its goals. However, this makes it challenging to train with excessive parameters, and the global information produced is typically incomplete and lossy. As a result, our model only employs one CNN layer to extract local features, and the 2D convolution at every location i is as follows:

$$\begin{aligned} Conv(E_{k}S_{i})=\sum _{m=1}^{l}\sum _{\tau =1}^{\gamma } E_{k_{m},\tau } S_{i+m-1, \tau } \end{aligned}$$
(6)

Where \(W_k\) denotes convolutional filters corresponding to S, the input sequence, m, is the location of the convolutional operation. The sequence motif detector is a \(l \times \gamma \) weight matrix, where \(\gamma \) is the channel number of S and l is the filter’s length. \(\tau \) is the filter’s index.

2.3 Recurrent neural network (RNN)

Long Short-Term Memory (LSTM) (one of the types of RNN) [46] addresses the issue that regular RNN is unable to handle with long-term dependency. We used LSTM to extract long-term characteristics from the DNA sequence, considering its double-stranded structure. The cell state is crucial to LSTM and is carefully monitored by structures known as gates, which include output, forget, and input gates. In the first phase, the "forget gate" determines which data should be saved or deleted. The next step is to choose the appropriate amount of new data to be added to the cell state. The output value is decided in the last stage.

$$\begin{aligned} fg_{t}= & {} \sigma (W_{f}\cdot hl_{t-1},W_{f}\cdot x_{t}+b_{f}) \end{aligned}$$
(7)
$$\begin{aligned} ig_{t}= & {} \sigma (W_{i}\cdot hl_{t-1}, W_{i}\cdot x_{t}]+b_{i}) \end{aligned}$$
(8)
$$\begin{aligned} Cm_{t}= & {} tanh(W_{G}\cdot hl_{t-1},W_{G}\cdot x_{t}+b_{G}) \end{aligned}$$
(9)
$$\begin{aligned} P_{t}= & {} fg_{t}\odot P_{t-1} + ig_{t}\odot Cm_{t} \end{aligned}$$
(10)
$$\begin{aligned} Og_{t}= & {} \sigma (W_{o} \cdot hl_{t-1}, W_{o}\cdot x_{t}+b_{o}) \end{aligned}$$
(11)
$$\begin{aligned} hl_{t}= & {} Og_{t}\cdot tanh(P_{t}) \end{aligned}$$
(12)

Where \(fg_{t}\), \(ig_{t}\), and \(Og_{t}\) stand for the forget, input, and output gates weight values; W is the weight matrix, and b is the bias; the input vector, the memory representation, and the hidden layer state are denoted, respectively, by the variables \(x_{t}\), \(Cm_{t}\), and \(hl_{t}\) at time t; and \(\odot \) used to represent element-wise multiplication. \(\sigma \) stands for sigmoid function. For clarity, the summary of notations used here is described in the following Table 1.

Table 1 Description of the notation in the variable

3 Material and method

We develop a two-path deep learning sequence plus shape kernels (DeepCTF) framework: one for DNA sequences computation with attention mechanism and the other regarding DNA shape-related data processing. The specifics of DeepCTF are explained as follows, as illustrated in Fig. 2.

3.1 Dataset and processing

The data processing technique for the proposed DeepCTF model is depicted in Fig. 2a.

3.1.1 DNA sequence data

The PBM approach provides biological understanding regarding the regulatory roles and in vivo activities of protein-DNA interactions. We extracted 12 uPBM data [47], which originates from a range of protein families, to assess the efficiency of the proposed model. Every input DNA sequence was first converted by one-hot encoding into a matrix \(n \times l\), suitable for a DL model. Here, n denotes the four nucleotides (A, T, C and G), indicated by the binary vectors written as follows, and l is the sequence length, i.e.35, in the uPBM we utilised.

$$\begin{aligned} A=[1 0 0 0], T=[0 1 0 0], C=[0 0 1 0], and G=[0 0 0 1] \end{aligned}$$
(13)

3.1.2 DNA shape data

The binding patterns are significantly influenced by the 3D structure of DNA of TFs [48]. The four DNA shape features have distinct pentamers utilising a sliding-window method and a query table-which were identified in earlier work [49]. Table S3 [50] included the preliminary DNA shape data. The efficient contribution of the four DNA shapes is determined by averaging the two roll and HelT values, one MGW value, and one ProT value contributed by each pentamer. As in [43], to generate a DNA sequence of length \(l + 4\), it is padded with two zeros at both sides of the sequence. Next, we utilise sliding window a to produce an input shape feature matrix \(n \times l\), with a size similar to the DNA sequence. Zero-mean normalisation was applied to each feature to remove the bias resulting from varying value ranges for distinct shapes.

Fig. 2
figure 2

A visual representation of the suggested model DeepCTF, where m is the number of shape features (m=4), n is the length of DNA sequences, and \(B_{m}\) is the size of the mini-batch

3.2 Architecture of DeepCTF

The overall layout of the DeepCTF model is seen in Fig. 2. On the left side of Fig. 2b, our model DeepCTF begins by encoding the DNA sequences into one-hot form and fed to a self-attention module. Then, we move on to a convolutional layer that performs the convolution function and a 2D max pooling layer to obtain an initial sense of local and global attributes. The purpose of the max pooling layer is to shorten the lengthy sequences to minimise the number of parameters and avoid overfitting. An LSTM layer is placed after the max pooling layer to record long-term relationships between the motifs and the orientations and spatial separations of DNA sequences. To mitigate overfitting, a dropout layer was implemented following the LSTM layer. Left-sided module configuration of Proposed DeepCTF model is presented in Table 2.

Table 2 Left-sided module configuration of Proposed DeepCTF model
Table 3 Right-sided module configuration of Proposed DeepCTF model

Implementing the DNA shape feature data in our model DeepCTF as shown on the right side of Fig. 2b, we used the same method as CRPTS [43] i.e. a convolution layer for processing DNA shape features to match the size of the DNA sequence feature. After the convolution layer, the output from this layer is fed into the activation function ReLU, which is used in our model. It improves convergence performance and addresses gradient disappearing issues during back-propagation training. Right-sided module configuration of Proposed DeepCTF model is presented in Table 3.

In the end, outputs from both the left and right sides of the DeepCTF model are concatenated and processed through the dense layer. The dense layer consists of two Fully Connected (FC) layers: batch normalisation and dropout (Table 4).

Batch normalisation was used at the output stage to simplify the network parameter initialisation process and reduce gradient problems during back-propagation. The previous layer’s outputs were fed into an FC layer to enable feature integration. A dropout layer containing a single neuron followed the output layer, which was utilised to predict the binding/no binding probability of TF-DNA binding. Table 4 presents a comprehensive configuration of this dense layer.

Table 4 Dense layer configuration of proposed DeepCTF model

4 Experimental results

Conducting several comparative experiments in this section to show how well the proposed model DeepCTF performed.

4.1 Experimental setup and hyper-parameter settings

Fig. 3
figure 3

Performance comparison of DeepCTF model with state-of-the-art models using \(R^2\) evaluation metric

Fig. 4
figure 4

Performance comparison of DeepCTF model with state-of-the-art models using PCC evaluation metric

Fig. 5
figure 5

Boxplot of the Evalution Metric Avearage \(R^2\) and PCC values of 12 datasets for DeepCTF and state-of-the-art models

Fig. 6
figure 6

Bar Plot Representation of \(R^2\) and PCC for DeepCTF without DNA shape with DeepCTF with DNA Shape, CRPTS and CRPT models for 12 in-vitro datasets

In the training process of DeepCTF, we minimise the tolerable loss function for each dataset. The loss function used in our proposed model is Mean Squared Error (MSE), which is described below:

$$\begin{aligned} J(\theta )=\frac{1}{N} \sum _{i=1}^{N} (\bar{y_{i}}-y_{i}^2) + \lambda \left\| \theta \right\| _{2} \end{aligned}$$
(14)

where N represents the total number of DNA samples in each training dataset, and \({\bar{y}}_i\) and \(y_i\) denote the ground and the observed value of the i-th sample, respectively. To prevent overfitting of the model, L2 regularisation was employed; \(\lambda \) denotes a regularisation variable, and \(\left\| .\right\| _{2}\) denotes the L2 norm. Mini-batch size is equal to 300, and AdaMax optimises the loss function. In AdaMax, the neural network’s dropout ratio, momentum, and Delta were chosen at random from [0.2,0.5], [0.9,0.99,0.999], and [1e-8,1e-6,1e-4], respectively. We employed five-fold cross-validation to guarantee model accuracy and avoid overfitting. An early stop approach was utilized in addition to choosing 100 training epochs to reduce the model running time. We also utilized a random-search approach to determine the optimal configuration for certain sensitive hyperparameters, such as dropout ratio, Momentum, and Delta, wherein we randomly sampled 30 hyperparameter settings. The training process spanned 100 epochs, during which the accuracy of the validation set was evaluated and monitored after each epoch. The model achieving the highest accuracy on the validation set was saved.

4.2 Evaluation metrics

DeepCTF model performance is evaluated using current competitive techniques to assess the suggested approach. The Pearson correlation coefficient (PCC) and coefficient of determination \((R^2)\) were used to evaluate the proposed model’s predicted binding affinity. Working under the assumption that as these mentioned evaluation metrics approach 1, the model’s efficacy improves. These two metrics were implemented on every dataset to confirm the model’s overall performance. The following defines two performance measures:

$$\begin{aligned} R^2= & {} 1- \frac{Rss}{Tss} \end{aligned}$$
(15)
$$\begin{aligned} PCC(y,Y)= & {} \frac{S_{yY}}{\sqrt{S_{yy}\times S_{YY}}} \end{aligned}$$
(16)

\(y_{i}\), \(Y_{i}\), y, and Y stand observed, predicted, average observed, and average predicted binding affinity scores, respectively. Where \(S_{yY}=\sum _{i}(y_{i}-\bar{y})(Y_{i}-\bar{Y})\), \(S_{yy}=(y_{i}-\bar{y})^2\), and \(S_{YY}=(Y_{i}-\bar{Y})^2\). Also Rss=\(\sum _{i}(y_{i}-Y_{i})^2\) is the residuals of sum of squares and Tss=\(\sum _{i}(y_{i}-\bar{y})^2\) is the total sum of squares.

4.3 Performance comparison with competitive models

To assess DeepCTF’s performance, we evaluate it not only with Deepbind, which relied on using the DNA sequences as primary input source processed by CNN but also with four different techniques that combined DNA shapes and sequences, which are two kernel-based approaches, DLBSS and CRPTS. Evaluating DeepCTF’s performance against the state-of-the-art approaches using 12 in vitro datasets is shown by considering the aforementioned PCC and \(R^2\) metrics.

Moreover, Figs. 3 and 4 compare the overall efficacy of DeepCTF with the state-of-the-art methods using 12 in vitro datasets. Concerning PCC and R2, it is clear from Figs. 3 and 4 that DeepCTF performs more well and steadily than the other approaches. As seen from these plots, DeepCTF performance is superior to the two kernel-based techniques due to the use of DNA sequences with the DNA shapes, proving that both significantly influence the identification of TFBSs. DeepCTF attains a statistically significant improvement in average \(R^2\) and PCC, as seen in Fig. 5. In terms of \(R^2\) and PCC, DeepCTF outperforms DLBSS and CRPTS by roughly 7% and 4%, and 3% and 1.4%, respectively. This indicates that our suggested DL model with an attention mechanism outperforms the one that merely uses CNN. In 12 in vitro datasets, DeepCTF’s highest and lowest values outperform the competing approaches. The smaller box of the DeepCTF shows that the two indicators (\(R^2\) and PCC) range is more condensed, demonstrating its strong stability.

The exceptional performance of DeepCTF in comparison to other competitive models (K_spectrum+Shape, Dimismatch+ shape, DeepBind, DLBSS, CRPT, and CPRTS) was attributed to two factors: (1) It utilises the DNA shape information; and (2) By employing an attention mechanism with CNN and RNN, DeepCTF prioritises obtaining global information regarding DNA sequences instead of local information. Thus, the main flaw of the convolution technique is that it only processes local neighbourhoods; as a result, global information is missed. This flaw of convolutional technique performance is overcome with the help of self-attention modules to gather more relational information from the network.

Further, we experimented with the DeepCTF model utilising only DNA sequences (without DNA Shape data) as input to evaluate whether adding DNA shape information impacts the resulting prediction accuracy of TF binding affinities. Figure 6 shows that the DeepCTF model with only DNA sequence data (DeepCTF_without DNAShape) as input has lower values of \(R^2\) and PCC values as compared to DeepCTF with both DNA sequence plus shape data as input, but higher values as compared to CRPT. This shows that the DeepCTF model has good stability. Since DeepCTF consists of an attention layer mechanism that extracts the global representation of the input DNA sequences and combines it with the local features drawn out from the next convolutional layer, the LSTM layer draws out long-term dependence in the DNA sequences and then combines it with DNA shape features generated by a convolutional layer which enhances the model prediction ability.

5 Conclusion

Deep-learning models have effectively reduced the computational cost and time required for exploring the intricate relationships within large-scale biological data, revealing hidden complexities. This paper proposes an attention-based deep learning model (DeepCTF) to use DNA sequences and shape data to predict transcription factor binding specificities. This method uses an attention layer, a CNN layer, and an RNN layer to learn features from DNA sequences, and on the other side, it uses a single convolutional layer to learn features from DNA shape data. The two heterogeneous data sets are suitably integrated and fully utilized by the model. The higher efficiency of our proposed model DeepCTF is due to the usage of the attention layer, which extracts the global representation of DNA sequences and combines it with the local features extracted from the CNN layers and provides it to the RNN layer, which is used to learn the long term dependencies from DNA sequences. Thus, the experimental findings obtained on 12 uPBM datasets demonstrate the high efficiency of our proposed approach, DeepCTF, in TFBS prediction.