1 Introduction

DNA/RNA interactions with proteins play an important part in the gene expression regulatory mechanism, involving transcription, translation, alternative splicing, and degradation (Huang et al. 2014; Zhu et al. 2015). Both DNA/RNA have short regulatory sequences known as transcriptional factors (TFs). Interaction between the biomolecules in the presence of specific TFs is the basic and foremost criterion in gene regulation (Zhu et al. 2013). Usually, TFs are shorter sequences, typically ranging from a few to approximately 20 bp (base pairs), localized in regulatory regions of genes.

Separate proteins have a specific TF with a characteristic binding capacity to the complementary genomic sequence, which is due to the presence of a short guided and recurring pattern of sequences also regarded as motifs. The presence of such short conservative sequences in genomic sequences specifically indicates the binding sites for particular proteins such as nucleases and TFs.

Nevertheless, RNA motifs are also involved in numerous significant RNA processes, including ribosomal binding and mRNA processing, and are typically useful in characterizing genomic regulatory pathways and decoding the regulatory code of different genes. Thereby, motif discovery acts as an important tool for computational biology in the post-genomic era (Dhaeseleer 2006). Similarly, motif discovery is imperative in providing insights into other primary problems like amyloid illnesses and has many applications in pharmaceutical and industrial purposes (Nair et al. 2012). However, the motif sequence specificity for the correct transcription factor-binding site (TFBS) identification is more accurately diagnosed using reliable and reproducible high-throughput sequencing technology with computational methods (Nutiu et al. 2011; Siggers and Gordan 2013). A review of the traditional techniques and algorithms employed for motif discovery can be found in Das and Dai (2007) and Hashim et al. (2019). The basic principles behind motif elicitation are threefold (Hashim et al. 2019), viz., data preprocessing, motif search, and motif evaluation (see Fig. 1). During the first phase, sequence data downloaded from motif databases, TFBS datasets, or other high-throughput experiment datasets are often clustered using state-of-the-art clustering algorithms to categorize the dataset based on some criteria. Then, data-cleaning procedures are performed on the clusters so that the effects of biases and noise are reduced satisfactorily. During the second and most important phase, the motif algorithms work on cleaned and clustered data to find conserved motifs. An encoding scheme for motif representation is applied to the data so that the chosen algorithm can work on the data efficiently through a scoring mechanism or scoring function to find statistically significant motif patterns. In the last stage, the elicited motifs are evaluated against known motif databases to determine the performance and accuracy of the motif discovery algorithm. Different flavors of motif search models use different motif discovery procedures in the second phase. Conventional methods such as the probabilistic approach and word enumeration techniques are now being replaced or augmented by neural network models in the second phase.

Fig. 1
figure 1

A basic flow diagram of the motif search process

In this aspect, the powerful machine learning concepts of “deep-learning” (DL) technology have been developed, which is typically built on the concept of convolutional neural networks (CNN). Their development is essential to capture the motif discovery relevant information which is used to define the selective transcriptional factor-binding sites accurately and the selection of appropriate computational biology (Alipanahi et al. 2015; Hassanzadeh and Wang 2016; Quang and Xie 2016; Zhou and Troyanskaya 2015). Conventional biological experimental techniques are less advantageous than modernized computational methods as they are simple to operate, cost-effective, and less tedious concerning motif research.

1.1 Motif mining with deep learning

The main task of motif finding is to interpret the complex behavior of motifs. This task can become difficult if one selects an inferior grade of experimental methodology randomly and therefore, being able to locate accurately, the binding specificity of TFs with variant motifs becomes problematic. Conversely, DL approaches can train on the various high-throughput datasets concerning biological research, especially to understand the regulatory changes that are directly concerned with human health and disease status. Furthermore, deep learning conventionally provides a framework to advance and communicate DL models for diverse genomic sequences (Avsec et al. 2019; Chen et al. 2019) and improve the interpretability of sequences via DL models (Shrikumar et al. 2017; Binder et al. 2021). It allows an automated optimization of network architecture (Zhang et al. 2021a) with improved power of accessibility. Thus, DL models are the most unprecedented technique, especially for elucidating several applications in the field of bioinformatics and computational biology (Eraslan et al. 2019), relying on the basic building blocks of CNNs (Krizhevsky et al. 2017).

The DeepBind application, which is the first of many models that employ DL methods, is used to predict the specificity of protein binding, and this application is based on a CNN (Alipanahi et al. 2015). In addition, some of the hybrid DL models are also used to find the function of particular DNA sequences, of which DanQ is one exemplar (Quang and Xie 2016). Furthermore, a variety of computational models are available that are based on novel convolutional architectures, its best example is a circular filter which is used efficiently to interpret data relevant to transcriptional factor specificity and binding to DNA/RNA (Blum et al. 2019). Many DL methods that came after the DeepBind method employ the CNN model and add some complex models on top of the CNN for gauging the long-term relationships between motif sequences. Some models use RNNs (Recurrent Neural Networks) or their improvements such as the Bi-LSTM (Bidirectional Long Short-Term Memory) networks, SAE (Stacked Autoencoders) instead of CNNs so that variable length input can be provided to the model, and long-term relationships are also captured. Some others use other regulatory elements such as DNA/RNA shape features, chromatin accessibility data, and histone modifications in addition to the convolutional kernels of the baseline CNN model to enhance the interpretability of the model. However, the basic motif discovery phase (Phase II of Fig. 1) consists of several convolution kernels that act like motif sequence finders. The kernel operations are performed across the input sequence such that the motif features are captured for each window of the sequence. DNA encoding as input to these kernels is achieved either via one-hot encoding or k-mer encoding. A general deep-learning framework for motif discovery that summarizes the broad steps involved is given in Fig. 2.

Fig. 2
figure 2

A generalized deep-learning framework for DNA/RNA motif elicitation. Any one or a combination of high-throughput datasets are pre-processed for noise, bias, etc., and encoded using either one-hot or k-mer encoding schemes before being used to train the deep neural network architecture of choice. Several deep neural networks may be combined for greater interpretability, performance, sensitivity, and specificity

A comprehensive meta-analysis of DL architectures via deepRAM can be used to locate the DNA and RNA-binding specificity and provide a valuable exhaustive investigation of the genome for the researcher (Trabelsi et al. 2019). A similar survey that reiterates the work done by Trabelsi et al. (2019) can be found in He et al. (2020). These DL methods are used to find motifs from human ChIP-seq (Chromatin Immuno-Precipitation sequence) data, which have common DNA sequence patterns and their corresponding TF-DNA-binding affinities (Yang et al. 2019). These features can be achieved by combining sequence and shape framework features of DNA (Sutskever et al. 2014) and making it possible to recognize the TFBS and sequence-specific motifs of DNA/RNA as well (Zhang et al. 2019a, b).

However, there are still some drawbacks existing in computational methods in discovering the task of genomic DNA/RNA motif mining. For example, the lack of big data profiles causes researchers to enhance their training datasets. Further, the more complex a DL model becomes, the interpretation of the model suffers. Even after prediction results present themselves to the researcher, it is often not fully understood how these results can be connected to the intricacies of our body’s regulatory networks. Moreover, choosing the correct network architecture along with the correctly tuned hyper-parameters is also very challenging. Thus, it is just as necessary to understand and work on these limitations as is necessary to understand the complex behavior of gene regulatory mechanisms concerning sequence-specific motifs and their respective transcriptional factors binding and affinities to genomic sequences (Das and Dai 2007).

1.2 Brief overview of our study

In this comprehensive review, more details about deep-learning predictive models are highlighted for DNA/RNA motif mining over the past few years, and the attributes of existing learning models are briefly described. The performance of existing DL models concerning predicting the transcriptional factor-binding interactions exactly to genomic DNA and RNA is also briefly explained. It also provides some promising evidence relevant to motif mining through PRISMA reporting and literature survey guidelines. It also includes other models that use datasets in addition to the ChIP-seq/CLIP-seq data, unlike the metanalysis reported by Trabelsi et al. (2019) such as DNase-seq, ATAC-seq, ChIP-exo, and ChIP-nexus. It also provides a methodology review of more than 30 models up to the year 2021 including benchmarking guidelines, whereas previous surveys (Trabelsi et al. 2019; Wang et al. 2020b; He et al. 2020) have included only 20 models. By including models that are scalable and flexible to work with more than one type of sequence dataset, this scoping methodology review intends to present a more comprehensive picture of the DNA/RNA motif mining problem that employs DL techniques. Furthermore, these facts and pieces of evidence of DL methods concerning DNA/RNA motifs are helpful to evaluate recent improvements in computational approaches. To the best of our knowledge, no past reviews have used PRISMA-ScR guidelines for systematic methodology review of human DNA/RNA motif discovery tools and algorithms that use deep-learning architectures with varied high-throughput datasets.

  • Review question: Is the prediction of DL models well suited for identifying the regulatory components and sequence structures that participate in the genomic rearrangement of DNA/RNA?

  • Inclusion criteria: The DL model accurately predicts the DNA/RNA protein sequence specificity pattern for gene regulatory mechanisms operating in a biological system.

  • Focus: The review will focus on short regulatory sequence elements or associated gene changes as a consequence of transcriptional factor binding.

  • Context: All surveyed literature reports are found to be original and peer-reviewed in all languages without date range limitations.

  • Types of sources: This scoping review will consider all full-text research papers, including experimental, case–control, quantitative studies and genome-wide studies, meta-analysis, and targeting the candidate gene studies. In addition, the research will not consider additional variants and non-variant sequences’ role in therapeutic approach development and disease diagnosis that may include an element of human cis-regulatory studies in gene expression.

2 Methods

We conducted a scoping review abiding by the reporting checklist of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-analyses of Scoping Reviews) guidelines (Moher et al. 2009; Tricco et al. 2018; Martin et al. 2020; Peters et al. 2021, 2022) (see Fig. 3). The goal of the bibliometric analysis is to explore motif information and provide an in-depth learning pathway to a biologist to understand the complex chemistry of DNA/RNA sequences. Although, this literature survey search was carried out on the scientific databases from Feb 2022 to June 2022. The strings used to explore the various scientific databases are as follows: (“motif mining” OR “motif discovery”) AND “data” AND “mining” AND (“deep learning techniques” OR “deep learning methods”) AND “load” AND “profil*”). This string is used to expose the items “article title, abstract, keyword, the content” of already existing reports in the literature database of 2012–2021. Meanwhile, following the positive and negative facts associated with motif mining, exploring specific TFBSs in a genomic sequence is a novel and growing field. Such a smart research initiative started around late 2004 (Häussler and Nicolas 2005). In addition, this report considers only research articles relevant to human DNA/RNA motif discovery and their data profiling.

Fig. 3
figure 3

PRISMA Flow Diagram for a scoping review of motif discovery using DL models for human DNA/RNA

2.1 PRISMA-ScR results: motif pattern discovery and transcription factor site binding

The total number (n = 13,801) of literature found in different databases (PubMed, BioMed Central, ScienceDirect, EBSCO, JASPAR, JSTOR, etc.) after the primary search is shown in Fig. 2 PRISMA flow chart. Automation tools help in speeding up systematic reviews and also aid in providing accuracy (Beller et al. 2018; Scott et al. 2021; Harrison et al. 2020), and tools such as LitSuggest (Allot et al. 2021), Abstrackr (Wallace et al. 2012), and Colandr (Cheng et al. 2018) were used for various automation tasks in this study for screening, extracting, eliminating duplicates, etc. After the preliminary identification phase, (n = 5420) are eliminated.

Of the remaining (n = 8381) records, these research articles are screened in PubMed (https://pubmed.ncbi.nlm.nih.gov/), JASPAR (Castro-Mondragon et al. 2021), Nature (https://www.nature.com/), EBSCO (https://www.ebsco.com/), ENCODE (Luo et al. 2019), PLOS (https://plos.org/), ScienceDirect (https://www.sciencedirect.com/), JSTOR (https://www.jstor.org/), UniPROBE (Universal Protein Binding Microarray Resource for Oligonucleotide Binding Evaluation, Hume et al. 2014), and BioMed Central (https://www.biomedcentral.com/). We carefully screened the literature relevant to the genomic datasets. Secondary search results from the ENCODE, UniPROBE, and JASPAR datasets include maximum papers (n = 4051) and it was first indexed to include short communication articles, books, book chapters, presented papers, and journals. On evaluating the results obtained from BioMed Central data which bears articles including research and book chapters and upon going through their titles and abstracts, around (n = 1042) related articles were selected. Analysis of titles and abstracts from all databases using manual screening and automation tools thereby reduces the number of articles to (n = 5316).

During the selection phase from the two search results, some studies excluded titles related to designing personalized therapeutic approaches and disease-associated risk factors as they do not come under the scope of this review. We considered only those articles that are related to genomic studies wherein, genetic and other risk factors with acquired biochemical activity along with genomic profiling are reflected. These factors have a substantial impact on motif profiling results. However, studies relevant to structural changes at the genomic level are also considered in this review, as these changes occur due to histone modifications, chromatin accessibility, or protein-to-protein binding. This is a broad subject area of research than the scope of review. In addition, many articles were concerned with motifs other than human DNA/RNA and corresponding TFs and TFBSs, and others were not relevant to genomic regulation pathways and these were thus excluded from the study. At this juncture, (n = 1209) records are eligible for retrieval. However, in the sorting of the google scholar database, additional (n = 1700) records were found which has a pivotal role in motif finding. These records are similarly screened in the screening phase and after removing duplicates and preprints, the results (n = 388) are merged to give (n = 1597) articles. Of these, (n = 303) studies are relevant to deep-learning frameworks, and (n = 58) are relevant to human DNA/RNA motif discovery from which (n = 33) quantitative studies that present novel DL models for the motif search problem from 2012 to 2021 are selected for inclusion in this review (Fig. 3).

3 Methods and deep-learning models

Several deep-learning model architectures are designed to improve the efficacy of DNA or RNA motif extraction. In this aspect, DL frameworks are designed to locate motifs that are based on low-cost variant CNN (Convolutional Neural Network) models; for example, it includes the Mobile Net family (Sandler et al. 2018), EfficientNet (Tan and Le 2019), CSPNet (Cross Stage Partial Network) (Wang et al. 2020a), and DenseNet (Huang et al. 2017). Furthermore, different deep-learning models have been used to explore the ChIP-seq data including recurrent neural networks (RNNs) (Kusupati et al. 2019), e.g., KEGRU (Gated Recurrent Unit with k-mer Embedding) used for RNA visual and textual motif mining (Shen et al. 2018; Xiong et al. 2016), Deep Belief Network (DBN) (Chen et al. 2015) and Graph Neural Networks (GNN) (Chiang et al. 2019; Zou et al. 2019) giving rise to over 30 specialized computational tools, e.g., DESSO (DEep Sequence and Shape mOtif, Yang et al. 2019), DeepBind (Alipanahi et al. 2015), and DeeperBind (Hassanzadeh and Wang 2016). These models are further modified to justify the problems found within the biological domain (Pouladi et al. 2015) and to reduce quadrant computational complexity and the memory cost of training. Moreover, not long ago, researchers designed deep-learning models such as DeepBind (Alipanahi et al. 2015), Basset (Kelley et al. 2016), and DeepSEA (Zhou and Troyanskaya 2015), which are methods based on CNN models for motif mining (Quang and Xie 2016). The motif discovery process involved in DeepBind is illustrated in Fig. 4. The convolution kernel filters detect low-level characteristics present in the one-hot encoded sequences which are shifted by a threshold in the rectified linear unit (ReLU). The average and maximum pooling account for the accumulative effects of shorter motifs and the detection of longer motifs, respectively.

Fig. 4
figure 4

Basic architecture of the DeepBind model. It uses a CNN architecture with several convolution kernels that extract low-level features from the input. The predictions are gradually improved via backpropagating the errors and updating model parameters

The neural network then trains on the feature vector generated and gives a final score which is improved via backpropagation until a desired performance is achieved. DeepSEA added single nucleotide sensitivity, and chromatin profiling, and increased the width of the kernel window to 1000 bp. While Basset (Kelley et al. 2016) used DNase I hypersensitive sites (DHS) to take account of DNA accessibility effects on TF binding, DeepHistone (Yin et al. 2019) used chromatin accessibility data with motif prediction. Dilated (Gupta and Rush 2017) further increased the spectrum of search by taking longer sequences to capture the long-range effects of DNA motifs. DeepSNR (Salekin et al. 2018) added a deconvolution layer after the CNN to increase specificity to single nucleotides using ChIP-exo datasets that remove noisy data and help to detect weak motifs as well. DESSO (Yang et al. 2019) added DNA shape features as well as a statistical analysis module to the baseline CNN model which is based on the binomial distribution for greater predictability. scFAN (Fu et al. 2020) added the feature of genome-wide TFBS prediction using a CNN with 3 layers for each cell. TFImpute (Qin and Feng 2017) and FactorNet (Quang and Xie 2019) impute the TFBSs for cell lines whose ChIP-seq data are not available by training the network on known cell line data. However, while TFImpute uses a CNN model, FactorNet uses a hybrid architecture composed of CNN and LSTM units in the RNN layer. FCNA (Zhang et al. 2021b) employs many fully connected CNNs to form an encoding and a decoding layer to do away with dataset disparity between positive and negative sets. RNA motif detectors such as iDeepE (Pan and Shen 2018b), iDeepV (Pan and Shen 2018a), and DeepRBP-Pred (Zheng et al. 2018) used CNNs to find the locations of RBPs (RNA Binding Proteins) from CLIP-seq datasets. While iDeepV used k-mer embedding and a one-dimensional CNN model, iDeepE used a local CNN and a global CNN to infer local and genome-wide features before merging the results. DeepVISP (Xu et al. 2021) used an attention mechanism after the CNN layer to identify virus integration sites (VISs) for cancer-causing viruses in humans.

Hybrid CNN–RNN-based models such as DanQ (Quang and Xie 2016), FactorNet (Quang and Xie 2019), DeepSite (Zhang et al. 2019b), iDeep (Pan and Shen 2017), iDeepS (Pan et al. 2018), and DeeperBind (Hassanzadeh and Wang 2016) are successfully prescribed to identify the sequence specificity of TF-DNA binding and RNA-binding proteins (Pan et al. 2018) with a good performance over existing motif-based statistical methods. TBiNet (Park et al. 2020), and DeepGRN (Chen et al. 2021) use an attention mechanism to discover long-range dependence in addition to the LSTM units in the RNN layer. Similarly, WSCNN LSTM (Zhang et al. 2019a), DeepSite (Zhang et al. 2019b), iDeepS (Pan et al. 2018), and DeepCLIP (Grønning et al. 2020) also employ Bidirectional LSTM units in the RNN layer to take into account the onward and reverse long-term dependencies among the motifs detected in the CNN layer. DeepCpG (Angermueller et al. 2017) and KEGRU (Shen et al. 2018) use gated units in the RNN layer to capture DNA methylation sites and TFBSs, respectively. AgentBind (Zheng et al. 2021) uses fine-tuned models after initial motif detection in the CNN layer which further enhances the specificity as each model is tuned for the target TF. iDeep (Pan and Shen 2017) uses Deep Belief Networks (DBN) for capturing different features such as motif information, structure information, region, and co-binding factors. iDeep also uses CNN filters to derive the motif locations for RBP sites directly from the sequence data. Then it merges all the results obtained from the various individual networks to classify RBP-binding sites. DeepFinder (Lee et al. 2018) uses a stacked autoencoder (SAE) for their ‘three-stage approach’ to detect motifs and TFBSs. DeepFinder tries to impute the other TFBSs from the small subset of training data.

The advantage of using the CNN-RNN hybrid model is that it is composed of multiple layers of data abstraction for accurate prediction of complex biological data relevant to functional biology e.g., phylogenetic inference, protein functions, and other aspects of computational biology (Krizhevsky et al. 2017). However, other most popular deep-learning architectures are applied to different areas of biological sciences such as CNN and ResNet (Residual Neural Networks) for phylogenetic inference, CNN, RNN, LSTM (Long Short-Term Memory), SAE (Stacked Auto Encoders), and VAE (Variable Auto Encoders) for system biology and data integration, MLP (Multi-Layer Perceptron) and CNN for genome engineering mainly for gRNA (guide RNA) sites on human genomes and CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) profile build-up, CNN, RNN, ResNet and GNN for protein function prediction, and lastly, CNN, ResNet, BLSTM (Bidirectional Long Short-Term Memory) and Transformers are preferred model architectures for protein structure prediction (Sapoval et al. 2022). In addition, all the advanced transcriptional DL models from 2012 to 2021 that deal with the problem of motif discovery are presented in Table 1 with a concise overview.

Table 1 DL models for human datasets such as ChIP-seq, CLIP-seq, DNase-seq, ATAC-seq, and ChIP-exo for motif mining

4 Deep-learning model selection benchmarks

Model selection benchmarks play a crucial role in the performance of deep-learning models for DNA/RNA motif discovery. To achieve the best performance, researchers need to carefully select the most suitable model for their specific dataset and research question. This requires extensive model benchmarking, which involves testing a range of models and selecting the best-performing one based on specific evaluation metrics. The adoption of the correct DL model for a specific purpose is relatively a confusing and challenging task for researchers without an assessment of model performance in terms of accuracy of motif finding, sequence classification, specificity, sensitivity, usability, and scalability. With a deeper understanding of DNA and RNA datasets, their comparative results were demonstrated using the deepRAM (Trabelsi et al. 2019) on human ChIP-seq/CLIP-seq data to reveal the performance of complex existing networks. Along with this, DL model selection is primarily based on the available volume of data, neural network type, and model outputs (Pouladi et al. 2015). Thereby, the deployment of new DL methodology from existing models and the origin of their variants is necessary to perform better when complex data and their size is sufficient (Pan and Shen 2017). Thus, recent research trends tend to move towards complex model construction despite choosing simpler models. Model selection is often difficult in motif mining also due to the many hyper-parameters that need to be carefully tweaked to attain the correct accuracy and acceleration. Training sample size also must be chosen to achieve the right representation of the datasets. While the generally accepted rule of thumb when it comes to training sample size is that training sample size should be larger than ten thousand samples at least; some researchers (Lee et al. 2018; Zia and Moses 2012; Hu et al. 2005) have observed that a smaller sample size of shorter sequences may suffice for the motif search problem and a larger number of sequences will not result in any further improvement in model performance. Thus, researchers must carefully select the most suitable model based on specific evaluation metrics, and model selection benchmarks such as validation set or cross-validation should be performed to achieve the best performance.

4.1 Performance evaluation of computational DL models

For computational biology applications, one approach for enhancing the efficacy of DL models is to exploit the inherent capacity to locate complex biological data sequences by focusing only on the small set of genomic sequences rather than the whole genome as discussed by Ke and Vikalo (2020). In this aspect, several researchers suggested transformer models for DNA/RNA sequence modeling (Zaheer et al. 2021). Nevertheless, Transformer models require higher training costs owing to the costly global attention procedure. Thereby, the practice of lightweight DL models with clustering methodology is recommended to reduce data pruning from the model and lower the neural network size, which has become a popular method in deployment.

Alternatively, a DL model known as deepBICS can compute the affinity of transcriptional factors to DNA target sites (Quan et al. 2022). This model applies to the human ChIP-seq datasets and differentiates disease-related variants and non-related variants. An improved version of deepBICS is also reframed (deepBICS4SNV) to improve accuracy and generalization capability to diagnose disease-related pathogenicity (Quan et al. 2022). In Trabelsi et al. (2019), Wang et al. (2020a, b) and He et al. (2020), some performance evaluation of DL models has been presented. The general consensus agrees upon CNN models as better at DNA/RNA motif discovery in terms of performance than others that are an amalgamation of various models. This is mainly due to the interpretability issue of the model in question which becomes more challenging as the models incorporate different types of DL sub-units to create a hybrid.

The performance of the DL models can be evaluated using various metrics, such as accuracy, sensitivity, specificity, and the area under the receiver-operating characteristic (ROC) curve (AUC). These metrics help to assess the quality of the model’s output and its ability to discriminate true motifs from false positives. The use of AUC is increasingly common in the evaluation of DNA/RNA motif discovery models as it provides a single numerical value that summarizes the overall model performance. In addition to these metrics, other state-of-the-art evaluation methods have emerged, such as precision–recall curves and F1 score. These evaluation methods can help researchers to identify the strengths and weaknesses of the DL models, which can be used to refine the models further. Furthermore, new metrics and evaluation techniques are constantly being developed, demonstrating the need for continuous improvement in DNA/RNA motif discovery applications.

Figure 5 provides a summary of the performance of various DL models for DNA and RNA motif discovery, in terms of the average AUC. Among the latest DNA motif discovery tools studied in this review, TBiNet, DeepSite, FactorNet, DeepGRN, AgentBind, and FCNA all outperform the advanced models such as DeepSEA, DeepBind, Basset, DanQ, and Zeng (Park et al. 2020; Zhang et al. 2019b, 2021b; Chen et al. 2021). These tools were tested for ChIP-seq datasets from the ENCODE database. However, DeepGRN was shown to outperform FactorNet for some DNase-seq (DNase hypersensitive sites sequencing) datasets that are considered to be superior to ChIP-seq datasets for TFs and TFBSs. DeepVISP was only tested on the traditional models and was found to outperform them with an average AUC (Area Under the receiver-operating characteristic Curve) of about 0.8 on several datasets (Xu et al. 2021). Overall, TBiNet and DeepFinder report the highest AUC for ChIP-seq datasets from ENCODE of greater than 0.9 and 0.95, respectively. However, among the RNA motif search models, RBPSuite was reported to be better than its counterparts like iDeepS, and other traditional methods with an approximate AUC of 0.85 (Pan et al. 2020). These models have demonstrated high performance in various benchmarks, and they are constantly being improved and refined to achieve better accuracy and generalization.

Fig. 5
figure 5

Performance of the various DL models in terms of average area under the receiver-operating characteristic (ROC) curve (AUC) for DNA/RNA motif mining problem

4.2 Scalability evaluation of computational DL tools

Researchers should know the appropriate deep-learning tools for assessing motif analysis studies and DNA/RNA sequence classification (Qin and Feng 2017). For this purpose, the performance of many DL tools was evaluated (Wang et al. 2020b), which is based on four matrix scores, namely the area of eight matrices radar (AEMR) score, motif prediction score, algorithm scalability, and tool usability. Based on eight metrics viz. sensitivity, specificity, precision, negative predictive value, accuracy, F1 score, Geometric-mean, and Matthews correlation coefficient (MCC), an overall score of AEMR and a score of motif prediction conclude the performance of developed DL tools, and then it was used, to rank every model from highest to lowest scores. The AEMR score provides a single summary metric that captures the overall performance of a deep-learning model across multiple metrics. This can be useful when comparing the performance of different models, as it provides a simple way to see which model is performing better overall. Out of recently developed DL tools, DESSO registers the maximum overall score for DNA sequence than any other DL tool while DeepBind is the perfect DL tool for RNA sequence-based analysis and is considered the next best DL tool for DNA sequences. Despite this, some researchers (Tang and Sun 2019) find the CNN network-based tools to be better than the CNN-RNN network tools for DNA sequences and inferior for micro-RNA sequences. It might be due to insufficient RNA motif data availability and a more variant nature of RNA CLIP-seq (Cross Linking Immuno-Precipitation with sequencing) data than DNA ChIP-seq data.

In addition, DeepHistone acquires the best AEMR score and DESSO was identified as the best tool to analyze the different motif patterns (LeCun et al. 2015). And for RNA sequences, iDeepV and iDeepS models are identified as the best tools that are based on CNN and BLSTM (Bidirectional Long Short-Term Memory) networks for RNA sequence cataloging and RNA motif mining, respectively.

Many of the models included in this review have been applied to the ENCODE-DREAM challenge datasets which consist of repositories of ChIP-seq, RNA-seq, and DNase-seq datasets for download (see https://www.synapse.org/#!Synapse:syn6131484/wiki/402028). Model scalability is thus often measured as how fast and accurately the model can be trained with the different datasets such as those available in the challenge. Models working on DNA motif discovery must be able to scale up well to ChIP-seq, ATAC-seq, DNase-seq, DNA shape features, etc. AgentBind, FactorNet, and DanQ scale up well to both ChIP-seq and DNase-seq datasets. DeepSEA and Dilated have only been tested on DNase-seq, whereas many state-of-the-art models such as DeepBind, DeeperBind, Zeng, TFImpute, DeepFinder, deepRAM, DESSO, DeFine, DeepSite, scFAN, FCNA, TBiNet, and AgentBind have been successfully trained and tested with ChIP-seq cell lines. Basset is one model that has scaled up well on many different types such as ChIP-seq, ATAC-seq, DNase1-seq, and CIS-BP (Catalog of Inferred Sequence Binding Proteins of RNA) datasets. DeepCpG has been tested and trained over two datasets CIS-BP and UniPROPE. RNA motif finders have scaled up well on CLIP-seq standard datasets such as iDeepS, iDeep, iDeepE, iDeepV, RBPSuite, and deepRAM. The authors of DeepVISP (Xu et al. 2021) on the other hand have created their own curated dataset called VISDB (Viral Integration Site Data Base) that they have used to train their model.

4.3 Research gaps identified

This study noted that the existence of numerous versions of motifs from several databases for a sole TF and the scarcity of a standardized evaluation system makes it problematic for biologists to select a suitable model and for algorithm designers to standardize, assess, and enhance their DL models. In addition, data scientists are not well versed with TFBSs which also hindered the capability to accurately find specific motif patterns and select appropriate algorithms to predict the true TF-binding sites. This is possible when there is a lack of interconnectivity between the researchers belonging to two different domains which affected the identification of unknown true TF-binding sites in genomic sequences. Such unknown information hinders the high-throughput screening of advanced techniques such as next-generation sequencing (NGS) and the identification of such specific sites may also be penalized. Until TF-binding sites are well annotated, sequencing techniques cannot be applied with confidence. In addition, inadequate knowledge of the entire gene expression dataset and an inappropriate tune setting of models, or performing the model selection before applying where it will be used in practice, can result in error-prone datasets. Thus, the incorporation of generalizable domain knowledge within DL architectures and adequate training of DL models that generate strong estimates on test data obtained from the data survey with comparison to previous studies concerned with the deeply learned mechanism can improve the performance of the model. Automatic calibration of complex datasets and training of biologists to keep themselves up to date can make it easy to predict the complex and variant nature of motifs which can help them to identify the complex chemistry behind the nucleic acid structure. More elaborately, the motif’s variant nature impacts the genomic sequences structurally and functionally and determines an exponential number of possible sequences of a given length. Deep-learning models can resolve the complex behavior of large genomic sequence datasets very well, especially for ChIP-seq data, and therefore, other techniques were discarded for computational reasons. However, these DL techniques come with their own set of limitations and challenges. Model interpretability is still an issue, especially with complex models that involve the use of many different types of DL concepts in a single DL framework. Further, training size and hyper-parameter selection along with other model selection benchmarks are also a challenge when designing a novel DL framework that is both scalable and efficient in eliciting motifs for TFs and TFBSs that have a low false-positive frequency. The representation of DNA/RNA sequence data in the DL model is also another area in which improvements and novelty are warranted.

5 Concluded comments

In this study, we have tried to present a comprehensive background of the deep-learning models that are state-of-the-art for human DNA/RNA motif mining that specifically uses ChIP-seq, DNase-seq, ATAC-seq, CLIP-seq, etc. This review concluded that the application of deep-learning methods in the field of motif discovery is decided in terms of the speed of complex data preprocessing, qualitative features of existing deep-learning architectures, and comparing the differences among the deep-learning models. Through the PRISMA-ScR reporting guidelines and literature survey, we have compared existing deep-learning models based on model size, automatic calibration ability, tool selection, and training set and have found that the DESSO, TBiNet, DeepSite, and DeepBind are the selective DL models in terms of performance and scalability of a true biological relationship especially concerning to gene expression pattern and sequence analysis. Other aspects of choosing the best DL models are when data are sufficient and briefly describe the characteristics of existing learning models. Therefore, it is necessary to conduct the literature survey on large datasets for motif mining and transcription factor recognition and an accurate choice selection of deep-learning methods. It will assist researchers to understand the current aspects of computational biology approaches and their concerned field of study.