1 Introduction

Generally, a motif is an idea, subject, theme, pattern etc. which repeats itself and has some significance, especially in a musical work, literary, artistic or set of sequences [1]. In bioinformatics, motif discovery means the process of determining motifs within a set of DNA, RNA or protein sequences where motif means a widespread amino-acid sequence or nucleotide pattern that captures a biological significance. Motifs are generally short sequence patterns of a fixed length that express important functional or structural features in protein sequences and nucleic acids such as active sites, transcription binding sites, interaction interfaces or splice junctions [2]. They can appear in an approximate or exact form within a family or subfamily of sequences. In other words, a pattern common to a set of DNA, RNA or protein sequences that shares same biological property, such as functioning as binding sites for a particular protein is called motif. So we can say that the problem of identifying short similar sequence elements shared by a set of protein or nucleotide sequences with a general biological function is known as motif discovery [3].

Figure 1 shows an example of motif discovery. Position weight matrix (PWM) have been used to express motifs. PWM is the representation of the occurrences of nucleotide at each position of a motif. Let the number of DNA sequences, \(n=6\) having a length \(L=30\) for each sequence. Here we have to discover a motif of width \(W=8\) using PWM, f where PWM is usually used to express motifs [4]. Now from every sequence, we get a motif instance of length 8. These six motif instances have used in PWM. Now from the PWM, we select nucleotide with the highest occurrences in every position and get a motif of length 8. Thus the final motif from these instances has been discovered.

Fig. 1
figure 1

Motif discovery of DNA sequences

In the era of bioinformatics revolution, the volume of biological sequences is increasing in public databases. That is why motif discovery gradually becomes a fundamental problem in molecular biology and computer science [5]. The capability to predict the function, structure, or behavior of biological entities or motifs such as proteins and genes, additionally cooperation among them, play a major role in the analysis of information to describe biological mechanisms. Motif discovery is, therefore, an important field of bioinformatics. There are two main ways to discover a motif using biological experiments and computing approaches, i.e; bioinformatics. But biological experiments are very costly and time-consuming processes. So computing approaches are extensively used to discover motifs [6]. But exact motif discovery is a tough problem. Because the lengths of motifs are generally very short such as up to 30 nucleotides, although the regulatory regions accommodate motifs are very long such as a range from several hundred to several thousand nucleotides. Again the mutations of the actual instances of motifs are added as a burden [5].

Many algorithms were proposed to predict motifs, such as Gibbs sampler [7], MEME (Multiple Expectation Maximization for Motif Elicitation) [8], GA (Genetic Algorithm) [9,10,11,12], GARPS (Genetic Algorithm with Random Projection Strategy) [13], ACO (Ant Colony Optimization), ACOMotif (An Efficient Ant Colony Algorithm for DNA Motif Finding) [6], EMACO (Ant Colony Optimization (ACO) and Expectation Maximization (EM)) [14], MFACO (Motif Finding using Ant Colony Optimization) [15], ACRI (Ant-Colony-Regulatory-Identification) [16], MotifSuite [17], MotifSampler, Bioprospector (BioProspector is an algorithm which is used to discover sequence motifs from a set of DNA sequences) [18], an iterative algorithm (based on GA with addition operator) for motif discovery [5] etc. Gibbs sampler and MEME have drawbacks of dropping into local optimum easily. The consuming time of Gibbs sampler is lower but less prediction accuracy, and MEME is superior to the other methods by its prediction accuracy but time-consuming [5]. Again there are some heuristic methods for predicting motifs, such as particle swarm optimization, Tabu search algorithm, and Simulated Annealing [19]. Some of the basic limitations of these algorithms are low prediction accuracy of binding sites and nucleotide levels. Another limitation of transaction factors is the pattern model to detect the regularity among the binding sites [20]. To enable the biologist in determining functional motifs from statistical artifacts, many algorithms do not produce good motif statistics. For this reason, valid motifs can be rejected, or time may be wasted by searching random motifs. The main drawback of genetic algorithm based on statistical significance is the lack of a mechanism to identify false positives [9]. Though modified genetic algorithm [12] gets better results than Gibbs, MEME, Consensus and genetic algorithm [10]. But it has a low contextual connection with the motifs which are introduced by TRANSFAC (A Database on Transcription Factors and Their DNA Binding Sites) [21]. Though all algorithms have some limitations, they produce better results in some restricted inputs criteria.

Motif discovery has several important application areas. It is widely used in locating regulatory sites and drug target identification [5]. Mainly, it is used to analyze the information for describing biological mechanisms. Besides, motif discovery has become the main part of several higher-level algorithms handle with time series specially rule-discovery, compression, summarization, and clustering algorithms.

In this paper, we have proposed a nature inspired meta-heuristic approach called Chemical Reaction Optimization (CRO) that mimics the interaction behaviour of molecules participated in a chemical reaction. CRO showed promising results in the case of various optimization problems. We choose CRO algorithm to solve the motif discovery problem because this algorithm searches the solution space globally as well as locally. Thus it gives the benefits of both GA and SA [22,23,24]. It has the flexibility to adapt with different optimization problems according to the requirements by redefining its four operators as well as by using additional operators as needed. This algorithm always tries to find out a stable solution as like chemical reaction in the real world. CRO facilitates to avail variable size population, which permits the system to adapt automatically the problem being solved. When diversification is required, decomposition operator is triggered to produce more molecules in order to explore the solution space for finding out the optimal solution. On the other hand, the algorithm triggers synthesis for merging molecules when intensification is required. As a result, the probability of resultant molecules to be selected for manipulation is increased. It also follows the law of conservation of energy where energy can be transformed from one form to different entities forms. The total amount of energy held by the molecules and buffer remains constant that makes the algorithm unique than other existing meta-heuristic algorithms. Besides, one can construct a molecule (solution) for different attributes that suit the problem to be solved. This advantage provides the flexibility to design and manage different operators as needed [22,23,24].

For properties and efficiency of CRO we refer two papers [23, 25]. Besides CRO is an algorithm which was used to solve many optimization problems effectively such as channel assignment problem in wireless mesh networks [22], shortest common supersequence [26], longest common subsequence [27], RNA structure prediction [28], transportation scheduling optimization by a collaborative strategy in supply chain management with TPL [29], quadratic assignment problem, resource-constrained project scheduling problem [30], RNA secondary structure prediction with pseudoknots [31], optimization of protein folding in HP cubic lattice model [32] etc. So we have designed the CRO algorithm by redesigning four basic operators and designing two additional operators to solve the motif discovery problem for finding better results than the existing algorithms.

Contribution and novelty of the proposed work are given below.

  • A new population generation process has been introduced to make our proposed approach more efficient.

  • The basic four operators of CRO have been redesigned to make suitable with DNA motif discovery problem. These operators help our proposed approach to search the solution space locally as well as globally.

  • The values of the parameters are defined very carefully to find the global optimal solution efficiently.

  • We have introduced an additional operator called repair function to improve the quality of the solution by searching all the neighboring solutions of the existing best solution. The local optimization technique is used as a second repair operator, which improves the quality of the binding sites.

  • The results of the proposed work have been compared with several state-of-the-arts and statistical tests have been performed to show the efficiency of the proposed method compared with the other methods.

1.1 Basic concepts

Motif discovery problem can be stated as follows. Let a set of N DNA sequences is represented as \(S=\{S_{1},S_{2},\ldots ,S_{N}\}\), where \(S_{i}=s_{1},s_{2},\ldots ,s_{l_{i}}\) and \(l_{i}\) is the length of the sequence i. We have to find out the possible accurate motif pattern \(X=x_{1},x_{2},\ldots ,x_{l}\) of length l where \(x_{i}\), and \(s_{i} \epsilon \{A,C,G,T\}\). Motif discovery is based on a defined score function that calculates the similarity of the motif pattern with its occurrences. There are two approaches for the given motif length l [10]. These are as follows:

  1. 1.

    Consensus approach Find a motif \(S_{c}\) of length l and a set of motif instances \(M=\{m_{1},m_{2},\ldots ,m_{N}\}\), where \(m_{i}\) is the motif instance of \(S_{i}\) so that \(S_{c}\) minimizes the total hamming distances given in Eq. 1.

    $$\begin{aligned} T_{HD}(S_{c})=\sum _{i=1}^{N}{H_D(S_c,S_i)} \end{aligned}$$
    (1)

    where

    $$\begin{aligned} H_D(S_c,S_i)&=min \big \{H(S_c,m):m \, is \, a \, subsequence\nonumber \\&\qquad \qquad of \, S_c \, of \, length \,l \big \} \end{aligned}$$
    (2)

    and \(H(S_{c}, m)\) is the hamming distance between the motif \(S_{c}\) and motif instances m. Here hamming distance means the number of positions in \(S_{c}\) and m where the nucleotides are not same. If we assume M as a consensus matrix where the ith row is the motif instance \(m_{i}\) and c(kj) is denoted as the number of nucleotides \(k \epsilon \{A,C,G,T\}\) in column j. Then the \(CSC_{M}\) (consensus score) is defined as:

    $$\begin{aligned} CSC_M=\sum _{j=1}^{l}\bigg (\underset{k\in \{A,C,G,T\}}{{\max }} \Big (c\big (k,j\big )\Big )\bigg ) \end{aligned}$$
    (3)
  2. 2.

    Positional approach Find a set of motif instances \(M=\{m_{1},m_{2},\ldots ,m_{N}\}\) with a length l of each motif and a set of positions, \(P=\{p_{1},p_{2},\ldots ,p_{N}\}\) where \(p_{i}\) is the starting position of motif instance \(m_{i}\) in sequence \(S_{i}\), then the objective function information content, \(IC_{M}\) of this approach is defined as:

    $$\begin{aligned} IC_M = \sum _{j=1}^{l}\sum _{k\in \{A,C,G,T\}} F(k,j).\log _2\frac{F(k,j)}{F_k} \end{aligned}$$
    (4)

    where F(kj) represents the frequency of nucleotide k to be in position j of the matrix M and \(F_{k}\) indicates its background frequency in the entire set S.

In bioinformatics, motif discovery is an NP-hard problem [6]. If we use real biological DNA sequences where the length of the nucleotides (or amino-acid) are very large, then it is not possible to find the exact motif in polynomial time. So here we solve this problem using chemical reaction optimization (a metaheuristics method) and our target is to find optimal or near-optimal solutions. For finding a motif, we use the consensus approach, and to find their positions within the input DNA set the positional approach is used.

2 Related works

Different approaches were proposed for optimization of the Motif Discovery problem. Each algorithm has some drawbacks as well as some efficiencies. Some of the approaches are described below.

2.1 Greedy mixture learning for multiple motif discovery in biological sequences

Greedy mixture learning for multiple motif discovery in biological sequences (Greedy EM) was proposed by Blekas et al. [33]. This algorithm uses incremental methods for Gaussian mixture learning for finding significant motifs within a set of DNA sequences. A mixture of motif model with a greedy fashion is learned by adding motifs incrementally to the mixture as far as some stopping criteria are met. This method starts with one motif that models the background. Then a new candidate motif is added at every step. By local search using partial EM (Expectation Maximization) steps and global search for tuning the parameters, this algorithm finds a great initialized value for the parameters of the new candidate motif. This method uses original kd-trees to reduce the running time for querying the nearest neighbor. Greedy EM uses real datasets from the PRINTS database [34] and the PROSITE database of protein families [35] to compare their results with the MEME [8] algorithm. Greedy EM finds a great initialized value for the parameters of the new motif but it fails to find multiple motifs with variable length. The time complexity of the initialization procedure is reduced by the kd-tree technique [33].

2.2 Motif discovery using a genetic algorithm

In 2005 Che et al. proposed motif discovery using a genetic algorithm (MDGA) [10]. MDGA uses a generic framework of the genetic algorithm to explore all possible search spaces of the starting position of the motifs within different target sequences. In this algorithm, the number of initial population is selected randomly and kept fixed during evolution. A new individual is generated from two parents using crossover and mutation in every iteration. Thus the number of new individuals reduces to the half of the existing population. Then the new individuals are merged with the existing population and worst one-third individuals are eliminated from the total individuals. MDGA uses CRP dataset (contains 18 sequences with length 105 nucleotides) [36], YDR02c dataset (consists of 15 target genes of transcription factor YDR02c) [37] and AZFI dataset (consists of 24 sequences in which each sequence has variable lengths, ranging from 175 to 1228) [37] to compare its efficiency with the other algorithms such as Gibbs sampler [7], Bioprospector [18] and AlignACE [38] etc. It gives higher prediction accuracy than Gibbs sampler [7] and Bioprospector [18] with the CRP dataset [36]. From YDR20c sequence dataset [37], MDGA gets truer motif pattern from a statistical point of view and in AZFI sequence dataset [37], it consumes less time than AlignACE [38].

2.3 Motif discovery using evolutionary algorithms

In 2009 Shao et al. proposed motif discovery using evolutionary algorithms [39]. This algorithm integrates bacterial foraging optimization algorithm and Tabu Search (TS), it is also known as a TS-BFO algorithm. In this method, one candidate motif is referred to as one bacteria to undergo the evolution. There are four steps in a bacteria’s foraging action: chemotaxis, swarming, reproduction, and elimination and dispersal. TS-BFO uses SCPD datasets [40] and TRANSFAC datasets [21] to compare its efficiency with the efficiency of other approaches such as DE/EDA [DE/EDA algorithm combines global information extracted by estimation of distribution algorithm (EDA) with differential information obtained by Differential evolution (DE)] [41], MotifSampler, MEME (Multiple EM for Motif Elicitation) [8] etc. TS-BFO algorithm uses self-control multi-length chemotactic step approach to extend the search space, remove local extremum, and speed up the constringency. It cannot generate the similar individuals in each step, guides the search orientation, and discovers the global solution [39].

2.4 Motif finding using ant colony optimization

Bouamama et al.  [15] proposed motif discovery using ant colony optimization (MFACO) algorithm in 2010. As a local heuristic optimization search step, this algorithm integrates a modified Gibbs sampling method. MFACO builds a weighted directed graph G(VE) with V is the set of nodes and E is the set of edges. This graph contains 4l nodes organized in a grid of four rows and l columns where l is the motif’s length. Every ant builds a solution incrementally by traversing the graph to complete a tour. MFACO searches both in the space of motif patterns and starting position. So it has better chances to detect potential motif. Three datasets used in FMGA (finding motifs by genetic algorithm) [11] and E. coli CRP binding sites [36] are used to test the performance of MFACO. The three datasets consist of 6, 9, 18 sequences respectively where each sequence has an equal length of 3001 nucleotides. For this datasets, MFACO can acquire better performance in terms of motif accuracy than MEME (Multiple EM for Motif Elicitation) [8], Motif Sampler, BioProspector (BioProspector algorithm is used to discover sequence motifs from a set of DNA sequences) [18], and FMGA (Finding Motifs by Genetic Algorithm) [11] within a reasonable computational time. E. coli dataset contains 18 sequences with length 105 nucleotides. For this datasets, MFACO is able to find the exact starting positions of the motifs identified by Footprinting while the other approaches such as BioProspector [18], MDGA (Motif Discovery using A Genetic Algorithm) [10] failed.

2.5 Optimizing genetic algorithm for motif discovery

Hongwei et al.  [13] proposed a new algorithm in 2010 named GARPS that optimizes Genetic Algorithm (GA) via Random Projection Strategy (RPS) to identify (ld)-motifs. Though the initial population used in this algorithm is generated from RPS that makes it capable of fast convergent to the best solution, the overall structure of GARPS is derived from the simple genetic algorithm. In the creation of every new generation, a simple mutation operator named the one-point crossover and keeping the best mechanism are used. Generation after generation, these steps are repeated iteratively in a while loop. During these iterations, new individuals appear because crossover and mutation operators are performed on the population. The best individuals survive using the best-keeping mechanism is guaranteed by the selection operator. As GARPS progresses, the average fitness of the population is increased and it stops when no more improvement can be made. The GARPS algorithm was compared with the Projection Algorithm and showed the better results. They used several data including eighteen sequences of identified binding sites of cAMP receptor protein (CRP) [36], seven sequences of identified binding sites of PDR3 [42]. This algorithm cannot find the extremely weak planted motif unless the algorithm reports a sufficient number of patterns [13].

2.6 DNA motif discovery based on ant colony optimization and expectation maximization

Yang et al.  [14] proposed a framework in 2011 with the combined ability of the Ant Colony Optimization (ACO) and Expectation Maximization (EM) known as EMACO. ACO is effective in global search and EM is efficient to maximize the likelihood of parameter estimation that makes these two algorithms adequately complementary. Initially, some potential binding sites are randomly extracted from the given sequences. Next, ACO applies iteratively over all these solutions to construction and updates pheromone in search of good motifs. To maximize the likelihood of parameter, the EM algorithm uses predictions found from ACO. Expectation step of EM, calculate the expected value of the log-likelihood function given the observed data under the current estimation of the missing motif sites. The maximization step finds the positions of motif instances. After applying this two algorithm, the post-processing procedures applied to refine the predicted results. Finally, those predicted binding sites are given as Motif predictions output. EMACO algorithm was compared with GAME (Genetic Algorithm for Motif Elicitation) [43] and GALF (Genetic Algorithm with Local Filtering) [39] and predicts better motifs under most circumstances. EMACO conducted experiments on eight real datasets named CREB, CRP, E2F, ERE, MEF2, MyoD, SRF, TBP which were previously constructed by the authors of GAME [39]. It has low standard deviations for prediction which indicates its stable performance [14].

2.7 An iterative algorithm for motif discovery

In 2013 iterative algorithm for motif discovery was proposed by Fan et al. [5]. This method uses the common GA framework and finds the motifs with three operations in GA and a new Addition operation proposed in this algorithm. This method contains three operators such as mutation, addition, and deletion. This method starts with short motifs whose length is three. So there are total 64 initial individuals because each site is chosen from ACGT. Now the length of each individual adds one each epoch by three operators until the length of the optimal motif reaches to the standard length. Throughout the method, the population number of individuals is kept 64. The iterative algorithm is a parallel random search which is helpful to implement parallel computing to increase the computational efficiency of the method. This method also can avoid dropping into the local optimum. This algorithm uses both simulated and biological data to test the effectiveness of this algorithm. The biological data set used in this method is download from the SCPD database [40]. The iterative algorithm achieves a higher score than Gibbs Sampler, GA, and GARPS in terms of the data CRP [36].

2.8 An Ant Colony Optimization based algorithm for identifying gene regulatory elements

Liu et al.  [16] proposed an Ant Colony Optimization based algorithm for identifying gene regulatory elements (ACRI) in 2013. This paper focused on specific type of motif such as de-novo motif. De-novo motif is a type of motif in which the length of the motif is predefined. This algorithm detects all possible binding sites of a transcription factor from the upstream of co-expressed genes. It takes a set of sequences and a length of the motif as input. A special digraph is created where each node except the last one represents a sequence from the set of input sequences, the last node indicates the termination point, and each edge between two nodes represents a possible starting position of a binding site in the corresponding sequence. Each ant builds a solution by traversing each node once and picking one edge between two nodes. Then the best solution is searched by various optimization. ACRI used five transcriptional factors of Saccharomyces cerevisiae from the uniform database SCPD [40], five transcriptional factors of Homo Sapiens from the uniform database JASPAR [44] and 18 gene sequences contain E. coli transcription factor binding sites [36] to compare the results with the algorithms, Gibbs sampler [7], AlignACE [38], MEME [8] etc. ACRI gets a higher quality of solutions at a very high speed compared with other existing related algorithms.

2.9 An efficient ant colony algorithm for DNA motif finding

In 2015 Huan et al. proposed an efficient ant colony algorithm for DNA motif finding (ACOMotif) [6]. ACOMotif uses a simple memetic scheme and applies ACO with reinforcement search technique. It uses the same structural graph G(VE) of MFACO [15] but the heuristic information, pheromone update rule and local search technique are quite different. G(VE) has 4l vertices organized in four rows and l columns where l is the length of the motif. The path through starting vertex to the last vertex that is made by each ant defines the acceptable solution for the motif. Then ACOMotif applies local search for the potential motif. This method uses the hill-climbing technique for local search. Additionally, it applies relax method to find the binding site of every motif. ACOMotif used H.sapiens dataset [44], E. coli dataset [36], SCPD dataset [40], ERE and E2F to compare its efficiency with the efficiencies of MFACO (motif discovery using ant colony optimization) [15], ACRI (Ant-Colony-Regulatory-Identification) [16], EMACO (DNA Motif Discovery based on Ant Colony Optimization and Expectation Maximization) [14] and MotifSuite [17]. Where H.sapiens dataset contains 6, 9, and 12 sequences respectively, and each sequence contains 3001 nucleotides, E. coli dataset holds 18 sequences and each sequence has 105 nucleotides, both ERE and E2F have 25 sequences and each contains 200 nucleotides. The experimental results show that ACOMotif/R-ACOMotif is superior in comparison with the other algorithms.

2.10 A genetic algorithm for motif finding based on statistical significance

In 2015 a genetic algorithm for motif finding based on statistical significance was proposed by Gutierrez et al. [9]. This approach proposes a new computational technique with a genetic algorithm that uses several statistical coefficients. It represents the candidate motifs using a position in which instances is situated. The only restriction is that they are overrepresented in at least a few sequences. So before starting the method, all input sequences are merged in a single supersequence. Then the supersequence is divided into subsequences of a random length disregarding the length of every sequence to generate more diverse solutions faster. Finally, the solutions are filtered and clustered to generate final solutions after applying the method for each given motif width. This method was tested with the assessment provided by the study performed by Tompa et al. [45]. This assessment contains 52 datasets of four different organisms (human, yeast, fly, and mouse) and four negative controls. This algorithm successfully predicts many of the sites with the high number of true positives both in site level and nucleotide level. The main disadvantage of this approach is the lack of a system to detect false positive. It generally detects a known motif, but with more instances than it really has [9].

3 Chemical reaction optimization

A nature-inspired metaheuristic algorithm for optimization named Chemical Reaction Optimization (CRO) was proposed by Lam and Li [30]. CRO has been successfully applied to solve many NP-hard problems and obtained better performance compared to other metaheuristic algorithms. CRO loosely couples chemical reaction with optimization that obeys two laws of thermodynamics. The first law commonly known as energy conservation rule states that total energy of a system remains constant. So, according to the first law of thermodynamics we can write,

$$\begin{aligned} \sum \limits _{i=1}^{popSize(t)}(PE_{i}(t)+KE_{i}(t))+Buffer(t)=C \end{aligned}$$
(5)

where \(PE_i(t)\) and \(KE_i(t)\) denote the potential and kinetic energy of the molecule i at time t respectively, Buffer(t) is the energy of the surrounding as well as the energy of the central buffer at time t, and C is a constant.

CRO is a multi-agent algorithm where the molecule is a manipulated agent having some essential attributes such as the molecular structure (z), the potential energy (PE), the kinetic energy (KE), the number of hits (NumHit) and other parameters. The excessive energy of a molecule means instability. An unstable molecule always tries to be stable with low energy. This phenomenon is similar to searching for the optimal point of the optimization problem. To obtain stability, molecules undergo four basic reactions named onwall ineffective collision, decomposition, inter-molecular ineffective collision, and synthesis. Here ineffective collisions mean a small change in the molecular structure that refers to local search while decomposition and synthesis mean a massive change in the molecular structure that refers to global search. As CRO follows the energy conservation rule so, any of the reactions will only take place when the following equation is satisfied:

$$\begin{aligned} \sum _{i=1}^{t} (PE_{zi} + KE_{zi}) \ge \sum _{i=1}^{s} (PE_{z'i}) \end{aligned}$$
(6)

where t is the number of reactants, s is the number of products, Z and \(Z^{\prime }\) are the structures of the molecule before and after the reaction.

3.1 Parameters of CRO

In CRO, molecules are the manipulated agents having some attributes. Table 1 lists the attributes and their algorithmic definitions.

Table 1 Various parameters of CRO and their algorithmic definitions

3.2 Operator selection

This section describes the basic scheme of CRO and operator selection. Figure 2 shows a flowchart of CRO to depict the whole process. The process starts with the initialization stage. In this stage, the number of populations and the other parameters are initialized. Then the iteration stage starts and a number of iterations are performed. In each iteration, one of the elementary reaction happens and the required number of molecules are selected from the population randomly. At first of each iteration, a random number v between 0 and 1 is generated to take the decision, if uni-molecular or inter-molecular collision will occur. If \(v >MoleColl\) or only one molecule remains, then the uni-molecular collision occurs else the inter-molecular collision takes place. Then for this collision, a definite number of molecules are selected from the population randomly. Now for uni-molecular collision (left side of the flowchart), a condition is checked with a parameter \(\alpha\) if the onwall ineffective or decomposition reaction will occur. Similarly, for inter-molecular collision (right side of the flowchart), a condition is checked with a parameter \(\beta\) if the inter-molecular ineffective or decomposition reaction will occur. The value of the parameters \(MoleColl, \alpha , \beta\) are assigned at the initialization state. After each elementary reaction, if any best solution is found, it is saved. The iteration stage continues until any stopping criterion is met. In the final stage, a global best solution is found. The operator repair1 is applied to the final solution to search for the better solution. Then the binding sites of the better solution are located. At last, operator repair2 is applied to improve the quality of the binding sites.

Fig. 2
figure 2

A flowchart of CRO

4 CRO for DNA motif discovery problem

In this paper, we solve the DNA Motif Discovery problem using a well-known population-based metaheuristic algorithm, Chemical Reaction Optimization (CRO). CRO is an algorithmic framework that can solve optimization problems efficiently. It is a variable population based algorithm that means there are different numbers of molecules in different iterations. Here we have proposed an algorithm to find the DNA motif using four basic operators of CRO. The operators are redesigned and an additional operator (repair operator) is designed to find out the best solutions. Another repair function is used to find the better binding sites that give a better result. The proposed algorithm is named here as DMD_CRO (DNA Motif Discovery using CRO). For implementation, the code of the proposed DMD_CRO can be found hereFootnote 1.

4.1 Basic structure of DMD_CRO

Our proposed DMD_CRO algorithm has a difference from the basic CRO algorithm. The difference is after the iteration stage when we find the final solution, two additional repair operators are applied to this final solution to get better potential motif and binding sites. Algorithm 1 shows the pseudo code of DMD_CRO.

figure a

4.2 Solution representation and population generation

Let, for a given set of N DNA sequences \(S=\{S_{1},S_{2},\ldots ,S_{N}\}\), we have to find a motif where the length of the motif, \(l=7\). So there are \(4^{7} = 16384\) possible patterns as each site is selected from \(\sigma = \{A,C,G,T\}\). We generate patterns randomly from all possible patterns and calculate their information contents according to Eq. 4. Then from the 100 patterns, 20 patterns with higher information contents are taken to use for exploring the solution space. When population generation is completed, each symbol \(\{A,C,G,T\}\) of each possible pattern is encoded by a unique numerical value. We have used 0, 1, 2, 3 for the symbols ACTG respectively. Figure 3 shows an example of solution representation.

Fig. 3
figure 3

Solution representation

4.3 Reaction operators

For the DMD_CRO algorithm, we have represented four reaction operators to find out the solutions and introduced two additional operators to get better results. The following sub-sections describe the operators used in the algorithm.

4.3.1 On-wall ineffective collision

This molecular reaction is used to search for the neighborhood solution (local search). We use the one-difference operator as shown in Fig. 4 for this elementary reaction. A position is chosen randomly in the molecule to change the value of this position. Let \(S_{m}\) is a molecule to which the on-wall ineffective collision is applied. The values of the solution \(S_{m}\) are copied to a solution \(S_{m}^{\prime }\). A position i of the molecule \(S_{m}\) is randomly selected where \(1 \le i \le l\) (length of the motif). Next, the value of ith position of \(S_{m}^{\prime }\) has to be changed. For this, we generate a value \(r \epsilon \{0,1,2,3\}\) such that \(r \ne S_{m}[i]\) and put the value of r in the ith position of \(S_{m}^{\prime }\). Thus a new solution \(S_{m}^{\prime }\) is created. In Fig. 4, \(i = 3\) and the value \(S_{m}[3]=2\). Now we randomly generate a value r between 0 and 3 such that \(r \ne 2\), so \(r = 3\) is selected and put it in \(S_{m}^{\prime }[i]\). Algorithm 2 shows the pseudo code of On-wall ineffective collision.

Fig. 4
figure 4

On-wall ineffective

figure b

4.3.2 Decomposition

This reaction is implemented to enable the algorithm for exploring the other region of solution space (global search). Here we have used a popular half-total exchange operator as decomposition shown in Fig. 5. In decomposition, two new molecules are generated from an original molecule. Let \(S_{d}\) is an original molecule to which we apply this reaction. At first, the molecule \(S_{d}\) is divided into two parts. Then we copy values of the first part of \(S_{d}\) to a new molecule \(S_{d1}\) and randomly generate values of the remaining part of \(S_{d1}\). Similarly, the values of the last part of \(S_{d}\) are being copied to the respective part of another new molecule \(S_{d2}\) and randomly generate values of the remaining part of \(S_{d2}\). Algorithm 3 depicts the pseudo code of Decomposition.

figure c
Fig. 5
figure 5

Decomposition reaction

4.3.3 Inter-molecular Ineffective Collision

In this elementary reaction, a well-known two-point crossover operator is used as shown in Fig. 6. Two molecules \(S_{c1}\) and \(S_{c2}\) are randomly selected from the solution space. Then two points \(p_{1}\) and \(p_{1}\) from the molecule are randomly chosen where \(p_{1} < p_{2}\). Now we divide both molecules \(S_{c1}\) and \(S_{c2}\) into three parts with these two points. Then the values from the first and third parts of \(S_{c1}\) are copied to the respective positions of a new molecule \(S_{n1}\). The values of the second part of \(S_{c2}\) are being copied to the respective positions of the new molecule \(S_{n1}\). Similarly, another new solution \(S_{n2}\) is created from the first and third parts of \(S_{c2}\) along with the second part of \(S_{c1}\). Algorithm 4 gives the pseudo code of inter-molecular ineffective collision.

figure d
Fig. 6
figure 6

Inter-molecular ineffective reaction

4.3.4 Synthesis

The probabilistic select operator depicted in Fig. 7 is used for this elementary reaction [26]. Synthesis takes two molecules \(S_{m1}\) and \(S_{m2}\) randomly from the solution space and produces a new molecule \(S_{m}^{\prime }\). This reaction is the opposite of the decomposition operator. At first, the frequency of each symbol \(\{A,C,G,T\}\) for both \(S_{m1}\) and \(S_{m2}\) are calculated and the values of the frequencies are put in two different arrays. Then to find a proper symbol for the ith position of \(S_{m}^{\prime }\), we compare the frequency of the ith symbol of \(S_{m1}\) with the frequency of the ith symbol of \(S_{m2}\) and take the symbol with the highest frequency as the value of the ith position of \(S_{m}^{\prime }\). Now the frequency of the selected symbol for the molecule is decreased by one from the solution array. This procedure repeats for selecting every symbol. Algorithm 5 shows the pseudo code of synthesis.

figure e
Fig. 7
figure 7

Synthesis reaction

4.3.5 Operator Repair1

The operator repair1 is applied to the final solution \(S_{m}\) to improve the result by the local search to get potential motif. At first, we copy the values of \(S_{m}\) to form a new solution \(S_{m}^{\prime }\). Now the value of the first position of \(S_{m}^{\prime }\) has been changed by one of \(\{0,1,2,3\}\) such that \(S_{m}[0] \ne S_{m}^{\prime }[0]\). Compute the information content \(T = IC(S_{m})\) and \(T^{\prime } = IC(S_{m}^{\prime })\) using Eq. 4. If \(T^{\prime }\) is not greater than T, then we change the value of the second position of \(S_{m}^{\prime }\) and do the same again. But if \(T^{\prime }\) is greater than T, then \(S_{m}\) is updated by \(S_{m}^{\prime }\) and again the technique is applied to the updated solution \(S_{m}\). The operator repair1 is stopped when we do not get a better result by checking all the positions of \(S_{m}\) and output the updated best solution \(S_{m}\). Algorithm 6 gives the pseudo code of the process.

figure f

Figure 8 shows an example of the operator repair1. Here an initial solution \(S_{m}\) is taken with information content 11.78. Now we have changed the value of the \(1^{st}\) position of \(S_{m}\) using \(\{0,1,2,3\}\) to get three new solutions \(S_{m11},\ S_{m12}\) and \(S_{m13}\) such that \(S_{m11} \ne S_{m12} \ne S_{m13}\). Next, the information contents of \(S_{m11},\ S_{m12}\) and \(S_{m13}\) are computed. But a larger information content than the initial solution has not been found. Now the value of the \(2^{nd}\) position of \(S_{m}\) is changed to get \(S_{m21},\ S_{m22}\) and \(S_{m23}\) similarly. But still, a larger information content has not been obtained. Next, the value of the \(3^{rd}\) position has been changed and got a solution \(S_{m32}\) with larger information content 12.17. So the solution \(S_{m}\) is updated by \(S_{m32}\). At this moment, we have to reapply the repair1 operator to this updated solution. The repair1 operator searches all the neighboring solutions of the existing solution to get a better one. If any better solution is found then the existing solution is replaced by this better solution and we repeat the process. This process continues until all the neighboring solutions are worse than the existing solution. But the CRO operators search one or two local or global solution(s) of the existing solution(s). Since the searching space by repair operator is very large compared to the traditional CRO operators. So this additional operator helps the proposed DMD_CRO algorithm to search the solution space efficiently in finding better solutions. That is why the possibility to find better solutions by the CRO with the operator repair1 is more than the CRO without this operator.

Fig. 8
figure 8

Repair1 operator

4.3.6 Locate binding sites

To get the binding sites for a solution \(S_{m}\), we have to find a position \(p_{i}\) for each input sequence \(S_{i}\) which can minimize the hamming distance in Eq. 2. Thus a set of positions \(P=\{p_{1},p_{2},\ldots ,p_{N}\}\) is found which is known as the binding sites for the solution \(S_{m}\).

4.3.7 Operator repair2

We have used operator repair2 (repair function), which is a modified version of a local optimization technique for subsequence tuple in ACRI [16]. This operator is applied to the binding sites \(P=\{p_{1},p_{2},\ldots ,p_{N}\}\) to find better binding sites. The value of information contents for existing binding sites P is calculated using Eq. 4. At first, each position in P is changed to get six new binding sites \(P_{k}=\{p_{1}+k,p_{2}+k,\ldots ,p_{N}+k\}\) where \(-3 \le k \le 3\) and \(k \ne 0\). Again the value of information content of \(P_{k}\) for each value of k is calculated using Eq. 4. Thus we get new six information contents for six values of k. At last, we have to find out the binding sites having the highest information content among the six information contents and old ones. Algorithm 7 gives the pseudo code of the process.

figure g

Figure 9 shows an example of the process of the repair2 operator. Here \(P = \{59,53,\ldots 76\}\) are initial binding sites having information content, \(IC = 11.923\). Next, we get \(P = \{56,50,\ldots 73\}\) by adding \(k = 3\) in each position of P and compute the information content value, \(IC = 10.568\) for \(P^{'}\). Similarly, the binding sites and information content for each value of k are computed. From Fig. 9, the highest information content value \(IC = 13.091\) is found with binding sites \(P = \{61,55,\ldots 78\}\) for \(k = 2\). So \(P = \{61,55,\ldots 78\}\) and \(IC = 13.091\) are the final outputs of the operator repair2.

Fig. 9
figure 9

Repair2 operator

5 Experimental results and analysis

The proposed DMD_CRO algorithm was tested with several datasets given in ACRI [16] for evaluation purpose. We implemented our algorithm in C# programming language using Microsoft Visual C# 2013 and executed using an Intel Core i5 computer with 2.50 GHz CPU and 4 GB RAM under Windows 10 operating system (64 bit). For an effective test, we compared the results of DMD_CRO with Gibbs sampler [7], AlignACE [38], MEME [8] and ACRI [16]. The datasets used in the experiments contain five transcriptional factors of Homo sapiens, 18 gene sequences contain E. coli transcription factor binding sites [36] and RAP1 of Saccharomyces cerevisiae from SCPD [40]. The ACRI algorithm solved the de-novo motif discovery problem. We have also designed DMD_CRO to solve the same type of problem. De-novo motif is a type of motif in which the length of the motif is predefined.

5.1 Experimental setup

In the proposed DMD_CRO algorithm, there are some key parameters. We investigated for the best value by testing over the 18 gene sequences of E. coli transcription factor binding sites dataset for these key parameters. The tunning process was demonstrated in Fig. 10 for \(\alpha\), \(\beta\), iteration, and KELossRate. In the first row and first column of Fig 10, a line graph has been drawn to show the effect of the value of \(\alpha\) over the value of information content (used in Eq. 4). Here \(\alpha\) has been plotted in the x-axis and information content (IC) has been plotted in the y-axis. From the graph, it can be seen that the highest value of IC is obtained for \(\alpha = 1\). Similarly, we get highest values of IC for \(\beta = 350\), \(iteration = 2000\), and \(KELossRate = 0.2\) respectively.

Fig. 10
figure 10

Parameters tuning of CRO algorithm

Besides these parameters, several parameters named popSize, MoleColl, InitialKE were used in the experiment. Table 2 shows all parameters and their respective values. The termination condition of the proposed algorithm was set upon the value of these parameters.

Table 2 Parameters of CRO algorithm for finding motif

5.2 Analysis for transcription factor binding sites of Homo sapiens

The experiments of the proposed DMD_CRO algorithm were performed using Homo sapiens for transcription factor binding sites from the uniform database JASPAR. We selected five transcriptional factor binding sites as tested data. Table 3 shows the dataset (also used in ACRI).

Table 3 The five transcriptional factors of Homo sapiens

The dataset was tested using our proposed algorithm and ACRI and created weblogo using the http://weblogo.berkeley.edu/logo.cgi website. Table 4 shows the results. The second and third columns show generated weblogo about its corresponding sequence using DMD_CRO and ACRI respectively. The weblogos of DMD_CRO and ACRI are similar to the real weblogos. These point out the effectiveness of our proposed DMD_CRO algorithm that means DMD_CRO algorithm is correct. We did this experiment to prove the effectiveness and correctness of our algorithm. In Tables 3 and 4, TF means sequence name.

Table 4 The experimental results for five transcriptional factors of Homo sapiens of DMD_CRO and ACRI

5.3 Analysis of CRP binding sites of E. coli

Another benchmark dataset for identifying the regulatory elements is the CRP binding sites of E. coli. In this dataset, there are 18 sequences having a length of 105 for each sequence. Table 5 shows the 18 sequences of the CRP binding sites for E. coli.

Table 5 The 18 sequences of the CRP binding sites for E. coli

To find the motif starting positions from these sequences, we used Information Content as objective function stated in Eq. 4. Like most of the popular computing methods, we set the \(motif length = 22\). We have executed DMD_CRO algorithm five times using the same parameter settings as shown in Table 2. In Table 6, the worst and best-found motif starting position for each sequence of five consecutive runs for both without and with repair operator is shown.

Table 6 The worst and best motif starting positions of the CRP binding sites of E. coli for DMD_CRO

Now, Table 7 shows the experimental results of our proposed DMD_CRO algorithm in comparison with MEME [8], ACRI [16], Gibbs sampler [7], and AlignACE [38] using the best found motif starting positions of the CRP binding sites of E. coli. The found binding sites are acceptable if the difference between the actual position and detecting position is 10 [16]. In Table 7 binding sites column denotes the actual position of the motif in the sequence. The found binding sites of MEME, ACRI, Gibbs sampler, and AlignACE were taken from ACRI [16] paper. MEME, ACRI, Gibbs sampler, and AlignACE columns show motif positions found using them. In DMD_CRO column, we show the best motif starting position for each sequence from Table 6. The positional difference between actual position and position found using respective algorithms have been shown in error columns. Here DMD_CRO (without repair) gives all binding sites successfully but some of the results have to be improved. The DMD_CRO algorithm with repair shows all the binding sites successfully and the results are better than those of MEME and ACRI algorithms. So the better results than the other related algorithms were obtained by the proposed algorithm with repair operators.

Table 7 Comparison of the results of DMD_CRO with MEME, and ACRI for the 18 sequences of the CRP binding sites for E. coli

In Tables 8 and 9, we compared the information content values of DMD_CRO with other four algorithms: Gibbs sampler, MEME, AlignACE, and ACRI. Values of information content (IC) were calculated using Eq. 4. We executed DMD_CRO algorithm 18 times to get the information content distributions by DMD_CRO. The information content distributions of other algorithms were taken directly from ACRI [16]. Here Table 8 depicts the information content distributions and Table 9 represents the worst, average, and best information content values of the respective algorithms. The higher information content value denotes a better solution. From Tables 8 and 9, it is clear that the quality of the solutions found by DMD_CRO (with and without repair operator) is higher than the other algorithms.

Table 8 Comparison of the information content distribution of the results by different algorithms
Table 9 Comparison of the computation information content with different algorithms

5.4 Statistical significance test

The previous subsections express that the performance of the DMD_CRO algorithm is better than the other traditional algorithms in terms of the quality of the results. In this subsection, we examine whether there is statistical significance between DMD_CRO and other traditional algorithms. The Student’s t-test and the Mann-Whitney U test were used for this purpose.

5.4.1 Comparison using student’s t-test

The information content values of Table 8 were used to calculate the t-values using Eq. 6.

$$\begin{aligned} t{\text {-}}value =\frac{|\overline{V_1} -\overline{V_2}|}{\sqrt{ \frac{{\sigma _1}^2}{n_1} + \frac{{\sigma _2}^2}{n_2} } } \end{aligned}$$
(7)

Where \(\overline{V_{1}}\), \(\overline{V_{2}}\) are the average information content values, \(\sigma _{1}\), \(\sigma _{2}\) are the standard deviations, and \(n_{1}\), \(n_{2}\) are the numbers of samples for group 1 and group 2 respectively. Each group has 18 samples, so the degree of freedom is \((18+18-2) = 34\). The significance level \(\alpha = 0.05\) was chosen to get the critical value \(t_{crit.} = 2.032\) at 34 degrees of freedom from the t-distribution table. We set the null hypothesis that there is no statistically significant difference between DMD_CRO with other algorithms. If \(t{\text {-}}value > t_{crit.}\) or \(t{\text {-}}value < -t_{crit.}\), then the null hypothesis can be rejected and decided that there is a statistically significant difference between DMD_CRO with other algorithms. Table 10 shows the \(t{\text {-}}values\) of different algorithms compared with DMD_CRO. Using the data of Table 8, for DMD_CRO (without repair), the average information content value, \(\overline{V_{1}} = 11.24\), standard deviation, \(\sigma _{1} = 0.985\), and the number of samples, \(n_{1} = 18\) and for Gibbs sampler, \(\overline{V_{2}} = 9.229\), \(\sigma _{2} = 0.124\), and \(n_{2} = 18\). Now, using Eq. 6\(t{\text {-}}value = 8.58\) for Gibbs sampler compared with DMD_CRO (without repair), which is shown in the first row and first column in Table 10. Similarly, the other values were calculated. In Table 10, all the \(t{\text {-}}values\) are greater than the \(t_{crit.} = 2.032\). So we can reject the null hypothesis and conclude that DMD_CRO has statistical significant difference compared with the other related algorithms.

Table 10 t-value of information content between DMD_CRO with other algorithms

5.4.2 Comparison using Mann-Whitney U Test

The two-tailed Mann–Whitney U test was used to compare DMD_CRO with other algorithms. We have considered the significance level \(\alpha = 0.05\) to get the critical value \(Z_{crit.} = 1.96\). Then \(Z_{stat.}\) was calculated from the information content values of Table 8 using Eq. 7.

$$\begin{aligned} Z_{stat.} =\frac{U- \frac{n_1 n_2}{2}}{\sqrt{ \frac{n_1 n_2(n_1 + n_{2} + 1)}{12} } } \end{aligned}$$
(8)

where U denotes the lowest sum between the positive and negative ranks of the information contents of DMD_CRO and any other algorithm, and \(n_{1}\), \(n_{2}\) are the numbers of samples for these two algorithms. The null hypothesis states that there is no statistical significance between DMD_CRO with other algorithms. The alternative hypothesis defines that there is statistically significant. If \(Z_{stat.} > Z_{crit.}\) or \(Z_{stat.} < -Z_{crit.}\), then we can reject the null hypothesis and accept the alternative hypothesis. Table 11 shows the calculated values of U and \(Z_{stat.}\) of different algorithms compared with DMD_CRO. Using the data of Table 8, the number of samples, \(n_{1} = 18\) for DMD_CRO (without repair) and \(n_{2} = 18\) for Gibbs sampler. The lowest sum, \(U = 2\) for Gibbs sampler. Now using Eq. 7, we get \(Z_{stat.} = -5.046\) for Gibbs sampler compared with DMD_CRO (without repair), which is shown in the first row and first column in Table 11. Similarly, the other values were calculated. In Table 11, all the \(Z_{stat.}\) values are lower than \(Z_{stat.} = -5.046\). So the null hypothesis can be rejected and concluded that DMD_CRO has statistical significant difference compared with the other algorithms.

Table 11 U and \(Z_{stat.}\) of information content between DMD_CRO with other algorithms

These two significance tests prove the superiority of DMD_CRO algorithm over other state-of-the-arts in this area.

5.5 Running time analysis

For the analysis of running time, we implemented the ACRI algorithm in our experimental platform. For this testing purpose, five ants were used for ACRI and initial population size was fixed to five for DMD_CRO. We set a value to the iteration parameter and executed each algorithm five times. So five running times were found for each algorithm. Now we made the average of these five running times for each algorithm to get the final running time. Then the value of iteration was changed to find the running times for different values of iteration. Thus the running times were calculated for all values of iteration.

Table 12 depicts the running time comparison between ACRI and DMD_CRO using the CRP binding sites of E. coli dataset. A line graph for this comparison has been depicted to better visualization as shown in Fig. 11. The running times of these two algorithms have been plotted under various iterations. From Table 12, it can be observed that when the number of iterations is 30 the running time of DMD_CRO (without repair) is less than that of ACRI. On the other hand, when the number of iterations is 45 DMD_CRO (with repair) takes less time than ACRI.

Table 12 Running time comparison for the 18 sequences of the CRP binding sites for E. coli
Fig. 11
figure 11

Running time comparison for the 18 sequences of the CRP binding sites for E. coli

Similarly, Table 13 and Fig. 12 give the results and graphs of the running time comparison between ACRI and DMD_CRO using RAP1 of Saccharomyces cerevisiae. From Table 13, it can be noticed that when the number of iterations is 15 both DMD_CRO (without repair) and DMD_CRO (with repair) take less time than ACRI.

Table 13 Running time comparison for the RAP1 of Saccharomyces cerevisiae
Fig. 12
figure 12

Running time comparison for the RAP1 of Saccharomyces cerevisiae

From Figs. 11 and  12, it can be observed that when the number of iterations increases, then the running time of ACRI also increases rapidly but in the case of DMD_CRO the running time increases very slowly. It proves that DMD_CRO takes less running time than the ACRI when the number of iterations increases.

6 Conclusions

This paper is concerned with a renowned NP-hard combinatorial problem called motif discovery from biological sequences. Nowadays, as the demand for analyzing important biological sequences is rapidly growing with the time, so researchers have focused on solving this problem. It is very useful and has great applications in the field of bioinformatics. Several algorithms were proposed with good results but there still need more precise identification of motif in a shorter period of time. Here a population-based metaheuristic algorithm Chemical Reaction Optimization (CRO) is selected to solve the motif discovery problem. Four basic operators of CRO have been redesigned to find the solutions. Besides, one additional repair operator has been designed to find better potential motif and another one is used to search for better binding sites. We compared the results of the proposed DMD_CRO algorithm with Ant Colony Optimization (ACO) based algorithm ACRI, Gibbs sampler, MEME, which are the state-of-the-arts. From the results, it can be concluded that in the case of five transcriptional factors of Homo sapiens dataset the found sequence logos are identical to the sequence logos by DNA footprinting method. In the case of the 18 sequences of the CRP binding sites for Escherichia coli dataset, DMD_CRO with repair operator gets better results than the other algorithms. The repair operators help our proposed DMD_CRO algorithms to get better results efficiently and effectively. Besides, the statistical tests demonstrate the superiority of DMD_CRO algorithm over other algorithms, which are state-of-the-arts.

To define the right values for the CRO parameters is a very difficult task. More statistical tests for proper parameters setting can be done to improve the results. The four operators of CRO can be modified to best suit for this problem. Better population initialization also can be beneficial.