1 Introduction

RNA has a fundamental role in protein synthesis and it is also important in genetic and evolution process. RNA is a macromolecule made from the nucleotide sequence. Four types of nucleotides are possible in RNA: adenine (A), uracil (U), cytosine (C), and guanine (G). Usually, RNA is single-stranded but stable double helix structures of correspondent strands can be formed. RNA bends and folds back to form hydrogen bonds between correspondent nucleotides and builds base pairs. The canonical base pairs in RNA are the stable Watson-Crick base pairs A-U and C-G, and the less stable ‘wobble’ pair G-U [1]. In protein synthesis, three categories of RNA are used: messenger RNA (mRNA), ribosomal RNA (rRNA), and transfer RNA (tRNA). The mRNA contains the information needed for protein synthesis, while the rRNA is a ribosome component and tRNA transfers amino acids to the ribosome as basic materials for protein synthesis [2].

RNA has three types of structure; the nucleotide sequence is the primary structure. The secondary structure is the bonded base pairs in a two-dimensional way. RNA molecules in 3D space are called tertiary structure. RNA structure prediction is a process to calculate possible legal stems and select some of them to obtain an optimal result by constructing the secondary structure of RNA. At present, determining the structure of RNA has become the target of many researchers as it is one of the main issues in inventing new drugs and finding out the genetic diseases. Determining secondary structure is the first action in predicting the 3-D structure of RNA and interpretation of the biological function of the RNA molecules. The secondary structure provides numerous information about molecule structure. X-ray crystallography and Nuclear Magnetic Resonance (NMR) spectroscopy are also used to obtain the secondary structure. These processes are difficult, slow and costly. On that account, it is needed to implement mathematical and computational methods to solve the RSP problem.

Dynamic Programming (DP) was the first approach to solving RSP problem based on the minimization of the free energy and successfully predicts the secondary structure of the small pseudoknot-free RNA molecule [3]. DP based mfold was developed by Zuker [4] for inventing the pseudoknot-free RNA secondary structure. Thermodynamic models were used to calculate the free energy of an RNA secondary structure [5].

The use of Simulated Annealing (SA) for predicting RNA secondary structures was first described by Schmitz and Steger [6]. In this method, iterative formation and separation of the single base pair through SA is used. Tsang and Grypma presented a permutation-based algorithm SARNA-Predict, for RSP problem which is based on simulated annealing [2]. They studied the synchronism behavior of the algorithm. The method requires a process of removing conflicted helix to correct the structure. This process is quite tedious especially in a sequence that has a large number of possible helices. N. McMillan investigated a solution for RSP problem using Ant Colony Optimization (ACO) [3]. Initially, all feasible stems are recognized using a brute-force algorithm for a particular RNA sequence. Then, using ACO, new stems are summed up probabilistically by an ant to build a possible secondary structure. The procedure is repeated for a number of ants and also for a particular iteration, the pheromone trails for all the structures are improved depending on the ideal ant, in the matter of minimum free energy.

Particle Swarm Optimization (PSO) is known as useful in solving many different types of optimization problems and known for being able to find out the global optimal outcomes in the solution space [7]. Several methods were introduced based on the PSO for the RSP problem. HelixPSO was introduced for finding RNA secondary structures with minimum free energy by Geis and Middendorf [8]. Another approach is set-based PSO to optimize the RNA molecule structure using an advanced thermodynamic model which was proposed by Neethling and Engelbrecht [9]. An improved PSO (IPSO) model was presented by Liu [10]. The authors designed an objective function according to the number of selected stems, the average length of selected stems and the minimum free energy. Another author, Xing, introduced PSOfold based on improved PSO. An adaptive parameter controller of PSO is applied to promote the balance between exploration and exploitation [11, 12]. In PSOfold, some crucial stems were ignored while predicting secondary structure. Genetic Algorithm (GA) based method named RNAPredict was proposed by Wiese using thermodynamic models [7]. In this method, a set of feasible helices is produced from a given RNA sequence with helix generation algorithm [1]. A permutation of the helix numbers represents each chromosome. If two or more helices in any chromosome share some common base pairs, then the conflict helices are not added to the RNA fold. After that crossover and mutation operators are used to improve the chromosome to find a minimum free energy finally [13]. For the sequences with a large number of helices, this method is time-consuming. For global optimization, another metaheuristic algorithm is known as the Bat algorithm. Changing Range Bat Algorithm (CRBA) was introduced for finding RNA Secondary Structure Prediction by Zhihua, Li, Cao and Zhu [32]. This paper represents an updated version of Bat algorithm. Bat algorithm describes microbats echolocation behavior. To predict RNA secondary structure CRBA showed the prediction for ten shorter sequences. And the results were compared with mfold.

In this paper, we have proposed an algorithm based on the Chemical Reaction Optimization (CRO) to solve the RNA structure prediction problem. Our main target is to find out the most stable secondary structure of an RNA sequence. The chemical reaction is a process involving the rearrangement of the molecular and ionic structure of substances. There is a common nature of this universe that every molecule or ion that is not in the stable state wants to be stable by chemical reaction. The CRO algorithm follows this exact behavior. The important feature of CRO is its searching ability. It has both local and global search properties. Another feature of CRO is the high flexibility of designing reaction operators and population generation. These two features help CRO to fit for any optimization problem and finding out the global best results of the optimization problems. In recent years CRO has successfully solved many optimization problems and showed better results than other meta-heuristics approaches. To solve 0-1 knapsack problem, the CRO with Greedy strategy showed a better result than ant colony optimization, genetic algorithm and quantum-inspired evolutionary algorithm [14]. CRO was also applied to the quadratic assignment problem [15], cognitive radio spectrum allocation problem [16], and network coding optimization [17]. To solve multiple choice 0-1 knapsack problem, the artificial CRO outperforms the genetic algorithm [18].

Contribution and novelty of this work are summarized below.

  1. 1.

    We have redesigned four reaction operators: On-wall ineffective collision, Decomposition, Inter-molecular ineffective collision and Synthesis to find the global optimal point. These operators make the whole process able to search best structures of RNA sequences. On the other hand, the proposed method gives the most stable structures for both shorter and longer sequences because of these operators.

  2. 2.

    A novel efficient approach has been proposed here. Chemical Reaction Optimization (CRO) algorithm has been successfully applied to solve different NP-hard problems, however it was not used or proposed for solving RNA structure prediction problem.

  3. 3.

    A new solution generation process has been introduced for CRO Algorithm. The process is efficient in generating valid solution for RNA structure prediction problem.

  4. 4.

    Repair function is one of our other novel tasks. With the help of repair function we verify and remove the duplicate stem number(s). While a solution is generated by any of the four basic operators of CRO, if there exist duplicates of stem numbers during the construction process of secondary structure then duplicate stem numbers are taken into consideration repeatedly, consequently the process takes lot a time. So the repair function makes the CRO algorithm robust and time efficient.

  5. 5.

    Our proposed work follows the hydrogen bond model (INN-HB) which is a group of thermodynamics model to estimate minimum free energy (MFE). This model is subtle and easy to implement and takes less time to obtain minimum free energy than all other models. These properties of the proposed work make it possible to give better results than other methods. The outcomes of the proposed work are compared with the previous related methods such as RNAPredict (GA) [7], the SARNA-Predict (SA) [2], COIN [13], TL-PSOfold [19] and CRBA [32] to show the performance of the proposed method.

2 Problem statement

The RSP is a problem to anticipate RNA secondary structure. Here, an RNA sequence is given to compute the correct secondary structure. The secondary structure of RNA is described by a list of base pairs formed from the primary sequence.

Let = S = s1,s2,...,sn be an RNA sequence. Here S is a string of alphabet \( \sum = \{a, u, g, c\}\). A pair (x, y) is called a base pair (complimentary) if {x, y} = {a, u} or {x, y} = {g, c}. Pairs like {a, g}, {c, u}, {a, c} are not treated as a base pair [20]. The most stable and common of these base pairs are {g, c}, {a, u}, and {g, u}, and their opposite, {c, g}, {u, a}, and {u, g}. When all the pairs are built, the RNA strand folds back to produce the secondary structure. Our main objective is to maximize the number of the stem to create RNA secondary structure from a given sequence and select the most stable secondary structure. The stability of a structure depends on the Gibbs free energy(ΔG). The structure with minimum energy is accepted.(ΔG) is used to calculate the total energy of different structures (RNA) of the same sequence. We use the individual nearest-neighbor hydrogen bond model (INN-HB) [21] for calculating the free energy of a helix in the RNA secondary structure. Now we define an objective function for the RNA structure prediction.

$$ \begin{array}{llll} R=min\{{\Delta} G_{i}\}; & where 1\leq i \leq n; n \\ &= \mathit{number} \mathit{of} \mathit{secondary} \mathit{structure} \\ & \mathit{for} \mathit{one} \mathit{sequence} \end{array} $$
(1)
$$ \begin{array}{lllll} {\Delta} G_{37}^{\circ} ={\Delta} G_{37init}^{\circ} + \sum &\left[{\Delta} G_{37NN}^{\circ}\right] \\&+ {\Delta} G_{\mathit{37AU/GUend}}^{\circ} (\mathit{per AU / GU end}) \\&+ {\Delta} G_{37sym}^{\circ} \end{array} $$
(2)

This Eq. (2) is widely used to calculate the fitness of the RNA secondary structure. The values of these parameters are taken based on [22]. In Table 1, the meaning of every symbol is given.

Table 1 Symbol table for (2)

In RNA structure pseudoknots are formed by the cross of base pairs, the cross of stems can form pseudoknotted structure. It is difficult for a prediction algorithm to compute large RNA molecules with pseudoknots. Jens and Robert proposed an algebraic dynamic programming algorithm for finding RNA pseudoknotted structure that takes a lot of time and space because of the pseudoknotted structure of RNA [20]. Another approach for RNA pseudoknotted structure of predigesting model based on minimum free energy presented by Rivas and Eddy which takes more time and space [23]. The problem for predicting RNA pseudoknotted secondary structure is an NP-complete and maximizing the number of stacking pairs makes it NP-hard. In mimic RNA structure, there exist pseudoknots. Several publications show that extending the RNA structures including arbitrary pseudoknots indicates the problem of finding the optimum structure is NP-hard [24, 25].

3 Related works

In the recent years, different approaches were proposed to predict RNA secondary structure. Some of them are briefly described here.

3.1 Dynamic programming

DP (Dynamic Programming) is a method of solving any complex problem by splitting up the problem into small problems. Then each small problem is solved and the results are stored, and later, they are calculated by the recursive algorithm. DP based algorithm, mfold was developed by Zuker [4] for inventing the pseudoknot-free RNA secondary structure with minimized free energy. For the prediction of secondary structure of RNA, “mfold web server” represents a number of closely nearly connected software applications accessible on World Wide Web (WWW). The web server takes the primary RNA sequence as input and computes pseudoknot-free secondary structure with minimized free energy (MFE) and predicts some suboptimal structures. A possible secondary structure is anticipated by computing the sum of free energies from each optimal substructure, for all probable combinations of substructures and at last, the combination with minimum free energy is selected. In the mfold (multiple folds) web server, users get some options to choose a window for optimal substructure and push some selected base pairs to take up some reserve information into account. The main algorithm brings the energy dot plot matrices for the base pairs involved in the folding. Although dynamic programming algorithm produces optimal and suboptimal structures with minimum free energy, it is incapable of considering the kinetic effects concerned to easily approachable states in RNA folding.

3.2 Simulated annealing

Simulated annealing (SA) is a searching approach for finding the comparative global optimum of a given operation. The method is a metaheuristic to estimate global optimization within vast search space. The use of Simulated Annealing for predicting RNA secondary structures, using free energy minimization approach, was first proposed by Schmitz and Seger [6]. The secondary structure is predicted by iterative formation and disruption of single base pairs. Another prediction based on SA was presented by Tsang and Grypma, a permutation-based algorithm [2]. The structure of SARNA-Predict is developed by four elements. These are i) state representation, ii) perturbation/mutation function, iii) evaluation function and iv) decision mechanism. They compared the performance of the inversion mutation operator to the percentage swap mutation operator used in SARNA-Predict. Then, they evaluated the use of different annealing schedules used in tandem with the two mutation operators. For testing data, 13 different RNA sequences were used and they were taken from the comparative RNA web site [26]. The results of the experiments of the process include sensitivity, specificity, and F-measure. It also includes Known bps, Predicted bps, true positive base pairs (TP), false positive base pairs (FP), and false negative base pairs (FN) [2]. It showed good results in term of sensitivity, selectivity, and F-measure. The results suggest that the inversion mutation operator could be used in place of the swap mutation operator when using geometric scheduled simulated annealing. It is useful only for lower energy sequences. In adaptive scheduling, at a given temperature the iterations number in the search space is variable which takes a long time in searching.

3.3 Genetic algorithm

A genetic algorithm is usually used to produce ideal solutions for optimization and search problems using mutation, crossover, and selection operators. Wiese proposed a method named RNAPredict based on thermodynamic models [7]. In that paper, at first, they evaluated the performance of the stacking energy based thermodynamic models, Individual Nearest Neighbor Hydrogen Bond (INN-HB) model precisely by assessing the statistical correlation between the free energy in a structure and the number of true positive base pairs. Then they measured the accuracy of 19 structures from four RNA classes predicted by RNAPredict by comparing them to known structures and compare the prediction accuracy of RNAPredict and the mfold DPA. They demonstrated that evolutionary computation is a feasible technique for RNA secondary structure prediction. They also demonstrated that RNAPredict gives more minimum Δ G structures than mfold [7]. TP, FP, and FN have been calculated and showed the sensitivity, specificity, and F-measure to further quantify the performance of RNAPredict and mfold. Another approach was proposed by Tong and Cheung [27]. They predicted RNA secondary structures with pseudoknots using GA. They named it GAknot and used two valid datasets for testing. TheGAknot was shown to be the best prediction method in terms of accuracy and speed compared to several competitive prediction methods [27]. One dataset is constructed from a subset of sequences used in HotKnots that contains 41 sequences and another dataset contains RNA secondary structures with pseudoknots. They calculated sensitivity and positive prediction value (PPV) to evaluate the attainment of the method.GA gives a population of solutions as suboptimal structures and makes it probable to search not only the MFE structures but also other structures that are closer to the natural fold. GA is also suitable for estimating certain energy parameters but it takes a long time for predicting RNA structure of sequences with a large number of helices.

3.4 Two-level particle swarm optimization algorithm

A set-based algorithm named two-level particle swarm optimization algorithm (TL-PSOfold) was applied to solve RSP problem [19]. The first objective of the method is to maximize the number of the stems for a given RNA sequence. After that, they have analyzed the minimum free energy. The procedure consists of an initialization stage, where the sets of levels one and two for each swarm are initialized, and then the procedure goes through two levels. In each level, the velocity and position of the level are updated, and the fitness value is evaluated. After that, the best and global best values are updated for the level. If the procedure meets the stopping criteria, it exports global best secondary structure. The outcomes of TL-PSOfold is compared with HelixPSO v1, HelixPSO v2, PSOfold, SetPSO, IPSO, FPSO, RNAfold, mfold and RNA-Predict, SARNA-Predict in terms of sensitivity, specificity, and F-measure. They have also shown the hypothesis test using the Kruskal-Wallis test followed by post-hoc analysis. All the results and analyses prove the higher prediction accuracy of TL-PSO than the other algorithms.

3.5 Coincidence algorithm

An evolutionary algorithm named coincidence algorithm was introduced by Srikamdee [13] to solve RSP problem. The main task of the algorithm was to obtain the feasible combination of helices from a sequence that has the minimum free energy among all combinations. At first, all the helices from a sequence is found by a helix generation algorithm. For this, the canonical base pairs are obtained at the beginning of the algorithm. After getting the population, they evaluated the population by calculating free energy by the INN-HB model. In the next step, the good solutions are selected as candidate solutions. At last the generator matrix is updated, which contains the probability of any helix that can appear together. They have taken 10 sequences for testing and the test results are compared with the two algorithms SARNA-Predict and RNA-Predict.

3.6 Changing range bat algorithm

Bat algorithm is another metaheuristic algorithm for global optimization. Zhihua, Li, Cao and Zhu introduced a new approach of Bat algorithm known as Changing Range Bat Algorithm (CRBA) [32]. This paper represents an updated version of Bat algorithm. Bat algorithm describes microbats echolocation behavior. The original bat algorithm has different variables for virtual bats: frequency, velocity, location etc. The bat algorithm adjusts its frequency first and updating its velocity and position next. By walking randomly, a new solution is picked from the current solution for the best solution. This paper improves the exploitation capability by generating each bat location using the random walk. Average loudness and rate of the pulse are used in the new algorithm, which indicates loudness decreases while the rate of pulse emission increases. The new algorithm improves on the previous the bats algorithm. Here the first bat has the best performance and the performance goes down with the list. They have taken 10 different sequences to predict secondary structure. The results of the experiments of the process include sensitivity, specificity, and F-measure. And to measure these three, this method calculated TP, FP, and FN. CRBA only took shorter sequences to compare with mfold. For some of the sequences, the results of mfold were better than CRBA.

3.7 Advances and assessment of 3D structure prediction

Miao and Eric discussed advances and assessment of 3D structure prediction [33]. This paper first described sequence (1D) to secondary (2D) structure, and then non-Watson-Crick (WC) base pair tertiary structure (3D). Then, it described previous module detection programs, namely RMDetect, JAR3D which can correctly identify standard RNA modules to a degree. RNA Puzzles contains 17 puzzles, which must be tackled by each program. It also indicated score which programmer can look at and improve upon it. They showed some results in term of INF for some sequences.

4 Chemical reaction optimization

Chemical reaction optimization (CRO) is a relatively new invented metaheuristic for optimization that is influenced by the nature of chemical reactions. In an atomic glimpse, a chemical reaction begins with some inconsistent molecules with enormous energy. The molecules collaborate with each other through some fundamental reactions and ultimately, they are altered to those with minimum energy to confirm their presence. Such a characteristic is inflicted in CRO to solve optimization problems. When a molecule occupies excessive energy, which means it is unstable, it manipulates itself to reduce excessive energy to stabilize, and this process is called chemical reaction [28]. Chemical reaction ends with a stable product with minimum energy which is alike to a stepwise process of finding the optimal point. The energy of molecules is of two types: kinetic energy (KE) and potential energy (PE). KE is a positive number and it evaluates the tolerance of the system accepting a worse solution than the current one. PE is represented as the objective function value of the corresponding solution. The change of the molecule from its unstable to a stable state is generated by a collision and two types of collisions exist such as unimolecular and inter-molecular collisions. The first type defines the status when the molecule hits on some external elements like a wall of a container. The second type defines the collision between a molecule and other molecules.

There are four elementary reactions of CRO: on-wall ineffective collision, inter-molecular ineffective collision, decomposition, and synthesis. Whereas an on-wall or inter-molecular ineffective collision occurs, the number of molecules earlier and then remains the same. Besides, when decomposition and synthesis occurs, the number becomes more or less [28]. For neighborhood search, the on-wall ineffective reaction is applied where the minimal change in molecular structure gives almost identical solutions [29]. Decomposition refers to the situation when a molecule hits a wall and then breaks into several molecules. Drastic change occurs in the molecular structure of new molecules in this reaction. The solution bounces to other portions of the solution space and in this way a global search is attained. Inter-molecular collision contains inter-molecular ineffective collision and synthesis. An inter-molecular ineffective collision occurs when multiple molecules clash with each other and after that bounce away. The molecularity remains unchanged before and after the process. This reaction is applied to the local search where the reactant molecules have minimal change in their molecular structure. Synthesis acts opposite of decomposition. A synthesis reaction occurs when multiple molecules hit against each other and integrate together. The elementary reaction can only occur when it assures the energy conservation condition as given below:

$$ \begin{array}{llll} {\sum}_{i = 1}^{k}(PE_{\omega i} + KE_{\omega i})\geq {\sum}_{i = 0}^{l}PE_{\omega^{\prime}i} \end{array} $$
(3)

Here, ω and ω are the structures of input and output molecules, respectively used in a reaction. The number of molecules that takes part in the reaction is denoted by k.

Literally, the steps of the chemical reaction optimization are achieved by generating a population of initial molecules(solutions) randomly first, and then calculating each solution’s objective function value as its PE and initialize each molecule’s other attributes. Then by mean of the parameter Molecoll (1 or 0) we choose one or two molecules from the population randomly by selecting one of the four elementary reactions and generate the new molecule(s) according to the corresponding reaction layout. If there is acceptable energy for the new molecule(s) to be generated, change the original molecule(s) with the new one(s) and then renew the KE. When the stopping criterion is met, the global minimum solution is obtained. This solution is the resultant solution.

4.1 Algorithm design

In the proposed methodology, we have applied CRO to solve RSP problem and CRO is a population-based metaheuristic, it contains three stages: initialization, iterations and the final stage. In initialization, the parameters of the algorithm are assigned, the algorithm analyzes the solution space in iterations, and it terminates with the best output so far in the final stage.

4.1.1 Initialization and population generation

At first, the values are assigned to the parameters such as PopSize, KELossRate, MoleColl, buffer, InitialKE, two thresholds (α and β) and the initial population of molecules is constructed in the initialization phase. To construct the population of molecules possible candidate stems are found. The construction of candidate stems is easy: for a given RNA sequence with length n, we take a matrix. In the matrix, a distinct row and a column denote the corresponding nucleotide in the input RNA sequence.

When, an element in the matrix can build a canonical Watson-Crick base pair (AU, GC) or Wobble base pair (GU) then the element is marked as 1 otherwise, it will be 0. When the matrix is filled, we can find a list of maximal stems. Maximal stems are found by checking the 1s from bottom left to top right of the diagonals (shown in ellipses). A feasible RNA structure has a minimum stem length of three. That means, if a stem has three base pairs or more, it is taken for secondary structure. If the number of 1s in a diagonal is three or more than three, we consider that stem for the count otherwise, the stem is neglected. When all the maximal stems are found, we build a list of stems. In the list, each element stores the stem number, the start, the end, and length of the stem. After that, by permuting the stem numbers of the stem list, we select a sequence which is said to be the randomly selected molecule (Fig. 1).

Fig. 1
figure 1

(a) a sequence matrix in which the individual row and column represent the primary sequence (b) reported stem list with info of start, end, and length (c) Some sequences after permutation (d) a randomly selected sequence of stem numbers

Algorithm 1 represents a generation of the population. After population generation, it is needed for initialization of the parameters of each molecule. To do this, “Molecule-CRO” class is built with some attributes and methods to initialize the parameters of the molecules. The attributes are ω, PE, KE, NumHit, MinStruct, MinPE, MinHit. There are five methods in the class, including the class constructor and the four elementary reactions. The constructor identifies the details of a molecule when it is created according to the class. The initial set of solutions randomly generates in the solution space, a random solution to ω in the constructor is assigned. Algorithm 2 represents the Molecule-CRO class. PopSize, a number of molecules are created from “Molecule-CRO” to produce the initial population of molecules. PE is measured by the (2).

figure c
figure d

4.1.2 Iteration and operator design

After the initialization, an operator is chosen to trigger. The decision depends on certain criteria. We have reconstructed the four reaction operators of CRO for the RSP problem. The four reaction operators are described below.

On-wall ineffective collision

For neighborhood search, the on-wall ineffective reaction is used. Here, the new solution is near to the initial solution. In Fig. 2 m_new is the newly generated molecule from molecule m. Algorithm 3 represents the on-wall ineffective collision.

figure e
Fig. 2
figure 2

On-wall Ineffective Collision

A position i in the molecule m is randomly chosen and a random integer r chosen within the length of the molecule m. Next, we verify the following condition to put a value to the ith position of m_new:

  1. i)

    If the value m[i] + r is less than or equal to the length of m, m[i] + r is assigned to m_new[i].

  2. ii)

    If m[i] is greater than r, the value m[i] - r is assigned to m_new[i].

  3. iii)

    Otherwise the value, r - m[i] is assigned to m_new[i].

The other values are copied from the respective positions (indices) of m to m_new. Suppose that i = 3 then m[i] = 2 as shown Fig. 2. We select r = 4 and get m[i] + r = 6, which is equal to the length of m. So, the value 6 is put to m_new [i]. After getting the new molecule m_new a Repair function (operator) is called to correct it.

Repair function

Repair function is applied to the resultant (output) solution generated by any of the four basic operators. When a new solution is produced by any of the four operators, it may have duplicate repeated stem number(s). As a result, during the constructing process of the secondary structure, the duplicate stem numbers are taken into account repeatedly. This repeated consideration of duplicate stem number(s) takes a lot of execution time. In Fig. 3, an example of constructing process of a secondary structure is given. A solution with a duplicate stem is taken in this example. From the figure, we can see that, the same phenomenon is happening twice for the repeated stems. If we do not remove the duplicate stem then the same checking to be done twice. This repetition of the same task makes the procedure time-consuming. For this, it is required to remove the duplicated stem(s) to avoid excessive run time.

Fig. 3
figure 3

Constructing a secondary structure of a solution with repeated stems

The main task of the repair function is to remove the duplicate stem number from the resultant solution and adjust the values in the solution by shifting the values after the duplicate stem number. So, there will be no repetition in the solution and no extra consideration is needed during the process of secondary structure. In Fig. 4, m is a solution built by any operator and m_new is the solution after applying repair function. In the solution, stem number 5 has a duplicate number. After eliminating the duplicate stem number(s) by the repair function, m_new is the updated solution and it makes the whole process time efficient. Algorithm 4 is for the repair function. Here, 0 is assigned to every index of the flag[i] at the beginning. Then within a loop, the values of flag array are checked and have been changed by putting 1s. If there is any repetition, it would be detected because it was already changed (there is 1 in that index of the flag array). So, the respective value is removed and the length is decreased.

figure f
Fig. 4
figure 4

Repair function

Decomposition

The decomposition is used to analyze another area of the search space. There are enormous changes in the molecular structure of the newly generated product by decomposition. At first in this reaction, the input molecule is divided into two equal parts. For the output of this operator, we get two molecules m1 and m2. The values from the first half of m are copied and pasted to the respective half of m1. Next, the values of the second half of m are copied and pasted them to the respective half of m2. The other halves of m1 and m2 are created by randomly generated values from [0…n], where n is the length of the molecule. The scenario is depicted in Fig. 5.

Fig. 5
figure 5

Decomposition

Hence, the reaction looks for another region of solution space to find out the global minimum solution. The algorithm for decomposition is given in Algorithm 5.

figure g

Inter-molecular Ineffective Collision

This reaction operator uses two molecules m1 and m2 randomly from the population and produces two new molecules m1-new and m2-new as shown in Fig. 6.

Fig. 6
figure 6

Inter-molecular Ineffective Collision

Here two points x1 and x2 have been chosen randomly such that the points divide both input solutions into three parts. Then the odd parts of the m1 and the even part of the m2 join up to form m1-new and the even part of the m1 and the odd parts of the m2 join up to form m2-new. The two-third portions of the molecule remain unchanged and one-third is changed. The points x1 and x2 are picked randomly and with the help of these two values, the new molecule would be created. If the index i is between x1 and x2, m1-new takes values from m1 and m2-new takes values from m2, otherwise m1-new gets values from m2 and m2-new gets values from m1. Algorithm 6 represents this reaction.

figure h

Synthesis

Two molecules m1 and m2 from the population are combined to form a new molecule m_new in the synthesis reaction. This reaction is the opposite of decomposition reaction. Two molecules m1 and m2 are chosen randomly from the population as shown in Fig. 7. The second step is to take a random value between 0 and 1. This random value helps to determine the molecule to choose. If the random value is less than 0.5, m1 is chosen to select value(s) otherwise, m2 is chosen to select the value(s) for m_new.

figure i
Fig. 7
figure 7

Synthesis

4.2 Parameter settings

The efficiency of the CRO algorithm depends on values of the parameters The results of RNA Structure Prediction (RSP) using CRO algorithm greatly depend on the values of these parameters. The parameters with their respective values in CRO have been given in Table 2.

Table 2 Parameters of CRO for RSP problems with their initial values

4.3 Operator selection

In this section, we describe the operator selection process in the CRO algorithm. A flowchart (Fig. 8)is given that depicts the stages step by step and the selection of the operator. The START and END express the initiation and the completion of CRO algorithm.

Fig. 8
figure 8

Flowchart of operator selection

The algorithm starts with the stage initialization then a number of iterations are executed and at the final stage, it terminates. In the initialization, values are assigned to some variables and control parameters. Based on the actions of decomposition and synthesis, the number of solutions spaces can be changed as the CRO is a population-based metaheuristic algorithm.

At first, the population is created by producing the number of solutions randomly which is called POPSize. From these solutions, the feasible solution is searched out. A number of iterations are executed at the iteration stage and a collision is chosen. A decision is made whether the collision is a unimolecular or an intermolecular collision by generating a random number t which lies between 0 and 1. Unimolecular collision occurs if t is greater than MoleColl else an inter-molecular collision takes place (if there is only one molecule in the population then the unimolecular collision takes place). The flowchart of operator selection is shown in Fig. 8. After that, an applicable number of molecules from the population is selected randomly, based on the decision of collision type. Literally, the molecules occupied in collision forms on their physical location in the container. Later a decision of collision type is made by testing the criteria of decomposition and synthesis (at the left of the flowchart: on-wall ineffective collision or decomposition, at the right of the flowchart: inter-molecular ineffective collision or synthesis). The iteration stage continues until it meets the termination criteria. The termination criteria may be defined by the maximum number of iterations performed. The output solution is determined in the final stage with the lowest value found in the objective function.

5 Experimental results

For the experiment, two datasets were taken from the RNA STRAND v2.0 where in the first dataset there are 20 RNA sequences and in the second dataset there are 10 RNA sequences [30]. These sequence sets or datasets had been taken from the website (http://www.rnasoft.ca/strand/). The characteristic of each sequence of data set 1 is given in Table 3 and the information about each sequence of data set 2 is given in Table 4. The first column is for the sequence number which is maintained properly in other tables. The second column is for sequence name with an accession number. The third column defines the RNA class. The fourth column is for the length and the last column shows the number of base pairs in the sequence. Parameter settings for our proposed method were given in Table 2. For the comparison of the experimental results, we use the word CRO for our proposed work and other five algorithms such as RNAPredict, SARNA-Predict, COIN, TL-PSOfold and Changing Range Bat Algorithm are represented by GA, SA, COIN, TL-PSO and CRBA respectively.

Table 3 Dataset 1: RNA sequence detail taken from the RNA STRAND V2.0
Table 4 Dataset 2: RNA sequence detail taken from the RNA STRAND V2.0

The prediction accuracy is calculated by comparing the predicted structure with the known structure. The performance measures for the predicted structure are defined as follows: a total number of true positive base pairs of the prediction is TP, the total number of false positive base pairs of the prediction is FP, and the total number of false negative base pairs of the prediction is FN. On the other hand, sensitivity is the proportion of correctly predicted base pairs of the total number of base pairs found in the known structure. It is computed using (4). The specification is the proportion of all predicted true positive base pairs to all predicted base pairs. It is computed by (5). F-measure and INF are computed using (6) and (7) respectively. We have taken the best results of ten iterations for every sequence of the proposed method. Our algorithm was implemented on Asus personal laptop. Model number: K46CA with Intel Corei3-3217u CPU @ 1.80GHz (4CPUs), 1.8GHz, 8192MB RAM and running on Windows 10 (64 bit). For the implementation, programming language C# 6.0 and Microsoft Visual Studio 2013 IDE had been used.

$$ \begin{array}{llll} Sensitivity = \frac{TP}{TP+FN} \end{array} $$
(4)
$$ \begin{array}{llll} Specificity = \frac{TP}{TP+FP} \end{array} $$
(5)
$$ \begin{array}{llll} F-measure = \frac{2*sensitivity*specificity}{sensitivity+specificity} \end{array} $$
(6)
$$ \begin{array}{llll} INF = \sqrt{\frac{TP}{TP+FP} * \frac{TP}{TP+FN}} \end{array} $$
(7)

The proposed CRO, GA and SA methods have been tested using 20 RNA sequences given in Table 3. The results of TP, FP, and FN found in the CRO, GA and SA are shown in Table 5. The proposed CRO gives the better TPs for all sequences than all other methods. The results of CRO for both with and without function are given in the table. The results of CRO with repair function are better than those of CRO without repair function. For the sequences of short length, the values of TP, FP, and FN are the same or close to the previous methods. The best outcomes are shown in bold text.

Table 5 Results of CRO, GA and SA in terms of TP, FP, and FN

In Tables 6 and 7, the results of Sensitivity, Specificity and F-Measure, INF of CRO are compared with the Sensitivity, Specificity, and F-Measure, INF of the GA and the SA respectively. The proposed method shows relatively better results than the other two methods. The best outcomes are shown in bold text. Results of CRO with repair function are better than those of CRO without repair function.

Table 6 Results of CRO, GA, and SA in terms of sensitivity and specificity
Table 7 Results of CRO, GA, and SA in terms of F-measure and INF

In the papers [13], the method COIN was tested for 10 sequences. The CRO has been tested for the same 10 sequences. Table 8 shows the results of CRO and COIN in terms of TP, FP and FN. We have given the results of CRO and COIN with respect to sensitivity, specificity in Table 9 and in Table 10 the results of CRO and COIN with respect to F-measure and INF have been given. From the results, it can be proved that CRO gives better results than those of COIN. The results of CRO with and without repair function are shown. The best results are given in the bold text.

Table 8 The results of CRO, COIN method in terms of TP, FP, and FN
Table 9 The results of CRO, COIN in terms of sensitivity and specificity
Table 10 The results of CRO, COIN in terms of F-measure and INF

In the papers [19], the method TL-PSO was tested for a specific 10 sequences. The CRO has been tested for the same 10 sequences. Table 11 depicts the results of CRO and COIN in terms of TP, FP and FN. We have shown the results of CRO and COIN with respect to sensitivity, specificity in Table 12. Table 13 shows the results in terms of F-measure and INF. From the results, it can be seen that CRO gives better results than then those of TL-PSO. The results of CRO with and without repair function are shown. The best results are given in the bold text.

Table 11 Results of CRO, TL-PSO method in terms of TP, FP, and FN
Table 12 Results of CRO, TL-PSO method in terms of sensitivity and specificity
Table 13 Results of CRO, TL-PSO method in terms of F-measure and INF

The method Changing Range Bat Algorithm (CRBA) was tested for 10 sequences in the paper [32]. This method took the dataset 2 for the experiment. Our method CRO has been tested for the same dataset. Table 14 depicts the results of CRBA and CRO in terms of TP, FP and FN. We have also shown the results of sensitivity, specificity, and F-measure in the Table 15. For all ten sequences, the outcomes of CRO are more appreciable than CRBA.

Table 14 Results of CRO, CRBA method in terms of TP, FP, and FN
Table 15 Results of CRO, CRBA method in terms of sensitivity, specificity, and F-measure

The simulation of the best, the worst, the average solutions, and the standard deviation of the sensitivity, specificity, and F-measure of each sequence by the proposed method are given. In Tables 16 and 17 it shows the same outcome in terms of F-measure and INF. Tables 16 and 17 show the results of CRO with repair function and Tables 18 and 19 show the results of CRO without repair function. From these tables, it can be seen that there are enough changes in the results while using the repair function. Repair function improves the results of the sensitivity, specificity, F-measure, and INF.

Table 16 Simulation results in terms of Sensitivity and Specificity of CRO with repair function
Table 17 Simulation results in terms of F-measure and INF of CRO with repair function
Table 18 Simulation results in terms of Sensitivity and Specificity of CRO without repair function
Table 19 Simulation results in terms of F-measure and INF of CRO without repair function

Figure 9 represents the graphs of the comparison of the proposed methods of GA and SA, COIN and TL-PSO in terms of sensitivity, specificity, and F-measure. From the graphs, it is proved that CRO outperforms the mentioned four methods. We have also implemented two algorithms of GA and SA to prove the time efficiency of our proposed work. After implementing two mentioned algorithms, it was concluded that the CRO takes less time in predicting RNA structure than GA and SA. The comparisons of execution time between the proposed method and the methods of GA as well as SA are demonstrated in Table 20. The execution time is shown in seconds in the table. The execution time of CRO with and without repair function CRO is given in Table 20 to show that the proposed method with repair function is faster than without repair function. Figure 10 represents the line graph which shows the comparison of the execution time of the proposed method (CRO) with and without the repair function.

Fig. 9
figure 9

The graph for the comparison of the CRO without repair function, CRO without repair function GA, SA, COIN and TL-PSO in terms of (a) sensitivity, (b) specificity and (c) F-measure

Table 20 Execution time of CRO, GA [7] and SA [2] (in seconds)
Fig. 10
figure 10

Line graph for the comparison of execution time of the CRO without repair function and with the repair function

In Table 21 the proposed method (CRO) is compared with the methods of SA and RNAfold [31] and TL-PSO method in terms of Minimum Free Energy (in KCAL/MOL). The values of minimum free energy in method RNAfold were given in the paper of TL-PSO [19], where RNAfold gives the best results than TL-PSO. So, we have compared the results of CRO and those of RNAfold in terms of minimum free energy to verify that our proposed method also outperforms the RNAfold in minimum free energy. As in the other two methods RNAPredict (GA) and COIN there were no minimum free energies of the tested sequences, so we could not provide their results. Any algorithm which has not tested for a particular sequence is denoted by − in the table. From the results, it is proved that CRO gives the lowest free energy as well as most stable structures for nineteen sequences than any other methods. For only one sequence the method of SA gives better result than CRO. However, the results of SA and CRO are very close.

Table 21 CRO is compared with the SA [2], TL-PSO [19], RNAfold [31] in terms of Minimum Free Energy (in KCAL/MOL)

Here, in Table 22 for the sequence, Saccharomyces cerevisiae, the known base pairs and the predicted base pairs of the proposed method is shown. The Saccharomyces cerevisiae has 38 base pairs. The primary sequence of Saccharomyces cerevisiae is given below. Both the known secondary structure of Saccharomyces cerevisiae and the predicted structure are given in dot parenthesis form in Table 17. In the table, i denotes the index of the opening parenthesis and j denotes the index of the closing parenthesis index. The pairs (6-112), (66-107) and (82-93) are not predicted by the proposed method. That is why the base pairs are called false negative (FN) base pairs. They are shown in bold text. Tables 232425, and 26 represent the results of significance testing using Kruskal-Wallis test followed by post-hoc analysis to confirm that, the proposed method significantly outperforms the algorithms of GA, SA, COIN and TL-PSO. The results of every algorithm for all sequences are not available. For example, T. aquaticus has been used in TL-PSO and CRO (the proposed method) but it was not used in SA.

Table 22 Secondary structure comparison of Saccharomyces cerevisiae
Table 23 Sensitivity comparison by Kruskal-Wallis test followed by post-hoc analysis
Table 24 Specificity comparison by Kruskal-Wallis test followed by post-hoc analysis
Table 25 F-Measure comparison by Kruskal-Wallis test followed by post-hoc analysis
Table 26 INF comparison by Kruskal-Wallis test followed by post-hoc analysis

So, when the significance was tested, the groups were formed by removing the unavailable sequences. The significance has been tested for the results of sensitivity, specificity, F-measure, INF. The results of sensitivity, specificity, F-measure and INF are compared in Tables 2326 where the lower portion has been removed to avoid repetition. From the comparison, it is verified that the proposed method outperforms all other algorithms.

Sequence: GGUUGCGGCCAUAUCUACCAGAAAGCACCGUUUCCCGUCCGAUCAACUGUGUUAAGCUGGUAGAGCCUGACCGAGUAGUGUAUGGGUGACCAUACGCGAAACUCAGGUGCU GCAAUCU

Secondary structure by CRO:

Dot-parenthesis form: [(((((.(((....((((((((....(((((((............))))..)))...))))))))..((((.......((((.(((....)))..))).....))))).))).))))).]

Benchmark secondary structure:

Dot-parenthesis form: [(((((((((....((((((((....(((((((............))))..)))...)))))))).(((((.......((((((((....)))))))).....))))).))))))))).]

6 Conclusion

Over the past decade, prediction of RNA structure achieved vital attention to researchers, as it has a crucial role in molecular biology. This work presents the popular optimization problem of RNA secondary structure prediction which is implemented using population-based metaheuristic algorithm called Chemical Reaction Optimization (CRO). The operators of CRO have been redesigned to solve the RSP problem using CRO algorithm. The hard tasks during the implementation of the problem were calculating the free energy of the structure and maintaining the accuracy of the predicted structure which depends on the value of TP (True base pairs), FP (False positive base pairs) and FN (False negative base pairs). The results of the RNA structure prediction problem using CRO algorithm are compared with the related metaheuristic algorithms such as RNAPredict, the SARNA-Predict, COIN, TL-PSOfold and CRBA and the results show that the proposed RNA structure prediction problem using CRO algorithm performs very well and gives more stable structures. The additional repair function helps to predict RNA structure in less time by eliminating the duplicate stem from the resultant solution.

We have not done any experiment with the pseudoknotted RNA sequences because of its complication in estimating the structures. In future work, the experiment of the RNA with pseudoknots could be considered.