Keywords

1 Introduction

DNA computing is a new computational paradigm, which has shown great potential to solve NP-complete problems, such as Hamiltonian path problem (HPP) [1], satisfaction problem (SAT) [2], traveling salesman problem (TSP) [3] and graph coloring problem (GCP) [4]. High-quality DNA sequences can improve the efficiency and reliability. Therefore, the design of DNA molecules should be carefully designed to prevent unwanted hybridization errors. The design of DNA molecules can be regarded as a multi-objective optimization [22,23,24,25] problem, which need satisfy a variety of conflicting DNA encoding constraints and objectives [5].

In the past few decades, a lot of efficient algorithms have been proposed to solve DNA sequences design problem. Frutos et al. [6] proposed the Template-Map method, but it is difficult to derive templates and mappings that satisfy the combined constraints when there are many constraints. Hartemink et al. [7] implemented an exhaustive search algorithm to design DNA sequences, which satisfy the constraints, but the algorithm has a high time complexity. Feldkamp [8] uses directed trees to design DNA encoding that the fixed length sub-sequences are only allowed to appear once, but the length of the sub-sequences requires a lot of testing to determine in the actual design. Recent years, evolutionary algorithms are widely adapted for DNA encoding design, such as genetic algorithm [9, 11, 14, 20], particle swarm optimization [11, 15, 17, 18], ant colony algorithm [12], simulated annealing [16], and multi-objective evolutionary algorithms [10, 13, 19]. However, existing algorithms often obtain the DNA sequences set with bias Similarity or H-measure values, which is easy to introduce errors during the DNA computing process.

In this paper, a reference point based multi-objective optimization evolutionary algorithm is proposed for designing DNA sequences. Firstly, the non-dominated sort algorithm is adapted to select convergent solutions rank by rank. Secondly, the crowding distance sort algorithm is adapted to choose the solutions that are closer to the reference point. The algorithm can provide a set of DNA sequences near idea point which are more reliable and efficient for DNA computing. Finally, to validate the proposed algorithm, we compare our algorithm with some state-of-art techniques. The experimental results confirm the performance of our algorithm in designing high-quality DNA sequences set efficiently.

In the following chapters, Sect. 2 introduces the relevant basis of DNA sequence design problem. The proposed reference-based evolution algorithm for designing DNA sequences is then detailed in Sect. 3. In Sect. 4, we present the experiment results of the proposed algorithm and compare them with the other literatures. Finally, conclusions are drawn in Sect. 5 along with pertinent observations identified.

2 Problem Formulation

During the process of DNA computing, single-strand DNA molecules dismissed randomly in the vitro, therefore four kinds of molecules exist simultaneity, including DNA molecular X, corresponding complementary sequence X C, reverse sequence X R and reverse complementary sequence X RC. Most existing DNA computing models are based on the specific hybridization between a given molecular X and it’s unique Watson-Crick complement X C. In fact, the non-specific hybridization often occurs because unwanted mismatches maybe take place between random molecules, as shown in Fig. 1.

Fig. 1.
figure 1

Specific and non-specific hybridizations.

However, the non-specific hybridization could introduce errors, such as false positives and negatives, and degrade efficiency. Obviously, it is important to design reliable DNA sequences for DNA computing, and the key is to avoid the non-specific hybridization. The design of reliable DNA sequences involves several conflicting design constraints which have to be considered simultaneously. In mathematical terms, DNA sequence design problem can be formulated as a multi-objective optimization problem as Eq. (1).

$$\begin{aligned} \min f(X)=\min \left[ f_{1}(X), f_{2}(X), \cdots , f_{M}(X)\right] ^{T}, f(X) \in R^{M} \end{aligned}$$
(1)

where DNA sequence \(X=\left[ x_{1}, x_{2}, \cdots , x_{N}\right] ^{T} \in \varOmega \) consists of N bases \(x_{i} \in \{\mathrm {A}, \mathrm {C}, \mathrm {G}, \mathrm {T}\}\), and the search space \(\varOmega \) is \(4^{N}\). f(X) consists of M objective functions \(f_{m}(x)\), \(m=1, \ldots , M\). \(R^{M}\) denotes the objective space. Several typical biochemical design criteria are chosen that other relevant authors use to evaluate and generate reliable DNA libraries. The formal definition for each design criteria is provided in the following subsections.

2.1 Similarity Criterion

Let \({X}_{i}\) and \({X}_{j}\) be two different DNA sequences, the similarity criterion refers to the degree of similarity in base composition between \({X}_{i}\) and \({X}_{j}\). By controlling the similarity, non-specific hybridization between \({X}_{i}\) and the complementary of \({X}_{j}\), (i.e. \({X}_{j}^{C}\)). The calculation of Similarity is shown in Eq. (2).

$$\begin{aligned} \begin{aligned} f_{\text{ Similariy } }(X)&=\sum _{i=1}^{n} \sum _{j=1}^{n} \text{ Similarity } \left( X_{i}, X_{j}\right) \\ {}&=\sum _{i=1}^{n} \sum _{j=1}^{n} {\text {Max}}_{g, i}\left( S i_{d i s}\left( X_{i}, X_{j}, s\right) +S i_{c o n}\left( X_{i}, X_{j}, s\right) \right) \end{aligned} \end{aligned}$$
(2)

where the function \(Max_{g, i}\) represents traversing all possible values of g and i and taking the maximum value as a result. The function \(S i_{d i s}\left( X_{i}, X_{j}, s\right) \) represents the number of identical bases in which the DNA sequence \({X}_{i}\) is shifted to the right by the s position compared with the sequence \({X}_{j}\). The function \(S i_{c o n}\left( X_{i}, X_{j}, s\right) \) represents that the DNA sequence \({X}_{i}\) shifts to the right by the s bit and the base compared with \({X}_{j}\) is continuously the same penalty value.

2.2 H-Measure Criterion

For DNA sequences X and Y, the H-measure constraint is to limit non-specific hybridization between X and reverse Y. The calculation of H-measure is as shown in Eq. (3).

$$\begin{aligned} \begin{aligned} f_{H\text {-measioe}}(X)&=\sum _{i=1}^{n} \sum _{j=1}^{n} H\text {-measure}\left( X_{i}, X_{j}\right) \\ {}&=\sum _{i=1}^{n} \sum _{j=1}^{n} {\text {Max}}_{g, i}\left( h_{d i s}\left( X_{i}, X_{j}^{\mathrm {R}}, s\right) +h_{c o n}\left( X_{i}, X_{j}^{\mathrm {R}}, s\right) \right) \end{aligned} \end{aligned}$$
(3)

where the function \(Max_{g, i}\) represents traversing all possible values of g and i and taking the maximum value as a result. The function \(h_{d s}\left( X_{i}, X_{j}^{\mathrm {R}}, s\right) \) represents the number of base complements in which the DNA sequence \({X}_{i}\) shifts to the right by the s-bit compared with the sequence \({X}_{j}\). The function \(h_{c o n}\left( X_{i}, X_{j}^{\mathrm {R}}, s\right) \) represents a base continuous pairing penalty value in which the DNA sequence \({X}_{i}\) is shifted to the right by the s-bit compared with \({X}_{j}\).

Similarity Criterion describes the degree of similarity between DNA sequences, and H-measure Criterion describes the degree of complementary hybridization between DNA sequences. Similarity Criterion and H-measure Criterion are two conflicting objectives, and they are difficult to optimize at the same time. Shin et al. [10] had proved that Similarity Criterion and H-measure Criterion are conflicting, and they are both discontinuous functions and have many locally optimal solutions.

2.3 Continuity Criterion

Continuity constraint means that in the single strand of DNA, the same base appears continuously, and an undesired secondary structure occurs under the hydrogen bonding force of the base molecule. The calculation of Continuity is as shown in Eq. (4).

$$\begin{aligned} \begin{aligned} f_{\text{ Continuity }}(X)&=\sum _{i=1}^{n} \text{ Continuity } \left( X_{i}\right) \\ {}&=\sum _{i=1}^{n} \sum _{i=1}^{l-t+1} T\left( c_{a}(x, i), t^{2}\right) \end{aligned} \end{aligned}$$
(4)

2.4 Hairpin Structure Criterion

The hairpin structure constraint refers to a single-stranded DNA molecule formed by reverse folding of itself, resulting in a secondary structure of a hairpin shape. The calculation of Hairpin is as shown in Eq. (5).

$$\begin{aligned} \begin{aligned} f_{\text{ Hairpin }}(X)&=\sum _{i=1}^{n} \text{ Hairpin } \left( X_{i}\right) \\ {}&=\sum _{i=1}^{n} \sum _{s=s_{m n}}^{\left( l / R_{m n}\right) / 2} \sum _{r=R_{m n}}^{l-2 s} \sum _{i=1}^{l-2 s-r} T\left( \sum _{j=1}^{s} b p\left( x_{s+i-j}, x_{s+i+r+j}\right) , \frac{s}{2}\right) \end{aligned} \end{aligned}$$
(5)

2.5 GC Content Criterion

DNA computing prefer the DNA molecules with uniform GC content. The GC content refers to the number or percentage of bases G and bases C in the DNA sequence. The calculation of GC% is as shown in Eq. (6).

$$\begin{aligned} \begin{array}{l}{f_{G C}(X)=\mathop {\max }\nolimits _{i}\left\{ \mathrm {GC}\left( X_{i}\right) \right\} -\mathop {\min }\nolimits _{j}\left\{ \mathrm {GC}\left( X_{j}\right) \right\} } \\ {\mathrm {GC}=\sum \nolimits _{i=1}^{n} \sum \nolimits _{i=1}^{l} g c\left( x_{i}\right) , \mathrm {gc}\left( x_{i}\right) =\left\{ \begin{array}{l}{1, x_{i}=G \text{ or } x_{i}=C} \\ {0, x_{i}=A \text{ or } x_{i}=T}\end{array}\right. }\end{array} \end{aligned}$$
(6)

2.6 Melting Temperature Criterion

The melting temperature is the temperature at which 50% of the DNA molecules open the double strand into a single strand during the warming denaturation of the double-stranded DNA molecule. The melting temperature is an important parameter for evaluating the thermodynamic stability of DNA molecules. The calculation of Tm is as shown in Eq. (7).

$$\begin{aligned} \begin{array}{l}{f_{T m}(X)=\mathop {\max }\nolimits _{i}\left\{ {\text {Tm}}\left( X_{i}\right) \right\} -\mathop {\min }\nolimits _{j}\left\{ {\text {Tm}}\left( X_{j}\right) \right\} } \\ {{\text {Tm}}\left( X_{i}\right) =\sum \nolimits _{i=1}^{n} \frac{\varDelta \mathrm {H}^{\circ }}{\varDelta S^{\circ }+R \ln \left( \left| C_{T}\right| / 4\right) }}\end{array} \end{aligned}$$
(7)

\(\varDelta \mathrm {H}^{\circ }\) is the total enthalpy of the adjacent base, \(\varDelta S^{\circ }\) is the total entropy of the adjacent base, R is the gas constant (1.987 cal/Kmol), and \(C_{T}\) is the DNA molecule concentration.

3 Problem Formulation

Because Similarity and H-measure are two conflict objectives, we would obtain a set of non-dominated solutions using MOEA. However, the DNA sequences which have high Similarity values will lead to non-specific hybridization between X and the complementary strand \({Y}^{C}\). Moreover, the DNA sequences which have high H-measure values will lead to non-specific hybridization between X and the reverse strand \({Y}^{R}\). Among the whole PF, only the nonbiased point is the idea solutions for DNA sequences design problem, as shown in Fig. 2.

In response to the above problems, we adopt R-NSGA-II [21] to search for DNA sequences, in which the crowding distance is replaced by the distance to the reference point. In our algorithm, the reference distance (RD) can be calculated as shown in Eq. (8).

$$\begin{aligned} R D=\sqrt{\sum _{i=1}^{m}\left( \frac{f_{i}(x)-R_{i}}{f_{i}^{m a x}-f_{i}^{m i n}}\right) ^{2}} \end{aligned}$$
(8)

where \(f_{i}^{\max }\) and \(f_{i}^{\min }\) are the global maximum and minimum function values of the i-th objective function, and \(R_{i}\) is the reference value of the i-th objective. In our algorithm, the reference point is set to \(R=\left( f_{1}, f_{2}, f_{3}, f_{4}, f_{5}, f_{6}\right) =\left( 0,0,0,0, \frac{N}{2}, 50\right) \).

Most of the algorithms calculate the objective functions on the entire population \(P_{t}\). However, two objective functions Similarity an H-measure are the full correlation with all the individuals. If k DNA sequences with best fitness values are selected for DNA computing, they may not remain optimal. In our algorithm, we re-evaluate the individuals with the population \(P_{t+1}\), and update the fitness values in \(P_{t}\) one by one. Three main procedures are iteratively run in our algorithm, specifically the non-dominated sort, the reference crowding distance sort, and full correlation fitness update. The algorithm procedure is also shown in Fig. 3.

Firstly, tournament selection is adapted on Pt, and the winner of two randomly selected individuals should be added into mating pool Qt. Then, crossover and mutation operators are adapted to generate new offspring, and replace the individuals in mating pool Qt. Secondly, the non-dominated sorting is applied to the union set PtQt, and the non-dominated fronts are copied to parent population rank by rank. Thirdly, the reference distance should be calculated for every individual, and the individual with minimum reference distance could be added into the new population Pt + 1 until the population size N. Moreover, in order to select the individuals with optimal full correlation objective values, we re-evaluate the population Pt when individual is selected and added into Pt + 1. The pseudocode is shown as Algorithm 1.

figure a
Fig. 2.
figure 2

Idea solution within the non-dominated solutions.

Fig. 3.
figure 3

The procedure of our algorithm.

Table 1. Comparison results of the obtained seven sequences with 20 bases.
Table 2. Comparison results of the obtained fourteen sequences with 20 bases.

4 Result and Discussion

In order to verify the effectiveness of the proposed algorithm, we compare the obtained results with various known algorithms. In our comparison, the population size is set to 200, the DNA length is set to 20, and the maximum number of iterations is set to 1000. The algorithm is implemented in Eclipse Java and tested on a PC (running environment intel® \(Core^{TM}\) i5-8400 CPU @ 2.802 GHz, 8G RAM, Windows 10).

Table 1 shows the obtained sequences generated by MGA [20], NACST/Seq [10], and our algorithm. As can be seen from Table 1, all the algorithms obtain same Hairpin and GC content values. The sequences of MGA and our algorithm have same Continuity values, which are better than the sequences of NACST/Seq. The MGA has most uniform melting temperature values fluctuated within one degree Celsius. The temperature fluctuation range of our sequences is \(\pm 1.2526\ ^\circ \)C, which is better than NACST/Seq.

The Similarity value of our algorithm is 290, which is much smaller than MGA(444) and NACST/Seq(374). In addition, the H-measure value of our algorithm is 284, which is also much smaller than MGA(438) and NACST/Seq(338). Moreover, the balanced Similarity and H-measure values imply that our sequences are more reliable and have a lower probability of unwanted non-hybridization.

Table 2 shows obtained larger group of DNA sequences by three compared algorithms. As can be seen from Table 2, all the algorithms obtain same Continuity and GC content values. Our algorithm obtains best Hairpin objective value, however, ten sequences in MGA set have poor Hairpin objective values. The sequences of our algorithm have most uniform melting temperature values fluctuated within \(\pm 0.9002\ ^\circ \)C. The melting temperature of MGA and NACST/Seq fluctuate in range \(\pm 1.7500\) and \(\pm 2.9574\) respectively. The H-measure and Similarity values of the sequences designed by our algorithm are balance and much lower than compared algorithms, which means the mismatch and non-hybridization between the coding sequences can be greatly reduced.

5 Conclusion

In this study, a multi-objective DNA sequence design algorithm had been successfully implemented for reliable DNA computation. The algorithm was based on ideal reference point, which could guide the population to search for balance Similarity and H-measure objective values efficiently. The algorithm was compared with some state-of-the-art approaches. The experimental results showed our algorithm can generate high quality DNA sequences set which satisfied various conflict DNA encoding criterions.