1 Introduction

The exponentially increasing demand in data storage forces to look into every possible option and DNA (DeoxyriboNucleic Acid) data storage has come out to be one of the most promising natural data storage for this purpose [11]. After the first striking implementation of large-scale archival DNA-based storage architecture by Church et al. [5] in 2012, followed by encoding scheme to DNA proposed by Goldman et al. [8] in 2013, researchers have taken great interests on the construction of DNA-based information storage systems [18, 20] because of its high storage density and longevity [5, 8, 36]. DNA consists of four types of bases or nucleotides (nt s) called adenine (A), cytosine (C), guanine (G) and thymine (T) where, the Watson-Cricks complementary bases for A and C are T and G respectively and vice versa. In order to store data into synthetic DNA, we need to encode data into strings on quaternary alphabet {A, C, G, T}. The set of encoded DNA strings (also called as DNA codewords) on the quaternary alphabet is called DNA code. For a DNA string, the complement is a DNA string obtained by replacing each nucleotide by its complement. Similarly, for a DNA string, the reverse DNA string is a DNA string in the reverse order, and the complement of the reverse DNA string is called reverse-complement DNA string. The encoded strings are synthesized using DNA synthesizer for the purpose of writing into DNA strings and the synthesized DNA strings are stored in the appropriate environment. To extract the source data, the stored DNA strings are read using DNA sequencing.

Errors occur, particularly, during synthesis and sequencing the DNA strings, which can be reduced by choosing good encoding scheme for the DNA strings. Therefore, it is important to study the source of errors. Generally, insertion or deletion errors occur frequently in a DNA string with consecutive repetitions of a specific nucleotide (e.g. ACGGGGAT) or of a block of nucleotides (e.g. AGATATATGC) up to certain length [12, 21, 26, 32]. In addition, these DNA strings get misaligned more frequently, during DNA sequencing [26]. So we prefer DNA codes that exclude those codewords which contain consecutive repetition(s) of a specific nucleotide or a block of nucleotides. In this article, such DNA strings are defined as conflict free DNA strings. In literature, DNA codes without Homopolymers (DNA string with consecutive repetition of a nucleotide) [1, 2, 6, 10, 31] and without consecutive repeats of blocks [12, 16] are studied. As an extension, in this work, the considered conflict free DNA strings are not only free from Homopolymers but are also free from consecutive repetition of blocks of nucleotides.

In a single stranded DNA, the existence of two sub-strings where, the reverse-complement of one sub-string is the other one, results in forming an antiparallel double stranded hairpin like structure (also called hairpin loop or Stem-loop) by folding back upon itself [4, 13, 24, 27, 38].

An example of such hairpin like structure is illustrated in Fig. 1. For DNA sequencing, it is preferred to avoid such secondary structures [13]. In this work, all the codewords of the constructed conflict free DNA codes are free from hairpin like structures with stem length more than 2.

Fig. 1
figure 1

An example of hairpin like secondary structures in a single stranded DNA

A DNA string can be read using specific hybridization between the DNA string and its complement DNA string [23]. If the DNA strings in a code are not different enough among themselves then nonspecific hybridization will occur and it will be a prominent cause of error. Therefore, a set of DNA codewords is preferred in which DNA strings are sufficiently different among themselves. Hamming distance between two strings of same length over the same alphabets is the number of positions in which the symbols in the strings are different. So, construction of DNA code with Hamming constraint (ensures the difference among DNA codewords), reverse constraint (ensures the difference between DNA codewords and their reverse DNA strings), and reverse-complement constraint (ensures the difference between DNA codewords and their reverse-complement DNA strings) is preferred. In literature, DNA codes with reverse and reverse-complement constraints are constructed from finite fields and finite rings in [9, 17, 29, 33].

The thermal stability of a DNA string depends on the GC-content (the total number of Gs and Cs) in the DNA string [35]. On the other hand, the high GC-content leads to the insertion and deletion error during polymerase chain reaction (PCR). Therefore, such DNA codes are preferred in which each DNA codeword has the same GC-content and equal to almost half of its length and the constraint for the DNA codes is called GC-content constraint. In [14, 15, 30, 33], DNA codes with balanced GC content are studied. DNA codes with reverse, reverse-complement and GC-content constraints are studied in [7, 30]. In [3], a revised lower bound on size of DNA codes with GC-content and reverse-complement constraints are obtained. In fact, DNA codes with balanced GC content and without Homopolymers are also studied in [6, 10, 31].

In this paper, we have studied the DNA codes with multiple properties such as each DNA string is free from consecutive reparation of DNA blocks up to a certain length. Any DNA codeword of the DNA code sufficiently differs from other DNA codewords, reverse of DNA codewords and reverse-complement DNA codewords. In addition, each codeword of the DNA code is also free from hairpin like secondary structures. These properties significantly help to reduce bit-flip, insertion, and deletion errors simultaneously during reading and writing DNA strings. The lower bounds on the maximum size of DNA code with all those properties are also obtained in this paper. To the best of author’s knowledge, in literature, an algebraic solution for DNA codes with all the constraints is not studied yet. In this work, an algebraic structure for the family of DNA codes is proposed where the constructed DNA codes meet all the constraints such as Hamming, reverse, reverse-complement, and GC-content constraints. Apart from that, all the DNA codewords do not have any consecutive identical sub-string(s) up to a certain length (generalization of non-homopolymer constraint [12, 21, 26, 32]. These are known as non-homopolymer constraint of order ). In addition, these codewords are free from hairpin like secondary structures. In this paper, an algorithm is given which calculates the DNA code with the property that each DNA codeword does not have any consecutive repeated sub-string of any length. In addition, DNA codes with Hamming constraint, reverse constraint, reverse-complement constraint, and GC-content constraint are obtained. For a DNA code with all the constraints, the obtained code size is improved for some specific parameters as given in [19, Table I]. Further, family of DNA codes have been obtained with Hamming, reverse, reverse-complement, and GC-content constraints where, each DNA codeword is free from hairpin like secondary structure and non-homopolymer constraint.

In Section 2, preliminary for DNA codes is discussed. Complete conflict free DNA codes with all the constraints are studied in Section 3. A recursive mapping from binary strings to DNA strings is discussed in Section 4, which is also an isometry between non-homopolymer distance over binary strings and Hamming distance over DNA strings. The conditions on binary strings are obtained, which ensure the constraints on encoded DNA strings in the same section. In Section 5, a family of DNA codes are obtained from binary Reed-Muller codes. Section 6 concludes the work with general remarks.

2 Preliminary

A code \(\mathcal {C}\) with parameters (n, M, d) over an alphabet Σ of size q is a set of M distinct strings (also called codewords) with length n such that the distance between any two distinct strings is at least d. Codes over {0,1} are called binary codes. Similarly, codes over an alphabet of size 4 are known as Quaternary codes. In particular, codes over alphabet ΣDNA = {A, C, G, T} are called DNA codes (denoted by \(\mathcal {C}_{DNA}\)) respectively. For various applications, codes with various distances (such as Gau distance [17]) are studied in literature. In this work, DNA codes with Hamming distance and binary codes with a newly defined distance are studied. For any strings x and y in Σn, the Hamming distance dH(x, y) between x and y is the total number of positions at which they differ. For a code \(\mathcal {C}\subset {\Sigma }^{n}\), the minimum Hamming distance is dH = \(\min \limits \{d_{H}(\textbf {x},\textbf {y}):\textbf {x}\neq \textbf {y}\text { and }\textbf {x},\textbf {y}\in \mathcal {C}\}\). Note that one can find fields or rings over the alphabet Σ. For a ring defined over the alphabet Σ, a code on Σ is called linear if the code is sub-module over the ring. For a field defined over the alphabet Σ, a code on Σ is called linear if the code is row span of a matrix over the field. For any linear code, the matrix is called the generator matrix if rows of that matrix are linearly independent.

For any DNA string x = \((x_{1}\ x_{2}{\ldots } x_{n})\in {\Sigma }_{DNA}^{n}\), the reverse, complement and reverse-complement DNA strings of x are xr = (xn xn− 1x1), xc = \(({x_{1}^{c}}\ {x_{2}^{c}}{\ldots } {x_{n}^{c}})\), and xrc = \(({x_{n}^{c}}\ x_{n-1}^{c}{\ldots } {x_{1}^{c}})\) respectively where, Ac = T, Cc = G, Gc = C and Tc = A. As defined in [23], for any DNA code \(\mathcal {C}_{DNA}\) with parameter (n, M, dH), the various constraints are defined as follows:

  • Hamming constraint: The Hamming distance dH(x, y) ≥ dH for any \(\textbf {x},\textbf {y}\in \mathcal {C}_{DNA}\) and xy.

  • Reverse constraint: The Hamming distance \(d_{H}(\textbf {x},\textbf {y}^{r})\geq d_{H}\) for any \(\textbf {x},\textbf {y}\in \mathcal {C}_{DNA}\) and xyr.

  • Reverse-complement constraint: The Hamming distance \(d_{H}(\textbf {x},\textbf {y}^{rc})\geq d_{H}\) for any \(\textbf {x},\textbf {y}\in \mathcal {C}_{DNA}\) and xyrc.

  • GC-content constraint: If the total number of G’s and C’s in each codeword is same and equal to g then the code satisfies g-GC-content constraint. For a specific case g = ⌊n/2⌋, the ⌊n/2⌋-GC-content constraint is simply called GC-content constraint.

Consider a DNA code \(\mathcal {C}_{DNA}\) with the minimum Hamming distance dH. For each \(\textbf {x}\in \mathcal {C}_{DNA}\), if \(\textbf {x}^{r}\in \mathcal {C}_{DNA}\) then, from the distance property of code, \(d_{H}(\textbf {y},\textbf {x}^{r})\geq d_{H}\) for each \(\textbf {y}\in \mathcal {C}_{DNA}\) such that xry. Therefore, the code satisfies the reverse constraint. Similarly, for each \(\textbf {x}\in \mathcal {C}_{DNA}\), if \(\textbf {x}^{rc}\in \mathcal {C}_{DNA}\) then from the distance property of code again, \(d_{H}(\textbf {y},\textbf {x}^{rc})\geq d_{H}\) for each y(≠xrc) in \(\mathcal {C}_{DNA}\), and hence the code satisfies the reverse-complement constraint. In consequence, researchers are curious in the construction of DNA codes which are closed under reverse and reverse-complement DNA strings [17]. Thus, motivated by this, we construct a set of DNA strings for a given length such that those DNA strings satisfy multiple constraints. In the following lemma, the distinct DNA strings of fix length with some additional constraints are enumerated.

Lemma 1

For a given length n,

  1. 1.

    there exists 4n/2⌉ number of distinct DNA strings \(\textbf {x}\in {\Sigma }_{DNA}^{n}\) such that x = xr,

  2. 2.

    for an even n, there exists 4n/2 number of DNA strings \(\textbf {x}\in {\Sigma }_{DNA}^{n}\) such that x = xrc,

  3. 3.

    for a positive integer m (≤ n), there exists \(\binom {n}{m}2^{n}\) distinct DNA strings \(\textbf {x}\in {\Sigma }_{DNA}^{n}\) each with GC-content m,

  4. 4.

    for an even positive integer m (≤ n), there exists \(\binom {n/2}{m/2}2^{n/2}\) distinct DNA strings \(\textbf {x}\in {\Sigma }_{DNA}^{n}\) each with GC-content m and x = xrc where, n is even, and

  5. 5.

    for a positive integer m (≤ n), there exists η distinct DNA strings \(\textbf {x}\in {\Sigma }_{DNA}^{n}\) each with m - GC-content and x = xr where,

    $$ \eta = \left\{ \begin{array}{ll} 0 & \text{if }n\text{ is even and }m\text{ is odd}, \\ \binom{\lfloor n/2\rfloor}{\lfloor m/2\rfloor}2^{\lceil n/2\rceil} & \text{otherwise.} \end{array}\right. $$

Proof

Consider x = \((x_{1}\ x_{2}{\ldots } x_{n})\in {\Sigma }_{DNA}^{n}\).

  1. 1)

    If x = xr then, xi = xni+ 1 for i = 1,2,…,⌈n/2⌉. Therefore, there exists 4n/2⌉ number of distinct DNA strings \(\textbf {x}\in {\Sigma }_{DNA}^{n}\) such that x = xr.

  2. 2)

    If x = xrc then, xi = \(x_{n-i+1}^{c}\) for i = 1,2,…,⌊n/2⌋. Therefore, there exists 4n/2 number of distinct DNA strings \(\textbf {x}\in {\Sigma }_{DNA}^{n}\) such that x = xrc. Note that, for any positive odd n, any \(\textbf {x}\in {\Sigma }_{DNA}^{n}\) can not equal xrc.

  3. 3)

    There are \(\binom {n}{m}2^{m}\) ways to fill m positions out of n positions by a symbol from {C, G}. If the remaining n - m positions are filled by a symbol from {A, T} then there are \(\binom {n}{m}2^{n}\) distinct DNA strings each with GC-content m.

Proofs of remaining results are similar. □

For theoretical analysis, we define complement constraint for a DNA code similar to reverse and reverse-complement constraints. A DNA code \(\mathcal {C}_{DNA}\) satisfies the complement constraint, if, for any x and y in \(\mathcal {C}_{DNA}\) such that xyc, \(d_{H}(\textbf {x},\textbf {y}^{c})\geq d_{H}\).

Remark 1

Consider a DNA code \(\mathcal {C}_{DNA}\) with the minimum Hamming distance dH. For any codeword \(\textbf {x}\in \mathcal {C}_{DNA}\), if xc and xr are also in the DNA code \(\mathcal {C}_{DNA}\) then, from the distance property of code, the DNA code satisfies reverse, complement, and reverse-complement constraints.

Remark 2

Consider a DNA code \(\mathcal {C}_{DNA}\) with the minimum Hamming distance dH such that \(\textbf {x}^{c}\in \mathcal {C}_{DNA}\) for each codeword \(\textbf {x}\in \mathcal {C}_{DNA}\). For any codeword \(\textbf {x},\textbf {y}\in \mathcal {C}_{DNA}\), if \(d_{H}(\textbf {x},\textbf {y}^{r})\geq d_{H}\) then, from the reverse and reverse-complement properties of DNA, the DNA code satisfies reverse, complement, and reverse-complement constraints.

Remark 3

Consider a DNA code \(\mathcal {C}_{DNA}\) with the minimum Hamming distance dH such that \(\textbf {x}^{r}\in \mathcal {C}_{DNA}\) for each codeword \(\textbf {x}\in \mathcal {C}_{DNA}\). For any codeword \(\textbf {x},\textbf {y}\in \mathcal {C}_{DNA}\), if \(d_{H}(\textbf {x},\textbf {y}^{c})\geq d_{H}\) then, from the reverse and reverse-complement properties of DNA, the DNA code satisfies reverse, complement, and reverse-complement constraints.

For a DNA code with some properties, the following lemma ensures the reverse-complement constraint.

Lemma 2

For a DNA code of parameter (n, M, dH) with reverse and complement constraints, if dHn/2 then the DNA code will satisfy the reverse-complement constraint.

Proof

For any \(\textbf {x},\textbf {y}\in \mathcal {C}\), \(d_{H}(\textbf {y}^{rc},\textbf {x}^{c})\) = \(d_{H}(\textbf {y}^{r},\textbf {x})\). So, \(d_{H}(\textbf {x},\textbf {x}^{c}) \leq d_{H}(\textbf {x},\textbf {y}^{rc})+d_{H}(\textbf {y}^{rc},\textbf {x}^{c})\) = \(d_{H}(\textbf {x},\textbf {y}^{rc})+d_{H}(\textbf {y}^{r},\textbf {x})\). This implies, \(d_{H}(\textbf {x},\textbf {x}^{c}) \leq d_{H}(\textbf {x},\textbf {y}^{rc})+d_{H}\). But, \(d_{H}(\textbf {x},\textbf {x}^{c})\) = n, so, \(n-d_{H}\leq d_{H}(\textbf {x},\textbf {y}^{rc})\). Therefore, if dHn/2 then \(d_{H}(\textbf {x},\textbf {y}^{rc})\geq d_{H}\). □

In a single stranded DNA, if there exist two sub-strings such that one sub-string is the reverse-complement of another sub-string then the DNA strand folds back and attaches the both sub-strings to each other and forms hairpin like secondary structures with stems and loops of certain length [4, 13, 24, 27]. The stem size of more than 2 bases long reasonably approximates the hairpin like structures. The single stranded DNA ATACGCGAATGCGTGC, considered in Fig. 1, contains the reverse-complementary sub-strings ACGC and TGCG (see the bold sub-strings). The sub-strings are attached to each other and forms a stem of length 4 base pairs, and a loop of size 4 bases long. Therefore, in this work, DNA strings are considered to be free from hairpin like structures with stem length of more than 2 bases long.

Definition 1

A DNA string is called free from secondary structures (of stem length more than 2) if the DNA string does not contain any two sub-strings of length more than 2 such that one is the reverse-complement of the other.

Remark 4

Consider a DNA string of length n (> 5) which contains two sub-strings of length t (nt ≥ 4) such that one sub-string is the reverse-complement of the other. Then, the DNA string will also contain two sub-strings of length 3 such that one is the reverse-complement of the other. Therefore, it is sufficient for a DNA string, free from secondary structures of stem length more than 2, not to contain two sub-strings which are reverse-complement to each other.

Apart from the pairing between Watson-Crick complement base pairs, the secondary structures can also be formed with Wobble base pairs or cross hybridization between base pairs in a single strand DNA [4, 13, 24, 27, 38]. In this work, we have considered only those secondary structures which have stem length more than 2 and each bounded base pair in the stem is Watson-Crick complement pairs. We called such DNA strings free from secondary structures in this paper.

Remark 5

A DNA string is free from secondary structures if and only if the complement of the DNA string is also free from secondary structures. The same result holds for reverse and reverse-complement.

For example, the DNA string ACATCG is free from reverse-complement sub-strings of length 3 because the reverse-complement of ACA is TGT and TGT is not a sub-string of ACATCG. The reverse and reverse-complement DNA strings GCTACA and CGATGT are also free from reverse-complement sub-strings. Therefore, the following remark holds.

Remark 6

Any DNA string is free from secondary structures of stem length more than 2 if and only if it is free from secondary structures of stem length 3.

3 On complete conflict free DNA strings

It is evident that, the process of sequencing and synthesizing of DNA strings will be erroneous due to the existence of homopolymers and consecutive repetition of same sub-string(s) of certain length in DNA [12, 21, 26, 32]. So it is preferable to construct DNA codes in such a way that each DNA codeword will be free from Homopolymers or consecutive repetition(s) of same sub-string(s) of certain length. In general, a sequence which is free from the consecutive repetition of a block is known as square free sequence or repeat free sequence in literature [22]. Thus motivated by this, we define conflict free DNA strings and conflict free DNA code. Also the necessary and sufficient condition for a DNA string to be conflict free is given.

Definition 2

For positive integers n and (≤ n/2), a DNA string is called conflict free, if the DNA string of length n is free from consecutive repetition(s) of identical sub-string(s) of length t for each t = 1,2,…,.

For example, the DNA string ATCATCG is 2 conflict free but not 3 conflict free as the substring ATC has a consecutive repetition in the DNA string while any two consecutive sub-strings of same length (≤ 2) are not same, i.e., the DNA string does not contain any of the DNA sub-strings ATAT, TCTC, CACA, CGCG, AA, TT, CC and GG. Note that, 1 conflict free DNA strings are also known as DNA strings free from homopolymers in literature [1, 2, 6, 10, 31]. Also note, for any positive integer (≥ 2), an conflict free DNA string is also − 1 conflict free.

Remark 7

A DNA string is conflict free if and only if the complement DNA string is also conflict free. Note that the result also holds for reverse and reverse-complement DNA strings.

Definition 3

For positive integers n and (≤⌊n/2⌋), a DNA code with length n is called conflict free DNA code if each DNA codeword of the DNA code is conflict free.

For example, the DNA code \(\left \{ACTG, TGAC, CAGT, GTCA\right \}\) is 2 conflict free DNA code, since each DNA codeword is 2 conflict free DNA string.

One can observe that the maximum length of a DNA sub-string will be ⌊n/2⌋, which can be repeated in a DNA string of length n. Hence, from Definition 2, a DNA string will be free from repetition(s) of DNA sub-string(s) of any length, if it is ⌊n/2⌋ conflict free. For a positive integer n, let S(n) be the set of all ⌊n/2⌋ conflict free DNA strings each of length n. For any zS(n) we will have zr,zc,zrcS(n). Also note that, any DNA code \(\mathcal {C}_{DNA}\subset S(n)\) is always a ⌊n/2⌋ conflict free DNA code.

Various computational approaches to construct DNA codes with some additional constraints are studied in literature [3, 28, 33, 34, 37]. In this work, conflict free DNA codes are constructed using stochastic local search in a seed set of conflict free DNA strings such that each DNA string has a fix GC-content and all the DNA strings are free from reverse-complement sub-strings. The computational construction for conflict free DNA codes is given as follows.

Construction 1

For given positive integers n, (≤⌊n/2⌋) and g (≤ n), let \(\mathcal {S}\subset {\Sigma }_{DNA}^{n}\) be the set of all conflict free DNA strings such that GC-content of each DNA string is g and every DNA string is free from secondary structures. For a sub-set R of random cardinality and containing DNA strings which are randomly selected from \(\mathcal {S}\), the DNA code \(\mathcal {C}_{DNA}\) = R ∪{xr,xc,xrc : xR} is an conflict free DNA code with Hamming, reverse and reverse-complement constraints where, each DNA codeword is free from reverse-complement sub-strings and GC-constant of each DNA codeword is fix g.

For example, consider n = 3, = 1 and g = 2. The seed set will be \(\mathcal {S}\) = \(\left \{ACG, AGC, CAC, CAG, CGA, CGT, CTC, CTG, GAC, GAG, GCA, GCT,\right .\) \(\left . GTC, GTG, TCG, TGC \right \}\). Note that, for R = {CAC, CGT, ACG, TGC}, the 1 conflict free DNA code with GC-content, reverse and reverse-complement constraints is \(\mathcal {C}_{DNA}\) = \(\left \{CAC, CGT, ACG, TGC, GCA, GTG\right \}\) where, each DNA codeword is free from secondary structures. Also note, the DNA code size and the minimum Hamming distance of the code are M = 6 and dH = 2.

For a given DNA string, the computational complexity to determine if the DNA string is conflict free (using Definition 2) is more than the computational complexity to determine whether the DNA string is free from secondary structures (using Remark 6), and, it is again more than the computational complexity to determine the GC-content of the DNA string. Therefore, in order to construct the seed set \(\mathcal {S}\) for the Construction 1, one can reduce the computations by removing DNA strings in the following order.

  1. 1.

    Remove all the DNA strings from the complete set \({\Sigma }_{DNA}^{n}\) which do not have g-GC-content.

  2. 2.

    Remove all the DNA strings from the remaining set which are not free from secondary structures of stem length 3.

  3. 3.

    Remove all the DNA strings from the remaining set which are not conflict free.

For a given length n and Hamming distance dH, the maximum size of code is subject to interest among researches. Now, similar to [23], some notations for the maximum size of conflict free DNA codes are introduced here.

  • For a DNA code of codeword length n and the minimum distance dH with GC-content and reverse-complement constraints, the maximum size of the DNA code is denoted by \(A_{4}^{GC,rc}(n,d_{H})\) where, the terms GC and rc stand for GC-content and reverse-complement constraints.

  • The maximum size of an conflict free DNA code with length n and the minimum Hamming distance dH is \(A_{4}^{cf}(n,d_{H},\ell )\). So, \(A_{4}^{cf,GC}(n,d_{H},1)\) denotes the maximum size of a DNA code of codeword length n and the minimum distance dH with GC-content constraint such that each codeword is free from homopolymers where, the term cf stands for conflict free property of the DNA code.

  • For an conflict free DNA code of codeword length n and the minimum distance dH with reverse constraint such that each DNA codeword is free from secondary structures (hairpin like structure) of length more than 2, the maximum size of the DNA code is denoted by \(A_{4}^{cf,hf,r}(n,d_{H},\ell )\) where, the terms hf and r stand for property free from secondary structures and reverse constraint.

  • For an conflict free DNA code of codeword length n and the minimum distance dH with reverse-complement constraint, \(A_{4}^{cf,hf,rc}(n,d_{H},\ell )\) denotes the maximum size of the DNA code where, each DNA codeword is free from secondary structures of length more than 2.

  • For any conflict free DNA codes of codeword length n and the minimum distance dH with GC-contant constraint, the maximum size of the DNA code is denoted by \(A_{4}^{cf,hf,GC}(n,d_{H},\ell )\) where, each DNA codeword is free from secondary structures of length more than 2.

  • For any conflict free DNA codes of codeword length n and the minimum distance dH with GC-contant, reverse, and reverse-complement constraints, the maximum size of the DNA code is denoted by \(A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\ell )\) where, each DNA codeword is free from secondary structures of length more than 2.

The relations among sizes of conflict free DNA codes with additional constraints are given in following theorem.

Theorem 1

For a positive even integer n,

$$ A_{4}^{cf,hf,GC,r}(n,d_{H},\ell) = A_{4}^{cf,hf,GC,rc}(n,d_{H},\ell), $$

and for a positive odd integer n,

$$ \begin{array}{ll} A_{4}^{cf,hf,GC,r}(n,d_{H}+1,\ell) & \leq A_{4}^{cf,hf,GC,rc}(n,d_{H},\ell) \\ & \leq A_{4}^{cf,hf,GC,r}(n,d_{H}-1,\ell). \end{array} $$

Proof

Similar to the proof of [23, Theorem 4.1], consider an conflict free DNA code \(\mathcal {C}_{DNA}\) with codeword length n and the minimum Hamming distance dH. In addition, each codeword of the DNA code is free from secondary structures and has fixed GC-content. Now, for even n, the DNA code \(\mathcal {C}_{DNA}=\{\textbf {a}_{i}\textbf {b}_{i}\}\) satisfies reverse constraint where, the sizes of ai and bi are same and aibi is the concatenation of sequences ai and bi in same order. Then the DNA code \(\mathcal {C}_{DNA}^{\prime }=\{\textbf {a}_{i}{\textbf {b}_{i}^{c}}\}\) satisfies reverse-complement constraint. Observe that the parameters of both the DNA codes \(\mathcal {C}_{DNA}\) and \(\mathcal {C}_{DNA}^{\prime }\) are same and therefore, \(A_{4}^{cf,hf,GC,r}(n,d_{H},\ell ) \leq A_{4}^{cf,hf,GC,rc}(n,d_{H},\ell )\). Similarly, one can prove the inequality \(A_{4}^{cf,hf,GC,r}(n,d_{H},\ell ) \geq A_{4}^{cf,hf,GC,rc}(n,d_{H},\ell )\). Thus, the result follows for even n. Again, for odd n, consider a conflict free DNA code \(\mathcal {C}_{DNA}\) with parameter (\(n,A_{4}^{cf,hf,GC,r}(n,d_{H}+1,\ell ),d_{H}+1\)) with reverse and GC-content constraints. Also each codeword of the DNA code is free from secondary structures. For some x ∈ΣDNA, \(\mathcal {C}_{DNA}=\{\textbf {a}_{i}x\textbf {b}_{i}\}\) where, the sizes of ai and bi are same and aibi is the concatenation of sequences ai and bi in same order. Now, consider \(\mathcal {C}_{DNA}^{*}=\{\textbf {a}_{i}\textbf {b}_{i}\}\) of even length which is obtained by deleting the middle symbol x of each codeword of the DNA code \(\mathcal {C}_{DNA}\). Therefore, from even case, \(A_{4}^{cf,hf,GC,r}(n,d_{H}+1,\ell )\leq A_{4}^{cf,hf,GC,rc}(n-1,d_{H},\ell )\). The first inequality for odd n follows from \(A_{4}^{cf,hf,GC,rc}(n-1,d_{H},\ell )\leq A_{4}^{cf,hf,GC,rc}(n,d_{H},\ell )\). Similarly, one can prove the second inequality for odd n. □

Remark 8

Consider an conflict free DNA code with codeword length n and the minimum Hamming distance dH.

  • For positive integers n, (< ⌊n/2⌋) and dH (≤ n),

    $$ \begin{array}{llllll} A_{4}^{cf}(n,d_{H},\ell) & \geq A_{4}^{cf}(n,d_{H},\ell+1) \\ A_{4}^{cf,hf}(n,d_{H},\ell) & \geq A_{4}^{cf,hf}(n,d_{H},\ell+1) \text{ and } \\ A_{4}^{cf}(n,d_{H},\ell) & \geq A_{4}^{cf,hf}(n,d_{H},\ell) \\ & \geq A_{4}^{cf,hf,GC}(n,d_{H},\ell) \\ & \geq A_{4}^{cf,hf,GC,r}(n,d_{H},\ell) \\ & \geq A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\ell). \end{array} $$
  • For positive integers n and dH (≤ n),

    $$ \begin{array}{ll} & A_{4}^{cf,GC}(n,d_{H},1) \geq A_{4}^{cf,hf,GC}(n,d_{H},\ell), \text{ and } \\ & A_{4}^{GC,rc}(n,d_{H}) \geq A_{4}^{cf,hf,GC,rc}(n,d_{H},\ell). \end{array} $$

For n = 1,2,…,10, g = ⌊n/2⌋ and = 1,2,…,⌊n/2⌋, the DNA codes with various parameters are calculated using Construction 1 and the maximum value of obtained code sizes are listed in Table 1 where, the DNA codes satisfy reverse, reverse-complement, GC-content constraints, and each codeword of the DNA code is ⌊n/2⌋ conflict free and also free from secondary structures. For given n and , first, the seed set \(\mathcal {S}\) is obtained such that each DNA string in the set contains ⌊n/2⌋ GC-content. Further, for a random sub-set R of the set \(\mathcal {S}\), the ⌊n/2⌋ conflict free DNA code \(\mathcal {C}_{DNA}\) = \(R\cup \{\textbf {x}^{r},\textbf {x}^{c},\textbf {x}^{rc}:\textbf {x}\in \mathcal {C}\}\) is obtained. From the seed set \(\mathcal {S}\), the sub-set R is generated 106 times and, for each sub-set R, the size \(|\mathcal {C}_{DNA}|\) is enumerated. For given n and dH, the maximum value among all the enumerated \(|\mathcal {C}_{DNA}|\) (lower bound of \(A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\lfloor n/2\rfloor )\)) is listed in Table 1. Note that \(A_{4}^{cf,hf,GC}(n,1,\lfloor n/2\rfloor )\) = \(A_{4}^{cf,hf,GC,r,rc}(n,1,\lfloor n/2\rfloor )\) and \(A_{4}^{cf,hf}(n,1,\lfloor n/2\rfloor )\) = \(A_{4}^{cf,hf,r,rc}(n,1,\lfloor n/2\rfloor )\), for a positive integer n. Therefore, from Remark 7 and Definition 3, \(A_{4}^{cf,GC,r,rc}(n,1,\lfloor n/2\rfloor )\) = \(|\mathcal {S}|\) where, \(|\mathcal {S}|\) is cardinality of seed set in Construction 1. Therefore, for n = 2,3,…,10, the listed lower bounds of \(A_{4}^{cf,hf,GC,r,rc}(n,1,\lfloor n/2\rfloor )\) are tight. Similarly, from the definition of Hamming distance one can observe that,

$$ A_{4}^{cf,hf,GC,r,rc}(n,n,\lfloor n/2\rfloor) = \left\{ \begin{array}{ll} 2 & \text{ if } n \text{ is odd}, \\ 4 & \text{ if } n \text{ is even}. \end{array} \right. $$

Recall, for any odd length DNA string x, \(d_{H}(\textbf {x},\textbf {x}^{r})< n\), and hence, the values \(A_{4}^{cf,hf,GC,r,rc}(n,n, \lfloor n/2\rfloor )\) are also tight for dH = n (= 2,3…,10) in Table 1. Values which are written in bold font in the table indicate equal or improved values from [19, Table I] or [3, Table II], and values written in a circle in the table are improved values from [19, Table I] or [3, Table II]. For the equal and improved values, the corresponding code parameters have been considered and the respective DNA codes and their codewords are listed in Table 6. Apart from the existing literature, all the specified constraints have been considered in Table 1. Some of those values are better when compared with the DNA code size with less constraints listed in Table 2.

Table 1 Lower bound of \(A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\lfloor n/2\rfloor )\) (the maximum size of ⌊n/2⌋ conflict free DNA code with Hamming, reverse, reverse-complement and GC-content constraints, and each DNA codeword is free from secondary structures) are given. Values given in circles are improved from the values given in [19, Table I] or [3, Table II]. The bold values are those values which archive the values given in [19, Table I] or [3, Table II]
Table 2 List of improved lower bound of \(A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\lfloor n/2\rfloor )\) (the maximum size of ⌊n/2⌋ conflict free DNA code with Hamming, reverse, reverse-complement and GC-content constraints and each DNA string is free from secondary structures) from [19, Table I] and [3, table II]

Note that the values in Table I, [19] are the lower bounds for the maximum size of DNA codes satisfying GC-content constraint and free from Homopolymers, and in Table II, [3] are lower bounds for the maximum size of DNA codes satisfying GC-content and reverse-complement constraints, on the other hand, values listed in Table 1 in this paper are the lower bounds for the maximum size of complete conflict free DNA codes satisfying Hamming, reverse, reverse-complement and GC-content constraints. From Remark 8 and Table 2, some new lower bounds are obtained for \(A_{4}^{cf,GC}(n,d_{H},1)\) and \(A_{4}^{GC,rc}(n,d_{H})\). For \(A_{4}^{cf,GC}(n,d_{H},1)\), the newly updated bounds are

$$ A_{4}^{cf,GC}(4,3,1)\geq 12, $$
$$ A_{4}^{cf,GC}(6,4,1)\geq 20, $$
$$ A_{4}^{cf,GC}(8,6,1)\geq 12, $$
$$ A_{4}^{cf,GC}(9,6,1)\geq 16, $$
$$ A_{4}^{cf,GC}(9,9,1)\geq 2, $$
$$ A_{4}^{cf,GC}(10,7,1)\geq 16, \text{ and } $$
$$ A_{4}^{cf,GC}(10,8,1)\geq 8. $$

Similarly, for \(A_{4}^{GC,rc}(n,d_{H})\), the newly achieved bounds are

$$ A_{4}^{GC,rc}(4,3)\geq 12, $$
$$ A_{4}^{GC,rc}(6,4)\geq 20, \text{ and } $$
$$ A_{4}^{GC,rc}(9,9)\geq 2. $$

4 Mappings and their properties

For a positive integer , the frequency of occurrence of insertion or deletion errors in an + 1 conflict free DNA string is less in an conflict free DNA string. Therefore, the chances of occurrence of these errors is significantly low in an conflict free DNA string for a sufficiently large . On the other hand, the computational complexity of Construction 1 is high. Therefore, to sidestep the computational approach, a recursive mapping is defined algebraically in this section which ensures that the obtained DNA strings will be conflict free. Moreover, DNA codes satisfying all the constraints are also studied with respect to the mapping in this section.

Definition 4

(Non-Homopolymer map of order ): For a positive integer , consider \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\) such that xy. Define a map \(f:\{0,1\}\times \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}\rightarrow \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}\) such that Table 3(a) holds.

Table 3 Non-Homopolymer Map and an Example with x = CG and y = AT

For = 2, one can obtain Non-Homopolymer map f of order 2 is given in Table 3(b) by considering x = CG and y = AT. One can read the table as f(1,CG) = TA, f(1,TA) = GC and f(0,GC) = TA etc. Using the Non-Homopolymer map of order 2 (Table 3(b)), a binary string can also be encoded into a DNA string. Formally, the encoding using the Non-Homopolymer map is as follows.

Encoding 1

For positive integers n and , consider a mapping f as defined in the Definition 4. A binary string a = \((a_{1}\ a_{2} {\ldots } a_{n})\in \{0,1\}^{n}\) is encoded into u = \((\textbf {u}_{1}\ \textbf {u}_{2} {\ldots } \textbf {u}_{n})\in \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}^{n}\), in such a way that ui = f(ai,ui− 1) for each i = 2,3,…,n and u1 = h(a1) where, \(h:\{0,1\}\rightarrow \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}\) such that h(0) = h(1)c. Note that \(\textbf {u}_{1}\in \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}\) initiates the encoding of the binary string, and therefore the length of the encoded DNA string u is n. Clearly, the encoding of two distinct binary strings are always distinct. Observe that for i = 2,3,…,n, f(ai,ui− 1)c = \(f(a_{i},\textbf {u}_{i-1}^{c})\) = \(f(\bar {a}_{i},\textbf {u}_{i-1})\) where, \(\bar {a}_{i}\) is binary complement of ai.

Example 1

Consider a binary string a = (0 1 1 0). Let the encoding be initialized with x = CG. So, one can encode the binary string a to a DNA string (x f(1,x) f(1,f(1,x)) f(0,f(1,f(1,x)))) = xycxy = CGTAGCTA.

For positive integers n and (< n), consider a subset \(S\subseteq \{0,1\}^{n}\). Each binary string from the set S is encoded using the Non-Homopolymer map, and the set of all encoded DNA strings is denoted by f(S). In particular, let \({\mathscr{B}}\) = \(f(\{0,1\}^{n})\subseteq {\Sigma }_{DNA}^{n\ell }\) be the set of all possible DNA strings of length n and they are obtained by the encoding using Non-Homopolymer map.

$$ \mathcal{B} = \left\{ \begin{array}{ll} (\{\textbf{u}_{1},{\textbf{u}_{1}^{c}}\}\times\{\textbf{u}_{2},{\textbf{u}_{2}^{c}}\})^{\frac{n-1}{2}}\times\{\textbf{u}_{1},{\textbf{u}_{1}^{c}}\} & n\text{ is positive odd integer, and } \\ (\{\textbf{u}_{1},{\textbf{u}_{1}^{c}}\}\times\{\textbf{u}_{2},{\textbf{u}_{2}^{c}}\})^{\frac{n}{2}} & n\text{ is positive even integer} \end{array} \right. $$

where, \(\{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}\) and \(\{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}\) are pairwise disjoint and \(\{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}\cup \{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}=\{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}\). Note that, in the Example 1, the encoded DNA string CGTAGCAT is 4 conflict free and also free from secondary structures. The GC-content of the DNA string is 4 (= ⌊n/2⌋). Therefore, in the following results, we obtained constraints on seed blocks (x and y) of the Non-Homopolymer mapping so that the encoded DNA strings from binary strings satisfy those additional constraints.

Theorem 2

For positive integers n (n ≥ 2), , t and m (≤ t), if \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\) such that each DNA string in the set \(S = \{(\textbf {x}\ \textbf {y}_{1}\ \textbf {x}_{2}\ \textbf {y}_{2}\ldots \textbf {x}_{t}\ \textbf {y}_{t}), (\textbf {y}\ \textbf {x}_{1}\ \textbf {y}_{2}\ \textbf {x}_{2}\ldots \\ \textbf {y}_{t}\ \textbf {x}_{t}): \textbf {x}_{i}\in \{\textbf {x},\textbf {x}^{c}\}\text { and } \textbf {y}_{i}\in \{\textbf {y},\textbf {y}^{c}\} \text { for } i=1,2,\ldots ,t\}\) is m conflict free then any binary string from {0,1}n will be encoded into an m conflict free DNA string using Non-Homopolymer map.

Proof

From Remark 7, if any DNA string in the set S is m conflict free then all the DNA sub-strings (x1 y1 x2 y2xt yt) and (y1 x1 y2 x2yt xt) are also m conflict free. Therefore, in the encoded DNA string u = (u1 u2un) (using Non-Homopolymer map), for 1 ≤ in − 2t + 1, consider (ui ui+ 1ui+ 2ui+ 2t), which is also m conflict free where, \(\textbf {u}_{j}\in \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}\) for j = i, i + 1,…,i + 2t. From Definition 2, m < t, the result follows. □

For the various values of m and t in Theorem 2, one can observe the following two propositions:

Proposition 1

For a positive integer , if \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\) such that each of the DNA strings (x y), (x yc), (y x) and (y xc) is conflict free then any binary string will be encoded into a conflict free DNA string using Non-Homopolymer map.

Theorem 3

For a positive integer (≥ 2), if \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\) such that each of the DNA string from the set \(\{(\textbf {x}\ \textbf {y}_{1}\ \textbf {x}_{1}\ \textbf {y}_{2}\ \textbf {x}_{2}):\textbf {x}_{1},\textbf {x}_{2}\in \{\textbf {x},\textbf {x}^{c}\}\text { and }\textbf {y}_{1},\textbf {y}_{2}\in \{\textbf {y},\textbf {y}^{c}\}\}\) is free from reverse-complement sub-string(s) of length 3 then the encoded DNA string using Non-Homopolymer map is free from secondary structure of stem length 3.

Proof

For (≥ 2) and \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\), consider a DNA string (x y x y x) free from reverse-complement sub-strings of length 3. From Remark 5, the DNA strings (xc yc xc yc xc) will also be free from reverse-complement sub-strings of length 3. Similarly remaining all DNA strings of 5 seed blocks are also free from reverse-complement sub-strings. Hence, from Definition 1 and Non-Homopolymer map, the encoded DNA string obtained from any binary string will be free from secondary structure of stem length 3. □

From Remark 4, if a DNA string of length n is free from reverse-complement sub-string of length m then the DNA string is also free from any reverse-complement sub-string of length t (mtn). Therefore, one can observe following two propositions from Theorem 3.

Proposition 2

For a positive integer (≥ 2), if \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\) such that each DNA string in the set \(\left \{(\textbf {x}\ \textbf {y}^{*}\ \textbf {x}^{*}\ \textbf {y}^{*}), (\textbf {y}\ \textbf {x}^{*}\ \textbf {y}^{*}\ \textbf {x}^{*}): \textbf {x}^{*}\in \{\textbf {x},\textbf {x}^{c}\} \text { and } \textbf {y}^{*}\in \left \{\textbf {y},\right .\right . \left .\left .\textbf {y}^{c}\right \}\right \}\) is free from secondary structures of stem length 3 then the encoded DNA string using Non-Homopolymer map is also free from secondary structure of stem length 3.

Recall that any DNA string of length n is ⌊n/2⌋ conflict free then the DNA string is free from any consecutive repetitions of sub-strings of any length. Following theorem gives the condition on binary strings such that the encoded DNA string of length n is ⌊n/2⌋ conflict free.

Theorem 4

For positive integers n and , consider \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\) such that the DNA strings (x y), (x yc), (y x) and (y xc) are conflict free. If a = (a1 a2an) is a binary string such that \(2\mu <{\sum }_{i=\lambda +1}^{\lambda +2\mu } (a_{i}a_{2\mu +i}+\bar {a}_{i}\bar {a}_{2\mu +i})\) for each positive even integer 2μ from the set {1,2,…,⌊n/2⌋} and λ = 0,1,…,n − 2μ then the binary string a will be encoded (using Non-Homopolymer map) into a ⌊n/2⌋ conflict free DNA string of length n.

Proof

Consider a binary string a = (a1 a2an) which is encoded into DNA string u = (u1 u2un) using Non-Homopolymer map. For each positive even integer 2μ from the set {1,2,…,⌊n/2⌋}, the DNA block \(u_{2\mu +i}\in \{u_{i},{u_{i}^{c}}\}\). For any binary symbol ai, a2μ+i ∈{0,1},

$$ (a_{i}a_{2\mu+i}+\bar{a}_{i}\bar{a}_{2\mu+i})= \left\{ \begin{array}{ll} 1 & \text{ if } a_{i}=a_{2\mu+i} \\ 0 & \text{ otherwise. } \end{array} \right. $$

Therefore, \({\sum }_{i=1}^{2\mu } (a_{i}a_{2\mu +i}+\bar {a}_{i}\bar {a}_{2\mu +i})=2\mu \) if and only if ai = a2μ+i for each i = 1,2,…,2μ. If the origin is shifted with λ, and \(2\mu <{\sum }_{i=\lambda +1}^{\lambda +2\mu } (a_{i}a_{2\mu +i}+\bar {a}_{i}\bar {a}_{2\mu +i})\) for each λ and μ, then there is not exist the consecutive identical sub-string of length 2μ in the binary string. From Non-Homopolymer map, for a binary string, two consecutive sub-strings of odd length cannot be encoded into identical DNA sub-strings. Therefore, from Non-Homopolymer map and Proposition 1, the encoded DNA string will be a ⌊n/2⌋ conflict free. □

The GC-content of encoded DNA string is calculated in the following lemma.

Lemma 3

For positive integers n and , consider \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\) with GC-content gx and gy. For the encoded DNA string u = \((\textbf {u}_{1}\ \textbf {u}_{2} \ldots \textbf {u}_{n})\in \left \{\textbf {x},\textbf {x}^{c},\textbf {y},\right . \left .\textbf {y}^{c}\right \}^{n}\) using Non-Homopolymer map, the GC-content of u will be

$$ g_{\textbf{u}} = \left\{ \begin{array}{ll} g_{\textbf{u}_{1}}+(g_{\textbf{u}_{1}}+g_{\textbf{u}_{2}})(n-1)/2 & \text{ if } n \text{ is odd integer}, \\ (g_{\textbf{u}_{1}}+g_{\textbf{u}_{2}})n/2 & \text{ if } n \text{ is even integer.} \end{array} \right. $$

Proof

For positive integers n and (< n), let a binary string a = \((a_{1}\ a_{2} {\ldots } a_{n})\in \{0,1\}^{n}\) be encoded into some u = \((\textbf {u}_{1}\ \textbf {u}_{2} \ldots \textbf {u}_{n})\in \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}^{n}\) using Non-Homopolymer map. In the Non-Homopolymer map, if \(\textbf {u}_{1}\in \{\textbf {x},\textbf {x}^{c}\}\) then DNA blocks \(\textbf {u}_{2j}\in \{\textbf {y},\textbf {y}^{c}\}\) and \(\textbf {u}_{2j+1}\in \{\textbf {x},\textbf {x}^{c}\}\), for 1 ≤ 2j,2j + 1 ≤ n. Since, the GC-content of a DNA string and its complement DNA string are the same, the GC-content of each sub-string (u2j u2j+ 1) is gx + gy. Hence, if n is even, the GC-content of the encoded DNA string u is gu = (gx + gy)n/2 and if n is odd then gu = gx + (gx + gy)(n − 1)/2. Similarly, if \(\textbf {u}_{1}\in \{\textbf {y},\textbf {y}^{c}\}\) then gu = (gx + gy)n/2 for even n and gu = gy + (gx + gy)(n − 1)/2 when n is odd. Hence, the lemma follows. □

From Lemma 3, one can observe the following two propositions.

Proposition 3

In Lemma 3, if gx + gy = then

$$ g_{\textbf{u}} = \left\{ \begin{array}{ll} g_{\textbf{u}_{1}}+\ell(n-1)/2 & \text{ if } n \text{ is odd integer} \\ \ell n/2 & \text{ if } n \text{ is even integer}. \end{array} \right. $$

The following proposition ensures that the GC-content of the encoded DNA string (using Non-Homopolymer map) is almost 50% of the length.

Proposition 4

For positive integer n and , let \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\). If the GC-content of x and y are ⌊/2⌋ and ⌈/2⌉ respectively, then the GC-content of any encoded DNA string u (using Non-Homopolymer map) of length n is

$$ g_{\textbf{u}} = \left\{ \begin{array}{ll} \lfloor n\ell/2\rfloor & \text{ if } \textbf{u}_{1}\in\{\textbf{x},\textbf{x}^{c}\}, \text{ and } \\ \lceil n\ell/2 \rceil & \text{ if } \textbf{u}_{1}\in\{\textbf{y},\textbf{y}^{c}\}. \end{array}\right. $$

Theorem 5

For positive integers n and , let \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\). Using Non-Homopolymer map, if a binary string (a1 a2an) is encoded into some u ∈{x, xc,y, yc}n then the binary string \((\bar {a}_{1}\ a_{2} {\ldots } a_{n})\) is encoded into uc where, \(\bar {a}_{1}\) is the binary complement of a1.

Proof

The proof is done using induction on the index i (i = 1,2,…,n). Now from the Non-Homopolymer map, f(0,z)c = f(0,zc) and f(1,z)c = f(1,zc), for each z ∈{x, xc,y, yc}. Consider the binary strings (a1 a2 a3an) and \((\bar {a}_{1}\ a_{2}\ a_{3}{\ldots } a_{n})\), that are encoded into some DNA strings (u1 u2 u3un) and (v1 v2 v3vn). From the Non-Homopolymer map, \({\textbf {u}_{1}^{c}}\) = h(a1)c = \(h(\bar {a}_{1})\) = v1. Let \({\textbf {u}_{i}^{c}}\) = vi, for some i ∈{1,2,…,n}. Consider \(\textbf {u}_{i+1}^{c}\) = f(ai+ 1,ui)c = \(f(a_{i+1},{\textbf {u}^{c}_{i}})\) = f(ai+ 1,vi) = vi+ 1. Therefore, from induction, the binary strings (a1 a2 a3an) and \((\bar {a}_{1}\ a_{2}\ a_{3}{\ldots } a_{n})\) are encoded into DNA strings which are complement to each other. □

In the following two theorems, the Hamming distance between the two DNA strings is calculated for binary strings with Hamming distance 1 and 2.

Theorem 6

For positive integers n and , consider the binary strings a = (a1 a2an) and b = \(\left .(a_{1}\ a_{2}{\ldots } a_{i-1}\ \bar {a}_{i}\ \right . \left . a_{i+1}{\ldots } a_{n})\right .\) (1 ≤ in), that are encoded into DNA strings u = (u1 u2un) and v = (v1 v2vn) where, \(\bar {a}_{i}\) is binary complement of ai. Then, dH(u, v) = (ni + 1).

Proof

Consider the binary strings a = (a1 a2ai− 1 ai ai+ 1an) and b = \((a_{1}\ a_{2} {\ldots } a_{i-1} \bar {a}_{i}\ a_{i+1}{\ldots } a_{n})\) which can be encoded into u = (u1 u2ui− 1 uiui+ 1un) and v = (v1 v2vi− 1vi vi+ 1vn) respectively. From Non-Homopolymer map, vj = uj (j = 1,2,…i − 1) and vj = \({\textbf {u}_{j}^{c}}\) (j = i, i + 1…,n). Therefore, dH(u, v) = (ni + 1), since \(d_{H}(\textbf {x},\textbf {x}^{c})\) = \(d_{H}(\textbf {y},\textbf {y}^{c})\) = . □

Theorem 7

For positive integers n and , consider the binary strings a = (a1 a2an) and b = \(\left .(a_{1}\ a_{2}{\ldots } a_{i-1}\ \bar {a}_{i}\right . \left . a_{i+1}{\ldots } a_{j-1}\ \bar {a}_{j}\ a_{j+1}{\ldots } a_{n}\right .)\) (1 ≤ i < jn) that are encoded into DNA strings u = (u1 u2un) and v = (v1 v2vn) where, \(\bar {a}_{i}\) is binary complement of ai. Then, dH(u, v) = (ji).

Proof

The proof is similar to the Theorem 6 and follows from the fact that for any DNA string x, (xc)c = x. □

The following theorem provides a bound on the Hamming distance on the encoded DNA strings.

Theorem 8

For positive integers n, and σ(≤ ), consider \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\) and σ = \(\min \limits \left \{d_{H}(\textbf {z}_{1},\textbf {z}_{2}),\right . \left .n-d_{H}(\textbf {z}_{1},\textbf {z}_{2}):\textbf {z}_{1}\in \{\textbf {x},\textbf {x}^{c}\}\text { and }\textbf {z}_{2}\in \{\textbf {y},\textbf {y}^{c}\} \right \}\). Let binary strings a, b ∈{0,1}n be encoded into the DNA strings u = (u1 u2un) and v = (v1 v2vn). For some a, b ∈{0,1}, if \(\textbf {u}^{\prime }\) = (u f(a, un)) and \(\textbf {v}^{\prime }\) = (v f(b, vn)) then

$$ d_{H}(\textbf{u}^{\prime},\textbf{v}^{\prime})\geq \left\{ \begin{array}{ll} \sigma(d_{H}(\textbf{a},\textbf{b})+d_{H}(a,b)) & \text{ if }d_{H}(\textbf{a},\textbf{b})\text{ is even} \\ \sigma(d_{H}(\textbf{a},\textbf{b})-d_{H}(a,b)+1) &\text{ if } d_{H}(\textbf{a},\textbf{b})\text{ is odd}. \end{array} \right. $$

Proof

The proof follows from the following two facts. (1) For a positive integer m and any \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{m}\), if dH(x, y) = t then \(d_{H}(\textbf {x}^{c},\textbf {y})\geq m-t\), and (2) For z ∈{x, xc,y, yc} and a, b ∈{0,1}, f(c, z)c = f(c, zc) = \(f(\bar {c},\textbf {z})\). So we can derive,

$$ d_{H}(\textbf{u}^{\prime},\textbf{v}^{\prime})\geq \left\{ \begin{array}{ll} \sigma(d_{H}(\textbf{a},\textbf{b})+1) & \text{ if }d_{H}(\textbf{a},\textbf{b})\text{ is even and } a\neq b\\ \sigma d_{H}(\textbf{a},\textbf{b}) & \text{ if }d_{H}(\textbf{a},\textbf{b})\text{ is even and } a= b\\ \sigma d_{H}(\textbf{a},\textbf{b}) &\text{ if } d_{H}(\textbf{a},\textbf{b})\text{ is odd and } a\neq b \\ \sigma(d_{H}(\textbf{a},\textbf{b})+1) &\text{ if } d_{H}(\textbf{a},\textbf{b})\text{ is odd and } a = b. \end{array} \right. $$

Hence the result follows. □

Motivated by Theorem 8, one can observe the following proposition.

Proposition 5

For positive integers n and , consider \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\) and = dH(z1,z2), for each \(\textbf {z}_{1}\in \{\textbf {x},\textbf {x}^{c}\}\) and \(\textbf {z}_{2}\in \{\textbf {y},\textbf {y}^{c}\}\). Let two binary strings a, b ∈{0,1}n be encoded into the DNA strings u = (u1 u2un) and v = (v1 v2vn). For some a, b ∈{0,1}, if \(\textbf {u}^{\prime }\) = (u f(a, un)) and \(\textbf {v}^{\prime }\) = (v f(b, vn)) then

$$ d_{H}(\textbf{u}^{\prime},\textbf{v}^{\prime})=\left\{ \begin{array}{ll} d_{H}(\textbf{u},\textbf{v})+\ell d_{H}(a,b) & \text{ if }d_{H}(\textbf{a},\textbf{b})\text{ is even} \\ d_{H}(\textbf{u},\textbf{v})+\ell(1-d_{H}(a,b)) &\text{ if } d_{H}(\textbf{a},\textbf{b})\text{ is odd}. \end{array} \right. $$

In order to establish the proposed mapping as an isometry from the set of binary strings to the set of DNA strings where, Hamming distance is taken for the set of DNA strings, we introduce a new distance between two binary strings in the following definition.

Definition 5

Let n(> 1) be an integer and Σ be an alphabet of size q (≤ 2). For any a = (a1 a2an), b = \((b_{1}\ b_{2}{\ldots } b_{n})\in {\Sigma }^{n}\), let P = {i : aibi and i ∈{1,2,…,n}} and

$$ S= \left\{ \begin{array}{ll} P & \text{ if } d_{H}(\textbf{a},\textbf{b})\text{ is even} \\ P\cup\{n+1\} & \text{ if } d_{H}(\textbf{a},\textbf{b})\text{ is odd.} \end{array} \right. $$

If S, we can denote S = {s1,s2,…,s|S|} such that, for each sj < sj+ 1, j = 1,2,…,|S|− 1. We define a map \(d_{NHo}:{\Sigma }^{n}\times {\Sigma }^{n}\rightarrow {\mathbb R}\) such thatl

$$ d_{NHo}(\textbf{a},\textbf{b}) = \left\{ \begin{array}{ll} \ell{\sum}_{i=1}^{|S|/2} (s_{2i}-s_{2i-1}) & \text{ if } |S|>0,\\ 0 & \text{ if } |S|=0, \end{array} \right. $$

where is a positive integer.

Remark 9

The map \(d_{NHo}:{\Sigma }^{n}\times {\Sigma }^{n}\rightarrow \mathbb { R}\) is indeed a distance (called Non-Homopolymer distance), because it holds all the properties including triangle inequality of distance. The first three properties of metric are easy to prove and the triangle inequality property can be proved using induction on length n. For a code \(\mathcal {C}\subseteq {\Sigma }^{n}\), the minimum distance dNHo = \(\min \limits \{d_{NHo}(\textbf {a},\textbf {b}):\textbf {a}\neq \textbf {b},\text { and }\textbf {a},\textbf {b}\in \mathcal {C}\}\).

For example, consider n = 5, = 2 and Σ = {0,1}. For a = (1 1 1 1 0) and b = (0 1 1 0 0), dNHo(a, b) = 6 where, S = {1,4}.

Theorem 9

For positive integers n and , the Non-Homopolymer map is a distance preserving encoding between ({0,1}n,dNHo) and \(({\mathscr{B}},d_{H})\).

Proof

The theorem is proved using induction on the string length n. The base case, n = 1, is obvious from the Non-Homopolymer map and the Non-Homopolymer distance. For the inductive step, assume that the distance is preserved for n = k where, a = (a1 a2ak) and b = (b1 b2bk) with the support set S, are encoded into DNA strings u = (u1 u2uk) and v = (v1 v2vk). To prove that the distance is preserved for n = k + 1, consider \(\textbf {a}^{\prime }\) = (a ak+ 1) = (a1 a2ak ak+ 1) and \(\textbf {b}^{\prime }\) = (b bk+ 1) = (b1 b2bk bk+ 1) with the support set \(S^{\prime }\) where, ak+ 1,bk+ 1 ∈{0,1}. Let the strings a and b be encoded into DNA strings \(\textbf {u}^{\prime }\) = (u1 u2uk uk+ 1) and \(\textbf {v}^{\prime }\) = (v1 v2vk vk+ 1) where, \(\textbf {u}_{k+1},\textbf {v}_{k+1}\in \{\textbf {x},\textbf {y},\textbf {x}^{c},\textbf {y}^{c}\}\). Now, there are four cases. (i) Consider dH(a, b) is even and ak+ 1 = bk+ 1. Note that both uk and vk are member of either {x, xc} or {y, yc}. On the other hand, \(d_{NHo}(\textbf {a}^{\prime },\textbf {b}^{\prime })\) = dNHo(a, b), since \(S^{\prime }\) = S. (ii) If dH(a, b) is even and ak+ 1bk+ 1 then uk+ 1 = \(\textbf {v}_{k+1}^{c}\). So, for = \(d_{H}(\textbf {x},\textbf {x}^{c})\) = \(d_{H}(\textbf {y},\textbf {y}^{c})\), \(d_{H}(\textbf {u}^{\prime },\textbf {v}^{\prime })\) = dH(u, v) + and \(d_{NHo}(\textbf {a}^{\prime },\textbf {b}^{\prime })\) = dNHo(a, b) + since, \(S^{\prime }\) = {k + 1,k + 2}∪ S. (iii) If dH(a, b) is odd and ak+ 1 = bk+ 1 then \(d_{H}(\textbf {u}^{\prime },\textbf {v}^{\prime })\) = dH(u, v) + m and \(d_{NHo}(\textbf {a}^{\prime },\textbf {b}^{\prime })\) = dNHo(a, b) + since, \(S^{\prime }\) = {k + 2}∪ S∖{k + 1}. (iv) If dH(a, b) is odd and ak+ 1bk+ 1 then \(d_{H}(\textbf {u}^{\prime },\textbf {v}^{\prime })\) = dH(u, v) and \(d_{NHo}(\textbf {a}^{\prime },\textbf {b}^{\prime })\) = dNHo(a, b) since, \(S^{\prime }\) = S. Hence, the result follows from Proposition 5. □

Theorem 10

For a binary code \(\mathcal {C} (n, M, d_{NHo})\) where, dNHo is the Non-Homopolymer distance (Definition 5), there exists a DNA code \(f(\mathcal {C})\) with codeword length n, size M and the minimum Hamming distance dH = dNHo.

Proof

The proof follows from the Non-Homopolymer map and Theorem 9. □

Theorem 11

For any binary code \(\mathcal {C}\) with the minimum distance dNHon/2 (\(n,\ell \in \mathbb { Z}^{+}\)), there exists a DNA code \(f(\mathcal {C}) (n\ell ,M,d_{H})\) with complement constraint.

Proof

Let binary strings a and b of length n be encoded into DNA strings u and v of length n. From the property of the complement of a DNA string, \(d_{H}(\textbf {u},\textbf {v}^{c})\geq n\ell -d_{H}(\textbf {u},\textbf {v})\). From Theorem 9, dH(u, v) = dNHo(a, b) ≤ n/2. Therefore, \(d_{H}(\textbf {u},\textbf {v}^{c})\geq n\ell /2\) and hence, the result follows. □

Theorem 12

For a positive integer n, if a binary linear code with codeword length n contains (1 0 0…0) as a codeword, then the encoded DNA code (using Non-Homopolymer map) will satisfy complement constraint.

Proof

Consider a binary linear code containing the codeword (1 0 0…0) of length n. For any codeword a = (a1 a2an) of the binary linear code, (1 0 0…0) + (a1 a2an) = \((\bar {a}_{1}\ a_{2}{\ldots } a_{n})\) = b is also a codeword of the code. Therefore, from Theorem 5, for each binary codeword a, there exists a binary codeword b such that the encoded DNA strings from a and b will be complement to each other. Hence, by the distance property, the theorem is proved. □

Theorem 13

For positive integers n and , let \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\). Then, for any encoded DNA strings u, vf({0,1}n) using Non-Homopolymer map,

$$ d_{H}(\textbf{u},\textbf{v}^{r}) \geq \left\{ \begin{array}{ll} n\min\{d_{H}(\textbf{x},\textbf{y}^{r}),d_{H}(\textbf{x},\textbf{y}^{rc})\}, & \text{ if } n \text{ is even,} \\ \min\left\{d_{H}(\textbf{x},\textbf{x}^{r}),d_{H}(\textbf{y},\textbf{y}^{r}),d_{H}(\textbf{x},\textbf{x}^{rc}),d_{H}(\textbf{y},\textbf{y}^{rc})\right\}, & \text{ if } n \text{ is odd. } \end{array}\right. $$

Proof

For \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\), let the binary strings a, b ∈{0,1}n of length n be encoded into DNA strings u = (u1 u2un), v = (v1 v2vn) in {x, xc,y, yc}, where \(\textbf {u}_{2i},\textbf {v}_{2i}\in \{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}\) and \(\textbf {u}_{2i+1},\textbf {v}_{2i+1}\in \{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}\) for 1 ≤ 2i,2i + 1 ≤ n. The set f({0,1}n) is the collection of all possible DNA strings such that obtained DNA blocks will be from \(\{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}\) and \(\{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}\) at even positions and odd positions respectively. Consider \(d_{H}(\textbf {u},\textbf {v}^{r})={\sum }_{j=1}^{n}d_{H}(\textbf {u}_{j},\textbf {v}^{r}_{n-j+1})\). Now two cases may arise.

  1. Case 1:

    If n is odd, then j and nj + 1 both are either even or odd. If both j and nj + 1 are even then \(\textbf {u}_{j},\textbf {v}_{n-j+1}\in \{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}\), and if both j and nj + 1 are odd then \(\textbf {u}_{j},\textbf {v}_{n-j+1}\in \{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}\). Therefore, u, vrf({0,1}n) and, from the Non-Homopolymer map, \(d_{H}(\textbf {u}_{j},\textbf {v}^{r}_{n-j+1})\geq \min \limits \{d_{H}(\textbf {x},\textbf {x}^{r}),d_{H}(\textbf {y},\textbf {y}^{r}),d_{H}(\textbf {x},\textbf {x}^{rc}),\\ d_{H}(\textbf {y},\textbf {y}^{rc})\}\).

  2. Case 2:

    If n is even, then the parity j and nj + 1 will be different. So, for even j, \(\textbf {u}_{j}\in \{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}\) and \(\textbf {v}_{n-j+1}\in \{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}\), and, for odd j, \(\textbf {u}_{j}\in \{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}\) and \(\textbf {v}_{n-j+1}\in \{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}\). Therefore, from Non-Homopolymer map and the fact that, for any \(\textbf {z}_{1},\textbf {z}_{2}\in {\Sigma }_{DNA}^{\ell }\), \(d_{H}(\textbf {z}_{1},{\textbf {z}_{2}^{r}})\) = \(d_{H}({\textbf {z}_{1}^{r}},\textbf {z}_{2})\), we obtain \(d_{H}(\textbf {u}_{j},\textbf {v}_{n-j+1}^{r})\geq \min \limits \{d_{H}(\textbf {x},\textbf {y}^{r}),d_{H}(\textbf {x},\textbf {y}^{rc})\}\). Hence the result follows for every n.

Similarly one can get result for reverse-complement constraint as given in the following preposition.

Proposition 6

For positive integers n and , let \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\). Then, for any encoded DNA strings u, vf({0,1}n) using the Non-Homopolymer map,

$$ d_{H}(\textbf{u},\textbf{v}^{rc}) \geq \left\{ \begin{array}{ll} n\min\{d_{H}(\textbf{x},\textbf{y}^{r}),d_{H}(\textbf{x},\textbf{y}^{rc})\}, & \text{ if } n \text{ is even,} \\ \min\left\{d_{H}(\textbf{x},\textbf{x}^{r}),d_{H}(\textbf{y},\textbf{y}^{r}),d_{H}(\textbf{x},\textbf{x}^{rc}),d_{H}(\textbf{y},\textbf{y}^{rc})\right\}, & \text{ if } n \text{ is odd. } \end{array}\right. $$

Theorem 14

For an even positive integer n and a positive integer , consider \(\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }\) such that \(d_{H}(\textbf {x},\textbf {y}^{rc})\) = \(d_{H}(\textbf {x},\textbf {y}^{r})=\ell \). Then, the DNA codes constructed using the Non-Homopolymer map will satisfy the reverse constraint.

Proof

If \(d_{H}(\textbf {x},\textbf {y}^{rc})\) = \(d_{H}(\textbf {x},\textbf {y}^{r})=\ell \) then, from Theorem 13, \(d_{H}(\textbf {u},\textbf {v}^{r})\geq n\min \limits \left \{d_{H}(\textbf {x},\textbf {y}^{r}),\right . \left .d_{H}(\textbf {x},\textbf {y}^{rc})\right \}=n\ell \). But the length of the encoded DNA string is n so, dHn and therefore, \(d_{H}(\textbf {u},\textbf {v}^{r})\geq d_{H}\) for any DNA code constructed using the Non-Homopolymer map. □

Lemma 4

For positive integers n and , if the binary strings a and b of length n are encoded into DNA strings u and v using the Non-Homopolymer map then

$$ n-\lfloor d_{H}(\textbf{a},\textbf{b})/2\rfloor\geq \frac{1}{\ell}d_{H}(\textbf{u},\textbf{v})\geq\lceil d_{H}(\textbf{a},\textbf{b})/2\rceil. $$

Proof

For a positive integer n, if \(S\subseteq \{1,2,\ldots ,n,n+1\}\) is a set with even cardinality such that, for each sjS(j = 1,2,…,|S|− 1), sj < sj+ 1 then one can observe that:

$$ n-\frac{|S|}{2}\geq\sum\limits_{i=1}^{|S|/2} (s_{2i}-s_{2i-1})\geq\frac{|S|}{2}. $$

From the Definition 5, dH(a, b) ∈{|S|,|S|− 1}, and therefore, the result follows. □

Theorem 15

For positive integers n and , suppose a binary code exists with the minimum Hamming distance dH and the minimum distance dNHo. Then, (n −⌊dH/2⌋) ≥ dNHodH/2⌉.

Proof

For any code \(\mathcal {C}\) with the minimum Hamming distance dH, if \(\textbf {a},\textbf {b}\in \mathcal {C}\) then ⌈dH(a, b)/2⌉≥⌈dH/2⌉ and n −⌊dH/2⌋≥ n −⌊dH(a, b)/2⌋≥⌈dH/2⌉. The proof follows from Lemma 4 and Theorem 9. □

In the following theorem, a constraint on binary string is imposed in such a way that the encoded DNA string of length n will be ⌊n/2⌋ conflict free.

Theorem 16

For positive integers n and , and any positive even integer 2μ ∈{1,2,…,⌊n/2⌋}, consider a binary code with codeword length n, such that for each codeword (a1 a2an),

$$ 2\mu<\sum\limits_{i=\lambda+1}^{\lambda+2\mu} (a_{i}a_{2\mu+i}+\bar{a}_{i}\bar{a}_{2\mu+i}), $$

where λ = 1,2,…,n − 2μ. Then there exists a ⌊n/2⌋ conflict free DNA code with codeword length n.

Proof

The proof follows from Definition 3 and Theorem 4. □

Lemma 5

Consider two seed blocks (xi,yi)i = 1,2 such that d(x1,x2) = \(d(\textbf {x}_{1},{\textbf {x}^{c}_{2}})\) = d(y1,y2) = \(d(\textbf {y}_{1},{\textbf {y}^{c}_{2}})\) = , and a binary code \(\mathcal {C}\) with the parameter (n, M, dNHo). The parameter of the DNA code \(\mathcal {C}_{DNA}\) = \(\mathcal {C}_{DNA_{1}}\cup \mathcal {C}_{DNA_{2}}\) will be (n,2M, dH) where, the DNA code \(\mathcal {C}_{DNA_{i}}\) is encoded from the binary code \(\mathcal {C}\) using Non-Homopolymer map with the seed block (xi,yi).

Proof

From Theorem 10, the parameters of the encoded DNA code \(\mathcal {C}_{DNA_{i}}\) is (n, M, dH) for i = 1,2. The codeword length of both the encoded DNA codes \(\mathcal {C}_{DNA_{1}}\) and \(\mathcal {C}_{DNA_{2}}\) are same and equal to n. Therefore, the codeword length of the DNA code \(\mathcal {C}_{DNA_{1}}\cup \mathcal {C}_{DNA_{2}}\) will also be n. The size of the DNA code \(\mathcal {C}_{DNA_{1}}\cup \mathcal {C}_{DNA_{2}}\) will be not be more than 2M. Let a = (a1 a2an) be a codeword in \(\mathcal {C}\). Let the codeword a is encoded into u = \((\textbf {u}_{1}\ \textbf {u}_{2}\ldots \textbf {u}_{n})\in \mathcal {C}_{DNA_{1}}\) and v = \((\textbf {v}_{1}\ \textbf {v}_{2}\ldots \textbf {v}_{n})\in \mathcal {C}_{DNA_{2}}\) using seed blocks (x1,y1) and (x2,y2). The encoded DNA strings \(\textbf {u}_{1}\in \{\textbf {x}_{1},{\textbf {x}_{1}^{c}}\}\), \(\textbf {v}_{1}\in \{\textbf {x}_{2},{\textbf {x}_{2}^{c}}\}\) where, x1x2 and \(\textbf {x}_{2}{\neq \textbf {x}_{2}^{c}}\). Therefore, uv for any case and hence, the encoded DNA strings are not identical. It follows that the size of the DNA code \(\mathcal {C}_{DNA_{1}}\cup \mathcal {C}_{DNA_{2}}\) will be 2M. In addition, if d(x1,x2) = \(d(\textbf {x}_{1},{\textbf {x}^{c}_{2}})\) = d(y1,y2) = \(d(\textbf {y}_{1},{\textbf {y}^{c}_{2}})\) = then dH(u, v) = n (≤ dH) for any \(\textbf {u}\in \mathcal {C}_{DNA_{1}}\) and \(\mathcal {C}_{DNA_{2}}\). Therefore, the minimum Hamming distance of the DNA code \(\mathcal {C}_{DNA_{1}}\cup \mathcal {C}_{DNA_{2}}\) will be dH. □

One can generalize Lemma 5 in the following proposition.

Proposition 7

Consider r seed blocks (xi,yi)i = 1,2,…,r such that d(xi,xj) = \(d(\textbf {x}_{i},{\textbf {x}^{c}_{j}})\) = d(yi,yj) = \(d(\textbf {y}_{i},{\textbf {y}^{c}_{j}})\) = (1 ≤ i < jr), and a binary code \(\mathcal {C}\) with the parameter (n, M, dNHo). The parameter of the DNA code \(\mathcal {C}_{DNA}\) = \(\bigcup _{i=1}^{r}\mathcal {C}_{DNA_{i}}\) will be (n, rM, dH) where, the DNA code \(\mathcal {C}_{DNA_{i}}\) is encoded from the binary code \(\mathcal {C}\) using the Non-Homopolymer map with the seed block (xi,yi).

One can observe the following remark from reverse, complement and reverse-complement DNA strings.

Remark 10

Using the Non-Homopolymer map, from some seed blocks (x, y), an conflict free DNA code is obtained with reverse, reverse-complement and GC-content constraints such that each DNA string is free from secondary structures if and only if an conflict free DNA code can also be obtained from seed blocks (xr,yr) with reverse, reverse-complement and GC-content constraints such that each DNA string is free from secondary structures. The statement is also true for DNA codes generated from seed blocks (xc,yc), (xrc,yrc), (y, x), (yr,xr), (yc,xc) and (yrc,xrc) using the Non-Homopolymer map.

Lemma 6

Consider, for i = 1,2, two seed blocks (xi,yi) such that d(x1,x2) = \(d(\textbf {x}_{1},{\textbf {x}^{c}_{2}})\) = d(y1,y2) = \(d(\textbf {y}_{1},{\textbf {y}^{c}_{2}})\) = , and two binary codes \(\mathcal {C}_{i}\) with the parameter \((n,M_{i},d_{NHo_{i}})\). The parameter of the DNA code \(\mathcal {C}_{DNA}\) = \(\mathcal {C}_{DNA_{1}}\cup \mathcal {C}_{DNA_{2}}\) will be (n, M, dH) where, the DNA code \(\mathcal {C}_{DNA_{i}}\) is encoded from the binary code \(\mathcal {C}_{i}\) using the Non-Homopolymer map with the seed block (xi,yi), M = M1 + M2 and dH = \(\min \limits \{d_{NHo_{i}}:i=1,2\}\).

Proof

The proof is similar to that of the proof of Lemma 5 and it follows from the definition of the minimum Hamming distance of a code. □

Proposition 8

For i = 1,2,…,r, consider seed blocks (xi,yi) such that d(xi,xj) = \(d(\textbf {x}_{i},{\textbf {x}^{c}_{j}})\) = d(yi,yj) = \(d(\textbf {y}_{i},{\textbf {y}^{c}_{j}})\) = (1 ≤ i < jr), and binary codes \(\mathcal {C}_{i}\) with the parameter \((n,M_{i},d_{NHo_{i}})\). The parameter of the DNA code \(\mathcal {C}_{DNA}\) = \(\bigcup _{i=1}^{r}\mathcal {C}_{DNA_{i}}\) will be (n, M, dH) where, the DNA code \(\mathcal {C}_{DNA_{i}}\) is encoded from the binary code \(\mathcal {C}\) using Non-Homopolymer map with the seed block (xi,yi), M = \({\sum }_{i=1}^{r}M_{i}\) and dH = \(\min \limits \{d_{NHo_{i}}:i=1,2,\ldots ,r\}\).

Proposition 9

As considered in Proposition 8, for a DNA code \(\mathcal {C}_{DNA}\) = \(\bigcup _{i=1}^{r}\mathcal {C}_{DNA_{i}}\) with the parameter (\(n\ell ,{\sum }_{i=1}^{r}M_{i},d_{H}\)), the code rate is:

$$ \mathcal{R} = \frac{log_{4}\left( {\sum}_{i=1}^{r}M_{i}\right)}{n\ell}\leq\frac{r}{2\ell}. $$

The bound on code rate in the Proposition 9 is obtained by taking Mi = 2n for each i = 1,2,…,r.

Remark 11

For any conflict free DNA code \(\mathcal {C}_{DNA}\) with the parameters (n,4n,dH) and encoded from \(\mathcal {C}={\mathbb { Z}_{2}^{n}}\) using the Non-Homopolymer map, the code rate is 1/2. Note that for = 2, the code rate is 1/4. In addition, the DNA code satisfies all the constraints reverse, reverse-complement, GC-content. All the DNA codewords are also free from secondary like structures and homopolymers.

Lemma 7

For dH < n/2, there exists DNA code \(\mathcal {C}_{DNA}(n\ell ,M,d_{H})\) encoded from a binary code \(\mathcal {C}(n,M,d_{NHo})\) such that

$$ M\geq \frac{2^{n}}{{\sum}_{i=0}^{\left\lceil\frac{2}{\ell}d_{H}\right\rceil-1}\binom{n}{i}}. $$
(1)

Proof

From Lemma 4, one can observe that \(d_{H}(\textbf {a},\textbf {b})\leq \left \lceil \frac {2}{\ell }d_{H}(\textbf {u},\textbf {v})\right \rceil \) for any binary codewords a and b in \(\mathcal {C}\) which can be encoded into DNA codewords u and v in \(\mathcal {C}_{DNA}\). The bounds are true for any distinct a and b in \(\mathcal {C}\) therefore, \(d_{H}^{\prime }\leq \left \lceil \frac {2}{\ell }d_{H}\right \rceil \) where, \(d_{H}^{\prime }\) is the minimum Hamming distance for the binary code \(\mathcal {C}\). From the Gilbert-Varshamov bound for binary code, one can obtain the bounds in (1) for dH < n/2. □

Now, one can easily obtain the following proposition.

Proposition 10

If there exists an conflict free DNA code \(\mathcal {C}_{DNA}\) with all the constraints reverse, reverse-complement, GC-content such that each DNA codeword is free from secondary like structures then

$$ A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\ell)\geq \frac{2^{n}}{{\sum}_{i=0}^{\left\lceil\frac{2}{\ell}d_{H}\right\rceil-1}\binom{n}{i}}, \text{ for } d_{H}< n\ell/2. $$

Lemma 8

For dHn/2, there exists DNA code \(\mathcal {C}_{DNA}(n\ell ,M,d_{H})\) encoded from a binary code \(\mathcal {C}(n,M,d_{NHo})\) such that

$$ M\geq \frac{2^{n}}{{\sum}_{i=0}^{2n-\left\lceil\frac{2}{\ell}d_{H}\right\rceil}\binom{n}{i}}. $$
(2)

Proof

From Lemma 4, one can observe that \(d_{H}(\textbf {a},\textbf {b})\leq 2n-\left \lceil \frac {2}{\ell }d_{H}(\textbf {u},\textbf {v})\right \rceil +1\) for any binary codewords a and b in \(\mathcal {C}\) which can be encoded into DNA codewords u and v in \(\mathcal {C}_{DNA}\). The bounds are true for any distinct a and b in \(\mathcal {C}\) therefore, \(d_{H}^{\prime }\leq 2n-\left \lceil \frac {2}{\ell }d_{H}\right \rceil +1\) where, \(d_{H}^{\prime }\) is the minimum Hamming distance for the binary code \(\mathcal {C}\). From the Gilbert-Varshamov bound for binary code, one can obtain the bound in (2) for dHn/2. □

Now, one can easily obtain the following proposition.

Proposition 11

If there exists an conflict free DNA code \(\mathcal {C}_{DNA}\) with all the constraints reverse, reverse-complement, GC-content such that each DNA codeword is free from secondary like structures then

$$ A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\ell)\geq \frac{2^{n}}{{\sum}_{i=0}^{2n-\left\lceil\frac{2}{\ell}d_{H}\right\rceil}\binom{n}{i}}, \text{ for } d_{H}\geq n\ell/2. $$

In Fig. 2, the lower bound on \(A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\ell )\) is plotted for n = 16 and 20.

Fig. 2
figure 2

For n = 16 and n = 20, graph between lower bound of \(A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\ell )\) and dH/ is plotted where, the horizontal axis represents dH/ and vertical axis represents lower bound of \(A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\ell )\) as calculated in Preposition 10 and Proposition 11

5 Conflict free DNA codes

For various binary codes, the parameters of encoded DNA codes are listed in the Table 4. For x = ATA and y = CGC, the 5 conflict free DNA code (21,16,6) with reverse, and GC-content constraints are obtained from [7,4,3] binary Hamming code is given in Table 5. The parameter of the DNA code is (21,16,6) and each DNA string is free from hairpin like secondary structures of stem length 5. For any positive integer = 2,3,4,5, DNA pairs \((\textbf {x},\textbf {y})\in {\Sigma }_{DNA}^{\ell }\) are listed in Tables 7 and 8 with various parameters. For any pair (x, y) given in Table 8, one can get conflict free DNA codes from any binary code using the Non-Homopolymer map, where the DNA code satisfies Hamming, GC-content, reverse and reverse-complement constraints. Similarly, for any pair (x, y) given in Table 7, one can get DNA codes from any binary code using the Non-Homopolymer map where, the DNA code satisfies Hamming and GC-content constraints where, each DNA codeword is free from reverse-complement sub-strings (Tables 6, 7 and 8).

Table 4 Parameters for conflict free DNA codes encoded from binary codes
Table 5 List of all encoded DNA strings for (x, y) = (ATA, CGC) from [7,4,3] binary Hamming code. The code rate of the DNA code is 0.1904
Table 6 Codewords (each of length n) for ⌊n/2⌋ conflict free DNA codes with Hamming, reverse, reverse-complement and GC-content constraints. All the codes have code size improved from existing literature
Table 7 Pairs (x, y) ∈Σ such that (i) GC-content sum of x and y is , and (ii) each DNA string in the set {(x y x y),(y x y x) : x∈{x, xc} and y∈{y, yc}} is free from secondary structures
Table 8 Pairs (x, y) ∈Σ such that (i) dH(x, y) = \(d_{H}(\textbf {x},\textbf {y}^{rc})\) = \(d_{H}(\textbf {x},\textbf {y}^{r})\) = , (ii) GC-content sum of x and y is , and (iii) each DNA string in the set {(x y x y),(y x y x) : x∈{x, xc} and y∈{y, yc}} is conflict free

5.1 Reed-Muller code:

The binary Reed-Muller codes were introduced by Reed and Muller in 1954 [25]. For two non-negative integers m and r (rm), the rth order binary Reed-Muller code \(\mathcal {R}(r,m)\) is a linear code of length 2m, code size \(2^{{\sum }_{i=1}^{r}\binom {m}{i}}\) and the minimum Hamming distance dH = 2mr. The generator matrix of \(\mathcal {R}(r,m)\) is

$$ G_{r,m} = \left( \begin{array}{cc} G_{r,m-1} & G_{r,m-1} \\ \textbf{0} & G_{r-1,m-1} \end{array} \right), \text{ for }1\leq r\leq m-1, $$

where

$$ G_{m,m} = \left( \begin{array}{c} G_{m-1,m} \\ 1\ 1{\ldots} 1\ 0 \end{array} \right), $$

G0,m is the all one matrix of size 1 × 2m, and 0 is a zero matrix with \(2^{{\sum }_{i=1}^{r}\binom {m}{i}}\) - \(2^{{\sum }_{i=1}^{r-1}\binom {m-1}{i}}\) rows and 2m− 1 columns.

In the following theorem, the minimum distance (Definition 5) is obtained for Reed-Muller codes.

Theorem 17

For positive integers m, r (0 ≤ rm) and , there exists a DNA code \(\mathcal {C}_{DNA}(\ell 2^{m},2^{{\sum }_{i=1}^{r}\binom {m}{r}},\ell 2^{m-r-1})\) for the binary Reed-Muller code \(\mathcal {R}(r,m)\).

Proof

For positive integers m (> 1) and (≤ 2m− 1), consider a binary Reed-Muller code \(\mathcal {R}(r,m)\) and the corresponding encoded DNA code \(f(\mathcal {R}(r,m))\) for some pair \((\textbf {x},\textbf {y})\in ({\Sigma }_{DNA}^{\ell })^{2}\). From Theorem 9, the codeword length and code size for \(f(\mathcal {R}(r,m))\) will be 2m and \(2^{{\sum }_{i=1}^{r}\binom {m}{r}}\). From Theorem 15, the minimum distance dNHodH/2⌉ = 2mr− 1. For a positive integer t, we denote 0t = (0 0…0) and 1t = (1 1…1), each of length t. Then the binary strings \(\textbf {0}_{2^{m}}\) and \((\textbf {0}_{2^{m}-2^{m-r}}\ \textbf {1}_{2^{m-r}})\) will be in \(\mathcal {R}(r,m)\). Therefore \(d_{NHo}\leq d_{NHo}(\textbf {0}_{2^{m}},(\textbf {0}_{2^{m}-2^{m-r}}\ \textbf {1}_{2^{m-r}}))\) = 2mr− 1. Hence, dNHo = 2mr− 1 for the binary \(\mathcal {R}(r,m)\). So, from Theorem 9, the minimum Hamming distance for \(f(\mathcal {R}(r,m))\) will be dH = 2mr− 1. □

Note that by choosing appropriate seed blocks one can obtain DNA codes with various constraints. For example, if one chooses DNA seed blocks from Table 8 and constructs the Reed-Muller type code (Theorem 17) then the DNA code will satisfy reverse, reverse-complement and GC-content constraints. In addition, each DNA codeword of the Reed-Muller code is conflict free. Similarly, if one chooses DNA seed blocks from Table 7 and constructs the Reed-Muller type code (Theorem 17) then the encoded DNA code is conflict free with GC-content constraint and each codeword is free from secondary structures.

6 Conclusions

We have scratched the surface of an interesting area of DNA codes that can be used in building efficient DNA data storage models.This article concentrates on two different approaches, computational and algebraic; to design DNA codes satisfying different constraints effective for practical usage. Computational approach improves the lower bounds on the size of the DNA codes in many cases from previous study under a new constraint (generalization of Homopolymers constraint) apart from the general constraints considered in the same aspect. On the contrary, the algebraic approach presents a new isometry between binary codes and DNA codes. Utilizing the recursive isometry, new classes of DNA codes has been constructed that are efficient. It is noteworthy to mention that the new codes are also free from hairpin like secondary structures. It would be an interesting future task to find bounds on DNA codes with the new constraint in mind and constructing optimal codes meeting those bounds. Extending the isometry from binary to q-ary case will also be an interesting future task.