On conflict free DNA codes

Benerjee, Krishna Gopal; Deb, Sourav; Gupta, Manish K.

doi:10.1007/s12095-020-00459-7

On conflict free DNA codes

Published: 13 October 2020

Volume 13, pages 143–171, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Cryptography and Communications Aims and scope Submit manuscript

On conflict free DNA codes

Download PDF

736 Accesses
15 Citations
3 Altmetric
Explore all metrics

Abstract

DNA storage has emerged as an important area of research. The reliability of a DNA storage system depends on designing those DNA strings (called DNA codes) that are sufficiently dissimilar. In this work, we introduce DNA codes that satisfy the newly introduced constraint, a generalization of the non-homopolymers constraint. In particular, each codeword of the DNA code has the specific property that any two consecutive sub-strings of the DNA codeword will not be the same. This is apart from the usual constraints such as Hamming, reverse, reverse-complement and GC-content. We believe that the new constraints proposed in this paper will provide significant achievements in reducing the errors, during reading and writing data into the synthetic DNA strings. We also present a construction (based on a variant of stochastic local search algorithm) to determine the size of the DNA codes with a constraint that each DNA codeword is free from secondary structures in addition to the usual constraint. This further improves the lower bounds from the existing literature, in some specific cases. A recursive isometric map between binary vectors and DNA strings is also proposed. By applying this map over the well known binary codes, we obtain classes of DNA codes with all of the above constraints, including the property that the constructed DNA codewords are free from the hairpin like secondary structures.

Construction of DNA codes with multiple constrained properties

Article 14 May 2024

Reconstruction algorithms for DNA-storage systems

Article Open access 23 January 2024

A Strategy-based Optimization Algorithm to Design Codes for DNA Data Storage System

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The exponentially increasing demand in data storage forces to look into every possible option and DNA (DeoxyriboNucleic Acid) data storage has come out to be one of the most promising natural data storage for this purpose [11]. After the first striking implementation of large-scale archival DNA-based storage architecture by Church et al. [5] in 2012, followed by encoding scheme to DNA proposed by Goldman et al. [8] in 2013, researchers have taken great interests on the construction of DNA-based information storage systems [18, 20] because of its high storage density and longevity [5, 8, 36]. DNA consists of four types of bases or nucleotides (nt s) called adenine (A), cytosine (C), guanine (G) and thymine (T) where, the Watson-Cricks complementary bases for A and C are T and G respectively and vice versa. In order to store data into synthetic DNA, we need to encode data into strings on quaternary alphabet {A, C, G, T}. The set of encoded DNA strings (also called as DNA codewords) on the quaternary alphabet is called DNA code. For a DNA string, the complement is a DNA string obtained by replacing each nucleotide by its complement. Similarly, for a DNA string, the reverse DNA string is a DNA string in the reverse order, and the complement of the reverse DNA string is called reverse-complement DNA string. The encoded strings are synthesized using DNA synthesizer for the purpose of writing into DNA strings and the synthesized DNA strings are stored in the appropriate environment. To extract the source data, the stored DNA strings are read using DNA sequencing.

Errors occur, particularly, during synthesis and sequencing the DNA strings, which can be reduced by choosing good encoding scheme for the DNA strings. Therefore, it is important to study the source of errors. Generally, insertion or deletion errors occur frequently in a DNA string with consecutive repetitions of a specific nucleotide (e.g. ACGGGGAT) or of a block of nucleotides (e.g. AGATATATGC) up to certain length [12, 21, 26, 32]. In addition, these DNA strings get misaligned more frequently, during DNA sequencing [26]. So we prefer DNA codes that exclude those codewords which contain consecutive repetition(s) of a specific nucleotide or a block of nucleotides. In this article, such DNA strings are defined as conflict free DNA strings. In literature, DNA codes without Homopolymers (DNA string with consecutive repetition of a nucleotide) [1, 2, 6, 10, 31] and without consecutive repeats of blocks [12, 16] are studied. As an extension, in this work, the considered conflict free DNA strings are not only free from Homopolymers but are also free from consecutive repetition of blocks of nucleotides.

In a single stranded DNA, the existence of two sub-strings where, the reverse-complement of one sub-string is the other one, results in forming an antiparallel double stranded hairpin like structure (also called hairpin loop or Stem-loop) by folding back upon itself [4, 13, 24, 27, 38].

An example of such hairpin like structure is illustrated in Fig. 1. For DNA sequencing, it is preferred to avoid such secondary structures [13]. In this work, all the codewords of the constructed conflict free DNA codes are free from hairpin like structures with stem length more than 2.

A DNA string can be read using specific hybridization between the DNA string and its complement DNA string [23]. If the DNA strings in a code are not different enough among themselves then nonspecific hybridization will occur and it will be a prominent cause of error. Therefore, a set of DNA codewords is preferred in which DNA strings are sufficiently different among themselves. Hamming distance between two strings of same length over the same alphabets is the number of positions in which the symbols in the strings are different. So, construction of DNA code with Hamming constraint (ensures the difference among DNA codewords), reverse constraint (ensures the difference between DNA codewords and their reverse DNA strings), and reverse-complement constraint (ensures the difference between DNA codewords and their reverse-complement DNA strings) is preferred. In literature, DNA codes with reverse and reverse-complement constraints are constructed from finite fields and finite rings in [9, 17, 29, 33].

The thermal stability of a DNA string depends on the GC-content (the total number of G^′s and C^′s) in the DNA string [35]. On the other hand, the high GC-content leads to the insertion and deletion error during polymerase chain reaction (PCR). Therefore, such DNA codes are preferred in which each DNA codeword has the same GC-content and equal to almost half of its length and the constraint for the DNA codes is called GC-content constraint. In [14, 15, 30, 33], DNA codes with balanced GC content are studied. DNA codes with reverse, reverse-complement and GC-content constraints are studied in [7, 30]. In [3], a revised lower bound on size of DNA codes with GC-content and reverse-complement constraints are obtained. In fact, DNA codes with balanced GC content and without Homopolymers are also studied in [6, 10, 31].

In this paper, we have studied the DNA codes with multiple properties such as each DNA string is free from consecutive reparation of DNA blocks up to a certain length. Any DNA codeword of the DNA code sufficiently differs from other DNA codewords, reverse of DNA codewords and reverse-complement DNA codewords. In addition, each codeword of the DNA code is also free from hairpin like secondary structures. These properties significantly help to reduce bit-flip, insertion, and deletion errors simultaneously during reading and writing DNA strings. The lower bounds on the maximum size of DNA code with all those properties are also obtained in this paper. To the best of author’s knowledge, in literature, an algebraic solution for DNA codes with all the constraints is not studied yet. In this work, an algebraic structure for the family of DNA codes is proposed where the constructed DNA codes meet all the constraints such as Hamming, reverse, reverse-complement, and GC-content constraints. Apart from that, all the DNA codewords do not have any consecutive identical sub-string(s) up to a certain length ℓ (generalization of non-homopolymer constraint [12, 21, 26, 32]. These are known as non-homopolymer constraint of order ℓ). In addition, these codewords are free from hairpin like secondary structures. In this paper, an algorithm is given which calculates the DNA code with the property that each DNA codeword does not have any consecutive repeated sub-string of any length. In addition, DNA codes with Hamming constraint, reverse constraint, reverse-complement constraint, and GC-content constraint are obtained. For a DNA code with all the constraints, the obtained code size is improved for some specific parameters as given in [19, Table I]. Further, family of DNA codes have been obtained with Hamming, reverse, reverse-complement, and GC-content constraints where, each DNA codeword is free from hairpin like secondary structure and non-homopolymer constraint.

In Section 2, preliminary for DNA codes is discussed. Complete conflict free DNA codes with all the constraints are studied in Section 3. A recursive mapping from binary strings to DNA strings is discussed in Section 4, which is also an isometry between non-homopolymer distance over binary strings and Hamming distance over DNA strings. The conditions on binary strings are obtained, which ensure the constraints on encoded DNA strings in the same section. In Section 5, a family of DNA codes are obtained from binary Reed-Muller codes. Section 6 concludes the work with general remarks.

2 Preliminary

A code $\mathcal {C}$ with parameters (n, M, d) over an alphabet Σ of size q is a set of M distinct strings (also called codewords) with length n such that the distance between any two distinct strings is at least d. Codes over {0,1} are called binary codes. Similarly, codes over an alphabet of size 4 are known as Quaternary codes. In particular, codes over alphabet Σ_DNA = {A, C, G, T} are called DNA codes (denoted by $\mathcal {C}_{DNA}$) respectively. For various applications, codes with various distances (such as Gau distance [17]) are studied in literature. In this work, DNA codes with Hamming distance and binary codes with a newly defined distance are studied. For any strings x and y in Σⁿ, the Hamming distance d_H(x, y) between x and y is the total number of positions at which they differ. For a code $\mathcal {C}\subset {\Sigma }^{n}$, the minimum Hamming distance is d_H = $\min \limits \{d_{H}(\textbf {x},\textbf {y}):\textbf {x}\neq \textbf {y}\text { and }\textbf {x},\textbf {y}\in \mathcal {C}\}$. Note that one can find fields or rings over the alphabet Σ. For a ring defined over the alphabet Σ, a code on Σ is called linear if the code is sub-module over the ring. For a field defined over the alphabet Σ, a code on Σ is called linear if the code is row span of a matrix over the field. For any linear code, the matrix is called the generator matrix if rows of that matrix are linearly independent.

For any DNA string x = $(x_{1}\ x_{2}{\ldots } x_{n})\in {\Sigma }_{DNA}^{n}$, the reverse, complement and reverse-complement DNA strings of x are x^r = (x_n x_n− 1…x₁), x^c = $({x_{1}^{c}}\ {x_{2}^{c}}{\ldots } {x_{n}^{c}})$, and x^rc = $({x_{n}^{c}}\ x_{n-1}^{c}{\ldots } {x_{1}^{c}})$ respectively where, A^c = T, C^c = G, G^c = C and T^c = A. As defined in [23], for any DNA code $\mathcal {C}_{DNA}$ with parameter (n, M, d_H), the various constraints are defined as follows:

Hamming constraint: The Hamming distance d_H(x, y) ≥ d_H for any $\textbf {x},\textbf {y}\in \mathcal {C}_{DNA}$ and x≠y.
Reverse constraint: The Hamming distance $d_{H}(\textbf {x},\textbf {y}^{r})\geq d_{H}$ for any $\textbf {x},\textbf {y}\in \mathcal {C}_{DNA}$ and x≠y^r.
Reverse-complement constraint: The Hamming distance $d_{H}(\textbf {x},\textbf {y}^{rc})\geq d_{H}$ for any $\textbf {x},\textbf {y}\in \mathcal {C}_{DNA}$ and x≠y^rc.
GC-content constraint: If the total number of G’s and C’s in each codeword is same and equal to g then the code satisfies g-GC-content constraint. For a specific case g = ⌊n/2⌋, the ⌊n/2⌋-GC-content constraint is simply called GC-content constraint.

Consider a DNA code $\mathcal {C}_{DNA}$ with the minimum Hamming distance d_H. For each $\textbf {x}\in \mathcal {C}_{DNA}$, if $\textbf {x}^{r}\in \mathcal {C}_{DNA}$ then, from the distance property of code, $d_{H}(\textbf {y},\textbf {x}^{r})\geq d_{H}$ for each $\textbf {y}\in \mathcal {C}_{DNA}$ such that x^r≠y. Therefore, the code satisfies the reverse constraint. Similarly, for each $\textbf {x}\in \mathcal {C}_{DNA}$, if $\textbf {x}^{rc}\in \mathcal {C}_{DNA}$ then from the distance property of code again, $d_{H}(\textbf {y},\textbf {x}^{rc})\geq d_{H}$ for each y(≠x^rc) in $\mathcal {C}_{DNA}$, and hence the code satisfies the reverse-complement constraint. In consequence, researchers are curious in the construction of DNA codes which are closed under reverse and reverse-complement DNA strings [17]. Thus, motivated by this, we construct a set of DNA strings for a given length such that those DNA strings satisfy multiple constraints. In the following lemma, the distinct DNA strings of fix length with some additional constraints are enumerated.

Lemma 1

For a given length n,

1.
there exists 4^⌈n/2⌉ number of distinct DNA strings $\textbf {x}\in {\Sigma }_{DNA}^{n}$ such that x = x^r,
2.
for an even n, there exists 4^n/2 number of DNA strings $\textbf {x}\in {\Sigma }_{DNA}^{n}$ such that x = x^rc,
3.
for a positive integer m (≤ n), there exists $\binom {n}{m}2^{n}$ distinct DNA strings $\textbf {x}\in {\Sigma }_{DNA}^{n}$ each with GC-content m,
4.
for an even positive integer m (≤ n), there exists $\binom {n/2}{m/2}2^{n/2}$ distinct DNA strings $\textbf {x}\in {\Sigma }_{DNA}^{n}$ each with GC-content m and x = x^rc where, n is even, and
5.
for a positive integer m (≤ n), there exists η distinct DNA strings $\textbf {x}\in {\Sigma }_{DNA}^{n}$ each with m - GC-content and x = x^r where,
$$ \eta = \left\{ \begin{array}{ll} 0 & \text{if }n\text{ is even and }m\text{ is odd}, \\ \binom{\lfloor n/2\rfloor}{\lfloor m/2\rfloor}2^{\lceil n/2\rceil} & \text{otherwise.} \end{array}\right. $$

Proof

Consider x = $(x_{1}\ x_{2}{\ldots } x_{n})\in {\Sigma }_{DNA}^{n}$.

1)
If x = x^r then, x_i = x_n−i+ 1 for i = 1,2,…,⌈n/2⌉. Therefore, there exists 4^⌈n/2⌉ number of distinct DNA strings $\textbf {x}\in {\Sigma }_{DNA}^{n}$ such that x = x^r.
2)
If x = x^rc then, x_i = $x_{n-i+1}^{c}$ for i = 1,2,…,⌊n/2⌋. Therefore, there exists 4^n/2 number of distinct DNA strings $\textbf {x}\in {\Sigma }_{DNA}^{n}$ such that x = x^rc. Note that, for any positive odd n, any $\textbf {x}\in {\Sigma }_{DNA}^{n}$ can not equal x^rc.
3)
There are $\binom {n}{m}2^{m}$ ways to fill m positions out of n positions by a symbol from {C, G}. If the remaining n - m positions are filled by a symbol from {A, T} then there are $\binom {n}{m}2^{n}$ distinct DNA strings each with GC-content m.

Proofs of remaining results are similar. □

For theoretical analysis, we define complement constraint for a DNA code similar to reverse and reverse-complement constraints. A DNA code $\mathcal {C}_{DNA}$ satisfies the complement constraint, if, for any x and y in $\mathcal {C}_{DNA}$ such that x≠y^c, $d_{H}(\textbf {x},\textbf {y}^{c})\geq d_{H}$.

Remark 1

Consider a DNA code $\mathcal {C}_{DNA}$ with the minimum Hamming distance d_H. For any codeword $\textbf {x}\in \mathcal {C}_{DNA}$, if x^c and x^r are also in the DNA code $\mathcal {C}_{DNA}$ then, from the distance property of code, the DNA code satisfies reverse, complement, and reverse-complement constraints.

Remark 2

Consider a DNA code $\mathcal {C}_{DNA}$ with the minimum Hamming distance d_H such that $\textbf {x}^{c}\in \mathcal {C}_{DNA}$ for each codeword $\textbf {x}\in \mathcal {C}_{DNA}$. For any codeword $\textbf {x},\textbf {y}\in \mathcal {C}_{DNA}$, if $d_{H}(\textbf {x},\textbf {y}^{r})\geq d_{H}$ then, from the reverse and reverse-complement properties of DNA, the DNA code satisfies reverse, complement, and reverse-complement constraints.

Remark 3

Consider a DNA code $\mathcal {C}_{DNA}$ with the minimum Hamming distance d_H such that $\textbf {x}^{r}\in \mathcal {C}_{DNA}$ for each codeword $\textbf {x}\in \mathcal {C}_{DNA}$. For any codeword $\textbf {x},\textbf {y}\in \mathcal {C}_{DNA}$, if $d_{H}(\textbf {x},\textbf {y}^{c})\geq d_{H}$ then, from the reverse and reverse-complement properties of DNA, the DNA code satisfies reverse, complement, and reverse-complement constraints.

For a DNA code with some properties, the following lemma ensures the reverse-complement constraint.

Lemma 2

For a DNA code of parameter (n, M, d_H) with reverse and complement constraints, if d_H ≤ n/2 then the DNA code will satisfy the reverse-complement constraint.

Proof

For any $\textbf {x},\textbf {y}\in \mathcal {C}$, $d_{H}(\textbf {y}^{rc},\textbf {x}^{c})$ = $d_{H}(\textbf {y}^{r},\textbf {x})$. So, $d_{H}(\textbf {x},\textbf {x}^{c}) \leq d_{H}(\textbf {x},\textbf {y}^{rc})+d_{H}(\textbf {y}^{rc},\textbf {x}^{c})$ = $d_{H}(\textbf {x},\textbf {y}^{rc})+d_{H}(\textbf {y}^{r},\textbf {x})$. This implies, $d_{H}(\textbf {x},\textbf {x}^{c}) \leq d_{H}(\textbf {x},\textbf {y}^{rc})+d_{H}$. But, $d_{H}(\textbf {x},\textbf {x}^{c})$ = n, so, $n-d_{H}\leq d_{H}(\textbf {x},\textbf {y}^{rc})$. Therefore, if d_H ≤ n/2 then $d_{H}(\textbf {x},\textbf {y}^{rc})\geq d_{H}$. □

In a single stranded DNA, if there exist two sub-strings such that one sub-string is the reverse-complement of another sub-string then the DNA strand folds back and attaches the both sub-strings to each other and forms hairpin like secondary structures with stems and loops of certain length [4, 13, 24, 27]. The stem size of more than 2 bases long reasonably approximates the hairpin like structures. The single stranded DNA ATACGCGAATGCGTGC, considered in Fig. 1, contains the reverse-complementary sub-strings ACGC and TGCG (see the bold sub-strings). The sub-strings are attached to each other and forms a stem of length 4 base pairs, and a loop of size 4 bases long. Therefore, in this work, DNA strings are considered to be free from hairpin like structures with stem length of more than 2 bases long.

Definition 1

A DNA string is called free from secondary structures (of stem length more than 2) if the DNA string does not contain any two sub-strings of length more than 2 such that one is the reverse-complement of the other.

Remark 4

Consider a DNA string of length n (> 5) which contains two sub-strings of length t (n ≥ t ≥ 4) such that one sub-string is the reverse-complement of the other. Then, the DNA string will also contain two sub-strings of length 3 such that one is the reverse-complement of the other. Therefore, it is sufficient for a DNA string, free from secondary structures of stem length more than 2, not to contain two sub-strings which are reverse-complement to each other.

Apart from the pairing between Watson-Crick complement base pairs, the secondary structures can also be formed with Wobble base pairs or cross hybridization between base pairs in a single strand DNA [4, 13, 24, 27, 38]. In this work, we have considered only those secondary structures which have stem length more than 2 and each bounded base pair in the stem is Watson-Crick complement pairs. We called such DNA strings free from secondary structures in this paper.

Remark 5

A DNA string is free from secondary structures if and only if the complement of the DNA string is also free from secondary structures. The same result holds for reverse and reverse-complement.

For example, the DNA string ACATCG is free from reverse-complement sub-strings of length 3 because the reverse-complement of ACA is TGT and TGT is not a sub-string of ACATCG. The reverse and reverse-complement DNA strings GCTACA and CGATGT are also free from reverse-complement sub-strings. Therefore, the following remark holds.

Remark 6

Any DNA string is free from secondary structures of stem length more than 2 if and only if it is free from secondary structures of stem length 3.

3 On complete conflict free DNA strings

It is evident that, the process of sequencing and synthesizing of DNA strings will be erroneous due to the existence of homopolymers and consecutive repetition of same sub-string(s) of certain length in DNA [12, 21, 26, 32]. So it is preferable to construct DNA codes in such a way that each DNA codeword will be free from Homopolymers or consecutive repetition(s) of same sub-string(s) of certain length. In general, a sequence which is free from the consecutive repetition of a block is known as square free sequence or repeat free sequence in literature [22]. Thus motivated by this, we define ℓ conflict free DNA strings and ℓ conflict free DNA code. Also the necessary and sufficient condition for a DNA string to be ℓ conflict free is given.

Definition 2

For positive integers n and ℓ (≤ n/2), a DNA string is called ℓ conflict free, if the DNA string of length n is free from consecutive repetition(s) of identical sub-string(s) of length t for each t = 1,2,…,ℓ.

For example, the DNA string ATCATCG is 2 conflict free but not 3 conflict free as the substring ATC has a consecutive repetition in the DNA string while any two consecutive sub-strings of same length (≤ 2) are not same, i.e., the DNA string does not contain any of the DNA sub-strings ATAT, TCTC, CACA, CGCG, AA, TT, CC and GG. Note that, 1 conflict free DNA strings are also known as DNA strings free from homopolymers in literature [1, 2, 6, 10, 31]. Also note, for any positive integer ℓ (≥ 2), an ℓ conflict free DNA string is also ℓ − 1 conflict free.

Remark 7

A DNA string is ℓ conflict free if and only if the complement DNA string is also ℓ conflict free. Note that the result also holds for reverse and reverse-complement DNA strings.

Definition 3

For positive integers n and ℓ(≤⌊n/2⌋), a DNA code with length n is called ℓ conflict free DNA code if each DNA codeword of the DNA code is ℓ conflict free.

For example, the DNA code $\left \{ACTG, TGAC, CAGT, GTCA\right \}$ is 2 conflict free DNA code, since each DNA codeword is 2 conflict free DNA string.

One can observe that the maximum length of a DNA sub-string will be ⌊n/2⌋, which can be repeated in a DNA string of length n. Hence, from Definition 2, a DNA string will be free from repetition(s) of DNA sub-string(s) of any length, if it is ⌊n/2⌋ conflict free. For a positive integer n, let S(n) be the set of all ⌊n/2⌋ conflict free DNA strings each of length n. For any z ∈ S(n) we will have z^r,z^c,z^rc ∈ S(n). Also note that, any DNA code $\mathcal {C}_{DNA}\subset S(n)$ is always a ⌊n/2⌋ conflict free DNA code.

Various computational approaches to construct DNA codes with some additional constraints are studied in literature [3, 28, 33, 34, 37]. In this work, ℓ conflict free DNA codes are constructed using stochastic local search in a seed set of ℓ conflict free DNA strings such that each DNA string has a fix GC-content and all the DNA strings are free from reverse-complement sub-strings. The computational construction for ℓ conflict free DNA codes is given as follows.

Construction 1

For given positive integers n, ℓ (≤⌊n/2⌋) and g (≤ n), let $\mathcal {S}\subset {\Sigma }_{DNA}^{n}$ be the set of all ℓ conflict free DNA strings such that GC-content of each DNA string is g and every DNA string is free from secondary structures. For a sub-set R of random cardinality and containing DNA strings which are randomly selected from $\mathcal {S}$, the DNA code $\mathcal {C}_{DNA}$ = R ∪{x^r,x^c,x^rc : x ∈ R} is an ℓ conflict free DNA code with Hamming, reverse and reverse-complement constraints where, each DNA codeword is free from reverse-complement sub-strings and GC-constant of each DNA codeword is fix g.

For example, consider n = 3, ℓ = 1 and g = 2. The seed set will be $\mathcal {S}$ = $\left \{ACG, AGC, CAC, CAG, CGA, CGT, CTC, CTG, GAC, GAG, GCA, GCT,\right .$ $\left . GTC, GTG, TCG, TGC \right \}$. Note that, for R = {CAC, CGT, ACG, TGC}, the 1 conflict free DNA code with GC-content, reverse and reverse-complement constraints is $\mathcal {C}_{DNA}$ = $\left \{CAC, CGT, ACG, TGC, GCA, GTG\right \}$ where, each DNA codeword is free from secondary structures. Also note, the DNA code size and the minimum Hamming distance of the code are M = 6 and d_H = 2.

For a given DNA string, the computational complexity to determine if the DNA string is ℓ conflict free (using Definition 2) is more than the computational complexity to determine whether the DNA string is free from secondary structures (using Remark 6), and, it is again more than the computational complexity to determine the GC-content of the DNA string. Therefore, in order to construct the seed set $\mathcal {S}$ for the Construction 1, one can reduce the computations by removing DNA strings in the following order.

1.
Remove all the DNA strings from the complete set ${\Sigma }_{DNA}^{n}$ which do not have g-GC-content.
2.
Remove all the DNA strings from the remaining set which are not free from secondary structures of stem length 3.
3.
Remove all the DNA strings from the remaining set which are not ℓ conflict free.

For a given length n and Hamming distance d_H, the maximum size of code is subject to interest among researches. Now, similar to [23], some notations for the maximum size of ℓ conflict free DNA codes are introduced here.

For a DNA code of codeword length n and the minimum distance d_H with GC-content and reverse-complement constraints, the maximum size of the DNA code is denoted by $A_{4}^{GC,rc}(n,d_{H})$ where, the terms GC and rc stand for GC-content and reverse-complement constraints.
The maximum size of an ℓ conflict free DNA code with length n and the minimum Hamming distance d_H is $A_{4}^{cf}(n,d_{H},\ell )$. So, $A_{4}^{cf,GC}(n,d_{H},1)$ denotes the maximum size of a DNA code of codeword length n and the minimum distance d_H with GC-content constraint such that each codeword is free from homopolymers where, the term cf stands for conflict free property of the DNA code.
For an ℓ conflict free DNA code of codeword length n and the minimum distance d_H with reverse constraint such that each DNA codeword is free from secondary structures (hairpin like structure) of length more than 2, the maximum size of the DNA code is denoted by $A_{4}^{cf,hf,r}(n,d_{H},\ell )$ where, the terms hf and r stand for property free from secondary structures and reverse constraint.
For an ℓ conflict free DNA code of codeword length n and the minimum distance d_H with reverse-complement constraint, $A_{4}^{cf,hf,rc}(n,d_{H},\ell )$ denotes the maximum size of the DNA code where, each DNA codeword is free from secondary structures of length more than 2.
For any ℓ conflict free DNA codes of codeword length n and the minimum distance d_H with GC-contant constraint, the maximum size of the DNA code is denoted by $A_{4}^{cf,hf,GC}(n,d_{H},\ell )$ where, each DNA codeword is free from secondary structures of length more than 2.
For any ℓ conflict free DNA codes of codeword length n and the minimum distance d_H with GC-contant, reverse, and reverse-complement constraints, the maximum size of the DNA code is denoted by $A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\ell )$ where, each DNA codeword is free from secondary structures of length more than 2.

The relations among sizes of ℓ conflict free DNA codes with additional constraints are given in following theorem.

Theorem 1

For a positive even integer n,

$$ A_{4}^{cf,hf,GC,r}(n,d_{H},\ell) = A_{4}^{cf,hf,GC,rc}(n,d_{H},\ell), $$

and for a positive odd integer n,

$$ \begin{array}{ll} A_{4}^{cf,hf,GC,r}(n,d_{H}+1,\ell) & \leq A_{4}^{cf,hf,GC,rc}(n,d_{H},\ell) \\ & \leq A_{4}^{cf,hf,GC,r}(n,d_{H}-1,\ell). \end{array} $$

Proof

Similar to the proof of [23, Theorem 4.1], consider an ℓ conflict free DNA code $\mathcal {C}_{DNA}$ with codeword length n and the minimum Hamming distance d_H. In addition, each codeword of the DNA code is free from secondary structures and has fixed GC-content. Now, for even n, the DNA code $\mathcal {C}_{DNA}=\{\textbf {a}_{i}\textbf {b}_{i}\}$ satisfies reverse constraint where, the sizes of a_i and b_i are same and a_ib_i is the concatenation of sequences a_i and b_i in same order. Then the DNA code $\mathcal {C}_{DNA}^{\prime }=\{\textbf {a}_{i}{\textbf {b}_{i}^{c}}\}$ satisfies reverse-complement constraint. Observe that the parameters of both the DNA codes $\mathcal {C}_{DNA}$ and $\mathcal {C}_{DNA}^{\prime }$ are same and therefore, $A_{4}^{cf,hf,GC,r}(n,d_{H},\ell ) \leq A_{4}^{cf,hf,GC,rc}(n,d_{H},\ell )$. Similarly, one can prove the inequality $A_{4}^{cf,hf,GC,r}(n,d_{H},\ell ) \geq A_{4}^{cf,hf,GC,rc}(n,d_{H},\ell )$. Thus, the result follows for even n. Again, for odd n, consider a ℓ conflict free DNA code $\mathcal {C}_{DNA}$ with parameter ($n,A_{4}^{cf,hf,GC,r}(n,d_{H}+1,\ell ),d_{H}+1$) with reverse and GC-content constraints. Also each codeword of the DNA code is free from secondary structures. For some x ∈Σ_DNA, $\mathcal {C}_{DNA}=\{\textbf {a}_{i}x\textbf {b}_{i}\}$ where, the sizes of a_i and b_i are same and a_ib_i is the concatenation of sequences a_i and b_i in same order. Now, consider $\mathcal {C}_{DNA}^{*}=\{\textbf {a}_{i}\textbf {b}_{i}\}$ of even length which is obtained by deleting the middle symbol x of each codeword of the DNA code $\mathcal {C}_{DNA}$. Therefore, from even case, $A_{4}^{cf,hf,GC,r}(n,d_{H}+1,\ell )\leq A_{4}^{cf,hf,GC,rc}(n-1,d_{H},\ell )$. The first inequality for odd n follows from $A_{4}^{cf,hf,GC,rc}(n-1,d_{H},\ell )\leq A_{4}^{cf,hf,GC,rc}(n,d_{H},\ell )$. Similarly, one can prove the second inequality for odd n. □

Remark 8

Consider an ℓ conflict free DNA code with codeword length n and the minimum Hamming distance d_H.

For positive integers n, ℓ(< ⌊n/2⌋) and d_H (≤ n),
$$ \begin{array}{llllll} A_{4}^{cf}(n,d_{H},\ell) & \geq A_{4}^{cf}(n,d_{H},\ell+1) \\ A_{4}^{cf,hf}(n,d_{H},\ell) & \geq A_{4}^{cf,hf}(n,d_{H},\ell+1) \text{ and } \\ A_{4}^{cf}(n,d_{H},\ell) & \geq A_{4}^{cf,hf}(n,d_{H},\ell) \\ & \geq A_{4}^{cf,hf,GC}(n,d_{H},\ell) \\ & \geq A_{4}^{cf,hf,GC,r}(n,d_{H},\ell) \\ & \geq A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\ell). \end{array} $$
For positive integers n and d_H (≤ n),
$$ \begin{array}{ll} & A_{4}^{cf,GC}(n,d_{H},1) \geq A_{4}^{cf,hf,GC}(n,d_{H},\ell), \text{ and } \\ & A_{4}^{GC,rc}(n,d_{H}) \geq A_{4}^{cf,hf,GC,rc}(n,d_{H},\ell). \end{array} $$

For n = 1,2,…,10, g = ⌊n/2⌋ and ℓ = 1,2,…,⌊n/2⌋, the DNA codes with various parameters are calculated using Construction 1 and the maximum value of obtained code sizes are listed in Table 1 where, the DNA codes satisfy reverse, reverse-complement, GC-content constraints, and each codeword of the DNA code is ⌊n/2⌋ conflict free and also free from secondary structures. For given n and ℓ, first, the seed set $\mathcal {S}$ is obtained such that each DNA string in the set contains ⌊n/2⌋ GC-content. Further, for a random sub-set R of the set $\mathcal {S}$, the ⌊n/2⌋ conflict free DNA code $\mathcal {C}_{DNA}$ = $R\cup \{\textbf {x}^{r},\textbf {x}^{c},\textbf {x}^{rc}:\textbf {x}\in \mathcal {C}\}$ is obtained. From the seed set $\mathcal {S}$, the sub-set R is generated 10⁶ times and, for each sub-set R, the size $|\mathcal {C}_{DNA}|$ is enumerated. For given n and d_H, the maximum value among all the enumerated $|\mathcal {C}_{DNA}|$ (lower bound of $A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\lfloor n/2\rfloor )$) is listed in Table 1. Note that $A_{4}^{cf,hf,GC}(n,1,\lfloor n/2\rfloor )$ = $A_{4}^{cf,hf,GC,r,rc}(n,1,\lfloor n/2\rfloor )$ and $A_{4}^{cf,hf}(n,1,\lfloor n/2\rfloor )$ = $A_{4}^{cf,hf,r,rc}(n,1,\lfloor n/2\rfloor )$, for a positive integer n. Therefore, from Remark 7 and Definition 3, $A_{4}^{cf,GC,r,rc}(n,1,\lfloor n/2\rfloor )$ = $|\mathcal {S}|$ where, $|\mathcal {S}|$ is cardinality of seed set in Construction 1. Therefore, for n = 2,3,…,10, the listed lower bounds of $A_{4}^{cf,hf,GC,r,rc}(n,1,\lfloor n/2\rfloor )$ are tight. Similarly, from the definition of Hamming distance one can observe that,

$$ A_{4}^{cf,hf,GC,r,rc}(n,n,\lfloor n/2\rfloor) = \left\{ \begin{array}{ll} 2 & \text{ if } n \text{ is odd}, \\ 4 & \text{ if } n \text{ is even}. \end{array} \right. $$

Recall, for any odd length DNA string x, $d_{H}(\textbf {x},\textbf {x}^{r})< n$, and hence, the values $A_{4}^{cf,hf,GC,r,rc}(n,n, \lfloor n/2\rfloor )$ are also tight for d_H = n (= 2,3…,10) in Table 1. Values which are written in bold font in the table indicate equal or improved values from [19, Table I] or [3, Table II], and values written in a circle in the table are improved values from [19, Table I] or [3, Table II]. For the equal and improved values, the corresponding code parameters have been considered and the respective DNA codes and their codewords are listed in Table 6. Apart from the existing literature, all the specified constraints have been considered in Table 1. Some of those values are better when compared with the DNA code size with less constraints listed in Table 2.

Table 1 Lower bound of $A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\lfloor n/2\rfloor )$ (the maximum size of ⌊n/2⌋ conflict free DNA code with Hamming, reverse, reverse-complement and GC-content constraints, and each DNA codeword is free from secondary structures) are given. Values given in circles are improved from the values given in [19, Table I] or [3, Table II]. The bold values are those values which archive the values given in [19, Table I] or [3, Table II]

Full size table

Table 2 List of improved lower bound of $A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\lfloor n/2\rfloor )$ (the maximum size of ⌊n/2⌋ conflict free DNA code with Hamming, reverse, reverse-complement and GC-content constraints and each DNA string is free from secondary structures) from [19, Table I] and [3, table II]

Full size table

Note that the values in Table I, [19] are the lower bounds for the maximum size of DNA codes satisfying GC-content constraint and free from Homopolymers, and in Table II, [3] are lower bounds for the maximum size of DNA codes satisfying GC-content and reverse-complement constraints, on the other hand, values listed in Table 1 in this paper are the lower bounds for the maximum size of complete conflict free DNA codes satisfying Hamming, reverse, reverse-complement and GC-content constraints. From Remark 8 and Table 2, some new lower bounds are obtained for $A_{4}^{cf,GC}(n,d_{H},1)$ and $A_{4}^{GC,rc}(n,d_{H})$. For $A_{4}^{cf,GC}(n,d_{H},1)$, the newly updated bounds are

$$ A_{4}^{cf,GC}(4,3,1)\geq 12, $$

$$ A_{4}^{cf,GC}(6,4,1)\geq 20, $$

$$ A_{4}^{cf,GC}(8,6,1)\geq 12, $$

$$ A_{4}^{cf,GC}(9,6,1)\geq 16, $$

$$ A_{4}^{cf,GC}(9,9,1)\geq 2, $$

$$ A_{4}^{cf,GC}(10,7,1)\geq 16, \text{ and } $$

$$ A_{4}^{cf,GC}(10,8,1)\geq 8. $$

Similarly, for $A_{4}^{GC,rc}(n,d_{H})$, the newly achieved bounds are

$$ A_{4}^{GC,rc}(4,3)\geq 12, $$

$$ A_{4}^{GC,rc}(6,4)\geq 20, \text{ and } $$

$$ A_{4}^{GC,rc}(9,9)\geq 2. $$

4 Mappings and their properties

For a positive integer ℓ, the frequency of occurrence of insertion or deletion errors in an ℓ + 1 conflict free DNA string is less in an ℓ conflict free DNA string. Therefore, the chances of occurrence of these errors is significantly low in an ℓ conflict free DNA string for a sufficiently large ℓ. On the other hand, the computational complexity of Construction 1 is high. Therefore, to sidestep the computational approach, a recursive mapping is defined algebraically in this section which ensures that the obtained DNA strings will be ℓ conflict free. Moreover, DNA codes satisfying all the constraints are also studied with respect to the mapping in this section.

Definition 4

(Non-Homopolymer map of order ℓ): For a positive integer ℓ, consider $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$ such that x≠y. Define a map $f:\{0,1\}\times \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}\rightarrow \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}$ such that Table 3(a) holds.

Table 3 Non-Homopolymer Map and an Example with x = CG and y = AT

Full size table

For ℓ = 2, one can obtain Non-Homopolymer map f of order 2 is given in Table 3(b) by considering x = CG and y = AT. One can read the table as f(1,CG) = TA, f(1,TA) = GC and f(0,GC) = TA etc. Using the Non-Homopolymer map of order 2 (Table 3(b)), a binary string can also be encoded into a DNA string. Formally, the encoding using the Non-Homopolymer map is as follows.

Encoding 1

For positive integers n and ℓ, consider a mapping f as defined in the Definition 4. A binary string a = $(a_{1}\ a_{2} {\ldots } a_{n})\in \{0,1\}^{n}$ is encoded into u = $(\textbf {u}_{1}\ \textbf {u}_{2} {\ldots } \textbf {u}_{n})\in \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}^{n}$, in such a way that u_i = f(a_i,u_i− 1) for each i = 2,3,…,n and u₁ = h(a₁) where, $h:\{0,1\}\rightarrow \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}$ such that h(0) = h(1)^c. Note that $\textbf {u}_{1}\in \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}$ initiates the encoding of the binary string, and therefore the length of the encoded DNA string u is nℓ. Clearly, the encoding of two distinct binary strings are always distinct. Observe that for i = 2,3,…,n, f(a_i,u_i− 1)^c = $f(a_{i},\textbf {u}_{i-1}^{c})$ = $f(\bar {a}_{i},\textbf {u}_{i-1})$ where, $\bar {a}_{i}$ is binary complement of a_i.

Example 1

Consider a binary string a = (0 1 1 0). Let the encoding be initialized with x = CG. So, one can encode the binary string a to a DNA string (x f(1,x) f(1,f(1,x)) f(0,f(1,f(1,x)))) = xy^cxy = CGTAGCTA.

For positive integers n and ℓ(< n), consider a subset $S\subseteq \{0,1\}^{n}$. Each binary string from the set S is encoded using the Non-Homopolymer map, and the set of all encoded DNA strings is denoted by f(S). In particular, let ${\mathscr{B}}$ = $f(\{0,1\}^{n})\subseteq {\Sigma }_{DNA}^{n\ell }$ be the set of all possible DNA strings of length nℓ and they are obtained by the encoding using Non-Homopolymer map.

$$ \mathcal{B} = \left\{ \begin{array}{ll} (\{\textbf{u}_{1},{\textbf{u}_{1}^{c}}\}\times\{\textbf{u}_{2},{\textbf{u}_{2}^{c}}\})^{\frac{n-1}{2}}\times\{\textbf{u}_{1},{\textbf{u}_{1}^{c}}\} & n\text{ is positive odd integer, and } \\ (\{\textbf{u}_{1},{\textbf{u}_{1}^{c}}\}\times\{\textbf{u}_{2},{\textbf{u}_{2}^{c}}\})^{\frac{n}{2}} & n\text{ is positive even integer} \end{array} \right. $$

where, $\{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}$ and $\{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}$ are pairwise disjoint and $\{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}\cup \{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}=\{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}$. Note that, in the Example 1, the encoded DNA string CGTAGCAT is 4 conflict free and also free from secondary structures. The GC-content of the DNA string is 4 (= ⌊n/2⌋). Therefore, in the following results, we obtained constraints on seed blocks (x and y) of the Non-Homopolymer mapping so that the encoded DNA strings from binary strings satisfy those additional constraints.

Theorem 2

For positive integers n (n ≥ 2), ℓ, t and m (≤ tℓ), if $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$ such that each DNA string in the set $S = \{(\textbf {x}\ \textbf {y}_{1}\ \textbf {x}_{2}\ \textbf {y}_{2}\ldots \textbf {x}_{t}\ \textbf {y}_{t}), (\textbf {y}\ \textbf {x}_{1}\ \textbf {y}_{2}\ \textbf {x}_{2}\ldots \\ \textbf {y}_{t}\ \textbf {x}_{t}): \textbf {x}_{i}\in \{\textbf {x},\textbf {x}^{c}\}\text { and } \textbf {y}_{i}\in \{\textbf {y},\textbf {y}^{c}\} \text { for } i=1,2,\ldots ,t\}$ is m conflict free then any binary string from {0,1}ⁿ will be encoded into an m conflict free DNA string using Non-Homopolymer map.

Proof

From Remark 7, if any DNA string in the set S is m conflict free then all the DNA sub-strings (x₁ y₁ x₂ y₂…x_t y_t) and (y₁ x₁ y₂ x₂…y_t x_t) are also m conflict free. Therefore, in the encoded DNA string u = (u₁ u₂…u_n) (using Non-Homopolymer map), for 1 ≤ i ≤ n − 2t + 1, consider (u_i u_i+ 1u_i+ 2…u_i+ 2t), which is also m conflict free where, $\textbf {u}_{j}\in \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}$ for j = i, i + 1,…,i + 2t. From Definition 2, m < tℓ, the result follows. □

For the various values of m and t in Theorem 2, one can observe the following two propositions:

Proposition 1

For a positive integer ℓ, if $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$ such that each of the DNA strings (x y), (x y^c), (y x) and (y x^c) is ℓ conflict free then any binary string will be encoded into a ℓ conflict free DNA string using Non-Homopolymer map.

Theorem 3

For a positive integer ℓ (≥ 2), if $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$ such that each of the DNA string from the set $\{(\textbf {x}\ \textbf {y}_{1}\ \textbf {x}_{1}\ \textbf {y}_{2}\ \textbf {x}_{2}):\textbf {x}_{1},\textbf {x}_{2}\in \{\textbf {x},\textbf {x}^{c}\}\text { and }\textbf {y}_{1},\textbf {y}_{2}\in \{\textbf {y},\textbf {y}^{c}\}\}$ is free from reverse-complement sub-string(s) of length 3 then the encoded DNA string using Non-Homopolymer map is free from secondary structure of stem length 3.

Proof

For ℓ (≥ 2) and $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$, consider a DNA string (x y x y x) free from reverse-complement sub-strings of length 3. From Remark 5, the DNA strings (x^c y^c x^c y^c x^c) will also be free from reverse-complement sub-strings of length 3. Similarly remaining all DNA strings of 5 seed blocks are also free from reverse-complement sub-strings. Hence, from Definition 1 and Non-Homopolymer map, the encoded DNA string obtained from any binary string will be free from secondary structure of stem length 3. □

From Remark 4, if a DNA string of length n is free from reverse-complement sub-string of length m then the DNA string is also free from any reverse-complement sub-string of length t (m ≤ t ≤ n). Therefore, one can observe following two propositions from Theorem 3.

Proposition 2

For a positive integer ℓ (≥ 2), if $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$ such that each DNA string in the set $\left \{(\textbf {x}\ \textbf {y}^{*}\ \textbf {x}^{*}\ \textbf {y}^{*}), (\textbf {y}\ \textbf {x}^{*}\ \textbf {y}^{*}\ \textbf {x}^{*}): \textbf {x}^{*}\in \{\textbf {x},\textbf {x}^{c}\} \text { and } \textbf {y}^{*}\in \left \{\textbf {y},\right .\right . \left .\left .\textbf {y}^{c}\right \}\right \}$ is free from secondary structures of stem length 3 then the encoded DNA string using Non-Homopolymer map is also free from secondary structure of stem length 3.

Recall that any DNA string of length n is ⌊n/2⌋ conflict free then the DNA string is free from any consecutive repetitions of sub-strings of any length. Following theorem gives the condition on binary strings such that the encoded DNA string of length nℓ is ⌊nℓ/2⌋ conflict free.

Theorem 4

For positive integers n and ℓ, consider $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$ such that the DNA strings (x y), (x y^c), (y x) and (y x^c) are ℓ conflict free. If a = (a₁ a₂…a_n) is a binary string such that $2\mu <{\sum }_{i=\lambda +1}^{\lambda +2\mu } (a_{i}a_{2\mu +i}+\bar {a}_{i}\bar {a}_{2\mu +i})$ for each positive even integer 2μ from the set {1,2,…,⌊n/2⌋} and λ = 0,1,…,n − 2μ then the binary string a will be encoded (using Non-Homopolymer map) into a ⌊nℓ/2⌋ conflict free DNA string of length nℓ.

Proof

Consider a binary string a = (a₁ a₂…a_n) which is encoded into DNA string u = (u₁ u₂…u_n) using Non-Homopolymer map. For each positive even integer 2μ from the set {1,2,…,⌊n/2⌋}, the DNA block $u_{2\mu +i}\in \{u_{i},{u_{i}^{c}}\}$. For any binary symbol a_i, a_2μ+i ∈{0,1},

$$ (a_{i}a_{2\mu+i}+\bar{a}_{i}\bar{a}_{2\mu+i})= \left\{ \begin{array}{ll} 1 & \text{ if } a_{i}=a_{2\mu+i} \\ 0 & \text{ otherwise. } \end{array} \right. $$

Therefore, ${\sum }_{i=1}^{2\mu } (a_{i}a_{2\mu +i}+\bar {a}_{i}\bar {a}_{2\mu +i})=2\mu $ if and only if a_i = a_2μ+i for each i = 1,2,…,2μ. If the origin is shifted with λ, and $2\mu <{\sum }_{i=\lambda +1}^{\lambda +2\mu } (a_{i}a_{2\mu +i}+\bar {a}_{i}\bar {a}_{2\mu +i})$ for each λ and μ, then there is not exist the consecutive identical sub-string of length 2μ in the binary string. From Non-Homopolymer map, for a binary string, two consecutive sub-strings of odd length cannot be encoded into identical DNA sub-strings. Therefore, from Non-Homopolymer map and Proposition 1, the encoded DNA string will be a ⌊nℓ/2⌋ conflict free. □

The GC-content of encoded DNA string is calculated in the following lemma.

Lemma 3

For positive integers n and ℓ, consider $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$ with GC-content g_x and g_y. For the encoded DNA string u = $(\textbf {u}_{1}\ \textbf {u}_{2} \ldots \textbf {u}_{n})\in \left \{\textbf {x},\textbf {x}^{c},\textbf {y},\right . \left .\textbf {y}^{c}\right \}^{n}$ using Non-Homopolymer map, the GC-content of u will be

$$ g_{\textbf{u}} = \left\{ \begin{array}{ll} g_{\textbf{u}_{1}}+(g_{\textbf{u}_{1}}+g_{\textbf{u}_{2}})(n-1)/2 & \text{ if } n \text{ is odd integer}, \\ (g_{\textbf{u}_{1}}+g_{\textbf{u}_{2}})n/2 & \text{ if } n \text{ is even integer.} \end{array} \right. $$

Proof

For positive integers n and ℓ (< n), let a binary string a = $(a_{1}\ a_{2} {\ldots } a_{n})\in \{0,1\}^{n}$ be encoded into some u = $(\textbf {u}_{1}\ \textbf {u}_{2} \ldots \textbf {u}_{n})\in \{\textbf {x},\textbf {x}^{c},\textbf {y},\textbf {y}^{c}\}^{n}$ using Non-Homopolymer map. In the Non-Homopolymer map, if $\textbf {u}_{1}\in \{\textbf {x},\textbf {x}^{c}\}$ then DNA blocks $\textbf {u}_{2j}\in \{\textbf {y},\textbf {y}^{c}\}$ and $\textbf {u}_{2j+1}\in \{\textbf {x},\textbf {x}^{c}\}$, for 1 ≤ 2j,2j + 1 ≤ n. Since, the GC-content of a DNA string and its complement DNA string are the same, the GC-content of each sub-string (u_2j u_2j+ 1) is g_x + g_y. Hence, if n is even, the GC-content of the encoded DNA string u is g_u = (g_x + g_y)n/2 and if n is odd then g_u = g_x + (g_x + g_y)(n − 1)/2. Similarly, if $\textbf {u}_{1}\in \{\textbf {y},\textbf {y}^{c}\}$ then g_u = (g_x + g_y)n/2 for even n and g_u = g_y + (g_x + g_y)(n − 1)/2 when n is odd. Hence, the lemma follows. □

From Lemma 3, one can observe the following two propositions.

Proposition 3

In Lemma 3, if g_x + g_y = ℓ then

$$ g_{\textbf{u}} = \left\{ \begin{array}{ll} g_{\textbf{u}_{1}}+\ell(n-1)/2 & \text{ if } n \text{ is odd integer} \\ \ell n/2 & \text{ if } n \text{ is even integer}. \end{array} \right. $$

The following proposition ensures that the GC-content of the encoded DNA string (using Non-Homopolymer map) is almost 50% of the length.

Proposition 4

For positive integer n and ℓ, let $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$. If the GC-content of x and y are ⌊ℓ/2⌋ and ⌈ℓ/2⌉ respectively, then the GC-content of any encoded DNA string u (using Non-Homopolymer map) of length nℓ is

$$ g_{\textbf{u}} = \left\{ \begin{array}{ll} \lfloor n\ell/2\rfloor & \text{ if } \textbf{u}_{1}\in\{\textbf{x},\textbf{x}^{c}\}, \text{ and } \\ \lceil n\ell/2 \rceil & \text{ if } \textbf{u}_{1}\in\{\textbf{y},\textbf{y}^{c}\}. \end{array}\right. $$

Theorem 5

For positive integers n and ℓ, let $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$. Using Non-Homopolymer map, if a binary string (a₁ a₂…a_n) is encoded into some u ∈{x, x^c,y, y^c}ⁿ then the binary string $(\bar {a}_{1}\ a_{2} {\ldots } a_{n})$ is encoded into u^c where, $\bar {a}_{1}$ is the binary complement of a₁.

Proof

The proof is done using induction on the index i (i = 1,2,…,n). Now from the Non-Homopolymer map, f(0,z)^c = f(0,z^c) and f(1,z)^c = f(1,z^c), for each z ∈{x, x^c,y, y^c}. Consider the binary strings (a₁ a₂ a₃…a_n) and $(\bar {a}_{1}\ a_{2}\ a_{3}{\ldots } a_{n})$, that are encoded into some DNA strings (u₁ u₂ u₃…u_n) and (v₁ v₂ v₃…v_n). From the Non-Homopolymer map, ${\textbf {u}_{1}^{c}}$ = h(a₁)^c = $h(\bar {a}_{1})$ = v₁. Let ${\textbf {u}_{i}^{c}}$ = v_i, for some i ∈{1,2,…,n}. Consider $\textbf {u}_{i+1}^{c}$ = f(a_i+ 1,u_i)^c = $f(a_{i+1},{\textbf {u}^{c}_{i}})$ = f(a_i+ 1,v_i) = v_i+ 1. Therefore, from induction, the binary strings (a₁ a₂ a₃…a_n) and $(\bar {a}_{1}\ a_{2}\ a_{3}{\ldots } a_{n})$ are encoded into DNA strings which are complement to each other. □

In the following two theorems, the Hamming distance between the two DNA strings is calculated for binary strings with Hamming distance 1 and 2.

Theorem 6

For positive integers n and ℓ, consider the binary strings a = (a₁ a₂…a_n) and b = $\left .(a_{1}\ a_{2}{\ldots } a_{i-1}\ \bar {a}_{i}\ \right . \left . a_{i+1}{\ldots } a_{n})\right .$ (1 ≤ i ≤ n), that are encoded into DNA strings u = (u₁ u₂…u_n) and v = (v₁ v₂…v_n) where, $\bar {a}_{i}$ is binary complement of a_i. Then, d_H(u, v) = ℓ(n − i + 1).

Proof

Consider the binary strings a = (a₁ a₂…a_i− 1 a_i a_i+ 1…a_n) and b = $(a_{1}\ a_{2} {\ldots } a_{i-1} \bar {a}_{i}\ a_{i+1}{\ldots } a_{n})$ which can be encoded into u = (u₁ u₂…u_i− 1 u_iu_i+ 1…u_n) and v = (v₁ v₂…v_i− 1v_i v_i+ 1…v_n) respectively. From Non-Homopolymer map, v_j = u_j (j = 1,2,…i − 1) and v_j = ${\textbf {u}_{j}^{c}}$ (j = i, i + 1…,n). Therefore, d_H(u, v) = ℓ(n − i + 1), since $d_{H}(\textbf {x},\textbf {x}^{c})$ = $d_{H}(\textbf {y},\textbf {y}^{c})$ = ℓ. □

Theorem 7

For positive integers n and ℓ, consider the binary strings a = (a₁ a₂…a_n) and b = $\left .(a_{1}\ a_{2}{\ldots } a_{i-1}\ \bar {a}_{i}\right . \left . a_{i+1}{\ldots } a_{j-1}\ \bar {a}_{j}\ a_{j+1}{\ldots } a_{n}\right .)$ (1 ≤ i < j ≤ n) that are encoded into DNA strings u = (u₁ u₂…u_n) and v = (v₁ v₂…v_n) where, $\bar {a}_{i}$ is binary complement of a_i. Then, d_H(u, v) = ℓ(j − i).

Proof

The proof is similar to the Theorem 6 and follows from the fact that for any DNA string x, (x^c)^c = x. □

The following theorem provides a bound on the Hamming distance on the encoded DNA strings.

Theorem 8

For positive integers n, ℓ and σ(≤ ℓ), consider $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$ and σ = $\min \limits \left \{d_{H}(\textbf {z}_{1},\textbf {z}_{2}),\right . \left .n-d_{H}(\textbf {z}_{1},\textbf {z}_{2}):\textbf {z}_{1}\in \{\textbf {x},\textbf {x}^{c}\}\text { and }\textbf {z}_{2}\in \{\textbf {y},\textbf {y}^{c}\} \right \}$. Let binary strings a, b ∈{0,1}ⁿ be encoded into the DNA strings u = (u₁ u₂…u_n) and v = (v₁ v₂…v_n). For some a, b ∈{0,1}, if $\textbf {u}^{\prime }$ = (u f(a, u_n)) and $\textbf {v}^{\prime }$ = (v f(b, v_n)) then

$$ d_{H}(\textbf{u}^{\prime},\textbf{v}^{\prime})\geq \left\{ \begin{array}{ll} \sigma(d_{H}(\textbf{a},\textbf{b})+d_{H}(a,b)) & \text{ if }d_{H}(\textbf{a},\textbf{b})\text{ is even} \\ \sigma(d_{H}(\textbf{a},\textbf{b})-d_{H}(a,b)+1) &\text{ if } d_{H}(\textbf{a},\textbf{b})\text{ is odd}. \end{array} \right. $$

Proof

The proof follows from the following two facts. (1) For a positive integer m and any $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{m}$, if d_H(x, y) = t then $d_{H}(\textbf {x}^{c},\textbf {y})\geq m-t$, and (2) For z ∈{x, x^c,y, y^c} and a, b ∈{0,1}, f(c, z)^c = f(c, z^c) = $f(\bar {c},\textbf {z})$. So we can derive,

$$ d_{H}(\textbf{u}^{\prime},\textbf{v}^{\prime})\geq \left\{ \begin{array}{ll} \sigma(d_{H}(\textbf{a},\textbf{b})+1) & \text{ if }d_{H}(\textbf{a},\textbf{b})\text{ is even and } a\neq b\\ \sigma d_{H}(\textbf{a},\textbf{b}) & \text{ if }d_{H}(\textbf{a},\textbf{b})\text{ is even and } a= b\\ \sigma d_{H}(\textbf{a},\textbf{b}) &\text{ if } d_{H}(\textbf{a},\textbf{b})\text{ is odd and } a\neq b \\ \sigma(d_{H}(\textbf{a},\textbf{b})+1) &\text{ if } d_{H}(\textbf{a},\textbf{b})\text{ is odd and } a = b. \end{array} \right. $$

Hence the result follows. □

Motivated by Theorem 8, one can observe the following proposition.

Proposition 5

For positive integers n and ℓ, consider $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$ and ℓ = d_H(z₁,z₂), for each $\textbf {z}_{1}\in \{\textbf {x},\textbf {x}^{c}\}$ and $\textbf {z}_{2}\in \{\textbf {y},\textbf {y}^{c}\}$. Let two binary strings a, b ∈{0,1}ⁿ be encoded into the DNA strings u = (u₁ u₂…u_n) and v = (v₁ v₂…v_n). For some a, b ∈{0,1}, if $\textbf {u}^{\prime }$ = (u f(a, u_n)) and $\textbf {v}^{\prime }$ = (v f(b, v_n)) then

$$ d_{H}(\textbf{u}^{\prime},\textbf{v}^{\prime})=\left\{ \begin{array}{ll} d_{H}(\textbf{u},\textbf{v})+\ell d_{H}(a,b) & \text{ if }d_{H}(\textbf{a},\textbf{b})\text{ is even} \\ d_{H}(\textbf{u},\textbf{v})+\ell(1-d_{H}(a,b)) &\text{ if } d_{H}(\textbf{a},\textbf{b})\text{ is odd}. \end{array} \right. $$

In order to establish the proposed mapping as an isometry from the set of binary strings to the set of DNA strings where, Hamming distance is taken for the set of DNA strings, we introduce a new distance between two binary strings in the following definition.

Definition 5

Let n(> 1) be an integer and Σ be an alphabet of size q (≤ 2). For any a = (a₁ a₂…a_n), b = $(b_{1}\ b_{2}{\ldots } b_{n})\in {\Sigma }^{n}$, let P = {i : a_i≠b_i and i ∈{1,2,…,n}} and

$$ S= \left\{ \begin{array}{ll} P & \text{ if } d_{H}(\textbf{a},\textbf{b})\text{ is even} \\ P\cup\{n+1\} & \text{ if } d_{H}(\textbf{a},\textbf{b})\text{ is odd.} \end{array} \right. $$

If S≠∅, we can denote S = {s₁,s₂,…,s_|S|} such that, for each s_j < s_j+ 1, j = 1,2,…,|S|− 1. We define a map $d_{NHo}:{\Sigma }^{n}\times {\Sigma }^{n}\rightarrow {\mathbb R}$ such thatl

$$ d_{NHo}(\textbf{a},\textbf{b}) = \left\{ \begin{array}{ll} \ell{\sum}_{i=1}^{|S|/2} (s_{2i}-s_{2i-1}) & \text{ if } |S|>0,\\ 0 & \text{ if } |S|=0, \end{array} \right. $$

where ℓ is a positive integer.

Remark 9

The map $d_{NHo}:{\Sigma }^{n}\times {\Sigma }^{n}\rightarrow \mathbb { R}$ is indeed a distance (called Non-Homopolymer distance), because it holds all the properties including triangle inequality of distance. The first three properties of metric are easy to prove and the triangle inequality property can be proved using induction on length n. For a code $\mathcal {C}\subseteq {\Sigma }^{n}$, the minimum distance d_NHo = $\min \limits \{d_{NHo}(\textbf {a},\textbf {b}):\textbf {a}\neq \textbf {b},\text { and }\textbf {a},\textbf {b}\in \mathcal {C}\}$.

For example, consider n = 5, ℓ = 2 and Σ = {0,1}. For a = (1 1 1 1 0) and b = (0 1 1 0 0), d_NHo(a, b) = 6 where, S = {1,4}.

Theorem 9

For positive integers n and ℓ, the Non-Homopolymer map is a distance preserving encoding between ({0,1}ⁿ,d_NHo) and $({\mathscr{B}},d_{H})$.

Proof

The theorem is proved using induction on the string length n. The base case, n = 1, is obvious from the Non-Homopolymer map and the Non-Homopolymer distance. For the inductive step, assume that the distance is preserved for n = k where, a = (a₁ a₂…a_k) and b = (b₁ b₂…b_k) with the support set S, are encoded into DNA strings u = (u₁ u₂…u_k) and v = (v₁ v₂…v_k). To prove that the distance is preserved for n = k + 1, consider $\textbf {a}^{\prime }$ = (a a_k+ 1) = (a₁ a₂…a_k a_k+ 1) and $\textbf {b}^{\prime }$ = (b b_k+ 1) = (b₁ b₂…b_k b_k+ 1) with the support set $S^{\prime }$ where, a_k+ 1,b_k+ 1 ∈{0,1}. Let the strings a and b be encoded into DNA strings $\textbf {u}^{\prime }$ = (u₁ u₂…u_k u_k+ 1) and $\textbf {v}^{\prime }$ = (v₁ v₂…v_k v_k+ 1) where, $\textbf {u}_{k+1},\textbf {v}_{k+1}\in \{\textbf {x},\textbf {y},\textbf {x}^{c},\textbf {y}^{c}\}$. Now, there are four cases. (i) Consider d_H(a, b) is even and a_k+ 1 = b_k+ 1. Note that both u_k and v_k are member of either {x, x^c} or {y, y^c}. On the other hand, $d_{NHo}(\textbf {a}^{\prime },\textbf {b}^{\prime })$ = d_NHo(a, b), since $S^{\prime }$ = S. (ii) If d_H(a, b) is even and a_k+ 1≠b_k+ 1 then u_k+ 1 = $\textbf {v}_{k+1}^{c}$. So, for ℓ = $d_{H}(\textbf {x},\textbf {x}^{c})$ = $d_{H}(\textbf {y},\textbf {y}^{c})$, $d_{H}(\textbf {u}^{\prime },\textbf {v}^{\prime })$ = d_H(u, v) + ℓ and $d_{NHo}(\textbf {a}^{\prime },\textbf {b}^{\prime })$ = d_NHo(a, b) + ℓ since, $S^{\prime }$ = {k + 1,k + 2}∪ S. (iii) If d_H(a, b) is odd and a_k+ 1 = b_k+ 1 then $d_{H}(\textbf {u}^{\prime },\textbf {v}^{\prime })$ = d_H(u, v) + m and $d_{NHo}(\textbf {a}^{\prime },\textbf {b}^{\prime })$ = d_NHo(a, b) + ℓ since, $S^{\prime }$ = {k + 2}∪ S∖{k + 1}. (iv) If d_H(a, b) is odd and a_k+ 1≠b_k+ 1 then $d_{H}(\textbf {u}^{\prime },\textbf {v}^{\prime })$ = d_H(u, v) and $d_{NHo}(\textbf {a}^{\prime },\textbf {b}^{\prime })$ = d_NHo(a, b) since, $S^{\prime }$ = S. Hence, the result follows from Proposition 5. □

Theorem 10

For a binary code $\mathcal {C} (n, M, d_{NHo})$ where, d_NHo is the Non-Homopolymer distance (Definition 5), there exists a DNA code $f(\mathcal {C})$ with codeword length nℓ, size M and the minimum Hamming distance d_H = d_NHo.

Proof

The proof follows from the Non-Homopolymer map and Theorem 9. □

Theorem 11

For any binary code $\mathcal {C}$ with the minimum distance d_NHo ≤ nℓ/2 ($n,\ell \in \mathbb { Z}^{+}$), there exists a DNA code $f(\mathcal {C}) (n\ell ,M,d_{H})$ with complement constraint.

Proof

Let binary strings a and b of length n be encoded into DNA strings u and v of length nℓ. From the property of the complement of a DNA string, $d_{H}(\textbf {u},\textbf {v}^{c})\geq n\ell -d_{H}(\textbf {u},\textbf {v})$. From Theorem 9, d_H(u, v) = d_NHo(a, b) ≤ nℓ/2. Therefore, $d_{H}(\textbf {u},\textbf {v}^{c})\geq n\ell /2$ and hence, the result follows. □

Theorem 12

For a positive integer n, if a binary linear code with codeword length n contains (1 0 0…0) as a codeword, then the encoded DNA code (using Non-Homopolymer map) will satisfy complement constraint.

Proof

Consider a binary linear code containing the codeword (1 0 0…0) of length n. For any codeword a = (a₁ a₂…a_n) of the binary linear code, (1 0 0…0) + (a₁ a₂…a_n) = $(\bar {a}_{1}\ a_{2}{\ldots } a_{n})$ = b is also a codeword of the code. Therefore, from Theorem 5, for each binary codeword a, there exists a binary codeword b such that the encoded DNA strings from a and b will be complement to each other. Hence, by the distance property, the theorem is proved. □

Theorem 13

For positive integers n and ℓ, let $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$. Then, for any encoded DNA strings u, v ∈ f({0,1}ⁿ) using Non-Homopolymer map,

$$ d_{H}(\textbf{u},\textbf{v}^{r}) \geq \left\{ \begin{array}{ll} n\min\{d_{H}(\textbf{x},\textbf{y}^{r}),d_{H}(\textbf{x},\textbf{y}^{rc})\}, & \text{ if } n \text{ is even,} \\ \min\left\{d_{H}(\textbf{x},\textbf{x}^{r}),d_{H}(\textbf{y},\textbf{y}^{r}),d_{H}(\textbf{x},\textbf{x}^{rc}),d_{H}(\textbf{y},\textbf{y}^{rc})\right\}, & \text{ if } n \text{ is odd. } \end{array}\right. $$

Proof

For $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$, let the binary strings a, b ∈{0,1}ⁿ of length n be encoded into DNA strings u = (u₁ u₂…u_n), v = (v₁ v₂…v_n) in {x, x^c,y, y^c}, where $\textbf {u}_{2i},\textbf {v}_{2i}\in \{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}$ and $\textbf {u}_{2i+1},\textbf {v}_{2i+1}\in \{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}$ for 1 ≤ 2i,2i + 1 ≤ n. The set f({0,1}ⁿ) is the collection of all possible DNA strings such that obtained DNA blocks will be from $\{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}$ and $\{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}$ at even positions and odd positions respectively. Consider $d_{H}(\textbf {u},\textbf {v}^{r})={\sum }_{j=1}^{n}d_{H}(\textbf {u}_{j},\textbf {v}^{r}_{n-j+1})$. Now two cases may arise.

Case 1:
If n is odd, then j and n − j + 1 both are either even or odd. If both j and n − j + 1 are even then $\textbf {u}_{j},\textbf {v}_{n-j+1}\in \{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}$, and if both j and n − j + 1 are odd then $\textbf {u}_{j},\textbf {v}_{n-j+1}\in \{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}$. Therefore, u, v^r ∈ f({0,1}ⁿ) and, from the Non-Homopolymer map, $d_{H}(\textbf {u}_{j},\textbf {v}^{r}_{n-j+1})\geq \min \limits \{d_{H}(\textbf {x},\textbf {x}^{r}),d_{H}(\textbf {y},\textbf {y}^{r}),d_{H}(\textbf {x},\textbf {x}^{rc}),\\ d_{H}(\textbf {y},\textbf {y}^{rc})\}$.
Case 2:
If n is even, then the parity j and n − j + 1 will be different. So, for even j, $\textbf {u}_{j}\in \{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}$ and $\textbf {v}_{n-j+1}\in \{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}$, and, for odd j, $\textbf {u}_{j}\in \{\textbf {u}_{1},{\textbf {u}_{1}^{c}}\}$ and $\textbf {v}_{n-j+1}\in \{\textbf {u}_{2},{\textbf {u}_{2}^{c}}\}$. Therefore, from Non-Homopolymer map and the fact that, for any $\textbf {z}_{1},\textbf {z}_{2}\in {\Sigma }_{DNA}^{\ell }$, $d_{H}(\textbf {z}_{1},{\textbf {z}_{2}^{r}})$ = $d_{H}({\textbf {z}_{1}^{r}},\textbf {z}_{2})$, we obtain $d_{H}(\textbf {u}_{j},\textbf {v}_{n-j+1}^{r})\geq \min \limits \{d_{H}(\textbf {x},\textbf {y}^{r}),d_{H}(\textbf {x},\textbf {y}^{rc})\}$. Hence the result follows for every n.

□

Similarly one can get result for reverse-complement constraint as given in the following preposition.

Proposition 6

For positive integers n and ℓ, let $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$. Then, for any encoded DNA strings u, v ∈ f({0,1}ⁿ) using the Non-Homopolymer map,

$$ d_{H}(\textbf{u},\textbf{v}^{rc}) \geq \left\{ \begin{array}{ll} n\min\{d_{H}(\textbf{x},\textbf{y}^{r}),d_{H}(\textbf{x},\textbf{y}^{rc})\}, & \text{ if } n \text{ is even,} \\ \min\left\{d_{H}(\textbf{x},\textbf{x}^{r}),d_{H}(\textbf{y},\textbf{y}^{r}),d_{H}(\textbf{x},\textbf{x}^{rc}),d_{H}(\textbf{y},\textbf{y}^{rc})\right\}, & \text{ if } n \text{ is odd. } \end{array}\right. $$

Theorem 14

For an even positive integer n and a positive integer ℓ, consider $\textbf {x},\textbf {y}\in {\Sigma }_{DNA}^{\ell }$ such that $d_{H}(\textbf {x},\textbf {y}^{rc})$ = $d_{H}(\textbf {x},\textbf {y}^{r})=\ell $. Then, the DNA codes constructed using the Non-Homopolymer map will satisfy the reverse constraint.

Proof

If $d_{H}(\textbf {x},\textbf {y}^{rc})$ = $d_{H}(\textbf {x},\textbf {y}^{r})=\ell $ then, from Theorem 13, $d_{H}(\textbf {u},\textbf {v}^{r})\geq n\min \limits \left \{d_{H}(\textbf {x},\textbf {y}^{r}),\right . \left .d_{H}(\textbf {x},\textbf {y}^{rc})\right \}=n\ell $. But the length of the encoded DNA string is nℓ so, d_H ≤ nℓ and therefore, $d_{H}(\textbf {u},\textbf {v}^{r})\geq d_{H}$ for any DNA code constructed using the Non-Homopolymer map. □

Lemma 4

For positive integers n and ℓ, if the binary strings a and b of length n are encoded into DNA strings u and v using the Non-Homopolymer map then

$$ n-\lfloor d_{H}(\textbf{a},\textbf{b})/2\rfloor\geq \frac{1}{\ell}d_{H}(\textbf{u},\textbf{v})\geq\lceil d_{H}(\textbf{a},\textbf{b})/2\rceil. $$

Proof

For a positive integer n, if $S\subseteq \{1,2,\ldots ,n,n+1\}$ is a set with even cardinality such that, for each s_j ∈ S(j = 1,2,…,|S|− 1), s_j < s_j+ 1 then one can observe that:

$$ n-\frac{|S|}{2}\geq\sum\limits_{i=1}^{|S|/2} (s_{2i}-s_{2i-1})\geq\frac{|S|}{2}. $$

From the Definition 5, d_H(a, b) ∈{|S|,|S|− 1}, and therefore, the result follows. □

Theorem 15

For positive integers n and ℓ, suppose a binary code exists with the minimum Hamming distance d_H and the minimum distance d_NHo. Then, ℓ(n −⌊d_H/2⌋) ≥ d_NHo ≥ ℓ⌈d_H/2⌉.

Proof

For any code $\mathcal {C}$ with the minimum Hamming distance d_H, if $\textbf {a},\textbf {b}\in \mathcal {C}$ then ⌈d_H(a, b)/2⌉≥⌈d_H/2⌉ and n −⌊d_H/2⌋≥ n −⌊d_H(a, b)/2⌋≥⌈d_H/2⌉. The proof follows from Lemma 4 and Theorem 9. □

In the following theorem, a constraint on binary string is imposed in such a way that the encoded DNA string of length nℓ will be ⌊nℓ/2⌋ conflict free.

Theorem 16

For positive integers n and ℓ, and any positive even integer 2μ ∈{1,2,…,⌊n/2⌋}, consider a binary code with codeword length n, such that for each codeword (a₁ a₂…a_n),

$$ 2\mu<\sum\limits_{i=\lambda+1}^{\lambda+2\mu} (a_{i}a_{2\mu+i}+\bar{a}_{i}\bar{a}_{2\mu+i}), $$

where λ = 1,2,…,n − 2μ. Then there exists a ⌊nℓ/2⌋ conflict free DNA code with codeword length nℓ.

Proof

The proof follows from Definition 3 and Theorem 4. □

Lemma 5

Consider two seed blocks (x_i,y_i)i = 1,2 such that d(x₁,x₂) = $d(\textbf {x}_{1},{\textbf {x}^{c}_{2}})$ = d(y₁,y₂) = $d(\textbf {y}_{1},{\textbf {y}^{c}_{2}})$ = ℓ, and a binary code $\mathcal {C}$ with the parameter (n, M, d_NHo). The parameter of the DNA code $\mathcal {C}_{DNA}$ = $\mathcal {C}_{DNA_{1}}\cup \mathcal {C}_{DNA_{2}}$ will be (nℓ,2M, d_H) where, the DNA code $\mathcal {C}_{DNA_{i}}$ is encoded from the binary code $\mathcal {C}$ using Non-Homopolymer map with the seed block (x_i,y_i).

Proof

From Theorem 10, the parameters of the encoded DNA code $\mathcal {C}_{DNA_{i}}$ is (nℓ, M, d_H) for i = 1,2. The codeword length of both the encoded DNA codes $\mathcal {C}_{DNA_{1}}$ and $\mathcal {C}_{DNA_{2}}$ are same and equal to n. Therefore, the codeword length of the DNA code $\mathcal {C}_{DNA_{1}}\cup \mathcal {C}_{DNA_{2}}$ will also be n. The size of the DNA code $\mathcal {C}_{DNA_{1}}\cup \mathcal {C}_{DNA_{2}}$ will be not be more than 2M. Let a = (a₁ a₂…a_n) be a codeword in $\mathcal {C}$. Let the codeword a is encoded into u = $(\textbf {u}_{1}\ \textbf {u}_{2}\ldots \textbf {u}_{n})\in \mathcal {C}_{DNA_{1}}$ and v = $(\textbf {v}_{1}\ \textbf {v}_{2}\ldots \textbf {v}_{n})\in \mathcal {C}_{DNA_{2}}$ using seed blocks (x₁,y₁) and (x₂,y₂). The encoded DNA strings $\textbf {u}_{1}\in \{\textbf {x}_{1},{\textbf {x}_{1}^{c}}\}$, $\textbf {v}_{1}\in \{\textbf {x}_{2},{\textbf {x}_{2}^{c}}\}$ where, x₁≠x₂ and $\textbf {x}_{2}{\neq \textbf {x}_{2}^{c}}$. Therefore, u≠v for any case and hence, the encoded DNA strings are not identical. It follows that the size of the DNA code $\mathcal {C}_{DNA_{1}}\cup \mathcal {C}_{DNA_{2}}$ will be 2M. In addition, if d(x₁,x₂) = $d(\textbf {x}_{1},{\textbf {x}^{c}_{2}})$ = d(y₁,y₂) = $d(\textbf {y}_{1},{\textbf {y}^{c}_{2}})$ = ℓ then d_H(u, v) = nℓ (≤ d_H) for any $\textbf {u}\in \mathcal {C}_{DNA_{1}}$ and $\mathcal {C}_{DNA_{2}}$. Therefore, the minimum Hamming distance of the DNA code $\mathcal {C}_{DNA_{1}}\cup \mathcal {C}_{DNA_{2}}$ will be d_H. □

One can generalize Lemma 5 in the following proposition.

Proposition 7

Consider r seed blocks (x_i,y_i)i = 1,2,…,r such that d(x_i,x_j) = $d(\textbf {x}_{i},{\textbf {x}^{c}_{j}})$ = d(y_i,y_j) = $d(\textbf {y}_{i},{\textbf {y}^{c}_{j}})$ = ℓ (1 ≤ i < j ≤ r), and a binary code $\mathcal {C}$ with the parameter (n, M, d_NHo). The parameter of the DNA code $\mathcal {C}_{DNA}$ = $\bigcup _{i=1}^{r}\mathcal {C}_{DNA_{i}}$ will be (nℓ, rM, d_H) where, the DNA code $\mathcal {C}_{DNA_{i}}$ is encoded from the binary code $\mathcal {C}$ using the Non-Homopolymer map with the seed block (x_i,y_i).

One can observe the following remark from reverse, complement and reverse-complement DNA strings.

Remark 10

Using the Non-Homopolymer map, from some seed blocks (x, y), an ℓ conflict free DNA code is obtained with reverse, reverse-complement and GC-content constraints such that each DNA string is free from secondary structures if and only if an ℓ conflict free DNA code can also be obtained from seed blocks (x^r,y^r) with reverse, reverse-complement and GC-content constraints such that each DNA string is free from secondary structures. The statement is also true for DNA codes generated from seed blocks (x^c,y^c), (x^rc,y^rc), (y, x), (y^r,x^r), (y^c,x^c) and (y^rc,x^rc) using the Non-Homopolymer map.

Lemma 6

Consider, for i = 1,2, two seed blocks (x_i,y_i) such that d(x₁,x₂) = $d(\textbf {x}_{1},{\textbf {x}^{c}_{2}})$ = d(y₁,y₂) = $d(\textbf {y}_{1},{\textbf {y}^{c}_{2}})$ = ℓ, and two binary codes $\mathcal {C}_{i}$ with the parameter $(n,M_{i},d_{NHo_{i}})$. The parameter of the DNA code $\mathcal {C}_{DNA}$ = $\mathcal {C}_{DNA_{1}}\cup \mathcal {C}_{DNA_{2}}$ will be (nℓ, M, d_H) where, the DNA code $\mathcal {C}_{DNA_{i}}$ is encoded from the binary code $\mathcal {C}_{i}$ using the Non-Homopolymer map with the seed block (x_i,y_i), M = M₁ + M₂ and d_H = $\min \limits \{d_{NHo_{i}}:i=1,2\}$.

Proof

The proof is similar to that of the proof of Lemma 5 and it follows from the definition of the minimum Hamming distance of a code. □

Proposition 8

For i = 1,2,…,r, consider seed blocks (x_i,y_i) such that d(x_i,x_j) = $d(\textbf {x}_{i},{\textbf {x}^{c}_{j}})$ = d(y_i,y_j) = $d(\textbf {y}_{i},{\textbf {y}^{c}_{j}})$ = ℓ (1 ≤ i < j ≤ r), and binary codes $\mathcal {C}_{i}$ with the parameter $(n,M_{i},d_{NHo_{i}})$. The parameter of the DNA code $\mathcal {C}_{DNA}$ = $\bigcup _{i=1}^{r}\mathcal {C}_{DNA_{i}}$ will be (nℓ, M, d_H) where, the DNA code $\mathcal {C}_{DNA_{i}}$ is encoded from the binary code $\mathcal {C}$ using Non-Homopolymer map with the seed block (x_i,y_i), M = ${\sum }_{i=1}^{r}M_{i}$ and d_H = $\min \limits \{d_{NHo_{i}}:i=1,2,\ldots ,r\}$.

Proposition 9

As considered in Proposition 8, for a DNA code $\mathcal {C}_{DNA}$ = $\bigcup _{i=1}^{r}\mathcal {C}_{DNA_{i}}$ with the parameter ($n\ell ,{\sum }_{i=1}^{r}M_{i},d_{H}$), the code rate is:

$$ \mathcal{R} = \frac{log_{4}\left( {\sum}_{i=1}^{r}M_{i}\right)}{n\ell}\leq\frac{r}{2\ell}. $$

The bound on code rate in the Proposition 9 is obtained by taking M_i = 2ⁿ for each i = 1,2,…,r.

Remark 11

For any ℓ conflict free DNA code $\mathcal {C}_{DNA}$ with the parameters (nℓ,4ⁿ,d_H) and encoded from $\mathcal {C}={\mathbb { Z}_{2}^{n}}$ using the Non-Homopolymer map, the code rate is 1/2ℓ. Note that for ℓ = 2, the code rate is 1/4. In addition, the DNA code satisfies all the constraints reverse, reverse-complement, GC-content. All the DNA codewords are also free from secondary like structures and homopolymers.

Lemma 7

For d_H < nℓ/2, there exists DNA code $\mathcal {C}_{DNA}(n\ell ,M,d_{H})$ encoded from a binary code $\mathcal {C}(n,M,d_{NHo})$ such that

$$ M\geq \frac{2^{n}}{{\sum}_{i=0}^{\left\lceil\frac{2}{\ell}d_{H}\right\rceil-1}\binom{n}{i}}. $$

(1)

Proof

From Lemma 4, one can observe that $d_{H}(\textbf {a},\textbf {b})\leq \left \lceil \frac {2}{\ell }d_{H}(\textbf {u},\textbf {v})\right \rceil $ for any binary codewords a and b in $\mathcal {C}$ which can be encoded into DNA codewords u and v in $\mathcal {C}_{DNA}$. The bounds are true for any distinct a and b in $\mathcal {C}$ therefore, $d_{H}^{\prime }\leq \left \lceil \frac {2}{\ell }d_{H}\right \rceil $ where, $d_{H}^{\prime }$ is the minimum Hamming distance for the binary code $\mathcal {C}$. From the Gilbert-Varshamov bound for binary code, one can obtain the bounds in (1) for d_H < nℓ/2. □

Now, one can easily obtain the following proposition.

Proposition 10

If there exists an ℓ conflict free DNA code $\mathcal {C}_{DNA}$ with all the constraints reverse, reverse-complement, GC-content such that each DNA codeword is free from secondary like structures then

$$ A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\ell)\geq \frac{2^{n}}{{\sum}_{i=0}^{\left\lceil\frac{2}{\ell}d_{H}\right\rceil-1}\binom{n}{i}}, \text{ for } d_{H}< n\ell/2. $$

Lemma 8

For d_H ≥ nℓ/2, there exists DNA code $\mathcal {C}_{DNA}(n\ell ,M,d_{H})$ encoded from a binary code $\mathcal {C}(n,M,d_{NHo})$ such that

$$ M\geq \frac{2^{n}}{{\sum}_{i=0}^{2n-\left\lceil\frac{2}{\ell}d_{H}\right\rceil}\binom{n}{i}}. $$

(2)

Proof

From Lemma 4, one can observe that $d_{H}(\textbf {a},\textbf {b})\leq 2n-\left \lceil \frac {2}{\ell }d_{H}(\textbf {u},\textbf {v})\right \rceil +1$ for any binary codewords a and b in $\mathcal {C}$ which can be encoded into DNA codewords u and v in $\mathcal {C}_{DNA}$. The bounds are true for any distinct a and b in $\mathcal {C}$ therefore, $d_{H}^{\prime }\leq 2n-\left \lceil \frac {2}{\ell }d_{H}\right \rceil +1$ where, $d_{H}^{\prime }$ is the minimum Hamming distance for the binary code $\mathcal {C}$. From the Gilbert-Varshamov bound for binary code, one can obtain the bound in (2) for d_H ≥ nℓ/2. □

Now, one can easily obtain the following proposition.

Proposition 11

If there exists an ℓ conflict free DNA code $\mathcal {C}_{DNA}$ with all the constraints reverse, reverse-complement, GC-content such that each DNA codeword is free from secondary like structures then

$$ A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\ell)\geq \frac{2^{n}}{{\sum}_{i=0}^{2n-\left\lceil\frac{2}{\ell}d_{H}\right\rceil}\binom{n}{i}}, \text{ for } d_{H}\geq n\ell/2. $$

In Fig. 2, the lower bound on $A_{4}^{cf,hf,GC,r,rc}(n,d_{H},\ell )$ is plotted for n = 16 and 20.

5 Conflict free DNA codes

For various binary codes, the parameters of encoded DNA codes are listed in the Table 4. For x = ATA and y = CGC, the 5 conflict free DNA code (21,16,6) with reverse, and GC-content constraints are obtained from [7,4,3] binary Hamming code is given in Table 5. The parameter of the DNA code is (21,16,6) and each DNA string is free from hairpin like secondary structures of stem length 5. For any positive integer ℓ = 2,3,4,5, DNA pairs $(\textbf {x},\textbf {y})\in {\Sigma }_{DNA}^{\ell }$ are listed in Tables 7 and 8 with various parameters. For any pair (x, y) given in Table 8, one can get ℓ conflict free DNA codes from any binary code using the Non-Homopolymer map, where the DNA code satisfies Hamming, GC-content, reverse and reverse-complement constraints. Similarly, for any pair (x, y) given in Table 7, one can get DNA codes from any binary code using the Non-Homopolymer map where, the DNA code satisfies Hamming and GC-content constraints where, each DNA codeword is free from reverse-complement sub-strings (Tables 6, 7 and 8).

Table 4 Parameters for conflict free DNA codes encoded from binary codes

Full size table

Table 5 List of all encoded DNA strings for (x, y) = (ATA, CGC) from [7,4,3] binary Hamming code. The code rate of the DNA code is 0.1904

Full size table

Table 6 Codewords (each of length n) for ⌊n/2⌋ conflict free DNA codes with Hamming, reverse, reverse-complement and GC-content constraints. All the codes have code size improved from existing literature

Full size table

Table 7 Pairs (x, y) ∈Σ^ℓ such that (i) GC-content sum of x and y is ℓ, and (ii) each DNA string in the set {(x y^∗ x^∗ y^∗),(y x^∗ y^∗ x^∗) : x^∗∈{x, x^c} and y^∗∈{y, y^c}} is free from secondary structures

Full size table

Table 8 Pairs (x, y) ∈Σ^ℓ such that (i) d_H(x, y) = $d_{H}(\textbf {x},\textbf {y}^{rc})$ = $d_{H}(\textbf {x},\textbf {y}^{r})$ = ℓ, (ii) GC-content sum of x and y is ℓ, and (iii) each DNA string in the set {(x y^∗ x^∗ y^∗),(y x^∗ y^∗ x^∗) : x^∗∈{x, x^c} and y^∗∈{y, y^c}} is ℓ conflict free

Full size table

5.1 Reed-Muller code:

The binary Reed-Muller codes were introduced by Reed and Muller in 1954 [25]. For two non-negative integers m and r (r ≤ m), the r^th order binary Reed-Muller code $\mathcal {R}(r,m)$ is a linear code of length 2^m, code size $2^{{\sum }_{i=1}^{r}\binom {m}{i}}$ and the minimum Hamming distance d_H = 2^m−r. The generator matrix of $\mathcal {R}(r,m)$ is

$$ G_{r,m} = \left( \begin{array}{cc} G_{r,m-1} & G_{r,m-1} \\ \textbf{0} & G_{r-1,m-1} \end{array} \right), \text{ for }1\leq r\leq m-1, $$

where

$$ G_{m,m} = \left( \begin{array}{c} G_{m-1,m} \\ 1\ 1{\ldots} 1\ 0 \end{array} \right), $$

G_0,m is the all one matrix of size 1 × 2^m, and 0 is a zero matrix with $2^{{\sum }_{i=1}^{r}\binom {m}{i}}$ - $2^{{\sum }_{i=1}^{r-1}\binom {m-1}{i}}$ rows and 2^m− 1 columns.

In the following theorem, the minimum distance (Definition 5) is obtained for Reed-Muller codes.

Theorem 17

For positive integers m, r (0 ≤ r ≤ m) and ℓ, there exists a DNA code $\mathcal {C}_{DNA}(\ell 2^{m},2^{{\sum }_{i=1}^{r}\binom {m}{r}},\ell 2^{m-r-1})$ for the binary Reed-Muller code $\mathcal {R}(r,m)$.

Proof

For positive integers m (> 1) and ℓ(≤ 2^m− 1), consider a binary Reed-Muller code $\mathcal {R}(r,m)$ and the corresponding encoded DNA code $f(\mathcal {R}(r,m))$ for some pair $(\textbf {x},\textbf {y})\in ({\Sigma }_{DNA}^{\ell })^{2}$. From Theorem 9, the codeword length and code size for $f(\mathcal {R}(r,m))$ will be ℓ2^m and $2^{{\sum }_{i=1}^{r}\binom {m}{r}}$. From Theorem 15, the minimum distance d_NHo ≥ ℓ⌈d_H/2⌉ = ℓ2^m−r− 1. For a positive integer t, we denote 0_t = (0 0…0) and 1_t = (1 1…1), each of length t. Then the binary strings $\textbf {0}_{2^{m}}$ and $(\textbf {0}_{2^{m}-2^{m-r}}\ \textbf {1}_{2^{m-r}})$ will be in $\mathcal {R}(r,m)$. Therefore $d_{NHo}\leq d_{NHo}(\textbf {0}_{2^{m}},(\textbf {0}_{2^{m}-2^{m-r}}\ \textbf {1}_{2^{m-r}}))$ = ℓ2^m−r− 1. Hence, d_NHo = ℓ2^m−r− 1 for the binary $\mathcal {R}(r,m)$. So, from Theorem 9, the minimum Hamming distance for $f(\mathcal {R}(r,m))$ will be d_H = ℓ2^m−r− 1. □

Note that by choosing appropriate seed blocks one can obtain DNA codes with various constraints. For example, if one chooses DNA seed blocks from Table 8 and constructs the Reed-Muller type code (Theorem 17) then the DNA code will satisfy reverse, reverse-complement and GC-content constraints. In addition, each DNA codeword of the Reed-Muller code is ℓ conflict free. Similarly, if one chooses DNA seed blocks from Table 7 and constructs the Reed-Muller type code (Theorem 17) then the encoded DNA code is ℓ conflict free with GC-content constraint and each codeword is free from secondary structures.

6 Conclusions

We have scratched the surface of an interesting area of DNA codes that can be used in building efficient DNA data storage models.This article concentrates on two different approaches, computational and algebraic; to design DNA codes satisfying different constraints effective for practical usage. Computational approach improves the lower bounds on the size of the DNA codes in many cases from previous study under a new constraint (generalization of Homopolymers constraint) apart from the general constraints considered in the same aspect. On the contrary, the algebraic approach presents a new isometry between binary codes and DNA codes. Utilizing the recursive isometry, new classes of DNA codes has been constructed that are efficient. It is noteworthy to mention that the new codes are also free from hairpin like secondary structures. It would be an interesting future task to find bounds on DNA codes with the new constraint in mind and constructing optimal codes meeting those bounds. Extending the isometry from binary to q-ary case will also be an interesting future task.

References

Blawat, M., Gaedke, K., Hütter, I., Chen, X.M., Turczyk, B., Inverso, S., Pruitt, B.W., Church, G.M.: Forward error correction for DNA data storage. Procedia Comput. Sci. 80, 1011–1022 (2016)
Article Google Scholar
Bornholt, J., Lopez, R., Carmean, D.M., Ceze, L., Seelig, G., Strauss, K.: A DNA-based archival storage system. ACM SIGOPS Operating Syst. Rev. 50(2), 637–649 (2016)
Article Google Scholar
Chee, Y.M., Ling, S.: Improved lower bounds for constant GC-content DNA codes. IEEE Trans. Inf. Theory 54(1), 391–394 (2008). https://doi.org/10.1109/TIT.2007.911167
Article MathSciNet MATH Google Scholar
Chheda, N., Gupta, M.K.: RNA As a permutation. arXiv:1403.5477v1 (2014)
Church, G.M., Gao, Y., Kosuri, S.: Next-generation digital information storage in DNA. Science 337(6102), 1628–1628 (2012). https://doi.org/10.1126/science.1226355
Article Google Scholar
Erlich, Y., Zielinski, D.: DNA Fountain enables a robust and efficient storage architecture. Science 355(6328), 950–954 (2017). https://doi.org/10.1126/science.aaj2038
Article Google Scholar
Gaborit, P., King, O.D.: Linear constructions for DNA codes. Theor. Comput. Sci. 334, 99–113 (2005)
Article MathSciNet Google Scholar
Goldman, N., Bertone, P., Chen, S., Dessimoz, C., LeProust, E.M., Sipos, B., Birney, E.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494(7435), 77 (2013)
Article Google Scholar
Guenda, K., Gulliver, T.A., Solé, P.: On cyclic DNA codes. In: Proceedings IEEE International Symposium on Information Theory (ISIT), pp. 121–125. https://doi.org/10.1109/ISIT.2013.6620200 (2013)
Immink, K.A.S., Cai, K.: Properties and constructions of constrained codes for DNA-based data storage. arXiv:1812.06798 (2018)
Jacobs, A.: Data-storage for eternity (ETH Zürich, 13th of February 2015 https://www.ethz.ch/en/news-and-events/eth-news/news/2015/02/data-storage-for-eternity.html)
Jain, S., Hassanzadeh, F.F., Schwartz, M., Bruck, J.: Duplication-correcting codes for data storage in the DNA of living organisms. IEEE Trans. Inf. Theory 63(8), 4996–5010 (2017). https://doi.org/10.1109/TIT.2017.2688361
Article MathSciNet MATH Google Scholar
Kari, L., Konstantinidis, S., Losseva, E., Sosík, P., Thierrin, G.: Hairpin structures in DNA words. In: DNA Computing, Pp. 158–170 (2006)
Kiah, H.M., Puleo, G.J., Milenkovic, O.: Codes for DNA sequence profiles. In: Proceedings IEEE International Symposium on Information Theory (ISIT), pp. 814–818. https://doi.org/10.1109/ISIT.2015.7282568 (2015)
Kim, Y.S., Kim, S.H.: New construction of DNA codes with constant-GC contents from binary sequences with ideal autocorrelation. In: Proceedings IEEE International Symposium on Information Theory (ISIT), pp. 1569–1573. https://doi.org/10.1109/ISIT.2011.6033808 (2011)
Kovačević, M., Tan, V.Y.F.: Asymptotically optimal codes correcting fixed-length duplication errors in DNA storage systems. IEEE Commun. Lett. 22(11), 2194–2197 (2018). https://doi.org/10.1109/LCOMM.2018.2868666
Article Google Scholar
Limbachiya, D., Benerjee, K.G., Rao, B., Gupta, M.K.: On DNA codes using the ring $\mathbb {Z}_{4}+w\mathbb {Z}_{4}$. In: Proceedings IEEE International Symposium on Information Theory (ISIT), pp. 2401–2405. https://doi.org/10.1109/ISIT.2018.8437313 (2018)
Limbachiya, D., Gupta, M.K.: Natural Data Storage: A Review on sending Information from now to then via Nature. arXiv:1505.04890 (2015)
Limbachiya, D., Gupta, M.K., Aggarwal, V.: Family of constrained codes for archival DNA data storage. IEEE Commun. Lett. 22(10), 1972–1975 (2018). https://doi.org/10.1109/LCOMM.2018.2861867
Article Google Scholar
Limbachiya, D., Rao, B., Gupta, M.K.: The Art of DNA Strings: Sixteen Years of DNA Coding Theory. arXiv:1607.00266 (2016)
Loman, N., Misra, R., Dallman, T., Constantinidou, C., Gharbia, S., Wain, J., Pallen, M.: Performance comparison of benchtop high-throughput sequencing platforms. Nat. Biotechnol. 30(6), 434–439 (2012)
Article Google Scholar
Lothaire, M.: Combinatorics on Words, 2nd edn. Cambridge Mathematical Library. Cambridge University Press, Cambridge (1997). https://doi.org/10.1017/CBO9780511566097
Book Google Scholar
Marathe, A., Condon, A.E., Corn, R.M.: On combinatorial DNA word design. J. Comput. Biol. 8(3), 201–219 (2001). https://doi.org/10.1089/10665270152530818
Article MATH Google Scholar
Milenkovic, O., Kashyap, N.: DNA Codes that avoid secondary structures. In: Proceedings IEEE International Symposium on Information Theory (ISIT), pp. 288–292. https://doi.org/10.1109/ISIT.2005.1523340 (2005)
Muller, D.E.: Application of boolean algebra to switching circuit design and to error detection. Transactions of the I. R. E. Professional Group on Electronic Computers EC-3(3), 6–12 (1954). https://doi.org/10.1109/IREPGELC.1954.6499441
Article Google Scholar
Myers, P., Sebaihia, M., Cerdeño-tárraga Bentley, S., Crossman, L., Parkhill, J.: Tandem repeats and morphological variation. Nature (2007)
Nelms, B.L., Labosky, P.A.: A predicted hairpin cluster correlates with barriers to PCR. sequencing and possibly BAC recombineering Scientific Reports 1, 106 (2011)
Ridge, P., Carroll, H., Sneddon, D., Clement, M., Snell, Q.: Large grain size stochastic optimization alignment. In: Proceedings IEEE Symposium on BioInformatics and BioEngineering (BIBE), pp. 127–134. https://doi.org/10.1109/BIBE.2006.253325 (2006)
Rykov, V.V., Macula, A.J., Torney, D.C., White, P.S.: DNA Sequences and quaternary cyclic codes. In: Proceedings IEEE International Symposium on Information Theory (ISIT), pp. 248–248. https://doi.org/10.1109/ISIT.2001.936111 (2001)
Smith, D.H., Aboluion, N., Montemanni, R., Perkins, S.: Linear and nonlinear constructions of DNA codes with Hamming distance d and constant GC-content. Discret. Math. 311(13), 1207–1219 (2011)
Article MathSciNet Google Scholar
Song, W., Cai, K., Zhang, M., Yuen, C.: Codes with run-length and GC-content constraints for DNA-based data storage. IEEE Commun. Lett. 22(10), 2004–2007 (2018). https://doi.org/10.1109/LCOMM.2018.2866566
Article Google Scholar
Thomson, N., Sebaihia, M., Cerdeño-tárraga Bentley, S., Crossman, L., Parkhill, J.: The value of comparison. Nat. Rev. Microbiology 1(11), 11–12 (2003)
Article Google Scholar
Tulpan, D., Smith, D.H., Montemanni, R.: Thermodynamic post-processing versus GC-content pre-processing for DNA codes satisfying the hamming distance and reverse-complement constraints. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(2), 441–452 (2014). https://doi.org/10.1109/TCBB.2014.2299815
Article Google Scholar
Tulpan, D.C., Hoos, H.H., Condon, A.E.: Stochastic local search algorithms for DNA word design. In: DNA Computing, pp. 229–241 (2003)
Yakovchuk, P., Protozanova, E., Frank-Kamenetskii, M.D.: Base-stacking and base-pairing contributions into thermal stability of the DNA double helix. Nuclice Acis Res. 34(2), 564–574 (2006). https://doi.org/10.1093/nar/gkj454
Article Google Scholar
Yazdi, S.H.T., Yuan, Y., Ma, J., Zhao, H., Milenkovic, O.: A rewritable, random-access DNA-based storage system. Scientific Reports 5, 14138 (2015)
Article Google Scholar
Zhu, X., Sun, C., Liu, W., Wu, W.: Research on the counting problem based on linear constructions for DNA coding. In: Proceedings Computational Intelligence and Bioinformatics, pp. 294–302 (2006)
Zuker, M.: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 31(13), 3406–3415 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Natural Information Processing, Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India
Krishna Gopal Benerjee, Sourav Deb & Manish K. Gupta

Authors

Krishna Gopal Benerjee
View author publications
You can also search for this author in PubMed Google Scholar
Sourav Deb
View author publications
You can also search for this author in PubMed Google Scholar
Manish K. Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manish K. Gupta.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The preliminary version of the paper is available at https://arxiv.org/abs/1902.04419

Rights and permissions

Reprints and permissions

About this article

Cite this article

Benerjee, K.G., Deb, S. & Gupta, M.K. On conflict free DNA codes. Cryptogr. Commun. 13, 143–171 (2021). https://doi.org/10.1007/s12095-020-00459-7

Download citation

Received: 28 October 2019
Accepted: 24 September 2020
Published: 13 October 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s12095-020-00459-7

Keywords

Mathematics Subject Classification (2010)

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On conflict free DNA codes

Abstract

Similar content being viewed by others

Construction of DNA codes with multiple constrained properties

Reconstruction algorithms for DNA-storage systems

A Strategy-based Optimization Algorithm to Design Codes for DNA Data Storage System

1 Introduction

2 Preliminary

Lemma 1

Proof

Remark 1

Remark 2

Remark 3

Lemma 2

Proof

Definition 1

Remark 4

Remark 5

Remark 6

3 On complete conflict free DNA strings

Definition 2

Remark 7

Definition 3

Construction 1

Theorem 1

Proof

Remark 8

4 Mappings and their properties

Definition 4

Encoding 1

Example 1

Theorem 2

Proof

Proposition 1

Theorem 3

Proof

Proposition 2

Theorem 4

Proof

Lemma 3

Proof

Proposition 3

Proposition 4

Theorem 5

Proof

Theorem 6

Proof

Theorem 7

Proof

Theorem 8

Proof

Proposition 5

Definition 5

Remark 9

Theorem 9

Proof

Theorem 10

Proof

Theorem 11

Proof

Theorem 12

Proof

Theorem 13

Proof

Proposition 6

Theorem 14

Proof

Lemma 4

Proof

Theorem 15

Proof

Theorem 16

Proof

Lemma 5

Proof

Proposition 7

Remark 10

Lemma 6

Proof

Proposition 8