Keywords

1 Introduction

In bioinformatics, the basic studying strategy for both DNA and protein sequences is to make proper comparisons of both. There are mainly two types of comparison methods—one is based on alignment technique and the other one is based on alignment-free technique. The later one is preferred, as it is less time consuming. Anyway it is mostly based on mononucleotide representation. One such graphical representation is first given by Hamori and Ruskin in 1983 [1]. Graphical representations are found to vary from 2D to 6D. However, directly working with mononucleotides (A, G, C & T) leads to a lot of information loss. So Di- and Tri-nucleotide representations were thought of. The mononucleotide models cannot represent the Di- and Tri- nucleotides without complex calculations [2]. So such representations were found out independently [3,4,5,6,7,8,9,10,11,12,13,14,15]. However, the following limitations still remain—1. For 3D representation, the represented values are only 64 in number. Naturally the mean value of such represented coordinates is not of much interest. Even if the cumulative values give much variation, still the use of mean value is not a very satisfactory descriptor. So the final comparison based on means of two types of represented points (normal and cumulative) may not be applicable for comparison of a larger variety of samples [16]. 2. Standard deviation shows the spread of the data rather than determining a theoretical centre, and the cumulative components reduce redundancy. But even the calculations of comparisons based on standard deviation as the descriptor on the cumulative data set is also found to be non-satisfactory [17]. 3. There is always a risk in taking cumulative values, as the resulting time series becomes stochastic. 4. Numerical values and the signs used for tri-nucleotide representation [16, 17] appear to be very much artificial.

In order to avoid these difficulties we have made a very simple approach. We take only the 64 different values obtained by Chaos game representation [18] as the 2D representation of tri-nucleotides and make the representation 3D by taking the third coordinate as the multiple of the first two. As the represented values are now different from those obtained by earlier methods, so to check the improvement in the results, we choose sequentially the descriptors as mean, standard deviation, highest eigen values of M/M matrix and J/J matrix. Compared to the traditional matrix, the J/J matrix can investigate the composition, distribution and chemical properties of bases; it can also picture the biological significance of the sequence [19]. So we try for both M/M and J/J matrix with the expectation that J/J might give better results.

What makes our representation better than the previous ones [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17] based on tri-nucleotides is that it is a much easier method and hence more efficient. The essential difference lies in getting the 2D tri-nucleotide representations with the help of chaos game theory. The 2D coordinates are obtained in a very natural way using chaos game with only the initial values of the nucleotides. It is known that the exon of β globin genes of different species is essential for pharmaceutical purposes. So we have preferred choosing coding sequences of β globin genes for the purpose of sequence comparison.

2 Methodology

We use the chaos game values shown in the graphical representation of [18] on the non-overlapping triplets of the given DNA sequence for the calculation of different statistical parameters to be used in the analysis of the paper.

Each nucleic acid triplet consists of three coordinates (x, y, z), the first two are obtained from the above chaos game representation, the third one being generated by the multiplication of the first two coordinates. Let N = M/3 be the number of codons in the sequence, where M is the length of the DNA sequence.

Let \(x{\kern 1pt} \; = \;(x_{1} ,\;x_{2} ,\;x_{3} ,\; \ldots ,\;x_{N} ),{\kern 1pt} \;y\; = \;(y_{1} ,\;y_{2} ,\;y_{3} ,\; \ldots ,\;y_{N} ),\;\;z\; = \;(z_{1} ,\;z_{2} ,\;z_{3} ,\; \ldots ,\;z_{N} )\) be the 3D represented points.

Then mean of x, y, z is given by \(\mu_{x}\), \(\mu_{y}\) and \(\mu_{z}\) respectively, where \(\mu_{x} = \sum\limits_{i = 1}^{N} {x_{i} /N} ,\;\mu_{y} = \sum\limits_{i = 1}^{N} {y_{i} /N} ,\;\mu_{z} = \sum\limits_{i = 1}^{N} {z_{i} } /N\)

Then standard deviation of x, y, z is given by \(V_{x} ,\;V_{y}\) and \(V_{z}\) respectively, where

$$V_{x} = \sqrt {\frac{{\sum\limits_{i = 1}^{N} {(x_{i} - \mu_{x} )^{2} } }}{N},} \;V_{y} = \sqrt {\frac{{\sum\limits_{i = 1}^{N} {(y_{i} - \mu_{y} )^{2} } }}{N},\;} V_{z} = \sqrt {\frac{{\sum\limits_{i = 1}^{N} {(z_{i} - \mu_{z} )^{2} } }}{N}}$$

The M/M matrix is calculated as

$$M_{i,j} = \frac{{\sqrt {(x_{i} - x_{j} )^{2} + \left( {y_{i} - y_{j} } \right)^{2} + (z_{i} - z_{j} )^{2} } }}{{\left| {x_{i} - x_{j} } \right| + \left| {y_{i} - y_{j} } \right| + \left| {z_{i} - z_{j} } \right|}}$$

The J/J matrix is calculated as

$$J_{i,j} = \frac{{\sqrt {(x_{i} - x_{j} )^{2} + \left( {y_{i} - y_{j} } \right)^{2} + } }}{{\left| {x_{i} - x_{j} } \right| + \left| {y_{i} - y_{j} } \right|}} + \frac{{\sqrt {(z_{i} - z_{j} )^{2} } }}{{\left| {z_{i} + z_{j} } \right|}}$$

We calculate the similarity/dissimilarity between the coding sequences based on the distance matrix measured by

(1) Euclidean distances between three component vectors \((\mu_{x} ,\;\mu_{y} ,\;\mu_{z} )\) of pair of sequences. (2) Euclidean distances between the three component vectors \((V_{x} ,\;V_{y} ,\;V_{z} )\) of pair of sequences. (3) Distance measured by modulus of the difference of highest eigen values of the M/M matrix. (4) Distance measured by modulus of the difference of highest eigen values of the J/J matrix.

The smaller the entry in the distance matrix is, more similar the DNA sequences are. Therefore, we can say that the distances between evolutionary closely related species are smaller, while those between evolutionary distant species are larger. We draw the phylogenetic tree based on similarity/dissimilarity matrix using UPGMA in MEGA4 software [20].

3 Result and Discussion

Table 1 shows the information regarding corresponding sequences of 10 different species and Table 2 shows the distance matrix of the complete coding sequences of β globin genes of 10 different species based on highest eigen value of J/J matrix. Distance matrix using Euclidian distance and corresponding Phylogenic trees are also obtained similarly in other three cases. We observe that the phylogenetic tree Fig. 1 using their highest eigen value of J/J matrix generates the best result among others. From Fig. 1 we also observe that the more similar species pairs are like Mouse—Rat, Tufted Monkey—Woolly Monkey, Hare—Rabbit, Gallus—Duck are come closer to each others. Our phylogenetic tree agrees with that found in [16] for the species taken in common.

Table 1 The complete coding sequences of β globin genes of 10 species
Table 2 Distance matrix for the coding sequences of β globin genes of 10 species based on highest eigen value of J/J Matrix
Fig. 1
figure 1

Phylogenetic tree of 10 different species based on their complete coding sequence of β globin genes using their highest eigen value of J/J matrix

4 Conclusion

In this paper, we propose a new method of nucleotide representation using chaos game theory. By comparing four descriptor, we conclude that highest eigen value for J/J matrix is the best among the four descriptors: mean, standard deviation, highest eigen value for M/M matrix and highest eigen value for J/J matrix for comparison of complete coding sequences of β globin genes for the above 10 species. We therefore conclude that our method is effective for evaluating sequence similarities on an intuitive basis. However, our method is experimented on only 10 different sequences; in near future we like to apply our method on large numbers of species.