Abstract
In this paper, we use 2D tri-nucleotide representation based on chaos game theory. We extend the representation from 2D to 3D by taking the third coordinate as the multiple of the first two ones. Complete coding sequences of β globin genes of 10 species are now compared using four types of descriptors—1. Mean of the components of the represented sequences, 2. Standard deviation of the components of the represented sequence, 3. Highest eigen value of M/M matrix and 4. Highest eigen value of J/J matrix. The results in the four cases are critically examined. It is found that the use of J/J matrix with highest eigen value as the descriptor is the best one among the others.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In bioinformatics, the basic studying strategy for both DNA and protein sequences is to make proper comparisons of both. There are mainly two types of comparison methods—one is based on alignment technique and the other one is based on alignment-free technique. The later one is preferred, as it is less time consuming. Anyway it is mostly based on mononucleotide representation. One such graphical representation is first given by Hamori and Ruskin in 1983 [1]. Graphical representations are found to vary from 2D to 6D. However, directly working with mononucleotides (A, G, C & T) leads to a lot of information loss. So Di- and Tri-nucleotide representations were thought of. The mononucleotide models cannot represent the Di- and Tri- nucleotides without complex calculations [2]. So such representations were found out independently [3,4,5,6,7,8,9,10,11,12,13,14,15]. However, the following limitations still remain—1. For 3D representation, the represented values are only 64 in number. Naturally the mean value of such represented coordinates is not of much interest. Even if the cumulative values give much variation, still the use of mean value is not a very satisfactory descriptor. So the final comparison based on means of two types of represented points (normal and cumulative) may not be applicable for comparison of a larger variety of samples [16]. 2. Standard deviation shows the spread of the data rather than determining a theoretical centre, and the cumulative components reduce redundancy. But even the calculations of comparisons based on standard deviation as the descriptor on the cumulative data set is also found to be non-satisfactory [17]. 3. There is always a risk in taking cumulative values, as the resulting time series becomes stochastic. 4. Numerical values and the signs used for tri-nucleotide representation [16, 17] appear to be very much artificial.
In order to avoid these difficulties we have made a very simple approach. We take only the 64 different values obtained by Chaos game representation [18] as the 2D representation of tri-nucleotides and make the representation 3D by taking the third coordinate as the multiple of the first two. As the represented values are now different from those obtained by earlier methods, so to check the improvement in the results, we choose sequentially the descriptors as mean, standard deviation, highest eigen values of M/M matrix and J/J matrix. Compared to the traditional matrix, the J/J matrix can investigate the composition, distribution and chemical properties of bases; it can also picture the biological significance of the sequence [19]. So we try for both M/M and J/J matrix with the expectation that J/J might give better results.
What makes our representation better than the previous ones [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17] based on tri-nucleotides is that it is a much easier method and hence more efficient. The essential difference lies in getting the 2D tri-nucleotide representations with the help of chaos game theory. The 2D coordinates are obtained in a very natural way using chaos game with only the initial values of the nucleotides. It is known that the exon of β globin genes of different species is essential for pharmaceutical purposes. So we have preferred choosing coding sequences of β globin genes for the purpose of sequence comparison.
2 Methodology
We use the chaos game values shown in the graphical representation of [18] on the non-overlapping triplets of the given DNA sequence for the calculation of different statistical parameters to be used in the analysis of the paper.
Each nucleic acid triplet consists of three coordinates (x, y, z), the first two are obtained from the above chaos game representation, the third one being generated by the multiplication of the first two coordinates. Let N = M/3 be the number of codons in the sequence, where M is the length of the DNA sequence.
Let \(x{\kern 1pt} \; = \;(x_{1} ,\;x_{2} ,\;x_{3} ,\; \ldots ,\;x_{N} ),{\kern 1pt} \;y\; = \;(y_{1} ,\;y_{2} ,\;y_{3} ,\; \ldots ,\;y_{N} ),\;\;z\; = \;(z_{1} ,\;z_{2} ,\;z_{3} ,\; \ldots ,\;z_{N} )\) be the 3D represented points.
Then mean of x, y, z is given by \(\mu_{x}\), \(\mu_{y}\) and \(\mu_{z}\) respectively, where \(\mu_{x} = \sum\limits_{i = 1}^{N} {x_{i} /N} ,\;\mu_{y} = \sum\limits_{i = 1}^{N} {y_{i} /N} ,\;\mu_{z} = \sum\limits_{i = 1}^{N} {z_{i} } /N\)
Then standard deviation of x, y, z is given by \(V_{x} ,\;V_{y}\) and \(V_{z}\) respectively, where
The M/M matrix is calculated as
The J/J matrix is calculated as
We calculate the similarity/dissimilarity between the coding sequences based on the distance matrix measured by
(1) Euclidean distances between three component vectors \((\mu_{x} ,\;\mu_{y} ,\;\mu_{z} )\) of pair of sequences. (2) Euclidean distances between the three component vectors \((V_{x} ,\;V_{y} ,\;V_{z} )\) of pair of sequences. (3) Distance measured by modulus of the difference of highest eigen values of the M/M matrix. (4) Distance measured by modulus of the difference of highest eigen values of the J/J matrix.
The smaller the entry in the distance matrix is, more similar the DNA sequences are. Therefore, we can say that the distances between evolutionary closely related species are smaller, while those between evolutionary distant species are larger. We draw the phylogenetic tree based on similarity/dissimilarity matrix using UPGMA in MEGA4 software [20].
3 Result and Discussion
Table 1 shows the information regarding corresponding sequences of 10 different species and Table 2 shows the distance matrix of the complete coding sequences of β globin genes of 10 different species based on highest eigen value of J/J matrix. Distance matrix using Euclidian distance and corresponding Phylogenic trees are also obtained similarly in other three cases. We observe that the phylogenetic tree Fig. 1 using their highest eigen value of J/J matrix generates the best result among others. From Fig. 1 we also observe that the more similar species pairs are like Mouse—Rat, Tufted Monkey—Woolly Monkey, Hare—Rabbit, Gallus—Duck are come closer to each others. Our phylogenetic tree agrees with that found in [16] for the species taken in common.
4 Conclusion
In this paper, we propose a new method of nucleotide representation using chaos game theory. By comparing four descriptor, we conclude that highest eigen value for J/J matrix is the best among the four descriptors: mean, standard deviation, highest eigen value for M/M matrix and highest eigen value for J/J matrix for comparison of complete coding sequences of β globin genes for the above 10 species. We therefore conclude that our method is effective for evaluating sequence similarities on an intuitive basis. However, our method is experimented on only 10 different sequences; in near future we like to apply our method on large numbers of species.
References
Hamori, E., Ruskin, J.: H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. J. Biol. Chem. 258, 1318–1327 (1983)
Guo, F.B., Ou, H.Y., Zhang, C.T.: ZCURVE: a new system for recognizing protein–coding genes in bacterial and archaeal genomes. Nucl. Acids. Res. 31, 1780–1789 (2003)
Zhang, C.T., Zhang, R.: Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucl. Acids Res. 19, 6313–6317 (1991)
Zhang, R., Zhang, C.T.: Z curves, an intuitive tool for visualizing and analyzing the DNA sequences. J. Biomol. Struct. Dyn. 11, 767–782 (1994)
Nandy, A.: A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes. Curr. Sci. 66, 309–314 (1994)
Randic, M., Vracko, M., Lers, N., Plavsic, D.: Novel 2–D graphical representation of DNA sequences and their numerical characterization. Chem. Phys. Lett. 368, 1–6 (2003)
Randic, M., Vracko, M., Zupan, J., Novic, M.: Compact 2–D graphical representation of DNA. Chem. Phys. Lett. 373, 558–562 (2003)
Liao, B., Wang, T.M.: Analysis of similarity/dissimilarity of DNA sequences based on 3–D graphical representation. Chem. Phys. Lett. 388, 195–200 (2004)
Randic, M.: Graphical representations of DNA as 2–D map. Chem. Phys. Lett. 386, 468–471 (2004)
Liao, B., Wang, T.M.: 3–D graphical representation of DNA sequences and their numerical characterization. J. Mol. Struct. (Theochem) 681, 209–212 (2004)
Chi, R., Ding, K.Q.: Novel 4D numerical representation of DNA sequences. Chem. Phys. Lett. 407, 63–67 (2005)
Yao, Y.H., Nan, X.Y., Wang, T.M.: A new 2D graphical representation—Classification curve and the analysis of similarity/dissimilarity of DNA sequences. J. Mol. Struct. (Theochem) 764, 101–108 (2006)
Liao, B., Ding, K.Q.: A 3D graphical representation of DNA sequences and its application. Theor. Comput. Sci. 358, 56–64 (2006)
Song, J., Tang, H.W.: A new 2–D graphical representation of DNA sequences and their numerical characterization. J. Biochem. Biophys. Methods 63, 228–239 (2005)
Zhang, Z.J.: DV–Curve: a novel intuitive tool for visualizing and analyzing DNA sequences. Bioinformatics, vol. 25, pp. 1112–1117 (2009)
Yu, J., Wang, J., Sun, X.: Analysis of similarities/dissimilarities of DNA sequences based on a novel graphical representation. MATCH Commun. Math. Comput. Chem. 63, 493-512 (2010)
Das, S., Palit, S., Mahalanabish, A.R., Choudhury, N.R.: A new way to find similarity/dissimilarity of DNA sequences on the basis of dinucleotides representation. In: Computational Advancement in Communication Circuits and System, pp. 151–160. Springer (2015)
Randic, M., Zupan, J., Balaban, A.T.: Unique graphical representation of protein sequences based on nucleotide triplet codons. Chem. Phys. Lett. 397, 247–252 (2004)
Luo, J., Guo, J., Li, Y.: A new graphical representation and its application in similarity/dissimilarity analysis of DNA sequences. In: 4th International Conference on Bioinformatics and Biomedical Engineering (2010). doi:10.1109/ICBBE.2010.5515203
Kumar, S., Nei, M., Dudley, J., Tamura, K.: MEGA: a biologist-centric software for evolutionary analysis of DNA and protein sequences. Brief. Bioinform. 9, 299–306 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Das, S., Choudhury, N.R., Tibarewala, D.N., Bhattacharya, D.K. (2018). Application of Chaos Game in Tri-Nucleotide Representation for the Comparison of Coding Sequences of β-Globin Gene. In: Bhattacharyya, S., Sen, S., Dutta, M., Biswas, P., Chattopadhyay, H. (eds) Industry Interactive Innovations in Science, Engineering and Technology . Lecture Notes in Networks and Systems, vol 11. Springer, Singapore. https://doi.org/10.1007/978-981-10-3953-9_54
Download citation
DOI: https://doi.org/10.1007/978-981-10-3953-9_54
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3952-2
Online ISBN: 978-981-10-3953-9
eBook Packages: EngineeringEngineering (R0)