Application of Chaos Game in Tri-Nucleotide Representation for the Comparison of Coding Sequences of β-Globin Gene

Das, Subhram; Choudhury, Nobhonil Roy; Tibarewala, D. N.; Bhattacharya, D. K.

doi:10.1007/978-981-10-3953-9_54

Subhram Das⁷,
Nobhonil Roy Choudhury⁷,
D. N. Tibarewala⁸ &
…
D. K. Bhattacharya⁸

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 11))

1218 Accesses
3 Citations

Abstract

In this paper, we use 2D tri-nucleotide representation based on chaos game theory. We extend the representation from 2D to 3D by taking the third coordinate as the multiple of the first two ones. Complete coding sequences of β globin genes of 10 species are now compared using four types of descriptors—1. Mean of the components of the represented sequences, 2. Standard deviation of the components of the represented sequence, 3. Highest eigen value of M/M matrix and 4. Highest eigen value of J/J matrix. The results in the four cases are critically examined. It is found that the use of J/J matrix with highest eigen value as the descriptor is the best one among the others.

Access provided by CONRICYT-eBooks. Download conference paper PDF

A New Method for Protein Sequence Comparison Using Chaos Game Representation

Graphical Representation and Similarity Analysis of DNA Sequences Based on Trigonometric Functions

Article 19 April 2018

Evaluation of Chaos Game Representation for Comparison of DNA Sequences

Keywords

1 Introduction

In bioinformatics, the basic studying strategy for both DNA and protein sequences is to make proper comparisons of both. There are mainly two types of comparison methods—one is based on alignment technique and the other one is based on alignment-free technique. The later one is preferred, as it is less time consuming. Anyway it is mostly based on mononucleotide representation. One such graphical representation is first given by Hamori and Ruskin in 1983 [1]. Graphical representations are found to vary from 2D to 6D. However, directly working with mononucleotides (A, G, C & T) leads to a lot of information loss. So Di- and Tri-nucleotide representations were thought of. The mononucleotide models cannot represent the Di- and Tri- nucleotides without complex calculations [2]. So such representations were found out independently [3,4,5,6,7,8,9,10,11,12,13,14,15]. However, the following limitations still remain—1. For 3D representation, the represented values are only 64 in number. Naturally the mean value of such represented coordinates is not of much interest. Even if the cumulative values give much variation, still the use of mean value is not a very satisfactory descriptor. So the final comparison based on means of two types of represented points (normal and cumulative) may not be applicable for comparison of a larger variety of samples [16]. 2. Standard deviation shows the spread of the data rather than determining a theoretical centre, and the cumulative components reduce redundancy. But even the calculations of comparisons based on standard deviation as the descriptor on the cumulative data set is also found to be non-satisfactory [17]. 3. There is always a risk in taking cumulative values, as the resulting time series becomes stochastic. 4. Numerical values and the signs used for tri-nucleotide representation [16, 17] appear to be very much artificial.

In order to avoid these difficulties we have made a very simple approach. We take only the 64 different values obtained by Chaos game representation [18] as the 2D representation of tri-nucleotides and make the representation 3D by taking the third coordinate as the multiple of the first two. As the represented values are now different from those obtained by earlier methods, so to check the improvement in the results, we choose sequentially the descriptors as mean, standard deviation, highest eigen values of M/M matrix and J/J matrix. Compared to the traditional matrix, the J/J matrix can investigate the composition, distribution and chemical properties of bases; it can also picture the biological significance of the sequence [19]. So we try for both M/M and J/J matrix with the expectation that J/J might give better results.

What makes our representation better than the previous ones [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17] based on tri-nucleotides is that it is a much easier method and hence more efficient. The essential difference lies in getting the 2D tri-nucleotide representations with the help of chaos game theory. The 2D coordinates are obtained in a very natural way using chaos game with only the initial values of the nucleotides. It is known that the exon of β globin genes of different species is essential for pharmaceutical purposes. So we have preferred choosing coding sequences of β globin genes for the purpose of sequence comparison.

2 Methodology

We use the chaos game values shown in the graphical representation of [18] on the non-overlapping triplets of the given DNA sequence for the calculation of different statistical parameters to be used in the analysis of the paper.

Each nucleic acid triplet consists of three coordinates (x, y, z), the first two are obtained from the above chaos game representation, the third one being generated by the multiplication of the first two coordinates. Let N = M/3 be the number of codons in the sequence, where M is the length of the DNA sequence.

Let $x{\kern 1pt} \; = \;(x_{1} ,\;x_{2} ,\;x_{3} ,\; \ldots ,\;x_{N} ),{\kern 1pt} \;y\; = \;(y_{1} ,\;y_{2} ,\;y_{3} ,\; \ldots ,\;y_{N} ),\;\;z\; = \;(z_{1} ,\;z_{2} ,\;z_{3} ,\; \ldots ,\;z_{N} )$ be the 3D represented points.

Then mean of x, y, z is given by $\mu_{x}$, $\mu_{y}$ and $\mu_{z}$ respectively, where $\mu_{x} = \sum\limits_{i = 1}^{N} {x_{i} /N} ,\;\mu_{y} = \sum\limits_{i = 1}^{N} {y_{i} /N} ,\;\mu_{z} = \sum\limits_{i = 1}^{N} {z_{i} } /N$

Then standard deviation of x, y, z is given by $V_{x} ,\;V_{y}$ and $V_{z}$ respectively, where

$$V_{x} = \sqrt {\frac{{\sum\limits_{i = 1}^{N} {(x_{i} - \mu_{x} )^{2} } }}{N},} \;V_{y} = \sqrt {\frac{{\sum\limits_{i = 1}^{N} {(y_{i} - \mu_{y} )^{2} } }}{N},\;} V_{z} = \sqrt {\frac{{\sum\limits_{i = 1}^{N} {(z_{i} - \mu_{z} )^{2} } }}{N}}$$

The M/M matrix is calculated as

$$M_{i,j} = \frac{{\sqrt {(x_{i} - x_{j} )^{2} + \left( {y_{i} - y_{j} } \right)^{2} + (z_{i} - z_{j} )^{2} } }}{{\left| {x_{i} - x_{j} } \right| + \left| {y_{i} - y_{j} } \right| + \left| {z_{i} - z_{j} } \right|}}$$

The J/J matrix is calculated as

$$J_{i,j} = \frac{{\sqrt {(x_{i} - x_{j} )^{2} + \left( {y_{i} - y_{j} } \right)^{2} + } }}{{\left| {x_{i} - x_{j} } \right| + \left| {y_{i} - y_{j} } \right|}} + \frac{{\sqrt {(z_{i} - z_{j} )^{2} } }}{{\left| {z_{i} + z_{j} } \right|}}$$

We calculate the similarity/dissimilarity between the coding sequences based on the distance matrix measured by

(1) Euclidean distances between three component vectors $(\mu_{x} ,\;\mu_{y} ,\;\mu_{z} )$ of pair of sequences. (2) Euclidean distances between the three component vectors $(V_{x} ,\;V_{y} ,\;V_{z} )$ of pair of sequences. (3) Distance measured by modulus of the difference of highest eigen values of the M/M matrix. (4) Distance measured by modulus of the difference of highest eigen values of the J/J matrix.

The smaller the entry in the distance matrix is, more similar the DNA sequences are. Therefore, we can say that the distances between evolutionary closely related species are smaller, while those between evolutionary distant species are larger. We draw the phylogenetic tree based on similarity/dissimilarity matrix using UPGMA in MEGA4 software [20].

3 Result and Discussion

Table 1 shows the information regarding corresponding sequences of 10 different species and Table 2 shows the distance matrix of the complete coding sequences of β globin genes of 10 different species based on highest eigen value of J/J matrix. Distance matrix using Euclidian distance and corresponding Phylogenic trees are also obtained similarly in other three cases. We observe that the phylogenetic tree Fig. 1 using their highest eigen value of J/J matrix generates the best result among others. From Fig. 1 we also observe that the more similar species pairs are like Mouse—Rat, Tufted Monkey—Woolly Monkey, Hare—Rabbit, Gallus—Duck are come closer to each others. Our phylogenetic tree agrees with that found in [16] for the species taken in common.

Table 1 The complete coding sequences of β globin genes of 10 species

Full size table

Table 2 Distance matrix for the coding sequences of β globin genes of 10 species based on highest eigen value of J/J Matrix

Full size table

4 Conclusion

In this paper, we propose a new method of nucleotide representation using chaos game theory. By comparing four descriptor, we conclude that highest eigen value for J/J matrix is the best among the four descriptors: mean, standard deviation, highest eigen value for M/M matrix and highest eigen value for J/J matrix for comparison of complete coding sequences of β globin genes for the above 10 species. We therefore conclude that our method is effective for evaluating sequence similarities on an intuitive basis. However, our method is experimented on only 10 different sequences; in near future we like to apply our method on large numbers of species.

References

Hamori, E., Ruskin, J.: H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. J. Biol. Chem. 258, 1318–1327 (1983)
Google Scholar
Guo, F.B., Ou, H.Y., Zhang, C.T.: ZCURVE: a new system for recognizing protein–coding genes in bacterial and archaeal genomes. Nucl. Acids. Res. 31, 1780–1789 (2003)
Article Google Scholar
Zhang, C.T., Zhang, R.: Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucl. Acids Res. 19, 6313–6317 (1991)
Article Google Scholar
Zhang, R., Zhang, C.T.: Z curves, an intuitive tool for visualizing and analyzing the DNA sequences. J. Biomol. Struct. Dyn. 11, 767–782 (1994)
Article Google Scholar
Nandy, A.: A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes. Curr. Sci. 66, 309–314 (1994)
Google Scholar
Randic, M., Vracko, M., Lers, N., Plavsic, D.: Novel 2–D graphical representation of DNA sequences and their numerical characterization. Chem. Phys. Lett. 368, 1–6 (2003)
Article Google Scholar
Randic, M., Vracko, M., Zupan, J., Novic, M.: Compact 2–D graphical representation of DNA. Chem. Phys. Lett. 373, 558–562 (2003)
Article Google Scholar
Liao, B., Wang, T.M.: Analysis of similarity/dissimilarity of DNA sequences based on 3–D graphical representation. Chem. Phys. Lett. 388, 195–200 (2004)
Article Google Scholar
Randic, M.: Graphical representations of DNA as 2–D map. Chem. Phys. Lett. 386, 468–471 (2004)
Article Google Scholar
Liao, B., Wang, T.M.: 3–D graphical representation of DNA sequences and their numerical characterization. J. Mol. Struct. (Theochem) 681, 209–212 (2004)
Article Google Scholar
Chi, R., Ding, K.Q.: Novel 4D numerical representation of DNA sequences. Chem. Phys. Lett. 407, 63–67 (2005)
Article Google Scholar
Yao, Y.H., Nan, X.Y., Wang, T.M.: A new 2D graphical representation—Classification curve and the analysis of similarity/dissimilarity of DNA sequences. J. Mol. Struct. (Theochem) 764, 101–108 (2006)
Article Google Scholar
Liao, B., Ding, K.Q.: A 3D graphical representation of DNA sequences and its application. Theor. Comput. Sci. 358, 56–64 (2006)
Article MathSciNet MATH Google Scholar
Song, J., Tang, H.W.: A new 2–D graphical representation of DNA sequences and their numerical characterization. J. Biochem. Biophys. Methods 63, 228–239 (2005)
Article Google Scholar
Zhang, Z.J.: DV–Curve: a novel intuitive tool for visualizing and analyzing DNA sequences. Bioinformatics, vol. 25, pp. 1112–1117 (2009)
Google Scholar
Yu, J., Wang, J., Sun, X.: Analysis of similarities/dissimilarities of DNA sequences based on a novel graphical representation. MATCH Commun. Math. Comput. Chem. 63, 493-512 (2010)
Google Scholar
Das, S., Palit, S., Mahalanabish, A.R., Choudhury, N.R.: A new way to find similarity/dissimilarity of DNA sequences on the basis of dinucleotides representation. In: Computational Advancement in Communication Circuits and System, pp. 151–160. Springer (2015)
Google Scholar
Randic, M., Zupan, J., Balaban, A.T.: Unique graphical representation of protein sequences based on nucleotide triplet codons. Chem. Phys. Lett. 397, 247–252 (2004)
Article Google Scholar
Luo, J., Guo, J., Li, Y.: A new graphical representation and its application in similarity/dissimilarity analysis of DNA sequences. In: 4th International Conference on Bioinformatics and Biomedical Engineering (2010). doi:10.1109/ICBBE.2010.5515203
Kumar, S., Nei, M., Dudley, J., Tamura, K.: MEGA: a biologist-centric software for evolutionary analysis of DNA and protein sequences. Brief. Bioinform. 9, 299–306 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science & Engineering, Narula Institute of Technology, Kolkata, India
Subhram Das & Nobhonil Roy Choudhury
Bio-Science & Engineering, Jadavpur University, Kolkata, India
D. N. Tibarewala & D. K. Bhattacharya

Authors

Subhram Das
View author publications
You can also search for this author in PubMed Google Scholar
Nobhonil Roy Choudhury
View author publications
You can also search for this author in PubMed Google Scholar
D. N. Tibarewala
View author publications
You can also search for this author in PubMed Google Scholar
D. K. Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Subhram Das .

Editor information

Editors and Affiliations

Department of Electronics and Communication, JIS College of Engineering (JISCE), Kalyani, West Bengal, India
Swapan Bhattacharyya
Department of Physics and Nanoscience and Technology, JIS College of Engineering (JISCE), Kalyani, West Bengal, India
Sabyasachi Sen
Department of Biomedical Engineering, JIS College of Engineering (JISCE), Kalyani, West Bengal, India
Meghamala Dutta
Department of Electrical Engineering, JIS College of Engineering (JISCE), Kalyani, West Bengal, India
Papun Biswas
Department of Mechanical Engineering, Jadavpur University, Kolkata, West Bengal, India
Himadri Chattopadhyay

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Das, S., Choudhury, N.R., Tibarewala, D.N., Bhattacharya, D.K. (2018). Application of Chaos Game in Tri-Nucleotide Representation for the Comparison of Coding Sequences of β-Globin Gene. In: Bhattacharyya, S., Sen, S., Dutta, M., Biswas, P., Chattopadhyay, H. (eds) Industry Interactive Innovations in Science, Engineering and Technology . Lecture Notes in Networks and Systems, vol 11. Springer, Singapore. https://doi.org/10.1007/978-981-10-3953-9_54

Download citation

DOI: https://doi.org/10.1007/978-981-10-3953-9_54
Published: 21 July 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3952-2
Online ISBN: 978-981-10-3953-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Application of Chaos Game in Tri-Nucleotide Representation for the Comparison of Coding Sequences of β-Globin Gene

Abstract

Similar content being viewed by others

A New Method for Protein Sequence Comparison Using Chaos Game Representation

Graphical Representation and Similarity Analysis of DNA Sequences Based on Trigonometric Functions

Evaluation of Chaos Game Representation for Comparison of DNA Sequences

Keywords

1 Introduction

2 Methodology

3 Result and Discussion

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Application of Chaos Game in Tri-Nucleotide Representation for the Comparison of Coding Sequences of β-Globin Gene

Abstract

Similar content being viewed by others

A New Method for Protein Sequence Comparison Using Chaos Game Representation

Graphical Representation and Similarity Analysis of DNA Sequences Based on Trigonometric Functions

Evaluation of Chaos Game Representation for Comparison of DNA Sequences

Keywords

1 Introduction

2 Methodology

3 Result and Discussion

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation