Abstract
Searching for similarity among biological sequences is an important research area of bioinformatics because it can provide insight into the evolutionary and genetic relationships between species that open doors to new scientific discoveries such as drug design and treament. In this paper, we introduce a novel measure of similarity between two biological sequences without the need of alignment. The method is based on the concept of spectral distortion measures developed for signal processing. The proposed method was tested using a set of six DNA sequences taken from Escherichia coli K-12 and Shigella flexneri, and one random sequence. It was further tested with a complex dataset of 40 DNA sequences taken from the GenBank sequence database. The results obtained from the proposed method are found superior to some existing methods for similarity measure of DNA sequences.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
- Markov Chain Model
- Biological Sequence
- Distortion Measure
- Chaos Game Representation
- Signal Analysis Method
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Ewens, W.J., Grant, G.R.: Statistical Methods in Bioinformatics. Springer, NY (2001)
Miller, W.: Comparison of genomic DNA sequences: solved and unsolved problems. Bioinformatics 17, 391–397 (2001)
Vinga, S., Almeida, J.: Alignment-free sequence comparison—a review. Bioinformatics 19, 513–523 (2003)
Blaisdell, B.E.: Ameasure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl Acad. Sci. USA 83, 5155–5159 (1986)
Wu, T.J., Burke, J.P., Davison, D.B.: A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics 53, 1431–1439 (1997)
Wu, T.J., Hsieh, Y.C., Li, L.A.: Statistical measures of DNA dissimilarity under Markov chain models of base composition. Biometrics 57, 441–448 (2001)
Stuart, G.W., Moffett, K., Baker, S.: Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics 18, 100–108 (2002)
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17, 149–154 (2001)
Almeida, J.S., Carrico, J.A., Maretzek, A., Noble, P.A., Fletcher, M.: Analysis of genomic sequences by chaos game representation. Bioinformatics 17, 429–437 (2001)
Pham, T.D., Zuegg, J.: A probabilistic measure for alignment-free sequence comparison. Bioinformatics 20, 3455–3461 (2004)
Nocerino, N., Soong, F.K., Rabiner, L.R., Klatt, D.H.: Comparative study of several distortion measures for speech recognition. IEEE Proc. Int. Conf. Acoustics, Speech, and Signal Processing 11.4.1, 387–390 (1985)
Veljkovic, V., Slavic, I.: General model of pseudopotentials. Physical Review Lett. 29, 105–108 (1972)
Cosic, I.: Macromolecular bioactivity: Is it resonant interaction between macromolecules? – theory and applications. IEEE trans. Biomedical Engineering 41, 1101–1114 (1994)
Veljkovic, V., Cosic, I., Dimitrijevic, B., Lalovic, D.: Is it possible to analyze DNA and protein sequences by the methods of digital signal processing? IEEE Trans. Biomed. Eng. 32, 337–341 (1985)
de Trad, C.H., Fang, Q., Cosic, I.: Protein sequence comparison based on the wavelet transform approach. Protein Engineering 15, 193–203 (2002)
Anatassiou, D.: Frequency-domain analysis of biomolecular sequences. Bioinformatics 16, 1073–1082 (2000)
Anatassiou, D.: Genomic signal processing. IEEE Signal Processing Magazine 18, 8–20 (2001)
Makhoul, J.: Linear prediction: a tutorial review. Proc. IEEE 63, 561–580 (1975)
Rabiner, L., Juang, B.H.: Fundamentals of Speech Recognition, New Jersey. Prentice Hall, Englewood Cliffs (1993)
Ingle, V.K., Proakis, J.G.: Digital Signal Processing Using Matlab V.4. PWS Publishing, Boston (1997)
Gray, R.M.: Vector quantization. IEEE ASSP Mag. 1, 4–29 (1984)
Itakura, F., Saito, S.S.: A statistical method for estimation of speech spectral density and formant frequencies. Electronics and Communications in Japan 53A, 36–43 (1970)
O’Shaughnessy, D.: Speech Communication – Human and Machine. Addison-Wesley, Reading (1987)
Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994)
Felsenstein, J.: PHYLIP (Phylogeny Inference Package), version 3.5c. Distributed by the Author, Department of Genetics, University of Washington, Seattle, WA (1993)
Kimura, M.: A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120 (1980)
Jukes, T.H., Cantor, C.R.: Evolution of protein molecules. In: Munro, H.N. (ed.) Mammalian Protein Metabolism, pp. 21–132. Academic Press, London (1969)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pham, T.D. (2006). Similarity Searching in DNA Sequences by Spectral Distortion Measures. In: Perner, P. (eds) Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining. ICDM 2006. Lecture Notes in Computer Science(), vol 4065. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11790853_3
Download citation
DOI: https://doi.org/10.1007/11790853_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36036-0
Online ISBN: 978-3-540-36037-7
eBook Packages: Computer ScienceComputer Science (R0)