SWORDS: A statistical tool for analysing large DNA sequences

Chaudhuri, Probal; Das, Sandip

doi:10.1007/BF02703678

SWORDS: A statistical tool for analysing large DNA sequences

Published: February 2002

Volume 27, pages 1–6, (2002)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Journal of Biosciences Aims and scope Submit manuscript

SWORDS: A statistical tool for analysing large DNA sequences

Download PDF

Probal Chaudhuri¹ &
Sandip Das¹

70 Accesses
14 Citations
Explore all metrics

Abstract

In this article, we present some simple yet effective statistical techniques for analysing and comparing large DNA sequences. These techniques are based on frequency distributions of DNA words in a large sequence, and have been packaged into a software called SWORDS. Using sequences available in public domain databases housed in the Internet, we demonstrate how SWORDS can be conveniently used by molecular biologists and geneticists to unmask biologically important features hidden in large sequences and assess their statistical significance.

References

Blaisdell B E, Campbell A M and Karlin S 1996 Similarities and Dissimilarities of phage genomes;Proc. Natl. Acad. Sci. USA 93 5854–5859
Article PubMed CAS Google Scholar
Doolittle R F 1990 Molecular evolution: computer analysis of protein and nucleic acid sequences;Methods Enzymol. 183 1–735
Google Scholar
Doolittle R F 1996 Molecular evolution: computer methods for macromolecular sequence analysis;Methods Enzymol. 266 1–711
Google Scholar
Everitt B S 1993Cluster Analysis (London: Edward Arnold)
Google Scholar
Felsenstein J 1983 Statistical inference of phylogenies (with Discussion);J. R. Stat. Soc. (Ser. A)146 246–272
Article Google Scholar
Felsenstein J 1985 Confidence limits on phylogenies: an approach using the bootstrap;Evolution 39 783–791
Article Google Scholar
Felsenstein J 1988 Phylogenies from molecular sequences: inference and reliability;Annu. Rev. Genet. 22 521–565
Article PubMed CAS Google Scholar
Felsenstein J and Kishino H 1993 Is there something wrong with the bootstrap? A reply to Hillis and Bull;Syst. Biol. 42 193–200
Article Google Scholar
Hillis D M and Bull J J 1993 An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis;Syst. Biol. 42 182–192
Article Google Scholar
Karlin S and Campbell A M 1994 Which bacterium is the ancestor of the animal mitochondrial genome?;Proc. Natl. Acad. Sci. USA 91 12842–12846
Article PubMed CAS Google Scholar
Karlin S and Cardon L R 1994 Computational DNA sequence analysis;Annu. Rev. Microbiol. 44 619–654
Article Google Scholar
Karlin S and Ladunga I 1994 Comparisons of eukaryotic genomic sequences;Proc. Natl. Acad. Sci. USA 91 12832–12836
Article PubMed CAS Google Scholar
Karlin S, Ladunga I and Blaisdell B E 1994 Heterogeneity of genomes: measures and values;Proc. Natl. Acad. Sci. USA 91 12837–12841
Article PubMed CAS Google Scholar
Leung M-Y, Marsh G M and Speed T P 1996 Over- and underrepresentation of short DNA words in herpesvirus genomes;J. Comput. Biol. 3 345–360
PubMed CAS Google Scholar
Martindale C and Konopka A K 1996 Oligonucleotide frequencies in DNA follow a Yule distribution;Comput. Chem. 20 35–38
Article PubMed CAS Google Scholar
Nei M 1996 Phylogenetic analysis in molecular evolutionary genetics;Annu. Rev. Genet. 30 371–403
Article PubMed CAS Google Scholar
Nussinov R 1980 Some rules in the ordering of nucleotides in the DNA;Nucleic Acids Res. 8 4545–4562
Article PubMed CAS Google Scholar
Nussinov R 1981 Nearest neighbor nucleotide patterns: structural and biological implications;J. Biol. Chem. 256 8458–8462
PubMed CAS Google Scholar
Nussinov R 1982 Some indications for inverse DNA duplication;J. Theor. Biol. 95 783–793
Article PubMed CAS Google Scholar
Nussinov R 1984a Doublet frequencies in evolutionary distinct groups;Nucleic Acids Res. 12 1749–1763
Article PubMed CAS Google Scholar
Nussinov R 1984b Strong doublet preferences in nucleotide sequences and DNA geometry;J. Mol. Evol. 20 111–119
Article PubMed CAS Google Scholar
Pan A, Basu S, Dutta C, Burma D P and Mukherjee R 1996 Nucleotide frequency map: a new technique for pictorial representation of dinucleotide frequencies;Curr. Sci. 71 50–53
Google Scholar
Pevzner P A 1992 Nucleotide sequences versus Markov models;Comput. Chem. 16 103–106
Article CAS Google Scholar
Pevzner P A, Borodovsky M Y and Mironov A A 1989a Linguistics of nucleotide sequences I: the significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words;J. Biomol. Struct. Dyn. 6 1013–1026
PubMed CAS Google Scholar
Pevzner P A, Borodovsky M Y and Mironov A A 1989b Linguistics of nucleotide sequences II: stationary words in genetic texts and the zonal structure of DNA;J. Biomol. Struct. Dyn. 6 1027–1038
PubMed CAS Google Scholar
Phillips G, Arnold J and Ivarie R 1987a Monothrough hexanucleotide composition of theEscherichia coli genome: a Markov chain analysis;Nucleic Acids Res. 15 2611–2626
Article PubMed CAS Google Scholar
Phillips G, Arnold J and Ivarie R 1987b The effect of codon usage on the oligonucleotide composition of theE. coli genome and identification of over- and underrepresented sequences by Markov chain analysis;Nucleic Acids Res. 15 2627–2638
Article PubMed CAS Google Scholar
Prum B, Rodolphe F and de Turckheim E 1995 Finding words with unexpected frequencies in deoxyribonucleic acid sequences;J. R. Statist. Soc. B57 205–220
Google Scholar
Reinert G and Schbath S 1999 Large compound Poisson approximations for occurrences of multiple words; inStatistics in molecular biology and genetics (ed.) F Seillier-Moiseiwitsch (IMS Lecture Notes and Monograph Series) (California: IMS Hayward) vol 33, pp 257–275
Google Scholar
Schbath S, Prum B and deTurckheimE 1995 Exceptional motifs in different Markov chain models for statistical analysis of DNA sequences;J. Comput. Biol. 2 417–437
Article PubMed CAS Google Scholar
Strimmer K and vanHaelsler A 1996 Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies;Mol. Biol. Evol. 13 964–969
CAS Google Scholar
Zardoya R and Meyer A 1996a Evolutionary relationships of the coelacanth lungfishes and tetrapods based on the 28S ribosomal RNA sequences;Proc. Natl. Acad. Sci. USA 93 5449–5454
Article PubMed CAS Google Scholar
Zardoya R and Meyer A 1996b The complete nucleotide sequence of the mitochondrial genome of the lungfish (Pro- topterus dolloi) supports its phylogenetic position as a close relative of land vertebrates;Genetics 142 1249–1263
PubMed CAS Google Scholar
Zardoya R and Meyer A 1997 The complete DNA sequence of the mitochondrial genome of a “living fossil” the coelacanth (Latimeria chalumnae);Genetics 146 995–1010
PubMed CAS Google Scholar
Zharkikh A and Li W H 1992a Statistical properties of boot-strap estimation of phylogenetic variability from nucleotide sequences. I. Four taxa with a molecular clock;Mol. Biol. Evol. 9 1119–1147
PubMed CAS Google Scholar
Zharkikh A and Li W H 1992b Statistical properties of boot-strap estimation of phylogenetic variability from nucleotide sequences. II. Four taxa without a molecular clock;J. Mol. Evol. 35 356–366
Article PubMed CAS Google Scholar
Zharkikh A and Li W H 1995 Estimation of confidence in phylogeny: the complete and partial bootstrap technique;Mol. Phylogenet. Evol. 4 44–63
Article PubMed CAS Google Scholar

Download references

Author information

Authors and Affiliations

Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, 203 BT Road, 700 108, Kolkata, India
Probal Chaudhuri & Sandip Das

Authors

Probal Chaudhuri
View author publications
You can also search for this author in PubMed Google Scholar
Sandip Das
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Probal Chaudhuri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chaudhuri, P., Das, S. SWORDS: A statistical tool for analysing large DNA sequences. J Biosci 27, 1–6 (2002). https://doi.org/10.1007/BF02703678

Download citation

Issue Date: February 2002
DOI: https://doi.org/10.1007/BF02703678

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

SWORDS: A statistical tool for analysing large DNA sequences

Abstract

Article PDF

Similar content being viewed by others

Bioinformatics Analysis of Sequence Data

Statistical Methods in Bioinformatics

Managing Sequence Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SWORDS: A statistical tool for analysing large DNA sequences

Abstract

Article PDF

Similar content being viewed by others

Bioinformatics Analysis of Sequence Data

Statistical Methods in Bioinformatics

Managing Sequence Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation