Abstract
In this article, we present some simple yet effective statistical techniques for analysing and comparing large DNA sequences. These techniques are based on frequency distributions of DNA words in a large sequence, and have been packaged into a software called SWORDS. Using sequences available in public domain databases housed in the Internet, we demonstrate how SWORDS can be conveniently used by molecular biologists and geneticists to unmask biologically important features hidden in large sequences and assess their statistical significance.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Blaisdell B E, Campbell A M and Karlin S 1996 Similarities and Dissimilarities of phage genomes;Proc. Natl. Acad. Sci. USA 93 5854–5859
Doolittle R F 1990 Molecular evolution: computer analysis of protein and nucleic acid sequences;Methods Enzymol. 183 1–735
Doolittle R F 1996 Molecular evolution: computer methods for macromolecular sequence analysis;Methods Enzymol. 266 1–711
Everitt B S 1993Cluster Analysis (London: Edward Arnold)
Felsenstein J 1983 Statistical inference of phylogenies (with Discussion);J. R. Stat. Soc. (Ser. A)146 246–272
Felsenstein J 1985 Confidence limits on phylogenies: an approach using the bootstrap;Evolution 39 783–791
Felsenstein J 1988 Phylogenies from molecular sequences: inference and reliability;Annu. Rev. Genet. 22 521–565
Felsenstein J and Kishino H 1993 Is there something wrong with the bootstrap? A reply to Hillis and Bull;Syst. Biol. 42 193–200
Hillis D M and Bull J J 1993 An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis;Syst. Biol. 42 182–192
Karlin S and Campbell A M 1994 Which bacterium is the ancestor of the animal mitochondrial genome?;Proc. Natl. Acad. Sci. USA 91 12842–12846
Karlin S and Cardon L R 1994 Computational DNA sequence analysis;Annu. Rev. Microbiol. 44 619–654
Karlin S and Ladunga I 1994 Comparisons of eukaryotic genomic sequences;Proc. Natl. Acad. Sci. USA 91 12832–12836
Karlin S, Ladunga I and Blaisdell B E 1994 Heterogeneity of genomes: measures and values;Proc. Natl. Acad. Sci. USA 91 12837–12841
Leung M-Y, Marsh G M and Speed T P 1996 Over- and underrepresentation of short DNA words in herpesvirus genomes;J. Comput. Biol. 3 345–360
Martindale C and Konopka A K 1996 Oligonucleotide frequencies in DNA follow a Yule distribution;Comput. Chem. 20 35–38
Nei M 1996 Phylogenetic analysis in molecular evolutionary genetics;Annu. Rev. Genet. 30 371–403
Nussinov R 1980 Some rules in the ordering of nucleotides in the DNA;Nucleic Acids Res. 8 4545–4562
Nussinov R 1981 Nearest neighbor nucleotide patterns: structural and biological implications;J. Biol. Chem. 256 8458–8462
Nussinov R 1982 Some indications for inverse DNA duplication;J. Theor. Biol. 95 783–793
Nussinov R 1984a Doublet frequencies in evolutionary distinct groups;Nucleic Acids Res. 12 1749–1763
Nussinov R 1984b Strong doublet preferences in nucleotide sequences and DNA geometry;J. Mol. Evol. 20 111–119
Pan A, Basu S, Dutta C, Burma D P and Mukherjee R 1996 Nucleotide frequency map: a new technique for pictorial representation of dinucleotide frequencies;Curr. Sci. 71 50–53
Pevzner P A 1992 Nucleotide sequences versus Markov models;Comput. Chem. 16 103–106
Pevzner P A, Borodovsky M Y and Mironov A A 1989a Linguistics of nucleotide sequences I: the significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words;J. Biomol. Struct. Dyn. 6 1013–1026
Pevzner P A, Borodovsky M Y and Mironov A A 1989b Linguistics of nucleotide sequences II: stationary words in genetic texts and the zonal structure of DNA;J. Biomol. Struct. Dyn. 6 1027–1038
Phillips G, Arnold J and Ivarie R 1987a Monothrough hexanucleotide composition of theEscherichia coli genome: a Markov chain analysis;Nucleic Acids Res. 15 2611–2626
Phillips G, Arnold J and Ivarie R 1987b The effect of codon usage on the oligonucleotide composition of theE. coli genome and identification of over- and underrepresented sequences by Markov chain analysis;Nucleic Acids Res. 15 2627–2638
Prum B, Rodolphe F and de Turckheim E 1995 Finding words with unexpected frequencies in deoxyribonucleic acid sequences;J. R. Statist. Soc. B57 205–220
Reinert G and Schbath S 1999 Large compound Poisson approximations for occurrences of multiple words; inStatistics in molecular biology and genetics (ed.) F Seillier-Moiseiwitsch (IMS Lecture Notes and Monograph Series) (California: IMS Hayward) vol 33, pp 257–275
Schbath S, Prum B and deTurckheimE 1995 Exceptional motifs in different Markov chain models for statistical analysis of DNA sequences;J. Comput. Biol. 2 417–437
Strimmer K and vanHaelsler A 1996 Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies;Mol. Biol. Evol. 13 964–969
Zardoya R and Meyer A 1996a Evolutionary relationships of the coelacanth lungfishes and tetrapods based on the 28S ribosomal RNA sequences;Proc. Natl. Acad. Sci. USA 93 5449–5454
Zardoya R and Meyer A 1996b The complete nucleotide sequence of the mitochondrial genome of the lungfish (Pro- topterus dolloi) supports its phylogenetic position as a close relative of land vertebrates;Genetics 142 1249–1263
Zardoya R and Meyer A 1997 The complete DNA sequence of the mitochondrial genome of a “living fossil” the coelacanth (Latimeria chalumnae);Genetics 146 995–1010
Zharkikh A and Li W H 1992a Statistical properties of boot-strap estimation of phylogenetic variability from nucleotide sequences. I. Four taxa with a molecular clock;Mol. Biol. Evol. 9 1119–1147
Zharkikh A and Li W H 1992b Statistical properties of boot-strap estimation of phylogenetic variability from nucleotide sequences. II. Four taxa without a molecular clock;J. Mol. Evol. 35 356–366
Zharkikh A and Li W H 1995 Estimation of confidence in phylogeny: the complete and partial bootstrap technique;Mol. Phylogenet. Evol. 4 44–63
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chaudhuri, P., Das, S. SWORDS: A statistical tool for analysing large DNA sequences. J Biosci 27, 1–6 (2002). https://doi.org/10.1007/BF02703678
Issue Date:
DOI: https://doi.org/10.1007/BF02703678