Abstract
Measures relating word frequencies and expectations have been constantly of interest in Bioinformatics studies. With sequence data becoming massively available, exhaustive enumeration of such measures have become conceivalbe, and yet pose significant computational burdeneven when limited to words of bounded maximum length. In addition, the display of the huge tables possibly resulting from these counts poses practical problems of visualization and inference.
Verbumculus is a suite of software tools for the efficient and fast detection of over- or under-represented words in nucleotide sequences. The inner core ofVerbumculus rests on subtly interwoven properties of statistics, pattern matching and combinatories on words, that enable one to limit drastically anda priori the set of over-or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible both to detect and visualize such words in a fast and practically useful way. This paper is devoted to the description of the facility at the outset and to report experimental results, ranging from simulations on synthetic data to the discovery of regulatory elements on the upstream regions of a set of genes of the yeast.
The softwareVerbumculus is accessible at http://www.cs.ucr.edu/~stelo/Verbumculus/or http://wwwdbl. dei.unipd.it/Verbumculus/
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Guyer M S, Collins F S. How is the human genome project doing, and what have we learned so far? InProc. Natl. Acad. Sci. U.S.A., 1995, 92: 10841–10848.
Collins F S, Patrinos A, Jordan Eet al. New goals for the U.S. human genome project: 1998–2003.Science, 1998, 282: 682–689.
Fleischmann R D, Adams M Det al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.Science, 1995, 269: 496–512.
Schena M, Shalom D, Davis R Wet al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray.Science, 1995, 270: 467–470.
Lockhart D J, Dong Het al. Expression monitoring by hybridization to high-density oligonucleotide arrays.Nature Biotechnology, 1996, 14: 1675–1680.
DeRisi J L, Iyer V R, Brown P O. Exploring the metabolblic and genetic control of gene expression on a genomic scale.Science, 1997, 278: 680–686.
Chu S, DeRisi J L, Eisen Michael Bet al. The transcriptional program of sporulation in budding yeast.Science, October 1998, 282: 699–705.
Apostolico A, Bock M E, Lonardi Set al. Efficient detection of unusual words.J. Comput. Bio., January 2000, 7(1/2): 71–94.
Apostolico A, Bock M E, Lonardi S. Monotony of surprise and large-scale quest for unusual words (extended abstract). InProc. Research in Computational Molecular Biology (RECOMB), Myers G, Hannenhalli Set al. (Eds.), Washington DC, April 2002, pp.283–311. Also inJ. Comput. Bio., July 2003, 10: 3–4.
Pesole G, Prunella N, Liuni Set al. Wordup: An efficient algorithm for discovering statistically significant patterns in DNA sequences.Nucleic Acids Res., 1992, 20(11):2871–2875.
van Helden J, André B, Collado-Vides J. Extracting regulatory sites from the upstream region of the yeast genes by computational analysis of oligonucleotides.J. Mol. Biol., 1998, 281: 827–842.
Schbath S, Prum Bet al. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences.J. Comput. Bio., 1995, 2: 417–437.
Schbath S. An efficient statistic to detect over- and under-represented words in DNA sequences.J. Comput. Bio., 1997, 4: 189–192.
Bräzma A, Jonassen I, Eidhammer Iet al. Approaches to the automatic discovery of patterns in biosequences.J. Comput Bio., 1998, 5(2): 277–304.
Bräzma A, Jonassen I, Ukkonen Eet al. Predicting gene regulatory elements in silico on a genomic scale.Genome Research, 1998, 8(11): 1202–1215.
Bailey T L, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization.Machine Learning, 1995, 21(1/2): 51–80.
Jonassen I, Collins J F, Higgins D G Finding flexible patterns in unaligned protein sequences.Protein Science, 1995, 4: 1587–1595.
Jonassen I. Efficient discovery of conserved patterns using a pattern graph.Comput. Appl. Biosci., 1997, 13: 509–522.
Yada T, Totoki Y, Ishikawa Met al. Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences.Bioinformatics, 1998, 14: 317–325.
Califano A. SPLASH: Structural pattern localization analysis by seqeuntial histograming.Bioinformatics, 2000, 15: 341–357.
Rigoutsos I, Floratos A. Combinatorial pattern discovery in biological sequences: TheTeiresias algorithm.Bioinformatics, 1998, 14(1): 55–67.
Hertz G Z, Stormo G D Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.Bioinformatics, 1999, 15: 563–577.
Lawrence C E, Altschul S Fet al. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment.Science, October 1993, 262: 208–214.
Neuwald A F, Liu J S, Lawrence C E. Gibbs motif sampling: Detecting bacterial outer membrane protein repeats.Protein Science, 1995, 4: 1618–1632.
Pevzner P A, Sze S H Combinatorial approaches to finding subtle signals in DNA sequences. InProc. the Int. Conf. Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, 2000, pp.269–278.
Keich U, Pevzner P A. Finding motifs in the twilight zone. InAnnual Int. Conf. Computational Molecular Biology, Washington DC, April 2002, pp.195–204.
Buhler J, Tompa M. Finding motifs using random projections.J. Comput. Bio., 2002, 9(2): 225–242.
Pavesi G, Mauri G, Pesole G An algorithm for finding signals of unknown length in DNA sequences. InProc. the Int. Conf. Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, 2001, pp.S207-S214.
Eskin E, Pevzner P A. Finding composite regulatory patterns in DNA sequences. InProc. the Int. Conf. Intelligent Systems for Molecular Biology, Bioinformatics AAAI Press, Menlo Park, CA, 2002, pp.S181-S188.
Apostolico A, Galil Z (Eds.). Pattern Matching Algorithms. Oxford University Press. 1997.
Brendel V, Beckmann J Set al. Linguistics of nucleotide sequences: Morphology and comparison of vocabularies.J. Biomol. Struct. Dynamics, 1986, 4(1): 11–21.
Stückle E E, Emmrich C, Grob U, Nielsen P J. Statistical analysis of nucleotide sequences.Nucleic Acids Res., 1990, 18(22): 6641–6647.
Apostolico A. Pattern discovery and the algorithmics of surprise. InArtificial Intelligence and Heuristic, Methods for Bioinformatics, Frasconi P, Shamir R (Eds.), IOS Press, 2003, pp.111–127.
McCreight E M. A space-economical suffix tree construction algorithm.J. Assoc. Comput. Mach., April 1976, 23(2): 262–272.
Apostolico A. The myriad virtues of suffix trees. InCombinatorial Algorithms on Words, Vol. 12 ofNATO Advanced Science Institutes, Series F, Apostolico A, Galil Z (Eds), Berlin: Springer-Verlag, 1985, pp.85–96.
Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology Cambridge University Press, 1997.
Hui L C K. Color set size problem with applications to string matching. InProc. the 3rd Annual Symp. Combinatorial Pattern Matching.Lecture Notes in Computer Science 644, Apostolico A, Crochemore Met al. (Eds.), Berlin: Springer-Verlag, 1992, pp.230–243.
Gansner E R, Koutsofios E, North S, Vo K-P. A technique for drawing directed graphs.IEEE Trans. Software Eng., 1993, 19(3): 214–230.
Leung M Y, Marsh G M, Speed T P. Over and underrepresentation of short DNA words in herpesvirus genomes.J. Comput. Bio., 1996, 3: 345–360.
Apostolico A, Giancarlo R. Sequence alignment in molecular biology.J. Comput. Bio., 1998, 5(2): 173–196.
Wingender E, Dietze P, Karas Het al. Transfac: A database on transcription factors and their DNA binding sites.Nucleic Acids Res., 1996, 24: 238–241. http://transfac.gbf-braunschweig.de/TRANSFAC/.
Wingender E, Chen X, Hehl Ret al. Transfac: An integrated system for gene expression regulation.Nucleic Acids Res., 2000 28: 316–319. http://transfac.gbf-braunschweig.de/TRANSFAC/.
Luche R M, Sumrada R, Cooper T G. A cis-acting element present in multiple genes serves as a repressor protein binding site for the yeast CAR1 gene.Mol. Cell. Biol. 1990, 10: 3884–3895.
Strich R, Surosky R T, Steber Cet al. UME6 is a key regulator of nitrogen repression and meiotic development.Genes Dev., 1994, 8: 796–810.
Amati B, Gasser S M. Drosophila scaffold-attached regions bind nuclear scaffolds and can function as MARS elements in both budding and fission yeast.Mol. Cell. Biol., 1990, 10: 5442–5454.
Strissel P L, Dann H Aet al. Scaffold-associated regions in the human type I interferon gene cluster on the short arm of chromosome 9.Genomics, 1998, 47: 217–229.
Gasser S M. Nuclear scaffold and high-order folding of eukaryotic DNA. InArchitecture of Eukaryotic Genes, Kahl G (Ed.), VCH Verlagsgeselschaft, Wienheim, Germary, 1988, pp.461–471.
Boulikas T. Chromatin domains and prediction of MAR sequences.Int. Rev. Cytol., 1995, 162A: 279–388.
Stief A, Winter D Met al. A nuclear DNA attachment element mediates elevated and position-independent gene activity.Nature, 1989, 341: 343–345.
McKnight R A, Shamay A, Sankaran Let al. Matrixattachment regions can impart position-independent regulation of a tissue-specific gene in transgenic mice. InProc. Natl. Acad. Sci., 1992, 89: 6943–6947.
Nussinov R. Strong adenine clustering in nucleotide sequences.J. Theor. Biol., 1980, 85: 285–291.
Gasser S M, Laemmli U K. Cohabitation of scaffold binding regions with upstream/enhancer elements of three developmentally regulated genes ofD. melanogaster.Cell, 1986, 46: 521–530.
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported in part by the NSF of U.S.A. (Grant No. CCR-9700276), Purdue Research Foundation (Grant No. 690-1398-3145), the Italian Ministry of University and Research, and the Research Program of the University of Padova.
Supported by Purdue Research Foundation (Grant No.690-1398-3145), the Italian Ministry of University and Research, the Research Program of the University of Padova and Bourns College of Engineering, University of California, Riverside.
Alberto Apostolico (Dr. Eng., 1973, Univ. Naples) is a professor of computer engineering at Univ. Padova and professor of computer sciences at Purdue University. He is a fulbright scholar in 1974–75 at CMU, held visiting and permanent positions in the U.S. (UIUC, Rensselaer, Purdue, IBM) and Europe (U. of Salerno, U. of L' Aquila, LASI, U. of Paris, U. of London, King's Zif-Bielefeld, Renyi-Hungarian Acad of Science), and a full prof. in Italy since 1987, at DEI since 1992. His research interests are algorithmic analysis and design, with emphasis on pattern matching, on which subject he has authored more than 100 papers, and co-authored/edited 7 volumes. He serves on the Editorial Boards of Theor. Comp. Sci. Par. Proc. Let., J. of Comp. Biol., Chaos Th. and Appl., Springer Lecture Notes in Bioinformatics, Algorithmica (g.e.), Keynote at over 60, PC Member for over 50 international conferences He has been a reviewer for NSF, Canadian SERC, NATO, HSFP, Finland Acad Sci, Hong Kong and Israel Science Councils. He is a current or past member of ACM AICA, EATCS, IEEE. He has been (co-) recipient of U.S. (NSF, AFOSR, NIH) French, British, Italian (CNR, MURST, MIUR) and international (Fulbright, NATO, ESPRIT) grants, and of au IBM Faculty Award in 2002.
Fang-Cheng Gong (Ph.D 1995, Dept. Plant Sciences, Univ. Arizona) has held positions of graduate research associate at Dept. Plant Sciences, Univ. Arizona, Postdoc Fellow at the Dept. Plant Pathology of Univ. Arizona, Postdoc Fellow at, Dept. Biological Sciences, Purdue University, before joining the research staff of Celera Genomics. His research interests include plant molecular biology, microbial molecular genetics, and plant cell biology.
Stefano Lonardi (Ph.D., 2001, Purdue University) is an assistant professor of computer science & engineering at the Univ. California, Riverside. His research is currently focused on bioinformatics, data compression, information hiding, and data mining. He received his “Laurea” degree from Univ. Pisa in 1994. and his Ph.D. degree in computer science, from Purdue University. He also holds a Research Doctorate from the Univ. Padua (1999). He is a member of ACM, IEEE, Upsilon Pi Upsilon and Phi Kappa Phi honor societies, and the International Society for Computational Biology.
Rights and permissions
About this article
Cite this article
Apostolico, A., Gong, FC. & Lonardi, S. Verbumculus and the discovery of unusual words. J. Comput. Sci. & Technol. 19, 22–41 (2004). https://doi.org/10.1007/BF02944783
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02944783