Optimal Spaced Seeds for Hidden Markov Models, with Application to Homologous Coding Regions

Brejová, Broňa; Brown, Daniel G.; Vinař, Tomáš

doi:10.1007/3-540-44888-8_4

Broňa Brejová⁷,
Daniel G. Brown⁷ &
Tomáš Vinař⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2676))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

659 Accesses
11 Citations

Abstract

We study the problem of computing optimal spaced seeds for detecting sequences generated by a Hidden Markov model. Inspired by recent work in DNA sequence alignment, we have developed such a model for representing the conservation between related DNA coding sequences. Our model includes positional dependencies and periodic rates of conservation, as well as regional deviations in overall conservation rate. We show that, for hidden Markov models in general, the probability that a seed is matched in a region can be computed efficiently, and use these methods to compute the optimal seed for our models. Our experiments on real data show that the optimal seeds are substantially more sensitive than the seeds used in the standard alignment program BLAST, and also substantially better than those of PatternHunter or WABA, both of which use spaced seeds. Our results offer the hope of improved gene finding due to fewer missed exons in DNA/DNA comparison, and more effective homology search in general, and may have applications outside of bioinformatics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Estimating evolutionary distances between genomic sequences from spaced-word matches

Article Open access 11 February 2015

Unsupervised statistical discovery of spaced motifs in prokaryotic genomes

Article Open access 05 January 2017

Profiles of Natural and Designed Protein-Like Sequences Effectively Bridge Protein Sequence Gaps: Implications in Distant Homology Detection

References

S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990.
Google Scholar
A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research, 28(1):45–48, 2000.
Article Google Scholar
J. Buhler, U. Keich, and Y. Sun. Designing seeds for similarity search in genomic dna. In Proceedings of the 7th Annual International Conference on Computational Biology (RECOMB), 2003. To appear.
Google Scholar
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis. Cambridge University Press, 1998.
Google Scholar
U. Keich, M. Li, B. Ma, and J. Tromp. On spaced seeds. Unpublished.
Google Scholar
W. J. Kent and A. M. Zahler. Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. Genome Research, 10(8):1115–1125, 2000.
Article Google Scholar
I. Korf, P. Flicek, D. Duan, and M. R. Brent. Integrating genomic homology into gene structure prediction. Bioinformatics, 17Suppl 1:S140–8, 2001.
Google Scholar
B. Ma, J. Tromp, and M. Li. PatternHunter: faster and more sensitive homology search. Bioinformatics, 18(3):440–445, March 2002.
Article Google Scholar
L._R. Rabiner. A tutorial on Hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–285, 1989.
Article Google Scholar
Z. Yang. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Molecular Biology and Evolution, 10(6):1396–1401, 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
Broňa Brejová, Daniel G. Brown & Tomáš Vinař

Authors

Broňa Brejová
View author publications
You can also search for this author in PubMed Google Scholar
Daniel G. Brown
View author publications
You can also search for this author in PubMed Google Scholar
Tomáš Vinař
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Depto. de Ciencias de la Computación, Universidad de Chile, Blanco Encalada 2120, Santiago, 6511224, Chile
Ricardo Baeza-Yates
Escuela de Ciencias Físico-Matemáticas, Universidad Michoacana, Edificio “B”, ciudad universitaria, Morelia Michoacán, Mexico
Edgar Chávez
Université de Marne-la-Vallée, 77454, Marne-la-Vallée Cedex 2, France
Maxime Crochemore

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brejová, B., Brown, D.G., Vinař, T. (2003). Optimal Spaced Seeds for Hidden Markov Models, with Application to Homologous Coding Regions. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds) Combinatorial Pattern Matching. CPM 2003. Lecture Notes in Computer Science, vol 2676. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44888-8_4

Download citation

DOI: https://doi.org/10.1007/3-540-44888-8_4
Published: 27 May 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40311-1
Online ISBN: 978-3-540-44888-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Optimal Spaced Seeds for Hidden Markov Models, with Application to Homologous Coding Regions

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Estimating evolutionary distances between genomic sequences from spaced-word matches

Unsupervised statistical discovery of spaced motifs in prokaryotic genomes

Profiles of Natural and Designed Protein-Like Sequences Effectively Bridge Protein Sequence Gaps: Implications in Distant Homology Detection

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Optimal Spaced Seeds for Hidden Markov Models, with Application to Homologous Coding Regions

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Estimating evolutionary distances between genomic sequences from spaced-word matches

Unsupervised statistical discovery of spaced motifs in prokaryotic genomes

Profiles of Natural and Designed Protein-Like Sequences Effectively Bridge Protein Sequence Gaps: Implications in Distant Homology Detection

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation