Indexing Finite Language Representation of Population Genotypes

Sirén, Jouni; Välimäki, Niko; Mäkinen, Veli

doi:10.1007/978-3-642-23038-7_23

Jouni Sirén²¹,
Niko Välimäki²¹ &
Veli Mäkinen²¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6833))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

1181 Accesses
4 Citations
3 Altmetric

Abstract

We propose a way to index population genotype information together with the complete genome sequence, so that one can use the index to efficiently align a given sequence to the genome with all plausible genotype recombinations taken into account. This is achieved through converting a multiple alignment of individual genomes into a finite automaton recognizing all strings that can be read from the alignment by switching the sequence at any time. The finite automaton is indexed with an extension of Burrows-Wheeler transform to allow pattern search inside the plausible recombinant sequences. The size of the index stays limited, because of the high similarity of individual genomes. The index finds applications in variation calling and in primer design.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Reads in NGS Are Distributed over a Sequence Very Inhomogeneously

Modeling the Genetic Code: p-Adic Approach

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Albers, C.A., et al.: Dindel: Accurate indel calls from short-read data. Genome Research (October 2010)
Google Scholar
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Google Scholar
Darling, A.E., et al.: ProgressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement. PLoS ONE 5(6), e11147 (2010)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proc. OSDI 2004, pp. 137–150. USENIX Association (2004)
Google Scholar
Ferragina, P., et al.: Compressing and indexing labeled trees, with applications. Journal of the ACM 57(1), article 4 (2009)
Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), 552–581 (2005)
Article MATH Google Scholar
Flicek, P., Birney, E.: Sense from sequence reads: methods for alignment and assembly. Nature Methods 6, S6–S12 (2009)
Article Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(2), 378–407 (2005)
Article MATH Google Scholar
Lander, E.S., et al.: Initial sequencing and analysis of the human genome. Nature 409(6822), 860–921 (2001)
Article Google Scholar
Langmead, B., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25 (2009)
Article Google Scholar
Levy, S., et al.: The diploid genome sequence of an individual human. PLoS Biol. 5(10), e254 (2007)
Article Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 27(14), 1754–1760 (2009)
Article Google Scholar
Li, R., et al.: SOAP2. Bioinformatics 25(15), 1966–1967 (2009)
Article Google Scholar
Li, R., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20(2), 265–272 (2010)
Article Google Scholar
Mäkinen, V., et al.: Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology 17(3), 281–308 (2010)
Article Google Scholar
Mäkinen, V., et al.: Unified view of backward backtracking in short read mapping. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) Ukkonen Festschrift 2010. LNCS, vol. 6060, pp. 182–195. Springer, Heidelberg (2010)
Chapter Google Scholar
Metzker, M.L.: Sequencing technologies – the next generation. Nature Reviews Genetics 11, 31–46 (2010)
Article Google Scholar
Myers, S., et al.: A fine-scale map of recombination rates and hotspots across the human genome. Science 310(5746), 321–324 (2005)
Article Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), 2 (2007)
Article MATH Google Scholar
Puglisi, S.J., et al.: A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39(2), 4 (2007)
Article Google Scholar
Spang, R., et al.: A novel approach to remote homology detection: Jumping alignments. Journal of Computational Biology 9(5), 747–760 (2002)
Article Google Scholar
Venter, J.C., et al.: The sequence of the human genome. Science 291(5507), 1304–1351 (2001)
Article Google Scholar
Wheeler, D.A., et al.: The complete genome of an individual by massively parallel DNA sequencing. Nature 452(7189), 872–876 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Helsinki Institute for Information Technology (HIIT) &, Department of Computer Science, University of Helsinki, Finland
Jouni Sirén, Niko Välimäki & Veli Mäkinen

Authors

Jouni Sirén
View author publications
You can also search for this author in PubMed Google Scholar
Niko Välimäki
View author publications
You can also search for this author in PubMed Google Scholar
Veli Mäkinen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Center for Biotechnology Information, U.S. National Library of Medicine, 8600 Rockville Pike, 20894, Bethesda, MD, USA
Teresa M. Przytycka
Institut National de Recherche en Informatique et en Automatique (INRIA) and Université Lyon 1 (UCBL), 43 bd du 11 Novembre 1918, 69622, Villeurbanne cedex, France
Marie-France Sagot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sirén, J., Välimäki, N., Mäkinen, V. (2011). Indexing Finite Language Representation of Population Genotypes. In: Przytycka, T.M., Sagot, MF. (eds) Algorithms in Bioinformatics. WABI 2011. Lecture Notes in Computer Science(), vol 6833. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23038-7_23

Download citation

DOI: https://doi.org/10.1007/978-3-642-23038-7_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23037-0
Online ISBN: 978-3-642-23038-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Indexing Finite Language Representation of Population Genotypes

Abstract

Chapter PDF

Similar content being viewed by others

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Reads in NGS Are Distributed over a Sequence Very Inhomogeneously

Modeling the Genetic Code: p-Adic Approach

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Indexing Finite Language Representation of Population Genotypes

Abstract

Chapter PDF

Similar content being viewed by others

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Reads in NGS Are Distributed over a Sequence Very Inhomogeneously

Modeling the Genetic Code: p-Adic Approach

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation