Abstract
Since the advent of ultra-massive sequencing techniques, the consequent drop-off in both price and time required made feasible the sequencing of increasingly more genomes from microbes belonging to the same taxonomic unit. Eventually, this led to the concept of pangenome, that is, the entire set of genes present in a group of representatives of the same genus/species, which, in turn, can be divided into core genome, defined as the set of those genes present in all the genomes under study, and a dispensable genome, the set of genes possessed only by one or a subset of organism.
When analyzing a pangenome, an interesting point is to measure its size, thus estimating the gene repertoire of a given taxonomic group. This is usually performed counting the novel genes added to the overall pangenome when new genomes are sequenced and annotated. A pangenome can be also classified as open or close: in an open pangenome its size increases indefinitely when adding new genomes; thus sequencing additional strains will likely yield novel genes. Conversely, in a close pangenome, adding new genomes will not lead to the discovery of new coding capabilities.
A central point in pangenomics is the definition of homology relationships between genes belonging to different genomes. This may turn into the search of those genes with similar sequences between different organisms (and including both paralogous and orthologous genes).
In this chapter, methods for finding groups of orthologs between genomes and for estimating the pangenome size are discussed. Also, working codes to address these tasks are provided.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Read TD, Salzberg SL, Pop M, Shumway M, Umayam L, Jiang L, Holtzapple E, Busch JD, Smith KL, Schupp JM, Solomon D, Keim P, Fraser CM. Comparative genome sequencing for discovery of novel polymorphisms in Bacillus anthracis. Science 296:2028–2033
Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, DeBoy RT, Davidsen TM, Mora M, Scarselli M, Ros IM, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O’Connor KJB, Smith S, Utterback TR, White O, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R, Fraser CM (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci U S A 10:13950–13955
Hayashi T, Makino K, Ohnishi M, Kurokawa K, Ishii K, Yokoyama K, Han C, Ohtsubo E, Nakayama K, Murata T, Tanaka M, Tobe T, Iida T, Takami H, Honda T, Sasakawa C, Ogasawara N, Yasunaga T, Kuhara S, Shiba T, Hattori M, Shinagawa H (2001) Complete genome sequence of enterohemorrhagic Escherichia coli O157: H7 and genomic comparison with a laboratory strain K-12. DNA Res 8:11–22
Kuroda M, Ohta T, Uchiyama I, Baba T, Yuzawa H, Kobayashi I, Cui L, Oguchi A, Aoki K, Nagai Y, Lian J, Ito T, Kanamori M, Matsumaru H, Maruyama A, Murakami H, Hosoyama A, Mizutani-Ui Y, Takahashi NK, Sawano T, Inoue R, Kaito C, Sekimizu K, Hirakawa H, Kuhara S, Goto S, Yabuzaki J, Kanehisa M, Yamashita A, Oshima K, Furuya K, Yoshino C, Shiba T, Hattori M, Ogasawara N, Hayashi H, Hiramatsu K (2001) Whole genome sequencing of meticillin-resistant Staphylococcus aureus. Lancet 357:1225–1240
Pallen MJ, Wren BW (2007) Bacterial pathogenomics. Nature 449:835–842
Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R (2005) The microbial pan-genome. Curr Opin Genet Dev 15:589–594
Tettelin H, Riley D, Cattuto C, Medini D (2008) Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol 11:472–477
Koonin EV (2005) Orthologs, paralogs, and evolutionary genomics 1. Annu Rev Genet 39:309–338
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278:631–637
Hyatt D, Chen GL, LoCascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119
Alexeyenko A, Tamas I, Liu G, Sonnhammer EL (2006) Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22:e9–e15
Lukashin AV, Borodovsky M (1998) GeneMark. hmm: new solutions for gene finding. Nucleic Acids Res 26:1107–1115
Delcher AL, Harmon D, Kasif S, White O, Salzberg SL (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res 27:4636–4641
Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178–2189
O’Brien KP, Remm M, Sonnhammer EL (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 33:D476–D480
van Dongen SM (2000) Graph clustering by flow simulation
Galardini M, Mengoni A, Biondi EG, Semeraro R, Florio A, Bazzicalupo M, Benedetti A, Mocali S (2013) DuctApe: a suite for the analysis and correlation of genomes and Omnilog™ Phenotype Microarray data. Genomics
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer Science+Business Media New York
About this protocol
Cite this protocol
Bosi, E., Fani, R., Fondi, M. (2015). Defining Orthologs and Pangenome Size Metrics. In: Mengoni, A., Galardini, M., Fondi, M. (eds) Bacterial Pangenomics. Methods in Molecular Biology, vol 1231. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-1720-4_13
Download citation
DOI: https://doi.org/10.1007/978-1-4939-1720-4_13
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-1719-8
Online ISBN: 978-1-4939-1720-4
eBook Packages: Springer Protocols