Abstract
Metagenomics is the study of microbial community diversity, especially the uncultured microorganisms by shotgun sequencing environmental samples. As the sequencers throughput and the data volume increase, it becomes challenging to develop scalable bioinformatics tools that reconstruct microbiome structure by binning sequencing reads to reference genomes. Standard alignment-based methods, such as BWA-MEM, provide state-of-the-art performance, but we demonstrate in Vervier et al. (2016) that compositional approaches using nucleotides motifs have faster analysis time, for comparable accuracy. In this work, we describe how to use MetaVW, a scalable machine learning implementation for short sequencing reads binning, based on their k-mers profile. We provide a step-by-step guideline on how we trained the classification models and how it can easily generalize to user-defined reference genomes and specific applications. We also give additional details on what effect parameters in the algorithm have on performances.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Handelsman J (2004) Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev 68(4):669–685
Quince C et al (2017) Shotgun metagenomics, from sampling to analysis. Nat Biotechnol 35(9):833–844
Vervier K et al (2016) Large-scale machine learning for metagenomics sequence classification. Bioinformatics 32(7):1023–1032
Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15:R46
Simner PJ et al (2018) Understanding the promises and hurdles of metagenomic next-generation sequencing as a diagnostic tool for infectious diseases. Clin Infect Dis 66(5): 778–788
Sonnenburg S et al (2006) Large scale learning with string kernels. J Mach Learn Res 7:1531–1565
Gammerman A, Vovk V (2007) Hedging predictions in machine learning. Comp J 50(2):151–163
Parks D et al (2011) Classifying short genomic fragments from novel lineages using composition and homology. BMC Bioinformatics 12:328–344
Acknowledgments
This work was supported by the European Research Council (SMAC-ERC-280032 to J-P.V.).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Vervier, K., Mahé, P., Vert, JP. (2018). MetaVW: Large-Scale Machine Learning for Metagenomics Sequence Classification. In: Mamitsuka, H. (eds) Data Mining for Systems Biology. Methods in Molecular Biology, vol 1807. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8561-6_2
Download citation
DOI: https://doi.org/10.1007/978-1-4939-8561-6_2
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-8560-9
Online ISBN: 978-1-4939-8561-6
eBook Packages: Springer Protocols