Abstract
In metagenomics, the amino acid sequences, due to the extraction process, are separated in DNA fragments of variable sizes. These fragments are used afterwards to determine which of the already recognized species are present in the samples and what portion of these amino acid sequences have not been previously categorized. Seeking for this method for identification to produce better results, clustering algorithms will be used as enablers in the identification process for the different species. These algorithms group amino acid sequences with a certain similarity rate, producing DNA fragments clusters, so these can be compared in group and be analyzed faster. One of the problems when analyzing metagenomic databases is that they are very large, which makes the algorithms have a high computational time. New technologies already provide platforms to develop and run algorithms achieving better temporal performance. Platforms like Apache Spark and TensorFlow were used with the objective of reducing the execution times, as they include native implementations of these clustering algorithms in their libraries. With these libraries as a base, an implementation of Iterative k-means was implemented and then used as a comparison point. In the results iterative k-means reduce the execution time with respect to the traditional implementation. The use of TensorFlow improved the execution times in general, with a more significative difference in the case of the Iterative k-means, with the disadvantage that it requires much more processing power.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Locey KJ, Lennon JT (2016) Scaling laws predict global microbial diversity. Natl Acad Sci
Wooley JC, Godzik A, Friedberg I (2010) A Primer on Metagenomics. PLoS Comput Biol 6(2):e10006672010
Thomas T, Gilbert J, Meyer F (2012) Metagenomics-a guide from sampling to data analysis. Microb Inform Exp
Handelsman J (2004) Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev
Kislyuk A, Bhatnagar S, Dushoff J, Weitz J (2009) Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinform 10(1):316
Camacho C et al (2009) BLAST + : architecture and applications. BMC Bioinform 10(1):421
Rosen GL, Reichenberger E, Rosenfeld A (2010) NBC: The Naïve Bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics
Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper TW (2009) TACOA–Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinf 10:56–56
Brady A, Salzberg SL (2009) Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated markov models. Nat Methods 6(9):673–676
Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner F (2004) TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinf 5(1):163
Reddy RM, Mohammed MH, Mande SS (2014) MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets. Genomics 103(2–3):161–168
Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T (2003) Informatics for unveiling hidden genome signatures. Genome Res 13(4): 693–702
Zouari H, Heutte L, Lecourtier Y (2005) Controlling the diversity in classifier ensembles through a measure of agreement (in English). Pattern Recognit 38(11):2195–2199
Bonet I, Escobar A, Mesa-Múnera A, Alzate JF (2017) Clustering of metagenomic data by combining different distance functions. Acta Polytech Hung 14(3)
Woods K, Kegelmeyer WP, Bowyer K (1997) Combination of multiple classifiers using local accuracy estimates (in English). IEEE Trans Pattern Anal Mach Intell 19(4):405–410
Leung HC et al (2011) A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio (in eng). Bioinformatics 27(11):1489–1495
Wang Y, Leung H, Yiu S, Chin F (2014) MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning (in English). BMC Genomics 15(1), 1–9. Article no. S12
Partalas I, Tsoumakas G, Katakis I, Vlahavas I (2006) Ensemble pruning using reinforcement learning. In: Advances in artificial intelligence, proceedings, Lecture Notes in Computer Science, vol 3955. Springer, Berlin, pp 301–310
Nanni L, Lumini A (2006) FuzzyBagging: a novel ensemble of classifiers. Pattern Recognit 39(3):488–490
MLlib Clustering (2018) In: Apache Spark Docs ed
Module (2018) tf.contrib.factorization. In: Tensorflow Python API Docs ed
Bonet I, Escobar A, Mesa-Múnera A, Alzate JF (2017) Clustering of metagenomic data by combining different distance functions. Acta Polythecnica Hung 14(3)
Bonet I, Montoya W, Mesa Múnera A, Alzate JF (2014) Iterative Clustering Method for Metagenomic Sequences
Apache Software Foundation (2018) MLlib Clustering. https://spark.apache.org/docs/2.3.0/mllib-clustering.html
Google, Module: tf.contrib.factorization(2018). https://www.tensorflow.org/api_docs/python/tf/contrib/factorization
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Vanegas, J., Bonet, I. (2019). Clustering Algorithm Optimization Applied to Metagenomics Using Big Data. In: Botto-Tobar, M., Barba-Maggi, L., González-Huerta, J., Villacrés-Cevallos, P., S. Gómez, O., Uvidia-Fassler, M. (eds) Information and Communication Technologies of Ecuador (TIC.EC). TICEC 2018. Advances in Intelligent Systems and Computing, vol 884. Springer, Cham. https://doi.org/10.1007/978-3-030-02828-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-02828-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02827-5
Online ISBN: 978-3-030-02828-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)