Abstract
In this paper, we introduce a new and highly scalable algorithm, PASTA, for large-scale multiple sequence alignment estimation. PASTA uses a new technique to produce an alignment given a guide tree that enables it to be both highly scalable and very accurate. We present a study on biological and simulated data with up to 200,000 sequences, showing that PASTA produces highly accurate alignments, improving on the accuracy of the leading alignment methods on large datasets, and is able to analyze much larger datasets than the current methods. We also show that trees estimated on PASTA alignments are highly accurate – slightly better than SATé trees, but with substantial improvements relative to other methods. Finally, PASTA is very fast, highly parallelizable, and requires relatively little memory.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Sievers, F., Dineen, D., Wilm, A., Higgins, D.G.: Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29(8), 989–995 (2013)
Liu, K., Raghavan, S., Nelesen, S., Linder, C.R., Warnow, T.: Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324(5934), 1561–1564 (2009)
Liu, K., Warnow, T., Holder, M., Nelesen, S., Yu, J., Stamatakis, A., Linder, C.: SATé-II: Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst. Biol. 61(1), 90–106 (2011)
Nelesen, S., Liu, K., Wang, L.S., Linder, C., Warnow, T.: DACTAL: divide-and-conquer trees (almost) without alignments. Bioinformatics 28(12), i274–i282 (2012)
Liu, K., Linder, C., Warnow, T.: Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLoS Currents: Tree of Life (2010)
iPlant Collaborative: iPTOL, Assembling the Tree of Life for the Plant Sciences (2013), https://pods.iplantcollaborative.org/wiki/display/iptol/Home
Wong, G.K.S.: The Thousand Transcriptome (1KP) Project (2013), http://www.onekp.com/project.html
Katoh, K., Kuma, K., Toh, H., Miyata, T.: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucl. Acids. Res. 33(2), 511–518 (2005)
Wheeler, T., Kececioglu, J.: Multiple alignment by aligning alignments. In: Proceedings of the 15th ISCB Conference on Intelligent Systems for Molecular Biology, pp. 559–568 (2007)
Edgar, R.C.: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5(113), 113 (2004)
Edgar, R.C.: MUSCLE: a multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5), 1792–1797 (2004)
Guo, S., Wang, L.S., Kim, J.: Large-scale simulating of RNA macroevolution by an energy-dependent fitness model. arXiv:0912.2326 (2009)
Price, M., Dehal, P., Arkin, A.: FastTree-2 approximately maximum-likelihood trees for large alignments. PLoS One 5(3), e9490 (2010)
Matsen, F., Kodner, R., Armbrust, E.: pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11, 538 (2010)
Mirarab, S., Nguyen, N., Warnow, T.: SEPP: SATé-enabled phylogenetic placement. In: Pacific Symposium on Biocomputing, pp. 247–258 (2012)
Eddy, S.: A new generation of homology search tools based on probabilistic inference. Genome Inform. 23, 205–211 (2009)
Finn, R., Clements, J., Eddy, S.: HMMER web server: interactive sequence similarity searching. Nucleic Acids Research 39, W29–W37 (2011)
Mirarab, S., Warnow, T.: FastSP: Linear-time calculation of alignment accuracy. Bioinformatics 27(23), 3250–3258 (2011)
Mirarab, S., Nguyen, N., Warnow, T.: Supplementary Online Material, PASTA: ultra-large multiple sequence alignment. figshare (2014), http://dx.doi.org/10.6084/m9.figshare.899770 (retrieved January 13, 2014)
Cannone, J., Subramanian, S., Schnare, M., Collett, J., D’Souza, L., Du, Y., Feng, B., Lin, N., Madabusi, L., Muller, K., Pande, N., Shang, Z., Yu, N., Gutell, R.: The Comparative RNA Web (CRW) Site: An Online Database of Comparative Sequence and Structure Information for Ribosomal, Intron and Other RNAs. BioMed. Central Bioinformatics 3(15) (2002)
Stamatakis, A.: RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinf. 22, 2688–2690 (2006)
Katoh, K., Frith, M.C.: Adding unaligned sequences into an existing alignment using MAFFT and LAST. Bioinformatics 28(23), 3144–3146 (2012)
Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G.: Clustal W and Clustal X version 2.0. Bioinformatics 23(21), 2947–2948 (2007)
Boisseau, J., Stanzione, D.: TACC: Texas Advanced Computing Center (2013), http://www.tacc.utexas.edu
Suchard, M.A., Redelings, B.D.: BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 22, 2047–2048 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Mirarab, S., Nguyen, N., Warnow, T. (2014). PASTA: Ultra-Large Multiple Sequence Alignment. In: Sharan, R. (eds) Research in Computational Molecular Biology. RECOMB 2014. Lecture Notes in Computer Science(), vol 8394. Springer, Cham. https://doi.org/10.1007/978-3-319-05269-4_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-05269-4_15
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05268-7
Online ISBN: 978-3-319-05269-4
eBook Packages: Computer ScienceComputer Science (R0)