Automated Removal of Non-homologous Sequence Stretches with PREQUAL

Irisarri, Iker; Burki, Fabien; Whelan, Simon

doi:10.1007/978-1-0716-1036-7_10

Iker Irisarri^7,3,4,
Fabien Burki^3,5 &
Simon Whelan⁶

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2231))

1680 Accesses
8 Altmetric

Abstract

Large-scale multigene datasets used in phylogenomics and comparative genomics often contain sequence errors inherited from source genomes and transcriptomes. These errors typically manifest as stretches of non-homologous characters and derive from sequencing, assembly, and/or annotation errors. The lack of automatic tools to detect and remove sequence errors leads to the propagation of these errors in large-scale datasets. PREQUAL is a command line tool that identifies and masks regions with non-homologous adjacent characters in sets of unaligned homologous sequences. PREQUAL uses a full probabilistic approach based on pair hidden Markov models. On the front end, PREQUAL is user-friendly and simple to use while also allowing full customization to adjust filtering sensitivity. It is primarily aimed at amino acid sequences but can handle protein-coding nucleotide sequences. PREQUAL is computationally efficient and shows high sensitivity and accuracy. In this chapter, we briefly introduce the motivation for PREQUAL and its underlying methodology, followed by a description of basic and advanced usage, and conclude with some notes and recommendations. PREQUAL fills an important gap in the current bioinformatics tool kit for phylogenomics, contributing toward increased accuracy and reproducibility in future studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LMAP_S: Lightweight Multigene Alignment and Phylogeny eStimation

Article Open access 30 December 2019

Alignment-free sequence comparison: benefits, applications, and tools

Article Open access 03 October 2017

DivA: detection of non-homologous and very divergent regions in protein sequence alignments

Article Open access 18 November 2014

References

Chatzou M, Floden EW, Di Tommaso P, Gascuel O, Notredame C (2018) Generalized bootstrap supports for phylogenetic analyses of protein sequences incorporating alignment uncertainty. Syst Biol 67(6):997–1009
Article CAS Google Scholar
Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G et al (2011) Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol 9(3):e1000602
Article CAS Google Scholar
Irisarri I, Meyer A (2016) The identification of the closest living relative(s) of tetrapods: phylogenomic lessons for resolving short ancient internodes. Syst Biol 65(6):1057–1075
Article Google Scholar
Schneider A, Souvorov A, Sabath N, Landan G, Gonnet GH, Graur D (2009) Estimates of positive Darwinian selection are inflated by errors in sequencing, annotation, and alignment. Genome Biol Evol 1:114–118
Article Google Scholar
Di Franco A, Poujol R, Baurain D, Philippe H (2019) Evaluating the usefulness of alignment filtering methods to reduce the impact of errors on evolutionary inferences. BMC Evol Biol 19(1):21
Article Google Scholar
Whelan S, Irisarri I, Burki F (2018) PREQUAL: detecting non-homologous characters in sets of unaligned homologous sequences. Bioinformatics 34(22):3929–3930
CAS PubMed Google Scholar
Criscuolo A, Gribaldo S (2010) BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol Biol 10(1):210
Article Google Scholar
Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25(15):1972–1973
Article Google Scholar
Castresana J (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 17(4):540–552
Article CAS Google Scholar
Ali RH, Bogusz M, Whelan S (2019) Identifying clusters of high confidence homologies in multiple sequence alignments. Mol Biol Evol 36(10):2340–2351
Article CAS Google Scholar
Durbin R, Eddy SR, Krogh A, Mitchison GJ (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge
Book Google Scholar
Wu M, Chatterji S, Eisen JA (2012) Accounting for alignment uncertainty in phylogenomics. PLoS One 7(1):e30288
Article CAS Google Scholar
Bogusz M, Whelan S (2017) Phylogenetic tree estimation with and without alignment: new distance methods and benchmarking. Syst Biol 66(2):218–231
CAS PubMed Google Scholar
Fletcher W, Yang Z (2009) INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol 26(8):1879–1888
Article CAS Google Scholar
Whelan NV, Kocot KM, Moroz TP, Mukherjee K, Williams P, Paulay G et al (2017) Ctenophore relationships and their placement as the sister group to all other animals. Nat Ecol Evol 1(11):1737–1746
Google Scholar
MacLeod A, Irisarri I, Vences M, Steinfartz S (2015) The complete mitochondrial genomes of the Galápagos iguanas, Amblyrhynchus cristatus and Conolophus subcristatus. Mitochondr DNA Part A 27(5):3699–3700
Article Google Scholar
Burki F, Kaplan M, Tikhonenkov DV, Zlatogursky V, Minh BQ, Radaykina LV et al (2016) Untangling the early diversification of eukaryotes: a phylogenomic study of the evolutionary origins of Centrohelida, Haptophyta and Cryptista. Proc R Soc B-Biol Sci 283(1823):20152802
Article Google Scholar
Tange O (2015) GNU Parallel 20150322 (‘Hellwig’). USENIX Magazine 36:42–47
Google Scholar
Köster J, Rahmann S (2012) Snakemake: a scalable bioinformatics workflow engine. Bioinformatics 28(19):2520–2522
Article Google Scholar
Larsson A (2014) AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics 30(22):3276–3278
Article CAS Google Scholar

Download references

Acknowledgments

We would like to thank Kazutaka Katoh for the possibility of contributing this chapter. Max E. Schön provided comments on an earlier version. II acknowledges the support from a Juan de la Cierva-Incorporación postdoctoral fellowship (IJCI-2016-29566) from the Spanish Ministry of Science and Competitiveness (MINECO). This work in the lab of FB is supported by a fellowship from Science for Life Laboratory. SW thanks the Carl Tryggers Stiftelse and Uppsala University for support.

Author information

Authors and Affiliations

Department of Organismal Biology (Program in Systematic Biology), Uppsala University, Uppsala, Sweden
Iker Irisarri & Fabien Burki
Department of Biodiversity and Evolutionary Biology, Museo Nacional de Ciencias Naturales, Madrid, Spain
Iker Irisarri
Science for Life Laboratory, Uppsala University, Uppsala, Sweden
Fabien Burki
Department of Evolutionary Genetics (Program in Evolutionary Biology), Uppsala University, Uppsala, Sweden
Simon Whelan
Department of Applied Bioinformatics, Institute for Microbiology and Genetics, University of Göttingen, Göttingen, Germany
Iker Irisarri

Authors

Iker Irisarri
View author publications
You can also search for this author in PubMed Google Scholar
Fabien Burki
View author publications
You can also search for this author in PubMed Google Scholar
Simon Whelan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research Institute for Microbial Disease, Osaka University, Osaka, Japan
Kazutaka Katoh

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Irisarri, I., Burki, F., Whelan, S. (2021). Automated Removal of Non-homologous Sequence Stretches with PREQUAL. In: Katoh, K. (eds) Multiple Sequence Alignment. Methods in Molecular Biology, vol 2231. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1036-7_10

Download citation

DOI: https://doi.org/10.1007/978-1-0716-1036-7_10
Published: 09 December 2020
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-1035-0
Online ISBN: 978-1-0716-1036-7
eBook Packages: Springer Protocols

Publish with us

Policies and ethics

Automated Removal of Non-homologous Sequence Stretches with PREQUAL

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LMAP_S: Lightweight Multigene Alignment and Phylogeny eStimation

Alignment-free sequence comparison: benefits, applications, and tools

DivA: detection of non-homologous and very divergent regions in protein sequence alignments

References

Acknowledgments

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Automated Removal of Non-homologous Sequence Stretches with PREQUAL

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LMAP_S: Lightweight Multigene Alignment and Phylogeny eStimation

Alignment-free sequence comparison: benefits, applications, and tools

DivA: detection of non-homologous and very divergent regions in protein sequence alignments

References

Acknowledgments

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Search

Navigation