Abstract
There are millions of sequences deposited in genomic databases, and it is an important task to categorize them according to their structural and functional roles. Sequence comparison is a prerequisite for proper categorization of both DNA and protein sequences, and helps in assigning a putative or hypothetical structure and function to a given sequence. There are various methods available for comparing sequences, alignment being first and foremost for sequences with a small number of base pairs as well as for large-scale genome comparison. Various tools are available for performing pairwise large sequence comparison. The best known tools either perform global alignment or generate local alignments between the two sequences. In this chapter we first provide basic information regarding sequence comparison. This is followed by the description of the PAM and BLOSUM matrices that form the basis of sequence comparison. We also give a practical overview of currently available methods such as BLAST and FASTA, followed by a description and overview of tools available for genome comparison including LAGAN, MumMER, BLASTZ, and AVID.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Tautz D (1998) Evolutionary biology. Debatable homologies. Nature 395:17–19
Pearson WR (1996) Effective protein sequence comparison. Methods Enzymol 266:227–258
Gibbs AJ, McIntyre GA (1970) The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur J Biochem 16:1–11
Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary changes in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5. National Biomedical Research Foundation, Washington, DC, pp 345–352
Gonnet GH, Cohen MA, Brenner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of protein mutation data matrices from protein sequences. Cumput Appl Biosci 8:275–282
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89:10915–10919
Henikoff S, Henikoff JG (1996) Blocks database and its application. Methods Enzymol 266:88–105
Henikoff S, Henikoff JG (2000) Amino acid substitution matrices. Adv Protein Chem 54:73–97
Henikoff S, Henikoff JG (1991) Automated assembly of protein blocks for database searching. Nucleic Acids Res 19:6565–6572
Henikoff S, Henikoff JG (1993) Performance evaluation of amino acid substitution matrices. Proteins Struct Funct Genet 17:49–61
Wheeler DG (2003) Selecting the right protein scoring matrix. Curr Protoc Bioinformatics 3.5.1–3.5.6
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in amino acid sequence of two proteins. J Mol Biol 48:443–453
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Sellers PH (1974) On the theory and computation of evolutionary distances. SIAM J Appl Math 26:787–793
Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162:705–708
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 87:2264–2268
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Altschul SF, Koonin EV (1998) Iterated profile searches with PSI-BLAST: a tool for discovery in protein databases. Trends Biochem Sci 23:444–447
Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF (2001) Improving the accuracy of PSI-BLAST protein database searches with composition based statistics and other refinements. Nucleic Acids Res 29:2994–3005
Bucher P, Karplus K, Moeri N, Hofmann K (1996) A flexible motif search technique based on generalized profiles. Comput Chem 20:3–23
Staden R (1988) Methods to define and locate patterns of motifs in sequences. Comput Appl Biosci 4:53–60
Tatusov RL, Altschul SF, Koonin EV (1994) Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci U S A 91:12091–12095
Marchler-Bauer A, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY et al (2009) CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res 37:D205–D210
Zhang Z, Schäffer AA, Miller W, Madden TL, Lipman DJ, Koonin EV, Altschul SF (1998) Protein similarity searches using patterns as seeds. Nucleic Acids Res 26:3986–3990
Wilbur WJ, Lipman DJ (1983) Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci U S A 80:726–730
Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85:2444–2448
Pearson WR (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183:63–98
Pearson WR (2003) Finding protein and nucleotide similarities with FASTA. Curr Protoc Bioinformatics 3.9.1–3.9.23
Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132:185–219
Zhang Z, Schwartz S, Wagner L, Miller WA (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7:203–214
Ma B, Tromp J, Li M (2002) Patternhunter: faster and more sensitive homology search. Bioinformatics 18:440–445
Kent WJ (2002) BLAT-the BLAST like alignment tool. Genome Res 12:656–664
Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W (2003) Human–mouse alignments with BLASTZ. Genome Res 13:103–107
Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, NISC Comparative Sequencing Program, Green ED, Sidow A, Batzoglou S (2003) LAGAN and multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 13:721–731
Brudno M, Morgenstern B (2002) Fast and sensitive alignment of large genomic sequences. In: Proceedings IEEE computer society bioinformatics conference, Stanford University, pp 138–147
Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL (1999) Alignment of whole genomes. Nucleic Acids Res 27:2369–2376
Bray N, Dubchak I, Pachter L (2003) AVID: a global alignment program. Genome Res 13:97–102
Angiuoli SV, Salzberg SL (2011) Mugsy: fast multiple alignment of closely related whole genome. Bioinformatics 27:334–342
Kent WJ, Zahler AM (2000) Conservation, regulation, synteny, and introns in a large-scale C. briggsae–C. elegans genomic alignment. Genome Res 10:1115–1125
Darling AC, Mau B, Blattner FR, Perna NT (2004) Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14:1394–1403
Nakato R, Gotoh O (2008) A novel method for reducing computational complexity of whole genome sequence alignment. In Proceedings of the sixth Asia-Pacific bioinformatics conference (APBC2008), pp 101–110
Nakato R, Gotoh O (2010) Cgaln: fast and space-efficient whole-genome alignment. BMC Bioinformatics 11:24
Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC (2011) Adaptive seeds tame genomic sequence comparison. Genome Res 21:487–493
Dalca AV, Brudno M (2008) Fresco: flexible alignment with rectangle scoring schemes. Pac Symp Biocomput 13:3–14
Treangen T, Messeguer X (2006) M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species. BMC Bioinformatics 7:433
Sonnhammer EL, Durbin R (1995) A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 167:GC1–GC10
Brodie R, Roper RL, Upton C (2004) JDotter: a Java interface to multiple dotplots generated by dotter. Bioinformatics 20:279–281
Noe L, Kucherov G (2005) YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Res 33:W540–W543
Junier T, Pagni M (2000) Dotlet: diagonal plots in a web browser. Bioinformatics 16:178–179
Grant JR, Arantes AS, Stothard P (2012) Comparing thousands of circular genomes using the CGView Comparison Tool. BMC Genomics 13:202
Alikhan NF, Petty NK, Ben Zakour NL, Beatson SA (2011) BLAST Ring Image Generator (BRIG): simple prokaryote genome comparisons. BMC Genomics 12:402
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media New York
About this protocol
Cite this protocol
Lal, D., Verma, M. (2017). Large-Scale Sequence Comparison. In: Keith, J. (eds) Bioinformatics. Methods in Molecular Biology, vol 1525. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6622-6_9
Download citation
DOI: https://doi.org/10.1007/978-1-4939-6622-6_9
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-6620-2
Online ISBN: 978-1-4939-6622-6
eBook Packages: Springer Protocols