Abstract
Quantitative comparison of sequences can be performed pairwise by aligning two sequences based on considerations of gaps representing insertions or deletions and matches between nucleotides or amino acids. Pairwise comparisons can be performed as global alignments if it is known that the sequences are homologous in their full length or by local alignments if it is known that one sequence is shorter than the other. The dynamic programming algorithms for constructing global and local alignments will be described in detail and relevant computer programs suggested. The different strategies behind multiple alignments are described including relevant computer programs. The search program BLAST is introduced in detail and different applications described.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Pairwise alignments and multiple alignments are the basic tools to compare sequences. BLAST is the most frequently used bioinformatics program to compare your own sequence (query sequence) to all sequences in a database (subject sequences) based on local pairwise alignments. The outcome of the BLAST analysis provides qualitative information about homologous sequences and can quantify the identity of the query sequence to the sequences in the database. You will learn about multiple alignments and how to construct them. Multiple alignments are used for a range of applications described in the other chapters of the book.
4.1 The Pairwise Alignment Problem
A pairwise alignment is a model of the homology between two sequences considering all nucleotides or amino acids and all deletion and insertions.
Two sequences to be compared could, for instance, be the ones below:
-
GCAGTAGCATGACGATAG
-
GCGGTAGCATGATAC
A pairwise alignment requires a model for the evolutionary steps that led to the differences observed. First, we should test if they are homologous. Such a test can be done by BLAST (see Sect. 4.3), and it will tell if the sequences are from the same gene. Once this has been established, we need to model evolution to construct the pairwise alignment. The simplest assumptions are that identical nucleotide pairs and that identical amino acids pair. Further, that different amino acids with similar physiochemical properties form pairs.
For most comparisons, it is preferred to base pairwise alignments on the amino acid sequences especially if the sequences are very divergent. Pairwise nucleotide comparisons are preferred for closely related sequences only.
Gaps are used to show that an amino acid or nucleotide is without a match in the other sequence and the gaps represent insertions or deletions in an evolutionary context. Most evolutionary modeling assumes that it is more difficult to form gaps than to form a mismatches and that it is more difficult to form a gap than to extent one already formed. Scoring systems are used to separate matches (high score) from mismatches (low score) and gaps (very low score). For proteins, scoring matrices (see Sect. 4.1.2.1) are used. At the end, scores between amino acids or nucleotides are summarized over the whole pairwise alignment, and penalties for gaps are subtracted. The result is a unit score for the multiple alignment given the parameters used. Given the same sequences are compared, the pairwise alignment with the highest total score is preferred compared to one with a lower score.
When the pairwise alignment has been constructed, the identity and similarity can be quantified. The identity is the number of nucleotides or amino acids matching in two sequences compared at all positions between the two sequences. Similarity is a further comparison also considering different types of nucleotides or amino acids as well as the gaps.
An alignment of the sequences above, and a possible scenario leading to the differences between them, is shown in Fig. 4.1.
The principles of pairwise alignments are used for most multiple alignment programs and for BLAST. The three tools are fundamental to carry out most bioinformatics (Fig. 4.2).
4.1.1 Global or Local Pairwise Alignments?
Before a pairwise alignment is constructed, it needs to be decided if it should be global or local. A pairwise alignment is global if it is known that the sequences are homologous in their full length. In this situation, it is sound to align both sequences in their full length. If preliminary evidence has showed that the genes are encoding for the same proteins and the full length of the gene has been sequenced, a global alignment can be constructed. A local alignment is needed if it is known that one sequence is shorter than the other and that it cannot be related to the other in its full length. This can happen if the DNA sequence of one gene has not been determined in the full length or if the domain structure of the proteins that the genes encode for is rather divergent.
4.1.2 Substitution Matrices
Substitution matrices are used to model the probability of mutual amino acid or nucleotide substitutions in sequences. Amino acids or nucleotides with a lower probability of changing are given more weight in the comparisons compared to those with a higher probability of changing. The changes are assumed to occur according to a Markov model where changes only depend on the current nucleotide or amino acid observed and not on past changes (Fig. 4.3). This model is very important to understand in full. A comparison can be made to a human being only acting based on the immediate desire or need. A certain degree of hunger will lead to taking a meal, whereas past events only have minor influence on the decision like meals taken the day before. This way a Markov chain is without memory, and it seems difficult to use the Markov model to trace evolution back in time like done in phylogeny. However, this is actually possible using the assumption about irreversible changes of nucleotides or amino acids since the same probabilities can be used both to model the future and to go back in time.
4.1.2.1 Amino Acid Substitution Matrices
Amino acids have different biochemical and physical properties that influence their relative replaceability during evolution (Table 4.1). The probability of replacement has been related to the physiochemical properties of the amino acids in the way that hydrophobic amino acids are easier replaced by other hydrophobic amino acids again compared to other types such as charged amino acids. The substitution matrices reflect these properties as seen from Fig. 4.4. Here, the score between the two hydrophobic amino acids isoleucine and histamine is 5, whereas the score between isoleucine and the positive charged arginine is only −4. Some amino acids belong to more groups, for instance, histidine, which is both hydrophobic, positively charged, polar, and aromatic.
The percent accepted mutation (PAM) matrices are based on global alignments of closely related proteins. The number at the end of a PAM matrix refers to the number of steps required for a given percent change. PAM1 corresponds to 99% identity (1% change) between the aligned sequences. This means that PAM112 will recognize more distantly related sequences (40% identity) than PAM23 (80% identity). PAM matrices are based on global alignments and therefore best suited for global pairwise alignments used in evolutionary studies (Table 4.2).
Blocks substitution matrix (BLOSUM) is based on local alignments. The number at the end of a BLOSUM matrix refers to the minimum percent identity allowed in the set of aligned sequences used to build the matrix. This means that BLOSUM50 will recognize more distantly related sequences than BLOSUM80 (Table 4.2). Other matrices have been published aiming at refining amino acid alignments and the more recently by including improved statistics for estimating exchangeability between amino acids (Gonnet et al. 1992; Jones et al. 1992; Le and Gascuel 2008).
Explanations of the background for the BLOSUM and PAM and other matrices are given by Pearson (2013). The recommendation is to use the “deep scoring matrices” for full-length protein sequence comparisons and only to compare the domain level with “shallower” scoring matrices.
4.1.2.2 Nucleotide Substitution Matrices
The simplest model is the Jukes & Cantor model where equal probabilities for the substitution of all four nucleotides are assumed (Table 4.4). Substitution matrices for nucleotides give higher weight to transitions (G to A or C to T and vice versa) than to transversions (G to C, G to T, A to C, A to T and vice versa) (Table 4.3). This is mainly related to the larger size of the purine compared to the pyrimidine nucleotides. The transversion/transition bias is included in the models of HKY, Kimura, Tamura-Nei, Tamura, and the General reversible model. An additional account for the frequency of nucleotides is included in models of Tamura-Nei, the Equal input, and the General reversible models. These frequencies are the observed frequencies of nucleotides under comparison. The unrestricted model allows different probabilities of changes for all 12 combinations of nucleotides (Table 4.4).
4.1.3 Gaps
Gaps are used to model insertions or deletions in sequences. If arbitrarily many gaps are inserted, this can lead to high scoring alignments of nonhomologous sequences. To abolish this effect, gaps are penalized when alignments are constructed to obtain relatively few gaps and separate penalties used for gap opening and gap elongation. In the linear model, the penalty ϒ is proportional to the number of gaps (g) given the penalty for one gap d:
Linear gap penalty score:
If the penalty of one gap is 8 and there are three gaps, the total penalty will be - 24.
To better model biological events where many gaps often occur in a row, the affine penalty score model is used where the cost is higher for inserting the first gap than to extent this gap to lengths of two and more:
Where ϒ (g) = gap penalty score of gap length g, d = gap opening penalty and e = gap extension penalty. For a gap of length 3 with a penalty of opening the gap of 8 and 4 of extending the gap with two more, the total penalty will be −16.
4.1.4 Dynamic Programming
Dynamic programming is a way to compute the pairwise alignment. If all possibilities for the pairwise alignment of two sequences should be tested, there would be around 1059 possibilities for two nucleotide sequences of 100 in length. This number is approximately the same as the number of molecules in the Milky Way, and even the largest computer would not be able to handle the problem. There is also no need to consider possibilities without relevance, for instance, that one nucleotide in one sequence would pair only with gaps in the other. For a more general comparison of dynamic programming with hidden Markov models, see: https://youtu.be/aELVQo6eRzk.
To reduce the computing effort but still to perform a careful analysis, dynamic programming is used. The computer is only testing relevant comparisons (high scores) and keeping in memory the highest ones already calculated. The most relevant dynamic programming algorithms for comparison of global and local pairwise alignments were published be Needleman and Wunsch (1970) and Smith et al. (1981), respectively.
The Needleman and Wunsch algorithm is used when it is known that the sequences represent exactly a gene or protein in full length (few amino acids or nucleotide differences are tolerated), and it is used to build global pairwise alignments.
The Smith and Waterman algorithm is used with partial sequences or with sequences of unknown length to build local alignments. For both algorithms, the adjustment of gap penalty functions depends on previous knowledge about domain structures, repeats, and other properties. Both algorithms and their implementations in computer programs can be used for both amino acid and nucleotide sequences.
4.1.4.1 Needleman and Wunsch
The score function in this model is illustrated in Fig. 4.5. This function is used on all comparisons between the two sequences. Here, we will use the example with the two sequences:
-
Sequence 1: HEAGAWGHEE
-
Sequence 2: PAWHEAE
Figure 4.6 shows one sequence (1) on the top of the table and sequence 2 as a vertical column. First, an alignment path matrix is created. For each cell, F(i,j) is calculated based on the score function in Fig. 4.5. This matrix allows a stepwise calculation of score values. The score of the best alignment is calculated between the initial segment x1…i of x up to xi and the initial segment y1…j of y up to yj. The algorithm function, F(i,j), is built recursively beginning with F(0,0) = 0 (Figs. 4.6, 4.7, 4.8). When the matrix is filled, backtracking is started (evaluation of the optimal path).
The scoring parameters are given by the BLOSUM 50 matrix (Fig. 4.4) and a linear gap penalty of d = −8. We will align the whole sequence length of the sequences, meaning that we need to form a global alignment and use the Needleman and Wunsch algorithm. The procedure can be followed on Figs. 4.6, 4.7, 4.8, 4.9.
A real example of the use of the Needleman and Wunsch algorithm is shown on Fig. 4.10. Here, realistic long sequences are aligned by the algorithm implemented as the needle program in the EMBOSS package (Rice et al. 2000).
4.1.4.2 Smith and Waterman
Now, we will compute a pairwise alignment of the sequences using the Smith and Waterman algorithm. There are two differences in the algorithm compared to Needleman and Wunsch. The first is that the alignment can start/end anywhere in the matrix and the other that backtracking is started at the highest value rather than in the lower right corner. Backtracking is terminated as soon as zero is encountered. The score function is shown in Fig. 4.11.
We will align the same two short model protein sequences with Smith and Waterman:
-
Sequence 1: HEAGAWGHEE
-
Sequence 2: PAWHEAE
And use the same parameters of BLOSUM50 and a linear gap penalty of d = −8.
Using the score function in Fig. 4.11 results in 0 in most cells when the matrix is filled in the forward direction (Fig. 4.12). The reason is that the three other possibilities in Fig. 4.11 result in negative values, and we then have to use 0. Backtracking starts with the highest score (28) and terminates when 0 is reached. Note that an alternative path is also possible. However, the maximum score of this one is lower (21), and therefore, the first is chosen to construct the final pairwise alignment (Fig. 4.13). Similar to the example with Needleman and Wunsch (Fig. 4.10), a more realistic example is shown in Fig. 4.14 with longer sequences. The sequences are the same that we used in Fig. 4.10, and we see little difference between the two pairwise alignments probably because they were of nearly the same length.
4.2 Multiple Alignment
A multiple alignment is the simultaneous alignment of three or more nucleic acid or amino acid sequences. The procedure involves the insertion of gaps in the sequences so as to maximize the overall similarity (Higgins and Sharp 1988). Multiple alignments are rarely used for their own sake but are usually created for another purpose—for instance, primer design (Chap. 5) or analysis of phylogeny (Chap. 6). Users select a favorite program or program package and try to optimize program settings for that.
To start the construction of a multiple alignment, we will use the same criteria as for the pairwise alignment problem. We will assume that they shared ancestors (homologous). We need to model evolution the way that every column represents only orthologous amino acids or nucleotides that have evolved from a common ancestor. The simplest assumptions are again that identical nucleotides or identical amino acids pair, that different amino acids with similar physiochemical properties pair, that it is more difficult to form a gap than to form a mismatch, and that it is more difficult to form a gap than to extent one already formed. The scoring between nucleotides and amino acids is based on the system above for pairwise alignment (Sect. 4.1.2).
4.2.1 Clustal
To form a multiple alignment of the Clustal type, first pairwise alignments between sequences 1–2, 1–3, 1–4, 1–5, 2–3, 2–4, 2–5 …0.4–5 are formed (Fig. 4.15). This is the progressive alignment part of the process, and there are (n(n−1))/2 combinations where n is the number of sequences. In this example with five sequences, there are ten combinations. The next step is to construct a “guide tree” (Fig. 4.16) uniting the pairs with highest score. The final step is to construct the multiple alignment from the guide tree, which is called profile alignment (Fig. 4.17). The Clustal type of multiple alignment is therefore performing progressive profile alignment. Clustal has been called “a quick and dirty version of Feng and Dolittle (1987)” where “only residues that are part of the matches of a given length (k-tuple matches) are scored” (Higgins and Sharp 1988).
The developments in the Clustal programs was started with Clustal (1988) (Higgins and Sharp 1988) and continued with Clustal V (five) (1992), Clustal W (weights) (1994), and Clustal X 2.0 (2007) (Larkin et al. 2007). ClustalX is the implementation of the program for PC (Figs. 4.18, 4.19) and is further described in activity 4.4.3. Clustal Omega (http://www.clustal.org/omega/) is the most recent version of the program for multiple alignment of very large protein datasets (Sievers et al. 2011).
ClustalX/W produces global alignments, which include benefits and drawbacks inhered from then pairwise alignment part included in the construction of the alignment. Domain structures of proteins can be optimized by.
ClustalW/X (Fig. 4.20). ClustalW/X invokes different “hidden” functions during alignment and tend to cluster gaps and take hydrophobic/hydrophobic properties into account. An additional benefit with the ClustalX program is that it allows an identity matrix to be generated. After the multiple alignment has been generated under the menu “Trees” and submenu “Output format Options,” one can mark “% identity matrix.” The matrix will then be generated by running the “Trees | Draw tree” command.
An inherent problem with the progressive approach used in ClustalX/W is that mistakes made in the initial alignment cannot be corrected later. To account for this and other problems, other programs have been constructed.
4.2.2 Other Multiple Alignment Programs
Multiple Sequence Comparison by Log-Expectation (MUSCLE) (Edgar 2004) (https://drive5.com/muscle5/) includes no guide tree. MUSCLE is claimed both to achieve better average accuracy and better speed than ClustalW2 or T-Coffee (Edgar 2004). The main focus is on protein multiple alignments, but it works also on DNA. The principle is based on K-mer comparison between sequences. K-mers are contiguous subsequence of short length (k-tuple). Related sequences tend to have more k-mers in common than expected by chance. The program can be used from MEGA11 (Tamura et al. 2011) (https://www.megasoftware.net/) used in Chaps. 6 and 11. Like the following programs, T-Coffee and MAFFT have options to adjust alignment parameters for a range of applications.
T-Coffee (http://tcoffee.crg.cat/) (Notredame et al. 2000) incorporates a “tree-based consistency objective function for alignment evaluation.” The benefit should be high accuracy. The program comes in a range of flavors suitable for different applications.
MAFFT (Yamada et al. 2016; Katoh et al. 2019 and other papers) (https://mafft.cbrc.jp/alignment/software/) is also a multiple alignment package that includes applications for very large datasets.
COBALT is a progressive constraint-based alignment program (Papadopoulos and Agarwala 2007). Constraints are from Conserved Domain Database (CDD) (Chap. 3) and PROSITE motif databases.
Multiple alignment programs can be compared with sets of reference sequence (BAliBASE) (Thompson et al. 1999).
The ARB package has been developed to account for the secondary structures in 16S rRNA sequence (Ludwig et al. 2004) (http://www.arb-home.de/). It has been used to construct precomputed multiple alignments in the SILVA database (www.arb-silva-de), which is further considered in Chap. 8. Folding patterns in 16S rRNA genes can be predicted with Mfold (http://www.unafold.org/mfold/applications/rna-folding-form-v2.php) (Zuker and Jacobson 1998).
Trimming should only be done with limited datasets of closely related organisms. Trimming can be done in BioEdit (Fig. 4.21) (see also Sect. 5.6). For Mac users, Jalview can be used (https://www.jalview.org) (Waterhouse et al. 2009).
Graphic output of multiple alignments can be made in Boxshade
(http://arete.ibb.waw.pl/PL/html/boxshade.html) (in Polish) (Fig. 4.22).
For whole genome alignments, we need to account for patchy distribution of matching regions, and we cannot use the strategies for single genome alignments. Strategies using a suffix tree can solve the problem. One strategy is to create maximal unique match (MUM) decomposition of the two genomes to be aligned. A MUM is a subsequence occurring exactly once in each of the two genomes. The strategy (MUMmer) for alignment in all includes suffix tree construction, sorting and extraction of the longest increasing subsequences and generation of Smit-Waterman alignments for all the MUMs (Delcher et al. 1999).
4.3 BLAST
Basic Local Alignment Search Tool (BLAST) was originally described by Altschul et al. (1990), and this tool enabled fast search in the electronic nucleotide and protein databases during the 1990s when the internet was invented. Later, an improved algorithm was described allowing gaps to be inserted during the analysis (Altschul et al. 1997). Both versions of BLAST are in use as the “ungapped” (1990) and “gapped” (1997), respectively.
BLAST is based on a heuristics search algorithm, which is able to search through the ever-growing sizes of databases. BLAST can be compared to the search tool Google in popularity within the bioinformatical community including the formation of the verb “to blast.” BLAST is based on the initial identification of the user query sequence by short pieces of sequence (words). These words are compared to the database with all the sequences (subjects). If the word is identified in a sequence of the database, the match of the word can be extended and will form a high scoring pair (HSP). When no further extension is possible, the result is returned as a hit list to the used with indication of similarity between the query and the closest related subjects in the database. The “gapped” version of BLAST is most frequently used, and its “two hit method” requires two nonoverlapping “words” on the same diagonal with a distance of A before an extension is invoked. Gapped BLAST should only require 1/3 computer time compared to ungapped version because of less extensions of the words. BLAST uses dynamic programming to extend residues in both directions, and BLAST is most suitable for finding of subject DNA and protein sequences in databases that match with a reasonable similarity and a comparable length to the query. BLAST is good for sequences of at least 100 in length that are well represented in the databases. BLAST is not suitable to make really precise determination of similarity between sequences, and pairwise alignment programs should instead be used for that (Sect. 4.1). BLAST is not suitable for sequence-based identification of bacterial species since there are simply too many sequences in GenBank and the type strains are not always clearly labeled (Chap. 7). BLAST is not suitable for databases beyond a certain size and cannot be used to search very large metagenomics databases (Chap. 9). In principle, the database search of BLAST is based on local alignments. It needs to be local since we are never certain if our query sequence will be represented by a homolog in the database with similar length. BLAST is most often used on the NCBI server (https://blast.ncbi.nlm.nih.gov/Blast.cgi); however, the BLAST program can also be installed on your local computer and used with you own database.
4.3.1 NCBI BLAST
The most common way of performing a BLAST search is to access the program at NCBI and search in their databases. A search can be performed by a DNA or protein sequence in FASTA format, by uploading a sequence file in FASTA format or by inserting an accession number as AB490809 shown in Fig. 4.23 if the sequence is already included with GenBank. In the example shown on Fig. 4.23, default setting has been selected for a BLAST search with a DNA sequence. Under Program Selection and Optimize for different versions of BLAST can be selected. In this case, megablast, discontigous megablast, and blastn can be selected. The three versions differ by the scores given to matches and their way of penalizing gaps. For shorter sequences, blastn should be used.
When the search is terminated, you will see the familiar graphical overview in (Fig. 4.24)—the colored section with horizontal bars.
In the next section comes the descriptions with one line for each subject, which gives a more detailed information about the subject sequences providing Score, Query cover, E value, and accession number. These parameters will be explained in the following.
The hit list is arranged by decreasing Score. In the example, you see that hits are sorted by decreasing score. The identities are also decreasing but not systematically since the score is not always directly related to the identity. The longer the match between two sequences, the higher the score given the same identity. E values cannot be directly seen here since they are very small and they are just shown as “0.0.”
In the alignment section, the pairwise alignments are shown. This is the most detailed information of comparison between your query sequence and the subjects in the database. The parameters from the description section are available, and additional information of gaps and visual location of both gaps and mismatches can be observed.
4.3.2 Ortholog Detection
Sequences are orthologs if they are homologs and have the same function in different species. The paralogs are also homologs but have been duplicated during evolution, and they have got divergent functions in the same species, and the sequences are often quite divergent. BLAST can be used to sort the orthologs from the paralogs. The principle of reciprocal best hits (RBH) is used as explained in Moreno-Hagelsieb and Latimer (2008) “Two genes residing in two different genomes are deemed orthologs if their protein products find each other as the best hit in the opposite genome” (Moreno-Hagelsieb and Latimer 2008). When BLAST search is done against different organisms, the orthologs will often have the highest scores and the paralogs lower scores. Further investigation of orthologs and paralogs can be done by phylogeny (Chap. 6).
4.3.3 BLAST2 sequences
BLAST2 sequences is a version of NCBI BLAST that allows pairwise comparisons on server or on stand-alone BLAST installed on your PC. The program is available on both DNA and protein level, and it can be used to build your own database for many purposes by including many sequences in the second window (Fig. 4.25). For instance, if the query DNA or protein sequence is included with the first search window and a range of genomes included in the second search, you can identify the query sequence in the whole genomic sequences including genomes not yet deposited with the databases. The same approach can be used to build a small MLST database (Chap. 11). BLAST2 sequences are started from the ordinary BLAST page NCBI by marking “Align two or more sequences.” It can be used for fast draft comparisons between two sequences, for example, if you want to look up the location of a primer or gene on a genome. This is your “Swiss knife” always available if you just can access the internet. Like for an ordinary knife, be careful with BLAST2 sequences! If you need real precise similarities between sequences, they should be obtained by the pairwise alignment programs described Sect. 4.1.
4.3.4 Statistics
Evaluation of the results in the BLAST output is based on the Karlin and Altschul formula E = k × m × N × e–λs (Karlin and Altschul 1990) where m is letters in query meaning the number of nucleotides or amino acids, N is the total letters in database, and S is the actual score. For amino acids, S is defined by the BLOSUM matrix. S is scaled to bit score to better fit the computer. The bit score = ((λ × S) – ln κ)/ln2. In the pairwise section (Fig. 4.24 above), you can see the raw score in parenthesis along with the bit score. The constants κ and λ depending on the database. E reflects that we find the sequence in the database by chance (chance for false positives). E is this way related to the size of the database. The smaller E, the less likely is it found by chance, and the more unique is the sequence—“The smaller E the better.” For really “unique” sequences, E is so small that it cannot be represented on the computer; this is the reason for “0.0” with the output.
All parameters related to the search can be found in Search Summary (Fig. 4.26). Here, you can find parameters used for calculating the statistics in the Karlin and Altschul formula, and you can understand everything about the BLAST output if you compare these parameters to the statistics just explained.
Note that parameters for ungapped blast (Altschul et al. 1990) differ from gapped-blast (Altschul et al. 1997) (Fig. 4.26), and λ and κ are different for gapped BLAST compared to ungapped.
4.3.5 Variants of BLAST
A BLAST search can be based on a query DNA sequence that the program translates to protein and compares to all protein sequence in the database (BLASTx). In this case, the search takes more time, and it should only be done when it is absolute needed (rarely). For such comparison, the codon table should be defined according to the type of organisms. Prokaryotes are using codon table 11 (Appendix). tBLASTn searches a translated nucleotide databases using a protein query. tBLASTx searches a translated nucleotide query against the translated nucleotide databases.
For BLASTp, it is sometimes convenient to select Swiss-Prot or ref_seq_protein if you want a really precise information about the hits. For protein searches, note the additional options PSI-, PHI-, and delta BLAST (Fig. 4.27), and they can be used to build up a profile used to search for query proteins, which are rather distantly related to the subject proteins in the database.
4.4 Activities
4.4.1 Pairwise Alignment
We will construct local and global pairwise alignments by the use of the programs “water” and “needle,” respectively, available as EMBOSS programs (Rice et al. 2000) on the EBI server. The protein sequences AAA85484 and AAA85485 can be downloaded from NCBI as described in Chap. 3 (see Sect. 3.3.1 including Activity 3.1).
Decide if you want to use “needle” for global or “water” for local alignment.
Open the FASTA format files and paste them in the windows of the relevant EMBOSS program on the server and select protein or DNA:
https://www.ebi.ac.uk/Tools/psa/emboss_needle/
https://www.ebi.ac.uk/Tools/psa/emboss_water/
4.4.2 Multiple Alignment with ClustalX
From http://www.clustal.org/download/current/ download clustalx-2.1-win.msi, install and locate the icon to the desktop. Open the program by double-clicking on the icon. File | Load sequences and select the sequences in FASTA format from the location on your computer. You need to edit the input file yourself. The sequences that you want to use as input should be in one file only with all sequences in FASTA format:
>sequence1
ATGACGATAC…
>sequence2
GATAGATAGACS…
etc.
Select Alignment | Do complete Alignment | ok.
In the lower left corner, you can follow how the program works.
Scroll through the alignment when it is completed.
At the bottom, the bars show the degree of conservation of columns. Above the alignment, the signature * shows columns with conserved nucleotides (or amino acids) with the same properties.
4.4.3 BLAST
We will use this activity to learn how to perform an ordinary BLAST search in GenBank with a DNA sequence. However, we also learn some more about BLAST in this activity. We will perform a search with the sequence with acc. no. AB490809 from GenBank.
First perform a BLAST search at NCBI: https://blast.ncbi.nlm.nih.gov/Blast.cgi. On the graphics, select nucleotide BLAST and insert the acc. no. AB490809 in the window and mark “somewhat similar sequences blastn,” mark “show results in a new window” and press the blue BLAST bottom. Recently, we have seen a request from the NCBI-BLAST server to restrict the search to reduce the demand for CPU capacity. For the current example, rRNA/ITS databases can be selected instead of Standard nr.
References
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL (1999) Alignment of whole genomes. Nucleic Acids Res 27(11):2369–2376. https://doi.org/10.1093/nar/27.11.2369
Durbin R, Eddy S, Krogh A, Mitchison G (1999) Biological sequence analysis. Cambridge Univ. Press
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797
Feng DF, Dolittle RF (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25:351–360
Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256(5062):1443–1445. https://doi.org/10.1126/science.1604319
Hall TA (1999) BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl Acids Symp Ser 1999(41):95–98
Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237–244
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8(3):275–282. https://doi.org/10.1093/bioinformatics/8.3.275
Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87:2264–2268
Katoh K, Rozewicki J, Yamada KD (2019) MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization. Brief Bioinform 20(4):1160–1166. https://doi.org/10.1093/bib/bbx108
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23:2947–2948
Le SQ, Gascuel O (2008) An improved general amino acid replacement matrix. Mol Biol Evol 25(7):1307–1320. https://doi.org/10.1093/molbev/msn067
Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Ludwig W, Strunk O, Westram R, Richter L, Meier H, Yadhukumar, Buchner A, Lai T, Steppi S, Jobb G, Förster W, Brettske I, Gerber S, Ginhart AW, Gross O, Grumann S, Hermann S, Jost R, König A, Liss T, Lüssmann R, May M, Nonhoff B, Reichel B, Strehlow R, Stamatakis A, Stuckmann N, Vilbig A, Lenke M, Ludwig T, Bode A, Schleifer KH (2004) ARB: a software environment for sequence data. Nucleic Acids Res 32:1363–1371
Moreno-Hagelsieb G, Latimer K (2008) Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics 24:319–324
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
Nei M, Kumar S (2000) Molecular evolution and phylogenetics. Oxford University Press
Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217
Papadopoulos JS, Agarwala R (2007) COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics 23(9):1073–1079. https://doi.org/10.1093/bioinformatics/btm076
Pearson WR (2013) Selecting the right similarity-scoring matrix. Curr Protoc Bioinformatics 43:3.5.1–3.5.9. https://doi.org/10.1002/0471250953.bi0305s43
Rice P, Longden I, Bleasby A (2000) EMBOSS: The European molecular biology open software suite. Trends Genetics 16:276–277
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using ClustalOmega. Mol Syst Biol 7:539
Smith TF, Waterman MS, Fitch WM (1981) Comparative biosequence metrics. J Mol Evol 18:38–46
Tamura K, Peterson D, Peterson N, Stecher G, Nei M et al (2011) MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol 28:2731–2739
Thompson JD, Plewniak F, Poch O (1999) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15:87–88
Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ (2009) Jalview Version 2 – a multiple sequence alignment editor and analysis workbench. Bioinformatics 25:1189–1191
Yamada KD, Tomii K, Katoh K (2016) Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees. Bioinformatics 32:3246–3251
Zuker M, Jacobson AB (1998) Using reliability information to annotate RNA secondary structures. RNA 4:669–679
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Christensen, H., Olsen, J.E. (2023). Pairwise Alignment, Multiple Alignment, and BLAST. In: Christensen, H. (eds) Introduction to Bioinformatics in Microbiology. Learning Materials in Biosciences. Springer, Cham. https://doi.org/10.1007/978-3-031-45293-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-45293-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45292-5
Online ISBN: 978-3-031-45293-2
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)