Pairwise Alignment, Multiple Alignment, and BLAST

Christensen, Henrik; Olsen, John Elmerdahl

doi:10.1007/978-3-031-45293-2_4

Henrik Christensen² &
John Elmerdahl Olsen³

Part of the book series: Learning Materials in Biosciences ((LMB))

592 Accesses

Abstract

Quantitative comparison of sequences can be performed pairwise by aligning two sequences based on considerations of gaps representing insertions or deletions and matches between nucleotides or amino acids. Pairwise comparisons can be performed as global alignments if it is known that the sequences are homologous in their full length or by local alignments if it is known that one sequence is shorter than the other. The dynamic programming algorithms for constructing global and local alignments will be described in detail and relevant computer programs suggested. The different strategies behind multiple alignments are described including relevant computer programs. The search program BLAST is introduced in detail and different applications described.

Access provided by Autonomous University of Puebla. Download chapter PDF

Pairwise Alignment, Multiple Alignment, and BLAST

BLAST and FASTA Similarity Searching for Multiple Sequence Alignment

Sequence Alignment

FormalPara What You Will Learn in This Chapter

Pairwise alignments and multiple alignments are the basic tools to compare sequences. BLAST is the most frequently used bioinformatics program to compare your own sequence (query sequence) to all sequences in a database (subject sequences) based on local pairwise alignments. The outcome of the BLAST analysis provides qualitative information about homologous sequences and can quantify the identity of the query sequence to the sequences in the database. You will learn about multiple alignments and how to construct them. Multiple alignments are used for a range of applications described in the other chapters of the book.

4.1 The Pairwise Alignment Problem

A pairwise alignment is a model of the homology between two sequences considering all nucleotides or amino acids and all deletion and insertions.

Two sequences to be compared could, for instance, be the ones below:

GCAGTAGCATGACGATAG
GCGGTAGCATGATAC

A pairwise alignment requires a model for the evolutionary steps that led to the differences observed. First, we should test if they are homologous. Such a test can be done by BLAST (see Sect. 4.3), and it will tell if the sequences are from the same gene. Once this has been established, we need to model evolution to construct the pairwise alignment. The simplest assumptions are that identical nucleotide pairs and that identical amino acids pair. Further, that different amino acids with similar physiochemical properties form pairs.

For most comparisons, it is preferred to base pairwise alignments on the amino acid sequences especially if the sequences are very divergent. Pairwise nucleotide comparisons are preferred for closely related sequences only.

Gaps are used to show that an amino acid or nucleotide is without a match in the other sequence and the gaps represent insertions or deletions in an evolutionary context. Most evolutionary modeling assumes that it is more difficult to form gaps than to form a mismatches and that it is more difficult to form a gap than to extent one already formed. Scoring systems are used to separate matches (high score) from mismatches (low score) and gaps (very low score). For proteins, scoring matrices (see Sect. 4.1.2.1) are used. At the end, scores between amino acids or nucleotides are summarized over the whole pairwise alignment, and penalties for gaps are subtracted. The result is a unit score for the multiple alignment given the parameters used. Given the same sequences are compared, the pairwise alignment with the highest total score is preferred compared to one with a lower score.

When the pairwise alignment has been constructed, the identity and similarity can be quantified. The identity is the number of nucleotides or amino acids matching in two sequences compared at all positions between the two sequences. Similarity is a further comparison also considering different types of nucleotides or amino acids as well as the gaps.

An alignment of the sequences above, and a possible scenario leading to the differences between them, is shown in Fig. 4.1.

A schematic illustration depicts the grouping of sequences over time. On the left, 5-point mutations are visible, and on the right, a 2-point mutation is illustrated. — **Fig. 4.1**

The principles of pairwise alignments are used for most multiple alignment programs and for BLAST. The three tools are fundamental to carry out most bioinformatics (Fig. 4.2).

A flow chart depicts the topics in a book. They are sequencing, assembly, sequence quality control, annotation, databases, pair-wise alignment, protein structures, blast, multiple alignment, primer design, phylogeny, classification, identification, 16 S r R N A amplicon, full D N A metagenomics, molecular typing, and transcriptomics. — **Fig. 4.2**

4.1.1 Global or Local Pairwise Alignments?

Before a pairwise alignment is constructed, it needs to be decided if it should be global or local. A pairwise alignment is global if it is known that the sequences are homologous in their full length. In this situation, it is sound to align both sequences in their full length. If preliminary evidence has showed that the genes are encoding for the same proteins and the full length of the gene has been sequenced, a global alignment can be constructed. A local alignment is needed if it is known that one sequence is shorter than the other and that it cannot be related to the other in its full length. This can happen if the DNA sequence of one gene has not been determined in the full length or if the domain structure of the proteins that the genes encode for is rather divergent.

4.1.2 Substitution Matrices

Substitution matrices are used to model the probability of mutual amino acid or nucleotide substitutions in sequences. Amino acids or nucleotides with a lower probability of changing are given more weight in the comparisons compared to those with a higher probability of changing. The changes are assumed to occur according to a Markov model where changes only depend on the current nucleotide or amino acid observed and not on past changes (Fig. 4.3). This model is very important to understand in full. A comparison can be made to a human being only acting based on the immediate desire or need. A certain degree of hunger will lead to taking a meal, whereas past events only have minor influence on the decision like meals taken the day before. This way a Markov chain is without memory, and it seems difficult to use the Markov model to trace evolution back in time like done in phylogeny. However, this is actually possible using the assumption about irreversible changes of nucleotides or amino acids since the same probabilities can be used both to model the future and to go back in time.

A schematic diagram depicts the substitution of nucleotides. The corners are marked with A, T, C, and G. Lines are marked from and to each other directly and diagonally. A and C are marked with clockwise arrows. T and G depict ant-clockwise direction arrows. — **Fig. 4.3**

4.1.2.1 Amino Acid Substitution Matrices

Amino acids have different biochemical and physical properties that influence their relative replaceability during evolution (Table 4.1). The probability of replacement has been related to the physiochemical properties of the amino acids in the way that hydrophobic amino acids are easier replaced by other hydrophobic amino acids again compared to other types such as charged amino acids. The substitution matrices reflect these properties as seen from Fig. 4.4. Here, the score between the two hydrophobic amino acids isoleucine and histamine is 5, whereas the score between isoleucine and the positive charged arginine is only −4. Some amino acids belong to more groups, for instance, histidine, which is both hydrophobic, positively charged, polar, and aromatic.

Table 4.1 Physiochemical properties of amino acids

Full size table

A diagram depicts the scoring values of the BLOSUM 50 matrix. The sequences are marked in rows and columns. — **Fig. 4.4**

The percent accepted mutation (PAM) matrices are based on global alignments of closely related proteins. The number at the end of a PAM matrix refers to the number of steps required for a given percent change. PAM1 corresponds to 99% identity (1% change) between the aligned sequences. This means that PAM112 will recognize more distantly related sequences (40% identity) than PAM23 (80% identity). PAM matrices are based on global alignments and therefore best suited for global pairwise alignments used in evolutionary studies (Table 4.2).

Table 4.2 Practical rules for use of PAM and BLOSUM amino acid substitution matrices

Full size table

Blocks substitution matrix (BLOSUM) is based on local alignments. The number at the end of a BLOSUM matrix refers to the minimum percent identity allowed in the set of aligned sequences used to build the matrix. This means that BLOSUM50 will recognize more distantly related sequences than BLOSUM80 (Table 4.2). Other matrices have been published aiming at refining amino acid alignments and the more recently by including improved statistics for estimating exchangeability between amino acids (Gonnet et al. 1992; Jones et al. 1992; Le and Gascuel 2008).

Explanations of the background for the BLOSUM and PAM and other matrices are given by Pearson (2013). The recommendation is to use the “deep scoring matrices” for full-length protein sequence comparisons and only to compare the domain level with “shallower” scoring matrices.

4.1.2.2 Nucleotide Substitution Matrices

The simplest model is the Jukes & Cantor model where equal probabilities for the substitution of all four nucleotides are assumed (Table 4.4). Substitution matrices for nucleotides give higher weight to transitions (G to A or C to T and vice versa) than to transversions (G to C, G to T, A to C, A to T and vice versa) (Table 4.3). This is mainly related to the larger size of the purine compared to the pyrimidine nucleotides. The transversion/transition bias is included in the models of HKY, Kimura, Tamura-Nei, Tamura, and the General reversible model. An additional account for the frequency of nucleotides is included in models of Tamura-Nei, the Equal input, and the General reversible models. These frequencies are the observed frequencies of nucleotides under comparison. The unrestricted model allows different probabilities of changes for all 12 combinations of nucleotides (Table 4.4).

Table 4.3 The 16 different types of nucleotide pairs that can be observed between two sequences (modified from Nei and Kumar 2000)

Full size table

Table 4.4 Models of nucleotide substitutions

Full size table

4.1.3 Gaps

Gaps are used to model insertions or deletions in sequences. If arbitrarily many gaps are inserted, this can lead to high scoring alignments of nonhomologous sequences. To abolish this effect, gaps are penalized when alignments are constructed to obtain relatively few gaps and separate penalties used for gap opening and gap elongation. In the linear model, the penalty ϒ is proportional to the number of gaps (g) given the penalty for one gap d:

Linear gap penalty score:

$$ \varUpsilon\;\left(\mathrm{g}\right)=-\mathrm{gd} $$

If the penalty of one gap is 8 and there are three gaps, the total penalty will be - 24.

To better model biological events where many gaps often occur in a row, the affine penalty score model is used where the cost is higher for inserting the first gap than to extent this gap to lengths of two and more:

$$ \varUpsilon\;\left(\mathrm{g}\right)=-\mathrm{d}-\left(\mathrm{g}-1\right)\mathrm{e} $$

Where ϒ (g) = gap penalty score of gap length g, d = gap opening penalty and e = gap extension penalty. For a gap of length 3 with a penalty of opening the gap of 8 and 4 of extending the gap with two more, the total penalty will be −16.

4.1.4 Dynamic Programming

Dynamic programming is a way to compute the pairwise alignment. If all possibilities for the pairwise alignment of two sequences should be tested, there would be around 1059 possibilities for two nucleotide sequences of 100 in length. This number is approximately the same as the number of molecules in the Milky Way, and even the largest computer would not be able to handle the problem. There is also no need to consider possibilities without relevance, for instance, that one nucleotide in one sequence would pair only with gaps in the other. For a more general comparison of dynamic programming with hidden Markov models, see: https://youtu.be/aELVQo6eRzk.

To reduce the computing effort but still to perform a careful analysis, dynamic programming is used. The computer is only testing relevant comparisons (high scores) and keeping in memory the highest ones already calculated. The most relevant dynamic programming algorithms for comparison of global and local pairwise alignments were published be Needleman and Wunsch (1970) and Smith et al. (1981), respectively.

The Needleman and Wunsch algorithm is used when it is known that the sequences represent exactly a gene or protein in full length (few amino acids or nucleotide differences are tolerated), and it is used to build global pairwise alignments.

The Smith and Waterman algorithm is used with partial sequences or with sequences of unknown length to build local alignments. For both algorithms, the adjustment of gap penalty functions depends on previous knowledge about domain structures, repeats, and other properties. Both algorithms and their implementations in computer programs can be used for both amino acid and nucleotide sequences.

4.1.4.1 Needleman and Wunsch

The score function in this model is illustrated in Fig. 4.5. This function is used on all comparisons between the two sequences. Here, we will use the example with the two sequences:

Sequence 1: HEAGAWGHEE
Sequence 2: PAWHEAE

A function is given s F or i, j =max of 3 functions, F of i minus 1, j minus 1 +s of i, j, F of i-1, j minus d, and F of i, j minus 1 minus d. — **Fig. 4.5**

Figure 4.6 shows one sequence (1) on the top of the table and sequence 2 as a vertical column. First, an alignment path matrix is created. For each cell, F(i,j) is calculated based on the score function in Fig. 4.5. This matrix allows a stepwise calculation of score values. The score of the best alignment is calculated between the initial segment x_1…i of x up to x_i and the initial segment y_1…j of y up to y_j. The algorithm function, F(i,j), is built recursively beginning with F(0,0) = 0 (Figs. 4.6, 4.7, 4.8). When the matrix is filled, backtracking is started (evaluation of the optimal path).

A schematic diagram depicts the pair alignment, with rows given for s of i and columns for t of j. The boundary conditions given are F of i, 0= negative i d, and F of 0, j= negative j d. — **Fig. 4.6**

A schematic diagram depicts the pairwise alignment with rows given for s of i and columns for t of j. The computed score for each point in the matrix is given as F of i, j = max of F of i j minus 1 minus d, F of i minus 1, j minus 1+ s of i, j, and F of i minus 1, j. All functions equal negative 2. — **Fig. 4.7**

A schematic diagram depicts the pairwise alignment with rows given for s of i and columns for t of j. The scores are given in the matrix and are linked from the bottom right corner towards 0 on the left top corner. — **Fig. 4.8**

The scoring parameters are given by the BLOSUM 50 matrix (Fig. 4.4) and a linear gap penalty of d = −8. We will align the whole sequence length of the sequences, meaning that we need to form a global alignment and use the Needleman and Wunsch algorithm. The procedure can be followed on Figs. 4.6, 4.7, 4.8, 4.9.

A schematic diagram depicts the pairwise alignment with rows given for s of i and columns for t of j. The pathway is given from the bottom right corner to the left top corner, from 1 to 0. The optimal global alignment sequence is given. — **Fig. 4.9**

A real example of the use of the Needleman and Wunsch algorithm is shown on Fig. 4.10. Here, realistic long sequences are aligned by the algorithm implemented as the needle program in the EMBOSS package (Rice et al. 2000).

A schematic diagram depicts the code for pairwise alignment of sequences. The program is the needle, the run date, align format, report file, aligned sequence, matrix, gap penalty, extend penalty, length, identity, similarity, and score are given. — **Fig. 4.10**

4.1.4.2 Smith and Waterman

Now, we will compute a pairwise alignment of the sequences using the Smith and Waterman algorithm. There are two differences in the algorithm compared to Needleman and Wunsch. The first is that the alignment can start/end anywhere in the matrix and the other that backtracking is started at the highest value rather than in the lower right corner. Backtracking is terminated as soon as zero is encountered. The score function is shown in Fig. 4.11.

A function is given as F of i, j=max of 4 functions, 0, F of i minus 1, j minus 1+ s of i, j, F of i minus 1, j minus d, and F of i, j minus 1 minus d. — **Fig. 4.11**

We will align the same two short model protein sequences with Smith and Waterman:

Sequence 1: HEAGAWGHEE
Sequence 2: PAWHEAE

And use the same parameters of BLOSUM50 and a linear gap penalty of d = −8.

$$ F\left(i,j\right)=\max \Big\{{\displaystyle \begin{array}{l}0\\ {}F\left(i-1,j-1\right)+s\left(i,j\right)\\ {}F\left(i-1,j\right)-d\\ {}F\left(i,J-1\right)-d\end{array}} $$

Using the score function in Fig. 4.11 results in 0 in most cells when the matrix is filled in the forward direction (Fig. 4.12). The reason is that the three other possibilities in Fig. 4.11 result in negative values, and we then have to use 0. Backtracking starts with the highest score (28) and terminates when 0 is reached. Note that an alternative path is also possible. However, the maximum score of this one is lower (21), and therefore, the first is chosen to construct the final pairwise alignment (Fig. 4.13). Similar to the example with Needleman and Wunsch (Fig. 4.10), a more realistic example is shown in Fig. 4.14 with longer sequences. The sequences are the same that we used in Fig. 4.10, and we see little difference between the two pairwise alignments probably because they were of nearly the same length.

A schematic diagram depicts the pairwise alignment with rows given for s of i and columns for t of j. The backtracking pathway is given from the bottom right corner to the top left until 0 is reached. — **Fig. 4.12**

A schematic diagram depicts the pairwise alignment with rows given for s of i and columns for t of j. The pathway is given from 28 to 0. The optimal local alignment is given. — **Fig. 4.13**

A pairwise alignment algorithm is depicted for the water program. The code format is given for run date, align format, report file, aligned sequence, matrix, gap penalty, extend penalty, length, identity, similarity, gaps, and score. The sequences are depicted below. — **Fig. 4.14**

4.2 Multiple Alignment

A multiple alignment is the simultaneous alignment of three or more nucleic acid or amino acid sequences. The procedure involves the insertion of gaps in the sequences so as to maximize the overall similarity (Higgins and Sharp 1988). Multiple alignments are rarely used for their own sake but are usually created for another purpose—for instance, primer design (Chap. 5) or analysis of phylogeny (Chap. 6). Users select a favorite program or program package and try to optimize program settings for that.

To start the construction of a multiple alignment, we will use the same criteria as for the pairwise alignment problem. We will assume that they shared ancestors (homologous). We need to model evolution the way that every column represents only orthologous amino acids or nucleotides that have evolved from a common ancestor. The simplest assumptions are again that identical nucleotides or identical amino acids pair, that different amino acids with similar physiochemical properties pair, that it is more difficult to form a gap than to form a mismatch, and that it is more difficult to form a gap than to extent one already formed. The scoring between nucleotides and amino acids is based on the system above for pairwise alignment (Sect. 4.1.2).

4.2.1 Clustal

To form a multiple alignment of the Clustal type, first pairwise alignments between sequences 1–2, 1–3, 1–4, 1–5, 2–3, 2–4, 2–5 …0.4–5 are formed (Fig. 4.15). This is the progressive alignment part of the process, and there are (n(n−1))/2 combinations where n is the number of sequences. In this example with five sequences, there are ten combinations. The next step is to construct a “guide tree” (Fig. 4.16) uniting the pairs with highest score. The final step is to construct the multiple alignment from the guide tree, which is called profile alignment (Fig. 4.17). The Clustal type of multiple alignment is therefore performing progressive profile alignment. Clustal has been called “a quick and dirty version of Feng and Dolittle (1987)” where “only residues that are part of the matches of a given length (k-tuple matches) are scored” (Higgins and Sharp 1988).

A set of combinations is given for C L U S T A L V multiple sequence alignments. The sections depicted are in sequence format are Pearson, start of pairwise alignments, start if multiple alignments, and alignment score. Finally, the C L U S T A L alignment file is created. — **Fig. 4.15**

A screenshot of the webpage for the CLUSTAL W flow chart. — **Fig. 4.16**

The set of sequences depicts the CLUSTAL W multiple sequence alignment. — **Fig. 4.17**

The developments in the Clustal programs was started with Clustal (1988) (Higgins and Sharp 1988) and continued with Clustal V (five) (1992), Clustal W (weights) (1994), and Clustal X 2.0 (2007) (Larkin et al. 2007). ClustalX is the implementation of the program for PC (Figs. 4.18, 4.19) and is further described in activity 4.4.3. Clustal Omega (http://www.clustal.org/omega/) is the most recent version of the program for multiple alignment of very large protein datasets (Sievers et al. 2011).

A screenshot of the webpage for CLUSATL X sequences. The mode is multiple alignment and the font is 36. The non-aligned sequences are given in colored grids on the right. — **Fig. 4.18**

A screenshot of the CLUSTAL X sequence alignment webpage. The mode is multiple alignment, and the font is 36. The alignment sequences are depicted in the colored grids on the right. — **Fig. 4.19**

ClustalX/W produces global alignments, which include benefits and drawbacks inhered from then pairwise alignment part included in the construction of the alignment. Domain structures of proteins can be optimized by.

ClustalW/X (Fig. 4.20). ClustalW/X invokes different “hidden” functions during alignment and tend to cluster gaps and take hydrophobic/hydrophobic properties into account. An additional benefit with the ClustalX program is that it allows an identity matrix to be generated. After the multiple alignment has been generated under the menu “Trees” and submenu “Output format Options,” one can mark “% identity matrix.” The matrix will then be generated by running the “Trees | Draw tree” command.

A screenshot of the CLUSTAL X sequence alignment webpage. The alignment option selected is to realign the selected residue range. — **Fig. 4.20**

An inherent problem with the progressive approach used in ClustalX/W is that mistakes made in the initial alignment cannot be corrected later. To account for this and other problems, other programs have been constructed.

4.2.2 Other Multiple Alignment Programs

Multiple Sequence Comparison by Log-Expectation (MUSCLE) (Edgar 2004) (https://drive5.com/muscle5/) includes no guide tree. MUSCLE is claimed both to achieve better average accuracy and better speed than ClustalW2 or T-Coffee (Edgar 2004). The main focus is on protein multiple alignments, but it works also on DNA. The principle is based on K-mer comparison between sequences. K-mers are contiguous subsequence of short length (k-tuple). Related sequences tend to have more k-mers in common than expected by chance. The program can be used from MEGA11 (Tamura et al. 2011) (https://www.megasoftware.net/) used in Chaps. 6 and 11. Like the following programs, T-Coffee and MAFFT have options to adjust alignment parameters for a range of applications.

T-Coffee (http://tcoffee.crg.cat/) (Notredame et al. 2000) incorporates a “tree-based consistency objective function for alignment evaluation.” The benefit should be high accuracy. The program comes in a range of flavors suitable for different applications.

MAFFT (Yamada et al. 2016; Katoh et al. 2019 and other papers) (https://mafft.cbrc.jp/alignment/software/) is also a multiple alignment package that includes applications for very large datasets.

COBALT is a progressive constraint-based alignment program (Papadopoulos and Agarwala 2007). Constraints are from Conserved Domain Database (CDD) (Chap. 3) and PROSITE motif databases.

Multiple alignment programs can be compared with sets of reference sequence (BAliBASE) (Thompson et al. 1999).

The ARB package has been developed to account for the secondary structures in 16S rRNA sequence (Ludwig et al. 2004) (http://www.arb-home.de/). It has been used to construct precomputed multiple alignments in the SILVA database (www.arb-silva-de), which is further considered in Chap. 8. Folding patterns in 16S rRNA genes can be predicted with Mfold (http://www.unafold.org/mfold/applications/rna-folding-form-v2.php) (Zuker and Jacobson 1998).

Trimming should only be done with limited datasets of closely related organisms. Trimming can be done in BioEdit (Fig. 4.21) (see also Sect. 5.6). For Mac users, Jalview can be used (https://www.jalview.org) (Waterhouse et al. 2009).

A screenshot of the webpage for the Bio Edit sequence alignment editor. The sequence highlighted depicts gaps. The taskbar is given at the top. — **Fig. 4.21**

Graphic output of multiple alignments can be made in Boxshade

(http://arete.ibb.waw.pl/PL/html/boxshade.html) (in Polish) (Fig. 4.22).

A screenshot of the webpage for sequence alignment. Various parts of the sequence are highlighted. The taskbar is given on top. — **Fig. 4.22**

For whole genome alignments, we need to account for patchy distribution of matching regions, and we cannot use the strategies for single genome alignments. Strategies using a suffix tree can solve the problem. One strategy is to create maximal unique match (MUM) decomposition of the two genomes to be aligned. A MUM is a subsequence occurring exactly once in each of the two genomes. The strategy (MUMmer) for alignment in all includes suffix tree construction, sorting and extraction of the longest increasing subsequences and generation of Smit-Waterman alignments for all the MUMs (Delcher et al. 1999).

4.3 BLAST

Basic Local Alignment Search Tool (BLAST) was originally described by Altschul et al. (1990), and this tool enabled fast search in the electronic nucleotide and protein databases during the 1990s when the internet was invented. Later, an improved algorithm was described allowing gaps to be inserted during the analysis (Altschul et al. 1997). Both versions of BLAST are in use as the “ungapped” (1990) and “gapped” (1997), respectively.

BLAST is based on a heuristics search algorithm, which is able to search through the ever-growing sizes of databases. BLAST can be compared to the search tool Google in popularity within the bioinformatical community including the formation of the verb “to blast.” BLAST is based on the initial identification of the user query sequence by short pieces of sequence (words). These words are compared to the database with all the sequences (subjects). If the word is identified in a sequence of the database, the match of the word can be extended and will form a high scoring pair (HSP). When no further extension is possible, the result is returned as a hit list to the used with indication of similarity between the query and the closest related subjects in the database. The “gapped” version of BLAST is most frequently used, and its “two hit method” requires two nonoverlapping “words” on the same diagonal with a distance of A before an extension is invoked. Gapped BLAST should only require 1/3 computer time compared to ungapped version because of less extensions of the words. BLAST uses dynamic programming to extend residues in both directions, and BLAST is most suitable for finding of subject DNA and protein sequences in databases that match with a reasonable similarity and a comparable length to the query. BLAST is good for sequences of at least 100 in length that are well represented in the databases. BLAST is not suitable to make really precise determination of similarity between sequences, and pairwise alignment programs should instead be used for that (Sect. 4.1). BLAST is not suitable for sequence-based identification of bacterial species since there are simply too many sequences in GenBank and the type strains are not always clearly labeled (Chap. 7). BLAST is not suitable for databases beyond a certain size and cannot be used to search very large metagenomics databases (Chap. 9). In principle, the database search of BLAST is based on local alignments. It needs to be local since we are never certain if our query sequence will be represented by a homolog in the database with similar length. BLAST is most often used on the NCBI server (https://blast.ncbi.nlm.nih.gov/Blast.cgi); however, the BLAST program can also be installed on your local computer and used with you own database.

4.3.1 NCBI BLAST

The most common way of performing a BLAST search is to access the program at NCBI and search in their databases. A search can be performed by a DNA or protein sequence in FASTA format, by uploading a sequence file in FASTA format or by inserting an accession number as AB490809 shown in Fig. 4.23 if the sequence is already included with GenBank. In the example shown on Fig. 4.23, default setting has been selected for a BLAST search with a DNA sequence. Under Program Selection and Optimize for different versions of BLAST can be selected. In this case, megablast, discontigous megablast, and blastn can be selected. The three versions differ by the scores given to matches and their way of penalizing gaps. For shorter sequences, blastn should be used.

A screenshot of a webpage for nucleotide blast. The sections given on the page are for entering query sequences, choosing search sets, and program selection. Other tabs are also depicted at the top of the page. — **Fig. 4.23**

When the search is terminated, you will see the familiar graphical overview in (Fig. 4.24)—the colored section with horizontal bars.

A screenshot of the webpage for Blast Suite. The BLAST results are listed on the page. The information given is the job title, graphic summary, and descriptions. The screenshot below depicts the webpage for sequence alignments. — **Fig. 4.24**

In the next section comes the descriptions with one line for each subject, which gives a more detailed information about the subject sequences providing Score, Query cover, E value, and accession number. These parameters will be explained in the following.

The hit list is arranged by decreasing Score. In the example, you see that hits are sorted by decreasing score. The identities are also decreasing but not systematically since the score is not always directly related to the identity. The longer the match between two sequences, the higher the score given the same identity. E values cannot be directly seen here since they are very small and they are just shown as “0.0.”

In the alignment section, the pairwise alignments are shown. This is the most detailed information of comparison between your query sequence and the subjects in the database. The parameters from the description section are available, and additional information of gaps and visual location of both gaps and mismatches can be observed.

4.3.2 Ortholog Detection

Sequences are orthologs if they are homologs and have the same function in different species. The paralogs are also homologs but have been duplicated during evolution, and they have got divergent functions in the same species, and the sequences are often quite divergent. BLAST can be used to sort the orthologs from the paralogs. The principle of reciprocal best hits (RBH) is used as explained in Moreno-Hagelsieb and Latimer (2008) “Two genes residing in two different genomes are deemed orthologs if their protein products find each other as the best hit in the opposite genome” (Moreno-Hagelsieb and Latimer 2008). When BLAST search is done against different organisms, the orthologs will often have the highest scores and the paralogs lower scores. Further investigation of orthologs and paralogs can be done by phylogeny (Chap. 6).

4.3.3 BLAST2 sequences

BLAST2 sequences is a version of NCBI BLAST that allows pairwise comparisons on server or on stand-alone BLAST installed on your PC. The program is available on both DNA and protein level, and it can be used to build your own database for many purposes by including many sequences in the second window (Fig. 4.25). For instance, if the query DNA or protein sequence is included with the first search window and a range of genomes included in the second search, you can identify the query sequence in the whole genomic sequences including genomes not yet deposited with the databases. The same approach can be used to build a small MLST database (Chap. 11). BLAST2 sequences are started from the ordinary BLAST page NCBI by marking “Align two or more sequences.” It can be used for fast draft comparisons between two sequences, for example, if you want to look up the location of a primer or gene on a genome. This is your “Swiss knife” always available if you just can access the internet. Like for an ordinary knife, be careful with BLAST2 sequences! If you need real precise similarities between sequences, they should be obtained by the pairwise alignment programs described Sect. 4.1.

A screenshot of the aligned sequences of nucleotides BLAST. The sections given are entering query sequence, entering subject sequence, and program selection. — **Fig. 4.25**

4.3.4 Statistics

Evaluation of the results in the BLAST output is based on the Karlin and Altschul formula E = k × m × N × e^–λs (Karlin and Altschul 1990) where m is letters in query meaning the number of nucleotides or amino acids, N is the total letters in database, and S is the actual score. For amino acids, S is defined by the BLOSUM matrix. S is scaled to bit score to better fit the computer. The bit score = ((λ × S) – ln κ)/ln2. In the pairwise section (Fig. 4.24 above), you can see the raw score in parenthesis along with the bit score. The constants κ and λ depending on the database. E reflects that we find the sequence in the database by chance (chance for false positives). E is this way related to the size of the database. The smaller E, the less likely is it found by chance, and the more unique is the sequence—“The smaller E the better.” For really “unique” sequences, E is so small that it cannot be represented on the computer; this is the reason for “0.0” with the output.

All parameters related to the search can be found in Search Summary (Fig. 4.26). Here, you can find parameters used for calculating the statistics in the Karlin and Altschul formula, and you can understand everything about the BLAST output if you compare these parameters to the statistics just explained.

A screenshot of the webpage of the BLAST sequence with information given for R I D, query I D, description, molecule type, and query length. The tables are given below for search parameters, database, Karlin-Alt Schul statistics, and results statistics. — **Fig. 4.26**

Note that parameters for ungapped blast (Altschul et al. 1990) differ from gapped-blast (Altschul et al. 1997) (Fig. 4.26), and λ and κ are different for gapped BLAST compared to ungapped.

4.3.5 Variants of BLAST

A BLAST search can be based on a query DNA sequence that the program translates to protein and compares to all protein sequence in the database (BLASTx). In this case, the search takes more time, and it should only be done when it is absolute needed (rarely). For such comparison, the codon table should be defined according to the type of organisms. Prokaryotes are using codon table 11 (Appendix). tBLASTn searches a translated nucleotide databases using a protein query. tBLASTx searches a translated nucleotide query against the translated nucleotide databases.

For BLASTp, it is sometimes convenient to select Swiss-Prot or ref_seq_protein if you want a really precise information about the hits. For protein searches, note the additional options PSI-, PHI-, and delta BLAST (Fig. 4.27), and they can be used to build up a profile used to search for query proteins, which are rather distantly related to the subject proteins in the database.

A screenshot of the webpage for BLAST. The page for the standard protein BLAST. The sections given are entering query sequence, choosing search set, and program selection. — **Fig. 4.27**

4.4 Activities

4.4.1 Pairwise Alignment

We will construct local and global pairwise alignments by the use of the programs “water” and “needle,” respectively, available as EMBOSS programs (Rice et al. 2000) on the EBI server. The protein sequences AAA85484 and AAA85485 can be downloaded from NCBI as described in Chap. 3 (see Sect. 3.3.1 including Activity 3.1).

Decide if you want to use “needle” for global or “water” for local alignment.

Open the FASTA format files and paste them in the windows of the relevant EMBOSS program on the server and select protein or DNA:

https://www.ebi.ac.uk/Tools/psa/emboss_needle/

https://www.ebi.ac.uk/Tools/psa/emboss_water/

4.4.2 Multiple Alignment with ClustalX

From http://www.clustal.org/download/current/ download clustalx-2.1-win.msi, install and locate the icon to the desktop. Open the program by double-clicking on the icon. File | Load sequences and select the sequences in FASTA format from the location on your computer. You need to edit the input file yourself. The sequences that you want to use as input should be in one file only with all sequences in FASTA format:

>sequence1

ATGACGATAC…

>sequence2

GATAGATAGACS…

etc.

Select Alignment | Do complete Alignment | ok.

In the lower left corner, you can follow how the program works.

Scroll through the alignment when it is completed.

At the bottom, the bars show the degree of conservation of columns. Above the alignment, the signature * shows columns with conserved nucleotides (or amino acids) with the same properties.

4.4.3 BLAST

We will use this activity to learn how to perform an ordinary BLAST search in GenBank with a DNA sequence. However, we also learn some more about BLAST in this activity. We will perform a search with the sequence with acc. no. AB490809 from GenBank.

First perform a BLAST search at NCBI: https://blast.ncbi.nlm.nih.gov/Blast.cgi. On the graphics, select nucleotide BLAST and insert the acc. no. AB490809 in the window and mark “somewhat similar sequences blastn,” mark “show results in a new window” and press the blue BLAST bottom. Recently, we have seen a request from the NCBI-BLAST server to restrict the search to reduce the demand for CPU capacity. For the current example, rRNA/ITS databases can be selected instead of Standard nr.

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Article CAS PubMed Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article CAS PubMed PubMed Central Google Scholar
Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL (1999) Alignment of whole genomes. Nucleic Acids Res 27(11):2369–2376. https://doi.org/10.1093/nar/27.11.2369
Article CAS PubMed PubMed Central Google Scholar
Durbin R, Eddy S, Krogh A, Mitchison G (1999) Biological sequence analysis. Cambridge Univ. Press
Google Scholar
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797
Article CAS PubMed PubMed Central Google Scholar
Feng DF, Dolittle RF (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25:351–360
Article CAS PubMed Google Scholar
Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256(5062):1443–1445. https://doi.org/10.1126/science.1604319
Article CAS PubMed Google Scholar
Hall TA (1999) BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl Acids Symp Ser 1999(41):95–98
Google Scholar
Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237–244
Article CAS PubMed Google Scholar
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8(3):275–282. https://doi.org/10.1093/bioinformatics/8.3.275
Article CAS PubMed Google Scholar
Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87:2264–2268
Article CAS PubMed PubMed Central Google Scholar
Katoh K, Rozewicki J, Yamada KD (2019) MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization. Brief Bioinform 20(4):1160–1166. https://doi.org/10.1093/bib/bbx108
Article CAS PubMed Google Scholar
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23:2947–2948
Article CAS PubMed Google Scholar
Le SQ, Gascuel O (2008) An improved general amino acid replacement matrix. Mol Biol Evol 25(7):1307–1320. https://doi.org/10.1093/molbev/msn067
Article CAS PubMed Google Scholar
Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article PubMed PubMed Central Google Scholar
Ludwig W, Strunk O, Westram R, Richter L, Meier H, Yadhukumar, Buchner A, Lai T, Steppi S, Jobb G, Förster W, Brettske I, Gerber S, Ginhart AW, Gross O, Grumann S, Hermann S, Jost R, König A, Liss T, Lüssmann R, May M, Nonhoff B, Reichel B, Strehlow R, Stamatakis A, Stuckmann N, Vilbig A, Lenke M, Ludwig T, Bode A, Schleifer KH (2004) ARB: a software environment for sequence data. Nucleic Acids Res 32:1363–1371
Article CAS PubMed PubMed Central Google Scholar
Moreno-Hagelsieb G, Latimer K (2008) Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics 24:319–324
Article CAS PubMed Google Scholar
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
Article CAS PubMed Google Scholar
Nei M, Kumar S (2000) Molecular evolution and phylogenetics. Oxford University Press
Book Google Scholar
Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217
Article CAS PubMed Google Scholar
Papadopoulos JS, Agarwala R (2007) COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics 23(9):1073–1079. https://doi.org/10.1093/bioinformatics/btm076
Article CAS PubMed Google Scholar
Pearson WR (2013) Selecting the right similarity-scoring matrix. Curr Protoc Bioinformatics 43:3.5.1–3.5.9. https://doi.org/10.1002/0471250953.bi0305s43
Rice P, Longden I, Bleasby A (2000) EMBOSS: The European molecular biology open software suite. Trends Genetics 16:276–277
Article CAS Google Scholar
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using ClustalOmega. Mol Syst Biol 7:539
Article PubMed PubMed Central Google Scholar
Smith TF, Waterman MS, Fitch WM (1981) Comparative biosequence metrics. J Mol Evol 18:38–46
Article CAS PubMed Google Scholar
Tamura K, Peterson D, Peterson N, Stecher G, Nei M et al (2011) MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol 28:2731–2739
Article CAS PubMed PubMed Central Google Scholar
Thompson JD, Plewniak F, Poch O (1999) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15:87–88
Article CAS PubMed Google Scholar
Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ (2009) Jalview Version 2 – a multiple sequence alignment editor and analysis workbench. Bioinformatics 25:1189–1191
Article CAS PubMed PubMed Central Google Scholar
Yamada KD, Tomii K, Katoh K (2016) Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees. Bioinformatics 32:3246–3251
Article CAS PubMed PubMed Central Google Scholar
Zuker M, Jacobson AB (1998) Using reliability information to annotate RNA secondary structures. RNA 4:669–679
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Department of Veterinary and Animal Sciences, University of Copenhagen, Copenhagen, Denmark
Henrik Christensen
National Food institute, Technical University of Denmark, Lyngby, Denmark
John Elmerdahl Olsen

Authors

Henrik Christensen
View author publications
You can also search for this author in PubMed Google Scholar
John Elmerdahl Olsen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Henrik Christensen .

Editor information

Editors and Affiliations

Department of Veterinary and Animal Sciences, University of Copenhagen, Copenhagen, Denmark
Henrik Christensen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Christensen, H., Olsen, J.E. (2023). Pairwise Alignment, Multiple Alignment, and BLAST. In: Christensen, H. (eds) Introduction to Bioinformatics in Microbiology. Learning Materials in Biosciences. Springer, Cham. https://doi.org/10.1007/978-3-031-45293-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-45293-2_4
Published: 28 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45292-5
Online ISBN: 978-3-031-45293-2
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics

Pairwise Alignment, Multiple Alignment, and BLAST

Abstract

Similar content being viewed by others

Pairwise Alignment, Multiple Alignment, and BLAST

BLAST and FASTA Similarity Searching for Multiple Sequence Alignment

Sequence Alignment

4.1 The Pairwise Alignment Problem

4.1.1 Global or Local Pairwise Alignments?

4.1.2 Substitution Matrices

4.1.2.1 Amino Acid Substitution Matrices

4.1.2.2 Nucleotide Substitution Matrices

4.1.3 Gaps

4.1.4 Dynamic Programming

4.1.4.1 Needleman and Wunsch

4.1.4.2 Smith and Waterman

4.2 Multiple Alignment

4.2.1 Clustal

4.2.2 Other Multiple Alignment Programs

4.3 BLAST

4.3.1 NCBI BLAST

4.3.2 Ortholog Detection

4.3.3 BLAST2 sequences

4.3.4 Statistics

4.3.5 Variants of BLAST

4.4 Activities

4.4.1 Pairwise Alignment

4.4.2 Multiple Alignment with ClustalX

4.4.3 BLAST

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation