FormalPara What You Will Learn in This Chapter

Pairwise alignments and multiple alignments are the basic tools to compare sequences. BLAST is the most frequently used bioinformatics program to compare your own sequence (query sequence) to all sequences in a database (subject sequences) based on local pairwise alignments. The outcome of the BLAST analysis provides qualitative information about homologous sequences and can quantify the identity of the query sequence to the sequences in the database. You will learn about multiple alignments and how to construct them. Multiple alignments are used for a range of applications described in the other chapters of the book.

4.1 The Pairwise Alignment Problem

A pairwise alignment is a model of the homology between two sequences considering all nucleotides or amino acids and all deletion and insertions.

Two sequences to be compared could, for instance, be the ones below:

  • GCAGTAGCATGACGATAG

  • GCGGTAGCATGATAC

A pairwise alignment requires a model for the evolutionary steps that led to the differences observed. First, we should test if they are homologous. Such a test can be done by BLAST (see ► Sect. 4.3), and it will tell if the sequences are from the same gene. Once this has been established, we need to model evolution to construct the pairwise alignment. The simplest assumptions are that identical nucleotide pairs and that identical amino acids pair and, further, that different amino acids with similar physiochemical properties form pairs.

For most comparisons it is preferred to base pairwise alignments on the amino acid sequences especially if the sequences are very divergent. Pairwise nucleotide comparisons are preferred for closely related sequences only.

Gaps are used to show that an amino acid or nucleotide is without a match in the other sequence and the gaps represent insertions or deletions in an evolutionary context. Most evolutionary modeling assumes that it is more difficult to form gaps than to form mismatches and that it is more difficult to form a gap than to extend one already formed. Scoring systems are used to separate matches (high score) from mismatches (low score) and gaps (very low score). For proteins, scoring matrices (see ► Sect. 4.1.2.1) are used. At the end, scores between amino acids or nucleotides are summarized over the whole pairwise alignment, and penalties for gaps are subtracted. The result is a unit score for the pairwise alignment given the parameters used. Given the same sequences compared, the pairwise alignment with the highest total score is preferred compared to one with a lower score.

When the pairwise alignment has been constructed, the identity and similarity can be quantified. The identity is the number of nucleotides or amino acids matching in two sequences compared at all positions between the two sequences. Similarity is a further comparison also considering different types of nucleotides or amino acids as well as the gaps.

An alignment of the sequences above, and a possible scenario leading to the differences between them, is shown in ◘ Fig. 4.1.

Fig. 4.1
figure 1

Hypothetical example of a pairwise alignment of two sequences if their evolution was known. The first sequence is the ancestor sequence which then diverged into two new lineages. In the sequence to the left, five-point mutations occurred over time, and in the one to the right, two-point mutations plus a deletion of three nucleotides happened probably resulting in the loss of an amino acid. With the exact knowledge of the position of the gap, we can construct the pairwise alignment at the bottom without any assumptions needed

The principles of pairwise alignments are used for most multiple alignment programs and for BLAST. The three tools are fundamental to carry out most bioinformatics (◘ Fig. 4.2).

Fig. 4.2
figure 2

Pairwise and multiple alignments as well as BLAST introduced in this chapter provide the background for most other chapters in this book

4.1.1 Global or Local Pairwise Alignments?

Before a pairwis e alignment is constructed, it needs to be decided if it should be global or local. A pair wise alignment is global if it is know that the sequences are homologous in their full length. In this situation, it is sound to align both sequences in their full length. If preliminary evidence has showed that the genes are encoding for the same proteins and the full length of the gene has been sequenced, a global alignment can be constructed. A local alignment is needed if it is know that one sequence is shorter than the other and that it cannot be related to the other in its full length. This can happen if the DNA sequence of one gene has not been determined in the full length or if the domain structure of the proteins that the genes encode for are rather divergent.

4.1.2 Substitution Matrices

Substitution matrices are used to model the probability of mutual amino acid or nucleotide substitutions in sequences. Amino acids or nucleotides with a lower probability of changing are given more weight in the comparisons compared to those with a higher probability of changing. The changes are assumed to occur according to a Markov model where changes only depend on the current nucleotide or amino acid observed and not on past changes (◘ Fig. 4.3).

Fig. 4.3
figure 3

Markov substitution model of nucleotides. The consequence of the Markov model is that it only depends on the observed nucleotide or amino acid and not on the past changes. The arrows with different colors indicate that potentially 16 different probabilities exist for changes (unrestricted model on ◘ Table 4.3). (Modified from Durbin et al. 1999)

4.1.2.1 Amino Acid Substitution Matrices

Amino acids have diffe rent biochemi cal and physical properties that influence their relative replaceability during evolution (◘ Table 4.1). The probability of replacement has been related to the physiochemical properties of the amino acids in the way that hydrophobic amino acids are easier replaced by other hydrophobic amino acids again compared to other types such as charged amino acids. The substitution matrices reflect these properties as seen from ◘ Fig. 4.4. Here the score between the two hydrophobic amino acids isoleucine and histidine is 5, whereas the score between isoleucine and the positive charged arginine is only −4. Some amino acids belong to more groups, for instance, histidine which is hydrophobic, positively charged, polar, and aromatic.

Table 4.1 Physiochemical properties of amino acids
Fig. 4.4
figure 4

The BLOSUM50 scoring matrix. Note that low scores both can be 0 and with a minus sign. Low scoring pairs occur with high frequencies in sequences. (Jesper Larsen is acknowledged for the figure)

The PAM (percent accepted mutations ) matrices are based on global alignments of closely related proteins. The number at the end of a PAM matrix refers to the number of steps required for a given percent change. PAM1 corresponds to 99% identity (1% change) between the aligned sequences. This means that PAM112 will recognize more distantly related sequences (40% identity) than PAM23 (80% identity). PAM matrices are based on global alignments and therefore best suited for global pairwise alignments used in evolutionary studies (◘ Table 4.2).

Table 4.2 Practical rules for use of PAM and BLOSUM amino acid substitution matrices

BLOSUM (blocks substitution matrix ) matrices are based on local alignments. The number at the end of a BLOSUM matrix refers to the minimum percent identity allowed in the set of aligned sequences used to build the matrix. This means that BLOSUM50 will recognize more distantly related sequences than BLOSUM80 (◘ Table 4.2).

4.1.2.2 Nucleotide Substitution Matrices

Substitution matrices for nucleotides give higher weight to transitions (G to A or C to T and vice versa) than to transversions (G to C, G to T, A to C, A to T, and vice versa) (◘ Table 4.3). This is mai nly related to the larger size of the purine compared to the pyrimidine nucleotides. The simplest model is the Jukes and Cantor model where equal probabilities for the substitution of all four nucleotides are assumed (◘ Table 4.4). The transversion/transition bias is included in the models of HKY, Kimura, Tamura-Nei, Tamura, and the general reversible model. An additional account for the frequency of nucleotides is included in models of Tamura-Nei, the equal input, and the general reversible models. These frequencies are the observed frequencies of nucleotides under comparison. The unrestricted model allows different probabilities of changes for all 12 combinations of nucleotides (◘ Table 4.4).

Table 4.3 The sixteen different types of nucleotide pairs that can be observed between two sequences
Table 4.4 Models of nucleotide substitutions

4.1.3 Gaps

Gaps are used to model insertions or deletions in sequences. If arbitrarily many gaps are inserted, this can lead to high-scoring align ments of nonhomologous sequences. To abolish this effect, gaps are penalized when alignments are constructed to obtain relatively few gaps and separate penalties used for gap opening and gap elongation. In the linear model, the penalty ϒ is proportional to the number of gaps (g) given the penalty for one gap d:

Linear gap penalty score:

$$ \varUpsilon (g)=- gd $$

If the penalty of one gap is 8 and there are 3 gaps, the total penalty will be – 24.

To better model biological events where many gaps often occur in a row, the affine penalty score model is used where the cost is higher for inserting the first gap than to extent this gap to lengths of two and more:

$$ \varUpsilon (g)=-d-\left(g-1\right)e $$

where ϒ(g) = gap penalty score of gap length g, d = gap opening penalty, and e = gap extension penalty. For a gap of length 3 with a penalty of opening the gap of 8 and of extending the gap with two more, the total penalty will be – 16.

4.1.4 Dynamic Programming

Dynamic programming is a way to compute the pairwise alignment? If all possibilities for the pairwise alignment of two sequences should be tested, there would be around 1059 possibilities for two nucleotide sequences of 100 in length. This number is approx. the same as the number of molecules in the Milky Way, and even the largest computer would not be able to handle the problem. There is also no need to consider possibilities without relevance, for instance, that one nucleotide in one sequence would pair only with gaps in the other.

To reduce the computing effort but still to perform a careful analysis, dynamic programming is used. The computer is only testing relevant comparisons (high scores) and keeping in memory the highest ones already calculated. The most relevant dynamic programming algorithms for comparison of global and local pairwise alignments were published be Needleman and Wunsch (1970) and Smith et al. (1981), respectively.

The Needleman and Wunsch algorithm is used when it is known that the sequences represent exactly a gene or protein in full length (few amino acids or nucleotide differences are tolerated) and it is used to build global pairwise alignments.

The Smith and Waterman algorithm is used with partial sequences or with sequences of unknown length to build local alignments. For both algorithms, the adjustment of gap penalty functions depends on previous knowledge about domain structures, repeats, and other properties. Both algorithms and their implementations in computer programs can be used for both amino acid and nucleotide sequences.

4.1.4.1 Needleman and Wunsch

The score f unction in this model is illustrated in ◘ Fig. 4.5. Th is function is used on all comparisons between the two sequences. Here we will use the example with the two sequences:

  • Sequence 1: HEAGAWGHEE

  • Sequence 2: PAWHEAE

Fig. 4.5
figure 5

For the Needleman and Wunsch algorithm, the score function F is defined as the maximum of the three expressions. F(i–1, j–1) is the score of the previous diagonal cell in the matrix which added the score for a match between an amino acid pair in the current cell. F(i–1, j) is the score for the previous cell in the column subtracted the penalty of a gap (d), and F(i, j–1) is the equivalent score for the previous cell in the row subtracted the gap penalty (see ◘ Fig. 4.7 for an example of this calculation)

◘ Figure 4.6 shows one sequence (1) on the top of the table and sequence 2 as a vertical column. First an alignment path matrix is created. For each cell, F(i, j) is calculated based on the score function in ◘ Fig. 4.5. This matrix allows a stepwise calculation of score values. The score of the best alignment is calculated between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj. The algorithm function, F(i, j), is built recursively beginning with F(0,0) = 0 (◘ Figs. 4.6, 4.7, and 4.8). When the matrix is filled, backtracking is started (evaluation of the optimal path).

Fig. 4.6
figure 6

Pairwise alignment with the Needleman and Wunsch algorithm showing the boundary conditions calculated. Here both of the sequences are paired only with gaps resulting in the lowest negative score. (Jesper Larsen is acknowledged for the figure)

Fig. 4.7
figure 7

Pairwise alignment with the Needleman and Wunsch algorithm. Here the first real position of the matrix is calculated. The three possibilities, H pairing with a gap, P pairing with a gap, or H pairing with P are considered. The one with the highest score: H pairing with P (−2) (see ◘ Fig. 4.4) gives the maximum score and is preferred. (Jesper Larsen is acknowledged for the figure)

Fig. 4.8
figure 8

Pairwise alignment with the Needleman and Wunsch algorithm. Here the maximum score has been inserted for all cells in the matrix, and backtracking from the bottom right corner has been performed following the path which maximizes the score. After the score of 1 for the cells to the lowest right, −5 is chosen since it was marked as a path to 1 in the forward tracking. The higher score of 2 cannot be selected since it was not marked in the forward tracking. (Jesper Larsen is acknowledged for the figure)

The scoring parameters are given by the BLOSUM 50 matrix (◘ Fig. 4.4) and a linear gap penalty of d = −8. We will align the whole sequence length of the sequences meaning that we need to form a global alignment and use the Needleman and Wunsch algorithm. The procedure can be followed on ◘ Figs. 4.6, 4.7, 4.8, and 4.9.

Fig. 4.9
figure 9

Pairwise alignment with the Needleman and Wunsch algorithm. The optimal path for backtracking is shown again (◘ Fig. 4.8), and the pairwise alignment has been built. Note the combination of H and P considered in ◘ Fig. 4.7 has not been selected since the backtrack path did not pass through this cell. The total score for the multiple alignments is 1 (bottom left corner). (Jesper Larsen is acknowledged for the figure)

A real example of the use of the Needleman and Wunsch algorithm is shown on ◘ Fig. 4.10. Here realistic long sequences are aligned by the algorithm implemented as the needle program in the EMBOSS package (Rice et al. 2000).

Fig. 4.10
figure 10

Pairwise alignment with the Needleman and Wunsch algorithm. A real example with sequences of realistic length has been computed using implementation of the algorithm in EMBOSS as the needle program. (see ► Activity 4.4.1 for details about this program)

4.1.4.2 Smith and Waterman

Now we will co mp ute a pair wise alignment of the sequences using the Smith and Waterman algorithm. There are two differences in the algorithm compared to Needleman and Wunsch. The first is that the alignment can start/end anywhere in the matrix, and the other that backtracking is started at the highest value rather than in the lower right corner. Backtracking is terminated as soon as zero is encountered. The score function is shown in ◘ Fig. 4.11.

Fig. 4.11
figure 11

Score function for Smith and Waterman. An extra possibility of 0 can be selected in addition to the three in ◘ Fig. 4.5. The 0 is chosen if the other three are negative. This way cells are marked as 0 to account for different length of sequences in the local alignment (◘ Fig. 4.12)

We will align the same two short model protein sequences with Smith and Waterman:

  • Sequence 1: HEAGAWGHEE

  • Sequence 2: PAWHEAE

And use the same parameters of BLOSUM50 and a linear gap penalty of d = −8.

Using the score function in ◘ Fig. 4.11 results in 0 in most cells when the matrix is filled in the forward direction (◘ Fig. 4.12). The reason is that the three other possibilities in ◘ Fig. 4.11 result in negative values and we then have to use 0. Backtracking starts with the highest score (28) and terminates when 0 is reached. Note that an alternative path is also possible. However, the maximum score of this one is lower (21), and therefore the first is chosen to construct the final pairwise alignment (◘ Fig. 4.13). Similar to the example with Needleman and Wunsch (◘ Fig. 4.10), a more realistic example is shown in ◘ Fig. 4.14 with longer sequences. The sequences are the same that we used in ◘ Fig. 4.10, and we see little difference between the two pairwise alignments probably because they were of nearly the same length.

Fig. 4.12
figure 12

Smith and Waterman alignment. Here the matrix has been filled with combinations based on the score function in ◘ Fig. 4.11. Backtracking started from the highest score in the matrix (28) and continued to join the highest possibilities until 0 was reached. (Jesper Larsen is acknowledged for the figure)

Fig. 4.13
figure 13

Smith and Waterman algorithm. Here the backtrack path is shown and the resulting pairwise alignment has been constructed. Note the difference to global alignment generated by the Needleman and Wunsch algorithm in ◘ Fig. 4.9. The total score for the pairwise alignment is 28. (Jesper Larsen is acknowledged for the figure)

Fig. 4.14
figure 14

Pairwise alignment with the Smith and Waterman algorithm. A real example with sequences of realistic length has been computed using the water program of EMBOSS which has implemented the Smith and Waterman algorithm (see ► Activity 4.4.2 for details about this program). Note that there are only small differences to the pairwise alignment generated by Needleman and Wunsch (◘ Fig. 4.10). It is related to the comparable length of the two sequences

4.2 Multiple Alignment

A multiple alignment is the simultaneous alignment of three or more nucleic acid or amino acid sequences. The procedure involves the insertion of gaps in the sequences so as to maximize the overall similarity (Higgins and Sharp 1988). Multiple alignments are rarely used for their own sake but are usually created for another purpose – for instance, primer design (► Chap. 5) or analysis of phylogeny (► Chap. 6). Users select a favorite program or program package and try to optimize program settings for that.

To start the construction of a multiple alignment, we will use the same criteria as for the pairwise alignment problem. We will assume that they shared ancestors (homologous). We need to model evolution the way that every column represents only orthologous amino acids or nucleotides that have evolved from a common ancestor. The simplest assumptions are again that identical nucleotides or identical amino acids pair, that different amino acids with similar physiochemical properties pair, that it is more difficult to form a gap than to form a mismatch, and that it is more difficult to form a gap than to extend one already formed. The scoring between nucleotides and amino acids is based on the system above for pairwise alignment (► Sect. 4.1.2).

4.2.1 Clustal

To form a multi ple alignment of the clustal type, fi rst pairwise alignments between sequences 1–2, 1–3, 1–4, 1–5, 2–3, 2–4, 2–5…4–5 are formed (◘ Fig. 4.15). This is the progressive alignment part of the process, and there are (n(n – 1))/2 combinations where n is the number of sequences. In this example with five sequences, there are ten combinations. The next step is to construct a “guide tree” (◘ Fig. 4.16) uniting the pairs with highest score. The final step is to construct the multiple alignments from the guide tree which is called profile alignment (◘ Fig. 4.17). The clustal type of multiple alignments is therefore performing progressive profile alignment. Clustal has been called “a quick and dirty version of Feng and Dolittle (1987)” where “only residues that are part of the matches of a given length (k-tuple matches) are scored” (Higgins and Sharp 1988).

Fig. 4.15
figure 15

Output from ClustalW showing the pairwise combinations and the grouping of sequences based on the guide tree (◘ Fig. 4.16)

Fig. 4.16
figure 16

The guide tree of ClustalW which is used by the program to perform the full multiple alignment

Fig. 4.17
figure 17

The final multiple alignments made with ClustalW

The developments in the clustal programs were started with Clustal (1988) (Higgins and Sharp 1988) and continued with ClustalV (five) (1992), ClustalW (weights) (1994), and ClustalX 2.0 (2007) (Larkin et al. 2007). ClustalX is the implementation of the program for PC (◘ Figs. 4.18 and 4.19) and is further described in ► Activity 4.4.2. Clustal Omega is the most recent version of the program for multiple alignments of very large protein datasets (Sievers et al. 2011).

Fig. 4.18
figure 18

CLUSTALX (Larkin et al. 2007) showing the sequences not yet aligned. This is seem from the unordered arrangement of the nucleotides in the columns

Fig. 4.19
figure 19

CLUSTALX (Larkin et al. 2007) showing aligned sequences where most of the columns show only one type of nucleotide

ClustalX/W produces global alignments which include benefits and drawbacks inhered from the pairwise alignment part included in the construction of the alignment. Domain structures of proteins can be optimized by ClustalW/X (◘ Fig. 4.20). ClustalW/X invokes different “hidden” functions during alignment and tends to be cluster gaps and take hydrophobic/hydrophobic properties into account. An inherent problem with the progressive approach used in ClustalX/W is that mistakes made in the initial alignment cannot be corrected later. To account for this and other problems, other programs have been constructed.

Fig. 4.20
figure 20

CLUSTALX (Larkin et al. 2007) showing optimization by realignment of a region of the multiple alignment to improve local regions

4.2.2 Other Multiple Alignment Programs

MUSCLE (Multiple Sequence Comparison by Log-Expectation ) (Edgar 2004) (► https://www.drive5.com/muscle/downloads.htm) include s no guide tree. MUSCLE is claimed both to achieve better average accuracy and better speed than ClustalW2 or T-Coffee (Edgar 2004). The main focus is on protein multiple alignments, but it works also on DNA. The principle is based on K-mer comparison between sequences. K-mers are contiguous subsequence of short length (k-tuple). Related sequences tend to have more k-mers in common than expected by chance. The program can be used from MEGA7 (Tamura et al. 2011) (► https://www.megasoftware.net/) used in ► Chaps. 6 and 11. Like the following programs, T-Coffee and MAFFT have options to adjust alignment parameters for a range of applications.

T-Coffee (► http://tcoffee.crg.cat/) (Notredame et al. 2000) incorporates a “tree-based consistency objective function for alignment evaluation.” The benefit should be high accuracy. The program comes in a range of flavors suitable for different applications.

MAFFT (Yamada et al. 2016 and other papers) (► https://mafft.cbrc.jp/alignment/software/) is also a multiple alignment package that includes applications for very large datasets.

Multiple alignment programs can be compared with sets of reference sequence (BAliBASE) (Thompson et al. 1999).

To account for the secondary structures in 16S rRNA sequence, the ARB package (Ludwig et al. 2004) (► http://www.arb-home.de/) has been developed. It has been used to construct precomputed multiple alignments in the SILVA database (► www.arb-silva-de) which is further considered in ► Chap. 8. Folding patterns in 16S rRNA genes can be predicted with Mfold (► http://unafold.rna.albany.edu/?q=mfold) (Zuker and Jacobson 1998) or similar program.

Trimming should only be done with limited datasets of closely related organisms. Trimming can be done in BioEdit (◘ Fig. 4.21). For Mac users Jalview can be used. (► http://www.jalview.org) (Waterhouse et al. 2009)

Fig. 4.21
figure 21

BIOEDIT (Hall 1999) (► http://www.mbio.ncsu.edu/BioEdit/bioedit.html) used to trim a region of a multiple alignment with many gaps inserted as a consequence of lack of data (short sequences). The box marked in black was deleted

Graphic output of multiple alignments can be made in BoxShade (► http://www.ch.embnet.org/software/BOX_form.html) (◘ Fig. 4.22).

Fig. 4.22
figure 22

Output from BoxShade (► http://www.ch.embnet.org/software/BOX_form.html). This output can be included for presentation in a publication

4.3 BLAST

BLAST (Basic Local Alignment Search Tool) was originally described by Altschul et al. (1990), and this tool enabled fast search in the electronic nucleotide and protein databases during the 1990s when the Internet was invented. Later an improved algorithm was described allowing gaps to be inserted during the analysis (Altschul et al. 1997). Both versions of BLAST are in use as the “ungapped” (1990) and “gapped” (1997), respectively.

BLAST is based on a heuristics search algorithm which is able to search through the ever-growing sizes of databases. BLAST can be compared to the search tool Google in popularity within the bioinformatics community including the formation of the verb “to blast.” BLAST is based on the initial identification of the users’ query sequence by short pieces of sequence (words). These words are compared to the database with all the sequences (subjects). If the word is identified in a sequence of the database, the match of the word can be extended and will form a high-scoring pair (HSP) . When no further extension is possible, the result is returned as a hit list to the used with indication of similarity between the query and the closest related subjects in the database. The “gapped” version of BLAST is most frequently used, and its “two-hit method” requires two non-overlapping “words” on the same diagonal with a distance of A before an extension is invoked. Gapped BLAST should only require 1/3 computer time compared to ungapped version because of less extensions of the words. BLAST uses dynamic programming to extend residues in both directions; BLAST is most suitable for finding of subject DNA and protein sequences in databases that match with a reasonable similarity and a comparable length to the query. BLAST is good for sequences of at least 100 in length that are well represented in the databases. BLAST is not suitable to make really precise determination of similarity between sequences, and pairwise alignment programs should instead be used for that (► Sect. 4.1). BLAST is not suitable for sequence-based identification of bacterial species since there are simply too many sequences in GenBank and the type strains are not always clearly labeled (► Chap. 7). BLAST is not suitable for databases beyond a certain size and cannot be used to search very large metagenomics databases (► Chap. 9). In principle the database search of BLAST is based on local alignments. It needs to be local since we are never certain if our query sequence will be represented by a homolog in the database with similar length. BLAST is most often used on the NCBI server (► http://www.ncbi.nlm.nih.gov/BLAST/); however, the BLAST program can also be installed on your local computer and used with your own database.

4.3.1 NCBI BLAST

The mos t commo n way of performing a BLAST search is to access the program at NCBI and search in their databases. A search can be performed by a DNA or protein sequence in FASTA format, by uploading a sequence file in FASTA format or by inserting an accession number as AB490809 shown in ◘ Fig. 4.23 if the sequence is already included with GenBank. In the example shown on ◘ Fig. 4.23, default setting has been selected for a BLAST search with a DNA sequence. Under Program Selection and Optimize for different versions of BLAST can be selected. In this case megablast, discontigous megablast, and blastn can be selected. The three versions differ by the scores given to matches and their way of penalizing gaps. For shorter sequences blastn should be used.

Fig. 4.23
figure 23

BLAST search (Altschul et al. 1990, 1997). (BLAST [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2004 – [cited 2018 04 17]. Available from: ► https://www.ncbi.nlm.nih.gov/Blast.cgi)

When the search is terminated, you will see the familiar graphical overview in ◘ Fig. 4.24 – the colored section with horizontal bars. You can see details of subject sequences by mousing over the bars in this section.

Fig. 4.24
figure 24

BLAST (Altschul et al. 1990, 1997) output showing three screenshots of the output from BLAST search: Graphic Summary, Description, and Alignments, respectively. The number of Max target sequences was reduced to ten in the alignment parameters section to show it all on one page. (BLAST [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2004 – [cited 2018 04 17]. Available from: ► https://www.ncbi.nlm.nih.gov/Blast.cgi)

In the next section comes the descriptions with one line for each subject which gives a more detailed information about the subject sequences providing Score, Query cover, E value, and accession number. These parameters will be explained in the following.

The hit list is arranged by decreasing Score. In the example you see that hits are sorted by decreasing score. The identities are also decreasing but not systematically since the score is not always directly related to the identity. The longer the match between two sequences, the higher the score given the same identity. E values cannot be directly seen here since they are very small and they are just shown as “0.0.”

In the alignments section, the pairwise alignments are shown. This is the most detailed information of comparison between your query sequence and the subjects in the database. The parameters from the description section are available, and additional information of gaps and visual location of both gaps and mismatches can be observed.

4.3.2 Ortholog Detection

Sequences a re orthologs if they are homologs and have the same function in different species. The paralogs are also homologs but have been duplicated in evolution, and they have got divergent functions in the same species, and the sequences are often quite divergent. BLAST can be used to sort the orthologs from the paralogs. The principle of reciprocal best hits (RBH) is used as explained in Moreno-Hagelsieb and Latimer (2008) “Two genes residing in two different genomes are deemed orthologs if their protein products find each other as the best hit in the opposite genome” (Moreno-Hagelsieb and Latimer 2008). Further investigation of orthologs and paralogs can be done by phylogeny (► Chap. 6).

4.3.3 BLAST2 Sequences

BLAST2 sequences is a version of NCBI BLAST that allows pairwise c omparisons on server or on stand-alone BLAST installed on your PC. The program is available on both DNA and protein level, and it can be used to build your own database for many purposes by including many sequences in the 2nd window (◘ Fig. 4.25). For instance, if the query DNA or protein sequence is included with the 1st search window and a range of genomes included in the 2nd search, you can identify the query sequence in the whole genomic sequences including genomes not yet deposited with the databases. The same approach can be used to build a small MLST database (► Chap. 11). BLAST2 sequences is started from the ordinary BLAST page NCBI by marking “Align two or more sequences.” It can be used for fast draft comparisons between two sequences for example if you want to look up the location of a primer or gene on a genome. This is your “Swiss knife” always available if you just can access the Internet. Like for an ordinary knife, be careful with BLAST2 sequences! If you need real precise similarities between sequences, they should be obtained by the pairwise alignment programs described in ► Sect. 4.1.

Fig. 4.25
figure 25

BLAST2 sequences (Altschul et al. 1990, 1997). (BLAST [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2004 – [cited 2018 04 17]. Available from: ► https://www.ncbi.nlm.nih.gov/Blast.cgi)

4.3.4 Statistics

Evaluation of the results in the BLAST output is based on the Karlin and Altschul formul a E = k × m × N × eλs (Karlin and Altschul 1990) where m is letter in query meaning the number of nucleotides or amino acids, N is the total letters in database, and S is the actual score.

For amino acids, S is defined by the BLOSUM matrix. S is scaled to bit score to better fit the computer. The bit score = ((λ × S) – ln κ) / ln2. In the pairwise section (◘ Fig. 4.24 above), you can see the raw score in parenthesis along with the bit score, the constants κ and λ depending on the database. E reflects that we find the sequence in the database by chance (chance for false positives). E is this way related to the size of the database. The smaller E the less likely is it found by chance and the more unique is the sequence – “The smaller E the better.” For really “unique” sequences, E is so small that it cannot be represented on the computer; this is the reason for “0.0” with the output.

All parameters related to the search can be found in Search Summary (◘ Fig. 4.26); just press the bottom Graphic summary on the output page (◘ Fig. 4.24). Here you can find parameters used for calculating the statistics in the Karlin and Altschul formula, and you can understand everything about the BLAST output if you compare these parameters to the statistics just explained.

Fig. 4.26
figure 26

The Search Summary section of the output showing all parameters used for the search (Altschul et al. 1990, 1997). (BLAST [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2004 – [cited 2018 04 17]. Available from: ► https://www.ncbi.nlm.nih.gov/Blast.cgi)

Note that parameters for ungapped blast (Altschul et al. 1990) differ from gapped blast (Altschul et al. 1997) (◘ Fig. 4.26) and λ and κ are different for gapped BLAST compared to ungapped.

4.3.5 Variants of BLAST

A BLAST sea rch can be based on a query DNA sequence that the program translates to protein and compares to all protein sequence in the database (BLASTx). In this case the search takes more time, and it should only be done when it is absolutely needed (rarely). For such comparison, the codon table should be defined according to the type of organisms. Prokaryotes are using codon table 11 (Appendix). tBLASTn searches a translated nucleotide databases using a protein query. tBLASTx searches a translated nucleotide query against the translated nucleotide databases.

For BLASTp it is sometimes convenient to select Swiss-Prot or ref_seq_protein if you want a really precise information about the hits. For protein searches note the additional options PSI-, PHI-, and delta BLAST (◘ Fig. 4.27); they can be used to build up a profile used to search for query proteins which are rather distantly related to the subject proteins in the database.

Fig. 4.27
figure 27

Protein BLAST and BLASTp; note the additional options PSI- PHI- and delta BLAST (Altschul et al. 1990, 1997). (BLAST [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2004 – [cited 2018 04 17]. Available from: ► https://www.ncbi.nlm.nih.gov/Blast.cgi)

4.4 Activities

4.4.1 Pairwise Alignment

We will construct local and gl obal pairwise alignments by use of the program “water” and “needle,” respectively, available as EMBOSS programs (Rice et al. 2000) on the EBI server. The protein sequences AAA85484 and AAA85485 can be downloaded from NCBI as described in ► Chap. 3 (see ► Sect. 3.3.1 including Activity 3.8.1).

Decide if you want to use “needle” for global – or “water” for local alignment.

Open the FASTA format files, and paste them in the windows of the relevant EMBOSS program on the server:

► http://www.ebi.ac.uk/Tools/psa/emboss_needle/

► http://www.ebi.ac.uk/Tools/psa/emboss_water/

If you need to do comparisons of nucleotide sequences, the similar programs can be found here:

► http://www.ebi.ac.uk/Tools/psa/emboss_needle/nucleotide.html

► http://www.ebi.ac.uk/Tools/psa/emboss_water/nucleotide.html

You can also install the mEMBOSS package on your PC and select the programs water and needle from the menu. Download mEMBOSS from ► ftp://emboss.open-bio.org/pub/EMBOSS/windows/ Click on ► mEMBOSS-6.3.1.2-setup.exe

And download to local computer. Click on file and install.

Use Explorer from Windows to navigate the folder with mEMBOSS.

Go into the Jemboss folder and click Jar and Jemboss MS; this will allow you to run EMBOSS in Windows from a graphic interface. All commands to activate the programs are found on the left menu bar. Most useful are water, transeq, and revseq which will do pairwise Smiths and Water alignments, translate from DNA to protein, and reverse and complement sequences, respectively.

4.4.2 Learn How Dynamic Programming Works with Pairwise Alignments

Use the tool at: ► http://www.itu.dk/~sestoft/bsa/graphalign.html.

Use th e short protein sequences shown already. Try the different substitution matrices, let the gap costs stay at “linear, gap score −8” to reduce the complexity, or make your own choice. Try with or without traceback. The tool has been designed and is maintained by Peter Sestoft.

4.4.3 Multiple Alignment with ClustalX

From ► ftp://ftp.ebi.ac.uk/pub/software/clustalw2/2.0.10/, download ► clustalx-2.0.10-win.msi, inst all, an d locate the icon to the desktop. Open the program by double-clicking on the icon. File | Load sequences, and select the sequences in FASTA format from the location on your computer. You need to edit the input file yourself. The sequences that you want to use as input should be in one file only with all sequences in FASTA format:

>sequence1 ATGACGATAC... >sequence2 GATAGATAGACS... etc.

Select Alignment | Do complete Alignment | ok

In the lower left corner, you can follow how the program works.

Scroll through the alignment when it is completed.

At the bottom, the bars show the degree of conservation of columns. Above the alignment the signature * shows columns with conserved nucleotides (for amino acids: with the same properties).

4.4.4 BLAST

We will use this activit y to learn how to perform an ordinary BLAST search in GenBank with a DNA sequence. However, we also learn some more about BLAST in this activity. We will perform a search with the sequence with acc. no. AF445297 from GenBank. The expectation is that this will be the best hit from the BLAST output. To a surprise it is not so.

First perform a BLAST search at NCBI: ► https://blast.ncbi.nlm.nih.gov/Blast.cgi. On the graphics select nucleotide BLAST, insert the acc. no. AF445297 in the window, mark “somewhat similar sequences blastn,” mark “show results in a new window,” and press the blue BLAST bottom. You will get a nice output and 100% identity to a sequence from an uncultured organism. Where is your query sequence? Usually you would expect the query as the best hit if it comes from the database. The reason can be found if you look up AF445297 at NCBI, since it is labeled as unverified. The unverified sequences are not included as databases in any BLAST search.

Take-Home Messages

  • Pairwise alignments and multiple alignments are the basic tools for sequence comparison.

  • A quantitative pairwise comparison of two sequences can be performed by aligning two sequences based on considerations of gaps representing insertions or deletions and matches between nucleotides or amino acids.

  • Pairwise comparisons can be performed as global alignments if it is known that the sequences are homologous in their full length or by local alignments if it is known that one sequence is shorter than the other.

  • BLAST is the most frequently used bioinformatics program to compare your own sequence (query sequence) to all sequences in a database (subject sequences).

  • BLAST provides qualitative information about homologous sequences and can quantify the identity of the query sequence to the subject sequences in the database.

  • Substitution matrices provide the probability of nucleotide or amino acid substitutions when two or more sequences are compared.

  • Dynamic programming is the most frequently used procedure to perform pairwise comparisons of nucleotide or amino acid sequences and can be done based on the Needleman and Wunsch algorithm for global alignments and by the Smiths and Waterman algorithm for local alignments.

  • Multiple alignments are most frequently constructed by the progressive profile procedure as implemented in the clustal family of programs.