FormalPara What You Will Learn in This Chapter

Pairwise alignments and multiple alignments are the basic tools to compare sequences. BLAST is the most frequently used bioinformatics program to compare your own sequence (query sequence) to all sequences in a database (subject sequences) based on local pairwise alignments. The outcome of the BLAST analysis provides qualitative information about homologous sequences and can quantify the identity of the query sequence to the sequences in the database. You will learn about multiple alignments and how to construct them. Multiple alignments are used for a range of applications described in the other chapters of the book.

4.1 The Pairwise Alignment Problem

A pairwise alignment is a model of the homology between two sequences considering all nucleotides or amino acids and all deletion and insertions.

Two sequences to be compared could, for instance, be the ones below:

  • GCAGTAGCATGACGATAG

  • GCGGTAGCATGATAC

A pairwise alignment requires a model for the evolutionary steps that led to the differences observed. First, we should test if they are homologous. Such a test can be done by BLAST (see Sect. 4.3), and it will tell if the sequences are from the same gene. Once this has been established, we need to model evolution to construct the pairwise alignment. The simplest assumptions are that identical nucleotide pairs and that identical amino acids pair. Further, that different amino acids with similar physiochemical properties form pairs.

For most comparisons, it is preferred to base pairwise alignments on the amino acid sequences especially if the sequences are very divergent. Pairwise nucleotide comparisons are preferred for closely related sequences only.

Gaps are used to show that an amino acid or nucleotide is without a match in the other sequence and the gaps represent insertions or deletions in an evolutionary context. Most evolutionary modeling assumes that it is more difficult to form gaps than to form a mismatches and that it is more difficult to form a gap than to extent one already formed. Scoring systems are used to separate matches (high score) from mismatches (low score) and gaps (very low score). For proteins, scoring matrices (see Sect. 4.1.2.1) are used. At the end, scores between amino acids or nucleotides are summarized over the whole pairwise alignment, and penalties for gaps are subtracted. The result is a unit score for the multiple alignment given the parameters used. Given the same sequences are compared, the pairwise alignment with the highest total score is preferred compared to one with a lower score.

When the pairwise alignment has been constructed, the identity and similarity can be quantified. The identity is the number of nucleotides or amino acids matching in two sequences compared at all positions between the two sequences. Similarity is a further comparison also considering different types of nucleotides or amino acids as well as the gaps.

An alignment of the sequences above, and a possible scenario leading to the differences between them, is shown in Fig. 4.1.

Fig. 4.1
A schematic illustration depicts the grouping of sequences over time. On the left, 5-point mutations are visible, and on the right, a 2-point mutation is illustrated.

Hypothetical example of a pairwise alignment of two sequences if their evolution was known. The first sequence is the ancestor sequence, which then diverged into two new lineages. In the sequence to the left, five-point mutations occurred over time, and in the one to the right, two-point mutations plus a deletion of three nucleotides happened probably resulting in the loss of an amino acid. With the exact knowledge of the position of the gap, we can construct the pairwise alignment at the bottom without any assumptions needed

The principles of pairwise alignments are used for most multiple alignment programs and for BLAST. The three tools are fundamental to carry out most bioinformatics (Fig. 4.2).

Fig. 4.2
A flow chart depicts the topics in a book. They are sequencing, assembly, sequence quality control, annotation, databases, pair-wise alignment, protein structures, blast, multiple alignment, primer design, phylogeny, classification, identification, 16 S r R N A amplicon, full D N A metagenomics, molecular typing, and transcriptomics.

Pairwise and multiple alignments as well as BLAST introduced in this chapter provide the background for most other chapters in this book

4.1.1 Global or Local Pairwise Alignments?

Before a pairwise alignment is constructed, it needs to be decided if it should be global or local. A pairwise alignment is global if it is known that the sequences are homologous in their full length. In this situation, it is sound to align both sequences in their full length. If preliminary evidence has showed that the genes are encoding for the same proteins and the full length of the gene has been sequenced, a global alignment can be constructed. A local alignment is needed if it is known that one sequence is shorter than the other and that it cannot be related to the other in its full length. This can happen if the DNA sequence of one gene has not been determined in the full length or if the domain structure of the proteins that the genes encode for is rather divergent.

4.1.2 Substitution Matrices

Substitution matrices are used to model the probability of mutual amino acid or nucleotide substitutions in sequences. Amino acids or nucleotides with a lower probability of changing are given more weight in the comparisons compared to those with a higher probability of changing. The changes are assumed to occur according to a Markov model where changes only depend on the current nucleotide or amino acid observed and not on past changes (Fig. 4.3). This model is very important to understand in full. A comparison can be made to a human being only acting based on the immediate desire or need. A certain degree of hunger will lead to taking a meal, whereas past events only have minor influence on the decision like meals taken the day before. This way a Markov chain is without memory, and it seems difficult to use the Markov model to trace evolution back in time like done in phylogeny. However, this is actually possible using the assumption about irreversible changes of nucleotides or amino acids since the same probabilities can be used both to model the future and to go back in time.

Fig. 4.3
A schematic diagram depicts the substitution of nucleotides. The corners are marked with A, T, C, and G. Lines are marked from and to each other directly and diagonally. A and C are marked with clockwise arrows. T and G depict ant-clockwise direction arrows.

Markow substitution model of nucleotides. The consequence of the Markow model is that only depends on the observed nucleotide or amino acid and not on the past changes. The arrows with different colors indicate that potentially 16 different probabilities exist for changes (Unrestricted model on Table 4.3) (Modified from Durbin et al. 1999)

4.1.2.1 Amino Acid Substitution Matrices

Amino acids have different biochemical and physical properties that influence their relative replaceability during evolution (Table 4.1). The probability of replacement has been related to the physiochemical properties of the amino acids in the way that hydrophobic amino acids are easier replaced by other hydrophobic amino acids again compared to other types such as charged amino acids. The substitution matrices reflect these properties as seen from Fig. 4.4. Here, the score between the two hydrophobic amino acids isoleucine and histamine is 5, whereas the score between isoleucine and the positive charged arginine is only −4. Some amino acids belong to more groups, for instance, histidine, which is both hydrophobic, positively charged, polar, and aromatic.

Table 4.1 Physiochemical properties of amino acids
Fig. 4.4
A diagram depicts the scoring values of the BLOSUM 50 matrix. The sequences are marked in rows and columns.

The BLOSUM50 scoring matrix. Note that low scores both can be 0 and with a minus sign. Low scoring pairs occur with high frequencies in sequences. Jesper Larsen is acknowledged for the figure

The percent accepted mutation (PAM) matrices are based on global alignments of closely related proteins. The number at the end of a PAM matrix refers to the number of steps required for a given percent change. PAM1 corresponds to 99% identity (1% change) between the aligned sequences. This means that PAM112 will recognize more distantly related sequences (40% identity) than PAM23 (80% identity). PAM matrices are based on global alignments and therefore best suited for global pairwise alignments used in evolutionary studies (Table 4.2).

Table 4.2 Practical rules for use of PAM and BLOSUM amino acid substitution matrices

Blocks substitution matrix (BLOSUM) is based on local alignments. The number at the end of a BLOSUM matrix refers to the minimum percent identity allowed in the set of aligned sequences used to build the matrix. This means that BLOSUM50 will recognize more distantly related sequences than BLOSUM80 (Table 4.2). Other matrices have been published aiming at refining amino acid alignments and the more recently by including improved statistics for estimating exchangeability between amino acids (Gonnet et al. 1992; Jones et al. 1992; Le and Gascuel 2008).

Explanations of the background for the BLOSUM and PAM and other matrices are given by Pearson (2013). The recommendation is to use the “deep scoring matrices” for full-length protein sequence comparisons and only to compare the domain level with “shallower” scoring matrices.

4.1.2.2 Nucleotide Substitution Matrices

The simplest model is the Jukes & Cantor model where equal probabilities for the substitution of all four nucleotides are assumed (Table 4.4). Substitution matrices for nucleotides give higher weight to transitions (G to A or C to T and vice versa) than to transversions (G to C, G to T, A to C, A to T and vice versa) (Table 4.3). This is mainly related to the larger size of the purine compared to the pyrimidine nucleotides. The transversion/transition bias is included in the models of HKY, Kimura, Tamura-Nei, Tamura, and the General reversible model. An additional account for the frequency of nucleotides is included in models of Tamura-Nei, the Equal input, and the General reversible models. These frequencies are the observed frequencies of nucleotides under comparison. The unrestricted model allows different probabilities of changes for all 12 combinations of nucleotides (Table 4.4).

Table 4.3 The 16 different types of nucleotide pairs that can be observed between two sequences (modified from Nei and Kumar 2000)
Table 4.4 Models of nucleotide substitutions

4.1.3 Gaps

Gaps are used to model insertions or deletions in sequences. If arbitrarily many gaps are inserted, this can lead to high scoring alignments of nonhomologous sequences. To abolish this effect, gaps are penalized when alignments are constructed to obtain relatively few gaps and separate penalties used for gap opening and gap elongation. In the linear model, the penalty ϒ is proportional to the number of gaps (g) given the penalty for one gap d:

Linear gap penalty score:

$$ \varUpsilon\;\left(\mathrm{g}\right)=-\mathrm{gd} $$

If the penalty of one gap is 8 and there are three gaps, the total penalty will be - 24.

To better model biological events where many gaps often occur in a row, the affine penalty score model is used where the cost is higher for inserting the first gap than to extent this gap to lengths of two and more:

$$ \varUpsilon\;\left(\mathrm{g}\right)=-\mathrm{d}-\left(\mathrm{g}-1\right)\mathrm{e} $$

Where ϒ (g) = gap penalty score of gap length g, d = gap opening penalty and e = gap extension penalty. For a gap of length 3 with a penalty of opening the gap of 8 and 4 of extending the gap with two more, the total penalty will be −16.

4.1.4 Dynamic Programming

Dynamic programming is a way to compute the pairwise alignment. If all possibilities for the pairwise alignment of two sequences should be tested, there would be around 1059 possibilities for two nucleotide sequences of 100 in length. This number is approximately the same as the number of molecules in the Milky Way, and even the largest computer would not be able to handle the problem. There is also no need to consider possibilities without relevance, for instance, that one nucleotide in one sequence would pair only with gaps in the other. For a more general comparison of dynamic programming with hidden Markov models, see: https://youtu.be/aELVQo6eRzk.

To reduce the computing effort but still to perform a careful analysis, dynamic programming is used. The computer is only testing relevant comparisons (high scores) and keeping in memory the highest ones already calculated. The most relevant dynamic programming algorithms for comparison of global and local pairwise alignments were published be Needleman and Wunsch (1970) and Smith et al. (1981), respectively.

The Needleman and Wunsch algorithm is used when it is known that the sequences represent exactly a gene or protein in full length (few amino acids or nucleotide differences are tolerated), and it is used to build global pairwise alignments.

The Smith and Waterman algorithm is used with partial sequences or with sequences of unknown length to build local alignments. For both algorithms, the adjustment of gap penalty functions depends on previous knowledge about domain structures, repeats, and other properties. Both algorithms and their implementations in computer programs can be used for both amino acid and nucleotide sequences.

4.1.4.1 Needleman and Wunsch

The score function in this model is illustrated in Fig. 4.5. This function is used on all comparisons between the two sequences. Here, we will use the example with the two sequences:

  • Sequence 1: HEAGAWGHEE

  • Sequence 2: PAWHEAE

Fig. 4.5
A function is given s F or i, j =max of 3 functions, F of i minus 1, j minus 1 +s of i, j, F of i-1, j minus d, and F of i, j minus 1 minus d.

For the Needleman and Wunsch algorithm, the score function F is defined as the maximum of the three expressions. F(i−1, j−1) is the score of the previous diagonal cell in the matrix, which is added the score for a match between an amino acid pair in the current cell. F(i−1, j) is the score for the previous cell in the column subtracted the penalty of a gap (d), and F(i, j−1) is the equivalent score for the previous cell in the row subtracted the gap penalty. See Fig. 4.7 for an example of this calculation

Figure 4.6 shows one sequence (1) on the top of the table and sequence 2 as a vertical column. First, an alignment path matrix is created. For each cell, F(i,j) is calculated based on the score function in Fig. 4.5. This matrix allows a stepwise calculation of score values. The score of the best alignment is calculated between the initial segment x1…i of x up to xi and the initial segment y1…j of y up to yj. The algorithm function, F(i,j), is built recursively beginning with F(0,0) = 0 (Figs. 4.6, 4.7, 4.8). When the matrix is filled, backtracking is started (evaluation of the optimal path).

Fig. 4.6
A schematic diagram depicts the pair alignment, with rows given for s of i and columns for t of j. The boundary conditions given are F of i, 0= negative i d, and F of 0, j= negative j d.

Pairwise alignment with the Needleman and Wunsch algorithm showing the boundary conditions calculated. Here, both of the sequences are paired only with gaps resulting in the lowest negative score. Jesper Larsen is acknowledged for the figure

Fig. 4.7
A schematic diagram depicts the pairwise alignment with rows given for s of i and columns for t of j. The computed score for each point in the matrix is given as F of i, j = max of F of i j minus 1 minus d, F of i minus 1, j minus 1+ s of i, j, and F of i minus 1, j. All functions equal negative 2.

Pairwise alignment with the Needleman and Wunsch algorithm. Here, the first real position of the matrix is calculated. The three possibilities, H pairing with a gap, P pairing with a gap, or H pairing with P, are considered. The one with the highest score: H pairing with P (−2) (see Fig. 4.4) gives the maximum score and is preferred. Jesper Larsen is acknowledged for the figure

Fig. 4.8
A schematic diagram depicts the pairwise alignment with rows given for s of i and columns for t of j. The scores are given in the matrix and are linked from the bottom right corner towards 0 on the left top corner.

Pairwise alignment with the Needleman and Wunsch algorithm. Here, the maximum score has been inserted for all cells in the matrix and backtracking from the bottom right corner has been performed following the path, which maximizes the score. After the score of 1 for the cells to the lowest right, −5 is chosen since it was marked as a path to 1 in the forward tracking. The higher score of 2 cannot be selected since it was not marked in the forward tracking. Jesper Larsen is acknowledged for the figure

The scoring parameters are given by the BLOSUM 50 matrix (Fig. 4.4) and a linear gap penalty of d = −8. We will align the whole sequence length of the sequences, meaning that we need to form a global alignment and use the Needleman and Wunsch algorithm. The procedure can be followed on Figs. 4.6, 4.7, 4.8, 4.9.

Fig. 4.9
A schematic diagram depicts the pairwise alignment with rows given for s of i and columns for t of j. The pathway is given from the bottom right corner to the left top corner, from 1 to 0. The optimal global alignment sequence is given.

Pairwise alignment with the Needleman and Wunsch algorithm. The optimal path for backtracking is shown again (Fig. 4.8), and the pairwise alignment has been built. Note the combination of H and P considered in Fig. 4.7 has not been selected since the backtrack path did not pass through this cell. The total score for the multiple alignment is 1 (bottom left corner). Jesper Larsen is acknowledged for the figure

A real example of the use of the Needleman and Wunsch algorithm is shown on Fig. 4.10. Here, realistic long sequences are aligned by the algorithm implemented as the needle program in the EMBOSS package (Rice et al. 2000).

Fig. 4.10
A schematic diagram depicts the code for pairwise alignment of sequences. The program is the needle, the run date, align format, report file, aligned sequence, matrix, gap penalty, extend penalty, length, identity, similarity, and score are given.

Pairwise alignment with the Needleman and Wunsch algorithm. A real example with sequences of realistic length has been computed using implementation of the algorithm in EMBOSS as the needle program (see Activity 4.6.1 for details about this program). Similarity is calculated only based on “positives,” which means the scores in the BLOSUM62 matrix, which are positive (signatures | and:)

4.1.4.2 Smith and Waterman

Now, we will compute a pairwise alignment of the sequences using the Smith and Waterman algorithm. There are two differences in the algorithm compared to Needleman and Wunsch. The first is that the alignment can start/end anywhere in the matrix and the other that backtracking is started at the highest value rather than in the lower right corner. Backtracking is terminated as soon as zero is encountered. The score function is shown in Fig. 4.11.

Fig. 4.11
A function is given as F of i, j=max of 4 functions, 0, F of i minus 1, j minus 1+ s of i, j, F of i minus 1, j minus d, and F of i, j minus 1 minus d.

Score function for Smith and Waterman. An extra possibility of 0 can be selected in addition to the three in Fig. 4.5. The 0 is chosen if the other three are negative. This way cells are marked as 0 to account for different length of sequences in the local alignment (Fig. 4.12)

We will align the same two short model protein sequences with Smith and Waterman:

  • Sequence 1: HEAGAWGHEE

  • Sequence 2: PAWHEAE

And use the same parameters of BLOSUM50 and a linear gap penalty of d = −8.

$$ F\left(i,j\right)=\max \Big\{{\displaystyle \begin{array}{l}0\\ {}F\left(i-1,j-1\right)+s\left(i,j\right)\\ {}F\left(i-1,j\right)-d\\ {}F\left(i,J-1\right)-d\end{array}} $$

Using the score function in Fig. 4.11 results in 0 in most cells when the matrix is filled in the forward direction (Fig. 4.12). The reason is that the three other possibilities in Fig. 4.11 result in negative values, and we then have to use 0. Backtracking starts with the highest score (28) and terminates when 0 is reached. Note that an alternative path is also possible. However, the maximum score of this one is lower (21), and therefore, the first is chosen to construct the final pairwise alignment (Fig. 4.13). Similar to the example with Needleman and Wunsch (Fig. 4.10), a more realistic example is shown in Fig. 4.14 with longer sequences. The sequences are the same that we used in Fig. 4.10, and we see little difference between the two pairwise alignments probably because they were of nearly the same length.

Fig. 4.12
A schematic diagram depicts the pairwise alignment with rows given for s of i and columns for t of j. The backtracking pathway is given from the bottom right corner to the top left until 0 is reached.

Smith and Waterman alignment. Here, the matrix has been filled with combinations based on the score function in Fig. 4.11. Backtracking started from the highest score in the matrix (28) and continued to join the highest possibilities until 0 was reached. Jesper Larsen is acknowledged for the figure

Fig. 4.13
A schematic diagram depicts the pairwise alignment with rows given for s of i and columns for t of j. The pathway is given from 28 to 0. The optimal local alignment is given.

Smith and Waterman algorithm. Here, the backtrack path is shown, and the resulting pairwise alignment has been constructed. Note the difference to global alignment generated by the Needleman and Wunsch algorithm in Fig. 4.9. The total score for the pairwise alignment is 28. Jesper Larsen is acknowledged for the figure

Fig. 4.14
A pairwise alignment algorithm is depicted for the water program. The code format is given for run date, align format, report file, aligned sequence, matrix, gap penalty, extend penalty, length, identity, similarity, gaps, and score. The sequences are depicted below.

Pairwise alignment with the Smith and Waterman algorithm. A real example with sequences of realistic length has been computed using the water program of EMBOSS, which has implemented the Smith and Waterman algorithm (see Activity 4.6.2 for details about this program). Note that there are only small differences to the pairwise alignment generated by Needleman and Wunsch (Fig. 4.10). It is related to the comparable length of the two sequences. Similarity is calculated only based on “positives,” which means the scores in the BLOSUM62 matrix, which are positive (signatures | and:)

4.2 Multiple Alignment

A multiple alignment is the simultaneous alignment of three or more nucleic acid or amino acid sequences. The procedure involves the insertion of gaps in the sequences so as to maximize the overall similarity (Higgins and Sharp 1988). Multiple alignments are rarely used for their own sake but are usually created for another purpose—for instance, primer design (Chap. 5) or analysis of phylogeny (Chap. 6). Users select a favorite program or program package and try to optimize program settings for that.

To start the construction of a multiple alignment, we will use the same criteria as for the pairwise alignment problem. We will assume that they shared ancestors (homologous). We need to model evolution the way that every column represents only orthologous amino acids or nucleotides that have evolved from a common ancestor. The simplest assumptions are again that identical nucleotides or identical amino acids pair, that different amino acids with similar physiochemical properties pair, that it is more difficult to form a gap than to form a mismatch, and that it is more difficult to form a gap than to extent one already formed. The scoring between nucleotides and amino acids is based on the system above for pairwise alignment (Sect. 4.1.2).

4.2.1 Clustal

To form a multiple alignment of the Clustal type, first pairwise alignments between sequences 1–2, 1–3, 1–4, 1–5, 2–3, 2–4, 2–5 …0.4–5 are formed (Fig. 4.15). This is the progressive alignment part of the process, and there are (n(n−1))/2 combinations where n is the number of sequences. In this example with five sequences, there are ten combinations. The next step is to construct a “guide tree” (Fig. 4.16) uniting the pairs with highest score. The final step is to construct the multiple alignment from the guide tree, which is called profile alignment (Fig. 4.17). The Clustal type of multiple alignment is therefore performing progressive profile alignment. Clustal has been called “a quick and dirty version of Feng and Dolittle (1987)” where “only residues that are part of the matches of a given length (k-tuple matches) are scored” (Higgins and Sharp 1988).

Fig. 4.15
A set of combinations is given for C L U S T A L V multiple sequence alignments. The sections depicted are in sequence format are Pearson, start of pairwise alignments, start if multiple alignments, and alignment score. Finally, the C L U S T A L alignment file is created.

Output from CLUSTALW showing the pairwise combinations and the grouping of sequences based on the guide tree (Fig. 4.16)

Fig. 4.16
A screenshot of the webpage for the CLUSTAL W flow chart.

The guide tree of ClustalW, which is used by the program to perform the full multiple alignment

Fig. 4.17
The set of sequences depicts the CLUSTAL W multiple sequence alignment.

The final multiple alignment made with CLUSTALW

The developments in the Clustal programs was started with Clustal (1988) (Higgins and Sharp 1988) and continued with Clustal V (five) (1992), Clustal W (weights) (1994), and Clustal X 2.0 (2007) (Larkin et al. 2007). ClustalX is the implementation of the program for PC (Figs. 4.18, 4.19) and is further described in activity 4.4.3. Clustal Omega (http://www.clustal.org/omega/) is the most recent version of the program for multiple alignment of very large protein datasets (Sievers et al. 2011).

Fig. 4.18
A screenshot of the webpage for CLUSATL X sequences. The mode is multiple alignment and the font is 36. The non-aligned sequences are given in colored grids on the right.

CLUSTALX (Larkin et al. 2007) showing the sequences not yet aligned. This is seem from the unordered arrangement of the nucleotides in the columns

Fig. 4.19
A screenshot of the CLUSTAL X sequence alignment webpage. The mode is multiple alignment, and the font is 36. The alignment sequences are depicted in the colored grids on the right.

CLUSTALX (Larkin et al. 2007) showing aligned sequences where most of the columns show only one type of nucleotide

ClustalX/W produces global alignments, which include benefits and drawbacks inhered from then pairwise alignment part included in the construction of the alignment. Domain structures of proteins can be optimized by.

ClustalW/X (Fig. 4.20). ClustalW/X invokes different “hidden” functions during alignment and tend to cluster gaps and take hydrophobic/hydrophobic properties into account. An additional benefit with the ClustalX program is that it allows an identity matrix to be generated. After the multiple alignment has been generated under the menu “Trees” and submenu “Output format Options,” one can mark “% identity matrix.” The matrix will then be generated by running the “Trees | Draw tree” command.

Fig. 4.20
A screenshot of the CLUSTAL X sequence alignment webpage. The alignment option selected is to realign the selected residue range.

CLUSTALX (Larkin et al. 2007) showing optimization by realignment of a region of the multiple alignment to improve local regions

An inherent problem with the progressive approach used in ClustalX/W is that mistakes made in the initial alignment cannot be corrected later. To account for this and other problems, other programs have been constructed.

4.2.2 Other Multiple Alignment Programs

Multiple Sequence Comparison by Log-Expectation (MUSCLE) (Edgar 2004) (https://drive5.com/muscle5/) includes no guide tree. MUSCLE is claimed both to achieve better average accuracy and better speed than ClustalW2 or T-Coffee (Edgar 2004). The main focus is on protein multiple alignments, but it works also on DNA. The principle is based on K-mer comparison between sequences. K-mers are contiguous subsequence of short length (k-tuple). Related sequences tend to have more k-mers in common than expected by chance. The program can be used from MEGA11 (Tamura et al. 2011) (https://www.megasoftware.net/) used in Chaps. 6 and 11. Like the following programs, T-Coffee and MAFFT have options to adjust alignment parameters for a range of applications.

T-Coffee (http://tcoffee.crg.cat/) (Notredame et al. 2000) incorporates a “tree-based consistency objective function for alignment evaluation.” The benefit should be high accuracy. The program comes in a range of flavors suitable for different applications.

MAFFT (Yamada et al. 2016; Katoh et al. 2019 and other papers) (https://mafft.cbrc.jp/alignment/software/) is also a multiple alignment package that includes applications for very large datasets.

COBALT is a progressive constraint-based alignment program (Papadopoulos and Agarwala 2007). Constraints are from Conserved Domain Database (CDD) (Chap. 3) and PROSITE motif databases.

Multiple alignment programs can be compared with sets of reference sequence (BAliBASE) (Thompson et al. 1999).

The ARB package has been developed to account for the secondary structures in 16S rRNA sequence (Ludwig et al. 2004) (http://www.arb-home.de/). It has been used to construct precomputed multiple alignments in the SILVA database (www.arb-silva-de), which is further considered in Chap. 8. Folding patterns in 16S rRNA genes can be predicted with Mfold (http://www.unafold.org/mfold/applications/rna-folding-form-v2.php) (Zuker and Jacobson 1998).

Trimming should only be done with limited datasets of closely related organisms. Trimming can be done in BioEdit (Fig. 4.21) (see also Sect. 5.6). For Mac users, Jalview can be used (https://www.jalview.org) (Waterhouse et al. 2009).

Fig. 4.21
A screenshot of the webpage for the Bio Edit sequence alignment editor. The sequence highlighted depicts gaps. The taskbar is given at the top.

BIOEDIT (Hall 1999) (https://bioedit.software.informer.com/download/) used to trim a region of a multiple alignment with many gaps inserted as a consequence of lack of data (short sequences). The box marked in black was deleted

Graphic output of multiple alignments can be made in Boxshade

(http://arete.ibb.waw.pl/PL/html/boxshade.html) (in Polish) (Fig. 4.22).

Fig. 4.22
A screenshot of the webpage for sequence alignment. Various parts of the sequence are highlighted. The taskbar is given on top.

Output from BOXSHADE (http://arete.ibb.waw.pl/PL/html/boxshade.html). This output can be included for presentation in a publication

For whole genome alignments, we need to account for patchy distribution of matching regions, and we cannot use the strategies for single genome alignments. Strategies using a suffix tree can solve the problem. One strategy is to create maximal unique match (MUM) decomposition of the two genomes to be aligned. A MUM is a subsequence occurring exactly once in each of the two genomes. The strategy (MUMmer) for alignment in all includes suffix tree construction, sorting and extraction of the longest increasing subsequences and generation of Smit-Waterman alignments for all the MUMs (Delcher et al. 1999).

4.3 BLAST

Basic Local Alignment Search Tool (BLAST) was originally described by Altschul et al. (1990), and this tool enabled fast search in the electronic nucleotide and protein databases during the 1990s when the internet was invented. Later, an improved algorithm was described allowing gaps to be inserted during the analysis (Altschul et al. 1997). Both versions of BLAST are in use as the “ungapped” (1990) and “gapped” (1997), respectively.

BLAST is based on a heuristics search algorithm, which is able to search through the ever-growing sizes of databases. BLAST can be compared to the search tool Google in popularity within the bioinformatical community including the formation of the verb “to blast.” BLAST is based on the initial identification of the user query sequence by short pieces of sequence (words). These words are compared to the database with all the sequences (subjects). If the word is identified in a sequence of the database, the match of the word can be extended and will form a high scoring pair (HSP). When no further extension is possible, the result is returned as a hit list to the used with indication of similarity between the query and the closest related subjects in the database. The “gapped” version of BLAST is most frequently used, and its “two hit method” requires two nonoverlapping “words” on the same diagonal with a distance of A before an extension is invoked. Gapped BLAST should only require 1/3 computer time compared to ungapped version because of less extensions of the words. BLAST uses dynamic programming to extend residues in both directions, and BLAST is most suitable for finding of subject DNA and protein sequences in databases that match with a reasonable similarity and a comparable length to the query. BLAST is good for sequences of at least 100 in length that are well represented in the databases. BLAST is not suitable to make really precise determination of similarity between sequences, and pairwise alignment programs should instead be used for that (Sect. 4.1). BLAST is not suitable for sequence-based identification of bacterial species since there are simply too many sequences in GenBank and the type strains are not always clearly labeled (Chap. 7). BLAST is not suitable for databases beyond a certain size and cannot be used to search very large metagenomics databases (Chap. 9). In principle, the database search of BLAST is based on local alignments. It needs to be local since we are never certain if our query sequence will be represented by a homolog in the database with similar length. BLAST is most often used on the NCBI server (https://blast.ncbi.nlm.nih.gov/Blast.cgi); however, the BLAST program can also be installed on your local computer and used with you own database.

4.3.1 NCBI BLAST

The most common way of performing a BLAST search is to access the program at NCBI and search in their databases. A search can be performed by a DNA or protein sequence in FASTA format, by uploading a sequence file in FASTA format or by inserting an accession number as AB490809 shown in Fig. 4.23 if the sequence is already included with GenBank. In the example shown on Fig. 4.23, default setting has been selected for a BLAST search with a DNA sequence. Under Program Selection and Optimize for different versions of BLAST can be selected. In this case, megablast, discontigous megablast, and blastn can be selected. The three versions differ by the scores given to matches and their way of penalizing gaps. For shorter sequences, blastn should be used.

Fig. 4.23
A screenshot of a webpage for nucleotide blast. The sections given on the page are for entering query sequences, choosing search sets, and program selection. Other tabs are also depicted at the top of the page.

BLAST search (Altschul et al. 1990, 1997). BLAST [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2004—[cited 2018 04 17]. Available from: https://blast.ncbi.nlm.nih.gov/Blast.cgi

When the search is terminated, you will see the familiar graphical overview in (Fig. 4.24)—the colored section with horizontal bars.

Fig. 4.24
A screenshot of the webpage for Blast Suite. The BLAST results are listed on the page. The information given is the job title, graphic summary, and descriptions. The screenshot below depicts the webpage for sequence alignments.

BLAST (Altschul et al. 1990, 1997) output showing three screenshots of the output from BLAST search: Graphic Summary, Description, and Alignments, respectively. The number of Max target sequences was reduced to 10 in the Alignment parameters section to show it all on one page (BLAST [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2004 – [cited 2018 04 17]. Available from: https://www.ncbi.nlm.nih.gov/Blast.cgi)

In the next section comes the descriptions with one line for each subject, which gives a more detailed information about the subject sequences providing Score, Query cover, E value, and accession number. These parameters will be explained in the following.

The hit list is arranged by decreasing Score. In the example, you see that hits are sorted by decreasing score. The identities are also decreasing but not systematically since the score is not always directly related to the identity. The longer the match between two sequences, the higher the score given the same identity. E values cannot be directly seen here since they are very small and they are just shown as “0.0.”

In the alignment section, the pairwise alignments are shown. This is the most detailed information of comparison between your query sequence and the subjects in the database. The parameters from the description section are available, and additional information of gaps and visual location of both gaps and mismatches can be observed.

4.3.2 Ortholog Detection

Sequences are orthologs if they are homologs and have the same function in different species. The paralogs are also homologs but have been duplicated during evolution, and they have got divergent functions in the same species, and the sequences are often quite divergent. BLAST can be used to sort the orthologs from the paralogs. The principle of reciprocal best hits (RBH) is used as explained in Moreno-Hagelsieb and Latimer (2008) “Two genes residing in two different genomes are deemed orthologs if their protein products find each other as the best hit in the opposite genome” (Moreno-Hagelsieb and Latimer 2008). When BLAST search is done against different organisms, the orthologs will often have the highest scores and the paralogs lower scores. Further investigation of orthologs and paralogs can be done by phylogeny (Chap. 6).

4.3.3 BLAST2 sequences

BLAST2 sequences is a version of NCBI BLAST that allows pairwise comparisons on server or on stand-alone BLAST installed on your PC. The program is available on both DNA and protein level, and it can be used to build your own database for many purposes by including many sequences in the second window (Fig. 4.25). For instance, if the query DNA or protein sequence is included with the first search window and a range of genomes included in the second search, you can identify the query sequence in the whole genomic sequences including genomes not yet deposited with the databases. The same approach can be used to build a small MLST database (Chap. 11). BLAST2 sequences are started from the ordinary BLAST page NCBI by marking “Align two or more sequences.” It can be used for fast draft comparisons between two sequences, for example, if you want to look up the location of a primer or gene on a genome. This is your “Swiss knife” always available if you just can access the internet. Like for an ordinary knife, be careful with BLAST2 sequences! If you need real precise similarities between sequences, they should be obtained by the pairwise alignment programs described Sect. 4.1.

Fig. 4.25
A screenshot of the aligned sequences of nucleotides BLAST. The sections given are entering query sequence, entering subject sequence, and program selection.

BLAST 2 sequences. (Altschul et al. 1990, 1997). BLAST [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2004—[cited 2018 04 17]. Available from: https://www.ncbi.nlm.nih.gov/Blast.cgi

4.3.4 Statistics

Evaluation of the results in the BLAST output is based on the Karlin and Altschul formula E = k × m × N × e–λs (Karlin and Altschul 1990) where m is letters in query meaning the number of nucleotides or amino acids, N is the total letters in database, and S is the actual score. For amino acids, S is defined by the BLOSUM matrix. S is scaled to bit score to better fit the computer. The bit score = ((λ × S) – ln κ)/ln2. In the pairwise section (Fig. 4.24 above), you can see the raw score in parenthesis along with the bit score. The constants κ and λ depending on the database. E reflects that we find the sequence in the database by chance (chance for false positives). E is this way related to the size of the database. The smaller E, the less likely is it found by chance, and the more unique is the sequence—“The smaller E the better.” For really “unique” sequences, E is so small that it cannot be represented on the computer; this is the reason for “0.0” with the output.

All parameters related to the search can be found in Search Summary (Fig. 4.26). Here, you can find parameters used for calculating the statistics in the Karlin and Altschul formula, and you can understand everything about the BLAST output if you compare these parameters to the statistics just explained.

Fig. 4.26
A screenshot of the webpage of the BLAST sequence with information given for R I D, query I D, description, molecule type, and query length. The tables are given below for search parameters, database, Karlin-Alt Schul statistics, and results statistics.

The Search Summary section of the output showing all parameters used for the search (Altschul et al. 1990, 1997). BLAST [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2004—[cited 2018 04 17]. Available from: https://www.ncbi.nlm.nih.gov/Blast.cgi

Note that parameters for ungapped blast (Altschul et al. 1990) differ from gapped-blast (Altschul et al. 1997) (Fig. 4.26), and λ and κ are different for gapped BLAST compared to ungapped.

4.3.5 Variants of BLAST

A BLAST search can be based on a query DNA sequence that the program translates to protein and compares to all protein sequence in the database (BLASTx). In this case, the search takes more time, and it should only be done when it is absolute needed (rarely). For such comparison, the codon table should be defined according to the type of organisms. Prokaryotes are using codon table 11 (Appendix). tBLASTn searches a translated nucleotide databases using a protein query. tBLASTx searches a translated nucleotide query against the translated nucleotide databases.

For BLASTp, it is sometimes convenient to select Swiss-Prot or ref_seq_protein if you want a really precise information about the hits. For protein searches, note the additional options PSI-, PHI-, and delta BLAST (Fig. 4.27), and they can be used to build up a profile used to search for query proteins, which are rather distantly related to the subject proteins in the database.

Fig. 4.27
A screenshot of the webpage for BLAST. The page for the standard protein BLAST. The sections given are entering query sequence, choosing search set, and program selection.

Protein BLAST, BLASTp, note the additional options PSI- PHI- and delta BLAST (Altschul et al. 1990, 1997). (Lipman 1997) BLAST [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2004—[cited 2018 04 17]. Available from: https://www.ncbi.nlm.nih.gov/Blast.cgi

4.4 Activities

4.4.1 Pairwise Alignment

We will construct local and global pairwise alignments by the use of the programs “water” and “needle,” respectively, available as EMBOSS programs (Rice et al. 2000) on the EBI server. The protein sequences AAA85484 and AAA85485 can be downloaded from NCBI as described in Chap. 3 (see Sect. 3.3.1 including Activity 3.1).

Decide if you want to use “needle” for global or “water” for local alignment.

Open the FASTA format files and paste them in the windows of the relevant EMBOSS program on the server and select protein or DNA:

https://www.ebi.ac.uk/Tools/psa/emboss_needle/

https://www.ebi.ac.uk/Tools/psa/emboss_water/

4.4.2 Multiple Alignment with ClustalX

From http://www.clustal.org/download/current/ download clustalx-2.1-win.msi, install and locate the icon to the desktop. Open the program by double-clicking on the icon. File | Load sequences and select the sequences in FASTA format from the location on your computer. You need to edit the input file yourself. The sequences that you want to use as input should be in one file only with all sequences in FASTA format:

>sequence1

ATGACGATAC…

>sequence2

GATAGATAGACS…

etc.

Select Alignment | Do complete Alignment | ok.

In the lower left corner, you can follow how the program works.

Scroll through the alignment when it is completed.

At the bottom, the bars show the degree of conservation of columns. Above the alignment, the signature * shows columns with conserved nucleotides (or amino acids) with the same properties.

4.4.3 BLAST

We will use this activity to learn how to perform an ordinary BLAST search in GenBank with a DNA sequence. However, we also learn some more about BLAST in this activity. We will perform a search with the sequence with acc. no. AB490809 from GenBank.

First perform a BLAST search at NCBI: https://blast.ncbi.nlm.nih.gov/Blast.cgi. On the graphics, select nucleotide BLAST and insert the acc. no. AB490809 in the window and mark “somewhat similar sequences blastn,” mark “show results in a new window” and press the blue BLAST bottom. Recently, we have seen a request from the NCBI-BLAST server to restrict the search to reduce the demand for CPU capacity. For the current example, rRNA/ITS databases can be selected instead of Standard nr.