Pairwise Alignment, Multiple Alignment, and BLAST

Christensen, Henrik; Olsen, John Elmerdahl

doi:10.1007/978-3-319-99280-8_4

Henrik Christensen² &
John Elmerdahl Olsen²

Part of the book series: Learning Materials in Biosciences ((LMB))

3640 Accesses

Abstract

Quantitative comparison of sequences can be performed pairwise by aligning two sequences based on considerations of gaps representing insertions or deletions and matches between nucleotides and amino acids. Pairwise comparisons can be performed as global alignments if it is known that the sequences are homologous in their full length or by local alignments if it is known that one sequence is shorter than the other. The dynamic programming algorithms for constructing global and local alignments will be described in detail and relevant computer programs suggested. The different strategies behind multiple alignments are described including relevant computer programs. The search program BLAST is introduced in detail and different applications described.

Access provided by CONRICYT-eBooks. Download chapter PDF

Pairwise Alignment, Multiple Alignment, and BLAST

BLAST and FASTA Similarity Searching for Multiple Sequence Alignment

Sequence Alignment

FormalPara What You Will Learn in This Chapter

Pairwise alignments and multiple alignments are the basic tools to compare sequences. BLAST is the most frequently used bioinformatics program to compare your own sequence (query sequence) to all sequences in a database (subject sequences) based on local pairwise alignments. The outcome of the BLAST analysis provides qualitative information about homologous sequences and can quantify the identity of the query sequence to the sequences in the database. You will learn about multiple alignments and how to construct them. Multiple alignments are used for a range of applications described in the other chapters of the book.

4.1 The Pairwise Alignment Problem

A pairwise alignment is a model of the homology between two sequences considering all nucleotides or amino acids and all deletion and insertions.

Two sequences to be compared could, for instance, be the ones below:

GCAGTAGCATGACGATAG
GCGGTAGCATGATAC

A pairwise alignment requires a model for the evolutionary steps that led to the differences observed. First, we should test if they are homologous. Such a test can be done by BLAST (see ► Sect. 4.3), and it will tell if the sequences are from the same gene. Once this has been established, we need to model evolution to construct the pairwise alignment. The simplest assumptions are that identical nucleotide pairs and that identical amino acids pair and, further, that different amino acids with similar physiochemical properties form pairs.

For most comparisons it is preferred to base pairwise alignments on the amino acid sequences especially if the sequences are very divergent. Pairwise nucleotide comparisons are preferred for closely related sequences only.

Gaps are used to show that an amino acid or nucleotide is without a match in the other sequence and the gaps represent insertions or deletions in an evolutionary context. Most evolutionary modeling assumes that it is more difficult to form gaps than to form mismatches and that it is more difficult to form a gap than to extend one already formed. Scoring systems are used to separate matches (high score) from mismatches (low score) and gaps (very low score). For proteins, scoring matrices (see ► Sect. 4.1.2.1) are used. At the end, scores between amino acids or nucleotides are summarized over the whole pairwise alignment, and penalties for gaps are subtracted. The result is a unit score for the pairwise alignment given the parameters used. Given the same sequences compared, the pairwise alignment with the highest total score is preferred compared to one with a lower score.

When the pairwise alignment has been constructed, the identity and similarity can be quantified. The identity is the number of nucleotides or amino acids matching in two sequences compared at all positions between the two sequences. Similarity is a further comparison also considering different types of nucleotides or amino acids as well as the gaps.

An alignment of the sequences above, and a possible scenario leading to the differences between them, is shown in ◘ Fig. 4.1.

The principles of pairwise alignments are used for most multiple alignment programs and for BLAST. The three tools are fundamental to carry out most bioinformatics (◘ Fig. 4.2).

4.1.1 Global or Local Pairwise Alignments?

Before a pairwis e alignment is constructed, it needs to be decided if it should be global or local. A pair wise alignment is global if it is know that the sequences are homologous in their full length. In this situation, it is sound to align both sequences in their full length. If preliminary evidence has showed that the genes are encoding for the same proteins and the full length of the gene has been sequenced, a global alignment can be constructed. A local alignment is needed if it is know that one sequence is shorter than the other and that it cannot be related to the other in its full length. This can happen if the DNA sequence of one gene has not been determined in the full length or if the domain structure of the proteins that the genes encode for are rather divergent.

4.1.2 Substitution Matrices

Substitution matrices are used to model the probability of mutual amino acid or nucleotide substitutions in sequences. Amino acids or nucleotides with a lower probability of changing are given more weight in the comparisons compared to those with a higher probability of changing. The changes are assumed to occur according to a Markov model where changes only depend on the current nucleotide or amino acid observed and not on past changes (◘ Fig. 4.3).

4.1.2.1 Amino Acid Substitution Matrices

Amino acids have diffe rent biochemi cal and physical properties that influence their relative replaceability during evolution (◘ Table 4.1). The probability of replacement has been related to the physiochemical properties of the amino acids in the way that hydrophobic amino acids are easier replaced by other hydrophobic amino acids again compared to other types such as charged amino acids. The substitution matrices reflect these properties as seen from ◘ Fig. 4.4. Here the score between the two hydrophobic amino acids isoleucine and histidine is 5, whereas the score between isoleucine and the positive charged arginine is only −4. Some amino acids belong to more groups, for instance, histidine which is hydrophobic, positively charged, polar, and aromatic.

Table 4.1 Physiochemical properties of amino acids

Full size table

The PAM (percent accepted mutations ) matrices are based on global alignments of closely related proteins. The number at the end of a PAM matrix refers to the number of steps required for a given percent change. PAM1 corresponds to 99% identity (1% change) between the aligned sequences. This means that PAM112 will recognize more distantly related sequences (40% identity) than PAM23 (80% identity). PAM matrices are based on global alignments and therefore best suited for global pairwise alignments used in evolutionary studies (◘ Table 4.2).

Table 4.2 Practical rules for use of PAM and BLOSUM amino acid substitution matrices

Full size table

BLOSUM (blocks substitution matrix ) matrices are based on local alignments. The number at the end of a BLOSUM matrix refers to the minimum percent identity allowed in the set of aligned sequences used to build the matrix. This means that BLOSUM50 will recognize more distantly related sequences than BLOSUM80 (◘ Table 4.2).

4.1.2.2 Nucleotide Substitution Matrices

Substitution matrices for nucleotides give higher weight to transitions (G to A or C to T and vice versa) than to transversions (G to C, G to T, A to C, A to T, and vice versa) (◘ Table 4.3). This is mai nly related to the larger size of the purine compared to the pyrimidine nucleotides. The simplest model is the Jukes and Cantor model where equal probabilities for the substitution of all four nucleotides are assumed (◘ Table 4.4). The transversion/transition bias is included in the models of HKY, Kimura, Tamura-Nei, Tamura, and the general reversible model. An additional account for the frequency of nucleotides is included in models of Tamura-Nei, the equal input, and the general reversible models. These frequencies are the observed frequencies of nucleotides under comparison. The unrestricted model allows different probabilities of changes for all 12 combinations of nucleotides (◘ Table 4.4).

Table 4.3 The sixteen different types of nucleotide pairs that can be observed between two sequences

Full size table

Table 4.4 Models of nucleotide substitutions

Full size table

4.1.3 Gaps

Gaps are used to model insertions or deletions in sequences. If arbitrarily many gaps are inserted, this can lead to high-scoring align ments of nonhomologous sequences. To abolish this effect, gaps are penalized when alignments are constructed to obtain relatively few gaps and separate penalties used for gap opening and gap elongation. In the linear model, the penalty ϒ is proportional to the number of gaps (g) given the penalty for one gap d:

Linear gap penalty score:

$$ \varUpsilon (g)=- gd $$

If the penalty of one gap is 8 and there are 3 gaps, the total penalty will be – 24.

To better model biological events where many gaps often occur in a row, the affine penalty score model is used where the cost is higher for inserting the first gap than to extent this gap to lengths of two and more:

$$ \varUpsilon (g)=-d-\left(g-1\right)e $$

where ϒ(g) = gap penalty score of gap length g, d = gap opening penalty, and e = gap extension penalty. For a gap of length 3 with a penalty of opening the gap of 8 and of extending the gap with two more, the total penalty will be – 16.

4.1.4 Dynamic Programming

Dynamic programming is a way to compute the pairwise alignment? If all possibilities for the pairwise alignment of two sequences should be tested, there would be around 10⁵⁹ possibilities for two nucleotide sequences of 100 in length. This number is approx. the same as the number of molecules in the Milky Way, and even the largest computer would not be able to handle the problem. There is also no need to consider possibilities without relevance, for instance, that one nucleotide in one sequence would pair only with gaps in the other.

To reduce the computing effort but still to perform a careful analysis, dynamic programming is used. The computer is only testing relevant comparisons (high scores) and keeping in memory the highest ones already calculated. The most relevant dynamic programming algorithms for comparison of global and local pairwise alignments were published be Needleman and Wunsch (1970) and Smith et al. (1981), respectively.

The Needleman and Wunsch algorithm is used when it is known that the sequences represent exactly a gene or protein in full length (few amino acids or nucleotide differences are tolerated) and it is used to build global pairwise alignments.

The Smith and Waterman algorithm is used with partial sequences or with sequences of unknown length to build local alignments. For both algorithms, the adjustment of gap penalty functions depends on previous knowledge about domain structures, repeats, and other properties. Both algorithms and their implementations in computer programs can be used for both amino acid and nucleotide sequences.

4.1.4.1 Needleman and Wunsch

The score f unction in this model is illustrated in ◘ Fig. 4.5. Th is function is used on all comparisons between the two sequences. Here we will use the example with the two sequences:

Sequence 1: HEAGAWGHEE
Sequence 2: PAWHEAE

◘ Figure 4.6 shows one sequence (1) on the top of the table and sequence 2 as a vertical column. First an alignment path matrix is created. For each cell, F(i, j) is calculated based on the score function in ◘ Fig. 4.5. This matrix allows a stepwise calculation of score values. The score of the best alignment is calculated between the initial segment x_1...i of x up to x_i and the initial segment y_1...j of y up to y_j. The algorithm function, F(i, j), is built recursively beginning with F(0,0) = 0 (◘ Figs. 4.6, 4.7, and 4.8). When the matrix is filled, backtracking is started (evaluation of the optimal path).

The scoring parameters are given by the BLOSUM 50 matrix (◘ Fig. 4.4) and a linear gap penalty of d = −8. We will align the whole sequence length of the sequences meaning that we need to form a global alignment and use the Needleman and Wunsch algorithm. The procedure can be followed on ◘ Figs. 4.6, 4.7, 4.8, and 4.9.

A real example of the use of the Needleman and Wunsch algorithm is shown on ◘ Fig. 4.10. Here realistic long sequences are aligned by the algorithm implemented as the needle program in the EMBOSS package (Rice et al. 2000).

4.1.4.2 Smith and Waterman

Now we will co mp ute a pair wise alignment of the sequences using the Smith and Waterman algorithm. There are two differences in the algorithm compared to Needleman and Wunsch. The first is that the alignment can start/end anywhere in the matrix, and the other that backtracking is started at the highest value rather than in the lower right corner. Backtracking is terminated as soon as zero is encountered. The score function is shown in ◘ Fig. 4.11.

We will align the same two short model protein sequences with Smith and Waterman:

Sequence 1: HEAGAWGHEE
Sequence 2: PAWHEAE

And use the same parameters of BLOSUM50 and a linear gap penalty of d = −8.

Using the score function in ◘ Fig. 4.11 results in 0 in most cells when the matrix is filled in the forward direction (◘ Fig. 4.12). The reason is that the three other possibilities in ◘ Fig. 4.11 result in negative values and we then have to use 0. Backtracking starts with the highest score (28) and terminates when 0 is reached. Note that an alternative path is also possible. However, the maximum score of this one is lower (21), and therefore the first is chosen to construct the final pairwise alignment (◘ Fig. 4.13). Similar to the example with Needleman and Wunsch (◘ Fig. 4.10), a more realistic example is shown in ◘ Fig. 4.14 with longer sequences. The sequences are the same that we used in ◘ Fig. 4.10, and we see little difference between the two pairwise alignments probably because they were of nearly the same length.

4.2 Multiple Alignment

A multiple alignment is the simultaneous alignment of three or more nucleic acid or amino acid sequences. The procedure involves the insertion of gaps in the sequences so as to maximize the overall similarity (Higgins and Sharp 1988). Multiple alignments are rarely used for their own sake but are usually created for another purpose – for instance, primer design (► Chap. 5) or analysis of phylogeny (► Chap. 6). Users select a favorite program or program package and try to optimize program settings for that.

To start the construction of a multiple alignment, we will use the same criteria as for the pairwise alignment problem. We will assume that they shared ancestors (homologous). We need to model evolution the way that every column represents only orthologous amino acids or nucleotides that have evolved from a common ancestor. The simplest assumptions are again that identical nucleotides or identical amino acids pair, that different amino acids with similar physiochemical properties pair, that it is more difficult to form a gap than to form a mismatch, and that it is more difficult to form a gap than to extend one already formed. The scoring between nucleotides and amino acids is based on the system above for pairwise alignment (► Sect. 4.1.2).

4.2.1 Clustal

To form a multi ple alignment of the clustal type, fi rst pairwise alignments between sequences 1–2, 1–3, 1–4, 1–5, 2–3, 2–4, 2–5…4–5 are formed (◘ Fig. 4.15). This is the progressive alignment part of the process, and there are (n(n – 1))/2 combinations where n is the number of sequences. In this example with five sequences, there are ten combinations. The next step is to construct a “guide tree” (◘ Fig. 4.16) uniting the pairs with highest score. The final step is to construct the multiple alignments from the guide tree which is called profile alignment (◘ Fig. 4.17). The clustal type of multiple alignments is therefore performing progressive profile alignment. Clustal has been called “a quick and dirty version of Feng and Dolittle (1987)” where “only residues that are part of the matches of a given length (k-tuple matches) are scored” (Higgins and Sharp 1988).

The developments in the clustal programs were started with Clustal (1988) (Higgins and Sharp 1988) and continued with ClustalV (five) (1992), ClustalW (weights) (1994), and ClustalX 2.0 (2007) (Larkin et al. 2007). ClustalX is the implementation of the program for PC (◘ Figs. 4.18 and 4.19) and is further described in ► Activity 4.4.2. Clustal Omega is the most recent version of the program for multiple alignments of very large protein datasets (Sievers et al. 2011).

ClustalX/W produces global alignments which include benefits and drawbacks inhered from the pairwise alignment part included in the construction of the alignment. Domain structures of proteins can be optimized by ClustalW/X (◘ Fig. 4.20). ClustalW/X invokes different “hidden” functions during alignment and tends to be cluster gaps and take hydrophobic/hydrophobic properties into account. An inherent problem with the progressive approach used in ClustalX/W is that mistakes made in the initial alignment cannot be corrected later. To account for this and other problems, other programs have been constructed.

4.2.2 Other Multiple Alignment Programs

MUSCLE (Multiple Sequence Comparison by Log-Expectation ) (Edgar 2004) (► https://www.drive5.com/muscle/downloads.htm) include s no guide tree. MUSCLE is claimed both to achieve better average accuracy and better speed than ClustalW2 or T-Coffee (Edgar 2004). The main focus is on protein multiple alignments, but it works also on DNA. The principle is based on K-mer comparison between sequences. K-mers are contiguous subsequence of short length (k-tuple). Related sequences tend to have more k-mers in common than expected by chance. The program can be used from MEGA7 (Tamura et al. 2011) (► https://www.megasoftware.net/) used in ► Chaps. 6 and 11. Like the following programs, T-Coffee and MAFFT have options to adjust alignment parameters for a range of applications.

T-Coffee (► http://tcoffee.crg.cat/) (Notredame et al. 2000) incorporates a “tree-based consistency objective function for alignment evaluation.” The benefit should be high accuracy. The program comes in a range of flavors suitable for different applications.

MAFFT (Yamada et al. 2016 and other papers) (► https://mafft.cbrc.jp/alignment/software/) is also a multiple alignment package that includes applications for very large datasets.

Multiple alignment programs can be compared with sets of reference sequence (BAliBASE) (Thompson et al. 1999).

To account for the secondary structures in 16S rRNA sequence, the ARB package (Ludwig et al. 2004) (► http://www.arb-home.de/) has been developed. It has been used to construct precomputed multiple alignments in the SILVA database (► www.arb-silva-de) which is further considered in ► Chap. 8. Folding patterns in 16S rRNA genes can be predicted with Mfold (► http://unafold.rna.albany.edu/?q=mfold) (Zuker and Jacobson 1998) or similar program.

Trimming should only be done with limited datasets of closely related organisms. Trimming can be done in BioEdit (◘ Fig. 4.21). For Mac users Jalview can be used. (► http://www.jalview.org) (Waterhouse et al. 2009)

Graphic output of multiple alignments can be made in BoxShade (► http://www.ch.embnet.org/software/BOX_form.html) (◘ Fig. 4.22).

4.3 BLAST

BLAST (Basic Local Alignment Search Tool) was originally described by Altschul et al. (1990), and this tool enabled fast search in the electronic nucleotide and protein databases during the 1990s when the Internet was invented. Later an improved algorithm was described allowing gaps to be inserted during the analysis (Altschul et al. 1997). Both versions of BLAST are in use as the “ungapped” (1990) and “gapped” (1997), respectively.

BLAST is based on a heuristics search algorithm which is able to search through the ever-growing sizes of databases. BLAST can be compared to the search tool Google in popularity within the bioinformatics community including the formation of the verb “to blast.” BLAST is based on the initial identification of the users’ query sequence by short pieces of sequence (words). These words are compared to the database with all the sequences (subjects). If the word is identified in a sequence of the database, the match of the word can be extended and will form a high-scoring pair (HSP) . When no further extension is possible, the result is returned as a hit list to the used with indication of similarity between the query and the closest related subjects in the database. The “gapped” version of BLAST is most frequently used, and its “two-hit method” requires two non-overlapping “words” on the same diagonal with a distance of A before an extension is invoked. Gapped BLAST should only require 1/3 computer time compared to ungapped version because of less extensions of the words. BLAST uses dynamic programming to extend residues in both directions; BLAST is most suitable for finding of subject DNA and protein sequences in databases that match with a reasonable similarity and a comparable length to the query. BLAST is good for sequences of at least 100 in length that are well represented in the databases. BLAST is not suitable to make really precise determination of similarity between sequences, and pairwise alignment programs should instead be used for that (► Sect. 4.1). BLAST is not suitable for sequence-based identification of bacterial species since there are simply too many sequences in GenBank and the type strains are not always clearly labeled (► Chap. 7). BLAST is not suitable for databases beyond a certain size and cannot be used to search very large metagenomics databases (► Chap. 9). In principle the database search of BLAST is based on local alignments. It needs to be local since we are never certain if our query sequence will be represented by a homolog in the database with similar length. BLAST is most often used on the NCBI server (► http://www.ncbi.nlm.nih.gov/BLAST/); however, the BLAST program can also be installed on your local computer and used with your own database.

4.3.1 NCBI BLAST

The mos t commo n way of performing a BLAST search is to access the program at NCBI and search in their databases. A search can be performed by a DNA or protein sequence in FASTA format, by uploading a sequence file in FASTA format or by inserting an accession number as AB490809 shown in ◘ Fig. 4.23 if the sequence is already included with GenBank. In the example shown on ◘ Fig. 4.23, default setting has been selected for a BLAST search with a DNA sequence. Under Program Selection and Optimize for different versions of BLAST can be selected. In this case megablast, discontigous megablast, and blastn can be selected. The three versions differ by the scores given to matches and their way of penalizing gaps. For shorter sequences blastn should be used.

When the search is terminated, you will see the familiar graphical overview in ◘ Fig. 4.24 – the colored section with horizontal bars. You can see details of subject sequences by mousing over the bars in this section.

In the next section comes the descriptions with one line for each subject which gives a more detailed information about the subject sequences providing Score, Query cover, E value, and accession number. These parameters will be explained in the following.

The hit list is arranged by decreasing Score. In the example you see that hits are sorted by decreasing score. The identities are also decreasing but not systematically since the score is not always directly related to the identity. The longer the match between two sequences, the higher the score given the same identity. E values cannot be directly seen here since they are very small and they are just shown as “0.0.”

In the alignments section, the pairwise alignments are shown. This is the most detailed information of comparison between your query sequence and the subjects in the database. The parameters from the description section are available, and additional information of gaps and visual location of both gaps and mismatches can be observed.

4.3.2 Ortholog Detection

Sequences a re orthologs if they are homologs and have the same function in different species. The paralogs are also homologs but have been duplicated in evolution, and they have got divergent functions in the same species, and the sequences are often quite divergent. BLAST can be used to sort the orthologs from the paralogs. The principle of reciprocal best hits (RBH) is used as explained in Moreno-Hagelsieb and Latimer (2008) “Two genes residing in two different genomes are deemed orthologs if their protein products find each other as the best hit in the opposite genome” (Moreno-Hagelsieb and Latimer 2008). Further investigation of orthologs and paralogs can be done by phylogeny (► Chap. 6).

4.3.3 BLAST2 Sequences

BLAST2 sequences is a version of NCBI BLAST that allows pairwise c omparisons on server or on stand-alone BLAST installed on your PC. The program is available on both DNA and protein level, and it can be used to build your own database for many purposes by including many sequences in the 2nd window (◘ Fig. 4.25). For instance, if the query DNA or protein sequence is included with the 1st search window and a range of genomes included in the 2nd search, you can identify the query sequence in the whole genomic sequences including genomes not yet deposited with the databases. The same approach can be used to build a small MLST database (► Chap. 11). BLAST2 sequences is started from the ordinary BLAST page NCBI by marking “Align two or more sequences.” It can be used for fast draft comparisons between two sequences for example if you want to look up the location of a primer or gene on a genome. This is your “Swiss knife” always available if you just can access the Internet. Like for an ordinary knife, be careful with BLAST2 sequences! If you need real precise similarities between sequences, they should be obtained by the pairwise alignment programs described in ► Sect. 4.1.

4.3.4 Statistics

Evaluation of the results in the BLAST output is based on the Karlin and Altschul formul a E = k × m × N × e^–^λs (Karlin and Altschul 1990) where m is letter in query meaning the number of nucleotides or amino acids, N is the total letters in database, and S is the actual score.

For amino acids, S is defined by the BLOSUM matrix. S is scaled to bit score to better fit the computer. The bit score = ((λ × S) – ln κ) / ln2. In the pairwise section (◘ Fig. 4.24 above), you can see the raw score in parenthesis along with the bit score, the constants κ and λ depending on the database. E reflects that we find the sequence in the database by chance (chance for false positives). E is this way related to the size of the database. The smaller E the less likely is it found by chance and the more unique is the sequence – “The smaller E the better.” For really “unique” sequences, E is so small that it cannot be represented on the computer; this is the reason for “0.0” with the output.

All parameters related to the search can be found in Search Summary (◘ Fig. 4.26); just press the bottom Graphic summary on the output page (◘ Fig. 4.24). Here you can find parameters used for calculating the statistics in the Karlin and Altschul formula, and you can understand everything about the BLAST output if you compare these parameters to the statistics just explained.

Note that parameters for ungapped blast (Altschul et al. 1990) differ from gapped blast (Altschul et al. 1997) (◘ Fig. 4.26) and λ and κ are different for gapped BLAST compared to ungapped.

4.3.5 Variants of BLAST

A BLAST sea rch can be based on a query DNA sequence that the program translates to protein and compares to all protein sequence in the database (BLASTx). In this case the search takes more time, and it should only be done when it is absolutely needed (rarely). For such comparison, the codon table should be defined according to the type of organisms. Prokaryotes are using codon table 11 (Appendix). tBLASTn searches a translated nucleotide databases using a protein query. tBLASTx searches a translated nucleotide query against the translated nucleotide databases.

For BLASTp it is sometimes convenient to select Swiss-Prot or ref_seq_protein if you want a really precise information about the hits. For protein searches note the additional options PSI-, PHI-, and delta BLAST (◘ Fig. 4.27); they can be used to build up a profile used to search for query proteins which are rather distantly related to the subject proteins in the database.

4.4 Activities

4.4.1 Pairwise Alignment

We will construct local and gl obal pairwise alignments by use of the program “water” and “needle,” respectively, available as EMBOSS programs (Rice et al. 2000) on the EBI server. The protein sequences AAA85484 and AAA85485 can be downloaded from NCBI as described in ► Chap. 3 (see ► Sect. 3.3.1 including Activity 3.8.1).

Decide if you want to use “needle” for global – or “water” for local alignment.

Open the FASTA format files, and paste them in the windows of the relevant EMBOSS program on the server:

► http://www.ebi.ac.uk/Tools/psa/emboss_needle/

► http://www.ebi.ac.uk/Tools/psa/emboss_water/

If you need to do comparisons of nucleotide sequences, the similar programs can be found here:

► http://www.ebi.ac.uk/Tools/psa/emboss_needle/nucleotide.html

► http://www.ebi.ac.uk/Tools/psa/emboss_water/nucleotide.html

You can also install the mEMBOSS package on your PC and select the programs water and needle from the menu. Download mEMBOSS from ► ftp://emboss.open-bio.org/pub/EMBOSS/windows/ Click on ► mEMBOSS-6.3.1.2-setup.exe

And download to local computer. Click on file and install.

Use Explorer from Windows to navigate the folder with mEMBOSS.

Go into the Jemboss folder and click Jar and Jemboss MS; this will allow you to run EMBOSS in Windows from a graphic interface. All commands to activate the programs are found on the left menu bar. Most useful are water, transeq, and revseq which will do pairwise Smiths and Water alignments, translate from DNA to protein, and reverse and complement sequences, respectively.

4.4.2 Learn How Dynamic Programming Works with Pairwise Alignments

Use the tool at: ► http://www.itu.dk/~sestoft/bsa/graphalign.html.

Use th e short protein sequences shown already. Try the different substitution matrices, let the gap costs stay at “linear, gap score −8” to reduce the complexity, or make your own choice. Try with or without traceback. The tool has been designed and is maintained by Peter Sestoft.

4.4.3 Multiple Alignment with ClustalX

From ► ftp://ftp.ebi.ac.uk/pub/software/clustalw2/2.0.10/, download ► clustalx-2.0.10-win.msi, inst all, an d locate the icon to the desktop. Open the program by double-clicking on the icon. File | Load sequences, and select the sequences in FASTA format from the location on your computer. You need to edit the input file yourself. The sequences that you want to use as input should be in one file only with all sequences in FASTA format:

>sequence1 ATGACGATAC... >sequence2 GATAGATAGACS... etc.

Select Alignment | Do complete Alignment | ok

In the lower left corner, you can follow how the program works.

Scroll through the alignment when it is completed.

At the bottom, the bars show the degree of conservation of columns. Above the alignment the signature * shows columns with conserved nucleotides (for amino acids: with the same properties).

4.4.4 BLAST

We will use this activit y to learn how to perform an ordinary BLAST search in GenBank with a DNA sequence. However, we also learn some more about BLAST in this activity. We will perform a search with the sequence with acc. no. AF445297 from GenBank. The expectation is that this will be the best hit from the BLAST output. To a surprise it is not so.

First perform a BLAST search at NCBI: ► https://blast.ncbi.nlm.nih.gov/Blast.cgi. On the graphics select nucleotide BLAST, insert the acc. no. AF445297 in the window, mark “somewhat similar sequences blastn,” mark “show results in a new window,” and press the blue BLAST bottom. You will get a nice output and 100% identity to a sequence from an uncultured organism. Where is your query sequence? Usually you would expect the query as the best hit if it comes from the database. The reason can be found if you look up AF445297 at NCBI, since it is labeled as unverified. The unverified sequences are not included as databases in any BLAST search.

Take-Home Messages

Pairwise alignments and multiple alignments are the basic tools for sequence comparison.
A quantitative pairwise comparison of two sequences can be performed by aligning two sequences based on considerations of gaps representing insertions or deletions and matches between nucleotides or amino acids.
Pairwise comparisons can be performed as global alignments if it is known that the sequences are homologous in their full length or by local alignments if it is known that one sequence is shorter than the other.
BLAST is the most frequently used bioinformatics program to compare your own sequence (query sequence) to all sequences in a database (subject sequences).
BLAST provides qualitative information about homologous sequences and can quantify the identity of the query sequence to the subject sequences in the database.
Substitution matrices provide the probability of nucleotide or amino acid substitutions when two or more sequences are compared.
Dynamic programming is the most frequently used procedure to perform pairwise comparisons of nucleotide or amino acid sequences and can be done based on the Needleman and Wunsch algorithm for global alignments and by the Smiths and Waterman algorithm for local alignments.
Multiple alignments are most frequently constructed by the progressive profile procedure as implemented in the clustal family of programs.

References

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410.
Article CAS PubMed Google Scholar
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402.
Article CAS PubMed PubMed Central Google Scholar
Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. 1999. Biological sequence analysis. Cambridge Univ. Press.
Google Scholar
Edgar, R. C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792–1797
Article CAS PubMed PubMed Central Google Scholar
Feng, D. F. & Dolittle, R. F. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351–60.
Article CAS PubMed Google Scholar
Hall, T. A. 1999. BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl Acids Symp Ser 1999;41:95–98.
Google Scholar
Higgins, D. G. & Sharp, P. M. 1988. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–244.
Article CAS PubMed Google Scholar
Karlin, S., & Altschul, S. F. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87, 2264–8.
Article CAS PubMed PubMed Central Google Scholar
Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGettigan, P. A., McWilliam, H., Valentin, F., Wallace, I. M., Wilm, A., Lopez, R., Thompson, J. D., Gibson, T. J., Higgins, D. G. 2007. Clustal W and Clustal X version 2.0. Bioinformatics 23:2947–2948.
Article CAS PubMed Google Scholar
Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, H., Yadhukumar, Buchner, A., Lai, T., Steppi, S., Jobb, G., Förster, W., Brettske, I., Gerber, S., Ginhart, A. W., Gross, O., Grumann, S., Hermann, S., Jost, R., König, A., Liss, T., Lüssmann, R., May, M., Nonhoff, B., Reichel, B., Strehlow, R., Stamatakis, A., Stuckmann, N., Vilbig, A., Lenke, M., Ludwig, T., Bode, A., Schleifer, K. H. 2004. ARB: a software environment for sequence data. Nucleic Acids Res. 32, 1363–1371.
Article CAS PubMed PubMed Central Google Scholar
Moreno-Hagelsieb, G. & Latimer, K. 2008. Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics 24:319–24.
Article CAS PubMed Google Scholar
Needleman, S. B. & Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453.
Article CAS PubMed Google Scholar
Nei, M. & Kumar, S. 2000. Molecular evolution and phylogenetics. Oxford University Press.
Google Scholar
Notredame, C., Higgins, D.G., Heringa, J. 2000. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302: 205–217.
Article CAS PubMed Google Scholar
Rice, P., Longden, I., & Bleasby, A. 2000. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genetics 16:276–277.
Article CAS Google Scholar
Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Söding, J., Thompson, J. D. & Higgins, D. G. 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using ClustalOmega. Mol. Syst. Biol. 7:539.
Article PubMed PubMed Central Google Scholar
Smith, T. F., Waterman, M. S. & Fitch, W. M. 1981 Comparative biosequence metrics. J. Mol. Evol. 18, 38–46.
Article CAS PubMed Google Scholar
Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M. et al. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol 2011; 28: 2731–2739.
Article CAS PubMed PubMed Central Google Scholar
Thompson, J. D., Plewniak, F. & Poch, O. 1999. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15:87–88.
Article CAS PubMed Google Scholar
Waterhouse, A. M., Procter, J. B., Martin, D. M., Clamp, M., Barton, G. J. 2009. Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics 25, 1189–1191.
Article CAS PubMed PubMed Central Google Scholar
Yamada, K.D., Tomii, K., Katoh, K. 2016. Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees. Bioinformatics32, 3246–3251.
Article CAS PubMed PubMed Central Google Scholar
Zuker, M. & Jacobson, A. B. 1998. Using Reliability Information to Annotate RNA Secondary Structures. RNA 4, 669–679, 1998.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Veterinary Animal Sciences, University of Copenhagen, Copenhagen, Denmark
Henrik Christensen & John Elmerdahl Olsen

Authors

Henrik Christensen
View author publications
You can also search for this author in PubMed Google Scholar
John Elmerdahl Olsen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Henrik Christensen .

Editor information

Editors and Affiliations

Department of Veterinary Animal Sciences, University of Copenhagen, Copenhagen, Denmark
Henrik Christensen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Christensen, H., Olsen, J.E. (2018). Pairwise Alignment, Multiple Alignment, and BLAST. In: Christensen, H. (eds) Introduction to Bioinformatics in Microbiology. Learning Materials in Biosciences. Springer, Cham. https://doi.org/10.1007/978-3-319-99280-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-99280-8_4
Published: 28 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99279-2
Online ISBN: 978-3-319-99280-8
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics

Pairwise Alignment, Multiple Alignment, and BLAST

Abstract

Similar content being viewed by others

Pairwise Alignment, Multiple Alignment, and BLAST

BLAST and FASTA Similarity Searching for Multiple Sequence Alignment

Sequence Alignment

4.1 The Pairwise Alignment Problem

4.1.1 Global or Local Pairwise Alignments?

4.1.2 Substitution Matrices

4.1.2.1 Amino Acid Substitution Matrices

4.1.2.2 Nucleotide Substitution Matrices

4.1.3 Gaps

4.1.4 Dynamic Programming

4.1.4.1 Needleman and Wunsch

4.1.4.2 Smith and Waterman

4.2 Multiple Alignment

4.2.1 Clustal

4.2.2 Other Multiple Alignment Programs

4.3 BLAST

4.3.1 NCBI BLAST

4.3.2 Ortholog Detection

4.3.3 BLAST2 Sequences

4.3.4 Statistics

4.3.5 Variants of BLAST

4.4 Activities

4.4.1 Pairwise Alignment

4.4.2 Learn How Dynamic Programming Works with Pairwise Alignments

4.4.3 Multiple Alignment with ClustalX

4.4.4 BLAST

Take-Home Messages

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation