Keywords

1 Introduction

More than 12 million organisms reside on the earth. This biodiversity is mainly due to distinct genomic and proteomic sequences contained in these organisms. These sequences store unique information that modulates various processes required for the survival of these organisms [1]. DNA sequence comparison is a unique approach to evaluate gene-level variations amongst these organisms and to study their differences and similarities [1]. What “similarities” are identified to rely on the alignment process’s objectives. The easiest way for comparing two same-length sequences is to identify the number of matching characters. The attribute that calculates sequence similarity is known as the alignment value of two sequences. On the contrary, the degree of dissimilarity between sequences is known as the sequence distance. The amount of characters that do not align is known as the hamming distance. However, while estimating similarity, this approach does not take into consideration of normal biological activities like insertion or deletions.

The classic definition of sequence alignment includes estimating the so-called “edit distance,” which normally equals the minimum number of insertions, substitution, and deletion that are necessary for transforming one sequence into another [2]. Earlier several algorithms, like Smith & Waterman and Needleman & Wunsch have been developed for computing “edit distance” [3, 4]. These algorithms were originally developed for protein-protein alignment and subsequently employed for DNA sequence alignment. In the majority of the real-life scenarios, nevertheless, these algorithms seem inefficient for DNA alignment owing to their runtime as well as memory requirements [2].

To date, several kinds of alignment approaches, like prediction-based methods, pairwise sequences alignment (PSA), profile-based methods, multisequence alignments (MSA), and the structure-based methods have been proposed [5]. The most frequently used are PSA and MSA. In PSA, per sequence is aligned once a time. It is the easiest method of aligning and can be achieved with two strategies: local and global. The MSA approach could also be implemented using local or global strategies but is much more complex. During MSA, many protein sequences are organized into a rectangular array, and residues that are either homologous or identical are placed in one column. MSA is generally employed for detecting conserved regions in protein sequence and for designing protein’s secondary and tertiary structures. Homology, as well as evolutionary relationships between sequences, may also be derived via MSA approaches because MSA has an underlying postulation, i.e., all matching sequences would share evolutionary homology [5]. Alignment results are also a requirement for many other downstream analyses, like drug design. Nevertheless, results generated by different methods can be quite diverse [6]. Thus, there is an urgent requirement for the development of systematic metrics that may provide explicit guidance on the strengths as well as shortcomings of the different sequence alignment algorithms. This, in turn, will help us to deduce a more significant relationship between sequences. Considering the above, in this chapter, the author attempted to provide an overview of sequence alignment with a summary of popular specific algorithms, methods, and approaches which underlie the most current method of sequence alignment.

2 Basic Terminology

A sequence alignment is a basic analysis in almost every biological study (implicit or explicit). The main objective of sequence alignment is to detect the homologous sites in sequences [7]. Homology is a qualitative argument and identifies shared ancestral relations between sequences. Two distinct types of homology exist, i.e., paralogs (shared ancestry due to a duplication event) and ortholog (shared ancestry due to a speciation event) [8]. “By definition, orthologs are genes that are related by vertical descent from a common ancestor and encode proteins with the same function in different species. By contrast, paralogs are homologous genes that have evolved by duplication and code for a protein with similar, but not identical functions” [9]. Other terms that are commonly used during sequences analysis are similarity and identity [10]. Unlike homology, similarity denotes the percentage of aligned residues with the same physicochemical properties that are easier to replace each other. It is pertinent to note that two sequences can be 70% similar but cannot share 70% homology. They are either nonhomologous or homologous [10]. In general, a shared ancestral relationship could be inferred if the sequence similarity level is very high. However, it is not really obvious at what similarity degree one should assume homologous relationships. The solution depends on the sequence type and lengths under consideration [10]. For instance, proteins having high sequence identity and high structural similarity have similar functional and evolutionary relationships [11]. Identity corresponds to the proportion of matches between the two aligned sequences with the same amino acid residue [10].

Another term, namely gap, is common during sequence analysis. A gap can be defined as the absence of a segment in a certain sequence. Gaps are natural feature of biological sequences. A single mutational event can result in the addition or deletion of certain regions of sequences (predominantly in DNA), and thus the effective identification of gaps is an important step toward understanding the various biological phenomenon [12]. A variety of biological processes may lead to the formation of gaps in DNA sequences, like, large pieces of DNA may be replicated and inserted through a single mutational occurrence, and slippage during the replication of the DNA can allow the same region to be replicated many times as replication machine lose its position on the template [12]. Earlier it has been reported that instead of penalizing all editing operations individually, one must penalize the formation of a longer gap more severely than others [13].

3 Alignment Methods

To date, different alignment approaches like dynamic programming (DP), heuristic algorithms, or probabilistic methods have been developed [14].

3.1 Dynamic Programming

DP is an effective computing strategy implemented to a problem class that can be addressed recursively [15]. When Richard Bellman first developed the DP algorithm in 1953 for researching “multi-stage decision problems,” he certainly did not expect its extensive usage within modern computer programming. Indeed, as Bellman has described in his comical autobiography [16], he wanted to employ the word “dynamic programming” as “an umbrella” for the mathematical research he carried out at RAND Corporation for protecting his boss, who was the Secretary of Defense Wilson and “had a pathological fear of word research.” Since it is one of the first algorithms that were used in bioinformatics research and has since been widely applied [17], DP has become an inevitable algorithmic subject.

DP is indeed a normal preference for evaluating sequences. Needleman & Wunsch initially illustrated the use of bottom-up DP for calculating an optimal pairing amongst two protein sequences [3]. While this algorithm offers a comparative evaluation of sequences pair, it estimates the similarity throughout the complete sequences (a “global alignment algorithm”). Hence, this approach is time-consuming and computationally exhaustive [18]. To overcome this, Smith and Waterman adapted DP for performing local alignments in which alignment was made between similar parts of the input sequences [4]. DP provides an ideal approach for PSA [18]. It is also widely employed to assembling DNA sequence data from fragments obtained from automated sequencing machines and for determining the exon/intron structure within eukaryotic genes [19]. It is also utilized for inferring proteins’ function through homology study with other proteins having a known function [3, 4], and for predicting the secondary structure of functional RNA genes or regulatory elements [19].

3.2 Heuristic Algorithms

Though DP gives a more accurate result, it is slow [14]. Other efficient approaches, like heuristic algorithms or probabilistic methods, have been developed for large-scale database searching. The term “heuristic” means that the developed algorithm is faster than the classical method but may not be the optimum method [20]. Heuristic algorithms can be categorized into three subgroups, namely, progressive alignment (PA) approach, iterative alignment type, and block-based alignment type [10]. PA approach is the incremental strategy that generates a final MSA through conducting a set of PSA on successively less closely associated sequences. In this approach, we align the two closest-related sequences first and then align the closest-related sequence in the questionnaire to the alignment generated in the previous step. Although success is particularly dependent on the consistency of the initial alignment and dramatically deteriorates when all sequences in the set are related distantly, PA methods are enough to be implemented on a broad scale for several sequences [21]. The most commonly used PA methods are ClustalW (https://www.genome.jp/tools-bin/clustalw) and T-Coffee (https://www.ebi.ac.uk/Tools/msa/tcoffee/). However, it is not possible that the progressive approaches converge to optimal global alignment, and efficiency can be difficult to approximate. Additionally, its true biological importance may be unclear [21].

The iterative method is based on the premise that an ideal solution could be sought by adjusting current suboptimal solutions on a repeated basis. The process begins with a low-quality alignment and gradually improves it through well-defined procedures until no more improvement can be achieved on the alignment scores. Since the sequence order in each iteration is different, this method could mitigate the “greedy” problem of progressive strategy. Nevertheless, this approach is also heuristic in nature and has no promises for optimum alignment [10]. PRRN (https://www.genome.jp/tools-bin/prrn) is a web-based program that utilizes a double-nested iterative strategic plan for multiple alignments. The progressive as well as iterative alignment techniques are primarily global and thus cannot detect conserve motifs and domains amongst strongly diverging sequences of various lengths. A local alignment strategy must be employed for those divergent sequences that share only local similarities. This technique detects the ungapped alignment block that is present in all sequences, and hence this is called the block-based local alignment technique [10]. DIALIGN2 (http://dialign.gobics.de/) web-tools that employ block-based alignment for detecting local alignment.

3.3 Probabilistic Methods

Introduction of probabilistic modeling approaches, like profile secret Markov models (profile HMMs) as well as pair-HMMs [22] have advanced sequence similarity search. When variables are probabilities instead of random scores, objective statistical parameters refine them more readily. This helps to create more detailed, biologically relevant models with many parameters. For instance, profile HMMs employ position-specific deletion/insertion probabilities instead of the random, position-invariant gap expense of more conventional approaches like BLAST or PSI-BLAST [23], enabling profile HMMs to model the possibility that indels occur more frequently in certain sections of a protein than others (e.g., in surface loops than submerged core) [24].

The probability method has three primary benefits: (i) Any kind of analogy may be adjusted to the probabilities [e.g., The DNA error-prone reads against the genome]. The comparisons are supposed to be more precise. (ii) We may approximate the reliability, for instance, each column of every alignment part. This is helpful because alignments also have unknown sections owing to high inconsistencies or repeating sequences. (iii) A similarity between two integrated sequences over potential alignments may be calculated. This can more powerfully detect subtle connexions than single ideal alignments [25]. The probabilistic approach, however, also has significant disadvantages. Aside from a moderate computational drawback, the probabilistic method suffers from uncharacterized score statistics - unlike the local alignment of Smith-Waterman, for which at least the form of the ideal score distribution is defined from the null model, relatively little is known about the distribution of the log-like score in the local probabilistic random alignment. It is proven empirically that random usage of the z-score would not deliver really strong results [26].

4 Global and Local Alignment

Sequence alignment approaches typically fell into two categories: global and local alignments. While global alignment compares all character of query sequences, local alignments define similarity regions within long sequences that are typically divergent. The Needleman-Wunsch algorithm is a well-known global alignment algorithm designed on the basis of DP. Local alignments are always preferred but more challenging to quantify considering the additional difficulty of recognizing similarities regions. The Smith-Waterman algorithm is a general local alignments method based on the DP system, with added features for beginning and finishing in either place [14]. Most biologists think that local alignment is what really matters when we are looking for functional conservation. Local alignment is more important since certain proteins have roles that are controlled by their capability to attach to some other molecule (protein’s ligand); therefore, the role would be maintained if this short portion becomes sustained via evolution, even if there is significant divergence in many other protein regions. As proteins are folded within their natural form, these retained regions need not be continuous protein segments. Indeed, several researchers researching on lymphocyte antigen recognition specifically account for these discontinuities within binding domains (known as “non-linear” epitopes, where an epitope is the ligand of a lymphocyte) [12, 14].

In few cases of the global alignment mode, adding a distance in the leftmost location of the alignment might be needed, but we are not aware of the length of the next reference sequence factor to be already aligned. It is obvious from this scenario that an intermediate alignment is required between the local and global alignment (i.e., semiglobal alignment) [12]. A semiglobal alignment does not penalize starting or ending gaps in any global alignment so that the resultant alignment continues to overlap one end of a sequence with the end of the other [27]. A Parasail is a stand-alone tool that can be employed for performing global, local, and semi-global alignment [27]. Recently, Suzuki & Kasahara developed a semi-global alignment algorithm, namely, “difference recurrence relationships,” that perform better than other available tools by 2.1 factor [28].

5 Pairwise Alignments

The most frequently employed mean of collecting information from protein and DNA sequences is a PSA. It is generally used to detect protein homolog, which diverged more than 2 billion years ago. For proteins that share statistically significant sequence similitudes, homology can be accurately inferred. If statistically meaningful similarities to a known sequence are observed, inferences may be made regarding the unknown sequence’s function, structure, and biologically significant residues. Although the homology assumption [29] is very robust (i.e., proteins which share significant similarities within PSA often have similar features), a few of the more detailed preassumptions critically rely on the consistency of the alignment between the two sequences. For instance, functional inferences for protein sequences having more than 60% identity are typically very reliable. However, uncertainty in the alignment of badly conserved areas can lead to errors for more distantly linked proteins [30, 31].

The fundamental law for sequence alignment is the structural alignment amongst two proteins known to have a 3D structure. The 3D-structure comprises more information relative to the 1-D sequence as well as diverges at a very slow rate. Thus, distant evolutionary correlations may also be established amongst sequences which do not display statistically significant similarities. Even directly relevant proteins with major sequence similarities may elicit sequence alignments that differ from the most accurate structural alignments. Since it is not possible to identify the three-dimensional structure of each protein, researchers are continually seeking for strategies for producing structurally correct homology models for sequences with unknown structure. The most common as well as successful methods, are to find a template for constructing the model within the set of established structures. This feature is relatively trivial in the case of high sequence similitude (i.e., > 60% identity) because both sequences, as well as structural alignments, are typically very near to this range. However, in this zone, there are just a few sequences; in the so-called “twilight zone,” there are several more sequences (i.e., ~20–40 percent sequence identity) where divergent yet clearly homologous protein may be hard to match. Although the precision of the end 3D model is dependent on the degree of alignment of the unspecified sequence to the structural template, researchers are mainly concentrating on enhancing the quality of alignment between proteins that share statistically relevant similarities and have 20% to 40% sequence identity [49, 50]. Dot-matrix techniques, DP, and Word techniques are the most widely used methods for PSA.

5.1 DOT Matrix Plot

Since visualization of alignment of character of hundreds or more sequences can be troublesome, scientists created a more visually understandable approach called the dot matrix approach. This sequence alignment process, which was first carried out manually and then computationally, allows the more apparent mapping of similarities for visual inspection. In this process, a sequence is shown on the top and one on the side of the matrix and a mark on the crossroads of the corresponding character pairs [51]. A dot matrix pattern will have a continuous array of dots running along the middle diagonal of the matrix for a pair of exactly matched sequences (Fig. 7.1). However, this trend is hardly used. Sometimes, without further processing, diagonal patterns are hard to recognize. Thus, a number of filters are also added to the results, as well as the use of color and other methods to highlight matching sequences. For instance, typical filtering is a stringency/window combination. The window represents the number of points evaluated at a time, while the minimum number of matches needed in each window is the stringency [51].

Fig. 7.1
figure 1

The dot-plot of the alignment for human chromosomes 2, 7, and 14 and mouse chromosome 12. The x-axis indicates the positions of mouse chromosome 12, and y-axis indicates the positions of human chromosomes 2, 7, and 14. The orthologous landmarks are plotted based on the pairwise alignments between the three human chromosomes and mouse chromosome 12 (Adapted from [52]).

Table 7.1 Softwares and tools used for PSA (Adapted from https://en.wikipedia.org/wiki/List_of_sequence_alignment_software)

The study of the dot matrix is extremely valuable in recognizing recurring characters or short sequences within one sequence, as is the case for the mapping the recurrent regions of entire chromosomes. Repeats of the same character produce artificially high scores and complicate sequence alignment. Methods of dot matrix are most appropriate for single PSA problems, particularly for relatively high similitudes. Sequences with a lower similarity and MSA need more efficient methods [51]. Even though window stringency values are always heuristically determined, they could be dependent on dynamic averages, matched scores in aligned protein groups, or different methods for calculating the amino acid similarity. For example, score matrices establish alignment scores in the aligned protein families depending on their statistical frequency. These matrices may be used to construct a sliding window, where only scores above an average scoring may appear in the matrix, as defined in the following section [51].

To date, various algorithms and computer software tools were created for performing the dot-matrix plot. While several of these tools accommodate 100 kb of sequences, the study of the genome sequences above 10 Mb on a microcomputer remains to be inoperative considering the length of time needed for execution as well as computer memory [53]. In 2004, Huang and Zhang created two dot matrix comparison methods for studying large sequences. Initially, the methods identify similarity regions amongst two sequences using a rapid word search algorithm and explicitly compare these regions. Because several random matches are omitted from the initial sampling, the estimation duration is decreased dramatically. These approaches yield good quality plots of the dot matrix with low background noise. Spatial criteria are linear, so genome scaling sequences can be compared by algorithms. Highly repetitive sequence structures of eukaryote genomes may impact the computational speed. In the 80s, with a 1GHz personalized machine, a dot matrix complot was developed for the yeast genome (12 Mb) for both strands [53].

5.2 Dynamic Programming

The most widely employed algorithm of PSA is DP, initially introduced by Needleman and Wunsch [3]. The DP ensures an optimum algorithmic alignment with unique parameters and sequences. However, an optimum sequence alignment score would not assure the structural consistency of the alignment. Additionally, there are no natural mechanisms under which two proteins align together. Therefore “optimum” alignments of the sequence may vary greatly from ideal structural alignments [31]. Moreover, distant-related proteins also have several optimal alignments and a significant number of sub-optimal alignments with scores quite similar to the optimal score [50, 54, 55]. If one moves further from the desired score, the number of alternatives alignment also keeps increasing. Therefore, one must sample the suboptimal alignment space for holding the number of alignments computationally trackable [50, 54, 55].

While a structure-based alignment is the “gold standard” against which sequence alignments are measured, structural alignment may vary, and no optimum structural alignment algorithm is possible [56]. As the number of structures appear to be smaller than the number of sequences, the structural alignment variations are minimal relative to the sequence-structural alignment variations. Although this definitely refers to quite distantly linked proteins that have no meaningful similitude (and therefore cannot be substantially aligned with sequence data alone), the structural and sequence alignment precision of proteins that share statistically significant similarities has not been closely studied [56]. Given that structurally correct alignments frequently include suboptimal alignment scores, researchers have been researching the alternate alignments and wondering whether they include details about precise structural alignments. Jaroszewski et al. [50] have studied alternate alignments, both based on an almost ideal algorithm for alignment generation and by combining score parameters (i.e., substitution matrix and gap penalties), and have found that alignment in the sets is much similar to the structural alignment. Their inference was that the two alternate alignment methods, namely, alternatives and sub-optimizing alignments, had complementary information (in contrast to redundant information) because the combination of the two sets created much higher alignments than any of the sets. The exactness of the optimal sequence alignment was also investigated by Holmes and Durbin [57]. They developed a technique for calculating the expected accuracy. In an algebraic approach, Zhang and Marr [58] used alternate alignments with maximal alignments in the neighborhood.

Various scholars also took the help of a probabilistic approach for producing alternate alignment sets. In 1995, Miyazawa [59] measured alignment likelihoods relying on alignment score exponent and, subsequently, compared the resulting likelihoods of matched amino acids throughout alignment with the respective protein structure alignments. Yu and Hwa investigated the statistically significant of alignments made using a pairwise Hidden Markov Model (HMM) [26]. Knudsen and Miyamoto [60] designed a pairwise HMM alignment approach that provided an explicit indel evolutionary model. Eventually, Mückstein and the team [61] constructed a sampling alignment procedure on the basis of statistical weighting employing partition function overall plausible two-sequence alignments.

Although it is of theoretical interest to compare individual sequence and structure sets in the absence of any structural information, it is only of practical use if the alignment of the sequence can be determined correctly. One approach to resolving this issue is to calculate the accuracy of a certain aligned residual pair (that we term an edge, using the norm for determining the optimum score in the dynamic programming path graph, aligned residues, insertions, and deletions along the edge) [31]. Cline and the team examined four strategies for forecasting the accuracy of a particular pair of aligned residues [62] and concluded that the most improved alignment quality was the method proposed by Yu and Smith [63] for retrieving near-optimal alignments from the HMM profile. The association between both the edge probabilities and structural alignment was studied by Knudsen & Miyamoto [60] and Mückstein et al. [61] and Miyazawa [59]. However, in the former two cases, only in the context of a limited number of protein pairs, usually considered a strong correspondence amongst them. In another study, Mevissen and Vingron [64] have evaluated the feasibility of an edge reliability index known as robustness that Chao and the team had previously defined [65]. They found that an edge’s robustness predicted correctly if the edge was still aligned in structural alignment. In another study, Sierka and the team improvised the robustness analysis by adding extra details on alignment consistency and creating a logistic regression model that returns the likelihood that a given edge is embedded in a structural alignment [31].

5.3 The Word or K-Tuple (Ktup) Method

It is the heuristic process, which offers greater alignment than DP. Currently, with massive datasets, DP cannot be used. This is why we use the K-tuple approach when searching for a specific question along with a large database. K Tuple corresponds to a series of k words. For instance, for nucleotide and protein, K is defined as 11 and 3, respectively. The K system has been introduced in the family of FASTA and BLAST.

5.3.1 FASTA

FASTA is a rapid alignment application for protein and DNA sequence pairs. Rather than comparing individual residues in both sequences, FASTA looks for matching sequence patterns or terms called k-tuples. In both sequences, these patterns contain k consecutive matches of letters. Based on these word matches, the algorithm then tries to establish a local alignment. FASTA is useful for regular database searches of this kind because of the ability of the algorithm to locate similar sequences in a sequence database with high-speed. FASTA programs offer a detailed range of simple similarity search resources (fasta36, fastx36, tfastx36, fasty36, and tfasty36), comparable to those offered by the BLAST tool, as well as programs for local, slower, optimal, as well as global similarity searches (search36, ggsearch36) and oligonucleotide and short peptide searches (fasts36, fastm36). fasta36 employs the FASTA algorithm developed by Pearson alone and Pearson & Lipman and compare protein (or nucleotide) sequence to protein (or nucleotide) sequence database [66, 67]. With the ktup (word size) parameter, search speed and selectivity are regulated. By default, ktup = 2 for protein comparisons; ktup = 1 is more sensitive but slower. By default, ktup = 6 for DNA comparisons; ktup = 3 or ktup = 4 allows maximum sensitivity. fastx36/fasty36 compares the translated nucleotide sequence into three frames and allowing gaps and changes, fastx36 compares a nucleotide sequence to a protein sequence base. Fastx36 uses a faster and simplified alignment algorithm, which only allows the frameshift between codons. However, fasty36 is slower, but better alignments are possible because frame shifts inside codons are permitted [68]. tfastx36/ tfasty36 compares a protein sequence with a nucleotide sequence database and measures comparisons for forward and reverse directed frames-shifts [68]. ssearch36 employs the Smith-Waterman algorithm [4] for comparing a nucleotide (or protein) sequence against a nucleotide (or protein) sequence database. The Fasta36 is just 2–5 times faster than Farrar SSE2 [69]. ggsearch36/glsearch36 compares a protein (or nucleotide) sequence to a protein (or nucleotide) sequence database, employing an optimal global algorithm: global: local (glsearch36) or global (ggsearch36). fasts36/ tfasts36 compares collection of small peptide fragments as collected from mass-spec, protein research, against nucleotide (tfasts) or protein (fasts) databases [70]. fastm36 compares ordered short nucleotide sequences (or peptides) to a nucleotide (or peptides) database.

The FASTA systems employ an empiric approach for approximating statistical importance that is consistent with a variety of similarities in scores and gap penalties and increases alignment of boundary precision as well as search sensitivity. FASTA systems can generate “BLAST-like” alignment as well as tabular results for ease of integrating analytics pipelines and can scan for small, descriptive datasets and afterward report findings for larger sequences employing small dataset connexions. FASTA systems operate in a wide range of database formats, like PostgreSQL and MySQL databases. Recently, Pearson has developed programs that lay out a strategy for incorporating domain as well as active site annotations into alignments and emphasizing the mutation status of functionally important residues. These protocols also explain how FASTA systems can classify protein and nucleotide sequences through protein: DNA, protein: protein, and DNA: DNA comparative study [71].

5.3.2 BLAST

The “Basic local alignment search tool” (BLAST) is a sequence similarity search software which could be employed either as a stand-alone tool or through a web interface for comparing all combinations of protein (or nucleotide) sequence to a protein (or nucleotide) sequence database [72]. BLAST is a heuristic approach that finds short matches between two sequences and tries to initiate alignment from these “hot spots.” BLAST also offers statistical details about alignment in addition to executing alignments [72]. The E-value contains details on the probability of a sequence being matched by sheer chance. The smaller the E-value, the less probable the database match is to be attributed to random chance, and thus the more important the match. If E < 1e− 50 (or 1 × 10−50), there should be an exceptionally strong conviction that matching the database is the product of a homologous partnership. If E is between 0.01 and 1e − 50, matching can be viewed as a consequence of homology. If E is between 0.01 and 10, the match is assumed to be nonsignificant but could suggest a possible remote homology relationship. Additional proof is required to validate the partnership. If E > 10, the sequences within evaluation are either unrelated or associated with incredibly remote relationships that fall far below the detection limit of the current system [10]. Although the E-value is proportionally influenced by the size of the database, an apparent concern is that as the database expands, the E-value often increases for a given sequence match. Since the true evolutionary relationship between the two sequences remains unchanged, as the database expands, the decline in the sequence match’s credibility means that one will “lose” homologs previously observed as the database enlarges. Consequently, an alternative to E-value calculations is needed [10].

BLAST is a family of services that comprises BLASTN, BLASTX, BLASTTP, TBLASTX, and TBLASTN. BLASTN searches nucleotide sequences in the nucleotide sequence database. BLASTP employs protein sequences as requests to scan a database of protein sequences. BLASTX employs nucleotide sequences as inputs and converts them into all six reading frames to generate translated protein sequences that are used to query the protein sequence database. TBLASTN requests protein sequences to a nucleotide sequence database, with sequences encoded into all six reading frames. TBLASTX employs nucleotide sequences that are interpreted into all six frames to scan a nucleotide sequence database that has all the sequences interpreted into six frames. In addition, also there is a bl2seq program that executes a local alignment of two user-provided input sequences. The graphic production involves horizontal bars as well as a diagonal in a two-dimensional diagram displaying the total degree of the matching between the two sequences [10].

6 Multiple Sequence Alignment

MSA is an alignment between more than two biological sequences. In most scenarios, the input sequences are believed to have a shared ancestor. Sequence homology can be derived from the subsequent MSA, and a phylogenetic study can be carried out to determine the common ancestral roots of the sequences. Visual alignment representations, as seen in the Fig. 7.2, demonstrate mutation occurrences like point mutations (single nucleotide or amino acid changes) that occur as distinct symbols within a single alignment column and insertion/deletion of mutations (indels or gaps) that occur as hyphens in one or more alignment sequences. MSA can also be used to determine sequence conservation of protein domains, tertiary as well as secondary structures, as well as specific amino acids or nucleotides [73,74,75].

Table 7.2 Softwares and tools used for MSA (Adapted from https://en.wikipedia.org/wiki/List_of_sequence_alignment_software)
Fig. 7.2
figure 2

“Multiple sequence alignment of a-type domains of B. distachyon PDI and PDI-like proteins and a typical rice PDI. These thioredoxin-like domains of the B. distachyon were annotated in Phytozome database, and comparative analysis used BioEdit software. Residues highlighted in deep blue and green show they were identical and similar, respectively. Open bars and arrowheads represent the α helices and β strands, respectively. The red box indicates the -CxxC- catalytic site, and red arrows indicate the glutamicacid–lysine charged pair. Blue and yellow arrows represent the conserved arginine (R) and the cis pralines (P) near the active site, respectively” (Adapted from [76]).

Since MSA of three or more lengthy sequences may be complicated and are often time-consuming to be aligned by hand, statistical algorithms are often used for generating and evaluating alignments. MSAs need more advanced approaches than PSA since they are more computationally complicated. Many MSA programs use heuristic approaches rather than global optimization since it is prohibitively costly to determine the optimum alignment amongst more than a few sequences of moderate length. On the other side, heuristic approaches usually refuse to guarantee the consistency of the answer, with heuristic strategies sometimes found to be well below the ideal solution in the case of benchmarks [73,74,75].

6.1 Dynamic Programming

The complex programming algorithms, namely, Smith-Waterman and Needleman-Wunsch, that are employed for a PSA, can also be used for evaluating the optimum alignment of over two sequences. Nevertheless, the difficulty of this algorithm is much shoddier than that of PSA. For performing PSA, the running period of the algorithm is proportionate to m × n, where m and n are the lengths of two aligned sequences. If n ≥ m, the argument is generalized to indicate that the algorithm’s execution time is n2. The exponent in the n2 definition derives from the presumption that, during PSA, if we presume that our sequences length is n, then n × n cells need to be filled within the dynamic programming matrix. If we were to employ either Needleman-Wunsch or Smith-Waterman algorithm to three sequences, we would need to build a 3-dimensional array for measuring and monitoring the alignment. Therefore, for sequences having n length, we will have n × n × n cells for filling in (http://readiab.org/book/0.1.3/2/3). Runtime for MSA employing complete DP algorithms increases dramatically with the sequences number to be aligned. If s and n are the sequence number and sequence length, respectively, then the execution time will be ns. However, in PSA, s = 2, which makes the problem handier (http://readiab.org/book/0.1.3/2/3).

6.2 Progressive Alignment

PA is a heuristic approach and does not optimize any obvious alignment score. The aim is to accomplish a series of PSA that begins with aligning nearest identical sequence pairs and subsequently aligning least similar ones [22, 107]. The PA method reduced the overall computational difficulty to polynomial-time by splitting the MSA problem into a set of PSA guided by a tree reflecting the evolutionary sequence relation [108]. Today, most popular alignment programs that employ the progressive approach are ClustalW [79], Mafft (“Multiple sequence alignment based on Fast Fourier Transform”) [109], “Multiple sequence comparison by log-expectation” (MUSCLE) [91], and T-Coffee [110].

6.2.1 ClustalW

ClustalW is currently the most commonly deployed alignment software, and the oldest of the modules examined. The program conducts a PA, first using PSA through computing the distance matrix that retains the sequence’s discrepancy. Just after the matrix is collected, a guided tree is created utilizing Neighbor-Joining algorithms, accompanied by a final stage where the sequences are aligned as per the branching order within the guide tree. In its alignment procedure, the software utilizes two gap penalties: gap expansion and gap opening, during polypeptides availability, a total amino acid weight matrix. These distance penalties rely strongly on variables like sequence length, similarity, and weight matrix. In a simple scenario, Clustal W will exactly match the related domains and sequences of established secondary or tertiary structures but can be seen as a strong starting point for more refinement in more complicated cases (Fig. 7.3a) [73, 79].

Table 7.3 Softwares and tools used for motif scanning (Adapted from https://en.wikipedia.org/wiki/List_of_sequence_alignment_software)
Fig. 7.3
figure 3

Steps for generating MSA via (a) ClustalW, (b) Clustal omega, and (c) T-Coffee (Adapted from [75])

6.2.2 Mafft

Mafft is a program that can be employed with different alignment methods, either PA alone (with Fast Fourier Transform) or iteratively aligned PA. Mafft‘s basic run requires up to three stages, but the default procedure performs the first two steps. The first stage is to create a PA centered on each sequence pair’s rough distance, on the basis of the mutual 6-tuples. The unweighted pair group method with arithmetic mean (UPGMA) guide tree is then generated with the changed linkage, and the sequences are then aligned with the tree branch order (the so-called FFT-NS-1 strategy). In the second phase, the distance matrix is recalculated based on the knowledge obtained from the previous stage, and the PA is reassessed using a tree from the existing matrix as the starting point (till this process, the technique is known as FFT-NS-2 and is the preferred approach used by the software). The final step is the iterative refinement, which optimizes the “Gotoh weighted pair sum” (WSP) score [111], the “group-to-group alignment” [85], and “the tree-dependent constraint partition technique” [112]. The method is referred to as FFT-NS-i, where all three steps are used, which indicates that it employs the FFT method to conveniently distinguish the homologous regions throughout the sequences followed by the refining iterative process. The FFT converts an amino acid inside a sequence into a vector describing volume and polarity that is key to replacement instances, allowing the software to accurately predict these events [73].

Three additional refining algorithms are also provided by Mafft: L-INS-i, G-INS-i, and E-INS-I [113]. These strategies improve the number of steps required to align the MSA to five. In such instances, the first step would also entail the formation of a distance matrix, not employing six-fold. In comparison to the FFT-NS- * solution, the UPGMA tree is not rebuilt, and the program continues into the second step, splitting gap-free segments and store the scoring arrays from sequence to sequence for each gap-free segment. Mafft subsequently calculates the “importance” value of the segment score and stores the residue in other segments. All “importance” values are then obtained in step three of the “importance” matrix, which is rapidly followed by a group-to-group alignment of scores and a weighting scheme based on the Needleman-Wunsch algorithm [79]. The final stage refines the alignments obtained, increases the WSP score, and the fixed “importance” values. All “importance” values are then obtained in step three of the “importance” matrix, which is rapidly accompanied by a group-to-group alignment of scores and a weighting scheme centered on the Needleman-Wunsch algorithm [79]. The final stage refines the alignments obtained, strengthens the WSP score, and the prescribed “importance” values.

6.2.3 Muscle

The muscle uses a pairwise alignment technique to the profile. First, the program establishes a progressive alignment, which is then refined and configured in two following stages. After the similarity of the sequence, the PA is produced, the distance estimation and the UPGMA tree are calculated. Muscle utilizes two distance measurements: a km distance for unaligned series pairs and a Kimura distance for ordered pairs [91]. A new tree with the already defined Kimura distance matrix is generated by the optimization stage of PA, which guarantees a stronger alignment centered on this improved tree. The last step of refinement uses the restricted partition variant tree-dependent [112]. This approach eliminates one of the tree edges, splits the orientation, and eliminates the profiles of the two partitions, which would then be re-aligned with the profile-profile alignment. Each tree edge will be iteratively visited and the alignment with the updated description score of each sequence pair will be preserved. The edges are inspected to minimise the gap from the root by reshaping each sequence and moving to similarly associated sequence classes [91].

6.2.4 Clustal Omega

Clustal Omega is the Clustal family’s new MSA algorithm [75]. This algorithm is used only for aligning protein sequences (though nucleotide sequences are likely to be introduced in time). The precision of Clustal Omega is comparable to other high-quality aligners on limited numbers of sequences; moreover, Clustal Omega surpasses other MSA algorithms in terms of completion time as well as overall quality of alignment on large sequence sets. In a few hours, Clustal Omega is able to align 190,000 sequences on a single process. By firstly generating pairwise alignments using the k-tuple form, the Clustal Omega algorithm generates a multiple sequence alignment. Then, employing the mBed method, the sequences are clustered. This is accompanied by the clustering process of k-means. Next, the guide tree is built using the UPGMA method. Finally, using the HHalign module, which aligns two profile hidden Markov models (HMM) as seen in Fig. 7.3b, the multiple sequence alignment is made.

6.2.5 T-Coffee

T-Coffee has a radical approach to match sequences. The software first builds a library from two separate sources: Clustal W’s global alignment and Lalign‘s local alignment [114]. Global alignments and pairwise local alignments for each pair of sequences are generated from the top ten nonoverlapping segments. The software processes global and local information and assigns weights to all PSA according to sequence identity [115]. This is accompanied by a mixture of groups that converge into a single repository. This consolidated library has an extension phase, such that the final weight of any pair of residues constitutes part of the information contained in the library. The ultimate step involves calculating the distance matrix and the neighboring joint tree by aligning the two nearest weight sequences on the tree with the stored weight of the consolidated library with a PA. The initial pair is then fixed, and no other gap can be consequently transmitted. The PA will proceed until all sequences fit [73].

Irrespective of their uses, earlier researchers have detected that the majority of PA programs employ the Neighbor-Joining algorithm for inferring a guided tree. Neighbor-Joining’s O(N 3) time complexity renders it a bottleneck when large data sets are aligned. The Relaxed Neighbor-Joining algorithm relaxes the joining nodes and decreases standard time complexity to O(N 2 log N) without any major qualitative results [47]. In 2008, Sheneman explored the relationship between the topology of the guide tree and the alignment reliability. He developed two different genetic algorithms, each of which enhances the population of tree guide topologies utilizing stochastic crossover and mutation operators. One genetic algorithm, EVALYN, generates highly accurate scores when evaluated against established reference samples. Nevertheless, we find that the disruptive crossover of EVALYN restricts the genetic algorithm to a stochastic hill climb (Fig. 7.3c).

6.3 Probabilistic Alignment

6.3.1 PRANK

PRANK [116] is one of the best examples of a probabilistic MSA tool. In comparison to other alignment systems, PRANK uses phylogenetic knowledge to identify alignment differences created through deletions or insertions and then treats the two forms of events differently. As a by-product of the proper handling of inserts and deletions, PRANK will also have assumed ancestral sequences as part of the production and label the alignment gaps differently based on their origin in the insertion or deletion incident. As the algorithm infers the ancestral history of the sequences, PRANK could be vulnerable to errors in the phylogeny guide as well as a violation of basic assumptions about the origin as well as the pattern of the gaps [116].

6.3.2 PSAR

In 2014, Kim and Ma developed a new metric, known as PSAR [117], that can metric the reliability of the MSA by agreeing to probabilistically sample Suboptimal Alignments (SAs). The SAs offer extra information which cannot be obtained by optimizing alignment on its own, particularly when the ideal alignment is not too far preferable to the SAs [117].

6.3.3 ProbPFP

Recently, Zhan and the team developed ProbPFP that incorporates HMM configured with partition function by particle swarm. The PSO algorithm was used to refine the parameters of the HMM. Subsequently, the posterior likelihood obtained by the HMM was compared with that retrieved through the partition function, and hence the integrated substitution score for the alignment was determined. To test the effectiveness of ProbPFP, 13 excellent or classical MSA methods were compared. The results show that the alignments obtained by ProbPFP have the highest mean SP and TC values for both SABmark and OXBench data sets, as well as the second highest mean TC scores and mean SP scores for BAliBASE. ProbPFP is also compared with four other excellent approaches by restoring phylogenetic trees spanning six protein families in the TreeFam database based on alignments achieved across these five approaches. The results show that the reference trees are like the phylogenetic trees rebuilt from the ProbPFP alignments compared with other approaches [118].

6.3.4 ProbCons

ProbCons is a modification of the regular pair-score approach and also provides a secret PA algorithm based on the pair-hidden Markov model. The alignment method is divided into the following steps, starting with the calculation of the reverse likelihood matrices for each pair of sequences. The alignment method is split into the following steps, starting with the calculation of the posterior-probability matrices for each pair of sequences. This is accompanied by a complex software calculation of each PSA’s expected accuracy. The probabilistic quality transition is then used to reassess the match’s accuracy. A hierarchical clustering determines the guiding tree by the similarities defined by the weighted average of the values between the sequences of every cluster. The guidance tree is employed for matching sequences with a progressive strategy. There is also a postprocessing phase in which random bipartitions of the generated alignment are realigned to find better regions for alignment. ProbCons varies from other alignment systems because it does not implement biological principles like evolutionary tree construction, role-specific gap score, and other features typically utilized with other packages [99].

7 Motif Search

Motif exploration is an application layer sequence analysis problem and one of the main obstacles while developing bioinformatics applications. Sequence motifs are constant in size, frequently repetitive and conserved, but at the same moment are small (approximately 6–12 Bp) and very long and are also highly variable in intergenic regions that make the motif discovery a difficult task. A motif is also known as regulatory elements in eukaryotic genes and occurs in the Regulatory Region (RR). These patterns play a crucial role in the identification of the Transcription Factor Binding Sites (TF-BSs), which aid in the understanding of gene expression regulation mechanisms [119, 120]. Motifs are broadly categorized into various forms, namely, sequence motifs, planted motifs, gapped motifs, structured motifs, and network motifs [119]. There are two major forms of algorithms for motif discovery, i.e., enumeration approach probabilistic technique. Enumeration method looks for consensus sequences; motifs are projected dependent on word counts and word similitudes; thus, this method is often named as word enumeration approach to solving Motif problem with panted Motif Problem with motif length and a maximum number of mismatches [120]. The algorithms focused on the word enumeration method extensively scan the entire search field for classifying the ones with potential substitutes, and then normally locate the global optimum. This implies, though, that they are exponential time algorithms that take long for detecting the larger one and inefficient to accommodate hundreds of sequences, and are thus only appropriate for the short motif. Additionally, these algorithms require several user-defined parameters, including the length of the motif, the number of mismatches permitted, and a minimum of sequences the motif requires to appear in [121]. The method to word enumeration can be accelerated by utilizing various data structures, like parallel processing or suffix trees. CisFinder (https://lgsun.grc.nia.nih.gov/CisFinder/), DREME [122], Weeder [123], and MCES [124] are common algorithms based on this method. A second group is a probabilistic method. This constructs a probabilistic model known as Position-Specified Weight Matrix (PSWM) or Motif Matrix, which describes a base distribution to differentiate motifs from nonmotifs for each position of TFBS and needs few search parameters [124]. MEME [125], EXTREME [126], and BioProspector [127] are the most common methods focused on probabilistic approaches. The third form, the nature-inspired approach, incorporates the core attributes of the first two approaches. This method is a basic idea and a global scan but can work with large data and long motifs concurrently. It has a dynamic intention representation, contributing to an infinite range of degenerated positions. The final form is the combinatory method, which depends on the hybrid algorithms which shape the appropriate algorithm.

8 Conclusion and Future Perspective

In conclusion, sequence alignment serves as a basic requirement for most of the biological research ranging from phylogenetics construction to protein design. Sequence alignment also employed for motif search in biological sequence, which in turn plays a key role in understanding the regulation of various biological phenomenon. However, because of the continuous increase of sequence amount, there is an urgent requirement of developing novel tools and techniques which can improvise the accuracy of the sequence analysis, including motif search, result obtained. Earlier several researchers have suggested that a successful tool for motif discovery can be constructed from different suggested motif discovery methods. The tool should be fitted with these features: (1) all models should be identified, (2) the overall search feature should be optimized, (3) the parallel processing abilities are needed, (4) optimized data structures should be accessible, (5) the overall search function should be able to locate both long and short motifs, (6) several motif discovery capabilities at the same time, i.e., without elimination of the discovered motif to find another motif. This research would then establish a new algorithm for motif discovery, which incorporates the key characteristics of enumerative and probabilistic approaches and utilizes them as a seed to a naturally inspired algorithm, taking into account the above-noted variables [120].