Keywords

1 Introduction

A multiple sequence alignment is an alignment of three or more DNA or RNA or protein sequences that can organize data in such a way that similar sequence features are aligned together [1, 2]. The goal of the multiple sequence alignment is to reveal features that may be shared by many sequences and to identify alterations that further elucidate functional and phenotypic variability [2]. The main applications of sequence alignment include secondary or tertiary structure prediction, phylogenetic tree construction, function prediction, hidden Markov modeling, PCR primer design, and data validation [2]. The computation of an exact multiple sequence alignment (MSA) of a large set of sequences is extremely complex and is classified as an NP-complete problem [3]. Multiple sequence alignment provides more information than pairwise sequence alignment because it reveals regions which are conserved within a protein family that have structural and functional importance [1]. Multiple sequence alignment is used for distinctive objectives such as performing similarity search of sequences. The approach is used in classification problems (e.g., to classify members in the protein family, to identify close and distant relationship to infer phylogeny).

figure a

To explain MSA, let us consider a set of three or more DNA/RNA/protein sequences as depicted above. MSA will aim to align these sequences by introducing gaps in each sequence. For example, if there are k number of sequences of N length, then Si, i = 1, 2,…, K and:

$$S = \left\{ {\begin{array}{*{20}c} {S_{1} = \left( {s_{11} ,s_{12} , \ldots ,s_{{1{\text{N}}}} } \right)} \\ {S_{2} = \left( {s_{21} ,s_{22} , \ldots ,s_{{2{\text{N}}}} } \right)} \\ {S_{3} = \left( {s_{{{\text{k}}1}} ,s_{{{\text{k}}2}} , \ldots ,s_{\text{kN}} } \right)} \\ \end{array} } \right\}$$
(1)

Consequently, MSA of S will be obtained by inserting gaps (‘-‘) into the sequences in such a way that all resulting sequences \({S}_{i}^{ *}\) will have equal length N and \({S}_{i}^{*} = {S}_{i}\) after removal of all gaps from \({S}_{i}^{*}\), and no column will consist of gaps. Consider another MSA \({S}^{*}\) that consists of two sequences \({s}_{p}^{*}\) and \({s}_{q}^{*}\) in the alignment. Let us consider two nucleotides a and b in the aligned sequence here the score of the sequence alignment will be defined as:

$${\text{score}}\left( {a,b} \right) = \left\{ {\begin{array}{*{20}c} {{\text{match score for }}a{\text{ and }}b} & {{\text{if }}a{\text{ and }}b{\text{ are residue}}} \\ { - d} & {{\text{if }}a{\text{ or }}b{\text{ are gap}}} \\ 0 & {{\text{if }}a{\text{ and }}b{\text{ both are gaps}}} \\ \end{array} } \right\}$$
(2)

To find the divergence d of a given set of aligned sequences, the following three methods are used. The divergence between sequences can also be called as the total distance between sequences or the alignment score.

Consensus Method: In the consensus method, a common character from each column is searched and the string created in this way is called the consensus string. The total distance between two sequences is calculated by adding a penalty for each character in its column that differs in the sequence from the consensus string. Let us consider S as a set of sequence wherein S = {S1, S2, ..Sk} and:

$${S} = \left\{ {\begin{array}{*{20}c} {{S}_{1} = \left( {{s}_{11} ,{s}_{12} , \ldots ,{s}_{{1{N}}} } \right)} \\ {{S}_{2} = \left( {{s}_{21} ,{s}_{22} , \ldots ,{s}_{{2{N}}} } \right)} \\ {{S}_{3} = \left( {{s}_{{{k}1}} ,{s}_{{{k}2}} , \ldots ,{s}_{kN} } \right)} \\ \end{array} } \right\} \to {S}^{*}$$
(3)
$${{\text{dist}}_{\text{i}}} = \mathop \sum \limits_{{{\text{i}} = 1}} ^ {\text{k}} {{\text{S}} ^ {*}} - {{\text{S}}_{\text{i}}}$$
(4)

Here, S* is the consensus string of S and disti is the distance of i-th sequence from S*.

Evolutionary Tree Method: A weighted evolutionary tree is created using sequences where adjacent nodes correspond to the sequence pair. The weight of the tree is defined as the summation of the number of changes between two adjacent nodes in the tree.

Sum of Pairs (SP score): The sum of pairs score is the pairwise distance between all sequence pairs. SP score is widely used similarity function. SP score for the two protein sequences is given as predefined BLOSUM or PAM matrix but for more than two sequences, and since the number of possible combinations is too large, SP score needs to be calculated. Let us consider S as a given set of sequence:

$${S} = \left\{ {\begin{array}{*{20}c} {{S}_{1} = \left( {{s}_{11} ,{s}_{12} , \ldots ,{s}_{{1{N}}} } \right)} \\ {{S}_{2} = \left( {{s}_{21} ,{s}_{22} , \ldots ,{s}_{{2{N}}} } \right)} \\ {{S}_{3} = \left( {{s}_{{{k}1}} ,{s}_{{{k}2}} , \ldots ,{s}_{kN} } \right)} \\ \end{array} } \right\}$$
(5)
$${\text{SP Score}}\left( S \right) = \mathop \sum \limits_{1 \le i \le j \le k} {\text{align}}\_{\text{Score}}\left( {S_{i} ,S_{j} } \right)$$
(6)

Here, the align score is the alignment score between Si and Sj sequences. The align_score is equal to the sum of the similarity score of every pair minus gap penalties [4]. The problem of finding a multiple sequence alignment that maximizes the SP score (or minimizes the SP distance) is known to be NP hard [5, 6].

2 Literature Review

To perform multiple sequence alignment, four distinctive approaches have been discussed in the literature namely global optimization, approximation, heuristic, and probabilistic methods. The probabilistic approach finds the probability of mutation and indels that leads to generating information related to the probability of evolution. Probabilistic methods work efficiently for phylogenetic analysis [7,8,9]. The global optimization, approximation, and heuristic methods find the optimized score for multiple sequence alignment and are suitable for classification problems. Dynamic programming is a global optimization approach, but the limitation of dynamic programming is that it takes exponential time. Simulated annealing and genetic algorithms have been also used by some researchers to get optimal results [10, 11]. Another approach to overcome the limitation of dynamic programming is to use different search methodologies and improve the efficiency of the global optimization [12,13,14]. These methods work efficiently with the small datasets but for the large datasets, approximation method is highly useful [15, 16]. Heuristics-based algorithms for multiple sequence alignment can be classified into two groups that are progressive heuristics and iterative heuristics-based algorithms. ClustalW [17, 18] and MUSCLE [19, 20] are well known examples of progressive heuristics and iterative heuristic algorithms, respectively. A combination of heuristic and probabilistic methods has been also suggested by few researchers [19, 21,22,23,24,25]. Other heuristics-based multiple-sequence alignment methods include simulated annealing [26], branch and bound [27], genetic algorithms [28, 29], Tabu search [30], hidden Markov modeling [31], countless agglomerative and progressive alignment [32], etc. Moreover, some other publically available tools for multiple sequence alignment are Clustal-Omega [33], KAlign [34], MAFET [35], Prank [36,37,38], TCOFFEE [39,40,41], ContraAlign [42], Prime [43], and DiAlign [44,45,46].

3 The Proposed Algorithm

The proposed algorithm is dynamic programming-based multiple-sequence alignment algorithm which is an improved version of the already proposed adaptive evolutionary clustering algorithm MPSAGA [47]. The proposed method was executed with a set of sequences \({S} = \left\{ {{s}_{1} ,{s}_{2} ,{s}_{3} , \ldots {s}_{n} } \right\}\). The pair of sequences was identified from the sequence set such as paired_sequences \(\left( {\text{PS}} \right) = \left\{ {\left\{ {{s}_{1} ,{s}_{2} } \right\},\left\{ {{s}_{3} ,{s}_{4} } \right\},..\left\{ {{s}_{{{n} - 1}} ,{s}_{n} } \right\}} \right\}\). Using MPSAGA algorithm, these sequences were aligned pairwise. The alignment of these paired sequences was denoted as Aij where i and j denoted the index of the aligned sequences (si and sj). The set of all resultant alignments was denoted as \({A}^{*} = \left\{ {{A}_{12} ,{A}_{34} , \ldots ,{A}_{pq} } \right\}\), where p = (n − 1)/2 and q = (n − 1), when there was even number of sequences in the alignment. However, if the odd number of sequences were provided, then one sequence remained unpaired and got paired with the first sequence. For example, if there are 6 (even) sequences to be aligned, then the sequence pairs will be denoted as \({\text{PS}} = \left\{ {\left\{ {{s}_{1} ,{s}_{2} } \right\},\left\{ {{s}_{3} ,{s}_{4} } \right\},\left\{ {{s}_{5} ,{s}_{6} } \right\}} \right\}\). But if provided 7 (odd) sequences, then the sequence pairs will be denoted as \({\text{PS}} = \left\{ {\left\{ {{s}_{1} ,{s}_{2} } \right\},\left\{ {{s}_{3} ,{s}_{4} } \right\},\left\{ {{s}_{5} ,{s}_{6} } \right\},\left\{ {{s}_{7} ,{s}_{1} } \right\}} \right\}\) wherein the last unpaired sequence will be paired with the first sequence. The distance between these sequences will be reflected as match_score. The match_score of alignment can be calculated by the following formula: Match_Score = matches reward–mismatches penalty - gap opening penalty - gap extension penalty—indels penalty (7).

The default values used for the parameters in this algorithm are Match_Reward = +2, Gap_Opening = −1, Gap_Extension = −2, Mismatch = −2, and Indel = −2. In the next step, to group similar data items, the resultant pairs were clustered with an adaptive evolutionary clustering algorithm [48]. The step is helpful for the large datasets and can be skipped if the method is applied to the small datasets. The fitness of the clusters is calculated based on the fitness score of the individual clusters, i.e., match_score, and the clusters are sorted based on their average health [48]. Intracluster sorting is performed with each cluster based on their fitness. Finally, all the clusters are merged and sorted. These aligned sequences resulting from the multiple sequence alignment are sorted according to their match score. The flowchart of the proposed algorithm has been provided in Fig. 1.

Fig. 1
figure 1

The proposed MSA-MPSAGA (MPS) algorithm

4 Materials and Methods

The present research study was performed on a Windows-based system having an intel i5 processor with 8 GB RAM and 1 TB hard disk. The algorithm was implemented in Java 8 and executed for multiple sequence alignment on nine randomly chosen sequences downloaded from NCBI (https://www.ncbi.nlm.nih.gov/). The NCBI data are publically available for research use, and one can retrieve it by simply submitting the sequence ID.

4.1 Datasets Used

To check the performance and accuracy of the proposed multiple sequence aligner, an empirical study was performed in which the following four datasets were used: BAliBASE [49], MattBench [50], Homstrad [51], and Sisyphus [52]. These datasets contained 4–25 sequences. Random sampling was performed on the datasets to create an artificial dataset of 115 sequences as dataset 1, dataset 2, dataset 3, and dataset 4.

4.2 Evaluation Criteria

To check the accuracy of the alignments, FastSP v. 1.6.0 [53, 54] was used. FastSP calculates the alignment accuracy with respect to SP score. The accuracy measures provide a value between 0.0 and 1.0. The value of SP score 1.0 indicates the perfect accuracy, and the value of SP score 0.0 indicative of the sequence alignment is incorrect. FastSP also indicates an expansion ratio which is the ratio between the number of matches in the estimated alignment and the observed alignment. The value of expansion ratio less than 1.0 is an indication of over alignment, and value more than 1.0 corresponds to under alignment.

5 Results

The proposed algorithm was executed in a single run to perform multiple sequence alignment for the nine sequences downloaded from NCBI (Table 1). The results of multiple sequence alignment have been shown in a similarity matrix. The percent similarity of each sequence with the other sequence is called conservancy, and in the present study, it was calculated using MSA-MPSAGA (MPS) (Table 2). MSA was also performed using ClustalW (CW) [55], TCOFFEE (TC) [56], and MUSCLE (ML) aligners [57, 58]. To compare the results of MSA obtained using all the algorithms tested, visualization method was used. Consequently, the overall results of the multiple sequence alignment were used to construct phylogenetic trees using Phylogenetic Tree Viewer—ETE Toolkit (Table 3).

Table 1 Sequences downloaded from NCBI used for empirical study for multiple sequence alignment
Table 2 Percent conservancy of the nine sequences calculated by MSA-MPSAGA algorithm
Table 3 Comparison of phylogenetic trees constructed from multiple sequence alignment of the nine sequences using ClustalW, TCOFFEE, MUSCLE, and MSA-MPSAGA

The empirical study was performed on the four data subsets. The summary of the results is shown in Table 4. The average number of the aligned sequences was observed to be 13, 7, 8, and 10 for the dataset 1, dataset 2, dataset 3, and dataset 4, respectively. The average length of the sequences in the dataset 1 was found to be 765 in which 38 gaps were inserted by the proposed algorithm to align the sequences. The average gap length in the aligned sequences in the dataset 1 was observed to be 9. In dataset 2, the average length of sequences was 260. To align these sequences, average 17 gaps were inserted with an average gap length of 4. A total of 8 sequences of average 421 lengths were aligned by inserting a total of 47 gaps and with an average of 3 basepair long gap length. While ten sequences with average 185 lengths were aligned by inserting 25 gaps with an average of 6 gap length.

Table 4 The four datasets analyzed under multiple sequence alignment

The comparison of modeler score and SP score for the four tested algorithms is given in Table 5. It indicated that the MPS algorithm provides the SP score similar to the expected score. The modeler score and SP score of CW, TC, ML, and MPS were observed to be (0.70 and 0.50), (0.78 and 0.77), (0.678 and 0.69), and (0.72 and 0.72), respectively. Each dataset used in these experiments had at most 25 sequences. A total of 115 sequences from a subset of four datasets were used (46 from dataset 1, 36 from dataset 2, 18 from dataset 3, and 15 from dataset 4) (Fig. 2).

Table 5 Comparison of modeler scores and SP score between the tested algorithms
Fig. 2
figure 2

The comparison of modeler score and SP score between four benchmarking datasets

In the other experiment, ten sequence sets of different protein categories were aligned using CW, ML, TC, and MPS algorithms. The consolidated results of the aligned sequence such as average aligned sequence length, the average number of matches in the aligned sequences, number of gaps inserted to align the sequences, and the average gap length inserted in the aligned sequences have been detailed out in Table 6. A comparison of the average length of the aligned sequences for each category of proteins is given in Fig. 3.

Table 6 A comparative study of the artificial dataset using MSA-MPSAGA with ClustalW, TCOFFEE, and MUSCLE aligners
Fig. 3
figure 3

Comparison of alignment length of MSA-MPSAGA with ClustalW, MUSCLE, and TCOFFEE multiple sequence aligners

Multiple sequence alignment using CW, TC, ML, and MPS aligners provided the average alignment length to be 224.8, 216.8, 229.5, and 230.1, respectively. The proposed algorithm MPS aligned sequences with an increased length of 0.004%, 0.067%, and 0.027% than the CW, TC, and ML algorithms, respectively. The number of matches in the aligned sequences is shown in Fig. 4. The average number of matches in the aligned sequences was observed to be 29.1, 25.2, 27.7, and 29.2 for CW, TC, ML, and MPS algorithms, respectively. MPS algorithm was found to align the sequences with an increased match of 0.025%, 0.181%, and 0.088% than the CW, TC, and ML algorithms. Comparison based on the number of gaps inserted in the aligned sequences by four multiple sequence alignment algorithms is shown in Fig. 5. The numbers of gaps inserted in the aligned sequences by CW, TC, ML, and MPS aligners were observed to be 271, 280, 280, and 279, respectively. The proposed MPS aligner inserted an increased number of gapes in the aligned sequences than the CW (0.03%), TC (0.014%), and ML (0.014%) algorithms.

Fig. 4
figure 4

Comparison of the number of matches occurred in the aligned sequences using ClustalW, TCOFFEE, and MUSCLE aligners with MSA-MPSAGA

Fig. 5
figure 5

Comparison of gaps inserted in the aligned sequences by MSA-MPSAGA with ClustalW, TCOFFEE, and MUSCLE aligners

A comparative study based on the match scores of the multiple alignments was also performed using CW, TC, ML, and MPS algorithms. The match score was calculated for the multiple sequence alignments performed by CW, TC, ML, and MPS algorithms that were observed to be 4484.6, 4084.7, 4589.6, and 4682.3, respectively (Fig. 6).

Fig. 6
figure 6

Comparison of Match Score for MSA performed by MSA-MPSAGA with ClustalW, TCOFFEE, and MUSCLE aligners

6 Conclusion

Conclusively, it can be stated that the proposed multiple sequence aligner based on adaptive evolutionary clustering algorithm (MSAMPSAGA or MPS) accurately identifies the sequence alignments. Furthermore, an average increase in sequence alignment length using the proposed aligner was observed to be 0.03% than the other tested algorithms ClustalW, TCOFFEE, and MUSCLE. The phylogenetic trees constructed from the MSA obtained from the aligners also indicated that the MPS provides more accurate results. The overall comparison of MPS with the other three tested algorithms showed that the qualitative and quantitative performance of the proposed algorithm is at par as compared to the other aligners. The only limitation of the proposed MPS algorithm is that the algorithm is more useful in doing MSA of biological sequences. The implementation of the proposed algorithm in aligning other types of sequences in the varied dataset is a scope of future study.