Keywords

1 Introduction

With the further rapid development of new sequencing technology, the biological applications become more and more widely, including exposition of relationship between nucleosome positioning and DNA methylation [1], prediction of missense mutation or protein functionality [2, 3], the assembly of new genomes [4], crop breeding [5], and so on. For most of these applications, multiple sequence alignments are fundamental.

For \( N \) sequences of length \( L \), the exact way of computing an optimal alignment has a computational complexity of \( O(N^{L} ) \), which is excessive even for small number of sequences. Unfortunately, all sequencing technologies in production, such as Illumina, Helicos, SOLiD and Roche/454, can produce thousands or millions of sequences concurrently [6, 7]. In order to overcome this difficulty, many heuristic methods, including progressive methods [8] and iterative refinement methods [9] are developed.

This article aims to systematically review the recent advance of MSA methods. It is organized as follows. We first introduce the basic theory of heuristic methods and review the development of wildly used techniques, including Clustal, T-Coffee, MAFFT, MUSCLE and Kalign in Sect. 2, and then examine their programs on the benchmark Balibase 3.0 references [10], Oxbench [11] and Homestrad in Sect. 3. Finally, we discuss the future development of multiple sequence alignment in Sect. 4.

2 Overview

2.1 Theory

Progressive Method.

The progressive method is the first practical MSA construction strategy, and still composes the key of a majority of MSA programs by now. A progressive method usually is made up of four steps as follows [12]:

Step 1: Calculate a distance matrix for \( N \) input sequences. The element of this matrix is the distance of every pair of the input sequences, and there are many ways to messure distance, for example, angle cosine and Euclidean distance. In a exact way, \( \left( {\begin{array}{*{20}c} N \\ 2 \\ \end{array} } \right) \) pair-wise alignments are needed to count the numbers of matches, mismatches, and indels, which are then converted to the distance measures. This procedure is costly when \( N \) is large, as its time complexity is \( O(N^{2} L^{2} ) \);

Step 2: Construct a guide tree according to the distance matrix calculated in Step1 by a clustering analysis method. The most widely used method is UPGMA(Unweighted Pair-Group Method with Arithmetic means) [13] which takes computation time of \( O(N^{2} ) \) to construct the guide tree;

Step 3: In the guide tree, an external node represents each input sequence, while an internal node represents an MSA;

Step 4: Repeat Step1 and Step2 for the generated pair-wise alignments after construction of the initial MSA.

Iterative Refinement.

The progressive method is implemented using a “greedy algorithm” by what mistakes made at the initial alignment stages cannot be corrected later [14]. To overcome this defect, an effective approach relies on post process known as iterative refinement, which also consists of four steps as follow [12]:

Step 1: Construct an initial MSA;

Step 2: Divide the MSA constructed in Step1 into two groups, then get rid of the columns made up of nulls from each of the two groups;

Step 3: Realign the two groups produced in Step2 by a pair-wise sequence-to-group or group-to-group alignment method;

Step 4: Repeat Step2 and Step3 until no gain in the alignment score or the iterative times exceeding a predefined number.

Scoring Function.

A good scoring function is necessary to guarantee this procedure work accurately. The most widely used function is sum-of-pairs (SP) score [15] and weighted sum-of-pairs score (WSP) [16] with affine gaps.

For a sequence set \( A \) which is made up of \( N \) sequences of length \( L \), we define WSP as follow:

$$ \begin{aligned} \begin{array}{*{20}c} {WSP(A)} & { \, = \sum\limits_{1 \le i < j \le N} {w_{i,j} H(a_{i} ,a_{j} )} } \\ \end{array} \hfill \\ \begin{array}{*{20}c} {} & { = \sum\limits_{1 \le l \le L} {\sum\limits_{1 \le i < j \le N} {w_{i,j} [S(a_{i,l} ,a_{j,l} ) - v \cdot G(i,j,l)]} } } \\ \end{array} \hfill \\ \hfill \\ \end{aligned}, $$
(1)

where \( H(a_{i} ,a_{j} ) \) is the alignment score of a pair of sequences in \( A \), \( w_{i,j} \) is the weight corresponding to the pair sequences \( [a_{i} ,a_{j} ] \) (\( w_{i,j} = 1 \) is an unweighted case), \( S(a_{i,l} ,a_{j,l} ) \) is the match score of the pair sequences \( [a_{i} ,a_{j} ] \) at position \( l \), \( G(i,j,l) \) is a Boolean variable which is defined as follows, if a gap opens between \( a_{i} \) and \( a_{j} \) at position \( l \), \( G(i,j,l) = 1 \), else \( G(i,j,l) = 0 \), and \( v \) is the penalty of gap.

2.2 Alignment Technique

Clustal.

In 1988, the first Clustal program was written by Des Higgins [17], and a dynamic programming algorithm [18] and the progressive alignment strategy developed by Feng and Doolittle [8] were combined in this program. It used a word-based alignment algorithm [19] to calculate the distance matrix and UPGMA method was used to construct the guide tree. In 1992, ClustalV [20] implemented profile alignments to generate guide trees from the multiple alignment using the Neighbour-Joining (NJ) method [21]. In 1994, ClustalW [22] improved the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. In 1997, ClustalX [23] provided a visual interface, so that the multiple alignment can be displayed on the screen and all parameters were optional, which was a significant convenience to the user’s of evaluation. The latest member of Clustal series program is Clustal Omega [14], which can align virtually any number of protein sequences quickly and delivers accurate alignments. For constructing a guide tree, Clustal Omega uses a modified version of mBed [24] which has complexity of \( O(N\log N) \) and the guide tree is just as accurate as those from conventional methods. In Clustal Omega, the alignments are then computed using the very accurate HHalign package [25], which aligns two profile hidden Markov models [17].

T-Coffee.

The first T-Coffee (Tree-based Consistency Objective Function for alignment Evaluation) [26] version can be track back to 2000. It implemented progressive alignment with a consistency-based objective function [27] and tried to maximize the score between the final multiple alignment and a library of pair-wise aligned residue scores which is derived from a mixture of local and global pair-wise alignments. M-Coffee [28] is an extension of T-Coffee and uses consistency to estimate a consensus alignment, and a meta-method for assembling multiple sequence alignments (MSA) by combining the output of several individual methods into one single MSA. TCS (Transitive Consistency Score) [29] is a new extended version of the T-Coffee scoring scheme for overcoming the problem that homology and evolutionary modeling are sensitive to the underlying MSA accuracy, and it also can improve phylogenetic tree reconstruction.

MAFFT.

MAFFT [30] was a method for rapid multiple protein sequence alignment based on FFT (Fast Fourier Transform), first released in 2002. Homologous region were rapidly identified by the FFT. FFT converted an amino acid sequence to a sequence whose composition were volume and polarity values of each amino acid residue. The original MAFFT included two different heuristics, the progressive methods were FFT-NS-1 and FFT-NS-2 and the iterate refinement method was FFT-NS-i. In 2005, MAFFT version 5 [31] was released with improvement of accuracy by offering new iterative refinement options, H-INS-i, F-INS-i and G-INS-i. And MAFFT version 5 incorporated pair-wise alignment information into objective function. In 2007, MAFFT version 6 [32] improved accuracy of multiple ncRNA alignment with two techniques: the PartTree algorithm and the Four-way consistency objective function. In 2010, for speeding up program, two natural parallelization strategies (best-first and simple hill-climbing) were implemented for the iterative refinement stage based on MAFFT version 6, and a simple hill-climbing approach was selected as the default [33]. In 2012, two methods had been implemented as the ‘–add’ and ‘–addfragments’ options in the MAFFT package [34] for adding unaligned sequences into an existing multiple sequence alignment.

The newest version is MAFFT version 7 [35], it has options for adding unaligned sequences into an existing alignment, and beyond this, it has several new features, including adjustment of direction in nucleotide alignment, constrained alignment and parallel processing.

MUSCLE.

MUSCLE (MUltiple Sequence Comparison by Log-Expectation) [36] is a multiple sequence alignment method of protein sequences. MUSCLE uses two distance measures for each pair of sequences: a kmer distance (for an unaligned pair) and the Kimura distance (for an aligned pair). Guide tree is constructed using UPGMA. MUSCLE uses a profile function called log-expectation (LE) score. And MUSCLE includes three stages as follow:

Stage 1: Draft progressive. This stage includes four steps (similarity measure, distance estimate, tree construction, progressive alignment) and produces a rapid multiple alignment, while de-emphasizing accuracy.

Stage 2: Improved progressive. This stage also includes four steps (similarity measure, tree construction, tree comparison, progressive alignment). In the stage1, the main source of error is the k-mer distance measure, which leads to a suboptimal tree. MUSCLE therefore re-estimates the tree using the Kimura distance, which is more accurate but requires an alignment.

Stage 3: Refinement. This stage is made up of four steps (choice of bipartition, profile extraction, re-alignment, accept/reject). The third stage performs iterative refinement using a approximate tree-dependent restricted partitioning [21].

Kalign.

Kalign [31] was a MSA algorithm, which proposed in 2005. It also implemented progressive alignment. And unlike other progressive methods, Kalign employed Wu-Manber approximate string-matching algorithm [37] which made Kalign more accurate in aspect of distance estimation. In 2007, Emmanuelle Becher etc. proposed a tool called HMM-Kalign [38] for generating sub-optimal alignments. As the name implies, HMM-Kalign was based on original Kalign by implementing Hidden Markove Model. The newest inproved edition of Kalign was Kalign-LCS [39]. It applied the longest common subsequence (LLCS) in similarity measure step, and obtained a balance between accuracy and speed.

3 Practical Result

We examine ClustalW, Clustal Omega, T-Coffee, MAFFT:Auto, MAFFT:FFT-NS-1, MAFFT:G-INS-i, MUSCLE and Kalign on the benchmark Balibase 3.0 references, OXbench and Homestrad, respectively.

We evaluate the alignment results with BaliScore, including SP-score (Sum of Pairs score) which is the percentage of homologies in the reference alignment recovered in the estimated alignment and TC-score (Total column score) is the percentage of columns that are recovered entirely correctly in the estimated alignment (Tables 1, 2 and 3).

Table 1. Summary of the techniques described in the review
Table 2. The SP-score of various individual methods on the benchmark Balibase 3.0 references
Table 3. The TC-score of various individual methods on the benchmark Balibase 3.0 references

From the results of SP-score and TC-score, we can see that all programs we examined are not sensitive to divergence of sequence. All programs suffer by the impact of a highly divergent “orphan” sequence, residue difference between groups, N/C-terminal extensions, and internal insertions to varying degrees, respectively. And on the whole, Clustal Omega and T-Coffee perform well, especially the results corresponding to T-Coffee are the best.

4 Conclusion and Future Development

In the past years, MSA achieved great development, and obtained good effect which applied in many biological applications. But there still is plenty room to improve multiple sequence alignment, especially in the respect of robustness and accuracy. In order to solve these problems, in one hand, we should continue to develop recent efficient MSA techniques, such as T-Coffee, in other hand we should transform the way of thinking and apply more techniques which are not just heuristic methods, even not just biological informatics technology to improve MSA.

Happily, many researchers devote themselves to develop MSA method. Sabari Pramanik and S.K. Setua [40] define a new form of chromosome representation, and deploy it on steady state Genetic Algorithm, then get better results. Siavash Mirarab, Nam Nguyen, and Tandy Warnow propose an algorithm called PASTA [41] to realize estimation of large-scale multiple sequence alignment. And there is a interesting method called Phylo [42], which is a human-based computing framework applying ‘‘crowd sourcing’’ techniques to solve the Multiple Sequence Alignment (MSA) problem. The key idea of Phylo is to convert the MSA problem into a casual game that can be played by ordinary web users with a minimal prior knowledge of the biological context. Cactus [43] caters to the phenomenon that much attention has been given to the problem of creating reliable multiple sequence alignments in a model incorporating substitutions, insertions, and deletions while far less attention has been paid to the problem of optimizing alignments in the presence of more general rearrangement and copy number variation.

Another trend of development is parallelization of MSA. Because of that MSA is a NP-hard problem and the huge amount of data, the programs of MSA are costly in the respect of time. Hence, it’s necessary to implement parallel solutions in MSA. Jucele F. A. et al. [44] present two parallel solutions using the BSP/CGM model, with MPI and CUDA implementations. And the results of this method show that the use of parallel processing allows the manipulation of more and larger sequences. Evandro A. Marucci et al. [45] propose a parallel algorithm for multiple sequence similarities calculation based on the k-mer counting method, and obtain a very good scalability and a nearly linear speedup.