Introduction

The matK gene is located in the large single-copy region of the chloroplast genome, and lies between the 5′ and the 3′ exons of trnK (tRNA-lysine) within a group II intron. The gene is roughly 1,500 bp in length, corresponding to 500 amino acids. matK protein consists of three parts: an N-terminal region, a reverse transcriptase (RT) domain in the middle, and domain X at the C-terminus. The high rate of substitution in this gene has resulted in an increased number of parsimony informative sites and strong phylogenetic signals, making it useful to determine evolutionary histories at various taxonomic levels (see, e.g., Müller et al. 2006; Hao et al. 2008). The abundant phylogenetic information derived from matK has made it an extremely valuable gene for DNA barcoding, systematic and evolutionary studies.

matK protein is the only putative group II intron maturase encoded in the chloroplast genome. This putative function is based on the homology of a region in the carboxy terminus of matK to the conserved domain X of mitochondrial group II intron maturases (Neuhaus and Link 1987). Domain X of matK contains the strongly conserved sequence SX3–6TLAXKXK, and most of the sequences have a large excess of basic over acidic amino acids (Mohr et al. 1993) and are mostly hydrophilic. Maturases are enzymes that catalyze nonautocatalytic intron removal from premature RNAs, such as RNA transcripts for the trnK, trnA, trnI, rps12, rpl2, and atpF genes (Vogel et al. 1999). The tRNA or protein products from these genes are required for normal chloroplast function, suggesting an indispensable function for matK in the chloroplast.

matK is often chosen for phylogenetic reconstructions, and has been sequenced in thousands of plant species. Surprisingly, despite matK’s physiological importance and the abundance of sequence data, matK is generally used as strings of anonymous nucleotides, without regard to its functional evolution. It has been noted that there is incongruence between the matK gene tree and plant species trees, as well as between the former and the nuclear ITS tree (e.g., Chaw et al. 2005; Rønsted et al. 2005); however, it is not clear whether natural selection is involved in such topological incongruence. Moreover, although the matK pseudogene has been found in Valerianaceae (Hidalgo et al. 2004) and Corallorhizinae (Freudenstein and Senyo 2008), and purifying selection has been detected in the matK of nonphotosynthetic Orobanchaceae (Young and de Pamphilis 2000), little is known about matK evolution in most other land plant groups, e.g., in monocots, eudicots, or gymnosperms. Progression of matK toward a pseudogene state in some species is probably the reason for the lack of interest in the analysis of positive selection acting on it. Recently, Barthet and Hilu (2008) evaluated evolutionary constraints on matK using protein composition to try to explain why this protein-coding gene accommodates elevated rates of substitution and yet maintains functionality. Duffy et al. (2009) compared matK sequences of the intron-less fern clade to sequences from seed plants and ferns with the intron and found no significant differences in selection among lineages. It is thought that matK in ferns has maintained its ancient and generalized function in chloroplasts, even after the loss of its co-evolved group II intron. To gain deeper insight into the evolutionary pattern of matK of various plant groups, here we detect, for the first time, positive selection of matK in 70 plant groups at various evolutionary levels, using likelihood molecular phylogenetic analysis. Using likelihood-based methods, positive Darwinian selection amino acid sites were found in 32 groups, but not in the other 38 groups. We identified positively selected residues in all three regions of functional importance. The contrasting ecological conditions between different plants have imposed different selective pressures on the maturase molecule. The increased amino acid replacement in matK may reflect the continuous fine-tuning of matK under varying ecological conditions.

Materials and methods

Taxon sampling and data preparation

Sampling of Taxaceae and Cephalotaxaceae species, genomic DNA extraction, PCR amplification of matK, cloning and DNA sequencing were performed as previously described (Hao et al. 2008). Species name and GenBank accession numbers of the sequences generated in this study, as well as those retrieved from GenBank, are given in Table S1 in Electronic Supplementary Material; 24 matK sequences of Taxaceae and Cephalotaxaceae were newly generated for this study. The obtained sequences were codon-aligned and edited using RevTrans (Wernersson and Pedersen 2003; http://www.cbs.dtu.dk/services/RevTrans/) and Clustal W2. We analyzed 70 separate data sets instead of the joint one (see below). Doubtful sequences (e.g., containing stop codons) were not included in the analyses. All alignments are available from the corresponding author upon request.

Molecular evolutionary analysis

Molecular adaptation tests on the matK codon sites and reconstruction of the ancestral matK sequences were performed using maximum-likelihood models and programs included in PAML ver. 4.1 (Yang 2007). The models used the nonsynonymous/synonymous substitution rate ratio (ω = dN/dS) as an indicator of selective pressure and allowed the ratio to vary among codon sites. We used five site-specific codon substitution models: null models for testing positive selection (M1A, M7, and M8A) and models allowing for positive selection (M2A and M8). The likelihood ratio test was used to compare these alternative models. Cases in which the M8 model fitted better with P < 0.05 in both M7–M8 and M8a–M8 comparisons were regarded as having positive selection. Because Yang models are based on theoretical assumptions and ignore the empirical observation that distinct amino acids differ in their replacement rates, we also implemented a mechanistic empirical combination (MEC) model (Doron-Faigenboim and Pupko 2007) that takes into account not only the transition–transversion bias and the nonsynonymous/synonymous ratio, but also the different amino acid replacement probabilities as specified in empirical amino acid matrices. Because the likelihood ratio test (LRT) is applicable only when two models are nested and thus is not suitable for comparing MEC and M8a models, the second-order Akaike information criterion (AICc) was used for comparisons (Doron-Faigenboim and Pupko 2007). Those sites that are most likely to be in the positive selection class (ω > 1) are identified as likely targets of selection.

Evaluation of effects of positive selection on phylogenetic reconstructions

Given that positive selection may result in homoplasy, we tested whether the removal of codons evolving under positive selection would improve the phylogenetic resolution. We compared bootstrap sums of trees reconstructed using all sites (including ones evolving under positive selection) with bootstrap sums of trees reconstructed using only neutrally evolving sites. Phylogenetic trees were reconstructed in MEGA4 (Tamura et al. 2007) using the neighbor-joining (NJ) algorithm. Due to the highly divergent data set and small number of characters, a relatively simple model of evolution was chosen (JTT). Gaps were pair-wise deleted. We used 50% majority rule trees and subtracted 50% from each support value before summing up. The subtraction was done to circumvent bias in summing up the bootstrap values of a consensus tree. Without this correction, a tree with two 51% groups would have higher support than one with one group with 100% support, and if support was decreased from 51 to 49%, the sum would be zero (due to a threshold of 50%).

Results and discussion

Positive selection in matKs of land plants

In order to test for the presence of positive selection acting on matK we used 2,279 matK sequences from different species (Table S1). Most matK sequences analyzed (94%) belong to flowering plants and represent 45 orders and 202 families [99% of flowering plant orders and 44% of families sensu APG II (2003)] providing reasonable coverage of the most taxon-rich lineages. The coverage outside flowering plants was less extensive (Tables 1, S1). For computational efficiency all the sequences were divided into 70 monophyletic groups, based on their phylogenetic relations.

Table 1 Sampled groups

For detection of positive selection, we used nested maximum likelihood models allowing for variation in the ratio of nonsynonymous to synonymous substitutions rates (dN/dS) across codons implemented in PAML. We performed two LRTs for the presence of codons under positive selection: M7–M8 and M8a–M8 comparisons. The M7 model assumes a discrete β distribution for dN/dS, which is constrained between 0 and 1, implemented using ten classes taken in equal proportions. To test for the presence of codons with dN/dS > 1, M7 is compared to the M8 model, which is similar to the M7 model, but allows for an extra “eleventh” class with dN/dS ≥ 1. This test was significant for 31 out of 70 groups analyzed (Table 2). With Bonferroni correction (significance level = 0.05/70), this test was significant for 21 groups. A more stringent test for positive selection compares model M8 with M8a, which is similar to model M7, but allows for an extra class of codons with dN/dS = 1. This test was significant for 32 groups (Table 2; 17 groups after Bonferroni correction). In 31 cases (44%) both M7–M8 and M8a–M8 comparisons rejected models without positive selection in favor of the M8 model assuming positive selection (Table 2; 17 groups after the Bonferroni correction). The MEC model (Doron-Faigenboim and Pupko 2007) takes into account not only the transition–transversion bias and the dN/dS ratio, but also the different amino acid replacement probabilities as specified in empirical amino acid matrices. Out of 32 groups, 18 were significant in MEC vs M8a comparisons (Table 2), including monocot-16, Gnetales, Coniferales-1 and 2, among others. In these groups, the MEC model was best-fitting, as the log likelihood value was highest. Compared to M8a, the MEC model had a much higher log-likelihood value and much lower AICc score in each of these 18 groups. Many of the M8 model sites identified (Fig. 1) were also identified by the MEC model. In 14 groups in which MEC vs M8a comparison was negative but the other two types of comparisons were positive, model M8 was best-fitting (Table 2). It should be noted that, on the one hand, there is a risk of overestimating the magnitude of positive selection because of the multiple-comparison problem, and correction of the significance level might be necessary; on the other hand, using dN/dS as the sole method of detecting positive selection is too conservative to detect single adaptive amino acid changes and is thus limited in scope.

Table 2 Likelihood ratio statistics and second-order Akaike information criterion (AICc) scores for tests of positive selection
Fig. 1a–c
figure 1

The distribution of matK residues evolving under positive selection. a Monocot-16 (Liliaceae including Fritillaria), b Coniferales-2 (Pinaceae), c eudicot/Sapindales-1 as examples

No positive selection was detected in green algae. The smallest proportion of cases with detected positive selection was in monocot (21%); the highest in angiosperms other than monocots and in gymnosperms (53.5 and 60%, respectively). There was a significant difference between monocots and other angiosperms in the proportion of groups with positive selection (two-tailed P = 0.026, Fisher exact probability test). Interestingly, Kapralov and Filatov (2007) also found that the smallest proportion of cases with detected rbcL positive selection was in monocots (61%). Since increasing numbers of sequences would increase the sensitivity of likelihood-based analysis, we combined the small sets of monocots in which no positive selection was detected. There was still no evidence of positive selection in the joint monocot dataset. Moreover, there was no significant difference between the lineages of other flowering plants in the proportion of groups with matK positive selection (both Fisher exact probability test and 2 × 2 contingency χ 2 tests with Yates’ correction). The contrasting physiological conditions between monocots and other plants may have imposed different selective pressures on the intron splicing system and maturase function. The increased amino acid replacement in matK may reflect the continuous fine-tuning of maturase under varying physiological conditions.

Distribution of matK residues responsible for positive selection

The average number of amino acids under selection per group was 20.2 ± 19.4. The distribution of residues identified in our analyses as evolving under positive selection was highly uneven: 63.5% (411/647) of positively selected sites are located in the N-terminal region (Table S2), followed by 21.5% (139/647) and 15.0% (97/647) in domain X of the C-terminal region and the internal RT domain, respectively. In 26 out of 32 groups, there are more positively selected sites in the N-terminal region than in either of the other regions. This is not surprising because the N-terminal region (~290 residues) is much larger than the RT domain (~70 residues) and domain X (~140 residues), and domain X determines the maturase activity of matK proteins and thus is relatively conservative (Hausner et al. 2006).

Implications for phylogenetic studies

Although matK has been used in hundreds of phylogenetic studies, it was used as strings of anonymous nucleotides, not as a functional molecule. Our analysis demonstrates that matK cannot be regarded as a neutral marker, and that positive selection is not unusual in some plant groups. Some studies found that substitution rates in the first and second codon positions in matK approach those of the third position, and that matK is not skewed toward the third position as is the case in most protein-coding genes used in angiosperm systematics (Hilu et al. 2003). However, in some plant groups, e.g., in Myrtales (Gadek et al. 1996; group 51 in Table 2) and Cornales (Xiang et al. 1998; group 32 in Table 2), the ratio of third-position changes to the average of the other positions is well above one, clearly showing that the third positions have a stronger phylogenetic signal. Positive selection may also result in homoplasy due to fixations of the same mutation arising independently in several phylogenetic lineages. As most substitutions in the third codon positions are synonymous, the third codon positions are less frequent targets of positive selection compared to the first and second positions. Thus, findings that the first and second codon positions in matK have a lesser phylogenetic signal in some groups (Young and de Pamphilis 2000) can be explained by positive selection on matK, at least in those 32 groups in which positive selection has been detected in this study.

We tested whether removal of codons evolving under positive selection would improve phylogenetic resolution in 32 groups with detected positive selection (Table S3). We compared sums of bootstrap values between the NJ trees reconstructed using all sites and those reconstructed using only neutrally evolving sites (positively selected sites excluded). The sums of bootstrap frequencies did not increase or decrease for more than 5 in 34% (11 groups) of analyzed cases; they decreased for more than 5 in 59% (19 groups) of cases, and increased for more than 5 in 6% (group 32: Cornales 67.2% and group 66: Pinaceae 10.9%) of cases. Since the maximum parsimony (MP) method would be more directly affected by the amount of homoplasy, we also compared sums of bootstrap values between MP trees reconstructed using all sites and those reconstructed using only neutrally evolving sites (Table S4). The sums of bootstrap frequencies did not increase or decrease for more than 5 in 50% (16 groups) of analyzed cases; they decreased for more than 5 in 34.4% (11 groups) of cases, and increased for more than 5 in 12.5% (four groups) of cases. Thus, taking into account the presence of positive selection in matK may improve phylogenetic reconstructions in specific groups. We recommend checking matK datasets for positive selection, and, if selection is found, to test whether deletion of sites evolving under positive selection from further phylogenetic analyses would increase topological resolution/bootstrap support of the selected branches.

Adaptive mutations may spread across subpopulations of a species, or across several species with very little gene flow. Thus, positive selection in matK may facilitate horizontal interspecific gene flow for chloroplast DNA, as spreading of adaptive mutations in matK may result in fixation of a single chloroplast haplotype in several occasionally hybridizing species, which may affect phylogeny reconstruction considerably. There are many reported cases of extensive incongruence of chloroplast DNA phylogeny with species phylogeny. For example, within genus Mitella (Saxifragaceae), there is a significant topological conflict between chloroplast and nuclear phylogenies, which can be attributed to frequent hybridization within the lineage (Okuyama et al. 2005; Okuyama and Kato 2009). Previously, we found strong cytonuclear incongruence caused partially by positive selection in matK and rbcL in Taxaceae (Hao et al. 2008). These cases exemplify the risk of reconstructing phylogenetic and phylogenomic relations solely from chloroplast data in groups with interspecific hybridization. Tests for the presence of positive selection and for congruence between chloroplast and nuclear phylogenies are indispensable for correct inference of species phylogenetic and phylogenomic relationships.