Introduction

Several bivalve species are known to have an unusual mode of mitochondrial DNA (mtDNA) inheritance known as doubly uniparental inheritance (DUI) (Skibinski et al. 1994; Zouros et al. 1994; Liu et al. 1996; Passamonti and Scali 2001; Curole and Kocher 2002; Serb and Lydeard 2003; Theologidis et al. 2008). Animals with this mode of mitochondrial inheritance posses two highly divergent (up to 60%; Doucet-Beaupré et al. 2010) mitochondrial genomes: the maternally transmitted F genome, present in both genders, and paternally transmitted M genome, present almost exclusively in heteroplasmic males. That way the two independent mitochondrial (mt) lineages exist bound together by the interactions with the common nuclear background. Both the origin and maintenance of DUI are not fully understood. The DUI system opens an opportunity for an adaptive evolution to occur within both mt lineages, yet so far only the faster evolution of the genomes in the M lineage, attributed to relaxed selective constrains was noted (Stewart et al. 1996).

Mitochondrial DNA (mtDNA) was thought to be free of adaptive evolution signatures, allowing researchers to use it as neutral or nearly neutral marker for phylogenetic, phylogeographic and population studies. However, the apparent lack of correlation between population size and mtDNA diversity in animals prompted Bazin et al. (2006) to postulate that periodic selective sweeps do participate in shaping mitochondrial diversity. In Mytilus departures from neutrality were postulated even earlier, particularly in the context of mtDNA introgression or role-reversal (Quesada et al. 1998, 1999), but the haplotypes bearing the same nonsynonymous substitutions were not recovered in later, more comprehensive studies (Śmietanka et al. 2009), therefore those early conclusions must be treated cautiously as the possibility of sequencing errors do exist. Nevertheless, several population studies showed that when the McDonald–Kreitman (MK) (McDonald and Kreitman 1991) test of selective neutrality was applied to Mytlius mitochondrial data, the excess of fixed nonsynonmous differences was not uncommon (Riginos et al. 2004; Ort and Pogson 2007; Śmietanka et al. 2009).

The ultimate demonstration of the adaptive process requires an analysis of the protein coding part of the complete mt genomes from a range of related DUI species. The three closely related species: Mytilus edulis, Mytilus galloprovincialis and Mytilus trossulus are best studied in this regard. The ancestor of all three species evolved in the Pacific and first colonized the northern Atlantic after the opening of the Bering Strait approximately 3.5 million years ago (mya), starting the allopatric speciation (Vermeij 1991; Riginos and Cunningham 2005). Its direct descendant, M. trossulus is widely distributed along the Pacific coasts of North America and Asia. It is also present in the west Atlantic where its southern range ends in the Gulf of Maine. The separation of M. edulis and M. galloprovincialis occurred during Pleistocene in the Atlantic and currently both species occur in Europe: M. galloprovincialis in the south, M. edulis in the north. There must have been at least one more wave of Mytilus migrants into European waters since M. trossulus has been identified in the Baltic Sea (Väinölä and Hvilsom 1991) and recently in Scotland (Beaumont et al. 2008).

The first nearly complete mitochondrial F genome of M. edulis was published by Hoffmann et al. (1992) and completed by Boore et al. (2004). Subsequently, the complete M and F genome sequences from M. edulis, M. galloprovincialis and Baltic M. trossulus were also published (Mizi et al. 2005; Breton et al. 2006; Zbawicka et al. 2007; Burzyński and Śmietanka 2009). However, the Baltic M. trossulus has experienced a strong introgression from M. edulis (Borsa et al. 1999; Riginos et al. 2002; Kijewski et al. 2006) and in effect the Baltic population has completely lost its native mtDNA. Therefore, the Baltic haplotypes resemble M. edulis mtDNA, even though they occur in M. trossulus mussels (Rawson and Hilbish 1998; Quesada et al. 1999; Zbawicka et al. 2007). Typically, the intraspecific divergences between sequences of M and F mt genomes exceed 20% in Mytilus. Occasionally, the M genome may be replaced by the F genome invading the paternal lineage in the process called masculinization, consequently lowering sequence divergence between paternally and maternally inherited genomes (Hoeh et al. 1997; Zouros 2000). The full sequences of two such genomes are available: one from M. galloprovincialis (Venetis et al. 2007) and one from west Atlantic M. trossulus (Breton et al. 2006), although the ‘recently masculinized’ (RM) status of both genomes has been questioned by Filipowicz et al. (2008) and Cao et al. (2009), respectively. The structure of all mt Mytilus genomes known to date is very similar, despite sequence differences. They have a unique gene order and composition, with one additional tRNA MET and missing atp8 gene (Hoffmann et al. 1992; Boore et al. 2004; Mizi et al. 2005; Breton et al. 2006; Zbawicka et al. 2007; Burzyński and Śmietanka 2009). Recently, it has been suggested that the missing atp8 gene may be present (Breton et al. 2010).

Contrary to the Baltic M. trossulus, the west Atlantic and Pacific populations of M. trossulus have their own, more divergent mtDNA. The sequence of the entire native M and F genomes of M. trossulus from Pacific, most likely representing the true native haplotypes of this species, have not been reported so far, therefore all the comparative and functional analyses involving the mt genomes in M. trossulus (Everett et al. 2004; Breton et al. 2006; Jha et al. 2008; Breton et al. 2009) were incomplete. In this paper, we report the complete sequences of both F and M mitochondrial genomes of Pacific M. trossulus from Canada. This gives the first opportunity to analyse the tempo and mode of evolution in both gender-associated mitochondrial lineages of Mytilus edulis species complex constituting the interesting insight into the operation of DUI in Mytilus in particular and the evolution of mtDNA in general.

Materials and Methods

Samples

Pacific M. trossulus mussels were collected in Howe Sound near Vancouver, Canada in 2006. The sample was classified as a pure M. trossulus based on nuclear DNA markers diagnostic for M. edulis, M. galloprovincialis and M. trossulus: Me15/16 (PCR length variant marker localised in adhesive protein gene) (Inoue et al. 1995) and EFbis (RFLP marker localised in an intron of elongation factor 1α gene) (Bierne et al. 2003; Kijewski et al. 2006). Mussels were sexed by microscopic examination of the mantle tissue. One male and one female were selected for further analyses.

PCR Amplification and Sequencing

DNA was extracted from the gonad tissue using a modified CTAB method according to Hoarau et al. (2002). mtDNA of both M and F genomes was sequenced in two steps, as described previously (Zbawicka et al. 2007). The first step was a long range polymerase chain reaction (LR-PCR), specific for the F and the M genomes. The nearly complete F genome was obtained using AB23 (Burzyński et al. 2006) and AB33 (Burzyński and Śmietanka 2009) universal primers. The M genome was amplified using TRO1 (5′-GTGCAGCAATAAAACGAGGGTAA-3′) and TRO2 (5′-GCACACCACATTTTCATTAAATCTATTTA-3′) pair of specific primers designed de novo based on the GenBank records containing partial sequences of the M genome of M. trossulus: AY515231 and EU826078 (Cao et al. 2009). LR-PCR amplifications were carried out using Phusion™ High-Fidelity DNA polymerase (Finnzymes Oy) in the supplied buffer and under the reaction conditions suggested by Finnzymes. Obtained products were diluted 1:800 and used as a substrate in a set of re-amplifications with the LSS and RSS primers listed in Supplementary Table S1. Additional PCRs with primers covering the CR (control region; Supplementary Table S1: first 10 rows) were performed to fill the gap. PCR amplifications were carried out in 15 μl reaction volumes containing 20 ng of template DNA, 0.4 μM of each primer, 200 μM of each dNTP, 1.5 mM MgCl2, 0.5 U of high-fidelity DyNAzymeEXT2 DNA polymerase and the appropriate reaction buffer from Finnzymes. All PCR products were purified by alkaline phosphatase and exonuclease I treatment (Werle et al. 1994), then sequenced directly with the BigDye™ terminator cycle sequencing method. An ABI 3730 automatic sequencer was used to separate the reaction products. High quality sequence reads were obtained having at least 600 bp of kb >20 quality value. To further safeguard against PCR artifacts the whole procedure, starting with LR-PCR, was repeated twice with identical results. Newly determined sequence reads were assembled using Gap4 from the Staden Package version 1.7.0 (Staden et al. 2001). Overall more than 2× coverage was achieved with the consensus quality value not falling below 60 in coding regions. The expected number of falsely called bases was well below one per genome assuring that sequencing artifacts could not affect the results. Genome organization was analysed using several methods. Protein genes were recognized using a suite of algorithms for identifying likely protein-coding sequences: Coding Region Identification Tool Invoking Comparative Analysis (CRITICA) (Badger and Olsen 1999), wise2 (Birney et al. 2004) and Glimmer (Delcher et al. 1999). All tRNA genes were identified by covariance model analysis implemented in COVE (Eddy and Durbin 1994) and ARWEN (Laslett and Canbäck 2008). GenBank RefSeq database of complete metazoan mtDNA was used as the reference when needed and the recent version of blast (Altschul et al. 1997) was run locally. Amino acid sequences for protein-coding genes were obtained by conceptual translation using the genetic code of Drosophila mtDNA, following Hoffmann et al. (1992). The assembled and annotated complete mtDNAs were deposited in GenBank under accession numbers HM462080-1.

Data Sets for Comparative Analyses

In addition to the two new sequenced genomes, all complete mt genome sequences of Mytilus taxa deposited in GenBank to date were used in comparative analyses: NC_006161 (F genome of M. edulis), FJ890849 and AY497292 (F genomes of M. galloprovincialis), DQ399833 (RM genome of M. galloprovincialis) DQ198231 (F genome of Baltic M. trossulus), AY823625 (RM genome of M. trossulus), FJ890850 (M genome of M. galloprovincialis), AY823623 and AY823624 (M genomes of M. edulis), DQ198225 (M genome of Baltic M. trossulus). In addition to that the complete sequence of all protein coding genes (of the F genome) was reconstructed from M. californianus Expressed Sequence Tag (EST) sequences present in GenBank (e.g. ES388182, GE747024; J Grimwood, personal communication). Out of the 42354 M. californianus ESTs currently available in GenBank, 489 contain high quality sequences of mitochondrial transcripts covering the whole F mt transcriptome. The M. californianus sequence deduced from transcriptiome data was used as an outgroup in phylogenetic analyses but it was not included in certain comparative analyses since no M genome from this species is available and the M–F divergence in M. californianus is significantly greater than in the three species in focus of this study (Ort and Pogson 2007). Thus, the complete data set consisted of 13 genomes. One published genome M of M. galloprovincialis: AY363687 (Mizi et al. 2005) was not considered for whole genome analyses due to its known mosaic coding region structure (Burzyński and Śmietanka 2009). For the purpose of inter-species comparisons one representative F and one M genome from each species was chosen as follows: for M. galloprovincialis FJ890849 and FJ890850, for M. edulis NC_006161 and AY823623 and the pair of genomes presented here (HM462080 and HM462081) for M. trossulus, forming small, 6 genome data set. This small data set was used to generate Tables 1 and 2 as well as all Figures except Figs. 3 and 7. All phylogenetic analyses were based on protein coding genes extracted from the full 13 genome data set. All database sequences were re-annotated accordingly in order not to introduce spurious differences. The alignment was produced using translated sequences but the appropriate manual corrections in the nucleotide sequence of two F genomes were introduced first to compensate for the effects of frameshift errors discussed by Zbawicka et al. (2007). To obtain the reference evolutionary rate of nuclear genes the EST data sets available for M. galloprovincialis and M. californianus were used. Both sets were downloaded, assembled by CAP3 (Huang and Madan 1999) to remove redundancy, local blast databases were built from each assembly and the best reciprocal blast hits (RBH) were found by running tblastx reciprocally on each database. Genes were provisionally annotated by running blast (Altschul et al. 1997) searches against nr and several complete genome protein databases at NCBI website (http://blast.ncbi.nlm.nih.gov) and MytiBase (Venier et al. 2009) as well as wise2 (Birney et al. 2004) searches against Pfam (Finn et al. 2010). Several representative genes were selected from RBH following Oliveira et al. (2008), aligned and concatenated for the total alignment length of 23154 bp. The individual annotated sequences are available as supplementary data (Supplementary Table S2). An additional set of short sequences covering the region between nad3 and cox1 was generated by sequencing of approx 1.5 kb long fragment amplified with appropriate primers (RSS10, Supplementary Table S1). To uniformly cover the phylogenetic tree with this data the individuals were chosen for this based on their haplotype (Śmietanka et al. 2009) so that each major clade is represented. These new sequences were deposited in GenBank under accession numbers HM489865-74. Together with other Mytilus sequences covering this region and already present in GenBank, this additional atp8 data set consisted of 26 sequences clustered into six groups, according to their phylogenetic affinity: M. edulis F (HM489865, HM489866, HM489867, HM489868, NC_006161, DQ198231), M. galloprovincialis F (FJ890849, DQ399833, HM489869, HM489870, HM489871, HM489872, AY497292), M. trossulus F (GU936625, AY823625, HM462080), M. edulis M (AY823623, AY823624, DQ198225, HM389873, HM389874), M. galloprovincialis M (FJ890850, AY363687) and M. trossulus M (HM462081, GU936626, GU936627). For the purpose of this analysis Baltic M. trossulus sequences were grouped with M. edulis sequences.

Table 1 Intra- and interspecies comparisons for all mitochondrial genes
Table 2 Estimates of non-synonymous (K A) and synonymous (K S) substitutions for intra- and inter-lineage comparisons in Mytilus taxa

Bioinformatic Analysis

Sequences were aligned using ClustalW algorithm (Higgins and Sharp 1989) with MEGA4 (Tamura et al. 2007) under default parameters when needed, although in most cases no alignment was necessary since there were only few differences in gene lengths. Genetic distance (K) based on Kimura’s two-parameter model (Kimura 1980), divergences in synonymous (K S) and non-synonymous (K A) sites using modified Nei–Gojobori method (Nei and Gojobori 1986) with Jukes–Cantor (Jukes and Cantor 1969) correction and default parameter sets following Mizi et al. (2005), Zbawicka et al. (2007) and Burzyński and Śmietanka (2009) as well as the intra- and inter-species divergences for all mitochondrially encoded proteins (as p-distances, following Breton et al. 2006) were calculated in MEGA4. The methods implemented in MEGA require the user to supply parameters (e.g. transition/transversion ratio) which should be independently estimated but are usually left at their default values. This may seriously affect the estimated indices. To avoid this problem for most analyses involving K A/K S the KaKs Calculator was used which estimates all parameters from the data within maximum likelihood framework (Zhang et al. 2006). Uncorrected nucleotide distances were calculated in DnaSP (Rozas et al. 2003). The comparison of molecular evolutionary rates was done using Tajima’s Relative Rates Test (Tajima 1993) implemented in MEGA4, under default parameter set.

At least three of the genomes from the analysed data set are know to possess mosaic CR sequences. Undetected coding region recombination could seriously affect the phylogenetic inference as well as other analyses. To check for recombination within the coding part of sequences the concatenated data set was analysed by a suite of recombination detection algorithms implemented in RDP software (Martin et al. 2005) as well as by the GARD tool (Kosakovsky Pond et al. 2006). No recombination was detected, therefore the whole data set was used.

A set of structure prediction methods was applied to the putative atp8 as well as the closest RefSeq atp8 sequences, as implemented in PredictProtein web server (Rost et al. 2004). The Kyte–Doolittle hydropathy plots (Kyte and Doolittle 1982) were obtained with pepwindow application from EMBOSS package (Rice et al. 2000).

Phylogenetic reconstructions were done using several methods. Both relatively simple Neighbor-Joining (NJ), distance-based and more sophisticated Maximum Likelihood (ML) methods with bootstrapping or Bayesian inference of credibility as well as Maximum Parsimony (MP) methods were employed as implemented in PAUP* (Swofford 2003) and MrBayes (Ronquist and Huelsenbeck 2003). The selection of the best fit model of evolution was facilitated by Modeltest (Posada and Crandall 1998). All methods gave identical tree topology, with full (100% bootstrap and posterior probability = 1) support for all bipartitions. The best-fitted GTR+G+I model was applied and the resulting ML tree was obtained by branch-and-bound algorithm. The same tree was used in all subsequent analyses. Standard measures were taken to ensure the validity of obtained results—e.g. MCMCMC chains in MrBayes were run long enough to achieve stationary phase, judged by the inspection of likelihood trend plots and the use of four independent runs which always converged at the same solution. For this relatively small data set it meant 1–2 millions generations, depending on the model. The effective sample size (ESS) of each parameter was always greater than 1000, after exclusion of the data from non-stationary phase.

The estimates of evolutionary rates and divergence times were done in r8s (Sanderson 2003). The models implemented in this software can lead to degenerate solutions necessitating the use of multiple starting points to ensure that all replicates provide the same results. Therefore, at least 10 runs were performed for each analysis, with checkgradient option and crossvalidation procedure turned on. All reported results passed the tests. Two calibration sources were used to date the tree. First, after Rawson and Hilbish (1995) the M. trossulusM. edulis/galloprovincialis divergence was constrained at 3.5 mya in both M and F lineages. Second, to overcome the risk that the recent dates estimated using old calibration point may be seriously wrong due to the effect described by Ho et al. (2005), the three most recently diverged pairs of sequences were dated using the population scale evolutionary rates from Śmietanka et al. (2009). The relevant fragment comprising the part of nad2 and cox3 was examined for that purpose. The per site substitution rate in fourfold degenerate sites within this fragment was estimated at approximately 1 per 106 years for the F genomes and 2 per 106 years for the M genomes (Śmietanka et al. 2009). The three sequence pairs used for dating: NC_006161 and DQ198231, AY823623 and 24, AY823625 and HM462080 are separated by 1, 3 and 4 such substitutions, respectively. Since there are 134 such sites, the corresponding divergence times are approximately 3.5, 6 and 14 thousands of years, respectively. These numbers were used to constrain these additional calibration points. With the expected complex pattern of rate changes across the tree the method of choice was penalized likelihood (PL) (Sanderson 2002) with truncated Newton (TN) algorithm. The optimal smoothing parameter (=1000) has been found by running crossvalidation multiple times with increasing smoothing parameter value (0–10000).

The codeml program from PAML version 4.3 (Yang 1997) as well as HyPhy (Kosakovsky Pond et al. 2005) implemented on datamonkey webserver (Kosakovsky Pond and Frost 2005b) were used to search for positively selected sites in the whole data set. In PAML, neutral or nearly neutral models were compared by log likelihood ratio test at 95% confidence level with the corresponding models assuming the existence of an additional, positively selected class of sites: models M1a with M2a and M7 with M8 (Nielsen and Yang 1998; Yang and Nielsen 2000). Also the recommended test with branch-site model A was performed to find both the most likely branches in the phylogenetic tree along which the non-neutral evolution occurred and the sites involved (Yang et al. 2005). In HyPhy, the fixed effect likelihood (FEL), random effect likelihood (REL) as well as single likelihood ancestor counting (SLAC) were used (Kosakovsky Pond and Frost 2005a). To evaluate the effect of selection on local changes in protein properties the TreeSAAP software was used (Woolley et al. 2003). The MK tests of selective neutrality (McDonald and Kreitman 1991) were performed on the short 26 sequence data set, as implemented in DnaSP (Rozas et al. 2003).

Results

The complete sequences of both M and F mitochondrial genomes from Pacific M. trossulus were obtained and deposited in GenBank. The descriptive statistics and comparative data for the two genomes are given in Supplementary Table S3 and the genetic maps of both genomes are presented in Fig. 1. The F genome (18,628 bp) is substantially longer than the M genome (16,578 bp), mainly due to the differences in the CRs—1,914 bp longer in the F genome. This F genome is also much longer than the previously published F genomes of European Mytilus mussels: M. edulis (16,740 bp) (Boore et al. 2004) M. galloprovincialis (16,780 bp) (Mizi et al. 2005; Burzyński and Śmietanka 2009) and Baltic M. trossulus (Zbawicka et al. 2007) but it is similar (only 24 bp shorter) to the RM genome of the west Atlantic M. trossulus (Breton et al. 2006). This similarity is even more obvious when the overall pattern of divergences is considered: the divergences (uncorrected nucleotide distance) between the F genome of Pacific M. trossulus and the F genomes of M. edulis and M. galloprovincialis are quite high, at 0.177, while the divergence between the Pacific F and the west Atlantic RM genomes of M. trossulus is much lower, at 0.025. Contrary to that, the M genome of Pacific M. trossulus did not have a similarly close relative in GenBank. This genome is only slightly shorter than the M genomes of M. edulis (Breton et al. 2006), M. galloprovincialis (Burzyński and Śmietanka 2009) and Baltic M. trossulus (Zbawicka et al. 2007) and its distances from M genomes of M. edulis and M. galloprovincialis are at the level of 0.26. The overall structure of both M and F genomes of Pacific M. trossulus is the same as that of the published complete mt genomes of other Mytilus mussels (Hoffmann et al. 1992; Boore et al. 2004; Mizi et al. 2005; Breton et al. 2006; Zbawicka et al. 2007; Burzyński and Śmietanka 2009), except for the few important points detailed in the following paragraph.

Fig. 1
figure 1

Gene maps of the mitochondrial F and M gender-specific genomes of Pacific M. trossulus. Remarkably, despite the high primary sequence divergence of 26%, the two genomes have very similar pattern of compositional bias and almost identical gene arrangement. Names of tRNA genes are indicated by the one-letter amino acid code that they specify. Protein coding genes as well as both small and large subunits of ribosomal RNA have standard abbreviations. Two inner rings show local compositional bias: the outer ring AT skew and the inner ring CG content. Both were calculated in a sliding window of 500 bp in 50 bp steps

Reassignment of Noncoding Sequences

The CR of the F genome has a mosaic structure identical to the RM genome of west Atlantic M. trossulus (Breton et al. 2006), first described by Rawson (2005) and discussed in detail by Cao et al. (2009). In the central part of the CR, there is a sequence resembling the block of tRNAs usually found between two rRNA genes. However, these tRNA-like sequences are highly divergent (0.16–0.29) from their copies present in the coding part of the genome. The sequences were not recognized as a tRNA by COVE or ARWEN so they should be considered non-functional pseudogenes, with one exception: the tRNA GLN was recognized properly. Moreover, the copy of this gene present in the coding part was not recognized due to many illegitimate base pairings and an internal deletion of 3 bp (TTT) near the 3′ end. Cao et al. (2009) postulated that this complex CR structure resulted from recombination and duplication-deletion processes which have led to the translocation of tRNA GLN into the CR. Our data provide further support for this view since the translocated sequence has much lower divergence between the two related genomes (F genome of Pacific M. trossulus and RM genome of west Atlantic M. trossulus) than the copy present within the original tRNA gene cluster. Consequently, this pseudogene has not been annotated as tRNA. The remaining 22 tRNA genes as well as 12 protein coding and two rRNA genes were identified in expected locations. The remaining noncoding regions (‘unassigned regions’: UR1–UR5) (Hoffmann et al. 1992) are either not present or much shorter. The UR5 region is present but it is only 10 bp long in Pacific M. trossulus. The differences in the remaining URs are the direct consequence of the chosen annotation convention. The UR1 was assigned to the cob and the UR2 is now part of the nad1. Likewise, the position of the stop codon of cox3 gene is shifted by 143 bp making the UR3 part of this gene. This new annotation convention, first introduced by Zbawicka et al. (2007) in the context of Baltic M. trossulus fits well also the new Pacific M. trossulus data. However, in addition to the 12 protein coding genes identified in Mytilus mtDNA by Hoffmann et al. (1992), one additional open reading frame (ORF) with significant coding sequence signature was found in both M and F genomes of Pacific M. trossulus between nad3 and cox1, covering the region homologous to UR4 and most likely representing the missing atp8, consistent with the recent finding of Breton et al. (2010).

Support for atp8

The support for the coding nature of the new ORF came primarily from the CRITICA algorithm, the P-value for assigning a coding status to this ORF by chance alone was well below 10−17 for each genome. The predicted length of the protein encoded by the new ORF is 84–87 amino acids for the F genome (depending on the chosen start codon) and 118–121 amino acids for the M genome. Notably, the same ORF is present in all available Mytilus mt genomes: it has the same length in all F genomes whereas in the M genomes of M. edulis and M. galloprovincialis it is 8 amino acid shorter than in Pacific M. trossulus (Supplementary Fig. 1). To check whether this is a true coding sequence the available Mytilus EST sequences (e.g. Tanguy et al. 2008; Venier et al. 2009; J.Grimwood, personal communication) were searched. Multiple highly significant hits were obtained both from M. galloprovincialis and M. californianus ESTs confirming that this sequence is expressed, contrary to other noncoding (CR) or tRNA sequences which are not present in ESTs. Interestingly, this sequence is usually contained within the single transcript with cox1, a feature noted already for M. californianus by Beagley et al. (1999), although no mention of atp8 has been made in this context. The obvious candidate gene for this ORF is the atp8. However, the direct comparison of predicted primary amino acid sequence with other known proteins did not reveal any significant similarities. Even when the database of mitochondrial proteins was used in a local blastp search only some marginally significant hits were recovered, and a few atp8 sequences were amongst them; most of the found similarities considered the nad4 protein and were above the E value of 1. Amongst the few scored atp8 proteins there were sequences from a cellular slime mold Polyspondylium pallidum (the best E value at 0.14), two insects Enithares tibialis (3.4) and Yemmalysus parallelus (5.8), a stony coral Madracis mirabilis (9.9) and a freshwater mussel Cristaria plicata (2.0). This very weak evidence encouraged further comparative analysis. Since the evidence given by Breton et al. (2010) came primarily from predicted secondary structure comparisons, an attempt has been made to predict and compare secondary structures of this novel protein with the structure of atp8 form the closest relative of Mytilus revealed by blastp searches—C. plicata. The predicted secondary structures had several characteristic features in common. They all started with a signal peptide, followed by one hydrophobic transmembrane helix and ended in a loose and variable hydrophilic domain. These similarities are illustrated by Kyte–Doolittle hydropathy plots (Fig. 2). Taking these arguments into account this ORF was considered a putative atp8 gene and included in all comparative analyses as a protein coding sequence.

Fig. 2
figure 2

Kyte–Doolittle hydropathy plots for four predicted atp8 proteins. The atp8 in C. plicata (an unioidean mussel) is the closest well annotated atp8 available in GenBank. The plots were obtained with the window of 9 amino acids, the value above 1.9 is considered indicative of a transmembrane domain—indeed the presence of such domain in the N terminal part of predicted proteins was confirmed by other methods. Top: C. plicata RefSeq record and M. edulis from the small 6 genome data set. Bottom: both Pacific M. trossulus sequences obtained in this study

Reconstruction and Dating of Phylogeny

To put the differences between Mytilus mitochondrial genomes in evolutionary perspective their phylogeny has been reconstructed. To be able to use M. californianus as an outgroup and to avoid problems with mosaic CR sequences the reconstruction was based on the concatenated set of protein coding sequences. The resulting graph is presented in Fig. 3 (left panel). The homogeneity of evolutionary rates across this tree was evaluated by a series of Tajima’s relative rate tests involving every pair of sequences, always using the closest possible outgroup. All the intra lineage rates appeared to be homogeneous, including both sequences reported as RM but grouping with the rest of F-like sequences (P > 0.05), whereas all inter lineage comparisons revealed a significant rate heterogeneity (P < 0.0001). In an attempt to date the tree the optimal rates have been fitted by r8s and the chronogram was generated (Fig. 3, right panel). The age of the MRCA for M and F lineages has been estimated at 4.67 mya with 95% confidence interval (CI) of 4.4–5.0. The age of the MRCA for M. edulis and M. galloprovincialis was estimated at 0.88 mya (CI 0.73–1.02) in the M lineage and 0.35 mya (CI 0.27–0.42) in the F lineage.

Fig. 3
figure 3

The phylogenetic relationships between 13 compared genomes based on concatenated protein coding sequences (alignment length 11895 bp). Left: Maximum Likelihood unrooted phylogeny. The topology has 100% bootstrap support and 1.0 posterior probability of bipartition for each node. Right: The best fitted reversed chronogram, rooted by the removed M. californianus outgroup. The chronogram scale is in million years. Grey bars at the chronogram nodes show confidence intervals. The sequences were from: a the complete set of mitochondrial protein coding genes extracted from database records; b genomes from Baltic mussel population, formally referred to as M. trossulus but having M. edulis mtDNA accession numbers: DQ198231 (F) and DQ198225 (M) (Zbawicka et al. 2007); c F genome of Atlantic M. galloprovincialis, accession number AY497292 (Mizi et al. 2005); d the RM genome from Black Sea M. galloprovincialis, described by Venetis et al. (2007), accession number DQ399833; e Mediterranean M. galloprovincialis genomes, accession numbers FJ890850 (M) and FJ890849 (F) (Burzyński and Śmietanka 2009); f from Breton et al. (2006): the RM genome of west Atlantic M. trossulus, accession number AY823625 and two M genomes of west Atlantic M. edulis: AY823623 and AY823624; r RefSeq sequence of M. edulis, NC_006161. Sequences in bold were obtained during this study from Pacific M. trossulus and they represent the native genomes of this species: accession numbers HM462080 (F) and HM462081 (M)

Evolution of Coding Sequences

In addition to the difference in the putative atp8 there are only four cases of differences in the protein coding gene lengths in the whole data set: one codon difference at the beginning of cob and at the end of nad3, marked differences at the beginnings of nad1 and cox1. The cox1 case is actually the result of the same putative insertion at atp8cox1 gene boundary and shares the same phylogenetic context. All other differences are limited to one or few genomes and are not consistent along M or F lineages. The general intra- and inter-specific comparison of coding sequences is presented in Table 1. As expected, the lowest divergence was observed between F genomes of European M. edulis and M. galloprovincialis while the highest divergences were observed between the M genomes of M. trossulus and its congeners. The crude comparison of the number of nucleotide substitutions in M and F lineages suggests that, on average, the RNA genes evolve approximately two times as fast in the M lineage as in the F lineage, whereas protein coding genes evolve three times faster and with marked differences between respiratory complexes: from approximately 2.6× for complexes I and V to 8.5× for complex III. Still, intra species M–F divergences were comparable suggesting similarity in evolutionary forces shaping this divergences in all three species. This was further confirmed by investigating the M–F divergence across each protein coding gene separately. The estimates of amino acid divergence at the 13 protein coding genes between the F and M genomes are presented in Fig. 4. They show marked differences between proteins but remarkably similar levels of divergence for all of three species, suggesting that similar comparisons presented by Breton et al. (2006) and showing very elevated and erratic differences, particularily at nad6, may have been affected by unaccounted frame shift errors. By far the highest differences were consistently observed in putative atp8 followed by the components of respiratory complex I. The most conserved was cox1, a common feature in animal mtDNA (Pesole et al. 1999; Saccone et al. 1999). The paradox of reversed nucleotide and protein divergences in M–F comparisons (nucleotide substitution pattern suggested relatively slower evolution of complex V) is understandable since atp8 is much shorter than atp6 and must have had smaller effect on the average. It also shows that each gene may be constrained differently, even within the same respiratory complex. If the evolution of each mitochondrial protein coding gene is determined by this gene-specific constraints, then the average divergence of a gene in the M lineage should be correlated with its divergence in the F lineage. To investigate this effect the p-distances for each protein, averaged over all three inter species comparisons, in the M lineage were plotted against the distances in the F lineage (Fig. 5). The correlation was strong, with the coefficient of determination (R 2) at 0.77. However, when the highly diverged atp8 was excluded, the coefficient of determination dropped dramatically to 0.47. Therefore the observed pattern cannot be fully explained by gene-specific constraints leaving plenty of room for potential independent evolution in each lineage. To compare the overall selective pressures acting at intra species and intra lineage levels the pairwise K A and K S as well as K A/K S ratios were calculated (Table 2). In all comparisons the diversities at non-synonymous sites were much lower than at synonymous sites, followed by the low K A/K S values, a sign of a strong purifying selection acting in both the M and F lineages. Consistently with all literature published to date on the subject (first observed by Stewart et al. 1995), in the case of the M genomes this pressure is slightly relaxed as shown by higher K A/K S values. However, even this ‘relaxed’ selection is in fact very strong—as shown by the comparison of mitochondrial and nuclear genes of M. californianus and M. galloprovincialis. Since there were marked differences in protein divergences (Fig. 4) it was legitimate to ask if they resulted from different selective pressures, manifested as differences in the K A/K S ratios between protein coding genes or if there is also some positional effect (Rodakis et al. 2007), acting through differences in mutation rates (Fig. 6). Both Figs. 4 and 6 show the same pattern, so it can be concluded that the positional effect has minor importance in shaping protein evolution in this case. The only gene experiencing really different selective regime is the putative atp8. Still, its K A/K S ratio is well below 1 (P < 10−8) providing further evidence that this is in fact a protein coding gene; for a non-protein coding sequence this ratio would not be significantly different than 1. More exhaustive search for the signature of positive selection was undertaken using methods implemented in PAML and HyPhy. The FEL procedure confirmed the overwhelming dominance of negatively selected sites in the whole data set (Fig. 7). Despite that, several sites were fit with positive dN–dS values, notably several of them within the putative atp8. Moreover, the models assuming the presence of positively selected sites fit the data significantly better than models assuming nearly neutral evolution (i.e. Model 8 vs. Model 7, P < 0.05). The overview of the support given is presented in Supplementary Table S4. Interestingly, the same region of the alignment has been identified as bearing strong (class 7) and significant (P < 0.01) changes in one of the important amino acid properties: polar requirement by TreeSAAP. An attempt has been made to identify the branches along the tree along which the most of positively selected changes have occurred. The approach using branch-site model A in PAML favoured the model with positive selection in atp8 within the branch leading to MRCA of all M genomes. No other gene gave significantly better fit of model A over its null model, even though all other genes favoured the tree with more relaxed selection in the whole M lineage. The GA Branch procedure from HyPhy did not reveal any significant signals of positive selection but indicated quite strong heterogeneity in selective pressures along branches—generally also in agreement with the relaxed selection within the M lineage and accelerated evolution at short terminal branches (data not shown). To further investigate the possibility that adaptive evolution may act on atp8, the extended set of 26 Mytilus atp8 sequences were subject to the MK test of neutrality (Table 3). Strong significant excess of nonsynonymous fixed differences were found in comparison of M. trossulus with its congeners within the M lineage. The results of all other comparisons were non-significant, two of them are shown as examples. Positive results of MK tests could in theory be obtained if the compared species/groups were divergent enough to underestimate the number of silent substitutions due to the saturation effect. Saturation may be relevant for our data since several K S estimates (Table 2) were above 1. However, it should not have affected the intra lineage M comparison more than the inter lineage comparison since both the divergences and K S estimates are higher for the inter lineage comparison. Moreover, the opposite effect is usually observed in animal mtDNA: the excess of nonsynonymous polymorphisms, consistent with the purifying selection acting slowly on slightly deleterious mutations (Ballard and Kreitman 1995). Therefore, we conclude that the results of MK test are consistent with an event of directional selection acting on atp8 during the evolution of the M lineage.

Fig. 4
figure 4

Amino acid divergence (Dayhoff PAM distance) between the M and F lineages for three Mytilus taxa. The 6 genome data set was used to obtain this plot (same as Tables 1 and 2)

Fig. 5
figure 5

The correlation of divergences (amino acid p-distances) within the M lineage with the divergence within the F lineage for all protein coding genes in all three species. The same data set was used to obtain this plot as in Fig. 4

Fig. 6
figure 6

Comparison of selective pressures based on pairwise M–F sequence divergence for each mitochondrial gene. Genes are ordered according to their position in the genome. (Triangles) M. trossulus, (squares) M. edulis, (diamonds) M. galloprovincialis. The same data set was used to obtain this plot as in Fig. 4

Fig. 7
figure 7

Concatenated protein coding gene alignment was analysed by FEL procedure in HyPhy in the context of resolved phylogeny (Fig. 3). The dN–dS plot capped at −100 was obtained: the values above zero indicate positively selected candidate sites. The gene boundaries are shown for reference at the top. The whole 13 genome data set was used in this analysis

Table 3 The MK test of neutrality for atp8 gene: the number of fixed (F) and polymorphic (P) substitutions at synonymous and nonsynonymous sites

Discussion

The pattern of divergences proves that the pair of genomes described here represent the native mitochondrial haplotypes of M. trossulus, most likely representative of all populations of this species except Baltic. This allows precise dating and detailed analysis of evolutionary forces acting on mtDNA within this complex of three DUI species.

Dating of Phylogeny and Tempo of Evolution

The first attempt to date the M–F divergence was made by Rawson and Hilbish (1995). The relatively short fragment (455 bp) of rrnaL was used with Perna canaliculata sequence as an outgroup. Using the same major calibration point as used in this study the estimate of MRCA age for F and M lineages was 5.3–5.7 mya. Our estimate (4.4–5.0 mya) differs by at least 300,000 years but is based on a much more comprehensive data set and a much closer outgroup so it should be more accurate yet the dating of M. edulisM. galloprovincialis split seems to be problematic as the M and F dates do not match. Apparently the split within the M lineage is much older (Fig. 3), even when the differences in the evolutionary rates are accounted for. The three different F genomes derived from M. galloprovincialis are good representatives of the overall F genome diversity in European Mytilus: they represent all the three haplogroups recognized by Śmietanka et al. (2009) and it is very unlikely that some more diverged F haplotypes do exist. Hence the M–F divergence mismatch is probably caused by a repeatable introgression of M. edulis F mtDNA into M. galloprovincialis and periodic elimination of the native F genomes from the lineage leading to present day M. galloprovincialis. Apparently this applies also to the most divergent F genomes: there are no true native M. galloprovincialis F genomes in Europe anymore.

It has been postulated that Mytilus mtDNA evolves at unusually high rate when compared to human or fruit fly (more than two times faster), at least in a very long evolutionary perspective and that this is due to relaxed selection (Hoeh et al. 1996). This conclusion was based on a limited data set; the analysed alignment consisted of two concatenated gene fragments: 660 bp from cox1 and 762 bp from cox3 covering 12 taxonomically divergent mitochondrial genomes. Today, with the far more comprehensive data it can be verified by comparing divergences of mitochondrial sequences between species pairs separated by the same amount of time or by comparing the divergence times of species separated by the same amount of genetic distance. The M. californianusM. galloprovincialis separation is dated at about 7.6 mya (Ort and Pogson 2007), which is comparable to the human–chimpanzee divergence usually assumed to happen at 6 mya (for review see Endicott et al. 2009). Yet the mitochondrial divergence between the two latter species (calculated with the same methodology) is much smaller (K A = 0.025, K S = 0.43) than in Mytilus (Table 2, K A = 0.070, K S = 2.371). This appears to be concordant with the original hypothesis but when the generation time (at least an order of magnitude longer in humans than in Mytilus) is taken into account, the slower evolution in human lineage is no longer unexpected. The comparison with Drosophila should be perhaps less affected by differences in generation time, at least it should be biased towards faster evolution in the fruit fly and not in Mytilus. It is therefore surprising to note that the level of mitochondrial protein evolution comparable to that of the M. californianusM. galloprovincialis species pair (~7%) is achieved only by the most divergent species from the Drosophila/Sophophora group which diverged approximately 40 mya (Montooth et al. 2009). It indicates more than five times faster evolution of mitochondrial protein coding genes in Mytilus F genomes than in Drosophila. With the growing genomic data it is becoming possible to also compare mitochondrial and nuclear evolutionary rates. With that regard the best example for comparative analysis is the case of rapidly evolving mtDNA in a parasitic wasp Nasonia (Oliveira et al. 2008). Relative mitochondrial to nuclear synonymous substitution ratio in those wasps was estimated at more than 30. In the Mytilus case this ratio slightly exceeds 10 for the F genome (Table 2, M. californianusM. galloprovincialis comparison) but should be even higher for the M genome (Table 2, K S values for M genomes always exceed K S values for F genomes for same species comparisons). Even though this is based (in both Nasonia and Mytilus case) on an arbitrarily selected and relatively small set of nuclear genes and may therefore be imprecise, it indicates that indeed Mytilus mitochondrial genomes are fast evolving, also in comparison with nuclear genes.

The Identity of Masculinized Genomes

The detailed discussion of the CR structure in all three Mytlius taxa was given by Cao et al. (2009) with the conclusion that the genome described by Breton et al. (2006) as RM is in fact the maternally inherited F genome of M. trossulus (Cao et al. 2009). This is essentially confirmed by our data. However, we cannot agree that both genomes having mosaic CR reported so far (the RM in M. galloprovincialis and the RM or F in M. trossulus) have similar age (Cao et al. 2009). This conclusion was drawn from the comparison of rrnaL sequences, whose divergences (from the respective M genomes) were the same (approx. 12%). But these divergences are expected to be the same as they reflect the separation of M and F lineages and are typical for all M–F comparisons of rRNA genes (Table 1). Based on our dating the MRCA of the M. galloprovincialis RM genome and the F genomes not having mosaic CR structure existed no earlier than 0.42 mya, more than an order of magnitude later than MRCA of M and F lineages. Therefore the conclusion that the M-like parts of the CR in the F genome of M. trossulus are experiencing different constrains than in the CR of RM genome of M. galloprovincialis resulting in marked differences in the divergences is premature. An alternative interpretation, assuming that the rearranged RM or F genome of M. trossulus is simply older than the RM genome of M. galloprovincialis should be considered. Two arguments seem to support this view. First, the comparison of the CR structure: in the RM M. galloprovincialis genome it has a relatively simple structure with an array of perfect tandem repeats whereas the CR of M. trossulus F genome is very complex, with barely recognizable traces of past duplications and numerous deletions of various sizes. More rearrangements usually require more time. Second, the population distribution of both genomes: the F genome of M. trossulus seems to be fixed in all studied populations having native M. trossulus mtDNA (Rawson 2005; Rawson and Harper 2009), whereas the RM genome of M. galloprovincialis is found in both paternal and maternal lineages competing with the native M and F genomes of M. galloprovincialis in the Black Sea (Filipowicz et al. 2008). This pattern is also consistent with the RM genome of M. galloprovincialis being younger. Neither of the two genomes showed accelerated evolutionary rate, as would be expected for the masculinized genome. Therefore, the overall pattern of polymorphisms suggests that even though the RM genomes described by Breton et al. (2006) and Venetis et al. (2007) may be paternally inherited (i.e. found in sperm), their masculinization would be indeed very recent and consequently the differences and similarities of their coding sequences would be determined primarily by their long evolution within the F lineage rather than by the short paternal episode. Consistently with that we have found no amino acid positions discriminating all the M and either of the RM genomes from the F genomes.

There were several attempts to compare the fitness of masculinized and native M genomes. However, the fact that the sequence of the native M genome of M. trossulus was not known could have led to misinterpretations. The so called ‘diagnostic’ primers for the RM as well as M and F genomes were used (Everett et al. 2004; Jha et al. 2008; Breton et al. 2009). Unfortunately none of those primers could recognize the native M genome of M. trossulus, and the RM-specific primers could not possibly differentiate the RM from the native F genomes. This could easily lead to the situations where the M genome was present but went undetected and the contaminating F genome could have been interpreted as masculinized. Similar problems with the specificity of primers in M. edulisM. galloprovincialis crosses were reported recently (Venetis et al. 2006). The interpretation of the physiologic differences noted by Breton et al. (2009) and Jha et al. (2008) in the context of the assumed masculinization may then not be appropriate and the better explanation could be the cytonuclear incompatibility.

Evolution of Protein-Coding Genes

What evolutionary forces shape the evolution of M and F lineages: is it only the relaxation of selective constraints within the M lineage (Stewart et al. 1996) or is there perhaps space for adaptive evolution? The perfect fit of intra- and inter-species divergences in protein sequences presented by Breton et al. (2006) (R 2 = 0.96) seemed to leave very little space for adaptive evolution. However, the most likely cause for this high correlation is the use of inherently auto correlated statistics: the intra- and inter-species divergence estimates were averaged over two species. Effectively, the same M–F divergences were included in both statistics leading to the very high correlation. Contrary to that, the correlation of diversity indices across protein coding genes between M and F lineages, averaged over three species, which is not auto correlated, is much weaker or non-existent (R 2 = 0.47 for the same set of genes), leaving plenty of space for factors other than strong purifying selection to operate. It seems unlikely that relaxed selection alone could explain the accelerated evolution as all the K A/K S ratios are comparable to that found in other animal mitochondrial genomes—including human and fruit fly. Moreover, even the M lineage K A/K S ratio is apparently lower than the nuclear K A/K S ratio indicating strong purifying selection. This paradox can be explained by assuming that periodic selective sweeps occur in Mytilus mtDNA, consistent with the explanation considered for M. californianus (Ort and Pogson 2007) and M. edulis/M. galloprovincialis (Śmietanka et al. 2009) population data. The best candidate gene to experience the episode of positive selection is the putative atp8. Similar explanation was proposed in the rapidly evolving wasps case: it was postulated that fixation of an advantageous mutation in atp8 started the Compensation-Draft Feedback (CDF) process resulting in accelerated evolution of the mitochondrially encoded proteome (Oliveira et al. 2008).

The CDF model is based on the genetic draft concept of Gillespie (2000) under which slightly deleterious mutations spread in large populations in drift-like manner due to the linkage with positively selected markers. Since all mitochondrial genes are linked this model would be particularly relevant for mitochondrial genomes. The CDF further utilizes the co-evolutionary paradigm outlined by Rand et al. (2004) which states that mildly deleterious substitutions in mitochondrial proteins will result in selection for compensatory mutations in both mitochondrial and nuclear genomes—in all genes encoding proteins interacting with the ones adversely affected. To complete the CDF model Oliveira et al. (2008) noted that these compensatory mutations will hitchhike additional deleterious mutations creating the positive feedback. Therefore the process once started by a sweep will continue to operate on its own resulting in a cascade of adaptive and non-adaptive amino acid substitutions. The high mitochondrial mutation rate will favour CDF so this model may be relevant to Mytilus. Moreover, in DUI animals the three and not two genomes are co-evolving introducing additional feedback loops in CDF model: a substitution in the M genome encoded protein can cause compensatory sweep in interacting nuclear genes and then, in turn change the selective pressure on the F genome.

Since the evidence for positive selection we present is rather weak, alternative possibilities need to be considered. In principle, the sweeps starting the CDF process could be caused by classic genetic drift, the fixation of slightly deleterious substitutions by chance alone. However, small effective population sizes are needed for drift-like effects to be effective whereas marine bivalves are known to reveal amongst the highest levels of genetic diversity ever observed in the animal kingdom (Bazin et al. 2006), consistent with very large effective population sizes. It is still possible that, during the time frame considered, there were demographic events, in at least some Mytilus taxa, extreme enough for this explanation to be plausible. This remains to be established and will require more population data on both mitochondrial and nuclear diversity. The complex patterns of nuclear and mitochondrial introgressions occurring between all three species of the complex could have also participated in the CDF through cytonuclear incompatibilities. The complete loss of native F genome of M. galloprovincialis or both native genomes of Baltic M. trossulus support the view that introgression-related sweeps did occur. However, in the case of the M lineage and particularly in the context of native M. trossulus genomes, no introgression has been reported, yet this clade shows the strongest signals of positive selection.

With the overall strong signal of purifying selection weak or intermittent episodes of diversifying or directional selection may be difficult or even impossible to detect. Therefore we feel confident that the weak signal we do observe is convincing enough to postulate the involvement of adaptive changes in atp8 in fuelling the CDF process. Indeed it can be parsimoniously assumed that an insertion at the atp8cox1 boundary happened just after the emergence of the MRCA of the M lineage, around 4 mya. This sequence acquired a new M-specific function which could then be selected for, starting the CDF process. There are certain similarities between this and the cox2 M-specific gene extension found in other DUI group—the Unionid mussels which also seem to bear the weak signs of positive selection (Chapman et al. 2008). This illustrates that when mitochondrial gene extensions emerge in DUI animals, they can create space for adaptive changes. Due to a relatively relaxed selection these extensions can exist long enough to acquire new, potentially gender-specific functions which then can be selected for. Thus, even if the DUI did not emerge as an adaptive process it can eventually operate as such.