Introduction

Despite their late appearance in the fossil record (e.g., Kooistra and Medlin 1996; Sorhannus 1997), diatoms form a species-rich group of algae which plays key roles in the global silica/carbon cycling (e.g., Mann 1999) and food chains of aquatic ecosystems. The diatom cell is surrounded by a silica shell that is structured like a “lid” and a “box.” The frustule (i.e., the silica shell) displays extensive shape variation and typically has a large number of tiny intricately shaped depressions, pores, spines, and passageways. In the traditional diatom classification, which was inferred from morphological features of the siliceous cell wall and the reproductive mode (e.g., Simonsen 1972, 1979; Round et al. 1990), two major subgroups were recognized (Simonsen 1979): Centrales (“centric diatoms”) and Pennales (“pennate diatoms”). The centrics typically exhibit radial or bi(multi)polar symmetry, whereas the pennates normally show bilateral symmetry with or without a central groove (i.e., raphe). The former subgroup was further split into three taxa, while the latter was divided into two groups based on the presence/absence of the raphe (Simonsen 1979). Modern classification principles do not recognize the centric (including some of the centric subgroups) and araphid pennate diatoms as valid taxa because of their paraphyletic status (e.g., Sorhannus 2004). Molecular phylogenetic studies (e.g., Sorhannus et al. 1995; Ehara et al. 2000; Fox and Sorhannus 2003; Medlin and Kaczmarska 2004; Sorhannus 2004) have indicated, by and large, the same relationships among the major groups as those proposed by Simonsen (1979) based on nonmolecular data. In the optimal molecular tree(s), the radial centrics (except Thalassiosirales) constituted the earliest branch(es) after which the bi(multi)polar centrics/Thalassiosirales forms appeared. A still undetermined bi(multi)polar centric/Thalassiosirales lineage shares a common ancestor with the pennates, which differentiated into the earlier “araphids” and the subsequent “raphids.” However, one major difference between the nonmolecular and the molecular trees is the position of Thalassiosirales. Simonsen (1979) placed Thalassiosirales lineages among the early radial centrics, while the majority of molecular phylogenetic studies (e.g., Ehara et al. 2000; Fox and Sorhannus 2003; Kooistra et al. 2003; Medlin and Kaczmarska 2004; Sorhannus 2004) indicated that it emerged among the bi(multi)polar centrics. Thalassiosira, which mainly consists of marine species, is one of the most diverse genuses within Thalassiosirales (e.g., Kaczmarska et al. 2006). The species included in this study (e.g., Round et al. 1990), Thalassiosira weissflogii (Grun.) Fryxell et Hasle, Thalassiosira oceanica (Hustedt) Hasle et Heimdal, Thalassiosira guillardii Hasle, and Thalassiosira pseudonana (Hustedt) Hasle et Heimdal, are planktonic marine forms which constitute a monophyletic group, based on an extensive molecular phylogenetic study (Sorhannus 2004). Within the clade, T. weissflogii and T.oceanica shared the most recent common ancestor, followed by T. guillardii and T. pseudonana (Sorhannus 2004).

In many diatom species, including the Thalassiosira species analyzed here, successive generations of asexual reproduction result in a decrease in average cell size. Cell size is often restored through sexual reproduction, which involves the release of gametes into a surrounding body of water followed by sperm-egg fusion. Pheromone–like compounds, produced by the female gametes, are thought to play a role in attracting sperm cells (Drebes 1996). For species-specific gamete fusion to occur, particular recognition sites in the reproductive cells of both mating types are required. Three sexually induced genes (Sig1, Sig2, and Sig3), hypothesized to play a role in sperm-egg adhesion, have been sequenced in T. weissflogii (Armbrust 1999; Armbrust and Galindo 2001). The three polypeptides encoded by the genes have several characteristics in common: (1) three or more cysteine-rich epithelial growth factor (EGF)-like repeats; (2) five highly conserved regions (I–V) which show a high degree of similarity to the EGF-containing domain of the vertebrate extracellular matrix glycoprotein tenascin X (promotes cell to cell interaction); (3) a signal sequence; (4) the same sequential order of the conserved domains, separated by a variable number of nonconserved amino acid sites; and (5) a lack of transmembrane regions (Armbrust 1999). In particular, similarity between the EGF-containing domain of the Sig polypeptides and the extracellular matrix glycoprotein tenascin X of vertebrates suggests that Sig proteins play a role in mediating sperm-egg recognition in Thalassiosira species during the sexual reproduction phase (Armbrust 1999; Armbrust and Galindo 2001). Sig1, thought to be represented by at least 10 unique copies in the genome (Armbrust and Galindo 2001), is characterized by five functional domains, inferred to be active in cell adhesion (Armbrust 1999).

Genes that encode proteins involved in gamete adhesion/fusion have typically exhibited increased rates of evolution both within and between species (e.g., Wyckoff et al. 2000; Vacquier 1998; Swanson et al. 2001; Armbrust and Galindo 2001; Swanson and Vacqier 2002). Accelerated diversification of reproductive proteins has been associated with site-specific positive selection (e.g., Swanson and Vacquier 1995; Tsaur and Wu 1997; Wyckoff et al. 2000; Swanson et al. 2001; Swanson and Vacqier 2002). Amino acid sites affected by positive selection tend to show nonsynonymous (dN) substitution rates that are significantly higher than the synonymous (dS) rates. Sites identified as being influenced by positive selection are frequently located in domains thought to be involved in sperm-egg interaction (e.g, Swanson et al. 2001). This phenomenon is thought to have important consequences for the establishment of barriers to fertilization and speciation (e.g., Swanson and Vacqier 2002).

In a study carried out by Sorhannus (2003), between four and seven codon sites in Sig1 of T. weissflogii were inferred to be affected by positive selection. This result was questioned by Suzuki and Nei (2004), who claimed that none of the sites were influenced by positive selection, i.e., putatively selected sites were in fact false positives. In this study, we carried out a careful case study of applying realistic models of codon evolution and studying the statistical properties of the tests on the specific data set under investigation. Regrettably, data set-specific analysis of error rates of various statistical tests is not a standard practice in applied sequence analyses. Lacking external confirmatory evidence (e.g., directed mutagenesis), which may be difficult or costly to obtain, such error rate analyses are one of the most rigorous computational tests for validity of inference. The outcome of the analyses carried out here, using a larger sample of Sig1 than the one employed by Sorhannus (2003) and additional Thalassiosira species, suggested that at least two sites were influenced by positive selection. One of them is located in functional domain II of Sig1. Inferred evolutionary changes in the positively selected sites appear to be associated with divergence among the three major Thalassiosira lineages.

Materials and Methods

Sequences

Sixty-three partial Sig1 sequences and one complete Sig1 gene were obtained from GenBank. The accession numbers and sampling locations of the T. weissflogii sequences are as follows: AF154499 (Li0), AF374490 (Li1)–AF374500 (Li11), AF374540 (Li12)–AF374552 (Li24) (Clone CCMP 1336; Long Island Sound, USA); AF374501 (Li25)–AF374505 (Li29) (Clone CCMP 1049; Long Island Sound, USA); AF374506 (No1)–AF374510 (No5) (Clone CCMP 1052; Skagerrak Sea, Norway); AF374521 (Ca1)–AF374525 (Ca5) (Clone CCMP 1050; Del Mar Slough, CA, USA); AF374526 (Po1)–AF374530 (Po5) (Clone CCMP 1053; North Atlantic Ocean, Portugal); AF374516 (Ha1)–AF374520 (Ha5) (Clone CCMP 1051; King Kalakaua’s Fishpond, HI, USA); and AF374511 (In1)–AF374515(In5) (Clone CCMP 1587; Jakarta Harbor, Indonesia). The accession numbers and sampling locations of the other sequences were AF374537–AF374539 (Clone CCMP 1335; Thalassiosira pseudonana, Moriches Bay, NY, USA); AF374531–AF374533 (Clone CCMP 1005; Thalassiosira oceanica, Sargasso Sea); and AF374534–AF374536 (Clone CCMP 988; Thalassiosira guillardii, North Atlantic Ocean). Each sampling location constitutes a clone, where different sequences reflect intraindividual variation. Five identical T. weissflogii sequences (AF374499, AF374506, AF374512, AF374521, AF374522) and an intron located between the two coding regions (except in the cDNA sequences AF374540–AF374552) were removed from the data matrix before the alignment was performed, that is, the alignment was carried out on the remaining 59 coding sequences. A NEXUS file with aligned sequences may be downloaded from http://www.hyphy.org/pubs/Sig1.nex . The four functional domain coordinates in the alignment are as follows (Fig. 1): domain I (nucleotides 1–108, codons 1–36); domain II (nucleotides 234–342, codons 78–114); domain III (nucleotides 465–516, codons 155–172); and domain IV (nucleotides 525–591, codons 175–197).

Fig 1
figure 1

Sig1 sequences used in the study (protein translated). Sites 37, 94, 150, and 174, were identified as being under diversifying selection by FEL or REL.

Data Analyses

The data were managed using the software package DAMBE (version 4.0.98 [Xia 2000]). The amino acid sequences of the Sig1 were aligned using CLUSTALW (version 1.7 [Thompson et al. 1994]), which is included in the DAMBE program package. Aligned amino acid sequences were mapped to corresponding codons using DAMBE. We used PAUP* 4.0 (Swofford 2002) to reconstruct a neighbor joining (Saitou and Nei 1987) gene tree based on the Tamura–Nei (1993) distance. Exact program settings for this step and all subsequent steps are given in the Supplementary Information. The resulting tree was displayed in TreeView (Page 1996). It should be noted that internal nodes with negative branch lengths were treated as unresolved polytomies. We applied a hierarchical and information theoretic model selection procedure (Kosakovsky Pond and Frost 2005) to choose a model of nucleotide substitution. HKY85 (matrix 010010) was selected as the optimal time-reversible nucleotide substitution model using the implementation in the HyPhy package (Kosakovsky Pond et al. 2005). Recent results of Kosakovsky Pond and Muse (2005) suggest that site-to-site variation in synonymous rates is widespread and can contribute to false-positive selection signal using methods which fail to account for this variation. We investigated the pattern of rate variation in the Sig1 gene as a whole, by fitting several models of site-to-site rate variation (Table 1). In order to test for evidence of positive selection in the context of models which allow for synonymous rate variation, we modified the Dual model (Kosakovsky Pond and Muse 2005) with S synonymous rate classes and N nonsynonymous rate classes to use a general discrete distribution which does not allow nonsynonymous rates to exceed synonymous rates. If this model, which we denote as Dual(–), fits the data significantly worse than the unconstrained Dual model, there is evidence of diversifying selection acting on the protein. To assess the significance of the likelihood ratio test (LRT), we noticed that the constrained model can be derived from the unconstrained model by applying N one-sided constraints (all nonsynonymous rates must be no higher than the lowest synonymous rate). An appropriate asymptotic distribution of the test statistic would be a mixture of χ2 distributions with 0 through N degrees of freedom (Self and Liang 1987). However, for phylogenetic likelihood and N > 1 it is not possible to obtain the mixing proportions analytically. Instead, we use the distribution of the test statistic under the null hypothesis, i.e., the Dual(–) model, based on 100 parametric data replicates.

Table 1 Results of codon analysis of rate variation patterns in Sig1

In order to identify codons affected by positive and negative selection, fast single-likelihood ancestor (SLAC) counting, fixed effects likelihood (FEL), and random effects likelihood (REL) methods available in the HyPhy package (Kosakovsky Pond et al. 2005) and in a free public web implementation (Kosakovsky Pond and Frost 2005) were applied to Sig1 data. Codon evolution at sites 37, 94, 150, and 174 were mapped on the tree (Fig. 2) using a maximum likelihood reconstruction of ancestral states based on a fitted model of codon substitution, as implemented in http://www.datamonkey.org (Kosakovsky Pond and Frost 2005).

Fig 2
figure 2

Codon evolution in sites 37, 94, 150, and 174 mapped on the distance tree used in the SLAC, FEL, and REL analyses. Ancestral codons, shown at each node, were inferred using SLAC. All terminal codons are shown. Li0–Li29 represent T. weissflogii sequences from the Long Island isolates; No1–No4 represent T. weissflogii sequences from the Norwegian isolate; Po1–Po5 represent T. weissflogii sequences from the Portuguese isolate; Ca1–Ca3 represent T. weissflogii sequences from the California isolate; Ha1–Ha5 represent T. weissflogii sequences from the Hawaiian isolate; In1–In4 represent T. weissflogii sequences from the Indonesian isolate. Nana1–Nana3 represent T. pseudonana sequences; Guillardii1–Guillardii3 represent T. guillardii sequences; Oceanica1–Oceanica3 represent T. oceanica sequences.

A detailed account of the SLAC, FEL, and REL methods is given by Kosakovsky Pond and Frost (2005). Briefly, SLAC uses a maximum likelihood reconstruction of ancestral codon states to compare the observed ratio of nonsynonymous and synonymous substitutions with the approximate estimate of the expected ratio assuming neutral evolution. FEL uses the entire alignment to infer model parameters shared by all sites (e.g., branch lengths) and then fits dS and dN rates individually at every site. Neutrality of an individual site is tested using the likelihood ratio test. REL extends the popular methods of Nielsen and Yang (1998) to allow both synonymous and nonsynonymous substitution rates to vary among sites. All three methods used here incorporate synonymous rate variation explicitly.

To establish statistical properties of the FEL test given low divergence in Sig1 sequences and multiple polytomies, we fitted 100 data sets simulated under neutrality (dN = dS = 1) using the Sig1 tree with branch lengths and nucleotide substitution biases inferred by maximum likelihood. We then applied the FEL procedure to each simulated data set and tabulated the number of times a site was identified as selected (positively or negatively) by the test, as well as the p value associated with the inference. The estimated Type I error rate of the FEL test R(p), treated as a function of the significance level of the test (p), can then be computed as the proportion of sites identified as selected at or below level p. For an ideal test, one would expect to find R(p) = p, while a conservative test would yield R(p) ≤ p. As shown in Fig. 3, FEL is not expected to have an elevated rate of false positives in the range of 0 < p < 0.08,which includes the values inferred for codons 94 (p = 0.07) and 174 (p = 0.06). A similar analysis for REL (Fig. 4) suggests, instead, that a measurable proportion of neutrally evolving sites may be misidentified even for large cutoff Bayes factors. This may be partly due to lack of convergence (flat likelihood surface) and partly due to large errors in rate parameter estimates.

Fig 3
figure 3

Type I error rate as a function of the significance level of the FEL test. The circles represent the actual false-positive errors made by FEL on 100 data sets simulated using the tree and model from Sig1 but assuming neutral evolution. For portions of the graph which lie below the expected error rate (solid line), FEL is behaving conservatively, but since both lines are reasonably close, the test is not unduly conservative.

Fig 4
figure 4

Type I error rate as a function of the significance level of the REL test. The line represents the actual false-positive errors (only for identifying sites as positively selected) made by REL on 100 data sets simulated using the tree and model from Sig1 but assuming neutral evolution.

Faulty inferences about positive selection could result from recombination events (e.g., Sorhannus 2003). We ran GENECONV (version 1.81 [Sawyer 1999]) to detect possible recombination events among the sequences. The default settings were used in the analysis. This method searched for unusually long identical fragments within pairs of aligned sequences or pairwise segments with in the alignment characterized by uncommonly high matching scores (Sawyer 1999) with p values for each pair of sequences and multiple-comparison adjusted global p-value levels derived by simulation. The significance levels for global and pairwise comparisons were set at 0.05.

Finally, we modified the FEL method (Kosakovsky Pond and Frost 2005) to test for evidence of nonneutral evolution in Sig1 along a subset of three preselected branches: those which are involved in speciation events and all internal branches.

Results

The distance tree showed largely unresolved relationships among and within the T. weissflogii clones obtained from the North Atlantic Ocean and the California coast. Those collected from the waters around Hawaii and Indonesia formed a distinct group with regard to the Atlantic and California cell lines (Fig. 2). T. oceanica, unexpectedly, clustered within the Atlantic/California group of T. weissflogii. The sister lineage of the T. weissflogii/T. oceanica complex was T. guillardii. T. pseudonana formed a distantly related lineage with respect to T. weissflogii/ T.oceanica/T. guillardii. With the exception of the position of T. oceanica, the gene/species tree is in general agreement with relationships obtained by phylogenetic studies of the diatoms (e.g., Sorhannus 2004).

The recombination analysis of the data matrix identified two Long Island sequences (AF374491 and AF374504) of T. weissflogii, belonging to two separate clones (CCMP 1336 and CCMP 1049), as having undergone a significant recombination event (global p value = 0.024). The length of the fragment was 437 nucleotides long. After both Li2 (AF374491) and Li28 (AF374504) were eliminated from the data set, the analysis showed no significant recombination events between the remaining sequences. Thus, the remaining 57 Sig1 sequences were subjected to analyses for positive/negative selection.

Codon analysis of rate variation patterns suggested that Sig1 is subject to both synonymous and nonsynonymous site-to-site rate variation (Table 1). By varying the number of rate classes in the Dual model, we determined that the best (small sample AIC) fit was achieved with two synonymous and three nonsynonymous rate classes. Furthermore, the comparison between the Dual and the Dual(–) models demonstrated that some sites in the alignment were undergoing adaptive change (parametric bootstrap p < 0.05 based on 100 replicates).

The SLAC analysis revealed no positively selected but 13 negatively selected sites (Table 2). FEL identified sites 94 (p = 0.07) and 174 (p = 0.06) as being influenced by positive selection and 65 negatively selected sites. One (site 94) of the two positively selected codons was located in functional domain II (Fig. 1; also see Armbrust 1999). This site was inferred to be under selection along the three branches involved in speciation. The other site (site 174) influenced by positive selection was located in a region between functional domains III and IV. The REL analysis suggested that codon sites 37 and 150 were affected by positive selection and that 64 sites were influenced by negative selection (Table 2 and Supplementary Information). The two positively selected sites were different from those discovered by FEL. Site 150 (between functional domains II and III; Fig. 1) was inferred to be under positive selection by FEL but with a p value of 0.18. Maximum likelihood ancestral state reconstruction suggested eight nonsynonymous and one synonymous substitution at that site.

Table 2 Positively selected sites identified by at least one method

Discussion

In an integrative approach to detecting positive selection, like this one, the ideal result is that every method supports the same positively selected site(s) (Kosakovsky Pond and Frost 2005). The fact that the SLAC analysis did not identify any positively selected sites is not an unexpected result since this approach may lack power for data sets consisting of sequences with low divergence (Kosakovsky Pond and Frost 2005). The FEL and REL methods each discovered two positively selected sites that were different between methods. Since FEL has been found, in general, to be a more conservative method (i.e., less likely to identify false-positive results) than REL, more confidence can be given to the results of the former approach (Kosakovsky Pond and Frost 2005). The results of the Type I error rate analysis carried out for the Sig1 data set/tree supported the conservative nature of the FEL test (Fig. 3), suggesting that the positively selected sites identified by this method are unlikely to be false positives. Moreover, the REL approach is expected to be unstable on this sequence alignment data due to poor parameter estimating properties induced by short sequences of low divergence and polytomies in the tree.

In light of the capacity to disperse over great distances and the lack of apparent barriers to gene flow (e.g., Darling et al. 2000), a low degree of genetic variation among unicellular planktonic organisms is expected (e.g., Palumbi 1994; Norris 2000). However, many molecular evolutionary studies, including that by Armbrust and Galindo (2001), have revealed a high degree of genetic differentiation in many pelagic organisms (e.g., De Vargas et al. 1999; Darling et al. 2000; Darling et al. 2004; Rynearson and Armbrust 2004). To explain the unexpected high degree of biological diversity in the open ocean, both allopatric and sympatric mechanisms have been proposed (e.g., De Vargas et al. 1999; Darling et al. 2004; Rynearson and Armbrust 2004). In this study, amino acid evolution due to positive selection, especially in site 94 of functional domain II and site 174 (Fig. 1), was associated with divergence among the major Thalassiosira lineages (Fig. 2). Based on the available information, it is difficult to infer whether codon changes in the two sites generated gamete isolation in sympatry or allopatry among the diverging lineages. However, the findings here suggest that positive selection for species-specific amino acids in a functional domain of a reproductive protein can be one of several mechanisms contributing to the fast diversification of unicellular pelagic organisms (also see discussion by Swanson and Vacquier 2002).