Introduction

Self-incompatibility (SI) is a mechanism present in flowering plants that prevents self-fertilization, by enabling the pistil to reject pollen from genetically related individuals. This mechanism is genetically controlled by the multi-allelic S-locus (de Nettancourt 1977). In gametophytic SI (GSI) pollen specificity is determined by the S-locus genotype of the haploid gametophyte. This system has been studied extensively at the molecular level in Solanaceae, Plantaginaceae, and Rosaceae. The pistil component of GSI has been shown to be an S-RNase in all three plant families (see review by Wang et al. 2003).

The Rosaceae S-RNases are distantly related to those of Plantaginaceae and Solanaceae, but a common evolutionary origin has been proposed for the S-RNase-mediated GSI (Igic and Kohn 2001; Steinbachs and Holsinger 2002; Roalson and McCubbin 2003). S-RNases have five conserved domains and four of them (C1, C2, C3, and C5) are shared between Solanaceae/Plantaginaceae and Rosaceae (see Fig. 1 in Vieira and Charlesworth 2002). The Rosaceae RC4 is nevertheless located in between region C3 and region C5 as the C4 region of Solanaceae and Plantaginaceae. In Solanacaeae and Plantaginaceae two hypervariable regions (HVa and HVb) have been identified. In Rosaceae the HVa region (the RHV region) is present but not the HVb region (see review by Wang et al. 2003). Prunus but not Pyrus/Malus (Rosaceae) S-RNases display another region highly variable in length and sequence in between the RC4 and the C5 domains that seems to be absent in Solanacaeae and Plantaginaceae (Ma and Oliveira 2002; Ortega et al. 2006). In Solanaceae the HV regions seem to be required for specificity determination, although other regions of the molecule may also be involved (Matton et al. 1997; Kao and McCubbin 1997; Zurek et al. 1997; Takebayashi et al. 2003). In Prunus and Pyrus (Rosaceae), however, S-RNase alleles with different specificities can be identical in the HV region, showing that this region is not required for specificity determination (Wunsch and Hormaza 2004; Zisovich et al. 2004). Nevertheless, this region seems to play a role in specificity determination since Pyrus/Malus sequences show a significant excess of nonsynonymous over synonymous substitutions (a sign of positive selection) in four regions, one of them corresponding to the HV region (Ishimizu et al. 1998).

Fig. 1
figure 1

Schematic representation of the S-RNase gene in Solanaceae, Plantaginaceae, Prunus, and Pyrus/Malus (alignment according to Igic and Kohn 2001). Filled boxes represent conserved regions; open boxes represent the hypervariable regions (according to Tsai et al. [1992] and Kato and Mukai [2004]). Filled and open underlines represent conserved and hypervariable regions identified using a sliding window of size 5 of classes of normed variability indexes (NVI; see Materials and Methods and Supplementary Figs. 1–4). Positively selected amino acid sites that have no indels in the dataset used to detect them are represented in boldface; shaded amino acids represent positively selected amino acid sites that have indels in the dataset used to detect them

Within Rosaceae, Prunus S-RNases (subfamily Amygdaloideae) have two introns, while in Pyrus/Malus (subfamily Maloideae) only one is observed, as in Solanaceae/Plantaginaceae (Tao et al. 1999, 2006; Yamane et al. 2000; Ma and Oliveira 2001). It should be noted that the two Rosaceae groups (Pyrus/Malus and Prunus) are distantly related (Igic and Kohn 2001; Steinbachs and Holsinger 2002; Roalson and McCubbin 2003; Wikström et al. 2001).

A total of 64, 19, 88, and 37 different S-RNase sequences are available for Solanaceae, Plantaginaceae, Prunus, and Pyrus/Malus, respectively. Although these sequences do not encompass the entire gene, they represent the best available dataset of S-RNases. Previously, Takebayashi et al. (2003) analyzed a set of 25 Solanaceae sequences using a phylogenetic approach. When a liberal definition is used (posterior probability of selection >50%) to identify sites likely under positive selection (those likely responsible for defining specificity differences), 52 sites are identified. When a more stringent criterion is used (posterior probability of selection >95%), 19 sites are identified. About 53% of these sites are located in the HV regions. No comparative study has yet been performed on the location of positively selected sites in Solanaceae, Plantaginaceae, Prunus, and Pyrus/Malus. In the present work positively selected sites are identified using both a phylogenetic (as in Takebayashi et al. 2003) and a population genetics method. For Plantaginaceae and Rosaceae species this is the first study aimed at determining the location of positively selected amino acid sites. Although the location of positively selected sites is different in Solanaceae, Prunus and Pyrus/Malus, the percentage of positively selected sites is similar.

Materials and Methods

Solanaceae, Plantaginaceae, Prunus, Malus, and Pyrus S-RNase coding sequences longer than 400 bp were obtained from GenBank. When more than one identical nucleotide sequence was available for the same species and specificity, only one sequence was used. A total of 64, 19, 88, and 37 Solanaceae, Plantaginaceae, Prunus, and Pyrus/Malus S-RNase sequences were used, respectively (Supplementary Table 1). The two Rosaceae groups (Prunus and Pyrus/Malus) were treated separately since evidence for trans-specific evolution is found within the genus Prunus and among Pyrus and Malus species but not between Prunus and Pyrus/Malus (Ushijma et al. 1998; Ma and Oliveira 2002).

For each dataset S-RNase amino acid sequences were aligned using ClustalX v.1.64b (Thompson et al. 1997), and minor manual adjustments performed using Proseq version 2.43 (http://www.helios.bto.ed.ac.uk/evolgen/filatov/proseq.html). The resulting amino acid alignment (Supplementary Figs. 1–4) was used as a guide to align the corresponding nucleotide sequences. Nonsynonymous (Ka) and synonymous (Ks) values per nonsynonymous and synonymous sites, respectively (with the Jukes-Cantor correction), were calculated with DnaSP 4.1 (Rozas et al. 2003).

Phylogenetic trees were generated using Quartet Puzzling with TREE-PUZZLE version 5.2 (Schmidt et al. 2002) after using Modeltest (Posada and Crandall 1998) to find the simplest model of nucleotide sequence evolution that best fit the data, according to the Akaike information criterion. For the Solanaceae dataset the model that best describes the data is the TVM+I+G model with nucleotide frequencies A = 0.3649, C = 0.1905, G = 0.2033, and T = 0.2414, the substitution model A ↔ C = 1.6923, A ↔ G = 2.9828, A ↔ T = 1.4778, C ↔ G = 2.1316, C ↔ T = 2.9828, and G ↔ T = 1.0000, a proportion of invariable sites of 0.0334, and a gamma distribution shape parameter α of 1.4361 for variable sites. The model that best describes the Plantaginaceae dataset is the TrN+I+G model with nucleotide frequencies A = 0.3267, C = 0.2412, G = 0.1964, T = 0.2357, the substitution model A ↔ C = 1.0000, A ↔ G = 2.3343, A ↔ T = 1.0000, C ↔ G = 1.0000, C ↔ T = 1.4849, and G ↔ T = 1.0000, a proportion of invariable sites of 0.1531, and a gamma distribution shape parameter α of 2.3685 for variable sites. The model that best describes the Pyrus/Malus dataset is the HKY+I+G model with base frequencies A = 0.3461, C = 0.2402, G = 0.2095, and T = 0.2042, a transition transversion ratio of 0.9229, a proportion of invariable sites of 0.2468, and a gamma distribution shape parameter α of 1.4418 for variable sites. For the Prunus dataset the best model is the TVM+I+G model, with base frequencies A = 0.3578, C = 0.2530, G = 0.2027, and T = 0.1865, the rate matrix A ↔ C = 0.5419, A ↔ G = 1.0766, A ↔ T = 0.5614, C ↔ G = 0.8286, C ↔ T = 1.0766, and G ↔ T = 1.0000, a proportion of invariable sites = 0.0945, and a gamma distribution shape parameter α of 0.7677 for variable sites.

For the identification of sites under positive selection we have used both the codeml software implemented in PAML 3.13 (Yang 1997) and the method of Wilson and McVean (2006) as implemented in the omegaMap v 0.5 software (http://www.danielwilson.me.uk), which uses a population genetics approximation to the coalescent with recombination. When using PAML 3.13 we tested the following models M0 (one rate), M1 (nearly neutral), M2 (positive selection), M3 (discrete), M7 (beta; 10 categories) and M8 (beta plus one class with Ka/Ks > 1; 11 categories). For Solanaceae the simplest model that best fits the data is the M8 model; for Plantaginaceae and Prunus, the M3 model; and for Pyrus/Malus, the M2 model (Supplementary Table 2). It should be noted that amino acid positions with alignment gaps are disregarded when this method is used.

When using the omegaMap v 0.5 software, for the Solanaceae and Plantaginaceae datasets 350,000 iterations and a burn-in of 100,000 was used. For Prunus and Pyrus/Malus datasets 250,000 iterations and a burn-in of 25,000 was used. Two independent runs were performed and the results checked for convergence. All codons were assumed to be at equal frequencies and 10 random sequence orders were generated. The parameters to be estimated are the selection parameter (ω = Ka/Ks), the population recombination rate (ρ), the rate of synonymous transversion (μ), the transition:transversion ratio (κ), and the insertion/deletion rate (φ); Wilson and McVean 2006). The first two parameters are allowed to vary along the sequence. For the estimation of both ω and ρ, a block of 20 codons (due to computational burden; this represents approximately 10% of the coding sequence size) was used. One objective and one subjective approach to prior specification was used: (i) inverse distributions were used as priors for ω and ρ, and improper inverse distributions were used for the other parameters (μ, κ, and φ) (the bounds for ω are 0.01–1000 and those for ρ are 0.00000001–1000, and therefore the posterior density outside this range should be about zero, if a wider range had been used); and (ii) exponential distributions were used for all parameters (μ = 1, κ = 3, ω = 1, ρ = 0.0001, φ = 0.2). These parameters reflect the prior belief that recombination is rare (Schierup et al. 2001; Vieira et al. 2003) and that only a few sites along the sequence are positively selected (Takebayashi et al. 2003).

Normed variability indexes (NVIs) for each site of the aligned S-RNase alleles were calculated as in Kheyr-Pour et al. (1990). Therefore, for each site, information on both the number of different amino acids and their frequencies is used.

Results

Identification of Positively Selected Sites

Briefly, key to both the Yang (1997) and the Wilson and McVean (2006) models used here is the likelihood function P(H|Θ), where H is the haplotypes and Θ represents the model parameters. In the Yang (1997) method the maximum likelihood tree (Ĝ Ĝ) is first estimated and the likelihood conditional on the tree P(H|Ĝ Ĝ, Θ) is then calculated. When there is recombination as seems to be the case for the S-RNase gene where rare recombination has been suggested (Schierup et al. 2001; Wang et al. 2001; Vieira et al. 2003; Ortega et al. 2006), the data cannot be described by a single tree. In this case the multiple trees must be treated as a nuisance parameter that must be averaged over so that P(H|Θ) = ∫ P(H|G, Θ)P(G)dG, where P(G) is the probability density of the ancestral trees, including branch lengths (Wilson and McVean 2006). P(G) can be modeled in various ways, but when the sequences are from a single population the coalescent with recombination is often used and this is the option implemented in OmegaMap by Wilson and McVean (2006). In our datasets not all S-RNase sequences are from the same species. Nevertheless, in GSI two alleles from the same species can be more divergent than alleles from two different species (trans-specific evolution) and recombination is rare. Therefore this method can in principle be applied to GSI (Nunes et al. 2006). Both methods use the same basic codon substitution model, although in the Wilson and McVean (2006) method the substitution model used by Nielsen and Yang (1998) was extended to incorporate the insertion/deletion rate (φ). It is not feasible to use importance sampling or MCMC to solve the above equation since in the context of the codon model used they are highly computationally intensive (Wilson and McVean 2006). Thus an approximation to the likelihood in the presence of recombination is used that is called the product of approximate conditionals (PAC) likelihoods (Wilson and McVean 2006). The approximate nature of the PAC likelihood means that the ordering of the haplotypes may influence the value of the likelihood and this is the reason why the likelihood is averaged over 10 random orderings. The other main difference between the method of Yang (1997) and that of Wilson and McVean (2006) is that the first uses a maximum-likelihood approach, while the latter uses a Bayesian approach in order to estimate the parameters of interest. Therefore in the Wilson and McVean (2006) method subjective or objective priors must be specified for the parameters to be estimated. Both methods can be used to detect amino acid sites under diversifying selection. The Yang (1997) method has been previously used to identify positively selected sites at the Solanaceae S- locus pistil component (Takebayashi et al. 2003) and at the Prunus S-locus pollen component (Ikeda et al. 2004; Nunes et al. 2006). The Wilson and McVean (2006) model was also used before by Nunes et al. (2006) to identify sites under positive selection at the Prunus S-locus pollen component.

Both the phylogenetic and the population genetics approach present potential problems that can affect the identification of sites under positive selection (see Discussion). Table 1 shows the amino acid sites that were identified by at least one of the two methods used as having a posterior probability of selection >95% as well as those sites that were identified by both methods as having a posterior probability of selection between 50% and 95%. Sites with alignment gaps identified by the population genetics method as having a posterior probability of selection >95% are also shown (Table 1). In order to be conservative, in the following analyses we use only those sites that were identified by one of the methods as having a posterior probability of selection >95% and by the other method >50%, but for gapped sites. Sites with alignment gaps were treated separately in most analyses. In the Solanaceae, Plantaginaceae, Prunus, and Pyrus/Malus datasets 13, 0, 17, and 27 sites without alignment gaps were identified as positively selected, respectively (Table 1). When using the phylogenetic method the detection of most of these sites is not dependent on the phylogenetic tree used. When a star phylogeny is used 12 of 13 (Solanaceae), 15 of 17 (Prunus), and 25 of 27 (Pyrus/Malus) positively selected sites are identified (Supplementary Table 3).When using the population genetics approach 5, 0, 11, and 11 amino acid sites with alignment gaps were identified as being putatively positively selected.

Table 1 Amino acid sites under positive selection

The positively selected sites are spread along the S-RNase sequence (see Fig. 1 and Supplementary Figs. 1–4). About 67%, 20%, and 32% of the Solanaceae, Prunus, and Pyrus/Malus positively selected sites, respectively are located in the HV regions (Fig. 1). The Prunus ParS10 sequence precludes the analysis of 41% of the amino acid sites located in the hypervariable region by the phylogenetic method. When this sequence is removed the phylogenetic method identifies four sites as being positively selected in this region, with a posterior probability of selection >95% (Table 1 and Supplementary Fig. 2). Nevertheless, the population genetics method gives for these sites posterior probabilities of positive selection <20%. This class of sites (Table 1; class 5) is common in Prunus but rare in the other datasets and may represent false-positive sites (see Discussion).

Rates of Synonymous and Nonsynonymous Evolution for Site Categories

Between 11.5% and 17.8% of the amino acid sites considered are positively selected except for Plantaginaceae, where no positively selected sites were identified (Table 2). The ratio of nonsynonymous (Ka)-to-synonymous (Ks) values (Ka/Ks) for the positively selected sites category is similar for Solanaceae and Prunus (1.75 and 1.88, respectively; 1.03 and 1.27, respectively, if uncorrected values are used) but different for Pyrus/Malus (3.21 and 1.82 if uncorrected values are used). When calculating Ks and Ka values the Jukes-Cantor correction was used. It is likely that this correction does not perform well when distantly related sequences are analyzed, but the same conclusion holds when using uncorrected values. Such a difference could be expected since Ma and Oliveira (2002) reported that the average nonsynonymous value (Ka) is higher in Pyrus/Malus than in Prunus, although the average synonymous value (Ks) is similar in the two datasets. It is thus conceivable that selection for specificity differences is stronger in Pyrus/Malus than in Solanaceae and Prunus.

Table 2 Average per site synonymous (Ks) and nonsynonymous (Ka) divergence values for the different classes of amino acid sites

One of the main differences between the Solanaceae/Plantaginaceae and the Pyrus/Malus/Prunus datasets is the percentage of sites under strong purifying selection (amino acid sites where the main variant is >95% frequency in order to allow for possible sequencing mistakes or rare slightly deleterious alleles; Table 2). This figure is about 2.1 times higher in the latter than in the former datasets. Given that the average Ks value is about 4 times lower (or 2.5 lower if uncorrected Ks values are used) in the latter datasets than in the former, it is nevertheless conceivable that some sites identified as being under strong purifying selection in Pyrus/Malus and Prunus accept few amino acid variants. Had enough evolutionary time elapsed, these sites would show amino acid variation and therefore they would not be classified as sites under strong purifying selection.

For Solanaceae and Plantaginaceae the removal of the positively selected sites and sites under strong purifying selection produces an average Ka/Ks ratio <1 (0.608 and 0.631 for Solanaceae and Plantaginaceae, respectively; 0.713 and 0.769, respectively, if uncorrected values are used), indicating little purifying selection. For Pyrus/Malus and Prunus datasets removal of those sites leads to an average Ka/Ks ratio close to 1, the value expected under neutrality (1.279 and 0.934 for Pyrus/Malus and Prunus, respectively; 1.116 and 0.938, respectively, if uncorrected values are used). These findings could suggest that most positively selected sites have been identified but the lack of good overlap between the sets of sites identified by the phylogenetic and population genetics approach suggests otherwise. Furthermore, it is unclear why in Plantaginaceae only a few positively selected sites were identified and with one of the methods only (Table 1).

Removal of the identified positively selected sites and sites under strong purifying selection from the dataset produces an average Ks similar to that obtained when all synonymous sites are considered. Birky and Walsh (1988) show that the rate of substitution at neutral sites linked to positively or negatively selected sites is not affected. Thus for the datasets used here the average Ks based on all sites is close to the rate of substitution at neutral sites.

Indirect Experimental Evidence That the Sites Identified as Positively Selected Are Important for Specificity Determination

Ideally experimental evidence should be collected to show that the sites identified as positively selected are important for determining S-allele specificities. Although technically possible these experiments are very time-consuming. Nevertheless, there is published evidence suggesting that regions and specific amino acid sites may be important for specificity determination. We compare the set of sites identified by both methods as being positively selected only since, in principle, both approaches can produce false positives (see Discussion) to the experimental evidence.

In Solanaceae 18 amino acid sites were identified as being strongly positively selected (sites in classes 1 to 4; Table 1). Only 5.4% of all pairwise comparisons produce <10 differences at these 18 amino acids (the average is 14.59). Thus most of the specificity pairs seem to differ in at least 56% of the sites identified as being under positive selection.

Zurek et al. (1997) exchanged S-RNase domains between two Nicotiana alata sequences (S C10 and S A2 ) and always obtained a different specificity. According to our analyses this is the expected result since varying positively selected sites have been identified in the regions swapped between alleles (14 in the region between C2 and C3, 2 sites in the region between C3 and C4, and 1 site in the region between C4 and C5).

Matton et al. (1997) also swapped domains between two closely related alleles (the Solanum chacoense S 11 and S 13 alleles). Exchanging four amino acids was sufficient to exchange specificities. These four amino acids have been identified here as being positively selected.

In Prunus 28 sites were identified as being strongly positively selected (sites in classes 1 to 4; Table 1). When these sites are used, only 2.7% of the comparisons have <8 amino acid differences at these 28 sites (the average is 13.75). Thus most of the specificity pairs seem to differ in at least 29% of the sites identified as being under positive selection.

The Prunus avium S-RNase alleles S 6 and S 24 differ at just 14 amino acid positions and have been shown to represent two different specificities (Wunsch and Hormaza 2004). Ten of the 14 amino acid differences are located in the region analyzed here. Four of these ten amino acid positions have been here identified as being positively selected.

Recently, Ortega et al. (2006) reported four pairs of similar Prunus dulcis sequences with different specificities (S 11 /S 24 , S 3 /S 6 , S 4 /S 8 , and S 5 /S 10 showing 10, 15, 23, and 35 amino acid differences, respectively, in the region analyzed in this study). For allele pair S 11 /S 24 , 3 of the varying amino acid positions were identified here as being under positive selection; for allele pair S 3 /S 6 , 6 of the varying amino acid positions were identified here as being under positive selection; for allele pair S 4 /S 8 , 11 of the varying amino acid positions were identified here as being under positive selection; and for allele pair S 5 /S 10 , 10 of the varying amino acid positions were identified here as being under positive selection. Nunes et al. (2006) also reported a case of two Prunus spinosa S-RNase alleles (S-RNase 3-1 and S-RNase 3-2 ) differing at just two amino acid positions. Although no direct evidence is presented, it is believed that they represent two different specificities since both alleles were found in the same individual. One of the two differences is located in the region analyzed here. This varying site has been identified as being positively selected.

In Pyrus/Malus 38 sites were identified as being strongly positively selected (sites in classes 1 to 4; Table 1). When these sites are used, only 5.3% of the pairwise comparisons present <7 amino acid differences (the average is 22.86). Thus most of the specificity pairs seem to differ in at least 18% of the sites identified as being under positive selection.

The S-RNase Pyrus communis alleles S n and S i have been shown to represent two different specificities, although they differ at just 14 amino acid sites (Zisovich et al. 2004). One of the differences is located outside the region analyzed here. Two of the remaining 13 amino acid sites have been identified here as being positively selected.

In conclusion, for Solanaceae, Pyrus/Malus, and Prunus there is indirect experimental evidence that the sites identified here as being under positive selection by both methods may be important for specificity determination. This does not imply, however, that amino acid sites identified by just one method are false positives. Closely related pairs of alleles found within and between species can be identical at the amino acid positions identified here as being under positive selection (Table 3). It is thus possible that they represent the same specificity.

Table 3 Closely related allele pairs that show no differences at the amino acid positions identified as being positively selected

Discussion

We have used two approaches to identify positively selected sites, namely, a phylogenetic method (Yang 1997) and a population genetics method that uses an approximation to the coalescent with recombination (Wilson and McVean 2006). Both methodologies present problems when applied to the gametophytic S-locus. There are also potential limitations to identify all sites important for specificity determination via detection of positive selection, thus even sites not detected in any of the analyses performed may be shown experimentally to be important.

The Yang (1997) method assumes that the relationship between the DNA sequences under consideration can be described by a single well-supported tree and this is rarely the case for S-RNase datasets (see, e.g., Igic and Kohn 2001). This method assumes, for instance, that there is no evidence for recombination in the data set. Patterns compatible with rare recombination have, nevertheless, been detected in the S-RNase gene of Solanaceae and Rosaceae (Schierup et al. 2001; Wang et al. 2001; Vieira et al. 2003; Ortega et al. 2006), although unequivocal evidence is yet to be produced. Violation of this assumption often results in the detection of false-positive sites (Anisimova et al. 2003; Wilson and McVean 2006). A total of 1, 4, 32, and 1 (Solanaceae, Plantaginaceae, Prunus, and Pyrus/Malus, respectively) amino acid sites have been identified by the phylogenetic method as having a posterior probability of selection >95% but as having a posterior probability of selection <50% by the population genetics method (Table 1). Since the latter method allows for the possibility of recombination, these sites could be false-positives. It should be noted that the Phi homoplasy test for recombination, as implemented in Splits Tree 4 (Huson and Bryant 2006), detects recombination only in the Prunus dataset (p < 0.05), but again this is not unequivocal evidence for recombination. It is nevertheless conceivable that the Prunus S-RNase experiences a higher rate of recombination than the other S-RNases, thus producing a higher number of false-positives when the phylogenetic method is applied.

The power of the population genetics method to detect diversifying selection was investigated by Wilson and McVean (2006). These authors simulated the evolution of a short sequence with a single site in the middle experiencing diversifying selection (ω = 5) against a background of functionally constrained but not invariable sites (ω = 0.2) in the presence of recombination (ρ = 0.1). Coverage (defined as the proportion of simulated datasets for which the 95% highest posterior density interval encloses the truth) varies between 16% (if the entire sequence is treated as one block) and 67% (if each codon is treated separately). The simulated data do not, however, mimic the features of the S-locus, thus it is not possible to infer the rate of false negatives from these simulations. In principle, false negatives can be the result of not explicitly modeling the genealogies when computing the PAC likelihoods, and/or using a block-like prior, thus disfavouring the identification of isolated sites under diversifying selection (Wilson and McVean 2006). For GSI the percentage of sites that are in this category cannot be calculated since it is not possible to be sure that all sites identified by the phylogenetic method are true positives. The Wilson and McVean (2006) method has a low false-positive rate for detecting sites under diversifying selection (<1%; Wilson and McVean 2006). Thus in principle most sites identified by the Wilson and McVean (2006) method should also be detected by the Yang (1997) method. For Solanaceae, Prunus, and Malus/Pyrus, 5.3%, 5.6%, and 22.7%, respectively, of the sites identified by the population genetics method as having a posterior probability of selection >50% (gapped sites were excluded) are not identified as having a posterior probability of selection >50% by the Yang (1997) method. Thus when applied to GSI the Wilson and McVean (2006) method may produce a higher percentage of false positives. The higher percentage of apparent false positives produced by the population genetics method in Malus/Pyrus than in Solanaceae and Prunus may be due to the clustering of true positively selected sites in specific regions of the molecule coupled with the use of a sliding block. Analyzing each site separately is, however, computationally prohibitive.

False positives detected by the population genetics method may also be due to the violation of other assumptions. The Wilson and McVean (2006) method assumes that the alleles used are a random sample of the alleles present in the population and this assumption is being violated since a sample where each allele is represented exactly once, even when all alleles are at equal frequencies in the population, is not a random sample. Furthermore, deep population structure is also problematic. Nevertheless, considering the mean level of divergence of the sampled alleles it is likely that most alleles coalesce before the speciation of species sampled. The identification of false-positive gapped sites is also possible since each alignment gap is treated as an independent event by the Wilson and McVean (2006) method.

For Plantaginaceae, no positively selected sites were identified by both methods. This could suggest that the Plantaginaceae S-RNase is not the GSI pistil component, but all experimental evidence supports the view that it is (Xue et al. 1996; Vieira and Charlesworth 2002). Therefore, we speculate that the failure to detect positively selected sites in Plantaginaceae is due to the small number of sequences available (19) in comparison with Solanaceae (64), since levels of divergence are similar in these two datasets (0.499 and 0.493, respectively; Table 2) and positively selected sites are detected in the Solanaceae. This example shows the limitations of the methodologies used. Besides it is unclear whether these approaches work well in the presence of balancing selection as strong as that in the gametophytic self-incompatibility systems studied. The apparent age of the clades (far lower in Rosaceae than in Solanaceae and Plantaginaceae) and differences in the sampling diversity can also differently limit the capability to detect positive selection in the different datasets using these methodologies.

For Solanaceae our results using the phylogenetic method and the 50% posterior probability of selection criterion can be compared directly with those of Takebayashi et al. (2003). Although the region analysed by us is smaller we use a much larger dataset (64 vs 25 S-RNase sequences). Sites with alignment gaps in either alignment were excluded from the comparison. Only 69% of the sites that could be compared and were identified by Takebayashi et al. (2003) as being positively selected are identified by us. This unexpected result suggests that about 30% of the sites identified by Takebayashi et al. (2003) as positively selected may be false positives. It should be noted that for Solanaceae the results obtained with the phylogenetic and population genetics methods agree well since there are only two amino acid residues that are identified by just one of the methods (Table 1). This is expected since Anisinova et al. (2003) showed that fewer false positives (due to assuming no recombination when there is some) are produced when model M8 is used (the model that best fits the Solanaceae dataset). On the other hand, one site is identified by us as being positively selected that is not detected by Takebayashi et al. (2003). This new site is detected by both the phylogenetic (posterior probability of selection higher than 95%) and the population genetics (posterior probability of selection between 50% and 95%) methods. The amino acid sites covered by our conservative criteria (the strongly positively selected sites) represent 39% of the sites identified by Takebayashi et al. (2003).

In the Solanaceae, Prunus, and Pyrus/Malus datasets the strongly positively selected sites (13, 17, and 27 sites without alignment gaps and 5, 11, and 11 amino acid sites with alignment gaps, respectively) are spread along the S-RNase sequence (see Fig. 1 and Supplementary Fig. 1, 3, and 4). About 67%, 20%, and 32% of the Solanaceae, Prunus, and Pyrus/Malus positively selected sites, respectively, are located in the HV regions defined according to Tsai et al. (1992) and Kato and Mukai (2004) (Fig. 1). The NVI sliding window analysis here used identified the HV regions defined before, although the sizes and limits of these regions are somewhat different (Fig. 1). It also suggests that additional HV regions with a size range of 5–10 amino acids are present in Plantaginaceae, Prunus, and Pyrus/Malus. The same approach identifies the conserved regions defined before except the Plantaginaceae C4 region and the Prunus RC4 region. It also identifies a new conserved region with a size range of 5–7 amino acids in all datasets but Solanaceae (Fig. 1). It should be pointed out that the definition of these regions is, however, of limited value since amino acids under positive selection can be found embedded in conserved regions (see, e.g., the C3 region of Prunus and Pyrus/Malus). Considering as candidate sites for determining specificity differences all HV sites rather than HV regions seems more logical but sites that are not HV have also been identified as being under strong positive selection by both methodologies in Solanaceae, Prunus, and Pyrus/Malus (1, 5, and 1 sites, respectively; see Supplementary Figs. 1–4).

We cannot be sure that all positively selected sites have been identified but there is indirect experimental evidence to suggest that the strongly positively selected amino acids identified are involved in specificity determination. The location of the identified positively selected sites is different in Solanaceae, Prunus, and Pyrus/Malus (Fig. 1), although a similar percentage of sites have been identified in Solanaceae and Rosaceae (Prunus and Pyrus/Malus) (between 11.5% and 17.8 %). It is unlikely that the observed pattern can be entirely attributed to the inability to detect a specific positively selected site in a given dataset. Indeed, there are two and eight positively selected sites in Solanaceae and Pyrus/Malus, respectively, that are sites identified as being under strong purifying selection in Prunus; there is one positively selected site in Pyrus/Malus that is under strong purifying selection in Solanaceae; and there are two positively selected sites in Solanaceae under strong purifying selection in Prunus (only sites without gaps were considered; data not shown). It is difficult to attribute all these instances to alignment errors. It should be noted that the Solanaceae/Plantaginaceae/Pyrus/Prunus alignment used in Fig. 1 is that of Igic and Kohn (2001). This alignment is used for this analysis only. About 23% of all positively selected sites (10/44; positively selected sites with indels were not considered; Fig. 1) have been identified in at least two datasets.

The common ancestor to the Solanaceae and Rosaceae families lived about 110 million years ago (MYA; Roalson and McCubbin 2003). The divergence of Prunus and Pyrus/Malus has been estimated to have occurred between 29 and 35 million years ago (MYA) (Wikström et al 2001). Some authors have recently argued for an independent origin of Prunus GSI (Yamane et al. 2003; Tao et al. 2006). Given the amount of evolutionary time elapsed since GSI origin and saturation at nucleotide level, it is not possible to use phylogenies of these sequences to resolve this issue unequivocally. Our results are compatible with the views of both a single and a multiple origin of GSI. Indeed the orthology of Solanaceae and Pyrus/Malus has never been questioned but the location of positively selected sites is also different in these two phylogenetic groups.