Introduction

The analysis of genetic variation in natural plant populations, the management of genetic resources, and the selection of individuals in breeding programs is frequently based on the analysis of supposedly neutral genetic markers, such as silent single nucleotide polymorphisms or simple sequence repeats. Meanwhile, the large-scale analysis of plant genomic variation revealed a bewildering diversity of genetic polymorphisms. Point mutations, repeat variants, polyploidization, transposable element insertions, and gene duplications are abundant in plant genomes and contribute to phenotypic variation in wild and crop plants (e.g., Fu et al. 2002; Morgante et al. 2005). This diversity of genetic variants raises the question of whether they differ in their effect on plant fitness and phenotypic diversity, and how they can be utilized for genetic mapping and plant breeding.

The evolutionary fate of a genetic polymorphism depends on its fitness effects, which is expressed by the selection coefficient, s, and the effective population size, N e (Kimura and Crow 1963). Strongly deleterious polymorphisms are rapidly removed by selection, but slightly deleterious polymorphisms with s ≈ 1/N e accumulate by genetic drift and reduce average population fitness (Kimura 1962; Lande 1994). In wild plant species, environmental changes, ecological specialization, or limited dispersal cause small effective population sizes. Artificial selection and inbreeding during domestication reduced levels of genetic variation in crop plants. The combination of artificial selection, hitchhiking and low variation from high levels of inbreeding during domestication is known as the domestication-associated Hill-Robertson effect (Lu et al. 2006; Yamasaki et al. 2007). For example, Wright et al. (2005) estimated that 2–4% of maize genes show reduced variation as a consequence of artificial selection. Modern breeding populations may accumulate slightly deleterious mutations by genetic drift, by linkage to advantageous yield improvement genes, or because irrigation and fertilization buffer their fitness effects. One consequence of this process is an increased susceptibility for diseases in elite varieties (Gepts and Papa 2002).

A significant portion of phenotypic variation in plants is caused by single nucleotide polymorphisms (SNPs) that alter gene regulation, gene function, or mRNA splicing patterns (e.g. Johanson et al. 2000; Maloof et al. 2001; Cartegni et al. 2002; Stein et al. 2005; Filiault et al. 2008). Within coding regions, SNPs are differentiated into synonymous, non-synonymous, and nonsense SNPs. While synonymous SNPs do not affect protein sequence, non-synonymous SNPs (nsSNPs) cause amino acid polymorphisms and nonsense mutations lead to a premature stop codon. In populations of both wild and crop plants (Nordborg et al 2005; Schmid et al. 2005; Hamblin et al. 2006; Lu et al. 2006), nsSNPs segregate at a lower frequency than synonymous SNPs, suggesting genome-wide purifying selection against deleterious amino acid polymorphisms. However, not all amino acid polymorphisms are deleterious because their effects on protein function depend on the location in the protein structure and the physico-chemical distance between polymorphic amino acids (Suckow et al. 1996). Several methods were developed to predict whether nsSNPs have deleterious effects on protein function (reviewed in Ng and Henikoff 2006). They analyze the evolutionary conservation of protein sequence or the location of amino acid polymorphism in the protein structure.

For this paper we quantified the relative frequency of deleterious nsSNPs in plants and tested whether nsSNPs are useful markers for the analysis of natural selection in wild and crop plants. We used publicly available genome-wide datasets from Arabidopsis thaliana and rice to characterize variation in the ratio of deleterious to tolerated nsSNPs on a genome-wide level. Based on the prediction results, we investigated the effect of demographic history on the efficiency of purifying selection against deleterious amino acid polymorphisms. To validate the results of prediction programs for individual genes, we compared predicted fitness effects of nsSNPs with functional studies of more than a dozen functionally well-characterized genes from various plant species. Finally, we conducted simulations to test the power of prediction programs to detect changes in deleterious nsSNP frequency after population bottlenecks.

Materials and methods

Genome-wide datasets of nsSNP polymorphism

Two publicly available datasets from Arabidopsis thaliana and one dataset from rice were analyzed. The first dataset (‘2010’ thereafter) comprised annotated DNA sequence polymorphism data from 778 protein coding regions in 96 natural A. thaliana accessions (Nordborg et al. 2005). The Van-0 accession was excluded because of a high level of heterozygosity (Nordborg et al. 2005). Homologous sequences from the closely related species Arabidopsis lyrata obtained from http://www.jgi.doe.gov/Alyrata were used as outgroup to distinguish between derived and ancestral alleles. Orthologous sequences in A. lyrata were identified by reciprocal BLAST searches. The second A. thaliana dataset (‘Perlegen’) included 121,418 nsSNPs identified by re-sequencing of 20 accessions (Clark et al. 2007). The rice dataset consisted of DNA sequence polymorphism data at 111 loci from 97 accessions of wild and domesticated rice populations (Caicedo et al. 2007). Sequences were obtained from cultivated rice Oryza sativa, its wild ancestor Oryza rufipogon, and other wild rice species, such as Oryza nivara, Oryza barthii, and Oryza meridionalis. To annotate rice sequences, a BLASTX comparison against protein sequences from TIGR rice genome annotation 5 was conducted (TIGR 2007) and best hits were used to predict the coding sequence with Wise 2.2.0 (Birney et al. 2004). O. barthii or O. meridionalis were used as outgroups.

To further test prediction methods, they were applied to nsSNPs with experimentally inferred functional effects in model and crop plants (see Table 2).

Prediction of functional effects of amino acid polymorphisms

Functional effects of amino acid polymorphisms on protein function (tolerated versus deleterious) were predicted with SIFT 2.1.2 (Ng and Henikoff 2001, 2003) and MAPP (Stone and Sidow 2005) programs. SIFT uses PSI-BLAST (Altschul et al. 1997) to identify homologous proteins in a database. We used the TrEMBL 37.4 database (Boeckmann et al. 2003) for homolog identification. MAPP requires a collection of homologous proteins as well, and for that reason PSI-BLAST alignments from the TrEMBL database searches were used for both SIFT and MAPP analyses. Furthermore, MAPP requires a phylogenetic tree of homologous sequences as input. The trees were constructed from PSI-BLAST alignments of protein sequences with the neighbor joining method implemented in the SEMPHY 2.00 program (Friedman et al. 2002). Sequences were translated with BioPython 1.43 (http://www.biopython.org) and aligned with the reference sequence using PAL2NAL (Suyama et al. 2006) to obtain a codon-based nucleotide alignment.

Forward-in-time simulations of evolutionary scenarios

To investigate the effect of demographic history on the relative frequency of deleterious nsSNPs, forward-in-time simulations were conducted. Sequences derived from the lacI gene of E. coli were used as starting sequences for the simulation; the functional effects of amino acid substitutions in this protein are known from mutagenesis studies (Suckow et al. 1996). Based on inferred effects, a fitness value was assigned to individuals in the population by adding up positive and negative fitness values of nsSNPs. In each generation, sequences were mutated and individuals for the next generation were selected according to their fitness value. Details of the simulation procedure are described in the Supplementary Information.

We simulated a model with a population bottleneck using the following model parameters (Fig. 1). The starting population size, N 0, was set to 10,000 for 5,000 generations. Then, population size was reduced to 100 individuals for 250 generations. After the bottleneck, the population size, N 2 was increased to 10,000 individuals and the simulation was run for another 7,250 generations. These parameter values correspond to the domestication model of Innan and Kim (2004). The nucleotide substitution rate of plants was estimated to range from 5 to 30 × 10−9, although the true rate is probably closer to the lower boundary (Wolfe et al. 1987). Therefore, a mutation rate μ = 10−8 was chosen. To ensure that |s| ≪ 1/N e and the fixation of a polymorphism by drift can occur during the bottleneck, a fitness penalty (see Supplementary Information) was set to P = 10−6, corresponding to |s| ∈ [10−6, 10−5].

Fig. 1
figure 1

Demographic model used for the simulation of the bottleneck and domestication models. Population size was reduced to N 1 between generations T 1 and T 2, and later increased to the previous size, N 0

We compared a null model with a test model and explored two scenarios. The first scenario describes a simple bottleneck in one population and a constant size in the other; in both populations, all loci evolved under selection. The second scenario was identical to the first one with the exception that selection acted on only one of the 10 simulated genes and the others were allowed to evolve neutrally. In the null model of this scenario, all loci evolved under selection. Simulations of each scenario were repeated 1,000 times and conducted with and without recombination between loci.

Programs were written in Python and statistical tests were conducted with GNU R 2.5.1.

Results

Comparison of the SIFT and MAPP programs

We first compared prediction results of SIFT and MAPP programs, which are summarized in Table 1. In the three datasets combined, a total of 65,355 nsSNPs were analyzed. The SIFT program made predictions for 47,305 and MAPP for 47,487 nsSNPs, and predictions were identical for 87.3% of polymorphisms (Supplementary Information). Furthermore, there was a high correlation between SIFT and MAPP prediction scores (Spearman’s r s  = 0.72). The correlation of either SIFT and MAPP predictions with an analysis using a BLOSUM62 matrix was much lower (r s  = 0.28 and 0.30, respectively). There were no significant differences in the distribution of both categories (deleterious vs. tolerated) between the three datasets within each program (G test, SIFT: P = 0.080, MAPP: P = 0.583). These comparisons indicate that SIFT and MAPP predictions are quite similar, although they differ in details as outlined below. In the following sections we present the MAPP predictions and show the SIFT results only if they differ from the MAPP output.

Table 1 Summary of predictions with SIFT and MAPP

Non-synonymous SNP frequencies in Arabidopsis thaliana

The genome-wide dataset from the 2010 dataset comprises 383 loci from A. thaliana that could be aligned to orthologous genes from A. lyrata. The data included 1,174 nsSNPs, of which 919 were predicted by MAPP. Figure 2 shows the frequency distribution of deleterious nsSNPs among A. thaliana accessions. The SIFT prediction for Col-0 shows a significantly lower proportion of deleterious nsSNPs compared to all other accessions (G test, P < 0.001, Fig. 2a). In the MAPP analysis, there was no difference between Col-0 and the other accessions (G test, P = 0.4125; Fig. 2b). The SIFT algorithm assumes that the external BLAST database contains solely tolerated nsSNPs. Since Col-0 was used as reference sequence for the A. thaliana genome project, its sequences are included in external databases and polymorphisms in Col-0 are classified as tolerated by SIFT. This suggests to remove reference sequences from the alignment input for SIFT. The MAPP algorithm does not make such an assumption and is unbiased with respect to the reference sequence.

Fig. 2
figure 2

Distribution of the relative proportions of deleterious nsSNPs among accessions in the 2010 (a and b) and Perlegen (c and d) datasets

The frequency distributions of nsSNPs in the complete 2010 dataset show that derived nsSNPs with a predicted deleterious effect segregate at a lower allele frequencies than tolerated nsSNPs (Fig. 3a; Kolmogorov–Smirnov Test, P = 0.007; Supplementary Information). The ratio of deleterious to tolerated nsSNPs decreases linearly with derived nsSNP frequencies (Fig. 3d).

Fig. 3
figure 3

Allele frequency distribution of tolerated and deleterious nsSNPs for the 2010 (a), Perlegen (b), and rice (c) datasets. The proportion of nsSNPs for each frequency class was calculated relative to the total number of nsSNPs in the corresponding fitness category. The ratio of absolute numbers of deleterious to tolerated nsSNPs of each dataset is shown in d, e, and f, respectively. The dashed lines represent the linear regression line

The Perlegen data included 121,418 SNPs from 20 A. thaliana accessions, of which 63,949 SNPs were retained for further analysis after alignment with A. lyrata orthologs. Predictions were obtained for 46,402 nsSNPs (Supplementary Information). In contrast to the 2010 data, the Col-0 accession exhibits a significantly lower proportion of deleterious nsSNPs than other accessions with both prediction programs (G test, P < 0.0001; Fig. 2c, d). The average proportion of deleterious nsSNP positions is 0.087 and 0.133 for the 2010 and Perlegen data, respectively, if the Col-0 accession is excluded. This difference is significant (t test, P < 0.0001). As in the 2010 data, derived deleterious nsSNPs segregate at a lower allele frequency than tolerated nsSNPs (Fig. 3b; Kolmogorov–Smirnov test, P < 0.0001) and the ratio of deleterious to tolerated nsSNPs decreases linearly with derived SNP frequencies (Fig. 3e). The regression slopes of this decay are nearly identical in the 2010 (−0.205) and the Perlegen data (−0.217).

Population genetics theory predicts that purifying selection is less effective in smaller and endemic populations. To test whether the proportion of deleterious nsSNPs differs between populations, we grouped the accessions into monophyletic clades according to Supplementary Fig. 1 of Nordborg et al. (2005). The Van-0 and Col-0 accessions were not included for reasons discussed above. The distribution of deleterious and tolerated nsSNPs was compared in the remaining clades (Supplementary Table 5). We observed that genetically divergent accessions, such as Cvi-0, Shahdara, C24, and Bur-0 contain a significantly higher proportion of deleterious amino acid polymorphisms than other accessions. The results are consistent between the 2010 and Perlegen data but the latter data showed more significant results because of a larger number of nsSNPs.

Since the Perlegen data covers the complete genome, we tested whether gene families differ by their proportions of deleterious nsSNPs. As has been noted before (Clark et al. 2007), the average number of nsSNPs per gene is highly variable between gene families (Fig. 4). The NBS-LRR and receptor-like kinases are the most polymorphic, and the cytoplasmatic ribosomal proteins the least polymorphic gene families. In contrast, the proportion of putatively deleterious polymorphisms does not vary much between gene families. Only the conserved gene family of ribosomal cytoplasmatic proteins contains a significantly increased proportion of nsSNPs (G test, P = 0.005); bHLH transcription factors exhibit a significantly decreased proportion of deleterious nsSNPs (P = 0.03).

Fig. 4
figure 4

Differences in the relative proportion of deleterious nsSNPs among protein families based on MAPP predictions

Non-synonymous SNP frequencies in rice

Domesticated species may accumulate slightly deleterious mutations because of relaxed purifying selection. To test this hypothesis, we analyzed a dataset from different rice species with 232 nsSNPs and 166 MAPP predictions (Table 1 and Supplementary Information). As in the A. thaliana data, there was an excess of rare over high-frequency nsSNPs, but allele frequency distributions of deleterious and tolerated nsSNPs in rice were not significantly different (Kolmogorov–Smirnov test, P = 0.1044; Fig. 3c). In comparison to wild O. rufipogon, both domesticated rice species O. sativa ssp. japonica and O. sativa ssp. indica show a higher proportion of deleterious nsSNPs. However, these differences are not statistically significant due to the small sample sizes (G test, P = 0.7 and P = 0.4). Figure S1 from Caicedo et al. (2007) shows a bifurcation between the monophyletic groups of wild rice O. rufipogon from different origins and the two domesticated species O. sativa ssp. japonica and O. sativa ssp. indica together with all O. rufipogon accessions from China. Such a pattern results from two independent domestication events in China (Kovach et al. 2007). Testing this phylogenetic grouping revealed a significantly higher proportion of deleterious nsSNPs in the group containing the domesticated rice species compared to the non-Chinese O. rufipogon accessions (G test, P = 0.045). Instead of a linear decay in the relative frequency of deleterious polymorphisms we observed a peak at intermediate frequency nsSNPs (Fig. 3f). This pattern may be interpreted as a domestication-associated enrichment of deleterious mutations. However, since the total number of nsSNPs in this analysis is quite low, it could represent random noise.

Prediction results for polymorphisms with an inferred phenotype

To test the sensitivity of the predictions with the SIFT and MAPP programs, we assembled a dataset from the literature that comprised 68 nsSNPs with experimentally inferred functional effects. Predictions of functional effects were obtained for 53 nsSNPs (Table 2). As much as 8 out of the 15 substitutions without a prediction occur in the A. thaliana PHYB flowering control gene (Filiault et al. 2008) and all of them are associated with phenotypic changes. They are located in highly variable regions of the protein and no prediction was possible because of a low alignment quality of these regions. The same problem was also observed in other proteins and demonstrates that sequence-based prediction methods are restricted to protein regions that can be aligned reliably.

Table 2 Prediction results for nsSNPs with inferred phenotypic effects

Among 53 nsSNPs with a MAPP prediction, 20 were found to have a significant effect and 14 to have no effect on protein function (Table 2). The functional effects of the remaining 19 substitutions are unknown because their function was not studied or no functional effect was found. In some cases, groups of nsSNPs were shown to have an effect on protein function like in the barley eIF4E gene (Stein et al. 2005). In the complete dataset, MAPP classified a single neutral nsSNP as deleterious, which corresponds to a false positive rate (FPR) of 7.1% (1 out of 14). The false negative rate (FNR) was higher because only 11 out of 20 nsSNPs (45% FNR) with functional effects were predicted correctly.

Forward-in-time simulations of demographic models

The neutral theory of molecular evolution states that the fixation probability of slightly deleterious polymorphisms depends on the effective population size (Kimura 1962). Therefore, we hypothesized that the observed ratio of deleterious to tolerated nsSNPs provides information on the extent of purifying selection acting on a population. We used forward simulations to test the power of detecting the effect of bottlenecks on the frequency of deleterious nsSNPs. The fitness assessment was based on empirical data from mutagenesis experiments of the lacI gene in E. coli. Even though fitness effects of mutations likely differ between genes (Li 1997), we assumed that LacI can be taken as a representative protein.

On average, three nsSNPs were observed per locus per individual in the simulated populations. Power was measured by counting the number of significant G tests of homogeneity (P < 0.01) for differences in the proportion of deleterious to tolerated nsSNPs between bottlenecked and non-bottlenecked populations. Tests were carried out with different sample sizes (10, 100, 1000, or 10,000 individuals from both populations), but with a constant number of 10 genes. The power to detect differences between bottlenecked and non-bottlenecked populations increased with sample size and was generally higher in simulations without recombination (Fig. 5), due to increased genetic drift from stronger background selection in non-recombining populations (Charlesworth et al. 1993). The simulations revealed no significant power differences between the bottleneck and domestication (bottleneck and artificial selection) scenarios (Fig. 5).

Fig. 5
figure 5

Power to detect the effect of a bottleneck on the frequency of deleterious nsSNPs with SIFT and MAPP under a bottleneck model (a) and a domestication model (b), simulations were run with and without recombination between loci

Discussion

Signature of purifying selection in genome-wide data

All three empirical genome-wide datasets showed variable proportions of deleterious nsSNPs among accessions and allele frequencies. Overall, deleterious nsSNPs segregate at a lower frequency than tolerated nsSNPs. The class of rare nsSNPs with a frequency of <10% harbors a higher proportion of deleterious polymorphisms than high-frequency nsSNPs (Fig. 3). This pattern is consistent with either purifying or balancing selection. Under purifying selection, deleterious nsSNPs are selected against and remain at low frequency (Wong et al. 2003). Balancing selection may produce low-frequency polymorphisms if multiple alleles or haplotypes are favored by selection. This seems unlikely because A. thaliana and O. sativa are mainly self-fertilizing species with low levels of heterozygosity (Oka 1988; Abbott and Gomes 1989). Hence, balancing selection caused by heterozygote advantage is probably rare. Alternatively, local adaptation in self-fertilizing species can lead to patterns of genetic variation resembling balancing selection (Hedrick 1998). Some nsSNPs classified as deleterious may in fact cause advantageous functional changes in encoded proteins and evolve by positive or balancing selection. In this case, a higher proportion of ‘deleterious’ nsSNPs may be expected in gene families, such as disease resistance genes that likely are more frequently targets of positive selection than other gene families (Clark et al. 2007). However, a comparison of total nsSNP counts with the proportion of deleterious nsSNPs in different gene families shows that the proportion of deleterious polymorphisms is remarkably similar between protein families (Fig. 4).

We observed a nearly identical decrease in the relative frequency of deleterious nsSNPs with increasing nsSNP frequency in the two A. thaliana datasets (Fig. 3d, e). This pattern is noteworthy because the two datasets differ strongly in the numbers of SNPs and accessions. Since high-frequency SNPs are on average older than low-frequency SNPs, deleterious nsSNPs do not reach higher frequencies as often as tolerated amino acid polymorphisms. A plausible explanation is that deleterious polymorphisms are selected against by purifying selection. In this case, the slope of the decrease is influenced by the strength of purifying selection and may allow to estimate the extent of purifying selection in populations and species of different effective population sizes. Taken together, our analyses support genome-wide acting purifying selection rather than gene-specific positive selection as an explanation for the lower frequency of deleterious nsSNPs.

Different proportions of deleterious nsSNPs in A. thaliana accessions

In A. thaliana, significant differences in relative frequencies of deleterious nsSNPs between groups of accessions were observed in the Perlegen dataset (Supplementary Table 5). In particular, the Cvi-0 accession exhibits an increased proportion of deleterious nsSNPs compared to other accessions. Cvi-0 originated from Cape Verde, a small group of islands that is located 500 km away from the mainland at the edge of the species distribution (Hoffmann 2002). The high frequency of deleterious nsSNPs is consistent with its phenotypic and genetic divergence from other A. thaliana accessions (Alonso-Blanco et al. 1999; Schmid et al. 2003; Nordborg et al. 2005). The Shahdara accession, which originated from an isolated Central Asian glacial refugium, also shows a higher proportion of deleterious nsSNPs. The increased proportions of deleterious nsSNPs in both accessions may result from local adaptation to specific environmental conditions at the edge of the species range or the random fixation of slightly deleterious nsSNPs in small island populations due to genetic drift. The enrichment of deleterious nsSNPs is observed on a genome-wide level and thus is likely caused by higher levels of genetic drift. However, further studies are required to test whether the extensive local population structure in A. thaliana (Nordborg et al. 2005; Schmid et al. 2006), particularly in glacial refugia (Pico et al. 2008), contributes to variable levels of deleterious nsSNPs in response to differential selection and drift.

Consequences of domestication in rice

The rice sequences were used to examine differences between wild ancestors and domesticated cultivars. Both subspecies O. sativa ssp. japonica and O. sativa ssp. indica were probably domesticated independently in China from O. rufipogon (Kovach et al. 2007). For this reason, O. sativa and O. rufipogon cannot be grouped into separate monophyletic clades. Instead, O. sativa and Chinese O. rufipogon were combined into a single clade and compared to O. rufipogon accessions from a different geographic origin (Caicedo et al. 2007). Non-Chinese populations of O. rufipogon form a sister group of cultivated rice and the Chinese O. rufipogon. We observed fewer deleterious substitutions in wild than in cultivated rice. This result supports the hypothesis that domestication bottlenecks, artificial selection, and reduced purifying selection lead to an enrichment of deleterious amino acid substitutions, consistent with a previous study (Lu et al. 2006). Furthermore, intermediate frequency deleterious nsSNPs are enriched in Oryza suggesting a domestication effect, although this may be a statistical artifact of too few data points. Our simulations suggested that more data are necessary to reliably infer domestication effects on nsSNP distributions.

Application and comparison of nsSNP prediction programs

Plant genomes contain a high proportion of duplicated genes and the reliable inference of orthology and paralogy is crucial. It is better for nsSNP analysis to use few distant orthologous proteins than too many paralogous sequences (Stone and Sidow 2005), because paralogs with a change in function may harbor different amino acids at critical positions. The use of proteins with altered function causes a decrease of sensitivity (Ng and Henikoff 2002; Stone and Sidow 2005) and paralogous proteins in the PSI-BLAST alignments might have considerably decreased our prediction accuracy.

By automating the alignment with PSI-BLAST, many sequences which overlap only in certain domains with the original protein were included in the alignment and remaining regions were filled with gaps that prevent the analysis of the corresponding regions. For this reason, the effects of eight substitutions in the PHYB protein could not be analyzed, although there was a correlation with certain phenotypes for three of them (Filiault et al. 2008). Generally, protein termini are difficult to align and were excluded from the prediction, but genome sequencing projects for numerous plant species will lead to improved sequence alignments and a higher prediction accuracy. The test data revealed a reasonable error rate of the predictions. Since false predictions can be assumed to occur uniformly across the genome, this observation suggests that the use of prediction programs is more appropriate for genome-wide analyses rather than inferring nsSNPs in individual genes.

The SIFT and MAPP prediction programs do not directly estimate the selection coefficients of nsSNPs. Prediction scores may be used as proxies for selection coefficients under the assumption that substitutions with higher scores have a stronger impact on protein function and fitness. Since a relationship between prediction scores and selection coefficients was not formally established, we used cutoff scores for a binary classification into tolerated and deleterious nsSNPs.

Detection of demographic effects by forward-in-time simulations

We used forward-in-time simulations to evaluate the effect of demographic history on the ratio of deleterious to tolerated nsSNPs and to estimate the power of the prediction programs to detect those effects. The total sequence length, i.e. the product of locus number, locus length, and numbers of individuals, was used as a first approximation to compare simulated and empirical datasets. The rice dataset had a total length of 0.8 × 106, the 2010 dataset 4.6 × 106, and the Perlegen dataset about 130 × 106 amino acids. In comparison, a simulation with 1,000 sampled individuals has a length of 6.6 × 106 amino acid residues in bottlenecked and non-bottlenecked populations combined. Such a dataset provided a power of more than 80% to detect differences in the ratio of deleterious to tolerated nsSNPs between the demographic histories of populations (Fig. 5). The rice and Perlegen data, but not the 2010 data revealed significant differences between groups of accessions. In contrast, the frequency distributions of deleterious and tolerated nsSNPs were significantly different in the 2010 and Perlegen, but not the rice dataset. The different results of the 2010 and Perlegen datasets are consistent with the simulations and show that whole genome resequencing is preferable to multilocus analysis for inferring the effects of demographic history on nsSNP frequencies.

Conclusions

Neutral genetic polymorphisms, such as silent SNPs are markers of choice for inferring the demographic history of a species. The present work shows that functional markers, such as nsSNPs can be utilized to infer the consequences of demographic history on the interplay of natural selection and genetic drift in plants. There is little power in the analysis of single genes, but genome-wide data carry enough information for comparisons of populations or species. NsSNPs are therefore useful markers for characterizing endangered plant species, plant genetic resources, or breeding populations. Since inbreeding and hitchhiking in response to artificial selection decrease genetic diversity, it will be interesting to infer deleterious nsSNP frequencies in germplasm of crop species to quantify fitness effects of plant domestication and modern breeding programs. Current genome sequencing projects of crop species and next generation sequencing technologies will greatly facilitate such investigations.