Introduction

Understanding the evolutionary forces that shape patterns of nucleotide polymorphism remains a critical goal of molecular population geneticists. Genomic analyses of the patterns of polymorphism and divergence suggest that these values are generally conserved between linked loci; that is, neighboring loci have more similar levels of polymorphism and divergence than do random pairs of loci (Hahn 2006; Matassi et al. 1999). For example, in a survey of the Drosophila X chromosome, Hahn (2006) determined that this correlation extended up to 300 kb for polymorphism values and almost 100 kb for divergence values. This correlation also extends to the rate of protein evolution. In the mouse genome, the rate of protein evolution, as gauged by the ratio of nonsynonymous divergence (Ka) to synonymous divergence (Ks), is highly correlated between neighboring genes up to 1 cM (2 Mbp) apart (Williams and Hurst 2000).

Knowing the extent to which the patterns of polymorphism and divergence are related between linked loci has significant bearing on the ability to identify loci that potentially contribute to evolutionary change (e.g. Akey et al. 2002; Diller et al. 2002; Mousset et al. 2003; Nurminsky et al. 1998; Payseur et al. 2002; Shimizu et al. 2004; Tiffin et al. 2004; Vigouroux et al. 2002; Wiener et al. 2003). For example, positive selection acting on a newly-arisen beneficial mutation can cause dramatic decreases in nucleotide variation through a selective sweep, as the beneficial allele rises to fixation faster than mutation can introduce new alleles to the population. Alternatively, balancing selection can act to maintain divergent alleles, increasing nucleotide diversity at selected loci (Charlesworth 2006). Through the action of genetic hitchhiking, these signatures of selection predictably extend to neighboring loci to an extent dependent on the strength of selection and the recombination rate (Barton 2000; Kaplan et al. 1989; Maynard Smith and Haigh 1974). Demographic forces, such as population bottlenecks, can also affect nucleotide diversity levels; however, demographic forces affect genome-wide patterns of nucleotide variation. In contrast to both selection and demography, a stochastic decrease in molecular variation caused by genetic drift is more localized and does not necessarily extend to neighboring loci. Thus, when trying to infer the evolutionary history of a particular locus, it is important to quantify variation at linked loci.

The extent to which linked neutral variation is affected by selection is determined, in part, by recombination, which severs the association of linked loci and constrains the extent to which evolutionary histories between linked loci are shared. In a selfing species, such as the model plant Arabidopsis thaliana, the low effective recombination rate should increase the correlation of evolutionary histories between linked loci (Glemin et al. 2006). For example, the signature of selection surrounding the pseudo-SCR1 locus involved in the loss of self-incompatibility spans up to 35 kb (Shimizu et al. 2004). However, even regions containing loci suspected of being targets of selection can exhibit variable patterns of polymorphism between neighboring loci (Hagenblad and Nordborg 2002; Haubold et al. 2002; Shepard and Purugganan 2003). In such cases gene conversion, versus reciprocal recombination, may play a more prominent role in structuring patterns of variation between neighboring loci (Haubold et al. 2002).

In order to assess the degree to which the evolutionary histories of linked loci are correlated in the A. thaliana genome, we examined the molecular population genetics of four genomic regions containing previously identified low-diversity genes. Although previous population genetics studies suggest that the low nucleotide diversity at three of our four focal genes is consistent with the action of positive selection (Barrier et al. 2003; Moore and Purugganan 2003; Olsen et al. 2002), there was considerable variation in the levels and patterns of nucleotide polymorphism in neighboring loci in all four genomic regions. Indeed, application of a test of selection which takes into account patterns of polymorphism and divergence of linked loci supported the selection hypothesis only for the region containing At3g03700. For three of the genomic regions, patterns of recombination and the extent of linkage disequilibrium correspond to the degree of heterogeneity in evolutionary histories between linked loci. However, one region that exhibits considerable dissimilarity between neighboring loci is found in a chromosomal region with the lowest estimates of recombination that exhibits extensive linkage disequilibrium. We discuss the implications of these data for future population genetics analyses and genomics studies in A. thaliana.

Materials and Methods

Isolation and Sequencing of Alleles

Genomic DNA was isolated from young leaves of 14 A. thaliana accessions (Supplementary Table 1) and one Arabidopsis lyrata individual using the Plant DNeasy Mini Kit (Qiagen, Valencia, CA). DNA was isolated from an additional 30 ecotypes for a subset of genes surrounding At1g04300 (Supplementary Table 1). The A. lyrata individual was grown from seed isolated from a Karhumaki, Russia population and was provided by O. Savolainen and Helmi Kuittinen (University of Oulu, Oulu, Finland). Between 0.7 and 1 kb of coding region was sequenced for each locus. In addition, previously published sequence data from additional ecotypes were incorporated when available (Barrier et al. 2003; Moore and Purugganan 2003; Olsen et al. 2004; Olsen et al. 2002).

PCR primers were designed based on the Col-0 gene sequences using Primer3 (Rozen and Skaletsky 2000); (Supplementary Table 2). PCR of A. thaliana and A. lyrata samples was performed with Taq DNA polymerase (Roche, Indianapolis, IN). DNA fragments amplified from A. thaliana were purified using the QIAquick Gel Extraction Kit (Qiagen) and directly sequenced. Amplified A. lyrata products were subcloned using the TA TOPO PCR Cloning Kit (Invitrogen, Carlsbad, CA), and plasmid DNA from six independent clones was sequenced. DNA sequencing was conducted at the NCSU Genome Research Laboratory with a Prism 3700 96-capilary automated sequencer (Applied Biosystems, Foster City, CA). All polymorphisms were visually confirmed and ambiguous polymorphisms were rechecked by PCR reamplification and sequencing. GenBank accession numbers for these genes are EU351021-EU352144.

Molecular Evolution and Population Genetic Data Analysis

Sequences were aligned against the A. thaliana sequence previously identified in the Arabidopsis whole genome sequence (Arabidopsis Genome Initiative 2000). The A. lyrata ortholog was used as the outgroup in the analyses. Levels of silent site (synonymous and noncoding) nucleotide diversity were estimated as π (Nei 1987) and θW (Watterson 1975). Nucleotide divergence at silent sites (KJC) was determined with Jukes and Cantor correction (Jukes and Cantor 1969) according to Nei (1987). All estimates of polymorphism and divergence were determined using DnaSP 4.0 (Rozas and Rozas 1999). DnaSP 4.0 was also used to calculate the Tajima’s D and Fay and Wu’s H test statistic (Fay and Wu 2000; Tajima 1989) .

Statistical Methods

Testing for Correlation of Population Genetic Statistics Between Neighboring Loci

A randomization method based on the one Williams and Hurst (2000) employed to test the correlation of protein evolution rates between adjacent loci was used in order to assess the degree of similarity in population genetic statistics (πsil, θW, θ/K, Tajima’s D, and Fay and Wu’s H) between neighboring loci within each region. This stands in contrast to studies in other organisms that have used autocorrelation analysis to test if neighboring loci have more similar polymorphism and divergence values (e.g. Hahn 2006, Matassi et al. 1999). For our dataset, neither autocorrelation analysis (Diggle 1990) nor spectral analysis (Bloomfield 2000) was appropriate, as such analyses have low power to detect patterns with small numbers of observations (SAS Institute 1990). Using our method, the difference in each statistic between pairs of neighboring loci in the four regions (e.g. Δπsil, ΔθW,Δ[θ/K], ΔD, ΔH) was first calculated. The means of these values for each region were then calculated, giving us a measure of the similarity in each statistic between neighboring loci. In order to assess the significance of these values, 10,000 random rearrangements of each statistic within a given region were produced. For each randomized data set, all values were randomly reallocated to positions within each region and the mean change in each statistic between neighboring loci was calculated (e.g. mean Δπsil, mean ΔθW, mean Δ[θ/K], mean ΔD, mean ΔH). If there is a significant correlation in a statistic between neighboring loci, less than 5% of the random arrangements will have a lower mean difference than the observed value.

Testing for Selection in Each Region

The composite likelihood-ratio (CLR) test of Kim and Stephan (2002) as implemented in the program clsw was used to determine if the patterns of nucleotide variation in each region is consistent with a selective sweep. This program is available as part of the composite likelihood analysis (CLA) software package available at http://yuseobkim.net/YuseobPrograms.html. CLR analysis contains two options; option 1 (test A) distinguishes between derived and ancestral alleles and option 2 (test B) does not make this distinction. DnaSP 4.0 was used to estimate the recombination parameter (R) between adjacent sites (Hudson 1987). CLR values for each region were compared to those from 1000 replicate neutral data sets simulated using the program ssw (part of the CLA software package). Regions with significant CLR for either option (P < 0.05) were further analyzed by a goodness-of-fit (GOF) test described by Jensen et al. (2005), as implemented in the clsw program. The GOF test is designed to distinguish between selection and demographic forces which can give rise to similar patterns of variation. GOF values are compared to 1000 replicate data sets generated by selective sweep simulations using the program ssw. Regions with low GOF values (P > 0.05) are consistent with the selective sweep hypothesis.

Analysis of Recombination and Linkage Disequilibrium

Global estimates of recombination for regions containing focal low-diversity loci were estimated from the derivative of a fifth order polynomial of genetic versus physical distance for collinear markers from the Lister and Dean Col × Ler recombinant inbred map (Lister and Dean 1993). Physical distances were obtained from ftp://ftp.arabidopsis.org/home/tair/Maps/mapviewer_data/. DNAsp 4.0 was used to estimate linkage disequilibrium (LD, as measured by the square of the correlation coefficient, r2) between informative polymorphic loci within each genomic region. In order to assess LD between all loci within each region, we combined sequence alignments for the region into one sequence file. The intermittent regions between sequenced loci were treated as missing data. The minimum number of recombination events (Rm) was obtained using the four-gamete test (Hudson and Kaplan 1985). All recombination analyses were performed using DnaSP 4.0.

Results

Patterns of Genetic Diversity Can Vary Greatly Between Neighboring Loci

In a selfing species such as A. thaliana, with its reduced effective recombination, neighboring loci should have correlated evolutionary histories (Glemin et al. 2006). This effect will be exacerbated for loci surrounding the target of recent selection due to genetic hitchhiking (Barton 2000; Kaplan et al. 1989; Maynard Smith and Haigh 1974). In order to determine the extent to which neighboring loci share similar evolutionary histories in A. thaliana, we surveyed patterns and levels of nucleotide diversity in 1 kb portions of genes located at 5 to 10 kb intervals from four focal low-diversity loci and spanning chromosomal regions ranging from 43 to 113 kb (Table 1; Supplementary Table 3).

Table 1 Nucleotide diversity and divergence of loci in genomic regions containing focal genes

The focal loci in all these four regions have levels of silent site pair-wise nucleotide diversity (πsil) that are over 10-fold lower than the genomic average of πsil estimated to be between 0.006 and 0.009 (Nordborg et al. 2005; Schmid et al. 2005). Region I spans 113 kb and includes the low-diversity gene At1g67140, which encodes a polycystein cation domain channel protein and was identified as a “fast-evolving” gene in A. thaliana (Barrier et al. 2003). Both At1g67140 and the neighboring gene At1g67120 comprise much of this region, being 16.5 kb and 26 kb in length, respectively. We sequenced three 1 kb fragments across each of these relatively large genes (designated a, b, and c); At1g67140a is the original sequenced locus (Barrier et al. 2003).

Region II spans 76 kb and includes the recent gene duplicate At3g03700, which encodes a predicted sodium dicarboxylic acid symporter (Moore and Purugganan 2003). Region III spans 83 kb and includes At1g04300, which encodes a MATH domain protein and was identified as a low-diversity locus in a linkage disequilibrium study of the neighboring CRYPTOCHROME2 (CRY2) locus (Olsen et al. 2004). Finally, region IV spans 43 kb and includes the focal gene At5g03840 (TERMINAL FLOWER1 [TFL1]), which is involved in the developmental transition to flowering (Alvarez et al. 1992) and was initially analyzed in a population genetic analysis of flowering pathway genes (Olsen et al. 2002). Previous population genetic analyses of At1g67140 (region I), At3g03700 (region II), and the coding region of TFL1(region IV) suggest they are the targets of positive selection and have undergone selective sweeps (Barrier et al. 2003; Moore and Purugganan 2003; Olsen et al. 2002).

Two patterns of within-species nucleotide polymorphism are evident across the four genomic regions. Regions I and II typify one pattern; these two regions have broad valleys of reduced silent site nucleotide diversity (estimated by πsil and θW) centered on the focal locus (Fig. 1A, 2A). In region I, we sequenced a total of 12,423 bps from nine genes and found a total of 137 polymorphic sites, 110 of which are silent site variants (Table 2). The focal gene At1g67140 is found in a broad 45 kb valley of reduced diversity that includes the neighboring gene, At1g67120, and is flanked by genes of moderately high diversity, At1g67110 and At1g67150. In region II, we sequenced a total of 9,823 bp from 11 loci and found 67 polymorphic sites, 36 of which are silent site changes (Table 2). Nucleotide diversity at the focal gene At3g03700 is at the lowest point of a shallow valley of low diversity spanning ∼25 kb and including the neighboring loci At3g03690, At3g03710 and At3g03720. This valley is flanked on one side by a locus (At3g03680) with relatively elevated diversity.

Fig. 1
figure 1

The distribution of (A) nucleotide diversity (πsil [closed circles] and θw [open circles]), (B) polymorphism to divergence ratio (θw/K), (C) Tajima’s D, and (D) Fay and Wu’s H in region I. (E) Position of predicted recombination events in region I (estimated from Rm). (F) Gene models for loci in region I (scaled to the positional axis of AD). Horizontal arrows show transcriptional direction; genes are shaded grey and sequenced regions are shaded black (exon/intron structure not shown). A vertical arrow in AD indicates the position of the focal locus At1g67140

Fig. 2
figure 2

The distribution of (A) nucleotide diversity (πsil [closed circles] and θw [open circles]), (B) polymorphism to divergence ratio (θw/K), (C) Tajima’s D, and (D) Fay and Wu’s H in region II. (E) Position of predicted recombination events in region II (estimated from Rm). (F) Gene models for loci in region II (scaled to the positional axis of AD). Horizontal arrows show transcriptional direction; genes are shaded grey and sequenced regions are shaded black (exon/intron structure not shown). A vertical arrow in AD indicates the position of the focal locus At3g03700

Table 2 Average nucleotide diversity for each genomic region containing focal loci

The second pattern of nucleotide polymorphism is observed across regions III and IV. Focal low-diversity loci in these regions are found in narrow valleys of reduced nucleotide diversity containing at most one other locus (Fig. 3A, 4A). In region III we sequenced a total of 17,295 bp from 18 loci and found 317 polymorphisms, 252 of which are silent substitutions (Table 2). Region III exhibits a number of alternating peaks and valleys of nucleotide diversity. For example, in the 30 kb region between At1g04250 and At1g04350, πsil changes 3- to 28-fold between neighboring loci. In region IV we sequenced a total of 8,290 bp from nine loci and found 153 polymorphic sites, 120 of which are silent substitutions (Table 2). Low levels of intraspecific polymorphism are restricted to the TFL1 coding region, and nucleotide diversity increases 24-fold in the adjacent TFL1 promoter and the neighboring locus, At5g03830.

Fig. 3
figure 3

The distribution of (A) nucleotide diversity (πsil [closed circles] and θw [open circles]), (B) polymorphism to divergence ratio (θw/K), (C) Tajima’s D, and (D) Fay and Wu’s H in region III. For loci in the region spanning At1g04240 to At1g04360, estimates of πsil (A) and Tajima’s D (C) calculated from the base set of 14 ecotypes (closed circles) or and an expanded set of 44 ecotypes (closed triangles in A; open circles in D) are indicated. (E) Position of predicted recombination events in region III (estimated from Rm). (F) Gene models for loci in region III (scaled to the positional axis of A-D). Horizontal arrows show transcriptional direction; genes are shaded grey and sequenced regions are shaded black (exon/intron structure not shown). A vertical arrow in A-D indicates the position of the focal locus At1g04300

Fig. 4
figure 4

The distribution of (A) nucleotide diversity (πsil [closed circles] and θw [open circles]), (B) polymorphism to divergence ratio (θw/K), (C) Tajima’s D, and (D) Fay and Wu’s H in region IV. (E) Position of predicted recombination events in region IV (estimated from Rm). (F) Gene models for loci in region IV (scaled to the positional axis of AD). Horizontal arrows show transcriptional direction; genes are shaded grey and sequenced regions are shaded black (exon/intron structure not shown). A vertical arrow in AD indicates the position of the focal locus At5g03840 (TFL1)

Levels of silent-site polymorphism varied considerably within each genomic region. Differences between high and low πsil values ranged from 24-fold (for region IV) to 165-fold (for region I), not including zero-diversity loci (Table 1). The coefficient of variation (V), a statistical measure of the degree of this variability, ranged from 0.9 to 1.7 for πsil values and from 0.6 to 1.3 for θW (Table 2). This degree of variation in nucleotide polymorphism is on par with that found in a sampling of 876 loci randomly distributed across the A. thaliana genome for which V is 1.2 for π and 0.8 for θW (Nordborg et al. 2005). Changes in nucleotide diversity between adjacent loci were often abrupt and extreme in these regions. For example, πsil values differed up to 47-fold between neighboring loci within each region (e.g., At1g67110 and At1g67150 in region I, At3g03680 in region II, At1g04290 in region III, and At5g03830 and the TFL1 coding sequence in region IV; Figs. 14).

In order to assess the degree of similarity in nucleotide diversity levels between adjacent loci within each region, we calculated the average change in πsil and θW between neighboring loci (Δπsil and ΔθW); (Table 3). The average Δπsil and ΔθW values were lowest in region II (Δπsil and Δθ= 0.004) and region III (Δπsil and Δθ= 0.005) indicating less extreme changes in nucleotide diversity between neighboring loci and highest in region I (Δπsil = 0.011 and Δθ= 0.009) and region IV (Δπsil = 0.011 and Δθ= 0.007), which exhibit relatively large differences in nucleotide diversity at loci At1g67110 and At1g67150 in region I and between TFL1 and neighboring loci in region IV. To analyze the significance of these values, we first calculated the mean Δπsil and ΔθW values of 10,000 random arrangements of πsil and θW values in each region (see Materials and Methods; Supplementary Fig. 1A–H). If there is a correlation in nucleotide diversity between neighboring loci, we should find less than 5% of the random arrangements with lower mean Δπsil or ΔθW than the observed value. However, between 16% and 95% of the random arrangements had lower mean Δπsil and ΔθW values than the observed value for each region, suggesting that nucleotide diversity estimates are uncorrelated between neighboring loci in these regions (Table 3).

Table 3 Average similarity in population genetic statistics (πsil, θW, θ/K, Tajima’s D, and Fay and Wu’s H) between neighboring loci and mean distance between loci for each region

Significant Variation in Effective Population Size (Ne), as Measured by θ/K, Exist Between Neighboring Loci

Variation in the levels of intraspecific polymorphism between loci may reflect differences in the underlying neutral mutation rate at those loci and not differences in selection regimes. It is more useful to track changes in the effective population size Ne across loci, as Ne should be equivalent between loci in the absence of selective forces. Ne can be approximated by the ratio of the estimate of the population mutation parameter (θW = 4 Neμ) to the intraspecific divergence (K = 2 Tμ under neutrality [Li 1997], where T is the divergence time between A. thaliana and A. lyrata); therefore, we tracked changes in this ratio across each region (Figs. 14B).

There is less variation in interspecific silent site divergence (K) between and within regions than in levels of intraspecific diversity, as measured by πsil or θW (Table 2). Average K values for the regions range between 0.10 and 0.15 and are similar to the average of 0.14 substitutions/silent site determined from a genomic survey of intraspecific divergence between A. thaliana and A. lyrata (Schmid et al. 2005). Furthermore, V for the average divergence for each of the regions is three-to five-fold lower than the V for the average levels of intraspecific polymorphism (Table 3), suggesting that loci in these regions are diverging between the sister species at similar rates. Not surprisingly, the same general patterns in the distribution of intraspecific diversity also appear in the distribution of θW/K; that is, regions I and II exhibit broad valleys of reduced θW/K values, while regions III and IV exhibit alternating peaks and valleys of θW/K values (Figs. 14B).

For each region, we also calculated the mean change in θW/K between neighboring loci (Δ[θW/K]) as a measure of the degree of shared evolutionary history between loci. To test the correlation of θW/K between neighboring loci, we then compared the mean Δ(θW/K) for each region to the mean(Δ θW/K) from 10,000 random arrangements of θW/K values in each region (Supplementary Fig. 1I–L). For regions I, III, and IV, between 52% and 92% of random arrangements had mean Δ(θW/K) lower than the observed means (Table 3). For region II, we did not obtain divergence data for the highly polymorphic locus At3g03810, leading to more concordance in θW/K between neighboring loci. However, even without this locus, 17% of mean (ΔθW/K) values from the randomized dataset were lower than the observed mean (Table 3). Thus, Ne, as approximated by θW/K, varies significantly between neighboring loci in all of the regions we analyzed.

Tajima’s D Values and Fay and Wu’s H Values Vary Significantly Between Adjacent Loci in Most Genomic Regions

Because we observed such dramatic changes in diversity measures between linked loci, we were also interested in assessing the degree to which the polymorphism frequency distribution varied. Deviations in the polymorphism frequency spectrum from the neutral expectation can be tested using Tajima’s D test statistic, while deviations in the frequency spectrum of derived polymorphisms can be tested with Fay and Wu’s H statistic. Values of D and H are skewed toward the negative in regions in and around targets of recent selective sweeps (Fay and Wu 2000; Tajima 1989). We therefore determined the values of Tajima’s D and Fay and Wu’s H at loci in each region and assessed the degree to which these values were correlated between neighboring loci.

In region I, negative Tajima’s D values span an approximate 45 kb region containing the focal locus At1g67140 and the neighboring loci At1g67130 and At1g67120 (Fig. 1C). This region is flanked by two loci, At1g67110 and At1g67150, with positive Tajima’s D values. We could not calculate Tajima’s D for the region II focal locus, At3g03700, which has no polymorphic sites. However, Tajima’s D is negatively skewed across a 35 kb region surrounding At3g03700 (Fig. 2C). In contrast, focal loci in region III and region IV are found in very narrow valleys of negative Tajima’s D values. Regions I, II, and IV exhibit significant variation in Tajima’s D values between neighboring loci; the observed mean ΔD values for all four regions were greater than 5% of the mean ΔD of 10,000 randomized rearrangements of the data (Supplementary Fig. 5M–P, Table 3). In contrast, the mean ΔD value of region III was significantly lower (P < 0.05) than the randomized datasets.

All of the focal loci are flanked by loci with negative Fay and Wu’s H, suggesting increased frequencies of derived polymorphism (Figs. 14D). As with Tajima’s D values, the frequency spectrum of derived polymorphisms, as measured by Fay and Wu’s H, was dissimilar between neighboring loci for the majority of the regions (Supplementary Fig. 1Q–T, Table 3). The exception was region II, where only 0.2% of the mean ΔH values from randomized data sets were lower than the observed value suggesting a significant correlation in Fay and Wu’s H statistic between neighboring loci in this region.

Fig. 5
figure 5

LD (as measured by r2) versus distance between polymorphic sites in (A) region I, (B) region II, (C) region III, and (D) region IV. (E) LD decay in region III for loci spanning At1g04240 to At1g04360 calculated from the base set of 14 ecotypes (closed circles) or and an expanded set of 44 ecotypes (open circles). Illustrating the decay in LD is a line that connects the average r2 value measured for 5 kb intervals (for those points less than 20 kb distant) or 10 kb intervals (for those data values more than 20 kb distance)

Does Selection Play a Role in Shaping Patterns of Nucleotide Variation in the Regions?

Previous population genetic analyses identified three of the four focal loci (in regions I, II, and IV) as the targets of positive selection. A recent selective sweep at these loci would predictably result in reduced polymorphism and an excess of rare alleles and of high frequency derived alleles at linked loci. We used the CLR test of Kim and Stephan (2002) to determine if the patterns of nucleotide variation in each region were consistent with selection or neutral drift. For those regions for which the CLR test supported the selection hypothesis we implemented a GOF test to distinguish between variation patterns resulting from selection versus demographic processes such as population subdivision (Jensen et al. 2005).

The CLR test supports the selection hypothesis for Regions II and III (Table 4). In Region II the CLR test was significant only when derived sites were not differentiated from ancestral sites (TEST B). This may be possible if the sweep was not recent, as simulations show that derived polymorphism will be lost after 0.4Ne generations (Kim and Stephan, 2002). The GOF test supports the selection hypothesis suggesting that the patterns of variation are not due to demographic processes. Although the CLR test supports the selection hypothesis for Region III, the prediction location of the selective sweep is not centered on the focal locus for that region. Furthermore, unlike Region II, the GOF test statistic does not support the selection hypothesis. Given the extremely low P value for the GOF value for this region, it is probable that the significant CLR test result was due to underlying demography, such as population structure, and not selection (Jensen et al. 2005). Patterns of variation in Regions I and IV were indistinguishable from neutral simulations using the CLR test.

Table 4 Composite-likelihood ratio (CLR) and goodness-of-fit (GOF) analysis for each region

Recombination and the Extent of Linkage Disequilibrium

The fluctuating patterns of nucleotide diversity in genomic regions containing our focal low-diversity loci should be correlated with rates of recombination and the degree of LD at those regions, as recombination breaks down associations between neighboring loci. For example, we would predict that regions I and II, with broad valleys of reduced polymorphism, would be found in chromosomal regions with relatively low recombination and extended LD between loci. In contrast, we would expect regions III and IV, which have very narrow valleys of reduced polymorphism, to be found in chromosomal regions with elevated recombination rates and limited LD between loci. To assess these predictions, we estimated the global recombination rate at each region and analyzed the decay of LD within the regions (Fig. 5).

Both regions I and II are found in chromosomal positions with recombination estimates near the genome average of 4.8 cM/Mb (Zhang and Gaut 2003); the global recombination estimate is 4.9 cM/Mb for region I and 5.2 cM/Mb for region II. Region IV has a relatively elevated global recombination estimate of 7.8 cM/Mb, consistent with the extremely localized reduction in variability at the TFL1 locus. However, region III, which exhibits a highly variable pattern of nucleotide diversity, has the lowest global recombination estimate of 3.0 cM/Mb.

Predictably, the rate of decay in LD parallels the trends in recombination estimates. In regions I and II, which exhibit broad valleys of reduced genetic diversity, LD, as measured by the square of the correlation coefficient (r2), decays between 10–15 kb (Fig. 5A, B). In region IV, in which reduced diversity is limited to the focal TFL1 gene, LD decays more rapidly, between 5–8 kb (Fig. 5D). In concordance with the low global estimate of recombination in the highly variable region III, LD in this region extends up to 15–20 kb (Fig. 5C). Furthermore, extensive LD (r= 1) can be found between polymorphic loci up to 65 kb apart in region III. Thus, the fluctuations in nucleotide diversity between neighboring loci in regions I, II, and IV agree with global recombination estimates and the LD decay rate, while there is an apparent disjunction between these factors in region III. This may be due in part to the influence of the dimorphic CRY2 locus in region III. If we focus on the 50 kb region from At1g04240 to At1g04370 that includes the focal gene At1g04300 but excludes CRY2 and its flanking loci, LD decays rapidly, within 5 kb (Fig. 5E).

We also estimated Hudson’s minimum number of recombination events (Rm) to see if differences in recombination rates reflect fluctuations in levels of nucleotide variation between linked loci. We found that Rm values correspond with the degree of fluctuation in nucleotide variation between neighboring loci in all regions except region I. For example, region II, with relatively low fluctuations in nucleotide variation, has an Rm of 3, whereas recombination is much more pervasive in regions III (Rm = 9) and IV (Rm = 11) which exhibit more frequent fluctuations in nucleotide variation. We mapped the recombination events in these regions and found that they potentially flank valleys of low diversity in each region (Figs. 14E). Region I, with the largest region of reduced variation, actually has an Rm of 12, on par with region IV. However, only three of these events occur in and around the 45 kb valley of reduced polymorphism surrounding the focal gene At1g67140 (recombination events #2–4, Fig. 1E), while eight of these events map in or adjacent to the moderately high diversity flanking locus, At1g67150 (recombination events #5–12, Fig. 1E).

Discussion

Considerable Variation in Genetic Diversity Exists Between Neighboring Loci Within A. thaliana Genomic Regions

We found highly variable levels and patterns of nucleotide diversity in genomic regions surrounding low-diversity genes. This stands in contrast to other studies in other organisms which used autocorrelation analysis to find that neighboring loci have more similar polymorphism and divergence values (Hahn 2006; Matassi et al. 1999). We do not think this discrepancy is due a failure of our randomization method to detect correlation of diversity values between neighboring loci; our method also detects highly significant associations in polymorphism between linked regions on the Drosophila X chromosome (P < 0.001) when applied to the dataset of Hahn (2006). Our findings, however, do agree with previous analyses of local patterns of nucleotide polymorphism in A. thaliana. Similar fluctuations in genetic diversity have been reported for genomic regions surrounding other loci, such as CLV2 (Shepard and Purugganan 2003), MAM1 (Haubold et al. 2002), FRI (Hagenblad and Nordborg 2002; Hagenblad et al. 2004), and for some high diversity genes (Cork and Purugganan 2005). The ability to detect this variation in genetic diversity may be in part due to the relatively high density of markers surrounding the locus of interest used in this and other studies. This variation is still surprising, as linked loci should share similar evolutionary histories due to the low effective recombination rate in A. thaliana (Nordborg et al. 2005).

In studies of variation around the FRI locus, it has been suggested that population structure contributes to these fluctuations in population genetic statistics (Hagenblad and Nordborg 2002; Hagenblad et al. 2004). In an unstructured population, these values should be insensitive to sampling, but an increased sampling size of loci around FRI produced more uniform population genetics measurements, especially with respect to Tajima’s D (Hagenblad et al. 2004). Note, however, that there was a two-fold decrease in marker density and more uniform marker distribution around the FRI locus in this later study which could also have contributed to a more uniform distribution of genetic variation (Hagenblad and Nordborg 2002; Hagenblad et al. 2004). The 14 A. thaliana accessions in our survey include members from across its native geographic range (Supplementary Table 1). To test for the underlying effects of population structure on the patterns of nucleotide polymorphism, we increased our sampling to 44 accessions for 10 loci surrounding At1g04300 in region III, one of the more variable regions. We found increased sampling only slightly altered estimates of nucleotide variation (πsil), by 15% on average (see Fig. 3A). The effect of increased sampling on Tajima’s D was more varied. The majority of loci showed only moderate adjustments to Tajima’s D values, on the order of 26%, resulting in a smoothing of the distribution of Tajima’s D estimates in the region (see Fig. 3C). However, upon further sampling, Tajima’s D for two loci, At1g04310 and At1g04350, decreased more dramatically, by 3-to 5-fold, and Tajima’s D for one locus, At1g04250, decreased over 83-fold, from −0.013 to −1.1 (see Fig. 3C). Thus, while population structure and sampling strategy may have contributed to fluctuation in certain population genetics statistics, such as Tajima’s D, it did not appear to contribute strongly to the observed fluctuations in nucleotide variation.

These variations in nucleotide diversity are reflected in changes in Ne, as measured by the θW/K ratio, suggesting discordance in evolutionary histories between neighboring loci in these genomic regions. It is possible that recombination has acted to break down associations between neighboring genes. The degree of dissimilarity in evolutionary histories between linked loci within a region generally correlated with estimates of recombination and linkage disequilibrium for that region, with the notable exception of region III. Nevertheless, recombination may still play a role in shaping diversity patterns in this highly variable region. While region III exhibits extensive LD and is found in a chromosomal region with low estimates of recombination (on par with centromeric regions), an explicit test of recombination estimates multiple recombination events across this region (R= 9). We also assessed LD decay in the 50 kb region surrounding the focal locus in our sample of 44 ecotypes in order to assess the influence of population structure on our estimates of LD. We observed an approximate 2-fold reduction in mean r2 values when we increased our sampling to 44 ecotypes, although LD reached basal levels within 5 kb in both data sets (Fig. 5E). Thus our estimates of LD may be inflated due to underlying population structure in our original 14 sampled ecotypes.

It is also possible that gene conversion has contributed to the variation in nucleotide polymorphism levels exhibited in our regions. Gene conversion contributes significantly to the short-range pattern of LD in A. thaliana (Haubold et al. 2002; Nordborg et al. 2005). Indeed, recent estimates from genome-wide analyses suggest the gene conversion rate in A. thaliana is equal to the crossing over rate (Plagnol et al. 2006). Haubold et al. (2002) found that gene conversion rather than reciprocal recombination was the major contributing factor to the variation in nucleotide polymorphism in the 170 kb region around the MAM1 gene. In this region 90% of recombination events were caused by gene conversion. Similarly, region III is found in a region of low recombination rate, yet the levels of polymorphism are not constant across the region; thus, gene conversion is a potential contributor to the observed pattern of polymorphism. Furthermore, studies in Drosophila have shown that gene conversion events associated with selective sweeps can give rise to peaks of high diversity within the sweep region (Glinka et al, 2006). This can potentially mask the signature of a selective sweep. We found a similar pattern in all three of the regions containing potentially selected loci (regions I, II, and IV). Each region has peaks of high diversity flanking the hypothesized selected locus; although, the CLR test supports the selection hypothesis only for region II.

Potential Contribution of Selection to the Evolutionary History of Low-Diversity Genes

Only three of our four low-diversity loci were previously predicted to be targets of positive selection: At1g67140 in region I, At3g03700 in region II and At5g03840 (TFL1) in region IV (Barrier et al. 2003; Moore and Purugganan 2003; Olsen et al. 2002). If selection is significantly strong and recent, this signature should extend to neighboring loci as predicted by the hitchhiking hypothesis (Kaplan et al. 1989; Kim and Stephan 2002; Przeworski 2002; Maynard Smith and Haigh 1974). This prediction is particularly important for distinguishing the pattern of selection from random fluctuations in nucleotide diversity due to drift (Kim and Stephan 2002; Przeworski 2002).

Patterns of nucleotide polymorphism are consistent with the hitchhiking prediction for only two of the loci, At1g67140 in region I and At3g03700 in region II, both of which are implicated from previous studies as targets of selection (Barrier et al. 2003; Moore and Purugganan 2003). Both loci are located in wide valleys of reduced polymorphism relative to divergence. For At1g67140, this valley spans approximately 45 kb and is flanked by two loci of elevated polymorphism relative to divergence. However, the CLR test of selection supports the selection hypothesis for only region II containing the recently duplicated gene, At3g03700. That the observed pattern of polymorphism in this region is a result of selection and not underlying population structure is bolstered by significantly low GOF test statistic. The predicted target of selection is centered on the region containing the focal locus, although the estimated strength of selection is low (α = 7.5). Using an estimate of Ne of 4x105 for A. thaliana, this equates to a selection coefficient (s) of approximately 10−5, which is consistent with previous estimates of selection acting at this locus (Moore and Purugganan, 2003). It should be noted, however, that the power of the CLR to detect strong selection events is reduced in regions of low recombination, such as found in A. thaliana (Kim and Stephan, 2002).

Implications for the Future of Population Genetic Analyses in A. thaliana

These data have significant implications for inferring the evolutionary histories at individual loci in A. thaliana. Population genetic analyses of genomic regions, and not just individual genes, are necessary. For example, without analyzing the genomic context surrounding the gene of interest, we cannot rule out that selection is acting on neighboring loci. This is especially true for those loci with limited functional characterization or for which no prior reason for selection exists. And while this can sometimes lead to ambiguities as to the true target of selection, it can also strengthen the argument for the action of a particularly strong selection event in the region that includes the gene of interest. As genome-wide resequencing data for A. thaliana becomes available, such analyses will become the standard (Clark et al. 2007). Although researchers have also successfully used a “bottom-up” approach using genomic screens to identify potential targets of selection by scanning genomes for regions of extreme polymorphism levels, we must be careful when interpreting these results (Akey et al. 2002; Cork and Purugganan 2005; Diller et al. 2002; Mousset et al. 2003; Payseur et al. 2002; Vigouroux et al. 2002; Wiener et al. 2003). Ultimately, we will need to understand the functions of potentially selective genes in order to bolster their candidacy as targets of selection and to formulate adaptive hypotheses involving them.