Introduction

The environment plays a fundamental role in shaping the geographic structure of genetic variation through its effect on demographic processes and through natural selection. In plant populations, climate can directly affect population expansion, contraction, and migration or alter flowering time and mating patterns (Cleland et al. 2007; Davis 1976). As a result of these “neutral” processes, genome-wide genetic variation is expected to associate with climate gradients (Eckert et al. 2010a; Sexton et al. 2014). In contrast, natural selection by climate would act largely on specific loci, such that associations with climate at particular loci should be independent of those with the background genetic structure (Eckert et al. 2010b; Hancock and Di Rienzo 2008; Keller et al. 2011). These genetic variants under selection along climate gradients within a species likely play a role in local adaptation and are of particular interest (Endler 1986). Because different alleles are favored in different environments (i.e., a form of balancing selection species wide), genetic variation at these loci is expected to be higher than the typical locus, which might be more likely to be under purifying selection (Lasky et al. 2014).

With genome-wide single nucleotide polymorphism (SNP) data sampled across environments, it is now possible to identify genetic variation under natural selection by local climate. One cost-effective way to identify large numbers of SNPs in non-model organisms is to perform whole-transcriptome sequencing (mRNA-Seq) from widely distributed samples and compare sequence variation, ignoring transcript abundance (Cánovas et al. 2010; Geraldes et al. 2011; Wang et al. 2009). An advantage of this approach is that it targets coding regions of the genome, which are commonly targets of selection and likely to underlie trait variation, and flanking untranslated regions, which play a role in gene expression (Barrett et al. 2012; Schork et al. 2013). Moreover, as genome size increases, enrichment for functional genomic regions can become important. Range-wide SNP data can then effectively be used in environmental association analyses to identify putatively adaptive genetic variation that displays exceptionally strong associations with the environment (De Mita et al. 2013) or in phenotypic association analyses to understand the genetic basis of trait variation (Atwell et al. 2010; Holliday et al. 2010; Neale and Savolainen 2004).

A number of powerful environmental association (“outlier”) analyses have been proposed that test whether specific loci are especially associated with a given environmental variable after accounting for background associations due to population structure (Coop et al. 2010; De Mita et al. 2013; Eckert et al. 2010b). For designs in which samples are scattered along the environmental gradient, rather than clustered in “populations,” linear mixed modeling approaches are an excellent choice (De Mita et al. 2013; Yoder et al. 2014). These models can effectively account for population structure using a kinship matrix of relatedness among individuals, are computationally efficient for large SNP data sets (Kang et al. 2008; Yu et al. 2006), and have low false-positive rates (Frichot et al. 2013), although they do not explicitly model population history (Günther and Coop 2013). One such method, EMMAX, was developed for genotype-phenotype associations and has been shown to outperform other similar approaches in accounting for underlying population structure (Kang et al. 2010; Sul and Eskin 2013). When applied to climate data, rather than phenotypic data, significant associations can be interpreted as candidate SNPs under natural selection by climate (Frichot et al. 2013; Yoder et al. 2014). Using the model in this way does not suggest that genotype causes climate, but rather it is a convenient statistical means of assessing the expected correlations among the variables of interest (Furlotte et al. 2011), and further, it can be assumed that adaptive phenotypes mediate those associations between SNP and climate (Eckert et al. 2010b). Accounting for population structure can minimize false-positive rates; however, all environmental association approaches still suffer from elevated false-negative rates due to some removal of true signal from the process of accounting for genetic structure. Environmental association analyses are especially powerful in highly outcrossing trees, such as oaks (Quercus), because linkage disequilibrium decays within a few hundred base pairs (Alberto et al. 2013; Brown et al. 2004; Kremer et al. 2012; Sork et al. 2016), meaning that significant associations are likely to be near the true target of selection (Neale and Savolainen 2004).

Our recently assembled transcriptome assembly and large SNP panel for Quercus lobata Née (valley oak) (Cokus et al. 2015) provides an excellent resource to disentangle how climate shapes underlying genome-wide genetic structure, presumably due to demographic processes (Gugger et al. 2013; Sork et al. 2010), versus the effect of selection by climate on specific SNPs (Sork et al. 2013). For comparison, several studies of the European oaks have identified candidate genes for climate-related traits such as timing of bud burst and response to drought stress based on differential gene expression experiments (Derory et al. 2006; Porth et al. 2005; Spieß et al. 2012). Some of these loci have been verified using nucleotide-based tests for signatures of natural selection (Derory et al. 2010) and other approaches (Alberto et al. 2013).

Q. lobata exhibits structured genetic diversity at the local to regional scales and has high potential for local adaptation relative to oaks in eastern North America and Europe, in part because its distribution has remained stable in a topographically complex area through recent glacial cycles unlike many oaks elsewhere (Grivet et al. 2006; Gugger et al. 2013). Q. lobata is currently threatened by land development and climate change (McLaughlin and Zavaleta 2012; Sork et al. 2010). Thus, it is especially important in Q. lobata to understand how climate shapes genetic variation through demographic processes, what genes are involved in adaptation to current environments, and how changing environments might impact adaptive genetic variation and population persistence for effective management.

Here, we use 220,427 diallelic SNPs previously identified in Q. lobata (Cokus et al. 2015) to (1) quantify transcriptome-wide associations of SNP variation with climate indicative of demographic responses to climate, (2) identify specific loci that are especially associated with climate and thus potential targets of natural selection, and (3) test the hypothesis that candidate genes for adaptation to climate gradients have higher genetic diversity than non-candidate genes.

Materials and methods

Sampling

Poly-A-purified mRNA libraries from 22 Q. lobata samples from throughout its natural distribution (Fig. 1) were previously sequenced for a de novo transcriptome assembly and SNP discovery project that included other California oaks (Cokus et al. 2015). This draft reference transcriptome includes a mixture of complete and partial gene models (generally with UTRs and introns), many of which contain Pfam protein domains (Finn et al. 2014; Jones et al. 2014), and a subset of which were found to be orthologous with Arabidopsis genes from The Arabidopsis Information Resource (TAIR) (Swarbreck et al. 2008). Inferred Gene Ontology (GO) (Ashburner et al. 2000) associations for numerous oak gene models were then available through TAIR and Pfam. From the total panel of over one million SNPs identified within and among the California oak species, we retained for the present study 220,427 diallelic SNPs that are variable within Q. lobata and for which genotype was not called in at most 2 of 22 samples (<10 %). For some analyses, we further restricted to those SNPs for which genotype was not called in at most one sample (193,428) or those with called genotype in all samples (155,465). Details on the SNP calling methods and quality control are in Cokus et al. (2015).

Fig. 1
figure 1

Distribution of valley oak (Quercus lobata) (blue) and sampling locations (black circles: small = one sample, large = two samples). Gray scale represents elevation (darker is higher)

Genetic structure and association with climate

To assess genetic structure and its association with spatial coordinates and climate variables, we performed redundancy analysis in vegan 2.0-7 (Oksanen et al. 2015) in R 3.0.0 using the completely called SNP set. Redundancy analysis is a multivariate analog of linear regression when there are multiple response variables (SNPs) and multiple explanatory variables (climate and space) and makes similar assumptions as principal component analysis (Legendre and Legendre 1998; ter Braak 1986). We further performed two partial redundancy analyses to partition the variance into the part explained uniquely by climate variables, that explained uniquely by spatial coordinates (e.g., due to spatial autocorrelation and phylogeographic structure), and the joint influence of these which could not be disentangled (Borcard et al. 1992; Økland 1999). The analysis was repeated for each type of SNP: synonymous, nonsynonymous, and noncoding. Statistical significances of the full models were tested via permutations. Redundancy analysis has been successfully applied to Arabidopsis and oaks to address similar questions about the roles of climate and spatial variables (Gugger et al. 2013; Lasky et al. 2012) and has been recommended over traditional Mantel and partial Mantel tests for its superior statistical properties (Legendre and Fortin 2010).

Derived climate data were drawn from a U.S. Geological Survey Spline Model designed for use in assessing plant-climate relationships (Rehfeldt 2006) (http://forest.moscowfsl.wsu.edu/climate/). We retained the five variables that are thought to be important determinants of Q. lobata’s distribution (McLaughlin and Zavaleta 2012; Sork et al. 2010) and that were not highly correlated with each other, including growing degree-days above 5 °C, mean maximum temperature of the warmest month, mean minimum temperature of the coldest month, growing season precipitation, and summer/spring precipitation balance. Spatial coordinates included latitude, longitude, squared terms, cubed terms, and cross products to account for nonlinear associations of genetic variation with spatial variables (Borcard et al. 1992).

SNPs under selection

To identify individual SNPs that are especially correlated (i.e., outliers) with any of the five climate variables, we performed linear mixed model correlations in EMMAX (intel-binary-20120210), as this effectively accounts for background genetic structure due to “hidden” relatedness or shared phylogeographic history via a pairwise “kinship matrix” among individuals based on the Balding-Nichols method (Balding and Nichols 1995; Kang et al. 2010; Sul and Eskin 2013). Standard linear mixed models as we implemented them have been evaluated against other individual-based environmental association methods, showing that they have very low false-positive rates and thus are conservative, and they have the benefit that no parameter optimization is necessary (Frichot et al. 2013). Although they can have high false-negative rates (i.e., low power), we prefer to be conservative, given our small sample size. For this analysis, we started with the 193,428 SNP set with at most one uncalled individual per locus. The kinship matrix was estimated based on a subset of 32,551 loci that were at least 500 bp apart when located on the same contig to ensure high likelihood of freedom from dependence mediated by linkage disequilibrium (and a version based on all SNPs was also tried, with similar results and hence omitted) (Alberto et al. 2013; Brown et al. 2004; Sork et al. 2016). Because redundancy analyses did not reveal any distinctions among types of SNPs (i.e., coding or noncoding; see “Results”), SNPs for the kinship matrix were randomly chosen with respect to type. The 70,639 SNPs with a minor allele frequency of at least 4 (≥10 %) of the total 2 × 22 = 44 alleles were tested for associations with climate variables. Consistent with common practice, this threshold serves as an additional filter to reduce the false positives due to potential high-leverage data points from rare alleles. Multiple testing was adjusted using the false discovery rate (Q) method of QVALUE 1.1 (Benjamini and Hochberg 1995; Storey and Tibshirani 2003). Finally, as a “validation” of the results for significant climate-associated SNPs from EMMAX, we performed partial Mantel tests of climate distance with genetic distance controlling for geographic distance. Relative to EMMAX, the partial Mantel tests essentially reverse the dependent and independent variables to be consistent with their hypothesized causal relationship. Geographic distance was calculated assuming a spherical model (WGS84) of the earth, climate and SNP allele frequency distance were calculated by Euclidean distance, and partial Mantel tests were performed in the vegan package in R.

Candidate genes from literature and functional annotations

We searched our data set for previously published climate-related candidate genes to determine whether SNPs from those genes were also among the top associations from EMMAX. We first examined our dataset for candidate genes reported elsewhere for oaks, specifically 213 drought and osmotic stress genes (Porth et al. 2005), 144 bud burst and flowering genes (Alberto et al. 2013; Derory et al. 2006) (http://www.evoltree.eu/), and 26 other climate-related genes (Sork et al. 2016). We searched for these genes in the reference transcriptome using USEARCH 7.0 (Edgar 2010) with thresholds of 92 % nucleotide identity and E value 10−10.

We also searched the Q. lobata transcriptome GO associations (Cokus et al. 2015) for keywords suggesting roles in responses to the tested climate variables. These included response to osmotic stress or homeostasis (GO IDs: 0006970, 0006972, 0009270, 0009992, 0030104, 0047484, 0071470) or water stimulus/deprivation (0009270, 0009414, 0009415, 0009819, 0042631, 0071462, 2000070), which might be related to growing season precipitation or summer/spring precipitation balance; response to heat (0034605, 0009408), heat acclimation (0010286, 0070370), or heat shock protein binding (0031072), which might be related to mean maximum temperature; response to cold or freezing (0050826, 0070417, 0009409) or cold acclimation (0009631), which might be related to mean minimum temperature; and flower/floral or leaf development or morphogenesis (0009908, 0009965, 0010093, 0010150, 0010338, 0010358, 0048366, 0048437, 0048438, 0048439, 0048444, 0048449, 0048464, 0048833), flower photoperiodism (0048573–0048575), or regulation of these processes (0009909, 0009910, 0009911, 0010080, 0048579, 0048577, 0048578, 0048586, 0048587, 0060860, 0060862, 2000024, 2000025, 2000028), which might be associated with a variety of climate variables, including growing degree-days, temperature, and precipitation (Hunter and Lechowicz 1992; Nizinski and Saugier 1988; Vitasse et al. 2011). We grouped flower and leaf development-related candidate genes together because flowers and leaves emerge from the same buds at almost the same time in Q. lobata.

We used hypergeometric tests (equivalent to one-tailed Fisher exact tests in this context) to determine whether SNPs from candidate genes from the literature and GO functional annotations were enriched in the top 5 % of associations from EMMAX with their respective climate variable. We also used one-sided Wilcoxon rank-sum tests to ask whether SNPs in those same candidate genes had stronger associations with their respective climate variables than non-candidate gene SNPs, as indicated by the P values from EMMAX.

Genetic diversity in candidate genes

We tested the hypothesis that genetic diversity is higher in climate-adaptive candidate genes than non-candidate genes using one-sided Wilcoxon rank-sum tests. We quantified genetic diversity with θ W or Watterson’s theta, which is a measure of SNP rate per bp per locus (Watterson 1975); π or nucleotide diversity, which is a measure of SNP rate per bp per locus weighted by frequency in the population (Begun et al. 2007; Nei and Li 1979); and G or Weir’s gene diversity, which is the average expected heterozygosity across all SNPs within a locus (Nei and Roychoudhury 1974; Weir 1996). These measures capture a range of concepts of genetic diversity from SNP rate irrespective of population allele frequency (θ W) to allele frequency in the population irrespective of SNP rate (G), and the combination of the two (π).

Genetic diversity measures were calculated based on the 220,427 SNP set with up to two uncalled genotypes per locus (as well as those with all genotypes called, but results were highly similar and thus omitted). We performed these tests separately for candidate genes from the literature, for those identified via GO associations, and for those identified in this study as top candidates associated with climate. For the former two tests, we restricted the non-candidate set to the 28,261 contigs with at least one gene model to avoid a downward bias of diversity estimates because many contigs without gene models are of low coverage and variant discovery power is reduced. As an additional control, we also did the same test omitting contigs with no SNPs. For the latter test, we restricted the non-candidate set to only the contigs that had SNPs that were tested in EMMAX because these were already a subset intentionally chosen to have at least a certain population allele frequency, which upwardly biases allele frequency-based diversity measures.

Results

Genetic structure and association with climate

Samples from southern California sites (Malibu Creek State Park and Fort Tejon State Historic Park) are differentiated from other sites as shown by the redundancy analysis (Fig. 2a). Spatial and climate variables together are significantly associated with genetic variation (P < 0.005), and the variables most strongly associated with genetic structure are mean minimum temperature of the coldest month, summer/spring precipitation balance, growing degree-days, latitude, and longitude. In the partial redundancy analysis of climate with SNP variation controlling for spatial location, minimum temperature is most strongly associated with axis 1 and summer/spring precipitation balance is most associated with axis 2 (Fig. 2b). Climate variables explained 26 % of the total explainable genetic variance, spatial variables explained 53 %, and their joint effect explained the remaining 21 %. These results (not shown) are nearly identical whether considering all relevant SNPs or broken down by nonsynonymous, synonymous, and noncoding SNPs (Procrustes test: r > 0.98, P < 0.001).

Fig. 2
figure 2

a Full redundancy analysis model for association of climate and spatial variables with transcriptome SNPs. Black points display ordination based on genetic variation and represent the underlying genetic structure. The two isolated sets of points are the southern sites of Malibu Creek State Park (right) and Fort Tejon State Historic Park (bottom). Vectors give the direction and magnitude of association of climate and spatial variables with the genetic structure along redundancy axes (RDA) 1 and 2. b Partial redundancy analysis for association of climate variables with transcriptome SNP variation after partialling out the association of spatial variables with SNP variation. This represents the “pure” effect of climate on genetic structure and suggests that minimum temperature and summer/spring precipitation balance are the most important factors of those studied here in structuring transcriptome-wide SNP variation

SNPs under selection

The strongest association of SNPs with any of the climate variables, after factoring out the underlying genetic structure via kinship, is with minimum temperature and growing season precipitation, and 12 of those SNPs from 10 contigs were statistically significant after adjustment for multiple testing (Q < 0.1) (Fig. 3 and Table 1). Given the large number of tests and small sample size, we also investigated the 67 other SNPs with associations of P < 0.0001, even if less than 500 bp apart. These include 10 SNPs from 6 contigs associated with growing degree-days, 20 additional SNPs from 12 contigs associated with growing season precipitation, 2 SNPs from 2 contigs associated with mean maximum temperature, 27 additional SNPs from 21 contigs associated with mean minimum temperature, and 7 SNPs from 6 contigs associated with summer/spring precipitation balance (Table 1). Of the resulting total of 79 distinct SNPs from 49 distinct contigs, 31 are nonsynonymous, 39 are synonymous, 2 are in 3′ untranslated regions (3′-UTRs), 3 were in 5′-UTRs, and 2 are undetermined. Nonsynonymous SNPs are not overrepresented in this list compared to the number in the overall sample of SNPs analyzed (hypergeometric test, P = 0.8). Of the 49 contigs, 4 are orthologous to A. thaliana genes, and 5 contained Pfam protein domains or TAIR annotations with GO annotations indicating their involvement in response to stimulus or stress (Table 1). All these climate-associated SNPs are also significantly associated in Mantel tests, providing a “validation” of the associations that considers the climate as the independent variable and genotype as the dependent variable (0.43 < r < 0.79, P < 0.008). Furthermore, QQ plots of the EMMAX results suggest that the false-positive rate is well controlled and thus the significant results are not likely spurious (Fig. S1 in Online Resource 1).

Fig. 3
figure 3

a Map showing the frequency of a particular nonsynonymous SNP (m01oak13412cC, nucleotide 2194) from a protein of unknown function (AT5G05190) significantly associated with growing season precipitation (background grayscale) (Q = 0.048, EMMAX). Large pie charts represent two valley oak individuals (four alleles) and small charts represent one individual (two alleles); they are colored by allele (blue = G, green = A). b Association of SNP frequency per individual with growing season precipitation for this same SNP (genotypes are 0 = A/A, 0.5 = A/G, 1 = G/G)

Table 1 SNPs with strongest association (P < 0.0001) to climate variables from EMMAX analyses, including functional annotations and diversity measures for each candidate gene

Candidate gene enrichment

Searches of the GO associations yielded 280 water-related genes, 128 heat-related genes, 127 cold-related genes, and 252 flower or leaf development-related genes. In addition, 298 of the 383 candidate genes from the oak literature (Alberto et al. 2013; Derory et al. 2006; Porth et al. 2005; Sork et al. 2016) are identifiable in our reference transcriptome and 233 of them have at least one SNP. Combining the candidate genes from GO with those from the oak literature, we observed from 27 to 122 candidate gene SNPs in the top 5 % of EMMAX associations with their respective climate variable (Table 4), but none have P < 0.0001 and thus none of these genes overlap with those identified as candidates in our EMMAX analyses. Specifically, SNPs from flower/leaf development-related genes are enriched in the top 5 % of EMMAX associations with growing degree-days, growing season precipitation, and minimum temperature (P < 0.003) (Table 4), but significant enrichment is not found for drought, heat, or cold gene SNPs in association with precipitation, maximum temperature, or minimum temperature, respectively (P > 0.26). Furthermore, Wilcoxon rank-sum tests show that SNPs from flower/leaf development-related genes have significantly lower P values for associations with growing degree-days and growing season precipitation compared to other genes (P < 0.037) and marginally significantly stronger associations with minimum temperature (P = 0.061) and summer/spring precipitation balance (P = 0.089) (Table 5). SNPs from cold genes also have stronger associations with minimum temperature (P = 0.012), and SNPs from heat genes have marginally significantly stronger associations with maximum temperature (P = 0.064).

Genetic diversity in candidate genes

Genetic diversity is higher in contigs containing SNPs that were from the top associations in EMMAX (i.e., those with P < 0.0001) compared to the other contigs tested in EMMAX, whether measured by θ W (P = 0.001), π (P = 0.00012), or G (P = 0.061) (Tables 1, 2, and 3). Genetic diversity is also higher for all measures in candidate genes from the oak literature and from the GO associations search when compared to all other contigs containing gene models (P < 2.2 × 10−16). However, this latter effect disappeared when only variable candidate genes (i.e., those containing SNPs) and variable non-candidate genes were considered (P > 0.132).

Table 2 Mean genetic diversity measures for candidate genes and non-candidate genes
Table 3 P values for Wilcoxon rank-sum tests of whether diversity measures are higher in candidate genes than non-candidate genes

Discussion

Demographic response to climate

The transcriptome-wide genetic structure of oaks was shaped by climate. A substantial amount of SNP variation was explained by climate after controlling for spatial location (26 %), suggesting that climate shapes genomic variation independent of any association of climate with geographic location. Specifically, minimum temperature and summer/spring precipitation balance have the strongest association among those factors investigated. These associations likely reflect the effects of climate (especially minimum temperature) on demographic processes, such as population expansion, contraction, and establishment (James et al. 2011), and the influence of climate (especially temperature and precipitation balance) on gene flow through its influence on flowering phenology (Knight et al. 2005; Ortego et al. 2012).

Similar analyses of microsatellite variation in Q. lobata from 65 sample sites also support this role of minimum temperature and precipitation seasonality in shaping genetic variation and, in fact, suggest a potentially even larger role for climate than this study (Gugger et al. 2013; Sork et al. 2010). In the transcriptome data, the genetic distinction of two southern populations is pronounced (Fig. 2a), suggesting restricted gene flow among them and between other populations, possibly due to geographic barriers. Here, we do not observe the clear east–west structure (i.e., coast versus Sierra Nevada) that we did with microsatellites. Further, another study focusing on a subset of candidate genes from different localities also did not find east–west structure (Sork et al. 2016). However, the fact that three separate studies found genetic associations with similar climate variables provides strong evidence that environment is shaping genome-wide genetic structure through processes other than natural selection.

Natural selection by climate

Despite the genome-wide association with climate, we found strong evidence that natural selection by climate is important in local adaptation of valley oak. Even with a relatively small sample size of 22 individual trees, we identified 12 SNPs from 10 contigs significantly associated with climate variables after multiple testing adjustment (Q < 0.1) and an additional 67 SNPs from 39 contigs with very strong support (P < 0.0001) after factoring out background association of genetic structure with climate. Thirty-one of the 79 top SNPs are nonsynonymous and lead to amino acid substitutions, and 5 are in UTRs and thus could be involved in regulation of expression (Barrett et al. 2012); both of these types are consistent with functional significance.

A few of the SNPs are in genes with known roles in response to stimulus or stress (e.g., SAUR-like auxin-responsive protein family), cold shock protein binding (zinc knuckle family protein), light response or photosynthesis (e.g., cryptochrome 1), and trichome development (myosin family protein with Dil domain) (Table 1). The latter is especially interesting because trichomes are thought to be important in drought tolerance in plants (Karkkäinen et al. 2004).

Some of the SNPs with significant climate associations follow a north–south gradient orthogonal to the neutral genetic structure suggested in other studies based on microsatellite variation from more sites than this study (Gugger et al. 2013; Sork et al. 2010) (e.g., Fig. 3). Overall, these candidate SNPs showed especially strong correlations with growing season precipitation and minimum temperature, which are variables that also seem to be important shapers of the underlying genetic structure, presumably through their influence on demography and mating patterns. Given that Q. lobata occupies largely water-limited environments with frequent droughts, finding compelling evidence for natural selection by precipitation is not surprising.

The climate-associated SNPs from the EMMAX analyses also tend to have especially strong associations with the first axis in the partial redundancy analyses of climate with SNPs conditioned on spatial variables (Wilcoxon rank-sum test, P = 5 × 10−8), although only 1 of 79 SNPs was in the top 5 % of associations (m01oak05422cC, nucleotide 1044). This partial redundancy analysis could be considered another means of controlling population structure while testing for SNP-climate associations, and there is a growing interest in applying multivariate ordination approaches to identify specific loci of interest in environmental association tests (Sork et al. 2013). Although our data do not allow for a rigorous assessment of the ability of the redundancy analysis model to effectively account for population structure, and while a number of factors could lead to differences between these methods, it is encouraging that the SNPs identified in EMMAX also tend to be strongly associated in the partial redundancy analysis (Sork et al. 2016).

Finally, candidate genes for climate adaptation that were identified in other studies are among the top associations with climate in our analyses. These especially include flowering and leaf development genes associated with growing degree-days and growing season precipitation; cold genes with minimum temperature; and heat genes with maximum temperature (Tables 4 and 5), lending additional support to their role in climate adaptation in oaks.

Table 4 Hypergeometric tests for enrichment of SNPs from oak literature candidate genes in the top 5 % of climate associations from EMMAX
Table 5 Wilcoxon rank-sum tests of whether P values from EMMAX climate associations for SNPs in oak literature candidate genes are lower than for non-candidate genes

Future studies will consider a larger sample of localities that will increase the ability to detect specific SNPs varying along climate gradients.

Genetic diversity in candidate genes

We find preliminary support for the hypothesis that climate-adaptive genes have elevated levels of genetic diversity. The most compelling evidence comes from the candidate genes identified here as the top associations with climate variables in EMMAX (Table 1). These genes had θ W and π over 50 % higher than other genes that were tested in EMMAX (Table 2). Moreover, G was 0.200 in candidate genes compared to 0.188 in non-candidates, providing support for the main hypothesis across a range of types of diversity measures. The elevated genetic diversity summarized across all top associations should be statistically robust (Lohmueller et al. 2013), despite the possibility of occasional false positives at any particular SNP or gene. On the other hand, our analysis does not control for the fact that loci with more SNPs were subjected to more tests in EMMAX, thus increasing the chance of finding a SNP significantly associated with climate. Choosing only one random SNP per locus led to too small a sampling of climate-associated SNPs to test for differences with the background.

Candidate genes from the oak literature as well as candidate genes identified by GO associations had higher diversity than non-candidates on average (Table 2). However, when only variable candidate and non-candidate genes from GO or the literature were compared, the difference was not significant (Table 3), suggesting that those candidate genes were more likely to be variable than non-candidate genes, but not more variable than other genes with variation. It is possible that the candidate genes from the literature and GO consist of many conserved genes involved in global responses to environmental perturbation that are not necessarily involved in local adaptation.

High diversity and significant associations with climate gradients are patterns that are consistent with balancing selection and/or disruptive selection maintaining diversity in climate-adaptive loci by favoring different alleles in different climate contexts. Although our study provides mixed support for elevated diversity, similar conclusions have been drawn for candidate genes for locally adaptive abiotic stress responses first identified using differential gene expression analyses in Arabidopsis (Lasky et al. 2014). Alternatively, soft selective sweeps that lead to increased adaptive allele frequencies in different parts of the distribution have been observed in Medicago (Yoder et al. 2014). Future studies with the ability to ascertain haplotypes will further clarify this possibility.

Conclusions

Climate has likely shaped both demographic and adaptive evolutionary processes in valley oak. Even with small sample sizes, we were able to disentangle candidate SNPs underlying climate adaptation from the background association of genomic variation with climate. As a result, we find some support that putatively climate-adaptive genes may have unusually high genetic variation, which we hypothesize is the result of natural selection leading to local adaptation that maintains diversity. Our study further highlights that large sequencing data sets and individual-based SNP analyses offer powerful means of identifying genes important in adaptation and the overall influence of climate on the genome.