Introduction

Water deficit stress is one of the major factors limiting crop production and productivity in many regions, and food security in the twenty-first century depends, in part, on improved drought-tolerant cultivars with high yield stability (Tuberosa 2013; Tuberosa et al. 2002). Various abiotic stresses including drought, salinity and extreme temperature affect crop growth and development at different stages. Among these, drought may be the most daunting challenge faced by breeders (Tuberosa 2013; Tuberosa and Salvi 2006). Drought tolerance is complex and driven by diverse drought-adaptive mechanisms that are controlled by a number of genes and environmental factors (Blum 2005; Pinto et al. 2010; Reynolds and Tuberosa 2008). Soybean is one of the most important sources of plant protein and oil worldwide and soybean crops face many challenges posed by environmental stresses.

Over millions of years, plants have evolved mechanisms to tolerate or escape water deficits. These mechanisms range from morphological modifications to physiological adaptations such as water use efficiency (WUE) (Baum et al. 2007; Taji et al. 2004). Water use efficiency can be defined in different ways: in crop production it is commonly defined as the ratio of grain yield to water used during crop growth and often called agronomic WUE (Angus and van Herwarden 2001; Gilbert et al. 2011; Passioura 1977, 2004); at the plant level, WUE can be defined as the amount of biomass produced per unit water transpired, and at the leaf level, it generally refers to photosynthetic carbon gain per unit of water transpired and is generally termed intrinsic WUE (Angus and van Herwarden 2001; Gilbert et al. 2011; Passioura 1977, 2004). Under water-limited condition, grain yield at the crop production level can be expressed as a function of the amount of water used (WU), WUE, and harvest index (HI) (grain yield = WU × WUE × HI; (Passioura 1977; Salekdeh et al. 2009). While the target for crop improvement ultimately is increased agronomic WUE, the focus of this article is at the leaf level and the definition of WUE used is in this context.

Although attempts to improve crop yields by selection for greater WUE can have significant limitations and do not always prove successful, it is widely recognized that improved WUE can enhance yield in certain environments (Condon et al. 2004; Gilbert et al. 2011; Sinclair 2012). Selection for increased WUE has played an important role in improving performance of wheat yield under late-season drought conditions (Condon et al. 2004). However, the use of the WUE trait in breeding programs has largely been limited because of the difficulty associated with measuring actual WUE in large populations. Nonetheless, a promising screening method for WUE was developed in the 1980s based on plant tissue carbon isotope composition (Farquhar et al. 1982; O’Leary 1981). Approximately, 1.1 % of the carbon in the biosphere naturally occurs in the form of the stable isotope 13C and the remaining 98.9 % is 12C (Condon et al. 2002; Farquhar et al. 1989; O’Leary 1981). However, the molar abundance ratio of 13C/12C in plant tissues is usually less than that in atmospheric CO2 because of discrimination against the ‘heavier’ 13C during photosynthesis (Farquhar et al. 1989; Farquhar and Richards 1984). The magnitude of this discrimination varies with photosynthetic type (C3 or C4), environment and genotype.

Extensive studies on C3 species have been reported and have confirmed the relationship between carbon isotope composition, whether measured as carbon isotope ratio (δ13C) or carbon isotope discrimination (Δ13C), and WUE (Condon et al. 1990; Ehleringer et al. 1991; Ismail and Hall 1992; Rebetzke et al. 2002). Farquhar and Richards (1984) proposed the use of Δ13C as an expression of the 13C/12C ratio in the plant tissue relative to the 13C/12C ratio in the air. While both δ13C and Δ13C are related to WUE, the correlation between δ13C and WUE is positive, while the correlation between Δ13C and WUE is negative. Because of the correlation with WUE, carbon isotope composition has been analyzed in tissues from a wide range of plant species to assess WUE (Brüggemann et al. 2011; Wingate et al. 2010). The association between WUE and carbon isotope composition is due to a common relationship of the ratio of CO2 inside and outside of the leaf. That is, as the ratio of internal to external CO2 decreases, both WUE and carbon isotopic composition increase. Carbon isotope composition has been used widely as an indirect method for the selection of genotypes with improved WUE and productivity in some environments (Cattivelli et al. 2008; Condon et al. 2004), and there has been interest in improving crop performance through direct selection for carbon isotope composition (Araus et al. 2002; Rebetzke et al. 2002). Further, in a growing number of studies it has been used successfully to investigate the role of WUE in drought adaptation (Ahmed et al. 2013; Chen et al. 2012).

Drought tolerance is a complex, quantitative trait. Selection efficiency of drought-tolerance traits could be enhanced with a better understanding of its genetic control (Chen et al. 2011). Quantitative trait locus (QTL) mapping and analysis provides unprecedented opportunities to identify and locate chromosomal regions controlling adaptive traits such as δ13C during plant growth in water-limited conditions. Associations of carbon isotope composition with leaf characteristics and other physiological traits have been reported for several plant species (Condon et al. 1990; Geber and Dawson 1997; Johnson 1993; Saranga et al. 1999). QTLs for δ13C or Δ13C have been reported for Arabidopsis (Arabidopsis thaliana) (Juenger et al. 2005), rice (Oryza indica and Oryza indica) (Laza et al. 2006; Takai et al. 2006), soybean (Specht et al. 2001), cotton (Gossypium spp) (Saranga et al. 2004) and barley (Hordeum vulgare L.) (Teulat et al. 2002). Sufficient genotypic variation, stability across environments and high broad-sense heritability (H 2) in carbon isotope composition indicate that it may be a promising surrogate for WUE that can be applied in breeding programs for legumes such as soybean as well as in cereal crops (Condon and Richards 1992; Rebetzke et al. 2008; Specht et al. 2001).

Among the different classes of molecular markers currently available, single nucleotide polymorphisms (SNPs) have proven to be the marker of choice for a variety of applications, particularly in breeding. Genome-wide association study (GWAS) is a powerful approach to identify the positions of genetic factors underlying complex traits (Riedelsheimer et al. 2012; Zhao et al. 2011). In a recent study, GWAS was performed in soybean to identify quantitative trait loci (QTLs) controlling seed protein and oil concentration in 298 germplasm accessions exhibiting a wide range of seed protein and oil content (Hwang et al. 2014). As compared to QTL analysis, GWAS can provide relatively higher resolution in terms of defining the genomic position of a gene or QTL, because the level of linkage disequilibrium (LD) is much lower in naturally occurring populations such as human populations or germplasm collections than in biparental populations which are generally used in QTL analyses (Abdurakhmonov and Abdukarimov 2008). Recent advances in genome sequencing and SNP genotyping facilitate association analysis for identification of genomic regions of importance for crop improvement (Rafalski 2010). In soybean, LD has been shown to be more extensive in the heterochromatic than euchromatic regions (Hwang et al. 2014) and it may differ between populations of soybean ancestors (Glycine soja Seib. et Zucc.), landraces and cultivars (Hyten et al. 2007). In cultivated soybean, Hwang et al. (2014) reported that the mean LD was 0.2 (r 2) within about 360 Kbp in euchromatic regions whereas it was about 9,600 Kbp at 0.2 in heterochromatic regions. In a study on soybean GWAS analysis, Hwang et al. (2014) were able to successfully map most of the previously reported seed protein and oil QTLs to narrower genomic regions than originally reported.

Investigations of the genetics underlying WUE in soybean have been limited. Previous studies were based on biparental populations and employed restriction fragment length polymorphism (RFLP) and simple sequence repeat (SSR) markers, respectively (Mian et al. 1998, 1996; Specht et al. 2001). WUE of ~36 day old greenhouse-grown plants was determined gravimetrically and identified four independent QTLs in one population (Mian et al. 1996) and two independent QTLs in a second population (Mian et al. 1998), with one of the markers apparently linked to the same QTL in the two populations. In another study, δ13C was determined on juvenile trifoliates from a population of 236 recombinant inbred lines derived from parents that did not differ in δ13C (Specht et al. 2001). Using δ13C data from one irrigated and one non-irrigated field environment in the same year, they only identified QTLs for δ13C that either coincided with maturity or determinacy QTLs, or were not associated with WUE. Thus, despite the importance of drought tolerance, our knowledge of the genetics underlying soybean WUE is minimal at best. Genome-wide association studies, coupled with δ13C, provide opportunities for rapid identification of novel SNP-based markers associated with WUE and facilitate the selection of promising parental genotypes for germplasm improvement and further genetic studies. The primary objective of this study was to use GWAS to identify SNPs associated with δ13C which ultimately may be used to improve WUE in soybean.

Methods

Field experiments and management

Field experiments were conducted in 2009 and 2010 at the Bradford Research and Extension Center (BREC) in Columbia, MO USA (38°53′N, 92°12′ W) and the Rice Research Experiment Station near Stuttgart, AR (34°30′N, 91°33′W). At Columbia, plants were grown on a Mexico silt loam (fine, smectitic, mesic Aeric Vertic Epiaqualf) and at Stuttgart on a Crowley silt loam (fine, montmorillonitic, thermic Typic Albaqualfs). Fields were tilled prior to sowing which occurred on 23 May 2009 and 27 May 2010 in Columbia and 2 June 2009 and 10 June 2010 in Stuttgart. Seeds were planted at 2.5 cm depth at a density of 25 seeds m−2. In Columbia, plots were 4.87 m long and four rows wide with 0.76 m row spacing. At Stuttgart, single-row plots 6.1 m in length and with 0.76 m between rows were sown. The experiments in Columbia were conducted under rainfed conditions, while plots at Stuttgart were furrow irrigated as needed. Pre-plant applications of P and K were conducted based on results from soil test analyses and corresponding recommendations from the University of Missouri (Columbia) and the University of Arkansas (Stuttgart). Weed control in Columbia was conducted by applying the pre-emergence herbicide sulfentrazone at a rate of 0.3 kg ai ha−1 and a post-emergence herbicide sethoxydim at a rate of 2.6 kg ai ha−1. Lambda-cyhalothrin at a rate of 0.23 kg ai ha−1 was applied to control insects. Prior to emergence, imazaquin and metolachlor were applied at Stuttgart at rates of 0.14 and 2.24 kg ai ha−1, respectively. For post-emergence weeds control, fomesafen and clethodim were applied at rates of 1.46 and 0.73 kg ai ha−1, respectively.

Experimental design

A total of 385 soybean (Glycine max (L.) Merr.) genotypes (376 genotypes from Soybean Germplasm Collection, USDA-ARS, and 9 other genotypes) within maturity group (MG) IV were planted in a randomized complete block design with three replications at both locations and in both years. The two growing seasons at each location were considered as four environments.

The MG IV genotypes evaluated were selected from the USDA-ARS Germplasm collection. Selection was based on GRIN (Germplasm Resources Information Network, www.ars-grin.gov) data with genotypes falling into one of two groups. The first group (Group A, 182 genotypes) consisted of the highest yielding (>2.5 Mg ha−1) MG IV genotypes with good agronomic traits (height, lodging, shattering, etc.) without regard to any consideration of genetic diversity. In the second group (Group B, 191 genotypes), good agronomic traits were maintained, but the yield threshold was lowered (<2.5 Mg ha−1) and consideration was given to country and province of origin in an attempt to maximize diversity. Accessions were included in the two groups based on GRIN data and recommendations from the soybean germplasm collection curator, Dr. Randall Nelson.

Biomass sampling and δ13C analysis

Shoots of five plants from each plot were randomly harvested at the soil surface at 53 days after planting (DAP) at Columbia in both years, and 50 and 61 DAP at Stuttgart in 2009 and 2010, respectively. Aboveground biomass samples were harvested when all genotypes were at beginning bloom to full bloom (R1 to R2) (Fehr et al. 1971). The samples were dried in an oven at 60 °C until completely dry and then ground to pass a 2 mm screen using a Wiley Mill (Thomas Model 4 Wiley® Mill, Thomas Scientific, NJ USA). After mixing, a subsample of about 1/4th of each sample was ground again using a UDY Cyclone sample mill (MODEL 3010-014, UDY Corporation, CO USA). After thorough mixing, a subsample of about 0.2 g was then transferred to a 15 ml tube (part # 2252-PC-30; SPEX CertiPrep, Inc., NJ USA) and a 9.52 mm diameter stainless ball (440C Stainless Steel Ball, Tolerance/Grade: 100, Abbott Ball Company, Inc., CT USA) was placed inside the tube along with the sample for grinding. Each sample was ground for 9 min at 1,200 rpm using a Geno/Grinder equipped with a large clamp assembly and a 15 ml tube foam holder (SPEX CertiPrep, Inc., NJ USA). Thereafter, about 3 mg of powdered sample was carefully packed in tin capsules and arranged in 96-well plates (Costech Analytical Technologies Inc., CA USA) according to the procedure described by the University of California, Davis Stable Isotope Facility for stable isotope analysis (UC Davis Stable Isotope Facility, CA USA).

The δ13C isotope analysis was conducted using an elemental analyzer interfaced to a continuous flow isotope ratio mass spectrometer. The final δ13C values were expressed relative to the international standard V-PDB (Vienna PeeDee Belemnite). For more information refer to the Stable Isotope facility website, http://stableisotopefacility.ucdavis.edu/13cand15n.html.

SNP genotyping and LD analysis

Genotypic data of ~23,000 SNPs for a select group of 385 soybean genotypes from the application of the SoySNP50 K iSelect SNP Beadchip were obtained (Song et al. 2013). From the original set of 385 genotypes evaluated in four environments (2 years and two locations), phenotypic data for δ13C were available for 373 genotypes. Based on previous work (Pasam et al. 2012), a minimum minor allele frequency (MAF) of ≥5 % was employed. Of the ~23,000 polymorphic SNPs, 12,347 had an MAF ≥5 % across the 373 genotypes and these were evaluated in this study for associations with δ13C.

Linkage disequilibrium (LD) was calculated using 12,347 SNPs with minor allele frequency ≥5 % covering the 20 chromosomes. Calculation of pairwise LD (r 2) among SNPs and identification of haplotype blocks were based on SNPs within 1 Mb windows using the Haploview software (Barrett et al. 2005).

ANOVA, BLUP and heritability

The experimental design was a randomized complete block with three replications at two locations (Columbia and Stuttgart) in two consecutive years (2009 and 2010). Analysis of variance (ANOVA) was calculated by PROC ANOVA (SAS-Institute-Inc 2004). The four environments were designated CO-09, CO-10, ST-09 and ST-10, corresponding to Columbia 2009, Columbia 2010, Stuttgart 2009 and Stuttgart 2010. To analyze the G × E interaction for δ13C from 373 soybean genotypes, the 2 years and two locations were treated as four environments, and analysis of variance was performed using PROC MIXED procedure (α = 0.05) of SAS 9.3 using the model as suggested by Bondari (2003).

$$Y_{\text{ijk}} = \mu + G_{\text{i}} + E_{\text{j}} + GE_{\text{ij}} + B_{\text{jk}} + \varepsilon_{\text{ijk}}$$
(1)

where µ is the mean, G i is the effect of the ith genotype, E j is the effect of jth environment, GE ij is the interaction of the ith genotype with the jth environment, B jk is the effect of the kth replication within the jth environment and ε ijk is the error. Genotype was considered as a fixed effect and replication nested within environment was used as a random effect.

To minimize the effects of environmental variation, the best linear unbiased predictions (BLUPs) were used for genome-wide association analysis (Kump et al. 2011). BLUP values were derived for each environment independently and also across all environments and were used as phenotypic values for genome-wide association analyses of data from each environment and across all environments. The BLUPs and variance components for δ13C per genotype were obtained using PROC MIXED procedure of SAS 9.3 (SAS-Institute-Inc 2004). For BLUP determinations for individual environments, all factors were considered as random effects (Littell et al. 1996). To derive across-environment BLUP values, environment was considered as fixed effect and all other factors as random (Piepho et al. 2008; Edae et al. 2014). The broad-sense heritability (Holland et al. 2003) for δ13C was derived using the variance components obtained from above PROC MIXED procedure of SAS 9.3 (SAS-Institute-Inc 2004) as described (Piepho and Möhring 2007).

Genetic diversity analysis and AMOVA

Summary statistics for the marker data such as minor allele frequency, heterozygosity, gene diversity and polymorphism information content (PIC) were calculated by Power Marker software V 3.25 (Liu and Muse 2005). The PIC value described by Bostein et al. (Botstein et al. 1980) was used to refer the relative value of each marker with respect to the amount of polymorphism exhibited. PIC value was estimated by the following formula:

$${\text{PIC}}_{\text{i}} = 1- \sum\limits_{j = 1}^{n} {P^{ 2}_{\text{ij}} } \, - \, \sum\limits_{j = 1}^{n - 1} {\sum\limits_{k = j + 1}^{n} { 2P^{ 2}_{\text{ij}} P^{ 2}_{\text{ik}} } }$$
(2)

In this formula, P ij and P ik are the frequencies of jth and kth alleles for marker i, respectively. The heterozygosity value indicates the proportion of heterozygous loci detected in single soybean genotype. The gene diversity is defined as the probability that two alleles randomly chosen from the test sample are different. The heterozygosity and gene diversity were calculated to quantify the genetic variation in soybean genotypes evaluated. The common biased estimator of the gene diversity for marker i can be obtained using the above equation by dropping the last item as previously described (Chen et al. 2011; Lu et al. 2009). Allele frequency was calculated for characterizing the differentiation and geographic patterns of genetic diversity in the sampled genotypes. Analysis of molecular variance (AMOVA) was calculated by GeneAlEx 6.41 (Peakall et al. 2006) with 1,000 permutations.

Population structure and clustering

The population structure was inferred using the Bayesian model-based software program STRUCTURE 2.2 (Pritchard et al. 2000) using the 12,347 SNPs. The burn-in iteration was 100,000, followed by 100,000 Markov chain Monte Carlo (MCMC) replications after burn-in with an admixture and allele frequencies correlated model. The population structure analysis was performed with five independent iterations with the hypothetical number of subpopulations (k) ranging from 1 to 10. By plotting the estimated likelihood value of data [LnP(D)] from the STRUCTURE output and an ad hoc statistic Δk, the correct value of k was determined (Evanno et al. 2005). Further analysis was based on the rate of change in the log probability of data between successive k values which best describes the population structure based on maximizing log probability or the value at which LnP(D) reaches a plateau. All soybean accessions were assigned to a subpopulation based on the correct k (k = 8), for which the membership value (Q value) was >0.5 (Breseghello and Sorrells 2006), and the population structure matrix (Q) was generated for further association analyses. The kinship matrix (K) was calculated by a built-in function of TASSEL 3.0 software (Bradbury et al. 2007; Buckler et al. 2009) using 12,347 SNPs to obtain the pairwise relatedness without any missing values. The kinship matrix was first generated using the TASSEL cladogram function to calculate a distance matrix. Each element d ij of the distance matrix is equal to the proportion of the SNPs which are different between taxon i and taxon j. The distance matrix is converted to a similarity matrix by subtracting all values from 2 and then scaling, so that the minimum value in the matrix is 0 and the maximum value is 2.

Clustering of genotypes was done with the cladogram function in TASSEL 3.0 (Bradbury et al. 2007; Buckler et al. 2009) to produce a neighbor-joining (NJ) relationship using parsimony substitution models and an unweighted pair group method with arithmetic mean (UPGMA) Newick file. The output Newick file was used as input in TreeDyn 198.3 software (Chevenet et al. 2006) to obtain the final tree.

Genome-wide association analysis

To account for the population structure and genetic relatedness, two statistical models were tested: i) general linear model (GLM) with Q-matrix, and ii) MLM with Q-matrix and K-matrix. The Q- and K-matrices were used as corrections for population structure and/or genetic relatedness (Dhanapal and Crisosto 2013; Pasam et al. 2012; Yang et al. 2010; Yu et al. 2006). Genome-wide association analyses based on these models were conducted with the software TASSEL 3.0 (Bradbury et al. 2007; Buckler et al. 2009). Markers were defined as being significantly associated with δ13C on the basis of their significant association threshold (−Log10 P ≥ 3.00; P ≤ 0.001) for GLM + Q and MLM Q + K (−Log10 P ≥ 2.00; P ≤ 0.01) (Hao et al. 2012; Yang et al. 2010). The P values obtained from both GLM + Q and MLM (Q + K) were used as an input file for a script written with small modifications in R software (R Development Core team 2013) to generate Manhattan plots.

A permutation testing approach was employed to establish the marker trait significance associations using the GLM + Q model. For the GLM + Q model, 10,000 permutation runs were performed using TASSEL software (Bradbury et al. 2007; Bush and Moore 2012; Anderson and Ter Braak 2003). The associations were regarded as significant when adjusted p values were <0.05. Further, a probability value ≤0.001 (−log10 P value ≥3) was used for selecting significant marker trait associations that fit the p < 0.05 criterion. To help avoid false positives, population structure (Q) was used in both models and for the MLM (Q + K) model, the kinship matrix (K) was also engaged. Both models were employed in the analysis of each environment. For an SNP to be considered a candidate, it had to exhibit a significant association in both models as well as in at least two environments, which also served to reduce the number of false positives. For both models and analyses by environment as well as over all environments, multiple testing was performed to assess the significance of marker trait associations using QVALUE R 3.1.0 employing the smoother method (Storey and Tibshirani 2003), an extension of the false discovery rate (FDR) method (Benjamini and Hochberg 1995). Markers with qFDR < 0.01 were considered to be significant.

Results

δ13C descriptive statistics

Measurements of δ13C were obtained on 373 MG IV soybean genotypes over 2 years (2009 and 2010) at two locations (Columbia, MO, and Stuttgart, AR). For analysis, each location in each year was considered a different environment. Figure 1 shows the 7-day running averages for solar radiation, maximum and minimum temperature and the daily rainfall for each environment. Conditions during the growing period between planting and sampling (denoted by horizontal lines in the top panel of the figure) were different in each environment. In general, it was warmer (both minimum and maximum temperatures) in Stuttgart than in Columbia and warmer in 2010 than in 2009 for both locations. Most of the time, daily solar radiation was generally greater in Stuttgart compared to Columbia, and especially for Stuttgart daily radiation was greater in 2010 compared to 2009. For the most part, Columbia received more rainfall than Stuttgart, except at the end of the sampling period in 2009. At Stuttgart, it was considerably wetter at the end of the growing period in 2009 than in 2010.

Fig. 1
figure 1

Weather conditions during the growing season. Graph showing 7-day running averages versus day of year for solar radiation, maximum and minimum temperature and daily rainfall during the crop season for the four environments (two locations and 2 years, Columbia and Stuttgart, 2009 and 2010). Horizontal lines in the top panel indicate the growing period between planting and sampling

The difference between extreme mean δ13C values in each environment was 1.86, 1.46, 1.59 and 1.70  ‰ for CO-09, CO-10, ST-09 and ST-10, respectively. Over all four environments the minimum δ13C value was in ST-10 (−30.55  ‰) and the maximum value was in ST-09 (−27.74  ‰). Other descriptive statistics of each environment are shown in Table 1, and the frequency distribution of the δ13C values of the 373 genotypes in each of the four environments is shown in Fig. 2. The ranges of values in the distributions of the 2 years at Columbia (CO-09 and CO-10) were similar, whereas there was considerable difference in the ranges of values between the 2 years at Stuttgart (ST-09 and ST-10). However, no significant (P ≤ 0.05) skewness or kurtosis in the distributions was found in any of the four environments. Correlations of genotypic δ13C values among all four locations were highly significant (P ≤ 0.001) and ranged from r = 0.35 between ST-09 and ST-10 to r = 0.61 between CO-09 and CO-10. Analysis of variance revealed significant (P < 0.0001) genotype (G), environment (E), and G × E interaction effects for δ13C (Table 2).

Table 1 Descriptive statistics of δ13C (‰) for plant samples from Columbia in 2009 (CO-09) and 2010 (CO-10), and Stuttgart in 2009 (ST-09) and 2010 (ST-10)
Fig. 2
figure 2

Frequency distribution of the δ13C values of the 373 soybean genotypes in four environments (two locations and 2 years, Columbia and Stuttgart, 2009 and 2010)

Table 2 Analysis of variance (ANOVA) results for the effect of environment (E), genotype (G) and their interaction for δ13C of 373 soybean genotypes

AMOVA and genetic diversity

Overall, the 373 genotypes represent 11 different national sources, including 244 from South Korea, 60 from China, 41 from Japan, 11 from North Korea, 6 from Georgia, 4 from Korea (North or South Korea not recorded in GRIN), 2 each from Russia and Taiwan and 1 each from India, Mexico and Romania. Within and among components of total genetic variation were evaluated by AMOVA (Table 3). The analysis revealed that the within-population diversity explained most of the genetic diversity (80 %) when compared to among-population diversity (20 %). The distance-based methods NJ and UPGMA identified eight subclusters (C1–C8) as were identified using the model-based method subpopulation groups (G1–G8). The genotypes comprising groups G3 and G4 of the model-based method were consistent with the results of both distance-based methods. The 26 of 111 genotypes from South Korea in G5 were displayed as admixtures in the different clusters. For NJ, the 26 genotypes were clustered as follows: 1 in C1; 4 in C2; 3 in C6; 7 in C7 and 11 in C8; and for UPGMA the following pattern was observed: 1 in C1; 2 in C2; 5 in C6 and 9 in C7 and 9 in C8, in both distance-based methods. Other than these differences, model-based method and distance-based method were the same (results not shown). The 12,347 SNPs used to determine genetic diversity and for further analyses, had an average MAF value of 12.47 % (range 5.0–50.0 %). The gene diversity, heterozygosity and PIC of the 12,347 SNPs averaged 0.20, 0.003 and 0.180, with ranges of 0.05–0.50, 0–0.101 and 0.01–0.40, respectively (Fig. 3). As suggested by its consistent response across different environments and high broad-sense heritability in cowpea and wheat, CID is under tight genetic control (Condon and Richards 1992; Ehdaie et al. 1991; Hall et al. 1990). In our study, broad-sense heritability ranged from 58.68 % for Columbia 2010 to 70.57 % for Stuttgart 2010. The heritability estimate across the two Columbia environments was 76.05 % and across the two Stuttgart environments 67.01 %, and across all four environments 68.20 %.

Table 3 Analysis of genetic differentiation among and within eight subpopulation groups of soybean genotypes by AMOVA
Fig. 3
figure 3

Distribution of genetic diversity of 12,347 SNPs across 373 soybean genotypes. a Polymorphic Information Content (PIC), b Minor Allele Frequency (MAF), c gene diversity and d heterozygosity

Genetic structure and linkage disequilibrium

STRUCTURE analysis software was used to determine the population structure (i.e., genetic relatedness) subpopulations (k) of the 373 individual soybean genotypes based on the distribution of the 12,347 SNP loci evaluated in this study. The most probable number of subpopulations was determined by plotting the estimated likelihood value [LnP(D)] obtained from STRUCTURE runs against k. The, LnP(D) appeared to be an increasing function of k for all the values observed. Structure simulation demonstrated that the calculated average of LnP(D) against k = 8 was determined to be the optimum k, indicating that eight subpopulations (Fig. 4) could contain all individuals with the greatest probability. Hence, a k value of 8 was selected to describe the genetic structure of the 373 soybean genotypes. The estimated population structure indicated genotypes with partial membership to multiple subpopulations, with few subpopulations exhibiting distinctive identities (Fig. 5 and Supplementary file 4 Figure S2). Significant divergence among subpopulations and average distances (expected heterozygosity) among individuals in the same subpopulations was also assigned (Table 4). Among the eight subpopulations, none had individuals exclusively from one country or region within a country.

Fig. 4
figure 4

Population structure results using 12,347 SNPs. Log probability data LnP(D) as function of k (number of groups) from the structure run. The plateau of the graph at k = 8 indicates the minimum number of subgroups possible in the panel

Fig. 5
figure 5

Estimated population structure of 373 soybean genotypes (k = 8). The y-axis is the subgroup membership, and the x-axis is the genotype. G (G1–G8) stands for a subpopulation

Table 4 STRUCTURE-based analysis showing significant divergence among subpopulation and average distances (expected heterozygosity) among individuals in the same subpopulation

In our study, LD analysis was performed using the SNPs with MAF ≥ 0.05 and the 373 soybean genotypes evaluated. LD decay was much greater in the euchromatic compared to heterochromatic regions. In the euchromatic regions, the LD decayed to half of its maximum value within approximately 100 Kb, while in the heterochromatic regions, the LD did not decay to half of the maximum value within 1 Mb. Within approximately 300 Kb, LD had decayed to approximately 0.2 in the euchromatic regions, while in heterochromatic regions LD was still >0.5 at 1 Mb. These results are consistent with those reported by Hwang et al. (2014).

Genome-wide association analysis

Association analyses of 12,347 SNP markers with δ13C values of the 373 genotypes were evaluated by two different models: (1) GLM model adjusted using the Q-matrix and (2) MLM model adjusted using both Q- and K-matrices. Q- and K-matrices were employed to reduce false positives derived from population structure and/or genetic relatedness. The number of potentially false-positive SNPs was also reduced by using BLUP means in both models, requiring that the putative associations be identified by both models, and that putatively associated SNPs must be identified in at least two of the four environments. Additionally, for the analysis across environments, a correction for multiple testing was applied (Benjamini and Hochberg 1995). An overview of the process employing both models by environment and across environments to reduce the 12,347 SNP to 21 putative loci is shown in Fig. 6.

Fig. 6
figure 6

An overview of the process using two models to reduce the 12,347 SNPs to 21 putative loci. Flowchart showing final SNP selection using two models GLM + Q and MLM (Q + K) from the original 12,347 SNPs with MAF ≥ 5 % analyzed by environments and using the overall means across environments. For all analyses, the BLUP mean was used for association testing

Analysis using the GLM + Q model identified a total of 1,879 SNPs significantly associated (−Log10P ≥ 3.00; P ≤ 0.001) with δ13C in at least one of the four environments. Of these, 1,229 were identified in at least one of the two Columbia environments (966 at CO-09 and 263 at CO-10) and 650 were found in at least one of the two Stuttgart environments (306 at ST-09 and 344 in ST-10). SNPs exhibiting significant associations with the GLM + Q model are shown (Supplementary File 1, Table S1) for each environment. Manhattan plots showing marker associations for the GLM + Q model for all four environments are shown in supplemental figures (Supplementary File 3, Figure S1a, b, c, d).

As indicated, the GLM model employed corrected for population structure, but not genetic relatedness. However, MLM procedures have been developed to account for both population structure and unequal relatedness (Zhang et al. 2010). The MLM (Q + K) model has been successfully applied to account for population structure in several crops (Aranzana et al. 2010; Breseghello and Sorrells 2006; Yu et al. 2006). Application of the MLM (Q + K) identified a total of 245 SNPs significantly (−Log10P ≥ 2.00; P ≤ 0.01) associated with δ13C in at least one of the four environments. Of these, 81 SNPs were significantly associated with δ13C in at least one of the two Columbia environments (37 at CO-09 and 44 at CO-10, Supplementary File 2, Table S2). The MLM (Q + K) identified 164 SNPs associated with δ13C in at least one of the two Stuttgart environments (97 at ST-09 and 67 in ST-10 (Supplementary File 2, Table S2). Manhattan plots showing marker associations for the MLM (Q + K) model in each of the four environments are shown (Fig. 7a, b, c, d). Table 5 and Supplementary File 5, Table S3 show the top five individual SNPs (greatest −Log10 P value) significantly associated with δ13C in each of the four environments along with corresponding probabilities (−Log10(P)) and r 2 values for GLM + Q and MLM (Q + K) models. Across the environments, among the top five SNP, probabilities ranged from 5.00 to 8.35 and r 2 values from the GLM + Q model ranged from 0.08 to 0.14.

Fig. 7
figure 7

Manhattan plot of −Log10 (P) vs. chromosomal position of SNP markers from MLM (Q + K) model for two locations in two consecutive years, 2009 and 2010. The plot shows −Log10 P values for each SNP against chromosomal location. (a) Columbia 2009; (b) Columbia 2010; (c) Stuttgart 2009; (d) Stuttgart 2010. Red line represents the association threshold (−Log 10 P ≥ 2.00, P ≤ 0.01)

Table 5 List of top five SNPs significantly associated with δ13C in Columbia and Stuttgart in 2009 and 2010 based on GLM (−Log 10P ≥ 3.00, P ≤ 0.001) and corresponding P and R 2 values

Between the two models (GLM adjusted for population structure and MLM adjusted for population structure and genetic relatedness), 122 SNPs showed significant association with δ13C in both models (Fig. 6, Supplementary Files 1 and 2, Tables S1 and S2). Of the 122 SNPs common to both models, 37 SNPs were identified as having a significant association in more than one environment (24 in the GLM + Q model and 13 for the MLM Q + K model). Figure 8 shows the genomic locations of the SNPs identified by each model. Six of the 37 SNPs were common between models (highlighted in yellow in Fig. 8). Thus, the number of unique SNP associations identified between the two models for at least two environments was 31. Markers with qFDR < 0.01 in at least two of four environments for both models were considered significant, and all 31 SNPs identified met this criterion.

Fig. 8
figure 8

Location of putative loci significantly associated with δ13C in more than one environment and across environment with previously identified QTLs for CID (Specht et al. 2001) and WUE (Mian et al. 1998) as shown in Soybase (www.soybase.org, [Grant et al. 2013]). For each chromosome, the black dots represent the location of a SNP evaluated for association with δ13C

Based on BLUP means across all four environments, 1,411 SNPs were significantly associated with δ13C in the GLM + Q model and 60 SNPs in the MLM (Q + K) model. Of these, 26 SNPs were found in common between the two models (Supplementary File S6, Table S4). Since the use of overall means across environments precluded exploiting significance in more than one environment as a criterion, we applied a correction for multiple testing (Storey and Tibshirani 2003) to increase stringency. The 26 SNPs identified in common between the two models were subjected to multiple testing. Markers with qFDR < 0.01 were considered significant which reduced the number of putative candidate SNPs to 11. These 11 SNP were localized to one locus on CHR 2 and to two loci on CHR 15 (Fig. 8). The scale of Fig. 8 does not allow the separation of closely spaced SNPs, but the locus on CHR 2 and the second locus on CHR 15 were each marked by five consecutive SNPs (only one SNP marked the first locus on CHR 15).

The 31 SNPs identified independently in at least two environments and the 11 SNPs identified using the BLUP means across environments together constituted 39 unique SNPs which were considered as putative candidate SNPs associated with δ13C (Fig. 6 and Supplementary File S7, Table S5). The numbers of significant SNP associations per chromosome are summarized in Table 6 and the relative genomic locations of the SNPs are illustrated in Fig. 8. Details for each of the 39 SNPs putatively associated with δ13C are shown in Table 7. Multiple, closely spaced SNPs likely identified the same putative locus associated with δ13C. Visual examination of Fig. 8, indicates that, overall, 21 putative loci were identified by the process outlined in Fig. 6. The scale of Fig. 8 does not allow individual visualization of closely spaced SNPs, but as shown in Table 7, 15 of the 21 putative loci were identified by one SNP, 2 were identified by two SNPs (putative loci 7 and 10), 1 by three SNPs (putative locus 19), 1 by five SNPs (putative locus 16) and 2 by six SNPs (putative loci 2 and 4).

Table 6 Number of SNPs associated with δ13C and their chromosomal location
Table 7 List of putative SNPs and potential genes associated with δ13C based on 39 SNPs identified in more than one environment or across environments

Potential genes associated with δ13C

Based on the 60 bp sequences flanking SNPs (Supporting information Table S1; (Song et al. 2013), a blast search was conducted with default parameters in Phytozome v9.1 (http://www.phytozome.net/). The search indicated that 25 of the 39 SNPs were located in a gene (Table 7). For the 14 SNPs not located in a gene, information on the gene closest to each SNP is provided in Table 7.

Discussion

Genetic diversity and population structure of soybean genotypes

In this study, 12,347 SNPs were used to estimate the genetic diversity and population structure of 373 soybean genotypes. However, despite the large number of SNPs, gaps in the distribution of the SNPs across the genome were observed (Fig. 8). Among these 373 MG IV soybean genotypes, the gaps in SNP coverage may represent areas of extreme similarity. This similarity might be expected given that the SNP data used in this analysis were selected based on whole genome sequence data obtained from a diverse set of six G. max genotypes with a range of maturities as well as two wild soybean accessions (Song et al. 2013). Additionally, only a few lines showed a high level of heterozygosity (5 % of the SNP loci, Fig. 3d). These may be due to natural outcrossing (Ray et al. 2003) during propagation of the germplasm or other sources of contamination.

Based on the 12,347 SNPs, mean PIC was 0.18, which was less than the 0.31 reported for 191 soybean (G. max) landraces using 1,142 SNPs (Hao et al. 2012) (Fig. 3a). Of the 373 genotypes studied here, heterozygosity was zero in 42.82 % of the entries and average heterozygosity was 0.003, which was also less than the 0.01 reported for 191 soybean genotypes by Hao et al. (2012). The mean gene diversity coefficient was 0.20 for the 373 genotypes, which was less than the 0.39 reported for the 191 genotypes studied by Hao et al. (2012) and the 0.35 reported for 303 cultivated and wild soybean (G. soja) using 554 SNPs (Li et al. 2010). Since only cultivated soybean from a single maturity group and no wild soybean was examined in the present study, lower PIC, heterozygosity and genetic diversity coefficients than those found by Hao et al. (2012) and Li et al. (2010) were not surprising.

The genetic structure of soybean populations has been studied previously using both SSR and SNP markers (Hao et al. 2012; Li et al. 2010). The 373 soybean genotypes evaluated in the present study were classified into eight subpopulations with significant divergence among subpopulations. The individuals within each subpopulation were independent of their collection sites. The low genetic differentiation among genotypes could be a result of gene flow due to movement of seeds. Seed exchange among farmers is a mechanism used to enhance diversity of local germplasm, which may result in increased distribution of alleles among different populations irrespective of their geographical distance (Louette et al. 1997). The results of the present study indicated high genetic diversity within subpopulations and less genetic diversity among subpopulations. Similar results have been found in other crops using SSR markers (Aranzana et al. 2010; Belamkar et al. 2011; Cao et al. 2012; Shiferaw et al. 2012; Wang et al. 2011). However, a study of the population structure of 40 wild soybeans from China with 20 SSR markers showed contrary results to those observed in our study (Guo et al. 2012). The reason behind this may be that the wild soybean evaluated in that study was significantly differentiated from other regional soybeans, as evidenced by their low allelic richness, genetic diversity and high ratios of regionally unique and fixed alleles (Guo et al. 2012). These genetic attributes suggest that wild soybean may have undergone severe adaptive selection for their ecogeographical conditions and had less genetic exchange with inland populations (Guo et al. 2012).

Comparison of model-based diversity and distance-based diversity and their significance are essential for population studies in plants (Guo et al. 2012). In this study, with a few exceptions, results of the model-based method (STRUCTURE) were largely in accordance with the results obtained using the distance-based method (NJ and UPGMA), (results not shown). Similar results were found for soybean, peanut (Arachis hypogaea L.) and peach (Prunus persica) by others (Belamkar et al. 2011; Cao et al. 2012; Guo et al. 2012). Comparison of the two clustering methods (neighbor joining and UPGMA) found less than 7 % of individuals falling into different clusters based on the method used (results not shown).

Although the MG IV soybean genotypes evaluated in this study were from distinct geographical regions, they did not show any regional or provincial clustering within the eight subpopulations determined from clustering with the 12,347 SNPs. Although the genotypes were originally selected to fall into two groups based on yield estimates from the germplasm database, yield values were randomly distributed among the eight subpopulations (Supplementary File 4, Figure S2).

Carbon isotope ratio

The ability of plants to respond to different levels of available water is variable and complex. Understanding the relationship between genotype and phenotype is essential for the improvement of complex traits in economically important crop species such as soybean. Across the four environments δ13C values in this study ranged from −30.55 to −27.74 ‰, which is within the range of δ13C values previously reported for soybean (Yoneyama et al. 2000). As indicated by the frequency distributions and descriptive statistics (Fig. 2; Table 1), the two Columbia environments (CO-09 and CO-10) were very similar, but the two Stuttgart environments (ST-09 and ST-10) were different. ST-09 had the lowest average δ13C values of the four environments and ST-10 had the highest. Differences between the Columbia and the Stuttgart locations include latitude, soil type and irrigation (furrow irrigation in Stuttgart and no irrigation in Columbia), all of which may affect the growth response of soybean. These and other environmental influences (see Fig. 1) likely affected the δ13C responses observed. For example, in water-limited environments, δ13C would be expected to increase (become less negative) due to partial stomatal closure and an increase in WUE (Specht et al. 2001). Among the eight subpopulations, none had individuals with highest or lowest δ13C exclusively clustered in it (Fig. 5).

GWAS analysis

GWAS provides a promising tool for the detection and mapping of quantitative trait loci (QTLs) underlying complex traits. Application of the GLM model with Q as corrections for population structure indicated that about a third of the markers tested had significant associations in at least one of the environments. A greater number of significant associations were identified in one of the Columbia environments than in the two Stuttgart environments (966 and 305). The greatest number of significant markers identified by the GLM + Q model was in the CO-09 environment and the fewest significant associations were identified in the CO-10 environment. Similarly, CO-10 showed the fewest significant associations using the MLM (Q + K) model, for which a total of 44 SNPs were identified as having significant associations with δ13C. The greatest number of significant markers was found in the ST-09 environment (97 SNPs), followed in order by the ST-10 (67 SNPs), CO-10 (44 SNPs) and CO-09 (37 SNPs) environments. Climatic and other differences that affected tissue δ13C may account for the variation in the number of markers that showed significant associations among environments.

Of all the SNPs identified by either model in the analysis by environment, 122 were identified by both models as being significantly associated with δ13C. Of these 122 SNPs, 31 were identified as significant in more than one environment. In addition to the putative associations identified when analyzed by environment, 11 SNPs significantly associated with δ13C were identified by analysis based on the means across environments. Together, these analyses identified 39 unique SNPs associated with δ13C (three SNPs were common between the analyses).

In total, 39 unique SNPs were identified as putatively associated with δ13C in more than one environment or across environments (Table 7). The genomic distribution of these SNPs revealed that several are located close together and likely mark the same locus (Fig. 8; Table 7). Thus, we putatively identified 21 genomic regions on 16 chromosomes that are highly likely to contain genes affecting δ13C. While the SNPs identified as significantly associated with δ13C in single environments may be important, particularly given the independence of the field experiments, those that were identified in at least two environments or from across all environments are likely the most stable. Additionally, six of the 21 putative loci were independently identified by two of the three analysis methods [GLM + Q, MLM (Q + K)], or across all environments) and one locus on CHR 2 was identified by all three methods (Fig. 8). These loci likely have a greater potential of identifying major QTLs.

Several QTLs have been identified for δ13C in other species (Chen et al. 2012; Gu et al. 2012; Hervé et al. 2001; Juenger et al. 2005; Mano et al. 2005; Specht et al. 2001). For soybean, five QTLs for CID located on chromosomes 6 (2), 13, 17, and 19 (Specht et al. 2001) and nine QTLs for WUE located on chromosomes 4 (2), 12 (2), 16, 18, and 19 (Mian et al. 1996) are identified in Soybase [www.soybase.org (Grant et al. 2013)]. The putative locations of the reported CID and WUE QTLs are included in Fig. 8. None of the 21 putative loci identified in this study as being associated with δ13C were located close to QTLs for CID identified by Specht et al. (Specht et al. 2001). However, one putative δ13C loci was located close to a QTL for WUE (chromosome 4, Fig. 8) identified by Mian et al. (1996). Interestingly, only one of the nine WUE QTLs reported by Mian et al. (1996) was located near one of the CID QTLs reported by Specht et al. (2001) (chromosome 19, Fig. 8). The mapping conducted in these two studies (Mian et al. 1996; Specht et al. 2001) for both the CID and WUE QTLs was undertaken with a comparatively limited marker set and the locations of the QTLs were inferred based on nearest markers to the base pair sequence location presented (Fig. 8). The actual QTL location in the genome may be at considerable distance from the location shown in Fig. 8.

The lack of overlap between putative loci identified by Specht et al. (2001) and those identified in this study was not surprising. Specht et al. conducted δ13C analyses on juvenile trifoliates as opposed to whole-plant samples that were used in this study and examined a population based on parental lines that were not included in this study and that did not differ in δ13C. As indicated by Specht et al. (2001), except for one, the major QTLs they identified coincided with maturity and/or determinacy QTLs. Thus, the identified QTLs may have been confounded by maturity and determinacy. Interestingly, even though the parental lines of the population studied by Mian et al. (1996) were not included in this study and growing conditions (greenhouse) and a phenotyping method (gravimetric determination of WUE) differed, one putative δ13C locus identified in this study is located near WUE QTLs identified by Mian et al. This may indicate the stability and importance of these putative loci and highlight the genomic regions for further investigation.

Conclusions

Even with the large number of SNPs with MAF ≥ 5 % (12,347), there were still areas of the genome in which differences among the 373 genotypes were not detected. Nonetheless, population analysis was able to separate the 373 soybean genotypes into eight subgroups. No relationship between the subgroups and geographic origin, yield (as reported in the germplasm collection database) or δ13C was apparent. GWAS analyses using GLM and MLM models with adjustments for genetic relatedness and/or population structure were conducted on δ13C data obtained from independent field experiments in four environments. Analysis by environment and on the mean across all four environments identified SNPs putatively associated with δ13C. In total, 39 unique SNPs were detected in at least two environments or based on the means across environments. Although the SNPs detected in single environments may be associated with genes affecting δ13C, we have greater confidence in the SNPs identified independently in at least two environments. Additionally, many of the 39 identified SNPs were in close proximity to each other and likely tag the same locus. Overall, results indicated 21 putative loci associated with δ13C with a high level of confidence. Although these results are conservative, they identified a tractable number of putative loci for further evaluation and confirmation in biparental mapping populations as well as potential use in breeding programs.

Author contributions

JRS, JDR, LCP and FBF designed the study. SKS and VHV performed field experiments in Columbia and collected leaf tissues for δ13C analysis. LCP and ACK managed field experiments in Stuttgart and collected leaf tissues for δ13C analysis. PBC and QS designed SoySNP50K iSelect SNP Beadchip for genotyping. APD performed SNP–trait association including other statistical analysis and co-wrote the manuscript with JDR. FBF coordinated and supervised the project. JRS, JDR, LCP and FBF critically revised the manuscript. All authors read and approved the final manuscript.