Introduction

Oil content and quality have drawn much attention in soybean genetics and breeding programs due to the increased demand for vegetable oils. The oil fraction corresponds to 20% of dry mass in the seeds of cultivated soybean (Glycine max L. Merr.) and is mainly (95%) directed toward the consumption of edible oil; the remainder is used for industrial products such as fatty acids, soaps and biodiesel (http://www.soyatech.com/soyfats). The nutritional value, flavor and stability of soybean oil are determined by its five dominant fatty acids: saturated palmitic (16:0) and stearic (18:0), monounsaturated oleic (18:1), and polyunsaturated linoleic (18:2) and linolenic (18:3) acids. The average percentages of these five fatty acids in soybean oil are 10%, 4%, 18%, 55%, and 13%, respectively. Previous research has shown that decreasing saturated (16:0 and 18:0) and polyunsaturated fatty acids (18:2 and 18:3) and increasing monounsaturated acids (18:1) improves the health benefits of soybean oil for human consumption (Wilson et al. 2002). For certain industrial applications, such as biodiesel production, the development of an oil high in oleic acid and low in saturated fatty acids has been suggested to simultaneously improve oxidative stability and augment cold flow (Aransiola et al. 2014). Significant efforts have been made to increase the oxidative stability of soybean oil as a means to avoid the trans fats generated through the hydrogenation process, enhance the ω-3 fatty acid content of the oil for use in both food and feed applications and increase the total oil content of the seeds (Graef et al. 2009; Clemente and Cahoon 2009).

Previous studies have shown that the Brazilian soybean germplasm has a narrow genetic base (Hiromoto and Vello 1986) with only five ancestors, representing approximately 60% of the overall genetic base of the soybean (Wysmierski and Vello 2013). In this context, the characterization and introduction of new sources of genes represents a crucial step in fostering efficient breeding strategies and, consequently, the development of new cultivars to improve soybean oil content and quality.

The oil content and fatty acid components (hereafter referred to as oil traits for simplicity) of soybean seeds behave as quantitative traits. Identifying molecular markers or quantitative trait loci (QTLs) associated with oil traits using marker-assisted selection (MAS) has the potential to facilitate the development of improved varieties. Although oil traits show quantitative inheritance, the cited heritability estimates for these traits are moderate to high (Fehr et al. 1991; Panthee et al. 2006; Hyten et al. 2004), highlighting the utility of identifying genetic markers associated with these traits.

Linkage mapping using biparental mapping populations is one approach for the identification of QTLs using molecular markers, and a number of molecular markers associated with oil traits have been reported (Diers and Shoemaker 1992; Spencer et al. 2003; Monteros et al. 2008; Li et al. 2011). However, the numbers of parents that have been used in previous QTL genetic linkage mapping experiments represent only a very small proportion of the total germplasm of soybean, and it is not known how often QTLs can be detected repeatedly in practical breeding (Mackay et al. 2009).

The genome-wide association study represents an alternative approach to association mapping for finding QTLs and has been widely applied in soybean studies (Deshmukh et al. 2014). This type of study requires a high density of single-nucleotide polymorphisms (SNPs) across the genomes of a diverse number of individuals as well as phenotyping of all the individuals in the study. Significant statistical associations are then determined between SNP alleles and trait phenotypes. Compared with the QTL linkage mapping approach, the association study can greatly increase the range of detection of natural variation, the number of genome-wide significant loci, and the QTL resolution for complex agronomic traits. Through application of this approach, many important QTLs can be localized, and candidate genes associated with oil traits can be identified (Hwang et al. 2014; Vaughn et al. 2014; Li et al. 2015; Cao et al. 2017; Leamy et al. 2017; Smallwood et al. 2017).

The number of soybean genome-wide association studies has substantially increased with the availability of next-generation sequencing (NGS). A large number of SNPs have been developed, mainly for diploid organisms (Koboldt et al. 2013). Array-based SNP genotyping platforms, such as Illumina GoldenGate, Infinium, and Affymetrix Axiom, have permitted the assaying of hundreds to thousands of SNPs in a high-throughput and cost-effective manner. In soybean, a Universal Soy Linkage panel (USLP 1.0), containing 1536 SNPs, was the first developed (Hyten et al. 2010); however, larger arrays, such as a SoySNP50 K Illumina array (Song et al. 2013), 180K AXIOM ® SoyaSNP Affymetrix array (Lee et al. 2015b), and an NJAU 355K SoySNP Affymetrix array (Wang et al. 2016), were subsequently developed from sequence analyses of several cultivated soybeans (Glycine max L. Merr.) and wild soybean (G. soja Siebold et Zucc.) genotypes. A set of 6000 SNPs for a medium-scale Infinium array was selected from the SoySNP50K array to maximally represent haplotype blocks, assess genetic diversity within cultivated soybean and G. soja, and facilitate genotyping in the soybean research community (Song et al. 2014). Recently, a soybean tropical collection containing 169 cultivars was genotyped using a high-throughput BARCSoySNP6K BeadChip assay, which provides a high-resolution map of genome-wide markers and can facilitate analysis of complex traits in soybean (Contreras-Soto et al. 2017).

In this study, we conducted a genome study of soybean with 96 diverse accessions genotyped with BARCSoySNP6K BeadChips to identify molecular markers associated with QTL regions for oil traits in soybean. Candidate genes within significant association loci that were potentially involved in the regulation of oil traits were also predicted. In addition, we identified the best alleles for significant oil traits, which can be used by soybean breeders in crossing programs.

Materials and methods

Plant material and field experiments

The association panel for the genome analysis consisted of a diverse collection of 96 soybean accessions (including 62 plant introductions (PIs) and 34 soybean cultivars) originating from different countries of the world (Table S1 and Fig. 1). Accessions were selected to represent a range of germplasm with respect to soybean oil content. Seeds were obtained from the germplasm collections of Embrapa-Soybean (Brazil) and the Department of Genetics, ESALQ, University of Sao Paulo (Brazil).

Fig. 1
figure 1

Two germplasm clusters, red and blue, based on Bayesian analysis of the 96 soybean accessions analyzed by using 5220 SNP loci. Details of the identification of accessions and their geographical origins are indicated. ‘*’ denotes an accession present in a cluster that does not correspond to its origins

All accessions were planted and cultivated between November and March of the 2009–2010 and 2010–2011 agricultural years in the experimental area of the Department of Genetics, Piracicaba, Sao Paulo, Brazil. Each plot contained 20 plants, which were planted in rows 1.5 m in length and spaced 0.8 m from the nearest plots. In both experimental years, a Federer augmented design was used, with the genotypes organized in two experimental sets with common checks.

Total oil was extracted and analyzed using a Butt apparatus and hexane as the solvent. Measurements of the five fatty acids were conducted by gas chromatography (chromatograph model 3900, Varian, Palo Alto, CA). For each accession in both years, the average values of seed oil and fatty acid contents from three replicates were used for the association analysis (Priolli et al. 2015).

Genotyping

Seeds of each accession were planted in seedling plates in standard soil mix. Plants were grown in the greenhouse (24–25 °C, approximately 33% humidity). Total genomic DNA was isolated from lyophilized leaf tissue bulked from five plants per accession using the DNeasy Plant Kit (Qiagen). DNA concentration was quantified with a spectrophotometer (NanoDrop Technologies Inc., Centerville, DE, USA) and normalized at 50 ng/µl for marker genotyping.

SNP genotyping was performed at Centro de Genômica Funcional ESALQ/USP in Piracicaba, Sao Paulo, Brazil, using BARCSoySNP6K Illumina Infinium BeadChips (Illumina, Inc., San Diego, CA, USA). The assay consisted of a series of standard protocols, such as incubation, DNA amplification, hybridization of samples to the bead assay, extension, and imaging of the bead assay (Song et al. 2013). The SNP alleles were called using the Illumina Genome Studio Genotyping Module (Illumina, Inc. San Diego, CA). Data were first filtered excluding redundant SNPs, nonpolymorphic SNPs and SNPs with more than 10% missing data to calculate population structure and principal component analysis resulting in 5914 SNP loci. In addition, for the association analysis, we also excluded SNP loci with minor allele frequencies of less than 1% and SNP loci with more than 25% missing data, retaining 5520 SNP loci. As recommended by Hwang et al. (2014), all heterozygous loci were treated as missing data.

Statistical analysis

The 2-year data were averaged for each trait. Kolmogorov–Smirnov two-sample statistical tests (K–S test) (Snedecor and Cochran 1989) were applied to test for heterogeneity in the distribution of each trait in a year.

The trials with genotypes organized in experimental sets with regular treatments and common checks were analyzed according to Zimmermann (2014). The statistical model was based on an augmented randomized complete block design and analyzed using the equation Yij = m + ti + bj + eij, where Yij is the observation of the ith treatment in the jth block, with j = 1, 2, …, b; I = 1, 2, …, p, p + 1, p + 2, …, p + t, where p is the number of progeny or regular treatments, t is the number of checks, and p + t = v is the total number of treatments; m is the general mean; ti is the effect of the ith treatment, with I = 1, 2, …, p, p +1, p +2, …, p + t; bj is the effect of the jth block, with j = 1, 2, …, b; and eij is the normally distributed random effect. The analyses of individual and joint variances were carried out using the restricted maximum likelihood (REML) method, considering all parameters of the model as random. Data were estimated by combining the 2 years using the LME4 R package (R Development Core Team 2015). Broad-sense heritability (BSH) was calculated with the formula BSH = σ2G/(σ2G + σ2ε/n), where σ2G is the genotype variance, σ2ε is the error component variance, and n represents years (Nyquist 1991).

The genetic structure based on the 5914 SNPs was investigated using a Bayesian model-based Markov chain Monte Carlo (MCMC) clustering method implemented using the program STRUCTURE v. 2.3.3 (Pritchard et al. 2000; Hubisz et al. 2009). The following parameters were applied to the analysis: diploid locus, admixture model and correlated allele frequencies. Following a burn-in period of 50,000, five independent runs were performed for each K value (from 1 to 10), with 500,000 iterations, as previously optimized (Priolli et al. 2015). The true value of K(ΔK) was chosen according to the method of Evanno (Evanno et al. 2005) using STRUCTURE HARVESTER 0.6.7 (Earl and vonHoldt 2012). Graphs of the STRUCTURE results were produced using CLUMPP (Jakobsson and Rosenberg 2007). We used a Q matrix from the structure to assign individuals to different Ks (referred to here as ‘clusters’ for simplicity) using a critical level of > 50% for each. Principal components analysis (PCA) was conducted in R using the APE (Paradis et al. 2004) and GGPLOT2 packages (Ginestet 2011). Genetic diversity was estimated using the package diveRsity version 1.9.90 (Keenan et al. 2013) in the R software (R Development Core Team 2015). Variations in allelic frequencies were quantified using FST. The statistical significance of departures from zero was tested using bootstrapping over the loci in the R package diveRsity.

The linkage disequilibrium (LD) block structure was examined using 5220 loci in TASSEL 5.0 software (Bradbury et al. 2007) by estimating the squared frequency correlation (r2) of alleles in each chromosome. Nonlinear regression curves were used to estimate the LD decay with distance, and the LD decay rate was determined as the physical distance between markers at which the average r2 dropped to half its maximum value.

The BSH, total sample size, number of SNPs and average two-locus LD (r2) between SNP markers were estimated to calculate the statistical power of each association analysis by using the GWAPower package (Feng et al. 2011).

A compressed mixed linear model (Zhang et al. 2010) incorporating the trait data, population structure (Q matrix) and pairwise kinship (K matrix) was used to identify marker-trait associations using the TASSEL program. The K matrix was automatically obtained by the centered-IBS method using TASSEL. We also generated quantile–quantile (QQ) plots of the observed versus expected P values at each SNP. Markers were identified as significantly associated with traits based on a significance threshold of P < 1.916 × 10−4, where P value < 1/n (n = number of markers). Manhattan plots of -log10 (P) values for each SNP vs. chromosomal position were generated from the TASSEL results. Genes with known functional descriptions related to SNP peaks were selected as candidate genes using the Wm82 Genome Browser of SoyBase (https://soybase.org/).

Some specific locus alleles were significantly associated with certain oil traits, and the contributions of these alleles to the phenotypic values were assessed. To graphically evaluate the associations of the polymorphisms, a binary logistic regression model was built using the GGPLOT2 package.

Results

Phenotypic data

Oil traits in soybean are complex traits controlled by both genetic and environmental factors that require multiple phenotypic scoring. As shown in Supp. Table S1, the mean value for oil content in 2010 over all accessions was 18.91% of seed dry mass, and the soybean oil concentrations showed means of 10.70, 3.27, 24.10, 53.01 and 6.40% for palmitic, stearic, oleic, linoleic and linolenic fatty acids, respectively. In 2011, the corresponding means were 18.83, 10.48, 3.16, 24.98, 52.62 and 6.31%, respectively. Normal distribution testing according to the K–S two-sample test (P < 0.05) showed that the frequency distribution for the oil traits in each year did not depart from normality. The ANOVA (Supp. Table S2) showed significant effects of genotype across different environments in total oil, palmitic, stearic, oleic, linoleic and linolenic acid contents. However, the high BSH estimates (Table 1) indicated that the phenotypic values in these 2 years were relatively stable for the different accessions, suggesting that there were major genetic components conditioning the oil traits in this population.

Table 1 Maximum and minimum oil trait values (% dry mass in seeds and % concentration in soybean oil) observed in 96 soybean accessions

To assess the oil breeding potential of the panel, PIs and cultivars were analyzed separately (Table 2). The means of oil content, palmitic acid and linoleic acid were higher for the cultivars than for PIs. However, the range of variation among the PIs was two- to three-fold higher than that of the cultivars for all oil traits, and the extreme values belonged to the PI group. A higher oleic acid content (> 50% concentration in soybean oil) was observed in accessions PI 531520, PI 568261 and PI 568260. For palmitic acid content, the lowest values (~ 5% soybean oil) were observed in PI 599811, PI 602455 and PI 568260. PI 531520 showed the lowest value of linolenic acid and a high value of oleic acid. PI 471931 had the highest oil content. Notably, all these oil components are important in soybean breeding programs aimed at oil quality.

Table 2 Seed oil (% dry mass) and fatty acid contents (% in soybean oil) for the soybean cultivars and plant introductions

Genotyping, population structure and linkage disequilibrium

A total of 5220 SNP loci distributed in the soybean genome were selected based on BARCSoySNP6K genotyping. These SNPs covered a region of 947 Mb in the soybean genome, which represents 86% of the 1100-Mb soybean genome. SNP markers were identified on each chromosome, with the number of markers ranging from 211 (chromosome 12) to 336 (chromosome 13) and averaging 261 (Supp. Table S3). These values indicated that the Illumina Infinium platform identified SNPs that were well distributed throughout the soybean genome.

Using SNP loci, we performed a Bayesian clustering analysis (STRUCTURE) to determine the population structure of our panel. According to Evanno’s method, the most likely K value (number of clusters) was K = 2 (Fig. 1), with 65 and 31 individuals predicted in each cluster. Notably, clusters 1 (red) and 2 (blue) corresponded to accessions from America (Brazil and the USA) and Asia (China, Korea, Japan, and India). To quantify the population structure of the panel, we performed principal component analysis (PCA) (Fig. 2). The dispersion plots of the first and second principal components explained 9.47% and 6.08% of the variance between the accessions, with a clear discrimination of clusters according to the Bayesian division. The two-dimensional PCA plot suggested a broader genetic base in cluster 2 (Asian materials) than cluster 1 (American accessions). The pedigrees of the soybean accessions (Supp. Table S4) confirmed the narrower genetic base of cluster 1, which consists mainly of cultivars and soybean breeding material.

Fig. 2
figure 2

Dispersion plot of the first (9.47%) and second (6.08%) principal components based on analysis of soybean accessions using 5220 SNP loci. The red and blue points represent accessions from America (1) and Asia (2), respectively, according to the clusters identified using STRUCTURE

Despite the presence of only two clusters, which might have suggested less genetic diversity among the genotypes of the soybean panel, the measure of differentiation, FST, was estimated at 0.1135, indicating that, despite their low number, the two clusters were significantly different (Supp. Table S5), and this contrast could be useful for identifying loci in association analysis. Cluster 1 showed the highest number of individuals and number of alleles, but allelic richness among the clusters was similar and not significant, confirming that both presented the same genetic variability.

The distribution of the correlation coefficients (r2) between SNPs located at different physical distances on each chromosome can be observed in Supp. Figure 1. Slow LD decay was observed with increased distance (Kb) in all 20 chromosomes, with the presence of large blocks in LD in each chromosome. The haplotype blocks spanned between 15,000 and 30,000 Kb, except that of chromosome 19, which spanned 6000 Kb. The LD block structures of all the chromosomes (Fig. 3) showed that the r2 value declined as the physical distance between the loci increased. The decay of LD with physical distance between SNPs occurred at approximately 300 Kb (r2 = 0.16), suggesting structure of the soybean genome within this distance.

Fig. 3
figure 3

Pairwise LD values (r2) plotted against genetic distance estimated among 5220 SNP loci and 96 soybean accessions

Genome analysis

The GWAPower simulation indicates the sample size required to reach the maximum power for an analysis. Considering the parameters specific to the present study, such as SNP number (5220 loci), LD of 0.16 and heritability of 0.8267 (average for all oil traits), the minimum adequate sample size is 96 individuals. This result indicates that our analysis based on these soybean accessions was adequate to obtain maximum resolution.

Using a linear mixed model (MLM) with corrections for multiple tests, the totals of 1, 16 and 2 SNPs for oil content, palmitic acid and oleic acid, respectively, exceeded the threshold of significance (− log10 P ≥ 3.72) (Fig. 4). The 48.10 Mb position on chromosome 19 showed the highest level of significance (P value = 7.61 × 10−7), comprising one SNP associated with palmitic acid content. Chromosomes 8 and 12 showed the most SNPs, three in each, which were associated with palmitic acid content. Chromosomes 10 and 18 showed one SNP each associated with oleic acid content, and chromosome 9 showed one SNP associated with oil content. No overlap was found between the loci associated with these traits; however, two regions, one on chromosome 8 and the other on chromosome 15, showed SNPs associated with palmitic acid content that were less than 0.5 Mb (500 kb) apart. The distribution of the QQ plots of total oil, palmitic acid and oleic acid content (Supp. Figure 2) showed values in a normal curve, adequate for the compressed MLM model to reduce false positives in the significant traits.

Fig. 4
figure 4

Manhattan plots of genome-wide association study for oil traits in soybean. Negative log10-transformed P values from a genome-wide scan by using mixed linear models (MLM) for oil content (a), palmitic acid (b) and oleic acid (c) are plotted against positions on each of the 20 chromosomes. The significant trait-associated SNPs (Bonferroni adjusted) are distinguished by the threshold line

Based on the association analysis and the genes annotated in SoyBase (www.soybase.org), we identified causal genes for loci significantly associated with each trait (Table 3). Although many of the SNP loci were in intergenic regions, 26% were in coding regions (CDS), introns or the 5′ UTRs of genes with functional annotation. These genes included genes involved in fatty acid metabolism and regulation, such as genes encoding methyltransferase, translation-initiation factor, glycosyltransferase, kinase protein and storage proteins.

Table 3 SNPs significantly associated with oil traits and predicted candidate genes

To identify alleles associated with the three significant traits, the most significant SNP loci of each trait were selected, and the contribution of each allele to the trait value was recorded. The results of the binary logistic regression indicated that the frequency of the C allele in ss715635790 (Fig. 5) decreased as palmitic acid content increased, whereas for the s715629367 and ss715603267 loci, the frequency of the C allele increased as oleic acid content and oil content increased. Although they are located in regions that do not contain any previously discovered QTLs or genes affecting these traits, these alleles belong to SNP loci on different chromosomes and may prove valuable for future breeding-by-design of soybean lines to enhance oil content and/or soybean oil quality.

Fig. 5
figure 5

Fitted logistic regression describing the associations between oil traits and three SNP polymorphisms in soybean panel a ss715603267: T-to-C allele in soybean seed oil content; b ss715635790: C-to-A allele in palmitic acid content; and c ss715629367: T-to-C allele in oleic acid content. Gray shadows show 95% confidence intervals

Discussion

Our findings show SNP loci associated with oil traits in soybean and that our soybean panel can be useful to identify such polymorphisms. Consistent with our results, previous studies have identified SNP loci associated with the five main fatty acids in soybean using universal SNP chips and a similar experimental setup (Li et al. 2015; Leamy et al. 2017). The current work also complements a growing body of work demonstrating the power of genome-wide association studies to identify molecular markers associated with oil content in soybean (Hwang et al. 2014; Vaughn et al. 2014; Cao et al. 2017) and provides data specific to Brazilian field conditions.

Breeding for oil traits is focused on the quantity and quality of soybean oil, including the contents of the five main fatty acids. Because the genetics of these traits are well known, and desired sources of germplasm are available, oil traits can be manipulated in a breeding program. The high BSH values that we document here suggest that a 96-soybean panel can be useful for oil trait breeding programs. Previous studies have reported heritability estimates for oil traits that are moderate to high, although these traits are quantitative traits controlled by multiple genes (Fehr et al. 1991; Panthee et al. 2006; Hyten et al. 2004).

The range of variation for oil traits was greater among the PIs than among the Brazilian cultivars in this study, indicating that the PI population can be used to find genes controlling these traits. The expansion of genetic diversity by incorporating alleles from PIs has been proposed in several studies from countries with a narrow genetic base for soybean, such as Brazil and the USA (Hiromoto and Vello 1986; Gizlice et al. 1994; Sneller 2003; Wysmierski and Vello 2013). However, PIs have presented low agronomic value in relation to seed yield, and genetic diversity and agronomic value are independent traits (Sneller 1994). To avoid the low agronomic potential of some PIs, researchers have advised selection of those soybean lines that present the best agronomic characteristics prior to using them in breeding programs (Vello et al. 1984; Sneller 1994; Wysmierski and Vello 2013).

Genome analysis using BeadChip platforms has allowed the evaluation of the genetic structure of soybean germplasm based on a large number of markers (Hyten et al. 2010; Song et al. 2013; Lee et al. 2015b; Wang et al. 2016). The two main genetic groups identified by the STRUCTURE analysis corresponded to the Asian and American gene pools, as identified in a previous study with 142 microsatellite markers (Priolli et al. 2015). The PCA suggested that the genetic base was higher in cluster 2 (Asian accessions) than cluster 1 (American accessions). These findings are consistent with previous studies using molecular markers that showed substructure based on geographical origin (Ude et al. 2003; Li et al. 2008) as well as studies on the genetic bases of both germplasms (Hiromoto and Vello 1986; Gizlice et al. 1994; Sneller 1994; Wysmierski and Vello 2013).

The extent of LD is an important factor determining the efficiency of association analysis. The decay of LD with physical distance between SNPs occurred at 300 kb (r2 = 0.16), which is comparable to the results of previous studies (220–270 Kb) that used larger and more genetically diverse populations (Vuong et al. 2015; Zhang et al. 2017). The more genetically diverse the germplasm, the more rapid the expected decay, which provides more opportunity for selection. Although the observed extent of LD was problematic because it resulted in the inclusion of tens to hundreds of candidate genes within an LD block, the results indicated that the use of these accessions had no substantial disadvantage compared to the use of the other sets of soybean germplasm, reaching reasonable resolution. In a study with Brazilian soybean cultivars, the length of the blocks was very similar among chromosomes, with most blocks being 51–500 kb (Contreras-Soto et al. 2017).

We discovered a total of 19 SNPs on ten different chromosomes that were associated with oil traits in our soybean panel. The corrected statistical test for the P values (of P < 1.916 × 10−4, Bonferroni correction) minimized the probability that the null hypothesis was falsely rejected by concentrating on a balance between false and true positives. Because it is stringent, this correction might miss some important associations, as confirmed by the absence of SNP loci associated with stearic, linoleic and linolenic acid content (data not shown).

Six of the sixteen loci significantly associated with palmitic acid were near or in the same linkage groups as previously identified SNPs. For instance, a genome-wide association study conducted with soybean accessions found SNPs located in genes Glyma05g07630 and Glyma12g01380a on chromosomes 5 and 12, respectively (Li et al. 2015). Our study identified SNPs ss715590297 and ss71561484 within 3.0 Kb of these genes. The SNP loci ss715591234, ss715603045, ss715603976, and ss715611451 were also located at a distance less than 4 Mb from genes according to the same study. QTLs in 40.6 and 44.5 Mb positions of chromosomes 9 and 19, respectively, were associated with palmitic acid in a study of linkage mapping (Smallwood et al. 2017), where the SNPs ss715603976 and ss715635790 were identified in our study. Similarly, 4.9 Mb position was found to be associated with oil content in a previous genome-wide association study (Hwang et al. 2014) and is where the ss715603267 locus was found in our study. The region at 9.9 Mb on chromosome 9 was reported twice as associated with oil content on SoyBase based on linkage mapping. Usually, SNPs that are reported in multiple studies using different sources of oil germplasm are good candidate genes for the validation of associations detected via association analysis.

The analysis of the SNP annotations revealed an extensive network of terms associated with several physiological metabolisms that may be associated with the metabolism of oils, such methylation enzymes, translation-initiation factors, glycosyltransferase, kinase protein and storage proteins; however, such terms were associated with only 26% of the genes or candidate genes identified here. In a genome study where an annotated gene approach in a model plant was adopted to design the genotyping array, the majority (93%) of the 1205 SNPs were located in the coding regions (CDS), untranslated regions (UTRs) and introns of 1074 annotated genes (Li et al. 2015). In another study, the soybean reference genome was used to search for all genes associated with seed composition traits of the wild soybean genome, and a total of 29 SNPs were found, of which 8 (27.6%) were located in candidate genes (Leamy et al. 2017). Both strategies, the SNPs developed using the model plant and the utilization of more diverse germplasm, were successful in the identification of candidate genes.

Beyond major gene effects, many QTLs with minor effects on oil traits have been discovered in soybean (Hyten et al. 2004; Lee et al. 2015a; Smallwood et al. 2017). These results can explain one of the probable causes for the presence of SNPs in introns and intergenic regions influencing target traits in our study. Another factor may have been the extension of LD, which in our soybean panel persisted for long distances, suggesting that the use of few markers would have resulted in the detection of additional QTLs. BARCSoySNP6K does not cover all genes in the soybean genome (Song et al. 2014), and one highly significant marker may be either the causative gene itself or in close linkage to the causative gene. Another factor can be the location of the SNP in the soybean genome. According to the authors, the BeadChip was developed using several quality criteria, including the genome region (euchromatic vs. heterochromatic). Although five-sixths of the SNPs came from euchromatic regions, the heterochromatic regions, which have lower numbers of genes, are also present (Song et al. 2014).

It is possible to manipulate the proportions of some fatty acids over a wide range by traditional plant breeding techniques (Graef et al. 2009; Clemente and Cahoon 2009). Increasing oleic acid levels and decreasing linoleic and linolenic acid levels make soybean oil healthier for human consumption. Soybean accessions with reduced palmitic acid are desirable because saturated palmitic acid associated with a diverse lipoprotein profile gives rise to negative health effects in humans (Mensink and Katan 1990). Moreover, to optimize the fuel characteristics of soybean oil for use in biodiesel, it has been suggested that oils that are high in oleic acid and low in palmitic acid should be developed (Graef et al. 2009). Because genome-wide allelic and haplotype data are available for relevant breeding lines and haplotype-trait associations have been established, it may be possible for soybean breeders to undertake breeding-by-design approaches. For example, considering our findings, soybean breeders interested in optimizing the quality characteristics of soybean oil can focus on the three C alleles of the loci ss715635790, ss715629367 and ss715603267, because they can yield soybean seeds with lower palmitic acid content as well as higher oleic acid and total oil contents.

In conclusion, the present study revealed the phenotypic variability of our association panel, indicating the potential of these materials to obtain new combinations of favorable alleles to oil traits, in addition to promoting the amplification of the genetic base for breeding programs. Our analysis also confirmed previous findings and the utility of BARCSoySNP6K BeadChips for genome analysis and its direct applicability for soybean improvement. In total, 16, 2 and 1 SNP loci were significantly associated with palmitic, oleic and oil content, and their candidate genes were predicted. We suggest that by using favorable alleles, soybean breeders can rapidly improve oil traits in soybean.