Introduction

SCN is a major pest of soybeans [Glycine max (L.) Merr.] worldwide. Soybean plants infected with SCN exhibit symptoms of root necrosis and stunting underground as well as leaf chlorosis, early senescence, and seed weight reduction. The soybean production loss caused by SCN is estimated to cost $1.5 billion annually in the USA, which makes it the most damaging pest than any other soybean pathogens (Wrather and Koenning 2006). Crop rotation for multiple years can reduce levels of SCN in the field, but this practice is dependent on the profitability and practicality of the non-host crops. Nematicides are costly and environmentally harmful. In contrast, the use of resistant soybean cultivars is an economically and environmentally friendly method of mitigating SCN damage to soybeans. However, continuous use of cultivars derived from a single genetic source of resistance can cause the SCN population to evolve to overcome host resistance, thus diminishing the utility of any given SCN-resistant soybean cultivar.

The first reported case of SCN in North America occurred in North Carolina (Winstead et al. 1955), and the pest has since spread through soybean growing areas of the United States. So far, SCN has been found in every soybean-producing state except New York and West Virginia and continues to spread considerably (Tylka and Marett 2014). Sixteen potential SCN races have been identified using four different hosts: PI548402 (Peking), PI548988 (Pickett), PI88788, and PI90763 (we call this classification scheme as SCN race scheme). A modified classification scheme was subsequently developed using the H. glycine (HG) type to characterize an SCN population (Niblack et al. 2003); this scheme is more suitable for characterizing a heterogeneous SCN population and has been extensively used in classifying SCN populations. The method uses a panel of seven soybean genotypes (Peking, PI88788, PI90763, 437654, 209332, PI89772, and PI548316) with varying levels of resistance to SCN (we call this modified classification scheme as SCN-type scheme). Here, we use SCN race scheme, because (1) the SCN screening in this study was performed in the year 2000 (prior to the development of SCN-type scheme) (Niblack et al. 2002); (2) both race 1 and race 5 have the same SCN type (HG 2.5.7); and (3) the term “race” has been used in soybean-breeding efforts referring to the SCN population for many decades and it is convenient for comparison. The SCN originally found in North Carolina was later categorized as race 1 (Golden et al. 1970). Now, this race is widely spread in most soybean fields. It has been demonstrated that cultivated soybeans underwent a severe “genetic bottleneck” (Hyten et al. 2006), and some of well-used resistant cultivars are loosing resistance due to SCN race shifts (Mitchum et al. 2007; Niblack et al. 2008). It is in urgent needs to explore new sources of SCN resistance in the wild relatives of soybeans and to develop novel SCN-resistant soybean cultivars.

Knowledge of the genetic basis of the soybean–SCN interaction mostly comes from quantitative trait locus (QTL) mapping and genome-wide association studies (GWAS). Soybean resistance to SCN is a complex trait involving many genes. To date, at least 40 QTLs from different resistant varieties have been mapped to 17 of 20 chromosomes (Concibido et al. 2004; Vuong et al. 2010); 14 of these QTLs were associated with race 1 resistance. The identification of two significant loci, rhg1 on chromosome 18 and Rhg4 on chromosome 8, has revealed that the copy number variation of three tandem genes (an amino acid transporter, an alpha-SNAP, and a wounding-inducible protein 12) and the involvement of a serine hydroxymethyltransferase provide resistance to SCN. Genomic regions containing or adjacent to these important loci are associated with resistance to multiple races (Yue et al. 2000; Guo et al. 2006; Wu et al. 2009). GWAS is increasingly being used to identify the genetic basis of complex trait variation in diverse crop species (Li et al. 2013; Morris et al. 2013). Recent analyses of sudden death syndrome resistance in soybean plants confirmed 7 loci in previously reported QTL intervals and identified 13 novel loci using the GWAS approach (Wen et al. 2014). Accordingly, GWAS was also applied to scan genomic regions associated with resistance to SCN races 3 and 4 in cultivated soybean populations (Han et al. 2015; Vuong et al. 2015). In these and other studies, both known and novel QTLs were identified, and candidate genes underlying SCN resistance were suggested (Wen et al. 2014; Zhang et al. 2015b). GWA mapping showed a useful complementary approach to classical bi-parental QTL mapping for dissecting the genetic basis of complex traits variation.

Crop wild relatives (CWR) have gradually gained attention during the past decade due to their higher genetic diversity than that of their domesticated decedents. CWRs have been extensively studied for identification of genes associated with abiotic and biotic stress resistances, as well as crop improvement (Zhang et al. 2016a). The wild soybean (Glycine soja Sieb. & Zucc.) is the closest relative of the cultivated soybean (G. max). G. soja has retained more than half of the genetic diversity that had been lost in the cultivated soybean due to “bottleneck” events during the domestication process (Hyten et al. 2006; Zhou et al. 2015). Long-term exposure to the combination of various harsh environmental stresses has led G. soja to possess superior characteristics in abiotic stress tolerance (Qi et al. 2014) and biotic stress resistance (Kim et al. 2011; Tian and Smith 2015). To date, few studies have used G. soja to identify QTLs for SCN resistance (Kim et al. 2011; Kim and Diers 2013; Zhang et al. 2016b), whereas no studies have applied high-density SNP markers to understand the genetic architecture of the G. soja response to race 1.

In this study, we evaluated the resistance responses of a set of 1032 G. soja accessions to (1) quantify the diversity of the G. soja resistance response to SCN race 1; (2) identify genomic regions significantly associated with race 1 resistance using the GWAS strategy with ~42,000 genome-side SNPs; and (3) predict candidate genes involved in the G. soja response to race 1 infection.

Materials and methods

Wild soybean samples and phenotype

A total of 1032 G. soja accessions from the United States Department of Agriculture (USDA) Soybean Germplasm Collection were used for association analysis in this study (Table S1). This set of G. soja accessions originated from four East Asian countries: Russia, China, Japan, and South Korea (Fig. 1). All these 1032 accessions were evaluated for resistance levels to SCN race 1 in the year 2000 by Prakash team. The SCN screening assay was performed in the greenhouse as described previously (Arelli et al. 2000). Ten seedlings of each accession were used for SCN inoculation. Peking, Pickett, PI 88788, PI 90763, and PI 437654 were used as host differentials, with Hutcheson as the susceptible control. Bioassays were repeated with the lines initially found to be resistant. Each seedling was inoculated with 1200 ± 25 eggs in 5 mL distilled water suspension. The resistance level for each accession was evaluated using female index (FI), which is a percentage of the number of females produced on each line divided by the number produced on a standard susceptible soybean, and the result is multiplied by 100. The screening result was retrieved from USDA website: https://training.ars-grin.gov/gringlobal/, which was reported using SCN race scheme (Table S1). We used the term “HG type” when compared to recently published results.

Fig. 1
figure 1

Geographic distribution of 1032 G. soja accessions in East Asia

Genotype data and quality control

This set of 1032 G. soja accessions was previously genotyped with the Illumina SoySNP50K iSelect BeadChip (Illumina, San Diego, CA, USA) and contained a total of 52,041 SNPs (Song et al. 2013, 2015). The SNP data for this set of population are available and can be retrieved from the public SoyBase database (http://soybase.org/). During the quality control process, SNPs without a physical position on 20 chromosomes, with a minor allele frequency (MAF) <5% and a missing rate >10% were eliminated from further analysis.

Population structure and linkage disequilibrium analysis

The STRUCTURE (version 2.3.4) program was used to estimate the population structure with the Bayesian Markov Chain Monte Carlo model (MCMC) (Hubisz et al. 2009). The burn-in length of the 10,000 MCMC iterations and the subsequent 50,000 iterations was used to estimate the parameters. Three replicates were performed for each K value (K = 2–8). The true K value was determined by a web tool called STRUCTURE HARVESTER, as described by Earl and Vonholdt (2012). A kinship matrix was calculated using TASSEL version 5.0 (Bradbury et al. 2007) to determine relatedness among individuals. Pairwise LD between SNP markers was calculated with squared allele frequency correlations (r 2) in TASSEL version 5.0 (Bradbury et al. 2007).

Genome-wide association analysis

Genome-wide association analysis was performed with a mixed linear model (MLM) in R package GAPIT (Lipka et al. 2012). The average value of SCN female index (FI) value per accession was used as phenotype for GWAS analysis as previously described (Arelli et al. 2000; Vuong et al. 2015). The Q matrix generated from STRUCTURE and a kinship matrix were used to account for population structure and kinship, respectively. We also ran the GWAS without using the Q matrix to control population structure, and both models (with and without the Q matrix) generated similar results. We then used permutation as previously described (Zhang et al. 2015a, b) to determine the significance threshold with the empirical significance value (p < 0.001).

Candidate gene prediction

Genes located within 50 kb on both sides of the significant SNPs were selected as the source for candidate gene identification as previously described (Zhang et al. 2016b). Annotations for the gene list and protein sequences could be retrieved from the public database Phytozome (https://phytozome.jgi.doe.gov). Functional annotation for each gene was also conducted using the Blast2GO software (Conesa et al. 2005) or through reference annotation of the Williams 82 soybean reference genome G. max Wm82.a2.1 (SoyBase, http://soybase.org/), as well as previously published literatures. Genes localized with significant SNPs or with functional descriptions relevant to disease resistance were regarded as candidate genes involved in the G. soja response to SCN race 1.

Results

Variation in G. soja response to race 1 infection

The set of 1032 G. soja accessions used in this study represents 88.3% of the USDA wild soybean germplasm collection (1168 accessions) (http://www.ars-grin.gov/). The phenotype data showed a wide range of variation in the G. soja response to race 1 infection in this studied population. The FI values of 1032 G. soja accessions ranged from 30 to 168.3, with an average FI value of 76.8 (Fig. 2a).

Fig. 2
figure 2

Analysis of female index (FI) distribution and estimated population structure. a Distribution of FI among 1032 G. soja accessions. b Genome-wide average LD decay for the association panel. c Unrooted neighbor-joining tree of 1032 G. soja accessions. Colored dots represent the country of origin of G. soja accessions

SNP data and LD in G. soja population

A SoySNP50K Illumina Infinium II BeadChip containing 52,041 SNPs was used to genotype the G. soja population. As a result, a total of 41,087 SNPs were polymorphic among the 1032 G. soja accessions. After quality control, 31,019 SNPs with a minor allele frequency ≥5% and a miss rate >10% were saved for further analyses. The SNP markers ranged from 1643 on chromosome 20 to 3221 on chromosome 18, with an average of 1550 for each soybean chromosome. Accordingly, the SNP density across chromosomes ranged from 17.2 kb on chromosome 13–33.08 kb on chromosome 1, with an average genome-wide SNP density of one SNP per 23 kb. The trend of LD decay by plotting r 2 against distances (kb) between SNP pairs showed that the LD decays quite rapidly (Fig. 2b). The LD decay rate (the distance where the maximum r 2 drop to half of its value) was estimated as 10 kb.

Population structure

To understand the genetic relationships among the 1035 G. soja accessions, a model-based Bayesian clustering method was used to characterize the population structure. The results from STRUCTURE HARVESTER (Earl and Vonholdt 2012) indicated that the highest ΔK value was observed at K = 4 (Figure S1B), suggesting that this population contains four subgroups (Figure S1A). To further understand population stratification in this G. soja population, an NJ tree for the 1032 accessions was constructed. As shown in Fig. 2c, the resulting NJ tree clustered the 1032 accessions into four groups, with accessions from Japan and South Korea each forming a group, accessions from northern China and Russia sharing one larger group, and accessions from southern China clustering in a smaller group. These results indicate a high correlation between geographical origin and population structure in the G. soja population, which is also observed in the G. max population (Zhang et al. 2015a). The consistency of the structure in the two analyses was also manually checked, and the four clusters identified by the NJ tree are consistent with the population stratification observed with STRUCTURE. For example, as observed in NJ tree, accessions from South Korea (red) and Japan (blue) were also primarily assigned into two major groups in STRUCTURE result, while accessions from Northern (yellow) and Central China (green) were separately grouped (Figure S1A). Thus, it is appropriate to classify this set of sample into four groups, which can appropriately account for the population structure.

Association mapping

Association analysis were performed using the MLM model in GAPIT with the population structure controlled by the Q matrix (K = 4) from STRUCTURE and familial relatedness. After 1000 permutations, the SNP markers were identified as significantly associated with the response to SCN race 1 infection using a threshold of –Log (p value) ≥3.64. As a result, a total of ten SNPs significantly associated with the G. soja response to race 1 infection were identified. These significant SNPs are located on five chromosomes (2, 4, 9, 16, and 18) (Fig. 3; Table 1). Of the ten significant SNPs, four are clustered on chromosome 4 and two are grouped on chromosome 18. Many of the significant SNPs are located within the previously identified QTLs. For example, the significant SNPs, ss715628474 and ss715628480, on chromosome 4 are localized within two previously mapped QTLs, SCN 18-5 and SCN 19-4 (Yue et al. 2001); ss71561958 on chromosome 18 is located within a known QTL SCN 37-2 (Vuong et al. 2010). The other seven significant SNPs without co-localization in known QTLs related to SCN resistance are regarded as novel. The ten SNPs explained 11.42–11.84% of the total phenotypic variation.

Fig. 3
figure 3

Genome-wide association analysis of G. soja response to race 1 infection. a Manhattan plot for SCN response. The red horizontal line represents the empirical threshold (p < 10−3.64) defined by 1000 permutations of the association analysis. b Quantile–Quantile plot using MLM

Table 1 Significant SNPs and predicted candidate genes associated with G. soja response to race 1

Candidate genes associated with the G. soja response to SCN race 1 infection

To further understand the molecular mechanism of the G. soja response to race 1 infection, genes located in or close to significant SNPs were evaluated. Four soybean genes were located within the SNP regions (Table 1) and were thus considered the most likely candidate genes underlying the G. soja response to SCN race 1. Another six SNPs were located in intergenic regions. In addition, genes within 50 kb at either side of significant SNPs were also retrieved, as previously described (Zhang et al. 2016b). As a result, a total of 83 gene models were predicted within the search region (Table S2), 13 of which might be involved in plant disease resistance based on reference annotations and results from published literature (Table 1). This candidate list includes genes encoding calcium-dependent phospholipid-binding protein, leucine-rich repeat protein kinase family protein, nucleotide-binding (NB)-ARC domains-containing protein, ethylene-responsive element binding factor, and cytochrome P450 family protein.

It is important to note that the gene Glyma.18G102600 on chromosome 18 was regarded as a highly promising candidate gene involved in G. soja responding to SCN race 1 infection, because this gene is located in a strong LD block (average r 2 = 0.88) harboring two significant SNPs, ss715628474 and ss71568480 (Fig. 4). Glyma.18 G102600 encodes an NB-ARC domains-containing protein, a typical disease resistance protein involved in pathogen recognition and activation of the innate immune response (Van Ooijen et al. 2008).

Fig. 4
figure 4

Regional plot and candidate gene for loci ss715628474 and ss715628480. The top panel shows the regional Manhattan plot for significant loci ss715628474 and ss715628480 and their adjacent loci on chromosome 18. The corresponding genes within the regional plot are drawn in green arrows in the middle panel. The putative disease resistance gene Glyma.18G102600 encoding an NB-ARC protein is regarded as a candidate gene. The LD pattern in this region is illustrated by an LD triangle based on a pairwise r 2 value that is drawn at the bottom pane. The r 2 values in the LD block are indicated with a color key index

Discussion

Applying wild soybeans to improve soybean resistance to SCN

Novel and exotic resources with rich genetic diversity are crucial in managing crop pest damage due to evolutionary arms races. Pest-resistant cultivars will gradually lose their resistance after many years: pests, especially pathogens, and viruses, can evolve rapidly because of their short life cycles and the selection pressure from host plants. To date, most pest-resistant cultivars have been developed using the gene pool of cultivated crops. It is known that the genetic diversity in modern cultivated crops has been substantially reduced during domestication (Hyten et al. 2006), whereas their wild relatives retain much higher genetic diversity. The promising potential of wild relatives in crop improvement has been gradually recognized in recent decades, and their application for crop improvement is well documented (Zhang et al. 2016a). In G. soja, the identification of superior characteristics in abiotic stress tolerance (Qi et al. 2014) and biotic stress resistance (Tian and Smith 2015) suggests that G. soja harbor promising novel genetic resources in biotic and abiotic resistance/tolerance. As expected, over a dozen of G. soja accessions were identified resistant to HG type 2.5.7 (race 5) in recent studies (Kim et al. 2011; Kim and Diers 2013; Zhang et al. 2016b). However, no G. soja resistance showed high levels of resistance to race 1 (FI <10). This lack of resistance might be due to the completely distinct molecular mechanisms of G. soja defending against race 5 (HG type 2.5.7) vs. race 1. Our recent study (Zhang et al. (2016b) identified different SNPs and candidate genes involved in G. soja resistant to race 5 (HG type 2.5.7) compared to this current study on SCN race 1.

The continuous distribution of FI values suggests that the G. soja response to SCN race 1 is quantitatively controlled, which can involve many genes and the interactions between them. Although no G. soja accessions showed FI value <10, which is not quite useful for breeding scientists, it is such a large population with wide variation in resistance levels. Thus, together with large genomic data available, this is useful population to understand the genetic basis of resistance variation in G. soja, which is our main aim in this study. To date, the majority of QTLs underlying SCN resistance have been identified from linkage mapping cultivated soybeans (Concibido et al. 2004; Guo et al. 2006; Wu et al. 2009; Vuong et al. 2010; Han et al. 2015; Vuong et al. 2015), which can only capture the variation of the two parental individuals. It is not surprising that some of the significant SNPs identified in our study are not located within the previously identified QTLs underlying SCN resistance as this studied populations capture diversity from over a thousand of accessions, rather than two parents. Indeed, we expect to identify novel SNPs or candidate genes in G. soja as it is a genetically different species from cultivated soybean (G. max), and the former also harbors much higher genomic diversity than the latter. Identification of the genomic regions responsible for the race 1 response in our study might provide a novel understanding of the genetic architecture of the G. soja response to SCN. Nevertheless, our study provides several genomic regions that might be response to the large variation of G. soja resistance to SCN race 1, and the two loci on chromosome 18 are promising candidate markers deserving further investigation.

Candidate genes involved in G. soja response to SCN race 1

The current knowledge of SCN resistance genes stem from the identification of two key loci, rhg1 (Cook et al. 2014) and Rhg4 (Liu et al. 2012). These studies indicated that a copy number of three tandem genes (an amino acid transporter, an alpha-SNAP, and a WI12 protein) (rhg1) and a serine hydroxymethyltransferase function (Rhg4) in SCN resistance. These findings were not consistent with the results that plant resistance to pathogens or pests is controlled by canonical NB-ARC or NBS-LRR-type genes (Cheng and Li 2012). The rhg1 and Rhg4 genes were not identified among the 83 gene models within the search regions of the significant SNPs in this study. Explanations include: (1) the genetic architecture of most quantitative traits is very complex and population-specific, with different QTLs/candidate genes for the same trait identified in different populations and/or species. The wild soybean used here is the closest wild relative of cultivated soybean, but has a substantially higher level of genetic diversity. We, therefore, did not expect to find the same SNPs/genes controlling the SCN response to race 1 in our sample as those previously discovered in the cultivated soybean populations. (2) This result may simply be a reflection of a lack of sufficiently close SNP markers. The linkage disequilibrium blocks in wild soybean populations are much smaller than those in domesticated soybean samples. (3) The frequency of the responsible alleles might be very low, and GWAS fails to detect them (Auer and Lettre 2015).

Previous studies have indicated that protein kinases show a broad spectrum of resistance to diverse invading pathogens (Song et al. 1995; Gomez-Gomez and Boller 2000). Mutation of a receptor-like protein kinase, RPK2, results in a decrease in both nematode infection and syncytium size in the rpk2 mutant (Replogle et al. 2013), which suggests the important roles of kinase in nematode resistance. However, plants trigger a cascade of defense signals once they encounter environmental stress. The calcium-mediated regulation of plant immunity (Lecourieux et al. 2006) and plant phytohormones, such as ethylene (Lorenzo et al. 2003), is two important regulatory mechanisms involved in plant defense signaling that subsequently trigger a battery of defense actions. In our recent studies using transcriptome profiling to characterize the molecular mechanism of the soybean–SCN interaction, significant changes in the expression of genes participating in calcium and ethylene-related pathways were observed, suggesting that calcium and ethylene play important roles in soybean defense against or response to SCN infection (Klink et al. 2007a, b; Kandoth et al. 2011). Calcium- and ethylene-mediated plant defense signaling was recently verified conserved in the plant defenses against nematodes, as previously described (Manosalva et al. 2015).

Plant R proteins have been extensively studied in plant defenses against diverse pathogens and pests (Van Ooijen et al. 2008; Cheng and Li 2012). However, to date, the major SCN-resistant genes are not canonical NBS-LRR or NB-ARC genes; several types of R genes were identified as resistant to other nematodes in various plants (Williamson and Kumar 2006). In addition, a significant change in R gene expression in soybean roots was observed after SCN infection (Klink et al. 2007b; Guo et al. 2015). Thus, the putative disease-resistant R gene (Glyma.18G102600) that was identified in a strong LD block harboring two significant SNPs might be a promising candidate gene meriting further investigation.