Introduction

Improvement of elite soybean [Glycine max (L.) Merr.] germplasm in North America may be limited by genetic bottlenecks during its development. Pedigree analysis shows that 80% of the ancestry of this germplasm can be traced to only 13 ancestral cultivars (Gizlice et al. 1994). In addition, northern U.S. cultivars can trace 50% of their ancestry to only three ancestral lines (Gizlice et al. 1994). Recent estimates show that yield improvements in soybean continue at a modest rate of approximately 30 kg ha-1 year-1 or about 1% year-1 (Specht et al. 1999; Wilcox 2001). These yield improvements are sustained, however, by breeding efforts that continue to increase in size.

The primary gene pool that can be used to improve North American soybean germplasm includes elite soybean germplasm, exotic soybean lines and Glycine soja Sieb. and Zucc., the wild ancestor of soybean (Hymowitz and Singh 1987). The limited diversity in elite soybean germplasm has resulted in an interest in evaluating G. soja as a new source of genetic diversity for improving the crop. There is ample evidence that within G. soja, there is greater genetic diversity than in G. max (Maughan et al. 1995). Introgression of genes from wild species through genetic mapping with molecular markers has been employed successfully in the identification of agronomically important genes in a number of crop species (Tanksley et al. 1996; Tanksley and Nelson 1996; Tanksley and McCouch 1997; Xiao et al. 1998). Using the advanced backcross method, deVicente and Tanksley (1993) identified and transferred quantitative trait loci (QTL) for increased yield and soluble solids from Lycopersicon pennellii, a wild relative of tomato (Lycopersicon esculentum), into cultivated tomato. Further research has resulted in the identification of QTL that agronomically improve tomato from L. peruvianum (Fulton et al. 1997), L. hirsutum (Bernacchi et al. 1998) and L. pimpinellifolium (Tanksley et al. 1996). The advanced backcross method was also used to map genes that enhance rice yield from Oryza rufipogon, a weedy relative of cultivated rice (Xiao et al. 1996, 1998; Moncada et al. 2001).

A number of researchers have evaluated G. soja as a source of useful genes. These researchers were unsuccessful in identifying positive transgressive segregates or in mapping useful QTL from G. soja (Ertl and Fehr 1985; Graef et al. 1989; Suarez et al. 1991) until the employment of genetic markers became widespread. Useful QTL from G. soja were discovered in a population of F2-derived lines from a cross between the G. max experimental line A81–356022 and the G. soja accession PI 468916. In this population, QTL controlling hard seededness (Keim et al. 1990b), plant morphology, maturity (Keim et al. 1990a), and protein and oil concentration (Diers et al. 1992) were mapped. The G. soja alleles for the protein QTL were associated with over 20 g kg-1 greater protein concentration than the G. max alleles (Diers et al. 1992). These G. soja alleles have been backcrossed into the background of the G. max parent and one region continued to increase protein concentration in a BC3 population, although it was associated with reduced yield (Sebolt et al. 2000).

Concibido et al. (2003) mapped a QTL allele that was associated with increased seed yield from G. soja PI 407305. The initial mapping was done in a BC2 population developed according to the advanced backcross method (Tanksley and Nelson 1996). In the BC2 population, the QTL allele was associated with a 12% yield increase across testing environments. The QTL was then backcrossed into seven genetic backgrounds and retested. They observed a significant (P<0.05) positive effect on yield for the QTL allele from G. soja in three of the seven backgrounds.

It is difficult to map QTL controlling yield in populations developed directly from crosses between soybean and G. soja, because of the poor general agronomic performance of lines. Agronomic problems such as premature pod dehiscence (seed shattering) highlight the need to study yield in backcross populations that have improved agronomic performance. Unfortunately, advanced backcross populations (Tanksley and Nelson 1996) are difficult to develop in species such as soybean where it is problematic to produce the hundreds of hybrid seed needed for the experiments. We have developed backcross populations for mapping QTL from wild species using a novel approach that does not require the large amount of hybrid seed production required for the advanced backcross method. The objective of our study was to map QTL for yield and other agronomic traits in a series of backcross populations developed using a G. soja line as a donor parent and a soybean cultivar as a recurrent parent.

Materials and methods

Plant material and trait evaluations

The mapping populations consisted of five BC2 populations developed using the G. max cultivar IA2008 as the recurrent parent and the G. soja plant introduction (PI) 468916 as the donor parent. The cultivar IA2008 is a high yielding, maturity group II public cultivar developed at Iowa State University. PI 468916 is a G. soja plant introduction that was collected from Liaoning Province, People's Republic of China in 1982 (Bernard et al. 1989). PI 468916 is also the G. soja parent of the population used in developing the Iowa State University-USDA genetic map (Keim et al. 1990a; Cregan et al. 1999).

IA2008 and PI 468916 were crossed and a population of F2:3 lines were produced. Five random F2:3 lines were crossed with IA2008 to produce BC1F1 plants. These BC1F1 plants were crossed again with IA2008 to produce BC2F1 seed. Five BC2F1 plants that each trace to a different F2 plant were grown to initiate the development of five backcross populations. The populations were inbred to the F4 generation through single-seed descent. Each F4 plant was then threshed individually to produce BC2F4-derived lines. There were a total of 468 BC2F4-derived lines tested with 110 lines each in populations 324B and 330A, and 79, 57 and 112 lines in populations 326, 334A and 338B, respectively.

The BC2F4-derived lines from the five populations and the IA2008 parent were evaluated in field tests during the summers of 1999 and 2000. Two replicates of the lines were grown in Lincoln, Nebraska and Urbana, Illinois each year. The plots were arranged in a randomized complete block design with each population grown as a separate set. The lines were grown in 4-row plots with a 76-cm row spacing and a seeding rate of 30 seeds m-1. The length of the plots was 3.2 m in Illinois and 3.0 m in Nebraska. The two center rows of each plot were harvested for seed yield measurement. Populations 326 and 334A were not harvested in Nebraska during 2000 because of plot damage in that area of the field. Seed moisture was measured for seed harvested from each plot and was used to calculate seed yield on a moisture-free basis. In addition to yield, the lines were rated for date of maturity, plant height, and lodging. Date of maturity was the day when 95% of the pods had changed to their mature pod color. Plant height was the average length in centimeters of mature plants from the ground to the top node of the main stem. Lodging was rated at maturity with a visual score ranging from 1 to 5, with 1 representing all plants erect and 5 all plants prostrate.

Pearson product-movement correlations among all traits were calculated with PROC CORR (SAS 1988). Estimates of variance components and broad-sense heritabilities were calculated for each population from mean squares (Fehr 1987) obtained from PROC GLM of SAS. Genotypes within populations, locations, and years were analyzed as random effects.

Molecular marker and QTL analysis

The lines in the populations were evaluated with simple sequence repeat (SSR) markers that were mapped previously in soybean (Cregan et al. 1999). Genomic DNA was extracted from 10 bulked seedlings per line using a modified CTAB method (Kisha et al. 1997). Polymerase chain reactions (PCR) were carried out according to Cregan and Quigley (1997). The PCR products were analyzed by electrophoresis in 6% non-denaturing polyacrylamide gels (Sambrook et al. 1989; Wang et al. 2003) and stained with 0.1 μg ml-1 ethidium bromide. DNA from the original parents of the populations (PI 468916 and IA2008) and a bulk of DNA from each population were screened with 606 SSR markers to identify markers polymorphic between the parents and those segregating in the populations. Each bulk consisted of an equal amount of DNA from all lines in the population. The polymorphic markers were then used to evaluate each line in the populations for which they were segregating.

The observed segregation ratios of SSR markers were tested for goodness-of-fit to the expected ratio using χ2 tests with a significance level of α=0.05. Conservative estimates of the proportion of the genome that was segregating in each population were calculated by totaling the lengths of genomic regions flanked by segregating markers. The cM lengths of the segregating regions were determined by the map positions of the flanking markers on the composite maps in Cregan et al. (1999) and the SoyBase website (http://soybase.org).

Linkage and map distances among the polymorphic markers were determined using the computer program JoinMap (Stam 1993) with the Kosambi mapping function (Kosambi 1944) and a minimum LOD score of 3.0. The marker orders from the composite maps on SoyBase were used as fixed orders in the linkage analysis. The names of linkage groups (LGs) on SoyBase were used for presenting the results in this paper.

QTL controlling traits were mapped by composite interval mapping (CIM) and simple linear regression with the computer program package QTL-CARTOGRAPHER (Basten et al. 1999). Marker and trait data from each population were analyzed independently in both linkage and QTL analyses. Trait values obtained from each environment and the averages across environments were treated as independent traits in the QTL analysis. The threshold of LOD scores for declaring a putative QTL as significant was determined by 1,000 permutations using the Zmapqtl program in QTL-CARTOGRAPHER. A LOD value corresponding to an experimentwise threshold of α=0.05 was used to declare a QTL as significant. The estimate of the QTL position is the point of maximum LOD score in the region under consideration. Only the results from QTL analyses of the averages across environments are reported in the tables.

A search for epistatic interactions between identified yield QTL and markers not linked to the QTL was performed using the conditional search function of the Epistat program (Chase et al. 1997). The Epistat program examines pairs of markers for additive by additive epistatic interactions (Chase et al. 1997). For each pair of markers examined, at least one marker had a significant additive effect on yield with a log likelihood ratio (LLR) greater than 5 and the other marker was on a different linkage group to avoid the confounding effects of linkage. The LLR threshold for identifying preliminary significant interactions was set to 6. The MNTECRLO program within the Epistat package was then used to determine the significance of the preliminary significant interactions. The MNTECRLO program estimates type I errors and significance thresholds using Monte Carlo methods. One million Monte Carlo simulations were carried out for each preliminary significant interaction. The P values from the output of MNTECRLO program were corrected for the number of interactions analyzed to produce experimentwise P values with the equation P corr =1–(1–P)x, where P corr is the corrected P value, P is the P value from the output of the MNTECRLO program, and x is the number of interactions analyzed. If the corrected P value is less than 0.05, the interaction between the two markers was declared significant. For significant interactions, Tukey tests were carried out to determine the significance of differences between the means of different interacting genotypes using the SAS GLM procedure.

Results

Molecular marker analysis

Of the 606 SSR markers tested using the parental and pooled progeny DNA, 302 were polymorphic between the two parents. One hundred and fifty nine (52%) of these markers segregated in at least one population and 77 (26%) markers segregated in at least two populations. Assuming that no selection occurred during the development of the BC2 populations, each population should segregate 25% of its genome while the remaining 75% should be fixed for the soybean parent. Of the markers that were polymorphic between the two parents, 24.1% segregated in population 326 and 22.7% segregated in population 330A. These values are not significantly (P=0.05) different from the expectation of 25% of the markers segregating based on Chi-square tests. The percentage of markers that segregated in populations 334A, 338B, and 324B were 13.5%, 15.9%, and 18.9%, respectively, and these values were significantly (P<0.05) lower than the expected. The lower than expected percentage of markers segregating in three populations was likely caused by inadvertent selection during population development, since conscious selection was not perfomed. The genomic regions that were verified as segregating in populations with SSR markers are listed in Table 1. These estimates are conservative because additional chromosomal regions adjacent to these flanking regions would also segregate. In addition, other unlinked regions likely went undetected due to a lack of polymorphic markers.

Table 1. Genetic regions confirmed to be segregating in each BC2F4 population based on comparisons to the composite soybean linkage map on the SoyBase website

The grouping of markers was generally consistent with Cregan et al. (1999) and composite maps at the SoyBase website. There were two exceptions where markers belonged to different LGs on the composite map, but were grouped together in our analyses. Marker Satt352 from LG G on the composite maps linked close to Satt133 from LG A2 in three of our populations. In addition, marker Satt317 from LG H on the composite map was grouped with markers on LG G in our analysis. As expected, the map distances were not always consistent among populations and with the composite maps. Nevertheless, the relative distances among markers were generally consistent among maps.

Field data analysis

There was significant (P<0.01) genetic variance among lines for yield, plant height, maturity, and lodging in each population. Location was significant for all traits in all populations except for plant height and lodging in population 334A. Year had a significant effect on yield in all populations except population 334A. Year was not significant for plant height in three of the five populations but was significant for maturity and lodging in all populations. The interactions of genotype-by-year and genotype-by-location were not significant for all traits. However, the interaction of genotype-by-year-by-location was significant for all four traits.

The population means for yield ranged from 2,263 kg ha-1 for population 330A to 2,697 kg ha-1 for population 334A (Table 2, Fig. 1). In no population were lines identified that had significantly greater yield than IA2008, the recurrent parent. The heritabilities for yield varied from 0.29 in population 326 to 0.85 in population 324B.

Table 2. Means, standard errors, and heritabilities for each BC2F4 population
Fig. 1
figure 1

Yield distributions of lines over all environments for each population

Analysis of QTL

Four significant (experimentwise P<0.05) yield QTL were identified with CIM on LGs C2, E, K and M (Table 3). The QTL on LG C2 was significant for population 334A in all three environments where data were collected. The QTL explained 40% of the phenotypic variance for yield in this population. The QTL on LGs E and M were identified in population 338B. The LG E QTL explained 28% of the phenotypic variance for yield and the LG M QTL explained 29% of the phenotypic variance for yield. The QTL on LG E was significant in three of the four environments and the QTL on LG M was significant in all environments. The significant QTL on LG K was identified in populations 324B and 330A and was significant in all environments that the populations were tested. This QTL explained 38–40% of the total phenotypic variance for yield in the two populations. For all four significant yield QTL, the marker alleles from IA2008, the G. max parent, were associated with greater yield than the G. soja alleles.

Table 3. Quantitative trait loci (QTL) identified as significant across environments based on composite interval mapping analysis

Three of the four yield QTL that we mapped were located in the same region or near previously mapped yield QTL. The QTL for yield on LG C2 mapped to the same location as yield QTL that were mapped in three previous studies. Two of these populations were from crosses between 'Minsoy' and 'Noir 1' (Mansur et al. 1993a, 1993b, Mansur et al. 1996) and the third was from a cross between 'Noir 1' and 'Archer' (Orf et al. 1999a, 1999b). The QTL for yield on LG M also was identified in the two populations developed by crossing 'Minsoy' and 'Noir 1' (Mansur et al. 1993a, 1996). Yuan et al. (2002) found evidence of a yield QTL on LG K in populations developed from crossing 'Flyer' with 'Hartwig' and 'Essex' with 'Forrest'. The yield QTL on LG K in both populations of Yuan et al. (2002) mapped near Satt337, which is 22 cM from our LOD peak according to the composite map. Additional work is needed to verify whether the QTL in our study and the study of Yuan et al. (2002) map to the same location or if they are different QTL. The QTL for yield on LG E has not been identified in earlier mapping studies, suggesting the possibility of this being a new yield QTL.

Five QTL for plant height were identified on five LGs (Table 3). Four of the five QTL mapped to the same locations as the yield QTL on LGs C2, E, K and M. The QTL on LG K was identified with significant LOD scores in population 324B in three of the four test environments and in all four environments for population 330A. The QTL on LGs C2 and E were significant in three of the four test environments and the QTL on LGs O and M were significant in all four environments. The height QTL on LG C2 (Mansur et al. 1996, Orf et al. 1999b) and M (Specht et al. 2001) mapped to regions where plant height QTL were previously identified in G. max populations.

Four QTL for maturity were identified on LGs C2, L, M, and O (Table 3). The four QTL were identified in four different populations and were identified with significant LOD scores in all test environments. The maturity QTL on LGs C2, and M, mapped to the same regions as the QTL for plant height and yield in these populations. In addition, the maturity QTL on LG O mapped to the same position as a height QTL. Other researchers mapped QTL for maturity to the same regions on LG C2 (Orf et al. 1999b), L (Orf et al. 1999b), and M (Mansur et al. 1993a, 1993b, 1996; Lark et al. 1994; Orf et al. 1999b) in G. max populations.

Only one QTL was identified for lodging in this study and it has not been identified in earlier studies. This QTL was identified in population 330A with significant LOD scores in all four environments. This QTL maps to the same region as a QTL for plant height and yield. Surprisingly, the region from IA2008 is associated with greater plant height but less lodging than the region from G. soja. This contradicts the normal association that is observed between greater plant height and increased lodging. One explanation for this contradiction is that in this region IA2008 carries a QTL allele, that confers less plant lodging and a second QTL that confers taller plant height. Alternatively, there could be a single QTL with a pleiotropic effect of increasing both plant height and stem strength.

The yield and marker data were analyzed with the simple linear regression option of QTL-CARTOGRAPHER and a significance threshold of P=0.05 for each marker association to determine if any positive yield QTL alleles from G. soja could be identified at this lower threshold. These analyses led to the identification of four QTL where the G. soja alleles were significantly associated with improved yield across all environments tested (Table 4). In addition, the LG D1b QTL was significantly associated with maturity and the LG L QTL was significantly associated with lodging. According to Soybase, the regions where these loci mapped have not been previously identified as containing significant yield QTL. However, PI468916 was previously found to carry a QTL allele for resistance to soybean cyst nematode (SCN, Heterodera glycines Ichinohe) near Satt491 on LG E (Wang et al. 2001). Although high SCN infestations were not found in the field tests, there may have been sufficient pressure to cause the SCN resistance QTL to have a significant effect on yield. Because these yield QTL were not identified with a high significance threshold through CIM, they will need to be retested in additional studies to confirm their effect.

Table 4. Significant yield QTL with a positive effect from G. soja across environments based on simple linear regression

Significant epistatic interactions for yield across environments were found with the computer program Epistat in only population 324B. The significant interactions in this population were between Satt349 and Satt417 in the region of the yield QTL on LG K, and Satt189 on LG D1b. On the Soybase composite map, Satt349 and Satt417 are positioned at 65 and 71 cM, respectively, on LG K and Satt189 is located at 113 cM on LG D1b. In the significant interactions between the two markers on LG K and the marker on LG D1b, the yield of genotypes that are homozygous for G. max alleles on LG K and G. soja alleles on LG D1b is higher than genotypes homozygous for G. max alleles on both LG K and LG D1b. However, this difference is not statistically significant at P=0.05. These two interacting regions also segregate in population 330A, but no significant interaction was detected between them in this population. There are two possible explanations for these inconsistent results. The first is that the significant interaction between the two regions identified in population 324B is actually a type I error and the second is that at least one additional region in the genome is needed for the interaction to occur.

Discussion

A major goal of this research effort was to map the locations of favorable yield QTL alleles from G. soja that could be useful in soybean improvement. We were not successful in finding these QTL using the CIM method with a LOD threshold corresponding to an experimentwise P=0.05. This lack of success stands in contrast to reports of the identification of positive alleles for yield from wild accessions in other species (Tanksley et al. 1996, Xiao et al. 1998) and soybean (Concibido et al. 2003). There are a number of explanations for our lack of success in this study. The first is simply that PI 468916, the G. soja parent of the population, lacks positive yield QTL that could improve IA2008, the G. max parent. Another possibility is that the G. soja parent did have one or more positive QTL, but they were not detected in this study because they were not segregating in any population. This is possible because only 52% of the markers that were polymorphic between the parents segregated in at least one population. In addition, positive QTL from G. soja may have segregated in one or more populations, but were not detected. This may have occurred if there were no segregating markers near the QTL or if polymorphic markers were present but the populations used in this study were of insufficient size to detect the QTL. Larger populations were not used in this study, in part, because of the difficultly in field testing the G. soja introgressed lines, which tend to be viny and prone to pod dehiscence (shattering). Many plots had to be hand-harvested prior to machine harvesting to avoid pod dehiscence affecting yield results. In addition, our goal was to identify relatively large, positive QTL from the G. soja parent, which should be identifiable in relatively small populations.

In this study, we used a series of backcross populations instead of the advanced backcross method, which has been successfully used to map positive QTL from exotic sources (Tanksley and Nelson 1996). In the advanced backcross method, populations are advanced typically to the backcross two generation before trait evaluations are initiated. This backcrossing is done by producing hundreds of F1 seed. Producing these F1 seed is not difficult in crops such as maize and tomato, which can be crossed easily, but is a problem in crops such as soybean where the production of F1 seed is tedious and time consuming. In response to the difficulty in producing the large number of F1 seed needed in the advanced backcross method, we developed a series of backcross populations that were used in our study.

In addition to reducing the number of F1 seed needed, there are other advantages to using the series backcross method compared to the advanced backcross method. One advantage is a reduction in marker data collection. After an initial marker screen of the parents and a separate bulk of lines from each population, the populations are tested with only those markers that segregate in the population. With the advanced backcross method, the entire population needs to be tested with all polymorphic markers. Another advantage of our series backcross populations is that although all of our populations are smaller than would be used in the advanced backcross method, only a small proportion of the genome segregates in each population. In addition, there should be an equal frequency for the alleles from the wild and domestic parent for segregating markers in the series backcross method. In the advanced backcross method, the marker frequency will be skewed in the population. Both the small proportion of the genome segregating in each population and the equal frequency of alleles will be advantages compared to the advanced backcross method. A disadvantage of the series backcross method is the inability to assay the entire G. soja genome for QTL, because many regions will not segregate in any population. Of the markers that were polymorphic between the parents, we found that only 52% segregated in at least one population. This is significantly lower than the expectation of 76% of the G. soja genome segregating in at least one population. Because we did not conduct QTL mapping with both our series backcross method and with the advanced backcross method, it is impossible to know whether we would have had greater success using the advanced backcross method.

Another difference between our procedures and the advanced backcross method procedures is whether selection is done during population development. Tanksley and Nelson (1996) recommend selection for good agronomic type during population development with the advanced backcross method. We avoided conscious selection because we were concerned that it would result in the segregation of even less of the G. soja genome in populations which may result in our missing positive QTL from G. soja. For example, selection for good plant type would result in missing an allele that increases yield that happens to be in coupling linkage with an allele for viny growth. In hindsight, there would have been advantages for selection against premature pod dehiscence during population development. This trait made yield testing difficult, resulting in the need for hand harvesting of many plots. We could have started with a larger initial population and selected for favorable agronomic traits during population development. This would have allowed us to evaluate the selected lines in more replications and environments using the same total resources during the field evaluation phase.

Many regions contained QTL that were significant for more than one trait (Table 3). In fact, of the four yield QTL identified, all were also significant for plant height and two were significant for maturity. There was no clear pattern in the relationship between yield and these other traits. For example, the marker alleles from IA2008 on LG C2 were significant for greater yield, shorter plants, and earlier maturity, whereas the IA2008 alleles on LG M were significant for greater yield, taller plants and later maturity. More research is needed to dissect whether these multiple trait associations are the result of pleiotropy or genetic linkage.

Of the four significant yield QTL identified through CIM (Table 3), only the QTL on LG E was not identified in previous studies. There was also a similar trend for associations with plant height, maturity and lodging with five of the ten QTL for these traits identified in other studies. The consistency of our results with previous studies suggests that although our population sizes were small, we have mapped real QTL. However, our relatively small populations likely caused an overestimation of the effects of the QTL we mapped (Beavis 1994).

Most previous QTL mapping in soybean was done in populations developed from intraspecific soybean crosses. Because our populations were interspecific, we expected to assay allelic variation from G. soja that was not present in soybean, and therefore did not expect so many of our QTL to have been identified previously. Our finding of consistency in the locations of QTL across different species is consistent with results in tomato. Chen et al. (1999) compared QTL identified for fruit weight and soluble solid content in seven mapping studies conducted by crossing tomato with four related wild species. They found as much consistency in the locations where QTL mapped when comparing two populations developed by crossing tomato with the same wild species as they did when comparing these populations to a population developed by crossing tomato to another species.

Results from our study and those reported in the literature suggest that it will be difficult to unlock positive allelic diversity from G. soja. There are no documented cases in the North American literature of soybean being improved through recent introgressions from G. soja. There are examples of the mapping of useful QTL from G. soja, but in most cases, evidence has later shown that these QTL were already present in G. max germplasm. This includes the high protein QTL identified by Diers et al. (1992) which later evidence suggested was already in high protein G. max germplasm (Sebolt et al. 2000). The yield QTL identified by Concibido et al. (2003) had a positive effect in some genetic backgrounds but not others, which suggests either the G. soja allele is already present in some genetic backgrounds or that the allele requires additional gene(s) in epistatic interactions. There is also new evidence that one of the two SCN resistance QTL from G. soja identified by Wang et al. (2001) also exists in G. max germplasm (Yue et al. 2001). These results, together with our finding that many of the yield and agronomic QTL we identified were also mapped in G. max populations, shows the difficulty in identifying new useful genetic diversity from G. soja.