Introduction

Marker-assisted backcrossing is used for transferring genes which are responsible for favorable agronomic traits from a donor line into the genome of a recipient line. Using molecular markers for selection against the genetic background of the donor can reduce the time and resources required for gene introgression. Although background selection has become a standard tool in plant breeding, the high costs of marker analysis still limit its use in practice and are the crucial factor for the experimental designs of gene introgression programs (Collard and Mackill 2008). These designs depend on the number of target genes to be transferred, the employed marker map, and the number of generations available for the gene introgression. Computer simulations are a robust tool for optimizing the design parameters of a marker-assisted backcrossing program before implementing it in practice (Prigge et al. 2008).

The design of marker-assisted backcrossing programs was studied with respect to the introgression of single dominant and recessive genes (Hospital et al. 1992; Frisch et al. 1999a, b; Frisch and Melchinger 2001a), two genes (Frisch and Melchinger 2001b), and favorable alleles at quantitative trait loci (Hospital and Charcosset 1997; Bouchez et al. 2002). More recently, marker-assisted backcrossing for developing libraries of near-isogenic lines was studied (Peleman and van der Voort 2003; Falke et al. 2009; Falke and Frisch 2011). These studies have mainly focused on optimizing the number of genotyped individuals as well as the positions and density of background selection markers with respect to the required number of marker data points. The optimizations have been carried out assuming marker systems in which each marker locus is analyzed in a separate assay (cf. Prigge et al. 2009). We refer to such systems as single-marker (SM) systems. Typical examples are the simple sequence repeat (SSR) and the restriction fragment length polymorphism (RFLP) marker systems.

Recently, high-throughput (HT) marker systems based on single nucleotide polymorphisms (SNPs) have been developed. Due to the high level of automation of systems such as DNA chips, they allow for cheap and fast analysis of hundreds of marker loci in a single analysis step (Gupta et al. 2001; Syvänen et al. 2005). HT marker systems have been developed for crops (Ragot and Lee 2007) and are becoming the marker systems of choice in commercial breeding programs of many economically important crops.

The crucial difference between HT and SM marker systems is that with SM marker systems, only those markers are analyzed in advanced backcross generations which were not already fixed for the recipient alleles in earlier generations. In contrast, with HT marker systems, the entire panel of markers used in a gene introgression program needs to be analyzed also for individuals of advanced backcross generations, even if 80 or 90% of these markers have already been fixed for the recipient alleles. To our knowledge, no study investigating the implications of this property on the efficiency of marker-assisted backcrossing is available. The combination of SM marker systems for the reduction of the chromosome segment attached to the target gene and HT markers for genome-wide background selection promises to further enhance selection efficiency in marker-assisted backcrossing and is not yet investigated.

The objectives of our simulation study were to (1) compare the relative costs of genome-wide background selection with SM and HT marker systems for different cost ratios of HT:SM markers, (2) compare the efficiency of equally spaced and randomly distributed markers with respect to the recovery of the recipient genome, (3) develop selection strategies combining SM and HT assays, which are more efficient than genome-wide background selection with SM or HT assays alone.

Simulations

A genetic model with ten equally sized chromosomes of 160 cM length was used for the simulations. Its genome length of 1,600 cM is similar to that of published linkage maps of maize (cf. Schön et al. 1994). Markers for genome-wide background selection were assumed to be (a) randomly distributed in the genome or (b) equally spaced. Average marker distances (randomly distributed markers) or marker distances (equally spaced markers) between two adjacent marker loci of δGW = 2, 5, 10, 20 cM were investigated. For equally spaced markers, two markers were located at the telomeres of each chromosome. One dominant target gene to be introgressed was located on Chromosome 1. It was 81, 82.5, 85, and 90 cM distant from the telomere for linkage maps with δGW = 2, 5, 10, 20 cM, respectively. Flanking markers for selection against the donor chromosome segment attached to the target gene were located on both sides of the target gene. The distances between target gene and each flanking marker were δF = 5, 10, 20, 30, 40 cM.

The investigated breeding scheme started with the cross of two homozygous parents (donor and recipient), which were polymorphic at all loci. The recipient carried the desirable alleles at all loci of the genome except for the target locus, while the donor carried the desirable allele at the target locus. The donor and recipient were crossed to create an F1 individual, which was backcrossed to the recipient. From the BC1 population of size n 1, one individual was selected with two- or three-stage selection, as described below, and backcrossed to the recipient. This procedure was repeated for t backcross generations.

Two-stage selection consisted of pre-selection of carriers of the target gene in the first selection step. The pre-selected individuals were subjected to genome-wide background selection in the second step. A selection index \(i = \sum\nolimits_{m}x_{m}\) was constructed, where summation is over markers and x m  = 1 if a marker is homozygous for the recipient allele. A plant with the highest value of i was selected and backcrossed to the recipient. Two-stage selection was carried out with SM and HT assays. For SM assays, only those markers were analyzed in advanced backcross generations which were not yet fixed for the recipient allele in the non-recurrent parent.

Three-stage selection combined selection for recombinants between the target gene and its two flanking markers, genotyped with SM assays, and genome-wide background selection with HT assays. It consisted of (1) selection for the target gene followed by (2) pre-selection with flanking markers and (3) genome-wide selection with background markers. For selection step (2), a selection index f was created, which took the values 0, 1, or 2, depending on whether recombination occurred between the target gene and none, one, or both flanking markers, respectively. On the basis of f, pre-selection of individuals was carried out according to one of two decision rules. Either (a) individuals with f ≥ 1 were selected, or (b) all individuals having the maximum observed score of f (f = max) were selected.

Four series of simulations were carried out with software Plabsoft (Maurer et al. 2008), assuming no interference in crossover formation. Each simulation was replicated 10,000 times in order to reduce sampling effects and to obtain results with high numerical accuracy and a small standard error. The 10% quantile (Q10) of the distribution of recipient genome (in percent) was determined in the last backcross generation to measure the success of a marker-assisted backcrossing program with respect to restoring the genome of the recipient. The number of SM and HT assays was determined as a measure for the costs of a marker-assisted backcrossing program.

In the first series of simulations, the population size n t (constant across all backcross generations BC t t = 1, …, 3) and the number of marker assays were determined which were required to reach Q10 values of 93, 94, 95, 96, 97, 98%, respectively. For 93–96%, we investigated two-generation backcrossing programs, and for 96–98% three-generation backcrossing programs. Two-stage selection with either SM or HT assay or a combination of both systems (HT in backcross generation BC1 and SM in the following backcross generations) was carried out for linkage maps with δGW = 5, 10, 20 cM.

In the second series of simulations, two-stage selection with HT assays was carried out. Background selection markers were either equally spaced or randomly distributed with δGW = 2, 5, 10, 20 cM. We considered three backcross generations and constant values of n t ranging from 40 to 200 individuals.

In the third series of simulations, three-stage selection was carried out either in backcross generation BC1 or BC3. In the remaining two generations, two-stage selection with HT assays was carried out. The flanking markers for three-stage selection had distances of δF = 5, 10, 20, 30, 40 cM from the target gene and individuals with f ≥ 1 were selected for genome-wide analysis with HT assays. Distances between genome-wide background selection markers were δGW = 5 cM. In the generations with two-stage selection, we investigated population sizes from n t  = 40 to 200. In the generation with three-stage selection, these population sizes were multiplied by a factor m = 1, 2, 5.

In the fourth series of simulations, three-stage selection was carried out in backcross generations BC1 and BC2. Marker distances of δGW = 5 cM and δF = 20 cM were employed. Individuals with f ≥ 1 were pre-selected for genome-wide analysis in backcross generation BC1, while only individuals having the highest observed number of recombinations between target gene and flanking markers (f = max) were pre-selected in backcross generation BC2. In backcross generation BC3, two-stage selection was carried out with HT assays. We investigated population sizes from n t  = 40 to 200 for generations BC2 and BC3. In backcross generation BC1, these population sizes were multiplied by the factor m = 1, 2, 5.

For comparing the costs of marker-assisted backcrossing programs with different selection strategies, linkage maps, and population sizes, the numbers of SM and HT assays required for the entire backcrossing program were assessed. For SM analyses, only those markers not yet fixed for the recipient allele in the non-recurrent parent of a backcross population were considered. For HT analyses, the number of assays was the same as the number of individuals subjected to genome-wide background selection. Calculation of costs was based on five cost ratios of one HT assay (corresponding to all HT marker loci on the linkage map) compared to one SM assay (corresponding to one SM locus). Cost ratios of HT:SM of 200:1, 100:1, 50:1, 20:1, 10:1 were investigated. For example, a cost ratio HT:SM of 100:1 corresponds to a price of 200€ for analyzing all SNP background marker loci with a DNA chip, and 2€ for analyzing one SSR marker locus. Comparisons were carried out to compare (a) the costs of two-stage selection with HT assays to those of two-stage selection with SM assays, (b) the costs of two-stage selection with HT assays in generation BC1 and SM assays in BC2 and BC3 to those of two-stage selection with HT assays in all backcross generations, (c) the costs of three-stage selection in BC1 to those of two-stage selection with HT assays in all generations. For (a) the costs of SM assays were set 1 and the relative costs of HT assays were determined, for (b) the costs of using HT assays in all backcross generations were set 1 and the relative costs of the strategy combing HT and SM were determined, and for (c) the costs of two-stage selection were set 1 and the relative costs of three-stage selection were determined.

Results

For two-stage selection, HT assays were considerably more expensive (up to factor 4.77) than SM assays for scenarios with high relative costs of HT markers (200:1, 100:1, and 50:1) in combination with large marker distances and/or large attempted Q10 values (Table 1). For scenarios with small marker distances and/or low relative cost ratios of HT:SM assays and low attempted Q10 values, HT assays were cheaper. To reach a Q10 value of 96% in two generations, the number of required marker assays was 9–14 times greater than those required to reach the same Q10 value in three generations. The increase in the required number of marker assays, which accompanied the shortening of a backcrossing program from three to two generations, was greater for SM than for HT marker systems.

Table 1 Relative costs of a gene introgression program using HT assays in generations BC1 to BC3 (HT[BC1–3]) compared to using SM assays in BC1 to BC3 (SM[BC1–3]) depending on the cost ratio of HT:SM assays

For high cost ratios of HT:SM markers (200:1, 100:1, and 50:1) and large marker distances, combining HT assays in generation BC1 with SM assays in generations BC2 and BC3 for genome-wide background selection was cheaper (up to 60%) than using HT assays alone (Table 2). This cost reduction was more pronounced for three-generation than two-generation backcross programs.

Table 2 Relative costs of a gene introgression program using HT assays in backcross generation BC1 and SM assays in backcross generations BC2 and BC3 (HT[BC1], SM[BC2,3]) compared to using HT assays in all backcross generations (HT[BC1–3], data presented in Table 1) depending on the cost ratio of HT:SM assays

To reach a given Q10 value with randomly distributed background selection markers, linkage maps with two to four times more markers are required than with equally spaced markers of marker distances δGW = 20 or 10 cM (Table 3). With equally spaced markers and δGW = 5 cM, approximately the same Q10 values were reached as with randomly distributed markers and δGW = 2 cM. A decrease in the distance between equally distributed markers from δGW = 10 to 5 cM resulted in only marginally greater Q10 values in generation BC3. No difference in the Q10 values was observed for δGW = 5 and 2 cM.

Table 3 Q10 values recovered in generation BC3 for constant population sizes n t in generations BC1 to BC3 and equally spaced or randomly distributed markers (δGW = 2, 5, 10, 20 cM) applying two-stage selection with HT assays

With three-stage selection combining SM and HT assays in generation BC1, the flanking marker distance δF had only marginal influence on the recovered genome-wide Q10 values (Table 4). For population sizes n 2 = n 3 < 100 in generations BC2 and BC3, a substantial increase of the Q10 values was observed, if in generation BC1 larger populations n 1 > n 2 = n 3 were employed. Doubling the population size in generation BC1 (n 1 = m n 2 = m n 3, m = 2) had approximately the same effect on the Q10 values as increasing a constant population size by about 20 individuals (n 1′ = n 2′ = n 3′ = n 2 + 20). The combination of doubled population sizes in generation BC1 and small flanking marker distances δF resulted in less required HT assays at the expense of more required SM assays to reach a certain Q10 value, compared to backcrossing programs with constant population sizes across generations.

Table 4 Q10 values recovered in generation BC3 and number of required SM/HT assays for increased population sizes n 1 = m n t (m = 1, 2, 5; t = 2, 3) in generation BC1 and equally spaced markers (δGW = 5 cM) applying three-stage selection (δF = 5, 10, 20, 30, 40 cM; f ≥ 1) in generation BC1 and two-stage selection in generations BC2 and BC3

Three-stage selection in generation BC3 recovered similar Q10 values as three-stage selection in generation BC1 for all combinations of n t and m. However, more HT assays were required (data not shown).

Three-stage selection in generations BC1 and BC2 required more SM assays but less HT assays compared to three-stage selection only in generation BC1 for all combinations of n t and m (Table 5). For population sizes smaller than 100, slightly lower Q10 values were recovered.

Table 5 Q10 values recovered in generation BC3 and number of required SM/HT assays for increased population sizes n 1 = m n t (m = 1, 2, 5; t = 2, 3) in generation BC1 and equally spaced markers (δGW = 5 cM) applying three-stage selection (δF = 5, 10, 20, 30, 40 cM) in generations BC1 (f ≥ 1) and BC2 (f = max) and two-stage selection in generation BC3

Three-stage selection combining SM and HT assays in generation BC1 of a three-generation backcrossing program was cheaper than two-stage selection with HT assays for all investigated combinations of n t with m = 1 and m = 2 (Fig. 1). The costs were ranging between 75.3–83.0% (m = 1) and 57.1–89.7% (m = 2) of the costs of two-stage selection. For m = 5, three-stage selection was only cheaper for cost ratios of HT:SM from 200:1 to 50:1. Three-stage selection with doubled population size (m = 2) in generation BC1 was the optimal selection strategy for reaching Q10 values of 98 and 99%. The only exception was the combination of a cost ratio of HT:SM assays of 10:1 and a desired Q10 value of 99%. In this case, constant population size over generations (m = 1) was optimal.

Fig. 1
figure 1

Relative costs of three-stage selection with m = 1, 2, 5 in generation BC1 and two-stage selection in generations BC2 and BC3 compared to two-stage selection in generations BC1 to BC3 for cost ratios for HT:SM assays of 200:1, 100:1, 50:1, 20:1, and 10:1

Discussion

HT marker systems

HT marker systems are expected to increase the cost-efficiency of marker-assisted backcrossing programs (Ragot and Lee 2007; Collard and Mackill 2008). However, previous studies on the efficiency of gene introgression programs have rarely taken differences between marker systems into account (Ribaut et al. 2002). In this study, we investigated the different properties of SM and HT marker systems and their effect on the efficiency of gene introgression. The simultaneous analysis of a large number of marker loci at comparatively low cost per individual marker locus is made feasible in HT assays (Syvänen et al. 2005). They, therefore, promise to be a powerful tool for marker-assisted background selection, especially when the expected number of required marker analyses is high. However, HT assays do not provide the possibility to selectively analyze individual markers. In contrast to SM assays, all markers on the linkage map need to be analyzed for every backcross individual, even if a large proportion of markers has already been fixed for the recipient alleles, as is the case in advanced backcross generations.

Comparing two-generation with three-generation gene introgression programs showed that SM marker systems require relatively less assays in three-generation programs than HT assays. For example, in a two-generation gene introgression program with distances of genome-wide background selection markers of δGW = 20 cM, both 44 HT and 2,643 SM assays resulted in a Q10 value of 93%, whereas in a three-generation program, 45 HT or 1,975 SM assays resulted in a Q10 value of 97% (Table 1). This effect is expected to be even more pronounced for background selection in higher backcross generations, and when background selection is carried out in selfing generations or during doubled haploid production. In line, using HT assays for genome-wide background selection in the first backcross generation, and SM assays in advanced backcross generations reduced the costs of marker analysis compared to using HT assays in all backcross generations (Table 2). Only 5–9% of all marker analyses in a three-generation backcross program fell upon backcross generation BC3. The cost reduction compared to using HT assays in all backcross generations was consequently greater for three-generation than for two-generation programs. We conclude that HT assays are particularly suited for short gene introgression programs, while SM assays are efficient for marker-assisted background selection when in advanced generations already large percentages of the markers have been fixed for the recipient alleles.

Marker distance and distribution for genome-wide background selection

HT systems based on SNP markers are often analyzed with techniques employing marker numbers that are multiples of 96. We did not limit our investigations to these marker numbers for two reasons. Firstly, usually not all markers of such a set are polymorphic for a certain cross. Moreover, reduced representation sequencing approaches have recently emerged and a trend towards genotyping by sequencing can be observed. For these systems, fixed marker numbers are less relevant. Therefore, we focused in our study on marker distances δGW, but not on the fixed marker numbers employed by a certain marker technology. The results discussed below can be regarded as thresholds, which, if they are surpassed for two parental lines and a certain HT markers system, result in the presented Q10 values.

SNPs occur in abundance in plant genomes. Dense linkage maps with marker distances below 5 cM can consequently be established at reasonable costs. However, the effect of such dense markers on the recipient genome recovery has not yet been investigated. Decreasing the marker distances δGW below 10 cM had only marginal effect on the recipient genome recovery (Table 1). An explanation for this result is that on expectation one crossover per meiosis and chromatid occurs on a chromosome segment of length 1 M. In two- or three-generation backcrossing programs, the number of recombination events resulting in chromosome segments of different parental origin is therefore limited. To detect these chromosome segments and to efficiently identify the backcross individuals with the smallest percentage of donor genome, a marker distance of δGW = 10 cM is sufficient. Smaller marker distances are not required, because the factor limiting selection response is not the precise estimation of the donor genome percentage, but the limited number of crossovers.

The difference in the Q10 values between equally spaced and randomly distributed markers was considerable for all marker distances δGW except 2 cM. Less than half the markers were required to reach a certain Q10 value with equally spaced markers compared with randomly distributed markers (Table 3). This difference can be explained by the fact that, with random marker distribution, occasionally the distance between adjacent markers can get quite large, resulting in random gaps in the marker coverage. The recipient genome content of the chromosome regions in these gaps is not assessed and, therefore, the correlation of the marker estimate of the recurrent parent genome contribution and the true recurrent parent genome contribution is lower than for equally spaced markers. This results in a smaller response to marker-assisted background selection for randomly distributed compared to equally spaced markers.

We conclude that the possibility to generate linkage maps with equidistant marker distribution is a major advantage of HT marker systems, while the possibility to establish linkage maps with marker distances below 10 cM is only of secondary importance for gene introgression programs.

Pre-selection with flanking markers

In three-stage selection, the pre-selection of backcross plants showing recombination between the target gene and flanking markers allows an efficient control of the donor chromosome segment attached to the target gene. This reduces the probability of introducing negative alleles linked to the target gene into the genome of the recipient. Further, three-stage selection reduces the number of backcross plants subjected to genome-wide background selection and, therefore, reduces the number of required marker assays (Frisch et al. 1999a). To take advantage of these favorable properties of three-stage selection, a pre-selection for recombination between the target gene and flanking markers analyzed with SM assays can be combined with genome-wide background selection on the basis of HT assays. The design decisions required to implement such a selection strategy are discussed in the following.

Distances of flanking markers

Tightly linked flanking markers result in short donor chromosome segments attached to the target gene. However, they also result in a greater reduction of the number of individuals subjected to genome-wide background selection than loosely linked flanking markers. This reduced selection intensity can result in a decline of the genome-wide recovery of the recurrent parent genome. Therefore, the smallest δF that has no negative effect on the genome-wide response to selection can be regarded as an optimal flanking marker distance.

In backcrossing programs with constant (m = 1) population sizes ≤60, marker distances δF = 20 cM between each flanking marker and the target gene resulted in high overall Q10 values while minimizing the number of HT assays required for background selection (Table 4). For larger populations, δF = 10 was optimal. With δF = 5 cM, controlling the donor genome segment attached to the target gene resulted in a decrease of the overall Q10 values. For such tightly linked flanking markers, only few recombinations do occur in a backcross population (see Frisch et al. 1999a, b for theoretical results) and, hence, only few plants are pre-selected and subjected to genome-wide background selection. This small number of individuals available for genome-wide background selection results in a smaller response to selection compared with less tightly linked flanking markers. We conclude that for gene introgression programs with constant population sizes, an optimum exploitation of the advantages of three-stage selection is reached with flanking marker distances of δF = 20–10 cM, and that with smaller flanking marker distances, controlling the donor segment attached to the target gene is only possible at the cost of a lower overall Q10 value.

Generation of three-stage selection

Carrying out pre-selection for recombinants at markers flanking the target gene in only some, but not all generations of a gene introgression program can considerably reduce the logistic effort required for the marker analysis. A comparison of three-stage selection in generations BC1 and BC3 showed similar genome-wide Q10 values, but three-stage selection in generation BC3 required more HT marker analyses (results not shown). Therefore, carrying out three-stage selection in generation BC1 can be regarded as superior to three-stage selection in generation BC3.

Three-stage selection in generations BC1 and BC2 required less HT assays but more SM assays than three-stage selection in generation BC1 (Tables 4, 5). For population sizes below 100 individuals, this was accompanied by smaller genome-wide Q10 values. For population sizes greater than 100, employing three-stage selection in generations BC1 and BC2 provides a means to reduce the number of required genome-wide HT assays, by increasing the number of required SM analysis. Depending on the actual costs of SM and HT analysis and the work flow in the lab, this strategy can be used to shift the number of required marker analyses from HT to SM assays.

Large population sizes in the first backcross generation

As pre-selection with SM assays reduces the number of required HT assays, it provides a means to handle larger populations without necessarily increasing the cost of marker analysis. Increasing the population size in the generation where pre-selection with flanking markers is carried out increases the chance to find an individual with a small donor chromosome segment attached to the target gene, which has in addition a high proportion of recurrent parent genome (Frisch et al. 1999b). This theoretical consideration can serve as a rationale for using large population sizes in generations with three-stage selection.

We investigated backcrossing programs with three-stage selection in BC1 populations that had m = 1, 2, or 5 times the size of the BC2 and BC3 populations in which two-stage selection was employed (Table 4). The Q10 values reached with m = 1 were comparable to those reached with two-stage selection for constant population sizes across generations (Table 3). Doubling the population size for three-stage selection in generation BC1 (m = 2, n 1 = m n 2 = m n 3) resulted in Q10 values that were comparable to those reached with constant population sizes but using 20 more individuals per generation (n 1′ = n 2′ = n 3′ = n 2 + 20). Using m = 2 required more SM but less HT assays than m = 1. A similar effect was observed for m = 5 and n 1′ = n 2′ = n 3′ = n 2 + 40. However, here the increase in the number of required SM assays was considerable, while the reduction in the number of required HT assays was only small.

In conclusion, three-stage selection can be employed to put a stronger emphasis on the reduction of the donor segment attached to the target gene, and using two times larger population sizes in generation BC1 (m = 2) than in BC2 and BC3 allows to shift the effort in the lab from HT to SM assays compared to constant population size in all backcross generations (m = 1). These effects can be exploited without a reduction in the overall Q10 values. However, neither genetic advantages nor a reduction in the required marker assays supported employing five times larger populations in generation BC1 (m = 5) than in generations BC2 and BC3.

Relative costs of three-stage selection

To compare the costs of three-stage selection in generation BC1 with those of two-stage selection, we assumed cost ratios of 200:1 to 10:1 for the costs of one HT assay (comprising all marker loci on the linkage map) in relation to one SM assay (for one SM locus). First, the number of marker assays required to reach a given Q10 value with three-stage selection was determined from the simulations presented in Table 4, and the number of marker assays required to reach this Q10 value with two-stage selection was determined from the simulations presented in Table 3. Then the costs required with three-stage selection were determined with the above cost ratios and were set in relation to the costs that were required with two-stage selection (Fig. 1). For example, with a cost ratio of 200:1 for HT:SM assays (first diagram in Fig. 1) reaching the Q10 value of 99% with three-stage selection and m = 5 required 0.85 times the costs that were required to reach the Q10 value of 99% with two-stage selection. Three-stage selection with m = 1 required 0.77, and three-stage selection with m = 2 required 0.74 times the costs of two stage selection.

From the cost comparisons, we conclude that three-stage selection reaches a given Q10 value with less cost than two-stage selection, regardless of the cost ratio of HT:SM assays. If the aspired Q10 values are 99% or less, then doubling the population size in generation BC1 provides a means to further reduce the costs required for the marker analyses.