Introduction

Many studies have shown that individual quantitative trait loci (QTL) can be detected and mapped with the aid of genetic markers. With multiple linked genetic markers, the approximate map location of QTL for a given set of experimental data can be determined by single marker or interval mapping. Both analytical (e.g., Lander and Botstein 1989) and empirical methods (Visscher et al. 1996) have been proposed to obtain confidence intervals (CI) for estimated QTL location, based on a given set of experimental results. While it is possible to derive empirical formulas for various specific situations by extensive simulation (e.g., Darvasi and Soller 1997; Darvasi et al. 1993; Ronin et al. 2003), it is clearly preferable to have a generally applicable simple formula for the CI of QTL map locations that enables the mapping potential of complex designs to be evaluated without the need for simulations. This is becoming particularly important with the availability of physical maps and complete genome sequences, since the width of the CI of a given QTL map location will determine the potential population of candidate genes for the QTL.

With many tightly linked markers, the limiting factors in locating a QTL are the number of recombination events in the sample and the magnitude of the QTL effect relative to the residual standard deviation (VanRaden and Weller 1994). This enables the map resolution attainable in a given experiment to be estimated by simply estimating the expected number of recombinants in a given interval. The objective of this study is to derive analytical formulas to predict the CI of any QTL map location within a saturated genetic map as a function of sample size, QTL effect, and experimental design.

Theory

A saturated genetic map consisting of many evenly spaced completely informative markers is assumed. We will further assume that the number of recombination events in a finite sample of individuals is a continuous variable, even though it is in fact discrete. The consequences of these assumptions will be considered in the Discussion. QTL location can then be estimated by single marker mapping, involving a t-test at each marker, and the most likely QTL location will be at the marker with the greatest estimated effect. With single markers, Simpson (1989) proposed that linkage of a segregating QTL to a marker could be detected by a likelihood ratio test, with the null hypothesis that the recombination frequency between the QTL and genetic marker is 0.5. Simpson (1992) showed for the backcross (BC) design that the statistical power for this test is equal to that obtained by a t-test with the null hypothesis of equal means for the two marker genotypes. It follows that for single markers, the (1−α) CI for QTL location, CI(1-α), can be determined from the CI(1-α) for the QTL effect. Consequently, with single marker mapping, given that the marker with the greatest estimated QTL effect is M1, the CI for QTL location will include marker M2 if the CI for the QTL effect at marker M1 also includes the effect estimated at marker M2. Thus, the CI for QTL location can be derived from the CI for the difference of the expected QTL effect for a marker at the QTL and the expected effect for a marker at some other chromosomal location. Clearly, this difference will be due solely to those individuals that are recombinant in the interval between the two markers. Therefore, given that the marker with the greatest estimated QTL effect is M1, the CI for QTL location will include marker M2 if the CI for QTL effect at marker M1, considering recombinant individuals only, also includes the effect estimated at marker M2.

Assuming a normal distribution of the estimated marker-associated effects and considering recombinant individuals only, the probability that the QTL effect at marker M1 also includes the effect estimated at marker M2 is equal to the probability, α/2, of obtaining the value:

$$Z_{{\alpha /2}} = D/{\text{SE}}(D)$$
(1)

where Zα/2 is the value of the standard normal variable corresponding to a probability of α/2. The “contrast”, D= E(M1)−E(M2), where E(M1) is the expected QTL effect evaluated at M1, and E(M2) is the expected QTL effect evaluated at M2; SE(D) is the standard error of D.

D and SE(D) are functions of the experimental design. Their derivation is now exemplified for a BC design initiated by a cross between two parental lines. The two QTL alleles are denoted Q and q, the QTL is assumed to be located at marker M1, and the parental genotypes are denoted M1QM2/M1QM2 and m1qm2/m1qm2. Relative to the genetic markers, there are two recombinant genotypes in the BC1 generation: M1m2/m1m2 and m1M2/m1m2, with expected mean values denoted \(\underline{{\text{M}}} _{1} /\underline{{\text{m}}} _{2} \) and \(\underline{{\text{m}}} _{1} \underline{{\text{M}}} _{2} .\) \(E({\text{M}}_{1} ) = \underline{{\text{M}}} _{1} \underline{{\text{m}}} _{2} - \underline{{\text{m}}} _{1} \underline{{\text{M}}} _{2} \) and \(E({\text{M}}_{2} ) = \underline{{\text{m}}} _{1} \underline{{\text{M}}} _{2} - \underline{{\text{M}}} _{1} \underline{{\text{m}}} _{2} ,\) giving:

$$D = E({\text{M}}_{1} ) - E({\text{M}}_{2} ) = 2\underline{{\text{M}}} _{1} \underline{{\text{m}}} _{2} - 2\underline{{\text{m}}} _{1} \underline{{\text{M}}} _{2} = 2(\underline{{\text{M}}} _{1} \underline{{\text{m}}} _{2} - \underline{{\text{m}}} _{1} \underline{{\text{M}}} _{2} )$$
(2)

Letting the phenotypic variance within the marker genotypes equal 1.0, standardized effects at the QTL are: QQ=+d, Qq=h, and qq=−d. Defining E(M1)=δ=d+h, and R=the number of individuals carrying a recombinant chromosome in each marker genotype group, we have:

$$D = 2(d + h) = 2\delta $$
(3)
$${\text{Var}}(\underline{{\text{M}}} _{1} \underline{{\text{m}}} _{2} ) = {\text{Var}}(\underline{{\text{m}}} _{1} \underline{{\text{M}}} _{2} ) = 1/R$$
(4)

To derive SE(D), recall that when X and Y are independent, Var[b(XY)]=b2(VarX+VarY). Applying this to (2) yields

$${\text{SE}}^{2} (D) = 4[{\text{Var}}(\underline{{\text{M}}} _{1} \underline{{\text{m}}} _{2} ) + {\text{Var}}(\underline{{\text{m}}} _{1} \underline{{\text{M}}} _{2} )] = 8/R$$
(5)

Substituting (3) and (5) in (1), gives

$$Z_{{\alpha /2}} = 2\delta /{\left( {8/R} \right)}^{{0.5}} $$
(6)

Defining k as the proportion of the mapping population included in each marker genotype group, r as the proportion of recombination between M1 and M2, and N as the population size, we have: R=rkN. For the BC design, k=0.5. Substituting rkN for R in (6) gives:

$$Z_{{\alpha /2}} = 2\delta /(8/rkN)^{{0.5}} = \delta /(2/rkN)^{{0.5}} $$
(7)

Note that the interval between markers M1 and M2 defines half of the CI(1-α). Assuming a chromosome of infinite length, the CI will be symmetrical, so that r= CI(1-α)/2, with the CI of QTL map location in units of proportion of recombination. Generally, the CI of map location is given in cM. To convert cM to proportion of recombination, cM are first converted to percent recombination using an appropriate mapping function, such as the Haldane mapping function, and then to proportion of recombination by dividing by 100. Thus, r=CI*(1-α)/200, where CI*(1-α) is the CI expressed as percent recombination. Substituting for r for 2R/N gives:

$$R = {\text{CI}}^{*}_{{(1 - \alpha )}} N/400$$
(8)

Substituting (8) in (7), gives Zα/2=2δ/(3,200/CI*(1-α)N)0.5, and solving for CI*(1-α) and N, yields:

$${\text{CI}}^{*}_{{(1 - \alpha )}} = 800Z^{2}_{{\alpha /2}} /\delta ^{2} N$$
(9)

and

$$N = 800Z^{2}_{{\alpha /2}} /\delta ^{2} {\text{CI}}^{*}_{{(1 - \alpha )}} $$
(10)

In a similar, but more complex derivation (see the Appendix) for the F2 design, the contrast between the appropriate marker genotype groups, D′, is: 2δ/(2−r), where E(M1)=δ=2d; and SE2(D′)=32/(2−r)2rN (Eqs. 18 and 19). Thus for the F2 design:

$$Z_{{\alpha /2}} = 2\delta /(32/rN)^{{0.5}} = \delta /(2/rkN)^{{0.5}} $$
(11)

For the F2 design only homozygotes for alternative marker alleles are used to construct the contrast (Appendix). Therefore, k=0.25 for this design. Letting r= CI*(1-α)/200, substituting in (1), and solving for CI*(1-α) and N yields:

$${\text{CI}}^{*}_{{(1 - \alpha )}} = 1,600Z^{2}_{{\alpha /2}} /\delta ^{2} N,$$
(12)

and

$$N = 1,600Z^{2}_{{\alpha /2}} /\delta ^{2} {\text{CI}}^{*}_{{(1 - \alpha )}} $$
(13)

Taking α=0.05, so that Zα/2=1.96 and substituting δ=d+h and δ=2d in Eqs. 9 and 12 for CI*(1-α) yields CI*(1-α)=3,073/(d+h)2N for a BC design, and CI*(1-α)=1,537/d2N for an F2 design. These equations are virtually identical to those obtained by extensive simulation in Darvasi and Soller (1997).

Equations 7 and 11 can readily be generalized to other mapping designs according to the corresponding values for δ, the expectation of the contrast for the marker M1 located at the QTL, and k, the proportion of the mapping population in each marker genotype group making up the contrasts for the markers M1 and M2. More complex mapping designs that accumulate recombination events, such as advanced intercross lines (AIL, Darvasi and Soller 1995), full-sib intercross lines (FSIL, Song et al. 1999), and recombinant inbred lines (RIL, Soller and Beckmann 1990) differ from the BC and F2 designs in the proportion of recombination per cM. To take this into account, Eq. 7 must be modified as follows to convert the proportion of recombination, r, which is the proportion of recombination in an F2 or BC generation, into the effective accumulated proportion of recombination obtained in generation g:

$$Z_{{\alpha /2}} = \delta _{{\text{D}}} /({\text{2}}/t_{{\text{D}}} k_{{\text{D}}} rN)^{{0.5}} $$
(14)

where δD and kD are the appropriate δ and k values for the given design; and tD is a factor that converts the proportion of recombination obtained in generation g into the effective accumulated proportion of recombination obtained in actuality. Substituting r=CI(1-α)/2 in (14) gives the general expressions:

$${\text{CI}}^{*}_{{(1 - \alpha )}} = 400Z^{2}_{{\alpha /2}} /{\left( {\delta ^{2}_{{\text{D}}} t_{{\text{D}}} k_{{\text{D}}} N} \right)}$$
(15)
$$N = 400Z^{2}_{{\alpha /2}} /{\left( {\delta ^{2}_{{\text{D}}} t_{{\text{D}}} k_{{\text{D}}} {\text{CI}}^{*}_{{(1 - \alpha )}} } \right)}$$
(16)

Results

The predicted CI with a saturated genetic map for various experimental designs are given in Table 1. BC, F2, and AIL designs are assumed to be initiated from fully inbred parental lines. Half-sib and full-sib designs in outcrossing populations (Soller and Genizi 1978) are the equivalent of BC and F2 designs respectively, assuming that the parents of the families are heterozygous at the QTL and that the family size is sufficiently large so that the marker-QTL phase can be determined virtually without error. The FSIL design is a variant of the AIL design, adapted to outcrossing populations. It is initiated as a large F1 family produced by a mating between two individuals, and is maintained by continued random mating within the families of each generation. In the cumulative AIL and FSIL designs (CAIL and CFSIL), progeny are phenotyped and genotyped in each generation from the F2 generation on, to build a cumulating mapping population consisting of individuals from all of the generations. The mapping resolution of such a population will depend on the accumulated recombinants over all generations.

Table 1 Confidence interval of QTL location with a saturated genetic map for various experimental designs. δ represents the contrast value between marker genotype groups for the quantitative trait (codominance is assumed), t D the effective proportion of recombination per cM, k D the proportion of the mapping population in each of the marker genotype groups forming the mapping contrast, CI1-α the length of the CI of 1-α in percent recombination, CI(0.95) the length of the 95% CI in percent recombination, and N(10, 0.25) the required total population size for CI(0.95)=10 cM with d=0.25. The codes representing the design of the population groups are as follows: BC backcross, AIL(g) advanced intercross line carried to generation g, FSIL(g) full-sib intercross line carried to generation g, CAIL(g) and CFSIL(g) cumulative AIL and FSIL respectively carried to generation g, RIL(n) recombinant inbred lines with n individuals phenotyped per line. The additive effect at the QTL is represented as d. σ f =(2h2+(1−h2)/n)0.5, where h2 equals heritability in the narrow sense and h2=0.25 is assumed. Zα/2 gives the value of the standard normal variable with a probability of α/2. For the CAIL and CFSIL designs the number of individuals per generation is given in parenthesis, and for the RIL designs the total number of individuals phenotyped is given in parenthesis

In the AIL design, from the F2 generation on, only half of the chromosomal regions in any generation will be in the heterozygous state. Thus recombination accumulates at a rate of tD=0.5g, relative to the BC and F2 designs, where g is the number of generations. In the FSIL design, three-quarters of the descendants of any one of the four parental chromosomes will be in the heterozygous state. Thus recombination accumulates at a rate of tD=0.75g, relative to the BC and F2 designs. In the RIL design the final proportion of recombination over small distances is twice that in the F2 generation, so that tD=2.0 (Soller and Beckmann 1990).

Assuming codominance at the QTL, the contrast values will be δD=d for the BC design and δD=2d for the F2 and AIL designs. The effect for the FSIL designs approaches δD=2d, depending on the specific configuration of the marker and the QTL alleles (Song et al. 1999). The contrast value for the RIL design depends on the number of individuals phenotyped in each line, and will be δD=2df, where σ 2 f is the variance among means of inbred lines. σ 2 f =2h2+(1−h2)/n where h2 is the heritability in the narrow sense and n is the number of individuals scored for the quantitative trait from each RIL (Soller and Beckmann 1990).

As in the BC design, kD=0.5 for the RIL design, for which one half of all lines are homozygous for one allele at each marker, and the other half are homozygous for the alternative allele. In the F2 and AIL designs one quarter of the population are in each of the two contrasted marker genotype groups, so that kD=0.25. In the FSIL population, at the optimal configuration of marker and QTL alleles, the proportions will be the equivalent of kD=0.25 in each marker genotype group (Song et al. 1999).

Table 1 also shows the general expressions for CI(1-α) as a function of Zα/2, δD, tD, kD and N, and specific values for CI(0.95). These expressions demonstrate that the CI is inversely proportional to population size and the square of the contrast value. Thus, methods such as multi-trait analysis that increase the contrast value (Korol et al. 1995) can markedly reduce the CI for given population size. General expressions for N as a function of Zα/2, δD, tD, kD, and CI(1-α), as derived from Eq. 16, are also presented.

The required number of individuals genotyped to obtain a CI(0.95) of 10 cM with d=0.25 are listed in the right-hand column. g=6 is assumed for the AIL and FSIL designs, and h2=0.25 for the RIL design. Sample sizes are quite large for the BC and F2 designs, about 5,000 and 2,500 respectively. The numbers of genotyped individuals required by the AIL and FSIL designs are one-third and one-fifth respectively those for the F2. Values for the FSIL assume an optimal configuration of marker and QTL alleles. Cumulative AIL and cumulative FSIL with g=6 require just twice the total numbers required for a single generation, but the numbers per generation are only 40% of the total. Thus, these designs are useful when total facilities are limited and not elastic. For the RIL design, the number of individuals genotyped is the total number of lines, while the number of individuals that must be scored for the quantitative trait is the number of lines multiplied by the number of individuals per line. This value is given in Table 1 in parenthesis. When a single individual is phenotyped for each line, 768 RIL have mapping power equivalent to an AIL (g=6); but 330 RIL, each evaluated on 20 individuals (6,600 individuals scored for the quantitative trait), has the same mapping power as a BC sample of 5,000 individuals. The formulas given in Table 1 demonstrate that very large samples are required for high resolution mapping. Achieving a 1 cM CI(0.95) would require 50,000 BC individuals, 8,000 AIL (g=–6), or 3,300 RIL with 20 individuals scored per line for a total of 66,000 phenotypes.

The equations for the BC and F2 designs are virtually identical to Eqs. 1 and 2 of Darvasi and Soller (1997). The only differences are that the values of k are 3,073 and 1,537, instead of 3,000 and 1,500 respectively, and that the CI is measured in percent recombination, rather than in cM. The value of 3,073 is well within the CI for the empirical estimate of the corresponding parameter derived by Darvasi and Soller (1997) and denoted “k”. Furthermore, Darvasi and Soller (1997) slightly underestimated “k”, because they simulated a chromosome of 100 cM. This imposes an artificial upper limit on the CI. This problem can also be noted in Fig. 2 of Darvasi and Soller (1997), where they used only a range of 1–20 cM to estimate “k”.

The analytical formula derived can be used for any CI up to 50% recombination on either side of the estimated QTL location. The expressions in the sixth column of Table 1 can also be used to derive the minimum value of d2N for which a valid QTL CI can be derived. A CI(0.95)≥100 will include any point up along the chromosome with up to 50% recombination relative to the point with the maximum test statistic, which means that a CI(0.95)>100 is essentially infinite. For the BC design, a valid CI(0.95) is obtain only if d2N>30.73. For example, if N=1,000, then a valid CI(0.95) can be obtained only for d>0.175. If the estimated QTL position is near one end of the chromosome, a “one-tailed” CI would be more appropriate than a “two-tailed” CI that includes nonexistent DNA.

Discussion

The formulas we have derived here are only asymptotically correct. During analysis of an actual data set, the likelihood profile across the chromosome may be far from symmetrical, because chromosomes are of finite length, information content differs among markers, and marker spacing is not uniform. This would be taken into account when deriving CI from the data. We emphasize, however, that the purpose of this note is to derive a priori expectations for the CI for design purposes, or for general evaluation of the overall mapping resolution of an experiment, and not to derive CIs for specific QTL from an actual data set. In estimating CI for an actual experimental data set, empirical methods, such as bootstrap analysis (Visscher et al. 1996), can be used to obtain the CI of estimated QTL map locations.

Although this study assumed that an infinite number of markers were genotyped, Darvasi and Soller (1994) demonstrated that for most experimental designs if genotyping costs are large compared to phenotyping costs, the power to detect QTL is economically optimized by phenotyping many individuals for fewer markers. For crosses between inbred lines or half-sib designs the optimum marker spacing is 80 cM, provided that unlimited numbers of individuals are available for phenotyping. Even if phenotyping costs are large relative to genotyping costs, the optimum marker spacing is no less than 20 cM if all markers are completely informative (Darvasi et al. 1993).

Percent recombination is close to cM for small values, but underestimates cM for larger values for most commonly used mapping functions. Measuring the CI in cM, rather than percent recombination, would not affect the relationship between the simulated and the predicted CI if both are given in the same units, as long as the CI(0.95) measured in percent recombination is <100.

The equations derived in this study do not account for the fact that the number of events of recombination in a finite sample is also finite. Although the expectation of the number of recombinants will be equal to 2R, the number of recombinants in a finite sample will have a binomial distribution, which will increase the variance of the standard error of the contrast. As noted by Kruglyak and Lander (1995) in the BC design, N events of recombination per Morgan are expected in a sample of N individuals. If in the sample of N individuals, there were no events of recombination between point x1 and x2 on the chromosome, then the probability of QTL location will be equal across the chromosomal segment x1x2. [In this case the likelihood function is completely flat between x1 and x2, and has first and second derivatives of zero. This is one reason why generic software, such as Proc NLIN (SAS 1999), often has difficulty obtaining QTL CI.] Thus even for a gene with complete heritability, the mean length of the “critical interval” for gene location will be 2/N in units of Morgans, or 200/N in cM. The critical interval, as defined by Kruglyak and Lander (1995), differs from the CI in that with complete heritability there is zero probability that the gene is outside this interval.

The effect of the finite distribution of events of recombination will be negligible, unless the QTL effect is very large relative to the phenotypic standard deviation. For example even if δ=0.88, then for the BC design, CI(0.95)=20 for N=200. In this case, 2R=20, and ten recombinants are expected in each marker class. When simulating, taking into account the binomial distribution of the number of recombinants, the standard error of the contrast was increased by 3.7%. As the effect of the QTL decreases, the discrepancy from the analytical formula will also decrease.

We were informed by a reviewer that Visscher and Goddard (2004), using somewhat different methodology, also derived the same formulas to predict CI(0.95) for the BC and F2 designs.