Introduction

An international effort called HapMap is being made to type single-nucleotide polymorphisms (SNPs) at high density in small samples of three human populations (Couzin 2002). The primary objective is to provide a resource from which SNPs may be selected to localize genes related to health and disease (association mapping, called positional cloning before the physical sequence became public). The structure, content, utilization, and therefore value, of this resource are unsettled. A central problem is how to balance the cost of SNP selection and typing against their power to localize a disease gene. A small number of SNPs is not costly, but the power in a large region is unacceptably low. An enormous number of SNPs has optimal power, but the cost of typing them in a large sample is currently prohibitive. A popular approach to this problem is to accept an arbitrary definition of a haplotype block in which SNP association [linkage disequilibrium (LD)] is high, from which a small number of informative SNPs may be selected. This presents four unique problems: (1) What definition of a block is optimal; (2) what selection algorithm best balances cost and power; (3) what allowance (if any) should be made for block length, which may vary from 1 to 200 kb and between populations; and (4) what provision should be made for steps between blocks where LD decreases abruptly, primarily reflecting recombinational hot-spots (Jeffreys et al. 2001). This complex dilemma suggests that a useful algorithm would accommodate haplotype blocks and steps without requiring their definition, the many variants of which give conflicting results (Cardon and Abecasis 2003; Ke et al. 2004).

Shortly after discovery of haplotype blocks in the human genome (Daly et al. 2001; Jeffreys et al. 2001), some statisticians and geneticists took up the idea that LD and haplotype diversity can be captured by much smaller subsets of “haplotype tagging” SNPs (htSNPs) (Johnson et al. 2001). The most dramatic success with “haplotype tagging” was reported by Zhang et al. (2002a): rejecting 75% of SNPs reduced diversity by 20%, but the estimate of power by only 4%. They assumed that recombination and mutation occur uniformly, and that affection is largely defined by a single untyped SNP in the centre of a designated candidate region with minor allele frequency of 0.10–0.15. Haplotypes were assumed to be known without error, uncomplicated by diplotypes and incomplete typing, and their frequencies were simulated. The LD statistic was the largest χ 2 for a single marker SNP or two-locus haplotype. The distribution of marker allele frequencies and LD was not reported, and it would not be possible to replicate their results from the information provided. The assumption that retention of haplotype diversity implies retention of power for association mapping remains unproven.

Material and methods

To test this hypothesis several decisions must be made. The first is to choose real data instead of simulations, since the latter do not credibly mirror the important variables (allele frequencies and LD). We analysed two samples of published genotype data on which current ideas of LD structure are based. Jeffreys et al. (2001) genotyped 296 SNPs in a 216-kb segment of the class II region of the MHC on 6p21.3 in 50 unrelated north-European British males. Daly et al. (2001) typed 103 SNPs in 617 kb on 5q31 in 129 parent-child trios from a European-derived population. We used parents only. For comparability we followed the example of Daly et al. (2001) and rejected SNPs with minor allele frequencies less than 0.05, reducing the numbers of SNPs in the first sample to 248. The second sample was remarkable for stimulating interest in haplotype blocks that ignored evidence that recombination at a given step usually affects only a small proportion of haplotypes, most of which extend without recombination over several blocks and steps. In this region the recombinant class accounts for as few as 2% of haplotypes in a small step and a maximum of only 40% of haplotypes in a relatively large step. These geographic features are real and therefore incorporated in LD maps. However it would be foolish to base association mapping on a block definition that most haplotypes do not respect and a SNP selection for which there is no consensus. The import of these data is to define a natural haplotype as the sequence between two adjacent recombinants. Unfortunately most overlapping sequences (especially those with short history and therefore less common alleles) did not recombine at the same points, and so there are no natural haplosets. This generality is fatal to an artificial correspondence between blocks and haplotypes, leaving steps in midair. On the contrary, the information in LD and arbitrary haplosets of uniform numbers of SNPs is unaffected by non-existence of natural haplosets.

Having rejected simulations in favour of these two datasets, we had to acknowledge that the best way to exploit blocks, steps, LD, haplotypes, and LDU maps has not been established. It is not self-evident that tagging common haplotypes, however defined, leads to accurate localisation of a causal SNP that is likely to be polymorphic in more than one haplotype and to have a different frequency than any of the haplotypes in which it is found. The problem is exacerbated if causal SNPs have not been tested, and becomes worse as marker SNPs are depleted by tagging. The point location to which a haplotype should be assigned, the information weight it should receive, the number of SNPs it contains, choice of these SNPs, the role of different populations (Lonjou et al. 2003), the overlap of different haplosets, and a host of other statistical questions have not been addressed. Given these uncertainties, we adopted two methods that have been widely applied. The first uses the quadratic form in the composite likelihood \( {\sum\limits_{i = 1}^n {K_{{bi}} {\left( {\ifmmode\expandafter\hat\else\expandafter\^\fi{b}_{i} - b_{i} } \right)}^{2} } }, \)where n is the number of SNPs (i = 1,2,...,n), \( \ifmmode\expandafter\hat\else\expandafter\^\fi{b}_{i} \) is an estimated regression coefficient with information K bi for the i-th SNP, and b i is its expected value under a Malecot model, \( b_{i} = {\left( {1 - L} \right)}Me^{{ - \varepsilon \Delta (S_{i} - S)}} + L, \) where Δ takes the value 1 if S i S and −1 otherwise, replacing n by m when m<n htSNPs are used. S i is the location of the i-th SNP, S is the location of a causal SNP, and L is the bias (residual association) at large distance (Collins and Morton 1998). L can be predicted (L p ) as the mean K bi -weighted absolute value of a normal deviate under H0. The other two parameters, M and ε, have expectations that are functions of the number of generations t since the last population bottleneck and either recombination (ε only) or mutation and drift (M only) as derived by Morton et al. (2001) using a recurrence method introduced by Malecot (1969, 1973), who obtained a similar equation for isolation by distance. The Malecot model has been applied to physical, linkage, and LD maps (Collins and Morton 1998; Maniatis et al. 2002; Zhang et al. 2002c). Recently four subhypotheses have been used for association mapping of oligogenes, only two of which are used here (Maniatis et al. 2004). Model A specifies M=0 and a uniform distribution at ordinate L p . Model B estimates L with M=0. Comparison gives \( \chi ^{2}_{1} = {\left[ {{\sum {K_{{bi}} {\left( {\ifmmode\expandafter\hat\else\expandafter\^\fi{b}_{{\text{i}}} - b_{i} } \right)}^{2} \left| {_{{\text{A}}} } \right. - {\sum {K_{{bi}} {\left( {\ifmmode\expandafter\hat\else\expandafter\^\fi{b}_{{\text{i}}} - b_{i} } \right)}^{2} } }\left| {_{{\text{B}}} } \right.} }} \right]}/V, \) where V is the empirical variance that ideally is estimated for a nonsyntenic causal SNP. In the absence of such information V has been estimated with less power from model B as \( {\sum {K_{{bi}} {\left( {\ifmmode\expandafter\hat\else\expandafter\^\fi{b}_{i} - b_{i} } \right)}^{2} \left| {_{{\text{B}}} /{\left( {n - 1} \right)}.} \right.} } \) Comparison of models A and B tests for significance of a candidate region and is invariant for physical, linkage, and LD maps. Neither of these models depends on the M or ε parameters of the Malecot model, nor on the locations of causal SNPs within the region. They have been shown to give a reliable test of H0 in these samples (Maniatis et al. 2004). However, many investigators prefer to select the largest maximal \( \chi ^{2}_{1} \) out of n SNPs or m htSNPs, accepting the heavy Bonferroni correction that this choice entails (Risch and Merikangas 1996). Here we use both methods, expressing results with m<n htSNPs as the relative efficiency (RE), defined as the ratio of \( \chi ^{2}_{1} \) when m htSNPs are used over the \( \chi ^{2}_{1} \) when the maximum of n SNPs are used. Although not equal to power, RE is monotonic on power and therefore a good surrogate. If desired, RE can be replaced by the ratio of the estimated noncentrality parameters, but ignorance of their exact values makes this more pretentious than useful.

The distribution of minor allele frequencies when SNPs are selected randomly is approximately uniform from 0.05 to 0.5, with an excess for smaller frequencies because of population expansion in recent millennia (Botstein and Risch 2003). SNP discovery often excludes frequencies under 0.05, but makes minor departure from a uniform distribution for greater frequencies. Neutral theory predicts that minor allele frequencies follow the same distribution for causal as for marker SNPs. We therefore took each of n+1 SNPs in turn as causal to determine how well the m htSNPs detect it (m<n). In previous trials, we found no significant difference in detection of causal SNPs in blocks and steps (Maniatis et al. 2004). As in quantitative genetics, we assumed additivity of allelic effects, assigning allelic count 0, 1, 2 to the three genotypes as a pseudo-phenotype and regressing this variable on the allele count of the i-th marker. Sample size precluded a case-control design, which would add some complexity but no additional information since the enrichment factor ω is incorporated in the analysis to recover population frequencies (Collins and Morton 1998).

Our final choice was among measures of diversity. Shannon entropy is applicable to haplotypes and diplotypes (Shannon 1948). Other programs for SNP selection are specific for diplotypes (Stram et al. 2003) or haplotypes (Zhang et al. 2002a, 2002b). These three approaches do not exhaust the possibilities for htSNP selection, since a variant is reported every month and most weeks. The results are remarkably consistent, and therefore inclusiveness was not attempted. The two samples were analysed separately, and combined results were reported by taking the means of RE, since the differences in RE were not large and general principles hold for both samples.

Results

SNP selection by Shannon entropy in diplotypes and inferred haplotypes

Shannon entropy is evaluated as −∑ i f i lnf i , where f i is an estimate of the frequency of the i-th haplotype or diplotype and ∑f i =1 (Shannon 1948). For haplotypes a trio of diallelic markers a, b, c gives up to eight values of f i , which the pair of flanking markers a, c combines into four values. Therefore all haplotype diversity is included. Moving one SNP along the region at a time, we estimated haplotype frequencies for three adjacent SNPs in random diplotypes using the SNPHAP program (http://www-gene.cimr.cam.ac.uk/clayton/software/). To be confident about the estimated frequencies, we used only the observations with no missing genotypes among the SNP trios. Shannon entropy may be designated T for trios and P for the flanking pair. In successive rounds the intermediate SNPs with the smallest values of S=T−P are eliminated, reducing the number of retained markers (m) by 10% of the initial number n+1. However, adjacent SNPs are not eliminated in the same round.

Diplotypes may be scored from phenotypes without inferring haplotypes, with some loss of information but avoiding arbitrary definition of haplosets and omission or misassignment of phase-unknown sites (Ackerman et al. 2003). Then no assumption is made about Hardy-Weinberg frequencies and so samples from different populations may be pooled if a cosmopolitan set of SNPs is desired. Diplotypes permit recognition of four genotypes whose heterozygosity is contributed uniquely by the intermediate b marker, in contrast to the same a, c genotypes where b is homozygous (Table 1). For both haplotypes and diplotypes, the sums of T, P, and S provide measures of entropy in each round.

Table 1 Shannon entropy for diplosets of SNP trios and pairs with genotypes 11, 12, 22. For the a, b, c trio: T=−[q lnq+ Q lnQ + (1−qQ) ln(1−qQ)]; Q=n 2 /N; q=(n 3 +n 4 +n 5 +n 6 +n 7 +n 8 )/N; 1−qQ=n 1 /N; for the a, c pair: P=−[q lnq + (1−q) ln(1−q)]

Suppose the initial SNP set containing n+1 SNPs (maximum of n predictive SNPs plus one causal SNP) is divided into m htSNPs and nm+1 rejected SNPs. Each of the n+1 SNPs is taken in turn as causal to produce a total of n+1 unique files. With respect to the htSNPs there are nm+1 external choices with nm−1 df under model B. There are m internal choices with m−2 df, since an additional df is lost because the causal SNP is excluded from the htSNPs. The total number of df in the n+1 files under model B is (m−1) (nm+1) + m (m−2) = (n+1) (m−1)−m. Let SS A , SS B denote the sums of the quadratic forms over all files under models A and B. Then V B =SS B /df is the mean error variance per file and \( \chi ^{2}_{1} \) = (SS A SS B )/(n+1)V B is the mean test of B against A. The relative efficiency is RE=(\( \chi ^{2}_{1} \)|m)/(\( \chi ^{2}_{1} \)|n). With slight modification this approach allows replacement of model B by a more complicated model withr df with the same significance level if \( \chi ^{2}_{r} \) is converted to \( \chi ^{2}_{1} \) (Collins and Morton 1998). The results are substantially the same (data not shown).

Figure 1 shows the successful preservation of Shannon entropy, as the curves for htSNPs are consistently above the ones for random selection. Here RE of Shannon entropy was calculated as the ratio of the sum of T in the htSNP set with m SNPs over the sum of T in the initial SNP set with n SNPs. RE for P is similar. The selection based on haplotypes preserved slightly more Shannon entropy than selection on diplotypes. Association mapping gave disappointing results: RE declined dramatically as the number of htSNPs was reduced (Fig. 2). Although haplotype selection retained slightly more Shannon entropy, RE decreased more rapidly than by diplotype selection. By both significance tests the htSNPs lost relative efficiency more rapidly than random selection (Fig. 2).

Fig. 1
figure 1

Relative efficiency (RE) measured by Shannon entropy for SNP trios

Fig. 2
figure 2

Relative efficiency (RE) for association mapping using SNPs selected by Shannon entropy

SNP selection in diplotypes

The tagSNPs program was used to select SNPs from diplotype data (Stram et al. 2003), estimating haplotype frequencies by an expectation-maximization (EM) algorithm. Following convention, haplotypes with frequencies greater than 0.05 were considered “common”. The program then chose the “best” set of htSNPs which maximize the minimum squared correlation (R 2) between true and predicted haplotype dosages (counted 0, 1 or 2) for each individual with a particular common haplotype. The EM calculation of predicted dosage is conditional on the genotype data and the haplotype frequencies as if they were known assuming Hardy-Weinberg equilibrium. R 2 is expressed as the ratio of the variance of the expectation on the predicted dosage to the total variance of the true haplotype dosage. The best set of htSNPs minimizes the uncertainty in prediction of common haplotypes. Because this program is computationally intensive, we used sliding windows with eleven adjacent SNPs, designating the medial one as causal, and moving one SNP along the chromosome at a time. Since df=0 for composite likelihood when m equals 1, we report the results for m equals 2–10 (percents of 20–100). Mean \( \chi ^{2}_{1} \) for m htSNPs was calculated across the region. We also tried splitting the data into numbers of 20 or 35 SNPs as units for selecting SNPs, and the same strategy for association mapping was applied. The results were similar (data not shown). Figure 3 shows that diversity measured by minimum R 2 decreased slowly, but RE for association mapping declined rapidly, and faster than for random selection.

Fig. 3
figure 3

Relative efficiency (RE) for association mapping using SNPs selected in diplotypes by maximizing the minimum squared correlation (R 2)

SNP selection in haplotypes

Haplotypes containing all n+1 SNPs were inferred by the SNPHAP program as described previously, and the HapBlock program (Zhang et al. 2002a, 2002b) was then used to select SNPs. This program implements a dynamic programming algorithm for haplotype block partitioning and htSNP selection on haplotype data. The program finds the optimal partition of haplotype blocks with the minimum total number of SNPs. A segment of consecutive SNPs forms a block if at least α% of haplotypes are represented more than once (Patil et al. 2001). SNPs are selected to minimize the number that can distinguish at least α% of the haplotypes in a block. Another criterion is β, the percentage of the haplotype diversity accounted for by the minimum number of SNPs (Johnson et al. 2001). As Zhang et al. (2002b) indicated, these two options are related in terms of conserving haplotype diversity. To obtain different proportions of selected SNPs, we varied α or β parameters from 0.50 to 0.99. The higher the value of α or β, the more SNPs will be selected. The same tests for association mapping were carried out as described for Shannon entropy. Here we present only results with the α parameter, since choosing β gave very similar results.

The maximum number of htSNPs that the program chose for the 6p21 data is about half the original number, reflecting the strong LD block structure, and so percents between 50 and 100 were not estimated. Similar to the other two methods, RE for association mapping declined rapidly, faster than random selection as the proportion of htSNPs is diminished (Fig. 4).

Fig. 4
figure 4

Relative efficiency (RE) for association mapping using SNPs selected in haplotypes by haplotype block-partitioning that minimize the number of SNPs that can distinguish at least α% of the haplotypes

Effects of allele frequency

The observation that htSNP selection generally performed worse for detecting causal loci than random selection is not puzzling if we consider the allele frequencies of the htSNPs. Figure 5 shows that all the methods tended to select common SNPs. This becomes pronounced when smaller proportions of SNPs are selected. The curves for random selection are flat and therefore omitted.

Fig. 5
figure 5

Mean minor allele frequency in SNPs selected by indicated methods

Next we classified SNPs either by their minor allele frequency q or the absolute value of the difference from the minor allele frequency of the causal SNP (q c ), Δ=|qq c |. SNPs were sorted on q or Δ and the most common or rare SNPs (q large or small), or SNPs with large or small Δ were selected until the required number of SNPs was obtained. Selection on q has the expected effect, but selection for large or small Δ, like random selection, does not alter the mean allele frequencies of the selected SNPs (Fig. 6). Figure 7 shows that RE of the maximum chi-square for selection on q declined faster than for random selection. It is striking that selection on small Δ provided the maximum retention of RE. On the contrary, selection on large Δ produced the least retention of RE, which declines to nearly nil.

Fig. 6
figure 6

Mean minor allele frequency in SNPs selected by large or small minor allele frequency (q) or absolute difference (Δ) in minor allele frequency between the marker and the causal SNP

Fig. 7
figure 7

Relative efficiency (RE) of maximum chi-square for association mapping using SNPs selected by large or small minor allele frequency (q) or absolute difference (Δ) in minor allele frequency between the marker and the causal SNP

One reason for the poor performance of the htSNPs in this study is apparent: SNPs selected by maximizing haplotype diversity tend to be common, making the average minor allele frequency of htSNPs greater than the average of SNPs in the initial set, which was 0.27 in both samples. The increased discrepancy in minor allele frequencies loses relative efficiency. Since q c is generally unknown, selecting common SNPs is inefficient to detect much rarer causal SNPs.

These results explain earlier results of the data that were puzzling. Selection based on Shannon entropy in haplotypes preserves greater entropy than in diplotypes, which have lower entropy (Fig. 1). As predicted, haplotypes tend to retain more common SNPs than diplotypes (Fig. 5), and therefore RE decreases more rapidly for haplotype than diplotype selection (Fig. 2).

Discussion

Estimates of power in association mapping by LD have serious problems. Bayesian methods attempt full likelihood by simultaneous estimates of population parameters, constrained to the likely location of a candidate marker and then treated as prior probabilities (Morris et al. 2002). Most geneticists dismiss this as not applicable to an uncertain location or useful for a certain one. As Devlin et al. (1996) delicately put it “we believe that the full likelihood model will be very difficult to specify without unrealistically stringent assumptions about population history”. The simplest alternative is to choose the marker with the largest χ 2, but this requires a drastic Bonferroni correction (Risch and Merikangas 1996) and has not been shown to give a good estimate of location. Therefore most association mapping by LD uses some form of composite likelihood for single markers or haplotypes (Terwilliger 2000; Collins and Morton 1998; Zaykin et al. 2002). This works rather well for point estimates, but estimates of standard errors and therefore power are less reliable than for true likelihood (Devlin et al. 1996). We try to allow for this by using the residual variance as error, which increases as the proportion of htSNPs declines. Perhaps in consequence, RE for composite likelihood declines faster than for maximum χ 2, although \( \chi ^{2}_{1} \) for composite likelihood exceeds maximum χ 2 for a single marker until the proportion of htSNPs is very small. Obviously information for association mapping is not restricted to the predictive SNP with the maximum χ 2. A better solution for composite likelihood that allows explicitly for autocorrelation would be welcome, but present evidence does not justify the hope that haplotype tagging can greatly reduce the number of markers without substantial loss of power for association mapping.

The optimal distribution of minor allele frequencies for predictive SNPs is unsettled. A high frequency is inefficient for detecting less common causal SNPs, is associated with multiple haplotypes and its linkage disequilibrium with other SNPs declines rapidly. A low frequency is inefficient for detecting common causal SNPs. SNPs with a large effect on a serious disease tend to have small frequency because of negative selection, but there is no basis to infer whether oligogenes of smaller effect are more likely common or rare (Reich and Lander 2001; Pritchard 2001; Weiss and Clark 2002; Pritchard and Cox 2002; Clark 2003). We took each SNP in turn as causal, with minor allele frequencies ranging from 0.05 to 0.50 in agreement with Botstein and Risch (2003).

There is no validated method for association mapping by haplotypes. When it is developed, selecting haplotype-tagging SNPs will require three preconditions, an operation, and a hope. The three preconditions are: selection of a genomic region; typing a marker set within the region; and estimation of haplotype frequencies in that marker set. Only after these preconditions are met can haplotype tagging take place, with the goal of reducing the number of markers without substantial loss of haplotype diversity and the hope that this will assure no substantial loss of power for association mapping. The profound impact of these constraints is beginning to be appreciated (Meng et al. 2003; Sebastiani et al. 2003). The genomic region is often poorly defined, ranging from an oligonucleotide to a whole chromosome, and it may contain no causal SNP or more than one, distributed among different exons, introns, promoters, and loci. The initial marker set is a small fraction of the number in the region, chosen by an arbitrary mixture of convenience, prior claims of disease association, and assumptions about desirable density, allele frequencies, and locations in blocks or steps (however defined). Haplotype frequencies are determined by assumptions about trial values, frequency of typing error, and the biological significance of common haplotypes, which in turn has depended on controversial application of evolutionary theory for nonrecurrent mutations in the absence of recombination (Crow and Kimura 1970; Reich and Lander 2001). On these shifting sands haplotype tagging takes many forms, accepting or dismissing arbitrary block definition and accepting arbitrary percentages of differently defined haplotype diversity, without reference to the much larger number of markers excluded from the initial set. Finally, we are asked to accept without proof that power for association mapping has a monotonic relation to haplotype diversity however the initial set was chosen, whether or not a causal SNP was included, however small the representation of markers in the region may be, and whatever the method of association analysis. When we consider that the costs and benefits of testing more markers or sampling more individuals are uncertain and in rapid flux, it is impossible to avoid the suspicion that application of haplotype tagging will be at the expense of successful association mapping.

The title of this paper is a query that cannot be conclusively answered. On the one hand there is no evidence that haplotype diversity is monotonic on power for association mapping, and our negative evidence has been supported (Zhao et al. 2003). On the other hand, a universal negative is unprovable for the many combinations of haplotype tagging applied to an uncountable variety of regions, initial marker sets and algorithms for haplotype tagging. The current generation of tagging programs is computationally prohibitive for a large number of markers, and the haplotypes so defined are too numerous for powerful analysis. There is no justification for taking our results as the best or worst that haplotype tagging can do for association mapping. However, the burden of proof is not on us to bury haplotype tagging, but on its advocates to show cause why it should not be buried.

Even if the best of many contenders is taken to tag haplotypes and a method of analysis is developed to minimise loss of power, there is a strong argument that miracles with haplotype-tagging should be regarded cautiously. The HapMap project that is intended as a tool for association mapping is far from a consensus about its product and farther still from convincing its detractors (Couzin 2002). Suppose that HapMap reaches its goal of mapping 600,000 SNPs and characterising in some way the overlapping haplotypes they form. The total number of SNPs in the genome is in excess of three-million, and may be as many as ten-million (Kruglyak and Nickerson 2001), or at least 15-million if SNPs with frequencies as small as 0.01 are included (Botstein and Risch 2003). The probability that a particular causal SNP is represented in 600,000 marker SNPs is less than 0.20 and may be as little as 0.06, without considering rare SNPs that are usually neglected in LD mapping. If the causal SNP were included, haplotypes would provide no additional information for its recognition. If the causal SNP is not included, haplotype-specific frequencies cannot be estimated accurately, since each defines an unidentified subset of recognised haplotypes, the number of which increases with the number of recognised SNPs. Common haplotypes have especially low power when the causal marker is uncommon. The most efficient strategy for association mapping is to type as many markers as can be afforded in a large region, and a greater density in a candidate region. Uniform spacing on the LD map, with additional SNPs in “holes” where the distance between successive markers is greater than 1 LDU, is the appropriate goal to implement this strategy. The cost of uniform spacing in the physical map has been investigated (Wang and Todd 2003). Additional constraint, such as rejecting all but the most common SNPs, assumes a consensus that does not exist.

These considerations favour a multi-stage design. Stage 1 is a genome scan by linkage or association at low resolution. Stage 2 tries to refine the candidate region by association at moderate resolution or (with less power) by linkage. Stage 3 attempts to identify a candidate locus within a supported region, using functional tests on SNPs at high resolution. These preliminary stages have as their goal to determine whether a particular region suggested by linkage, LD, or function has one or more causal SNPs for a particular disease. Haplotype analysis has a possible but still undefined role to assure equal spacing of markers on the LD map, with several SNPs at different frequencies in each LDU. Pooling of samples may be considered, sacrificing haplotypes. Concentration on common SNPs should be avoided. In studies that cover more than a small region it is currently impractical to aim for all SNPs, and therefore identification of causal SNPs is unreliable. The analysis leans heavily on models A and B, although location models are also useful at this stage, especially if the region is large. The false discovery rate (FDR) introduced for linkage (Morton 1955) is useful to evade a costly Bonferroni correction.

Stage 4 is the end-game that begins when a candidate locus or a small candidate region has been confirmed. All SNPs should be sought, supported by sequencing of cases and controls and with functional tests if feasible. The FDR is uniquely adapted to identification of causal SNPs (Storey and Tibshirani 2003). Haplotype analysis has nothing to offer in stage 4 because the goal is to identify causal SNPs. Stepwise regression is appropriate if all SNPs have been tested, but in that case their genotypes are known and the ρ model testing each SNP for causality is more powerful. All recent research indicates the need in stage 4 to type virtually all markers on the critical LD map (Wang and Todd 2003). For example, 11 SNPs in a 2.5-kb region were required to provide modest evidence of association for the NOS2A promoter with cerebral malaria, with possible identification of a causal SNP (Burgner et al. 2003). A similar density has been proposed for genome scans of nonsynonymous substitutions, splice junctions, and promoter regions where the higher frequency of rare alleles in random SNPs is enhanced by negative selection (Botstein and Risch 2003). None of the current methods to select htSNPs is appropriate after stage 1. Recently tagSNPs selected without regard to haplotypes have been advocated, replacing blocks of contiguous SNPs by bins of highly associated SNPs that need not be contiguous (Carlson et al. 2004). SNPs with low minor allele frequency are associated with extensive LD and therefore tend to be selected as the number of tagSNPs diminishes. Compared with htSNPs, this enhances power to detect causal alleles that are at low frequency by chance or negative selection, but in our trials the power is less than for random SNPs. It is likely that a method based on the LD map and more efficient than composite likelihood for single SNPs will be developed, but one limitation cannot be overcome: haplotypes or bins of dense markers are a poor surrogate for a causal SNP, and haplotypes or bins of a smaller number of htSNPs or even tagSNPs are worse.