Does haplotype diversity predict power for association mapping of disease susceptibility?

Original Investigation
Published: 04 June 2004

Volume 115, pages 157–164, (2004)
Cite this article

Access provided by CONRICYT-eBooks

Human Genetics Aims and scope Submit manuscript

Does haplotype diversity predict power for association mapping of disease susceptibility?

Weihua Zhang¹,
Andrew Collins¹ &
Newton E. Morton¹

145 Accesses
46 Citations
1 Altmetric
Explore all metrics

Abstract

Many recent studies have established that haplotype diversity in a small region may not be greatly diminished when the number of markers is reduced to a smaller set of “haplotype-tagging” single-nucleotide polymorphisms (SNPs) that identify the most common haplotypes. These studies are motivated by the assumption that retention of haplotype diversity assures retention of power for mapping disease susceptibility by allelic association. Using two bodies of real data, three proposed measures of diversity, and regression-based methods for association mapping, we found no scenario for which this assumption was tenable. We compared the chi-square for composite likelihood and the maximum chi-square for single SNPs in diplotypes, excluding the marker designated as causal. All haplotype-tagging methods conserve haplotype diversity by selecting common SNPs. When the causal marker has a range of allele frequencies as in real data, chi-square decreases faster than under random selection as the haplotype-tagging set diminishes. Selecting SNPs by maximizing haplotype diversity is inefficient when their frequency is much different from the unknown frequency of the causal variant. Loss of power is minimized when the difference between minor allele frequencies of the causal SNP and a closely associated marker SNP is small, which is unlikely in ignorance of the frequency of the causal SNP unless dense markers are used. Therefore retention of haplotype diversity in simulations that do not mirror genomic allele frequencies has no relevance to power for association mapping. TagSNPs that are assigned to bins instead of haplotype blocks also lose power compared with random SNPs. This evidence favours a multi-stage design in which both models and density change adaptively.

Similar content being viewed by others

High density marker panels, SNPs prioritizing and accuracy of genomic selection

Article Open access 05 January 2018

Multi-SNP Haplotype Analysis Methods for Association Analysis

Chapter © 2017

Population-specific long-range linkage disequilibrium in the human genome and its influence on identifying common disease variants

Article Open access 06 August 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

An international effort called HapMap is being made to type single-nucleotide polymorphisms (SNPs) at high density in small samples of three human populations (Couzin 2002). The primary objective is to provide a resource from which SNPs may be selected to localize genes related to health and disease (association mapping, called positional cloning before the physical sequence became public). The structure, content, utilization, and therefore value, of this resource are unsettled. A central problem is how to balance the cost of SNP selection and typing against their power to localize a disease gene. A small number of SNPs is not costly, but the power in a large region is unacceptably low. An enormous number of SNPs has optimal power, but the cost of typing them in a large sample is currently prohibitive. A popular approach to this problem is to accept an arbitrary definition of a haplotype block in which SNP association [linkage disequilibrium (LD)] is high, from which a small number of informative SNPs may be selected. This presents four unique problems: (1) What definition of a block is optimal; (2) what selection algorithm best balances cost and power; (3) what allowance (if any) should be made for block length, which may vary from 1 to 200 kb and between populations; and (4) what provision should be made for steps between blocks where LD decreases abruptly, primarily reflecting recombinational hot-spots (Jeffreys et al. 2001). This complex dilemma suggests that a useful algorithm would accommodate haplotype blocks and steps without requiring their definition, the many variants of which give conflicting results (Cardon and Abecasis 2003; Ke et al. 2004).

Shortly after discovery of haplotype blocks in the human genome (Daly et al. 2001; Jeffreys et al. 2001), some statisticians and geneticists took up the idea that LD and haplotype diversity can be captured by much smaller subsets of “haplotype tagging” SNPs (htSNPs) (Johnson et al. 2001). The most dramatic success with “haplotype tagging” was reported by Zhang et al. (2002a): rejecting 75% of SNPs reduced diversity by 20%, but the estimate of power by only 4%. They assumed that recombination and mutation occur uniformly, and that affection is largely defined by a single untyped SNP in the centre of a designated candidate region with minor allele frequency of 0.10–0.15. Haplotypes were assumed to be known without error, uncomplicated by diplotypes and incomplete typing, and their frequencies were simulated. The LD statistic was the largest χ ² for a single marker SNP or two-locus haplotype. The distribution of marker allele frequencies and LD was not reported, and it would not be possible to replicate their results from the information provided. The assumption that retention of haplotype diversity implies retention of power for association mapping remains unproven.

Material and methods

To test this hypothesis several decisions must be made. The first is to choose real data instead of simulations, since the latter do not credibly mirror the important variables (allele frequencies and LD). We analysed two samples of published genotype data on which current ideas of LD structure are based. Jeffreys et al. (2001) genotyped 296 SNPs in a 216-kb segment of the class II region of the MHC on 6p21.3 in 50 unrelated north-European British males. Daly et al. (2001) typed 103 SNPs in 617 kb on 5q31 in 129 parent-child trios from a European-derived population. We used parents only. For comparability we followed the example of Daly et al. (2001) and rejected SNPs with minor allele frequencies less than 0.05, reducing the numbers of SNPs in the first sample to 248. The second sample was remarkable for stimulating interest in haplotype blocks that ignored evidence that recombination at a given step usually affects only a small proportion of haplotypes, most of which extend without recombination over several blocks and steps. In this region the recombinant class accounts for as few as 2% of haplotypes in a small step and a maximum of only 40% of haplotypes in a relatively large step. These geographic features are real and therefore incorporated in LD maps. However it would be foolish to base association mapping on a block definition that most haplotypes do not respect and a SNP selection for which there is no consensus. The import of these data is to define a natural haplotype as the sequence between two adjacent recombinants. Unfortunately most overlapping sequences (especially those with short history and therefore less common alleles) did not recombine at the same points, and so there are no natural haplosets. This generality is fatal to an artificial correspondence between blocks and haplotypes, leaving steps in midair. On the contrary, the information in LD and arbitrary haplosets of uniform numbers of SNPs is unaffected by non-existence of natural haplosets.

Having rejected simulations in favour of these two datasets, we had to acknowledge that the best way to exploit blocks, steps, LD, haplotypes, and LDU maps has not been established. It is not self-evident that tagging common haplotypes, however defined, leads to accurate localisation of a causal SNP that is likely to be polymorphic in more than one haplotype and to have a different frequency than any of the haplotypes in which it is found. The problem is exacerbated if causal SNPs have not been tested, and becomes worse as marker SNPs are depleted by tagging. The point location to which a haplotype should be assigned, the information weight it should receive, the number of SNPs it contains, choice of these SNPs, the role of different populations (Lonjou et al. 2003), the overlap of different haplosets, and a host of other statistical questions have not been addressed. Given these uncertainties, we adopted two methods that have been widely applied. The first uses the quadratic form in the composite likelihood \( {\sum\limits_{i = 1}^n {K_{{bi}} {\left( {\ifmmode\expandafter\hat\else\expandafter\^\fi{b}_{i} - b_{i} } \right)}^{2} } }, \)where n is the number of SNPs (i = 1,2,...,n), \( \ifmmode\expandafter\hat\else\expandafter\^\fi{b}_{i} \) is an estimated regression coefficient with information K _bi for the i-th SNP, and b _i is its expected value under a Malecot model, \( b_{i} = {\left( {1 - L} \right)}Me^{{ - \varepsilon \Delta (S_{i} - S)}} + L, \) where Δ takes the value 1 if S _i≥S and −1 otherwise, replacing n by m when m<n htSNPs are used. S _i is the location of the i-th SNP, S is the location of a causal SNP, and L is the bias (residual association) at large distance (Collins and Morton 1998). L can be predicted (L _p) as the mean K _bi-weighted absolute value of a normal deviate under H₀. The other two parameters, M and ε, have expectations that are functions of the number of generations t since the last population bottleneck and either recombination (ε only) or mutation and drift (M only) as derived by Morton et al. (2001) using a recurrence method introduced by Malecot (1969, 1973), who obtained a similar equation for isolation by distance. The Malecot model has been applied to physical, linkage, and LD maps (Collins and Morton 1998; Maniatis et al. 2002; Zhang et al. 2002c). Recently four subhypotheses have been used for association mapping of oligogenes, only two of which are used here (Maniatis et al. 2004). Model A specifies M=0 and a uniform distribution at ordinate L _p. Model B estimates L with M=0. Comparison gives \( \chi ^{2}_{1} = {\left[ {{\sum {K_{{bi}} {\left( {\ifmmode\expandafter\hat\else\expandafter\^\fi{b}_{{\text{i}}} - b_{i} } \right)}^{2} \left| {_{{\text{A}}} } \right. - {\sum {K_{{bi}} {\left( {\ifmmode\expandafter\hat\else\expandafter\^\fi{b}_{{\text{i}}} - b_{i} } \right)}^{2} } }\left| {_{{\text{B}}} } \right.} }} \right]}/V, \) where V is the empirical variance that ideally is estimated for a nonsyntenic causal SNP. In the absence of such information V has been estimated with less power from model B as \( {\sum {K_{{bi}} {\left( {\ifmmode\expandafter\hat\else\expandafter\^\fi{b}_{i} - b_{i} } \right)}^{2} \left| {_{{\text{B}}} /{\left( {n - 1} \right)}.} \right.} } \) Comparison of models A and B tests for significance of a candidate region and is invariant for physical, linkage, and LD maps. Neither of these models depends on the M or ε parameters of the Malecot model, nor on the locations of causal SNPs within the region. They have been shown to give a reliable test of H₀ in these samples (Maniatis et al. 2004). However, many investigators prefer to select the largest maximal \( \chi ^{2}_{1} \) out of n SNPs or m htSNPs, accepting the heavy Bonferroni correction that this choice entails (Risch and Merikangas 1996). Here we use both methods, expressing results with m<n htSNPs as the relative efficiency (RE), defined as the ratio of \( \chi ^{2}_{1} \) when m htSNPs are used over the \( \chi ^{2}_{1} \) when the maximum of n SNPs are used. Although not equal to power, RE is monotonic on power and therefore a good surrogate. If desired, RE can be replaced by the ratio of the estimated noncentrality parameters, but ignorance of their exact values makes this more pretentious than useful.

The distribution of minor allele frequencies when SNPs are selected randomly is approximately uniform from 0.05 to 0.5, with an excess for smaller frequencies because of population expansion in recent millennia (Botstein and Risch 2003). SNP discovery often excludes frequencies under 0.05, but makes minor departure from a uniform distribution for greater frequencies. Neutral theory predicts that minor allele frequencies follow the same distribution for causal as for marker SNPs. We therefore took each of n+1 SNPs in turn as causal to determine how well the m htSNPs detect it (m<n). In previous trials, we found no significant difference in detection of causal SNPs in blocks and steps (Maniatis et al. 2004). As in quantitative genetics, we assumed additivity of allelic effects, assigning allelic count 0, 1, 2 to the three genotypes as a pseudo-phenotype and regressing this variable on the allele count of the i-th marker. Sample size precluded a case-control design, which would add some complexity but no additional information since the enrichment factor ω is incorporated in the analysis to recover population frequencies (Collins and Morton 1998).

Our final choice was among measures of diversity. Shannon entropy is applicable to haplotypes and diplotypes (Shannon 1948). Other programs for SNP selection are specific for diplotypes (Stram et al. 2003) or haplotypes (Zhang et al. 2002a, 2002b). These three approaches do not exhaust the possibilities for htSNP selection, since a variant is reported every month and most weeks. The results are remarkably consistent, and therefore inclusiveness was not attempted. The two samples were analysed separately, and combined results were reported by taking the means of RE, since the differences in RE were not large and general principles hold for both samples.

Results

SNP selection by Shannon entropy in diplotypes and inferred haplotypes

Shannon entropy is evaluated as −∑ _i f _ilnf _i, where f _i is an estimate of the frequency of the i-th haplotype or diplotype and ∑f _i=1 (Shannon 1948). For haplotypes a trio of diallelic markers a, b, c gives up to eight values of f _i, which the pair of flanking markers a, c combines into four values. Therefore all haplotype diversity is included. Moving one SNP along the region at a time, we estimated haplotype frequencies for three adjacent SNPs in random diplotypes using the SNPHAP program (http://www-gene.cimr.cam.ac.uk/clayton/software/). To be confident about the estimated frequencies, we used only the observations with no missing genotypes among the SNP trios. Shannon entropy may be designated T for trios and P for the flanking pair. In successive rounds the intermediate SNPs with the smallest values of S=T−P are eliminated, reducing the number of retained markers (m) by 10% of the initial number n+1. However, adjacent SNPs are not eliminated in the same round.

Diplotypes may be scored from phenotypes without inferring haplotypes, with some loss of information but avoiding arbitrary definition of haplosets and omission or misassignment of phase-unknown sites (Ackerman et al. 2003). Then no assumption is made about Hardy-Weinberg frequencies and so samples from different populations may be pooled if a cosmopolitan set of SNPs is desired. Diplotypes permit recognition of four genotypes whose heterozygosity is contributed uniquely by the intermediate b marker, in contrast to the same a, c genotypes where b is homozygous (Table 1). For both haplotypes and diplotypes, the sums of T, P, and S provide measures of entropy in each round.

Table 1 Shannon entropy for diplosets of SNP trios and pairs with genotypes 11, 12, 22. For the a, b, c trio: T=−[q lnq+ Q lnQ + (1−q−Q) ln(1−q−Q)]; Q=n ₂ /N; q=(n ₃+n ₄+n ₅+n ₆+n ₇+n ₈)/N; 1−q−Q=n ₁ /N; for the a, c pair: P=−[q lnq + (1−q) ln(1−q)]

Full size table

Suppose the initial SNP set containing n+1 SNPs (maximum of n predictive SNPs plus one causal SNP) is divided into m htSNPs and n−m+1 rejected SNPs. Each of the n+1 SNPs is taken in turn as causal to produce a total of n+1 unique files. With respect to the htSNPs there are n−m+1 external choices with n−m−1 df under model B. There are m internal choices with m−2 df, since an additional df is lost because the causal SNP is excluded from the htSNPs. The total number of df in the n+1 files under model B is (m−1) (n−m+1) + m (m−2) = (n+1) (m−1)−m. Let SS _A, SS _B denote the sums of the quadratic forms over all files under models A and B. Then V _B=SS _B /df is the mean error variance per file and \( \chi ^{2}_{1} \) = (SS _A−SS _B)/(n+1)V _B is the mean test of B against A. The relative efficiency is RE=(\( \chi ^{2}_{1} \)|m)/(\( \chi ^{2}_{1} \)|n). With slight modification this approach allows replacement of model B by a more complicated model withr df with the same significance level if \( \chi ^{2}_{r} \) is converted to \( \chi ^{2}_{1} \) (Collins and Morton 1998). The results are substantially the same (data not shown).

Figure 1 shows the successful preservation of Shannon entropy, as the curves for htSNPs are consistently above the ones for random selection. Here RE of Shannon entropy was calculated as the ratio of the sum of T in the htSNP set with m SNPs over the sum of T in the initial SNP set with n SNPs. RE for P is similar. The selection based on haplotypes preserved slightly more Shannon entropy than selection on diplotypes. Association mapping gave disappointing results: RE declined dramatically as the number of htSNPs was reduced (Fig. 2). Although haplotype selection retained slightly more Shannon entropy, RE decreased more rapidly than by diplotype selection. By both significance tests the htSNPs lost relative efficiency more rapidly than random selection (Fig. 2).

Fig. 1

Fig. 2

SNP selection in diplotypes

The tagSNPs program was used to select SNPs from diplotype data (Stram et al. 2003), estimating haplotype frequencies by an expectation-maximization (EM) algorithm. Following convention, haplotypes with frequencies greater than 0.05 were considered “common”. The program then chose the “best” set of htSNPs which maximize the minimum squared correlation (R ²) between true and predicted haplotype dosages (counted 0, 1 or 2) for each individual with a particular common haplotype. The EM calculation of predicted dosage is conditional on the genotype data and the haplotype frequencies as if they were known assuming Hardy-Weinberg equilibrium. R ² is expressed as the ratio of the variance of the expectation on the predicted dosage to the total variance of the true haplotype dosage. The best set of htSNPs minimizes the uncertainty in prediction of common haplotypes. Because this program is computationally intensive, we used sliding windows with eleven adjacent SNPs, designating the medial one as causal, and moving one SNP along the chromosome at a time. Since df=0 for composite likelihood when m equals 1, we report the results for m equals 2–10 (percents of 20–100). Mean \( \chi ^{2}_{1} \) for m htSNPs was calculated across the region. We also tried splitting the data into numbers of 20 or 35 SNPs as units for selecting SNPs, and the same strategy for association mapping was applied. The results were similar (data not shown). Figure 3 shows that diversity measured by minimum R ² decreased slowly, but RE for association mapping declined rapidly, and faster than for random selection.

Fig. 3

SNP selection in haplotypes

Haplotypes containing all n+1 SNPs were inferred by the SNPHAP program as described previously, and the HapBlock program (Zhang et al. 2002a, 2002b) was then used to select SNPs. This program implements a dynamic programming algorithm for haplotype block partitioning and htSNP selection on haplotype data. The program finds the optimal partition of haplotype blocks with the minimum total number of SNPs. A segment of consecutive SNPs forms a block if at least α% of haplotypes are represented more than once (Patil et al. 2001). SNPs are selected to minimize the number that can distinguish at least α% of the haplotypes in a block. Another criterion is β, the percentage of the haplotype diversity accounted for by the minimum number of SNPs (Johnson et al. 2001). As Zhang et al. (2002b) indicated, these two options are related in terms of conserving haplotype diversity. To obtain different proportions of selected SNPs, we varied α or β parameters from 0.50 to 0.99. The higher the value of α or β, the more SNPs will be selected. The same tests for association mapping were carried out as described for Shannon entropy. Here we present only results with the α parameter, since choosing β gave very similar results.

The maximum number of htSNPs that the program chose for the 6p21 data is about half the original number, reflecting the strong LD block structure, and so percents between 50 and 100 were not estimated. Similar to the other two methods, RE for association mapping declined rapidly, faster than random selection as the proportion of htSNPs is diminished (Fig. 4).

Fig. 4

Effects of allele frequency

The observation that htSNP selection generally performed worse for detecting causal loci than random selection is not puzzling if we consider the allele frequencies of the htSNPs. Figure 5 shows that all the methods tended to select common SNPs. This becomes pronounced when smaller proportions of SNPs are selected. The curves for random selection are flat and therefore omitted.

Fig. 5

Next we classified SNPs either by their minor allele frequency q or the absolute value of the difference from the minor allele frequency of the causal SNP (q _c), Δ=|q−q _c|. SNPs were sorted on q or Δ and the most common or rare SNPs (q large or small), or SNPs with large or small Δ were selected until the required number of SNPs was obtained. Selection on q has the expected effect, but selection for large or small Δ, like random selection, does not alter the mean allele frequencies of the selected SNPs (Fig. 6). Figure 7 shows that RE of the maximum chi-square for selection on q declined faster than for random selection. It is striking that selection on small Δ provided the maximum retention of RE. On the contrary, selection on large Δ produced the least retention of RE, which declines to nearly nil.

Fig. 6

Fig. 7

One reason for the poor performance of the htSNPs in this study is apparent: SNPs selected by maximizing haplotype diversity tend to be common, making the average minor allele frequency of htSNPs greater than the average of SNPs in the initial set, which was 0.27 in both samples. The increased discrepancy in minor allele frequencies loses relative efficiency. Since q _c is generally unknown, selecting common SNPs is inefficient to detect much rarer causal SNPs.

These results explain earlier results of the data that were puzzling. Selection based on Shannon entropy in haplotypes preserves greater entropy than in diplotypes, which have lower entropy (Fig. 1). As predicted, haplotypes tend to retain more common SNPs than diplotypes (Fig. 5), and therefore RE decreases more rapidly for haplotype than diplotype selection (Fig. 2).

Discussion

Estimates of power in association mapping by LD have serious problems. Bayesian methods attempt full likelihood by simultaneous estimates of population parameters, constrained to the likely location of a candidate marker and then treated as prior probabilities (Morris et al. 2002). Most geneticists dismiss this as not applicable to an uncertain location or useful for a certain one. As Devlin et al. (1996) delicately put it “we believe that the full likelihood model will be very difficult to specify without unrealistically stringent assumptions about population history”. The simplest alternative is to choose the marker with the largest χ ², but this requires a drastic Bonferroni correction (Risch and Merikangas 1996) and has not been shown to give a good estimate of location. Therefore most association mapping by LD uses some form of composite likelihood for single markers or haplotypes (Terwilliger 2000; Collins and Morton 1998; Zaykin et al. 2002). This works rather well for point estimates, but estimates of standard errors and therefore power are less reliable than for true likelihood (Devlin et al. 1996). We try to allow for this by using the residual variance as error, which increases as the proportion of htSNPs declines. Perhaps in consequence, RE for composite likelihood declines faster than for maximum χ ², although \( \chi ^{2}_{1} \) for composite likelihood exceeds maximum χ ² for a single marker until the proportion of htSNPs is very small. Obviously information for association mapping is not restricted to the predictive SNP with the maximum χ ². A better solution for composite likelihood that allows explicitly for autocorrelation would be welcome, but present evidence does not justify the hope that haplotype tagging can greatly reduce the number of markers without substantial loss of power for association mapping.

The optimal distribution of minor allele frequencies for predictive SNPs is unsettled. A high frequency is inefficient for detecting less common causal SNPs, is associated with multiple haplotypes and its linkage disequilibrium with other SNPs declines rapidly. A low frequency is inefficient for detecting common causal SNPs. SNPs with a large effect on a serious disease tend to have small frequency because of negative selection, but there is no basis to infer whether oligogenes of smaller effect are more likely common or rare (Reich and Lander 2001; Pritchard 2001; Weiss and Clark 2002; Pritchard and Cox 2002; Clark 2003). We took each SNP in turn as causal, with minor allele frequencies ranging from 0.05 to 0.50 in agreement with Botstein and Risch (2003).

There is no validated method for association mapping by haplotypes. When it is developed, selecting haplotype-tagging SNPs will require three preconditions, an operation, and a hope. The three preconditions are: selection of a genomic region; typing a marker set within the region; and estimation of haplotype frequencies in that marker set. Only after these preconditions are met can haplotype tagging take place, with the goal of reducing the number of markers without substantial loss of haplotype diversity and the hope that this will assure no substantial loss of power for association mapping. The profound impact of these constraints is beginning to be appreciated (Meng et al. 2003; Sebastiani et al. 2003). The genomic region is often poorly defined, ranging from an oligonucleotide to a whole chromosome, and it may contain no causal SNP or more than one, distributed among different exons, introns, promoters, and loci. The initial marker set is a small fraction of the number in the region, chosen by an arbitrary mixture of convenience, prior claims of disease association, and assumptions about desirable density, allele frequencies, and locations in blocks or steps (however defined). Haplotype frequencies are determined by assumptions about trial values, frequency of typing error, and the biological significance of common haplotypes, which in turn has depended on controversial application of evolutionary theory for nonrecurrent mutations in the absence of recombination (Crow and Kimura 1970; Reich and Lander 2001). On these shifting sands haplotype tagging takes many forms, accepting or dismissing arbitrary block definition and accepting arbitrary percentages of differently defined haplotype diversity, without reference to the much larger number of markers excluded from the initial set. Finally, we are asked to accept without proof that power for association mapping has a monotonic relation to haplotype diversity however the initial set was chosen, whether or not a causal SNP was included, however small the representation of markers in the region may be, and whatever the method of association analysis. When we consider that the costs and benefits of testing more markers or sampling more individuals are uncertain and in rapid flux, it is impossible to avoid the suspicion that application of haplotype tagging will be at the expense of successful association mapping.

The title of this paper is a query that cannot be conclusively answered. On the one hand there is no evidence that haplotype diversity is monotonic on power for association mapping, and our negative evidence has been supported (Zhao et al. 2003). On the other hand, a universal negative is unprovable for the many combinations of haplotype tagging applied to an uncountable variety of regions, initial marker sets and algorithms for haplotype tagging. The current generation of tagging programs is computationally prohibitive for a large number of markers, and the haplotypes so defined are too numerous for powerful analysis. There is no justification for taking our results as the best or worst that haplotype tagging can do for association mapping. However, the burden of proof is not on us to bury haplotype tagging, but on its advocates to show cause why it should not be buried.

Even if the best of many contenders is taken to tag haplotypes and a method of analysis is developed to minimise loss of power, there is a strong argument that miracles with haplotype-tagging should be regarded cautiously. The HapMap project that is intended as a tool for association mapping is far from a consensus about its product and farther still from convincing its detractors (Couzin 2002). Suppose that HapMap reaches its goal of mapping 600,000 SNPs and characterising in some way the overlapping haplotypes they form. The total number of SNPs in the genome is in excess of three-million, and may be as many as ten-million (Kruglyak and Nickerson 2001), or at least 15-million if SNPs with frequencies as small as 0.01 are included (Botstein and Risch 2003). The probability that a particular causal SNP is represented in 600,000 marker SNPs is less than 0.20 and may be as little as 0.06, without considering rare SNPs that are usually neglected in LD mapping. If the causal SNP were included, haplotypes would provide no additional information for its recognition. If the causal SNP is not included, haplotype-specific frequencies cannot be estimated accurately, since each defines an unidentified subset of recognised haplotypes, the number of which increases with the number of recognised SNPs. Common haplotypes have especially low power when the causal marker is uncommon. The most efficient strategy for association mapping is to type as many markers as can be afforded in a large region, and a greater density in a candidate region. Uniform spacing on the LD map, with additional SNPs in “holes” where the distance between successive markers is greater than 1 LDU, is the appropriate goal to implement this strategy. The cost of uniform spacing in the physical map has been investigated (Wang and Todd 2003). Additional constraint, such as rejecting all but the most common SNPs, assumes a consensus that does not exist.

These considerations favour a multi-stage design. Stage 1 is a genome scan by linkage or association at low resolution. Stage 2 tries to refine the candidate region by association at moderate resolution or (with less power) by linkage. Stage 3 attempts to identify a candidate locus within a supported region, using functional tests on SNPs at high resolution. These preliminary stages have as their goal to determine whether a particular region suggested by linkage, LD, or function has one or more causal SNPs for a particular disease. Haplotype analysis has a possible but still undefined role to assure equal spacing of markers on the LD map, with several SNPs at different frequencies in each LDU. Pooling of samples may be considered, sacrificing haplotypes. Concentration on common SNPs should be avoided. In studies that cover more than a small region it is currently impractical to aim for all SNPs, and therefore identification of causal SNPs is unreliable. The analysis leans heavily on models A and B, although location models are also useful at this stage, especially if the region is large. The false discovery rate (FDR) introduced for linkage (Morton 1955) is useful to evade a costly Bonferroni correction.

Stage 4 is the end-game that begins when a candidate locus or a small candidate region has been confirmed. All SNPs should be sought, supported by sequencing of cases and controls and with functional tests if feasible. The FDR is uniquely adapted to identification of causal SNPs (Storey and Tibshirani 2003). Haplotype analysis has nothing to offer in stage 4 because the goal is to identify causal SNPs. Stepwise regression is appropriate if all SNPs have been tested, but in that case their genotypes are known and the ρ model testing each SNP for causality is more powerful. All recent research indicates the need in stage 4 to type virtually all markers on the critical LD map (Wang and Todd 2003). For example, 11 SNPs in a 2.5-kb region were required to provide modest evidence of association for the NOS2A promoter with cerebral malaria, with possible identification of a causal SNP (Burgner et al. 2003). A similar density has been proposed for genome scans of nonsynonymous substitutions, splice junctions, and promoter regions where the higher frequency of rare alleles in random SNPs is enhanced by negative selection (Botstein and Risch 2003). None of the current methods to select htSNPs is appropriate after stage 1. Recently tagSNPs selected without regard to haplotypes have been advocated, replacing blocks of contiguous SNPs by bins of highly associated SNPs that need not be contiguous (Carlson et al. 2004). SNPs with low minor allele frequency are associated with extensive LD and therefore tend to be selected as the number of tagSNPs diminishes. Compared with htSNPs, this enhances power to detect causal alleles that are at low frequency by chance or negative selection, but in our trials the power is less than for random SNPs. It is likely that a method based on the LD map and more efficient than composite likelihood for single SNPs will be developed, but one limitation cannot be overcome: haplotypes or bins of dense markers are a poor surrogate for a causal SNP, and haplotypes or bins of a smaller number of htSNPs or even tagSNPs are worse.

References

Ackerman H, Usen S, Mott R, Richardson A, Sisay-Joof F, Katundu P, Taylor T, Ward R, Molyneux M, Pinder M, Kwiatkowski DP (2003) Haplotype analysis of the TNF locus by association efficiency and entropy. Genome Biol 4:R24
Google Scholar
Botstein D, Risch N (2003) Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease. Nat Genet 33 (Suppl): 228–237
Article CAS PubMed Google Scholar
Burgner D, Usen S, Rockett K, Jallow M, Ackerman H, Cervino A, Pinder M, Kwiatkowski DP (2003) Nucleotide and haplotypic diversity of the NOS2A promoter region and its relationship to cerebral malaria. Hum Genet 112:379–386
CAS PubMed Google Scholar
Cardon LR, Abecasis GR (2003) Using haplotype blocks to map human complex trait loci. Trends Genet 19:135–140
Article CAS PubMed Google Scholar
Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analysis using linkage disequilibrium. Am J Hum Genet 74:106–120
PubMed Google Scholar
Clark AG (2003) Finding genes underlying risk of complex disease by linkage disequilibrium mapping. Curr Opin Genet Dev 13:296–302
Article CAS PubMed Google Scholar
Collins A, Morton NE (1998) Mapping a disease locus by allelic association. Proc Natl Acad Sci USA 95:1741–1745
Article CAS PubMed Google Scholar
Couzin J (2002) Genomics. New mapping project splits the community. Science 296:1391–1393
Article CAS PubMed Google Scholar
Crow JF, Kimura M (1970) An introduction to population genetics theory. Harper and Row, New York
Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES (2001) High-resolution haplotype structure in the human genome. Nat Genet 29:229–232
CAS PubMed Google Scholar
Devlin B, Risch N, Roeder K (1996) Disequilibrium mapping: composite likelihood for pairwise disequilibrium. Genomics 36:1–16
Article CAS PubMed Google Scholar
Jeffreys AJ, Kauppi L, Neumann R (2001) Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet 29:217–222
CAS PubMed Google Scholar
Johnson GC, Esposito L, Barratt BJ, Smith AN, Heward J, Di Genova G, Ueda H, Cordell HJ, Eaves IA, Dudbridge F, Twells RC, Payne F, Hughes W, Nutland S, Stevens H, Carr P, Tuomilehto-Wolf E, Tuomilehto J, Gough SC, Clayton DG, Todd JA (2001) Haplotype tagging for the identification of common disease genes. Nat Genet 29:233–237
CAS PubMed Google Scholar
Ke X, Hunt S, Tapper W, Lawrence R, Stavrides G, Ghori J, Whittaker P, Collins A, Morris AP, Bentley D, Cardon LR, Deloukas P (2004) The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum Mol Genet 13:577–588
Article CAS PubMed Google Scholar
Kruglyak L, Nickerson DA (2001) Variation is the spice of life. Nat Genet 27:234–236
Article CAS PubMed Google Scholar
Lonjou C, Zhang W, Collins A, Tapper WJ, Elahi E, Maniatis N, Morton NE (2003) Linkage disequilibrium in human populations. Proc Natl Acad Sci USA 100:6069–6074
Article CAS PubMed Google Scholar
Malecot G (1969) The Mathematics of Heredity. Freeman, San Francisco
Malecot G (1973) Isolation by distance. In: Morton NE (ed) Genetic Structure of Populations. University of Hawaii Press, Honolulu, pp 72–75
Maniatis N, Collins A, Xu CF, McCarthy LC, Hewett DR, Tapper W, Ennis S, Ke X, Morton NE (2002) The first linkage disequilibrium (LD) maps: delineation of hot and cold blocks by diplotype analysis. Proc Natl Acad Sci USA 99:2228–2233
CAS PubMed Google Scholar
Maniatis N, Collins A, Gibson J, Zhang W, Tapper W, Morton NE (2004) Positional cloning by linkage disequilibrium. Am J Hum Genet 74:846–855
Article CAS PubMed Google Scholar
Meng Z, Zaykin DV, Xu C-F, Wagner M, Ehm MG (2003) Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am J Hum Genet 73:115–130
CAS PubMed Google Scholar
Morris AP, Whittaker JC, Balding DJ (2002) Fine-scale mapping of disease loci via shattered coalescent modelling of genealogies. Am J Hum Genet 70:686–707
Article CAS PubMed Google Scholar
Morton NE (1955) Sequential tests for the detection of linkage. Am J Hum Genet 7:277–318
CAS PubMed Google Scholar
Morton NE, Zhang W, Taillon-Miller P, Ennis S, Kwok PY, Collins A (2001) The optimal measure of allelic association. Proc Natl Acad Sci USA 98:5217–5221
Article CAS PubMed Google Scholar
Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, Lee DH, Marjoribanks C, McDonough DP, Nguyen BT, Norris MC, Sheehan JB, Shen N, Stern D, Stokowski RP, Thomas DJ, Trulson MO, Vyas KR, Frazer KA, Fodor SP, Cox DR (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294:1719–1723
CAS PubMed Google Scholar
Pritchard JK (2001) Are rare variants responsible for susceptibility to common diseases? Am J Hum Genet 69:124–137
Article CAS PubMed Google Scholar
Pritchard JK, Cox NJ (2002) The allelic architecture of human disease genes: common disease-common variant...or not? Hum Mol Genet 11:2417–2423
Article CAS PubMed Google Scholar
Reich DE, Lander ES (2001) On the allelic spectrum of human disease. Trends Genet 17:502–510
Google Scholar
Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273:1516–1517
CAS PubMed Google Scholar
Sebastiani P, Lazarus R, Weiss ST, Kunkel LM, Kohane IS, Romani MF (2003) Minimal haplotype tagging. Proc Natl Acad Sci USA 100:9900–9905
CAS PubMed Google Scholar
Shannon CE (1948) A mathematical theory of communication. Bell System Tech J 27:379–423, 623–656
Google Scholar
Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100:9440–9445
Article CAS PubMed Google Scholar
Stram DO, Haiman CA, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Pike MC (2003) Choosing haplotype-tagging SNPs based on unphased genotype data using a preliminary sample of unrelated subjects with an example from the multiethnic cohort study. Hum Hered 55:27–36
PubMed Google Scholar
Terwilliger JD (2000) A likelihood-based extended admixture model of oligogenic inheritance in ‘model-based’ and ‘model-free’ analysis. Eur J Hum Genet 8:399–406
Google Scholar
Wang WY, Todd JA (2003) The usefulness of different density SNP maps for disease association studies of common variants. Hum Mol Genet 12:3145–3149
Article CAS PubMed Google Scholar
Weiss KM, Clark AG (2002) Linkage disequilibrium and the mapping of complex human traits. Trends Genet 18:19–24
Article CAS PubMed Google Scholar
Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG (2002) Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum Hered 53:79–91
Article PubMed Google Scholar
Zhang K, Calabrese P, Nordborg M, Sun F (2002a) Haplotype block structure and its applications to association studies: power and study designs. Am J Hum Genet 71:1386–1394
Article CAS PubMed Google Scholar
Zhang K, Deng M, Chen T, Waterman MS, Sun F (2002b) A dynamic programming algorithm for haplotype block partitioning. Proc Natl Acad Sci USA 99:7335–7339
Article CAS PubMed Google Scholar
Zhang W, Collins A, Maniatis N, Tapper W, Morton NE (2002c) Properties of linkage disequilibrium (LD) maps. Proc Natl Acad Sci USA 99:17004–17007
Article CAS PubMed Google Scholar
Zhao H, Pfeiffer R, Gail M (2003) How useful are the tagging SNPs for identifying complex disease genes? Am J Hum Genet 73 (Suppl): 216
Google Scholar

Download references

Acknowledgements

We are grateful to Alec Jeffreys and Mark Daly for making their data publicly available. We thank Daniel Stram and Kui Zhang for the tagSNPs and HapBlock programs and suggestions in using them. This work was supported by a grant from the Medical Research Council.

Author information

Authors and Affiliations

Human Genetics Division, University of Southampton, Southampton General Hospital, Southampton, SO16 6YD, UK
Weihua Zhang, Andrew Collins & Newton E. Morton

Authors

Weihua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Collins
View author publications
You can also search for this author in PubMed Google Scholar
Newton E. Morton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Newton E. Morton.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, W., Collins, A. & Morton, N.E. Does haplotype diversity predict power for association mapping of disease susceptibility?. Hum Genet 115, 157–164 (2004). https://doi.org/10.1007/s00439-004-1122-x

Download citation

Received: 10 December 2003
Accepted: 17 March 2004
Published: 04 June 2004
Issue Date: July 2004
DOI: https://doi.org/10.1007/s00439-004-1122-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.