Introduction

After the successes of the HapMap project (Frazer et al. 2007; Altshuler et al. 2005), genome-wide association studies (GWAS) have enabled the identification of more candidate genes related to diseases (Wellcome Trust Case Control Consortium 2007; Easton et al. 2007; Maris et al. 2008; Scott et al. 2009; Erdmann et al. 2009; Barrett et al. 2009; Cho et al. 2009). Reviews regarding GWAS presented current concerns and future research strategies, which emphasized the importance of obtaining common, truly functional polymorphisms influencing expression in a locus or protein structure (Altshuler et al. 2008; Hardy and Singleton 2009; Manolio et al. 2009). It is now of primary and imperative interest to find the functional polymorphisms associated with diseases through dense genotyping or re-sequencing of target regions. Typically, several polymorphisms simultaneously show associations in the target locus (Yasuda et al. 2008; Unoki et al. 2008; Wrensch et al. 2009; Shete et al. 2009; Song et al. 2009). The significant associations of many markers in target gene loci are primarily due to linkage disequilibrium (LD) between the markers and the real disease polymorphisms. It is important to use the LD information to separate true associations, but there is no actual method that primarily makes use of LD information to discriminate the associations of real disease polymorphisms from the associations of marker polymorphisms.

Early efforts to discriminate real disease polymorphisms from several associated marker polymorphisms in case–control association studies have made use of conventional statistical solutions to deal with confounders (Wrensch et al. 2009; Nicodemus et al. 2004), and the usual conclusion from current statistical methods is that the most significantly associated polymorphism is the disease polymorphism. Recent efforts include more advanced statistical approaches, such as various regression, ensemble, and network methods (Szymczak et al. 2009; Charoen et al. 2007). However, these recent studies in case–control associations are more focused on detecting main effects or gene–gene interaction in genome-wide association data than distinguishing real disease polymorphisms in a genetic locus, and they have demonstrated that there are problems identifying disease polymorphisms using current methods. In family based association studies, there have been efforts for identifying polymorphisms that explain a linkage signal (Biernacka and Cordell 2009). Similar efforts were partially demonstrated in case–control association studies based on a step-wise regression, but there were difficulties in differentiating between potentially causal polymorphisms and other polymorphisms (Biernacka et al. 2007; Cordell and Clayton 2002).

The classical methods in statistics for dealing with confounding factors may work less efficiently in the discrimination of real disease polymorphisms from associated markers in LD compared to methods primarily based on the actual LD relationship. Based on the LD relationship, a study was previously tested whether the positive association of most polymorphisms in the APOE gene region with Alzheimer’s disease comes from the single disease polymorphism, encoding ApoE ε4 (Park 2007). This study found that these associations are difficult to explain with the single disease polymorphism. Expanding this previous effort, a method for identifying disease polymorphisms from the genotypes of cases and controls was developed in the current study. Since there is no information regarding which polymorphisms are the disease polymorphisms, the likelihood ratio tests for various models should be considered to find the most likely set of disease polymorphisms from the data. If a model is accepted, then it probably indicates the set of true disease polymorphisms and can tell us how the gene influences disease presentation.

Methods

Likelihood ratio tests based on models

In this study, all of the genotyped polymorphisms were considered potential disease polymorphisms. To find the most likely set of disease polymorphisms, all possible tests using the genotype data were conducted, ranging from the model with one disease polymorphism to the model with the true number of disease polymorphisms, one of which is usually accepted because it is the correct model. In actual case–control association studies, the tests starts from the model with one disease polymorphism and continued until accepting a model at a certain number of disease polymorphisms, which is considered as the true number of disease polymorphisms. For each model, the expected allele frequencies in cases were calculated based on the control allele frequency data, the case allele frequency data of the targeted disease polymorphism(s), and the LD between the markers and the disease polymorphism in controls. Likelihood ratio tests were then conducted for the expected case allele frequencies using the case genotype data.

When testing the model with only one disease polymorphism, the calculation is simple. It involves adding the portion changed due to the disease polymorphism based on the LD relationship to the control marker allele frequency. When there is one real disease polymorphism, the odds ratio of the disease allele is directly estimated; the actual odds ratio of the disease allele is the same as the observed odds ratio. When the model contains two or more disease polymorphisms, the calculation of the expected allele frequencies of polymorphisms in cases is more complicated. In the cases, the observed allelic odds ratios of disease polymorphisms calculated from the genotype data are not the same as the real odds ratio, given that the frequency of a disease polymorphism is influenced by the frequencies of other disease polymorphism(s) due to LD between disease polymorphisms. Therefore, the independent odds ratios of disease polymorphisms should be estimated first. Based on the independent odds ratios, the expected frequencies of markers were estimated from the LD relationship with the disease polymorphisms. The general expression of marker allele frequencies in cases is expressed as shown in Eq. 1.

$$ p_{{M_{i}^{'} }} = p_{{M_{i} }} + \sum\limits_{j} {\Updelta_{j} {\frac{{p_{{M_{i} d_{j} }} - p_{{M_{i} }} p_{{d_{j} }} }}{{p_{{d_{j} }} (1 - p_{{d_{j} }} )}}}} $$
(1)

In Eq. 1, p(Mi′) indicates the marker allele frequency in cases; p(Mi) indicates the marker allele frequency in controls; p(Midj) is the frequency of haplotype Midj; p(dj) indicates the disease allele frequency in controls; delta indicates the genuine differences in disease allele frequency between controls and cases, in which the effects of all other disease polymorphisms in LD are excluded. Therefore, the disease allele frequency in cases can be used to derive the real odds ratio of the disease polymorphism from delta. By solving the following matrix as shown in Eq. 2, the delta values can be obtained.

$$ \left( \begin{array}{*{20}c} p_{{d_{1}^{'} }} - p_{{d_{1} }} \hfill \\ \, \hfill p_{{d_{2}^{'} }} - p_{{d_{2} }} \hfill \\ \, \hfill \vdots \hfill \\ \end{array} \right) = \left( {\begin{array}{*{20}c} 1 & {{\frac{{p_{{d_{1} d_{2} }} - p_{{d_{1} }} p_{{d_{2} }} }}{{p_{{d_{2} }} (1 - p_{{d_{2} }} )}}}} & \cdots \\ {{\frac{{p_{{d_{1} d_{2} }} - p_{{d_{1} }} p_{{d_{2} }} }}{{p_{{d_{1} }} (1 - p_{{d_{1} }} )}}}} & 1 & \cdots \\ \vdots & \vdots & \ddots \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {\Updelta_{1} } \\ {\Updelta_{2} } \\ \vdots \\ \end{array} } \right) $$
(2)

In some cases, there may be no solution for delta or the estimated allele frequencies and/or delta may be out of the appropriate range. These cases indicate that the underlying model with the targeted disease polymorphisms is not appropriate. It should therefore be excluded in the search for the correct model with the real disease polymorphisms.

The hypotheses for a set of selected potential disease polymorphisms can be tested using likelihood ratio tests over the case genetic data. For a given hypothesis, the frequency changes in other polymorphisms are due to frequency changes in the selected polymorphism(s) and the LD relationship with the selected polymorphism(s). Based on this hypothesis, the case allele frequencies are calculated as previously indicated in Eq. 1. It should be noted that the allele frequencies in controls are not real control population frequencies because the control allele frequencies come from a sampling of the target population. Therefore, there should be a correction term for the likelihood ratio tests.

As indicated, because parameters of the underlying hypothesis in likelihood ratio tests are variable due to sampling, correction for variance is necessary in the likelihood ratio tests. When the binomial distribution is approximated as a normal distribution, the following Eq. 3 is derived. In this equation, θ0 is the parameter derived from the selected model indicated previously; np(1−p) is the variance of the maximum likelihood estimate; σ2 is the variance of θ0. The sum of two normal random variables, θ0 and \( \hat{\theta } \), is normally distributed, and the last term of the multiplication in the top equation in Eq. 3 converges to a chi-squared distribution with one degree of freedom. Therefore, the bottom equation has an approximately chi-squared distribution with one degree of freedom.

$$ \begin{aligned} - 2\log \left( {{\frac{{L(\theta_{0} )}}{{L(\hat{\theta })}}}} \right) = {\frac{{(\hat{\theta } -\theta_{0} )^{2} }}{np(1 - p)}} = & {\frac{{np(1 - p) +\sigma^{2} }}{np(1 - p)}} \times {\frac{{(\hat{\theta } -\theta_{0} )^{2} }}{{np(1 - p) + \sigma^{2} }}} \\\; - 2\log\left( {{\frac{{L(\theta_{0} )}}{{L(\hat{\theta })}}}} \right)\times {\frac{np(1 - p)}{{np(1 - p) + \sigma^{2} }}} \, \sim &\chi_{1}^{2} \\ \end{aligned} $$
(3)

The variance of θ0, σ2, can be estimated approximately through simulations. For simulations, the observed parameters from the sample were used instead of the real parameters from the population. Simulations were repeated 1,000 times to estimate the approximate variances. This simulation was performed for each polymorphism in a model with selected disease polymorphisms. Therefore, there were n chi-squared distributions, with n being the number of polymorphisms excluding the disease polymorphism(s). The sum of the distributions is the chi-squared distribution with n degrees of freedom, and it was the actual test in this method. The software that implements the computation of likelihood ratio test statistics is available in the R statistical package under the package name “IFP” (Identifying Functional Polymorphisms).

Simulation studies for estimating the type I and II error rates

To generate the virtual data set, the re-sequenced public data of the APOE region were used (http://www.droog.gs.washington.edu/mdecode/data/apoe/) (Nickerson et al. 2000). The data provided sequencing results for 48 individuals, and there were 19 polymorphic sites (Fig. 1; Qin et al. 2002; Barrett et al. 2005). Three pairs of these sites were in complete LD with one another, as indicated in black in Fig. 1. Each one of the pairs in complete LD was eliminated for this study because they produced a duplicate result for another SNP in complete LD. When the control sample size is large enough, the probability of observing polymorphisms in complete LD decreases without requiring the removal of portions of the data. Among the 16 polymorphic sites remaining after the elimination, only seven single nucleotide polymorphisms (SNPs) had a minor allele frequency higher than 0.1. These polymorphisms in the gene region showed both high and low LD between polymorphisms. Therefore, the effect of both high and low LD could be examined in this study. The seven polymorphisms with frequencies higher than 0.1 were used to generate primary data, and the 16 polymorphisms were used for examining the effect of using re-sequenced data.

Fig. 1
figure 1

LD (r 2) between all polymorphic sites from re-sequenced APOE data (the seven SNPs with minor allele frequencies higher than 0.1 are indicated as squares)

For the simulation, control sample data were derived from binomial sampling using the frequencies of the APOE data, depending on the control sample sizes, to obtain the control allele frequencies and LD between the polymorphisms. The haplotype frequencies of the APOE data were estimated using haplo.stats in the R package (Yu and Schaid 2007; Schaid et al. 2002). To generate case data, various sets of disease polymorphisms were selected from the seven polymorphisms. Many recent results of GWAS discovered associated polymorphisms with odds ratios of ~1.5 (Manolio et al. 2009). Since it is expected that further fine mapping using dense genotyping or re-sequencing might reveal associations with odds ratio higher than 1.5, the independent odds ratios of each disease polymorphism were fixed at 2.0. Using Eq. 2, the observed differences in allele frequencies of disease polymorphisms between cases and controls were calculated based on the original differences in each polymorphism (independently of other disease polymorphisms). Additionally, using Eq. 1, the expected allele frequencies of other polymorphisms in cases were calculated. These frequencies and the case sample size were used to generate case data from binomial sampling.

Determination of the correct model is based on the results of tests with various models; therefore, both Type I and II error rates are important in this approach, concerning the intrinsic multiple testing in this method. To estimate the Type I and Type II error rates, the generation of case and control data was repeated 1,000 times for each set of disease polymorphism(s) with a fixed odds ratio(s). The likelihood ratio tests were applied to each set of data for various models, assuming sets of the true disease polymorphism(s). For instance, assume that the true model is that SNP2 is the only disease polymorphism among seven SNPs with an allelic odds ratio of 2.0. Seven possible models can be tested for a single disease polymorphism. Among these seven models, the true model is the one in which the disease polymorphism is SNP2. The −2 log (likelihood ratio) with variance correction is distributed approximately as a chi-squared distribution (Eq. 3), from which the Type I error rate can be estimated. The other six models are incorrect models for the data; their likelihood ratio tests are rejected in most cases, providing reasonably low Type II error rates.

The model with real disease polymorphism(s) was tested to obtain the Type I error rate, and all other possible models were tested to obtain the Type II error rate. In these tests it is inevitable that the value of the −2 log (likelihood ratio) with variance correction becomes small enough to accept the model whenever the model involves all of the real disease polymorphisms. Some models with a greater number of disease polymorphisms than the real number include all of the real disease polymorphisms; the acceptance of these models results in Type II errors. Therefore, the tests for identifying the appropriate model should stop at the actual number of disease polymorphism(s), at which point the right model is usually accepted. In inferring the appropriateness of the accepted model, examination of all of the likelihood tests could be helpful from one disease polymorphism to N disease polymorphisms, where N is the total number of polymorphic sites in the region.

The closest previous method for identifying real disease polymorphisms, the stepwise regression method (Biernacka et al. 2007; Cordell and Clayton 2002), was performed using the simulated data to compare with the current method. Simulated data were generated based on the same APOE data assuming causal polymorphisms with fixed odds ratios, similar to the generation of the virtual data for the current method. The original application of stepwise regression to case–control association study was based on the genotype effect rather than allelic effect. Since the alleles should be the variables in the regression instead of genotypes for the comparison with the current method, the relationship between alleles (haplotypes) should be known. Haplotypes based on the APOE data were obtained using the R statistical package, haplo.stats (Yu and Schaid 2007; Schaid et al. 2002), and used for generating the control and case data, in which the alleles were coded as −1 and 1 for the regression. Sample sizes were 500 cases and 500 controls (1,000 case haplotypes and 1,000 control haplotypes), and simulations were repeated 1,000 times.

The effects of the sample size and odds ratio were examined with regard to error rates. Currently, a large control sample size is common due to huge population genetic studies and large-scale case–control studies. A variable number of controls (from 500 to 10,000) and cases (from 500 to 5,000) were examined to determine the appropriate sample sizes for circumstances with high error rates. Considering recent GWAS results, variable odds ratios from 1.2 to 3.0 were applied to examine the effect of odds ratio on error rates. Although computation time is longer, using fully re-sequenced data might reduce Type II error by providing more data to test the appropriateness of the model. Therefore, these types of data were also examined for Type I and Type II error rates. Finally, a study of the APOE association with Alzheimer’s disease was examined using the current approach.

Results

Number of disease polymorphisms in a gene region

Various situations with a varying number of disease polymorphisms were examined using a sample size of 500 for both cases and controls. Each situation of all possible sets of disease polymorphism(s) was simulated in this study for one, two, and three disease polymorphism(s). The actual disease polymorphism(s) in each simulation is indicated in bold characters in Tables 1, 2, 3, and the independent allelic odds ratios were fixed as 2.0 for each polymorphism. When there was only one disease polymorphism, the observed odds ratio of the disease polymorphism was the same as the true odds ratio; in this case the most significantly associated polymorphism was the disease polymorphism (Table 1). When there was more than one disease polymorphism, however, the observed odds ratios differed from the actual odds ratios as calculated by Eq. 2 in the “Materials and methods” section (Tables 2 and 3). These differences result from the LD relationship between disease polymorphisms, and they are substantially reduced or increased in several cases. Moreover, polymorphisms not associated with diseases often show more significant observed associations than disease polymorphisms.

Table 1 Estimates (Based on 1,000 replicated samples) of the Type I and Type II error rates for nominal Type I Error α and Type II Error β when there is only one disease polymorphism with an odds ratio of 2.0 and sample size of 500 for both cases and controls
Table 2 Estimates (Based on 1,000 replicated samples) of the Type I and Type II error rates for nominal Type I Error α and Type II Error β when there are two disease polymorphisms, each with an odds ratio of 2.0 and sample size of 500 for both cases and controls
Table 3 Estimates (Based on 1,000 replicated samples) of the Type I and Type II error rates for nominal Type I Error α and Type II Error β when there are three disease polymorphisms, each with an odds ratio of 2.0 and sample size of 500 for both cases and controls

The estimated Type I error rates were reasonable overall, although their average was slightly higher than the nominal error rates of α = 0.05 and α = 0.01 (Tables 1, 2, 3). This might be because the variations in the model parameters were estimated by simulations based on observed data rather than true data. Therefore, the limited sample sizes likely produced less variability in simulations with extreme frequencies. The Type II error rates were low overall but varied depending on the set of true disease polymorphisms (Tables 1, 2, 3), indicating that the power to identify the set of true disease polymorphism(s) could be quite high for most genetic association data.

Table 1 shows that the Type II error rates were higher when SNP7 was the disease polymorphism than they were for the other SNPs. SNP7 is in moderate LD with SNP2 (r 2: 0.178), and the model in which SNP2 is the disease polymorphism looks slightly similar to the model in which SNP7 is the disease polymorphism. However, the value of the −2 log (likelihood ratio) with variance correction of the true model is almost always lower than that of the false model (91.5% probability from simulations). Therefore, if those values are considered for identifying disease polymorphisms, the actual Type II error rates can be greatly reduced when SNP7 is the disease polymorphism and even for other true models. The Type II error rates when SNP2 is the disease polymorphism are not as high as the rates when SNP7 is the disease polymorphism because of the higher minor allele frequency and moderately high D’ of SNP2 with other polymorphisms.

As shown for several cases in Tables 2 and 3, the Type II error rates become large when there is more than one real disease polymorphism. The representative case occurs is when SNP3 and SNP4 are the disease polymorphisms. The r 2 between these two SNPs is high (0.693); therefore, the minor alleles of the two SNPs are highly associated with each other. An odds ratio of 2.0, however, was applied to the major allele of SNP3 and the minor allele of SNP4. As a result, the observed odds ratios of SNP3 and SNP4 were diminished to 0.98 and 1.28, respectively. These values are lower than most other SNPs due to the counterbalancing effect of each of these SNPs on the other (Table 2).

With three real disease polymorphisms, a similar phenomenon can be seen when including both SNP3 and SNP4 as the disease polymorphisms (Table 3). Therefore, there are certain situations that make it difficult to detect associations and identify the disease polymorphisms. It should be noted that this situation would be the worst-case scenario, which might not be very common. In the opposite of this situation, the Type II error rates can be low. The high r 2 between SNP3 and SNP4 is also responsible for increasing the Type II error rates in several cases involving one of the SNPs as a disease polymorphism. However, this effect has a less significant impact on increasing the Type II error rates (Tables 2 and 3).

For the same situations in Table 1 and 2, the stepwise regressions were conducted for the comparison with the current method. As suggested (Cordell and Clayton 2002), the backward stepwise regression were performed, and the final model was examined compared to the true model. Frequencies of selecting the correct model as the final model were overall very low ranging from 0 to 0.243 based on 1,000 simulations. When each regression result was examined one by one, the real disease polymorphism usually showed the most significant coefficient when there was one real disease polymorphism. For two disease polymorphisms, one or both of the disease polymorphisms were often eliminated in the final model. The result confirms that the current method is more efficient for identifying the real disease polymorphisms.

Effect of sample size

Increasing the sample size is very helpful to reduce Type II errors in this test. Since the overall Type II error rates were low, the situations with high Type II errors were of primary interest. The control sample sizes were increased from 500 to 10,000, and the case sample sizes were increased from 500 to 5,000. Three different combinations of disease SNPs were examined for their Type II error rates at various sample sizes. Two of these had the worst Type II error rates for two or three true disease polymorphisms; these were caused by the counterbalancing effects of SNP3 and SNP4. The other combination of disease SNPs studied for sample size effects had moderately high Type II error rates with three disease polymorphisms. This moderately high Type II error rate came from the high LD between SNP3 and SNP4. As shown in Table 4a, the Type II error rates in this case were reduced quickly as sample size increased. Increasing the control sample size was more effective for reducing the Type II error rates than increasing the case sample size. As the sample sizes increased, the two combinations of disease SNPs with the worst initial Type II error rate for two or three true disease polymorphisms also showed fairly large improvements. In those cases, however, the Type II error rates were substantially reduced when both the case and control sample sizes were large enough. In summary, by increasing the sample sizes to a certain extent depending on situation, this method provided a reliable test identifying true disease polymorphisms, even for the worst-case situations.

Table 4 Type II error rates for various case and control sample sizes: (a) when there are three disease polymorphisms (SNP2, SNP4, and SNP7) with moderately high Type II error; (b) when there are two disease polymorphisms with the highest Type II error (SNP3 and SNP4); (c) when there are three disease polymorphisms with the highest Type II error (SNP3, SNP4, and SNP7)

Effect of the odds ratio

The test can be affected by the odds ratios of disease polymorphisms. As previous association studies resulted in associated SNPs with varying odds ratios, it is worthwhile to examine the effect of the odds ratios in the current method. As shown in Table 5, several sets of disease polymorphisms with high Type II error rates were tested with changes in odds ratios ranging from 1.2 to 3.0. A set of disease SNPs with a high Type II error rate (SNP3 and SNP4) and a set with a moderately high Type II error rate (SNP2, SNP4, and SNP7) were examined for variable odds ratios. For the set of three disease SNPs, the odds ratios of SNP2 and SNP4 varied; the odds ratio of SNP7 was fixed. As the examined models had high Type II error rates, this method is expected to reliably identify disease polymorphisms with various odds ratios higher than 1.5 (Table 5). Low odds ratios (e.g., 1.2) resulted in high Type II error rates. Increased odds ratios reduced the Type II error rates, but the decrements were not as consistent as those in response to changes in the sample size. Reductions of Type II error rates impeded at a certain point as the odds ratios were increased (data not shown). Overall, when the Type II error rate was low enough, increased odds ratios did not affect the Type II error rates very much. For sets of disease SNPs with high Type II error rates (Table 5), however, increased odds ratio decreased the Type II errors. Although the trend is not always true, it is notable that high observed odds ratios or large differences in allele frequencies between cases and controls may be more important for reducing the Type II error rates than high independent odds ratios for each polymorphism; this is consistent with the data shown in the previous section with variable sets of true disease polymorphisms (Tables 1, 2, 3). In comparison to sample size, therefore, changes in the odds ratios are not as crucially important for the accuracy of the tests.

Table 5 Type II error rates for various odds ratios when there are two disease polymorphisms (2,000 controls and 2,000 cases were used for simulation; OR: odds ratio): (a) the set of SNP3 and SNP4 as disease polymorphisms with the highest Type II error; (b) the set of SNP2, SNP4, and SNP7 as disease polymorphisms with moderately high Type II error (the odds ratio of SNP7 is fixed at 2)

Using re-sequenced data

High Type II error rates in several bad situations can be improved by obtaining more information from fully re-sequenced data. Re-sequenced data include rare polymorphic sites and common polymorphic sites. These rare polymorphisms are usually in complete D’ and low r 2 with other polymorphisms, and this LD information can be very useful for these model-based likelihood ratio tests. Using all 16 SNPs from the re-sequenced data (Table 6), similar tests were conducted for various sets of disease polymorphisms.

Table 6 Minor allele frequencies of the 16 SNPs tested for examining re-sequenced data and their frequencies in comparison to results from an APOE association study with Alzheimer’s disease

First, to examine the reduction in Type II error rates resulting from the use of re-sequenced data, simulation tests were conducted using 500 cases and 2,000 controls. Increased numbers of controls were used to ensure stable results for normal approximation. As in previous tests, each allelic odds ratio was fixed at 2.0. Table 7a demonstrates that the use of 16 instead of seven SNPs resulted in a substantial reduction in Type II error rates, even though the reductions differed depending on the set of disease SNPs. As previously indicated, examination of the lowest value of the −2 log (likelihood ratio) with variance correction can also be helpful to obtain the correct model. The reduced Type II error rates result from the inclusion of more data in the tests. Expanding the re-sequenced region would produce better outcomes by providing more data with appropriate frequencies.

Table 7 Results using re-sequenced data (SNP numbers in the SNP set indicate the same SNP number as previously presented using the seven common polymorphisms, except in (c))

Caution regarding the control sample size should be applied when using re-sequenced data. Depending on the model, including very rare polymorphisms might decrease parameter variances estimated using simulations since there are situations that do not have many possible options for the simulated sets. In addition, the successful normal approximation of binomial variables depends on both frequency and sample size. Therefore, extremely low minor allele and haplotype frequencies may result in increased Type I error rates (Table 7b) if the sample sizes are not sufficiently large. Polymorphic sites with adequate allele and haplotype frequencies that can be handled properly using the given control sample size from re-sequenced data should be included in the tests as data. The inclusion of many polymorphisms increases the number of testing models. Even though the Type II error rates were reduced in re-sequenced data, the actual number of false positive results might be increased proportional to the increased number of testing models. Therefore, depending on sample sizes, extremely rare polymorphisms may be excluded in the data unless they were suspected as disease polymorphisms.

To examine the error rates when rare polymorphisms are the real disease polymorphisms, tests for sets of rare disease polymorphisms as well as rare and common disease polymorphisms were conducted using control and case sample sizes of 2,000. As shown in Table 7c, the patterns of error rates when one or more disease polymorphisms were rare were not different from the patterns observed when only common polymorphisms were the real disease polymorphisms. Although not always the case, extremely rare polymorphisms serving as disease polymorphisms may result in high Type I and/or Type II error rates (Table 7c, the set of SNP 2 and 7). In these situations, the increment or decrement of the actual disease polymorphisms was not obvious compared to the frequency changes of other common polymorphisms in LD with the disease polymorphisms. As previously indicated, several sets of disease polymorphisms showed increased Type I error rates, which could be decreased as the control sample size was increased. These results show that this method is applicable to re-sequenced data involving rare disease polymorphisms.

Application to the APOE association with Alzheimer’s disease

The current method was applied to the APOE association with Alzheimer’s disease. The results of many previous association studies were available, and the study examining the largest number of SNPs near the APOE region was selected (Yu et al. 2007). They typed 50 SNPs using 232 controls and 193 cases. As shown in Table 6, the frequencies of rare SNPs in the association study do not match well with the re-sequenced data used in this study. Therefore, the genotype data for the seven common SNPs primarily used in the current study were selected for the tests. Since their study did not provide specified linkage disequilibrium data, based on the same data previously used in the current study, 10,000 controls were generated for the test. Likelihood ratio tests were conducted on the models ranging from a single disease SNP to seven disease SNPs. When the models with five disease SNPs were tested, two models showed acceptance at the level of α = 0.01. The set of SNP1, 2, 4, 5, and 6 resulted in a P value of 0.96, and the set of SNP1, 2, 4, 6, and 7 resulted in a P value of 0.98. One of the models with six disease SNPs was finally accepted with the level of α = 0.05 (P value of 0.81). The set contained SNP1, 2, 3, 5, 6, and 7. The model with all seven SNPs as disease polymorphisms was rejected in this test, since the model could not provide an appropriate solution for the original differences in disease allele frequencies between cases and controls.

This result confirmed that a few common SNPs might not explain the strong and consistent association between APOE and Alzheimer’s disease. It is possible that, as in this test, many common SNPs can be actual disease polymorphisms. However, this result merits some skepticism because (1) the controls in the test were not actual controls for late-onset Alzheimer’s disease and (2) the actual sample sizes were not large enough for either cases or controls. Therefore, the true disease polymorphisms responsible for the APOE association with Alzheimer’s disease may differ from this result. A recent study indicated that rare disease polymorphisms with very high odds ratios (allele frequency between 0.005 and 0.02; genotype relative risk = 4) can create synthetic associations of common polymorphisms (Dickson et al. 2010). This is highly probable if there are a few rare polymorphisms with high odds ratios in the gene region. In the previous study of the APOE association with Alzheimer’s disease (Yu et al. 2007), however, there was no such rare SNP with high odds ratios. Because the data in their association study was not re-sequenced data, it is still possible that rare SNPs with high odds ratios exist in the region. Further investigation of the region with increased sample sizes may help to identify real disease polymorphisms.

Discussion

The advantages and limitations of GWAS have been indicated recently, and identifying several causal polymorphisms in a gene region using fine mapping or re-sequencing has been emphasized (Altshuler et al. 2008; Hardy and Singleton 2009; Manolio et al. 2009). With the goal of obtaining true associations connecting gene association results with a disease presentation, the current study provides the first practical method for finding actual disease polymorphisms from case–control genotype data using LD information in a given gene region. In contrast to other conventional statistical approaches for dealing with confounders, this method uses the actual LD relationship between markers and disease polymorphisms instead of statistically treated relationships. The current method can provide more concrete conclusions from which to infer the real disease polymorphisms in a gene region.

Concerning the intrinsic multiple testing in the method, both Type I and Type II error rates were examined thoroughly. As shown in the Results section, the Type I error rates were consistently reasonable and the Type II error rates usually low for most tests. In situations of disease polymorphisms in very high LD through r 2 with other polymorphisms, the Type II error rates increased; however, increasing the control sample size remedied the error rates. For unusual situations in which the true disease polymorphisms were in very high LD through r 2 and the effects of their odds ratios were opposed, increasing both the control and case sample sizes reduced the Type II error rates to a great extent. This method is also applicable to multi-allelic polymorphisms, where it considers each allele as a separate SNP in complete linkage disequilibrium through D’. Therefore, the effect of each allele can be independently tested in this method. In summary, the current method is expected to provide valid results for most sets of disease polymorphisms with odds ratios higher than 1.5 and a sufficient sample size.

When there are polymorphisms in complete LD in both cases and controls, it is not possible to distinguish one from another. As indicated previously, increasing sample sizes could mitigate the complete LD, but high LD between those polymorphisms could still hamper the identification of real disease polymorphisms if one of them is the disease polymorphism. In the case, comparisons with case–control association studies using different population could be helpful. It is well known that there are clear population differences in allele frequencies and linkage disequilibrium patterns (Frazer et al. 2007). Therefore, the polymorphisms in complete LD in a certain population might not be in strong LD in other populations, and it would be possible to distinguish the real disease polymorphism among the polymorphisms in complete LD in the previous population. If the polymorphisms are in complete LD only in either case or control samples, there is no need to exclude the polymorphisms for the analysis.

As shown in the Results section, the use of re-sequenced data substantially reduces Type II error rates even in bad situations. However, as indicated previously, the actual Type II errors might slightly be increased due to the increment of possible testing models using re-sequenced data. In addition, a large control sample size or polymorphisms with appropriate minor allele and haplotype frequencies should be used as data to obtain stable Type I error rates. Therefore, if the control sample sizes are large enough and/or sufficiently frequent polymorphisms are used as data, better outcomes are expected using re-sequenced data. Since these methods are also useful for polymorphisms in low LD, expanding the region to include more data would be helpful to obtain better results (even though the expanded regions are not related to disease presentation). Another advantage of using re-sequencing data is that a more accurate set of disease polymorphisms can be derived. Including all possible polymorphisms reduces the chance of missing the actual disease polymorphisms in the model set. As re-sequencing takes a greater role in the next generation of genetic methodologies, the method presented here is expected to be even more advantageous than existing methods.

This is the first report of a method to identify actual disease polymorphism(s) based primarily on the LD relationship between polymorphisms using case–control genotype data. Having information about truly associated disease polymorphisms permits better understanding of the role of associated genes in complex traits and better modeling of gene–gene interactions. The method would improve our knowledge of the genetic architecture of complex traits. From an evolutionary point of view, it is beneficial for two deleterious polymorphisms not to exist in a haplotype. Therefore, the worst-case scenario of SNP3 and SNP4 might be more common than expected. Applying the method developed in this study to dense association mapping or re-sequenced data would provide valid interpretations for the results of GWAS. This would improve our understanding of the true genetic effects underlying disease presentation and could provide a better explanation for purifying selection pressures in human genome.