Introduction

Family history is a well-established risk factor for breast cancer. Breast cancer risk is typically increased by two- to three-fold in first-degree relatives of affected individuals. Mutations in high-penetrance cancer susceptibility genes such as BRCA1 and BRCA2 account for less than 20% of the excess familial risk [1]. The remaining familial risk is likely to be explained by a polygenic model where breast cancer susceptibility is conferred by a large number of low-penetrance alleles. The risk conferred by each of these alleles may be small but these alleles may combine additively or multiplicatively to affect breast cancer susceptibility substantially [2]. Rare, high-penetrance susceptibility alleles have been successfully mapped using family-based linkage studies. Further progress in the search for genetic determinants of breast cancer likely lies in the identification of the large number of low-penetrance cancer susceptibility alleles by population-based genetic association studies.

Numerous genetic association studies on breast cancer have been published but results have been equivocal, partly due to shortcomings in study design [3]. The past few years have witnessed rapid advances in high-throughput technologies for genotyping analysis as well as in our understanding of genetic variation patterns across the human genome. These advances have empowered researchers to improve the design of genetic epidemiological studies, especially the way in which genetic variation is captured. In this short review, we will focus on the recent developments in high-throughput technologies for genotyping analysis and their impact on genetic epidemiological studies of breast cancer, addressing both their promises and pitfalls.

Candidate polymorphism analysis

The genetic association studies published on breast cancer from the 1990s onwards have typically compared the allelic and/or genotypic frequencies of selected polymorphisms between breast cancer cases and controls. These studies aimed to find polymorphisms that may be directly related to breast cancer risk as causal variants or indirectly related to breast cancer risk due to being in linkage disequilibrium (LD) with causal variants. These studies typically start with the selection of candidate genes based on current biological understanding of their potential role in breast cancer carcinogenesis. Then a small number of polymorphisms are selected in these genes and genotyped. Polymorphism selection has usually been based on isolated reports of a polymorphism's potential functional effect, such as coding variants, and/or its feasibility to be successfully genotyped at that time.

Moving from family-based linkage studies to population-based genetic association analysis causes a shift from microsatellite markers to single nucleotide polymorphisms (SNPs) as the leading marker for genetic analysis. Micro-satellite markers have been extremely useful in mapping causal genetic variants in family pedigrees and have been successfully used to identify high-penetrance genes, as in the case of BRCA1 [4]. But microsatellite markers are less efficient in population-based genetic association analysis and have rarely been used in the search for low-penetrance alleles using unrelated subjects [5, 6], partly due to their relatively high mutation rate and complex mutation patterns. Compared to microsatellite markers, SNPs are stable, more abundant, associated with lower genotyping error, easier to automate and thus cheaper in terms of cost and labor. The availability of detailed information on LD patterns of SNPs has also enabled genetic variation to be captured more effectively using SNPs. Hence, SNPs have increasingly dominated the field of population-based genetic association studies in breast cancer. Examples of genes investigated using candidate SNPs include the steroid hormone metabolism genes (CYP17, CYP19, COMT, SHBG), estrogen-signaling genes (ESR1, ESR2), carcinogen metabolism genes (CYP1A1, NAT1, NAT2, GSTM1) and DNA repair genes (XRCC1-3, ATM) [79]. Although being commonly termed candidate gene analysis, such studies can at most qualify as candidate polymorphism analysis since only a very small number of polymorphisms within each gene were evaluated and these cannot be assumed to represent the whole gene, especially if the gene is large.

Despite huge efforts being invested in population-based genetic association studies of breast cancer, the outcome has not been satisfactory. The low throughput and high cost of genotyping analysis has constrained investigators to studying only a few polymorphisms within a few candidate genes in a limited number of samples. Positive results have been rare and often not replicated in subsequent studies. It is possible that the generally negative findings of past studies may be due to a true absence of risk alleles of moderate to high effect for breast cancer. But given both poor coverage and inadequate power of past studies, causal alleles are likely to be missed even if they exist. Hence negative results of such studies could not be used as evidence to rule out the role of a particular gene in breast cancer risk. To illustrate the problem of inadequate power, a systematic review of genetic association studies of breast cancer found 46 case-control studies published between 1983 and July 1998. Most studies were small, with the median number of cases and controls combined being 391 (range 58 to 1,431). From power calculations, a study of 315 cases and 315 controls will be needed to detect a risk allele with a frequency of 20% conferring a relative risk of 2.5 with 90% power at the 5% significance level. Only 10 out of 46 studies met these criteria [8]. It has been further argued that to reduce false positives arising from multiple testing, a significance level of 10-4 should be used for candidate gene studies. Then a study of approximately 1,000 cases and 1,000 controls will be needed to detect a susceptibility allele with a frequency of 20% conferring a relative risk of 1.5 [10]. Few candidate polymorphism studies in breast cancer have managed to fulfill such criteria. In summary, limited progress has been made by such candidate polymorphism-based genetic epidemiological studies in identifying low-penetrance risk alleles for breast cancer.

Recent developments in high-throughput genotyping technology

The rapid development of high-throughput technology for SNP genotyping over the past few years has resulted in a wide variety of SNP genotyping platforms now available for use, each with unique features. On platforms such as the Illumina BeadArray™ and the Affymetrix GeneChip® array systems, up to thousands of SNPs can be analyzed simultaneously (i.e., multiplexed) in each sample. These have dramatically increased the throughput of genotyping and brought down the genotyping cost per SNP. Such platforms are well suited for large-scale screening studies where thousands of SNPs are analyzed in a fair number of samples. However, due to their high level of multiplexing, total cost, and sometimes lengthy process of initial assay development, these platforms become unwieldy in studies where only a moderate number of SNPs needs to be analyzed. For such studies, Sequenom's MassARRAY® system is one of the better choices as it only requires up to 29 SNPs for each multiplexing assay and requires short assay development time by investigators themselves. Such systems provide greater flexibility and efficiency for investigators to carry out either medium-size studies that target a moderate number of candidate genes or follow-up studies where a limited number of positive findings from initial large-scale screening studies are further investigated in large samples. In situations where only single or a very limited number of SNPs need to be analyzed in a large number of samples (e.g., in confirmation studies), methods such as TaqMan® and Pyrosequencing™ assays are more suitable. Such systems can only genotype very few SNPs at a time but are very robust and efficient. A summary of the main features of some of the main genotyping platforms available for custom SNPs is shown in Table 1. A detailed discussion of SNP genotyping technology is beyond the scope of this review but has been reviewed elsewhere [1113].

Table 1 Main features of some custom SNP genotyping platforms available

The technological limit of genotyping analysis has been further challenged by the recent release of ultra high-throughput systems from Illumina and Affymetrix. Innovative multiplexing chemistry allows these systems to analyze between approximately 317,000 SNPs (Illumina's Sentrix® humanHap300 beadchip and Infinium™ II assay) and 500,000 SNPs (Affymetrix's GeneChip® Mapping 500 K Array) in a single experiment. Both systems are of fixed contents, meaning that all the SNPs for analysis have been pre-selected by the manufacturers. While Illumina's SNP selection is based on the available information on allele frequency and the LD pattern of the human genome from the HapMap project, Affymetrix's SNP selection is generally random and mainly based on the SNPs' feasibility to be genotyped. By driving down the genotyping cost below US$0.01 per SNP, such systems have transformed whole-genome association analysis into reality.

The technological advancements in genotyping analysis, coupled with the extensive collection of validated SNPs and knowledge of LD patterns across the human genome from the HapMap project, have transformed the landscape of genetic epidemiological studies. These advancements have allowed us to progress from the investigation of candidate polymorphisms to truly comprehensive candidate gene and whole-genome studies.

Comprehensive candidate gene study using the haplotype tagging approach

Knowledge of LD patterns across different genes has given rise to the haplotype tagging approach as an efficient way of conducting comprehensive candidate gene studies. Due to the extensive non-independence between SNPs and the limited haplotype diversity within regions of strong LD (LD blocks) in the human genome, only a subset of selected SNPs, instead of all variants, needs to be analyzed to capture the majority of common genetic variation within such blocks. With an average LD block size of between 11 and 22 kb and assuming 3 to 5 haplotypes per block, it has been estimated that around 300,000 to 1,000,000 well-chosen tagging SNPs (in non-African and African samples, respectively) would be required to capture the 10 million SNPs that are thought to exist [14]. Equipped with large sample sizes and efficient coverage of all genetic variation within candidate genes, current genetic epidemiological studies are expected to stand a good chance of detecting susceptibility alleles with moderate effects, if they exist. While current genetic association studies are being geared up to a comprehensive coverage of common variants and are thus greatly enhancing the confidence of a negative result, it will be difficult to assertively exclude the role of a candidate gene purely based on the results of LD mapping. Although there is general agreement on the merits of using the haplotype tagging approach in genetic association studies, there are pitfalls [15] and active discussions are still ongoing on several issues, including optimizing tagging SNP selection [16, 17] and haplotype construction [18], as well as statistical analysis of such SNP/haplotype data to study disease associations [19].

Genetic association studies on breast cancer that have used haplotype tagging SNPs for candidate gene analysis are starting to appear in the literature. Some examples of genes studied in this manner include CYP19 [20], HSD17B1 [21], EMSY [22] and CHEK2 [23], and more results are expected in the near future. Currently, published studies have focused on assessing genetic variation within single candidate genes, but more efforts will be needed to evaluate entire biological pathways or gene families. Genes often work together as part of complex biological pathways. Selecting a single candidate gene within a pathway for genetic epidemiological investigation is likely to be over-simplistic. Instead, genetic variability of entire biological pathways, for example, the estrogen metabolism pathway, should be investigated to evaluate potential association with disease. Although it is no longer technologically challenging to capture most, if not all, of the common genetic variation within a biological pathway using the haplotype tagging approach, the method for data analysis is not straightforward. Locus-by-locus analysis can detect SNPs associated with moderate main effects. But this method of analysis will become less effective in situations where breast cancer susceptibility is attributed to a fair number of alleles, each of which is only associated with a weak effect (below the threshold for detection) or in situations where susceptibility is attributed to the interaction of multiple SNPs, each with negligible effect. Therefore, success of comprehensive candidate gene studies will rely substantially on the development of new statistical methods for evaluating the cumulative effect of whole biological pathways on susceptibility to breast cancer.

Genomic epidemiological studies

The success of candidate gene studies, whether based on single genes or whole pathways, is constrained by our current biological understanding of breast carcinogenesis. Since breast carcinogenesis is a complex and still only partially understood process, it is likely that many important genes are overlooked in candidate gene studies. Such a limitation can only be overcome by genomic epidemiological studies where no prior biological hypotheses are assumed and the entire human genome is targeted for identifying genetic variation associated with breast cancer susceptibility. Several research groups have embarked on whole genome association studies in breast cancer but no results have been published yet. The use of whole genome scans in genetic association studies is still in its infancy. Design issues for genome-wide association studies are still evolving and have been reviewed elsewhere [24, 25].

Although promising, genome-wide association studies bring about major challenges in regard to data analysis. Genetic epidemiological studies have conventionally been designed in such a way that a relatively small number of potential risk factors (both genetic and non-genetic) are evaluated in a much larger number of samples. Locus-by-locus approaches for statistical analysis are well developed for such designs to evaluate the main effect of a genetic variant and simple interactions between genetic variants. In contrast, genome-wide association studies are expected to involve analysis of hundreds of thousands of SNPs in several hundred (or thousand) samples. This means that the number of testing targets will be far greater than the number of samples, which is unfavorable for a conventional locus-by-locus statistical analysis approach. This issue has already emerged when attempting to extend the candidate gene approach to studying multiple genes in a pathway but will become greatly compounded in the whole genome analysis. By performing a locus-by-locus test on each of the hundreds of thousands of SNPs in a moderate sample size, a large number of false positive findings are expected to be generated in addition to the expected small number of true positive results. Because the true risk alleles are likely to be associated with moderate effects, the true positive association results are by no means guaranteed to enjoy stronger statistical evidence than the false positive ones. Although Bonferroni correction or false discovery rate can be used to control the adverse effect of multiple testing and reduce the false positive rate, they cannot improve the power for detection. As a means of validating initial positive findings, a two-stage design may be used in which a large number of potential positive findings from the initial genome-wide analysis are tested in a much bigger sample. But the efficiency of such a design still needs to be proven by real studies. Hypothesis-free attempts to identify interactions among genetic variants at the genomic level will be even more challenging, due to the immense number of tests involved. Initial simulation analysis has demonstrated the feasibility of performing genome-wide interaction analysis [26], but more will need to be done to verify its efficiency.

Future directions

Looking ahead, the technical barriers to genotyping are unlikely to be a limiting factor. Future breakthroughs in the search for breast cancer susceptibility genes will probably hinge heavily on devising novel data analysis strategies to make sense out of the vast amount of data generated. Although still speculative, novel statistical and/or mathematical approaches that allow the incorporation of the information of biological network and genomic structure will likely champion the field of data analysis.

With the vast amount of data generated from high throughput genotyping, many genetic association findings are expected. Replication will be needed and functional verification will need to be conducted to identify true causal alleles. Efforts to devise efficient methods for functional validation would accelerate the accumulation of well-founded evidence. Despite all the promises held by genome-wide association studies, if such studies are not handled properly, large numbers of false positive results will be generated and published. This will result in a significant drain in resources invested in studies with slim prior probabilities of yielding significant findings, which would slow down the search for breast cancer susceptibility genes. Recognizing the promises and the pitfalls of such genomic approaches, efforts are already underway to coordinate genetic association studies to build a roadmap for efficient and effective human genomic epidemiology [27].

Apart from genetic factors, environmental and lifestyle factors also play a substantial role in affecting breast cancer risk [2830]. Low penetrance genes most likely act in concert with lifestyle and other environmental factors to affect breast cancer risk. The subtle effects of some genetic variants may be magnified and only become detectable in the presence of certain exposures. Failure to take into account these external factors may hinder the search for breast cancer susceptibility gene variants. For example, the associations between polymorphisms in DNA repair genes and breast cancer risk were only detectable in women with a high intake of folate and carotenoids [31, 32]. Studies of such gene-environment interactions will not only help in the search for low-penetrance gene variants affecting breast cancer risk, but can also uncover ways by which risk may be modified.

Finally, it deserves to be mentioned that no amount of genetic, technological or statistical sophistication can compensate for a badly devised study. Sound epidemio-logical design remains fundamental in order to obtain valid and reproducible genomic epidemiological results. Sufficient numbers of carefully defined cases and appropriately chosen controls with accurate information about potential confounders and effect modifiers are needed. Ideally such study samples will be derived from large prospective studies.

Note

This article is part of a review series on

High-throughput genomic technology in research and clinical management of breast cancer, edited by Yudi Pawitan and Per Hall.

Other articles in the series can be found online at http://breast-cancer-research.com/articles/review-series.asp?series=bcr_Genomic

Box 1 Glossary of terms