Introduction

The search for breast cancer susceptibility genes is going at an enormous speed owing to the general appreciation of breast cancer being a multifactor as well as polygenic disease. This continuous search is encouraged by the worldwide annual toll of more than one million new breast cancer cases and more than 400,000 breast cancer related deaths [1, 2]. Twin studies in Scandinavia suggested that heredity may account for about 27% of breast cancer. As of today less than 5% of familial breast cancer have been attributed to high penetrance breast cancer genes BRCA1, BRCA2, PTEN, and TP53 [35] and rare genetic variants at ATM, CHEK2, BRIP1 or PALB2 that confer an approximately 2-fold increased risk [68]. Strong evidence for a common breast cancer susceptibility allele has been provided for CASP8 [9].

From an epidemiological view point it is generally accepted that cumulative, excessive exposure to endogenous estrogen across a woman’s life span contributes and may be causal to breast cancer [10]. High serum estrogen level has been proposed as a major risk factor for breast cancer [11]. In vitro and in vivo animal as well as patient-based studies suggested that estrogens, their metabolic compounds and the entire biochemical metabolic machinery may play a role in breast carcinogenesis [12]. Moreover, recent observational studies showed a possible influence of exogenous hormones such as hormone replacement therapy (HRT) [1315] and oral contraceptives (OC) [16, 17].

In keeping with the polygene hypothesis of breast cancer [18], the steroid hormone metabolic pathway appears to be a prime candidate for the investigative search of breast cancer susceptibility genes. This notion draws substance from frequent polymorphisms at enzymes of biosynthesis and catabolism. A systematic search across the pathway holds the potential to comprehensively scrutinize a group of cooperating candidate genes and variants through gene-gene and gene-environment interactions. This approach has been suggested recently [19] and may be superior to common single factor analyses due to the feasibility of interaction analyses.

The GENICA interdisciplinary network of clinicians, epidemiologists, human geneticists, statisticians, and bioinformatic experts applies current state-of-the-art analytical and computational methods towards the investigation of breast cancer susceptibility genes and gene-environment interactions within a population-based case-control study from Germany [14, 20]. Analyses reported herein include six epidemiological variables and 18 polymorphisms of 11 genes encoding enzymes related to estrogen biosynthesis (CYP17A1, CYP19A1, EPHX1, HSD17B1, SRD5A2, PPARG2), catabolism and detoxification (CYP1A1, CYP1B1, COMT, GSTP1, and SOD2). We present single factor and multifactor analyses of high-order gene–gene and gene–environment interactions, and discuss the findings within the biological and currently available methodological context.

Materials and methods

Study population

The GENICA study participants of the population-based breast cancer case-control study from the Greater Bonn Region, Germany, were recruited between 08/2000 and 10/2002 as described previously [14, 20]. In brief, there are 688 incident breast cancer cases and 724 population controls matched in 5-year classes. Cases and controls were eligible if they were of Caucasian ethnicity, current residents of the study region, and below 80 years of age. Information on known and supposed risk factors was collected via in person interviews. The response rate for cases was 88% and for controls 67%. Characteristics of the study population regarding potential breast cancer risk factors include age at diagnosis (<50, ≥50 years), menopausal status (premenopausal, postmenopausal), breast cancer in mother and sisters (yes, no), OC use (never, >0–<5, 5–<10, ≥10 years), HRT use (never, >0–<10, ≥10 years), body mass index (BMI; <20, 20–<25, 25–<30, ≥30 kg/m2), and smoking status (never, former, current) (Table 1). The GENICA study was approved by the Ethic’s Committee of the University of Bonn. All study participants gave written informed consent.

Table 1 Selected epidemiological variables and their risk association in the GENICA study population

DNA isolation and genotyping

Genomic DNA was extracted from heparinized blood samples (PuregeneTM, Gentra Systems, Inc., Mineapolis, USA) as previously described [20]. DNA samples were available for 610 cases (89%) and 651 controls (90%). Eighteen polymorphisms of 11 genes of the estrogen metabolic pathway were subjected to genotyping in DNA of 1,261 patients and controls. Genes have been selected according to their role with respect to the estrogen metabolic pathway and polymorphisms have been selected on the basis of a known or putative functional consequence as well as an allele frequency of at least 4% in a population of European descent. A detailed description of genes and polymorphisms is given in Supplementary Table S1.

MALDI-TOF MS genotyping

Genotyping of 16 single nucleotide polymorphisms (SNPs) at CYP19A1_790_C>T (rs700519), EPHX1_337_T>C (rs1051740), EPHX1_416_A>G (rs2234922), HSD17B1_937_A>G (rs6050598), SRD5A2_265_G>C (rs523349), PPARG2_-34_C>G (rs1801282), CYP1A1_2452_C>A (rs1799814), CYP1A1_2454_A>G (rs1048943), CYP1A1_3801_T>C (rs4646903), CYP1B1_1294_C>G (rs1056836), CYP1B1_1358_A>G (rs1800440), COMT_472_G>A (rs4680), GSTP1_313_A>G (rs947894), GSTP1_341_C>T (rs1799811), and SOD2_47_T>C (rs1799725) were performed by matrix assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) as described previously [20, 21]. In brief, 5 ng of genomic DNA was amplified by PCR using 0.1 unit HotStarTaq DNA Polymerase (Qiagen, Hilden, Germany). PCR conditions were 95°C for 15 min, followed by 44 cycles of 95°C for 30 s, 56°C for 30 s, 72°C for 1 min, finally followed by 72°C for 10 min. PCR products were treated with shrimp alkaline phosphatase (SAP, Amersham, Freiburg, Germany) for 20 min at 37°C followed by 10 min at 85°C. Base extension [homogenous MassEXTEND (hMETM), Sequenom, San Diego, CA] reaction in a final volume of 10 μl contained extension primers at a final concentration of 0.54 μM and 0.6 units ThermoSequenase (Amersham, Freiburg, Germany). Reaction conditions were 94°C for 2 min, followed by 40 cycles of 94°C for 5 s, 52°C for 5 s, and 72°C for 5 s. Final base extension products were treated with SpectroCLEAN resin (Sequenom, San Diego, CA, USA). A 10 nl aliquot of reaction solution was dispensed onto a 384 format SpectroCHIP microarray (Sequenom, San Diego, CA, USA) pre-spotted with a matrix of 3-hydroxypicolinic acid. A Bruker Autoflex MALDI-TOF MS was used for data acquisitions from the SpectroCHIP. Genotyping calls were made with MASSARRAY RT software v 3.0.0.4 (Sequenom, San Diego, CA, USA). For quality control repeated analyses was performed for 10% randomly selected samples. Primers were synthesized by Metabion International AG, Martinsried, Germany.

TaqMan® assisted allelic discrimination

In case of EPHX1_337_T>C (rs1051740) and EPHX1_416_A>G (rs2234922) genotyping of 69% and 56% of the samples, respectively, was additionally performed by TaqMan® allelic discrimination as described previously [22]. In brief, PCRs were performed in a reaction volume of 25 μl containing 50 ng DNA, 1 × TaqMan® Universal PCR Master Mix (Applied Biosystems, Foster City, CA, USA) Primer and MGB probes were synthesized by MWG-Biotech AG, Ebersberg, Germany and Applied Biosystems, Foster City, CA, USA, respectively. PCR and genotype analysis was performed with ABI Prism 7700 Sequence Detection System and ABI Prism 7700 SDS software (Applied Biosystems, Foster City, CA, USA).

PCR-based fragment length polymorphism genotyping

Four polymorphisms including COMT_472_G>A (rs4680), CYP17A1_-34_T>C (rs743572), CYP19A1_630_delCTT, and CYP19A1_681_(TTTA) n were genotyped by fragment analysis. For quality controls 10% of randomly selected samples were repeated.

The COMT_472_G>A polymorphism creates a Hsp92II restriction site: PCR was performed with primers from Thompson et al. [23] and carried out in 10 μl reaction containing 50 ng genomic DNA, 1 × PCR buffer (Qiagen, Hilden, Germany), 1.5 mM MgCl2, 0.1 μM of each primer, 250 μM of each dNTP (Promega, Mannheim, Germany), 0.4 U HotStarTaq DNA polymerase (Qiagen, Hilden, Germany). After an initial 15 min at 95°C, DNA was amplified by 35 cycles of 1 min at 94°C, 1 min at 58°C, 1 min at 72°C, and with a final extension step of 10 min at 72°C. Amplified DNA fragments were digested with 5 U Hsp92II (Promega, Mannheim, Germany) in a total volume of 20 μl, separated on a 2.5% agarose gel containing ethidium bromide (Sigma-Aldrich, Steinheim, Germany) and scored by UV visualization. Fragment sizes were 114, 29, and 26 bp for the G allele and 96, 18, 29, and 26 bp for the A allele.

The polymorphism CYP17A1_-34_T>C creates a MspA1I restriction site: PCR was performed with primers including a D4-labeled forward primer to generate a fluorescent product, PCR reactions, conditions and restriction enzyme digestion followed the method described by Bergmann-Jungestrom et al. [24]. The sizes of the labelled fragments were 209 bp for the T allele and 123 bp for the C allele.

CYP19A1_630_delCTT and CYP19A1_681_(TTTA) n polymorphisms: The CTT insertion/deletion (ins/del) polymorphism is located upstream of the tetranucleotide repeat (TTTA) n [25]. The deletion is linked to (TTTA)7 generating the two alleles delCTT_(TTTA)7 and insCTT_(TTTA)7. CTT ins/del was amplified using a D3-labelled forward primer [25]. PCR conditions were the same as for COMT_472_G>A analysis. Fragment sizes were 171 bp for the insCTT allele and 168 bp for the delCTT allele. The (TTTA) n polymorphism without the ins/del site was analyzed according to Kristensen et al. [26] using a D4-labelled forward primer. PCR reactions and conditions were as for COMT_472_G>A analysis, with the exception of using 2 mM MgCl2 and an annealing temperature of 51°C. The sizes of the labelled fragments ranged from 302 bp for the (TTTA)7 allele to 326 bp for the (TTTA)13 allele.

For fragment analyses fluorescence labelled fragments were separated by capillary gel electrophoresis on a CEQ™ 8000 fully automated Genetic DNA Analysis System (Beckman Coulter, Krefeld, Germany). Separation of the CEQ DNA Size Standard-400 in each well allowed for CEQ automated sizing of D3- and D4-labelled PCR products. The analysis was performed in a 96-well format.

Due to their functional roles CYP19A1_630_delCTT and CYP19A1_681_(TTTA) n polymorphisms were combined for multivariate analysis. The CYP19A1_630_delCTT is linked to CYP19A1_681_(TTTA)7, moreover carriers of the CYP19A1_630_insCTT genotype can be carrier of all numbers of repeats. To increase the power of statistical analyses we combined these two polymorphisms and created three variables: (1) Carriership of CYP19A1_630_delCTT and CYP19A1_681_(TTTA)7, (2) carriership of CYP19A1_630_insCTT and CYP19A1_681_(TTTA)7, (3) carriership of CYP19A1_630_insCTT and CYP19A1_681_(TTTA)>10. The three groups represent hypothesized increasing levels of sex hormone levels. Accordingly, the delCTT is associated with lower estrone, estradiol, free estradiol levels and higher sex hormone binding globulin concentrations [27]. Likewise, more than seven repeats are associated with higher concentrations of estrone, estradiol, free estradiol, androstenedione as well as testosterone in the serum [2729].

Statistical analyses

A number of statistical strategies have been applied to analyze genetic and epidemiologic variables of the GENICA study collection for the identification of main effects and interactions on the risk to develop breast cancer. Altogether we applied six methods including conditional logistic regression, haplotype analysis, cluster analysis, multifactor dimensionality reduction, logic regression, and global testing. For the analyses two datasets were used. A standard dataset included the five epidemiological data and SNP data encoded as 1 = homozygous for major allele, 2 = heterozygous, and 3 = homozygous for minor allele. A dichotomized dataset was used to increase the power of the study with SNP data encoded as 1 = homozygous for major allele and 2 = heterozygous and homozygous for minor allele. Using the standard dataset the study had an 80% power to detect minimum ORs of 1.39–1.96 (α = 0.05, two sided test). Using the dichotomized dataset the study had an 80% power to detect minimum ORs of 1.09–1.59 (α = 0.05, two sided test). The range reflects the variation of minor allele frequencies of the 18 polymorphisms which were at 4–48%.

Basic single and multifactor analyses

Odds ratios (OR) and 95% confidence intervals (CI) were calculated by conditional logistic regression and P-values by chi square tests. All tests were two sided. To identify subgroups of patients at risk we performed stratified analyses in two directions. P-values were corrected for multiple testing by Holm’s method. If indicated, study subjects with missing values were excluded from the analyses.

We stratified for six epidemiologic variables (menopausal status, breast cancer in mother or sisters, OC use, HRT use, BMI, and smoking) and for genotypes of all 18 polymorphisms. Risk estimation was performed by logistic regression conditional on age in 5-year groups, according to the matching scheme of controls to cases and adjusted for the five epidemiologic variables using SAS [30]. We considered P < 0.01 noteworthy provided that the minimal size of the group was ≥5. Conditional logistic regression was also used as a multivariate procedure and to investigate possible interaction effects by comparing models with and without interaction term and considering deviance statistic. Models were compared considering their deviance statistics.

Haplotypes were estimated using PHASE version 2.1 [31, 32]. Comparisons between observed and expected haplotype frequencies as well as comparison of frequencies between cases and controls were performed using chi square test.

Cluster analysis, higher order interactions and global associations

All these analyses were carried out using the open source statistical package R (www.r-project.org).

Cluster analysis

To investigate associations of genetic and epidemiological variables a hierarchical cluster analysis of all variables was performed with Pearson’s corrected coefficient of contingency as similarity measure [33, 34]. We conducted separate analyses for cases and controls using average linkage as clustering algorithm.

Multifactor dimensionality reduction

The multifactor dimensionality reduction (MDR) algorithm [35, 36] performs well in detecting high-order interactions, even in the absence of any statistically significant main effects [37]. The classification accuracy is estimated using a 10-fold cross-validation. However, this estimation tends to be over optimistic because the best model of all cross-validation steps is selected. Therefore, we estimated an additional classification accuracy using an independent test and training set generated by splitting the original dataset (¼–¾). Furthermore, the dichotomized dataset was used to reduce the size of cross tables used for MDR algorithm.

Logic regression

Logic regression has been developed to analyze genetic data [38]. It is an adaptive regression methodology for predicting the outcome in classification and regression problems that are based on binary (true/false) variables. Logic regression searches for the logic tree that best explains the cases. Instead of just building a single tree, multiple trees may be built and combined by a generalized linear model with logit link function. Since we focus on variable/feature selection rather than on classification, we combine the logic regression approach with a method for quantifying the importance of identified genetic or epidemiological variables and their interactions of order two and higher. The underlying measure of importance is the logicFS measure introduced in Schwender and Ickstadt [39] constructed for both the single and multiple tree logic regression approach. We used the dichotomized dataset and the analysis was carried out using the R package logicFS which is part of Bioconductor [40], a project of the analysis of genomic data.

Global testing

Instead of checking the impact of each polymorphism individually we asked whether the whole set of polymorphisms cooperating within the estrogen metabolic pathway may affect case and control status. This issue can be addressed by global testing as it has been recently proposed within the context of gene microarray analysis [41]. The global test is particularly applicable for the following situations: (1) Each of several genes (in our case polymorphisms) may have a small impact on the outcome, (2) few genes may have an impact on the outcome or (3) one gene may have a major impact on the outcome. Epidemiological variables were considered as confounders and we based our analyses on the dichotomized dataset.

Furthermore, we applied non-linear and non-parametric methods such as neural networks, support vector machines, and random forests.

Results

We genotyped 1261 study participants at 18 loci for the investigation of a possible association with breast cancer risk, either alone or in combination. Call rates were >97.5%. Concordance of MALDI-TOF MS genotyping data was 99.9%, for PCR-RFLP based genotyping 99.5%, and for TaqMan® allelic discrimination 100%. For CYP17A1_-34_T>C and COMT_472_G>A duplicate analyses were performed with MALDI-TOF MS and PCR-RFLP based genotyping with a concordance rate of 99.7%. EPHX1_337_T>C and EPHX1_416_A>G were genotyped by MALDI-TOF MS and repeated by TaqMan® allelic discrimination in randomly selected samples of 69% and 56%, respectively. Concordance rate was 99.9%. Genotype frequencies of 17 polymorphisms of cases and controls are given in Table 2 and allele frequencies of CYP19A1_681_(TTTA) n are given in Table 3. All genotype frequencies were in HWE with the exception of CYP1B1_1358_A>G in cases (P = 0.007).

Table 2 Frequencies and risk estimates of polymorphic loci at genes related to the estrogen metabolic pathway
Table 3 Allele frequencies of CYP19A1_681_(TTTA) n repeat polymorphism and combined alleles with CYP19A1_630_delCTT polymorphism in breast cancer cases and controls

Association analyses of single genetic variants

When we calculated ORs from genotype frequencies at 17 loci (Table 2) and allele frequencies at one locus (Table 3) we identified a significantly increased breast cancer risk for carriers of the CYP1B1_1358_GG genotype (OR = 2.57; 95% CI: 1.34–4.93). In addition we observed a borderline decreased risk for carriers of the HSD17B1_937_GG genotype (OR = 0.73; 95% CI: 0.52–1.01). Following correction for multiple testing the significance of these effects vanished. No significant differences between cases and controls were observed at the remaining 16 loci (Tables 2 and 3).

Stratified analyses

We stratified in two directions and observed four effects which we considered to be noteworthy. When we stratified genetic data for epidemiological variables we observed two associations involving BMI. Women with a BMI ≥30 kg/m2 carrying COMT_472_GG had an increased breast cancer risk (OR = 3.84; 95% CI: 1.43–10.3; Table 4). Women with a BMI <20 kg/m2 carrying HSD17B1_937_GG showed a decreased breast cancer risk (OR = 0.17; 95% CI: 0.04–0.63; Table 4). When we stratified epidemiological data for genetic variables we observed two associations involving BMI and HRT use. Carriers of COMT_472_GG had an increased breast cancer risk when they had a BMI ≥30 kg/m2 (OR = 3.87; 95% CI: 1.61–9.34; Table 4). Carriers of CYP17A1_-34_CC had an increased breast cancer risk when they used HRT ≥10 years (OR = 3.17, 95% CI: 1.39–7.25; Table 4). Following correction for multiple testing these results were not significant.

Table 4 Breast cancer risk estimation for genotypes stratified for epidemiologic variables and for epidemiologic variables stratified for genotypes (selection criteria: P < 0.01, minimum number of cases in strata 5)

Haplotype analyses

Haplotypes were estimated for all polymorphisms located within the same gene, i.e. CYP1A1, CYP1B1, CYP19A1, EPHX1, and GSTP1. For CYP1B1 and CYP19A1 we observed haplotype frequencies that differed significantly from those expected from an independent assortment: CYP1B1_1294_C/CYP1B1_1358_G and CYP1B1_1294_G/CYP1B1_1358_A as well as CYP19A1_630_insCTT/CYP19A1_681_(TTTA)11/CYP19A1_790_C and CYP19A1_630_delCTT/CYP19A1_681_(TTTA)7/CYP19A1_790_C were more frequent (P = 0.004, P = 0.002, respectively; Table 5). Frequencies of CYP1A1, EPHX1, and GSTP1 haplotypes were similar to those expected. No differences were observed between breast cancer cases and controls. Due to their possible role in slowing down or increasing the production of estrogen metabolites, we were particularly interested in frequencies of combined haplotypes of CYP1A1 461Asn-462Ile CYP1B1 432Val-453Asn COMT 158Met and CYP1A1 461Thr-462Val CYP1B1 432Val-453Ser COMT 158Val. No significant differences were observed between cases and controls.

Table 5 Haplotype frequencies of CYP1B1 and CYP19A1 in breast cancer cases and controls compared to those expected from an independent assortment

Multifactor model for breast cancer risk

Logistic regression

In a first multivariate logistic regression model we investigated a possible influence of the five epidemiological factors under investigation. In a second model we included the polymorphisms CYP1B1_1358_A>G and HSD17B1_937_A>G (Table 6), for which a significant and a borderline significant association with breast cancer risk has been observed in the preceeding univariate analyses, respectively. The comparison of both models showed a significant improvement of the risk association in the SNP containing model (P = 0.0044). In detail, the epidemiological variable breast cancer in mother or sisters gave a significant P-value (P = 0.01), but other parameters showed no significance. The genetic variable CYP1B1_1358_A>G showed an increased risk to develop breast cancer for carriers of the rare homozygous CYP1B1_1358_GG genotype (OR: 2.95; 95% CI = 1.51–5.75; Table 6), and the genetic variable HSD17B1_937_A>G showed a borderline significant decreased risk to develop breast cancer for carriers of the rare HSD17B1_937_GG genotype (OR = 0.72; 95% CI: 0.51–1.00; Table 6).

Table 6 Multivariate breast cancer risk model including five epidemiological and two genetic variables; overall P = 0.0069 (n = 1181)

Interaction analyses using logistic regression

To find possible pairwise interactions of an epidemiologic and a genetic variable or of two genetic variables we considered all possible pairs in an appropriate logistic regression model with breast cancer risk as an outcome. However, after accounting for multiple testing no significant interaction was found.

Cluster analysis

Performing the cluster analysis with Pearson’s corrected coefficient of contingency for epidemiological and genetic variables we observed no differences between the clusters in cases and controls except for EPHX1_416_A>G and CYP1A1_2452_C>A, which clustered in one group in controls whereas in the case group they appeared to be independent.

Multifactor dimensionality reduction

Multifactor dimensionality reduction (MDR) was applied to the original and dichotomized dataset to identify higher order effects. Using the original dataset a significant association of HRT use ≥10 years with breast cancer (P = 0.0107, sign test) was observed. For the dichotomized dataset three significant rules were found (HRT use ≥10 years, P = 0.0107; HRT use ≥10 years and HSD17B1_937_AA, P = 0.0010; HRT use ≥10 years and CYP1B1_1358_AG/GG and HSD17B1_937_AA, P = 0.0010). For an unbiased estimation of the classification performance independent test set training sets (¾ training, ¼ test) were evaluated however, none of these effects remained significant. Furthermore, none of the rule sets generated with the original dataset were significant (P < 0.05, chi squared statistic) for the test data in cross validation. Finally, no statistically significant interactions were found using MDR.

Logic regression

Application of the single and multiple tree version of logic regression demonstrated that no genetic or epidemiologic variable and no gene-gene or gene-epidemiologic interaction of order 2 or higher influenced the case-control status significantly.

Global testing

We investigated whether the whole set of polymorphisms associated with the estrogen pathway might be significantly related to the breast cancer risk rather than single polymorphisms. Again no significant associations were observed.

Other non-parametric and non-linear classification methods such as support vector machines did not detect any significant association between the set of considered polymorphisms and breast cancer either.

Discussion

We performed association studies at multiple polymorphic loci across the estrogen metabolic pathway to identify genetic variants that may contribute to breast cancer risk. We focused on potential risk classifiers including hereditary variables as well as endocrine and exocrine hormonal history being particularly interested in major effects and possible interactions of gene-gene and gene-environment effects. Our investigation included 18 polymorphisms of 11 genes which have been selected on the basis of a minimum minor allele frequency of 4% in Caucasians and a known or predictable functional effect. To our knowledge this is the first report in which these 18 variants have been investigated together within the same study population for joint analyses. A special feature of our study is the comprehensive statistical analysis by means of a multitude of computational approaches to identify or refute effects on breast cancer risk. We observed modest effects in single factor and subgroup analyses as well as multivariate analyses. Although, we employed additional interaction and cluster analysis, MDR, logic regression, and global testing no statistical significant association with breast cancer risk was observed.

In single factor analyses CYP1B1 was seemingly associated with breast cancer risk in that CYP1B1_1358_GG carriers had more frequently breast cancer than controls. The result however was not stable upon correction for multiple testing which is in line with a recent meta-analysis [42].

For the identification of subgroups of women at breast cancer risk we performed two-way stratification analyses. Given a required threshold P-value of 0.002 after adjusting for all 18 polymorphisms we consider P-values of <0.01 and a minimum number of five subjects in the strata noteworthy. For example, obese women (BMI <30 kg/m2) showed a fourfold increased breast cancer risk when they were homozygous carriers of COMT158Met. Likewise, carriers of COMT158Met showed a fourfold increased breast cancer risk when they were obese. Although, this observation is in line with epidemiologic findings of an increased breast cancer risk among obese women [43, 44] our observed breast cancer risk association with the more efficient COMT detoxification allele remains illusive. In contrast, slim women (BMI < 20 kg/m2) seemed to be protected when they were homozygous carriers of HSD17B1313Gly of which functional implications are unknown. Moreover, carriers of the promoter CYP17A1_-34_C variant showed an increased breast cancer risk when they had used HRT for more than 10 years. The latter observation may spur future investigations towards the elucidation of a molecular basis of an HRT-related breast cancer risk association.

Led by our findings from single polymorphisms we further hypothesized that any potential breast cancer risk effect might be observed and become even more evident in multifactor and interaction analyses. When we employed additional statistical methods neither method revealed a significant breast cancer risk model of gene–gene or gene–environment interaction even among high order interaction. Yet, our cluster analysis and logic regression approach suggest that some variables may influence the case-control status more than others. This observation stresses the value of association analyses and points to the likely role of yet unknown risk classifiers from this or other biological pathways which may corroborate towards a measurable breast cancer risk.

Additional information on a possible association between the genome and breast cancer risk may come from haplotype analyses. A recently reported mathematical model of the mammary estrogen metabolism provided a kinetic analysis for the estrogen metabolic pathway based on the conversion of 17beta-estradiol by the enzymes CYP1A1, CYP1B1, COMT, and GSTP1 into eight metabolites [45]. The model allows the prediction of concentrations of each metabolite including the transient quinones. Moreover, it was used to simulate the kinetic effect of enzyme polymorphisms on the pathway and identified haplotypes generating the largest amounts of catechols and quinones. 17beta-estradiol-3,4-quinone has been described as the most potent carcinogenic metabolite and combined haplotypes conferring variable metabolic activities were established across CYP1A1, CYP1B1, and COMT. Highest metabolite production has been assigned to the combined haplotype CYP1A1 461Asn-462Ile CYP1B1 48Arg-119Ser-432Val-453Asn COMT 158Met, for which an increased breast cancer risk has been shown in a case-control study [45]. Led by these compelling findings derived from in silico and genetic data we tested for a similar risk association in our case-control study. Yet, we were unable to detect an association with the combined haplotype CYP1A1 461Asn-462Ile CYP1B1 432Val-453Asn COMT 158Met or any other haplotype. In comparison to Crooke et al. who included four CYP1B1 polymorphisms our combined haplotype was limited to two of these CYP1B1 polymorphisms which may explain the discrepant results. However, there may be a chance that these divergent results may be due to the low frequency of the combined haplotype in both studies.

Our finding of a lack of associations between breast cancer and polymorphisms participating in the estrogen metabolic pathway is relevant since we investigated a large number of potential classifiers. These have been at least in part and/or may be still under scrutiny in various international studies. Of note is the recent recommendation by the National Cancer Institute Breast and Prostate Cancer Cohort Consortium to conduct a search for low-penetrance breast cancer genes within the steroid-hormone metabolism pathway in pooled analyses of multiple large cohort studies [19]. Considering the large body of literature with respect to reported gene-breast cancer associations and the inherent error rate recently debated by Ioannidis [46], our study with more than 1,200 cases and controls is substantial. Yet, despite its power to detect potential major effects, a single effort like ours requires replication and confirmation preferably from large collaborative efforts. This has been recently demonstrated by the Breast Cancer Association Consortium that reported null results for common genetic variants from pooled analyses in as many as 12,000–30,000 subjects [47]. Our findings are important because we may infer that genetic polymorphisms scrutinized by us do not significantly contribute to the breast cancer risk in our German and possibly other populations. Moreover, the data obtained from multifactor and interaction analyses shed new light on the strategies applied towards the identification of breast cancer risk associations. We neither found nor confirmed any suggested risk associations which supports the critical discussion reviewed in Folkerd et al. [48]. Thus, single factor analysis may be interpreted with caution because any putative classifier may drop from the list of candidates when scrutinized within the context of multifactorial analysis. We propose, that the tested polymorphisms of the estrogen metabolic pathway may not play a direct role in breast cancer risk and that future investigations should be extended to other DNA variants of these genes and to other regulatory pathways.