Introduction

Genome-wide association studies (GWAS) have been successful at identifying germline common variations associated with the risk of developing colorectal cancer (CRC). Success of the genome-wide design has been driven mainly by large international collaborative efforts to pool resources and samples to produce large datasets of tens of thousands of cases and controls, to help identify genetic risk factors that only had moderate associated risks. Over 50 genetic risk variants have been identified thus far (Al-Tassan et al. 2015; Broderick et al. 2007; Cui et al. 2011; Dunlop et al. 2012; Houlston et al. 2008, 2010; Jaeger et al. 2008; Jia et al. 2013; Peters et al. 2012, 2013; Schmit et al. 2014; Schumacher et al. 2015; Tenesa et al. 2008; Tomlinson et al. 2007, 2008, 2011; Wang et al. 2014; Whiffin et al. 2014; Zanke et al. 2007; Zhang et al. 2014), with odds ratio typically in the range 1.10–1.25 and minor allele frequencies typically no less than ~10 % (partly by design of genotyping arrays). Once the low-hanging fruits have been picked, the design becomes more challenging since the discovery of additional variants with smaller effect or lower allelic frequency may require increasing the sample size by an order of magnitude. Although not as informative from a public health perspective, these additional, undiscovered variants still have the potential to help elucidate parts of the pathobiology.

The American Cancer Society and the US Multi-Society Task Force on Colorectal Cancer recommend early detection testing starting at 40 years of age for those with a family history of CRC, given their higher risk of developing tumors (Read and Kodner 1999; Levin et al. 2008; Lieberman et al. 2012). The lifetime increase in risk in those with a family history of CRC is about twofold (Slattery et al. 2003), partly due to shared genes and/or shared environment with the affected relative (Lichtenstein et al. 2000). Because they share the genome and the genetic risk background of their affected relative, the inclusion of controls with a family history of CRC may reduce the power to detect a genetic association with the disease in a case–control study. By excluding these controls from the study, we show that power can be increased even if the sample size is reduced. Moreover, we argue with empirical evidence that excluding controls that were diagnosed with colorectal (CR) polyps (potential precursors of tumors), when such a diagnostic is available, may also lead to an increase in power. This allows for a re-evaluation of GWAS without the need to increase the sample size or genotype additional samples.

Materials and methods

Sample description and genotyping

The cases and controls included in the present GWAS consist of a subset of samples that were collected across multiple study centers, within the Genetics and Epidemiology of Colorectal Cancer Consortium/Colon Cancer Family Registries (GECCO/CCFR) (Peters et al. 2013). As a result of simulation-based power calculations and empirical observations, we attempted to increase the power to detect an association by excluding controls with a positive family history and controls that were diagnosed with CR polyps. Status of CR polyps was self-reported from answering questions such as “has a doctor ever told you that you had polyps in your large bowel or colon or rectum?”. Table 1 describes the sample sizes of each study, before and after exclusion of controls and the genotyping platform used in each. Replication of initial results from GECCO/CCFR was attempted in samples from 6 studies from the Colorectal Cancer Transdisciplinary Study (CORECT) (Wang et al. 2014) (Table 1). Genome-wide significant results were then analyzed in samples of African ancestry [1894 cases (49.6 % females; mean age 67.9) and 4703 controls (35.2 % females; mean age 61.6)] and of Japanese ancestry [2627 cases (42.1 % females; mean age 65.3) and 3797 controls (45 % females; mean age 64.7)] to evaluate trans-ethnic effects of the SNPs. These samples were genotyped using Illumina 1 M-Duo, 660 W-Quad or Omni 2.5 M depending on the center (see Wang et al. 2014 for details).

Table 1 Sample sizes

Statistical power comparison

To confirm that the exclusion of controls with a positive family history of CRC would not lead to a reduction, but rather an increase in power, we performed a simulation study. We simulated the segregation of a susceptibility SNP in nuclear families. Sibship size followed a Poisson distribution with mean 3.5 sibs. One susceptibility SNP was simulated with varying allele frequency and relative risk (with risk alleles acting multiplicatively on the risk). The segregation of alleles in the nuclear families and the simulation of the disease state of all family members were performed using SLINK (Schäffer et al. 2011). Lifetime risk of the simulated disease was fixed at 5 % (Siegel et al. 2014). 11,800 cases and 14,300 controls (the approximate sample size of all samples in GECCO/CCFR) were randomly selected among all affected and unaffected individuals, respectively. Once an individual was selected, all other members of the nuclear family became ineligible to enter the case–control sample. Having a family history of the simulated disease was defined as having at least one first-degree affected relative (sib or parent). For each combination of allele frequency and effect size, 400 replicates were assessed for association between the simulated SNP and the disease status using a simple allelic Chi-square test (–assoc command in PLINK; Purcell et al. 2007), before and after exclusion of the controls with a positive family history. Power was estimated from the proportion of replicates reaching significance p < 5 × 10−8.

Genome-wide association analysis

Imputation to HapMap2 Release 24 was performed using MACH for all studies, with the exception of OFCCR, which was imputed to HapMap Release 22 using BEAGLE. Log-additive models were fit and adjusted for age, sex, center, batch effect (in the ASTERISK study), smoking status (in the PHS study), and the first 3 principal components on study level (using HapMap-imputed data). Replication was attempted in CORECT for the SNPs with meta-analysis p < 10−5 in GECCO/CCFR.

RNA expression studies

Two sample sets were used to assess the association between a SNP and expression of genes within a 2 Mbp window centered at the SNP position. Both studies evaluated gene expression in colon adenocarcinomas and normal colon tissues.

The first study (TCGA) consists of data from 155 colon adenocarcinomas and 19 normal colon tissues (from a total of 162 distinct donors: 12 matched tumor and normal adjacent pairs are included) from The Cancer Genome Atlas (TCGA; downloaded from CG Hub: https://cghub.ucsc.edu/). These samples have gene expression data derived from an Agilent 244 K Custom Gene Expression Array and genotypes derived from Affymetrix Genome-Wide Human SNP 6.0 Array. We used Level 3 expression data, which consists of normalized signals and expression calls per gene, per sample. Genotype data were obtained under approved access. We compared the genotype calls between tissues of the same donors. A patient was excluded if he or she presented discordant homozygous genotype calls at >1 % of homozygous markers (heterozygous genotypes were ignored because of the potential for loss of heterozygosity in tumors). The SNP data was analyzed with the –homozyg command in PLINK to identify regions with loss of heterozygosity (LOH); gene expression values in samples displaying LOH in the gene interval were ignored in analyses.

The second study (CCFR) consists of data from 40 tumors and 40 paired adjacent normal tissues from 40 participants enrolled in CCFR, with gene expression data derived from the Affymetrix GeneChip Human Exon 1.0 ST Array and genotype data derived from Affymetrix Genome-Wide Human SNP 6.0 Array. This set of tumor/normal samples has been used in an eQTL (expression quantitative trait loci) study of previously published GWAS loci for CRC (Loo et al. 2012).

Differential expression was assessed using a non-parametric Wilcoxon rank-sum test when comparing two factors, or a Kruskal–Wallis rank-sum test when comparing three factors.

Results

Controls with a family history or CR polyps potentially reduce power to detect association

As a proof of concept that power may be reduced when including controls with a positive family history of CRC in a case–control study, we evaluated a genetic risk score in GECCO by counting the number of risk alleles that an individual possessed across 36 SNPs identified by GWAS, after pruning those in LD (Al-Tassan et al. 2015; Broderick et al. 2007; Cui et al. 2011; Dunlop et al. 2012; Houlston et al. 2008, 2010; Jaeger et al. 2008; Jia et al. 2013; Peters et al. 2012, 2013; Schumacher et al. 2015; Tenesa et al. 2008; Tomlinson et al. 2007, 2008, 2011; Wang et al. 2014; Whiffin et al. 2014; Zanke et al. 2007; Zhang et al. 2014). The distribution of this genetic risk score was stratified by disease status and family history. Figure 1 shows that controls with a family history of CRC have genetic risk scores that are intermediate between that of cases and family-history-negative controls, indicating that controls with a family history share some genetic risk with their affected first-degree relatives.

Fig. 1
figure 1

Count of risk alleles. Boxplot representation of the total count of risk alleles in cases and controls, stratified on family history (FH)

Simulation-based power calculations support the strategy of excluding controls with a family history of CRC: across a wide spectrum of allele frequencies and relative risks, Supplementary Table S2 indicates a gain in statistical power even though the number of controls is reduced by over 20 %. This motivated exclusion of controls with a positive family history.

Family history is a feature that can easily be simulated, through specification of penetrances (including phenocopies), segregation of alleles or shared environmental variables, and ascertainment. For other traits or features—such as diagnosis of CR polyps in controls, it can be hypothesized that power may be reduced from inclusion of samples that display them. However, these traits may not be straightforward to incorporate in an assessment of power; interpretation would only be as good as the underlying model linking the trait (say, presence of CR polyps) to the likelihood of developing the disease. For these traits, stratifying the risk score, as was done for family history, can provide insights. Similar to family history-based stratification, Supplementary Figure S1 shows that controls that were previously diagnosed with CR polyps have a genetic risk score intermediate to that of cases and other controls. Because the diagnosis of CR polyps is correlated with family history of CRC, Supplementary Figure S1 only focuses on samples without a family history. Based on this empirical evidence and the results from simulations described above, we excluded from this analysis controls that have a family history and/or controls diagnosed with CR polyps.

Genome-wide association study and replication

Samples in the discovery phase of this study, which were collected across multiple study centers within GECCO/CCFR, were analyzed after exclusion of controls with a family history of CRC or diagnosis of CR polyps. Of note, among the centers that sampled both sexes, female controls were more likely to have reported a family history of CRC than males (fixed effect model: OR = 1.31; p = 0.0006) and less likely to have reported CR polyps than males (OR = 0.65; p = 2 × 10−8). Control individuals who reported family history were slightly older than those who did not (mean of 64.06 years compared to 63.49; p = 0.011, adjusted for center). In contrast, control individuals who reported polyps were substantially older than those who did not (mean of 65.9 years compared to 63.3; p < 10−8).

Association results between genetic variants and risk of developing CRC in the resulting samples are graphically summarized in the Manhattan plot depicted in Fig. 2. The inflation factor (λ = 1.019) is comparable to the one calculated when no controls are excluded (λ = 1.021; Fig. 2b, c).

Fig. 2
figure 2

Association results. a Manhattan plot of results in GECCO/CCFR. Controls with family history and/or polyps are excluded from the analysis. Each dot represents a SNP plotted on the x axis relative to its position in the genome, whose level of significance is represented on the y axis. Green dots represent SNPs in LD with SNPs identified in published GWAS for CRC. Replication in CORECT was attempted for SNPs with p < 10−5 (blue horizontal line). The red horizontal line indicates p = 5 × 10−8; b quantile–quantile plot of p values in (a), on the negative log scale. λ is the inflation factor (the ratio of observed to expected median); c quantile–quantile plot of p values when no controls are excluded from the analysis

Replication was attempted in samples from CORECT for SNPs that reached significance at p < 10−5 in the discovery phase. Supplementary Table S3 shows results for these SNPs in both phases of the study after pruning for linkage disequilibrium (LD) (reporting the most significant SNP among SNPs with r 2 > 0.5).

One SNP, rs17094983, reached genome-wide significance in the meta-analysis of all studies combined (p = 2.5 × 10−10) with no evidence of heterogeneity across centers (p het = 0.97) (Supplementary Figure S2). The minor allele of the SNP has a frequency of 13 % and is inversely associated with risk; the odds ratio (estimated by re-including the controls with FH or CR polyps, to eliminate the effect of the selection bias) is OR = 0.87 (95 % confidence interval 0.83–0.91; p = 4.7 × 10−9) compared to OR = 0.85 when these controls are excluded (Supplementary Figure S2). To evaluate trans-ethnic associations for that SNP, we first note that rs17094983 is monomorphic in populations of Asian ancestry according to the 1000 Genomes project, and it has thus not been observed in the samples of Japanese descent; this also has been reported elsewhere (Peters et al. 2013). In samples of African descent, the SNP replicated (p = 0.01) with a minor allele frequency of 16 % and a consistent effect size (OR = 0.86, 95 % confidence interval 0.77–0.97).

Genes and transcripts in the region surrounding rs17094983 are illustrated in Fig. 3.

Fig. 3
figure 3

UCSC browser representation of the 14q23.1 locus. Window is centered at rs17094983 ± 2Mbp. Top track indicates position of SNPs in LD with rs17094983 (r 2 > 0.05) along with r 2 values

Study of expression quantitative trait loci

In the 2Mbp window centered on rs17094983, The Cancer Genome Atlas (TCGA) includes expression data on 11 transcripts: ACTR10, ARID4A, JKAMP (C14orf100), C14orf37, DAAM1, DACT1, GPR135, KIAA0586, PSMA3, RTN1 and TIMM9. Figure 4 and Supplementary Figures S3–S12 show expression values of these genes in normal colon tissues and tumors as well as expression values in tumors stratified by genotypes at 3 SNPs in high LD with rs17094983 (which is not part of the Affymetrix 6.0 array available from TCGA): rs17094971 (r 2 = 0.81 with rs17094983, calculated from the EUR samples of the 1000 Genomes Project), rs1432096 (r 2 = 0.80) and rs710005 (r 2 = 0.54). RTN1 (Fig. 4) displays lower expression in tumors than in normal tissue and is the transcript that shows the most differential expression in the region (p = 1.3 × 10−8; based on a non-parametric Wilcoxon test). Notably, of the transcripts targeted by the expression array, RTN1 is among the genes with the highest average expression across normal colon tissues: only 13 % of transcripts in the genome have expression values higher than that of RTN1. In tumors, eQTL analyses reveal that RTN1 shows differential expression between genotypes of both rs1432096 (p = 0.022; based on a non-parametric Kruskal–Wallis test) and rs710005 (p = 0.0013), the latter being statistically significant even after accounting for the 33 eQTL combinations (SNP-transcript expression) that we tested [false discovery rate (FDR) = 4.2 % for rs710005]. It is, however, the SNP with the weakest LD with rs17094983. Expression values for the heterozygous genotypes are elevated compared to values for the common homozygous genotypes (homozygous for the apparent “risk” allele); this direction of association is consistent with the minor allele being inversely associated with risk, as normal tissue shows higher expression of RTN1. The number of normal tissues (n = 15) is too small to draw meaningful conclusions from eQTL analyses. No other transcript is associated (after accounting for multiple testing) with any of these SNPs (Supplementary Figures S3–S12).

Fig. 4
figure 4

Expression of RTN1 in TCGA. a Boxplot representation of the expression of RTN1 in normal colon tissues and tumors. Significance calculated from Wilcoxon test. bd Boxplot representations of the expression of RTN1 in tumors as a function of b rs17094971; c rs1432096; d rs710005. Significance calculated from Kruskal–Wallis tests

We sought to replicate RTN1 expression association results from TCGA using data from 40 normal colon tissues and 40 matched tumors from CCFR. Consistent with the TCGA data, RTN1 shows significantly lower expression in tumors compared to normal tissues (p = 1.1 × 10−8) (Fig. 5a). When stratified on genotypes, RTN1 expression levels show patterns of associations that are in the same direction as seen in the TCGA data, in both normal colon tissues [p = 0.041 for rs1432096 (r 2 = 0.80 with rs17094983); Fig. 5] and tumors (p = 0.041 for rs1432096; Supplemental Figure S13), suggesting that heterozygous individuals tend to show higher expression of RTN1 than common homozygous individuals, irrespective of whether the colon cells are normal or malignant.

Fig. 5
figure 5

Expression of RTN1 in CCFR. a Boxplot representation of the expression of RTN1 in normal colon tissues and tumors. Significance calculated from Wilcoxon test. bd Boxplot representations of the expression of RTN1 in normal tissues as a function of b rs17094971; c rs1432096; d rs710005. Significance calculated from Kruskal–Wallis tests

Discussion

We describe a strategy to re-evaluate GWAS data that may facilitate identification of additional genetic risk variants at genome-wide significance levels without necessitating an increase in sample size. By excluding controls with a family history of the disease from a case–control study (or other features that may potentially make controls more likely to possess genetic risk factors for the disease under study—such as diagnosis of CR polyps, potential precursors of tumors of the colon) power can be increased. This also has implications for study design.

We report an association between SNPs at 14q23.1 and the risk of developing CRC. rs17094983 was mentioned in a published GWAS (Peters et al. 2013) for CRC but did not reach genome-wide significance (reported p < 3 × 10−6). The present study confirms the association at genome-wide significance levels. We show that genotypes of SNPs in high LD with it are significantly associated with expression of RTN1 (Reticulon 1), a protein-coding gene highly expressed in normal colon cells whose expression is substantially reduced in colon tumor cells.

The RTN1 gene produces three transcripts, which encode for the RTN1-A, RTN1-B, and RTN1-C proteins. The expression values that we presented were derived from probes that are targeting exons present in all three transcripts; there were no probes specific to a single transcript. These proteins are members of highly conserved reticulons, which are localized in the endoplasmic reticulum (ER). Reticulons show pro-apoptotic activity via the induction of ER stress (Kuang et al. 2005; Di Sano et al. 2007). The mechanisms by which RTN1 exerts its effects are not well understood. RTN1-A has been recently described as a mediator of chronic kidney disease progression that promotes renal injury through ER stress (Fan et al. 2015). In kidney epithelial cells, RTN1-A but not RTN1-C interacts with PERK, an ER stress molecule that activates apoptotic pathway. RTN1-C is regulated by acetylation and its DNA-binding activity is required for its role as an inhibitor of histone deacetylases (HDAC) activity (Fazi et al. 2009). Inhibition of HDACs can result in hyperacetylation of proteins, which, in turn, induces apoptosis of tumor cells and sensitizes tumors to cell-death processes and to other drugs (Heerboth et al. 2014). RTN1-C overexpression sensitizes cancer cells to chemotherapeutic-induced apoptosis through p53-independent pathways (Di Sano et al. 2003). In androgen-dependent LNCaP prostate cancer cells, knock down using siRNA targeting all RTN1 transcript isoforms enabled androgen independent growth of these cells (Levina et al. 2015). Gastrointestinal stromal tumors (GISTs) with mutations in KIT or PDGFRA show frequent alterations of the 14q23.1 region, which includes the RTN1 gene (Astolfi et al. 2010). Moreover, the knockdown of RTN1 results in increased proliferation of mutation-harboring GIST cells. These studies indicate that decreased expression of RTN1 is related to survival and proliferation of cancer cells. In the present study, reduced expression of RTN1 in tumors and a further decrease in patients with risk-associated alleles are consistent with the abovementioned roles of RTN1 in cancer.

The strengths of this study are the large sample size and the increase in power to detect a genetic association, caused by the removal of controls with family history of CRC or personal history of CR polyps. By excluding controls that may share the genetic risk background of their affected relatives, we have increased the differences between cases and the remaining controls. However, the OR estimated from samples that underwent this selection bias does not readily generalize to the whole population; we thus provided an OR estimated from the complete sample set thereby making a distinction between the discovery aspects of the study and the estimation of the effect size. In the present study, genome-wide significance was observed with or without the excluded controls, due to the large sample size at hand. Excluding these controls, the p value was more than one order of magnitude smaller, consistent with higher power; for smaller studies, an order of magnitude difference might be all that is needed for additional discoveries at genome-wide significance levels.