Introduction

Copy number variation (CNV) is a well-established cause of rare genomic disorders (Freeman et al. 2006), but the presence of wide-spread CNVs in apparently phenotypically normal individuals was not recognized until recently (Iafrate et al. 2004; Sebat et al. 2004; Sharp et al. 2005; Tuzun et al. 2005; Conrad et al. 2006; Freeman et al. 2006; Hinds et al. 2006; McCarroll et al. 2006; Redon et al. 2006; Wong et al. 2007). The presence of large structural variants in the human genome challenges the dogma that a germ-line diploid state represents normal copy number for all DNA regions across the entire genome.

Although a significant fraction of these germ-line deletions and gains are likely benign variants, CNVs affecting dosage-sensitive genes or regulatory regions may mediate or predispose to phenotypic outcomes. Characterization of large populations will help uncover the biological significance of CNVs and facilitate the identification of infrequent hereditary and de novo CNVs that may underlie Mendelian disease, genomic disorders, and common diseases. Therefore, to ascertain the relative contribution of CNVs to human variation and to enable the study of their role in disease predisposition, we have constructed the first North American population-based resource of copy number variable regions (CNVRs: stretches of overlapping or adjacent gains or losses of DNA) and measured the population frequencies of CNVs in these regions by characterizing 1,190 controls from Ontario, Canada.

Materials and methods

DNA samples

Biospecimens were obtained from the Ontario Familial Colorectal Cancer Registry (OFCCR), a member of the National Cancer Institute Cooperative Family Registries for Colorectal Cancer Studies (http://www.epi.grants.cancer.gov/CFR/about_colon.html) (Cotterchio et al. 2000). Approximately, 1,200 control subjects (random sample of men and women) living in Ontario, Canada were recruited by telephone from a list of randomly selected residential telephone numbers for Ontario and from population-based Tax Assessment Rolls of the Ontario Ministry of Finance. The 1,190 subjects averaged 63.5 ± 8.6 years in age, and consisted of 662 males and 516 females (data unavailable for 12 individuals). The ancestry of these subjects was self-reported as follows: Caucasian (n = 1,062), Black (n = 8), East Asian (n = 20), South Asian (n = 15), and Hispanic (n = 1). Ethnicity data were not available for 78 individuals and three subjects were of mixed ancestral backgrounds.

Study subjects donated a venous blood sample and peripheral blood lymphocytes were isolated using Ficoll-Paque, according to the manufacturer’s recommendations (Amersham Biosciences, Baie d’Urfé, Quebec, PQ, Canada). Genomic DNA was extracted from lymphocytes by phenol–chloroform. The Research Ethics Board at Mount Sinai Hospital, Toronto, approved study protocols and informed consent was obtained from all enrolled study subjects.

Hybridization of samples onto the Affymetrix GeneChip Human Mapping 100 K Array Set

For target preparation prior to hybridization, 250 ng of genomic DNA were digested with either HindIII or XbaI (New England Biolabs, Ipswich, MA, USA), followed by ligation with HindIII or XbaI specific adapters (Affymetrix Inc., Santa Clara, CA, USA). The ligated DNA was diluted 4 fold, and then PCR amplified using a primer designed for the adapter DNA. The PCR reactions were purified using a Qiagen MinElute 96 UF PCR Purification Plate (Qiagen Inc., Valencia, CA, USA), and 40 μg of this purified product was fragmented using 0.2 units of DNase I (Affymetrix Inc.). The fragmented DNA was labeled using 0.214 mM DNA Labeling Reagent (DLR) (Affymetrix Inc.) and 105 U of terminal deoxy-nucleotidyl transferase (Tdt) (Affymetrix Inc.) for 2 h at 37°C. Hybridization onto the 50 K HindIII and 50 K XbaI arrays and subsequent steps were performed as described by the manufacturer (http://www.affymetrix.com).

Hybridization of samples onto the Affymetrix GeneChip Human Mapping 500 K Array Set

For target preparation prior to hybridization, 250 ng of genomic DNA were digested with either NspI or StyI (New England Biolabs), followed by ligation with NspI or StyI specific adapters (Affymetrix Inc.). The ligated DNA was diluted fourfold, and then PCR amplified using a primer designed for the adapter DNA. The PCR reactions were purified using a DNA Amplification Clean-Up Kit (Clontech, Mountain View, CA, USA), and 90 μg of this purified product was fragmented using 0.25 units of DNase I (Affymetrix Inc.). The fragmented DNA was labeled using 0.857 mM DNA Labeling Reagent (DLR) (Affymetrix Inc.) and 105 U of terminal deoxy-nucleotidyl transferase (Tdt) (Affymetrix Inc.) for 4 h at 37°C. Hybridization onto the 250 K NspI and 250 K StyI arrays and subsequent steps were performed as described by the manufacturer (http://www.affymetrix.com).

CNVR determinations

To identify a gain or loss in genomic copy number, we used the Copy Number Analyzer for GeneChip (CNAG) algorithm (Nannya et al. 2005), which was developed specifically for measuring copy number alterations using the Affymetrix GeneChip 100 and 500 K platforms. In proof-of-principle experiments, we used this algorithm to accurately predict a known structural genomic change (deletion of exons 1–16 in the MSH2 gene) in a previously characterized clinical sample, and to identify, in control samples, previously published CNVs. We then proceeded with our population-based study, analyzing each chip separately and pooling the results from all four chips across all 1,190 samples to delineate CNVRs.

We implemented several filter criteria in our analyses of Affymetrix GeneChip data to ensure high quality of the resultant CNVRs. Hybridization experiments with genotyping call rates of <93%, using the Dynamic Model (100 K platform) (Di et al. 2005) or BRLMM (500 K platform) (Hua et al. 2007) algorithms, were not included in the analysis. CNAG version 2 software (http://www.genome.umin.jp), which employs a Hidden Markov Model, was used to identify markers (probes) showing copy number variation greater or less than diploid (Nannya et al. 2005). The CNAG algorithm includes a built-in correction for signal-to-noise ratios, which uses quadratic regressions to account for the GC content and the length of the PCR products. Hybridization data from each Affymetrix chip (HindIII, XbaI, NspI and StyI) were analyzed separately. For each chip analysis, the sample set was divided equally and randomly into test and reference sets. The algorithm requires reference DNA sources to measure copy number alterations. Following the analysis of the first set of test samples, the data sets were swapped and the CNAG algorithm was rerun. In this manner, all samples were screened for copy number changes.

Prior to determining the CNVRs, a set of stringently defined filters was computationally applied on all preliminary copy number data. In order to reduce noise in the copy number data generated by CNAG, small 1 bp singletons (copy number changes based on a single marker) were excluded from the analyses. Large chromosomal segmental imbalances spanning more than 14 Mb were rare events, likely representing somatic changes in lymphocytes and were also excluded from further analysis. Once these two filters were applied, we further restricted our analysis to samples showing copy number changes at <1000 and <200 chromosomal (SNP or markers) positions for the 100K (HindIII, XbaI) and 500K (NspI, StyI) platforms, respectively. Samples with copy number alterations at chromosomal positions exceeding these thresholds had high noise to signal ratios. In addition, only markers observed as gains (or losses in the deleted regions) in at least two separate individuals were used to define CNVRs. This filter further increased the quality of CNVR determinations and excluded very rare structural variations (population frequency <0.17%) from our map of the more common CNVs present in the Ontario population. Once these filters were applied, all markers, from all four chips showing copy number changes, were merged and CNVRs were determined. The filtering criteria resulted in the inclusion of 1,190 samples in the analyses, with 939 and 604 samples having adequate quality copy number data across 3 and 4 chips, respectively (Table 1).

Table 1 Details of samples satisfying the filtering criteria

Since it is not possible to determine the exact boundaries of individual copy number changes using genome-wide SNP genotyping platforms, we approximated the boundaries of genomic regions with CNVRs. Each CNVR represents the union of all observed copy number changes at multiple chromosomal positions (i.e., at the SNP or marker locations) in a stretch of genomic DNA. Setting a threshold for a maximum distance between two adjacent markers in a CNVR, permitted approximation of CNVR breakpoints, such that a breakpoint occurred when this preset inter-marker distance was exceeded. To select the appropriate threshold value, we evaluated the effect of varying this variable on the number of resultant CNVRs. Since the slope of this curve markedly changed at an inter-marker distance of approximately 1 Mb, we selected 1 Mb as the threshold distance.

Quantitative PCR

Lymphocyte genomic DNA was assayed using the 7700 ABI real-time instrument (Applied Biosystems, Foster City, CA, USA) and the Platinum SYBR Green qPCR SuperMix-UDG assay system (Invitrogen Canada Inc., Burlington, ON, Canada), according to the manufacturer’s specifications. The comparative CT method (User Bulletin #2; Applied Biosystems) was used to identify copy number changes. For each sample and primer set, triplicate reactions were run. Since copy number changes at the MSH2 locus are rare events, which have only been observed in individuals with hereditary non-polyposis colorectal cancer, we selected a region within exon 13 of the MSH2 gene as the diploid reference threshold for the qPCR assay. MSH2 gene deletions had already been excluded in all relevant OFCCR cases by a multiple ligation-dependent probe amplification assay. For each CNVR validation experiment, we selected the negative control sample (predicted by our computational approach to be diploid at the CNVR locus) with a ΔC T value closest to zero and used it as the diploid reference for that CNVR. A t-test comparing ΔC T values was used to determine the statistical significance of the result. A significant change in copy number was observed when P < 0.05. Primer sets and PCR conditions are available upon request.

CNVR genomic feature annotation

All genomic feature information was obtained from the ‘Known Genes’ data track, downloaded from the University of California, Santa Cruz (UCSC) Genome Browser, May 2004 Assembly (NCBI Build 35), which corresponds to our Affymetrix chip SNP coordinates (Kent et al. 2002; Karolchik et al. 2003, 2004). Genome feature overlap was considered if at least 1 bp from the feature overlapped our CNVRs. Genes were defined by transcription start and end position.

Results

We constructed a CNVR frequency map by assaying lymphocyte genomic DNA, using the genome-wide Affymetrix GeneChip 100 and 500 K SNP genotyping platforms to identify genomic copy number changes. The use of four chips (i.e., the Affymetrix HindIII, XbaI, NspI and StyI chips) provides overlapping genomic coverage, resulting in both complementary and confirmatory CNVR results.

Using this computational approach, we assembled a genomic map consisting of 578 stringently defined CNVRs covering approximately 220 Mb (7.3%) of the human genome, with an average length of 408 kb (Fig. 1; Table 2). We found, on average, 6 ± 3 (SD) genomic regions with copy number alterations in each subject. The population frequencies of copy number changes within each CNVR were also estimated. Supplementary Table 1 provides the 578 CNVR coordinates and the population frequency and ancestry associations for each CNVR. Each CNVR consists of multiple markers (average = 22, range 2–292). For each CNVR, we counted the number of times a copy number change (gains and deletions counted separately) occurred at each marker position in our series of 1,190 samples. An average (±SD) was taken across all markers found in the CNVR and reported as the “Average Number of Samples with Copy Number Change in the CNVR” or as a percentage of the total number of individuals characterized (n = 1,190), estimating the population frequency of structural variation in each genomic region.

Fig. 1
figure 1

Genomic map of common CNVRs found in 1,190 population-based controls in Ontario, Canada. Chromosomal locations of copy number losses and gains are shown on the left and right of the ideograms. Colors indicate whether each CNVR occurs at a population frequency of <1% (blue), 1–5% (yellow) or >5% (red). CNVRs associated with low copy repeats are shown (*). Novel CNVRs that were not detected in the HapMap sample collection9 are also indicated (+)

Table 2 Characteristics of 578 CNVRs identified in 1,190 control subjects

We used simulation experiments to evaluate whether we had identified most, if not all, CNVRs detectable in our population with the Affymetrix GeneChip 100 and 500 K SNP genotyping platforms. Random sample sets of increasing size were selected from a pool of 1,190 subjects and CNVRs for each sample set were determined. Plotting the number of resultant CNVRs versus sample size revealed the effect of increasing sample size and showed our screen to be near saturation for this resolution of analysis (Fig. 2). The results follow a typical saturation curve defined by (Y = Max*X/Rate + X), where X and Y are points on the curve, Max is the maximum number of CNVRs in the population and Rate is the point at 50% saturation.

Fig. 2
figure 2

Curves showing that we detected CNVRs in our population to saturation (given our SNP platform and computational detection and filtering criteria). Increasingly larger sample sets were randomly selected from a pool of 1,190 subjects and CNVRs were determined using the computational approaches described above. For each increment of 100 samples, the experiment was repeated either 100 (red curve) or 200 (blue curve) times and averages of the total number of CNVRs detected were calculated and plotted

We tested the accuracy of CNAG copy number predictions in CNVRs #309 (loss), 336 (gain), 454 (gain) and 455 (loss). The estimated population frequencies of CNVs in these four genomic regions are 3.13, 2.02, 13.10 and 9.77%, respectively. For each CNVR, we tested 20 samples predicted by our computational approach to harbor a copy number alteration and 20 samples with diploid calls. The sensitivity and specificity of the CNAG copy number calls for CNVRs #309, #336, #454 and #455 were 90 and 85%, 80 and 100%, 60 and 80%, and 50 and 85%, respectively. These data suggest that the CNAG algorithm has greater detection sensitivity for copy number changes occurring at population frequencies <5%, which includes 99% of all CNVRs identified in our study. Consistent with previous reports (Huang et al. 2006), the specificity of the CNAG copy number calls was determined by quantitative PCR to be at least 80%, regardless of the population frequency of the copy number alteration. Together, these observations suggest that the CNV population frequencies reported in this study are accurate for CNVs occurring at frequencies of <5%, but may be underestimated for about 1% of CNVRs with genomic imbalances at population frequencies >5%.

Discussion

We used the Affymetrix GeneChip 100 and 500 K SNP genotyping platforms to assemble a genomic map of 578 CNV regions, covering approximately 220 Mb (7.3%) of the human genome. Although we observed CNVRs to be widespread across the human genome, copy number changes in the majority of these CNVRs are rare (>93% CNVRs include structural alterations occurring at <1% frequency). Population frequencies of 1–5% and >5% were estimated for CNVs present in approximately 6 and 1% of CNVRs, respectively. Our findings suggest that these structural variants are widespread across the human genome, but the majority occur at relatively low population frequencies. Therefore, based on the results obtained using the Affymetrix 100 and 500 K platforms to perform a genome-wide survey for CNVs, we postulate that most CNVs are infrequent events and are unlikely to seriously confound genome-wide case-control SNP association studies.

CNVs may be the product of recurrent events, resulting from non-allelic homologous recombination mediated by higher-order genomic architectural features (Freeman et al. 2006). The mapping of 34% of our CNVRs to genomic rearrangement hotspot sequences, namely sequences flanked by low copy repeats, supports this idea (Fig. 1). One limitation of the Affymetrix GeneChip 100 and 500 K SNP arrays is that these platforms are not tiling arrays, with a poor coverage over repetitive genomic regions where CNVs may be present. Since repetitive DNA stretches are often gene-poor, copy number alterations in these unmapped regions may represent benign genomic variants.

In keeping with evolutionary selection favoring genomic duplications over deletions (Freeman et al. 2006; Lee and Lupski 2006), regions with copy number gains were found to be more frequent in our population, more widespread across the genome, longer in size, and more likely to harbor genes. The frequency of copy number gains in the 578 CNVRs was 2.6-fold greater than the occurrence of deletions, with a total of 4,803 gains versus 1,867 deletions detected in 1,190 subjects. In addition, a larger fraction of the 578 unique CNVRs encompass gains (70%) compared to deletions (30%). Furthermore, CNVRs with deletions were found, on average, to be >2-fold shorter (207 vs. 494 kb for CNVRs encompassing DNA gains, t-test, P < 0.0001), while regions with copy number gains were more likely to include genes (Chi-square, 12.55, P < 0.001). In addition, an increased number of CNVRs with gains vs. deletions were associated with both OMIM morbid map (Fisher’s Exact Test, two-tail P < 0.001) and known cancer genes (Fisher’s Exact Test, two-tail P < 0.05) (Hamosh et al. 2002; Higgins et al. 2006; Sjoblom et al. 2006). We propose that increased copy numbers of tumor suppressor genes and hemizygosity for oncogenes may be protective against cancer.

Characterization of the CNVRs we identified revealed that these genomic regions contain hundreds of genes and disease loci. CNVRs are significantly enriched (hypergeometric test and Benjamini & Hochberg False Discovery Rate, P < 0.05) for rhodopsin-like receptor genes (e.g. olfactory, chemokine) (Supplementary Table 2). We found that 56% of the CNVRs are associated with known genes, including tumor suppressors and oncogenes. In fact, 55 known cancer genes overlap with CNVRs (Supplementary Table 3). Common CNVs rarely include dosage sensitive cancer genes, which may be the result of evolutionary selection. Of the 35 CNVRs associated with cancer genes, only three are associated with CNVs occurring at population frequencies >1%. These three CNVRs consist of copy number gains and overlap with the cyclin B1 interacting protein 1, Makorin-3, and myeloid/lymphoid or mixed-lineage leukemia genes. The presence of occasional structural variation involving cancer gene loci suggests a possible role for CNVs in genetic predisposition to cancer risk.

We found 172 genes in the OMIM morbid map of disease genes that overlap with the CNVRs we identified (Supplementary Table 4). Interestingly, one of these genes is PMP22, which causes Charcot-Marie-Tooth disease type 1A (CMT1A [MIM 118220]) by gene duplication (Lee and Lupski 2006). This observation supports evidence of previous reports describing cases of clinically healthy adult CMT1A patients (Berciano et al. 1994). Although the majority of CNVs mapping to disease loci occur at population frequencies <1%, we found 10 CNVRs with frequencies >1% associated with disease loci, including Prader-Willi syndrome (PWS [MIM 176270]), noninsulin-dependent diabetes mellitus (NIDDM [MIM 125853]), and autosomal recessive deafness (DFNB23 [MIM 609533]). We also detected rare copy number gains in a 401 kb genomic region (CNVR#284) that includes the PRSS1 gene. Triplications of PRSS1 (MIM 276000) have recently been associated with hereditary pancreatitis (Le Marechal et al. 2006), suggesting that the controls we identified with copy number gains in this locus may be at increased risk for this condition. CNVRs were also found to overlap with the CCL3L1 (MIM 601395), FCGR3B (MIM 610665) and HBD-2 (MIM 602215) genes, where increased germ-line genomic copy numbers have been recently associated with lower HIV, systemic lupus erythematosus (SLE), and Crohn’s disease susceptibility, respectively (Gonzalez et al. 2005; Aitman et al. 2006; Fellermann et al. 2006). We estimated the frequency of copy number changes in CNVRs encompassing these three genes to be 3.3, 0.4 and 0.9% in our control population. In agreement with a recent report showing an average of 2.99 ± 1.74 (SD) copies of the CCL3L1 gene in non-African subjects (Gonzalez et al. 2005), our predominantly Caucasian control group had an average copy number of 3 ± 0.6 (SD) in the genomic interval (CNVR #514) encompassing the CCL3L1 locus. A FCGR3B locus copy number gain was identified in six individuals, while no genomic losses involving this region were observed, which is consistent with recent studies demonstrating a low FCGR3B copy number association with SLE and anti-neutrophil cytoplasmic antibody (ANCA)-associated vasculitis (Aitman et al. 2006; Fanciulli et al. 2007). We observed a copy number gain of 4 ± 0.47 (SD) as the most frequent variation involving the HBD-2 locus. Since the control subjects in our investigation do not have a known history of inflammatory bowel disease, our observation supports the recent report showing a relationship between predisposition to colonic Crohn’s disease and a HBD-2 gene copy of less than 4 (Fellermann et al. 2006).

We found that 68.6% of our CNVRs overlapped with recently reported CNVs (Iafrate et al. 2004; Sebat et al. 2004; Sharp et al. 2005; Tuzun et al. 2005; Conrad et al. 2006; Freeman et al. 2006; Hinds et al. 2006; McCarroll et al. 2006; Redon et al. 2006; Wong et al. 2007), providing further validation for our genomic map of CNVRs. Not surprisingly, there is a correlation (r = 0.125) between the population frequencies of copy number alterations within a CNVR and overlap with previously published CNVs, with more common CNVs being previously reported. In addition to validating the observations of recent smaller studies, our large population-based investigation permitted us to organize CNVs into CNVRs, to determine the population frequencies of genomic imbalances within the 578 CNVRs, and to identify 183 genomic regions (CNVRs) with novel CNVs (Supplementary Table 5). These novel findings include CNVRs #339 and #352, which harbor copy number alterations occurring at population frequencies of 2.0 and 1.7%, respectively. CNVR #339 includes copy number gains affecting the orosomucoid 1 (MIM 138600) and 2 (MIM 138610) genes, which encode for two acute phase plasma proteins with possible roles in immunosuppression, whereas CNVR #352 consists of genomic losses involving protocadherin 15 (MIM 605514), a member of the cadherin gene superfamily. One possible mechanism by which CNVs may cause phenotypic diversity is by altering the expression of copy number variant genes. Although dosage effects may underlie the biological effects of certain CNVs, the functional consequences of copy number changes involving non-coding genomic sequences are not apparent. Further investigations are needed to distinguish biologically important structural variants from neutral polymorphisms.

Comparison of our CNVRs with those recently identified in 270 individuals from four populations with ancestry in Europe, Africa or Asia (the HapMap samples) (Redon et al. 2006) revealed 45% overlap (Fig. 1), suggesting that the two studies are complementary. In contrast to the former study, our investigation is sufficiently large to estimate the population frequencies of copy number changes and to identify less frequent variants. In addition, our genomic map is based on a homogeneous population, consisting of predominantly Caucasian (89.2%) controls from Ontario, Canada. Although, the 18 genomic regions where copy number changes occur most often (>2% frequency) were observed in both studies (Table 3), we identified hundreds of CNVRs, spanning at least 46 Mb (320 CNVRs), that were not detected in the HapMap samples (Supplementary Table 6). These include 9 CNVRs harboring genomic variants with population frequencies of 1–2%, which overlap with several genes, including genes belonging to the ras oncogene, cadherin, and acute phase plasma protein gene families. The remaining 311 CNVRs map to genomic regions where copy number changes are rare (<1% frequency).

Table 3 Characteristics of the 18 CNVRs with frequently occurring (>2%) copy number changes

In summary, using approximately 1,200 controls, we have demonstrated that CNVs are spread across hundreds of genomic regions covering 7.3% of the human genome. Although, we suggest that some degree of genome plasticity, manifested by CNVs, is expected in most individuals, copy number changes are infrequent events in the majority of CNVRs. We detected 183 novel CNVRs and suggest that certain CNVs may underlie genetic predisposition to cancer and other diseases. Using statistical simulation, we show that our sample size is sufficiently large to identify nearly all CNVRs present in our population using the Affymetrix GeneChip 100 and 500 K SNP genotyping platforms, and suggest that similar sample sizes will be necessary to fully characterize this relatively infrequent form of genetic variation in other worldwide populations. This North American population-based map of human copy number variable genomic regions will be a resource for future genetic studies.