Introduction

Ischemic stroke (IS) is a heterogeneous multifactorial disorder. Studies in twins, families, and animal models provide substantial evidence for a genetic contribution to this disease [1, 2]. Some conditions where stroke occurs are inherited in a classical Mendelian pattern; for example, mutations in NOTCH3 underlie cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL) [3], and mutation of BRI causes familial British dementia, an amyloid angiopathy, with white matter lesions without hemorrhage [4].

Identifying individual causative mutations when stroke is the prominent feature remains problematic due to the complexity of genetic and environmental risk factors in this disease. While numerous genetic risk loci have been reported for IS, the majority of these have not been replicated. The most prominent exception is the reported association between PDE4D variants and ischemic stroke, first identified using an extended family based approach [5]. This work has been confirmed by several groups [613], although the variants associated at this locus are not consistent across populations [14].

Application of genome-wide single-nucleotide polymorphism (SNP) assays in the context of case-control, rather than family-based studies, holds considerable promise in the identification of novel susceptibility loci, even in complex genetic disorders. SNP-based genome-wide association studies have proven successful in diseases with high-risk conferring alleles, such as variability in CFH in age-related macular degeneration and for lower risk alleles in type II diabetes (odds ratio of ~1.1 to 1.4), albeit with extremely large cohorts [1517]. In addition to providing SNP association data, the genotyping assay also generates metrics that allow detection of copy number variants (CNVs), i.e., segments of the genome that may have been deleted or duplicated [18].

The CNV distribution in the human genome and its potential association with risk for complex human disease have been the center of much discussion [19]. Large deletions and duplications have previously been associated with rare familial diseases [20, 21], in addition to common traits such as α-thalassemia [22], and color blindness [23]. We have recently performed a pilot study of genome-wide association in a North American cohort of ischemic stroke patients and controls [24]. We have now analyzed these data specifically to assess the role of CNVs in an attempt to determine whether structural genomic variation may contribute to risk for IS.

Materials and methods

Subject collection

The Ischemic Stroke Genetics Study (ISGS) supplied all the stroke samples for the current study. ISGS is a prospective five-center North American case-control study. The protocol for ISGS has been reported previously [25]. For the stroke cohort, all cases had recent (within 30 days) first-ever IS confirmed by history, physical examination, and head imaging (CT or MRI). Stroke was defined according to the World Health Organization (WHO) definition [26]. Iatrogenic, septic embolic, vasospastic, and vasculitic stroke cases were excluded.

A single neurologist rater (RDB) classified ischemic strokes according to the Trial of Org 10172 in Acute Stroke Treatment (TOAST) [27], Oxfordshire [28], and Baltimore [29] criteria based on medical record review. Video-certified examiners assessed neurological impairment using the National Institutes of Health (NIH) Stroke Scale [30]. Functional status was assessed using the Barthel Index [31], Oxford Handicap Scale [32], and the Glasgow Outcome Score [33].

The control cohort used here and the identification of structural alterations in this cohort have been previously described [18, 34]. Neurologically normal subjects were briefly derived from three panels (NDPT002, NDPT006, and NDPT009) containing DNA from 275 unrelated individuals from North America and one replicate sample (133 males and 142 females). Each panel contains DNA from 92 unrelated individuals without a history of Alzheimer disease, amyotrophic lateral sclerosis, ataxia, autism, bipolar disorder, brain aneurysm, dementia, dystonia, Parkinson’s disease, or stroke. None had any first-degree relative with a known primary neurological disorder, and the mean age of participants at sample collection was 68 years, ranging from 55 to 88 years (for more details, see http://ccr.coriell.org/ninds/catalog/panel/).

Sample preparation

All individuals provided written consent for the genetic analysis. Epstein–Barr virus (EBV) immortalization was performed as previously described [35, 36]. DNA for the experiments was extracted from the EBV-immortalized lymphocyte cell lines (LCL); as we have previously shown, these LCLs remain highly faithful to the genotype of source [18].

Genotyping

All samples were assayed with the Illumina Infinium Human-1 and HumanHap300 SNP chips (Illumina Inc, San Diego, CA, USA). The Human-1 product assays 109,365 gene-centric SNPs and the HumanHap300 product assays 317,511 haplotype tagging SNPs derived from phase I of the International HapMap Project (www.hapmap.org). There are 18,073 SNPs in common between the two arrays; thus, the assays combined provide data on 408,803 unique SNPs. Any assay with a call rate below 95% was repeated on a fresh DNA aliquot; if the call rate persisted below 95% the sample was excluded from further analysis.

CNV detection

Data were analyzed using BeadStudio v2.1.10.0 (Illumina Inc., San Diego, CA). Two metrics were visualized using this tool: B allele frequency and log R ratio.

The B allele frequency is the theta value for an individual SNP corrected for cluster position. This parameter provides an estimate of the proportion of times an individual allele at each polymorphism was called A or B. In this setting, an individual who is homozygous for the B allele (BB genotype) would have a score close to 1, an individual homozygous for the A allele (AA) would have a score close to 0, and an individual who is heterozygous (AB) would have a score of approximately 0.5. Significant deviations from these figures in contiguous SNPs are indicative of a CNV. The log R ratio is defined as the log (base 2) ratio of the observed normalized R value for the SNP divided by the expected normalized R value for the SNPs theta value. The expected R value is calculated from the values theta and R, where R is the intensity of dye-labeled molecules that have hybridized to the beads on the array and theta is the ratio of signal at each polymorphism for beads recognizing an A allele to beads recognizing a B allele. The expected R value for any individual at any typed SNP is calculated using a large population of typed individuals. Therefore, the ratio of observed R to expected R in any individual at any SNP gives an indirect measure of genomic copy number. An R value above 1 is indicative of an increase in copy number, and an R value below 1 suggests a decrease (deletion) in copy number. While this metric exhibits a high level of variance for individual SNPs, it does provide a measure of copy number when log R ratio values for numerous contiguous SNPs are visualized.

Based on our own experience with this technology, we have established that the smallest copy number variation that can be reliably detected is ~50 kb [18, 37]. We evaluated both the log R ratio and the B allele frequency plots across the genome in all samples. Each identified CNV was compared with changes identified in our neurologically normal control population [18] and those published in the Database for Genomic Variants (http://projects.tcag.ca/variation/). We calculated that our study had good power (>90%) to detect rare variants conferring a genetic risk ratio of 3.75 or greater at an alpha of 0.05.

Results

Analysis of CNVs in the control cohort was previously performed by us in a manner identical to that described here [18] and deposited publicly (http://projects.tcag.ca/variation/). Our previous work has demonstrated that the majority (10 of 10 alterations examined) of simple copy number changes ≤1 Mb in size that were observed in LCLs are also apparent in the blood sample used for immortalization. In contrast, those copy number changes larger than 1 Mb (or present as heterosomic alterations) appear to represent artifacts of the LCL creation and culture process (10 of 12 alterations examined), or correspond to V(D)J-like recombination events (two of two alterations examined) [18].

Within the stroke cohort we identified a total of 231 CNVs that were simple deletions or duplications, corresponding to 185 insertions (80%) and 46 deletions (20%), ranging in size from 1.7 kb to 2.1 Mb (ESM-Tables 1 and 2). Most of the 231 simple CNVs have been previously reported in healthy individuals or overlap with previously reported CNVs (ESM-Tables 1 and 2). Forty-five of the 231 simple CNVs (19.5%) are unique. Of these potential new sites of structural variation, only one genomic region, on chromosome 1, contained recurrent CNVs in three individuals with IS (IS-14, IS-236 and IS-553). The three individuals showed an apparently identical duplication spanning the genes SPRY domain-containing SOCS box protein 1 (SPSB1) and hexose-6-phosphate dehydrogenase (H6PD).

Because of the potential disease relevance of these alterations, we examined copy-number metrics at this locus in an additional 450 neurologically normal controls samples (NDPT019, NDPT020, NDPT022, NDPT023, and NDPT024 from the NINDS neurogenetics repository at the Coriell Institute) using Illumina Infinium HumanHap550 SNP chips (unpublished data). These data showed the presence of CNVs at this locus in five of these samples (~1%).

We also identified 14 deletions and two duplications that were consistent with heterosomic copy number changes (those where not all of the cellular population examined carry the CNV) ranging from 30 kb to an entire chromosome (E-Table 3). Of the 14 heterosomic deletions, 50% spanned the immunoglobulin lambda gene cluster located at chromosome 22q11.22 and likely reflect normal V(D)J-type recombination [38]. Thus, in the IS patients, 146 of the 263 samples demonstrated some form of copy number variation. Forty-nine of these samples have more than one CNV, with a maximum of five within two samples (IS-1 and IS-55). Figure 1 shows the CNVs detected sorting by chromosome.

Fig. 1
figure 1

Simple CNVs detected sorting by chromosome in 263 patients with ischemic stroke.The red and the black stars indicate regions with insertions and deletions respectively. Regions where we found insertions and deletions are showed with violet starts. Numbers in parenthesis are the total number of CNVs (insertions in red, deletions in black) found in each chromosome. Chromosome Y was excluded from the analysis

Discussion

There is increasing discussion of the impact structural genomic alterations are likely to have in common diseases [19]. Cataloging CNVs using a methodology that assesses this variation in a genome-wide manner is critical for the identification of disease-associated genetic variability. With recent technologies such as the high-density SNP-based assays used here, an abundance of genomic copy number variations have been reported, ranging from kilobases to megabases in size. Further, this variation is readily identified in apparently healthy individuals [18, 38-40].

We report here the first genome-wide analysis of CNVs in IS patients. The CNVs identified in the current study were widely distributed throughout the genome. The majority of CNVs were rare (or orphan) changes. While single sample to group comparisons using the current methodology may under-represent common CNVs, the observation that the majority of alterations are rare is consistent with previous reports. A clear limitation of a SNP-array based approach for CNV observation is based in coverage. Many regions of the genome are poorly covered with SNPs using these technologies and thus many CNVs will be missed, in addition SNPs were previously excluded from inclusion in such arrays based on apparent Mendelian errors within families such an exclusion would clearly lead to the removal of SNPs in CNV regions. While this issue would be less of a problem for rare or orphan CNVs, such an exclusion leads to an underestimation of common CNVs.

The accuracy of the current platform for identifying CNVs was previously evaluated [18, 41]. The EBV immortalization process and clonal nature of LCL culture has been shown to lead to structural genomic variation that is not detectable in the source tissue used for immortalization. In our previous work, concordance rates between DNA derived from LCLs and DNA extracted from source tissue were 100% for CNVs ≤1 MB and 17% for CNVs >1 MB (excluding apparent CNVs resulting from V(D)J type recombination) [18]. Given this observation, we concentrated analyses on CNVs less than that 1 Mb in size. Of the 45 CNVs identified that did not overlap with previously identified CNVs (Table 1), only 1 is recurrent in more than one IS sample; a duplication across SPSB1 and H6PD identified in 3 individuals. However the presence of similar CNVs over this locus in 5 of an additional 460 controls suggests that this variant in not a risk factor for IS. The remaining CNVs may be of importance in the pathobiology of IS; however, given the low frequency of each individual alteration, screening of these variants in a very large cohort (1000’s of cases and controls) would be required to make any unequivocal conclusions.

Table 1 Structural changes found in ischemic stroke that have not been previously reported in healthy controls

In summary, our study did not detect any common genomic structural variation unequivocally linked to IS. We cannot exclude the possibility that smaller CNVs or CNVs in genomic regions poorly covered by this methodology may confer risk for IS. The recent availability of higher density CNV-directed arrays from both Affymetrix and Illumina will increase the number of CNVs that can be detected and thus may go some way toward addressing this question; however, because some studies have reported linkage disequilibrium (LD) between CNVs and proximal SNPs [42, 43], we would predict that our LD-based whole-genome association study [24] would have detected common CNVs linked to disease, whether detected directly or not in this study, by showing association at SNPs tagging such changes. The application of genome-wide SNP arrays now facilitates the evaluation of structural changes through the entire genome as part of a genome-wide genetic association study.