Introduction

The recent revolution in human complex trait genetics comes from the development of genome-wide genotyping platforms. These, when combined with genotype imputation using reference haplotypes from the HapMap or 1000 Genomes Project, have led to the identification of thousands of robust genetic associations between complex human phenotypes and single nucleotide polymorphisms (SNPs) (1000 Genomes Project Consortium 2010; 1000 Genomes Project Consortium 2012; Altshuler et al. 2005, 2010; Frazer et al. 2007; Welter et al. 2014). Under the genome-wide association study (GWAS) framework, genotype imputation is key: it improves power to find markers that are not directly genotyped and facilitate meta-analyses between studies that are genotyped on different arrays. Notwithstanding its success, the GWAS approach has sub-optimal ability to capture low frequency and rare DNA sequence variants because of limited or absent linkage disequilibrium (LD) between markers. The genetic community is currently shifting part of its interest from common to rare genetic variants to study the aetiology of complex human diseases. To this end, it is seeking strategies to more comprehensively characterize rare genetic variation in large and diverse human populations. In particular, because rare genetic variants tend to be population specific, such comprehensive surveys need to be done by direct DNA sequencing in populations of interest.

Although feasible, high-coverage (30×) whole-genome sequencing (WGS) remains expensive, especially if we consider the large sample size required for rare-variant genetic association studies. One alternative strategy is to sequence at low coverage (4–6×) a large number of participants, and then uses imputation methodology to recover missing genotypes within the sequenced individuals (Li et al. 2011). This method can also yield high-quality haplotypes that are useful to impute individuals from the same population with genome-wide genotype data available. The 1000 Genomes Project implemented successfully this approach in individuals from several populations from across the globe (1000 Genomes Project Consortium 2010; 1000 Genomes Project Consortium 2012). The SardiNIA and Genome of the Netherlands consortia also used low-to-intermediate coverage WGS to find common and rare population-specific variants (Danjou et al. 2015; Genome of the Netherlands 2014; Sidore et al. 2015; Zoledziewska et al. 2015). Their analyses confirmed that low-pass WGS data sets are appropriate for genetic association studies, to create imputation panels, or to ask population genetics questions that are specific to these populations.

The population of the Province of Quebec in Canada now includes 8.1 million individuals, ~78 % of whom are French speaking. This French-Canadian population largely originates from the immigration of 8500 mostly French settlers starting in 1608 and ending with the British Conquest of 1759. Following these events, the French-Canadian population of Quebec expanded rapidly, and an intense population growth was paralleled with the colonization of new geographical regions, such as Saguenay-Lac-St-Jean in the North-East and Gaspesia in the East (Supplementary Fig. 1). This rapid expansion and migration gave rise to regional populations with genetic features that can be explained by founder effects (Bherer et al. 2011; Gauvin et al. 2014; Moreau et al. 2011a; Roy-Gagnon et al. 2011). Geneticists have taken advantage of this genetic homogeneity in the French-Canadian founder population of Quebec to map genes involved in rare and common human diseases (Chami et al. 2014; Chetaille et al. 2014; Laprise 2014; Scriver 2001). To date, however, no WGS project has explored the pattern of rare genetic variation in French Canadians from Quebec, as well as its contribution to disease aetiology.

In this study, we sequenced the whole genome of 1970 French Canadians at 5.6× coverage. We identified ~29 million high-quality bi-allelic variants, of which 31 % were not previously reported in public databases (dbSNP 147). When stratifying by minor allele frequency (MAF), we found 6 million common (MAF > 5 %), 4 million low-frequency (0.5 % < MAF ≤ 5 %), and 19 million rare (MAF ≤ 0.5 %) markers. The majority of the bi-allelic variants are intergenic, while 44 % are intronic and 3 % are exonic. The WGS data show 99.8 % concordance when compared to ExomeChip genotype data available in the same participants, suggesting very high quality. We further demonstrate the utility of this WGS data set for genetic association studies, to build a French-Canadian haplotype reference panel for imputation, and to begin to explore the genetic history of this specific population.

Results

Genetic variation in French Canadians

We sequenced at 5.6× coverage 1970 French Canadians recruited by the Montreal Heart Institute (MHI) Biobank (Supplementary Fig. 2). Demographic and clinical information for the participants is available in Supplementary Table 1. After quality-control steps, we identified 29 M single nucleotide variants (SNVs) and small insertion deletions (indels). Nearly, 31 % of these genetic variants are not present in public databases (dbSNP 147); this number is consistent with results from a low-pass WGS project in European Americans (Morrison et al. 2013). We annotated all bi-allelic SNVs and indels identified (Table 1). As expected, likely detrimental mutations (nonsense, splice site, frameshift indel) are enriched at the rarer end of the allele frequency spectrum, consistent with purifying selection. For instance, whereas 17 % of synonymous SNVs are common, 5 % of nonsense and 11 % of frameshift indel variants have a frequency >5 % (Fig. 1).

Table 1 Number of bi-allelic variants per annotation type and minor allele frequency (MAF) class: common (MAF > 5 %), low frequency (0.5 % < MAF ≤ 5 %), and rare (MAF ≤ 0.5 %)
Fig. 1
figure 1

Proportion of annotated variants in three minor allele frequency (MAF) classes: common (MAF > 5 %), low frequency (0.5 % < MAF ≤ 5 %), and rare (MAF ≤ 0.5 %). The definition of each functional annotation is provided in Table 1. We note an enrichment of missense, nonsense, and splice site variants among rare variants when compared to synonymous DNA sequence changes

The transition-to-transversion ratio, a metric used to assess the quality of DNA sequencing data, is 2.13, very close to the theoretical 2.1 ratio expected for WGS (DePristo et al. 2011). To further evaluate the quality of the WGS data, we calculated concordance with genotypes from an independent exome array experiment available for 1967 individuals. Across 83,273 bi-allelic SNPs shared between the WGS and ExomeChip data sets, we calculated an overall concordance rate of 99.8 %. When restricting this analysis to heterozygous genotype calls, we found concordance rate of 99.3, 95.2, and 85.3 % for common, low-frequency, and rare SNVs, respectively (Supplementary Tables 2–5). The rediscovery rate (percentage of bi-allelic markers from the ExomeChip data set detected by WGS) was 98, 97, and 76 % for common, low-frequency and rare SNVs, respectively.

Replication of GWAS findings

This WGS experiment is part of a larger research program that focuses on the genetics of coronary artery disease in the French-Canadian population. For this project, we explicitly selected 984 myocardial infarction (MI) patients and 980 MI-free controls; additional clinical information on MI risk factors is also available for these participants (Supplementary Table 1). Although our sample size (and thus statistical power) is modest when compared to recent meta-analyses of GWAS results, we tested the association between known SNPs and MI status as well as high-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol, total cholesterol (TC), and triglyceride (TG) levels to assess the quality of our data set to perform association testing.

For MI, 35 of the 44 known variants available in the WGS data set have an effect in the right direction (binomial P = 1 × 10−4) and five are nominally significant (Table 2 and Supplementary Table 6) (CARDIoGRAMplusC4D Consortium 2013; Coronary Artery Disease (C4D) Genetics Consortium 2011; Myocardial Infarction Genetics Consortium 2009; Schunkert et al 2011). This includes an intronic marker at the PHACTR1 locus, which we had previously shown to be associated with MI in the MHI Biobank using a different genomic technology for DNA genotyping (Beaudoin et al. 2015). For HDL-C, LDL-C, TC, and TG levels, respectively, 51 of the 69 (binomial P = 9 × 10−5), 42 of the 57 (binomial P = 5 × 10−4), 56 of the 73 (binomial P = 5 × 10−6), and 26 of the 40 (binomial P = 0.08) variants tested have effect alleles that modulate their respective phenotype in the right direction based on the literature (Supplementary Tables 7–10) (Global Lipids Genetics et al. 2013). We found eleven, nine, eleven, and eight genetic associations with HDL-C, LDL-C, TC, and TG levels that are in the correct direction and nominally significant at P < 0.05 (Table 2). Overall, these results suggest that our WGS data set represents a robust resource to investigate a novel genetic association with coronary artery disease or its risk factor in the French-Canadian population.

Table 2 Replication of a subset of markers associated with myocardial infarction, high-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol, total cholesterol, and triglyceride levels

Imputation

Motivated by recent reports that showed improvements in imputation quality when the target samples and reference haplotypes are from individuals of the same population (Genome of the Netherlands 2014; Pistis et al. 2015), we tested the utility of our WGS-derived haplotypes to impute genotypes in French Canadians. For this analysis, we imputed genotypes in 731 independent French Canadians previously genotyped on the Illumina Omni2.5 M array. To this end, we used three reference panels: (1) 5008 haplotypes from all the populations from the 1000 Genomes Project phase 3, (2) 3940 French-Canadian haplotypes from our low-pass WGS experiment, and (3) the combined set of 1000 Genomes Project and French-Canadian haplotypes. We limited our analyses to 892,893 high-quality markers (HWE P value in WGS controls >1 × 10−5) on chromosome 1 that are shared between the 1000 Genomes Project and our WGS data set. To estimate imputation quality, we had each known genotype in turn and computed the squared correlation (r 2) between the imputed dosages and the masked true genotypes. Overall, the three reference panels supported excellent imputation quality, with mean imputation r 2 of 0.926, 0.942, and 0.944 for the 1000 Genomes Project, French-Canadian, and combined haplotypes, respectively (Fig. 2). We noted a slightly better imputation quality for rare variants using the French Canadian rather than the 1000 Genomes Project haplotypes, consistent with the previous report (mean r 2 of 0.76 vs. 0.71) (Fig. 2). The estimated imputation accuracy, calculated on both known and imputed genotypes, behaves in a similar way (Supplementary Fig. 3). Practically, investigators often filter at an imputation quality score ≥0.3 to select imputed markers for association testing. Using this threshold, 59 % of the rare variants imputed with the French-Canadian haplotypes would be selected instead of 55 % of the rare variants imputed with the 1000 Genomes Project haplotypes only.

Fig. 2
figure 2

Boxplots and mean of the squared correlation between masked genotypes and imputed dosages. For each of three haplotype reference panels, we report the imputation quality for ~10 K rare (MAF ≤ 0.5 %), ~20 K low-frequency (0.5 % < MAF ≤ 5 %), ~46 K common (MAF > 5 %), and all ~76 K shared polymorphic variants on chromosome 1. The number on each box corresponds to the mean EmpRsq quality score. 1000G, 5008 haplotypes from phase 3 of the 1000 Genomes Project; FC, 3940 haplotypes from this whole-genome DNA re-sequencing project in French Canadians; and combined, combination of the 1000G and FC haplotypes

The recently developed Haplotype Reference Consortium (HRC, http://www.haplotype-reference-consortium.org/) has combined 20 cohorts to obtain over 64,000 haplotypes that cover ~39 M variants. We compared the imputation performance of our French-Canadian panel to haplotypes from HRC, restricting this comparison to 808,746 shared variants on chromosome 1. Overall, the HRC panel performs slightly better than the French-Canadian panel (mean r 2 of 0.96 vs. 0.94), and that is also true for rare variants (mean r 2 of 0.79 vs. 0.76) (Supplementary Figs. 4 and 5). Thus, for variants present in HRC, the large number of haplotypes outweighs the population specificity of the French-Canadian panel. However, this result does not question the relevance of our French-Canadian WGS, since nearly 30 % of the variants found in our project are absent from public databases and could not be imputed with HRC.

Population structure in Québec

Although all MHI Biobank participants were recruited in the cosmopolitan Montreal area, we have information on the geographical origin of their four grandparents. Using this data, we could assign each participant to the 17 administrative regions in Quebec (Supplementary Fig. 1). For the subsequent analyses, we focus our genetic analyses to four regions: Montreal, Quebec City, and the Saguenay-Lac-St-Jean regions that are characterized by a strong founder effect and the Gaspesia region, which harbours higher genetic heterogeneity, caused by an influx of populations, namely, Acadians and Loyalists (Moreau et al. 2011b). For each region, the participants receive a score (0–4) that corresponds to the number of grandparents from this region. Thus, a participant with two grandparents from Montreal and two from Saguenay would get scores of 2 for these regions and scores of 0 for Quebec City and Gaspesia (Supplementary Table 11).

We used multidimensional scaling (MDS) on partially independent bi-allelic variants (r 2 < 0.5) with MAF > 1 % to project the data on the ten dimensions that preserve most of the genetic distance. Then, we determined how genetic variation captured by low-pass WGS reflects regional ancestry. The first two dimensions from MDS model are the regional structure of our population within Quebec. For instance, dimension C1 highlights individuals from Saguenay-Lac-St-Jean—and to a lesser extent participants from Quebec City—from the rest of the sequenced participants (Fig. 3). Thus, C1 appears to capture the serial migration events that led to the peopling of the Northern regions of Quebec in the 19th century, consistent with the previous observations (Moreau et al. 2011a; Roy-Gagnon et al. 2011). The second MDS dimension (C2) isolates individuals from Gaspesia (Fig. 3). Since a small number of the French Canadians in our data set (N = 28, Supplementary Table 11) do not originate from the province of Quebec, we derive similar scores for the other Canadian provinces. Interestingly, C2 also captures ancestry from Canada’s Maritime provinces (Nova Scotia, New Brunswick, and Prince Edward Island) that are historically and geographically linked to Gaspesia (Supplementary Fig. 6). Supplementary Figure 7 shows the same projection, but restricted to 199 individuals, whose four grandparents are from the same region (Montreal, Quebec City, Saguenay, and Gaspesia). Dimensions C1 and C2 capture the same patterns of population structure. Finally, we combined our French-Canadian data, restricted to variants with MAF > 5 %, with data from the 1000 Genomes Project to create a data set of 5 million shared genetic variants among 4474 individuals (2504 individuals from the 1000 Genomes Project). We then performed a new MDS analysis using only 429,800 partially independent variants (r 2 < 0.5). On C1 and C2, the French Canadians overlap mostly with individuals of northern European ancestry (CEU) (Supplementary Figs. 8 and 9).

Fig. 3
figure 3

Projection of 1970 French-Canadian individuals on dimensions C1 and C2 from multidimensional scaling (MDS). Individuals are color-coded according to the number of their grandparents from Montreal (a), Quebec City (b), Saguenay-Lac-St-Jean (c), and Gaspesia (d). C1 isolates individuals with ancestry in Saguenay-Lac-St-Jean and Quebec City, consistent with the historical migration wave in the Province of Quebec. C2 opposes individuals from Gaspesia to the rest of the population in Quebec

Discussion

In this article, we present the first low-pass WGS effort to characterize the French-Canadian population. Our work builds on the previous studies that targeted specific subpopulations to discover population-specific genetic variation. The resulting high-quality WGS data set reflects the regional structure of the French-Canadian population and is highly concordant with genotype data available for the same samples but obtained using a different technology. Despite the relatively small sample size of our data set, we replicated genetic associations with myocardial infarction and lipid traits. Furthermore, the corresponding population-specific haplotypes outperformed the 1000 Genomes Project panel for the imputation of rare variants in French-Canadian samples. All these results illustrate the depth, quality, and future usefulness of this WGS resource to investigate the genetics of complex human diseases and other quantitative traits in this specific population.

Although the French-Canadian haplotypes were more efficient than the 1000 Genomes Project haplotypes to impute rare variants in French Canadians, the gain remains modest. This is particularly noticeable when comparing with the gain in imputation quality of rare variants observed in other low-pass WGS projects, such as the Genome of the Netherlands and SardiNIA (Genome of the Netherlands 2014; Pistis et al. 2015). One major difference between these efforts and our WGS project is the absence of related individuals among the French Canadians that were sequenced. Related individuals would simplify the phasing of the genotypes and improve the quality of the haplotypes. An alternative approach to improve the quality of our French-Canadian haplotypes would be to use independent dense genotyping data (e.g., GWAS-type array) as a so-called “scaffold” to constrain the search space for imputation (Delaneau et al. 2014).

The development of very large imputation panels, such as the 64,976 haplotypes panel from HRC, has called into question the value of WGS to identify genetic risk factors for complex human diseases. In particular, we compared the performance of HRC and haplotypes from our WGS project to impute rare variants in French Canadians, and noted slightly better imputation quality for the larger HRC panel (e.g., mean r 2 of 0.79 vs. 0.76 for rare variants). As expected, this suggests that for variants common to both data sets, the number of haplotypes is the main predictor of imputation quality. However, this result does not undermine the importance to carry out population-specific WGS projects, since 31 % of the variants identified in French Canadians are not currently available in public databases (and would not have been imputed using HRC alone). This is consistent with the discovery by low-pass WGS of population-specific variants of large effect size on human complex phenotypes (Danjou et al. 2015; Sidore et al. 2015; Zoledziewska et al. 2015). WGS data also provide the opportunity to explore other type of DNA sequence polymorphisms, such as structural and highly repetitive variants. As sequencing costs continue to drop, the sample size of WGS projects will increase, allowing powerful association analyses between the entire human genome and complex diseases.

Online methods

Sample collection

984 cases and 986 controls for myocardial infarction (MI) (Supplementary Table 1) were recruited from the Montreal Heart Institute Biobank (Beaudoin et al. 2012). 980 controls were selected to be free of percutaneous coronary intervention, coronary artery bypass surgery, transient ischemic attack or stroke, peripheral vascular disease, congestive heart failure, and angina. Cases and controls were matched on sex, hypertension, diabetes, and dyslipidemia. All samples have French-Canadian ancestry, and the origin (birthplace) of their grandparents is available. Clinical information (MI status, age, sex, statin use, nuclear magnetic resonance-based HDL cholesterol and triglyceride levels, biochemistry-measured LDL-cholesterol and total cholesterol) is available for all participants.

Whole-genome DNA sequencing

Samples were sequenced with Illumina HiSeq technology at the McGill and Genome Quebec Innovation Centre. The resulting reads were then aligned on the UCSC reference genome b37 and filtered using a bioinformatics pipeline combining the Burrows–Wheeler Aligner (BWA) (Li and Durbin 2009), SAMtools (http://samtools.sourceforge.net), Picard (http://picard.sourceforge.net), and the Genome Analysis Tool Kit (GATK) v1.8 (McKenna et al. 2010) (Supplementary Note). Variant calling was performed across samples using GATK’s UnifiedGenotyper. We then applied GATK’s Variant QualityScore Recalibration to filter the resulting vcf file (Supplementary Note). GATK’s VariantEvaluation was used to compute statistics on the filtered variants. Sequences were phased, and sporadic missing genotypes were imputed with Beagle v4 (Browning and Browning 2007). For each sample, the algorithm took the genotype likelihoods as input and used the phased remaining sequences as a reference panel (Supplementary Note). We used PLINK 1.9 (Purcell et al. 2007) and vcftools 0.1.12 to control the quality of our sequence data (Supplementary Note). The resulting quality-controlled bi-allelic variants were annotated using build hg19 of the reference human genome and the Efficient and Parallelizable Association Container Toolbox (EPACTS, http://genome.sph.umich.edu/wiki/EPACTS).

Concordance with the ExomeChip

All samples have been genotyped on the Illumina ExomeChip. Concordance on individuals and positions shared in common between the whole-genome sequence data set and the ExomeChip genotype data was calculated with PLINK using the option diff. We then recoded both files in additive components with PLINK and built the contingency table with R (Supplementary Note).

Genetic association analyses

The association analyses were performed using the Wald test from the software rvtest (http://genome.sph.umich.edu/wiki/RvTests). The covariates for all phenotypes included age-squared, sex, statin usage, and the first ten dimensions from multidimensional scaling (MDS). The MI case–control status was also a covariate for the inverse-normal transformed lipids phenotypes. We accounted for experimental batch effects for LDL-C and TC phenotypes.

Genotype imputation in French Canadians

We evaluated the imputation performance of our French-Canadian reference panel on 751 French-Canadian samples genotyped on the Illumina Omni2.5 array. We excluded 20 samples already present in the reference panel based on an identity-by-descent analysis with independent SNPs at MAF > 1 %. The following analyses were performed on chromosome 1 only: the remaining 731 samples were filtered with PLINKseq and checkvcf, and phased with shapeit (Delaneau et al. 2012). The resulting haplotypes were imputed with minimac3 (http://genome.sph.umich.edu/wiki/Minimac3) using four reference panels: the Haplotype Reference Consortium (HRC), 1000 Genomes Project version 3 (1000G), our French-Canadian reference panel (FC), excluding variants out of Hardy–Weinberg equilibrium (P < 1 × 10−5), and a panel combining 1000G and FC haplotypes. Imputation quality was measured with the metrics EmpRsq and Rsq produced by minimac3.

Population structure analysis

Multidimensional scaling (MDS) was performed with PLINK on partially independent (r 2 < 0.5) variants with MAF > 1 %. The origin of the four grandparents of each patient is heterogeneous: it might be a city, an administrative region of Quebec, the global province of Quebec, or even other Canadian provinces or countries. To assign regional ancestry scores to individuals, we mapped their grandparents’ origin to the 17 Quebec administrative regions based on electoral data available at http://www.electionsquebec.qc.ca/francais/provincial/carte-electorale/municipalites-et-entites-administratives-2011.php.