Introduction

Alzheimer disease (AD) is the most common form of senile dementia. AD is the sixth leading cause of death in the US with the healthcare costs surpassing $200 billion in the year 2013, and anticipated to increase exponentially with aging population.1, 2, 3 Clinical symptoms are broadly characterized by a slowly progressing loss of memory and cognitive functions, dementia, and ultimately death. Neuropathologically, deposition of β-amyloid peptide in the form of senile 'plaques' and oligomers (crucial to initiating AD pathogenesis), and accumulation of hyperphosphorlylated tau protein in the form of intracellular neurofibrillary 'tangles', along with inflammation and neurodegeneration, are the hallmark characteristics in post-mortem AD brains.

Following advancing age, family history is the second strongest risk factor in AD. Traditionally, AD is classified into two dichotomous forms based on the age of onset and the associated genetic factors. The relatively rare early-onset familial AD (onset age <65 years of age, <5% of diagnosed AD cases) is caused by highly penetrant, autosomal-dominant mutations in the three genes APP, PSEN1, and PSEN2. On the other hand, the more predominantly diagnosed late-onset form of AD (LOAD, onset >65 years of age, >95% of AD cases) shows less obvious familial aggregation. APOE-ɛ4 still remains the strongest risk factor in LOAD, where the ɛ4 allele confers between 3.7- and 14-fold increases in risk, in heterozygotes and homozygotes, respectively. Importantly, the identification of the above four genes was the key to understand the underlying molecular mechanism leading to AD driven by β-amyloid oligomers, leading to the tangles formation, loss of neurons, neuroinflammation, and dementia (‘amyloidosis’,4).

Since the first wave of genome-wide association studies beginning in 2007, more than a dozen genome-wide association studies (GWAS) and meta-analysis have been published to date, revealing several novel genetic loci in LOAD. Some of the genes that either encompass AD GWA single nucleotide polymorphisms (SNPs) or present in close proximity include, triggering receptor expressed on myeloid cells 2, bridging integrator 1, sialic acid-binding immunoglobulin-like lectin, clusterin, ATP-binding cassette transporter (ABCA7), complement receptor 1, phosphatidylinositol binding clathrin assembly protein, to name a few. Overall, although twin and population studies estimate heritability in AD as high as 80%,5 all the above genetic factors taken together explain less than 50% of heritability in AD. Identification of the remaining genetic factors in AD will not only explain the missing heritability but will be vital for fully understanding the disease pathogenesis and developing the treatment strategies.

In this study we performed a systematic meta-analysis of the family-based association test (FBAT) results using imputed genotypes (limited to minor allele frequency >0.05) generated in the subjects from three large Alzheimer's family collections: National Institute of Mental Health (NIMH), National Institute on Aging Genetics Initiative for Late Onset Alzheimer’s Disease (NIA-LOAD), and National Repository of Research on Alzheimer’s Disease (NCRAD). A total of 3500 subjects from 1070 families were assessed in this study. To maximize the statistical power to detect disease associated variants, we implemented a novel approach combining AD affection status and age of onset jointly using the multivariate extension of the FBAT approach.6, 7 We performed family-within component analyses and family-within and family-between component analyses (family-based association test using generalized estimating equations (FBAT-GEE method)), and finally the results from the three family-based samples were combined via meta-analysis. Meta-analysis results were computed separately for the two statistical approaches,1 based on the family-within component analyses, and,2 for the family-within component and family-between component analyses.

Materials and methods

Study families

We utilized three large family-based AD samples in the association tests and the meta-analyses: NIMH, NCRAD and NIA. The NIMH Alzheimer disease genetics initiative study8 was originally ascertained for the study of genetic risk factors in AD in a family-based setting. The complete NIMH sample contains a total of 1536 subjects from 457 families. For the purposes for our study 1376 participants (941 affected and 404 unaffected) from 410 families were included. The complete NIA-LOAD family sample and the National Cell Repository for Alzheimer’s Disease sample contains 4006 subjects from 1653 families. Here, we included 1040 subjects (748 affected and 282 unaffected) from 329 multiplex families. The families originally ascertained as part of the NCRAD subset of families include 1108 subjects from 331 families, with 799 affected and 293 unaffected siblings. Affection status was based on clinical dementia diagnosis documentation according to the NINCDA-ADRDA criteria. Patients diagnosed with mild cognitive impairment, unknown dementia, or unconfirmed family reports of dementia were excluded from our analysis. The initial age of detection of cognitive impairment in the patients was used for the age of onset phenotype. The basis for each cohort was the presence of at least two affected individuals within a family, typically siblings. All subjects are of self-reported European ancestry.

Genotyping and imputation

DNA samples from study subjects belonging to the NIMH and NCRAD AD families were processed on Affymetrix Human Genome Wide SNP 6.0 arrays. Samples that failed to pass Affymetrix quality control, showed conflicting gender or carried large chromosomal abnormalities and were excluded from the study, as described in detail elsewhere.9, 10 The Human610-Quad array genotypes of the NIA-LOAD study samples were obtained from dbGAP (Genetic Consortium for Late Onset Alzheimer's Disease 6K, ID: phs000160.v1.p1). The quality of the array genotype data is available in the original report.11

Before performing an association analysis of our three family cohorts, we applied standard GWAS quality control procedures for all three samples (NCRAD, NIA-LOAD, and NIMH) as described here.12 SNPs and individuals were filtered for a call rate of at least 99%. In addition, SNPs with a minor allele frequency of <5% were excluded. Population stratification within and between the samples was also checked by performing a multi-dimensional scaling (identification of population outliers) implemented in PLINK.13 Duplicated DNA samples were identified by consideration of genome-wide genotype identity-by-state status (identity-by-state >1.98). From each pair the individual with the lower genotyping rate was removed. In a second step, we used IMPUTE2 (ref. 14) to impute the QCed datasets NCRAD and NIMH into the May 2013 release of the 1000 genomes project and NIA into the September 2013 release of the 1000 genomes project.15 SNPs with an info score smaller than 0.4 were removed. Next, we ‘called’ individual genotypes in the family studies by assigning the genotype with the highest imputation probability.

After imputing, we had a total of 43 207 737 markers from the three study cohorts, NIMH (n=141 29 045 markers), NCRAD (n=13 971 550), and NIA-LOAD (n=15 107 142), for association analysis. A total of 273 families from NCRAD with 470 affected and 279 unaffected siblings, 401 families from NIMH with 905 affected and 318 unaffected siblings and 618 families from NIA-LOAD with 1464 affected and 1096 unaffected siblings were analyzed.

Family-based association analyses

SNPs showing Mendelian errors were excluded from all the following analyses. In the presence of markers showing Mendelian errors in a pedigree, all genotypes for those markers with the Mendelian inconsistencies were set to zero (missing) in those pedigrees. In other words, markers showing Mendelian inconsistencies in a family were set to ‘missing’ in that specific pedigree only prior to performing association tests. To maximize statistical power and avoid multiple comparison problems, we used for our analyses a multivariate extension of the FBAT-approach,6 the FBAT-GEE7 statistic and Van-Steen-like testing strategy.16, 17 FBAT-GEE, as the original FBAT, does not require any distributional assumptions for the phenotypes and it tests different trait types simultaneously. Assuming that m traits for each offspring that we want to test simultaneously by the FBAT approach, we denote the vector containing all m observations for each offspring by Yij = (Yij1, , Yijm), where Yijk is the kth phenotype for the jth offspring in the family. The multivariate FBAT-GEE statistic is constructed by replacing the univariate coding variable Tij in C by the coding vector defined by18

where, the Ŷijk’s are the predicted trait values based on the regression model for covariates. Replacing univariate coding variable Tij in FBAT statistic by the vector Tij results in the FBAT-GEE statistic

Under the null hypothesis, the FBAT-GEE statistic has a χ2 distribution with m degrees of freedom.

In our case the FBAT-GEE statistic contains affection status and time to onset as phenotypes, coded as Wilcoxon statistic. A more detailed description can be found in the original article.7

The analyses were divided in two steps: the family-within components and the family-within and family-between components approach.17, 19 The family-within component is a genetic association test that is based on Mendelian transmissions, where as the family-between component is a population-based association analysis in which the genotypes are replaced by the expected genotyped conditional on the sufficient statistic, that is, the conditional mean model.19 The advantage of the within-family component is that it is robust against population stratification. However, the between-family component remains sensitive against population stratification and requires further adjustment.20 Therefore, several statistical techniques proposed to use both approaches.16 After performing within-family analysis (FBAT-GEE) and between-family analysis (conditional mean model), the results from the two family analyses are combined via meta-analysis, in which the FBAT P-value is used for the within family-analysis and a rank-based P-value for the between-family-analysis.16 The rank-based P-values for the conditional mean model ensure maximal efficiency and, at the same time, maintain the robustness against population stratification of the overall approach.

In the meta-analyses that was performed with METAL,21 the P-values across our studies NCRAD (n=7,432,385 variants), NIMH (n=7,346,118), and NIA-LOAD (n=7,556,673) were summarized and also the sample size and direction of effect were taking into account. First, a meta-analysis for the family-within component analysis1 was performed for our three samples, second for the family-within component and family-between component analysis.2

To check if our top SNPs are in linkage disequilibrium with a gene we used the software package epigwas.22 For each SNP, the SNPs in a 1M window (upstream 500 K and downstream 500 K) are included in the calculation. The tool calculates linkage disequilibrium using 1000 genome data for the European population. Only SNPs with r2>0.5 are shown in the results section.

Results

The results of our two meta-analyses: (1) FBAT-GEE results for the family-within component analysis and (2) FBAT-GEE for the family-within and family-between component analysis) are shown in Table 1. We present SNPs exhibiting family-based association with AD with P-value<10−6, a minor allele frequency >0.05, and with the same effect direction in each family sample. The complete list of SNPs with P-values>10−5 are listed in the Supplementary Table 3a and b.

Table 1 Results of genome-wide family-based association analysis of imputed genotypes using three large collections of multiplex AD study families

The APOE region (rs56131196) shows highly significant results (P-values of 3.09 × 10−24 and 3.96 × 10−24 for approaches (1) and (2), respectively; 187 SNPs <0.05 for approach (1) and 283 SNPs for approach (2). The three early-onset familial AD genes (APP, PSEN1, and PSEN2) harbor several SNPs with nominally significant association with AD. In the FBAT-GEE results for the family-within component analysis, the most strongly associated SNP in the APP region (among 47 SNPs with P-value <0.05) is rs190685835 (P-value=3.74 × 10−3). For the PSEN1 region, there are 61 SNPs with P-value <0.05 (rs3025774: P-value=0.02775); for the PSEN2 region, 15 SNPs <0.05, (rs182226938 has a P-value=2.23 × 10−3). Looking at the FBAT-GEE results for the family-within component and the family-between components the number of nominal significant SNPs is increased. The APP region has 452 SNPs <0.05 (rs141145244 P-value=1.9 × 10−4; the PSEN1 region has 311 SNPs <0.05 (rs214277 P-value=1.07 × 10−4) and the PSEN2 region with 87 SNPs <0.05 (rs149734051 P-value=1.06 × 10−3). This is in concordance with previous GWAS in AD, where variants in the three early-onset familial AD genes failed to show consistent genome-wide significant association with AD.

The results of our two meta-analysis approaches are shown in Table 1 (Panel A and B). We found 32 novel variants showing genome-wide significant association with AD and fulfilling the criteria described above, that is, P-value <10−6, minor allele frequency >5% and same effect direction across all samples tested. Detailed information can be found in Supplementary Table 1. A total of 15 variants were either in a gene (Table 1) or in linkage disequilibrium with SNPs in a gene, while 18 other SNPs (Table 1) were in proximity of a known gene. Three variants reached genome-wide significance in at least one of the four meta-analysis: rs7609954 (PTPRG): P-value=3.98 × 10−8; rs1513625 (PDCL3): P-value=4.28 × 10−8; rs1347297 (OSBPL6): P-value=4.53 × 10−8. A second SNP, rs72953347 in OSBPL6 also showed marginally significant evidence for association with AD using the other meta-analysis approach (approach 2): P=6.36 × 10−7). In addition, two SNPs in the gene, CDKAL1, showed marginally significant evidence for association with AD in the two different testing meta-analysis approaches (rs62400067: P-value=3.54 × 10−7; rs10456232: P-value=9.60 × 10−7).

We next tested the 32 SNPs from Table 1 (Panel A and B) showing GW-significant association with AD using family-based methods in the International Genomics of Alzheimer's project (IGAP) case-control dataset (Supplementary Table 4). SNPs showing association with AD in the family-based studies do not consistently replicate in case-control data, and vice-versa. As seen in previous reports (reviewed elsewhere23 none of our top 32 SNPs from Table 1 showed genome-wide significance (with the exception of APOE SNPs) in the IGAP case-control GWAS dataset.24

Discussion

We carried out a family-based genome-wide association and meta-analysis on roughly 15 million imputed variants using three large AD family samples (~3500 subjects from 1070 families). We employed a multivariate phenotype combining affection status and onset age and then performed meta-analysis of the association results. Three SNPs: one in PTPRG (rs7609954), one in OSBPL6 (rs1347297), and another near PDCL3 revealed genome-wide association with AD in the meta-analysis. In addition, another SNP, rs72953347 in OSBPL6 (P-value=6.36 × 10−7) and two SNPs (rs10456232 and rs62400067) in the gene CDKAL1 showed marginally significant association with AD.

OSBPL6 encodes a member of the oxysterol-binding protein (OSBP) family, a group of intracellular lipid receptors. This gene adds to a growing number of other cholesterol-related genes implicated in AD genetics, for example, APOE and ABCA7. Differential gene expression studies have previously implicated OSBPL6 in Niemann-Pick Type C Disease, Parkinson disease and AD.25, 26, 27 The precise pathogenic mechanism still remains unclear but it is speculated that OSBPL6 may affect cognition decline through cholesterol mediated pathways.28 PTPRG encodes a member of the protein tyrosine phosphatase (PTP) family, known to function as signaling molecules that regulate cell growth, differentiation, mitotic cycle, and oncogenic transformation. PDCL3 encodes phosphoducin-like 3, which is believed to modulate heterotrimeric G-proteins via binding to beta and gamma subunits of G-proteins. It has also been proposed to play a role in angiogenesis by serving as a chaperone for the vascular endothelial growth factor (VEGF) receptor, KDR/VEGFR2 and regulating its ubiquitination and degradation.29 PDCL3 has also been proposed to modulate caspase activation by interacting with the inhibitor of apoptosis.30 CDKAL1 encodes the methylthiotransferase family member, cyclin-dependent kinase 5 Regulatory Subunit Associated protein-Like 1, and has been previously associated with noninsulin-dependent diabetes mellitus. Interestingly, cyclin-dependent kinase 5 has also been implicated in AD tangle pathology.31

The most significant results of our meta-analysis in family-based GWAS studies differ from those of the published meta-analyses of large-scale case/control studies. In addition, while many of the top hits from IGAP24 are also significantly associated with AD and age-at-onset of AD in our family-based meta-analysis (Supplementary Table 4), they do not achieve genome significance in our family-based association analyses. This observation can most likely be attributed to the fact that different types of association tests were used across these studies, that is, population-based association tests vs family-based association tests, which require the presence of both linkage and association. Given that the family-based tests combine the evidence of both, linkage and association, the P-values in both meta-analyses may vary for each SNP and the same P-value ranking cannot be expected. This lack of replication is not uncommon in the GWAS of complex human traits and often attributed to several other factors, including, insufficient statistical power, population stratification, differences in genetic ancestry and age-dependent genetic effects, to name a few.32 While the case-control method is the most common study design due to ease of sample ascertainment, the main concern is the population stratification effects, most notable in the SNPs present in the region involved with natural selection.33 On the other hand, family-based studies are more robust against population admixture and stratification that allows both linkage and association testing,18, 34 but may lack power due to small number of families present in the studies. The analytic approaches used in most studies address these pitfalls of the two study designs, and allowing for these caveats, both types of designs yield useful and complementary information.35 In this study, another important factor is that in order to maximize power, our family-based meta-analysis used a multi-variate phenotype combining AD and age-at-onset of AD, while the meta-analysis of case/control design tested for AD without taking age-at-onset into account.24

The use of a multi-variate phenotype may also explain why our top meta-analysis association findings do not replicate in IGAP,24 as the meta-analysis of case/control studies does not incorporate the age-at-onset information. However, the most important factor that contributes to the non-replication in IGAP may be the adjustment for population substructure. The case/control studies use principal component approaches which works well to adjust for global genetic stratification, but cannot account for local genetic stratification. The FBAT-based meta-analysis is robust against any genetic confounding, global and local. Any type of locus-specific stratification therefore could bias a principal component-based association analysis and therefore result in undetected, true genetic association, which could be the case here.

In summary, using close to 15 million imputed variants we performed a systematic family-based genome-wide association and meta-analyses using a multivariate phenotype combining affection status and onset age in three large collections of AD families. The meta-analysis of the association results revealed three SNPs that show genome-wide significance for association with AD risk in the genes PTPRG and OSBPL6, and near PDCL3 gene. One of our top genes, OSBPL6 has previously been implicated in AD in the transcriptomic studies of the post-mortem brains. Further studies will be required to replicate these novel findings and to elucidate the pathophysiologic role of these AD-associated genes and variants in the etiology and pathogenesis of AD.