1 Introduction

Next-generation sequencing (NGS) studies of complex human traits and diseases are becoming commonplace for investigating the role of rare polymorphic variation in such phenotypes. Many analytic methods have been developed for the analysis of such rare variants with a particular emphasis on techniques that first aggregate information on rare variants within a gene of interest and then contrast this aggregated genetic information with the phenotypic outcome. The majority of such aggregation-based methods [16, 18, 22, 24, 37, 38] focus on population-based designs or case–control designs. However, family-based study designs are gaining traction in NGS projects since they provide inherent benefits over the traditional population-based designs. In particular, families ascertained based on multiple relatives with a particular phenotype tend to enrich the sample for rare causal variants compared to a general population, thereby making such variants easier to detect [39].

The appeal of family-based NGS studies has led to the development of a few analytic methods tailored for rare-variant analysis in such designs. Such methods [6, 13, 15, 29] generally apply a modeling framework that accounts for the relatedness of familial samples through appropriate modeling of kinship. However, such methods do not take into account the potential bias of findings due to population stratification. Population stratification is the presence of systematic differences between subpopulations both in the allele frequencies of the rare variants under study as well as in the distribution of phenotype. Failure to model these differences will lead to inflated false positive rate and decreased power to detect real associations. For rare variants, the issue of population stratification is more severe than for common variants, as rare variants are more likely to be young mutations which are more population-specific [11]. It has been shown that inclusion of self-reported ethnicity as a covariate is not sufficient to adjust for population stratification [31]. Similarly, standard methods to adjust for population stratification for common variants may not be as effective an adjustment for rare variants. In particular, genomic control can lead to very conservative results for rare variants [14]. Although principal components works well for spatially distinctive populations, the procedure fails for spatially non-distinctive populations [23].

With these concerns in mind, Jiang et al. [15] developed a rare-variant association test for quantitative traits in parent–child trios and nuclear families that, by design, was robust to population stratification. The method was motivated by the QTDT framework [1], which showed that the observed genotype of a familial subject could be partitioned into orthogonal between-family and within-family components. The between-family component can be defined as the expected value of the subject’s genotype within the family and can be constructed as the average of the parents’ genotype or the average of the siblings’ genotype. The within-family component is the deviation of the observed genotype from the between-family component. While the between-family component is sensitive to population stratification, the within-family component is robust to stratification since it is based on a family-specific deviation. Utilizing a kernel machine regression (KMR) framework for multi-marker analysis of familial quantitative phenotypes [6, 15, 30], Jiang et al. [15] created a robust rare-variant test by replacing observed sample genotypes in the standard KMR with their corresponding within-family genotypic components. Simulation results demonstrated the approach yielded appropriate type I error even when strong confounding existed within the sample. As with other KMR approaches, Jiang et al. [15] approach derived p-values analytically using Davies’ [9] method, thereby allowing easy application to large-scale sequencing studies.

While the work of Jiang et al. [15] provides a powerful approach that is robust in the presence of population stratification, the method’s design limited its application only to nuclear families and parent–child trios. However, many sequencing studies have emerged that utilize phenotype and genotype data collected on multiplex pedigrees that are larger and contain more distant relationships than those in nuclear families. Examples of such studies include the Epi4K study of epilepsy Epi4K Consortium [10]. and the Genetic Analysis Workshop (GAW18) study of blood pressure. Large pedigrees have unique features that make them ideal for mapping traits associated with rare variants. Compared to nuclear families or trios, rare variants are further enriched in large pedigrees [34]. It has been shown that large pedigree studies have increased power compared to smaller families with the same total number of samples, especially for rare-variant sequencing data [32, 35, 36]. In addition to improved power, analysis of large pedigrees can provide evidence for both co-segregation and association, while population-based studies can only provide evidence for association [17, 26, 34]. Further, the study of large pedigrees provides a cost-effective strategy for rare-variant analysis as it enables in silico imputation of rare-variant genotypes in non-sequenced subjects using information from sequenced relatives coupled to knowledge of inheritance flow [7, 34]. With a large pedigree-based study design, researchers can also combine sequencing-based association studies with linkage analyses [26]. Recent research has identified rare variants associated with several diseases or traits such as hyperkalemic hypertension [21], spinocerebellar ataxias [33], hypolipidemia [25], and lithium responsive bipolar disorder [8] by combining association and linkage approaches.

Given the obvious value of extended pedigrees, it would be useful to develop a robust family-based association test of rare variants for such designs that are also computationally efficient. While the method of Jiang et al. [15] is both robust and fast, it is also only limited to trios and nuclear families and therefore cannot be applied to studies such as GAW18 that possesses sequence data for 20 Mexican American families with an average pedigree size of 70 (see sample pedigree in Supplementary Fig. S1). Therefore, in this paper, we propose an expansion of Jiang et al. [15] framework to allow robust and efficient analysis of multiplex families of arbitrary size and structure. To do so, we employ a non-trivial modification of the QTDT framework for use in extended pedigrees developed by Abecasis et al. [2] that uses information from all genotyped family members to construct a more informative between-family genotypic component. We then derive the within-family component for each genotype and integrate this information within the KMR framework of Schifano et al. [30] to obtain a rare-variant test that is robust to population stratification. In the following sections, we will first introduce our study setting, followed by how we use the QTDT framework to decompose genotype information to obtain a robust within-family component. We then show how to integrate this information within a KMR framework to yield our robust test. We will also describe how we can improve the power of our robust test by pre-screening potential trait-influencing genes using genotype and phenotypic information from founders across families. Such founder information is orthogonal to the within-family information used in our proposed test. We then evaluate our method using both simulation studies and sequencing data from a study of systolic and diastolic blood pressure (SBP and DBP) provided by the GAW18.

Fig. 1
figure 1

Example of pedigree structure

2 Materials and Methods

2.1 Study Design and Notation

We assume a family-based study consisting of N families, where each family consists of a large pedigree. While we use Fig. 1 as an example here to show the structure of the large pedigree, our method can be applied to any family structure and can accommodate any family size unlike the original framework of Jiang et al. [15]. Suppose there are s rare variants in a gene of interest, and let \({\varvec{G}}_{ij} ,\) a \(s\times 1\) vector, represent the genotypes of the s rare variants for the jth \((j=1,\,2,\ldots ,n_{i})\) individual in the ith (\(i=1,\,2,\ldots ,N)\) family. We assume an additive model, and let components in \({\varvec{G}}_{ij}\) take the value of 0, 1, 2, indicating the number of copies of minor alleles at each site. If an individual is not genotyped, then we leave \({\varvec{G}}_{ij} \) undefined. Let \({\varvec{X}}_{ij},\) a \(c\times 1\) vector, denote the covariates, and denote \(Y_{ij}\) as the value of the quantitative outcome for the jth individual in the ith family. For non-founders (defined as individuals with ancestors included in the pedigree, e.g., individuals 5–10 in Fig. 1), let \(M_{ij}\) and \(F_{ij}\) be the index of mother and father of j individual in the ith family, respectively. For founders (defined as individuals with no ancestors in the pedigree, e.g., individuals 1–4 in Fig. 1), we leave \(M_{ij}\) and \(F_{ij}\) undefined.

2.2 KMR Framework for Pedigree Data

We create our robust rare-variant association test for a quantitative trait based on the KMR test of Schifano et al. [30] and Chen et al. [6] for association testing of a group of genetic variants with a continuous phenotype allowing for related individuals. As shown by these authors, the KMR test can be implemented in a linear mixed modeling framework with mean and variance defined through the model:

$$\begin{aligned} Y_{ij} ={\varvec{X}}_{ij}^\mathrm{T} \alpha +h\left( {{\varvec{G}}_{ij} } \right) +f_{ij} +\varepsilon _{ij}, \end{aligned}$$
(1)

where \(\alpha \) is a \(c\times 1\) vector of coefficients for \({\varvec{X}}_{ij},\, f_{ij} \) is the random effect to account for within-family correlation, and \(\varepsilon _{ij}\) is the random error term. We further assume that the random effects within a family, \({\varvec{f}}_{i} =( {f_{i1} ,\,f_{i2} ,\,f_{i3},\ldots ,f_{{in_{i}}} })^\mathrm{T},\) follow a multivariate normal distribution \({\varvec{f}}_{i} \sim \mathrm{MVN}( {0,\,2 {\varPhi }_{i} \sigma _\mathrm{pg}^2 }).\) Here \(\varPhi _{i}\) is the kinship matrix for the ith family (elements in \(\varPhi _i\) represent the pairwise kinship coefficients between relatives in the ith family) and \(\sigma _\mathrm{pg}^2\) represents the variance due to the shared polygenic effect. We also assume that the random environmental effect \(\varepsilon _{ij}\) is independent among subjects within and between families and follows a normal distribution with mean 0 and variance \(\sigma _e^2.\)

Within Eq. (1) above, \(h(\varvec{G}_{ij})\) is a function of \(\varvec{G}_{ij} \) defined through a positive semidefinite kernel function \(k( {\cdot ,\,\cdot } ).\) Following Liu et al. [20] and Kwee et al. [16], \(h(\varvec{G}_{ij} )\) can be represented as \(\mathop \sum \nolimits _{i^{\prime }} \mathop \sum \nolimits _{j^{\prime }} \vartheta _{i^{\prime }j^{\prime }} k( {\varvec{G}_{ij} ,\,\varvec{G}_{i^{\prime }j^{\prime }} } ),\) where \(\vartheta _{i^{\prime }j^{\prime }}\) are unknown parameters. It is worth noting that the kernel function, \(k(\varvec{G}_{ij} ,\,\varvec{G}_{i^{\prime }j^{\prime }}), \) measures the genetic similarity between subject j in family i and subject \(j^{\prime }\) in family \(i^{\prime }\) and contrasts this similarity to phenotypic similarity between the two subjects. It has been shown that appropriate choice of the kernel can increase the power [37]. Frequently used kernels include the identity by state (IBS) kernel or the linear kernel. The IBS kernel, which takes the form \(k({\varvec{G}}_{ij} ,\,{\varvec{G}}_{i^{\prime }j^{\prime }}) = { \sum \nolimits _{l=1}^s}( {2-| {G_{ijl} -G_{i^{\prime }j^{\prime }l}}|} ),\) measures the genetic similarity as the number of alleles that share by state. It assumes a nonlinear effect of each rare variant and can thus enable the study of epistatic effects. The linear kernel, on the other hand, assumes a linear relationship between the trait and the variants. The kernel takes the form \(k({\varvec{G}}_{ij} ,\,G_{i^{\prime }j^{\prime }}) = {{\sum \nolimits _{l=1}^s}} ( {G_{ijl} G_{i^{\prime }j^{\prime }l} } ).\) Additionally, we can include prior knowledge of variants that are possibly causal in the gene by assigning each variant a weight. If prior knowledge is not available, weights can also be calculated as a function of minor allele frequency (MAF; under the logic that the rarer the allele, the more likely it is selected against and therefore the more likely it is to be pathogenic). Wu et al. [37] suggest calculating the weights based on a beta distribution, which assigns greater weight to less frequent variants. For a given weight, we can create weighted kernels such as the weighted linear kernel \(k({\varvec{G}}_{ij} ,\,{\varvec{G}}_{i^{\prime }j^{\prime }})={\sum \nolimits _{l=1}^s} w_l ( {G_{ijl} G_{i^{\prime }j^{\prime }l} })\), where \(w_{l}\) denotes a normalized weight for variant l in the gene.

It can be easily shown that the estimator of h takes the same form as in the linear mixed model with h as a random effect [20, 30]:

$$\begin{aligned} y=X\alpha +h+f+e, \end{aligned}$$
(2)

where \({\varvec{\alpha }}\) is a \(c\times 1\) vector of coefficients for fixed effect \(\mathbf{X},\, {\varvec{h }}\) is an \(\mathop \sum \nolimits _{i=1}^N n_i \times 1 \) vector of random effects that follow an arbitrary distribution with mean 0 and variance \(\tau {\varvec{K}},\) where K is the genetic similarity matrix with element \(\langle {ij},\, i^{\prime }j^{\prime } \rangle \) equal to \(k({\varvec{G}}_{ij} ,\,{\varvec{G}}_{{i^{\prime }j^{\prime }}}) ;\, {\varvec{f}}=( {{\varvec{f}}_1^\mathrm{T} ,\,{\varvec{f}}_2^\mathrm{T},\ldots ,{\varvec{f}}_N^\mathrm{T} })^\mathrm{T} \sim N( {0,\,2\sigma _\mathrm{pg}^2 {\varvec{\Phi }}} ),\) where \({\varvec{\Phi }}\) is a block diagonal matrix with \({\varPhi }_{i}\) on the diagonal. Finally, \(e=( {{\varvec{e}}_1^\mathrm{T} ,\,{\varvec{e}}_2^\mathrm{T},\ldots ,{\varvec{e}}_N^\mathrm{T} } )^\mathrm{T} \sim N( {0,\,\sigma _e^2 \mathbf{I}}).\) Thus, the test of whether genotype is associated with the outcome is equivalent to testing whether the random component h equals 0 or not. We adopted the variance component score test, which is the locally most powerful test [19]. As \({\varvec{h}}\) has the variance of \(\tau {\varvec{K}},\) the test of whether \({\varvec{h}}=0\) is equivalent to testing whether \(\tau = 0.\) The null hypothesis is H\(_{0}{\text {:}}\,\tau =0,\) and the test statistic takes the form:

$$\begin{aligned} Q=\frac{1}{2}\left( Y-X\hat{{\alpha }}_0\right) \hat{{V}}_0^{-1} K\hat{{V}}_0^{-1} \left( Y-X\hat{{\alpha }}_0\right) , \end{aligned}$$
(3)

where all parameters are estimated under the null hypothesis. \(\mathop {\widehat{V}_0 } =2 \widehat{{\sigma _\mathrm{pg}^2 }} {\varvec{\Phi }}+ \widehat{{\sigma _e^2 }} I\) denotes the sample variance/covariance matrix estimated under the null. To obtain the null distribution of Q,  we define a projection matrix \(P=\hat{{V}}_0^{-1} -\hat{{V}}_0^{-1} X(X^\mathrm{T}\hat{{V}}_0^{-1} X)^{-1}X^\mathrm{T}\hat{{V}}_0^{-1}\) such that \(P{\widehat{V}_{0}} P=P.\) Thus, under the null, we have

$$\begin{aligned} Q=\frac{1}{2}Y^\mathrm{T}PKPY=\sum \limits _{i=1}^N {\lambda _i \chi _{1i}^2 }, \end{aligned}$$
(4)

where \(\lambda _i\) are eigenvalues of \(\frac{1}{2}D\hat{{V}}_0^{-1/2} K\hat{{V}}_0^{-1/2} D,\) here \(D=I-\hat{{V}}_0^{-1/2} X(X^\mathrm{T}\hat{{V}}_0^{-1/2} X)^{-1}X^\mathrm{T}\hat{{V}}_0^{-1/2}. \) As \(\chi _{1i}^2 \) are independently and identically distributed random variables, Q is distributed as an asymptotic mixture of \(\chi ^2\) distributions, and the p-values can be calculated using the Davies method [9].

2.3 QTDT Framework for General Pedigrees

In the presence of population stratification, association testing of \({\varvec{G}}_{ij}\) with \(Y_{ij}\) in models (1) and (2) may lead to spurious association due to the underlying differences in allele frequencies of the subpopulations. However, for family studies, family members can be used as internal controls, where an expected genotype can be constructed using the family members’ information. Tests based on the within-family component (deviation of observed genotype from expected within family) will not be influenced by population structure, even in the most extreme case, where each of the N pedigrees is drawn from a different population. Here, we leverage the work of Abecasis et al. [1] and present the method to calculate transmission scores for individuals in general pedigrees.

The QTDT framework [1] for general pedigrees decomposes a genotype into a between-family component (which is sensitive to population stratification) and a within-family component (which is robust to population stratification). For relative j in family i,  let \({\varvec{B}}_{ij}\) and \({\varvec{W}}_{\varvec{ij}}\) denote vectors of between-family and within-family genotype components for the s rare-variant genotypes in \({\varvec{G}}_{{\varvec{ij}}}.\) Assuming all parents in the pedigree are genotyped, the between-family component for founders (with no ancestors included in the pedigree) will be equal to their observed genotypes, while the between-family component for non-founders at each rare-variant genotype is equal to the average genotype of the between-family components of that individual’s parents: such that \({\varvec{B}}_{{ij}} =\frac{{\varvec{B}}_{{{M}}_{{ij}} } +{\varvec{B}}_{{F}_{ij} } }{2}.\) Using the pedigree in Fig. 1 as an example, suppose all the individuals in the pedigree are genotyped. Suppressing the family index for ease of presentation, the between-family components for founders 1, 2, 3, and 4 are \({\varvec{B}}_{1}={\varvec{G}}_{{1}},\, {\varvec{B}}_{{2}}={\varvec{G}}_{{2}},\, {\varvec{B}}_{{3}}={\varvec{G}}_{{3}},\, {\varvec{B}}_{{4}}={\varvec{G}}_{{4}},\) respectively. For the non-founders in the second generation, the between-family component for individual 5 is \({\varvec{B}}_{{5}} =\frac{{\varvec{B}}_{{1}} +{\varvec{B}}_{{2}} }{2},\) and between-family component for 6 is \({\varvec{B}}_{{6}} =\frac{{\varvec{B}}_{{3}} +{\varvec{B}}_{4} }{2}.\) For the non-founders in the third generation, the between-family components for individuals 7–10 are \(\frac{{\varvec{B}}_{{5}} +{\varvec{B}}_{{6}} }{2}=\frac{{\varvec{B}}_{{1}} +{\varvec{B}}_{{2}} +{\varvec{B}}_{{3}} +{\varvec{B}}_{{4}} }{{4}}.\) It can be seen that, in the situation where all founders are genotyped, the between-family component of any non-founder is calculated as follows:

$$\begin{aligned} {\varvec{B}}_{ij} =\mathop \sum \limits _{f\in F} 2\varphi _{ijf} {\varvec{G}}_{if}, \end{aligned}$$
(5)

where in the ith family, f is the index of founders, \({\varvec{G}}_{if}\) is the rare-variant genotype vector of the founder, \(\varphi _{_{ijf}}\) is the kinship coefficient between individual j and founder f,  and F is the set of all the genotyped founders.

In the situation where the parents’ genotypes are missing, the between-family component \({\varvec{B}}_{{ij}}\) is equal to the average of the genotypes for all sibling of relative j. For example in Fig. 1, if individuals 5 and 6 are not genotyped, then the between-family component for individuals 7–10 is \(\frac{{\varvec{G}}_7 +{\varvec{G}}_8 +{\varvec{G}}_9 +{\varvec{G}}_{10} }{4}.\) The average of genotypes of siblings in the family is the sufficient statistic for the between-family component [1]. We note that, when applied to parent–child trios and nuclear families, the proposed method for calculating the between-family component we describe here is then equivalent to the forms of the between-family component outlined in the work of Jiang et al. [15].

The within-family genotype vector for the s rare-variant genotypes \({\varvec{W}}_{{ij}}\) is then calculated as the difference between the observed genotype vector and the between-family genotype vector:

$$\begin{aligned} {\varvec{W}}_{ij} ={\varvec{G}}_{ij} -{\varvec{B}}_{ij}. \end{aligned}$$
(6)

Positive values within \({\varvec{W}}_{{ij}}\) indicate excess transmission of the minor (reference) allele, while negative values of \({\varvec{W}}_{{ij}}\) indicate excess transmission of the major allele. As discussed above, the within-family component is not influenced by population substructure; thus, the test on the within-family component is robust to population stratification.

As discussed before, directly testing based on the observed rare-variant genotypes in models (1) and (2) will lead to spurious association in the presence of population stratification. For our robust test, we follow the same approach as in our earlier work [15] and simply calculate \({\varvec{W}}_{{ij}}\) as described above, replace \({\varvec{G}}_{{ij}}\) with \({\varvec{W}}_{{ij}}\) in Eqs. (1) and (2), and construct our score statistic Q in (3) using \({\varvec{W}}_{{ij}}.\)

2.4 Screening Methods

Although the within-family component has the advantage of robustness to population stratification, constructing tests based only on the within-family genotypic component while ignoring the between-family component reduces power. However, if founders’ phenotype and genotype data are available, we can borrow the idea of Purcell et al. [27] to implement a screening procedure to potentially increase power. Specifically, we use the founders’ phenotype and genotype information in the first stage to identify those regions showing strongest signals of association. We can perform such testing using standard burden or variance component tests for unrelated subjects. We then implement a second stage where we test only the top regions from the first stage using our proposed test in (3) based on the within-family genotypic component; the number of top regions in the second stage can take a value between 1 and the total number of regions. In this project, we assume 10–50% of the regions enter the second stage. By pre-screening in this manner, we reduce the multiple testing burden for our robust test, thereby increasing power. As the within-family component and the between-family component are orthogonal to each other by design [1], population stratification that can invalidate the first-stage analysis using founders will not invalidate the within-family component test.

2.5 Simulation Studies

We evaluate type I error rate and power of our method using simulated sequencing data generated by cosi [28], which has high resemblance with empirical data. To simulate large pedigrees, we first use cosi to simulate 5000 haplotypes of European ancestry and 5000 haplotypes of African ancestry. We then randomly draw and pair haplotypes within each population and randomly select one haplotype from each parent to pass down to offspring. Our simulated pedigree has the same structure as Fig. 1. We assume that there are 10 non-overlapping genes or regions of interest, each 30 kb long. We show the empirical distribution of rare variants in these regions across simulated datasets in Supplementary Fig. S2.

For each family, we simulate phenotype data from a multivariate normal distribution, whose mean and variance vary according to different scenarios. For type I error rate simulations, all 10 regions are null, while for power simulations we randomly select 1 region of the 10 to harbor causal variation. Rare variants are defined as variants with MAF smaller than 3%. To simulate population substructure, we simulate the outcome for the null model as follows: \(Y_{ij} =\gamma I_{\mathrm{African},\mathrm{ij}} +f_{ij} +e_{ij},\) where \(\gamma \) is the mean trait difference between European and African, and \(I_{\mathrm{African},\mathrm{ij}} \) is the indicator variable, which is 1 for African individuals and 0 for European individuals. For the power simulations, we let either 5 or 15% of the rare variants in the causal region influence phenotype. Within each family, we simulate the random effects \({\varvec{f}}_{ij} \) through \(f_i \sim \mathrm{MVN}( {0,\,0.56\times 2 {\varPhi }_{i}}).\, e_{ij}\) is the random error and follows a standard normal distribution. For each causal variant, we define the effect size as \(\beta =c\times |\log _{10} \mathrm{MAF}|,\) where c is a pre-defined constant. Thus, the outcome is simulated as \(Y_{ij} -\gamma I_{\mathrm{African},{ij}} +\beta _{ij} \times G_{ij} +f_{ij} +e_{ij}.\) We perform 5000 simulations to evaluate type I error rate. For power simulation, we also perform 5000 simulations and calculate power as the proportion of simulations with the causal region correctly identified. Unless otherwise noted, we applied a linear genotype kernel for analysis.

2.6 GAW18 Data

The GAW18 provides whole genome sequence data for extended pedigrees and phenotypes such as SBP and DBP. The dataset was drawn from the T2D-GENES Consortium Project 2; a family-based study that aims to identify low-frequency variants that increase the risk of type-2 diabetes. The original dataset contains whole genome sequences for the odd-numbered chromosomes only (chromosomes 1, 3, 5,...,21) for 464 individuals from 20 Mexican American families. The dataset we used in this project contains 959 individuals of which 464 of them were directly sequenced by Complete Genomics, Inc., while the remaining 495 had sequence data imputed from array-based genotype data by the T2D-GENES Consortium. In addition to SBP and DBP, the dataset also includes information on age, gender, current use of antihypertensive medicine, and current smoking status. We include these phenotypes as covariates in our model. Detailed information about the dataset can be found at Almasy et al. [4].

After standard data cleaning procedure removed subjects with missing SBP or DBP measurements, our final dataset contained 855 individuals. Genes were annotated using information from the 1000 Genome Project (http://www.1000genomes.org/). We tested all genes in the 11 odd-numbered chromosomes, where each gene was tested individually. For each gene, we calculated the empirical frequency of the variants within the gene and only performed tests on the rare variants, where a rare variant was defined as having a MAF less than 3%. For perspective, we show the empirical distribution of rare variants within genes in the GAW18 project in Supplementary Fig. S3. We constructed the test statistics using within-family components as defined above.

3 Results

3.1 Type I Error

We first performed null simulations to show that population stratification can lead to inflated type I error rate for sequencing studies of large pedigrees. Figure 2 summarizes the empirical type I error rates of a study with 25 European pedigrees and 75 African pedigrees, each with the same size and family structure as shown in Fig. 1. We first set the mean trait difference (\(\gamma )\) between European and African to be 1 (Fig. 2, left) and further increased it to 2 (Fig. 2, right). Both figures show that in the presence of population stratification, test statistics constructed on observed genotype have inflated type I error rates (leftmost bars in each panel of in Fig. 2). As population structure becomes more extreme, the inflation becomes more severe (Fig. 2, right). We then performed tests based on the robust test based on our two-stage screening procedure using founders’ genotypes and phenotypes. Figure 2 shows that testing on the within-family component combined with the screening method leads to appropriate control of the type I error rate in the presence of population stratification.

Fig. 2
figure 2

Type I error rates. Left mean trait difference between European and African is 1. Right mean trait difference between European and African is two 10–30 kb regions are simulated. Yellow bar type I error rate tested on observed genotype. Others type I error rate tested on within-family component, with different number of genes at second stage. Black line \(y=0.05\)

3.2 Power

We next examined power of the proposed robust test. For power simulations, we assume the mean trait different between European and African is 0.25. For each simulation, we randomly drew 25 European pedigrees and 75 African pedigrees from the haplotype pools. We varied the percentage of rare causal variants in the causal region from 5% (Fig. 3a) to 15% (Fig. 3b). We also assumed different effect sizes (\(\beta =c\times |\log _{10} \mathrm{MAF}|)\) for the causal variants by letting c take the values 0.4, 0.5, and 0.6. Figure 3 shows that power increases as the percentage of causal variants in a region increases and as the effect size increases. We next investigated whether the two-stage screening approach using founder information improves power over a within-family analysis that ignores screening. As shown in Fig. 3, screening on the top 10–50% of hits can yield noticeable improvements in power over the naïve strategy. In addition to applying the linear genotype kernel, we also considered a weighted genotype linear kernel for screening and analysis (with weights based on MAFs using the weight function of Wu et al. [37]). Results, which we show in Supplementary Fig. S4, show similar results to the linear genotype kernel. With screening, we observed slight improvement of the weighted linear kernel over the unweighted linear kernel, particularly when larger effect sizes were assumed.

Fig. 3
figure 3

Power to detect rare-variant association in large pedigrees. a 5% of rare variants in the causal region are causal variants. b 15% of rare variants in the causal region are causal variants. Yellow bars power without screening. Others power with screening. Mean trait different between European and African is 0.25. 10–50% regions entered second stage

3.3 Application to GAW18 Dataset

We used GAW18 data to test for association between DBP/SBP and genes on odd numbered chromosomes. Within each gene, we calculated empirical frequencies of variants and only tested on variants with frequencies smaller than 3%. GAW18 provides longitudinal phenotype information, where SBP and DBP were measured in up to four follow-ups for each subject. We used the baseline measurement to test for association. We also controlled for age, gender, current usage of antihypertensive medicine, and current smoking status in our model. The pedigrees are relatively large in the dataset. The median number of individuals in a pedigree is 37 (min 22, max 74). Among the participants, 20.2% of them smoke, 9.4% took medicine, and 57.7% of them are female.

We performed association tests using our robust test. The genome-wide significance level with Bonferroni correction is \(\alpha _\mathrm{Bonferroni} =0.05/7034=7.1\times 10^{-6}.\) We chose the linear-weighted kernel and used the Davies method to calculate p-values. Following Wu et al. [37], the weight is calculated as \(w_j \sim \mathrm{Beta}( {\mathrm{MAF}_j ,\,1,\,25}).\) The results of testing SBP and DBP are summarized in Fig. 4. As shown in Fig. 4, we did not observe any genes passing the genome-wide significance level (\(7.1 \times 10^{-6},\) based on Bonferroni adjustment for 7034 genes). At the suggestive level (\(1 \times 10^{-4}),\) one gene on chromosome 21 is associated with SBP, and one gene on chromosome 7 is associated with DBP. The gene associated with SBP is open reading frame 33 (C21orf33), which is a protein-coding gene and is over-expressed in Down syndrome Yahya-Graison et al. [3]. LSM5 is associated with DBP at the suggestive level. It has been found that human LSM1–7 genes were expressed in Hela cells within cytoplasmic foci Ingelfinger et al. [12], which contains important factors in the degeneration of mRNA. In addition to the Manhattan plots shown in Fig. 4, we also constructed QQ plots of results using both the observed genotypes and the within-family components of the genotypes. We present these QQ plots in Supplementary Fig. S5, which show inflation of SBP (but not DBP) when analyzing observed genotypes. We observed no such inflation when analyzing the within-family component, although results for SBP showed some deflation in p-values.

Fig. 4
figure 4

Manhattan plots for GAW18 analyses. a Association analyses between SBP and within-family component of genotypes within genes on odd number of chromosomes. b Association analyses between DBP and within-family component of within genes on odd number of chromosomes. Red line genome-wide significant level (\(p<7.1\times 10^{-6}),\) blue line suggestive significant level (\(p<1\times 10^{-4})\)

4 Discussion

In this paper, we presented a framework for rare-variant sequencing studies in large pedigrees. Large pedigrees have several important features that make them ideal for finding traits with associated rare variants. Our previous work for robust and efficient family-based analysis [15] was only applicable to parent-case trios or nuclear families and so, in this work, we expand the work to handle these large pedigrees of arbitrary size and structure such as those in the GAW18 study of blood pressure. Our model, which combines a kernel machine framework for rare-variant analysis with a QTDT framework for general pedigrees, provides a powerful, efficient, and robust way to identify such associations in large pedigree studies. As the test score statistics follows an asymptotically mixed \(\chi ^2\) distribution, the calculation of p-values is much easier compared to other methods. This feature also makes our model applicable to large-scale genetic studies.

We also applied our method on GAW18 data to identify SBP/DBP-associated rare variants. We tested all the genes on odd numbers of chromosomes. This application gives an example that our method can be easily applied to large-scale data. The analysis of a gene takes 70 s on a 768 processors running Linux OS with 512 GB or RAM.

The data from GAW18 are based on 20 extended Mexican American families. For studies that do not have records of participants’ geographic origin or studies whose participants are from different origins, our method provides a robust way to perform the test.

In this project, we assumed that rare variants only associated with a single phenotype. However, there is substantial interest in identifying genetic factors with pleiotropic effects that influence multiple distinct phenotypes. Current methods for family data are not well equipped to investigate the effect of pleiotropy. For example, while analyzing GAW18 data, analyses seeking to identify genes simultaneously associated with both SBP and DBP cannot be performed. However, Broadaway et al. [5] provide a framework that can test cross-phenotype effects of rare variants. Their method is based on kernel distance-covariance, whose test statistics also asymptotically follow a mixed \(\chi ^2\) distribution. In contrast to our method presented here, Broadaway et al. focused only on unrelated individuals. In the future, we would like to combine our robust test with the method of Broadaway et al. (2016) to test cross-phenotype effects of rare variants in related individuals.