Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

20.1 Introduction

Large-scale genomics has been at the forefront of science and medicine over the last decade. The advent of high-throughput technologies including single nucleotide polymorphism (SNPs) microarrays, array comparative genomic hybridization and genome sequencing have enabled rapid discovery of genetic variants varying in size and frequency [18]. Copy number variants are deletions and duplications in the genome that constitute the most genetic variation, in total base pairs, between individuals [35]. Classically, disease-association studies involved evaluation of either variants of high frequency in the population, also termed common variants, or variants of low frequency or rare variants. In this chapter, we will consider analysis of rare variants with specific focus on copy number variants. One of the key statistical challenges in the analysis of rare variants is that they have small population prevalences. If we view the rare variants as predictors that we wish to associate with a phenotype, then they in fact contain very little statistical information. To illustrate the idea, suppose we wish to regress the phenotype on a rare variant that we treat as binary, where zero indicates absence and one indicates presence. We assume that the regression model is linear. Then it can be shown analytically that the information about the regression coefficient in such a setup is maximized when half of the subjects have the rare variant and half do not. However, by definition, for rare variants, a majority of subjects will not have the rare variant, the implication being that we are in an inherently low-power situation. Thus, it is necessary to begin to think about pooling information in various ways; this will be one of the themes explicated on in the chapter.

The structure of this chapter is as follows. In Sect. 20.2, we provide some biological background to rare variants. Section 20.3 reviews association methods for the analysis of rare variants and in particular focuses on the sequence kernel association test (SKAT) [56] and its extensions. The SKAT methodology is based on the kernel machine framework originally proposed by Liu et al. [33, 34], so we also expand on this. Finally, we discuss the multiple comparisons problem and how its consideration needs to be modified for the rare variant problem in Sect. 20.4. This chapter concludes with some discussion in Sect. 20.5.

20.2 Biological Background

Association of disease genes to phenotypic traits or overt disease has been carried out with the discovery, characterization or genotyping of variants. Genetic studies have relied upon identifying causative genes by finding genetic variants, common ( > 1 % or > 5 % in the population) or rare, and whether they are enriched in cases compared to controls. Common variants are contributed by alleles that originated during the development of humans and are therefore shared between different human populations [39]. These variants constitute most of the human genetic variation, in frequency, and are also represented as SNPs that tag specific haplotypes mapped by the HapMap project [10, 11]. While technologies and genetic methods have concentrated on implicating common or rare variants of extreme size for disease etiology, identification and characterization of variants of intermediate size and frequency remains a challenge [50].

The basis for rare variants can be best understood in a historical context. In the field of human genetics of complex traits, the dominant school of thought in the early 2000s was based on the so-called common disease-common variant (CDCV) hypothesis [45, 46]. This framework postulated that for many diseases, multiple SNPs would be needed to explain a large percentage of variation in the phenotype. Identification of SNPs in linkage disequilibrium or functional variants in the neighborhood of causative genes is the basis of genome-wide association studies (GWAS) [22]. This thought very much influenced the design of GWAS and the technology used to measure DNA variation. The dominant platform for measuring SNPs was the microarray platform, which was being used simultaneously for measuring transcript mRNA expression. The major company that developed the SNP microarray platform was Affymetrix (http://www.affymetrix.com), and the DNA variations selected to be on the chip primarily represented variants that satisfied the CDCV hypothesis, i.e., all of the variants had to be sufficiently present in the population. In particular, what tended to be excluded from the SNP microarrays were DNA variants where the less prevalent form had a population prevalence (termed minor allele frequency) that was less than 5 %.

Currently, there have been over 2000 GWAS studies that have been conducted in humans (genome.gov, 2013) with a major finding that DNA variations in the form of SNPs can only explain a limited amount of variation for several human disease associated phenotypes [37, 38]. GWAS has been only successful in studies on type-2 diabetes, age-related macular degeneration, coronary artery disease, and Crohn disease as well as for obesity and height. These studies were not successful for a majority of common complex diseases including neurodevelopmental disorders such as autism, schizophrenia, and epilepsy. This has led to consideration of reasons for the missing heritability [38]. The difficulty of achieving statistical power to identify multiple loci of small effect sizes is considered as a major factor. Other factors, not considered in traditional GWAS studies [5], such as gene-gene interactions and gene-environment interactions, have also been proposed [57].

Rare variants, on the other hand, tend to have much bigger effects than the DNA variants identified from first-generation GWAS. This alternative model is termed common disease rare variant (CDRV). From an evolutionary point of view, these variants are under strong selection and their frequency in the population is maintained by de novo mutations. In fact, new germline mutations arise constantly, based on the underlying sequencing architecture or age of the parents, at a rate of about 61 base pairs for single nucleotide mutations [4] and 16–50 kbp per diploid genome [24]. Genetic associations for copy number variants have met with higher success for variants of low frequency. These variants were classically associated with clinically recognizable syndromes such as 7q11.2 deletions in individuals with features of Williams syndrome, 22q11.2 deletions in individuals with features of DiGeorge syndrome, and 17p11.2 deletions in Smith-Magenis syndrome [19].

Developments in microarray technology and rapid incorporation of high-throughput genotyping in diagnostic laboratories have resulted in the identification of about two dozen CNVs that are strongly enriched in affected cases with neurodevelopmental impairments compared to controls. However, extensive phenotypic heterogeneity even in individuals carrying the same CNV has complicated further analysis. For example, the 16p11.2 deletion was originally associated with autism, but was later identified to be enriched in individuals with intellectual disability, epilepsy, schizophrenia, and obesity [40, 47, 55]. Comparison of CNV load (measured as the proportion of population carrying a deletion or duplication of a particular size) across cohorts of affected population suggests that the CNV load correlates with the severity of neurodevelopmental disorders [17]. Similarly, phenotypic variability and severity associated with a specific disease-associated CNV can be also explained by rare variants in the genetic background [19, 21]. These variants modulate the ultimate phenotypic expression either by additive or synergistic effect, in genetic terms, in a digenic or oligogenic manner [53] Genome sequencing has made tremendous strides in finding the missing heritability. Sequencing of the protein coding sequences in the genome for neurodevelopmental disorders has identified several rare, de novo variants that cluster in pathways related to nervous system development, maturation, and maintenance [16, 41, 42, 48]. These studies have, however, revealed a complex genetic basis for common diseases; for example, recent estimates suggest that a minimum of 1000 genes as causal for autism. These disorders can be explained by an infinitesimal model consistent with the role of multiple rare variants in complex disease [14]. According to this model the genetic etiology can be explained by a hybrid of the two models. The challenge therefore lies in understanding how these variants work together in causing the disease rather than if they are rare or common [14].

The implications of rare variants for medicine and public health are potentially quite paradigm shifting. Both disciplines have placed a tremendous emphasis on evidence that has been gathered from consideration of population-based analyses of biomedical data. However, rare variants are predictor variables that by definition are quite individual-specific. Simply based on their prevalence, standard population-based analyses will have low power to detect them. The rare variant paradigm also is quite in tune with the notion of personalized medicine, where treatments and/or interventions would be tailored to the particular variant present in the individual. Very broadly speaking, this is consistent with the patient-centered/patient-oriented paradigm in medicine that has been developing over the last few years. Figure 20.1 describes the genetic spectrum for disease that analysts must contend with.

Fig. 20.1
figure 1

(a) Size of variant. Genetic variants are ordered by size on the horizontal axis versus frequency on the vertical axis. Note that single nucleotide variants or more specifically single nucleotide polymorphisms (used for GWAS studies) are more frequent than copy number variants (i.e., deletions and duplications) in the human genome. The large chromosomal aberrations such as trisomies and monosomies are rarer and are the cause for severe developmental disabilities. (b) Frequency of Variants. Variants can be classified by the frequency (on the horizontal axis) and its effect, i.e. penetrance (proportion of individuals carrying a variant also manifesting a phenotype) on the vertical axis. Note that rare variants (typically < 0.1 % to < 5 %) are highly penetrant, associated with severe developmental disorders, while common variants have modest effect. Variants of intermediate frequency are currently missed by most studies. Current studies also suggest that multiple rare alleles interacting in common or related pathways are responsible for several human disorders

20.3 Kernel Machine Methodology

20.3.1 Setup and Review of Methods

We now describe tests of associations between rare variants and a phenotype. To make the ideas concrete in this chapter, we will suppose that we have available (Y i , G i , Z i ), i = 1, , n for n subjects, which is a random sample from (Y, G, Z). Here, Y denotes the phenotype, G is a p−dimensional vector for the genotypes for the p variants within a region, and Z is a q−dimensional vector of confounding variables to adjust for. Here and in the sequel, we will assume that each component of G will count the number of minor alleles. We can postulate a class of regression models for Y given G and Z; a standard one would be to postulate a generalized linear model:

$$\displaystyle{ h(E[Y _{i}\vert \mathbf{G}_{i},\mathbf{Z}_{i}]) =\alpha _{0} +\alpha ^{T}\mathbf{G}_{ i} +\beta ^{T}\mathbf{Z}_{ i}, }$$
(20.1)

where (α 0, α, β) are the regression coefficients to be estimated, and h is a link function. Note that the current model (20.1) can allow for both continuous and binary phenotypes.

While model (20.1) is quite standard in the statistical literature, new issues arise when attempting to apply it to rare variant data. First, due to the sparsity of G the components of α will not be estimated very well. Due to this as well as for computationally feasibility, there has been a reliance on the use of score-based tests, which will be less sensitive to this type of sparsity relative to a Wald test, for example. A second problem is one of power. Models such as (20.1) that treat the genetic effects as fixed effects will have lower power due to the number of degrees of freedom for jointly testing α = 0. To circumvent this issue, two classes of approaches have been developed. The first includes methods that can broadly interpreted as collapsing methods [31, 32, 36, 44]. These tests effectively reduce G into a scalar quantity G and to fit model (20.1), where α T G i is replaced by \(\gamma \mathbf{G}_{i}^{{\ast}}\). The reduction to a one-dimensional quantity leads to a reduction in the number of parameters and a potential gain in power.

Collapsing approaches will work in situations in which the components of G have effects on Y that are in the same direction. However, it might be the case that this assumption is not true. The SKAT methodology of Wu et al. [56] then becomes quite useful in this regard. In particular, it generalizes (20.1):

$$\displaystyle{ h(E[Y _{i}\vert \mathbf{G}_{i},\mathbf{Z}_{i}]) =\alpha _{0} + f(\mathbf{G}_{i}) +\beta ^{T}\mathbf{Z}_{ i}, }$$
(20.2)

where now f is a flexible non-linear function of the rare variants. This is a special case of the kernel machine framework originally proposed by Liu et al. [33, 34]. We will describe the technical details of the approach in the next section. We point out here that the rare variant effects are allowed to be much more flexible than in (20.1). Further, the test of the genetic effect in (20.2) is identical to testing for a random effect being zero for a certain linear mixed effects model. This amounts to an effective shrinking of the degrees of freedom and allows for pooling of information across the rare variants. The score test amounts to a quadratic form that takes deviations of the individual rare variant effects and squares them.

20.3.2 Kernel Machines: Technical Details

In this section, we review the technical details behind the SKAT model in the case of h in (20.2) being the identity link. This material is intended for mathematically minded readers and can be skipped upon initial reading of this chapter. Recall the model from the previous section with α 0 = 0:

$$\displaystyle{ Y _{i} =\beta ^{T}\mathbf{Z}_{ i} + f(\mathbf{G}_{i}) + e_{i}, }$$
(20.3)

where \(\boldsymbol{\beta }\) is a q × 1 vector of regression coefficients, \(f(\mathbf{G}_{i})\) is an unknown centered smooth function, and the errors e i are assumed to be independent and follow \(N(0,\sigma ^{2})\). Here, we are centering the response so that there is no intercept term as in (20.2). Note that when f(⋅ ) = 0, (20.3) reduces to the standard linear regression model.

20.3.2.1 Function Space of f(G): Specification

We assume the nonparametric function \(f(\mathbf{G})\) lies in a function space \(\mathcal{F}\) spanned by a set of basis functions \(\{\phi _{1}(\mathbf{G}),\cdots \,,\phi _{j}(\mathbf{G}),\cdots \,,\phi _{J}(\mathbf{G})\}_{j=1}^{J}\) such that any function in the space \(\mathcal{F}\) can be written as \(f(\mathbf{G}) =\sum _{ j=1}^{J}\omega _{j}\phi _{j}(\mathbf{G})\) for some constants \(\{\omega _{j}\}_{j=1}^{J}.\) Note that the set of basis functions can be finite (J < ) or infinite (J = ). In the machine learning literature, such basis functions are called features.

Specification of a function space using basis functions or features might be complicated since explicit expressions of features are required and the number of features might be high or even infinite. An alternative way to conveniently specify a function space is to use a kernel function \(K(\mathbf{G},\mathbf{G}^{\prime})\) instead of the basis functions. Specifically, a kernel function \(K(\mathbf{G},\mathbf{G}^{\prime})\) is a bounded, symmetric, positive function satisfying

$$\displaystyle{ \int K(\mathbf{G},\mathbf{G}^{\prime})f(\mathbf{G})f(\mathbf{G}^{\prime})d\mathbf{G}d\mathbf{G}^{\prime} \geq 0, }$$
(20.4)

for any arbitrary square integrable function \(f(\mathbf{G})\) and all \(\mathbf{G},\mathbf{G}^{\prime} \in R^{p}\). The kernel function can be viewed as a measure of similarity between two values of the covariate vector \(\mathbf{G}\) and \(\mathbf{G}^{\prime}\). Following from the Mercer Theorem (e.g., see p. 33 of [6]), any kernel function satisfying some regularity conditions implicitly specifies an unique function space spanned by a particular set of basis functions (features), and vice versa. Before formally defining such a function space, we give a few examples.

  1. 1.

    The dth degree Polynomial Kernel: \(K(\mathbf{G},\mathbf{G}^{\prime}) = [\mathbf{G} \cdot \mathbf{G}^{\prime} + 1]^{d}\), where \(\mathbf{G} \cdot \mathbf{G}^{\prime} =\sum _{ k=1}^{p}g_{k}g_{k}^{\prime}\) denotes the dot product. Recall that g represents components of the vector G in (20.3). This dth degree polynomial kernel generates the function space \(\mathcal{F}\) spanned by all possible dth order monomials of the components of \(\mathbf{G}\). For example, if d = 1, the first polynomial kernel generates the linear function space with basis functions {z 1, ⋯ , z p }. If d = 2, the second polynomial kernel corresponds the quadratic function space with basis functions {z k , z k z k ′} (k, k′ = 1, ⋯ , p), i.e., the main effects, all two-way interactions and quadratic main effects. Note that the function space determined by the dth degree polynomial kernel is of finite dimension.

  2. 2.

    The Gaussian Kernel: \(K(\mathbf{G},\mathbf{G}^{\prime}) = \mathrm{exp}\{ -\vert \vert \mathbf{G} -\mathbf{G}^{\prime}\vert \vert ^{2}/\rho \}\), where \(\vert \vert \mathbf{G} -\mathbf{G}^{\prime}\vert \vert =\sum _{ k=1}^{p}(g_{k} - g_{k}^{\prime})^{2}\). The Gaussian kernel generates the function space spanned by radial basis functions, whose nice properties can be found in Bühmann [3]. The function space determined by the Gaussian kernel is of infinite dimension.

  3. 3.

    The identity by state kernel: Kwee et al. [26] propose the use of the concept of identity by state to define a new kernel. The kernel is given by

    $$\displaystyle{K(\mathbf{G},\mathbf{G}^{\prime}) = \frac{\sum _{s=1}^{p}IBS(\mathbf{G}_{s},\mathbf{G}^{\prime}_{s})} {2p},}$$

    where the IBS function denotes the number of alleles shared identically by state at position s.

The above examples suggest that the choice of a kernel function determines which function space one would like to use to approximate \(f(\mathbf{G})\). The dimension of the function space defined by a kernel function K(⋅ , ⋅ ) is determined by the dimension of the eigenfunctions of K(⋅ , ⋅ ). The use of a kernel to specify a function space avoids specifications of complicated basis functions (features) and inner products. One will see in the next section that it has significant computational advantages in high dimensional problems. It should be noted that the term “kernel” here has a rather different meaning from that used in the kernel smoothing literature. A commonly used function space defined by a kernel is a Reproducing Kernel Hilbert Space (RKHS), which we label as \(\mathcal{F}_{K}\). Technical details on RKHS can be found in Wahba [54] or Chapter 3 of Cristianini and Shawe-Taylor [6].

20.3.2.2 Primal and Dual Representations of f(G)

Any function \(f(\mathbf{G})\) in the function space \(\mathcal{F}_{K}\) defined by a kernel \(K(\cdot,\cdot )\) can have a primal representation directly using the basis functions (features) of \(\mathcal{F}_{K}\), and it can equivalently have a dual representation using the kernel function \(K(\mathbf{G},\mathbf{G}^{\prime})\) directly. Specifically, for an arbitrary function \(h(\mathbf{G}) \in \mathcal{F}_{K}\), its primal representation takes the form

$$\displaystyle{ f(\mathbf{G}) =\sum _{ j=1}^{J}\omega _{ j}\phi _{j}(\mathbf{G}) =\boldsymbol{\phi } (\mathbf{G})^{T}\boldsymbol{\omega }, }$$
(20.5)

where \(\boldsymbol{\phi }(\cdot ) =\{\phi _{1}(\cdot ),\cdots \,,\phi _{J}(\cdot )\}^{T}\) is a J × 1 vector of the standardized orthogonal basis functions (features), i.e., standardized Mercer features of the function space \(\mathcal{F}_{K}\), and \(\boldsymbol{\omega }\equiv (\omega _{1},\cdots \,,\omega _{J})^{\prime}\) is a vector of some constants. The square norm of f(⋅ ) can be written as

$$\displaystyle{ \|f\|_{\mathcal{F}_{K}}^{2} =\sum _{ j=1}^{J}\omega _{ j}^{2} =\boldsymbol{\omega } ^{T}\boldsymbol{\omega }. }$$
(20.6)

Alternatively, the same \(f(\mathbf{G})\) can be equivalently written in a dual representation using the kernel function K(⋅ , ⋅ ) directly as

$$\displaystyle{ f(\mathbf{G}) =\sum _{ l=1}^{L}\alpha _{ l}K(\mathbf{G}_{l}^{{\ast}},\mathbf{G}), }$$
(20.7)

for some integer L, some constants \(\alpha _{1},\ldots,\alpha _{L}\) and some \(\{\mathbf{G}_{1}^{{\ast}},\cdots \,,\mathbf{G}_{L}^{{\ast}}\}\in R^{p}\). For justifications of these results and more details about the RKHS, see Cristianini and Shawe-Taylor (2000[6], Chapter 3).

Estimation of \(\boldsymbol{\beta }\) and f(⋅ ) proceeds by maximizing the scaled penalized likelihood function

$$\displaystyle\begin{array}{rcl} -\frac{1} {2}\sum _{i=1}^{n}\{Y _{ i} -\boldsymbol{\beta }^{T}\mathbf{Z}_{ i} - f(\mathbf{G}_{i})\}^{2} -\frac{1} {2}\lambda \|f\|_{\mathcal{F}_{K}}^{2},& &{}\end{array}$$
(20.8)

where λ is a tuning parameter and controls the tradeoff between goodness of fit and complexity of the model. When λ = 0, the model interpolates the data, whereas when λ = , the model reduces to a simple linear model.

While the function (20.8) is hard to optimize directly, we introduce the Lagrangian multiplier (also called the dual parameter) \(\boldsymbol{\gamma }\) to obtain

$$\displaystyle{ \mathcal{L}(\boldsymbol{\omega },\boldsymbol{\beta },\boldsymbol{e},\boldsymbol{\gamma }) = -\frac{1} {2}\sum _{i=1}^{n}e_{ i}^{2} -\frac{1} {2}\lambda \boldsymbol{\omega }^{T}\boldsymbol{\omega } +\sum _{ i=1}^{n}\gamma _{ i}\{\boldsymbol{\beta }^{T}\mathbf{Z}_{ i} + \boldsymbol{\phi }(\mathbf{G}_{i})^{T}\boldsymbol{\omega } + e_{ i} - Y _{i}\}. }$$
(20.9)

The dual problem is formulated by constructing an objective function by removing the high-dimensional primal coefficient vector \(\boldsymbol{\omega }\) and the constraint parameters \(\boldsymbol{e}\) from \(\mathcal{L}(\boldsymbol{\omega },\boldsymbol{\beta },\boldsymbol{\rceil },\boldsymbol{\gamma })\) and writing \(\mathcal{L}(\boldsymbol{\omega },\boldsymbol{\beta },\boldsymbol{\rceil },\boldsymbol{\gamma })\) as a function of \(\boldsymbol{\beta }\) and the dual parameter vector \(\boldsymbol{\gamma }\) only. We will see that the resulting estimators \(\hat{\boldsymbol{\beta }}\) and \(\hat{\boldsymbol{\gamma }}\) can be expressed as a function of some kernel function K(⋅ , ⋅ ). One can then conveniently obtain the maximizer of the original primal problem \(\hat{\boldsymbol{\omega }}\) and then \(\hat{f}(\mathbf{G})\) at any arbitrary \(\mathbf{G}\) as a function of the kernel function K(⋅ , ⋅ ).

Specifically, the dual problem to minimizing (20.8) is

$$\displaystyle\begin{array}{rcl} \min _{\boldsymbol{\beta },\boldsymbol{\gamma }}\mathcal{Q}(\boldsymbol{\boldsymbol{\beta },\boldsymbol{\gamma }})& &{}\end{array}$$
(20.10)

where \(\mathcal{Q}(\boldsymbol{\beta },\boldsymbol{\gamma }) =\sup _{\boldsymbol{\omega },\boldsymbol{e}}\mathcal{L}(\boldsymbol{\omega },\boldsymbol{\beta },\boldsymbol{e},\boldsymbol{\gamma }).\) Note that (20.10) is an unconstrained optimization problem, and the number of unknown parameters depends only on \(\boldsymbol{\beta }\) and the dual parameters \(\boldsymbol{\gamma }\), whose dimension is equal to the sample size n, often much smaller than J, the dimension of the primal vector \(\boldsymbol{\omega }\). Therefore the dual formulation (20.10) effectively transforms the often infinite-dimensional optimization problem (20.8) into a finite-dimensional problem.

To obtain \(\mathcal{Q}(\boldsymbol{\beta },\boldsymbol{\gamma })\), one differentiates \(\mathcal{L}(\boldsymbol{\omega },\boldsymbol{\beta },\boldsymbol{\rceil },\boldsymbol{\gamma })\) with respect to \(\boldsymbol{e}\) and \(\boldsymbol{\omega }\) and sets the derivatives to zero. We have

$$\displaystyle\begin{array}{rcl} \hat{\boldsymbol{e}}& =& \boldsymbol{\gamma } \\ \hat{\boldsymbol{\omega }}& =& \lambda ^{-1}\sum _{ i=1}^{n}\gamma _{ i}\boldsymbol{\boldsymbol{\phi }}(\mathbf{G}_{i}).{}\end{array}$$
(20.11)

Substituting \(\hat{\boldsymbol{\omega }}\) and \(\hat{\boldsymbol{e}}\) into \(\mathcal{L}(\cdot )\), some calculations give

$$\displaystyle\begin{array}{rcl} \mathcal{Q}(\boldsymbol{\beta },\boldsymbol{\gamma })& =& (\boldsymbol{Y } -\beta ^{T}\mathbf{Z})^{T}\boldsymbol{\gamma } -\frac{1} {2}\boldsymbol{\gamma }^{T}\left (\boldsymbol{I} + \lambda ^{-1}\boldsymbol{K}\right )\boldsymbol{\gamma }{}\end{array}$$
(20.12)

where \(\boldsymbol{Y } = (Y _{1},\cdots \,,Y _{n})^{T}\) and \(\mathbf{Z} = (\mathbf{Z}_{1},\cdots \,,\mathbf{Z}_{n})^{T}\), \(\boldsymbol{K}\) is an n × n matrix whose (i, i′)th element is \(K(\mathbf{G}_{i},\mathbf{G}_{i^{\prime}})\), the kernel function evaluated at the pair of the design points \((\mathbf{G}_{i},\mathbf{G}_{i^{\prime}})\). Note that the kernel matrix \(\boldsymbol{K}\) measures the similarity among the covariate values \((\mathbf{G}_{1},\cdots \,,\mathbf{G}_{n})\). One can see that even when p (the dimension of \(\mathbf{G}\)) or J (the dimension of the feature space) is high, the dimension of \(\boldsymbol{K}\) is not affected by p and J and remains the same as the sample size n.

Differentiating \(\mathcal{Q}(\boldsymbol{\beta },\boldsymbol{\gamma })\) with respect to \(\boldsymbol{\gamma }\) and \(\boldsymbol{\beta }\), some calculations give

$$\displaystyle\begin{array}{rcl} \hat{\boldsymbol{\beta }}= \left \{\mathbf{Z}^{T}(\boldsymbol{I} +\lambda ^{-1}\boldsymbol{K})^{-1}\mathbf{Z}\right \}^{-1}\mathbf{Z}^{T}(\boldsymbol{I} +\lambda ^{-1}\boldsymbol{K})^{-1}\boldsymbol{Y }& &{}\end{array}$$
(20.13)
$$\displaystyle\begin{array}{rcl} \hat{\boldsymbol{\gamma }}= (\boldsymbol{I} +\lambda ^{-1}\boldsymbol{K})^{-1}(\boldsymbol{Y } -\hat{\boldsymbol{\beta }}^{T}\mathbf{Z}).& &{}\end{array}$$
(20.14)

Plugging (20.14) into (20.11), we have

$$\displaystyle{\hat{\boldsymbol{\omega }}=\lambda ^{-1}\{\boldsymbol{\phi }(\mathbf{G}_{ 1}),\cdots \,,\boldsymbol{\phi }(\mathbf{G}_{n})\}\hat{\boldsymbol{\gamma }} =\lambda ^{-1}\{\boldsymbol{\phi }(\mathbf{G}_{ 1}),\cdots \,,\boldsymbol{\phi }(\mathbf{G}_{n})\}(\boldsymbol{I} +\lambda ^{-1}\boldsymbol{K})^{-1}(\boldsymbol{Y } -\hat{\boldsymbol{\beta }}^{T}\mathbf{Z}).}$$

It follows that the nonparametric function f(⋅ ) evaluated at the design points \((\mathbf{G}_{1},\cdots \,,\mathbf{G}_{n})^{T}\) is estimated as

$$\displaystyle\begin{array}{rcl} \hat{\boldsymbol{f}} =\lambda ^{-1}\boldsymbol{K}\hat{\boldsymbol{\gamma }} =\lambda ^{-1}\boldsymbol{K}(\boldsymbol{I} +\lambda ^{-1}\boldsymbol{K})^{-1}(\boldsymbol{Y } -\hat{\boldsymbol{\beta }}^{T}\mathbf{Z}).& &{}\end{array}$$
(20.15)

The estimator of the nonparametric function f(⋅ ) at an arbitrary \(\mathbf{G}\) is

$$\displaystyle\begin{array}{rcl} \hat{f}(\mathbf{G}) =\boldsymbol{\phi } (\mathbf{G})^{T}\hat{\boldsymbol{\omega }}& &{}\end{array}$$
(20.16)
$$\displaystyle\begin{array}{rcl} =\lambda ^{-1}\{K(\mathbf{G},\mathbf{G}_{ 1}),\cdots \,,K(\mathbf{G},\mathbf{G}_{n})\}(\boldsymbol{I} +\lambda ^{-1}\boldsymbol{K})^{-1}(\boldsymbol{Y } -\hat{\boldsymbol{\beta }}^{T}\mathbf{Z}).& &{}\end{array}$$
(20.17)

Note that the estimators \(\hat{\boldsymbol{\beta }}\) and \(\hat{f}(\cdot )\) in (20.13) and (20.15) are the maximizer of the original primal problem. Examination of equations (20.13) and (20.17) suggests that the estimators \(\hat{\boldsymbol{\beta }}\) and \(\hat{f}(\cdot )\) are both conveniently evaluated using the kernel function \(K(\cdot,\cdot )\) and do not require specifying the high (maybe infinite) dimensional basis functions (features) \(\{\phi (\mathbf{G})\}\). This means one simply summarizes the similarity of high-dimensional covariates \((\mathbf{G}_{1},\cdots \,,\mathbf{G}_{n})\) using a kernel matrix \(\boldsymbol{K}\), then calculates \(\hat{\boldsymbol{\beta }}\) and \(\hat{f}(\cdot )\) by inventing an n × n matrix involving the kernel matrix \(\boldsymbol{K}\), which is of the dimension of sample size and is often small in high dimensional problems, e.g., microarray problems. Using (20.14), one can easily see that \(\hat{f}(\mathbf{G})\) can be rewritten as

$$\displaystyle{\hat{f}(\mathbf{G}) =\sum _{ i=1}^{n}\lambda ^{-1}\hat{\gamma }_{ i}K(\mathbf{G},\mathbf{G}_{i}).}$$

A comparison of this equation with equation (20.7) suggests that \(\hat{f}(\mathbf{G})\) takes exactly a dual representation with L = n, \((\mathbf{G}_{1}^{{\ast}},\cdots \,,\mathbf{G}_{n}^{{\ast}}) = (\mathbf{G}_{1},\cdots \,,\mathbf{G}_{n})\) and \(\boldsymbol{\alpha }=\lambda ^{-1}\hat{\boldsymbol{\gamma }}\). Hence the estimated Lagrangian multiplier \(\hat{\boldsymbol{\gamma }}\) serves as the coefficients in the dual representation of \(\hat{f}(\mathbf{G})\), apart from a scale factor.

In Liu et al. [33], it is shown that the estimates of f and β can be derived as estimates from a random effects model of the following form:

$$\displaystyle{ \boldsymbol{Y } =\boldsymbol{\beta } ^{T}\mathbf{Z} +\boldsymbol{ f} +\boldsymbol{ e}, }$$
(20.18)

where \(\boldsymbol{\beta }\) is a q × 1 vector of regression coefficients, \(\boldsymbol{f}\) is an n × 1 vector random effects following \(\boldsymbol{f} \sim N\{\mathbf{0},\tau \boldsymbol{K}(\rho )\}\), ρ is a scale parameter, and \(\boldsymbol{e} \sim N(\mathbf{0},\boldsymbol{R} =\sigma ^{2}\boldsymbol{I})\). Because of this equivalence, all regression parameters in the model can be estimated by maximum likelihood, while the variance component parameters can be estimated by restricted maximum likelihood. If we assume \(f(\mathbf{G}) \in \mathcal{F}_{K}\), one can easily see from the linear mixed model representation (20.18) of the least squares kernel machine that \(H_{0}: f(\mathbf{G}) = 0\) is equivalent to testing the variance component τ = 0. The null hypothesis H 0: τ = 0 places τ on the boundary of the parameter space. Liu et al. [33] developed a score test for testing H 0.

20.3.3 SKAT Extensions

Since the seminal work of Wu et al. [56] on this topic, there have been several notable extensions of the SKAT methodology. One extension was by Lee et al. [28, 29], which made the observation that the collapsed approaches and SKAT could be combined into a unified framework based on a prior distribution for the linkage disequilibrium between rare variants within a genomic region of interest. An application of the SKAT statistics to meta-analysis has been developed by Lee et al. [27]. Finally, we note that Ionita-Laza et al. [23] have extended the SKAT approach to simultaneously incorporate common and rare variants.

20.3.4 SKAT Example

We now describe the application of the SKAT methodology to data from Girirajan et al. [20], in which the role of structural variants in autism was explored. The data come from the Simons Simplex Complex Foundation. For the purposes of this chapter, we will assume that the rows of the data matrix below represent statistically independent observations. A sample of the data is given below:

chrom start end size pheno

1 chr1 6191784 6494317 302533 0

2 chr1 108655067 108718023 62956 0

3 chr1 143636400 143700636 64236 0

4 chr1 143636400 143701095 64695 0

5 chr1 143639096 143701095 61999 0

In this file, start and end denote the beginning and end of the structural variant, and size denotes the length of the variant and is the difference between start and end. Finally, pheno is a coding of the phenotype as zero for control and one for case (i.e., autism). Our analyses using SKAT will use start, end and pheno.

We consider data from chromosome 1, which has been considered to be a hotspot for structural variations in autism. We have measurements from 99 cases and 76 controls. We note that the hotspots have variable length, which is why the size column shows variation. In order to implement the SKAT method, we need to convert each row of the dataset into a vector of zeroes and ones. The zero represents absence of a structural variant while one indicates its presence. We partitioned chromosome 1 into 2000 nonoverlapping windows of equal size and determined for each row of the dataset how many windows the alteration overlapped with. This is done by comparing both the start and end to the window in question. Note that this will give us a 175 by 2000 matrix with zeroes and ones. However, of the 2,000 columns, only 50 have at least one nonzero entry. This means that for each subject, we have a 50-dimensional vector of counts. It is not easy to perform descriptive statistics on this type of data. As in [20], we can define a concept of copy number burden, which means to simply add the up the counts over the 50 dimensions for each subject. A plot of the distribution of copy number burden between cases and controls is given in Fig. 20.2.

Fig. 20.2
figure 2

Boxplot of distribution of copy number burden for chromosome 1 in controls (left boxplot) versus autism cases (right boxplot). The data represent the total number of structural variants from 50 windows that had at least one variant across the 175 samples

Based on Fig. 20.2, we find almost no difference between the copy number burden distribution of cases and controls, aside from two high outliers among the autism cases. However, the SKAT methodology may be able to identify differences between the controls and cases when examining the 50-dimensional count vectors that cannot be seen in the copy number burden data. To illustrate our method, we simulated a covariate Z = Z from a standard normal distribution and used the following R code to run SKAT.

# y.b = pheno variable from the dataset; Z: simulated

# normal(0,1) covariate;

# G: structural variant data, here a 50-dimensional

# vector of counts

# kernel specifies the kernel matrix needed to run

# SKAT; options include linear, IBS,quadratic

# and 2 way interaction; the first three have the

# option of being weighted by the

# inverse of the variance of the estimated

# proportion of the rare variant, as described

# in Madsen and Browning (2009)

#

# Here, we use the weighted linear kernel.

#

obj = SKAT_Null_Model(y.b~Z,out_type="D",

kernel="linear.weighted")

skat1 = SKAT(G,obj)

Further details about the code can be found in the SKAT manual. We note that the default procedure of Wu et al. [56] is recommended for a sample size greater than 2000. Given that our example has a sample size of 175, SKAT performs an adjustment in terms of using higher-order approximations in order to estimate the null distribution of the test statistic. Using this adjustment, the p-value from SKAT is 5. 27 × 10−5. Thus there is strong evidence of structural variants in chromosome one being associated with autism.

20.4 Multiple Testing

Next, we discuss the impact of multiple comparisons on the analysis of rare variant data from sequencing studies. While genomics has experienced an explosion in the literature on multiple testing, there are two unique issues in the sequencing context. First, because these variations are rare by definition, the number of single variant hypothesis tests that need to be performed are actually quite small relative to numbers of tests in other problems (e.g., number of tests in common-variant GWAS studies). What is more challenging, however, is the fact that there is an inherent discreteness in the data structure. For a given rare variant, we can represent the data as in Table 20.1,

Table 20.1 Rare variant presence/absence analysis

where the cell entries represent the number of samples in each of the groups. We wish to test for independence of the rows and columns, and many methods exist for testing the null hypothesis of no association between presence of rare variant and group label. If the expected cell count is greater than five in all the cells, then one can safely use chi-squared statistics. However, when the cell counts are small, we then use Fisher’s exact test, where the p-value is computed using a hypergeometric distribution.

While there has been a lot of work on extensions and generalizations of the FDR estimation methodology, most of the literature in this area has used the fact that under the null distribution, the p-values are uniformly distributed on (0,1) or more generally, that the test statistics have a continuous distribution. This will not apply in the case of rare variant data with respect to the presence/absence calls. The literature on multiple testing with discrete p-values is much more limited. An initial procedure was proposed by Tarone [51] which involves only considering hypotheses where a sufficiently small rejection probability is possible and to then perform a Bonferroni test on those selected hypotheses. This procedure has been modified to the false discovery setting in Gilbert [15], where the Bonferroni adjustment was replaced by the Benjamini-Hochberg [2] procedure. Theoretical aspects of the B-H procedure with discrete test statistics have been addressed by Ferreira [9]. An FDR-based estimation procedure in the spirit of the q-value methodology of Storey [49] was developed in Pounds and Cheng [43]. In Kulinskaya and Lewin [25], the B-H procedure was applied to so-called fuzzy p-values, whose behavior under the null hypothesis is identical to that of a Uniform(0,1) random variable so that the usual methods apply. Applications of discrete multiple testing ideas to a cancer genomics problem can be found in Ghosh [12, 13]. Some recent work of Bancroft et al. [1] uses a novel sequential permutation p-value approach to estimate FDR that would be applicable in this setting as well.

Finally, an open problem in this area is the incorporation of dependence into multiple testing procedure. While there has been a lot of recent work in the area on multiple comparisons with dependent data [8, 30], almost all of this work again assumes that the p-values are derived from continuous distribution, which is not the case here. However, the argument that rare variants operate with a network structure is less plausible than for phenomena such as gene expression, so a case could be made that dependence is not as big of an issue as in other genomic settings. Again, this topic is definitely worthy of future exploration.

20.5 Discussion

This chapter has attempted to discuss issues in the analysis of rare variant data for a statistical audience. One of the major messages from this chapter is that the phenomenon being described is one with a low probability of occurring, but given its occurrence, it can have a large effect.

One of the major challenges in this area will be development of methods that will have high power of detecting these events. A major statistical lesson that has been used here is that the score method of testing has definite merits. While classical statistical theory teaches us that the behavior of the likelihood ratio test, Wald test and score test will be identical as the sample size tends to infinity, it is also the case that we are definitely in a small-sample scenario where asymptotic theory will not hold. The score statistic provides many advantages, one of the major ones being that of avoiding having to estimate rare variant effects.

An area not discussed in this chapter is meta-analysis. This has become the de rigueur method for identifying candidate genes from genomewide studies. We point the reader to the recent review by Evangelou and Ioannidis [7] and note the SKAT approach to this problem that was described in Lee et al. [27].

While this area is relatively new, we should also be wise to lessons that have been learnt in many other settings. For example, it is well-known that selected variables or SNPs suffer from the so-called ‘winner’s curse’ so that estimated effects will be biased. This will also be the case for the rare variants and is inherent to the statistical task at hand.

Finally, we believe that a tactic that will be useful in the future is what we term ‘pooling information.’ One of the major reasons that SKAT methods have had such a major impact in this area is that the equivalence with variance components models and the introduction of random effects models leads to the ability to pool information across estimated parameters. Statistically, this can be conceptualized using shrinkage theory, Empirical Bayes and more generally, Bayesian methods. Given the increasing availability of genomewide information from different data sources, pooling information using ‘vertical integration’ techniques [52] will be needed to identify and to elucidate the functionality of rare variants in the foreseeable future.