Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

8.1 Introduction

In most living organisms, the DNA information stored in a cell is transcribed into messenger RNA (mRNA) and then translated into protein, which is the working force of the cell. The amount of mRNA produced by a gene is generally referred to as gene expression. Since mid 1990s, gene expression microarrays have been widely employed to assess mRNA abundance genome-wide. The huge amount of data produced by expression microarrays have not only greatly improved our understanding of cell biology, but also provided invaluable resources to guide the diagnosis and treatment of human diseases. For example, gene expression profiles have been used to dissect cancer subtypes [45] and to predict drug sensitivities [20].

The mRNA abundance of a gene may be associated with the genotype of one or more genetic loci, which are referred to as expression quantitative trait loci (eQTL). In most eQTL studies, genome-wide gene expression data and DNA genotype data of genetic markers such as single nucleotide polymorphisms (SNPs) are collected in a common set of samples. Then eQTLs are identified by linkage/association analysis in which the expression of each gene is treated as a quantitative trait. We refer the readers to [10, 51] for reviews on eQTL studies and their potential impacts on understanding the genomic basis of human complex traits, and to [33, 68] for reviews on statistical methods and computational tools for eQTL studies using gene expression from microarrays.

In this chapter, we will focus on eQTL mapping using RNA-seq data. RNA-seq, i.e., high-throughput RNA sequencing, is replacing expression microarrays for transcriptome studies. To explain the motivations of designing statistical methods specifically for RNA-seq data, it is helpful to first describe the differences between the microarray and RNA-seq platforms. In microarray experiments, the abundance of gene expression is measured by fluorescent signals on a set of probes, where each probe contains a specific short piece of DNA sequence (e.g., 25 base pairs for most Affymetrix arrays). The amount of information that can be obtained is limited by the design of the microarray:

  • The quantification of gene expression is confined to the regions where the probes are placed. The probes are pre-selected to cover known genes, and in most array platforms, the probes are located at the 3’ ends of the transcripts instead of being uniformly distributed across exonic regions. Therefore, previously unknown transcripts cannot be measured for expression and the measurements at known transcripts may be biased by the signals at the 3’ ends.

  • The same probe sequences are used for all samples and do not accommodate the genetic differences across samples or the differences between the paternal and maternal alleles of a sample. Therefore, the gene expression from the paternal and maternal alleles cannot be distinguished.

In RNA-seq experiments, the expression of a gene is measured by the number of sequence reads mapped to that gene [18, 42]. RNA-seq overcomes the two limitations of microarrays. First, RNA-seq objectively quantifies the genome-wide transcript abundance without relying on pre-selected probes. Second, an RNA-seq read delivers allele-specific information if it overlaps with at least one heterozygous SNP/indel (i.e., a SNP or an insertion or deletion that is heterozygous between the paternal and maternal alleles).

Figure 8.1 illustrates the data generated by the two platforms. In particular, microarray data take continuous values and RNA-seq data are discrete counts. If that is all the difference between the two platforms, then there is no need to develop novel statistical methods for RNA-seq data because one can simply replace the linear regression model for continuous microarray data with the generalized linear regression model (with Poisson or negative binomial distribution assumption) for count data. In fact, the raw sequence data from RNA-seq contain much more information than a single count as shown in Fig. 8.1. First, in a diploid genome such as the genome of human or mouse, there are two sets of chromosomes, one from the father and one from the mother. Thus most genes (e.g., autosomal genes and X-linked genes in females) have two copies and each copy is called an allele of this gene. The expression of each allele of a gene, i.e., allele-specific expression (ASE), can be extracted from the raw RNA-seq data. Second, in a higher organism such as a human or mouse, one gene often comprises of several exons and the exons can be grouped in different ways to produce different proteins or non-coding RNA molecules. Each combination of the exons of a gene is called a transcript or an RNA isoform. The expression of each isoform, i.e., isoform-specific expression (ISE), can also be inferred from the raw RNA-seq data. In summary, the RNA-seq platform delivers much more information than the microarrays and thus warrants the development of novel statistical methods to fully exploit the new features.

Fig. 8.1
figure 1

(a) Gene expression data from a microarray. Each sample is measured by an array with tens of thousands of pre-selected probes. The expression of one gene is estimated by combining the fluorescent signals of multiple probes. (b) Gene expression data from RNA-seq. The data of each sample is stored in a text file, usually in the FASTQ format. An FASTQ file contains millions of records and each record corresponds to an RNA-seq read with four lines: the sequence identifier, the actual DNA sequence, a separator, and the sequencing quality scores for every base pair of the sequence

The remainder of this chapter is organized as follows. Sections 8.2 and 8.3 will introduce eQTL mapping using ASE and ISE, respectively. Section 8.4 will discuss some challenges and future directions.

8.2 eQTL Mapping Using ASE

We will first describe the quantification of ASE and show how the ASE enables the detection of cis-regulatory eQTLs. Then we will introduce statistical methods for eQTL mapping using ASE under two scenarios, namely, with and without known haplotypes between the candidate eQTL and the gene of interest.

8.2.1 Quantification of ASE Using RNA-seq

ASE can be measured by the number of RNA-seq reads that are mapped to the gene and overlapped with at least one SNP or indel with heterozygous genotype. Figure 8.2 illustrates the quantification of ASE for a hypothetical gene with two exons.

Fig. 8.2
figure 2

An example of ASE abundance quantification using RNA-seq, for a hypothetical gene with two exons and one heterozygous SNP within each exon. (a) Two haplotypes of this gene. (b) The number of allele-specific reads from these two haplotypes

There are two SNPs with heterozygous genotypes on the exonic regions of this gene, one SNP for each exon. Given the genotype at each SNP, allele-specific read count (ASReC) can be obtained by counting the number of reads harboring a particular SNP allele. For example, there are 6 reads overlapping with the first SNP with genotype CT, and the ASReCs are 4 and 2 for SNP alleles C and T, respectively. Then, the ASE of this gene can be estimated by combining ASReCs across multiple SNPs if the haplotype information is available. In the example shown in Fig. 8.2a, the genotypes of the two SNPs are CT and GA and the possible haplotype pairs are (C-G, T-A) and (C-A, T-G). If we knew that the underlying haplotype pair is (C-G, T-A), we could obtain the gene-level ASReCs as shown in Fig. 8.2b.

Next we discuss a few issues related to ASE quantification: haplotype phasing, sequence mapping bias, and expected ASReC.

8.2.1.1 Haplotype Phasing

Many algorithms (e.g., [8, 12, 36]) have been developed to infer the haplotype phases from the genotypes of unrelated individuals. It is well known that the phasing accuracy deteriorates as the length of the haplotype increases. However, it is still reasonable to assume that the phasing is accurate within the exonic regions of a gene because those regions are relatively short ( ∼ 90 % of the annotated genes are shorter than 100 kb [16]) and tend to undergo less recombination [62]. In addition, the switch errors (i.e., mistaken swapping from one haplotype to the other) in exonic regions can be captured and corrected by RNA-seq reads (either single or paired-end reads) that overlap with two or more heterozygous SNPs (i.e., SNPs with heterozygous genotypes) and thus provide direct information on the haplotype phase. Some reads may even span over non-adjacent exons due to alternative splicing and thus provide information on long-range phase.

8.2.1.2 Sequence Mapping Bias

A common practice in RNA-seq studies is to map the reads of all samples against the same reference genome. This may induce mapping bias because the reads harboring reference alleles tend to be mapped more accurately than those harboring alternative alleles. There are several solutions to this problem.

  1. 1.

    Identify and remove SNPs that may cause mapping bias by mapping simulated reads to the reference genome [46].

  2. 2.

    Employ an allele-aware sequence aligner [70] that uses both the reference genome and alternate alleles to map reads.

  3. 3.

    Construct the two haploid genomes for each diploid individual and map the reads against the two genomes separately [26, 30].

The third approach is the most unbiased and most comprehensive one, although it requires more information, i.e., the complete haploid genomes, and more computational time. Such an effort can be well justified for certain diploid samples with two very different haploid genomes, e.g., F1 mice from a cross of two inbred mouse strains with different genome backgrounds.

8.2.1.3 Expected ASReC

What proportion of RNA-seq reads are allele-specific? The answer depends on two factors, the density of DNA polymorphisms (usually SNPs or indels) with heterozygous genotypes and the read length. Clearly, the more different are the two haploid genomes, the more reads are allele-specific; the longer the reads are, the more likely they overlap with heterozygous DNA polymorphisms. The expected proportion of allele-specific reads can vary from 0.5 % in a human study with short reads [46, 55] (Fig. 8.3a) to 35 % in an F1 mouse study with longer reads [11] (Fig. 8.3b). To be specific, the human study [46, 55] adopted an RNA-seq experiment with 35 bp single-end reads and used ∼ 1.4 million HapMap SNPs to extract allele-specific reads. The number of heterozygous SNPs for an individual ranges from 392,800 to 415,500 with a median of 409,100. In another on-going study involving 550 breast cancer patients from The Cancer Genome Atlas (TCGA) using 2×50 bp paired-end reads and ∼ 30 million 1000G SNPs, we identified 3.4 % reads as allele-specific. The number of heterozygous SNPs across these TCGA samples ranges from 1.91 million to 2.02 million with a median of 1.97 million. The increase of the proportion of allele-specific reads from 0.5 % to 3.4 % in the two human studies can be attributed to both the longer reads and the larger number of heterozygous SNPs. By contrast, the mouse study [11] collected 2×100 bp paired-end RNA-seq reads from F1 mice with around 17.5 million heterozygous SNPs/indels per sample, making it possible to harvest 35 % of RNA-seq reads as allele-specific.

Fig. 8.3
figure 3

Scatter plot of the total number of RNA-seq reads versus the total number of allele-specific reads for all the samples in (a) a human study of unrelated individuals of African population (HapMap YRI samples) [55] and (b) a mouse study of three reciprocal F1 crosses of three mouse inbred strains (CAST/EiJ, PWK/PhJ and WSB/EiJ) representative of three subspecies within the Mus musculus species group (M. m. castaneus, M. m. musculus and M. m. domesticus, respectively)

8.2.2 ASE for cis-eQTL Mapping

Given ASE, we can assess whether there is allelic imbalance of gene expression. In some publications, the terms ASE and allelic imbalance are used exchangeably. In this book chapter, however, ASE indicates the expression measurement from a particular allele. ASE is available for a gene if it has exonic SNPs/indels with heterozygous genotypes, and thus having ASE does not imply allelic balance. A number of pioneering studies have shown that allelic imbalance in gene expression exists and may be associated with disease susceptibility [17, 27, 35, 40, 60, 73]. For example, the reduction in the expression of one allele at the TGFBR1 gene in blood cells (germline) leads to an elevated risk of colorectal cancer [60]. In addition, effective treatments can be developed by silencing the disease allele while sparing the expression of the wild-type allele [41]. Here, we focus on mapping the DNA polymorphism that leads to allelic imbalance of gene expression, which is called a cis-eQTL and is a main mechanism of allelic imbalance.

To better understand cis-eQTLs, it is helpful to introduce the concept of trans-eQTL and clarify their differences. Cis-eQTL and trans-eQTL have been widely used to refer to eQTLs that are close to the associated genes and eQTLs that are distant, respectively. An arbitrary distance, such as 200 kb or 1 Mb, is often used to distinguish local and distant eQTLs. It has been pointed out before [51] and is worthwhile to be emphasized again: it is misleading to refer to a local or distant eQTL as a cis- or trans-eQTL as the latter have their own biological meanings.

The Latin words cis and trans mean “on the same side” and “across”, respectively. A cis-eQTL is located on the same chromosome as its target gene and influences the gene expression in an allele-specific manner. Specifically, a mutation in the maternal allele only changes the gene expression from the maternal allele but does not affect the expression from the paternal allele (Fig. 8.4a). A plausible scenario is that a cis-eQTL is located at the transcriptional factor binding site of a gene and thus interferes with the transcriptional factor binding in the allele-specific manner. A cis-eQTL is likely to be a local eQTL, though this is not always true. By contrast, a trans-eQTL of a gene can be located anywhere in the genome and it influences the gene expression of both alleles to the same extent. One possible mechanism is that a trans-eQTL modifies the activity or abundance of a protein that regulates the gene and such regulation does not distinguish the two alleles of the gene [67] (Fig. 8.4b). Therefore, cis- and trans-eQTLs should be distinguished by ASE (Fig. 8.4a, b) [14, 52] rather than their physical distance to the target gene. Note that cis- and trans-eQTLs cannot be distinguished by the total expression of the gene, which shows the same pattern at the population level (Fig. 8.4c, d).

Fig. 8.4
figure 4

(a) An example of a cis-eQTL in two samples. In sample 2 where the candidate eQTL (the SNP for which we test association) has a heterozygous genotype CG, the expression of the two alleles are different. (b) An example of a trans-eQTL in two samples. In sample 2 where the candidate eQTL has a heterozygous genotype TA, the expression of the two alleles are the same. (c) A simulated data for a cis-eQTL across 60 samples with 20 samples within each genotype class. (d) A simulated data for a trans-eQTL across 60 samples with 20 samples within each genotype class. This figure is adapted from Fig. 1 in our earlier paper Sun and Hu (2013) [56]

From the above discussions, it is clear that ASE is informative for cis-eQTL mapping. Figure 8.5a–d shows a hypothetical example of cis-eQTL mapping using ASE. Assume that the gene of interest has two exons with one SNP for each. We wish to test whether a candidate eQTL, displayed on the left of the gene in Fig. 8.5a, cis-regulates the gene expression. First, we count the number of allele-specific reads. As mentioned in Sect. 8.2.1, an RNA-seq read is allele-specific if it can be assigned to one of the two alleles of the gene without ambiguity. As illustrated in Fig. 8.5a, individuals (i) and (ii) have heterozygous genotypes for at least one exonic SNP, and thus their ASE can be measured by the number of reads that overlap with the heterozygous SNPs. Haplotype information is required to combine ASE measured at individual exonic SNPs into the gene-level ASE. For example, for individual (i), we count the number of allele-specific reads mapped to the haplotypes A-A and T-G. Next, we associate ASE with the candidate eQTL. For individual (i) in Fig. 8.5a, given the longer haplotypes C-A-A and T-T-G that span over the gene as well as the candidate eQTL, we can link ASE of the A-A and T-G haplotypes of the gene to the C and T alleles of the candidate eQTL, respectively (Fig. 8.5c). The association testing seeks to answer the question whether one allele of the candidate eQTL is associated with a higher or lower ASE of the gene. If the answer is yes (and assuming there is no other factor inducing the allelic imbalance), then we expect allelic imbalanced expression when the genotype of the candidate eQTL is heterozygous and allelic balanced expression when the genotype is homozygous; in other words, the candidate eQTL is a cis-eQTL. For example, individual (i) has a heterozygous genotype C/T at the candidate eQTL and has a higher ASE corresponding to the C allele than the T allele (Fig. 8.5c). Individual (ii) has a homozygous genotype C/C at the candidate eQTL, each C allele corresponding to the same ASE (Fig. 8.5d). A real data example of 65 HapMap samples is shown in Fig. 8.5f.

The total read count (TReC) is also informative for cis-eQTL mapping, which is similar to the traditional eQTL mapping using gene expression measured by microarrays. While ASE provides information at the allele level, TReC contributes at the individual level and in a way that is consistent with the allele level. In Fig. 8.5a–d, the C allele of the candidate eQTL is associated with a higher ASE, which is manifested at the allele level (Fig. 8.5c, d) and at the individual level (Fig. 8.5b). In general, the TReC of a gene is much greater than the sum of the two ASReCs in that TReC includes many reads that do not overlap with any heterozygous SNPs/indels.

Fig. 8.5
figure 5

(a)–(d) A hypothetical example of cis-eQTL mapping. (a) RNA-seq measurements of a gene with two exons in three individuals. (b) TReC (total read count) for the three individuals. (c–d) ASE for individual (i) and (ii). (e)–(f) A real data example of cis-eQTL mapping between gene KLK1 and SNP rs1054713. (e) Association between the genotypes and TReC. The y-axis is the total number of reads mapped to the gene KLK1 and each point corresponds to one of the 65 samples. (f) Association between the genotypes and ASE. When the genotype of rs1054713 is heterozygous, the ASE of the two alleles of this gene can be associated with the two alleles of rs1054713. ASET and ASEC denote the ASReC corresponding to the T and C allele of rs1054713, respectively. When the genotype of rs1054713 is homozygous, we denote the ASReC of the two alleles of this gene by ASE1 and ASE2, respectively. This figure is a modified version of Figs. 2 and 4 of the earlier paper by Sun and Hu (2013) [56]

8.2.3 eQTL Mapping Using ASE with Known Haplotypes

While the haplotypes across the exonic regions of a gene can be accurately phased, those extending from the gene to a candidate eQTL may not be reliably phased because the candidate eQTL may be far away from the gene. In this section, we assume that the extended haplotypes are known and defer the scenario with unknown haplotypes to the next section.

Our statistical model is for a particular gene of interest. To simplify the notation, we skip the index for gene. The model was originally proposed by Sun (2012) [55] and reviewed by Sun and Hu (2013) [56]. We use the following notation.

  • Let \(H = (h_{1},h_{2})\) denote the haplotype pair consisting of haplotypes h 1 and h 2 across the exonic SNPs. Let \(\tilde{H} = (\tilde{h}_{1},\tilde{h}_{2})\) denote the extended haplotype pair consisting of both the exonic SNPs and the candidate eQTL. Here the order of the two haplotypes is arbitrary and thus \((h_{1},h_{2})\) is the same as (h 2, h 1) and \((\tilde{h}_{1},\tilde{h}_{2})\) is the same as \((\tilde{h}_{2},\tilde{h}_{1})\). We assume that both H and \(\tilde{H}\) are known here.

  • Let T be the total read count (TReC). Note that a paired-end sequence read is counted as one read.

  • Let N 1, N 2 and N denote the allele-specific read count (ASReC) from haplotypes h 1 and h 2 and the total ASReC, respectively. Naturally, \(N = N_{1} + N_{2}\).

  • Let G be the genotype of the candidate eQTL, which has two alleles A and B. Under the additive genetic effect, G = 0, 1, and 2 for genotypes AA, AB and BB, respectively. Dominant, recessive, and co-dominant effects can also be modeled using appropriate coding for genotypes.

  • Let \(\mathbf{X}\) be the relevant covariates including an intercept. Typically, \(\mathbf{X}\) include the log form of the total read count per sample reflecting the read depth.

We model the probability of T given G and \(\mathbf{X}\) by a negative binomial distribution indexed by parameters \((\boldsymbol{\gamma },\beta _{T},\phi )\), which is denoted by \(P_{\mathtt{TReC}}(T\vert G,\mathbf{X};\boldsymbol{\gamma },\beta _{T},\phi )\). A negative binomial distribution can be considered as an infinite gamma mixture of Poisson distributions. It allows over-dispersion in the read counts, a phenomenon that is often observed in sequencing data across biological replicates. Thus the negative binomial distribution has been commonly used for RNA-seq data analysis [5]. In particular, we assume that T follows the negative binomial distribution with mean μ and a dispersion parameter ϕ:

$$\displaystyle\begin{array}{rcl} P_{\mathtt{TReC}}(T\vert G,\mathbf{X};\boldsymbol{\gamma },\beta _{T},\phi ) = \frac{\varGamma (T + 1/\phi )} {T!\varGamma (1/\phi )} \left ( \frac{1} {1+\phi \mu }\right )^{1/\phi }\left ( \frac{\phi \mu } {1+\phi \mu }\right )^{T},& & {}\\ \end{array}$$

where

$$\displaystyle\begin{array}{rcl} \log (\mu ) =\boldsymbol{\gamma } ^{\mathrm{T}}\mathbf{X} + w(G,\beta _{ T}),& & {}\\ \end{array}$$

and

$$\displaystyle\begin{array}{rcl} w(G,\beta _{T}) = \left \{\begin{array}{l l} 0 &\quad \mbox{ if $G = 0$ } \\ \log \left [1 +\exp (\beta _{T})\right ] -\log 2&\quad \mbox{ if $G = 1$} \\ \beta _{T} &\quad \mbox{ if $G = 2$}.\\ \end{array} \right.& & {}\\ \end{array}$$

The functional form of w(G, β T ) reflects the additive genetic effect. To see this, we write the means of T given \(\mathbf{X}\) and G = 0, 1, 2 by \(\mu _{AA,\mathbf{X}}\), \(\mu _{AB,\mathbf{X}}\) and \(\mu _{BB,\mathbf{X}}\), respectively, where

$$\displaystyle\begin{array}{rcl} \mu _{AA,\mathbf{X}}& =& \exp (\boldsymbol{\gamma }^{\mathrm{T}}\mathbf{X}), {}\\ \mu _{AB,\mathbf{X}}& =& \exp (\boldsymbol{\gamma }^{\mathrm{T}}\mathbf{X} +\log \left [1 +\exp (\beta _{ T})\right ] -\log 2) {}\\ \mu _{BB,\mathbf{X}}& =& \exp (\boldsymbol{\gamma }^{\mathrm{T}}\mathbf{X} +\beta _{ T}). {}\\ \end{array}$$

We can see that β T characterizes the difference between \(\log (\mu _{AA,\mathbf{X}})\) and \(\log (\mu _{BB,\mathbf{X}})\) and \(\mu _{AB,\mathbf{X}}\) is at the mid point between \(\mu _{AA\mathbf{X}}\) and \(\mu _{BB,\mathbf{X}}\), i.e., \(\mu _{AB,\mathbf{X}} = (\mu _{AA,\mathbf{X}} +\mu _{BB,\mathbf{X}})/2\).

We model the probability of N 1 given N, \(\tilde{H}\) and \(\mathbf{X}\) assuming that N 1 follows a beta-binomial distribution indexed by parameters (β A , ψ) and denote the model by \(P_{\mathtt{ASReC}}(N_{1}\vert N,\tilde{H},\mathbf{X};\beta _{A},\psi )\). A beta-binomial distribution extends a binomial distribution to allow over-dispersion. In particular, we assume that N 1 follows a beta-binomial distribution with mean p and a dispersion parameter ψ:

$$\displaystyle\begin{array}{rcl} P_{\mathtt{ASReC}}(N_{1}\vert N,\tilde{H},\mathbf{X};\beta _{A},\psi ) ={ N\choose N_{1}}\frac{\prod _{k=0}^{N_{1}-1}(p + k\psi )\prod _{k=0}^{N-N_{1}-1}(1 - p + k\psi )} {\prod _{k=1}^{N-1}(1 + k\psi )},& & {}\\ \end{array}$$

where

$$\displaystyle\begin{array}{rcl} p = \left \{\begin{array}{l l} 0.5 &\quad \mbox{ if the candidate eQTL has a homozygous genotype AA or BB}, \\ q &\quad \mbox{ if}\tilde{H}\mbox{ indicates haplotype configuration B-$h_{1}$ and A-$h_{2}$, respectively}, \\ 1 - q&\quad \mbox{ if}\tilde{H}\mbox{ indicates haplotype configuration A-$h_{1}$ and B-$h_{2}$, respectively}.\\ \end{array} \right.& & {}\\ \end{array}$$

Thus q characterizes the proportion of ASReC corresponding to the B allele among the total ASReC corresponding to the heterozygous genotype AB. We further express q as \(e^{\beta _{A}}/(1 + e^{\beta _{A}})\). Note that the covariate effects are ignored here because they are expected to be the same on the two alleles of a gene within an individual. When the candidate eQTL cis-regulates the expression of the gene, we have β A  = β T . To see this, we first define μ A and μ B as the mean ASReC corresponding to the A and B alleles, respectively, at the baseline of \(\mathbf{X}\). Then, \(\beta _{A} =\log [q/(1 - q)] =\log (\mu _{B}/\mu _{A})\). On the other hand, \(\beta _{T} =\log (\mu _{BB,\mathbf{X}}/\mu _{AA,\mathbf{X}}) =\log \{ (2\mu _{B})/(2\mu _{A})\}\), where the second equation follows from the additive genetic effect and from canceling out the individual-specific covariate effects. By contrast, when the candidate eQTL trans-regulates the gene expression, we have \(\beta _{T}\neq 0\) but β A  = 0.

The likelihood based on the TReC and ASReC data of n unrelated individuals takes the form

$$\displaystyle{ L(\boldsymbol{\varTheta }) =\prod _{ i=1}^{n}P_{\mathtt{ TReC}}(T_{i}\vert G_{i},\mathbf{X}_{i};\boldsymbol{\gamma },\beta _{T},\phi )P_{\mathtt{ASReC}}(N_{i1}\vert N_{i},\tilde{H}_{i},\mathbf{X}_{i};\beta _{A},\psi ), }$$
(8.1)

where \(\boldsymbol{\varTheta }= (\boldsymbol{\gamma },\beta _{T},\phi,\beta _{A},\psi )\). We refer to (8.1) as the TReCASE model, which is the novel model for cis-eQTL mapping using RNA-seq data. For trans-eQTL mapping, since ASE data are uninformative, the likelihood is only based on the TReC data: \(L(\boldsymbol{\gamma },\beta _{T},\phi ) =\prod _{ i=1}^{n}P_{\mathtt{TReC}}(T_{i}\vert G_{i},\mathbf{X}_{i};\boldsymbol{\gamma },\beta _{T},\phi )\). A hypothesis testing method has been developed to distinguish whether an eQTL is cis- or trans- by testing H 0: β T  = β A [55].

8.2.4 eQTL Mapping Using ASE with Unknown Haplotypes

When the haplotypes connecting the candidate eQTL and the gene of interest are unknown, we consider all possible haplotype pairs \((\tilde{h}_{k},\tilde{h}_{l})\) that are compatible with the known haplotypes in the gene body (H) and the genotype at the candidate eQTL (G). We denote these haplotype pairs as \((\tilde{h}_{k},\tilde{h}_{l}) \sim (G,H)\). Then the likelihood function is a weighted summation of the probabilities, each corresponding to a possible haplotype pair and given by (8.1), i.e.,

$$\displaystyle\begin{array}{rcl} L(\boldsymbol{\varTheta })& =& \prod _{i=1}^{n}P_{\mathtt{ TReC}}(T_{i}\vert G_{i},\mathbf{X}_{i};\boldsymbol{\gamma },\beta _{T},\phi ) \\ & & \times \sum _{(\tilde{h}_{k},\tilde{h}_{l})\sim (G_{i},H_{i})}P_{\mathtt{ASReC}}(N_{i1}\vert N_{i},\tilde{h}_{k},\tilde{h}_{l},\mathbf{X}_{i};\beta _{A},\psi )P(\tilde{h}_{k},\tilde{h}_{l};\boldsymbol{\pi })f_{kl}(\mathbf{X}_{i}),{}\end{array}$$
(8.2)

where \(\boldsymbol{\varTheta }= (\boldsymbol{\gamma },\beta _{T},\phi,\beta _{A},\psi,\boldsymbol{\pi },\{f_{kl}(.)\}_{k,l})\). We explain the terms that are not in (8.1) as follows.

Suppose there are K possible haplotypes across the exonic SNPs and the candidate eQTL. Write the frequency of the kth haplotype by \(\pi _{k} = \mathrm{Pr}(\tilde{h} =\tilde{ h}_{k})\) and \(\boldsymbol{\pi }= (\pi _{1},\ldots,\pi _{K})\). We denote the model for the probability of \(\tilde{H} = (\tilde{h}_{k},\tilde{h}_{l})\) indexed by \(\boldsymbol{\pi }\) by \(P(\tilde{h}_{k},\tilde{h}_{l};\boldsymbol{\pi })\). Under the assumption of Hardy-Weinberg equilibrium, \(P(\tilde{h}_{k},\tilde{h}_{l};\boldsymbol{\pi }) =\pi _{k}\pi _{l}\).

The density function of \(\mathbf{X}\) given \(\tilde{H} = (\tilde{h}_{k},\tilde{h}_{l})\) is denoted by \(f_{kl}(\mathbf{X})\). Under the assumption of gene-environment independence, \(f_{kl}(\mathbf{X})\) reduces to the marginal density function of \(\mathbf{X}\) and will drop out from (8.2). In some applications, \(\tilde{H}\) and \(\mathbf{X}\) are correlated. One important example is when \(\mathbf{X}\) represent the principal components for ancestry. Another example is when the gene influences both the environmental exposure (e.g., cigarette smoking) and the disease occurrence (e.g., lung cancer) [3]. In such cases, \(f_{kl}(\mathbf{X})\) can be specified using a generalized odds-ratio function [28].

8.3 Isoform-Specific eQTL Mapping

More than 90 % of human multi-exon genes can be alternatively spliced, resulting in RNA isoforms [44, 64]. Alternative splicing may directly cause a disease or modify certain disease susceptibility [19, 61, 63]. Although several methods have been proposed for detecting the event of alternative splicing and estimating the RNA-isoform abundance [2, 4, 21, 23, 31, 34, 38, 39, 50, 59, 65], only a few have been developed for testing the differential RNA-isoform usage between two groups of samples (e.g., cases vs. controls) [22, 54, 59]. Differential isoform usage refers to the changes of RNA-isoform expression relative to the total expression of the corresponding gene. The purpose of isoform-specific eQTL mapping is to dissect the genetic basis of the differential isoform usage. There are a few points worth mentioning from the statistical perspective on isoform-specific eQTL mapping.

  • Because the isoform structure or abundance cannot be directly measured, transcriptome reconstruction and abundance estimation are necessary steps of isoform-specific eQTL mapping. The uncertainty of the transcriptome reconstruction and the abundance estimation should be incorporated into isoform-specific eQTL mapping.

  • In most eQTL studies or genome-wide association studies, SNP genotype effects are assumed to be additive. Thus the SNP genotype is essentially a quantitative covariate. However, most existing methods assess the differential isoform usage between two groups of samples (e.g., cases vs. controls) and few methods can test the association between the isoform usage and a quantitative covariate.

  • One gene may be differentially expressed with respect to a covariate, both in terms of the total expression and the isoform usage. It will be useful to jointly test for differential expression and differential isoform usage.

8.3.1 Transcriptome Reconstruction and Isoform Abundance Estimation

A gene usually occupies a consecutive segment of the DNA sequence and it is often composed of several exons that are separated by introns. A subset of the exons may be employed by the cell to construct alternatively spliced messenger RNAs (mRNAs). These mRNAs may be translated to different proteins. Each RNA isoform is often referred to as a transcript and thus each gene can be considered as a transcript cluster. In some organism such as a human or a mouse, there are existing annotations on the kinds of transcripts a gene may encode. Such annotations are often incomplete or inaccurate, for example, some transcripts may be express in a particular tissue and/or developmental stage. In some other organisms, such as those without complete reference genomes, such transcriptome annotations are not available at all. Therefore, one may need to reconstruct the transcriptome from the observed RNA-seq data. This task can be achieved with or without a reference genome [18]. The reference genome-guided reconstruction is often more accurate and computationally more efficient than the de novo transcriptome construction without a reference genome. Thus the former approach is more popular for organisms that have reference genomes. Given the transcriptome annotation, the abundance of each transcript can be estimated by the number of RNA-seq reads aligned to that transcript. However, most RNA-seq fragments cannot be uniquely assigned to a specific transcript. To estimate transcript abundance in the presence of such alignment ambiguity is the focus of many existing works [31, 32, 37, 43, 48, 49, 53, 59, 72]. Penalized regression methods have been developed to simultaneously reconstruct transcriptome and estimate transcript/isoform abundance [6, 38, 39, 71]. The method we will describe next is an example of such penalized regression methods.

Fig. 8.6
figure 6

All possible isoforms of a gene with three exons and the corresponding design matrix \(\mathbf{X}^{\mathsf{T}}\)

8.3.2 Isoform-Specific eQTL Mapping

The method presented here is based on Sun et al. (2013) [58]. We first illustrate the statistical model by a hypothetical gene with three exons (Fig. 8.6). An RNA-seq read may overlap with one or more exons. Thus we count the number of RNA-seq reads per exon set. For this simple gene, there are seven possible exon sets, denoted by {1}, {2}, {3}, {1, 2}, {2, 3}, {1, 3}, and {1, 2, 3}. Note that each RNA-seq read is only counted once. For example, if an RNA-seq read overlaps with both exon 1 and 2, it will be counted for exon set {1, 2} instead of exon set {1} or {2}. There are seven possible isoforms (right panel of Fig. 8.6). We code each isoform as a covariate, which corresponds to one row of the design matrix \(\mathbf{X}^{\mathsf{T}}\) (left panel), where \(^{\mathsf{T}}\) denotes matrix transpose. The seven columns of matrix \(\mathbf{X}^{\mathsf{T}}\) correspond to exon sets {1}, {2}, {3}, {1, 2}, {2, 3}, {1, 3}, and {1, 2, 3}. Each element in \(\mathbf{X}^{\mathsf{T}}\) is the effective length of the column-specific exon set within the row-specific isoform. Intuitively, the effective length of an exon set A, denoted by η A , is the number of unique locations within A, where a randomly selected sequence fragment can be sampled. We defer the details of effective length calculation to the next section, but would like to point out that there are special exon sets that consist of non-contiguous exons in the specific isoform. For example, the exons in set {1, 3} is non-contiguous with respective to isoform 1-2-3 and the effective length of {1, 3} is denoted by η {1, (2), 3}. Our effective length calculation accurately reflects the fact that sequence reads of exon set {1, 3} are more likely from isoform 1-3 rather than isoform 1-2-3.

In this example, the gene expression in the ith sample is denoted by a vector: \(\mathbf{y}_{i} = (y_{i\{1\}},y_{i\{2\}},y_{i\{3\}},y_{i\{1,2\}},y_{i\{2,3\}},y_{i\{1,3\}},y_{i\{1,2,3\}})^{\mathsf{T}}\), where y iA indicates the TReC at the exon set A. As in Sect. 8.2.3, we model the probability of a TReC via a negative binomial distribution. Let f NB (μ, ϕ) be a negative binomial distribution with mean μ and a dispersion parameter ϕ. We assume that y iA  ∼ f NB (μ iA , ϕ). Assuming independence of y iA ’s given the underlying RNA isoforms, then \(\mathbf{y}_{i} \sim f_{NB}(\boldsymbol{\mu }_{i},\phi ) \equiv \prod _{A}f_{NB}(\mu _{iA},\phi )\) where \(\boldsymbol{\mu }_{i} = (\mu _{i\{1\}},\mu _{i\{2\}},\ldots,\mu _{i\{1,2,3\}})^{\mathsf{T}}\). By the definition of the design matrix X, we transform the problem of isoform deconvolution to a regression problem: \(\mathbf{y}_{i} \sim f_{NB}(\boldsymbol{\mu }_{i},\phi )\), \(\boldsymbol{\mu }_{i} = T_{i}\mathbf{X}\boldsymbol{\gamma } = T_{i}\sum _{u=1}^{7}\mathbf{x}_{u}b_{u}\), where T i is TReC of this gene in sample i, \(\mathbf{X} = (\mathbf{x}_{1},\ldots,\mathbf{x}_{7})\), \(\boldsymbol{\gamma }= (b_{1},\ldots,b_{7})^{\mathsf{T}}\), and \(b_{u} \geq 0\) is the expression rate of the uth isoform. Note that b u quantifies the relative expression abundance with respect to the total expression T i .

Next, we present the general method. Suppose that we study the isoform-specific expression of a gene with m exon sets and p possible isoforms across n individuals, and we are particularly interested in whether a covariate G has an influence on the isoform-specific expression of this gene. We assess this hypothesis by a likelihood ratio test. Under the null hypothesis, we solve the problems of isoform selection and abundance estimation by assuming that the isoform usage is the same for all samples. Thus we use a negative binomial regression with the link function \(\boldsymbol{\mu }_{i} = T_{i}\mathbf{X}\boldsymbol{\gamma }\). Note that a linear link function instead of commonly used log link function is used to reflect the fact that the total number of reads is the summation of the number of reads from all the isoforms. Under the alternative, we model the effect of G as follows. Let g i be the value of G in the ith sample. Without loss of generality, we restrict the range of g i to be [0,1]. For example, if G is genotype of a SNP, we set g i = 0, 1/2, and 1 for genotypes AA, AB, and BB, respectively. Provided \(\boldsymbol{\mu }_{i} = T_{i}\mathbf{X}\boldsymbol{\gamma }\), we model the influence of G on b u (1 ≤ u ≤ p) by a linear model: \(b_{u} =\gamma _{u}(1 - g_{i}) +\gamma _{u+p}g_{i}\), where γ j  ≥ 0 for 1 ≤ j ≤ 2p. Therefore, we have two negative binomial problems, with p and 2p covariates, under null and alternative, respectively.

The major difficulty of this problem comes from the high dimensionality of the possible isoforms [25]. We address this difficulty by two sequential steps. First we identify the candidate isoforms for a gene using a modified connectivity graph approach [23, 38]. Next we select among the candidate isoforms using a penalized negative binomial regression problem. For example, under the alternative, the objective function becomes \(f(\boldsymbol{\gamma },\phi ) =\sum _{ i=1}^{n}\log \left [f_{NB}(\boldsymbol{\mu }_{i},\phi )\right ] -\sum _{j=1}^{2p}\lambda \log (\gamma _{j}+\tau )\), where λ and τ are two tuning parameters that can be selected by BIC or extended BIC [57]. We use the log penalty λlog(γ j +τ) because of its superior theoretical and empirical advantages over other penalties [9, 15, 57]. Given λ and τ, the parameters \(\boldsymbol{\gamma }\) and ϕ can be estimated by a coordinate descent algorithm [57]. The above model is formulated when the isoform usage is associated with one quantitative covariate; it is straightforward to extend it to include multiple quantitative covariates. For a categorical covariate (e.g., under the dominant or recessive effect of a SNP), we can simply code it as a number of dummy variables, which can be treated as multiple quantitative covariates.

Due to the variable selection (i.e., selecting expressed RNA isoforms) under both the null and the alternative hypotheses, the asymptotic distribution of the likelihood ratio statistic is unknown. Thus we estimate the null distribution of the statistic by parametric bootstrap. Specifically, we generate the vth bootstrap sample, denoted by \(\tilde{\mathbf{y}}^{(v)}\) (a vector of length nm), by sampling from a negative binomial distribution with mean \(\hat{\boldsymbol{\mu }}_{0}\) and a dispersion parameter \(\hat{\phi }_{0}\), where \(\hat{\boldsymbol{\mu }}_{0}\) (a vector of length nm) and \(\hat{\phi }_{0}\) are estimated under the null. Then using this bootstrap sample, we apply the penalized regression approach under the null and the alternative to obtain a likelihood ratio statistic LR v . Repeat the parametric bootstrap for a large number of times (e.g. 10,000 times) and pool the LR v ’s, we obtain the null distribution for the observed statistic LR. The final p-value is the proportion of LR v ’s that are equal to or larger than the likelihood ratio statistic from original data.

The above solution only tests differential isoform usage, which is the difference of relative abundance of an isoform with respect to the total expression of the gene for different values of G. If we are interested in testing both the differential expression and the differential isoform usage of a gene, the original link function \(\boldsymbol{\mu }_{i} = T_{i}\mathbf{X}\boldsymbol{\gamma }\) can be changed to be \(\boldsymbol{\mu }_{i} = R_{i}\mathbf{X}\boldsymbol{\gamma }\), where R i is the total number of RNA-seq reads of the ith sample across all genes. The reason is as follows. The original link function can be written as \(\boldsymbol{\mu }_{i} = T_{i}\mathbf{X}\boldsymbol{\gamma } = R_{i}(T_{i}/R_{i})\mathbf{X}\boldsymbol{\gamma }\), where (T i R i ) measures the total expression of the gene in the ith sample. Then skipping the ratio (T i R i ) in the original link function leads to the new link function, which is equivalent to assuming this gene has a constant expression rate across samples.

8.3.3 Calculation of Effective Length

An RNA-seq fragment is a segment of RNA to be sequenced. Usually only part of an RNA-seq fragment is sequenced: one end or both ends, hence single-end sequencing or paired-end sequencing. All the discussions in this section are for paired-end reads, though the extension to single-end reads is straightforward. The minimum fragment size is the read length, denoted by d. This happens when the two reads of a fragment completely overlap. We impose an upper bound for the fragment length based on prior knowledge of the experimental procedure and denote the upper bound by l M . Then the fragment length l satisfies d ≤ l ≤ l M . We denote the distribution of the fragment length for sample i by \(\varphi _{i}(l)\), which can be calculated using observed read alignment information. The fragment length distribution is incorporated in our model to allow across-sample variations due to the differences in fragment length distribution.

For the ith sample, the effective length of exon j of r j base pairs (bps) is

$$\displaystyle\begin{array}{rcl} \eta _{i,\{j\}} = f(r_{j},d,l_{M},\varphi _{i}) = \left \{\begin{array}{@{}l@{\quad }l@{}} 0 \quad &\quad \text{if }r_{j} < d \\ \sum _{l=d}^{\min (r_{j},l_{M})}\varphi _{ i}(l)(r_{j} + 1 - l)\quad &\quad \mbox{ if $r_{j} \geq d$} \end{array} \right..& & {}\\ \end{array}$$

If r j  < d, the exon is shorter than the shortest fragment length, and thus the effective length of this exon is 0. In other words, no RNA-seq fragment is expected to overlap and only overlap with this exon. If r j  ≥ d, the effective length is \(r_{j} + 1 - l\), i.e., there are \(r_{j} + 1 - l\) distinct RNA-seq fragments that can be sequenced from this exon (Fig. 8.7). Then \(\sum _{l=d}^{\min (r_{j},l_{M})}\varphi _{ i}(l)(r_{j} + 1 - l)\) is summation across all likely fragment lengths, weighted by the probability of having fragment length l.

Fig. 8.7
figure 7

An illustration of effective length calculation for an exon of r j bps and RNA-seq fragment of l bps. The orange box indicates the exon, and the black lines above the orange box indicate two RNA-seq fragments, while each RNA-seq fragment is sequenced by a paired-end read. There are \(r_{j} + 1 - l\) distinct choices to select an RNA-seq fragment of l bps from this exon, and thus the effective length is \(r_{j} + 1 - l\)

In the following discussions, to simplify the notation, we skip the subscript of i. For two exons j and k (j < k) of lengths r j and r k , which are adjacent in the transcript, the effective length for the fragments that cover both exons is

$$\displaystyle\begin{array}{rcl} \eta _{\{j,k\}} = f(r_{j} + r_{k},d,l_{M},\varphi ) -\eta _{\{j\}} -\eta _{\{k\}}.& &{}\end{array}$$
(8.3)

For three exons j, h, and k (j < h < k) of lengths r j , r h and r k , which are adjacent in the transcript, the effective length for the fragments that cover all three exons is

$$\displaystyle\begin{array}{rcl} \eta _{\{j,h,k\}}& =& f(r_{j} + r_{h} + r_{k},d,l_{M}) -\eta _{\{j,h\}} -\eta _{\{h,k\}} -\eta _{\{j,(h),k\}} -\eta _{\{j\}} -\eta _{\{h\}} -\eta _{\{k\}}, {}\\ \end{array}$$

where η {j, (h), k} is the effective length in the scenario that the transcript covers consecutive exons j, h, and k, whereas the observed paired-end read only covers exons j and k.

$$\displaystyle\begin{array}{rcl} \eta _{\{j,(h),k\}} = \left \{\begin{array}{@{}l@{\quad }l@{}} 0 \quad &\quad \text{if }(r_{j},r_{h},r_{k}) \in R_{1} \\ \sum _{l=2d+r_{h}}^{\min (r_{j}+r_{h}+r_{k},l_{M})}\varphi (l)\delta _{ l}\quad &\quad \text{otherwise} \end{array} \right.& & {}\\ \end{array}$$

where \(R_{1} =\{ (r_{j},r_{h},r_{k}):\ r_{j} < d\text{ or }r_{k} < d\text{ or }r_{h} + 2d > l_{M}\}\), and \(\delta _{l} =\min (r_{j},l - r_{h} - d) -\max (d,l - r_{h} - r_{k}) + 1\). The above formula is derived by the following arguments. Let l j and l k be the lengths of the parts of the fragment that overlaps with exon j and k, respectively. Given l, the restriction of l j and l k are \(l = l_{j} + l_{k} + r_{h}\), d ≤ l j  ≤ r j , and d ≤ l k  ≤ r k , and thus the range of l j is \(\max (d,l - r_{h} - r_{k}) \leq l_{j} \leq \min (r_{j},l - r_{h} - d)\). For more than three consecutive exons, the effective lengths can be calculated using recursive calls to the above equations.

In practice, a few sequence fragments may be observed even when the effective length is zero, which may be due to sequencing errors. To improve the robustness of our method, we modify the design matrix \(\mathbf{X}\) by adding a pre-determined constant eLenMin to each element of \(\mathbf{X}\).

8.4 Discussion

We conclude this chapter by a few discussion points.

8.4.1 eQTL Mapping Using Both ASE and ISE

We have introduced statistical methods of using ASE or ISE for eQTL mapping. A natural extension is to use both ASE and ISE for eQTL mapping. The likelihood can be similar to the one for eQTL mapping using ASE, but using count data from exon sets intend of genes. Such a model can explain more subtle changes in the gene expression data. For example, one isoform is used in one allele, but not in the other allele, i.e., allele-specific isoform usage. A major challenge would be computational feasibility. Thus a more computationally efficient implementation is needed for such an effort.

8.4.2 cis-eQTL and Imprinting

Allelic imbalance of gene expressions may be due to factors other than cis-eQTL. Arguably, the second most likely factor causing allelic imbalance, after cis-eQTL, is imprinting. Imprinted genes are differentially expressed on maternal and paternal alleles. Thus imprinting is also referred to as the parent-of-origin effect [47]. An important lesson we learned from our recent study of ASE in F1 mice [11] is that “imprinting is incomplete for most genes and cis-acting mutations can modify the strength of imprinting”. Usually imprinting effect is much more subtle than cis-eQTL effects. Therefore, to obtain more sensitive and more accurate estimates of imprinting effects, it is crucial to jointly study imprinting and cis-eQTL.

8.4.3 Quality Control and Possible Non-genetic Factors

Quality control (QC) is a necessary step for eQTL mapping using RNA-seq data. Low quality samples may be detected by checking the sequencing quality scores, mapping quality, percentage of uniquely mapped reads, percentage of reads mapped to exonic regions, percentage of rRNA reads, and the distribution of insert size for paired-end reads [1, 13, 66]. Sample identity check is a very important step in genome-wide genomic studies. Between sample contamination may be detected by the percentage of heterozygous SNPs, sex-mismatch (recorded sex from demographic information vs. sex inferred from genomic data), or the D-statistic that measures the median correlation of gene expression between one sample versus each of the other samples [1, 69]. Sample swap will seriously reduce the power of eQTL analysis. Fortunately, checking for sample swap is relatively easy using RNA-seq data than using microarray data [29]. A QC step that is crucial for ASE data is the mapping bias toward reference alleles, which has been discussed at Sect. 8.2. For ISE data, checking the coverage of the whole gene body is important because there may be a trend of increasing read depth towards the 3’ end of a gene. The method described in Sect. 8.3 assumes a uniform distribution of read depth, though the hypothesis testing method is not sensitive to this assumption due to the resampling nature of the test [58].

The effect of non-genetic factors can be accounted for by including them (or an appropriate transformation of them) as covariates in eQTL mapping. First, the overall read depth per sample is one factor that should always be included. In addition, GC content and dinucleotide frequencies may influence gene expression in a sample-specific manner. For example, gene expression and GC content may be positively correlated in some samples, but negatively correlated in other samples [74]. A conditional quantile normalization method has been proposed to model such sample-specific effects from sequence contents within the framework of generalized linear regression models [24]. This approach can be employed in the eQTL-mapping framework described in this book chapter.

8.4.4 The Genetic Architecture of Gene Expression

Figure 8.8 shows the results of two genome-wide eQTL studies: a yeast study of ∼ 6,000 genes and ∼ 1,000 SNPs in 112 yeast segregants (offspring) (Fig. 8.8a) and a human study of ∼ 18,000 genes and ∼ 1,000,000 SNPs (germline genotype) in 550 breast cancer patients. Gene expression abundance was measured by microarrays in the yeast study and by RNA-seq in the human study. The difference in the genetic architecture of gene expression between the two studies is remarkable. In both studies, the eQTL plots have a diagonal pattern, which corresponds to a large number of local eQTLs. In the yeast study, there are several vertical bands, each corresponding to an eQTL hotspot, i.e., a genetic locus that is eQTL of many genes. In contrast, there is no such eQTL hotspot in the human study. The two studies are representative for experimental cross and human studies. In experimental cross, usually two strains with very different genetic backgrounds are crossed and thus some loci may have large and broad effects on many genes. For example, in the yeast study, several eQTL hotspots arise because one strain has several genes deleted. In human studies, the genetic differences across humans are much smaller than in experimental crosses and generally no single locus can substantially alter the expression of many genes. We have reported similar findings in a recent human eQTL studies with 2,494 twins and a validation data set of 1,895 independent subjects [69]. The conclusion is that, for human studies, the vast majority of genetic effects on gene expression are through local eQTL and most of the local eQTL are likely to be cis-eQTL [55]. This implies that the identification of distant eQTLs may be as difficult as or even more difficult than genome-wide association studies for complex traits.

Fig. 8.8
figure 8

The results of eQTL studies in (a) 112 yeast sergeants of two yeast strains [7] and (b) 550 breast cancer patients of an on-going study. Each point represents a genome-wide significant association. The color indicates certain range of the p-value. More liberal p-values are used for the yeast study because there is a smaller number of genes and SNPs and hence less burden of multiple testing correction