1 Introduction

With the completion of the human reference genome [36] and the pilot study of the 1000 Genomes Project [17], an unprecedented wealth of knowledge has been accumulated for human DNA sequence variations. In contrast, much less of this DNA-level knowledge has been translated to the understanding of human diseases. Gene expression quantitative trait locus (eQTL) mapping, which aims to dissect the genetic basis of gene expression, is one of the most promising approaches to fill this gap [11]. Many early genome-wide eQTL studies were conducted on experimental populations [7, 9, 43, 57, 67, 89]. Recently, more eQTL studies have been reported on human populations [72, 74, 75] and some of them used both DNA and RNA information to study phenotypic outcomes, such as complex diseases [18, 31, 66, 103].

RNA-seq is replacing gene expression microarrays to be the major technique for genome-wide assessment of transcript abundance. Compared with microarrays, RNA-seq provides more accurate estimates of transcript abundance for either known or unknown transcripts in a larger dynamic range, while requiring less RNA materials [90]. The central computational problems in RNA-seq include read mapping, transcriptome reconstruction (or RNA-isoform selection given exon annotations), transcript abundance estimation, and differential expression analysis. Since a number of RNA-seq protocols were developed at 2008 [10, 47, 51, 84], numerous technical improvements or computational/statistical methods have been developed for RNA-seq. We refer interested readers to Ozsolak and Milos (2010) [52] and Garber et al. (2011) [21] for recent reviews of experimental and computational methods for RNA-seq, respectively. In this review paper, we focus on the statistical/computational methods of eQTL mapping using RNA-seq.

A few pioneer studies of eQTL mapping using RNA-seq have emerged [50, 58]. These pioneer studies employed existing eQTL mapping methods that were designed for microarray data, and thus cannot fully exploit the new features in RNA-seq data. For eQTL studies, RNA-seq provides allele-specific gene expression (ASE), which is not available in microarrays, and unprecedentedly rich information for RNA-isoform expression. To the best of our knowledge, no statistical/computational method has been specifically developed for eQTL mapping using RNA-Seq, except for our recent work [77]. In the following, we will discuss the issues and potentials of eQTL mapping using ASE and isoform-specific eQTL mapping.

2 eQTL Mapping Using ASE

2.1 Introduction

In a diploid individual, each gene has two alleles: the paternal and maternal allele. The allele-specific transcript abundance is referred to as the ASE of this gene. Cis-acting regulation is due to DNA variation that directly influences the transcription process in an allele-specific manner (Fig. 1(a)). Alternatively, trans-acting regulation affects the gene expression by modifying the activity (or abundance) of the factors that regulate the gene, which leads to the same amount of expression changes for both alleles [91] (Fig. 1(b)). In this paper, we refer to an eQTL of a gene as a cis-eQTL if it alters the expression of the two alleles of this gene differently, otherwise we refer to the eQTL as a trans-eQTL. Therefore, cis- and trans-eQTL can be distinguished by ASE (Fig. 1(a),(b)) [16, 64]. In contrast, total expression of a gene cannot separate cis-eQTL and trans-eQTLs because the two types of eQTL result in similar patterns across a group of individuals (Fig. 1(c),(d)). In previous eQTL studies using microarrays, cis-eQTLs were often not distinguished from local eQTLs due to the lack of ASE. Here, we use the precise definitions of cis- and trans-eQTLs based on the ASE patterns [63]. In what follows, we introduce more details of ASE and cis-/trans-eQTL mapping using RNA-seq data.

Fig. 1
figure 1

(a) An example of a cis-eQTL in two samples. In Sample 2 where the target SNP (the SNP for which we test association) has a heterozygous genotype CG, the expressions of the two alleles are different. (b) An example of a trans-eQTL in two samples. In Sample 2 where the target SNP has a heterozygous genotype TA, the expressions of the two alleles are the same. (c) A simulated data for a cis-eQTL across 60 samples with 20 samples within each genotype class. (d) A simulated data for a trans-eQTL across 60 samples with 20 samples within each genotype class

2.1.1 ASE

In earlier studies, ASE has been assessed by quantitative genotyping following RT-PCR [12, 16, 64], which is a relatively labor-intensive low-throughput approach. Genome-wide genotyping arrays have also been used to assess ASE at predetermined polymorphic sites [23, 24, 45]. Recently, RNA-seq has been used to study the allelic imbalance of gene expression by comparing the expression of the two alleles at a single heterozygous SNP [14, 25, 48, 94]. Among these existing approaches for ASE studies, RNA-seq is the only one that provides both allelic and total expression data [55]. Previous studies have shown that allelic imbalance of gene expression is relatively common. For example, Zhang et al. [100] showed that 20 % of target polymorphic sites exhibited 1.5-fold expression difference, and Ge et al. [23] showed that 30 % of measured transcripts exhibited 1.2-fold expression difference.

Currently, ASE is often assessed by mapping the RNA-seq reads to reference genome followed by counting the number of allele-specific reads that overlap with heterozygous SNPs. Two major technical difficulties hinder accurate measurement of ASE. One is that the mapped allelic reads may be biased to the allele represented by the reference genome. The other is relative low density of heterozygous SNPs (other types of polymorphic sites) where we can assess ASE. For the former problem, one effective treatment is to remove the SNPs that tend to cause mapping bias [58]. For the latter problem, one can impute the genotypes of untyped SNPs and aggregate the information of multiple SNPs given known haplotype. While haplotypes’ information is often not available, they can be imputed (together with genotypes of untyped SNPs) using available genotype data and reference haplotypes [8, 44]. Another strategy that addresses both technical difficulties of ASE assessment is to directly map RNA-seq reads to individual-specific haploid genomes. The haploid genomes may be available for the study of experimental cross, or they can be imputed [8, 44]. The success of this strategy relies on the accuracy the haploid genomes. We are not aware of any study that has carefully compared the two strategies or mapping to reference genome or imputed haploid genomes, and it is certainly an interesting research topic. If there is no genotype data available at all, it is also possible to align RNA-seq reads to the reference genome, call genotypes, and then impute haplotypes using the genotype calls [81].

A simple binomial test can be applied to test whether the expression of the two alleles are the same or not. However a binomial distribution cannot accommodate possible overdispersion in the data, and thus beta-binomial distribution may be preferred. Recently, Skelly et al. [71] have proposed a hierarchical Bayesian model that combines information across loci to test allelic imbalance of gene expression.

2.1.2 eQTL mapping using ASE

To the best of our knowledge, except for our recent work [77], no method has been proposed for eQTL mapping using ASE measured by multiple SNPs. In what follows, we briefly describe our eQTL mapping method using ASE by an example of a cis-eQTL for one gene in three individuals (Fig. 2). Assume that this gene has two exons and there are two exonic SNPs, one on each exon, with alleles A/T and A/G, respectively. We test the association of the gene expression with an upstream SNP (target SNP), which has two alleles, C and T. A straightforward approach is to test the association between Total Read Count (TReC) of this gene and the target SNP (Fig. 2(b)). In this example, TReC is negatively correlated with the number of T alleles of the target SNP.

Fig. 2
figure 2

(a) RNA-seq measurements of a gene with two exons in three individuals. (b) TReC for the three individuals. (c) ASE for individual (i). (d) ASE for individual (ii)

Testing the association between ASE and the target SNP is less straightforward. We can consider it as a two-step procedure: (1) count the number of allele-specific reads as ASE; (2) assess the association between ASE and the target SNP (Fig. 3).

Fig. 3
figure 3

A flow chart of the two-step procedure for eQTL mapping using ASE

We first use the example in Fig. 2 to describe the procedure of counting allele-specific reads. An RNA-seq read is allele-specific if it can be assigned to one of the two alleles of the gene without ambiguity. As illustrated in Fig. 2(a), individuals (i) and (ii) have heterozygous genotypes for at least one exonic SNP, and thus their ASE can be measured by the RNA-seq reads that overlap with the heterozygous SNPs. Specifically, all the RNA-seq reads in individual (i) are allele-specific (Fig. 2(c)). However, for individual (ii), only the reads of the first exon are allele-specific, while the reads of the second exon do not overlap with any heterozygous SNP and hence are not allele-specific (Fig. 2(d)). Haplotype information is needed to obtain gene-level ASE by combining ASE measured at different exonic SNPs. For example, for individual (i), we count the number of allele-specific reads on the haplotype A-A and the haplotype T-G.

Next, we discuss association testing using ASE. It is important to note that the target SNP can be anywhere in the genome, and we can study the ASE association as long as the target SNP is connected with the gene of interest by contiguous haplotypes. For example, for individual (i) in Fig. 2(a), given haplotypes C-A-A and T-T-G, we can assign ASE of the gene to the two alleles of the target SNP (Fig. 2(c)). The association testing seeks to answer this question: whether one allele of the target SNP is associated with higher or lower ASE of the gene of interest. If the answer is yes, we expect ASE of one allele to be higher than the other allele when the target SNP is heterozygous, and ASE of the two alleles to be comparable when the target SNP is homozygous. For example, individual (i) has a heterozygous genotype at the target SNP, and the C-A-A allele has higher expression than the T-T-G allele. In contrast, individual (ii) has a homozygous genotype at the target SNP, and the two alleles have the same number of allele-specific reads.

Finally, we conclude this section by a real data example consisting of 65 HapMap YRI samples [58]. Figure 4(a) shows the association between TReC of the gene KLK1 (ENSG00000167748) and SNP rs1054713. There is an apparent negative correlation between TReC of KLK1 and the number of T alleles of SNP rs1054713. Figure 4(b) illustrates the association between ASE of KLK1 and the two alleles of SNP rs1054713. Denote the number of allele-specific reads pertaining to the C allele and the T allele of SNP rs1054713 by ASEc and ASEt, respectively. We are interested in whether the proportion ASEt/(ASEc + ASEt) is deviated from 0.5. The results of TReC association show that the T allele is associated with lower expression (Fig. 4(a)). If the genetic effect is allele-specific, then within one individual, the T allele should also have lower expression than the C allele; thus the proportion ASEt/(ASEc+ASEt) should be lower than 0.5. This is consistent with the observation shown in Fig. 4(b).

Fig. 4
figure 4

(a) An example of TReC association between the gene KLK1 and SNP rs1054713. The y-axis is the total number of reads mapped to the gene KLK1 and each point corresponds to one of the 65 samples. (b) An example of ASE association. The y-axis is the proportion of ASEt over all the allele-specific reads. The allele of ASEt is defined as the allele corresponding to the T allele of SNP rs1054713 when the SNP is heterozygous, and it is defined arbitrarily when the SNP is homozygous. When SNP rs1054713 is homozygous, the proportion is around 0.5; when it is heterozygous, the proportion is below 0.5, indicating that the expression from the T allele is lower than that from the C allele

2.2 Methods

Let T i and N i be respectively TReC and ASE (i.e., allele-specific read count) in sample i (1≤in, where n is the number of study samples). Suppose that the target SNP has two alleles, A and B. Denote the two haplotypes of the gene of interest by H i =(H i1,H i2). Let N i1 be the number of allele-specific reads that are mapped to haplotype H i1, which implies N i1N i . Let G i be the genotype of the target SNP, which takes the value AA, AB or BB. Our model is based on the following factorization:

Each component is defined as follows.

  • P(T i |H i ,G i ). Given G i , the total read count T i is assumed to be independent of H i and follows a negative binomial distribution with mean μ AA , μ AB or μ BB corresponding to G i =AA, AB or BB, and a dispersion parameter ϕ. We define the association parameter β (T)≡log(μ AA /μ BB ), i.e., the log ratio of the gene expression between genotype classes AA and BB. The eQTL strength can be assessed by testing whether β (T)=0. We refer to the above model, denoted by \(P_{\beta^{(\mathrm{T})}, \phi}(T_{i} | G_{i})\), as the TReC model. The superscript (T) in β (T) indicates that the association parameter is defined in the TReC model.

  • P(N i |T i ,H i ,G i ). This part of information is irrelevant for assessing the eQTL strength, and thus can be factored out of the likelihood.

  • P(N i1|N i ,T i ,H i ,G i ). Given (N i ,H i ,G i ), the read count N i1 is assumed to be independent of T i and follows a beta-binomial distribution with a parameter π, which is the expected proportion of the allele-specific reads from haplotype H i1 over the N i allele-specific reads, and a dispersion parameter ψ. If the target SNP is homozygous in sample i, i.e., G i =AA or BB, π is fixed to be 0.5; thus the two haplotypes H i1 and H i2 can be defined arbitrarily because the likelihood remains the same if the definitions of H i1 and H i2 are flipped. The samples with homozygous genotypes at the target SNP only contribute to the estimation of the dispersion parameter ψ. If the target SNP is heterozygous, π is a free parameter, and without loss of generality we define H i1 and H i2 such that the haplotype configuration is A-H i1 and B-H i2. The eQTL strength can be assessed by testing whether π is deviated from 0.5. Following the above discussion, we have \(P(N_{i1} | N_{i},T_{i},H_{i},G_{i}) = \{P_{\pi =0.5, \psi}(N_{i1} | N_{i})\}^{I(G_{i}=AA~\text{or}~BB)} \{P_{\pi, \psi }(N_{i1} | N_{i})\}^{I(G_{i}=AB)}\), where I(.) is an indicator function. We refer to this model as the ASE model.

The TReC model can detect both cis- and trans-eQTLs (although it cannot distinguish cis- and trans-eQTLs), and it is more powerful than a computationally convenient approach: normal quantile transformation of the TReC data is followed by a linear regression [77]. The ASE model can only detect cis-eQTL. In the following derivation, we show that the TReC and ASE data provide consistent information for cis-eQTL mapping, and thus combining them increases the power of cis-eQTL mapping. Let

(1)

where the superscript of β (A) indicates that β (A) is the genetic effect defined in the ASE model, and μ A and μ B denote the expected number of allele-specific reads for haplotypes A-H i1 and B-H i2. Recall that β (T)≡log(μ AA /μ BB ), where μ AA and μ BB are the expected TReC when the target SNP has the genotype AA and BB, respectively. Since TReC of an individual equals to the summation of TReC on each allele, we have

(2)

Note that log(μ A /μ B ) in (1) and (2) have different meanings. In (1), the expression log(μ A /μ B ) is the log ratio of ASE from the A-H i1 allele vs. the B-H i2 allele within an individual with a heterozygous genotype at the target SNP. In contrast, log(μ A /μ B ) in (2) is the log ratio of TReC from two individuals with respective genotypes AA and BB. By the definition of cis-eQTL, the variation of gene expression abundance across individuals is due to allele-specific expression, and thus we can equate log(μ A /μ B ) in (1) and (2) for cis-eQTL but not for trans-eQTL. In other words, for cis-eQTL, we can estimate the genetic effect β based on the joint likelihood \(\mathcal{L}(\beta,\phi ,\psi)\) combining the TReC and ASE data, where

$$\beta= \log({\mu_{AA}}/{\mu_{BB}} ) = \log \bigl({\pi}/{(1 - \pi)} \bigr), $$

and

We refer to this joint model as the TReCASE model. We have also developed a statistical test to distinguish cis- and trans-eQTLs:

$$H_0\ (\mathit{cis}\mbox{-}\mathrm{eQTL}):\ \beta^{(\mathrm{A})} = \beta^{(\mathrm{T})}, \quad\mathrm{v.s.}\quad H_1\ (\mathit{trans} \mbox{-}\mathrm{eQTL}):\ \beta^{(\mathrm{A})} \neq\beta^{(\mathrm{T})}. $$

One should use the TReC model for trans-eQTL and the joint model for cis-eQTL [77]. The details of obtaining MLE from the TReC, ASE, and TReCASE models are skipped, and the interested readers are referred to Sun (2011) [77].

2.3 Implementation

In most real data studies, the input data are RNA-seq data in the FASTA or FASTAQ format, DNA genotype data, and haplotype data from reference panels. The implementation of eQTL mapping using RNA-seq can be divided into four major steps: DNA data processing, RNA data processing, read counting, and eQTL mapping (Fig. 5).

Fig. 5
figure 5

A workflow of eQTL mapping using RNA-seq data

In the step of DNA data processing, we use a phasing program, such as BEAGLE [8] or MACH [44], to impute the phase as well as to impute the genotype of a large set of SNPs that are phased against a referenced panel. It is also possible to align RNA-seq reads to the reference genome, call genotypes, and then impute haplotypes using the genotype calls [81].

The step of RNA data processing involves mapping RNA-seq reads to the genome. One can either map the reads of all the individuals to the same reference genome, or map them to the individual-specific haploid genomes that are constructed based on the phasing results. The advantages/limitations of these two approaches have been discussed in Sect. 2.1.1.

The counting step counts TReC per gene, per sample, and counts the number of allele-specific reads per allele of a gene, per sample. If there are m genes and n samples, the result of counting TReC is a matrix of size m×n, and the result of counting ASE is a matrix of size m×2n. Counting TReC is not trivial because one may prefer to count the reads that overlap and only overlap with the exonic regions of the gene of interest. Counting ASE is more complicated because one needs to compare the nucleotides in an RNA-seq read with the two alleles of any heterozygous SNP. Some Quality Control (QC) steps should be implemented. For example, the reads with mapping ambiguity or low mapping quality should be removed. While counting allele-specific reads, one should check the sequencing quality score of a read at a particular SNP. If the sequencing quality score at that particular base pair is low, the read should not be counted as allele-specific. In addition, one RNA-seq read may harbor more than one SNP and those SNPs may suggest contradicting alleles for the read, e.g., one SNP suggests this read is from paternal allele and the other SNP suggests it is from the maternal allele. Such reads should also be discarded.

Finally, in the step of eQTL mapping, the variation of TReC and/or ASE of a gene is associated with a target SNP, using the haplotype information to connect the alleles of the gene to the alleles of the target SNP. Two sets of covariates can be included in the regression model. One is the set of observed covariates, including the total number of reads per sample, batch, gender, age, etc. The other is the set of derived covariates that aim to capture unobserved batch effects. For example, one may use standardized TReCs (TReCs of all genes of a sample are normalized by the total number of reads of that sample) to estimate Principal Components (PCs) via Principal Component Analysis (PCA), and then use these PCs as derived covariates.

2.4 Discussion and Future Directions

The above discussions of eQTL mapping assume that the haplotypes are known or they are accurately estimated by a phasing program. It is reasonable to expect that the haplotypes within exonic regions of a gene can be accurately estimated. Almost 90 % of the annotated genes are shorter than 100 kb [20], in which haplotypes estimated from genotypes (i.e., phasing) are usually accurate [46]. In addition, RNA-seq assembly can fix possible switch errors from phasing. Although most existing methods for genome-wide de novo RNA-seq assembly do not produce allele-specific assembly yet [5, 69], we conjecture that reference-genome guided assembly, which is sufficient to fix switch errors from phasing, is feasible and computationally efficient. The main challenge is to infer the haplotypes connecting the target SNP and the gene body. Phasing across a long genetic distance is often inaccurate, and RNA-seq assembly cannot help if the target SNP is located in a non-exonic region, which is true in most cases. Due to this limitation, we have carried out eQTL mapping only for local SNPs within 200 kb of each gene [77]. Although recent developments render whole-genome phasing possible [19, 35, 97], these techniques are not mature enough for large-scale studies yet. Therefore, there is a pressing need to develop statistical methods for eQTL mapping using ASE that can accommodate the uncertainty of long-distance phasing.

Xiao and Scott [94] have proposed several methods for cis-eQTL mapping based on the allele-specific expression measured at a single exonic SNP from phase-unknown data: an F-test to assess whether log(N i1/N i2) has a larger variance when the target SNP is heterozygous, a t-test to assess whether the mean value of log(N i1/N i2) is deviated from 0, and a mixture-model-based test in which log(N i1/N i2) is modeled by a mixture normal distribution to account for phasing uncertainty. They found that the t-test/F-test has the highest power when the LD between the target SNP and the exonic SNP is high/low, and the mixture model approach has the highest power for moderate LD. The problem they addressed can be considered as a simplified situation of eQTL mapping using RNA-seq with a few limitations. First, they measured ASE only on a single transcribed SNP instead of across all exonic SNPs of the gene. Second, they did not borrow the information of TReC for eQTL mapping. Third, they modeled log(N i1/N i2) using normal approximation, which is less accurate than directly modeling the read counts by discrete distribution, especially for relatively lower read counts.

In addition to improving statistical power for eQTL mapping, dissecting the genetic basis of ASE can provide important insights into biology questions. For example, some recent studies have shown that cancer drivers/contributors may show imbalanced allelic expression in germline and/or tumor tissues [30, 49, 82, 101]. Such allelic imbalanced expression may be considered as biomarkers and their genetic basis may be valuable to guide personal treatments.

3 Isoform-Specific eQTL Mapping

3.1 Introduction

One important source that contributes to functional complexity of the mammalian genome is the RNA-isoforms due to alternative splicing of pre-messenger RNA [33, 36]. It has been shown that more than 90 % of human genes are alternatively spliced [54, 85], and gene expression is often differentially regulated at the isoform level in different tissues and/or at different developmental stages [85]. Previous studies have reported associations between alternative splicing events and diseases such as cystic fibrosis [22] and cancer [83, 86]. RNA-seq data provide unprecedentedly rich information to study alternative splicing events [54, 76, 85, 90]. Specifically, read depth along the gene body is informative for inferring the underlying RNA-isoforms, and reads covering exon junctions provide direct evidence of alternative splicing. Such information is also available from exon tiling arrays [95] and exon junction arrays [68], but with lower precision and limited by the probe design of the array.

There are three types of statistical/computational problems for the study of RNA-isoforms using RNA-seq data: transcriptome reconstruction, isoform abundance estimation, and differential isoform usage testing. Differential isoform usage refers to the changes of RNA-isoform expression relative to the expression of the corresponding gene. The purpose of isoform-specific eQTL mapping is to dissect the genetic basis of differential isoform usage. We also refer to isoform-specific eQTL mapping as splicing QTL mapping or sQTL mapping. Since isoform abundance cannot be directly measured, transcriptome reconstruction and abundance estimation are necessary steps of sQTL mapping, and the results of these two steps have non-negligible effect on the testing of differential isoform usage. Therefore, we review all the three topics.

3.2 Transcriptome Reconstruction

There are two types of methods for the purpose of transcriptome reconstruction: genome-independent reconstruction and genome-guided reconstruction [21]. Genome-independent reconstruction methods, such as Velvet [99], ABySS [5], and trans-ABySS [61], directly assemble the RNA-seq reads into transcripts without using a reference genome. This approach is, obviously, the only choice for organisms without a reference genome. However, when transcriptome annotation is available, the genome-guided reconstruction methods, which first map all the RNA-seq reads to the reference genome and then assemble overlapping reads into transcripts, are more accurate and computationally much more efficient. Mapping RNA-seq reads to the reference genome may involve the detection of de novo exons and exon junctions by TopHat [79], SpliceMap [3], MapSplice [87], SplitSeek [1], G-Mo.R-Se [15], QPALMA [13], or other software. Two genome-guided reconstruction methods, Cufflinks [80] and Scripture [27], have been developed. Both methods build assembly graphs (using different approaches though) in which one path in the graph corresponds to an RNA-isoform. Cufflinks reports a minimal set of isoforms by choosing a minimal set of paths while Scripture reports all compatible isoforms.

3.3 Isoform Abundance Estimation

We group the methods for isoform abundance estimation into four categories (Table 1). The methods in the first category (e.g., ALEXA-seq [26] and NEUMA [38]) estimate isoform abundance using the sequence reads that are unique to an isoform. This approach misses the information embedded in the “isoform multi-reads” [39], i.e., reads that are compatible with more than one isoform.

Table 1 Statistical/computational methods for isoform abundance estimation. The Input column is empty for some of the methods because there is no specific requirement for the input data

The methods in the other three categories use different approaches to probabilistically assign the “isoform multi-reads” to certain isoforms and then estimate isoform abundance. Methods in the second category employ a generative model to describe the stochastic process in RNA-seq experiments. The term “generative model” means that the process of generating each read is modeled so that the likelihood is a product of the likelihoods from each read. For example, following Equation (14) of Pachter (2011) [53] (with some changes of notation so that the notations are consistent in this paper), the likelihood of N single-end reads from K isoforms is

(3)

where \(\tilde{l}_{k}\) is the effective length (i.e., the number of positions where a read can start) of the kth isoform, \(\widetilde{c}_{s,k}=1\) if read s is compatible with the kth isoform and 0 otherwise, and α k is the probability of selecting a read from the kth isoform. The probability α k can be formulated as \(\alpha_{k} = \theta_{k} \tilde{l}_{k} / \sum_{k'=1}^{K} \theta_{k'} \tilde{l}_{k'}\), where θ k is the relative abundance of the kth isoform and is the parameter of interest. Extension to paired-end fragments involves modeling the distance of the two reads of a paired-end fragment. We skip the details here and refer interested readers to existing works such as Cufflinks [60, 80].

The third category includes methods that build their likelihood functions by a Poisson model [32, 59, 65]. Given a known set of isoforms, Jiang and Wong [32] modeled the fragment count of each locus (either an exon or an exon junction) by a Poisson distribution, and estimated the expression of each isoform by Maximum Likelihood Estimation (MLE). Specifically, suppose that there are K isoforms, and let N r (1≤rR) be the number of reads falling into the rth region of interest (e.g., an exon or exon–exon junction); the likelihood function is

(4)

where λ r is the expression rate pertaining to the rth region. Let \(\theta^{*}_{k}\) be the expression rate of the kth isoform and the parameter of interest. We define \(\lambda_{r} = l_{r} w \sum_{k=1}^{K} c_{r,k} \theta^{*}_{k}\) and \(\lambda_{r,r'} = l_{r,r'} w\sum_{k=1}^{K} c_{r,k} c_{r',k} \theta^{*}_{k}\), where w is the total number of sequence reads, l r and l r,r are the lengths of the rth exon and the junction of the rth and r′th exon, respectively, and c r,k =1 if the rth region is compatible with the kth isoform and 0 otherwise. Note that it is more appropriate to use the effective length instead of the actual length of exons and exon–exon junctions in the above likelihood [53]. The expression of an isoform could be zero or close to zero, which is the boundary of the parameter space and thus leads to unreliable MLE. Jiang and Wong [32] addressed this problem by importance sampling guided by MLE. Salzman et al. [65] extended the method of Jiang and Wong [32] to work with paired-end sequencing data. Richard et al. [59] developed a similar MLE approach for isoform abundance estimation of known isoforms using only the reads on exons.

The likelihoods employed by the methods in the second and third categories are different. The multinomial generative model pertains to the individual single-end read or paired-end fragment, whereas the Poisson model pertains to the read count of a region. However, the two likelihoods result in an identical estimate of isoform abundance [53], following from the equivalence between the multinomial and Poisson models [37].

The fourth category includes methods based on penalized Poisson regression [93] or penalized least squares [6, 40, 42]. These methods can simultaneously construct isoforms and estimate isoform abundance. For example, isoLasso [42] first identifies candidate isoforms for each gene using a modified connectivity-graph approach of Scripture [27]. Since Scripture reports all isoforms compatible with the observed data, it is expected that some candidate isoforms may not be expressed. Thus, one needs to simultaneously select the expressed isoforms and estimate their abundance. Towards this end, isoLasso [42] minimizes the objective function of penalized least squares

(5)

where N r is the number of sequence fragments in the rth region (e.g., exon or exon–exon junction), l r is the length of the rth region, c r,k =1 if the rth region is compatible with the kth isoform, and \(\theta^{**}_{k}\) is the expression rate of the kth isoform and is the parameter of interest. The Lasso penalty \(\lambda \sum_{k=1}^{K} | \theta^{**}_{k} |\) can penalize some of \(\theta^{**}_{k}\)’s to be 0, hence achieving the goal of isoform selection [78]. The authors of isoLasso pointed out that it is more appropriate to use the effective length instead of the actual length of exons and exon–exon junctions in their objective function.

Recent studies have shown that it is important to consider positional bias and sequence bias for the purpose of transcript abundance estimation [28, 39, 41, 60, 92]. Positional bias refers to the observation that the sequence reads are not uniformly distributed along the transcript. Sequence bias refers to the non-randomness of the sequences around the beginning and the end of each singe-end sequence read or paired-end sequence fragment; for example, reads may be more likely to start at a position of higher GC content. Methods have been developed to account for such biases for both the multinomial generative model [39, 60] and the Poisson model [41, 92]. Another approach is to reweight each sequence read by its first heptamer (seven bases), and instead of counting the number of reads mapped to a genomic region, one adds up the weight of the reads mapped to the region, and then the sums of weight are used as counts for downstream analyses [28].

3.4 Differential Isoform Usage Testing

Recall that differential isoform usage means the changes of the relative isoform expression with respect to the expression of the gene. Testing differential isoform usage is related to but different from testing differential expression. Nevertheless, some conclusions from testing differential expression are instructive for testing differential isoform usage, and are stated in this paragraph. First, for the purpose of testing differential expression, one can apply transformation such as the normal quantile transformation to read count data and then treat the transformed measurements as normally distributed random variables. Such transformation loses information, and it is more appropriate to keep the discrete feature of the RNA-seq data. Several methods have been developed for differential expression testing by modeling read counts via a discrete distribution, such as a Poisson distribution when there is no overdispersion [88], a negative binomial distribution [2, 29, 62] or a generalized Poisson distribution [73] when there is overdispersion, which is often true for expression data across biological replicates. One can also apply a two-stage approach to first test for overdispersion and then apply the appropriate modeling strategy based on the conclusion of the overdispersion test [4].

So far, only a few methods have been developed for testing differential isoform usage. Trapnell et al. [80] employed the square-root of Jensen–Shannon Divergence (JSD) as a test statistic and they derived its asymptotic distribution. Specifically, let p (1),…,p (M) be the distributions of isoform abundance under M conditions, where \(\mathbf{p}^{(m)} = (p_{1}^{(m)}, \ldots, p_{K}^{(m)})^{T}\) is a vector of length K such that \(p_{k}^{(m)}\) is the relative abundance of the kth isoform under condition m. We have \(\sum_{k=1}^{K} p_{k}^{(m)} = 1\), m=1,…,M. Then JSD is defined as

(6)

where \(H(\mathbf{p}^{(m)}) = - \sum_{k=1}^{K} p_{k}^{(m)} \log (p_{k}^{(m)})\) is the entropy across the K isoforms. The test statistic, denoted by \(f(\mathbf{p}^{(1)}, \ldots, \mathbf{p}^{(M)}) = \sqrt{JS(\mathbf {p}^{(1)},\ldots, \mathbf{p}^{(M)})}\), asymptotically follows a normal distribution with mean 0 and variance (∇f)T Σ(∇f), where (∇f) is the partial derivative of f(p (1),…,p (M)) with respect to \(p_{k}^{(m)}\), and Σ is the block-diagonal variance-covariance matrix with one block for each p (m).

Singh et al. [70] modeled the transcriptome of one condition by a splice graph, which is constructed such that one edge corresponds to a transcribed interval or a spliced site. Then they proposed a flow difference metric (FDM) to measure the isoform usage difference between two conditions by the difference between the two corresponding splice graphs. They showed that FDM is correlated with JSD and can be used as a classifier for JSD. They developed a non-parametric resampling method to obtain the null distribution of FDM under the null hypothesis of no differential isoform usage, and used this null distribution to test for differential isoform usage.

Although it is important to consider positional bias and sequence bias for isoform abundance estimation as we discussed before, it is a question of whether modeling such a bias is necessary for differential isoform usage testing. Suppose that there is a positional bias such that there is higher read depth in the 3′ end of the gene. Without modeling the positional bias, the abundance of the isoforms closer to the 3′ end of the gene may be overestimated. However, as long as such bias is consistent across all the samples, it does not lead to a false positive result for differential isoform usage testing.

3.5 Differential Isoform Expression

In addition to isoform usage testing, one can also consider differential expression of each isoform. Notably, differential isoform expression testing is different from isoform usage testing. The former produces one p-value for each isoform while the latter produces one p-value for multiple isoforms of one gene. Cufflinks [80] tests differential expression of a transcript under two conditions by assessing the following test statistic: log(FPKM 1/FPKM 2), where FPKM i is the FPKM (Fragments Per Kilo-base of the transcript and per Million RNA-seq fragments of the sample) of the transcript under condition i, and i=1 or 2. Using the conclusion Var[log(X)]≈Var(X)/E(X)2, they derived a test statistic

$$\frac{\log(\mbox{\texttt{FPKM}}_1 / \mbox{\texttt {FPKM}}_2)}{\sqrt{\mathrm{Var}(\mbox{\texttt {FPKM}}_1) / E(\mbox{\texttt{FPKM}}_1)^2 + \mathrm{Var}(\mbox {\texttt{FPKM}}_2) / E(\mbox{\texttt {FPKM}}_2)^2}}, $$

which follows standard normal distribution under null hypothesis of no differential expression. This testing approach did not consider the variation of FPKM estimates due to isoform selection.

An alternative method named BASIS (Bayesian Analysis of Splicing IsoformS) [102] directly compares RNA-isoform expression without an intermediate isoform selection step. Specifically, a hierarchical Bayesian model is employed to model the expression coverage difference at one locus between two conditions as a linear combination of the isoform expression differences plus an error term. Since the variance of the error term is dependent on the mean expression level, the error terms of all loci across the genome are grouped into 100 bins by the total coverage of the loci, and modeled separately.

3.6 Splicing QTL (sQTL) Mapping

The problem of sQTL mapping can be considered as a special case of the problem of differential isoform usage testing. To the best of our knowledge, no existing method is able to directly assess the association between the isoform usage and a quantitative covariate, which can be the additive coding of a SNP or the copy number calls at a genomic locus. The testing of differential isoform usage against a quantitative covariate is a very interesting direction for future development, not only for sQTL mapping but also for many other problems of differential isoform usage testing, for example, to assess the association between differential isoform usage and age.

The other potential research direction is to combine the eQTL mapping of total transcription abundance of a gene with the sQTL mapping of relative transcription abundance (e.g., isoform usage), because genetic variation is very likely to affect both the total expression of a gene and the relative expression of its isoforms. If this gene-level testing indicate significant differential expression, either for total expression or for isoform usage, one can further test differential expression of each isoform. We expect that this two-step approach of gene-level testing followed by isoform-level testing is more powerful than directly testing for all possible isoforms due to the reduction of the number of tests, and hence the reduced burden of multiple testing correction.

The third future direction is simultaneous allele-specific and isoform-specific eQTL mapping, which can provide unprecedented details of the genetic basis of transcription regulation. A pioneer work in this direction, a haplotype and isoform-specific expression estimation method, has been reported [81]. In fact, joint analysis of allele-specific expression and isoform-specific expression is necessary to obtain more precise conclusions. We illustrate this point by an example shown in Table 2. Suppose there is a hypothetical gene with three exons of effective length 100 bp, and to simplify the discussion, we ignore the reads overlapping with more than one exon. Here effective length of an exon is defined as the number of base pairs where an RNA-seq fragment can be sampled [80]. Further assume that this gene has two isoforms: one includes exons 1 and 3, and the other includes exons 1, 2, and 3. Isoform 1 has higher expression in paternal allele than maternal allele while isoform 2 has higher expression in maternal allele than paternal allele. If one ignores isoform expression and naively estimates FPKM at gene level, the FPKM estimates for both alleles, paternal allele, and maternal allele are 500/300=1.67,(30+30+70+70+70)/300=0.9, and (70+70+30+30+30)/300=0.77, respectively. However, given isoform configuration, the FPKM estimates at gene level for both alleles, paternal allele, and maternal allele are 500/(0.5×200+0.5×300)=2, (30+30+70+70+70)/(0.3×200+0.7×300)=1, and (70+70+30+30+30)/(0.7×200+0.3×300)=1, respectively. Therefore, ignoring isoform level expression leads to the conclusion that there is allelic imbalance of gene expression, while a more accurate explanation is that there is allele-specific isoform usage.

Table 2 An example illustrating that one can obtain more accurate allele-specific expression estimates at RNA-isoform level. Assume this gene has two isoforms. Isoform 1 includes exons 1 and 3, and isoform 2 includes exons 1, 2, and 3. The columns Exon1, Exon2, and Exon3 show the number of reads mapped to the corresponding exons. Columns \(\mbox{FPKM}_{\mbox{{\scriptsize\texttt{isoform}}}}\) and \(\mbox{FPKM}_{\mbox{{\scriptsize\texttt{gene}}}}\) show FPKM estimates at isoform and gene level, respectively

4 Discussion and Conclusion

Network analysis has been employed in eQTL studies to jointly mapping eQTL of multiple transcripts [56, 98]. It involves simultaneous estimation of residual covariance/precision matrix and the regression coefficient matrix. It is interesting to apply similar approaches for eQTL mapping using RNA-seq data. However, while discrete distributions such as beta-binomial or negative-binomial distributions are appropriate choices to model the RNA-seq count data for each gene, it is much more challenging to study the joint distribution of multiple genes due to the difficulty of studying multivariate beta-binomial or negative-binomial distributions. This is an interesting direction that warrants further developments of appropriate statistical methods.

We would like conclude this paper by pointing out that the developers of statistical/computational methods for eQTL mapping should not only focus on exploiting each bit of information from RNA-seq to improve statistical power. One should put even more emphasis on the scientific questions that can be answered by developing a new method. For example, using allele-specific and isoform-specific eQTL to dissect the genetic/genomic basis of complex diseases. Recent genome-wide association studies (GWAS) found that most common genetic variants can explain at most a few percents of the variance of a complex disease. This has raised some doubts on the efficacy of genetic/genomic approach for understanding complex diseases and developing treatments. eQTL studies can provide more information than GWAS because a complex disease often has tighter correlations with gene expression variations than genetic variants. This is in turn due to at least two reasons. First, by the central dogma of DNA → RNA → Protein, RNA is closer to disease than DNA in terms of signal transmission from DNA to phenotype. Second, the effects of more than one genetic variant may be accumulated on a particular transcript. On the other hand, unlike DNA data, which is stable, RNA data is noisier, e.g., RNA expression varies across tissues and development stages. RNA-seq provides more information of gene expression than expression arrays, together with more variation, e.g., the gene expression may vary in allele-specific manner or in isoform level. By combining DNA and RNA data in eQTL analysis, we may exploit both the stability of DNA data and the informativeness of RNA data for the purpose of understanding complex diseases.