Introduction

Genome-wide association (GWA) studies have been a powerful method for identifying genetic components associated with traits or diseases (Manolio 2009). Until now, about 15,000 published trait-associated SNPs have been compiled in the catalog of published GWA studies (http://www.genome.gov/gwastudies). In spite of a great success of the studies, interpreting these genetic loci remains challenging, as bulk of them do not change amino acid sequence of a protein (Ioannidis et al. 2009). Recently, there have been several approaches that attempt to explain the genetic mechanism of these GWA signals. For example, there are some web services that accept a list of GWA signals and look up the overlapping epigenetic loci, the regulatory roles of which have been compiled by the epigenomic studies such as the ENCODE project. Another example is based on the observations that GWA signals may affect disease susceptibility by altering regulations of gene expression (ENCODE Project Consortium 2004; ENCODE Project Consortium 2012). Expression quantitative trait loci (eQTLs) influence gene expression levels, while splicing quantitative trait loci (sQTLs) alter alternative splicing (AS) patterns. Previous studies have accumulated a number of eQTLs, which have been helpful to explain GWAS signals (Boyle et al. 2012; Gong et al. 2016; Nicolae et al. 2010; Schadt et al. 2008). On the other hand, only a few studies have reported a genome-wide survey of sQTLs. Since changes in AS have been well known for causing various diseases and influence the susceptibility to complex diseases, sQTLs altering the AS pattern may affect the balance of expression of transcript isoforms and consequently influence on traits or diseases (Wang and Cooper 2007; E et al. 2013). As a bulk of GWAS signals is located in noncoding region such as intron, UTR, and promoter (Zhang and Lupski 2015), it is tempting to investigate whether some GWA signals can be interpreted as sQTLs. Recent development of high-throughput RNA sequencing (RNA-seq) using NGS has enabled more precise annotation and quantification of exons than the conventional exon array method. Consequently, RNA-seq has been become a powerful alternative method for the survey of genome-wide sQTLs (Katz et al. 2010). Here, we report candidate genome-wide sQTLs in three European populations (CEU, GBR, and FIN). First, in order to facilitate the identification of candidate sQTLs, we developed a user-friendly R/Bioconductor package, IVAS, which identify genetic variants affecting a specific exon based on a simple transcript expression data set in fragment per kilobases for million reads (FPKM) from RNA-seq and a genotype data set. Using the package, we identified sQTLs with each of the published RNA-seq datasets and genotype datasets generated from the lymphoblastoid cell line in the populations of the 1000 Genomes Project (Consortium et al. 2010). The results of the individual studies were then combined by a meta-analysis in order to reduce the heterogeneity among the studies and to improve the statistical power (Begum et al. 2012; Teslovich et al. 2010). The meta-analyzed sQTLs comprise a comprehensive collection of genome-wide sQTLs, which may be a useful resource in understanding the genetic mechanisms of GWA signals.

Materials and methods

Study samples and data preprocessing

The matched RNA-seq and genotype data used in this study have been downloaded from the Geuvadis RNA Sequencing Project (http://www.geuvadis.org/web/geuvadis/RNA-seqproject) and the 1000 Genomes Project (Consortium et al. 2010), respectively. The RNA-seq project measured RNA expressions from the lymphoblastoid cell lines of the three European populations (78 CEU, 89 FIN, and 85 GBR individuals) included in the 1000 Genomes Project.

The downloaded RNA-seq quantification data include 183,086 transcript FPKM values. Among them, we considered only genes with at least two isoforms. Furthermore, we selected 4712,172 SNPs out of 38,187,571 markers in the genotype data by minor allele frequency (MAF) > 0.1. Among them, we studied 196,853, 194,237, and 198,950 SNPs located in the alternative exons and their flanking introns in CEU, FIN and GBR populations, respectively.

Meta-analysis of three European populations

In order to improve statistical power and decrease heterogeneous associations among the three populations, the individual IVAS results were combined using METAL, a meta-analysis software (Willer et al. 2010). We selected SNPs with heterogeneity P-value > 0.85, FDR P-value based on the meta-analysis <0.05, and Z-score > 2.75.

Estimation of expression ratio for alternative spliced exons

The relative expression ratio between included and skipped exons is used for evaluating the exon inclusion. For a specific exon j, the relative expression ratio, Ψj, is as follows:

$$ {{{{\Uppsi }}_{\text{j}} = \sum\limits_{\text{i = 1}}^{\text{n}} {{\text{X}}_{\text{ij}} } } \mathord{\left/ {\vphantom {{{{\Uppsi }}_{\text{j}} { = }\sum\limits_{\text{i = 1}}^{\text{n}} {{\text{X}}_{\text{ij}} } } {\left( {\sum\limits_{\text{s = 1}}^{\text{n}} {{\text{X}}_{\text{ij}} + \sum\limits_{\text{i = 1}}^{\text{n}} {{\text{X}}_{\text{ij}} } } } \right)}}} \right. \kern-0pt} {\left( {\sum\limits_{\text{s = 1}}^{\text{n}} {{\text{X}}_{\text{ij}} + \sum\limits_{\text{i = 1}}^{\text{n}} {{\text{X}}_{\text{ij}} } } } \right)}} $$

The Xij and Xsj are transcript expressions with and without exon j, respectively (Supplementary Fig. 2). We implemented this method in a user-friendly and convenient R/Bioconductor package, IVAS.

General process of IVAS

The package is written in the R programming language and has dependency on other packages of R/Bioconductor. To run this package, two experimental datasets (an expression and genotype data frame) are required. In addition, it also needs a data frame of marker positions and a general transfer format (GTF) file of reference transcript models. Based on the reference transcript models of a given gene, IVAS identifies exons that are alternatively included among the transcript isoforms. For each exon of interest, IVAS divides the isoforms into two groups: transcripts with and without it. The group totals of the transcript expression levels are calculated, and their ratio is taken as an indicator of alternative exon skipping (Supplementary Fig. 2). Finally, IVAS looks for statistically significant correlations between genotypes and the ratio values with previously used statistical models for sQTLs. The user manual including installation, implementation, and tutorials is provided as a vignette in the package. This package is available at http://bioconductor.org/packages/release/bioc/html/IVAS.html.

Results

Identification of sQTLs in the CEU population using IVAS

We used the locally developed IVAS package (see “Materials and methods”) section for the identification of sQTLs. Initially to validate the IVAS performance, we analyzed the lymphoblastoid cell line samples in CEU population of the 1000 Genomes Project (Consortium et al. 2010), and compared the results with those by other tools with the same dataset. The CEU genotype and the RNA-seq dataset were downloaded from the Geuvadis RNA sequencing project website (http://www.geuvadis.org/web/geuvadis/RNA-seq project). The RNA-seq dataset contains FPKM values of 183,086 unique transcripts, and the genotype dataset contains 38,187,571 SNP markers. The genotype data were filtered by MAF > 0.1, resulting in 4712,172 SNP markers. Furthermore, we considered only genes with at least two AS isoforms, as well as only SNPs located within the alternative exons and their flanking introns. The filtered genotype and expression data sets comprising 196,853 SNPs and 6279 genes were analyzed with IVAS. As a result, we discovered 562 sQTLs affecting 74 alternative exons at a FDR of 1 %. In order to show that our result is reliable, we compared our result with other tools. There are two previously published tools, GLiMMPS (Zhao et al. 2013) and sQTLseekeR (Monlong et al. 2014), for identifying sQTLs. We compared the result with GLiMMPS, a tool that estimates the expression level based on junction read counts. GLiMMPS could not test 23 out of 74 sQTL target exons that had been identified by IVAS in the CEU population. Those 23 target exons lacked sufficient junction reads to be analyzed by GLiMMPS, while they had enough exon reads to be identified by IVAS (Supplementary Fig. 1). Of the remaining 51 exons (associated with 478 sQTLs), GLiMMPS identified 36 (associated with 445 sQTLs) as being significant (P < 0.05). The second tool we compared against was sQTLseekeR, which investigates multivariate comparison of all isoforms expression values individually in the transcript level. Unlike GLiMMPS, only one SNP from its result overlapped with the significant SNPs from our result. The detailed comparisons are given in Supplementary Figs. 3 and 4.

Meta-analysis of three European populations

Upon a successful validation of our IVAS tool with the CEU dataset, we analyzed two additional European datasets, namely GBR and FIN populations, also available from the Geuvadis project (http://www.geuvadis.org/web/geuvadis/RNA-seq project). The IVAS runs, comprising an average of 196,853 SNPs and 6279 genes in the studied populations, identified 562, 492, and 485 candidate sQTLs and 74, 60, and 71 alternatively spliced target exons at a FDR of 1 % in the CEU, GBR and FIN populations, respectively (Supplementary Table 1). The high portion of candidate sQTLs was shared across three test populations (Fig. 1). If we focus on CEU and GBR, which are known to be closer to each other than FIN (Olalde et al. 2014), the sharing rate is about 74 %. All those three results were combined by a meta-analysis approach, in order to improve statistical power in discovering true sQTLs and to reduce heterogeneous calls among the three populations. The meta-analysis was performed using METAL (Willer et al. 2010), resulting in a comprehensive list of 2525 statistically significant loci (P (heterogeneity) > 0.85 (meaning highly homogenous calls), FDR (after the meta-analysis) < 0.05, and Z-score > 2.75). The result is found in Supplementary Table 2.

Fig. 1
figure 1

The sharing of sQTLs between populations. The sharing rate, ij at the column i and row j, is defined as follows: Cij = |Xi ∩ Xj|/|Xi| (i ∈ {CEU, FIN, GBR}) where X i is the set of sQTLs identified in the population i

Interpretation of trait-associated SNPs with sQTLs

Here we demonstrate the utility of sQTLs in interpreting known GWAS signals. The published trait-associated SNPs were downloaded from the catalog of published GWA Studies (http://www.genome.gov/gwastudies). For each of the sQTLs that have been compiled through the meta-analysis, the neighboring SNPs with high linkage disequilibrium (r2 > 0.8) were compiled from dbSNP, expanding the genomic loci. We then looked for those candidate loci that overlapped with the list of the GWAS catalog SNPs. As a result, nine independent sQTLs overlapped the trait-associated SNPs in the GWAS catalog, and their alternatively spliced target exons were annotated using the ensembl genome browser (Table 1). Interestingly, in six cases, the alternative exon skipping cause a loss of either a key functional domain or a coding feature conserved in vertebrates. The remaining three cases involve skipping exons in 5′-UTR regions that are conserved in mammals.

Table 1 The GWAS signals associated with predicted sQTLs

Discussion

We used a locally developed R/Bioconductor package, IVAS, a convenient and user-friendly tool for the discovery of sQTL, genetic variants affecting alternative splicing. To our knowledge, there have been two published software tools, GLiMMPS and sQTLseekeR, for sQTL discovery. GLiMMPS counts the reads spanning splice junctions to estimate the expression level of the sQTL target exon. Unless enough junction reads are available, this method may miss the sQTL target exons. On the other hand, sQTLseekeR overcomes such an issue by comparing the expression levels of the whole transcripts given in terms of so-called FPKM values. However, this tool does not pinpoint the alternative exons that are differentially expressed. Our tool, IVAS, decouples the transcript FPKM values into the expression level of a specific target exon. Using IVAS, we report candidate sQTLs in three European populations. The availability of three closely related population datasets allowed us to infer study variations. The population structure analysis based on genotype differences has reported that while these populations are part of the European cluster that are distinct from either African or Asian populations, CEU and GBR are closer to each other than to FIN (Olalde et al. 2014). The three separate lists of candidate sQTLs from our IVAS runs also reflected this population structure: the generally high sharing among all three populations and even higher overlap between CEU and GBR, a closer pair, than either with FIN, a remoter neighbor (Fig. 1). This indicates that our sQTL analysis captures some genetic components in transcription regulation rather than manifests arbitrary spurious statistical associations. The fact that the three results are comparable to one another prompted us to do an integrative approach to compile a unified list of candidate sQTLs. The Fig. 2b shows exemplary boxplots of a sQTL, rs17418283, in three datasets that show the correlation between its genotype and an expression level of an MCTP1 exon. When the sharing of sQTLs is high as in this example, the meta-analysis can improve the sensitivity of detecting sQTLs. The inconsistency among the populations can be inferred by the heterogeneity calculation as a part of the meta-analysis, and the P (heterogeneity) cutoff can be used to remove inconsistent results. We used a filter of P (heterogeneity) > 0.85 in order to minimize the inconsistent associations.

Fig. 2
figure 2

The effect of 18th-exon skipping in MCTP1. a The SNP rs17418283, an intronic variant of MCTP1, is associated with the 18th-exon skipping in MCTP1. b The distribution of expression ratio of the exon in each genotype of rs17418232. The higher y-value corresponds to more exon inclusion. c The protein structure of MCTP1 was predicted by RaptorX structure prediction tool, a Web based prediction tool for protein structure and function (Kallberg et al. 2014). The 18th-exon skipping leads to missing one of the transmembrane helices (the blue arrow)

As the sQTLs affect the balance of expression of transcripts by changing AS patterns, their functional consequences can contribute to the phenotypic variation or disease susceptibility. In that regard, a comprehensive list of sQTLs can be a useful resource for the underpinning of genetic associations of various traits or diseases. For example, MCTP1 encodes a protein having two C-terminal transmembrane domains. An intronic variant in MCTP1, rs17418283, was reported as having an association with bipolar disease (Scott et al. 2009). However, its molecular underpinning has been elusive. Our study identified the rs17418283 was associated with the alternative skipping event of the 18th-exon, resulting in the loss of one of the transmembrane helices (Fig. 2). It turned out that the allelic directions of the risk allele of our result (enhancing the exon skipping) and the GWA signal (promoting bipolar disease) were opposite. High intracellular accumulation of calcium ion was reported in bipolar disease. Since MCTP1 protein can bind calcium ion and plays a role in calcium-mediated signaling, it is tempting to speculate that the canonical form of MCTP1 may be one of the risk factors of bipolar disease. Further biochemical and functional studies of the alternative MCTP1 protein isoforms may help elucidate its role in bipolar disease pathogenicity. Currently only a small portion of the GWAS Catalog can be explained in terms of sQTLs. As more GWAS signals are cataloged in the coming years, our current list of candidate sQTLs may find more utility. Moreover, more candidate sQTLs can be discovered by studying a variety of tissues other than lymphoblastoid cell lines in population level. Such a comprehensive list of sQTLs would be a useful resource in functional genomics.