Introduction

HCC is a prevalent solid-organ tumor representing the third leading cause of cancer mortality in the world with growing incidence rates (Jemal et al. 2011). The disease is primarily induced by hepatitis B virus (HBV) infection particularly in epidemic regions of Asia and Africa. Other pathogenic factors include hepatitis C virus (HCV) infection, alcoholism and afatoxin B1 exposure (Parkin 2006; Jemal et al. 2011). To date, there are still very limited treatments for HCC urging the development of new diagnostic and prognostic biomarkers and therapeutic targets. A comprehensive exploration of the transcriptomic alterations in HCC is critical for a better understanding of the biology of HCC and provides possibilities for the identification of new biomarkers and therapeutic targets. With the rapid development of next-generation sequencing, RNA-Seq provides a powerful way to study transcriptome (Wang et al. 2009).

An lncRNA is a RNA molecule which has a length of more than 200 bp and lacks protein-coding potential. In cancer biology, lncRNAs are emerging as important regulators influencing a wide range of biological processes such as transcription (Wang et al. 2008b; Tsai et al. 2010; Bi et al. 2013), translation (Carrieri et al. 2012), gene expression (Wang and Chang 2011), cell cycle (Yochum et al. 2007) and cellular differentiation (Young et al. 2005). Several lncRNAs have been reported dysregulated in HCC such as MALAT1 (Lai et al. 2012), HOTAIR (Ishibashi et al. 2013), H19 (Matouk et al. 2007; Zhang et al. 2013), MEG3 (Braconi et al. 2011), MVIH (Yuan et al. 2012) and HULC (Panzitt et al. 2007). Despite these findings, the expression profiles and function of lncRNAs in HCC remain poorly understood (Zhao et al. 2014), further investigation of which may shed new light on hepatopathogenesis and the utility of lncRNAs as biomarkers.

Alternative splicing of precursor mRNA of a gene can produce multiple isoforms, which results in a single gene coding for multiple proteins. In addition, inefficient splicing due to inclusion of premature STOP codons can lead to mRNA degradation through nonsense-mediated decay (NMD). Studies estimated that more than 90 % of multi-exon human genes undergo alternative splicing (Pan et al. 2008; Wang et al. 2008a). This phenomenon is a major post-transcriptional regulatory mechanism involved in the development of cancers by affecting key aspects of cancer cell biology including cell proliferation, cancer metabolism, angiogenesis, apoptosis, invasiveness and metastasis (Ghigna et al. 2008; David and Manley 2010; Biamonti et al. 2012). Splicing dysregulation of genes such as FN1 (Oyama et al. 1989, 1993; Matsui et al. 1997), CD44 (Harn et al. 1994), FGFR2 (Lin et al. 2014b), NT5E (Snider et al. 2014) and Sulf1 (Gill et al. 2012) have been reported to be associated with HCC. However, the knowledge on aberrant splicing in HCC is rather incomplete compared to other types of tumors (Berasain et al. 2010). Therefore, it is essential to thoroughly explore the splicing alterations generated by HCC through RNA-Seq.

This study characterized lncRNAs and differential splicing in HCC at whole-transcriptome level by integrative analysis of four sets of RNA-Seq data. Of note, we identified 15 lncRNAs only detectable in a number of HCC samples in this study and 115 lncRNAs differentially expressed between HCC and adjacent normal samples. In addition, function of five lncRNAs was predicted. On the other hand, nine highly recurrent differential splicing events were identified. Our findings provided important and novel insight into the transcriptional changes in HCC, justifying further investigation into the pathogenic and translational impact of these changes in HCC.

Materials and methods

Datasets

At the time of this study, four datasets on HCC were publicly available in NCBI Sequence Read Archive (SRA) database (Shumway et al. 2010) which were designated as dataset A [accession number SRP018008 (Kang et al. 2015)], B [accession number SRP009123 (Chan et al. 2014)], C [accession number SRP007560 (Lin et al. 2014b)] and D [accession number SRP004768 (Huang et al. 2011)] in this study, respectively (Table 1). Dataset A was derived from Asian Cancer Research Group including RNA-Seq data of nine HCC tissues and nine adjacent normal tissues. Dataset B is comprised of RNA-Seq data of three pairs of primary HCC and adjacent normal liver tissues. Dataset C contains RNA-Seq data sequenced in great depth of one pair of liver tumor and adjacent normal samples. Dataset D contains RNA-Seq data from 10 matched pairs of HCC and adjacent normal liver tissues.

Table 1 Descriptive information on the four datasets used for analysis

Primary processing and alignment of RNA-Seq reads

First, quality of the RNA-Seq reads of the four datasets was checked using FastQC v0.10.1 (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc). FastQC showed that the quality of RNA-Seq data in dataset D is relatively low compared with the quality reached by recent sequencing technology. Software Trimmomatic v0.32 (Bolger et al. 2014) was then applied on the fastq files of dataset D to (1) remove adapters, (2) remove leading and trailing bases with quality below 3, (3) scan each read with a 4-base wide sliding window and cut when the average quality per base lower than 15 and (4) discard reads shorter than 36 bases. After trimming and eliminating short reads, between 37.73 and 73.76 % (median: 47.87 %) of total reads were kept for each sample in dataset D for downstream analysis.

Afterwards, all the RNA-Seq reads were mapped to the human reference genome (release hg19) using tophat (v2.0.10) (Kim et al. 2013) with default settings and the following options ‘-p 4 -g 10 - -keep-fasta-order - -no-coverage-search’. In addition, a combined gene annotation file was fed into tophat through the ‘-G’ parameter. The annotation file contained annotations for 296,680 transcript IDs created by merging the RefSeq genes (downloaded from UCSC Table Browser on Mar 3, 2014), UCSC known genes (downloaded from UCSC Table Browser on Mar 1, 2014) and Ensembl (v75) genes.

The statistics on the four sequencing datasets were presented in Online Resource 1a. The total number of reads per sample in dataset A ranged between 63,720,281 and 79,543,083, with a median of 73,190,808. Of them, from 80.1 to 93.3 % were aligned to the reference genome. For samples in dataset B, the total read counts ranged from 29,479,423 to 31,701,232 (median 30,700,686). The median mapping rate is 94.5 %. The total read counts for the one pair of samples in dataset C is 126,533,165 and 127,161,324 with alignment rates of 91.5 and 93 %, respectively. For dataset D, after trimming and dropping short reads, only two pairs of samples from patients D_A13 and D_A39 with higher read counts retained (from 18,980,456 to 30,588,144) and alignment rates (between 90.5 and 94.9 %) were kept for downstream analysis.

Differential gene expression analysis

To identify differentially expressed genes between HCC and adjacent normal samples meta-analyses using two different p value combination techniques (Inverse normal and Fisher methods) were performed on the multiple datasets according to Rau et al. (2014). Meta-analyses were used to increase detection power by increasing available sample size. Before performing meta-analyses, a differential expression analysis was performed on each individual dataset.

We performed a paired sample test with a generalized linear model (GLM) method on each individual dataset for differential expression using edgeR (v3.2.4) program (Robinson et al. 2010). The input to edgeR is a matrix of read count with each row corresponding to a gene and each column corresponding to a sample. Read count per gene per sample was generated using htseq-count python script v0.6.1 (Anders et al. 2015) in union mode with the tophat aligned reads and GENCODE v19 annotations for a total of 57,820 genes (Derrien et al. 2012; Harrow et al. 2012). For each dataset we used genes that achieve one count per million for at least half of the samples in the dataset to perform differential expression analysis. The significance level for per-study differential analysis was set at Benjamini-Hochberg (BH) false discover rate (FDR) of 0.05. A total of 2878, 2587, 670 and 6 differentially expressed genes were detected for dataset A, B, C and D, respectively. We found that the number of differentially expressed genes detected by dataset D is very low compared with the other three datasets. We checked the distribution of p values in the four datasets and found that the null distribution of p values from dataset D diverged from the expected uniform distribution greatly and was different from those of the other three datasets. This indicates that the model for detecting differentially expressed genes for dataset A, B and C did not fit dataset D well, probably due to the additional preprocessing of dataset D during quality control. Therefore, we decided not to include the p values from dataset D in the differential expression meta-analyses.

Meta-analyses were performed by combining the raw p values of all genes from individual differential expression analyses of dataset A, B and C through Inverse normal and Fisher approaches. The significance threshold for meta-analyses is BH FDR <0.05 for either Fisher or Inverse normal method. There were 744 genes displaying differential expression in contradictory directions in individual expression analyses which were removed from the list of genes identified as differentially expressed via meta-analyses.

After obtaining the list of differentially expressed genes, hierarchical clustering based on the list was performed across the three datasets using log-transformed normalized gene counts logN. The logN were calculated as \(logN = \, ln ( {\frac{{Y_{\text{gk}} }}{{S_{k} }} \cdot \overline{S} + \, 1})\), where Y gk was defined as the total reads for gene g for sample k; S k is derived from multiplying the library size of sample k by the scale factor for the sample k given by the trimmed mean of M values normalization method implemented in edgeR (Robinson and Oshlack 2010); \(\overline{S}\) represents the arithmetic mean of S k values. The distances between the samples of the three datasets were calculated using logN by Spearman correlation method, while the distances between the differentially expressed genes were calculated using logN by Pearson correlation method. Clustering was formed using the average linkage method for samples and genes, respectively.

Identification of important lncRNAs in HCC by expression and correlation analysis

To identify important lncRNAs in HCC, we used the GENCODE v19 annotation (Derrien et al. 2012). The annotation contains 13,870 lncRNA genes involving six lncRNA types that are long intervening ncRNA (7114), antisense RNA (5276), processed transcript (515), 3′ overlapping ncRNA (21), sense intronic ncRNA (742) and sense overlapping ncRNA (202). Furthermore, we found 41 genes tagged as processed transcript from GENCODE v19 lncRNAs annotation are, however, annotated as protein coding genes according to RefSeq annotation and were treated as protein coding genes in the present study. This resulted in an annotation of 13,829 lncRNA genes and 20,386 protein coding genes for downstream analysis.

First, we counted the number of lncRNA genes expressed (number of reads mapped to the lncRNA ≥ 10) in each HCC and adjacent normal samples. We also investigated the existence of lncRNAs which were only expressed (number of reads mapped to the lncRNA ≥ 10) in HCC tissues but were not expressed (number of reads mapped to the gene = 0) in any of the normal tissues. In addition, differentially expressed lncRNA genes were obtained from the list of differentially expressed genes derived from meta-analyses described in the above section in combination with GENCODE v19 lncRNA annotation.

The correlation between the differentially expressed lncRNAs and the differentially expressed protein coding genes were explored in this study. To obtain the Spearman correlation coefficients R the expression level of differentially expressed lncRNAs and protein coding genes represented by normalized counts calculated as mentioned above \(\frac{{Y_{\text{gk}} }}{{S_{k} }} \cdot \overline{S}\) was used. A |R| > 0.9 and FDR < 0.001 was set as the significance threshold for correlation. Moreover, as highly connected genes tend to be involved in similar biological functions, we predicted the function of lncRNAs based on the function of their co-expressed protein coding genes. Functional annotation was performed on each set of protein coding genes co-expressed with a lncRNA using biological processes (BP-FAT) annotation implemented in DAVID (Database for Annotation, Visualization and Integrated Discovery http://david.abcc.ncifcrf.gov/). The threshold for a significant gene ontology (GO) term was BH FDR < 0.1.

Identification of highly recurrent splicing alterations in HCC through comparison with adjacent normal samples

The detection of differential alternative splicing events was carried out using program MISO (Mixture-of-Isoforms) v0.5.2 exon-centric analysis (Katz et al. 2010). As MISO does not handle replicates/groups of samples, we performed MISO program on each pair of HCC and adjacent normal samples independently and then summarized the results across all 15 pairwise comparisons of the four datasets (Online Resource 2a). A total of five types of alternative splicing events were investigated, including alternative 3′ splice sites (A3SS), alternative 5′ splice sites (A5SS), mutually exclusive exons (MXE), retained introns (RI) and skipped exons (SE) using the annotation file provided by MISO (hg19 v2.0) (http://miso.readthedocs.org/en/fastmiso/annotation.html). The annotation for each alternative splicing event contains two isoforms. MISO utilized these annotations and tophat aligned bam files to detect differential alternative splicing events between each HCC and adjacent normal samples through calculating ΔΨ and Bayes factors which determine the magnitude and statistical significance of splicing differences, respectively.

The differential splicing events detected by an HCC sample-adjacent normal sample pairwise comparison were filtered by applying MISO default cutoff of |ΔΨ| ≥ 0.2 and Bayes factor ≥10. Moreover, for an event to be retained we required the sum of junction counts supporting the first or the second isoform ≥5 in at least one of the two samples. After filtering, each differential splicing event was summarized across all the pairwise comparisons detecting the event by counting the number of detections with positive ΔΨ values (# ΔΨ (+)) and the number of detections with negative ΔΨ values (# ΔΨ (−)), respectively. For a splicing event different signs (+ or −) of ΔΨ represent different directions of splicing change (inclusion or exclusion of a splicing segment in HCC versus adjacent normal). We called a differential splicing event as recurrent only if the event was detected by several pairwise comparisons and the number of detections in one direction is larger than that of the other direction by a given number 3. The definition for recurrent differential splicing events can be denoted by |# ΔΨ (+) − # ΔΨ (−)| ≥ 3. In this study, a highly recurrent differential splicing event entails the event occurring in more than half of the 15 HCC tissues (i.e., recurrence times ≥8).

To understand whether the highly recurrent differential splicing genes are biologically connected, we used software IPA (ingenuity pathway analysis, Ingenuity Systems, www.ingenuity.com) which can generate networks enriched by the user-provided genes (focus genes). IPA produces the networks by applying an algorithm on the list of user-specified genes and IPA knowledge base global molecular network and gives p scores to rank the networks. In a network containing n genes of which f are focus genes, the p score is the negative logarithm (base 10) of the p value representing the probability of obtaining at least f focus genes in a set of n genes randomly taken from the global molecular network calculated using Fisher’s exact test.

Results

Preliminary characterization of lncRNA expression profiles in HCC

LncRNAs have been shown to act as critical components of cancer biology because of their important roles in regulating key biological processes. In the present study we found that the number of lncRNAs expressed in each HCC sample was consistently greater than the adjacent normal sample across the four datasets (Fig. 1). By Wilcoxon signed-rank test, we confirmed that the number of lncRNA genes in tumor samples was significantly greater than that in adjacent normal samples (p = 6.10e−5). Moreover, we identified 15 lncRNAs that were not detected in any of the adjacent normal samples but were expressed in between five and seven of the 15 HCC samples including NOVA1-AS1, CTC-261N6.1, XX-C2158C6.3, RP11-284G10.1, RP11-199O14.1, RP11-346D6.6, RP1-90L14.1, RP11-608O21.1, RP11-565A3.2, RP11-400D2.2, RP11-6N13.1, RP11-962G15.1, RP11-1038A11.3, RP11-103J17.2 and RP11-109M17.2 (Fig. 2). Of the 15 lncRNAs, only RP11-109M17.2 is single-exonic gene according to GENCODE annotation. The 15 lncRNAs are worthy of further investigation for their roles in HCC tumorigenesis.

Fig. 1
figure 1

Boxplots displaying the number of lncRNAs expressed in HCC and adjacent normal samples. The y-axis reflects the number of lncRNAs. The cutoff for claiming an lncRNA being expressed is 10 reads mapped to the gene. The number of lncRNAs expressed in each sample was also plotted on the boxplots as a dot with specific color and shape for each patient as shown in the legend

Fig. 2
figure 2

Expression matrix of 15 lncRNAs which were only expressed in HCC tissues but were not expressed in any adjacent normal samples. Being expressed was defined as having mapped read count ≥10 illustrated by red color. Not being expressed was defined as having mapped read count = 0 represented by light blue color. Expression values in the middle of 0 to 10 mapped read counts were depicted using antique white color. The x-axis represents sample names. Sample names ending in T represent HCC samples. Sample names ending in N represent adjacent normal samples. The y-axis represents lncRNAs (color figure online)

Detection of differentially expressed lncRNAs in HCC in comparison with adjacent normal samples

Before detecting differentially expressed lncRNA genes between HCC and adjacent normal samples, we first identified 3112 differentially expressed genes through p value combination approaches (Inverse normal and Fisher methods) using dataset A, B and C (Online Resource 2b). Hierarchical clustering was performed on the three datasets based on the 3112 genes and showed a distinguishable gene expression profiling between HCC and adjacent normal samples (Online Resource 2c).

Based on the 3112 differentially expressed genes and GENCODE v19 lncRNA annotation, we identified 35 up- and 80 down-regulated lncRNA genes in HCC samples in comparison with adjacent normal samples (Online Resource 1b). Several of these differentially expressed lncRNAs have been well-studied including H19, MEG3, HAND2-AS1, RN7SK, LINC00261 and TP53TG1 down-regulated in HCC samples and GAS5, LINC00152, PVT1 and SNHG1 up-regulated in HCC samples in this study. The 20 most significantly differentially expressed lncRNA genes in HCC samples according to the FDR obtained by Fisher method are listed in Table 2, which contains lncRNA HAND2-AS1.

Table 2 The top 20 significantly differentially expressed lncRNAs in HCC samples in comparison with adjacent normal samples according to the false discovery rate (FDR) produced by meta-analysis using Fisher method

Co-expression analysis of lncRNAs and protein coding genes

This study investigated the effects of expression changes of the 115 differentially expressed lncRNAs on the expression of differentially expressed protein coding genes in HCC. As a result, we identified 212 pairs of co-expressed lncRNAs and protein coding genes formed by 33 lncRNAs (3 were up-regulated in HCC samples and 30 were down-regulated in HCC samples) and 173 protein coding genes with 210 pairs presented as positive correlation and only two pairs presented as negative correlation (Fig. 3; Online Resource 1c). The two negatively co-expressed pairs are LINC01093TTK (R = −0.912, FDR = 4.14e−07) and LINC01093NDC80 (R = −0.902, FDR = 7.91e−07). Moreover, there were five differentially expressed lncRNAs positively correlated with six nearby differentially expressed protein coding genes (distance <300 kb) forming co-expressed pairs including the top two significantly positively co-expressed pairs CTD-2044J15.2SRD5A1 (R = 0.979, FDR = 4.26e−05) and CTB-167B5.2STEAP4 (R = 0.976, FDR = 6.95e−12) along with TNRC6C-AS1TMC8 (R = 0.931, FDR = 8.40e−08), CTD-2337J16.1LILRB5 (R = 0.953, FDR = 7.09e−09), CTD-2337J16.1LILRB2 (R = 0.904, FDR = 6.91e−07) and RP11-42O15.3CTH (R = 0.922, FDR = 2.13e−07). In addition, we found that MEG3 was co-expressed with LAMC3 (R = 0.904, FDR = 0.0001) and C8orf58 (R = 0.905, FDR = 6.54e−07).

Fig. 3
figure 3

Coding-non coding co-expression network. The plot exhibited the co-expression network formed by 33 differentially expressed lncRNAs and 173 differentially expressed protein coding genes. Each lncRNA was represented by an ellipse and each protein coding gene was denoted by a diamond. Pink color represents the gene up-regulated in HCC samples compared with the adjacent normal samples, and green color indicates the gene down-regulated in HCC tissues compared with the adjacent normal tissues (color figure online)

Based on the coding–non coding co-expression pairs, we also performed functional prediction of the lncRNAs according to GO biological processes (BP-FAT) terms enriched by their co-expressed protein coding genes using DAVID. In total function of five lncRNAs LINC00261, RP11-119D9.1, AC004538.3, CTD-2044J15.2 and CTC-505O3.2 was predicted. All of the five lncRNAs were down-regulated in HCC samples (Online Resource 1b). The five sets of protein coding genes correlated with the five lncRNAs were all enriched in the biological process of oxidation and reduction (Table 3). In addition, functional annotation also showed the association between LINC00261 and lipid metabolism and the correlation between AC004538.3 and immunity.

Table 3 Predicted function of lncRNAs based on the co-expressed protein coding genes

Identification of highly recurrent differential splicing events between HCC and adjacent normal samples

Aberrant splicing is a hallmark of cancer which was investigated by MISO exon-centric analysis. An average of 82 A3SS, 57 A5SS, 83 MXE, 135 RI and 324 SE differential splicing events were detected by the 15 HCC-adjacent normal pairwise comparisons after filtering (Fig. 4). Differential exon skipping and differential intron retention seem to be the predominant differential splicing types. Moreover, 43 A3SS, 27 A5SS, 37 MXE, 84 RI and 199 SE differential splicing events were identified as recurrent differential splicing events based on our definition (|# ΔΨ (+) − # ΔΨ (−)| ≥ 3) (Fig. 4; Online Resource 1d-h).

Fig. 4
figure 4

Number of differential splicing events detected between HCC and adjacent normal samples. There are five types of differential splicing including differential alternative 3′ splice sites (A3SS), differential alternative 5′ splice sites (A5SS), differential mutually exclusive exons (MXE), differential retained introns (RI) and differential skipped exons (SE). The height of a hollow bar represents the average number of differential splicing events of a specific type detected by the 15 HCC-adjacent normal pairwise comparisons. The error bar of a hollow bar represents the standard deviation. The height of a filled bar represents the number of recurrent differential splicing events of a certain type

Of note, we identified nine aberrant alternative splicing events occurring in at least 8 of the 15 HCC samples (incidence >50 %) (Fig. 5; Table 4). The highly recurrent aberrant splicing events took place in eight genes USO1, RPS24, CCDC50, THNSL2, SLC39A14, NR1I3, FN1 (two events) and NUMB. To the best of our knowledge, the aberrant splicing occurring in SLC39A14 and NR1I3 was reported for the association with HCC for the first time. The other seven events have been shown to be present in HCC before (Oyama et al. 1989, 1993; Matsui et al. 1997; Huang et al. 2011; Danan-Gotthold et al. 2015; Lu et al. 2015; Zhang et al. 2015). Among the eight genes, SLC39A14 (Fisher FDR = 0.006) and NR1I3 (Fisher FDR = 0.002) were down-regulated in HCC samples compared to adjacent normal samples. The other six genes were not differentially expressed between HCC and adjacent normal samples.

Fig. 5
figure 5

Illustration of nine alternative splicing isoforms up-regulated in relative abundance in the majority of the HCC samples compared with each corresponding adjacent normal sample. The isoforms were derived from USO1, RPS24, CCDC50, THNSL2, NUMB, FN1, SLC39A14 and NR1I3. A red rectangle represents the exon is skipped from the isoform, while a green rectangle denotes the exon is included in the isoform. A yellow rectangle represents an intron retained in the isoform. For each isoform, the reference transcript and exon number were denoted in the plot. The translated amino acid sequence of each isoform was also provided with domain structure plotted and corresponding coordinates indicated to scale. The amino acid regions translated by the splicing sequences were shown in the plot. ARM, Armadillo/beta-catenin-like repeats; Uso1_p115_head, Uso1/p115 like vesicle tethering protein, head region; Uso1_p115_c, Uso1/p115 like vesicle tethering protein, C terminal region; PTZ00071, 40S ribosomal protein S24; CCDC50_N, coiled-coil domain-containing protein 50 N-terminus; Thr-synth_2, threonine synthase; PTB_Numb, Numb phosphotyrosine-binding (PTB) domain; NumbF, NUMB domain; ED-A, extra-domain A; CS1, type III connecting segment 1; FN3, fibronectin type 3 domain; ZIP, ZIP zinc transporter; GVQW, putative binding domain (color figure online)

Table 4 List of highly recurrent aberrant alternative splicing events identified in HCC tissues in comparison with adjacent normal tissues

The most frequent differential splicing gene is USO1 identified in 11 out of 15 HCC tissues. In the 11 HCC tissues the exon (exon 15 of NM_001290049)-excluding isoform of USO1 was significantly up-regulated in isoform percentage (Fig. 6). In addition, we observed that FN1 isoforms containing extra-domain A (ED-A) region (Online Resource 2d) or type III connecting segment 1 (CS1) (Online Resource 2e) region were up-regulated in relative abundance in 53.3 and 66.7 % HCC samples relative to adjacent normal samples, respectively. Seven out of the eight patients undergoing differential splicing of FN1 at the ED-A region also went through differential splicing at the CS1 region indicating a coordinated dysregulation of splicing of FN1 at these two regions in HCC tissues compared with adjacent normal tissues. The remaining six highly recurrent differential splicing events were illustrated in Online Resource 2f-k.

Fig. 6
figure 6

Sashimi plot illustrating the differential exon skipping event of USO1 in 11 patients undergoing this event. In the plot, red color represents HCC samples and green color represents adjacent normal samples. The exon excluding isoform of USO1 is up-regulated in isoform percentage in each of the 11 HCC samples compared with the corresponding adjacent normal sample. The thickness of each arc is in proportion to the number of reads spanning the particular junction. The actual number of junction reads was also plotted in each arc. The x-axis represents the genomic coordinates. The y-axis represents the RPKM (Reads Per Kilobase per Million mapped reads) values (color figure online)

The eight highly recurrent differential splicing genes were involved in a network with the function cell-to-cell signaling and interaction, cell-mediated immune response and cellular development though IPA network analysis (score = 24) (Online Resource 2l). The specific functions of these HCC-related aberrant splicing events are worthy of further investigation for their roles in the development of HCC.

Detection of three splicing factors down-regulated in HCC

Alternative splicing is regulated through the interplay between cis-acting sequence elements of the pre-mRNA and trans-acting splicing factor proteins that bind to them. We checked whether any of the 71 splicing factors from SpliceAid-F database (Giulietti et al. 2013) was in the list of differentially expressed genes detected by differential expression meta-analyses described above. We found that three splicing factors ESRP2 (Fisher FDR = 2.41e−05), CELF2 (Fisher FDR = 0.06) and SRSF5 (Fisher FDR = 0.005) were down-regulated in HCC samples compared with the adjacent normal samples (Fig. 7) suggesting that the splicing dysregulation in HCC might be influenced by these splicing factors. Figure 7 was created based on the normalized counts of the three splicing factors.

Fig. 7
figure 7

Boxplots illustrating the expression level of three splicing factors ESRP2, CELF2 and SRSF5. The y-axis reflects normalized counts of a splicing factor. The x-axis represents HCC and adjacent normal samples. For each splicing factor the normalized count of each sample was also plotted in the plot as represented by a red circle (filled circle) for each HCC sample and a blue circle (filled circle) for each adjacent normal sample (color figure online)

Discussion

HCC is an aggressive and deadly cancer. To date, still there has been a startling lack of effective treatments for the disease. To further elucidate the mechanism of HCC tumorigenesis and identify potential biomarkers and therapeutic targets, this study investigated the transcriptome of HCC with a focus on lncRNAs and alternative splicing through integrative analysis of multiple datasets which revealed important transcriptional changes in HCC.

Accumulating evidence suggests the existence of vital regulatory roles of lncRNAs in cancer (Young et al. 2005; Yochum et al. 2007; Wang et al. 2008b; Tsai et al. 2010; Wang and Chang 2011; Carrieri et al. 2012; Bi et al. 2013). In the present study, we found that there are more lncRNAs expressed in HCC samples than in the adjacent normal samples as shown by all pairwise comparisons, which suggested a direct correlation of the number of expressed lncRNAs and HCC development. Further analysis identified 15 lncRNA genes which were only detectable in a number of HCC samples and may be potential biomarkers of HCC.

More than a hundred lncRNAs were detected differentially expressed between HCC and adjacent normal tissues. Some well-studied lncRNAs including H19, MEG3, HAND2-AS1, RN7SK, LINC00261 and TP53TG1 were identified down-regulated in HCC tissues in this study. Consistently, Zhang et al. (2013) found under-expressed H19 in intratumoral HCC tissues (T) vs peritumoral tissues (L) and low T/L ratio of H19 associated with poor prognosis. Besides, several studies also displayed the down-regulation of tumor suppressor MEG3 in HCC which is due to hypermethylation of MEG3 in promoter region (Braconi et al. 2011; Zhuo et al. 2015). Of interest, this study detected MEG3 co-expressed with two protein coding genes LAMC3 and C8orf58. Significant methylation was found in the promoter region of LAMC3 in breast cancer (Kuznetsova et al. 2007). HAND2-AS1, also known as DEIN, was first identified by Voth et al. (2007) displaying high expression level in stage IVS neuroblastoma. Similarly, Lin and Chuang (2012) found repressed expression of LINC00261 in primary cultured invasive phenotype HCC cells compared to their corresponding parent cells. Gene TP53TG1 was originally isolated from a colon cancer cell line which may have an effect on the signaling pathway of TP53 and the response to cellular damage (Takei et al. 1998). In addition, we found some lncRNAs over-expressed in HCC tissues such as GAS5, LINC00152, PVT1 and SNHG1. GAS5 may have a pro-apoptotic attribute due to its repressive action on glucocorticoid receptor during starvation (Kino et al. 2010). Neumann et al. (2012) detected LINC00152 as differentially hypomethylated during hepatocarcinogenesis. Studies by Wang et al. (2014) and Ding et al. (2015) supported our finding of up-regulation of PVT1 in HCC and showed that PVT1 positively regulates HCC cell proliferation and stemness by stabilizing NOP2 nucleolar protein. Yan et al. (2015) reported a breakpoint between c-Myc and PVT1 commonly detected in early onset HCC leading to the overexpression of c-Myc and PVT1 in tumors. In addition, SNHG1 was found up-regulated in prostate cancer as well (Berretta and Moscato 2010).

This study also investigated the co-expression profile between differentially expressed lncRNAs and differentially expressed protein coding genes. The rationale is that lncRNAs have a regulatory role in the transcription of many protein coding genes (Wang et al. 2008b; Tsai et al. 2010; Bi et al. 2013) and many co-expressed genes are found related in function such as involved in the same signal transduction pathway. Consistent with Ren et al. (2012), most of the co-expression pairs in this study displayed positive correlation suggesting an enhancer-like role of these lncRNAs in regulating the transcription of the protein coding genes. Only two co-expression pairs were presented as negative correlation including one lncRNA LINC01093 correlated with TTK and NDC80. The mRNA level of TTK and NDC80 were elevated in HCC samples in this study. Both of the two genes participate in the regulation of mitosis (Chen et al. 1997; Dou et al. 2004; Huang et al. 2009; Sundin et al. 2011). TTK encoding a protein kinase was shown to be a promising prognostic marker in HCC (Miao et al. 2014). NDC80 could be a treatment target which was suggested to be implicated in the pathogenesis of HCC (Liu et al. 2015). The negative correlation between the expression level of LINC01093 and the expression level of TTK and NDC80 found in this study indicated that LINC01093 could serve as a prognostic biomarker and therapeutic target for HCC. LINC01093 was nominated as a cancer-associated lncRNA in a recent study comprehensively delineating the landscape of human lncRNAs (Iyer et al. 2015). Generally, lncRNAs show higher cancer- and tissue-specificity compared to protein coding genes, suggesting that they can be powerful biomarkers and drug targets (Iyer et al. 2015; Sahu et al. 2015).

In many cases, lncRNAs seem to regulate the expression of their neighboring protein coding genes through different mechanisms (Guttman et al. 2009; Orom et al. 2010; Kambara et al. 2015). This study identified five lncRNAs correlated with six vicinal protein coding genes, among which TMC8, LILRB5 and LILRB2 were involved in immunity (Borges et al. 1997; Shiroishi et al. 2006; Crequer et al. 2013).

The correlation analysis also provided us with novel insights into the function of five lncRNAs, all of which were predicted to be involved in oxidation–reduction process. Studies have demonstrated the association between the dysregulation of redox and the carcinogenesis of HCC (Vali et al. 2008; Zhao et al. 2011; Lin et al. 2014a), which warrants further study of these five lncRNAs in the pathogenesis of HCC.

Another important aspect of HCC transcriptome examined in this study is differential splicing. The alternative splicing profile of several genes is recurrently altered in HCC arguing for a direct role of specific splicing isoforms in HCC development. In the present study, nine highly recurrent aberrant splicing events were identified associated with HCC. As far as we know, this is the first study reporting the association between the splicing alterations of SLC39A14 and NR1I3 and HCC.

We identified an MXE event of SLC39A14 with splicing of this gene shifted towards including exon 4B in the majority of HCC samples. The aberrant splicing is regulated by the Wnt pathway proposed by Thorsen et al. (2011) who found the alternative splicing of SLC39A14 in colorectal tumors changing in the same way as our finding. Franklin et al. (2012) found the gene expression of SLC39A14 was down-regulated in the hepatoma cells, which was verified by this study. They also deduced that the absence of SLC39A14 may induce the depletion of zinc in hepatoma.

This study also identified an intron 7-retained isoform of gene NR1I3 up-regulated in isoform percentage in the majority of HCC samples. The retention of intron 7 in NR1I3 was first reported by Choi et al. (2013), which may lead to the production of proteins fail to transactivate the CYP2B6 reporter gene (SV1, SV3, SV6) or produce protein with enhanced transactivation activity (SV2). NR1I3 encodes a constitutive androstane receptor (CAR) regulating hepatic drug metabolism (Wei et al. 2000) and hepatic energy homeostasis (Kodama et al. 2004; Konno et al. 2008). Activation of CAR promotes liver injury and the development of HCC (Yamazaki et al. 2011; Kamino and Negishi 2012).

Another splicing isoform significantly up-regulated in isoform percentage in the majority of HCC samples is the oncofetal FN1 variant containing the ED-A splice-in segment. The inclusion of ED-A exon was regulated by PI3K/Akt/mTOR (Blaustein et al. 2005; White et al. 2010) and multiple MAPK pathways (Al-Ayoubi et al. 2012). The alternative ED-A domain serves as a vascular marker of solid tumor and metastasis (Rybak et al. 2007) as well as contributes to the lymphangiogenesis of colorectal tumors (Ou et al. 2010). A high-affinity human anti-ED-A monoclonal antibody F8 has been generated targeting tumor neo-vasculature in vivo (Villa et al. 2008). Based on antibody F8 several potent anti-cancer biopharmaceuticals have been developed such as immunocytokines F8-IL13 (Hess and Neri 2015) and F8-IL2 (Pretto et al. 2014) and immunocytokine drug conjugate F8–IL2–SS–DM1 (List et al. 2014). On the other hand, splicing of FN1 towards producing the isoform with CS1 exon was up-regulated in isoform percentage in the majority of HCC samples. The CS1 domain has a selective affinity for some tumor cells (Humphries et al. 1987) and lymphoid cells (Wayner et al. 1989).

The NUMB PRRL isoform significantly increased in relative abundance in three-fifths of HCC samples in this study was found linked to tumorigenesis of lung cancer (Misquitta-Ali et al. 2011). This isoform is formed by including an exon which reduces the level of NUMB protein expression and activates Notch signaling pathway (Misquitta-Ali et al. 2011). Recently, Lu et al. (2015) also observed a strong expression of PRRL isoform in HCC. Bechara et al. (2013) found the NUMB PRR alternative splicing is regulated by RNA-binding motif proteins RBM5, RBM6 and RBM10 in the control of cancer cell proliferation.

Based on these splicing isoforms significantly up-regulated in isoform percentage in the majority of HCC tissues, it is possible to targetedly deliver anti-cancer molecules (i.e., cytotoxic drugs, cytokines, radionuclides, etc.) to the HCC site by binding molecules such as human antibodies specific to the isoforms (Schrama et al. 2006; List et al. 2014; Hess and Neri 2015). Another way to treat HCC might be through correcting these splicing isoforms by drugs which lead to the activation of NMD pathway or generation of inactive cell cycle genes as demonstrated by Kaida et al. (2007), Kotake et al. (2007), Chang et al. (2011) and Corrionero et al. (2011).

Alternative splicing is regulated by the interaction between sequence elements of the pre-mRNA and splicing factor proteins binding to them. Three splicing factors including ESRP2, CELF2 and SRSF5 were found down-regulated in HCC samples compared with adjacent normal samples, suggesting that the splicing alterations of HCC were influenced by these splicing factors. ESRP2 and ESRP1 were shown to be down-regulated in cells during an EMT (Warzecha et al. 2009) which is a sign of cancer progression. In addition, Xiao et al. (2014) suggested that CELF2 induces apoptosis of HCC cells.

In summary, this study characterized HCC transcriptome with regard to lncRNAs and alternative splicing in a comprehensive way. By applying an integrative approach for the analysis of four RNA-Seq datasets, we observed that the number of lncRNAs expressed in each HCC sample was consistently greater than the adjacent normal sample. Furthermore, 15 lncRNAs were found expressed in five to seven HCC samples but were not detected in any adjacent normal sample. Based on differential expression analysis we detected 35 up- and 80 down-regulated lncRNAs in HCC samples compared with adjacent normal samples, among which five lncRNAs were predicted to be involved in oxidation and reduction process. Differential splicing analysis revealed nine highly recurrent differential splicing events belonging to eight genes USO1, RPS24, CCDC50, THNSL2, NUMB, FN1, SLC39A14 and NR1I3. As far as we know, this is the first study reporting that aberrant splicing of SLC39A14 and NR1I3 is associated with HCC. Alternative splicing in HCC may be influenced by three splicing factors ESRP2, CELF2 and SRSF5 which were significantly down-regulated in HCC samples. The findings of this study will add new information to aid in understanding the pathogenesis of HCC. The important molecules identified in this study are worthy of further investigation as potential biomarkers and possible therapeutic targets for the disease.