Introduction

Prostate cancer is a type of cancer that develops in the prostate, a gland in the male reproductive system. Detection rate of prostate cancer vary widely across the world, with higher rate in developed countries than in developing countries. It has been the most frequently diagnosed cancer in American men. And this trend rises rapidly in recent years. Among men in the United States, prostate cancer accounts for more than 200,000 new cancer cases and 32,000 deaths annually [1]. These evidence alerts us the importance for researching prostate cancer.

The androgen deprivation therapy yields transient efficacy in prostate cancer sufferer, and there are many patients cannot survive from this deadly killer. As the development of the Next-Generation-Sequencing, many somatic mutations or other genomic alteration has been found, our knowledge about prostate cancer mutation has been expanded. For example, by exon-sequencing of 112 pair prostate cancer tissue this year, Gordon’s team not only found the three genes-MED12,FOXA1 and SPOP which are always recurrently mutated in prostate cancer patients, but also found a gene fusion [2]. Basing on the Integrating exome copy number analysis, Kenneth identified disruptions of CHD1 that define a subtype of ETS gene family fusion-negative prostate cancer [3]. All those genomics alteration found by next-generation-sequencing are the potential treatment target in future.

Referring to the use of high-throughput sequencing technologies, RNA-seq, which is short for “Whole Transcriptome Shotgun Sequencing-WTSS”, sequence cDNA in order to get information about a sample’s RNA content [4],such as gene expression level, new isoform, and so on. As soon as this technology has published, it has adopted to disease research filed such as cancer [5]. In Mark’s study, basing on the RNA-seq result of prostate cancer tissue, they detected non-ETS gene fusions in human prostate cancer. They discovered and characterized seven new cancer-specific gene fusions, two involving the ETS genes ETV1 and ERG [6]. In 2012, aiming to find the ethnic variation, scientific from University of Michigan Medical School also used RNA-seq technology to deeply insight to Chinese prostate cancer patients [7].

A non-coding RNA (ncRNA) is a function RNA molecule that is not translated into a protein. It contains abundant RNA such as tRNA, miRNA, snoRNA, Piwi-RNA and rRNA and so on. The large number of ncRNA is unknown now, and recently, through many bioinformatics study and new experiment technology, many ncRNA were found, especially some small RNA. After the genome sequencing project have released, this project have revealed an unexpected problem in our understanding of the molecular basis of developmental complexity in the higher organisms: complex organisms have lower numbers of protein coding genes than anticipated. The new role-non-coding RNA have been proved to make the architects of eukaryotic much more complexity [8]. Moreover, miRNA have drew many scientific attention after the Nobel prize for the miRNA discoverer. As the important roles of those small non-coding RNA, such as miRNA, Piwi-Interaction RNA in animal development [9], the long non-coding RNA drew scientific attention either. If the length of ncRNA is greater than 200 bp, we named them long non-coding RNA (lncRNA). This rapid advance filed shows a great potential of their regulation function [10]. In 2011, Howard and his team found that the long non-coding RNA HOTAIR is increased in expression in primary breast tumors and metastases, and HOTAIR expression level in primary tumors is a powerful predictor of eventual metastasis and death [11]. All these findings suggest that non-coding, included miRNA, non-transcript genes and long ncRNAs play active roles in modulating the cancer genome and may be important targets for cancer diagnosis and therapy.

In our study, basing on the RNA-seq result of human prostate cancer tissue, we analysis the data between prostate cancer samples and control samples, aligned them, then assembled the transcripts and finally obtained the transcription and non-coding RNA, which may be important targets for cancer diagnosis and therapy.

Materials and Methods

Data Achievement

Our project is based on the RNA-seq data of a former study’s sequencing result [12]. All those data is available on European Nucleotide Archive [13] (ENA; http://www.ebi.ac.uk/ena). It’s the primary nucleotide-sequence repository of Europe. ENA collects comprehensive record of the world’s nucleotide sequencing information, and consists of three main databases: the Sequence Read Archive (SRA), the Trace Archive and EMBL-Bank. When collecting sequencing data, we used the rule bellow: 1) paired-end sequencing; 2) of more than 50 bp length. Those two rules were selected because of our alignment tools. We will explain it later.

Data Preprocessing

According to the preprocessing method of the former study where our data from, we filtered the reads with the following cutoff condition: (1) N-bases number is above and beyond 2 %; (2) the low-quality bases is above and beyond 50 %(Q ≤ 15). Then, we drew base quality distribution to profile the filtering effects.

Alignment, Assemble and Estimate Abundances

The traditional RNA-Seq data analysis method was based on denovo assembling and aligning with reference for sequencing annotation. While this method found the new transcripts only relying on matching different genes between both sides of reads, so it mostly limited the length and numbers of reads, and cannot detected the region of breakpoint.

The new method aligned the genes and cleavage site, and then built the mimetic exon-exon references data using assembling of cleavage site to find differentially expressed genes and transcription as mostly as we can.

It can fix the fragment ends to the different exons to determine which spliceosome is correct, do not need with the previous annotation information.

In this paper, we use this new method for the bioinformatics. There are three steps:

  1. 1.

    the first step, alignment, TopHat [14] is chose to alignment. It aligns reads to genomes using Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

We used hg19 to construct the reference library, with the following condition: 1) minimum intron length is 70; 2) maximum intron length is 500000; 3) tolerance 3 bp deletion/insertion; 4) tolerance two mismatch, samples 10 N and 10 T was mapped and then generated two bam files.

  1. 2.

    We used cufflinks [15] software for the second step—assembling transcripts. Some parameters was set for assemble:1) Mean Inner Distance between Mate Pairs is 20; 2) Standard Deviation for Inner Distance between Mate Pairs is 20.

  2. 3.

    The third step, we also used Cufflinks estimated the relative abundances of these transcripts based on how many reads support each one. Two normalization methods Quartile and Bias correction are used for improving accuracy of transcript abundance estimates.

Merging Transcripts

The two transcript assembly result of two samples 10 N and 10 T produced were merged by the cufflinks. Mergence conditions: 1) the transcripts have different IDs and the positions are uniform; 2) the transcripts have the intersection of sets with genome mapping; 3) the distance between the transcripts is less than 500 bp. According to these conditions, we got a new transcript that is no redundancy information.

Analysis Transcripts Expression

Combined the assemble transcripts and the alignment produced by Tophat, we computed the expression value of every transcripts. Traditional expression value was represented by RPKM [16], it means the reads number of one gene per million reads, considering the impact on reads count of sequencing depth. At the same time, because the reads are pair-end, we can connect the pair reads to rebuild the fragment input to sequencer. Basing on the RPKM algorithm, we computed the fragment count, and got the FPKM value. It is more reliable to substitute the RPKM with the expression value [17].

Finding Significant Transcripts

As we can imagine, transcripts must have some significant different FPKM value between two samples. So, we combined the FPKM in two samples according to transcripts, calculated the fold change value of them, and computed the p-value. Then, we used these two feature value of each transcripts to plot volcano picture. After that, we can get the significance boundary to define the transcript if differentially expressed or not.

Results

Summary of Raw RNA-seq Data

The RNA-seq data which is complete transcriptomic landscape of prostate cancer in the Chinese population were downloaded from ENA. Basing on the rule we described before, we finally chose two sample-10 N and 10 T for our analysis, which are pair-end sequenced, and of 90 bp length. Detail information is shown in Table 1. 10 N sample data is the RNA-seq data for normal tissue, and 10 T sample data is the RNA-seq data for prostate cancer tissue.

Table 1 Sample information table

Prepossessing Result of Sequencing Data

To evaluate the prepossessing method we used, we drew box plot picture of bases quality through whole reads before and after prepossessing. Figure 1 showed the distribution of bases quality map before and after filtering (Fig. 1). Certainly, the upper half part is the distribution of bases quality map of raw data, the lower half part is that of preprocessing data. The black line in each box represents the median quality score. The information this picture tells us: (1) The fluctuating of bases quality is lower in prepossessed data than in raw data, which suggested that the filter method was worked; (2) The overall data are distributing in the part more than Q15, the median value is in more than Q34 and focus on more than Q36. Consequently, after preprocessing, the quality of reads has improved significantly. The data of preprocessing is used for all our following analysis. Table 2 showed the statistics result of data before and after preprocessed (Table 2).

Fig. 1
figure 1

The distribution of bases quality about before and after processing map

Table 2 The statistics result of reads about before and after processing

Alignment and Assemble

We used TopHat for sequences alignment, and Cufflink for transcripts assembling. We thought our method which aligns first is of great potential to make use of the RNA-seq data as many as we can. After the assemble result came out, we merged the “neighbor” transcripts as method session commented, and got the merging result of all transcripts. For example, if transcript A in sample 10 N is overlapped with transcript B in sample 10 T, we merged them for the convenient comparing. Finally, samples 10 N and 10 T get about 400,000 and 230,000 transcripts, respectively.

FPKM Distribution

To profiling the expression level of each transcript, we calculated an average fragments per kilo base of transcript per million fragments mapped (FPKM). According the FPKM calculation foundation described before, we got the FPKM value of all transcripts. Figure 2 is the density distribution mapping of the FPKM of every transcript (Fig. 2). As we can see, 10 T samples have higher FPKM value than 10 N samples. It seems that cancer samples are always of greater expression level than the normal samples. 10 T samples have two peak value of FPKM distribution. The first peak in 0.7–0.8 log10(FPKM), which cannot find in samples 10 N. The second peak is shared with two samples in almost 0 value. Figure 3 is the box plot of the FPKM of the all transcripts of two samples (Fig. 3). In this picture, we can understand the distribution much better. Samples 10 N have median value under 0 log10 (FPKM), and have no outstanding outliers. But in samples 10 T, the median value is increased upon 0, and has many outstanding outliers. To further analysis those outlier transcripts, we tried to find the boundary to distinguish differential transcripts.

Fig. 2
figure 2

The density distribution mapping of FPKM(q1:10 N q2:10 T)

Fig. 3
figure 3

Boxplot of FPKM in two samples (q1:10 N q2:10 T)

Significant Transcripts

By calculating the p-value and fold change with FPKM between two samples, we got all differential level of all related transcripts. Figure 4 is the volcano picture, which reflects the different situation of related transcripts between two samples (Fig. 4).

Fig. 4
figure 4

Volcano picture of two samples

According to the information of Fig. 4 showed, we set the following boundary to distinguish differential transcriptions:

  1. 1)

    FPKM is more than three in both of two samples

  2. 2)

    |log2(fold_change)|>2;

  3. 3)

    P-value < 0.006.

According to the above conditions, we got 197 significant transcripts (supplement), and there are 17 transcripts are non-coding transcripts. See Tables 3 and 4.

Table 3 Top 10 differential transcripts
Table 4 Difference non-coding transcripts

New lncRNA Discovery

To deeply analysis the other non-coding region, we focused on the long non coding RNA. We selected the assembling transcripts with over 200 bp length long, and located them on all human genes. The assembling transcripts cannot located in any of human genes are what we called lncRNA. Finally, we found that 36 lncRNAs are significant differential lncRNA shown in Table 5.

Table 5 Significant lncRNA

Discussion

Differential Coding Transcripts

As we can see in Table 4, the most differential gene is TFF3-Trefoil factor 3, which was more than 7 fold change from prostate cancer tissue to normal tissue. Some cDNA expression array analysis reveals that TFF3 may over express in prostate cancer patients. Recently, many studies have reported the strong relationship between gene TFF3 and prostate cancer. In 2004, immunohistochemistry was performed on a prostate cancer tissue microarray containing tumor tissue samples from 246 primary radical retro pubic prostatectomy cases with antibodies specific for TFF3, and Reiter’s team ensured that the up-expressed situation of TFF3 were found in those tumor sample [18]. Then, in 2008, Arul’s team announced that they have processed qPCR on seven prostate cancer biomarker, and found that TFF3 was a biomarker truly [19]. Now, our project has confirmed it. What all we human should do is developing the diagnosis kit for prostate early detecting. And interesting, we found the gene TFF1 was also in our Top 10 differential genes. But in our list, TFF1 has an opposite trend with TFF3, down-expressed in prostate cancer patients. In the many former study, most of them said that TFF1 (ps2 protein) was an up-expressed gene in prostate tumor. The family trefoil factor, included TFF1, TFF2, TFF3, are all over-expressed in prostate tumor, and the genes in this family are so differentially expressed in plasma levels in patients with advanced prostate cancer [20]. But shahid collected 95 malignant prostatic specimens from primary adenocarcinoma, performed immunohistochemical staining, he found that there was no significant correlation between TFF1 expression and the stage of disease, but TFF1 expression in prostate cancer significantly correlates with histological grade and the neuroendocrine differentiation [21]. So, although the TFF1 trend in our analysis is opposite with some other studies, this study reveals us that TFF1 can be a biomarker, but only for some stage of prostate cancer. Because TFF1 maybe reflects a contradiction expression level in different prostate cancer stage.

Differential Non-Coding Genes

Why we concern about the non-coding genes? The non-coding genes are always some pseudogene, or some function-unknown open reading frame. Many of them cannot be related to disease, especially cancer. But if we found them differentially over-expressed, we can say that gene has a great potential to be related to in the disease, for example prostate cancer in our project. Among the 17 transcripts we found, only two of them are down-expressed. The most outstanding transcript is NR_022014, one transcript for gene C15orf21. We detected this gene is 3 fold up change in prostate cancer with P = 1.62E-14, fitted the result of a former study by Arul in 2007 [22]. In his result, C15orf21 showed over-expressed in prostate cancer with significance p-value in prostate cancer with P = 3.4*10E-6, which be confirmed by our project.

New lncRNA Discovery

Large intergenic non-coding RNAs (lincRNAs) are emerging as key regulators of diverse cellular processes. Determining the function of individual lincRNAs remains a challenge. In 2011, John Rinn from Broad Institute used RNA-seq to produce the most complement catalogue of lincRNA [23] crossing 24 tissues, included prostate cancer tissue. So, in this catalogue, we can find their result of prostate cancer related lncRNA. As shown in Table 5, red highlight part represents the lncRNAs related with prostate cancer has been published, 9 lncRNAs were found according our method; 3 blue highlight lncRNAs have been published but don’t find the relationship with prostate cancer, other 24 lncRNAs are significant in this project. So, there is a huge possibility that the 24 lncRNAs are related with the prostate cancer.

Interesting

When we queried these lncRNA regions on UCSC to get the average conservation score of each candidate or putative lncRNA, most of them are reflecting a very low score. We image that lncRNA are not “rubbish” any more, so they should be conservative across mammal. But why they are always so low conservational score? Can it explain us that, lncRNA are not so conservative and change acutely across mammal? All these questions are waiting to be explored.