Abstract
Prostate cancer is a big killer in many regions especially American men, and this year, the diagnosed rate rises rapidly. We aimed to find the biomarker or any changing in prostate cancer patients. With the development of next generation sequencing, much genomic alteration has been found. Here, basing on the RNA-seq result of human prostate cancer tissue, we tried to find the transcription or non-coding RNA expressed differentially between normal tissue and prostate cancer tissue. 10 T sample data is the RNA-seq data for prostate cancer tissue in this study, we found the differential gene is TFF3-Trefoil factor 3, which was more than seven fold change from prostate cancer tissue to normal tissue, and the most outstanding transcript is C15orf21. Additionally, 9 lncRNAs were found according our method. Finally, we found the many important non-coding RNA related to prostate cancer, some of them were long non-coding RNA (lncRNA).
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Prostate cancer is a type of cancer that develops in the prostate, a gland in the male reproductive system. Detection rate of prostate cancer vary widely across the world, with higher rate in developed countries than in developing countries. It has been the most frequently diagnosed cancer in American men. And this trend rises rapidly in recent years. Among men in the United States, prostate cancer accounts for more than 200,000 new cancer cases and 32,000 deaths annually [1]. These evidence alerts us the importance for researching prostate cancer.
The androgen deprivation therapy yields transient efficacy in prostate cancer sufferer, and there are many patients cannot survive from this deadly killer. As the development of the Next-Generation-Sequencing, many somatic mutations or other genomic alteration has been found, our knowledge about prostate cancer mutation has been expanded. For example, by exon-sequencing of 112 pair prostate cancer tissue this year, Gordon’s team not only found the three genes-MED12,FOXA1 and SPOP which are always recurrently mutated in prostate cancer patients, but also found a gene fusion [2]. Basing on the Integrating exome copy number analysis, Kenneth identified disruptions of CHD1 that define a subtype of ETS gene family fusion-negative prostate cancer [3]. All those genomics alteration found by next-generation-sequencing are the potential treatment target in future.
Referring to the use of high-throughput sequencing technologies, RNA-seq, which is short for “Whole Transcriptome Shotgun Sequencing-WTSS”, sequence cDNA in order to get information about a sample’s RNA content [4],such as gene expression level, new isoform, and so on. As soon as this technology has published, it has adopted to disease research filed such as cancer [5]. In Mark’s study, basing on the RNA-seq result of prostate cancer tissue, they detected non-ETS gene fusions in human prostate cancer. They discovered and characterized seven new cancer-specific gene fusions, two involving the ETS genes ETV1 and ERG [6]. In 2012, aiming to find the ethnic variation, scientific from University of Michigan Medical School also used RNA-seq technology to deeply insight to Chinese prostate cancer patients [7].
A non-coding RNA (ncRNA) is a function RNA molecule that is not translated into a protein. It contains abundant RNA such as tRNA, miRNA, snoRNA, Piwi-RNA and rRNA and so on. The large number of ncRNA is unknown now, and recently, through many bioinformatics study and new experiment technology, many ncRNA were found, especially some small RNA. After the genome sequencing project have released, this project have revealed an unexpected problem in our understanding of the molecular basis of developmental complexity in the higher organisms: complex organisms have lower numbers of protein coding genes than anticipated. The new role-non-coding RNA have been proved to make the architects of eukaryotic much more complexity [8]. Moreover, miRNA have drew many scientific attention after the Nobel prize for the miRNA discoverer. As the important roles of those small non-coding RNA, such as miRNA, Piwi-Interaction RNA in animal development [9], the long non-coding RNA drew scientific attention either. If the length of ncRNA is greater than 200 bp, we named them long non-coding RNA (lncRNA). This rapid advance filed shows a great potential of their regulation function [10]. In 2011, Howard and his team found that the long non-coding RNA HOTAIR is increased in expression in primary breast tumors and metastases, and HOTAIR expression level in primary tumors is a powerful predictor of eventual metastasis and death [11]. All these findings suggest that non-coding, included miRNA, non-transcript genes and long ncRNAs play active roles in modulating the cancer genome and may be important targets for cancer diagnosis and therapy.
In our study, basing on the RNA-seq result of human prostate cancer tissue, we analysis the data between prostate cancer samples and control samples, aligned them, then assembled the transcripts and finally obtained the transcription and non-coding RNA, which may be important targets for cancer diagnosis and therapy.
Materials and Methods
Data Achievement
Our project is based on the RNA-seq data of a former study’s sequencing result [12]. All those data is available on European Nucleotide Archive [13] (ENA; http://www.ebi.ac.uk/ena). It’s the primary nucleotide-sequence repository of Europe. ENA collects comprehensive record of the world’s nucleotide sequencing information, and consists of three main databases: the Sequence Read Archive (SRA), the Trace Archive and EMBL-Bank. When collecting sequencing data, we used the rule bellow: 1) paired-end sequencing; 2) of more than 50 bp length. Those two rules were selected because of our alignment tools. We will explain it later.
Data Preprocessing
According to the preprocessing method of the former study where our data from, we filtered the reads with the following cutoff condition: (1) N-bases number is above and beyond 2 %; (2) the low-quality bases is above and beyond 50 %(Q ≤ 15). Then, we drew base quality distribution to profile the filtering effects.
Alignment, Assemble and Estimate Abundances
The traditional RNA-Seq data analysis method was based on denovo assembling and aligning with reference for sequencing annotation. While this method found the new transcripts only relying on matching different genes between both sides of reads, so it mostly limited the length and numbers of reads, and cannot detected the region of breakpoint.
The new method aligned the genes and cleavage site, and then built the mimetic exon-exon references data using assembling of cleavage site to find differentially expressed genes and transcription as mostly as we can.
It can fix the fragment ends to the different exons to determine which spliceosome is correct, do not need with the previous annotation information.
In this paper, we use this new method for the bioinformatics. There are three steps:
-
1.
the first step, alignment, TopHat [14] is chose to alignment. It aligns reads to genomes using Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
We used hg19 to construct the reference library, with the following condition: 1) minimum intron length is 70; 2) maximum intron length is 500000; 3) tolerance 3 bp deletion/insertion; 4) tolerance two mismatch, samples 10 N and 10 T was mapped and then generated two bam files.
-
2.
We used cufflinks [15] software for the second step—assembling transcripts. Some parameters was set for assemble:1) Mean Inner Distance between Mate Pairs is 20; 2) Standard Deviation for Inner Distance between Mate Pairs is 20.
-
3.
The third step, we also used Cufflinks estimated the relative abundances of these transcripts based on how many reads support each one. Two normalization methods Quartile and Bias correction are used for improving accuracy of transcript abundance estimates.
Merging Transcripts
The two transcript assembly result of two samples 10 N and 10 T produced were merged by the cufflinks. Mergence conditions: 1) the transcripts have different IDs and the positions are uniform; 2) the transcripts have the intersection of sets with genome mapping; 3) the distance between the transcripts is less than 500 bp. According to these conditions, we got a new transcript that is no redundancy information.
Analysis Transcripts Expression
Combined the assemble transcripts and the alignment produced by Tophat, we computed the expression value of every transcripts. Traditional expression value was represented by RPKM [16], it means the reads number of one gene per million reads, considering the impact on reads count of sequencing depth. At the same time, because the reads are pair-end, we can connect the pair reads to rebuild the fragment input to sequencer. Basing on the RPKM algorithm, we computed the fragment count, and got the FPKM value. It is more reliable to substitute the RPKM with the expression value [17].
Finding Significant Transcripts
As we can imagine, transcripts must have some significant different FPKM value between two samples. So, we combined the FPKM in two samples according to transcripts, calculated the fold change value of them, and computed the p-value. Then, we used these two feature value of each transcripts to plot volcano picture. After that, we can get the significance boundary to define the transcript if differentially expressed or not.
Results
Summary of Raw RNA-seq Data
The RNA-seq data which is complete transcriptomic landscape of prostate cancer in the Chinese population were downloaded from ENA. Basing on the rule we described before, we finally chose two sample-10 N and 10 T for our analysis, which are pair-end sequenced, and of 90 bp length. Detail information is shown in Table 1. 10 N sample data is the RNA-seq data for normal tissue, and 10 T sample data is the RNA-seq data for prostate cancer tissue.
Prepossessing Result of Sequencing Data
To evaluate the prepossessing method we used, we drew box plot picture of bases quality through whole reads before and after prepossessing. Figure 1 showed the distribution of bases quality map before and after filtering (Fig. 1). Certainly, the upper half part is the distribution of bases quality map of raw data, the lower half part is that of preprocessing data. The black line in each box represents the median quality score. The information this picture tells us: (1) The fluctuating of bases quality is lower in prepossessed data than in raw data, which suggested that the filter method was worked; (2) The overall data are distributing in the part more than Q15, the median value is in more than Q34 and focus on more than Q36. Consequently, after preprocessing, the quality of reads has improved significantly. The data of preprocessing is used for all our following analysis. Table 2 showed the statistics result of data before and after preprocessed (Table 2).
Alignment and Assemble
We used TopHat for sequences alignment, and Cufflink for transcripts assembling. We thought our method which aligns first is of great potential to make use of the RNA-seq data as many as we can. After the assemble result came out, we merged the “neighbor” transcripts as method session commented, and got the merging result of all transcripts. For example, if transcript A in sample 10 N is overlapped with transcript B in sample 10 T, we merged them for the convenient comparing. Finally, samples 10 N and 10 T get about 400,000 and 230,000 transcripts, respectively.
FPKM Distribution
To profiling the expression level of each transcript, we calculated an average fragments per kilo base of transcript per million fragments mapped (FPKM). According the FPKM calculation foundation described before, we got the FPKM value of all transcripts. Figure 2 is the density distribution mapping of the FPKM of every transcript (Fig. 2). As we can see, 10 T samples have higher FPKM value than 10 N samples. It seems that cancer samples are always of greater expression level than the normal samples. 10 T samples have two peak value of FPKM distribution. The first peak in 0.7–0.8 log10(FPKM), which cannot find in samples 10 N. The second peak is shared with two samples in almost 0 value. Figure 3 is the box plot of the FPKM of the all transcripts of two samples (Fig. 3). In this picture, we can understand the distribution much better. Samples 10 N have median value under 0 log10 (FPKM), and have no outstanding outliers. But in samples 10 T, the median value is increased upon 0, and has many outstanding outliers. To further analysis those outlier transcripts, we tried to find the boundary to distinguish differential transcripts.
Significant Transcripts
By calculating the p-value and fold change with FPKM between two samples, we got all differential level of all related transcripts. Figure 4 is the volcano picture, which reflects the different situation of related transcripts between two samples (Fig. 4).
According to the information of Fig. 4 showed, we set the following boundary to distinguish differential transcriptions:
-
1)
FPKM is more than three in both of two samples
-
2)
|log2(fold_change)|>2;
-
3)
P-value < 0.006.
According to the above conditions, we got 197 significant transcripts (supplement), and there are 17 transcripts are non-coding transcripts. See Tables 3 and 4.
New lncRNA Discovery
To deeply analysis the other non-coding region, we focused on the long non coding RNA. We selected the assembling transcripts with over 200 bp length long, and located them on all human genes. The assembling transcripts cannot located in any of human genes are what we called lncRNA. Finally, we found that 36 lncRNAs are significant differential lncRNA shown in Table 5.
Discussion
Differential Coding Transcripts
As we can see in Table 4, the most differential gene is TFF3-Trefoil factor 3, which was more than 7 fold change from prostate cancer tissue to normal tissue. Some cDNA expression array analysis reveals that TFF3 may over express in prostate cancer patients. Recently, many studies have reported the strong relationship between gene TFF3 and prostate cancer. In 2004, immunohistochemistry was performed on a prostate cancer tissue microarray containing tumor tissue samples from 246 primary radical retro pubic prostatectomy cases with antibodies specific for TFF3, and Reiter’s team ensured that the up-expressed situation of TFF3 were found in those tumor sample [18]. Then, in 2008, Arul’s team announced that they have processed qPCR on seven prostate cancer biomarker, and found that TFF3 was a biomarker truly [19]. Now, our project has confirmed it. What all we human should do is developing the diagnosis kit for prostate early detecting. And interesting, we found the gene TFF1 was also in our Top 10 differential genes. But in our list, TFF1 has an opposite trend with TFF3, down-expressed in prostate cancer patients. In the many former study, most of them said that TFF1 (ps2 protein) was an up-expressed gene in prostate tumor. The family trefoil factor, included TFF1, TFF2, TFF3, are all over-expressed in prostate tumor, and the genes in this family are so differentially expressed in plasma levels in patients with advanced prostate cancer [20]. But shahid collected 95 malignant prostatic specimens from primary adenocarcinoma, performed immunohistochemical staining, he found that there was no significant correlation between TFF1 expression and the stage of disease, but TFF1 expression in prostate cancer significantly correlates with histological grade and the neuroendocrine differentiation [21]. So, although the TFF1 trend in our analysis is opposite with some other studies, this study reveals us that TFF1 can be a biomarker, but only for some stage of prostate cancer. Because TFF1 maybe reflects a contradiction expression level in different prostate cancer stage.
Differential Non-Coding Genes
Why we concern about the non-coding genes? The non-coding genes are always some pseudogene, or some function-unknown open reading frame. Many of them cannot be related to disease, especially cancer. But if we found them differentially over-expressed, we can say that gene has a great potential to be related to in the disease, for example prostate cancer in our project. Among the 17 transcripts we found, only two of them are down-expressed. The most outstanding transcript is NR_022014, one transcript for gene C15orf21. We detected this gene is 3 fold up change in prostate cancer with P = 1.62E-14, fitted the result of a former study by Arul in 2007 [22]. In his result, C15orf21 showed over-expressed in prostate cancer with significance p-value in prostate cancer with P = 3.4*10E-6, which be confirmed by our project.
New lncRNA Discovery
Large intergenic non-coding RNAs (lincRNAs) are emerging as key regulators of diverse cellular processes. Determining the function of individual lincRNAs remains a challenge. In 2011, John Rinn from Broad Institute used RNA-seq to produce the most complement catalogue of lincRNA [23] crossing 24 tissues, included prostate cancer tissue. So, in this catalogue, we can find their result of prostate cancer related lncRNA. As shown in Table 5, red highlight part represents the lncRNAs related with prostate cancer has been published, 9 lncRNAs were found according our method; 3 blue highlight lncRNAs have been published but don’t find the relationship with prostate cancer, other 24 lncRNAs are significant in this project. So, there is a huge possibility that the 24 lncRNAs are related with the prostate cancer.
Interesting
When we queried these lncRNA regions on UCSC to get the average conservation score of each candidate or putative lncRNA, most of them are reflecting a very low score. We image that lncRNA are not “rubbish” any more, so they should be conservative across mammal. But why they are always so low conservational score? Can it explain us that, lncRNA are not so conservative and change acutely across mammal? All these questions are waiting to be explored.
References
Berger MF, Lawrence MS, Demichelis F, Drier Y, Cibulskis K, Sivachenko AY, Sboner A, Esgueva R, Pflueger D, Sougnez C (2011) The genomic complexity of primary human prostate cancer. Nature 470(7333):214–220
Barbieri CE, Baca SC, Lawrence MS, Demichelis F, Blattner M, Theurillat JP, White TA, Stojanov P, Van Allen E, Stransky N (2012) Exome sequencing identifies recurrent SPOP, FOXA1 and MED12 mutations in prostate cancer. Nat Genet 44(6):685–689
Grasso CS, Wu YM, Robinson DR, Cao X, Dhanasekaran SM, Khan AP, Quist MJ, Jing X, Lonigro RJ, Brenner JC (2012) The mutational landscape of lethal castration-resistant prostate cancer. Nature 487(7406):239–243
Morin RD, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh TJ, McDonald H, Varhol R, Jones SJM, Marra MA (2008) Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques 45(1):81–94
Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature 458(7234):97–101
Pflueger D, Terry S, Sboner A, Habegger L, Esgueva R, Lin PC, Svensson MA, Kitabayashi N, Moss BJ, MacDonald TY (2011) Discovery of non-ETS gene fusions in human prostate cancer using next-generation RNA sequencing. Genome Res 21(1):56–67
Sahu A, Iyer MK, Chinnaiyan AM (2012) Insights into Chinese prostate cancer with RNA-seq. Cell Res 22(5):786–788
Mattick JS (2001) Non-coding RNAs: the architects of eukaryotic complexity. EMBO reports 2(11):986–991
Stefani G, Slack FJ (2008) Small non-coding RNAs in animal development. Nat Rev Mol Cell Biol 9(3):219–230
Mercer TR, Dinger ME, Mattick JS (2009) Long non-coding RNAs: insights into functions. Nat Rev Genet 10(3):155–159
Gupta RA, Shah N, Wang KC, Kim J, Horlings HM, Wong DJ, Tsai MC, Hung T, Argani P, Rinn JL (2010) Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature 464(7291):1071–1076
Ren S, Peng Z, Mao JH, Yu Y, Yin C, Gao X, Cui Z, Zhang J, Yi K, Xu W (2012) RNA-seq analysis of prostate cancer in the Chinese population identifies recurrent gene fusions, cancer-associated long noncoding RNAs and aberrant alternative splicings. Cell Res 22(5):806–821
Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tárraga A, Cheng Y, Cleland I, Faruque N, Goodgame N, Gibson R (2011) The European nucleotide archive. Nucleic Acids Res 39(suppl 1):D28–D31
Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9):1105–1111
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7(3):562–578
Jiang H, Wong WH (2009) Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25(8):1026–1032
Toung JM, Morley M, Li M, Cheung VG (2011) RNA-sequence analysis of human B-cells. Genome Res 21(6):991–998
Garraway IP, Seligson D, Said J, Horvath S, Reiter RE (2004) Trefoil factor 3 is overexpressed in human prostate cancer. Prostate 61(3):209–214
Laxman B, Morris DS, Yu J, Siddiqui J, Cao J, Mehra R, Lonigro RJ, Tsodikov A, Wei JT, Tomlins SA (2008) A first-generation multiplex biomarker analysis of urine for the early detection of prostate cancer. Cancer Res 68(3):645
Vestergaard EM, Borre M, Poulsen SS, Nexø E, Tørring N (2006) Plasma levels of trefoil factors are increased in patients with advanced prostate cancer. Clin Cancer Res 12(3):807–812
Ather MH, Abbas F, Faruqui N, Israr M, Pervez S (2004) Expression of pS2 in prostate cancer correlates with grade and Chromogranin A expression but not with stage. BMC Urol 4(1):14
Tomlins SA, Laxman B, Dhanasekaran SM, Helgeson BE, Cao X, Morris DS, Menon A, Jing X, Cao Q, Han B (2007) Distinct classes of chromosomal rearrangements create oncogenic ETS gene fusions in prostate cancer. Nature 448(7153):595–599
Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL (2011) Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev 25(18):1915–1927
Conflict of Interest
The authors have no conflict of interest to declare.
Author information
Authors and Affiliations
Corresponding author
Additional information
Xiao-Ming Zhang, Zhong-Wei Ma and Qiang Wang contributed equally to this work as the co-first author.
Rights and permissions
About this article
Cite this article
Zhang, XM., Ma, ZW., Wang, Q. et al. A New RNA-Seq Method to Detect the Transcription and Non-coding RNA in Prostate Cancer. Pathol. Oncol. Res. 20, 43–50 (2014). https://doi.org/10.1007/s12253-013-9618-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12253-013-9618-0