Keywords

1 Introduction

Currently, cancer is one of the leading causes of morbidity and mortality with variable survival rates depending on the type of cancer. Recent studies have demonstrated that, besides the specific somatic or germinal mutations that drive tumor growth, mobile elements, also known as transposable elements (TEs) are involved in the onset of many human diseases, as well as in the development of established cancers. For example, in epithelial cancer, activation of TEs correlates with their mobilisation and genomic drift [15]. This is due to the fact that TEs are DNA molecules with the ability to move from one place to another in the genome, contributing to genomic instability and causing genetic disorders. Since nearly 50% of the human genome is composed of TEs, cells try to avoid the deleterious consequences of TE activity inducing the inactivation of most TEs by large deletions, stop codons, and frameshift mutations within their open reading frames. It has been recently shown that some human endogenous viral elements (HEVEs) are still active and play a crucial role in placental development in various mammalian species [20].

The study of TEs using high-throughput technologies has been relegated due to the complexity of its measurement and processing, since there is a large number of copies of TEs present throughout the genome. Earlier efforts drove to tools such as RepEnrich [9] or TEtranscript [14] that were designed to accurately quantify the global expression of the different families of TEs from RNA-seq data, the TE evaluation being based on RepBase. Another one, Lions [4], has been developed to quantitatively measure and compare the contribution of TEs promoters to their expression in cancer. Recently, TEtools [16] has been designed to analyse the TE expression using non-annotated and non-assembled genomes. But better than knowing the activity of a specific family of TEs, the identification of the particular, differentially expressed TEs would provide more profitable results. Our main objective is not related to the detection of TE jumps that can explain a disease, but to design a tool that can identify which copy of the different TEs in human genome presents differential expression when the normal cell becomes a cancer cell. To elucidate this problem, gEVE [20], the database of endogenous viral elements (EVEs) including endogenous retrovirus that was developed to investigate the function and evolution of the TEs in mammalian genomes, seems to be more appropriate than RepBase. The great advantage of gEVE is that it provides nucleotide and amino acid sequences, genomic loci and functional annotations of all EVEs. Particularly, this database describes 33 966 EVEs, 1782 gag elements, 1482 pro elements, 29 120 pol elements, and 1731 env elements in human genome. As a result, the bioinformatic workflow NearTrans, that is able to determine (i) differentially expressed TEs and (ii) the activity of genes surrounding them to study whether changes in TE expression are related to nearby genes. As a biological model, prostate cancer was elected, a cancer where it was already known that LINE-1 was over-expressed [9].

2 Materials and Methods

2.1 Input Data

Control (healthy prostate cells) and treatment (prostate cancer) RNA-seq reads from 14 patients from Shanghai Hospital were publicly available from BioProject PRJEB2449 [24]. The main feature of these data is that prostate cancer and nearby normal tissues were paired, since they were sequenced from the same individual.

Information about EVEs in gEVE was downloaded from http://geve.med.u-tokai.ac.jp/ for the Hg38 human genome in GTF format. Estructural information about human genome Hg38 was downloaded from UCSC web portal (http://genome.ucsc.edu/cgi-bin/hgTables). The sequences of the human genome assembly Hg38 were downloaded from NCBI (https://www.ncbi.nlm.nih.gov/assembly?term=GRCh38).

2.2 Implementation

The double task of NearTrans related to differential expression of TEs and expression level of their nearby genes was carried out as follows (Fig. 1), using the same tools for genes and TEs, normal and tumoral prostate, whenever is possible:

Fig. 1.
figure 1

Flowchart illustrating tools and datasets provided and obtained by NearTrans workflow.

  1. 1.

    Data quality control using SeqTrimNext (STN) [11] with the specific NGS Illumina configuration parameters to remove low quality, ambiguous and low complexity stretches, adaptors, organelle DNA, polyA/polyT tails, and contaminated sequences while keeping the longest (at least \(>20\) bp) informative part of the read.

  2. 2.

    Mapping the pre-processed, useful reads to human genome hg38 using STAR v2.5 [10] with the following parameters (see the STAR help for the meaning of each parameter):

    figure a
  3. 3.

    Use the GFFs of hg38 and gEVE with Cufflinks (v.2.2.1) [25] followed by Cuffquant and then Cuffdiff, for assessing expression levels of genes and TEs, respectively, between matched normal and cancer tissues, as described in [13]. cummeRbund v3.6 is then pipelined to analyse, explore, manipulate and plot (visualise) the results.

  4. 4.

    Selection of differentially expressed TEs using as filters an adjusted \(P < 0.05\) and a \(|log_2FC| > 1\).

  5. 5.

    Location of nearby genes and their expression fold-change for every differentially expressed TE using BEDTools (v.2.26.0) [22], with the command bedtools closest -a TEs_file.bed -b genes_file.gtf -D a > nearest_genes.bed. Where the file TEs_file.bed contains the location of the differentially expressed TEs in the human genome and genes_file.gtf contains the location of all genes in the human genome.

3 Results

After preprocessing raw RNA-seq datasets data from the 14 prostate cancer patients from PRJEB2449, the percentage of useful reads is in the range of 93.54% for patient ERR031029 to 96.16% for patient ERR031025. This clearly shows the high quality of those sequence reads, and that further analyses will not be affected by read quality. Mapping useful reads resulted in a global 98.18% of the reads mapped on the human genome. Again, the high mapping rate confirms that results will not be affected by inadequate sequencing.

Fig. 2.
figure 2

Volcano plot where each TE is defined by its log2 of fold-change (log\(_2\)FC\(_{TE}\)) vs −log10 of adjusted P-value (log\(_{10}\)P\(_{TE}\)). Dots highlighted in red are those presenting a significant over-expression in prostate cancer cells. The TE corresponding to each red dot is indicated. (Color figure online)

The differentially expressed TEs are shown as red dots in Fig. 2. The three red dots having the log\(_2\)FC\(_{TE}\) closer to 0 are LINEs (L1PA3, L1PA4 and L1PA7), while the upper-right point is for the two HERVs (HERVH-int and HERV17-int). All TEs where found to be over-expressed in prostate cancer: HERVs were not expressed at all on normal cells, but expressed only on cancer cells (this is why they appear at the right border of Fig. 2 and as “Inf” in Table 1). On the contrary, LINEs (as many other TEs) were expressed in normal cells and their expressions were significantly increased in tumor cells. The advantage of using gEVE is that now we know that from the 20 699 described positions of LINE-1 in Hg38, 946 were strongly (although not significantly, adjusted \(P>0.05\)) repressed, while 3 829 were over-expressed (but only three positions exhibit significant over-expression). The remaining 15 924 positions of LINEs can be considered unchanged, since they show a log\(_2\)FC\(_{TE}\) of \(-0.06\) with a standard deviation of 1.58. These results are highly compatible with the reported over-expression of LINE-1 already described in prostate cancer [9], the main innovation of NearTrans being the positions of the LINE-1 copies whose over-expression is significant.

Taking in mind the idea that a TE can only be expressed if its genomic context is not supercoiled (silenced), the chromosome region where each differentially expressed TE is located was screened for the closer gene. It can be seen that distances between genes and TEs is highly variable irrespective of the TE (Table 1). The stronger correlation was observed between the expression of HERV17-int and ACSM1, while LINEs present the less significant correlation (adjusted \(P > 0.5\)). Interestingly, expression of MIR4675 (close to L1PA4) that has not been found in the samples analysed. It seems that those HERVs are more dependent on the genetic context than LINEs.

Table 1. Summary of differentially expressed retrotransposons in prostate cancer and their nearby genes

4 Discussion

The capabilities of NearTrans workflow (Fig. 1) allowed the identification of five TEs (HERVH-int, HERV17-int, L1PA3, L1PA4 and L1PA7) with differential expression in separate positions of the human genome in prostate cancer (Fig. 2 and Table 1). In some cases (HERV17-int and L1PA7), TE over-expression appears to be correlated with high gene expression of their nearby genes (ACSM1 and LOC101928437, respectively). In most cases, the gene is not highly expressed or the correlation is not significant. Even though the statistic significance of these correlations between genes and TEs is significant only in the case of HERV17-int/ACSM1, we will examine if nearby genes are related to prostate cancer to know which TEs are over-expressed due to their proximity to expressed genes that have a role in the development of cancer.

Investigating the roles of the genes identified by NearTrans in prostate cancer close to the differentially expressed TEs, (Table 1) we found that:

  • ACSM1 has already been described as highly expressed when compared with the normal prostate tissue [1,2,3, 26], while its expression was decreased when the patients underwent androgen deprivation and a chemotherapy antitumor treatment with docetaxel [23]. It has also been described that the silencing of ACSM1 in breast cancer decreases the cellular invasion and progression, and therefore it is identified as a potential biomarker for the prognosis of cancer [7].

  • PLA2G5 has variable expression profile and is involved in diseases of immunological nature [5, 8]. It was described as repressed in colon adenocarcinoma [19], acute myeloid leukemia [12] and in the leukemic cell line Jurkat [17]. It has been recently related to prostate as highly expressed in normal epitelial cells while repressed by methylation in diseased prostate [18]. In the analysis of NearTrans, PLA2G5 has an adjusted \(P_{g} = 0.25\) and a \(log_2FC_{g} = 0.33\) (Table 1), indicating that its expression is not so high and not significant.

  • L1PA3 is close to two pseudogenes: UBE2MP1 is the ubiquitin conjugating enzyme E2 M pseudogene 1 not apparently related with any disease, even though its upregulation was significantly involved in a pathway related to prostate cancer [21]. The HAVANA GTF for Hg38 predicts another closer pseudogene with unknown function, VN1R68P, only at 26 nt.

  • MIR4675 is a miRNA that has not been described in prostate cancer but is related with other types of tumors, including adenocarcinoma, colorectal carcinoma, non-small cell lung carcinoma and breast cancer, where its expression is inhibited with respect to normal tissue [6]. In our case it has not been found in the samples.

  • We consider that the unknown nature of LOC101928437, its distance to L1PA7 (211 321 nt) and the \(P_{g} = 1\) completely discard any influence on the expression of L1PA7.

In conclusion, NearTrans seem to be a suitable and useful workflow for detection of differentially expressed TEs and their nearby genes. It must be noted that NearTrans can be applied to any cancer or any other disease, provided that the same individual presents healthy and diseased tissues where the gene expression levels are different, and from which samples can be taken. The results presented regarding HERVs in prostate cancer suggest that they are expressed depending on the nature of the genome context. The over-expression of LINEs is compatible with previous reports [9] but NearTrans offers more detail since it also indicates which genome copy of the TE is significantly over-expressed. Interestingly, the TEs belonging to LINE1 family appeared as the most genomic context independent, which supports the idea that this type of TE could be used to increase genome instability in cancer, even though the nearby genes could have a potential relation with cancer. We propose then that the study of TEs in cancer can help in the discovery or corroboration of genes involved in cancer, and can be used as specific biomarkers for the diagnosis, prognosis or treatment of cancer.