Introduction

Approximately, 23–34% prostate cancer patients will undergo post-operative disease relapse, initially with increased serum prostate specific antigen (PSA) value. Some patients with BCR will progress to local recurrence and distant metastasis. Although androgen deprivation therapy and salvage radiation therapy are effective management for these patients, with a period of disease control, part of the hormone-sensitive prostate cancer patients develops into the stage of castration resistance (castration resistance prostate cancer/CRPC) [1]. In this regard, BCR is an earlier intervention time point than CRPC, and effective assessment of the risk of BCR is a key clinical issue in prostate cancer management. Certain clinical and pathological indicators such as TNM stage, Gleason score and serum PSA have been employed in prediction of BCR. Nevertheless, in virtue of the heterogeneity in prostate cancer, patients with same clinical pathologic parameters always progressed to diverse consequences. Hence, the discovery of addictive prognostic factors to improve patients’ management after RP is desirable.

Numerous factors have been investigated for enhancing the predictive ability of clinical and pathological parameters. High serum alkaline phosphatase [2], lncRNA TMPO-AS1 [3], and NAP1L6 [4] were reported to be significantly associated with prostate cancer survival. Besides single molecule, multiple gene signatures such as Oncotype DX [5], Prolaris [6], Decipher [7], and sigMuc1NW [8] have also been explored the association with prostate cancer prognosis after RP. Although these indicators and others not mentioned here contributed to improve clinical decision and patients’ management, their clinical utilization deserves further validation [9, 10]. So far, there has no assay recommended by EAU or AUA guidelines for clinical prediction.

Nowadays, the application of microarray and RNA-sequencing technology has deepened our recognition of the tumorigenesis and development of prostate cancer. The gene expression omnibus (GEO) provides substantial information about gene expression profile. Besides gene expression profile, The Cancer Genomic Atlas (TCGA) is also a follow-up dataset of prostate cancer (PCa) patients after RP, which facilitates the survival analysis. In this study, we first identified the key DEGs by combining the GEO and TCGA dataset and then constructed a 6-gene signature associated with BCR by survival analysis. Finally, its independent prognostic value was further investigated.

Materials and methods

Data acquisition and pretreatment

The gene expression profile of GSE35988 [11] was obtained from GEO (http://www.ncbi.nlm.nih.gov/geo). Then, we compared the differential expression between the prostate cancer tissue and normal prostate tissue via online tool: GEO2R (http://www.ncbi.nlm.nih.gov/geo/geo2r/). The RNA-sequencing data of TCGA prostate adenocarcinoma (TCGA_PCa) were available on the website of Gene Expression Profiling Interactive Analysis/GEPIA [12] (http://gepia.cancer-pku.cn/index.html). And the differential expression analysis of TCGA_PCa was conducted using GEPIA online tool. The statistical analysis of these two online tools: GEO2R and GEPIA was based on limma R package. The significant level: adjusted p value (adj. p value) was set as 0.05 to reduce the false-positive rate. The criterion of Fold change was set as |logFC| ≥ 1.

The Clinical data and RNA expression data of TCGA prostate adenocarcinoma [13] (up to Aug 11, 2017) were downloaded from TCGA official website (https://portal.gdc.cancer.gov/). The downloaded data type of gene expression was fragments per kilobase of exon per million fragments mapped (FPKM); then, this data type was converted to transcripts per million (TPM) by a bioinformatics engineer [14]. The exclusion criteria of PCa patients were used as follows: (1) pathologic result is not prostate adenocarcinoma, (2) patients with clinical data but not biochemical recurrence data, and (3) patients whose vital clinical information involving American Joint Committee on Cancer (AJCC) TNM stage [15] is missed. At last, 358 patients, both having clinical data and gene expression data, were obtained in our study for survival analysis, as shown in supplementary material 1 (http://dx.doi.org/10.13140/RG.2.2.13131.44324).

The series matrix file of GSE55945 [16] was downloaded from website GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse55945). And the probe IDs were converted into gene symbols via online tool g:profile [17] (http://biit.cs.ut.ee/gprofiler/).

Statistics analysis and data mining

The association between expression level of DEGs and biochemical free survival was analyzed by univariable cox proportional hazard regression model and log-rank test (median as cutoff point) [18]. DEGs were considered with prognostic values if their p values were less than 0.05. The DEGs with statistical significance were assessed in multivariable cox regression model to construct a predictive model. Then, a risk score formula was constructed using the expression level of DEGs and their coefficients calculated in the multivariable Cox regression model. The risk score of each patient was then calculated and patients were ranked into low-risk group and high-risk group using median as cutoff point. The prognostic effect of risk score was assessed in Kaplan–Meier estimate using log-rank test to evaluating its statistical significance. Univariable and multivariable Cox regression analysis was also conducted. The association between the risk score and clinical pathological characteristic used Chi-square test. All analysis conducted above were taken in SPSS 16.0 and the criteria of statistical significance was set as p < 0.05.

Functional enrichment analysis was conducted in Funrich software version 3.1.1 [19].

The normalized gene expression data were extracted from series matrix of GSE55945 and the differential expression analysis was conducted in SPSS using two-tailed student’s t test with p < 0.05 as statistical test criteria.

All the graphs in this study were drawn in GraphPad Prism 7.0 software.

Results

Identification of key DEGs

The GEO2R analysis for GSE35988 was conducted between 49 samples of localized prostate cancer and 12 samples of benign prostate tissue, using data in platform GPL6480. According to the filtering criteria mentioned above, the result showed that there were 767 DEGs, among which 312 genes were upregulated and 455 genes were downregulated.

The differential expression analysis in TCGA_PCa dataset showed that 3017 genes were selected, among which 690 genes were upregulated and 2327 genes were downregulated.

At last, 310 DEGs were confirmed to be appeared in two dataset as shown in Venn diagram (Fig. 1) and supplementary material 2 (http://dx.doi.org/10.13140/RG.2.2.28230.93766). Among them, there were 96 upregulated genes and 214 downregulated genes.

Fig. 1
figure 1

Venn diagram of differentially expressed genes. Venn diagram of differentially expressed genes in the datasets: GSE35988 and TCGA_PCa

Establishment of gene signature with prognostic value

Patients’ characteristics involved in this study are shown in Table 1. The median follow-up time of censored patients was 512 days. To investigate whether the DEGs were related to BCR survival outcome, the 310 genes were taken into statistical analysis using univariable Cox regression and log-rank test. And the result showed that 19 DEGs were significantly related to BCR-free survival (p < 0.05), as shown in Table 2 and Fig. 2.

Table 1 Clinical and pathologic characteristics of analyzed patients
Table 2 Univariable BCR-free survival analysis of 310 DEGs
Fig. 2
figure 2

Kaplan–Meier plots based on the 6 differentially expressed genes. Kaplan–Meier plots of BCR-free survival for 358 PCa patients grouped by single gene constituting the 6-gene signature. Patients were divided into two groups: high expression group and low expression group, based on the gene expression level using the median as cutoff point. The comparison method of two survival curves was Log-rank test. a SMIM22, b NINL, c TPCN2, d NRG2, e TOP2A, f REPS2

All these 19 DEGs were taken into multivariable cox regression model and the method of variables entering into equation was Forward Stepwise (Likelihood Ratio)/Forward:LR in SPSS. When the 6 genes: SMIM22, REPS2, TPCN2, NINL, TOP2A, NRG2 entered into equation, the model was successfully established and all 6 genes had statistical significance (all p < 0.05, Fig. 2; Table 3).

Table 3 Multivariable BCR-free survival analysis of 19 DEGs

According to the coefficients from multivariable cox regression model and the gene expression levels, risk score formula was created as follows: risk score = (− 0.744 * expression level of SMIM22) + (− 0.809 * expression level of REPS2) + (0.568 * expression level of TPCN2) + 0.681 * expression level of NINL + 0.686 * expression level of TOP2A + (− 0.962 * expression level of NRG2). The gene expression level used in risk score formula was 0 or 1, representing low expression and high expression, respectively. Patients were then ranked into two groups by risk score. Furthermore, the relationship between risk score and clinical pathological parameters was analyzed (Table 4). The result showed that pathologic T stage, N stage, TNM stage and Gleason score had significant difference between the low-risk group and high-risk group, but age was not.

Table 4 the association between risk score and clinical pathological parameters

Kaplan–Meier plots showed that the patients with high-risk score inclined to present a worse BCR-free survival probability (Fig. 3a). To evaluate the independent prognostic effect of the 6-gene signature in predicting BCR, the univariable and multivariable cox proportion hazard regression model was applied (Table 5). In light of the TNM stage integrated with the information of pT and pN, pT and pN were not involved in survival analysis. The results showed that in univariable analysis, risk score, Gleason score and TNM stage instead of age had predictive ability on BCR. And in multivariable stepwise cox regression analysis, the risk score still kept its prognostic effect independent of TNM stage.

Fig. 3
figure 3

Kaplan–Meier plots of Risk score in the entire patients group and Gleason score subgroups. a Kaplan–Meier plots of BCR-free survival in the entire patient group (358 patients). Patients were divided into two groups: high-risk score group and low-risk score, using median-risk score as cutoff point. b Comparing means of risk score in different Gleason score subgroups by Students’ t test. **Represents p < 0.01. c Kaplan–Meier plots of BCR-free survival in the subgroup 1: Gleason score ≤ 7 (197 patients). Patients were divided into two groups: high-risk score group and low-risk score, using median risk score as cutoff point. d Kaplan–Meier plots of BCR-free survival in the subgroup 2: Gleason score > 7 (161 patients). Patients were divided into two groups: high-risk score group and low-risk score, using median-risk score as cutoff point

Table 5 Univariable and multivariable cox regression analysis of 6-gene signature for BCR-free survival

The prognostic effect of risk score in different Gleason score subgroup

Due to that Gleason score was not permitted to enter into multivariable cox regression equation and the association between Gleason score and risk score, a subgroup analysis was conducted to justify whether the prognostic effect of risk score is suitable for all patients regardless of Gleason score. The survival analysis demonstrated that in both subgroups: Gleason score ≤ 7 (n = 197) and Gleason score > 7 (n = 161), the survival difference between high-risk score and low-risk score was significant (all p < 0.01, Fig. 3b–d).

Functional enrichment analysis

The Gene ontology and KEGG pathway analysis was conducted (Fig. 4). The results showed that the 6 genes were enriched in biological process (BP) including cell growth, regulation of nucleic acid metabolism, transport and cell communication. For molecular function (MF), these genes were enriched in growth factor activity, DNA topoisomerase activity, calcium ion binging and ion channel activity. And for cell component (CC) analysis, these genes were located in centriole, nuclear chromosome, kinetochore, and DNA topoisomerase complex. In addition, these genes were involved in ErbB2/ErbB3 signaling, ErbB4 signaling, cell cycle, mitotic and et al.

Fig. 4
figure 4

Functional enrichment analysis on the 6 differentially expressed genes (DEGs). a Biological pathways the 6 DEGs involved in. b Biological process the 6 DEGs involved in. c Cellular component the 6 DEGs involved in. d Molecular function of the 6 DEGs

Validation of 6 DEGs

The expression difference between cancer group and normal prostate tissue group of these 6 genes was compared in GSE55945. As shown in Fig. 5, their differential expression was consistent with that of GSE35988 and TCGA PCa.

Fig. 5
figure 5

The expression of six differentially expressed genes in GSE55945. The normalized gene expression values of the 6 DEGs in prostate cancer tissue group and normal prostate tissue group were compared by Student’s t test. *Represents p < 0.05

Discussion

At present, the indication for post-operative adjuvant therapy of prostate cancer is confined to pathological TNM staging as pT3 and pN+, positive surgical margins and Gleason score ≥ 7 [1]. However, some patients without these signs still develop to BCR, the risk stratification needs to be updated.

TCGA-PCa dataset, a large-scale, prospective post-operative follow-up cohort research was exploited in our study. In order to thoroughly explore the high dimensional gene expression data and not just focus on genes at a specific pathway [6] or genes author interested in [17], the strategy for variable selection was set as follows: (a) Using the strict criteria of adj. p < 0.05 and |logFC| ≥ 1 to get DEGs and then identification of common DEGs in the two datasets: GSE35988 and TCGA-PCa. (b) The common DEGs were under univariate survival analysis and, subsequently, the DEGs with statistical significance were further analyzed in multivariate cox regression model to achieve optimal model and risk score formula. (c) The independent predictive value was then investigated for the new variable: risk score, combined with clinical and pathological parameters. This strategy for predictive factors selection could avoid the variable elimination of clinical and pathological parameters when they entered into multivariate cox regression model with the high dimensional transcriptomic data at the same time [20]. After the analysis above, the 6-gene signature was eventually constructed with powerful predictive ability on BCR, independent of TNM stage. In the light of the association between risk score and Gleason score, we divided patients into two subgroups: Gleason ≤ 7, and Gleason score > 7 and conducted subgroup survival analysis. Surprisingly, in the subgroup: Gleason score ≤ 7, risk score was demonstrated to be a predictive factor (Log-rank test p < 0.01), and the result was also statistically significant in subgroup: Gleason score > 7. This result revealed that the predictive effect of risk score may be independent of Gleason score, although this needs further validation in another follow-up cohort.

Downregulation of SMIM22, or named CASIMO1 [21] was reported in breast cancer to decreased cell proliferation and restraint of cell motility, by affecting its downstream phosphorylation of ERK. NRG2 [22, 23] was shown to interact with ErbB family of receptors promoting cell growth and differentiation. The primary function of NINL/Nlp [24] is to promote microtubule nucleation. Its aberrant expression in cancer cells could render cell tumorigenic. TPCN2 [25] belonged to a recently described class of NAADP- and PI(3,5)P2-sensitive Ca2+—permeable cation channels in the endolysosomal system of cells whose downregulation induced abrogating migration of metastatic cancer cells in vitro. In our knowledge, there was no research reported the relationship between the 4 genes (SMIM22, NINL, NRG2, and TPCN2) and prostate cancer, and the exact molecular mechanism of these four genes in prostate cancer development and progression deserved further investigation.

The decreased expression of REPS2/POB1 [26,27,28,29] in androgen-independent prostate cancer cell lines results in loss control of growth factor signaling and, therefore, in loss control of cell proliferation. Our finding that high expression of RESP2 in patients after RP may be associated with decreased risk of BCR (Table 2, HR 0.550, p = 0.025) was consistent with the previous study. TOP2A [30] was demonstrated to enhance androgen signaling by promoting transcription of androgen responsive genes, therefore contributing to hormone-independent cell growth and proliferation. Collectively, the 6-gene signature was involved in various signaling pathway and participated in diverse biological process. These may contribute to its powerful ability in predicting BCR. In addition, we further validated the differential expression of the 6 genes in another public dataset: GSE55945 (all p < 0.05).

Some limitation of this research should take into account. Firstly, the exact molecular mechanism of SMIM22, NRG2, NINL and TPCN2 in prostate cancer had not been explored in our research. Secondly, the variables: serum PSA and surgical margin were not provided by the TCGA_PCa dataset, so the relationship between the novel gene signature and the two variables was not explored. Thirdly, the construction of this gene signature was based on TCGA-PCa follow-up cohort, without validation in a prospective clinical cohort study.

Conclusion

We derived the key DEGs using public dataset: GSE35988 and TCGA-PCa, and then thoroughly analyzed their predictive ability on post-operative biochemical recurrence. Eventually, a novel multi-gene set was constructed with robust prediction of BCR, and its predictive ability was independent of TNM stage. Among the 6-gene signature, 4 genes (SMIM22, NRG2, NINL and TPCN2) have not been reported the relationship with prostate cancer. This novel predictive system will have attractive applications to improve post-operative patients’ management, if validated in other prospective clinical trial.