Introduction

Lung cancer is the most common cause of cancer-related mortality worldwide, leading to over one million deaths each year (Torre et al. 2015). Adenocarcinoma is currently the predominant histological subtype of lung cancer, and the rate is still increasing (Travis et al. 2015). Although multidisciplinary cancer therapies have achieved remarkable progress in the past decade, the overall prognosis for lung cancer patients remains poor (Miller et al. 2016). It is important to discover specific prognostic factors for lung adenocarcinomas to foresee clinical outcomes and improve treatment efficacies.

In the age of genomics, gene expression profiling studies have been shown to provide prognostic information beyond conventional parameters in a variety of cancers (Niedzwiecki et al. 2016), and some have further influenced treatment guidelines (Cardoso et al. 2016). However, previous technologies used in these studies were not reflective of the current view of the genome, as most of them only analyzed protein-coding genes (Chen et al. 2007a; Endoh 2004). Genome-wide researches were able to identify tens of thousands of long non-coding RNAs (lncRNAs), which were a novel class of RNAs defined by length >200 nucleotides with no protein-coding potential (Kung et al. 2013; Schmitt and Chang 2016). LncRNAs were reported to play crucial roles in various aspects of cellular processes, including proliferation, survival, migration or genomic stability (Iyer et al. 2015). Based on the large number and expression specificity across different cancer types, lncRNAs are likely to serve as the basis for many clinical applications in oncology (Wahlestedt 2013; Yan et al. 2015). However, the understanding of lncRNAs in lung adenocarcinoma development and clinical outcome in previous studies was relatively limited.

The Cancer Genome Atlas (TCGA) project provides a collection of RNA sequence, DNA copy number variation, DNA methylation and corresponding clinical data for most cancer types. Here, we utilized RNA-seq dataset of TCGA project to identify potential prognostic lncRNAs of lung adenocarcinomas by analyzing the lncRNA expressions of lung adenocarcinomas and adjacent non-tumor tissues. Then, we developed a prognostic lncRNA signature and analyzed the biological pathway of the lncRNA signature.

Materials and methods

Clinical cohorts and lncRNA expression data procession

LncRNA expression datasets and corresponding clinical information of lung adenocarcinoma patients were obtained from publicly available TCGA database. In details, lncRNA expression information of TCGA RNA sequencing database for lung adenocarcinomas was downloaded from the Atlas of non-coding RNAs in Cancer (TANRIC, http://bioinformatics.mdanderson.org/main/TANRIC:Overview, up to Aug 16, 2016) (Li et al. 2015), following a quality filtering by excluding lncRNAs with missing data exceeded 20% of all subject. The full clinical dataset of corresponding lung adenocarcinoma patients was obtained from cBioPortal (http://www.cbioportal.org/, up to Aug 11, 2016) (Cerami et al. 2012; Gao et al. 2013). Overall, 478 patients with primary tumor and 58 of them with normal tissue were included in the current study with lncRNA expression data and corresponding clinical information.

Statistical and data mining analyses of TCGA LUAD lncRNA profiles

Differentially expressed lncRNAs between lung adenocarcinomas and normal tissues were identified with “limma” package, and significance level was set as 0.01 as default to control the false discovery rate (FDR) (Kaczkowski et al. 2016). Heatmap of differentially expressed lncRNAs was plotted with “pheatmap” package, respectively.

The association between differentially expressed lncRNAs and patient overall survival was analyzed by univariate Cox proportional regression model. Difference was considered statistically significant if p value was <0.01. Then the selected lncRNAs were fitted in a multivariate Cox proportional regression model in the dataset. A risk score formula was constructed by these prognostic lncRNAs, weighted by their estimated Cox regression coefficients (Kawaguchi et al. 2012). Patients with assigned risk score were classified into high-risk or low-risk group by using the median as the cutoff point. Overall and disease-free survival of patients in high-risk or low-risk groups were estimated using the Kaplan–Meier method. The log-rank test was used to determine survival differences between groups. Independent prognostic factors were identified through the Cox proportional hazards regression model. All tests were two-tailed. Statistical significance was set as p < 0.05. All data were analyzed using the SPSS Version 19.0 Software (SPSS Inc., Chicago, IL, USA) and R version 3.2.5.

The receiver operating characteristic (ROC) curve was constructed using R package pROC to evaluate the sensitivity and specificity of the survival prediction for the lncRNA signature risk score, age and TNM stage. Area under the curve (AUC) values were calculated from the ROC curves.

Gene set enrichment analysis was performed by the GSEA (http://www.broadinstitute.org/gsea) using MSigDB C2 CP: KEGG gene sets collection (186 gene sets available). Gene sets with a false discovery rate (FDR) value <0.01 after performing 1000 permutations were considered to be significantly enriched (Subramanian et al. 2005).

Results

Identification of prognostic lncRNAs in lung adenocarcinomas from TCGA dataset

The expression of lncRNAs in the 478 lung adenocarcinomas and 58 adjacent non-tumor tissues were investigated, and a total of 132 lncRNAs were found to be expressed differentially with p value <0.01 (Fig. 1). Univariate Cox proportional regression model was performed on the lncRNA expression data, and eight lncRNAs were identified to be significantly associated with overall survival (p < 0.01, Supplementary Table 1). Among the eight lncRNAs, six (LINC00857, RP11-284F21.7, TMPO-AS1, RP11-284F21.9, LINC01137, and RP11-253E3.3) with larger hazard ratios (hazard ratio larger than 1) were associated with shorter survival; and two (RP11-344B5.2 and CTC-429P9.1) with lower hazard ratios (hazard ratio smaller than 1) were associated with longer survival.

Fig. 1
figure 1

Heatmap of 132 lncRNAs differentially expressed between lung adenocarcinomas and normal tissues, with red indicating higher expression and green indicating lower expression

Development of a lncRNA signature associated with survival of lung adenocarcinoma patients

A risk score formula was established to compute a prognostic index for each patient by multivariate Cox proportional regression model according to the expressions of these significant lncRNAs and respective coefficients: Risk Score = (0.095 × LINC00857 + 0.031 × RP11-284F21.7 + 0.066 × TMPO-AS1 + 0.042 × RP11-284F21.9 + 0.061 × LINC01137 + 0.408 × RP11-253E3.3) + (−0.071 × RP11-344B5.2) + (−0.241 × CTC-429P9.1). We calculated the eight-lncRNA expression signature risk score for each patient. Patients were assigned into two groups(high-risk and low-risk groups) by their risk scores, and the median risk score served as cutoff point. Patients in the high-risk group had significantly shorter overall survival (OS, median 85.97 vs. 38.34 months, p < 0.001) and disease-free survival (DFS, median 44.02 vs. 26.58 months, p = 0.007) than patients in the low-risk group (Fig. 2). The distribution of risk score, survival status and lncRNA expression level of each individual were also analyzed (Fig. 3).

Fig. 2
figure 2

Kaplan–Meier curves of overall survival (a) and disease-free survival (b) in TCGA cohort stratified by the eight-lncRNA prognostic signature in high and low risk. A two-sided log-rank test was used to calculate p value

Fig. 3
figure 3

Expression of eight lncRNAs, risk score distribution and survival in TCGA cohort. The risk scores (a) for all patients in TCGA cohort are plotted in ascending order and marked as low risk (blue) or high risk (red), as divided by median score (vertical black line). Following up and survival of each patient are shown in b, and alive or dead patient is marked as blue or red, respectively. Expression distribution of eight lncRNAs in TCGA cohort by z score are shown in c, with red indicating higher expression and green indicating lower expression

Associations of the eight-lncRNA signature model with clinical parameters and patient outcome

An analysis was made of the associations between the eight-lncRNA signature model and clinical parameters in lung adenocarcinoma patients. The results showed that the lncRNA signature was not associated with age, smoking history, tumor location, surgical margin, KRAS/EGFR mutation, adjuvant or neoadjuvant therapy (Table 1). High-risk score was significantly associated with male sex p < 0.001) and late AJCC TNM stage (p < 0.001).

Table 1 Clinical characteristics of patients with lung adenocarcinoma according to lncRNA signature

We performed Cox univariate and multivariate analysis to ascertain whether the eight-lncRNA expression signature could be an independent predictor for lung adenocarcinoma patients. The univariate analysis results showed that lncRNA risk score and AJCC TNM stage were associated with recurrence and mortality (results not shown). The effect of risk score, age, sex, smoking history, KRAS/EGFR mutation status and TNM stage on lung cancer patient survival time was further analyzed by multivariate Cox proportional hazard model in the cohort. The statistical results indicated that risk score might be an independent predictor of overall survival when adjusted by these factors (Table 2).

Table 2 Multivariate Cox proportional hazards regression analysis of the lncRNA signature and characteristics with OS and DFS

We also analyzed the relationship between the eight-lncRNA signature and the AJCC TNM Stage. As shown in Fig. 4, mean risk score increased with the tumor stage in the cohort, suggesting that the eight-lncRNA risk score model might provide a liable aid to predict the clinical outcome of lung adenocarcinoma patients. To test the performance of the signature, we conducted ROC analysis and calculated the AUC on the signature and traditional prognostic factors identified by multivariate Cox proportional hazard model (AJCC TNM stage and age). Our analysis showed that the lncRNA signature might have better prognostic value than traditional prognostic factors in predicting 5-year OS (Fig. 5, AUC, 0.689 vs. 0.661 and 0.532).

Fig. 4
figure 4

Risk score distribution of patients with different TNM stage. Risk score increases with TNM stage, p < 0.00001

Fig. 5
figure 5

Comparison of the sensitivity and specificity of prognosis by the lncRNA signature and the traditional clinicopathologic factors in TCGA cohort. ROC curves were plotted to assess the efficacy of the signatures, with AUCs reported

KEGG pathway analysis for eight-lncRNA signature target genes

The lncRNAs selected in the present study were correlated with several cancers and diseases by previous studies. Furthermore, we checked whether any KEGG pathways were enriched with target genes of the eight lncRNAs to reveal the biological relevance, and the target genes resulted to participate in both cancer-related and non-cancer-related pathways. The target gene enrichment of KEGG pathways in the present study is listed in Fig. 6. Pathways for ascorbate and aldarate metabolism, cell cycle, DNA replication, porphyrin and chlorophyll metabolism, pentose and glucuronate interconversions, proteasome, homologous recombination and spliceosome were enriched with the eight-lncRNA target genes. Expression level of important genes involved in cell cycle and DNA replication pathways in the cohort are shown in Supplementary Figure 1. Among these genes, Pearson correlation analysis revealed that expression of some genes, for example, PLK1, CDC20, CCNB1, BUB1B, CCNA2, CCNB2, CDC45 and CHEK1, was related with the lncRNA risk score, as listed in Supplementary Table 2. Correlation between expression level of representative genes and lncRNA risk score is shown in Supplementary Figure 2. Thus, KEGG pathway analysis has identified potential targets and biological processes known to be involved in cancer, which provided the biological relevance of the eight-lncRNA signature.

Fig. 6
figure 6

The target gene enrichment of KEGG pathways of the lncRNA signature. Top eight KEGG pathways with highest normalized enrichment score (a). GSEA results for high versus low risk differentially expressed genes in TCGA cohort for KEGG cell cycle (b) and KEGG DNA replication (c)

Discussion

The conventional view of cancer has been refreshed in the past decade with the rapid progression of genomic and transcriptional research. TCGA dataset has been demonstrated to be a powerful approach to link clinical features to genetic and transcriptional changes, leading to the identification of new prognostic markers and potentially novel therapeutic targets. LncRNAs are suggested to be promising biomarkers in various cancer types. In the present study, we mined and analyzed lncRNA expression profiles from the TCGA database. By analyzing the association between gene expression profiling and clinical outcome of lung adenocarcinoma patients, we identified a tumor-specific eight-lncRNA signature significantly related to the overall survival of lung adenocarcinoma patients. We further demonstrated the eight-lncRNA signature is independent predictor for lung adenocarcinoma after adjusting to the various variables including age, sex, KRAS/EGFR mutation status and AJCC pathological stage. The use of this eight-lncRNA signature should be investigated in prospective patient cohorts in the future.

Lung adenocarcinoma is characterized by higher incidence rate and worse survival outcomes than many other cancers, with a 5-year overall survival rate at approximately 20% (Miller et al. 2016). Therapy efficacy intensification for high-risk patients is therefore needed. In fact, numerous microarray methods were developed to use the identification of prognostic signatures in lung cancer since 2002 (Beer et al. 2002). Beer and colleagues firstly developed a gene expression profile based on microarray analysis which can be used to predict survival in early-stage lung adenocarcinoma patients (Beer et al. 2002). Traditionally patients were selected for adjuvant chemotherapy based on clinical criteria. Therefore, this result allowed delineation of a high-risk group that may benefit from adjuvant therapy by molecular diagnostic method. Endoh (2004) analyzed the relationship between qRT-PCR results of a total of 45 genes and clinical outcome of cancer patients and found that 8 genes of them were associated with the outcomes of lung adenocarcinoma. Chen et al. (2007b) also estimated a five-gene signature which was closely associated with survival of NSCLC patients. These substantial studies have led to ongoing clinical trials using microarray-derived qRT-PCR prognostic signature to select patients for adjuvant chemotherapy (Kratz et al. 2013). However, these previous studies only included protein-coding genes into analysis and tens of thousands of non-coding RNAs were excluded.

As an important class of non-coding RNAs, increasing evidences have indicated that lncRNAs affect various aspects of cellular homeostasis, and play key roles in cell proliferation, migration, survival, or genomic stability (Huarte 2015). A number of lncRNAs have been identified to correlate with the survival of cancer patients. However, knowledge of lncRNAs associated with lung adenocarcinoma patient survival is still limited. Zhou et al. (2015) performed an array-based transcriptional analysis of lncRNAs from the gene expression omnibus (GEO) database. An expression pattern of eight lncRNAs was found to be significantly associated with overall survival in NSCLC patients; however, the differences of genomic landscape between lung adenocarcinoma and squamous cell carcinoma were not emphasized. Shukla et al. undertook a prognostic analysis of lung adenocarcinoma RNA-seq data and generated a prognostic signature consisted of four genes, including one lncRNA. But more lncRNAs associated with outcomes of lung adenocarcinoma still need to be identified.

As for the characteristics of the eight lncRNAs according to our predictive model, the expression of two lncRNAs (RP11.344B5.2 and CTC.429P9.1) in low-risk group were found to correlate with longer survival, while other six lncRNAs (LINC00857, RP11-284F21.7, TMPO-AS1, RP11-284F21.9, LINC01137, and RP11-253E3.3) were upregulated in the high-risk group compared to low-risk group. Among the six risky lncRNAs, LINC00857 was found to be overexpressed by next generation RNA sequencing analysis of 461 lung adenocarcinomas and 156 normal lung tissues from three separate institutions (Wang et al. 2016). As the top deregulated lncRNAs, overexpression of LINC00857 was significantly associated with poor survival of lung adenocarcinoma. In vitro studies demonstrated that the overexpression of LINC00857 increased cancer cell proliferation, colony formation and invasion. Mechanistic analyses indicated that LINC00857 mediated tumor progression via cell cycle regulation. TMPO-AS1 was also found to be a prognostic biomarker in lung adenocarcinoma in another study (Li et al. 2016). Meanwhile, as a novel subset of RNA, the functions of other six lncRNAs were poorly understood. The biological and clinical studies of these lncRNAs might partially provide clues for the prognostic value, but we still need well-designed studies to validate the functions of these lncRNAs in lung adenocarcinoma. The identification of eight lncRNAs can predict the clinical outcome in patients with lung adenocarcinoma and reveal potential targets for the development of therapeutic options. LncRNAs are suggested to be promising therapeutic targets mostly by preclinical studies and by some clinical studies, and some lncRNAs have been used as therapeutic targets for the treatment in cancer patients (Chandra and Nandan 2016).

Although the eight-lncRNA signature could provide an effective independent prognostic model for the prediction of lung adenocarcinoma, the limitations of the current study should be acknowledged as well. Firstly, the patient cohort of TCGA was from multi-institutional sites, and the study results still need to be validated in other cohorts in future studies. Secondly, since non-coding RNAs were several folds of protein-coding RNAs, only lncRNAs were included in the present study. Thus, this study could not represent the whole transcription alteration associated with lung adenocarcinoma.

In conclusion, we have identified eight lncRNAs associated with survival of lung adenocarcinoma patients in a large cohort and developed a lncRNA prognostic signature. Further analysis revealed that the eight-lncRNA signature could be a prognostic factor independent of conventional clinical parameters. This can serve as a novel biomarker for lung adenocarcinoma prognostic prediction and improve treatment outcome. Further research is necessary to improve the eight-lncRNA signature in large cohorts.