Introduction

Lung cancer is the leading cause of cancer‐related death worldwide, with 2.1 million new cases and 1.8 million deaths in the year 2018 (Bray et al. 2018). Lung adenocarcinoma is becoming the most common histological type of lung cancer, and its morbidity and mortality are also increasing (Travis et al. 2015). Even for patients with early-stage (stage I and stage II) lung adenocarcinoma, more than 30% of patients who underwent radical surgical resection will relapse and die of tumor recurrence (Padda et al. 2014). Patients underwent early relapse in lung adenocarcinoma tend to have poorer survival rates and is attributed mainly to poor clinicopathological features such as advanced tumor stage, poorly differentiated tumors, visceral pleural involvement, insufficient resection, incomplete nodal sampling and resistance to adjuvant chemotherapy (Kozu et al. 2013; Kiankhooy et al. 2014). In addition, the relapse of early-stage lung adenocarcinoma can be characterized as a process with time sequence. The first 2 years after surgery accounted for the most relapse cases. Currently, the tumor–node–metastasis (TNM) staging system carried out by American Joint Commission on Cancer/International Union against Cancer (AJCC/UICC) has been widely used for treatment selection and relapse prediction in early-stage lung adenocarcinoma (Amin et al. 2017). Though multiple clinicopathological features can be incorporated with TNM staging system to improve relapse prediction accuracy for lung adenocarcinoma, prognosis often varies in patients even with comparable stage and clinicopathological features. These problems reflect the potential tumor heterogeneity of lung adenocarcinoma. Consequently, more accurate strategies are warranted for early relapse detection. Nowadays, some studies have focused on transcriptional profiles in lung adenocarcinoma and several transcriptional multigene signatures have been developed for overall survival prediction in lung adenocarcinoma patients (Raponi et al. 2006; Der et al. 2014). However, very few molecular classifiers have been developed for early relapse prediction. More importantly, although ncRNAs have been confirmed to have important roles in multiple cancers (Kornienko et al. 2013; Kung et al. 2013; Khorkova et al. 2015), no previous study has combined mRNAs and ncRNAs to construct an integrated signature for early relapse prediction in early-stage lung adenocarcinoma. Thus, with multiple public transcriptome data and novel bioinformatic methods, identifying a robust and practical mRNA–ncRNA signature to predict early relapse of early-stage lung adenocarcinoma is feasible and of great clinical significance.

In the present study, we performed an integrative analysis of mRNAs and ncRNAs expression profiles for the prediction of early relapse in early-stage lung adenocarcinoma. We believed that the integrated signature with more transcript information would improve risk stratification, reveal the biological behavior of different risk groups and provide a more accurate individualized treatment strategy in early-stage lung adenocarcinoma.

Materials and methods

Data collection and preprocessing

Raw gene microarray expression profiles were downloaded from the Gene Expression Omnibus database (https://www.ncbi.nlm.nih.gov/geo/). All datasets fulfilling the following criteria were included: (1) detected gene expression profiles of primary lung cancer; (2) used the chip platform of Affymetrix Human Genome U133 Plus 2.0 (GPL570); (3) availability of basic clinicopathological information, follow‐up and relapse status; (4) a sample size of more than 50. Finally, datasets of GSE31210, GSE50081, GSE30219 and GES37745 were recruited. Among the four datasets, only patients with stage I–II lung adenocarcinoma. In addition, considering the potential for uncured resection, patients who relapse within 1 month after surgery are deleted. Finally, clinical data and raw CEL files of the remaining 476 patients from 4 datasets were merged as a meta-dataset for further analysis. Using the robust multichip average (RMA) algorithm (Irizarry et al. 2003), raw CEL files of the four microarray datasets of lung adenocarcinoma were processed for background correction, normalization, and log2 transformation. To get mRNA and ncRNA expression profiles separately, we performed a probe reannotation pipeline as proposed in previous studies (Du et al. 2013). First, the probe sequences of Affymetrix HG‐U133 Plus 2.0 array were remapped to the latest version of the NetAffx Annotation File. When multiple gene probes were mapped to the same EntrezGeneID, the mean value was used as average expression level. Second, the chromosomal coordinates of the retained probes were matched to the chromosomal coordinates of ncRNAs derived from the GENCODE project (https://www.gencodegenes.org/, release 28). The probes that mapped to both ncRNAs and protein‐coding genes were discarded. The ComBat method was used to remove the potential internal and external batch effects among different datasets.

Identification of mRNA and ncRNA related to early relapse

Early relapse is defined as relapse within 2 years after radical resection. The cases in the meta-dataset were randomly allocated to generate a training set and a test set according to the ratio of 7:3 (training set, n = 334; testing set, n = 142), and the training set was further divided into an early relapse group and long-term nonrelapse group (at least 5 years of follow-up without relapse). To eliminate the interference of confounding factors between the groups, propensity score matching (PSM) analysis was performed between the two groups based on clinicopathological information such as gender, age, smoking history, and tumor staging. The matching ratio was set to 1:1. Finally, 56 paired patients were selected for transcriptome analysis in the training set. Linear Models for Microarray data (LIMMA) method was used to screen differentially expressed mRNAs (fold change > 1.5, adjusted P < 0.05) and ncRNAs (fold change > 1.25, adjusted P < 0.05) between samples from paired patients of early relapse and long-term nonrelapse groups. Random survival forest (RSF) analysis was used to perform dimensionality reduction and importance ranking of differentially expressed genes (DEGs). LASSO Cox regression model was then used to construct the final prognostic model using the DEGs related to early relapse after dimensionality reduction by RSF.

Establishment and clinical application of risk score and prognostic model

The risk score of each patient is calculated combining the expression levels of RNAs and the LASSO Cox regression coefficient. Patients from training and testing sets were divided into high-risk and low-risk groups using the median risk score of the training set as the cutoff value. Kaplan–Meier estimator was used to compare the relapse differences between high-risk and low-risk groups in training and testing sets. Univariate and multivariate Cox regression analysis and stratified survival analysis were used to test the independent role of risk score in predicting relapse. The time-dependent receiver operating characteristic curve (ROC) was used to evaluate the predictive accuracy of each feature and signature at different times. Integrated mRNA–ncRNA signature and clinicopathological characteristics were combined to construct a nomogram for early relapse prediction. Survival decision curve analysis (DCA) was used to evaluate the net benefits derived from the integrated mRNA–ncRNA signature or nomogram. We predicted the chemotherapeutic response for each sample based on the Genomics of Drug Sensitivity in Cancer (GDSC) database (https://www.cancerrxgene.org/) using R package “pRRophetic” (Geeleher et al. 2014). Seven commonly used chemotherapy drugs including paclitaxel, fluorouracil, cisplatin, etoposide, vinorelbine, gemcitabine, and docetaxel were used for analysis, the samples’ half-maximal inhibitory concentration (IC50) for each drug was estimated by ridge regression and the prediction accuracy was evaluated by tenfold cross-validation.

Molecular characteristics and tumor microenvironment analysis of different risk groups

The gene set enrichment analysis (GSEA) was performed on the expression profile data to investigate the potential mechanisms in the MSigDB database of h.all.v7.2.symbols, c2.cp.kegg.v7.2.symbols and c2.cp.reactome.v7.2.symbols using the JAVA program (https://www.gsea-msigdb.org/gsea/index.jsp) and R package “clusterProfiler” (Subramanian et al. 2005; Yu et al. 2012). The random number is set to 1000, the significance threshold is set to adjusted P < 0.05, and false discovery rate (FDR) < 0.25. The CIBERSORT algorithm was used to calculate the composition of 22 immune cells of each sample. The gene expression data with standard annotation were uploaded to the CIBERSORT web portal (https://cibersort.stanford.edu/), 22 immune cell feature matrix (LM22) was used to perform 1000 random times for deconvolution. For accuracy evaluation, samples with a CIBERSORT output of P < 0.05 were selected.

RNA-seq and mutation landscape analysis

TCGA RNA-Seq raw read counts data of 226 stage I–II lung adenocarcinoma patients with complete follow-up information was downloaded from GDC database (https://portal.gdc.cancer.gov/). Ensembl ID for genes was annotated in GENCODE 28 to generate Gene Symbol names. To be consistent with the distribution of microarray data, raw read counts data were normalized across samples using voom algorithm. Mutation data that stored in Mutation Annotation Format (MAF) contained somatic variants were also downloaded from GDC. Nonsynonymous mutations were used for mutation load investigations.

Statistical analysis

All statistical tests were executed by R/3.6.1 and SPSS/23.0 using a χ2 or Fisher’s exact test for categorical data when appropriate, a two-sample Wilcoxon test (Mann–Whitney test) for continuous data. Pearson’s correlation test was used for correlation analysis. Survival analysis were depicted using the Kaplan–Meier method and compared using the log-rank tests. Univariable and multivariable Cox regression were performed to investigate whether the gene signature was independent of other. All statistical tests were two sided, and a P value < 0.05 was considered statistically significant.

Results

Preparation of lung adenocarcinoma dataset

Four hundred and seventy-six patients with stage I–II lung adenocarcinoma from GEO database were selected and comprehensively studied, including 226 patients from GSE31210 cohort; 125 patients from GSE50081 cohort; 82 patients from GSE30219 cohort; and 43 patients from GSE37745 cohort. The clinical information of all patients can be found in Supplementary Table 1. Plots of the first and second principal components before and after removing batch effects among the four cohorts are shown in Fig. S1.

Establishment of early relapse-related mRNA–ncRNA signature from the training set

Patients in the training set were divided into early relapse and long-term nonrelapse group. The baseline clinicopathologic characteristics before and after PSM analysis are shown in Table 1. Before PSM analysis, there were more patients with stage II disease in the early relapse group. After PSM analysis, there were no significant differences between the two groups among age, gender, T stage, N stage, and TNM stage. Through gene annotation, a total of 3419 ncRNAs and 17,561 mRNAs were identified. Transcriptome change profiling was then performed between the matched two groups. The results are shown in Supplementary Table 2 and 3. A total of 193 mRNAs and 49 ncRNAs that were differentially expressed were included. After dimensionality reduction using RSF analysis, 41 RNAs including 39 mRNA and 2 ncRNA are retained (Fig. 1A, B). The differentially expressed genes and their distributions on chromosomes is shown in Fig. 2. LASSO coefficient profiles of the 41 RNAs are shown in Fig. 1C, D. A coefficient profile plot was produced against the log (λ) sequence. Vertical line was drawn at the value selected using tenfold cross-validation, and the minimize λ method resulted in 12 mRNAs and 1 ncRNA. Finally, Risk score was calculated for each patient in training set based on the expression levels of the 13 RNAs and LASSO Cox regression coefficients: Risk score = (0.006 × CCL20) + (0.069 × ANLN) + (0.061 × ARNTL2) + (− 0.088 × CYP4B1) + (0.036 × FAM83A) + (0.005 × GREM1) + (0.005 × IL1R2) + (0.001 × SPOCK1) + (0.034 × DLGAP5) + (0.022 × COL11A1) + (0.063 × TPX2) + (0.095 × TK1) + (0.171 × LINC01116).

Table 1 Clinicopathological features of patients in early relapse and long-term nonrelapse groups before and after propensity score matching
Fig. 1
figure 1

Integrated mRNA–ncRNA signature selection using RSF and LASSO Cox regression. A, B Dimensionality reduction using RSF analysis. C, D LASSO coefficient profiles of the 41 candidate mRNAs and ncRNAs

Fig. 2
figure 2

The differentially expressed genes and their distributions on chromosomes in the PSM paired groups after dimensionality reduction using RSF

The prognostic value of risk score in different datasets

By applying the median risk score as cutoff value, patients in the training set were divided into a low-risk group and high-risk group (n = 167, respectively). The distribution of risk scores and relapse status shows that patients with low risk scores have better RFS than patients with high risk scores (Fig. 3A, left panel). Time-dependent ROC analysis evaluated the prognostic accuracy of the integrated mRNA–ncRNA risk score, and the AUC at 1, 3 and 5 year were 0.747, 0.736, and 0.764, respectively (Fig. 3A, middle panel). The RFS rates for patients in low-risk group were 94.6% at 1 year, 80.8% at 3 year, and 54.5% at 5 year, compared with 80.2%, 47.3%, and 31.7% in high-risk group, respectively (HR 3.19, 95% CI 2.16–4.72, P < 0.001, Fig. 3A, right panel). Then, the same analysis was performed in the testing set. In the testing set, the time-dependent ROC AUC at 1, 3 and 5 year were 0.749, 0.711, and 0.728, respectively. The 1, 3 and 5-year RFS rates for the low-risk group were 96.4%, 78.2%, and 52.7%, respectively, while for the high-risk group were 79.3%, 51.7%, and 32.2%, respectively (HR 2.91, 95% CI 1.63–5.20, P = 0.002) (Fig. 3B). In the entire dataset, the classification based on the risk score yield similar results (Fig. 3C). Subgroup analysis based on T stage and mutation background suggested that in T1N0, T2N0, EGFR mutant and EGFR wild-type subgroups, the high-risk patients tend to have significantly unfavorable RFS (Fig. 4A–D).

Fig. 3
figure 3

Distribution of risk score, time-dependent ROC curves at 1, 3 and 5 years and Kaplan–Meier analysis between patients at low and high risk of relapse in training set (A), testing set (B) and entire dataset (C)

Fig. 4
figure 4

Kaplan–Meier analysis for the entire dataset with stages I–II lung adenocarcinoma based on the integrated mRNA–ncRNA signature stratified by T stage and EGFR mutation status

Clinical application of risk score established based on mRNA–ncRNA

After multivariate analysis adjusted by clinicopathological factors, the integrated mRNA–ncRNA classifier remained a powerful and independent factor in the training set and testing set (Fig. 5). To provide a quantitative method for predicting the likelihood of relapse, a nomogram that integrates the mRNA–ncRNA classifier and clinicopathological factors was constructed (Fig. 6A). The calibration curve (Fig. 6B–D) showed that the nomogram model performed well agreement in predicting the RFS rate at 1, 3 and 5 years. Similarly, the decision curve (Fig. 6E) showed that both the mRNA–ncRNA classifier and the classifier based nomogram have a higher net income and better prediction accuracy than the TNM staging system. The drug sensitivity analysis of seven chemotherapeutics including paclitaxel, fluorouracil, cisplatin, etoposide, vinorelbine, gemcitabine, and docetaxel (Fig. 7) showed that patients in the high-risk group have lower IC50 values, which indicated higher sensitivity to these seven chemotherapeutics (P < 0.001), suggesting that the mRNA–ncRNA classifier can be used as a potential indicator for postoperative adjuvant chemotherapy.

Fig. 5
figure 5

Univariable and multivariable Cox regression analysis in training and testing datasets with stages I–II lung adenocarcinoma

Fig. 6
figure 6

Construction of nomogram based on mRNA–ncRNA signature and its clinical utility. A Nomograms integrated with the mRNA–ncRNA signature to predict 1-, 3- and 5-year RFS probability in the entire dataset. C, D Calibration curve in predicting the RFS rate at 1, 3 and 5 years. E Decision curve analysis of the nomogram

Fig. 7
figure 7

Drug sensitivity analysis of paclitaxel, fluorouracil, cisplatin, etoposide, vinorelbine, gemcitabine, and docetaxel in the low- and high-risk groups using IC50 values

Pathway enrichment analysis and immunophenotyping analysis related to mRNA–ncRNA signature

GSEA using HALLMARK, KEGG and REACTOME gene sets (Fig. 8) showed that the high-risk group has high enrichment levels in cell cycle regulation, DNA replication, mismatch repair, glucose metabolism, and immune pathways related to antigen presentation. The immune lineage analysis using CIBERSORT algorithm (Fig. 9) showed that all the selected 13 RNAs had significant correlations with immune cell composition in the tumor microenvironment, meanwhile, a higher risk score was positively correlated with the composition of suppressive immune cells such as M2 macrophage and Treg, as well as a variety of antigen-presenting cells in resting state, while negatively correlated with the antigen-presenting cells in activated state. The risk score is also positively correlated with expression levels of multiple inhibitory immune checkpoint coding genes including CD274, PDCD1, HAVCR2, LAG3, PDCD1LG2, IDO1, TIGIT, CTLA4, and LAIR1 (Fig. 10).

Fig. 8
figure 8

Gene set enrichment analysis using HALLMARK, KEGG and REACTOME gene sets

Fig. 9
figure 9

Correlation of selected RNAs, risk score and immune cell composition in the tumor microenvironment

Fig. 10
figure 10

Correlation of risk score and expression levels of inhibitory immune checkpoint coding genes including CD274, PDCD1, HAVCR2, LAG3, PDCD1LG2, IDO1, TIGIT, CTLA4, and LAIR1

Validation and mutation analysis in a database based on RNA sequencing

Validation using the RNA-Seq data of 226 stage I–II lung adenocarcinoma cases from TCGA database showed a similar result to the training set and testing set. The clinical information of all patients can be found in Supplementary Table 4. The high-risk group also tended to have unfavorable RFS (HR 1.70, 95% CI 1.14–2.53, P = 0.008). The 2-year ROC curve showed that the mRNA–ncRNA classifier has a higher AUC value (AUC = 0.680) than the TNM staging system (AUC = 0.615), and the combination of the mRNA–ncRNA classifier and the TNM staging system provided a stronger predictive power (AUC = 0.708). The RNA-seq-based external validation further showed that the integrated mRNA–ncRNA classifier can be used as an effective prognostic indicator for stage I–II lung adenocarcinoma patients (Fig. 11). Based on the mutation analysis of the above patients in the TCGA database, patients in the high-risk group tended to have higher mutation frequencies, and the mutation rate of driver genes such as TP53 (63%) and KRAS (33%) was also higher, while in the low-risk score group, EGFR mutation (33%) was more common (Fig. 12). In addition, patients in the high-risk group tended to have higher mutation burdens (P < 0.001) (Fig. 13).

Fig. 11
figure 11

Validation of the relapse prediction efficiency in the TCGA lung adenocarcinoma dataset. A Kaplan–Meier analysis for the TCGA dataset with stages I–II lung adenocarcinoma. B The 2-year ROC curve comparison of the mRNA–ncRNA signature and the TNM staging system

Fig. 12
figure 12

Mutation landscape of patients in the TCGA database. A Top ranked mutations in the high-risk group. B Top ranked mutations in the low-risk group

Fig. 13
figure 13

Mutation burdens of patients in low and high risk groups in the TCGA database

Discussion

Although early-stage lung adenocarcinoma can benefit from radical resection and postoperative adjuvant chemotherapy, early relapse is still the main cause of unfavorable prognosis (Birim et al. 2006). At present, prognostic systems such as the TNM staging system of AJCC have been widely used to assess the prognosis of lung adenocarcinoma patients (Woodard et al. 2016). However, they cannot always be sufficient to predict relapse and prognosis, especially for early relapse of early-stage patients, which may be due to insufficient understanding of the different genetic backgrounds of tumors in current prediction methods (Borczuk et al. 2009). Although some literatures have explored the association between molecular markers and early postoperative relapse, most of the works have focused on analyzing the function of only one or a class of biomarkers. Compared with a single biomarker, integrating multiple biomarkers of different types and functions into a single model will significantly improve the prognostic value and provide indications for adjuvant therapy. However, most of these previous studies only included protein-coding genes into analysis while thousands of non-coding RNAs were excluded (Farhat et al. 2012; Matthaios et al. 2013; Fang and Wang 2014; Zhu and Tsao 2014). Increasing evidences have proved that ncRNAs affect various aspects of homeostasis in cells, and play key roles in cell proliferation, migration and genomic stability (Niedzwiecki et al. 2016). Hence, an integrated mRNA–ncRNA signature could provide more diversified information for early relapse prediction and biological characteristics identification.

In this study, we used microarray probe reannotation and subsequently extracted mRNA and ncRNA transcriptional profiles from 476 early-stage (stage I–II) lung adenocarcinoma patients from the GEO database. PSM analysis was performed to exclude the interference of other clinicopathological factors between the early relapse group and the long-term nonrelapse group. The RSF algorithm and LASSO Cox regression were used to identify an early relapse-related signature including 12 mRNAs and 1 ncRNA. The survival analysis showed that the signature has accurate relapse prediction ability in both the training set and the testing set, and is further verified by the RNA-seq data in TCGA. When combined with other clinicopathologic information, multivariate Cox regression indicated that the signature can be used as an independent prognostic factor for relapse prediction. DCA confirmed that the nomogram which combined the mRNA–ncRNA signature and clinicopathological data are superior to the TNM staging system in relapse prediction. Chemotherapeutics sensitivity analysis showed that the risk score is positively correlated with the drug sensitivity of the seven commonly used chemotherapeutics, suggesting that the integrated mRNA–ncRNA signature could be used as a prediction of postoperative adjuvant chemotherapy. GSEA showed that the mRNA–ncRNA-based risk score is closely related to the tumor microenvironment, and the high-risk group showed active cell proliferation and glucose metabolism characteristics, as well as the enrichment of immune pathways related to antigen presentation. Combining mutation landscape analysis, immune cell composition, and immune checkpoint expression analysis, we hypothesize that the high-risk group has a heavier mutation load, which in turn produces more tumor neoantigens, promotes the antigen presentation process. But due to the inhibitory immune microenvironment, the antigen-presenting cells are more likely to stay in a resting or functionally inhibited state, and are unable to exert an effective immune surveillance effect, which suggests that early relapse may be related to the inhibitory tumor immune microenvironment in addition to the high proliferation characteristics of the tumor cells.

In addition, through literature search, we found that the genes included in the signature have been experimentally proved to be related to cancer. Among them, LINC01116 has been proved to be related to cell proliferation, G1/S transition, and apoptosis regulation in lung adenocarcinoma and has been proved to promote gefitinib resistance by affecting IFI44 expression, which is involved in the IFN/STAT1 pathway. (Wang et al. 2020). Blocade of CCL20 was confirmed a strong induction of circulating cancer-specific T cells in blood and can significantly reshape the tumor microenvironment (Da Silva et al. 2019). A single cell sequencing study confirmed that high expression of IL1R2 is related to activated tumor Tregs, and is correlated with poor prognosis in lung adenocarcinoma (Guo et al. 2018). In a study of early breast cancer, high COL11A1 expression was observed in tumor cells and surrounding stromal cells, and is associated with aggressive behavior, poor outcome and resistance to radiotherapy (Toss et al. 2019). SPOCK1 was proved to be a novel regulator of metastasis from the lung to the brain. It plays a crucial role in cancer stem cell self-renewal, and can modulate tumor initiation (Singh et al. 2017). GREM1 contributed to a tumor-associated mesenchymal stem cells (MSC) phenotype, enhanced the MSC’s ability to promote primary tumor cell dissemination, and contributed to an immunosuppressive tumor microenvironment (Fregni et al. 2018). FAM83A in lung cancer tissues was significantly increased and overexpression of FAM83A enhanced the proliferation, colony formation, and invasion of lung cancer cells, and was correlated with advanced TNM stage and poor prognosis. Meanwhile, overexpression of FAM83A increased the expression of active β-catenin and Wnt target genes and the activity of epithelial–mesenchymal transition (Zheng et al. 2020). Disc large homologue-associated protein 5 (DLGAP5), which required for AURKA-dependent, centrosome-independent mitotic spindle assembly is essential for the survival and proliferation of SMARCA4 mutant lung cancer cells (Tagal et al. 2017). Another gene related to AURKA in the signature is TPX2, which act as the coactivator of AURKA, can mitigate drug-induced lung cancer cell apoptosis, and hence emerges in response to chronic EGFR inhibition (Shah et al. 2019). Thymidine kinase 1 (TK1) overexpression is associated with significantly reduced RFS in lung adenocarcinoma patients. Transcriptional overexpression of TK1 in lung cancer cells is driven, in part, by MAP kinase pathway in a transcription factor MAZ-dependent manner (Malvi et al. 2019). High expression of the transcription factor ARNTL2 also predicts poor lung adenocarcinoma patient outcome. ARNTL2 initiated metastatic self-sufficiency by orchestrating the expression of complex pro-metastatic secreted factors (Brady et al. 2016). ANLN, a homologue of anillin, was transactivated in lung cancer cells and seemed to play a significant role in pulmonary carcinogenesis. Induction of small interfering RNAs against ANLN in NSCLC cells suppressed its expression and resulted in growth suppression; moreover, treatment with small interfering RNA yielded cells with larger morphology and multiple nuclei, which subsequently died. Interestingly, inhibition of phosphoinositide 3-kinase/AKT activity in NSCLC cells decreased the stability of ANLN and caused a reduction of the nuclear ANLN level. Immunohistochemical staining of nuclear ANLN on lung cancer tissue microarrays was associated with the poor survival of NSCLC patients, indicating that this molecule might serve as a prognostic indicator (Suzuki et al. 2005). CYP4B1 is one of the major xenobiotic-metabolizing enzymes (XME) coding genes, and plays a crucial role in maintaining normal bronchial epithelial cell structure and function. A decrease in CYP4B1 expression was observed in tumoral specimens. Furthermore, some of the XME coding genes are involved in the metabolism or transport of chemotherapeutics and may influence the response of tumors to chemotherapy (Leclerc et al. 2011).

To our knowledge, this study was the first attempt to integrate mRNAs and ncRNAs to construct an early relapse predictive signature in early-stage lung adenocarcinoma. However, the limitations should be acknowledged. First, in addition to mRNA and ncRNA, the predictive value of methylation and single nucleotide polymorphisms in tumor prognosis has been verified. Multidimensional data analysis that integrates mRNA, ncRNA, CpG, single nucleotide polymorphisms and other multiomics information may further improve the prediction efficiency. Second, subject to the limitations of the clinical information available in the GEO database, some important clinicopathological characteristics, such as histological grade, histological subtypes, CT imaging information were not included in this study, which may affect the predictive value of the mRNA–ncRNA signature. Finally, the biological functions of the mRNAs and ncRNA incorporated in the integrated signature are still needed to be further explored.

Conclusions

In conclusion, we constructed a robust mRNA–ncRNA signature that can accurately identify patients at high risk of early relapse in stages I–II lung adenocarcinoma. Future clinical trials and confirmatory experiments are still essential to verify the clinical applicability and biological significance of the integrated signature in detecting postoperative early relapse in early-stage lung adenocarcinoma.