Introduction

Lung cancer remains the leading cause of cancer-related death in many countries despite extensive preclinical and clinical research [1]. Lung cancer is one the major causes that influence the trends of overall cancer incidence [2]. It is characterized by late stage of presentation coupled with intrinsic resistance to cytotoxic chemotherapy [3]. Non-small cell lung cancer (NSCLC, accounting for 85 % of all lung cancers) and small cell lung cancer (SCLC, accounting for 15 %) are the two major forms of lung cancer [4]. NSCLC can be divided into three major histological subtypes: squamous cell carcinoma (SCC), adenocarcinoma (AC), and large cell lung cancer (LCC). Smoking causes all types of lung cancer but is most strongly linked with SCLC and squamous cell carcinoma, while adenocarcinoma is the most frequent type in patients who have never smoked [58].

The incidence of SCC in males is higher than females. Treatment of patients with SCC remains a vexing problem, and long-term survival beyond 5 years is extremely rare [9]. Despite various treatments for SCC patients, including surgery, radiotherapy, chemotherapy, or a comprehensive therapy approach, the survival rates for patients with SCC had not increased much [10]. Therefore, we aim to improve treatment and prevention of the disease by greater knowledge of the molecular origins and progression of lung cancer.

The Cancer Genome Atlas (TCGA) pilot is a feasible and powerful tool. The project can expand knowledge of the molecular basis of various cancers, and it aims to assess the value of large-scale multi-dimensional analysis of many molecular characteristics in human cancer, providing data rapidly to the research community [11]. Besides, the interim integrative analysis of DNA copy number, gene expression, and DNA methylation aberrations, along with network view of the pathways altered in the development of cancer, can be much helpful in clinical management. In this study, TCGA Research Network was established to generate the comprehensive catalog of genomic abnormalities driving tumorigenesis.

In the clinical setting, the evaluation of messenger RNA (mRNA) expression levels of selected potential genes may enable clinicians to tailor chemotherapy according to each individual’s gene profile and to produce a substantial improvement in the therapeutic outcome in terms of overall survival, time to progression, and response to therapy. But currently, not an effective model has been constructed to distinguish prognostic conditions of SCCL patients. The exploration of new markers in clinical management will hopefully improve survival and quality of life for patients with advanced SCCL.

The main purpose of this study is to identify potential prognostic gene sets that are closely associated with tumor progression and survivals for SCCL patients in decreasing the dimensions by the least absolute shrinkage and selectionator operator (LASSO) regression model. Another goal is to construct a model that can distinguish the prognostic conditions of SCCL patients effectively. Here, we report that 22 potential genes could function as prognostic and predictive markers for survival of SCCL patients, and ≥6 gene model was constructed for the first time as indicator for SCC patients and can form the basis for multi-institutional randomized adjuvant trials for “high-risk” patients.

Materials and methods

Data source

The SCC microRNA (miRNA) expression profiles were downloaded from TCGA dataset. Three hundred samples with squamous lung carcinoma were included. The level 3 RNAseq data were extracted. The data platform was UNC__IlluminaHiSeq_RNASeqV2. mRNAs with no signal or whose signal was 0 were eliminated.

Data preprocessing

The standard miRNA expression profiles were extracted from the original downloaded data; mRNAs with no signal or whose signal was 0 were eliminated. To eliminate the batch effect, the generalized linear model (GLM) in Limma package of R project was used for standardization between samples.

Survival analysis

miRNA expression profiles related to squamous lung carcinoma survival were identified by the Kaplan-Meier survival analysis. Survival and prognostic conditions in each clinical stage were painted. The Cox proportional hazard regression risk ratios were used to determine influences of miRNA expressions as well as clinicopathological factors (age, gender, and recurrence) on patient survival by multiplying the ratios for all factors present [12]. SPSS (version 17.0; SPSS Inc.) was used to perform the survival analysis while the GraphPad Prism (version 5.04; GraphPad Software, Inc.) was used to generate the survival curve.

Differentially expressed gene screening

Genes whose expression value in each sample were 20 % higher than the 1.5-folds of median or lower than 1/1.5-folds of all the samples and whose variance was significantly larger than the median of all the genes in each sample (p < 0.05) were filtered out as the differentially expressed genes among cancer samples.

Potential prognostic gene screening

Single-factor survival analysis was performed on the differentially expressed genes to all the cancer samples by survival package [13] in R language. Two conditions were satisfied: p < 0.5 and s (variance) > 0.2. Genes meeting the above conditions were figured out as the prognostic genes to squamous cell cancer of the lung.

LASSO regression model

LASSO [14] was proposed by Tibshirani. It is an algorithm to obtain a refined model by constructing a penalty function and then define coefficient of some index as zero, thus simplifying index sets. AIC and BIC principles can help to achieve the reduction of dimensions by simplifying the variance sets of statistical model. LASSO model of penalized package [15] in R language was performed on the potential prognostic genes, and after 1000 times LASSO regression, the left genes were counted. Finally, the frequency of each gene was obtained, and genes with frequency higher than 100 were recognized as prognostic genes with high frequency.

Functional enrichment analysis

Functional enrichment analysis was performed on these high-frequency genes by Database for Annotation, Visualization, and Integrated Discovery (DAVID) [16]. Single-factor survival analysis and multi-factor survival analysis were conducted on these genes in order to obtain the roles they played on the prognostic process. Besides, ROC curve was generated (survival ROC package of R software [17]).

Construction of prognostic model

High-risk and low-risk genes were divided by the following rules [18]: (1) HR of single-factor survival analysis of genes was higher than 1 and gene expression level was on the top 20 % of all the samples. (2) HR of single-factor survival analysis of genes was lower than 1 and gene expression level was on the low 20 % of all the samples. (3) Genes which do not meet the above conditions were marked as low-risk expression genes. The model was constructed by counting the number of genes with high-risk expression, and the model that had the great impact on prognosis was screened. Samples were divided by the number of their corresponding genes with high-risk expression: ≥1 gene, ≥2 genes, ≥3 genes, etc. The samples in each category and the survival time to obtain the survival model of each category were counted. Survival analysis was performed on each model to get the prognostic condition, and then the model that significantly affected prognosis was to be found.

Model stability testing

The samples were selected randomly in the original sample sets, and the above steps were repeated in order to testify the stability of the model. The significance in survival analysis of each model was observed, and 1000 repeats were carried out in the attempt to find the most stable model.

Specificity analysis of subtype disease

All samples were divided into subtype according to the clinical experience so as to testify the sample specificity of patient samples. The subtype with more than 100 samples was selected out and was verified by the obtained model.

Results

Data processing

Gene expressions (20,254) from 300 samples were obtained after data processing from TCGA dataset. Data with undetected mRNA or no signal were eliminated. Meanwhile, the corresponding survival information of the 300 samples was obtained.

Survival analysis of different clinical stages

miRNA expression profiles related to survival were identified using the Kaplan-Meier survival analysis, and statistical significances of overall survival (OS) and progression-free survival (PFS) were determined using the log-rank test. Survival analysis was performed on SPSS (version 17.0; SPSS Inc.), and the survival curve was generated by GraphPad Prism (version 5.04; GraphPad Software, Inc.) (Fig. 1). As can be seen from the figure, various methods used to classify the disease in clinical cannot differentiate the risk degree of the disease, indicating the model is necessary to estimate the risk after prognosis accurately.

Fig. 1
figure 1

Survival status and prognostic conditions of SLLC patients. Shown are different survival status and prognostic conditions of SLLC patients in different clinical stages by different classification methods

Differentially expressed gene screening

Seven thousand nine hundred ninety-eight differentially expressed genes were identified out by the forth step of the method, and all the selected genes fulfilled the two conditions mentioned above. Two thousand forty-one genes that had the potential roles on prognostic process were filtered out by single-factor analysis (Table 1).

Table 1 Top 20 potential genes that have impact on prognostic process

Prognostic gene screening

LASSO was used to identify gene-gene interaction in genome-wide association studies. In this study, 22 prognostic genes, that is, BCAR3, PCDHGB4, PLIN2, SCD5, STC2, TGM2, APLN, GNB3, ZNF813, COBL, SDK2, NGFR, FKBP10, NR1I3, TNFSF11, BSPRY, C12orf53, GALNT14, NHLRC1, KLF12, and TREM1, were with high frequency after 1000 times regression by LASSO method (Table 2). And, the frequency of each gene was obtained. Twenty-one genes among them had frequency more than 100, and they were shown to have significant prognostic roles by single-factor analysis.

Table 2 High-frequency genes

Survival analysis

The DAVID classification system, a powerful bioinformatic tool for classifying genes according to their function, was used to identify gene families that may play significant roles in specific pathways, biological processes, and molecular functions. In this study, we used it to classify the differentially expressed sequences (obtained after Bonferroni test application) in the comparisons between conditions. DAVID analysis was performed on the 22 high-frequency genes, and the result was shown in Table 3. Ten genes were highly enriched on four molecular functions: GO:0005509—calcium ion binding, GO:0046872—metal ion binding, GO:0043169—cation binding, and GO:0043167—ion binding (Table 3).

Table 3 Functional enrichment analysis on high-frequency genes

Multi-factor survival analysis of high-frequency genes

Multi-factor survival analysis results showed that Wald test p = 8.902e-11, revealing that the overall multi-factor survival analysis on the 22 genes was significant. The ROC curve was generated (Fig. 2). The average AUC were all above 0.05, indicating their effective roles on differentiating the disease from the normal samples on prognostic process.

Fig. 2
figure 2

AUC curve of 22 high-frequency genes by multi-factor survival analysis. Shown was the result of the Kaplan-Meier survival analysis of the 22 high-frequency genes from 1 to 6 years

Construction of efficient models

Genes with high-risk expression and corresponding with each sample were listed in Table 4. Eight efficient models were obtained, and seven among them had great impact on prognostic process. Single-factor survival analysis of the eight models was shown in Fig. 3. As can be seen, the fifth and the sixth had the largest number of samples, and the survival curve can differentiate the high-risk samples from the low-risk ones.

Table 4 Single-factor survival analysis of eight models
Fig. 3
figure 3

Survival curve of eight models. Eight constructed models’ single-factor survival curve by the Kaplan-Meier survival analysis

Stability testing of high-risk models

Random sample selecting was used to screen out the most stable prognostic model. Eventually, 1000 random significance distributions by survival analysis of each model were revealed in the box plot (Fig. 4). It showed that 5 genes and 6 genes were the most stable in the random selection.

Fig. 4
figure 4

Box plot of significance distribution of eight models after 1000 times LASSO regression single-factor survival analysis randomly

Specificity testing of the subtype of SLLC

Different division was done by the different subtype of the samples so as to verify the specificity of the subtype of the disease (Table 5). There were fewer samples in most division, and subtype with sample number more than 100 was selected to be performed on the nine to ten steps. Single-factor survival analysis of the model corresponding with each subtype was obtained (Table 6). Five genes and 6 genes were shown to have the most favorable division effect in different subtype of the disease.

Table 5 Sample numbers of each subtype of the disease
Table 6 Single-testing result of each subtype of the disease

Discussion

SCC is one major subtype of lung cancer, but there were few biomarkers to aid patient management. Currently, despite advances in treatment modalities, the prognosis of SCC patients is very poor. Recent studies suggested that microRNA biomarkers could be useful for stratifying lung cancer subtypes [19], but microRNA signatures varied between different populations [20]. In this study, we identified 22 differentially expressed genes from the most significantly altered genes by using data from TCGA dataset. And we found that the 22 genes had the potential to serve as prognostic genes in clinical management. What is more, the result of single-factor survival analysis showed that 21 genes among them had significant impact on prognostic process. For example, breast cancer anti-estrogen resistance protein 3 (BCAR3) was once reported to be a candidate marker in classifying epithelial-like and mesenchymal-like phenotypes observed in NSCLCs [21] and homologous to the cell division cycle protein CDC48 [22], thus increasing the reliability of its potential role in classifying prognostic conditions of SCCL patients. Another selected gene, PCDHGB4, was also reported to be associated with lung cancer since PCDH hypermethylation was proven to be a frequent event found in all Wilms’ tumor subtypes [23]. The expression of PCDHGB4 may be involved in methylation process since hypermethylation was found to be concordant with reduced PCDH expression in tumors [24]. Other genes, PLIN2, may be involved in the development and maintenance of adipose tissue while pathways related with SCD5 were fatty acid metabolism. And the promoter methylation of transglutaminase 2 (TGM2) was identified as good responders of cisplatin in NSCLC. Therefore, we suspected that the occurrence of SCC may be in association with the metabolism of adipose. Combined with the result of enrichment analysis, the four most enriched GO terms were calcium ion binding (GO:0005509), metal ion binding (GO:0046872), cation binding (GO:0043169), and ion binding (GO:0043167), and we proposed that there may be close association between DNA methylation and ion binding ability, thus resulting in the occurrence of SCC, which needs further research to support our idea.

Since different clinical classification methods cannot make an accurate distinction between high-risk and low-risk ones, the necessity to make a prognostic model became more urgent. Walter’s study confirmed that NSCLC can be divided into two phenotypically distinct subtypes of tumor [21]. As for the squamous cell carcinoma of the lung, in this study, ≥6 gene model was constructed to distinguish prognostic condition of patient cases, providing reference for clinical therapy. The model we constructed in this study can help to predict recurrence and death in a large population of patients with SCC. The current model of 300 cancer samples from patients can be used to stratify high-risk future populations for adjuvant therapy. Nowadays, due to the development of molecular and gene profiles, molecular stratification for patients’ outcome is increasingly emphasized [25, 26], which leads to the extensive investigation and exploration of molecular markers. Therefore, the construction of the model can be used to predict recurrence of individual patients with SCC significantly, and it was consistent across all early stages of NSCLC. In this study, samples were selected randomly to ensure the reliability, and through 1000 times random LASSO regression analysis, the most stable prognostic model was found out. Herein, the model we constructed in this article was more convincing and feasible for further potential application in clinical practice. The model can also be used to identify a subgroup of patients who were at high risk for recurrence; thus, we can determine who might be best treated by adjuvant chemotherapy. In addition, the functional enrichment analysis on the 22 high-frequency genes showed that four molecular functions, namely GO:0005509—calcium ion binding, GO:0046872—metal ion binding, GO:0043169—cation binding, and GO:0043167—ion binding, were highly enriched. We speculated that the ion binding may be associated with the methylation of DNA since there were early reports about the cytotoxic effect of metal ions and their complexes on DNA interactions [27, 28], which still need further research.

Conclusion

In conclusion, we identified 22 potential genes, BCAR3, PCDHGB4, PLIN2, SCD5, STC2, TGM2, APLN, GNB3, ZNF813, COBL, SDK2, NGFR, FKBP10, NR1I3, TNFSF11, BSPRY, C12orf53, GALNT14, NHLRC1, KLF12, and TREM1, which may function as prognostic indicator of squamous lung cell carcinoma, and the ≥6 gene model constructed based on these high-risk genes can help in the early recurrences and death in localized SCC. Thus, patients with high risk for recurrence and death can receive timely adjuvant therapy. As for the association between four highly enriched GO terms and the prognostic process of SCCL, there still needs further research to prove our hypothesis.