Introduction

Strong heterogeneity and aggressiveness are the two main characteristics of colon cancer which has a high prevalence and fatality rate (Liu et al. 2022a). Colon tumor continues to be ranked the second-leading cause of death and the third-leading cause of incidence despite improvements made in recent decades (Sung et al. 2021). Female colorectal cancer accounts for 10% of all cancers in women. Colon cancer is the second leading cause of cancer death, accounting for 9.4% of all death (Sung et al. 2021). Thus, it is crucial to provide a model for identifying biomarkers that benefit cancer therapy in the early stage and distinguishing poor prognosis groups.

Tumor-infiltrating immune cells are important for the development and aggressiveness of cancer, according to the expanding body of research on the tumor microenvironment (TME) (Jochems and Schlom 2011; Bense et al. 2017; Barnes and Amir 2018). There is evidence that distinct kinds of immune cells related to clinical outcomes are abundant in the microenvironment of colon cancer (Mola et al. 2020; Ooki et al. 2021; Liang et al. 2022; Vitorino et al. 2022). As a result, the quantitative molecular signature of immune cells that infiltrate tumors is being recognized as a class of prognostic biomarkers that may help patients better manage and choose their own treatment. Numerous lncRNAs have been demonstrated to be crucial in controlling transcription, translation, and protein modification, among other cellular and biological processes in cancer (Peng et al. 2017). LncRNAs have recently been found in a variety of immune cells and have been identified as essential regulators of immune cell growth and differentiation (Turner et al. 2014; Elling et al. 2016; Chen et al. 2017). LncRNAs are also linked to the immunological control of cancer, including immune activation, immune escape, dendritic cell (DC), T cell, regulatory T cell, B cell, and macrophage penetration into cancer tissues (Denaro et al. 2019; Egranov et al. 2020). The tumor infiltration immune-related lncRNA signatures have been established in glioblastoma and non-small cell lung cancer (NSCLC) (Sun et al. 2020a, b; Zhang et al. 2022a, b).

Immunological checkpoint inhibitors (ICIs) are a cutting-edge type of tumor immunotherapy that works by focusing on immune checkpoint proteins (Mahoney et al. 2015). However, only a tiny percentage of patients have thus far seen a significant improvement after receiving ICI treatment (Robert 2020). Hence, researchers need to develop a score for splitting all patients into poor and good immune response groups. Clinical doctors give personalized treatment strategies for colon cancer patients based on molecular characteristics. Long-noncoding RNAs (lncRNAs) are a class of non-coding RNAs (ncRNAs) with a length of more than 200 nt, which don’t encode proteins but directly play a role in the formation of RNAs (Liu et al. 2022a). LncRNAs regulate the expression of protein-coding genes at the transcriptional and post-transcriptional levels and participate in the life processes (Park et al. 2022). Notably, recent research has shown that lncRNAs have critical roles in immune response, immune cell formation, differentiation, function, the tumor immunological microenvironment, and cancer immunotherapy (Coker and Wood 1986; Najafi et al. 2022). In addition, the expression specificity of immune-related lncRNA makes it can be a promising biomarker. Wu et al. reported eight immune-related lncRNAs classifier was applied to predict recurrent bladder cancer (Wu et al. 2020b). Four lncRNAs have been identified by Li et al. as potential independent prognostic variables for triple-negative breast cancer. They also confirmed that the high-risk group has strong immune responses (Li et al. 2021). A systematic and exhaustive strategy to find lncRNAs linked to immunological prognosis in colon cancer is currently lacking. Thus, we used the lncRNAs as risk factors for constructing the risk model.

Recently, machine learning-based algorithms have been widely used to mine prognostic factors in cancer research. Machine learning-based technology can be used to identify genes, CT-scan features, or clinical characteristics that are associated with patients’ survival. Then, the prognostic model built by these genes or clinical characteristics was utilized to infer the risk score, which is an index value that evaluates the effect of therapy. According to the prediction results, researchers give treatment suggestions. It is a constructive way that can apply to many kinds of cancer. For example, Liu et al. defined risk genes in colorectal cancer based on the importance score of genes from the RF algorithm (Liu et al. 2023). Liu et al. selected alternative splicing (AS) events without co-linear correlation tested by the LASSO algorithm to feed into the Cox regression model to predict the survival time of the bladder urothelial carcinoma (BLCA) cohort (Liu et al. 2022b). Similarly, Zheng et al. reported a CT-based nomogram in clear cell renal cell carcinoma (ccRCC) by considering the 20 features filtered from 1316 radiomics features using LASSO logistic regression (Zheng et al. 2021). However, there is no integrated method proposed considering the advantages of various machine-learning algorithms. Here, we selected important lncRNA features based on the occurrence twice among five machine learning algorithms.

The birth of high-throughput sequencing technology can be said to be a landmark event in the field of genomics research (Pareek et al. 2011; Slatko et al. 2018). This technology makes the single-base cost of nucleic acid sequencing drop sharply compared with the first-generation sequencing technology. Before the advent of deep sequencing technology, the primary method for high-throughput measurement of different gene expression levels was the gene microarray (Hung and Weng 2017; Nurk et al. 2022). On this basis, the differences and patterns of gene expression in different tissues or different developmental stages could be analyzed. With the successful completion of the human genome project in 2003, sequencing technology has dramatically improved (Collins et al. 2003). These advances provide researchers and medical diagnosticians an excellent platform for further understanding phenotypic changes and disease development caused by genomic variation. GEO and TCGA are two data resources for providing us with lots of data (Edgar et al. 2002; Barrett et al. 2013), including microarray, RNA-seq, clinical information, and so on. We constructed a risk model based on microarray and clinical traits data, and then we validated this model by RNA-seq data.

In this study, we constructed a risk score model using tumor immune infiltration-related and prognostic lncRNAs. The risk score model has high specificity and sensitivity across training and testing datasets. We identified three lncRNAs, which can illustrate the mechanism of the tumor progression, improve the prognosis, and design new drug targets for colon cancer.

Materials and Methods

Immune Cell Types Data

We collected 19 different immune cell types (B cell activated, CD4 T cell activated, CD4 T cell resting, CD8 T cell activated, CD8 T cell resting, Dendritic cells activated, Dendritic cells resting, Eosinophils, Immature dendritic cells, Mast cells activated, Monocytes, Myeloid dendritic cells, NK activated, NK resting, NKT activated, Neutrophils, Plasmacytoid dendritic cells, T gamma delta, and T helper 17) for 115 samples from the GEO database with the accession numbers: GSE13906, GSE23371, GSE25320, GSE27291, GSE27838, GSE28490, GSE28698, GSE28726, GSE37750, GSE39889, GSE42058, GSE49910, GSE51540, GSE59237, GSE6863, and GSE8059. Detailed information about the immune cell microarray data is shown in Supplementary Table 1. The platform of these data is HG-U133_Plus_2.

Data Collection and Pre-processing of Colon Cancer

Raw “.cel” format microarray data for colon cancer was downloaded by the GEOquery R package (version: 2.64.2) with the accession number GSE39582 and platform number GPL570 (HG-U133_Plus_2) (Edgar et al. 2002). The Robust Multiarray Average (RMA) algorithm was selected for background correction, quantile normalization, and log2 transformation using the “affy” R package (version: 1.74.0) (Gautier et al. 2004). We saved the clinical information for each patient. Gene annotation was performed by matching the probe id to the gene symbol from NetAffx (https://sec-assets.thermofisher.com/TFS-Assets/LSG/Support-Files/HG-U133_Plus_2-na36-annot-csv.zip).

RNA-seq data of colon adenocarcinoma (COAD) was collected from the TCGA project via TCGAbiolinks (version: 2.25.0) R package (Colaprico et al. 2016). We saved the corresponding clinical data for patients. The gene type information came from GENCODE (https://www.gencodegenes.org/, version: GRCh38/hg38). The lncRNA gene information was from the GENCODE database (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_21/gencode.v21.long_noncoding_RNAs.gtf.gz).

The detailed clinical characteristics of the training and testing datasets can be found in Table 1.

Table 1 The clinical traits of colon cancer patients in each dataset

The Identification of HKLncRNAs

In this study, we aimed to explore immune-related and can be used for immunotherapy lncRNAs in colon cancer. In general, the expression levels of lncRNAs are lower than mRNAs, so it is difficult for researchers to detect them (Park et al. 2022). Thus, we performed the following two analyses: (A) identify highly expressed lncRNAs. (B) identify lncRNAs that are present in all immune cells.

Capture Highly Expressed lncRNAs

We determined the expression level of each lncRNA in each immune cell based on the average value of this lncRNA across all cells belonging to this immune cell. The top 30% expressed lncRNAs in each immune cell in descending order were merged into a gene list, and then we defined this gene list as highly expressed lncRNAs.

Capture Widely Expressed lncRNAs

The cell type-specificity index was used to evaluate the expression pattern of lncRNAs in immune cells as follows (Smith 1992; Yanai et al. 2005):

$${Specificity\,Index}_{lncRNA}=\frac{{\sum }_{i=1}^{N}\left(1-{x}_{lncRNA,i}\right)}{N-1}$$

where \(N\) indicates the number of cell types, and the \({x}_{lncRNA}\) indicates the normalized expression levels of lncRNA by max expression level. Cell type-specificity index values were calculated and sorted in ascending order. The top 30% lncRNAs with lower cell type-specificity index values were defined as HKLncRNA. And the bottom 30% lncRNAs with higher cell type-specificity index values were defined as cell type-specificity lncRNAs.

Capture lncRNAs That Are Up-Regulated in the Immune Compared to Cancer Cells

Significance analysis of microarrays (SAM) algorithm was chosen for capturing lncRNAs that are upregulated in immune compared to cancer cells. The cutoff of the significant level was set to 0.05 based on FDR correction. In this study, samr (version: 3.0) was utilized for performing DEG analysis (Tusher et al. 2001; Tibshirani 2006; Li and Tibshirani 2013; Tzeng 2021).

The Selection of Optimal lncRNAs

Before the construction of the risk score model, four machine learning-based methods (LASSO, Random Forest, Boruta, and Xgboost), univariate Cox regression, and a Gaussian mixture model (GMM) were utilized to mine the optimal combination of lncRNAs.

LASSO

LASSO was performed because it has powerful advantages in handling high-dimensional data and solving multicollinearity problems (Liu et al. 2023).

$$Q\left(\beta \right)={\Vert y-X\beta \Vert }^{2}+\lambda {\Vert \beta \Vert }_{1}$$
$$\Leftrightarrow \mathrm{arg}\,min{\Vert y-X\beta \Vert }^{2} s.t. \sum {\beta }_{j}^{2}\le s$$

Here, \(Q\left(\beta \right)\) represents the error vector, and we expected it to be as small as possible (so that we lose as little sample information as possible). \(\beta \) represents the features (lncRNAs). \(\lambda \) is generally obtained by cross-validation (CV).

Random Forest

Random Forest (RF) has the ability to analyze complex interaction classification features, has good robustness to noise data and data with missing values, and has a fast-learning speed (Toth et al. 2019; Zhang et al. 2022b).

$${GI}_{q}^{(i)}=\sum_{c=1}^{|C|}\sum_{{c}{\prime}\ne c}{p}_{qc}^{(i)}{p}_{q{c}{\prime}}^{(i)}=1-\sum_{c=1}^{|C|}{({p}_{qc}^{(i)})}^{2}$$

Here, \(GI\) (Gini) value was used to calculate the contribution of each feature (lncRNA). \(i\) represents the decision tree, and \(q\) represents the node in the decision tree. \(C\) reflects the classified category. \({p}_{qc}\) indicates the percentage of node \(q\) in \(C\).

The variable importance measure (VIM) stands for the importance of each feature (lncRNA). The \(VIM\) of the feature in node \(q\) and tree \(i\) can be calculated by the following formula:

$${VIM}_{jq}^{(Gini)(i)}={GI}_{q}^{(i)}-{GI}_{l}^{\left(i\right)}-{GI}_{r}^{(i)}$$

where \({GI}_{l}^{\left(i\right)}\) and \({GI}_{r}^{(i)}\) represent the \(GI\) values of the two nodes before and after the branch, respectively.

We assume \(Q\) is the collection of all nodes that feature \({X}_{j}\) present in the decision tree \(i\), then the \(VIM\) of the feature \({X}_{j}\) in decision tree \(i\) can be calculated by the following formula:

$${VIM}_{j}^{(Gini)(i)}=\sum_{q\in Q}{VIM}_{jq}^{(Gini)(i)}$$

We assume \(I\) is the collection of all trees in RF, then the \(VIM\) of the feature can be calculated by the following formula:

$${VIM}_{j}^{(Gini)}=\sum_{i=1}^{I}{VIM}_{j}^{(Gini)(i)}$$

Finally, we normalized the \(VIM\) of the feature:

$${normalized\,VIM}_{j}^{(Gini)}=\frac{{VIM}_{j}^{(Gini)}}{\sum_{{j}{\prime}}^{J}{VIM}_{{j}{\prime}}^{(Gini)}}$$

In this study, the randomForestSRC R package (version: 3.2.1) was used for feature selection based on the RF algorithm.

Boruta

In general, the goal of feature selection in machine learning is to filter out the features that minimize the cost function of the current model. However, Boruta’s feature selection aims to filter out all features correlated with the dependent variable (Wallentin et al. 2021). The significance of the Boruta algorithm is that it can help us understand the influencing factors of the dependent variable more comprehensively so as to perform feature selection better and more efficiently.

Real features:

$${Z}_{score}=\frac{average\, feature\, important}{SE(feature\, important)}$$

Shuffled features:

$$MZSA=\mathrm{max}\{{Z}_{score}\}$$
$$\left\{\begin{array}{c}feature\, is\, important\, if\, {Z}_{score}\ge MZSA\\ feature\,is\, not\, important \,if\, {Z}_{score}<MZSA\end{array}\right.$$

In this study, the Boruta R package (version: 8.0.0) was used for performing feature selection based on the Boruta algorithm.

Xgboost

Xgboost is an extreme gradient boosting algorithm based on GBDT (Chai et al. 2021; Jiang et al. 2021; Hu et al. 2022). It has the characteristics of high efficiency, flexibility, and portability, which makes it has been widely used in data mining, recommendation system, and other fields.

The contribution degree of the feature is defined by the following formula:

$$V\left(k\right)=\frac{{\sum }_{t=1}^{T}{\sum }_{i=1}^{N(t)}I(\beta \left(t,i\right)=k){H}_{\gamma (t,i)}}{{\sum }_{t=1}^{T}{\sum }_{i=1}^{N(t)}I(\beta \left(t,i\right)=k)}$$

Here, \(k\) represents the feature, \(T\) indicates the total number of trees, \(N(t)\) represents the number of non-leaf nodes, \(\beta \left(t,i\right)\) represents the divide signature of the \(i\) non-leaf node in \(t\) tree. \({H}_{\gamma (t,i)}\) represents the sum of second derivatives for the \(i\) non-leaf node in \(t\) tree from all samples. \(I(\beta \left(t,i\right)=k)\) is an indicator function.

In this study, the Xgboost R package (version: 1.7.5.1) was used for performing feature selection based on the Xgboost algorithm.

Gaussian Mixture Model

Gaussian mixture model (GMM) refers to the linear combination of multiple Gaussian distribution functions (Zhang et al. 2022a). The IRL presented at least twice among four machine learning-based models were fed into GMM.

The distribution of GMM is defined as:

$${P}_{M}\left(x\right)=\sum_{k=1}^{K}{\pi }_{k}\bullet N\left(x|{\mu }_{k},{\Sigma }_{k}\right)$$
$$\sum_{k=1}^{K}{\pi }_{k}=1, {\pi }_{k}>0$$

Here, the distribution consists of \(K\) mixture components. \(\mu \) is an \(n\) dimensional mean vector, \(\Sigma \) is a \(n\times n\) covariance matrix, and \(\pi \) is the corresponding mixture coefficient.

GMM was used for clustering analysis. We assumed that the sample data obeys the mixed Gaussian distribution, the parameters of the mixed Gaussian distribution are deduced from the sample dataset, and which Gaussian distribution each sample is most likely to belong to. In our study, there are 4095 (\({2}^{12}-1\)) models consisting of 12 prognostic-related HKLncRNA signatures associated with immune regulation, up regulated in immune compared to cancer cells, and benefit for the prediction of outcomes of colon cancer patients.

The criteria for selecting the optimal IRL is based on the best prediction performance with minimum consumption.

The Construction of Risk Scoring System

The combination of IRL with the highest AUC and least number was used to construct the final risk score model. To predict the outcome of colon cancer patients, a predictive model based on the expression levels of IRL and clinical information as follows:

$$Risk\, Score= {\sum }_{i=1}^{N}Exp(LncRNA)\times i\beta $$

where \(N\) represents the number of IRL, and \(i\beta \) represents correlation coefficient values for its corresponding lncRNA in the multivariate Cox regression model.

This multivariable Cox regression model was built by coxph() and step() functions in the survival R package (version: 3.5.1).

Performance of Risk Score Model

The receiver operating characteristic curve (ROC) is an analysis tool for assessing the sensitivity and specificity of our model. The range of area under the curve (AUC) is the indicator of ROC. The closer it is to 1, the better the model performs.

After constructing the risk score model, the risk score model was validated on an independent dataset in terms of C-index value, tROC, and ROC. TCGA-COAD was treated as an independent dataset for validating the robustness of our risk score model. Survival analysis was performed using the survival R package (version: 3.3.1). Time-dependent ROC analysis was performed by the timeROC R package (version: 0.4).

Univariate and Multivariate Cox Regression

The univariate Cox regression model was utilized for the selection of prognostic IRL. Obtained IRL served as individual factors associated with colon cancer patients’ outcomes by combining clinical characteristics. Then, multivariate Cox regression was utilized to identify independent prognostic factors among clinical traits and IRLs.

Two R packages, survival (version: 3.1.1) and forestplot (version: 3.1.1), were used to select independent prognostic factors and visualize.

Other Statistical and Bioinformatics Analysis

All statistical analysis was performed in R language (version: 4.2.2). The gsva() function in the GSVA R package (version: 1.44.5) was applied to immune infiltration analysis, which calculates the enrichment score of each immune cell for each patient using the ssGSEA algorithm. Immune cell types and gene sets in each immune cell type were collected from Pan-Cancer research, which includes 28 immune cell types B cell, CD4 T cell, CD8 T cell, dendritic cell, CD56 bright natural killer cell, CD56 dim natural killer cell, Central memory CD4 T cell, Central memory CD8 T cell, Effector memory CD4 T cell, Effector memory CD8 T cell, Eosinophil, Gamma delta T cell, Immature B cell, Immature dendritic cell, MDSC, Macrophage, Mast cell, Memory B cell, Monocyte, Natural killer T cell, Natural killer cell, Neutrophil, Plasmacytoid dendritic cell, Regulatory T cell, T follicular helper cell, Type 1 T helper cell, Type 17 T helper cell, Type 2 T helper cell 23 (Charoentong et al. 2017). To investigate the tumor immune microenvironment, we compare the expression levels of lncRNA in the risk model across five immune subtypes, including C1 (wound healing), C2 (IFN-gamma dominant), C3 (inflammatory), C4 (lymphocyte depleted), and C6 (TGF-beta dominant) (Thorsson et al. 2019).

Wilcoxon and ANOVA test was utilized for statistically significant analysis.

Results

The Expressed lncRNAs in Human Immune Cells

The overall schematic workflow is shown in Fig. 1. After the lncRNA annotation process (match probe id to lncRNA gene symbol), 1422 lncRNAs were kept. In order to determine the expression pattern of lncRNAs in human immune cells, we ranked lncRNAs according to their expression levels, from high to low, for each immune cell type. The top 30% of highly expressed lncRNAs for each immune cell type were merged and removed duplicates. Then 737 lncRNAs were treated as immune-related lncRNAs and kept for the following analysis.

Fig. 1
figure 1

The schematic workflow for identifying IRLs enhances colon cancer patients’ prognosis

The top 30% of highly expressed lncRNAs were obtained as the candidate IRL for each immune cell type. The specificity of expression of a candidate IRL with respect to different immune cell types was calculated using the specificity index. Those HKLncRNA, which are significantly up-regulated in immune samples compared to colon cancer samples, were selected as IRL. We analyzed GEO dataset GSE39582 and TCGA-COAD with four machine learning algorithms and a Gaussian mixture model to screen out the optimal combination of lncRNAs: LINC00638, CYB561D2, and DANCR. A prognostic signature was constructed using the linear combination of the expression values of the prognostic IRL, weighted by their estimated regression coefficients in the multivariate Cox regression analysis. The model has a satisfactory performance and was validated by an independent dataset, C-index, ROC, and tROC. Finally, we explored the difference in immune cell types between low- and high-risk score groups.

HKLncRNAs in Human Immune Cells

We calculated these 737 lncRNAs’ expression levels across 19 immune cell types. By introducing a tissue specificity index value and setting cutoff values, we identified 221 HKLncRNAs (Supplementary Table 2) and 221 cell type-specific lncRNAs in immune cells (Supplementary Table 3). Supplementary Fig. 1 shows the heatmap of house-keeping and cell type-specific genes’ expression profile across all immune cell types.

HKLncRNAs in immune cells is a type of constitutive gene which have an essential role in the maintenance of cellular immune function.

LncRNAs That Up-Regulated in the Immune Compared to Colon Cancer Cells

By combining microarray data in immune and cancer cells, we conducted DEG analysis using SAM() function in samr R package. There are 87 HKLncRNAs significantly up-regulated in the immune compared to colon cancer cells that were kept for the following feature selection analysis (Supplementary Table 4).

The Prognostic and Optimal lncRNAs

14, 43, 6, and 91 prognostic lncRNAs were mined by four machine learning-based models, LASSO, RF, Boruta, and Xgboost, separately (Supplementary Fig. 2). Finally, 12 lncRNAs identified twice (Fig. 2A, Supplementary Table 5). 11 lncRNAs were confirmed that there are associated with the prognosis of colon cancer patients using univariate Cox regression analysis (Fig. 2B).

Fig. 2
figure 2

The identification of prognostic and optimal lncRNAs. A The overlap lncRNAs among four different machine learning-based models, LASSO, RF, Boruta, and Xgboost. B Prognosis-related lncRNAs inferred by univariate Cox coefficient regression. C GMM model was conducted to identify the optimal combination of lncRNAs

GMM was utilized for identifying the optimal combination of lncRNAs. Three lncRNAs, CYB561D2, LINC00638, and DANCR, are identified as optimal lncRNAs that are related to the prognosis of colon cancer with the maximum AUC = 0.770 (Fig. 2C). There is another combination of lncRNAs, CYB561D2, LINC00638, DANCR, and LINC01208, which have the same maximum AUC value. However, the number of lncRNAs is equal to four. Considering that the more detection index, the more cost will be taken, we choose three lncRNAs instead of four as the optimal combination.

A Scoring System Based on Immune-Related and Prognostic lncRNAs

A scoring system based on immune-related and prognostic lncRNAs that can be used to detect the prognosis of colon cancer patients is constructed by multivariate Cox regression. Table 2 gives the detailed coefficient value of the above three lncRNAs. The immune-related and prognostic lncRNA signature can be calculated by the following formula:

Table 2 The coefficient of the three lncRNAs in the multivariate Cox regression model
$$lncRNA \,signature=-0.356\times Exp\left(CYB561D2\right)+0.830\times Exp\left(LINC00638\right)-0.170\times Exp(DANCR)$$

Risk Score Model Is an Evaluation Indicator for Clinical Outcome

Multivariate Cox regression was constructed to distinguish the patients into two groups when setting the mean value of all risk scores as the cutoff. We defined the patient as high-risk if the score of the patient is larger or equal to the average value of all patients’ risk scores. On the contrary, we defined the patient as low-risk if the score of the patient is lower than the average value of all patients’ risk scores. Figure 3A is the distribution of patients’ age, gender, tumor stage, and survival status in low- and high-risk groups. It demonstrated that there were significant differences in tumor stage and survival status between the low- and high-risk groups, but there were no significant differences in age and gender between the low- and high-risk groups. Figure 3B shows a significant difference between low- and high-risk groups (P-value < 0.05), which hints that the low-risk group has a longer overall survival time than the high-risk group. Figure 3C is the visualization of the relationships among the risk score of patients, survival time, and the expression levels of three lncRNAs in the risk score model. From the risk plot, we concluded that high-risk score patients are associated with shorter survival time compared to low-risk score patients. In addition, the expression level of LINC00638 is positively related to the risk score, while the expression levels of CYB561D2 and DANCR are negatively related to the risk score. AUC is an index for describing the performance of the risk score model. Our scoring model has a high correction ratio (Fig. 3D). The high accuracy indicates our model can be well used to distinguish the prognostic effect of training samples. Also, the time-dependent ROC (tROC) supported that our provided scoring system has a satisfactory performance at 3-, 5-, and 7-years (Fig. 3E).

Fig. 3
figure 3

The application of immune-related and prognostic lncRNA signature into the training dataset. A The pie chart displayed the difference in patients’ age, gender, tumor stage, and survival status between low- and high-risk groups. B The survival curve reflected that there is a significant difference between low- and high-risk cohorts. C The risk plot demonstrated the relationship among risk score, survival time, and lncRNA expression levels. D The ROC curve of this scoring system. The AUC is equal to 0.770. E The time-dependent ROC results for 3-, 5-, and 7-years of this scoring system. AUC3-years = 0.700, AUC5-years = 0.702, and AUC7-years = 0.651

The Risk Score Model Has a Good Robustness

TCGA-COAD dataset was treated as an independent validating dataset. Our provided scoring system was also applied to this independent dataset. The C-index and SE (C-index) values of the training dataset (GSE39582, self-validation) are 0.599 and 0.022, respectively. And the C-index and SE (C-index) values of the testing dataset (TCGA, independent validation) are 0.592 and 0.032, respectively (Fig. 4A). Thus, our scoring system has good robustness and the ability to avoid overfitting problems. Figure 4B indicates there is a significant difference between low- and high-risk score groups, and the survival time of high-risk score patients is significantly lower than low-risk score patients. The risk plot shows that the high-risk score group tends to have a shorter survival time compared to the low-risk score group (Fig. 4C). The heatmap reflects that the expression level of LINC00638 is positively related to the risk score, while the expression levels of CYB561D2 and DANCR are negatively related to the risk score (Fig. 4C). ROC and tROC were calculated to evaluate the effect of the model prediction. Figure 4D and E exhibited that this scoring model has a high accuracy of prediction.

Fig. 4
figure 4

The application of immune-related and prognostic lncRNA signature into the testing dataset. A The bar plot displayed the C-index values in TCGA (testing) and GSE39582 (training) datasets. B The survival curve reflected that there is a significant difference between low- and high-risk cohorts. C The risk plot demonstrated the relationship among risk score, survival time, and lncRNA expression levels. D The ROC curve of this scoring system. The AUC is equal to 0.618. E The time-dependent ROC results for 7-, 8-, and 9-years of this scoring system. AUC7-years = 0.703, AUC8-years = 0.780, and AUC9-years = 0.738

Immune Infiltration-Related lncRNA Signature Is an Independent Prognostic Factor

To determine which clinical characteristics are associated with survival time, each clinical trait is compared individually with survival time and survival status. The results showed that patients’ age, TNM_t, TNM_n, TNM_m, and risk score are significantly related to survival, while patients’ gender is not significantly related to survival (Fig. 5A). Further, we simultaneously considered all clinical factors to survival for identifying independent prognostic factors. Finally, patients’ age, gender, TNM_t, TNM_n, TNM_m, and risk score can serve as six independent prognostic factors in colon cancer (P-value < 0.05), while TNM_stage can be represented by other clinical factors (Fig. 5B).

Fig. 5
figure 5

Univariate and multivariate Cox regression analysis. A Univariate independent prognostic analysis. B Multivariate independent prognostic analysis

Immune-Related and Prognostic lncRNA Signature Is Associated with Immune Cell Infiltration

Further, we explored this kind of disorder in the immune infiltration levels. We considered 28 different kinds of immune cell types in the analysis. We found that patient risk groups stratified by IRL signature showed distinct immune infiltration patterns. As shown in Fig. 6A, patients in the low-risk group were enriched with six immune subpopulations, while only two immune subpopulations were enriched in patients with high risk (P-value < 0.01). These results suggested that the higher score of IRL corresponded to less immune cell infiltration and poor outcome, while lower score of IRL corresponded to greater immune cell infiltration and better outcome. Figure 6A demonstrated significantly positive relationships between risk scores and immune cell types, including memory B cells, natural killer cells, macrophages, mast cells, etc. We further examined the risk score distribution among five immune subtypes reported by a recent study (Thorsson et al. 2019). Risk score showed a notable difference among five different immune subtypes (Fig. 6B). LINC00638 (Fig. 6C) showed a notable difference among five kinds of immune subtypes (P-value < 0.05), while CYB561D2 (Fig. 6D) and DANCR (Fig. 6E) didn’t show a notable difference among five types of immune subtypes (P-value > 0.05). Because C1, C2, C3, C4, and C6 classification systems were from pan-cancer research (Charoentong et al. 2017), we defined them as tumor immune environments. The combination of LINC00638, CYB561D2, and DANCR is closely associated with the tumor immune microenvironment. However, each lncRNA in the model might not be closely associated with the tumor immune microenvironment. The enrichment of 28 immune cell types with positive and negatively associated the expression levels of three IRL, LINC00638, CYB561D2, and DANCR, calculated by normalized enrichment score (NES) score from gene set enrichment analysis (GESA) was shown in Supplementary Fig. 3.

Fig. 6
figure 6

The validation of immune-related and prognostic lncRNA signature in immune cell types. A The volcano plot represents the enrichment of immune cell types for colon cancer with positive and negatively associated risk scores calculated by normalized enrichment score (NES) score from gene set enrichment analysis (GESA). B The distribution of risk scores among five different kinds of immune cell types. The expression pattern of three lncRNAs, LINC00638 (C), CYB561D2 (D), and DANCR (E), in five immune cell subtypes. C1 = Wound healing; C2 = IFN-\(\gamma \) dominant; C3 = Inflammatory; C4 = Lymphocyte depleted; C6 = TGF-\(\beta \) dominant

Immunotherapy Response Prediction

Researchers designed immunotherapy drugs based on the targeted immune checkpoint proteins. PD-1 is an immune checkpoint receptor in T cells, which serves as a “switch-off” (Wang et al. 2022). When the PD-1 binds to the PD-L1 of tumor cells, T cells will not attract tumor cells (Yu et al. 2022). By developing an inhibitor for PD-1, we can extensively block the combination between PD-1 and PD-L1, further enhancing immune response. CTLA-4 (also known as CD152) is constitutively expressed in regulatory T cells (Wang et al. 2022). In cancer cells, CTLA-4 is up-regulated after the immune system activation (Iranzo et al. 2022).

Thus, we explored the expression levels of PD-1, PD-L1, and CTLA-4 between low- and high-risk score groups. We didn’t identify a significant difference between low- and high-risk score groups for PD-1 (Fig. 7A) and CTLA-4 (Fig. 7E). However, a significant difference exists between low- and high-risk score groups for PD-L1 was observed (Fig. 7C). Recently, researchers reported that high PD-L1 expression on tumor cells indicates the presence of an anti-tumor immune response (Sorensen et al. 2016; Shah et al. 2022). This is consistent with our conclusion. This phenomenon indicates that the model helps to propose drugs related to immunotherapy.

Fig. 7
figure 7

The relationship between immune checkpoint genes (PD-1, PD-L1, and CTLA-4) and risk score. The distribution of normalized expression levels of the PD-1 (A), PD-L1 (C), and CTLA-4 (E) across low- and high-risk score groups. The correlation between the normalized expression levels of three immune checkpoint genes [PD-1 (B), PD-L1 (D), CTLA-4 (F)] and risk score

Nomogram-Based Survival Prediction

A comprehensive model, including immune-related and prognostic lncRNA signature and clinical characteristics, was developed and displayed as the nomogram (Fig. 8A). Its prognosis reliability was established by the calibration examination (Fig. 8B, D, and F). The nomogram demonstrated viability for 3-, 5-, and 10-year survival under control, as indicated by the Decision Curve Analysis (DCA) curve (Fig. 8C, E, and G). These results demonstrated that the nomogram achieved favorable predictive performance.

Fig. 8
figure 8

The construction and validation of a nomogram for predicting the survival of colon cancer patients. A The nomogram is displayed to estimate the results for CRC patients. The prognostic risk score model is presented in a visualization pattern. The nomogram is shown to assess the outcome of CRC patients. It is a readable style of the prognostic risk score model. The patient survival calibration curve at 3- (B), 5- (D), and 10-year (F) years. The x-axis displays the OS probability predicted by the nomogram at that time, while the y-axis displays the actual data at different timepoint. The optimum prediction is shown by the 45° grey line. The training dataset is represented by the dots, while the validation dataset is represented by the curve line. The figure also included a label for the 95% CI. Nomogram DCAs for the OS at the 3- (C), 5- (E), and 10-year (G) intervals

Discussion

Colon cancer, also known as colorectal cancer, is one kind of cancer that develops from colon or rectum (Labianca et al. 2010; Terzić et al. 2010). The mechanism of colon cancer development and progression is still unclear. Recent studies have found that tumor immune cell infiltration is associated with cancer development and may adversely affect cancer prognosis (Kong et al. 2022; Wei et al. 2022). Accumulating evidence demonstrated that lncRNAs play essential roles in the immune response by participating in cancer progression; for example, Wu et al. revealed that most marker genes of immune cells showed a significant correlation with LINC00665 (Wu et al. 2020a). Especially the expression of LINC00885 has a positive relationship with marker genes of M2 macrophages (Wu et al. 2020a). However, a systematic model for identifying immune-related lncRNAs is currently lacking. Therefore, we aim to develop a risk score model to mine regulatory lncRNAs in the colon cancer immune microenvironment.

Firstly, we obtained the top 30% highly expression lncRNAs from gene expression profiles of 19 immune cell types. Two-hundred and twenty-one HKLncRNAs and Two-hundred and twenty-one cell type-specific lncRNAs in immune cells were screened by calculating the cell type specificity index. The results can be validated by manually reviewing publications. For example, CYB561D2 and EHIH are two HKLncRNAs in immune cells. CYB561D2 encoded cytochrome B561 family member D2 participating in ion metabolism and stress defense pathways (Sananmuang et al. 2020). Sordillo et al. reported that oxidative stress is a major underlying reason for inflammatory dysfunction (Sordillo and Aitken 2009). Sun et al. validated that EHIH is a diagnostic and prognostic biomarker in pan-cancer, and it is involved in an immune-oncogenic system combined with YBX3, particularly for colon cancer (Sun et al. 2022). Then, we identified 87 lncRNAs that are up-regulated in immune samples and down-regulated in colon cancer samples, which demonstrated their expression specificity to immune cells rather than tumor cells. We aimed to identify lncRNAs that can be used as biomarkers to improve colon cancer’s prognosis and patients’ immunotherapy response. These lncRNAs were treated as specificity expression in immune cells compared to tumor cells. Combining with clinical traits information, we got lncRNAs that were significantly associated with the survival time of colon cancer patients. These lncRNAs were incorporated into four machine learning-based algorithms, LASSO regression analysis, RF, Boruta, and Xgboost. Twice lncRNAs identified from four methods are CYB561D2, PRR34-AS1, DANCR, LINC00638, LINC01119, ADARB2-AS1, GABARAPL3, OVCH1-AS1, DSCR10, DSCR9, LINC00869, and LINC01208. Only LINC00869 was neglected because there is no significant relationship between expression level and prognosis of colon patients. For maximum prediction accuracy and minimum cost, three immune-related and prognostic lncRNAs, LINC00638, CYB561D2, and DANCR, were used to create the risk score model.

Then, multivariate regression analysis was conducted to construct a risk score model. Our linear mathematically combinational model was validated by survival analysis, ROC, tROC, C-index, risk plot analysis, and independent dataset analysis. In addition, there is an immune checkpoint gene named PD-L1 that is expressed differently between low- and high-risk score groups, which suggests that our model can be used to assess patient’s immune response. Tao et al. reported CYB561D1 is up-regulated in glioma samples compared to normal samples (Tao et al. 2021). And they concluded that the over-expression of CYB561D1 is associated with a short survival time of high-grade glioma (Tao et al. 2021). The mechanism is that the over-expression of CYB561D1 increased the expression of CCL2 and PD-L1 and triggered immunosuppression in T cells by activating the STAT3 signaling pathway (Tao et al. 2021). At the same time, LINC00638/miR-4732-3p/ULBP1 is a lncRNA-related competitive endogenous RNA (ceRNA) network, which is highly associated with immune filtration and tumor mutation burden (TMB) in hepatocellular carcinoma (HCC) (Qi et al. 2021). In HCC with elevated TMD, LINC00638/miR-4732-3p/ULBP1 is a prognostic predictor and controls immunological escape via PD-L1 (Qi et al. 2021). LINC00638/hsa-miR-29b-3p/CDCA4 is a candidate regulatory network in liver hepatocellular carcinoma (LIHC). Tumor immune evasion and anti-tumor immunity may play a role in CDCA4-mediated LIHC carcinogenesis (Wang et al. 2023). The prognosis of LIHC patients is dramatically improved by low CDCA4 expression, and CDCA4 is a promising novel biomarker for predicting LIHC prognosis. Tumor immune evasion and anti-tumor immunity may play a role in CDCA4-mediated LIHC carcinogenesis (Wang et al. 2023). In recent years, more and more studies have been done on using DANCR as a biomarker to predict colon cancer prognosis (Yang et al. 2018; Shi et al. 2020; Sun et al. 2020b). DANCR was extensively expressed in colon cancer tissue and cell lines (Sun et al. 2020b). Sun et al. reported that higher levels of DANCR were associated with a poorer prognosis and shorter patient survival time for colon cancer (Sun et al. 2020b). Cell proliferation and colony formation were drastically reduced when DANCR was silenced by short-interfering RNA (siRNA) (Yang et al. 2018). Although immune checkpoint inhibitors targeting PD-1, PD-L1, and CTLA-4 have been developed to treat cancer and improve survival time (Sun et al. 2020a, b). However, the immune responses of different patients are not the same due to the heterogeneity of the tumor immune environment. Our findings revealed that the lncRNAs have complex crosstalk between tumor cells and immune cells. Low-risk group patients have high expression of PD-L1 and longer survival time compared to high-risk group patients.

Compared to published studies (Toth et al. 2019; Wallentin et al. 2021; Jiang et al. 2021; Chai et al. 2021; Zhang et al. 2022b, a; Hu et al. 2022), our framework has the advantage of considering multiple machine learning-based methods. We proposed that immune-related and prognostic lncRNAs have a great potential to predict the survival of colon cancer patients based on the linear regression model. The feature selection process is achieved by combing four machine learning methods, keeping prognosis-related lncRNAs, and selecting the optimal combination of lncRNAs from 4095 combinations using GMM. We provided a nomogram with the maximum performance and minimum cost. This framework is very helpful for distinguishing patients into two groups and gives different treatment strategies not only for colon cancer but also can be used for other cancers. However, this study also has some limitations. First of all, this study is a retrospective experiment. Second, the mechanism of DANCR is not fully explained in the previous research. Further experiments should be conducted to validate our model.

In conclusion, we give an immune-related and prognostic lncRNA signature by combining transcriptome data and clinical data. This signature can be validated by ROC, tROC, C-index, independent dataset, and literature. It has a good potential to predict the outcome of colon cancer patients. Applying this model to colon cancer patients, we can discover that the tumor immune microenvironment is different between low- and high-risk score groups, which is beneficial for immunotherapy and precision medicine.