Introduction

Lung cancer remains a prevalent malignancy, ranking third in incidence and first in cancer-related mortality worldwide [1]. Lung adenocarcinoma (LUAD), the most common type of lung cancer, has seen a continuous rise in incidence [2]. Recent advances in targeted molecular therapies and immunotherapies have shown promising results in improving the prognosis of patients with LUAD [3,4,5]. However, the efficacy of immunotherapy is limited to specific subtypes, and patients with LUAD generally have a poor prognosis due to early metastasis [6]. Therefore, gaining a deeper understanding of LUAD-related molecular mechanisms is essential for developing effective treatments.

Cancer cell characteristics and tumor microenvironment (TME) play significant roles in tumor progression, a complex biological process [7]. Thus, a comprehensive analysis of the TME in LUAD cells may shed light on critical factors involved in tumor-induced immunological changes. While traditional bulk RNA sequencing (RNA-seq) only reveals general tumor biology, it fails to capture intra-tumoral and inter-cellular heterogeneous features. Conversely, the emergence of single-cell RNA sequencing (scRNA-seq) provides a novel possibility to reveal heterogeneity among different cells and is essential for profiling TME, analyzing cell fate, exploring cellular interactions, and developing personalized therapeutic strategies [8]. scRNA-seq is widely used to study the cellular characteristics of various tumors [9]. However, the single-cell profile of LUAD has yet to be fully elucidated.

To better capture the heterogeneity of tumors and precisely stratify patients, we analyzed scRNA-seq data and identified malignant cells using CopyKAT. We combined pseudotime analysis, regulatory transcription factor (TF) analysis, and cellular communication revealed cancer heterogeneity, TME, and cell–cell interactions. Malignant cell-associated ligand–receptor genes were screened and relevant molecular subtypes for accurate patient stratification were constructed [10]. Numerous researchers have exerted efforts to construct potential biomarkers for predicting prognoses and immune responses in tumor studies [11, 12]. In this study, we introduced a novel computational framework that integrates ten diverse machine learning algorithms to develop a robust prognostic model for LUAD. The predictive efficacy of our model has been validated across multiple independent cohorts, demonstrating its reliability in clinical stratification and outcome prediction. This comprehensive and innovative methodology marks a significant advancement in the personalized treatment and prognosis assessment of patients with LUAD. Finally, we performed experiments to validate the core gene (MYO1E) in our model, offering a new predictive biomarker and molecular target for treating patients with LUAD. The study workflow is depicted in Fig. 1.

Fig. 1
figure 1

The workflow illustrating the schematic overview of single-cell sequencing and GSE171145 dataset analysis (upper) and prognostic model establishment (lower)

Materials and methods

Data sources used for analysis

To explore the cellular composition of the TME in lung adenocarcinoma, we analyzed nine untreated LUAD samples from eight patients using scRNA-seq. These samples were sourced from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) dataset GSE171145. Additionally, after excluding samples with incomplete clinical and pathological information, we utilized gene expression profiles along with their associated clinical data from ten different datasets for constructing and validating a prognostic signature through integrative machine learning approaches. These datasets included TCGA-LUAD (N = 500), GSE31210 (N = 118), GSE36471 (N = 107), GSE37745 (N = 106), GSE42127 (N = 171), GSE50081 (N = 181), GSE68465 (N = 435), GSE68571 (N = 83), GSE72094 (N = 398), and GSE87340 (N = 50), sourced from the Cancer Genome Atlas (TCGA, https://www.cancer.gov) and the GEO databases. All datasets are detailed in Supplementary Table 1. Moreover, single nucleotide variants (SNVs) in the TCGA-LUAD dataset, processed using the “mutect2” software, were retrieved from the TCGA database.

Cell cluster annotation

The “Seurat” R package was used to analyze an scRNA-seq dataset [13, 14]. Quality control standards were set, and cells that did not meet the criteria of the comprehensive dataset were excluded. First, scRNA-seq was filtered to include only cells expressing each gene in at least three cells, each containing a minimum of 250 genes. Next, mitochondria and rRNA were identified using the “PercentageFeatureSet” function, and each cell was required to contain between 100 and 5000 genes, less than 25% mitochondria, and at least 100 unique molecular identifiers (UMIs). After log-normalization, highly variable genes were identified using the “FindVariableFeatures” function. Data scaling was performed with the “ScaleData” function, followed by the principal component analysis (PCA) of 50 dimensions to determine anchor points [15]. The dimensionality of the data was further reduced using the “RunTSNE” function. The cell clusters were then annotated using classical markers of immune cells.

Defining subpopulations of malignant cells

The CopyKAT algorithm, an integrated Bayesian approach with hierarchical clustering, was utilized to categorize cells based on copy number alterations [16]. Aneuploid cells were classified as malignant, while diploid cells were classified as stromal or immune cells.

Analysis of TF activity

We employed the SCENIC algorithm to investigate interaction mechanisms among different cell types and computed a TF regulatory network [17]. The “calcRSS” function in the SCENIC algorithm was used to calculate regulon specificity scores (RSS), aiding in the identification of TFs associated with malignant cells [18].

Pseudotime analysis

Single-cell pseudotime trajectories were constructed using Monocle 2, identifying specific TFs associated with malignant cells over time We used the “dispersionTable” function to select genes for trajectory inference, describing gene variance across cells by the mean. We also used the “reduceDimensions” function for DDRTree-dimensionality reduction. Visualization of results was facilitated using the “plot pseudotime heatmap” and “plot cell trajectory” functions.

Analysis of cell–cell communication

We used the “cellchat” R package to infer differences and similarities between malignant and adjacent cells and established cell–cell communication networks [10, 19]. We used the “identifOverExpressedInteractions,” “computeCommunProb,” and “computeCommunProbPathway” functions to calculate ligand–receptor interactions, compute communication probabilities, and infer cellular communication networks at the signaling pathway level, respectively.

Consensus clustering analysis of malignant cell-associated ligand–receptor genes

Based on the cell communication analysis results, we identified malignant cell-associated ligand–receptor genes and further screened them for prognostic relevance by performing univariate Cox analysis. Next, we employed the “ConsensusClusterPlus” package in R for unsupervised consensus clustering to identify robust clusters relevant to LUAD.

Gene set variation analysis (GSVA)

To evaluate prognostic differences between molecular subtypes, we conducted a Kaplan–Meier survival analysis. To clarify these distinctions, we performed a GSVA using the “c2.cp.kegg.v7.5.1.symbols” gene set obtained from the MSigDB database (https://www.gsea-msigdb.org/gsea/msigdb/index.jsp).

Development and validation of the prognostic signature for LUAD

We used the “limma” R package to conduct differential analysis and identify genes associated with malignant cell-associated ligand–receptor subtypes [20]. To identify the functional enrichment of these genes, we utilized the “clusterProfiler” R package for Gene Set Enrichment Analysis (GSEA) [21]. Next, we conducted a univariate Cox regression analysis to identify genes linked to prognosis, followed by a 10-fold cross-validation process to assess 95 unique configurations originating from 10 different machine learning algorithms. These algorithms included CoxBoost, generalized boosted regression modeling (GBM), Lasso, Ridge, supervised principal components (SuperPC), survival support vector machine (survival-SVM), elastic network (Enet), stochastic survival forest (RSF), stepwise Cox, and partial least squares regression for Cox (plsRcox) [22]. For each method, we evaluated its C index across both the TCGA datasets and external validation datasets (GSE72094). Subsequently, we determined the predictive efficacy of these models by averaging their C indices. The selection of algorithm combination was based on its robustness in performance and potential clinical applicability. Consequently, we developed a signature that could predict the overall survival in patients with LUAD. We then categorized LUAD patients into high- and low-risk groups based on the median risk score in the TCGA-LUAD cohort. To examine prognostic differences between these groups, we conducted Kaplan–Meier survival analysis. We assessed the predictive performance of the model by categorizing patients into different subgroups on the basis of age, tumor stage, and TNM stage. We used “survminer” and “timeROC” packages for time-dependent receiver operator characteristic (ROC) curve analysis. We used Kaplan–Meier survival and ROC curve analyses to assess the robustness of the model in nine distinct datasets. To balance the granularity of the analysis with practical clinical management and prognostic assessment, we grouped the T stage into T1–2 and T3–4, N stage into N0 and N1–3, and tumor stage into I-II and III-IV. This approach allowed us to ensure sufficient sample sizes for robust statistical analysis and derive meaningful insights applicable to broader patient groups. We then conducted a subgroup analysis by stratifying patients by age (≤ 65 and > 65 years), gender (female and male), T, N, and M stages, and tumor stage, enabling us to explore variations in risk scores across clinical phenotypes and their correlations with clinical characteristics.

Immunological characteristics and therapeutic responses of the Prognostic signature

We evaluated differences in immune checkpoint expression between the high- and low-risk groups. The subclass mapping (SubMap) method was computed to evaluate the immune checkpoint blockade (ICB) response in the two groups [23]. Moreover, two independent immunotherapy cohorts, namely GSE78220 (N = 24) and a phase II immunotherapy cohort applied to locally advanced or metastatic uroepithelial cancers (IMvigor210, N = 293), were further evaluated.

Drug sensitivity estimation

We obtained the cancer cell line (CCL) drug sensitivity metrics from three separate response databases: GDSC [24], CTRP [25], and PRISM [26]. The CTRP and PRISM databases provide AUC values as indicators of drug sensitivity, while GDSC reports IC50 values. Additionally, we gathered transcriptome profiling data for CCLs from the CCLE database [27]. The IC50 values for various compounds in GDSC were determined using the “oncoPredict” R package. The relationship between the risk score and the IC50 (or AUC) values suggests potential LUAD sample responses to specific compounds.

Mutation analysis

We used the “maftools” R package to perform tumor mutation burden (TMB) analysis and generated a waterfall plot to characterize somatic mutations in patients with LUAD. We also examined differences in homologous recombination defects, fractions altered, segment numbers, and TMB by performing the Wilcox test [28].

Establishment of a nomogram scoring system

To quantify the risk evaluation of patients and improve the practicability of the model, we developed a nomogram that combined age, N stage, and risk scores to predict the overall survival at 1, 3, and 5 years [29]. Moreover, we assessed the efficiency of the nomogram by decision curve analysis (DCA) and calibration plots.

Cell culture

The lung adenocarcinoma cell lines, A549 and H1299, were acquired from the American Type Culture Collection (ATCC; Rockville, MD, USA). These cells were maintained in RPMI 1640 medium (ProCell) enriched with 10% fetal bovine serum (Gibco, Waltham, MA, USA) and were cultured under a humidified environment with 5% CO2 at a temperature of 37 °C.

RNA interference and transfection

The small interfering RNAs (siRNA) of MYO1E were obtained from Shanghai GenePharma Co. Ltd (Shanghai, China). A549 and H1299 cells were transfected with 50 nmol/L siRNA using Lipofectamine 2000 (ThermoFisher, Massachusetts, USA). The knockdown efficiency of MYO1E was evaluated by quantitative real time PCR (RT-qPCR) and western blot. The sequences of siRNA were: si-MYO1E-1: 5’-GCACGCCATGAATGTGATT-3’, si-MYO1E-2: 5’-GCATCAAGTCGAATATTTG-3’.

RT-qPCR

The total RNA in cells and tissues were extracted with Trizol reagent (Vazyme, Nanjing, China, R411-01) and reverse-transcribed using the HiScript III RT SuperMix (Vazyme, China, R323). RT-qPCR analysis was performed using Universal SYBR Green Fast qPCR Mix (ABclonal, Hong Kong, China, RK21203), and the results were calculated using the 2(−ΔΔCt) method with the GADPH serving as the internal control reference [30]. The primer sequences were: GAPDH, F-5′-GGCTGTTGTCATACTTCTCATGG-3′, R-5′- GGAGCGAGATCCCTCCAAAAT-3′. MYO1E, F-5′- AAGGAGCGGCACAGTATGAAA-3′, R-5′-TCACCACTGATAATGACGCAC-3′.

Clone formation tests

Cells transfected with control and siRNA were plated in 6-well plates. After 2 weeks, the cell colonies were fixed using 4% paraformaldehyde for 30 min, followed by staining with 0.1% crystal violet for another 30 min [31]. High-definition photographs of the colonies were captured and subsequently analyzed with ImageJ software.

Edu assay

Cells transfected with either control or siRNA were seeded into 24-well plates. After 48 h, Edu was added to the cells, which were then incubated for an additional 2 h. Cells were fixed with 4% paraformaldehyde for 30 min, and nuclei were stained with DAPI. A Nikon microscope was used for imaging, and the number of Edu-positive cells was quantified using ImageJ software.

Wound-healing assay

Cells were plated in 6-well plates and a scratch was created using a sterile plastic pipette tip. Cells were then cultured in FBS-deficient medium. Images were taken with an electron microscope at 0 and 24 h to capture the wound area. Cell migration was assessed by measuring the change in wound size.

Cell migration assay

The migration capability of LUAD cells was assessed using a transwell membrane (Corning 3422, 8 μm pore size) without Matrigel coating. In brief, 2–4 × 104 cells were seeded into the upper chamber in 200 µL of FBS-free medium, while the lower chamber was filled with 600 µL of medium supplemented with 10% FBS. Following 24 h of incubation at 37 °C, the chambers were rinsed with PBS and fixed with 4% paraformaldehyde for approximately 30 min. Non-migratory cells on the upper membrane surface were removed using a cotton swab. The membrane was stained with crystal violet for about 30 min at room temperature, rinsed with PBS, air-dried, and then imaged.

Statistical analysis

Data are presented as the mean ± SD. Statistical diagrams were generated using the ggplot2 package in R and GraphPad Prism 8. A P-value of less than 0.05 was considered statistically significant. Each experiment was conducted in triplicate to ensure reproducibility.

Results

Dimensionality reduction clustering of LUAD single-cell data

After filtering the scRNA-seq data, a total of 43,851 cells were obtained. The data quality was evaluated using the following three parameters: total UMI count, number of genes detected, and the ratio of mitochondrial gene UMI count to total UMI count. A significant positive correlation between UMI count and mRNA and a weak correlation between UMI/mRNA and mitochondrial gene content are presented in Fig. S1A. Violin plots show differences before and after the quality control analysis (Fig. S1B). Next, the data was normalized using log normalization, followed by identifying variable features on the basis of variance stabilization transformation to discover highly variable genes. Scaling was then performed using the “ScaleData” function for all genes, followed by PCA downsizing using “RunPCA” to identify anchor points, with dim = 50 selected. Clustering performed on the cells (Resolution = 0.6) resulted in a total of 27 clusters (Fig. S1C). Moreover, t-SNE dimensionality reduction analysis was performed on the 43,851 cells. Some classical markers of immune cells were used to annotate the cells in 27 clusters: clusters 0, 1, 3, 4, 5, 8, 10, 14, 19, 26, and 27 were classified as T/NK cells (CD4, CD3D, CD3E, CD8A); clusters 6, 17, and 21 were classified as B/plasma cells (CD19, CD79A, MS4A1, JCHAIN); clusters 7, 9, 12, 15, 16, 22, and 25 were classified as epithelial cells (EPCAM, KRT19, KRT18, PROM1, ALDH1A1, CD24); cluster 20 and 23 were classified as fibroblasts (DCN, COL1A2, PDGFRA, COL1A1, FGF7); cluster 24 was classified as endothelial cells (expressing PECAM1, VWF, CDH5); clusters 2 and 11 were classified as monocytic cells (CD14, CD68, CD163, C1QA, CD1C); cluster 13 was classified as neutrophil cells (S100A9, CSF3R, FCGR3B); and cluster 18 was classified as mast cells (MS4A2, CPA3, TPSB2), as shown in Fig. S2.

To identify distinctions among various patients, we performed cell clustering based on their origin. The diversity of cells across these patients indicated high inter-tumor heterogeneity (Fig. 2A). Figure 2B shows the distribution of the 27 clusters, and Fig. 2C shows the t-SNE plot after cell annotation. The “FindAllMarkers” function was used to identify markers for each cell cluster, setting the following thresholds: log2FC > 0.25 and min.pct > 0.25. Figure 2D illustrates the top five marker genes expression for each cell type. Based on copy number alterations in LUAD samples identified using the CopyKAT algorithm, malignant cells were distinguished from non-malignant cells. Despite the presence of heterogeneity, almost all malignant cells showed chromosome 13 deletions and chromosome 1, 8, and 21 amplifications (Fig. S3). The predicted aneuploid cells were deduced to be malignant cells, whereas diploid cells were deduced to be normal cells. In total, we inferred 11,227 malignant cells and 24,701 normal cells (Fig. 2G). Finally, we calculated the percentages of the nine cell types and the numbers of cells in the nine samples (Fig. 2E, F).

Fig. 2
figure 2

Definition of cell clusters. (A) The t-distributed stochastic neighbor embedding (t-SNE) plot of nine samples in the GSE171145 dataset, colored to indicate sample names. (B) The t-SNE plot of the distribution of 27 clusters, colored to indicate cell clusters. (C) The t-SNE plot of eight cell types after cell annotation, colored to indicate cell types. (D) Dot plots of the top five marker genes contributing to the clusters, x-axis: cell types, y-axis: marker genes, dot colors: average expression (blue represents low expression and red represents high expression), and dot size: percent expressed cells in the cluster. (E and F) Numbers and proportions of cell types in each sample after annotation, x-axis: cell numbers and proportions and y-axis: cell types. (G) The t-SNE plot of aneuploid and diploid cells, colored to indicate aneuploid and diploid cells

Analysis of malignant cell-associated TFs

We used the SCENIC platform to investigate TF regulatory networks in malignant cells. The “runSCENIC_3_scoreCells” function was used for computing the area under the curve (AUC) of a regulon in each cell and the AUC threshold for each regulon was determined. The cells were then downscaled and clustered using a regulon AUC matrix, which is presented in a heatmap plot (Fig. 3A). The steady state of the cells was visualized using the “bkde2D” function (Fig. 3B). A heatmap of the top-ranked active TFs for the nine cell types showed distinct transcriptional regulation patterns (Fig. 3C). The RSS was computed for each cell clusters, and the top five TFs and all identified TFs are shown in Fig. 3D and Fig. S4, respectively. We used t-SNE to show the expression of the top five regulon TFs, their regulatory activity, the regulon AUC, and the regulon AUC distribution in all cells (Fig. S5). Subsequently, we plotted ridge and violin maps to visualize the TFs in the nine cell types (Fig. S6). These findings identify potential targets for inhibiting cells possessing malignant characteristics.

Fig. 3
figure 3

Transcription factor regulatory networks in malignant tumor subpopulations and trajectory analysis of malignant cells in lung adenocarcinoma. (A) Heatmap with regulon area under the curve (AUC) matrix of scaled AUC values (columns) detected in different cell types (rows). Blue represents low expression, yellow represents moderate expression, and red represents high expression. (B) Density map of steady-state cells. The darker color represents more steadiness. (C) Heatmap of transcriptional regulatory activity (columns) of nine cell types (rows). Blue represents low expression and red represents high expression. (D) Point plots of the top five regulon specificity scores. X-axis: rank and y-axis: regulon specificity scores. (E and F) Monocle 2 trajectory plots showing state dynamics and pseudotime curves. Each dot represents a singlet and the color gradient represents the pseudotemporal order. States 1–3 are labeled in the same topology. (G) Heatmap hierarchical clustering of differentially expressed transcription factor genes (columns) along the pseudotime curve (rows). Blue represents low expression, gradient represents moderate expression, and red represents high expression

Trajectory analysis performed using the Monocle 2 algorithm revealed dynamic changes in three states and the pseudotime profiles of these malignant cells (Fig. 3E, F). Given the role of tissue-specific TFs in regulating cellular differentiation [32], we examined variations in 67 specific TFs over time in the malignant cells (Fig. 3G).

Cell–Cell Interaction

The fundamental processes of cellular biological activity depended on cell–cell interactions. To further elaborate on the role of malignant cell types in LUAD genesis, we analyzed cellular communication between these cell types using the “cellchat” R package. The findings are summarized in Supplementary Table 2. Notably, a strong correlation was observed between malignant and monocytic cells in the nine cell types regarding the number and strength of ligand–receptor interactions (Fig. 4A, B, Fig. S7). The malignant cell types also played an essential role as ligands in multiple TME-related pathways (Fig. 4C). These results provide preliminary insight into the potential interactions between these cell types, which may help us further explore the role of malignant cells in LUAD development.

Fig. 4
figure 4

Cell–cell communication analysis and identification of molecular subtypes. (A and B) Circle plots showing the number and strength of cell type interactions. The ligand–receptor expressed by each cell type, the thicker the lines, the greater the number/intensity of ligand–receptor. Dot size represents the number of cells in the subpopulation. (C) Enrichment of tumor microenvironment-related pathways inputs and outputs among cell types. (D) Hazard ratio distribution plot for univariate Cox analysis of malignant cell ligand–receptor-related gene sets. X-axis: cox coefficient and y-axis: −log10(p-value), colored to indicate cell states. (E) Cumulative distribution function (CDF). X-axis: consensus index and y-axis: CDF, colored to indicate clustering number. (F) Delta area curve for The Cancer Genome Atlas cohort samples. X-axis: k and y-axis: relative change in area under CDF curve. (G) Heatmap of sample clustering when k = 2. (H) Kaplan–Meier survival analysis comparing the prognosis of two subtypes. X-axis: years and y-axis: survival probability

Identification of malignant cell-associated ligand–receptor subtypes of LUAD

To investigate the clinical significance of tumor heterogeneity and clarify the role of malignant cell-associated ligand–receptor genes in bulk RNA sequencing data, 108 ligand–receptor genes from malignant cell types were extracted using a cell–cell communication analysis approach. We identified 47 genes that correlated with the prognosis of LUAD through performing univariate Cox regression analysis (p < 0.05, Fig. 4D). We then used the “ConsensusClusterPlus” R package, using the K-means algorithm with “spearman” distance, to optimally cluster these genes. The results indicated that k = 2 was the optimal approach for classifying the cohort into cluster1 (N = 446) and cluster2 (N = 274) (Fig. 4E-G). According to the results of Kaplan–Meier survival analysis, it was found that cluster1 exhibited a more favorable prognosis than did cluster2 (p < 0.05, Fig. 4H). Supplementary Table 3 provides data on the TCGA dataset subtypes. To explore the reasons behind these differences, we plotted bar proportional charts and a Sankey diagram to analyze clinicopathological distinctions between the two clusters. The results indicated that the proportions of TNM stage and age were variable, with the incidence of late-stage clinicopathological outcomes tending to increase in cluster2 (Fig. S8).

GSVA of molecular subtypes

To detect biological behavioral differences between the two clusters, we conducted a GSVA enrichment analysis. We calculated the significance of pathway scores for two clusters using the Kruskal test method and screened critical pathways (p < 0.001, Fig. S9). Cluster2 showed significant enrichment in pathways related to the cell cycle, base excision repair, nucleotide excision repair, DNA replication, and mismatch repair compared to cluster1.

Genomic Variance Analysis

Single nucleotide variants from the TCGA dataset were analyzed using the mutect2 tool. The somatic mutation landscapes depicted in Fig. S10A illustrate distinct genomic profiles for the two clusters. Comparisons showed that homologous recombination defects, altered fractions, segment numbers, and TMB were higher in cluster2 than in cluster1 (Wilcoxon rank-sum test, p < 0.001, Fig. S10B).

Assessment of TME and differences in immunotherapy

To further investigate the functional role of malignant cell-associated ligand–receptor genes in the TME, we performed the “ESTIMATE” R package to evaluate stromal, immune, and “ESTIMATE” scores. We observed that cluster1 was closely associated with higher immune (p < 0.001), ESTIMATE (p < 0.001), and stromal scores (p < 0.001, Fig. 5D). We then utilized the ssGSEA algorithm to quantify the levels of immune cell infiltration within the TME in the two clusters. The findings indicated that cluster1 exhibited a greater degree of the infiltration of effector memory CD8 + T cells, activated CD8 + T cells, effector memory CD4 T + cells, activated CD4 + T cells, monocytes, and activated B cells (p < 0.001, Fig. 5A). We identified 47 immune checkpoints that exhibited significant differential expression between the two clusters; cluster1 showed the higher expression of these 45 inhibitory checkpoints, except for CD276 and TNFSF9 (Fig. 5B). Moreover, we assessed the ICB response using the tumor immune dysfunction and exclusion (TIDE, http://tide.dfci.harvard.edu/) algorithm. The findings revealed that the higher TIDE scores, exclusion scores, and expression of TAM.M2 and MDSC in cluster2, while Cluster1 exhibited higher interferon-gamma and dysfunction scores (p < 0.01, Fig. 5C). Taken together, the malignant cell-associated ligand–receptor subtypes could effectively differentiate tumor characteristics and TME and were essential to stratify patients with LUAD.

Fig. 5
figure 5

Immune infiltration analysis in molecular subtypes and differential expression of malignant cell-associated ligand–receptor genes (A) Relative abundance of immune cells infiltrating the tumor microenvironment between molecular subtypes, x-axis: infiltrating immune cells and y-axis: score, colored to indicate different cell clusters, red, cluster1; green, cluster2. (B) Differences in stromal, immune, and “ESTIMATE” scores in molecular subtypes. X-axis: immune scores and y-axis: score, colored to indicate different cell clusters, red, cluster1; green, cluster2. (C) Expression levels of 47 immune checkpoints between molecular subtypes. X-axis: genes and y-axis: expression, colored to indicate different cell clusters, red, cluster1; green, cluster2. (D) Differences of TIDE, IFNG, MDSC, Exclusion, Dysfunction, and TAM.M2 in molecular subtypes. X-axis: cluster and y-axis: immune suppressive score, colored to indicate different cell clusters, red, cluster1; green, cluster2. (E) The volcano plot of differentially expressed genes was identified between cluster1 and cluster2 (false discovery rate [FDR] < 0.05). X-axis: log2(FoldChange), y-axis: −log10(FDR), color of bubbles: red, considerably upregulated, and blue, considerably downregulated. (F) Bar chart of top five terms showing pathway enriched in biological process, cellular component, and molecular function. X-axis: gene counts in the enriched pathway and y-axis: pathway, colored to indicate enriched − log10(p-value). (G) Top 15 terms of the Kyoto Encyclopedia of genes and genomes (KEGG) pathways enrichment visualized via a bubble chart. X-axis: gene ratio in the enriched pathway and y-axis: pathway, colored to indicate enriched − log10(p-value), and the bubble size indicates the count of enriched genes

Screening and functional enrichment analysis of malignant cell-associated ligand–receptor genes

Using the “limma” R package, we identified 1107 malignant cell-associated ligand–receptor subtype-derived genes (false discovery rate [FDR] < 0.05 and |log2(Fold Change)| > 1, Fig. 5E). We then performed Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses. The findings showed that genes differentially expressed between cluster1 and cluster2 were enriched in processes regulating cell–cell adhesion and T-cell activation (Fig. 5F, G).

Construction of a prognosis signature based on integrative machine learning

To develop a robust signature, we initially conducted univariate Cox regression analysis to screen 37 genes identified as significant prognostic markers in TCGA-LUAD. Subsequently, these genes were integrated into an ensemble framework for comprehensive machine learning-based survival analysis. Employing a diverse set of 95 different machine learning algorithms, we constructed a predictive model within the TCGA dataset. A tenfold cross-validation approach was employed to determine the concordance index (C index) for all training and validation groups (Fig. 6A). Among these models, the top five, ranked by their mean C index, were developed using the Random Survival Forest (RSF) algorithm. These models demonstrated impressive outcomes in the training cohort but exhibited subpar performance in the validation cohort, with C indices below 0.6. This discrepancy highlighted a considerable tendency for overfitting to the training data. Consequently, these models were excluded from our final selection. Following a comprehensive evaluation process, the Lasso algorithm was selected as a highly accurate and clinically relevant predictive model. After performing Lasso Cox regression analyses, a six-gene signature was constructed, including FEN1, NMI, ZNF506, ALDOA, MLLT6, and MYO1E. The signature includes two low-risk genes (hazard ratio [HR] < 1), specifically ZNF506, which is up-regulated in normal tissues. Conversely, four high-risk genes (HR > 1) are MYO1E, FEN1, ALDOA, and NMI, all up-regulated in tumor tissues within the TCGA-LUAD cohort (Fig. 6B, C and Fig. S11). The risk score was computed using the following formula: 𝑅𝑖𝑠𝑘𝑆𝑐𝑜𝑟𝑒 = [(0.253 × 𝐸𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 FEN1) + (0.119 × 𝐸𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 NMI) + [(− 0.466) × 𝐸𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 ZNF506] + (0.158 × 𝐸𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 ALDOA) + [(− 0.244) × 𝐸𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 MLLT6] + (0.302 × 𝐸𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 MYO1E)]. The risk scores were standardized using the Z-score normalization method, dividing the samples into high- and low-risk groups. Kaplan–Meier survival analysis showed that the low-risk group had a more favorable prognosis compared to the high-risk group (Fig. 6D). Based on the ROC analysis, the AUC values of the risk score for predicting overall survival (OS) at 1-year, 3-year, and 5-year time points were 0.7, 0.7, and 0.64, respectively (Fig. 6E). To assess the accuracy and robustness of the signature, the GSE31210 dataset was utilized as an external validation set. Kaplan–Meier survival analysis results were consistent with those observed in the TCGA dataset (p < 0.05, Fig. 6F). It is noteworthy that there were no events (death) recorded in the first year of the GSE31210 dataset, resulting in an AUC value of 0 for the 1-year prediction. By shifting the focus to 2, 3, and 5 years, we demonstrate the robust predictive power of the model over these longer intervals, where sufficient event data are available to support meaningful analysis. Supplementary Table 4 provides a comprehensive view of the survival status and times for each sample in the validation set, enhancing our understanding of the dataset’s dynamics over extended periods. Furthermore, ROC analysis showed the risk score demonstrated a robust prognostic value for 2-year OS with an AUC of 0.93, 3-year OS with an AUC of 0.72, and 5-year OS with an AUC of 0.85 (Fig. 6G). We conducted Kaplan–Meier survival and ROC analyses to validate the reliability of the prognostic gene signature in eight independent external validation datasets, namely GSE36471, GSE37745, GSE42127, GSE87340, GSE50081, GSE68465, GSE68571, and GSE72094. The Kaplan–Meier analysis indicated that the low-risk group had a better prognosis compared to the high-risk group. Additionally, the AUC of the risk score showed excellent predictive performance of the signature across all cohorts (Fig. S12). The subgroup analysis findings indicated that among individuals on the basis of criteria, including age > 65, age ≤ 65, female, male, M0, N1-N3, stage I-II, T1-2, and T3-4, the high-risk group had an inferior OS compared to the low-risk group (p < 0.05; Fig. S13). Furthermore, we conducted a comparison between the clinicopathological characteristics of the high-risk and low-risk groups and found significant differences in clusters, T-stage, N-stage, and stage (Fig. 6H). The rise of next-generation sequencing technologies has led to a surge in reported gene expression-based prognostic signatures. To thoroughly evaluate how our model compares to existing signatures, we conducted an exhaustive review of the literature on prognostic models, identifying 44 relevant publications (Supplementary Table 5) [33,34,35]. These models correlate with various biological processes, including response to immunotherapy, oxidative stress, and pyroptosis. The results showed that our gene signature outperformed all other models in terms of the C-index within the TCGA-LUAD cohort (Fig. S14). These findings further showed that the gene signature exhibited a robust predictive performance.

Fig. 6
figure 6

Identification of prognostic gene signature. (A) 95 predictive models using diverse machine learning techniques, employing a tenfold cross-validation method. The C-index for each model was computed, covering both the TCGA-LUAD and GSE72094 cohorts. (B) Lambda trajectory of differentially expressed genes. X-axis: −In (lambda) and y-axis: coefficients, colored to indicate genes. (C) Confidence interval under lambda. X-axis: In (lambda) and y-axis: partial likelihood deviance, colored to indicate genes. (D) Kaplan–Meier survival analysis in The Cancer Genome Atlas (TCGA) dataset. X-axis: years and y-axis: survival probability. (E) Receiver operator characteristic (ROC) curve analysis-based evaluation of the prediction performance of gene signature in TCGA. X-axis: false positive fraction and y-axis: true positive fraction, colored to show time site. (F) Kaplan–Meier survival analysis in GSE31210. X-axis: years and, y-axis: survival probability. (G) ROC curve analysis-based evaluation of the prediction performance of gene signature in GSE31210. X-axis: false positive fraction and y-axis: true positive fraction, colored to show time site. (H) Pie plot of the difference in clinical characteristics between high- and low-risk groups (Wilcox test, *p < 0.05, **p < 0.01, ***p < 0.001, and ****p < 0.0001)

The relationship between risk score and TME

We used the CIBERSORT algorithm to calculate the proportions of 22 immune cell types, we found significant differences in the infiltration scores of 17 immune cell types between the high- and low-risk groups (Fig. 7A). There were variations in immune checkpoint expression between the groups, specifically lower CTLA4 expression in the high-risk group compared to the low-risk group (Wilcox.test; Fig. 7B). Additionally, the SubMap analysis indicated a pronounced propensity for the high-risk group to respond positively to ICB therapy (Fig. S15A). The link between the risk score and ICB response was verified in two independent immunotherapy cohorts. We found that patients with complete and partial ICB responses exhibited a higher risk score compared to those with stable and progressive disease (p < 0.05, Fig. S15B, C). The findings indicate that the gene signature plays an essential role in regulating the microenvironment of the immune system and has the potential to act as a valuable predictor of the effectiveness of immunotherapy.

Fig. 7
figure 7

Immune infiltration analysis and drug sensitivity analysis in high- and low-risk groups. (A) Comparison of 28 immune cell scores in high- and low-risk groups. (B) Comparison of immune checkpoint expression in high-and low-risk groups. (C) Analyzing the association between IC50 values and the risk scores in patients with lung adenocarcinoma. (D-G) Analysis of correlation and differences in sensitivity to drugs among potential medications derived from the CTRP and PRISM datasets

Drug sensitivity analysis

In our drug sensitivity analysis, we focused on pinpointing potential therapeutic targets and agents that exhibit a robust correlation with the risk score, aiming to enhance treatment strategies for LUAD patients. To achieve this, we analyzed IC50 values for 198 compounds from the GDSC database, applying these against each sample from the TCGA dataset. Subsequently, a Spearman correlation analysis was conducted to identify the relationship between these IC50 values and the LUAD patients’ risk scores. Notably, two compounds, AZD3759 and Gefitinib, displayed the most pronounced negative correlation with the risk scores and were identified as EGFR inhibitors, as illustrated in Fig. 7C. Furthermore, we examined the signaling pathways and therapeutic properties of the candidate compounds, with findings elaborated in Fig. S16. We also assessed AUC values for compounds within the CTRP and PRISM databases for each TCGA sample, followed by a Spearman correlation analysis between these AUC values and the risk scores. The top five compounds exhibiting the strongest negative correlations from both databases were illustrated in dot-line plots, including SB − 743,921, paclitaxel, GSK461364, KX2 − 391, and leptomycin B from the CTRP database, and ispinesib, cabazitaxel, D − 64,131, ganetespib, and docetaxel from the PRISM database (Fig. 7D, F). The comparison of their estimated AUC values across varying risk score groups was detailed in Fig. 7E, G. In conclusion, the identified compounds consistently showed a significant negative correlation with the risk score and had lower estimated AUC values in the high-risk group, suggesting their potential therapeutic efficacy in LUAD treatment.

Clinical application of the prognostic risk model

Univariate and multivariate Cox regression analyses were conducted to assess the independent prognostic value of the risk-scoring model for LUAD. The univariate Cox regression analysis revealed the HR of the risk score was 2.718 with a 95% confidence interval (CI) of 2.067–3.575 (p < 0.001). In the multivariate Cox regression analysis, the HR for the risk score was 2.217 with a 95% CI of 1.652–2.976 (Fig. 8A, B). These findings indicate that the risk score is a crucial predictive factor independent of multiple clinical parameters. A nomogram comprising T stage, N stage, stage and risk score was developed in the TCGA cohort to quantitatively assess the risk and predict the patient survival probability (Fig. 8F). The calibration curves indicated that the nomogram was reliable and accurate because the predicted probabilities for 1-, 3-, and 5-year OS aligned closely with the actual observations (Fig. 8D). The AUC analysis showed that both the risk score and nomogram exhibited outstanding predictive accuracy (Fig. 8C). Additionally, the DCA analysis was used to evaluate the predictive value of the nomogram in clinical decision-making (Fig. 8E). These findings indicate that the gene signature and nomogram are highly reliable regarding LUAD management.

Fig. 8
figure 8

Construction of nomogram. (A) Univariate Cox regression analysis of LUAD patients. (B) Multivariate Cox regression analysis of LUAD patients. (C) AUC analysis of risk score, nomogram, stage, T stage, and N stage. (D) Calibration curve of the nomogram. (E) Decision curves of “risk score”, “nomogram”, “T stage”, “N stage”, “stage”, “all”, and “None” models. (F) Nomogram for predicting the 1-, 3-, and 5-year survival rates based on the risk score

Knockdown of MYO1E inhibited LUAD cell proliferation and migration

To assess the biological role of a previously unreported model gene (MYO1E) in LUAD, we employed two siRNAs to knocked down its expression in A549 and H1299 cells. RT-qPCR analysis confirmed the effective suppression of MYO1E by these siRNAs (Fig. 9A). Knockdown of MYO1E expression led to a decrease in both cell proliferation and colony formation capability in A549 and H1299 cell lines, as evidenced by the data presented in Fig. 9B. Furthermore, Edu staining revealed a significant reduction in cell proliferation in LUAD following MYO1E knockdown (Fig. 9C). Additionally, transwell assays demonstrated that the silencing of MYO1E impaired the migratory capabilities of A549 and H1299 cells (Fig. 9D). The wound-healing assays further supported these findings, showing a slowed wound closure rate in cells deficient in MYO1E (Fig. 9E). Collectively, these observations suggest that MYO1E plays a crucial role in promoting cell proliferation and migration in LUAD, positioning it as a promising therapeutic target for LUAD treatment.

Fig. 9
figure 9

MYO1E promotes proliferation and migration of LUAD cells. (A) RT-qPCR analyse confirmed MYO1E knockdown in A549 and H1299 cells using two siRNAs (B) Colony formation of A549 cells and H1299 cells transfected with control or si-MYO1E was measured by ImageJ. (C) Edu assay assessed the cell proliferation of control cells compared to MYO1E knockdown cells. (D) Transwell assay demonstrated the cell migration of control cells compared to MYO1E knockdown cells. (E) Wound healing assay showed the cell migration of control cells compared to MYO1E knockdown cells

Discussion

Lung cancer has the highest mortality rate among all cancer types, with a 5-year survival rate of approximately 22% [36]. LUAD represents a large proportion of lung cancer cases and many patients already present with metastases at diagnosis, resulting in a poor prognosis. ICB therapy is effective for patients with recurrent lung cancer [37,38,39]. However, intra-tumoral heterogeneity increases the probability of malignant cells surviving standard chemotherapy and radiotherapy, thus significantly affecting the efficacy of various immunotherapies, especially ICB, leading to poor therapy outcomes for most patients. Additionally, the role of TME in cancer progression and metastasis has been demonstrated in various cancers [7]. Investigating the cellular and molecular mechanisms involved in the TME has the potential to establish a foundation for drug discovery, especially regarding targeted immunotherapy. Therefore, understanding LUAD tumor heterogeneity can enable more reliable and accurate presurgical molecular testing, facilitate stratification, and enable personalized precision therapy for recurrence risk. Owing to the advancements in high-throughput sequencing technology, the combination of multiomics data analysis has become an effective method for thoroughly elucidating disease heterogeneity, predicting disease prognosis, and identifying new therapeutic targets.

This study incorporated scRNA-seq from nine LUAD samples and bulk RNA sequencing data from 618 patients. We found considerable heterogeneity among patient-derived tumor cells, demonstrating that tumor cell cluster-related differences were primarily attributed to tumor heterogeneity. After quality control and downscaling clustering, 27 clusters and nine cell types were annotated, and an integrated Bayesian segmentation method (CopyKAT) was used to identify malignant cells by inferring large-scale copy number alterations from single-cell expression profiles. The TF regulatory network was analyzed in a subset of malignancies using SCENIC to obtain the top five specific TFs. Furthermore, the high heterogeneity of malignant cells was explored by determining three different differentiation fates of malignant cells based on developmental trajectory analysis and using a heatmap to visualize the changes in specific TFs over time. We employed cell communication analysis to assemble multiple ligand-receptor pairs and characterize the regulatory network in the LUAD TME. To further explore the clinical significance of tumor heterogeneity and clarify the role of malignant cell-associated ligand-receptor genes in bulk RNA-sequencing data, we categorized the patients into two clusters based on the unsupervised clustering of those genes. Kaplan–Meier survival analysis revealed that cluster1 had a better prognosis than those in cluster2. We explored the underlying mechanisms behind these results from multiple dimensions, including functional enrichment analysis, TME cell infiltration, somatic mutation landscapes, and immunotherapy. The T, N, and overall stages were differed, and the incidence of advanced clinicopathological features tended to increase in cluster2. We analyzed biological cancer features between the two clusters and found that cluster2 was considerably enriched in the cell cycle, base excision repair, DNA replication, nucleotide excision repair, and mismatch repair signaling pathway compared with cluster1. These biofunctions and signaling pathways are important in promoting tumor development. Dysregulation of the cell cycle is fundamental to the proliferation of tumor cells and derangement of cell cycle checkpoints facilitates genetic instability [40]. Sen et al. reported that cyclin-dependent kinase (CDK)1 inhibition could induce PD-L1 and promote the immune response against tumors via stimulator of interferon genes-mediated T-cell activation in small-cell lung cancer [41]. Rooney and Jerby-Arnon et al. found that the inhibition of CDK4 and CDK6 has the potential to augment T-cell activity, reverse T-cell exclusion patterns, and result in a better response to ICB therapy [42, 43]. These studies indicated that cell cycle inhibitors intensify ICB responses and cell cycle-related pathways mainly contribute to the worse prognosis of cluster2, suggesting that patients in cluster2 may benefit from cell cycle inhibitors. Additionally, mutations in DNA mismatch repair are related to genomic instability, susceptibility to certain cancers, and resistance to specific chemotherapeutic drugs [44]. Further analyses of mutations in molecular subtypes to understand intra-tumoral heterogeneity showed higher homologous recombination defects, fraction alterations, number of segments, and TMB in cluster2. Somatic mutations drive cancer and guide diagnosis and therapies. Homologous recombination mutations increase genomic instability and lead to more error-prone DNA damage responses [45]. Furthermore, DNA damage response plays multiple roles in promoting the growth of cancer cells by accumulating driver mutations, generating tumor heterogeneity, and evading apoptosis [46, 47]. TMB serves as a potentially valuable biomarker in predicting response to ICB therapy. Multiple studies have indicated a correlation between TMB and the response rate to ICB therapy in various tumor types [48, 49]. These results show that malignant cell-associated ligand–receptor genes have a complex interaction with somatic mutation.

We used 10 machine learning algorithms into more than 90 combinations. The selection of the most optimal algorithm was based on the average C-index derived from two LUAD cohorts. This process facilitated the development of a robust and effective prognostic signature, crucial for evaluating the prognosis of tumor patients. After thorough evaluation, the Lasso algorithm emerged as the superior method for creating a novel prognostic model centered around genes linked to ligand-receptor interactions in malignant cells, including MYO1E, FEN1, NMI, ZNF506, ALDOA, and MLLT6. FEN1 is a vital endonuclease gene, whose protein plays multiple roles in DNA replication and damage repairs. FEN1 overexpression has been observed in multiple cancer types, such as testicular, brain, lung, and breast cancers [50]. He et al. discovered that FEN1 promotes tumor progression and contributes to cisplatin resistance development in NSCLC [51]. NMI encodes a protein (N-MYC) that interacts with two members of the Myc family of oncogenes. Meng et al. indicated that a high NMI expression was linked to unfavorable prognosis and increased tumor growth in glioblastoma [52]. NMI inhibits Wnt/β-catenin signaling by increasing the Dkk1 level, which blocks breast tumor growth. Low NMI expression leads to epithelial–mesenchymal transition in breast cancer [53, 54]. Wang et al. demonstrated the tumor-suppressive potential of NMI in lung cancer by inhibiting multiple signaling pathways, such as phosphoinositide-3-kinase/protein kinase B, MMP2/MMP9, COX-2/PGE2, and p300-mediated nuclear factor-κB acetylation, and indicated that NMI as promising therapeutic target for lung cancer [55]. ZNF506 encodes an important component of the signaling pathway that involves γH2AX, which detects and repairs damaged DNA. Nowsheen et al. found that the ZNF506 protein could help recruit the EYA protein, forming a feedback loop with H2AX and MDC1 that amplifies the DNA damage response. Mutations in ZNF506 are associated with cancer and can be involved in its pathogenesis [56]. ALDOA encodes a class I fructose-bisphosphate aldolase protein family member. Chang et al. revealed the molecular process by which ALDOA increases the spread of lung cancer by prolyl hydroxylase domain-dependent stabilization of the hypoxia-inducible factor-1α and consequent MMP9 activation [57]. Myeloid/lymphoid or mixed-lineage leukemia translocated to 6 (MLLT6) is crucial for cancer cells to efficiently express and present PD-L1 protein on their cell surface. Sreevalsan et al. reported that the depletion of the MLLT6 protein leads to decreased inhibition of CD8 + cytotoxic T cell-mediated cytolysis. Moreover, cancer cells that do not express MLLT6 exhibit impaired signal transducer and activator of transcription 1 signaling, resulting in reduced responsiveness to interferon-γ-induced stimulation of indoleamine 2,3-dioxygenase 1, guanylate binding protein 5, CD74, and major histocompatibility complex class II genes [58]. Myosin 1e (MYO1E), a widely expressed myosin identified through proteomic studies as a key element in cell-substrate adhesions, plays a critical role in cancer progression [59, 60]. Its expression levels have been linked to a poorer prognosis in individuals with invasive breast cancer, where it contributes to increased malignancy by promoting tumor cell proliferation and driving de-differentiation of tumor cells [61, 62]. Moreover, high levels of MYO1E expression are similarly indicative of a poor prognosis in patients suffering from LUAD and pancreatic adenocarcinoma, underscoring its significance across different cancer types [63, 64]. These studies suggested that the genes identified in the signature could serve as potential targets for in vitro experimental designs for elucidating LUAD-related molecular mechanisms. This study showed that the six-gene risk model was more effective in predicting prognosis in TCGA (N = 500) and nine independent GEO (N = 1649) datasets. This conclusion was drawn based on time-dependent ROC curves and survival analysis. Despite the different trends of AUC values across these datasets, which may be attributed to the statistical power and sample size of each dataset, future research will benefit from the ongoing advances in bioinformatics. This will allow for the expansion of cohort sample sizes and the use of self-test data cohorts to further validate the robustness of the model. The signature also demonstrated good predictive performance across different clinical subgroups. Specifically, our study found that knockdown of MYO1E inhibited LUAD cell proliferation and migration through multiple experiments. This suggests that MYO1E may serve as an important potential target for the treatment of LUAD.

The risk score and two immunotherapy cohorts, including patients with skin cutaneous melanoma (N = 24) and bladder urothelial carcinoma (N = 293) were used to assess the variance in response to immunotherapy of the signature. Our subclass mapping analysis further demonstrated an improved response to immunotherapy in patients identified as high-risk, consistent with our earlier results. This suggests that the risk score could be a valuable tool for early detection of individuals who are more likely to benefit from immunotherapy. Furthermore, our analysis involved identifying potential therapeutic targets and compounds for LUAD patients categorized as high-risk by our prognostic model. From this investigation, AZD3759 and Gefitinib emerged as the most promising compounds. Interestingly, both compounds are classified as EGFR inhibitors and were selected from the GDSC drug response database. We developed a nomogram to increase the precision of clinical decision-making for predicting the OS of patients with LUAD. ROC curve analysis, calibration, and DCA were used to assess the effectiveness of the nomogram. These data suggest that the new prognostic model has the potential for clinical use and could provide a promising therapeutic target for patients with LUAD.

This study had some limitations. First, the essential genes identified have not been thoroughly validated through in vivo and in vitro experiments, and the specific mechanisms remain unclear. Second, our prognostic model requires validation in real clinical samples, which was not performed and is the focus of our future study.

Conclusion

In this study, we employed consensus clustering based on the expression of malignant cell-associated ligand–receptor genes and classified the cohort into two clusters using the scRNA-seq and bulk RNA sequencing data. These genes were found to differ significantly in their immune and molecular features and in the TME, making them crucial for the stratification of patients with LUAD. This classification may enhance the understanding of the correlation between tumor cell subtypes and their response to immunotherapy. We also developed a risk-scoring model that effectively predicts the prognosis and response to immunotherapy for patients with LUAD across various testing datasets. The findings of this study provide a theoretical foundation for developing personalized treatments for these patients. Furthermore, we investigated the role of a previously unreported gene, MYO1E, which may serve as a new therapeutic target for treating LUAD.