Introduction

Nonalcoholic fatty liver disease (NAFLD) currently affects about 25% of the global population and is likely to become the leading cause of end-stage liver disease in the coming decades, with nearly a quarter of NAFLD patients progressing to NASH [1]. As a progressive inflammatory phenotype of NAFLD, NASH is closely associated with hypertension, obesity, dyslipidemia, type 2 diabetes and metabolic syndrome, and its progression involves multiple pathogenic pathways [2,3,4].

Fibrosis is an important concern in the NASH field and a major determinant of patient clinical outcomes [5,6,7,8]. Progressive fibrosis reflects the ongoing regrowth of repeated useless liver frameworks, and these defective liver regrowth increases the risk of cirrhosis and primary liver cancers [9]. Apart from lifestyle interventions, more and more treatment options are expected to be used to limit fibrosis and NASH. As a peroxisome proliferator-activated receptor agonist, Lanifibranor improves NASH by regulating metabolism, inflammation, and fibrogenesis [10]. In addition, GLP-1 RAs is considered a promising treatment option for NASH and deserves further investigation [11, 12]. However, it is also necessary to find new biomarkers, therapeutic targets, and ideas for fibrosis progression in NASH and NASH-associated HCC. As a common bioinformatics analysis method, WGCNA can effectively explore the relationship between genes and clinical characteristics. The significant advantage of WGCNA is that it links sample characteristics with changes in gene expression by clustering genes into co-expression modules so as to identify modules related to phenotype, and finally identify genes in disease pathway for further verification [13].

In this study, we aimed to identify susceptibility modules and genes associated with fibrosis. We used two RNA-seq datasets from the Gene Expression Omnibus (GEO) database to select gene expression data associated with NASH. The GEO database is an international public repository, which provides valuable high-throughput microarray and next-generation sequence functional genomic datasets that can be downloaded free of charge in a variety of formats and for further integration analysis [14]. WGCNA was used to construct gene co-expression network, select important modules, and screen hub genes through protein–protein interaction (PPI) network. Single-cell RNA-seq analysis was integrated to observe the expression of hub genes in different cell clusters and we further analyzed whether these genes are involved in the malignant progression of NASH based on another microarray dataset.

Materials and methods

Data download and processing

The framework of this study was shown in Fig. 1. Two RNA-seq datasets (GSE135251 and GSE162694), one microarray dataset (GSE164760), and two single-cell RNA-seq datasets (GSE166504 and GSE182365) were systematically extracted from the GEO database (Supplementary file 1). The datasets we selected were described in Table 1. GSE135251, GSE162694 and GSE164760 were sequenced from human liver tissues, while GSE166504 and GSE182365 were from mice fed with high fat diet. The clinical characteristics of the human samples were described in the datasets uploader's articles [15,16,17]. GSE135251 included 51 NAFL, 155 NASH and 10 controls. GSE162694 comprised of 31 normal and 112 NASH liver tissues. We selected gene expression and clinical information related to NASH in GSE135251 and GSE162694 to prepare for further analysis. One outlier sample (548nash100) was removed by cluster analysis (Supplementary file 2). Finally, 155 NASH samples from GES135251 and 111 NASH samples from GSE162694 were identified. We used the ClusterProfiler R package to perform gene symbol conversion on the downloaded counts data [18]. According to the suggestions of WGCNA, we processed the data with variance-stabilizing transformation and logarithmic transformation, screening out the top 5000 genes of variance variation in NASH samples respectively for WGCNA analysis.

Fig. 1
figure 1

Flow diagram of this study

Table 1 Description of each dataset

WGCNA analysis and identification of modules

We used WGCNA R package to construct a co-expression network for the top 5000 genes mentioned above [13]. Firstly, the R function pickSoftThreshold was used to calculate soft threshold power from 1 to 20. Secondly, hierarchical clustering and dynamic tree cut package were used to divide modules (abline = 0.25). Thirdly, correlation between module eigengenes and clinical traits was analyzed to identify modules of interest. Finally, excluding the grey modules that did not cluster successfully, we selected modules in two networks that are significantly positively correlated with fibrosis progression.

PPI network construction and hub genes

The common genes within the modules were displayed by Venn diagram. To further understand the biological functions of common genes, ClusterProfiler R package was used for Gene Ontology (GO) enrichment, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis as well as ReactomePA R package was used for Reactome pathway analysis [19,20,21]. GO annotations include biological processes (BP), molecular functions (MF), and cellular components (CC). We presented the results of the top ten BP analysis, all KEGG and Reactome pathway analysis. Adjusted P < 0.05 indicates significant differences. We also used the STRING database to construct PPI network for overlapping genes and imported data with interaction score > 0.4 (medium confidence) into Cytoscape 3.8.2 software for visualization [22, 23]. Finally, CytoHubba and MCODE plugins were used to analyze the functional modules of PPI network and identify hub genes. CytoHubba plugin obtained top ten genes through MMC algorithm. MCODE plugin parameters were set as follows. Degree cut-off: 2, node score cut-0.2, cut-style: haircut, K-core: 2, Max. Depth: 100. Pearson’s correlation coefficient was used to evaluate correlations between hub genes.

Integrating single-cell RNA-seq analysis based on hub genes

Based on another dataset GSE166504, single-cell sequencing results of hepatocytes and liver non-parenchymal cells (NPC) in mice, we selected NPC during high-fat feeding and analyzed them using Seurat R package with default settings (Supplementary file 3) [24]. At the same time, we also performed cluster analysis on HSCs from high-fat fed mice in another dataset GSE182365 to observe gene expression in different cell clusters (Supplementary file 4). Combined with the CellMarker database and relevant literature, cell clusters were annotated [25,26,27,28].

Expression of hub genes in different liver tissues

GSE164760 is a human RNA-seq dataset including normal, cirrhotic, adjacent non-tumor NASH, and NASH-associated HCC liver tissues. Based on GSE164760, we further analyzed whether it was involved in NASH-HCC transformation. ROC analysis was performed using the R software package pROC to evaluate their accuracy in predicting HCC according to the expression of differential genes [29]. We also used the human protein atlas (HPA) database to compare differences at protein levels of corresponding genes (paired by antibody, sex, and age). HPA is a protein database which is committed to providing tissue and cell distribution information of a variety of human proteins and provides free public inquiries [30]. Finally, consensus clustering was performed for 53 NASH-associated HCC to compare the differences between different subtypes. Consensus Clustering, a common approach to the classification of cancer subtypes, allows for the discovery of new disease subtypes or comparative analysis of different subtypes by dividing samples into several subtypes based on different omics datasets [31]. The potential KEGG signaling pathway of HCC was explored by GSVA analysis [32]. The comparison between groups was performed by limma [33]. Benjamini & Hochberg was used for multiple comparison correction and adjusted P < 0.05 was considered statistically significant (Supplementary file 5).

Results

WGCNA and module identification

The top 5000 genes in NASH samples were selected to construct the co-expression network. According to the principle of scale-free network, 3 was selected as the soft threshold of network 1 and network 2 (scale free R^2 is 0.93 and 0.95 respectively) (Fig. 2a, b). Network 1 (from GSE135251) and Network 2 (from GSE162694) were divided into 11 modules respectively (Fig. 2c, d). We associated the models with fibrosis stage of NASH, looking for models that were most significantly associated with fibrosis progression (Fig. 2e, f). Excluding grey module (fail to cluster), we selected positive modules with P < 0.01 in NASH F4 and performed BP analysis on them. The blue module (GSE162694) is associated with the extracellular matrix, the red module (GSE162694) with immune inflammation, and the brown module (GSE135251) with both (Supplementary file 6).

Fig. 2
figure 2

WGCNA analysis (a, b) Scale-free fit index in network 1 and 2 (x-axis is soft threshold power, y-axis is signed R^2). (cd) Clustering dendrogram of genes (visual comparison of modules based on dynamic tree cutting). (ef) Heatmaps of correlation between module eigengenes and clinical traits

PPI network construction and functional annotation

There are 234 common genes, and a PPI network was constructed (Fig. 3a, b). According to BP, these genes are mainly involved in external encapsulating structure organization, extracellular structure organization and extracellular matrix organization (Fig. 3c). Based on KEGG pathway analysis, our results showed that these genes are mainly involved in wnt signaling pathway, extracellular matrix (ECM)-receptor interaction and viral protein interaction with cytokine and cytokine receptors (Fig. 3d). Reactome pathway analysis showed that these genes are mainly related to extracellular matrix organization (Fig. 3e). These results suggested that these genes mainly participate in fibrosis repair during NASH development.

Fig. 3
figure 3

PPI network of common genes and functional annotation (a) Venn diagrams of overlapping genes. (b) PPI network (Arranged by degree, the high degree (dark color) is in the center of the circle). (c) Bar chart of BP analysis results. (d, e) Bubble diagram of KEGG and Reactome analysis results

Hub genes in each fibrosis stage

The results of CytoHubba and MCODE were intersected to obtain seven hub genes (SPP1, PROM1, SOX9, EPCAM, THY1, CD34 and MCAM) (Fig. 4a). As we can see, there is a good correlation between hub genes, and the correlation is highly consistent in the two datasets (Fig. 4b). The line chart also showed an upward trend (Fig. 4c, d) (Supplementary file 5).

Fig. 4
figure 4

Expression of hub gens in different fibrosis stages (a) Algorithm of CytoHubba and modules of MCODE. (b) Visualization of correlation between hub genes in different datasets. (cd) Line charts of fibrosis stages (false discovery rate was used for multiple comparison correction in Deseq2, *adjusted P < 0.05; **adjusted P < 0.01; ***adjusted P < 0.001)

Hub genes in different liver NPC clusters from mice

Based on the expression of lineage-specific markers, we identified B cells, conventional dendritic cells (cDCs), cholangiocytes, cycling cells, endothelial cells, HSCs, hepatocytes, Kupffer cells, monocyte-derived macrophages (MDMs), NK cells and Plasmacytoid dendritic cells (pDCs) (Fig. 5a). As we can see, SPP1, PROM1, SOX9 and EPCAM are highly expressed in cholangiocytes, THY1 is in NK cells, and CD34 is in HSCs (especially in late period), while MCAM is not specifically expressed (Fig. 5b). The total expression of these hub genes increased gradually in NPC apart from MCAM (Fig. 5c).

Fig. 5
figure 5

Single-cell RNA-seq analysis of liver NPC from mice during high-fat feeding (a) t-SNE visualization of liver NPC clusters. (b) Violin plot of hub genes expression in liver NPC clusters. (c) Dot plot of total expression of hub genes at 15, 30 and 34 weeks during high-fat feeding

Hub genes in different HSCs clusters from mice

HSCs is the main source of ECM [34]. Through the clustering of HSCs from mice during high-fat feeding, we divided seven HSCs clusters (Fig. 6a, b). Cluster 7 is the mixed cholangiocytes based on annotation, which expresses most markers of cholangiocytes and has a small number of cells. We found that SPP1 and CD34 are markers of cell cluster 2 and 5, respectively (Fig. 6c). According to the BP analysis of their respective markers of cluster 2 and 5, they may have different biological functions. Cluster 2 expresses various cytokines, chemokines and receptors and is mainly involved in inflammation (Fig. 6d). Cluster 5 is mainly associated with extracellular matrix formation (Fig. 6e).

Fig. 6
figure 6

Single-cell RNA-seq analysis of HSCs from mice during high-fat feeding (a) Umap visualization of HSCs clusters. (b) Heatmap based on the top five markers of each cluster. (c) Violin plot of hub genes expression in HSCs clusters. (de) Circle diagrams of BP enrichment results of cluster 2 and 5

Correlation between hub genes and NASH-associated HCC

We further analyzed whether these hub genes were also involved in NASH-HCC transformation through another human dataset GSE164760. Four hub genes, SPP1, SOX9, MCAM and THY1, showed significant differences between NASH and NASH-associated HCC tissues, suggesting that they are likely to be associated with NASH-HCC, while no differences were observed in PROM1, EPCAM and CD34 (Fig. 7a). At the same time, ROC analysis indicated that the high expression of these differential genes (SPP1, SOX9, MCAM and THY1) can well predict the occurrence of HCC (Fig. 7b). Finally, we verified differences at protein levels using the HPA database. The results showed that the protein levels of these four genes in tumor tissues were higher than those in normal tissues (Fig. 7c). Taken together, these four genes play an important role in the transformation of NASH to HCC.

Fig. 7
figure 7

Expression of hub genes in different liver tissues (a) mRNA expression of hub genes in normal, cirrhotic, adjacent non-tumor NASH, and NASH-associated HCC liver tissues (**adjusted P < 0.01). (b) ROC analysis to evaluate expression of differential genes in predicting HCC. (c) Immunohistochemical staining of SPP1, SOX9, MCAM and THY1 in normal and HCC liver tissues

Comparison of different NASH-associated HCC subtypes

Through consensus clustering, it was found that 53 NASH-associated HCC samples from GSE164760 could be roughly divided into 2 subgroups (C1 and C2), which was well illustrated by PCA analysis (Fig. 8a, b). Compared with C1, 96 up-regulated pathways and 14 down-regulated pathways were found in C2 (Fig. 8c). Among them, these pathways are mostly related to metabolism (Fig. 8d). The expression of SOX9 was significantly downregulated in C2, while SPP1 and THY1 only showed a downward trend (Fig. 8e).

Fig. 8
figure 8

Difference between NASH-associated HCC subtypes (a) consensus clustering to divide subtypes. (b) PCA diagram of NASH-associated HCC samples. (c) Volcanic map of differential KEGG pathways. (d) Bar chart to show the top 50 most significant pathways. (e) Expression of SPP1, SOX9, MCAM and THY1 in C1 and C2

Discussion

NAFLD is represented by nonalcoholic fatty liver (NAFL) and NASH. Compared with NAFL, NASH progresses more rapidly in fibrosis, which can lead to cirrhosis and HCC [7, 34, 35]. Fibrosis is the only histological feature that can predict NASH clinical outcome and more and more noninvasive biomarkers are being attempted to reflect the severity of fibrosis [36,37,38,39]. In this study, we performed WGCNA on NASH samples from datasets GSE135251 and GSE162694, respectively, identified the modules of interest, and obtained 234 overlapping genes after intersection. BP analysis showed that, 234 overlapping genes are enriched in several biological terms, including external encapsulating structure organization, extracellular structure organization and extracellular matrix organization, confirming their association with NASH fibrosis progression. The most prominent KEGG and Reactome pathways are ECM-receptor interaction and extracellular matrix organization, respectively. NASH is associated with sustained activation of chronic HSCs (from stationary vitamin A-rich cells to fibrogenic, hyperplastic, and pro-inflammatory cells), resulting in accumulation of ECM and gradual replacement of liver parenchyma by fibrous tissue [5, 6, 34]. Taken together, these results suggested that these overlapping genes are well involved in the progression of NASH fibrosis.

Based on these overlapping genes, we constructed a PPI network and identified seven hub genes (SPP1, PROM1, SOX9, EPCAM, THY1, CD34 and MCAM) through Cytoscape plugins. According to single-cell RNA-seq analysis of hub genes, we found that they are expressed primarily in different cell clusters such as cholangiocytes, NK cells, and HSCs. NAFLD with cholestasis is characterized by ductal inflammation, bile duct loss and swelling and bile duct hyperplasia, and more prone to bridging fibrosis and cirrhosis [40]. Cholangiocytes injury is an important factor to reflect the severity of NAFLD [41, 42]. At the same time, persistent biliary fibrosis can create an environment of liver tissue that promotes regeneration of hepatocytes [43]. SPP1 is located on human chromosome 4 and encodes a protein called osteopontin (OPN), OPN is closely related to chronic liver disease and involved in liver steatosis, inflammation, and fibrosis [44, 45]. A clinical trial found that serum OPN level increased progressively with the progression of NAFLD fibrosis [46]. SOX9 is a transcription factor involved in ECM production during liver fibrosis, which regulates the wnt pathway and its downstream target protein OPN [47, 48]. PROM1 is considered as a marker of endothelial progenitor cells, hematopoietic stem cells and other stem cells and participates in the expansion of cholangiocytes known as the ductular reaction [49, 50]. PROM1 also has a strong correlation with the expression of biliary fibrosis related genes, such as KRT19 and COL1A1 [51]. It was found that HSCs expressing stem/ progenitor cell marker PROM1 exist in the liver, show the characteristics of progenitor cells, and participate in the process of liver injury and fibrosis [52]. EPCAM is a type I transmembrane glycoprotein, which is also a marker of cholangiocytes [26, 53]. It is involved in a variety of biological processes such as cell adhesion, signal transduction, migration and proliferation [53]. In the mouse model of alcoholic hepatitis, silencing EPCAM can inhibit liver fibrosis and HSCs proliferation [54]. CD34 is considered a marker of microvascular formation [55]. In this study, we found that the expression of CD34 was significantly increased in HSCs during high-fat feeding. HSCs is the main source of ECM. When the liver is damaged, HSCs is activated and transformed into fibroblasts to participate in liver fibrosis and reconstruction of intrahepatic structure [34]. Our results suggested that CD34 is likely to influence the formation of NASH fibrosis by participating in the activation of HSCs. A study also found that CD34 positive microvessels were more common in areas with higher fibrosis and significantly associated with fibrosis in NASH [56]. As an important regulator of cell–cell and cell–matrix interactions, THY1 plays an important role in nerve regeneration, inflammation, metastasis, and fibrosis [57]. Through the single-cell analysis of liver NPC in high-fat diet-induced mice, we found that THY1 is a marker of NK cells. NK cells usually have anti-fibrosis properties including killing activated HSCs by IFNγ and inducing HSCs apoptosis by expressing death receptor ligands [58]. NK cells also help clear senescent activated HSCs to reduce fibrosis [59]. However, there have been no studies linking THY1 to NASH fibrosis, but THY1 has been shown to be involved in fibrosis formation in a mouse model of cholestatic liver injury, and a bioinformatics study has found that THY1 may be a potential key regulator of NAFLD progression [60, 61]. MCAM is an adhesion molecule of immunoglobulin superfamily [62]. Cell adhesion molecules are excellent biosensors with specific contributions to the liver, including leukocyte recruitment, cell differentiation and survival, matrix remodeling or angiogenesis, and are unique anti-fibrosis therapeutic targets [63]. Interestingly, we found that the total expression of MCAM was decreased in liver NPC, which may be because we did not consider the influence of liver parenchymal cells and MCAM may only be activated in specific cell clusters. Together, we discovered that the expression of these hub genes may be involved in the progression of fibrosis and verified them from a single-cell perspective. However, the progression of NASH fibrosis is affected by a variety of clinical characteristics (age, gender, diabetes, etc.), and the effect of these factors on gene expression remains to be further explored [64,65,66].

HSCs are the central driver of fibrosis in liver injury [58]. Based on the cluster analysis of HSCs in another dataset, we found that SPP1 and CD34 are highly expressed on cluster 2 and 5, respectively. Further BP analysis of clusters 2 and 5 suggested that they might be closely related to the progression of fibrosis. Cluster 2 is closely associated with inflammation, showing leukocyte chemotaxis and the release of cytokines and receptors. HSCs need to strictly regulate autocrine and paracrine crosstalk to rapidly respond to changes in extracellular matrix content [67]. Cytokines are important for the initiation and duration of HSCs activation, causing ECM production and contractility, respectively [68]. In addition, cluster 5 showed the properties of myofibroblasts (activated HSCs), participating in the formation of hepatic fibrosis and reconstruction of intrahepatic structure through proliferation and secretion of extracellular matrix.

Finally, NASH has resulted in a dramatic increase in HCC prevalence [34]. We found that SPP1, SOX9, MCAM and THY1 were differentially expressed in NASH and NASH-associated HCC, suggesting that these four genes are involved not only in NASH fibrosis, but also in the malignant progression of NASH. In addition, further studies suggested that SOX9 may be associated with changes in metabolism-related aspects in NASH-associated HCC. However, this study also has some limitations. Firstly, due to the absence of clinical characteristics of the samples, we could not exclude the influence of clinical confounding factors on the expression of these genes that might be involved in the progression of fibrosis. Similarly, we did not correct these factors in ROC analysis for predicting the occurrence of HCC. Secondly, we did not explore whether these genes are associated with other forms of fibrosis, so it is not clear whether they share a common pathway. Finally, we also need to further verify the role of these hub genes in the progression of NASH through molecular biology experiments in the future.

Conclusion

To investigate the association of gene expression with NASH fibrosis, WGCNA and single-cell RNA-seq analysis were used. Finally, we found seven related hub genes (SPP1, PROM1, SOX9, EPCAM, THY1, CD34 and MCAM), and single-cell RNA-seq analysis showed that cholangiocytes seemed to play an important role in NASH fibrosis. Further studies suggested that SPP1 and CD34 were highly expressed in different clusters of HSCs which perform different functions. Further studies suggested that SPP1, SOX9, MCAM and THY1 were associated with NASH-associated HCC, and that SOX9 may be related to changes in metabolism-related pathways between different subtypes of HCC.