1 Introduction

Gallbladder cancer (GBC) is one of the most fatal malignancies of biliary tract cancers, where malignant cells form in the tissues of the gallbladder (Hundal and Shaffer 2014; Muhammad et al. 2018). Globally it accounts for around 80–90% of all the biliary tract cancers, and ranks sixth among gastrointestinal cancers (Hundal and Shaffer 2014; Song et al. 2020). As reported by the 2018 GLOBOCAN data, GBC accounts for around 1.7% of cancer-related deaths globally (Rawla et al. 2019). The incidence rate of GBC shows very high geographical, racial, and socioeconomic variations, suggesting the potential role of different environmental as well as genetic factors associated with the development and progression of this cancer (Hundal and Shaffer 2014; Sharma et al. 2017; Muhammad et al. 2018).

GBC does not exhibit any specific clinical symptoms. This causes difficulty in diagnosing the disease at an early stage. It is often diagnosed at an advanced stage (Letelier et al. 2012; Hundal and Shaffer 2014). Most of the time, GBC is incidentally diagnosed in patients undergoing cholecystectomy for the treatment of cholecystitis or cholelithiasis (Muhammad et al. 2018). According to different epidemiological and pathological investigations, patients with gallstones have a higher risk of GBC than healthy individuals. Gallstone disease (GSD) is considered as the major risk factor of GBC, affecting ~20% of the adult population worldwide and also present in more than 85% of GBC patients (Letelier et al. 2012; Hundal and Shaffer 2014; Jinghan Wang et al. 2020). Gallstones cause local mucosal irritation and chronic inflammation. This process has been speculated to activate intracellular enzymes involved in promoter methylation of some potential genes, and also produces some inflammatory mediators in the tissue microenvironment. Such events may result in alteration of the transcriptomic and genomic landscape, contributing to early-stage carcinogenesis in GBC (Letelier et al. 2012; Hundal and Shaffer 2014; Muhammad et al. 2018; Jinghan Wang et al. 2020). However, the detailed molecular mechanism associated with the transition of GSD to GBC is not yet understood. The available tumor markers for diagnosis of GBC do not have high specificity, and therefore cannot be detected until the advanced stages of the disease (Sharma et al. 2017). Understanding of the molecular mechanism behind the transition of GSD to GBC will help in the identification of crucial molecular markers for its early detection and treatment.

The complex interactions of molecular and environmental factors may initiate GBC pathogenesis in a progressive manner, which could lead to the dysregulation of multiple processes such as cell cycle, DNA repair, apoptosis, as well as immune responses (Knox 2010). Integrative analysis of multi-dimensional data using systems biology-based approaches will provide the basis for understanding the complex molecular mechanisms responsible for carcinogenesis of the gallbladder. Network biology is an integrative systems biology approach that can help us understand the complex molecular mechanisms responsible for GBC pathogenesis (Furlong 2013), and for the development of personalized treatment protocols (Chand and Alam 2012; Masoudi-Nejad and Wang 2015). Different types of networks such as protein–protein interaction (PPI) networks, gene regulatory networks, metabolic pathways, and various signaling pathways interact in a conjugated manner to define the fate of cellular behavior (Barabási and Oltvai 2004). The results generated from such integrative analysis of complex biological networks help to determine the specific roles of differentially regulated molecules, pathways, or processes in different cellular conditions especially in a multifactorial disease such as cancer (Barabási and Oltvai 2004). Here, we analyzed a transcriptomic dataset of 10 gallbladder cancer samples with respect to their adjacent 10 normal tissue samples, and 30 gallstone disease tissue samples in three different follow-up periods. An integrative network-based analysis was carried out on the differentially expressed genes (DEGs) obtained to identify the overlapping and unique molecular signatures. We performed differential gene expression analysis, functional enrichment analysis, PPI network analysis, module analysis, and regulatory network analysis of the specific DEGs identified from GBC vs. gallstone disease with three different follow-up periods to identify significant hub genes and hub transcription factors (TFs). Moreover, we also carried out miRNA–hub genes network analysis, hub gene signaling network analysis, and evaluation of genomic alteration of the hub genes.

2 Methodology

2.1 Retrieval of transcriptomic data

The RNA-seq dataset of GBC and GSD samples were obtained from the European Nucleotide Archive (ENA) database in Sequence Read Archive (SRA) format with the accession number SRP226150. The dataset contained a total of 50 samples obtained through surgical resection. The data comprised 10 GBC tissues, 10 adjacent normal tissue samples, and 30 GSD tissue samples from three different follow-up periods of 1–3 years (GSD3), 5–10 years (GSD5) and more than 10 years (GSD10). The Illumina HiSeq 2500 platform was used to generate the paired end reads of these 50 samples (Jinghan Wang et al. 2020). Using this published dataset, we carried out a detailed integrative analysis with various benchmarked network-based approaches to identify systems-level molecular signatures in GBC and the three different follow-up periods of GSD.

2.2 Transcriptomic data analysis and identification of differential overlapping and specific molecular signatures

The retrieved RNA-seq datasets in SRA format were converted into FastQ reads. The FastQ reads were pre-processed using an in-house RNA seq data analysis pipeline. Pre-processing is an important step to either remove or trim the adapter, poly N, as well as the low-quality reads. FastQC and fastp tools were used for the quality check (QC) of the reads and adapter trimming, respectively (Chen et al. 2018; de Sena Brandine and Smith 2019). The pre-processed high-quality reads after quality control were mapped against the reference human genome Homo sapiens (GRCh38) using HISAT2 (version 2.2.1)(Kim et al. 2019). The aligned or mapped reads were then quantified using the FeatureCounts tool to obtain the gene expression profile of each sample as a single-count matrix file (Liao et al. 2014). The counts matrix file obtained from RNA-seq analysis was used to identify the differentially expressed genes (DEGs) using DESeq2 package in R (Love et al. 2014). The DESeq2 tool gives the log2 fold changes and absolute gene expression levels relative to each sample by calculating the ratio of each read count to the logarithmic mean value of all the read counts for each gene across all the samples. The lists of significant DEGs were generated separately for GBC vs. adjacent normal (DEG list 1), and GBC vs. GSD with three different follow-up periods. We have considered a P-adjusted value ≤0.05 and |Log2 fold change| ≥1 for identifying DEGs. The overlapping and specific DEGs between GBC vs. GSD3, GBC vs. GSD5, and GBC vs. GSD10 were identified using Venny tool 2.1.0 (Oliveros 2007). The unique DEGs identified for GSD3 (DEG list 2), GSD5 (DEG list 3), and GSD10 (DEG list 4) were further used for downstream integrative analysis. The overall methodology has been described in figure 1.

Figure 1
figure 1

A schematic representation of the overall methodology used in this study.

2.3 Functional annotation and pathway analysis

We carried out functional annotation and pathway enrichment analysis using the unique DEG lists (DEG lists 2, 3, and 4) identified from three follow-up periods. Functional annotations provided an overview of associations of the DEGs with biological processes, pathways, and disease phenotypes. We used two independent tools, i.e., DAVID tool (v6.8) (Dennis et al. 2003) and BINGO (a Cytoscape plugin), to determine the enriched biological processes associated with the unique DEGs. The enriched pathways associated with unique DEGs were identified from the KEGG database. The threshold of p-value <0.05 and gene counts >5 were considered for selecting the enriched biological processes and KEGG pathways.

2.4 PPI-based network analysis and screening of hub genes

The STRING database, version 11.5 (http://www.stringdb.org/), was used to construct the PPI network with unique DEGs identified from the DEG lists 2, 3, and 4 (Suratanee and Plaimas 2018). In the PPI network topology, the nodes represented the seed proteins (seed DEGs) and the edges represented the interactions between the DEGs. The PPI networks were analyzed using Cytoscape version 3.8 (Shannon et al. 2003). The plugin CytoHubba in cytoscape was used for topological analysis of the PPI networks and subsequent identification of hub genes (Chin et al. 2014). The hub genes for each disease group were identified through an ensemble approach by taking the consensus of five topological parameters, viz., maximum clique centrality (MCC), maximum neighborhood component (MNC), degree, edge percolated component (EPC), and betweeness centrality (Chin et al. 2014). The five top-ranked genes were considered to be the potential candidate genes for each of the conditions. Furthermore, highly connected gene modules from the PPI networks were detected using the Molecular Complex Detection (MCODE) algorithm (Pruitt et al. 2001). MCODE scores ≥4 and the number of nodes >4 were set as cutoff criteria with the default parameters (degree cutoff ≥2, node score cutoff ≥2, K-core ≥2, and max depth =100) (Roy et al. 2021).

2.5 Transcription regulatory-based network analysis and screening of hub transcription factors

Transcription factors (TFs) are the key regulators in the transcription process which influence overall gene expression by binding to the start site of the promoter region. For the construction of transcriptional regulatory networks, 1KB upstream FASTA sequence of the specific DEGs identified from GBC compared with GSD with different follow-up periods (GBC vs. GSD3, GBC vs. GSD5, and GBC vs. GSD10) were extracted using Regulatory Sequence Analysis Tools (RSAT) (Thomas-Chollier et al. 2008). Experimentally determined benchmarked position weight matrices (PWMs) for all the TFs were obtained from the CIS-BP database (Weirauch et al. 2014). PWM is a mathematical model describing the binding specificity of a TF. PWMs were used to scan cis-regulatory sequences of a gene for determining the enrichment of the defined patterns which were significantly more similar to the PWM than to the background models (Stormo 2000). A widely used benchmark matrix scan tool in MEME suite (v3.4.0) was used for PWM scanning by considering p-value cutoff of 10-4 (Bailey et al. 2009). Finally, the transcriptional regulatory networks (TRNs) with prediction scores were visualized in the form of interactive networks.

2.6 Prediction of hub genes–microRNA network analysis

MicroRNAs (miRNAs) belong to class of small noncoding RNAs that play a crucial role in cancer development by acting either as oncogenes and/or tumor suppressor genes. We performed hub gene–miRNA network analysis to identify potential hub gene–miRNA interactions. To identify hub gene–miRNA interactions, we used the miRTar database that stores experimentally validated miRNA–gene interaction data. Signaling network analysis of hub genes was performed using SIGNOR 2.0 database (http://signor.uniroma2.it/) to identify key signaling pathways. The cBioPortal database (https://www.cbioportal.org/) was used to identify genetic alterations associated with the identified hub genes.

3 Results

3.1 Identification of differentially expressed genes in GBC and GSDs

Differential gene expression analysis using DESeq2 identified four significant lists of DEGs for GBC vs. adjacent normal (DEG list 1), GBC vs. GSD3 (DEG list 2), GBC vs. GSD5 (DEG list 3), and GBC vs. GSD10 (DEG list 4) conditions. DEG list 1 contained 985 genes, of which 248 were upregulated, and 737 were downregulated. The total number of upregulated and downregulated DEGs (DEG lists 1, 2, 3, and 4) has been summarized in table 1. The complete lists of DEGs for all the comparisons along with the log2 fold change values and the corresponding adjusted p-values have been presented in the supplementary dataset. The results from DEG list 1 show that the significant DEGs identified in GBC were mostly downregulated (figure 2A). The downregulated DEGs were mainly enriched in important cell signaling pathways such as cAMP signaling, AMPK signaling pathway, PPAR signaling, and adipocytokine signaling pathways (figure 2B).

Table 1 Summary of the DEG lists identified from each comparison
Figure 2
figure 2

Overview of gene expression profile and cellular pathways identified in GBC compared with adjacent normal. (A) Volcano plot and hierarchical clustering showing the expression significant DEGs in GBC compared with that of adjacent normal. (B) The bubble plot represents the top 10 key pathways associated with significant DEGs in GBC.

3.2 Overlapping and unique DEGs in GBC compared to GSD with different follow-up periods

The objective of our work was to identify common and overlapping molecular signatures between GBC and GSD for understanding the possible mechanisms through which GSD progress to GBC. There were 3102 overlapping genes identified among the DEG lists 2, 3, and 4. GSD3 had 824, GSD5 had 499, and GSD10 had 446 unique DEGs (figure 3A). The heatmap visualization of the significant unique DEGs reflected variation in the expression pattern of DEGs identified in each GSD follow-up period as compared with the DEGs from GBC (figure 3B). This suggested that the differential expression pattern of these genes in GSD might manifest into a wide pathological spectrum, and thereby could contribute to GBC pathogenesis.

Figure 3
figure 3

Differential gene expression profiles of overlapping and unique signatures. (A) Venn diagram showing the number of unique and overlapping DEGs between GBC vs. GSD3, GBC vs. GSD5, and GBC vs. GSD10. The bar plot represents key pathways associated with overlapping DEGs. (B) Heatmap plot for significant unique DEGs identified in between GBC vs. GSD3, GBC vs. GSD5, and GBC vs. GSD10.

3.3 Functional enrichment and pathways analysis of specific DEGs identified in GSD3, GSD5, and GSD10 follow-up periods

Functional enrichment and pathway enrichment analysis were performed for the identification of significant biological processes (table 2) and pathways (table 3) in the identified unique lists of DEGs. The enrichment analysis from both DAVID and BINGO showed that the DEGs in GSD3 and GSD10 were largely associated with immune response regulation and cell adhesion processes such as collagen organization. However, the unique DEGs in GSD5 were associated with distinct biological processes such as ion-transport channel-related processes. This suggested that among the GSD cases with different follow-up periods, there were potential molecular signatures which might contribute to GBC progression from GSD. The pathways associated with GSD3 were also enriched in cell adhesion pathways such as extracellular matrix organization, whereas the GSD5 DEGs were mainly linked with endocannabinoid signaling, leukocyte transendothelial migration, and neuroactive ligand-receptor interaction.

Table 2 Enriched biological processes associated with specific molecular signatures identified in GBC vs. GSD3, GBC vs. GSD5, and GBC vs. GSD10
Table 3 Enriched pathways linked with specific molecular signatures identified in GBC vs. GSD3, GBC vs. GSD5, and GBC vs. GSD10

3.4 Construction of PPI networks and screening of significant hub genes/proteins associated with GBC progression

The specific DEGs identified from GBC compared with that of GSD with different follow-up periods (GSD3, GSD5, and GSD10) were used to construct the PPI networks. The queried DEGs with an effective binding score >0.4 were used to build the PPI networks. The effective binding score represents how likely the interactions between nodes are true. In PPI networks, nodes and edges represent proteins and interactions, respectively, and the nodes with high degree are considered as hub genes/proteins. The interactive PPI networks were analyzed and visualized using Cytoscape v3.8.2 (figure 4A). The detailed statistics of the PPI networks analysis are given in supplementary table 1.

Figure 4
figure 4

PPI network construction from the unique DEGs. (A) PPI networks of unique DEGs identified from GBC vs. GSD3, GBC vs. GSD5, and GBC vs. GSD10. The red triangle represents the hub genes identified based on the consensus of five topological algorithms. (B) Significant modules extract from the PPI networks. The red triangle in the module network represents hub genes.

CytoHubba, a Cytoscape plugin, was used to identify the hub DEGs from the PPI networks generated using the unique DEGs identified from DEG lists 2, 3, and 4. Five topological parameters (MCC, MNC, Degree, EPC, and Betweenness) were considered to identify the predicted hub DEGs. The 20 top-ranked DEGs identified from these five algorithms were considered for further evaluation (supplementary figure 1). The predicted hub DEGs from each of the topological parameter were intersected for the identification of consensus significant hub DEGs in the PPI networks (table 4). The identified hub genes in the PPI networks were mostly downregulated (supplementary table 2).

Table 4 List of significant hub genes identified through PPI networks analysis from unique DEGs in GBC vs. GSD3, GBC vs. GSD5, and GBC vs. GSD10

Functionally enriched significant modules in the PPI networks were identified using the Molecular Complex Detection (MCODE) algorithm (figure 4B). The significant module for GSD3 was associated with cell adhesion and collagen fibril organization, the GSD5 module was associated with ion transport and metabolic pathways, and the module identified for GSD10 was linked with immune system regulation.

3.5 Analysis of TRNs and identification of potential TFs

The TRNs were constructed through PWM scanning, followed by identification of transcription factor binding sites (TFBS) on the target DEGs. The TFs were considered as source nodes, and non-TF DEGs were considered as target nodes for each condition (figure 5). The topological analyses of the TRNs such as assortativity and shortest path length were calculated using igraph, an R package (Csardi and Nepusz 2006). The top 10 highly connected TFs were identified based on degree centrality (table 5). The topology of the TRN obtained from GSD3 was the largest, with 663 nodes and 4896 edges. Zinc finger Family (ZNF) proteins were the commonly enriched regulatory hubs in all the three GSD follow-up periods. ZNF genes act as tumor suppressor and oncogenes. They regulate the key pathways and processes of cancer initiation, development, as well as progression. Some of the key pathways and processes are apoptosis, metastasis, and regulation of transcription, protein degradation mediated by ubiquitin–proteasome pathway, etc.

Figure 5
figure 5

Transcriptional regulatory network of the unique DEGs. The red node represents the hub genes identified based on degree centrality, and the small yellow nodes represent target DEGs. The TF–TG interactions are identified through PWM scanning of 1 kbp upstream sequences of significant DEGs that assign a probability score of the transcription factor binding to the target gene.

Table 5 Hub TFs identified through transcriptional regulatory networks from GBC vs. GSD3, GBC vs. GSD5, and GBC vs. GSD10

3.6 Prediction of miRNA interactions with the hub genes

The miRTar database was used to identify the miRNA regulators associated with the hub genes. The Network Analyzer tool in Cytoscape was used to analyze the miRNA regulatory network connections (figure 6A; supplementary table 3). In GSD3, COL1A1 had the highest number of interacting miRNAs, indicating its potential role in regulating crucial miRNAs associated with cancer progression. In GSD5, 3 out of the 8 hub genes were found to interact with miRNAs, viz., HBEGF, KIF5a, and GABRG2. In GSD10, GAPDH was found to have the highest number of miRNA connections, followed by CD3E and EGR2. However, no miRNAs were found to be associated with IL17A.

Figure 6
figure 6

Hub gene analysis. (A) Hub gene–miRNA interactive networks. The small green nodes and big red nodes are the interacting miRNAs and hub genes, respectively. (B) Signaling network showing complex, associated proteins and signaling pathways associated with hub genes. The green nodes represent associated proteins; light blue nodes indicate signaling pathways; circled blue nodes represents hub genes; squared blue nodes represents signaling complex, and yellow nodes indicate protein families.

3.7 Identification of key signaling complex associated with hub genes

We identified the key signaling complex associated with the hub genes using the SIGNOR 2.0 database (figure 6B). The hub genes COL1A1 identified from the GBC vs. GSD3 PPI network analysis were associated with the ECM interaction signaling pathway through A11/b1 and A2/b1 integrin complexes. The ECM interaction pathway is known to be one of the hallmarks of cancer. The significant hub genes from GBC vs. GSD5 were associated with MAPK signaling pathways. The hub genes identified from GBC vs. GSD10 were largely associated with TCR signaling and PI3K signaling pathways. These pathways have been reported to be deregulated in many cancers (Sanchez-Vega et al. 2018).

3.8 Mining genomic alterations of the DEGs/hub genes from external datasets

Genomic alterations such as mutations and copy number variations (CNVs) associated with hub genes were evaluated from TCGA-GBC data and other TCGA datasets of gastrointestinal cancers such as esophageal cancer, stomach cancer, liver cancer, colorectal cancer, and pancreatic cancer using the cBioPortal database (figure 7). It was been observed that hub gene amplification is prominent in other gastrointestinal cancers, whereas the hub genes in GBC patients are associated with mutations. The OncoPrint tool of cBioportal showed 36% of patients’ cases to have genetic alterations such as amplification, deletion, and several mutations.

Figure 7
figure 7

Genomic alterations associated with the hub genes in gastrointestinal cancers.

4 Discussion

Gallbladder cancer is known to be the most fatal malignancy of biliary tract cancer and it ranks sixth among the neoplasms of the gastrointestinal tract (Hundal and Shaffer 2014; Song et al. 2020). Among all the different risk factors, gallstones are considered as the major risk factor, as in most cases the cancer is incidentally diagnosed while the patient is undergoing treatment for gallstones or cholelithiasis (Hundal and Shaffer 2014). Identification of molecular markers for early diagnosis is very important to reduce the death risk of this cancer. Hence, our objective was to identify crucial molecular signatures that lead to the progression of GSD to GBC.

We carried out an integrative network-based analysis of transcriptomic datasets to compare and identify key molecular signatures in GBC with reference to GSD of different follow-up periods. Differential gene expression analysis and hierarchical clustering analysis showed significant variation in gene expression pattern among the unique DEGs identified from GSD with three different follow-up periods with respect to the GBC samples. The significant hub genes and TFs identified from GBC with reference to GSD of all the three follow-up periods were directly or indirectly associated with a few important processes and pathways known to be involved in cancer development and progression.

The hub genes identified from unique DEGs of GBC vs. GSD3 are SERPINH1, COL1A1, TPT1, and THBS1. The hub genes are linked with cell adhesion and collagen fibril organization processes. Cell adhesion molecules play an important role in regulating epithelial-to-mesenchymal transition (EMT) and influence malignant transformation and metastasis (Janiszewska et al. 2020). The Serpin Family H Member 1 (SERPINH1) gene is aberrantly expressed in different cancers: in gastric cancer it is involved in metastasis and EMT via the Wnt/β-catenin signaling pathway, and regulates the expression of the proteins of the extracellular matrix (ECM) to promote breast cancer (Tian et al. 2020). The COL1A1 gene encodes type 1 collagen, which is a major structural component of the ECM known to be involved in EMT. EMT allows epithelial cells to adopt a more mesenchymal state to enhance cellular migration, which thereby helps in the metastasis of cancer. Upregulation of COL1A1 promotes tumor metastasis by regulating the WNT/planar cell polarity (PCP) signaling pathway (Zhang et al. 2018). THBS1, or thrombospondin1, plays a key role in cellular communication, both cell-to-cell and cell-to-ECM interactions (Hu et al. 2021). Earlier studies reported that THBS1 was upregulated due to aberrant DNA methylation in various types of human cancer including breast cancer, gastric cancer, oral cancer, etc., to promote proliferation, invasion, and migration (Zhang 2021). TPT1, or Tumor Protein Translationally-Controlled 1, is an anti-apoptotic protein-coding gene which is involved in various cellular pathways like cell proliferation, growth, apoptosis, metabolism, and stabilization of microtubules during cell division (Zhang et al. 2021). It is also known to be involved in cancer progression and is differentially expressed in many types of human cancer. Studies revealed that TPT1 is upregulated in colon cancer and prostate cancer (Hosseinzadeh et al. 2020). In the case of epithelial ovarian cancer, TPT1 promotes tumor growth and metastasis via the TPT1/PI3K/AKT signaling pathway (Wu et al. 2019).

The hub genes identified from the unique DEGs in GBC vs. GSD5 were CX3CR1, GRM1, HBEGF, KIF5A, HEY2, GABRG2, GJA1, and GJA5. These hub genes are mainly involved in different types of cellular pathways including inflammation, cell growth and development, intracellular organelle transport, and cellular interaction. Inflammation in GBC due to GSD leads to the release of some carcinogenic molecules that ultimately result in tumor growth and development. Hub genes identified from the unique DEGs of GBC vs. GSD5 were associated with key cancer-related pathways such as inflammatory response, cellular interaction, as well as cell growth and proliferation, suggesting a higher risk of cancer progression. CX3CR1 or C-X3-C Motif Chemokine Receptor 1 is a transmembrane protein involved in the regulation of immune response, cell adhesion, inflammation, etc. But it is aberrantly expressed in many types of cancers like gastric cancer, breast cancer, pancreatic cancer, lung cancer, etc. (Marchesi et al. 2010; Wei et al. 2015). CX3CR1 overexpression in gastric cancer promotes migration, proliferation, and survival of tumors (Wei et al. 2015). However, in the case of glioma, neuroblastoma, and other non-neural-origin cancers, overexpression of CX3CR1 helps in the trans-endothelial migration and metastasis of cancer (Marchesi et al. 2010; Wei et al. 2015). GRM1 or Glutamate Metabotropic Receptor 1, is a G-protein-coupled receptor for glutamate that plays a crucial role in synaptic plasticity and the development of the cerebellum. GRM1 can hydrolyze phosphoinositide through phospholipase C activation. Apart from various neurological disorders, GRM1 is also known to be involved in human cancers like breast cancer, skin cancer, etc. Overexpression of GRM1 in melanocytes promotes tumor growth and progression through activation of PI3K/AKT and MAPK signaling pathways (Wangari-Talbot et al. 2012; Wen et al. 2014). However, GRM1 also involved in other cancer-related pathways like neuro-active ligand-binding receptor interaction, FOXO signaling pathway, etc. The growth factor HB-EGF, or heparin-binding epidermal growth factor-like growth factor, is one of the ligands of the epidermal growth factor receptor (EGFR) that mediates its function via ERBB1/HER1 (also EGFR) and ERBB4/HER4. According to various studies, HBEGF is highly expressed in hepatocellular carcinoma, breast cancer, colon cancer, prostate cancer, and ovarian cancer, where it can help in the growth, proliferation, and progression of tumors (Miyamoto et al. 2004; Miyata et al. 2012). KIF5A, or kinesin family member 5A, is a member of the kinesin family protein, mainly expressed in neurons. It acts as a microtubular motor protein in axonal transport (Brenner et al. 2018). It has been observed from various studies that the kinesin proteins were aberrantly expressed in different types of human cancers including breast cancer, prostate cancer, lung cancer, bladder cancer, etc. Kinesins mediate the process of tumorigenesis by promoting cell growth and proliferation (Rath and Kozielski 2012; Tian et al. 2019). GJA1, or Gap Junction Protein Alpha 1, and GJA5, or Gap Junction Protein Alpha 5, are members of the connexin family of proteins that are involved in cellular communication. In gastric cancer, higher expression of GJA1 leads to shorter overall survivability of patients (Zhao et al. 2019). HEY2, a bHLH transcription factor with YRPW motif 2, is a transcription-factor-encoding gene of the hairy and enhancer-of-split-related (HESR) family. Expression of HEY2 is regulated by the Notch signal transduction pathway and TGF-β signaling pathway, which are mostly dysregulated in various human cancers (Liu et al. 2017). According to previous studies, HEY2 is highly expressed in different cancers like esophageal squamous cell carcinoma and non-small-cell lung carcinoma, where it can promote metastasis, cancer cell self-renewal, angiogenesis, EMT, as well as tumor proliferation (Forghanifard et al. 2015; Liu et al. 2017; Cheng et al. 2018). In hepatocellular carcinoma, upregulation of HEY2 plays an important role in cancer progression through the TGF-β/Smad signaling pathway by inhibiting TGF-β-induced growth arrest (Wang et al. 2019). The GABRG2 gene encodes a subunit of the gamma-aminobutyric acid type A receptor. Although this gene is most commonly involved in the function of the central nervous system, a recent study has suggested it as a novel oncogene promoting tumor invasion and metastasis (Jin et al. 2017). In thyroid cancer, higher expression of GABRG2 promotes tumor metastasis to lymph node (Jin et al. 2017). GABRG2 is also found to be highly expressed in colon adenocarcinoma (Yan et al. 2020).

The significant hub genes identified from the specific DEGs in GBC compared with GSD with follow-up period of more than 10 years were LCK, CCR7, CD3E, IKZF1, EGR2, GAPDH, and NR4A1. The majority of the identified hub genes were associated with immune response signaling pathways such as the T-cell receptor (TCR) signaling pathways. Chronic inflammation caused by gallstones is known to be the most potential risk factor in GBC development. Therefore, it is largely associated with immune cells and inflammatory mediators such as such as cytokines, chemokines, reactive oxygen species, prostaglandins (PGs), and growth factors which strongly influence the genetic and epigenetic aberrations in oncogenes and/or tumor suppressor genes (TSG) (Hussain and Harris 2007). We identified that DEGs in GBC vs. GSD10 were associated with immune cell regulatory processes. The T-cells are known to be the principal defensive components against tumors and pathogens. T-cell activation functions to regulate a wide array of metabolic pathways and any aberration in the T-cell signaling pathway can lead to oncogenesis (Franchina et al. 2018). Lymphocyte cell-specific protein-tyrosine kinase (LCK) is an important gene that is expressed on T-lymphocytes and natural killer cell, and plays a significant role in T-cell receptor signaling, which can affect the pathogenesis or metastasis of cancer (Kumar Singh et al. 2018; Weiße et al. 2021). LCK phosphorylates CD79a, which induces the distal signaling events involved in the addition of phosphate group to Syk, and thereby activates different signaling pathways such as PI3K/Akt, NF-kB, and ERK. These signaling pathways are known to be involved in cancer cell survival, proliferation, and also resistance to treatment of cancer (Fresno Vara et al. 2004; Kumar Singh et al. 2018). LCK is highly expressed in small-cell lung cancer, non-small-cell lung carcinoma, as well as lung cancer (Bommhardt et al. 2019). In cholangiocarcinoma, expression of LCK is related to the recurrence of tumors (Bommhardt et al. 2019). CCR7, or C-C motif chemokine receptor 7, encodes a G-protein-coupled receptor family protein that plays a crucial role in adaptive immune response through activation of B- and T-lymphocytes. CCR7 helps the tumor cell to escape immune surveillance and helps cancerous cells to survive by the activation of the PI3K/Akt signaling pathway (Legler et al. 2014). IL17A, or interleukin 17A, is a member of the interleukin 17 pro-inflammatory cytokine family produced by T-helper 17 (Th17) cells. Expression of IL17A has been found to be high in various tumor tissues, such as hepatocellular carcinoma, gastric cancer, etc. (Wu et al. 2014). IKZF1, or IKAROS Family Zinc Finger 1, is a zinc-finger DNA binding protein that acts as a transcription factor. It is involved in various biological processes such as immune system regulation and proliferation of hematopoietic cells, and also regulates cellular interaction via the Notch signaling pathway (Jedi et al. 2018). Epigenetic studies revealed that in colorectal cancer, IKZF1 was downregulated due to hypermethylation (Pedersen et al. 2015; Jedi et al. 2018). EGR2, or early growth response 2, is a sequence-specific DNA-binding protein which is a member of the Kruppel-like zinc finger transcription factor family (Bradley et al. 2008). EGR2 induced apoptosis through phosphatase and tensin homolog deleted on chromosome 10 or the PTEN growth suppressive signaling pathway (Unoki and Nakamura 2003). However, the negative regulation of EGR2 through miR-20a (a small noncoding RNA) promotes the growth of gastric cancer (Li et al. 2013).

From this study, we have observed that the identified hub genes and hub TFs were associated with different cellular processes and pathways directly or indirectly linked with cancer progression and metastatic invasion. The hub genes identified from each of the GSD follow-up periods were associated with distinct processes and signaling pathways. This suggested that GSD progresses to GBC through the dysregulation of multiple signal transduction pathways at different stages (initiation–progression–metastasis) with distinct pathological spectra. Hence, the identified common and unique molecular signatures between GSD and GBC reflect possible mechanisms through which GSD progressed to GBC. Further in-depth functional evaluation of the hub genes and TFs will be able to establish their association with specific stages of disease development and progression.