Keywords

13.1 Introduction

Bioinformatics plays an important role in clinical diagnosis and research. At present, clinical bioinformatics has been widely used in the discovery of disease-related genes, determination of new drug molecular targets, disease diagnosis, and prognosis prediction (Wooller et al. 2017; Oliver et al. 2015; Fu et al. 2020). In clinical practice, bioinformatics can effectively predict the prognosis of patients or the occurrence and development of diseases based on the integration of previous diagnosis and treatment data and sequencing data and provide guidance for the diagnosis and treatment of diseases. For diseases whose pathogenesis is not clear, bioinformatics can also provide strong guidance, which can effectively save time and avoid aimless experiments.

The clinical identification of disease subtypes has mainly relied on pathology and symptoms, but the use of molecular bioinformatics to identify molecular subtypes has just begun. The molecular subtypes of diseases can be associated with clinical phenotypes, which may indicate the causes of phenotypic changes and explain the different symptoms of the same disease at the molecular level. This paper introduces how clinical bioinformatics can integrate molecular networks into clinical practice and how bioinformatics can be used to reclassify disease and solve clinical problems in regional medicine.

13.2 Methods Suitable for Clinical Practice

Clinical bioinformatics is widely used, which can not only integrate phenotype and gene expression but also predict phenotype and even find etiology through gene expression. It could also focus on genes, regulatory elements, or microRNAs to find potential ways to treat diseases. The following are some bioinformatics methods that can be popularized in clinical practice.

13.2.1 Weighted Gene Co-expression Network Analysis (WGCNA)

Correlation network analysis is becoming more and more widely used in biological research. Weighted correlation network analysis (WGCNA) is a method used to describe the gene association patterns among different samples. It can be used to identify highly covariated gene sets and to identify alternative biomarker genes or therapeutic targets based on the connectivity of the gene sets and the association between the gene sets and phenotypes. Compared with the research method of focusing on differentially expressed genes, WGCNA can study thousands of genes with the greatest variation or all detected genes, form co-expression networks, and then conduct significant association analysis of phenotypes. This can either make full use of the information or obtain the important genes associated with the phenotype by screening the hub genes of the module and also provide reference and inspiration for the diagnosis and treatment of clinical diseases (Yin et al. 2018; Bai et al. 2020). WGCNA mainly include the establishment of gene co-expression network, formation of co-expressed gene modules, correlation of co-expressed gene modules with clinical data, correlations between modules and among genes within modules, and screening of hub genes according to gene significance and module membership (Langfelder and Horvath 2008), of which the WGCNA workflow is shown in Fig. 13.1a.

Fig. 13.1
figure 1

Introduction to several bioinformatics methods. (a) The flowchart of WGCNA. First, the co-expression network was constructed, and then gene modules were formed, which were correlated with clinical phenotypes, and HUB genes were selected. (b, c) The method of consensus clustering to select K value. (d) A schematic diagram of the ceRNA network, which is usually composed of mRNAs, microRNAs, and lncRNAs. (e, j) Part of the results of single-cell sequencing analysis, and the analysis data came from the dataset PBMC3K provided by R package Seurat

13.2.2 Identification of Disease Subtypes

Clinically, diseases are often classified according to their symptomatic characteristics. Consensus clustering provides a new way to classify molecular subtypes of diseases according to gene expression. Based on consensus clustering results, clinical phenotypes of different molecular subtypes were studied by statistical methods such as chi-square test and T test, or WGCNA was used to construct co-expression network to correlate molecular subtypes with clinical phenotypes, which is beneficial to more efficient and accurate diagnosis and treatment of diseases. The consistent clustering method takes sub-sampling from the gene expression matrix to determine the clusters with a specific cluster count (k). For the consensus value, the two items have the same cluster in the number of occurrences in the same subsample, which is calculated and stored in the symmetric consensus matrix for each k. There are many methods to determine the optimal clustering number K value of consensus clustering. The optimal cluster number can be determined by (principal component analysis) PCA method or by consensus CDF (Fig. 13.1b, c). However, no matter which method is used, the final clustering results need to pass the evaluation of clustering significance.

In addition, before using consensus clustering to classify molecular subtypes of diseases, it is necessary to ensure that no batch effect exists; otherwise, the effect caused by batch effect needs to be eliminated.

13.2.3 The ceRNA Regulatory Network

Competitive endogenous RNA (ceRNA) has attracted much attention in academic circles in recent years. It represents a new regulation mode of gene expression. Compared with the mRNA-miRNA regulation network, the ceRNA regulation network is more sophisticated and complex, involving more RNA molecules, including mRNA, pseudogenes of coding genes, long non-coding RNAs and miRNAs, etc. ceRNA network provides a new way of studying transcriptome and can explain some biological phenomena more deeply. Common ceRNA networks generally contain differentially expressed mRNAs, microRNAs, and lncRNAs or circRNAs. Among them, the expression trend of mRNAs and lncRNAs was consistent, while the expression trend of microRNAs and mRNAs was opposite, and the same was true between microRNAs and lncRNAs. The regulatory relationships among microRNAs, mRNAs, and lncRNAs can be effectively predicted through the construction of the ceRNA regulatory network. It is helpful to excavate gene function and regulation mechanism at a deeper level and facilitate to understand many biological phenomena in a more thorough and comprehensive way (Fig. 13.1d).

13.2.4 Single-Cell Sequencing

Biomarkers are analyzed and mined based on genomics, proteomics, and transcriptomics in a large number of cell or tissue samples, of which the information always ignores the heterogeneity of the sample. In order to fully explore the heterogeneity of cells or tissues and explore the trajectory of cell differentiation, single-cell sequencing is essential (Wang and Song 2017). Techniques such as scRNA-seq and scATAC-seq are gaining popularity in scientific research.

By using Cell Ranger to process single-cell FASTQ files and mapping reads to the reference genome, we can obtain gene expression matrix, annotation information, and cell information. Data are imported into R packages such as Seurat (Satija et al. 2015; Durruthy-Durruthy et al. 2014) and Monocle (https://cole-trapnell-lab.github.io/monocle3/) (Trapnell et al. 2014) to create objects, and then principal component analysis (PCA), T-SNE, and other methods can be used to cluster cells, and marker of different clusters can be identified. In addition, cell types can also be identified based on marker identification results. For example, in the clustering results of PBMC samples, we can pick out the cell cluster with CD8a as marker and mark it as CD8+ T cells or pick out the cell clusters with GNLY and NKG7 as marker and mark them as NK cells (Fig. 13.1e). It is worth noting that different single-cell sequencing methods have different ways of identifying cell types. For example, single-cell ATAC-seq can also identify and cluster similar cell types and states, but it generally uses the open promoter region as a signal of transcriptional activity.

Based on the above analysis results, further pseudotime analysis can be performed. As the cell transitions between states, it undergoes a process of transcriptional recombination, in which some genes are silenced and others are activated. These states are often hard to characterize. Pseudotime analysis of single-cell RNA-seq can view these states without the need to purify the cells (Guerrero-Juarez et al. 2019) (Fig. 13.1j). The single-cell transcriptome analysis data was derived from the PBMC3K dataset provided by R package Seurat.

13.3 Example of Molecular Bioinformatics in Application

Data analysis based on presentation matrices usually requires normalization of the data. Just as the count value in RNA-seq is normalized to obtain the FPKM value, the microarray expression data also needs to be normalized, which can be determined by plotting a boxplot (Fig. 13.2a). If you are using a Series Matrix File on the Geo Dataset for analysis, another problem you may encounter is whether you need to perform log2 transformations on the data. It can be preliminarily judged from the value of each expression quantity in the expression matrix. The analysis results should not only conform to the set threshold but also be analyzed in combination with the actual situation.

Fig. 13.2
figure 2

NSIP-related differentially expressed genes and network analysis. (a) Boxplot of gene expression data, according to which the normalized of data can be roughly judged. (b) Heat map of the top 10 upregulated genes and the top 10 downregulated genes. According to the results of cluster analysis, 10 samples on the left were taken from NSIP patients, and 11 samples on the right were normal tissue controls. (c) Protein-protein interaction (PPI) network of differentially expressed genes with the highest score. (d) Interaction network between differentially expressed genes and differentially expressed microRNAs; 123 interactions between 18 DEMs and 14 DEGs were selected

13.3.1 mRNA-MicroRNA Interaction Network

We studied the regulatory networks of mRNA and microRNA in non-specific interstitial pneumonia (NSIP) based on two datasets of GEO dataset GSE110147 (Cecchini et al. 2018) and GSE32538 (Yang et al. 2013) (Table 13.1). The online differential expression analysis tool GEO2R (http://www.ncbi.nlm.nih.gov/geo/geo2r/) was used to analyze the differences in the two datasets (GSE110147 and GSE32538), respectively, to obtain the genes and microRNAs differentially expressed in NSIP. Cutoff values were adjusted p-value < 0.01 and |logFC|>1.3 (FC: fold change of expression between NSIP and normal tissue) for DEGs and adjusted p-value < 0.01 for DEMs. GO and KEGG analysis of DEGs was done by DAVID database (https://david-d.ncifcrf.gov/) (Kanehisa et al. 2016). The p < 0.05 serves as the cutoff value.

Table 13.1 The situation of GSE110147 and GSE32538 in GEO database, respectively

The regulatory relationship between mRNAs and microRNAs was predicted based on the miRWalk database (Dweep et al. 2011; Sticht et al. 2018). Protein-protein interaction (PPI) network was obtained from STRING (http://string-db.org/) database (Szklarczyk et al. 2015). PPI network was drawn by Cytoscape (Su et al. 2014). The cutoff values were a combined confident score of >0.7 for the PPI network and a node degree of ≥10 for screening hub genes. We used the Molecular Complex Detection (MCODE) plug-in for Cytoscape to screen hub genes from the PPI network. As a result, there were 2099 differential expressed genes to be identified between NSIP and normal lung tissue samples, and these genes were potential disease-associated genes for NSIP. 450 genes were upregulated from normal to NSIP, and 1649 genes were downregulated. These genes maybe play key roles in disease onset of NSIP. The heat map of expression quantity of DEGs (the top 10 upregulated genes and the top 10 downregulated genes) was shown in Fig. 13.2b. In addition, we used to adjust p-value<0.01 as a threshold and identified 21 DEMs between NSIP and normal lung tissue samples.

The functional analysis was performed on GO and KEGG for the 2099 DEGs by DAVID database. In the result analysis, p < 0.05 was used as the threshold. The GO analysis revealed that the differential expressed genes were significantly enriched in immune response mechanisms, such as “innate immune response,” “adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains,” “adaptive immune response,” “immunoglobulin-mediated immune response,” “activation of plasma proteins involved in acute inflammatory response,” “immunoglobulin production,” etc. (Table 13.2a). Furthermore, by KEGG pathway analysis, the results indicated that DEGs were significantly enriched in tumor and cell cycle-related pathways, such as “Cell cycle,” “p53 signaling pathway,” or “Pathways in cancer” (Table 13.2b).

Table 13.2 Shows the GO and KEGG analysis results of DEGs, in which Table 13.2(a) is the GO analysis results and Table 13.2(b) is the KEGG analysis results with the ten pathways with the lowest p-value

The molecular sub-network was identified by mapping the differential expressed genes into the PPI network, choosing the nodes in which the combined score is greater than 0.7 and the degree value is greater than 10. A sub-network with 131 nodes and 1009 edges was obtained from the network. By MCODE, a significant module containing 23 nodes and 246 edges was identified (Fig. 13.2c). We selected the ten genes with the highest degree (degree-value = 22): PLRG1, SRSF4, SNRPA1, HNRNPR, CDC40, DDX42, CWC22, HNRNPU, CPSF2, and CSTF3. Other genes in the significant module network are DDX46, HNRNP, HNRNPH1, SRSF5, POLR2H, POLR2B, SF3B5, CWC27, SKIV2L2, SYF2, SLU7, PRPF40A, and NAA38. Using adjusted p-value < 0.01 as the threshold for DEMs, 21 microRNAs were identified as differential expressed microRNAs between NSIP and normal tissue samples. With miRwalk3.0, a microRNA target gene prediction tool was obtained, and the score >0.95 serves as the cutoff. We predicted the target genes of 21 microRNAs and screened out the overlap between the target genes and the differentially expressed genes. A total of 3687 DEG-DEM interactions are obtained.

In addition, we drew DEG-DEM interaction networks by Cytoscape, calculated degree values of nodes, and further studied sub-networks with degree values ≥ 9. According to the interaction relationship, we further screened out the pairs of interaction relationship with opposite expression trend, selected a total of 123 interactions between 18 DEMs and 14 DEGs (Fig. 13.2d), and listed them in Table 13.3. In 14 target genes, MDM2, as a target gene of hsa-let-7b-5p, hsa-miR-126-3p, hsa-miR-1268a, hsa-miR-193a-3p, hsa-miR-422a, hsa-miR-423-3p, and hsa-miR-532-5p, has been confirmed to be related to NSIP, but the regulatory effects of these four microRNAs on MDM2 in NSIP have not been reported in the literature. Studies have shown that compared with normal lung parenchyma, MDM2 in the epithelial cells of IPF and NSIP patients is significantly upregulated (Nakashima et al. 2005). In addition, CEP128 as a target gene of hsa-let-7b-5p, hsa-miR-1268a, hsa-miR-193a-3p, hsa-miR-20a-5p, hsa-miR-30d-5p, hsa-miR-345-5p, hsa-miR-422a, and hsa-miR-532-5p is an autoimmune thyroid diseases’ pathogenic factor (Wang et al. 2019).

Table 13.3 The list of microRNAs with differential expression and the predicted target genes with an opposing expression trend

Based on the above research results, we have identified the interaction relationship between 18 DEMs and 14 DEGs associated with NSIP, which has not been reported yet. Of the 14 NSIP-related DEGs, MDM2 has been shown to be related to NSIP in previous studies (Chen et al. 2017; Wurz and Cee 2019). Therefore, the interaction relationship between 18 DEMs and 14 DEGs selected in this study, especially 4 interaction relationships of MDM2, may provide new ideas for the research of NSIP.

13.3.2 Identification of Genes Associated with Open Regions of Chromatin and Super-enhancers in Lung Adenocarcinoma

In addition to the analysis of mRNA-microRNA interaction regulatory network, which can explain the causes of some gene expression changes, the causes of gene expression changes are often explored through the identification of enhancers, super-enhancers, and open regions of chromatin. The presence of super-enhancers and open regions of chromatin generally leads to the upregulation of the corresponding genes (Buenrostro et al. 2015; Peng and Zhang 2018). Super-enhancers are generally identified by analyzing ChIP-seq processed with H3K27ac (Jiang et al. 2017). The general analysis flow of super-enhancer identification is shown in Fig. 13.3a.

Fig. 13.3
figure 3

Identification of genes associated with open regions of chromatin and super-enhancers in lung adenocarcinoma. (a) The process for identifying super-enhancers. (b–d) The identification results of super-enhancers in A549, Calu-3, and IMR-90 cell lines, respectively. (e, f) The distribution of super-enhancers and open regions of chromatin on chromosomes, respectively. (g–k) The H3K27AC signal near the five genes EFNA5, HAVCR1, ATP1B1, DUSP4, and IGF2BP3 in different cell lines. (lp) Results of survival analysis of five genes. The result suggested that DUSP4 and IGF2BP3 were significantly associated with prognosis

The potential regulatory genes of the super-enhancer can be identified by annotating the genes in the upstream and downstream 50kb range of the super-enhancer. The image shows the results of ChIP-seq analysis of the lung adenocarcinoma cell line A549, Calu-3, and lung fibroblast cell line IMR-90 (Fig. 13.3b–d). Among them, the ChIP-seq data of A549 cell line was derived from the Encyclopedia of DNA Elements (ENCODE) Project (Consortium EP 2012); GEO Accession numbers are GSE91337 and GSM2421889. The ChIP-seq data of the Calu-3 cell line came from the GEO database; GEO Accession numbers are GSM1548075 and GSM1548073 (Fossum et al. 2014). ChIP-seq data for IMR-90 cell line was derived from the Encyclopedia of DNA Elements (ENCODE) Project (Consortium EP 2012); GEO Accession number is GSE16256 (Lister et al. 2009; Hawkins et al. 2010; Bernstein et al. 2010; Lister et al. 2011; Schultz et al. 2015; Micheletti et al. 2017; Rajagopal et al. 2013).

The process of ATAC-seq to identify open regions of chromatin is similar to that of ChIP-seq, but data quality control is required. ATAC-seq data for the A549 cell line came from the Encyclopedia of DNA Elements (ENCODE) Project (Consortium EP 2012); GEO Accession number is GSE114202. Differential expression results of lung adenocarcinoma and normal controls based on TCGA database (Tomczak et al. 2015), we finally screened out five genes: EFNA5, HAVCR1, ATP1B1, DUSP4, and IGF2BP3, among which DUSP4 and IGF2BP3 are associated with prognosis. Prognostic analysis results were obtained from UALCAN (http://ualcan.path.uab.edu/) (Chandrashekar et al. 2017) (Fig. 13.3l–p).

13.4 Disease Categories Based on Molecular Networks

In the past, clinical phenotypes have been an important basis for distinguishing disease subtypes. Now, the concept of molecular subtype provides a new research idea for the diagnosis and treatment of diseases. Here, we present a case study of molecular subtypes associated with immune genes in COPD. Chronic obstructive pulmonary disease (COPD) is a form of chronic bronchitis or emphysema characterized by blocked airflow. If not treated, it often develops into pulmonary heart disease or respiratory failure (Blanchette et al. 2014; Kim et al. 2017). With the increase of air pollution, the incidence of COPD is increasing, but its mechanism is still not fully understood. Currently, COPD is still diagnosed and treated based on simple clinical presentation (degree of airflow limitation, symptoms and frequency of exacerbations, etc.). With the popularization of the concept of precision medicine, it has become a general trend to treat patients according to their individual differences (Zhang et al. 2018; Hogg et al. 2004). Reclassification of COPD is essential for developing more effective new treatments or optimizing existing treatments. Therefore, it is necessary for bioinformatics technology and the existing large amount of high-throughput data to redefine and interpret large amounts of multi-level information. Two new research strategies (systems biology and network medicine) have the potential to provide new perspectives on the pathology of COPD. Our research has found that the immune-based COPD classification can be used as an auxiliary reference for clinical treatment, which is helpful to the advancement and development of precision medicine.

Common detection items of COPD patients are FEV1 (forced expiratory volume in 1 second) (Chuang and Lin 2019), FVC (forced vital capacity) (Chuang and Lin 2019), emphysema (F-950), DLCO (Hao et al. 2019), etc. DLCO tests the lung’s ability to diffuse carbon monoxide. FEV1 is the rapid exhalation of air within 1 s after inspiration to total lung volume. FVC is the maximum amount of breath that can be exhaled as soon as possible after inhaling as much as possible. Emphysema (F-950) is the index involved in quantifying emphysema on CT images by using the density mask method to calculate voxel fraction of the lung (Radder et al. 2017). These indicators play an important role in the clinical diagnosis of COPD.

Although many articles have reported the influence of immune genes and pathways on COPD, the study on the classification of COPD according to the immune gene expression mode of patients’ lung tissues has not been reported. COPD is a complex disease driven by a combination of genes; because the gene combinations of different patients are very different, COPD are wildly heterogeneous. Immune-based COPD classification may be used as an auxiliary reference for clinical treatment, which is conducive to the advancement and development of precision medicine.

The expression data of COPD (GSE47460) (Peng et al. 2016; Anathy et al. 2018; Kim et al. 2015; Yu et al. 2018; Tan et al. 2016) were downloaded from Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/gds/). We excluded whole lung homogenate samples with interstitial lung disease and at risk and selected 139 COPD whole lung homogenate samples with different gold stages for analysis. Download the immune gene list from the ImmPort database (https://www.immport.org/: SDY1205, DOI: 10.21430/M37N6PJEQT) for research.

Combined with the list of immune genes, all the immune gene expression data in the expression data were taken for consensus clustering by using R package ConsensusClusterPlus (Wilkerson and Hayes 2010). According to the results of the first consensus clustering, we pre-classified the samples into 2 categories, 68 subtype I and 71 subtype II (Fig. 13.4a). By using R package Limma (Ritchie et al. 2015) for differential gene analysis of the 2 subtypes (Fig. 13.4b), 158 different immune genes were obtained. According to the screened 158 immune genes, consensus clustering was carried out for the second time, and the result of 69 subtype I and 70 subtype II was obtained (Fig. 13.4c). 134 immune genes were differentially expressed between the 2 subtypes. Through the third consensus cluster analysis of the 134 different immune genes, the final subtype grouping was obtained, including 70 subtype I and 69 subtype II. A total of 131 immune-related differentially expressed genes were found between the 2 subtypes.

Fig. 13.4
figure 4

Molecular subtype analysis of chronic obstructive pulmonary disease. (a–c) The result of thrice consensus clustering. (d, e) Principal component analysis (PCA) of immune-related genes. PCA results confirm that the molecular subtypes obtained in this study are not caused by batch effect. (f, g) GO and KEGG enrichment analysis results of differentially expressed genes between subtype I and the control group. (h) GO enrichment analysis result of differentially expressed genes between subtype II and the control group, but there is no significant enrichment KEGG pathway

The R package SigClust (Huang et al. 2012) was used to evaluate the clustering results, and the clustering significance p-values of the two subtypes obtained were shown in Table 13.4. The p-value of the third cluster is the smallest.

Table 13.4 shows the clustering significance between the two subtypes of the third consensus clustering

In general, the series matrix file of GEO database has preprocessed the data. But we are still trying to verify whether there is a batch effect in the data. To ensure that the intergroup differences we analyzed were not due to batch effect, we queried the sample data one by one from the GEO database and obtained the batch information of 139 COPD samples we used. First, we performed principal component analysis (PCA) on the total expression data of 139 COPD samples and on the immune genes differentially expressed between subtypes I and II in the third consensus cluster. Batch or subtype information was labeled in the three-dimensional scatter plot to verify whether the differential gene expression between different subtypes was caused by batch effect. In addition, we did the same for 131 differentially expressed immune genes in the third consensus cluster. Ensure the consistency of clustering results and intergroup differences are independent of the batch effect. As shown in Fig. 13.4d–e, the clustering results of labeled batch information are inconsistent with the clustering results of labeled subtype information, which proves that the subtype we obtained is not caused by batch effect.

To determine the differences between different subtypes and the normal control group, we set p < 0.01 and fold change = 0.5 which were used as thresholds to obtain differentially expressed genes between subtypes, and functional enrichment analysis was performed on the results. Functional enrichment analysis was performed using R package clusterProfiler (Yu et al. 2012). As shown in Fig. 13.4g, subtype I was significantly enriched in immune-related pathways, while subtype II was not. This seems to indicate that the immune subtype I identified is more immune-dependent than the immune subtype II.

In combination with clinical data, we investigated the relationship between two immune subtypes and clinical data. We performed chi-square tests on gold stages in the two subtypes. Results showed that the proportion of gold III and gold IV patients in subtype I patients was significantly higher than that in subtype II patients (Table 13.5).

Table 13.5 Chi-square test results between clinical data and subtypes

We used R package WGCNA (Langfelder and Horvath 2008) to further investigate the genes that play a key role in the division of molecular subtypes. All the genes in the dataset were included in the analysis so as not to miss out on key information. The results showed that the turquoise module was significantly correlated with the molecular subtypes (Fig. 13.5c). According to gene significance and module membership (Fig. 13.5d), we screened out 11 key genes for subtype classification (Fig. 13.5e–g).

Fig. 13.5
figure 5

Weighted gene co-expression network analysis and HUB gene screening. (a) The horizontal axis is soft threshold (power), and the vertical axis is the evaluation parameter of scale-free network. The higher the value is, the more the network conforms to the non-scale feature. (b) Gene clustering results. (c) Results of association between gene co-expression modules and clinical phenotypes. Correlation coefficient of threshold setting is greater than 0.5, and p-value is less than 0.05. (d) Correlation between genes and modules and phenotypes in turquoise module. Genes with GS greater than 0.4 and Mm greater than 0.9 were selected as hub genes. (eg) The difference and significance of hub genes’ expression in the two subtypes and the control group

Through bioinformatics and computational analysis, we have determined the possible set of mutations associated with immunity, as well as genes, cell types, and biological pathways. Our analysis provides further support for the genetic susceptibility and immune heterogeneity of COPD. We identify the characteristics in each subtype of COPD, which may provide new insights into the biological mechanisms to promote the progress. Studying the use of these endotypes and biomarkers may be helpful for the diagnosis and treatment of COPD and the development of precision medicine.

13.5 Conclusion

Gene sequencing technology helps doctors diagnose patients with symptoms that have no clear cause. But the large amount of data generated is often difficult to get answers quickly. The use of molecular bioinformatics solved this problem. Most diseases are not caused by a single genetic defect but are caused by the interaction of a variety of different genes. Gene expression products such as RNA and proteins interact with other proteins and metabolites in the cell to form a signal regulation network of the disease. Gene mutation did not occur at exactly the same place, but some mutations occur in genes on the same signaling pathway. Gene expression can be changed by the environment, and when changed, specific disease subtypes or endotypes can be formed. Many interventions in the experimental model cannot be completely reproduced on the human body, and therefore molecular bioinformatics provides a way to explore the molecular complexity of a particular disease, to identify disease pathways and modules, and to explore the molecular connections between the different phenotypes. Therefore, molecular bioinformatics has the potential to discover new disease genes, reveal the biological importance of disease-associated mutations, and identify complex diseases, drug targets, and biomarkers (Agusti et al. 2017). The rapid development of molecular bioinformatics provides new ideas for the diagnosis and treatment of diseases. Molecular bioinformatics is defined as a treatment tailored to the individual needs of patients, which distinguishes specific patients from other patients with similar clinical manifestations based on genes, biomarkers, phenotypes, or psychosocial characteristics. Bioinformatics can often reduce research costs and be quick and effective, by computing a large number of sample data, summarizing rules, and associating phenotypes. It helps the precision medicine enter the primary medical system.

For primary hospitals, the simplification of methods is more conducive to the promotion of bioinformatics technology. As bioinformatics tools become more and more accessible, information learning loses some of its complexity and is easier to master quickly through short training. The transition from clinical practice to precision medicine is a more effective and safer way to treat patients than existing treatment methods. For the primary medical institutions, it has more development prospects. Most of the training and research related to bioinformatics take place in high-income areas and resource-rich medical institutions, while in primary medical institutions, bioinformatics technology cannot be popularized due to the limited funds and talents. It is becoming more and more urgent to assist primary medical institutions to train professionally talents in the field of bioinformatics. Our article offers an important perspective: molecular bioinformatics can be used in hospitals, and the basic approach we describe is clinically achievable. By learning the methods involved in our research, the personnel of primary medical institutions can use existing resources to re-analyze the published data which helps to re-understand the disease. In addition, the increasing popularity of cloud resources and the availability of online training materials provide excellent opportunities for researchers in primary medical institutions with limited resources. Researchers in primary medical institutions can use cloud resources to analyze large omics datasets, which can reduce the differences caused by equipment shortages to some extent (Mangul et al. 2019). The development of bioinformatics in primary medical institutions is conducive to discovering local related genetic abnormalities.

According to biomedical and life sciences researches, bioinformatics is essential for science to explain treatments and high-throughput omics data meaningful. In the process of disease recognition, diseases are often diagnosed and treated according to phenotypes. Using molecular biological information technology to classify ovarian cancer, it was found that the FGF pathway, a pathway related to tumor proliferation and angiogenesis, plays a significant role in one of the subtypes of ovarian cancer (Hofree et al. 2013). The subtype of liver cancer that overexpress seven hub genes may lead to reduced overall survival in patients (Li et al. 2021). According to the data searched from the public database, bladder cancer is divided into two main molecular subtypes, basic type and differentiated type, and it is found that basic type tumors are associated with a shorter survival period (Volkmer et al. 2012). Cancer involves not only individual mutations but also dysregulation of multiple pathways governing fundamental cell processes such as cell proliferation and apoptosis (Kreeger and Lauffenburger 2010). Increased researches have successfully integrated that database with the molecular to map the signal network of cancer. Through the use of bioinformatics analysis in the molecular signal network, we can subdivide a set of tumor mutation into different subtypes via their biological and clinical information. These subtypes are different from those classified by other clinical markers that are well known to be associated with survival. The subtypes may provide new insight for biological mechanisms driving disease progression.

As molecular bioinformatics become integrated into clinical treatments (Seiler et al. 2017), molecular subtype will become critical for determining the intrinsic feature of many diseases. Heterogeneity is a major challenge to promote precision medicine. If molecular bioinformatics is applied to clinical practice, the treatment and prognosis of diseases will be improved to a new height. We hope that the integration of molecular bioinformatics and multi-omics data will enable patients to receive more accurate, effective, and safe treatments.