Keywords

1 Introduction

Interstitial lung disease (ILD) is an umbrella term that encompasses about 300 parenchymal pulmonary disorders, resulting from varying degrees of inflammation or fibrosis in the lung interstitium, that is, the septum between alveoli and the blood capillaries. A schematic diagram of a healthy vs. ILD lung is shown in Fig. 1.

Fig. 1
An illustration of a healthy and an I L D lung. The healthy lung has normal alveoli, and an I L D lung has fibrosing alveoli with scarred interstitium causing obstruction of oxygen and carbon dioxide gas exchange.

Healthy lung vs. interstitial lung disease (created using BioRender.com)

Numerous studies across the globe have reported the incidence, prevalence, and relative frequency of ILD. The annual incidence of ILD varies between 1 and 31.5 per 100,000 [1]. The incidence and prevalence vary among populations, likely due to differences in study design, data collection, and incorrect recognition of the disease subtypes [2]. ILD is classified based on clinical, radiological, and histopathological features. The latest classification focuses on recognizing the underlying etiology since this often impacts both prognostication and management decisions. ILD mainly consists of disorders of known causes [collagen vascular disease, hypersensitivity pneumonitis (HP)] as well as disorders of unknown/idiopathic causes [idiopathic interstitial pneumonia (IIP), sarcoidosis] [3]. ILD registries comprising patients from Western countries suggest that idiopathic pulmonary fibrosis (IPF) and sarcoidosis are the most common phenotypes. However, the ILD registry of India indicates HP to be the most common, which accounts for nearly 50% of all ILD cases [4].

The emerging field of metabolomics, in which many small molecules from body fluids or tissues can be identified, holds immense potential for early diagnosis, therapeutic monitoring, and understanding of disease pathophysiology. Over the past two decades, nuclear magnetic resonance (NMR) spectroscopy and gas chromatography (GC)/liquid chromatography (LC) coupled with mass spectrometry (MS) combined with chemometric analysis have emerged as principal analytical techniques for use in metabolomics. Several biofluids including cerebrospinal fluid (CSF), bronchoalveolar lavage fluid (BALF), bile, seminal fluid, amniotic fluid, synovial fluid, gut aspirate, serum/plasma, saliva, exhaled breath condensate (EBC), and urine contain hundreds to thousands of detectable metabolites which have been extensively studied so far [5]. More recently, metabolic profiling of intact tissue and extracts of lipid and aqueous metabolites are gaining increasing importance for detection of biomarkers.

Another branch of popular omic science is transcriptomics, which provides detailed information about gene regulation in normal and diseased conditions. Two key contemporary techniques commonly used for transcriptomic analysis are hybridization-based microarray techniques, which quantify a set of predetermined sequences, and next-generation sequencing (NGS), which uses high-throughput sequencing to capture all sequences [6]. In the last decade, these two transcriptomic approaches have been utilized most widely to understand the underlying disease pathogenesis at both molecular and genetic levels and also for molecular diagnosis and clinical therapy. Human biofluids including amniotic fluid, aqueous humor, ascites, bile, BALF, breast milk, CSF, colostrum, gastric fluid, pancreatic cyst fluid, plasma, saliva, seminal fluid, serum, sputum, stool, synovial fluid, sweat, tears, urine, and tissues are widely used for transcriptomic studies to identify biomarkers of several diseases [7,8,9].

2 Types of ILD

ILD, as mentioned earlier, refers to a group of lung diseases ranging from occasional self-limited inflammatory processes to severe debilitating fibrosis of the lung parenchyma. There are varied causes of ILD, which generally result from a range of environmental, occupational, recreational, or drug-related exposures or could arise from the various systemic autoimmune or connective tissue diseases (CTD) [10]. Classification of different types of ILD is shown in Fig. 2. A few of the common ILD subtypes are described in the present section.

Fig. 2
A flowchart on I L D includes I I Ps, autoimmune I L D, H P, sarcoidosis, and other I L D. The I I Ps include I P F, i N S I P, desquamative I P, R B I L D, C O P, acute interstitial pneumonia, and idiopathic pleuroparenchymal fibroelastosis. Other I L D include drug-induced I L D.

Classification of different types of interstitial lung disease (Cottin et al. 2018) [3]. ILD interstitial lung disease, IIP idiopathic interstitial pneumonia, IPF idiopathic pulmonary fibrosis, iNSIP idiopathic nonspecific interstitial pneumonia, RB-ILD respiratory bronchiolitis-associated ILD, COP cryptogenic organizing pneumonia, RA-ILD rheumatoid arthritis-associated ILD, SSC-ILD systemic sclerosis-associated ILD, HP hypersensitivity pneumonitis

2.1 Idiopathic Interstitial Pneumonia (IIP)

The cause of IIP, comprising of diffuse parenchymal lung diseases, remains unknown. IIP is characterized by varying degrees of inflammation and fibrosis in the lung interstitium. These characteristics split IIP into eight clinicopathologic entities, that is, IPF, nonspecific interstitial pneumonia (NSIP), cryptogenic organizing pneumonia (COP), acute interstitial pneumonia, respiratory bronchiolitis-associated interstitial lung disease, desquamative interstitial pneumonia, lymphoid interstitial pneumonia, and idiopathic pleuroparenchymal fibroelastosis [11]. Among all IIPs, IPF is the most common phenotype characterized by fibroblastic foci and the presence of inflammation and honeycombing in the lung parenchyma.

2.2 Autoimmune ILD

Autoimmune ILD is caused specifically by autoimmune disorders, which involve the body’s immune system attacking the lungs. This ILD group gradually develops and emerges over a long period of time. The symptoms of this ILD include difficulty in breathing, dry cough, and shortness of breath. Connective tissue disease-related ILD (CTD-ILD), rheumatoid arthritis-associated ILD (RA-ILD), and systemic sclerosis-associated ILD (SSC-ILD) are the common types of autoimmune ILD [12, 13].

2.3 Hypersensitivity Pneumonitis (HP)

HP, also referred to as extrinsic alveolar alveolitis, is a complex subtype of ILD arising from repeated exposure to certain antigens, most commonly avian, microbial (especially molds), or chemical. HP is the third most prevalent ILD after IPF and CTD-ILD. The inhaled antigen triggers type III and type IV hypersensitivity reactions, which causes the damage of alveolar epithelial cells. An impaired repair mechanism may result in fibroblast activation, deposition of collagen by the destruction of extracellular matrix, and parenchymal architecture [14]. The major forms of HP are acute, subacute, and chronic. Acute and subacute HP is mainly characterized by influenza-like symptoms, such as cough, dyspnea, and fever, developing after 2–9 h of antigen exposure. The chronic form of HP arises from repetitive, low-level exposure to the causative agent. Still, the identity of the causative antigen may remain unknown in more than half the cases. Chronic HP patients slowly develop fibrosis in the lung interstitium and are associated with a significantly high mortality rate [15].

2.4 Sarcoidosis

Sarcoidosis is a systemic, inflammatory disease resulting from an unknown origin. Chronic immune response to an idiopathic antigen may lead to sarcoidosis in genetically susceptible subjects. Almost 90% of sarcoidosis patients have pulmonary involvement. Dry cough, chest tightness, chronic dyspnea on exertion, shortness of breath, wheezing, hypoxemia, and decline in pulmonary function are the common signs and symptoms of sarcoidosis. Near about 20% of sarcoidosis patients develop pulmonary fibrosis, that is, stage IV sarcoidosis which is associated with high mortality [16].

2.5 Occupational and Environmental Exposure-Related Other ILDs

Long-term exposure to occupational or environmental antigens could cause certain types of ILD via pulmonary and systemic inflammation and oxidative stress. Many different types of mineral dust, such as silica, asbestos, beryllium, coal mine dust, metal, and organic dust, including mold spores, can also affect the lung airways, either by a direct wound or through reactive oxygen molecules. Common conditions include asbestosis, which is associated with asbestos fibers, and silicosis, which is caused by free crystalline silicon dioxide or silica particles [17, 18].

3 Metabolomics: An Emerging Tool in Clinical Research

Metabolomics, one of the newest omics science, is an evolving field in clinical research. Metabolomics is the scientific study of metabolic fingerprints that all cellular processes leave behind in a biological sample [19]. It provides a snapshot of the metabolic state of an individual at a given point in time. On the other hand, “metabonomics,” a term first coined by Jeremy Nicholson, refers to “the quantitative measurement of the dynamic multiparametric metabolic response of living systems to pathophysiological stimuli or genetic modification” [20, 21]. The terms “metabolomics” and “metabonomics” are often used interchangeably. Among the different omic approaches, metabolomics is considered to modulate best and depict the molecular phenotype of health and disease [22]. Thus, it is increasingly becoming a useful and powerful tool for the investigation of complex diseases with unclear etiology, enabling the discovery of novel biomarkers, which, in turn, aid in the prevention and early diagnosis of diseases. Metabolomics can also monitor the effect of pharmacotherapy, allowing clinicians to choose the best treatment option for patients suffering from potentially devastating disorders. The two analytical techniques popularly used have their own advantages and disadvantages. While mass spectrometry can analyze a wider range of metabolites and is more sensitive, it results in the destruction of the analyzed sample. NMR spectroscopy, on the other hand, is highly reproducible and does not destroy the sample; however, sensitivity is limited [23, 24]. Over the years, application of metabolomics in diseases is rapidly growing, and recent studies exploring the metabolic profiles of various human samples, including but not limited to plasma, serum, urine, BALF, exhaled breath, saliva, and tissues, bring this technology closer to the patients’ bedside, thereby enhancing its clinical utility. A schematic representation of the metabolomic workflow is shown in Fig. 3.

Fig. 3
A workflow displays the following steps. biological samples, data acquisition, data analysis and metabolite identification, and biological interpretation pathway analysis to construct biological models.

Schematic representation of the metabolomic workflow (created using BioRender.com)

3.1 Metabolomics in ILD

Several attempts have been made to understand the metabolic status of ILD patients and identify prospective biomarkers in lung tissues and various body fluids using a nontargeted and targeted metabolomic approach. Studies utilizing the metabolomic approaches to investigate ILD are summarized in Table 1.

Table 1 A summary of studies exploring different types of ILD in humans using metabolomic approach

4 Transcriptomics: A Promising Omic Approach

Transcriptome analysis utilizes high-throughput methods to study the complete set of RNA transcripts produced by the genome under specific circumstances. It covers all types of transcripts, including mRNAs, miRNAs, and different types of long noncoding RNAs (lncRNAs). Transcriptome analysis gives us an overview of all genes’ expression levels and enables us to understand the physiology of the cell. More precisely, it also discloses key regulations of biological processes triggering diseases. While microarrays are generally less complex and easier to use than NGS, the latter is associated with greater flexibility, high throughput, and high discovery potential. A schematic representation of the transcriptomic workflow is shown in Fig. 4.

Fig. 4
A flow chart illustrates the following steps. biological samples, data acquisition, data analysis, and differentially expressed gene identification, and biological interpretation analysis to construct the Cytoscape.

Schematic representation of the transcriptomic workflow (created using BioRender.com)

4.1 Transcriptomics in ILD

Various studies have been performed to understand the transcriptomic signatures of ILD patients and identify prospective biomarkers in lung tissues and various biofluids using NGS and microarray techniques. Despite increasing interest and effort invested by clinicians and scientists during the last decade, the etiology of ILD remains elusive and controversial.

As mentioned earlier, IPF is characterized by remodeling or scarring of the airway epithelium. The activated extracellular matrix (ECM)-produced myofibroblasts play a key role in the process of fibrotic tissue remodeling. Advances in transcriptomic techniques have allowed high-throughput analysis and discovery of gene deregulation in IPF. Several studies using lung tissues have reported that IPF is associated with variances in the expression levels of genes such as CCL8 [42], CXCL14 [43], CXCL4 and CXCL12 [44], NOTCH2 [45], TGF-β1 and RhoA kinase [46], REVERBα [47], IL-1β [48], FLIL33 and POU2AF1 [49], FOXL1 [50], COL6A3, and POSTN [51]. Microarray analysis of peripheral blood by Abe et al. (2020) has shown dysregulated PDGF B, VEGF B, and FGF 2. The authors confirmed their findings using ELISA, western blot, immunofluorescence, and 3H thymine uptake assays. Xia and co-workers (2021) recently utilized weighted gene co-expression network analysis (WGCNA) of BALF samples and could associate four genes, TLR2, CCR2, HTRA1, and SFN, with disease prognosis.

Pathway enrichment analysis based on dysregulated genes highlights the associated biological pathways, molecular functions, and cellular components. This method identifies all biological pathways enriched in a gene list more than would be expected by chance. The KEGG pathway tool maps the pathways associated with dysregulated genes in a specific disease. Pathway enrichment analysis of IPF patients revealed that the differentially expressed genes were majorly associated with myofibroblast differentiation and massive ECM deposition. The transcriptomic signatures of fibroblasts suggest that characterization of lung proteins, specifically lung fibrotic ECM, helps determine its composition and define targetable molecules for advanced stages of fibrosis. Boesch and his team (2020) isolated fibrosis-specific mesenchymal stem cell-like cells from lung tissue of IPF subjects and observed that the differentially expressed genes were enriched with hypoxia, fibrosis, and bacterial colonization factors which are the typical hallmarks of pulmonary fibrosis. They found that the cells isolated from IPF patients express genes associated with activating canonical TGF-β, HIPPO/YAP, PI3K/AKT, p53, and WNT signaling cascades, which are activated in an integrated network. Another interesting study by Hsu and co-authors (2011) suggested that IPF lungs enriched in fibrosis-related genes, insulin-like growth factor signaling, and caveolin-mediated endocytosis. This microarray analysis also highlighted the common molecular signatures between lung tissue and fibroblasts of these patients.

Like IPF, HP is associated with matrix remodeling and formation of fibrosis. There exist only two studies where transcriptomics has been used to explore genetic alterations in HP. Sarcoidosis, as mentioned earlier, is an immune-mediated multisystem disease characterized by the formation of non-caseating granuloma. Multiple pro-inflammatory signaling pathways, including IFN-γ/STAT-1, IL-6/STAT-3, and NF-κB, have been implicated in mediating macrophage activation and granuloma formation in sarcoidosis. Utilizing RT-PCR, Christophi et al. (2014) have demonstrated that IL-6, COX-2, MCP-1, IFN-γ, T-bet, IRF-1, Nox2, IL-33, and eotaxin-1 hold potential for differential diagnosis between sarcoidosis, suture, and fungal granulomas. In another recent study, Lepzien and co-workers (2021) have shown that allogeneic T cell proliferation increased after coculture with monocytes and dendritic cells of sarcoidosis patients. The authors also found that mainly T-bet and RORγt-expressing T cells produce IFN-γ. Monocytes from sarcoidosis patients can activate and polarize T cells towards Th1 and Th17.1 cells. In a comparative study between sarcoidosis and IPF, cluster analysis of BALF cells showed elevated mRNA expression of genes associated with ribosome biogenesis in sarcoidosis patients. Clusters formed by genes with altered mRNA expression in patients with IPF could be implicated in cell migration and adhesion processes, metalloproteinase expression, and negative regulation of cell proliferation. Various studies highlighting the transcriptome fingerprints and associated pathways in different ILD subtypes are summarized in Table 2.

Table 2 A summary of studies exploring different types of ILD in humans using transcriptomic approach

5 Integration of Metabolomic and Transcriptomic Fingerprints

As mentioned earlier, clinical metabolomics is primarily used to identify low molecular weight compounds differentially expressed in a particular disease. In contrast, transcriptomics identifies the complete set of dysregulated RNAs associated with a disease. Integration of metabolomic and transcriptomic signatures has emerged as a popular application-driven method for investigating underlying disease mechanisms, monitoring disease progression, and identifying potential biomarkers [101,102,102]. The omic tools highlight alterations in genotype and phenotype and provide complementary information about genetic alterations, protein synthesis, metabolism, and cellular function. Pathways and network connections further reflect the association between key metabolites and candidate transcripts.

Biological pathway networks reveal hidden patterns in unstructured data by converting them into logically structured and visually evident representations, with nodes representing genes and metabolites and edges suggesting relationships between nodes and clusters with similar chemical activities. VANTED [103], VisAnt [104], Impala [105], and Metscape2 [106] are some of the network-based visualization tools that interface with public databases. In addition, Arena3D allows users to envision three-dimensional biological networks [107]. Interactive editing is frequently performed for small biological networks. However, for major networks, automated layout web tools, that is, Cytoscape [108], NAViGaTOR [109], and Cerebral [110], are more convenient. Alternatively, pathway visualization tools highlight the biochemical activities and different interactive pathways in experimental datasets. Pathguide offers an overview of nearly 190 web-usable network databases and biological pathways [110]. Arakawa and his team have developed a pathway visualization tool for KEGG-based pathways. Users can capture systematic features of biological activity by visualizing pathways at the level of different omic data representations [111]. Paintomics, another software program, analyzes the expression of genes and concentration of metabolite data and displays it on KEGG pathway maps [112]. ProMeTra can display dynamic data and accept annotated images in SVG format [113]. In plants, KaPPa-View and MapMan show the number of metabolites and transcripts for preset route blocks [114, 115]. Other tools like MAYDAY enable viewing expression data in a genomic context with any metadata [116], and PaVESy creates personalized pathways using proteins and metabolites provided by the user [117]. A schematic representation of integrated metabolomic and transcriptomic workflow is shown in Fig. 5.

Fig. 5
3 diagrams of metabolic profile, transcriptomic profile, and data integration. The metabolites are identified in the metabolic profile. The genes are identified in the transcriptomic profile.

Schematic representation of integrated metabolomic and transcriptomi data (created using BioRender.com, STITCH database, and Graph pad prism version 7)

In a recent study, our group used NMR coupled with chemometric analysis to identify the unique metabolic fingerprints in BALF of HP subjects. A total of six metabolites were found to be significantly altered in HP compared to non-HP controls [35]. Next, we considered NGS data of lung tissues from HP patients and controls, reported in the NCBI-GEO public database by Furusawa et al., and performed bioinformatic analysis. A total of 555 genes were dysregulated (373 upregulated and 182 downregulated) in HP cases. An interaction network between the six candidate metabolites and most significantly altered genes (five upregulated and five downregulated) was established utilizing the Search Tool for Interactions of Chemicals (STITCH) database. The metabolite-gene interaction by STITCH demonstrated 19 nodes connected via 16 edges. The clustering coefficient of the network was found to be 0.768 (protein-protein interaction enrichment p-value: 0.0838). Overall pathway overrepresentation analysis was performed by integrating the candidate metabolites and transcripts utilizing IMPaLA version 12. Glycolysis and phosphatidylinositol 3-kinase-protein kinase B (PI3K-AKT) signaling pathways emerged to be most significantly associated with the pathogenesis of HP. These findings are encouraging, since association of these pathways in chronic HP is well established. Since glycolysis is the key energy driving force for myofibroblast differentiation and formation of fibrosis, perturbation of glycolysis seems likely [118]. The involvement of PI3K-AKT pathway is also evidenced in bleomycin-induced pulmonary fibrosis. It is hypothesized that PI3K-AKT plays a central role in fibrosis development [119, 120]. A novel insight into the pathogenesis of HP is envisioned by integrating the findings of the two omic platforms.

6 Challenges and Future Scope

Most of the omic-driven studies conducted on ILD so far have included a small number of patients, which is quite understandable considering that ILD is a severe condition with a short average life expectancy. Power and sample size estimation, however difficult, would be useful because the low sample size is connected with statistical errors and risks of overfitting and misleading calculations. Since omic output is highly dynamic, clinical variables such as physiological status, age, gender, and treatment may influence the findings. Hence, baseline characteristics of recruited ILD subjects need to be closely matched. Lack of a rigorous subject selection approach could also result in discovering markers that are not exclusive to ILD subtypes. It is observed that only a few groups have included healthy controls in their omic-driven research on ILD. Also, nonuniformity in including smokers and nonsmokers is frequently observed while comparing disease populations with healthy controls. This makes unbiased comparisons and conclusions impossible. A few groups were also unable to validate ILD candidate markers, which is crucial for biomarker identification. In fact, one of the main reasons why most of the omic-based disease markers identified so far have not made it to clinical practice is due to a lack of adequate validation trials. Another observation that warrants attention while using omics is that different research groups identify different biomarkers in the same biofluid for a particular disease. This is not surprising given the fact that factors such as sampling methods, sample collection, handling and preparation, instrumentation, and data mining protocols tend to vary from one setup to another. To generate robust and reproducible data, the practices and procedures should be standardized and rigorously followed across all clinics and research laboratories. Metabolic flux analysis is crucial to obtain insight into dysregulated cellular metabolism caused by disease perturbations. It is expected that stratifying ILD patients based on disease severity and subtypes will significantly improve metabolome and transcriptome coverage. Assessment of sensitivity, specificity, and clinical relevance of the differentially expressed molecules is also recommended. For a reliable and unbiased diagnosis of this severe pulmonary disease, large-scale, well-designed, multicentric clinical studies and recruitment of suitable controls are recommended.

The ultimate focus of metabolomic and transcriptomic data integration is identifying key metabolic and genetic factors that contribute significantly to disease etiology. Integrated omics is more than a collection of tools; it is a comprehensive paradigm for interpreting multi-omic datasets in a way that can provide new insights into basic biology, as well as health and disease. Machine learning approaches for multi-omic data analyses is an emerging trend for exploring molecular pathways in detail and drawing a holistic representation of a given phenotype using all biological and clinical information of an individual. One of the major advantages is incorporating biological domain knowledge into the machine learning models as inductive biases to reduce data overfitting. Additionally, as omic tools evolve, they need to be user-friendly, interoperable, and effective for computationally intensive analyses. Machine learning methods offer novel techniques to integrate such omic datasets. With the emerging precision medicine initiative, where disease prevention and management take into account the variability in genes, environment, and lifestyle of each individual in contrast to the conventional one-size-fits-all approach, integration of clinical data with the patients’ metabolome and genetic makeup will provide an in-depth understanding of disease pathophysiology and facilitate designing of targeted therapies for individuals, thereby revolutionizing precision medicine-based decision-making in the clinic.