Introduction

Idiopathic pulmonary fibrosis (IPF) is a chronic and progressive interstitial lung disease. IPF has an indistinct etiology, a poor prognosis, and an unpredictable course [1,2,3]. The incidence and mortality rates of IPF are consistently increasing [4,5,6,7]. Existing antifibrotic treatments are not curative. Early and timely intervention is essential to preserve pulmonary function furthest [8, 9]. The heterogenous disease course and the importance of pharmacological interventions in the early stages of IPF make it difficult to assess IPF disease progression and to predict the prognosis. Therefore, it is necessary to explore accurate and sensitive medical imaging indicators to assess condition stage and guide physicians in early pharmacological intervention.

Early and effective antifibrotic treatment reduces the risk of respiratory-related hospitalization and death in IPF patients. Some studies found that patients with faster progression appear to have greater benefit than those with slower progression within 6–12 months of drug initiation [10]. And studies have shown that in patients treated with pirfenidone, transplant-free survival is significantly lower in patients with advanced IPF than in those with mild to moderate IPF [11, 12]. Therefore, the ability to accurately monitor disease progression is of considerable clinical importance in the development of treatment plans for patients with IPF. Pulmonary function tests (PFTs), such as forced vital capacity (FVC) and diffusion lung capacity for carbon monoxide (DLCO), are considered the most standardized methods for estimating deterioration conditions in IPF [13]. A greater than 10% reduction in FVC has been shown to predict a higher mortality over the next 6 or 12 months [14, 15]. Compared to the baseline, a 15% decline in DLCO was a sign of disease progression [13, 16] Nevertheless, a 10–15% measurement deviation for each PFT test inevitably decreased efficacy. The presence of emphysema weakens the impact of fibrosis on pulmonary function indices and conceals real disease progression [17,18,19,20,21]. Therefore, PFT indicators may not be useful or timely indicators of disease progression for clinicians [22, 23]. Recently, capturing the change in characteristic HRCT signs (e.g., ground glass opacity, reticulation, honeycombing) by radiology has been the preferred method of estimating disease progression [20]. However, there is subjective variability when these sings are assessed by the radiologist alone. Given the lack of discrepancy in professional practice, radiologists show moderate agreement regarding honeycombing when it is combined with emphysema, traction bronchiectasis, and thin-walled large cysts [24,25,26]. Such a level of agreement could lead to errors in assessing disease progression based on HRCT signs.

Quantitative CT analysis is one area of the rapidly developing field of radiomics. Radiomics can extract, analyze, and interpret data from medical images to aid disease diagnosis and prognosis [27]. With the extensive application of machine learning in the medical field, quantitative analysis techniques have been used in quantitative CT analysis (QCT) [27,28,29]. Machine learning is a branch of artificial intelligence. When the researcher provided a set of data, the machine learning algorithm system could be used to extract image features that were considered important for prediction or diagnosis. The features found in the training dataset could then be used to make predictions and classify future inputs [30, 31]. The results of the entire study were summarized after classification and could then be statistically analyzed in terms of absolute volumes per region or percentages per feature. The Computer-Aided Lung Informatics for Pathology Evaluation and Rating (CALIPER) software, which can be used as a QCT method, was trained by an expert radiologist. Currently, CALIPER is always used in diagnostic or prognostic studies of COPD and IPF. CALIPER can be used to segment and extract the anatomical region in HRCT. Then, through the machine learning algorithms, CALIPER could label the interstitial lung abnormalities of ILD subjects automatically in 3D CT and quantify the characteristic imaging signs of IPF [32, 33]. Then, it can be used to calculate the percentage of characteristic imaging signs on total lung volume, such as ground glass opacity (GGO), reticulation (R), honeycombing (H), low attenuation areas, normal tissue, and vessel-related structure (VRS). The extent of these signs and longitudinal change reflected disease condition [34].

Currently, several studies have examined the performance of CALIPER in disease surveillance. Different studies have had diverse research directions when examining the application of CALIPER for IPF. Some studies also had smaller sample sizes. In this systematic review, we reviewed literature reporting the use of CALIPER for analyzing IPF with respect to its disease progression, prognosis, and mortality. The benefits, limitations, and several future directions are also discussed in this review.

Methods

Search strategy

We searched the PubMed, Web of Science, Cochrane, Embase, Scopus, and CNKI databases up to February 28, 2022, to identify relevant studies. We applied no language restrictions. We used the following search terms: “CALIPER,” “Pulmonary Fibrosis,” “Idiopathic Pulmonary Fibrosis,” and “IPF” in combination with Boolean operators “OR” “AND.” The complete search strategy used in PubMed was as follows: (CALIPER) AND (((IPF) OR (“Pulmonary Fibrosis”[Mesh])) OR (“Idiopathic Pulmonary Fibrosis”[Mesh])). In addition, reference lists of all eligible studies about the topic were manually searched for relevant studies.

Study selection

The included studies met the following eligibility criteria: (1) the target disease was IPF; (2) the detection method was computed tomography (CT); (3) the CT patterns were evaluated by CALIPER; and (4) the study assessed the association of CALIPER with PFT and mortality.

The exclusion criteria were as follows: (1) duplicate publications; (2) REVIEWS and conference abstracts; (3) FULL text could not be obtained; and (4) studies in which valid data were not available.

Data extraction

The articles and relevant information were independently inspected and extracted by 2 investigators. The extracted information was as follows: first author’s name, publication year, study location, study design, sample capacity, patient characteristics (mean age), diagnosis criteria, endpoint definition, years of follow-up, number of events, statistical model used, and multivariate adjustments. The mean and standard deviation values of baseline PFT and CALIPER indices, and survival analysis of HR and 95% CI were extracted. To minimize confounding, we used adjusted HRs from the multivariate regression analysis as possible. When disagreements arose, they were resolved by seeking a senior investigator to discuss and re-examine the appropriate research together.

Quality assessment

Two reviewers used the Methodological Index for Non-Randomized Studies (MINORS) tool to independently evaluate the non-RCT studies’ quality. The MINORS tool comprised 12 items, the first eight of which were designed for noncomparative investigations, and the maximum score was 16 or 24. Each item was scored from 0 to 2; scores of 0 indicated that the item was not reported in the article, 1 indicated that it was reported but insufficiently, and 2 indicated that it was reported sufficiently.

Statistical analysis

Owing to the enormous heterogeneity between studies, we intentionally aimed not to perform a formal quantitative synthesis.

Results

Search results

The electronic literature search yielded 244 potential studies. After duplicates were removed by Endnote 20, 178 studies remained. Eight studies were excluded because the full text could not be obtained. After reviewing 170 full tests, 111 studies were excluded because the target disease was not IPF, 10 studies were excluded because the detection method was not computed tomography (CT), 25 studies were excluded because CT patterns were not evaluated by CALIPER, 5 studies were excluded because they did not detect the relationship with PFT, mortality, or outcome in clinical trials, and 8 studies were excluded because they were reviews or conference abstracts. One Chinese study was not included because there was inadequate interpretation of the data. Ultimately, 10 studies met the inclusion criteria. Figure 1 presents a flow diagram of the search.

Fig. 1
figure 1

Flow diagram of selected papers including identified number of records (included and excluded) and the reasons for exclusions

Study characteristics

Ten studies were included and evaluated. Most of the included studies used a single center design (7 out of 10 studies). They were conducted in Europe or the USA (3 and 7 respectively). Of the ten studies, there were one prospective study, two case–control studies, and seven retrospective studies. The included studies were published from 2014 to 2021. Our analysis contained 1312 participants. In particular, 2 studies used the same data. The number of subjects taking part in these studies ranged from 38 to 531 (Table 1). Supplement Table 1 is a summary of the inclusion and exclusion criteria for all studies. We summarize the duration of follow-up and number of events for each study in Table 2.

Table 1 Basic characteristics of patient of included studies
Table 2 Adverse events by study

Five of the ten studies measured CALIPER-derived ILD, R, H, GGO, and VRS; three studies measured CALIPER-derived ILD and VRS; one measured CALIPER-derived R, H, GGO, and VRS; and one study measured CALIPER-derived VRS only. Four studies had all-cause mortality as an endpoint, and three studies used both endpoints (all-cause death and lung transplantation). One study reported mortality but was not applied for prognostic analysis.

The statistical methods used for assessment vary considerably due to differences in study design. Among the included studies, 7 studies evaluated the correlation between the CALIPER-derived parameters and PFT and four cohort studies conducted Pearson’s correlation analysis. Three cohort studies conducted linear correlation analyses. Seven studies assessed the association between parameters and mortality, 3 cohort studies performed Cox regression analyses and calculated HRs, and 2 cohorts calculated − log10 p values for covariates in multivariate Cox mortality models. One cohort study used Kaplan‒Meier analysis to assess survival, and one cohort study was conducted by a chi-square test in relation to mortality. Four studies conducted segmental analysis for CALIPER-VRS, and 2 studies evaluated the performance of CALIPER with respect to stratification. One study used receiver operating characteristic (ROC) curve analysis to confirm the best threshold. One cohort study identified independent predictors of mortality by Cox survival analysis and then combined it with CPI to obtain a new formula that represents a per-patient mortality estimate. The characteristics of the articles are summarized in Table 1, Table 3, and Table 4.

Table 3 Main characteristics of selected studies evaluating CALIPER-derived parameters as the biomarkers for predicting prognosis in patients with IPF
Table 4 Prognostic value of CALIPER-derived parameters

Descriptive summary of results

CALIPER-derived parameters and PFT at the baseline level

Five studies involving 872 subjects reported the association of CALIPER-derived parameters with PFT at baseline (the time point of diagnosis or pre-treatment) [29, 35,36,37,38]. In particular, 2 studies used the same data, so in this part they were deemed to be one study. The total of GGO, R, and H features constituted the extent of CALIPER-derived interstitial lung disease (ILD). A majority of results showed that CALIPER-ILD and CALIPER-VRS appeared to be strongly related to FVC [29, 35,36,37,38]. Only one study analyzing 38 patients identified that there was inadequate evidence to prove that CALIPER-derived GGO, R, ILD, and VRS were correlated with DLCO [35], and CALIPER-H was not significantly related to FVC.

CALIPER-derived parameters and PFT in longitudinal follow-up

Five studies involving 324 subjects reported the association of CALIPER-derived parameters with PFT in longitudinal follow-up. A majority of results showed a statistically significant association of FVC with CALIPER-VRS and CALIPER- ILD at the 1-year follow-up time point [35,36,37, 39, 40]. Both the studies of Sverzellati, N. et al and Jacob, J. et al found that this association of FVC with CALIPER-derived GGO, R, H, ILD, and VRS also existed at the 2-year follow-up time point [36, 39]. Only one study reported that CALIPER-GGO and CALIPER-R were significantly related to FVC at the 3-month and 6-month follow-up time point [35]. Both the studies of Jacob, J. et al and Maldonado, F. et al provided the annual variation of each CALIPER-derived parameter. Among the individual CALIPER-derived parameter, CALIPER-GGO and R were the patterns that increased the most per year [39, 41].

CALIPER-derived parameters in predicting mortality at the baseline level

Five studies involving 991 subjects evaluated the prognostic value of CALIPER-derived parameters at baseline [29, 36, 37, 42, 43]. All five studies indicated that CALIPER-VRS was the strongest independent predictor of survival, and CALIPER-ILD also appeared to have great performance in predicting mortality. Of these studies, one was adjusted for GAP score and pack-years of smoking, one for age and sex, and one for sex, age, and smoking status. The third study analyzed 105 patients and found that 3-year survival was poor in patients with CALIPER-VRS% ≥ 5% and CALIPER-ILD% ≥ 20%. However, in the eighth study analyzing 283 patients when the CALIPER-ILD range was combined with the CALIPER-VRS, only the CALIPER-VRS remained an independent predictor of mortality. Two studies analyzing 341 patients showed that CALIPER-R and CALIPER-H could also predict morality [36, 43]. One study was adjusted for GAP score and pack-years of smoking.

CALIPER-derived parameters in predicting mortality in longitudinal follow-up

Three studies involving 231 subjects reported the correlation between CALIPER-derived parameters and mortality at an average follow-up time point of 1 or 2 years. The study of Sverzellati, N. et al and Jacob, J. et al indicated that the CALIPER-VRS had the best correlation with the long-term outcome [36, 40]. One study was adjusted for GAP score and pack-years of smoking. The long-term outcome was any all-cause death or lung transplantation more than 2 years from the baseline level. In addition, Maldonado, F. et al analyzed 55 patients and showed that CALIPER-R and CALIPER-ILD had a statistically significant association with the 1- or 2-year survival rate [41]. This analysis was adjusted for sex, pack-years, baseline FVC % pred, and baseline DLCO % pred. The study of Sverzellati, N. et al found that the amount of change in CALIPER-H,VRS showed more strong correlation with prognosis during the 1-year follow-up period than the amount of change during the 2-year follow-up period [36].

Subgroup analysis of the CALIPER-VRS

Compared with PFT which can only evaluate the global lung, CALIPER can assess regional features. Four studies involving 998 subjects performed the segmental analysis [29, 39, 40, 43]. CALIPER-VRS could be divided into upper zone (UZ) VRS, middle zone (MZ) VRS, and lower zone (LZ) VRS according to the vessel location, and it could be divided into < 5 mm2, 5–10 mm2, 10–15 mm2, 15–20 mm2, and > 20 mm2 according to the cross-sectional area. Jacob, J. analyzed 649 patients and found that the UZ VRS had a superior effect to others in predicting FVC reduction, or prognostic value [29, 40]. One study was adjusted for age and sex. Another study was adjusted for sex, age, and smoking status. The eighth study containing 283 patients analyzed that when the CALIPER-VRS > 5 mm2 range was combined with CALIPER-H, CPI, and GAP, only CALIPER-VRS > 5 mm2 was still an independent predictor of mortality [43]. In another study by Jacob, J. et al, only 10–15 mm2 appeared independently correlated with the change in FVC [39].

Multivariate model in stratification outcome

Two studies including 341 patients reported the utility of CALIPER in refining the evaluation of outcome. In the study of Sverzellati, N. et al according to the ROC analysis, CALIPER-ILD ≥ 20% and CALIPER-VRS ≥ 20% were the optimal thresholds for stratification of results. Three risk categories were defined by combining FVC% reduction ≥ 10% and relative rise of CALIPER-ILD ≥ 20%, at the 12–14-month follow-up. The risk categories were as follows: low risk meant that neither condition was included, intermediate risk meant that any one of these conditions was included, and high risk meant that both conditions were included. At 23–26 months of follow-up, the model integrating FVC% decline trend and relative rise of CALIPER-VRS ≥ 20% generated similar effect [36].

Another study by Jacob, J. used CALIPER-VRS and CALIPER-H, which could independently predict mortality, to modify the formula of the composite physiologic index (CPI). The formula was presented as follows: CALIPER–CPI score = (CALIPER PVV × 23.0904) + (CALIPER honeycombing × 18.3795) + (CPI × 4.5065) [43]. Cox regression analysis showed that the CALIPER–CPI not only had similar prognostic strength to the GAP Index staging system, but also showed better goodness of fit. Additionally, Cox regression analysis showed that the multivariate model integrating CALIPER-H, CALIPER-VRS, and Gender-Age-Physiology (GAP) demonstrated improved performance compared to using GAP only [43].

Risk of bias in trials

The quality assessment included 10 studies and is shown in Supplement Table 2. Two articles did not mention the follow-up period, and 1 article did not give adequate interpretation for the follow-up period. All articles did not perform the blinding of outcome assessments or carry out the prospective calculation of the study size. The included studies had immense lost proportion. One study included a validation cohort. All studies collected data that were established before the beginning of the study, and all patients who met the inclusion criteria were included. The included studies carried out definite interpretation for endpoint indices.

Discussion

Our study provided a systematic review of the literature on the relationship between CALIPER-derived parameters and lung function and prognosis. In all nine studies, we found that CALIPER-derived parameters served as significant predictors of mortality and improved prediction of future adverse events (death or lung transplantation) by FVC, CPI, or GAP.

Our systematic review showed a strong correlation between CALIPER-derived parameters and FVC, as well as a significant value in predicting prognosis. The condition of prognosis and survival could be estimated by PFT at the baseline level and vertical change over time [44, 45]. Additionally, the vertical change in PFT maintained a superior effect to a single moment as well as CALIPER-derived parameters [46, 47]. Most of the studies in our collection, after finding a strong correlation between longitudinal changes in CALIPER-VRS, ILD, and lung function indicators, went on to develop multiparametric models or explore the role of this parameter in disease prognostic stratification through multivariate Cox regression analyses. Then, we found that the baseline level or amount of longitudinal change in CALIPER-derived parameters showed great performance in predicting prognosis. We find that the variation of CALIPER-derived parameters was different over a 1-year time interval. The two studies we included showed that CALIPER-GGO and R were the patterns with the highest annual growth [39, 41]. Given that most of the studies we included did not provide the annual growth of CALIPER-derived parameters and discuss their value in evaluating prognosis, the use of these two models could be further explored in future studies to find the value of detecting subtle advances at shorter intervals for estimating survival times and potentially as markers of treatment response in future clinical trials. In cases where HRCT was routinely reviewed in IPF patients, CALIPER analysis did not require additional image acquisition time. CALIPER analysis was semiautomatic and independent of the severity of the patient’s disease, with good accuracy and reproducibility [27]. CALIPER analysis therefore reduced to a certain extent the level of expertise required of the operator.

In the included studies, only two studies distinguish between patients with and without antifibrotic treatment, and they showed that patients with IPF who received treatment had slower CT progression compared to untreated patients. Given that such studies are still in the minority, in the future we should explore whether CALIPER-derived parameters are still able to assess prognosis after a period of antifibrotic treatment. And then, we should explore whether a stable or reduced CALIPER-derived parameter after antifibrotic treatment is associated with a good prognosis. Future studies should examine the increase in CALIPER-derived parameters before and after antifibrotic therapy, or the growth rate, and then explore the role in assisting clinical evaluation of drug efficacy.

The results of our analysis suggest the need for a staging system for the quantitative assessment of CALIPER-derived parameters, similar to the creation of standardized quantitative categories in the GAP index. The GAP index, a staging model based on clinical (e.g., sex, age) and physiologic (e.g., FVC, DLCO) parameters, classified patients into three phases, and different staging corresponded to different rates of patient mortality [48]. Currently, most subjects in clinical trials are in a mild to moderate stage (according to the GAP system), and these subjects might experience more noticeable reactions to antifibrotic treatment. The establishment of a staging system facilitated the identification of early stages of the disease and timely intervention. Early intervention with antifibrotic therapy can preserve pulmonary function to the greatest extent, decrease hospitalization risk, and prolong lifespan [12, 49]. Two of our included studies found CALIPER-ILD ≥ 20% and CALIPER-VRS ≥ 20% to be the optimal thresholds for outcome stratification, and the optimal cut-off value to distinguish between the presence and absence of FVC reduction for CALIPER-ILD was 20%, and for CALIPER-VRS was 5%. Due to the small size of the relevant cohort, the corresponding indicators should be confirmed in a larger cohort. A case–control study needs to be set up to further validate the accuracy of the thresholds and to refine the corresponding rating scale.

Our analysis suggests that the results should be stratified more precisely by a staging system established jointly with other parameters. Given that PFT was insensitive to less extensive disease [50], GAP was less powerful at predicting future pulmonary function decline [51]. Combined with CPI-, GAP-, and FVC-related indicators, CALIPER quantification could provide patients and clinicians with a more accurate estimate of risk than other methods. However, these factors should be harmonized in future studies due to differences in endpoint events and the duration of follow-up.

Our analysis showed that the CALIPER-VRS was more valuable than other indicators in predicting adverse events. Five of our included studies used multivariate Cox regression analysis to conclude that CALIPER-VRS was the strongest independent predictor of mortality. CALIPER-VRS enhanced power in longitudinal variations for “silent” disease progression. One of our included studies discovered that the relative rise in CALIPER-VRS ≥ 20% was the optimum threshold for outcome refined categorization and the change of VRS showed superior efficiency than FVC. Therefore, the subregional analysis of CALIPER-VRS was also more valuable. Given IPF was more severe in the inferior and dorsal regions of the histopathology, the abnormal signs always emerged in the basal and peripheral lung parenchyma on imaging [52]. Global analysis that integrates severe fibrotic areas with milder areas would conceal real disease progression. CALIPER-derived regional analysis could perform targeted analysis of different parts of the pulmonary system. By incorporating more precise and fine-grained location data into the multiparameter model, the accuracy and sensitivity of the staging system created would be further improved and would be more targeted. For example, two of our included studies used the method of adding the UZ VRS and CALIPER-VRS > 5 mm2, which are strongly correlated with adverse events, to the multiparameter model. Existing studies differ in their approach to CALIPER-VRS segmentation and in the creation of multiparameter models, and the harmonization of standards was necessary.

Our review has a few limitations. First, the study was limited by a small sample size. The indicators exhibited high heterogeneity and variability. Some studies provided insufficient data. Therefore, we could not conduct a meta-analysis. Second, most of the studies were retrospective, and only one study was prospective. Because the subjects were selected from a single center or two centers, we could not verify the diversity of subjects. Third, the imaging protocols were different among studies, such as reconstruction kernel and section thickness; these slight differences may have led to bias. Finally, the inclusion criteria were different for the included studies, such as the disease severity or follow-up time.

In conclusion, CALIPER-derived parameters appear to have a considerable effect on assessing prognosis, progression, and mortality, especially CALIPER-VRS. The CALIPER–CPI score formula and the multivariate model of CALIPER-H, CALIPER-VRS, and GAP could refine prognostic stratification, which would support clinicians in implementing targeted management strategies. CALIPER-VRS-derived regional analysis could provide a more precise assessment of the disease condition and improve evaluation of the therapeutic effect. It is essential to establish the staging system by CALIPER-derived parameters and combining them with CPI, FVC, or GAP.