Introduction

Recently developed sophisticated computer analytical tools such as CALIPER are able to quantify a range of parenchymal pattern extents in a whole lung CT dataset [1]. Considerable information is generated by such tools, including quantitation of parenchymal patterns that cannot be similarly characterised using visual CT analysis, such as the volume of intraparenchymal vessels [2]. To date, however, there has been a paucity of computer-based analytical studies in non-idiopathic fibrosing lung diseases such as hypersensitivity pneumonitis (HP). However, quantitative analysis has the potential to estimate disease burden in a condition such as HP that may present with highly variable morphological [3] and histopathological features [4, 5].

Traditionally, the evaluation of disease severity in patients with interstitial lung disease (ILD) has focused on the identification of individual CT parenchymal patterns that predict mortality. The result has been the identification of a few key features such as honeycombing [6, 7] and traction bronchiectasis [8, 9] that are associated with a worse outcome.

In tandem with the development of computer quantitation, mathematical modelling has also evolved. New advanced automated computational techniques are able to analyse information contained within an entire CT dataset and identify patient groups that share common or idiosyncratic disease phenotypes [10, 11]. By examining the spectrum of morphological appearances on an individual patient’s CT, the resulting analysis facilitates individualised medicine. Automated stratification of patients into phenotypically similar groups has been shown to correlate well with functional indices [12]. However, the association between stratified patient subgroups and differing patient outcomes has not been validated.

In the current study, CT imaging scored visually and by CALIPER in patients with HP was analysed to identify variables predictive of mortality using proportional hazards regression analysis. The population was also stratified using advanced automated computational techniques. Patient subgroups derived using automated stratification were evaluated against mortality and an alternative risk prediction model – the interstitial lung disease gender, age physiology (ILD-GAP) model [13]. The ability of automated stratification to substitute for functional indices in the ILD-GAP model was also explored.

Materials and methods

Study population

A previously defined cohort of consecutive patients diagnosed with subacute and chronic HP was identified using the clinical coding database of the Royal Brompton Hospital for the period January 2000 to December 2006 [14]. All patients with a non-contrast, interspaced supine HRCT (1-mm sections at 10-mm intervals) reconstructed with a Siemens B70 edge-enhancing algorithm were chosen for analysis (Fig. 1) [n = 98]. Patients were diagnosed by a multidisciplinary team with clinical, radiological and when available pathological consensus. Sixty-eight of 98 patients (69%) had both a positive exposure history and serum precipitants to relevant antigens. The remainder were diagnosed using broncho-alveolar lavage lymphocytosis results, appropriate CT findings [15, 16] and histopathological confirmation [17, 18]. Eleven cases with concurrent volumetric CT imaging were evaluated with CALIPER to analyse differences in interpretation of interspaced and volumetric scans (Fig. 1, Supplementary Material). Approval for this study of clinically indicated CT and pulmonary function index (PFI) data was obtained from the Institutional Ethics Committee of the Royal Brompton Hospital and the Institutional Review Board of the Mayo Clinic Rochester.

Fig. 1
figure 1

CONSORT diagram illustrating the selection of patients for the final study population. HP hypersensitivity pneumonitis, HRCT high-resolution computed tomography, CT computed tomography

CT, CALIPER and PFI protocols

Protocols are described in detail in the Supplementary Material. PFIs analysed included forced expiratory volume in 1 s (FEV1), forced vital capacity (FVC), total lung capacity (TLC), residual volume (RV), transfer coefficient of the lung for carbon monoxide (Kco), single breath carbon monoxide diffusing capacity corrected for haemoglobin concentration (DLco) and the composite physiological index (CPI) [19].

CT evaluation

Visual CT evaluation on a lobar basis was performed independently by two radiologists (MM, WM) each with 3 years’ thoracic imaging experience, blinded to all clinical information. CT parameters evaluated included: ground glass opacity, reticular pattern, honeycombing and consolidation, which were summed to calculate total ILD extent [2]. Total fibrosis extent represented the sum of reticular pattern and honeycombing. Emphysema, mosaicism (decreased attenuation component) and traction bronchiectasis were scored as previously described [2].

CALIPER CT evaluation

CALIPER segmented the lung into six zones: right and left upper, middle and lower zones, demarcated with reference to the lung hilum, with each of the six zones further divided into an inner half (central region) and an outer half (peripheral region). Eight initial CT patterns were classified by CALIPER; however, preliminary analyses demonstrated that honeycombing was not identified as a distinct pattern on CALIPER evaluation of interspaced datasets as honeycombing requires three-dimensional information for characterisation. Consequently, CALIPER honeycombing was not further analysed in the study. In the final analysis, therefore, seven CT patterns (described below) were evaluated for all 12 anatomical areas of the lung (see Supplementary Material).

CALIPER evaluation of the lungs was pictorially expressed as volume-rendered three-dimensional images or as a glyph (Fig. 2). Each glyph comprised six wedges, or zones, the size of which reflected the volume of the zone relative to the total lung volume. Within each lung zone, every voxel was classified into one of eight separately colour-coded CALIPER parenchymal patterns: ground glass opacity = yellow, reticular pattern = orange, honeycombing = brown, Grade 1 decreased attenuation (DA) = light green, Grade 2 DA = light blue, Grade 3 DA = dark blue, Normal lung = dark green, pulmonary vessel volume (PVV)[pulmonary arteries and veins, excluding vessels at the lung hilum] = white. The relative volumes of the patterns within a zone determined the proportions of each colour in a zone.

Fig. 2
figure 2

Coronal three-dimensional rendering, accompanying glyph of parenchymal patterns scored by CALIPER and corresponding colour overlay axial images at the lung bases using interspaced imaging (top images) and volumetric imaging (lower images). Within the glyph the dark line separates right and left lungs and concentric circles overlying the glyph represented quintiles of lung volume. Examination of the glyph generated by the volumetric dataset demonstrates a shrunken left lung and right lower lobe both of which contain honeycombing. A greater volume of decreased attenuation lung is also evident in the right middle lobe when compared to the glyph generated from interspaced data. Dark green = normal lung, light green = grade 1 decreased attenuation, light and dark blue = emphysema, yellow = ground glass opacity, orange = reticular pattern, brown = honeycombing, white = pulmonary vessel volume

All CT variables were expressed as a percentage of the total lung volume. CALIPER grade 2 and 3 DA lung corresponded to areas of emphysema [2], whilst ILD extent represented the sum of ground glass opacity and reticular pattern.

Automated stratification of CALIPER-variables

Global and regional dissimilarities in distribution of the eight patterns quantified by CALIPER within 12 zones were evaluated by a dissimilarity metric as previously described [12]. Regional dissimilarities were discerned using three components. Within a single lung, differences in regional lung volume as a proportion of the total lung volume were calculated. Between any two lungs, dissimilarities in the proportions of absolute lung volumes in corresponding regions and dissimilarities in the proportions of specific parenchymal patterns in the corresponding regions were calculated.

The dissimilarity metric was used to compare all 98 HP cases in a pairwise manner and the resultant 98 x 98 matrix was stratified using unsupervised affinity propagation [20] to identify unique clusters representing patient groups with shared parenchymal characteristics. An a priori specification of the number of expected clusters was not imposed, as affinity propagation derives naturally occurring clusters using real-valued message exchange [20].

Statistical analysis

Data are given as means with standard deviations, or numbers of patients with percentages where appropriate. Interobserver variation for the visual scores was calculated using the single determination standard deviation [21]. CALIPER analysis of 11 interspaced and volumetric CTs was compared using the independent samples T test, (significance = p < 0.05). Univariate and multivariate Cox regression analyses were used to investigate relationships within and between CALIPER and visual CT evaluation and PFIs. Linear regression analyses were used to characterise relationships between cardinal pulmonary function indices and CT scores of ground glass opacity and reticular pattern.

Comparisons of functional and morphological indices between automated stratified groups were examined using one-way analysis of variance (ANOVA) and post-ANOVA pairwise T test analyses using the Bonferroni correction for multiple analyses. Cox regression analysis and Kaplan-Meier survival curves were used to identify survival differences between automated stratification results and the ILD-GAP staging system. Survival distributions were compared using the Log rank test, and bootstrapped with 1,000 randomly generated samples. The automated stratified groups were then substituted for PFIs in the ILD-GAP staging system resulting in the creation of a Stratified-CT model containing the following weighted variables: automated stratified group score, patient age and gender. Finally the automated stratified groups were combined with the ILD-GAP model to form a third final model termed the Stratified-GAP model. Model strength for the ILD-GAP, Stratified-CT and Stratified-GAP models was compared using receiver operating characteristic (ROC) curves, and measuring the area under the ROC curve (AUROCC) and Harrells C-Index [22]. Statistical analyses were performed with STATA (version 12, StatCorp, College Station, TX, USA).

Results

Baseline results

The median age of the cohort was 59 years, with 52% having died during the average follow-up period of 69 months. Demographic data and average visual and CALIPER CT scores and PFI data are provided in Table 1. Interobserver variation for the visual scores is demonstrated in Supplementary Table 1. On average, visual scores identified more ILD than CALIPER. ILD mainly compromised reticular opacities on visual scoring as opposed to ground glass opacities as scored by CALIPER. CALIPER scores for ground glass opacity and reticular pattern correlated more strongly with cardinal pulmonary function indices than equivalent visual CT scores (Supplementary Table 3).

Table 1 Patient age, gender and measures of pulmonary function indices and CALIPER-scored CT parameters

Evaluation of the 11 cases with concurrent interspaced and volumetric CT imaging demonstrated a significant difference in PVV extent between groups (p = 0.03), with CALIPER classifying more vessels on interspaced images than volumetric scans (Fig. 2) [Fig. 1, Supplementary Material].

Mortality analyses

On univariate visual CT analysis reticular pattern, honeycombing, mosaicism and traction bronchiectasis were strongly predictive of mortality. All CALIPER-scored patterns except emphysema and all PFIs were predictive of mortality on univariate analysis (Table 2). The results were maintained after adjusting for patient age and gender (Supplementary Table 2). When visual parameters alone were analysed in a multivariate model, reticular pattern (hazard ratio (HR) = 1.05, confidence interval (CI) 1.02–1.08, p = 0.001) and ILD extent (HR = 1.03, CI 1.01–1.04, p = 0.01) were independent predictors of mortality. Reticular pattern (HR = 1.10, CI 1.02–1.18, p = 0.02) and PVV (HR = 1.08, CI 1.01–1.15, p = 0.02) were independently predictive of mortality on multivariate analysis of CALIPER variables. Of the PFIs, DLco alone best described mortality on multivariate analysis. When CALIPER, visual and PFI variables were analysed together in a multivariate model, DLco and CALIPER reticular pattern were the only two independent predictors of mortality (Table 2). The results were maintained after adjusting for patient age and gender (Supplementary Table 2).

Table 2 Univariate Cox regression analysis demonstrating mortality according to CALIPER indices (top white), pulmonary function indices (light grey) and visual CT scores (dark grey). A multivariate model evaluated CALIPER and pulmonary function indices (lower white)

Automated stratification of CALIPER-derived groups

Automated stratification of HP patients identified three distinct groups with similar distributions of CALIPER parenchymal patterns within each group (represented as glyphs in Fig. 3). With progression from group 1 to group 3, the proportion of the lung comprised of the reticular pattern, ground glass density and PVV increased, while the extent of normal lung and grade 1 DA decreased (Table 1). Mean PFIs also worsened with progression from group 1 to group 3 (Table 1).

Fig. 3
figure 3

Glyphs demonstrating the compositions of the three hypersensitivity pneumonitis groups (Group 1 = left, group 2 = middle, 3 = right) derived following CALIPER CT analysis. Dark green = normal lung, light green = grade 1 decreased attenuation, light and dark blue = emphysema, yellow = ground glass opacity, orange = reticular pattern, brown = honeycombing, white = pulmonary vessel volume

Significant functional differences across all groups were identified with FVC, TLC and CPI, with differences in at least two groups identified with FEV1 and DLco (Table 3). Visual CT parameters demonstrated good separation between group 1 and groups 2 and 3 for parenchymal patterns indicative of fibrosis and for mosaicism. However, only ILD and consolidation extents identified group separations between groups 2 and 3. CALIPER scores for patterns indicating ILD and PVV demonstrated clear differences across all three automated stratified groups. When the 11 cases with concurrent interspaced and volumetric imaging were analysed by CALIPER and stratified, the volumetric cases mapped to the same outcome groups as the interspaced cases (Fig. 1, Supplementary Material)

Table 3 Functional and morphological differences between stratified groups, evaluated using pairwise T tests, following one-way ANOVA analysis and modified using the Bonferroni correction. CALIPER scores are white, pulmonary function indices light grey and visual CT scores dark grey

Survival distributions between automated stratified groups are demonstrated in Fig. 4a (p < 0.0001 Log rank test). Cox regression analysis demonstrated that separation of patients into automated stratified groups was strongly predictive of mortality (HR = 2.74, CI 1.86–4.05, p < 0.0001). A mortality effect from automated stratification was maintained following correction for age, gender and baseline disease severity using the CPI (group stratification: HR 1.95, CI 1.15–3.29, p = 0.01) and DLco (group stratification: HR 2.05, CI 1.23–2.41, p = 0.006). When automated stratified groups were evaluated against DLco tertiles in a Cox proportional hazards analysis, both DLco tertiles (p = 0.001) and automated stratified groups (p = 0.002) were equivalent in their ability to predict outcome following bo otstrapping of 1,000 samples.

Fig. 4
figure 4

(a Kaplan-Meier survival curves for the three hypersensitivity pneumonitis groups derived using automated stratification on the basis of CALIPER CT parenchymal pattern extents. Group 1 (blue) mean survival 120.4 ± 4.9 months; n = 33, group 2 (green) mean survival 74.6 ± 8.0 months; n = 40 and group 3 (yellow) mean survival 45.1 ± 7.7 months, n = 25. Log rank test p < 0.0001. (b) Kaplan-Meier survival curves for patients with hypersensitivity pneumonitis stratified on the basis of the ILD-GAP model. Group 1 (blue) mean survival 122.6 ± 5.1 months; n = 21, group 2 (green) mean survival 98.3 ± 7.3 months; n = 35; Group 3 (yellow) mean survival 53.4 ± 8.0 months; n = 29; Group 4 (magenta) mean survival 46.1 ± 15.6 months, n = 10. Log rank test p < 0.0001. (c) Kaplan-Meier survival curves for patients with hypersensitivity pneumonitis stratified on the basis of the Stratified-CT model combining automated stratified groups, patient age and gender. Group 1 (blue) mean survival 120.4 ± 5.7 months; n = 23, group 2 (green) mean survival 95.1 ± 8.5 months; n = 33; Group 3 (yellow) mean survival 66.0 ± 8.5 months; n = 32; Group 4 (magenta) mean survival 15.0 ± 6.5 months, n = 7. Log rank test p < 0.0001

The ILD-GAP model separated patients according to age, gender, FVC and DLco values (scored on a nine-point scale) into four outcome groups (Fig. 4b). A bivariate Cox regression analysis, bootstrapped with 1,000 samples, demonstrated no difference between the ILD-GAP model (p = 0.002, CI 0.34–1.13) and automated stratification groups (p = 0.001, CI 0.36–1.22).

To compare the ability of the automated stratified groups to substitute for FVC and DLco in the ILD-GAP model, the three-point automated stratified group scale was converted into a five-point scale analogous to the five-point scale for FVC (0–2) and DLco (0–3) in the ILD-GAP model. Automated stratified group 1 was converted to a score of 0, group 2 remained unchanged, whilst group 3 was converted to a score of 4. When the five-point automated stratified group scale was combined with patient age and gender (weighted on a three (0–2) and two-point (0–1) scale, respectively, in accordance with the ILD-GAP model) an eight-point scale was derived and converted into a four-point automated stratified model (Stratified-CT model) using the same group divisions as the ILD-GAP model (scores of 0–1 = 1; scores of 2–3 = 2; scores of 4–5 = 3; scores >5 = 4) with good separation of outcome groups (Fig. 4c; Log rank test p < 0.0001).

The ordinal predictive power of the four-point Stratified-CT model was 0.73 as judged by the Harrells C-Index, which was identical to the C-Index value of the ILD-GAP model (0.73). When the three-point scale of the automated stratified groups was combined with the four-point scale of the ILD-GAP model, the new Stratified-GAP model had a C-Index value of 0.77. Sensitivity and specificity for mortality prediction using the three models are demonstrated using ROC curves analysis in Fig. 5, and was greatest with the Stratified-GAP model.

Fig. 5
figure 5

Receiver operating characteristic curves demonstrating sensitivity and specificity for mortality prediction using three models: ILD-GAP (blue, area under curve (AUC) = 0.72, confidence interval (CI) 0.61–0.82, p = 0.0002); Stratified-CT (green, area under curve (AUC) = 0.76, CI 0.66–0.85, p < 0.0001); Stratified-GAP (yellow, area under curve (AUC) = 0.77, CI 0.67–0.86, p < 0.0001). When evaluated using Harrells C-index, values for the ILD-GAP and Stratified-CT models were equivalent (0.73), but model strength improved when using the Stratified-GAP model (0.77)

Discussion

In the current study, a computer-derived variable, CALIPER reticular pattern was stronger than all visual CT scores at predicting mortality in patients with HP. The current study has demonstrated for the first time that in conjunction with computer quantitation, automated stratification techniques can separate HP patients into prognostic groups that are functionally distinct and comparable to the ILD-GAP model for risk prediction. When automated stratification is combined with functional indices, patient age and gender, the resulting model is stronger than the ILD-GAP model alone at predicting mortality. Furthermore, survival across groups defined using automated stratification remains independent of baseline disease severity.

Automated stratification identified three unique HP groups that had distinct functional characteristics. The recognition of disease sub-groups is clinically desirable if it allows identification of high-risk patients that may benefit from aggressive interventions, yet limits unnecessary treatment in patients with quiescent disease [23]. Computer analysis of CT imaging is attractive as a tool to generate repeatable and reproducible information across disease cohorts given its lack of interobserver variation and reproducibility. Furthermore, quantitative tools such as CALIPER can evaluate the entirety of a CT dataset, providing a comprehensive analysis of an individual patient’s CT. Computer analysis has been used in emphysema cohorts with the aim of identifying distinct phenotypic groups [12, 24] with limited success [25], but similar studies in individual ILD populations are sparse [26].

The benefits of combining CT quantitation with automated stratification techniques lie in the emphasis on data to drive the identification of distinct phenotypic groups which can then be assessed to identify functional and prognostic similarities. The automated stratified groups in the current study were generated without any innate bias. The numbers and types of phenotypic clusters were not predetermined but were generated by the automated stratification process itself. Consequently, phenotypic features that might be overlooked with visual scores can be identified, and may be used to uncover populations with shared outcomes.

The comparable strength in risk stratification between the ILD-GAP model and the automated stratification model suggests that computer analysis and automated modelling could have a role as outcome measures in clinical trials. For example, discrete therapeutic responses or adverse reactions may become apparent, with a sensitivity potentially surpassing functional indices alone.

Against the complexity of stratified mathematical modelling, the simplicity of a glyph representation translates dense numerical datapoints into a format with clinical pertinence. Patients can use a glyph to understand the nature and extent of their particular disease. For clinicians, meanwhile, the combination of a glyph mapped to a stratified group allows the characterisation of a patient’s disease phenotype at a glance in a busy clinic setting.

The difference in quantitation of ground glass opacity and reticular pattern between visual and CALIPER scores was largely secondary to differences in classification of a pattern where ground glass opacity is overlaid by reticular pattern. Visual scorers often disagree on such a pattern, and in the current study the visual scorers considered the pattern to represent reticular pattern, whilst as previously described [2], CALIPER classifies such a pattern as ground glass opacity.

The prognostic implications of an increasing CT reticular pattern extent identified in the current analysis confirms findings from a previous HP study [27]. Fibrosis extent [2730] and the decreased attenuation component of a mosaic attenuation pattern [27], also demonstrated in the current study, have similarly been previously implicated as prognostic variables in HP. Conversely, however, CALIPER grade 1 DA, which partly corresponds to air trapping on CT, demonstrated a mild protective effect on univariate analysis. Traction bronchiectasis, in contrast to previous studies in HP [14], was not found to be a powerful multivariate predictor of mortality.

There were limitations to the current study. CALIPER was unable to classify honeycombing on interspaced imaging. Nevertheless, despite the loss of a strong prognostic variable, a CALIPER variable, reticular pattern, remained an independent predictor of mortality across the entire HP cohort. PVV was over-represented on interspaced imaging secondary to its requirement, like honeycombing, for three-dimensional patterns for optimal characterization. PVV is recognized by CALIPER as contiguous tubes followed to the lung edge using tubular filters and three-dimensional region-growing software. The PVV signal in the current study may have related to linear tubular structures that were predominantly ‘in plane’ on a single interspaced image. However, it is also possible that some of the characterised vessels may have represented reticular densities that were in fact misclassified as vessels. Nevertheless, whilst there appears to be a degree of overlap in CALIPER scoring of reticular pattern and vessels, both factors remained independent predictors of mortality when CALIPER variables were analysed. A final limitation lies in the lack of an external validation cohort with which to confirm the study findings. The scarcity of large HP populations even within tertiary referral centres made validation of our results challenging, but remains an important aim for future studies.

In conclusion, we have shown for the first time that computer quantitation and automated stratification of CTs (by CALIPER) generate variables that are powerfully predictive of mortality in HP. Automated stratification is able to distinguish patients with differing disease phenotypes that correspond to discreet functional groups, and is equivalent to functional indices in the ILD-GAP model in their ability to risk stratify patients with hypersensitivity pneumonitis. When combined with quantitative CT analysis, prognostication using the ILD-GAP model was found to improve.