Introduction

Visual assessment of computed tomography (CT) images of the lung is an important tool in the routine clinical workup of patients with pulmonary disease [1, 2]. However, visual evaluation is subjective, and the conclusion is dependent upon the observer [3, 4]. For longitudinal studies of lung diseases such as emphysema, an accurate and reliable visual assessment is necessary if disease progression is to be detected [5]. Chronic obstructive pulmonary disease (COPD) remains a major public health problem, and early intervention is of great importance [6]. Reliability of visual assessment depends upon good intra- and interobserver agreement in assessing the presence and severity of disease. Interobserver agreement in visual assessment of CT of the lung has previously been shown to vary; emphysema [2, 7] being easier to agree upon than interstitial abnormalities such as ground-glass attenuation [8], honeycombing [9], or reticulation [10]. Previous conclusions on interobserver variation in visual assessment of airway disease are highly inconsistent; some show reasonable levels of agreement and others show considerable interobserver variability [11, 12]. As such, interobserver variation has been a subject of interest and investigation for several decades. However, studies on visual assessment of progression of disease are lacking. For instance, the ability of the trained human eye to detect a change in the severity of emphysema over time is unknown.

While the main objective of lung cancer screening programmes is the early detection of malignant lung nodules, the huge amount of data provided by CT gives a unique opportunity to detect other subclinical lung pathologies, including emphysema, large airway diseases, and interstitial abnormalities. In this regard, the Danish Lung Cancer Screening Trial (DLCST) provides an opportunity to investigate early CT signs of lung disease and to study interobserver agreement in the detection and grading of these abnormalities. In addition, the participants in the screening arm of the study were followed up with five annual CT examinations. This is a reasonable time span, during which significant change in lung function [13] and lung density [14] has been detected in this population. It was plausible, therefore, to test whether the progression of emphysema, airway diseases, and interstitial abnormalities can be detected visually within this time frame.

The purpose of this paper is to evaluate interobserver agreement in the visual assessment of emphysema, airway abnormalities, and interstitial abnormalities, and to explore whether progression of disease can be detected by subjective visual evaluation of CT images.

Materials and methods

Study population

DLCST is a five-year prospective randomised controlled trial (enrolment Oct. 2004 to Mar. 2006) comprising 4,104 participants randomised to either screening (with annual low-dose CT, n = 2,052) or no screening (control group, n = 2,052) [15]. Inclusion criteria comprise smoking history of at least 20 pack-years, age 50–70 years, and FEV1 of at least 30 % of predicted; the latter to ensure surgical feasibility in the event of a positive screen. Participants were recruited by newspaper advertisements, and were relatively healthy at time of inclusion.

DLCST was approved by the Ethics Committee of Copenhagen County and fully funded by the Danish Ministry of Interior and Health. Approval of data management in the trial was obtained from the Danish Data Protection Agency. The trial is registered in the ClinicalTrials.gov Protocol Registration System (identification no. NCT00496977). All participants provided written informed consent.

Only participants with at least two low-dose screening thoracic CT examinations were included in the present study (n = 1,990). Of these, 411 individuals were former smokers and 1,103 were continuous smokers; the remaining participants changed smoking status one or more times during the study period.

CT protocol and image analysis

Participants in the screening group were examined annually during a period of five years, using a multi-slice CT system (16-row Philips Mx8000, Philips Medical Systems). Examinations were performed in supine position at full inspiration with a low-dose technique (120 kV and 40 mAs), with the following specifications: section collimation 16 × 0.75 mm, pitch 1.5, and rotation time 0.5 second. Participants were instructed to first hyperventilate three times and thereafter inhale maximally and hold their breath during imaging. Images were reconstructed with 1 mm slice thicknesses using a hard reconstruction algorithm (kernel D).

The first and last available CT images from each participant were selected, and two sets of differently anonymised images in different random order were produced, one for each of the two observers: both MDs and PhD students with backgrounds in pulmonology (LT) and radiology (MW), respectively. Thus, the two observers were blinded to participant identity, clinical data, and examination date. The visual evaluations were performed individually; however, the observers were able to freely discuss the images with each other.

Visual assessment was performed using a window width of 1,500 Hounsfield units (HU) and a level of -500 HU. The method was adopted from the COPDGene Image-Based COPD Subtypes Visual Scoring of Chest CT Scans Workshop, Virginia, USA, 2010 [10]. Modifications were made to increase focus on interstitial abnormalities and simplify characterisation of emphysema and airways. The main difference between the COPDGene workshop and this study was the addition of interstitial abnormality scorings, which were expanded by separating honeycombing and reticulation scorings and by adding a nodular pattern section as well as a classification of the interstitial abnormalities found. As our study only includes inspiratory images, scorings on expiratory air trapping, expiratory malacia, and saber-sheath trachea were omitted. Standard definitions for the lung pathologies included in the score sheet were provided by the Fleischner Society glossary of terms for thoracic imaging [16].

In the visual assessment of the images, each lung was divided into three regions: upper zone above carina, middle zone between carina and inferior pulmonary vein, and lower zone below inferior pulmonary vein. Each region was assigned an emphysema grade: 0 %, 1–5 %, 6–25 %, 26–50 %, 51–75 %, or 76–100 % [10]. The predominant type of emphysema was registered as centrilobular, paraseptal, panlobular, or mixed. Airway wall thickening and airway dilation were registered as absent, focal, or diffuse, as well as graded mild/moderate or severe.

Wall thickening was defined as a decrease in the ratio of inner to outer bronchial diameter. Bronchial dilation was defined as increased bronchial lumen diameter to more than the accompanying pulmonary artery diameter (signet ring sign), or when the bronchi were visible within 10 mm of the pleural border. Interstitial abnormalities were classified as ground-glass opacity, honeycombing, reticulation, pleural nodules, centrilobular nodules, paraseptal/subpleural nodules, mosaic attenuation, and mass. Ground-glass opacity was defined as areas of hazy increased opacity in which vessels remain visible. Honeycombing was defined as cystic airspaces surrounded by fibrotic thickened walls, and reticulation was defined as linear opacities representing septal thickening (could be both linear and net-like and both inter- and intralobular). Mosaic attenuation was defined as areas of patchwork comprised of differently attenuated regions. Mass was defined as lump above 30 mm in diameter. Nodules were defined as rounded opacities less than 30 mm in diameter and were characterized according to their anatomical position and their composition.

For presentation, answers to questions regarding airways and interstitial findings were simplified to create binomial variables, thus limiting the registrations to absent or present rather than, for instance, mild, moderate, focal, or diffuse. We did this for the following variables: airway wall thickening, airway dilation, ground-glass attenuation, reticulation, honeycombing, pleural nodules, centrilobular nodules, subpleural nodules, and mosaic attenuation.

Interstitial abnormalities do not necessarily have any relation to interstitial lung disease, as some of these findings could also represent airway disease, and consequently each abnormality was evaluated separately. Subsequently, a conclusion regarding suspicion of interstitial lung disease (ILD) was stated as one of the following: no ILD suspicion, equivocal for ILD, ILD suspicion/centrilobular, or ILD suspicion/subpleural/mixed, including pattern of usual interstitial pneumonia (UIP). This final classification was designated only if at least one interstitial abnormality was found.

The observers were trained in a pilot study comprising 300 of the 3,980 CT images, and they did this under supervision of more experienced board-certified specialists (AD and SS), who in February 2010 had participated in a NIH Workshop at ACR Learning Center in Reston, Virginia, USA, with the title: Image-Based COPD Subtypes Visual Scoring of Chest CT Scans [10]. A preliminary evaluation of the interobserver agreement was obtained, disagreements and cases of doubt were settled by consensus, and the electronic score sheet adjusted to eliminate ambiguity.

Statistical methods

Interobserver agreement was based on baseline images and tabulated for each question on the score sheet as kappa. Kappa agreement levels were: 0–0.2, poor; 0.21–0.4, fair; 0.41–0.6, moderate; 0.61–0.80, substantial; and 0.81–1.0, almost perfect [17]. The p-values for kappa were based on the exact binomial test.

Progression of disease severity was investigated by comparing the first and last image for each participant, and a trend was calculated for each question on the score sheet as ‘the mean number of boxes that the score has moved from first to last image’. With regard to emphysema, the sheet contained two initial boxes (present or absent), and from first to last image the presence of emphysema increased from 27 % to 30 %, which corresponds to a mean movement of 0.30–0.27 = 0.03 boxes toward the presence of disease. Regarding the extent of emphysema, the sheet included six boxes for each lung zone (corresponding to 0 %, 1–5 %, 6–25 %, 26–50 %, 51–75 %, and 76–100 %). To investigate time-trend in disease extent, a mean box was calculated for both early and late examinations, and an increase in mean box indicated progression in emphysema. The p-values for trends were based on sign test.

Level of significance was set at p < 0.05. Calculations were made using statistical software R version 3.00.

Results

Demographics

The characteristics of the participants in the screening arm of DLCST have been reported previously [14]. The current study included 1,990 participants, with two available images for assessment. The demographics of included participants are shown in Table 1. Imaging intervals were similar between the groups of continuous (mean: 3.76; sd: 0.84 years) and former smokers (mean 3.85; sd: 0.76 years).

Table 1 Demographics of participating subjects, baseline characteristics. Smoking habits based on the complete study period

Visual analysis of CT

At baseline, emphysema was detected in 539 (27 %) participants (mean), primarily of centrilobular (47 %) and paraseptal (43 %) subtypes. In late examinations, emphysema was detected in 604 (30 %) participants. At baseline, airway disease was recorded 178 times (9 %); airway wall thickening accounted for 28 %, and dilation for 72 % of the observations. Interstitial abnormalities of various patterns were present in 346 (17 %) participants at baseline. Interobserver agreement and time-trends, including p-values, are shown in Table 2.

Table 2 Kappa values for presence or absence of specific variables on baseline scans and time-trend for variables in early and late scans

Interobserver agreement and time-trends regarding emphysema grading in the six lung zones are shown in Table 3. For continuous smokers, emphysema was seen more often in the late images in all lung zones. For former smokers, no progression of emphysema was detectable by visual assessment.

Table 3 Emphysema gradea by lung zone and smoking status, interobserver agreement, and time-trend

Emphysema was seen in more participants in late examinations than in early examinations, and the emphysema grading was generally higher in all lung zones in the late images (Fig. 1).

Fig. 1
figure 1

Percentage of all six lung zones in early and late scans by severity of emphysema

Discussion

In this study, we investigated the interobserver agreement in visual evaluation of CT of the lungs, and in addition we evaluated the ability to visually assess progression in lung disease. The population in the study derives from DLCST—a lung cancer screening trial with relatively healthy heavy smokers at time of inclusion.

Our results show substantial interobserver consistency in determining the presence of emphysema in a population with low average disease severity (kappa 0.74). The agreement on emphysema grading in general was moderate (kappa 0.57), but agreement in upper lung zones, nearly substantial. There was a clear gradient for agreement in both right and left lung: highest kappa values were achieved for upper zone, lesser values for middle zones, and poorest kappa values for lower zones—suggestive of a more clearly recognisable pattern of emphysema in the upper zones. However, no kappa ranged below 0.50 in any lung zone, and all values were highly significant.

Interestingly, a clear and significant time-trend was seen for both presence and grading of emphysema in all lung zones, which indicates that it is possible to visually detect early emphysema and disease progression. However, progression applies to continuous smokers only, and no significant progression was found in former smokers. In other words, for former smokers, progression of emphysema was not detectable by visual assessment, indicating a remarkable visually detectable effect of smoking cessation on rate of progression, which to our knowledge is a new finding.

Agreement on emphysema pattern was substantial for centrilobular emphysema, moderate for paraseptal emphysema, and fair for panlobular emphysema and mixed subtype (Fig. 2).

Fig. 2
figure 2

Example of progression in centrilobular emphysema as seen in low-dose thoracic CT scan: a) baseline scan, and b) four years later

With regard to airway abnormalities, we found only fair agreement on airway wall thickening and airway dilation. For both wall thickening and airway dilation, progression was subtle but highly significant. Based on our airway results, we believe that computer analysis with dedicated software is preferable in the diagnosis of early airway abnormalities on low-dose CT, particularly because the level of inspiration may influence airway dimensions. Automated software for airway segmentation in which it is possible to adjust for lung volume is now available [1821].

In general, interstitial findings in this lung cancer screening trial were infrequent. Agreement on ground-glass, honeycombing, reticulation, nodules, and interstitial classification was fair to moderate, but as there were few abnormalities—and almost none with a clear pattern of interstitial lung disease—conclusions on visual assessment of interstitial lung disease, in our opinion, cannot be based on this material. However, when limiting the inquiry of interstitial abnormalities to a question of ‘yes’ or ‘no’, regardless of the type of abnormality found, there was, in fact, a highly significant, nearly substantial interobserver agreement (kappa 0.60), indicating an ability to visually detect subtle interstitial abnormalities. There was also a significant time-trend, as more abnormalities were found in the later examinations, and thus it was possible to visually detect an increase in interstitial abnormalities.

Regarding classification of interstitial abnormalities, the observers had substantial agreement (kappa 0.63) on ‘no ILD suspicion’—in other words, agreement on ruling out ILD, but less agreement on categorising ILD suspicion (kappa values of 0.40–0.56). Mosaic attenuation and mass were relatively rare findings, both with a non-significant kappa of 0.42.

Our findings are in keeping with those from the COPDGene CT Workshop Group in 2012 [10], although we generally found higher interobserver agreement, possibly due to the initial pilot study as well as close and continuing cooperation of the observers.

Strengths and limitations

To the best of our knowledge, this study is the first to quantify the ability to visually detect onset and progression of early lung disease while at the same time evaluating interobserver agreement. The study was performed on a large cohort followed up with multiple CT examinations. Observers were blinded to participant identity and clinical data; and they were able to discuss findings with each other while independently arriving at their own conclusions. The design of this study is an advantage with respect to applying the results to everyday clinical settings.

The observers were PhD students—with residencies in radiology and pulmonology, respectively, and each with two to three years of chest CT experience—dedicated to analysing the chest CT images. Moreover, a couple of months before the main study, a pilot study was performed in which the PhD students scored 300 images under supervision of board-certified specialists. The results show that interobserver agreement was quite high, indicating consistency, and that acceptable experience was therefore achieved.

A limitation of the study is the low prevalence of severe disease—both severe COPD and ILD are uncommon in this cohort—restricting conclusions to visual assessment of early signs of lung disease.

The low-dose imaging technique used sets a spatial resolution limit to diagnosing very small changes, and thus small bronchial abnormalities such as wall thickening could be difficult to accurately assess. This is a possible contributing factor to the results of airway analyses.

Our findings underscore the usability of visual assessment of early emphysema, which can be utilised diagnostically to follow patients, but also as an educational tool that could facilitate smoking cessation attempts in early-stage disease.

Evaluation of visual analysis of more advanced disease—both COPD and ILD—could add important knowledge to this field. A similar study design on cohorts of various lung disease groups could be performed. In particular, we find it interesting to further evaluate interobserver consistency and time-trend in ILD diagnostics, which are complicated and often ambiguous. At the present time, automated methods of ILD diagnostics are still in an investigative phase with research ongoing [22], and have not been clinically implemented. As such, diagnostic imaging of these patients is highly dependent on reliable visual assessment.

Conclusions

Our study implies that visual scoring of chest CT, using a systematic approach, can characterise presence, pattern, and progression of early emphysema.