Introduction

Chronic obstructive pulmonary disease (COPD) is a worldwide public health challenge, due to its high prevalence and long-term effects on related disabilities and mortality [1, 2]. The accurate diagnosis of COPD is crucial for the timely initiation of appropriate therapeutic intervention to improve the patient’s quality of life and reduce the risk of future exacerbation [3]. Previous studies have reported that an estimate of over 40% of COPD patients remain undiagnosed, particularly in developing countries [4, 5]. According to Wang et al [6], only 12% of individuals with chronic airflow limitations had a previous spirometry-defined COPD diagnosis during the recent screening of 57,779 participants in China. Furthermore, prior studies have reported that COPD can either be misdiagnosed or missed entirely when using spirometry alone [4, 5, 7]. Therefore, alternative strategies are urgently needed to provide accurate detection and evaluation of COPD for optimal clinical decision-making.

With the growing use of computed tomography (CT) for screening of lung cancer, there is an opportunity to use these scans to identify patients with COPD, with subsequent confirmation by spirometry. In the past few years, growing evidence has shown that CT-derived qualitative and quantitative imaging analyses have potential in COPD diagnosis and stratification [8, 9]. Typical CT features, such as lung parenchyma, airways, pulmonary vasculature, and the chest wall, provide valuable insights into evaluating lung function, categorizing disease severity, and predicting outcomes for patients with COPD [10,11,12]. Thus, CT-based imaging can lead to improvements in COPD detection and evaluation [13, 14]. However, objective CT analysis requires prior knowledge of the anatomical and physiological implications of diseases likely to be associated with certain clinical outcomes. In addition, the conventional manual inspection of CT images is often time-consuming and subjective, which limits its use for large-scale COPD screening.

Recent advances in deep learning (DL)–based artificial intelligence (AI) have enabled the direct interpretation of medical images without relying on specific radiographic features of interest [15, 16]. Sophisticated and subtle image patterns (at distinct spatial scales) have been learned by trained models and used to discriminate diseases without any human guidance [17]. As a result, the advantages of a DL strategy for improving the accuracy and efficiency of human COPD detection, and for bolstering human knowledge of COPD subtypes, have in principle been established [18,19,20,21]. For example, Gonzàlez et al [22], using the large cohort COPD genetic epidemiology study (COPDGene), trained a 2D convolutional neural network (CNN) for automated COPD detection in smokers, achieving a c-statistic of 85.6%. Using the same cohort, Hatt et al [23] developed CT-based CNN classifiers for COPD assessment. These classification models were also shown to be predictive in the National Lung Screening Trial (NLST) cohort. Recently, Tang and colleagues [24] proposed a novel residual network in the detection of COPD among smokers screened for lung cancer using low-dose CT (LDCT) and achieved an area under the receiver operating characteristic curve (AUC) of more than 88%. It is worth mentioning that these DL models reported in most previous studies were trained using open public datasets that were primarily composed of current or former smokers. Thus, it remains largely unknown whether this approach could be applied to a heterogeneous dataset that extracted from diverse clinical scenarios, especially among the Chinese population.

In the present study, we recruited 1393 participants from outpatient, inpatient, and physical examination center settings of four large hospitals in China. The dataset were highly heterogeneous which we thought could mimic the screening scenario to a great extent. We developed an attention-based multi-instance learning (MIL) model for COPD detection and a multi-channel 3D residual network for the Global Initiative for Chronic Obstructive Lung Disease (GOLD) stage classification among spirometry-confirmed COPD patients. External validation of the COPD detection model was performed with an LDCT subset acquired from the NLST cohort, which comprised 620 patients with current or previous smoking history.

Materials and methods

Data collection

CT image data were retrospectively collected from 1441 participants from outpatient, inpatient, and physical examination center settings of four large public hospitals across China, namely, the Affiliated Hospital of Qingdao University, Changsha First Hospital, People’s Liberation Army Joint Logistic Support Force 920th Hospital, and Shandong Provincial Hospital. The inclusion criteria were (1) participants underwent single inspiratory breath-hold CT scans in the supine position; (2) at least one-time pulmonary function test; and (3) aged 18 years and older and with no history of thoracic surgery. All images were uploaded by principal investigators at each site through the InferScholar research platform (Infervision). Spirometry data, demographic information, smoking history, clinical indices, and underlying diseases were extracted from electronic medical records using a standardized data collection form. The data collection period ranged from August 10, 2019, to October 8, 2020. After excluding cases with incomplete clinical data (17 cases), substandard pulmonary function (8 cases), and poor CT image quality (23 cases), a total of 1393 participants were enrolled in the final cohort. Eligible subjects were then randomly divided into a training set (n = 837, 391 with COPD), a validation set (n = 278, 135 with COPD), and a test set (n = 278, 118 with COPD) at a ratio of 3:1:1 for subsequent model development (see Figure 1). We further elected to use a random subset of the NLST cohort (n = 620) as a means of external validation. The NLST study was conducted by the National Cancer Institute to determine the feasibility of using LDCT for lung cancer screenings and included subjects with spirometry-defined COPD, facilitating an investigation of model efficiency for LDCT and diverse populations [25]. The NLST subset included participants between 55 and 74 years old, with a smoking history of more than 30 pack-years and no self-reported history of lung cancer, thereby allowing us to further evaluate the model efficiency among smokers. Detailed NLST subset information is provided in Table S1. This study was approved by the ethics commissions of all participating hospitals and requirements for written informed consent were waived due to the retrospective nature of the research.

Fig. 1
figure 1

A multi-center COPD dataset establishment diagram

COPD diagnosis was confirmed by forced expiratory volume in 1 second (FEV1) to forced vital capacity (FVC) ratio less than 0.7 after inhalation of bronchodilators. The severity of COPD was graded according to the GOLD standard [2]. CT images were acquired using a range of acquisition protocols and scanners, representative of clinical routines. Further details regarding image acquisition are provided in Table S2.

Data preprocessing

Because the CT images were acquired from different vendors with varying scanning parameters, the original data were first adjusted to lung window settings using lower and upper Hounsfield unit (HU) bounds of − 1500 and 600, respectively. All images were then resized to a resolution of 512 × 512 pixels using bilinear interpolation and the whole CT volume was normalized.

Development of the COPD detection model

The workflow for the experimental COPD detection model is illustrated in Figure 2 and consists of three primary steps: (1) preparation of CT lung instances and bags; (2) feature extraction using a deep residual neural network, ResNet18; and (3) an attention mechanism–based classifier for COPD detection. Whole CT volumes were divided into multiple parts, with a single axial slice (one instance) being selected from each set and formed into a bag (collection of instances) with defined patient labels (COPD vs non-COPD) used for training the network [26]. A weakly supervised approach, MIL, was adopted due to the heterogeneous nature of the COPD CT instances [27]. MIL has previously been used to examine available CT voxels and facilitate the detection of asymptomatic or subtle lesions during screening [28], while keeping computational costs and memory requirements manageable. In the next step, ResNet18 was used for feature extraction, generating a dictionary of visual characteristics from bag instances. Attention mechanisms were further applied to augment the most discriminative features related to COPD, thereby increasing detection accuracy [29]. Finally, the resulting responses were converted into probability values using a softmax classifier. A detailed network architecture and training methodology are provided in Supplemental Appendix 2.

Fig. 2
figure 2

A flowchart for the proposed COPD detection model. The COPD detection pipeline consisted of three primary steps: (1) preparation of CT lung instances and bags (top left); (2) feature extraction using ResNet18 (bottom left); and (3) an attention mechanism–based classifier (bottom right). The top-right image shows details concerning the ResBlock layer. In 2D ResNet18, we use c, s, and d to denote number of output channels, strides, and down-sampling factor. “ResBlock, c = 128, d = 2” denotes a vanilla ResBlock with 128 output channels and a down-sample skip connection that reduces the resolution with a factor of 2 in x, y direction. Conv, convolution; BatchNorm, batch normalization; ReLU, rectified linear unit; Avg, average; FC, fully connected layer

Development of the COPD staging model

The GOLD stage of confirmed COPD cases was classified by training an end-to-end deep learning model to identify radiographic features suggestive of disease severity. As shown in Figure 3, a lung segmentation algorithm was first applied to raw 3D CT data to create binary lung masks and exclude unrelated information that might cause confusion or reduce learning efficiency. This segmentation algorithm was developed in-house, derived from a signature U-net architecture [30], and implemented in MxNet. We next employed a multi-channel strategy that included raw CT volumes, segmented lung parenchyma, and emphysema features (percent low-attenuation area less than −950 Hounsfield units, %LAA-950) as model inputs. Stacked channels were concatenated into 3D volumes and passed to a 3D ResNet50 network for post-processing. The proposed 3D ResNet50 consisted of five ResBlock layers capable of processing high-dimensional and complex features for improved prediction outcomes. A final softmax layer was applied to the output of the fully connected layer, to generate four GOLD stage categories. The detailed network architecture and training methodology are provided in Supplemental Appendix 3.

Fig. 3
figure 3

Illustration of the GOLD stage classification model. Model input is a composite of three channels, including raw CT volumes, segmented lung parenchyma, and emphysema features (CT value ≤ 950 HU). Stacked channels were then passed through a 3D ResNet50 network consisting of five ResBlock layers. A final softmax layer was applied to the output of the fully connected layer to generate four GOLD stage categories. HU, Hounsfield units

Model validation

The performance of the proposed attention-based MIL COPD detection model was evaluated using a test set of 278 non-overlapping participants. External validation was further conducted with 620 LDCT scans acquired from the NLST cohort. The receiver operating characteristic (ROC) curves and their confidence interval were determined in accordance with the DeLong methods, to assess the DL model’s ability to identify COPD patients from a large heterogeneous dataset. Confusion matrices such as sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and F1 score were determined when applying an optimal threshold selected from the validation set. We also reported the COPD detection accuracy of a common quantitative CT measurement (%LAA-950), as a reference to prior studies reporting similar outcomes. A five-fold cross-validation was used to evaluate the staging performance of the multi-channel COPD staging model. Considering the imbalance in the number of patients within each GOLD stage, micro F1 score and Cohen’s weighed kappa were applied to allow comparison.

Statistical analysis

Measurement data of the baseline clinical and demographic characteristics with normal distribution were presented as mean ± standard deviation (SD), and data with non-normal distribution were presented as the median (M) and upper and lower quartile spacing (IQR). Categorical variables were presented as numbers (%). The Wilcoxon signed-rank or Kruskal-Wallis tests were used for numerical variables, and Fisher exact tests were used for categorical variables. No multivariable analyses were conducted, because we deployed each model as an assessment of risk over the entire cohort. Statistical analysis was performed using the IBM SPSS statistics 20.0 software (SPSS) in the R programming language (version 3.4.0, http://www.Rproject.org).

Results

Demographic and clinical characteristics

A total of 1393 participants were included in the study: 749 spirometry-defined COPD patients and 644 non-COPD participants. The median age of COPD patients was higher than that of non-COPD participants (62 vs 56, p < 0.001) and the majority of the COPD cohort was male (76.09%), which is consistent with COPD gender distributions in China [6]. In addition, a higher proportion of smokers (24.53% vs 4.67%, p < 0.001), a reduced FEV1 percentage (52.56% vs 103.25%, p < 0.001), and a lower average body mass index (BMI) (22.73 vs 24.02, p < 0.001) were evident among the COPD patients. The percentages of stage 1, 2, 3, and 4 spirometry-defined COPD subjects on the GOLD scale were 3.73%, 59.63%, 30.28%, and 6.37%, respectively. Cardiovascular disease was the most common comorbidity within the dataset, followed by asthma in the COPD group and diabetes mellitus in the non-COPD group. Of the non-COPD participants, 376 (50.20%) were healthy subjects with normal CT manifestations and clinical assessments. Detailed demographic and clinical characteristics for the participants are provided in Table 1.

Table 1 Demographic and clinical characteristics for the development dataset

COPD detection performance

We first examined the overall detection performance of the proposed CT-DL model. The present attention-based MIL algorithm correctly determined the presence or absence of COPD in 243 of 278 subjects in the test set, with an AUC of 0.934 (95% CI: 0.903, 0.961), as shown in Figure 4A. When applying the optimal threshold value (a probability of 0.25 determined by Youden [31] in the validation set), we obtained the sensitivity, specificity, NPV, PPV, and F1 score of 0.805, 0.925, 0.888, 0.865, and 0.894, respectively (see Table 2).

Fig. 4
figure 4

COPD detection performance using the attention-based MIL model. Receiver operating characteristic curves and confusion matrices for the MIL model during prediction of COPD in the internal test set. (b) COPD detection performance for the external validation set (NLST). (c) Model generalizability among groups categorized by sex, age, CT manufacturer, and slice thickness. ROC, receiver operating characteristic curve; AUC, area under the receiver operating characteristic curve; 95% CI, 95% confidence interval; NLST, National Lung Screening Trial

Table 2 COPD detection performance for the attention-based MIL model

We subsequently evaluated the generalizability of the model among groups categorized by sex, age, CT manufacturer, and slice thickness, as we anticipate that the model will be applicable in diverse clinical settings. The model exhibited relatively robust performance, with AUC values ranging between 0.874 and 1.000 (see Table 2 and Figure 4C). This performance was not affected by modifications to imaging settings or participant demographics. The common quantitative CT measurement %LAA-950 was also used as a reference, producing an AUC of 0.708 (95% CI: 0.648, 0.768) for the same test set when detecting COPD using univariate regression analysis (see Table 2).

For the external validation dataset (NLST), the model showed an AUC of 0.866 (95% CI: 0.805, 0.928), with the sensitivity and specificity of 0.804 and 0.835, using the same threshold. The confusion matrices revealed that 516 of 620 subjects were accurately categorized, with the sensitivity and specificity of 0.804 and 0.835, respectively, using the optimal threshold of the maximized Youden Index (see Figure 4B). Other measurements, including sensitivity, specificity, PPV, NPV, and F1 score, are summarized in Table 2.

Feature extraction visualization

The lack of transparency in machine learning can be overcome by applying gradient-weighted class activation mapping (Grad-CAM) to visualize feature extraction using a heatmap [16]. As shown in Figure 5, signature lesions related to COPD detection and differential diagnosis, such as emphysema (A), diffuse exudation (B), bronchiectasis (C), and pulmonary mass (D), were manifest as increased values in the Grad-CAM results, while zero values in the heatmap corresponded to normal regions in the lung. Insights generated from the CT-DL model were compared with manual annotations made by experienced respiratory specialists, and the results indicated that the model pays specific attention to these lesions when distinguishing COPD subjects.

Fig. 5
figure 5

Visualization of features extracted by the MIL model during COPD detection. The first column shows the original CT axial slices. The second column shows manual annotations of emphysema (a), diffuse exudation (b), bronchiectasis (c), and pulmonary mass (d), identified by experienced respiratory specialists. The third column shows predictions made by the MIL model. The MIL model correctly predicted abnormalities that were highlighted on attention heatmaps. The color depth of the heatmaps represented the possibility of predicted lesions by the model

GOLD stage prediction performance

Confusion matrices showed the number of cases between the spirometry-defined GOLD stage and the differential classification of the proposed CT-DL model in the pooled dataset. The number of accurate GOLD stage predictions is shown in diagonal, with a pooled overall accuracy of 76.4% (423 out of 553) (see Figure 6). Detailed results in Table 3 show that the AUC for classifying GOLD stages 1, 2, 3, and 4 were 0.901, 0.903, 0.848, and 0.952, respectively. The model adopted a Cohen’s weighted kappa of 0.619, suggesting a strong agreement between predictions and truth labels. Other measurements within each stage, including sensitivity, specificity, and F1 score are summarized in Table 3.

Fig. 6
figure 6

Confusion matrices of five-fold cross-validation performed on the multi-channel 3D ResNet50 network in classifying the GOLD stage. AUC, area under the receiver operating characteristic curve; 95% CI, 95% confidence interval

Table 3 GOLD stage prediction performance for the multi-channel 3D residual network applied to the test set

Discussion

In the present study, an attention-based MIL model was developed to identify spirometry-defined COPD patients using a large and highly heterogeneous collection of CT scans across China. The dataset contained participants recruited from both outpatient and inpatient and physical examination settings. Implemented with the novel DL networks, our model achieved an AUC of 0.934 (95% CI: 0.903, 0.961) in the internal test group of 278 subjects. This DL-based approach also revealed satisfactory robustness across distinct scanner models, and slice thickness was employed to reconstruct CT scans, with AUC of 0.8 and above. The generalizability of the model was externally validated using a separate dataset collected from a large cohort consisting of LDCT scans (NLST), with the AUC of 0.866 (95% CI: 0.805, 0.928). A multi-channel 3D ResNet50 network was further trained to predict GOLD stages for confirmed COPD patients, achieving an accuracy above 0.8 for every stage. The proposed model offers the requisite performance for detecting COPD and predicting GOLD stage, and it is also the first attempt to apply DL-based approaches to COPD case-finding among the Chinese populace.

Although the heterogeneous pathological nature of COPD has been understood for decades, patients are currently diagnosed primarily by spirometry, a history of exposure (smoking or other environmental factors), and respiratory symptoms at the time of presentation. Over the last few years, it has become evident that patients without spirometry abnormalities who experience COPD-like respiratory symptoms and acute exacerbation events (with significant pulmonary structural abnormalities) can often be found among these populations [32,33,34]. Crapo et al [35] presented an analysis of baseline phenotyping and a 5-year longitudinal progression for the COPDGene study, demonstrating that spirometry criteria alone were insufficient to characterize COPD participants among current and former heavy smokers. Results also indicated that quantitative CT metrics outperformed spirometry when predicting disease progression and mortality. Accordingly, CT scans could be used to improve COPD case-finding and evaluation beyond spirometry alone.

The development of AI for large-scale data processing has increasingly led to the use of DL-based techniques in establishing a direct link between diagnostic images and disease categorization [10, 36]. This approach overcomes the limitations of conventional manual CT image inspection, such as inter/intra-observer variability and heavy workloads. It also bypasses the requirement of prior knowledge of radiographic features, which is required for quantitative CT analysis. Gonzàlez et al [22] have trained CNN models using CT images alone to detect and stage COPD among 7983 COPDGene participants, and achieved an accuracy of 0.773 for 1000 non-overlapping participants. In addition, Hatt et al [23] implemented CNN models with accuracy of 0.777 in COPDGene cohort and 0.762 in the NLST cohort. Due to the constrained graphical processing units’ capability, both studies adopted a strategy of extracting a subset of CT slices to build up a single montage for an individual instead of using the full images. The spatial heterogeneity nature of COPD has then prompted researchers to adopt an MIL strategy in COPD detection, as only the label of COPD or non-COPD is needed in case-finding scenarios. Bruijne et al [37] reported that the use of MIL combined with support vector machine (MI-SVM) could facilitate the detection of COPD by automatically identifying emphysema regions on CT images. Xu et al [26] also proposed a deep CNN transferred multiple instance learning (DCT-MIL) model to identify COPD using CT images from 280 participants including 190 COPD patients, with an accuracy of 99.29% and AUC of 0.983. The analysis process used in the present study differed from that used in the aforementioned studies in terms of patient selection and disease spectrum distribution. Meanwhile, we adopted a novel attention-based MIL strategy that improved the proportion of lesion character information without increasing the computational power and costs. A multi-channel 3D ResNet50 network allowed the model to extract spatial information between slices and identify abnormal images exhibiting relatively small regions of interest, further improving staging performance (see Supplemental Appendices 24).

This study offers several clinical benefits. The deep learning model was trained using subjects recruited from both respiratory clinics and health management centers, thus including participants with normal spirometry as well as CT results. This scenario is representative of diverse clinical situations in which COPD patients could be detected. Previous attempts using DL algorithms for COPD detection have mostly been trained using cohorts enrolling former and current smokers, which may not truly reflect case-finding settings. While researchers from the COPDGene and ECLIPSE cohorts have reported desired COPD imaging results, it is crucial to further extend this expertise into a Chinese population, because a very small percentage of subjects from the aforementioned studies were ethnically Chinese. Furthermore, the increased use of LDCT for pulmonary nodule assessment and lung cancer screenings has created an opportunity to apply the present model to COPD detection, with subsequent confirmation using spirometry. This is particularly relevant, as our model was generalized to LDCT in the NLST subset.

The present study does have some limitations. First, spirometry was used to diagnose COPD instead of symptoms or radiographs, which may prevent our algorithm from being generalized to the detection of COPD in patients without airflow limitations, such as para-septal emphysema. This was a result of the relatively objective criteria used for enrollment. Second, the size of our cohort is relatively small compared with other larger cohorts, and the inclusion criteria may lead to some undesired biases. Third, a limited number of subjects with GOLD 1 and 4 were enrolled, which might constrain the model’s stratification capacity and resulted in the discrepancy of staging efficiency between groups. Fourth, the NLST dataset is considered as a poor validation cohort given its predominate portion of smokers with high risk of lung cancer. To improve the efficiency of detection and staging, we are currently recruiting more participants and aim to optimize our cohort in the future. Last but certainly not least, the ability of DL to detect and stage COPD without specification of clinical or radiographic characteristics could be considered both a strength and a weakness. The “black box” nature of the DL model may severely limit its utility in clinical situations, as it does not provide sufficient information to clinicians concerning its decision-making process. Future work is urgently needed to elucidate the decision path.

In conclusion, we developed and validated a CT-DL approach for detecting and staging of spirometry-defined COPD via a highly heterogeneous Chinese population cohort. The proposed model approach achieved the desired performance and could serve as a powerful tool for COPD case-finding, providing useful indicators for clinicians and clinically relevant findings that could improve management and follow-up treatment for specific patients.