Introduction

Osteoporosis is recognized as a latent metabolic bone disease characterized by low bone mass and microarchitectural deterioration resulting in increased bone fragility [1,2,3,4]. Osteoporosis is underdiagnosed. It remains asymptomatic for several years until the development of fractures that limit daily life activities, especially for elderly people [5,6,7]. As the population ages, the number of patients in China with osteoporosis or osteoporotic fractures is estimated to be 212 million and 5.99 million, respectively, by 2050, which will lead to significant increases in morbidity and mortality [8, 9]. Therefore, early screening and monitoring of osteoporosis are crucial for timely prevention and treatment of osteoporotic fractures [10,11,12]. BMD is an important marker of bone strength. Osteoporosis is diagnosed when the BMD falls below a certain threshold and can also predict fracture risk [1, 13,14,15]. Although dual-energy X-ray absorptiometry (DXA)-derived areal BMD is required for diagnosing osteoporosis according to the World Health Organization criteria, Quantitative CT (QCT) is more sensitive for monitoring bone loss and measuring trabecular bone density because it is less affected by severe degeneration of the spine, vascular calcification, oral contrast agents, or body position [16,17,18]. Low-dose chest computed tomography (LDCT) scans performed for lung cancer screening, including lumbar spine images, can be used to assess volumetric bone mineral density and screen for osteoporosis simultaneously [19, 20].

Although QCT has been proposed as a relatively accurate method for evaluating BMD, it requires appropriate equipment to be deployed, which is expensive for many, especially basic-level hospitals, to afford. In previous studies [14, 21], conventional diagnostic CT scans were used to measure BMD via direct measurement of the CT values of cancellous bone, with correlation coefficients ranging from 0.399 to 0.891. However, CT image analysis requires frequent manual operations, including localization of vertebral bodies (VBs) and extraction of the volume of interest (VOI), which are heavy and redundant tasks in large-scale osteoporosis screening [6, 22]. With the current advancements in computational performance, data processing, and the availability of large-scale datasets, artificial intelligence (AI)-based medical image analysis has played an essential role in extracting image features, identifying lesion areas, and classifying disease types, which provides helpful assistance in the diagnosis and prevention of disease [23]. In particular, deep-learning (DL) algorithms, especially convolutional neural network (CNN) algorithms, have been applied to extract vertebral CT values to determine BMD and related fractures from CT images [11, 24]. DL is expected to replace manual operation in BMD measurement, to liberate radiologists for more important and meaningful tasks, and to reduce the cost of osteoporosis screening [25, 26].

There have been a few studies reporting the application of DL techniques for osteoporosis screening. Pan Yali et al [27] developed a DL-based system to automatically measure BMD for opportunistic osteoporosis screening from low-dose chest CT scans for lung cancer screening. However, in this report, only images from a single CT scanner were utilized for training and testing the DL-based model. Yasaka et al [28] developed a DL model to predict the BMD of lumbar vertebrae from unenhanced abdominal CT images by using BMD levels obtained from DXA as a reference. The model derived from this study focused on the prediction of a real BMD but did not provide information on VB location. Fang Yijie et al [29] developed a deep CNN model to predict the BMD of lumbar vertebrae (L1–L4) with BMD values obtained from QCT as a reference. LDCT for lung cancer screening, which causes neither additional exposure nor substantial cost, is more common for spinal or abdominal CT scans. Li Yali et al [30, 31] demonstrated that the qualities of LDCT images obtained from various CT scanners sold by five different manufacturers were comparable. The linearity of CT values from those LDCT images was unbiased and could be utilized to quantify the BMD values accurately. However, few studies in the literature have focused on the validation of DL-based automatic osteoporosis screening models in LDCT images obtained from different scanners because of their accuracy in locating the thoracic and upper lumbar regions and in calculating BMD values, which could be comparable to those obtained from QCT.

In this study, we aimed to (I) identify the potential of CT scans, which were originally used for lung cancer screening, for both osteoporosis screening and bone quality evaluation and (II) assess the performance of the AI-based automatic osteoporosis screening model in both locating the lumbar VB and calculating BMD values in CT images obtained from different CT scanners.

Materials and methods

Study design

This retrospective, single-center study was approved by the First Affiliated Hospital of Zhengzhou University’s Institutional Review Board (No. 2021-KY-1222-002), and the need for informed consent was waived. From the electronic database of our hospital, we retrieved the data of individuals who underwent paired LDCT and QCT examinations for screening both lung cancer and osteoporosis from September 2019 to January 2022. The collected data were divided into two parts, and one part was manually annotated to obtain the diagnostic standard for segmenting vertebrae. It is used for training and testing to explore the effect of different gender and age groups on the model. The other part was used for external validation to explore the effect of scanners from different manufacturers on the model. The development of a fully automated BMD measurement system consisted of two main steps. First, an end-to-end DL model was trained to achieve automatic segmentation from the first thoracic vertebra to the second lumbar vertebra (T1–T12, L1, and L2). Second, the DL regression detection algorithm was used to calculate the BMD from the twelfth thoracic vertebra to the second lumbar vertebra (T12, L1, and L2). The workflow is shown in Fig. 1.

Fig. 1
figure 1

Overview of this study

Dataset

Individuals who had a history of prior spinal surgery or primary or metastatic tumors were excluded. Finally, 7713 individuals were enrolled in this study. A total of 3312 LDCT scans were manually annotated by an experienced radiologist for the contours and the anatomical names of all VBs and were used to develop the DL-based system. These images were randomly divided into a training set (N = 2337) and a test set (N = 975). The categorization process ensured that the gender distribution was balanced and that the age distribution was realistic. The remaining 4401 unannotated LDCT scans obtained from 3 different scanners (200 scans from GE, 2867 scans from Philips, and 1334 scans from Siemens) were used as external validation datasets to evaluate the developed system.

Data preparation

Data collection

We chose hospitals with QCT scanners from different vendors to collect the data. QCT can be carried out simultaneously with LDCT during plain-scan physical examination. Both the low-dose chest CT images and the corresponding BMD values could be collected via QCT. In accordance with previous studies, all individuals were scanned from the apical lung to the lower edge of L2 on three different scanners at the same hospital ((I) Revolution CT, GE Healthcare; (II) Brilliance iCT, Philips Health care; (III) Somatom Force, Siemens Healthineers). Noncontrast chest CT was performed while the patient held their mid-inspiratory breath. To calibrate the linear relationship between CT values and BMD values, QA phantom data were acquired once a month by separately scanning an asynchronous phantom (Mindways Software, Inc.) with the same parameters as other scans. The scan parameters are shown in Supplemental Table 1 [19]. BMD values were recorded directly by the physician operating the QCT equipment. The BMD values of all individuals were extracted from three consecutive VBs, T12 to L2, using QCT image analysis software (QCT Pro 6.1, Mindways Software, Inc.) [27]. The 9 mm-high VOIs capturing the trabecular bone were manually placed in the center of the targeted VBs, avoiding the basivertebral veins, cortical bone, and any focal pathology [19]. Based on the standard clinically utilized average BMD of L1–L2, bone mass was categorized as normal (> 120 mg/cm³), osteopenia (80–120 mg/cm³), or osteoporosis (< 80 mg/cm³) [32]. The BMD values from T12 to L2 and the assessment of bone mass constitute the final output of the developed system.

Data annotation

The diagnostic standard of labeling specifications was developed by three imaging physicians, two of whom were annotators and one of whom was an arbitrator. The annotator was a practising physician who had been practising medical imaging diagnosis at a public medical institution for more than 5 years, and the arbitrator was a practising physician who had been practising medical imaging diagnosis at a public medical institution for more than 10 years [29]. The diagnostic standard was developed as follows: two annotating physicians independently reviewed the images, marked the target spine region in their own images, and then passed their images and annotation results to the arbitrator physician separately. The arbitrator physician-reviewed and modified the results of the two annotating physicians to form a unique final annotation result.

Statistical analysis

Continuous variables with an abnormal distribution are expressed as medians (IQRs), and categorical variables are represented as frequencies. The chi-square test was carried out to assess categorical variables. The Kruskal‒Wallis test was used to compare abnormally distributed continuous variables between the two datasets. The Pearson correlation coefficient R was used in the correlation evaluation for the bone density regression task. We used the mean absolute error (MAE), coefficient of determination (R2), and Bland‒Altman plot to evaluate the overall regression detection performance. The Kolmogorov‒Smirnov test was used to test the normality of all the continuous variables. p < 0.05 was considered to indicate statistical significance. R software for Windows (version 4.2.1) was used for the statistical analyses. Sensitivity and specificity were used to evaluate the diagnostic performance for normal, osteopenia, and osteoporosis patients.

Results

Table 1 summarizes the demographic characteristics of all patients (n = 3312) who were divided into a training set (n = 2337) and a test set (n = 975). For all patients, the median age was 54 years (IQR, 47–61 years), 1540 were female (46.5%), and 1772 were male (53.5%). We then divided the patients into four groups based on age distribution: under 45 years of age, between 45 and 55 years of age, between 55 and 65 years of age, and over 65 years of age, with 553 (16.7%), 1224 (37.6%), 989 (29.9%) and 526 (15.9%) patients, respectively. The median slice number of each CT scan was 370 (IQR, 345–405). Age and slice number exhibited statistically significant differences between the different groups (p < 0.05).

Table 1 Baseline patient characteristics

The segmentation results of the VBs for the test set are shown below in Supplemental Fig. S2. The proposed method achieved a mean Dice similarity coefficient (DSC) of 95.8% for VB segmentation. The DSCs of the T12, L1, and L2 vertebrae, which were the focus of the subsequent BMD regression network, were 95.2%, 94.1%, and 95.6%, respectively. Figure 2 shows the original images and the AI automatic segmentation images of VBs from normal, osteopenia, and osteoporosis patients.

Fig. 2
figure 2

Segmentation masks with three categories were predicted by the AI model and visualized by ITK-SNAP (A1, A2) CT sagittal image (B1, B2) manual segmentation (C1, C2) automated segmentation (D1, D2) automated VOI segmentation from T12–L2

The experiments also evaluated the segmentation performance of the model among different sex and age groups, as shown in Table 2. For all VB segmentations obtained from scanners from different manufacturers, R was greater than 0.99, R2 was greater than 0.98, and the MAE was less than 3.5. The test results showed that the overall VB segmentation performance was better than the segmentation performance of a specific VB. The performance evaluation metrics of the segmentation model in the male and female groups were very close to each other; therefore, gender did not have a significant influence on the segmentation network performance. Similarly, there were no significant differences among patients of different ages for the performance evaluation metrics of the segmentation model, indicating that patient age did not significantly affect segmentation network performance.

Table 2 Overall segmentation results in different gender and age groups on the test set

The evaluation metrics indicating the overall performance of the BMD regression algorithm model for different sex and age groups in the test set are shown in Table 3. The Bland‒Altman plot of the test set is shown in Fig. 3A, in which the abscissa is the mean value of the predicted and actual BMD. The ordinate is the difference between the predicted BMD and the actual value, and the two dashed lines are the 95% consistency limits. The vast majority of the differences are within this range. When the algorithm model was applied to the validation dataset, the overall correlation coefficient R was 0.994, the determination coefficient R2 was 0.987, and the MAE was 2.88. To study the influence of the age of different patients on the performance of the algorithm model, the patients were divided into four groups based on their ages: under 45 years, 45–55 years, 55–65 years, and over 65 years. There were no significant differences in algorithm performance among the different sex and age groups.

Table 3 Comparison of algorithm performance between different genders and age groups on test set
Fig. 3
figure 3

Bland‒Altman plot comparing BMD values obtained by using the developed system and QCT on the test set (A) and on external validation sets obtained from scanners of different manufacturers (BD)

There were 4401 patients in the external validation datasets, including 2867 patients from Philips manufacturers (Set1), 1334 patients from Siemens manufacturers (Set2), and 200 patients from general manufacturers (Set3). All the data were applied to the previous segmentation network to obtain the vertebral masks corresponding to T1–T12 and L1–L2. Linear analysis of BMD values obtained from QCT and algorithmic models based on T12, L1, and L2 vertebral images from different manufacturers’ devices. The statistical results are shown in Table 4. A Bland‒Altman plot comparing the BMD values obtained by using the developed system and QCT with those obtained by scanners from different manufacturers is shown in Fig. 3B‒D. Most of the differences were within this range. Compared with the QCT results as the diagnostic standard, the BMD assessed by using the developed algorithm model had a mean error of (− 0.28, 2.37) mg/cm3, with the 95% consistency limits of Set1, Set2, and Set3 falling in the ranges of (− 3.79, 3.67) mg/cm3, (− 3.14, 2.58) mg/cm3 and (− 0.61, 5.35) mg/cm3, respectively. Based on the results above, it could be concluded that the BMD regression model exhibited an accurate prediction, good generalization ability, and robustness for CT scans obtained from all three different scanners supplied by Philips, Siemens, and general equipment manufacturers.

Table 4 Linear analysis of BMD values obtained from QCT and from the algorithm model based on data collected from scanners of different manufacturers on external validation sets

We also converted the BMD values predicted by the proposed algorithm model and obtained from QCT into a three-class qualitative classification task for evaluation. A confusion matrix is a cross-tabulation of the model’s predicted results and the actual results. The predicted results of the test set are shown in Fig. 4 (a confusion matrix), and the higher the diagonal values are, the better the model performance. However, the confusion matrix only counts the number of cases, and it is difficult to measure the quality of the model when a large amount of data is available. Therefore, the confusion matrix extends the following indicators based on the basic statistical results: sensitivity, specificity, positive predictive value, and negative predictive value. We calculated the above evaluation indicators on the test set and three external validation sets, as shown in Table 5. Overall, the sensitivity of a normal diagnosis was greater than that of a diagnosis of osteopenia or osteoporosis. The sensitivity of the diagnosis of osteoporosis was greater in Set3 than in the other two datasets, possibly due to the smaller sample size of this test set compared to the other two datasets. For the diagnosis of osteoporosis, the model achieved a sensitivity of more than 86% and a specificity of more than 98% for different datasets, which could indicate excellent performance.

Fig. 4
figure 4

Confusion matrix of the test set

Table 5 The diagnostic performance of normal, osteopenia, and osteoporosis on test set and external validation sets

Discussion

We proposed a model based on DL for fully automated prediction of BMD from CT images with the bone density values acquired from QCT as the gold standard. The model could extract the T12, L1, and L2 vertebrae via direct segmentation of VBs at the 3D level with the 3D-UNet architecture [33]. Subsequently, the modified DenseNet121 network [34] was used for the end-to-end prediction of BMD. We used 3312 cases of data after labeling for the training and testing sets of the model and then analyzed the robustness of the model among patients of different genders and ages. A total of 4401 scans from different CT scanners supplied by three different instrument manufacturers were also collected as external validation sets to evaluate the stability of the algorithm across different devices.

In terms of segmentation performance, our model had an average DCE of 95.8% and showed good stability among patients of different genders and ages. The segmented VB targets in a previous study [29] were the L1–L4 vertebrae, and the average segmentation accuracy was approximately 82.3%. The task was trained based on 2D U-Net only, which missed the spatial features of vertebrae and affected the subsequent analytical results. A similar study [27] segmented T1–T6, T7–T12, and L1–L2 separately based on a 2D network and subsequently segmented the T12, L1, and L2 vertebrae using a conventional image processing algorithm. The average segmentation accuracy was 86.6%. Our approach, different from previous studies, was based on 3D image analysis with an end-to-end network model to obtain results for each individual VB segmentation, achieving much greater accuracy than previous 2D models.

In terms of the accuracy of BMD regression prediction, our model had a high R2 of 0.987 and an MAE of only 2.88 in the validation set. The MAEs for patients of different sexes ranged from 2.75–3.00, and the MAEs for patients of different age groups ranged from 2.29–3.67, both of which demonstrated a high degree of stability. We also collected CT scans from devices supplied by three different manufacturers for validation of the regression model. The model achieved R2 values of 0.987, 0.993, and 0.986 and R values of 0.994, 0.992, and 0.996 in Set1, Set2, and Set3, respectively, with MAEs in the range of 2.60–3.41. Compared to previously reported studies [27, 29], our models achieved the best performance.

We also evaluated the diagnostic efficacy of qualitative osteoporosis screening based on the predicted BMD values. The sensitivity of the model was greater than 86% for osteoporosis and greater than 89% for osteopenia among the validation set and different external validation sets. It should be noted that the attribution of BMD values at the discriminant boundary (80 and 120 mg/cm3) could influence the diagnostic efficacy of the classification. Therefore, comparing the assessed BMD values acquired from the algorithm models and those from the QCT method to evaluate bone quality is more reasonable than attributing patients to only three classifications.

In summary, a clinically useful opportunistic osteoporosis screening tool has been developed based on DL combined with CT images from various scanners of different manufacturers. This tool achieves excellent accuracy for both the 3D segmentation of VBs and the regression prediction of BMD. Therefore, it has a strong ability to screen for osteoporosis with high efficacy and sensitivity. In the near future, the proposed model can be further improved in the following aspects. More special cases of patients with incomplete numbers of vertebrae for various reasons (e.g., injury, surgery, congenital abnormalities, etc.) should be collected and utilized to train the model for further improvement of the precision of VB localization and segmentation for those special patients (approximately 5% of all patients). More data should be collected from different hospitals or medical centers all over China to further validate the applicability, adaptability, and robustness of this proposed osteoporosis screening and bone quality assessment algorithm model.