Introduction

Osseous metastases are common, affecting approximately 400,000 adults in the USA, often affecting patients with primary breast, prostate, or lung cancer, and are a source of morbidity and pain [1,2,3]. Detection of metastatic disease is also important for staging, outcome prediction, and treatment planning [4,5,6]. The majority of these metastases occur in the axial skeleton, which is most commonly evaluated on chest, abdomen, and pelvic CTs, where findings can be subtle [3, 7]. In addition, these CTs are primarily interpreted by general radiologists or body radiologists, rather than musculoskeletal radiologists.

It has been widely recognized that automated lesion detection could help improve radiologist sensitivity for osseous metastases, and multiple computer-aided detection (CAD) software systems have been developed [3, 8,9,10,11,12]. Although CNNs have been used to detect focal lesions in the brain, liver, teeth, lung, ocular fundus, skin, and breast, use of deep convolutional neural networks (CNNs) and other newer techniques has not been fully explored for this particular task [13,14,15,16,17,18,19,20,21]. We believe that CNNs can be more accurate than the previously developed models in detecting osseous metastases. Our purpose is to develop a deep CNN to accurately detect sclerotic spinal metastases.

Materials and methods

This study was IRB-approved and was Health Insurance and Portability and Accountability Act (HIPAA)-compliant, with exemption status for individual informed consent.

Dataset

A retrospective search of an internal report search engine developed at our institution was performed for CT reports with “sclerotic lesion” in the report impression between December 2000 and December 2019. Only chest, abdomen, and pelvis CTs in patients with a history of malignancy were included. If a patient had more than one CT scan, then the timepoint where the lesions were the most visible was selected. If a lesion was visible on more than one slice, then only one slice per lesion was selected, in order to avoid overfitting the model to our dataset. For a similar reason, if there were more than one lesion in a vertebral level, only one slice was chosen per vertebral level. These lesions were confirmed across multiple contiguous slices, prior and/or follow-up imaging, and/or MRI. A total of 242 CT scans were used for the study, from which 600 unique images were collected. The images were divided into 90% training (N = 540) and 10% test (N = 60) datasets. The 540 training images were augmented to 20,000 training images.

The CTs were performed using multi-detector CT scanners (General Electric, Waukesha, WI, USA; Siemens Healthcare, Erlangen, Germany) at our institution. Studies with and without contrast enhancement were included. Patients were scanned supine, headfirst, and with arms overhead to decrease streak artifact.

Ground truth labeling (manual segmentation)

All studies were reviewed by two investigators with 1 and 6 years of medical image analysis experience, supervised by a musculoskeletal radiologist with 10 years of experience to identify sclerotic osseous lesions—presumably metastases. Lesions were considered sclerotic if denser than adjacent unaffected marrow. Only lesions that were 100% sclerotic were included in the study; mixed lytic-sclerotic lesions were excluded. For each vertebra with a sclerotic lesion, a single axial image at the largest cross-sectional diameter was used for segmentation.

Manual and semi-automated segmentation was performed using the Osirix DICOM viewer (version 6.5.2, www.osirix-viewer.com/index.html). Each image was segmented into 3 classes:

  1. (1)

    lesion (any number of well-defined abnormal sclerotic foci; green),

  2. (2)

    bone (non-pathologic marrow and vertebral cortex, including osteophytes; magenta),

  3. (3)

    background (all remaining image pixels; black).

Initial segmentation of the entire vertebral body was performed using an extension of a previously established model [22]. Then, this segmentation was checked and fine-tuned by the two investigators with 1 (200 images) and 6 years (400 images) of medical image analysis experience. Only the spine (vertebral body and posterior elements) was included in the “bone” class. All other bones (ribs, pelvis) were considered “background.” If multiple lesions were present on a slice, they were all segmented separately, but all labeled as the same class (lesion) (Fig. 1). The segmented images were stored as Tag Image File Format (TIFF), with masks being 8-bit RGB and corresponding CT images were 8-bit single-channel grayscale. Finally, the images were manually cropped centered on the vertebrae with final dimensions of 128 × 128 pixels.

Fig. 1
figure 1

Example of manual ground truth segmentation. The axial CT image of a vertebra (A) is manually segmented (B) resulting in a mask (C) with three classes: lesion (any number of well-defined abnormal sclerotic foci; green), bone (non-pathologic marrow and vertebral cortex, including osteophytes; magenta), and background (all remaining image pixels; black)

Training

The dataset was randomly divided into 90% training and 10% test datasets, ensuring fully segregated datasets with no overlap. Contrast-limited adaptive histogram equalization (CLAHE) was performed on all grayscale images, which were then saved as JPEG files. Image augmentation was performed on grayscale and ground truth pairs to enlarge the training dataset by applying random rotation, horizontal flipping, cropping, and scaling (N = 20,000). To further increase variability and improve generalizability, we applied Poisson noise to 50% of randomly selected augmented grayscale images. All images were normalized to the training dataset mean and standard deviation.

We trained our model from scratch on Keras/TensorFlow using an 80/20 training/validation split and a U-Net architecture (64 batch size, 100 epochs, dropout 0.25, initial learning rate 0.0001, sigmoid activation) [23]. Briefly, images were input in our U-Net pipeline that consisted of five layers with four down-sampling steps followed by four up-sampling steps. Each step consisted of two successive 3 × 3 padded convolutions, and in the down-sizing steps, a dropout of 0.25 was applied. Next, a rectified linear unit (ReLU) activation function and a max-pooling operation with a 2 × 2 pixel kernel size. The up-sampling operations were performed using a 2 × 2 transposed convolution followed by a 3 × 3 filter size convolution, after which the output concatenates with the corresponding decoding step. The final layer consisted of a 1 × 1 convolution followed by a sigmoid function, resulting in an output pixelwise prediction score for each class.

Our model was written and trained in Python 3.7 (Python Software Foundation, Beaverton, OR) using the Keras library (v2.2.4, https://keras.io) with TensorFlow 1.13.1 (Google, Mountain View, CA, USA) [24]. During training, the 3 classes were adjusted for imbalances by weighting prevalence, penalizing predictions of classes with highest pixel count (e.g., background). Batch size was 64 and we used the Adadelta optimizer (initial learning rate, 0.0001). The model trained for 100 epochs with early stopping enabled. Multi-class Dice loss was used as cost function. Training was performed using a Linux workstation (Ubuntu 14.04) with 4 NVIDIA Titan Xp Graphic Processing Units.

Testing and further validation

The Dice (F1) score was used to assess similarity between the manual segmentations and the CNN-predicted segmentations [25]. A Dice score of 1.00 is a perfect similarity. To further examine model performance, we tested 1104 images of vertebrae from CTs of the chest, abdomen, and pelvis without bone pathology based on report and confirmed to be non-pathologic bone by one of the authors (hereafter referred to as the “non-pathologic dataset”). These images did not contain sclerotic foci suggestive of metastases.

Statistical analysis

Sensitivity was defined as TP / (TP + FN), PPV was defined as TP / (TP + FP), and specificity was defined as TN / (FP + TN).

In the lesion test dataset, each image and lesion were evaluated as follows:

  • global sensitivity: a measure to identify one or more lesions on an image

    • True positive (TP) if the model correctly captured any lesion in that image.

    • False negative (FN) if the model failed to identify any lesion within an image.

  • local sensitivity and positive predictive value (PPV): measures to identify each lesion separately in a given image

    • True positive if each lesion was captured regardless of segmentation accuracy.

    • False positive (FP) if the model assigned “lesion” class to non-pathologic bone.

    • False negative if the model assigned “bone” class to a lesion.

In the non-pathologic test dataset, each image was evaluated as follows:

  • local specificity: a measure to identify the false positive rate in non-pathologic bone.

    • True negative if non-pathologic bone was correctly identified as such.

    • False positive if the model assigned the “lesion” class to non-pathologic bone.

Results

Among the 600 images, there were 13 (2%) cervical, 384 (64%) thoracic, and 203 (34%) lumbar vertebral images. Lesions predominantly involved the vertebral body (497, 83%), followed by posterior elements (48, 8%), and images with lesions in both vertebral body and posterior elements (55, 9%).

Lesion test dataset

Dice scores were 0.83 for lesion, 0.96 for non-pathologic bone, and 0.99 for background (Fig. 2). Global sensitivity was 95% (57/60), with a false negative rate of 5% (3/60 images, each containing one missed lesion). Local sensitivity was 92% (89/97) with 3 false positive lesions: a pronounced anterior osteophyte (Fig. 3A), volume averaging from the pedicle, and an area of heterogeneous bone marrow (Fig. 3B). Local PPV was 97% (89/92). We observed 8/97 (8%) local false negative rate comprising 5 images, which are shown in Fig. 4. In 2 of the 5 images, lesions were missed in the context of multiple lesions, most of which were identified. In 3 of the 5 images, missed lesions were near cortical bone.

Fig. 2
figure 2

Examples of accurate lesion segmentation using our deep CNN. Each row shows distinct test images with varied lesion sizes and count (A, B, C). Manual, manual tracing; Deep CNN, model prediction

Fig. 3
figure 3

Prediction errors on 2 distinct test images (one per row) from the lesion dataset, showing 3 false positive lesions. (A) Pronounced anterior osteophyte (arrowhead) was mislabeled as a lesion. (B) Volume averaging from the right pedicle (arrow) and an area of heterogeneous bone marrow (arrowhead) were mislabeled as lesions. Manual, manual tracing; Deep CNN, model prediction

Fig. 4
figure 4

Prediction errors on 5 distinct test images (one per row) showing 8 false negative lesions. (A) Single lesion in bone marrow (arrowhead). (B) Two lesions (arrowhead and arrow) in bone marrow. (C) Single lesion adjacent to endplate (arrowhead). (D) Single lesion adjacent to left costovertebral junction (arrowhead). (E) Three lesions, one in bone marrow (black arrow), one in right transverse process (white arrow), and one in left transverse process (arrowhead). Manual, manual tracing; Deep CNN, model prediction

Non-pathologic test dataset

Additional testing performed on 1104 images without lesions covered the following anatomic regions: 21/1104 (2%) cervical, 787/1104 (69%) thoracic, and 318 (29%) lumbar spine. Local specificity was 87% (958/1104). At least one false positive lesion was noted in 146/1104 (13%) images, of which 47/146 (32%) were 4 pixels or fewer, considered to be below detection threshold (Fig. 5A). A single pixel for a given class was not reliably visualized upon qualitative inspection of images and could represent a spurious prediction. Therefore, we used the next step larger in scale that had a square aspect ratio, which is comprised of 4 pixels. The most common location for false positive lesions was the vertebral body (131/1104, 12%), most commonly at the endplate or in regions of increased sclerosis secondary to endplate degenerative change (73/131, 56%) (Fig. 5B), as well as osteophytes (25/131, 19%) (Fig. 5C). The second most common location for false positive lesions was the posterior elements (36/1104, 3%), with volume averaging in the pedicle (18/36, 50%) (Fig. 5D) and lamina (8/36, 22%) being more frequent. Other likely reasons for false positive lesions included disc calcification, facet degenerative change, and intravenous contrast in adjacent blood vessels (Fig. 5E).

Fig. 5
figure 5

Examples of model false positive lesions in the non-pathologic test dataset. (A) Single pixel in the left transverse process (arrowhead) was considered below detection threshold (< 4 pixels). (B) Vertebral body endplate change (arrowheads) was the most common location for false positive lesions (arrowheads). (C) Osteophytes (arrowhead) were the second most common, followed by (D) volume averaging in the posterior elements (arrowhead), and (E) contrast in adjacent vessels (arrowhead). Deep CNN, model prediction

Discussion

This study aimed to develop a deep CNN capable of detecting spinal sclerotic metastases on body CTs. We found our proposed technique yielded high Dice scores for lesion detection and high global and local sensitivity. Further, examination of a non-pathologic dataset showed a high local specificity. Altogether, our study showed that a deep CNN has the potential to assist in detecting sclerotic spinal metastases.

The earliest applications of artificial intelligence in detection of focal lesions on imaging were in the fields of breast imaging (computer-assisted diagnosis or CAD) and CT colonography [14, 18]. CAD software packages have also been developed to detect focal bone lesions. Burns et al. used a “watershed algorithm” originally designed for detection of lytic lesions to evaluate the entire vertebral body for lesion candidates [3]. Additional algorithms using graph cuts and threshold levels were used to refine the segmentations. This CAD system detected 439 out of 532 (83%) lesions in 49 patients, with a testing set sensitivity of 79% (95% CI: 74–84%) [3]. Huang and Chiang described a CAD system for PET/CT able to detect both lytic and sclerotic lesions with a sensitivity, specificity, and accuracy of 85%, 92%, and 90%, respectively [26]. Hammon et al. described another CAD system able to detect sclerotic and lytic lesions on CT with a sensitivity by case of 83% for sclerotic lesions [8]. In comparison, our neural network had a global sensitivity of 95% and local sensitivity of 92%. The model accurately predicted the borders of the lesions, which was reflected in the high Dice scores (Fig. 2). The 8 missed lesions in the 5 images were in locations of higher difficulty to detect. In the first instance, the lesion was slightly denser than the background marrow and was difficult to identify on a single axial slice (Fig. 4A). In the second instance, there were 5 lesions on the image, and the model identified 3 out of 5 lesions. The two that were not identified were also only slightly denser than the surrounding bone marrow (Fig. 4B). In the third instance, the lesion was adjacent to the endplate, and there was also disc and the adjacent vertebral body on the same image (Fig. 4C). In the fourth instance, the model missed a small lesion adjacent to the left costovertebral junction (Fig. 4D). In the fifth instance, the model identified 4 out of 7 lesions. Two of the missed lesions were in the posterior elements, which is a less common location for lesions than the vertebral body (Fig. 4E). Additional training images focused on these challenging areas may be helpful to improve detection in these areas.

To our knowledge, there is no previous study using a deep CNN to help detect vertebral sclerotic lesions. Roth et al. used a CNN and random view aggregation to improve the performance of a CAD system [10]. However, specific lesion segmentation was not performed. There has been a single study by Klein et al. that looked at CNN segmentation of normal bone, and they achieved a Dice score of 0.95, similar to our study (0.96) [27]. Although segmentation of normal bone and background was not the goal of our study, this task could also have useful applications in trauma or osteoporosis screening [28, 29].

Using a CNN to detect focal lesions has been explored in other organ systems, including focal bladder lesions in CT urography, prostate cancer, and breast masses [30, 31]. CNNs have also been shown to be able to assist in the detection of non-neoplastic focal lesions such as strokes, multiple sclerosis plaques, focal cortical dysplasia, liver cirrhosis, and fatty liver disease [13, 32,33,34,35,36,37,38,39,40,41]. Other non-radiology medical fields using deep learning methods to detect focal lesions include focal esophageal lesions on endoscopy, segmentation of cervical cancer cells in pathology, and diabetic retinopathy in ophthalmology [19, 42, 43]. These studies found that their deep learning algorithms can be helpful to physicians in improving lesion detection rate. For example, Cai et al. sound that their model sensitivity values for the detection of early esophageal squamous cell carcinoma were 98%, 85%, 91%, 86%, and 98%, compared to diagnostic accuracy for 89% of a senior group of endoscopists, and 77% of a junior group of endoscopists [40]. The sensitivity of our model is similar to the results of these studies.

In this study, we observed 3 false positive lesions: a pronounced anterior osteophyte, volume averaging from the pedicle, and an area of heterogeneous bone marrow (Fig. 3). This finding is similar to findings on prior CAD studies. In Burns et al., the most common causes for false positive lesions were endplate changes and vertebral endplate cortex. The most common cases for false negative findings were endplate proximity (and therefore volume averaging with the intervertebral disc), lower attenuation, and small size [3]. CAD errors were also most often near the endplates in Huang and Chiang [26]. Hammon et al. found that CAD errors were most commonly secondary to endplate changes, osteophytes, and small size [8].

The limitations of this study include the relatively small dataset that may impact generalizability, and the application of the model only to sclerotic lesions, not evaluating its performance on lytic or mixed density lesions. Also, the model was trained and tested on single slices, rather than having to scan through a stack of contiguous spine images. Each scan was checked at multiple adjacent levels by an experienced musculoskeletal radiologist prior to image selection for lesion identification. However, it is possible volume averaging artifacts may still be present in the selected images. The model inputs were also cropped to isolate vertebra, so that the sclerotic lesion represents a higher percentage of pixels on the image; in reality, the sclerotic lesion is a very small portion of the image on a full chest, abdomen, or pelvic CT slice. Further, the model is limited to lesions in the spine and not in other structures, such as the pelvic bones and ribs. Multiple lesions were used from the same patient, which has a risk of overfitting the data; however, this risk was minimized by using only one image per lesion and vertebral level. Finally, external validation has not been performed on this model. As presented, this model is not sufficient to deploy in routine clinical practice. These limitations are hurdles that must be overcome in order to achieve a clinically useful application.

In conclusion, a deep CNN has the potential to assist in detecting sclerotic spinal metastases and could become an important component of radiology workflow in the interpretation of body CTs. This model is highly accurate in defining lesions, non-pathologic bone, and background. Additional model training may help overcome areas prone to false positives, especially endplate degenerative changes and osteophytes.