Introduction

The application of machine learning (ML) techniques is transforming healthcare, with particularly rapid growth and influence in diagnostic imaging. Deep learning (DL) algorithms such as convolutional neural networks (CNNs) have shown great performance in solving computer vision problems and thus have proven useful in the imaging data-rich medical domains [1]. Recent studies using ML and 18F-fluorodeoxyglucose (FDG) positron emission tomography (PET) demonstrated promising results in lesion detection and staging in lung and brain cancer [2,3,4], lesion classification [5,6,7] and response prediction [8], characterization of non-focal processes such as in Alzheimer’s disease classification [9], and cardiovascular event prediction [10]. While most previous efforts thus far have evaluated the application of ML in FDG PET studies, this work investigates the performance of DL in classifying PET studies utilizing 18F-fluciclovine, a PET radiopharmaceutical approved by the US FDA for evaluation of biochemical recurrence (BCR) of prostate cancer (PC) [11]. In this study, we evaluated both 2D-CNN and 3D-CNN architectures in detecting abnormal 18F-fluciclovine uptake on PET images of patients with BCR of PC.

Materials and methods

Patients

A total of 251 consecutive 18F-fluciclovine PET/CT scans were performed between September 2017 and June 2019 in 233 prostate cancer patients at Stanford Health Care. Eighteen of the 233 patients had two 18F-fluciclovine PET/CTs at a median of 216.6 ± 100.0 (range 49–399) days between scans. The remaining had only 1 scan. Patients underwent 18F-fluciclovine PET for suspected BCR after primary definitive treatment (radical prostatectomy or radiation therapy). BCR was diagnosed after prostatectomy with or without adjuvant radiotherapy at a PSA level of 0.2 ng/mL or greater, with a second confirmatory PSA level of at least 0.2 ng/mL [12]. For post-radiation therapy patients, BCR was diagnosed as rise of PSA measurement of 2 or more ng/mL over the nadir [13]. Age and serum prostate specific antigen (PSA) level at the time 18F-fluciclovine PET were 71.5 ± 9.0 years and 31.6 ± 195.6 (range 0–2975.0) ng/mL. This retrospective study was approved by the Stanford University Institutional Review Board, and the requirement to obtain a written informed consent was waived.

18F-fluciclovine PET scan acquisition

We followed published parameters [14]. In brief, patients were advised to avoid any significant exercise 1 day prior and to not eat or drink for a minimum of 4 h prior to imaging. Approximately 3–5 minutes following IV administration of the radiopharmaceutical, images were acquired using the Discovery 600, 690 or MI models (GE Healthcare, Waukesha, WI, USA) starting from mid-thigh and proceed to the base of the skull. A low-dose CT scan was performed for attenuation correction and anatomic correlation. The mean dosage of 18F-Fluciclovine administered was 10.5 ± 1.6 mCi.

PET dataset preparation

In order to augment the model’s accuracy and reduce computational expense, only the pelvis, which is the region most commonly affected by early recurrence and metastases, was included in the analysis. Consequently, the number of transaxial slices per PET image was reduced to the region of pelvis, which was saved using Horos (Ver. 2.4.1, Horosproject.org supported by Nimble Co, LLC, Annapolis, MD, USA). All PET images consisted of 100 contiguous transaxial slices including the entire bony pelvis and its soft tissue contents. Digital imaging and communications in medicine (DICOM) images were manually anonymized using MIM software (Ver. 6.7.14, MIM software, Cleveland, OH, USA).

Pelvic PET image labeling

PET scans were labeled as either normal, abnormal, or indeterminate based on clinical reports as follows: normal had no radiopharmaceutical in the prostate bed or evidence of lymph node or other bone or soft tissue metastasis; abnormal had the presence of radiopharmaceutical in the prostate bed or evidence of lymph node or other bone or soft tissue metastasis; and indeterminant included those slices that included a non-specific finding by report. The reports were prospectively generated at the time of PET/CT study acquisition while accessing correlative imaging studies, clinical history, and laboratory data. PET scans were interpreted according to published guidelines [14]. Following image preparation, PET images of the pelvis were reviewed by one of the authors (JJL, 19 years of experience in the clinical practice of nuclear medicine) who confirmed the labeled classes according to imaging findings and the clinical reports. There were 5 patients who had lesions outside the pelvic region including lung nodules (n = 1), rib (n = 1), sternum (n = 1), ribs and sternum (n = 1), and lung nodules, rib, and cervical and thoracic lymph nodes (n = 1); however, since the pelvic region was without PET abnormalities, the PET image in those studies were labeled as normal.

2D-CNN training

Transaxial DICOM slices were converted to PNG format pictures using internally developed software. Two different modeling strategies were employed using the 2D-CNN architecture: (i) study slice and (ii) case-based approach. For the slice-based approach, only the abnormal slices were selected from abnormal PET images by JJL. The same number of normal transaxial slices was randomly chosen from the normal PET images to balance the dataset. The selected transaxial slices were randomized within each group to allocate training and test datasets, each with the same number of abnormal and normal slices. For the case-based approach, normal and abnormal PET images of the pelvis were randomly included with the goal of maximizing the size of the training and test datasets while keeping a balance between normal and abnormal case numbers. Every transaxial slice in PET image was labeled according to the label of the PET image following our previous method [4]. Model training was done with Neopulse® (Dimensional Mechanics, Inc.® Seattle, WA), a machine learning framework using a 2D convolutional neural network (ResNet-50). The 2D-CNN Resnet50 structure consists of the following: Conv1 (7 × 7,64), Conv2 ([1 × 1 × 64, 3 × 3 × 64, 1 × 1 × 256] × 3), Conv3 ([1 × 1 × 128, 3 × 3 × 128, 1 × 1 × 512] × 4), Conv4 ([1 × 1 × 256, 3 × 3 × 256, 1 × 1 × 1024] × 6), Conv5 ([1 × 1 × 512, 3 × 3 × 512, 1 × 1 × 2048] × 3), and average pooling layer. ReLU was used as an activation filter. The probability vector of a Softmax layer output was used to determine the score to be abnormal for transaxial slices. The number of epochs used was 500. A workstation Intel Xeon CPU E5-2650, 4*16 GB DIMM RAM (Synchronous 2400 MHz), NVIDIA GK110BGL and Dell H730 PowerEdge RAID Controller with 1 TB and 5 TB HDD was used for models’ training and testing.

3D-CNN training

A DICOM data generator was built to produce data in batches, perform preprocessing (unifying spacing/direction/offsets, resample, crop into same size, normalize value pixel 0 to 1) and data augmentation using crop, flip, add gaussian noise and adding gaussian offset. A ResNet 3D model with 14 layers, including 11 conv3D layers and 3 fully connected layers, was built to classify the data into 2 classes, normal and abnormal. Cross-entropy was applied to calculate the loss. Adam with learning rate 0.0001 was applied for model optimization. Batch size was 8. No clinical parameters were used to calculate pretest likelihood or other forms of input for the 2D-CNN and 3D-CNN. CT features were not included in our models. The validation dataset size was 10 % of that of the training dataset. Test datasets were independent from training datasets in each approach.

CNN performance evaluation

The sensitivity and specificity were calculated for the test dataset results using the receiver operating characteristic (ROC) curve. For the 2D-CNN case-based approach, the likelihood of abnormality for each PET image was defined as the average score of its 10 transaxial slices with the highest probability. A p < 0.05 was regarded as statistically significant. MedCalc© (ver. 19, MedCalc Software bvba, Ostend, Belgium) was used for statistical analysis.

Results

2D-CNN slice-based approach

A total of 3668 abnormal transaxial slices from 170 abnormal PET images were identified and extracted. Of these, 3400 slices were allocated for the training dataset, and the remaining 268 slices were used in the test dataset. The same numbers of normal transaxial slices were randomly selected and allocated to the training and test datasets. The sensitivity and specificity to determine the presence of an abnormality using this approach were 90.7% and 95.1% (criterion of 0.634), and the area under curve (AUC) was 0.971 (p < 0.001) (Fig. 1). The calculated accuracy and loss function were 99.93% and 0.18% in the training and 89.26% and 71.50% on the training datasets, respectively (Fig. 2).

Fig. 1
figure 1

ROC curves showing test performance of three different CNN models to detect abnormal 18F-fluciclovine uptake in PET scans. The AUC of 2D-CNN slice-based, 2D-CNN, and 3D-CNN case-based approaches are 0.971, 0.750, and 0.699

Fig. 2
figure 2

Accuracy and loss changes according to the epoch number in the 2D-CNN slice-based approach. Accuracies of training set and validation set reach plateau after 120 epochs. Accuracies and losses of training set and validation set estranged at 50 epochs

2D-CNN case-based approach

There were 170 abnormal, 64 normal, and 17 indeterminate PET images. A training dataset of 100 PET images (50 normal and 50 abnormal images) and a test dataset of 28 PET images (14 normal and 14 abnormal images) were randomly distributed. Their age and serum PSA level at the time 18F-fluciclovine PET were 70.7 ± 8.3 years and 11.0 ± 33.0 (range 0–246.0) ng/mL. Their age and serum PSA level at the time 18F-fluciclovine PET were 68.0 ± 7.3 years and 2.6 ± 8.7 (range 0–67.1) ng/mL in the normal group label and 73.3 ± 8.4 years and 19.6 ± 44.8 (range 0–246.0) ng/mL in the abnormal group label. Among the abnormal PET images, 35, 30, and 23 PET images had local recurrence, lymph node metastasis, and distant metastasis, respectively. Twenty PET images had two or more abnormalities (e.g., local recurrence and lymph node metastasis). The number of indeterminate PET images was negligible, and therefore this group label was excluded from the analysis. A subgroup analysis based on local recurrence alone or metastasis alone was not performed due to a small number. The sensitivity and specificity in determining abnormal PET images in the case-based approach were 85.7% and 71.4% (criterion of 1.00), and area under curve (AUC) was 0.750 (p = 0.013). The accuracy and loss function were 99.98% and 0.11% in the training and 91.70% and 41.68% on the test datasets, respectively.

3D-CNN case-based approach

For comparison purpose, training and test datasets used for the 2D case-based approach were employed again in the 3D-CNN case-based approach. The sensitivity and specificity were both 71.4% (criterion of 65.84), and AUC was 0.699 (p = 0.053). The accuracy and loss function of the 3D-CNN case-based approach were 55.2% and 83.8% in the training and 54.0% and 74.95% on the test datasets, respectively.

Discussion

In this study, DL using a 2D-CNN slice-based approach showed the best performance in detecting abnormal 18F-fluciclovine uptake in PET images using clinical reports as the ground truth generated by imaging physicians experienced with PET. To the authors’ knowledge, most of ML studies using oncologic PET scans have concentrated in either determining if a lesion is benign or malignant [5,6,7] or adjudicating a pathologic etiology to a given lesion [15] using a limited number of slices or evaluating a single organ. In the realm of automated detection, performances of DL have previously been investigated in lung cancer with FDG PET [2] and glioma with 18F-fluoro-ethyl-tyrosine (FET) PET [16]. The AUC reported on these detection studies was 0.989 for lung cancer and 100% for glioma. Although it is known that model performance in test datasets is lower than on training datasets, these studies did not mention which dataset (train or test) was used to obtain their results.

The DL method employed in the present work handles a large field of view, including numerous transaxial PET slices and a wider spectrum of abnormality and normality than that used in previous studies. The slices used in this study had variations including the number, shape, size, and location of local recurrence of PC, as well as uptake in lymph nodes or distant metastases. In addition, our imaging dataset is more heterogenous than those employed in earlier studies. For example, some physiologic excretion of 18F-fluciclovine occurs in urine which can be seen variably throughout the urinary track including the urethra and bladder, and milder physiologic uptake of the radiopharmaceutical may occur in tissues such as inguinal lymph nodes or muscles. These factors could impede the performance of the DL model. Further, the models described in this paper did not include clinical variables or CT imaging features, also potentially limiting their performance. Despite these disadvantages, our results in detecting abnormal 18F-fluciclovine uptake in PET images are promising.

A model to detect PC using 68Ga-Prostate-specific membrane antigen (PSMA) PET scan in the pelvis has recently been published [17]. The model utilized a different CNN architecture, a 2.5D U-net, with ground truth consisting of manually annotated lesions. Depending on the tissue evaluated, the recall (sensitivity) of that study was in 0.61–0.99 range, and the model’s precision (positive predictive value) was in the 0.79–0.99 range according to bone, lymph node, and local lesions. In comparison, the sensitivity and positive predictive values of our 2D-CNN slice-based model using 18F-fluciclovine PET scans were 90.7% and 94.9%.

In the present work, a 2D-CNN slice-based approach had better performance than the case-based approach. The performance of 2D-CNN case-based approach of this study was similar to that of a previous study from our group studying a 2D-CNN model for the characterization of brain lesions on FDG-PET, with a sensitivity, specificity, and AUC of 87.2%, 68.0, and 0.822 [4]. A possible explanation of these results is the differences in the composition of the training dataset and test dataset used on these two approaches. In the case-based approach, normal transaxial slices were part of the abnormal PET images but were labeled as abnormal for the purpose of training a model based on cases. In contrast, in the slice-based approach, transaxial slices labeled as abnormal were meticulously chosen, and only real abnormal transaxial slices were labeled as the abnormal for the purpose of model training.

Our results highlight the importance of curating pertinent training datasets in the DL. In the 2D-CNN case-based approach in this work, a PET image abnormality is determined based on the average score of its 10 transaxial slices with highest probability on test dataset. This means that the distribution of abnormality score is more convergent than just a single transaxial slices, because an abnormal PET image is generally intermingled with normal and abnormal transaxial slices. Consequently, finding a specific cutoff between normal and abnormal group labels is hard. Nevertheless, our 2D-CNN case-based approach showed a good performance. We postulate that the limited scan range of the pelvis helped improve the performance of this DL model; this region was chosen because it is most frequently affected in BCR of PC [18].

The performance of 3D-CNN case-based model used in the current study did not meet our expectations, especially when compared with the 2D-CNN case-based model. There are reports that 3D-CNN architectures have better performances than 2D-CNN architectures in generating CT image from magnetic resonance imaging (MR) [19] or segmentation of brain MR [20]. While it is possible that an optimal 3D-CNN architecture was not employed in the current study, we hypothesized that an insufficient case number may be the most important reason, mainly because modeling with 3D-CNN architectures requires a larger number of parameters than modeling with 2D-CNN architectures.

Thinking about a potential deployment of models like this at the point-of-care is of importance to the authors. A variable of value in such clinical setting is time. The processing of 28 test PET images and 536 test transaxial slices took less than 1 min. This suggests that once a model is properly trained and available in a clinical workflow, a probability “score of abnormality” could be provided in real time following PET scan reconstruction. A probability “score of abnormality” alone may help prioritize reads and/or increase reader confidence.

There are several limitations to our study. Firstly, as previously mentioned, the number of PET images included is small. DL modeling is data-hungry, and larger datasets are almost always beneficial for the performance of DL models. The clinical use of 18F-fluciclovine PET scan is less frequent than 18F-FDG PET. Specifically, 18F-fluciclovine PET has a narrower clinical indication than 18F-FDG PET, which typically is BCR in PC in the US. This naturally skews the study population, in fewer normal than abnormal PET scans and clinical reports. Balancing the number of cases in the different classification classes (e.g., normal and abnormal labels) is important in DL model training [21], and due to this reason, near two thirds of positive images were discarded and not used in this study. Secondly, labeling in this study was based on formal clinical reports. There could be disparities between physician readers (e.g., variation in reader experience); labeling based on these may differ from the actual clinical follow-up or from pathology results. Additionally, this study is based on PET scans acquired at a single institution with a limited number of scanners and possibly patient demographics. We think that these limitations warrant a prospective and multicenter study that ensures heterogeneity of scanners, patients, and readers prior to clinical deployment. Despite this, the aim of the present study was to evaluate the performances of DL models in PET images of patients with BCR PC, which was achieved.

Conclusion

DL accurately classifies 18F-fluciclovine PET images of the pelvis in patients with BCR of PC. A 2D-CNN DL classifier using single slice prediction had superior performance over case-based approaches for detecting intrapelvic tumor recurrence and metastases. If prospectively validated in larger datasets, similar methods could be useful to assist radiologists and nuclear medicine physicians at the point-of-care.