Introduction

Positron emission tomography (PET) is widely used for diagnosis, staging, and monitoring treatment effect in clinical oncology. Quantitative or semi-quantitative measures in PET, e.g., standard uptake value (SUV), are sensitive to different physics effect corrections, e.g., attenuation correction (AC). In PET/CT, CT scans provide high spatial resolution attenuation maps, but these can lead to artifacts in the PET images due to the CT itself, e.g., beam-hardening, metal artifacts, and count-starving [1], or from PET-CT mis-alignment. In addition, concerns of CT radiation exposure can be raised when multiple PET/CT scans are needed for the same patient during treatment evaluation, which is even more of a concern for pediatric patients. In PET/MR, attenuation map generation is more complex [2] since an MR image does not directly provide PET attenuation information, and the optimal AC solution for PET/MR has not been found [3]. Therefore, the development of an AC algorithm that does not depend on CT (or MR) is of strong clinical interest, since it may not only elimite the aforementioned AC artifacts but also substantially reduce the radiation dose.

Maximum likelihood reconstruction of activity and attenuation (MLAA) [4] was proposed to simultaneously reconstruct tracer activity (λ-MLAA) and attenuation maps (μ-MLAA) based on the time-of-flight (TOF) PET raw data only without CT or MR. However, μ-MLAA suffers from high noise and λ-MLAA suffers from quantitative error [5] as compared to the conventional CT-based (μ-CT) OSEM reconstruction. Recently, deep-learning (DL) frameworks were proposed to improve MLAA by predicting the CT attenuation map (μ-DL) from λ-MLAA and μ-MLAA [6, 7]. Specifically, Hwang et al. [6] used a convolutional neural network to predict μ-DL while Shi et al. [7] added an additional line-integral constraint into the loss function and further improved the μ-DL accuracy. However, the μ-DL in [6, 7] was only applied to datasets using 18F-FDG.

While 18F-FDG accounts for most of the PET scans in clinical oncology, many other oncological tracers have shown promising clinical efficacy in more specific cancer types, e.g., prostate-specific membrane antigen (PSMA) [8] or 18F-Fluciclovine [9] for prostate cancer and 68 Ga-DOTATATE [10] for neuroendocrine tumor imaging. Since the μ-DL prediction framework [6, 7] was data-driven, i.e., relying on the PET raw data itself, the neural network trained for 18F-FDG cannot be used for other tracers. In this work, we applied the μ-DL prediction framework, which was previously developed in [7], to clinical 18F-FDG, 18F-Fluciclovine, and 68 Ga-DOTATATE datasets. Evaluation of tumor uptake was performed by two radiologists. SUVmax, SUVmean and tumor volume are reported in addition to evaluation using conventional image analysis metrics.

Methods

Clinical data

In this study, 890 whole-body PET/CT datasets were used, which were previously acquired on a Siemens Biograph mCT 40 scanner at Yale New Haven Hospital, including 610 18F-FDG, 120 68 Ga-DOTATATE, and 160 18F-Fluciclovine studies. Scans with minimal body motion between PET and CT, based on visual examination, were selected for neural network training (N = 40 per tracer, M/F: 14/26, age: 59.7 ± 12.0 years old for 18F-FDG; M/F: 25/15, age: 62.1 ± 14.0 for 68 Ga-DOTATATE; M/F: 40/0, age: 70.1 ± 8.7 for 18F-Fluciclovine) and testing (N = 73, M/F: 25/48, age: 61.2 ± 17.9 for 18F-FDG; N = 36, M/F: 11/25, age: 62.3 ± 14.4 for 68 Ga-DOTATATE; N = 50, M/F: 50/0, age: 74.5 ± 7.6 for 18F-Fluciclovine). CT was acquired prior to the PET scan with regular radiation dose for diagnostic purpose. Both 18F-FDG and 68 Ga-DOTATATE scans started at ~ 60 min post-injection using a supine protocol. Approximately 10-mCi injection was used for 18F-FDG and ~ 0.054 mCi/kg (5.4 mCi max) was used for 68 Ga-DOTATATE. 18F-Fluciclovine scans started at 3–5 min post-injection of ~ 10 mCi using a pelvis-first supine protocol, which was used to minimize bladder fill-up. Continuous-bed-motion protocol was used for all the PET acquisitions. Two and three minutes per bed position-equivalent bed speed was used for 18F-FDG and 68 Ga-DOTATATE, respectively. For 18F-Fluciclovine, 4–5 min (2–3 min) per bed position-equivalent time was used over the pelvis (abdomen, chest, and head/neck) and the total imaging time was ~ 20 min.

MLAA and data preprocessing

Three iterations and 21 subsets were used for MLAA [11]. Both λ-MLAA and μ-MLAA images were reconstructed using 2 mm3 voxel size followed by 5-mm full-width-half-max Gaussian smoothing and were down-sampled to 4 mm3.

In deep learning applications, image pre-processing is key for effective network training [12]. Here, two-channel inputs, i.e., λ-MLAA and μ-MLAA, represent two different physical quantities with different numerical value ranges, i.e., λ-MLAA represents the radiotracer density measured in Bq/mL while μ-MLAA represents attenuation coefficient measured in cm−1. For λ-MLAA, converting Bq/mL to standardized uptake value (SUV) helps to normalize the tracer injection dose and the patient weight. This normalization, however, does not help with the broad range of biological tracer uptake in different organs. Here, λ-MLAA images were normalized using λnorm = tanh(λ/σ) before training and testing, where λ and λnorm are the λ-MLAA images (in SUV) before and after normalization. σ controls the active gradient range of the hyperbolic tangent (tanh) function. σ was set to 10 to ensure most organs of interests (except for bladder) which are in the active gradient range for tracers. Every voxel in μ-MLAA and μ-CT (training label) was divided by 0.15 cm−1, which corresponds to skull bone attenuation coefficient at 511 keV, to match the value range of λnorm. The normalized λ-MLAA and μ-MLAA were concatenated as a dual-channel input to the deep neural network for training and testing. All training input and label μ-maps were 4 mm3 isotropic voxel size.

Network structure, loss function, and training

A modified fully convolutional 3D U-net architecture [13] was used for predicting the attenuation map (μ-DL) from λ-MLAA and μ-MLAA. Figure 1 shows the framework of the training process and Supplemental Fig. 1 shows the detailed U-net network structure. The network operates on 3D patches and uses 3 × 3 × 3 convolution kernels. Patch-based training, i.e., 64 × 64 × 32 voxels per patch, was used. Resolution reduction was performed by 2 × 2 × 2 convolution kernels with stride 2. Other network details can be found in Supplemental Fig. 1.

Fig. 1
figure 1

Proposed framework (training phase). Both µ- and λ-MLAA are used as inputs and µ-CT is used as labels. Line integral projection loss (LIP-loss) is used to update the deep learning neural network in additon to image domain loss (IM-loss)

In this study, we used a novel loss function [14] that includes not only the conventional image intensity-based loss (IM-loss), but also a gradient-based loss (GDL-loss) as well as a line-integral projection (LIP-loss) to corporate with the PET attenuation physics. Formally, the loss function is defined as follows:

$$L\left({{\boldsymbol{\upmu}}}^{\text{DL}},{{\boldsymbol{\upmu}}}^{\text{CT}}\right)={L}_{\text{IM}}\left({{\boldsymbol{\upmu}}}^{\text{DL}},{{\boldsymbol{\upmu}}}^{\text{CT}}\right)+{{\beta }_{1}{L}_{\text{GDL}}\left({{\boldsymbol{\upmu}}}^{\text{DL}},{{\boldsymbol{\upmu}}}^{\text{CT}}\right)+\beta }_{2}{L}_{\text{LIP}}\left({{\boldsymbol{\upmu}}}^{\text{DL}},{{\boldsymbol{\upmu}}}^{\text{CT}}\right),$$
(1)

where \({\beta }_{1}\) and \({\beta }_{2}\) are the hyper-parameters for the GDL-loss (\({L}_{\text{GDL}}\)) and LIP-loss (\({L}_{\text{LIP}}\)) and were empirically set to 1.0 and 0.02, respectively. \({{\boldsymbol{\upmu}}}^{\text{DL}}\) and \({{\boldsymbol{\upmu}}}^{\text{CT}}\) represent patches of the μ-DL and the gold-standard μ-CT, respectively. The L1-norm IM-loss (\({L}_{\text{IM}}\)) is constructed as follows:

$${L}_{\text{IM}}\left({{\boldsymbol{\upmu}}}^{\text{DL}},{{\boldsymbol{\upmu}}}^{\text{CT}}\right)=\frac{1}{{N}_{j}}{\sum }_{j}\left|{\mu }_{j}^{\text{DL}}-{\mu }_{j}^{\text{CT}}\right|,$$
(2)

where \({\mu }_{j}^{\text{DL}}\) and \({\mu }_{j}^{\text{CT}}\) are the intensities of voxel j in \({{\boldsymbol{\upmu}}}^{\text{DL}}\) and \({{\boldsymbol{\upmu}}}^{\text{CT}}\), respectively. Nj is the number of voxels inside one patch.\({L}_{\text{GDL}}\) is an image gradient difference loss defined as follows:

$${L}_{\text{GDL}}\left({{\boldsymbol{\upmu}}}^{\text{DL}},{{\boldsymbol{\upmu}}}^{\text{CT}}\right)=\frac{1}{{N}_{j}}{\sum }_{j}\left({\left|{\left|{\nabla }_{x}{{\boldsymbol{\upmu}}}^{\text{DL}}\right|}_{j}-{\left|{\nabla }_{x}{{\boldsymbol{\upmu}}}^{\text{CT}}\right|}_{j}\right|}^{2}+{\left|{\left|{\nabla }_{y}{{\boldsymbol{\upmu}}}^{\text{DL}}\right|}_{j}-{\left|{\nabla }_{y}{{\boldsymbol{\upmu}}}^{\text{CT}}\right|}_{j}\right|}^{2}+{\left|{\left|{\nabla }_{z}{{\boldsymbol{\upmu}}}^{\text{DL}}\right|}_{j}-{\left|{\nabla }_{z}{{\boldsymbol{\upmu}}}^{\text{CT}}\right|}_{j}\right|}^{2}\right),$$
(3)

where \({\nabla }_{x}\) is the gradient operator in the x direction (same for y and z). \({L}_{\text{GDL}}\) is shown to be effective in preventing μ-map blurring [15].

To enforce the additional similarity in the projection domain between μ-DL and μ-CT, the LIP-loss was used to measure the line-integral difference between the μ-map patch \({{\boldsymbol{\upmu}}}^{\text{DL}}\) and \({{\boldsymbol{\upmu}}}^{\text{CT}}\):

$${L}_{\text{LIP}}\left({{\boldsymbol{\upmu}}}^{\text{DL}},{{\boldsymbol{\upmu}}}^{\text{CT}}\right)=\frac{1}{{N}_{\mathbb{k}}}\sum_{k\in {\mathbb{k}}}\frac{\sum_{i=1}^{{N}_{k}}{({\left[{P}_{k}{{\boldsymbol{\upmu}}}^{\text{DL}}\right]}_{i}-{\left[{P}_{k}{{\boldsymbol{\upmu}}}^{\text{CT}}\right]}_{i})}^{2}}{{N}_{k}},$$
(4)

where \({\mathbb{k}}\) is the set of line-integral projection (LIP) angles, k indexes the projection angles, \({N}_{\mathbb{k}}\) represents the total number of angles in \({\mathbb{k}}\), \({P}_{k}\) is the LIP operator on the patch \({{\boldsymbol{\upmu}}}^{\text{DL}}\) and \({{\boldsymbol{\upmu}}}^{\text{CT}}\) at the k-th angle, i is the pixel index in the projection domain, and \({N}_{k}\) is the total number of pixels in \({P}_{k}{{\boldsymbol{\upmu}}}^{\text{DL}}\) and \({P}_{k}{{\boldsymbol{\upmu}}}^{\text{CT}}\). The set \({\mathbb{k}}\) was designed to uniformly sample \({N}_{\mathbb{k}}\) angles over 180°, e.g., \({\mathbb{k}}=[ 0^\circ ,45^\circ ,90^\circ ,135^\circ ]\) for \({N}_{\mathbb{k}}\) = 4 was used in this study.

For each tracer, an individual network was trained for 160 epochs at which point network training converged. In each epoch, 10,000 patches were randomly sampled from the training data and a mini-batch size of 16 patches was used to update the network. Adam optimization [16] was used with an initial learning rate of 10−3 with a decay factor of 0.975, which was applied after each epoch, was used. In the testing phase, instead of using the same patch size as in the training, we used a larger patch, i.e., 200 × 200 × 32 voxels and stride size of 200 × 200 × 16, to avoid striding in the first 2 dimensions. The fully convolutional architecture allows us to use different sizes of patches as inputs and this practice helps to reduce stitching artifacts caused by overlapping small image patches. Twenty-two FDG subjects were used for computing the validation loss during the training, which were independent from the cohorts of training and testing. The validation loss (Supplemental Fig. 2) results suggested that the network was sufficiently trained and there was no overfitting issue. The framework [17] was implemented in Python using TensorFlow (shared on the GitHub, https://github.com/j-onofrey/deep-image-pet). Network training took approximately 40 h on an NVIDIA RTX 8000 GPU with 48 GB memory.

Image reconstructions

PET image reconstructions were performed using the 3D-OSEM algorithm (3 iteration and 21 subsets) with the Siemens e7 toolkit. μ-DL, μ-MLAA, and μ-CT were used as the attenuation map to reconstruct PET images, i.e., OSEMDL, OSEMMLAA, and OSEMCT, respectively. Before reconstruction, μ-DL was up-sampled from 4 to 2 mm3 in voxel size while the original μ-MLAA and μ-CT with 2 mm3 were used. Note that OSEMMLAA is different from λ-MLAA since OSEMMLAA is reconstructed by the OSEM algorithm with μ-MLAA as the attenuation map whereas λ-MLAA is reconstructed by the MLAA algorithm.

Image analysis

Physics-based quantitative measurement

Using μ-CT as the reference, the quality of μ-DL was evaluated using normalized mean absolute error (NMAE), normalized root mean squared error (NRMSE), and mean normalized voxel error (MNVE), which are defined as follows:

$$\begin{array}{c}\text{NMAE}=\frac{\frac{1}{M}{\sum }_{j=1}^{M}\left|{\mu }_{j}^{\text{DL}}-{\mu }_{j}^{\text{CT}}\right|}{\max \left({{\boldsymbol{\upmu}}}^{\text{CT}}\right)-\min ({{\boldsymbol{\upmu}}}^{\text{CT}})},\\ \text{NRMSE}=\frac{\sqrt{\frac{1}{M}{\sum }_{j=1}^{M}{\left({\mu }_{j}^{\text{DL}}-{\mu }_{j}^{\text{CT}}\right)}^{2}}}{\max \left({{\boldsymbol{\upmu}}}^{\text{CT}}\right)-\min ({{\boldsymbol{\upmu}}}^{\text{CT}})},\\ \text{MNVE}=\frac{1}{M}{\sum }_{j=1}^{M}\frac{{\mu }_{j}^{\text{DL}}-{\mu }_{j}^{\text{CT}}}{{\mu }_{j}^{\text{CT}}},\end{array}$$

where \({{\boldsymbol{\upmu}}}^{\text{CT}}\) represents the entire μ-map volume within a body-contour mask and M represents the number of voxels inside the mask. The mask was generated by setting all voxels with attenuation coefficient greater than 0.01 in μ-CT with bed-removed to 1 (rest as 0). The same metrics were used to evaluate μ-MLAA using the same μ-CT as reference.

The quality of the reconstructed PET images, i.e., OSEMMLAA and OSEMDL, was evaluated using NMAE, NRMSE, and normalized regional error (NRE) inside the same body-contour mask as for attenuation map evaluation. NRE is defined as \(\text{NRE}={~}^{{\sum }_{j=1}^{M}({\text{OSEM}}_{\text{DL},j} -{\text{OSEM}}_{\text{CT},j})}\!\left/ \!\!{~}_{{\sum }_{j=1}^{M}{\text{OSEM}}_{\text{CT},j}}\right.\). Note that the mean of OSEMCT, instead of the difference between maximum and minimum, was used as the denominator in the NMAE and NRMSE calculation for PET evaluation.

For both μ and PET evaluation, in addition to whole body, we evaluated within three sub-regions: neck-thorax, abdomen, and pelvis, which correspond to 0–35%, 35–65%, and 65–100% of the image volume.

To reflect the overall μ-map quality, we performed joint histogram analysis. Specifically, for each tracer, joint histograms were generated between μ-MLAA (or μ-DL) and μ-CT in both μ-map and projection domains. Projection μ-maps were generated by computing the line integral of a whole-body attenuation map, e.g., μ-MLAA, μ-DL, or μ-CT, at 0 and 90°. Image domain joint histogram analysis was also performed for PET images, i.e., between OSEMMLAA (or OSEMDL) and OSEMCT.

Tumor delineation and clinical quantitative measure

For the clinical evaluation, two experienced radiologists (TT and DS) identified 45/73 subjects for 18F-FDG, 20/36 for 68 Ga-DOTATATE, and 22/50 for 18F-Fluciclovine with tumor uptakes with high confidence by referring OSEMCT. Tumor region of interests (ROIs) were generated using in-house software, Metavol2 (modified from Metavol [18]), as follows: (1) voxels with SUV above a patient-specific threshold (see below) were extracted to form tumor clusters; (2) the radiologist dropped a digital tumor pin in each identified cluster; (3) additional non-tumor pins, adjacent to the tumor in step (2), were dropped in other high-uptake clusters to separate inflammatory/physiological uptake; and (4) automatic segmentation was performed in Metavol2 using information from all the pins, i.e., tumor and non-tumor. The above 4 steps were iteratively performed with manual adjustment of the threshold and the pin locations until satisfactory tumor segmentation was obtained. Supplemental Fig. 3 depicts the tumor delineation process. Note that the SUV thresholds as well as the tumor pins (not the same ROIs), which were defined on the OSEMCT to delineate the tumor, were used for OSEMMLAA and OSEMDL for the same subject. Metavol2 was used to generate ROIs for OSEMMLAA or OSEMDL based on the same pins and threshold as for OSEMCT. SUVmax, SUVmean, and tumor volume (Voltumor) were computed for each ROI.

Statistical analysis

All variables are presented as mean ± standard deviation. A paired t-test was applied to assess whether the means of each metric for DL and MLAA using μ-CT or OSEMCT as references were statistically different. p-values were adjusted by Bonferroni correction for multiple comparisons. Multiple comparison corrections were applied per tracer and u-map, OSEM images, and tumor ROIs were considered to be independent. Adjusted p-values of less than 0.05 were considered statistically significant.

Results

Attenuation map

Large spatial distribution differences were found across tracers, e.g., 68 Ga-DOTATATE exhibits high uptake in the spleen, liver, and kidney while 18F-Fluciclovine is broadly distributed with high muscle and bone marrow uptake (Fig. 2). Visually, μ-MLAA of 18F-FDG and 18F-Fluciclovine was found superior in quality, especially at bone areas, than 68 Ga-DOTATATE, which was likely due to the broader tracer distribution for 18F-FDG and 18F-Fluciclovine than 68 Ga-DOTATATE. Most anatomical details were successfully recovered in μ-DL for all the tracers. The ribs are distinguishable and the delineation between muscle and fat was mostly accurate for μ-DL. Notable differences between the μ-DL and μ-CT were observed in the abdominal regions, e.g., oral contrast agents in the stomach and intestine for 18F-FDG.

Fig. 2
figure 2

Examples of PET and attenuation maps for a 18F-FDG, b 68 Ga-DOTATATE, and c 18F-Fluciclovine. PET images were reconstructed with μ-CT. Yellow arrows point the differences between the μ-DL and μ-CT due to oral contrast agents in the stomach and intestine

Numerically, μ-DL shows consistent superior performance as compared to μ-MLAA for all the tracers in NMAE, NRMSE, and MNVE with high statistical significance (Table 1). The whole-body NMAE of μ-MLAA was 7.3 ± 1.1% for 18F-FDG, 8.2 ± 1.3% for 68 Ga-DOTATATE, and 9.0 ± 0.8% for 18F-Fluciclovine, which were improved for μ-DL to 2.0 ± 0.4%, 1.4 ± 0.2%, and 2.5 ± 0.4%, respectively. Among the three sub-regions in μ-MLAA, the abdomen yielded higher NMAE for all three tracers (18F-FDG: μ-MLAA: 10.7 ± 1.5%, μ-DL: 2.9 ± 0.6%; 68 Ga-DOTATATE: 13.2 ± 2.9%, 1.8 ± 0.3%; 18F-Fluciclovine: 13.7 ± 1.4%, 3.4 ± 0.8%) than the thorax (7.2 ± 1.2%, 2.3 ± 0.6%; 7.6 ± 1.6%, 1.5 ± 0.3%; 8.2 ± 0.8%, 3.4 ± 0.8%) and the pelvis (8.2 ± 1.1%, 2.0 ± 0.6%; 9.1 ± 1.6%, 1.4 ± 0.3%; 11.6 ± 1.6%, 2.0 ± 0.5%). A scatter plot presentation of the NMAE results is shown in Fig. 3, where µ-MLAA shows larger variability than µ-DL. Sub-region NRMSE results were as follows: 18F-FDG at whole-body level: μ-MLAA: 9.8 ± 1.4%, μ-DL: 3.6 ± 0.8%; 68 Ga-DOTATATE: 10.6 ± 1.6%, 2.5 ± 0.6%; 18F-Fluciclovine: 11.7 ± 0.9%, 4.7 ± 0.9%. μ-MLAA showed larger negative MNVE comparing to μ-DL for all three tracers: 18F-FDG at whole-body level: μ-MLAA: − 10.6 ± 2.1%, μ-DL: − 1.5 ± 1.3%; 68 Ga-DOTATATE: − 5.7 ± 3.1%, − 0.5 ± 0.5%; 18F-Fluciclovine: − 9.2 ± 2.4%, − 2.0 ± 1.2%.

Table 1 Normalized mean absolute error (NMAE), normalized root mean squared error (NRMSE), and mean normalized voxel error (MNVE) for µ-DL or µ-MLAA comparing with µ-CT. Paired t-test was performed between µ-DL andµ-MLAA for NMAE, NRMSE, and MNVE. *p < 0.0001, **p < 10−30
Fig. 3
figure 3

Regional normalized mean absolute error (NMAE) for µ-MLAA (blue) and µ-DL (gray)

PET reconstruction

Overall, OSEMDL showed substantially smaller differences than OSEMMLAA for all tracers using OSEMCT as the reference (Fig. 4). OSEMMLAA showed larger differences in the abdomen than other regions for 68 Ga-DOTATATE and 18F-Fluciclovine, whereas the regional difference was not obvious for OSEMDL.

Fig. 4
figure 4

Difference of SUV images between OSEMMLAA (or OSEMDL) and OSEMCT. The representative subjects yielded average NMAE in abdominal region among all the subjects for each tracer

At the whole-body level, OSEMDL substantially outperformed OSEMMLAA in both NMAE (18F-FDG: OSEMMLAA: 7.1 ± 0.7%, OSEMDL: 4.4 ± 1.3%; 68 Ga-DOTATATE: 12.4 ± 4.5%, 3.1 ± 1.2%; 18F-Fluciclovine: 9.9 ± 6.4%, 4.9 ± 1.2%), and NRMSE (13.2 ± 4.9%, 10.9 ± 11.3%; 27.9 ± 11.6%, 11.7 ± 7.9%; 17.3 ± 13.1%, 9.1 ± 1.9%) (Table 2). Both OSEMDL and OSEMMLAA yielded low NRE at the whole-body level (− 1.9 ± 2.0% and − 1.7 ± 1.1% without statistical significance) for 18F-FDG while OSEMDL significantly outperformed OSEMMLAA for 68 Ga-DOTATATE (1.4 ± 1.7% and − 7.7 ± 4.1%, p < 0.0001) and 18F-Fluciclovine (− 2.8 ± 1.5% and − 5.0 ± 4.6%, p < 0.01). In the sub-regions, OSEMMLAA yielded larger NMAE, NRMSE, and NRE in the abdomen than the other two regions while OSEMDL yielded consistent performance at all regions with slightly higher error at thorax.

Table 2 Normalized mean absolute error (NMAE), normalized root mean squared error (NRMSE), and normalized regional error (NRE) for OSEMDL or OSEMMLAA comparing with OSEMCT. Paired t-test was performed between OSEMDL and OSEMMLAA for NMAE, NRMSE, and NRE. *p < 0.05, **p < 10−10

Joint histogram analysis

For all the tracers, μ-DL shows superior alliance with μ-CT than μ-MLAA in the image domain comparison (Fig. 5a). In the projection domain (b), for all the tracers, μ-MLAA shows larger error, i.e., deviating from the unity line, than the μ-DL, especially in the low value range. μ-MLAA also shows larger variability, i.e., more spread along the unity line, than the μ-DL. For PET images, OSEMDL also yielded smaller deviation from the OSEMCT than the OSEMMLAA (Supplemental Fig. 4). For OSEMMLAA, larger difference from OSEMCT was found for 18F-Fluciclovine as compared to the other two tracers, especially in the high SUV value range (> 7).

Fig. 5
figure 5

Joint histogram between µ-MLAA (µ-DL) and µ-CT. µ-MLAA showed larger difference in both image domain (a) and projection domain (b) than µ-DL, as compared to µ-CT. The value of each histogram bin, i.e., number of voxels, was normalized by the subject number and was then plotted in the log scale

Tumor delieanation

In the 18F-FDG tumor delineation example (Supplemental Fig. 5), abnormal uptake was found in the inferior lobe of the left lung and the left hilar lymph nodes. The paramediastinal mass uptake was considered the primary tumor. In the 68 Ga-DOTATATE example, multiple high-uptake metastatic lesions of neuroendocrine tumor were found in the liver, the abdominal, and the pelvic lymph nodes. For 18F-Fluciclovine, abnormal uptake was found in the prostate and multiple retroperitoneal lymph nodes were identified. Among all the testing cases, 152 tumor ROIs (< 150 mL in volume) among 45/73 18F-FDG subjects, 70 ROIs (< 50 mL) among 20/36 68 Ga-DOTATATE subjects and 44 ROIs (< 20 mL) among 22/50 18F-Fluciclovine subjects were delineated for evaluation.

Clinical measure evaluation

Overall, OSEMMLAA yielded relatively small error (< 5%) for 18F-FDG and 68 Ga-DOTATATE but larger error (e.g., − 7.3 ± 2.9% in SUVmax) for 18F-Fluciclovine in all SUV measures whereas OSEMDL yielded excellent quantitation (< 2%, except for SUVmax of 18F-Fluciclovine) for all the three tracers (Table 3). Specifically, OSEMMLAA showed significantly lower % differences in SUVmax than OSEMDL (18F-FDG: OSEMMLAA: − 3.6 ± 4.4%, OSEMDL: − 1.7 ± 4.5%; 68 Ga-DOTATATE: − 4.3 ± 5.1%, 0.4 ± 2.8%; 18F-Fluciclovine: − 7.3 ± 2.9%, − 2.8 ± 2.3%, p < 0.001) (see Fig. 6 for the SUVmax scatter plot). Similar results were found for SUVmean.

Table 3 PET reconstruction evaluations using clinical measurements in tumor ROIs. % Differences were calculated using OSEMCT as the reference. *p < 0.01, **p < 0.001, ***p < 0.0001
Fig. 6
figure 6

Tumor SUVmax difference between OSEMDL (gray) or OSEMMLAA (blue) and OSEMCT. *p < 0.0001

OSEMDL yielded substantially smaller error in tumor volume measure (Table 3) than OSEMMLAA (18F-FDG: OSEMMLAA: − 8.4 ± 14.5%, OSEMDL: − 3.0 ± 15.0%; 68 Ga-DOTATATE: − 14.1 ± 19.7%, 1.8 ± 11.6%; 18F-Fluciclovine: − 15.9 ± 9.1%, − 6.4 ± 6.4%). Individual tumor volume results are shown in Fig. 7, where tumor volume measured in OSEMMLAA and OSEMDL was both significantly (p < 0.0001) correlated with it in OSEMCT. The slope for OSEMDL was closer to unity than it was for OSEMMLAA for all the tracers (18F-FDG: OSEMMLAA: 0.97, OSEMDL: 1.00; 68 Ga-DOTATATE: 0.96, 1.01; 18F-Fluciclovine: 0.86, 0.93), which suggested superior performance in volume measure accuracy from OSEMDL than OSEMMLAA.

Fig. 7
figure 7

a Correlation between OSEMDL (or OSEMMLAA) with OSEMCT in tumor volume measure. Both OSEMDL and OSEMMLAA showed significant correlation. Significance was evaluated using linear regression between OSEMDL (or OSEMMLAA) with OSEMCT b %differences between OSEMMLAA (or OSEMDL) with OSEMCT. Paired t-test was used to compare the %difference between OSEMDL and OSEMMLAA while OSEMCT was used as reference. Log scale is used in the x-axis

Discussion

In this study, we applied a deep learning–based attenuation map (μ-DL) prediction framework to three clinical oncology tracers with diverse imaging characteristics: 18F-FDG, 68 Ga-DOTATATE, and 18F-Fluciclovine. Using the CT-based attenuation map (μ-CT) as a gold standard, we evaluated both the μ-DL and the PET reconstructions using the μ-DL (OSEMDL) using both physics-based metrics and clinical-relevant measures, i.e., SUVmax, SUVmean, and tumor volume. μ-DL was compared to the MLAA attenuation map (μ-MLAA) and was found to be superior than the μ-MLAA for all the three tracers. OSEMDL yielded excellent tumor quantification for all tracers and was superior to OSEMMLAA.

In addition to the proposed framework, i.e., using the MLAA output as the neural network input [6, 7], many other machine-learning approaches have been proposed to improve the attenuation map in PET. Another well-explored framework is to use paired MRI images as the network input to synthesize pseudo-CT for PET AC [19, 20], but this approach only applies to PET/MR systems. Using non-attenuation corrected (NAC) PET as the network input is a more commonly used approach to synthesize pseudo-CT [21, 22] or PET with AC [23, 24], since it does not require MR or TOF capability. However, the quantification accuracy of this approach is still improving. As compared to our framework, the NAC-based framework [21, 22] makes use of less information from the PET data, i.e., abandoning all the attenuation information in the emission data. We refer readers to a comprehensive review article by Lee [25].

Here, although the OSEMDL was found more accurate in SUV measures than the OSEMMLAA, the radiologists (TT and DS) confirmed that it will be unlikely to make diagnositic differences whether OSEMMLAA or OSEMDL was used for a single PET scan. However, the excellent quantitation of OSEMDL will be invaluable in cancer treatment evaluations. For example, in FDG PET scans, any change in tumor uptake/volume between the baseline and the follow-up scans may alter treatment strategies where accurate tumor quantitation becomes crucial.

For a same patient, multiple PET/CT scans at multiple time points are typically performed. Therefore, radiation dose reduction is important (more important for pediatric patients). Many studies had been conducted to reduce PET injection dose without/with minimal compromise in the PET image quality [26, 27] while the radiation exposure for a patient from CT is also substantial. Even for a low-dose CT protocol [28], dose from CT is still comparable to a 10-mCi injection of 18F-FDG [28]. The proposed deep learning–based method provides a promising solution for CT-less PET. It is understood that in addition to providing AC for PET, CT also provides important anatomical guidance in a clinical PET read. Although our current solution cannot provide the same quality of anatomical guidance as compared to CT, for certain protocols in which CT may not be needed, e.g., follow-up scans in lymphoma treatment evaluation, the proposed method is sufficient for clinical use.

In this study, although we purposely excluded cases with mis-alignment between CT and PET due to patient motion, minor PET-CT mis-alignments were still found in many cases due to body motions (arms or legs) [29] and respiratory motion [30]. In the Supplemental Fig. 6, we show an example to demonstrate the application of the proposed method to eliminate the artifacts caused by patient motion between PET and CT.

Here, we point out other study limitations and discuss the future directions. (1) For each tracer, only 40 subjects were used for the neural network training. In order to train a more robust model for the patient population, especially for those with unusual anatomy or special conditions, e.g., subjects with limb amputation and pediatric patients or subjects with metal implants and pacemakers, datasets of more comprehensive patient cohort are needed. (2) The quality of μ-DL depends on the quality of μ-MLAA and λ-MLAA, which are PET dose-dependent. Here, we only studied the regular clinical dose used for the three tracers. In the future, we will explore the efficacy of applying the current method for low-dose PET studies. (3) The hyperparameters used in the loss functions and neural networks, e.g., \({\beta }_{1}\) and \({\beta }_{2}\), were not optimized. In the future, we will perform optimization studies. (4) In this study, the scatter estimate used in the MLAA reconstruction was generated based on the μ-CT. In a clinical application, detector background radiation [31, 32] can be used to provide the scatter estimate for MLAA reconstructions.

Conclusion

The proposed framework provides accurate and robust attenuation correction for clinical whole-body 18F-FDG, 68 Ga-DOTATATE, and 18F-Fluciclovine in tumor SUV measures as well as tumor volume estimation. The proposed framework provides clinically equivalent quality as compared to CT in attenuation correction for the three tracers.