Introduction

Various physical and patient factors such as attenuation, scattering, and motion need to be corrected to accurately estimate the distribution of radioactive tracers in positron emission tomography (PET). As the scattered line-of-response (LOR) distribution is usually estimated based on the linear attenuation coefficient map (attenuation map: μ-map) for 511-keV annihilation photons [1,2,3], accurate μ-map generation is important for both attenuation correction (AC) and scatter correction (SC). In dual-modality hybrid PET systems, computed tomography (CT) or magnetic resonance (MR) images are converted into μ-maps with nearly no statistical noise [4,5,6,7,8,9]. However, CT artifacts often cause errors in attenuation-corrected PET images [10,11,12]. Additionally, the accuracy of MR-based PET AC has been proven only in adult brain PET images with normal anatomy [13, 14]. However, this has not yielded satisfactory results in whole-body scans [15, 16]. The spatial mismatch between the emission PET and μ-maps derived from CT or MR images is another source of error in anatomical image-based PET AC [17,18,19].

Deep learning (DL)-based PET AC methods using only PET emission data have several advantages over anatomical image-based AC methods [14]. DL-based emission-only approaches are free of errors due to the spatial mismatch between the emission and transmission data [14, 20, 21]. They can also be applied to standalone PET systems (e.g., brain-dedicated PET) without CT or MR images [22, 23]. One of the emission-only approaches is to use the deep neural network(s) to generate pseudo-CT or μ-maps from non-attenuation-corrected (NAC) PET images [20, 24, 25]. Although NAC PET images do not contain explicit information about photon attenuation, deep neural networks could predict μ-maps, including bone structures. However, the NAC PET-based method has shown relatively high error in the lungs, which generally has large inter-individual variability in μ-values [25]. Similarly, a method directly generating CT-based AC PET images from NAC PET without attenuation map generation has been proposed [26,27,28,29]. However, this approach is vulnerable to outliers and fails to recover quantitative accurate activity around the center of the head with complex anatomical structures [26]. There is an alternative approach to obtaining both AC PET and attenuation map from NAC PET [30]. However, this study was limited to 2D-based learning, which suffers from the problem of discontinuity across the slices [30].

Another DL-based emission-only approach is to improve the accuracy of μ-maps generated by simultaneous reconstruction of activity and attenuation only from emission PET data [21, 31,32,33,34]. Simultaneous activity and attenuation reconstruction has evolved by incorporating time-of-flight information into the sub-iterations estimating the activity distribution to apply spatial constraints on the activity origin [35,36,37,38]. The maximum likelihood estimation of activity and attenuation (MLAA) is an effective algorithm for simultaneous reconstruction [37]. However, the high noise level in the μ-map and the crosstalk between the activity and attenuation distribution are the main limitations of the MLAA algorithm, currently suffering from insufficient timing resolution of PET systems. To overcome the limitations of MLAA, we proposed a DL-based approach and improved the accuracy of the MLAA μ-map and the corresponding activity image (λ) [31,32,33]. Moreover, because the MLAA μ-maps are generated using monoenergetic 511-keV annihilation photons, metal artifacts caused by low-energy photon starving in X-ray CT are not observed in the DL-enhanced MLAA μ-maps [14].

Another limitation of the MLAA AC method is the chicken-egg dilemma of the scatter estimation [39]. Scatter event distribution needs to be known to conduct the MLAA. However, estimating scatter events requires μ-maps. Thus, scatter events were derived from CT μ-maps (μ-CT) and assumed to be known in our previous studies [31, 32]. This is a critical limitation.

This study’s purpose is three-fold. The first is to investigate whether the scatter distribution estimated from NAC PET activity images using DL is compatible with that estimated using the CT and the single-scatter simulation (SSS) algorithm [1, 2]. The second is to compare the two emission-only approaches (NAC and MLAA) proposed for the DL-based whole-body PET AC. Finally, the study addresses whether the accuracy of the DL-based whole-body PET AC improves by combining the NAC and MLAA approaches.

Materials and methods

Dataset

Image data from 150 oncology patients who underwent the 18F-FDG (n = 100; 38 men and 62 women; age, 57.3 ± 14.1 years) or 68 Ga-DOTATOC (n = 50; 29 men and 21 women; age, 53.5 ± 14.2 years) PET/CT scans were used for the training and testing of the neural network. The dataset was divided into training, validation, and test sets, as summarized in Table 1. The networks were trained separately for each tracer. Whole-body PET/CT scans were acquired using a Biograph mCT 40 scanner (Siemens Healthineers, Knoxville, TN; timing resolution = 580 ps) 60 min after the intravenous injection of the tracer (5.18 MBq/kg for 18F-FDG and 2.78 MBq/kg for 68 Ga-DOTATOC). Six-eight bed positions were used to cover the upper body in the PET scans with a scan time of 1 min per position. The institutional review board of our institute approved the retrospective use of the scan data and waiver of the need for informed consent.

Table 1 The number of patients included in the training, validation, and test sets

The CT images, reconstructed in a 512 × 512 × 100 matrix and 1.52 × 1.52 × 2.03 mm voxel size, were converted into the μ-map for 511-keV photons (μ-CT; 200 × 200 × 109; 4.07 × 4.07 × 2.03 mm). Ground truth PET activity images (λ-CT) were reconstructed using an ordered-subset expectation maximization (OSEM) algorithm (3 iterations and 21 subsets, 5-mm Gaussian postfilter) with CT-based AC and SC. The CT-based scatter estimates were generated using the SSS algorithm. The correction factors were generated, and the OSEM reconstruction was performed using the vendor-supplied e7 toolkit. The size of the reconstructed PET images was 200 × 200 × 109 (4.07 × 4.07 × 2.03 mm voxel size) for each bed position.

The NAC PET activity images (λ-NAC) were reconstructed using the OSEM algorithm with the same reconstruction parameters, but AC and SC were not applied. The numbers of iteration and subset for MLAA reconstruction producing activity and attenuation maps (λ-MLAA and μ-MLAA) were 6 and 21, respectively. A boundary constraint was applied during the μ-MLAA estimation to resolve the problem of global scaling that is not unique in MLAA [37].

Network architectures

Convolutional neural networks (CNNs) were designed to predict the μ-CT, ground truth, from λ-NAC, λ-MLAA, and μ-MLAA (Fig. 1). The architecture of the CNNs was based on U-net [40], which is widely used in medical image processing analysis [41,42,43,44]. The CNNs’ architectures, except for the number of the input channel, were the same as those used in our previous study [31]. Network architecture detail is provided in Fig. 1 in [31]. The CNN’s inputs were the 32 × 32 × 32 matrix patches extracted from λ-NAC, λ-MLAA, and μ-MLAA. The training labels were equally sized patches from the μ-CT at that location. We stacked the inferences from the input patches on the image matrix to construct the output image with the trained network (μ-CNN). Patch-based min–max normalizations were performed on λ-NAC, λ-MLAA, and μ-MLAA before feeding the input patch to the network.

Fig. 1
figure 1

Strategies for attenuation (μ) map generation using CNNs. Non-attenuation-corrected PET activity image (λ-NAC) was used as input for the CNN in (A). Results (λ-MLAA and μ-MLAA) of the MLAA simultaneous reconstruction algorithm were used as CNN inputs in (B) and (C). All the λ-NAC, λ-MLAA, and μ-MLAA were used as CNN inputs in (D). Scatter distributions estimated using μ-CT were used in (B), but those estimated using the μ-map generated by CNNNAC (CNN output in A) were used in (C) and (D). Here, “Em” and “Sc” stand for the emission and scatter sinograms

Figure 1 compares the μ-map generation strategies used in this study. The CNNNAC takes λ-NAC as the input to produce a synthetic μ-map, μ-CNNNAC (Fig. 1A). The CNNMLAA* takes λ-MLAA* and μ-MLAA*, corrected for scatter using μ-CT, and produces μ-CNNMLAA* (Fig. 1B; a method used in our previous studies [31, 32]). The CNNMLAA takes the λ-MLAA and μ-MLAA corrected for scatter using μ-CNNNAC and produces the μ-CNNMLAA (Fig. 1C). Finally, the CNNMLAA+NAC takes λ-NAC along with λ-MLAA and μ-MLAA to produce μ-CNNMLAA+NAC (Fig. 1D). Note that the third and final methods do not require μ-CT, as scatter correction is performed using μ-CNNNAC.

Network training

The L1-norm between the output (μ-CNN) and ground truth (μ-CT) was chosen as the loss function for training the networks and was minimized using the adaptive moment estimation method. A learning rate of 0.001 was used as an initial value and decayed every two epochs with rate of 0.92. We adopted a batch size of 64 patches for all experiments. 3D patches for training the networks were selected randomly from the input images. To avoid meaningless computation with blank patches, the 3D patches whose centers were included in the body were only employed for the networks. Approximately 4000 patches for each bed were used for training. Each network was trained using a training set with a maximum of 200 epochs. When the training loss calculated using the validation set did not decrease for consecutive 10 epochs, training was stopped, and performance of the model was evaluated using the test set. The networks were implemented using the TensorFlow library and trained using NVIDIA RTX 3090 (24 GB VRAM).

Scatter estimate comparison

We compared the SSS scatter estimates derived from the μ-CNNNAC and μ-CT to determine whether the scatter estimates using μ-CNNNAC can solve the chicken-egg dilemma in the MLAA. The accuracy of the attenuation and scatter estimation was compared in the sinogram space in terms of absolute and percentage errors as AC and SC are performed in the sinogram space during the MLAA. The MLAA reconstruction and CNN-enhancement results obtained using scatter estimates from μ-CNNNAC and μ-CT were compared.

Comparison of attenuation and activity estimates

The attenuation and activity estimates obtained using the CNNs shown in Fig. 1 were compared to the ground truth using three different metrics: the structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and normalized root mean square error (NRMSE):

$$\mathrm{SSIM}=\frac{\left(2{\mu }_{x}{\mu }_{y}+{c}_{1}\right)\left(2{\sigma }_{xy}+{c}_{2}\right)}{\left({\mu }_{x}^{2}+{\mu }_{y}^{2}+{c}_{1}\right)\left({\sigma }_{x}^{2}+{\sigma }_{y}^{2}\right)}$$
$$\mathrm{PSNR}=20\cdot {\mathrm{log}}_{10}\left(\frac{MAX}{\sqrt{MSE}}\right)$$
$$\mathrm{NRMSE}=\sqrt{\frac{{\sum }_{k\in VOI}{\left({x}_{k}-{\widehat{x}}_{k}\right)}^{2}}{{\sum }_{k\in VOI}{{\widehat{x}}_{k}}^{2}}}$$

where \(\mu\), \(\sigma\), and \(c\) are average, standard deviation, and predefined constant. We used the default function for SSIM from MATLAB 2020b. \(MAX\) and \(MSE\) are the maximum intensity and mean square error. \({x}_{k}\) and \({\widehat{x}}_{k}\) are the \(k\)-th voxel of generated image and ground truth, respectively. Here, \(VOI\) is set to the patient body.

The voxel-wise correlation between the DL-based approaches and the ground truth was also estimated. The mean μ-values of the lungs and the standard uptake value (SUV) of lung and bone lesions were also compared to assess the accuracy of DL-based approaches further. The boundaries of the lungs were segmented from the μ-CT and eroded considering the mismatch between the PET and CT due to respiratory motion to calculate the mean μ-values of the lungs. Additionally, volumes of interest (VOIs) were semi-automatically drawn on 23 suspected lung cancer and 29 suspected bone cancer regions in the 18F-FDG PET scans of 20 patients by applying a threshold of 40% of the maximum SUV on the tumor, determined by averaging the SUV of the voxels with an SUV higher than 90% of the SUV peak. VOIs were drawn on λ-CT (reference images), and these VOIs were utilized in the other reconstructed images for evaluating SUV quantification.

Results

Scatter estimation using μ-CNNNAC

Figure 2 and Supplemental Fig. 1 compare the attenuation correction factors (ACFs) and scatter distributions estimated using μ-CT and μ-CNNNAC (μ-map inferred from NAC activity image using CNN as illustrated in Fig. 1A) and show the percent and absolute differences between them. While the mean squared percent error of the ACF was higher than 7% (7.2% ± 4.1%), that of the scatter estimates was only 2.5% ± 0.1%, indicating the validity of the scatter estimation using the μ-CNNNAC. Figure 3A compares the μ-MLAA* and μ-MLAA, which are the μ-maps estimated using the MLAA simultaneous reconstruction algorithm for which the scatter was estimated from μ-CT and μ-CNNNAC, as illustrated in Fig. 1B and C. As shown in Fig. 3B that compares μ-CNNMLAA* and μ-CNNMLAA, the difference between μ-MLAA* and μ-MLAA was further reduced by applying the CNN to the output images of the MLAA. The data shown in Figs. 2 and 3 have been obtained using 18F-FDG PET scans.

Fig. 2
figure 2

Comparison of attenuation correction factors (ACFs) and scatter distributions derived from μ-CT and μ-CNNNAC. A ACF and B Scatter estimates

Fig. 3
figure 3

μ-maps obtained using MLAA (A) and CNN applied to MLAA output images (B). No scatter correction was applied for the μ-maps in the first columns. Scatter was estimated using the μ-CT for the μ-maps in the second columns and using the μ-CNNNAC in the third columns

Attenuation maps

Figure 4 and Supplemental Figs. 2 show the sagittal and coronal slices of the CNN models’ input, output, and ground truth images for the 68 Ga-DOTATOC and 18F-FDG studies, respectively. Although the bone structures were not clearly resolved and the noise levels were high in the input images (λ-MLAA, μ-MLAA, and λ-NAC), the CNNs provided nearly noiseless μ-maps with improved bone delineation. As indicated by the orange arrows in Fig. 4, the CNN with only NAC input (CNNNAC) generated less accurate bone structures in the μ-maps for the 68 Ga-DOTATOC studies showing lower bone uptake and higher noise level than in the 18F-FDG studies. The best results for recovering the fine bone structures were obtained by providing all λ-MLAA, μ-MLAA, and λ-NAC to the CNN (CNNMLAA+NAC), as indicated by the white arrows in Fig. 4 and Supplemental Figs. 2.

Fig. 4
figure 4

CNN models’ input, output, and ground truth images for the 68 Ga-DOTATOC PET study in a 57-year-old male patient

Figure 5 shows the advantages of providing λ-MLAA and μ-MLAA as inputs to the CNN. The μ-map generation errors, indicated by orange arrows in the μ-CNNNAC, were observed less frequently in the μ-CNNMLAA and μ-CNNMLAA+NAC, which also resulted in better soft tissue and fat contrast than μ-CNNNAC, as shown in Fig. 5C. However, abdominal air was often misclassified as fat or soft tissue in all the CNN models.

Fig. 5
figure 5

μ-maps generated using different CNN inputs. Transaxial slices from 18F-FDG PET scans of the lung (A), liver (B), and kidney (C) levels and coronal slices from a 68 Ga-DOTATOC study with a metallic hip implant

The voxel-wise correction and quantitative measurements of the similarity between μ-CNNs and μ-CT confirmed the qualitative comparison results. As shown in Table 2 and Supplemental Figs. 3 and 4, the μ-CNNMLAA and μ-CNNMLAA+NAC achieved better voxel-wise correlation, higher PSNR and SSIM, and lower NRMSE than the μ-CNNNAC, which showed an especially poor correlation between the μ-values corresponding to the lung tissues. Figure 6 shows the percent errors of μ-CNNs relative to the μ-CT in whole lung tissues, indicating the overestimation and increased variability of μ-values using the CNNNAC in the lung. The difference in the performance of the CNN models was smaller in the 68 Ga-DOTATOC than in 18F-FDG studies. Figure 7 shows the μ- and λ-maps of a lung cancer patient who underwent 18F-FDG PET/CT study, demonstrating that the abnormal hot uptake in lung lesions prevented proper inference of the μ-map by CNNNAC.

Table 2 Summary of voxel-wise correlation between μ-CNNs and μ-CT
Fig. 6
figure 6

Percent errors of the μ-CNNs relative to the μ-CT in the lung. A 18F-FDG and B 68 Ga-DOTATOC

Fig. 7.
figure 7

18F-FDG PET/CT case with the inaccurate μ-map estimation by the CNNNAC due to hot uptake in lung lesions

Activity images

The activity images corrected for attenuation using μ-CNNMLAA and μ-CNNMLAA+NAC were also superior to those corrected using μ-CNNNAC in terms of their similarity to λ-CT, as shown in Supplemental Figs. 5, 6, and 7 (voxel-wise correlation plots, percent difference maps, and quantitative similarity measures (PSNR, SSIM, and NRMSE) between λ-CNNs and λ-CT) and Table 3. The improvement in the similarity with λ-CT achieved by employing λ-NAC in addition to λ-MLAA and μ-MLAA was not significant (CNNMLAA versus CNNMLAA+NAC).

Table 3 Summary of voxel-wise correlation between λ-CNNs and λ-CT

Supplemental Fig. 8 shows the correlation between the SUV measurements in lung cancer lesions. λ-CT shows the highest correlation with λ-CNNMLAA+NAC. Although λ-CNNNAC was also correlated with λ-CT, it showed a higher positive bias and variability in the regional SUV than the other methods (percent error: λ-CNNNAC = 5.45% ± 7.88%; λ-CNNMLAA = 1.21% ± 5.74%; λ-CNNMLAA+NAC = 1.91% ± 4.78%). Supplemental Fig. 9 shows the correlation between SUV measurements in bone cancer lesions, showing no significant differences among results obtained using the different methods (percent error: λ-CNNNAC = 1.37% ± 5.16%; λ-CNNMLAA = 0.23% ± 3.81%; λ-CNNMLAA+NAC = 0.05% ± 3.49%).

Discussion

This study compared two approaches using only the emission PET data and a CNN to correct the attenuation of annihilation photons in PET: one used a CNN to generate μ-maps from NAC PET images (μ-CNNNAC), and in the other method, CNN was used to improve the accuracy of μ-maps generated through MLAA reconstruction (μ-CNNMLAA). It also investigated whether the CNN performance is improved by combining the two methods (μ-CNNMLAA+NAC) and whether μ-CNNNAC would be suitable for providing the scatter distribution required for MLAA reconstruction.

The use of CNN to generate μ-maps from NAC PET images is a relatively straightforward approach because it does not require special image reconstruction algorithms such as MLAA. Therefore, this method can be applied to any PET data, regardless of the PET scanner’s time-of-flight measurement capability. Additionally, this method allows for joint attenuation and scatter correction [26]. The feasibility of this method for brain PET studies using 18F-FDG and other tracers has been demonstrated by several groups using various DL models, including convolutional autoencoder and generative adversarial networks [20, 24, 25]. Recently, Dong et al. demonstrated the potential of this method in whole-body 18F-FDG PET studies. However, the errors in their study were large in the lung, mainly due to the heterogeneity and inter-individual variability of lung density [25]. The current study also highlights similar limitations of this method for whole-body PET scans, especially in the lung and metallic implants (Figs. 5, 6 and 7).

However, the μ-CNNNAC was useful in estimating the scatter distribution needed when applying the MLAA. Although the ACF error between the μ-CT and μ-CNNNAC was relatively high, the error of the scatter distribution in the sinogram space, estimated using the μ-maps, was only 2.5% on average. Therefore, the μ-CNNNAC appears to be a promising solution for addressing the chicken-egg dilemma [39] in MLAA reconstruction.

The results of this study show the superiority of the μ-CNNMLAA over the μ-CNNNAC in many ways. Bone and metallic implants were better delineated, and the error in tissue classification was smaller when applying μ-CNNMLAA (Figs. 4, 5, 6 and 7). This resulted in the improvement of the similarity between the reconstructed images and the ground truth and the accuracy in the quantitation of tumor SUVs (Supplemental Figs. 5-7). The difference in the accuracy between the μ-CNNMLAA and μ-CNNNAC was most pronounced in the lungs, as shown in Figs. 6 and 7. The difference in performance between the CNNMLAA and CNNNAC approaches was less significant in the 68 Ga-DOTATOC PET studies than in the 18F-FDG, potentially because of the erroneous estimation of the μ-map using the CNNNAC in the lung was mainly caused by the abnormal hot uptake in lung lesions (Fig. 7). This was less prevalent in the 68 Ga-DOTATOC PET studies than in 18F-FDG PET.

Interestingly, the CNNMLAA was able to predict metallic hip implants, despite there being no implants in any of the cases included in the training set (Fig. 5D). Moreover, streaking artifacts appearing around the metal on the μ-CT due to low-energy photon starvation were not observed in the μ-CNNMLAA. However, the μ-values of the metallic implants in the μ-CNNMLAA were slightly lower than those in the μ-CT, leading to an ~ 5% SUV difference in the lesions near the metallic implants. This is thought to be due to the lack of training data as high as those of the metal implants with the μ-values. Therefore, further investigation with the training and test sets, including many metallic implant cases, which are properly corrected for metal artifacts in CT, is necessary to improve the quantitative accuracy of the CNNMLAA approach.

No significant improvement in the similarity with the λ-CT by combining the CNNNAC and CNNMLAA approaches was observed in this study. Additional λ-NAC input to the CNN, along with the λ-MLAA and μ-MLAA, resulted in a better prediction of fine bone structures in the μ-maps (Fig. 4 and Supplemental Fig. 2). However, the improvement of the quantitative similarity measures on the μ- and λ-maps by the combined inputs was not as evident as the difference between the CNNNAC and CNNMLAA approaches (Fig. 6 and Supplemental Figs. 3-9).

An alternative approach for AC is the MRI-based methods, including the segmented-based and the atlas-based method [6, 45]. The use of CNN combined with MLAA in AC shows same or better performance compared to the MRI-based AC methods. Martinez-Moller et al. and Arabi et al. reported SUV quantification in the osseous lesions, which were an average decrease of 8.0% ± 3.3% using the segmentation-based method and an average increase of 1.5% ± 3.5% using the atlas-based method, respectively [6, 45]. The errors in CNNMLAA and CNNMLAA+NAC were only 0.23% ± 3.81% and 0.05% ± 3.49%. Martinez-Moller et al. reported that SUV in the lung lesions was underestimated, with an average decrease of 1.9% ± 2.3% [6], while errors in CNNMLAA and CNNMLAA+NAC were 1.21% ± 5.74% and 1.91% ± 3.49%. No evaluation for lung lesions was not performed in Arabi et al..

Different neural networks were trained and used individually in this study for two tracers to compare the emission-only approaches under the best conditions for each tracer. However, this resulted in the overfitting of the neural networks suitable only for a specific tracer and requiring network retraining for new tracers. We tried training the U-net model with two tracers to create a more general model, but the results were worse than the individual training results. Further investigation is required to develop a general model that provides the optimal performance for all tracers.

Summary and conclusion

We compared two DL-based approaches to PET AC that use only emission data. The use of CNNNAC for scatter estimation successfully addressed the chicken-egg dilemma in MLAA reconstruction. However, the CNNMLAA outperformed the CNNNAC. The benefits of combining these two approaches were not significant.